All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/4] btrfs: read_policy types latency, device and round-robin
@ 2021-01-11  9:41 Anand Jain
  2021-01-11  9:41 ` [PATCH v3 1/4] btrfs: add read_policy latency Anand Jain
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Anand Jain @ 2021-01-11  9:41 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, josef, Anand Jain

v3:
The block layer commit 0d02129e76ed (block: merge struct block_device and
struct hd_struct) has changed the first argument in the function
part_stat_read_all() in 5.11-rc1. So trickle down its changes in the patch 1/4.

v2:
Fixes as per review comments, as in the individual patches.

v1:
Drop tracing patch
Drop factoring inflight command
  Here below is the performance differences, when inflight is used, it pushed
  few commands to the other device, so losing the potential merges.
Few C styles fixes.

-----

This patchset adds read policy types latency, device, and round-robin, for the
mirrored raid profiles such as raid1, raid1c3, raid1c4, and raid10. The default
read policy remains as PID, as of now.

Read policy types:
Latency:

Latency policy routes the read IO based on the historical average
wait time experienced by the read IOs on the individual device.

Device:

With the device policy along with the read_preferred flag, you can
set the device for reading manually. Useful to test mirrors in a
deterministic way and helps advance system administrations.

Round-robin (RFC patch):

Alternates striped device in a round-robin loop for reading. To achieve
this first we put the stripes in an array, sort it by devid and pick the
next device.

Test scripts:
=============

I have included a few scripts which were useful for testing.

-------------------8<--------------------------------
Set latency policy on the btrfs mounted at /mnt

Usage example:
  $ readpolicyset /mnt latency

$ cat readpolicyset
#!/bin/bash

: ${1?"arg1 <mnt> missing"}
: ${2?"arg2 <pid|latency|device|roundrobin> missing"}

mnt=$1
policy=$2
[ $policy == "device" ] && { : ${3?"arg3 <devid> missing"}; }
devid=$3

uuid=$(btrfs fi show -m /btrfs | grep uuid | awk '{print $4}')
p=/sys/fs/btrfs/$uuid/read_policy
q=/sys/fs/btrfs/$uuid/devinfo

[ $policy == "device" ] && { echo 1 > ${q}/$devid/read_preferred || exit $?; }

echo $policy > $p
exit $?

-------------------8<--------------------------------

Read policy type from the btrfs mounted at /mnt

Usage example:
  $ readpolicy /mnt

$ cat readpolicy
#!/bin/bash

: ${1?"arg1 <mnt> missing"}
mnt=$1

uuid=$(btrfs fi show -m /btrfs | grep uuid | awk '{print $4}')
p=/sys/fs/btrfs/$uuid/read_policy
q=/sys/fs/btrfs/$uuid/devinfo

policy=$(cat $p)
echo -n "$policy ( "

for i in $(find $q -type f -name read_preferred | xargs cat)
do
	echo -n "$i"
done
echo ")"
-------------------8<--------------------------------

Show the number of read IO per devices for the give command.

Usage example:
   $ readstat /mnt fioread

$ cat readstat 
#!/bin/bash

: ${1?"arg1 <mnt> is missing"}
: ${2?"arg2 <cmd-to-run> is missing"}

mnt=$1; shift
mountpoint -q $mnt || { echo "ERROR: $mnt is not mounted"; exit 1; }

declare -A devread

for dev in $(btrfs filesystem show -m $mnt | grep devid |awk '{print $8}')
do
	prefix=$(echo $dev | rev | cut -d"/" -f1 | rev)
	sysfs_path=$(find /sys | grep $prefix/stat$)

	devread[$sysfs_path]=$(cat $sysfs_path | awk '{print $1}')
done

"$@" | grep "READ: bw"

echo
echo
for sysfs_path in ${!devread[@]}
do
	dev=$(echo $sysfs_path | rev | cut -d"/" -f2 | rev)
	new=$(cat $sysfs_path | awk '{print $1}')
	old=${devread[$sysfs_path]}
	echo "$dev $((new - old))"
done
-------------------8<--------------------------------

Run fio read command 

Usage example:
    $ touch /mnt/largefile
    $ fioread /mnt/largefile 500m

$ cat fioread 
#!/bin/bash

: ${1?"arg1 </mnt/file> is missing"}
: ${2?"arg2 <1Gi|50Gi> is missing"}

tf=$1
sz=$2
mnt=$(stat -c '%m' $tf)

fio \
--filename=$tf \
--directory=$mnt \
--filesize=$sz \
--size=$sz \
--rw=randread \
--bs=64k \
--ioengine=libaio \
--direct=1 \
--numjobs=32 \
--group_reporting \
--thread \
--name iops-test-job
-------------------8<--------------------------------


Testing on guest VM
~~~~~~~~~~~~~~~~~~~

The test results from my VM with 2 devices of type sata and 2 devices of 
type virtio, are below. Performance results are for raid1c4, raid10, and raid1
are as below.

The workload is fio read 32 threads, 500m random reads.

Fio is passed to the script called readstat, which returns the number of read IOs
per device sent during the fio.

Supporting fio logs are below. And readstat shows the number of read IOs
to the devices (excluding the merges).

raid1c4
=======

pid
----

$ readpolicyset /btrfs pid && readpolicy /btrfs && dropcache && readstat /btrfs fioread /btrfs/largefile 500m

[pid] latency device roundrobin ( 0000)
 READ: bw=87.0MiB/s (91.2MB/s), 87.0MiB/s-87.0MiB/s (91.2MB/s-91.2MB/s), io=15.6GiB (16.8GB), run=183884-183884msec

vdb 64060
vdc 64053
sdb 64072
sda 64054


latency
-------

(All devices are non-rotational, but sda and sdb are of type sata and vdb and vdc are of type virtio).

$ readpolicyset /btrfs latency && readpolicy /btrfs && dropcache && readstat /btrfs fioread /btrfs/largefile 500m

pid [latency] device roundrobin ( 0000)
 READ: bw=87.1MiB/s (91.3MB/s), 87.1MiB/s-87.1MiB/s (91.3MB/s-91.3MB/s), io=15.6GiB (16.8GB), run=183774-183774msec

vdb 255844
vdc 559
sdb 0
sda 93


roundrobin
----------

$ readpolicyset /btrfs roundrobin && readpolicy /btrfs && dropcache && readstat /btrfs fioread /btrfs/largefile 500m

pid latency device [roundrobin] ( 0000)
 READ: bw=51.0MiB/s (54.5MB/s), 51.0MiB/s-51.0MiB/s (54.5MB/s-54.5MB/s), io=15.6GiB (16.8GB), run=307755-307755msec

vdb 866859
vdc 866651
sdb 864139
sda 865533


raid10
======

pid
---

$ readpolicyset /btrfs pid && readpolicy /btrfs && dropcache && readstat /btrfs fioread /btrfs/largefile 500m

[pid] latency device roundrobin ( 0000)
 READ: bw=85.2MiB/s (89.3MB/s), 85.2MiB/s-85.2MiB/s (89.3MB/s-89.3MB/s), io=15.6GiB (16.8GB), run=187864-187864msec


sdf 64053
sde 64036
sdd 64043
sdc 64038


latency
-------

$ readpolicyset /btrfs latency && readpolicy /btrfs && dropcache && readstat /btrfs fioread /btrfs/largefile 500m

pid [latency] device roundrobin ( 0000)
 READ: bw=85.4MiB/s (89.5MB/s), 85.4MiB/s-85.4MiB/s (89.5MB/s-89.5MB/s), io=15.6GiB (16.8GB), run=187370-187370msec


sdf 117494
sde 10748
sdd 125247
sdc 2921

roundrobin
----------

$ readpolicyset /btrfs roundrobin && readpolicy /btrfs && dropcache && readstat /btrfs fioread /btrfs/largefile 500m

pid latency device [roundrobin] ( 0000)
 READ: bw=55.4MiB/s (58.1MB/s), 55.4MiB/s-55.4MiB/s (58.1MB/s-58.1MB/s), io=15.6GiB (16.8GB), run=288701-288701msec

sdf 617593
sde 617381
sdd 618486
sdc 618633


raid1
=====

pid
----

$ readpolicyset /btrfs pid && readpolicy /btrfs && dropcache && readstat /btrfs fioread /btrfs/largefile 500m

[pid] latency device roundrobin ( 00)
 READ: bw=78.8MiB/s (82.6MB/s), 78.8MiB/s-78.8MiB/s (82.6MB/s-82.6MB/s), io=15.6GiB (16.8GB), run=203158-203158msec

sdb 128087
sda 128090


latency
--------

$ readpolicyset /btrfs latency && readpolicy /btrfs && dropcache && readstat /btrfs fioread /btrfs/largefile 500m

pid [latency] device roundrobin ( 00)
 READ: bw=86.5MiB/s (90.7MB/s), 86.5MiB/s-86.5MiB/s (90.7MB/s-90.7MB/s), io=15.6GiB (16.8GB), run=185023-185023msec

sdb 567
sda 255942


device
-------

(From the latency test results (above) we know sda is providing low latency read
IO. So set sda as read preferred device.)

$ readpolicyset /btrfs device 1 && readpolicy /btrfs && dropcache && readstat /btrfs fioread /btrfs/largefile 500m

pid latency [device] roundrobin ( 10)
 READ: bw=88.2MiB/s (92.5MB/s), 88.2MiB/s-88.2MiB/s (92.5MB/s-92.5MB/s), io=15.6GiB (16.8GB), run=181374-181374msec

sdb 0
sda 256191


roundrobin
-----------

$ readpolicyset /btrfs roundrobin && readpolicy /btrfs && dropcache && readstat /btrfs fioread /btrfs/largefile 500m

pid latency device [roundrobin] ( 00)
 READ: bw=54.1MiB/s (56.7MB/s), 54.1MiB/s-54.1MiB/s (56.7MB/s-56.7MB/s), io=15.6GiB (16.8GB), run=295693-295693msec

sdb 1252584
sda 1254258


Testing on real hardware:
~~~~~~~~~~~~~~~~~~~~~~~~

raid1 Read 500m
-----------------------------------------------------
            |nvme+ssd  nvme+ssd   all-nvme  all-nvme
            |random    sequential random    sequential
------------+------------------------------------------
pid         | 744MiB/s  809MiB/s  2225MiB/s 2155MiB/s
latency     |2072MiB/s 2008MiB/s  1999MiB/s 1961MiB/s
device(nvme)|2187MiB/s 2063MiB/s  2125MiB/s 2080MiB/s
roundrobin  | 527MiB/s  519MiB/s  2137MiB/s 1876MiB/s


raid10 Read 500m
-----------------------------------------------------
            | nvme+ssd  nvme+ssd  all-nvme  all-nvme
            | random    seq       random    seq
------------+-----------------------------------------
pid         | 1282MiB/s 1427MiB/s 2152MiB/s 1969MiB/s
latency     | 2073MiB/s 1871MiB/s 1975MiB/s 1984MiB/s
device(nvme)| 2447MiB/s 1873MiB/s 2184MiB/s 2015MiB/s
roundrobin  | 1117MiB/s 1076MiB/s 2020MiB/s 2030MiB/s


raid1c3 Read 500m
-----------------------------------------------------
            | nvme+ssd  nvme+ssd  all-nvme  all-nvme
            | random    seq       random    seq
------------+-----------------------------------------
pid         |  973MiB/s  955MiB/s 2144MiB/s 1962MiB/s
latency     | 2005MiB/s 1924MiB/s 2083MiB/s 1980MiB/s
device(nvme)| 2021MiB/s 2034MiB/s 1920MiB/s 2132MiB/s
roundrobin  |  707MiB/s  701MiB/s 1760MiB/s 1990MiB/s


raid1c4 Read 500m
-----------------------------------------------------
            | nvme+ssd  nvme+ssd  all-nvme  all-nvme
            | random    seq       random    seq
------------+----------------------------------------
pid         | 1204MiB/s 1221MiB/s 2065MiB/s 1878MiB/s
latency     | 1990MiB/s 1920MiB/s 1945MiB/s 1865MiB/s
device(nvme)| 2109MiB/s 1935MiB/s 2153MiB/s 1991MiB/s
roundrobin  |  887MiB/s  865MiB/s 1948MiB/s 1796MiB/s


Observations:
=============

1.
As our chunk allocation is based on the device's available size
at that time. So stripe 0 may be circulating among the devices.
So a single-threaded process running with a constant PID, may balance
the read IO among devices. But it is not guaranteed to work in all the
cases, and it might not work very well in the case of raid1c3/4. Further,
PID provides terrible performance if the devices are heterogeneous in
terms of either type, speed, or size.

2.
Latency provides performance equal to PID if all devices are of same
type. Latency needs iostat be enabled and includes cost of calculating
the avg. wait time. So if you factor in a similar cost of calculating the
avg. wait time in case of PID policy (using the debug code [2]) then the
Latency performance is better than PID. This proves that read IO
distribution as per latency is working, but there is a cost to it. And
moreover, latency works for any type of devices.

3.
Round robin is worst (unless there is a bug in my patch). The total
number of new IOs issued is almost double when compared with the PID and
Latency read_policy, that's because there were fewer number of IO merges
in the block layer due to constant switching of devices in the btrfs.

4.
4.
Device read_policy is useful in testing and provides advanced sysadmin
capabilities. When known how to use, the policy could help avert
performance degradation due to csum/IO errors at production.

Thanks, Anand

------------------
[2] Debug patch to factor the cost of calculating the latency per IO.

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d3023879bdf6..72ec633e9063 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5665,6 +5665,12 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
 fs_info->fs_devices->read_policy = BTRFS_READ_POLICY_PID;
 fallthrough;
 case BTRFS_READ_POLICY_PID:
+ /*
+ * Just to factor in the cost of calculating the avg wait using
+ * iostat, call btrfs_find_best_stripe() here for the PID policy
+ * and drop its results on the floor.
+ */
+ btrfs_find_best_stripe(fs_info, map, first, num_stripes, log,
+ logsz);
 preferred_mirror = first + current->pid % num_stripes;
 scnprintf(log, logsz,
 "first %d num_stripe %d %s (%d) preferred %d",

Anand Jain (4):
  btrfs: add read_policy latency
  btrfs: introduce new device-state read_preferred
  btrfs: introduce new read_policy device
  btrfs: introduce new read_policy round-robin

 fs/btrfs/sysfs.c   |  57 ++++++++++++++++++++++-
 fs/btrfs/volumes.c | 110 +++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h |   8 ++++
 3 files changed, 174 insertions(+), 1 deletion(-)

-- 
2.30.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 1/4] btrfs: add read_policy latency
  2021-01-11  9:41 [PATCH v3 0/4] btrfs: read_policy types latency, device and round-robin Anand Jain
@ 2021-01-11  9:41 ` Anand Jain
  2021-01-19 19:36   ` Josef Bacik
  2021-01-20 10:27   ` Michal Rostecki
  2021-01-11  9:41 ` [PATCH v3 2/4] btrfs: introduce new device-state read_preferred Anand Jain
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 15+ messages in thread
From: Anand Jain @ 2021-01-11  9:41 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, josef, Anand Jain

The read policy type latency routes the read IO based on the historical
average wait-time experienced by the read IOs through the individual
device. This patch obtains the historical read IO stats from the kernel
block layer and calculates its average.

Example usage:
 echo "latency" > /sys/fs/btrfs/$uuid/read_policy 

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
v3: The block layer commit 0d02129e76ed (block: merge struct block_device and
    struct hd_struct) has changed the first argument in the function
    part_stat_read_all() in 5.11-rc1. So the compilation will fail. This patch
    fixes it.
    Commit log updated.

v2: Use btrfs_debug_rl() instead of btrfs_info_rl()
    It is better we have this debug until we test this on at least few
    hardwares.
    Drop the unrelated changes.
    Update change log.

v1: Drop part_stat_read_all instead use part_stat_read
    Drop inflight


 fs/btrfs/sysfs.c   |  3 ++-
 fs/btrfs/volumes.c | 38 ++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h |  2 ++
 3 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 19b9fffa2c9c..96ca7bef6357 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -915,7 +915,8 @@ static bool strmatch(const char *buffer, const char *string)
 	return false;
 }
 
-static const char * const btrfs_read_policy_name[] = { "pid" };
+/* Must follow the order as in enum btrfs_read_policy */
+static const char * const btrfs_read_policy_name[] = { "pid", "latency" };
 
 static ssize_t btrfs_read_policy_show(struct kobject *kobj,
 				      struct kobj_attribute *a, char *buf)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f4037a6bd926..f7a0a83d2cd4 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -14,6 +14,7 @@
 #include <linux/semaphore.h>
 #include <linux/uuid.h>
 #include <linux/list_sort.h>
+#include <linux/part_stat.h>
 #include "misc.h"
 #include "ctree.h"
 #include "extent_map.h"
@@ -5490,6 +5491,39 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
 	return ret;
 }
 
+static int btrfs_find_best_stripe(struct btrfs_fs_info *fs_info,
+				  struct map_lookup *map, int first,
+				  int num_stripe)
+{
+	u64 est_wait = 0;
+	int best_stripe = 0;
+	int index;
+
+	for (index = first; index < first + num_stripe; index++) {
+		u64 read_wait;
+		u64 avg_wait = 0;
+		unsigned long read_ios;
+		struct btrfs_device *device = map->stripes[index].dev;
+
+		read_wait = part_stat_read(device->bdev, nsecs[READ]);
+		read_ios = part_stat_read(device->bdev, ios[READ]);
+
+		if (read_wait && read_ios && read_wait >= read_ios)
+			avg_wait = div_u64(read_wait, read_ios);
+		else
+			btrfs_debug_rl(device->fs_devices->fs_info,
+			"devid: %llu avg_wait ZERO read_wait %llu read_ios %lu",
+				       device->devid, read_wait, read_ios);
+
+		if (est_wait == 0 || est_wait > avg_wait) {
+			est_wait = avg_wait;
+			best_stripe = index;
+		}
+	}
+
+	return best_stripe;
+}
+
 static int find_live_mirror(struct btrfs_fs_info *fs_info,
 			    struct map_lookup *map, int first,
 			    int dev_replace_is_ongoing)
@@ -5519,6 +5553,10 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
 	case BTRFS_READ_POLICY_PID:
 		preferred_mirror = first + (current->pid % num_stripes);
 		break;
+	case BTRFS_READ_POLICY_LATENCY:
+		preferred_mirror = btrfs_find_best_stripe(fs_info, map, first,
+							  num_stripes);
+		break;
 	}
 
 	if (dev_replace_is_ongoing &&
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 1997a4649a66..71ba1f0e93f4 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -222,6 +222,8 @@ enum btrfs_chunk_allocation_policy {
 enum btrfs_read_policy {
 	/* Use process PID to choose the stripe */
 	BTRFS_READ_POLICY_PID,
+	/* Find and use device with the lowest latency */
+	BTRFS_READ_POLICY_LATENCY,
 	BTRFS_NR_READ_POLICY,
 };
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 2/4] btrfs: introduce new device-state read_preferred
  2021-01-11  9:41 [PATCH v3 0/4] btrfs: read_policy types latency, device and round-robin Anand Jain
  2021-01-11  9:41 ` [PATCH v3 1/4] btrfs: add read_policy latency Anand Jain
@ 2021-01-11  9:41 ` Anand Jain
  2021-01-19 19:44   ` Josef Bacik
  2021-01-11  9:41 ` [PATCH v3 3/4] btrfs: introduce new read_policy device Anand Jain
  2021-01-11  9:41 ` [PATCH RFC 4/4] btrfs: introduce new read_policy round-robin Anand Jain
  3 siblings, 1 reply; 15+ messages in thread
From: Anand Jain @ 2021-01-11  9:41 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, josef, Anand Jain

This is a preparatory patch and introduces a new device flag
'read_preferred', RW-able using sysfs interface.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
v2: C style fixes. Drop space in between '! test_bit' and extra lines
    after it.

 fs/btrfs/sysfs.c   | 53 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h |  1 +
 2 files changed, 54 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 96ca7bef6357..b5d6499fbad0 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1417,11 +1417,64 @@ static ssize_t btrfs_devinfo_writeable_show(struct kobject *kobj,
 }
 BTRFS_ATTR(devid, writeable, btrfs_devinfo_writeable_show);
 
+static ssize_t btrfs_devinfo_read_pref_show(struct kobject *kobj,
+					    struct kobj_attribute *a, char *buf)
+{
+	int val;
+	struct btrfs_device *device = container_of(kobj, struct btrfs_device,
+						   devid_kobj);
+
+	val = !!test_bit(BTRFS_DEV_STATE_READ_PREFERRED, &device->dev_state);
+
+	return snprintf(buf, PAGE_SIZE, "%d\n", val);
+}
+
+static ssize_t btrfs_devinfo_read_pref_store(struct kobject *kobj,
+					     struct kobj_attribute *a,
+					     const char *buf, size_t len)
+{
+	int ret;
+	unsigned long val;
+	struct btrfs_device *device;
+
+	ret = kstrtoul(skip_spaces(buf), 0, &val);
+	if (ret)
+		return ret;
+
+	if (val != 0 && val != 1)
+		return -EINVAL;
+
+	/*
+	 * lock is not required, the btrfs_device struct can't be freed while
+	 * its kobject btrfs_device::devid_kobj is still open.
+	 */
+	device = container_of(kobj, struct btrfs_device, devid_kobj);
+
+	if (val &&
+	    !test_bit(BTRFS_DEV_STATE_READ_PREFERRED, &device->dev_state)) {
+		set_bit(BTRFS_DEV_STATE_READ_PREFERRED, &device->dev_state);
+		btrfs_info(device->fs_devices->fs_info,
+			   "set read preferred on devid %llu (%d)",
+			   device->devid, task_pid_nr(current));
+	} else if (!val &&
+		   test_bit(BTRFS_DEV_STATE_READ_PREFERRED, &device->dev_state)) {
+		clear_bit(BTRFS_DEV_STATE_READ_PREFERRED, &device->dev_state);
+		btrfs_info(device->fs_devices->fs_info,
+			   "reset read preferred on devid %llu (%d)",
+			   device->devid, task_pid_nr(current));
+	}
+
+	return len;
+}
+BTRFS_ATTR_RW(devid, read_preferred, btrfs_devinfo_read_pref_show,
+	      btrfs_devinfo_read_pref_store);
+
 static struct attribute *devid_attrs[] = {
 	BTRFS_ATTR_PTR(devid, in_fs_metadata),
 	BTRFS_ATTR_PTR(devid, missing),
 	BTRFS_ATTR_PTR(devid, replace_target),
 	BTRFS_ATTR_PTR(devid, writeable),
+	BTRFS_ATTR_PTR(devid, read_preferred),
 	NULL
 };
 ATTRIBUTE_GROUPS(devid);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 71ba1f0e93f4..ea786864b903 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -51,6 +51,7 @@ struct btrfs_io_geometry {
 #define BTRFS_DEV_STATE_REPLACE_TGT	(3)
 #define BTRFS_DEV_STATE_FLUSH_SENT	(4)
 #define BTRFS_DEV_STATE_NO_READA	(5)
+#define BTRFS_DEV_STATE_READ_PREFERRED	(6)
 
 struct btrfs_zoned_device_info;
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v3 3/4] btrfs: introduce new read_policy device
  2021-01-11  9:41 [PATCH v3 0/4] btrfs: read_policy types latency, device and round-robin Anand Jain
  2021-01-11  9:41 ` [PATCH v3 1/4] btrfs: add read_policy latency Anand Jain
  2021-01-11  9:41 ` [PATCH v3 2/4] btrfs: introduce new device-state read_preferred Anand Jain
@ 2021-01-11  9:41 ` Anand Jain
  2021-01-19 19:44   ` Josef Bacik
  2021-01-11  9:41 ` [PATCH RFC 4/4] btrfs: introduce new read_policy round-robin Anand Jain
  3 siblings, 1 reply; 15+ messages in thread
From: Anand Jain @ 2021-01-11  9:41 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, josef, Anand Jain

Read-policy type 'device' and device flag 'read-preferred':

The read-policy type device picks the device(s) flagged as
read-preferred for reading stripes of type raid1, raid10,
raid1c3 and raid1c4.

A system might contain SSD, nvme, iscsi, or san lun, and which are all
a non-rotational device, so it is not a good idea to set the read-preferred
automatically. Instead, device read-policy along with the read-preferred
flag provides an ability to do it manually. This advanced tuning is useful
in more than one situation, for example,
 - In heterogeneous-disk volume, it provides an ability to manually choose
    the low latency disks for reading.
 - Useful for more accurate testing.
 - Avoid known problematic device from reading the chunk until it is
   replaced (by marking the other good devices as read-preferred).

Note:

If the read-policy type is set to 'device', but there isn't any device
which is flagged as read-preferred, then stripe 0 is used for reading.

The device replacement won't migrate the read-preferred flag to the new
replace the target device.

As of now, this is an in-memory only feature.

It's pointless to set the read-preferred flag on the missing device, as
IOs aren't submitted to the missing device.

If there is more than one read-preferred device in a chunk, the read IO
shall go to the stripe 0 as of now.

Usage example:

Consider a typical two disks raid1.

Configure devid1 for reading.

$ echo 1 > devinfo/1/read_preferred
$ cat devinfo/1/read_preferred
1
$ cat devinfo/2/read_preferred
0

$ pwd
/sys/fs/btrfs/12345678-1234-1234-1234-123456789abc

$ cat read_policy
[pid] device
$ echo device > ./read_policy
$ cat read_policy
pid [device]

Now read IOs are sent to devid 1 (sdb).

$ echo 3 > /proc/sys/vm/drop_caches
$ md5sum /btrfs/YkZI

$ iostat -zy 1 | egrep 'sdb|sdc' (from another terminal)
sdb              50.00     40048.00         0.00      40048          0

Change the read-preferred device from devid 1 to devid 2 (sdc).

$ echo 0 > ./devinfo/1/read_preferred

[ 3343.918658] BTRFS info (device sdb): reset read preferred on devid 1 (1334)

$ echo 1 > ./devinfo/2/read_preferred

[ 3343.919876] BTRFS info (device sdb): set read preferred on devid 2 (1334)

$ echo 3 > /proc/sys/vm/drop_caches
$ md5sum /btrfs/YkZI

Further read ios are sent to devid 2 (sdc).

$ iostat -zy 1 | egrep 'sdb|sdc' (from another terminal)
sdc              49.00     40048.00         0.00      40048          0

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
v1:-
v2:-
v3: update the change log

 fs/btrfs/sysfs.c   |  3 ++-
 fs/btrfs/volumes.c | 22 ++++++++++++++++++++++
 fs/btrfs/volumes.h |  2 ++
 3 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index b5d6499fbad0..899b66c83db1 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -916,7 +916,8 @@ static bool strmatch(const char *buffer, const char *string)
 }
 
 /* Must follow the order as in enum btrfs_read_policy */
-static const char * const btrfs_read_policy_name[] = { "pid", "latency" };
+static const char * const btrfs_read_policy_name[] = { "pid", "latency",
+						       "device" };
 
 static ssize_t btrfs_read_policy_show(struct kobject *kobj,
 				      struct kobj_attribute *a, char *buf)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f7a0a83d2cd4..50d4d54f7abd 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5524,6 +5524,25 @@ static int btrfs_find_best_stripe(struct btrfs_fs_info *fs_info,
 	return best_stripe;
 }
 
+static int btrfs_find_read_preferred(struct map_lookup *map, int first, int num_stripe)
+{
+	int stripe_index;
+	int last = first + num_stripe;
+
+	/*
+	 * If there are more than one read preferred devices, then just pick the
+	 * first found read preferred device as of now.
+	 */
+	for (stripe_index = first; stripe_index < last; stripe_index++) {
+		if (test_bit(BTRFS_DEV_STATE_READ_PREFERRED,
+			     &map->stripes[stripe_index].dev->dev_state))
+			return stripe_index;
+	}
+
+	/* If there is no read preferred device then just use the first stripe */
+	return first;
+}
+
 static int find_live_mirror(struct btrfs_fs_info *fs_info,
 			    struct map_lookup *map, int first,
 			    int dev_replace_is_ongoing)
@@ -5557,6 +5576,9 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
 		preferred_mirror = btrfs_find_best_stripe(fs_info, map, first,
 							  num_stripes);
 		break;
+	case BTRFS_READ_POLICY_DEVICE:
+		preferred_mirror = btrfs_find_read_preferred(map, first, num_stripes);
+		break;
 	}
 
 	if (dev_replace_is_ongoing &&
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index ea786864b903..8d5a2cddc0ab 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -225,6 +225,8 @@ enum btrfs_read_policy {
 	BTRFS_READ_POLICY_PID,
 	/* Find and use device with the lowest latency */
 	BTRFS_READ_POLICY_LATENCY,
+	/* Use the device marked with READ_PREFERRED state */
+	BTRFS_READ_POLICY_DEVICE,
 	BTRFS_NR_READ_POLICY,
 };
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH RFC 4/4] btrfs: introduce new read_policy round-robin
  2021-01-11  9:41 [PATCH v3 0/4] btrfs: read_policy types latency, device and round-robin Anand Jain
                   ` (2 preceding siblings ...)
  2021-01-11  9:41 ` [PATCH v3 3/4] btrfs: introduce new read_policy device Anand Jain
@ 2021-01-11  9:41 ` Anand Jain
  2021-01-19 19:41   ` Josef Bacik
  3 siblings, 1 reply; 15+ messages in thread
From: Anand Jain @ 2021-01-11  9:41 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, josef, Anand Jain

Add round-robin read policy to route the read IO to the next device in the
round-robin order. The chunk allocation and thus the stripe-index follows
the order of free space available on devices. So to make the round-robin
effective it shall follow the devid order instead of the stripe-index
order.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
--
RFC because: Provides terrible performance with the fio tests.
I am not yet sure if there is any io workload or a block layer
tuning that shall make this policy better. As of now just an
experimental patch.

 fs/btrfs/sysfs.c   |  2 +-
 fs/btrfs/volumes.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h |  3 +++
 3 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 899b66c83db1..d40b0ff054ca 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -917,7 +917,7 @@ static bool strmatch(const char *buffer, const char *string)
 
 /* Must follow the order as in enum btrfs_read_policy */
 static const char * const btrfs_read_policy_name[] = { "pid", "latency",
-						       "device" };
+						       "device", "roundrobin" };
 
 static ssize_t btrfs_read_policy_show(struct kobject *kobj,
 				      struct kobj_attribute *a, char *buf)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 50d4d54f7abd..60370b9121e0 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5491,6 +5491,52 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
 	return ret;
 }
 
+struct stripe_mirror {
+	u64 devid;
+	int map;
+};
+
+static int btrfs_cmp_devid(const void *a, const void *b)
+{
+	struct stripe_mirror *s1 = (struct stripe_mirror *)a;
+	struct stripe_mirror *s2 = (struct stripe_mirror *)b;
+
+	if (s1->devid < s2->devid)
+		return -1;
+	if (s1->devid > s2->devid)
+		return 1;
+	return 0;
+}
+
+static int btrfs_find_read_round_robin(struct map_lookup *map, int first,
+				       int num_stripe)
+{
+	struct stripe_mirror stripes[4] = {0}; //4: for testing, works for now.
+	struct btrfs_fs_devices *fs_devices;
+	u64 devid;
+	int index, j, cnt;
+	int next_stripe;
+
+	index = 0;
+	for (j = first; j < first + num_stripe; j++) {
+		devid = map->stripes[j].dev->devid;
+
+		stripes[index].devid = devid;
+		stripes[index].map = j;
+
+		index++;
+	}
+
+	sort(stripes, num_stripe, sizeof(struct stripe_mirror),
+	     btrfs_cmp_devid, NULL);
+
+	fs_devices = map->stripes[first].dev->fs_devices;
+	cnt = atomic_inc_return(&fs_devices->total_reads);
+	next_stripe = stripes[cnt % num_stripe].map;
+
+	return next_stripe;
+}
+
 static int btrfs_find_best_stripe(struct btrfs_fs_info *fs_info,
 				  struct map_lookup *map, int first,
 				  int num_stripe)
@@ -5579,6 +5625,10 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
 	case BTRFS_READ_POLICY_DEVICE:
 		preferred_mirror = btrfs_find_read_preferred(map, first, num_stripes);
 		break;
+	case BTRFS_READ_POLICY_ROUND_ROBIN:
+		preferred_mirror = btrfs_find_read_round_robin(map, first,
+							       num_stripes);
+		break;
 	}
 
 	if (dev_replace_is_ongoing &&
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 8d5a2cddc0ab..ce4490437f53 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -227,6 +227,8 @@ enum btrfs_read_policy {
 	BTRFS_READ_POLICY_LATENCY,
 	/* Use the device marked with READ_PREFERRED state */
 	BTRFS_READ_POLICY_DEVICE,
+	/* Distribute read IO equally across striped devices */
+	BTRFS_READ_POLICY_ROUND_ROBIN,
 	BTRFS_NR_READ_POLICY,
 };
 
@@ -286,6 +288,7 @@ struct btrfs_fs_devices {
 
 	/* Policy used to read the mirrored stripes */
 	enum btrfs_read_policy read_policy;
+	atomic_t total_reads;
 };
 
 #define BTRFS_BIO_INLINE_CSUM_SIZE	64
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/4] btrfs: add read_policy latency
  2021-01-11  9:41 ` [PATCH v3 1/4] btrfs: add read_policy latency Anand Jain
@ 2021-01-19 19:36   ` Josef Bacik
  2021-01-20  2:43     ` Anand Jain
  2021-01-20 10:27   ` Michal Rostecki
  1 sibling, 1 reply; 15+ messages in thread
From: Josef Bacik @ 2021-01-19 19:36 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs; +Cc: dsterba

On 1/11/21 4:41 AM, Anand Jain wrote:
> The read policy type latency routes the read IO based on the historical
> average wait-time experienced by the read IOs through the individual
> device. This patch obtains the historical read IO stats from the kernel
> block layer and calculates its average.
> 
> Example usage:
>   echo "latency" > /sys/fs/btrfs/$uuid/read_policy
> 
> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> ---
> v3: The block layer commit 0d02129e76ed (block: merge struct block_device and
>      struct hd_struct) has changed the first argument in the function
>      part_stat_read_all() in 5.11-rc1. So the compilation will fail. This patch
>      fixes it.
>      Commit log updated.
> 
> v2: Use btrfs_debug_rl() instead of btrfs_info_rl()
>      It is better we have this debug until we test this on at least few
>      hardwares.
>      Drop the unrelated changes.
>      Update change log.
> 
> v1: Drop part_stat_read_all instead use part_stat_read
>      Drop inflight
> 
> 
>   fs/btrfs/sysfs.c   |  3 ++-
>   fs/btrfs/volumes.c | 38 ++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/volumes.h |  2 ++
>   3 files changed, 42 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> index 19b9fffa2c9c..96ca7bef6357 100644
> --- a/fs/btrfs/sysfs.c
> +++ b/fs/btrfs/sysfs.c
> @@ -915,7 +915,8 @@ static bool strmatch(const char *buffer, const char *string)
>   	return false;
>   }
>   
> -static const char * const btrfs_read_policy_name[] = { "pid" };
> +/* Must follow the order as in enum btrfs_read_policy */
> +static const char * const btrfs_read_policy_name[] = { "pid", "latency" };
>   
>   static ssize_t btrfs_read_policy_show(struct kobject *kobj,
>   				      struct kobj_attribute *a, char *buf)
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index f4037a6bd926..f7a0a83d2cd4 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -14,6 +14,7 @@
>   #include <linux/semaphore.h>
>   #include <linux/uuid.h>
>   #include <linux/list_sort.h>
> +#include <linux/part_stat.h>
>   #include "misc.h"
>   #include "ctree.h"
>   #include "extent_map.h"
> @@ -5490,6 +5491,39 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
>   	return ret;
>   }
>   
> +static int btrfs_find_best_stripe(struct btrfs_fs_info *fs_info,
> +				  struct map_lookup *map, int first,
> +				  int num_stripe)
> +{
> +	u64 est_wait = 0;
> +	int best_stripe = 0;
> +	int index;
> +
> +	for (index = first; index < first + num_stripe; index++) {
> +		u64 read_wait;
> +		u64 avg_wait = 0;
> +		unsigned long read_ios;
> +		struct btrfs_device *device = map->stripes[index].dev;
> +
> +		read_wait = part_stat_read(device->bdev, nsecs[READ]);
> +		read_ios = part_stat_read(device->bdev, ios[READ]);
> +
> +		if (read_wait && read_ios && read_wait >= read_ios)
> +			avg_wait = div_u64(read_wait, read_ios);
> +		else
> +			btrfs_debug_rl(device->fs_devices->fs_info,

You can just use fs_info here, you already have it.  I'm not in love with doing 
this check every time, I'd rather cache the results somewhere.  However if we're 
read-only I can't think of a mechanism we could piggy back on, RW we could just 
do it every transaction commit.  Fix the fs_info thing and you can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 4/4] btrfs: introduce new read_policy round-robin
  2021-01-11  9:41 ` [PATCH RFC 4/4] btrfs: introduce new read_policy round-robin Anand Jain
@ 2021-01-19 19:41   ` Josef Bacik
  2021-01-20  2:40     ` Anand Jain
  0 siblings, 1 reply; 15+ messages in thread
From: Josef Bacik @ 2021-01-19 19:41 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs; +Cc: dsterba

On 1/11/21 4:41 AM, Anand Jain wrote:
> Add round-robin read policy to route the read IO to the next device in the
> round-robin order. The chunk allocation and thus the stripe-index follows
> the order of free space available on devices. So to make the round-robin
> effective it shall follow the devid order instead of the stripe-index
> order.
> 
> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> --
> RFC because: Provides terrible performance with the fio tests.
> I am not yet sure if there is any io workload or a block layer
> tuning that shall make this policy better. As of now just an
> experimental patch.
> 

Just drop this one, if we can't find a reason to use it then don't bother adding 
the code.  The other options have real world valuable uses, so stick with those. 
  Thanks,

Josef


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 2/4] btrfs: introduce new device-state read_preferred
  2021-01-11  9:41 ` [PATCH v3 2/4] btrfs: introduce new device-state read_preferred Anand Jain
@ 2021-01-19 19:44   ` Josef Bacik
  0 siblings, 0 replies; 15+ messages in thread
From: Josef Bacik @ 2021-01-19 19:44 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs; +Cc: dsterba

On 1/11/21 4:41 AM, Anand Jain wrote:
> This is a preparatory patch and introduces a new device flag
> 'read_preferred', RW-able using sysfs interface.
> 
> Signed-off-by: Anand Jain <anand.jain@oracle.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 3/4] btrfs: introduce new read_policy device
  2021-01-11  9:41 ` [PATCH v3 3/4] btrfs: introduce new read_policy device Anand Jain
@ 2021-01-19 19:44   ` Josef Bacik
  0 siblings, 0 replies; 15+ messages in thread
From: Josef Bacik @ 2021-01-19 19:44 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs; +Cc: dsterba

On 1/11/21 4:41 AM, Anand Jain wrote:
> Read-policy type 'device' and device flag 'read-preferred':
> 
> The read-policy type device picks the device(s) flagged as
> read-preferred for reading stripes of type raid1, raid10,
> raid1c3 and raid1c4.
> 
> A system might contain SSD, nvme, iscsi, or san lun, and which are all
> a non-rotational device, so it is not a good idea to set the read-preferred
> automatically. Instead, device read-policy along with the read-preferred
> flag provides an ability to do it manually. This advanced tuning is useful
> in more than one situation, for example,
>   - In heterogeneous-disk volume, it provides an ability to manually choose
>      the low latency disks for reading.
>   - Useful for more accurate testing.
>   - Avoid known problematic device from reading the chunk until it is
>     replaced (by marking the other good devices as read-preferred).
> 
> Note:
> 
> If the read-policy type is set to 'device', but there isn't any device
> which is flagged as read-preferred, then stripe 0 is used for reading.
> 
> The device replacement won't migrate the read-preferred flag to the new
> replace the target device.
> 
> As of now, this is an in-memory only feature.
> 
> It's pointless to set the read-preferred flag on the missing device, as
> IOs aren't submitted to the missing device.
> 
> If there is more than one read-preferred device in a chunk, the read IO
> shall go to the stripe 0 as of now.
> 
> Usage example:
> 
> Consider a typical two disks raid1.
> 
> Configure devid1 for reading.
> 
> $ echo 1 > devinfo/1/read_preferred
> $ cat devinfo/1/read_preferred
> 1
> $ cat devinfo/2/read_preferred
> 0
> 
> $ pwd
> /sys/fs/btrfs/12345678-1234-1234-1234-123456789abc
> 
> $ cat read_policy
> [pid] device
> $ echo device > ./read_policy
> $ cat read_policy
> pid [device]
> 
> Now read IOs are sent to devid 1 (sdb).
> 
> $ echo 3 > /proc/sys/vm/drop_caches
> $ md5sum /btrfs/YkZI
> 
> $ iostat -zy 1 | egrep 'sdb|sdc' (from another terminal)
> sdb              50.00     40048.00         0.00      40048          0
> 
> Change the read-preferred device from devid 1 to devid 2 (sdc).
> 
> $ echo 0 > ./devinfo/1/read_preferred
> 
> [ 3343.918658] BTRFS info (device sdb): reset read preferred on devid 1 (1334)
> 
> $ echo 1 > ./devinfo/2/read_preferred
> 
> [ 3343.919876] BTRFS info (device sdb): set read preferred on devid 2 (1334)
> 
> $ echo 3 > /proc/sys/vm/drop_caches
> $ md5sum /btrfs/YkZI
> 
> Further read ios are sent to devid 2 (sdc).
> 
> $ iostat -zy 1 | egrep 'sdb|sdc' (from another terminal)
> sdc              49.00     40048.00         0.00      40048          0
> 
> Signed-off-by: Anand Jain <anand.jain@oracle.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH RFC 4/4] btrfs: introduce new read_policy round-robin
  2021-01-19 19:41   ` Josef Bacik
@ 2021-01-20  2:40     ` Anand Jain
  0 siblings, 0 replies; 15+ messages in thread
From: Anand Jain @ 2021-01-20  2:40 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs; +Cc: dsterba

On 20/1/21 3:41 am, Josef Bacik wrote:
> On 1/11/21 4:41 AM, Anand Jain wrote:
>> Add round-robin read policy to route the read IO to the next device in 
>> the
>> round-robin order. The chunk allocation and thus the stripe-index follows
>> the order of free space available on devices. So to make the round-robin
>> effective it shall follow the devid order instead of the stripe-index
>> order.
>>
>> Signed-off-by: Anand Jain <anand.jain@oracle.com>
>> -- 
>> RFC because: Provides terrible performance with the fio tests.
>> I am not yet sure if there is any io workload or a block layer
>> tuning that shall make this policy better. As of now just an
>> experimental patch.
>>
> 
> Just drop this one, if we can't find a reason to use it then don't 
> bother adding the code.  The other options have real world valuable 
> uses, so stick with those.  Thanks,
> 

  Yep. I will drop this patch in the next iteration.

  The low performance is attributed to the low number of read IO
  mergers in the block layer. The consecutive blocks in my test case
  (fio random read) were read from the other disk, so the block layer
  lost the opportunity to merge the IOs.

Thanks, Anand

> Josef
> 



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/4] btrfs: add read_policy latency
  2021-01-19 19:36   ` Josef Bacik
@ 2021-01-20  2:43     ` Anand Jain
  0 siblings, 0 replies; 15+ messages in thread
From: Anand Jain @ 2021-01-20  2:43 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs; +Cc: dsterba

On 20/1/21 3:36 am, Josef Bacik wrote:
> On 1/11/21 4:41 AM, Anand Jain wrote:
>> The read policy type latency routes the read IO based on the historical
>> average wait-time experienced by the read IOs through the individual
>> device. This patch obtains the historical read IO stats from the kernel
>> block layer and calculates its average.
>>
>> Example usage:
>>   echo "latency" > /sys/fs/btrfs/$uuid/read_policy
>>
>> Signed-off-by: Anand Jain <anand.jain@oracle.com>
>> ---
>> v3: The block layer commit 0d02129e76ed (block: merge struct 
>> block_device and
>>      struct hd_struct) has changed the first argument in the function
>>      part_stat_read_all() in 5.11-rc1. So the compilation will fail. 
>> This patch
>>      fixes it.
>>      Commit log updated.
>>
>> v2: Use btrfs_debug_rl() instead of btrfs_info_rl()
>>      It is better we have this debug until we test this on at least few
>>      hardwares.
>>      Drop the unrelated changes.
>>      Update change log.
>>
>> v1: Drop part_stat_read_all instead use part_stat_read
>>      Drop inflight
>>
>>
>>   fs/btrfs/sysfs.c   |  3 ++-
>>   fs/btrfs/volumes.c | 38 ++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/volumes.h |  2 ++
>>   3 files changed, 42 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
>> index 19b9fffa2c9c..96ca7bef6357 100644
>> --- a/fs/btrfs/sysfs.c
>> +++ b/fs/btrfs/sysfs.c
>> @@ -915,7 +915,8 @@ static bool strmatch(const char *buffer, const 
>> char *string)
>>       return false;
>>   }
>> -static const char * const btrfs_read_policy_name[] = { "pid" };
>> +/* Must follow the order as in enum btrfs_read_policy */
>> +static const char * const btrfs_read_policy_name[] = { "pid", 
>> "latency" };
>>   static ssize_t btrfs_read_policy_show(struct kobject *kobj,
>>                         struct kobj_attribute *a, char *buf)
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index f4037a6bd926..f7a0a83d2cd4 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -14,6 +14,7 @@
>>   #include <linux/semaphore.h>
>>   #include <linux/uuid.h>
>>   #include <linux/list_sort.h>
>> +#include <linux/part_stat.h>
>>   #include "misc.h"
>>   #include "ctree.h"
>>   #include "extent_map.h"
>> @@ -5490,6 +5491,39 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
>> *fs_info, u64 logical, u64 len)
>>       return ret;
>>   }
>> +static int btrfs_find_best_stripe(struct btrfs_fs_info *fs_info,
>> +                  struct map_lookup *map, int first,
>> +                  int num_stripe)
>> +{
>> +    u64 est_wait = 0;
>> +    int best_stripe = 0;
>> +    int index;
>> +
>> +    for (index = first; index < first + num_stripe; index++) {
>> +        u64 read_wait;
>> +        u64 avg_wait = 0;
>> +        unsigned long read_ios;
>> +        struct btrfs_device *device = map->stripes[index].dev;
>> +
>> +        read_wait = part_stat_read(device->bdev, nsecs[READ]);
>> +        read_ios = part_stat_read(device->bdev, ios[READ]);
>> +
>> +        if (read_wait && read_ios && read_wait >= read_ios)
>> +            avg_wait = div_u64(read_wait, read_ios);
>> +        else
>> +            btrfs_debug_rl(device->fs_devices->fs_info,
> 
> You can just use fs_info here, you already have it.  I'm not in love 
> with doing this check every time, I'd rather cache the results 
> somewhere.  However if we're read-only I can't think of a mechanism we 
> could piggy back on, RW we could just do it every transaction commit.  
> Fix the fs_info thing and you can add
> 

  Oh. I will fix it. Thanks for the review here and in other patches.
-Anand

> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> 
> Thanks,
> 
> Josef


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/4] btrfs: add read_policy latency
  2021-01-11  9:41 ` [PATCH v3 1/4] btrfs: add read_policy latency Anand Jain
  2021-01-19 19:36   ` Josef Bacik
@ 2021-01-20 10:27   ` Michal Rostecki
  2021-01-20 12:30     ` Anand Jain
  1 sibling, 1 reply; 15+ messages in thread
From: Michal Rostecki @ 2021-01-20 10:27 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs, dsterba, josef

On Mon, Jan 11, 2021 at 05:41:34PM +0800, Anand Jain wrote:
> The read policy type latency routes the read IO based on the historical
> average wait-time experienced by the read IOs through the individual
> device. This patch obtains the historical read IO stats from the kernel
> block layer and calculates its average.
> 
> Example usage:
>  echo "latency" > /sys/fs/btrfs/$uuid/read_policy 
> 
> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> ---
> v3: The block layer commit 0d02129e76ed (block: merge struct block_device and
>     struct hd_struct) has changed the first argument in the function
>     part_stat_read_all() in 5.11-rc1. So the compilation will fail. This patch
>     fixes it.
>     Commit log updated.
> 
> v2: Use btrfs_debug_rl() instead of btrfs_info_rl()
>     It is better we have this debug until we test this on at least few
>     hardwares.
>     Drop the unrelated changes.
>     Update change log.
> 
> v1: Drop part_stat_read_all instead use part_stat_read
>     Drop inflight
> 

Hi Anand,

I tested this policy with fio and dstat. It performs overall really
well. On my raid1c3 array with two HDDs and one SSD (which is the last
device), I'm getting the following results.

With direct=0:

  Run status group 0 (all jobs):
     READ: bw=3560MiB/s (3733MB/s), 445MiB/s-445MiB/s (467MB/s-467MB/s),
     io=3129GiB (3360GB), run=900003-900013msec

With direct=1:

  Run status group 0 (all jobs):
     READ: bw=520MiB/s (545MB/s), 64.9MiB/s-65.0MiB/s (68.1MB/s-68.2MB/s),
     io=457GiB (490GB), run=900001-900001msec

However, I was also running dstat at the same time and I noticed that
the read stop sometimes for ~15-20 seconds. For example:

  ----system---- --dsk/sdb-- --dsk/sdc-- --dsk/sdd--
  20-01 00:37:21|   0     0 :   0     0 : 509M    0
  20-01 00:37:22|   0     0 :   0     0 : 517M    0
  20-01 00:37:23|   0     0 :   0     0 : 507M    0
  20-01 00:37:24|   0     0 :   0     0 : 518M    0
  20-01 00:37:25|   0     0 :   0     0 :  22M    0
  20-01 00:37:26|   0     0 :   0     0 :   0     0
  20-01 00:37:27|   0     0 :   0     0 :   0     0
  20-01 00:37:28|   0     0 :   0     0 :   0     0
  20-01 00:37:29|   0     0 :   0     0 :   0     0
  20-01 00:37:30|   0     0 :   0     0 :   0     0
  20-01 00:37:31|   0     0 :   0     0 :   0     0
  20-01 00:37:32|   0     0 :   0     0 :   0     0
  20-01 00:37:33|   0     0 :   0     0 :   0     0
  20-01 00:37:34|   0     0 :   0     0 :   0     0
  20-01 00:37:35|   0     0 :   0     0 :   0     0
  20-01 00:37:36|   0     0 :   0     0 :   0     0
  20-01 00:37:37|   0     0 :   0     0 :   0     0
  20-01 00:37:38|   0     0 :   0     0 :   0     0
  20-01 00:37:39|   0     0 :   0     0 :   0     0
  20-01 00:37:40|   0     0 :   0     0 :   0     0
  20-01 00:37:41|   0     0 :   0     0 :   0     0
  20-01 00:37:42|   0     0 :   0     0 :   0     0
  20-01 00:37:43|   0     0 :   0     0 :   0     0
  20-01 00:37:44|   0     0 :   0     0 :   0     0
  20-01 00:37:45|   0     0 :   0     0 :   0     0
  20-01 00:37:46|   0     0 :   0     0 :  55M    0
  20-01 00:37:47|   0     0 :   0     0 : 516M    0
  20-01 00:37:48|   0     0 :   0     0 : 515M    0
  20-01 00:37:49|   0     0 :   0     0 : 516M    0
  20-01 00:37:50|   0     0 :   0     0 : 520M    0
  20-01 00:37:51|   0     0 :   0     0 : 520M    0
  20-01 00:37:52|   0     0 :   0     0 : 514M    0

Here is the full log:

https://susepaste.org/16928336

I never noticed that happening with the PID policy. Is that maybe
because of reading the part stats for all CPUs while selecting the
mirror?

Michal

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/4] btrfs: add read_policy latency
  2021-01-20 10:27   ` Michal Rostecki
@ 2021-01-20 12:30     ` Anand Jain
  2021-01-20 13:54       ` Michal Rostecki
  0 siblings, 1 reply; 15+ messages in thread
From: Anand Jain @ 2021-01-20 12:30 UTC (permalink / raw)
  To: Michal Rostecki; +Cc: linux-btrfs, dsterba, josef


> Hi Anand,
> 
> I tested this policy with fio and dstat. It performs overall really
> well. On my raid1c3 array with two HDDs and one SSD (which is the last
> device), I'm getting the following results.
> 

Michal,

  Thank you for verifying. More below...

> With direct=0:
> 
>    Run status group 0 (all jobs):
>       READ: bw=3560MiB/s (3733MB/s), 445MiB/s-445MiB/s (467MB/s-467MB/s),
>       io=3129GiB (3360GB), run=900003-900013msec
> 
> With direct=1:
> 
>    Run status group 0 (all jobs):
>       READ: bw=520MiB/s (545MB/s), 64.9MiB/s-65.0MiB/s (68.1MB/s-68.2MB/s),
>       io=457GiB (490GB), run=900001-900001msec
> 
> However, I was also running dstat at the same time and I noticed that
> the read stop sometimes for ~15-20 seconds. For example:
> 
>    ----system---- --dsk/sdb-- --dsk/sdc-- --dsk/sdd--
>    20-01 00:37:21|   0     0 :   0     0 : 509M    0
>    20-01 00:37:22|   0     0 :   0     0 : 517M    0
>    20-01 00:37:23|   0     0 :   0     0 : 507M    0
>    20-01 00:37:24|   0     0 :   0     0 : 518M    0
>    20-01 00:37:25|   0     0 :   0     0 :  22M    0
>    20-01 00:37:26|   0     0 :   0     0 :   0     0
>    20-01 00:37:27|   0     0 :   0     0 :   0     0
>    20-01 00:37:28|   0     0 :   0     0 :   0     0
>    20-01 00:37:29|   0     0 :   0     0 :   0     0
>    20-01 00:37:30|   0     0 :   0     0 :   0     0
>    20-01 00:37:31|   0     0 :   0     0 :   0     0
>    20-01 00:37:32|   0     0 :   0     0 :   0     0
>    20-01 00:37:33|   0     0 :   0     0 :   0     0
>    20-01 00:37:34|   0     0 :   0     0 :   0     0
>    20-01 00:37:35|   0     0 :   0     0 :   0     0
>    20-01 00:37:36|   0     0 :   0     0 :   0     0
>    20-01 00:37:37|   0     0 :   0     0 :   0     0
>    20-01 00:37:38|   0     0 :   0     0 :   0     0
>    20-01 00:37:39|   0     0 :   0     0 :   0     0
>    20-01 00:37:40|   0     0 :   0     0 :   0     0
>    20-01 00:37:41|   0     0 :   0     0 :   0     0
>    20-01 00:37:42|   0     0 :   0     0 :   0     0
>    20-01 00:37:43|   0     0 :   0     0 :   0     0
>    20-01 00:37:44|   0     0 :   0     0 :   0     0
>    20-01 00:37:45|   0     0 :   0     0 :   0     0
>    20-01 00:37:46|   0     0 :   0     0 :  55M    0
>    20-01 00:37:47|   0     0 :   0     0 : 516M    0
>    20-01 00:37:48|   0     0 :   0     0 : 515M    0
>    20-01 00:37:49|   0     0 :   0     0 : 516M    0
>    20-01 00:37:50|   0     0 :   0     0 : 520M    0
>    20-01 00:37:51|   0     0 :   0     0 : 520M    0
>    20-01 00:37:52|   0     0 :   0     0 : 514M    0
> 
> Here is the full log:
> 
> https://susepaste.org/16928336
> 
> I never noticed that happening with the PID policy. Is that maybe
> because of reading the part stats for all CPUs while selecting the
> mirror?
> 

  I ran fio tests again, now with dstat in an another window. I don't
  notice any such stalls, the read numbers went continuous until fio
  finished. Could you please check with the below fio command, also
  could you please share your fio command options.

fio \
--filename=/btrfs/largefile \
--directory=/btrfs \
--filesize=50G \
--size=50G \
--bs=64k \
--ioengine=libaio \
--rw=read \
--direct=1 \
--numjobs=1 \
--group_reporting \
--thread \
--name iops-test-job

  It is system specific?

Thanks.
Anand


> Michal
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/4] btrfs: add read_policy latency
  2021-01-20 12:30     ` Anand Jain
@ 2021-01-20 13:54       ` Michal Rostecki
  2021-01-21 10:45         ` Anand Jain
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Rostecki @ 2021-01-20 13:54 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs, dsterba, josef

On Wed, Jan 20, 2021 at 08:30:56PM +0800, Anand Jain wrote:
>  I ran fio tests again, now with dstat in an another window. I don't
>  notice any such stalls, the read numbers went continuous until fio
>  finished. Could you please check with the below fio command, also
>  could you please share your fio command options.

That's the fio config I used:

https://gitlab.com/vadorovsky/playground/-/blob/master/fio/btrfs-raid1-seqread.fio

The main differences seem to be:
- the number of jobs (I used the number of CPU threads)
- direct vs buffered I/O

> 
> fio \
> --filename=/btrfs/largefile \
> --directory=/btrfs \
> --filesize=50G \
> --size=50G \
> --bs=64k \
> --ioengine=libaio \
> --rw=read \
> --direct=1 \
> --numjobs=1 \
> --group_reporting \
> --thread \
> --name iops-test-job
> 
>  It is system specific?

With this command, dstat output looks good:

https://paste.opensuse.org/view/simple/93159623

So I think it might be specific to whether we test direct of buffered
I/O. Or to the number of jobs (single vs multiple jobs). Since the most
of I/O on production environments is usually buffered, I think we should
test with direct=0 too.

Cheers,
Michal

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 1/4] btrfs: add read_policy latency
  2021-01-20 13:54       ` Michal Rostecki
@ 2021-01-21 10:45         ` Anand Jain
  0 siblings, 0 replies; 15+ messages in thread
From: Anand Jain @ 2021-01-21 10:45 UTC (permalink / raw)
  To: Michal Rostecki; +Cc: linux-btrfs, dsterba, josef



On 20/1/21 9:54 pm, Michal Rostecki wrote:
> On Wed, Jan 20, 2021 at 08:30:56PM +0800, Anand Jain wrote:
>>   I ran fio tests again, now with dstat in an another window. I don't
>>   notice any such stalls, the read numbers went continuous until fio
>>   finished. Could you please check with the below fio command, also
>>   could you please share your fio command options.
> 
> That's the fio config I used:
> 
> https://gitlab.com/vadorovsky/playground/-/blob/master/fio/btrfs-raid1-seqread.fio
> 
> The main differences seem to be:
> - the number of jobs (I used the number of CPU threads)
> - direct vs buffered I/O
> 
>>
>> fio \
>> --filename=/btrfs/largefile \
>> --directory=/btrfs \
>> --filesize=50G \
>> --size=50G \
>> --bs=64k \
>> --ioengine=libaio \
>> --rw=read \
>> --direct=1 \
>> --numjobs=1 \
>> --group_reporting \
>> --thread \
>> --name iops-test-job
>>
>>   It is system specific?
> 
> With this command, dstat output looks good:
> 
> https://paste.opensuse.org/view/simple/93159623
> 
> So I think it might be specific to whether we test direct of buffered
> I/O. Or to the number of jobs (single vs multiple jobs). Since the most
> of I/O on production environments is usually buffered, I think we should
> test with direct=0 too.
> 
  It explains the stall for now, so it might be reading from the cache
  so there was actually no IO to the device.

  Agreed it must be tested with the most common type of IO. But as the
  cache comes into play I was a bit reluctant.

Thanks, Anand


> Cheers,
> Michal
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-01-21 10:51 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-11  9:41 [PATCH v3 0/4] btrfs: read_policy types latency, device and round-robin Anand Jain
2021-01-11  9:41 ` [PATCH v3 1/4] btrfs: add read_policy latency Anand Jain
2021-01-19 19:36   ` Josef Bacik
2021-01-20  2:43     ` Anand Jain
2021-01-20 10:27   ` Michal Rostecki
2021-01-20 12:30     ` Anand Jain
2021-01-20 13:54       ` Michal Rostecki
2021-01-21 10:45         ` Anand Jain
2021-01-11  9:41 ` [PATCH v3 2/4] btrfs: introduce new device-state read_preferred Anand Jain
2021-01-19 19:44   ` Josef Bacik
2021-01-11  9:41 ` [PATCH v3 3/4] btrfs: introduce new read_policy device Anand Jain
2021-01-19 19:44   ` Josef Bacik
2021-01-11  9:41 ` [PATCH RFC 4/4] btrfs: introduce new read_policy round-robin Anand Jain
2021-01-19 19:41   ` Josef Bacik
2021-01-20  2:40     ` Anand Jain

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.