From: Anand Jain <anand.jain@oracle.com>
To: linux-btrfs@vger.kernel.org
Cc: josef@toxicpanda.com, dsterba@suse.com
Subject: Re: [PATCH v4 0/3, full-cover-letter] btrfs: read_policy types latency, device and round-robin
Date: Fri, 22 Jan 2021 13:52:10 +0800 [thread overview]
Message-ID: <35c2e0bc-79e3-082b-e1f2-b55739ef5b70@oracle.com> (raw)
In-Reply-To: <20210120123437.OVx7ybGaVfmOdZxtpp43qcB_ORHQQs5OzPSzr3ZUGbo@z>
[Oops. A part of the cover letter is missing again. The cover-letter
file has it all. I am not sure why it happened.
Here below, I am just sending it by email].
v4:
Add rb from Josef in patch 1 and 3.
In patch 1/3, use fs_info instead of device->fs_devices->fs_info.
Drop round-robin policy because my workload (fio random) shows no
performance gains due to fewer merges at the block layer.
v3:
The block layer commit 0d02129e76ed (block: merge struct block_device
and struct hd_struct) has changed the first argument in the function
part_stat_read_all() in 5.11-rc1. So trickle down its changes in the
patch 1/4.
v2:
Fixes as per review comments, as in the individual patches.
rfc->v1:
Drop the tracing patch.
Drop the factor associated with the inflight commands (because there
were too many unnecessary switches).
Few C styles fix.
-----
This patchset adds read policy types latency, device, and round-robin,
for the mirrored raid profiles such as raid1, raid1c3, raid1c4, and
raid10. The default read policy remains as PID, as of now.
Read policy types:
Latency:
Latency policy routes the read IO based on the historical average
wait time experienced by the read IOs on the individual device.
Device:
With the device policy along with the read_preferred flag, you can
set the device for reading manually. Useful to test mirrors in a
deterministic way and helps advance system administrations.
Round-robin (RFC patch, removed in v4):
Alternates striped device in a round-robin loop for reading. To achieve
this first we put the stripes in an array, sort it by devid and pick the
next device.
Test scripts:
=============
I have included a few scripts which were useful for testing.
-------------------8<--------------------------------
Set latency policy on the btrfs mounted at /mnt
Usage example:
$ readpolicyset /mnt latency
$ cat readpolicyset
#!/bin/bash
: ${1?"arg1 <mnt> missing"}
: ${2?"arg2 <pid|latency|device|roundrobin> missing"}
mnt=$1
policy=$2
[ $policy == "device" ] && { : ${3?"arg3 <devid> missing"}; }
devid=$3
uuid=$(btrfs fi show -m /btrfs | grep uuid | awk '{print $4}')
p=/sys/fs/btrfs/$uuid/read_policy
q=/sys/fs/btrfs/$uuid/devinfo
[ $policy == "device" ] && { echo 1 > ${q}/$devid/read_preferred || exit
$?; }
echo $policy > $p
exit $?
-------------------8<--------------------------------
Read policy type from the btrfs mounted at /mnt
Usage example:
$ readpolicy /mnt
$ cat readpolicy
#!/bin/bash
: ${1?"arg1 <mnt> missing"}
mnt=$1
uuid=$(btrfs fi show -m /btrfs | grep uuid | awk '{print $4}')
p=/sys/fs/btrfs/$uuid/read_policy
q=/sys/fs/btrfs/$uuid/devinfo
policy=$(cat $p)
echo -n "$policy ( "
for i in $(find $q -type f -name read_preferred | xargs cat)
do
echo -n "$i"
done
echo ")"
-------------------8<--------------------------------
Show the number of read IO per devices for the give command.
Usage example:
$ readstat /mnt fioread
$ cat readstat
#!/bin/bash
: ${1?"arg1 <mnt> is missing"}
: ${2?"arg2 <cmd-to-run> is missing"}
mnt=$1; shift
mountpoint -q $mnt || { echo "ERROR: $mnt is not mounted"; exit 1; }
declare -A devread
for dev in $(btrfs filesystem show -m $mnt | grep devid |awk '{print $8}')
do
prefix=$(echo $dev | rev | cut -d"/" -f1 | rev)
sysfs_path=$(find /sys | grep $prefix/stat$)
devread[$sysfs_path]=$(cat $sysfs_path | awk '{print $1}')
done
"$@" | grep "READ: bw"
echo
echo
for sysfs_path in ${!devread[@]}
do
dev=$(echo $sysfs_path | rev | cut -d"/" -f2 | rev)
new=$(cat $sysfs_path | awk '{print $1}')
old=${devread[$sysfs_path]}
echo "$dev $((new - old))"
done
-------------------8<--------------------------------
Run fio read command
Usage example:
$ touch /mnt/largefile
$ fioread /mnt/largefile 500m
$ cat fioread
#!/bin/bash
: ${1?"arg1 </mnt/file> is missing"}
: ${2?"arg2 <1Gi|50Gi> is missing"}
tf=$1
sz=$2
mnt=$(stat -c '%m' $tf)
fio \
--filename=$tf \
--directory=$mnt \
--filesize=$sz \
--size=$sz \
--rw=randread \
--bs=64k \
--ioengine=libaio \
--direct=1 \
--numjobs=32 \
--group_reporting \
--thread \
--name iops-test-job
-------------------8<--------------------------------
Testing on guest VM
~~~~~~~~~~~~~~~~~~~
The test results from my VM with 2 devices of type sata and 2 devices of
type virtio, are below. Performance results are for raid1c4, raid10, and
raid1 are as below.
The workload is fio read 32 threads, 500m random reads.
Fio is passed to the script called readstat, which returns the number of
read IOs per device sent during the fio.
Supporting fio logs are below. And readstat shows the number of read IOs
to the devices (excluding the merges).
raid1c4
=======
pid
----
$ readpolicyset /btrfs pid && readpolicy /btrfs && dropcache && readstat
/btrfs fioread /btrfs/largefile 500m
[pid] latency device roundrobin ( 0000)
READ: bw=87.0MiB/s (91.2MB/s), 87.0MiB/s-87.0MiB/s
(91.2MB/s-91.2MB/s), io=15.6GiB (16.8GB), run=183884-183884msec
vdb 64060
vdc 64053
sdb 64072
sda 64054
latency
-------
(All devices are non-rotational, but sda and sdb are of type sata and
vdb and vdc are of type virtio).
$ readpolicyset /btrfs latency && readpolicy /btrfs && dropcache &&
readstat /btrfs fioread /btrfs/largefile 500m
pid [latency] device roundrobin ( 0000)
READ: bw=87.1MiB/s (91.3MB/s), 87.1MiB/s-87.1MiB/s
(91.3MB/s-91.3MB/s), io=15.6GiB (16.8GB), run=183774-183774msec
vdb 255844
vdc 559
sdb 0
sda 93
roundrobin
----------
$ readpolicyset /btrfs roundrobin && readpolicy /btrfs && dropcache &&
readstat /btrfs fioread /btrfs/largefile 500m
pid latency device [roundrobin] ( 0000)
READ: bw=51.0MiB/s (54.5MB/s), 51.0MiB/s-51.0MiB/s
(54.5MB/s-54.5MB/s), io=15.6GiB (16.8GB), run=307755-307755msec
vdb 866859
vdc 866651
sdb 864139
sda 865533
raid10
======
pid
---
$ readpolicyset /btrfs pid && readpolicy /btrfs && dropcache && readstat
/btrfs fioread /btrfs/largefile 500m
[pid] latency device roundrobin ( 0000)
READ: bw=85.2MiB/s (89.3MB/s), 85.2MiB/s-85.2MiB/s
(89.3MB/s-89.3MB/s), io=15.6GiB (16.8GB), run=187864-187864msec
sdf 64053
sde 64036
sdd 64043
sdc 64038
latency
-------
$ readpolicyset /btrfs latency && readpolicy /btrfs && dropcache &&
readstat /btrfs fioread /btrfs/largefile 500m
pid [latency] device roundrobin ( 0000)
READ: bw=85.4MiB/s (89.5MB/s), 85.4MiB/s-85.4MiB/s
(89.5MB/s-89.5MB/s), io=15.6GiB (16.8GB), run=187370-187370msec
sdf 117494
sde 10748
sdd 125247
sdc 2921
roundrobin
----------
$ readpolicyset /btrfs roundrobin && readpolicy /btrfs && dropcache &&
readstat /btrfs fioread /btrfs/largefile 500m
pid latency device [roundrobin] ( 0000)
READ: bw=55.4MiB/s (58.1MB/s), 55.4MiB/s-55.4MiB/s
(58.1MB/s-58.1MB/s), io=15.6GiB (16.8GB), run=288701-288701msec
sdf 617593
sde 617381
sdd 618486
sdc 618633
raid1
=====
pid
----
$ readpolicyset /btrfs pid && readpolicy /btrfs && dropcache && readstat
/btrfs fioread /btrfs/largefile 500m
[pid] latency device roundrobin ( 00)
READ: bw=78.8MiB/s (82.6MB/s), 78.8MiB/s-78.8MiB/s
(82.6MB/s-82.6MB/s), io=15.6GiB (16.8GB), run=203158-203158msec
sdb 128087
sda 128090
latency
--------
$ readpolicyset /btrfs latency && readpolicy /btrfs && dropcache &&
readstat /btrfs fioread /btrfs/largefile 500m
pid [latency] device roundrobin ( 00)
READ: bw=86.5MiB/s (90.7MB/s), 86.5MiB/s-86.5MiB/s
(90.7MB/s-90.7MB/s), io=15.6GiB (16.8GB), run=185023-185023msec
sdb 567
sda 255942
device
-------
(From the latency test results (above) we know sda is providing low
latency read
IO. So set sda as read preferred device.)
$ readpolicyset /btrfs device 1 && readpolicy /btrfs && dropcache &&
readstat /btrfs fioread /btrfs/largefile 500m
pid latency [device] roundrobin ( 10)
READ: bw=88.2MiB/s (92.5MB/s), 88.2MiB/s-88.2MiB/s
(92.5MB/s-92.5MB/s), io=15.6GiB (16.8GB), run=181374-181374msec
sdb 0
sda 256191
roundrobin
-----------
$ readpolicyset /btrfs roundrobin && readpolicy /btrfs && dropcache &&
readstat /btrfs fioread /btrfs/largefile 500m
pid latency device [roundrobin] ( 00)
READ: bw=54.1MiB/s (56.7MB/s), 54.1MiB/s-54.1MiB/s
(56.7MB/s-56.7MB/s), io=15.6GiB (16.8GB), run=295693-295693msec
sdb 1252584
sda 1254258
Testing on real hardware:
~~~~~~~~~~~~~~~~~~~~~~~~
raid1 Read 500m
-----------------------------------------------------
|nvme+ssd nvme+ssd all-nvme all-nvme
|random sequential random sequential
------------+------------------------------------------
pid | 744MiB/s 809MiB/s 2225MiB/s 2155MiB/s
latency |2072MiB/s 2008MiB/s 1999MiB/s 1961MiB/s
device(nvme)|2187MiB/s 2063MiB/s 2125MiB/s 2080MiB/s
roundrobin | 527MiB/s 519MiB/s 2137MiB/s 1876MiB/s
raid10 Read 500m
-----------------------------------------------------
| nvme+ssd nvme+ssd all-nvme all-nvme
| random seq random seq
------------+-----------------------------------------
pid | 1282MiB/s 1427MiB/s 2152MiB/s 1969MiB/s
latency | 2073MiB/s 1871MiB/s 1975MiB/s 1984MiB/s
device(nvme)| 2447MiB/s 1873MiB/s 2184MiB/s 2015MiB/s
roundrobin | 1117MiB/s 1076MiB/s 2020MiB/s 2030MiB/s
raid1c3 Read 500m
-----------------------------------------------------
| nvme+ssd nvme+ssd all-nvme all-nvme
| random seq random seq
------------+-----------------------------------------
pid | 973MiB/s 955MiB/s 2144MiB/s 1962MiB/s
latency | 2005MiB/s 1924MiB/s 2083MiB/s 1980MiB/s
device(nvme)| 2021MiB/s 2034MiB/s 1920MiB/s 2132MiB/s
roundrobin | 707MiB/s 701MiB/s 1760MiB/s 1990MiB/s
raid1c4 Read 500m
-----------------------------------------------------
| nvme+ssd nvme+ssd all-nvme all-nvme
| random seq random seq
------------+----------------------------------------
pid | 1204MiB/s 1221MiB/s 2065MiB/s 1878MiB/s
latency | 1990MiB/s 1920MiB/s 1945MiB/s 1865MiB/s
device(nvme)| 2109MiB/s 1935MiB/s 2153MiB/s 1991MiB/s
roundrobin | 887MiB/s 865MiB/s 1948MiB/s 1796MiB/s
Observations:
=============
1.
As our chunk allocation is based on the device's available size
at that time. So stripe 0 may be circulating among the devices.
So a single-threaded process running with a constant PID, may balance
the read IO among devices. But it is not guaranteed to work in all the
cases, and it might not work very well in the case of raid1c3/4. Further,
PID provides terrible performance if the devices are heterogeneous in
terms of either type, speed, or size.
2.
Latency provides performance equal to PID if all devices are of same
type. Latency needs iostat be enabled and includes cost of calculating
the avg. wait time. So if you factor in a similar cost of calculating the
avg. wait time in case of PID policy (using the debug code [2]) then the
Latency performance is better than PID. This proves that read IO
distribution as per latency is working, but there is a cost to it. And
moreover, latency works for any type of devices.
3.
Round robin is worst (unless there is a bug in my patch). The total
number of new IOs issued is almost double when compared with the PID and
Latency read_policy, that's because there were fewer number of IO merges
in the block layer due to constant switching of devices in the btrfs.
4.
4.
Device read_policy is useful in testing and provides advanced sysadmin
capabilities. When known how to use, the policy could help avert
performance degradation due to csum/IO errors at production.
Thanks, Anand
------------------
[2] Debug patch to factor the cost of calculating the latency per IO.
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d3023879bdf6..72ec633e9063 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5665,6 +5665,12 @@ static int find_live_mirror(struct btrfs_fs_info
*fs_info,
fs_info->fs_devices->read_policy = BTRFS_READ_POLICY_PID;
fallthrough;
case BTRFS_READ_POLICY_PID:
+ /*
+ * Just to factor in the cost of calculating the avg wait using
+ * iostat, call btrfs_find_best_stripe() here for the PID policy
+ * and drop its results on the floor.
+ */
+ btrfs_find_best_stripe(fs_info, map, first, num_stripes, log,
+ logsz);
preferred_mirror = first + current->pid % num_stripes;
scnprintf(log, logsz,
"first %d num_stripe %d %s (%d) preferred %d",
-------------------------
On 20/1/21 8:34 pm, Anand Jain wrote:
> [Only some parts of the cover-letter went through, tying again.].
>
> v4:
> Add rb from Josef in patch 1 and 3.
> In patch 1/3, use fs_info instead of device->fs_devices->fs_info.
> Drop round-robin policy because my workload (fio random) shows no performance
> gains due to fewer merges at the block layer.
>
> v3:
> The block layer commit 0d02129e76ed (block: merge struct block_device and
> struct hd_struct) has changed the first argument in the function
> part_stat_read_all() in 5.11-rc1. So trickle down its changes in the patch 1/4.
>
> v2:
> Fixes as per review comments, as in the individual patches.
>
> rfc->v1:
> Drop the tracing patch.
> Drop the factor associated with the inflight commands (because there
> were too many unnecessary switches).
> Few C styles fix.
>
> -----
>
> This patchset adds read policy types latency, device, and round-robin, for the
> mirrored raid profiles such as raid1, raid1c3, raid1c4, and raid10. The default
> read policy remains as PID, as of now.
>
> Read policy types:
> Latency:
>
> Latency policy routes the read IO based on the historical average
> wait time experienced by the read IOs on the individual device.
>
> Device:
>
> With the device policy along with the read_preferred flag, you can
> set the device for reading manually. Useful to test mirrors in a
> deterministic way and helps advance system administrations.
>
> Round-robin (RFC patch):
>
> Alternates striped device in a round-robin loop for reading. To achieve
> this first we put the stripes in an array, sort it by devid and pick the
> next device.
>
> Test scripts:
> =============
>
> I have included a few scripts which were useful for testing.
>
> -------------------8<--------------------------------
> Set latency policy on the btrfs mounted at /mnt
>
> Usage example:
> $ readpolicyset /mnt latency
>
> Anand Jain (3):
> btrfs: add read_policy latency
> btrfs: introduce new device-state read_preferred
> btrfs: introduce new read_policy device
>
> fs/btrfs/sysfs.c | 57 ++++++++++++++++++++++++++++++++++++++++++-
> fs/btrfs/volumes.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++
> fs/btrfs/volumes.h | 5 ++++
> 3 files changed, 121 insertions(+), 1 deletion(-)
>
prev parent reply other threads:[~2021-01-22 5:53 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-01-20 7:52 [PATCH v4 0/3] btrfs: read_policy types latency, device and round-robin Anand Jain
2021-01-20 12:34 ` [PATCH v4 0/3, full-cover-letter] " Anand Jain
2021-01-20 7:52 ` [PATCH v4 1/3] btrfs: add read_policy latency Anand Jain
2021-01-20 12:14 ` David Sterba
2021-01-21 10:10 ` Anand Jain
2021-01-21 17:52 ` David Sterba
2021-01-22 8:10 ` Anand Jain
2021-01-30 1:08 ` Anand Jain
2021-02-04 12:30 ` Anand Jain
2021-02-09 21:12 ` Michal Rostecki
2021-02-10 6:14 ` Anand Jain
2021-01-20 7:52 ` [PATCH v4 2/3] btrfs: introduce new device-state read_preferred Anand Jain
2021-01-21 10:19 ` Anand Jain
2021-01-20 7:52 ` [PATCH v4 3/3] btrfs: introduce new read_policy device Anand Jain
2021-01-20 12:34 ` [PATCH v4 0/3, full-cover-letter] btrfs: read_policy types latency, device and round-robin Anand Jain
2021-01-22 5:52 ` Anand Jain [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=35c2e0bc-79e3-082b-e1f2-b55739ef5b70@oracle.com \
--to=anand.jain@oracle.com \
--cc=dsterba@suse.com \
--cc=josef@toxicpanda.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.