Linux-Block Archive on lore.kernel.org
 help / color / Atom feed
From: Paolo Valente <paolo.valente@linaro.org>
To: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>,
	newella@fb.com, clm@fb.com, Josef Bacik <josef@toxicpanda.com>,
	dennisz@fb.com, Li Zefan <lizefan@huawei.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	linux-block <linux-block@vger.kernel.org>,
	kernel-team@fb.com, cgroups@vger.kernel.org, ast@kernel.org,
	daniel@iogearbox.net, kafai@fb.com, songliubraving@fb.com,
	yhs@fb.com, bpf@vger.kernel.org
Subject: Re: [PATCHSET block/for-next] IO cost model based work-conserving porportional controller
Date: Mon, 2 Sep 2019 17:45:50 +0200
Message-ID: <88C7DC68-680E-49BB-9699-509B9B0B12A0@linaro.org> (raw)
In-Reply-To: <20190831065358.GF2263813@devbig004.ftw2.facebook.com>



> Il giorno 31 ago 2019, alle ore 08:53, Tejun Heo <tj@kernel.org> ha scritto:
> 
> Hello, Paolo.
> 

Hi Tejun,

> On Thu, Aug 22, 2019 at 10:58:22AM +0200, Paolo Valente wrote:
>> Ok, I tried with the parameters reported for a SATA SSD:
>> 
>> rpct=95.00 rlat=10000 wpct=95.00 wlat=20000 min=50.00 max=400.00
> 
> Sorry, I should have explained it with a lot more details.
> 
> There are two things - the cost model and qos params.  The default SSD
> cost model parameters are derived by averaging a number of mainstream
> SSD parameters.  As a ballpark, this can be good enough because while
> the overall performance varied quite a bit from one ssd to another,
> the relative cost of different types of IOs wasn't drastically
> different.
> 
> However, this means that the performance baseline can easily be way
> off from 100% depending on the specific device in use.  In the above,
> you're specifying min/max which limits how far the controller is
> allowed to adjust the overall cost estimation.  50% and 400% are
> numbers which may make sense if the cost model parameter is expected
> to fall somewhere around 100% - ie. if the parameters are for that
> specific device.
> 
> In your script, you're using default model params but limiting vrate
> range.  It's likely that your device is significantly slower than what
> the default parameters are expecting.  However, because min vrate is
> limited to 50%, it doesn't throttle below 50% of the estimated cost,
> so if the device is significantly slower than that, nothing gets
> controlled.
> 

Thanks for this extra explanations.  It is a little bit difficult for
me to understand how the min/max teaks for exactly, but you did give
me the general idea.

>> and with a simpler configuration [1]: one target doing random reads
> 
> And without QoS latency targets, the controller is purely going by
> queue depth depletion which works fine for many usual workloads such
> as larger reads and writes but isn't likely to serve low-concurrency
> latency-sensitive IOs well.
> 
>> and only four interferers doing sequential reads, with all the
>> processes (groups) having the same weight.
>> 
>> But there seemed to be little or no control on I/O, because the target
>> got only 1.84 MB/s, against 1.15 MB/s without any control.
>> 
>> So I tried with rlat=1000 and rlat=100.
> 
> And this won't do anything as all rlat/wlat does is regulating how the
> overall vrate should be adjusted and it's being min'd at 50%.
> 
>> Control did improve, with same results for both values of rlat.  The
>> problem is that these results still seem rather bad, both in terms of
>> throughput guaranteed to the target and in terms of total throughput.
>> Here are results compared with BFQ (throughputs measured in MB/s):
>> 
>>                           io.weight            BFQ
>> target's throughput        3.415                6.224        
>> total throughput           159.14               321.375
> 
> So, what should have been configured is something like
> 
> $ echo '8:0 enable=1 rpct=95 rlat=10000 wpct=95 wlat=20000' > /sys/fs/cgroup/io.cost.qos
> 

Unfortunately, io.cost does not seem to control I/O with this
configuration, as it gives the interfered the same bw as no I/O
control (i.e., none as I/O scheduler and no I/O controller or policy
active):

                          none        io.weight            BFQ
target's throughput       0.8            0.7                 4
total throughput          506            506               344

The test case is still the rand reader against 7 seq readers.

> which just says "target 10ms p(95) read latency and 20ms p(95) write
> latency" without putting any restrictions on vrate range.
> 
> With that, I got the following on Micron_1100_MTFDDAV256TBN which is a
> pretty old 256GB SATA drive.
> 
>  Aggregated throughput:
> 	   min         max         avg     std_dev     conf99%
> 	266.73      275.71      271.38     4.05144     45.7635
>  Interfered total throughput:
> 	   min         max         avg     std_dev
> 	 9.608      13.008      10.941    0.664938
> 
> During the run, iocost-monitor.py looked like the following.
> 
>  sda RUN  per=40ms cur_per=2074.351:v1008.844 busy= +0 vrate= 59.85% params=ssd_dfl(CQ)
> 			    active    weight      hweight% inflt% del_ms usages%
>  InterfererGroup0             *   100/  100  22.94/ 20.00   0.00  0*000 023:023:023
>  InterfererGroup1             *   100/  100  22.94/ 20.00   0.00  0*000 023:023:023
>  InterfererGroup2             *   100/  100  22.94/ 20.00   0.00  0*000 025:023:021
>  InterfererGroup3             *   100/  100  22.94/ 20.00   0.00  0*000 023:023:023
>  interfered                   *    36/  100   8.26/ 20.00   0.42  0*000 003:004:004
> 
> Note that interfered is reported to only use 3-4% of the disk capacity
> while configured to consume 20%.  This is because with single
> concurrency 4k randread job, its ability to consume IO capacity is
> limited by the completion latency.
> 
> 10ms is pretty generous (ie. more work-conserving) target for SSDs.
> Let's say we're willing to tighten it to trade off total work for
> tighter latency.
> 
> $ echo '8:0 enable=1 rpct=95 rlat=2500 wpct=95 wlat=5000' > /sys/fs/cgroup/io.cost.qos
> 

Now io.weight does control I/O, but throughputs fluctuate a lot
between runs and during each run.  After extending the duration of
each run to 20 seconds, an average run for io.weight and BFQ gives the
following throughputs (same throughputs as above for none):

                          none        io.weight            BFQ
target's throughput       0.8            2.3               3.6
total throughput          506            321               360

For completeness I tried also with rlat=1000.  But throughputs dropped
dramatically:

                          none        io.weight            BFQ
target's throughput       0.8            0.2               3.6
total throughput          506            17               360

Are these results in line with your expectations?  If they are, then
I'd like to extend benchmarks to more mixes of workloads.  Or should I
try some other QoS configuration first?

Thanks,
Paolo

>  Aggregated throughput:
> 	   min         max         avg     std_dev     conf99%
> 	147.06      172.18     154.608      11.783     133.096
>  Interfered total throughput:
> 	   min         max         avg     std_dev
> 	17.992       19.32      18.698    0.313105
> 
> and the monitoring output
> 
>  sda RUN  per=10ms cur_per=2927.152:v1556.138 busy= -2 vrate= 34.74% params=ssd_dfl(CQ)
> 			    active    weight      hweight% inflt% del_ms usages%
>  InterfererGroup0             *   100/  100  20.00/ 20.00 386.11  0*000 070:020:020
>  InterfererGroup1             *   100/  100  20.00/ 20.00 386.11  0*000 070:020:020
>  InterfererGroup2             *   100/  100  20.00/ 20.00 386.11  0*000 070:020:020
>  InterfererGroup3             *   100/  100  20.00/ 20.00   0.00  0*000 020:020:020
>  interfered                   *   100/  100  20.00/ 20.00   1.21  0*000 010:014:017
> 
> The followings happened.
> 
> * The vrate is now hovering way lower.  The device is now doing less
>  total work to acheive tighter completion latencies.
> 
> * The overall throughput dropped but interfered's utilization is now
>  significantly higher along with its bandwidth from lower completion
>  latencies.
> 
> For reference:
> 
> [Disabled]
> 
>  Aggregated throughput:
> 	   min         max         avg     std_dev     conf99%
> 	493.98      511.37     502.808     9.52773     107.621
>  Interfered total throughput:
> 	   min         max         avg     std_dev
> 	 0.056       0.304       0.107   0.0691052
> 
> [Enabled, no QoS config]
> 
>  Aggregated throughput:
> 	   min         max         avg     std_dev     conf99%
> 	429.07      449.59     437.597     8.64952     97.7015
>  Interfered total throughput:
> 	   min         max         avg     std_dev
> 	 0.456        3.12        1.08    0.774318
> 
> Thanks.
> 
> -- 
> tejun


  parent reply index

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-14  1:56 Tejun Heo
2019-06-14  1:56 ` [PATCH 01/10] blkcg: pass @q and @blkcg into blkcg_pol_alloc_pd_fn() Tejun Heo
2019-06-14  1:56 ` [PATCH 02/10] blkcg: make ->cpd_init_fn() optional Tejun Heo
2019-06-14  1:56 ` [PATCH 03/10] blkcg: separate blkcg_conf_get_disk() out of blkg_conf_prep() Tejun Heo
2019-06-14  1:56 ` [PATCH 04/10] block/rq_qos: add rq_qos_merge() Tejun Heo
2019-06-14  1:56 ` [PATCH 05/10] block/rq_qos: implement rq_qos_ops->queue_depth_changed() Tejun Heo
2019-06-14  1:56 ` [PATCH 06/10] blkcg: s/RQ_QOS_CGROUP/RQ_QOS_LATENCY/ Tejun Heo
2019-06-14  1:56 ` [PATCH 07/10] blk-mq: add optional request->pre_start_time_ns Tejun Heo
2019-06-14  1:56 ` [PATCH 08/10] blkcg: implement blk-ioweight Tejun Heo
2019-06-14 12:17   ` Toke Høiland-Jørgensen
2019-06-14 15:09     ` Tejun Heo
2019-06-14 20:50       ` Toke Høiland-Jørgensen
2019-06-15 15:57         ` Tejun Heo
2019-06-14  1:56 ` [PATCH 09/10] blkcg: add tools/cgroup/monitor_ioweight.py Tejun Heo
2019-06-14  1:56 ` [PATCH 10/10] blkcg: implement BPF_PROG_TYPE_IO_COST Tejun Heo
2019-06-14 11:32   ` Quentin Monnet
2019-06-14 14:52     ` Tejun Heo
2019-06-14 16:35       ` Alexei Starovoitov
2019-06-14 17:09         ` Tejun Heo
2019-06-14 17:56 ` [PATCHSET block/for-next] IO cost model based work-conserving porportional controller Tejun Heo
2019-08-20 10:48   ` Paolo Valente
2019-08-20 15:04     ` Paolo Valente
2019-08-20 15:19       ` Tejun Heo
2019-08-22  8:58         ` Paolo Valente
2019-08-31  6:53           ` Tejun Heo
2019-08-31  7:10             ` Paolo Valente
2019-08-31 11:20               ` Tejun Heo
2019-09-02 15:45             ` Paolo Valente [this message]
2019-09-02 15:56               ` Tejun Heo
2019-09-02 19:43                 ` Paolo Valente
2019-09-05 16:55                   ` Tejun Heo
2019-09-06  9:07                     ` Paolo Valente
2019-09-06 14:58                       ` Tejun Heo

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=88C7DC68-680E-49BB-9699-509B9B0B12A0@linaro.org \
    --to=paolo.valente@linaro.org \
    --cc=ast@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=bpf@vger.kernel.org \
    --cc=cgroups@vger.kernel.org \
    --cc=clm@fb.com \
    --cc=daniel@iogearbox.net \
    --cc=dennisz@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=josef@toxicpanda.com \
    --cc=kafai@fb.com \
    --cc=kernel-team@fb.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizefan@huawei.com \
    --cc=newella@fb.com \
    --cc=songliubraving@fb.com \
    --cc=tj@kernel.org \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Block Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-block/0 linux-block/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-block linux-block/ https://lore.kernel.org/linux-block \
		linux-block@vger.kernel.org
	public-inbox-index linux-block

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-block


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git