Linux-Block Archive on lore.kernel.org
 help / color / Atom feed
From: Paolo Valente <paolo.valente@linaro.org>
To: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>,
	newella@fb.com, clm@fb.com, Josef Bacik <josef@toxicpanda.com>,
	dennisz@fb.com, Li Zefan <lizefan@huawei.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	linux-block <linux-block@vger.kernel.org>,
	kernel-team@fb.com, cgroups@vger.kernel.org, ast@kernel.org,
	daniel@iogearbox.net, kafai@fb.com, songliubraving@fb.com,
	yhs@fb.com, bpf@vger.kernel.org
Subject: Re: [PATCHSET block/for-next] IO cost model based work-conserving porportional controller
Date: Sat, 31 Aug 2019 09:10:26 +0200
Message-ID: <73CA9E04-DD93-47E4-B0FE-0A12EC991DEF@linaro.org> (raw)
In-Reply-To: <20190831065358.GF2263813@devbig004.ftw2.facebook.com>

Hi Tejun,
thank you very much for this extra information, I'll try the
configuration you suggest.  In this respect, is this still the branch
to use

https://kernel.googlesource.com/pub/scm/linux/kernel/git/tj/cgroup/+/refs/heads/review-iocost-v2

also after the issue spotted two days ago [1]?

Thanks,
Paolo

[1] https://lkml.org/lkml/2019/8/29/910

> Il giorno 31 ago 2019, alle ore 08:53, Tejun Heo <tj@kernel.org> ha scritto:
> 
> Hello, Paolo.
> 
> On Thu, Aug 22, 2019 at 10:58:22AM +0200, Paolo Valente wrote:
>> Ok, I tried with the parameters reported for a SATA SSD:
>> 
>> rpct=95.00 rlat=10000 wpct=95.00 wlat=20000 min=50.00 max=400.00
> 
> Sorry, I should have explained it with a lot more details.
> 
> There are two things - the cost model and qos params.  The default SSD
> cost model parameters are derived by averaging a number of mainstream
> SSD parameters.  As a ballpark, this can be good enough because while
> the overall performance varied quite a bit from one ssd to another,
> the relative cost of different types of IOs wasn't drastically
> different.
> 
> However, this means that the performance baseline can easily be way
> off from 100% depending on the specific device in use.  In the above,
> you're specifying min/max which limits how far the controller is
> allowed to adjust the overall cost estimation.  50% and 400% are
> numbers which may make sense if the cost model parameter is expected
> to fall somewhere around 100% - ie. if the parameters are for that
> specific device.
> 
> In your script, you're using default model params but limiting vrate
> range.  It's likely that your device is significantly slower than what
> the default parameters are expecting.  However, because min vrate is
> limited to 50%, it doesn't throttle below 50% of the estimated cost,
> so if the device is significantly slower than that, nothing gets
> controlled.
> 
>> and with a simpler configuration [1]: one target doing random reads
> 
> And without QoS latency targets, the controller is purely going by
> queue depth depletion which works fine for many usual workloads such
> as larger reads and writes but isn't likely to serve low-concurrency
> latency-sensitive IOs well.
> 
>> and only four interferers doing sequential reads, with all the
>> processes (groups) having the same weight.
>> 
>> But there seemed to be little or no control on I/O, because the target
>> got only 1.84 MB/s, against 1.15 MB/s without any control.
>> 
>> So I tried with rlat=1000 and rlat=100.
> 
> And this won't do anything as all rlat/wlat does is regulating how the
> overall vrate should be adjusted and it's being min'd at 50%.
> 
>> Control did improve, with same results for both values of rlat.  The
>> problem is that these results still seem rather bad, both in terms of
>> throughput guaranteed to the target and in terms of total throughput.
>> Here are results compared with BFQ (throughputs measured in MB/s):
>> 
>>                           io.weight            BFQ
>> target's throughput        3.415                6.224        
>> total throughput           159.14               321.375
> 
> So, what should have been configured is something like
> 
> $ echo '8:0 enable=1 rpct=95 rlat=10000 wpct=95 wlat=20000' > /sys/fs/cgroup/io.cost.qos
> 
> which just says "target 10ms p(95) read latency and 20ms p(95) write
> latency" without putting any restrictions on vrate range.
> 
> With that, I got the following on Micron_1100_MTFDDAV256TBN which is a
> pretty old 256GB SATA drive.
> 
>  Aggregated throughput:
> 	   min         max         avg     std_dev     conf99%
> 	266.73      275.71      271.38     4.05144     45.7635
>  Interfered total throughput:
> 	   min         max         avg     std_dev
> 	 9.608      13.008      10.941    0.664938
> 
> During the run, iocost-monitor.py looked like the following.
> 
>  sda RUN  per=40ms cur_per=2074.351:v1008.844 busy= +0 vrate= 59.85% params=ssd_dfl(CQ)
> 			    active    weight      hweight% inflt% del_ms usages%
>  InterfererGroup0             *   100/  100  22.94/ 20.00   0.00  0*000 023:023:023
>  InterfererGroup1             *   100/  100  22.94/ 20.00   0.00  0*000 023:023:023
>  InterfererGroup2             *   100/  100  22.94/ 20.00   0.00  0*000 025:023:021
>  InterfererGroup3             *   100/  100  22.94/ 20.00   0.00  0*000 023:023:023
>  interfered                   *    36/  100   8.26/ 20.00   0.42  0*000 003:004:004
> 
> Note that interfered is reported to only use 3-4% of the disk capacity
> while configured to consume 20%.  This is because with single
> concurrency 4k randread job, its ability to consume IO capacity is
> limited by the completion latency.
> 
> 10ms is pretty generous (ie. more work-conserving) target for SSDs.
> Let's say we're willing to tighten it to trade off total work for
> tighter latency.
> 
> $ echo '8:0 enable=1 rpct=95 rlat=2500 wpct=95 wlat=5000' > /sys/fs/cgroup/io.cost.qos
> 
>  Aggregated throughput:
> 	   min         max         avg     std_dev     conf99%
> 	147.06      172.18     154.608      11.783     133.096
>  Interfered total throughput:
> 	   min         max         avg     std_dev
> 	17.992       19.32      18.698    0.313105
> 
> and the monitoring output
> 
>  sda RUN  per=10ms cur_per=2927.152:v1556.138 busy= -2 vrate= 34.74% params=ssd_dfl(CQ)
> 			    active    weight      hweight% inflt% del_ms usages%
>  InterfererGroup0             *   100/  100  20.00/ 20.00 386.11  0*000 070:020:020
>  InterfererGroup1             *   100/  100  20.00/ 20.00 386.11  0*000 070:020:020
>  InterfererGroup2             *   100/  100  20.00/ 20.00 386.11  0*000 070:020:020
>  InterfererGroup3             *   100/  100  20.00/ 20.00   0.00  0*000 020:020:020
>  interfered                   *   100/  100  20.00/ 20.00   1.21  0*000 010:014:017
> 
> The followings happened.
> 
> * The vrate is now hovering way lower.  The device is now doing less
>  total work to acheive tighter completion latencies.
> 
> * The overall throughput dropped but interfered's utilization is now
>  significantly higher along with its bandwidth from lower completion
>  latencies.
> 
> For reference:
> 
> [Disabled]
> 
>  Aggregated throughput:
> 	   min         max         avg     std_dev     conf99%
> 	493.98      511.37     502.808     9.52773     107.621
>  Interfered total throughput:
> 	   min         max         avg     std_dev
> 	 0.056       0.304       0.107   0.0691052
> 
> [Enabled, no QoS config]
> 
>  Aggregated throughput:
> 	   min         max         avg     std_dev     conf99%
> 	429.07      449.59     437.597     8.64952     97.7015
>  Interfered total throughput:
> 	   min         max         avg     std_dev
> 	 0.456        3.12        1.08    0.774318
> 
> Thanks.
> 
> -- 
> tejun


  reply index

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-14  1:56 Tejun Heo
2019-06-14  1:56 ` [PATCH 01/10] blkcg: pass @q and @blkcg into blkcg_pol_alloc_pd_fn() Tejun Heo
2019-06-14  1:56 ` [PATCH 02/10] blkcg: make ->cpd_init_fn() optional Tejun Heo
2019-06-14  1:56 ` [PATCH 03/10] blkcg: separate blkcg_conf_get_disk() out of blkg_conf_prep() Tejun Heo
2019-06-14  1:56 ` [PATCH 04/10] block/rq_qos: add rq_qos_merge() Tejun Heo
2019-06-14  1:56 ` [PATCH 05/10] block/rq_qos: implement rq_qos_ops->queue_depth_changed() Tejun Heo
2019-06-14  1:56 ` [PATCH 06/10] blkcg: s/RQ_QOS_CGROUP/RQ_QOS_LATENCY/ Tejun Heo
2019-06-14  1:56 ` [PATCH 07/10] blk-mq: add optional request->pre_start_time_ns Tejun Heo
2019-06-14  1:56 ` [PATCH 08/10] blkcg: implement blk-ioweight Tejun Heo
2019-06-14 12:17   ` Toke Høiland-Jørgensen
2019-06-14 15:09     ` Tejun Heo
2019-06-14 20:50       ` Toke Høiland-Jørgensen
2019-06-15 15:57         ` Tejun Heo
2019-06-14  1:56 ` [PATCH 09/10] blkcg: add tools/cgroup/monitor_ioweight.py Tejun Heo
2019-06-14  1:56 ` [PATCH 10/10] blkcg: implement BPF_PROG_TYPE_IO_COST Tejun Heo
2019-06-14 11:32   ` Quentin Monnet
2019-06-14 14:52     ` Tejun Heo
2019-06-14 16:35       ` Alexei Starovoitov
2019-06-14 17:09         ` Tejun Heo
2019-06-14 17:56 ` [PATCHSET block/for-next] IO cost model based work-conserving porportional controller Tejun Heo
2019-08-20 10:48   ` Paolo Valente
2019-08-20 15:04     ` Paolo Valente
2019-08-20 15:19       ` Tejun Heo
2019-08-22  8:58         ` Paolo Valente
2019-08-31  6:53           ` Tejun Heo
2019-08-31  7:10             ` Paolo Valente [this message]
2019-08-31 11:20               ` Tejun Heo
2019-09-02 15:45             ` Paolo Valente
2019-09-02 15:56               ` Tejun Heo
2019-09-02 19:43                 ` Paolo Valente
2019-09-05 16:55                   ` Tejun Heo
2019-09-06  9:07                     ` Paolo Valente
2019-09-06 14:58                       ` Tejun Heo

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=73CA9E04-DD93-47E4-B0FE-0A12EC991DEF@linaro.org \
    --to=paolo.valente@linaro.org \
    --cc=ast@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=bpf@vger.kernel.org \
    --cc=cgroups@vger.kernel.org \
    --cc=clm@fb.com \
    --cc=daniel@iogearbox.net \
    --cc=dennisz@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=josef@toxicpanda.com \
    --cc=kafai@fb.com \
    --cc=kernel-team@fb.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizefan@huawei.com \
    --cc=newella@fb.com \
    --cc=songliubraving@fb.com \
    --cc=tj@kernel.org \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Block Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-block/0 linux-block/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-block linux-block/ https://lore.kernel.org/linux-block \
		linux-block@vger.kernel.org linux-block@archiver.kernel.org
	public-inbox-index linux-block

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-block


AGPL code for this site: git clone https://public-inbox.org/ public-inbox