Linux-Block Archive on lore.kernel.org
 help / color / Atom feed
From: Paolo Valente <paolo.valente@linaro.org>
To: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>,
	newella@fb.com, clm@fb.com, Josef Bacik <josef@toxicpanda.com>,
	dennisz@fb.com, Li Zefan <lizefan@huawei.com>,
	hannes@cmpxchg.org, linux-kernel@vger.kernel.org,
	linux-block@vger.kernel.org, kernel-team@fb.com,
	cgroups@vger.kernel.org
Subject: Re: [PATCHSET v3 block/for-linus] IO cost model based work-conserving porportional controller
Date: Thu, 29 Aug 2019 17:54:38 +0200
Message-ID: <50384CF8-39F1-461C-9EC0-7314671E5604@linaro.org> (raw)
In-Reply-To: <20190828220600.2527417-1-tj@kernel.org>

Hi,
I see an important interface problem.  Userspace has been waiting for
io.weight to become eventually the file name for setting the weight
for the proportional-share policy [1,2].  If you use that name, how
will we solve this?

Thanks,
Paolo

[1] https://github.com/systemd/systemd/issues/7057#issuecomment-522747575
[2] https://github.com/systemd/systemd/pull/13335#issuecomment-523035303

> Il giorno 29 ago 2019, alle ore 00:05, Tejun Heo <tj@kernel.org> ha scritto:
> 
> Changes from v2[2]:
> 
> * Fixed a divide-by-zero bug in current_hweight().
> 
> * pre_start_time and friends renamed to alloc_time and now has its own
>  CONFIG option which is selected by IOCOST.
> 
> Changes from v1[1]:
> 
> * Prerequisite patchsets had cosmetic changes and merged.  Refreshed
>  on top.
> 
> * Renamed from ioweight to iocost.  All source code and tools are
>  updated accordingly.  Control knobs io.weight.qos and
>  io.weight.cost_model are renamed to io.cost.qos and io.cost.model
>  respectively.  This is a more fitting name which won't become a
>  misnomer when, for example, cost based io.max is added.
> 
> * Various bug fixes and improvements.  A few bugs were discovered
>  while testing against high-iops nvme device.  Auto parameter
>  selection improved and verified across different classes of SSDs.
> 
> * Dropped bpf iocost support for now.
> 
> * Added coef generation script.
> 
> * Verified on high-iops nvme device.  Result is included below.
> 
> One challenge of controlling IO resources is the lack of trivially
> observable cost metric.  This is distinguished from CPU and memory
> where wallclock time and the number of bytes can serve as accurate
> enough approximations.
> 
> Bandwidth and iops are the most commonly used metrics for IO devices
> but depending on the type and specifics of the device, different IO
> patterns easily lead to multiple orders of magnitude variations
> rendering them useless for the purpose of IO capacity distribution.
> While on-device time, with a lot of clutches, could serve as a useful
> approximation for non-queued rotational devices, this is no longer
> viable with modern devices, even the rotational ones.
> 
> While there is no cost metric we can trivially observe, it isn't a
> complete mystery.  For example, on a rotational device, seek cost
> dominates while a contiguous transfer contributes a smaller amount
> proportional to the size.  If we can characterize at least the
> relative costs of these different types of IOs, it should be possible
> to implement a reasonable work-conserving proportional IO resource
> distribution.
> 
> This patchset implements IO cost model based work-conserving
> proportional controller.  It currently has a simple linear cost model
> builtin where each IO is classified as sequential or random and given
> a base cost accordingly and additional size-proportional cost is added
> on top.  Each IO is given a cost based on the model and the controller
> issues IOs for each cgroup according to their hierarchical weight.
> 
> By default, the controller adapts its overall IO rate so that it
> doesn't build up buffer bloat in the request_queue layer, which
> guarantees that the controller doesn't lose significant amount of
> total work.  However, this may not provide sufficient differentiation
> as the underlying device may have a deep queue and not be fair in how
> the queued IOs are serviced.  The controller provides extra QoS
> control knobs which allow tightening control feedback loop as
> necessary.
> 
> For more details on the control mechanism, implementation and
> interface, please refer to the comment at the top of
> block/blk-iocost.c and Documentation/admin-guide/cgroup-v2.rst changes
> in the "blkcg: implement blk-iocost" patch.
> 
> Here are some test results.  Each test run goes through the following
> combinations with each combination running for a minute.  All tests
> are performed against regular files on btrfs w/ deadline as the IO
> scheduler.  Random IOs are direct w/ queue depth of 64.  Sequential
> are normal buffered IOs.
> 
>        high priority (weight=500)      low priority (weight=100)
> 
>        Rand read                       None
>        ditto                           Rand read
>        ditto                           Seq  read
>        ditto                           Rand write
>        ditto                           Seq  write
>        Seq  read                       None
>        ditto                           Rand read
>        ditto                           Seq  read
>        ditto                           Rand write
>        ditto                           Seq  write
>        Rand write                      None
>        ditto                           Rand read
>        ditto                           Seq  read
>        ditto                           Rand write
>        ditto                           Seq  write
>        Seq  write                      None
>        ditto                           Rand read
>        ditto                           Seq  read
>        ditto                           Rand write
>        ditto                           Seq  write
> 
> * 7200RPM SATA hard disk
>  * No IO control
>    https://photos.app.goo.gl/1KBHn7ykpC1LXRkB8
>  * iocost, QoS: None
>    https://photos.app.goo.gl/MLNQGxCtBQ8wAmjm7
>  * iocost, QoS: rpct=95.00 rlat=40000 wpct=95.00 wlat=40000 min=25.00 max=200.00
>    https://photos.app.goo.gl/XqXHm3Mkbm9w6Db46
> * NCQ-blacklisted SATA SSD (QD==1)
>  * No IO control
>    https://photos.app.goo.gl/wCTXeu2uJ6LYL4pk8
>  * iocost, QoS: None
>    https://photos.app.goo.gl/T2HedKD2sywQgj7R9
>  * iocost, QoS: rpct=95.00 rlat=20000 wpct=95.00 wlat=20000 min=50.00 max=200.00
>    https://photos.app.goo.gl/urBTV8XQc1UqPJJw7
> * SATA SSD (QD==32)
>  * No IO control
>    https://photos.app.goo.gl/TjEVykuVudSQcryh6
>  * iocost, QoS: None
>    https://photos.app.goo.gl/iyQBsky7bmM54Xiq7
>  * iocost, QoS: rpct=95.00 rlat=10000 wpct=95.00 wlat=20000 min=50.00 max=400.00
>    https://photos.app.goo.gl/q1a6URLDxPLMrnHy5
> * NVME SSD (ran with 8 concurrent fio jobs to achieve saturation)
>  * No IO control
>    https://photos.app.goo.gl/S6xjEVTJzcfb3w1j7
>  * iocost, QoS: None
>    https://photos.app.goo.gl/SjQUUotJBAGr7vqz7
>  * iocost, QoS: rpct=95.00 rlat=5000 wpct=95.00 wlat=5000 min=1.00 max=10000.00
>    https://photos.app.goo.gl/RsaYBd2muX7CegoN7
> 
> Even without explicit QoS configuration, read-heavy scenarios can
> obtain acceptable differentiation.  However, when write-heavy, the
> deep buffering on the device side makes it difficult to maintain
> control.  With QoS parameters set, the differentiation is acceptable
> across all combinations.
> 
> The implementation comes with default cost model parameters which are
> selected automatically which should provide acceptable behavior across
> most common devices.  The parameters for hdd and consumer-grade SSDs
> seem pretty robust.  The default parameter set and selection criteria
> for highend SSDs might need further adjustments.
> 
> It is fairly easy to configure the QoS parameters and, if needed, cost
> model coefficients.  We'll follow up with tooling and further
> documentation.  Also, the last RFC patch in the series implements
> support for bpf-based custom cost function.  Originally we thought
> that we'd need per-device-type cost functions but the simple linear
> model now seem good enough to cover all common device classes.  In
> case custom cost functions become necessary, we can fully develop the
> bpf based extension and also easily add different builtin cost models.
> 
> Andy Newell did the heavy lifting of analyzing IO workloads and device
> characteristics, exploring various cost models, determining the
> default model and parameters to use.
> 
> Josef Bacik implemented a prototype which explored the use of
> different types of cost metrics including on-device time and Andy's
> linear model.
> 
> This patchset is on top of the current block/for-next 53fc55c817c3
> ("Merge branch 'for-5.4/block' into for-next") and contains the
> following 10 patches.
> 
> 0001-blkcg-pass-q-and-blkcg-into-blkcg_pol_alloc_pd_fn.patch
> 0002-blkcg-make-cpd_init_fn-optional.patch
> 0003-blkcg-separate-blkcg_conf_get_disk-out-of-blkg_conf_.patch
> 0004-block-rq_qos-add-rq_qos_merge.patch
> 0005-block-rq_qos-implement-rq_qos_ops-queue_depth_change.patch
> 0006-blkcg-s-RQ_QOS_CGROUP-RQ_QOS_LATENCY.patch
> 0007-blk-mq-add-optional-request-alloc_time_ns.patch
> 0008-blkcg-implement-blk-iocost.patch
> 0009-blkcg-add-tools-cgroup-iocost_monitor.py.patch
> 0010-blkcg-add-tools-cgroup-iocost_coef_gen.py.patch
> 
> 0001-0007 are prep patches.
> 0008 implements blk-iocost.
> 0009 adds monitoring script.
> 0010 adds linear cost model coefficient generation script.
> 
> The patchset is also available in the following git branch.
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-iow-v2
> 
> diffstat follows, Thanks.
> 
> Documentation/admin-guide/cgroup-v2.rst |   97 +
> block/Kconfig                           |   13 
> block/Makefile                          |    1 
> block/bfq-cgroup.c                      |    5 
> block/blk-cgroup.c                      |   71 
> block/blk-core.c                        |    4 
> block/blk-iocost.c                      | 2395 ++++++++++++++++++++++++++++++++
> block/blk-iolatency.c                   |    8 
> block/blk-mq.c                          |   13 
> block/blk-rq-qos.c                      |   18 
> block/blk-rq-qos.h                      |   28 
> block/blk-settings.c                    |    2 
> block/blk-throttle.c                    |    6 
> block/blk-wbt.c                         |   18 
> block/blk-wbt.h                         |    4 
> include/linux/blk-cgroup.h              |    4 
> include/linux/blk_types.h               |    3 
> include/linux/blkdev.h                  |   13 
> include/trace/events/iocost.h           |  174 ++
> tools/cgroup/iocost_coef_gen.py         |  178 ++
> tools/cgroup/iocost_monitor.py          |  270 +++
> 21 files changed, 3272 insertions(+), 53 deletions(-)
> 
> --
> tejun
> 
> [1] http://lkml.kernel.org/r/20190614015620.1587672-1-tj@kernel.org
> [2] http://lkml.kernel.org/r/20190710205128.1316483-1-tj@kernel.org
> 


  parent reply index

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-28 22:05 Tejun Heo
2019-08-28 22:05 ` [PATCH 01/10] blkcg: pass @q and @blkcg into blkcg_pol_alloc_pd_fn() Tejun Heo
2019-08-28 22:05 ` [PATCH 02/10] blkcg: make ->cpd_init_fn() optional Tejun Heo
2019-08-28 22:05 ` [PATCH 03/10] blkcg: separate blkcg_conf_get_disk() out of blkg_conf_prep() Tejun Heo
2019-08-28 22:05 ` [PATCH 04/10] block/rq_qos: add rq_qos_merge() Tejun Heo
2019-08-28 22:05 ` [PATCH 05/10] block/rq_qos: implement rq_qos_ops->queue_depth_changed() Tejun Heo
2019-08-28 22:05 ` [PATCH 06/10] blkcg: s/RQ_QOS_CGROUP/RQ_QOS_LATENCY/ Tejun Heo
2019-08-28 22:05 ` [PATCH 07/10] blk-mq: add optional request->alloc_time_ns Tejun Heo
2019-08-28 22:05 ` [PATCH 08/10] blkcg: implement blk-iocost Tejun Heo
2019-08-29 15:53   ` [PATCH] blkcg: fix missing free on error path of blk_iocost_init() Tejun Heo
2019-09-10 12:55   ` [PATCH 08/10] blkcg: implement blk-iocost Michal Koutný
2019-09-10 16:08     ` Tejun Heo
2019-09-11  8:18       ` Paolo Valente
2019-09-11 14:16         ` Tejun Heo
2019-09-11 15:54           ` Tejun Heo
2019-09-11 16:44           ` Paolo Valente
2019-10-03 14:51       ` Michal Koutný
2019-10-03 16:45         ` Tejun Heo
2019-10-09 15:36           ` Michal Koutný
2019-10-14 15:36             ` Tejun Heo
2019-11-01 16:15               ` Michal Koutný
2019-11-01 16:56                 ` Paolo Valente
2019-08-28 22:05 ` [PATCH 09/10] blkcg: add tools/cgroup/iocost_monitor.py Tejun Heo
2019-08-28 22:06 ` [PATCH 10/10] blkcg: add tools/cgroup/iocost_coef_gen.py Tejun Heo
2019-08-29  3:29 ` [PATCHSET v3 block/for-linus] IO cost model based work-conserving porportional controller Jens Axboe
     [not found] ` <20190829082248.6464-1-hdanton@sina.com>
2019-08-29 15:43   ` [PATCH 07/10] blk-mq: add optional request->alloc_time_ns Tejun Heo
     [not found] ` <20190829133928.16192-1-hdanton@sina.com>
2019-08-29 15:46   ` [PATCH 08/10] blkcg: implement blk-iocost Tejun Heo
2019-08-29 15:54 ` Paolo Valente [this message]
2019-08-29 15:56   ` [PATCHSET v3 block/for-linus] IO cost model based work-conserving porportional controller Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50384CF8-39F1-461C-9EC0-7314671E5604@linaro.org \
    --to=paolo.valente@linaro.org \
    --cc=axboe@kernel.dk \
    --cc=cgroups@vger.kernel.org \
    --cc=clm@fb.com \
    --cc=dennisz@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=josef@toxicpanda.com \
    --cc=kernel-team@fb.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizefan@huawei.com \
    --cc=newella@fb.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Block Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-block/0 linux-block/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-block linux-block/ https://lore.kernel.org/linux-block \
		linux-block@vger.kernel.org
	public-inbox-index linux-block

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-block


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git