Linux-Block Archive on
 help / color / Atom feed
* [PATCHSET block/for-next] IO cost model based work-conserving porportional controller
@ 2019-06-14  1:56 Tejun Heo
  2019-06-14  1:56 ` [PATCH 01/10] blkcg: pass @q and @blkcg into blkcg_pol_alloc_pd_fn() Tejun Heo
                   ` (10 more replies)
  0 siblings, 11 replies; 34+ messages in thread
From: Tejun Heo @ 2019-06-14  1:56 UTC (permalink / raw)
  To: axboe, newella, clm, josef, dennisz, lizefan, hannes
  Cc: linux-kernel, linux-block, kernel-team, cgroups, ast, daniel,
	kafai, songliubraving, yhs, bpf

One challenge of controlling IO resources is the lack of trivially
observable cost metric.  This is distinguished from CPU and memory
where wallclock time and the number of bytes can serve as accurate
enough approximations.

Bandwidth and iops are the most commonly used metrics for IO devices
but depending on the type and specifics of the device, different IO
patterns easily lead to multiple orders of magnitude variations
rendering them useless for the purpose of IO capacity distribution.
While on-device time, with a lot of clutches, could serve as a useful
approximation for non-queued rotational devices, this is no longer
viable with modern devices, even the rotational ones.

While there is no cost metric we can trivially observe, it isn't a
complete mystery.  For example, on a rotational device, seek cost
dominates while a contiguous transfer contributes a smaller amount
proportional to the size.  If we can characterize at least the
relative costs of these different types of IOs, it should be possible
to implement a reasonable work-conserving proportional IO resource

This patchset implements IO cost model based work-conserving
proportional controller.  It currently has a simple linear cost model
builtin where each IO is classified as sequential or random and given
a base cost accordingly and additional size-proportional cost is added
on top.  Each IO is given a cost based on the model and the controller
issues IOs for each cgroup according to their hierarchical weight.

By default, the controller adapts its overall IO rate so that it
doesn't build up buffer bloat in the request_queue layer, which
guarantees that the controller doesn't lose significant amount of
total work.  However, this may not provide sufficient differentiation
as the underlying device may have a deep queue and not be fair in how
the queued IOs are serviced.  The controller provides extra QoS
control knobs which allow tightening control feedback loop as

For more details on the control mechanism, implementation and
interface, please refer to the comment at the top of
block/blk-ioweight.c and Documentation/admin-guide/cgroup-v2.rst
changes in the "blkcg: implement blk-ioweight" patch.

Here are some test results.  Each test run goes through the following
combinations with each combination running for a minute.  All tests
are performed against regular files on btrfs w/ deadline as the IO
scheduler.  Random IOs are direct w/ queue depth of 64.  Sequential
are normal buffered IOs.

	high priority (weight=500)	low priority (weight=100)

	Rand read			None
	ditto				Rand read
	ditto				Seq  read
	ditto				Rand write
	ditto				Seq  write
	Seq  read			None
	ditto				Rand read
	ditto				Seq  read
	ditto				Rand write
	ditto				Seq  write
	Rand write			None
	ditto				Rand read
	ditto				Seq  read
	ditto				Rand write
	ditto				Seq  write
	Seq  write			None
	ditto				Rand read
	ditto				Seq  read
	ditto				Rand write
	ditto				Seq  write

* 7200RPM SATA hard disk
  * No IO control
  * ioweight, QoS: None
  * ioweight, QoS: rpct=95.00 rlat=40000 wpct=95.00 wlat=40000 min=25.00 max=200.00
* NCQ-blacklisted SATA SSD (QD==1)
  * No IO control
  * ioweight, QoS: None
  * ioweight, QoS: rpct=95.00 rlat=20000 wpct=95.00 wlat=20000 min=50.00 max=200.00
* SATA SSD (QD==32)
  * No IO control
  * ioweight, QoS: None
  * ioweight, QoS: rpct=95.00 rlat=10000 wpct=95.00 wlat=20000 min=50.00 max=400.00

Even without explicit QoS configuration, read-heavy scenarios can
obtain acceptable differentiation.  However, when write-heavy, the
deep buffering on the device side makes it difficult to maintain
control.  With QoS parameters set, the differentiation is acceptable
across all combinations.

The implementation comes with default cost model parameters which are
selected automatically which should provide acceptable behavior across
most common devices.  The parameters for hdd and consumer-grade SSDs
seem pretty robust.  The default parameter set and selection criteria
for highend SSDs might need further adjustments.

It is fairly easy to configure the QoS parameters and, if needed, cost
model coefficients.  We'll follow up with tooling and further
documentation.  Also, the last RFC patch in the series implements
support for bpf-based custom cost function.  Originally we thought
that we'd need per-device-type cost functions but the simple linear
model now seem good enough to cover all common device classes.  In
case custom cost functions become necessary, we can fully develop the
bpf based extension and also easily add different builtin cost models.

Andy Newell did the heavy lifting of analyzing IO workloads and device
characteristics, exploring various cost models, determining the
default model and parameters to use.

Josef Bacik implemented a prototype which explored the use of
different types of cost metrics including on-device time and Andy's
linear model.

This patchset is on top of
    git:// for-5.3
+ [PATCHSET block/for-linus] Assorted blkcg fixes
+ [PATCHSET btrfs/for-next] btrfs: fix cgroup writeback support

This patchset contains the following 10 patches.


0001-0007 are prep patches.
0008 implements blk-ioweight.
0009 adds monitoring script.
0010 is the RFC patch for BPF cost function.

The patchset is also available in the following git branch.

 git:// review-iow

diffstat follows, Thanks.

 Documentation/admin-guide/cgroup-v2.rst                |   93 
 block/Kconfig                                          |   12 
 block/Makefile                                         |    1 
 block/bfq-cgroup.c                                     |    5 
 block/blk-cgroup.c                                     |   71 
 block/blk-core.c                                       |    4 
 block/blk-iolatency.c                                  |    8 
 block/blk-ioweight.c                                   | 2509 +++++++++++++++++
 block/blk-mq.c                                         |   11 
 block/blk-rq-qos.c                                     |   18 
 block/blk-rq-qos.h                                     |   28 
 block/blk-settings.c                                   |    2 
 block/blk-throttle.c                                   |    6 
 block/blk-wbt.c                                        |   18 
 block/blk-wbt.h                                        |    4 
 block/blk.h                                            |    8 
 block/ioctl.c                                          |    4 
 include/linux/blk-cgroup.h                             |    4 
 include/linux/blk_types.h                              |    3 
 include/linux/blkdev.h                                 |    7 
 include/linux/bpf_types.h                              |    3 
 include/trace/events/ioweight.h                        |  174 +
 include/uapi/linux/bpf.h                               |   11 
 include/uapi/linux/fs.h                                |    2 
 tools/bpf/bpftool/feature.c                            |    3 
 tools/bpf/bpftool/main.h                               |    1 
 tools/cgroup/                       |  264 +
 tools/include/uapi/linux/bpf.h                         |   11 
 tools/include/uapi/linux/fs.h                          |    2 
 tools/lib/bpf/libbpf.c                                 |    2 
 tools/lib/bpf/libbpf_probes.c                          |    1 
 tools/testing/selftests/bpf/Makefile                   |    2 
 tools/testing/selftests/bpf/iocost_ctrl.c              |   43 
 tools/testing/selftests/bpf/progs/iocost_linear_prog.c |   52 
 34 files changed, 3333 insertions(+), 54 deletions(-)


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, back to index

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-14  1:56 [PATCHSET block/for-next] IO cost model based work-conserving porportional controller Tejun Heo
2019-06-14  1:56 ` [PATCH 01/10] blkcg: pass @q and @blkcg into blkcg_pol_alloc_pd_fn() Tejun Heo
2019-06-14  1:56 ` [PATCH 02/10] blkcg: make ->cpd_init_fn() optional Tejun Heo
2019-06-14  1:56 ` [PATCH 03/10] blkcg: separate blkcg_conf_get_disk() out of blkg_conf_prep() Tejun Heo
2019-06-14  1:56 ` [PATCH 04/10] block/rq_qos: add rq_qos_merge() Tejun Heo
2019-06-14  1:56 ` [PATCH 05/10] block/rq_qos: implement rq_qos_ops->queue_depth_changed() Tejun Heo
2019-06-14  1:56 ` [PATCH 06/10] blkcg: s/RQ_QOS_CGROUP/RQ_QOS_LATENCY/ Tejun Heo
2019-06-14  1:56 ` [PATCH 07/10] blk-mq: add optional request->pre_start_time_ns Tejun Heo
2019-06-14  1:56 ` [PATCH 08/10] blkcg: implement blk-ioweight Tejun Heo
2019-06-14 12:17   ` Toke Høiland-Jørgensen
2019-06-14 15:09     ` Tejun Heo
2019-06-14 20:50       ` Toke Høiland-Jørgensen
2019-06-15 15:57         ` Tejun Heo
2019-06-14  1:56 ` [PATCH 09/10] blkcg: add tools/cgroup/ Tejun Heo
2019-06-14  1:56 ` [PATCH 10/10] blkcg: implement BPF_PROG_TYPE_IO_COST Tejun Heo
2019-06-14 11:32   ` Quentin Monnet
2019-06-14 14:52     ` Tejun Heo
2019-06-14 16:35       ` Alexei Starovoitov
2019-06-14 17:09         ` Tejun Heo
2019-06-14 17:56 ` [PATCHSET block/for-next] IO cost model based work-conserving porportional controller Tejun Heo
2019-08-20 10:48   ` Paolo Valente
2019-08-20 15:04     ` Paolo Valente
2019-08-20 15:19       ` Tejun Heo
2019-08-22  8:58         ` Paolo Valente
2019-08-31  6:53           ` Tejun Heo
2019-08-31  7:10             ` Paolo Valente
2019-08-31 11:20               ` Tejun Heo
2019-09-02 15:45             ` Paolo Valente
2019-09-02 15:56               ` Tejun Heo
2019-09-02 19:43                 ` Paolo Valente
2019-09-05 16:55                   ` Tejun Heo
2019-09-06  9:07                     ` Paolo Valente
2019-09-06 14:58                       ` Tejun Heo
2020-02-19 18:34                         ` Paolo Valente

Linux-Block Archive on

Archives are clonable:
	git clone --mirror linux-block/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-block linux-block/ \
	public-inbox-index linux-block

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone