Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler

From: Tejun Heo <tj@kernel.org>
To: Paolo Valente <paolo.valente@linaro.org>
Cc: Jens Axboe <axboe@kernel.dk>,
	Fabio Checconi <fchecconi@gmail.com>,
	Arianna Avanzini <avanzini.arianna@gmail.com>,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	Ulf Hansson <ulf.hansson@linaro.org>,
	Linus Walleij <linus.walleij@linaro.org>,
	broonie@kernel.org
Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler
Date: Wed, 17 Feb 2016 12:02:06 -0500	[thread overview]
Message-ID: <20160217170206.GU3741@mtj.duckdns.org> (raw)
In-Reply-To: <8FDE2B10-9BD2-4741-917F-5A37A74E5B58@linaro.org>

Hello,

On Wed, Feb 17, 2016 at 10:02:00AM +0100, Paolo Valente wrote:
> your first statement “bfq is using bandwidth as the virtual time" is
> not very clear to me. In contrast, after that you raise a clear,

Another way to put that would be "bfq is using bandwidth as the
measure of IO resource provided by a device and to be distributed".

...
> Bandwidth fluctuation is one of the key issues addressed in the very
> definition of BFQ (more precisely in the definition of the engine of
> BFQ, introduced in this patch). In particular, BFQ schedules queues
> so as to approximate an ideal service in which every queue receives,
> over any time interval, its share of the bandwidth, regardless of
> the actual value of the bandwidth, and even if the bandwidth
> fluctuates. IOW, in the ideal service approximated by BFQ, if the

So, this is a problem.  For illustration, let's ignore penalization of
seeky workload and assume a hard drive which can do 100MB/s for a
fully sequential workload and 1MB/s for 4k random one.  In other
words, a sequential workload doing 100 MB/s is consuming 100% of
available IO resource and so does 4k random workload doing 1MB/s.

Let's say there are two sequential workload.  If the two workloads are
interleaved, the total bandwidth achieved would be lower than 100MB/s
with the amount of drop dependent on how often the two workloads are
switched.  Given big enough scheduling stride, the total won't be too
much lower and assuming the same weight each should be getting
bandwidth a bit lower than 50MB/s.

If we do the same for two random workloads, the story is the same.
Each should be getting close to 512KB/s of bandwidth.

Now let's put one sequential workload and one random workload on the
device.  What *should* happen is clear - the sequential workload
should be achieving close to 50MB/s while the random one 512KB/s so
that each is still consuming the equal proportion of the available IO
resource on the device.  For devices with low concurrency such as
disks, this measure of IO resources can be almost directly represented
as how long each workload occupies the device.

If we use the proportion of bandwidth a workload is getting as the
measure of IO resource consumed, the picture becomes very different
and, I think, lost.  The amount of bandwidth available is heavily
dependent on what mix of IOs the device is getting which would then
feed back into how much proportion of IO resource each workload
forming a feedback loop.  More importantly, random workload would be
consuming far more IO resource than it's entitled to.  I assume that
is why bfq currently would need special punishment of seeky workloads.
In the end what that coefficient does is trying to translate bandwidth
consumption to IO time consumption in an ad-hoc and inaccruate way.

> bandwidth fluctuates during a given time interval, then, in every
> sub-interval during which the bandwidth is constant (possibly an
> infinitesimal time interval), each queue receives a fraction of the
> total bandwidth equal to the ratio between its weight and the sum of
> the weights of the active queues. BFQ is an optimal scheduler in
> that it guarantees the minimum possible deviation, for a
> slice-by-slice service scheme, with respect to the ideal
> service. Finally, BFQ actually provides these fairness guarantees
> only to queues whose I/O pattern yields a reasonably high
> throughput. This fact has to do with the issue you raise.

I don't think it's fair at all.  It distributes undue amount of IO
resources to seeky workloads and then tries to compensate for it by
punishing them with a special coefficient.  The devices bfq is
targeting primarily are the ones where whether the workload is seeky
or not results in multiple orders of magnitude difference in terms of
available IO bandwidth.  It simply doesn't make sense to use bandwidth
as the measurement for available IO resources.  That'd be worse than
calcualting computation resource consumed as the number of
instructions executed regardless of whether the instruction is a
simple register to register move or double precision floating point
division, which we don't do for obvious reasons.

> The virtual time is the key concept used to achieve the above
> optimal fairness guarantees. To show how intuitively, let me restate
> the above guarantees in terms of "amount of service”, i.e., of
> number of sectors read/written: BFQ guarantees that every active
> queue receives, over any time interval, an amount of service equal
> to the total amount of service provided by the system, multiplied by
> the share of the bandwidth for the queue (apart from the
> above-mentioned minimum, unavoidable deviation). Informally, this
> implies that the amount of service received by each active queue,
> during any given time interval, is proportional to the weight of the
> queue. As a consequence, the normalized service received by every
> active queue, i.e., the amount of service divided by the weight of
> the queue, grows at the same rate. In BFQ, every queue is associated
> with a virtual time, equal exactly to this normalized service. The
> goal of BFQ is then to equalize queues’ virtual times (using also a
> further global quantity, called system virtual time). To achieve
> this goal, BFQ does not schedule time slices, but budgets, measured
> in amount of service.

I think I understand what you're saying.  What I'm trying to say is
the unit bfq is currently using for budgeting is almost completely
bogus.  It's the wrong unit.  The "amount of service" a group receives
is fundamentally "the amount of time" it occupies the target device.
If that can be approximated by bandwidth as in network interfaces,
using bandwidth as the unit is fine.  However, that isn't the case at
all for disks.

> Hoping that we are now in sync on virtual times, I will try to
> discuss your comment on why not to schedule time (slices) in the
> first place. The fact that BFQ does not schedule time slices with
> the same scheduling policy as CFQ, but budgets with an optimally
> fair policy, provides the following main practical benefits:
> 
> 1) It allows BFQ to distribute bandwidth as desired among processes
> or groups of processes, regardless of: device parameters, bandwidth
> fluctuation, overall workload composition (BFQ gives up this
> bandwidth-fair policy only if this would cause a throughput drop, as
> discussed in the second part of this email). It is impossible for a
> time-based scheduling policy to provide such a service guarantee on
> a device with a fluctuating bandwidth. In fact, a process achieving,
> during its allotted time slice, a lower bandwidth than other luckier
> processes, will just receive less service when it has the
> opportunity to use the device. As a consequence, an unlucky process
> may easily receive less bandwidth than another process with the same
> weight, or even of a process with a lower weight. On HDDs, this
> trivially happens to processes reading/writing on the inner zones of
> the disk. On SSDs, the variability is due to more complex
> causes. This problem has been one of the main motivations for the
> definition of BFQ.

I think you're confused about what IO resource is.  Let's say there is
one swing and two kids.  When we say the two kids get to share the
swing equally, it means that each kid would get the same amount of
time on the swing, not the same number of strokes or the length of
circumference traveled.  A kid who gets 30% of the swing gets 30% time
share of the swing regardless of who else the kid is sharing the swing
with.  This is exactly the same situation.

> 2) The above bandwidth fairness of a budget-based policy is key for
> providing sound low-latency guarantees to soft real-time
> applications, such as audio or video players. In fact, a time-based
> policy is nasty with an unlucky soft real-time process whose I/O
> happens to yield, in the time slice during which the I/O requests of
> the process are served, a lower throughput than the peak rate. The
> reason is that, first, a lower throughput makes it more difficult
> for the process to reach its target bandwidth. Secondly, the
> scheduling policy does not tend to balance this unlucky situation at
> all: the time slice for the process is basically the same as for any
> other process with the same weight. This is one of the reasons why,
> in our experiments, BFQ constantly guarantees to soft real-time
> applications a much lower latency than CFQ.

I don't think this is because of any fundamental properties of
bandwidth being used as the unit but more that the algorithm favors
low bandwidth consumers and the "soft real-time" ones don't get
punished as seeky.  It seems more a product of distortion in
distribution scheme, which btw is fine if it serves a practical
purpose, but this should be achievable in a different way too.

> 3) For the same reasons as in the above case, a budget-based policy
> is also key for providing sound high-responsiveness guarantees,
> i.e., for guaranteeing that, even in the presence of heavy
> background workloads, applications are started quickly, and in
> general, that interactive tasks are completed promptly. Again, this
> is one of the reasons why BFQ provides responsiveness guarantees not
> comparable with CFQ.

I probably am missing something but I don't quite get how the above
property is a fundamental property of using bandwidth as the unit.  If
it is, it'd be only because the unit distorts the actual IO resource
distribution in a favorable way.  However, because the unit is
distorted to begin with, bfq has to apply seeky penalty to correct it.
I'd be far happier with opt-in correction (identifying the useful
distortions and implementing them) than the current opt-out scheme
where the fundamental unit is wrong.

> For all the three cases above, and concerning unlucky applications,
> i.e., applications whose I/O patterns yield a lower throughput than

No, that's not an unlucky application.  That's an expensive
application in terms of IO.  It's the slow swinging kid.

> other applications, I think it is important to stress that the
> unfairness, high-latency or low-responsiveness problems experienced
> by these applications with a time-based policy are not transient: in
> the presence of a background workload, these application are
> mistreated in the same way *every time* their I/O is replayed.

I think you're getting it backwards.  Let's go back to the swing
example.  If you think allocating per-stroke or
per-circumference-traveled is a valid strategy when different kids can
show multiple orders of magnitude differences on those scales, please
convince me why.

> Unfortunately, there are applications for which BFQ cannot provide
> the above desirable service guarantees: the applications whose I/O
> patterns may kill the throughput if they are guaranteed the service
> fairness in item 1 (they may also cause other applications to suffer
> from a high latency). In this respect, if I have well understood the
> “punish-seeky-queues” case you have highlighted, you refer to the
> case where a process does not finish its budget within a reasonable
> time interval. In this case, the process is de-scheduled, and
> treated as if it has used all its budget. As you have noted, this is
> a switch from a fair budget-based policy to a fair time-based
> policy, to prevent I/O patterns yielding a very low throughput from
> causing throughput and latency problems. This simple switch becomes
> more complex with the refinements introduced by the following
> patches, which take into account also whether the application is
> soft real-time or interactive, and, if so, let the trade-off become
> more favorable to the application, even if its I/O is not
> throughput-friendly.

I hope my point is clear by now.  The above is correcting for the
fundamental flaws in the unit used for distribution.  It's mapping
bandwidth to time in an ad-hoc and mostly arbitrary way.

In the same cgroup, fairness can take the second seat to overall
bandwidth or practical characteristics.  That's completely fine.
What bothers me are

1. The direction is the other way around.  It starts with a flawed
   fundamental unit and then opts out "really bad ones".  I think the
   better way would be using the correct resource unit as the basis
   and opt in specific distortions to achieve certain practical
   advantages.  The later patches add specific opt-in behaviors
   anyway.

2. For IO control, the practical distortions should be much lower and
   the distribution should be based on the actual IO resource.

Maybe I'm missing out something big.  If so, please hammer it into me.

Thanks.

-- 
tejun