Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler

From: Paolo Valente <paolo.valente@linaro.org>
To: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>,
	Fabio Checconi <fchecconi@gmail.com>,
	Arianna Avanzini <avanzini.arianna@gmail.com>,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	Ulf Hansson <ulf.hansson@linaro.org>,
	Linus Walleij <linus.walleij@linaro.org>,
	Mark Brown <broonie@kernel.org>
Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler
Date: Fri, 15 Apr 2016 16:20:44 +0200	[thread overview]
Message-ID: <427F5DF5-507A-4657-8279-B6A8FD98F6D8@linaro.org> (raw)
In-Reply-To: <20160414162953.GG12583@htj.duckdns.org>

Il giorno 14/apr/2016, alle ore 18:29, Tejun Heo <tj@kernel.org> ha scritto:

> Hello, Paolo.
> 
> On Thu, Apr 14, 2016 at 12:23:14PM +0200, Paolo Valente wrote:
> ...
>>>> 1) Stable(r) and tight bandwidth distribution for mostly-sequential
>>>> reads/writes
>>> 
>>> So, yeah, the above makes toal sense.
>>> 
>>>> 2) Stable(r) and high responsiveness
>>>> 3) Stable(r) and low latency for soft real-time applications
>>>> 4) Faster execution of dev tasks, such as compile and git operations
>>>> (checkout, merge, …), in the presence of background workloads, and
>>>> while guaranteeing a high responsiveness too
>>> 
>>> But can you please enlighten me on why 2-4 are inherently tied to
>>> bandwidth-based scheduling?
>> 
>> Goals 2-4 are obtained by granting a higher share of the throughput
>> to the applications to privilege. The more stably and accurately the
>> underlying scheduling engine is able to enforce the desired bandwidth
>> distribution, the more stably and accurately higher shares can be
>> guaranteed. Then 2-4 follows from 1, i.e., from that BFQ guarantees
>> a stabler and tight(er) bandwidth distribution.
> 
> 4) makes sense as a lot of that workload would be at least
> quasi-sequential but I can't tell why 2) and 3) would depend on
> bandwidth based scheduling.  They're about recognizing workloads which
> can benefit from low latency and treating them accordingly.  Why would
> making the underlying scheduling time based change that?
> 

Because, in BFQ, "treating them accordingly" means raising their
weights to let them receive a higher share of the throughput (other
rigid solutions, such as priority scheduling, would easily lead to
starvation problems). With time-based scheduling, the share of the
throughput guaranteed for a given value of the weight is less stable
than with sector-based scheduling. Then, to provide the same latency
guarantees as with sector-based scheduling, weights have to be raised
more. This would throttle unprivileged processes more. And it would
not however improve stability.

With time-based scheduling, latency guarantees among two privileged
applications may vary more too, even if both applications perform
quasi-sequential I/O. In fact, throughput shares would vary more
depending, e.g., on where the sectors requested by the applications are
located.

>>> To summarize,
>>> 
>>> 1. I still don't understand why bandwidth-based scheduling is better
>>>  (sorry).  The only reason I can think of is that most workloads
>>>  that we care about are at least quasi-sequential and can benefit
>>>  from ignoring randomness to a certain degree.  Is that it?
>>> 
>> 
>> If I have understood correctly, you refer to that maximum ~30%
>> throughput loss that a quasi-sequential workload can incur (because of
>> some randomness or of other unlucky accidents). If so, then I think
>> you fully got the point.
> 
> Alright, I see.
> 
>>> 2. I don't think strict fairness matters is all that important for IO
>>>  scheduling in general.  Whatever gives us the best overall result
>>>  should work, so if bandwidth based scheduling does that great;
>>>  however, fairness does matter across cgroups.  A cgroup configured
>>>  to receive 50% of IO resources should get close to that no matter
>>>  what others are doing, would bfq be able to do that?
>> 
>> BFQ guarantees 50% of the bandwidth of the resource, not 50% of the
>> time. In this respect, with 50% of the time instead of 50% of the
> 
> So, across cgroups, I don't think we can pretend that bandwidth is the
> resource.  There should be a reasonable level of isolation.  Bandwidth
> for a rotating disk is a byproduct which can fluctuate widely.  "You
> have 50% of the total disk bandwidth" doesn't mean anything if that
> bandwidth can easily fluctuate a hundred fold.
> 

I agree that, if a system serves a workload whose characteristics
change significantly every 100ms, or even more frequently, and in an
unpredictable way, then both time-based and sector-based scheduling
provide exactly the same level of bandwidth guarantees. That is,
almost no bandwidth guarantee.

But AFAIK many systems, services and applications do not behave in such
a way. On the contrary, their IO patterns are rather stable.

So, if we choose time-based scheduling in view of the fact that for an
unpredictable system there would be no benefits with sector-based
scheduling, then we just throw away all the benefits that we would
have with sector-based scheduling on more stable systems.

>> bandwidth, a group suffers from the bandwidth fluctuation, higher
>> latency and throughput loss problems that I have tried to highlight.
>> Plus, it is not possible to easily answer to questions like, e.g.: "how
>> long would it take to copy this file"?.
> 
> It's actually a lot more difficult to answer that with bandwidth
> scheduling.  Let's say cgroup A has 50% of disk time.  Sure, there are
> inaccuracies, but it should be able to get close to the ballpark -
> let's be lax and say between 30% and 45% of raw sequential bandwidth.
> It isn't ideal but now imagine bandwidth based scheduling.  Depending
> on what the others are doing, it may get 5% or even lower of the raw
> sequential bandwidth.  It isn't isolating anything.
> 

Definitely. Nevertheless my point is still about the same: we have to
consider one system at a time. If the workload of the system is highly
variable and completely unpredictable, then it is hard to provide any
bandwidth guarantee with any solution.

But if the workload has a minimum of stability, then sector scheduling
either wins or provides the same guarantees as time-based guarantees.
For example, a concrete instance of your low-bandwidth example may be
one where you have one quasi-sequential workload W, competing with
nine random workloads. In this case, if, e.g., all workloads have the
same weight, then BFQ would schedule the resource like a time-based
scheduler: one full budget (which lasts for about one time slice) for
workload W, followed by one time slice for each of the other
workloads. Then there would be no service-guarantee loss with respect
to time-based scheduling.

In contrast, in all the other examples I have mentioned so far
(file-hosting, streaming, video/audio playback against background
workloads, application start-up, ...) sector-based scheduling would be
clearly beneficial even in a hierarchical setting.

In the end, if we give up sector scheduling for cgroups, we can only
lose some benefits. Unless I'm still missing some even more important
problem (sorry about that).

>> In any case, it is of course possible to get time distribution also
>> with BFQ, by 'just' letting it work in the time domain. However,
>> changing BFQ to operate in the time domain, or, probably much better,
>> extending BFQ to operate correctly in both domains, would be a lot of
>> work. I don't know whether it would be worth the effort and the
>> extra complexity.
> 
> As I wrote before, as fairness isn't that important for normal
> scheduling, if empirical data show that bandwidth based scheduling is
> beneficial for most common workloads, that's awesome especially given
> that CFQ has plenty of issues.  I don't think cgroup case is workable
> as currently implemented tho.
> 

I was thinking about some solution to achieve both goals. An option is
probably to let BFQ work in a double mode: sector-based within groups
and time-based among groups. However, I find it a little messy and
confusing.

Other ideas/solutions? I have no better proposal at the moment :(

Thanks,
Paolo

> Thanks.
> 
> -- 
> tejun