From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751611AbcDOOU6 (ORCPT ); Fri, 15 Apr 2016 10:20:58 -0400 Received: from mail-wm0-f44.google.com ([74.125.82.44]:37993 "EHLO mail-wm0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750924AbcDOOU5 convert rfc822-to-8bit (ORCPT ); Fri, 15 Apr 2016 10:20:57 -0400 Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Content-Type: text/plain; charset=windows-1252 From: Paolo Valente In-Reply-To: <20160414162953.GG12583@htj.duckdns.org> Date: Fri, 15 Apr 2016 16:20:44 +0200 Cc: Jens Axboe , Fabio Checconi , Arianna Avanzini , linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, Ulf Hansson , Linus Walleij , Mark Brown Content-Transfer-Encoding: 8BIT Message-Id: <427F5DF5-507A-4657-8279-B6A8FD98F6D8@linaro.org> References: <1454364778-25179-1-git-send-email-paolo.valente@linaro.org> <1454364778-25179-10-git-send-email-paolo.valente@linaro.org> <20160211222210.GC3741@mtj.duckdns.org> <8FDE2B10-9BD2-4741-917F-5A37A74E5B58@linaro.org> <20160217170206.GU3741@mtj.duckdns.org> <72E81252-203C-4EB7-8459-B9B7060029C6@linaro.org> <20160301184656.GI3965@htj.duckdns.org> <20160413204110.GF20142@htj.duckdns.org> <2B664E4D-857C-4BBA-BE77-97EA6CC3F270@linaro.org> <20160414162953.GG12583@htj.duckdns.org> To: Tejun Heo X-Mailer: Apple Mail (2.1878.6) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Il giorno 14/apr/2016, alle ore 18:29, Tejun Heo ha scritto: > Hello, Paolo. > > On Thu, Apr 14, 2016 at 12:23:14PM +0200, Paolo Valente wrote: > ... >>>> 1) Stable(r) and tight bandwidth distribution for mostly-sequential >>>> reads/writes >>> >>> So, yeah, the above makes toal sense. >>> >>>> 2) Stable(r) and high responsiveness >>>> 3) Stable(r) and low latency for soft real-time applications >>>> 4) Faster execution of dev tasks, such as compile and git operations >>>> (checkout, merge, …), in the presence of background workloads, and >>>> while guaranteeing a high responsiveness too >>> >>> But can you please enlighten me on why 2-4 are inherently tied to >>> bandwidth-based scheduling? >> >> Goals 2-4 are obtained by granting a higher share of the throughput >> to the applications to privilege. The more stably and accurately the >> underlying scheduling engine is able to enforce the desired bandwidth >> distribution, the more stably and accurately higher shares can be >> guaranteed. Then 2-4 follows from 1, i.e., from that BFQ guarantees >> a stabler and tight(er) bandwidth distribution. > > 4) makes sense as a lot of that workload would be at least > quasi-sequential but I can't tell why 2) and 3) would depend on > bandwidth based scheduling. They're about recognizing workloads which > can benefit from low latency and treating them accordingly. Why would > making the underlying scheduling time based change that? > Because, in BFQ, "treating them accordingly" means raising their weights to let them receive a higher share of the throughput (other rigid solutions, such as priority scheduling, would easily lead to starvation problems). With time-based scheduling, the share of the throughput guaranteed for a given value of the weight is less stable than with sector-based scheduling. Then, to provide the same latency guarantees as with sector-based scheduling, weights have to be raised more. This would throttle unprivileged processes more. And it would not however improve stability. With time-based scheduling, latency guarantees among two privileged applications may vary more too, even if both applications perform quasi-sequential I/O. In fact, throughput shares would vary more depending, e.g., on where the sectors requested by the applications are located. >>> To summarize, >>> >>> 1. I still don't understand why bandwidth-based scheduling is better >>> (sorry). The only reason I can think of is that most workloads >>> that we care about are at least quasi-sequential and can benefit >>> from ignoring randomness to a certain degree. Is that it? >>> >> >> If I have understood correctly, you refer to that maximum ~30% >> throughput loss that a quasi-sequential workload can incur (because of >> some randomness or of other unlucky accidents). If so, then I think >> you fully got the point. > > Alright, I see. > >>> 2. I don't think strict fairness matters is all that important for IO >>> scheduling in general. Whatever gives us the best overall result >>> should work, so if bandwidth based scheduling does that great; >>> however, fairness does matter across cgroups. A cgroup configured >>> to receive 50% of IO resources should get close to that no matter >>> what others are doing, would bfq be able to do that? >> >> BFQ guarantees 50% of the bandwidth of the resource, not 50% of the >> time. In this respect, with 50% of the time instead of 50% of the > > So, across cgroups, I don't think we can pretend that bandwidth is the > resource. There should be a reasonable level of isolation. Bandwidth > for a rotating disk is a byproduct which can fluctuate widely. "You > have 50% of the total disk bandwidth" doesn't mean anything if that > bandwidth can easily fluctuate a hundred fold. > I agree that, if a system serves a workload whose characteristics change significantly every 100ms, or even more frequently, and in an unpredictable way, then both time-based and sector-based scheduling provide exactly the same level of bandwidth guarantees. That is, almost no bandwidth guarantee. But AFAIK many systems, services and applications do not behave in such a way. On the contrary, their IO patterns are rather stable. So, if we choose time-based scheduling in view of the fact that for an unpredictable system there would be no benefits with sector-based scheduling, then we just throw away all the benefits that we would have with sector-based scheduling on more stable systems. >> bandwidth, a group suffers from the bandwidth fluctuation, higher >> latency and throughput loss problems that I have tried to highlight. >> Plus, it is not possible to easily answer to questions like, e.g.: "how >> long would it take to copy this file"?. > > It's actually a lot more difficult to answer that with bandwidth > scheduling. Let's say cgroup A has 50% of disk time. Sure, there are > inaccuracies, but it should be able to get close to the ballpark - > let's be lax and say between 30% and 45% of raw sequential bandwidth. > It isn't ideal but now imagine bandwidth based scheduling. Depending > on what the others are doing, it may get 5% or even lower of the raw > sequential bandwidth. It isn't isolating anything. > Definitely. Nevertheless my point is still about the same: we have to consider one system at a time. If the workload of the system is highly variable and completely unpredictable, then it is hard to provide any bandwidth guarantee with any solution. But if the workload has a minimum of stability, then sector scheduling either wins or provides the same guarantees as time-based guarantees. For example, a concrete instance of your low-bandwidth example may be one where you have one quasi-sequential workload W, competing with nine random workloads. In this case, if, e.g., all workloads have the same weight, then BFQ would schedule the resource like a time-based scheduler: one full budget (which lasts for about one time slice) for workload W, followed by one time slice for each of the other workloads. Then there would be no service-guarantee loss with respect to time-based scheduling. In contrast, in all the other examples I have mentioned so far (file-hosting, streaming, video/audio playback against background workloads, application start-up, ...) sector-based scheduling would be clearly beneficial even in a hierarchical setting. In the end, if we give up sector scheduling for cgroups, we can only lose some benefits. Unless I'm still missing some even more important problem (sorry about that). >> In any case, it is of course possible to get time distribution also >> with BFQ, by 'just' letting it work in the time domain. However, >> changing BFQ to operate in the time domain, or, probably much better, >> extending BFQ to operate correctly in both domains, would be a lot of >> work. I don't know whether it would be worth the effort and the >> extra complexity. > > As I wrote before, as fairness isn't that important for normal > scheduling, if empirical data show that bandwidth based scheduling is > beneficial for most common workloads, that's awesome especially given > that CFQ has plenty of issues. I don't think cgroup case is workable > as currently implemented tho. > I was thinking about some solution to achieve both goals. An option is probably to let BFQ work in a double mode: sector-based within groups and time-based among groups. However, I find it a little messy and confusing. Other ideas/solutions? I have no better proposal at the moment :( Thanks, Paolo > Thanks. > > -- > tejun