From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751678AbcCASrC (ORCPT ); Tue, 1 Mar 2016 13:47:02 -0500 Received: from mail-yk0-f195.google.com ([209.85.160.195]:34852 "EHLO mail-yk0-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750877AbcCASq7 (ORCPT ); Tue, 1 Mar 2016 13:46:59 -0500 Date: Tue, 1 Mar 2016 13:46:56 -0500 From: Tejun Heo To: Paolo Valente Cc: Jens Axboe , Fabio Checconi , Arianna Avanzini , linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, Ulf Hansson , Linus Walleij , broonie@kernel.org Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler Message-ID: <20160301184656.GI3965@htj.duckdns.org> References: <1454364778-25179-1-git-send-email-paolo.valente@linaro.org> <1454364778-25179-10-git-send-email-paolo.valente@linaro.org> <20160211222210.GC3741@mtj.duckdns.org> <8FDE2B10-9BD2-4741-917F-5A37A74E5B58@linaro.org> <20160217170206.GU3741@mtj.duckdns.org> <72E81252-203C-4EB7-8459-B9B7060029C6@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <72E81252-203C-4EB7-8459-B9B7060029C6@linaro.org> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, Paolo. Sorry about the delay. On Sat, Feb 20, 2016 at 11:23:43AM +0100, Paolo Valente wrote: > Before replying to your points, I want to stress that I'm not a > champion of budget-based scheduling at all costs. Budget-based > scheduling just seems to provide tight bandwidth and latency > guarantees that are practically impossible to get with time-based > scheduling. I will try to explain this fact better, and to provide > also a numerical example, in my replies to your points. I do like the budget-based scheduling. It just feels that the budget is based on the wrong unit. ... > I think I got your point. In any case, a queue is not punished *after* > it has consumed an undue amount of the resource, because a queue just > cannot get to consume an undue amount of the resource. There is a > timeout: a queue cannot be served for more than a pre-defined maximum > time slice. > > Even if a queue expires before that timeout, BFQ checks anyway, on the > expiration of the queue, whether the queue is so slow to not deserve > accurate service-based guarantees. This is done to achieve additional > robustness. In fact, if service-based guarantees were provided to a > very slow queue that, for some reason, never causes the timeout to > expire, then the queue would happen to be served too often, i.e., to > get the undue amount of IO resource you mention. I see. Once a queue starts timing out its slice, it gets switched to time based scheduling; however, as you mentioned, workloads which generate moderate random IOs would still get preferential treatment over workloads which generate sequential IOs, by definition. ... > Your metaphor is clear, but it does not seem to match what I expect > from my storage device. As a user of my PC, I’m concerned, e.g., about > how long it takes to copy a large file. I’m not concerned about what > percentage of time will be guaranteed to my file copy if other > processes happen to do I/O in parallel. As a similar example, a good The underlying device is fundamentally incapable of giving guarantees like that. The only way to get a (quasi) bandwidth guarantee from a disk device is either ensuring that the IO is almost completely sequential or there's enough buffer in capacity for the expected seekiness of the IO pattern. For use cases where the differences in seekiness across workloads are accidental - e.g. all are trying to stream different files but some files are more fragmented by accident - using bandwidth as the resource unit would be helpful in mitigating the random gaps that the user shouldn't be bothered by, but that'd be focusing on a pretty narrow set of use cases. Workloads are varied and underlying device performs wildly differently depending on the specific IO pattern. Because rotating disks suck so badly, it's true that there's a lot of wiggle room in what the IO scheduler can do. People are accustomed to dealing with random behaviors. That said, it still doesn't feel comfortable to use the obviously wrong unit as the fundamental basis of resource distribution. > file-hosting service must probably guarantee reasonable read/write, > i.e., download/upload, speeds to users (of course, I/O speed matters > only if the bottleneck is not the network). Most likely, a user of > such a service does not care (directly) about how much resource-time > is guaranteed to the I/O related to his/her downloads/uploads. > > With a budget-based service scheme, you can easily provide these > service guarantees accurately. In particular, you can achieve this > goal even if the bandwidth fluctuates, provided that fluctuations > depend mostly on I/O patterns. That is, it must hold true that the > device achieves about the same bandwidth with the same I/O > pattern. This is exactly the case with both rotational and So, yes, I see that bandwidth based control would yield a better result for this specific use case but at the same time this is a very specialized use case and probably the *only* use case where bandwidth based distribution makes sense - equivalent logically sequential workloads where the specifics of IO pattern are completely accidental. We can't really design the whole thing around that single use case. > non-rotational devices. With a time-based scheme, it is impossible to > provide these service guarantees, if bandwidth fluctuates and > requirements are minimally stringent. I will try to show it with a > simple numerical example. Before this numerical example, I would like > to report a last practical example. I agree that bandwidth based distribution would behave better for this use case but think the above paragraph is going a bit too far. Bandwidth based distribution can stick to the line better but that just means that time based scheduling would need a bit more buffer in achieving the same level of behavioral consistency. It's not like bandwidth can actually be guarnateed no matter what we do. ... > To achieve, with a time-based scheduler, the same precise and stable > bandwidth distribution as with a budget-based scheduler, the only > solution is to change weights dynamically, as a function of how the > throughput achieved by B varies in its time slots. Such a solution > would be definitely complex, if ever feasible and stable. Isn't that a use case specifically carved for bandwidth based distribution? Imagine how this would work when e.g. there are mostly sequential IO workloads and fluctuating random workloads. Addition of another sequential workload would behave as expected but addition of random workloads would cripple everyone to the same level if the random workloads play their cards right. One side of the coin is "if I have two parallel file copies, they proceed at the same speed regardless of how they're distributed across disk" and the other is "but if I start an application which intermittently issues random IOs, my copies take 5x longer". Isn't "the two parallel copies mostly keep the same pace but may deviate a bit but addition of random IOs doesn't collapse the whole thing" a better proposition? Hmmm... it could be that I'm mistaken on how trigger happy the switch to time based scheduling is. Maybe it's sensitive enough that bandwidth based scheduling is only applied to workloads which are mostly sequential. I'm sorry if I'm being too dense on this point but can you please give me some examples on what would happen when sequential workloads and random ones mix? Thanks. -- tejun