From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1423799AbcBQRCL (ORCPT ); Wed, 17 Feb 2016 12:02:11 -0500 Received: from mail-yw0-f176.google.com ([209.85.161.176]:35982 "EHLO mail-yw0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756184AbcBQRCI (ORCPT ); Wed, 17 Feb 2016 12:02:08 -0500 Date: Wed, 17 Feb 2016 12:02:06 -0500 From: Tejun Heo To: Paolo Valente Cc: Jens Axboe , Fabio Checconi , Arianna Avanzini , linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, Ulf Hansson , Linus Walleij , broonie@kernel.org Subject: Re: [PATCH RFC 09/22] block, cfq: replace CFQ with the BFQ-v0 I/O scheduler Message-ID: <20160217170206.GU3741@mtj.duckdns.org> References: <1454364778-25179-1-git-send-email-paolo.valente@linaro.org> <1454364778-25179-10-git-send-email-paolo.valente@linaro.org> <20160211222210.GC3741@mtj.duckdns.org> <8FDE2B10-9BD2-4741-917F-5A37A74E5B58@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <8FDE2B10-9BD2-4741-917F-5A37A74E5B58@linaro.org> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, On Wed, Feb 17, 2016 at 10:02:00AM +0100, Paolo Valente wrote: > your first statement “bfq is using bandwidth as the virtual time" is > not very clear to me. In contrast, after that you raise a clear, Another way to put that would be "bfq is using bandwidth as the measure of IO resource provided by a device and to be distributed". ... > Bandwidth fluctuation is one of the key issues addressed in the very > definition of BFQ (more precisely in the definition of the engine of > BFQ, introduced in this patch). In particular, BFQ schedules queues > so as to approximate an ideal service in which every queue receives, > over any time interval, its share of the bandwidth, regardless of > the actual value of the bandwidth, and even if the bandwidth > fluctuates. IOW, in the ideal service approximated by BFQ, if the So, this is a problem. For illustration, let's ignore penalization of seeky workload and assume a hard drive which can do 100MB/s for a fully sequential workload and 1MB/s for 4k random one. In other words, a sequential workload doing 100 MB/s is consuming 100% of available IO resource and so does 4k random workload doing 1MB/s. Let's say there are two sequential workload. If the two workloads are interleaved, the total bandwidth achieved would be lower than 100MB/s with the amount of drop dependent on how often the two workloads are switched. Given big enough scheduling stride, the total won't be too much lower and assuming the same weight each should be getting bandwidth a bit lower than 50MB/s. If we do the same for two random workloads, the story is the same. Each should be getting close to 512KB/s of bandwidth. Now let's put one sequential workload and one random workload on the device. What *should* happen is clear - the sequential workload should be achieving close to 50MB/s while the random one 512KB/s so that each is still consuming the equal proportion of the available IO resource on the device. For devices with low concurrency such as disks, this measure of IO resources can be almost directly represented as how long each workload occupies the device. If we use the proportion of bandwidth a workload is getting as the measure of IO resource consumed, the picture becomes very different and, I think, lost. The amount of bandwidth available is heavily dependent on what mix of IOs the device is getting which would then feed back into how much proportion of IO resource each workload forming a feedback loop. More importantly, random workload would be consuming far more IO resource than it's entitled to. I assume that is why bfq currently would need special punishment of seeky workloads. In the end what that coefficient does is trying to translate bandwidth consumption to IO time consumption in an ad-hoc and inaccruate way. > bandwidth fluctuates during a given time interval, then, in every > sub-interval during which the bandwidth is constant (possibly an > infinitesimal time interval), each queue receives a fraction of the > total bandwidth equal to the ratio between its weight and the sum of > the weights of the active queues. BFQ is an optimal scheduler in > that it guarantees the minimum possible deviation, for a > slice-by-slice service scheme, with respect to the ideal > service. Finally, BFQ actually provides these fairness guarantees > only to queues whose I/O pattern yields a reasonably high > throughput. This fact has to do with the issue you raise. I don't think it's fair at all. It distributes undue amount of IO resources to seeky workloads and then tries to compensate for it by punishing them with a special coefficient. The devices bfq is targeting primarily are the ones where whether the workload is seeky or not results in multiple orders of magnitude difference in terms of available IO bandwidth. It simply doesn't make sense to use bandwidth as the measurement for available IO resources. That'd be worse than calcualting computation resource consumed as the number of instructions executed regardless of whether the instruction is a simple register to register move or double precision floating point division, which we don't do for obvious reasons. > The virtual time is the key concept used to achieve the above > optimal fairness guarantees. To show how intuitively, let me restate > the above guarantees in terms of "amount of service”, i.e., of > number of sectors read/written: BFQ guarantees that every active > queue receives, over any time interval, an amount of service equal > to the total amount of service provided by the system, multiplied by > the share of the bandwidth for the queue (apart from the > above-mentioned minimum, unavoidable deviation). Informally, this > implies that the amount of service received by each active queue, > during any given time interval, is proportional to the weight of the > queue. As a consequence, the normalized service received by every > active queue, i.e., the amount of service divided by the weight of > the queue, grows at the same rate. In BFQ, every queue is associated > with a virtual time, equal exactly to this normalized service. The > goal of BFQ is then to equalize queues’ virtual times (using also a > further global quantity, called system virtual time). To achieve > this goal, BFQ does not schedule time slices, but budgets, measured > in amount of service. I think I understand what you're saying. What I'm trying to say is the unit bfq is currently using for budgeting is almost completely bogus. It's the wrong unit. The "amount of service" a group receives is fundamentally "the amount of time" it occupies the target device. If that can be approximated by bandwidth as in network interfaces, using bandwidth as the unit is fine. However, that isn't the case at all for disks. > Hoping that we are now in sync on virtual times, I will try to > discuss your comment on why not to schedule time (slices) in the > first place. The fact that BFQ does not schedule time slices with > the same scheduling policy as CFQ, but budgets with an optimally > fair policy, provides the following main practical benefits: > > 1) It allows BFQ to distribute bandwidth as desired among processes > or groups of processes, regardless of: device parameters, bandwidth > fluctuation, overall workload composition (BFQ gives up this > bandwidth-fair policy only if this would cause a throughput drop, as > discussed in the second part of this email). It is impossible for a > time-based scheduling policy to provide such a service guarantee on > a device with a fluctuating bandwidth. In fact, a process achieving, > during its allotted time slice, a lower bandwidth than other luckier > processes, will just receive less service when it has the > opportunity to use the device. As a consequence, an unlucky process > may easily receive less bandwidth than another process with the same > weight, or even of a process with a lower weight. On HDDs, this > trivially happens to processes reading/writing on the inner zones of > the disk. On SSDs, the variability is due to more complex > causes. This problem has been one of the main motivations for the > definition of BFQ. I think you're confused about what IO resource is. Let's say there is one swing and two kids. When we say the two kids get to share the swing equally, it means that each kid would get the same amount of time on the swing, not the same number of strokes or the length of circumference traveled. A kid who gets 30% of the swing gets 30% time share of the swing regardless of who else the kid is sharing the swing with. This is exactly the same situation. > 2) The above bandwidth fairness of a budget-based policy is key for > providing sound low-latency guarantees to soft real-time > applications, such as audio or video players. In fact, a time-based > policy is nasty with an unlucky soft real-time process whose I/O > happens to yield, in the time slice during which the I/O requests of > the process are served, a lower throughput than the peak rate. The > reason is that, first, a lower throughput makes it more difficult > for the process to reach its target bandwidth. Secondly, the > scheduling policy does not tend to balance this unlucky situation at > all: the time slice for the process is basically the same as for any > other process with the same weight. This is one of the reasons why, > in our experiments, BFQ constantly guarantees to soft real-time > applications a much lower latency than CFQ. I don't think this is because of any fundamental properties of bandwidth being used as the unit but more that the algorithm favors low bandwidth consumers and the "soft real-time" ones don't get punished as seeky. It seems more a product of distortion in distribution scheme, which btw is fine if it serves a practical purpose, but this should be achievable in a different way too. > 3) For the same reasons as in the above case, a budget-based policy > is also key for providing sound high-responsiveness guarantees, > i.e., for guaranteeing that, even in the presence of heavy > background workloads, applications are started quickly, and in > general, that interactive tasks are completed promptly. Again, this > is one of the reasons why BFQ provides responsiveness guarantees not > comparable with CFQ. I probably am missing something but I don't quite get how the above property is a fundamental property of using bandwidth as the unit. If it is, it'd be only because the unit distorts the actual IO resource distribution in a favorable way. However, because the unit is distorted to begin with, bfq has to apply seeky penalty to correct it. I'd be far happier with opt-in correction (identifying the useful distortions and implementing them) than the current opt-out scheme where the fundamental unit is wrong. > For all the three cases above, and concerning unlucky applications, > i.e., applications whose I/O patterns yield a lower throughput than No, that's not an unlucky application. That's an expensive application in terms of IO. It's the slow swinging kid. > other applications, I think it is important to stress that the > unfairness, high-latency or low-responsiveness problems experienced > by these applications with a time-based policy are not transient: in > the presence of a background workload, these application are > mistreated in the same way *every time* their I/O is replayed. I think you're getting it backwards. Let's go back to the swing example. If you think allocating per-stroke or per-circumference-traveled is a valid strategy when different kids can show multiple orders of magnitude differences on those scales, please convince me why. > Unfortunately, there are applications for which BFQ cannot provide > the above desirable service guarantees: the applications whose I/O > patterns may kill the throughput if they are guaranteed the service > fairness in item 1 (they may also cause other applications to suffer > from a high latency). In this respect, if I have well understood the > “punish-seeky-queues” case you have highlighted, you refer to the > case where a process does not finish its budget within a reasonable > time interval. In this case, the process is de-scheduled, and > treated as if it has used all its budget. As you have noted, this is > a switch from a fair budget-based policy to a fair time-based > policy, to prevent I/O patterns yielding a very low throughput from > causing throughput and latency problems. This simple switch becomes > more complex with the refinements introduced by the following > patches, which take into account also whether the application is > soft real-time or interactive, and, if so, let the trade-off become > more favorable to the application, even if its I/O is not > throughput-friendly. I hope my point is clear by now. The above is correcting for the fundamental flaws in the unit used for distribution. It's mapping bandwidth to time in an ad-hoc and mostly arbitrary way. In the same cgroup, fairness can take the second seat to overall bandwidth or practical characteristics. That's completely fine. What bothers me are 1. The direction is the other way around. It starts with a flawed fundamental unit and then opts out "really bad ones". I think the better way would be using the correct resource unit as the basis and opt in specific distortions to achieve certain practical advantages. The later patches add specific opt-in behaviors anyway. 2. For IO control, the practical distortions should be much lower and the distribution should be based on the actual IO resource. Maybe I'm missing out something big. If so, please hammer it into me. Thanks. -- tejun