linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
@ 2016-10-26  9:27 Paolo Valente
  2016-10-26  9:27 ` [PATCH 01/14] block, bfq: " Paolo Valente
                   ` (9 more replies)
  0 siblings, 10 replies; 57+ messages in thread
From: Paolo Valente @ 2016-10-26  9:27 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: linux-block, linux-kernel, ulf.hansson, linus.walleij, broonie,
	hare, arnd, bart.vanassche, grant.likely, jack, James.Bottomley,
	Paolo Valente

Hi,
this new patch series turns back to the initial approach, i.e., it
adds BFQ as an extra scheduler, instead of replacing CFQ with
BFQ. This patch series also contains all the improvements and bug
fixes recommended by Tejun [5], plus new features of BFQ-v8r5. Details
about old and new features in patch descriptions.

The first version of BFQ was submitted a few years ago [1]. It is
denoted as v0 in this patchset, to distinguish it from the version I
am submitting now, v8r5. In particular, the first two patches
introduce BFQ-v0, whereas the remaining patches turn progressively
BFQ-v0 into BFQ-v8r5.

Some patch generates WARNINGS with checkpatch.pl, but these WARNINGS
seem to be either unavoidable for the involved pieces of code (which
the patch just extends), or false positives.

For your convenience, a slightly updated and extended description of
BFQ follows.

On average CPUs, the current version of BFQ can handle devices
performing at most ~30K IOPS; at most ~50 KIOPS on faster CPUs. These
are about the same limits as CFQ. There may be room for noticeable
improvements regarding these limits, but, given the overall
limitations of blk itself, I thought it was not the case to further
delay this new submission.

Here are some nice features of BFQ-v8r5.

Low latency for interactive applications

Regardless of the actual background workload, BFQ guarantees that, for
interactive tasks, the storage device is virtually as responsive as if
it was idle. For example, even if one or more of the following
background workloads are being executed:
- one or more large files are being read, written or copied,
- a tree of source files is being compiled,
- one or more virtual machines are performing I/O,
- a software update is in progress,
- indexing daemons are scanning filesystems and updating their
  databases,
starting an application or loading a file from within an application
takes about the same time as if the storage device was idle. As a
comparison, with CFQ, NOOP or DEADLINE, and in the same conditions,
applications experience high latencies, or even become unresponsive
until the background workload terminates (also on SSDs).

Low latency for soft real-time applications

Also soft real-time applications, such as audio and video
players/streamers, enjoy a low latency and a low drop rate, regardless
of the background I/O workload. As a consequence, these applications
do not suffer from almost any glitch due to the background workload.

Higher speed for code-development tasks

If some additional workload happens to be executed in parallel, then
BFQ executes the I/O-related components of typical code-development
tasks (compilation, checkout, merge, ...) much more quickly than CFQ,
NOOP or DEADLINE.

High throughput

On hard disks, BFQ achieves up to 30% higher throughput than CFQ, and
up to 150% higher throughput than DEADLINE and NOOP, with all the
sequential workloads considered in our tests. With random workloads,
and with all the workloads on flash-based devices, BFQ achieves,
instead, about the same throughput as the other schedulers.

Strong fairness, bandwidth and delay guarantees

BFQ distributes the device throughput, and not just the device time,
among I/O-bound applications in proportion their weights, with any
workload and regardless of the device parameters. From these bandwidth
guarantees, it is possible to compute tight per-I/O-request delay
guarantees by a simple formula. If not configured for strict service
guarantees, BFQ switches to time-based resource sharing (only) for
applications that would otherwise cause a throughput loss.


BFQ achieves the above service properties thanks to the combination of
its accurate scheduling engine (patches 1-2), and a set of simple
heuristics and improvements (patches 3-14). Details on how BFQ and
its components work are provided in the descriptions of the
patches. In addition, an organic description of the main BFQ algorithm
and of most of its features can be found in this paper [2].

What BFQ can do in practice is shown, e.g., in this 8-minute demo with
an SSD: [3]. I made this demo with an older version of BFQ (v7r6) and
under Linux 3.17.0, but, for the tests considered in the demo,
performance has remained about the same with more recent BFQ and
kernel versions. More details about this point can be found here [4],
together with graphs showing the performance of BFQ, as compared with
CFQ, DEADLINE and NOOP, and on: a fast and a slow hard disk, a RAID1,
an SSD, a microSDHC Card and an eMMC. As an example, our results on
the SSD are reported also in a table at the end of this email.

Finally, as for testing in everyday use, BFQ is the default I/O
scheduler in, e.g., Mageia, Manjaro, Sabayon, OpenMandriva and Arch
Linux ARM, plus several kernel forks for PCs and smartphones. In
addition, BFQ is optionally available in, e.g., Arch, PCLinuxOS and
Gentoo, and we record several downloads a day from people using other
distributions. The feedback received so far basically confirms the
expected latency drop and throughput boost.

Thanks,
Paolo

Results on a Plextor PX-256M5S SSD

The first two rows of the next table report the aggregate throughput
achieved by BFQ, CFQ, DEADLINE and NOOP, while ten parallel processes
read, either sequentially or randomly, a separate portion of the
memory blocks each. These processes read directly from the device, and
no process performs writes, to avoid writing large files repeatedly
and wearing out the device during the many tests done. As can be seen,
all schedulers achieve about the same throughput with sequential
readers, whereas, with random readers, the throughput slightly grows
as the complexity, and hence the execution time, of the schedulers
decreases. In fact, with random readers, the number of IOPS is
extremely higher, and all CPUs spend all the time either executing
instructions or waiting for I/O (the total idle percentage is
0). Therefore, the processing time of I/O requests influences the
maximum throughput achievable.

The remaining rows report the cold-cache start-up time experienced by
various applications while one of the above two workloads is being
executed in parallel. In particular, "Start-up time 10 seq/rand"
stands for "Start-up time of the application at hand while 10
sequential/random readers are running". A timeout fires, and the test
is aborted, if the application does not start within 60 seconds; so,
in the table, '>60' means that the application did not start before
the timeout fired.

With sequential readers, the performance gap between BFQ and the other
schedulers is remarkable. Background workloads are intentionally very
heavy, to show the performance of the schedulers in somewhat extreme
conditions. Differences are however still significant also with
lighter workloads, as shown, e.g., here [4] for slower devices.

-----------------------------------------------------------------------------
|                      SCHEDULER                    |        Test           |
-----------------------------------------------------------------------------
|    BFQ     |    CFQ     |  DEADLINE  |    NOOP    |                       |
-----------------------------------------------------------------------------
|            |            |            |            | Aggregate Throughput  |
|            |            |            |            |       [MB/s]          |
|    399     |    400     |    400     |    400     |  10 raw seq. readers  |
|    191     |    193     |    202     |    203     | 10 raw random readers |
-----------------------------------------------------------------------------
|            |            |            |            | Start-up time 10 seq  |
|            |            |            |            |       [sec]           |
|    0.21    |    >60     |    1.91    |    1.88    |      xterm            |
|    0.93    |    >60     |    10.2    |    10.8    |     oowriter          |
|    0.89    |    >60     |    29.7    |    30.0    |      konsole          |
-----------------------------------------------------------------------------
|            |            |            |            | Start-up time 10 rand |
|            |            |            |            |       [sec]           |
|    0.20    |    0.30    |    0.21    |    0.21    |      xterm            |
|    0.81    |    3.28    |    0.80    |    0.81    |     oowriter          |
|    0.88    |    2.90    |    1.02    |    1.00    |      konsole          |
-----------------------------------------------------------------------------


[1] https://lkml.org/lkml/2008/4/1/234

[2] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
    Scheduler", Proceedings of the First Workshop on Mobile System
    Technologies (MST-2015), May 2015.
    http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf

[3] https://youtu.be/1cjZeaCXIyM

[4] http://algogroup.unimore.it/people/paolo/disk_sched/results.php

[5] https://lkml.org/lkml/2016/2/1/818

Arianna Avanzini (4):
  block, bfq: add full hierarchical scheduling and cgroups support
  block, bfq: add Early Queue Merge (EQM)
  block, bfq: reduce idling only in symmetric scenarios
  block, bfq: handle bursts of queue activations

Paolo Valente (10):
  block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler
  block, bfq: improve throughput boosting
  block, bfq: modify the peak-rate estimator
  block, bfq: add more fairness with writes and slow processes
  block, bfq: improve responsiveness
  block, bfq: reduce I/O latency for soft real-time applications
  block, bfq: preserve a low latency also with NCQ-capable drives
  block, bfq: reduce latency during request-pool saturation
  block, bfq: boost the throughput on NCQ-capable flash-based devices
  block, bfq: boost the throughput with random I/O on NCQ-capable HDDs

 Documentation/block/00-INDEX        |    2 +
 Documentation/block/bfq-iosched.txt |  516 +++
 block/Kconfig.iosched               |   27 +
 block/Makefile                      |    1 +
 block/bfq-iosched.c                 | 8195 +++++++++++++++++++++++++++++++++++
 include/linux/blkdev.h              |    2 +-
 6 files changed, 8742 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/block/bfq-iosched.txt
 create mode 100644 block/bfq-iosched.c

-- 
2.10.0

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 01/14] block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-26  9:27 [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler Paolo Valente
@ 2016-10-26  9:27 ` Paolo Valente
  2016-10-26  9:27 ` [PATCH 02/14] block, bfq: add full hierarchical scheduling and cgroups support Paolo Valente
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 57+ messages in thread
From: Paolo Valente @ 2016-10-26  9:27 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: linux-block, linux-kernel, ulf.hansson, linus.walleij, broonie,
	hare, arnd, bart.vanassche, grant.likely, jack, James.Bottomley,
	Paolo Valente, Fabio Checconi, Arianna Avanzini

We tag as v0 the version of BFQ containing only BFQ's engine plus
hierarchical support. BFQ's engine is introduced by this commit, while
hierarchical support is added by next commit. We use the v0 tag to
distinguish this minimal version of BFQ from the versions containing
also the features and the improvements added by next commits. BFQ-v0
coincides with the version of BFQ submitted a few years ago [1], apart
from the introduction of preemption, described below.

BFQ is a proportional-share I/O scheduler, whose general structure,
plus a lot of code, are borrowed from CFQ.

- Each process doing I/O on a device is associated with a weight and a
  (bfq_)queue.

- BFQ grants exclusive access to the device, for a while, to one queue
  (process) at a time, and implements this service model by
  associating every queue with a budget, measured in number of
  sectors.

  - After a queue is granted access to the device, the budget of the
    queue is decremented, on each request dispatch, by the size of the
    request.

  - The in-service queue is expired, i.e., its service is suspended,
    only if one of the following events occurs: 1) the queue finishes
    its budget, 2) the queue empties, 3) a "budget timeout" fires.

    - The budget timeout prevents processes doing random I/O from
      holding the device for too long and dramatically reducing
      throughput.

    - Actually, as in CFQ, a queue associated with a process issuing
      sync requests may not be expired immediately when it empties. In
      contrast, BFQ may idle the device for a short time interval,
      giving the process the chance to go on being served if it issues
      a new request in time. Device idling typically boosts the
      throughput on rotational devices, if processes do synchronous
      and sequential I/O. In addition, under BFQ, device idling is
      also instrumental in guaranteeing the desired throughput
      fraction to processes issuing sync requests (see [2] for
      details).

      - With respect to idling for service guarantees, if several
        processes are competing for the device at the same time, but
        all processes (and groups, after the following commit) have
        the same weight, then BFQ guarantees the expected throughput
        distribution without ever idling the device. Throughput is
        thus as high as possible in this common scenario.

  - Queues are scheduled according to a variant of WF2Q+, named
    B-WF2Q+, and implemented using an augmented rb-tree to preserve an
    O(log N) overall complexity.  See [2] for more details. B-WF2Q+ is
    also ready for hierarchical scheduling. However, for a cleaner
    logical breakdown, the code that enables and completes
    hierarchical support is provided in the next commit, which focuses
    exactly on this feature.

  - B-WF2Q+ guarantees a tight deviation with respect to an ideal,
    perfectly fair, and smooth service. In particular, B-WF2Q+
    guarantees that each queue receives a fraction of the device
    throughput proportional to its weight, even if the throughput
    fluctuates, and regardless of: the device parameters, the current
    workload and the budgets assigned to the queue.

  - The last, budget-independence, property (although probably
    counterintuitive in the first place) is definitely beneficial, for
    the following reasons:

    - First, with any proportional-share scheduler, the maximum
      deviation with respect to an ideal service is proportional to
      the maximum budget (slice) assigned to queues. As a consequence,
      BFQ can keep this deviation tight not only because of the
      accurate service of B-WF2Q+, but also because BFQ *does not*
      need to assign a larger budget to a queue to let the queue
      receive a higher fraction of the device throughput.

    - Second, BFQ is free to choose, for every process (queue), the
      budget that best fits the needs of the process, or best
      leverages the I/O pattern of the process. In particular, BFQ
      updates queue budgets with a simple feedback-loop algorithm that
      allows a high throughput to be achieved, while still providing
      tight latency guarantees to time-sensitive applications. When
      the in-service queue expires, this algorithm computes the next
      budget of the queue so as to:

      - Let large budgets be eventually assigned to the queues
        associated with I/O-bound applications performing sequential
        I/O: in fact, the longer these applications are served once
        got access to the device, the higher the throughput is.

      - Let small budgets be eventually assigned to the queues
        associated with time-sensitive applications (which typically
        perform sporadic and short I/O), because, the smaller the
        budget assigned to a queue waiting for service is, the sooner
        B-WF2Q+ will serve that queue (Subsec 3.3 in [2]).

- Weights can be assigned to processes only indirectly, through I/O
  priorities, and according to the relation:
  weight = 10 * (IOPRIO_BE_NR - ioprio).
  The next patch provides, instead, a cgroups interface through which
  weights can be assigned explicitly.

- If several processes are competing for the device at the same time,
  but all processes and groups have the same weight, then BFQ
  guarantees the expected throughput distribution without ever idling
  the device. It uses preemption instead. Throughput is then much
  higher in this common scenario.

- ioprio classes are served in strict priority order, i.e.,
  lower-priority queues are not served as long as there are
  higher-priority queues.  Among queues in the same class, the
  bandwidth is distributed in proportion to the weight of each
  queue. A very thin extra bandwidth is however guaranteed to the Idle
  class, to prevent it from starving.

- If the strict_guarantees parameter is set (default: unset), then BFQ
     - always performs idling when the in-service queue becomes empty;
     - forces the device to serve one I/O request at a time, by
       dispatching a new request only if there is no outstanding
       request.
  In the presence of differentiated weights or I/O-request sizes,
  both the above conditions are needed to guarantee that every
  queue receives its allotted share of the bandwidth (see
  Documentation/block/bfq-iosched.txt for more details). Setting
  strict_guarantees may evidently affect throughput.

[1] https://lkml.org/lkml/2008/4/1/234
    https://lkml.org/lkml/2008/11/11/148

[2] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
    http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
							results.pdf

Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 Documentation/block/00-INDEX        |    2 +
 Documentation/block/bfq-iosched.txt |  516 +++++
 block/Kconfig.iosched               |   19 +
 block/Makefile                      |    1 +
 block/bfq-iosched.c                 | 4045 +++++++++++++++++++++++++++++++++++
 5 files changed, 4583 insertions(+)
 create mode 100644 Documentation/block/bfq-iosched.txt
 create mode 100644 block/bfq-iosched.c

diff --git a/Documentation/block/00-INDEX b/Documentation/block/00-INDEX
index e55103a..8d55b4b 100644
--- a/Documentation/block/00-INDEX
+++ b/Documentation/block/00-INDEX
@@ -1,5 +1,7 @@
 00-INDEX
 	- This file
+bfq-iosched.txt
+	- BFQ IO scheduler and its tunables
 biodoc.txt
 	- Notes on the Generic Block Layer Rewrite in Linux 2.5
 biovecs.txt
diff --git a/Documentation/block/bfq-iosched.txt b/Documentation/block/bfq-iosched.txt
new file mode 100644
index 0000000..5ba67af
--- /dev/null
+++ b/Documentation/block/bfq-iosched.txt
@@ -0,0 +1,516 @@
+BFQ (Budget Fair Queueing)
+==========================
+
+BFQ is a proportional-share I/O scheduler, with some extra
+low-latency capabilities. In addition to cgroups support (blkio or io
+controllers), BFQ's main features are:
+- BFQ guarantees a high system and application responsiveness, and a
+  low latency for time-sensitive applications, such as audio or video
+  players;
+- BFQ distributes bandwidth, and not just time, among processes or
+  groups (switching back to time distribution when needed to keep
+  throughput high).
+
+On average CPUs, the current version of BFQ can handle devices
+performing at most ~30K IOPS; at most ~50 KIOPS on faster CPUs. As a
+reference, 30-50 KIOPS correspond to very high bandwidths with
+sequential I/O (e.g., 8-12 GB/s if I/O requests are 256 KB large), and
+to 120-200 MB/s with 4KB random I/O.
+
+The table of contents follow. Impatients can just jump to Section 3.
+
+CONTENTS
+
+1. When may BFQ be useful?
+ 1-1 Personal systems
+ 1-2 Server systems
+2. How does BFQ work?
+3. What are BFQ's tunable?
+4. BFQ group scheduling
+ 4-1 Service guarantees provided
+ 4-2 Interface
+
+1. When may BFQ be useful?
+==========================
+
+BFQ provides the following benefits on personal and server systems.
+
+1-1 Personal systems
+--------------------
+
+Low latency for interactive applications
+
+Regardless of the actual background workload, BFQ guarantees that, for
+interactive tasks, the storage device is virtually as responsive as if
+it was idle. For example, even if one or more of the following
+background workloads are being executed:
+- one or more large files are being read, written or copied,
+- a tree of source files is being compiled,
+- one or more virtual machines are performing I/O,
+- a software update is in progress,
+- indexing daemons are scanning filesystems and updating their
+  databases,
+starting an application or loading a file from within an application
+takes about the same time as if the storage device was idle. As a
+comparison, with CFQ, NOOP or DEADLINE, and in the same conditions,
+applications experience high latencies, or even become unresponsive
+until the background workload terminates (also on SSDs).
+
+Low latency for soft real-time applications
+
+Also soft real-time applications, such as audio and video
+players/streamers, enjoy a low latency and a low drop rate, regardless
+of the background I/O workload. As a consequence, these applications
+do not suffer from almost any glitch due to the background workload.
+
+Higher speed for code-development tasks
+
+If some additional workload happens to be executed in parallel, then
+BFQ executes the I/O-related components of typical code-development
+tasks (compilation, checkout, merge, ...) much more quickly than CFQ,
+NOOP or DEADLINE.
+
+High throughput
+
+On hard disks, BFQ achieves up to 30% higher throughput than CFQ, and
+up to 150% higher throughput than DEADLINE and NOOP, with all the
+sequential workloads considered in our tests. With random workloads,
+and with all the workloads on flash-based devices, BFQ achieves,
+instead, about the same throughput as the other schedulers.
+
+Strong fairness, bandwidth and delay guarantees
+
+BFQ distributes the device throughput, and not just the device time,
+among I/O-bound applications in proportion their weights, with any
+workload and regardless of the device parameters. From these bandwidth
+guarantees, it is possible to compute tight per-I/O-request delay
+guarantees by a simple formula. If not configured for strict service
+guarantees, BFQ switches to time-based resource sharing (only) for
+applications that would otherwise cause a throughput loss.
+
+1-2 Server systems
+------------------
+
+Most benefits for server systems follow from the same service
+properties as above. In particular, regardless of whether additional,
+possibly heavy workloads are being served, BFQ guarantees:
+
+. audio and video-streaming with zero or very low jitter and drop
+  rate;
+
+. fast retrieval of WEB pages and embedded objects;
+
+. real-time recording of data in live-dumping applications (e.g.,
+  packet logging);
+
+. responsiveness in local and remote access to a server.
+
+
+2. How does BFQ work?
+=====================
+
+BFQ is a proportional-share I/O scheduler, whose general structure,
+plus a lot of code, are borrowed from CFQ.
+
+- Each process doing I/O on a device is associated with a weight and a
+  (bfq_)queue.
+
+- BFQ grants exclusive access to the device, for a while, to one queue
+  (process) at a time, and implements this service model by
+  associating every queue with a budget, measured in number of
+  sectors.
+
+  - After a queue is granted access to the device, the budget of the
+    queue is decremented, on each request dispatch, by the size of the
+    request.
+
+  - The in-service queue is expired, i.e., its service is suspended,
+    only if one of the following events occurs: 1) the queue finishes
+    its budget, 2) the queue empties, 3) a "budget timeout" fires.
+
+    - The budget timeout prevents processes doing random I/O from
+      holding the device for too long and dramatically reducing
+      throughput.
+
+    - Actually, as in CFQ, a queue associated with a process issuing
+      sync requests may not be expired immediately when it empties. In
+      contrast, BFQ may idle the device for a short time interval,
+      giving the process the chance to go on being served if it issues
+      a new request in time. Device idling typically boosts the
+      throughput on rotational devices, if processes do synchronous
+      and sequential I/O. In addition, under BFQ, device idling is
+      also instrumental in guaranteeing the desired throughput
+      fraction to processes issuing sync requests (see the description
+      of the slice_idle tunable in this document, or [1, 2], for more
+      details).
+
+      - With respect to idling for service guarantees, if several
+	processes are competing for the device at the same time, but
+	all processes (and groups, after the following commit) have
+	the same weight, then BFQ guarantees the expected throughput
+	distribution without ever idling the device. Throughput is
+	thus as high as possible in this common scenario.
+
+  - If low-latency mode is enabled (default configuration), BFQ
+    executes some special heuristics to detect interactive and soft
+    real-time applications (e.g., video or audio players/streamers),
+    and to reduce their latency. The most important action taken to
+    achieve this goal is to give to the queues associated with these
+    applications more than their fair share of the device
+    throughput. For brevity, we call just "weight-raising" the whole
+    sets of actions taken by BFQ to privilege these queues. In
+    particular, BFQ provides a milder form of weight-raising for
+    interactive applications, and a stronger form for soft real-time
+    applications.
+
+  - BFQ automatically deactivates idling for queues born in a burst of
+    queue creations. In fact, these queues are usually associated with
+    the processes of applications and services that benefit mostly
+    from a high throughput. Examples are systemd during boot, or git
+    grep.
+
+  - As CFQ, BFQ merges queues performing interleaved I/O, i.e.,
+    performing random I/O that becomes mostly sequential if
+    merged. Differently from CFQ, BFQ achieves this goal with a more
+    reactive mechanism, called Early Queue Merge (EQM). EQM is so
+    responsive in detecting interleaved I/O (cooperating processes),
+    that it enables BFQ to achieve a high throughput, by queue
+    merging, even for queues for which CFQ needs a different
+    mechanism, preemption, to get a high throughput. As such EQM is a
+    unified mechanism to achieve a high throughput with interleaved
+    I/O.
+
+  - Queues are scheduled according to a variant of WF2Q+, named
+    B-WF2Q+, and implemented using an augmented rb-tree to preserve an
+    O(log N) overall complexity.  See [2] for more details. B-WF2Q+ is
+    also ready for hierarchical scheduling. However, for a cleaner
+    logical breakdown, the code that enables and completes
+    hierarchical support is provided in the next commit, which focuses
+    exactly on this feature.
+
+  - B-WF2Q+ guarantees a tight deviation with respect to an ideal,
+    perfectly fair, and smooth service. In particular, B-WF2Q+
+    guarantees that each queue receives a fraction of the device
+    throughput proportional to its weight, even if the throughput
+    fluctuates, and regardless of: the device parameters, the current
+    workload and the budgets assigned to the queue.
+
+  - The last, budget-independence, property (although probably
+    counterintuitive in the first place) is definitely beneficial, for
+    the following reasons:
+
+    - First, with any proportional-share scheduler, the maximum
+      deviation with respect to an ideal service is proportional to
+      the maximum budget (slice) assigned to queues. As a consequence,
+      BFQ can keep this deviation tight not only because of the
+      accurate service of B-WF2Q+, but also because BFQ *does not*
+      need to assign a larger budget to a queue to let the queue
+      receive a higher fraction of the device throughput.
+
+    - Second, BFQ is free to choose, for every process (queue), the
+      budget that best fits the needs of the process, or best
+      leverages the I/O pattern of the process. In particular, BFQ
+      updates queue budgets with a simple feedback-loop algorithm that
+      allows a high throughput to be achieved, while still providing
+      tight latency guarantees to time-sensitive applications. When
+      the in-service queue expires, this algorithm computes the next
+      budget of the queue so as to:
+
+      - Let large budgets be eventually assigned to the queues
+	associated with I/O-bound applications performing sequential
+	I/O: in fact, the longer these applications are served once
+	got access to the device, the higher the throughput is.
+
+      - Let small budgets be eventually assigned to the queues
+	associated with time-sensitive applications (which typically
+	perform sporadic and short I/O), because, the smaller the
+	budget assigned to a queue waiting for service is, the sooner
+	B-WF2Q+ will serve that queue (Subsec 3.3 in [2]).
+
+- If several processes are competing for the device at the same time,
+  but all processes and groups have the same weight, then BFQ
+  guarantees the expected throughput distribution without ever idling
+  the device. It uses preemption instead. Throughput is then much
+  higher in this common scenario.
+
+- ioprio classes are served in strict priority order, i.e.,
+  lower-priority queues are not served as long as there are
+  higher-priority queues.  Among queues in the same class, the
+  bandwidth is distributed in proportion to the weight of each
+  queue. A very thin extra bandwidth is however guaranteed to
+  the Idle class, to prevent it from starving.
+
+
+3. What are BFQ's tunable?
+==========================
+
+The tunables back_seek-max, back_seek_penalty, fifo_expire_async and
+fifo_expire_sync below are the same as in CFQ. Their description is
+just copied from that for CFQ. Some considerations in the description
+of slice_idle are copied from CFQ too.
+
+per-process ioprio and weight
+-----------------------------
+
+Unless the cgroups interface is used, weights can be assigned to
+processes only indirectly, through I/O priorities, and according to
+the relation: weight = (IOPRIO_BE_NR - ioprio) * 10.
+
+slice_idle
+----------
+
+This parameter specifies how long BFQ should idle for next I/O
+request, when certain sync BFQ queues become empty. By default
+slice_idle is a non-zero value. Idling has a double purpose: boosting
+throughput and making sure that the desired throughput distribution is
+respected (see the description of how BFQ works, and, if needed, the
+papers referred there).
+
+As for throughput, idling can be very helpful on highly seeky media
+like single spindle SATA/SAS disks where we can cut down on overall
+number of seeks and see improved throughput.
+
+Setting slice_idle to 0 will remove all the idling on queues and one
+should see an overall improved throughput on faster storage devices
+like multiple SATA/SAS disks in hardware RAID configuration.
+
+So depending on storage and workload, it might be useful to set
+slice_idle=0.  In general for SATA/SAS disks and software RAID of
+SATA/SAS disks keeping slice_idle enabled should be useful. For any
+configurations where there are multiple spindles behind single LUN
+(Host based hardware RAID controller or for storage arrays), setting
+slice_idle=0 might end up in better throughput and acceptable
+latencies.
+
+Idling is however necessary to have service guarantees enforced in
+case of differentiated weights or differentiated I/O-request lengths.
+To see why, suppose that a given BFQ queue A must get several I/O
+requests served for each request served for another queue B. Idling
+ensures that, if A makes a new I/O request slightly after becoming
+empty, then no request of B is dispatched in the middle, and thus A
+does not lose the possibility to get more than one request dispatched
+before the next request of B is dispatched. Note that idling
+guarantees the desired differentiated treatment of queues only in
+terms of I/O-request dispatches. To guarantee that the actual service
+order then corresponds to the dispatch order, the strict_guarantees
+tunable must be set too.
+
+There is an important flipside for idling: apart from the above cases
+where it is beneficial also for throughput, idling can severely impact
+throughput. One important case is random workload. Because of this
+issue, BFQ tends to avoid idling as much as possible, when it is not
+beneficial also for throughput. As a consequence of this behavior, and
+of further issues described for the strict_guarantees tunable,
+short-term service guarantees may be occasionally violated. And, in
+some cases, these guarantees may be more important than guaranteeing
+maximum throughput. For example, in video playing/streaming, a very
+low drop rate may be more important than maximum throughput. In these
+cases, consider setting the strict_guarantees parameter.
+
+strict_guarantees
+-----------------
+
+If this parameter is set (default: unset), then BFQ
+
+- always performs idling when the in-service queue becomes empty;
+
+- forces the device to serve one I/O request at a time, by dispatching a
+  new request only if there is no outstanding request.
+
+In the presence of differentiated weights or I/O-request sizes, both
+the above conditions are needed to guarantee that every BFQ queue
+receives its allotted share of the bandwidth. The first condition is
+needed for the reasons explained in the description of the slice_idle
+tunable.  The second condition is needed because all modern storage
+devices reorder internally-queued requests, which may trivially break
+the service guarantees enforced by the I/O scheduler.
+
+Setting strict_guarantees may evidently affect throughput.
+
+back_seek_max
+-------------
+
+This specifies, given in Kbytes, the maximum "distance" for backward seeking.
+The distance is the amount of space from the current head location to the
+sectors that are backward in terms of distance.
+
+This parameter allows the scheduler to anticipate requests in the "backward"
+direction and consider them as being the "next" if they are within this
+distance from the current head location.
+
+back_seek_penalty
+-----------------
+
+This parameter is used to compute the cost of backward seeking. If the
+backward distance of request is just 1/back_seek_penalty from a "front"
+request, then the seeking cost of two requests is considered equivalent.
+
+So scheduler will not bias toward one or the other request (otherwise scheduler
+will bias toward front request). Default value of back_seek_penalty is 2.
+
+fifo_expire_async
+-----------------
+
+This parameter is used to set the timeout of asynchronous requests. Default
+value of this is 248ms.
+
+fifo_expire_sync
+----------------
+
+This parameter is used to set the timeout of synchronous requests. Default
+value of this is 124ms. In case to favor synchronous requests over asynchronous
+one, this value should be decreased relative to fifo_expire_async.
+
+low_latency
+-----------
+
+This parameter is used to enable/disable BFQ's low latency mode. By
+default, low latency mode is enabled. If enabled, interactive and soft
+real-time applications are privileged and experience a lower latency,
+as explained in more detail in the description of how BFQ works.
+
+timeout_sync
+------------
+
+Maximum amount of device time that can be given to a task (queue) once
+it has been selected for service. On devices with costly seeks,
+increasing this time usually increases maximum throughput. On the
+opposite end, increasing this time coarsens the granularity of the
+short-term bandwidth and latency guarantees, especially if the
+following parameter is set to zero.
+
+max_budget
+----------
+
+Maximum amount of service, measured in sectors, that can be provided
+to a BFQ queue once it is set in service (of course within the limits
+of the above timeout). According to what said in the description of
+the algorithm, larger values increase the throughput in proportion to
+the percentage of sequential I/O requests issued. The price of larger
+values is that they coarsen the granularity of short-term bandwidth
+and latency guarantees.
+
+The default value is 0, which enables auto-tuning: BFQ sets max_budget
+to the maximum number of sectors that can be served during
+timeout_sync, according to the estimated peak rate.
+
+weights
+-------
+
+Read-only parameter, used to show the weights of the currently active
+BFQ queues.
+
+
+wr_ tunables
+------------
+
+BFQ exports a few parameters to control/tune the behavior of
+low-latency heuristics.
+
+wr_coeff
+
+Factor by which the weight of a weight-raised queue is multiplied. If
+the queue is deemed soft real-time, then the weight is further
+multiplied by an additional, constant factor.
+
+wr_max_time
+
+Maximum duration of a weight-raising period for an interactive task
+(ms). If set to zero (default value), then this value is computed
+automatically, as a function of the peak rate of the device. In any
+case, when the value of this parameter is read, it always reports the
+current duration, regardless of whether it has been set manually or
+computed automatically.
+
+wr_max_softrt_rate
+
+Maximum service rate below which a queue is deemed to be associated
+with a soft real-time application, and is then weight-raised
+accordingly (sectors/sec).
+
+wr_min_idle_time
+
+Minimum idle period after which interactive weight-raising may be
+reactivated for a queue (in ms).
+
+wr_rt_max_time
+
+Maximum weight-raising duration for soft real-time queues (in ms). The
+start time from which this duration is considered is automatically
+moved forward if the queue is detected to be still soft real-time
+before the current soft real-time weight-raising period finishes.
+
+wr_min_inter_arr_async
+
+Minimum period between I/O request arrivals after which weight-raising
+may be reactivated for an already busy async queue (in ms).
+
+
+4. Group scheduling with BFQ
+============================
+
+BFQ supports both cgroup-v1 and cgroup-v2 io controllers, namely blkio
+and io. In particular, BFQ supports weight-based proportional
+share.
+
+4-1 Service guarantees provided
+-------------------------------
+
+With BFQ, proportional share means true proportional share of the
+device bandwidth, according to group weights. For example, a group
+with weight 200 gets twice the bandwidth, and not just twice the time,
+of a group with weight 100.
+
+BFQ supports hierarchies (group trees) of any depth. Bandwidth is
+distributed among groups and processes in the expected way: for each
+group, the children of the group share the whole bandwidth of the
+group in proportion to their weights. In particular, this implies
+that, for each leaf group, every process of the group receives the
+same share of the whole group bandwidth, unless the ioprio of the
+process is modified.
+
+The resource-sharing guarantee for a group may partially or totally
+switch from bandwidth to time, if providing bandwidth guarantees to
+the group lowers the throughput too much. This switch occurs on a
+per-process basis: if a process of a leaf group causes throughput loss
+if served in such a way to receive its share of the bandwidth, then
+BFQ switches back to just time-based proportional share for that
+process.
+
+4-2 Interface
+-------------
+
+To get proportional sharing of bandwidth with BFQ for a given device,
+BFQ must of course be the active scheduler for that device.
+
+Within each group directory, the names of the files associated with
+BFQ-specific cgroup parameters and stats begin with the "bfq."
+prefix. So, with cgroups-v1 or cgroups-v2, the full prefix for
+BFQ-specific files is "blkio.bfq." or "io.bfq." For example, the group
+parameter to set the weight of a group with BFQ is blkio.bfq.weight
+or io.bfq.weight.
+
+Parameters to set
+-----------------
+
+For each group, there is only the following parameter to set.
+
+weight (namely blkio.bfq.weight or io.bfq-weight): the weight of the
+group inside its parent. Available values: 1..10000 (default 100). The
+linear mapping between ioprio and weights, described at the beginning
+of the tunable section, is still valid, but all weights higher than
+IOPRIO_BE_NR*10 are mapped to ioprio 0.
+
+
+[1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
+    Scheduler", Proceedings of the First Workshop on Mobile System
+    Technologies (MST-2015), May 2015.
+    http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf
+
+[2] P. Valente and M. Andreolini, "Improving Application
+    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
+    the 5th Annual International Systems and Storage Conference
+    (SYSTOR '12), June 2012.
+    Slightly extended version:
+    http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
+							results.pdf
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 421bef9..48434bc 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -39,6 +39,14 @@ config CFQ_GROUP_IOSCHED
 	---help---
 	  Enable group IO scheduling in CFQ.
 
+config IOSCHED_BFQ
+	tristate "BFQ I/O scheduler"
+	default n
+	---help---
+	  The BFQ I/O scheduler distributes bandwidth among all
+	  processes according to their weights, regardless of the
+	  device parameters and with any workload.
+
 choice
 	prompt "Default I/O scheduler"
 	default DEFAULT_CFQ
@@ -52,6 +60,16 @@ choice
 	config DEFAULT_CFQ
 		bool "CFQ" if IOSCHED_CFQ=y
 
+	config DEFAULT_BFQ
+		bool "BFQ" if IOSCHED_BFQ=y
+		help
+		  Selects BFQ as the default I/O scheduler which will be
+		  used by default for all block devices.
+		  The BFQ I/O scheduler aims at distributing the bandwidth
+		  as desired, independently of the disk parameters and with
+		  any workload. It also tries to guarantee low latency to
+		  interactive and soft real-time applications.
+
 	config DEFAULT_NOOP
 		bool "No-op"
 
@@ -61,6 +79,7 @@ config DEFAULT_IOSCHED
 	string
 	default "deadline" if DEFAULT_DEADLINE
 	default "cfq" if DEFAULT_CFQ
+	default "bfq" if DEFAULT_BFQ
 	default "noop" if DEFAULT_NOOP
 
 endmenu
diff --git a/block/Makefile b/block/Makefile
index 36acdd7..736e91a 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-throttle.o
 obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
 obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
 obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
+obj-$(CONFIG_IOSCHED_BFQ)	+= bfq-iosched.o
 
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
 obj-$(CONFIG_BLK_CMDLINE_PARSER)	+= cmdline-parser.o
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
new file mode 100644
index 0000000..643aeef
--- /dev/null
+++ b/block/bfq-iosched.c
@@ -0,0 +1,4045 @@
+/*
+ * Budget Fair Queueing (BFQ) I/O scheduler.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ *		      Paolo Valente <paolo.valente@unimore.it>
+ *
+ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
+ *                    Arianna Avanzini <avanzini@google.com>
+ *
+ * Copyright (C) 2016 Paolo Valente <paolo.valente@linaro.org>
+ *
+ * Licensed under GPL-2.
+ *
+ * BFQ [1] is a proportional-share storage-I/O scheduling algorithm
+ * based, as CFQ, on a slice-by-slice service scheme. Yet, differently
+ * from CFQ, BFQ does not assign a time slice to each process doing
+ * I/O. Instead, BFQ assigns a budget, measured in number of sectors:
+ * once selected for service, a process is granted access to the
+ * device until it exhausts its assigned budget. This change from the
+ * time to the service domain enables BFQ to distribute the device
+ * throughput among processes as desired, without any distortion due
+ * to throughput fluctuations, or to device internal queueing.
+ *
+ * More precisely, BFQ associates an I/O-request queue with each process
+ * doing I/O, and uses an accurate internal scheduler, called B-WF2Q+,
+ * to schedule queues according to process budgets. Each process/queue
+ * is also assigned a user-configurable weight, and B-WF2Q+ guarantees
+ * that each queue receives a fraction of the throughput proportional
+ * to its weight. In addition, B-WF2Q+ enables BFQ to schedule queues
+ * in such a way to boost the throughput and at the same time
+ * guarantee a low latency to non-I/O bound processes (the latter
+ * often belong to time-sensitive applications).
+ *
+ * B-WF2Q+ is based on WF2Q+, which is described in [2], while the
+ * augmented tree used here to implement B-WF2Q+ with O(log N)
+ * complexity derives from the one introduced with EEVDF in [3].
+ *
+ * [1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
+ *     Scheduler", Proceedings of the First Workshop on Mobile System
+ *     Technologies (MST-2015), May 2015.
+ *
+ * http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf
+ *
+ * [2] Jon C.R. Bennett and H. Zhang, "Hierarchical Packet Fair Queueing
+ *     Algorithms", IEEE/ACM Transactions on Networking, 5(5):675-689,
+ *     Oct 1997.
+ *
+ * http://www.cs.cmu.edu/~hzhang/papers/TON-97-Oct.ps.gz
+ *
+ * [3] I. Stoica and H. Abdel-Wahab, "Earliest Eligible Virtual Deadline
+ *     First: A Flexible and Accurate Mechanism for Proportional Share
+ *     Resource Allocation", technical report.
+ *
+ * http://www.cs.berkeley.edu/~istoica/papers/eevdf-tr-95.pdf
+ */
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/blkdev.h>
+#include <linux/elevator.h>
+#include <linux/ktime.h>
+#include <linux/rbtree.h>
+#include <linux/ioprio.h>
+#include "blk.h"
+#include <linux/blktrace_api.h>
+#include <linux/hrtimer.h>
+#include <linux/ioprio.h>
+#include <linux/blk-cgroup.h>
+
+#define BFQ_IOPRIO_CLASSES	3
+#define BFQ_CL_IDLE_TIMEOUT	(HZ/5)
+
+#define BFQ_MIN_WEIGHT			1
+#define BFQ_MAX_WEIGHT			1000
+#define BFQ_WEIGHT_CONVERSION_COEFF	10
+
+#define BFQ_DEFAULT_QUEUE_IOPRIO	4
+
+#define BFQ_DEFAULT_GRP_WEIGHT	10
+#define BFQ_DEFAULT_GRP_IOPRIO	0
+#define BFQ_DEFAULT_GRP_CLASS	IOPRIO_CLASS_BE
+
+struct bfq_entity;
+
+/**
+ * struct bfq_service_tree - per ioprio_class service tree.
+ *
+ * Each service tree represents a B-WF2Q+ scheduler on its own.  Each
+ * ioprio_class has its own independent scheduler, and so its own
+ * bfq_service_tree.  All the fields are protected by the queue lock
+ * of the containing bfqd.
+ */
+struct bfq_service_tree {
+	/* tree for active entities (i.e., those backlogged) */
+	struct rb_root active;
+	/* tree for idle entities (i.e., not backlogged, with V <= F_i)*/
+	struct rb_root idle;
+
+	struct bfq_entity *first_idle;	/* idle entity with minimum F_i */
+	struct bfq_entity *last_idle;	/* idle entity with maximum F_i */
+
+	u64 vtime; /* scheduler virtual time */
+	/* scheduler weight sum; active and idle entities contribute to it */
+	unsigned long wsum;
+};
+
+/**
+ * struct bfq_sched_data - multi-class scheduler.
+ *
+ * bfq_sched_data is the basic scheduler queue.  It supports three
+ * ioprio_classes, and can be used either as a toplevel queue or as
+ * an intermediate queue on a hierarchical setup.
+ * @next_in_service points to the active entity of the sched_data
+ * service trees that will be scheduled next.
+ *
+ * The supported ioprio_classes are the same as in CFQ, in descending
+ * priority order, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE.
+ * Requests from higher priority queues are served before all the
+ * requests from lower priority queues; among requests of the same
+ * queue requests are served according to B-WF2Q+.
+ * All the fields are protected by the queue lock of the containing bfqd.
+ */
+struct bfq_sched_data {
+	struct bfq_entity *in_service_entity;  /* entity in service */
+	/* head-of-the-line entity in the scheduler */
+	struct bfq_entity *next_in_service;
+	/* array of service trees, one per ioprio_class */
+	struct bfq_service_tree service_tree[BFQ_IOPRIO_CLASSES];
+};
+
+/**
+ * struct bfq_entity - schedulable entity.
+ *
+ * A bfq_entity is used to represent a bfq_queue (leaf node in the upper
+ * level scheduler). Each entity belongs to the sched_data of the parent
+ * group hierarchy. Non-leaf entities have also their own sched_data,
+ * stored in @my_sched_data.
+ *
+ * Each entity stores independently its priority values; this would
+ * allow different weights on different devices, but this
+ * functionality is not exported to userspace by now.  Priorities and
+ * weights are updated lazily, first storing the new values into the
+ * new_* fields, then setting the @prio_changed flag.  As soon as
+ * there is a transition in the entity state that allows the priority
+ * update to take place the effective and the requested priority
+ * values are synchronized.
+ *
+ * The weight value is calculated from the ioprio to export the same
+ * interface as CFQ.  When dealing with  ``well-behaved'' queues (i.e.,
+ * queues that do not spend too much time to consume their budget
+ * and have true sequential behavior, and when there are no external
+ * factors breaking anticipation) the relative weights at each level
+ * of the hierarchy should be guaranteed.  All the fields are
+ * protected by the queue lock of the containing bfqd.
+ */
+struct bfq_entity {
+	struct rb_node rb_node; /* service_tree member */
+
+	/*
+	 * flag, true if the entity is on a tree (either the active or
+	 * the idle one of its service_tree).
+	 */
+	int on_st;
+
+	u64 finish; /* B-WF2Q+ finish timestamp (aka F_i) */
+	u64 start;  /* B-WF2Q+ start timestamp (aka S_i) */
+
+	/* tree the entity is enqueued into; %NULL if not on a tree */
+	struct rb_root *tree;
+
+	/*
+	 * minimum start time of the (active) subtree rooted at this
+	 * entity; used for O(log N) lookups into active trees
+	 */
+	u64 min_start;
+
+	/* amount of service received during the last service slot */
+	int service;
+
+	/* budget, used also to calculate F_i: F_i = S_i + @budget / @weight */
+	int budget;
+
+	unsigned short weight;	/* weight of the queue */
+	unsigned short new_weight; /* next weight if a change is in progress */
+
+	/* original weight, used to implement weight boosting */
+	unsigned short orig_weight;
+
+	/* parent entity, for hierarchical scheduling */
+	struct bfq_entity *parent;
+
+	/*
+	 * For non-leaf nodes in the hierarchy, the associated
+	 * scheduler queue, %NULL on leaf nodes.
+	 */
+	struct bfq_sched_data *my_sched_data;
+	/* the scheduler queue this entity belongs to */
+	struct bfq_sched_data *sched_data;
+
+	/* flag, set to request a weight, ioprio or ioprio_class change  */
+	int prio_changed;
+};
+
+/**
+ * struct bfq_queue - leaf schedulable entity.
+ *
+ * A bfq_queue is a leaf request queue; it can be associated with an
+ * io_context or more, if it is async.
+ */
+struct bfq_queue {
+	/* reference counter */
+	int ref;
+	/* parent bfq_data */
+	struct bfq_data *bfqd;
+
+	/* current ioprio and ioprio class */
+	unsigned short ioprio, ioprio_class;
+	/* next ioprio and ioprio class if a change is in progress */
+	unsigned short new_ioprio, new_ioprio_class;
+
+	/* sorted list of pending requests */
+	struct rb_root sort_list;
+	/* if fifo isn't expired, next request to serve */
+	struct request *next_rq;
+	/* number of sync and async requests queued */
+	int queued[2];
+	/* number of sync and async requests currently allocated */
+	int allocated[2];
+	/* number of pending metadata requests */
+	int meta_pending;
+	/* fifo list of requests in sort_list */
+	struct list_head fifo;
+
+	/* entity representing this queue in the scheduler */
+	struct bfq_entity entity;
+
+	/* maximum budget allowed from the feedback mechanism */
+	int max_budget;
+	/* budget expiration (in jiffies) */
+	unsigned long budget_timeout;
+
+	/* number of requests on the dispatch list or inside driver */
+	int dispatched;
+
+	unsigned int flags; /* status flags.*/
+
+	/* node for active/idle bfqq list inside parent bfqd */
+	struct list_head bfqq_list;
+
+	/* bit vector: a 1 for each seeky requests in history */
+	u32 seek_history;
+	/* position of the last request enqueued */
+	sector_t last_request_pos;
+
+	/* Number of consecutive pairs of request completion and
+	 * arrival, such that the queue becomes idle after the
+	 * completion, but the next request arrives within an idle
+	 * time slice; used only if the queue's IO_bound flag has been
+	 * cleared.
+	 */
+	unsigned int requests_within_timer;
+
+	/* pid of the process owning the queue, used for logging purposes */
+	pid_t pid;
+};
+
+/**
+ * struct bfq_ttime - per process thinktime stats.
+ */
+struct bfq_ttime {
+	u64 last_end_request; /* completion time of last request */
+
+	u64 ttime_total; /* total process thinktime */
+	unsigned long ttime_samples; /* number of thinktime samples */
+	u64 ttime_mean; /* average process thinktime */
+
+};
+
+/**
+ * struct bfq_io_cq - per (request_queue, io_context) structure.
+ */
+struct bfq_io_cq {
+	/* associated io_cq structure */
+	struct io_cq icq; /* must be the first member */
+	/* array of two process queues, the sync and the async */
+	struct bfq_queue *bfqq[2];
+	/* associated @bfq_ttime struct */
+	struct bfq_ttime ttime;
+	/* per (request_queue, blkcg) ioprio */
+	int ioprio;
+};
+
+enum bfq_device_speed {
+	BFQ_BFQD_FAST,
+	BFQ_BFQD_SLOW,
+};
+
+/**
+ * struct bfq_data - per-device data structure.
+ *
+ * All the fields are protected by the @queue lock.
+ */
+struct bfq_data {
+	/* request queue for the device */
+	struct request_queue *queue;
+
+	/* root @bfq_sched_data for the device */
+	struct bfq_sched_data sched_data;
+
+	/*
+	 * Number of bfq_queues containing requests (including the
+	 * queue in service, even if it is idling).
+	 */
+	int busy_queues;
+	/* number of queued requests */
+	int queued;
+	/* number of requests dispatched and waiting for completion */
+	int rq_in_driver;
+
+	/*
+	 * Maximum number of requests in driver in the last
+	 * @hw_tag_samples completed requests.
+	 */
+	int max_rq_in_driver;
+	/* number of samples used to calculate hw_tag */
+	int hw_tag_samples;
+	/* flag set to one if the driver is showing a queueing behavior */
+	int hw_tag;
+
+	/* number of budgets assigned */
+	int budgets_assigned;
+
+	/*
+	 * Timer set when idling (waiting) for the next request from
+	 * the queue in service.
+	 */
+	struct hrtimer idle_slice_timer;
+	/* delayed work to restart dispatching on the request queue */
+	struct work_struct unplug_work;
+
+	/* bfq_queue in service */
+	struct bfq_queue *in_service_queue;
+	/* bfq_io_cq (bic) associated with the @in_service_queue */
+	struct bfq_io_cq *in_service_bic;
+
+	/* on-disk position of the last served request */
+	sector_t last_position;
+
+	/* beginning of the last budget */
+	ktime_t last_budget_start;
+	/* beginning of the last idle slice */
+	ktime_t last_idling_start;
+	/* number of samples used to calculate @peak_rate */
+	int peak_rate_samples;
+	/* peak transfer rate observed for a budget */
+	u64 peak_rate;
+	/* maximum budget allotted to a bfq_queue before rescheduling */
+	int bfq_max_budget;
+
+	/* list of all the bfq_queues active on the device */
+	struct list_head active_list;
+	/* list of all the bfq_queues idle on the device */
+	struct list_head idle_list;
+
+	/*
+	 * Timeout for async/sync requests; when it fires, requests
+	 * are served in fifo order.
+	 */
+	u64 bfq_fifo_expire[2];
+	/* weight of backward seeks wrt forward ones */
+	unsigned int bfq_back_penalty;
+	/* maximum allowed backward seek */
+	unsigned int bfq_back_max;
+	/* maximum idling time */
+	u32 bfq_slice_idle;
+	/* last time CLASS_IDLE was served */
+	u64 bfq_class_idle_last_service;
+
+	/* user-configured max budget value (0 for auto-tuning) */
+	int bfq_user_max_budget;
+	/*
+	 * Timeout for bfq_queues to consume their budget; used to
+	 * prevent seeky queues from imposing long latencies to
+	 * sequential or quasi-sequential ones (this also implies that
+	 * seeky queues cannot receive guarantees in the service
+	 * domain; after a timeout they are charged for the time they
+	 * have been in service, to preserve fairness among them, but
+	 * without service-domain guarantees).
+	 */
+	unsigned int bfq_timeout;
+
+	/*
+	 * Number of consecutive requests that must be issued within
+	 * the idle time slice to set again idling to a queue which
+	 * was marked as non-I/O-bound (see the definition of the
+	 * IO_bound flag for further details).
+	 */
+	unsigned int bfq_requests_within_timer;
+
+	/*
+	 * Force device idling whenever needed to provide accurate
+	 * service guarantees, without caring about throughput
+	 * issues. CAVEAT: this may even increase latencies, in case
+	 * of useless idling for processes that did stop doing I/O.
+	 */
+	bool strict_guarantees;
+
+	/* fallback dummy bfqq for extreme OOM conditions */
+	struct bfq_queue oom_bfqq;
+};
+
+enum bfqq_state_flags {
+	BFQ_BFQQ_FLAG_busy = 0,		/* has requests or is in service */
+	BFQ_BFQQ_FLAG_wait_request,	/* waiting for a request */
+	BFQ_BFQQ_FLAG_non_blocking_wait_rq, /*
+					     * waiting for a request
+					     * without idling the device
+					     */
+	BFQ_BFQQ_FLAG_must_alloc,	/* must be allowed rq alloc */
+	BFQ_BFQQ_FLAG_fifo_expire,	/* FIFO checked in this slice */
+	BFQ_BFQQ_FLAG_idle_window,	/* slice idling enabled */
+	BFQ_BFQQ_FLAG_sync,		/* synchronous queue */
+	BFQ_BFQQ_FLAG_budget_new,	/* no completion with this budget */
+	BFQ_BFQQ_FLAG_IO_bound,		/*
+					 * bfqq has timed-out at least once
+					 * having consumed at most 2/10 of
+					 * its budget
+					 */
+};
+
+#define BFQ_BFQQ_FNS(name)						\
+static void bfq_mark_bfqq_##name(struct bfq_queue *bfqq)		\
+{									\
+	(bfqq)->flags |= (1 << BFQ_BFQQ_FLAG_##name);			\
+}									\
+static void bfq_clear_bfqq_##name(struct bfq_queue *bfqq)		\
+{									\
+	(bfqq)->flags &= ~(1 << BFQ_BFQQ_FLAG_##name);			\
+}									\
+static int bfq_bfqq_##name(const struct bfq_queue *bfqq)		\
+{									\
+	return ((bfqq)->flags & (1 << BFQ_BFQQ_FLAG_##name)) != 0;	\
+}
+
+BFQ_BFQQ_FNS(busy);
+BFQ_BFQQ_FNS(wait_request);
+BFQ_BFQQ_FNS(non_blocking_wait_rq);
+BFQ_BFQQ_FNS(must_alloc);
+BFQ_BFQQ_FNS(fifo_expire);
+BFQ_BFQQ_FNS(idle_window);
+BFQ_BFQQ_FNS(sync);
+BFQ_BFQQ_FNS(budget_new);
+BFQ_BFQQ_FNS(IO_bound);
+#undef BFQ_BFQQ_FNS
+
+/* Logging facilities. */
+#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) \
+	blk_add_trace_msg((bfqd)->queue, "bfq%d " fmt, (bfqq)->pid, ##args)
+
+#define bfq_log(bfqd, fmt, args...) \
+	blk_add_trace_msg((bfqd)->queue, "bfq " fmt, ##args)
+
+/* Expiration reasons. */
+enum bfqq_expiration {
+	BFQ_BFQQ_TOO_IDLE = 0,		/*
+					 * queue has been idling for
+					 * too long
+					 */
+	BFQ_BFQQ_BUDGET_TIMEOUT,	/* budget took too long to be used */
+	BFQ_BFQQ_BUDGET_EXHAUSTED,	/* budget consumed */
+	BFQ_BFQQ_NO_MORE_REQUESTS,	/* the queue has no more requests */
+	BFQ_BFQQ_PREEMPTED		/* preemption in progress */
+};
+
+static struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity);
+
+static struct bfq_service_tree *
+bfq_entity_service_tree(struct bfq_entity *entity)
+{
+	struct bfq_sched_data *sched_data = entity->sched_data;
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+	unsigned int idx = bfqq ? bfqq->ioprio_class - 1 :
+				  BFQ_DEFAULT_GRP_CLASS - 1;
+
+	return sched_data->service_tree + idx;
+}
+
+static struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic, bool is_sync)
+{
+	return bic->bfqq[is_sync];
+}
+
+static void bic_set_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq,
+			 bool is_sync)
+{
+	bic->bfqq[is_sync] = bfqq;
+}
+
+static struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic)
+{
+	return bic->icq.q->elevator->elevator_data;
+}
+
+static void bfq_check_ioprio_change(struct bfq_io_cq *bic, struct bio *bio);
+static void bfq_put_queue(struct bfq_queue *bfqq);
+static void bfq_dispatch_insert(struct request_queue *q, struct request *rq);
+static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
+				       struct bio *bio, bool is_sync,
+				       struct bfq_io_cq *bic);
+static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
+
+/*
+ * Array of async queues for all the processes, one queue
+ * per ioprio value per ioprio_class.
+ */
+struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
+/* Async queue for the idle class (ioprio is ignored) */
+struct bfq_queue *async_idle_bfqq;
+
+/* Expiration time of sync (0) and async (1) requests, in ns. */
+static const u64 bfq_fifo_expire[2] = { NSEC_PER_SEC / 4, NSEC_PER_SEC / 8 };
+
+/* Maximum backwards seek, in KiB. */
+static const int bfq_back_max = 16 * 1024;
+
+/* Penalty of a backwards seek, in number of sectors. */
+static const int bfq_back_penalty = 2;
+
+/* Idling period duration, in ns. */
+static u64 bfq_slice_idle = NSEC_PER_SEC / 125;
+
+/* Minimum number of assigned budgets for which stats are safe to compute. */
+static const int bfq_stats_min_budgets = 194;
+
+/* Default maximum budget values, in sectors and number of requests. */
+static const int bfq_default_max_budget = 16 * 1024;
+
+/* Default timeout values, in jiffies, approximating CFQ defaults. */
+static const int bfq_timeout = HZ / 8;
+
+struct kmem_cache *bfq_pool;
+
+/* Below this threshold (in ms), we consider thinktime immediate. */
+#define BFQ_MIN_TT		(2 * NSEC_PER_MSEC)
+
+/* hw_tag detection: parallel requests threshold and min samples needed. */
+#define BFQ_HW_QUEUE_THRESHOLD	4
+#define BFQ_HW_QUEUE_SAMPLES	32
+
+#define BFQQ_SEEK_THR		(sector_t)(8 * 100)
+#define BFQQ_SEEKY(bfqq)	(hweight32(bfqq->seek_history) > 32/8)
+
+/* Budget feedback step. */
+#define BFQ_BUDGET_STEP         128
+
+/* Min samples used for peak rate estimation (for autotuning). */
+#define BFQ_PEAK_RATE_SAMPLES	32
+
+/* Shift used for peak rate fixed precision calculations. */
+#define BFQ_RATE_SHIFT		16
+
+#define BFQ_SERVICE_TREE_INIT	((struct bfq_service_tree)		\
+				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
+
+#define RQ_BIC(rq)		((struct bfq_io_cq *) (rq)->elv.priv[0])
+#define RQ_BFQQ(rq)		((rq)->elv.priv[1])
+
+static void bfq_schedule_dispatch(struct bfq_data *bfqd);
+
+/**
+ * icq_to_bic - convert iocontext queue structure to bfq_io_cq.
+ * @icq: the iocontext queue.
+ */
+static struct bfq_io_cq *icq_to_bic(struct io_cq *icq)
+{
+	/* bic->icq is the first member, %NULL will convert to %NULL */
+	return container_of(icq, struct bfq_io_cq, icq);
+}
+
+/**
+ * bfq_bic_lookup - search into @ioc a bic associated to @bfqd.
+ * @bfqd: the lookup key.
+ * @ioc: the io_context of the process doing I/O.
+ *
+ * Queue lock must be held.
+ */
+static struct bfq_io_cq *bfq_bic_lookup(struct bfq_data *bfqd,
+					struct io_context *ioc)
+{
+	if (ioc)
+		return icq_to_bic(ioc_lookup_icq(ioc, bfqd->queue));
+	return NULL;
+}
+
+#define for_each_entity(entity)	\
+	for (; entity ; entity = NULL)
+
+#define for_each_entity_safe(entity, parent) \
+	for (parent = NULL; entity ; entity = parent)
+
+static int bfq_update_next_in_service(struct bfq_sched_data *sd)
+{
+	return 0;
+}
+
+static void bfq_check_next_in_service(struct bfq_sched_data *sd,
+				      struct bfq_entity *entity)
+{
+}
+
+static void bfq_update_budget(struct bfq_entity *next_in_service)
+{
+}
+
+/*
+ * Shift for timestamp calculations.  This actually limits the maximum
+ * service allowed in one timestamp delta (small shift values increase it),
+ * the maximum total weight that can be used for the queues in the system
+ * (big shift values increase it), and the period of virtual time
+ * wraparounds.
+ */
+#define WFQ_SERVICE_SHIFT	22
+
+/**
+ * bfq_gt - compare two timestamps.
+ * @a: first ts.
+ * @b: second ts.
+ *
+ * Return @a > @b, dealing with wrapping correctly.
+ */
+static int bfq_gt(u64 a, u64 b)
+{
+	return (s64)(a - b) > 0;
+}
+
+static struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = NULL;
+
+	if (!entity->my_sched_data)
+		bfqq = container_of(entity, struct bfq_queue, entity);
+
+	return bfqq;
+}
+
+
+/**
+ * bfq_delta - map service into the virtual time domain.
+ * @service: amount of service.
+ * @weight: scale factor (weight of an entity or weight sum).
+ */
+static u64 bfq_delta(unsigned long service, unsigned long weight)
+{
+	u64 d = (u64)service << WFQ_SERVICE_SHIFT;
+
+	do_div(d, weight);
+	return d;
+}
+
+/**
+ * bfq_calc_finish - assign the finish time to an entity.
+ * @entity: the entity to act upon.
+ * @service: the service to be charged to the entity.
+ */
+static void bfq_calc_finish(struct bfq_entity *entity, unsigned long service)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+
+	entity->finish = entity->start +
+		bfq_delta(service, entity->weight);
+
+	if (bfqq) {
+		bfq_log_bfqq(bfqq->bfqd, bfqq,
+			"calc_finish: serv %lu, w %d",
+			service, entity->weight);
+		bfq_log_bfqq(bfqq->bfqd, bfqq,
+			"calc_finish: start %llu, finish %llu, delta %llu",
+			entity->start, entity->finish,
+			bfq_delta(service, entity->weight));
+	}
+}
+
+/**
+ * bfq_entity_of - get an entity from a node.
+ * @node: the node field of the entity.
+ *
+ * Convert a node pointer to the relative entity.  This is used only
+ * to simplify the logic of some functions and not as the generic
+ * conversion mechanism because, e.g., in the tree walking functions,
+ * the check for a %NULL value would be redundant.
+ */
+static struct bfq_entity *bfq_entity_of(struct rb_node *node)
+{
+	struct bfq_entity *entity = NULL;
+
+	if (node)
+		entity = rb_entry(node, struct bfq_entity, rb_node);
+
+	return entity;
+}
+
+/**
+ * bfq_extract - remove an entity from a tree.
+ * @root: the tree root.
+ * @entity: the entity to remove.
+ */
+static void bfq_extract(struct rb_root *root, struct bfq_entity *entity)
+{
+	entity->tree = NULL;
+	rb_erase(&entity->rb_node, root);
+}
+
+/**
+ * bfq_idle_extract - extract an entity from the idle tree.
+ * @st: the service tree of the owning @entity.
+ * @entity: the entity being removed.
+ */
+static void bfq_idle_extract(struct bfq_service_tree *st,
+			     struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+	struct rb_node *next;
+
+	if (entity == st->first_idle) {
+		next = rb_next(&entity->rb_node);
+		st->first_idle = bfq_entity_of(next);
+	}
+
+	if (entity == st->last_idle) {
+		next = rb_prev(&entity->rb_node);
+		st->last_idle = bfq_entity_of(next);
+	}
+
+	bfq_extract(&st->idle, entity);
+
+	if (bfqq)
+		list_del(&bfqq->bfqq_list);
+}
+
+/**
+ * bfq_insert - generic tree insertion.
+ * @root: tree root.
+ * @entity: entity to insert.
+ *
+ * This is used for the idle and the active tree, since they are both
+ * ordered by finish time.
+ */
+static void bfq_insert(struct rb_root *root, struct bfq_entity *entity)
+{
+	struct bfq_entity *entry;
+	struct rb_node **node = &root->rb_node;
+	struct rb_node *parent = NULL;
+
+	while (*node) {
+		parent = *node;
+		entry = rb_entry(parent, struct bfq_entity, rb_node);
+
+		if (bfq_gt(entry->finish, entity->finish))
+			node = &parent->rb_left;
+		else
+			node = &parent->rb_right;
+	}
+
+	rb_link_node(&entity->rb_node, parent, node);
+	rb_insert_color(&entity->rb_node, root);
+
+	entity->tree = root;
+}
+
+/**
+ * bfq_update_min - update the min_start field of a entity.
+ * @entity: the entity to update.
+ * @node: one of its children.
+ *
+ * This function is called when @entity may store an invalid value for
+ * min_start due to updates to the active tree.  The function  assumes
+ * that the subtree rooted at @node (which may be its left or its right
+ * child) has a valid min_start value.
+ */
+static void bfq_update_min(struct bfq_entity *entity, struct rb_node *node)
+{
+	struct bfq_entity *child;
+
+	if (node) {
+		child = rb_entry(node, struct bfq_entity, rb_node);
+		if (bfq_gt(entity->min_start, child->min_start))
+			entity->min_start = child->min_start;
+	}
+}
+
+/**
+ * bfq_update_active_node - recalculate min_start.
+ * @node: the node to update.
+ *
+ * @node may have changed position or one of its children may have moved,
+ * this function updates its min_start value.  The left and right subtrees
+ * are assumed to hold a correct min_start value.
+ */
+static void bfq_update_active_node(struct rb_node *node)
+{
+	struct bfq_entity *entity = rb_entry(node, struct bfq_entity, rb_node);
+
+	entity->min_start = entity->start;
+	bfq_update_min(entity, node->rb_right);
+	bfq_update_min(entity, node->rb_left);
+}
+
+/**
+ * bfq_update_active_tree - update min_start for the whole active tree.
+ * @node: the starting node.
+ *
+ * @node must be the deepest modified node after an update.  This function
+ * updates its min_start using the values held by its children, assuming
+ * that they did not change, and then updates all the nodes that may have
+ * changed in the path to the root.  The only nodes that may have changed
+ * are the ones in the path or their siblings.
+ */
+static void bfq_update_active_tree(struct rb_node *node)
+{
+	struct rb_node *parent;
+
+up:
+	bfq_update_active_node(node);
+
+	parent = rb_parent(node);
+	if (!parent)
+		return;
+
+	if (node == parent->rb_left && parent->rb_right)
+		bfq_update_active_node(parent->rb_right);
+	else if (parent->rb_left)
+		bfq_update_active_node(parent->rb_left);
+
+	node = parent;
+	goto up;
+}
+
+/**
+ * bfq_active_insert - insert an entity in the active tree of its
+ *                     group/device.
+ * @st: the service tree of the entity.
+ * @entity: the entity being inserted.
+ *
+ * The active tree is ordered by finish time, but an extra key is kept
+ * per each node, containing the minimum value for the start times of
+ * its children (and the node itself), so it's possible to search for
+ * the eligible node with the lowest finish time in logarithmic time.
+ */
+static void bfq_active_insert(struct bfq_service_tree *st,
+			      struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+	struct rb_node *node = &entity->rb_node;
+
+	bfq_insert(&st->active, entity);
+
+	if (node->rb_left)
+		node = node->rb_left;
+	else if (node->rb_right)
+		node = node->rb_right;
+
+	bfq_update_active_tree(node);
+
+	if (bfqq)
+		list_add(&bfqq->bfqq_list, &bfqq->bfqd->active_list);
+}
+
+/**
+ * bfq_ioprio_to_weight - calc a weight from an ioprio.
+ * @ioprio: the ioprio value to convert.
+ */
+static unsigned short bfq_ioprio_to_weight(int ioprio)
+{
+	return (IOPRIO_BE_NR - ioprio) * BFQ_WEIGHT_CONVERSION_COEFF;
+}
+
+/**
+ * bfq_weight_to_ioprio - calc an ioprio from a weight.
+ * @weight: the weight value to convert.
+ *
+ * To preserve as much as possible the old only-ioprio user interface,
+ * 0 is used as an escape ioprio value for weights (numerically) equal or
+ * larger than IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF.
+ */
+static unsigned short bfq_weight_to_ioprio(int weight)
+{
+	return IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight < 0 ?
+		0 : IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight;
+}
+
+static void bfq_get_entity(struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+
+	if (bfqq) {
+		bfqq->ref++;
+		bfq_log_bfqq(bfqq->bfqd, bfqq, "get_entity: %p %d",
+			     bfqq, bfqq->ref);
+	}
+}
+
+/**
+ * bfq_find_deepest - find the deepest node that an extraction can modify.
+ * @node: the node being removed.
+ *
+ * Do the first step of an extraction in an rb tree, looking for the
+ * node that will replace @node, and returning the deepest node that
+ * the following modifications to the tree can touch.  If @node is the
+ * last node in the tree return %NULL.
+ */
+static struct rb_node *bfq_find_deepest(struct rb_node *node)
+{
+	struct rb_node *deepest;
+
+	if (!node->rb_right && !node->rb_left)
+		deepest = rb_parent(node);
+	else if (!node->rb_right)
+		deepest = node->rb_left;
+	else if (!node->rb_left)
+		deepest = node->rb_right;
+	else {
+		deepest = rb_next(node);
+		if (deepest->rb_right)
+			deepest = deepest->rb_right;
+		else if (rb_parent(deepest) != node)
+			deepest = rb_parent(deepest);
+	}
+
+	return deepest;
+}
+
+/**
+ * bfq_active_extract - remove an entity from the active tree.
+ * @st: the service_tree containing the tree.
+ * @entity: the entity being removed.
+ */
+static void bfq_active_extract(struct bfq_service_tree *st,
+			       struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+	struct rb_node *node;
+
+	node = bfq_find_deepest(&entity->rb_node);
+	bfq_extract(&st->active, entity);
+
+	if (node)
+		bfq_update_active_tree(node);
+
+	if (bfqq)
+		list_del(&bfqq->bfqq_list);
+}
+
+/**
+ * bfq_idle_insert - insert an entity into the idle tree.
+ * @st: the service tree containing the tree.
+ * @entity: the entity to insert.
+ */
+static void bfq_idle_insert(struct bfq_service_tree *st,
+			    struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+	struct bfq_entity *first_idle = st->first_idle;
+	struct bfq_entity *last_idle = st->last_idle;
+
+	if (!first_idle || bfq_gt(first_idle->finish, entity->finish))
+		st->first_idle = entity;
+	if (!last_idle || bfq_gt(entity->finish, last_idle->finish))
+		st->last_idle = entity;
+
+	bfq_insert(&st->idle, entity);
+
+	if (bfqq)
+		list_add(&bfqq->bfqq_list, &bfqq->bfqd->idle_list);
+}
+
+/**
+ * bfq_forget_entity - remove an entity from the wfq trees.
+ * @st: the service tree.
+ * @entity: the entity being removed.
+ *
+ * Update the device status and forget everything about @entity, putting
+ * the device reference to it, if it is a queue.  Entities belonging to
+ * groups are not refcounted.
+ */
+static void bfq_forget_entity(struct bfq_service_tree *st,
+			      struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+	struct bfq_sched_data *sd;
+
+	entity->on_st = 0;
+	st->wsum -= entity->weight;
+	if (bfqq) {
+		sd = entity->sched_data;
+		bfq_log_bfqq(bfqq->bfqd, bfqq, "forget_entity: %p %d",
+			     bfqq, bfqq->ref);
+		bfq_put_queue(bfqq);
+	}
+}
+
+/**
+ * bfq_put_idle_entity - release the idle tree ref of an entity.
+ * @st: service tree for the entity.
+ * @entity: the entity being released.
+ */
+static void bfq_put_idle_entity(struct bfq_service_tree *st,
+				struct bfq_entity *entity)
+{
+	bfq_idle_extract(st, entity);
+	bfq_forget_entity(st, entity);
+}
+
+/**
+ * bfq_forget_idle - update the idle tree if necessary.
+ * @st: the service tree to act upon.
+ *
+ * To preserve the global O(log N) complexity we only remove one entry here;
+ * as the idle tree will not grow indefinitely this can be done safely.
+ */
+static void bfq_forget_idle(struct bfq_service_tree *st)
+{
+	struct bfq_entity *first_idle = st->first_idle;
+	struct bfq_entity *last_idle = st->last_idle;
+
+	if (RB_EMPTY_ROOT(&st->active) && last_idle &&
+	    !bfq_gt(last_idle->finish, st->vtime)) {
+		/*
+		 * Forget the whole idle tree, increasing the vtime past
+		 * the last finish time of idle entities.
+		 */
+		st->vtime = last_idle->finish;
+	}
+
+	if (first_idle && !bfq_gt(first_idle->finish, st->vtime))
+		bfq_put_idle_entity(st, first_idle);
+}
+
+static struct bfq_service_tree *
+__bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
+			 struct bfq_entity *entity)
+{
+	struct bfq_service_tree *new_st = old_st;
+
+	if (entity->prio_changed) {
+		struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+		unsigned short prev_weight, new_weight;
+		struct bfq_data *bfqd = NULL;
+
+		if (bfqq)
+			bfqd = bfqq->bfqd;
+
+		old_st->wsum -= entity->weight;
+
+		if (entity->new_weight != entity->orig_weight) {
+			if (entity->new_weight < BFQ_MIN_WEIGHT ||
+			    entity->new_weight > BFQ_MAX_WEIGHT) {
+				pr_crit("update_weight_prio: new_weight %d\n",
+					entity->new_weight);
+				if (entity->new_weight < BFQ_MIN_WEIGHT)
+					entity->new_weight = BFQ_MIN_WEIGHT;
+				else
+					entity->new_weight = BFQ_MAX_WEIGHT;
+			}
+			entity->orig_weight = entity->new_weight;
+			if (bfqq)
+				bfqq->ioprio =
+				  bfq_weight_to_ioprio(entity->orig_weight);
+		}
+
+		if (bfqq)
+			bfqq->ioprio_class = bfqq->new_ioprio_class;
+		entity->prio_changed = 0;
+
+		/*
+		 * NOTE: here we may be changing the weight too early,
+		 * this will cause unfairness.  The correct approach
+		 * would have required additional complexity to defer
+		 * weight changes to the proper time instants (i.e.,
+		 * when entity->finish <= old_st->vtime).
+		 */
+		new_st = bfq_entity_service_tree(entity);
+
+		prev_weight = entity->weight;
+		new_weight = entity->orig_weight;
+		entity->weight = new_weight;
+
+		new_st->wsum += entity->weight;
+
+		if (new_st != old_st)
+			entity->start = new_st->vtime;
+	}
+
+	return new_st;
+}
+
+/**
+ * bfq_bfqq_served - update the scheduler status after selection for
+ *                   service.
+ * @bfqq: the queue being served.
+ * @served: bytes to transfer.
+ *
+ * NOTE: this can be optimized, as the timestamps of upper level entities
+ * are synchronized every time a new bfqq is selected for service.  By now,
+ * we keep it to better check consistency.
+ */
+static void bfq_bfqq_served(struct bfq_queue *bfqq, int served)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+	struct bfq_service_tree *st;
+
+	for_each_entity(entity) {
+		st = bfq_entity_service_tree(entity);
+
+		entity->service += served;
+
+		st->vtime += bfq_delta(served, st->wsum);
+		bfq_forget_idle(st);
+	}
+	bfq_log_bfqq(bfqq->bfqd, bfqq, "bfqq_served %d secs", served);
+}
+
+/**
+ * bfq_bfqq_charge_full_budget - set the service to the entity budget.
+ * @bfqq: the queue that needs a service update.
+ *
+ * When it's not possible to be fair in the service domain, because
+ * a queue is not consuming its budget fast enough (the meaning of
+ * fast depends on the timeout parameter), we charge it a full
+ * budget.  In this way we should obtain a sort of time-domain
+ * fairness among all the seeky/slow queues.
+ */
+static void bfq_bfqq_charge_full_budget(struct bfq_queue *bfqq)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+
+	bfq_log_bfqq(bfqq->bfqd, bfqq, "charge_full_budget");
+
+	bfq_bfqq_served(bfqq, entity->budget - entity->service);
+}
+
+/**
+ * __bfq_activate_entity - activate an entity.
+ * @entity: the entity being activated.
+ * @non_blocking_wait_rq: true if this entity was waiting for a request
+ *
+ * Called whenever an entity is activated, i.e., it is not active and one
+ * of its children receives a new request, or has to be reactivated due to
+ * budget exhaustion.  It uses the current budget of the entity (and the
+ * service received if @entity is active) of the queue to calculate its
+ * timestamps.
+ */
+static void __bfq_activate_entity(struct bfq_entity *entity,
+				  bool non_blocking_wait_rq)
+{
+	struct bfq_sched_data *sd = entity->sched_data;
+	struct bfq_service_tree *st = bfq_entity_service_tree(entity);
+	bool backshifted = false;
+
+	if (entity == sd->in_service_entity) {
+		/*
+		 * If we are requeueing the current entity we have
+		 * to take care of not charging to it service it has
+		 * not received.
+		 */
+		bfq_calc_finish(entity, entity->service);
+		entity->start = entity->finish;
+		sd->in_service_entity = NULL;
+	} else if (entity->tree == &st->active) {
+		/*
+		 * Requeueing an entity due to a change of some
+		 * next_in_service entity below it.  We reuse the
+		 * old start time.
+		 */
+		bfq_active_extract(st, entity);
+	} else {
+		unsigned long long min_vstart;
+
+		/* See comments on bfq_fqq_update_budg_for_activation */
+		if (non_blocking_wait_rq && bfq_gt(st->vtime, entity->finish)) {
+			backshifted = true;
+			min_vstart = entity->finish;
+		} else
+			min_vstart = st->vtime;
+
+		if (entity->tree == &st->idle) {
+			/*
+			 * Must be on the idle tree, bfq_idle_extract() will
+			 * check for that.
+			 */
+			bfq_idle_extract(st, entity);
+			entity->start = bfq_gt(min_vstart, entity->finish) ?
+				min_vstart : entity->finish;
+		} else {
+			/*
+			 * The finish time of the entity may be invalid, and
+			 * it is in the past for sure, otherwise the queue
+			 * would have been on the idle tree.
+			 */
+			entity->start = min_vstart;
+			st->wsum += entity->weight;
+			bfq_get_entity(entity);
+
+			entity->on_st = 1;
+		}
+	}
+
+	st = __bfq_entity_update_weight_prio(st, entity);
+	bfq_calc_finish(entity, entity->budget);
+
+	/*
+	 * If some queues enjoy backshifting for a while, then their
+	 * (virtual) finish timestamps may happen to become lower and
+	 * lower than the system virtual time.	In particular, if
+	 * these queues often happen to be idle for short time
+	 * periods, and during such time periods other queues with
+	 * higher timestamps happen to be busy, then the backshifted
+	 * timestamps of the former queues can become much lower than
+	 * the system virtual time. In fact, to serve the queues with
+	 * higher timestamps while the ones with lower timestamps are
+	 * idle, the system virtual time may be pushed-up to much
+	 * higher values than the finish timestamps of the idle
+	 * queues. As a consequence, the finish timestamps of all new
+	 * or newly activated queues may end up being much larger than
+	 * those of lucky queues with backshifted timestamps. The
+	 * latter queues may then monopolize the device for a lot of
+	 * time. This would simply break service guarantees.
+	 *
+	 * To reduce this problem, push up a little bit the
+	 * backshifted timestamps of the queue associated with this
+	 * entity (only a queue can happen to have the backshifted
+	 * flag set): just enough to let the finish timestamp of the
+	 * queue be equal to the current value of the system virtual
+	 * time. This may introduce a little unfairness among queues
+	 * with backshifted timestamps, but it does not break
+	 * worst-case fairness guarantees.
+	 */
+	if (backshifted && bfq_gt(st->vtime, entity->finish)) {
+		unsigned long delta = st->vtime - entity->finish;
+
+		entity->start += delta;
+		entity->finish += delta;
+	}
+
+	bfq_active_insert(st, entity);
+}
+
+/**
+ * bfq_activate_entity - activate an entity and its ancestors if necessary.
+ * @entity: the entity to activate.
+ * @non_blocking_wait_rq: true if this entity was waiting for a request
+ *
+ * Activate @entity and all the entities on the path from it to the root.
+ */
+static void bfq_activate_entity(struct bfq_entity *entity,
+				bool non_blocking_wait_rq)
+{
+	struct bfq_sched_data *sd;
+
+	for_each_entity(entity) {
+		__bfq_activate_entity(entity, non_blocking_wait_rq);
+
+		sd = entity->sched_data;
+		if (!bfq_update_next_in_service(sd))
+			/*
+			 * No need to propagate the activation to the
+			 * upper entities, as they will be updated when
+			 * the in-service entity is rescheduled.
+			 */
+			break;
+	}
+}
+
+/**
+ * __bfq_deactivate_entity - deactivate an entity from its service tree.
+ * @entity: the entity to deactivate.
+ * @requeue: if false, the entity will not be put into the idle tree.
+ *
+ * Deactivate an entity, independently from its previous state.  If the
+ * entity was not on a service tree just return, otherwise if it is on
+ * any scheduler tree, extract it from that tree, and if necessary
+ * and if the caller did not specify @requeue, put it on the idle tree.
+ *
+ * Return %1 if the caller should update the entity hierarchy, i.e.,
+ * if the entity was in service or if it was the next_in_service for
+ * its sched_data; return %0 otherwise.
+ */
+static int __bfq_deactivate_entity(struct bfq_entity *entity, int requeue)
+{
+	struct bfq_sched_data *sd = entity->sched_data;
+	struct bfq_service_tree *st = bfq_entity_service_tree(entity);
+	int was_in_service = entity == sd->in_service_entity;
+	int ret = 0;
+
+	if (!entity->on_st)
+		return 0;
+
+	if (was_in_service) {
+		bfq_calc_finish(entity, entity->service);
+		sd->in_service_entity = NULL;
+	} else if (entity->tree == &st->active)
+		bfq_active_extract(st, entity);
+	else if (entity->tree == &st->idle)
+		bfq_idle_extract(st, entity);
+
+	if (was_in_service || sd->next_in_service == entity)
+		ret = bfq_update_next_in_service(sd);
+
+	if (!requeue || !bfq_gt(entity->finish, st->vtime))
+		bfq_forget_entity(st, entity);
+	else
+		bfq_idle_insert(st, entity);
+
+	return ret;
+}
+
+/**
+ * bfq_deactivate_entity - deactivate an entity.
+ * @entity: the entity to deactivate.
+ * @requeue: true if the entity can be put on the idle tree
+ */
+static void bfq_deactivate_entity(struct bfq_entity *entity, int requeue)
+{
+	struct bfq_sched_data *sd;
+	struct bfq_entity *parent;
+
+	for_each_entity_safe(entity, parent) {
+		sd = entity->sched_data;
+
+		if (!__bfq_deactivate_entity(entity, requeue))
+			/*
+			 * The parent entity is still backlogged, and
+			 * we don't need to update it as it is still
+			 * in service.
+			 */
+			break;
+
+		if (sd->next_in_service)
+			/*
+			 * The parent entity is still backlogged and
+			 * the budgets on the path towards the root
+			 * need to be updated.
+			 */
+			goto update;
+
+		/*
+		 * If we get here, then the parent is no more backlogged and
+		 * we want to propagate the deactivation upwards.
+		 */
+		requeue = 1;
+	}
+
+	return;
+
+update:
+	entity = parent;
+	for_each_entity(entity) {
+		__bfq_activate_entity(entity, false);
+
+		sd = entity->sched_data;
+		if (!bfq_update_next_in_service(sd))
+			break;
+	}
+}
+
+/**
+ * bfq_update_vtime - update vtime if necessary.
+ * @st: the service tree to act upon.
+ *
+ * If necessary update the service tree vtime to have at least one
+ * eligible entity, skipping to its start time.  Assumes that the
+ * active tree of the device is not empty.
+ *
+ * NOTE: this hierarchical implementation updates vtimes quite often,
+ * we may end up with reactivated processes getting timestamps after a
+ * vtime skip done because we needed a ->first_active entity on some
+ * intermediate node.
+ */
+static void bfq_update_vtime(struct bfq_service_tree *st)
+{
+	struct bfq_entity *entry;
+	struct rb_node *node = st->active.rb_node;
+
+	entry = rb_entry(node, struct bfq_entity, rb_node);
+	if (bfq_gt(entry->min_start, st->vtime)) {
+		st->vtime = entry->min_start;
+		bfq_forget_idle(st);
+	}
+}
+
+/**
+ * bfq_first_active_entity - find the eligible entity with
+ *                           the smallest finish time
+ * @st: the service tree to select from.
+ *
+ * This function searches the first schedulable entity, starting from the
+ * root of the tree and going on the left every time on this side there is
+ * a subtree with at least one eligible (start >= vtime) entity. The path on
+ * the right is followed only if a) the left subtree contains no eligible
+ * entities and b) no eligible entity has been found yet.
+ */
+static struct bfq_entity *bfq_first_active_entity(struct bfq_service_tree *st)
+{
+	struct bfq_entity *entry, *first = NULL;
+	struct rb_node *node = st->active.rb_node;
+
+	while (node) {
+		entry = rb_entry(node, struct bfq_entity, rb_node);
+left:
+		if (!bfq_gt(entry->start, st->vtime))
+			first = entry;
+
+		if (node->rb_left) {
+			entry = rb_entry(node->rb_left,
+					 struct bfq_entity, rb_node);
+			if (!bfq_gt(entry->min_start, st->vtime)) {
+				node = node->rb_left;
+				goto left;
+			}
+		}
+		if (first)
+			break;
+		node = node->rb_right;
+	}
+
+	return first;
+}
+
+/**
+ * __bfq_lookup_next_entity - return the first eligible entity in @st.
+ * @st: the service tree.
+ *
+ * Update the virtual time in @st and return the first eligible entity
+ * it contains.
+ */
+static struct bfq_entity *__bfq_lookup_next_entity(struct bfq_service_tree *st,
+						   bool force)
+{
+	struct bfq_entity *entity, *new_next_in_service = NULL;
+
+	if (RB_EMPTY_ROOT(&st->active))
+		return NULL;
+
+	bfq_update_vtime(st);
+	entity = bfq_first_active_entity(st);
+
+	/*
+	 * If the chosen entity does not match with the sched_data's
+	 * next_in_service and we are forcedly serving the IDLE priority
+	 * class tree, bubble up budget update.
+	 */
+	if (unlikely(force && entity != entity->sched_data->next_in_service)) {
+		new_next_in_service = entity;
+		for_each_entity(new_next_in_service)
+			bfq_update_budget(new_next_in_service);
+	}
+
+	return entity;
+}
+
+/**
+ * bfq_lookup_next_entity - return the first eligible entity in @sd.
+ * @sd: the sched_data.
+ * @extract: if true the returned entity will be also extracted from @sd.
+ *
+ * NOTE: since we cache the next_in_service entity at each level of the
+ * hierarchy, the complexity of the lookup can be decreased with
+ * absolutely no effort just returning the cached next_in_service value;
+ * we prefer to do full lookups to test the consistency of the data
+ * structures.
+ */
+static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
+						 int extract,
+						 struct bfq_data *bfqd)
+{
+	struct bfq_service_tree *st = sd->service_tree;
+	struct bfq_entity *entity;
+	int i = 0;
+
+	/*
+	 * Choose from idle class, if needed to guarantee a minimum
+	 * bandwidth to this class. This should also mitigate
+	 * priority-inversion problems in case a low priority task is
+	 * holding file system resources.
+	 */
+	if (bfqd &&
+	    jiffies - bfqd->bfq_class_idle_last_service >
+	    BFQ_CL_IDLE_TIMEOUT) {
+		entity = __bfq_lookup_next_entity(st + BFQ_IOPRIO_CLASSES - 1,
+						  true);
+		if (entity) {
+			i = BFQ_IOPRIO_CLASSES - 1;
+			bfqd->bfq_class_idle_last_service = jiffies;
+			sd->next_in_service = entity;
+		}
+	}
+	for (; i < BFQ_IOPRIO_CLASSES; i++) {
+		entity = __bfq_lookup_next_entity(st + i, false);
+		if (entity) {
+			if (extract) {
+				bfq_check_next_in_service(sd, entity);
+				bfq_active_extract(st + i, entity);
+				sd->in_service_entity = entity;
+				sd->next_in_service = NULL;
+			}
+			break;
+		}
+	}
+
+	return entity;
+}
+
+static bool next_queue_may_preempt(struct bfq_data *bfqd)
+{
+	struct bfq_sched_data *sd = &bfqd->sched_data;
+
+	return sd->next_in_service != sd->in_service_entity;
+}
+
+
+/*
+ * Get next queue for service.
+ */
+static struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd)
+{
+	struct bfq_entity *entity = NULL;
+	struct bfq_sched_data *sd;
+	struct bfq_queue *bfqq;
+
+	if (bfqd->busy_queues == 0)
+		return NULL;
+
+	sd = &bfqd->sched_data;
+	for (; sd ; sd = entity->my_sched_data) {
+		entity = bfq_lookup_next_entity(sd, 1, bfqd);
+		entity->service = 0;
+	}
+
+	bfqq = bfq_entity_to_bfqq(entity);
+
+	return bfqq;
+}
+
+static void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd)
+{
+	if (bfqd->in_service_bic) {
+		put_io_context(bfqd->in_service_bic->icq.ioc);
+		bfqd->in_service_bic = NULL;
+	}
+
+	bfq_clear_bfqq_wait_request(bfqd->in_service_queue);
+	hrtimer_try_to_cancel(&bfqd->idle_slice_timer);
+	bfqd->in_service_queue = NULL;
+}
+
+static void bfq_deactivate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+				int requeue)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+
+	bfq_deactivate_entity(entity, requeue);
+}
+
+static void bfq_activate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+
+	bfq_activate_entity(entity, bfq_bfqq_non_blocking_wait_rq(bfqq));
+	bfq_clear_bfqq_non_blocking_wait_rq(bfqq);
+}
+
+/*
+ * Called when the bfqq no longer has requests pending, remove it from
+ * the service tree.
+ */
+static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+			      int requeue)
+{
+	bfq_log_bfqq(bfqd, bfqq, "del from busy");
+
+	bfq_clear_bfqq_busy(bfqq);
+
+	bfqd->busy_queues--;
+
+	bfq_deactivate_bfqq(bfqd, bfqq, requeue);
+}
+
+/*
+ * Called when an inactive queue receives a new request.
+ */
+static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	bfq_log_bfqq(bfqd, bfqq, "add to busy");
+
+	bfq_activate_bfqq(bfqd, bfqq);
+
+	bfq_mark_bfqq_busy(bfqq);
+	bfqd->busy_queues++;
+}
+
+static void bfq_init_entity(struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+
+	entity->weight = entity->new_weight;
+	entity->orig_weight = entity->new_weight;
+
+	bfqq->ioprio = bfqq->new_ioprio;
+	bfqq->ioprio_class = bfqq->new_ioprio_class;
+
+	entity->sched_data = &bfqq->bfqd->sched_data;
+}
+
+#define bfq_class_idle(bfqq)	((bfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
+#define bfq_class_rt(bfqq)	((bfqq)->ioprio_class == IOPRIO_CLASS_RT)
+
+#define bfq_sample_valid(samples)	((samples) > 80)
+
+/*
+ * We regard a request as SYNC, if either it's a read or has the SYNC bit
+ * set (in which case it could also be a direct WRITE).
+ */
+static bool bfq_bio_sync(struct bio *bio)
+{
+	return bio_data_dir(bio) == READ || (bio->bi_opf & REQ_SYNC);
+}
+
+/*
+ * Scheduler run of queue, if there are requests pending and no one in the
+ * driver that will restart queueing.
+ */
+static void bfq_schedule_dispatch(struct bfq_data *bfqd)
+{
+	if (bfqd->queued != 0) {
+		bfq_log(bfqd, "schedule dispatch");
+		kblockd_schedule_work(&bfqd->unplug_work);
+	}
+}
+
+/*
+ * Lifted from AS - choose which of rq1 and rq2 that is best served now.
+ * We choose the request that is closesr to the head right now.  Distance
+ * behind the head is penalized and only allowed to a certain extent.
+ */
+static struct request *bfq_choose_req(struct bfq_data *bfqd,
+				      struct request *rq1,
+				      struct request *rq2,
+				      sector_t last)
+{
+	sector_t s1, s2, d1 = 0, d2 = 0;
+	unsigned long back_max;
+#define BFQ_RQ1_WRAP	0x01 /* request 1 wraps */
+#define BFQ_RQ2_WRAP	0x02 /* request 2 wraps */
+	unsigned int wrap = 0; /* bit mask: requests behind the disk head? */
+
+	if (!rq1 || rq1 == rq2)
+		return rq2;
+	if (!rq2)
+		return rq1;
+
+	if (rq_is_sync(rq1) && !rq_is_sync(rq2))
+		return rq1;
+	else if (rq_is_sync(rq2) && !rq_is_sync(rq1))
+		return rq2;
+	if ((rq1->cmd_flags & REQ_META) && !(rq2->cmd_flags & REQ_META))
+		return rq1;
+	else if ((rq2->cmd_flags & REQ_META) && !(rq1->cmd_flags & REQ_META))
+		return rq2;
+
+	s1 = blk_rq_pos(rq1);
+	s2 = blk_rq_pos(rq2);
+
+	/*
+	 * By definition, 1KiB is 2 sectors.
+	 */
+	back_max = bfqd->bfq_back_max * 2;
+
+	/*
+	 * Strict one way elevator _except_ in the case where we allow
+	 * short backward seeks which are biased as twice the cost of a
+	 * similar forward seek.
+	 */
+	if (s1 >= last)
+		d1 = s1 - last;
+	else if (s1 + back_max >= last)
+		d1 = (last - s1) * bfqd->bfq_back_penalty;
+	else
+		wrap |= BFQ_RQ1_WRAP;
+
+	if (s2 >= last)
+		d2 = s2 - last;
+	else if (s2 + back_max >= last)
+		d2 = (last - s2) * bfqd->bfq_back_penalty;
+	else
+		wrap |= BFQ_RQ2_WRAP;
+
+	/* Found required data */
+
+	/*
+	 * By doing switch() on the bit mask "wrap" we avoid having to
+	 * check two variables for all permutations: --> faster!
+	 */
+	switch (wrap) {
+	case 0: /* common case for CFQ: rq1 and rq2 not wrapped */
+		if (d1 < d2)
+			return rq1;
+		else if (d2 < d1)
+			return rq2;
+
+		if (s1 >= s2)
+			return rq1;
+		else
+			return rq2;
+
+	case BFQ_RQ2_WRAP:
+		return rq1;
+	case BFQ_RQ1_WRAP:
+		return rq2;
+	case (BFQ_RQ1_WRAP|BFQ_RQ2_WRAP): /* both rqs wrapped */
+	default:
+		/*
+		 * Since both rqs are wrapped,
+		 * start with the one that's further behind head
+		 * (--> only *one* back seek required),
+		 * since back seek takes more time than forward.
+		 */
+		if (s1 <= s2)
+			return rq1;
+		else
+			return rq2;
+	}
+}
+
+static struct request *bfq_find_next_rq(struct bfq_data *bfqd,
+					struct bfq_queue *bfqq,
+					struct request *last)
+{
+	struct rb_node *rbnext = rb_next(&last->rb_node);
+	struct rb_node *rbprev = rb_prev(&last->rb_node);
+	struct request *next = NULL, *prev = NULL;
+
+	if (rbprev)
+		prev = rb_entry_rq(rbprev);
+
+	if (rbnext)
+		next = rb_entry_rq(rbnext);
+	else {
+		rbnext = rb_first(&bfqq->sort_list);
+		if (rbnext && rbnext != &last->rb_node)
+			next = rb_entry_rq(rbnext);
+	}
+
+	return bfq_choose_req(bfqd, next, prev, blk_rq_pos(last));
+}
+
+static unsigned long bfq_serv_to_charge(struct request *rq,
+					struct bfq_queue *bfqq)
+{
+	return blk_rq_sectors(rq);
+}
+
+/**
+ * bfq_updated_next_req - update the queue after a new next_rq selection.
+ * @bfqd: the device data the queue belongs to.
+ * @bfqq: the queue to update.
+ *
+ * If the first request of a queue changes we make sure that the queue
+ * has enough budget to serve at least its first request (if the
+ * request has grown).  We do this because if the queue has not enough
+ * budget for its first request, it has to go through two dispatch
+ * rounds to actually get it dispatched.
+ */
+static void bfq_updated_next_req(struct bfq_data *bfqd,
+				 struct bfq_queue *bfqq)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+	struct request *next_rq = bfqq->next_rq;
+	unsigned long new_budget;
+
+	if (!next_rq)
+		return;
+
+	if (bfqq == bfqd->in_service_queue)
+		/*
+		 * In order not to break guarantees, budgets cannot be
+		 * changed after an entity has been selected.
+		 */
+		return;
+
+	new_budget = max_t(unsigned long, bfqq->max_budget,
+			   bfq_serv_to_charge(next_rq, bfqq));
+	if (entity->budget != new_budget) {
+		entity->budget = new_budget;
+		bfq_log_bfqq(bfqd, bfqq, "updated next rq: new budget %lu",
+					 new_budget);
+		bfq_activate_bfqq(bfqd, bfqq);
+	}
+}
+
+static int bfq_bfqq_budget_left(struct bfq_queue *bfqq)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+
+	return entity->budget - entity->service;
+}
+
+/*
+ * If enough samples have been computed, return the current max budget
+ * stored in bfqd, which is dynamically updated according to the
+ * estimated disk peak rate; otherwise return the default max budget
+ */
+static int bfq_max_budget(struct bfq_data *bfqd)
+{
+	if (bfqd->budgets_assigned < bfq_stats_min_budgets)
+		return bfq_default_max_budget;
+	else
+		return bfqd->bfq_max_budget;
+}
+
+/*
+ * Return min budget, which is a fraction of the current or default
+ * max budget (trying with 1/32)
+ */
+static int bfq_min_budget(struct bfq_data *bfqd)
+{
+	if (bfqd->budgets_assigned < bfq_stats_min_budgets)
+		return bfq_default_max_budget / 32;
+	else
+		return bfqd->bfq_max_budget / 32;
+}
+
+static void bfq_bfqq_expire(struct bfq_data *bfqd,
+			    struct bfq_queue *bfqq,
+			    bool compensate,
+			    enum bfqq_expiration reason);
+
+/*
+ * The next function, invoked after the input queue bfqq switches from
+ * idle to busy, updates the budget of bfqq. The function also tells
+ * whether the in-service queue should be expired, by returning
+ * true. The purpose of expiring the in-service queue is to give bfqq
+ * the chance to possibly preempt the in-service queue, and the reason
+ * for preempting the in-service queue is to achieve the following
+ * goal: guarantee to bfqq its reserved bandwidth even if bfqq has
+ * expired because it has remained idle.
+ *
+ * In particular, bfqq may have expired for one of the following two
+ * reasons:
+ *
+ * - BFQ_BFQQ_NO_MORE_REQUESTS bfqq did not enjoy any device idling
+ *   and did not make it to issue a new request before its last
+ *   request was served;
+ *
+ * - BFQ_BFQQ_TOO_IDLE bfqq did enjoy device idling, but did not issue
+ *   a new request before the expiration of the idling-time.
+ *
+ * Even if bfqq has expired for one of the above reasons, the process
+ * associated with the queue may be however issuing requests greedily,
+ * and thus be sensitive to the bandwidth it receives (bfqq may have
+ * remained idle for other reasons: CPU high load, bfqq not enjoying
+ * idling, I/O throttling somewhere in the path from the process to
+ * the I/O scheduler, ...). But if, after every expiration for one of
+ * the above two reasons, bfqq has to wait for the service of at least
+ * one full budget of another queue before being served again, then
+ * bfqq is likely to get a much lower bandwidth or resource time than
+ * its reserved ones. To address this issue, two countermeasures need
+ * to be taken.
+ *
+ * First, the budget and the timestamps of bfqq need to be updated in
+ * a special way on bfqq reactivation: they need to be updated as if
+ * bfqq did not remain idle and did not expire. In fact, if they are
+ * computed as if bfqq expired and remained idle until reactivation,
+ * then the process associated with bfqq is treated as if, instead of
+ * being greedy, it stopped issuing requests when bfqq remained idle,
+ * and restarts issuing requests only on this reactivation. In other
+ * words, the scheduler does not help the process recover the "service
+ * hole" between bfqq expiration and reactivation. As a consequence,
+ * the process receives a lower bandwidth than its reserved one. In
+ * contrast, to recover this hole, the budget must be updated as if
+ * bfqq was not expired at all before this reactivation, i.e., it must
+ * be set to the value of the remaining budget when bfqq was
+ * expired. Along the same line, timestamps need to be assigned the
+ * value they had the last time bfqq was selected for service, i.e.,
+ * before last expiration. Thus timestamps need to be back-shifted
+ * with respect to their normal computation (see [1] for more details
+ * on this tricky aspect).
+ *
+ * Secondly, to allow the process to recover the hole, the in-service
+ * queue must be expired too, to give bfqq the chance to preempt it
+ * immediately. In fact, if bfqq has to wait for a full budget of the
+ * in-service queue to be completed, then it may become impossible to
+ * let the process recover the hole, even if the back-shifted
+ * timestamps of bfqq are lower than those of the in-service queue. If
+ * this happens for most or all of the holes, then the process may not
+ * receive its reserved bandwidth. In this respect, it is worth noting
+ * that, being the service of outstanding requests unpreemptible, a
+ * little fraction of the holes may however be unrecoverable, thereby
+ * causing a little loss of bandwidth.
+ *
+ * The last important point is detecting whether bfqq does need this
+ * bandwidth recovery. In this respect, the next function deems the
+ * process associated with bfqq greedy, and thus allows it to recover
+ * the hole, if: 1) the process is waiting for the arrival of a new
+ * request (which implies that bfqq expired for one of the above two
+ * reasons), and 2) such a request has arrived soon. The first
+ * condition is controlled through the flag non_blocking_wait_rq,
+ * while the second through the flag arrived_in_time. If both
+ * conditions hold, then the function computes the budget in the
+ * above-described special way, and signals that the in-service queue
+ * should be expired. Timestamp back-shifting is done later in
+ * __bfq_activate_entity.
+ */
+static bool bfq_bfqq_update_budg_for_activation(struct bfq_data *bfqd,
+						struct bfq_queue *bfqq,
+						bool arrived_in_time)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+
+	if (bfq_bfqq_non_blocking_wait_rq(bfqq) && arrived_in_time) {
+		/*
+		 * We do not clear the flag non_blocking_wait_rq here, as
+		 * the latter is used in bfq_activate_bfqq to signal
+		 * that timestamps need to be back-shifted (and is
+		 * cleared right after).
+		 */
+
+		/*
+		 * In next assignment we rely on that either
+		 * entity->service or entity->budget are not updated
+		 * on expiration if bfqq is empty (see
+		 * __bfq_bfqq_recalc_budget). Thus both quantities
+		 * remain unchanged after such an expiration, and the
+		 * following statement therefore assigns to
+		 * entity->budget the remaining budget on such an
+		 * expiration. For clarity, entity->service is not
+		 * updated on expiration in any case, and, in normal
+		 * operation, is reset only when bfqq is selected for
+		 * service (see bfq_get_next_queue).
+		 */
+		entity->budget = min_t(unsigned long,
+				       bfq_bfqq_budget_left(bfqq),
+				       bfqq->max_budget);
+
+		return true;
+	}
+
+	entity->budget = max_t(unsigned long, bfqq->max_budget,
+			       bfq_serv_to_charge(bfqq->next_rq, bfqq));
+	bfq_clear_bfqq_non_blocking_wait_rq(bfqq);
+	return false;
+}
+
+static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
+					     struct bfq_queue *bfqq,
+					     struct request *rq)
+{
+	bool bfqq_wants_to_preempt,
+		/*
+		 * See the comments on
+		 * bfq_bfqq_update_budg_for_activation for
+		 * details on the usage of the next variable.
+		 */
+		arrived_in_time = ktime_get_ns() <=
+			RQ_BIC(rq)->ttime.last_end_request +
+			bfqd->bfq_slice_idle * 3ULL;
+
+	/*
+	 * Update budget and check whether bfqq may want to preempt
+	 * the in-service queue.
+	 */
+	bfqq_wants_to_preempt =
+		bfq_bfqq_update_budg_for_activation(bfqd, bfqq,
+						    arrived_in_time);
+
+	if (!bfq_bfqq_IO_bound(bfqq)) {
+		if (arrived_in_time) {
+			bfqq->requests_within_timer++;
+			if (bfqq->requests_within_timer >=
+			    bfqd->bfq_requests_within_timer)
+				bfq_mark_bfqq_IO_bound(bfqq);
+		} else
+			bfqq->requests_within_timer = 0;
+	}
+
+	bfq_add_bfqq_busy(bfqd, bfqq);
+
+	/*
+	 * Expire in-service queue only if preemption may be needed
+	 * for guarantees. In this respect, the function
+	 * next_queue_may_preempt just checks a simple, necessary
+	 * condition, and not a sufficient condition based on
+	 * timestamps. In fact, for the latter condition to be
+	 * evaluated, timestamps would need first to be updated, and
+	 * this operation is quite costly (see the comments on the
+	 * function bfq_bfqq_update_budg_for_activation).
+	 */
+	if (bfqd->in_service_queue && bfqq_wants_to_preempt &&
+	    next_queue_may_preempt(bfqd))
+		bfq_bfqq_expire(bfqd, bfqd->in_service_queue,
+				false, BFQ_BFQQ_PREEMPTED);
+}
+
+static void bfq_add_request(struct request *rq)
+{
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+	struct bfq_data *bfqd = bfqq->bfqd;
+	struct request *next_rq, *prev;
+
+	bfq_log_bfqq(bfqd, bfqq, "add_request %d", rq_is_sync(rq));
+	bfqq->queued[rq_is_sync(rq)]++;
+	bfqd->queued++;
+
+	elv_rb_add(&bfqq->sort_list, rq);
+
+	/*
+	 * Check if this request is a better next-serve candidate.
+	 */
+	prev = bfqq->next_rq;
+	next_rq = bfq_choose_req(bfqd, bfqq->next_rq, rq, bfqd->last_position);
+	bfqq->next_rq = next_rq;
+
+	if (!bfq_bfqq_busy(bfqq)) /* switching to busy ... */
+		bfq_bfqq_handle_idle_busy_switch(bfqd, bfqq, rq);
+	else if (prev != bfqq->next_rq)
+		bfq_updated_next_req(bfqd, bfqq);
+}
+
+static struct request *bfq_find_rq_fmerge(struct bfq_data *bfqd,
+					  struct bio *bio)
+{
+	struct task_struct *tsk = current;
+	struct bfq_io_cq *bic;
+	struct bfq_queue *bfqq;
+
+	bic = bfq_bic_lookup(bfqd, tsk->io_context);
+	if (!bic)
+		return NULL;
+
+	bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
+	if (bfqq)
+		return elv_rb_find(&bfqq->sort_list, bio_end_sector(bio));
+
+	return NULL;
+}
+
+static void bfq_activate_request(struct request_queue *q, struct request *rq)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+
+	bfqd->rq_in_driver++;
+	bfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq);
+	bfq_log(bfqd, "activate_request: new bfqd->last_position %llu",
+		(unsigned long long)bfqd->last_position);
+}
+
+static void bfq_deactivate_request(struct request_queue *q, struct request *rq)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+
+	bfqd->rq_in_driver--;
+}
+
+static void bfq_remove_request(struct request *rq)
+{
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+	struct bfq_data *bfqd = bfqq->bfqd;
+	const int sync = rq_is_sync(rq);
+
+	if (bfqq->next_rq == rq) {
+		bfqq->next_rq = bfq_find_next_rq(bfqd, bfqq, rq);
+		bfq_updated_next_req(bfqd, bfqq);
+	}
+
+	if (rq->queuelist.prev != &rq->queuelist)
+		list_del_init(&rq->queuelist);
+	bfqq->queued[sync]--;
+	bfqd->queued--;
+	elv_rb_del(&bfqq->sort_list, rq);
+
+	if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
+		if (bfq_bfqq_busy(bfqq) && bfqq != bfqd->in_service_queue) {
+			bfq_del_bfqq_busy(bfqd, bfqq, 1);
+
+			/* bfqq emptied. In normal operation, when
+			 * bfqq is empty, bfqq->entity.service and
+			 * bfqq->entity.budget must contain,
+			 * respectively, the service received and the
+			 * budget used last time bfqq emptied. These
+			 * facts do not hold in this case, as at least
+			 * this last removal occurred while bfqq is
+			 * not in service. To avoid inconsistencies,
+			 * reset both bfqq->entity.service and
+			 * bfqq->entity.budget.
+			 */
+			bfqq->entity.budget = bfqq->entity.service = 0;
+		}
+	}
+
+	if (rq->cmd_flags & REQ_META)
+		bfqq->meta_pending--;
+}
+
+static int bfq_merge(struct request_queue *q, struct request **req,
+		     struct bio *bio)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct request *__rq;
+
+	__rq = bfq_find_rq_fmerge(bfqd, bio);
+	if (__rq && elv_bio_merge_ok(__rq, bio)) {
+		*req = __rq;
+		return ELEVATOR_FRONT_MERGE;
+	}
+
+	return ELEVATOR_NO_MERGE;
+}
+
+static void bfq_merged_request(struct request_queue *q, struct request *req,
+			       int type)
+{
+	if (type == ELEVATOR_FRONT_MERGE &&
+	    rb_prev(&req->rb_node) &&
+	    blk_rq_pos(req) <
+	    blk_rq_pos(container_of(rb_prev(&req->rb_node),
+				    struct request, rb_node))) {
+		struct bfq_queue *bfqq = RQ_BFQQ(req);
+		struct bfq_data *bfqd = bfqq->bfqd;
+		struct request *prev, *next_rq;
+
+		/* Reposition request in its sort_list */
+		elv_rb_del(&bfqq->sort_list, req);
+		elv_rb_add(&bfqq->sort_list, req);
+		/* Choose next request to be served for bfqq */
+		prev = bfqq->next_rq;
+		next_rq = bfq_choose_req(bfqd, bfqq->next_rq, req,
+					 bfqd->last_position);
+		bfqq->next_rq = next_rq;
+		/*
+		 * If next_rq changes, update the queue's budget to fit
+		 * the new request.
+		 */
+		if (prev != bfqq->next_rq)
+			bfq_updated_next_req(bfqd, bfqq);
+	}
+}
+
+static void bfq_merged_requests(struct request_queue *q, struct request *rq,
+				struct request *next)
+{
+	struct bfq_queue *bfqq = RQ_BFQQ(rq), *next_bfqq = RQ_BFQQ(next);
+
+	/*
+	 * If next and rq belong to the same bfq_queue and next is older
+	 * than rq, then reposition rq in the fifo (by substituting next
+	 * with rq). Otherwise, if next and rq belong to different
+	 * bfq_queues, never reposition rq: in fact, we would have to
+	 * reposition it with respect to next's position in its own fifo,
+	 * which would most certainly be too expensive with respect to
+	 * the benefits.
+	 */
+	if (bfqq == next_bfqq &&
+	    !list_empty(&rq->queuelist) && !list_empty(&next->queuelist) &&
+	    next->fifo_time < rq->fifo_time) {
+		list_del_init(&rq->queuelist);
+		list_replace_init(&next->queuelist, &rq->queuelist);
+		rq->fifo_time = next->fifo_time;
+	}
+
+	if (bfqq->next_rq == next)
+		bfqq->next_rq = rq;
+
+	bfq_remove_request(next);
+}
+
+static int bfq_allow_bio_merge(struct request_queue *q, struct request *rq,
+			       struct bio *bio)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct bfq_io_cq *bic;
+	struct bfq_queue *bfqq;
+
+	/*
+	 * Disallow merge of a sync bio into an async request.
+	 */
+	if (bfq_bio_sync(bio) && !rq_is_sync(rq))
+		return false;
+
+	/*
+	 * Lookup the bfqq that this bio will be queued with. Allow
+	 * merge only if rq is queued there.
+	 * Queue lock is held here.
+	 */
+	bic = bfq_bic_lookup(bfqd, current->io_context);
+	if (!bic)
+		return false;
+
+	bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
+
+	return bfqq == RQ_BFQQ(rq);
+}
+
+static int bfq_allow_rq_merge(struct request_queue *q, struct request *rq,
+			      struct request *next)
+{
+	return RQ_BFQQ(rq) == RQ_BFQQ(next);
+}
+
+static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
+				       struct bfq_queue *bfqq)
+{
+	if (bfqq) {
+		bfq_mark_bfqq_must_alloc(bfqq);
+		bfq_mark_bfqq_budget_new(bfqq);
+		bfq_clear_bfqq_fifo_expire(bfqq);
+
+		bfqd->budgets_assigned = (bfqd->budgets_assigned*7 + 256) / 8;
+
+		bfq_log_bfqq(bfqd, bfqq,
+			     "set_in_service_queue, cur-budget = %d",
+			     bfqq->entity.budget);
+	}
+
+	bfqd->in_service_queue = bfqq;
+}
+
+/*
+ * Get and set a new queue for service.
+ */
+static struct bfq_queue *bfq_set_in_service_queue(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq = bfq_get_next_queue(bfqd);
+
+	__bfq_set_in_service_queue(bfqd, bfqq);
+	return bfqq;
+}
+
+/*
+ * bfq_default_budget - return the default budget for @bfqq on @bfqd.
+ * @bfqd: the device descriptor.
+ * @bfqq: the queue to consider.
+ *
+ * We use 3/4 of the @bfqd maximum budget as the default value
+ * for the max_budget field of the queues.  This lets the feedback
+ * mechanism to start from some middle ground, then the behavior
+ * of the process will drive the heuristics towards high values, if
+ * it behaves as a greedy sequential reader, or towards small values
+ * if it shows a more intermittent behavior.
+ */
+static unsigned long bfq_default_budget(struct bfq_data *bfqd,
+					struct bfq_queue *bfqq)
+{
+	unsigned long budget;
+
+	/*
+	 * When we need an estimate of the peak rate we need to avoid
+	 * to give budgets that are too short due to previous measurements.
+	 * So, in the first 10 assignments use a ``safe'' budget value.
+	 */
+	if (bfqd->budgets_assigned < 194 && bfqd->bfq_user_max_budget == 0)
+		budget = bfq_default_max_budget;
+	else
+		budget = bfqd->bfq_max_budget;
+
+	return budget - budget / 4;
+}
+
+static void bfq_arm_slice_timer(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq = bfqd->in_service_queue;
+	struct bfq_io_cq *bic;
+	u32 sl;
+
+	/* Processes have exited, don't wait. */
+	bic = bfqd->in_service_bic;
+	if (!bic || atomic_read(&bic->icq.ioc->active_ref) == 0)
+		return;
+
+	bfq_mark_bfqq_wait_request(bfqq);
+
+	/*
+	 * We don't want to idle for seeks, but we do want to allow
+	 * fair distribution of slice time for a process doing back-to-back
+	 * seeks. So allow a little bit of time for him to submit a new rq.
+	 */
+	sl = bfqd->bfq_slice_idle;
+	/*
+	 * Grant only minimum idle time if the queue is seeky.
+	 */
+	if (BFQQ_SEEKY(bfqq))
+		sl = min_t(u64, sl, BFQ_MIN_TT);
+
+	bfqd->last_idling_start = ktime_get();
+	hrtimer_start(&bfqd->idle_slice_timer, ns_to_ktime(sl),
+		      HRTIMER_MODE_REL);
+}
+
+/*
+ * Set the maximum time for the in-service queue to consume its
+ * budget. This prevents seeky processes from lowering the disk
+ * throughput (always guaranteed with a time slice scheme as in CFQ).
+ */
+static void bfq_set_budget_timeout(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq = bfqd->in_service_queue;
+	unsigned int timeout_coeff = bfqq->entity.weight /
+				     bfqq->entity.orig_weight;
+
+	bfqd->last_budget_start = ktime_get();
+
+	bfq_clear_bfqq_budget_new(bfqq);
+	bfqq->budget_timeout = jiffies +
+		bfqd->bfq_timeout * timeout_coeff;
+
+	bfq_log_bfqq(bfqd, bfqq, "set budget_timeout %u",
+		jiffies_to_msecs(bfqd->bfq_timeout * timeout_coeff));
+}
+
+/*
+ * Move request from internal lists to the dispatch list of the request queue.
+ */
+static void bfq_dispatch_insert(struct request_queue *q, struct request *rq)
+{
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+
+	/*
+	 * For consistency, the next instruction should have been executed
+	 * after removing the request from the queue and dispatching it.
+	 * We execute instead this instruction before bfq_remove_request()
+	 * (and hence introduce a temporary inconsistency), for efficiency.
+	 * In fact, in a forced_dispatch, this prevents two counters related
+	 * to bfqq->dispatched to risk to be uselessly decremented if bfqq
+	 * is not in service, and then to be incremented again after
+	 * incrementing bfqq->dispatched.
+	 */
+	bfqq->dispatched++;
+
+	bfq_remove_request(rq);
+	elv_dispatch_sort(q, rq);
+}
+
+/*
+ * Return expired entry, or NULL to just start from scratch in rbtree.
+ */
+static struct request *bfq_check_fifo(struct bfq_queue *bfqq)
+{
+	struct request *rq = NULL;
+
+	if (bfq_bfqq_fifo_expire(bfqq))
+		return NULL;
+
+	bfq_mark_bfqq_fifo_expire(bfqq);
+
+	if (list_empty(&bfqq->fifo))
+		return NULL;
+
+	rq = rq_entry_fifo(bfqq->fifo.next);
+
+	if (ktime_get_ns() < rq->fifo_time)
+		return NULL;
+
+	return rq;
+}
+
+static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	__bfq_bfqd_reset_in_service(bfqd);
+
+	if (RB_EMPTY_ROOT(&bfqq->sort_list))
+		bfq_del_bfqq_busy(bfqd, bfqq, 1);
+	else
+		bfq_activate_bfqq(bfqd, bfqq);
+}
+
+/**
+ * __bfq_bfqq_recalc_budget - try to adapt the budget to the @bfqq behavior.
+ * @bfqd: device data.
+ * @bfqq: queue to update.
+ * @reason: reason for expiration.
+ *
+ * Handle the feedback on @bfqq budget at queue expiration.
+ * See the body for detailed comments.
+ */
+static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
+				     struct bfq_queue *bfqq,
+				     enum bfqq_expiration reason)
+{
+	struct request *next_rq;
+	int budget, min_budget;
+
+	budget = bfqq->max_budget;
+	min_budget = bfq_min_budget(bfqd);
+
+	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last budg %d, budg left %d",
+		bfqq->entity.budget, bfq_bfqq_budget_left(bfqq));
+	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last max_budg %d, min budg %d",
+		budget, bfq_min_budget(bfqd));
+	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: sync %d, seeky %d",
+		bfq_bfqq_sync(bfqq), BFQQ_SEEKY(bfqd->in_service_queue));
+
+	if (bfq_bfqq_sync(bfqq)) {
+		switch (reason) {
+		/*
+		 * Caveat: in all the following cases we trade latency
+		 * for throughput.
+		 */
+		case BFQ_BFQQ_TOO_IDLE:
+			if (budget > min_budget + BFQ_BUDGET_STEP)
+				budget -= BFQ_BUDGET_STEP;
+			else
+				budget = min_budget;
+			break;
+		case BFQ_BFQQ_BUDGET_TIMEOUT:
+			budget = bfq_default_budget(bfqd, bfqq);
+			break;
+		case BFQ_BFQQ_BUDGET_EXHAUSTED:
+			/*
+			 * The process still has backlog, and did not
+			 * let either the budget timeout or the disk
+			 * idling timeout expire. Hence it is not
+			 * seeky, has a short thinktime and may be
+			 * happy with a higher budget too. So
+			 * definitely increase the budget of this good
+			 * candidate to boost the disk throughput.
+			 */
+			budget = min(budget + 8 * BFQ_BUDGET_STEP,
+				     bfqd->bfq_max_budget);
+			break;
+		case BFQ_BFQQ_NO_MORE_REQUESTS:
+			/*
+			 * For queues that expire for this reason, it
+			 * is particularly important to keep the
+			 * budget close to the actual service they
+			 * need. Doing so reduces the timestamp
+			 * misalignment problem described in the
+			 * comments in the body of
+			 * __bfq_activate_entity. In fact, suppose
+			 * that a queue systematically expires for
+			 * BFQ_BFQQ_NO_MORE_REQUESTS and presents a
+			 * new request in time to enjoy timestamp
+			 * back-shifting. The larger the budget of the
+			 * queue is with respect to the service the
+			 * queue actually requests in each service
+			 * slot, the more times the queue can be
+			 * reactivated with the same virtual finish
+			 * time. It follows that, even if this finish
+			 * time is pushed to the system virtual time
+			 * to reduce the consequent timestamp
+			 * misalignment, the queue unjustly enjoys for
+			 * many re-activations a lower finish time
+			 * than all newly activated queues.
+			 *
+			 * The service needed by bfqq is measured
+			 * quite precisely by bfqq->entity.service.
+			 * Since bfqq does not enjoy device idling,
+			 * bfqq->entity.service is equal to the number
+			 * of sectors that the process associated with
+			 * bfqq requested to read/write before waiting
+			 * for request completions, or blocking for
+			 * other reasons.
+			 */
+			budget = max_t(int, bfqq->entity.service, min_budget);
+			break;
+		default:
+			return;
+		}
+	} else
+		/*
+		 * Async queues get always the maximum possible
+		 * budget, as for them we do not care about latency
+		 * (in addition, their ability to dispatch is limited
+		 * by the charging factor).
+		 */
+		budget = bfqd->bfq_max_budget;
+
+	bfqq->max_budget = budget;
+
+	if (bfqd->budgets_assigned >= bfq_stats_min_budgets &&
+	    !bfqd->bfq_user_max_budget)
+		bfqq->max_budget = min(bfqq->max_budget, bfqd->bfq_max_budget);
+
+	/*
+	 * If there is still backlog, then assign a new budget, making
+	 * sure that it is large enough for the next request.  Since
+	 * the finish time of bfqq must be kept in sync with the
+	 * budget, be sure to call __bfq_bfqq_expire() *after* this
+	 * update.
+	 *
+	 * If there is no backlog, then no need to update the budget;
+	 * it will be updated on the arrival of a new request.
+	 */
+	next_rq = bfqq->next_rq;
+	if (next_rq)
+		bfqq->entity.budget = max_t(unsigned long, bfqq->max_budget,
+					    bfq_serv_to_charge(next_rq, bfqq));
+
+	bfq_log_bfqq(bfqd, bfqq, "head sect: %u, new budget %d",
+			next_rq ? blk_rq_sectors(next_rq) : 0,
+			bfqq->entity.budget);
+}
+
+static unsigned long bfq_calc_max_budget(u64 peak_rate, u64 timeout)
+{
+	unsigned long max_budget;
+
+	/*
+	 * The max_budget calculated when autotuning is equal to the
+	 * amount of sectors transferred in timeout at the
+	 * estimated peak rate.
+	 */
+	max_budget = (unsigned long)(peak_rate * 1000 *
+				     timeout >> BFQ_RATE_SHIFT);
+
+	return max_budget;
+}
+
+/*
+ * In addition to updating the peak rate, checks whether the process
+ * is "slow", and returns 1 if so. This slow flag is used, in addition
+ * to the budget timeout, to reduce the amount of service provided to
+ * seeky processes, and hence reduce their chances to lower the
+ * throughput. See the code for more details.
+ */
+static bool bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+				 bool compensate)
+{
+	u64 bw, usecs, expected, timeout;
+	ktime_t delta;
+	int update = 0;
+
+	if (!bfq_bfqq_sync(bfqq) || bfq_bfqq_budget_new(bfqq))
+		return false;
+
+	if (compensate)
+		delta = bfqd->last_idling_start;
+	else
+		delta = ktime_get();
+	delta = ktime_sub(delta, bfqd->last_budget_start);
+	usecs = ktime_to_us(delta);
+
+	/* Don't trust short/unrealistic values. */
+	if (usecs < 100 || usecs >= LONG_MAX)
+		return false;
+
+	/*
+	 * Calculate the bandwidth for the last slice.  We use a 64 bit
+	 * value to store the peak rate, in sectors per usec in fixed
+	 * point math.  We do so to have enough precision in the estimate
+	 * and to avoid overflows.
+	 */
+	bw = (u64)bfqq->entity.service << BFQ_RATE_SHIFT;
+	do_div(bw, (unsigned long)usecs);
+
+	timeout = jiffies_to_msecs(bfqd->bfq_timeout);
+
+	/*
+	 * Use only long (> 20ms) intervals to filter out spikes for
+	 * the peak rate estimation.
+	 */
+	if (usecs > 20000) {
+		if (bw > bfqd->peak_rate) {
+			bfqd->peak_rate = bw;
+			update = 1;
+			bfq_log(bfqd, "new peak_rate=%llu", bw);
+		}
+
+		update |= bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES - 1;
+
+		if (bfqd->peak_rate_samples < BFQ_PEAK_RATE_SAMPLES)
+			bfqd->peak_rate_samples++;
+
+		if (bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES &&
+		    update && bfqd->bfq_user_max_budget == 0) {
+			bfqd->bfq_max_budget =
+				bfq_calc_max_budget(bfqd->peak_rate,
+						    timeout);
+			bfq_log(bfqd, "new max_budget=%d",
+				bfqd->bfq_max_budget);
+		}
+	}
+
+	/*
+	 * A process is considered ``slow'' (i.e., seeky, so that we
+	 * cannot treat it fairly in the service domain, as it would
+	 * slow down too much the other processes) if, when a slice
+	 * ends for whatever reason, it has received service at a
+	 * rate that would not be high enough to complete the budget
+	 * before the budget timeout expiration.
+	 */
+	expected = bw * 1000 * timeout >> BFQ_RATE_SHIFT;
+
+	/*
+	 * Caveat: processes doing IO in the slower disk zones will
+	 * tend to be slow(er) even if not seeky. And the estimated
+	 * peak rate will actually be an average over the disk
+	 * surface. Hence, to not be too harsh with unlucky processes,
+	 * we keep a budget/3 margin of safety before declaring a
+	 * process slow.
+	 */
+	return expected > (4 * bfqq->entity.budget) / 3;
+}
+
+/*
+ * Return the farthest past time instant according to jiffies
+ * macros.
+ */
+static unsigned long bfq_smallest_from_now(void)
+{
+	return jiffies - MAX_JIFFY_OFFSET;
+}
+
+/**
+ * bfq_bfqq_expire - expire a queue.
+ * @bfqd: device owning the queue.
+ * @bfqq: the queue to expire.
+ * @compensate: if true, compensate for the time spent idling.
+ * @reason: the reason causing the expiration.
+ *
+ *
+ * If the process associated with the queue is slow (i.e., seeky), or
+ * in case of budget timeout, or, finally, if it is async, we
+ * artificially charge it an entire budget (independently of the
+ * actual service it received). As a consequence, the queue will get
+ * higher timestamps than the correct ones upon reactivation, and
+ * hence it will be rescheduled as if it had received more service
+ * than what it actually received. In the end, this class of processes
+ * will receive less service in proportion to how slowly they consume
+ * their budgets (and hence how seriously they tend to lower the
+ * throughput).
+ *
+ * In contrast, when a queue expires because it has been idling for
+ * too much or because it exhausted its budget, we do not touch the
+ * amount of service it has received. Hence when the queue will be
+ * reactivated and its timestamps updated, the latter will be in sync
+ * with the actual service received by the queue until expiration.
+ *
+ * Charging a full budget to the first type of queues and the exact
+ * service to the others has the effect of using the WF2Q+ policy to
+ * schedule the former on a timeslice basis, without violating the
+ * service domain guarantees of the latter.
+ */
+static void bfq_bfqq_expire(struct bfq_data *bfqd,
+			    struct bfq_queue *bfqq,
+			    bool compensate,
+			    enum bfqq_expiration reason)
+{
+	bool slow;
+
+	/*
+	 * Update device peak rate for autotuning and check whether the
+	 * process is slow (see bfq_update_peak_rate).
+	 */
+	slow = bfq_update_peak_rate(bfqd, bfqq, compensate);
+
+	/*
+	 * As above explained, 'punish' slow (i.e., seeky), timed-out
+	 * and async queues, to favor sequential sync workloads.
+	 */
+	if (slow || reason == BFQ_BFQQ_BUDGET_TIMEOUT)
+		bfq_bfqq_charge_full_budget(bfqq);
+
+	if (reason == BFQ_BFQQ_TOO_IDLE &&
+	    bfqq->entity.service <= 2 * bfqq->entity.budget / 10)
+		bfq_clear_bfqq_IO_bound(bfqq);
+
+	bfq_log_bfqq(bfqd, bfqq,
+		"expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
+		slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
+
+	/*
+	 * Increase, decrease or leave budget unchanged according to
+	 * reason.
+	 */
+	__bfq_bfqq_recalc_budget(bfqd, bfqq, reason);
+	__bfq_bfqq_expire(bfqd, bfqq);
+
+	if (!bfq_bfqq_busy(bfqq) &&
+	    reason != BFQ_BFQQ_BUDGET_TIMEOUT &&
+	    reason != BFQ_BFQQ_BUDGET_EXHAUSTED)
+		bfq_mark_bfqq_non_blocking_wait_rq(bfqq);
+}
+
+/*
+ * Budget timeout is not implemented through a dedicated timer, but
+ * just checked on request arrivals and completions, as well as on
+ * idle timer expirations.
+ */
+static bool bfq_bfqq_budget_timeout(struct bfq_queue *bfqq)
+{
+	if (bfq_bfqq_budget_new(bfqq) ||
+	    time_before(jiffies, bfqq->budget_timeout))
+		return false;
+	return true;
+}
+
+/*
+ * If we expire a queue that is actively waiting (i.e., with the
+ * device idled) for the arrival of a new request, then we may incur
+ * the timestamp misalignment problem described in the body of the
+ * function __bfq_activate_entity. Hence we return true only if this
+ * condition does not hold, or if the queue is slow enough to deserve
+ * only to be kicked off for preserving a high throughput.
+ */
+static bool bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
+{
+	bfq_log_bfqq(bfqq->bfqd, bfqq,
+		"may_budget_timeout: wait_request %d left %d timeout %d",
+		bfq_bfqq_wait_request(bfqq),
+			bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3,
+		bfq_bfqq_budget_timeout(bfqq));
+
+	return (!bfq_bfqq_wait_request(bfqq) ||
+		bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3)
+		&&
+		bfq_bfqq_budget_timeout(bfqq);
+}
+
+/*
+ * For a queue that becomes empty, device idling is allowed only if
+ * this function returns true for the queue. And this function returns
+ * true only if idling is beneficial for throughput.
+ */
+static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
+{
+	struct bfq_data *bfqd = bfqq->bfqd;
+	bool idling_boosts_thr;
+
+	if (bfqd->strict_guarantees)
+		return true;
+
+	/*
+	 * The value of the next variable is computed considering that
+	 * idling is usually beneficial for the throughput if:
+	 * (a) the device is not NCQ-capable, or
+	 * (b) regardless of the presence of NCQ, the request pattern
+	 *     for bfqq is I/O-bound (possible throughput losses
+	 *     caused by granting idling to seeky queues are mitigated
+	 *     by the fact that, in all scenarios where boosting
+	 *     throughput is the best thing to do, i.e., in all
+	 *     symmetric scenarios, only a minimal idle time is
+	 *     allowed to seeky queues).
+	 */
+	idling_boosts_thr = !bfqd->hw_tag || bfq_bfqq_IO_bound(bfqq);
+
+	/*
+	 * We have now the components we need to compute the return
+	 * value of the function, which is true only if both the
+	 * following conditions hold:
+	 * 1) bfqq is sync, because idling make sense only for sync queues;
+	 * 2) idling boosts the throughput.
+	 */
+	return bfq_bfqq_sync(bfqq) && idling_boosts_thr;
+}
+
+/*
+ * If the in-service queue is empty but the function bfq_bfqq_may_idle
+ * returns true, then:
+ * 1) the queue must remain in service and cannot be expired, and
+ * 2) the device must be idled to wait for the possible arrival of a new
+ *    request for the queue.
+ * See the comments on the function bfq_bfqq_may_idle for the reasons
+ * why performing device idling is the best choice to boost the throughput
+ * and preserve service guarantees when bfq_bfqq_may_idle itself
+ * returns true.
+ */
+static bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)
+{
+	struct bfq_data *bfqd = bfqq->bfqd;
+
+	return RB_EMPTY_ROOT(&bfqq->sort_list) && bfqd->bfq_slice_idle != 0 &&
+	       bfq_bfqq_may_idle(bfqq);
+}
+
+/*
+ * Select a queue for service.  If we have a current queue in service,
+ * check whether to continue servicing it, or retrieve and set a new one.
+ */
+static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq;
+	struct request *next_rq;
+	enum bfqq_expiration reason = BFQ_BFQQ_BUDGET_TIMEOUT;
+
+	bfqq = bfqd->in_service_queue;
+	if (!bfqq)
+		goto new_queue;
+
+	bfq_log_bfqq(bfqd, bfqq, "select_queue: already in-service queue");
+
+	if (bfq_may_expire_for_budg_timeout(bfqq) &&
+	    !hrtimer_active(&bfqd->idle_slice_timer) &&
+	    !bfq_bfqq_must_idle(bfqq))
+		goto expire;
+
+	next_rq = bfqq->next_rq;
+	/*
+	 * If bfqq has requests queued and it has enough budget left to
+	 * serve them, keep the queue, otherwise expire it.
+	 */
+	if (next_rq) {
+		if (bfq_serv_to_charge(next_rq, bfqq) >
+			bfq_bfqq_budget_left(bfqq)) {
+			reason = BFQ_BFQQ_BUDGET_EXHAUSTED;
+			goto expire;
+		} else {
+			/*
+			 * The idle timer may be pending because we may
+			 * not disable disk idling even when a new request
+			 * arrives.
+			 */
+			if (bfq_bfqq_wait_request(bfqq)) {
+				/*
+				 * If we get here: 1) at least a new request
+				 * has arrived but we have not disabled the
+				 * timer because the request was too small,
+				 * 2) then the block layer has unplugged
+				 * the device, causing the dispatch to be
+				 * invoked.
+				 *
+				 * Since the device is unplugged, now the
+				 * requests are probably large enough to
+				 * provide a reasonable throughput.
+				 * So we disable idling.
+				 */
+				bfq_clear_bfqq_wait_request(bfqq);
+				hrtimer_try_to_cancel(&bfqd->idle_slice_timer);
+			}
+			goto keep_queue;
+		}
+	}
+
+	/*
+	 * No requests pending. However, if the in-service queue is idling
+	 * for a new request, or has requests waiting for a completion and
+	 * may idle after their completion, then keep it anyway.
+	 */
+	if (hrtimer_active(&bfqd->idle_slice_timer) ||
+	    (bfqq->dispatched != 0 && bfq_bfqq_may_idle(bfqq))) {
+		bfqq = NULL;
+		goto keep_queue;
+	}
+
+	reason = BFQ_BFQQ_NO_MORE_REQUESTS;
+expire:
+	bfq_bfqq_expire(bfqd, bfqq, false, reason);
+new_queue:
+	bfqq = bfq_set_in_service_queue(bfqd);
+	bfq_log(bfqd, "select_queue: new queue %d returned",
+		bfqq ? bfqq->pid : 0);
+keep_queue:
+	return bfqq;
+}
+
+/*
+ * Dispatch one request from bfqq, moving it to the request queue
+ * dispatch list.
+ */
+static int bfq_dispatch_request(struct bfq_data *bfqd,
+				struct bfq_queue *bfqq)
+{
+	int dispatched = 0;
+	struct request *rq;
+	unsigned long service_to_charge;
+
+	/* Follow expired path, else get first next available. */
+	rq = bfq_check_fifo(bfqq);
+	if (!rq)
+		rq = bfqq->next_rq;
+	service_to_charge = bfq_serv_to_charge(rq, bfqq);
+
+	if (service_to_charge > bfq_bfqq_budget_left(bfqq)) {
+		/*
+		 * This may happen if the next rq is chosen in fifo order
+		 * instead of sector order. The budget is properly
+		 * dimensioned to be always sufficient to serve the next
+		 * request only if it is chosen in sector order. The reason
+		 * is that it would be quite inefficient and little useful
+		 * to always make sure that the budget is large enough to
+		 * serve even the possible next rq in fifo order.
+		 * In fact, requests are seldom served in fifo order.
+		 *
+		 * Expire the queue for budget exhaustion, and make sure
+		 * that the next act_budget is enough to serve the next
+		 * request, even if it comes from the fifo expired path.
+		 */
+		bfqq->next_rq = rq;
+		/*
+		 * Since this dispatch is failed, make sure that
+		 * a new one will be performed
+		 */
+		if (!bfqd->rq_in_driver)
+			bfq_schedule_dispatch(bfqd);
+		goto expire;
+	}
+
+	/* Finally, insert request into driver dispatch list. */
+	bfq_bfqq_served(bfqq, service_to_charge);
+	bfq_dispatch_insert(bfqd->queue, rq);
+
+	bfq_log_bfqq(bfqd, bfqq,
+			"dispatched %u sec req (%llu), budg left %d",
+			blk_rq_sectors(rq),
+			(unsigned long long)blk_rq_pos(rq),
+			bfq_bfqq_budget_left(bfqq));
+
+	dispatched++;
+
+	if (!bfqd->in_service_bic) {
+		atomic_long_inc(&RQ_BIC(rq)->icq.ioc->refcount);
+		bfqd->in_service_bic = RQ_BIC(rq);
+	}
+
+	if (bfqd->busy_queues > 1 && bfq_class_idle(bfqq))
+		goto expire;
+
+	return dispatched;
+
+expire:
+	bfq_bfqq_expire(bfqd, bfqq, false, BFQ_BFQQ_BUDGET_EXHAUSTED);
+	return dispatched;
+}
+
+static int __bfq_forced_dispatch_bfqq(struct bfq_queue *bfqq)
+{
+	int dispatched = 0;
+
+	while (bfqq->next_rq) {
+		bfq_dispatch_insert(bfqq->bfqd->queue, bfqq->next_rq);
+		dispatched++;
+	}
+
+	return dispatched;
+}
+
+/*
+ * Drain our current requests.
+ * Used for barriers and when switching io schedulers on-the-fly.
+ */
+static int bfq_forced_dispatch(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq, *n;
+	struct bfq_service_tree *st;
+	int dispatched = 0;
+
+	bfqq = bfqd->in_service_queue;
+	if (bfqq)
+		__bfq_bfqq_expire(bfqd, bfqq);
+
+	/*
+	 * Loop through classes, and be careful to leave the scheduler
+	 * in a consistent state, as feedback mechanisms and vtime
+	 * updates cannot be disabled during the process.
+	 */
+	list_for_each_entry_safe(bfqq, n, &bfqd->active_list, bfqq_list) {
+		st = bfq_entity_service_tree(&bfqq->entity);
+
+		dispatched += __bfq_forced_dispatch_bfqq(bfqq);
+
+		bfqq->max_budget = bfq_max_budget(bfqd);
+
+		bfq_forget_idle(st);
+	}
+
+	return dispatched;
+}
+
+static int bfq_dispatch_requests(struct request_queue *q, int force)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct bfq_queue *bfqq;
+
+	bfq_log(bfqd, "dispatch requests: %d busy queues", bfqd->busy_queues);
+
+	if (bfqd->busy_queues == 0)
+		return 0;
+
+	if (unlikely(force))
+		return bfq_forced_dispatch(bfqd);
+
+	/*
+	 * Force device to serve one request at a time if
+	 * strict_guarantees is true. Forcing this service scheme is
+	 * currently the ONLY way to guarantee that the request
+	 * service order enforced by the scheduler is respected by a
+	 * queueing device. Otherwise the device is free even to make
+	 * some unlucky request wait for as long as the device
+	 * wishes.
+	 *
+	 * Of course, serving one request at at time may cause loss of
+	 * throughput.
+	 */
+	if (bfqd->strict_guarantees && bfqd->rq_in_driver > 0)
+		return 0;
+
+	bfqq = bfq_select_queue(bfqd);
+	if (!bfqq)
+		return 0;
+
+	if (!bfq_dispatch_request(bfqd, bfqq))
+		return 0;
+
+	bfq_log_bfqq(bfqd, bfqq, "dispatched %s request",
+			bfq_bfqq_sync(bfqq) ? "sync" : "async");
+
+	return 1;
+}
+
+/*
+ * Task holds one reference to the queue, dropped when task exits.  Each rq
+ * in-flight on this queue also holds a reference, dropped when rq is freed.
+ *
+ * Queue lock must be held here.
+ */
+static void bfq_put_queue(struct bfq_queue *bfqq)
+{
+	bfqq->ref--;
+	if (bfqq->ref)
+		return;
+
+	kmem_cache_free(bfq_pool, bfqq);
+}
+
+static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	if (bfqq == bfqd->in_service_queue) {
+		__bfq_bfqq_expire(bfqd, bfqq);
+		bfq_schedule_dispatch(bfqd);
+	}
+
+	bfq_log_bfqq(bfqd, bfqq, "exit_bfqq: %p, %d", bfqq, bfqq->ref);
+
+	bfq_put_queue(bfqq);
+}
+
+static void bfq_init_icq(struct io_cq *icq)
+{
+	icq_to_bic(icq)->ttime.last_end_request = ktime_get_ns() - (1ULL<<32);
+}
+
+static void bfq_exit_icq(struct io_cq *icq)
+{
+	struct bfq_io_cq *bic = icq_to_bic(icq);
+	struct bfq_data *bfqd = bic_to_bfqd(bic);
+
+	if (bic_to_bfqq(bic, false)) {
+		bfq_exit_bfqq(bfqd, bic_to_bfqq(bic, false));
+		bic_set_bfqq(bic, NULL, false);
+	}
+
+	if (bic_to_bfqq(bic, true)) {
+		bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_SYNC]);
+		bic->bfqq[BLK_RW_SYNC] = NULL;
+	}
+}
+
+/*
+ * Update the entity prio values; note that the new values will not
+ * be used until the next (re)activation.
+ */
+static void
+bfq_set_next_ioprio_data(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
+{
+	struct task_struct *tsk = current;
+	int ioprio_class;
+
+	ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
+	switch (ioprio_class) {
+	default:
+		dev_err(bfqq->bfqd->queue->backing_dev_info.dev,
+			"bfq: bad prio class %d\n", ioprio_class);
+	case IOPRIO_CLASS_NONE:
+		/*
+		 * No prio set, inherit CPU scheduling settings.
+		 */
+		bfqq->new_ioprio = task_nice_ioprio(tsk);
+		bfqq->new_ioprio_class = task_nice_ioclass(tsk);
+		break;
+	case IOPRIO_CLASS_RT:
+		bfqq->new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
+		bfqq->new_ioprio_class = IOPRIO_CLASS_RT;
+		break;
+	case IOPRIO_CLASS_BE:
+		bfqq->new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
+		bfqq->new_ioprio_class = IOPRIO_CLASS_BE;
+		break;
+	case IOPRIO_CLASS_IDLE:
+		bfqq->new_ioprio_class = IOPRIO_CLASS_IDLE;
+		bfqq->new_ioprio = 7;
+		bfq_clear_bfqq_idle_window(bfqq);
+		break;
+	}
+
+	if (bfqq->new_ioprio >= IOPRIO_BE_NR) {
+		pr_crit("bfq_set_next_ioprio_data: new_ioprio %d\n",
+			bfqq->new_ioprio);
+		bfqq->new_ioprio = IOPRIO_BE_NR;
+	}
+
+	bfqq->entity.new_weight = bfq_ioprio_to_weight(bfqq->new_ioprio);
+	bfqq->entity.prio_changed = 1;
+}
+
+static void bfq_check_ioprio_change(struct bfq_io_cq *bic, struct bio *bio)
+{
+	struct bfq_data *bfqd = bic_to_bfqd(bic);
+	struct bfq_queue *bfqq;
+	int ioprio = bic->icq.ioc->ioprio;
+
+	/*
+	 * This condition may trigger on a newly created bic, be sure to
+	 * drop the lock before returning.
+	 */
+	if (unlikely(!bfqd) || likely(bic->ioprio == ioprio))
+		return;
+
+	bic->ioprio = ioprio;
+
+	bfqq = bic_to_bfqq(bic, false);
+	if (bfqq) {
+		bfq_put_queue(bfqq);
+		bfqq = bfq_get_queue(bfqd, bio, BLK_RW_ASYNC, bic);
+		bic_set_bfqq(bic, bfqq, false);
+	}
+
+	bfqq = bic_to_bfqq(bic, true);
+	if (bfqq)
+		bfq_set_next_ioprio_data(bfqq, bic);
+}
+
+static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+			  struct bfq_io_cq *bic, pid_t pid, int is_sync)
+{
+	RB_CLEAR_NODE(&bfqq->entity.rb_node);
+	INIT_LIST_HEAD(&bfqq->fifo);
+
+	bfqq->ref = 0;
+	bfqq->bfqd = bfqd;
+
+	if (bic)
+		bfq_set_next_ioprio_data(bfqq, bic);
+
+	if (is_sync) {
+		if (!bfq_class_idle(bfqq))
+			bfq_mark_bfqq_idle_window(bfqq);
+		bfq_mark_bfqq_sync(bfqq);
+	} else
+		bfq_clear_bfqq_sync(bfqq);
+	bfq_mark_bfqq_IO_bound(bfqq);
+
+	bfqq->pid = pid;
+
+	/* Tentative initial value to trade off between thr and lat */
+	bfqq->max_budget = bfq_default_budget(bfqd, bfqq);
+	bfqq->budget_timeout = bfq_smallest_from_now();
+	bfqq->pid = pid;
+
+	/* first request is almost certainly seeky */
+	bfqq->seek_history = 1;
+}
+
+static struct bfq_queue **bfq_async_queue_prio(struct bfq_data *bfqd,
+					       int ioprio_class, int ioprio)
+{
+	switch (ioprio_class) {
+	case IOPRIO_CLASS_RT:
+		return &async_bfqq[0][ioprio];
+	case IOPRIO_CLASS_NONE:
+		ioprio = IOPRIO_NORM;
+		/* fall through */
+	case IOPRIO_CLASS_BE:
+		return &async_bfqq[1][ioprio];
+	case IOPRIO_CLASS_IDLE:
+		return &async_idle_bfqq;
+	default:
+		return NULL;
+	}
+}
+
+static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
+				       struct bio *bio, bool is_sync,
+				       struct bfq_io_cq *bic)
+{
+	const int ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
+	const int ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
+	struct bfq_queue **async_bfqq = NULL;
+	struct bfq_queue *bfqq;
+
+	rcu_read_lock();
+
+	if (!is_sync) {
+		async_bfqq = bfq_async_queue_prio(bfqd, ioprio_class,
+						  ioprio);
+		bfqq = *async_bfqq;
+		if (bfqq)
+			goto out;
+	}
+
+	bfqq = kmem_cache_alloc_node(bfq_pool, GFP_NOWAIT | __GFP_ZERO,
+				     bfqd->queue->node);
+
+	if (bfqq) {
+		bfq_init_bfqq(bfqd, bfqq, bic, current->pid,
+			      is_sync);
+		bfq_init_entity(&bfqq->entity);
+		bfq_log_bfqq(bfqd, bfqq, "allocated");
+	} else {
+		bfqq = &bfqd->oom_bfqq;
+		bfq_log_bfqq(bfqd, bfqq, "using oom bfqq");
+		goto out;
+	}
+
+	/*
+	 * Pin the queue now that it's allocated, scheduler exit will
+	 * prune it.
+	 */
+	if (async_bfqq) {
+		bfqq->ref++;
+		bfq_log_bfqq(bfqd, bfqq,
+			     "get_queue, bfqq not in async: %p, %d",
+			     bfqq, bfqq->ref);
+		*async_bfqq = bfqq;
+	}
+
+out:
+	bfqq->ref++;
+	bfq_log_bfqq(bfqd, bfqq, "get_queue, at end: %p, %d", bfqq, bfqq->ref);
+	rcu_read_unlock();
+	return bfqq;
+}
+
+static void bfq_update_io_thinktime(struct bfq_data *bfqd,
+				    struct bfq_io_cq *bic)
+{
+	struct bfq_ttime *ttime = &bic->ttime;
+	u64 elapsed = ktime_get_ns() - bic->ttime.last_end_request;
+
+	elapsed = min_t(u64, elapsed, 2ULL * bfqd->bfq_slice_idle);
+
+	ttime->ttime_samples = (7*bic->ttime.ttime_samples + 256) / 8;
+	ttime->ttime_total = div_u64(7*ttime->ttime_total + 256*elapsed,  8);
+	ttime->ttime_mean = div64_ul(ttime->ttime_total + 128,
+				     ttime->ttime_samples);
+}
+
+static void
+bfq_update_io_seektime(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+		       struct request *rq)
+{
+	sector_t sdist = 0;
+
+	if (bfqq->last_request_pos) {
+		if (bfqq->last_request_pos < blk_rq_pos(rq))
+			sdist = blk_rq_pos(rq) - bfqq->last_request_pos;
+		else
+			sdist = bfqq->last_request_pos - blk_rq_pos(rq);
+	}
+
+	bfqq->seek_history <<= 1;
+	bfqq->seek_history |= (sdist > BFQQ_SEEK_THR);
+}
+
+/*
+ * Disable idle window if the process thinks too long or seeks so much that
+ * it doesn't matter.
+ */
+static void bfq_update_idle_window(struct bfq_data *bfqd,
+				   struct bfq_queue *bfqq,
+				   struct bfq_io_cq *bic)
+{
+	int enable_idle;
+
+	/* Don't idle for async or idle io prio class. */
+	if (!bfq_bfqq_sync(bfqq) || bfq_class_idle(bfqq))
+		return;
+
+	enable_idle = bfq_bfqq_idle_window(bfqq);
+
+	if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
+	    bfqd->bfq_slice_idle == 0 ||
+		(bfqd->hw_tag && BFQQ_SEEKY(bfqq)))
+		enable_idle = 0;
+	else if (bfq_sample_valid(bic->ttime.ttime_samples)) {
+		if (bic->ttime.ttime_mean > bfqd->bfq_slice_idle)
+			enable_idle = 0;
+		else
+			enable_idle = 1;
+	}
+	bfq_log_bfqq(bfqd, bfqq, "update_idle_window: enable_idle %d",
+		enable_idle);
+
+	if (enable_idle)
+		bfq_mark_bfqq_idle_window(bfqq);
+	else
+		bfq_clear_bfqq_idle_window(bfqq);
+}
+
+/*
+ * Called when a new fs request (rq) is added to bfqq.  Check if there's
+ * something we should do about it.
+ */
+static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+			    struct request *rq)
+{
+	struct bfq_io_cq *bic = RQ_BIC(rq);
+
+	if (rq->cmd_flags & REQ_META)
+		bfqq->meta_pending++;
+
+	bfq_update_io_thinktime(bfqd, bic);
+	bfq_update_io_seektime(bfqd, bfqq, rq);
+	if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
+	    !BFQQ_SEEKY(bfqq))
+		bfq_update_idle_window(bfqd, bfqq, bic);
+
+	bfq_log_bfqq(bfqd, bfqq,
+		     "rq_enqueued: idle_window=%d (seeky %d)",
+		     bfq_bfqq_idle_window(bfqq), BFQQ_SEEKY(bfqq));
+
+	bfqq->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq);
+
+	if (bfqq == bfqd->in_service_queue && bfq_bfqq_wait_request(bfqq)) {
+		bool small_req = bfqq->queued[rq_is_sync(rq)] == 1 &&
+				 blk_rq_sectors(rq) < 32;
+		bool budget_timeout = bfq_bfqq_budget_timeout(bfqq);
+
+		/*
+		 * There is just this request queued: if the request
+		 * is small and the queue is not to be expired, then
+		 * just exit.
+		 *
+		 * In this way, if the device is being idled to wait
+		 * for a new request from the in-service queue, we
+		 * avoid unplugging the device and committing the
+		 * device to serve just a small request. On the
+		 * contrary, we wait for the block layer to decide
+		 * when to unplug the device: hopefully, new requests
+		 * will be merged to this one quickly, then the device
+		 * will be unplugged and larger requests will be
+		 * dispatched.
+		 */
+		if (small_req && !budget_timeout)
+			return;
+
+		/*
+		 * A large enough request arrived, or the queue is to
+		 * be expired: in both cases disk idling is to be
+		 * stopped, so clear wait_request flag and reset
+		 * timer.
+		 */
+		bfq_clear_bfqq_wait_request(bfqq);
+		hrtimer_try_to_cancel(&bfqd->idle_slice_timer);
+
+		/*
+		 * The queue is not empty, because a new request just
+		 * arrived. Hence we can safely expire the queue, in
+		 * case of budget timeout, without risking that the
+		 * timestamps of the queue are not updated correctly.
+		 * See [1] for more details.
+		 */
+		if (budget_timeout)
+			bfq_bfqq_expire(bfqd, bfqq, false,
+					BFQ_BFQQ_BUDGET_TIMEOUT);
+
+		/*
+		 * Let the request rip immediately, or let a new queue be
+		 * selected if bfqq has just been expired.
+		 */
+		__blk_run_queue(bfqd->queue);
+	}
+}
+
+static void bfq_insert_request(struct request_queue *q, struct request *rq)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+
+	assert_spin_locked(bfqd->queue->queue_lock);
+
+	bfq_add_request(rq);
+
+	rq->fifo_time = ktime_get_ns() + bfqd->bfq_fifo_expire[rq_is_sync(rq)];
+	list_add_tail(&rq->queuelist, &bfqq->fifo);
+
+	bfq_rq_enqueued(bfqd, bfqq, rq);
+}
+
+static void bfq_update_hw_tag(struct bfq_data *bfqd)
+{
+	bfqd->max_rq_in_driver = max_t(int, bfqd->max_rq_in_driver,
+				       bfqd->rq_in_driver);
+
+	if (bfqd->hw_tag == 1)
+		return;
+
+	/*
+	 * This sample is valid if the number of outstanding requests
+	 * is large enough to allow a queueing behavior.  Note that the
+	 * sum is not exact, as it's not taking into account deactivated
+	 * requests.
+	 */
+	if (bfqd->rq_in_driver + bfqd->queued < BFQ_HW_QUEUE_THRESHOLD)
+		return;
+
+	if (bfqd->hw_tag_samples++ < BFQ_HW_QUEUE_SAMPLES)
+		return;
+
+	bfqd->hw_tag = bfqd->max_rq_in_driver > BFQ_HW_QUEUE_THRESHOLD;
+	bfqd->max_rq_in_driver = 0;
+	bfqd->hw_tag_samples = 0;
+}
+
+static void bfq_completed_request(struct request_queue *q, struct request *rq)
+{
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+	struct bfq_data *bfqd = bfqq->bfqd;
+
+	bfq_update_hw_tag(bfqd);
+
+	bfqd->rq_in_driver--;
+	bfqq->dispatched--;
+
+	RQ_BIC(rq)->ttime.last_end_request = ktime_get_ns();
+
+	/*
+	 * If this is the in-service queue, check if it needs to be expired,
+	 * or if we want to idle in case it has no pending requests.
+	 */
+	if (bfqd->in_service_queue == bfqq) {
+		if (bfq_bfqq_budget_new(bfqq))
+			bfq_set_budget_timeout(bfqd);
+
+		if (bfq_bfqq_must_idle(bfqq)) {
+			bfq_arm_slice_timer(bfqd);
+			goto out;
+		} else if (bfq_may_expire_for_budg_timeout(bfqq))
+			bfq_bfqq_expire(bfqd, bfqq, false,
+					BFQ_BFQQ_BUDGET_TIMEOUT);
+		else if (RB_EMPTY_ROOT(&bfqq->sort_list) &&
+			 (bfqq->dispatched == 0 ||
+			  !bfq_bfqq_may_idle(bfqq)))
+			bfq_bfqq_expire(bfqd, bfqq, false,
+					BFQ_BFQQ_NO_MORE_REQUESTS);
+	}
+
+	if (!bfqd->rq_in_driver)
+		bfq_schedule_dispatch(bfqd);
+
+out:
+	return;
+}
+
+static int __bfq_may_queue(struct bfq_queue *bfqq)
+{
+	if (bfq_bfqq_wait_request(bfqq) && bfq_bfqq_must_alloc(bfqq)) {
+		bfq_clear_bfqq_must_alloc(bfqq);
+		return ELV_MQUEUE_MUST;
+	}
+
+	return ELV_MQUEUE_MAY;
+}
+
+static int bfq_may_queue(struct request_queue *q, int op, int op_flags)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct task_struct *tsk = current;
+	struct bfq_io_cq *bic;
+	struct bfq_queue *bfqq;
+
+	/*
+	 * Don't force setup of a queue from here, as a call to may_queue
+	 * does not necessarily imply that a request actually will be
+	 * queued. So just lookup a possibly existing queue, or return
+	 * 'may queue' if that fails.
+	 */
+	bic = bfq_bic_lookup(bfqd, tsk->io_context);
+	if (!bic)
+		return ELV_MQUEUE_MAY;
+
+	bfqq = bic_to_bfqq(bic, rw_is_sync(op, op_flags));
+	if (bfqq)
+		return __bfq_may_queue(bfqq);
+
+	return ELV_MQUEUE_MAY;
+}
+
+/*
+ * Queue lock held here.
+ */
+static void bfq_put_request(struct request *rq)
+{
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+
+	if (bfqq) {
+		const int rw = rq_data_dir(rq);
+
+		bfqq->allocated[rw]--;
+
+		rq->elv.priv[0] = NULL;
+		rq->elv.priv[1] = NULL;
+
+		bfq_log_bfqq(bfqq->bfqd, bfqq, "put_request %p, %d",
+			     bfqq, bfqq->ref);
+		bfq_put_queue(bfqq);
+	}
+}
+
+/*
+ * Allocate bfq data structures associated with this request.
+ */
+static int bfq_set_request(struct request_queue *q, struct request *rq,
+			   struct bio *bio, gfp_t gfp_mask)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct bfq_io_cq *bic = icq_to_bic(rq->elv.icq);
+	const int rw = rq_data_dir(rq);
+	const int is_sync = rq_is_sync(rq);
+	struct bfq_queue *bfqq;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	bfq_check_ioprio_change(bic, bio);
+
+	if (!bic)
+		goto queue_fail;
+
+	bfqq = bic_to_bfqq(bic, is_sync);
+	if (!bfqq || bfqq == &bfqd->oom_bfqq) {
+		if (bfqq)
+			bfq_put_queue(bfqq);
+		bfqq = bfq_get_queue(bfqd, bio, is_sync, bic);
+		bic_set_bfqq(bic, bfqq, is_sync);
+	}
+
+	bfqq->allocated[rw]++;
+	bfqq->ref++;
+	bfq_log_bfqq(bfqd, bfqq, "set_request: bfqq %p, %d", bfqq, bfqq->ref);
+
+	rq->elv.priv[0] = bic;
+	rq->elv.priv[1] = bfqq;
+
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	return 0;
+
+queue_fail:
+	bfq_schedule_dispatch(bfqd);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	return 1;
+}
+
+static void bfq_kick_queue(struct work_struct *work)
+{
+	struct bfq_data *bfqd =
+		container_of(work, struct bfq_data, unplug_work);
+	struct request_queue *q = bfqd->queue;
+
+	spin_lock_irq(q->queue_lock);
+	__blk_run_queue(q);
+	spin_unlock_irq(q->queue_lock);
+}
+
+/*
+ * Handler of the expiration of the timer running if the in-service queue
+ * is idling inside its time slice.
+ */
+static enum hrtimer_restart bfq_idle_slice_timer(struct hrtimer *timer)
+{
+	struct bfq_data *bfqd = container_of(timer, struct bfq_data,
+					     idle_slice_timer);
+	struct bfq_queue *bfqq;
+	unsigned long flags;
+	enum bfqq_expiration reason;
+
+	spin_lock_irqsave(bfqd->queue->queue_lock, flags);
+
+	bfqq = bfqd->in_service_queue;
+	/*
+	 * Theoretical race here: the in-service queue can be NULL or
+	 * different from the queue that was idling if the timer handler
+	 * spins on the queue_lock and a new request arrives for the
+	 * current queue and there is a full dispatch cycle that changes
+	 * the in-service queue.  This can hardly happen, but in the worst
+	 * case we just expire a queue too early.
+	 */
+	if (bfqq) {
+		bfq_log_bfqq(bfqd, bfqq, "slice_timer expired");
+		bfq_clear_bfqq_wait_request(bfqq);
+
+		if (bfq_bfqq_budget_timeout(bfqq))
+			/*
+			 * Also here the queue can be safely expired
+			 * for budget timeout without wasting
+			 * guarantees
+			 */
+			reason = BFQ_BFQQ_BUDGET_TIMEOUT;
+		else if (bfqq->queued[0] == 0 && bfqq->queued[1] == 0)
+			/*
+			 * The queue may not be empty upon timer expiration,
+			 * because we may not disable the timer when the
+			 * first request of the in-service queue arrives
+			 * during disk idling.
+			 */
+			reason = BFQ_BFQQ_TOO_IDLE;
+		else
+			goto schedule_dispatch;
+
+		bfq_bfqq_expire(bfqd, bfqq, true, reason);
+	}
+
+schedule_dispatch:
+	bfq_schedule_dispatch(bfqd);
+
+	spin_unlock_irqrestore(bfqd->queue->queue_lock, flags);
+	return HRTIMER_NORESTART;
+}
+
+static void bfq_shutdown_timer_wq(struct bfq_data *bfqd)
+{
+	hrtimer_cancel(&bfqd->idle_slice_timer);
+	cancel_work_sync(&bfqd->unplug_work);
+}
+
+static void __bfq_put_async_bfqq(struct bfq_data *bfqd,
+					struct bfq_queue **bfqq_ptr)
+{
+	struct bfq_queue *bfqq = *bfqq_ptr;
+
+	bfq_log(bfqd, "put_async_bfqq: %p", bfqq);
+	if (bfqq) {
+		bfq_log_bfqq(bfqd, bfqq, "put_async_bfqq: putting %p, %d",
+			     bfqq, bfqq->ref);
+		bfq_put_queue(bfqq);
+		*bfqq_ptr = NULL;
+	}
+}
+
+/*
+ * Release the extra reference of the async queues as the device
+ * goes away.
+ */
+static void bfq_put_async_queues(struct bfq_data *bfqd)
+{
+	int i, j;
+
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < IOPRIO_BE_NR; j++)
+			__bfq_put_async_bfqq(bfqd, &async_bfqq[i][j]);
+
+	__bfq_put_async_bfqq(bfqd, &async_idle_bfqq);
+}
+
+static void bfq_exit_queue(struct elevator_queue *e)
+{
+	struct bfq_data *bfqd = e->elevator_data;
+	struct request_queue *q = bfqd->queue;
+	struct bfq_queue *bfqq, *n;
+
+	bfq_shutdown_timer_wq(bfqd);
+
+	spin_lock_irq(q->queue_lock);
+
+	list_for_each_entry_safe(bfqq, n, &bfqd->idle_list, bfqq_list)
+		bfq_deactivate_bfqq(bfqd, bfqq, 0);
+
+	bfq_put_async_queues(bfqd);
+	spin_unlock_irq(q->queue_lock);
+
+	bfq_shutdown_timer_wq(bfqd);
+
+	kfree(bfqd);
+}
+
+static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
+{
+	struct bfq_data *bfqd;
+	struct elevator_queue *eq;
+	int i;
+
+	eq = elevator_alloc(q, e);
+	if (!eq)
+		return -ENOMEM;
+
+	bfqd = kzalloc_node(sizeof(*bfqd), GFP_KERNEL, q->node);
+	if (!bfqd) {
+		kobject_put(&eq->kobj);
+		return -ENOMEM;
+	}
+	eq->elevator_data = bfqd;
+
+	/*
+	 * Our fallback bfqq if bfq_find_alloc_queue() runs into OOM issues.
+	 * Grab a permanent reference to it, so that the normal code flow
+	 * will not attempt to free it.
+	 */
+	bfq_init_bfqq(bfqd, &bfqd->oom_bfqq, NULL, 1, 0);
+	bfqd->oom_bfqq.ref++;
+	bfqd->oom_bfqq.new_ioprio = BFQ_DEFAULT_QUEUE_IOPRIO;
+	bfqd->oom_bfqq.new_ioprio_class = IOPRIO_CLASS_BE;
+	bfqd->oom_bfqq.entity.new_weight =
+		bfq_ioprio_to_weight(bfqd->oom_bfqq.new_ioprio);
+	/*
+	 * Trigger weight initialization, according to ioprio, at the
+	 * oom_bfqq's first activation. The oom_bfqq's ioprio and ioprio
+	 * class won't be changed any more.
+	 */
+	bfqd->oom_bfqq.entity.prio_changed = 1;
+
+	bfqd->queue = q;
+
+	spin_lock_irq(q->queue_lock);
+	q->elevator = eq;
+	spin_unlock_irq(q->queue_lock);
+
+	for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
+		bfqd->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
+
+	hrtimer_init(&bfqd->idle_slice_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL);
+	bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
+
+	INIT_WORK(&bfqd->unplug_work, bfq_kick_queue);
+
+	INIT_LIST_HEAD(&bfqd->active_list);
+	INIT_LIST_HEAD(&bfqd->idle_list);
+
+	bfqd->hw_tag = -1;
+
+	bfqd->bfq_max_budget = bfq_default_max_budget;
+
+	bfqd->bfq_fifo_expire[0] = bfq_fifo_expire[0];
+	bfqd->bfq_fifo_expire[1] = bfq_fifo_expire[1];
+	bfqd->bfq_back_max = bfq_back_max;
+	bfqd->bfq_back_penalty = bfq_back_penalty;
+	bfqd->bfq_slice_idle = bfq_slice_idle;
+	bfqd->bfq_class_idle_last_service = 0;
+	bfqd->bfq_timeout = bfq_timeout;
+
+	bfqd->bfq_requests_within_timer = 120;
+
+	return 0;
+}
+
+static void bfq_slab_kill(void)
+{
+	kmem_cache_destroy(bfq_pool);
+}
+
+static int __init bfq_slab_setup(void)
+{
+	bfq_pool = KMEM_CACHE(bfq_queue, 0);
+	if (!bfq_pool)
+		return -ENOMEM;
+	return 0;
+}
+
+static ssize_t bfq_var_show(unsigned int var, char *page)
+{
+	return sprintf(page, "%u\n", var);
+}
+
+static ssize_t bfq_var_store(unsigned long *var, const char *page,
+			     size_t count)
+{
+	unsigned long new_val;
+	int ret = kstrtoul(page, 10, &new_val);
+
+	if (ret == 0)
+		*var = new_val;
+
+	return count;
+}
+
+static ssize_t bfq_weights_show(struct elevator_queue *e, char *page)
+{
+	struct bfq_queue *bfqq;
+	struct bfq_data *bfqd = e->elevator_data;
+	ssize_t num_char = 0;
+
+	num_char += sprintf(page + num_char, "Tot reqs queued %d\n\n",
+			    bfqd->queued);
+
+	spin_lock_irq(bfqd->queue->queue_lock);
+
+	num_char += sprintf(page + num_char, "Active:\n");
+	list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list) {
+		num_char += sprintf(page + num_char,
+				    "pid%d: weight %hu, nr_queued %d %d\n",
+				    bfqq->pid,
+				    bfqq->entity.weight,
+				    bfqq->queued[0],
+				    bfqq->queued[1]);
+	}
+
+	num_char += sprintf(page + num_char, "Idle:\n");
+	list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list) {
+		num_char += sprintf(page + num_char,
+				    "pid%d: weight %hu\n",
+				    bfqq->pid,
+				    bfqq->entity.weight);
+	}
+
+	spin_unlock_irq(bfqd->queue->queue_lock);
+
+	return num_char;
+}
+
+#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
+static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
+{									\
+	struct bfq_data *bfqd = e->elevator_data;			\
+	u64 __data = __VAR;						\
+	if (__CONV == 1)						\
+		__data = jiffies_to_msecs(__data);			\
+	else if (__CONV == 2)						\
+		__data = div_u64(__data, NSEC_PER_MSEC);		\
+	return bfq_var_show(__data, (page));				\
+}
+SHOW_FUNCTION(bfq_fifo_expire_sync_show, bfqd->bfq_fifo_expire[1], 2);
+SHOW_FUNCTION(bfq_fifo_expire_async_show, bfqd->bfq_fifo_expire[0], 2);
+SHOW_FUNCTION(bfq_back_seek_max_show, bfqd->bfq_back_max, 0);
+SHOW_FUNCTION(bfq_back_seek_penalty_show, bfqd->bfq_back_penalty, 0);
+SHOW_FUNCTION(bfq_slice_idle_show, bfqd->bfq_slice_idle, 2);
+SHOW_FUNCTION(bfq_max_budget_show, bfqd->bfq_user_max_budget, 0);
+SHOW_FUNCTION(bfq_timeout_sync_show, bfqd->bfq_timeout, 1);
+SHOW_FUNCTION(bfq_strict_guarantees_show, bfqd->strict_guarantees, 0);
+#undef SHOW_FUNCTION
+
+#define USEC_SHOW_FUNCTION(__FUNC, __VAR)				\
+static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
+{									\
+	struct bfq_data *bfqd = e->elevator_data;			\
+	u64 __data = __VAR;						\
+	__data = div_u64(__data, NSEC_PER_USEC);			\
+	return bfq_var_show(__data, (page));				\
+}
+USEC_SHOW_FUNCTION(bfq_slice_idle_us_show, bfqd->bfq_slice_idle);
+#undef USEC_SHOW_FUNCTION
+
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
+static ssize_t								\
+__FUNC(struct elevator_queue *e, const char *page, size_t count)	\
+{									\
+	struct bfq_data *bfqd = e->elevator_data;			\
+	unsigned long uninitialized_var(__data);			\
+	int ret = bfq_var_store(&__data, (page), count);		\
+	if (__data < (MIN))						\
+		__data = (MIN);						\
+	else if (__data > (MAX))					\
+		__data = (MAX);						\
+	if (__CONV == 1)						\
+		*(__PTR) = msecs_to_jiffies(__data);			\
+	else if (__CONV == 2)						\
+		*(__PTR) = (u64)__data * NSEC_PER_MSEC;			\
+	else								\
+		*(__PTR) = __data;					\
+	return ret;							\
+}
+STORE_FUNCTION(bfq_fifo_expire_sync_store, &bfqd->bfq_fifo_expire[1], 1,
+		INT_MAX, 2);
+STORE_FUNCTION(bfq_fifo_expire_async_store, &bfqd->bfq_fifo_expire[0], 1,
+		INT_MAX, 2);
+STORE_FUNCTION(bfq_back_seek_max_store, &bfqd->bfq_back_max, 0, INT_MAX, 0);
+STORE_FUNCTION(bfq_back_seek_penalty_store, &bfqd->bfq_back_penalty, 1,
+		INT_MAX, 0);
+STORE_FUNCTION(bfq_slice_idle_store, &bfqd->bfq_slice_idle, 0, INT_MAX, 2);
+#undef STORE_FUNCTION
+
+#define USEC_STORE_FUNCTION(__FUNC, __PTR, MIN, MAX)			\
+static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)\
+{									\
+	struct bfq_data *bfqd = e->elevator_data;			\
+	unsigned long __data;						\
+	int ret = bfq_var_store(&__data, (page), count);		\
+	if (__data < (MIN))						\
+		__data = (MIN);						\
+	else if (__data > (MAX))					\
+		__data = (MAX);						\
+	*(__PTR) = (u64)__data * NSEC_PER_USEC;				\
+	return ret;							\
+}
+USEC_STORE_FUNCTION(bfq_slice_idle_us_store, &bfqd->bfq_slice_idle, 0,
+		    UINT_MAX);
+#undef USEC_STORE_FUNCTION
+
+/* do nothing for the moment */
+static ssize_t bfq_weights_store(struct elevator_queue *e,
+				    const char *page, size_t count)
+{
+	return count;
+}
+
+static unsigned long bfq_estimated_max_budget(struct bfq_data *bfqd)
+{
+	u64 timeout = jiffies_to_msecs(bfqd->bfq_timeout);
+
+	if (bfqd->peak_rate_samples >= BFQ_PEAK_RATE_SAMPLES)
+		return bfq_calc_max_budget(bfqd->peak_rate, timeout);
+	else
+		return bfq_default_max_budget;
+}
+
+static ssize_t bfq_max_budget_store(struct elevator_queue *e,
+				    const char *page, size_t count)
+{
+	struct bfq_data *bfqd = e->elevator_data;
+	unsigned long uninitialized_var(__data);
+	int ret = bfq_var_store(&__data, (page), count);
+
+	if (__data == 0)
+		bfqd->bfq_max_budget = bfq_estimated_max_budget(bfqd);
+	else {
+		if (__data > INT_MAX)
+			__data = INT_MAX;
+		bfqd->bfq_max_budget = __data;
+	}
+
+	bfqd->bfq_user_max_budget = __data;
+
+	return ret;
+}
+
+/*
+ * Leaving this name to preserve name compatibility with cfq
+ * parameters, but this timeout is used for both sync and async.
+ */
+static ssize_t bfq_timeout_sync_store(struct elevator_queue *e,
+				      const char *page, size_t count)
+{
+	struct bfq_data *bfqd = e->elevator_data;
+	unsigned long uninitialized_var(__data);
+	int ret = bfq_var_store(&__data, (page), count);
+
+	if (__data < 1)
+		__data = 1;
+	else if (__data > INT_MAX)
+		__data = INT_MAX;
+
+	bfqd->bfq_timeout = msecs_to_jiffies(__data);
+	if (bfqd->bfq_user_max_budget == 0)
+		bfqd->bfq_max_budget = bfq_estimated_max_budget(bfqd);
+
+	return ret;
+}
+
+static ssize_t bfq_strict_guarantees_store(struct elevator_queue *e,
+				     const char *page, size_t count)
+{
+	struct bfq_data *bfqd = e->elevator_data;
+	unsigned long uninitialized_var(__data);
+	int ret = bfq_var_store(&__data, (page), count);
+
+	if (__data > 1)
+		__data = 1;
+	if (!bfqd->strict_guarantees && __data == 1
+	    && bfqd->bfq_slice_idle < 8 * NSEC_PER_MSEC)
+		bfqd->bfq_slice_idle = 8 * NSEC_PER_MSEC;
+
+	bfqd->strict_guarantees = __data;
+
+	return ret;
+}
+
+#define BFQ_ATTR(name) \
+	__ATTR(name, S_IRUGO|S_IWUSR, bfq_##name##_show, bfq_##name##_store)
+
+static struct elv_fs_entry bfq_attrs[] = {
+	BFQ_ATTR(fifo_expire_sync),
+	BFQ_ATTR(fifo_expire_async),
+	BFQ_ATTR(back_seek_max),
+	BFQ_ATTR(back_seek_penalty),
+	BFQ_ATTR(slice_idle),
+	BFQ_ATTR(slice_idle_us),
+	BFQ_ATTR(max_budget),
+	BFQ_ATTR(timeout_sync),
+	BFQ_ATTR(strict_guarantees),
+	BFQ_ATTR(weights),
+	__ATTR_NULL
+};
+
+static struct elevator_type iosched_bfq = {
+	.ops = {
+		.elevator_merge_fn =		bfq_merge,
+		.elevator_merged_fn =		bfq_merged_request,
+		.elevator_merge_req_fn =	bfq_merged_requests,
+		.elevator_allow_bio_merge_fn =	bfq_allow_bio_merge,
+		.elevator_allow_rq_merge_fn =	bfq_allow_rq_merge,
+		.elevator_dispatch_fn =		bfq_dispatch_requests,
+		.elevator_add_req_fn =		bfq_insert_request,
+		.elevator_activate_req_fn =	bfq_activate_request,
+		.elevator_deactivate_req_fn =	bfq_deactivate_request,
+		.elevator_completed_req_fn =	bfq_completed_request,
+		.elevator_former_req_fn =	elv_rb_former_request,
+		.elevator_latter_req_fn =	elv_rb_latter_request,
+		.elevator_init_icq_fn =		bfq_init_icq,
+		.elevator_exit_icq_fn =		bfq_exit_icq,
+		.elevator_set_req_fn =		bfq_set_request,
+		.elevator_put_req_fn =		bfq_put_request,
+		.elevator_may_queue_fn =	bfq_may_queue,
+		.elevator_init_fn =		bfq_init_queue,
+		.elevator_exit_fn =		bfq_exit_queue,
+	},
+	.icq_size =		sizeof(struct bfq_io_cq),
+	.icq_align =		__alignof__(struct bfq_io_cq),
+	.elevator_attrs =	bfq_attrs,
+	.elevator_name =	"bfq",
+	.elevator_owner =	THIS_MODULE,
+};
+
+static int __init bfq_init(void)
+{
+	int ret;
+	char msg[50] = "BFQ I/O-scheduler: v0";
+
+	ret = -ENOMEM;
+	if (bfq_slab_setup())
+		goto err_pol_unreg;
+
+	ret = elv_register(&iosched_bfq);
+	if (ret)
+		goto err_pol_unreg;
+
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+	strcat(msg, " (with cgroups support)");
+#endif
+	pr_info("%s", msg);
+
+	return 0;
+
+err_pol_unreg:
+	return ret;
+}
+
+static void __exit bfq_exit(void)
+{
+	elv_unregister(&iosched_bfq);
+	bfq_slab_kill();
+}
+
+module_init(bfq_init);
+module_exit(bfq_exit);
+
+MODULE_AUTHOR("Arianna Avanzini, Fabio Checconi, Paolo Valente");
+MODULE_LICENSE("GPL");
-- 
2.10.0

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 02/14] block, bfq: add full hierarchical scheduling and cgroups support
  2016-10-26  9:27 [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler Paolo Valente
  2016-10-26  9:27 ` [PATCH 01/14] block, bfq: " Paolo Valente
@ 2016-10-26  9:27 ` Paolo Valente
  2016-10-26  9:27 ` [PATCH 03/14] block, bfq: improve throughput boosting Paolo Valente
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 57+ messages in thread
From: Paolo Valente @ 2016-10-26  9:27 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: linux-block, linux-kernel, ulf.hansson, linus.walleij, broonie,
	hare, arnd, bart.vanassche, grant.likely, jack, James.Bottomley,
	Arianna Avanzini, Fabio Checconi, Paolo Valente

From: Arianna Avanzini <avanzini.arianna@gmail.com>

Add complete support for full hierarchical scheduling, with a cgroups
interface. Full hierarchical scheduling is implemented through the
'entity' abstraction: both bfq_queues, i.e., the internal BFQ queues
associated with processes, and groups are represented in general by
entities. Given the bfq_queues associated with the processes belonging
to a given group, the entities representing these queues are sons of
the entity representing the group. At higher levels, if a group, say
G, contains other groups, then the entity representing G is the parent
entity of the entities representing the groups in G.

Hierarchical scheduling is performed as follows: if the timestamps of
a leaf entity (i.e., of a bfq_queue) change, and such a change lets
the entity become the next-to-serve entity for its parent entity, then
the timestamps of the parent entity are recomputed as a function of
the budget of its new next-to-serve leaf entity. If the parent entity
belongs, in its turn, to a group, and its new timestamps let it become
the next-to-serve for its parent entity, then the timestamps of the
latter parent entity are recomputed as well, and so on. When a new
bfq_queue must be set in service, the reverse path is followed: the
next-to-serve highest-level entity is chosen, then its next-to-serve
child entity, and so on, until the next-to-serve leaf entity is
reached, and the bfq_queue that this entity represents is set in
service.

Writeback is accounted for on a per-group basis, i.e., for each group,
the async I/O requests of the processes of the group are enqueued in a
distinct bfq_queue, and the entity associated with this queue is a
child of the entity associated with the group.

Weights can be assigned explicitly to groups and processes through the
cgroups interface, differently from what happens, for single
processes, if the cgroups interface is not used (as explained in the
description of the previous patch). In particular, since each node has
a full scheduler, each group can be assigned its own weight.

Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/Kconfig.iosched  |    7 +
 block/bfq-iosched.c    | 1789 +++++++++++++++++++++++++++++++++++++++++++-----
 include/linux/blkdev.h |    2 +-
 3 files changed, 1635 insertions(+), 163 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 48434bc..408b619 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -47,6 +47,13 @@ config IOSCHED_BFQ
 	  processes according to their weights, regardless of the
 	  device parameters and with any workload.
 
+config BFQ_GROUP_IOSCHED
+	bool "BFQ hierarchical scheduling support"
+	depends on IOSCHED_BFQ && BLK_CGROUP
+	default n
+	---help---
+	  Enable hierarchical scheduling in BFQ, using the blkio controller.
+
 choice
 	prompt "Default I/O scheduler"
 	default DEFAULT_CFQ
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 643aeef..e33f85e 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -34,9 +34,15 @@
  * guarantee a low latency to non-I/O bound processes (the latter
  * often belong to time-sensitive applications).
  *
- * B-WF2Q+ is based on WF2Q+, which is described in [2], while the
- * augmented tree used here to implement B-WF2Q+ with O(log N)
- * complexity derives from the one introduced with EEVDF in [3].
+ * With respect to the version of BFQ presented in [1], and in the
+ * papers cited therein, this implementation adds a hierarchical
+ * extension based on H-WF2Q+. In this extension, also the service of
+ * whole groups of queues is scheduled using B-WF2Q+.
+ *
+ * B-WF2Q+ is based on WF2Q+, which is described in [2], together with
+ * H-WF2Q+, while the augmented tree used here to implement B-WF2Q+
+ * with O(log N) complexity derives from the one introduced with EEVDF
+ * in [3].
  *
  * [1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
  *     Scheduler", Proceedings of the First Workshop on Mobile System
@@ -59,6 +65,7 @@
 #include <linux/module.h>
 #include <linux/slab.h>
 #include <linux/blkdev.h>
+#include <linux/cgroup.h>
 #include <linux/elevator.h>
 #include <linux/ktime.h>
 #include <linux/rbtree.h>
@@ -78,7 +85,7 @@
 
 #define BFQ_DEFAULT_QUEUE_IOPRIO	4
 
-#define BFQ_DEFAULT_GRP_WEIGHT	10
+#define BFQ_WEIGHT_LEGACY_DFL	100
 #define BFQ_DEFAULT_GRP_IOPRIO	0
 #define BFQ_DEFAULT_GRP_CLASS	IOPRIO_CLASS_BE
 
@@ -110,10 +117,11 @@ struct bfq_service_tree {
  * struct bfq_sched_data - multi-class scheduler.
  *
  * bfq_sched_data is the basic scheduler queue.  It supports three
- * ioprio_classes, and can be used either as a toplevel queue or as
- * an intermediate queue on a hierarchical setup.
- * @next_in_service points to the active entity of the sched_data
- * service trees that will be scheduled next.
+ * ioprio_classes, and can be used either as a toplevel queue or as an
+ * intermediate queue on a hierarchical setup.  @next_in_service
+ * points to the active entity of the sched_data service trees that
+ * will be scheduled next. It is used to reduce the number of steps
+ * needed for each hierarchical-schedule update.
  *
  * The supported ioprio_classes are the same as in CFQ, in descending
  * priority order, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE.
@@ -124,7 +132,7 @@ struct bfq_service_tree {
  */
 struct bfq_sched_data {
 	struct bfq_entity *in_service_entity;  /* entity in service */
-	/* head-of-the-line entity in the scheduler */
+	/* head-of-the-line entity in the scheduler (see comments above) */
 	struct bfq_entity *next_in_service;
 	/* array of service trees, one per ioprio_class */
 	struct bfq_service_tree service_tree[BFQ_IOPRIO_CLASSES];
@@ -133,10 +141,11 @@ struct bfq_sched_data {
 /**
  * struct bfq_entity - schedulable entity.
  *
- * A bfq_entity is used to represent a bfq_queue (leaf node in the upper
- * level scheduler). Each entity belongs to the sched_data of the parent
- * group hierarchy. Non-leaf entities have also their own sched_data,
- * stored in @my_sched_data.
+ * A bfq_entity is used to represent either a bfq_queue (leaf node in the
+ * cgroup hierarchy) or a bfq_group into the upper level scheduler.  Each
+ * entity belongs to the sched_data of the parent group in the cgroup
+ * hierarchy.  Non-leaf entities have also their own sched_data, stored
+ * in @my_sched_data.
  *
  * Each entity stores independently its priority values; this would
  * allow different weights on different devices, but this
@@ -147,13 +156,14 @@ struct bfq_sched_data {
  * update to take place the effective and the requested priority
  * values are synchronized.
  *
- * The weight value is calculated from the ioprio to export the same
- * interface as CFQ.  When dealing with  ``well-behaved'' queues (i.e.,
- * queues that do not spend too much time to consume their budget
- * and have true sequential behavior, and when there are no external
- * factors breaking anticipation) the relative weights at each level
- * of the hierarchy should be guaranteed.  All the fields are
- * protected by the queue lock of the containing bfqd.
+ * Unless cgroups are used, the weight value is calculated from the
+ * ioprio to export the same interface as CFQ.  When dealing with
+ * ``well-behaved'' queues (i.e., queues that do not spend too much
+ * time to consume their budget and have true sequential behavior, and
+ * when there are no external factors breaking anticipation) the
+ * relative weights at each level of the cgroups hierarchy should be
+ * guaranteed.  All the fields are protected by the queue lock of the
+ * containing bfqd.
  */
 struct bfq_entity {
 	struct rb_node rb_node; /* service_tree member */
@@ -203,11 +213,17 @@ struct bfq_entity {
 	int prio_changed;
 };
 
+struct bfq_group;
+
 /**
  * struct bfq_queue - leaf schedulable entity.
  *
  * A bfq_queue is a leaf request queue; it can be associated with an
- * io_context or more, if it is async.
+ * io_context or more, if it is async. @cgroup holds a reference to
+ * the cgroup, to be sure that it does not disappear while a bfqq
+ * still references it (mostly to avoid races between request issuing
+ * and task migration followed by cgroup destruction).  All the fields
+ * are protected by the queue lock of the containing bfqd.
  */
 struct bfq_queue {
 	/* reference counter */
@@ -290,6 +306,9 @@ struct bfq_io_cq {
 	struct bfq_ttime ttime;
 	/* per (request_queue, blkcg) ioprio */
 	int ioprio;
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+	uint64_t blkcg_serial_nr; /* the current blkcg serial */
+#endif
 };
 
 enum bfq_device_speed {
@@ -306,8 +325,8 @@ struct bfq_data {
 	/* request queue for the device */
 	struct request_queue *queue;
 
-	/* root @bfq_sched_data for the device */
-	struct bfq_sched_data sched_data;
+	/* root bfq_group for the device */
+	struct bfq_group *root_group;
 
 	/*
 	 * Number of bfq_queues containing requests (including the
@@ -456,8 +475,35 @@ BFQ_BFQQ_FNS(IO_bound);
 #undef BFQ_BFQQ_FNS
 
 /* Logging facilities. */
-#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) \
-	blk_add_trace_msg((bfqd)->queue, "bfq%d " fmt, (bfqq)->pid, ##args)
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+static struct bfq_group *bfqq_group(struct bfq_queue *bfqq);
+static struct blkcg_gq *bfqg_to_blkg(struct bfq_group *bfqg);
+
+#define bfq_log_bfqq(bfqd, bfqq, fmt, args...)	do {			\
+	char __pbuf[128];						\
+									\
+	blkg_path(bfqg_to_blkg(bfqq_group(bfqq)), __pbuf, sizeof(__pbuf)); \
+	blk_add_trace_msg((bfqd)->queue, "bfq%d%c %s " fmt, (bfqq)->pid, \
+			bfq_bfqq_sync((bfqq)) ? 'S' : 'A',		\
+			  __pbuf, ##args);				\
+} while (0)
+
+#define bfq_log_bfqg(bfqd, bfqg, fmt, args...)	do {			\
+	char __pbuf[128];						\
+									\
+	blkg_path(bfqg_to_blkg(bfqg), __pbuf, sizeof(__pbuf));		\
+	blk_add_trace_msg((bfqd)->queue, "%s " fmt, __pbuf, ##args);	\
+} while (0)
+
+#else /* CONFIG_BFQ_GROUP_IOSCHED */
+
+#define bfq_log_bfqq(bfqd, bfqq, fmt, args...)	\
+	blk_add_trace_msg((bfqd)->queue, "bfq%d%c " fmt, (bfqq)->pid,	\
+			bfq_bfqq_sync((bfqq)) ? 'S' : 'A',		\
+				##args)
+#define bfq_log_bfqg(bfqd, bfqg, fmt, args...)		do {} while (0)
+
+#endif /* CONFIG_BFQ_GROUP_IOSCHED */
 
 #define bfq_log(bfqd, fmt, args...) \
 	blk_add_trace_msg((bfqd)->queue, "bfq " fmt, ##args)
@@ -474,6 +520,107 @@ enum bfqq_expiration {
 	BFQ_BFQQ_PREEMPTED		/* preemption in progress */
 };
 
+struct bfqg_stats {
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+	/* number of ios merged */
+	struct blkg_rwstat		merged;
+	/* total time spent on device in ns, may not be accurate w/ queueing */
+	struct blkg_rwstat		service_time;
+	/* total time spent waiting in scheduler queue in ns */
+	struct blkg_rwstat		wait_time;
+	/* number of IOs queued up */
+	struct blkg_rwstat		queued;
+	/* total disk time and nr sectors dispatched by this group */
+	struct blkg_stat		time;
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+	/* sum of number of ios queued across all samples */
+	struct blkg_stat		avg_queue_size_sum;
+	/* count of samples taken for average */
+	struct blkg_stat		avg_queue_size_samples;
+	/* how many times this group has been removed from service tree */
+	struct blkg_stat		dequeue;
+	/* total time spent waiting for it to be assigned a timeslice. */
+	struct blkg_stat		group_wait_time;
+	/* time spent idling for this blkcg_gq */
+	struct blkg_stat		idle_time;
+	/* total time with empty current active q with other requests queued */
+	struct blkg_stat		empty_time;
+	/* fields after this shouldn't be cleared on stat reset */
+	uint64_t			start_group_wait_time;
+	uint64_t			start_idle_time;
+	uint64_t			start_empty_time;
+	uint16_t			flags;
+#endif	/* CONFIG_DEBUG_BLK_CGROUP */
+#endif	/* CONFIG_BFQ_GROUP_IOSCHED */
+};
+
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+
+/*
+ * struct bfq_group_data - per-blkcg storage for the blkio subsystem.
+ *
+ * @ps: @blkcg_policy_storage that this structure inherits
+ * @weight: weight of the bfq_group
+ */
+struct bfq_group_data {
+	/* must be the first member */
+	struct blkcg_policy_data pd;
+
+	unsigned short weight;
+};
+
+/**
+ * struct bfq_group - per (device, cgroup) data structure.
+ * @entity: schedulable entity to insert into the parent group sched_data.
+ * @sched_data: own sched_data, to contain child entities (they may be
+ *              both bfq_queues and bfq_groups).
+ * @bfqd: the bfq_data for the device this group acts upon.
+ * @async_bfqq: array of async queues for all the tasks belonging to
+ *              the group, one queue per ioprio value per ioprio_class,
+ *              except for the idle class that has only one queue.
+ * @async_idle_bfqq: async queue for the idle class (ioprio is ignored).
+ * @my_entity: pointer to @entity, %NULL for the toplevel group; used
+ *             to avoid too many special cases during group creation/
+ *             migration.
+ * @stats: stats for this bfqg.
+ *
+ * Each (device, cgroup) pair has its own bfq_group, i.e., for each cgroup
+ * there is a set of bfq_groups, each one collecting the lower-level
+ * entities belonging to the group that are acting on the same device.
+ *
+ * Locking works as follows:
+ *    o @bfqd is protected by the queue lock, RCU is used to access it
+ *      from the readers.
+ *    o All the other fields are protected by the @bfqd queue lock.
+ */
+struct bfq_group {
+	/* must be the first member */
+	struct blkg_policy_data pd;
+
+	struct bfq_entity entity;
+	struct bfq_sched_data sched_data;
+
+	void *bfqd;
+
+	struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
+	struct bfq_queue *async_idle_bfqq;
+
+	struct bfq_entity *my_entity;
+
+	struct bfqg_stats stats;
+};
+
+#else
+struct bfq_group {
+	struct bfq_sched_data sched_data;
+
+	struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
+	struct bfq_queue *async_idle_bfqq;
+
+	struct rb_root rq_pos_tree;
+};
+#endif
+
 static struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity);
 
 static struct bfq_service_tree *
@@ -509,16 +656,9 @@ static void bfq_dispatch_insert(struct request_queue *q, struct request *rq);
 static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
 				       struct bio *bio, bool is_sync,
 				       struct bfq_io_cq *bic);
+static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg);
 static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
 
-/*
- * Array of async queues for all the processes, one queue
- * per ioprio value per ioprio_class.
- */
-struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
-/* Async queue for the idle class (ioprio is ignored) */
-struct bfq_queue *async_idle_bfqq;
-
 /* Expiration time of sync (0) and async (1) requests, in ns. */
 static const u64 bfq_fifo_expire[2] = { NSEC_PER_SEC / 4, NSEC_PER_SEC / 8 };
 
@@ -594,26 +734,81 @@ static struct bfq_io_cq *bfq_bic_lookup(struct bfq_data *bfqd,
 	return NULL;
 }
 
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+
 #define for_each_entity(entity)	\
-	for (; entity ; entity = NULL)
+	for (; entity ; entity = entity->parent)
 
 #define for_each_entity_safe(entity, parent) \
-	for (parent = NULL; entity ; entity = parent)
+	for (; entity && ({ parent = entity->parent; 1; }); entity = parent)
+
+
+static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
+						 int extract,
+						 struct bfq_data *bfqd);
+
+static void bfq_update_budget(struct bfq_entity *next_in_service)
+{
+	struct bfq_entity *bfqg_entity;
+	struct bfq_group *bfqg;
+	struct bfq_sched_data *group_sd;
+
+	group_sd = next_in_service->sched_data;
+
+	bfqg = container_of(group_sd, struct bfq_group, sched_data);
+	/*
+	 * bfq_group's my_entity field is not NULL only if the group
+	 * is not the root group. We must not touch the root entity
+	 * as it must never become an in-service entity.
+	 */
+	bfqg_entity = bfqg->my_entity;
+	if (bfqg_entity)
+		bfqg_entity->budget = next_in_service->budget;
+}
 
 static int bfq_update_next_in_service(struct bfq_sched_data *sd)
 {
-	return 0;
+	struct bfq_entity *next_in_service;
+
+	if (sd->in_service_entity)
+		/* will update/requeue at the end of service */
+		return 0;
+
+	/*
+	 * NOTE: this can be improved in many ways, such as returning
+	 * 1 (and thus propagating upwards the update) only when the
+	 * budget changes, or caching the bfqq that will be scheduled
+	 * next from this subtree.  By now we worry more about
+	 * correctness than about performance...
+	 */
+	next_in_service = bfq_lookup_next_entity(sd, 0, NULL);
+	sd->next_in_service = next_in_service;
+
+	if (next_in_service)
+		bfq_update_budget(next_in_service);
+
+	return 1;
 }
 
-static void bfq_check_next_in_service(struct bfq_sched_data *sd,
-				      struct bfq_entity *entity)
+#else /* CONFIG_BFQ_GROUP_IOSCHED */
+
+#define for_each_entity(entity)	\
+	for (; entity ; entity = NULL)
+
+#define for_each_entity_safe(entity, parent) \
+	for (parent = NULL; entity ; entity = parent)
+
+static int bfq_update_next_in_service(struct bfq_sched_data *sd)
 {
+	return 0;
 }
 
 static void bfq_update_budget(struct bfq_entity *next_in_service)
 {
 }
 
+#endif /* CONFIG_BFQ_GROUP_IOSCHED */
+
 /*
  * Shift for timestamp calculations.  This actually limits the maximum
  * service allowed in one timestamp delta (small shift values increase it),
@@ -853,6 +1048,11 @@ static void bfq_active_insert(struct bfq_service_tree *st,
 {
 	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
 	struct rb_node *node = &entity->rb_node;
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+	struct bfq_sched_data *sd = NULL;
+	struct bfq_group *bfqg = NULL;
+	struct bfq_data *bfqd = NULL;
+#endif
 
 	bfq_insert(&st->active, entity);
 
@@ -863,6 +1063,11 @@ static void bfq_active_insert(struct bfq_service_tree *st,
 
 	bfq_update_active_tree(node);
 
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+	sd = entity->sched_data;
+	bfqg = container_of(sd, struct bfq_group, sched_data);
+	bfqd = (struct bfq_data *)bfqg->bfqd;
+#endif
 	if (bfqq)
 		list_add(&bfqq->bfqq_list, &bfqq->bfqd->active_list);
 }
@@ -941,6 +1146,11 @@ static void bfq_active_extract(struct bfq_service_tree *st,
 {
 	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
 	struct rb_node *node;
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+	struct bfq_sched_data *sd = NULL;
+	struct bfq_group *bfqg = NULL;
+	struct bfq_data *bfqd = NULL;
+#endif
 
 	node = bfq_find_deepest(&entity->rb_node);
 	bfq_extract(&st->active, entity);
@@ -948,6 +1158,11 @@ static void bfq_active_extract(struct bfq_service_tree *st,
 	if (node)
 		bfq_update_active_tree(node);
 
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+	sd = entity->sched_data;
+	bfqg = container_of(sd, struct bfq_group, sched_data);
+	bfqd = (struct bfq_data *)bfqg->bfqd;
+#endif
 	if (bfqq)
 		list_del(&bfqq->bfqq_list);
 }
@@ -1039,7 +1254,7 @@ static void bfq_forget_idle(struct bfq_service_tree *st)
 
 static struct bfq_service_tree *
 __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
-			 struct bfq_entity *entity)
+				struct bfq_entity *entity)
 {
 	struct bfq_service_tree *new_st = old_st;
 
@@ -1047,9 +1262,20 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
 		struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
 		unsigned short prev_weight, new_weight;
 		struct bfq_data *bfqd = NULL;
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+		struct bfq_sched_data *sd;
+		struct bfq_group *bfqg;
+#endif
 
 		if (bfqq)
 			bfqd = bfqq->bfqd;
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+		else {
+			sd = entity->my_sched_data;
+			bfqg = container_of(sd, struct bfq_group, sched_data);
+			bfqd = (struct bfq_data *)bfqg->bfqd;
+		}
+#endif
 
 		old_st->wsum -= entity->weight;
 
@@ -1095,6 +1321,9 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
 	return new_st;
 }
 
+static void bfqg_stats_set_start_empty_time(struct bfq_group *bfqg);
+static struct bfq_group *bfqq_group(struct bfq_queue *bfqq);
+
 /**
  * bfq_bfqq_served - update the scheduler status after selection for
  *                   service.
@@ -1118,6 +1347,7 @@ static void bfq_bfqq_served(struct bfq_queue *bfqq, int served)
 		st->vtime += bfq_delta(served, st->wsum);
 		bfq_forget_idle(st);
 	}
+	bfqg_stats_set_start_empty_time(bfqq_group(bfqq));
 	bfq_log_bfqq(bfqq->bfqd, bfqq, "bfqq_served %d secs", served);
 }
 
@@ -1289,13 +1519,16 @@ static void bfq_activate_entity(struct bfq_entity *entity,
 static int __bfq_deactivate_entity(struct bfq_entity *entity, int requeue)
 {
 	struct bfq_sched_data *sd = entity->sched_data;
-	struct bfq_service_tree *st = bfq_entity_service_tree(entity);
-	int was_in_service = entity == sd->in_service_entity;
+	struct bfq_service_tree *st;
+	int was_in_service;
 	int ret = 0;
 
-	if (!entity->on_st)
+	if (sd == NULL || !entity->on_st) /* never activated, or inactive now */
 		return 0;
 
+	st = bfq_entity_service_tree(entity);
+	was_in_service = entity == sd->in_service_entity;
+
 	if (was_in_service) {
 		bfq_calc_finish(entity, entity->service);
 		sd->in_service_entity = NULL;
@@ -1330,17 +1563,18 @@ static void bfq_deactivate_entity(struct bfq_entity *entity, int requeue)
 
 		if (!__bfq_deactivate_entity(entity, requeue))
 			/*
-			 * The parent entity is still backlogged, and
-			 * we don't need to update it as it is still
-			 * in service.
+			 * next_in_service has not been changed, so
+			 * no upwards update is needed
 			 */
 			break;
 
 		if (sd->next_in_service)
 			/*
-			 * The parent entity is still backlogged and
-			 * the budgets on the path towards the root
-			 * need to be updated.
+			 * The parent entity is still backlogged,
+			 * because next_in_service is not NULL, and
+			 * next_in_service has been updated (see
+			 * comment on the body of the above if):
+			 * upwards update of the schedule is needed.
 			 */
 			goto update;
 
@@ -1434,8 +1668,8 @@ static struct bfq_entity *bfq_first_active_entity(struct bfq_service_tree *st)
  * Update the virtual time in @st and return the first eligible entity
  * it contains.
  */
-static struct bfq_entity *__bfq_lookup_next_entity(struct bfq_service_tree *st,
-						   bool force)
+static struct bfq_entity *
+__bfq_lookup_next_entity(struct bfq_service_tree *st, bool force)
 {
 	struct bfq_entity *entity, *new_next_in_service = NULL;
 
@@ -1456,161 +1690,1283 @@ static struct bfq_entity *__bfq_lookup_next_entity(struct bfq_service_tree *st,
 			bfq_update_budget(new_next_in_service);
 	}
 
-	return entity;
+	return entity;
+}
+
+/**
+ * bfq_lookup_next_entity - return the first eligible entity in @sd.
+ * @sd: the sched_data.
+ * @extract: if true the returned entity will be also extracted from @sd.
+ *
+ * NOTE: since we cache the next_in_service entity at each level of the
+ * hierarchy, the complexity of the lookup can be decreased with
+ * absolutely no effort just returning the cached next_in_service value;
+ * we prefer to do full lookups to test the consistency of the data
+ * structures.
+ */
+static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
+						 int extract,
+						 struct bfq_data *bfqd)
+{
+	struct bfq_service_tree *st = sd->service_tree;
+	struct bfq_entity *entity;
+	int i = 0;
+
+	/*
+	 * Choose from idle class, if needed to guarantee a minimum
+	 * bandwidth to this class. This should also mitigate
+	 * priority-inversion problems in case a low priority task is
+	 * holding file system resources.
+	 */
+	if (bfqd &&
+	    jiffies - bfqd->bfq_class_idle_last_service >
+	    BFQ_CL_IDLE_TIMEOUT) {
+		entity = __bfq_lookup_next_entity(st + BFQ_IOPRIO_CLASSES - 1,
+						  true);
+		if (entity) {
+			i = BFQ_IOPRIO_CLASSES - 1;
+			bfqd->bfq_class_idle_last_service = jiffies;
+			sd->next_in_service = entity;
+		}
+	}
+	for (; i < BFQ_IOPRIO_CLASSES; i++) {
+		entity = __bfq_lookup_next_entity(st + i, false);
+		if (entity) {
+			if (extract) {
+				bfq_active_extract(st + i, entity);
+				sd->in_service_entity = entity;
+				sd->next_in_service = NULL;
+			}
+			break;
+		}
+	}
+
+	return entity;
+}
+
+static bool next_queue_may_preempt(struct bfq_data *bfqd)
+{
+	struct bfq_sched_data *sd = &bfqd->root_group->sched_data;
+
+	return sd->next_in_service != sd->in_service_entity;
+}
+
+
+/*
+ * Get next queue for service.
+ */
+static struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd)
+{
+	struct bfq_entity *entity = NULL;
+	struct bfq_sched_data *sd;
+	struct bfq_queue *bfqq;
+
+	if (bfqd->busy_queues == 0)
+		return NULL;
+
+	sd = &bfqd->root_group->sched_data;
+	for (; sd ; sd = entity->my_sched_data) {
+		entity = bfq_lookup_next_entity(sd, 1, bfqd);
+		entity->service = 0;
+	}
+
+	bfqq = bfq_entity_to_bfqq(entity);
+
+	return bfqq;
+}
+
+static void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd)
+{
+	if (bfqd->in_service_bic) {
+		put_io_context(bfqd->in_service_bic->icq.ioc);
+		bfqd->in_service_bic = NULL;
+	}
+
+	bfq_clear_bfqq_wait_request(bfqd->in_service_queue);
+	hrtimer_try_to_cancel(&bfqd->idle_slice_timer);
+	bfqd->in_service_queue = NULL;
+}
+
+static void bfq_deactivate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+				int requeue)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+
+	bfq_deactivate_entity(entity, requeue);
+}
+
+static void bfq_activate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+
+	bfq_activate_entity(entity, bfq_bfqq_non_blocking_wait_rq(bfqq));
+	bfq_clear_bfqq_non_blocking_wait_rq(bfqq);
+}
+
+static void bfqg_stats_update_dequeue(struct bfq_group *bfqg);
+
+/*
+ * Called when the bfqq no longer has requests pending, remove it from
+ * the service tree.
+ */
+static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+			      int requeue)
+{
+	bfq_log_bfqq(bfqd, bfqq, "del from busy");
+
+	bfq_clear_bfqq_busy(bfqq);
+
+	bfqd->busy_queues--;
+
+	bfqg_stats_update_dequeue(bfqq_group(bfqq));
+
+	bfq_deactivate_bfqq(bfqd, bfqq, requeue);
+}
+
+/*
+ * Called when an inactive queue receives a new request.
+ */
+static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	bfq_log_bfqq(bfqd, bfqq, "add to busy");
+
+	bfq_activate_bfqq(bfqd, bfqq);
+
+	bfq_mark_bfqq_busy(bfqq);
+	bfqd->busy_queues++;
+}
+
+#if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP)
+
+/* bfqg stats flags */
+enum bfqg_stats_flags {
+	BFQG_stats_waiting = 0,
+	BFQG_stats_idling,
+	BFQG_stats_empty,
+};
+
+#define BFQG_FLAG_FNS(name)						\
+static void bfqg_stats_mark_##name(struct bfqg_stats *stats)	\
+{									\
+	stats->flags |= (1 << BFQG_stats_##name);			\
+}									\
+static void bfqg_stats_clear_##name(struct bfqg_stats *stats)	\
+{									\
+	stats->flags &= ~(1 << BFQG_stats_##name);			\
+}									\
+static int bfqg_stats_##name(struct bfqg_stats *stats)		\
+{									\
+	return (stats->flags & (1 << BFQG_stats_##name)) != 0;		\
+}									\
+
+BFQG_FLAG_FNS(waiting)
+BFQG_FLAG_FNS(idling)
+BFQG_FLAG_FNS(empty)
+#undef BFQG_FLAG_FNS
+
+/* This should be called with the queue_lock held. */
+static void bfqg_stats_update_group_wait_time(struct bfqg_stats *stats)
+{
+	unsigned long long now;
+
+	if (!bfqg_stats_waiting(stats))
+		return;
+
+	now = sched_clock();
+	if (time_after64(now, stats->start_group_wait_time))
+		blkg_stat_add(&stats->group_wait_time,
+			      now - stats->start_group_wait_time);
+	bfqg_stats_clear_waiting(stats);
+}
+
+/* This should be called with the queue_lock held. */
+static void bfqg_stats_set_start_group_wait_time(struct bfq_group *bfqg,
+						 struct bfq_group *curr_bfqg)
+{
+	struct bfqg_stats *stats = &bfqg->stats;
+
+	if (bfqg_stats_waiting(stats))
+		return;
+	if (bfqg == curr_bfqg)
+		return;
+	stats->start_group_wait_time = sched_clock();
+	bfqg_stats_mark_waiting(stats);
+}
+
+/* This should be called with the queue_lock held. */
+static void bfqg_stats_end_empty_time(struct bfqg_stats *stats)
+{
+	unsigned long long now;
+
+	if (!bfqg_stats_empty(stats))
+		return;
+
+	now = sched_clock();
+	if (time_after64(now, stats->start_empty_time))
+		blkg_stat_add(&stats->empty_time,
+			      now - stats->start_empty_time);
+	bfqg_stats_clear_empty(stats);
+}
+
+static void bfqg_stats_update_dequeue(struct bfq_group *bfqg)
+{
+	blkg_stat_add(&bfqg->stats.dequeue, 1);
+}
+
+static void bfqg_stats_set_start_empty_time(struct bfq_group *bfqg)
+{
+	struct bfqg_stats *stats = &bfqg->stats;
+
+	if (blkg_rwstat_total(&stats->queued))
+		return;
+
+	/*
+	 * group is already marked empty. This can happen if bfqq got new
+	 * request in parent group and moved to this group while being added
+	 * to service tree. Just ignore the event and move on.
+	 */
+	if (bfqg_stats_empty(stats))
+		return;
+
+	stats->start_empty_time = sched_clock();
+	bfqg_stats_mark_empty(stats);
+}
+
+static void bfqg_stats_update_idle_time(struct bfq_group *bfqg)
+{
+	struct bfqg_stats *stats = &bfqg->stats;
+
+	if (bfqg_stats_idling(stats)) {
+		unsigned long long now = sched_clock();
+
+		if (time_after64(now, stats->start_idle_time))
+			blkg_stat_add(&stats->idle_time,
+				      now - stats->start_idle_time);
+		bfqg_stats_clear_idling(stats);
+	}
+}
+
+static void bfqg_stats_set_start_idle_time(struct bfq_group *bfqg)
+{
+	struct bfqg_stats *stats = &bfqg->stats;
+
+	stats->start_idle_time = sched_clock();
+	bfqg_stats_mark_idling(stats);
+}
+
+static void bfqg_stats_update_avg_queue_size(struct bfq_group *bfqg)
+{
+	struct bfqg_stats *stats = &bfqg->stats;
+
+	blkg_stat_add(&stats->avg_queue_size_sum,
+		      blkg_rwstat_total(&stats->queued));
+	blkg_stat_add(&stats->avg_queue_size_samples, 1);
+	bfqg_stats_update_group_wait_time(stats);
+}
+
+#else	/* CONFIG_BFQ_GROUP_IOSCHED && CONFIG_DEBUG_BLK_CGROUP */
+
+static inline void
+bfqg_stats_set_start_group_wait_time(struct bfq_group *bfqg,
+				     struct bfq_group *curr_bfqg) { }
+static inline void bfqg_stats_end_empty_time(struct bfqg_stats *stats) { }
+static inline void bfqg_stats_update_dequeue(struct bfq_group *bfqg) { }
+static inline void bfqg_stats_set_start_empty_time(struct bfq_group *bfqg) { }
+static inline void bfqg_stats_update_idle_time(struct bfq_group *bfqg) { }
+static inline void bfqg_stats_set_start_idle_time(struct bfq_group *bfqg) { }
+static inline void bfqg_stats_update_avg_queue_size(struct bfq_group *bfqg) { }
+
+#endif	/* CONFIG_BFQ_GROUP_IOSCHED && CONFIG_DEBUG_BLK_CGROUP */
+
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+
+/*
+ * blk-cgroup policy-related handlers
+ * The following functions help in converting between blk-cgroup
+ * internal structures and BFQ-specific structures.
+ */
+
+static struct bfq_group *pd_to_bfqg(struct blkg_policy_data *pd)
+{
+	return pd ? container_of(pd, struct bfq_group, pd) : NULL;
+}
+
+static struct blkcg_gq *bfqg_to_blkg(struct bfq_group *bfqg)
+{
+	return pd_to_blkg(&bfqg->pd);
+}
+
+static struct blkcg_policy blkcg_policy_bfq;
+
+static struct bfq_group *blkg_to_bfqg(struct blkcg_gq *blkg)
+{
+	return pd_to_bfqg(blkg_to_pd(blkg, &blkcg_policy_bfq));
+}
+
+/*
+ * bfq_group handlers
+ * The following functions help in navigating the bfq_group hierarchy
+ * by allowing to find the parent of a bfq_group or the bfq_group
+ * associated to a bfq_queue.
+ */
+
+static struct bfq_group *bfqg_parent(struct bfq_group *bfqg)
+{
+	struct blkcg_gq *pblkg = bfqg_to_blkg(bfqg)->parent;
+
+	return pblkg ? blkg_to_bfqg(pblkg) : NULL;
+}
+
+static struct bfq_group *bfqq_group(struct bfq_queue *bfqq)
+{
+	struct bfq_entity *group_entity = bfqq->entity.parent;
+
+	return group_entity ? container_of(group_entity, struct bfq_group,
+					   entity) :
+			      bfqq->bfqd->root_group;
+}
+
+/*
+ * The following two functions handle get and put of a bfq_group by
+ * wrapping the related blk-cgroup hooks.
+ */
+
+static void bfqg_get(struct bfq_group *bfqg)
+{
+	return blkg_get(bfqg_to_blkg(bfqg));
+}
+
+static void bfqg_put(struct bfq_group *bfqg)
+{
+	return blkg_put(bfqg_to_blkg(bfqg));
+}
+
+static void bfqg_stats_update_io_add(struct bfq_group *bfqg,
+				     struct bfq_queue *bfqq,
+				     int op, int op_flags)
+{
+	blkg_rwstat_add(&bfqg->stats.queued, op, op_flags, 1);
+	bfqg_stats_end_empty_time(&bfqg->stats);
+	if (!(bfqq == ((struct bfq_data *)bfqg->bfqd)->in_service_queue))
+		bfqg_stats_set_start_group_wait_time(bfqg, bfqq_group(bfqq));
+}
+
+static void bfqg_stats_update_io_remove(struct bfq_group *bfqg, int op,
+					int op_flags)
+{
+	blkg_rwstat_add(&bfqg->stats.queued, op, op_flags, -1);
+}
+
+static void bfqg_stats_update_io_merged(struct bfq_group *bfqg, int op,
+					int op_flags)
+{
+	blkg_rwstat_add(&bfqg->stats.merged, op, op_flags, 1);
+}
+
+static void bfqg_stats_update_completion(struct bfq_group *bfqg,
+			uint64_t start_time, uint64_t io_start_time, int op,
+			int op_flags)
+{
+	struct bfqg_stats *stats = &bfqg->stats;
+	unsigned long long now = sched_clock();
+
+	if (time_after64(now, io_start_time))
+		blkg_rwstat_add(&stats->service_time, op, op_flags,
+				now - io_start_time);
+	if (time_after64(io_start_time, start_time))
+		blkg_rwstat_add(&stats->wait_time, op, op_flags,
+				io_start_time - start_time);
+}
+
+/* @stats = 0 */
+static void bfqg_stats_reset(struct bfqg_stats *stats)
+{
+	/* queued stats shouldn't be cleared */
+	blkg_rwstat_reset(&stats->merged);
+	blkg_rwstat_reset(&stats->service_time);
+	blkg_rwstat_reset(&stats->wait_time);
+	blkg_stat_reset(&stats->time);
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+	blkg_stat_reset(&stats->avg_queue_size_sum);
+	blkg_stat_reset(&stats->avg_queue_size_samples);
+	blkg_stat_reset(&stats->dequeue);
+	blkg_stat_reset(&stats->group_wait_time);
+	blkg_stat_reset(&stats->idle_time);
+	blkg_stat_reset(&stats->empty_time);
+#endif
+}
+
+/* @to += @from */
+static void bfqg_stats_add_aux(struct bfqg_stats *to, struct bfqg_stats *from)
+{
+	if (!to || !from)
+		return;
+
+	/* queued stats shouldn't be cleared */
+	blkg_rwstat_add_aux(&to->merged, &from->merged);
+	blkg_rwstat_add_aux(&to->service_time, &from->service_time);
+	blkg_rwstat_add_aux(&to->wait_time, &from->wait_time);
+	blkg_stat_add_aux(&from->time, &from->time);
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+	blkg_stat_add_aux(&to->avg_queue_size_sum, &from->avg_queue_size_sum);
+	blkg_stat_add_aux(&to->avg_queue_size_samples,
+			  &from->avg_queue_size_samples);
+	blkg_stat_add_aux(&to->dequeue, &from->dequeue);
+	blkg_stat_add_aux(&to->group_wait_time, &from->group_wait_time);
+	blkg_stat_add_aux(&to->idle_time, &from->idle_time);
+	blkg_stat_add_aux(&to->empty_time, &from->empty_time);
+#endif
+}
+
+/*
+ * Transfer @bfqg's stats to its parent's aux counts so that the ancestors'
+ * recursive stats can still account for the amount used by this bfqg after
+ * it's gone.
+ */
+static void bfqg_stats_xfer_dead(struct bfq_group *bfqg)
+{
+	struct bfq_group *parent;
+
+	if (!bfqg) /* root_group */
+		return;
+
+	parent = bfqg_parent(bfqg);
+
+	lockdep_assert_held(bfqg_to_blkg(bfqg)->q->queue_lock);
+
+	if (unlikely(!parent))
+		return;
+
+	bfqg_stats_add_aux(&parent->stats, &bfqg->stats);
+	bfqg_stats_reset(&bfqg->stats);
+}
+
+static void bfq_init_entity(struct bfq_entity *entity,
+			    struct bfq_group *bfqg)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+
+	entity->weight = entity->new_weight;
+	entity->orig_weight = entity->new_weight;
+	if (bfqq) {
+		bfqq->ioprio = bfqq->new_ioprio;
+		bfqq->ioprio_class = bfqq->new_ioprio_class;
+		bfqg_get(bfqg);
+	}
+	entity->parent = bfqg->my_entity;
+	entity->sched_data = &bfqg->sched_data;
+}
+
+static void bfqg_stats_exit(struct bfqg_stats *stats)
+{
+	blkg_rwstat_exit(&stats->merged);
+	blkg_rwstat_exit(&stats->service_time);
+	blkg_rwstat_exit(&stats->wait_time);
+	blkg_rwstat_exit(&stats->queued);
+	blkg_stat_exit(&stats->time);
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+	blkg_stat_exit(&stats->avg_queue_size_sum);
+	blkg_stat_exit(&stats->avg_queue_size_samples);
+	blkg_stat_exit(&stats->dequeue);
+	blkg_stat_exit(&stats->group_wait_time);
+	blkg_stat_exit(&stats->idle_time);
+	blkg_stat_exit(&stats->empty_time);
+#endif
+}
+
+static int bfqg_stats_init(struct bfqg_stats *stats, gfp_t gfp)
+{
+	if (blkg_rwstat_init(&stats->merged, gfp) ||
+	    blkg_rwstat_init(&stats->service_time, gfp) ||
+	    blkg_rwstat_init(&stats->wait_time, gfp) ||
+	    blkg_rwstat_init(&stats->queued, gfp) ||
+	    blkg_stat_init(&stats->time, gfp))
+		goto err;
+
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+	if (blkg_stat_init(&stats->avg_queue_size_sum, gfp) ||
+	    blkg_stat_init(&stats->avg_queue_size_samples, gfp) ||
+	    blkg_stat_init(&stats->dequeue, gfp) ||
+	    blkg_stat_init(&stats->group_wait_time, gfp) ||
+	    blkg_stat_init(&stats->idle_time, gfp) ||
+	    blkg_stat_init(&stats->empty_time, gfp))
+		goto err;
+#endif
+	return 0;
+err:
+	bfqg_stats_exit(stats);
+	return -ENOMEM;
+}
+
+static struct bfq_group_data *cpd_to_bfqgd(struct blkcg_policy_data *cpd)
+{
+	return cpd ? container_of(cpd, struct bfq_group_data, pd) : NULL;
+}
+
+static struct bfq_group_data *blkcg_to_bfqgd(struct blkcg *blkcg)
+{
+	return cpd_to_bfqgd(blkcg_to_cpd(blkcg, &blkcg_policy_bfq));
+}
+
+static struct blkcg_policy_data *bfq_cpd_alloc(gfp_t gfp)
+{
+	struct bfq_group_data *bgd;
+
+	bgd = kzalloc(sizeof(*bgd), GFP_KERNEL);
+	if (!bgd)
+		return NULL;
+	return &bgd->pd;
+}
+
+static void bfq_cpd_init(struct blkcg_policy_data *cpd)
+{
+	struct bfq_group_data *d = cpd_to_bfqgd(cpd);
+
+	d->weight = cgroup_subsys_on_dfl(io_cgrp_subsys) ?
+		CGROUP_WEIGHT_DFL : BFQ_WEIGHT_LEGACY_DFL;
+}
+
+static void bfq_cpd_free(struct blkcg_policy_data *cpd)
+{
+	kfree(cpd_to_bfqgd(cpd));
+}
+
+static struct blkg_policy_data *bfq_pd_alloc(gfp_t gfp, int node)
+{
+	struct bfq_group *bfqg;
+
+	bfqg = kzalloc_node(sizeof(*bfqg), gfp, node);
+	if (!bfqg)
+		return NULL;
+
+	if (bfqg_stats_init(&bfqg->stats, gfp)) {
+		kfree(bfqg);
+		return NULL;
+	}
+
+	return &bfqg->pd;
+}
+
+static void bfq_pd_init(struct blkg_policy_data *pd)
+{
+	struct blkcg_gq *blkg = pd_to_blkg(pd);
+	struct bfq_group *bfqg = blkg_to_bfqg(blkg);
+	struct bfq_data *bfqd = blkg->q->elevator->elevator_data;
+	struct bfq_entity *entity = &bfqg->entity;
+	struct bfq_group_data *d = blkcg_to_bfqgd(blkg->blkcg);
+
+	entity->orig_weight = entity->weight = entity->new_weight = d->weight;
+	entity->my_sched_data = &bfqg->sched_data;
+	bfqg->my_entity = entity; /*
+				   * the root_group's will be set to NULL
+				   * in bfq_init_queue()
+				   */
+	bfqg->bfqd = bfqd;
+}
+
+static void bfq_pd_free(struct blkg_policy_data *pd)
+{
+	struct bfq_group *bfqg = pd_to_bfqg(pd);
+
+	bfqg_stats_exit(&bfqg->stats);
+	return kfree(bfqg);
+}
+
+static void bfq_pd_reset_stats(struct blkg_policy_data *pd)
+{
+	struct bfq_group *bfqg = pd_to_bfqg(pd);
+
+	bfqg_stats_reset(&bfqg->stats);
+}
+
+static void bfq_group_set_parent(struct bfq_group *bfqg,
+					struct bfq_group *parent)
+{
+	struct bfq_entity *entity;
+
+	entity = &bfqg->entity;
+	entity->parent = parent->my_entity;
+	entity->sched_data = &parent->sched_data;
+}
+
+static struct bfq_group *bfq_lookup_bfqg(struct bfq_data *bfqd,
+					 struct blkcg *blkcg)
+{
+	struct blkcg_gq *blkg;
+
+	blkg = blkg_lookup(blkcg, bfqd->queue);
+	if (likely(blkg))
+		return blkg_to_bfqg(blkg);
+	return NULL;
+}
+
+static struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd,
+					    struct blkcg *blkcg)
+{
+	struct bfq_group *bfqg, *parent;
+	struct bfq_entity *entity;
+
+	assert_spin_locked(bfqd->queue->queue_lock);
+
+	bfqg = bfq_lookup_bfqg(bfqd, blkcg);
+
+	if (unlikely(!bfqg))
+		return NULL;
+
+	/*
+	 * Update chain of bfq_groups as we might be handling a leaf group
+	 * which, along with some of its relatives, has not been hooked yet
+	 * to the private hierarchy of BFQ.
+	 */
+	entity = &bfqg->entity;
+	for_each_entity(entity) {
+		bfqg = container_of(entity, struct bfq_group, entity);
+		if (bfqg != bfqd->root_group) {
+			parent = bfqg_parent(bfqg);
+			if (!parent)
+				parent = bfqd->root_group;
+			bfq_group_set_parent(bfqg, parent);
+		}
+	}
+
+	return bfqg;
+}
+
+static void bfq_bfqq_expire(struct bfq_data *bfqd,
+			    struct bfq_queue *bfqq,
+			    bool compensate,
+			    enum bfqq_expiration reason);
+
+
+/**
+ * bfq_bfqq_move - migrate @bfqq to @bfqg.
+ * @bfqd: queue descriptor.
+ * @bfqq: the queue to move.
+ * @bfqg: the group to move to.
+ *
+ * Move @bfqq to @bfqg, deactivating it from its old group and reactivating
+ * it on the new one.  Avoid putting the entity on the old group idle tree.
+ *
+ * Must be called under the queue lock; the cgroup owning @bfqg must
+ * not disappear (by now this just means that we are called under
+ * rcu_read_lock()).
+ */
+static void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+			  struct bfq_group *bfqg)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+
+	/* If bfqq is empty, then bfq_bfqq_expire also invokes
+	 * bfq_del_bfqq_busy, thereby removing bfqq and its entity
+	 * from data structures related to current group. Otherwise we
+	 * need to remove bfqq explicitly with bfq_deactivate_bfqq, as
+	 * we do below.
+	 */
+	if (bfqq == bfqd->in_service_queue)
+		bfq_bfqq_expire(bfqd, bfqd->in_service_queue,
+				false, BFQ_BFQQ_PREEMPTED);
+
+	if (bfq_bfqq_busy(bfqq))
+		bfq_deactivate_bfqq(bfqd, bfqq, 0);
+	else if (entity->on_st)
+		bfq_put_idle_entity(bfq_entity_service_tree(entity), entity);
+	bfqg_put(bfqq_group(bfqq));
+
+	/*
+	 * Here we use a reference to bfqg.  We don't need a refcounter
+	 * as the cgroup reference will not be dropped, so that its
+	 * destroy() callback will not be invoked.
+	 */
+	entity->parent = bfqg->my_entity;
+	entity->sched_data = &bfqg->sched_data;
+	bfqg_get(bfqg);
+
+	if (bfq_bfqq_busy(bfqq))
+		bfq_activate_bfqq(bfqd, bfqq);
+
+	if (!bfqd->in_service_queue && !bfqd->rq_in_driver)
+		bfq_schedule_dispatch(bfqd);
+}
+
+/**
+ * __bfq_bic_change_cgroup - move @bic to @cgroup.
+ * @bfqd: the queue descriptor.
+ * @bic: the bic to move.
+ * @blkcg: the blk-cgroup to move to.
+ *
+ * Move bic to blkcg, assuming that bfqd->queue is locked; the caller
+ * has to make sure that the reference to cgroup is valid across the call.
+ *
+ * NOTE: an alternative approach might have been to store the current
+ * cgroup in bfqq and getting a reference to it, reducing the lookup
+ * time here, at the price of slightly more complex code.
+ */
+static struct bfq_group *__bfq_bic_change_cgroup(struct bfq_data *bfqd,
+						struct bfq_io_cq *bic,
+						struct blkcg *blkcg)
+{
+	struct bfq_queue *async_bfqq = bic_to_bfqq(bic, 0);
+	struct bfq_queue *sync_bfqq = bic_to_bfqq(bic, 1);
+	struct bfq_group *bfqg;
+	struct bfq_entity *entity;
+
+	lockdep_assert_held(bfqd->queue->queue_lock);
+
+	bfqg = bfq_find_set_group(bfqd, blkcg);
+
+	if (unlikely(!bfqg))
+		bfqg = bfqd->root_group;
+
+	if (async_bfqq) {
+		entity = &async_bfqq->entity;
+
+		if (entity->sched_data != &bfqg->sched_data) {
+			bic_set_bfqq(bic, NULL, 0);
+			bfq_log_bfqq(bfqd, async_bfqq,
+				     "bic_change_group: %p %d",
+				     async_bfqq,
+				     async_bfqq->ref);
+			bfq_put_queue(async_bfqq);
+		}
+	}
+
+	if (sync_bfqq) {
+		entity = &sync_bfqq->entity;
+		if (entity->sched_data != &bfqg->sched_data)
+			bfq_bfqq_move(bfqd, sync_bfqq, bfqg);
+	}
+
+	return bfqg;
+}
+
+static void bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio)
+{
+	struct bfq_data *bfqd = bic_to_bfqd(bic);
+	struct bfq_group *bfqg = NULL;
+	uint64_t serial_nr;
+
+	rcu_read_lock();
+	serial_nr = bio_blkcg(bio)->css.serial_nr;
+
+	/*
+	 * Check whether blkcg has changed.  The condition may trigger
+	 * spuriously on a newly created cic but there's no harm.
+	 */
+	if (unlikely(!bfqd) || likely(bic->blkcg_serial_nr == serial_nr))
+		goto out;
+
+	bfqg = __bfq_bic_change_cgroup(bfqd, bic, bio_blkcg(bio));
+	bic->blkcg_serial_nr = serial_nr;
+out:
+	rcu_read_unlock();
+}
+
+/**
+ * bfq_flush_idle_tree - deactivate any entity on the idle tree of @st.
+ * @st: the service tree being flushed.
+ */
+static void bfq_flush_idle_tree(struct bfq_service_tree *st)
+{
+	struct bfq_entity *entity = st->first_idle;
+
+	for (; entity ; entity = st->first_idle)
+		__bfq_deactivate_entity(entity, 0);
+}
+
+/**
+ * bfq_reparent_leaf_entity - move leaf entity to the root_group.
+ * @bfqd: the device data structure with the root group.
+ * @entity: the entity to move.
+ */
+static void bfq_reparent_leaf_entity(struct bfq_data *bfqd,
+				     struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+
+	bfq_bfqq_move(bfqd, bfqq, bfqd->root_group);
+}
+
+/**
+ * bfq_reparent_active_entities - move to the root group all active
+ *                                entities.
+ * @bfqd: the device data structure with the root group.
+ * @bfqg: the group to move from.
+ * @st: the service tree with the entities.
+ *
+ * Needs queue_lock to be taken and reference to be valid over the call.
+ */
+static void bfq_reparent_active_entities(struct bfq_data *bfqd,
+					 struct bfq_group *bfqg,
+					 struct bfq_service_tree *st)
+{
+	struct rb_root *active = &st->active;
+	struct bfq_entity *entity = NULL;
+
+	if (!RB_EMPTY_ROOT(&st->active))
+		entity = bfq_entity_of(rb_first(active));
+
+	for (; entity ; entity = bfq_entity_of(rb_first(active)))
+		bfq_reparent_leaf_entity(bfqd, entity);
+
+	if (bfqg->sched_data.in_service_entity)
+		bfq_reparent_leaf_entity(bfqd,
+			bfqg->sched_data.in_service_entity);
 }
 
 /**
- * bfq_lookup_next_entity - return the first eligible entity in @sd.
- * @sd: the sched_data.
- * @extract: if true the returned entity will be also extracted from @sd.
+ * bfq_pd_offline - deactivate the entity associated with @pd,
+ *		    and reparent its children entities.
+ * @pd: descriptor of the policy going offline.
  *
- * NOTE: since we cache the next_in_service entity at each level of the
- * hierarchy, the complexity of the lookup can be decreased with
- * absolutely no effort just returning the cached next_in_service value;
- * we prefer to do full lookups to test the consistency of the data
- * structures.
+ * blkio already grabs the queue_lock for us, so no need to use
+ * RCU-based magic
  */
-static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
-						 int extract,
-						 struct bfq_data *bfqd)
+static void bfq_pd_offline(struct blkg_policy_data *pd)
 {
-	struct bfq_service_tree *st = sd->service_tree;
-	struct bfq_entity *entity;
-	int i = 0;
+	struct bfq_service_tree *st;
+	struct bfq_group *bfqg = pd_to_bfqg(pd);
+	struct bfq_data *bfqd = bfqg->bfqd;
+	struct bfq_entity *entity = bfqg->my_entity;
+	int i;
+
+	if (!entity) /* root group */
+		return;
 
 	/*
-	 * Choose from idle class, if needed to guarantee a minimum
-	 * bandwidth to this class. This should also mitigate
-	 * priority-inversion problems in case a low priority task is
-	 * holding file system resources.
+	 * Empty all service_trees belonging to this group before
+	 * deactivating the group itself.
 	 */
-	if (bfqd &&
-	    jiffies - bfqd->bfq_class_idle_last_service >
-	    BFQ_CL_IDLE_TIMEOUT) {
-		entity = __bfq_lookup_next_entity(st + BFQ_IOPRIO_CLASSES - 1,
-						  true);
-		if (entity) {
-			i = BFQ_IOPRIO_CLASSES - 1;
-			bfqd->bfq_class_idle_last_service = jiffies;
-			sd->next_in_service = entity;
-		}
+	for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) {
+		st = bfqg->sched_data.service_tree + i;
+
+		/*
+		 * The idle tree may still contain bfq_queues belonging
+		 * to exited task because they never migrated to a different
+		 * cgroup from the one being destroyed now.  No one else
+		 * can access them so it's safe to act without any lock.
+		 */
+		bfq_flush_idle_tree(st);
+
+		/*
+		 * It may happen that some queues are still active
+		 * (busy) upon group destruction (if the corresponding
+		 * processes have been forced to terminate). We move
+		 * all the leaf entities corresponding to these queues
+		 * to the root_group.
+		 * Also, it may happen that the group has an entity
+		 * in service, which is disconnected from the active
+		 * tree: it must be moved, too.
+		 * There is no need to put the sync queues, as the
+		 * scheduler has taken no reference.
+		 */
+		bfq_reparent_active_entities(bfqd, bfqg, st);
 	}
-	for (; i < BFQ_IOPRIO_CLASSES; i++) {
-		entity = __bfq_lookup_next_entity(st + i, false);
-		if (entity) {
-			if (extract) {
-				bfq_check_next_in_service(sd, entity);
-				bfq_active_extract(st + i, entity);
-				sd->in_service_entity = entity;
-				sd->next_in_service = NULL;
-			}
-			break;
+
+	__bfq_deactivate_entity(entity, 0);
+	bfq_put_async_queues(bfqd, bfqg);
+
+	/*
+	 * @blkg is going offline and will be ignored by
+	 * blkg_[rw]stat_recursive_sum().  Transfer stats to the parent so
+	 * that they don't get lost.  If IOs complete after this point, the
+	 * stats for them will be lost.  Oh well...
+	 */
+	bfqg_stats_xfer_dead(bfqg);
+}
+
+static int bfq_io_show_weight(struct seq_file *sf, void *v)
+{
+	struct blkcg *blkcg = css_to_blkcg(seq_css(sf));
+	struct bfq_group_data *bfqgd = blkcg_to_bfqgd(blkcg);
+	unsigned int val = 0;
+
+	if (bfqgd)
+		val = bfqgd->weight;
+
+	seq_printf(sf, "%u\n", val);
+
+	return 0;
+}
+
+static int bfq_io_set_weight_legacy(struct cgroup_subsys_state *css,
+				    struct cftype *cftype,
+				    u64 val)
+{
+	struct blkcg *blkcg = css_to_blkcg(css);
+	struct bfq_group_data *bfqgd = blkcg_to_bfqgd(blkcg);
+	struct blkcg_gq *blkg;
+	int ret = -ERANGE;
+
+	if (val < BFQ_MIN_WEIGHT || val > BFQ_MAX_WEIGHT)
+		return ret;
+
+	ret = 0;
+	spin_lock_irq(&blkcg->lock);
+	bfqgd->weight = (unsigned short)val;
+	hlist_for_each_entry(blkg, &blkcg->blkg_list, blkcg_node) {
+		struct bfq_group *bfqg = blkg_to_bfqg(blkg);
+
+		if (!bfqg)
+			continue;
+		/*
+		 * Setting the prio_changed flag of the entity
+		 * to 1 with new_weight == weight would re-set
+		 * the value of the weight to its ioprio mapping.
+		 * Set the flag only if necessary.
+		 */
+		if ((unsigned short)val != bfqg->entity.new_weight) {
+			bfqg->entity.new_weight = (unsigned short)val;
+			/*
+			 * Make sure that the above new value has been
+			 * stored in bfqg->entity.new_weight before
+			 * setting the prio_changed flag. In fact,
+			 * this flag may be read asynchronously (in
+			 * critical sections protected by a different
+			 * lock than that held here), and finding this
+			 * flag set may cause the execution of the code
+			 * for updating parameters whose value may
+			 * depend also on bfqg->entity.new_weight (in
+			 * __bfq_entity_update_weight_prio).
+			 * This barrier makes sure that the new value
+			 * of bfqg->entity.new_weight is correctly
+			 * seen in that code.
+			 */
+			smp_wmb();
+			bfqg->entity.prio_changed = 1;
 		}
 	}
+	spin_unlock_irq(&blkcg->lock);
 
-	return entity;
+	return ret;
 }
 
-static bool next_queue_may_preempt(struct bfq_data *bfqd)
+static ssize_t bfq_io_set_weight(struct kernfs_open_file *of,
+				 char *buf, size_t nbytes,
+				 loff_t off)
 {
-	struct bfq_sched_data *sd = &bfqd->sched_data;
+	u64 weight;
+	/* First unsigned long found in the file is used */
+	int ret = kstrtoull(strim(buf), 0, &weight);
 
-	return sd->next_in_service != sd->in_service_entity;
+	if (ret)
+		return ret;
+
+	return bfq_io_set_weight_legacy(of_css(of), NULL, weight);
 }
 
+static int bfqg_print_stat(struct seq_file *sf, void *v)
+{
+	blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), blkg_prfill_stat,
+			  &blkcg_policy_bfq, seq_cft(sf)->private, false);
+	return 0;
+}
 
-/*
- * Get next queue for service.
- */
-static struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd)
+static int bfqg_print_rwstat(struct seq_file *sf, void *v)
 {
-	struct bfq_entity *entity = NULL;
-	struct bfq_sched_data *sd;
-	struct bfq_queue *bfqq;
+	blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), blkg_prfill_rwstat,
+			  &blkcg_policy_bfq, seq_cft(sf)->private, true);
+	return 0;
+}
 
-	if (bfqd->busy_queues == 0)
-		return NULL;
+static u64 bfqg_prfill_stat_recursive(struct seq_file *sf,
+				      struct blkg_policy_data *pd, int off)
+{
+	u64 sum = blkg_stat_recursive_sum(pd_to_blkg(pd),
+					  &blkcg_policy_bfq, off);
+	return __blkg_prfill_u64(sf, pd, sum);
+}
 
-	sd = &bfqd->sched_data;
-	for (; sd ; sd = entity->my_sched_data) {
-		entity = bfq_lookup_next_entity(sd, 1, bfqd);
-		entity->service = 0;
-	}
+static u64 bfqg_prfill_rwstat_recursive(struct seq_file *sf,
+					struct blkg_policy_data *pd, int off)
+{
+	struct blkg_rwstat sum = blkg_rwstat_recursive_sum(pd_to_blkg(pd),
+							   &blkcg_policy_bfq,
+							   off);
+	return __blkg_prfill_rwstat(sf, pd, &sum);
+}
 
-	bfqq = bfq_entity_to_bfqq(entity);
+static int bfqg_print_stat_recursive(struct seq_file *sf, void *v)
+{
+	blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)),
+			  bfqg_prfill_stat_recursive, &blkcg_policy_bfq,
+			  seq_cft(sf)->private, false);
+	return 0;
+}
 
-	return bfqq;
+static int bfqg_print_rwstat_recursive(struct seq_file *sf, void *v)
+{
+	blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)),
+			  bfqg_prfill_rwstat_recursive, &blkcg_policy_bfq,
+			  seq_cft(sf)->private, true);
+	return 0;
 }
 
-static void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd)
+static u64 bfqg_prfill_sectors(struct seq_file *sf, struct blkg_policy_data *pd,
+			       int off)
 {
-	if (bfqd->in_service_bic) {
-		put_io_context(bfqd->in_service_bic->icq.ioc);
-		bfqd->in_service_bic = NULL;
-	}
+	u64 sum = blkg_rwstat_total(&pd->blkg->stat_bytes);
 
-	bfq_clear_bfqq_wait_request(bfqd->in_service_queue);
-	hrtimer_try_to_cancel(&bfqd->idle_slice_timer);
-	bfqd->in_service_queue = NULL;
+	return __blkg_prfill_u64(sf, pd, sum >> 9);
 }
 
-static void bfq_deactivate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
-				int requeue)
+static int bfqg_print_stat_sectors(struct seq_file *sf, void *v)
 {
-	struct bfq_entity *entity = &bfqq->entity;
-
-	bfq_deactivate_entity(entity, requeue);
+	blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)),
+			  bfqg_prfill_sectors, &blkcg_policy_bfq, 0, false);
+	return 0;
 }
 
-static void bfq_activate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+static u64 bfqg_prfill_sectors_recursive(struct seq_file *sf,
+					 struct blkg_policy_data *pd, int off)
 {
-	struct bfq_entity *entity = &bfqq->entity;
+	struct blkg_rwstat tmp = blkg_rwstat_recursive_sum(pd->blkg, NULL,
+					offsetof(struct blkcg_gq, stat_bytes));
+	u64 sum = atomic64_read(&tmp.aux_cnt[BLKG_RWSTAT_READ]) +
+		atomic64_read(&tmp.aux_cnt[BLKG_RWSTAT_WRITE]);
 
-	bfq_activate_entity(entity, bfq_bfqq_non_blocking_wait_rq(bfqq));
-	bfq_clear_bfqq_non_blocking_wait_rq(bfqq);
+	return __blkg_prfill_u64(sf, pd, sum >> 9);
 }
 
-/*
- * Called when the bfqq no longer has requests pending, remove it from
- * the service tree.
- */
-static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
-			      int requeue)
+static int bfqg_print_stat_sectors_recursive(struct seq_file *sf, void *v)
 {
-	bfq_log_bfqq(bfqd, bfqq, "del from busy");
+	blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)),
+			  bfqg_prfill_sectors_recursive, &blkcg_policy_bfq, 0,
+			  false);
+	return 0;
+}
 
-	bfq_clear_bfqq_busy(bfqq);
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+static u64 bfqg_prfill_avg_queue_size(struct seq_file *sf,
+				      struct blkg_policy_data *pd, int off)
+{
+	struct bfq_group *bfqg = pd_to_bfqg(pd);
+	u64 samples = blkg_stat_read(&bfqg->stats.avg_queue_size_samples);
+	u64 v = 0;
 
-	bfqd->busy_queues--;
+	if (samples) {
+		v = blkg_stat_read(&bfqg->stats.avg_queue_size_sum);
+		v = div64_u64(v, samples);
+	}
+	__blkg_prfill_u64(sf, pd, v);
+	return 0;
+}
 
-	bfq_deactivate_bfqq(bfqd, bfqq, requeue);
+/* print avg_queue_size */
+static int bfqg_print_avg_queue_size(struct seq_file *sf, void *v)
+{
+	blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)),
+			  bfqg_prfill_avg_queue_size, &blkcg_policy_bfq,
+			  0, false);
+	return 0;
 }
+#endif /* CONFIG_DEBUG_BLK_CGROUP */
 
-/*
- * Called when an inactive queue receives a new request.
- */
-static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+static struct bfq_group *
+bfq_create_group_hierarchy(struct bfq_data *bfqd, int node)
 {
-	bfq_log_bfqq(bfqd, bfqq, "add to busy");
+	int ret;
 
-	bfq_activate_bfqq(bfqd, bfqq);
+	ret = blkcg_activate_policy(bfqd->queue, &blkcg_policy_bfq);
+	if (ret)
+		return NULL;
 
-	bfq_mark_bfqq_busy(bfqq);
-	bfqd->busy_queues++;
+	return blkg_to_bfqg(bfqd->queue->root_blkg);
 }
 
-static void bfq_init_entity(struct bfq_entity *entity)
+static struct cftype bfq_blkcg_legacy_files[] = {
+	{
+		.name = "bfq.weight",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = bfq_io_show_weight,
+		.write_u64 = bfq_io_set_weight_legacy,
+	},
+
+	/* statistics, covers only the tasks in the bfqg */
+	{
+		.name = "bfq.time",
+		.private = offsetof(struct bfq_group, stats.time),
+		.seq_show = bfqg_print_stat,
+	},
+	{
+		.name = "bfq.sectors",
+		.seq_show = bfqg_print_stat_sectors,
+	},
+	{
+		.name = "bfq.io_service_bytes",
+		.private = (unsigned long)&blkcg_policy_bfq,
+		.seq_show = blkg_print_stat_bytes,
+	},
+	{
+		.name = "bfq.io_serviced",
+		.private = (unsigned long)&blkcg_policy_bfq,
+		.seq_show = blkg_print_stat_ios,
+	},
+	{
+		.name = "bfq.io_service_time",
+		.private = offsetof(struct bfq_group, stats.service_time),
+		.seq_show = bfqg_print_rwstat,
+	},
+	{
+		.name = "bfq.io_wait_time",
+		.private = offsetof(struct bfq_group, stats.wait_time),
+		.seq_show = bfqg_print_rwstat,
+	},
+	{
+		.name = "bfq.io_merged",
+		.private = offsetof(struct bfq_group, stats.merged),
+		.seq_show = bfqg_print_rwstat,
+	},
+	{
+		.name = "bfq.io_queued",
+		.private = offsetof(struct bfq_group, stats.queued),
+		.seq_show = bfqg_print_rwstat,
+	},
+
+	/* the same statictics which cover the bfqg and its descendants */
+	{
+		.name = "bfq.time_recursive",
+		.private = offsetof(struct bfq_group, stats.time),
+		.seq_show = bfqg_print_stat_recursive,
+	},
+	{
+		.name = "bfq.sectors_recursive",
+		.seq_show = bfqg_print_stat_sectors_recursive,
+	},
+	{
+		.name = "bfq.io_service_bytes_recursive",
+		.private = (unsigned long)&blkcg_policy_bfq,
+		.seq_show = blkg_print_stat_bytes_recursive,
+	},
+	{
+		.name = "bfq.io_serviced_recursive",
+		.private = (unsigned long)&blkcg_policy_bfq,
+		.seq_show = blkg_print_stat_ios_recursive,
+	},
+	{
+		.name = "bfq.io_service_time_recursive",
+		.private = offsetof(struct bfq_group, stats.service_time),
+		.seq_show = bfqg_print_rwstat_recursive,
+	},
+	{
+		.name = "bfq.io_wait_time_recursive",
+		.private = offsetof(struct bfq_group, stats.wait_time),
+		.seq_show = bfqg_print_rwstat_recursive,
+	},
+	{
+		.name = "bfq.io_merged_recursive",
+		.private = offsetof(struct bfq_group, stats.merged),
+		.seq_show = bfqg_print_rwstat_recursive,
+	},
+	{
+		.name = "bfq.io_queued_recursive",
+		.private = offsetof(struct bfq_group, stats.queued),
+		.seq_show = bfqg_print_rwstat_recursive,
+	},
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+	{
+		.name = "bfq.avg_queue_size",
+		.seq_show = bfqg_print_avg_queue_size,
+	},
+	{
+		.name = "bfq.group_wait_time",
+		.private = offsetof(struct bfq_group, stats.group_wait_time),
+		.seq_show = bfqg_print_stat,
+	},
+	{
+		.name = "bfq.idle_time",
+		.private = offsetof(struct bfq_group, stats.idle_time),
+		.seq_show = bfqg_print_stat,
+	},
+	{
+		.name = "bfq.empty_time",
+		.private = offsetof(struct bfq_group, stats.empty_time),
+		.seq_show = bfqg_print_stat,
+	},
+	{
+		.name = "bfq.dequeue",
+		.private = offsetof(struct bfq_group, stats.dequeue),
+		.seq_show = bfqg_print_stat,
+	},
+#endif	/* CONFIG_DEBUG_BLK_CGROUP */
+	{ }	/* terminate */
+};
+
+static struct cftype bfq_blkg_files[] = {
+	{
+		.name = "bfq.weight",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = bfq_io_show_weight,
+		.write = bfq_io_set_weight,
+	},
+	{} /* terminate */
+};
+
+#else /* CONFIG_BFQ_GROUP_IOSCHED */
+
+static inline void bfqg_stats_update_io_add(struct bfq_group *bfqg,
+			struct bfq_queue *bfqq, int op, int op_flags) { }
+static inline void
+bfqg_stats_update_io_remove(struct bfq_group *bfqg, int op, int op_flags) { }
+static inline void
+bfqg_stats_update_io_merged(struct bfq_group *bfqg, int op, int op_flags) { }
+static inline void bfqg_stats_update_completion(struct bfq_group *bfqg,
+			uint64_t start_time, uint64_t io_start_time, int op,
+			int op_flags) { }
+
+static void bfq_init_entity(struct bfq_entity *entity,
+			    struct bfq_group *bfqg)
 {
 	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
 
 	entity->weight = entity->new_weight;
 	entity->orig_weight = entity->new_weight;
+	if (bfqq) {
+		bfqq->ioprio = bfqq->new_ioprio;
+		bfqq->ioprio_class = bfqq->new_ioprio_class;
+	}
+	entity->sched_data = &bfqg->sched_data;
+}
+
+static struct bfq_group *
+bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio)
+{
+	struct bfq_data *bfqd = bic_to_bfqd(bic);
+
+	return bfqd->root_group;
+}
 
-	bfqq->ioprio = bfqq->new_ioprio;
-	bfqq->ioprio_class = bfqq->new_ioprio_class;
+static void bfq_disconnect_groups(struct bfq_data *bfqd)
+{
+	bfq_put_async_queues(bfqd, bfqd->root_group);
+}
+
+static struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd,
+					    struct blkcg *blkcg)
+{
+	return bfqd->root_group;
+}
+
+static struct bfq_group *bfqq_group(struct bfq_queue *bfqq)
+{
+	return bfqq->bfqd->root_group;
+}
 
-	entity->sched_data = &bfqq->bfqd->sched_data;
+static struct bfq_group *bfq_create_group_hierarchy(struct bfq_data *bfqd,
+						    int node)
+{
+	struct bfq_group *bfqg;
+	int i;
+
+	bfqg = kmalloc_node(sizeof(*bfqg), GFP_KERNEL | __GFP_ZERO, node);
+	if (!bfqg)
+		return NULL;
+
+	for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
+		bfqg->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
+
+	return bfqg;
 }
+#endif /* CONFIG_BFQ_GROUP_IOSCHED */
 
 #define bfq_class_idle(bfqq)	((bfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
 #define bfq_class_rt(bfqq)	((bfqq)->ioprio_class == IOPRIO_CLASS_RT)
@@ -1965,6 +3321,9 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
 			RQ_BIC(rq)->ttime.last_end_request +
 			bfqd->bfq_slice_idle * 3ULL;
 
+	bfqg_stats_update_io_add(bfqq_group(RQ_BFQQ(rq)), bfqq,
+				 req_op(rq), rq->cmd_flags);
+
 	/*
 	 * Update budget and check whether bfqq may want to preempt
 	 * the in-service queue.
@@ -2099,6 +3458,9 @@ static void bfq_remove_request(struct request *rq)
 
 	if (rq->cmd_flags & REQ_META)
 		bfqq->meta_pending--;
+
+	bfqg_stats_update_io_remove(bfqq_group(bfqq), req_op(rq),
+				    rq->cmd_flags);
 }
 
 static int bfq_merge(struct request_queue *q, struct request **req,
@@ -2145,6 +3507,15 @@ static void bfq_merged_request(struct request_queue *q, struct request *req,
 	}
 }
 
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+static void bfq_bio_merged(struct request_queue *q, struct request *req,
+			   struct bio *bio)
+{
+	bfqg_stats_update_io_merged(bfqq_group(RQ_BFQQ(req)), bio_op(bio),
+				    bio->bi_opf);
+}
+#endif
+
 static void bfq_merged_requests(struct request_queue *q, struct request *rq,
 				struct request *next)
 {
@@ -2171,6 +3542,8 @@ static void bfq_merged_requests(struct request_queue *q, struct request *rq,
 		bfqq->next_rq = rq;
 
 	bfq_remove_request(next);
+	bfqg_stats_update_io_merged(bfqq_group(bfqq), req_op(next),
+				    next->cmd_flags);
 }
 
 static int bfq_allow_bio_merge(struct request_queue *q, struct request *rq,
@@ -2210,6 +3583,7 @@ static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
 				       struct bfq_queue *bfqq)
 {
 	if (bfqq) {
+		bfqg_stats_update_avg_queue_size(bfqq_group(bfqq));
 		bfq_mark_bfqq_must_alloc(bfqq);
 		bfq_mark_bfqq_budget_new(bfqq);
 		bfq_clear_bfqq_fifo_expire(bfqq);
@@ -2293,6 +3667,7 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd)
 	bfqd->last_idling_start = ktime_get();
 	hrtimer_start(&bfqd->idle_slice_timer, ns_to_ktime(sl),
 		      HRTIMER_MODE_REL);
+	bfqg_stats_set_start_idle_time(bfqq_group(bfqq));
 }
 
 /*
@@ -2824,6 +4199,7 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
 				 */
 				bfq_clear_bfqq_wait_request(bfqq);
 				hrtimer_try_to_cancel(&bfqd->idle_slice_timer);
+				bfqg_stats_update_idle_time(bfqq_group(bfqq));
 			}
 			goto keep_queue;
 		}
@@ -3013,11 +4389,18 @@ static int bfq_dispatch_requests(struct request_queue *q, int force)
  */
 static void bfq_put_queue(struct bfq_queue *bfqq)
 {
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+	struct bfq_group *bfqg = bfqq_group(bfqq);
+#endif
+
 	bfqq->ref--;
 	if (bfqq->ref)
 		return;
 
 	kmem_cache_free(bfq_pool, bfqq);
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+	bfqg_put(bfqg);
+#endif
 }
 
 static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
@@ -3159,18 +4542,19 @@ static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 }
 
 static struct bfq_queue **bfq_async_queue_prio(struct bfq_data *bfqd,
+					       struct bfq_group *bfqg,
 					       int ioprio_class, int ioprio)
 {
 	switch (ioprio_class) {
 	case IOPRIO_CLASS_RT:
-		return &async_bfqq[0][ioprio];
+		return &bfqg->async_bfqq[0][ioprio];
 	case IOPRIO_CLASS_NONE:
 		ioprio = IOPRIO_NORM;
 		/* fall through */
 	case IOPRIO_CLASS_BE:
-		return &async_bfqq[1][ioprio];
+		return &bfqg->async_bfqq[1][ioprio];
 	case IOPRIO_CLASS_IDLE:
-		return &async_idle_bfqq;
+		return &bfqg->async_idle_bfqq;
 	default:
 		return NULL;
 	}
@@ -3184,11 +4568,18 @@ static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
 	const int ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
 	struct bfq_queue **async_bfqq = NULL;
 	struct bfq_queue *bfqq;
+	struct bfq_group *bfqg;
 
 	rcu_read_lock();
 
+	bfqg = bfq_find_set_group(bfqd, bio_blkcg(bio));
+	if (!bfqg) {
+		bfqq = &bfqd->oom_bfqq;
+		goto out;
+	}
+
 	if (!is_sync) {
-		async_bfqq = bfq_async_queue_prio(bfqd, ioprio_class,
+		async_bfqq = bfq_async_queue_prio(bfqd, bfqg, ioprio_class,
 						  ioprio);
 		bfqq = *async_bfqq;
 		if (bfqq)
@@ -3201,7 +4592,7 @@ static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
 	if (bfqq) {
 		bfq_init_bfqq(bfqd, bfqq, bic, current->pid,
 			      is_sync);
-		bfq_init_entity(&bfqq->entity);
+		bfq_init_entity(&bfqq->entity, bfqg);
 		bfq_log_bfqq(bfqd, bfqq, "allocated");
 	} else {
 		bfqq = &bfqd->oom_bfqq;
@@ -3349,6 +4740,7 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 		 */
 		bfq_clear_bfqq_wait_request(bfqq);
 		hrtimer_try_to_cancel(&bfqd->idle_slice_timer);
+		bfqg_stats_update_idle_time(bfqq_group(bfqq));
 
 		/*
 		 * The queue is not empty, because a new request just
@@ -3418,6 +4810,10 @@ static void bfq_completed_request(struct request_queue *q, struct request *rq)
 
 	bfqd->rq_in_driver--;
 	bfqq->dispatched--;
+	bfqg_stats_update_completion(bfqq_group(bfqq),
+				     rq_start_time_ns(rq),
+				     rq_io_start_time_ns(rq), req_op(rq),
+				     rq->cmd_flags);
 
 	RQ_BIC(rq)->ttime.last_end_request = ktime_get_ns();
 
@@ -3524,6 +4920,8 @@ static int bfq_set_request(struct request_queue *q, struct request *rq,
 	if (!bic)
 		goto queue_fail;
 
+	bfq_bic_update_cgroup(bic, bio);
+
 	bfqq = bic_to_bfqq(bic, is_sync);
 	if (!bfqq || bfqq == &bfqd->oom_bfqq) {
 		if (bfqq)
@@ -3629,6 +5027,9 @@ static void __bfq_put_async_bfqq(struct bfq_data *bfqd,
 
 	bfq_log(bfqd, "put_async_bfqq: %p", bfqq);
 	if (bfqq) {
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+		bfq_bfqq_move(bfqd, bfqq, bfqd->root_group);
+#endif
 		bfq_log_bfqq(bfqd, bfqq, "put_async_bfqq: putting %p, %d",
 			     bfqq, bfqq->ref);
 		bfq_put_queue(bfqq);
@@ -3637,18 +5038,20 @@ static void __bfq_put_async_bfqq(struct bfq_data *bfqd,
 }
 
 /*
- * Release the extra reference of the async queues as the device
- * goes away.
+ * Release all the bfqg references to its async queues.  If we are
+ * deallocating the group these queues may still contain requests, so
+ * we reparent them to the root cgroup (i.e., the only one that will
+ * exist for sure until all the requests on a device are gone).
  */
-static void bfq_put_async_queues(struct bfq_data *bfqd)
+static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg)
 {
 	int i, j;
 
 	for (i = 0; i < 2; i++)
 		for (j = 0; j < IOPRIO_BE_NR; j++)
-			__bfq_put_async_bfqq(bfqd, &async_bfqq[i][j]);
+			__bfq_put_async_bfqq(bfqd, &bfqg->async_bfqq[i][j]);
 
-	__bfq_put_async_bfqq(bfqd, &async_idle_bfqq);
+	__bfq_put_async_bfqq(bfqd, &bfqg->async_idle_bfqq);
 }
 
 static void bfq_exit_queue(struct elevator_queue *e)
@@ -3664,19 +5067,40 @@ static void bfq_exit_queue(struct elevator_queue *e)
 	list_for_each_entry_safe(bfqq, n, &bfqd->idle_list, bfqq_list)
 		bfq_deactivate_bfqq(bfqd, bfqq, 0);
 
-	bfq_put_async_queues(bfqd);
+#ifndef CONFIG_BFQ_GROUP_IOSCHED
+	bfq_disconnect_groups(bfqd);
+#endif
 	spin_unlock_irq(q->queue_lock);
 
 	bfq_shutdown_timer_wq(bfqd);
 
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+	blkcg_deactivate_policy(q, &blkcg_policy_bfq);
+#else
+	kfree(bfqd->root_group);
+#endif
+
 	kfree(bfqd);
 }
 
+static void bfq_init_root_group(struct bfq_group *root_group,
+				struct bfq_data *bfqd)
+{
+	int i;
+
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+	root_group->entity.parent = NULL;
+	root_group->my_entity = NULL;
+	root_group->bfqd = bfqd;
+#endif
+	for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
+		root_group->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
+}
+
 static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 {
 	struct bfq_data *bfqd;
 	struct elevator_queue *eq;
-	int i;
 
 	eq = elevator_alloc(q, e);
 	if (!eq)
@@ -3713,8 +5137,11 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 	q->elevator = eq;
 	spin_unlock_irq(q->queue_lock);
 
-	for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
-		bfqd->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
+	bfqd->root_group = bfq_create_group_hierarchy(bfqd, q->node);
+	if (!bfqd->root_group)
+		goto out_free;
+	bfq_init_root_group(bfqd->root_group, bfqd);
+	bfq_init_entity(&bfqd->oom_bfqq.entity, bfqd->root_group);
 
 	hrtimer_init(&bfqd->idle_slice_timer, CLOCK_MONOTONIC,
 		     HRTIMER_MODE_REL);
@@ -3740,6 +5167,11 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 	bfqd->bfq_requests_within_timer = 120;
 
 	return 0;
+
+out_free:
+	kfree(bfqd);
+	kobject_put(&eq->kobj);
+	return -ENOMEM;
 }
 
 static void bfq_slab_kill(void)
@@ -3986,6 +5418,9 @@ static struct elevator_type iosched_bfq = {
 		.elevator_merge_req_fn =	bfq_merged_requests,
 		.elevator_allow_bio_merge_fn =	bfq_allow_bio_merge,
 		.elevator_allow_rq_merge_fn =	bfq_allow_rq_merge,
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+		.elevator_bio_merged_fn =	bfq_bio_merged,
+#endif
 		.elevator_dispatch_fn =		bfq_dispatch_requests,
 		.elevator_add_req_fn =		bfq_insert_request,
 		.elevator_activate_req_fn =	bfq_activate_request,
@@ -4008,11 +5443,35 @@ static struct elevator_type iosched_bfq = {
 	.elevator_owner =	THIS_MODULE,
 };
 
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+static struct blkcg_policy blkcg_policy_bfq = {
+	.dfl_cftypes		= bfq_blkg_files,
+	.legacy_cftypes		= bfq_blkcg_legacy_files,
+
+	.cpd_alloc_fn		= bfq_cpd_alloc,
+	.cpd_init_fn		= bfq_cpd_init,
+	.cpd_bind_fn	        = bfq_cpd_init,
+	.cpd_free_fn		= bfq_cpd_free,
+
+	.pd_alloc_fn		= bfq_pd_alloc,
+	.pd_init_fn		= bfq_pd_init,
+	.pd_offline_fn		= bfq_pd_offline,
+	.pd_free_fn		= bfq_pd_free,
+	.pd_reset_stats_fn	= bfq_pd_reset_stats,
+};
+#endif
+
 static int __init bfq_init(void)
 {
 	int ret;
 	char msg[50] = "BFQ I/O-scheduler: v0";
 
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+	ret = blkcg_policy_register(&blkcg_policy_bfq);
+	if (ret)
+		return ret;
+#endif
+
 	ret = -ENOMEM;
 	if (bfq_slab_setup())
 		goto err_pol_unreg;
@@ -4029,11 +5488,17 @@ static int __init bfq_init(void)
 	return 0;
 
 err_pol_unreg:
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+	blkcg_policy_unregister(&blkcg_policy_bfq);
+#endif
 	return ret;
 }
 
 static void __exit bfq_exit(void)
 {
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+	blkcg_policy_unregister(&blkcg_policy_bfq);
+#endif
 	elv_unregister(&iosched_bfq);
 	bfq_slab_kill();
 }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c47c358..1047d99 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -45,7 +45,7 @@ struct pr_ops;
  * Maximum number of blkcg policies allowed to be registered concurrently.
  * Defined here to simplify include dependency.
  */
-#define BLKCG_MAX_POLS		2
+#define BLKCG_MAX_POLS		3
 
 typedef void (rq_end_io_fn)(struct request *, int);
 
-- 
2.10.0

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 03/14] block, bfq: improve throughput boosting
  2016-10-26  9:27 [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler Paolo Valente
  2016-10-26  9:27 ` [PATCH 01/14] block, bfq: " Paolo Valente
  2016-10-26  9:27 ` [PATCH 02/14] block, bfq: add full hierarchical scheduling and cgroups support Paolo Valente
@ 2016-10-26  9:27 ` Paolo Valente
  2016-10-26  9:27 ` [PATCH 04/14] block, bfq: modify the peak-rate estimator Paolo Valente
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 57+ messages in thread
From: Paolo Valente @ 2016-10-26  9:27 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: linux-block, linux-kernel, ulf.hansson, linus.walleij, broonie,
	hare, arnd, bart.vanassche, grant.likely, jack, James.Bottomley,
	Paolo Valente, Arianna Avanzini

The feedback-loop algorithm used by BFQ to compute queue (process)
budgets is basically a set of three update rules, one for each of the
main reasons why a queue may be expired. If many processes suddenly
switch from sporadic I/O to greedy and sequential I/O, then these
rules are quite slow to assign large budgets to these processes, and
hence to achieve a high throughput. On the opposite side, BFQ assigns
the maximum possible budget B_max to a just-created queue. This allows
a high throughput to be achieved immediately if the associated process
is I/O-bound and performs sequential I/O from the beginning. But it
also increases the worst-case latency experienced by the first
requests issued by the process, because the larger the budget of a
queue waiting for service is, the later the queue will be served by
B-WF2Q+ (Subsec 3.3 in [1]). This is detrimental for an interactive or
soft real-time application.

To tackle these throughput and latency problems, on one hand this
patch changes the initial budget value to B_max/2. On the other hand,
it re-tunes the three rules, adopting a more aggressive,
multiplicative increase/linear decrease scheme. This scheme trades
latency for throughput more than before, and tends to assign large
budgets quickly to processes that are or become I/O-bound. For two of
the expiration reasons, the new version of the rules also contains
some more little improvements, briefly described below.

*No more backlog.* In this case, the budget was larger than the number
of sectors actually read/written by the process before it stopped
doing I/O. Hence, to reduce latency for the possible future I/O
requests of the process, the old rule simply set the next budget to
the number of sectors actually consumed by the process. However, if
there are still outstanding requests, then the process may have not
yet issued its next request just because it is still waiting for the
completion of some of the still outstanding ones. If this sub-case
holds true, then the new rule, instead of decreasing the budget,
doubles it, proactively, in the hope that: 1) a larger budget will fit
the actual needs of the process, and 2) the process is sequential and
hence a higher throughput will be achieved by serving the process
longer after granting it access to the device.

*Budget timeout*. The original rule set the new budget to the maximum
value B_max, to maximize throughput and let all processes experiencing
budget timeouts receive the same share of the device time. In our
experiments we verified that this sudden jump to B_max did not provide
sensible benefits; rather it increased the latency of processes
performing sporadic and short I/O. The new rule only doubles the
budget.

[1] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
    http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
							results.pdf

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/bfq-iosched.c | 83 ++++++++++++++++++++++++++---------------------------
 1 file changed, 41 insertions(+), 42 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index e33f85e..fa1e531 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -692,9 +692,6 @@ struct kmem_cache *bfq_pool;
 #define BFQQ_SEEK_THR		(sector_t)(8 * 100)
 #define BFQQ_SEEKY(bfqq)	(hweight32(bfqq->seek_history) > 32/8)
 
-/* Budget feedback step. */
-#define BFQ_BUDGET_STEP         128
-
 /* Min samples used for peak rate estimation (for autotuning). */
 #define BFQ_PEAK_RATE_SAMPLES	32
 
@@ -3609,36 +3606,6 @@ static struct bfq_queue *bfq_set_in_service_queue(struct bfq_data *bfqd)
 	return bfqq;
 }
 
-/*
- * bfq_default_budget - return the default budget for @bfqq on @bfqd.
- * @bfqd: the device descriptor.
- * @bfqq: the queue to consider.
- *
- * We use 3/4 of the @bfqd maximum budget as the default value
- * for the max_budget field of the queues.  This lets the feedback
- * mechanism to start from some middle ground, then the behavior
- * of the process will drive the heuristics towards high values, if
- * it behaves as a greedy sequential reader, or towards small values
- * if it shows a more intermittent behavior.
- */
-static unsigned long bfq_default_budget(struct bfq_data *bfqd,
-					struct bfq_queue *bfqq)
-{
-	unsigned long budget;
-
-	/*
-	 * When we need an estimate of the peak rate we need to avoid
-	 * to give budgets that are too short due to previous measurements.
-	 * So, in the first 10 assignments use a ``safe'' budget value.
-	 */
-	if (bfqd->budgets_assigned < 194 && bfqd->bfq_user_max_budget == 0)
-		budget = bfq_default_max_budget;
-	else
-		budget = bfqd->bfq_max_budget;
-
-	return budget - budget / 4;
-}
-
 static void bfq_arm_slice_timer(struct bfq_data *bfqd)
 {
 	struct bfq_queue *bfqq = bfqd->in_service_queue;
@@ -3780,13 +3747,47 @@ static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
 		 * for throughput.
 		 */
 		case BFQ_BFQQ_TOO_IDLE:
-			if (budget > min_budget + BFQ_BUDGET_STEP)
-				budget -= BFQ_BUDGET_STEP;
-			else
-				budget = min_budget;
+			/*
+			 * This is the only case where we may reduce
+			 * the budget: if there is no request of the
+			 * process still waiting for completion, then
+			 * we assume (tentatively) that the timer has
+			 * expired because the batch of requests of
+			 * the process could have been served with a
+			 * smaller budget.  Hence, betting that
+			 * process will behave in the same way when it
+			 * becomes backlogged again, we reduce its
+			 * next budget.  As long as we guess right,
+			 * this budget cut reduces the latency
+			 * experienced by the process.
+			 *
+			 * However, if there are still outstanding
+			 * requests, then the process may have not yet
+			 * issued its next request just because it is
+			 * still waiting for the completion of some of
+			 * the still outstanding ones.  So in this
+			 * subcase we do not reduce its budget, on the
+			 * contrary we increase it to possibly boost
+			 * the throughput, as discussed in the
+			 * comments to the BUDGET_TIMEOUT case.
+			 */
+			if (bfqq->dispatched > 0) /* still outstanding reqs */
+				budget = min(budget * 2, bfqd->bfq_max_budget);
+			else {
+				if (budget > 5 * min_budget)
+					budget -= 4 * min_budget;
+				else
+					budget = min_budget;
+			}
 			break;
 		case BFQ_BFQQ_BUDGET_TIMEOUT:
-			budget = bfq_default_budget(bfqd, bfqq);
+			/*
+			 * We double the budget here because it gives
+			 * the chance to boost the throughput if this
+			 * is not a seeky process (and has bumped into
+			 * this timeout because of, e.g., ZBR).
+			 */
+			budget = min(budget * 2, bfqd->bfq_max_budget);
 			break;
 		case BFQ_BFQQ_BUDGET_EXHAUSTED:
 			/*
@@ -3798,8 +3799,7 @@ static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
 			 * definitely increase the budget of this good
 			 * candidate to boost the disk throughput.
 			 */
-			budget = min(budget + 8 * BFQ_BUDGET_STEP,
-				     bfqd->bfq_max_budget);
+			budget = min(budget * 4, bfqd->bfq_max_budget);
 			break;
 		case BFQ_BFQQ_NO_MORE_REQUESTS:
 			/*
@@ -4533,9 +4533,8 @@ static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	bfqq->pid = pid;
 
 	/* Tentative initial value to trade off between thr and lat */
-	bfqq->max_budget = bfq_default_budget(bfqd, bfqq);
+	bfqq->max_budget = (2 * bfq_max_budget(bfqd)) / 3;
 	bfqq->budget_timeout = bfq_smallest_from_now();
-	bfqq->pid = pid;
 
 	/* first request is almost certainly seeky */
 	bfqq->seek_history = 1;
-- 
2.10.0

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 04/14] block, bfq: modify the peak-rate estimator
  2016-10-26  9:27 [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler Paolo Valente
                   ` (2 preceding siblings ...)
  2016-10-26  9:27 ` [PATCH 03/14] block, bfq: improve throughput boosting Paolo Valente
@ 2016-10-26  9:27 ` Paolo Valente
  2016-10-26  9:27 ` [PATCH 05/14] block, bfq: add more fairness with writes and slow processes Paolo Valente
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 57+ messages in thread
From: Paolo Valente @ 2016-10-26  9:27 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: linux-block, linux-kernel, ulf.hansson, linus.walleij, broonie,
	hare, arnd, bart.vanassche, grant.likely, jack, James.Bottomley,
	Paolo Valente, Arianna Avanzini

Unless the maximum budget B_max that BFQ can assign to a queue is set
explicitly by the user, BFQ automatically updates B_max. In
particular, BFQ dynamically sets B_max to the number of sectors that
can be read, at the current estimated peak rate, during the maximum
time, T_max, allowed before a budget timeout occurs. In formulas, if
we denote as R_est the estimated peak rate, then B_max = T_max ∗
R_est. Hence, the higher R_est is with respect to the actual device
peak rate, the higher the probability that processes incur budget
timeouts unjustly is. Besides, a too high value of B_max unnecessarily
increases the deviation from an ideal, smooth service.

Unfortunately, it is not trivial to estimate the peak rate correctly:
because of the presence of sw and hw queues between the scheduler and
the device components that finally serve I/O requests, it is hard to
say exactly when a given dispatched request is served inside the
device, and for how long. As a consequence, it is hard to know
precisely at what rate a given set of requests is actually served by
the device.

On the opposite end, the dispatch time of any request is trivially
available, and, from this piece of information, the "dispatch rate"
of requests can be immediately computed. So, the idea in the next
function is to use what is known, namely request dispatch times
(plus, when useful, request completion times), to estimate what is
unknown, namely in-device request service rate.

The main issue is that, because of the above facts, the rate at
which a certain set of requests is dispatched over a certain time
interval can vary greatly with respect to the rate at which the
same requests are then served. But, since the size of any
intermediate queue is limited, and the service scheme is lossless
(no request is silently dropped), the following obvious convergence
property holds: the number of requests dispatched MUST become
closer and closer to the number of requests completed as the
observation interval grows. This is the key property used in
this new version of the peak-rate estimator.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/bfq-iosched.c | 496 +++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 376 insertions(+), 120 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index fa1e531..101c591 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -367,14 +367,32 @@ struct bfq_data {
 	/* on-disk position of the last served request */
 	sector_t last_position;
 
+	/* time of last request completion (ns) */
+	u64 last_completion;
+
+	/* time of first rq dispatch in current observation interval (ns) */
+	u64 first_dispatch;
+	/* time of last rq dispatch in current observation interval (ns) */
+	u64 last_dispatch;
+
 	/* beginning of the last budget */
 	ktime_t last_budget_start;
 	/* beginning of the last idle slice */
 	ktime_t last_idling_start;
-	/* number of samples used to calculate @peak_rate */
+
+	/* number of samples in current observation interval */
 	int peak_rate_samples;
-	/* peak transfer rate observed for a budget */
-	u64 peak_rate;
+	/* num of samples of seq dispatches in current observation interval */
+	u32 sequential_samples;
+	/* total num of sectors transferred in current observation interval */
+	u64 tot_sectors_dispatched;
+	/* max rq size seen during current observation interval (sectors) */
+	u32 last_rq_max_size;
+	/* time elapsed from first dispatch in current observ. interval (us) */
+	u64 delta_from_first;
+	/* current estimate of device peak rate */
+	u32 peak_rate;
+
 	/* maximum budget allotted to a bfq_queue before rescheduling */
 	int bfq_max_budget;
 
@@ -682,7 +700,7 @@ static const int bfq_timeout = HZ / 8;
 
 struct kmem_cache *bfq_pool;
 
-/* Below this threshold (in ms), we consider thinktime immediate. */
+/* Below this threshold (in ns), we consider thinktime immediate. */
 #define BFQ_MIN_TT		(2 * NSEC_PER_MSEC)
 
 /* hw_tag detection: parallel requests threshold and min samples needed. */
@@ -692,8 +710,12 @@ struct kmem_cache *bfq_pool;
 #define BFQQ_SEEK_THR		(sector_t)(8 * 100)
 #define BFQQ_SEEKY(bfqq)	(hweight32(bfqq->seek_history) > 32/8)
 
-/* Min samples used for peak rate estimation (for autotuning). */
-#define BFQ_PEAK_RATE_SAMPLES	32
+/* Min number of samples required to perform peak-rate update */
+#define BFQ_RATE_MIN_SAMPLES	32
+/* Min observation time interval required to perform a peak-rate update (ns) */
+#define BFQ_RATE_MIN_INTERVAL	(300*NSEC_PER_MSEC)
+/* Target observation time interval for a peak-rate update (ns) */
+#define BFQ_RATE_REF_INTERVAL	NSEC_PER_SEC
 
 /* Shift used for peak rate fixed precision calculations. */
 #define BFQ_RATE_SHIFT		16
@@ -3400,14 +3422,25 @@ static struct request *bfq_find_rq_fmerge(struct bfq_data *bfqd,
 	return NULL;
 }
 
+static sector_t get_sdist(sector_t last_pos, struct request *rq)
+{
+	sector_t sdist = 0;
+
+	if (last_pos) {
+		if (last_pos < blk_rq_pos(rq))
+			sdist = blk_rq_pos(rq) - last_pos;
+		else
+			sdist = last_pos - blk_rq_pos(rq);
+	}
+
+	return sdist;
+}
+
 static void bfq_activate_request(struct request_queue *q, struct request *rq)
 {
 	struct bfq_data *bfqd = q->elevator->elevator_data;
 
 	bfqd->rq_in_driver++;
-	bfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq);
-	bfq_log(bfqd, "activate_request: new bfqd->last_position %llu",
-		(unsigned long long)bfqd->last_position);
 }
 
 static void bfq_deactivate_request(struct request_queue *q, struct request *rq)
@@ -3659,6 +3692,226 @@ static void bfq_set_budget_timeout(struct bfq_data *bfqd)
 }
 
 /*
+ * In autotuning mode, max_budget is dynamically recomputed as the
+ * amount of sectors transferred in timeout at the estimated peak
+ * rate. This enables BFQ to utilize a full timeslice with a full
+ * budget, even if the in-service queue is served at peak rate. And
+ * this maximises throughput with sequential workloads.
+ */
+static unsigned long bfq_calc_max_budget(struct bfq_data *bfqd)
+{
+	return (u64)bfqd->peak_rate * USEC_PER_MSEC *
+		jiffies_to_msecs(bfqd->bfq_timeout)>>BFQ_RATE_SHIFT;
+}
+
+void bfq_reset_rate_computation(struct bfq_data *bfqd, struct request *rq)
+{
+	if (rq != NULL) { /* new rq dispatch now, reset accordingly */
+		bfqd->last_dispatch = bfqd->first_dispatch = ktime_get_ns();
+		bfqd->peak_rate_samples = 1;
+		bfqd->sequential_samples = 0;
+		bfqd->tot_sectors_dispatched = bfqd->last_rq_max_size =
+			blk_rq_sectors(rq);
+	} else /* no new rq dispatched, just reset the number of samples */
+		bfqd->peak_rate_samples = 0; /* full re-init on next disp. */
+
+	bfq_log(bfqd,
+		"reset_rate_computation at end, sample %u/%u tot_sects %llu",
+		bfqd->peak_rate_samples, bfqd->sequential_samples,
+		bfqd->tot_sectors_dispatched);
+}
+
+void bfq_update_rate_reset(struct bfq_data *bfqd, struct request *rq)
+{
+	u32 rate, weight, divisor;
+
+	/*
+	 * For the convergence property to hold (see comments on
+	 * bfq_update_peak_rate()) and for the assessment to be
+	 * reliable, a minimum number of samples must be present, and
+	 * a minimum amount of time must have elapsed. If not so, do
+	 * not compute new rate. Just reset parameters, to get ready
+	 * for a new evaluation attempt.
+	 */
+	if (bfqd->peak_rate_samples < BFQ_RATE_MIN_SAMPLES ||
+	    bfqd->delta_from_first < BFQ_RATE_MIN_INTERVAL)
+		goto reset_computation;
+
+	/*
+	 * If a new request completion has occurred after last
+	 * dispatch, then, to approximate the rate at which requests
+	 * have been served by the device, it is more precise to
+	 * extend the observation interval to the last completion.
+	 */
+	bfqd->delta_from_first =
+		max_t(u64, bfqd->delta_from_first,
+		      bfqd->last_completion - bfqd->first_dispatch);
+
+	/*
+	 * Rate computed in sects/usec, and not sects/nsec, for
+	 * precision issues.
+	 */
+	rate = div64_ul(bfqd->tot_sectors_dispatched<<BFQ_RATE_SHIFT,
+			div_u64(bfqd->delta_from_first, NSEC_PER_USEC));
+
+	/*
+	 * Peak rate not updated if:
+	 * - the percentage of sequential dispatches is below 3/4 of the
+	 *   total, and rate is below the current estimated peak rate
+	 * - rate is unreasonably high (> 20M sectors/sec)
+	 */
+	if ((bfqd->peak_rate_samples > (3 * bfqd->sequential_samples)>>2 &&
+	     rate <= bfqd->peak_rate) ||
+		rate > 20<<BFQ_RATE_SHIFT)
+		goto reset_computation;
+
+	/*
+	 * We have to update the peak rate, at last! To this purpose,
+	 * we use a low-pass filter. We compute the smoothing constant
+	 * of the filter as a function of the 'weight' of the new
+	 * measured rate.
+	 *
+	 * As can be seen in next formulas, we define this weight as a
+	 * quantity proportional to how sequential the workload is,
+	 * and to how long the observation time interval is.
+	 *
+	 * The weight runs from 0 to 8. The maximum value of the
+	 * weight, 8, yields the minimum value for the smoothing
+	 * constant. At this minimum value for the smoothing constant,
+	 * the measured rate contributes for half of the next value of
+	 * the estimated peak rate.
+	 *
+	 * So, the first step is to compute the weight as a function
+	 * of how sequential the workload is. Note that the weight
+	 * cannot reach 9, because bfqd->sequential_samples cannot
+	 * become equal to bfqd->peak_rate_samples, which, in its
+	 * turn, holds true because bfqd->sequential_samples is not
+	 * incremented for the first sample.
+	 */
+	weight = (9 * bfqd->sequential_samples) / bfqd->peak_rate_samples;
+
+	/*
+	 * Second step: further refine the weight as a function of the
+	 * duration of the observation interval.
+	 */
+	weight = min_t(u32, 8,
+		       div_u64(weight * bfqd->delta_from_first,
+			       BFQ_RATE_REF_INTERVAL));
+
+	/*
+	 * Divisor ranging from 10, for minimum weight, to 2, for
+	 * maximum weight.
+	 */
+	divisor = 10 - weight;
+
+	/*
+	 * Finally, update peak rate:
+	 *
+	 * peak_rate = peak_rate * (divisor-1) / divisor  +  rate / divisor
+	 */
+	bfqd->peak_rate *= divisor-1;
+	bfqd->peak_rate /= divisor;
+	rate /= divisor; /* smoothing constant alpha = 1/divisor */
+
+	bfqd->peak_rate += rate;
+	if (bfqd->bfq_user_max_budget == 0)
+		bfqd->bfq_max_budget =
+			bfq_calc_max_budget(bfqd);
+
+reset_computation:
+	bfq_reset_rate_computation(bfqd, rq);
+}
+
+/*
+ * Update the read/write peak rate (the main quantity used for
+ * auto-tuning, see update_thr_responsiveness_params()).
+ *
+ * It is not trivial to estimate the peak rate (correctly): because of
+ * the presence of sw and hw queues between the scheduler and the
+ * device components that finally serve I/O requests, it is hard to
+ * say exactly when a given dispatched request is served inside the
+ * device, and for how long. As a consequence, it is hard to know
+ * precisely at what rate a given set of requests is actually served
+ * by the device.
+ *
+ * On the opposite end, the dispatch time of any request is trivially
+ * available, and, from this piece of information, the "dispatch rate"
+ * of requests can be immediately computed. So, the idea in the next
+ * function is to use what is known, namely request dispatch times
+ * (plus, when useful, request completion times), to estimate what is
+ * unknown, namely in-device request service rate.
+ *
+ * The main issue is that, because of the above facts, the rate at
+ * which a certain set of requests is dispatched over a certain time
+ * interval can vary greatly with respect to the rate at which the
+ * same requests are then served. But, since the size of any
+ * intermediate queue is limited, and the service scheme is lossless
+ * (no request is silently dropped), the following obvious convergence
+ * property holds: the number of requests dispatched MUST become
+ * closer and closer to the number of requests completed as the
+ * observation interval grows. This is the key property used in
+ * the next function to estimate the peak service rate as a function
+ * of the observed dispatch rate. The function assumes to be invoked
+ * on every request dispatch.
+ */
+void bfq_update_peak_rate(struct bfq_data *bfqd, struct request *rq)
+{
+	u64 now_ns = ktime_get_ns();
+
+	if (bfqd->peak_rate_samples == 0) { /* first dispatch */
+		bfq_log(bfqd, "update_peak_rate: goto reset, samples %d",
+			bfqd->peak_rate_samples);
+		bfq_reset_rate_computation(bfqd, rq);
+		goto update_last_values; /* will add one sample */
+	}
+
+	/*
+	 * Device idle for very long: the observation interval lasting
+	 * up to this dispatch cannot be a valid observation interval
+	 * for computing a new peak rate (similarly to the late-
+	 * completion event in bfq_completed_request()). Go to
+	 * update_rate_and_reset to have the following three steps
+	 * taken:
+	 * - close the observation interval at the last (previous)
+	 *   request dispatch or completion
+	 * - compute rate, if possible, for that observation interval
+	 * - start a new observation interval with this dispatch
+	 */
+	if (now_ns - bfqd->last_dispatch > 100*NSEC_PER_MSEC &&
+	    bfqd->rq_in_driver == 0)
+		goto update_rate_and_reset;
+
+	/* Update sampling information */
+	bfqd->peak_rate_samples++;
+
+	if ((bfqd->rq_in_driver > 0 ||
+		now_ns - bfqd->last_completion < BFQ_MIN_TT)
+	     && get_sdist(bfqd->last_position, rq) < BFQQ_SEEK_THR)
+		bfqd->sequential_samples++;
+
+	bfqd->tot_sectors_dispatched += blk_rq_sectors(rq);
+
+	/* Reset max observed rq size every 32 dispatches */
+	if (likely(bfqd->peak_rate_samples % 32))
+		bfqd->last_rq_max_size =
+			max_t(u32, blk_rq_sectors(rq), bfqd->last_rq_max_size);
+	else
+		bfqd->last_rq_max_size = blk_rq_sectors(rq);
+
+	bfqd->delta_from_first = now_ns - bfqd->first_dispatch;
+
+	/* Target observation interval not yet reached, go on sampling */
+	if (bfqd->delta_from_first < BFQ_RATE_REF_INTERVAL)
+		goto update_last_values;
+
+update_rate_and_reset:
+	bfq_update_rate_reset(bfqd, rq);
+update_last_values:
+	bfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq);
+	bfqd->last_dispatch = now_ns;
+}
+
+/*
  * Move request from internal lists to the dispatch list of the request queue.
  */
 static void bfq_dispatch_insert(struct request_queue *q, struct request *rq)
@@ -3676,6 +3929,7 @@ static void bfq_dispatch_insert(struct request_queue *q, struct request *rq)
 	 * incrementing bfqq->dispatched.
 	 */
 	bfqq->dispatched++;
+	bfq_update_peak_rate(q->elevator->elevator_data, rq);
 
 	bfq_remove_request(rq);
 	elv_dispatch_sort(q, rq);
@@ -3874,105 +4128,92 @@ static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
 			bfqq->entity.budget);
 }
 
-static unsigned long bfq_calc_max_budget(u64 peak_rate, u64 timeout)
-{
-	unsigned long max_budget;
-
-	/*
-	 * The max_budget calculated when autotuning is equal to the
-	 * amount of sectors transferred in timeout at the
-	 * estimated peak rate.
-	 */
-	max_budget = (unsigned long)(peak_rate * 1000 *
-				     timeout >> BFQ_RATE_SHIFT);
-
-	return max_budget;
-}
-
 /*
- * In addition to updating the peak rate, checks whether the process
- * is "slow", and returns 1 if so. This slow flag is used, in addition
- * to the budget timeout, to reduce the amount of service provided to
- * seeky processes, and hence reduce their chances to lower the
- * throughput. See the code for more details.
+ * Return true if the process associated with bfqq is "slow". The slow
+ * flag is used, in addition to the budget timeout, to reduce the
+ * amount of service provided to seeky processes, and thus reduce
+ * their chances to lower the throughput. More details in the comments
+ * on the function bfq_bfqq_expire().
+ *
+ * An important observation is in order: as discussed in the comments
+ * on the function bfq_update_peak_rate(), with devices with internal
+ * queues, it is hard if ever possible to know when and for how long
+ * an I/O request is processed by the device (apart from the trivial
+ * I/O pattern where a new request is dispatched only after the
+ * previous one has been completed). This makes it hard to evaluate
+ * the real rate at which the I/O requests of each bfq_queue are
+ * served.  In fact, for an I/O scheduler like BFQ, serving a
+ * bfq_queue means just dispatching its requests during its service
+ * slot (i.e., until the budget of the queue is exhausted, or the
+ * queue remains idle, or, finally, a timeout fires). But, during the
+ * service slot of a bfq_queue, around 100 ms at most, the device may
+ * be even still processing requests of bfq_queues served in previous
+ * service slots. On the opposite end, the requests of the in-service
+ * bfq_queue may be completed after the service slot of the queue
+ * finishes.
+ *
+ * Anyway, unless more sophisticated solutions are used
+ * (where possible), the sum of the sizes of the requests dispatched
+ * during the service slot of a bfq_queue is probably the only
+ * approximation available for the service received by the bfq_queue
+ * during its service slot. And this sum is the quantity used in this
+ * function to evaluate the I/O speed of a process.
  */
-static bool bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
-				 bool compensate)
+static bool bfq_bfqq_is_slow(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+				 bool compensate, enum bfqq_expiration reason,
+				 unsigned long *delta_ms)
 {
-	u64 bw, usecs, expected, timeout;
-	ktime_t delta;
-	int update = 0;
+	ktime_t delta_ktime;
+	u32 delta_usecs;
+	bool slow = BFQQ_SEEKY(bfqq); /* if delta too short, use seekyness */
 
-	if (!bfq_bfqq_sync(bfqq) || bfq_bfqq_budget_new(bfqq))
+	if (!bfq_bfqq_sync(bfqq))
 		return false;
 
 	if (compensate)
-		delta = bfqd->last_idling_start;
+		delta_ktime = bfqd->last_idling_start;
 	else
-		delta = ktime_get();
-	delta = ktime_sub(delta, bfqd->last_budget_start);
-	usecs = ktime_to_us(delta);
-
-	/* Don't trust short/unrealistic values. */
-	if (usecs < 100 || usecs >= LONG_MAX)
-		return false;
-
-	/*
-	 * Calculate the bandwidth for the last slice.  We use a 64 bit
-	 * value to store the peak rate, in sectors per usec in fixed
-	 * point math.  We do so to have enough precision in the estimate
-	 * and to avoid overflows.
-	 */
-	bw = (u64)bfqq->entity.service << BFQ_RATE_SHIFT;
-	do_div(bw, (unsigned long)usecs);
+		delta_ktime = ktime_get();
+	delta_ktime = ktime_sub(delta_ktime, bfqd->last_budget_start);
+	delta_usecs = ktime_to_us(delta_ktime);
+
+	/* don't trust short/unrealistic values. */
+	if (delta_usecs < 1000 || delta_usecs >= LONG_MAX) {
+		if (blk_queue_nonrot(bfqd->queue))
+			 /*
+			  * give same worst-case guarantees as idling
+			  * for seeky
+			  */
+			*delta_ms = BFQ_MIN_TT / NSEC_PER_MSEC;
+		else /* charge at least one seek */
+			*delta_ms = bfq_slice_idle / NSEC_PER_MSEC;
+
+		return slow;
+	}
 
-	timeout = jiffies_to_msecs(bfqd->bfq_timeout);
+	*delta_ms = delta_usecs / USEC_PER_MSEC;
 
 	/*
-	 * Use only long (> 20ms) intervals to filter out spikes for
-	 * the peak rate estimation.
+	 * Use only long (> 20ms) intervals to filter out excessive
+	 * spikes in service rate estimation.
 	 */
-	if (usecs > 20000) {
-		if (bw > bfqd->peak_rate) {
-			bfqd->peak_rate = bw;
-			update = 1;
-			bfq_log(bfqd, "new peak_rate=%llu", bw);
-		}
-
-		update |= bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES - 1;
-
-		if (bfqd->peak_rate_samples < BFQ_PEAK_RATE_SAMPLES)
-			bfqd->peak_rate_samples++;
-
-		if (bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES &&
-		    update && bfqd->bfq_user_max_budget == 0) {
-			bfqd->bfq_max_budget =
-				bfq_calc_max_budget(bfqd->peak_rate,
-						    timeout);
-			bfq_log(bfqd, "new max_budget=%d",
-				bfqd->bfq_max_budget);
-		}
+	if (delta_usecs > 20000) {
+		/*
+		 * Caveat for rotational devices: processes doing I/O
+		 * in the slower disk zones tend to be slow(er) even
+		 * if not seeky. In this respect, the estimated peak
+		 * rate is likely to be an average over the disk
+		 * surface. Accordingly, to not be too harsh with
+		 * unlucky processes, a process is deemed slow only if
+		 * its rate has been lower than half of the estimated
+		 * peak rate.
+		 */
+		slow = bfqq->entity.service < bfqd->bfq_max_budget / 2;
 	}
 
-	/*
-	 * A process is considered ``slow'' (i.e., seeky, so that we
-	 * cannot treat it fairly in the service domain, as it would
-	 * slow down too much the other processes) if, when a slice
-	 * ends for whatever reason, it has received service at a
-	 * rate that would not be high enough to complete the budget
-	 * before the budget timeout expiration.
-	 */
-	expected = bw * 1000 * timeout >> BFQ_RATE_SHIFT;
+	bfq_log_bfqq(bfqd, bfqq, "bfq_bfqq_is_slow: slow %d", slow);
 
-	/*
-	 * Caveat: processes doing IO in the slower disk zones will
-	 * tend to be slow(er) even if not seeky. And the estimated
-	 * peak rate will actually be an average over the disk
-	 * surface. Hence, to not be too harsh with unlucky processes,
-	 * we keep a budget/3 margin of safety before declaring a
-	 * process slow.
-	 */
-	return expected > (4 * bfqq->entity.budget) / 3;
+	return slow;
 }
 
 /*
@@ -4020,12 +4261,13 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
 			    enum bfqq_expiration reason)
 {
 	bool slow;
+	unsigned long delta = 0;
+	struct bfq_entity *entity = &bfqq->entity;
 
 	/*
-	 * Update device peak rate for autotuning and check whether the
-	 * process is slow (see bfq_update_peak_rate).
+	 * Check whether the process is slow (see bfq_bfqq_is_slow).
 	 */
-	slow = bfq_update_peak_rate(bfqd, bfqq, compensate);
+	slow = bfq_bfqq_is_slow(bfqd, bfqq, compensate, reason, &delta);
 
 	/*
 	 * As above explained, 'punish' slow (i.e., seeky), timed-out
@@ -4035,7 +4277,7 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
 		bfq_bfqq_charge_full_budget(bfqq);
 
 	if (reason == BFQ_BFQQ_TOO_IDLE &&
-	    bfqq->entity.service <= 2 * bfqq->entity.budget / 10)
+	    entity->service <= 2 * entity->budget / 10)
 		bfq_clear_bfqq_IO_bound(bfqq);
 
 	bfq_log_bfqq(bfqd, bfqq,
@@ -4636,17 +4878,9 @@ static void
 bfq_update_io_seektime(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 		       struct request *rq)
 {
-	sector_t sdist = 0;
-
-	if (bfqq->last_request_pos) {
-		if (bfqq->last_request_pos < blk_rq_pos(rq))
-			sdist = blk_rq_pos(rq) - bfqq->last_request_pos;
-		else
-			sdist = bfqq->last_request_pos - blk_rq_pos(rq);
-	}
-
 	bfqq->seek_history <<= 1;
-	bfqq->seek_history |= (sdist > BFQQ_SEEK_THR);
+	bfqq->seek_history |=
+		get_sdist(bfqq->last_request_pos, rq) > BFQQ_SEEK_THR;
 }
 
 /*
@@ -4804,6 +5038,8 @@ static void bfq_completed_request(struct request_queue *q, struct request *rq)
 {
 	struct bfq_queue *bfqq = RQ_BFQQ(rq);
 	struct bfq_data *bfqd = bfqq->bfqd;
+	u64 now_ns;
+	u32 delta_us;
 
 	bfq_update_hw_tag(bfqd);
 
@@ -4814,7 +5050,37 @@ static void bfq_completed_request(struct request_queue *q, struct request *rq)
 				     rq_io_start_time_ns(rq), req_op(rq),
 				     rq->cmd_flags);
 
-	RQ_BIC(rq)->ttime.last_end_request = ktime_get_ns();
+	now_ns = ktime_get_ns();
+
+	RQ_BIC(rq)->ttime.last_end_request = now_ns;
+
+	/*
+	 * Using us instead of ns, to get a reasonable precision in
+	 * computing rate in next check.
+	 */
+	delta_us = div_u64(now_ns - bfqd->last_completion, NSEC_PER_USEC);
+
+	/*
+	 * If the request took rather long to complete, and, according
+	 * to the maximum request size recorded, this completion latency
+	 * implies that the request was certainly served at a very low
+	 * rate (less than 1M sectors/sec), then the whole observation
+	 * interval that lasts up to this time instant cannot be a
+	 * valid time interval for computing a new peak rate.  Invoke
+	 * bfq_update_rate_reset to have the following three steps
+	 * taken:
+	 * - close the observation interval at the last (previous)
+	 *   request dispatch or completion
+	 * - compute rate, if possible, for that observation interval
+	 * - reset to zero samples, which will trigger a proper
+	 *   re-initialization of the observation interval on next
+	 *   dispatch
+	 */
+	if (delta_us > BFQ_MIN_TT/NSEC_PER_USEC &&
+	   (bfqd->last_rq_max_size<<BFQ_RATE_SHIFT)/delta_us <
+			1UL<<(BFQ_RATE_SHIFT - 10))
+		bfq_update_rate_reset(bfqd, NULL);
+	bfqd->last_completion = now_ns;
 
 	/*
 	 * If this is the in-service queue, check if it needs to be expired,
@@ -5322,16 +5588,6 @@ static ssize_t bfq_weights_store(struct elevator_queue *e,
 	return count;
 }
 
-static unsigned long bfq_estimated_max_budget(struct bfq_data *bfqd)
-{
-	u64 timeout = jiffies_to_msecs(bfqd->bfq_timeout);
-
-	if (bfqd->peak_rate_samples >= BFQ_PEAK_RATE_SAMPLES)
-		return bfq_calc_max_budget(bfqd->peak_rate, timeout);
-	else
-		return bfq_default_max_budget;
-}
-
 static ssize_t bfq_max_budget_store(struct elevator_queue *e,
 				    const char *page, size_t count)
 {
@@ -5340,7 +5596,7 @@ static ssize_t bfq_max_budget_store(struct elevator_queue *e,
 	int ret = bfq_var_store(&__data, (page), count);
 
 	if (__data == 0)
-		bfqd->bfq_max_budget = bfq_estimated_max_budget(bfqd);
+		bfqd->bfq_max_budget = bfq_calc_max_budget(bfqd);
 	else {
 		if (__data > INT_MAX)
 			__data = INT_MAX;
@@ -5370,7 +5626,7 @@ static ssize_t bfq_timeout_sync_store(struct elevator_queue *e,
 
 	bfqd->bfq_timeout = msecs_to_jiffies(__data);
 	if (bfqd->bfq_user_max_budget == 0)
-		bfqd->bfq_max_budget = bfq_estimated_max_budget(bfqd);
+		bfqd->bfq_max_budget = bfq_calc_max_budget(bfqd);
 
 	return ret;
 }
-- 
2.10.0

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 05/14] block, bfq: add more fairness with writes and slow processes
  2016-10-26  9:27 [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler Paolo Valente
                   ` (3 preceding siblings ...)
  2016-10-26  9:27 ` [PATCH 04/14] block, bfq: modify the peak-rate estimator Paolo Valente
@ 2016-10-26  9:27 ` Paolo Valente
  2016-10-26  9:27 ` [PATCH 06/14] block, bfq: improve responsiveness Paolo Valente
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 57+ messages in thread
From: Paolo Valente @ 2016-10-26  9:27 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: linux-block, linux-kernel, ulf.hansson, linus.walleij, broonie,
	hare, arnd, bart.vanassche, grant.likely, jack, James.Bottomley,
	Paolo Valente, Arianna Avanzini

This patch deals with two sources of unfairness, which can also cause
high latencies and throughput loss. The first source is related to
write requests. Write requests tend to starve read requests, basically
because, on one side, writes are slower than reads, whereas, on the
other side, storage devices confuse schedulers by deceptively
signaling the completion of write requests immediately after receiving
them. This patch addresses this issue by just throttling writes. In
particular, after a write request is dispatched for a queue, the
budget of the queue is decremented by the number of sectors to write,
multiplied by an (over)charge coefficient. The value of the
coefficient is the result of our tuning with different devices.

The second source of unfairness has to do with slowness detection:
when the in-service queue is expired, BFQ also controls whether the
queue has been "too slow", i.e., has consumed its last-assigned budget
at such a low rate that it would have been impossible to consume all
of this budget within the maximum time slice T_max (Subsec. 3.5 in
[1]). In this case, the queue is always (over)charged the whole
budget, to reduce its utilization of the device. Both this overcharge
and the slowness-detection criterion may cause unfairness.

First, always charging a full budget to a slow queue is too coarse. It
is much more accurate, and this patch lets BFQ do so, to charge an
amount of service 'equivalent' to the amount of time during which the
queue has been in service. As explained in more detail in the comments
on the code, this enables BFQ to provide time fairness among slow
queues.

Secondly, because of ZBR, a queue may be deemed as slow when its
associated process is performing I/O on the slowest zones of a
disk. However, unless the process is truly too slow, not reducing the
disk utilization of the queue is more profitable in terms of disk
throughput than the opposite. A similar problem is caused by logical
block mapping on non-rotational devices. For this reason, this patch
lets a queue be charged time, and not budget, only if the queue has
consumed less than 2/3 of its assigned budget. As an additional,
important benefit, this tolerance allows BFQ to preserve enough
elasticity to still perform bandwidth, and not time, distribution with
little unlucky or quasi-sequential processes.

Finally, for the same reasons as above, this patch makes slowness
detection itself much less harsh: a queue is deemed slow only if it
has consumed its budget at less than half of the peak rate.

[1] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
    http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
							results.pdf

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/bfq-iosched.c | 120 +++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 85 insertions(+), 35 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 101c591..3162e02 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -695,6 +695,13 @@ static const int bfq_stats_min_budgets = 194;
 /* Default maximum budget values, in sectors and number of requests. */
 static const int bfq_default_max_budget = 16 * 1024;
 
+/*
+ * Async to sync throughput distribution is controlled as follows:
+ * when an async request is served, the entity is charged the number
+ * of sectors of the request, multiplied by the factor below
+ */
+static const int bfq_async_charge_factor = 10;
+
 /* Default timeout values, in jiffies, approximating CFQ defaults. */
 static const int bfq_timeout = HZ / 8;
 
@@ -1371,22 +1378,52 @@ static void bfq_bfqq_served(struct bfq_queue *bfqq, int served)
 }
 
 /**
- * bfq_bfqq_charge_full_budget - set the service to the entity budget.
+ * bfq_bfqq_charge_time - charge an amount of service equivalent to the length
+ *			  of the time interval during which bfqq has been in
+ *			  service.
+ * @bfqd: the device
  * @bfqq: the queue that needs a service update.
+ * @time_ms: the amount of time during which the queue has received service
  *
- * When it's not possible to be fair in the service domain, because
- * a queue is not consuming its budget fast enough (the meaning of
- * fast depends on the timeout parameter), we charge it a full
- * budget.  In this way we should obtain a sort of time-domain
- * fairness among all the seeky/slow queues.
+ * If a queue does not consume its budget fast enough, then providing
+ * the queue with service fairness may impair throughput, more or less
+ * severely. For this reason, queues that consume their budget slowly
+ * are provided with time fairness instead of service fairness. This
+ * goal is achieved through the BFQ scheduling engine, even if such an
+ * engine works in the service, and not in the time domain. The trick
+ * is charging these queues with an inflated amount of service, equal
+ * to the amount of service that they would have received during their
+ * service slot if they had been fast, i.e., if their requests had
+ * been dispatched at a rate equal to the estimated peak rate.
+ *
+ * It is worth noting that time fairness can cause important
+ * distortions in terms of bandwidth distribution, on devices with
+ * internal queueing. The reason is that I/O requests dispatched
+ * during the service slot of a queue may be served after that service
+ * slot is finished, and may have a total processing time loosely
+ * correlated with the duration of the service slot. This is
+ * especially true for short service slots.
  */
-static void bfq_bfqq_charge_full_budget(struct bfq_queue *bfqq)
+static void bfq_bfqq_charge_time(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+				 unsigned long time_ms)
 {
 	struct bfq_entity *entity = &bfqq->entity;
+	int tot_serv_to_charge = entity->service;
+	unsigned int timeout_ms = jiffies_to_msecs(bfq_timeout);
+
+	if (time_ms > 0 && time_ms < timeout_ms)
+		tot_serv_to_charge =
+			(bfqd->bfq_max_budget * time_ms) / timeout_ms;
 
-	bfq_log_bfqq(bfqq->bfqd, bfqq, "charge_full_budget");
+	if (tot_serv_to_charge < entity->service)
+		tot_serv_to_charge = entity->service;
 
-	bfq_bfqq_served(bfqq, entity->budget - entity->service);
+	/* Increase budget to avoid inconsistencies */
+	if (tot_serv_to_charge > entity->budget)
+		entity->budget = tot_serv_to_charge;
+
+	bfq_bfqq_served(bfqq,
+			max_t(int, 0, tot_serv_to_charge - entity->service));
 }
 
 /**
@@ -3129,10 +3166,14 @@ static struct request *bfq_find_next_rq(struct bfq_data *bfqd,
 	return bfq_choose_req(bfqd, next, prev, blk_rq_pos(last));
 }
 
+/* see the definition of bfq_async_charge_factor for details */
 static unsigned long bfq_serv_to_charge(struct request *rq,
 					struct bfq_queue *bfqq)
 {
-	return blk_rq_sectors(rq);
+	if (bfq_bfqq_sync(bfqq))
+		return blk_rq_sectors(rq);
+
+	return blk_rq_sectors(rq) * bfq_async_charge_factor;
 }
 
 /**
@@ -4232,28 +4273,24 @@ static unsigned long bfq_smallest_from_now(void)
  * @compensate: if true, compensate for the time spent idling.
  * @reason: the reason causing the expiration.
  *
+ * If the process associated with bfqq does slow I/O (e.g., because it
+ * issues random requests), we charge bfqq with the time it has been
+ * in service instead of the service it has received (see
+ * bfq_bfqq_charge_time for details on how this goal is achieved). As
+ * a consequence, bfqq will typically get higher timestamps upon
+ * reactivation, and hence it will be rescheduled as if it had
+ * received more service than what it has actually received. In the
+ * end, bfqq receives less service in proportion to how slowly its
+ * associated process consumes its budgets (and hence how seriously it
+ * tends to lower the throughput). In addition, this time-charging
+ * strategy guarantees time fairness among slow processes. In
+ * contrast, if the process associated with bfqq is not slow, we
+ * charge bfqq exactly with the service it has received.
  *
- * If the process associated with the queue is slow (i.e., seeky), or
- * in case of budget timeout, or, finally, if it is async, we
- * artificially charge it an entire budget (independently of the
- * actual service it received). As a consequence, the queue will get
- * higher timestamps than the correct ones upon reactivation, and
- * hence it will be rescheduled as if it had received more service
- * than what it actually received. In the end, this class of processes
- * will receive less service in proportion to how slowly they consume
- * their budgets (and hence how seriously they tend to lower the
- * throughput).
- *
- * In contrast, when a queue expires because it has been idling for
- * too much or because it exhausted its budget, we do not touch the
- * amount of service it has received. Hence when the queue will be
- * reactivated and its timestamps updated, the latter will be in sync
- * with the actual service received by the queue until expiration.
- *
- * Charging a full budget to the first type of queues and the exact
- * service to the others has the effect of using the WF2Q+ policy to
- * schedule the former on a timeslice basis, without violating the
- * service domain guarantees of the latter.
+ * Charging time to the first type of queues and the exact service to
+ * the other has the effect of using the WF2Q+ policy to schedule the
+ * former on a timeslice basis, without violating service domain
+ * guarantees among the latter.
  */
 static void bfq_bfqq_expire(struct bfq_data *bfqd,
 			    struct bfq_queue *bfqq,
@@ -4270,11 +4307,24 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
 	slow = bfq_bfqq_is_slow(bfqd, bfqq, compensate, reason, &delta);
 
 	/*
-	 * As above explained, 'punish' slow (i.e., seeky), timed-out
-	 * and async queues, to favor sequential sync workloads.
+	 * As above explained, charge slow (typically seeky) and
+	 * timed-out queues with the time and not the service
+	 * received, to favor sequential workloads.
+	 *
+	 * Processes doing I/O in the slower disk zones will tend to
+	 * be slow(er) even if not seeky. Therefore, since the
+	 * estimated peak rate is actually an average over the disk
+	 * surface, these processes may timeout just for bad luck. To
+	 * avoid punishing them, do not charge time to processes that
+	 * succeeded in consuming at least 2/3 of their budget. This
+	 * allows BFQ to preserve enough elasticity to still perform
+	 * bandwidth, and not time, distribution with little unlucky
+	 * or quasi-sequential processes.
 	 */
-	if (slow || reason == BFQ_BFQQ_BUDGET_TIMEOUT)
-		bfq_bfqq_charge_full_budget(bfqq);
+	if (slow ||
+	    (reason == BFQ_BFQQ_BUDGET_TIMEOUT &&
+	     bfq_bfqq_budget_left(bfqq) >=  entity->budget / 3))
+		bfq_bfqq_charge_time(bfqd, bfqq, delta);
 
 	if (reason == BFQ_BFQQ_TOO_IDLE &&
 	    entity->service <= 2 * entity->budget / 10)
-- 
2.10.0

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 06/14] block, bfq: improve responsiveness
  2016-10-26  9:27 [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler Paolo Valente
                   ` (4 preceding siblings ...)
  2016-10-26  9:27 ` [PATCH 05/14] block, bfq: add more fairness with writes and slow processes Paolo Valente
@ 2016-10-26  9:27 ` Paolo Valente
  2016-10-26  9:28 ` [PATCH 07/14] block, bfq: reduce I/O latency for soft real-time applications Paolo Valente
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 57+ messages in thread
From: Paolo Valente @ 2016-10-26  9:27 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: linux-block, linux-kernel, ulf.hansson, linus.walleij, broonie,
	hare, arnd, bart.vanassche, grant.likely, jack, James.Bottomley,
	Paolo Valente, Arianna Avanzini

This patch introduces a simple heuristic to load applications quickly,
and to perform the I/O requested by interactive applications just as
quickly. To this purpose, both a newly-created queue and a queue
associated with an interactive application (we explain in a moment how
BFQ decides whether the associated application is interactive),
receive the following two special treatments:

1) The weight of the queue is raised.

2) The queue unconditionally enjoys device idling when it empties; in
fact, if the requests of a queue are sync, then performing device
idling for the queue is a necessary condition to guarantee that the
queue receives a fraction of the throughput proportional to its weight
(see [1] for details).

For brevity, we call just weight-raising the combination of these
two preferential treatments. For a newly-created queue,
weight-raising starts immediately and lasts for a time interval that:
1) depends on the device speed and type (rotational or
non-rotational), and 2) is equal to the time needed to load (start up)
a large-size application on that device, with cold caches and with no
additional workload.

Finally, as for guaranteeing a fast execution to interactive,
I/O-related tasks (such as opening a file), consider that any
interactive application blocks and waits for user input both after
starting up and after executing some task. After a while, the user may
trigger new operations, after which the application stops again, and
so on. Accordingly, the low-latency heuristic weight-raises again a
queue in case it becomes backlogged after being idle for a
sufficiently long (configurable) time. The weight-raising then lasts
for the same time as for a just-created queue.

According to our experiments, the combination of this low-latency
heuristic and of the improvements described in the previous patch
allows BFQ to guarantee a high application responsiveness.

[1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
    Scheduler", Proceedings of the First Workshop on Mobile System
    Technologies (MST-2015), May 2015.
    http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/Kconfig.iosched |   3 +-
 block/bfq-iosched.c   | 785 ++++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 706 insertions(+), 82 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 408b619..ecc6ca2 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -45,7 +45,8 @@ config IOSCHED_BFQ
 	---help---
 	  The BFQ I/O scheduler distributes bandwidth among all
 	  processes according to their weights, regardless of the
-	  device parameters and with any workload.
+	  device parameters and with any workload. It also guarantees
+	  a low latency to interactive applications.
 
 config BFQ_GROUP_IOSCHED
 	bool "BFQ hierarchical scheduling support"
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 3162e02..e5a92fa 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -34,6 +34,10 @@
  * guarantee a low latency to non-I/O bound processes (the latter
  * often belong to time-sensitive applications).
  *
+ * Even better for latency, BFQ explicitly privileges the I/O of
+ * interactive applications, thereby providing these applications with
+ * a very low latency.
+ *
  * With respect to the version of BFQ presented in [1], and in the
  * papers cited therein, this implementation adds a hierarchical
  * extension based on H-WF2Q+. In this extension, also the service of
@@ -192,11 +196,11 @@ struct bfq_entity {
 	/* budget, used also to calculate F_i: F_i = S_i + @budget / @weight */
 	int budget;
 
-	unsigned short weight;	/* weight of the queue */
-	unsigned short new_weight; /* next weight if a change is in progress */
+	unsigned int weight;	/* weight of the queue */
+	unsigned int new_weight; /* next weight if a change is in progress */
 
 	/* original weight, used to implement weight boosting */
-	unsigned short orig_weight;
+	unsigned int orig_weight;
 
 	/* parent entity, for hierarchical scheduling */
 	struct bfq_entity *parent;
@@ -280,6 +284,17 @@ struct bfq_queue {
 
 	/* pid of the process owning the queue, used for logging purposes */
 	pid_t pid;
+
+	/* current maximum weight-raising time for this queue */
+	unsigned long wr_cur_max_time;
+	/*
+	 * Start time of the current weight-raising period if
+	 * the @bfq-queue is being weight-raised, otherwise
+	 * finish time of the last weight-raising period.
+	 */
+	unsigned long last_wr_start_finish;
+	/* factor by which the weight of this queue is multiplied */
+	unsigned int wr_coeff;
 };
 
 /**
@@ -444,6 +459,34 @@ struct bfq_data {
 	 */
 	bool strict_guarantees;
 
+	/* if set to true, low-latency heuristics are enabled */
+	bool low_latency;
+	/*
+	 * Maximum factor by which the weight of a weight-raised queue
+	 * is multiplied.
+	 */
+	unsigned int bfq_wr_coeff;
+	/* maximum duration of a weight-raising period (jiffies) */
+	unsigned int bfq_wr_max_time;
+	/*
+	 * Minimum idle period after which weight-raising may be
+	 * reactivated for a queue (in jiffies).
+	 */
+	unsigned int bfq_wr_min_idle_time;
+	/*
+	 * Minimum period between request arrivals after which
+	 * weight-raising may be reactivated for an already busy async
+	 * queue (in jiffies).
+	 */
+	unsigned long bfq_wr_min_inter_arr_async;
+	/*
+	 * Cached value of the product R*T, used for computing the
+	 * maximum duration of weight raising automatically.
+	 */
+	u64 RT_prod;
+	/* device-speed class for the low-latency heuristic */
+	enum bfq_device_speed device_speed;
+
 	/* fallback dummy bfqq for extreme OOM conditions */
 	struct bfq_queue oom_bfqq;
 };
@@ -459,7 +502,6 @@ enum bfqq_state_flags {
 	BFQ_BFQQ_FLAG_fifo_expire,	/* FIFO checked in this slice */
 	BFQ_BFQQ_FLAG_idle_window,	/* slice idling enabled */
 	BFQ_BFQQ_FLAG_sync,		/* synchronous queue */
-	BFQ_BFQQ_FLAG_budget_new,	/* no completion with this budget */
 	BFQ_BFQQ_FLAG_IO_bound,		/*
 					 * bfqq has timed-out at least once
 					 * having consumed at most 2/10 of
@@ -488,7 +530,6 @@ BFQ_BFQQ_FNS(must_alloc);
 BFQ_BFQQ_FNS(fifo_expire);
 BFQ_BFQQ_FNS(idle_window);
 BFQ_BFQQ_FNS(sync);
-BFQ_BFQQ_FNS(budget_new);
 BFQ_BFQQ_FNS(IO_bound);
 #undef BFQ_BFQQ_FNS
 
@@ -584,7 +625,7 @@ struct bfq_group_data {
 	/* must be the first member */
 	struct blkcg_policy_data pd;
 
-	unsigned short weight;
+	unsigned int weight;
 };
 
 /**
@@ -674,6 +715,8 @@ static void bfq_dispatch_insert(struct request_queue *q, struct request *rq);
 static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
 				       struct bio *bio, bool is_sync,
 				       struct bfq_io_cq *bic);
+static void bfq_end_wr_async_queues(struct bfq_data *bfqd,
+				    struct bfq_group *bfqg);
 static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg);
 static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
 
@@ -727,6 +770,56 @@ struct kmem_cache *bfq_pool;
 /* Shift used for peak rate fixed precision calculations. */
 #define BFQ_RATE_SHIFT		16
 
+/*
+ * By default, BFQ computes the duration of the weight raising for
+ * interactive applications automatically, using the following formula:
+ * duration = (R / r) * T, where r is the peak rate of the device, and
+ * R and T are two reference parameters.
+ * In particular, R is the peak rate of the reference device (see below),
+ * and T is a reference time: given the systems that are likely to be
+ * installed on the reference device according to its speed class, T is
+ * about the maximum time needed, under BFQ and while reading two files in
+ * parallel, to load typical large applications on these systems.
+ * In practice, the slower/faster the device at hand is, the more/less it
+ * takes to load applications with respect to the reference device.
+ * Accordingly, the longer/shorter BFQ grants weight raising to interactive
+ * applications.
+ *
+ * BFQ uses four different reference pairs (R, T), depending on:
+ * . whether the device is rotational or non-rotational;
+ * . whether the device is slow, such as old or portable HDDs, as well as
+ *   SD cards, or fast, such as newer HDDs and SSDs.
+ *
+ * The device's speed class is dynamically (re)detected in
+ * bfq_update_peak_rate() every time the estimated peak rate is updated.
+ *
+ * In the following definitions, R_slow[0]/R_fast[0] and
+ * T_slow[0]/T_fast[0] are the reference values for a slow/fast
+ * rotational device, whereas R_slow[1]/R_fast[1] and
+ * T_slow[1]/T_fast[1] are the reference values for a slow/fast
+ * non-rotational device. Finally, device_speed_thresh are the
+ * thresholds used to switch between speed classes. The reference
+ * rates are not the actual peak rates of the devices used as a
+ * reference, but slightly lower values. The reason for using these
+ * slightly lower values is that the peak-rate estimator tends to
+ * yield slightly lower values than the actual peak rate (it can yield
+ * the actual peak rate only if there is only one process doing I/O,
+ * and the process does sequential I/O).
+ *
+ * Both the reference peak rates and the thresholds are measured in
+ * sectors/usec, left-shifted by BFQ_RATE_SHIFT.
+ */
+static int R_slow[2] = {1000, 10700};
+static int R_fast[2] = {14000, 33000};
+/*
+ * To improve readability, a conversion function is used to initialize the
+ * following arrays, which entails that they can be initialized only in a
+ * function.
+ */
+static int T_slow[2];
+static int T_fast[2];
+static int device_speed_thresh[2];
+
 #define BFQ_SERVICE_TREE_INIT	((struct bfq_service_tree)		\
 				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
 
@@ -1286,7 +1379,7 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
 
 	if (entity->prio_changed) {
 		struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
-		unsigned short prev_weight, new_weight;
+		unsigned int prev_weight, new_weight;
 		struct bfq_data *bfqd = NULL;
 #ifdef CONFIG_BFQ_GROUP_IOSCHED
 		struct bfq_sched_data *sd;
@@ -1335,7 +1428,8 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
 		new_st = bfq_entity_service_tree(entity);
 
 		prev_weight = entity->weight;
-		new_weight = entity->orig_weight;
+		new_weight = entity->orig_weight *
+			     (bfqq ? bfqq->wr_coeff : 1);
 		entity->weight = new_weight;
 
 		new_st->wsum += entity->weight;
@@ -1442,6 +1536,7 @@ static void __bfq_activate_entity(struct bfq_entity *entity,
 {
 	struct bfq_sched_data *sd = entity->sched_data;
 	struct bfq_service_tree *st = bfq_entity_service_tree(entity);
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
 	bool backshifted = false;
 
 	if (entity == sd->in_service_entity) {
@@ -1521,10 +1616,19 @@ static void __bfq_activate_entity(struct bfq_entity *entity,
 	 * time. This may introduce a little unfairness among queues
 	 * with backshifted timestamps, but it does not break
 	 * worst-case fairness guarantees.
+	 *
+	 * As a special case, if bfqq is weight-raised, push up
+	 * timestamps much less, to keep very low the probability that
+	 * this push up causes the backshifted finish timestamps of
+	 * weight-raised queues to become higher than the backshifted
+	 * finish timestamps of non weight-raised queues.
 	 */
 	if (backshifted && bfq_gt(st->vtime, entity->finish)) {
 		unsigned long delta = st->vtime - entity->finish;
 
+		if (bfqq)
+			delta /= bfqq->wr_coeff;
+
 		entity->start += delta;
 		entity->finish += delta;
 	}
@@ -2630,6 +2734,18 @@ static void bfq_pd_offline(struct blkg_policy_data *pd)
 	bfqg_stats_xfer_dead(bfqg);
 }
 
+static void bfq_end_wr_async(struct bfq_data *bfqd)
+{
+	struct blkcg_gq *blkg;
+
+	list_for_each_entry(blkg, &bfqd->queue->blkg_list, q_node) {
+		struct bfq_group *bfqg = blkg_to_bfqg(blkg);
+
+		bfq_end_wr_async_queues(bfqd, bfqg);
+	}
+	bfq_end_wr_async_queues(bfqd, bfqd->root_group);
+}
+
 static int bfq_io_show_weight(struct seq_file *sf, void *v)
 {
 	struct blkcg *blkcg = css_to_blkcg(seq_css(sf));
@@ -2991,6 +3107,11 @@ bfq_bic_update_cgroup(struct bfq_io_cq *bic, struct bio *bio)
 	return bfqd->root_group;
 }
 
+static void bfq_end_wr_async(struct bfq_data *bfqd)
+{
+	bfq_end_wr_async_queues(bfqd, bfqd->root_group);
+}
+
 static void bfq_disconnect_groups(struct bfq_data *bfqd)
 {
 	bfq_put_async_queues(bfqd, bfqd->root_group);
@@ -3170,7 +3291,7 @@ static struct request *bfq_find_next_rq(struct bfq_data *bfqd,
 static unsigned long bfq_serv_to_charge(struct request *rq,
 					struct bfq_queue *bfqq)
 {
-	if (bfq_bfqq_sync(bfqq))
+	if (bfq_bfqq_sync(bfqq) || bfqq->wr_coeff > 1)
 		return blk_rq_sectors(rq);
 
 	return blk_rq_sectors(rq) * bfq_async_charge_factor;
@@ -3257,12 +3378,12 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
  * whether the in-service queue should be expired, by returning
  * true. The purpose of expiring the in-service queue is to give bfqq
  * the chance to possibly preempt the in-service queue, and the reason
- * for preempting the in-service queue is to achieve the following
- * goal: guarantee to bfqq its reserved bandwidth even if bfqq has
- * expired because it has remained idle.
+ * for preempting the in-service queue is to achieve one of the two
+ * goals below.
  *
- * In particular, bfqq may have expired for one of the following two
- * reasons:
+ * 1. Guarantee to bfqq its reserved bandwidth even if bfqq has
+ * expired because it has remained idle. In particular, bfqq may have
+ * expired for one of the following two reasons:
  *
  * - BFQ_BFQQ_NO_MORE_REQUESTS bfqq did not enjoy any device idling
  *   and did not make it to issue a new request before its last
@@ -3326,10 +3447,36 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
  * above-described special way, and signals that the in-service queue
  * should be expired. Timestamp back-shifting is done later in
  * __bfq_activate_entity.
+ *
+ * 2. Reduce latency. Even if timestamps are not backshifted to let
+ * the process associated with bfqq recover a service hole, bfqq may
+ * however happen to have, after being (re)activated, a lower finish
+ * timestamp than the in-service queue.	 That is, the next budget of
+ * bfqq may have to be completed before the one of the in-service
+ * queue. If this is the case, then preempting the in-service queue
+ * allows this goal to be achieved, apart from the unpreemptible,
+ * outstanding requests mentioned above.
+ *
+ * Unfortunately, regardless of which of the above two goals one wants
+ * to achieve, service trees need first to be updated to know whether
+ * the in-service queue must be preempted. To have service trees
+ * correctly updated, the in-service queue must be expired and
+ * rescheduled, and bfqq must be scheduled too. This is one of the
+ * most costly operations (in future versions, the scheduling
+ * mechanism may be re-designed in such a way to make it possible to
+ * know whether preemption is needed without needing to update service
+ * trees). In addition, queue preemptions almost always cause random
+ * I/O, and thus loss of throughput. Because of these facts, the next
+ * function adopts the following simple scheme to avoid both costly
+ * operations and too frequent preemptions: it requests the expiration
+ * of the in-service queue (unconditionally) only for queues that need
+ * to recover a hole, or that either are weight-raised or deserve to
+ * be weight-raised.
  */
 static bool bfq_bfqq_update_budg_for_activation(struct bfq_data *bfqd,
 						struct bfq_queue *bfqq,
-						bool arrived_in_time)
+						bool arrived_in_time,
+						bool wr_or_deserves_wr)
 {
 	struct bfq_entity *entity = &bfqq->entity;
 
@@ -3364,14 +3511,85 @@ static bool bfq_bfqq_update_budg_for_activation(struct bfq_data *bfqd,
 	entity->budget = max_t(unsigned long, bfqq->max_budget,
 			       bfq_serv_to_charge(bfqq->next_rq, bfqq));
 	bfq_clear_bfqq_non_blocking_wait_rq(bfqq);
-	return false;
+	return wr_or_deserves_wr;
+}
+
+static unsigned int bfq_wr_duration(struct bfq_data *bfqd)
+{
+	u64 dur;
+
+	if (bfqd->bfq_wr_max_time > 0)
+		return bfqd->bfq_wr_max_time;
+
+	dur = bfqd->RT_prod;
+	do_div(dur, bfqd->peak_rate);
+
+	/*
+	 * Limit duration between 3 and 13 seconds. Tests show that
+	 * higher values than 13 seconds often yield the opposite of
+	 * the desired result, i.e., worsen responsiveness by letting
+	 * non-interactive and non-soft-real-time applications
+	 * preserve weight raising for a too long time interval.
+	 *
+	 * On the other end, lower values than 3 seconds make it
+	 * difficult for most interactive tasks to complete their jobs
+	 * before weight-raising finishes.
+	 */
+	if (dur > msecs_to_jiffies(13000))
+		dur = msecs_to_jiffies(13000);
+	else if (dur < msecs_to_jiffies(3000))
+		dur = msecs_to_jiffies(3000);
+
+	return dur;
+}
+
+static void bfq_update_bfqq_wr_on_rq_arrival(struct bfq_data *bfqd,
+					     struct bfq_queue *bfqq,
+					     unsigned int old_wr_coeff,
+					     bool wr_or_deserves_wr,
+					     bool interactive)
+{
+	if (old_wr_coeff == 1 && wr_or_deserves_wr) {
+		/* start a weight-raising period */
+		bfqq->wr_coeff = bfqd->bfq_wr_coeff;
+		/* update wr duration */
+		bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+
+		/*
+		 * If needed, further reduce budget to make sure it is
+		 * close to bfqq's backlog, so as to reduce the
+		 * scheduling-error component due to a too large
+		 * budget. Do not care about throughput consequences,
+		 * but only about latency. Finally, do not assign a
+		 * too small budget either, to avoid increasing
+		 * latency by causing too frequent expirations.
+		 */
+		bfqq->entity.budget = min_t(unsigned long,
+					    bfqq->entity.budget,
+					    2 * bfq_min_budget(bfqd));
+	} else if (old_wr_coeff > 1) {
+		/* update wr duration */
+		bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+	}
+}
+
+static bool bfq_bfqq_idle_for_long_time(struct bfq_data *bfqd,
+					struct bfq_queue *bfqq)
+{
+	return bfqq->dispatched == 0 &&
+		time_is_before_jiffies(
+			bfqq->budget_timeout +
+			bfqd->bfq_wr_min_idle_time);
 }
 
 static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
 					     struct bfq_queue *bfqq,
-					     struct request *rq)
+					     int old_wr_coeff,
+					     struct request *rq,
+					     bool *interactive)
 {
-	bool bfqq_wants_to_preempt,
+	bool wr_or_deserves_wr,	bfqq_wants_to_preempt,
+		idle_for_long_time = bfq_bfqq_idle_for_long_time(bfqd, bfqq),
 		/*
 		 * See the comments on
 		 * bfq_bfqq_update_budg_for_activation for
@@ -3385,12 +3603,23 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
 				 req_op(rq), rq->cmd_flags);
 
 	/*
-	 * Update budget and check whether bfqq may want to preempt
-	 * the in-service queue.
+	 * bfqq deserves to be weight-raised if:
+	 * - it is sync,
+	 * - it has been idle for enough time.
+	 */
+	*interactive = idle_for_long_time;
+	wr_or_deserves_wr = bfqd->low_latency &&
+		(bfqq->wr_coeff > 1 ||
+		 (bfq_bfqq_sync(bfqq) && *interactive));
+
+	/*
+	 * Using the last flag, update budget and check whether bfqq
+	 * may want to preempt the in-service queue.
 	 */
 	bfqq_wants_to_preempt =
 		bfq_bfqq_update_budg_for_activation(bfqd, bfqq,
-						    arrived_in_time);
+						    arrived_in_time,
+						    wr_or_deserves_wr);
 
 	if (!bfq_bfqq_IO_bound(bfqq)) {
 		if (arrived_in_time) {
@@ -3402,6 +3631,16 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
 			bfqq->requests_within_timer = 0;
 	}
 
+	if (bfqd->low_latency) {
+		bfq_update_bfqq_wr_on_rq_arrival(bfqd, bfqq,
+						 old_wr_coeff,
+						 wr_or_deserves_wr,
+						 *interactive);
+
+		if (old_wr_coeff != bfqq->wr_coeff)
+			bfqq->entity.prio_changed = 1;
+	}
+
 	bfq_add_bfqq_busy(bfqd, bfqq);
 
 	/*
@@ -3415,6 +3654,7 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
 	 * function bfq_bfqq_update_budg_for_activation).
 	 */
 	if (bfqd->in_service_queue && bfqq_wants_to_preempt &&
+	    bfqd->in_service_queue->wr_coeff == 1 &&
 	    next_queue_may_preempt(bfqd))
 		bfq_bfqq_expire(bfqd, bfqd->in_service_queue,
 				false, BFQ_BFQQ_PREEMPTED);
@@ -3425,6 +3665,8 @@ static void bfq_add_request(struct request *rq)
 	struct bfq_queue *bfqq = RQ_BFQQ(rq);
 	struct bfq_data *bfqd = bfqq->bfqd;
 	struct request *next_rq, *prev;
+	unsigned int old_wr_coeff = bfqq->wr_coeff;
+	bool interactive = false;
 
 	bfq_log_bfqq(bfqd, bfqq, "add_request %d", rq_is_sync(rq));
 	bfqq->queued[rq_is_sync(rq)]++;
@@ -3440,9 +3682,45 @@ static void bfq_add_request(struct request *rq)
 	bfqq->next_rq = next_rq;
 
 	if (!bfq_bfqq_busy(bfqq)) /* switching to busy ... */
-		bfq_bfqq_handle_idle_busy_switch(bfqd, bfqq, rq);
-	else if (prev != bfqq->next_rq)
-		bfq_updated_next_req(bfqd, bfqq);
+		bfq_bfqq_handle_idle_busy_switch(bfqd, bfqq, old_wr_coeff,
+						 rq, &interactive);
+	else {
+		if (bfqd->low_latency && old_wr_coeff == 1 && !rq_is_sync(rq) &&
+		    time_is_before_jiffies(
+				bfqq->last_wr_start_finish +
+				bfqd->bfq_wr_min_inter_arr_async)) {
+			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
+			bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+
+			bfqq->entity.prio_changed = 1;
+		}
+		if (prev != bfqq->next_rq)
+			bfq_updated_next_req(bfqd, bfqq);
+	}
+
+	/*
+	 * Assign jiffies to last_wr_start_finish in the following
+	 * cases:
+	 *
+	 * . if bfqq is not going to be weight-raised, because, for
+	 *   non weight-raised queues, last_wr_start_finish stores the
+	 *   arrival time of the last request; as of now, this piece
+	 *   of information is used only for deciding whether to
+	 *   weight-raise async queues
+	 *
+	 * . if bfqq is not weight-raised, because, if bfqq is now
+	 *   switching to weight-raised, then last_wr_start_finish
+	 *   stores the time when weight-raising starts
+	 *
+	 * . if bfqq is interactive, because, regardless of whether
+	 *   bfqq is currently weight-raised, the weight-raising
+	 *   period must start or restart (this case is considered
+	 *   separately because it is not detected by the above
+	 *   conditions, if bfqq is already weight-raised)
+	 */
+	if (bfqd->low_latency &&
+		(old_wr_coeff == 1 || bfqq->wr_coeff == 1 || interactive))
+		bfqq->last_wr_start_finish = jiffies;
 }
 
 static struct request *bfq_find_rq_fmerge(struct bfq_data *bfqd,
@@ -3617,6 +3895,46 @@ static void bfq_merged_requests(struct request_queue *q, struct request *rq,
 				    next->cmd_flags);
 }
 
+/* Must be called with bfqq != NULL */
+static void bfq_bfqq_end_wr(struct bfq_queue *bfqq)
+{
+	bfqq->wr_coeff = 1;
+	bfqq->wr_cur_max_time = 0;
+	/*
+	 * Trigger a weight change on the next invocation of
+	 * __bfq_entity_update_weight_prio.
+	 */
+	bfqq->entity.prio_changed = 1;
+}
+
+static void bfq_end_wr_async_queues(struct bfq_data *bfqd,
+				    struct bfq_group *bfqg)
+{
+	int i, j;
+
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < IOPRIO_BE_NR; j++)
+			if (bfqg->async_bfqq[i][j])
+				bfq_bfqq_end_wr(bfqg->async_bfqq[i][j]);
+	if (bfqg->async_idle_bfqq)
+		bfq_bfqq_end_wr(bfqg->async_idle_bfqq);
+}
+
+static void bfq_end_wr(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq;
+
+	spin_lock_irq(bfqd->queue->queue_lock);
+
+	list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list)
+		bfq_bfqq_end_wr(bfqq);
+	list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list)
+		bfq_bfqq_end_wr(bfqq);
+	bfq_end_wr_async(bfqd);
+
+	spin_unlock_irq(bfqd->queue->queue_lock);
+}
+
 static int bfq_allow_bio_merge(struct request_queue *q, struct request *rq,
 			       struct bio *bio)
 {
@@ -3650,17 +3968,33 @@ static int bfq_allow_rq_merge(struct request_queue *q, struct request *rq,
 	return RQ_BFQQ(rq) == RQ_BFQQ(next);
 }
 
+/*
+ * Set the maximum time for the in-service queue to consume its
+ * budget. This prevents seeky processes from lowering the throughput.
+ * In practice, a time-slice service scheme is used with seeky
+ * processes.
+ */
+static void bfq_set_budget_timeout(struct bfq_data *bfqd,
+				   struct bfq_queue *bfqq)
+{
+	bfqd->last_budget_start = ktime_get();
+
+	bfqq->budget_timeout = jiffies +
+		bfqd->bfq_timeout *
+		(bfqq->entity.weight / bfqq->entity.orig_weight);
+}
+
 static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
 				       struct bfq_queue *bfqq)
 {
 	if (bfqq) {
 		bfqg_stats_update_avg_queue_size(bfqq_group(bfqq));
 		bfq_mark_bfqq_must_alloc(bfqq);
-		bfq_mark_bfqq_budget_new(bfqq);
 		bfq_clear_bfqq_fifo_expire(bfqq);
 
 		bfqd->budgets_assigned = (bfqd->budgets_assigned*7 + 256) / 8;
 
+		bfq_set_budget_timeout(bfqd, bfqq);
 		bfq_log_bfqq(bfqd, bfqq,
 			     "set_in_service_queue, cur-budget = %d",
 			     bfqq->entity.budget);
@@ -3700,9 +4034,13 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd)
 	 */
 	sl = bfqd->bfq_slice_idle;
 	/*
-	 * Grant only minimum idle time if the queue is seeky.
+	 * Unless the queue is being weight-raised, grant only minimum
+	 * idle time if the queue is seeky. A long idling is preserved
+	 * for a weight-raised queue, because it is needed for
+	 * guaranteeing to the queue its reserved share of the
+	 * throughput.
 	 */
-	if (BFQQ_SEEKY(bfqq))
+	if (BFQQ_SEEKY(bfqq) && bfqq->wr_coeff == 1)
 		sl = min_t(u64, sl, BFQ_MIN_TT);
 
 	bfqd->last_idling_start = ktime_get();
@@ -3712,27 +4050,6 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd)
 }
 
 /*
- * Set the maximum time for the in-service queue to consume its
- * budget. This prevents seeky processes from lowering the disk
- * throughput (always guaranteed with a time slice scheme as in CFQ).
- */
-static void bfq_set_budget_timeout(struct bfq_data *bfqd)
-{
-	struct bfq_queue *bfqq = bfqd->in_service_queue;
-	unsigned int timeout_coeff = bfqq->entity.weight /
-				     bfqq->entity.orig_weight;
-
-	bfqd->last_budget_start = ktime_get();
-
-	bfq_clear_bfqq_budget_new(bfqq);
-	bfqq->budget_timeout = jiffies +
-		bfqd->bfq_timeout * timeout_coeff;
-
-	bfq_log_bfqq(bfqd, bfqq, "set budget_timeout %u",
-		jiffies_to_msecs(bfqd->bfq_timeout * timeout_coeff));
-}
-
-/*
  * In autotuning mode, max_budget is dynamically recomputed as the
  * amount of sectors transferred in timeout at the estimated peak
  * rate. This enables BFQ to utilize a full timeslice with a full
@@ -3745,6 +4062,42 @@ static unsigned long bfq_calc_max_budget(struct bfq_data *bfqd)
 		jiffies_to_msecs(bfqd->bfq_timeout)>>BFQ_RATE_SHIFT;
 }
 
+/*
+ * Update parameters related to throughput and responsiveness, as a
+ * function of the estimated peak rate. See comments on
+ * bfq_calc_max_budget(), and on T_slow and T_fast arrays.
+ */
+void update_thr_responsiveness_params(struct bfq_data *bfqd)
+{
+	int dev_type = blk_queue_nonrot(bfqd->queue);
+
+	if (bfqd->bfq_user_max_budget == 0)
+		bfqd->bfq_max_budget =
+			bfq_calc_max_budget(bfqd);
+
+	if (bfqd->device_speed == BFQ_BFQD_FAST &&
+	    bfqd->peak_rate < device_speed_thresh[dev_type]) {
+		bfqd->device_speed = BFQ_BFQD_SLOW;
+		bfqd->RT_prod = R_slow[dev_type] *
+			T_slow[dev_type];
+	} else if (bfqd->device_speed == BFQ_BFQD_SLOW &&
+		   bfqd->peak_rate > device_speed_thresh[dev_type]) {
+		bfqd->device_speed = BFQ_BFQD_FAST;
+		bfqd->RT_prod = R_fast[dev_type] *
+			T_fast[dev_type];
+	}
+
+	bfq_log(bfqd,
+"dev_type %s dev_speed_class = %s (%llu sects/sec), thresh %llu setcs/sec",
+		dev_type == 0 ? "ROT" : "NONROT",
+		bfqd->device_speed == BFQ_BFQD_FAST ? "FAST" : "SLOW",
+		bfqd->device_speed == BFQ_BFQD_FAST ?
+		(USEC_PER_SEC*(u64)R_fast[dev_type])>>BFQ_RATE_SHIFT :
+		(USEC_PER_SEC*(u64)R_slow[dev_type])>>BFQ_RATE_SHIFT,
+		(USEC_PER_SEC*(u64)device_speed_thresh[dev_type])>>
+		BFQ_RATE_SHIFT);
+}
+
 void bfq_reset_rate_computation(struct bfq_data *bfqd, struct request *rq)
 {
 	if (rq != NULL) { /* new rq dispatch now, reset accordingly */
@@ -3855,9 +4208,7 @@ void bfq_update_rate_reset(struct bfq_data *bfqd, struct request *rq)
 	rate /= divisor; /* smoothing constant alpha = 1/divisor */
 
 	bfqd->peak_rate += rate;
-	if (bfqd->bfq_user_max_budget == 0)
-		bfqd->bfq_max_budget =
-			bfq_calc_max_budget(bfqd);
+	update_thr_responsiveness_params(bfqd);
 
 reset_computation:
 	bfq_reset_rate_computation(bfqd, rq);
@@ -4003,9 +4354,18 @@ static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 {
 	__bfq_bfqd_reset_in_service(bfqd);
 
-	if (RB_EMPTY_ROOT(&bfqq->sort_list))
+	if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
+		if (bfqq->dispatched == 0)
+			/*
+			 * Overloading budget_timeout field to store
+			 * the time at which the queue remains with no
+			 * backlog and no outstanding request; used by
+			 * the weight-raising mechanism.
+			 */
+			bfqq->budget_timeout = jiffies;
+
 		bfq_del_bfqq_busy(bfqd, bfqq, 1);
-	else
+	} else
 		bfq_activate_bfqq(bfqd, bfqq);
 }
 
@@ -4025,9 +4385,18 @@ static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
 	struct request *next_rq;
 	int budget, min_budget;
 
-	budget = bfqq->max_budget;
 	min_budget = bfq_min_budget(bfqd);
 
+	if (bfqq->wr_coeff == 1)
+		budget = bfqq->max_budget;
+	else /*
+	      * Use a constant, low budget for weight-raised queues,
+	      * to help achieve a low latency. Keep it slightly higher
+	      * than the minimum possible budget, to cause a little
+	      * bit fewer expirations.
+	      */
+		budget = 2 * min_budget;
+
 	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last budg %d, budg left %d",
 		bfqq->entity.budget, bfq_bfqq_budget_left(bfqq));
 	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last max_budg %d, min budg %d",
@@ -4035,7 +4404,7 @@ static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
 	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: sync %d, seeky %d",
 		bfq_bfqq_sync(bfqq), BFQQ_SEEKY(bfqd->in_service_queue));
 
-	if (bfq_bfqq_sync(bfqq)) {
+	if (bfq_bfqq_sync(bfqq) && bfqq->wr_coeff == 1) {
 		switch (reason) {
 		/*
 		 * Caveat: in all the following cases we trade latency
@@ -4134,7 +4503,7 @@ static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
 		default:
 			return;
 		}
-	} else
+	} else if (!bfq_bfqq_sync(bfqq))
 		/*
 		 * Async queues get always the maximum possible
 		 * budget, as for them we do not care about latency
@@ -4321,15 +4690,19 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
 	 * bandwidth, and not time, distribution with little unlucky
 	 * or quasi-sequential processes.
 	 */
-	if (slow ||
-	    (reason == BFQ_BFQQ_BUDGET_TIMEOUT &&
-	     bfq_bfqq_budget_left(bfqq) >=  entity->budget / 3))
+	if (bfqq->wr_coeff == 1 &&
+	    (slow ||
+	     (reason == BFQ_BFQQ_BUDGET_TIMEOUT &&
+	      bfq_bfqq_budget_left(bfqq) >=  entity->budget / 3)))
 		bfq_bfqq_charge_time(bfqd, bfqq, delta);
 
 	if (reason == BFQ_BFQQ_TOO_IDLE &&
 	    entity->service <= 2 * entity->budget / 10)
 		bfq_clear_bfqq_IO_bound(bfqq);
 
+	if (bfqd->low_latency && bfqq->wr_coeff == 1)
+		bfqq->last_wr_start_finish = jiffies;
+
 	bfq_log_bfqq(bfqd, bfqq,
 		"expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
 		slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
@@ -4354,10 +4727,7 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
  */
 static bool bfq_bfqq_budget_timeout(struct bfq_queue *bfqq)
 {
-	if (bfq_bfqq_budget_new(bfqq) ||
-	    time_before(jiffies, bfqq->budget_timeout))
-		return false;
-	return true;
+	return time_is_before_eq_jiffies(bfqq->budget_timeout);
 }
 
 /*
@@ -4384,19 +4754,40 @@ static bool bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
 
 /*
  * For a queue that becomes empty, device idling is allowed only if
- * this function returns true for the queue. And this function returns
- * true only if idling is beneficial for throughput.
+ * this function returns true for the queue. As a consequence, since
+ * device idling plays a critical role in both throughput boosting and
+ * service guarantees, the return value of this function plays a
+ * critical role in both these aspects as well.
+ *
+ * In a nutshell, this function returns true only if idling is
+ * beneficial for throughput or, even if detrimental for throughput,
+ * idling is however necessary to preserve service guarantees (low
+ * latency, desired throughput distribution, ...). In particular, on
+ * NCQ-capable devices, this function tries to return false, so as to
+ * help keep the drives' internal queues full, whenever this helps the
+ * device boost the throughput without causing any service-guarantee
+ * issue.
+ *
+ * In more detail, the return value of this function is obtained by,
+ * first, computing a number of boolean variables that take into
+ * account throughput and service-guarantee issues, and, then,
+ * combining these variables in a logical expression. Most of the
+ * issues taken into account are not trivial. We discuss these issues
+ * individually while introducing the variables.
  */
 static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
 {
 	struct bfq_data *bfqd = bfqq->bfqd;
-	bool idling_boosts_thr;
+	bool idling_boosts_thr, asymmetric_scenario;
 
 	if (bfqd->strict_guarantees)
 		return true;
 
 	/*
-	 * The value of the next variable is computed considering that
+	 * The next variable takes into account the cases where idling
+	 * boosts the throughput.
+	 *
+	 * The value of the variable is computed considering that
 	 * idling is usually beneficial for the throughput if:
 	 * (a) the device is not NCQ-capable, or
 	 * (b) regardless of the presence of NCQ, the request pattern
@@ -4410,13 +4801,80 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
 	idling_boosts_thr = !bfqd->hw_tag || bfq_bfqq_IO_bound(bfqq);
 
 	/*
-	 * We have now the components we need to compute the return
-	 * value of the function, which is true only if both the
-	 * following conditions hold:
+	 * There is then a case where idling must be performed not for
+	 * throughput concerns, but to preserve service guarantees. To
+	 * introduce it, we can note that allowing the drive to
+	 * enqueue more than one request at a time, and hence
+	 * delegating de facto final scheduling decisions to the
+	 * drive's internal scheduler, causes loss of control on the
+	 * actual request service order. In particular, the critical
+	 * situation is when requests from different processes happens
+	 * to be present, at the same time, in the internal queue(s)
+	 * of the drive. In such a situation, the drive, by deciding
+	 * the service order of the internally-queued requests, does
+	 * determine also the actual throughput distribution among
+	 * these processes. But the drive typically has no notion or
+	 * concern about per-process throughput distribution, and
+	 * makes its decisions only on a per-request basis. Therefore,
+	 * the service distribution enforced by the drive's internal
+	 * scheduler is likely to coincide with the desired
+	 * device-throughput distribution only in a completely
+	 * symmetric scenario where: (i) each of these processes must
+	 * get the same throughput as the others; (ii) all these
+	 * processes have the same I/O pattern (either sequential or
+	 * random).  In fact, in such a scenario, the drive will tend
+	 * to treat the requests of each of these processes in about
+	 * the same way as the requests of the others, and thus to
+	 * provide each of these processes with about the same
+	 * throughput (which is exactly the desired throughput
+	 * distribution). In contrast, in any asymmetric scenario,
+	 * device idling is certainly needed to guarantee that bfqq
+	 * receives its assigned fraction of the device throughput
+	 * (see [1] for details).
+	 *
+	 * As for sub-condition (i), actually we check only whether
+	 * bfqq is being weight-raised. In fact, if bfqq is not being
+	 * weight-raised, we have that:
+	 * - if the process associated with bfqq is not I/O-bound, then
+	 *   it is not either latency- or throughput-critical; therefore
+	 *   idling is not needed for bfqq;
+	 * - if the process asociated with bfqq is I/O-bound, then
+	 *   idling is already granted with bfqq (see the comments on
+	 *   idling_boosts_thr).
+	 *
+	 * We do not check sub-condition (ii) at all, i.e., the next
+	 * variable is true if and only if bfqq is being
+	 * weight-raised. We do not need to control sub-condition (ii)
+	 * for the following reason:
+	 * - if bfqq is being weight-raised, then idling is already
+	 *   guaranteed to bfqq by sub-condition (i);
+	 * - if bfqq is not being weight-raised, then idling is
+	 *   already guaranteed to bfqq (only) if it matters, i.e., if
+	 *   bfqq is associated to a currently I/O-bound process (see
+	 *   the above comment on sub-condition (i)).
+	 *
+	 * As a side note, it is worth considering that the above
+	 * device-idling countermeasures may however fail in the
+	 * following unlucky scenario: if idling is (correctly)
+	 * disabled in a time period during which the symmetry
+	 * sub-condition holds, and hence the device is allowed to
+	 * enqueue many requests, but at some later point in time some
+	 * sub-condition stops to hold, then it may become impossible
+	 * to let requests be served in the desired order until all
+	 * the requests already queued in the device have been served.
+	 */
+	asymmetric_scenario = bfqq->wr_coeff > 1;
+
+	/*
+	 * We have now all the components we need to compute the return
+	 * value of the function, which is true only if both the following
+	 * conditions hold:
 	 * 1) bfqq is sync, because idling make sense only for sync queues;
-	 * 2) idling boosts the throughput.
+	 * 2) idling either boosts the throughput (without issues), or
+	 *    is necessary to preserve service guarantees.
 	 */
-	return bfq_bfqq_sync(bfqq) && idling_boosts_thr;
+	return bfq_bfqq_sync(bfqq) &&
+		(idling_boosts_thr || asymmetric_scenario);
 }
 
 /*
@@ -4519,6 +4977,43 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
 	return bfqq;
 }
 
+static void bfq_update_wr_data(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+
+	if (bfqq->wr_coeff > 1) { /* queue is being weight-raised */
+		bfq_log_bfqq(bfqd, bfqq,
+			"raising period dur %u/%u msec, old coeff %u, w %d(%d)",
+			jiffies_to_msecs(jiffies - bfqq->last_wr_start_finish),
+			jiffies_to_msecs(bfqq->wr_cur_max_time),
+			bfqq->wr_coeff,
+			bfqq->entity.weight, bfqq->entity.orig_weight);
+
+		if (entity->prio_changed)
+			bfq_log_bfqq(bfqd, bfqq, "WARN: pending prio change");
+
+		/*
+		 * If too much time has elapsed from the beginning of
+		 * this weight-raising period, then end weight
+		 * raising.
+		 */
+		if (time_is_before_jiffies(bfqq->last_wr_start_finish +
+					   bfqq->wr_cur_max_time)) {
+			bfqq->last_wr_start_finish = jiffies;
+			bfq_log_bfqq(bfqd, bfqq,
+				     "wrais ending at %lu, rais_max_time %u",
+				     bfqq->last_wr_start_finish,
+				     jiffies_to_msecs(bfqq->wr_cur_max_time));
+			bfq_bfqq_end_wr(bfqq);
+		}
+	}
+	/* Update weight both if it must be raised and if it must be lowered */
+	if ((entity->weight > entity->orig_weight) != (bfqq->wr_coeff > 1))
+		__bfq_entity_update_weight_prio(
+			bfq_entity_service_tree(entity),
+			entity);
+}
+
 /*
  * Dispatch one request from bfqq, moving it to the request queue
  * dispatch list.
@@ -4565,6 +5060,19 @@ static int bfq_dispatch_request(struct bfq_data *bfqd,
 	bfq_bfqq_served(bfqq, service_to_charge);
 	bfq_dispatch_insert(bfqd->queue, rq);
 
+	/*
+	 * If weight raising has to terminate for bfqq, then next
+	 * function causes an immediate update of bfqq's weight,
+	 * without waiting for next activation. As a consequence, on
+	 * expiration, bfqq will be timestamped as if has never been
+	 * weight-raised during this service slot, even if it has
+	 * received part or even most of the service as a
+	 * weight-raised queue. This inflates bfqq's timestamps, which
+	 * is beneficial, as bfqq is then more willing to leave the
+	 * device immediately to possible other weight-raised queues.
+	 */
+	bfq_update_wr_data(bfqd, bfqq);
+
 	bfq_log_bfqq(bfqd, bfqq,
 			"dispatched %u sec req (%llu), budg left %d",
 			blk_rq_sectors(rq),
@@ -4828,6 +5336,9 @@ static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	bfqq->max_budget = (2 * bfq_max_budget(bfqd)) / 3;
 	bfqq->budget_timeout = bfq_smallest_from_now();
 
+	bfqq->wr_coeff = 1;
+	bfqq->last_wr_start_finish = bfq_smallest_from_now();
+
 	/* first request is almost certainly seeky */
 	bfqq->seek_history = 1;
 }
@@ -4954,7 +5465,8 @@ static void bfq_update_idle_window(struct bfq_data *bfqd,
 		(bfqd->hw_tag && BFQQ_SEEKY(bfqq)))
 		enable_idle = 0;
 	else if (bfq_sample_valid(bic->ttime.ttime_samples)) {
-		if (bic->ttime.ttime_mean > bfqd->bfq_slice_idle)
+		if (bic->ttime.ttime_mean > bfqd->bfq_slice_idle &&
+			bfqq->wr_coeff == 1)
 			enable_idle = 0;
 		else
 			enable_idle = 1;
@@ -5100,6 +5612,16 @@ static void bfq_completed_request(struct request_queue *q, struct request *rq)
 				     rq_io_start_time_ns(rq), req_op(rq),
 				     rq->cmd_flags);
 
+	if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq)) {
+		/*
+		 * Set budget_timeout (which we overload to store the
+		 * time at which the queue remains with no backlog and
+		 * no outstanding request; used by the weight-raising
+		 * mechanism).
+		 */
+		bfqq->budget_timeout = jiffies;
+	}
+
 	now_ns = ktime_get_ns();
 
 	RQ_BIC(rq)->ttime.last_end_request = now_ns;
@@ -5137,10 +5659,7 @@ static void bfq_completed_request(struct request_queue *q, struct request *rq)
 	 * or if we want to idle in case it has no pending requests.
 	 */
 	if (bfqd->in_service_queue == bfqq) {
-		if (bfq_bfqq_budget_new(bfqq))
-			bfq_set_budget_timeout(bfqd);
-
-		if (bfq_bfqq_must_idle(bfqq)) {
+		if (bfqq->dispatched == 0 && bfq_bfqq_must_idle(bfqq)) {
 			bfq_arm_slice_timer(bfqd);
 			goto out;
 		} else if (bfq_may_expire_for_budg_timeout(bfqq))
@@ -5481,6 +6000,26 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 
 	bfqd->bfq_requests_within_timer = 120;
 
+	bfqd->low_latency = true;
+
+	/*
+	 * Trade-off between responsiveness and fairness.
+	 */
+	bfqd->bfq_wr_coeff = 30;
+	bfqd->bfq_wr_max_time = 0;
+	bfqd->bfq_wr_min_idle_time = msecs_to_jiffies(2000);
+	bfqd->bfq_wr_min_inter_arr_async = msecs_to_jiffies(500);
+
+	/*
+	 * Begin by assuming, optimistically, that the device is a
+	 * high-speed one, and that its peak rate is equal to 2/3 of
+	 * the highest reference rate.
+	 */
+	bfqd->RT_prod = R_fast[blk_queue_nonrot(bfqd->queue)] *
+			T_fast[blk_queue_nonrot(bfqd->queue)];
+	bfqd->peak_rate = R_fast[blk_queue_nonrot(bfqd->queue)] * 2 / 3;
+	bfqd->device_speed = BFQ_BFQD_FAST;
+
 	return 0;
 
 out_free:
@@ -5519,6 +6058,15 @@ static ssize_t bfq_var_store(unsigned long *var, const char *page,
 	return count;
 }
 
+static ssize_t bfq_wr_max_time_show(struct elevator_queue *e, char *page)
+{
+	struct bfq_data *bfqd = e->elevator_data;
+
+	return sprintf(page, "%d\n", bfqd->bfq_wr_max_time > 0 ?
+		       jiffies_to_msecs(bfqd->bfq_wr_max_time) :
+		       jiffies_to_msecs(bfq_wr_duration(bfqd)));
+}
+
 static ssize_t bfq_weights_show(struct elevator_queue *e, char *page)
 {
 	struct bfq_queue *bfqq;
@@ -5533,19 +6081,29 @@ static ssize_t bfq_weights_show(struct elevator_queue *e, char *page)
 	num_char += sprintf(page + num_char, "Active:\n");
 	list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list) {
 		num_char += sprintf(page + num_char,
-				    "pid%d: weight %hu, nr_queued %d %d\n",
+				    "pid%d: weight %hu, nr_queued %d %d, ",
 				    bfqq->pid,
 				    bfqq->entity.weight,
 				    bfqq->queued[0],
 				    bfqq->queued[1]);
+		num_char += sprintf(page + num_char,
+				    "dur %d/%u\n",
+				    jiffies_to_msecs(
+					    jiffies -
+					    bfqq->last_wr_start_finish),
+				    jiffies_to_msecs(bfqq->wr_cur_max_time));
 	}
 
 	num_char += sprintf(page + num_char, "Idle:\n");
 	list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list) {
 		num_char += sprintf(page + num_char,
-				    "pid%d: weight %hu\n",
+				    "pid%d: weight %hu, dur %d/%u\n",
 				    bfqq->pid,
-				    bfqq->entity.weight);
+				    bfqq->entity.weight,
+				    jiffies_to_msecs(
+					    jiffies -
+					    bfqq->last_wr_start_finish),
+				    jiffies_to_msecs(bfqq->wr_cur_max_time));
 	}
 
 	spin_unlock_irq(bfqd->queue->queue_lock);
@@ -5572,6 +6130,11 @@ SHOW_FUNCTION(bfq_slice_idle_show, bfqd->bfq_slice_idle, 2);
 SHOW_FUNCTION(bfq_max_budget_show, bfqd->bfq_user_max_budget, 0);
 SHOW_FUNCTION(bfq_timeout_sync_show, bfqd->bfq_timeout, 1);
 SHOW_FUNCTION(bfq_strict_guarantees_show, bfqd->strict_guarantees, 0);
+SHOW_FUNCTION(bfq_low_latency_show, bfqd->low_latency, 0);
+SHOW_FUNCTION(bfq_wr_coeff_show, bfqd->bfq_wr_coeff, 0);
+SHOW_FUNCTION(bfq_wr_min_idle_time_show, bfqd->bfq_wr_min_idle_time, 1);
+SHOW_FUNCTION(bfq_wr_min_inter_arr_async_show, bfqd->bfq_wr_min_inter_arr_async,
+	1);
 #undef SHOW_FUNCTION
 
 #define USEC_SHOW_FUNCTION(__FUNC, __VAR)				\
@@ -5612,6 +6175,12 @@ STORE_FUNCTION(bfq_back_seek_max_store, &bfqd->bfq_back_max, 0, INT_MAX, 0);
 STORE_FUNCTION(bfq_back_seek_penalty_store, &bfqd->bfq_back_penalty, 1,
 		INT_MAX, 0);
 STORE_FUNCTION(bfq_slice_idle_store, &bfqd->bfq_slice_idle, 0, INT_MAX, 2);
+STORE_FUNCTION(bfq_wr_coeff_store, &bfqd->bfq_wr_coeff, 1, INT_MAX, 0);
+STORE_FUNCTION(bfq_wr_max_time_store, &bfqd->bfq_wr_max_time, 0, INT_MAX, 1);
+STORE_FUNCTION(bfq_wr_min_idle_time_store, &bfqd->bfq_wr_min_idle_time, 0,
+		INT_MAX, 1);
+STORE_FUNCTION(bfq_wr_min_inter_arr_async_store,
+		&bfqd->bfq_wr_min_inter_arr_async, 0, INT_MAX, 1);
 #undef STORE_FUNCTION
 
 #define USEC_STORE_FUNCTION(__FUNC, __PTR, MIN, MAX)			\
@@ -5699,6 +6268,22 @@ static ssize_t bfq_strict_guarantees_store(struct elevator_queue *e,
 	return ret;
 }
 
+static ssize_t bfq_low_latency_store(struct elevator_queue *e,
+				     const char *page, size_t count)
+{
+	struct bfq_data *bfqd = e->elevator_data;
+	unsigned long uninitialized_var(__data);
+	int ret = bfq_var_store(&__data, (page), count);
+
+	if (__data > 1)
+		__data = 1;
+	if (__data == 0 && bfqd->low_latency != 0)
+		bfq_end_wr(bfqd);
+	bfqd->low_latency = __data;
+
+	return ret;
+}
+
 #define BFQ_ATTR(name) \
 	__ATTR(name, S_IRUGO|S_IWUSR, bfq_##name##_show, bfq_##name##_store)
 
@@ -5712,6 +6297,11 @@ static struct elv_fs_entry bfq_attrs[] = {
 	BFQ_ATTR(max_budget),
 	BFQ_ATTR(timeout_sync),
 	BFQ_ATTR(strict_guarantees),
+	BFQ_ATTR(low_latency),
+	BFQ_ATTR(wr_coeff),
+	BFQ_ATTR(wr_max_time),
+	BFQ_ATTR(wr_min_idle_time),
+	BFQ_ATTR(wr_min_inter_arr_async),
 	BFQ_ATTR(weights),
 	__ATTR_NULL
 };
@@ -5769,7 +6359,7 @@ static struct blkcg_policy blkcg_policy_bfq = {
 static int __init bfq_init(void)
 {
 	int ret;
-	char msg[50] = "BFQ I/O-scheduler: v0";
+	char msg[50] = "BFQ I/O-scheduler: v1";
 
 #ifdef CONFIG_BFQ_GROUP_IOSCHED
 	ret = blkcg_policy_register(&blkcg_policy_bfq);
@@ -5781,6 +6371,39 @@ static int __init bfq_init(void)
 	if (bfq_slab_setup())
 		goto err_pol_unreg;
 
+	/*
+	 * Times to load large popular applications for the typical
+	 * systems installed on the reference devices (see the
+	 * comments before the definitions of the next two
+	 * arrays). Actually, we use slightly slower values, as the
+	 * estimated peak rate tends to be smaller than the actual
+	 * peak rate.  The reason for this last fact is that estimates
+	 * are computed over much shorter time intervals than the long
+	 * intervals typically used for benchmarking. Why? First, to
+	 * adapt more quickly to variations. Second, because an I/O
+	 * scheduler cannot rely on a peak-rate-evaluation workload to
+	 * be run for a long time.
+	 */
+	T_slow[0] = msecs_to_jiffies(3500); /* actually 4 sec */
+	T_slow[1] = msecs_to_jiffies(1000); /* actually 1.5 sec */
+	T_fast[0] = msecs_to_jiffies(7000); /* actually 8 sec */
+	T_fast[1] = msecs_to_jiffies(2500); /* actually 3 sec */
+
+	/*
+	 * Thresholds that determine the switch between speed classes
+	 * (see the comments before the definition of the array
+	 * device_speed_thresh). These thresholds are biased towards
+	 * transitions to the fast class. This is safer than the
+	 * opposite bias. In fact, a wrong transition to the slow
+	 * class results in short weight-raising periods, because the
+	 * speed of the device then tends to be higher that the
+	 * reference peak rate. On the opposite end, a wrong
+	 * transition to the fast class tends to increase
+	 * weight-raising periods, because of the opposite reason.
+	 */
+	device_speed_thresh[0] = (4 * R_slow[0]) / 3;
+	device_speed_thresh[1] = (4 * R_slow[1]) / 3;
+
 	ret = elv_register(&iosched_bfq);
 	if (ret)
 		goto err_pol_unreg;
-- 
2.10.0

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 07/14] block, bfq: reduce I/O latency for soft real-time applications
  2016-10-26  9:27 [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler Paolo Valente
                   ` (5 preceding siblings ...)
  2016-10-26  9:27 ` [PATCH 06/14] block, bfq: improve responsiveness Paolo Valente
@ 2016-10-26  9:28 ` Paolo Valente
  2016-10-26  9:28 ` [PATCH 08/14] block, bfq: preserve a low latency also with NCQ-capable drives Paolo Valente
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 57+ messages in thread
From: Paolo Valente @ 2016-10-26  9:28 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: linux-block, linux-kernel, ulf.hansson, linus.walleij, broonie,
	hare, arnd, bart.vanassche, grant.likely, jack, James.Bottomley,
	Paolo Valente, Arianna Avanzini

To guarantee a low latency also to the I/O requests issued by soft
real-time applications, this patch introduces a further heuristic,
which weight-raises (in the sense explained in the previous patch)
also the queues associated to applications deemed as soft real-time.

To be deemed as soft real-time, an application must meet two
requirements.  First, the application must not require an average
bandwidth higher than the approximate bandwidth required to playback
or record a compressed high-definition video. Second, the request
pattern of the application must be isochronous, i.e., after issuing a
request or a batch of requests, the application must stop issuing new
requests until all its pending requests have been completed. After
that, the application may issue a new batch, and so on.

As for the second requirement, it is critical to require also that,
after all the pending requests of the application have been completed,
an adequate minimum amount of time elapses before the application
starts issuing new requests. This prevents also greedy (i.e.,
I/O-bound) applications from being incorrectly deemed, occasionally,
as soft real-time. In fact, if *any amount of time* is fine, then even
a greedy application may, paradoxically, meet both the above
requirements, if: (1) the application performs random I/O and/or the
device is slow, and (2) the CPU load is high. The reason is the
following.  First, if condition (1) is true, then, during the service
of the application, the throughput may be low enough to let the
application meet the bandwidth requirement.  Second, if condition (2)
is true as well, then the application may occasionally behave in an
apparently isochronous way, because it may simply stop issuing
requests while the CPUs are busy serving other processes.

To address this issue, the heuristic leverages the simple fact that
greedy applications issue *all* their requests as quickly as they can,
whereas soft real-time applications spend some time processing data
after each batch of requests is completed. In particular, the
heuristic works as follows. First, according to the above isochrony
requirement, the heuristic checks whether an application may be soft
real-time, thereby giving to the application the opportunity to be
deemed as such, only when both the following two conditions happen to
hold: 1) the queue associated with the application has expired and is
empty, 2) there is no outstanding request of the application.

Suppose that both conditions hold at time, say, t_c and that the
application issues its next request at time, say, t_i. At time t_c the
heuristic computes the next time instant, called soft_rt_next_start in
the code, such that, only if t_i >= soft_rt_next_start, then both the
next conditions will hold when the application issues its next
request: 1) the application will meet the above bandwidth requirement,
2) a given minimum time interval, say Delta, will have elapsed from
time t_c (so as to filter out greedy application).

The current value of Delta is a little bit higher than the value that
we have found, experimentally, to be adequate on a real,
general-purpose machine. In particular we had to increase Delta to
make the filter quite precise also in slower, embedded systems, and in
KVM/QEMU virtual machines (details in the comments on the code).

If the application actually issues its next request after time
soft_rt_next_start, then its associated queue will be weight-raised
for a relatively short time interval. If, during this time interval,
the application proves again to meet the bandwidth and isochrony
requirements, then the end of the weight-raising period for the queue
is moved forward, and so on. Note that an application whose associated
queue never happens to be empty when it expires will never have the
opportunity to be deemed as soft real-time.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/Kconfig.iosched |   2 +-
 block/bfq-iosched.c   | 359 ++++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 337 insertions(+), 24 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index ecc6ca2..be242aa 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -46,7 +46,7 @@ config IOSCHED_BFQ
 	  The BFQ I/O scheduler distributes bandwidth among all
 	  processes according to their weights, regardless of the
 	  device parameters and with any workload. It also guarantees
-	  a low latency to interactive applications.
+	  a low latency to interactive and soft real-time applications.
 
 config BFQ_GROUP_IOSCHED
 	bool "BFQ hierarchical scheduling support"
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index e5a92fa..d9b0900 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -34,9 +34,10 @@
  * guarantee a low latency to non-I/O bound processes (the latter
  * often belong to time-sensitive applications).
  *
- * Even better for latency, BFQ explicitly privileges the I/O of
- * interactive applications, thereby providing these applications with
- * a very low latency.
+ * Even better for latency, BFQ explicitly privileges the I/O of two
+ * classes of time-sensitive applications: interactive and soft
+ * real-time. This feature enables BFQ to provide applications in
+ * these classes with a very low latency.
  *
  * With respect to the version of BFQ presented in [1], and in the
  * papers cited therein, this implementation adds a hierarchical
@@ -93,6 +94,13 @@
 #define BFQ_DEFAULT_GRP_IOPRIO	0
 #define BFQ_DEFAULT_GRP_CLASS	IOPRIO_CLASS_BE
 
+/*
+ * Soft real-time applications are extremely more latency sensitive
+ * than interactive ones. Over-raise the weight of the former to
+ * privilege them against the latter.
+ */
+#define BFQ_SOFTRT_WEIGHT_FACTOR	100
+
 struct bfq_entity;
 
 /**
@@ -288,6 +296,14 @@ struct bfq_queue {
 	/* current maximum weight-raising time for this queue */
 	unsigned long wr_cur_max_time;
 	/*
+	 * Minimum time instant such that, only if a new request is
+	 * enqueued after this time instant in an idle @bfq_queue with
+	 * no outstanding requests, then the task associated with the
+	 * queue it is deemed as soft real-time (see the comments on
+	 * the function bfq_bfqq_softrt_next_start())
+	 */
+	unsigned long soft_rt_next_start;
+	/*
 	 * Start time of the current weight-raising period if
 	 * the @bfq-queue is being weight-raised, otherwise
 	 * finish time of the last weight-raising period.
@@ -295,6 +311,20 @@ struct bfq_queue {
 	unsigned long last_wr_start_finish;
 	/* factor by which the weight of this queue is multiplied */
 	unsigned int wr_coeff;
+	/*
+	 * Time of the last transition of the @bfq_queue from idle to
+	 * backlogged.
+	 */
+	unsigned long last_idle_bklogged;
+	/*
+	 * Cumulative service received from the @bfq_queue since the
+	 * last transition from idle to backlogged.
+	 */
+	unsigned long service_from_backlogged;
+	/*
+	 * Value of wr start time when switching to soft rt
+	 */
+	unsigned long wr_start_at_switch_to_srt;
 };
 
 /**
@@ -468,6 +498,9 @@ struct bfq_data {
 	unsigned int bfq_wr_coeff;
 	/* maximum duration of a weight-raising period (jiffies) */
 	unsigned int bfq_wr_max_time;
+
+	/* Maximum weight-raising duration for soft real-time processes */
+	unsigned int bfq_wr_rt_max_time;
 	/*
 	 * Minimum idle period after which weight-raising may be
 	 * reactivated for a queue (in jiffies).
@@ -479,6 +512,9 @@ struct bfq_data {
 	 * queue (in jiffies).
 	 */
 	unsigned long bfq_wr_min_inter_arr_async;
+
+	/* Max service-rate for a soft real-time queue, in sectors/sec */
+	unsigned int bfq_wr_max_softrt_rate;
 	/*
 	 * Cached value of the product R*T, used for computing the
 	 * maximum duration of weight raising automatically.
@@ -507,6 +543,10 @@ enum bfqq_state_flags {
 					 * having consumed at most 2/10 of
 					 * its budget
 					 */
+	BFQ_BFQQ_FLAG_softrt_update,	/*
+					 * may need softrt-next-start
+					 * update
+					 */
 };
 
 #define BFQ_BFQQ_FNS(name)						\
@@ -531,6 +571,7 @@ BFQ_BFQQ_FNS(fifo_expire);
 BFQ_BFQQ_FNS(idle_window);
 BFQ_BFQQ_FNS(sync);
 BFQ_BFQQ_FNS(IO_bound);
+BFQ_BFQQ_FNS(softrt_update);
 #undef BFQ_BFQQ_FNS
 
 /* Logging facilities. */
@@ -3547,13 +3588,21 @@ static void bfq_update_bfqq_wr_on_rq_arrival(struct bfq_data *bfqd,
 					     struct bfq_queue *bfqq,
 					     unsigned int old_wr_coeff,
 					     bool wr_or_deserves_wr,
-					     bool interactive)
+					     bool interactive,
+					     bool soft_rt)
 {
 	if (old_wr_coeff == 1 && wr_or_deserves_wr) {
 		/* start a weight-raising period */
-		bfqq->wr_coeff = bfqd->bfq_wr_coeff;
-		/* update wr duration */
-		bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+		if (interactive) {
+			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
+			bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+		} else {
+			bfqq->wr_start_at_switch_to_srt = jiffies;
+			bfqq->wr_coeff = bfqd->bfq_wr_coeff *
+				BFQ_SOFTRT_WEIGHT_FACTOR;
+			bfqq->wr_cur_max_time =
+				bfqd->bfq_wr_rt_max_time;
+		}
 
 		/*
 		 * If needed, further reduce budget to make sure it is
@@ -3568,8 +3617,51 @@ static void bfq_update_bfqq_wr_on_rq_arrival(struct bfq_data *bfqd,
 					    bfqq->entity.budget,
 					    2 * bfq_min_budget(bfqd));
 	} else if (old_wr_coeff > 1) {
-		/* update wr duration */
-		bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+		if (interactive) { /* update wr coeff and duration */
+			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
+			bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+		} else if (soft_rt) {
+			/*
+			 * The application is now or still meeting the
+			 * requirements for being deemed soft rt.  We
+			 * can then correctly and safely (re)charge
+			 * the weight-raising duration for the
+			 * application with the weight-raising
+			 * duration for soft rt applications.
+			 *
+			 * In particular, doing this recharge now, i.e.,
+			 * before the weight-raising period for the
+			 * application finishes, reduces the probability
+			 * of the following negative scenario:
+			 * 1) the weight of a soft rt application is
+			 *    raised at startup (as for any newly
+			 *    created application),
+			 * 2) since the application is not interactive,
+			 *    at a certain time weight-raising is
+			 *    stopped for the application,
+			 * 3) at that time the application happens to
+			 *    still have pending requests, and hence
+			 *    is destined to not have a chance to be
+			 *    deemed soft rt before these requests are
+			 *    completed (see the comments to the
+			 *    function bfq_bfqq_softrt_next_start()
+			 *    for details on soft rt detection),
+			 * 4) these pending requests experience a high
+			 *    latency because the application is not
+			 *    weight-raised while they are pending.
+			 */
+			if (bfqq->wr_cur_max_time !=
+				bfqd->bfq_wr_rt_max_time) {
+				bfqq->wr_start_at_switch_to_srt =
+					bfqq->last_wr_start_finish;
+
+				bfqq->wr_cur_max_time =
+					bfqd->bfq_wr_rt_max_time;
+				bfqq->wr_coeff = bfqd->bfq_wr_coeff *
+					BFQ_SOFTRT_WEIGHT_FACTOR;
+			}
+			bfqq->last_wr_start_finish = jiffies;
+		}
 	}
 }
 
@@ -3588,7 +3680,7 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
 					     struct request *rq,
 					     bool *interactive)
 {
-	bool wr_or_deserves_wr,	bfqq_wants_to_preempt,
+	bool soft_rt, wr_or_deserves_wr, bfqq_wants_to_preempt,
 		idle_for_long_time = bfq_bfqq_idle_for_long_time(bfqd, bfqq),
 		/*
 		 * See the comments on
@@ -3605,12 +3697,14 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
 	/*
 	 * bfqq deserves to be weight-raised if:
 	 * - it is sync,
-	 * - it has been idle for enough time.
+	 * - it has been idle for enough time or is soft real-time.
 	 */
+	soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
+		time_is_before_jiffies(bfqq->soft_rt_next_start);
 	*interactive = idle_for_long_time;
 	wr_or_deserves_wr = bfqd->low_latency &&
 		(bfqq->wr_coeff > 1 ||
-		 (bfq_bfqq_sync(bfqq) && *interactive));
+		 (bfq_bfqq_sync(bfqq) && (*interactive || soft_rt)));
 
 	/*
 	 * Using the last flag, update budget and check whether bfqq
@@ -3635,12 +3729,17 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
 		bfq_update_bfqq_wr_on_rq_arrival(bfqd, bfqq,
 						 old_wr_coeff,
 						 wr_or_deserves_wr,
-						 *interactive);
+						 *interactive,
+						 soft_rt);
 
 		if (old_wr_coeff != bfqq->wr_coeff)
 			bfqq->entity.prio_changed = 1;
 	}
 
+	bfqq->last_idle_bklogged = jiffies;
+	bfqq->service_from_backlogged = 0;
+	bfq_clear_bfqq_softrt_update(bfqq);
+
 	bfq_add_bfqq_busy(bfqd, bfqq);
 
 	/*
@@ -3654,7 +3753,7 @@ static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,
 	 * function bfq_bfqq_update_budg_for_activation).
 	 */
 	if (bfqd->in_service_queue && bfqq_wants_to_preempt &&
-	    bfqd->in_service_queue->wr_coeff == 1 &&
+	    bfqd->in_service_queue->wr_coeff < bfqq->wr_coeff &&
 	    next_queue_may_preempt(bfqd))
 		bfq_bfqq_expire(bfqd, bfqd->in_service_queue,
 				false, BFQ_BFQQ_PREEMPTED);
@@ -3717,6 +3816,12 @@ static void bfq_add_request(struct request *rq)
 	 *   period must start or restart (this case is considered
 	 *   separately because it is not detected by the above
 	 *   conditions, if bfqq is already weight-raised)
+	 *
+	 * last_wr_start_finish has to be updated also if bfqq is soft
+	 * real-time, because the weight-raising period is constantly
+	 * restarted on idle-to-busy transitions for these queues, but
+	 * this is already done in bfq_bfqq_handle_idle_busy_switch if
+	 * needed.
 	 */
 	if (bfqd->low_latency &&
 		(old_wr_coeff == 1 || bfqq->wr_coeff == 1 || interactive))
@@ -3900,6 +4005,7 @@ static void bfq_bfqq_end_wr(struct bfq_queue *bfqq)
 {
 	bfqq->wr_coeff = 1;
 	bfqq->wr_cur_max_time = 0;
+	bfqq->last_wr_start_finish = jiffies;
 	/*
 	 * Trigger a weight change on the next invocation of
 	 * __bfq_entity_update_weight_prio.
@@ -3977,11 +4083,17 @@ static int bfq_allow_rq_merge(struct request_queue *q, struct request *rq,
 static void bfq_set_budget_timeout(struct bfq_data *bfqd,
 				   struct bfq_queue *bfqq)
 {
+	unsigned int timeout_coeff;
+
+	if (bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time)
+		timeout_coeff = 1;
+	else
+		timeout_coeff = bfqq->entity.weight / bfqq->entity.orig_weight;
+
 	bfqd->last_budget_start = ktime_get();
 
 	bfqq->budget_timeout = jiffies +
-		bfqd->bfq_timeout *
-		(bfqq->entity.weight / bfqq->entity.orig_weight);
+		bfqd->bfq_timeout * timeout_coeff;
 }
 
 static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
@@ -3994,6 +4106,42 @@ static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
 
 		bfqd->budgets_assigned = (bfqd->budgets_assigned*7 + 256) / 8;
 
+		if (time_is_before_jiffies(bfqq->last_wr_start_finish) &&
+		    bfqq->wr_coeff > 1 &&
+		    bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time &&
+		    time_is_before_jiffies(bfqq->budget_timeout)) {
+			/*
+			 * For soft real-time queues, move the start
+			 * of the weight-raising period forward by the
+			 * time the queue has not received any
+			 * service. Otherwise, a relatively long
+			 * service delay is likely to cause the
+			 * weight-raising period of the queue to end,
+			 * because of the short duration of the
+			 * weight-raising period of a soft real-time
+			 * queue.  It is worth noting that this move
+			 * is not so dangerous for the other queues,
+			 * because soft real-time queues are not
+			 * greedy.
+			 *
+			 * To not add a further variable, we use the
+			 * overloaded field budget_timeout to
+			 * determine for how long the queue has not
+			 * received service, i.e., how much time has
+			 * elapsed since the queue expired. However,
+			 * this is a little imprecise, because
+			 * budget_timeout is set to jiffies if bfqq
+			 * not only expires, but also remains with no
+			 * request.
+			 */
+			if (time_after(bfqq->budget_timeout,
+				       bfqq->last_wr_start_finish))
+				bfqq->last_wr_start_finish +=
+					jiffies - bfqq->budget_timeout;
+			else
+				bfqq->last_wr_start_finish = jiffies;
+		}
+
 		bfq_set_budget_timeout(bfqd, bfqq);
 		bfq_log_bfqq(bfqd, bfqq,
 			     "set_in_service_queue, cur-budget = %d",
@@ -4627,6 +4775,76 @@ static bool bfq_bfqq_is_slow(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 }
 
 /*
+ * To be deemed as soft real-time, an application must meet two
+ * requirements. First, the application must not require an average
+ * bandwidth higher than the approximate bandwidth required to playback or
+ * record a compressed high-definition video.
+ * The next function is invoked on the completion of the last request of a
+ * batch, to compute the next-start time instant, soft_rt_next_start, such
+ * that, if the next request of the application does not arrive before
+ * soft_rt_next_start, then the above requirement on the bandwidth is met.
+ *
+ * The second requirement is that the request pattern of the application is
+ * isochronous, i.e., that, after issuing a request or a batch of requests,
+ * the application stops issuing new requests until all its pending requests
+ * have been completed. After that, the application may issue a new batch,
+ * and so on.
+ * For this reason the next function is invoked to compute
+ * soft_rt_next_start only for applications that meet this requirement,
+ * whereas soft_rt_next_start is set to infinity for applications that do
+ * not.
+ *
+ * Unfortunately, even a greedy application may happen to behave in an
+ * isochronous way if the CPU load is high. In fact, the application may
+ * stop issuing requests while the CPUs are busy serving other processes,
+ * then restart, then stop again for a while, and so on. In addition, if
+ * the disk achieves a low enough throughput with the request pattern
+ * issued by the application (e.g., because the request pattern is random
+ * and/or the device is slow), then the application may meet the above
+ * bandwidth requirement too. To prevent such a greedy application to be
+ * deemed as soft real-time, a further rule is used in the computation of
+ * soft_rt_next_start: soft_rt_next_start must be higher than the current
+ * time plus the maximum time for which the arrival of a request is waited
+ * for when a sync queue becomes idle, namely bfqd->bfq_slice_idle.
+ * This filters out greedy applications, as the latter issue instead their
+ * next request as soon as possible after the last one has been completed
+ * (in contrast, when a batch of requests is completed, a soft real-time
+ * application spends some time processing data).
+ *
+ * Unfortunately, the last filter may easily generate false positives if
+ * only bfqd->bfq_slice_idle is used as a reference time interval and one
+ * or both the following cases occur:
+ * 1) HZ is so low that the duration of a jiffy is comparable to or higher
+ *    than bfqd->bfq_slice_idle. This happens, e.g., on slow devices with
+ *    HZ=100.
+ * 2) jiffies, instead of increasing at a constant rate, may stop increasing
+ *    for a while, then suddenly 'jump' by several units to recover the lost
+ *    increments. This seems to happen, e.g., inside virtual machines.
+ * To address this issue, we do not use as a reference time interval just
+ * bfqd->bfq_slice_idle, but bfqd->bfq_slice_idle plus a few jiffies. In
+ * particular we add the minimum number of jiffies for which the filter
+ * seems to be quite precise also in embedded systems and KVM/QEMU virtual
+ * machines.
+ */
+static unsigned long bfq_bfqq_softrt_next_start(struct bfq_data *bfqd,
+						struct bfq_queue *bfqq)
+{
+	return max(bfqq->last_idle_bklogged +
+		   HZ * bfqq->service_from_backlogged /
+		   bfqd->bfq_wr_max_softrt_rate,
+		   jiffies + nsecs_to_jiffies(bfqq->bfqd->bfq_slice_idle) + 4);
+}
+
+/*
+ * Return the farthest future time instant according to jiffies
+ * macros.
+ */
+static unsigned long bfq_greatest_from_now(void)
+{
+	return jiffies + MAX_JIFFY_OFFSET;
+}
+
+/*
  * Return the farthest past time instant according to jiffies
  * macros.
  */
@@ -4676,6 +4894,17 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
 	slow = bfq_bfqq_is_slow(bfqd, bfqq, compensate, reason, &delta);
 
 	/*
+	 * Increase service_from_backlogged before next statement,
+	 * because the possible next invocation of
+	 * bfq_bfqq_charge_time would likely inflate
+	 * entity->service. In contrast, service_from_backlogged must
+	 * contain real service, to enable the soft real-time
+	 * heuristic to correctly compute the bandwidth consumed by
+	 * bfqq.
+	 */
+	bfqq->service_from_backlogged += entity->service;
+
+	/*
 	 * As above explained, charge slow (typically seeky) and
 	 * timed-out queues with the time and not the service
 	 * received, to favor sequential workloads.
@@ -4703,6 +4932,48 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
 	if (bfqd->low_latency && bfqq->wr_coeff == 1)
 		bfqq->last_wr_start_finish = jiffies;
 
+	if (bfqd->low_latency && bfqd->bfq_wr_max_softrt_rate > 0 &&
+	    RB_EMPTY_ROOT(&bfqq->sort_list)) {
+		/*
+		 * If we get here, and there are no outstanding
+		 * requests, then the request pattern is isochronous
+		 * (see the comments on the function
+		 * bfq_bfqq_softrt_next_start()). Thus we can compute
+		 * soft_rt_next_start. If, instead, the queue still
+		 * has outstanding requests, then we have to wait for
+		 * the completion of all the outstanding requests to
+		 * discover whether the request pattern is actually
+		 * isochronous.
+		 */
+		if (bfqq->dispatched == 0)
+			bfqq->soft_rt_next_start =
+				bfq_bfqq_softrt_next_start(bfqd, bfqq);
+		else {
+			/*
+			 * The application is still waiting for the
+			 * completion of one or more requests:
+			 * prevent it from possibly being incorrectly
+			 * deemed as soft real-time by setting its
+			 * soft_rt_next_start to infinity. In fact,
+			 * without this assignment, the application
+			 * would be incorrectly deemed as soft
+			 * real-time if:
+			 * 1) it issued a new request before the
+			 *    completion of all its in-flight
+			 *    requests, and
+			 * 2) at that time, its soft_rt_next_start
+			 *    happened to be in the past.
+			 */
+			bfqq->soft_rt_next_start =
+				bfq_greatest_from_now();
+			/*
+			 * Schedule an update of soft_rt_next_start to when
+			 * the task may be discovered to be isochronous.
+			 */
+			bfq_mark_bfqq_softrt_update(bfqq);
+		}
+	}
+
 	bfq_log_bfqq(bfqd, bfqq,
 		"expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
 		slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
@@ -4999,12 +5270,18 @@ static void bfq_update_wr_data(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 		 */
 		if (time_is_before_jiffies(bfqq->last_wr_start_finish +
 					   bfqq->wr_cur_max_time)) {
-			bfqq->last_wr_start_finish = jiffies;
-			bfq_log_bfqq(bfqd, bfqq,
-				     "wrais ending at %lu, rais_max_time %u",
-				     bfqq->last_wr_start_finish,
-				     jiffies_to_msecs(bfqq->wr_cur_max_time));
-			bfq_bfqq_end_wr(bfqq);
+			if (bfqq->wr_cur_max_time != bfqd->bfq_wr_rt_max_time ||
+			time_is_before_jiffies(bfqq->wr_start_at_switch_to_srt +
+						   bfq_wr_duration(bfqd)))
+				bfq_bfqq_end_wr(bfqq);
+			else {
+				/* switch back to interactive wr */
+				bfqq->wr_coeff = bfqd->bfq_wr_coeff;
+				bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+				bfqq->last_wr_start_finish =
+					bfqq->wr_start_at_switch_to_srt;
+				bfqq->entity.prio_changed = 1;
+			}
 		}
 	}
 	/* Update weight both if it must be raised and if it must be lowered */
@@ -5338,6 +5615,13 @@ static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfqq->wr_coeff = 1;
 	bfqq->last_wr_start_finish = bfq_smallest_from_now();
+	bfqq->wr_start_at_switch_to_srt = bfq_smallest_from_now();
+
+	/*
+	 * Set to the value for which bfqq will not be deemed as
+	 * soft rt when it becomes backlogged.
+	 */
+	bfqq->soft_rt_next_start = bfq_greatest_from_now();
 
 	/* first request is almost certainly seeky */
 	bfqq->seek_history = 1;
@@ -5655,6 +5939,20 @@ static void bfq_completed_request(struct request_queue *q, struct request *rq)
 	bfqd->last_completion = now_ns;
 
 	/*
+	 * If we are waiting to discover whether the request pattern
+	 * of the task associated with the queue is actually
+	 * isochronous, and both requisites for this condition to hold
+	 * are now satisfied, then compute soft_rt_next_start (see the
+	 * comments on the function bfq_bfqq_softrt_next_start()). We
+	 * schedule this delayed check when bfqq expires, if it still
+	 * has in-flight requests.
+	 */
+	if (bfq_bfqq_softrt_update(bfqq) && bfqq->dispatched == 0 &&
+	    RB_EMPTY_ROOT(&bfqq->sort_list))
+		bfqq->soft_rt_next_start =
+			bfq_bfqq_softrt_next_start(bfqd, bfqq);
+
+	/*
 	 * If this is the in-service queue, check if it needs to be expired,
 	 * or if we want to idle in case it has no pending requests.
 	 */
@@ -6006,9 +6304,16 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 	 * Trade-off between responsiveness and fairness.
 	 */
 	bfqd->bfq_wr_coeff = 30;
+	bfqd->bfq_wr_rt_max_time = msecs_to_jiffies(300);
 	bfqd->bfq_wr_max_time = 0;
 	bfqd->bfq_wr_min_idle_time = msecs_to_jiffies(2000);
 	bfqd->bfq_wr_min_inter_arr_async = msecs_to_jiffies(500);
+	bfqd->bfq_wr_max_softrt_rate = 7000; /*
+					      * Approximate rate required
+					      * to playback or record a
+					      * high-definition compressed
+					      * video.
+					      */
 
 	/*
 	 * Begin by assuming, optimistically, that the device is a
@@ -6132,9 +6437,11 @@ SHOW_FUNCTION(bfq_timeout_sync_show, bfqd->bfq_timeout, 1);
 SHOW_FUNCTION(bfq_strict_guarantees_show, bfqd->strict_guarantees, 0);
 SHOW_FUNCTION(bfq_low_latency_show, bfqd->low_latency, 0);
 SHOW_FUNCTION(bfq_wr_coeff_show, bfqd->bfq_wr_coeff, 0);
+SHOW_FUNCTION(bfq_wr_rt_max_time_show, bfqd->bfq_wr_rt_max_time, 1);
 SHOW_FUNCTION(bfq_wr_min_idle_time_show, bfqd->bfq_wr_min_idle_time, 1);
 SHOW_FUNCTION(bfq_wr_min_inter_arr_async_show, bfqd->bfq_wr_min_inter_arr_async,
 	1);
+SHOW_FUNCTION(bfq_wr_max_softrt_rate_show, bfqd->bfq_wr_max_softrt_rate, 0);
 #undef SHOW_FUNCTION
 
 #define USEC_SHOW_FUNCTION(__FUNC, __VAR)				\
@@ -6177,10 +6484,14 @@ STORE_FUNCTION(bfq_back_seek_penalty_store, &bfqd->bfq_back_penalty, 1,
 STORE_FUNCTION(bfq_slice_idle_store, &bfqd->bfq_slice_idle, 0, INT_MAX, 2);
 STORE_FUNCTION(bfq_wr_coeff_store, &bfqd->bfq_wr_coeff, 1, INT_MAX, 0);
 STORE_FUNCTION(bfq_wr_max_time_store, &bfqd->bfq_wr_max_time, 0, INT_MAX, 1);
+STORE_FUNCTION(bfq_wr_rt_max_time_store, &bfqd->bfq_wr_rt_max_time, 0, INT_MAX,
+		1);
 STORE_FUNCTION(bfq_wr_min_idle_time_store, &bfqd->bfq_wr_min_idle_time, 0,
 		INT_MAX, 1);
 STORE_FUNCTION(bfq_wr_min_inter_arr_async_store,
 		&bfqd->bfq_wr_min_inter_arr_async, 0, INT_MAX, 1);
+STORE_FUNCTION(bfq_wr_max_softrt_rate_store, &bfqd->bfq_wr_max_softrt_rate, 0,
+		INT_MAX, 0);
 #undef STORE_FUNCTION
 
 #define USEC_STORE_FUNCTION(__FUNC, __PTR, MIN, MAX)			\
@@ -6300,8 +6611,10 @@ static struct elv_fs_entry bfq_attrs[] = {
 	BFQ_ATTR(low_latency),
 	BFQ_ATTR(wr_coeff),
 	BFQ_ATTR(wr_max_time),
+	BFQ_ATTR(wr_rt_max_time),
 	BFQ_ATTR(wr_min_idle_time),
 	BFQ_ATTR(wr_min_inter_arr_async),
+	BFQ_ATTR(wr_max_softrt_rate),
 	BFQ_ATTR(weights),
 	__ATTR_NULL
 };
@@ -6359,7 +6672,7 @@ static struct blkcg_policy blkcg_policy_bfq = {
 static int __init bfq_init(void)
 {
 	int ret;
-	char msg[50] = "BFQ I/O-scheduler: v1";
+	char msg[50] = "BFQ I/O-scheduler: v2";
 
 #ifdef CONFIG_BFQ_GROUP_IOSCHED
 	ret = blkcg_policy_register(&blkcg_policy_bfq);
-- 
2.10.0

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 08/14] block, bfq: preserve a low latency also with NCQ-capable drives
  2016-10-26  9:27 [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler Paolo Valente
                   ` (6 preceding siblings ...)
  2016-10-26  9:28 ` [PATCH 07/14] block, bfq: reduce I/O latency for soft real-time applications Paolo Valente
@ 2016-10-26  9:28 ` Paolo Valente
  2016-10-26  9:28 ` [PATCH 09/14] block, bfq: reduce latency during request-pool saturation Paolo Valente
  2016-10-26 10:19 ` [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler Christoph Hellwig
  9 siblings, 0 replies; 57+ messages in thread
From: Paolo Valente @ 2016-10-26  9:28 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: linux-block, linux-kernel, ulf.hansson, linus.walleij, broonie,
	hare, arnd, bart.vanassche, grant.likely, jack, James.Bottomley,
	Paolo Valente, Arianna Avanzini

I/O schedulers typically allow NCQ-capable drives to prefetch I/O
requests, as NCQ boosts the throughput exactly by prefetching and
internally reordering requests.

Unfortunately, as discussed in detail and shown experimentally in [1],
this may cause fairness and latency guarantees to be violated. The
main problem is that the internal scheduler of an NCQ-capable drive
may postpone the service of some unlucky (prefetched) requests as long
as it deems serving other requests more appropriate to boost the
throughput.

This patch addresses this issue by not disabling device idling for
weight-raised queues, even if the device supports NCQ. This allows BFQ
to start serving a new queue, and therefore allows the drive to
prefetch new requests, only after the idling timeout expires. At that
time, all the outstanding requests of the expired queue have been most
certainly served.

[1] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
    http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
							results.pdf

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/bfq-iosched.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index d9b0900..3b11772 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -5746,7 +5746,8 @@ static void bfq_update_idle_window(struct bfq_data *bfqd,
 
 	if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
 	    bfqd->bfq_slice_idle == 0 ||
-		(bfqd->hw_tag && BFQQ_SEEKY(bfqq)))
+		(bfqd->hw_tag && BFQQ_SEEKY(bfqq) &&
+			bfqq->wr_coeff == 1))
 		enable_idle = 0;
 	else if (bfq_sample_valid(bic->ttime.ttime_samples)) {
 		if (bic->ttime.ttime_mean > bfqd->bfq_slice_idle &&
-- 
2.10.0

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 09/14] block, bfq: reduce latency during request-pool saturation
  2016-10-26  9:27 [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler Paolo Valente
                   ` (7 preceding siblings ...)
  2016-10-26  9:28 ` [PATCH 08/14] block, bfq: preserve a low latency also with NCQ-capable drives Paolo Valente
@ 2016-10-26  9:28 ` Paolo Valente
  2016-10-26 10:19 ` [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler Christoph Hellwig
  9 siblings, 0 replies; 57+ messages in thread
From: Paolo Valente @ 2016-10-26  9:28 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: linux-block, linux-kernel, ulf.hansson, linus.walleij, broonie,
	hare, arnd, bart.vanassche, grant.likely, jack, James.Bottomley,
	Paolo Valente, Arianna Avanzini

This patch introduces an heuristic that reduces latency when the
I/O-request pool is saturated. This goal is achieved by disabling
device idling, for non-weight-raised queues, when there are weight-
raised queues with pending or in-flight requests. In fact, as
explained in more detail in the comment on the function
bfq_bfqq_may_idle(), this reduces the rate at which processes
associated with non-weight-raised queues grab requests from the pool,
thereby increasing the probability that processes associated with
weight-raised queues get a request immediately (or at least soon) when
they need one. Along the same line, if there are weight-raised queues,
then this patch halves the service rate of async (write) requests for
non-weight-raised queues.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/bfq-iosched.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 63 insertions(+), 3 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 3b11772..46d6df3 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -378,6 +378,8 @@ struct bfq_data {
 	 * queue in service, even if it is idling).
 	 */
 	int busy_queues;
+	/* number of weight-raised busy @bfq_queues */
+	int wr_busy_queues;
 	/* number of queued requests */
 	int queued;
 	/* number of requests dispatched and waiting for completion */
@@ -2019,6 +2021,9 @@ static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfqd->busy_queues--;
 
+	if (bfqq->wr_coeff > 1)
+		bfqd->wr_busy_queues--;
+
 	bfqg_stats_update_dequeue(bfqq_group(bfqq));
 
 	bfq_deactivate_bfqq(bfqd, bfqq, requeue);
@@ -2035,6 +2040,9 @@ static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 
 	bfq_mark_bfqq_busy(bfqq);
 	bfqd->busy_queues++;
+
+	if (bfqq->wr_coeff > 1)
+		bfqd->wr_busy_queues++;
 }
 
 #if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP)
@@ -3335,7 +3343,16 @@ static unsigned long bfq_serv_to_charge(struct request *rq,
 	if (bfq_bfqq_sync(bfqq) || bfqq->wr_coeff > 1)
 		return blk_rq_sectors(rq);
 
-	return blk_rq_sectors(rq) * bfq_async_charge_factor;
+	/*
+	 * If there are no weight-raised queues, then amplify service
+	 * by just the async charge factor; otherwise amplify service
+	 * by twice the async charge factor, to further reduce latency
+	 * for weight-raised queues.
+	 */
+	if (bfqq->bfqd->wr_busy_queues == 0)
+		return blk_rq_sectors(rq) * bfq_async_charge_factor;
+
+	return blk_rq_sectors(rq) * 2 * bfq_async_charge_factor;
 }
 
 /**
@@ -3791,6 +3808,7 @@ static void bfq_add_request(struct request *rq)
 			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
 			bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
 
+			bfqd->wr_busy_queues++;
 			bfqq->entity.prio_changed = 1;
 		}
 		if (prev != bfqq->next_rq)
@@ -4003,6 +4021,8 @@ static void bfq_merged_requests(struct request_queue *q, struct request *rq,
 /* Must be called with bfqq != NULL */
 static void bfq_bfqq_end_wr(struct bfq_queue *bfqq)
 {
+	if (bfq_bfqq_busy(bfqq))
+		bfqq->bfqd->wr_busy_queues--;
 	bfqq->wr_coeff = 1;
 	bfqq->wr_cur_max_time = 0;
 	bfqq->last_wr_start_finish = jiffies;
@@ -5049,7 +5069,8 @@ static bool bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
 static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
 {
 	struct bfq_data *bfqd = bfqq->bfqd;
-	bool idling_boosts_thr, asymmetric_scenario;
+	bool idling_boosts_thr, idling_boosts_thr_without_issues,
+		asymmetric_scenario;
 
 	if (bfqd->strict_guarantees)
 		return true;
@@ -5072,6 +5093,44 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
 	idling_boosts_thr = !bfqd->hw_tag || bfq_bfqq_IO_bound(bfqq);
 
 	/*
+	 * The value of the next variable,
+	 * idling_boosts_thr_without_issues, is equal to that of
+	 * idling_boosts_thr, unless a special case holds. In this
+	 * special case, described below, idling may cause problems to
+	 * weight-raised queues.
+	 *
+	 * When the request pool is saturated (e.g., in the presence
+	 * of write hogs), if the processes associated with
+	 * non-weight-raised queues ask for requests at a lower rate,
+	 * then processes associated with weight-raised queues have a
+	 * higher probability to get a request from the pool
+	 * immediately (or at least soon) when they need one. Thus
+	 * they have a higher probability to actually get a fraction
+	 * of the device throughput proportional to their high
+	 * weight. This is especially true with NCQ-capable drives,
+	 * which enqueue several requests in advance, and further
+	 * reorder internally-queued requests.
+	 *
+	 * For this reason, we force to false the value of
+	 * idling_boosts_thr_without_issues if there are weight-raised
+	 * busy queues. In this case, and if bfqq is not weight-raised,
+	 * this guarantees that the device is not idled for bfqq (if,
+	 * instead, bfqq is weight-raised, then idling will be
+	 * guaranteed by another variable, see below). Combined with
+	 * the timestamping rules of BFQ (see [1] for details), this
+	 * behavior causes bfqq, and hence any sync non-weight-raised
+	 * queue, to get a lower number of requests served, and thus
+	 * to ask for a lower number of requests from the request
+	 * pool, before the busy weight-raised queues get served
+	 * again. This often mitigates starvation problems in the
+	 * presence of heavy write workloads and NCQ, thereby
+	 * guaranteeing a higher application and system responsiveness
+	 * in these hostile scenarios.
+	 */
+	idling_boosts_thr_without_issues = idling_boosts_thr &&
+		bfqd->wr_busy_queues == 0;
+
+	/*
 	 * There is then a case where idling must be performed not for
 	 * throughput concerns, but to preserve service guarantees. To
 	 * introduce it, we can note that allowing the drive to
@@ -5145,7 +5204,7 @@ static bool bfq_bfqq_may_idle(struct bfq_queue *bfqq)
 	 *    is necessary to preserve service guarantees.
 	 */
 	return bfq_bfqq_sync(bfqq) &&
-		(idling_boosts_thr || asymmetric_scenario);
+		(idling_boosts_thr_without_issues || asymmetric_scenario);
 }
 
 /*
@@ -6315,6 +6374,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 					      * high-definition compressed
 					      * video.
 					      */
+	bfqd->wr_busy_queues = 0;
 
 	/*
 	 * Begin by assuming, optimistically, that the device is a
-- 
2.10.0

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-26  9:27 [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler Paolo Valente
                   ` (8 preceding siblings ...)
  2016-10-26  9:28 ` [PATCH 09/14] block, bfq: reduce latency during request-pool saturation Paolo Valente
@ 2016-10-26 10:19 ` Christoph Hellwig
  2016-10-26 11:34   ` Jan Kara
  2016-10-26 12:37   ` Paolo Valente
  9 siblings, 2 replies; 57+ messages in thread
From: Christoph Hellwig @ 2016-10-26 10:19 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Tejun Heo, linux-block, linux-kernel, ulf.hansson,
	linus.walleij, broonie, hare, arnd, bart.vanassche, grant.likely,
	jack, James.Bottomley

Just as last time:

big NAK for introducing giant new infrastructure like a new I/O scheduler
for the legacy request structure.

Please direct your engergy towards blk-mq instead.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-26 10:19 ` [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler Christoph Hellwig
@ 2016-10-26 11:34   ` Jan Kara
  2016-10-26 15:05     ` Bart Van Assche
  2016-10-26 12:37   ` Paolo Valente
  1 sibling, 1 reply; 57+ messages in thread
From: Jan Kara @ 2016-10-26 11:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Paolo Valente, Jens Axboe, Tejun Heo, linux-block, linux-kernel,
	ulf.hansson, linus.walleij, broonie, hare, arnd, bart.vanassche,
	grant.likely, jack, James.Bottomley

On Wed 26-10-16 03:19:03, Christoph Hellwig wrote:
> Just as last time:
> 
> big NAK for introducing giant new infrastructure like a new I/O scheduler
> for the legacy request structure.
> 
> Please direct your engergy towards blk-mq instead.

Christoph, we will probably talk about this next week but IMO rotating
disks and SATA based SSDs are going to stay with us for another 15 years,
likely more. For them blk-mq is no win, relatively complex IO scheduling
like CFQ or BFQ does is a big win for them in some cases. So I think IO
scheduling (and thus place for something like BFQ) is going to stay with us
for quite a long time still. So are we going to add hooks in blk-mq to
support full-blown IO scheduling at least for single queue devices? Or how
else do we want to support that HW?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-26 10:19 ` [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler Christoph Hellwig
  2016-10-26 11:34   ` Jan Kara
@ 2016-10-26 12:37   ` Paolo Valente
  1 sibling, 0 replies; 57+ messages in thread
From: Paolo Valente @ 2016-10-26 12:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Tejun Heo, linux-block, linux-kernel, ulf.hansson,
	linus.walleij, broonie, hare, arnd, bart.vanassche, grant.likely,
	jack, James.Bottomley


> Il giorno 26 ott 2016, alle ore 12:19, Christoph Hellwig <hch@infradead.org> ha scritto:
> 
> Just as last time:
> 
> big NAK for introducing giant new infrastructure like a new I/O scheduler
> for the legacy request structure.
> 

I would fully agree, if there weren't important problems involved.
But there are.

Linux has been suffering for years from responsiveness and latency
problems, related to I/O (and I/O bandwidth fairness is still just not
available). Users are not happy about that.

BFQ apparently solves these problems in most scenarios.  Adding BFQ
would not be disruptive for any use case.  People could just try it if
they want, and check whether things get better.

IMO these problems are more important than the clear code-
maintenance issue you raise.

> Please direct your engergy towards blk-mq instead.

Definitely.  I would really like to help.  To this purpose, I
have already tried to stimulate discussion, as well offer and ask for
help [1].

And I think that addressing these latency problems (and not only) is
even more important with blk-mq.  In fact, with blk-mq, they get
worse, as no I/O scheduler is available yet.

Thanks,
Paolo

[1] http://www.spinics.net/lists/linux-block/msg04555.html

> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-26 11:34   ` Jan Kara
@ 2016-10-26 15:05     ` Bart Van Assche
  2016-10-26 15:13       ` Arnd Bergmann
  0 siblings, 1 reply; 57+ messages in thread
From: Bart Van Assche @ 2016-10-26 15:05 UTC (permalink / raw)
  To: Jan Kara, Christoph Hellwig
  Cc: Paolo Valente, Jens Axboe, Tejun Heo, linux-block, linux-kernel,
	ulf.hansson, linus.walleij, broonie, hare, arnd, bart.vanassche,
	grant.likely, James.Bottomley

On 10/26/2016 04:34 AM, Jan Kara wrote:
> On Wed 26-10-16 03:19:03, Christoph Hellwig wrote:
>> Just as last time:
>>
>> big NAK for introducing giant new infrastructure like a new I/O scheduler
>> for the legacy request structure.
>>
>> Please direct your engergy towards blk-mq instead.
>
> Christoph, we will probably talk about this next week but IMO rotating
> disks and SATA based SSDs are going to stay with us for another 15 years,
> likely more. For them blk-mq is no win, relatively complex IO scheduling
> like CFQ or BFQ does is a big win for them in some cases. So I think IO
> scheduling (and thus place for something like BFQ) is going to stay with us
> for quite a long time still. So are we going to add hooks in blk-mq to
> support full-blown IO scheduling at least for single queue devices? Or how
> else do we want to support that HW?

Hello Jan,

Having two versions (one for non-blk-mq, one for blk-mq) of every I/O 
scheduler would be a maintenance nightmare. Has anyone already analyzed 
whether it would be possible to come up with an API for I/O schedulers 
that makes it possible to use the same I/O scheduler for both blk-mq and 
the traditional block layer?

Bart.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-26 15:05     ` Bart Van Assche
@ 2016-10-26 15:13       ` Arnd Bergmann
  2016-10-26 15:29         ` Christoph Hellwig
  0 siblings, 1 reply; 57+ messages in thread
From: Arnd Bergmann @ 2016-10-26 15:13 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jan Kara, Christoph Hellwig, Paolo Valente, Jens Axboe,
	Tejun Heo, linux-block, linux-kernel, ulf.hansson, linus.walleij,
	broonie, hare, grant.likely, James.Bottomley

On Wednesday, October 26, 2016 8:05:11 AM CEST Bart Van Assche wrote:
> On 10/26/2016 04:34 AM, Jan Kara wrote:
> > On Wed 26-10-16 03:19:03, Christoph Hellwig wrote:
> >> Just as last time:
> >>
> >> big NAK for introducing giant new infrastructure like a new I/O scheduler
> >> for the legacy request structure.
> >>
> >> Please direct your engergy towards blk-mq instead.
> >
> > Christoph, we will probably talk about this next week but IMO rotating
> > disks and SATA based SSDs are going to stay with us for another 15 years,
> > likely more. For them blk-mq is no win, relatively complex IO scheduling
> > like CFQ or BFQ does is a big win for them in some cases. So I think IO
> > scheduling (and thus place for something like BFQ) is going to stay with us
> > for quite a long time still. So are we going to add hooks in blk-mq to
> > support full-blown IO scheduling at least for single queue devices? Or how
> > else do we want to support that HW?
> 
> Hello Jan,
> 
> Having two versions (one for non-blk-mq, one for blk-mq) of every I/O 
> scheduler would be a maintenance nightmare. Has anyone already analyzed 
> whether it would be possible to come up with an API for I/O schedulers 
> that makes it possible to use the same I/O scheduler for both blk-mq and 
> the traditional block layer?

The question to ask first is whether to actually have pluggable
schedulers on blk-mq at all, or just have one that is meant to
do the right thing in every case (and possibly can be bypassed
completely).

	Arnd

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-26 15:13       ` Arnd Bergmann
@ 2016-10-26 15:29         ` Christoph Hellwig
  2016-10-26 15:32           ` Jens Axboe
  0 siblings, 1 reply; 57+ messages in thread
From: Christoph Hellwig @ 2016-10-26 15:29 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Bart Van Assche, Jan Kara, Christoph Hellwig, Paolo Valente,
	Jens Axboe, Tejun Heo, linux-block, linux-kernel, ulf.hansson,
	linus.walleij, broonie, hare, grant.likely, James.Bottomley

On Wed, Oct 26, 2016 at 05:13:07PM +0200, Arnd Bergmann wrote:
> The question to ask first is whether to actually have pluggable
> schedulers on blk-mq at all, or just have one that is meant to
> do the right thing in every case (and possibly can be bypassed
> completely).

That would be my preference.  Have a BFQ-variant for blk-mq as an
option (default to off unless opted in by the driver or user), and
not other scheduler for blk-mq.  Don't bother with bfq for non
blk-mq.  It's not like there is any advantage in the legacy-request
device even for slow devices, except for the option of having I/O
scheduling.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-26 15:29         ` Christoph Hellwig
@ 2016-10-26 15:32           ` Jens Axboe
  2016-10-26 16:04             ` Paolo Valente
  0 siblings, 1 reply; 57+ messages in thread
From: Jens Axboe @ 2016-10-26 15:32 UTC (permalink / raw)
  To: Christoph Hellwig, Arnd Bergmann
  Cc: Bart Van Assche, Jan Kara, Paolo Valente, Tejun Heo, linux-block,
	linux-kernel, ulf.hansson, linus.walleij, broonie, hare,
	grant.likely, James.Bottomley

On 10/26/2016 09:29 AM, Christoph Hellwig wrote:
> On Wed, Oct 26, 2016 at 05:13:07PM +0200, Arnd Bergmann wrote:
>> The question to ask first is whether to actually have pluggable
>> schedulers on blk-mq at all, or just have one that is meant to
>> do the right thing in every case (and possibly can be bypassed
>> completely).
>
> That would be my preference.  Have a BFQ-variant for blk-mq as an
> option (default to off unless opted in by the driver or user), and
> not other scheduler for blk-mq.  Don't bother with bfq for non
> blk-mq.  It's not like there is any advantage in the legacy-request
> device even for slow devices, except for the option of having I/O
> scheduling.

It's the only right way forward. blk-mq might not offer any substantial
advantages to rotating storage, but with scheduling, it won't offer a
downside either. And it'll take us towards the real goal, which is to
have just one IO path. Adding a new scheduler for the legacy IO path
makes no sense. Adding one for blk-mq and phasing out the old path is
what we need to do.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-26 15:32           ` Jens Axboe
@ 2016-10-26 16:04             ` Paolo Valente
  2016-10-26 16:12               ` Jens Axboe
  0 siblings, 1 reply; 57+ messages in thread
From: Paolo Valente @ 2016-10-26 16:04 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Arnd Bergmann, Bart Van Assche, Jan Kara,
	Tejun Heo, linux-block, Linux-Kernal, Ulf Hansson, Linus Walleij,
	Mark Brown, Hannes Reinecke, grant.likely, James.Bottomley


> Il giorno 26 ott 2016, alle ore 17:32, Jens Axboe <axboe@kernel.dk> ha scritto:
> 
> On 10/26/2016 09:29 AM, Christoph Hellwig wrote:
>> On Wed, Oct 26, 2016 at 05:13:07PM +0200, Arnd Bergmann wrote:
>>> The question to ask first is whether to actually have pluggable
>>> schedulers on blk-mq at all, or just have one that is meant to
>>> do the right thing in every case (and possibly can be bypassed
>>> completely).
>> 
>> That would be my preference.  Have a BFQ-variant for blk-mq as an
>> option (default to off unless opted in by the driver or user), and
>> not other scheduler for blk-mq.  Don't bother with bfq for non
>> blk-mq.  It's not like there is any advantage in the legacy-request
>> device even for slow devices, except for the option of having I/O
>> scheduling.
> 
> It's the only right way forward. blk-mq might not offer any substantial
> advantages to rotating storage, but with scheduling, it won't offer a
> downside either. And it'll take us towards the real goal, which is to
> have just one IO path.

ok

> Adding a new scheduler for the legacy IO path
> makes no sense.

I would fully agree if effective and stable I/O scheduling would be
available in blk-mq in one or two months.  But I guess that it will
take at least one year optimistically, given the current status of the
needed infrastructure, and given the great difficulties of doing
effective scheduling at the high parallelism and extreme target speeds
of blk-mq.  Of course, this holds true unless little clever scheduling
is performed.

So, what's the point in forcing a lot of users wait another year or
more, for a solution that has yet to be even defined, while they could
enjoy a much better system, and then switch an even better system when
scheduling is ready in blk-mq too?

For example, currently, opening a terminal on a 50KIOPS SSD with cfq,
deadline or noop, if there happens to be some background workload,
takes *more* time than opening the same terminal on a 1KIOPS HDD with
the same background workload, but with BFQ.  Does this make any sense?

> Adding one for blk-mq and phasing out the old path is
> what we need to do.
> 

Yes, as the ultimate goal.

Thanks,
Paolo

> -- 
> Jens Axboe
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-26 16:04             ` Paolo Valente
@ 2016-10-26 16:12               ` Jens Axboe
  2016-10-27  9:26                 ` Jan Kara
                                   ` (2 more replies)
  0 siblings, 3 replies; 57+ messages in thread
From: Jens Axboe @ 2016-10-26 16:12 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Christoph Hellwig, Arnd Bergmann, Bart Van Assche, Jan Kara,
	Tejun Heo, linux-block, Linux-Kernal, Ulf Hansson, Linus Walleij,
	Mark Brown, Hannes Reinecke, grant.likely, James.Bottomley

On 10/26/2016 10:04 AM, Paolo Valente wrote:
>
>> Il giorno 26 ott 2016, alle ore 17:32, Jens Axboe <axboe@kernel.dk> ha scritto:
>>
>> On 10/26/2016 09:29 AM, Christoph Hellwig wrote:
>>> On Wed, Oct 26, 2016 at 05:13:07PM +0200, Arnd Bergmann wrote:
>>>> The question to ask first is whether to actually have pluggable
>>>> schedulers on blk-mq at all, or just have one that is meant to
>>>> do the right thing in every case (and possibly can be bypassed
>>>> completely).
>>>
>>> That would be my preference.  Have a BFQ-variant for blk-mq as an
>>> option (default to off unless opted in by the driver or user), and
>>> not other scheduler for blk-mq.  Don't bother with bfq for non
>>> blk-mq.  It's not like there is any advantage in the legacy-request
>>> device even for slow devices, except for the option of having I/O
>>> scheduling.
>>
>> It's the only right way forward. blk-mq might not offer any substantial
>> advantages to rotating storage, but with scheduling, it won't offer a
>> downside either. And it'll take us towards the real goal, which is to
>> have just one IO path.
>
> ok
>
>> Adding a new scheduler for the legacy IO path
>> makes no sense.
>
> I would fully agree if effective and stable I/O scheduling would be
> available in blk-mq in one or two months.  But I guess that it will
> take at least one year optimistically, given the current status of the
> needed infrastructure, and given the great difficulties of doing
> effective scheduling at the high parallelism and extreme target speeds
> of blk-mq.  Of course, this holds true unless little clever scheduling
> is performed.
>
> So, what's the point in forcing a lot of users wait another year or
> more, for a solution that has yet to be even defined, while they could
> enjoy a much better system, and then switch an even better system when
> scheduling is ready in blk-mq too?

That same argument could have been made 2 years ago. Saying no to a new
scheduler for the legacy framework goes back roughly that long. We could
have had BFQ for mq NOW, if we didn't keep coming back to this very
point.

I'm hesistant to add a new scheduler because it's very easy to add, very
difficult to get rid of. If we do add BFQ as a legacy scheduler now,
it'll take us years and years to get rid of it again. We should be
moving towards LESS moving parts in the legacy path, not more.

We can keep having this discussion every few years, but I think we'd
both prefer to make some actual progress here. It's perfectly fine to
add an interface for a single queue interface for an IO scheduler for
blk-mq, since we don't care too much about scalability there. And that
won't take years, that should be a few weeks. Retrofitting BFQ on top of
that should not be hard either. That can co-exist with a real multiqueue
scheduler as well, something that's geared towards some fairness for
faster devices.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-26 16:12               ` Jens Axboe
@ 2016-10-27  9:26                 ` Jan Kara
  2016-10-27 14:34                   ` Grozdan
  2016-10-27 16:26                   ` Jens Axboe
  2016-10-27 17:32                 ` Ulf Hansson
  2016-10-29  5:38                 ` Paolo Valente
  2 siblings, 2 replies; 57+ messages in thread
From: Jan Kara @ 2016-10-27  9:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Paolo Valente, Christoph Hellwig, Arnd Bergmann, Bart Van Assche,
	Jan Kara, Tejun Heo, linux-block, Linux-Kernal, Ulf Hansson,
	Linus Walleij, Mark Brown, Hannes Reinecke, grant.likely,
	James.Bottomley

On Wed 26-10-16 10:12:38, Jens Axboe wrote:
> On 10/26/2016 10:04 AM, Paolo Valente wrote:
> >
> >>Il giorno 26 ott 2016, alle ore 17:32, Jens Axboe <axboe@kernel.dk> ha scritto:
> >>
> >>On 10/26/2016 09:29 AM, Christoph Hellwig wrote:
> >>>On Wed, Oct 26, 2016 at 05:13:07PM +0200, Arnd Bergmann wrote:
> >>>>The question to ask first is whether to actually have pluggable
> >>>>schedulers on blk-mq at all, or just have one that is meant to
> >>>>do the right thing in every case (and possibly can be bypassed
> >>>>completely).
> >>>
> >>>That would be my preference.  Have a BFQ-variant for blk-mq as an
> >>>option (default to off unless opted in by the driver or user), and
> >>>not other scheduler for blk-mq.  Don't bother with bfq for non
> >>>blk-mq.  It's not like there is any advantage in the legacy-request
> >>>device even for slow devices, except for the option of having I/O
> >>>scheduling.
> >>
> >>It's the only right way forward. blk-mq might not offer any substantial
> >>advantages to rotating storage, but with scheduling, it won't offer a
> >>downside either. And it'll take us towards the real goal, which is to
> >>have just one IO path.
> >
> >ok
> >
> >>Adding a new scheduler for the legacy IO path
> >>makes no sense.
> >
> >I would fully agree if effective and stable I/O scheduling would be
> >available in blk-mq in one or two months.  But I guess that it will
> >take at least one year optimistically, given the current status of the
> >needed infrastructure, and given the great difficulties of doing
> >effective scheduling at the high parallelism and extreme target speeds
> >of blk-mq.  Of course, this holds true unless little clever scheduling
> >is performed.
> >
> >So, what's the point in forcing a lot of users wait another year or
> >more, for a solution that has yet to be even defined, while they could
> >enjoy a much better system, and then switch an even better system when
> >scheduling is ready in blk-mq too?
> 
> That same argument could have been made 2 years ago. Saying no to a new
> scheduler for the legacy framework goes back roughly that long. We could
> have had BFQ for mq NOW, if we didn't keep coming back to this very
> point.
> 
> I'm hesistant to add a new scheduler because it's very easy to add, very
> difficult to get rid of. If we do add BFQ as a legacy scheduler now,
> it'll take us years and years to get rid of it again. We should be
> moving towards LESS moving parts in the legacy path, not more.
> 
> We can keep having this discussion every few years, but I think we'd
> both prefer to make some actual progress here. It's perfectly fine to
> add an interface for a single queue interface for an IO scheduler for
> blk-mq, since we don't care too much about scalability there. And that
> won't take years, that should be a few weeks. Retrofitting BFQ on top of
> that should not be hard either. That can co-exist with a real multiqueue
> scheduler as well, something that's geared towards some fairness for
> faster devices.

OK, so some solution like having a variant of blk_sq_make_request() that
will consume requests, do IO scheduling decisions on them, and feed them
into the HW queue is it sees fit would be acceptable? That will provide the
IO scheduler a global view that it needs for complex scheduling decisions
so it should indeed be relatively easy to port BFQ to work like that.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-27  9:26                 ` Jan Kara
@ 2016-10-27 14:34                   ` Grozdan
  2016-10-27 15:55                     ` Heinz Diehl
  2016-10-27 16:28                     ` Jens Axboe
  2016-10-27 16:26                   ` Jens Axboe
  1 sibling, 2 replies; 57+ messages in thread
From: Grozdan @ 2016-10-27 14:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, Paolo Valente, Christoph Hellwig, Arnd Bergmann,
	Bart Van Assche, Tejun Heo, linux-block, Linux-Kernal,
	Ulf Hansson, Linus Walleij, Mark Brown, Hannes Reinecke,
	grant.likely, James.Bottomley

On Thu, Oct 27, 2016 at 11:26 AM, Jan Kara <jack@suse.cz> wrote:
> On Wed 26-10-16 10:12:38, Jens Axboe wrote:
>> On 10/26/2016 10:04 AM, Paolo Valente wrote:
>> >
>> >>Il giorno 26 ott 2016, alle ore 17:32, Jens Axboe <axboe@kernel.dk> ha scritto:
>> >>
>> >>On 10/26/2016 09:29 AM, Christoph Hellwig wrote:
>> >>>On Wed, Oct 26, 2016 at 05:13:07PM +0200, Arnd Bergmann wrote:
>> >>>>The question to ask first is whether to actually have pluggable
>> >>>>schedulers on blk-mq at all, or just have one that is meant to
>> >>>>do the right thing in every case (and possibly can be bypassed
>> >>>>completely).
>> >>>
>> >>>That would be my preference.  Have a BFQ-variant for blk-mq as an
>> >>>option (default to off unless opted in by the driver or user), and
>> >>>not other scheduler for blk-mq.  Don't bother with bfq for non
>> >>>blk-mq.  It's not like there is any advantage in the legacy-request
>> >>>device even for slow devices, except for the option of having I/O
>> >>>scheduling.
>> >>
>> >>It's the only right way forward. blk-mq might not offer any substantial
>> >>advantages to rotating storage, but with scheduling, it won't offer a
>> >>downside either. And it'll take us towards the real goal, which is to
>> >>have just one IO path.
>> >
>> >ok
>> >
>> >>Adding a new scheduler for the legacy IO path
>> >>makes no sense.
>> >
>> >I would fully agree if effective and stable I/O scheduling would be
>> >available in blk-mq in one or two months.  But I guess that it will
>> >take at least one year optimistically, given the current status of the
>> >needed infrastructure, and given the great difficulties of doing
>> >effective scheduling at the high parallelism and extreme target speeds
>> >of blk-mq.  Of course, this holds true unless little clever scheduling
>> >is performed.
>> >
>> >So, what's the point in forcing a lot of users wait another year or
>> >more, for a solution that has yet to be even defined, while they could
>> >enjoy a much better system, and then switch an even better system when
>> >scheduling is ready in blk-mq too?
>>
>> That same argument could have been made 2 years ago. Saying no to a new
>> scheduler for the legacy framework goes back roughly that long. We could
>> have had BFQ for mq NOW, if we didn't keep coming back to this very
>> point.
>>
>> I'm hesistant to add a new scheduler because it's very easy to add, very
>> difficult to get rid of. If we do add BFQ as a legacy scheduler now,
>> it'll take us years and years to get rid of it again. We should be
>> moving towards LESS moving parts in the legacy path, not more.
>>
>> We can keep having this discussion every few years, but I think we'd
>> both prefer to make some actual progress here. It's perfectly fine to
>> add an interface for a single queue interface for an IO scheduler for
>> blk-mq, since we don't care too much about scalability there. And that
>> won't take years, that should be a few weeks. Retrofitting BFQ on top of
>> that should not be hard either. That can co-exist with a real multiqueue
>> scheduler as well, something that's geared towards some fairness for
>> faster devices.
>
> OK, so some solution like having a variant of blk_sq_make_request() that
> will consume requests, do IO scheduling decisions on them, and feed them
> into the HW queue is it sees fit would be acceptable? That will provide the
> IO scheduler a global view that it needs for complex scheduling decisions
> so it should indeed be relatively easy to port BFQ to work like that.
>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Hello,

Let me first say that I'm in no way associated with Paolo Valente or
any other BFQ developer. I'm a mere user who has had great experience
using BFQ

My workload is one that takes my disks to their limits. I often use
large files like raw Blu-ray streams which then I remux to mkv's while
at the same time streaming at least 2 movies to various devices in
house and using my system as I do while the remuxing process is going
on. At times, I'm also pushing video files to my NAS at close to Gbps
speed while the stuff I mentioned is in progress

My experience with BFQ is that it has never resulted in the video
streams being interrupted due to disk trashing. I've extensively used
all the other Linux disk schedulers in the past and what I've observed
is that whenever I start the remuxing (and copying) process, the
streams will begin to hiccup, stutter and often multi-seconds long
"waits" will occur. It gets even worse, when I do this kind of
workload, the whole system will come to almost a halt and
interactivity goes out the window. Impossible to start an app in a
reasonable amount of time. Loading a visited website makes Chrome hang
while trying to get the contents from its cache, etc

BFQ has greatly helped to have a responsive system during such
operations and as I said, I have never experience any interruption of
the video streams. Do I think BFQ is the best thing since sliced
bread? No, as with BFQ too there are sometimes corner cases where it
takes too long to start a program. But if I was on one of the other
disk schedulers, most of the time that program won't start at all
until the disk gets some "relief"

So in the end, I'm here to support the inclusion of BFQ. Paolo has put
too much energy, time, and sleepless nights into this so people like
me can have a working, responsive system during heavy disk operations.
>From a normal user's perspective, I do not want BFQ to be dismissed
and all the effort/time/etc thrown out the window. From my
perspective, Paolo deserves more support from the guys in charge of
the block layer in Linux.

Thanks :)



-- 
Yours truly

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-27 14:34                   ` Grozdan
@ 2016-10-27 15:55                     ` Heinz Diehl
  2016-10-27 16:28                     ` Jens Axboe
  1 sibling, 0 replies; 57+ messages in thread
From: Heinz Diehl @ 2016-10-27 15:55 UTC (permalink / raw)
  To: linux-kernel

On 27.10.2016, Grozdan wrote: 

> So in the end, I'm here to support the inclusion of BFQ. Paolo has put
> too much energy, time, and sleepless nights into this so people like
> me can have a working, responsive system during heavy disk operations.
> From a normal user's perspective, I do not want BFQ to be dismissed
> and all the effort/time/etc thrown out the window. From my
> perspective, Paolo deserves more support from the guys in charge of
> the block layer in Linux.

I really  want to second that!

Just take a bog-standard desktop PC with an SSD and a reasonably fast
CPU (an 8-core Xeon in my case) and do the following:

1. dd if=/dev/zero of=deleteme bs=1M count=50000
2. Start oowriter (Libreoffice Writer)

Using both cfq, deadline or noop, oowriter does not load until dd'ing
the 50 gigs is finished. Using bfq, oowriter loads nearly immediately.

Not to mention that both cfq, deadline and noop are a nightmare on
Android in terms of latency.

I'm (obviously) neither a kernel nor a bfq developer, but I really
want you to reconsider, with the overall greatness of bfq in mind, if
it really is totally impossible to include it at least as a scheduler
option, alongside the other three already existing ones.

Thanks,
 Heinz
 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-27  9:26                 ` Jan Kara
  2016-10-27 14:34                   ` Grozdan
@ 2016-10-27 16:26                   ` Jens Axboe
  2016-10-28  7:59                     ` Jan Kara
  1 sibling, 1 reply; 57+ messages in thread
From: Jens Axboe @ 2016-10-27 16:26 UTC (permalink / raw)
  To: Jan Kara
  Cc: Paolo Valente, Christoph Hellwig, Arnd Bergmann, Bart Van Assche,
	Tejun Heo, linux-block, Linux-Kernal, Ulf Hansson, Linus Walleij,
	Mark Brown, Hannes Reinecke, grant.likely, James.Bottomley

On 10/27/2016 03:26 AM, Jan Kara wrote:
> On Wed 26-10-16 10:12:38, Jens Axboe wrote:
>> On 10/26/2016 10:04 AM, Paolo Valente wrote:
>>>
>>>> Il giorno 26 ott 2016, alle ore 17:32, Jens Axboe <axboe@kernel.dk> ha scritto:
>>>>
>>>> On 10/26/2016 09:29 AM, Christoph Hellwig wrote:
>>>>> On Wed, Oct 26, 2016 at 05:13:07PM +0200, Arnd Bergmann wrote:
>>>>>> The question to ask first is whether to actually have pluggable
>>>>>> schedulers on blk-mq at all, or just have one that is meant to
>>>>>> do the right thing in every case (and possibly can be bypassed
>>>>>> completely).
>>>>>
>>>>> That would be my preference.  Have a BFQ-variant for blk-mq as an
>>>>> option (default to off unless opted in by the driver or user), and
>>>>> not other scheduler for blk-mq.  Don't bother with bfq for non
>>>>> blk-mq.  It's not like there is any advantage in the legacy-request
>>>>> device even for slow devices, except for the option of having I/O
>>>>> scheduling.
>>>>
>>>> It's the only right way forward. blk-mq might not offer any substantial
>>>> advantages to rotating storage, but with scheduling, it won't offer a
>>>> downside either. And it'll take us towards the real goal, which is to
>>>> have just one IO path.
>>>
>>> ok
>>>
>>>> Adding a new scheduler for the legacy IO path
>>>> makes no sense.
>>>
>>> I would fully agree if effective and stable I/O scheduling would be
>>> available in blk-mq in one or two months.  But I guess that it will
>>> take at least one year optimistically, given the current status of the
>>> needed infrastructure, and given the great difficulties of doing
>>> effective scheduling at the high parallelism and extreme target speeds
>>> of blk-mq.  Of course, this holds true unless little clever scheduling
>>> is performed.
>>>
>>> So, what's the point in forcing a lot of users wait another year or
>>> more, for a solution that has yet to be even defined, while they could
>>> enjoy a much better system, and then switch an even better system when
>>> scheduling is ready in blk-mq too?
>>
>> That same argument could have been made 2 years ago. Saying no to a new
>> scheduler for the legacy framework goes back roughly that long. We could
>> have had BFQ for mq NOW, if we didn't keep coming back to this very
>> point.
>>
>> I'm hesistant to add a new scheduler because it's very easy to add, very
>> difficult to get rid of. If we do add BFQ as a legacy scheduler now,
>> it'll take us years and years to get rid of it again. We should be
>> moving towards LESS moving parts in the legacy path, not more.
>>
>> We can keep having this discussion every few years, but I think we'd
>> both prefer to make some actual progress here. It's perfectly fine to
>> add an interface for a single queue interface for an IO scheduler for
>> blk-mq, since we don't care too much about scalability there. And that
>> won't take years, that should be a few weeks. Retrofitting BFQ on top of
>> that should not be hard either. That can co-exist with a real multiqueue
>> scheduler as well, something that's geared towards some fairness for
>> faster devices.
>
> OK, so some solution like having a variant of blk_sq_make_request() that
> will consume requests, do IO scheduling decisions on them, and feed them
> into the HW queue is it sees fit would be acceptable? That will provide the
> IO scheduler a global view that it needs for complex scheduling decisions
> so it should indeed be relatively easy to port BFQ to work like that.

I'd probably start off Omar's base [1] that switches the software queues
to store bios instead of requests, since that lifts the of the 1:1
mapping between what we can queue up and what we can dispatch. Without
that, the IO scheduler won't have too much to work with. And with that
in place, it'll be a "bio in, request out" type of setup, which is
similar to what we have in the legacy path.

I'd keep the software queues, but as a starting point, mandate 1
hardware queue to keep that as the per-device view of the state. The IO
scheduler would be responsible for moving one or more bios from the
software queues to the hardware queue, when they are ready to dispatch.

[1] 
https://github.com/osandov/linux/commit/8ef3508628b6cf7c4712cd3d8084ee11ef5d2530

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-27 14:34                   ` Grozdan
  2016-10-27 15:55                     ` Heinz Diehl
@ 2016-10-27 16:28                     ` Jens Axboe
  1 sibling, 0 replies; 57+ messages in thread
From: Jens Axboe @ 2016-10-27 16:28 UTC (permalink / raw)
  To: Grozdan, Jan Kara
  Cc: Paolo Valente, Christoph Hellwig, Arnd Bergmann, Bart Van Assche,
	Tejun Heo, linux-block, Linux-Kernal, Ulf Hansson, Linus Walleij,
	Mark Brown, Hannes Reinecke, grant.likely, James.Bottomley

On 10/27/2016 08:34 AM, Grozdan wrote:
> On Thu, Oct 27, 2016 at 11:26 AM, Jan Kara <jack@suse.cz> wrote:
>> On Wed 26-10-16 10:12:38, Jens Axboe wrote:
>>> On 10/26/2016 10:04 AM, Paolo Valente wrote:
>>>>
>>>>> Il giorno 26 ott 2016, alle ore 17:32, Jens Axboe <axboe@kernel.dk> ha scritto:
>>>>>
>>>>> On 10/26/2016 09:29 AM, Christoph Hellwig wrote:
>>>>>> On Wed, Oct 26, 2016 at 05:13:07PM +0200, Arnd Bergmann wrote:
>>>>>>> The question to ask first is whether to actually have pluggable
>>>>>>> schedulers on blk-mq at all, or just have one that is meant to
>>>>>>> do the right thing in every case (and possibly can be bypassed
>>>>>>> completely).
>>>>>>
>>>>>> That would be my preference.  Have a BFQ-variant for blk-mq as an
>>>>>> option (default to off unless opted in by the driver or user), and
>>>>>> not other scheduler for blk-mq.  Don't bother with bfq for non
>>>>>> blk-mq.  It's not like there is any advantage in the legacy-request
>>>>>> device even for slow devices, except for the option of having I/O
>>>>>> scheduling.
>>>>>
>>>>> It's the only right way forward. blk-mq might not offer any substantial
>>>>> advantages to rotating storage, but with scheduling, it won't offer a
>>>>> downside either. And it'll take us towards the real goal, which is to
>>>>> have just one IO path.
>>>>
>>>> ok
>>>>
>>>>> Adding a new scheduler for the legacy IO path
>>>>> makes no sense.
>>>>
>>>> I would fully agree if effective and stable I/O scheduling would be
>>>> available in blk-mq in one or two months.  But I guess that it will
>>>> take at least one year optimistically, given the current status of the
>>>> needed infrastructure, and given the great difficulties of doing
>>>> effective scheduling at the high parallelism and extreme target speeds
>>>> of blk-mq.  Of course, this holds true unless little clever scheduling
>>>> is performed.
>>>>
>>>> So, what's the point in forcing a lot of users wait another year or
>>>> more, for a solution that has yet to be even defined, while they could
>>>> enjoy a much better system, and then switch an even better system when
>>>> scheduling is ready in blk-mq too?
>>>
>>> That same argument could have been made 2 years ago. Saying no to a new
>>> scheduler for the legacy framework goes back roughly that long. We could
>>> have had BFQ for mq NOW, if we didn't keep coming back to this very
>>> point.
>>>
>>> I'm hesistant to add a new scheduler because it's very easy to add, very
>>> difficult to get rid of. If we do add BFQ as a legacy scheduler now,
>>> it'll take us years and years to get rid of it again. We should be
>>> moving towards LESS moving parts in the legacy path, not more.
>>>
>>> We can keep having this discussion every few years, but I think we'd
>>> both prefer to make some actual progress here. It's perfectly fine to
>>> add an interface for a single queue interface for an IO scheduler for
>>> blk-mq, since we don't care too much about scalability there. And that
>>> won't take years, that should be a few weeks. Retrofitting BFQ on top of
>>> that should not be hard either. That can co-exist with a real multiqueue
>>> scheduler as well, something that's geared towards some fairness for
>>> faster devices.
>>
>> OK, so some solution like having a variant of blk_sq_make_request() that
>> will consume requests, do IO scheduling decisions on them, and feed them
>> into the HW queue is it sees fit would be acceptable? That will provide the
>> IO scheduler a global view that it needs for complex scheduling decisions
>> so it should indeed be relatively easy to port BFQ to work like that.
>>
>>                                                                 Honza
>> --
>> Jan Kara <jack@suse.com>
>> SUSE Labs, CR
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-block" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> Hello,
>
> Let me first say that I'm in no way associated with Paolo Valente or
> any other BFQ developer. I'm a mere user who has had great experience
> using BFQ
>
> My workload is one that takes my disks to their limits. I often use
> large files like raw Blu-ray streams which then I remux to mkv's while
> at the same time streaming at least 2 movies to various devices in
> house and using my system as I do while the remuxing process is going
> on. At times, I'm also pushing video files to my NAS at close to Gbps
> speed while the stuff I mentioned is in progress
>
> My experience with BFQ is that it has never resulted in the video
> streams being interrupted due to disk trashing. I've extensively used
> all the other Linux disk schedulers in the past and what I've observed
> is that whenever I start the remuxing (and copying) process, the
> streams will begin to hiccup, stutter and often multi-seconds long
> "waits" will occur. It gets even worse, when I do this kind of
> workload, the whole system will come to almost a halt and
> interactivity goes out the window. Impossible to start an app in a
> reasonable amount of time. Loading a visited website makes Chrome hang
> while trying to get the contents from its cache, etc
>
> BFQ has greatly helped to have a responsive system during such
> operations and as I said, I have never experience any interruption of
> the video streams. Do I think BFQ is the best thing since sliced
> bread? No, as with BFQ too there are sometimes corner cases where it
> takes too long to start a program. But if I was on one of the other
> disk schedulers, most of the time that program won't start at all
> until the disk gets some "relief"
>
> So in the end, I'm here to support the inclusion of BFQ. Paolo has put
> too much energy, time, and sleepless nights into this so people like
> me can have a working, responsive system during heavy disk operations.
> From a normal user's perspective, I do not want BFQ to be dismissed
> and all the effort/time/etc thrown out the window. From my
> perspective, Paolo deserves more support from the guys in charge of
> the block layer in Linux.

Nobody is against BFQ as a project, the recommendation (for ages) has
been that it be reworked to fit in with where the block layer is
currently going. It's for the good of the BFQ project, since making it
work with blk-mq is the best way to future proof it.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-26 16:12               ` Jens Axboe
  2016-10-27  9:26                 ` Jan Kara
@ 2016-10-27 17:32                 ` Ulf Hansson
  2016-10-27 17:43                   ` Jens Axboe
  2016-10-29  5:38                 ` Paolo Valente
  2 siblings, 1 reply; 57+ messages in thread
From: Ulf Hansson @ 2016-10-27 17:32 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Paolo Valente, Christoph Hellwig, Arnd Bergmann, Bart Van Assche,
	Jan Kara, Tejun Heo, linux-block, Linux-Kernal, Linus Walleij,
	Mark Brown, Hannes Reinecke, Grant Likely, James Bottomley

[...]

>
> I'm hesistant to add a new scheduler because it's very easy to add, very
> difficult to get rid of. If we do add BFQ as a legacy scheduler now,
> it'll take us years and years to get rid of it again. We should be
> moving towards LESS moving parts in the legacy path, not more.

Jens, I think you are wrong here and let me try to elaborate on why.

1)
We already have legacy schedulers like CFQ, DEADLINE, etc - and most
block device drivers are still using the legacy blk interface.

To be able to remove the legacy blk layer, all block device drivers
must be converted to blkmq - of course.

So to reach that goal, we will not only need to evolve blkmq to allow
scheduling (at least for single queue devices), but we also need to
convert *all* block device drivers to blkmq. For sure this will take
*years* and not months.

More important, when the transition to blkmq has been completed, then
there is absolutely no difference (from effort point of view) in
removing the legacy blk layer - no matter if we have BFQ in there or
not.

I do understand if you have concern from maintenance point of view, as
I assume you would rather focus on evolving blkmq, than care about
legacy blk code. So, would it help if Paolo volunteers to maintain the
BFQ code in the meantime?

2)
While we work on evolving blkmq and convert block device drivers to
it, BFQ could as a separate legacy scheduler, help *lots* of Linux
users to get a significant improved experience. Should we really
prevent them from that? I think you block maintainer guys, really need
to consider this fact.

3)
While we work on scheduling in blkmq (at least for single queue
devices), it's of course important that we set high goals. Having BFQ
(and the other schedulers) in the legacy blk, provides a good
reference for what we could aim for.

>
> We can keep having this discussion every few years, but I think we'd
> both prefer to make some actual progress here. It's perfectly fine to
> add an interface for a single queue interface for an IO scheduler for
> blk-mq, since we don't care too much about scalability there. And that
> won't take years, that should be a few weeks. Retrofitting BFQ on top of
> that should not be hard either. That can co-exist with a real multiqueue
> scheduler as well, something that's geared towards some fairness for
> faster devices.

That's really great news!

I hope we get a possibility to meet and discuss the plans for this at
Kernel summit/Linux Plumbers the next week!

>
> --
> Jens Axboe

Kind regards
Ulf Hansson

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-27 17:32                 ` Ulf Hansson
@ 2016-10-27 17:43                   ` Jens Axboe
  2016-10-27 18:13                     ` Ulf Hansson
  0 siblings, 1 reply; 57+ messages in thread
From: Jens Axboe @ 2016-10-27 17:43 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Paolo Valente, Christoph Hellwig, Arnd Bergmann, Bart Van Assche,
	Jan Kara, Tejun Heo, linux-block, Linux-Kernal, Linus Walleij,
	Mark Brown, Hannes Reinecke, Grant Likely, James Bottomley

On 10/27/2016 11:32 AM, Ulf Hansson wrote:
> [...]
>
>>
>> I'm hesistant to add a new scheduler because it's very easy to add, very
>> difficult to get rid of. If we do add BFQ as a legacy scheduler now,
>> it'll take us years and years to get rid of it again. We should be
>> moving towards LESS moving parts in the legacy path, not more.
>
> Jens, I think you are wrong here and let me try to elaborate on why.
>
> 1)
> We already have legacy schedulers like CFQ, DEADLINE, etc - and most
> block device drivers are still using the legacy blk interface.

I don't think that's an accurate statement. In terms of coverage, most
drivers do support blk-mq. Anything SCSI, nvme, virtio-blk, SATA runs on
(or can run on) top of blk-mq.

> To be able to remove the legacy blk layer, all block device drivers
> must be converted to blkmq - of course.

That's a given.

> So to reach that goal, we will not only need to evolve blkmq to allow
> scheduling (at least for single queue devices), but we also need to
> convert *all* block device drivers to blkmq. For sure this will take
> *years* and not months.

Correct.

> More important, when the transition to blkmq has been completed, then
> there is absolutely no difference (from effort point of view) in
> removing the legacy blk layer - no matter if we have BFQ in there or
> not.
>
> I do understand if you have concern from maintenance point of view, as
> I assume you would rather focus on evolving blkmq, than care about
> legacy blk code. So, would it help if Paolo volunteers to maintain the
> BFQ code in the meantime?

We're obviously still maintaining the legacy IO path. But we don't want
to actively develop it, and we haven't, for a long time.

And Paolo maintaining it is a strict requirement for inclusion, legacy
or blk-mq aside. That would go for both. I'd never accept a major
feature from an individual or company if they weren't willing and
capable of maintaining it. Throwing submissions over the wall is not
viable.

> 2)
> While we work on evolving blkmq and convert block device drivers to
> it, BFQ could as a separate legacy scheduler, help *lots* of Linux
> users to get a significant improved experience. Should we really
> prevent them from that? I think you block maintainer guys, really need
> to consider this fact.

You still seem to be basing that assumption on the notion that we have
to convert tons of drivers for BFQ to make sense under the blk-mq
umbrella. That's not the case.

> 3)
> While we work on scheduling in blkmq (at least for single queue
> devices), it's of course important that we set high goals. Having BFQ
> (and the other schedulers) in the legacy blk, provides a good
> reference for what we could aim for.

Sure, but you don't need BFQ to be included in the kernel for that.

>> We can keep having this discussion every few years, but I think we'd
>> both prefer to make some actual progress here. It's perfectly fine to
>> add an interface for a single queue interface for an IO scheduler for
>> blk-mq, since we don't care too much about scalability there. And that
>> won't take years, that should be a few weeks. Retrofitting BFQ on top of
>> that should not be hard either. That can co-exist with a real multiqueue
>> scheduler as well, something that's geared towards some fairness for
>> faster devices.
>
> That's really great news!
>
> I hope we get a possibility to meet and discuss the plans for this at
> Kernel summit/Linux Plumbers the next week!

I'll be there.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-27 17:43                   ` Jens Axboe
@ 2016-10-27 18:13                     ` Ulf Hansson
  2016-10-27 18:21                       ` Jens Axboe
  2016-10-28 12:07                       ` Arnd Bergmann
  0 siblings, 2 replies; 57+ messages in thread
From: Ulf Hansson @ 2016-10-27 18:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Paolo Valente, Christoph Hellwig, Arnd Bergmann, Bart Van Assche,
	Jan Kara, Tejun Heo, linux-block, Linux-Kernal, Linus Walleij,
	Mark Brown, Hannes Reinecke, Grant Likely, James Bottomley

On 27 October 2016 at 19:43, Jens Axboe <axboe@kernel.dk> wrote:
> On 10/27/2016 11:32 AM, Ulf Hansson wrote:
>>
>> [...]
>>
>>>
>>> I'm hesistant to add a new scheduler because it's very easy to add, very
>>> difficult to get rid of. If we do add BFQ as a legacy scheduler now,
>>> it'll take us years and years to get rid of it again. We should be
>>> moving towards LESS moving parts in the legacy path, not more.
>>
>>
>> Jens, I think you are wrong here and let me try to elaborate on why.
>>
>> 1)
>> We already have legacy schedulers like CFQ, DEADLINE, etc - and most
>> block device drivers are still using the legacy blk interface.
>
>
> I don't think that's an accurate statement. In terms of coverage, most
> drivers do support blk-mq. Anything SCSI, nvme, virtio-blk, SATA runs on
> (or can run on) top of blk-mq.

Well, I just used "git grep" and found that many drivers didn't use
blkmq. Apologize if I gave the wrong impressions.

>
>> To be able to remove the legacy blk layer, all block device drivers
>> must be converted to blkmq - of course.
>
>
> That's a given.
>
>> So to reach that goal, we will not only need to evolve blkmq to allow
>> scheduling (at least for single queue devices), but we also need to
>> convert *all* block device drivers to blkmq. For sure this will take
>> *years* and not months.
>
>
> Correct.
>
>> More important, when the transition to blkmq has been completed, then
>> there is absolutely no difference (from effort point of view) in
>> removing the legacy blk layer - no matter if we have BFQ in there or
>> not.
>>
>> I do understand if you have concern from maintenance point of view, as
>> I assume you would rather focus on evolving blkmq, than care about
>> legacy blk code. So, would it help if Paolo volunteers to maintain the
>> BFQ code in the meantime?
>
>
> We're obviously still maintaining the legacy IO path. But we don't want
> to actively develop it, and we haven't, for a long time.
>
> And Paolo maintaining it is a strict requirement for inclusion, legacy
> or blk-mq aside. That would go for both. I'd never accept a major
> feature from an individual or company if they weren't willing and
> capable of maintaining it. Throwing submissions over the wall is not
> viable.

That seems very reasonable!

>
>> 2)
>> While we work on evolving blkmq and convert block device drivers to
>> it, BFQ could as a separate legacy scheduler, help *lots* of Linux
>> users to get a significant improved experience. Should we really
>> prevent them from that? I think you block maintainer guys, really need
>> to consider this fact.
>
>
> You still seem to be basing that assumption on the notion that we have
> to convert tons of drivers for BFQ to make sense under the blk-mq
> umbrella. That's not the case.

Well, let's not argue about how many. It's pretty easy to check that.

Instead, what I can tell, as we have been looking into converting mmc
(which I maintains) and that is indeed a significant amount of work.
We will need to rip out all of the mmc request management, and most
likely we also need to extend the blkmq interface - as to be able to
do re-implement all the current request optimizations. We are looking
into this, but it just takes time.

I can imagine, that it's not always a straight forward "convert to blk
mq" patch for every block device driver.

>
>> 3)
>> While we work on scheduling in blkmq (at least for single queue
>> devices), it's of course important that we set high goals. Having BFQ
>> (and the other schedulers) in the legacy blk, provides a good
>> reference for what we could aim for.
>
>
> Sure, but you don't need BFQ to be included in the kernel for that.

Perhaps not.

But does that mean, you expect Paolo to maintain an up to date BFQ tree for you?

>
>>> We can keep having this discussion every few years, but I think we'd
>>> both prefer to make some actual progress here. It's perfectly fine to
>>> add an interface for a single queue interface for an IO scheduler for
>>> blk-mq, since we don't care too much about scalability there. And that
>>> won't take years, that should be a few weeks. Retrofitting BFQ on top of
>>> that should not be hard either. That can co-exist with a real multiqueue
>>> scheduler as well, something that's geared towards some fairness for
>>> faster devices.
>>
>>
>> That's really great news!
>>
>> I hope we get a possibility to meet and discuss the plans for this at
>> Kernel summit/Linux Plumbers the next week!
>
>
> I'll be there.

Great!

Kind regards
Ulf Hansson

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-27 18:13                     ` Ulf Hansson
@ 2016-10-27 18:21                       ` Jens Axboe
  2016-10-27 19:34                         ` Ulf Hansson
  2016-10-27 19:41                         ` Mark Brown
  2016-10-28 12:07                       ` Arnd Bergmann
  1 sibling, 2 replies; 57+ messages in thread
From: Jens Axboe @ 2016-10-27 18:21 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Paolo Valente, Christoph Hellwig, Arnd Bergmann, Bart Van Assche,
	Jan Kara, Tejun Heo, linux-block, Linux-Kernal, Linus Walleij,
	Mark Brown, Hannes Reinecke, Grant Likely, James Bottomley

On 10/27/2016 12:13 PM, Ulf Hansson wrote:
>>> 2)
>>> While we work on evolving blkmq and convert block device drivers to
>>> it, BFQ could as a separate legacy scheduler, help *lots* of Linux
>>> users to get a significant improved experience. Should we really
>>> prevent them from that? I think you block maintainer guys, really need
>>> to consider this fact.
>>
>>
>> You still seem to be basing that assumption on the notion that we have
>> to convert tons of drivers for BFQ to make sense under the blk-mq
>> umbrella. That's not the case.
>
> Well, let's not argue about how many. It's pretty easy to check that.

I wasn't arguing - you made a false or misleading statement, I had to
correct that.

Most of the drivers that haven't been converted yet are themselves for
legacy hardware. Some are not, though, and it'd be great to get those
converted. But coverage wise, we're in pretty good shape.

> Instead, what I can tell, as we have been looking into converting mmc
> (which I maintains) and that is indeed a significant amount of work.
> We will need to rip out all of the mmc request management, and most
> likely we also need to extend the blkmq interface - as to be able to
> do re-implement all the current request optimizations. We are looking
> into this, but it just takes time.

It's usually as much work as you make it into, for most cases it's
pretty straight forward and usually removes more code than it adds.
Hence the end result is better for it as well - less code in a driver is
better.

> I can imagine, that it's not always a straight forward "convert to blk
> mq" patch for every block device driver.

Well, I've actually done a few conversions, and it's not difficult at
all. The grunt of the work is usually around converting to using some of
the blk-mq features for parts of the driver that it had implemented
privately, like timeout handling, etc.

I'm always happy to help people with converting drivers.

>>> 3)
>>> While we work on scheduling in blkmq (at least for single queue
>>> devices), it's of course important that we set high goals. Having BFQ
>>> (and the other schedulers) in the legacy blk, provides a good
>>> reference for what we could aim for.
>>
>>
>> Sure, but you don't need BFQ to be included in the kernel for that.
>
> Perhaps not.
>
> But does that mean, you expect Paolo to maintain an up to date BFQ
> tree for you?

I don't expect anything. If Paolo or others want to compare with BFQ on
the legacy IO path, then they can do that however way they want. If you
(and others) want to have that reference point, it's up to you how to
accomplish that.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-27 18:21                       ` Jens Axboe
@ 2016-10-27 19:34                         ` Ulf Hansson
  2016-10-27 21:08                           ` Jens Axboe
  2016-10-27 19:41                         ` Mark Brown
  1 sibling, 1 reply; 57+ messages in thread
From: Ulf Hansson @ 2016-10-27 19:34 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Paolo Valente, Christoph Hellwig, Arnd Bergmann, Bart Van Assche,
	Jan Kara, Tejun Heo, linux-block, Linux-Kernal, Linus Walleij,
	Mark Brown, Hannes Reinecke, Grant Likely, James Bottomley

[...]

>> Instead, what I can tell, as we have been looking into converting mmc
>> (which I maintains) and that is indeed a significant amount of work.
>> We will need to rip out all of the mmc request management, and most
>> likely we also need to extend the blkmq interface - as to be able to
>> do re-implement all the current request optimizations. We are looking
>> into this, but it just takes time.
>
>
> It's usually as much work as you make it into, for most cases it's
> pretty straight forward and usually removes more code than it adds.
> Hence the end result is better for it as well - less code in a driver is
> better.

>From a scalability and maintenance point of view, converting to blkmq
makes perfect sense.

Although, me personally don't want to sacrifice on performance (at
least very little), just for the sake of gaining in
scalability/maintainability.

I would rather strive to adopt the blkmq framework to also suit my
needs. Then it simply do takes more time.

For example, in the mmc case we have implemented an asynchronous
request path, which greatly improves performance on some systems.

>
>> I can imagine, that it's not always a straight forward "convert to blk
>> mq" patch for every block device driver.
>
>
> Well, I've actually done a few conversions, and it's not difficult at
> all. The grunt of the work is usually around converting to using some of
> the blk-mq features for parts of the driver that it had implemented
> privately, like timeout handling, etc.
>
> I'm always happy to help people with converting drivers.

Great, we ping you if we need some help! Thanks!

>
>>>> 3)
>>>> While we work on scheduling in blkmq (at least for single queue
>>>> devices), it's of course important that we set high goals. Having BFQ
>>>> (and the other schedulers) in the legacy blk, provides a good
>>>> reference for what we could aim for.
>>>
>>>
>>>
>>> Sure, but you don't need BFQ to be included in the kernel for that.
>>
>>
>> Perhaps not.
>>
>> But does that mean, you expect Paolo to maintain an up to date BFQ
>> tree for you?
>
>
> I don't expect anything. If Paolo or others want to compare with BFQ on
> the legacy IO path, then they can do that however way they want. If you
> (and others) want to have that reference point, it's up to you how to
> accomplish that.

Do I get this right? You personally don't care about using BFQ as
reference when evolving blkmq for single queue devices?

Paolo and lots of other Linux users certainly do care about this.

Moreover, I am still trying to understand what's the big deal to why
you say no to BFQ as a legacy scheduler. Ideally it shouldn't cause
you any maintenance burden and it doesn't make the removal of the
legacy blk layer any more difficult, right?

Kind regards
Ulf Hansson

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-27 18:21                       ` Jens Axboe
  2016-10-27 19:34                         ` Ulf Hansson
@ 2016-10-27 19:41                         ` Mark Brown
  2016-10-27 19:45                           ` Christoph Hellwig
  1 sibling, 1 reply; 57+ messages in thread
From: Mark Brown @ 2016-10-27 19:41 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ulf Hansson, Paolo Valente, Christoph Hellwig, Arnd Bergmann,
	Bart Van Assche, Jan Kara, Tejun Heo, linux-block, Linux-Kernal,
	Linus Walleij, Hannes Reinecke, Grant Likely, James Bottomley

[-- Attachment #1: Type: text/plain, Size: 1647 bytes --]

On Thu, Oct 27, 2016 at 12:21:06PM -0600, Jens Axboe wrote:
> On 10/27/2016 12:13 PM, Ulf Hansson wrote:

> > I can imagine, that it's not always a straight forward "convert to blk
> > mq" patch for every block device driver.

> Well, I've actually done a few conversions, and it's not difficult at
> all. The grunt of the work is usually around converting to using some of
> the blk-mq features for parts of the driver that it had implemented
> privately, like timeout handling, etc.

Plus the benchmarking to verify that it works well of course, especially
initially where it'll also be a new queue infrastructure as well as the
blk-mq conversion itself.  It does feel like something that's going to
take at least a couple of kernel releases to get through.

> > > > 3)
> > > > While we work on scheduling in blkmq (at least for single queue
> > > > devices), it's of course important that we set high goals. Having BFQ
> > > > (and the other schedulers) in the legacy blk, provides a good
> > > > reference for what we could aim for.

> > > Sure, but you don't need BFQ to be included in the kernel for that.

> > Perhaps not.

> > But does that mean, you expect Paolo to maintain an up to date BFQ
> > tree for you?

> I don't expect anything. If Paolo or others want to compare with BFQ on
> the legacy IO path, then they can do that however way they want. If you
> (and others) want to have that reference point, it's up to you how to
> accomplish that.

I think there's also value in having improvements there for people who
benefit from them while queue infrastructure for blk-mq is being worked
on.  

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-27 19:41                         ` Mark Brown
@ 2016-10-27 19:45                           ` Christoph Hellwig
  2016-10-27 22:01                             ` Mark Brown
  0 siblings, 1 reply; 57+ messages in thread
From: Christoph Hellwig @ 2016-10-27 19:45 UTC (permalink / raw)
  To: Mark Brown
  Cc: Jens Axboe, Ulf Hansson, Paolo Valente, Christoph Hellwig,
	Arnd Bergmann, Bart Van Assche, Jan Kara, Tejun Heo, linux-block,
	Linux-Kernal, Linus Walleij, Hannes Reinecke, Grant Likely,
	James Bottomley

On Thu, Oct 27, 2016 at 08:41:27PM +0100, Mark Brown wrote:
> Plus the benchmarking to verify that it works well of course, especially
> initially where it'll also be a new queue infrastructure as well as the
> blk-mq conversion itself.  It does feel like something that's going to
> take at least a couple of kernel releases to get through.

Or to put it the other way around:  it could have been long done
if people had started it the first it was suggestead.  Instead you guys
keep arguing and nothing gets done.  Get started now, waiting won't
make anything go faster.

> I think there's also value in having improvements there for people who
> benefit from them while queue infrastructure for blk-mq is being worked
> on.  

Well, apply it to you vendor tree then and maintain it yourself if you
disagree with our direction.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-27 19:34                         ` Ulf Hansson
@ 2016-10-27 21:08                           ` Jens Axboe
  2016-10-27 22:27                             ` Linus Walleij
  2016-10-28  6:36                             ` Ulf Hansson
  0 siblings, 2 replies; 57+ messages in thread
From: Jens Axboe @ 2016-10-27 21:08 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Paolo Valente, Christoph Hellwig, Arnd Bergmann, Bart Van Assche,
	Jan Kara, Tejun Heo, linux-block, Linux-Kernal, Linus Walleij,
	Mark Brown, Hannes Reinecke, Grant Likely, James Bottomley

On 10/27/2016 01:34 PM, Ulf Hansson wrote:
> [...]
>
>>> Instead, what I can tell, as we have been looking into converting mmc
>>> (which I maintains) and that is indeed a significant amount of work.
>>> We will need to rip out all of the mmc request management, and most
>>> likely we also need to extend the blkmq interface - as to be able to
>>> do re-implement all the current request optimizations. We are looking
>>> into this, but it just takes time.
>>
>>
>> It's usually as much work as you make it into, for most cases it's
>> pretty straight forward and usually removes more code than it adds.
>> Hence the end result is better for it as well - less code in a driver is
>> better.
>
> From a scalability and maintenance point of view, converting to blkmq
> makes perfect sense.
>
> Although, me personally don't want to sacrifice on performance (at
> least very little), just for the sake of gaining in
> scalability/maintainability.

Nobody has said anything about sacrificing performance. And whether you
like it or not, maintainability is always the most important aspect.
Even performance takes a backseat to maintainability.

> I would rather strive to adopt the blkmq framework to also suit my
> needs. Then it simply do takes more time.
>
> For example, in the mmc case we have implemented an asynchronous
> request path, which greatly improves performance on some systems.

blk-mq has evolved to support a variety of devices, there's nothing
special about mmc that can't work well within that framework.

>>>>> 3)
>>>>> While we work on scheduling in blkmq (at least for single queue
>>>>> devices), it's of course important that we set high goals. Having BFQ
>>>>> (and the other schedulers) in the legacy blk, provides a good
>>>>> reference for what we could aim for.
>>>>
>>>>
>>>>
>>>> Sure, but you don't need BFQ to be included in the kernel for that.
>>>
>>>
>>> Perhaps not.
>>>
>>> But does that mean, you expect Paolo to maintain an up to date BFQ
>>> tree for you?
>>
>>
>> I don't expect anything. If Paolo or others want to compare with BFQ on
>> the legacy IO path, then they can do that however way they want. If you
>> (and others) want to have that reference point, it's up to you how to
>> accomplish that.
>
> Do I get this right? You personally don't care about using BFQ as
> reference when evolving blkmq for single queue devices?
>
> Paolo and lots of other Linux users certainly do care about this.

I'm getting a little tired of this putting words in my mouth... That is
not what I'm saying at all. What I'm saying is that the people working
on BFQ can do what they need to do to have a reference implementation to
compare against. You don't need BFQ in the kernel for that. I said it's
up to YOU, with the you here meaning the people that want to work on it,
how that goes down.

> Moreover, I am still trying to understand what's the big deal to why
> you say no to BFQ as a legacy scheduler. Ideally it shouldn't cause
> you any maintenance burden and it doesn't make the removal of the
> legacy blk layer any more difficult, right?

Not sure I can state it much clearer. It's a new scheduler, and a
complicated one at that. It WILL carry a maintenance burden. And I'm
really not that interested in adding such a burden for something that
will be defunct as soon as the single queue blk-mq version is done.
Additionally, if we put BFQ in right now, the motivation to do the real
work will be gone.

The path forward is clear. It'd be a lot better to put some work behind
that, rather than continue this email thread.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-27 19:45                           ` Christoph Hellwig
@ 2016-10-27 22:01                             ` Mark Brown
  0 siblings, 0 replies; 57+ messages in thread
From: Mark Brown @ 2016-10-27 22:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Ulf Hansson, Paolo Valente, Arnd Bergmann,
	Bart Van Assche, Jan Kara, Tejun Heo, linux-block, Linux-Kernal,
	Linus Walleij, Hannes Reinecke, Grant Likely, James Bottomley

[-- Attachment #1: Type: text/plain, Size: 2337 bytes --]

On Thu, Oct 27, 2016 at 12:45:48PM -0700, Christoph Hellwig wrote:
> On Thu, Oct 27, 2016 at 08:41:27PM +0100, Mark Brown wrote:

> > Plus the benchmarking to verify that it works well of course, especially
> > initially where it'll also be a new queue infrastructure as well as the
> > blk-mq conversion itself.  It does feel like something that's going to
> > take at least a couple of kernel releases to get through.

> Or to put it the other way around:  it could have been long done
> if people had started it the first it was suggestead.  Instead you guys
> keep arguing and nothing gets done.  Get started now, waiting won't
> make anything go faster.

There are things going on already like the effort to convert MMC to
blk-mq and there have been some initial emails with Omar about how best
to collaborate on his existing work (which was pointed out as the way
forwards) so that things are useful and we avoid duplication of effort.
In any case the situation is what it is, we can't change the past.

> > I think there's also value in having improvements there for people who
> > benefit from them while queue infrastructure for blk-mq is being worked
> > on.  

> Well, apply it to you vendor tree then and maintain it yourself if you
> disagree with our direction.

I don't think there's any substantial disagreement about where we want
to end up, it's a much more tactical discussion about what we do while
we're on the way there.  Just saying put the changes in your vendor tree
isn't ideal, it's not like there's some singular vendor tree out there
that everyone uses and doing things in vendor trees is what we mostly
encourage people to avoid doing.

If it were something that was actively disruptive for other users or it
made the blk-mq code harder to work with then it'd be clear that having
it upstream would cause problems but that doesn't seem to be the case
here.  Similarly if blk-mq were already at the point where it could
replace blk then it'd be clear that drivers should just be being
converted.  Instead we're in the middle somewhere, it wouldn't be
entirely free to put something in but on the other hand helps solve
people's problems and where it's causing costs those costs are also
providing a hook that helps pull people into working with the community
more.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-27 21:08                           ` Jens Axboe
@ 2016-10-27 22:27                             ` Linus Walleij
  2016-10-28  9:32                               ` Linus Walleij
  2016-10-28 14:07                               ` Jens Axboe
  2016-10-28  6:36                             ` Ulf Hansson
  1 sibling, 2 replies; 57+ messages in thread
From: Linus Walleij @ 2016-10-27 22:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ulf Hansson, Paolo Valente, Christoph Hellwig, Arnd Bergmann,
	Bart Van Assche, Jan Kara, Tejun Heo, linux-block, Linux-Kernal,
	Mark Brown, Hannes Reinecke, Grant Likely, James Bottomley

On Thu, Oct 27, 2016 at 11:08 PM, Jens Axboe <axboe@kernel.dk> wrote:

> blk-mq has evolved to support a variety of devices, there's nothing
> special about mmc that can't work well within that framework.

There is. Read mmc_queue_thread() in drivers/mmc/card/queue.c

This repeatedly calls req = blk_fetch_request(q);, starting one request
and then getting the next one off the queue, including reading
a few NULL requests off the end of the queue (to satisfy the
semantics of its state machine.

It then preprocess each request by esstially calling .pre() and .post()
hooks all the way down to the driver, flushing its mapped
sglist from CPU to DMA device memory (not a problem on x86 and
other DMA-coherent archs, but a big win on the incoherent ones).

In the attempt that was posted recently this is achieved by lying
and saying the HW queue is two items deep and eating requests
off that queue calling pre/post on them.

But as there actually exist MMC cards with command queueing, this
would become hopeless to handle, the hw queue depth has to reflect
the real depth. What we need is for the block core to call pre/post
hooks on each request.

The "only" thing that doesn't work well after that is that CFQ is no
longer in action, which will have interesting effects on MMC throughput
in any fio-like stress test as it is mostly single-hw-queue.

Yours,
Linus Walleij

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-27 21:08                           ` Jens Axboe
  2016-10-27 22:27                             ` Linus Walleij
@ 2016-10-28  6:36                             ` Ulf Hansson
  2016-10-28 14:17                               ` Jens Axboe
  1 sibling, 1 reply; 57+ messages in thread
From: Ulf Hansson @ 2016-10-28  6:36 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Paolo Valente, Christoph Hellwig, Arnd Bergmann, Bart Van Assche,
	Jan Kara, Tejun Heo, linux-block, Linux-Kernal, Linus Walleij,
	Mark Brown, Hannes Reinecke, Grant Likely, James Bottomley

[...]

>
>> Moreover, I am still trying to understand what's the big deal to why
>> you say no to BFQ as a legacy scheduler. Ideally it shouldn't cause
>> you any maintenance burden and it doesn't make the removal of the
>> legacy blk layer any more difficult, right?
>
>
> Not sure I can state it much clearer. It's a new scheduler, and a
> complicated one at that. It WILL carry a maintenance burden. And I'm

Really? Either you maintain the code or not. And if Paolo would do it,
then your are off the hook!

> really not that interested in adding such a burden for something that
> will be defunct as soon as the single queue blk-mq version is done.
> Additionally, if we put BFQ in right now, the motivation to do the real
> work will be gone.

You have been pushing Paolo in different directions throughout the
years with his work in BFQ, wasting lots of his time/effort.

You have not given him any credibility for his work in BFQ and now you
point him yet in another direction.

I understand Paolo is a very persistent hard working guy, most likely
because he is really confident about his work in BFQ and he should be!

But, regarding motivation, if you continue to push him in different
directions and without giving him any credibility - then at some
point, you probably knows what will happen.

>
> The path forward is clear. It'd be a lot better to put some work behind
> that, rather than continue this email thread.

Yes, it seems so!

Kind regards
Ulf Hansson

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-27 16:26                   ` Jens Axboe
@ 2016-10-28  7:59                     ` Jan Kara
  2016-10-28 14:10                       ` Jens Axboe
  0 siblings, 1 reply; 57+ messages in thread
From: Jan Kara @ 2016-10-28  7:59 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jan Kara, Paolo Valente, Christoph Hellwig, Arnd Bergmann,
	Bart Van Assche, Tejun Heo, linux-block, Linux-Kernal,
	Ulf Hansson, Linus Walleij, Mark Brown, Hannes Reinecke,
	grant.likely, James.Bottomley

On Thu 27-10-16 10:26:18, Jens Axboe wrote:
> On 10/27/2016 03:26 AM, Jan Kara wrote:
> >On Wed 26-10-16 10:12:38, Jens Axboe wrote:
> >>On 10/26/2016 10:04 AM, Paolo Valente wrote:
> >>>
> >>>>Il giorno 26 ott 2016, alle ore 17:32, Jens Axboe <axboe@kernel.dk> ha scritto:
> >>>>
> >>>>On 10/26/2016 09:29 AM, Christoph Hellwig wrote:
> >>>>>On Wed, Oct 26, 2016 at 05:13:07PM +0200, Arnd Bergmann wrote:
> >>>>>>The question to ask first is whether to actually have pluggable
> >>>>>>schedulers on blk-mq at all, or just have one that is meant to
> >>>>>>do the right thing in every case (and possibly can be bypassed
> >>>>>>completely).
> >>>>>
> >>>>>That would be my preference.  Have a BFQ-variant for blk-mq as an
> >>>>>option (default to off unless opted in by the driver or user), and
> >>>>>not other scheduler for blk-mq.  Don't bother with bfq for non
> >>>>>blk-mq.  It's not like there is any advantage in the legacy-request
> >>>>>device even for slow devices, except for the option of having I/O
> >>>>>scheduling.
> >>>>
> >>>>It's the only right way forward. blk-mq might not offer any substantial
> >>>>advantages to rotating storage, but with scheduling, it won't offer a
> >>>>downside either. And it'll take us towards the real goal, which is to
> >>>>have just one IO path.
> >>>
> >>>ok
> >>>
> >>>>Adding a new scheduler for the legacy IO path
> >>>>makes no sense.
> >>>
> >>>I would fully agree if effective and stable I/O scheduling would be
> >>>available in blk-mq in one or two months.  But I guess that it will
> >>>take at least one year optimistically, given the current status of the
> >>>needed infrastructure, and given the great difficulties of doing
> >>>effective scheduling at the high parallelism and extreme target speeds
> >>>of blk-mq.  Of course, this holds true unless little clever scheduling
> >>>is performed.
> >>>
> >>>So, what's the point in forcing a lot of users wait another year or
> >>>more, for a solution that has yet to be even defined, while they could
> >>>enjoy a much better system, and then switch an even better system when
> >>>scheduling is ready in blk-mq too?
> >>
> >>That same argument could have been made 2 years ago. Saying no to a new
> >>scheduler for the legacy framework goes back roughly that long. We could
> >>have had BFQ for mq NOW, if we didn't keep coming back to this very
> >>point.
> >>
> >>I'm hesistant to add a new scheduler because it's very easy to add, very
> >>difficult to get rid of. If we do add BFQ as a legacy scheduler now,
> >>it'll take us years and years to get rid of it again. We should be
> >>moving towards LESS moving parts in the legacy path, not more.
> >>
> >>We can keep having this discussion every few years, but I think we'd
> >>both prefer to make some actual progress here. It's perfectly fine to
> >>add an interface for a single queue interface for an IO scheduler for
> >>blk-mq, since we don't care too much about scalability there. And that
> >>won't take years, that should be a few weeks. Retrofitting BFQ on top of
> >>that should not be hard either. That can co-exist with a real multiqueue
> >>scheduler as well, something that's geared towards some fairness for
> >>faster devices.
> >
> >OK, so some solution like having a variant of blk_sq_make_request() that
> >will consume requests, do IO scheduling decisions on them, and feed them
> >into the HW queue is it sees fit would be acceptable? That will provide the
> >IO scheduler a global view that it needs for complex scheduling decisions
> >so it should indeed be relatively easy to port BFQ to work like that.
> 
> I'd probably start off Omar's base [1] that switches the software queues
> to store bios instead of requests, since that lifts the of the 1:1
> mapping between what we can queue up and what we can dispatch. Without
> that, the IO scheduler won't have too much to work with. And with that
> in place, it'll be a "bio in, request out" type of setup, which is
> similar to what we have in the legacy path.
>
> I'd keep the software queues, but as a starting point, mandate 1
> hardware queue to keep that as the per-device view of the state. The IO
> scheduler would be responsible for moving one or more bios from the
> software queues to the hardware queue, when they are ready to dispatch.
> 
> [1] https://github.com/osandov/linux/commit/8ef3508628b6cf7c4712cd3d8084ee11ef5d2530

Yeah, but what would be software queues actually good for for a single
queue device with device-global IO scheduling? The IO scheduler doing
complex decisions will keep all the bios / requests in a single structure
anyway so there's no scalability to gain from per-cpu software queues...
So you can directly consume bios in your ->make_request handler, place it
in IO scheduler structures and then push requests out to the HW queue in
response to HW tags getting freed (i.e. IO completion). No need
for intermediate software queues. But maybe I miss something.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-27 22:27                             ` Linus Walleij
@ 2016-10-28  9:32                               ` Linus Walleij
  2016-10-28 14:22                                 ` Jens Axboe
                                                   ` (2 more replies)
  2016-10-28 14:07                               ` Jens Axboe
  1 sibling, 3 replies; 57+ messages in thread
From: Linus Walleij @ 2016-10-28  9:32 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ulf Hansson, Paolo Valente, Christoph Hellwig, Arnd Bergmann,
	Bart Van Assche, Jan Kara, Tejun Heo, linux-block, Linux-Kernal,
	Mark Brown, Hannes Reinecke, Grant Likely, James Bottomley,
	Bartlomiej Zolnierkiewicz

On Fri, Oct 28, 2016 at 12:27 AM, Linus Walleij
<linus.walleij@linaro.org> wrote:
> On Thu, Oct 27, 2016 at 11:08 PM, Jens Axboe <axboe@kernel.dk> wrote:
>
>> blk-mq has evolved to support a variety of devices, there's nothing
>> special about mmc that can't work well within that framework.
>
> There is. Read mmc_queue_thread() in drivers/mmc/card/queue.c

So I'm not just complaining by the way, I'm trying to fix this. Also
Bartlomiej from Samsung has done some stabs at switching MMC/SD
to blk-mq. I just rebased my latest stab at a naïve switch to blk-mq
to v4.9-rc2 with these results.

The patch to enable MQ looks like this:
https://git.kernel.org/cgit/linux/kernel/git/linusw/linux-stericsson.git/commit/?h=mmc-mq&id=8f79b527e2e854071d8da019451da68d4753f71d

I run these tests directly after boot with cold caches. The results
are consistent: I ran the same commands 10 times in a row.


BEFORE switching to BLK-MQ (clean v4.9-rc2):

time dd if=/dev/mmcblk0 of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.0GB) copied, 47.781464 seconds, 21.4MB/s
real    0m 47.79s
user    0m 0.02s
sys     0m 9.35s

mount /dev/mmcblk0p1 /mnt/
cd /mnt/
time find . > /dev/null
real    0m 3.60s
user    0m 0.25s
sys     0m 1.58s

mount /dev/mmcblk0p1 /mnt/
iozone -az -i0 -i1 -i2 -s 20m -I -f /mnt/foo.test
(kBytes/second)
                                                    random    random
    kB  reclen    write  rewrite    read    reread    read     write
 20480       4     2112     2157     6052     6060     6025       40
 20480       8     4820     5074     9163     9121     9125       81
 20480      16     5755     5242    12317    12320    12280      165
 20480      32     6176     6261    14981    14987    14962      336
 20480      64     6547     5875    16826    16828    16810      692
 20480     128     6762     6828    17899    17896    17896     1408
 20480     256     6802     6871    16960    17513    18373     3048
 20480     512     7220     7252    18675    18746    18741     7228
 20480    1024     7222     7304    18436    17858    18246     7322
 20480    2048     7316     7398    18744    18751    18526     7419
 20480    4096     7520     7636    20774    20995    20703     7609
 20480    8192     7519     7704    21850    21489    21467     7663
 20480   16384     7395     7782    22399    22210    22215     7781


AFTER switching to BLK-MQ:

time dd if=/dev/mmcblk0 of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.0GB) copied, 60.551117 seconds, 16.9MB/s
real    1m 0.56s
user    0m 0.02s
sys     0m 9.81s

mount /dev/mmcblk0p1 /mnt/
cd /mnt/
time find . > /dev/null
real    0m 4.42s
user    0m 0.24s
sys     0m 1.81s

mount /dev/mmcblk0p1 /mnt/
iozone -az -i0 -i1 -i2 -s 20m -I -f /mnt/foo.test
(kBytes/second)
                                                    random    random
    kB  reclen    write  rewrite    read    reread    read     write
 20480       4     2086     2201     6024     6036     6006       40
 20480       8     4812     5036     8014     9121     9090       82
 20480      16     5432     5633    12267     9776    12212      168
 20480      32     6180     6233    14870    14891    14852      340
 20480      64     6382     5454    16744    16771    16746      702
 20480     128     6761     6776    17816    17846    17836     1394
 20480     256     6828     6842    17789    17895    17094     3084
 20480     512     7158     7222    17957    17681    17698     7232
 20480    1024     7215     7274    18642    17679    18031     7300
 20480    2048     7229     7269    17943    18642    17732     7358
 20480    4096     7212     7360    18272    18157    18889     7371
 20480    8192     7008     7271    18632    18707    18225     7282
 20480   16384     6889     7211    18243    18429    18018     7246


A simple dd readtest of 1 GB is always consistently 10+
seconds slower with MQ. find in the rootfs is a second slower.
iozone results are consistently lower throughput or the same.

This is without using Bartlomiej's clever hack to pretend we have
2 elements in the HW queue though. His early tests indicate that
it doesn't help much: the performance regression we see is due to
lack of block scheduling.

I try to find a way forward with this, and also massage the MMC/SD
code to be more MQ friendly to begin with (like only pick requests
when we get a request notification and stop pulling NULL requests
off the queue) but it's really a messy piece of code.

Yours,
Linus Walleij

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-27 18:13                     ` Ulf Hansson
  2016-10-27 18:21                       ` Jens Axboe
@ 2016-10-28 12:07                       ` Arnd Bergmann
  2016-10-28 12:17                         ` Richard Weinberger
  1 sibling, 1 reply; 57+ messages in thread
From: Arnd Bergmann @ 2016-10-28 12:07 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Jens Axboe, Paolo Valente, Christoph Hellwig, Bart Van Assche,
	Jan Kara, Tejun Heo, linux-block, Linux-Kernal, Linus Walleij,
	Mark Brown, Hannes Reinecke, Grant Likely, James Bottomley

On Thursday, October 27, 2016 8:13:08 PM CEST Ulf Hansson wrote:
> On 27 October 2016 at 19:43, Jens Axboe <axboe@kernel.dk> wrote:
> > On 10/27/2016 11:32 AM, Ulf Hansson wrote:
> >>
> >> [...]
> >>
> >>>
> >>> I'm hesistant to add a new scheduler because it's very easy to add, very
> >>> difficult to get rid of. If we do add BFQ as a legacy scheduler now,
> >>> it'll take us years and years to get rid of it again. We should be
> >>> moving towards LESS moving parts in the legacy path, not more.
> >>
> >>
> >> Jens, I think you are wrong here and let me try to elaborate on why.
> >>
> >> 1)
> >> We already have legacy schedulers like CFQ, DEADLINE, etc - and most
> >> block device drivers are still using the legacy blk interface.
> >
> >
> > I don't think that's an accurate statement. In terms of coverage, most
> > drivers do support blk-mq. Anything SCSI, nvme, virtio-blk, SATA runs on
> > (or can run on) top of blk-mq.
> 
> Well, I just used "git grep" and found that many drivers didn't use
> blkmq. Apologize if I gave the wrong impressions.

To clarify, this seems to be a complete list:

$ git grep -wl '\(__\|\)blk_\(fetch\|end\|start\)_request' | xargs grep -L blk_mq
Documentation/scsi/scsi_eh.txt
arch/um/drivers/ubd_kern.c
block/blk-tag.c
block/bsg-lib.c
drivers/block/DAC960.c
drivers/block/amiflop.c
drivers/block/aoe/aoeblk.c
drivers/block/aoe/aoecmd.c
drivers/block/aoe/aoedev.c
drivers/block/ataflop.c
drivers/block/cciss.c
drivers/block/floppy.c
drivers/block/hd.c
drivers/block/mg_disk.c
drivers/block/osdblk.c
drivers/block/paride/pcd.c
drivers/block/paride/pd.c
drivers/block/paride/pf.c
drivers/block/ps3disk.c
drivers/block/skd_main.c
drivers/block/sunvdc.c
drivers/block/swim.c
drivers/block/swim3.c
drivers/block/sx8.c
drivers/block/xsysace.c
drivers/block/z2ram.c
drivers/cdrom/gdrom.c
drivers/ide/ide-atapi.c
drivers/ide/ide-io.c
drivers/ide/ide-pm.c
drivers/memstick/core/ms_block.c
drivers/memstick/core/mspro_block.c
drivers/mmc/card/block.c
drivers/mmc/card/queue.c
drivers/mtd/mtd_blkdevs.c
drivers/s390/block/dasd.c
drivers/s390/block/scm_blk.c
drivers/sbus/char/jsflash.c
drivers/scsi/osd/osd_initiator.c
drivers/scsi/scsi_transport_fc.c
drivers/scsi/scsi_transport_sas.c
samples/bpf/tracex3_kern.c

>From what I can tell, most of these are hopelessly obsolete, but
there are some notable exceptions: aoe, osdblk, skd, sunvdc, mtdblk,
mmc, dasd and scm. I've never used any of the first four, but the
last four of the list are certainly important (for very different
reasons).

	Arnd

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-28 12:07                       ` Arnd Bergmann
@ 2016-10-28 12:17                         ` Richard Weinberger
  0 siblings, 0 replies; 57+ messages in thread
From: Richard Weinberger @ 2016-10-28 12:17 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Ulf Hansson, Jens Axboe, Paolo Valente, Christoph Hellwig,
	Bart Van Assche, Jan Kara, Tejun Heo, linux-block, Linux-Kernal,
	Linus Walleij, Mark Brown, Hannes Reinecke, Grant Likely,
	James Bottomley, Daniel Walter, Anton Ivanov

On Fri, Oct 28, 2016 at 2:07 PM, Arnd Bergmann <arnd@arndb.de> wrote:
>> > I don't think that's an accurate statement. In terms of coverage, most
>> > drivers do support blk-mq. Anything SCSI, nvme, virtio-blk, SATA runs on
>> > (or can run on) top of blk-mq.
>>
>> Well, I just used "git grep" and found that many drivers didn't use
>> blkmq. Apologize if I gave the wrong impressions.
>
> To clarify, this seems to be a complete list:
>
> $ git grep -wl '\(__\|\)blk_\(fetch\|end\|start\)_request' | xargs grep -L blk_mq
> Documentation/scsi/scsi_eh.txt
> arch/um/drivers/ubd_kern.c

AFAICT Daniel looked at the UML block driver and did an initial
conversion some time ago.
Daniel?
Anton is also working on a patch series to speed up the driver.
Maybe it is time to bite the bullet and do the conversion.

-- 
Thanks,
//richard

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-27 22:27                             ` Linus Walleij
  2016-10-28  9:32                               ` Linus Walleij
@ 2016-10-28 14:07                               ` Jens Axboe
  1 sibling, 0 replies; 57+ messages in thread
From: Jens Axboe @ 2016-10-28 14:07 UTC (permalink / raw)
  To: Linus Walleij
  Cc: Ulf Hansson, Paolo Valente, Christoph Hellwig, Arnd Bergmann,
	Bart Van Assche, Jan Kara, Tejun Heo, linux-block, Linux-Kernal,
	Mark Brown, Hannes Reinecke, Grant Likely, James Bottomley

On 10/27/2016 04:27 PM, Linus Walleij wrote:
> On Thu, Oct 27, 2016 at 11:08 PM, Jens Axboe <axboe@kernel.dk> wrote:
>
>> blk-mq has evolved to support a variety of devices, there's nothing
>> special about mmc that can't work well within that framework.
>
> There is. Read mmc_queue_thread() in drivers/mmc/card/queue.c
>
> This repeatedly calls req = blk_fetch_request(q);, starting one request
> and then getting the next one off the queue, including reading
> a few NULL requests off the end of the queue (to satisfy the
> semantics of its state machine.
>
> It then preprocess each request by esstially calling .pre() and .post()
> hooks all the way down to the driver, flushing its mapped
> sglist from CPU to DMA device memory (not a problem on x86 and
> other DMA-coherent archs, but a big win on the incoherent ones).
>
> In the attempt that was posted recently this is achieved by lying
> and saying the HW queue is two items deep and eating requests
> off that queue calling pre/post on them.
>
> But as there actually exist MMC cards with command queueing, this
> would become hopeless to handle, the hw queue depth has to reflect
> the real depth. What we need is for the block core to call pre/post
> hooks on each request.
>
> The "only" thing that doesn't work well after that is that CFQ is no
> longer in action, which will have interesting effects on MMC throughput
> in any fio-like stress test as it is mostly single-hw-queue.

That will cause you pain with any IO scheduler that has more complex
state, like CFQ and BFQ... I looked at the code but I don't quite get
why it is handling requests like that. Care to expand? Is it a
performance optimization? It looks fairly convoluted for some reason. I
would imagine that latency would be one of the more important aspects
for mmc, yet the driver has a context switch for each sync IO.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-28  7:59                     ` Jan Kara
@ 2016-10-28 14:10                       ` Jens Axboe
  0 siblings, 0 replies; 57+ messages in thread
From: Jens Axboe @ 2016-10-28 14:10 UTC (permalink / raw)
  To: Jan Kara
  Cc: Paolo Valente, Christoph Hellwig, Arnd Bergmann, Bart Van Assche,
	Tejun Heo, linux-block, Linux-Kernal, Ulf Hansson, Linus Walleij,
	Mark Brown, Hannes Reinecke, grant.likely, James.Bottomley

On 10/28/2016 01:59 AM, Jan Kara wrote:
> On Thu 27-10-16 10:26:18, Jens Axboe wrote:
>> On 10/27/2016 03:26 AM, Jan Kara wrote:
>>> On Wed 26-10-16 10:12:38, Jens Axboe wrote:
>>>> On 10/26/2016 10:04 AM, Paolo Valente wrote:
>>>>>
>>>>>> Il giorno 26 ott 2016, alle ore 17:32, Jens Axboe <axboe@kernel.dk> ha scritto:
>>>>>>
>>>>>> On 10/26/2016 09:29 AM, Christoph Hellwig wrote:
>>>>>>> On Wed, Oct 26, 2016 at 05:13:07PM +0200, Arnd Bergmann wrote:
>>>>>>>> The question to ask first is whether to actually have pluggable
>>>>>>>> schedulers on blk-mq at all, or just have one that is meant to
>>>>>>>> do the right thing in every case (and possibly can be bypassed
>>>>>>>> completely).
>>>>>>>
>>>>>>> That would be my preference.  Have a BFQ-variant for blk-mq as an
>>>>>>> option (default to off unless opted in by the driver or user), and
>>>>>>> not other scheduler for blk-mq.  Don't bother with bfq for non
>>>>>>> blk-mq.  It's not like there is any advantage in the legacy-request
>>>>>>> device even for slow devices, except for the option of having I/O
>>>>>>> scheduling.
>>>>>>
>>>>>> It's the only right way forward. blk-mq might not offer any substantial
>>>>>> advantages to rotating storage, but with scheduling, it won't offer a
>>>>>> downside either. And it'll take us towards the real goal, which is to
>>>>>> have just one IO path.
>>>>>
>>>>> ok
>>>>>
>>>>>> Adding a new scheduler for the legacy IO path
>>>>>> makes no sense.
>>>>>
>>>>> I would fully agree if effective and stable I/O scheduling would be
>>>>> available in blk-mq in one or two months.  But I guess that it will
>>>>> take at least one year optimistically, given the current status of the
>>>>> needed infrastructure, and given the great difficulties of doing
>>>>> effective scheduling at the high parallelism and extreme target speeds
>>>>> of blk-mq.  Of course, this holds true unless little clever scheduling
>>>>> is performed.
>>>>>
>>>>> So, what's the point in forcing a lot of users wait another year or
>>>>> more, for a solution that has yet to be even defined, while they could
>>>>> enjoy a much better system, and then switch an even better system when
>>>>> scheduling is ready in blk-mq too?
>>>>
>>>> That same argument could have been made 2 years ago. Saying no to a new
>>>> scheduler for the legacy framework goes back roughly that long. We could
>>>> have had BFQ for mq NOW, if we didn't keep coming back to this very
>>>> point.
>>>>
>>>> I'm hesistant to add a new scheduler because it's very easy to add, very
>>>> difficult to get rid of. If we do add BFQ as a legacy scheduler now,
>>>> it'll take us years and years to get rid of it again. We should be
>>>> moving towards LESS moving parts in the legacy path, not more.
>>>>
>>>> We can keep having this discussion every few years, but I think we'd
>>>> both prefer to make some actual progress here. It's perfectly fine to
>>>> add an interface for a single queue interface for an IO scheduler for
>>>> blk-mq, since we don't care too much about scalability there. And that
>>>> won't take years, that should be a few weeks. Retrofitting BFQ on top of
>>>> that should not be hard either. That can co-exist with a real multiqueue
>>>> scheduler as well, something that's geared towards some fairness for
>>>> faster devices.
>>>
>>> OK, so some solution like having a variant of blk_sq_make_request() that
>>> will consume requests, do IO scheduling decisions on them, and feed them
>>> into the HW queue is it sees fit would be acceptable? That will provide the
>>> IO scheduler a global view that it needs for complex scheduling decisions
>>> so it should indeed be relatively easy to port BFQ to work like that.
>>
>> I'd probably start off Omar's base [1] that switches the software queues
>> to store bios instead of requests, since that lifts the of the 1:1
>> mapping between what we can queue up and what we can dispatch. Without
>> that, the IO scheduler won't have too much to work with. And with that
>> in place, it'll be a "bio in, request out" type of setup, which is
>> similar to what we have in the legacy path.
>>
>> I'd keep the software queues, but as a starting point, mandate 1
>> hardware queue to keep that as the per-device view of the state. The IO
>> scheduler would be responsible for moving one or more bios from the
>> software queues to the hardware queue, when they are ready to dispatch.
>>
>> [1] https://github.com/osandov/linux/commit/8ef3508628b6cf7c4712cd3d8084ee11ef5d2530
>
> Yeah, but what would be software queues actually good for for a single
> queue device with device-global IO scheduling? The IO scheduler doing
> complex decisions will keep all the bios / requests in a single structure
> anyway so there's no scalability to gain from per-cpu software queues...
> So you can directly consume bios in your ->make_request handler, place it
> in IO scheduler structures and then push requests out to the HW queue in
> response to HW tags getting freed (i.e. IO completion). No need
> for intermediate software queues. But maybe I miss something.

The software queues tend to take the pressure of lock contention on the
submission side. It's one of the reasons why the single queue blk-mq
still scales a lot better than the old request_fn model.

If you bypass and grab them at make_request time, I'd be worried that we
are now losing the various support functionality we have for block
devices, or having to implement that differently.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-28  6:36                             ` Ulf Hansson
@ 2016-10-28 14:17                               ` Jens Axboe
  2016-10-28 17:12                                 ` Mark Brown
  0 siblings, 1 reply; 57+ messages in thread
From: Jens Axboe @ 2016-10-28 14:17 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Paolo Valente, Christoph Hellwig, Arnd Bergmann, Bart Van Assche,
	Jan Kara, Tejun Heo, linux-block, Linux-Kernal, Linus Walleij,
	Mark Brown, Hannes Reinecke, Grant Likely, James Bottomley

On 10/28/2016 12:36 AM, Ulf Hansson wrote:
> [...]
>
>>
>>> Moreover, I am still trying to understand what's the big deal to why
>>> you say no to BFQ as a legacy scheduler. Ideally it shouldn't cause
>>> you any maintenance burden and it doesn't make the removal of the
>>> legacy blk layer any more difficult, right?
>>
>>
>> Not sure I can state it much clearer. It's a new scheduler, and a
>> complicated one at that. It WILL carry a maintenance burden. And I'm
>
> Really? Either you maintain the code or not. And if Paolo would do it,
> then your are off the hook!

Are you trying to be deliberately obtuse? If so, good job. I'd advise 
you to look into how code in the kernel is maintained in general. A 
maintenance burden exists for code A, but it also carries over to the 
subsystem it is under, and the kernel in general. Adding code is never free.

>> really not that interested in adding such a burden for something that
>> will be defunct as soon as the single queue blk-mq version is done.
>> Additionally, if we put BFQ in right now, the motivation to do the real
>> work will be gone.
>
> You have been pushing Paolo in different directions throughout the
> years with his work in BFQ, wasting lots of his time/effort.

I have not. Various entities have advised Paolo approach it in various 
ways. We've had blk-mq for 3 years now, my position should have been 
pretty clear on that.

> You have not given him any credibility for his work in BFQ and now you
> point him yet in another direction.

I don't even know what that means. But I'm not pointing him in a new 
direction.

Ulf, I'm done discussing with you. I've made my position clear, yet you 
continue to beat on a dead horse. As far as I'm concerned, there's 
nothing further to discuss here. I'll be happy to discuss when there's 
some meat on the bone (ie code). Until then, EOD.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-28  9:32                               ` Linus Walleij
@ 2016-10-28 14:22                                 ` Jens Axboe
  2016-10-28 20:38                                   ` Linus Walleij
  2016-10-28 15:29                                 ` Christoph Hellwig
  2016-10-28 15:30                                 ` Jens Axboe
  2 siblings, 1 reply; 57+ messages in thread
From: Jens Axboe @ 2016-10-28 14:22 UTC (permalink / raw)
  To: Linus Walleij
  Cc: Ulf Hansson, Paolo Valente, Christoph Hellwig, Arnd Bergmann,
	Bart Van Assche, Jan Kara, Tejun Heo, linux-block, Linux-Kernal,
	Mark Brown, Hannes Reinecke, Grant Likely, James Bottomley,
	Bartlomiej Zolnierkiewicz

On 10/28/2016 03:32 AM, Linus Walleij wrote:
> On Fri, Oct 28, 2016 at 12:27 AM, Linus Walleij
> <linus.walleij@linaro.org> wrote:
>> On Thu, Oct 27, 2016 at 11:08 PM, Jens Axboe <axboe@kernel.dk> wrote:
>>
>>> blk-mq has evolved to support a variety of devices, there's nothing
>>> special about mmc that can't work well within that framework.
>>
>> There is. Read mmc_queue_thread() in drivers/mmc/card/queue.c
>
> So I'm not just complaining by the way, I'm trying to fix this. Also
> Bartlomiej from Samsung has done some stabs at switching MMC/SD
> to blk-mq. I just rebased my latest stab at a naïve switch to blk-mq
> to v4.9-rc2 with these results.
>
> The patch to enable MQ looks like this:
> https://git.kernel.org/cgit/linux/kernel/git/linusw/linux-stericsson.git/commit/?h=mmc-mq&id=8f79b527e2e854071d8da019451da68d4753f71d
>
> I run these tests directly after boot with cold caches. The results
> are consistent: I ran the same commands 10 times in a row.
>
>
> BEFORE switching to BLK-MQ (clean v4.9-rc2):
>
> time dd if=/dev/mmcblk0 of=/dev/null bs=1M count=1024
> 1024+0 records in
> 1024+0 records out
> 1073741824 bytes (1.0GB) copied, 47.781464 seconds, 21.4MB/s
> real    0m 47.79s
> user    0m 0.02s
> sys     0m 9.35s
>
> mount /dev/mmcblk0p1 /mnt/
> cd /mnt/
> time find . > /dev/null
> real    0m 3.60s
> user    0m 0.25s
> sys     0m 1.58s
>
> mount /dev/mmcblk0p1 /mnt/
> iozone -az -i0 -i1 -i2 -s 20m -I -f /mnt/foo.test
> (kBytes/second)
>                                                     random    random
>     kB  reclen    write  rewrite    read    reread    read     write
>  20480       4     2112     2157     6052     6060     6025       40
>  20480       8     4820     5074     9163     9121     9125       81
>  20480      16     5755     5242    12317    12320    12280      165
>  20480      32     6176     6261    14981    14987    14962      336
>  20480      64     6547     5875    16826    16828    16810      692
>  20480     128     6762     6828    17899    17896    17896     1408
>  20480     256     6802     6871    16960    17513    18373     3048
>  20480     512     7220     7252    18675    18746    18741     7228
>  20480    1024     7222     7304    18436    17858    18246     7322
>  20480    2048     7316     7398    18744    18751    18526     7419
>  20480    4096     7520     7636    20774    20995    20703     7609
>  20480    8192     7519     7704    21850    21489    21467     7663
>  20480   16384     7395     7782    22399    22210    22215     7781
>
>
> AFTER switching to BLK-MQ:
>
> time dd if=/dev/mmcblk0 of=/dev/null bs=1M count=1024
> 1024+0 records in
> 1024+0 records out
> 1073741824 bytes (1.0GB) copied, 60.551117 seconds, 16.9MB/s
> real    1m 0.56s
> user    0m 0.02s
> sys     0m 9.81s
>
> mount /dev/mmcblk0p1 /mnt/
> cd /mnt/
> time find . > /dev/null
> real    0m 4.42s
> user    0m 0.24s
> sys     0m 1.81s
>
> mount /dev/mmcblk0p1 /mnt/
> iozone -az -i0 -i1 -i2 -s 20m -I -f /mnt/foo.test
> (kBytes/second)
>                                                     random    random
>     kB  reclen    write  rewrite    read    reread    read     write
>  20480       4     2086     2201     6024     6036     6006       40
>  20480       8     4812     5036     8014     9121     9090       82
>  20480      16     5432     5633    12267     9776    12212      168
>  20480      32     6180     6233    14870    14891    14852      340
>  20480      64     6382     5454    16744    16771    16746      702
>  20480     128     6761     6776    17816    17846    17836     1394
>  20480     256     6828     6842    17789    17895    17094     3084
>  20480     512     7158     7222    17957    17681    17698     7232
>  20480    1024     7215     7274    18642    17679    18031     7300
>  20480    2048     7229     7269    17943    18642    17732     7358
>  20480    4096     7212     7360    18272    18157    18889     7371
>  20480    8192     7008     7271    18632    18707    18225     7282
>  20480   16384     6889     7211    18243    18429    18018     7246
>
>
> A simple dd readtest of 1 GB is always consistently 10+
> seconds slower with MQ. find in the rootfs is a second slower.
> iozone results are consistently lower throughput or the same.
>
> This is without using Bartlomiej's clever hack to pretend we have
> 2 elements in the HW queue though. His early tests indicate that
> it doesn't help much: the performance regression we see is due to
> lack of block scheduling.

A simple dd test, I don't see how that can be slower due to lack of
scheduling. There's nothing to schedule there, just issue them in order?
So that would probably be where I would start looking. A blktrace of the
in-kernel code and the blk-mq enabled code would perhaps be
enlightening. I don't think it's worth looking at the more complex test
cases until the dd test case is at least as fast as the non-mq version.
Was that with CFQ, btw, or what scheduler did it run?

It'd be nice to NOT have to rely on that fake QD=2 setup, since it will
mess with the IO scheduling as well.

> I try to find a way forward with this, and also massage the MMC/SD
> code to be more MQ friendly to begin with (like only pick requests
> when we get a request notification and stop pulling NULL requests
> off the queue) but it's really a messy piece of code.

Yeah, it does look pretty messy... I'd be happy to help out with that,
and particularly in figuring out why the direct conversion is slower for
a basic 'dd' test case.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-28  9:32                               ` Linus Walleij
  2016-10-28 14:22                                 ` Jens Axboe
@ 2016-10-28 15:29                                 ` Christoph Hellwig
  2016-10-28 21:09                                   ` Linus Walleij
  2016-10-28 15:30                                 ` Jens Axboe
  2 siblings, 1 reply; 57+ messages in thread
From: Christoph Hellwig @ 2016-10-28 15:29 UTC (permalink / raw)
  To: Linus Walleij
  Cc: Jens Axboe, Ulf Hansson, Paolo Valente, Christoph Hellwig,
	Arnd Bergmann, Bart Van Assche, Jan Kara, Tejun Heo, linux-block,
	Linux-Kernal, Mark Brown, Hannes Reinecke, Grant Likely,
	James Bottomley, Bartlomiej Zolnierkiewicz

On Fri, Oct 28, 2016 at 11:32:21AM +0200, Linus Walleij wrote:
> So I'm not just complaining by the way, I'm trying to fix this. Also
> Bartlomiej from Samsung has done some stabs at switching MMC/SD
> to blk-mq. I just rebased my latest stab at a naïve switch to blk-mq
> to v4.9-rc2 with these results.
> 
> The patch to enable MQ looks like this:
> https://git.kernel.org/cgit/linux/kernel/git/linusw/linux-stericsson.git/commit/?h=mmc-mq&id=8f79b527e2e854071d8da019451da68d4753f71d
> 
> I run these tests directly after boot with cold caches. The results
> are consistent: I ran the same commands 10 times in a row.

A couple comments from a quick look over the patch:

In the changelog you complain:

". Lack of front- and back-end merging in the MQ block layer creating
several small requests instead of a few large ones."

In blk-mq merging is controller by the BLK_MQ_F_SHOULD_MERGE and
BLK_MQ_F_SG_MERGE flags.  You set the former, but not the latter.
BLK_MQ_F_SG_MERGE controls wether multiple physical contiguous pages get
merged into a single segment.  For a dd after a fresh boot that is
probably very common.  Except for the polarity of the merge flags the
basic merge functionality between the legacy and blk-mq path should be
the same, and if they aren't you've found a bug we need to address.

You also say that you disable the pipelining.  How much of a performance
gain did this feature give when added?  How much does just removing that
on it's own cost you?  While I think that features is rather messy and
should be avoided if possible I don't see how it's impossible to
implement in blk-mq.  If you just increase your queue depth and use
the old scheme you should get it - if you currently can't handle the
second command for some reason (i.e. the special request magic) you
can just return BLK_MQ_RQ_QUEUE_BUSY from the queue_rq function.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-28  9:32                               ` Linus Walleij
  2016-10-28 14:22                                 ` Jens Axboe
  2016-10-28 15:29                                 ` Christoph Hellwig
@ 2016-10-28 15:30                                 ` Jens Axboe
  2016-10-28 15:58                                   ` Bartlomiej Zolnierkiewicz
  2016-10-28 16:05                                   ` Arnd Bergmann
  2 siblings, 2 replies; 57+ messages in thread
From: Jens Axboe @ 2016-10-28 15:30 UTC (permalink / raw)
  To: Linus Walleij
  Cc: Ulf Hansson, Paolo Valente, Christoph Hellwig, Arnd Bergmann,
	Bart Van Assche, Jan Kara, Tejun Heo, linux-block, Linux-Kernal,
	Mark Brown, Hannes Reinecke, Grant Likely, James Bottomley,
	Bartlomiej Zolnierkiewicz

On 10/28/2016 03:32 AM, Linus Walleij wrote:
> The patch to enable MQ looks like this:
> https://git.kernel.org/cgit/linux/kernel/git/linusw/linux-stericsson.git/commit/?h=mmc-mq&id=8f79b527e2e854071d8da019451da68d4753f71d

BTW, another viable "hack" for the depth issue would be to expose more
than one hardware queue. It's meant to map to a distinct submission
region in the hardware, but there's nothing stopping the driver from
using it differently. Might not be cleaner than just increasing the
queue depth on a single queue, though.

That still won't solve the issue of lying about it and causing IO
scheduler confusion, of course.

Also, 4.8 and newer have support for BLK_MQ_F_BLOCKING, if you need to
block in ->queue_rq(). That could eliminate the need to offload to a
kthread manually.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-28 15:30                                 ` Jens Axboe
@ 2016-10-28 15:58                                   ` Bartlomiej Zolnierkiewicz
  2016-10-28 16:05                                   ` Arnd Bergmann
  1 sibling, 0 replies; 57+ messages in thread
From: Bartlomiej Zolnierkiewicz @ 2016-10-28 15:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Walleij, Ulf Hansson, Paolo Valente, Christoph Hellwig,
	Arnd Bergmann, Bart Van Assche, Jan Kara, Tejun Heo, linux-block,
	Linux-Kernal, Mark Brown, Hannes Reinecke, Grant Likely,
	James Bottomley


Hi,

On Friday, October 28, 2016 09:30:07 AM Jens Axboe wrote:
> On 10/28/2016 03:32 AM, Linus Walleij wrote:
> > The patch to enable MQ looks like this:
> > https://git.kernel.org/cgit/linux/kernel/git/linusw/linux-stericsson.git/commit/?h=mmc-mq&id=8f79b527e2e854071d8da019451da68d4753f71d
> 
> BTW, another viable "hack" for the depth issue would be to expose more
> than one hardware queue. It's meant to map to a distinct submission
> region in the hardware, but there's nothing stopping the driver from
> using it differently. Might not be cleaner than just increasing the
> queue depth on a single queue, though.

Yes, I'm already considering this for rewritten version of my
patch set as it may also help with performance when compared to
non blk-mq case.

Significant amount of time is spent on DMA map/unmap operations
on ARM MMC hosts and I would like to do these DMA (un)mapping-s
in parallel for two (or more) requests to check whether it helps
the performance (hopefully the cache controller doesn't serialize
these operations).

BTW I'm following the discussion and still would like to help with
getting blk-mq work for MMC.  I'm just quite busy with other things
at the moment.

> That still won't solve the issue of lying about it and causing IO
> scheduler confusion, of course.
> 
> Also, 4.8 and newer have support for BLK_MQ_F_BLOCKING, if you need to
> block in ->queue_rq(). That could eliminate the need to offload to a
> kthread manually.

Best regards,
--
Bartlomiej Zolnierkiewicz
Samsung R&D Institute Poland
Samsung Electronics

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-28 15:30                                 ` Jens Axboe
  2016-10-28 15:58                                   ` Bartlomiej Zolnierkiewicz
@ 2016-10-28 16:05                                   ` Arnd Bergmann
  2016-10-28 17:17                                     ` Mark Brown
  1 sibling, 1 reply; 57+ messages in thread
From: Arnd Bergmann @ 2016-10-28 16:05 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Walleij, Ulf Hansson, Paolo Valente, Christoph Hellwig,
	Bart Van Assche, Jan Kara, Tejun Heo, linux-block, Linux-Kernal,
	Mark Brown, Hannes Reinecke, Grant Likely, James Bottomley,
	Bartlomiej Zolnierkiewicz

On Friday, October 28, 2016 9:30:07 AM CEST Jens Axboe wrote:
> On 10/28/2016 03:32 AM, Linus Walleij wrote:
> > The patch to enable MQ looks like this:
> > https://git.kernel.org/cgit/linux/kernel/git/linusw/linux-stericsson.git/commit/?h=mmc-mq&id=8f79b527e2e854071d8da019451da68d4753f71d
> 
> BTW, another viable "hack" for the depth issue would be to expose more
> than one hardware queue. It's meant to map to a distinct submission
> region in the hardware, but there's nothing stopping the driver from
> using it differently. Might not be cleaner than just increasing the
> queue depth on a single queue, though.
> 
> That still won't solve the issue of lying about it and causing IO
> scheduler confusion, of course.
> 
> Also, 4.8 and newer have support for BLK_MQ_F_BLOCKING, if you need to
> block in ->queue_rq(). That could eliminate the need to offload to a
> kthread manually.

I think the main reason for the kthread is that on ARM and other
architectures, the dma mapping operations are fairly slow (for
cache flushes or bounce buffering) and we want to minimize the
time between subsequent requests being handled by the hardware.

This is not unique to MMC in any way, MMC just happens to be
common on ARM and it is limited by its lack of hardware
command queuing.
It would be nice to do a similar trick for SCSI disks,
especially USB mass storage, maybe also SATA, which are the
next most common storage devices on non-coherent ARM systems
(SATA nowadays often comes with NCQ, so it's less of an
issue)

It may be reasonable to tie this in with the I/O scheduler:
if you don't have a scheduler, the access to the device is
probably rather direct and you want to avoid any complexity
in the kernel, but if preparing a request is expensive
and the hardware has no queuing, you probably also want to
use a scheduler.

We should probably also try to understand how this could
work out with USB mass storage, if there is a solution at
all, and then do it for MMC in a way that would work on
both. I don't think the USB core can currently split the
dma_map_sg() operation from the USB command submission,
so this may require some deeper surgery there.

	Arnd

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-28 14:17                               ` Jens Axboe
@ 2016-10-28 17:12                                 ` Mark Brown
  0 siblings, 0 replies; 57+ messages in thread
From: Mark Brown @ 2016-10-28 17:12 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ulf Hansson, Paolo Valente, Christoph Hellwig, Arnd Bergmann,
	Bart Van Assche, Jan Kara, Tejun Heo, linux-block, Linux-Kernal,
	Linus Walleij, Hannes Reinecke, Grant Likely, James Bottomley

[-- Attachment #1: Type: text/plain, Size: 798 bytes --]

On Fri, Oct 28, 2016 at 08:17:01AM -0600, Jens Axboe wrote:
> On 10/28/2016 12:36 AM, Ulf Hansson wrote:

> > You have been pushing Paolo in different directions throughout the
> > years with his work in BFQ, wasting lots of his time/effort.

> I have not. Various entities have advised Paolo approach it in various ways.
> We've had blk-mq for 3 years now, my position should have been pretty clear
> on that.

Having come to this somewhat late I have to say that that hasn't been
100% clear as a set opinion from everyone - in the time I've been
following things there's been engagement about the meat of the code
which gave the impression the patches were being seriously considered.

But like I said in a previous mail this is all in the past anyway, we
need to focus on the present situation.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-28 16:05                                   ` Arnd Bergmann
@ 2016-10-28 17:17                                     ` Mark Brown
  0 siblings, 0 replies; 57+ messages in thread
From: Mark Brown @ 2016-10-28 17:17 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jens Axboe, Linus Walleij, Ulf Hansson, Paolo Valente,
	Christoph Hellwig, Bart Van Assche, Jan Kara, Tejun Heo,
	linux-block, Linux-Kernal, Hannes Reinecke, Grant Likely,
	James Bottomley, Bartlomiej Zolnierkiewicz

[-- Attachment #1: Type: text/plain, Size: 932 bytes --]

On Fri, Oct 28, 2016 at 06:05:35PM +0200, Arnd Bergmann wrote:
> On Friday, October 28, 2016 9:30:07 AM CEST Jens Axboe wrote:

> > Also, 4.8 and newer have support for BLK_MQ_F_BLOCKING, if you need to
> > block in ->queue_rq(). That could eliminate the need to offload to a
> > kthread manually.

> I think the main reason for the kthread is that on ARM and other
> architectures, the dma mapping operations are fairly slow (for
> cache flushes or bounce buffering) and we want to minimize the
> time between subsequent requests being handled by the hardware.

> This is not unique to MMC in any way, MMC just happens to be
> common on ARM and it is limited by its lack of hardware
> command queuing.

Plus the fact that MMC (and SD) have some *relatively* high performance
implementations which amplify the effects of desaturating the hardware -
the faster the hardware is the more noticable the overhead of stalling
it becomes.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-28 14:22                                 ` Jens Axboe
@ 2016-10-28 20:38                                   ` Linus Walleij
  0 siblings, 0 replies; 57+ messages in thread
From: Linus Walleij @ 2016-10-28 20:38 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ulf Hansson, Paolo Valente, Christoph Hellwig, Arnd Bergmann,
	Bart Van Assche, Jan Kara, Tejun Heo, linux-block, Linux-Kernal,
	Mark Brown, Hannes Reinecke, Grant Likely, James Bottomley,
	Bartlomiej Zolnierkiewicz

On Fri, Oct 28, 2016 at 4:22 PM, Jens Axboe <axboe@kernel.dk> wrote:
> On 10/28/2016 03:32 AM, Linus Walleij wrote:
>>
>> This is without using Bartlomiej's clever hack to pretend we have
>> 2 elements in the HW queue though. His early tests indicate that
>> it doesn't help much: the performance regression we see is due to
>> lack of block scheduling.
>
> A simple dd test, I don't see how that can be slower due to lack of
> scheduling. There's nothing to schedule there, just issue them in order?

Yeah I guess you're right, I guess it could be in part to not having
activated front- and back-end merges properly as Christoph pointed
out, I'll look closer at this.

> So that would probably be where I would start looking. A blktrace of the
> in-kernel code and the blk-mq enabled code would perhaps be
> enlightening. I don't think it's worth looking at the more complex test
> cases until the dd test case is at least as fast as the non-mq version.

Yeah.

> Was that with CFQ, btw, or what scheduler did it run?

CFQ, just plain defconfig.

> It'd be nice to NOT have to rely on that fake QD=2 setup, since it will
> mess with the IO scheduling as well.

I agree.

>> I try to find a way forward with this, and also massage the MMC/SD
>> code to be more MQ friendly to begin with (like only pick requests
>> when we get a request notification and stop pulling NULL requests
>> off the queue) but it's really a messy piece of code.
>
> Yeah, it does look pretty messy... I'd be happy to help out with that,
> and particularly in figuring out why the direct conversion is slower for
> a basic 'dd' test case.

I'm looking into it.

Yours,
Linus Walleij

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-28 15:29                                 ` Christoph Hellwig
@ 2016-10-28 21:09                                   ` Linus Walleij
  0 siblings, 0 replies; 57+ messages in thread
From: Linus Walleij @ 2016-10-28 21:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Ulf Hansson, Paolo Valente, Arnd Bergmann,
	Bart Van Assche, Jan Kara, Tejun Heo, linux-block, Linux-Kernal,
	Mark Brown, Hannes Reinecke, Grant Likely, James Bottomley,
	Bartlomiej Zolnierkiewicz

On Fri, Oct 28, 2016 at 5:29 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Fri, Oct 28, 2016 at 11:32:21AM +0200, Linus Walleij wrote:
>> So I'm not just complaining by the way, I'm trying to fix this. Also
>> Bartlomiej from Samsung has done some stabs at switching MMC/SD
>> to blk-mq. I just rebased my latest stab at a naīve switch to blk-mq
>> to v4.9-rc2 with these results.
>>
>> The patch to enable MQ looks like this:
>> https://git.kernel.org/cgit/linux/kernel/git/linusw/linux-stericsson.git/commit/?h=mmc-mq&id=8f79b527e2e854071d8da019451da68d4753f71d
>>
>> I run these tests directly after boot with cold caches. The results
>> are consistent: I ran the same commands 10 times in a row.
>
> A couple comments from a quick look over the patch:
>
> In the changelog you complain:
>
> ". Lack of front- and back-end merging in the MQ block layer creating
> several small requests instead of a few large ones."
>
> In blk-mq merging is controller by the BLK_MQ_F_SHOULD_MERGE and
> BLK_MQ_F_SG_MERGE flags.  You set the former, but not the latter.
> BLK_MQ_F_SG_MERGE controls wether multiple physical contiguous pages get
> merged into a single segment.  For a dd after a fresh boot that is
> probably very common.  Except for the polarity of the merge flags the
> basic merge functionality between the legacy and blk-mq path should be
> the same, and if they aren't you've found a bug we need to address.

Aha OK I will make sure to set both flags next time. (I will also stop
guessing about that as a cause since that part probably works.)

> You also say that you disable the pipelining.  How much of a performance
> gain did this feature give when added? How much does just removing that
> on it's own cost you?

Interestingly, the original commit doesn't say.
http://marc.info/?l=linaro-dev&m=137645684811479&w=2

It however dependends the cache architecture of the machine how
much is won. The heavier the cache flushes, the more it gains.

I guess I need to make a patch removing that mechanism to bench
it. It's pretty hard to get rid of because it goes really deep into the
MMC subsystem. It's massaged in like a schampoo.

> While I think that features is rather messy and
> should be avoided if possible I don't see how it's impossible to
> implement in blk-mq.

It's probably possible. What I discussed with Arnd was to let
the blk-mq core call out to these pre-request and post-request
hooks on new requests in parallel with processing a request or
a queue of requests. I.e. add .prep_request() and .unprep_request()
callbacks to struct blk_mq_ops.

I tried to understand if the existing .init_request and .exit_request
callbacks could be used. But as I understand it they are only used
to allocate and prepare the extra per-request-associated memory
and state, and does not have access to the request per se,
so it doesn't know anything about the actual request when
.init_request() is called.

So we're looking for something called whenever the contents of
a request are done, right before queueing it, and right after
dequeueing it after being served.

>  If you just increase your queue depth and use
> the old scheme you should get it - if you currently can't handle the
> second command for some reason (i.e. the special request magic) you
> can just return BLK_MQ_RQ_QUEUE_BUSY from the queue_rq function.

Bartlomiejs patch set did that, but I haven't been able to reproduce it.

I will try to make a clean patch in the spirit of his.

Yours,
Linus Walleij

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-26 16:12               ` Jens Axboe
  2016-10-27  9:26                 ` Jan Kara
  2016-10-27 17:32                 ` Ulf Hansson
@ 2016-10-29  5:38                 ` Paolo Valente
  2016-10-29 13:12                   ` Bart Van Assche
  2016-10-29 14:12                   ` Jens Axboe
  2 siblings, 2 replies; 57+ messages in thread
From: Paolo Valente @ 2016-10-29  5:38 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Arnd Bergmann, Bart Van Assche, Jan Kara,
	Tejun Heo, linux-block, Linux-Kernal, Ulf Hansson, Linus Walleij,
	Mark Brown, Hannes Reinecke, grant.likely, James.Bottomley


> Il giorno 26 ott 2016, alle ore 18:12, Jens Axboe <axboe@kernel.dk> ha scritto:
> 
> On 10/26/2016 10:04 AM, Paolo Valente wrote:
>> 
>>> Il giorno 26 ott 2016, alle ore 17:32, Jens Axboe <axboe@kernel.dk> ha scritto:
>>> 
>>> On 10/26/2016 09:29 AM, Christoph Hellwig wrote:
>>>> On Wed, Oct 26, 2016 at 05:13:07PM +0200, Arnd Bergmann wrote:
>>>>> The question to ask first is whether to actually have pluggable
>>>>> schedulers on blk-mq at all, or just have one that is meant to
>>>>> do the right thing in every case (and possibly can be bypassed
>>>>> completely).
>>>> 
>>>> That would be my preference.  Have a BFQ-variant for blk-mq as an
>>>> option (default to off unless opted in by the driver or user), and
>>>> not other scheduler for blk-mq.  Don't bother with bfq for non
>>>> blk-mq.  It's not like there is any advantage in the legacy-request
>>>> device even for slow devices, except for the option of having I/O
>>>> scheduling.
>>> 
>>> It's the only right way forward. blk-mq might not offer any substantial
>>> advantages to rotating storage, but with scheduling, it won't offer a
>>> downside either. And it'll take us towards the real goal, which is to
>>> have just one IO path.
>> 
>> ok
>> 
>>> Adding a new scheduler for the legacy IO path
>>> makes no sense.
>> 
>> I would fully agree if effective and stable I/O scheduling would be
>> available in blk-mq in one or two months.  But I guess that it will
>> take at least one year optimistically, given the current status of the
>> needed infrastructure, and given the great difficulties of doing
>> effective scheduling at the high parallelism and extreme target speeds
>> of blk-mq.  Of course, this holds true unless little clever scheduling
>> is performed.
>> 
>> So, what's the point in forcing a lot of users wait another year or
>> more, for a solution that has yet to be even defined, while they could
>> enjoy a much better system, and then switch an even better system when
>> scheduling is ready in blk-mq too?
> 
> That same argument could have been made 2 years ago. Saying no to a new
> scheduler for the legacy framework goes back roughly that long. We could
> have had BFQ for mq NOW, if we didn't keep coming back to this very
> point.
> 
> I'm hesistant to add a new scheduler because it's very easy to add, very
> difficult to get rid of. If we do add BFQ as a legacy scheduler now,
> it'll take us years and years to get rid of it again. We should be
> moving towards LESS moving parts in the legacy path, not more.
> 
> We can keep having this discussion every few years, but I think we'd
> both prefer to make some actual progress here.

ok Jens, I give up

> It's perfectly fine to
> add an interface for a single queue interface for an IO scheduler for
> blk-mq, since we don't care too much about scalability there. And that
> won't take years, that should be a few weeks. Retrofitting BFQ on top of
> that should not be hard either. That can co-exist with a real multiqueue
> scheduler as well, something that's geared towards some fairness for
> faster devices.
> 

AFAICT this solution is good, for many practical reasons.  I don't
have the expertise to make such an infrastructure well on my own.  At
least not in an acceptable amount of time, because working on this
nice stuff is unfortunately not my job (although Linaro is now
supporting me for BFQ).

Then, assuming that this solution may be of general interest, and that
BFQ benefits convinced you a little bit too, may I get significant
collaboration/help on implementing this infrastructure?  If so, Jens
and all possibly interested parties, could we have a sort of short
kick-off technical meeting during KS/LPC?

Thanks,
Paolo

> -- 
> Jens Axboe

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-29  5:38                 ` Paolo Valente
@ 2016-10-29 13:12                   ` Bart Van Assche
  2016-10-29 14:12                   ` Jens Axboe
  1 sibling, 0 replies; 57+ messages in thread
From: Bart Van Assche @ 2016-10-29 13:12 UTC (permalink / raw)
  To: Paolo Valente, Jens Axboe
  Cc: Christoph Hellwig, Arnd Bergmann, Jan Kara, Tejun Heo,
	linux-block, Linux-Kernal, Ulf Hansson, Linus Walleij,
	Mark Brown, Hannes Reinecke, grant.likely, James.Bottomley

On 10/28/16 22:38, Paolo Valente wrote:
> Then, assuming that this solution may be of general interest, and that
> BFQ benefits convinced you a little bit too, may I get significant
> collaboration/help on implementing this infrastructure?  If so, Jens
> and all possibly interested parties, could we have a sort of short
> kick-off technical meeting during KS/LPC?

Hello Paolo and Jens,

Please keep me in the loop for any communication about BFQ / blk-mq 
scheduling. My employer was so kind to allow me to spend some of my time 
to work on this. I plan to attend the KS.

Bart.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-29  5:38                 ` Paolo Valente
  2016-10-29 13:12                   ` Bart Van Assche
@ 2016-10-29 14:12                   ` Jens Axboe
  2016-10-30  3:06                     ` Paolo Valente
  1 sibling, 1 reply; 57+ messages in thread
From: Jens Axboe @ 2016-10-29 14:12 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Christoph Hellwig, Arnd Bergmann, Bart Van Assche, Jan Kara,
	Tejun Heo, linux-block, Linux-Kernal, Ulf Hansson, Linus Walleij,
	Mark Brown, Hannes Reinecke, grant.likely, James.Bottomley

On 10/28/2016 11:38 PM, Paolo Valente wrote:
>
>> Il giorno 26 ott 2016, alle ore 18:12, Jens Axboe <axboe@kernel.dk> ha scritto:
>>
>> On 10/26/2016 10:04 AM, Paolo Valente wrote:
>>>
>>>> Il giorno 26 ott 2016, alle ore 17:32, Jens Axboe <axboe@kernel.dk> ha scritto:
>>>>
>>>> On 10/26/2016 09:29 AM, Christoph Hellwig wrote:
>>>>> On Wed, Oct 26, 2016 at 05:13:07PM +0200, Arnd Bergmann wrote:
>>>>>> The question to ask first is whether to actually have pluggable
>>>>>> schedulers on blk-mq at all, or just have one that is meant to
>>>>>> do the right thing in every case (and possibly can be bypassed
>>>>>> completely).
>>>>>
>>>>> That would be my preference.  Have a BFQ-variant for blk-mq as an
>>>>> option (default to off unless opted in by the driver or user), and
>>>>> not other scheduler for blk-mq.  Don't bother with bfq for non
>>>>> blk-mq.  It's not like there is any advantage in the legacy-request
>>>>> device even for slow devices, except for the option of having I/O
>>>>> scheduling.
>>>>
>>>> It's the only right way forward. blk-mq might not offer any substantial
>>>> advantages to rotating storage, but with scheduling, it won't offer a
>>>> downside either. And it'll take us towards the real goal, which is to
>>>> have just one IO path.
>>>
>>> ok
>>>
>>>> Adding a new scheduler for the legacy IO path
>>>> makes no sense.
>>>
>>> I would fully agree if effective and stable I/O scheduling would be
>>> available in blk-mq in one or two months.  But I guess that it will
>>> take at least one year optimistically, given the current status of the
>>> needed infrastructure, and given the great difficulties of doing
>>> effective scheduling at the high parallelism and extreme target speeds
>>> of blk-mq.  Of course, this holds true unless little clever scheduling
>>> is performed.
>>>
>>> So, what's the point in forcing a lot of users wait another year or
>>> more, for a solution that has yet to be even defined, while they could
>>> enjoy a much better system, and then switch an even better system when
>>> scheduling is ready in blk-mq too?
>>
>> That same argument could have been made 2 years ago. Saying no to a new
>> scheduler for the legacy framework goes back roughly that long. We could
>> have had BFQ for mq NOW, if we didn't keep coming back to this very
>> point.
>>
>> I'm hesistant to add a new scheduler because it's very easy to add, very
>> difficult to get rid of. If we do add BFQ as a legacy scheduler now,
>> it'll take us years and years to get rid of it again. We should be
>> moving towards LESS moving parts in the legacy path, not more.
>>
>> We can keep having this discussion every few years, but I think we'd
>> both prefer to make some actual progress here.
>
> ok Jens, I give up
>
>> It's perfectly fine to
>> add an interface for a single queue interface for an IO scheduler for
>> blk-mq, since we don't care too much about scalability there. And that
>> won't take years, that should be a few weeks. Retrofitting BFQ on top of
>> that should not be hard either. That can co-exist with a real multiqueue
>> scheduler as well, something that's geared towards some fairness for
>> faster devices.
>>
>
> AFAICT this solution is good, for many practical reasons.  I don't
> have the expertise to make such an infrastructure well on my own.  At
> least not in an acceptable amount of time, because working on this
> nice stuff is unfortunately not my job (although Linaro is now
> supporting me for BFQ).
>
> Then, assuming that this solution may be of general interest, and that
> BFQ benefits convinced you a little bit too, may I get significant
> collaboration/help on implementing this infrastructure?

Of course, I already offered to help with this.

> If so, Jens
> and all possibly interested parties, could we have a sort of short
> kick-off technical meeting during KS/LPC?

I'm not a huge fan of setting up a BoF to discuss something technical,
when there's no code to discuss yet. We need some actual meat on the
bone in the shape of code, and that's much better dealt with in email.
Timing is pretty advanced at this point, otherwise I'd offer to cook
something up that we COULD discuss, but I will not have time to do that
for KS.

If you are at LPC, why don't the two of us sit down and talk about it
Wednesday or Thursday? I'd like to try and understand what parts of
blk-mq you aren't up to speed on, and how we can best get a simple
framework going that will allow us to entertain single queue scheduling
within blk-mq.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
  2016-10-29 14:12                   ` Jens Axboe
@ 2016-10-30  3:06                     ` Paolo Valente
  0 siblings, 0 replies; 57+ messages in thread
From: Paolo Valente @ 2016-10-30  3:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Arnd Bergmann, Bart Van Assche, Jan Kara,
	Tejun Heo, linux-block, Linux-Kernal, Ulf Hansson, Linus Walleij,
	Mark Brown, Hannes Reinecke, grant.likely, James.Bottomley


> Il giorno 29 ott 2016, alle ore 16:12, Jens Axboe <axboe@kernel.dk> ha scritto:
> 
> On 10/28/2016 11:38 PM, Paolo Valente wrote:
>> 
>>> Il giorno 26 ott 2016, alle ore 18:12, Jens Axboe <axboe@kernel.dk> ha scritto:
>>> 
>>> On 10/26/2016 10:04 AM, Paolo Valente wrote:
>>>> 
>>>>> Il giorno 26 ott 2016, alle ore 17:32, Jens Axboe <axboe@kernel.dk> ha scritto:
>>>>> 
>>>>> On 10/26/2016 09:29 AM, Christoph Hellwig wrote:
>>>>>> On Wed, Oct 26, 2016 at 05:13:07PM +0200, Arnd Bergmann wrote:
>>>>>>> The question to ask first is whether to actually have pluggable
>>>>>>> schedulers on blk-mq at all, or just have one that is meant to
>>>>>>> do the right thing in every case (and possibly can be bypassed
>>>>>>> completely).
>>>>>> 
>>>>>> That would be my preference.  Have a BFQ-variant for blk-mq as an
>>>>>> option (default to off unless opted in by the driver or user), and
>>>>>> not other scheduler for blk-mq.  Don't bother with bfq for non
>>>>>> blk-mq.  It's not like there is any advantage in the legacy-request
>>>>>> device even for slow devices, except for the option of having I/O
>>>>>> scheduling.
>>>>> 
>>>>> It's the only right way forward. blk-mq might not offer any substantial
>>>>> advantages to rotating storage, but with scheduling, it won't offer a
>>>>> downside either. And it'll take us towards the real goal, which is to
>>>>> have just one IO path.
>>>> 
>>>> ok
>>>> 
>>>>> Adding a new scheduler for the legacy IO path
>>>>> makes no sense.
>>>> 
>>>> I would fully agree if effective and stable I/O scheduling would be
>>>> available in blk-mq in one or two months.  But I guess that it will
>>>> take at least one year optimistically, given the current status of the
>>>> needed infrastructure, and given the great difficulties of doing
>>>> effective scheduling at the high parallelism and extreme target speeds
>>>> of blk-mq.  Of course, this holds true unless little clever scheduling
>>>> is performed.
>>>> 
>>>> So, what's the point in forcing a lot of users wait another year or
>>>> more, for a solution that has yet to be even defined, while they could
>>>> enjoy a much better system, and then switch an even better system when
>>>> scheduling is ready in blk-mq too?
>>> 
>>> That same argument could have been made 2 years ago. Saying no to a new
>>> scheduler for the legacy framework goes back roughly that long. We could
>>> have had BFQ for mq NOW, if we didn't keep coming back to this very
>>> point.
>>> 
>>> I'm hesistant to add a new scheduler because it's very easy to add, very
>>> difficult to get rid of. If we do add BFQ as a legacy scheduler now,
>>> it'll take us years and years to get rid of it again. We should be
>>> moving towards LESS moving parts in the legacy path, not more.
>>> 
>>> We can keep having this discussion every few years, but I think we'd
>>> both prefer to make some actual progress here.
>> 
>> ok Jens, I give up
>> 
>>> It's perfectly fine to
>>> add an interface for a single queue interface for an IO scheduler for
>>> blk-mq, since we don't care too much about scalability there. And that
>>> won't take years, that should be a few weeks. Retrofitting BFQ on top of
>>> that should not be hard either. That can co-exist with a real multiqueue
>>> scheduler as well, something that's geared towards some fairness for
>>> faster devices.
>>> 
>> 
>> AFAICT this solution is good, for many practical reasons.  I don't
>> have the expertise to make such an infrastructure well on my own.  At
>> least not in an acceptable amount of time, because working on this
>> nice stuff is unfortunately not my job (although Linaro is now
>> supporting me for BFQ).
>> 
>> Then, assuming that this solution may be of general interest, and that
>> BFQ benefits convinced you a little bit too, may I get significant
>> collaboration/help on implementing this infrastructure?
> 
> Of course, I already offered to help with this.
> 

Yep, I just did not want to take this important point for granted.

>> If so, Jens
>> and all possibly interested parties, could we have a sort of short
>> kick-off technical meeting during KS/LPC?
> 
> I'm not a huge fan of setting up a BoF to discuss something technical,
> when there's no code to discuss yet. We need some actual meat on the
> bone in the shape of code, and that's much better dealt with in email.
> Timing is pretty advanced at this point, otherwise I'd offer to cook
> something up that we COULD discuss, but I will not have time to do that
> for KS.
> 

Sorry, I was not thinking of any BoF or the like.  I just meant, with
a stuffy phrase, "let's get it started concretely". 

> If you are at LPC, why don't the two of us sit down and talk about it
> Wednesday or Thursday?

I'm also at KS.  I'm available from Sunday evening to Wednesday
evening.  I'm leaving on Thursday morning.  If Wednesday is in any
case your preferred day, then let's do it on Wednesday.  At what time?

If I understand correctly, Bart will join us too.

> I'd like to try and understand what parts of
> blk-mq you aren't up to speed on, and how we can best get a simple
> framework going that will allow us to entertain single queue scheduling
> within blk-mq.
> 

That's exactly what I hoped we would have talked about.

Thanks,
Paolo

> -- 
> Jens Axboe

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
@ 2016-10-30 17:48 Manuel Krause
  0 siblings, 0 replies; 57+ messages in thread
From: Manuel Krause @ 2016-10-30 17:48 UTC (permalink / raw)
  To: linux-kernel

Dear blk-mq maintainers,

Since years now I use the BFQ disk IO scheduler by default, 
always fetching the newest release.

Now a reality story of mine:
For a clean BUG hunt, I was forced to leave out BFQ for a week 
recently. Result was an unusable experience with CFQ. Long time 
pauses of desktop applications' startup, even KDE menus NOT 
opening, when CFQ did "fair" queuing, of course at its best. ;-)

Next, in my usage pattern, often making use of /dev/shm, what is 
backed by swap on 2nd disk, is highly eased by BFQ, as there is 
no blocking for the rest of the system. CFQ likes to stay 
uninteractive until done: No mouse pointer etc.

When CFQ people do like that... they haven't understood linux' 
present/ future goals, and maybe have no usage experience to have 
the right to discuss this at all.

And until blq-mq+ is matured, mmmh, o.k., 
feature-ready-and-proof, mmmh, o.k. ready with brainstorming... 
in ? years...

there's a still actual (!) wish of Linux users (!) for years now 
(!), to include BFQ as an addon I/O scheduler into mainline kernel.

Sidenote against false-talkers: Paolo Valente and his team have 
always shown to offer updated patches for newly appeared kernel 
releases.

Best regards,
Manuel Krause

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler
@ 2016-10-29 17:08 Manuel Krause
  0 siblings, 0 replies; 57+ messages in thread
From: Manuel Krause @ 2016-10-29 17:08 UTC (permalink / raw)
  To: linux-kernel

Hey, people,
don't you annoy yourselves all the time?
The BFQ patches provide a useful alternative for the code called 
"legacy" by you, while you're not maintaining the base any more, 
and just about to invent something new, again. ?!
When blk-mq has no scheduler -> work on it. When you want to 
develop I/O scheduler APIs -> work on it. Maybe you even want to 
collaborate with someone, who already has a working solution, 
meaning Paolo Valente +team, with BFQ. Too much for you?

I don't see any progress with your blk-mq work since years, while 
Paolo Valente continuously improves and maintains the BFQ.

I need to be a little impolite on here: Several blk maintainers 
behave as masters of the universe, just to keep up their own 
view/ claim. That's a real shame for all Linux.

Best regards,
Manuel Krause

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2016-10-30 17:49 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-26  9:27 [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler Paolo Valente
2016-10-26  9:27 ` [PATCH 01/14] block, bfq: " Paolo Valente
2016-10-26  9:27 ` [PATCH 02/14] block, bfq: add full hierarchical scheduling and cgroups support Paolo Valente
2016-10-26  9:27 ` [PATCH 03/14] block, bfq: improve throughput boosting Paolo Valente
2016-10-26  9:27 ` [PATCH 04/14] block, bfq: modify the peak-rate estimator Paolo Valente
2016-10-26  9:27 ` [PATCH 05/14] block, bfq: add more fairness with writes and slow processes Paolo Valente
2016-10-26  9:27 ` [PATCH 06/14] block, bfq: improve responsiveness Paolo Valente
2016-10-26  9:28 ` [PATCH 07/14] block, bfq: reduce I/O latency for soft real-time applications Paolo Valente
2016-10-26  9:28 ` [PATCH 08/14] block, bfq: preserve a low latency also with NCQ-capable drives Paolo Valente
2016-10-26  9:28 ` [PATCH 09/14] block, bfq: reduce latency during request-pool saturation Paolo Valente
2016-10-26 10:19 ` [PATCH 00/14] introduce the BFQ-v0 I/O scheduler as an extra scheduler Christoph Hellwig
2016-10-26 11:34   ` Jan Kara
2016-10-26 15:05     ` Bart Van Assche
2016-10-26 15:13       ` Arnd Bergmann
2016-10-26 15:29         ` Christoph Hellwig
2016-10-26 15:32           ` Jens Axboe
2016-10-26 16:04             ` Paolo Valente
2016-10-26 16:12               ` Jens Axboe
2016-10-27  9:26                 ` Jan Kara
2016-10-27 14:34                   ` Grozdan
2016-10-27 15:55                     ` Heinz Diehl
2016-10-27 16:28                     ` Jens Axboe
2016-10-27 16:26                   ` Jens Axboe
2016-10-28  7:59                     ` Jan Kara
2016-10-28 14:10                       ` Jens Axboe
2016-10-27 17:32                 ` Ulf Hansson
2016-10-27 17:43                   ` Jens Axboe
2016-10-27 18:13                     ` Ulf Hansson
2016-10-27 18:21                       ` Jens Axboe
2016-10-27 19:34                         ` Ulf Hansson
2016-10-27 21:08                           ` Jens Axboe
2016-10-27 22:27                             ` Linus Walleij
2016-10-28  9:32                               ` Linus Walleij
2016-10-28 14:22                                 ` Jens Axboe
2016-10-28 20:38                                   ` Linus Walleij
2016-10-28 15:29                                 ` Christoph Hellwig
2016-10-28 21:09                                   ` Linus Walleij
2016-10-28 15:30                                 ` Jens Axboe
2016-10-28 15:58                                   ` Bartlomiej Zolnierkiewicz
2016-10-28 16:05                                   ` Arnd Bergmann
2016-10-28 17:17                                     ` Mark Brown
2016-10-28 14:07                               ` Jens Axboe
2016-10-28  6:36                             ` Ulf Hansson
2016-10-28 14:17                               ` Jens Axboe
2016-10-28 17:12                                 ` Mark Brown
2016-10-27 19:41                         ` Mark Brown
2016-10-27 19:45                           ` Christoph Hellwig
2016-10-27 22:01                             ` Mark Brown
2016-10-28 12:07                       ` Arnd Bergmann
2016-10-28 12:17                         ` Richard Weinberger
2016-10-29  5:38                 ` Paolo Valente
2016-10-29 13:12                   ` Bart Van Assche
2016-10-29 14:12                   ` Jens Axboe
2016-10-30  3:06                     ` Paolo Valente
2016-10-26 12:37   ` Paolo Valente
2016-10-29 17:08 Manuel Krause
2016-10-30 17:48 Manuel Krause

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).