All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC RESEND 00/14] New version of the BFQ I/O Scheduler
@ 2014-05-27 12:42 ` paolo
  0 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups, Paolo Valente

From: Paolo Valente <paolo.valente@unimore.it>

[Re-posting, previous attempt seems to have partially failed]

Hi,
this patchset introduces the last version of BFQ, a proportional-share
storage-I/O scheduler. BFQ also supports hierarchical scheduling with
a cgroups interface. The first version of BFQ was submitted a few
years ago [1]. It is denoted as v0 in the patches, to distinguish it
from the version I am submitting now, v7r4. In particular, the first
four patches introduce BFQ-v0, whereas the remaining patches turn it
progressively into BFQ-v7r4. Here are some nice features of this last
version.

Low latency for interactive applications

According to our results, regardless of the actual background
workload, for interactive tasks the storage device is virtually as
responsive as if it was idle. For example, even if one or more of the
following background workloads are being executed:
- one or more large files are being read or written,
- a tree of source files is being compiled,
- one or more virtual machines are performing I/O,
- a software update is in progress,
- indexing daemons are scanning filesystems and updating their
  databases,
starting an application or loading a file from within an application
takes about the same time as if the storage device was idle. As a
comparison, with CFQ, NOOP or DEADLINE, and in the same conditions,
applications experience high latencies, or even become unresponsive
until the background workload terminates (also on SSDs).

Low latency for soft real-time applications

Also soft real-time applications, such as audio and video
players/streamers, enjoy low latency and drop rate, regardless of the
storage-device background workload. As a consequence, these
applications do not suffer from almost any glitch due to the
background workload.

High throughput

On hard disks, BFQ achieves up to 30% higher throughput than CFQ, and
up to 150% higher throughput than DEADLINE and NOOP, with half of the
parallel workloads considered in our tests. With the rest of the
workloads, and with all the workloads on flash-based devices, BFQ
achieves instead about the same throughput as the other schedulers.

Strong fairness guarantees (already provided by BFQ-v0)

As for long-term guarantees, BFQ distributes the device throughput
(and not just the device time) as desired to I/O-bound applications,
with any workload and regardless of the device parameters.

What allows BFQ to provide the above features is its accurate
scheduling engine (patches 1-4), combined with a set of simple
heuristics and improvements (patches 5-14).  A 15-minute demo of the
performance of BFQ is available here [2]. I made this demo with an
older version of BFQ (v3r4) and under Linux 3.4.0. We have further
improved BFQ since then. The performance of the last version of BFQ is
shown, e.g., through some graphs here [3], under 3.14.0, compared
against CFQ, DEADLINE and NOOP, and on: a fast and a slow hard disk, a
RAID1, an SSD, a microSDHC Card and an eMMC. As an example, our
results on the SSD are reported also in a table at the end of this
email. Finally, details on how BFQ and its components work are
provided in the descriptions of the patches. An organic description of
the main BFQ algorithm and of most of its features can instead be
found in this paper [4].

Finally, as for testing in everyday use, BFQ is the default I/O
scheduler in, e.g., Manjaro, Sabayon, OpenMandriva and Arch Linux ARM
in some NAS boxes, plus several kernel forks for PCs and
smartphones. BFQ is instead optionally available in, e.g., Arch,
PCLinuxOS and Gentoo. In addition, we record a few tens of downloads
per day from people using other distributions. The feedback received
so far basically confirms the expected latency drop and throughput
boost.

Paolo

Results on a Plextor PX-256M5S SSD

The first two rows of the next table report the aggregate throughput
achieved by BFQ, CFQ, DEADLINE and NOOP, while ten parallel processes
read, either sequentially or randomly, a separate portion of the
memory blocks each. These processes read directly from the device, and
no process performs writes, to avoid writing large files repeatedly
and wearing out the SSD during the many tests done. As can be seen,
all schedulers achieve about the same throughput with sequential
readers, whereas, with random readers, the throughput slightly grows
as the complexity, and hence the execution time, of the schedulers
decreases. In fact, with random readers, the number of IOPS is
extremely higher, and all CPUs spend all the time either executing
instructions or waiting for I/O (the total idle percentage is
0). Therefore, the processing time of I/O requests influences the
maximum throughput achievable.

The remaining rows report the cold-cache start-up time experienced by
various applications while one of the above two workloads is being
executed in parallel. In particular, "Start-up time 10 seq/rand"
stands for "Start-up time of the application at hand while 10
sequential/random readers are running". A timeout fires, and the test
is aborted, if the application does not start within 60 seconds; so,
in the table, '>60' means that the application did not start before
the timeout fired.

With sequential readers, the performance gap between BFQ and the other
schedulers is remarkable. Background workloads are intentionally very
heavy, to show the performance of the schedulers in somewhat extreme
conditions. Differences are however still significant also with
lighter workloads, as shown, e.g., here [3] for slower devices.

-----------------------------------------------------------------------------
|                      SCHEDULER                    |        Test           |
-----------------------------------------------------------------------------
|    BFQ     |    CFQ     |  DEADLINE  |    NOOP    |                       |
-----------------------------------------------------------------------------
|            |            |            |            | Aggregate Throughput  |
|            |            |            |            |       [MB/s]          |
|    399     |    400     |    400     |    400     |  10 raw seq. readers  |
|    191     |    193     |    202     |    203     | 10 raw random readers |
-----------------------------------------------------------------------------
|            |            |            |            | Start-up time 10 seq  |
|            |            |            |            |       [sec]           |
|    0.21    |    >60     |    1.91    |    1.88    |      xterm            |
|    0.93    |    >60     |    10.2    |    10.8    |     oowriter          |
|    0.89    |    >60     |    29.7    |    30.0    |      konsole          |
-----------------------------------------------------------------------------
|            |            |            |            | Start-up time 10 rand |
|            |            |            |            |       [sec]           |
|    0.20    |    0.30    |    0.21    |    0.21    |      xterm            |
|    0.81    |    3.28    |    0.80    |    0.81    |     oowriter          |
|    0.88    |    2.90    |    1.02    |    1.00    |      konsole          |
-----------------------------------------------------------------------------

[1] https://lkml.org/lkml/2008/4/1/234
    https://lkml.org/lkml/2008/11/11/148

[2] http://youtu.be/J-e7LnJblm8

[3] http://www.algogroup.unimo.it/people/paolo/disk_sched/results.php

[4] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
http://www.algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

Arianna Avanzini (1):
  block, bfq: add Early Queue Merge (EQM)

Fabio Checconi (4):
  block: kconfig update and build bits for BFQ
  block: introduce the BFQ-v0 I/O scheduler
  block: add hierarchical-support option to kconfig
  block, bfq: add full hierarchical scheduling and cgroups support

Paolo Valente (9):
  block, bfq: improve throughput boosting
  block, bfq: modify the peak-rate estimator
  block, bfq: add more fairness to boost throughput and reduce latency
  block, bfq: improve responsiveness
  block, bfq: reduce I/O latency for soft real-time applications
  block, bfq: preserve a low latency also with NCQ-capable drives
  block, bfq: reduce latency during request-pool saturation
  block, bfq: boost the throughput on NCQ-capable flash-based devices
  block, bfq: boost the throughput with random I/O on NCQ-capable HDDs

 block/Kconfig.iosched         |   32 +
 block/Makefile                |    1 +
 block/bfq-cgroup.c            |  909 ++++++++++
 block/bfq-ioc.c               |   36 +
 block/bfq-iosched.c           | 3802 +++++++++++++++++++++++++++++++++++++++++
 block/bfq-sched.c             | 1104 ++++++++++++
 block/bfq.h                   |  729 ++++++++
 include/linux/cgroup_subsys.h |    4 +
 8 files changed, 6617 insertions(+)
 create mode 100644 block/bfq-cgroup.c
 create mode 100644 block/bfq-ioc.c
 create mode 100644 block/bfq-iosched.c
 create mode 100644 block/bfq-sched.c
 create mode 100644 block/bfq.h

-- 
1.9.2


^ permalink raw reply	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 00/14] New version of the BFQ I/O Scheduler
@ 2014-05-27 12:42 ` paolo
  0 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

From: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>

[Re-posting, previous attempt seems to have partially failed]

Hi,
this patchset introduces the last version of BFQ, a proportional-share
storage-I/O scheduler. BFQ also supports hierarchical scheduling with
a cgroups interface. The first version of BFQ was submitted a few
years ago [1]. It is denoted as v0 in the patches, to distinguish it
from the version I am submitting now, v7r4. In particular, the first
four patches introduce BFQ-v0, whereas the remaining patches turn it
progressively into BFQ-v7r4. Here are some nice features of this last
version.

Low latency for interactive applications

According to our results, regardless of the actual background
workload, for interactive tasks the storage device is virtually as
responsive as if it was idle. For example, even if one or more of the
following background workloads are being executed:
- one or more large files are being read or written,
- a tree of source files is being compiled,
- one or more virtual machines are performing I/O,
- a software update is in progress,
- indexing daemons are scanning filesystems and updating their
  databases,
starting an application or loading a file from within an application
takes about the same time as if the storage device was idle. As a
comparison, with CFQ, NOOP or DEADLINE, and in the same conditions,
applications experience high latencies, or even become unresponsive
until the background workload terminates (also on SSDs).

Low latency for soft real-time applications

Also soft real-time applications, such as audio and video
players/streamers, enjoy low latency and drop rate, regardless of the
storage-device background workload. As a consequence, these
applications do not suffer from almost any glitch due to the
background workload.

High throughput

On hard disks, BFQ achieves up to 30% higher throughput than CFQ, and
up to 150% higher throughput than DEADLINE and NOOP, with half of the
parallel workloads considered in our tests. With the rest of the
workloads, and with all the workloads on flash-based devices, BFQ
achieves instead about the same throughput as the other schedulers.

Strong fairness guarantees (already provided by BFQ-v0)

As for long-term guarantees, BFQ distributes the device throughput
(and not just the device time) as desired to I/O-bound applications,
with any workload and regardless of the device parameters.

What allows BFQ to provide the above features is its accurate
scheduling engine (patches 1-4), combined with a set of simple
heuristics and improvements (patches 5-14).  A 15-minute demo of the
performance of BFQ is available here [2]. I made this demo with an
older version of BFQ (v3r4) and under Linux 3.4.0. We have further
improved BFQ since then. The performance of the last version of BFQ is
shown, e.g., through some graphs here [3], under 3.14.0, compared
against CFQ, DEADLINE and NOOP, and on: a fast and a slow hard disk, a
RAID1, an SSD, a microSDHC Card and an eMMC. As an example, our
results on the SSD are reported also in a table at the end of this
email. Finally, details on how BFQ and its components work are
provided in the descriptions of the patches. An organic description of
the main BFQ algorithm and of most of its features can instead be
found in this paper [4].

Finally, as for testing in everyday use, BFQ is the default I/O
scheduler in, e.g., Manjaro, Sabayon, OpenMandriva and Arch Linux ARM
in some NAS boxes, plus several kernel forks for PCs and
smartphones. BFQ is instead optionally available in, e.g., Arch,
PCLinuxOS and Gentoo. In addition, we record a few tens of downloads
per day from people using other distributions. The feedback received
so far basically confirms the expected latency drop and throughput
boost.

Paolo

Results on a Plextor PX-256M5S SSD

The first two rows of the next table report the aggregate throughput
achieved by BFQ, CFQ, DEADLINE and NOOP, while ten parallel processes
read, either sequentially or randomly, a separate portion of the
memory blocks each. These processes read directly from the device, and
no process performs writes, to avoid writing large files repeatedly
and wearing out the SSD during the many tests done. As can be seen,
all schedulers achieve about the same throughput with sequential
readers, whereas, with random readers, the throughput slightly grows
as the complexity, and hence the execution time, of the schedulers
decreases. In fact, with random readers, the number of IOPS is
extremely higher, and all CPUs spend all the time either executing
instructions or waiting for I/O (the total idle percentage is
0). Therefore, the processing time of I/O requests influences the
maximum throughput achievable.

The remaining rows report the cold-cache start-up time experienced by
various applications while one of the above two workloads is being
executed in parallel. In particular, "Start-up time 10 seq/rand"
stands for "Start-up time of the application at hand while 10
sequential/random readers are running". A timeout fires, and the test
is aborted, if the application does not start within 60 seconds; so,
in the table, '>60' means that the application did not start before
the timeout fired.

With sequential readers, the performance gap between BFQ and the other
schedulers is remarkable. Background workloads are intentionally very
heavy, to show the performance of the schedulers in somewhat extreme
conditions. Differences are however still significant also with
lighter workloads, as shown, e.g., here [3] for slower devices.

-----------------------------------------------------------------------------
|                      SCHEDULER                    |        Test           |
-----------------------------------------------------------------------------
|    BFQ     |    CFQ     |  DEADLINE  |    NOOP    |                       |
-----------------------------------------------------------------------------
|            |            |            |            | Aggregate Throughput  |
|            |            |            |            |       [MB/s]          |
|    399     |    400     |    400     |    400     |  10 raw seq. readers  |
|    191     |    193     |    202     |    203     | 10 raw random readers |
-----------------------------------------------------------------------------
|            |            |            |            | Start-up time 10 seq  |
|            |            |            |            |       [sec]           |
|    0.21    |    >60     |    1.91    |    1.88    |      xterm            |
|    0.93    |    >60     |    10.2    |    10.8    |     oowriter          |
|    0.89    |    >60     |    29.7    |    30.0    |      konsole          |
-----------------------------------------------------------------------------
|            |            |            |            | Start-up time 10 rand |
|            |            |            |            |       [sec]           |
|    0.20    |    0.30    |    0.21    |    0.21    |      xterm            |
|    0.81    |    3.28    |    0.80    |    0.81    |     oowriter          |
|    0.88    |    2.90    |    1.02    |    1.00    |      konsole          |
-----------------------------------------------------------------------------

[1] https://lkml.org/lkml/2008/4/1/234
    https://lkml.org/lkml/2008/11/11/148

[2] http://youtu.be/J-e7LnJblm8

[3] http://www.algogroup.unimo.it/people/paolo/disk_sched/results.php

[4] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
http://www.algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

Arianna Avanzini (1):
  block, bfq: add Early Queue Merge (EQM)

Fabio Checconi (4):
  block: kconfig update and build bits for BFQ
  block: introduce the BFQ-v0 I/O scheduler
  block: add hierarchical-support option to kconfig
  block, bfq: add full hierarchical scheduling and cgroups support

Paolo Valente (9):
  block, bfq: improve throughput boosting
  block, bfq: modify the peak-rate estimator
  block, bfq: add more fairness to boost throughput and reduce latency
  block, bfq: improve responsiveness
  block, bfq: reduce I/O latency for soft real-time applications
  block, bfq: preserve a low latency also with NCQ-capable drives
  block, bfq: reduce latency during request-pool saturation
  block, bfq: boost the throughput on NCQ-capable flash-based devices
  block, bfq: boost the throughput with random I/O on NCQ-capable HDDs

 block/Kconfig.iosched         |   32 +
 block/Makefile                |    1 +
 block/bfq-cgroup.c            |  909 ++++++++++
 block/bfq-ioc.c               |   36 +
 block/bfq-iosched.c           | 3802 +++++++++++++++++++++++++++++++++++++++++
 block/bfq-sched.c             | 1104 ++++++++++++
 block/bfq.h                   |  729 ++++++++
 include/linux/cgroup_subsys.h |    4 +
 8 files changed, 6617 insertions(+)
 create mode 100644 block/bfq-cgroup.c
 create mode 100644 block/bfq-ioc.c
 create mode 100644 block/bfq-iosched.c
 create mode 100644 block/bfq-sched.c
 create mode 100644 block/bfq.h

-- 
1.9.2

^ permalink raw reply	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 01/14] block: kconfig update and build bits for BFQ
  2014-05-27 12:42 ` paolo
@ 2014-05-27 12:42     ` paolo
  -1 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

From: Fabio Checconi <fchecconi-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Update Kconfig.iosched and make the related Makefile changes to include
kernel-configuration options for BFQ.

Signed-off-by: Fabio Checconi <fchecconi-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 block/Kconfig.iosched | 19 +++++++++++++++++++
 block/Makefile        |  1 +
 2 files changed, 20 insertions(+)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 421bef9..8f98cc7 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -39,6 +39,15 @@ config CFQ_GROUP_IOSCHED
 	---help---
 	  Enable group IO scheduling in CFQ.
 
+config IOSCHED_BFQ
+	tristate "BFQ I/O scheduler"
+	default n
+	---help---
+	  The BFQ I/O scheduler tries to distribute bandwidth among all
+	  processes according to their weights.
+	  It aims at distributing the bandwidth as desired, regardless
+	  of the disk parameters and with any workload.
+
 choice
 	prompt "Default I/O scheduler"
 	default DEFAULT_CFQ
@@ -52,6 +61,15 @@ choice
 	config DEFAULT_CFQ
 		bool "CFQ" if IOSCHED_CFQ=y
 
+	config DEFAULT_BFQ
+		bool "BFQ" if IOSCHED_BFQ=y
+		help
+		  Selects BFQ as the default I/O scheduler which will be
+		  used by default for all block devices.
+		  The BFQ I/O scheduler aims at distributing the bandwidth
+		  as desired, regardless of the disk parameters and with
+		  any workload.
+
 	config DEFAULT_NOOP
 		bool "No-op"
 
@@ -61,6 +79,7 @@ config DEFAULT_IOSCHED
 	string
 	default "deadline" if DEFAULT_DEADLINE
 	default "cfq" if DEFAULT_CFQ
+	default "bfq" if DEFAULT_BFQ
 	default "noop" if DEFAULT_NOOP
 
 endmenu
diff --git a/block/Makefile b/block/Makefile
index 20645e8..cbd83fb 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -16,6 +16,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-throttle.o
 obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
 obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
 obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
+obj-$(CONFIG_IOSCHED_BFQ)	+= bfq-iosched.o
 
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY)	+= blk-integrity.o
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 01/14] block: kconfig update and build bits for BFQ
@ 2014-05-27 12:42     ` paolo
  0 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups, Paolo Valente

From: Fabio Checconi <fchecconi@gmail.com>

Update Kconfig.iosched and make the related Makefile changes to include
kernel-configuration options for BFQ.

Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/Kconfig.iosched | 19 +++++++++++++++++++
 block/Makefile        |  1 +
 2 files changed, 20 insertions(+)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 421bef9..8f98cc7 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -39,6 +39,15 @@ config CFQ_GROUP_IOSCHED
 	---help---
 	  Enable group IO scheduling in CFQ.
 
+config IOSCHED_BFQ
+	tristate "BFQ I/O scheduler"
+	default n
+	---help---
+	  The BFQ I/O scheduler tries to distribute bandwidth among all
+	  processes according to their weights.
+	  It aims at distributing the bandwidth as desired, regardless
+	  of the disk parameters and with any workload.
+
 choice
 	prompt "Default I/O scheduler"
 	default DEFAULT_CFQ
@@ -52,6 +61,15 @@ choice
 	config DEFAULT_CFQ
 		bool "CFQ" if IOSCHED_CFQ=y
 
+	config DEFAULT_BFQ
+		bool "BFQ" if IOSCHED_BFQ=y
+		help
+		  Selects BFQ as the default I/O scheduler which will be
+		  used by default for all block devices.
+		  The BFQ I/O scheduler aims at distributing the bandwidth
+		  as desired, regardless of the disk parameters and with
+		  any workload.
+
 	config DEFAULT_NOOP
 		bool "No-op"
 
@@ -61,6 +79,7 @@ config DEFAULT_IOSCHED
 	string
 	default "deadline" if DEFAULT_DEADLINE
 	default "cfq" if DEFAULT_CFQ
+	default "bfq" if DEFAULT_BFQ
 	default "noop" if DEFAULT_NOOP
 
 endmenu
diff --git a/block/Makefile b/block/Makefile
index 20645e8..cbd83fb 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -16,6 +16,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-throttle.o
 obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
 obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
 obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
+obj-$(CONFIG_IOSCHED_BFQ)	+= bfq-iosched.o
 
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY)	+= blk-integrity.o
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 02/14] block: introduce the BFQ-v0 I/O scheduler
  2014-05-27 12:42 ` paolo
@ 2014-05-27 12:42     ` paolo
  -1 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

From: Fabio Checconi <fchecconi-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

BFQ is a proportional-share I/O scheduler, whose general structure,
plus a lot of code, are borrowed from CFQ.

- Each process doing I/O on a device is associated with a weight and a
  (bfq_)queue.

- BFQ grants exclusive access to the device, for a while, to one queue
  (process) at a time, and implements this service model by
  associating every queue with a budget, measured in number of
  sectors.

  - After a queue is granted access to the device, the budget of the
    queue is decremented, on each request dispatch, by the size of the
    request.

  - The in-service queue is expired, i.e., its service is suspended,
    only if one of the following events occurs: 1) the queue finishes
    its budget, 2) the queue empties, 3) a "budget timeout" fires.

    - The budget timeout prevents processes doing random I/O from
      holding the device for too long and dramatically reducing
      throughput.

    - Actually, as in CFQ, a queue associated with a process issuing
      sync requests may not be expired immediately when it empties. In
      contrast, BFQ may idle the device for a short time interval,
      giving the process the chance to go on being served if it issues
      a new request in time. Device idling typically boosts the
      throughput on rotational devices, if processes do synchronous
      and sequential I/O. Besides, under BFQ, device idling is also
      instrumental in guaranteeing the desired throughput fraction to
      processes issuing sync requests (see [1] for details).

  - Queues are scheduled according to a variant of WF2Q+, named
    B-WF2Q+, and implemented using an augmented rb-tree to preserve an
    O(log N) overall complexity.  See [1] for more details. B-WF2Q+ is
    also ready for hierarchical scheduling. However, for a cleaner
    logical breakdown, the code that enables and completes
    hierarchical support is provided in patch 4, which focuses exactly
    on this feature.

  - B-WF2Q+ guarantees a tight deviation with respect to an ideal,
    perfectly fair, and smooth service. In particular, B-WF2Q+
    guarantees that each queue receives a fraction of the device
    throughput proportional to its weight, even if the throughput
    fluctuates, and regardless of: the device parameters, the current
    workload and the budgets assigned to the queue.

  - The last, budget-independence, property (although probably
    counterintuitive in the first place) is definitely beneficial, for
    the following reasons.

    - First, with any proportional-share scheduler, the maximum
      deviation with respect to an ideal service is proportional to
      the maximum budget (slice) assigned to queues. As a consequence,
      BFQ can keep this deviation tight not only because of the
      accurate service of B-WF2Q+, but also because BFQ *does not*
      need to assign a larger budget to a queue to let the queue
      receive a higher fraction of the device throughput.

    - Second, BFQ is free to choose, for every process (queue), the
      budget that best fits the needs of the process, or best
      leverages the I/O pattern of the process. In particular, BFQ
      updates queue budgets with a simple feedback-loop algorithm that
      allows a high throughput to be achieved, while still providing
      tight latency guarantees to time-sensitive applications. When
      the in-service queue expires, this algorithm computes the next
      budget of the queue so as to:

      - Let large budgets be eventually assigned to the queues
        associated with I/O-bound applications performing sequential
        I/O: in fact, the longer these applications are served once
        got access to the device, the higher the throughput is.

      - Let small budgets be eventually assigned to the queues
        associated with time-sensitive applications (which typically
        perform sporadic and short I/O), because, the smaller the
        budget assigned to a queue waiting for service is, the sooner
        B-WF2Q+ will serve that queue (Subsec 3.3 in [1]).

- Weights can be assigned to processes only indirectly, through I/O
  priorities, and according to the relation: weight = IOPRIO_BE_NR -
  ioprio. The next two patches provide instead a cgroups interface
  through which weights can be assigned explicitly.

- ioprio classes are served in strict priority order, i.e.,
  lower-priority queues are not served as long as there are
  higher-priority queues.  Among queues in the same class, the
  bandwidth is distributed in proportion to the weight of each
  queue. A very thin extra bandwidth is however guaranteed to the Idle
  class, to prevent it from starving.

[1] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
http://www.algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

Signed-off-by: Fabio Checconi <fchecconi-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 block/bfq-ioc.c     |   34 +
 block/bfq-iosched.c | 2297 +++++++++++++++++++++++++++++++++++++++++++++++++++
 block/bfq-sched.c   |  936 +++++++++++++++++++++
 block/bfq.h         |  467 +++++++++++
 4 files changed, 3734 insertions(+)
 create mode 100644 block/bfq-ioc.c
 create mode 100644 block/bfq-iosched.c
 create mode 100644 block/bfq-sched.c
 create mode 100644 block/bfq.h

diff --git a/block/bfq-ioc.c b/block/bfq-ioc.c
new file mode 100644
index 0000000..adfb5a1
--- /dev/null
+++ b/block/bfq-ioc.c
@@ -0,0 +1,34 @@
+/*
+ * BFQ: I/O context handling.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
+ *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ */
+
+/**
+ * icq_to_bic - convert iocontext queue structure to bfq_io_cq.
+ * @icq: the iocontext queue.
+ */
+static inline struct bfq_io_cq *icq_to_bic(struct io_cq *icq)
+{
+	/* bic->icq is the first member, %NULL will convert to %NULL */
+	return container_of(icq, struct bfq_io_cq, icq);
+}
+
+/**
+ * bfq_bic_lookup - search into @ioc a bic associated to @bfqd.
+ * @bfqd: the lookup key.
+ * @ioc: the io_context of the process doing I/O.
+ *
+ * Queue lock must be held.
+ */
+static inline struct bfq_io_cq *bfq_bic_lookup(struct bfq_data *bfqd,
+					       struct io_context *ioc)
+{
+	if (ioc)
+		return icq_to_bic(ioc_lookup_icq(ioc, bfqd->queue));
+	return NULL;
+}
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
new file mode 100644
index 0000000..01a98be
--- /dev/null
+++ b/block/bfq-iosched.c
@@ -0,0 +1,2297 @@
+/*
+ * Budget Fair Queueing (BFQ) disk scheduler.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
+ *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ *
+ * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ
+ * file.
+ *
+ * BFQ is a proportional-share storage-I/O scheduling algorithm based on
+ * the slice-by-slice service scheme of CFQ. But BFQ assigns budgets,
+ * measured in number of sectors, to processes instead of time slices. The
+ * device is not granted to the in-service process for a given time slice,
+ * but until it has exhausted its assigned budget. This change from the time
+ * to the service domain allows BFQ to distribute the device throughput
+ * among processes as desired, without any distortion due to ZBR, workload
+ * fluctuations or other factors. BFQ uses an ad hoc internal scheduler,
+ * called B-WF2Q+, to schedule processes according to their budgets. More
+ * precisely, BFQ schedules queues associated to processes. Thanks to the
+ * accurate policy of B-WF2Q+, BFQ can afford to assign high budgets to
+ * I/O-bound processes issuing sequential requests (to boost the
+ * throughput), and yet guarantee a relatively low latency to interactive
+ * applications.
+ *
+ * BFQ is described in [1], where also a reference to the initial, more
+ * theoretical paper on BFQ can be found. The interested reader can find
+ * in the latter paper full details on the main algorithm, as well as
+ * formulas of the guarantees and formal proofs of all the properties.
+ * With respect to the version of BFQ presented in these papers, this
+ * implementation adds a hierarchical extension based on H-WF2Q+.
+ *
+ * B-WF2Q+ is based on WF2Q+, that is described in [2], together with
+ * H-WF2Q+, while the augmented tree used to implement B-WF2Q+ with O(log N)
+ * complexity derives from the one introduced with EEVDF in [3].
+ *
+ * [1] P. Valente and M. Andreolini, ``Improving Application Responsiveness
+ *     with the BFQ Disk I/O Scheduler'',
+ *     Proceedings of the 5th Annual International Systems and Storage
+ *     Conference (SYSTOR '12), June 2012.
+ *
+ * http://algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf
+ *
+ * [2] Jon C.R. Bennett and H. Zhang, ``Hierarchical Packet Fair Queueing
+ *     Algorithms,'' IEEE/ACM Transactions on Networking, 5(5):675-689,
+ *     Oct 1997.
+ *
+ * http://www.cs.cmu.edu/~hzhang/papers/TON-97-Oct.ps.gz
+ *
+ * [3] I. Stoica and H. Abdel-Wahab, ``Earliest Eligible Virtual Deadline
+ *     First: A Flexible and Accurate Mechanism for Proportional Share
+ *     Resource Allocation,'' technical report.
+ *
+ * http://www.cs.berkeley.edu/~istoica/papers/eevdf-tr-95.pdf
+ */
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/blkdev.h>
+#include <linux/cgroup.h>
+#include <linux/elevator.h>
+#include <linux/jiffies.h>
+#include <linux/rbtree.h>
+#include <linux/ioprio.h>
+#include "bfq.h"
+#include "blk.h"
+
+/*
+ * Array of async queues for all the processes, one queue
+ * per ioprio value per ioprio_class.
+ */
+struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
+/* Async queue for the idle class (ioprio is ignored) */
+struct bfq_queue *async_idle_bfqq;
+
+/* Max number of dispatches in one round of service. */
+static const int bfq_quantum = 4;
+
+/* Expiration time of sync (0) and async (1) requests, in jiffies. */
+static const int bfq_fifo_expire[2] = { HZ / 4, HZ / 8 };
+
+/* Maximum backwards seek, in KiB. */
+static const int bfq_back_max = 16 * 1024;
+
+/* Penalty of a backwards seek, in number of sectors. */
+static const int bfq_back_penalty = 2;
+
+/* Idling period duration, in jiffies. */
+static int bfq_slice_idle = HZ / 125;
+
+/* Default maximum budget values, in sectors and number of requests. */
+static const int bfq_default_max_budget = 16 * 1024;
+static const int bfq_max_budget_async_rq = 4;
+
+/* Default timeout values, in jiffies, approximating CFQ defaults. */
+static const int bfq_timeout_sync = HZ / 8;
+static int bfq_timeout_async = HZ / 25;
+
+struct kmem_cache *bfq_pool;
+
+/* Below this threshold (in ms), we consider thinktime immediate. */
+#define BFQ_MIN_TT		2
+
+/* hw_tag detection: parallel requests threshold and min samples needed. */
+#define BFQ_HW_QUEUE_THRESHOLD	4
+#define BFQ_HW_QUEUE_SAMPLES	32
+
+#define BFQQ_SEEK_THR	 (sector_t)(8 * 1024)
+#define BFQQ_SEEKY(bfqq) ((bfqq)->seek_mean > BFQQ_SEEK_THR)
+
+/* Budget feedback step. */
+#define BFQ_BUDGET_STEP         128
+
+/* Min samples used for peak rate estimation (for autotuning). */
+#define BFQ_PEAK_RATE_SAMPLES	32
+
+/* Shift used for peak rate fixed precision calculations. */
+#define BFQ_RATE_SHIFT		16
+
+#define BFQ_SERVICE_TREE_INIT	((struct bfq_service_tree)		\
+				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
+
+#define RQ_BIC(rq)		((struct bfq_io_cq *) (rq)->elv.priv[0])
+#define RQ_BFQQ(rq)		((rq)->elv.priv[1])
+
+static inline void bfq_schedule_dispatch(struct bfq_data *bfqd);
+
+#include "bfq-ioc.c"
+#include "bfq-sched.c"
+
+#define bfq_class_idle(bfqq)	((bfqq)->entity.ioprio_class ==\
+				 IOPRIO_CLASS_IDLE)
+#define bfq_class_rt(bfqq)	((bfqq)->entity.ioprio_class ==\
+				 IOPRIO_CLASS_RT)
+
+#define bfq_sample_valid(samples)	((samples) > 80)
+
+/*
+ * We regard a request as SYNC, if either it's a read or has the SYNC bit
+ * set (in which case it could also be a direct WRITE).
+ */
+static inline int bfq_bio_sync(struct bio *bio)
+{
+	if (bio_data_dir(bio) == READ || (bio->bi_rw & REQ_SYNC))
+		return 1;
+
+	return 0;
+}
+
+/*
+ * Scheduler run of queue, if there are requests pending and no one in the
+ * driver that will restart queueing.
+ */
+static inline void bfq_schedule_dispatch(struct bfq_data *bfqd)
+{
+	if (bfqd->queued != 0) {
+		bfq_log(bfqd, "schedule dispatch");
+		kblockd_schedule_work(bfqd->queue, &bfqd->unplug_work);
+	}
+}
+
+/*
+ * Lifted from AS - choose which of rq1 and rq2 that is best served now.
+ * We choose the request that is closesr to the head right now.  Distance
+ * behind the head is penalized and only allowed to a certain extent.
+ */
+static struct request *bfq_choose_req(struct bfq_data *bfqd,
+				      struct request *rq1,
+				      struct request *rq2,
+				      sector_t last)
+{
+	sector_t s1, s2, d1 = 0, d2 = 0;
+	unsigned long back_max;
+#define BFQ_RQ1_WRAP	0x01 /* request 1 wraps */
+#define BFQ_RQ2_WRAP	0x02 /* request 2 wraps */
+	unsigned wrap = 0; /* bit mask: requests behind the disk head? */
+
+	if (rq1 == NULL || rq1 == rq2)
+		return rq2;
+	if (rq2 == NULL)
+		return rq1;
+
+	if (rq_is_sync(rq1) && !rq_is_sync(rq2))
+		return rq1;
+	else if (rq_is_sync(rq2) && !rq_is_sync(rq1))
+		return rq2;
+	if ((rq1->cmd_flags & REQ_META) && !(rq2->cmd_flags & REQ_META))
+		return rq1;
+	else if ((rq2->cmd_flags & REQ_META) && !(rq1->cmd_flags & REQ_META))
+		return rq2;
+
+	s1 = blk_rq_pos(rq1);
+	s2 = blk_rq_pos(rq2);
+
+	/*
+	 * By definition, 1KiB is 2 sectors.
+	 */
+	back_max = bfqd->bfq_back_max * 2;
+
+	/*
+	 * Strict one way elevator _except_ in the case where we allow
+	 * short backward seeks which are biased as twice the cost of a
+	 * similar forward seek.
+	 */
+	if (s1 >= last)
+		d1 = s1 - last;
+	else if (s1 + back_max >= last)
+		d1 = (last - s1) * bfqd->bfq_back_penalty;
+	else
+		wrap |= BFQ_RQ1_WRAP;
+
+	if (s2 >= last)
+		d2 = s2 - last;
+	else if (s2 + back_max >= last)
+		d2 = (last - s2) * bfqd->bfq_back_penalty;
+	else
+		wrap |= BFQ_RQ2_WRAP;
+
+	/* Found required data */
+
+	/*
+	 * By doing switch() on the bit mask "wrap" we avoid having to
+	 * check two variables for all permutations: --> faster!
+	 */
+	switch (wrap) {
+	case 0: /* common case for CFQ: rq1 and rq2 not wrapped */
+		if (d1 < d2)
+			return rq1;
+		else if (d2 < d1)
+			return rq2;
+		else {
+			if (s1 >= s2)
+				return rq1;
+			else
+				return rq2;
+		}
+
+	case BFQ_RQ2_WRAP:
+		return rq1;
+	case BFQ_RQ1_WRAP:
+		return rq2;
+	case (BFQ_RQ1_WRAP|BFQ_RQ2_WRAP): /* both rqs wrapped */
+	default:
+		/*
+		 * Since both rqs are wrapped,
+		 * start with the one that's further behind head
+		 * (--> only *one* back seek required),
+		 * since back seek takes more time than forward.
+		 */
+		if (s1 <= s2)
+			return rq1;
+		else
+			return rq2;
+	}
+}
+
+static struct request *bfq_find_next_rq(struct bfq_data *bfqd,
+					struct bfq_queue *bfqq,
+					struct request *last)
+{
+	struct rb_node *rbnext = rb_next(&last->rb_node);
+	struct rb_node *rbprev = rb_prev(&last->rb_node);
+	struct request *next = NULL, *prev = NULL;
+
+	if (rbprev != NULL)
+		prev = rb_entry_rq(rbprev);
+
+	if (rbnext != NULL)
+		next = rb_entry_rq(rbnext);
+	else {
+		rbnext = rb_first(&bfqq->sort_list);
+		if (rbnext && rbnext != &last->rb_node)
+			next = rb_entry_rq(rbnext);
+	}
+
+	return bfq_choose_req(bfqd, next, prev, blk_rq_pos(last));
+}
+
+static inline unsigned long bfq_serv_to_charge(struct request *rq,
+					       struct bfq_queue *bfqq)
+{
+	return blk_rq_sectors(rq);
+}
+
+/**
+ * bfq_updated_next_req - update the queue after a new next_rq selection.
+ * @bfqd: the device data the queue belongs to.
+ * @bfqq: the queue to update.
+ *
+ * If the first request of a queue changes we make sure that the queue
+ * has enough budget to serve at least its first request (if the
+ * request has grown).  We do this because if the queue has not enough
+ * budget for its first request, it has to go through two dispatch
+ * rounds to actually get it dispatched.
+ */
+static void bfq_updated_next_req(struct bfq_data *bfqd,
+				 struct bfq_queue *bfqq)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+	struct request *next_rq = bfqq->next_rq;
+	unsigned long new_budget;
+
+	if (next_rq == NULL)
+		return;
+
+	if (bfqq == bfqd->in_service_queue)
+		/*
+		 * In order not to break guarantees, budgets cannot be
+		 * changed after an entity has been selected.
+		 */
+		return;
+
+	new_budget = max_t(unsigned long, bfqq->max_budget,
+			   bfq_serv_to_charge(next_rq, bfqq));
+	if (entity->budget != new_budget) {
+		entity->budget = new_budget;
+		bfq_log_bfqq(bfqd, bfqq, "updated next rq: new budget %lu",
+					 new_budget);
+		bfq_activate_bfqq(bfqd, bfqq);
+	}
+}
+
+static void bfq_add_request(struct request *rq)
+{
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+	struct bfq_entity *entity = &bfqq->entity;
+	struct bfq_data *bfqd = bfqq->bfqd;
+	struct request *next_rq, *prev;
+
+	bfq_log_bfqq(bfqd, bfqq, "add_request %d", rq_is_sync(rq));
+	bfqq->queued[rq_is_sync(rq)]++;
+	bfqd->queued++;
+
+	elv_rb_add(&bfqq->sort_list, rq);
+
+	/*
+	 * Check if this request is a better next-serve candidate.
+	 */
+	prev = bfqq->next_rq;
+	next_rq = bfq_choose_req(bfqd, bfqq->next_rq, rq, bfqd->last_position);
+	bfqq->next_rq = next_rq;
+
+	if (!bfq_bfqq_busy(bfqq)) {
+		entity->budget = max_t(unsigned long, bfqq->max_budget,
+				       bfq_serv_to_charge(next_rq, bfqq));
+		bfq_add_bfqq_busy(bfqd, bfqq);
+	} else {
+		if (prev != bfqq->next_rq)
+			bfq_updated_next_req(bfqd, bfqq);
+	}
+}
+
+static struct request *bfq_find_rq_fmerge(struct bfq_data *bfqd,
+					  struct bio *bio)
+{
+	struct task_struct *tsk = current;
+	struct bfq_io_cq *bic;
+	struct bfq_queue *bfqq;
+
+	bic = bfq_bic_lookup(bfqd, tsk->io_context);
+	if (bic == NULL)
+		return NULL;
+
+	bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
+	if (bfqq != NULL)
+		return elv_rb_find(&bfqq->sort_list, bio_end_sector(bio));
+
+	return NULL;
+}
+
+static void bfq_activate_request(struct request_queue *q, struct request *rq)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+
+	bfqd->rq_in_driver++;
+	bfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq);
+	bfq_log(bfqd, "activate_request: new bfqd->last_position %llu",
+		(long long unsigned)bfqd->last_position);
+}
+
+static inline void bfq_deactivate_request(struct request_queue *q,
+					  struct request *rq)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+
+	bfqd->rq_in_driver--;
+}
+
+static void bfq_remove_request(struct request *rq)
+{
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+	struct bfq_data *bfqd = bfqq->bfqd;
+	const int sync = rq_is_sync(rq);
+
+	if (bfqq->next_rq == rq) {
+		bfqq->next_rq = bfq_find_next_rq(bfqd, bfqq, rq);
+		bfq_updated_next_req(bfqd, bfqq);
+	}
+
+	list_del_init(&rq->queuelist);
+	bfqq->queued[sync]--;
+	bfqd->queued--;
+	elv_rb_del(&bfqq->sort_list, rq);
+
+	if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
+		if (bfq_bfqq_busy(bfqq) && bfqq != bfqd->in_service_queue)
+			bfq_del_bfqq_busy(bfqd, bfqq, 1);
+	}
+
+	if (rq->cmd_flags & REQ_META)
+		bfqq->meta_pending--;
+}
+
+static int bfq_merge(struct request_queue *q, struct request **req,
+		     struct bio *bio)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct request *__rq;
+
+	__rq = bfq_find_rq_fmerge(bfqd, bio);
+	if (__rq != NULL && elv_rq_merge_ok(__rq, bio)) {
+		*req = __rq;
+		return ELEVATOR_FRONT_MERGE;
+	}
+
+	return ELEVATOR_NO_MERGE;
+}
+
+static void bfq_merged_request(struct request_queue *q, struct request *req,
+			       int type)
+{
+	if (type == ELEVATOR_FRONT_MERGE &&
+	    rb_prev(&req->rb_node) &&
+	    blk_rq_pos(req) <
+	    blk_rq_pos(container_of(rb_prev(&req->rb_node),
+				    struct request, rb_node))) {
+		struct bfq_queue *bfqq = RQ_BFQQ(req);
+		struct bfq_data *bfqd = bfqq->bfqd;
+		struct request *prev, *next_rq;
+
+		/* Reposition request in its sort_list */
+		elv_rb_del(&bfqq->sort_list, req);
+		elv_rb_add(&bfqq->sort_list, req);
+		/* Choose next request to be served for bfqq */
+		prev = bfqq->next_rq;
+		next_rq = bfq_choose_req(bfqd, bfqq->next_rq, req,
+					 bfqd->last_position);
+		bfqq->next_rq = next_rq;
+		/*
+		 * If next_rq changes, update the queue's budget to fit
+		 * the new request.
+		 */
+		if (prev != bfqq->next_rq)
+			bfq_updated_next_req(bfqd, bfqq);
+	}
+}
+
+static void bfq_merged_requests(struct request_queue *q, struct request *rq,
+				struct request *next)
+{
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+
+	/*
+	 * Reposition in fifo if next is older than rq.
+	 */
+	if (!list_empty(&rq->queuelist) && !list_empty(&next->queuelist) &&
+	    time_before(next->fifo_time, rq->fifo_time)) {
+		list_move(&rq->queuelist, &next->queuelist);
+		rq->fifo_time = next->fifo_time;
+	}
+
+	if (bfqq->next_rq == next)
+		bfqq->next_rq = rq;
+
+	bfq_remove_request(next);
+}
+
+static int bfq_allow_merge(struct request_queue *q, struct request *rq,
+			   struct bio *bio)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct bfq_io_cq *bic;
+	struct bfq_queue *bfqq;
+
+	/*
+	 * Disallow merge of a sync bio into an async request.
+	 */
+	if (bfq_bio_sync(bio) && !rq_is_sync(rq))
+		return 0;
+
+	/*
+	 * Lookup the bfqq that this bio will be queued with. Allow
+	 * merge only if rq is queued there.
+	 * Queue lock is held here.
+	 */
+	bic = bfq_bic_lookup(bfqd, current->io_context);
+	if (bic == NULL)
+		return 0;
+
+	bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
+	return bfqq == RQ_BFQQ(rq);
+}
+
+static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
+				       struct bfq_queue *bfqq)
+{
+	if (bfqq != NULL) {
+		bfq_mark_bfqq_must_alloc(bfqq);
+		bfq_mark_bfqq_budget_new(bfqq);
+		bfq_clear_bfqq_fifo_expire(bfqq);
+
+		bfqd->budgets_assigned = (bfqd->budgets_assigned*7 + 256) / 8;
+
+		bfq_log_bfqq(bfqd, bfqq,
+			     "set_in_service_queue, cur-budget = %lu",
+			     bfqq->entity.budget);
+	}
+
+	bfqd->in_service_queue = bfqq;
+}
+
+/*
+ * Get and set a new queue for service.
+ */
+static struct bfq_queue *bfq_set_in_service_queue(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq = bfq_get_next_queue(bfqd);
+
+	__bfq_set_in_service_queue(bfqd, bfqq);
+	return bfqq;
+}
+
+/*
+ * If enough samples have been computed, return the current max budget
+ * stored in bfqd, which is dynamically updated according to the
+ * estimated disk peak rate; otherwise return the default max budget
+ */
+static inline unsigned long bfq_max_budget(struct bfq_data *bfqd)
+{
+	if (bfqd->budgets_assigned < 194)
+		return bfq_default_max_budget;
+	else
+		return bfqd->bfq_max_budget;
+}
+
+ /*
+ * bfq_default_budget - return the default budget for @bfqq on @bfqd.
+ * @bfqd: the device descriptor.
+ * @bfqq: the queue to consider.
+ *
+ * We use 3/4 of the @bfqd maximum budget as the default value
+ * for the max_budget field of the queues.  This lets the feedback
+ * mechanism to start from some middle ground, then the behavior
+ * of the process will drive the heuristics towards high values, if
+ * it behaves as a greedy sequential reader, or towards small values
+ * if it shows a more intermittent behavior.
+ */
+static unsigned long bfq_default_budget(struct bfq_data *bfqd,
+					struct bfq_queue *bfqq)
+{
+	unsigned long budget;
+
+	/*
+	 * When we need an estimate of the peak rate we need to avoid
+	 * to give budgets that are too short due to previous measurements.
+	 * So, in the first 10 assignments use a ``safe'' budget value.
+	 */
+	if (bfqd->budgets_assigned < 194 && bfqd->bfq_user_max_budget == 0)
+		budget = bfq_default_max_budget;
+	else
+		budget = bfqd->bfq_max_budget;
+
+	return budget - budget / 4;
+}
+
+/*
+ * Return min budget, which is a fraction of the current or default
+ * max budget (trying with 1/32)
+ */
+static inline unsigned long bfq_min_budget(struct bfq_data *bfqd)
+{
+	if (bfqd->budgets_assigned < 194)
+		return bfq_default_max_budget / 32;
+	else
+		return bfqd->bfq_max_budget / 32;
+}
+
+static void bfq_arm_slice_timer(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq = bfqd->in_service_queue;
+	struct bfq_io_cq *bic;
+	unsigned long sl;
+
+	/* Processes have exited, don't wait. */
+	bic = bfqd->in_service_bic;
+	if (bic == NULL || atomic_read(&bic->icq.ioc->active_ref) == 0)
+		return;
+
+	bfq_mark_bfqq_wait_request(bfqq);
+
+	/*
+	 * We don't want to idle for seeks, but we do want to allow
+	 * fair distribution of slice time for a process doing back-to-back
+	 * seeks. So allow a little bit of time for him to submit a new rq.
+	 */
+	sl = bfqd->bfq_slice_idle;
+	/*
+	 * Grant only minimum idle time if the queue has been seeky for long
+	 * enough.
+	 */
+	if (bfq_sample_valid(bfqq->seek_samples) && BFQQ_SEEKY(bfqq))
+		sl = min(sl, msecs_to_jiffies(BFQ_MIN_TT));
+	bfqd->last_idling_start = ktime_get();
+	mod_timer(&bfqd->idle_slice_timer, jiffies + sl);
+	bfq_log(bfqd, "arm idle: %u/%u ms",
+		jiffies_to_msecs(sl), jiffies_to_msecs(bfqd->bfq_slice_idle));
+}
+
+/*
+ * Set the maximum time for the in-service queue to consume its
+ * budget. This prevents seeky processes from lowering the disk
+ * throughput (always guaranteed with a time slice scheme as in CFQ).
+ */
+static void bfq_set_budget_timeout(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq = bfqd->in_service_queue;
+	unsigned int timeout_coeff = bfqq->entity.weight /
+				     bfqq->entity.orig_weight;
+
+	bfqd->last_budget_start = ktime_get();
+
+	bfq_clear_bfqq_budget_new(bfqq);
+	bfqq->budget_timeout = jiffies +
+		bfqd->bfq_timeout[bfq_bfqq_sync(bfqq)] * timeout_coeff;
+
+	bfq_log_bfqq(bfqd, bfqq, "set budget_timeout %u",
+		jiffies_to_msecs(bfqd->bfq_timeout[bfq_bfqq_sync(bfqq)] *
+		timeout_coeff));
+}
+
+/*
+ * Move request from internal lists to the request queue dispatch list.
+ */
+static void bfq_dispatch_insert(struct request_queue *q, struct request *rq)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+
+	/*
+	 * For consistency, the next instruction should have been executed
+	 * after removing the request from the queue and dispatching it.
+	 * We execute instead this instruction before bfq_remove_request()
+	 * (and hence introduce a temporary inconsistency), for efficiency.
+	 * In fact, in a forced_dispatch, this prevents two counters related
+	 * to bfqq->dispatched to risk to be uselessly decremented if bfqq
+	 * is not in service, and then to be incremented again after
+	 * incrementing bfqq->dispatched.
+	 */
+	bfqq->dispatched++;
+	bfq_remove_request(rq);
+	elv_dispatch_sort(q, rq);
+
+	if (bfq_bfqq_sync(bfqq))
+		bfqd->sync_flight++;
+}
+
+/*
+ * Return expired entry, or NULL to just start from scratch in rbtree.
+ */
+static struct request *bfq_check_fifo(struct bfq_queue *bfqq)
+{
+	struct request *rq = NULL;
+
+	if (bfq_bfqq_fifo_expire(bfqq))
+		return NULL;
+
+	bfq_mark_bfqq_fifo_expire(bfqq);
+
+	if (list_empty(&bfqq->fifo))
+		return NULL;
+
+	rq = rq_entry_fifo(bfqq->fifo.next);
+
+	if (time_before(jiffies, rq->fifo_time))
+		return NULL;
+
+	return rq;
+}
+
+static inline unsigned long bfq_bfqq_budget_left(struct bfq_queue *bfqq)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+	return entity->budget - entity->service;
+}
+
+static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	__bfq_bfqd_reset_in_service(bfqd);
+
+	if (RB_EMPTY_ROOT(&bfqq->sort_list))
+		bfq_del_bfqq_busy(bfqd, bfqq, 1);
+	else
+		bfq_activate_bfqq(bfqd, bfqq);
+}
+
+/**
+ * __bfq_bfqq_recalc_budget - try to adapt the budget to the @bfqq behavior.
+ * @bfqd: device data.
+ * @bfqq: queue to update.
+ * @reason: reason for expiration.
+ *
+ * Handle the feedback on @bfqq budget.  See the body for detailed
+ * comments.
+ */
+static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
+				     struct bfq_queue *bfqq,
+				     enum bfqq_expiration reason)
+{
+	struct request *next_rq;
+	unsigned long budget, min_budget;
+
+	budget = bfqq->max_budget;
+	min_budget = bfq_min_budget(bfqd);
+
+	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last budg %lu, budg left %lu",
+		bfqq->entity.budget, bfq_bfqq_budget_left(bfqq));
+	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last max_budg %lu, min budg %lu",
+		budget, bfq_min_budget(bfqd));
+	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: sync %d, seeky %d",
+		bfq_bfqq_sync(bfqq), BFQQ_SEEKY(bfqd->in_service_queue));
+
+	if (bfq_bfqq_sync(bfqq)) {
+		switch (reason) {
+		/*
+		 * Caveat: in all the following cases we trade latency
+		 * for throughput.
+		 */
+		case BFQ_BFQQ_TOO_IDLE:
+			if (budget > min_budget + BFQ_BUDGET_STEP)
+				budget -= BFQ_BUDGET_STEP;
+			else
+				budget = min_budget;
+			break;
+		case BFQ_BFQQ_BUDGET_TIMEOUT:
+			budget = bfq_default_budget(bfqd, bfqq);
+			break;
+		case BFQ_BFQQ_BUDGET_EXHAUSTED:
+			/*
+			 * The process still has backlog, and did not
+			 * let either the budget timeout or the disk
+			 * idling timeout expire. Hence it is not
+			 * seeky, has a short thinktime and may be
+			 * happy with a higher budget too. So
+			 * definitely increase the budget of this good
+			 * candidate to boost the disk throughput.
+			 */
+			budget = min(budget + 8 * BFQ_BUDGET_STEP,
+				     bfqd->bfq_max_budget);
+			break;
+		case BFQ_BFQQ_NO_MORE_REQUESTS:
+		       /*
+			* Leave the budget unchanged.
+			*/
+		default:
+			return;
+		}
+	} else /* async queue */
+	    /* async queues get always the maximum possible budget
+	     * (their ability to dispatch is limited by
+	     * @bfqd->bfq_max_budget_async_rq).
+	     */
+		budget = bfqd->bfq_max_budget;
+
+	bfqq->max_budget = budget;
+
+	if (bfqd->budgets_assigned >= 194 && bfqd->bfq_user_max_budget == 0 &&
+	    bfqq->max_budget > bfqd->bfq_max_budget)
+		bfqq->max_budget = bfqd->bfq_max_budget;
+
+	/*
+	 * Make sure that we have enough budget for the next request.
+	 * Since the finish time of the bfqq must be kept in sync with
+	 * the budget, be sure to call __bfq_bfqq_expire() after the
+	 * update.
+	 */
+	next_rq = bfqq->next_rq;
+	if (next_rq != NULL)
+		bfqq->entity.budget = max_t(unsigned long, bfqq->max_budget,
+					    bfq_serv_to_charge(next_rq, bfqq));
+	else
+		bfqq->entity.budget = bfqq->max_budget;
+
+	bfq_log_bfqq(bfqd, bfqq, "head sect: %u, new budget %lu",
+			next_rq != NULL ? blk_rq_sectors(next_rq) : 0,
+			bfqq->entity.budget);
+}
+
+static unsigned long bfq_calc_max_budget(u64 peak_rate, u64 timeout)
+{
+	unsigned long max_budget;
+
+	/*
+	 * The max_budget calculated when autotuning is equal to the
+	 * amount of sectors transfered in timeout_sync at the
+	 * estimated peak rate.
+	 */
+	max_budget = (unsigned long)(peak_rate * 1000 *
+				     timeout >> BFQ_RATE_SHIFT);
+
+	return max_budget;
+}
+
+/*
+ * In addition to updating the peak rate, checks whether the process
+ * is "slow", and returns 1 if so. This slow flag is used, in addition
+ * to the budget timeout, to reduce the amount of service provided to
+ * seeky processes, and hence reduce their chances to lower the
+ * throughput. See the code for more details.
+ */
+static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+				int compensate)
+{
+	u64 bw, usecs, expected, timeout;
+	ktime_t delta;
+	int update = 0;
+
+	if (!bfq_bfqq_sync(bfqq) || bfq_bfqq_budget_new(bfqq))
+		return 0;
+
+	if (compensate)
+		delta = bfqd->last_idling_start;
+	else
+		delta = ktime_get();
+	delta = ktime_sub(delta, bfqd->last_budget_start);
+	usecs = ktime_to_us(delta);
+
+	/* Don't trust short/unrealistic values. */
+	if (usecs < 100 || usecs >= LONG_MAX)
+		return 0;
+
+	/*
+	 * Calculate the bandwidth for the last slice.  We use a 64 bit
+	 * value to store the peak rate, in sectors per usec in fixed
+	 * point math.  We do so to have enough precision in the estimate
+	 * and to avoid overflows.
+	 */
+	bw = (u64)bfqq->entity.service << BFQ_RATE_SHIFT;
+	do_div(bw, (unsigned long)usecs);
+
+	timeout = jiffies_to_msecs(bfqd->bfq_timeout[BLK_RW_SYNC]);
+
+	/*
+	 * Use only long (> 20ms) intervals to filter out spikes for
+	 * the peak rate estimation.
+	 */
+	if (usecs > 20000) {
+		if (bw > bfqd->peak_rate) {
+			bfqd->peak_rate = bw;
+			update = 1;
+			bfq_log(bfqd, "new peak_rate=%llu", bw);
+		}
+
+		update |= bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES - 1;
+
+		if (bfqd->peak_rate_samples < BFQ_PEAK_RATE_SAMPLES)
+			bfqd->peak_rate_samples++;
+
+		if (bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES &&
+		    update && bfqd->bfq_user_max_budget == 0) {
+			bfqd->bfq_max_budget =
+				bfq_calc_max_budget(bfqd->peak_rate,
+						    timeout);
+			bfq_log(bfqd, "new max_budget=%lu",
+				bfqd->bfq_max_budget);
+		}
+	}
+
+	/*
+	 * A process is considered ``slow'' (i.e., seeky, so that we
+	 * cannot treat it fairly in the service domain, as it would
+	 * slow down too much the other processes) if, when a slice
+	 * ends for whatever reason, it has received service at a
+	 * rate that would not be high enough to complete the budget
+	 * before the budget timeout expiration.
+	 */
+	expected = bw * 1000 * timeout >> BFQ_RATE_SHIFT;
+
+	/*
+	 * Caveat: processes doing IO in the slower disk zones will
+	 * tend to be slow(er) even if not seeky. And the estimated
+	 * peak rate will actually be an average over the disk
+	 * surface. Hence, to not be too harsh with unlucky processes,
+	 * we keep a budget/3 margin of safety before declaring a
+	 * process slow.
+	 */
+	return expected > (4 * bfqq->entity.budget) / 3;
+}
+
+/**
+ * bfq_bfqq_expire - expire a queue.
+ * @bfqd: device owning the queue.
+ * @bfqq: the queue to expire.
+ * @compensate: if true, compensate for the time spent idling.
+ * @reason: the reason causing the expiration.
+ *
+ *
+ * If the process associated to the queue is slow (i.e., seeky), or in
+ * case of budget timeout, or, finally, if it is async, we
+ * artificially charge it an entire budget (independently of the
+ * actual service it received). As a consequence, the queue will get
+ * higher timestamps than the correct ones upon reactivation, and
+ * hence it will be rescheduled as if it had received more service
+ * than what it actually received. In the end, this class of processes
+ * will receive less service in proportion to how slowly they consume
+ * their budgets (and hence how seriously they tend to lower the
+ * throughput).
+ *
+ * In contrast, when a queue expires because it has been idling for
+ * too much or because it exhausted its budget, we do not touch the
+ * amount of service it has received. Hence when the queue will be
+ * reactivated and its timestamps updated, the latter will be in sync
+ * with the actual service received by the queue until expiration.
+ *
+ * Charging a full budget to the first type of queues and the exact
+ * service to the others has the effect of using the WF2Q+ policy to
+ * schedule the former on a timeslice basis, without violating the
+ * service domain guarantees of the latter.
+ */
+static void bfq_bfqq_expire(struct bfq_data *bfqd,
+			    struct bfq_queue *bfqq,
+			    int compensate,
+			    enum bfqq_expiration reason)
+{
+	int slow;
+
+	/* Update disk peak rate for autotuning and check whether the
+	 * process is slow (see bfq_update_peak_rate).
+	 */
+	slow = bfq_update_peak_rate(bfqd, bfqq, compensate);
+
+	/*
+	 * As above explained, 'punish' slow (i.e., seeky), timed-out
+	 * and async queues, to favor sequential sync workloads.
+	 */
+	if (slow || reason == BFQ_BFQQ_BUDGET_TIMEOUT)
+		bfq_bfqq_charge_full_budget(bfqq);
+
+	bfq_log_bfqq(bfqd, bfqq,
+		"expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
+		slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
+
+	/*
+	 * Increase, decrease or leave budget unchanged according to
+	 * reason.
+	 */
+	__bfq_bfqq_recalc_budget(bfqd, bfqq, reason);
+	__bfq_bfqq_expire(bfqd, bfqq);
+}
+
+/*
+ * Budget timeout is not implemented through a dedicated timer, but
+ * just checked on request arrivals and completions, as well as on
+ * idle timer expirations.
+ */
+static int bfq_bfqq_budget_timeout(struct bfq_queue *bfqq)
+{
+	if (bfq_bfqq_budget_new(bfqq) ||
+	    time_before(jiffies, bfqq->budget_timeout))
+		return 0;
+	return 1;
+}
+
+/*
+ * If we expire a queue that is waiting for the arrival of a new
+ * request, we may prevent the fictitious timestamp back-shifting that
+ * allows the guarantees of the queue to be preserved (see [1] for
+ * this tricky aspect). Hence we return true only if this condition
+ * does not hold, or if the queue is slow enough to deserve only to be
+ * kicked off for preserving a high throughput.
+*/
+static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
+{
+	bfq_log_bfqq(bfqq->bfqd, bfqq,
+		"may_budget_timeout: wait_request %d left %d timeout %d",
+		bfq_bfqq_wait_request(bfqq),
+			bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3,
+		bfq_bfqq_budget_timeout(bfqq));
+
+	return (!bfq_bfqq_wait_request(bfqq) ||
+		bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3)
+		&&
+		bfq_bfqq_budget_timeout(bfqq);
+}
+
+/*
+ * Device idling is allowed only for sync queues that have a non-null
+ * idle window.
+ */
+static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
+{
+	return bfq_bfqq_sync(bfqq) && bfq_bfqq_idle_window(bfqq);
+}
+
+/*
+ * If the in-service queue is empty, but it is sync and the queue has its
+ * idle window set (in this case, waiting for a new request for the queue
+ * is likely to boost the throughput), then:
+ * 1) the queue must remain in service and cannot be expired, and
+ * 2) the disk must be idled to wait for the possible arrival of a new
+ *    request for the queue.
+ */
+static inline bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)
+{
+	struct bfq_data *bfqd = bfqq->bfqd;
+
+	return RB_EMPTY_ROOT(&bfqq->sort_list) && bfqd->bfq_slice_idle != 0 &&
+	       bfq_bfqq_must_not_expire(bfqq);
+}
+
+/*
+ * Select a queue for service.  If we have a current queue in service,
+ * check whether to continue servicing it, or retrieve and set a new one.
+ */
+static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq;
+	struct request *next_rq;
+	enum bfqq_expiration reason = BFQ_BFQQ_BUDGET_TIMEOUT;
+
+	bfqq = bfqd->in_service_queue;
+	if (bfqq == NULL)
+		goto new_queue;
+
+	bfq_log_bfqq(bfqd, bfqq, "select_queue: already in-service queue");
+
+	if (bfq_may_expire_for_budg_timeout(bfqq) &&
+	    !timer_pending(&bfqd->idle_slice_timer) &&
+	    !bfq_bfqq_must_idle(bfqq))
+		goto expire;
+
+	next_rq = bfqq->next_rq;
+	/*
+	 * If bfqq has requests queued and it has enough budget left to
+	 * serve them, keep the queue, otherwise expire it.
+	 */
+	if (next_rq != NULL) {
+		if (bfq_serv_to_charge(next_rq, bfqq) >
+			bfq_bfqq_budget_left(bfqq)) {
+			reason = BFQ_BFQQ_BUDGET_EXHAUSTED;
+			goto expire;
+		} else {
+			/*
+			 * The idle timer may be pending because we may
+			 * not disable disk idling even when a new request
+			 * arrives.
+			 */
+			if (timer_pending(&bfqd->idle_slice_timer)) {
+				/*
+				 * If we get here: 1) at least a new request
+				 * has arrived but we have not disabled the
+				 * timer because the request was too small,
+				 * 2) then the block layer has unplugged
+				 * the device, causing the dispatch to be
+				 * invoked.
+				 *
+				 * Since the device is unplugged, now the
+				 * requests are probably large enough to
+				 * provide a reasonable throughput.
+				 * So we disable idling.
+				 */
+				bfq_clear_bfqq_wait_request(bfqq);
+				del_timer(&bfqd->idle_slice_timer);
+			}
+			goto keep_queue;
+		}
+	}
+
+	/*
+	 * No requests pending.  If the in-service queue still has requests
+	 * in flight (possibly waiting for a completion) or is idling for a
+	 * new request, then keep it.
+	 */
+	if (timer_pending(&bfqd->idle_slice_timer) ||
+	    (bfqq->dispatched != 0 && bfq_bfqq_must_not_expire(bfqq))) {
+		bfqq = NULL;
+		goto keep_queue;
+	}
+
+	reason = BFQ_BFQQ_NO_MORE_REQUESTS;
+expire:
+	bfq_bfqq_expire(bfqd, bfqq, 0, reason);
+new_queue:
+	bfqq = bfq_set_in_service_queue(bfqd);
+	bfq_log(bfqd, "select_queue: new queue %d returned",
+		bfqq != NULL ? bfqq->pid : 0);
+keep_queue:
+	return bfqq;
+}
+
+/*
+ * Dispatch one request from bfqq, moving it to the request queue
+ * dispatch list.
+ */
+static int bfq_dispatch_request(struct bfq_data *bfqd,
+				struct bfq_queue *bfqq)
+{
+	int dispatched = 0;
+	struct request *rq;
+	unsigned long service_to_charge;
+
+	/* Follow expired path, else get first next available. */
+	rq = bfq_check_fifo(bfqq);
+	if (rq == NULL)
+		rq = bfqq->next_rq;
+	service_to_charge = bfq_serv_to_charge(rq, bfqq);
+
+	if (service_to_charge > bfq_bfqq_budget_left(bfqq)) {
+		/*
+		 * This may happen if the next rq is chosen in fifo order
+		 * instead of sector order. The budget is properly
+		 * dimensioned to be always sufficient to serve the next
+		 * request only if it is chosen in sector order. The reason
+		 * is that it would be quite inefficient and little useful
+		 * to always make sure that the budget is large enough to
+		 * serve even the possible next rq in fifo order.
+		 * In fact, requests are seldom served in fifo order.
+		 *
+		 * Expire the queue for budget exhaustion, and make sure
+		 * that the next act_budget is enough to serve the next
+		 * request, even if it comes from the fifo expired path.
+		 */
+		bfqq->next_rq = rq;
+		/*
+		 * Since this dispatch is failed, make sure that
+		 * a new one will be performed
+		 */
+		if (!bfqd->rq_in_driver)
+			bfq_schedule_dispatch(bfqd);
+		goto expire;
+	}
+
+	/* Finally, insert request into driver dispatch list. */
+	bfq_bfqq_served(bfqq, service_to_charge);
+	bfq_dispatch_insert(bfqd->queue, rq);
+
+	bfq_log_bfqq(bfqd, bfqq,
+			"dispatched %u sec req (%llu), budg left %lu",
+			blk_rq_sectors(rq),
+			(long long unsigned)blk_rq_pos(rq),
+			bfq_bfqq_budget_left(bfqq));
+
+	dispatched++;
+
+	if (bfqd->in_service_bic == NULL) {
+		atomic_long_inc(&RQ_BIC(rq)->icq.ioc->refcount);
+		bfqd->in_service_bic = RQ_BIC(rq);
+	}
+
+	if (bfqd->busy_queues > 1 && ((!bfq_bfqq_sync(bfqq) &&
+	    dispatched >= bfqd->bfq_max_budget_async_rq) ||
+	    bfq_class_idle(bfqq)))
+		goto expire;
+
+	return dispatched;
+
+expire:
+	bfq_bfqq_expire(bfqd, bfqq, 0, BFQ_BFQQ_BUDGET_EXHAUSTED);
+	return dispatched;
+}
+
+static int __bfq_forced_dispatch_bfqq(struct bfq_queue *bfqq)
+{
+	int dispatched = 0;
+
+	while (bfqq->next_rq != NULL) {
+		bfq_dispatch_insert(bfqq->bfqd->queue, bfqq->next_rq);
+		dispatched++;
+	}
+
+	return dispatched;
+}
+
+/*
+ * Drain our current requests.
+ * Used for barriers and when switching io schedulers on-the-fly.
+ */
+static int bfq_forced_dispatch(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq, *n;
+	struct bfq_service_tree *st;
+	int dispatched = 0;
+
+	bfqq = bfqd->in_service_queue;
+	if (bfqq != NULL)
+		__bfq_bfqq_expire(bfqd, bfqq);
+
+	/*
+	 * Loop through classes, and be careful to leave the scheduler
+	 * in a consistent state, as feedback mechanisms and vtime
+	 * updates cannot be disabled during the process.
+	 */
+	list_for_each_entry_safe(bfqq, n, &bfqd->active_list, bfqq_list) {
+		st = bfq_entity_service_tree(&bfqq->entity);
+
+		dispatched += __bfq_forced_dispatch_bfqq(bfqq);
+		bfqq->max_budget = bfq_max_budget(bfqd);
+
+		bfq_forget_idle(st);
+	}
+
+	return dispatched;
+}
+
+static int bfq_dispatch_requests(struct request_queue *q, int force)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct bfq_queue *bfqq;
+	int max_dispatch;
+
+	bfq_log(bfqd, "dispatch requests: %d busy queues", bfqd->busy_queues);
+	if (bfqd->busy_queues == 0)
+		return 0;
+
+	if (unlikely(force))
+		return bfq_forced_dispatch(bfqd);
+
+	bfqq = bfq_select_queue(bfqd);
+	if (bfqq == NULL)
+		return 0;
+
+	max_dispatch = bfqd->bfq_quantum;
+	if (bfq_class_idle(bfqq))
+		max_dispatch = 1;
+
+	if (!bfq_bfqq_sync(bfqq))
+		max_dispatch = bfqd->bfq_max_budget_async_rq;
+
+	if (bfqq->dispatched >= max_dispatch) {
+		if (bfqd->busy_queues > 1)
+			return 0;
+		if (bfqq->dispatched >= 4 * max_dispatch)
+			return 0;
+	}
+
+	if (bfqd->sync_flight != 0 && !bfq_bfqq_sync(bfqq))
+		return 0;
+
+	bfq_clear_bfqq_wait_request(bfqq);
+
+	if (!bfq_dispatch_request(bfqd, bfqq))
+		return 0;
+
+	bfq_log_bfqq(bfqd, bfqq, "dispatched one request of %d (max_disp %d)",
+			bfqq->pid, max_dispatch);
+
+	return 1;
+}
+
+/*
+ * Task holds one reference to the queue, dropped when task exits.  Each rq
+ * in-flight on this queue also holds a reference, dropped when rq is freed.
+ *
+ * Queue lock must be held here.
+ */
+static void bfq_put_queue(struct bfq_queue *bfqq)
+{
+	struct bfq_data *bfqd = bfqq->bfqd;
+
+	bfq_log_bfqq(bfqd, bfqq, "put_queue: %p %d", bfqq,
+		     atomic_read(&bfqq->ref));
+	if (!atomic_dec_and_test(&bfqq->ref))
+		return;
+
+	bfq_log_bfqq(bfqd, bfqq, "put_queue: %p freed", bfqq);
+
+	kmem_cache_free(bfq_pool, bfqq);
+}
+
+static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	if (bfqq == bfqd->in_service_queue) {
+		__bfq_bfqq_expire(bfqd, bfqq);
+		bfq_schedule_dispatch(bfqd);
+	}
+
+	bfq_log_bfqq(bfqd, bfqq, "exit_bfqq: %p, %d", bfqq,
+		     atomic_read(&bfqq->ref));
+
+	bfq_put_queue(bfqq);
+}
+
+static inline void bfq_init_icq(struct io_cq *icq)
+{
+	icq_to_bic(icq)->ttime.last_end_request = jiffies;
+}
+
+static void bfq_exit_icq(struct io_cq *icq)
+{
+	struct bfq_io_cq *bic = icq_to_bic(icq);
+	struct bfq_data *bfqd = bic_to_bfqd(bic);
+
+	if (bic->bfqq[BLK_RW_ASYNC]) {
+		bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_ASYNC]);
+		bic->bfqq[BLK_RW_ASYNC] = NULL;
+	}
+
+	if (bic->bfqq[BLK_RW_SYNC]) {
+		bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_SYNC]);
+		bic->bfqq[BLK_RW_SYNC] = NULL;
+	}
+}
+
+/*
+ * Update the entity prio values; note that the new values will not
+ * be used until the next (re)activation.
+ */
+static void bfq_init_prio_data(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
+{
+	struct task_struct *tsk = current;
+	int ioprio_class;
+
+	if (!bfq_bfqq_prio_changed(bfqq))
+		return;
+
+	ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
+	switch (ioprio_class) {
+	default:
+		dev_err(bfqq->bfqd->queue->backing_dev_info.dev,
+			"bfq: bad prio %x\n", ioprio_class);
+	case IOPRIO_CLASS_NONE:
+		/*
+		 * No prio set, inherit CPU scheduling settings.
+		 */
+		bfqq->entity.new_ioprio = task_nice_ioprio(tsk);
+		bfqq->entity.new_ioprio_class = task_nice_ioclass(tsk);
+		break;
+	case IOPRIO_CLASS_RT:
+		bfqq->entity.new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
+		bfqq->entity.new_ioprio_class = IOPRIO_CLASS_RT;
+		break;
+	case IOPRIO_CLASS_BE:
+		bfqq->entity.new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
+		bfqq->entity.new_ioprio_class = IOPRIO_CLASS_BE;
+		break;
+	case IOPRIO_CLASS_IDLE:
+		bfqq->entity.new_ioprio_class = IOPRIO_CLASS_IDLE;
+		bfqq->entity.new_ioprio = 7;
+		bfq_clear_bfqq_idle_window(bfqq);
+		break;
+	}
+
+	bfqq->entity.ioprio_changed = 1;
+
+	bfq_clear_bfqq_prio_changed(bfqq);
+}
+
+static void bfq_changed_ioprio(struct bfq_io_cq *bic)
+{
+	struct bfq_data *bfqd;
+	struct bfq_queue *bfqq, *new_bfqq;
+	unsigned long uninitialized_var(flags);
+	int ioprio = bic->icq.ioc->ioprio;
+
+	bfqd = bfq_get_bfqd_locked(&(bic->icq.q->elevator->elevator_data),
+				   &flags);
+	/*
+	 * This condition may trigger on a newly created bic, be sure to
+	 * drop the lock before returning.
+	 */
+	if (unlikely(bfqd == NULL) || likely(bic->ioprio == ioprio))
+		goto out;
+
+	bfqq = bic->bfqq[BLK_RW_ASYNC];
+	if (bfqq != NULL) {
+		new_bfqq = bfq_get_queue(bfqd, BLK_RW_ASYNC, bic,
+					 GFP_ATOMIC);
+		if (new_bfqq != NULL) {
+			bic->bfqq[BLK_RW_ASYNC] = new_bfqq;
+			bfq_log_bfqq(bfqd, bfqq,
+				     "changed_ioprio: bfqq %p %d",
+				     bfqq, atomic_read(&bfqq->ref));
+			bfq_put_queue(bfqq);
+		}
+	}
+
+	bfqq = bic->bfqq[BLK_RW_SYNC];
+	if (bfqq != NULL)
+		bfq_mark_bfqq_prio_changed(bfqq);
+
+	bic->ioprio = ioprio;
+
+out:
+	bfq_put_bfqd_unlock(bfqd, &flags);
+}
+
+static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+			  pid_t pid, int is_sync)
+{
+	RB_CLEAR_NODE(&bfqq->entity.rb_node);
+	INIT_LIST_HEAD(&bfqq->fifo);
+
+	atomic_set(&bfqq->ref, 0);
+	bfqq->bfqd = bfqd;
+
+	bfq_mark_bfqq_prio_changed(bfqq);
+
+	if (is_sync) {
+		if (!bfq_class_idle(bfqq))
+			bfq_mark_bfqq_idle_window(bfqq);
+		bfq_mark_bfqq_sync(bfqq);
+	}
+
+	/* Tentative initial value to trade off between thr and lat */
+	bfqq->max_budget = bfq_default_budget(bfqd, bfqq);
+	bfqq->pid = pid;
+}
+
+static struct bfq_queue *bfq_find_alloc_queue(struct bfq_data *bfqd,
+					      int is_sync,
+					      struct bfq_io_cq *bic,
+					      gfp_t gfp_mask)
+{
+	struct bfq_queue *bfqq, *new_bfqq = NULL;
+
+retry:
+	/* bic always exists here */
+	bfqq = bic_to_bfqq(bic, is_sync);
+
+	/*
+	 * Always try a new alloc if we fall back to the OOM bfqq
+	 * originally, since it should just be a temporary situation.
+	 */
+	if (bfqq == NULL || bfqq == &bfqd->oom_bfqq) {
+		bfqq = NULL;
+		if (new_bfqq != NULL) {
+			bfqq = new_bfqq;
+			new_bfqq = NULL;
+		} else if (gfp_mask & __GFP_WAIT) {
+			spin_unlock_irq(bfqd->queue->queue_lock);
+			new_bfqq = kmem_cache_alloc_node(bfq_pool,
+					gfp_mask | __GFP_ZERO,
+					bfqd->queue->node);
+			spin_lock_irq(bfqd->queue->queue_lock);
+			if (new_bfqq != NULL)
+				goto retry;
+		} else {
+			bfqq = kmem_cache_alloc_node(bfq_pool,
+					gfp_mask | __GFP_ZERO,
+					bfqd->queue->node);
+		}
+
+		if (bfqq != NULL) {
+			bfq_init_bfqq(bfqd, bfqq, current->pid, is_sync);
+			bfq_log_bfqq(bfqd, bfqq, "allocated");
+		} else {
+			bfqq = &bfqd->oom_bfqq;
+			bfq_log_bfqq(bfqd, bfqq, "using oom bfqq");
+		}
+
+		bfq_init_prio_data(bfqq, bic);
+	}
+
+	if (new_bfqq != NULL)
+		kmem_cache_free(bfq_pool, new_bfqq);
+
+	return bfqq;
+}
+
+static struct bfq_queue **bfq_async_queue_prio(struct bfq_data *bfqd,
+					       int ioprio_class, int ioprio)
+{
+	switch (ioprio_class) {
+	case IOPRIO_CLASS_RT:
+		return &async_bfqq[0][ioprio];
+	case IOPRIO_CLASS_NONE:
+		ioprio = IOPRIO_NORM;
+		/* fall through */
+	case IOPRIO_CLASS_BE:
+		return &async_bfqq[1][ioprio];
+	case IOPRIO_CLASS_IDLE:
+		return &async_idle_bfqq;
+	default:
+		BUG();
+	}
+}
+
+static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
+				       int is_sync, struct bfq_io_cq *bic,
+				       gfp_t gfp_mask)
+{
+	const int ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
+	const int ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
+	struct bfq_queue **async_bfqq = NULL;
+	struct bfq_queue *bfqq = NULL;
+
+	if (!is_sync) {
+		async_bfqq = bfq_async_queue_prio(bfqd, ioprio_class, ioprio);
+		bfqq = *async_bfqq;
+	}
+
+	if (bfqq == NULL)
+		bfqq = bfq_find_alloc_queue(bfqd, is_sync, bic, gfp_mask);
+
+	/*
+	 * Pin the queue now that it's allocated, scheduler exit will
+	 * prune it.
+	 */
+	if (!is_sync && *async_bfqq == NULL) {
+		atomic_inc(&bfqq->ref);
+		bfq_log_bfqq(bfqd, bfqq, "get_queue, bfqq not in async: %p, %d",
+			     bfqq, atomic_read(&bfqq->ref));
+		*async_bfqq = bfqq;
+	}
+
+	atomic_inc(&bfqq->ref);
+	bfq_log_bfqq(bfqd, bfqq, "get_queue, at end: %p, %d", bfqq,
+		     atomic_read(&bfqq->ref));
+	return bfqq;
+}
+
+static void bfq_update_io_thinktime(struct bfq_data *bfqd,
+				    struct bfq_io_cq *bic)
+{
+	unsigned long elapsed = jiffies - bic->ttime.last_end_request;
+	unsigned long ttime = min(elapsed, 2UL * bfqd->bfq_slice_idle);
+
+	bic->ttime.ttime_samples = (7*bic->ttime.ttime_samples + 256) / 8;
+	bic->ttime.ttime_total = (7*bic->ttime.ttime_total + 256*ttime) / 8;
+	bic->ttime.ttime_mean = (bic->ttime.ttime_total + 128) /
+				bic->ttime.ttime_samples;
+}
+
+static void bfq_update_io_seektime(struct bfq_data *bfqd,
+				   struct bfq_queue *bfqq,
+				   struct request *rq)
+{
+	sector_t sdist;
+	u64 total;
+
+	if (bfqq->last_request_pos < blk_rq_pos(rq))
+		sdist = blk_rq_pos(rq) - bfqq->last_request_pos;
+	else
+		sdist = bfqq->last_request_pos - blk_rq_pos(rq);
+
+	/*
+	 * Don't allow the seek distance to get too large from the
+	 * odd fragment, pagein, etc.
+	 */
+	if (bfqq->seek_samples == 0) /* first request, not really a seek */
+		sdist = 0;
+	else if (bfqq->seek_samples <= 60) /* second & third seek */
+		sdist = min(sdist, (bfqq->seek_mean * 4) + 2*1024*1024);
+	else
+		sdist = min(sdist, (bfqq->seek_mean * 4) + 2*1024*64);
+
+	bfqq->seek_samples = (7*bfqq->seek_samples + 256) / 8;
+	bfqq->seek_total = (7*bfqq->seek_total + (u64)256*sdist) / 8;
+	total = bfqq->seek_total + (bfqq->seek_samples/2);
+	do_div(total, bfqq->seek_samples);
+	bfqq->seek_mean = (sector_t)total;
+
+	bfq_log_bfqq(bfqd, bfqq, "dist=%llu mean=%llu", (u64)sdist,
+			(u64)bfqq->seek_mean);
+}
+
+/*
+ * Disable idle window if the process thinks too long or seeks so much that
+ * it doesn't matter.
+ */
+static void bfq_update_idle_window(struct bfq_data *bfqd,
+				   struct bfq_queue *bfqq,
+				   struct bfq_io_cq *bic)
+{
+	int enable_idle;
+
+	/* Don't idle for async or idle io prio class. */
+	if (!bfq_bfqq_sync(bfqq) || bfq_class_idle(bfqq))
+		return;
+
+	enable_idle = bfq_bfqq_idle_window(bfqq);
+
+	if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
+	    bfqd->bfq_slice_idle == 0 ||
+		(bfqd->hw_tag && BFQQ_SEEKY(bfqq)))
+		enable_idle = 0;
+	else if (bfq_sample_valid(bic->ttime.ttime_samples)) {
+		if (bic->ttime.ttime_mean > bfqd->bfq_slice_idle)
+			enable_idle = 0;
+		else
+			enable_idle = 1;
+	}
+	bfq_log_bfqq(bfqd, bfqq, "update_idle_window: enable_idle %d",
+		enable_idle);
+
+	if (enable_idle)
+		bfq_mark_bfqq_idle_window(bfqq);
+	else
+		bfq_clear_bfqq_idle_window(bfqq);
+}
+
+/*
+ * Called when a new fs request (rq) is added to bfqq.  Check if there's
+ * something we should do about it.
+ */
+static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+			    struct request *rq)
+{
+	struct bfq_io_cq *bic = RQ_BIC(rq);
+
+	if (rq->cmd_flags & REQ_META)
+		bfqq->meta_pending++;
+
+	bfq_update_io_thinktime(bfqd, bic);
+	bfq_update_io_seektime(bfqd, bfqq, rq);
+	if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
+	    !BFQQ_SEEKY(bfqq))
+		bfq_update_idle_window(bfqd, bfqq, bic);
+
+	bfq_log_bfqq(bfqd, bfqq,
+		     "rq_enqueued: idle_window=%d (seeky %d, mean %llu)",
+		     bfq_bfqq_idle_window(bfqq), BFQQ_SEEKY(bfqq),
+		     (long long unsigned)bfqq->seek_mean);
+
+	bfqq->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq);
+
+	if (bfqq == bfqd->in_service_queue && bfq_bfqq_wait_request(bfqq)) {
+		int small_req = bfqq->queued[rq_is_sync(rq)] == 1 &&
+				blk_rq_sectors(rq) < 32;
+		int budget_timeout = bfq_bfqq_budget_timeout(bfqq);
+
+		/*
+		 * There is just this request queued: if the request
+		 * is small and the queue is not to be expired, then
+		 * just exit.
+		 *
+		 * In this way, if the disk is being idled to wait for
+		 * a new request from the in-service queue, we avoid
+		 * unplugging the device and committing the disk to serve
+		 * just a small request. On the contrary, we wait for
+		 * the block layer to decide when to unplug the device:
+		 * hopefully, new requests will be merged to this one
+		 * quickly, then the device will be unplugged and
+		 * larger requests will be dispatched.
+		 */
+		if (small_req && !budget_timeout)
+			return;
+
+		/*
+		 * A large enough request arrived, or the queue is to
+		 * be expired: in both cases disk idling is to be
+		 * stopped, so clear wait_request flag and reset
+		 * timer.
+		 */
+		bfq_clear_bfqq_wait_request(bfqq);
+		del_timer(&bfqd->idle_slice_timer);
+
+		/*
+		 * The queue is not empty, because a new request just
+		 * arrived. Hence we can safely expire the queue, in
+		 * case of budget timeout, without risking that the
+		 * timestamps of the queue are not updated correctly.
+		 * See [1] for more details.
+		 */
+		if (budget_timeout)
+			bfq_bfqq_expire(bfqd, bfqq, 0, BFQ_BFQQ_BUDGET_TIMEOUT);
+
+		/*
+		 * Let the request rip immediately, or let a new queue be
+		 * selected if bfqq has just been expired.
+		 */
+		__blk_run_queue(bfqd->queue);
+	}
+}
+
+static void bfq_insert_request(struct request_queue *q, struct request *rq)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+
+	assert_spin_locked(bfqd->queue->queue_lock);
+
+	bfq_init_prio_data(bfqq, RQ_BIC(rq));
+
+	bfq_add_request(rq);
+
+	rq->fifo_time = jiffies + bfqd->bfq_fifo_expire[rq_is_sync(rq)];
+	list_add_tail(&rq->queuelist, &bfqq->fifo);
+
+	bfq_rq_enqueued(bfqd, bfqq, rq);
+}
+
+static void bfq_update_hw_tag(struct bfq_data *bfqd)
+{
+	bfqd->max_rq_in_driver = max(bfqd->max_rq_in_driver,
+				     bfqd->rq_in_driver);
+
+	if (bfqd->hw_tag == 1)
+		return;
+
+	/*
+	 * This sample is valid if the number of outstanding requests
+	 * is large enough to allow a queueing behavior.  Note that the
+	 * sum is not exact, as it's not taking into account deactivated
+	 * requests.
+	 */
+	if (bfqd->rq_in_driver + bfqd->queued < BFQ_HW_QUEUE_THRESHOLD)
+		return;
+
+	if (bfqd->hw_tag_samples++ < BFQ_HW_QUEUE_SAMPLES)
+		return;
+
+	bfqd->hw_tag = bfqd->max_rq_in_driver > BFQ_HW_QUEUE_THRESHOLD;
+	bfqd->max_rq_in_driver = 0;
+	bfqd->hw_tag_samples = 0;
+}
+
+static void bfq_completed_request(struct request_queue *q, struct request *rq)
+{
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+	struct bfq_data *bfqd = bfqq->bfqd;
+	bool sync = bfq_bfqq_sync(bfqq);
+
+	bfq_log_bfqq(bfqd, bfqq, "completed one req with %u sects left (%d)",
+		     blk_rq_sectors(rq), sync);
+
+	bfq_update_hw_tag(bfqd);
+
+	bfqd->rq_in_driver--;
+	bfqq->dispatched--;
+
+	if (sync) {
+		bfqd->sync_flight--;
+		RQ_BIC(rq)->ttime.last_end_request = jiffies;
+	}
+
+	/*
+	 * If this is the in-service queue, check if it needs to be expired,
+	 * or if we want to idle in case it has no pending requests.
+	 */
+	if (bfqd->in_service_queue == bfqq) {
+		if (bfq_bfqq_budget_new(bfqq))
+			bfq_set_budget_timeout(bfqd);
+
+		if (bfq_bfqq_must_idle(bfqq)) {
+			bfq_arm_slice_timer(bfqd);
+			goto out;
+		} else if (bfq_may_expire_for_budg_timeout(bfqq))
+			bfq_bfqq_expire(bfqd, bfqq, 0, BFQ_BFQQ_BUDGET_TIMEOUT);
+		else if (RB_EMPTY_ROOT(&bfqq->sort_list) &&
+			 (bfqq->dispatched == 0 ||
+			  !bfq_bfqq_must_not_expire(bfqq)))
+			bfq_bfqq_expire(bfqd, bfqq, 0,
+					BFQ_BFQQ_NO_MORE_REQUESTS);
+	}
+
+	if (!bfqd->rq_in_driver)
+		bfq_schedule_dispatch(bfqd);
+
+out:
+	return;
+}
+
+static inline int __bfq_may_queue(struct bfq_queue *bfqq)
+{
+	if (bfq_bfqq_wait_request(bfqq) && bfq_bfqq_must_alloc(bfqq)) {
+		bfq_clear_bfqq_must_alloc(bfqq);
+		return ELV_MQUEUE_MUST;
+	}
+
+	return ELV_MQUEUE_MAY;
+}
+
+static int bfq_may_queue(struct request_queue *q, int rw)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct task_struct *tsk = current;
+	struct bfq_io_cq *bic;
+	struct bfq_queue *bfqq;
+
+	/*
+	 * Don't force setup of a queue from here, as a call to may_queue
+	 * does not necessarily imply that a request actually will be
+	 * queued. So just lookup a possibly existing queue, or return
+	 * 'may queue' if that fails.
+	 */
+	bic = bfq_bic_lookup(bfqd, tsk->io_context);
+	if (bic == NULL)
+		return ELV_MQUEUE_MAY;
+
+	bfqq = bic_to_bfqq(bic, rw_is_sync(rw));
+	if (bfqq != NULL) {
+		bfq_init_prio_data(bfqq, bic);
+
+		return __bfq_may_queue(bfqq);
+	}
+
+	return ELV_MQUEUE_MAY;
+}
+
+/*
+ * Queue lock held here.
+ */
+static void bfq_put_request(struct request *rq)
+{
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+
+	if (bfqq != NULL) {
+		const int rw = rq_data_dir(rq);
+
+		bfqq->allocated[rw]--;
+
+		rq->elv.priv[0] = NULL;
+		rq->elv.priv[1] = NULL;
+
+		bfq_log_bfqq(bfqq->bfqd, bfqq, "put_request %p, %d",
+			     bfqq, atomic_read(&bfqq->ref));
+		bfq_put_queue(bfqq);
+	}
+}
+
+/*
+ * Allocate bfq data structures associated with this request.
+ */
+static int bfq_set_request(struct request_queue *q, struct request *rq,
+			   struct bio *bio, gfp_t gfp_mask)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct bfq_io_cq *bic = icq_to_bic(rq->elv.icq);
+	const int rw = rq_data_dir(rq);
+	const int is_sync = rq_is_sync(rq);
+	struct bfq_queue *bfqq;
+	unsigned long flags;
+
+	might_sleep_if(gfp_mask & __GFP_WAIT);
+
+	bfq_changed_ioprio(bic);
+
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	if (bic == NULL)
+		goto queue_fail;
+
+	bfqq = bic_to_bfqq(bic, is_sync);
+	if (bfqq == NULL || bfqq == &bfqd->oom_bfqq) {
+		bfqq = bfq_get_queue(bfqd, is_sync, bic, gfp_mask);
+		bic_set_bfqq(bic, bfqq, is_sync);
+	}
+
+	bfqq->allocated[rw]++;
+	atomic_inc(&bfqq->ref);
+	bfq_log_bfqq(bfqd, bfqq, "set_request: bfqq %p, %d", bfqq,
+		     atomic_read(&bfqq->ref));
+
+	rq->elv.priv[0] = bic;
+	rq->elv.priv[1] = bfqq;
+
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	return 0;
+
+queue_fail:
+	bfq_schedule_dispatch(bfqd);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	return 1;
+}
+
+static void bfq_kick_queue(struct work_struct *work)
+{
+	struct bfq_data *bfqd =
+		container_of(work, struct bfq_data, unplug_work);
+	struct request_queue *q = bfqd->queue;
+
+	spin_lock_irq(q->queue_lock);
+	__blk_run_queue(q);
+	spin_unlock_irq(q->queue_lock);
+}
+
+/*
+ * Handler of the expiration of the timer running if the in-service queue
+ * is idling inside its time slice.
+ */
+static void bfq_idle_slice_timer(unsigned long data)
+{
+	struct bfq_data *bfqd = (struct bfq_data *)data;
+	struct bfq_queue *bfqq;
+	unsigned long flags;
+	enum bfqq_expiration reason;
+
+	spin_lock_irqsave(bfqd->queue->queue_lock, flags);
+
+	bfqq = bfqd->in_service_queue;
+	/*
+	 * Theoretical race here: the in-service queue can be NULL or
+	 * different from the queue that was idling if the timer handler
+	 * spins on the queue_lock and a new request arrives for the
+	 * current queue and there is a full dispatch cycle that changes
+	 * the in-service queue.  This can hardly happen, but in the worst
+	 * case we just expire a queue too early.
+	 */
+	if (bfqq != NULL) {
+		bfq_log_bfqq(bfqd, bfqq, "slice_timer expired");
+		if (bfq_bfqq_budget_timeout(bfqq))
+			/*
+			 * Also here the queue can be safely expired
+			 * for budget timeout without wasting
+			 * guarantees
+			 */
+			reason = BFQ_BFQQ_BUDGET_TIMEOUT;
+		else if (bfqq->queued[0] == 0 && bfqq->queued[1] == 0)
+			/*
+			 * The queue may not be empty upon timer expiration,
+			 * because we may not disable the timer when the
+			 * first request of the in-service queue arrives
+			 * during disk idling.
+			 */
+			reason = BFQ_BFQQ_TOO_IDLE;
+		else
+			goto schedule_dispatch;
+
+		bfq_bfqq_expire(bfqd, bfqq, 1, reason);
+	}
+
+schedule_dispatch:
+	bfq_schedule_dispatch(bfqd);
+
+	spin_unlock_irqrestore(bfqd->queue->queue_lock, flags);
+}
+
+static void bfq_shutdown_timer_wq(struct bfq_data *bfqd)
+{
+	del_timer_sync(&bfqd->idle_slice_timer);
+	cancel_work_sync(&bfqd->unplug_work);
+}
+
+static inline void __bfq_put_async_bfqq(struct bfq_data *bfqd,
+					struct bfq_queue **bfqq_ptr)
+{
+	struct bfq_queue *bfqq = *bfqq_ptr;
+
+	bfq_log(bfqd, "put_async_bfqq: %p", bfqq);
+	if (bfqq != NULL) {
+		bfq_log_bfqq(bfqd, bfqq, "put_async_bfqq: putting %p, %d",
+			     bfqq, atomic_read(&bfqq->ref));
+		bfq_put_queue(bfqq);
+		*bfqq_ptr = NULL;
+	}
+}
+
+/*
+ * Release the extra reference of the async queues as the device
+ * goes away.
+ */
+static void bfq_put_async_queues(struct bfq_data *bfqd)
+{
+	int i, j;
+
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < IOPRIO_BE_NR; j++)
+			__bfq_put_async_bfqq(bfqd, &async_bfqq[i][j]);
+
+	__bfq_put_async_bfqq(bfqd, &async_idle_bfqq);
+}
+
+static void bfq_exit_queue(struct elevator_queue *e)
+{
+	struct bfq_data *bfqd = e->elevator_data;
+	struct request_queue *q = bfqd->queue;
+	struct bfq_queue *bfqq, *n;
+
+	bfq_shutdown_timer_wq(bfqd);
+
+	spin_lock_irq(q->queue_lock);
+
+	list_for_each_entry_safe(bfqq, n, &bfqd->idle_list, bfqq_list)
+		bfq_deactivate_bfqq(bfqd, bfqq, 0);
+
+	bfq_put_async_queues(bfqd);
+	spin_unlock_irq(q->queue_lock);
+
+	bfq_shutdown_timer_wq(bfqd);
+
+	synchronize_rcu();
+
+	kfree(bfqd);
+}
+
+static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
+{
+	struct bfq_data *bfqd;
+	struct elevator_queue *eq;
+
+	eq = elevator_alloc(q, e);
+	if (eq == NULL)
+		return -ENOMEM;
+
+	bfqd = kzalloc_node(sizeof(*bfqd), GFP_KERNEL, q->node);
+	if (bfqd == NULL) {
+		kobject_put(&eq->kobj);
+		return -ENOMEM;
+	}
+	eq->elevator_data = bfqd;
+
+	/*
+	 * Our fallback bfqq if bfq_find_alloc_queue() runs into OOM issues.
+	 * Grab a permanent reference to it, so that the normal code flow
+	 * will not attempt to free it.
+	 */
+	bfq_init_bfqq(bfqd, &bfqd->oom_bfqq, 1, 0);
+	atomic_inc(&bfqd->oom_bfqq.ref);
+
+	bfqd->queue = q;
+
+	spin_lock_irq(q->queue_lock);
+	q->elevator = eq;
+	spin_unlock_irq(q->queue_lock);
+
+	init_timer(&bfqd->idle_slice_timer);
+	bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
+	bfqd->idle_slice_timer.data = (unsigned long)bfqd;
+
+	INIT_WORK(&bfqd->unplug_work, bfq_kick_queue);
+
+	INIT_LIST_HEAD(&bfqd->active_list);
+	INIT_LIST_HEAD(&bfqd->idle_list);
+
+	bfqd->hw_tag = -1;
+
+	bfqd->bfq_max_budget = bfq_default_max_budget;
+
+	bfqd->bfq_quantum = bfq_quantum;
+	bfqd->bfq_fifo_expire[0] = bfq_fifo_expire[0];
+	bfqd->bfq_fifo_expire[1] = bfq_fifo_expire[1];
+	bfqd->bfq_back_max = bfq_back_max;
+	bfqd->bfq_back_penalty = bfq_back_penalty;
+	bfqd->bfq_slice_idle = bfq_slice_idle;
+	bfqd->bfq_class_idle_last_service = 0;
+	bfqd->bfq_max_budget_async_rq = bfq_max_budget_async_rq;
+	bfqd->bfq_timeout[BLK_RW_ASYNC] = bfq_timeout_async;
+	bfqd->bfq_timeout[BLK_RW_SYNC] = bfq_timeout_sync;
+
+	return 0;
+}
+
+static void bfq_slab_kill(void)
+{
+	if (bfq_pool != NULL)
+		kmem_cache_destroy(bfq_pool);
+}
+
+static int __init bfq_slab_setup(void)
+{
+	bfq_pool = KMEM_CACHE(bfq_queue, 0);
+	if (bfq_pool == NULL)
+		return -ENOMEM;
+	return 0;
+}
+
+static ssize_t bfq_var_show(unsigned int var, char *page)
+{
+	return sprintf(page, "%d\n", var);
+}
+
+static ssize_t bfq_var_store(unsigned long *var, const char *page,
+			     size_t count)
+{
+	unsigned long new_val;
+	int ret = kstrtoul(page, 10, &new_val);
+
+	if (ret == 0)
+		*var = new_val;
+
+	return count;
+}
+
+static ssize_t bfq_weights_show(struct elevator_queue *e, char *page)
+{
+	struct bfq_queue *bfqq;
+	struct bfq_data *bfqd = e->elevator_data;
+	ssize_t num_char = 0;
+
+	num_char += sprintf(page + num_char, "Tot reqs queued %d\n\n",
+			    bfqd->queued);
+
+	spin_lock_irq(bfqd->queue->queue_lock);
+
+	num_char += sprintf(page + num_char, "Active:\n");
+	list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list) {
+	  num_char += sprintf(page + num_char,
+			      "pid%d: weight %hu, nr_queued %d %d\n",
+			      bfqq->pid,
+			      bfqq->entity.weight,
+			      bfqq->queued[0],
+			      bfqq->queued[1]);
+	}
+
+	num_char += sprintf(page + num_char, "Idle:\n");
+	list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list) {
+			num_char += sprintf(page + num_char,
+				"pid%d: weight %hu\n",
+				bfqq->pid,
+				bfqq->entity.weight);
+	}
+
+	spin_unlock_irq(bfqd->queue->queue_lock);
+
+	return num_char;
+}
+
+#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
+static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
+{									\
+	struct bfq_data *bfqd = e->elevator_data;			\
+	unsigned int __data = __VAR;					\
+	if (__CONV)							\
+		__data = jiffies_to_msecs(__data);			\
+	return bfq_var_show(__data, (page));				\
+}
+SHOW_FUNCTION(bfq_quantum_show, bfqd->bfq_quantum, 0);
+SHOW_FUNCTION(bfq_fifo_expire_sync_show, bfqd->bfq_fifo_expire[1], 1);
+SHOW_FUNCTION(bfq_fifo_expire_async_show, bfqd->bfq_fifo_expire[0], 1);
+SHOW_FUNCTION(bfq_back_seek_max_show, bfqd->bfq_back_max, 0);
+SHOW_FUNCTION(bfq_back_seek_penalty_show, bfqd->bfq_back_penalty, 0);
+SHOW_FUNCTION(bfq_slice_idle_show, bfqd->bfq_slice_idle, 1);
+SHOW_FUNCTION(bfq_max_budget_show, bfqd->bfq_user_max_budget, 0);
+SHOW_FUNCTION(bfq_max_budget_async_rq_show,
+	      bfqd->bfq_max_budget_async_rq, 0);
+SHOW_FUNCTION(bfq_timeout_sync_show, bfqd->bfq_timeout[BLK_RW_SYNC], 1);
+SHOW_FUNCTION(bfq_timeout_async_show, bfqd->bfq_timeout[BLK_RW_ASYNC], 1);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
+static ssize_t								\
+__FUNC(struct elevator_queue *e, const char *page, size_t count)	\
+{									\
+	struct bfq_data *bfqd = e->elevator_data;			\
+	unsigned long uninitialized_var(__data);			\
+	int ret = bfq_var_store(&__data, (page), count);		\
+	if (__data < (MIN))						\
+		__data = (MIN);						\
+	else if (__data > (MAX))					\
+		__data = (MAX);						\
+	if (__CONV)							\
+		*(__PTR) = msecs_to_jiffies(__data);			\
+	else								\
+		*(__PTR) = __data;					\
+	return ret;							\
+}
+STORE_FUNCTION(bfq_quantum_store, &bfqd->bfq_quantum, 1, INT_MAX, 0);
+STORE_FUNCTION(bfq_fifo_expire_sync_store, &bfqd->bfq_fifo_expire[1], 1,
+		INT_MAX, 1);
+STORE_FUNCTION(bfq_fifo_expire_async_store, &bfqd->bfq_fifo_expire[0], 1,
+		INT_MAX, 1);
+STORE_FUNCTION(bfq_back_seek_max_store, &bfqd->bfq_back_max, 0, INT_MAX, 0);
+STORE_FUNCTION(bfq_back_seek_penalty_store, &bfqd->bfq_back_penalty, 1,
+		INT_MAX, 0);
+STORE_FUNCTION(bfq_slice_idle_store, &bfqd->bfq_slice_idle, 0, INT_MAX, 1);
+STORE_FUNCTION(bfq_max_budget_async_rq_store, &bfqd->bfq_max_budget_async_rq,
+		1, INT_MAX, 0);
+STORE_FUNCTION(bfq_timeout_async_store, &bfqd->bfq_timeout[BLK_RW_ASYNC], 0,
+		INT_MAX, 1);
+#undef STORE_FUNCTION
+
+/* do nothing for the moment */
+static ssize_t bfq_weights_store(struct elevator_queue *e,
+				    const char *page, size_t count)
+{
+	return count;
+}
+
+static inline unsigned long bfq_estimated_max_budget(struct bfq_data *bfqd)
+{
+	u64 timeout = jiffies_to_msecs(bfqd->bfq_timeout[BLK_RW_SYNC]);
+
+	if (bfqd->peak_rate_samples >= BFQ_PEAK_RATE_SAMPLES)
+		return bfq_calc_max_budget(bfqd->peak_rate, timeout);
+	else
+		return bfq_default_max_budget;
+}
+
+static ssize_t bfq_max_budget_store(struct elevator_queue *e,
+				    const char *page, size_t count)
+{
+	struct bfq_data *bfqd = e->elevator_data;
+	unsigned long uninitialized_var(__data);
+	int ret = bfq_var_store(&__data, (page), count);
+
+	if (__data == 0)
+		bfqd->bfq_max_budget = bfq_estimated_max_budget(bfqd);
+	else {
+		if (__data > INT_MAX)
+			__data = INT_MAX;
+		bfqd->bfq_max_budget = __data;
+	}
+
+	bfqd->bfq_user_max_budget = __data;
+
+	return ret;
+}
+
+static ssize_t bfq_timeout_sync_store(struct elevator_queue *e,
+				      const char *page, size_t count)
+{
+	struct bfq_data *bfqd = e->elevator_data;
+	unsigned long uninitialized_var(__data);
+	int ret = bfq_var_store(&__data, (page), count);
+
+	if (__data < 1)
+		__data = 1;
+	else if (__data > INT_MAX)
+		__data = INT_MAX;
+
+	bfqd->bfq_timeout[BLK_RW_SYNC] = msecs_to_jiffies(__data);
+	if (bfqd->bfq_user_max_budget == 0)
+		bfqd->bfq_max_budget = bfq_estimated_max_budget(bfqd);
+
+	return ret;
+}
+
+#define BFQ_ATTR(name) \
+	__ATTR(name, S_IRUGO|S_IWUSR, bfq_##name##_show, bfq_##name##_store)
+
+static struct elv_fs_entry bfq_attrs[] = {
+	BFQ_ATTR(quantum),
+	BFQ_ATTR(fifo_expire_sync),
+	BFQ_ATTR(fifo_expire_async),
+	BFQ_ATTR(back_seek_max),
+	BFQ_ATTR(back_seek_penalty),
+	BFQ_ATTR(slice_idle),
+	BFQ_ATTR(max_budget),
+	BFQ_ATTR(max_budget_async_rq),
+	BFQ_ATTR(timeout_sync),
+	BFQ_ATTR(timeout_async),
+	BFQ_ATTR(weights),
+	__ATTR_NULL
+};
+
+static struct elevator_type iosched_bfq = {
+	.ops = {
+		.elevator_merge_fn =		bfq_merge,
+		.elevator_merged_fn =		bfq_merged_request,
+		.elevator_merge_req_fn =	bfq_merged_requests,
+		.elevator_allow_merge_fn =	bfq_allow_merge,
+		.elevator_dispatch_fn =		bfq_dispatch_requests,
+		.elevator_add_req_fn =		bfq_insert_request,
+		.elevator_activate_req_fn =	bfq_activate_request,
+		.elevator_deactivate_req_fn =	bfq_deactivate_request,
+		.elevator_completed_req_fn =	bfq_completed_request,
+		.elevator_former_req_fn =	elv_rb_former_request,
+		.elevator_latter_req_fn =	elv_rb_latter_request,
+		.elevator_init_icq_fn =		bfq_init_icq,
+		.elevator_exit_icq_fn =		bfq_exit_icq,
+		.elevator_set_req_fn =		bfq_set_request,
+		.elevator_put_req_fn =		bfq_put_request,
+		.elevator_may_queue_fn =	bfq_may_queue,
+		.elevator_init_fn =		bfq_init_queue,
+		.elevator_exit_fn =		bfq_exit_queue,
+	},
+	.icq_size =		sizeof(struct bfq_io_cq),
+	.icq_align =		__alignof__(struct bfq_io_cq),
+	.elevator_attrs =	bfq_attrs,
+	.elevator_name =	"bfq",
+	.elevator_owner =	THIS_MODULE,
+};
+
+static int __init bfq_init(void)
+{
+	/*
+	 * Can be 0 on HZ < 1000 setups.
+	 */
+	if (bfq_slice_idle == 0)
+		bfq_slice_idle = 1;
+
+	if (bfq_timeout_async == 0)
+		bfq_timeout_async = 1;
+
+	if (bfq_slab_setup())
+		return -ENOMEM;
+
+	elv_register(&iosched_bfq);
+	pr_info("BFQ I/O-scheduler version: v0");
+
+	return 0;
+}
+
+static void __exit bfq_exit(void)
+{
+	elv_unregister(&iosched_bfq);
+	bfq_slab_kill();
+}
+
+module_init(bfq_init);
+module_exit(bfq_exit);
+
+MODULE_AUTHOR("Fabio Checconi, Paolo Valente");
+MODULE_LICENSE("GPL");
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
new file mode 100644
index 0000000..a9142f5
--- /dev/null
+++ b/block/bfq-sched.c
@@ -0,0 +1,936 @@
+/*
+ * BFQ: Hierarchical B-WF2Q+ scheduler.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
+ *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ */
+
+#define for_each_entity(entity)	\
+	for (; entity != NULL; entity = NULL)
+
+#define for_each_entity_safe(entity, parent) \
+	for (parent = NULL; entity != NULL; entity = parent)
+
+static inline int bfq_update_next_in_service(struct bfq_sched_data *sd)
+{
+	return 0;
+}
+
+static inline void bfq_check_next_in_service(struct bfq_sched_data *sd,
+					     struct bfq_entity *entity)
+{
+}
+
+static inline void bfq_update_budget(struct bfq_entity *next_in_service)
+{
+}
+
+/*
+ * Shift for timestamp calculations.  This actually limits the maximum
+ * service allowed in one timestamp delta (small shift values increase it),
+ * the maximum total weight that can be used for the queues in the system
+ * (big shift values increase it), and the period of virtual time
+ * wraparounds.
+ */
+#define WFQ_SERVICE_SHIFT	22
+
+/**
+ * bfq_gt - compare two timestamps.
+ * @a: first ts.
+ * @b: second ts.
+ *
+ * Return @a > @b, dealing with wrapping correctly.
+ */
+static inline int bfq_gt(u64 a, u64 b)
+{
+	return (s64)(a - b) > 0;
+}
+
+static inline struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = NULL;
+
+	if (entity->my_sched_data == NULL)
+		bfqq = container_of(entity, struct bfq_queue, entity);
+
+	return bfqq;
+}
+
+
+/**
+ * bfq_delta - map service into the virtual time domain.
+ * @service: amount of service.
+ * @weight: scale factor (weight of an entity or weight sum).
+ */
+static inline u64 bfq_delta(unsigned long service,
+					unsigned long weight)
+{
+	u64 d = (u64)service << WFQ_SERVICE_SHIFT;
+
+	do_div(d, weight);
+	return d;
+}
+
+/**
+ * bfq_calc_finish - assign the finish time to an entity.
+ * @entity: the entity to act upon.
+ * @service: the service to be charged to the entity.
+ */
+static inline void bfq_calc_finish(struct bfq_entity *entity,
+				   unsigned long service)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+
+	entity->finish = entity->start +
+		bfq_delta(service, entity->weight);
+
+	if (bfqq != NULL) {
+		bfq_log_bfqq(bfqq->bfqd, bfqq,
+			"calc_finish: serv %lu, w %d",
+			service, entity->weight);
+		bfq_log_bfqq(bfqq->bfqd, bfqq,
+			"calc_finish: start %llu, finish %llu, delta %llu",
+			entity->start, entity->finish,
+			bfq_delta(service, entity->weight));
+	}
+}
+
+/**
+ * bfq_entity_of - get an entity from a node.
+ * @node: the node field of the entity.
+ *
+ * Convert a node pointer to the relative entity.  This is used only
+ * to simplify the logic of some functions and not as the generic
+ * conversion mechanism because, e.g., in the tree walking functions,
+ * the check for a %NULL value would be redundant.
+ */
+static inline struct bfq_entity *bfq_entity_of(struct rb_node *node)
+{
+	struct bfq_entity *entity = NULL;
+
+	if (node != NULL)
+		entity = rb_entry(node, struct bfq_entity, rb_node);
+
+	return entity;
+}
+
+/**
+ * bfq_extract - remove an entity from a tree.
+ * @root: the tree root.
+ * @entity: the entity to remove.
+ */
+static inline void bfq_extract(struct rb_root *root,
+			       struct bfq_entity *entity)
+{
+	entity->tree = NULL;
+	rb_erase(&entity->rb_node, root);
+}
+
+/**
+ * bfq_idle_extract - extract an entity from the idle tree.
+ * @st: the service tree of the owning @entity.
+ * @entity: the entity being removed.
+ */
+static void bfq_idle_extract(struct bfq_service_tree *st,
+			     struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+	struct rb_node *next;
+
+	if (entity == st->first_idle) {
+		next = rb_next(&entity->rb_node);
+		st->first_idle = bfq_entity_of(next);
+	}
+
+	if (entity == st->last_idle) {
+		next = rb_prev(&entity->rb_node);
+		st->last_idle = bfq_entity_of(next);
+	}
+
+	bfq_extract(&st->idle, entity);
+
+	if (bfqq != NULL)
+		list_del(&bfqq->bfqq_list);
+}
+
+/**
+ * bfq_insert - generic tree insertion.
+ * @root: tree root.
+ * @entity: entity to insert.
+ *
+ * This is used for the idle and the active tree, since they are both
+ * ordered by finish time.
+ */
+static void bfq_insert(struct rb_root *root, struct bfq_entity *entity)
+{
+	struct bfq_entity *entry;
+	struct rb_node **node = &root->rb_node;
+	struct rb_node *parent = NULL;
+
+	while (*node != NULL) {
+		parent = *node;
+		entry = rb_entry(parent, struct bfq_entity, rb_node);
+
+		if (bfq_gt(entry->finish, entity->finish))
+			node = &parent->rb_left;
+		else
+			node = &parent->rb_right;
+	}
+
+	rb_link_node(&entity->rb_node, parent, node);
+	rb_insert_color(&entity->rb_node, root);
+
+	entity->tree = root;
+}
+
+/**
+ * bfq_update_min - update the min_start field of a entity.
+ * @entity: the entity to update.
+ * @node: one of its children.
+ *
+ * This function is called when @entity may store an invalid value for
+ * min_start due to updates to the active tree.  The function  assumes
+ * that the subtree rooted at @node (which may be its left or its right
+ * child) has a valid min_start value.
+ */
+static inline void bfq_update_min(struct bfq_entity *entity,
+				  struct rb_node *node)
+{
+	struct bfq_entity *child;
+
+	if (node != NULL) {
+		child = rb_entry(node, struct bfq_entity, rb_node);
+		if (bfq_gt(entity->min_start, child->min_start))
+			entity->min_start = child->min_start;
+	}
+}
+
+/**
+ * bfq_update_active_node - recalculate min_start.
+ * @node: the node to update.
+ *
+ * @node may have changed position or one of its children may have moved,
+ * this function updates its min_start value.  The left and right subtrees
+ * are assumed to hold a correct min_start value.
+ */
+static inline void bfq_update_active_node(struct rb_node *node)
+{
+	struct bfq_entity *entity = rb_entry(node, struct bfq_entity, rb_node);
+
+	entity->min_start = entity->start;
+	bfq_update_min(entity, node->rb_right);
+	bfq_update_min(entity, node->rb_left);
+}
+
+/**
+ * bfq_update_active_tree - update min_start for the whole active tree.
+ * @node: the starting node.
+ *
+ * @node must be the deepest modified node after an update.  This function
+ * updates its min_start using the values held by its children, assuming
+ * that they did not change, and then updates all the nodes that may have
+ * changed in the path to the root.  The only nodes that may have changed
+ * are the ones in the path or their siblings.
+ */
+static void bfq_update_active_tree(struct rb_node *node)
+{
+	struct rb_node *parent;
+
+up:
+	bfq_update_active_node(node);
+
+	parent = rb_parent(node);
+	if (parent == NULL)
+		return;
+
+	if (node == parent->rb_left && parent->rb_right != NULL)
+		bfq_update_active_node(parent->rb_right);
+	else if (parent->rb_left != NULL)
+		bfq_update_active_node(parent->rb_left);
+
+	node = parent;
+	goto up;
+}
+
+/**
+ * bfq_active_insert - insert an entity in the active tree of its
+ *                     group/device.
+ * @st: the service tree of the entity.
+ * @entity: the entity being inserted.
+ *
+ * The active tree is ordered by finish time, but an extra key is kept
+ * per each node, containing the minimum value for the start times of
+ * its children (and the node itself), so it's possible to search for
+ * the eligible node with the lowest finish time in logarithmic time.
+ */
+static void bfq_active_insert(struct bfq_service_tree *st,
+			      struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+	struct rb_node *node = &entity->rb_node;
+
+	bfq_insert(&st->active, entity);
+
+	if (node->rb_left != NULL)
+		node = node->rb_left;
+	else if (node->rb_right != NULL)
+		node = node->rb_right;
+
+	bfq_update_active_tree(node);
+
+	if (bfqq != NULL)
+		list_add(&bfqq->bfqq_list, &bfqq->bfqd->active_list);
+}
+
+/**
+ * bfq_ioprio_to_weight - calc a weight from an ioprio.
+ * @ioprio: the ioprio value to convert.
+ */
+static inline unsigned short bfq_ioprio_to_weight(int ioprio)
+{
+	return IOPRIO_BE_NR - ioprio;
+}
+
+/**
+ * bfq_weight_to_ioprio - calc an ioprio from a weight.
+ * @weight: the weight value to convert.
+ *
+ * To preserve as mush as possible the old only-ioprio user interface,
+ * 0 is used as an escape ioprio value for weights (numerically) equal or
+ * larger than IOPRIO_BE_NR
+ */
+static inline unsigned short bfq_weight_to_ioprio(int weight)
+{
+	return IOPRIO_BE_NR - weight < 0 ? 0 : IOPRIO_BE_NR - weight;
+}
+
+static inline void bfq_get_entity(struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+
+	if (bfqq != NULL) {
+		atomic_inc(&bfqq->ref);
+		bfq_log_bfqq(bfqq->bfqd, bfqq, "get_entity: %p %d",
+			     bfqq, atomic_read(&bfqq->ref));
+	}
+}
+
+/**
+ * bfq_find_deepest - find the deepest node that an extraction can modify.
+ * @node: the node being removed.
+ *
+ * Do the first step of an extraction in an rb tree, looking for the
+ * node that will replace @node, and returning the deepest node that
+ * the following modifications to the tree can touch.  If @node is the
+ * last node in the tree return %NULL.
+ */
+static struct rb_node *bfq_find_deepest(struct rb_node *node)
+{
+	struct rb_node *deepest;
+
+	if (node->rb_right == NULL && node->rb_left == NULL)
+		deepest = rb_parent(node);
+	else if (node->rb_right == NULL)
+		deepest = node->rb_left;
+	else if (node->rb_left == NULL)
+		deepest = node->rb_right;
+	else {
+		deepest = rb_next(node);
+		if (deepest->rb_right != NULL)
+			deepest = deepest->rb_right;
+		else if (rb_parent(deepest) != node)
+			deepest = rb_parent(deepest);
+	}
+
+	return deepest;
+}
+
+/**
+ * bfq_active_extract - remove an entity from the active tree.
+ * @st: the service_tree containing the tree.
+ * @entity: the entity being removed.
+ */
+static void bfq_active_extract(struct bfq_service_tree *st,
+			       struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+	struct rb_node *node;
+
+	node = bfq_find_deepest(&entity->rb_node);
+	bfq_extract(&st->active, entity);
+
+	if (node != NULL)
+		bfq_update_active_tree(node);
+
+	if (bfqq != NULL)
+		list_del(&bfqq->bfqq_list);
+}
+
+/**
+ * bfq_idle_insert - insert an entity into the idle tree.
+ * @st: the service tree containing the tree.
+ * @entity: the entity to insert.
+ */
+static void bfq_idle_insert(struct bfq_service_tree *st,
+			    struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+	struct bfq_entity *first_idle = st->first_idle;
+	struct bfq_entity *last_idle = st->last_idle;
+
+	if (first_idle == NULL || bfq_gt(first_idle->finish, entity->finish))
+		st->first_idle = entity;
+	if (last_idle == NULL || bfq_gt(entity->finish, last_idle->finish))
+		st->last_idle = entity;
+
+	bfq_insert(&st->idle, entity);
+
+	if (bfqq != NULL)
+		list_add(&bfqq->bfqq_list, &bfqq->bfqd->idle_list);
+}
+
+/**
+ * bfq_forget_entity - remove an entity from the wfq trees.
+ * @st: the service tree.
+ * @entity: the entity being removed.
+ *
+ * Update the device status and forget everything about @entity, putting
+ * the device reference to it, if it is a queue.  Entities belonging to
+ * groups are not refcounted.
+ */
+static void bfq_forget_entity(struct bfq_service_tree *st,
+			      struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+	struct bfq_sched_data *sd;
+
+	entity->on_st = 0;
+	st->wsum -= entity->weight;
+	if (bfqq != NULL) {
+		sd = entity->sched_data;
+		bfq_log_bfqq(bfqq->bfqd, bfqq, "forget_entity: %p %d",
+			     bfqq, atomic_read(&bfqq->ref));
+		bfq_put_queue(bfqq);
+	}
+}
+
+/**
+ * bfq_put_idle_entity - release the idle tree ref of an entity.
+ * @st: service tree for the entity.
+ * @entity: the entity being released.
+ */
+static void bfq_put_idle_entity(struct bfq_service_tree *st,
+				struct bfq_entity *entity)
+{
+	bfq_idle_extract(st, entity);
+	bfq_forget_entity(st, entity);
+}
+
+/**
+ * bfq_forget_idle - update the idle tree if necessary.
+ * @st: the service tree to act upon.
+ *
+ * To preserve the global O(log N) complexity we only remove one entry here;
+ * as the idle tree will not grow indefinitely this can be done safely.
+ */
+static void bfq_forget_idle(struct bfq_service_tree *st)
+{
+	struct bfq_entity *first_idle = st->first_idle;
+	struct bfq_entity *last_idle = st->last_idle;
+
+	if (RB_EMPTY_ROOT(&st->active) && last_idle != NULL &&
+	    !bfq_gt(last_idle->finish, st->vtime)) {
+		/*
+		 * Forget the whole idle tree, increasing the vtime past
+		 * the last finish time of idle entities.
+		 */
+		st->vtime = last_idle->finish;
+	}
+
+	if (first_idle != NULL && !bfq_gt(first_idle->finish, st->vtime))
+		bfq_put_idle_entity(st, first_idle);
+}
+
+static struct bfq_service_tree *
+__bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
+			 struct bfq_entity *entity)
+{
+	struct bfq_service_tree *new_st = old_st;
+
+	if (entity->ioprio_changed) {
+		old_st->wsum -= entity->weight;
+
+		if (entity->new_weight != entity->orig_weight) {
+			entity->orig_weight = entity->new_weight;
+			entity->ioprio =
+				bfq_weight_to_ioprio(entity->orig_weight);
+		} else if (entity->new_ioprio != entity->ioprio) {
+			entity->ioprio = entity->new_ioprio;
+			entity->orig_weight =
+					bfq_ioprio_to_weight(entity->ioprio);
+		} else
+			entity->new_weight = entity->orig_weight =
+				bfq_ioprio_to_weight(entity->ioprio);
+
+		entity->ioprio_class = entity->new_ioprio_class;
+		entity->ioprio_changed = 0;
+
+		/*
+		 * NOTE: here we may be changing the weight too early,
+		 * this will cause unfairness.  The correct approach
+		 * would have required additional complexity to defer
+		 * weight changes to the proper time instants (i.e.,
+		 * when entity->finish <= old_st->vtime).
+		 */
+		new_st = bfq_entity_service_tree(entity);
+		entity->weight = entity->orig_weight;
+		new_st->wsum += entity->weight;
+
+		if (new_st != old_st)
+			entity->start = new_st->vtime;
+	}
+
+	return new_st;
+}
+
+/**
+ * bfq_bfqq_served - update the scheduler status after selection for
+ *                   service.
+ * @bfqq: the queue being served.
+ * @served: bytes to transfer.
+ *
+ * NOTE: this can be optimized, as the timestamps of upper level entities
+ * are synchronized every time a new bfqq is selected for service.  By now,
+ * we keep it to better check consistency.
+ */
+static void bfq_bfqq_served(struct bfq_queue *bfqq, unsigned long served)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+	struct bfq_service_tree *st;
+
+	for_each_entity(entity) {
+		st = bfq_entity_service_tree(entity);
+
+		entity->service += served;
+
+		st->vtime += bfq_delta(served, st->wsum);
+		bfq_forget_idle(st);
+	}
+	bfq_log_bfqq(bfqq->bfqd, bfqq, "bfqq_served %lu secs", served);
+}
+
+/**
+ * bfq_bfqq_charge_full_budget - set the service to the entity budget.
+ * @bfqq: the queue that needs a service update.
+ *
+ * When it's not possible to be fair in the service domain, because
+ * a queue is not consuming its budget fast enough (the meaning of
+ * fast depends on the timeout parameter), we charge it a full
+ * budget.  In this way we should obtain a sort of time-domain
+ * fairness among all the seeky/slow queues.
+ */
+static inline void bfq_bfqq_charge_full_budget(struct bfq_queue *bfqq)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+
+	bfq_log_bfqq(bfqq->bfqd, bfqq, "charge_full_budget");
+
+	bfq_bfqq_served(bfqq, entity->budget - entity->service);
+}
+
+/**
+ * __bfq_activate_entity - activate an entity.
+ * @entity: the entity being activated.
+ *
+ * Called whenever an entity is activated, i.e., it is not active and one
+ * of its children receives a new request, or has to be reactivated due to
+ * budget exhaustion.  It uses the current budget of the entity (and the
+ * service received if @entity is active) of the queue to calculate its
+ * timestamps.
+ */
+static void __bfq_activate_entity(struct bfq_entity *entity)
+{
+	struct bfq_sched_data *sd = entity->sched_data;
+	struct bfq_service_tree *st = bfq_entity_service_tree(entity);
+
+	if (entity == sd->in_service_entity) {
+		/*
+		 * If we are requeueing the current entity we have
+		 * to take care of not charging to it service it has
+		 * not received.
+		 */
+		bfq_calc_finish(entity, entity->service);
+		entity->start = entity->finish;
+		sd->in_service_entity = NULL;
+	} else if (entity->tree == &st->active) {
+		/*
+		 * Requeueing an entity due to a change of some
+		 * next_in_service entity below it.  We reuse the
+		 * old start time.
+		 */
+		bfq_active_extract(st, entity);
+	} else if (entity->tree == &st->idle) {
+		/*
+		 * Must be on the idle tree, bfq_idle_extract() will
+		 * check for that.
+		 */
+		bfq_idle_extract(st, entity);
+		entity->start = bfq_gt(st->vtime, entity->finish) ?
+				       st->vtime : entity->finish;
+	} else {
+		/*
+		 * The finish time of the entity may be invalid, and
+		 * it is in the past for sure, otherwise the queue
+		 * would have been on the idle tree.
+		 */
+		entity->start = st->vtime;
+		st->wsum += entity->weight;
+		bfq_get_entity(entity);
+
+		entity->on_st = 1;
+	}
+
+	st = __bfq_entity_update_weight_prio(st, entity);
+	bfq_calc_finish(entity, entity->budget);
+	bfq_active_insert(st, entity);
+}
+
+/**
+ * bfq_activate_entity - activate an entity and its ancestors if necessary.
+ * @entity: the entity to activate.
+ *
+ * Activate @entity and all the entities on the path from it to the root.
+ */
+static void bfq_activate_entity(struct bfq_entity *entity)
+{
+	struct bfq_sched_data *sd;
+
+	for_each_entity(entity) {
+		__bfq_activate_entity(entity);
+
+		sd = entity->sched_data;
+		if (!bfq_update_next_in_service(sd))
+			/*
+			 * No need to propagate the activation to the
+			 * upper entities, as they will be updated when
+			 * the in-service entity is rescheduled.
+			 */
+			break;
+	}
+}
+
+/**
+ * __bfq_deactivate_entity - deactivate an entity from its service tree.
+ * @entity: the entity to deactivate.
+ * @requeue: if false, the entity will not be put into the idle tree.
+ *
+ * Deactivate an entity, independently from its previous state.  If the
+ * entity was not on a service tree just return, otherwise if it is on
+ * any scheduler tree, extract it from that tree, and if necessary
+ * and if the caller did not specify @requeue, put it on the idle tree.
+ *
+ * Return %1 if the caller should update the entity hierarchy, i.e.,
+ * if the entity was in service or if it was the next_in_service for
+ * its sched_data; return %0 otherwise.
+ */
+static int __bfq_deactivate_entity(struct bfq_entity *entity, int requeue)
+{
+	struct bfq_sched_data *sd = entity->sched_data;
+	struct bfq_service_tree *st = bfq_entity_service_tree(entity);
+	int was_in_service = entity == sd->in_service_entity;
+	int ret = 0;
+
+	if (!entity->on_st)
+		return 0;
+
+	if (was_in_service) {
+		bfq_calc_finish(entity, entity->service);
+		sd->in_service_entity = NULL;
+	} else if (entity->tree == &st->active)
+		bfq_active_extract(st, entity);
+	else if (entity->tree == &st->idle)
+		bfq_idle_extract(st, entity);
+
+	if (was_in_service || sd->next_in_service == entity)
+		ret = bfq_update_next_in_service(sd);
+
+	if (!requeue || !bfq_gt(entity->finish, st->vtime))
+		bfq_forget_entity(st, entity);
+	else
+		bfq_idle_insert(st, entity);
+
+	return ret;
+}
+
+/**
+ * bfq_deactivate_entity - deactivate an entity.
+ * @entity: the entity to deactivate.
+ * @requeue: true if the entity can be put on the idle tree
+ */
+static void bfq_deactivate_entity(struct bfq_entity *entity, int requeue)
+{
+	struct bfq_sched_data *sd;
+	struct bfq_entity *parent;
+
+	for_each_entity_safe(entity, parent) {
+		sd = entity->sched_data;
+
+		if (!__bfq_deactivate_entity(entity, requeue))
+			/*
+			 * The parent entity is still backlogged, and
+			 * we don't need to update it as it is still
+			 * in service.
+			 */
+			break;
+
+		if (sd->next_in_service != NULL)
+			/*
+			 * The parent entity is still backlogged and
+			 * the budgets on the path towards the root
+			 * need to be updated.
+			 */
+			goto update;
+
+		/*
+		 * If we reach there the parent is no more backlogged and
+		 * we want to propagate the dequeue upwards.
+		 */
+		requeue = 1;
+	}
+
+	return;
+
+update:
+	entity = parent;
+	for_each_entity(entity) {
+		__bfq_activate_entity(entity);
+
+		sd = entity->sched_data;
+		if (!bfq_update_next_in_service(sd))
+			break;
+	}
+}
+
+/**
+ * bfq_update_vtime - update vtime if necessary.
+ * @st: the service tree to act upon.
+ *
+ * If necessary update the service tree vtime to have at least one
+ * eligible entity, skipping to its start time.  Assumes that the
+ * active tree of the device is not empty.
+ *
+ * NOTE: this hierarchical implementation updates vtimes quite often,
+ * we may end up with reactivated processes getting timestamps after a
+ * vtime skip done because we needed a ->first_active entity on some
+ * intermediate node.
+ */
+static void bfq_update_vtime(struct bfq_service_tree *st)
+{
+	struct bfq_entity *entry;
+	struct rb_node *node = st->active.rb_node;
+
+	entry = rb_entry(node, struct bfq_entity, rb_node);
+	if (bfq_gt(entry->min_start, st->vtime)) {
+		st->vtime = entry->min_start;
+		bfq_forget_idle(st);
+	}
+}
+
+/**
+ * bfq_first_active_entity - find the eligible entity with
+ *                           the smallest finish time
+ * @st: the service tree to select from.
+ *
+ * This function searches the first schedulable entity, starting from the
+ * root of the tree and going on the left every time on this side there is
+ * a subtree with at least one eligible (start >= vtime) entity. The path on
+ * the right is followed only if a) the left subtree contains no eligible
+ * entities and b) no eligible entity has been found yet.
+ */
+static struct bfq_entity *bfq_first_active_entity(struct bfq_service_tree *st)
+{
+	struct bfq_entity *entry, *first = NULL;
+	struct rb_node *node = st->active.rb_node;
+
+	while (node != NULL) {
+		entry = rb_entry(node, struct bfq_entity, rb_node);
+left:
+		if (!bfq_gt(entry->start, st->vtime))
+			first = entry;
+
+		if (node->rb_left != NULL) {
+			entry = rb_entry(node->rb_left,
+					 struct bfq_entity, rb_node);
+			if (!bfq_gt(entry->min_start, st->vtime)) {
+				node = node->rb_left;
+				goto left;
+			}
+		}
+		if (first != NULL)
+			break;
+		node = node->rb_right;
+	}
+
+	return first;
+}
+
+/**
+ * __bfq_lookup_next_entity - return the first eligible entity in @st.
+ * @st: the service tree.
+ *
+ * Update the virtual time in @st and return the first eligible entity
+ * it contains.
+ */
+static struct bfq_entity *__bfq_lookup_next_entity(struct bfq_service_tree *st,
+						   bool force)
+{
+	struct bfq_entity *entity, *new_next_in_service = NULL;
+
+	if (RB_EMPTY_ROOT(&st->active))
+		return NULL;
+
+	bfq_update_vtime(st);
+	entity = bfq_first_active_entity(st);
+
+	/*
+	 * If the chosen entity does not match with the sched_data's
+	 * next_in_service and we are forcedly serving the IDLE priority
+	 * class tree, bubble up budget update.
+	 */
+	if (unlikely(force && entity != entity->sched_data->next_in_service)) {
+		new_next_in_service = entity;
+		for_each_entity(new_next_in_service)
+			bfq_update_budget(new_next_in_service);
+	}
+
+	return entity;
+}
+
+/**
+ * bfq_lookup_next_entity - return the first eligible entity in @sd.
+ * @sd: the sched_data.
+ * @extract: if true the returned entity will be also extracted from @sd.
+ *
+ * NOTE: since we cache the next_in_service entity at each level of the
+ * hierarchy, the complexity of the lookup can be decreased with
+ * absolutely no effort just returning the cached next_in_service value;
+ * we prefer to do full lookups to test the consistency of * the data
+ * structures.
+ */
+static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
+						 int extract,
+						 struct bfq_data *bfqd)
+{
+	struct bfq_service_tree *st = sd->service_tree;
+	struct bfq_entity *entity;
+	int i = 0;
+
+	if (bfqd != NULL &&
+	    jiffies - bfqd->bfq_class_idle_last_service > BFQ_CL_IDLE_TIMEOUT) {
+		entity = __bfq_lookup_next_entity(st + BFQ_IOPRIO_CLASSES - 1,
+						  true);
+		if (entity != NULL) {
+			i = BFQ_IOPRIO_CLASSES - 1;
+			bfqd->bfq_class_idle_last_service = jiffies;
+			sd->next_in_service = entity;
+		}
+	}
+	for (; i < BFQ_IOPRIO_CLASSES; i++) {
+		entity = __bfq_lookup_next_entity(st + i, false);
+		if (entity != NULL) {
+			if (extract) {
+				bfq_check_next_in_service(sd, entity);
+				bfq_active_extract(st + i, entity);
+				sd->in_service_entity = entity;
+				sd->next_in_service = NULL;
+			}
+			break;
+		}
+	}
+
+	return entity;
+}
+
+/*
+ * Get next queue for service.
+ */
+static struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd)
+{
+	struct bfq_entity *entity = NULL;
+	struct bfq_sched_data *sd;
+	struct bfq_queue *bfqq;
+
+	if (bfqd->busy_queues == 0)
+		return NULL;
+
+	sd = &bfqd->sched_data;
+	for (; sd != NULL; sd = entity->my_sched_data) {
+		entity = bfq_lookup_next_entity(sd, 1, bfqd);
+		entity->service = 0;
+	}
+
+	bfqq = bfq_entity_to_bfqq(entity);
+
+	return bfqq;
+}
+
+static void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd)
+{
+	if (bfqd->in_service_bic != NULL) {
+		put_io_context(bfqd->in_service_bic->icq.ioc);
+		bfqd->in_service_bic = NULL;
+	}
+
+	bfqd->in_service_queue = NULL;
+	del_timer(&bfqd->idle_slice_timer);
+}
+
+static void bfq_deactivate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+				int requeue)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+
+	if (bfqq == bfqd->in_service_queue)
+		__bfq_bfqd_reset_in_service(bfqd);
+
+	bfq_deactivate_entity(entity, requeue);
+}
+
+static void bfq_activate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+
+	bfq_activate_entity(entity);
+}
+
+/*
+ * Called when the bfqq no longer has requests pending, remove it from
+ * the service tree.
+ */
+static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+			      int requeue)
+{
+	bfq_log_bfqq(bfqd, bfqq, "del from busy");
+
+	bfq_clear_bfqq_busy(bfqq);
+
+	bfqd->busy_queues--;
+
+	bfq_deactivate_bfqq(bfqd, bfqq, requeue);
+}
+
+/*
+ * Called when an inactive queue receives a new request.
+ */
+static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	bfq_log_bfqq(bfqd, bfqq, "add to busy");
+
+	bfq_activate_bfqq(bfqd, bfqq);
+
+	bfq_mark_bfqq_busy(bfqq);
+	bfqd->busy_queues++;
+}
diff --git a/block/bfq.h b/block/bfq.h
new file mode 100644
index 0000000..bd146b6
--- /dev/null
+++ b/block/bfq.h
@@ -0,0 +1,467 @@
+/*
+ * BFQ-v0 for 3.15.0: data structures and common functions prototypes.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
+ *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ */
+
+#ifndef _BFQ_H
+#define _BFQ_H
+
+#include <linux/blktrace_api.h>
+#include <linux/hrtimer.h>
+#include <linux/ioprio.h>
+#include <linux/rbtree.h>
+
+#define BFQ_IOPRIO_CLASSES	3
+#define BFQ_CL_IDLE_TIMEOUT	(HZ/5)
+
+#define BFQ_MIN_WEIGHT	1
+#define BFQ_MAX_WEIGHT	1000
+
+#define BFQ_DEFAULT_GRP_WEIGHT	10
+#define BFQ_DEFAULT_GRP_IOPRIO	0
+#define BFQ_DEFAULT_GRP_CLASS	IOPRIO_CLASS_BE
+
+struct bfq_entity;
+
+/**
+ * struct bfq_service_tree - per ioprio_class service tree.
+ * @active: tree for active entities (i.e., those backlogged).
+ * @idle: tree for idle entities (i.e., those not backlogged, with V <= F_i).
+ * @first_idle: idle entity with minimum F_i.
+ * @last_idle: idle entity with maximum F_i.
+ * @vtime: scheduler virtual time.
+ * @wsum: scheduler weight sum; active and idle entities contribute to it.
+ *
+ * Each service tree represents a B-WF2Q+ scheduler on its own.  Each
+ * ioprio_class has its own independent scheduler, and so its own
+ * bfq_service_tree.  All the fields are protected by the queue lock
+ * of the containing bfqd.
+ */
+struct bfq_service_tree {
+	struct rb_root active;
+	struct rb_root idle;
+
+	struct bfq_entity *first_idle;
+	struct bfq_entity *last_idle;
+
+	u64 vtime;
+	unsigned long wsum;
+};
+
+/**
+ * struct bfq_sched_data - multi-class scheduler.
+ * @in_service_entity: entity in service.
+ * @next_in_service: head-of-the-line entity in the scheduler.
+ * @service_tree: array of service trees, one per ioprio_class.
+ *
+ * bfq_sched_data is the basic scheduler queue.  It supports three
+ * ioprio_classes, and can be used either as a toplevel queue or as
+ * an intermediate queue on a hierarchical setup.
+ * @next_in_service points to the active entity of the sched_data
+ * service trees that will be scheduled next.
+ *
+ * The supported ioprio_classes are the same as in CFQ, in descending
+ * priority order, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE.
+ * Requests from higher priority queues are served before all the
+ * requests from lower priority queues; among requests of the same
+ * queue requests are served according to B-WF2Q+.
+ * All the fields are protected by the queue lock of the containing bfqd.
+ */
+struct bfq_sched_data {
+	struct bfq_entity *in_service_entity;
+	struct bfq_entity *next_in_service;
+	struct bfq_service_tree service_tree[BFQ_IOPRIO_CLASSES];
+};
+
+/**
+ * struct bfq_entity - schedulable entity.
+ * @rb_node: service_tree member.
+ * @on_st: flag, true if the entity is on a tree (either the active or
+ *         the idle one of its service_tree).
+ * @finish: B-WF2Q+ finish timestamp (aka F_i).
+ * @start: B-WF2Q+ start timestamp (aka S_i).
+ * @tree: tree the entity is enqueued into; %NULL if not on a tree.
+ * @min_start: minimum start time of the (active) subtree rooted at
+ *             this entity; used for O(log N) lookups into active trees.
+ * @service: service received during the last round of service.
+ * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
+ * @weight: weight of the queue
+ * @parent: parent entity, for hierarchical scheduling.
+ * @my_sched_data: for non-leaf nodes in the hierarchy, the
+ *                 associated scheduler queue, %NULL on leaf nodes.
+ * @sched_data: the scheduler queue this entity belongs to.
+ * @ioprio: the ioprio in use.
+ * @new_weight: when a weight change is requested, the new weight value.
+ * @orig_weight: original weight, used to implement weight boosting
+ * @new_ioprio: when an ioprio change is requested, the new ioprio value.
+ * @ioprio_class: the ioprio_class in use.
+ * @new_ioprio_class: when an ioprio_class change is requested, the new
+ *                    ioprio_class value.
+ * @ioprio_changed: flag, true when the user requested a weight, ioprio or
+ *                  ioprio_class change.
+ *
+ * A bfq_entity is used to represent a bfq_queue (leaf node in the upper
+ * level scheduler). Each entity belongs to the sched_data of the parent
+ * group hierarchy. Non-leaf entities have also their own sched_data,
+ * stored in @my_sched_data.
+ *
+ * Each entity stores independently its priority values; this would
+ * allow different weights on different devices, but this
+ * functionality is not exported to userspace by now.  Priorities and
+ * weights are updated lazily, first storing the new values into the
+ * new_* fields, then setting the @ioprio_changed flag.  As soon as
+ * there is a transition in the entity state that allows the priority
+ * update to take place the effective and the requested priority
+ * values are synchronized.
+ *
+ * The weight value is calculated from the ioprio to export the same
+ * interface as CFQ.  When dealing with  ``well-behaved'' queues (i.e.,
+ * queues that do not spend too much time to consume their budget
+ * and have true sequential behavior, and when there are no external
+ * factors breaking anticipation) the relative weights at each level
+ * of the hierarchy should be guaranteed.  All the fields are
+ * protected by the queue lock of the containing bfqd.
+ */
+struct bfq_entity {
+	struct rb_node rb_node;
+
+	int on_st;
+
+	u64 finish;
+	u64 start;
+
+	struct rb_root *tree;
+
+	u64 min_start;
+
+	unsigned long service, budget;
+	unsigned short weight, new_weight;
+	unsigned short orig_weight;
+
+	struct bfq_entity *parent;
+
+	struct bfq_sched_data *my_sched_data;
+	struct bfq_sched_data *sched_data;
+
+	unsigned short ioprio, new_ioprio;
+	unsigned short ioprio_class, new_ioprio_class;
+
+	int ioprio_changed;
+};
+
+/**
+ * struct bfq_queue - leaf schedulable entity.
+ * @ref: reference counter.
+ * @bfqd: parent bfq_data.
+ * @sort_list: sorted list of pending requests.
+ * @next_rq: if fifo isn't expired, next request to serve.
+ * @queued: nr of requests queued in @sort_list.
+ * @allocated: currently allocated requests.
+ * @meta_pending: pending metadata requests.
+ * @fifo: fifo list of requests in sort_list.
+ * @entity: entity representing this queue in the scheduler.
+ * @max_budget: maximum budget allowed from the feedback mechanism.
+ * @budget_timeout: budget expiration (in jiffies).
+ * @dispatched: number of requests on the dispatch list or inside driver.
+ * @flags: status flags.
+ * @bfqq_list: node for active/idle bfqq list inside our bfqd.
+ * @seek_samples: number of seeks sampled
+ * @seek_total: sum of the distances of the seeks sampled
+ * @seek_mean: mean seek distance
+ * @last_request_pos: position of the last request enqueued
+ * @pid: pid of the process owning the queue, used for logging purposes.
+ *
+ * A bfq_queue is a leaf request queue; it can be associated with an
+ * io_context or more, if it is async.
+ */
+struct bfq_queue {
+	atomic_t ref;
+	struct bfq_data *bfqd;
+
+	struct rb_root sort_list;
+	struct request *next_rq;
+	int queued[2];
+	int allocated[2];
+	int meta_pending;
+	struct list_head fifo;
+
+	struct bfq_entity entity;
+
+	unsigned long max_budget;
+	unsigned long budget_timeout;
+
+	int dispatched;
+
+	unsigned int flags;
+
+	struct list_head bfqq_list;
+
+	unsigned int seek_samples;
+	u64 seek_total;
+	sector_t seek_mean;
+	sector_t last_request_pos;
+
+	pid_t pid;
+};
+
+/**
+ * struct bfq_ttime - per process thinktime stats.
+ * @ttime_total: total process thinktime
+ * @ttime_samples: number of thinktime samples
+ * @ttime_mean: average process thinktime
+ */
+struct bfq_ttime {
+	unsigned long last_end_request;
+
+	unsigned long ttime_total;
+	unsigned long ttime_samples;
+	unsigned long ttime_mean;
+};
+
+/**
+ * struct bfq_io_cq - per (request_queue, io_context) structure.
+ * @icq: associated io_cq structure
+ * @bfqq: array of two process queues, the sync and the async
+ * @ttime: associated @bfq_ttime struct
+ */
+struct bfq_io_cq {
+	struct io_cq icq; /* must be the first member */
+	struct bfq_queue *bfqq[2];
+	struct bfq_ttime ttime;
+	int ioprio;
+};
+
+enum bfq_device_speed {
+	BFQ_BFQD_FAST,
+	BFQ_BFQD_SLOW,
+};
+
+/**
+ * struct bfq_data - per device data structure.
+ * @queue: request queue for the managed device.
+ * @sched_data: root @bfq_sched_data for the device.
+ * @busy_queues: number of bfq_queues containing requests (including the
+ *		 queue in service, even if it is idling).
+ * @queued: number of queued requests.
+ * @rq_in_driver: number of requests dispatched and waiting for completion.
+ * @sync_flight: number of sync requests in the driver.
+ * @max_rq_in_driver: max number of reqs in driver in the last
+ *                    @hw_tag_samples completed requests.
+ * @hw_tag_samples: nr of samples used to calculate hw_tag.
+ * @hw_tag: flag set to one if the driver is showing a queueing behavior.
+ * @budgets_assigned: number of budgets assigned.
+ * @idle_slice_timer: timer set when idling for the next sequential request
+ *                    from the queue in service.
+ * @unplug_work: delayed work to restart dispatching on the request queue.
+ * @in_service_queue: bfq_queue in service.
+ * @in_service_bic: bfq_io_cq (bic) associated with the @in_service_queue.
+ * @last_position: on-disk position of the last served request.
+ * @last_budget_start: beginning of the last budget.
+ * @last_idling_start: beginning of the last idle slice.
+ * @peak_rate: peak transfer rate observed for a budget.
+ * @peak_rate_samples: number of samples used to calculate @peak_rate.
+ * @bfq_max_budget: maximum budget allotted to a bfq_queue before
+ *                  rescheduling.
+ * @active_list: list of all the bfq_queues active on the device.
+ * @idle_list: list of all the bfq_queues idle on the device.
+ * @bfq_quantum: max number of requests dispatched per dispatch round.
+ * @bfq_fifo_expire: timeout for async/sync requests; when it expires
+ *                   requests are served in fifo order.
+ * @bfq_back_penalty: weight of backward seeks wrt forward ones.
+ * @bfq_back_max: maximum allowed backward seek.
+ * @bfq_slice_idle: maximum idling time.
+ * @bfq_user_max_budget: user-configured max budget value
+ *                       (0 for auto-tuning).
+ * @bfq_max_budget_async_rq: maximum budget (in nr of requests) allotted to
+ *                           async queues.
+ * @bfq_timeout: timeout for bfq_queues to consume their budget; used to
+ *               to prevent seeky queues to impose long latencies to well
+ *               behaved ones (this also implies that seeky queues cannot
+ *               receive guarantees in the service domain; after a timeout
+ *               they are charged for the whole allocated budget, to try
+ *               to preserve a behavior reasonably fair among them, but
+ *               without service-domain guarantees).
+ * @oom_bfqq: fallback dummy bfqq for extreme OOM conditions
+ *
+ * All the fields are protected by the @queue lock.
+ */
+struct bfq_data {
+	struct request_queue *queue;
+
+	struct bfq_sched_data sched_data;
+
+	int busy_queues;
+	int queued;
+	int rq_in_driver;
+	int sync_flight;
+
+	int max_rq_in_driver;
+	int hw_tag_samples;
+	int hw_tag;
+
+	int budgets_assigned;
+
+	struct timer_list idle_slice_timer;
+	struct work_struct unplug_work;
+
+	struct bfq_queue *in_service_queue;
+	struct bfq_io_cq *in_service_bic;
+
+	sector_t last_position;
+
+	ktime_t last_budget_start;
+	ktime_t last_idling_start;
+	int peak_rate_samples;
+	u64 peak_rate;
+	unsigned long bfq_max_budget;
+
+	struct list_head active_list;
+	struct list_head idle_list;
+
+	unsigned int bfq_quantum;
+	unsigned int bfq_fifo_expire[2];
+	unsigned int bfq_back_penalty;
+	unsigned int bfq_back_max;
+	unsigned int bfq_slice_idle;
+	u64 bfq_class_idle_last_service;
+
+	unsigned int bfq_user_max_budget;
+	unsigned int bfq_max_budget_async_rq;
+	unsigned int bfq_timeout[2];
+
+	struct bfq_queue oom_bfqq;
+};
+
+enum bfqq_state_flags {
+	BFQ_BFQQ_FLAG_busy = 0,		/* has requests or is in service */
+	BFQ_BFQQ_FLAG_wait_request,	/* waiting for a request */
+	BFQ_BFQQ_FLAG_must_alloc,	/* must be allowed rq alloc */
+	BFQ_BFQQ_FLAG_fifo_expire,	/* FIFO checked in this slice */
+	BFQ_BFQQ_FLAG_idle_window,	/* slice idling enabled */
+	BFQ_BFQQ_FLAG_prio_changed,	/* task priority has changed */
+	BFQ_BFQQ_FLAG_sync,		/* synchronous queue */
+	BFQ_BFQQ_FLAG_budget_new,	/* no completion with this budget */
+};
+
+#define BFQ_BFQQ_FNS(name)						\
+static inline void bfq_mark_bfqq_##name(struct bfq_queue *bfqq)		\
+{									\
+	(bfqq)->flags |= (1 << BFQ_BFQQ_FLAG_##name);			\
+}									\
+static inline void bfq_clear_bfqq_##name(struct bfq_queue *bfqq)	\
+{									\
+	(bfqq)->flags &= ~(1 << BFQ_BFQQ_FLAG_##name);			\
+}									\
+static inline int bfq_bfqq_##name(const struct bfq_queue *bfqq)		\
+{									\
+	return ((bfqq)->flags & (1 << BFQ_BFQQ_FLAG_##name)) != 0;	\
+}
+
+BFQ_BFQQ_FNS(busy);
+BFQ_BFQQ_FNS(wait_request);
+BFQ_BFQQ_FNS(must_alloc);
+BFQ_BFQQ_FNS(fifo_expire);
+BFQ_BFQQ_FNS(idle_window);
+BFQ_BFQQ_FNS(prio_changed);
+BFQ_BFQQ_FNS(sync);
+BFQ_BFQQ_FNS(budget_new);
+#undef BFQ_BFQQ_FNS
+
+/* Logging facilities. */
+#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) \
+	blk_add_trace_msg((bfqd)->queue, "bfq%d " fmt, (bfqq)->pid, ##args)
+
+#define bfq_log(bfqd, fmt, args...) \
+	blk_add_trace_msg((bfqd)->queue, "bfq " fmt, ##args)
+
+/* Expiration reasons. */
+enum bfqq_expiration {
+	BFQ_BFQQ_TOO_IDLE = 0,		/*
+					 * queue has been idling for
+					 * too long
+					 */
+	BFQ_BFQQ_BUDGET_TIMEOUT,	/* budget took too long to be used */
+	BFQ_BFQQ_BUDGET_EXHAUSTED,	/* budget consumed */
+	BFQ_BFQQ_NO_MORE_REQUESTS,	/* the queue has no more requests */
+};
+
+static inline struct bfq_service_tree *
+bfq_entity_service_tree(struct bfq_entity *entity)
+{
+	struct bfq_sched_data *sched_data = entity->sched_data;
+	unsigned int idx = entity->ioprio_class - 1;
+
+	return sched_data->service_tree + idx;
+}
+
+static inline struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic,
+					    int is_sync)
+{
+	return bic->bfqq[!!is_sync];
+}
+
+static inline void bic_set_bfqq(struct bfq_io_cq *bic,
+				struct bfq_queue *bfqq, int is_sync)
+{
+	bic->bfqq[!!is_sync] = bfqq;
+}
+
+static inline struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic)
+{
+	return bic->icq.q->elevator->elevator_data;
+}
+
+/**
+ * bfq_get_bfqd_locked - get a lock to a bfqd using a RCU protected pointer.
+ * @ptr: a pointer to a bfqd.
+ * @flags: storage for the flags to be saved.
+ *
+ * This function allows bfqg->bfqd to be protected by the
+ * queue lock of the bfqd they reference; the pointer is dereferenced
+ * under RCU, so the storage for bfqd is assured to be safe as long
+ * as the RCU read side critical section does not end.  After the
+ * bfqd->queue->queue_lock is taken the pointer is rechecked, to be
+ * sure that no other writer accessed it.  If we raced with a writer,
+ * the function returns NULL, with the queue unlocked, otherwise it
+ * returns the dereferenced pointer, with the queue locked.
+ */
+static inline struct bfq_data *bfq_get_bfqd_locked(void **ptr,
+						   unsigned long *flags)
+{
+	struct bfq_data *bfqd;
+
+	rcu_read_lock();
+	bfqd = rcu_dereference(*(struct bfq_data **)ptr);
+
+	if (bfqd != NULL) {
+		spin_lock_irqsave(bfqd->queue->queue_lock, *flags);
+		if (*ptr == bfqd)
+			goto out;
+		spin_unlock_irqrestore(bfqd->queue->queue_lock, *flags);
+	}
+
+	bfqd = NULL;
+out:
+	rcu_read_unlock();
+	return bfqd;
+}
+
+static inline void bfq_put_bfqd_unlock(struct bfq_data *bfqd,
+				       unsigned long *flags)
+{
+	spin_unlock_irqrestore(bfqd->queue->queue_lock, *flags);
+}
+
+static void bfq_changed_ioprio(struct bfq_io_cq *bic);
+static void bfq_put_queue(struct bfq_queue *bfqq);
+static void bfq_dispatch_insert(struct request_queue *q, struct request *rq);
+static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd, int is_sync,
+				       struct bfq_io_cq *bic, gfp_t gfp_mask);
+static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
+
+#endif /* _BFQ_H */
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 02/14] block: introduce the BFQ-v0 I/O scheduler
@ 2014-05-27 12:42     ` paolo
  0 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups, Paolo Valente

From: Fabio Checconi <fchecconi@gmail.com>

BFQ is a proportional-share I/O scheduler, whose general structure,
plus a lot of code, are borrowed from CFQ.

- Each process doing I/O on a device is associated with a weight and a
  (bfq_)queue.

- BFQ grants exclusive access to the device, for a while, to one queue
  (process) at a time, and implements this service model by
  associating every queue with a budget, measured in number of
  sectors.

  - After a queue is granted access to the device, the budget of the
    queue is decremented, on each request dispatch, by the size of the
    request.

  - The in-service queue is expired, i.e., its service is suspended,
    only if one of the following events occurs: 1) the queue finishes
    its budget, 2) the queue empties, 3) a "budget timeout" fires.

    - The budget timeout prevents processes doing random I/O from
      holding the device for too long and dramatically reducing
      throughput.

    - Actually, as in CFQ, a queue associated with a process issuing
      sync requests may not be expired immediately when it empties. In
      contrast, BFQ may idle the device for a short time interval,
      giving the process the chance to go on being served if it issues
      a new request in time. Device idling typically boosts the
      throughput on rotational devices, if processes do synchronous
      and sequential I/O. Besides, under BFQ, device idling is also
      instrumental in guaranteeing the desired throughput fraction to
      processes issuing sync requests (see [1] for details).

  - Queues are scheduled according to a variant of WF2Q+, named
    B-WF2Q+, and implemented using an augmented rb-tree to preserve an
    O(log N) overall complexity.  See [1] for more details. B-WF2Q+ is
    also ready for hierarchical scheduling. However, for a cleaner
    logical breakdown, the code that enables and completes
    hierarchical support is provided in patch 4, which focuses exactly
    on this feature.

  - B-WF2Q+ guarantees a tight deviation with respect to an ideal,
    perfectly fair, and smooth service. In particular, B-WF2Q+
    guarantees that each queue receives a fraction of the device
    throughput proportional to its weight, even if the throughput
    fluctuates, and regardless of: the device parameters, the current
    workload and the budgets assigned to the queue.

  - The last, budget-independence, property (although probably
    counterintuitive in the first place) is definitely beneficial, for
    the following reasons.

    - First, with any proportional-share scheduler, the maximum
      deviation with respect to an ideal service is proportional to
      the maximum budget (slice) assigned to queues. As a consequence,
      BFQ can keep this deviation tight not only because of the
      accurate service of B-WF2Q+, but also because BFQ *does not*
      need to assign a larger budget to a queue to let the queue
      receive a higher fraction of the device throughput.

    - Second, BFQ is free to choose, for every process (queue), the
      budget that best fits the needs of the process, or best
      leverages the I/O pattern of the process. In particular, BFQ
      updates queue budgets with a simple feedback-loop algorithm that
      allows a high throughput to be achieved, while still providing
      tight latency guarantees to time-sensitive applications. When
      the in-service queue expires, this algorithm computes the next
      budget of the queue so as to:

      - Let large budgets be eventually assigned to the queues
        associated with I/O-bound applications performing sequential
        I/O: in fact, the longer these applications are served once
        got access to the device, the higher the throughput is.

      - Let small budgets be eventually assigned to the queues
        associated with time-sensitive applications (which typically
        perform sporadic and short I/O), because, the smaller the
        budget assigned to a queue waiting for service is, the sooner
        B-WF2Q+ will serve that queue (Subsec 3.3 in [1]).

- Weights can be assigned to processes only indirectly, through I/O
  priorities, and according to the relation: weight = IOPRIO_BE_NR -
  ioprio. The next two patches provide instead a cgroups interface
  through which weights can be assigned explicitly.

- ioprio classes are served in strict priority order, i.e.,
  lower-priority queues are not served as long as there are
  higher-priority queues.  Among queues in the same class, the
  bandwidth is distributed in proportion to the weight of each
  queue. A very thin extra bandwidth is however guaranteed to the Idle
  class, to prevent it from starving.

[1] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
http://www.algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/bfq-ioc.c     |   34 +
 block/bfq-iosched.c | 2297 +++++++++++++++++++++++++++++++++++++++++++++++++++
 block/bfq-sched.c   |  936 +++++++++++++++++++++
 block/bfq.h         |  467 +++++++++++
 4 files changed, 3734 insertions(+)
 create mode 100644 block/bfq-ioc.c
 create mode 100644 block/bfq-iosched.c
 create mode 100644 block/bfq-sched.c
 create mode 100644 block/bfq.h

diff --git a/block/bfq-ioc.c b/block/bfq-ioc.c
new file mode 100644
index 0000000..adfb5a1
--- /dev/null
+++ b/block/bfq-ioc.c
@@ -0,0 +1,34 @@
+/*
+ * BFQ: I/O context handling.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ *		      Paolo Valente <paolo.valente@unimore.it>
+ */
+
+/**
+ * icq_to_bic - convert iocontext queue structure to bfq_io_cq.
+ * @icq: the iocontext queue.
+ */
+static inline struct bfq_io_cq *icq_to_bic(struct io_cq *icq)
+{
+	/* bic->icq is the first member, %NULL will convert to %NULL */
+	return container_of(icq, struct bfq_io_cq, icq);
+}
+
+/**
+ * bfq_bic_lookup - search into @ioc a bic associated to @bfqd.
+ * @bfqd: the lookup key.
+ * @ioc: the io_context of the process doing I/O.
+ *
+ * Queue lock must be held.
+ */
+static inline struct bfq_io_cq *bfq_bic_lookup(struct bfq_data *bfqd,
+					       struct io_context *ioc)
+{
+	if (ioc)
+		return icq_to_bic(ioc_lookup_icq(ioc, bfqd->queue));
+	return NULL;
+}
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
new file mode 100644
index 0000000..01a98be
--- /dev/null
+++ b/block/bfq-iosched.c
@@ -0,0 +1,2297 @@
+/*
+ * Budget Fair Queueing (BFQ) disk scheduler.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ *		      Paolo Valente <paolo.valente@unimore.it>
+ *
+ * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ
+ * file.
+ *
+ * BFQ is a proportional-share storage-I/O scheduling algorithm based on
+ * the slice-by-slice service scheme of CFQ. But BFQ assigns budgets,
+ * measured in number of sectors, to processes instead of time slices. The
+ * device is not granted to the in-service process for a given time slice,
+ * but until it has exhausted its assigned budget. This change from the time
+ * to the service domain allows BFQ to distribute the device throughput
+ * among processes as desired, without any distortion due to ZBR, workload
+ * fluctuations or other factors. BFQ uses an ad hoc internal scheduler,
+ * called B-WF2Q+, to schedule processes according to their budgets. More
+ * precisely, BFQ schedules queues associated to processes. Thanks to the
+ * accurate policy of B-WF2Q+, BFQ can afford to assign high budgets to
+ * I/O-bound processes issuing sequential requests (to boost the
+ * throughput), and yet guarantee a relatively low latency to interactive
+ * applications.
+ *
+ * BFQ is described in [1], where also a reference to the initial, more
+ * theoretical paper on BFQ can be found. The interested reader can find
+ * in the latter paper full details on the main algorithm, as well as
+ * formulas of the guarantees and formal proofs of all the properties.
+ * With respect to the version of BFQ presented in these papers, this
+ * implementation adds a hierarchical extension based on H-WF2Q+.
+ *
+ * B-WF2Q+ is based on WF2Q+, that is described in [2], together with
+ * H-WF2Q+, while the augmented tree used to implement B-WF2Q+ with O(log N)
+ * complexity derives from the one introduced with EEVDF in [3].
+ *
+ * [1] P. Valente and M. Andreolini, ``Improving Application Responsiveness
+ *     with the BFQ Disk I/O Scheduler'',
+ *     Proceedings of the 5th Annual International Systems and Storage
+ *     Conference (SYSTOR '12), June 2012.
+ *
+ * http://algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf
+ *
+ * [2] Jon C.R. Bennett and H. Zhang, ``Hierarchical Packet Fair Queueing
+ *     Algorithms,'' IEEE/ACM Transactions on Networking, 5(5):675-689,
+ *     Oct 1997.
+ *
+ * http://www.cs.cmu.edu/~hzhang/papers/TON-97-Oct.ps.gz
+ *
+ * [3] I. Stoica and H. Abdel-Wahab, ``Earliest Eligible Virtual Deadline
+ *     First: A Flexible and Accurate Mechanism for Proportional Share
+ *     Resource Allocation,'' technical report.
+ *
+ * http://www.cs.berkeley.edu/~istoica/papers/eevdf-tr-95.pdf
+ */
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/blkdev.h>
+#include <linux/cgroup.h>
+#include <linux/elevator.h>
+#include <linux/jiffies.h>
+#include <linux/rbtree.h>
+#include <linux/ioprio.h>
+#include "bfq.h"
+#include "blk.h"
+
+/*
+ * Array of async queues for all the processes, one queue
+ * per ioprio value per ioprio_class.
+ */
+struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
+/* Async queue for the idle class (ioprio is ignored) */
+struct bfq_queue *async_idle_bfqq;
+
+/* Max number of dispatches in one round of service. */
+static const int bfq_quantum = 4;
+
+/* Expiration time of sync (0) and async (1) requests, in jiffies. */
+static const int bfq_fifo_expire[2] = { HZ / 4, HZ / 8 };
+
+/* Maximum backwards seek, in KiB. */
+static const int bfq_back_max = 16 * 1024;
+
+/* Penalty of a backwards seek, in number of sectors. */
+static const int bfq_back_penalty = 2;
+
+/* Idling period duration, in jiffies. */
+static int bfq_slice_idle = HZ / 125;
+
+/* Default maximum budget values, in sectors and number of requests. */
+static const int bfq_default_max_budget = 16 * 1024;
+static const int bfq_max_budget_async_rq = 4;
+
+/* Default timeout values, in jiffies, approximating CFQ defaults. */
+static const int bfq_timeout_sync = HZ / 8;
+static int bfq_timeout_async = HZ / 25;
+
+struct kmem_cache *bfq_pool;
+
+/* Below this threshold (in ms), we consider thinktime immediate. */
+#define BFQ_MIN_TT		2
+
+/* hw_tag detection: parallel requests threshold and min samples needed. */
+#define BFQ_HW_QUEUE_THRESHOLD	4
+#define BFQ_HW_QUEUE_SAMPLES	32
+
+#define BFQQ_SEEK_THR	 (sector_t)(8 * 1024)
+#define BFQQ_SEEKY(bfqq) ((bfqq)->seek_mean > BFQQ_SEEK_THR)
+
+/* Budget feedback step. */
+#define BFQ_BUDGET_STEP         128
+
+/* Min samples used for peak rate estimation (for autotuning). */
+#define BFQ_PEAK_RATE_SAMPLES	32
+
+/* Shift used for peak rate fixed precision calculations. */
+#define BFQ_RATE_SHIFT		16
+
+#define BFQ_SERVICE_TREE_INIT	((struct bfq_service_tree)		\
+				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
+
+#define RQ_BIC(rq)		((struct bfq_io_cq *) (rq)->elv.priv[0])
+#define RQ_BFQQ(rq)		((rq)->elv.priv[1])
+
+static inline void bfq_schedule_dispatch(struct bfq_data *bfqd);
+
+#include "bfq-ioc.c"
+#include "bfq-sched.c"
+
+#define bfq_class_idle(bfqq)	((bfqq)->entity.ioprio_class ==\
+				 IOPRIO_CLASS_IDLE)
+#define bfq_class_rt(bfqq)	((bfqq)->entity.ioprio_class ==\
+				 IOPRIO_CLASS_RT)
+
+#define bfq_sample_valid(samples)	((samples) > 80)
+
+/*
+ * We regard a request as SYNC, if either it's a read or has the SYNC bit
+ * set (in which case it could also be a direct WRITE).
+ */
+static inline int bfq_bio_sync(struct bio *bio)
+{
+	if (bio_data_dir(bio) == READ || (bio->bi_rw & REQ_SYNC))
+		return 1;
+
+	return 0;
+}
+
+/*
+ * Scheduler run of queue, if there are requests pending and no one in the
+ * driver that will restart queueing.
+ */
+static inline void bfq_schedule_dispatch(struct bfq_data *bfqd)
+{
+	if (bfqd->queued != 0) {
+		bfq_log(bfqd, "schedule dispatch");
+		kblockd_schedule_work(bfqd->queue, &bfqd->unplug_work);
+	}
+}
+
+/*
+ * Lifted from AS - choose which of rq1 and rq2 that is best served now.
+ * We choose the request that is closesr to the head right now.  Distance
+ * behind the head is penalized and only allowed to a certain extent.
+ */
+static struct request *bfq_choose_req(struct bfq_data *bfqd,
+				      struct request *rq1,
+				      struct request *rq2,
+				      sector_t last)
+{
+	sector_t s1, s2, d1 = 0, d2 = 0;
+	unsigned long back_max;
+#define BFQ_RQ1_WRAP	0x01 /* request 1 wraps */
+#define BFQ_RQ2_WRAP	0x02 /* request 2 wraps */
+	unsigned wrap = 0; /* bit mask: requests behind the disk head? */
+
+	if (rq1 == NULL || rq1 == rq2)
+		return rq2;
+	if (rq2 == NULL)
+		return rq1;
+
+	if (rq_is_sync(rq1) && !rq_is_sync(rq2))
+		return rq1;
+	else if (rq_is_sync(rq2) && !rq_is_sync(rq1))
+		return rq2;
+	if ((rq1->cmd_flags & REQ_META) && !(rq2->cmd_flags & REQ_META))
+		return rq1;
+	else if ((rq2->cmd_flags & REQ_META) && !(rq1->cmd_flags & REQ_META))
+		return rq2;
+
+	s1 = blk_rq_pos(rq1);
+	s2 = blk_rq_pos(rq2);
+
+	/*
+	 * By definition, 1KiB is 2 sectors.
+	 */
+	back_max = bfqd->bfq_back_max * 2;
+
+	/*
+	 * Strict one way elevator _except_ in the case where we allow
+	 * short backward seeks which are biased as twice the cost of a
+	 * similar forward seek.
+	 */
+	if (s1 >= last)
+		d1 = s1 - last;
+	else if (s1 + back_max >= last)
+		d1 = (last - s1) * bfqd->bfq_back_penalty;
+	else
+		wrap |= BFQ_RQ1_WRAP;
+
+	if (s2 >= last)
+		d2 = s2 - last;
+	else if (s2 + back_max >= last)
+		d2 = (last - s2) * bfqd->bfq_back_penalty;
+	else
+		wrap |= BFQ_RQ2_WRAP;
+
+	/* Found required data */
+
+	/*
+	 * By doing switch() on the bit mask "wrap" we avoid having to
+	 * check two variables for all permutations: --> faster!
+	 */
+	switch (wrap) {
+	case 0: /* common case for CFQ: rq1 and rq2 not wrapped */
+		if (d1 < d2)
+			return rq1;
+		else if (d2 < d1)
+			return rq2;
+		else {
+			if (s1 >= s2)
+				return rq1;
+			else
+				return rq2;
+		}
+
+	case BFQ_RQ2_WRAP:
+		return rq1;
+	case BFQ_RQ1_WRAP:
+		return rq2;
+	case (BFQ_RQ1_WRAP|BFQ_RQ2_WRAP): /* both rqs wrapped */
+	default:
+		/*
+		 * Since both rqs are wrapped,
+		 * start with the one that's further behind head
+		 * (--> only *one* back seek required),
+		 * since back seek takes more time than forward.
+		 */
+		if (s1 <= s2)
+			return rq1;
+		else
+			return rq2;
+	}
+}
+
+static struct request *bfq_find_next_rq(struct bfq_data *bfqd,
+					struct bfq_queue *bfqq,
+					struct request *last)
+{
+	struct rb_node *rbnext = rb_next(&last->rb_node);
+	struct rb_node *rbprev = rb_prev(&last->rb_node);
+	struct request *next = NULL, *prev = NULL;
+
+	if (rbprev != NULL)
+		prev = rb_entry_rq(rbprev);
+
+	if (rbnext != NULL)
+		next = rb_entry_rq(rbnext);
+	else {
+		rbnext = rb_first(&bfqq->sort_list);
+		if (rbnext && rbnext != &last->rb_node)
+			next = rb_entry_rq(rbnext);
+	}
+
+	return bfq_choose_req(bfqd, next, prev, blk_rq_pos(last));
+}
+
+static inline unsigned long bfq_serv_to_charge(struct request *rq,
+					       struct bfq_queue *bfqq)
+{
+	return blk_rq_sectors(rq);
+}
+
+/**
+ * bfq_updated_next_req - update the queue after a new next_rq selection.
+ * @bfqd: the device data the queue belongs to.
+ * @bfqq: the queue to update.
+ *
+ * If the first request of a queue changes we make sure that the queue
+ * has enough budget to serve at least its first request (if the
+ * request has grown).  We do this because if the queue has not enough
+ * budget for its first request, it has to go through two dispatch
+ * rounds to actually get it dispatched.
+ */
+static void bfq_updated_next_req(struct bfq_data *bfqd,
+				 struct bfq_queue *bfqq)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+	struct request *next_rq = bfqq->next_rq;
+	unsigned long new_budget;
+
+	if (next_rq == NULL)
+		return;
+
+	if (bfqq == bfqd->in_service_queue)
+		/*
+		 * In order not to break guarantees, budgets cannot be
+		 * changed after an entity has been selected.
+		 */
+		return;
+
+	new_budget = max_t(unsigned long, bfqq->max_budget,
+			   bfq_serv_to_charge(next_rq, bfqq));
+	if (entity->budget != new_budget) {
+		entity->budget = new_budget;
+		bfq_log_bfqq(bfqd, bfqq, "updated next rq: new budget %lu",
+					 new_budget);
+		bfq_activate_bfqq(bfqd, bfqq);
+	}
+}
+
+static void bfq_add_request(struct request *rq)
+{
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+	struct bfq_entity *entity = &bfqq->entity;
+	struct bfq_data *bfqd = bfqq->bfqd;
+	struct request *next_rq, *prev;
+
+	bfq_log_bfqq(bfqd, bfqq, "add_request %d", rq_is_sync(rq));
+	bfqq->queued[rq_is_sync(rq)]++;
+	bfqd->queued++;
+
+	elv_rb_add(&bfqq->sort_list, rq);
+
+	/*
+	 * Check if this request is a better next-serve candidate.
+	 */
+	prev = bfqq->next_rq;
+	next_rq = bfq_choose_req(bfqd, bfqq->next_rq, rq, bfqd->last_position);
+	bfqq->next_rq = next_rq;
+
+	if (!bfq_bfqq_busy(bfqq)) {
+		entity->budget = max_t(unsigned long, bfqq->max_budget,
+				       bfq_serv_to_charge(next_rq, bfqq));
+		bfq_add_bfqq_busy(bfqd, bfqq);
+	} else {
+		if (prev != bfqq->next_rq)
+			bfq_updated_next_req(bfqd, bfqq);
+	}
+}
+
+static struct request *bfq_find_rq_fmerge(struct bfq_data *bfqd,
+					  struct bio *bio)
+{
+	struct task_struct *tsk = current;
+	struct bfq_io_cq *bic;
+	struct bfq_queue *bfqq;
+
+	bic = bfq_bic_lookup(bfqd, tsk->io_context);
+	if (bic == NULL)
+		return NULL;
+
+	bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
+	if (bfqq != NULL)
+		return elv_rb_find(&bfqq->sort_list, bio_end_sector(bio));
+
+	return NULL;
+}
+
+static void bfq_activate_request(struct request_queue *q, struct request *rq)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+
+	bfqd->rq_in_driver++;
+	bfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq);
+	bfq_log(bfqd, "activate_request: new bfqd->last_position %llu",
+		(long long unsigned)bfqd->last_position);
+}
+
+static inline void bfq_deactivate_request(struct request_queue *q,
+					  struct request *rq)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+
+	bfqd->rq_in_driver--;
+}
+
+static void bfq_remove_request(struct request *rq)
+{
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+	struct bfq_data *bfqd = bfqq->bfqd;
+	const int sync = rq_is_sync(rq);
+
+	if (bfqq->next_rq == rq) {
+		bfqq->next_rq = bfq_find_next_rq(bfqd, bfqq, rq);
+		bfq_updated_next_req(bfqd, bfqq);
+	}
+
+	list_del_init(&rq->queuelist);
+	bfqq->queued[sync]--;
+	bfqd->queued--;
+	elv_rb_del(&bfqq->sort_list, rq);
+
+	if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
+		if (bfq_bfqq_busy(bfqq) && bfqq != bfqd->in_service_queue)
+			bfq_del_bfqq_busy(bfqd, bfqq, 1);
+	}
+
+	if (rq->cmd_flags & REQ_META)
+		bfqq->meta_pending--;
+}
+
+static int bfq_merge(struct request_queue *q, struct request **req,
+		     struct bio *bio)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct request *__rq;
+
+	__rq = bfq_find_rq_fmerge(bfqd, bio);
+	if (__rq != NULL && elv_rq_merge_ok(__rq, bio)) {
+		*req = __rq;
+		return ELEVATOR_FRONT_MERGE;
+	}
+
+	return ELEVATOR_NO_MERGE;
+}
+
+static void bfq_merged_request(struct request_queue *q, struct request *req,
+			       int type)
+{
+	if (type == ELEVATOR_FRONT_MERGE &&
+	    rb_prev(&req->rb_node) &&
+	    blk_rq_pos(req) <
+	    blk_rq_pos(container_of(rb_prev(&req->rb_node),
+				    struct request, rb_node))) {
+		struct bfq_queue *bfqq = RQ_BFQQ(req);
+		struct bfq_data *bfqd = bfqq->bfqd;
+		struct request *prev, *next_rq;
+
+		/* Reposition request in its sort_list */
+		elv_rb_del(&bfqq->sort_list, req);
+		elv_rb_add(&bfqq->sort_list, req);
+		/* Choose next request to be served for bfqq */
+		prev = bfqq->next_rq;
+		next_rq = bfq_choose_req(bfqd, bfqq->next_rq, req,
+					 bfqd->last_position);
+		bfqq->next_rq = next_rq;
+		/*
+		 * If next_rq changes, update the queue's budget to fit
+		 * the new request.
+		 */
+		if (prev != bfqq->next_rq)
+			bfq_updated_next_req(bfqd, bfqq);
+	}
+}
+
+static void bfq_merged_requests(struct request_queue *q, struct request *rq,
+				struct request *next)
+{
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+
+	/*
+	 * Reposition in fifo if next is older than rq.
+	 */
+	if (!list_empty(&rq->queuelist) && !list_empty(&next->queuelist) &&
+	    time_before(next->fifo_time, rq->fifo_time)) {
+		list_move(&rq->queuelist, &next->queuelist);
+		rq->fifo_time = next->fifo_time;
+	}
+
+	if (bfqq->next_rq == next)
+		bfqq->next_rq = rq;
+
+	bfq_remove_request(next);
+}
+
+static int bfq_allow_merge(struct request_queue *q, struct request *rq,
+			   struct bio *bio)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct bfq_io_cq *bic;
+	struct bfq_queue *bfqq;
+
+	/*
+	 * Disallow merge of a sync bio into an async request.
+	 */
+	if (bfq_bio_sync(bio) && !rq_is_sync(rq))
+		return 0;
+
+	/*
+	 * Lookup the bfqq that this bio will be queued with. Allow
+	 * merge only if rq is queued there.
+	 * Queue lock is held here.
+	 */
+	bic = bfq_bic_lookup(bfqd, current->io_context);
+	if (bic == NULL)
+		return 0;
+
+	bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
+	return bfqq == RQ_BFQQ(rq);
+}
+
+static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
+				       struct bfq_queue *bfqq)
+{
+	if (bfqq != NULL) {
+		bfq_mark_bfqq_must_alloc(bfqq);
+		bfq_mark_bfqq_budget_new(bfqq);
+		bfq_clear_bfqq_fifo_expire(bfqq);
+
+		bfqd->budgets_assigned = (bfqd->budgets_assigned*7 + 256) / 8;
+
+		bfq_log_bfqq(bfqd, bfqq,
+			     "set_in_service_queue, cur-budget = %lu",
+			     bfqq->entity.budget);
+	}
+
+	bfqd->in_service_queue = bfqq;
+}
+
+/*
+ * Get and set a new queue for service.
+ */
+static struct bfq_queue *bfq_set_in_service_queue(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq = bfq_get_next_queue(bfqd);
+
+	__bfq_set_in_service_queue(bfqd, bfqq);
+	return bfqq;
+}
+
+/*
+ * If enough samples have been computed, return the current max budget
+ * stored in bfqd, which is dynamically updated according to the
+ * estimated disk peak rate; otherwise return the default max budget
+ */
+static inline unsigned long bfq_max_budget(struct bfq_data *bfqd)
+{
+	if (bfqd->budgets_assigned < 194)
+		return bfq_default_max_budget;
+	else
+		return bfqd->bfq_max_budget;
+}
+
+ /*
+ * bfq_default_budget - return the default budget for @bfqq on @bfqd.
+ * @bfqd: the device descriptor.
+ * @bfqq: the queue to consider.
+ *
+ * We use 3/4 of the @bfqd maximum budget as the default value
+ * for the max_budget field of the queues.  This lets the feedback
+ * mechanism to start from some middle ground, then the behavior
+ * of the process will drive the heuristics towards high values, if
+ * it behaves as a greedy sequential reader, or towards small values
+ * if it shows a more intermittent behavior.
+ */
+static unsigned long bfq_default_budget(struct bfq_data *bfqd,
+					struct bfq_queue *bfqq)
+{
+	unsigned long budget;
+
+	/*
+	 * When we need an estimate of the peak rate we need to avoid
+	 * to give budgets that are too short due to previous measurements.
+	 * So, in the first 10 assignments use a ``safe'' budget value.
+	 */
+	if (bfqd->budgets_assigned < 194 && bfqd->bfq_user_max_budget == 0)
+		budget = bfq_default_max_budget;
+	else
+		budget = bfqd->bfq_max_budget;
+
+	return budget - budget / 4;
+}
+
+/*
+ * Return min budget, which is a fraction of the current or default
+ * max budget (trying with 1/32)
+ */
+static inline unsigned long bfq_min_budget(struct bfq_data *bfqd)
+{
+	if (bfqd->budgets_assigned < 194)
+		return bfq_default_max_budget / 32;
+	else
+		return bfqd->bfq_max_budget / 32;
+}
+
+static void bfq_arm_slice_timer(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq = bfqd->in_service_queue;
+	struct bfq_io_cq *bic;
+	unsigned long sl;
+
+	/* Processes have exited, don't wait. */
+	bic = bfqd->in_service_bic;
+	if (bic == NULL || atomic_read(&bic->icq.ioc->active_ref) == 0)
+		return;
+
+	bfq_mark_bfqq_wait_request(bfqq);
+
+	/*
+	 * We don't want to idle for seeks, but we do want to allow
+	 * fair distribution of slice time for a process doing back-to-back
+	 * seeks. So allow a little bit of time for him to submit a new rq.
+	 */
+	sl = bfqd->bfq_slice_idle;
+	/*
+	 * Grant only minimum idle time if the queue has been seeky for long
+	 * enough.
+	 */
+	if (bfq_sample_valid(bfqq->seek_samples) && BFQQ_SEEKY(bfqq))
+		sl = min(sl, msecs_to_jiffies(BFQ_MIN_TT));
+	bfqd->last_idling_start = ktime_get();
+	mod_timer(&bfqd->idle_slice_timer, jiffies + sl);
+	bfq_log(bfqd, "arm idle: %u/%u ms",
+		jiffies_to_msecs(sl), jiffies_to_msecs(bfqd->bfq_slice_idle));
+}
+
+/*
+ * Set the maximum time for the in-service queue to consume its
+ * budget. This prevents seeky processes from lowering the disk
+ * throughput (always guaranteed with a time slice scheme as in CFQ).
+ */
+static void bfq_set_budget_timeout(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq = bfqd->in_service_queue;
+	unsigned int timeout_coeff = bfqq->entity.weight /
+				     bfqq->entity.orig_weight;
+
+	bfqd->last_budget_start = ktime_get();
+
+	bfq_clear_bfqq_budget_new(bfqq);
+	bfqq->budget_timeout = jiffies +
+		bfqd->bfq_timeout[bfq_bfqq_sync(bfqq)] * timeout_coeff;
+
+	bfq_log_bfqq(bfqd, bfqq, "set budget_timeout %u",
+		jiffies_to_msecs(bfqd->bfq_timeout[bfq_bfqq_sync(bfqq)] *
+		timeout_coeff));
+}
+
+/*
+ * Move request from internal lists to the request queue dispatch list.
+ */
+static void bfq_dispatch_insert(struct request_queue *q, struct request *rq)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+
+	/*
+	 * For consistency, the next instruction should have been executed
+	 * after removing the request from the queue and dispatching it.
+	 * We execute instead this instruction before bfq_remove_request()
+	 * (and hence introduce a temporary inconsistency), for efficiency.
+	 * In fact, in a forced_dispatch, this prevents two counters related
+	 * to bfqq->dispatched to risk to be uselessly decremented if bfqq
+	 * is not in service, and then to be incremented again after
+	 * incrementing bfqq->dispatched.
+	 */
+	bfqq->dispatched++;
+	bfq_remove_request(rq);
+	elv_dispatch_sort(q, rq);
+
+	if (bfq_bfqq_sync(bfqq))
+		bfqd->sync_flight++;
+}
+
+/*
+ * Return expired entry, or NULL to just start from scratch in rbtree.
+ */
+static struct request *bfq_check_fifo(struct bfq_queue *bfqq)
+{
+	struct request *rq = NULL;
+
+	if (bfq_bfqq_fifo_expire(bfqq))
+		return NULL;
+
+	bfq_mark_bfqq_fifo_expire(bfqq);
+
+	if (list_empty(&bfqq->fifo))
+		return NULL;
+
+	rq = rq_entry_fifo(bfqq->fifo.next);
+
+	if (time_before(jiffies, rq->fifo_time))
+		return NULL;
+
+	return rq;
+}
+
+static inline unsigned long bfq_bfqq_budget_left(struct bfq_queue *bfqq)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+	return entity->budget - entity->service;
+}
+
+static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	__bfq_bfqd_reset_in_service(bfqd);
+
+	if (RB_EMPTY_ROOT(&bfqq->sort_list))
+		bfq_del_bfqq_busy(bfqd, bfqq, 1);
+	else
+		bfq_activate_bfqq(bfqd, bfqq);
+}
+
+/**
+ * __bfq_bfqq_recalc_budget - try to adapt the budget to the @bfqq behavior.
+ * @bfqd: device data.
+ * @bfqq: queue to update.
+ * @reason: reason for expiration.
+ *
+ * Handle the feedback on @bfqq budget.  See the body for detailed
+ * comments.
+ */
+static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
+				     struct bfq_queue *bfqq,
+				     enum bfqq_expiration reason)
+{
+	struct request *next_rq;
+	unsigned long budget, min_budget;
+
+	budget = bfqq->max_budget;
+	min_budget = bfq_min_budget(bfqd);
+
+	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last budg %lu, budg left %lu",
+		bfqq->entity.budget, bfq_bfqq_budget_left(bfqq));
+	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last max_budg %lu, min budg %lu",
+		budget, bfq_min_budget(bfqd));
+	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: sync %d, seeky %d",
+		bfq_bfqq_sync(bfqq), BFQQ_SEEKY(bfqd->in_service_queue));
+
+	if (bfq_bfqq_sync(bfqq)) {
+		switch (reason) {
+		/*
+		 * Caveat: in all the following cases we trade latency
+		 * for throughput.
+		 */
+		case BFQ_BFQQ_TOO_IDLE:
+			if (budget > min_budget + BFQ_BUDGET_STEP)
+				budget -= BFQ_BUDGET_STEP;
+			else
+				budget = min_budget;
+			break;
+		case BFQ_BFQQ_BUDGET_TIMEOUT:
+			budget = bfq_default_budget(bfqd, bfqq);
+			break;
+		case BFQ_BFQQ_BUDGET_EXHAUSTED:
+			/*
+			 * The process still has backlog, and did not
+			 * let either the budget timeout or the disk
+			 * idling timeout expire. Hence it is not
+			 * seeky, has a short thinktime and may be
+			 * happy with a higher budget too. So
+			 * definitely increase the budget of this good
+			 * candidate to boost the disk throughput.
+			 */
+			budget = min(budget + 8 * BFQ_BUDGET_STEP,
+				     bfqd->bfq_max_budget);
+			break;
+		case BFQ_BFQQ_NO_MORE_REQUESTS:
+		       /*
+			* Leave the budget unchanged.
+			*/
+		default:
+			return;
+		}
+	} else /* async queue */
+	    /* async queues get always the maximum possible budget
+	     * (their ability to dispatch is limited by
+	     * @bfqd->bfq_max_budget_async_rq).
+	     */
+		budget = bfqd->bfq_max_budget;
+
+	bfqq->max_budget = budget;
+
+	if (bfqd->budgets_assigned >= 194 && bfqd->bfq_user_max_budget == 0 &&
+	    bfqq->max_budget > bfqd->bfq_max_budget)
+		bfqq->max_budget = bfqd->bfq_max_budget;
+
+	/*
+	 * Make sure that we have enough budget for the next request.
+	 * Since the finish time of the bfqq must be kept in sync with
+	 * the budget, be sure to call __bfq_bfqq_expire() after the
+	 * update.
+	 */
+	next_rq = bfqq->next_rq;
+	if (next_rq != NULL)
+		bfqq->entity.budget = max_t(unsigned long, bfqq->max_budget,
+					    bfq_serv_to_charge(next_rq, bfqq));
+	else
+		bfqq->entity.budget = bfqq->max_budget;
+
+	bfq_log_bfqq(bfqd, bfqq, "head sect: %u, new budget %lu",
+			next_rq != NULL ? blk_rq_sectors(next_rq) : 0,
+			bfqq->entity.budget);
+}
+
+static unsigned long bfq_calc_max_budget(u64 peak_rate, u64 timeout)
+{
+	unsigned long max_budget;
+
+	/*
+	 * The max_budget calculated when autotuning is equal to the
+	 * amount of sectors transfered in timeout_sync at the
+	 * estimated peak rate.
+	 */
+	max_budget = (unsigned long)(peak_rate * 1000 *
+				     timeout >> BFQ_RATE_SHIFT);
+
+	return max_budget;
+}
+
+/*
+ * In addition to updating the peak rate, checks whether the process
+ * is "slow", and returns 1 if so. This slow flag is used, in addition
+ * to the budget timeout, to reduce the amount of service provided to
+ * seeky processes, and hence reduce their chances to lower the
+ * throughput. See the code for more details.
+ */
+static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+				int compensate)
+{
+	u64 bw, usecs, expected, timeout;
+	ktime_t delta;
+	int update = 0;
+
+	if (!bfq_bfqq_sync(bfqq) || bfq_bfqq_budget_new(bfqq))
+		return 0;
+
+	if (compensate)
+		delta = bfqd->last_idling_start;
+	else
+		delta = ktime_get();
+	delta = ktime_sub(delta, bfqd->last_budget_start);
+	usecs = ktime_to_us(delta);
+
+	/* Don't trust short/unrealistic values. */
+	if (usecs < 100 || usecs >= LONG_MAX)
+		return 0;
+
+	/*
+	 * Calculate the bandwidth for the last slice.  We use a 64 bit
+	 * value to store the peak rate, in sectors per usec in fixed
+	 * point math.  We do so to have enough precision in the estimate
+	 * and to avoid overflows.
+	 */
+	bw = (u64)bfqq->entity.service << BFQ_RATE_SHIFT;
+	do_div(bw, (unsigned long)usecs);
+
+	timeout = jiffies_to_msecs(bfqd->bfq_timeout[BLK_RW_SYNC]);
+
+	/*
+	 * Use only long (> 20ms) intervals to filter out spikes for
+	 * the peak rate estimation.
+	 */
+	if (usecs > 20000) {
+		if (bw > bfqd->peak_rate) {
+			bfqd->peak_rate = bw;
+			update = 1;
+			bfq_log(bfqd, "new peak_rate=%llu", bw);
+		}
+
+		update |= bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES - 1;
+
+		if (bfqd->peak_rate_samples < BFQ_PEAK_RATE_SAMPLES)
+			bfqd->peak_rate_samples++;
+
+		if (bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES &&
+		    update && bfqd->bfq_user_max_budget == 0) {
+			bfqd->bfq_max_budget =
+				bfq_calc_max_budget(bfqd->peak_rate,
+						    timeout);
+			bfq_log(bfqd, "new max_budget=%lu",
+				bfqd->bfq_max_budget);
+		}
+	}
+
+	/*
+	 * A process is considered ``slow'' (i.e., seeky, so that we
+	 * cannot treat it fairly in the service domain, as it would
+	 * slow down too much the other processes) if, when a slice
+	 * ends for whatever reason, it has received service at a
+	 * rate that would not be high enough to complete the budget
+	 * before the budget timeout expiration.
+	 */
+	expected = bw * 1000 * timeout >> BFQ_RATE_SHIFT;
+
+	/*
+	 * Caveat: processes doing IO in the slower disk zones will
+	 * tend to be slow(er) even if not seeky. And the estimated
+	 * peak rate will actually be an average over the disk
+	 * surface. Hence, to not be too harsh with unlucky processes,
+	 * we keep a budget/3 margin of safety before declaring a
+	 * process slow.
+	 */
+	return expected > (4 * bfqq->entity.budget) / 3;
+}
+
+/**
+ * bfq_bfqq_expire - expire a queue.
+ * @bfqd: device owning the queue.
+ * @bfqq: the queue to expire.
+ * @compensate: if true, compensate for the time spent idling.
+ * @reason: the reason causing the expiration.
+ *
+ *
+ * If the process associated to the queue is slow (i.e., seeky), or in
+ * case of budget timeout, or, finally, if it is async, we
+ * artificially charge it an entire budget (independently of the
+ * actual service it received). As a consequence, the queue will get
+ * higher timestamps than the correct ones upon reactivation, and
+ * hence it will be rescheduled as if it had received more service
+ * than what it actually received. In the end, this class of processes
+ * will receive less service in proportion to how slowly they consume
+ * their budgets (and hence how seriously they tend to lower the
+ * throughput).
+ *
+ * In contrast, when a queue expires because it has been idling for
+ * too much or because it exhausted its budget, we do not touch the
+ * amount of service it has received. Hence when the queue will be
+ * reactivated and its timestamps updated, the latter will be in sync
+ * with the actual service received by the queue until expiration.
+ *
+ * Charging a full budget to the first type of queues and the exact
+ * service to the others has the effect of using the WF2Q+ policy to
+ * schedule the former on a timeslice basis, without violating the
+ * service domain guarantees of the latter.
+ */
+static void bfq_bfqq_expire(struct bfq_data *bfqd,
+			    struct bfq_queue *bfqq,
+			    int compensate,
+			    enum bfqq_expiration reason)
+{
+	int slow;
+
+	/* Update disk peak rate for autotuning and check whether the
+	 * process is slow (see bfq_update_peak_rate).
+	 */
+	slow = bfq_update_peak_rate(bfqd, bfqq, compensate);
+
+	/*
+	 * As above explained, 'punish' slow (i.e., seeky), timed-out
+	 * and async queues, to favor sequential sync workloads.
+	 */
+	if (slow || reason == BFQ_BFQQ_BUDGET_TIMEOUT)
+		bfq_bfqq_charge_full_budget(bfqq);
+
+	bfq_log_bfqq(bfqd, bfqq,
+		"expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
+		slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
+
+	/*
+	 * Increase, decrease or leave budget unchanged according to
+	 * reason.
+	 */
+	__bfq_bfqq_recalc_budget(bfqd, bfqq, reason);
+	__bfq_bfqq_expire(bfqd, bfqq);
+}
+
+/*
+ * Budget timeout is not implemented through a dedicated timer, but
+ * just checked on request arrivals and completions, as well as on
+ * idle timer expirations.
+ */
+static int bfq_bfqq_budget_timeout(struct bfq_queue *bfqq)
+{
+	if (bfq_bfqq_budget_new(bfqq) ||
+	    time_before(jiffies, bfqq->budget_timeout))
+		return 0;
+	return 1;
+}
+
+/*
+ * If we expire a queue that is waiting for the arrival of a new
+ * request, we may prevent the fictitious timestamp back-shifting that
+ * allows the guarantees of the queue to be preserved (see [1] for
+ * this tricky aspect). Hence we return true only if this condition
+ * does not hold, or if the queue is slow enough to deserve only to be
+ * kicked off for preserving a high throughput.
+*/
+static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
+{
+	bfq_log_bfqq(bfqq->bfqd, bfqq,
+		"may_budget_timeout: wait_request %d left %d timeout %d",
+		bfq_bfqq_wait_request(bfqq),
+			bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3,
+		bfq_bfqq_budget_timeout(bfqq));
+
+	return (!bfq_bfqq_wait_request(bfqq) ||
+		bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3)
+		&&
+		bfq_bfqq_budget_timeout(bfqq);
+}
+
+/*
+ * Device idling is allowed only for sync queues that have a non-null
+ * idle window.
+ */
+static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
+{
+	return bfq_bfqq_sync(bfqq) && bfq_bfqq_idle_window(bfqq);
+}
+
+/*
+ * If the in-service queue is empty, but it is sync and the queue has its
+ * idle window set (in this case, waiting for a new request for the queue
+ * is likely to boost the throughput), then:
+ * 1) the queue must remain in service and cannot be expired, and
+ * 2) the disk must be idled to wait for the possible arrival of a new
+ *    request for the queue.
+ */
+static inline bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)
+{
+	struct bfq_data *bfqd = bfqq->bfqd;
+
+	return RB_EMPTY_ROOT(&bfqq->sort_list) && bfqd->bfq_slice_idle != 0 &&
+	       bfq_bfqq_must_not_expire(bfqq);
+}
+
+/*
+ * Select a queue for service.  If we have a current queue in service,
+ * check whether to continue servicing it, or retrieve and set a new one.
+ */
+static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq;
+	struct request *next_rq;
+	enum bfqq_expiration reason = BFQ_BFQQ_BUDGET_TIMEOUT;
+
+	bfqq = bfqd->in_service_queue;
+	if (bfqq == NULL)
+		goto new_queue;
+
+	bfq_log_bfqq(bfqd, bfqq, "select_queue: already in-service queue");
+
+	if (bfq_may_expire_for_budg_timeout(bfqq) &&
+	    !timer_pending(&bfqd->idle_slice_timer) &&
+	    !bfq_bfqq_must_idle(bfqq))
+		goto expire;
+
+	next_rq = bfqq->next_rq;
+	/*
+	 * If bfqq has requests queued and it has enough budget left to
+	 * serve them, keep the queue, otherwise expire it.
+	 */
+	if (next_rq != NULL) {
+		if (bfq_serv_to_charge(next_rq, bfqq) >
+			bfq_bfqq_budget_left(bfqq)) {
+			reason = BFQ_BFQQ_BUDGET_EXHAUSTED;
+			goto expire;
+		} else {
+			/*
+			 * The idle timer may be pending because we may
+			 * not disable disk idling even when a new request
+			 * arrives.
+			 */
+			if (timer_pending(&bfqd->idle_slice_timer)) {
+				/*
+				 * If we get here: 1) at least a new request
+				 * has arrived but we have not disabled the
+				 * timer because the request was too small,
+				 * 2) then the block layer has unplugged
+				 * the device, causing the dispatch to be
+				 * invoked.
+				 *
+				 * Since the device is unplugged, now the
+				 * requests are probably large enough to
+				 * provide a reasonable throughput.
+				 * So we disable idling.
+				 */
+				bfq_clear_bfqq_wait_request(bfqq);
+				del_timer(&bfqd->idle_slice_timer);
+			}
+			goto keep_queue;
+		}
+	}
+
+	/*
+	 * No requests pending.  If the in-service queue still has requests
+	 * in flight (possibly waiting for a completion) or is idling for a
+	 * new request, then keep it.
+	 */
+	if (timer_pending(&bfqd->idle_slice_timer) ||
+	    (bfqq->dispatched != 0 && bfq_bfqq_must_not_expire(bfqq))) {
+		bfqq = NULL;
+		goto keep_queue;
+	}
+
+	reason = BFQ_BFQQ_NO_MORE_REQUESTS;
+expire:
+	bfq_bfqq_expire(bfqd, bfqq, 0, reason);
+new_queue:
+	bfqq = bfq_set_in_service_queue(bfqd);
+	bfq_log(bfqd, "select_queue: new queue %d returned",
+		bfqq != NULL ? bfqq->pid : 0);
+keep_queue:
+	return bfqq;
+}
+
+/*
+ * Dispatch one request from bfqq, moving it to the request queue
+ * dispatch list.
+ */
+static int bfq_dispatch_request(struct bfq_data *bfqd,
+				struct bfq_queue *bfqq)
+{
+	int dispatched = 0;
+	struct request *rq;
+	unsigned long service_to_charge;
+
+	/* Follow expired path, else get first next available. */
+	rq = bfq_check_fifo(bfqq);
+	if (rq == NULL)
+		rq = bfqq->next_rq;
+	service_to_charge = bfq_serv_to_charge(rq, bfqq);
+
+	if (service_to_charge > bfq_bfqq_budget_left(bfqq)) {
+		/*
+		 * This may happen if the next rq is chosen in fifo order
+		 * instead of sector order. The budget is properly
+		 * dimensioned to be always sufficient to serve the next
+		 * request only if it is chosen in sector order. The reason
+		 * is that it would be quite inefficient and little useful
+		 * to always make sure that the budget is large enough to
+		 * serve even the possible next rq in fifo order.
+		 * In fact, requests are seldom served in fifo order.
+		 *
+		 * Expire the queue for budget exhaustion, and make sure
+		 * that the next act_budget is enough to serve the next
+		 * request, even if it comes from the fifo expired path.
+		 */
+		bfqq->next_rq = rq;
+		/*
+		 * Since this dispatch is failed, make sure that
+		 * a new one will be performed
+		 */
+		if (!bfqd->rq_in_driver)
+			bfq_schedule_dispatch(bfqd);
+		goto expire;
+	}
+
+	/* Finally, insert request into driver dispatch list. */
+	bfq_bfqq_served(bfqq, service_to_charge);
+	bfq_dispatch_insert(bfqd->queue, rq);
+
+	bfq_log_bfqq(bfqd, bfqq,
+			"dispatched %u sec req (%llu), budg left %lu",
+			blk_rq_sectors(rq),
+			(long long unsigned)blk_rq_pos(rq),
+			bfq_bfqq_budget_left(bfqq));
+
+	dispatched++;
+
+	if (bfqd->in_service_bic == NULL) {
+		atomic_long_inc(&RQ_BIC(rq)->icq.ioc->refcount);
+		bfqd->in_service_bic = RQ_BIC(rq);
+	}
+
+	if (bfqd->busy_queues > 1 && ((!bfq_bfqq_sync(bfqq) &&
+	    dispatched >= bfqd->bfq_max_budget_async_rq) ||
+	    bfq_class_idle(bfqq)))
+		goto expire;
+
+	return dispatched;
+
+expire:
+	bfq_bfqq_expire(bfqd, bfqq, 0, BFQ_BFQQ_BUDGET_EXHAUSTED);
+	return dispatched;
+}
+
+static int __bfq_forced_dispatch_bfqq(struct bfq_queue *bfqq)
+{
+	int dispatched = 0;
+
+	while (bfqq->next_rq != NULL) {
+		bfq_dispatch_insert(bfqq->bfqd->queue, bfqq->next_rq);
+		dispatched++;
+	}
+
+	return dispatched;
+}
+
+/*
+ * Drain our current requests.
+ * Used for barriers and when switching io schedulers on-the-fly.
+ */
+static int bfq_forced_dispatch(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq, *n;
+	struct bfq_service_tree *st;
+	int dispatched = 0;
+
+	bfqq = bfqd->in_service_queue;
+	if (bfqq != NULL)
+		__bfq_bfqq_expire(bfqd, bfqq);
+
+	/*
+	 * Loop through classes, and be careful to leave the scheduler
+	 * in a consistent state, as feedback mechanisms and vtime
+	 * updates cannot be disabled during the process.
+	 */
+	list_for_each_entry_safe(bfqq, n, &bfqd->active_list, bfqq_list) {
+		st = bfq_entity_service_tree(&bfqq->entity);
+
+		dispatched += __bfq_forced_dispatch_bfqq(bfqq);
+		bfqq->max_budget = bfq_max_budget(bfqd);
+
+		bfq_forget_idle(st);
+	}
+
+	return dispatched;
+}
+
+static int bfq_dispatch_requests(struct request_queue *q, int force)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct bfq_queue *bfqq;
+	int max_dispatch;
+
+	bfq_log(bfqd, "dispatch requests: %d busy queues", bfqd->busy_queues);
+	if (bfqd->busy_queues == 0)
+		return 0;
+
+	if (unlikely(force))
+		return bfq_forced_dispatch(bfqd);
+
+	bfqq = bfq_select_queue(bfqd);
+	if (bfqq == NULL)
+		return 0;
+
+	max_dispatch = bfqd->bfq_quantum;
+	if (bfq_class_idle(bfqq))
+		max_dispatch = 1;
+
+	if (!bfq_bfqq_sync(bfqq))
+		max_dispatch = bfqd->bfq_max_budget_async_rq;
+
+	if (bfqq->dispatched >= max_dispatch) {
+		if (bfqd->busy_queues > 1)
+			return 0;
+		if (bfqq->dispatched >= 4 * max_dispatch)
+			return 0;
+	}
+
+	if (bfqd->sync_flight != 0 && !bfq_bfqq_sync(bfqq))
+		return 0;
+
+	bfq_clear_bfqq_wait_request(bfqq);
+
+	if (!bfq_dispatch_request(bfqd, bfqq))
+		return 0;
+
+	bfq_log_bfqq(bfqd, bfqq, "dispatched one request of %d (max_disp %d)",
+			bfqq->pid, max_dispatch);
+
+	return 1;
+}
+
+/*
+ * Task holds one reference to the queue, dropped when task exits.  Each rq
+ * in-flight on this queue also holds a reference, dropped when rq is freed.
+ *
+ * Queue lock must be held here.
+ */
+static void bfq_put_queue(struct bfq_queue *bfqq)
+{
+	struct bfq_data *bfqd = bfqq->bfqd;
+
+	bfq_log_bfqq(bfqd, bfqq, "put_queue: %p %d", bfqq,
+		     atomic_read(&bfqq->ref));
+	if (!atomic_dec_and_test(&bfqq->ref))
+		return;
+
+	bfq_log_bfqq(bfqd, bfqq, "put_queue: %p freed", bfqq);
+
+	kmem_cache_free(bfq_pool, bfqq);
+}
+
+static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	if (bfqq == bfqd->in_service_queue) {
+		__bfq_bfqq_expire(bfqd, bfqq);
+		bfq_schedule_dispatch(bfqd);
+	}
+
+	bfq_log_bfqq(bfqd, bfqq, "exit_bfqq: %p, %d", bfqq,
+		     atomic_read(&bfqq->ref));
+
+	bfq_put_queue(bfqq);
+}
+
+static inline void bfq_init_icq(struct io_cq *icq)
+{
+	icq_to_bic(icq)->ttime.last_end_request = jiffies;
+}
+
+static void bfq_exit_icq(struct io_cq *icq)
+{
+	struct bfq_io_cq *bic = icq_to_bic(icq);
+	struct bfq_data *bfqd = bic_to_bfqd(bic);
+
+	if (bic->bfqq[BLK_RW_ASYNC]) {
+		bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_ASYNC]);
+		bic->bfqq[BLK_RW_ASYNC] = NULL;
+	}
+
+	if (bic->bfqq[BLK_RW_SYNC]) {
+		bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_SYNC]);
+		bic->bfqq[BLK_RW_SYNC] = NULL;
+	}
+}
+
+/*
+ * Update the entity prio values; note that the new values will not
+ * be used until the next (re)activation.
+ */
+static void bfq_init_prio_data(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
+{
+	struct task_struct *tsk = current;
+	int ioprio_class;
+
+	if (!bfq_bfqq_prio_changed(bfqq))
+		return;
+
+	ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
+	switch (ioprio_class) {
+	default:
+		dev_err(bfqq->bfqd->queue->backing_dev_info.dev,
+			"bfq: bad prio %x\n", ioprio_class);
+	case IOPRIO_CLASS_NONE:
+		/*
+		 * No prio set, inherit CPU scheduling settings.
+		 */
+		bfqq->entity.new_ioprio = task_nice_ioprio(tsk);
+		bfqq->entity.new_ioprio_class = task_nice_ioclass(tsk);
+		break;
+	case IOPRIO_CLASS_RT:
+		bfqq->entity.new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
+		bfqq->entity.new_ioprio_class = IOPRIO_CLASS_RT;
+		break;
+	case IOPRIO_CLASS_BE:
+		bfqq->entity.new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
+		bfqq->entity.new_ioprio_class = IOPRIO_CLASS_BE;
+		break;
+	case IOPRIO_CLASS_IDLE:
+		bfqq->entity.new_ioprio_class = IOPRIO_CLASS_IDLE;
+		bfqq->entity.new_ioprio = 7;
+		bfq_clear_bfqq_idle_window(bfqq);
+		break;
+	}
+
+	bfqq->entity.ioprio_changed = 1;
+
+	bfq_clear_bfqq_prio_changed(bfqq);
+}
+
+static void bfq_changed_ioprio(struct bfq_io_cq *bic)
+{
+	struct bfq_data *bfqd;
+	struct bfq_queue *bfqq, *new_bfqq;
+	unsigned long uninitialized_var(flags);
+	int ioprio = bic->icq.ioc->ioprio;
+
+	bfqd = bfq_get_bfqd_locked(&(bic->icq.q->elevator->elevator_data),
+				   &flags);
+	/*
+	 * This condition may trigger on a newly created bic, be sure to
+	 * drop the lock before returning.
+	 */
+	if (unlikely(bfqd == NULL) || likely(bic->ioprio == ioprio))
+		goto out;
+
+	bfqq = bic->bfqq[BLK_RW_ASYNC];
+	if (bfqq != NULL) {
+		new_bfqq = bfq_get_queue(bfqd, BLK_RW_ASYNC, bic,
+					 GFP_ATOMIC);
+		if (new_bfqq != NULL) {
+			bic->bfqq[BLK_RW_ASYNC] = new_bfqq;
+			bfq_log_bfqq(bfqd, bfqq,
+				     "changed_ioprio: bfqq %p %d",
+				     bfqq, atomic_read(&bfqq->ref));
+			bfq_put_queue(bfqq);
+		}
+	}
+
+	bfqq = bic->bfqq[BLK_RW_SYNC];
+	if (bfqq != NULL)
+		bfq_mark_bfqq_prio_changed(bfqq);
+
+	bic->ioprio = ioprio;
+
+out:
+	bfq_put_bfqd_unlock(bfqd, &flags);
+}
+
+static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+			  pid_t pid, int is_sync)
+{
+	RB_CLEAR_NODE(&bfqq->entity.rb_node);
+	INIT_LIST_HEAD(&bfqq->fifo);
+
+	atomic_set(&bfqq->ref, 0);
+	bfqq->bfqd = bfqd;
+
+	bfq_mark_bfqq_prio_changed(bfqq);
+
+	if (is_sync) {
+		if (!bfq_class_idle(bfqq))
+			bfq_mark_bfqq_idle_window(bfqq);
+		bfq_mark_bfqq_sync(bfqq);
+	}
+
+	/* Tentative initial value to trade off between thr and lat */
+	bfqq->max_budget = bfq_default_budget(bfqd, bfqq);
+	bfqq->pid = pid;
+}
+
+static struct bfq_queue *bfq_find_alloc_queue(struct bfq_data *bfqd,
+					      int is_sync,
+					      struct bfq_io_cq *bic,
+					      gfp_t gfp_mask)
+{
+	struct bfq_queue *bfqq, *new_bfqq = NULL;
+
+retry:
+	/* bic always exists here */
+	bfqq = bic_to_bfqq(bic, is_sync);
+
+	/*
+	 * Always try a new alloc if we fall back to the OOM bfqq
+	 * originally, since it should just be a temporary situation.
+	 */
+	if (bfqq == NULL || bfqq == &bfqd->oom_bfqq) {
+		bfqq = NULL;
+		if (new_bfqq != NULL) {
+			bfqq = new_bfqq;
+			new_bfqq = NULL;
+		} else if (gfp_mask & __GFP_WAIT) {
+			spin_unlock_irq(bfqd->queue->queue_lock);
+			new_bfqq = kmem_cache_alloc_node(bfq_pool,
+					gfp_mask | __GFP_ZERO,
+					bfqd->queue->node);
+			spin_lock_irq(bfqd->queue->queue_lock);
+			if (new_bfqq != NULL)
+				goto retry;
+		} else {
+			bfqq = kmem_cache_alloc_node(bfq_pool,
+					gfp_mask | __GFP_ZERO,
+					bfqd->queue->node);
+		}
+
+		if (bfqq != NULL) {
+			bfq_init_bfqq(bfqd, bfqq, current->pid, is_sync);
+			bfq_log_bfqq(bfqd, bfqq, "allocated");
+		} else {
+			bfqq = &bfqd->oom_bfqq;
+			bfq_log_bfqq(bfqd, bfqq, "using oom bfqq");
+		}
+
+		bfq_init_prio_data(bfqq, bic);
+	}
+
+	if (new_bfqq != NULL)
+		kmem_cache_free(bfq_pool, new_bfqq);
+
+	return bfqq;
+}
+
+static struct bfq_queue **bfq_async_queue_prio(struct bfq_data *bfqd,
+					       int ioprio_class, int ioprio)
+{
+	switch (ioprio_class) {
+	case IOPRIO_CLASS_RT:
+		return &async_bfqq[0][ioprio];
+	case IOPRIO_CLASS_NONE:
+		ioprio = IOPRIO_NORM;
+		/* fall through */
+	case IOPRIO_CLASS_BE:
+		return &async_bfqq[1][ioprio];
+	case IOPRIO_CLASS_IDLE:
+		return &async_idle_bfqq;
+	default:
+		BUG();
+	}
+}
+
+static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
+				       int is_sync, struct bfq_io_cq *bic,
+				       gfp_t gfp_mask)
+{
+	const int ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
+	const int ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
+	struct bfq_queue **async_bfqq = NULL;
+	struct bfq_queue *bfqq = NULL;
+
+	if (!is_sync) {
+		async_bfqq = bfq_async_queue_prio(bfqd, ioprio_class, ioprio);
+		bfqq = *async_bfqq;
+	}
+
+	if (bfqq == NULL)
+		bfqq = bfq_find_alloc_queue(bfqd, is_sync, bic, gfp_mask);
+
+	/*
+	 * Pin the queue now that it's allocated, scheduler exit will
+	 * prune it.
+	 */
+	if (!is_sync && *async_bfqq == NULL) {
+		atomic_inc(&bfqq->ref);
+		bfq_log_bfqq(bfqd, bfqq, "get_queue, bfqq not in async: %p, %d",
+			     bfqq, atomic_read(&bfqq->ref));
+		*async_bfqq = bfqq;
+	}
+
+	atomic_inc(&bfqq->ref);
+	bfq_log_bfqq(bfqd, bfqq, "get_queue, at end: %p, %d", bfqq,
+		     atomic_read(&bfqq->ref));
+	return bfqq;
+}
+
+static void bfq_update_io_thinktime(struct bfq_data *bfqd,
+				    struct bfq_io_cq *bic)
+{
+	unsigned long elapsed = jiffies - bic->ttime.last_end_request;
+	unsigned long ttime = min(elapsed, 2UL * bfqd->bfq_slice_idle);
+
+	bic->ttime.ttime_samples = (7*bic->ttime.ttime_samples + 256) / 8;
+	bic->ttime.ttime_total = (7*bic->ttime.ttime_total + 256*ttime) / 8;
+	bic->ttime.ttime_mean = (bic->ttime.ttime_total + 128) /
+				bic->ttime.ttime_samples;
+}
+
+static void bfq_update_io_seektime(struct bfq_data *bfqd,
+				   struct bfq_queue *bfqq,
+				   struct request *rq)
+{
+	sector_t sdist;
+	u64 total;
+
+	if (bfqq->last_request_pos < blk_rq_pos(rq))
+		sdist = blk_rq_pos(rq) - bfqq->last_request_pos;
+	else
+		sdist = bfqq->last_request_pos - blk_rq_pos(rq);
+
+	/*
+	 * Don't allow the seek distance to get too large from the
+	 * odd fragment, pagein, etc.
+	 */
+	if (bfqq->seek_samples == 0) /* first request, not really a seek */
+		sdist = 0;
+	else if (bfqq->seek_samples <= 60) /* second & third seek */
+		sdist = min(sdist, (bfqq->seek_mean * 4) + 2*1024*1024);
+	else
+		sdist = min(sdist, (bfqq->seek_mean * 4) + 2*1024*64);
+
+	bfqq->seek_samples = (7*bfqq->seek_samples + 256) / 8;
+	bfqq->seek_total = (7*bfqq->seek_total + (u64)256*sdist) / 8;
+	total = bfqq->seek_total + (bfqq->seek_samples/2);
+	do_div(total, bfqq->seek_samples);
+	bfqq->seek_mean = (sector_t)total;
+
+	bfq_log_bfqq(bfqd, bfqq, "dist=%llu mean=%llu", (u64)sdist,
+			(u64)bfqq->seek_mean);
+}
+
+/*
+ * Disable idle window if the process thinks too long or seeks so much that
+ * it doesn't matter.
+ */
+static void bfq_update_idle_window(struct bfq_data *bfqd,
+				   struct bfq_queue *bfqq,
+				   struct bfq_io_cq *bic)
+{
+	int enable_idle;
+
+	/* Don't idle for async or idle io prio class. */
+	if (!bfq_bfqq_sync(bfqq) || bfq_class_idle(bfqq))
+		return;
+
+	enable_idle = bfq_bfqq_idle_window(bfqq);
+
+	if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
+	    bfqd->bfq_slice_idle == 0 ||
+		(bfqd->hw_tag && BFQQ_SEEKY(bfqq)))
+		enable_idle = 0;
+	else if (bfq_sample_valid(bic->ttime.ttime_samples)) {
+		if (bic->ttime.ttime_mean > bfqd->bfq_slice_idle)
+			enable_idle = 0;
+		else
+			enable_idle = 1;
+	}
+	bfq_log_bfqq(bfqd, bfqq, "update_idle_window: enable_idle %d",
+		enable_idle);
+
+	if (enable_idle)
+		bfq_mark_bfqq_idle_window(bfqq);
+	else
+		bfq_clear_bfqq_idle_window(bfqq);
+}
+
+/*
+ * Called when a new fs request (rq) is added to bfqq.  Check if there's
+ * something we should do about it.
+ */
+static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+			    struct request *rq)
+{
+	struct bfq_io_cq *bic = RQ_BIC(rq);
+
+	if (rq->cmd_flags & REQ_META)
+		bfqq->meta_pending++;
+
+	bfq_update_io_thinktime(bfqd, bic);
+	bfq_update_io_seektime(bfqd, bfqq, rq);
+	if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
+	    !BFQQ_SEEKY(bfqq))
+		bfq_update_idle_window(bfqd, bfqq, bic);
+
+	bfq_log_bfqq(bfqd, bfqq,
+		     "rq_enqueued: idle_window=%d (seeky %d, mean %llu)",
+		     bfq_bfqq_idle_window(bfqq), BFQQ_SEEKY(bfqq),
+		     (long long unsigned)bfqq->seek_mean);
+
+	bfqq->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq);
+
+	if (bfqq == bfqd->in_service_queue && bfq_bfqq_wait_request(bfqq)) {
+		int small_req = bfqq->queued[rq_is_sync(rq)] == 1 &&
+				blk_rq_sectors(rq) < 32;
+		int budget_timeout = bfq_bfqq_budget_timeout(bfqq);
+
+		/*
+		 * There is just this request queued: if the request
+		 * is small and the queue is not to be expired, then
+		 * just exit.
+		 *
+		 * In this way, if the disk is being idled to wait for
+		 * a new request from the in-service queue, we avoid
+		 * unplugging the device and committing the disk to serve
+		 * just a small request. On the contrary, we wait for
+		 * the block layer to decide when to unplug the device:
+		 * hopefully, new requests will be merged to this one
+		 * quickly, then the device will be unplugged and
+		 * larger requests will be dispatched.
+		 */
+		if (small_req && !budget_timeout)
+			return;
+
+		/*
+		 * A large enough request arrived, or the queue is to
+		 * be expired: in both cases disk idling is to be
+		 * stopped, so clear wait_request flag and reset
+		 * timer.
+		 */
+		bfq_clear_bfqq_wait_request(bfqq);
+		del_timer(&bfqd->idle_slice_timer);
+
+		/*
+		 * The queue is not empty, because a new request just
+		 * arrived. Hence we can safely expire the queue, in
+		 * case of budget timeout, without risking that the
+		 * timestamps of the queue are not updated correctly.
+		 * See [1] for more details.
+		 */
+		if (budget_timeout)
+			bfq_bfqq_expire(bfqd, bfqq, 0, BFQ_BFQQ_BUDGET_TIMEOUT);
+
+		/*
+		 * Let the request rip immediately, or let a new queue be
+		 * selected if bfqq has just been expired.
+		 */
+		__blk_run_queue(bfqd->queue);
+	}
+}
+
+static void bfq_insert_request(struct request_queue *q, struct request *rq)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+
+	assert_spin_locked(bfqd->queue->queue_lock);
+
+	bfq_init_prio_data(bfqq, RQ_BIC(rq));
+
+	bfq_add_request(rq);
+
+	rq->fifo_time = jiffies + bfqd->bfq_fifo_expire[rq_is_sync(rq)];
+	list_add_tail(&rq->queuelist, &bfqq->fifo);
+
+	bfq_rq_enqueued(bfqd, bfqq, rq);
+}
+
+static void bfq_update_hw_tag(struct bfq_data *bfqd)
+{
+	bfqd->max_rq_in_driver = max(bfqd->max_rq_in_driver,
+				     bfqd->rq_in_driver);
+
+	if (bfqd->hw_tag == 1)
+		return;
+
+	/*
+	 * This sample is valid if the number of outstanding requests
+	 * is large enough to allow a queueing behavior.  Note that the
+	 * sum is not exact, as it's not taking into account deactivated
+	 * requests.
+	 */
+	if (bfqd->rq_in_driver + bfqd->queued < BFQ_HW_QUEUE_THRESHOLD)
+		return;
+
+	if (bfqd->hw_tag_samples++ < BFQ_HW_QUEUE_SAMPLES)
+		return;
+
+	bfqd->hw_tag = bfqd->max_rq_in_driver > BFQ_HW_QUEUE_THRESHOLD;
+	bfqd->max_rq_in_driver = 0;
+	bfqd->hw_tag_samples = 0;
+}
+
+static void bfq_completed_request(struct request_queue *q, struct request *rq)
+{
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+	struct bfq_data *bfqd = bfqq->bfqd;
+	bool sync = bfq_bfqq_sync(bfqq);
+
+	bfq_log_bfqq(bfqd, bfqq, "completed one req with %u sects left (%d)",
+		     blk_rq_sectors(rq), sync);
+
+	bfq_update_hw_tag(bfqd);
+
+	bfqd->rq_in_driver--;
+	bfqq->dispatched--;
+
+	if (sync) {
+		bfqd->sync_flight--;
+		RQ_BIC(rq)->ttime.last_end_request = jiffies;
+	}
+
+	/*
+	 * If this is the in-service queue, check if it needs to be expired,
+	 * or if we want to idle in case it has no pending requests.
+	 */
+	if (bfqd->in_service_queue == bfqq) {
+		if (bfq_bfqq_budget_new(bfqq))
+			bfq_set_budget_timeout(bfqd);
+
+		if (bfq_bfqq_must_idle(bfqq)) {
+			bfq_arm_slice_timer(bfqd);
+			goto out;
+		} else if (bfq_may_expire_for_budg_timeout(bfqq))
+			bfq_bfqq_expire(bfqd, bfqq, 0, BFQ_BFQQ_BUDGET_TIMEOUT);
+		else if (RB_EMPTY_ROOT(&bfqq->sort_list) &&
+			 (bfqq->dispatched == 0 ||
+			  !bfq_bfqq_must_not_expire(bfqq)))
+			bfq_bfqq_expire(bfqd, bfqq, 0,
+					BFQ_BFQQ_NO_MORE_REQUESTS);
+	}
+
+	if (!bfqd->rq_in_driver)
+		bfq_schedule_dispatch(bfqd);
+
+out:
+	return;
+}
+
+static inline int __bfq_may_queue(struct bfq_queue *bfqq)
+{
+	if (bfq_bfqq_wait_request(bfqq) && bfq_bfqq_must_alloc(bfqq)) {
+		bfq_clear_bfqq_must_alloc(bfqq);
+		return ELV_MQUEUE_MUST;
+	}
+
+	return ELV_MQUEUE_MAY;
+}
+
+static int bfq_may_queue(struct request_queue *q, int rw)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct task_struct *tsk = current;
+	struct bfq_io_cq *bic;
+	struct bfq_queue *bfqq;
+
+	/*
+	 * Don't force setup of a queue from here, as a call to may_queue
+	 * does not necessarily imply that a request actually will be
+	 * queued. So just lookup a possibly existing queue, or return
+	 * 'may queue' if that fails.
+	 */
+	bic = bfq_bic_lookup(bfqd, tsk->io_context);
+	if (bic == NULL)
+		return ELV_MQUEUE_MAY;
+
+	bfqq = bic_to_bfqq(bic, rw_is_sync(rw));
+	if (bfqq != NULL) {
+		bfq_init_prio_data(bfqq, bic);
+
+		return __bfq_may_queue(bfqq);
+	}
+
+	return ELV_MQUEUE_MAY;
+}
+
+/*
+ * Queue lock held here.
+ */
+static void bfq_put_request(struct request *rq)
+{
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+
+	if (bfqq != NULL) {
+		const int rw = rq_data_dir(rq);
+
+		bfqq->allocated[rw]--;
+
+		rq->elv.priv[0] = NULL;
+		rq->elv.priv[1] = NULL;
+
+		bfq_log_bfqq(bfqq->bfqd, bfqq, "put_request %p, %d",
+			     bfqq, atomic_read(&bfqq->ref));
+		bfq_put_queue(bfqq);
+	}
+}
+
+/*
+ * Allocate bfq data structures associated with this request.
+ */
+static int bfq_set_request(struct request_queue *q, struct request *rq,
+			   struct bio *bio, gfp_t gfp_mask)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct bfq_io_cq *bic = icq_to_bic(rq->elv.icq);
+	const int rw = rq_data_dir(rq);
+	const int is_sync = rq_is_sync(rq);
+	struct bfq_queue *bfqq;
+	unsigned long flags;
+
+	might_sleep_if(gfp_mask & __GFP_WAIT);
+
+	bfq_changed_ioprio(bic);
+
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	if (bic == NULL)
+		goto queue_fail;
+
+	bfqq = bic_to_bfqq(bic, is_sync);
+	if (bfqq == NULL || bfqq == &bfqd->oom_bfqq) {
+		bfqq = bfq_get_queue(bfqd, is_sync, bic, gfp_mask);
+		bic_set_bfqq(bic, bfqq, is_sync);
+	}
+
+	bfqq->allocated[rw]++;
+	atomic_inc(&bfqq->ref);
+	bfq_log_bfqq(bfqd, bfqq, "set_request: bfqq %p, %d", bfqq,
+		     atomic_read(&bfqq->ref));
+
+	rq->elv.priv[0] = bic;
+	rq->elv.priv[1] = bfqq;
+
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	return 0;
+
+queue_fail:
+	bfq_schedule_dispatch(bfqd);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	return 1;
+}
+
+static void bfq_kick_queue(struct work_struct *work)
+{
+	struct bfq_data *bfqd =
+		container_of(work, struct bfq_data, unplug_work);
+	struct request_queue *q = bfqd->queue;
+
+	spin_lock_irq(q->queue_lock);
+	__blk_run_queue(q);
+	spin_unlock_irq(q->queue_lock);
+}
+
+/*
+ * Handler of the expiration of the timer running if the in-service queue
+ * is idling inside its time slice.
+ */
+static void bfq_idle_slice_timer(unsigned long data)
+{
+	struct bfq_data *bfqd = (struct bfq_data *)data;
+	struct bfq_queue *bfqq;
+	unsigned long flags;
+	enum bfqq_expiration reason;
+
+	spin_lock_irqsave(bfqd->queue->queue_lock, flags);
+
+	bfqq = bfqd->in_service_queue;
+	/*
+	 * Theoretical race here: the in-service queue can be NULL or
+	 * different from the queue that was idling if the timer handler
+	 * spins on the queue_lock and a new request arrives for the
+	 * current queue and there is a full dispatch cycle that changes
+	 * the in-service queue.  This can hardly happen, but in the worst
+	 * case we just expire a queue too early.
+	 */
+	if (bfqq != NULL) {
+		bfq_log_bfqq(bfqd, bfqq, "slice_timer expired");
+		if (bfq_bfqq_budget_timeout(bfqq))
+			/*
+			 * Also here the queue can be safely expired
+			 * for budget timeout without wasting
+			 * guarantees
+			 */
+			reason = BFQ_BFQQ_BUDGET_TIMEOUT;
+		else if (bfqq->queued[0] == 0 && bfqq->queued[1] == 0)
+			/*
+			 * The queue may not be empty upon timer expiration,
+			 * because we may not disable the timer when the
+			 * first request of the in-service queue arrives
+			 * during disk idling.
+			 */
+			reason = BFQ_BFQQ_TOO_IDLE;
+		else
+			goto schedule_dispatch;
+
+		bfq_bfqq_expire(bfqd, bfqq, 1, reason);
+	}
+
+schedule_dispatch:
+	bfq_schedule_dispatch(bfqd);
+
+	spin_unlock_irqrestore(bfqd->queue->queue_lock, flags);
+}
+
+static void bfq_shutdown_timer_wq(struct bfq_data *bfqd)
+{
+	del_timer_sync(&bfqd->idle_slice_timer);
+	cancel_work_sync(&bfqd->unplug_work);
+}
+
+static inline void __bfq_put_async_bfqq(struct bfq_data *bfqd,
+					struct bfq_queue **bfqq_ptr)
+{
+	struct bfq_queue *bfqq = *bfqq_ptr;
+
+	bfq_log(bfqd, "put_async_bfqq: %p", bfqq);
+	if (bfqq != NULL) {
+		bfq_log_bfqq(bfqd, bfqq, "put_async_bfqq: putting %p, %d",
+			     bfqq, atomic_read(&bfqq->ref));
+		bfq_put_queue(bfqq);
+		*bfqq_ptr = NULL;
+	}
+}
+
+/*
+ * Release the extra reference of the async queues as the device
+ * goes away.
+ */
+static void bfq_put_async_queues(struct bfq_data *bfqd)
+{
+	int i, j;
+
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < IOPRIO_BE_NR; j++)
+			__bfq_put_async_bfqq(bfqd, &async_bfqq[i][j]);
+
+	__bfq_put_async_bfqq(bfqd, &async_idle_bfqq);
+}
+
+static void bfq_exit_queue(struct elevator_queue *e)
+{
+	struct bfq_data *bfqd = e->elevator_data;
+	struct request_queue *q = bfqd->queue;
+	struct bfq_queue *bfqq, *n;
+
+	bfq_shutdown_timer_wq(bfqd);
+
+	spin_lock_irq(q->queue_lock);
+
+	list_for_each_entry_safe(bfqq, n, &bfqd->idle_list, bfqq_list)
+		bfq_deactivate_bfqq(bfqd, bfqq, 0);
+
+	bfq_put_async_queues(bfqd);
+	spin_unlock_irq(q->queue_lock);
+
+	bfq_shutdown_timer_wq(bfqd);
+
+	synchronize_rcu();
+
+	kfree(bfqd);
+}
+
+static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
+{
+	struct bfq_data *bfqd;
+	struct elevator_queue *eq;
+
+	eq = elevator_alloc(q, e);
+	if (eq == NULL)
+		return -ENOMEM;
+
+	bfqd = kzalloc_node(sizeof(*bfqd), GFP_KERNEL, q->node);
+	if (bfqd == NULL) {
+		kobject_put(&eq->kobj);
+		return -ENOMEM;
+	}
+	eq->elevator_data = bfqd;
+
+	/*
+	 * Our fallback bfqq if bfq_find_alloc_queue() runs into OOM issues.
+	 * Grab a permanent reference to it, so that the normal code flow
+	 * will not attempt to free it.
+	 */
+	bfq_init_bfqq(bfqd, &bfqd->oom_bfqq, 1, 0);
+	atomic_inc(&bfqd->oom_bfqq.ref);
+
+	bfqd->queue = q;
+
+	spin_lock_irq(q->queue_lock);
+	q->elevator = eq;
+	spin_unlock_irq(q->queue_lock);
+
+	init_timer(&bfqd->idle_slice_timer);
+	bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
+	bfqd->idle_slice_timer.data = (unsigned long)bfqd;
+
+	INIT_WORK(&bfqd->unplug_work, bfq_kick_queue);
+
+	INIT_LIST_HEAD(&bfqd->active_list);
+	INIT_LIST_HEAD(&bfqd->idle_list);
+
+	bfqd->hw_tag = -1;
+
+	bfqd->bfq_max_budget = bfq_default_max_budget;
+
+	bfqd->bfq_quantum = bfq_quantum;
+	bfqd->bfq_fifo_expire[0] = bfq_fifo_expire[0];
+	bfqd->bfq_fifo_expire[1] = bfq_fifo_expire[1];
+	bfqd->bfq_back_max = bfq_back_max;
+	bfqd->bfq_back_penalty = bfq_back_penalty;
+	bfqd->bfq_slice_idle = bfq_slice_idle;
+	bfqd->bfq_class_idle_last_service = 0;
+	bfqd->bfq_max_budget_async_rq = bfq_max_budget_async_rq;
+	bfqd->bfq_timeout[BLK_RW_ASYNC] = bfq_timeout_async;
+	bfqd->bfq_timeout[BLK_RW_SYNC] = bfq_timeout_sync;
+
+	return 0;
+}
+
+static void bfq_slab_kill(void)
+{
+	if (bfq_pool != NULL)
+		kmem_cache_destroy(bfq_pool);
+}
+
+static int __init bfq_slab_setup(void)
+{
+	bfq_pool = KMEM_CACHE(bfq_queue, 0);
+	if (bfq_pool == NULL)
+		return -ENOMEM;
+	return 0;
+}
+
+static ssize_t bfq_var_show(unsigned int var, char *page)
+{
+	return sprintf(page, "%d\n", var);
+}
+
+static ssize_t bfq_var_store(unsigned long *var, const char *page,
+			     size_t count)
+{
+	unsigned long new_val;
+	int ret = kstrtoul(page, 10, &new_val);
+
+	if (ret == 0)
+		*var = new_val;
+
+	return count;
+}
+
+static ssize_t bfq_weights_show(struct elevator_queue *e, char *page)
+{
+	struct bfq_queue *bfqq;
+	struct bfq_data *bfqd = e->elevator_data;
+	ssize_t num_char = 0;
+
+	num_char += sprintf(page + num_char, "Tot reqs queued %d\n\n",
+			    bfqd->queued);
+
+	spin_lock_irq(bfqd->queue->queue_lock);
+
+	num_char += sprintf(page + num_char, "Active:\n");
+	list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list) {
+	  num_char += sprintf(page + num_char,
+			      "pid%d: weight %hu, nr_queued %d %d\n",
+			      bfqq->pid,
+			      bfqq->entity.weight,
+			      bfqq->queued[0],
+			      bfqq->queued[1]);
+	}
+
+	num_char += sprintf(page + num_char, "Idle:\n");
+	list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list) {
+			num_char += sprintf(page + num_char,
+				"pid%d: weight %hu\n",
+				bfqq->pid,
+				bfqq->entity.weight);
+	}
+
+	spin_unlock_irq(bfqd->queue->queue_lock);
+
+	return num_char;
+}
+
+#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
+static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
+{									\
+	struct bfq_data *bfqd = e->elevator_data;			\
+	unsigned int __data = __VAR;					\
+	if (__CONV)							\
+		__data = jiffies_to_msecs(__data);			\
+	return bfq_var_show(__data, (page));				\
+}
+SHOW_FUNCTION(bfq_quantum_show, bfqd->bfq_quantum, 0);
+SHOW_FUNCTION(bfq_fifo_expire_sync_show, bfqd->bfq_fifo_expire[1], 1);
+SHOW_FUNCTION(bfq_fifo_expire_async_show, bfqd->bfq_fifo_expire[0], 1);
+SHOW_FUNCTION(bfq_back_seek_max_show, bfqd->bfq_back_max, 0);
+SHOW_FUNCTION(bfq_back_seek_penalty_show, bfqd->bfq_back_penalty, 0);
+SHOW_FUNCTION(bfq_slice_idle_show, bfqd->bfq_slice_idle, 1);
+SHOW_FUNCTION(bfq_max_budget_show, bfqd->bfq_user_max_budget, 0);
+SHOW_FUNCTION(bfq_max_budget_async_rq_show,
+	      bfqd->bfq_max_budget_async_rq, 0);
+SHOW_FUNCTION(bfq_timeout_sync_show, bfqd->bfq_timeout[BLK_RW_SYNC], 1);
+SHOW_FUNCTION(bfq_timeout_async_show, bfqd->bfq_timeout[BLK_RW_ASYNC], 1);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
+static ssize_t								\
+__FUNC(struct elevator_queue *e, const char *page, size_t count)	\
+{									\
+	struct bfq_data *bfqd = e->elevator_data;			\
+	unsigned long uninitialized_var(__data);			\
+	int ret = bfq_var_store(&__data, (page), count);		\
+	if (__data < (MIN))						\
+		__data = (MIN);						\
+	else if (__data > (MAX))					\
+		__data = (MAX);						\
+	if (__CONV)							\
+		*(__PTR) = msecs_to_jiffies(__data);			\
+	else								\
+		*(__PTR) = __data;					\
+	return ret;							\
+}
+STORE_FUNCTION(bfq_quantum_store, &bfqd->bfq_quantum, 1, INT_MAX, 0);
+STORE_FUNCTION(bfq_fifo_expire_sync_store, &bfqd->bfq_fifo_expire[1], 1,
+		INT_MAX, 1);
+STORE_FUNCTION(bfq_fifo_expire_async_store, &bfqd->bfq_fifo_expire[0], 1,
+		INT_MAX, 1);
+STORE_FUNCTION(bfq_back_seek_max_store, &bfqd->bfq_back_max, 0, INT_MAX, 0);
+STORE_FUNCTION(bfq_back_seek_penalty_store, &bfqd->bfq_back_penalty, 1,
+		INT_MAX, 0);
+STORE_FUNCTION(bfq_slice_idle_store, &bfqd->bfq_slice_idle, 0, INT_MAX, 1);
+STORE_FUNCTION(bfq_max_budget_async_rq_store, &bfqd->bfq_max_budget_async_rq,
+		1, INT_MAX, 0);
+STORE_FUNCTION(bfq_timeout_async_store, &bfqd->bfq_timeout[BLK_RW_ASYNC], 0,
+		INT_MAX, 1);
+#undef STORE_FUNCTION
+
+/* do nothing for the moment */
+static ssize_t bfq_weights_store(struct elevator_queue *e,
+				    const char *page, size_t count)
+{
+	return count;
+}
+
+static inline unsigned long bfq_estimated_max_budget(struct bfq_data *bfqd)
+{
+	u64 timeout = jiffies_to_msecs(bfqd->bfq_timeout[BLK_RW_SYNC]);
+
+	if (bfqd->peak_rate_samples >= BFQ_PEAK_RATE_SAMPLES)
+		return bfq_calc_max_budget(bfqd->peak_rate, timeout);
+	else
+		return bfq_default_max_budget;
+}
+
+static ssize_t bfq_max_budget_store(struct elevator_queue *e,
+				    const char *page, size_t count)
+{
+	struct bfq_data *bfqd = e->elevator_data;
+	unsigned long uninitialized_var(__data);
+	int ret = bfq_var_store(&__data, (page), count);
+
+	if (__data == 0)
+		bfqd->bfq_max_budget = bfq_estimated_max_budget(bfqd);
+	else {
+		if (__data > INT_MAX)
+			__data = INT_MAX;
+		bfqd->bfq_max_budget = __data;
+	}
+
+	bfqd->bfq_user_max_budget = __data;
+
+	return ret;
+}
+
+static ssize_t bfq_timeout_sync_store(struct elevator_queue *e,
+				      const char *page, size_t count)
+{
+	struct bfq_data *bfqd = e->elevator_data;
+	unsigned long uninitialized_var(__data);
+	int ret = bfq_var_store(&__data, (page), count);
+
+	if (__data < 1)
+		__data = 1;
+	else if (__data > INT_MAX)
+		__data = INT_MAX;
+
+	bfqd->bfq_timeout[BLK_RW_SYNC] = msecs_to_jiffies(__data);
+	if (bfqd->bfq_user_max_budget == 0)
+		bfqd->bfq_max_budget = bfq_estimated_max_budget(bfqd);
+
+	return ret;
+}
+
+#define BFQ_ATTR(name) \
+	__ATTR(name, S_IRUGO|S_IWUSR, bfq_##name##_show, bfq_##name##_store)
+
+static struct elv_fs_entry bfq_attrs[] = {
+	BFQ_ATTR(quantum),
+	BFQ_ATTR(fifo_expire_sync),
+	BFQ_ATTR(fifo_expire_async),
+	BFQ_ATTR(back_seek_max),
+	BFQ_ATTR(back_seek_penalty),
+	BFQ_ATTR(slice_idle),
+	BFQ_ATTR(max_budget),
+	BFQ_ATTR(max_budget_async_rq),
+	BFQ_ATTR(timeout_sync),
+	BFQ_ATTR(timeout_async),
+	BFQ_ATTR(weights),
+	__ATTR_NULL
+};
+
+static struct elevator_type iosched_bfq = {
+	.ops = {
+		.elevator_merge_fn =		bfq_merge,
+		.elevator_merged_fn =		bfq_merged_request,
+		.elevator_merge_req_fn =	bfq_merged_requests,
+		.elevator_allow_merge_fn =	bfq_allow_merge,
+		.elevator_dispatch_fn =		bfq_dispatch_requests,
+		.elevator_add_req_fn =		bfq_insert_request,
+		.elevator_activate_req_fn =	bfq_activate_request,
+		.elevator_deactivate_req_fn =	bfq_deactivate_request,
+		.elevator_completed_req_fn =	bfq_completed_request,
+		.elevator_former_req_fn =	elv_rb_former_request,
+		.elevator_latter_req_fn =	elv_rb_latter_request,
+		.elevator_init_icq_fn =		bfq_init_icq,
+		.elevator_exit_icq_fn =		bfq_exit_icq,
+		.elevator_set_req_fn =		bfq_set_request,
+		.elevator_put_req_fn =		bfq_put_request,
+		.elevator_may_queue_fn =	bfq_may_queue,
+		.elevator_init_fn =		bfq_init_queue,
+		.elevator_exit_fn =		bfq_exit_queue,
+	},
+	.icq_size =		sizeof(struct bfq_io_cq),
+	.icq_align =		__alignof__(struct bfq_io_cq),
+	.elevator_attrs =	bfq_attrs,
+	.elevator_name =	"bfq",
+	.elevator_owner =	THIS_MODULE,
+};
+
+static int __init bfq_init(void)
+{
+	/*
+	 * Can be 0 on HZ < 1000 setups.
+	 */
+	if (bfq_slice_idle == 0)
+		bfq_slice_idle = 1;
+
+	if (bfq_timeout_async == 0)
+		bfq_timeout_async = 1;
+
+	if (bfq_slab_setup())
+		return -ENOMEM;
+
+	elv_register(&iosched_bfq);
+	pr_info("BFQ I/O-scheduler version: v0");
+
+	return 0;
+}
+
+static void __exit bfq_exit(void)
+{
+	elv_unregister(&iosched_bfq);
+	bfq_slab_kill();
+}
+
+module_init(bfq_init);
+module_exit(bfq_exit);
+
+MODULE_AUTHOR("Fabio Checconi, Paolo Valente");
+MODULE_LICENSE("GPL");
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
new file mode 100644
index 0000000..a9142f5
--- /dev/null
+++ b/block/bfq-sched.c
@@ -0,0 +1,936 @@
+/*
+ * BFQ: Hierarchical B-WF2Q+ scheduler.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ *		      Paolo Valente <paolo.valente@unimore.it>
+ */
+
+#define for_each_entity(entity)	\
+	for (; entity != NULL; entity = NULL)
+
+#define for_each_entity_safe(entity, parent) \
+	for (parent = NULL; entity != NULL; entity = parent)
+
+static inline int bfq_update_next_in_service(struct bfq_sched_data *sd)
+{
+	return 0;
+}
+
+static inline void bfq_check_next_in_service(struct bfq_sched_data *sd,
+					     struct bfq_entity *entity)
+{
+}
+
+static inline void bfq_update_budget(struct bfq_entity *next_in_service)
+{
+}
+
+/*
+ * Shift for timestamp calculations.  This actually limits the maximum
+ * service allowed in one timestamp delta (small shift values increase it),
+ * the maximum total weight that can be used for the queues in the system
+ * (big shift values increase it), and the period of virtual time
+ * wraparounds.
+ */
+#define WFQ_SERVICE_SHIFT	22
+
+/**
+ * bfq_gt - compare two timestamps.
+ * @a: first ts.
+ * @b: second ts.
+ *
+ * Return @a > @b, dealing with wrapping correctly.
+ */
+static inline int bfq_gt(u64 a, u64 b)
+{
+	return (s64)(a - b) > 0;
+}
+
+static inline struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = NULL;
+
+	if (entity->my_sched_data == NULL)
+		bfqq = container_of(entity, struct bfq_queue, entity);
+
+	return bfqq;
+}
+
+
+/**
+ * bfq_delta - map service into the virtual time domain.
+ * @service: amount of service.
+ * @weight: scale factor (weight of an entity or weight sum).
+ */
+static inline u64 bfq_delta(unsigned long service,
+					unsigned long weight)
+{
+	u64 d = (u64)service << WFQ_SERVICE_SHIFT;
+
+	do_div(d, weight);
+	return d;
+}
+
+/**
+ * bfq_calc_finish - assign the finish time to an entity.
+ * @entity: the entity to act upon.
+ * @service: the service to be charged to the entity.
+ */
+static inline void bfq_calc_finish(struct bfq_entity *entity,
+				   unsigned long service)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+
+	entity->finish = entity->start +
+		bfq_delta(service, entity->weight);
+
+	if (bfqq != NULL) {
+		bfq_log_bfqq(bfqq->bfqd, bfqq,
+			"calc_finish: serv %lu, w %d",
+			service, entity->weight);
+		bfq_log_bfqq(bfqq->bfqd, bfqq,
+			"calc_finish: start %llu, finish %llu, delta %llu",
+			entity->start, entity->finish,
+			bfq_delta(service, entity->weight));
+	}
+}
+
+/**
+ * bfq_entity_of - get an entity from a node.
+ * @node: the node field of the entity.
+ *
+ * Convert a node pointer to the relative entity.  This is used only
+ * to simplify the logic of some functions and not as the generic
+ * conversion mechanism because, e.g., in the tree walking functions,
+ * the check for a %NULL value would be redundant.
+ */
+static inline struct bfq_entity *bfq_entity_of(struct rb_node *node)
+{
+	struct bfq_entity *entity = NULL;
+
+	if (node != NULL)
+		entity = rb_entry(node, struct bfq_entity, rb_node);
+
+	return entity;
+}
+
+/**
+ * bfq_extract - remove an entity from a tree.
+ * @root: the tree root.
+ * @entity: the entity to remove.
+ */
+static inline void bfq_extract(struct rb_root *root,
+			       struct bfq_entity *entity)
+{
+	entity->tree = NULL;
+	rb_erase(&entity->rb_node, root);
+}
+
+/**
+ * bfq_idle_extract - extract an entity from the idle tree.
+ * @st: the service tree of the owning @entity.
+ * @entity: the entity being removed.
+ */
+static void bfq_idle_extract(struct bfq_service_tree *st,
+			     struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+	struct rb_node *next;
+
+	if (entity == st->first_idle) {
+		next = rb_next(&entity->rb_node);
+		st->first_idle = bfq_entity_of(next);
+	}
+
+	if (entity == st->last_idle) {
+		next = rb_prev(&entity->rb_node);
+		st->last_idle = bfq_entity_of(next);
+	}
+
+	bfq_extract(&st->idle, entity);
+
+	if (bfqq != NULL)
+		list_del(&bfqq->bfqq_list);
+}
+
+/**
+ * bfq_insert - generic tree insertion.
+ * @root: tree root.
+ * @entity: entity to insert.
+ *
+ * This is used for the idle and the active tree, since they are both
+ * ordered by finish time.
+ */
+static void bfq_insert(struct rb_root *root, struct bfq_entity *entity)
+{
+	struct bfq_entity *entry;
+	struct rb_node **node = &root->rb_node;
+	struct rb_node *parent = NULL;
+
+	while (*node != NULL) {
+		parent = *node;
+		entry = rb_entry(parent, struct bfq_entity, rb_node);
+
+		if (bfq_gt(entry->finish, entity->finish))
+			node = &parent->rb_left;
+		else
+			node = &parent->rb_right;
+	}
+
+	rb_link_node(&entity->rb_node, parent, node);
+	rb_insert_color(&entity->rb_node, root);
+
+	entity->tree = root;
+}
+
+/**
+ * bfq_update_min - update the min_start field of a entity.
+ * @entity: the entity to update.
+ * @node: one of its children.
+ *
+ * This function is called when @entity may store an invalid value for
+ * min_start due to updates to the active tree.  The function  assumes
+ * that the subtree rooted at @node (which may be its left or its right
+ * child) has a valid min_start value.
+ */
+static inline void bfq_update_min(struct bfq_entity *entity,
+				  struct rb_node *node)
+{
+	struct bfq_entity *child;
+
+	if (node != NULL) {
+		child = rb_entry(node, struct bfq_entity, rb_node);
+		if (bfq_gt(entity->min_start, child->min_start))
+			entity->min_start = child->min_start;
+	}
+}
+
+/**
+ * bfq_update_active_node - recalculate min_start.
+ * @node: the node to update.
+ *
+ * @node may have changed position or one of its children may have moved,
+ * this function updates its min_start value.  The left and right subtrees
+ * are assumed to hold a correct min_start value.
+ */
+static inline void bfq_update_active_node(struct rb_node *node)
+{
+	struct bfq_entity *entity = rb_entry(node, struct bfq_entity, rb_node);
+
+	entity->min_start = entity->start;
+	bfq_update_min(entity, node->rb_right);
+	bfq_update_min(entity, node->rb_left);
+}
+
+/**
+ * bfq_update_active_tree - update min_start for the whole active tree.
+ * @node: the starting node.
+ *
+ * @node must be the deepest modified node after an update.  This function
+ * updates its min_start using the values held by its children, assuming
+ * that they did not change, and then updates all the nodes that may have
+ * changed in the path to the root.  The only nodes that may have changed
+ * are the ones in the path or their siblings.
+ */
+static void bfq_update_active_tree(struct rb_node *node)
+{
+	struct rb_node *parent;
+
+up:
+	bfq_update_active_node(node);
+
+	parent = rb_parent(node);
+	if (parent == NULL)
+		return;
+
+	if (node == parent->rb_left && parent->rb_right != NULL)
+		bfq_update_active_node(parent->rb_right);
+	else if (parent->rb_left != NULL)
+		bfq_update_active_node(parent->rb_left);
+
+	node = parent;
+	goto up;
+}
+
+/**
+ * bfq_active_insert - insert an entity in the active tree of its
+ *                     group/device.
+ * @st: the service tree of the entity.
+ * @entity: the entity being inserted.
+ *
+ * The active tree is ordered by finish time, but an extra key is kept
+ * per each node, containing the minimum value for the start times of
+ * its children (and the node itself), so it's possible to search for
+ * the eligible node with the lowest finish time in logarithmic time.
+ */
+static void bfq_active_insert(struct bfq_service_tree *st,
+			      struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+	struct rb_node *node = &entity->rb_node;
+
+	bfq_insert(&st->active, entity);
+
+	if (node->rb_left != NULL)
+		node = node->rb_left;
+	else if (node->rb_right != NULL)
+		node = node->rb_right;
+
+	bfq_update_active_tree(node);
+
+	if (bfqq != NULL)
+		list_add(&bfqq->bfqq_list, &bfqq->bfqd->active_list);
+}
+
+/**
+ * bfq_ioprio_to_weight - calc a weight from an ioprio.
+ * @ioprio: the ioprio value to convert.
+ */
+static inline unsigned short bfq_ioprio_to_weight(int ioprio)
+{
+	return IOPRIO_BE_NR - ioprio;
+}
+
+/**
+ * bfq_weight_to_ioprio - calc an ioprio from a weight.
+ * @weight: the weight value to convert.
+ *
+ * To preserve as mush as possible the old only-ioprio user interface,
+ * 0 is used as an escape ioprio value for weights (numerically) equal or
+ * larger than IOPRIO_BE_NR
+ */
+static inline unsigned short bfq_weight_to_ioprio(int weight)
+{
+	return IOPRIO_BE_NR - weight < 0 ? 0 : IOPRIO_BE_NR - weight;
+}
+
+static inline void bfq_get_entity(struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+
+	if (bfqq != NULL) {
+		atomic_inc(&bfqq->ref);
+		bfq_log_bfqq(bfqq->bfqd, bfqq, "get_entity: %p %d",
+			     bfqq, atomic_read(&bfqq->ref));
+	}
+}
+
+/**
+ * bfq_find_deepest - find the deepest node that an extraction can modify.
+ * @node: the node being removed.
+ *
+ * Do the first step of an extraction in an rb tree, looking for the
+ * node that will replace @node, and returning the deepest node that
+ * the following modifications to the tree can touch.  If @node is the
+ * last node in the tree return %NULL.
+ */
+static struct rb_node *bfq_find_deepest(struct rb_node *node)
+{
+	struct rb_node *deepest;
+
+	if (node->rb_right == NULL && node->rb_left == NULL)
+		deepest = rb_parent(node);
+	else if (node->rb_right == NULL)
+		deepest = node->rb_left;
+	else if (node->rb_left == NULL)
+		deepest = node->rb_right;
+	else {
+		deepest = rb_next(node);
+		if (deepest->rb_right != NULL)
+			deepest = deepest->rb_right;
+		else if (rb_parent(deepest) != node)
+			deepest = rb_parent(deepest);
+	}
+
+	return deepest;
+}
+
+/**
+ * bfq_active_extract - remove an entity from the active tree.
+ * @st: the service_tree containing the tree.
+ * @entity: the entity being removed.
+ */
+static void bfq_active_extract(struct bfq_service_tree *st,
+			       struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+	struct rb_node *node;
+
+	node = bfq_find_deepest(&entity->rb_node);
+	bfq_extract(&st->active, entity);
+
+	if (node != NULL)
+		bfq_update_active_tree(node);
+
+	if (bfqq != NULL)
+		list_del(&bfqq->bfqq_list);
+}
+
+/**
+ * bfq_idle_insert - insert an entity into the idle tree.
+ * @st: the service tree containing the tree.
+ * @entity: the entity to insert.
+ */
+static void bfq_idle_insert(struct bfq_service_tree *st,
+			    struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+	struct bfq_entity *first_idle = st->first_idle;
+	struct bfq_entity *last_idle = st->last_idle;
+
+	if (first_idle == NULL || bfq_gt(first_idle->finish, entity->finish))
+		st->first_idle = entity;
+	if (last_idle == NULL || bfq_gt(entity->finish, last_idle->finish))
+		st->last_idle = entity;
+
+	bfq_insert(&st->idle, entity);
+
+	if (bfqq != NULL)
+		list_add(&bfqq->bfqq_list, &bfqq->bfqd->idle_list);
+}
+
+/**
+ * bfq_forget_entity - remove an entity from the wfq trees.
+ * @st: the service tree.
+ * @entity: the entity being removed.
+ *
+ * Update the device status and forget everything about @entity, putting
+ * the device reference to it, if it is a queue.  Entities belonging to
+ * groups are not refcounted.
+ */
+static void bfq_forget_entity(struct bfq_service_tree *st,
+			      struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+	struct bfq_sched_data *sd;
+
+	entity->on_st = 0;
+	st->wsum -= entity->weight;
+	if (bfqq != NULL) {
+		sd = entity->sched_data;
+		bfq_log_bfqq(bfqq->bfqd, bfqq, "forget_entity: %p %d",
+			     bfqq, atomic_read(&bfqq->ref));
+		bfq_put_queue(bfqq);
+	}
+}
+
+/**
+ * bfq_put_idle_entity - release the idle tree ref of an entity.
+ * @st: service tree for the entity.
+ * @entity: the entity being released.
+ */
+static void bfq_put_idle_entity(struct bfq_service_tree *st,
+				struct bfq_entity *entity)
+{
+	bfq_idle_extract(st, entity);
+	bfq_forget_entity(st, entity);
+}
+
+/**
+ * bfq_forget_idle - update the idle tree if necessary.
+ * @st: the service tree to act upon.
+ *
+ * To preserve the global O(log N) complexity we only remove one entry here;
+ * as the idle tree will not grow indefinitely this can be done safely.
+ */
+static void bfq_forget_idle(struct bfq_service_tree *st)
+{
+	struct bfq_entity *first_idle = st->first_idle;
+	struct bfq_entity *last_idle = st->last_idle;
+
+	if (RB_EMPTY_ROOT(&st->active) && last_idle != NULL &&
+	    !bfq_gt(last_idle->finish, st->vtime)) {
+		/*
+		 * Forget the whole idle tree, increasing the vtime past
+		 * the last finish time of idle entities.
+		 */
+		st->vtime = last_idle->finish;
+	}
+
+	if (first_idle != NULL && !bfq_gt(first_idle->finish, st->vtime))
+		bfq_put_idle_entity(st, first_idle);
+}
+
+static struct bfq_service_tree *
+__bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
+			 struct bfq_entity *entity)
+{
+	struct bfq_service_tree *new_st = old_st;
+
+	if (entity->ioprio_changed) {
+		old_st->wsum -= entity->weight;
+
+		if (entity->new_weight != entity->orig_weight) {
+			entity->orig_weight = entity->new_weight;
+			entity->ioprio =
+				bfq_weight_to_ioprio(entity->orig_weight);
+		} else if (entity->new_ioprio != entity->ioprio) {
+			entity->ioprio = entity->new_ioprio;
+			entity->orig_weight =
+					bfq_ioprio_to_weight(entity->ioprio);
+		} else
+			entity->new_weight = entity->orig_weight =
+				bfq_ioprio_to_weight(entity->ioprio);
+
+		entity->ioprio_class = entity->new_ioprio_class;
+		entity->ioprio_changed = 0;
+
+		/*
+		 * NOTE: here we may be changing the weight too early,
+		 * this will cause unfairness.  The correct approach
+		 * would have required additional complexity to defer
+		 * weight changes to the proper time instants (i.e.,
+		 * when entity->finish <= old_st->vtime).
+		 */
+		new_st = bfq_entity_service_tree(entity);
+		entity->weight = entity->orig_weight;
+		new_st->wsum += entity->weight;
+
+		if (new_st != old_st)
+			entity->start = new_st->vtime;
+	}
+
+	return new_st;
+}
+
+/**
+ * bfq_bfqq_served - update the scheduler status after selection for
+ *                   service.
+ * @bfqq: the queue being served.
+ * @served: bytes to transfer.
+ *
+ * NOTE: this can be optimized, as the timestamps of upper level entities
+ * are synchronized every time a new bfqq is selected for service.  By now,
+ * we keep it to better check consistency.
+ */
+static void bfq_bfqq_served(struct bfq_queue *bfqq, unsigned long served)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+	struct bfq_service_tree *st;
+
+	for_each_entity(entity) {
+		st = bfq_entity_service_tree(entity);
+
+		entity->service += served;
+
+		st->vtime += bfq_delta(served, st->wsum);
+		bfq_forget_idle(st);
+	}
+	bfq_log_bfqq(bfqq->bfqd, bfqq, "bfqq_served %lu secs", served);
+}
+
+/**
+ * bfq_bfqq_charge_full_budget - set the service to the entity budget.
+ * @bfqq: the queue that needs a service update.
+ *
+ * When it's not possible to be fair in the service domain, because
+ * a queue is not consuming its budget fast enough (the meaning of
+ * fast depends on the timeout parameter), we charge it a full
+ * budget.  In this way we should obtain a sort of time-domain
+ * fairness among all the seeky/slow queues.
+ */
+static inline void bfq_bfqq_charge_full_budget(struct bfq_queue *bfqq)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+
+	bfq_log_bfqq(bfqq->bfqd, bfqq, "charge_full_budget");
+
+	bfq_bfqq_served(bfqq, entity->budget - entity->service);
+}
+
+/**
+ * __bfq_activate_entity - activate an entity.
+ * @entity: the entity being activated.
+ *
+ * Called whenever an entity is activated, i.e., it is not active and one
+ * of its children receives a new request, or has to be reactivated due to
+ * budget exhaustion.  It uses the current budget of the entity (and the
+ * service received if @entity is active) of the queue to calculate its
+ * timestamps.
+ */
+static void __bfq_activate_entity(struct bfq_entity *entity)
+{
+	struct bfq_sched_data *sd = entity->sched_data;
+	struct bfq_service_tree *st = bfq_entity_service_tree(entity);
+
+	if (entity == sd->in_service_entity) {
+		/*
+		 * If we are requeueing the current entity we have
+		 * to take care of not charging to it service it has
+		 * not received.
+		 */
+		bfq_calc_finish(entity, entity->service);
+		entity->start = entity->finish;
+		sd->in_service_entity = NULL;
+	} else if (entity->tree == &st->active) {
+		/*
+		 * Requeueing an entity due to a change of some
+		 * next_in_service entity below it.  We reuse the
+		 * old start time.
+		 */
+		bfq_active_extract(st, entity);
+	} else if (entity->tree == &st->idle) {
+		/*
+		 * Must be on the idle tree, bfq_idle_extract() will
+		 * check for that.
+		 */
+		bfq_idle_extract(st, entity);
+		entity->start = bfq_gt(st->vtime, entity->finish) ?
+				       st->vtime : entity->finish;
+	} else {
+		/*
+		 * The finish time of the entity may be invalid, and
+		 * it is in the past for sure, otherwise the queue
+		 * would have been on the idle tree.
+		 */
+		entity->start = st->vtime;
+		st->wsum += entity->weight;
+		bfq_get_entity(entity);
+
+		entity->on_st = 1;
+	}
+
+	st = __bfq_entity_update_weight_prio(st, entity);
+	bfq_calc_finish(entity, entity->budget);
+	bfq_active_insert(st, entity);
+}
+
+/**
+ * bfq_activate_entity - activate an entity and its ancestors if necessary.
+ * @entity: the entity to activate.
+ *
+ * Activate @entity and all the entities on the path from it to the root.
+ */
+static void bfq_activate_entity(struct bfq_entity *entity)
+{
+	struct bfq_sched_data *sd;
+
+	for_each_entity(entity) {
+		__bfq_activate_entity(entity);
+
+		sd = entity->sched_data;
+		if (!bfq_update_next_in_service(sd))
+			/*
+			 * No need to propagate the activation to the
+			 * upper entities, as they will be updated when
+			 * the in-service entity is rescheduled.
+			 */
+			break;
+	}
+}
+
+/**
+ * __bfq_deactivate_entity - deactivate an entity from its service tree.
+ * @entity: the entity to deactivate.
+ * @requeue: if false, the entity will not be put into the idle tree.
+ *
+ * Deactivate an entity, independently from its previous state.  If the
+ * entity was not on a service tree just return, otherwise if it is on
+ * any scheduler tree, extract it from that tree, and if necessary
+ * and if the caller did not specify @requeue, put it on the idle tree.
+ *
+ * Return %1 if the caller should update the entity hierarchy, i.e.,
+ * if the entity was in service or if it was the next_in_service for
+ * its sched_data; return %0 otherwise.
+ */
+static int __bfq_deactivate_entity(struct bfq_entity *entity, int requeue)
+{
+	struct bfq_sched_data *sd = entity->sched_data;
+	struct bfq_service_tree *st = bfq_entity_service_tree(entity);
+	int was_in_service = entity == sd->in_service_entity;
+	int ret = 0;
+
+	if (!entity->on_st)
+		return 0;
+
+	if (was_in_service) {
+		bfq_calc_finish(entity, entity->service);
+		sd->in_service_entity = NULL;
+	} else if (entity->tree == &st->active)
+		bfq_active_extract(st, entity);
+	else if (entity->tree == &st->idle)
+		bfq_idle_extract(st, entity);
+
+	if (was_in_service || sd->next_in_service == entity)
+		ret = bfq_update_next_in_service(sd);
+
+	if (!requeue || !bfq_gt(entity->finish, st->vtime))
+		bfq_forget_entity(st, entity);
+	else
+		bfq_idle_insert(st, entity);
+
+	return ret;
+}
+
+/**
+ * bfq_deactivate_entity - deactivate an entity.
+ * @entity: the entity to deactivate.
+ * @requeue: true if the entity can be put on the idle tree
+ */
+static void bfq_deactivate_entity(struct bfq_entity *entity, int requeue)
+{
+	struct bfq_sched_data *sd;
+	struct bfq_entity *parent;
+
+	for_each_entity_safe(entity, parent) {
+		sd = entity->sched_data;
+
+		if (!__bfq_deactivate_entity(entity, requeue))
+			/*
+			 * The parent entity is still backlogged, and
+			 * we don't need to update it as it is still
+			 * in service.
+			 */
+			break;
+
+		if (sd->next_in_service != NULL)
+			/*
+			 * The parent entity is still backlogged and
+			 * the budgets on the path towards the root
+			 * need to be updated.
+			 */
+			goto update;
+
+		/*
+		 * If we reach there the parent is no more backlogged and
+		 * we want to propagate the dequeue upwards.
+		 */
+		requeue = 1;
+	}
+
+	return;
+
+update:
+	entity = parent;
+	for_each_entity(entity) {
+		__bfq_activate_entity(entity);
+
+		sd = entity->sched_data;
+		if (!bfq_update_next_in_service(sd))
+			break;
+	}
+}
+
+/**
+ * bfq_update_vtime - update vtime if necessary.
+ * @st: the service tree to act upon.
+ *
+ * If necessary update the service tree vtime to have at least one
+ * eligible entity, skipping to its start time.  Assumes that the
+ * active tree of the device is not empty.
+ *
+ * NOTE: this hierarchical implementation updates vtimes quite often,
+ * we may end up with reactivated processes getting timestamps after a
+ * vtime skip done because we needed a ->first_active entity on some
+ * intermediate node.
+ */
+static void bfq_update_vtime(struct bfq_service_tree *st)
+{
+	struct bfq_entity *entry;
+	struct rb_node *node = st->active.rb_node;
+
+	entry = rb_entry(node, struct bfq_entity, rb_node);
+	if (bfq_gt(entry->min_start, st->vtime)) {
+		st->vtime = entry->min_start;
+		bfq_forget_idle(st);
+	}
+}
+
+/**
+ * bfq_first_active_entity - find the eligible entity with
+ *                           the smallest finish time
+ * @st: the service tree to select from.
+ *
+ * This function searches the first schedulable entity, starting from the
+ * root of the tree and going on the left every time on this side there is
+ * a subtree with at least one eligible (start >= vtime) entity. The path on
+ * the right is followed only if a) the left subtree contains no eligible
+ * entities and b) no eligible entity has been found yet.
+ */
+static struct bfq_entity *bfq_first_active_entity(struct bfq_service_tree *st)
+{
+	struct bfq_entity *entry, *first = NULL;
+	struct rb_node *node = st->active.rb_node;
+
+	while (node != NULL) {
+		entry = rb_entry(node, struct bfq_entity, rb_node);
+left:
+		if (!bfq_gt(entry->start, st->vtime))
+			first = entry;
+
+		if (node->rb_left != NULL) {
+			entry = rb_entry(node->rb_left,
+					 struct bfq_entity, rb_node);
+			if (!bfq_gt(entry->min_start, st->vtime)) {
+				node = node->rb_left;
+				goto left;
+			}
+		}
+		if (first != NULL)
+			break;
+		node = node->rb_right;
+	}
+
+	return first;
+}
+
+/**
+ * __bfq_lookup_next_entity - return the first eligible entity in @st.
+ * @st: the service tree.
+ *
+ * Update the virtual time in @st and return the first eligible entity
+ * it contains.
+ */
+static struct bfq_entity *__bfq_lookup_next_entity(struct bfq_service_tree *st,
+						   bool force)
+{
+	struct bfq_entity *entity, *new_next_in_service = NULL;
+
+	if (RB_EMPTY_ROOT(&st->active))
+		return NULL;
+
+	bfq_update_vtime(st);
+	entity = bfq_first_active_entity(st);
+
+	/*
+	 * If the chosen entity does not match with the sched_data's
+	 * next_in_service and we are forcedly serving the IDLE priority
+	 * class tree, bubble up budget update.
+	 */
+	if (unlikely(force && entity != entity->sched_data->next_in_service)) {
+		new_next_in_service = entity;
+		for_each_entity(new_next_in_service)
+			bfq_update_budget(new_next_in_service);
+	}
+
+	return entity;
+}
+
+/**
+ * bfq_lookup_next_entity - return the first eligible entity in @sd.
+ * @sd: the sched_data.
+ * @extract: if true the returned entity will be also extracted from @sd.
+ *
+ * NOTE: since we cache the next_in_service entity at each level of the
+ * hierarchy, the complexity of the lookup can be decreased with
+ * absolutely no effort just returning the cached next_in_service value;
+ * we prefer to do full lookups to test the consistency of * the data
+ * structures.
+ */
+static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
+						 int extract,
+						 struct bfq_data *bfqd)
+{
+	struct bfq_service_tree *st = sd->service_tree;
+	struct bfq_entity *entity;
+	int i = 0;
+
+	if (bfqd != NULL &&
+	    jiffies - bfqd->bfq_class_idle_last_service > BFQ_CL_IDLE_TIMEOUT) {
+		entity = __bfq_lookup_next_entity(st + BFQ_IOPRIO_CLASSES - 1,
+						  true);
+		if (entity != NULL) {
+			i = BFQ_IOPRIO_CLASSES - 1;
+			bfqd->bfq_class_idle_last_service = jiffies;
+			sd->next_in_service = entity;
+		}
+	}
+	for (; i < BFQ_IOPRIO_CLASSES; i++) {
+		entity = __bfq_lookup_next_entity(st + i, false);
+		if (entity != NULL) {
+			if (extract) {
+				bfq_check_next_in_service(sd, entity);
+				bfq_active_extract(st + i, entity);
+				sd->in_service_entity = entity;
+				sd->next_in_service = NULL;
+			}
+			break;
+		}
+	}
+
+	return entity;
+}
+
+/*
+ * Get next queue for service.
+ */
+static struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd)
+{
+	struct bfq_entity *entity = NULL;
+	struct bfq_sched_data *sd;
+	struct bfq_queue *bfqq;
+
+	if (bfqd->busy_queues == 0)
+		return NULL;
+
+	sd = &bfqd->sched_data;
+	for (; sd != NULL; sd = entity->my_sched_data) {
+		entity = bfq_lookup_next_entity(sd, 1, bfqd);
+		entity->service = 0;
+	}
+
+	bfqq = bfq_entity_to_bfqq(entity);
+
+	return bfqq;
+}
+
+static void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd)
+{
+	if (bfqd->in_service_bic != NULL) {
+		put_io_context(bfqd->in_service_bic->icq.ioc);
+		bfqd->in_service_bic = NULL;
+	}
+
+	bfqd->in_service_queue = NULL;
+	del_timer(&bfqd->idle_slice_timer);
+}
+
+static void bfq_deactivate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+				int requeue)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+
+	if (bfqq == bfqd->in_service_queue)
+		__bfq_bfqd_reset_in_service(bfqd);
+
+	bfq_deactivate_entity(entity, requeue);
+}
+
+static void bfq_activate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+
+	bfq_activate_entity(entity);
+}
+
+/*
+ * Called when the bfqq no longer has requests pending, remove it from
+ * the service tree.
+ */
+static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+			      int requeue)
+{
+	bfq_log_bfqq(bfqd, bfqq, "del from busy");
+
+	bfq_clear_bfqq_busy(bfqq);
+
+	bfqd->busy_queues--;
+
+	bfq_deactivate_bfqq(bfqd, bfqq, requeue);
+}
+
+/*
+ * Called when an inactive queue receives a new request.
+ */
+static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	bfq_log_bfqq(bfqd, bfqq, "add to busy");
+
+	bfq_activate_bfqq(bfqd, bfqq);
+
+	bfq_mark_bfqq_busy(bfqq);
+	bfqd->busy_queues++;
+}
diff --git a/block/bfq.h b/block/bfq.h
new file mode 100644
index 0000000..bd146b6
--- /dev/null
+++ b/block/bfq.h
@@ -0,0 +1,467 @@
+/*
+ * BFQ-v0 for 3.15.0: data structures and common functions prototypes.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ *		      Paolo Valente <paolo.valente@unimore.it>
+ */
+
+#ifndef _BFQ_H
+#define _BFQ_H
+
+#include <linux/blktrace_api.h>
+#include <linux/hrtimer.h>
+#include <linux/ioprio.h>
+#include <linux/rbtree.h>
+
+#define BFQ_IOPRIO_CLASSES	3
+#define BFQ_CL_IDLE_TIMEOUT	(HZ/5)
+
+#define BFQ_MIN_WEIGHT	1
+#define BFQ_MAX_WEIGHT	1000
+
+#define BFQ_DEFAULT_GRP_WEIGHT	10
+#define BFQ_DEFAULT_GRP_IOPRIO	0
+#define BFQ_DEFAULT_GRP_CLASS	IOPRIO_CLASS_BE
+
+struct bfq_entity;
+
+/**
+ * struct bfq_service_tree - per ioprio_class service tree.
+ * @active: tree for active entities (i.e., those backlogged).
+ * @idle: tree for idle entities (i.e., those not backlogged, with V <= F_i).
+ * @first_idle: idle entity with minimum F_i.
+ * @last_idle: idle entity with maximum F_i.
+ * @vtime: scheduler virtual time.
+ * @wsum: scheduler weight sum; active and idle entities contribute to it.
+ *
+ * Each service tree represents a B-WF2Q+ scheduler on its own.  Each
+ * ioprio_class has its own independent scheduler, and so its own
+ * bfq_service_tree.  All the fields are protected by the queue lock
+ * of the containing bfqd.
+ */
+struct bfq_service_tree {
+	struct rb_root active;
+	struct rb_root idle;
+
+	struct bfq_entity *first_idle;
+	struct bfq_entity *last_idle;
+
+	u64 vtime;
+	unsigned long wsum;
+};
+
+/**
+ * struct bfq_sched_data - multi-class scheduler.
+ * @in_service_entity: entity in service.
+ * @next_in_service: head-of-the-line entity in the scheduler.
+ * @service_tree: array of service trees, one per ioprio_class.
+ *
+ * bfq_sched_data is the basic scheduler queue.  It supports three
+ * ioprio_classes, and can be used either as a toplevel queue or as
+ * an intermediate queue on a hierarchical setup.
+ * @next_in_service points to the active entity of the sched_data
+ * service trees that will be scheduled next.
+ *
+ * The supported ioprio_classes are the same as in CFQ, in descending
+ * priority order, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE.
+ * Requests from higher priority queues are served before all the
+ * requests from lower priority queues; among requests of the same
+ * queue requests are served according to B-WF2Q+.
+ * All the fields are protected by the queue lock of the containing bfqd.
+ */
+struct bfq_sched_data {
+	struct bfq_entity *in_service_entity;
+	struct bfq_entity *next_in_service;
+	struct bfq_service_tree service_tree[BFQ_IOPRIO_CLASSES];
+};
+
+/**
+ * struct bfq_entity - schedulable entity.
+ * @rb_node: service_tree member.
+ * @on_st: flag, true if the entity is on a tree (either the active or
+ *         the idle one of its service_tree).
+ * @finish: B-WF2Q+ finish timestamp (aka F_i).
+ * @start: B-WF2Q+ start timestamp (aka S_i).
+ * @tree: tree the entity is enqueued into; %NULL if not on a tree.
+ * @min_start: minimum start time of the (active) subtree rooted at
+ *             this entity; used for O(log N) lookups into active trees.
+ * @service: service received during the last round of service.
+ * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
+ * @weight: weight of the queue
+ * @parent: parent entity, for hierarchical scheduling.
+ * @my_sched_data: for non-leaf nodes in the hierarchy, the
+ *                 associated scheduler queue, %NULL on leaf nodes.
+ * @sched_data: the scheduler queue this entity belongs to.
+ * @ioprio: the ioprio in use.
+ * @new_weight: when a weight change is requested, the new weight value.
+ * @orig_weight: original weight, used to implement weight boosting
+ * @new_ioprio: when an ioprio change is requested, the new ioprio value.
+ * @ioprio_class: the ioprio_class in use.
+ * @new_ioprio_class: when an ioprio_class change is requested, the new
+ *                    ioprio_class value.
+ * @ioprio_changed: flag, true when the user requested a weight, ioprio or
+ *                  ioprio_class change.
+ *
+ * A bfq_entity is used to represent a bfq_queue (leaf node in the upper
+ * level scheduler). Each entity belongs to the sched_data of the parent
+ * group hierarchy. Non-leaf entities have also their own sched_data,
+ * stored in @my_sched_data.
+ *
+ * Each entity stores independently its priority values; this would
+ * allow different weights on different devices, but this
+ * functionality is not exported to userspace by now.  Priorities and
+ * weights are updated lazily, first storing the new values into the
+ * new_* fields, then setting the @ioprio_changed flag.  As soon as
+ * there is a transition in the entity state that allows the priority
+ * update to take place the effective and the requested priority
+ * values are synchronized.
+ *
+ * The weight value is calculated from the ioprio to export the same
+ * interface as CFQ.  When dealing with  ``well-behaved'' queues (i.e.,
+ * queues that do not spend too much time to consume their budget
+ * and have true sequential behavior, and when there are no external
+ * factors breaking anticipation) the relative weights at each level
+ * of the hierarchy should be guaranteed.  All the fields are
+ * protected by the queue lock of the containing bfqd.
+ */
+struct bfq_entity {
+	struct rb_node rb_node;
+
+	int on_st;
+
+	u64 finish;
+	u64 start;
+
+	struct rb_root *tree;
+
+	u64 min_start;
+
+	unsigned long service, budget;
+	unsigned short weight, new_weight;
+	unsigned short orig_weight;
+
+	struct bfq_entity *parent;
+
+	struct bfq_sched_data *my_sched_data;
+	struct bfq_sched_data *sched_data;
+
+	unsigned short ioprio, new_ioprio;
+	unsigned short ioprio_class, new_ioprio_class;
+
+	int ioprio_changed;
+};
+
+/**
+ * struct bfq_queue - leaf schedulable entity.
+ * @ref: reference counter.
+ * @bfqd: parent bfq_data.
+ * @sort_list: sorted list of pending requests.
+ * @next_rq: if fifo isn't expired, next request to serve.
+ * @queued: nr of requests queued in @sort_list.
+ * @allocated: currently allocated requests.
+ * @meta_pending: pending metadata requests.
+ * @fifo: fifo list of requests in sort_list.
+ * @entity: entity representing this queue in the scheduler.
+ * @max_budget: maximum budget allowed from the feedback mechanism.
+ * @budget_timeout: budget expiration (in jiffies).
+ * @dispatched: number of requests on the dispatch list or inside driver.
+ * @flags: status flags.
+ * @bfqq_list: node for active/idle bfqq list inside our bfqd.
+ * @seek_samples: number of seeks sampled
+ * @seek_total: sum of the distances of the seeks sampled
+ * @seek_mean: mean seek distance
+ * @last_request_pos: position of the last request enqueued
+ * @pid: pid of the process owning the queue, used for logging purposes.
+ *
+ * A bfq_queue is a leaf request queue; it can be associated with an
+ * io_context or more, if it is async.
+ */
+struct bfq_queue {
+	atomic_t ref;
+	struct bfq_data *bfqd;
+
+	struct rb_root sort_list;
+	struct request *next_rq;
+	int queued[2];
+	int allocated[2];
+	int meta_pending;
+	struct list_head fifo;
+
+	struct bfq_entity entity;
+
+	unsigned long max_budget;
+	unsigned long budget_timeout;
+
+	int dispatched;
+
+	unsigned int flags;
+
+	struct list_head bfqq_list;
+
+	unsigned int seek_samples;
+	u64 seek_total;
+	sector_t seek_mean;
+	sector_t last_request_pos;
+
+	pid_t pid;
+};
+
+/**
+ * struct bfq_ttime - per process thinktime stats.
+ * @ttime_total: total process thinktime
+ * @ttime_samples: number of thinktime samples
+ * @ttime_mean: average process thinktime
+ */
+struct bfq_ttime {
+	unsigned long last_end_request;
+
+	unsigned long ttime_total;
+	unsigned long ttime_samples;
+	unsigned long ttime_mean;
+};
+
+/**
+ * struct bfq_io_cq - per (request_queue, io_context) structure.
+ * @icq: associated io_cq structure
+ * @bfqq: array of two process queues, the sync and the async
+ * @ttime: associated @bfq_ttime struct
+ */
+struct bfq_io_cq {
+	struct io_cq icq; /* must be the first member */
+	struct bfq_queue *bfqq[2];
+	struct bfq_ttime ttime;
+	int ioprio;
+};
+
+enum bfq_device_speed {
+	BFQ_BFQD_FAST,
+	BFQ_BFQD_SLOW,
+};
+
+/**
+ * struct bfq_data - per device data structure.
+ * @queue: request queue for the managed device.
+ * @sched_data: root @bfq_sched_data for the device.
+ * @busy_queues: number of bfq_queues containing requests (including the
+ *		 queue in service, even if it is idling).
+ * @queued: number of queued requests.
+ * @rq_in_driver: number of requests dispatched and waiting for completion.
+ * @sync_flight: number of sync requests in the driver.
+ * @max_rq_in_driver: max number of reqs in driver in the last
+ *                    @hw_tag_samples completed requests.
+ * @hw_tag_samples: nr of samples used to calculate hw_tag.
+ * @hw_tag: flag set to one if the driver is showing a queueing behavior.
+ * @budgets_assigned: number of budgets assigned.
+ * @idle_slice_timer: timer set when idling for the next sequential request
+ *                    from the queue in service.
+ * @unplug_work: delayed work to restart dispatching on the request queue.
+ * @in_service_queue: bfq_queue in service.
+ * @in_service_bic: bfq_io_cq (bic) associated with the @in_service_queue.
+ * @last_position: on-disk position of the last served request.
+ * @last_budget_start: beginning of the last budget.
+ * @last_idling_start: beginning of the last idle slice.
+ * @peak_rate: peak transfer rate observed for a budget.
+ * @peak_rate_samples: number of samples used to calculate @peak_rate.
+ * @bfq_max_budget: maximum budget allotted to a bfq_queue before
+ *                  rescheduling.
+ * @active_list: list of all the bfq_queues active on the device.
+ * @idle_list: list of all the bfq_queues idle on the device.
+ * @bfq_quantum: max number of requests dispatched per dispatch round.
+ * @bfq_fifo_expire: timeout for async/sync requests; when it expires
+ *                   requests are served in fifo order.
+ * @bfq_back_penalty: weight of backward seeks wrt forward ones.
+ * @bfq_back_max: maximum allowed backward seek.
+ * @bfq_slice_idle: maximum idling time.
+ * @bfq_user_max_budget: user-configured max budget value
+ *                       (0 for auto-tuning).
+ * @bfq_max_budget_async_rq: maximum budget (in nr of requests) allotted to
+ *                           async queues.
+ * @bfq_timeout: timeout for bfq_queues to consume their budget; used to
+ *               to prevent seeky queues to impose long latencies to well
+ *               behaved ones (this also implies that seeky queues cannot
+ *               receive guarantees in the service domain; after a timeout
+ *               they are charged for the whole allocated budget, to try
+ *               to preserve a behavior reasonably fair among them, but
+ *               without service-domain guarantees).
+ * @oom_bfqq: fallback dummy bfqq for extreme OOM conditions
+ *
+ * All the fields are protected by the @queue lock.
+ */
+struct bfq_data {
+	struct request_queue *queue;
+
+	struct bfq_sched_data sched_data;
+
+	int busy_queues;
+	int queued;
+	int rq_in_driver;
+	int sync_flight;
+
+	int max_rq_in_driver;
+	int hw_tag_samples;
+	int hw_tag;
+
+	int budgets_assigned;
+
+	struct timer_list idle_slice_timer;
+	struct work_struct unplug_work;
+
+	struct bfq_queue *in_service_queue;
+	struct bfq_io_cq *in_service_bic;
+
+	sector_t last_position;
+
+	ktime_t last_budget_start;
+	ktime_t last_idling_start;
+	int peak_rate_samples;
+	u64 peak_rate;
+	unsigned long bfq_max_budget;
+
+	struct list_head active_list;
+	struct list_head idle_list;
+
+	unsigned int bfq_quantum;
+	unsigned int bfq_fifo_expire[2];
+	unsigned int bfq_back_penalty;
+	unsigned int bfq_back_max;
+	unsigned int bfq_slice_idle;
+	u64 bfq_class_idle_last_service;
+
+	unsigned int bfq_user_max_budget;
+	unsigned int bfq_max_budget_async_rq;
+	unsigned int bfq_timeout[2];
+
+	struct bfq_queue oom_bfqq;
+};
+
+enum bfqq_state_flags {
+	BFQ_BFQQ_FLAG_busy = 0,		/* has requests or is in service */
+	BFQ_BFQQ_FLAG_wait_request,	/* waiting for a request */
+	BFQ_BFQQ_FLAG_must_alloc,	/* must be allowed rq alloc */
+	BFQ_BFQQ_FLAG_fifo_expire,	/* FIFO checked in this slice */
+	BFQ_BFQQ_FLAG_idle_window,	/* slice idling enabled */
+	BFQ_BFQQ_FLAG_prio_changed,	/* task priority has changed */
+	BFQ_BFQQ_FLAG_sync,		/* synchronous queue */
+	BFQ_BFQQ_FLAG_budget_new,	/* no completion with this budget */
+};
+
+#define BFQ_BFQQ_FNS(name)						\
+static inline void bfq_mark_bfqq_##name(struct bfq_queue *bfqq)		\
+{									\
+	(bfqq)->flags |= (1 << BFQ_BFQQ_FLAG_##name);			\
+}									\
+static inline void bfq_clear_bfqq_##name(struct bfq_queue *bfqq)	\
+{									\
+	(bfqq)->flags &= ~(1 << BFQ_BFQQ_FLAG_##name);			\
+}									\
+static inline int bfq_bfqq_##name(const struct bfq_queue *bfqq)		\
+{									\
+	return ((bfqq)->flags & (1 << BFQ_BFQQ_FLAG_##name)) != 0;	\
+}
+
+BFQ_BFQQ_FNS(busy);
+BFQ_BFQQ_FNS(wait_request);
+BFQ_BFQQ_FNS(must_alloc);
+BFQ_BFQQ_FNS(fifo_expire);
+BFQ_BFQQ_FNS(idle_window);
+BFQ_BFQQ_FNS(prio_changed);
+BFQ_BFQQ_FNS(sync);
+BFQ_BFQQ_FNS(budget_new);
+#undef BFQ_BFQQ_FNS
+
+/* Logging facilities. */
+#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) \
+	blk_add_trace_msg((bfqd)->queue, "bfq%d " fmt, (bfqq)->pid, ##args)
+
+#define bfq_log(bfqd, fmt, args...) \
+	blk_add_trace_msg((bfqd)->queue, "bfq " fmt, ##args)
+
+/* Expiration reasons. */
+enum bfqq_expiration {
+	BFQ_BFQQ_TOO_IDLE = 0,		/*
+					 * queue has been idling for
+					 * too long
+					 */
+	BFQ_BFQQ_BUDGET_TIMEOUT,	/* budget took too long to be used */
+	BFQ_BFQQ_BUDGET_EXHAUSTED,	/* budget consumed */
+	BFQ_BFQQ_NO_MORE_REQUESTS,	/* the queue has no more requests */
+};
+
+static inline struct bfq_service_tree *
+bfq_entity_service_tree(struct bfq_entity *entity)
+{
+	struct bfq_sched_data *sched_data = entity->sched_data;
+	unsigned int idx = entity->ioprio_class - 1;
+
+	return sched_data->service_tree + idx;
+}
+
+static inline struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic,
+					    int is_sync)
+{
+	return bic->bfqq[!!is_sync];
+}
+
+static inline void bic_set_bfqq(struct bfq_io_cq *bic,
+				struct bfq_queue *bfqq, int is_sync)
+{
+	bic->bfqq[!!is_sync] = bfqq;
+}
+
+static inline struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic)
+{
+	return bic->icq.q->elevator->elevator_data;
+}
+
+/**
+ * bfq_get_bfqd_locked - get a lock to a bfqd using a RCU protected pointer.
+ * @ptr: a pointer to a bfqd.
+ * @flags: storage for the flags to be saved.
+ *
+ * This function allows bfqg->bfqd to be protected by the
+ * queue lock of the bfqd they reference; the pointer is dereferenced
+ * under RCU, so the storage for bfqd is assured to be safe as long
+ * as the RCU read side critical section does not end.  After the
+ * bfqd->queue->queue_lock is taken the pointer is rechecked, to be
+ * sure that no other writer accessed it.  If we raced with a writer,
+ * the function returns NULL, with the queue unlocked, otherwise it
+ * returns the dereferenced pointer, with the queue locked.
+ */
+static inline struct bfq_data *bfq_get_bfqd_locked(void **ptr,
+						   unsigned long *flags)
+{
+	struct bfq_data *bfqd;
+
+	rcu_read_lock();
+	bfqd = rcu_dereference(*(struct bfq_data **)ptr);
+
+	if (bfqd != NULL) {
+		spin_lock_irqsave(bfqd->queue->queue_lock, *flags);
+		if (*ptr == bfqd)
+			goto out;
+		spin_unlock_irqrestore(bfqd->queue->queue_lock, *flags);
+	}
+
+	bfqd = NULL;
+out:
+	rcu_read_unlock();
+	return bfqd;
+}
+
+static inline void bfq_put_bfqd_unlock(struct bfq_data *bfqd,
+				       unsigned long *flags)
+{
+	spin_unlock_irqrestore(bfqd->queue->queue_lock, *flags);
+}
+
+static void bfq_changed_ioprio(struct bfq_io_cq *bic);
+static void bfq_put_queue(struct bfq_queue *bfqq);
+static void bfq_dispatch_insert(struct request_queue *q, struct request *rq);
+static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd, int is_sync,
+				       struct bfq_io_cq *bic, gfp_t gfp_mask);
+static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
+
+#endif /* _BFQ_H */
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 03/14] block: add hierarchical-support option to kconfig
  2014-05-27 12:42 ` paolo
@ 2014-05-27 12:42     ` paolo
  -1 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

From: Fabio Checconi <fchecconi-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Add the CGROUP_BFQIO option to Kconfig.iosched. This option allows
full hierarchical support to be enabled in BFQ, and the bfqio
controller to be added to the cgroups interface.

Signed-off-by: Fabio Checconi <fchecconi-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 block/Kconfig.iosched | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 8f98cc7..a3675cb 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -46,7 +46,18 @@ config IOSCHED_BFQ
 	  The BFQ I/O scheduler tries to distribute bandwidth among all
 	  processes according to their weights.
 	  It aims at distributing the bandwidth as desired, regardless
-	  of the disk parameters and with any workload.
+	  of the disk parameters and with any workload. If compiled
+	  built-in (saying Y here), BFQ can be configured to support
+	  hierarchical scheduling.
+
+config CGROUP_BFQIO
+	bool "BFQ hierarchical scheduling support"
+	depends on CGROUPS && IOSCHED_BFQ=y
+	default n
+	---help---
+	  Enable hierarchical scheduling in BFQ, using the cgroups
+	  filesystem interface.  The name of the subsystem will be
+	  bfqio.
 
 choice
 	prompt "Default I/O scheduler"
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 03/14] block: add hierarchical-support option to kconfig
@ 2014-05-27 12:42     ` paolo
  0 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups, Paolo Valente

From: Fabio Checconi <fchecconi@gmail.com>

Add the CGROUP_BFQIO option to Kconfig.iosched. This option allows
full hierarchical support to be enabled in BFQ, and the bfqio
controller to be added to the cgroups interface.

Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/Kconfig.iosched | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 8f98cc7..a3675cb 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -46,7 +46,18 @@ config IOSCHED_BFQ
 	  The BFQ I/O scheduler tries to distribute bandwidth among all
 	  processes according to their weights.
 	  It aims at distributing the bandwidth as desired, regardless
-	  of the disk parameters and with any workload.
+	  of the disk parameters and with any workload. If compiled
+	  built-in (saying Y here), BFQ can be configured to support
+	  hierarchical scheduling.
+
+config CGROUP_BFQIO
+	bool "BFQ hierarchical scheduling support"
+	depends on CGROUPS && IOSCHED_BFQ=y
+	default n
+	---help---
+	  Enable hierarchical scheduling in BFQ, using the cgroups
+	  filesystem interface.  The name of the subsystem will be
+	  bfqio.
 
 choice
 	prompt "Default I/O scheduler"
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 04/14] block, bfq: add full hierarchical scheduling and cgroups support
  2014-05-27 12:42 ` paolo
@ 2014-05-27 12:42     ` paolo
  -1 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

From: Fabio Checconi <fchecconi-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Complete support for full hierarchical scheduling, with a cgroups
interface. The name of the new subsystem is bfqio.

Weights can be assigned explicitly to groups and processes through the
cgroups interface, differently from what happens, for single
processes, if the cgroups interface is not used (as explained in the
description of patch 2). In particular, since each node has a full
scheduler, each group can be assigned its own weight.

Signed-off-by: Fabio Checconi <fchecconi-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 block/bfq-cgroup.c            | 891 ++++++++++++++++++++++++++++++++++++++++++
 block/bfq-iosched.c           |  66 ++--
 block/bfq-sched.c             |  64 ++-
 block/bfq.h                   | 122 +++++-
 include/linux/cgroup_subsys.h |   4 +
 5 files changed, 1099 insertions(+), 48 deletions(-)
 create mode 100644 block/bfq-cgroup.c

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
new file mode 100644
index 0000000..00a7a1b
--- /dev/null
+++ b/block/bfq-cgroup.c
@@ -0,0 +1,891 @@
+/*
+ * BFQ: CGROUPS support.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
+ *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ *
+ * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ
+ * file.
+ */
+
+#ifdef CONFIG_CGROUP_BFQIO
+
+static DEFINE_MUTEX(bfqio_mutex);
+
+static bool bfqio_is_removed(struct bfqio_cgroup *bgrp)
+{
+	return bgrp ? !bgrp->online : false;
+}
+
+static struct bfqio_cgroup bfqio_root_cgroup = {
+	.weight = BFQ_DEFAULT_GRP_WEIGHT,
+	.ioprio = BFQ_DEFAULT_GRP_IOPRIO,
+	.ioprio_class = BFQ_DEFAULT_GRP_CLASS,
+};
+
+static inline void bfq_init_entity(struct bfq_entity *entity,
+				   struct bfq_group *bfqg)
+{
+	entity->weight = entity->new_weight;
+	entity->orig_weight = entity->new_weight;
+	entity->ioprio = entity->new_ioprio;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->parent = bfqg->my_entity;
+	entity->sched_data = &bfqg->sched_data;
+}
+
+static struct bfqio_cgroup *css_to_bfqio(struct cgroup_subsys_state *css)
+{
+	return css ? container_of(css, struct bfqio_cgroup, css) : NULL;
+}
+
+/*
+ * Search the bfq_group for bfqd into the hash table (by now only a list)
+ * of bgrp.  Must be called under rcu_read_lock().
+ */
+static struct bfq_group *bfqio_lookup_group(struct bfqio_cgroup *bgrp,
+					    struct bfq_data *bfqd)
+{
+	struct bfq_group *bfqg;
+	void *key;
+
+	hlist_for_each_entry_rcu(bfqg, &bgrp->group_data, group_node) {
+		key = rcu_dereference(bfqg->bfqd);
+		if (key == bfqd)
+			return bfqg;
+	}
+
+	return NULL;
+}
+
+static inline void bfq_group_init_entity(struct bfqio_cgroup *bgrp,
+					 struct bfq_group *bfqg)
+{
+	struct bfq_entity *entity = &bfqg->entity;
+
+	/*
+	 * If the weight of the entity has never been set via the sysfs
+	 * interface, then bgrp->weight == 0. In this case we initialize
+	 * the weight from the current ioprio value. Otherwise, the group
+	 * weight, if set, has priority over the ioprio value.
+	 */
+	if (bgrp->weight == 0) {
+		entity->new_weight = bfq_ioprio_to_weight(bgrp->ioprio);
+		entity->new_ioprio = bgrp->ioprio;
+	} else {
+		entity->new_weight = bgrp->weight;
+		entity->new_ioprio = bfq_weight_to_ioprio(bgrp->weight);
+	}
+	entity->orig_weight = entity->weight = entity->new_weight;
+	entity->ioprio = entity->new_ioprio;
+	entity->ioprio_class = entity->new_ioprio_class = bgrp->ioprio_class;
+	entity->my_sched_data = &bfqg->sched_data;
+}
+
+static inline void bfq_group_set_parent(struct bfq_group *bfqg,
+					struct bfq_group *parent)
+{
+	struct bfq_entity *entity;
+
+	entity = &bfqg->entity;
+	entity->parent = parent->my_entity;
+	entity->sched_data = &parent->sched_data;
+}
+
+/**
+ * bfq_group_chain_alloc - allocate a chain of groups.
+ * @bfqd: queue descriptor.
+ * @css: the leaf cgroup_subsys_state this chain starts from.
+ *
+ * Allocate a chain of groups starting from the one belonging to
+ * @cgroup up to the root cgroup.  Stop if a cgroup on the chain
+ * to the root has already an allocated group on @bfqd.
+ */
+static struct bfq_group *bfq_group_chain_alloc(struct bfq_data *bfqd,
+					       struct cgroup_subsys_state *css)
+{
+	struct bfqio_cgroup *bgrp;
+	struct bfq_group *bfqg, *prev = NULL, *leaf = NULL;
+
+	for (; css != NULL; css = css->parent) {
+		bgrp = css_to_bfqio(css);
+
+		bfqg = bfqio_lookup_group(bgrp, bfqd);
+		if (bfqg != NULL) {
+			/*
+			 * All the cgroups in the path from there to the
+			 * root must have a bfq_group for bfqd, so we don't
+			 * need any more allocations.
+			 */
+			break;
+		}
+
+		bfqg = kzalloc(sizeof(*bfqg), GFP_ATOMIC);
+		if (bfqg == NULL)
+			goto cleanup;
+
+		bfq_group_init_entity(bgrp, bfqg);
+		bfqg->my_entity = &bfqg->entity;
+
+		if (leaf == NULL) {
+			leaf = bfqg;
+			prev = leaf;
+		} else {
+			bfq_group_set_parent(prev, bfqg);
+			/*
+			 * Build a list of allocated nodes using the bfqd
+			 * filed, that is still unused and will be
+			 * initialized only after the node will be
+			 * connected.
+			 */
+			prev->bfqd = bfqg;
+			prev = bfqg;
+		}
+	}
+
+	return leaf;
+
+cleanup:
+	while (leaf != NULL) {
+		prev = leaf;
+		leaf = leaf->bfqd;
+		kfree(prev);
+	}
+
+	return NULL;
+}
+
+/**
+ * bfq_group_chain_link - link an allocated group chain to a cgroup
+ *                        hierarchy.
+ * @bfqd: the queue descriptor.
+ * @css: the leaf cgroup_subsys_state to start from.
+ * @leaf: the leaf group (to be associated to @cgroup).
+ *
+ * Try to link a chain of groups to a cgroup hierarchy, connecting the
+ * nodes bottom-up, so we can be sure that when we find a cgroup in the
+ * hierarchy that already as a group associated to @bfqd all the nodes
+ * in the path to the root cgroup have one too.
+ *
+ * On locking: the queue lock protects the hierarchy (there is a hierarchy
+ * per device) while the bfqio_cgroup lock protects the list of groups
+ * belonging to the same cgroup.
+ */
+static void bfq_group_chain_link(struct bfq_data *bfqd,
+				 struct cgroup_subsys_state *css,
+				 struct bfq_group *leaf)
+{
+	struct bfqio_cgroup *bgrp;
+	struct bfq_group *bfqg, *next, *prev = NULL;
+	unsigned long flags;
+
+	assert_spin_locked(bfqd->queue->queue_lock);
+
+	for (; css != NULL && leaf != NULL; css = css->parent) {
+		bgrp = css_to_bfqio(css);
+		next = leaf->bfqd;
+
+		bfqg = bfqio_lookup_group(bgrp, bfqd);
+
+		spin_lock_irqsave(&bgrp->lock, flags);
+
+		rcu_assign_pointer(leaf->bfqd, bfqd);
+		hlist_add_head_rcu(&leaf->group_node, &bgrp->group_data);
+		hlist_add_head(&leaf->bfqd_node, &bfqd->group_list);
+
+		spin_unlock_irqrestore(&bgrp->lock, flags);
+
+		prev = leaf;
+		leaf = next;
+	}
+
+	if (css != NULL && prev != NULL) {
+		bgrp = css_to_bfqio(css);
+		bfqg = bfqio_lookup_group(bgrp, bfqd);
+		bfq_group_set_parent(prev, bfqg);
+	}
+}
+
+/**
+ * bfq_find_alloc_group - return the group associated to @bfqd in @cgroup.
+ * @bfqd: queue descriptor.
+ * @cgroup: cgroup being searched for.
+ *
+ * Return a group associated to @bfqd in @cgroup, allocating one if
+ * necessary.  When a group is returned all the cgroups in the path
+ * to the root have a group associated to @bfqd.
+ *
+ * If the allocation fails, return the root group: this breaks guarantees
+ * but is a safe fallback.  If this loss becomes a problem it can be
+ * mitigated using the equivalent weight (given by the product of the
+ * weights of the groups in the path from @group to the root) in the
+ * root scheduler.
+ *
+ * We allocate all the missing nodes in the path from the leaf cgroup
+ * to the root and we connect the nodes only after all the allocations
+ * have been successful.
+ */
+static struct bfq_group *bfq_find_alloc_group(struct bfq_data *bfqd,
+					      struct cgroup_subsys_state *css)
+{
+	struct bfqio_cgroup *bgrp = css_to_bfqio(css);
+	struct bfq_group *bfqg;
+
+	bfqg = bfqio_lookup_group(bgrp, bfqd);
+	if (bfqg != NULL)
+		return bfqg;
+
+	bfqg = bfq_group_chain_alloc(bfqd, css);
+	if (bfqg != NULL)
+		bfq_group_chain_link(bfqd, css, bfqg);
+	else
+		bfqg = bfqd->root_group;
+
+	return bfqg;
+}
+
+/**
+ * bfq_bfqq_move - migrate @bfqq to @bfqg.
+ * @bfqd: queue descriptor.
+ * @bfqq: the queue to move.
+ * @entity: @bfqq's entity.
+ * @bfqg: the group to move to.
+ *
+ * Move @bfqq to @bfqg, deactivating it from its old group and reactivating
+ * it on the new one.  Avoid putting the entity on the old group idle tree.
+ *
+ * Must be called under the queue lock; the cgroup owning @bfqg must
+ * not disappear (by now this just means that we are called under
+ * rcu_read_lock()).
+ */
+static void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+			  struct bfq_entity *entity, struct bfq_group *bfqg)
+{
+	int busy, resume;
+
+	busy = bfq_bfqq_busy(bfqq);
+	resume = !RB_EMPTY_ROOT(&bfqq->sort_list);
+
+	if (busy) {
+		if (!resume)
+			bfq_del_bfqq_busy(bfqd, bfqq, 0);
+		else
+			bfq_deactivate_bfqq(bfqd, bfqq, 0);
+	} else if (entity->on_st)
+		bfq_put_idle_entity(bfq_entity_service_tree(entity), entity);
+
+	/*
+	 * Here we use a reference to bfqg.  We don't need a refcounter
+	 * as the cgroup reference will not be dropped, so that its
+	 * destroy() callback will not be invoked.
+	 */
+	entity->parent = bfqg->my_entity;
+	entity->sched_data = &bfqg->sched_data;
+
+	if (busy && resume)
+		bfq_activate_bfqq(bfqd, bfqq);
+
+	if (bfqd->in_service_queue == NULL && !bfqd->rq_in_driver)
+		bfq_schedule_dispatch(bfqd);
+}
+
+/**
+ * __bfq_bic_change_cgroup - move @bic to @cgroup.
+ * @bfqd: the queue descriptor.
+ * @bic: the bic to move.
+ * @cgroup: the cgroup to move to.
+ *
+ * Move bic to cgroup, assuming that bfqd->queue is locked; the caller
+ * has to make sure that the reference to cgroup is valid across the call.
+ *
+ * NOTE: an alternative approach might have been to store the current
+ * cgroup in bfqq and getting a reference to it, reducing the lookup
+ * time here, at the price of slightly more complex code.
+ */
+static struct bfq_group *__bfq_bic_change_cgroup(struct bfq_data *bfqd,
+						struct bfq_io_cq *bic,
+						struct cgroup_subsys_state *css)
+{
+	struct bfq_queue *async_bfqq = bic_to_bfqq(bic, 0);
+	struct bfq_queue *sync_bfqq = bic_to_bfqq(bic, 1);
+	struct bfq_entity *entity;
+	struct bfq_group *bfqg;
+	struct bfqio_cgroup *bgrp;
+
+	bgrp = css_to_bfqio(css);
+
+	bfqg = bfq_find_alloc_group(bfqd, css);
+	if (async_bfqq != NULL) {
+		entity = &async_bfqq->entity;
+
+		if (entity->sched_data != &bfqg->sched_data) {
+			bic_set_bfqq(bic, NULL, 0);
+			bfq_log_bfqq(bfqd, async_bfqq,
+				     "bic_change_group: %p %d",
+				     async_bfqq, atomic_read(&async_bfqq->ref));
+			bfq_put_queue(async_bfqq);
+		}
+	}
+
+	if (sync_bfqq != NULL) {
+		entity = &sync_bfqq->entity;
+		if (entity->sched_data != &bfqg->sched_data)
+			bfq_bfqq_move(bfqd, sync_bfqq, entity, bfqg);
+	}
+
+	return bfqg;
+}
+
+/**
+ * bfq_bic_change_cgroup - move @bic to @cgroup.
+ * @bic: the bic being migrated.
+ * @cgroup: the destination cgroup.
+ *
+ * When the task owning @bic is moved to @cgroup, @bic is immediately
+ * moved into its new parent group.
+ */
+static void bfq_bic_change_cgroup(struct bfq_io_cq *bic,
+				  struct cgroup_subsys_state *css)
+{
+	struct bfq_data *bfqd;
+	unsigned long uninitialized_var(flags);
+
+	bfqd = bfq_get_bfqd_locked(&(bic->icq.q->elevator->elevator_data),
+				   &flags);
+	if (bfqd != NULL) {
+		__bfq_bic_change_cgroup(bfqd, bic, css);
+		bfq_put_bfqd_unlock(bfqd, &flags);
+	}
+}
+
+/**
+ * bfq_bic_update_cgroup - update the cgroup of @bic.
+ * @bic: the @bic to update.
+ *
+ * Make sure that @bic is enqueued in the cgroup of the current task.
+ * We need this in addition to moving bics during the cgroup attach
+ * phase because the task owning @bic could be at its first disk
+ * access or we may end up in the root cgroup as the result of a
+ * memory allocation failure and here we try to move to the right
+ * group.
+ *
+ * Must be called under the queue lock.  It is safe to use the returned
+ * value even after the rcu_read_unlock() as the migration/destruction
+ * paths act under the queue lock too.  IOW it is impossible to race with
+ * group migration/destruction and end up with an invalid group as:
+ *   a) here cgroup has not yet been destroyed, nor its destroy callback
+ *      has started execution, as current holds a reference to it,
+ *   b) if it is destroyed after rcu_read_unlock() [after current is
+ *      migrated to a different cgroup] its attach() callback will have
+ *      taken care of remove all the references to the old cgroup data.
+ */
+static struct bfq_group *bfq_bic_update_cgroup(struct bfq_io_cq *bic)
+{
+	struct bfq_data *bfqd = bic_to_bfqd(bic);
+	struct bfq_group *bfqg;
+	struct cgroup_subsys_state *css;
+
+	rcu_read_lock();
+	css = task_css(current, bfqio_cgrp_id);
+	bfqg = __bfq_bic_change_cgroup(bfqd, bic, css);
+	rcu_read_unlock();
+
+	return bfqg;
+}
+
+/**
+ * bfq_flush_idle_tree - deactivate any entity on the idle tree of @st.
+ * @st: the service tree being flushed.
+ */
+static inline void bfq_flush_idle_tree(struct bfq_service_tree *st)
+{
+	struct bfq_entity *entity = st->first_idle;
+
+	for (; entity != NULL; entity = st->first_idle)
+		__bfq_deactivate_entity(entity, 0);
+}
+
+/**
+ * bfq_reparent_leaf_entity - move leaf entity to the root_group.
+ * @bfqd: the device data structure with the root group.
+ * @entity: the entity to move.
+ */
+static inline void bfq_reparent_leaf_entity(struct bfq_data *bfqd,
+					    struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+
+	bfq_bfqq_move(bfqd, bfqq, entity, bfqd->root_group);
+	return;
+}
+
+/**
+ * bfq_reparent_active_entities - move to the root group all active
+ *                                entities.
+ * @bfqd: the device data structure with the root group.
+ * @bfqg: the group to move from.
+ * @st: the service tree with the entities.
+ *
+ * Needs queue_lock to be taken and reference to be valid over the call.
+ */
+static inline void bfq_reparent_active_entities(struct bfq_data *bfqd,
+						struct bfq_group *bfqg,
+						struct bfq_service_tree *st)
+{
+	struct rb_root *active = &st->active;
+	struct bfq_entity *entity = NULL;
+
+	if (!RB_EMPTY_ROOT(&st->active))
+		entity = bfq_entity_of(rb_first(active));
+
+	for (; entity != NULL; entity = bfq_entity_of(rb_first(active)))
+		bfq_reparent_leaf_entity(bfqd, entity);
+
+	if (bfqg->sched_data.in_service_entity != NULL)
+		bfq_reparent_leaf_entity(bfqd,
+			bfqg->sched_data.in_service_entity);
+
+	return;
+}
+
+/**
+ * bfq_destroy_group - destroy @bfqg.
+ * @bgrp: the bfqio_cgroup containing @bfqg.
+ * @bfqg: the group being destroyed.
+ *
+ * Destroy @bfqg, making sure that it is not referenced from its parent.
+ */
+static void bfq_destroy_group(struct bfqio_cgroup *bgrp, struct bfq_group *bfqg)
+{
+	struct bfq_data *bfqd;
+	struct bfq_service_tree *st;
+	struct bfq_entity *entity = bfqg->my_entity;
+	unsigned long uninitialized_var(flags);
+	int i;
+
+	hlist_del(&bfqg->group_node);
+
+	/*
+	 * Empty all service_trees belonging to this group before
+	 * deactivating the group itself.
+	 */
+	for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) {
+		st = bfqg->sched_data.service_tree + i;
+
+		/*
+		 * The idle tree may still contain bfq_queues belonging
+		 * to exited task because they never migrated to a different
+		 * cgroup from the one being destroyed now.  No one else
+		 * can access them so it's safe to act without any lock.
+		 */
+		bfq_flush_idle_tree(st);
+
+		/*
+		 * It may happen that some queues are still active
+		 * (busy) upon group destruction (if the corresponding
+		 * processes have been forced to terminate). We move
+		 * all the leaf entities corresponding to these queues
+		 * to the root_group.
+		 * Also, it may happen that the group has an entity
+		 * in service, which is disconnected from the active
+		 * tree: it must be moved, too.
+		 * There is no need to put the sync queues, as the
+		 * scheduler has taken no reference.
+		 */
+		bfqd = bfq_get_bfqd_locked(&bfqg->bfqd, &flags);
+		if (bfqd != NULL) {
+			bfq_reparent_active_entities(bfqd, bfqg, st);
+			bfq_put_bfqd_unlock(bfqd, &flags);
+		}
+	}
+
+	/*
+	 * We may race with device destruction, take extra care when
+	 * dereferencing bfqg->bfqd.
+	 */
+	bfqd = bfq_get_bfqd_locked(&bfqg->bfqd, &flags);
+	if (bfqd != NULL) {
+		hlist_del(&bfqg->bfqd_node);
+		__bfq_deactivate_entity(entity, 0);
+		bfq_put_async_queues(bfqd, bfqg);
+		bfq_put_bfqd_unlock(bfqd, &flags);
+	}
+
+	/*
+	 * No need to defer the kfree() to the end of the RCU grace
+	 * period: we are called from the destroy() callback of our
+	 * cgroup, so we can be sure that no one is a) still using
+	 * this cgroup or b) doing lookups in it.
+	 */
+	kfree(bfqg);
+}
+
+/**
+ * bfq_disconnect_groups - disconnect @bfqd from all its groups.
+ * @bfqd: the device descriptor being exited.
+ *
+ * When the device exits we just make sure that no lookup can return
+ * the now unused group structures.  They will be deallocated on cgroup
+ * destruction.
+ */
+static void bfq_disconnect_groups(struct bfq_data *bfqd)
+{
+	struct hlist_node *tmp;
+	struct bfq_group *bfqg;
+
+	bfq_log(bfqd, "disconnect_groups beginning");
+	hlist_for_each_entry_safe(bfqg, tmp, &bfqd->group_list, bfqd_node) {
+		hlist_del(&bfqg->bfqd_node);
+
+		__bfq_deactivate_entity(bfqg->my_entity, 0);
+
+		/*
+		 * Don't remove from the group hash, just set an
+		 * invalid key.  No lookups can race with the
+		 * assignment as bfqd is being destroyed; this
+		 * implies also that new elements cannot be added
+		 * to the list.
+		 */
+		rcu_assign_pointer(bfqg->bfqd, NULL);
+
+		bfq_log(bfqd, "disconnect_groups: put async for group %p",
+			bfqg);
+		bfq_put_async_queues(bfqd, bfqg);
+	}
+}
+
+static inline void bfq_free_root_group(struct bfq_data *bfqd)
+{
+	struct bfqio_cgroup *bgrp = &bfqio_root_cgroup;
+	struct bfq_group *bfqg = bfqd->root_group;
+
+	bfq_put_async_queues(bfqd, bfqg);
+
+	spin_lock_irq(&bgrp->lock);
+	hlist_del_rcu(&bfqg->group_node);
+	spin_unlock_irq(&bgrp->lock);
+
+	/*
+	 * No need to synchronize_rcu() here: since the device is gone
+	 * there cannot be any read-side access to its root_group.
+	 */
+	kfree(bfqg);
+}
+
+static struct bfq_group *bfq_alloc_root_group(struct bfq_data *bfqd, int node)
+{
+	struct bfq_group *bfqg;
+	struct bfqio_cgroup *bgrp;
+	int i;
+
+	bfqg = kzalloc_node(sizeof(*bfqg), GFP_KERNEL, node);
+	if (bfqg == NULL)
+		return NULL;
+
+	bfqg->entity.parent = NULL;
+	for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
+		bfqg->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
+
+	bgrp = &bfqio_root_cgroup;
+	spin_lock_irq(&bgrp->lock);
+	rcu_assign_pointer(bfqg->bfqd, bfqd);
+	hlist_add_head_rcu(&bfqg->group_node, &bgrp->group_data);
+	spin_unlock_irq(&bgrp->lock);
+
+	return bfqg;
+}
+
+#define SHOW_FUNCTION(__VAR)						\
+static u64 bfqio_cgroup_##__VAR##_read(struct cgroup_subsys_state *css, \
+				       struct cftype *cftype)		\
+{									\
+	struct bfqio_cgroup *bgrp = css_to_bfqio(css);			\
+	u64 ret = -ENODEV;						\
+									\
+	mutex_lock(&bfqio_mutex);					\
+	if (bfqio_is_removed(bgrp))					\
+		goto out_unlock;					\
+									\
+	spin_lock_irq(&bgrp->lock);					\
+	ret = bgrp->__VAR;						\
+	spin_unlock_irq(&bgrp->lock);					\
+									\
+out_unlock:								\
+	mutex_unlock(&bfqio_mutex);					\
+	return ret;							\
+}
+
+SHOW_FUNCTION(weight);
+SHOW_FUNCTION(ioprio);
+SHOW_FUNCTION(ioprio_class);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__VAR, __MIN, __MAX)				\
+static int bfqio_cgroup_##__VAR##_write(struct cgroup_subsys_state *css,\
+					struct cftype *cftype,		\
+					u64 val)			\
+{									\
+	struct bfqio_cgroup *bgrp = css_to_bfqio(css);			\
+	struct bfq_group *bfqg;						\
+	int ret = -EINVAL;						\
+									\
+	if (val < (__MIN) || val > (__MAX))				\
+		return ret;						\
+									\
+	ret = -ENODEV;							\
+	mutex_lock(&bfqio_mutex);					\
+	if (bfqio_is_removed(bgrp))					\
+		goto out_unlock;					\
+	ret = 0;							\
+									\
+	spin_lock_irq(&bgrp->lock);					\
+	bgrp->__VAR = (unsigned short)val;				\
+	hlist_for_each_entry(bfqg, &bgrp->group_data, group_node) {	\
+		/*							\
+		 * Setting the ioprio_changed flag of the entity        \
+		 * to 1 with new_##__VAR == ##__VAR would re-set        \
+		 * the value of the weight to its ioprio mapping.       \
+		 * Set the flag only if necessary.			\
+		 */							\
+		if ((unsigned short)val != bfqg->entity.new_##__VAR) {  \
+			bfqg->entity.new_##__VAR = (unsigned short)val; \
+			/*						\
+			 * Make sure that the above new value has been	\
+			 * stored in bfqg->entity.new_##__VAR before	\
+			 * setting the ioprio_changed flag. In fact,	\
+			 * this flag may be read asynchronously (in	\
+			 * critical sections protected by a different	\
+			 * lock than that held here), and finding this	\
+			 * flag set may cause the execution of the code	\
+			 * for updating parameters whose value may	\
+			 * depend also on bfqg->entity.new_##__VAR (in	\
+			 * __bfq_entity_update_weight_prio).		\
+			 * This barrier makes sure that the new value	\
+			 * of bfqg->entity.new_##__VAR is correctly	\
+			 * seen in that code.				\
+			 */						\
+			smp_wmb();                                      \
+			bfqg->entity.ioprio_changed = 1;                \
+		}							\
+	}								\
+	spin_unlock_irq(&bgrp->lock);					\
+									\
+out_unlock:								\
+	mutex_unlock(&bfqio_mutex);					\
+	return ret;							\
+}
+
+STORE_FUNCTION(weight, BFQ_MIN_WEIGHT, BFQ_MAX_WEIGHT);
+STORE_FUNCTION(ioprio, 0, IOPRIO_BE_NR - 1);
+STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
+#undef STORE_FUNCTION
+
+static struct cftype bfqio_files[] = {
+	{
+		.name = "weight",
+		.read_u64 = bfqio_cgroup_weight_read,
+		.write_u64 = bfqio_cgroup_weight_write,
+	},
+	{
+		.name = "ioprio",
+		.read_u64 = bfqio_cgroup_ioprio_read,
+		.write_u64 = bfqio_cgroup_ioprio_write,
+	},
+	{
+		.name = "ioprio_class",
+		.read_u64 = bfqio_cgroup_ioprio_class_read,
+		.write_u64 = bfqio_cgroup_ioprio_class_write,
+	},
+	{ },	/* terminate */
+};
+
+static struct cgroup_subsys_state *bfqio_create(struct cgroup_subsys_state
+						*parent_css)
+{
+	struct bfqio_cgroup *bgrp;
+
+	if (parent_css != NULL) {
+		bgrp = kzalloc(sizeof(*bgrp), GFP_KERNEL);
+		if (bgrp == NULL)
+			return ERR_PTR(-ENOMEM);
+	} else
+		bgrp = &bfqio_root_cgroup;
+
+	spin_lock_init(&bgrp->lock);
+	INIT_HLIST_HEAD(&bgrp->group_data);
+	bgrp->ioprio = BFQ_DEFAULT_GRP_IOPRIO;
+	bgrp->ioprio_class = BFQ_DEFAULT_GRP_CLASS;
+
+	return &bgrp->css;
+}
+
+/*
+ * We cannot support shared io contexts, as we have no means to support
+ * two tasks with the same ioc in two different groups without major rework
+ * of the main bic/bfqq data structures.  By now we allow a task to change
+ * its cgroup only if it's the only owner of its ioc; the drawback of this
+ * behavior is that a group containing a task that forked using CLONE_IO
+ * will not be destroyed until the tasks sharing the ioc die.
+ */
+static int bfqio_can_attach(struct cgroup_subsys_state *css,
+			    struct cgroup_taskset *tset)
+{
+	struct task_struct *task;
+	struct io_context *ioc;
+	int ret = 0;
+
+	cgroup_taskset_for_each(task, tset) {
+		/*
+		 * task_lock() is needed to avoid races with
+		 * exit_io_context()
+		 */
+		task_lock(task);
+		ioc = task->io_context;
+		if (ioc != NULL && atomic_read(&ioc->nr_tasks) > 1)
+			/*
+			 * ioc == NULL means that the task is either too
+			 * young or exiting: if it has still no ioc the
+			 * ioc can't be shared, if the task is exiting the
+			 * attach will fail anyway, no matter what we
+			 * return here.
+			 */
+			ret = -EINVAL;
+		task_unlock(task);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
+static void bfqio_attach(struct cgroup_subsys_state *css,
+			 struct cgroup_taskset *tset)
+{
+	struct task_struct *task;
+	struct io_context *ioc;
+	struct io_cq *icq;
+
+	/*
+	 * IMPORTANT NOTE: The move of more than one process at a time to a
+	 * new group has not yet been tested.
+	 */
+	cgroup_taskset_for_each(task, tset) {
+		ioc = get_task_io_context(task, GFP_ATOMIC, NUMA_NO_NODE);
+		if (ioc) {
+			/*
+			 * Handle cgroup change here.
+			 */
+			rcu_read_lock();
+			hlist_for_each_entry_rcu(icq, &ioc->icq_list, ioc_node)
+				if (!strncmp(
+					icq->q->elevator->type->elevator_name,
+					"bfq", ELV_NAME_MAX))
+					bfq_bic_change_cgroup(icq_to_bic(icq),
+							      css);
+			rcu_read_unlock();
+			put_io_context(ioc);
+		}
+	}
+}
+
+static void bfqio_destroy(struct cgroup_subsys_state *css)
+{
+	struct bfqio_cgroup *bgrp = css_to_bfqio(css);
+	struct hlist_node *tmp;
+	struct bfq_group *bfqg;
+
+	/*
+	 * Since we are destroying the cgroup, there are no more tasks
+	 * referencing it, and all the RCU grace periods that may have
+	 * referenced it are ended (as the destruction of the parent
+	 * cgroup is RCU-safe); bgrp->group_data will not be accessed by
+	 * anything else and we don't need any synchronization.
+	 */
+	hlist_for_each_entry_safe(bfqg, tmp, &bgrp->group_data, group_node)
+		bfq_destroy_group(bgrp, bfqg);
+
+	kfree(bgrp);
+}
+
+static int bfqio_css_online(struct cgroup_subsys_state *css)
+{
+	struct bfqio_cgroup *bgrp = css_to_bfqio(css);
+
+	mutex_lock(&bfqio_mutex);
+	bgrp->online = true;
+	mutex_unlock(&bfqio_mutex);
+
+	return 0;
+}
+
+static void bfqio_css_offline(struct cgroup_subsys_state *css)
+{
+	struct bfqio_cgroup *bgrp = css_to_bfqio(css);
+
+	mutex_lock(&bfqio_mutex);
+	bgrp->online = false;
+	mutex_unlock(&bfqio_mutex);
+}
+
+struct cgroup_subsys bfqio_cgrp_subsys = {
+	.css_alloc = bfqio_create,
+	.css_online = bfqio_css_online,
+	.css_offline = bfqio_css_offline,
+	.can_attach = bfqio_can_attach,
+	.attach = bfqio_attach,
+	.css_free = bfqio_destroy,
+	.base_cftypes = bfqio_files,
+};
+#else
+static inline void bfq_init_entity(struct bfq_entity *entity,
+				   struct bfq_group *bfqg)
+{
+	entity->weight = entity->new_weight;
+	entity->orig_weight = entity->new_weight;
+	entity->ioprio = entity->new_ioprio;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->sched_data = &bfqg->sched_data;
+}
+
+static inline struct bfq_group *
+bfq_bic_update_cgroup(struct bfq_io_cq *bic)
+{
+	struct bfq_data *bfqd = bic_to_bfqd(bic);
+	return bfqd->root_group;
+}
+
+static inline void bfq_bfqq_move(struct bfq_data *bfqd,
+				 struct bfq_queue *bfqq,
+				 struct bfq_entity *entity,
+				 struct bfq_group *bfqg)
+{
+}
+
+static inline void bfq_disconnect_groups(struct bfq_data *bfqd)
+{
+	bfq_put_async_queues(bfqd, bfqd->root_group);
+}
+
+static inline void bfq_free_root_group(struct bfq_data *bfqd)
+{
+	kfree(bfqd->root_group);
+}
+
+static struct bfq_group *bfq_alloc_root_group(struct bfq_data *bfqd, int node)
+{
+	struct bfq_group *bfqg;
+	int i;
+
+	bfqg = kmalloc_node(sizeof(*bfqg), GFP_KERNEL | __GFP_ZERO, node);
+	if (bfqg == NULL)
+		return NULL;
+
+	for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
+		bfqg->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
+
+	return bfqg;
+}
+#endif
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 01a98be..b2cbfce 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -66,14 +66,6 @@
 #include "bfq.h"
 #include "blk.h"
 
-/*
- * Array of async queues for all the processes, one queue
- * per ioprio value per ioprio_class.
- */
-struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
-/* Async queue for the idle class (ioprio is ignored) */
-struct bfq_queue *async_idle_bfqq;
-
 /* Max number of dispatches in one round of service. */
 static const int bfq_quantum = 4;
 
@@ -128,6 +120,7 @@ static inline void bfq_schedule_dispatch(struct bfq_data *bfqd);
 
 #include "bfq-ioc.c"
 #include "bfq-sched.c"
+#include "bfq-cgroup.c"
 
 #define bfq_class_idle(bfqq)	((bfqq)->entity.ioprio_class ==\
 				 IOPRIO_CLASS_IDLE)
@@ -1359,6 +1352,7 @@ static void bfq_changed_ioprio(struct bfq_io_cq *bic)
 {
 	struct bfq_data *bfqd;
 	struct bfq_queue *bfqq, *new_bfqq;
+	struct bfq_group *bfqg;
 	unsigned long uninitialized_var(flags);
 	int ioprio = bic->icq.ioc->ioprio;
 
@@ -1373,7 +1367,9 @@ static void bfq_changed_ioprio(struct bfq_io_cq *bic)
 
 	bfqq = bic->bfqq[BLK_RW_ASYNC];
 	if (bfqq != NULL) {
-		new_bfqq = bfq_get_queue(bfqd, BLK_RW_ASYNC, bic,
+		bfqg = container_of(bfqq->entity.sched_data, struct bfq_group,
+				    sched_data);
+		new_bfqq = bfq_get_queue(bfqd, bfqg, BLK_RW_ASYNC, bic,
 					 GFP_ATOMIC);
 		if (new_bfqq != NULL) {
 			bic->bfqq[BLK_RW_ASYNC] = new_bfqq;
@@ -1417,6 +1413,7 @@ static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 }
 
 static struct bfq_queue *bfq_find_alloc_queue(struct bfq_data *bfqd,
+					      struct bfq_group *bfqg,
 					      int is_sync,
 					      struct bfq_io_cq *bic,
 					      gfp_t gfp_mask)
@@ -1459,6 +1456,7 @@ retry:
 		}
 
 		bfq_init_prio_data(bfqq, bic);
+		bfq_init_entity(&bfqq->entity, bfqg);
 	}
 
 	if (new_bfqq != NULL)
@@ -1468,26 +1466,27 @@ retry:
 }
 
 static struct bfq_queue **bfq_async_queue_prio(struct bfq_data *bfqd,
+					       struct bfq_group *bfqg,
 					       int ioprio_class, int ioprio)
 {
 	switch (ioprio_class) {
 	case IOPRIO_CLASS_RT:
-		return &async_bfqq[0][ioprio];
+		return &bfqg->async_bfqq[0][ioprio];
 	case IOPRIO_CLASS_NONE:
 		ioprio = IOPRIO_NORM;
 		/* fall through */
 	case IOPRIO_CLASS_BE:
-		return &async_bfqq[1][ioprio];
+		return &bfqg->async_bfqq[1][ioprio];
 	case IOPRIO_CLASS_IDLE:
-		return &async_idle_bfqq;
+		return &bfqg->async_idle_bfqq;
 	default:
 		BUG();
 	}
 }
 
 static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
-				       int is_sync, struct bfq_io_cq *bic,
-				       gfp_t gfp_mask)
+				       struct bfq_group *bfqg, int is_sync,
+				       struct bfq_io_cq *bic, gfp_t gfp_mask)
 {
 	const int ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
 	const int ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
@@ -1495,12 +1494,13 @@ static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
 	struct bfq_queue *bfqq = NULL;
 
 	if (!is_sync) {
-		async_bfqq = bfq_async_queue_prio(bfqd, ioprio_class, ioprio);
+		async_bfqq = bfq_async_queue_prio(bfqd, bfqg, ioprio_class,
+						  ioprio);
 		bfqq = *async_bfqq;
 	}
 
 	if (bfqq == NULL)
-		bfqq = bfq_find_alloc_queue(bfqd, is_sync, bic, gfp_mask);
+		bfqq = bfq_find_alloc_queue(bfqd, bfqg, is_sync, bic, gfp_mask);
 
 	/*
 	 * Pin the queue now that it's allocated, scheduler exit will
@@ -1830,6 +1830,7 @@ static int bfq_set_request(struct request_queue *q, struct request *rq,
 	const int rw = rq_data_dir(rq);
 	const int is_sync = rq_is_sync(rq);
 	struct bfq_queue *bfqq;
+	struct bfq_group *bfqg;
 	unsigned long flags;
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
@@ -1841,9 +1842,11 @@ static int bfq_set_request(struct request_queue *q, struct request *rq,
 	if (bic == NULL)
 		goto queue_fail;
 
+	bfqg = bfq_bic_update_cgroup(bic);
+
 	bfqq = bic_to_bfqq(bic, is_sync);
 	if (bfqq == NULL || bfqq == &bfqd->oom_bfqq) {
-		bfqq = bfq_get_queue(bfqd, is_sync, bic, gfp_mask);
+		bfqq = bfq_get_queue(bfqd, bfqg, is_sync, bic, gfp_mask);
 		bic_set_bfqq(bic, bfqq, is_sync);
 	}
 
@@ -1937,10 +1940,12 @@ static void bfq_shutdown_timer_wq(struct bfq_data *bfqd)
 static inline void __bfq_put_async_bfqq(struct bfq_data *bfqd,
 					struct bfq_queue **bfqq_ptr)
 {
+	struct bfq_group *root_group = bfqd->root_group;
 	struct bfq_queue *bfqq = *bfqq_ptr;
 
 	bfq_log(bfqd, "put_async_bfqq: %p", bfqq);
 	if (bfqq != NULL) {
+		bfq_bfqq_move(bfqd, bfqq, &bfqq->entity, root_group);
 		bfq_log_bfqq(bfqd, bfqq, "put_async_bfqq: putting %p, %d",
 			     bfqq, atomic_read(&bfqq->ref));
 		bfq_put_queue(bfqq);
@@ -1949,18 +1954,20 @@ static inline void __bfq_put_async_bfqq(struct bfq_data *bfqd,
 }
 
 /*
- * Release the extra reference of the async queues as the device
- * goes away.
+ * Release all the bfqg references to its async queues.  If we are
+ * deallocating the group these queues may still contain requests, so
+ * we reparent them to the root cgroup (i.e., the only one that will
+ * exist for sure until all the requests on a device are gone).
  */
-static void bfq_put_async_queues(struct bfq_data *bfqd)
+static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg)
 {
 	int i, j;
 
 	for (i = 0; i < 2; i++)
 		for (j = 0; j < IOPRIO_BE_NR; j++)
-			__bfq_put_async_bfqq(bfqd, &async_bfqq[i][j]);
+			__bfq_put_async_bfqq(bfqd, &bfqg->async_bfqq[i][j]);
 
-	__bfq_put_async_bfqq(bfqd, &async_idle_bfqq);
+	__bfq_put_async_bfqq(bfqd, &bfqg->async_idle_bfqq);
 }
 
 static void bfq_exit_queue(struct elevator_queue *e)
@@ -1976,18 +1983,20 @@ static void bfq_exit_queue(struct elevator_queue *e)
 	list_for_each_entry_safe(bfqq, n, &bfqd->idle_list, bfqq_list)
 		bfq_deactivate_bfqq(bfqd, bfqq, 0);
 
-	bfq_put_async_queues(bfqd);
+	bfq_disconnect_groups(bfqd);
 	spin_unlock_irq(q->queue_lock);
 
 	bfq_shutdown_timer_wq(bfqd);
 
 	synchronize_rcu();
 
+	bfq_free_root_group(bfqd);
 	kfree(bfqd);
 }
 
 static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 {
+	struct bfq_group *bfqg;
 	struct bfq_data *bfqd;
 	struct elevator_queue *eq;
 
@@ -2016,6 +2025,15 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 	q->elevator = eq;
 	spin_unlock_irq(q->queue_lock);
 
+	bfqg = bfq_alloc_root_group(bfqd, q->node);
+	if (bfqg == NULL) {
+		kfree(bfqd);
+		kobject_put(&eq->kobj);
+		return -ENOMEM;
+	}
+
+	bfqd->root_group = bfqg;
+
 	init_timer(&bfqd->idle_slice_timer);
 	bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
 	bfqd->idle_slice_timer.data = (unsigned long)bfqd;
@@ -2279,7 +2297,7 @@ static int __init bfq_init(void)
 		return -ENOMEM;
 
 	elv_register(&iosched_bfq);
-	pr_info("BFQ I/O-scheduler version: v0");
+	pr_info("BFQ I/O-scheduler version: v1");
 
 	return 0;
 }
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
index a9142f5..8801b6c 100644
--- a/block/bfq-sched.c
+++ b/block/bfq-sched.c
@@ -8,6 +8,61 @@
  *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
  */
 
+#ifdef CONFIG_CGROUP_BFQIO
+#define for_each_entity(entity)	\
+	for (; entity != NULL; entity = entity->parent)
+
+#define for_each_entity_safe(entity, parent) \
+	for (; entity && ({ parent = entity->parent; 1; }); entity = parent)
+
+static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
+						 int extract,
+						 struct bfq_data *bfqd);
+
+static inline void bfq_update_budget(struct bfq_entity *next_in_service)
+{
+	struct bfq_entity *bfqg_entity;
+	struct bfq_group *bfqg;
+	struct bfq_sched_data *group_sd;
+
+	group_sd = next_in_service->sched_data;
+
+	bfqg = container_of(group_sd, struct bfq_group, sched_data);
+	/*
+	 * bfq_group's my_entity field is not NULL only if the group
+	 * is not the root group. We must not touch the root entity
+	 * as it must never become an in-service entity.
+	 */
+	bfqg_entity = bfqg->my_entity;
+	if (bfqg_entity != NULL)
+		bfqg_entity->budget = next_in_service->budget;
+}
+
+static int bfq_update_next_in_service(struct bfq_sched_data *sd)
+{
+	struct bfq_entity *next_in_service;
+
+	if (sd->in_service_entity != NULL)
+		/* will update/requeue at the end of service */
+		return 0;
+
+	/*
+	 * NOTE: this can be improved in many ways, such as returning
+	 * 1 (and thus propagating upwards the update) only when the
+	 * budget changes, or caching the bfqq that will be scheduled
+	 * next from this subtree.  By now we worry more about
+	 * correctness than about performance...
+	 */
+	next_in_service = bfq_lookup_next_entity(sd, 0, NULL);
+	sd->next_in_service = next_in_service;
+
+	if (next_in_service != NULL)
+		bfq_update_budget(next_in_service);
+
+	return 1;
+}
+
+#else
 #define for_each_entity(entity)	\
 	for (; entity != NULL; entity = NULL)
 
@@ -19,14 +74,10 @@ static inline int bfq_update_next_in_service(struct bfq_sched_data *sd)
 	return 0;
 }
 
-static inline void bfq_check_next_in_service(struct bfq_sched_data *sd,
-					     struct bfq_entity *entity)
-{
-}
-
 static inline void bfq_update_budget(struct bfq_entity *next_in_service)
 {
 }
+#endif
 
 /*
  * Shift for timestamp calculations.  This actually limits the maximum
@@ -842,7 +893,6 @@ static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
 		entity = __bfq_lookup_next_entity(st + i, false);
 		if (entity != NULL) {
 			if (extract) {
-				bfq_check_next_in_service(sd, entity);
 				bfq_active_extract(st + i, entity);
 				sd->in_service_entity = entity;
 				sd->next_in_service = NULL;
@@ -866,7 +916,7 @@ static struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd)
 	if (bfqd->busy_queues == 0)
 		return NULL;
 
-	sd = &bfqd->sched_data;
+	sd = &bfqd->root_group->sched_data;
 	for (; sd != NULL; sd = entity->my_sched_data) {
 		entity = bfq_lookup_next_entity(sd, 1, bfqd);
 		entity->service = 0;
diff --git a/block/bfq.h b/block/bfq.h
index bd146b6..b982567 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -1,5 +1,5 @@
 /*
- * BFQ-v0 for 3.15.0: data structures and common functions prototypes.
+ * BFQ-v1 for 3.15.0: data structures and common functions prototypes.
  *
  * Based on ideas and code from CFQ:
  * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@@ -92,7 +92,7 @@ struct bfq_sched_data {
  * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
  * @weight: weight of the queue
  * @parent: parent entity, for hierarchical scheduling.
- * @my_sched_data: for non-leaf nodes in the hierarchy, the
+ * @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the
  *                 associated scheduler queue, %NULL on leaf nodes.
  * @sched_data: the scheduler queue this entity belongs to.
  * @ioprio: the ioprio in use.
@@ -105,10 +105,11 @@ struct bfq_sched_data {
  * @ioprio_changed: flag, true when the user requested a weight, ioprio or
  *                  ioprio_class change.
  *
- * A bfq_entity is used to represent a bfq_queue (leaf node in the upper
- * level scheduler). Each entity belongs to the sched_data of the parent
- * group hierarchy. Non-leaf entities have also their own sched_data,
- * stored in @my_sched_data.
+ * A bfq_entity is used to represent either a bfq_queue (leaf node in the
+ * cgroup hierarchy) or a bfq_group into the upper level scheduler.  Each
+ * entity belongs to the sched_data of the parent group in the cgroup
+ * hierarchy.  Non-leaf entities have also their own sched_data, stored
+ * in @my_sched_data.
  *
  * Each entity stores independently its priority values; this would
  * allow different weights on different devices, but this
@@ -119,13 +120,14 @@ struct bfq_sched_data {
  * update to take place the effective and the requested priority
  * values are synchronized.
  *
- * The weight value is calculated from the ioprio to export the same
- * interface as CFQ.  When dealing with  ``well-behaved'' queues (i.e.,
- * queues that do not spend too much time to consume their budget
- * and have true sequential behavior, and when there are no external
- * factors breaking anticipation) the relative weights at each level
- * of the hierarchy should be guaranteed.  All the fields are
- * protected by the queue lock of the containing bfqd.
+ * Unless cgroups are used, the weight value is calculated from the
+ * ioprio to export the same interface as CFQ.  When dealing with
+ * ``well-behaved'' queues (i.e., queues that do not spend too much
+ * time to consume their budget and have true sequential behavior, and
+ * when there are no external factors breaking anticipation) the
+ * relative weights at each level of the cgroups hierarchy should be
+ * guaranteed.  All the fields are protected by the queue lock of the
+ * containing bfqd.
  */
 struct bfq_entity {
 	struct rb_node rb_node;
@@ -154,6 +156,8 @@ struct bfq_entity {
 	int ioprio_changed;
 };
 
+struct bfq_group;
+
 /**
  * struct bfq_queue - leaf schedulable entity.
  * @ref: reference counter.
@@ -177,7 +181,11 @@ struct bfq_entity {
  * @pid: pid of the process owning the queue, used for logging purposes.
  *
  * A bfq_queue is a leaf request queue; it can be associated with an
- * io_context or more, if it is async.
+ * io_context or more, if it is async. @cgroup holds a reference to the
+ * cgroup, to be sure that it does not disappear while a bfqq still
+ * references it (mostly to avoid races between request issuing and task
+ * migration followed by cgroup destruction). All the fields are protected
+ * by the queue lock of the containing bfqd.
  */
 struct bfq_queue {
 	atomic_t ref;
@@ -244,7 +252,7 @@ enum bfq_device_speed {
 /**
  * struct bfq_data - per device data structure.
  * @queue: request queue for the managed device.
- * @sched_data: root @bfq_sched_data for the device.
+ * @root_group: root bfq_group for the device.
  * @busy_queues: number of bfq_queues containing requests (including the
  *		 queue in service, even if it is idling).
  * @queued: number of queued requests.
@@ -267,6 +275,7 @@ enum bfq_device_speed {
  * @peak_rate_samples: number of samples used to calculate @peak_rate.
  * @bfq_max_budget: maximum budget allotted to a bfq_queue before
  *                  rescheduling.
+ * @group_list: list of all the bfq_groups active on the device.
  * @active_list: list of all the bfq_queues active on the device.
  * @idle_list: list of all the bfq_queues idle on the device.
  * @bfq_quantum: max number of requests dispatched per dispatch round.
@@ -293,7 +302,7 @@ enum bfq_device_speed {
 struct bfq_data {
 	struct request_queue *queue;
 
-	struct bfq_sched_data sched_data;
+	struct bfq_group *root_group;
 
 	int busy_queues;
 	int queued;
@@ -320,6 +329,7 @@ struct bfq_data {
 	u64 peak_rate;
 	unsigned long bfq_max_budget;
 
+	struct hlist_head group_list;
 	struct list_head active_list;
 	struct list_head idle_list;
 
@@ -390,6 +400,82 @@ enum bfqq_expiration {
 	BFQ_BFQQ_NO_MORE_REQUESTS,	/* the queue has no more requests */
 };
 
+#ifdef CONFIG_CGROUP_BFQIO
+/**
+ * struct bfq_group - per (device, cgroup) data structure.
+ * @entity: schedulable entity to insert into the parent group sched_data.
+ * @sched_data: own sched_data, to contain child entities (they may be
+ *              both bfq_queues and bfq_groups).
+ * @group_node: node to be inserted into the bfqio_cgroup->group_data
+ *              list of the containing cgroup's bfqio_cgroup.
+ * @bfqd_node: node to be inserted into the @bfqd->group_list list
+ *             of the groups active on the same device; used for cleanup.
+ * @bfqd: the bfq_data for the device this group acts upon.
+ * @async_bfqq: array of async queues for all the tasks belonging to
+ *              the group, one queue per ioprio value per ioprio_class,
+ *              except for the idle class that has only one queue.
+ * @async_idle_bfqq: async queue for the idle class (ioprio is ignored).
+ * @my_entity: pointer to @entity, %NULL for the toplevel group; used
+ *             to avoid too many special cases during group creation/
+ *             migration.
+ *
+ * Each (device, cgroup) pair has its own bfq_group, i.e., for each cgroup
+ * there is a set of bfq_groups, each one collecting the lower-level
+ * entities belonging to the group that are acting on the same device.
+ *
+ * Locking works as follows:
+ *    o @group_node is protected by the bfqio_cgroup lock, and is accessed
+ *      via RCU from its readers.
+ *    o @bfqd is protected by the queue lock, RCU is used to access it
+ *      from the readers.
+ *    o All the other fields are protected by the @bfqd queue lock.
+ */
+struct bfq_group {
+	struct bfq_entity entity;
+	struct bfq_sched_data sched_data;
+
+	struct hlist_node group_node;
+	struct hlist_node bfqd_node;
+
+	void *bfqd;
+
+	struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
+	struct bfq_queue *async_idle_bfqq;
+
+	struct bfq_entity *my_entity;
+};
+
+/**
+ * struct bfqio_cgroup - bfq cgroup data structure.
+ * @css: subsystem state for bfq in the containing cgroup.
+ * @online: flag marked when the subsystem is inserted.
+ * @weight: cgroup weight.
+ * @ioprio: cgroup ioprio.
+ * @ioprio_class: cgroup ioprio_class.
+ * @lock: spinlock that protects @ioprio, @ioprio_class and @group_data.
+ * @group_data: list containing the bfq_group belonging to this cgroup.
+ *
+ * @group_data is accessed using RCU, with @lock protecting the updates,
+ * @ioprio and @ioprio_class are protected by @lock.
+ */
+struct bfqio_cgroup {
+	struct cgroup_subsys_state css;
+	bool online;
+
+	unsigned short weight, ioprio, ioprio_class;
+
+	spinlock_t lock;
+	struct hlist_head group_data;
+};
+#else
+struct bfq_group {
+	struct bfq_sched_data sched_data;
+
+	struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
+	struct bfq_queue *async_idle_bfqq;
+};
+#endif
+
 static inline struct bfq_service_tree *
 bfq_entity_service_tree(struct bfq_entity *entity)
 {
@@ -460,8 +546,10 @@ static inline void bfq_put_bfqd_unlock(struct bfq_data *bfqd,
 static void bfq_changed_ioprio(struct bfq_io_cq *bic);
 static void bfq_put_queue(struct bfq_queue *bfqq);
 static void bfq_dispatch_insert(struct request_queue *q, struct request *rq);
-static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd, int is_sync,
+static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
+				       struct bfq_group *bfqg, int is_sync,
 				       struct bfq_io_cq *bic, gfp_t gfp_mask);
+static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg);
 static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
 
 #endif /* _BFQ_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 768fe44..cdd2528 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -39,6 +39,10 @@ SUBSYS(net_cls)
 SUBSYS(blkio)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_BFQIO)
+SUBSYS(bfqio)
+#endif
+
 #if IS_ENABLED(CONFIG_CGROUP_PERF)
 SUBSYS(perf_event)
 #endif
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 04/14] block, bfq: add full hierarchical scheduling and cgroups support
@ 2014-05-27 12:42     ` paolo
  0 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups, Paolo Valente

From: Fabio Checconi <fchecconi@gmail.com>

Complete support for full hierarchical scheduling, with a cgroups
interface. The name of the new subsystem is bfqio.

Weights can be assigned explicitly to groups and processes through the
cgroups interface, differently from what happens, for single
processes, if the cgroups interface is not used (as explained in the
description of patch 2). In particular, since each node has a full
scheduler, each group can be assigned its own weight.

Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/bfq-cgroup.c            | 891 ++++++++++++++++++++++++++++++++++++++++++
 block/bfq-iosched.c           |  66 ++--
 block/bfq-sched.c             |  64 ++-
 block/bfq.h                   | 122 +++++-
 include/linux/cgroup_subsys.h |   4 +
 5 files changed, 1099 insertions(+), 48 deletions(-)
 create mode 100644 block/bfq-cgroup.c

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
new file mode 100644
index 0000000..00a7a1b
--- /dev/null
+++ b/block/bfq-cgroup.c
@@ -0,0 +1,891 @@
+/*
+ * BFQ: CGROUPS support.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ *		      Paolo Valente <paolo.valente@unimore.it>
+ *
+ * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ
+ * file.
+ */
+
+#ifdef CONFIG_CGROUP_BFQIO
+
+static DEFINE_MUTEX(bfqio_mutex);
+
+static bool bfqio_is_removed(struct bfqio_cgroup *bgrp)
+{
+	return bgrp ? !bgrp->online : false;
+}
+
+static struct bfqio_cgroup bfqio_root_cgroup = {
+	.weight = BFQ_DEFAULT_GRP_WEIGHT,
+	.ioprio = BFQ_DEFAULT_GRP_IOPRIO,
+	.ioprio_class = BFQ_DEFAULT_GRP_CLASS,
+};
+
+static inline void bfq_init_entity(struct bfq_entity *entity,
+				   struct bfq_group *bfqg)
+{
+	entity->weight = entity->new_weight;
+	entity->orig_weight = entity->new_weight;
+	entity->ioprio = entity->new_ioprio;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->parent = bfqg->my_entity;
+	entity->sched_data = &bfqg->sched_data;
+}
+
+static struct bfqio_cgroup *css_to_bfqio(struct cgroup_subsys_state *css)
+{
+	return css ? container_of(css, struct bfqio_cgroup, css) : NULL;
+}
+
+/*
+ * Search the bfq_group for bfqd into the hash table (by now only a list)
+ * of bgrp.  Must be called under rcu_read_lock().
+ */
+static struct bfq_group *bfqio_lookup_group(struct bfqio_cgroup *bgrp,
+					    struct bfq_data *bfqd)
+{
+	struct bfq_group *bfqg;
+	void *key;
+
+	hlist_for_each_entry_rcu(bfqg, &bgrp->group_data, group_node) {
+		key = rcu_dereference(bfqg->bfqd);
+		if (key == bfqd)
+			return bfqg;
+	}
+
+	return NULL;
+}
+
+static inline void bfq_group_init_entity(struct bfqio_cgroup *bgrp,
+					 struct bfq_group *bfqg)
+{
+	struct bfq_entity *entity = &bfqg->entity;
+
+	/*
+	 * If the weight of the entity has never been set via the sysfs
+	 * interface, then bgrp->weight == 0. In this case we initialize
+	 * the weight from the current ioprio value. Otherwise, the group
+	 * weight, if set, has priority over the ioprio value.
+	 */
+	if (bgrp->weight == 0) {
+		entity->new_weight = bfq_ioprio_to_weight(bgrp->ioprio);
+		entity->new_ioprio = bgrp->ioprio;
+	} else {
+		entity->new_weight = bgrp->weight;
+		entity->new_ioprio = bfq_weight_to_ioprio(bgrp->weight);
+	}
+	entity->orig_weight = entity->weight = entity->new_weight;
+	entity->ioprio = entity->new_ioprio;
+	entity->ioprio_class = entity->new_ioprio_class = bgrp->ioprio_class;
+	entity->my_sched_data = &bfqg->sched_data;
+}
+
+static inline void bfq_group_set_parent(struct bfq_group *bfqg,
+					struct bfq_group *parent)
+{
+	struct bfq_entity *entity;
+
+	entity = &bfqg->entity;
+	entity->parent = parent->my_entity;
+	entity->sched_data = &parent->sched_data;
+}
+
+/**
+ * bfq_group_chain_alloc - allocate a chain of groups.
+ * @bfqd: queue descriptor.
+ * @css: the leaf cgroup_subsys_state this chain starts from.
+ *
+ * Allocate a chain of groups starting from the one belonging to
+ * @cgroup up to the root cgroup.  Stop if a cgroup on the chain
+ * to the root has already an allocated group on @bfqd.
+ */
+static struct bfq_group *bfq_group_chain_alloc(struct bfq_data *bfqd,
+					       struct cgroup_subsys_state *css)
+{
+	struct bfqio_cgroup *bgrp;
+	struct bfq_group *bfqg, *prev = NULL, *leaf = NULL;
+
+	for (; css != NULL; css = css->parent) {
+		bgrp = css_to_bfqio(css);
+
+		bfqg = bfqio_lookup_group(bgrp, bfqd);
+		if (bfqg != NULL) {
+			/*
+			 * All the cgroups in the path from there to the
+			 * root must have a bfq_group for bfqd, so we don't
+			 * need any more allocations.
+			 */
+			break;
+		}
+
+		bfqg = kzalloc(sizeof(*bfqg), GFP_ATOMIC);
+		if (bfqg == NULL)
+			goto cleanup;
+
+		bfq_group_init_entity(bgrp, bfqg);
+		bfqg->my_entity = &bfqg->entity;
+
+		if (leaf == NULL) {
+			leaf = bfqg;
+			prev = leaf;
+		} else {
+			bfq_group_set_parent(prev, bfqg);
+			/*
+			 * Build a list of allocated nodes using the bfqd
+			 * filed, that is still unused and will be
+			 * initialized only after the node will be
+			 * connected.
+			 */
+			prev->bfqd = bfqg;
+			prev = bfqg;
+		}
+	}
+
+	return leaf;
+
+cleanup:
+	while (leaf != NULL) {
+		prev = leaf;
+		leaf = leaf->bfqd;
+		kfree(prev);
+	}
+
+	return NULL;
+}
+
+/**
+ * bfq_group_chain_link - link an allocated group chain to a cgroup
+ *                        hierarchy.
+ * @bfqd: the queue descriptor.
+ * @css: the leaf cgroup_subsys_state to start from.
+ * @leaf: the leaf group (to be associated to @cgroup).
+ *
+ * Try to link a chain of groups to a cgroup hierarchy, connecting the
+ * nodes bottom-up, so we can be sure that when we find a cgroup in the
+ * hierarchy that already as a group associated to @bfqd all the nodes
+ * in the path to the root cgroup have one too.
+ *
+ * On locking: the queue lock protects the hierarchy (there is a hierarchy
+ * per device) while the bfqio_cgroup lock protects the list of groups
+ * belonging to the same cgroup.
+ */
+static void bfq_group_chain_link(struct bfq_data *bfqd,
+				 struct cgroup_subsys_state *css,
+				 struct bfq_group *leaf)
+{
+	struct bfqio_cgroup *bgrp;
+	struct bfq_group *bfqg, *next, *prev = NULL;
+	unsigned long flags;
+
+	assert_spin_locked(bfqd->queue->queue_lock);
+
+	for (; css != NULL && leaf != NULL; css = css->parent) {
+		bgrp = css_to_bfqio(css);
+		next = leaf->bfqd;
+
+		bfqg = bfqio_lookup_group(bgrp, bfqd);
+
+		spin_lock_irqsave(&bgrp->lock, flags);
+
+		rcu_assign_pointer(leaf->bfqd, bfqd);
+		hlist_add_head_rcu(&leaf->group_node, &bgrp->group_data);
+		hlist_add_head(&leaf->bfqd_node, &bfqd->group_list);
+
+		spin_unlock_irqrestore(&bgrp->lock, flags);
+
+		prev = leaf;
+		leaf = next;
+	}
+
+	if (css != NULL && prev != NULL) {
+		bgrp = css_to_bfqio(css);
+		bfqg = bfqio_lookup_group(bgrp, bfqd);
+		bfq_group_set_parent(prev, bfqg);
+	}
+}
+
+/**
+ * bfq_find_alloc_group - return the group associated to @bfqd in @cgroup.
+ * @bfqd: queue descriptor.
+ * @cgroup: cgroup being searched for.
+ *
+ * Return a group associated to @bfqd in @cgroup, allocating one if
+ * necessary.  When a group is returned all the cgroups in the path
+ * to the root have a group associated to @bfqd.
+ *
+ * If the allocation fails, return the root group: this breaks guarantees
+ * but is a safe fallback.  If this loss becomes a problem it can be
+ * mitigated using the equivalent weight (given by the product of the
+ * weights of the groups in the path from @group to the root) in the
+ * root scheduler.
+ *
+ * We allocate all the missing nodes in the path from the leaf cgroup
+ * to the root and we connect the nodes only after all the allocations
+ * have been successful.
+ */
+static struct bfq_group *bfq_find_alloc_group(struct bfq_data *bfqd,
+					      struct cgroup_subsys_state *css)
+{
+	struct bfqio_cgroup *bgrp = css_to_bfqio(css);
+	struct bfq_group *bfqg;
+
+	bfqg = bfqio_lookup_group(bgrp, bfqd);
+	if (bfqg != NULL)
+		return bfqg;
+
+	bfqg = bfq_group_chain_alloc(bfqd, css);
+	if (bfqg != NULL)
+		bfq_group_chain_link(bfqd, css, bfqg);
+	else
+		bfqg = bfqd->root_group;
+
+	return bfqg;
+}
+
+/**
+ * bfq_bfqq_move - migrate @bfqq to @bfqg.
+ * @bfqd: queue descriptor.
+ * @bfqq: the queue to move.
+ * @entity: @bfqq's entity.
+ * @bfqg: the group to move to.
+ *
+ * Move @bfqq to @bfqg, deactivating it from its old group and reactivating
+ * it on the new one.  Avoid putting the entity on the old group idle tree.
+ *
+ * Must be called under the queue lock; the cgroup owning @bfqg must
+ * not disappear (by now this just means that we are called under
+ * rcu_read_lock()).
+ */
+static void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+			  struct bfq_entity *entity, struct bfq_group *bfqg)
+{
+	int busy, resume;
+
+	busy = bfq_bfqq_busy(bfqq);
+	resume = !RB_EMPTY_ROOT(&bfqq->sort_list);
+
+	if (busy) {
+		if (!resume)
+			bfq_del_bfqq_busy(bfqd, bfqq, 0);
+		else
+			bfq_deactivate_bfqq(bfqd, bfqq, 0);
+	} else if (entity->on_st)
+		bfq_put_idle_entity(bfq_entity_service_tree(entity), entity);
+
+	/*
+	 * Here we use a reference to bfqg.  We don't need a refcounter
+	 * as the cgroup reference will not be dropped, so that its
+	 * destroy() callback will not be invoked.
+	 */
+	entity->parent = bfqg->my_entity;
+	entity->sched_data = &bfqg->sched_data;
+
+	if (busy && resume)
+		bfq_activate_bfqq(bfqd, bfqq);
+
+	if (bfqd->in_service_queue == NULL && !bfqd->rq_in_driver)
+		bfq_schedule_dispatch(bfqd);
+}
+
+/**
+ * __bfq_bic_change_cgroup - move @bic to @cgroup.
+ * @bfqd: the queue descriptor.
+ * @bic: the bic to move.
+ * @cgroup: the cgroup to move to.
+ *
+ * Move bic to cgroup, assuming that bfqd->queue is locked; the caller
+ * has to make sure that the reference to cgroup is valid across the call.
+ *
+ * NOTE: an alternative approach might have been to store the current
+ * cgroup in bfqq and getting a reference to it, reducing the lookup
+ * time here, at the price of slightly more complex code.
+ */
+static struct bfq_group *__bfq_bic_change_cgroup(struct bfq_data *bfqd,
+						struct bfq_io_cq *bic,
+						struct cgroup_subsys_state *css)
+{
+	struct bfq_queue *async_bfqq = bic_to_bfqq(bic, 0);
+	struct bfq_queue *sync_bfqq = bic_to_bfqq(bic, 1);
+	struct bfq_entity *entity;
+	struct bfq_group *bfqg;
+	struct bfqio_cgroup *bgrp;
+
+	bgrp = css_to_bfqio(css);
+
+	bfqg = bfq_find_alloc_group(bfqd, css);
+	if (async_bfqq != NULL) {
+		entity = &async_bfqq->entity;
+
+		if (entity->sched_data != &bfqg->sched_data) {
+			bic_set_bfqq(bic, NULL, 0);
+			bfq_log_bfqq(bfqd, async_bfqq,
+				     "bic_change_group: %p %d",
+				     async_bfqq, atomic_read(&async_bfqq->ref));
+			bfq_put_queue(async_bfqq);
+		}
+	}
+
+	if (sync_bfqq != NULL) {
+		entity = &sync_bfqq->entity;
+		if (entity->sched_data != &bfqg->sched_data)
+			bfq_bfqq_move(bfqd, sync_bfqq, entity, bfqg);
+	}
+
+	return bfqg;
+}
+
+/**
+ * bfq_bic_change_cgroup - move @bic to @cgroup.
+ * @bic: the bic being migrated.
+ * @cgroup: the destination cgroup.
+ *
+ * When the task owning @bic is moved to @cgroup, @bic is immediately
+ * moved into its new parent group.
+ */
+static void bfq_bic_change_cgroup(struct bfq_io_cq *bic,
+				  struct cgroup_subsys_state *css)
+{
+	struct bfq_data *bfqd;
+	unsigned long uninitialized_var(flags);
+
+	bfqd = bfq_get_bfqd_locked(&(bic->icq.q->elevator->elevator_data),
+				   &flags);
+	if (bfqd != NULL) {
+		__bfq_bic_change_cgroup(bfqd, bic, css);
+		bfq_put_bfqd_unlock(bfqd, &flags);
+	}
+}
+
+/**
+ * bfq_bic_update_cgroup - update the cgroup of @bic.
+ * @bic: the @bic to update.
+ *
+ * Make sure that @bic is enqueued in the cgroup of the current task.
+ * We need this in addition to moving bics during the cgroup attach
+ * phase because the task owning @bic could be at its first disk
+ * access or we may end up in the root cgroup as the result of a
+ * memory allocation failure and here we try to move to the right
+ * group.
+ *
+ * Must be called under the queue lock.  It is safe to use the returned
+ * value even after the rcu_read_unlock() as the migration/destruction
+ * paths act under the queue lock too.  IOW it is impossible to race with
+ * group migration/destruction and end up with an invalid group as:
+ *   a) here cgroup has not yet been destroyed, nor its destroy callback
+ *      has started execution, as current holds a reference to it,
+ *   b) if it is destroyed after rcu_read_unlock() [after current is
+ *      migrated to a different cgroup] its attach() callback will have
+ *      taken care of remove all the references to the old cgroup data.
+ */
+static struct bfq_group *bfq_bic_update_cgroup(struct bfq_io_cq *bic)
+{
+	struct bfq_data *bfqd = bic_to_bfqd(bic);
+	struct bfq_group *bfqg;
+	struct cgroup_subsys_state *css;
+
+	rcu_read_lock();
+	css = task_css(current, bfqio_cgrp_id);
+	bfqg = __bfq_bic_change_cgroup(bfqd, bic, css);
+	rcu_read_unlock();
+
+	return bfqg;
+}
+
+/**
+ * bfq_flush_idle_tree - deactivate any entity on the idle tree of @st.
+ * @st: the service tree being flushed.
+ */
+static inline void bfq_flush_idle_tree(struct bfq_service_tree *st)
+{
+	struct bfq_entity *entity = st->first_idle;
+
+	for (; entity != NULL; entity = st->first_idle)
+		__bfq_deactivate_entity(entity, 0);
+}
+
+/**
+ * bfq_reparent_leaf_entity - move leaf entity to the root_group.
+ * @bfqd: the device data structure with the root group.
+ * @entity: the entity to move.
+ */
+static inline void bfq_reparent_leaf_entity(struct bfq_data *bfqd,
+					    struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+
+	bfq_bfqq_move(bfqd, bfqq, entity, bfqd->root_group);
+	return;
+}
+
+/**
+ * bfq_reparent_active_entities - move to the root group all active
+ *                                entities.
+ * @bfqd: the device data structure with the root group.
+ * @bfqg: the group to move from.
+ * @st: the service tree with the entities.
+ *
+ * Needs queue_lock to be taken and reference to be valid over the call.
+ */
+static inline void bfq_reparent_active_entities(struct bfq_data *bfqd,
+						struct bfq_group *bfqg,
+						struct bfq_service_tree *st)
+{
+	struct rb_root *active = &st->active;
+	struct bfq_entity *entity = NULL;
+
+	if (!RB_EMPTY_ROOT(&st->active))
+		entity = bfq_entity_of(rb_first(active));
+
+	for (; entity != NULL; entity = bfq_entity_of(rb_first(active)))
+		bfq_reparent_leaf_entity(bfqd, entity);
+
+	if (bfqg->sched_data.in_service_entity != NULL)
+		bfq_reparent_leaf_entity(bfqd,
+			bfqg->sched_data.in_service_entity);
+
+	return;
+}
+
+/**
+ * bfq_destroy_group - destroy @bfqg.
+ * @bgrp: the bfqio_cgroup containing @bfqg.
+ * @bfqg: the group being destroyed.
+ *
+ * Destroy @bfqg, making sure that it is not referenced from its parent.
+ */
+static void bfq_destroy_group(struct bfqio_cgroup *bgrp, struct bfq_group *bfqg)
+{
+	struct bfq_data *bfqd;
+	struct bfq_service_tree *st;
+	struct bfq_entity *entity = bfqg->my_entity;
+	unsigned long uninitialized_var(flags);
+	int i;
+
+	hlist_del(&bfqg->group_node);
+
+	/*
+	 * Empty all service_trees belonging to this group before
+	 * deactivating the group itself.
+	 */
+	for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) {
+		st = bfqg->sched_data.service_tree + i;
+
+		/*
+		 * The idle tree may still contain bfq_queues belonging
+		 * to exited task because they never migrated to a different
+		 * cgroup from the one being destroyed now.  No one else
+		 * can access them so it's safe to act without any lock.
+		 */
+		bfq_flush_idle_tree(st);
+
+		/*
+		 * It may happen that some queues are still active
+		 * (busy) upon group destruction (if the corresponding
+		 * processes have been forced to terminate). We move
+		 * all the leaf entities corresponding to these queues
+		 * to the root_group.
+		 * Also, it may happen that the group has an entity
+		 * in service, which is disconnected from the active
+		 * tree: it must be moved, too.
+		 * There is no need to put the sync queues, as the
+		 * scheduler has taken no reference.
+		 */
+		bfqd = bfq_get_bfqd_locked(&bfqg->bfqd, &flags);
+		if (bfqd != NULL) {
+			bfq_reparent_active_entities(bfqd, bfqg, st);
+			bfq_put_bfqd_unlock(bfqd, &flags);
+		}
+	}
+
+	/*
+	 * We may race with device destruction, take extra care when
+	 * dereferencing bfqg->bfqd.
+	 */
+	bfqd = bfq_get_bfqd_locked(&bfqg->bfqd, &flags);
+	if (bfqd != NULL) {
+		hlist_del(&bfqg->bfqd_node);
+		__bfq_deactivate_entity(entity, 0);
+		bfq_put_async_queues(bfqd, bfqg);
+		bfq_put_bfqd_unlock(bfqd, &flags);
+	}
+
+	/*
+	 * No need to defer the kfree() to the end of the RCU grace
+	 * period: we are called from the destroy() callback of our
+	 * cgroup, so we can be sure that no one is a) still using
+	 * this cgroup or b) doing lookups in it.
+	 */
+	kfree(bfqg);
+}
+
+/**
+ * bfq_disconnect_groups - disconnect @bfqd from all its groups.
+ * @bfqd: the device descriptor being exited.
+ *
+ * When the device exits we just make sure that no lookup can return
+ * the now unused group structures.  They will be deallocated on cgroup
+ * destruction.
+ */
+static void bfq_disconnect_groups(struct bfq_data *bfqd)
+{
+	struct hlist_node *tmp;
+	struct bfq_group *bfqg;
+
+	bfq_log(bfqd, "disconnect_groups beginning");
+	hlist_for_each_entry_safe(bfqg, tmp, &bfqd->group_list, bfqd_node) {
+		hlist_del(&bfqg->bfqd_node);
+
+		__bfq_deactivate_entity(bfqg->my_entity, 0);
+
+		/*
+		 * Don't remove from the group hash, just set an
+		 * invalid key.  No lookups can race with the
+		 * assignment as bfqd is being destroyed; this
+		 * implies also that new elements cannot be added
+		 * to the list.
+		 */
+		rcu_assign_pointer(bfqg->bfqd, NULL);
+
+		bfq_log(bfqd, "disconnect_groups: put async for group %p",
+			bfqg);
+		bfq_put_async_queues(bfqd, bfqg);
+	}
+}
+
+static inline void bfq_free_root_group(struct bfq_data *bfqd)
+{
+	struct bfqio_cgroup *bgrp = &bfqio_root_cgroup;
+	struct bfq_group *bfqg = bfqd->root_group;
+
+	bfq_put_async_queues(bfqd, bfqg);
+
+	spin_lock_irq(&bgrp->lock);
+	hlist_del_rcu(&bfqg->group_node);
+	spin_unlock_irq(&bgrp->lock);
+
+	/*
+	 * No need to synchronize_rcu() here: since the device is gone
+	 * there cannot be any read-side access to its root_group.
+	 */
+	kfree(bfqg);
+}
+
+static struct bfq_group *bfq_alloc_root_group(struct bfq_data *bfqd, int node)
+{
+	struct bfq_group *bfqg;
+	struct bfqio_cgroup *bgrp;
+	int i;
+
+	bfqg = kzalloc_node(sizeof(*bfqg), GFP_KERNEL, node);
+	if (bfqg == NULL)
+		return NULL;
+
+	bfqg->entity.parent = NULL;
+	for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
+		bfqg->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
+
+	bgrp = &bfqio_root_cgroup;
+	spin_lock_irq(&bgrp->lock);
+	rcu_assign_pointer(bfqg->bfqd, bfqd);
+	hlist_add_head_rcu(&bfqg->group_node, &bgrp->group_data);
+	spin_unlock_irq(&bgrp->lock);
+
+	return bfqg;
+}
+
+#define SHOW_FUNCTION(__VAR)						\
+static u64 bfqio_cgroup_##__VAR##_read(struct cgroup_subsys_state *css, \
+				       struct cftype *cftype)		\
+{									\
+	struct bfqio_cgroup *bgrp = css_to_bfqio(css);			\
+	u64 ret = -ENODEV;						\
+									\
+	mutex_lock(&bfqio_mutex);					\
+	if (bfqio_is_removed(bgrp))					\
+		goto out_unlock;					\
+									\
+	spin_lock_irq(&bgrp->lock);					\
+	ret = bgrp->__VAR;						\
+	spin_unlock_irq(&bgrp->lock);					\
+									\
+out_unlock:								\
+	mutex_unlock(&bfqio_mutex);					\
+	return ret;							\
+}
+
+SHOW_FUNCTION(weight);
+SHOW_FUNCTION(ioprio);
+SHOW_FUNCTION(ioprio_class);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__VAR, __MIN, __MAX)				\
+static int bfqio_cgroup_##__VAR##_write(struct cgroup_subsys_state *css,\
+					struct cftype *cftype,		\
+					u64 val)			\
+{									\
+	struct bfqio_cgroup *bgrp = css_to_bfqio(css);			\
+	struct bfq_group *bfqg;						\
+	int ret = -EINVAL;						\
+									\
+	if (val < (__MIN) || val > (__MAX))				\
+		return ret;						\
+									\
+	ret = -ENODEV;							\
+	mutex_lock(&bfqio_mutex);					\
+	if (bfqio_is_removed(bgrp))					\
+		goto out_unlock;					\
+	ret = 0;							\
+									\
+	spin_lock_irq(&bgrp->lock);					\
+	bgrp->__VAR = (unsigned short)val;				\
+	hlist_for_each_entry(bfqg, &bgrp->group_data, group_node) {	\
+		/*							\
+		 * Setting the ioprio_changed flag of the entity        \
+		 * to 1 with new_##__VAR == ##__VAR would re-set        \
+		 * the value of the weight to its ioprio mapping.       \
+		 * Set the flag only if necessary.			\
+		 */							\
+		if ((unsigned short)val != bfqg->entity.new_##__VAR) {  \
+			bfqg->entity.new_##__VAR = (unsigned short)val; \
+			/*						\
+			 * Make sure that the above new value has been	\
+			 * stored in bfqg->entity.new_##__VAR before	\
+			 * setting the ioprio_changed flag. In fact,	\
+			 * this flag may be read asynchronously (in	\
+			 * critical sections protected by a different	\
+			 * lock than that held here), and finding this	\
+			 * flag set may cause the execution of the code	\
+			 * for updating parameters whose value may	\
+			 * depend also on bfqg->entity.new_##__VAR (in	\
+			 * __bfq_entity_update_weight_prio).		\
+			 * This barrier makes sure that the new value	\
+			 * of bfqg->entity.new_##__VAR is correctly	\
+			 * seen in that code.				\
+			 */						\
+			smp_wmb();                                      \
+			bfqg->entity.ioprio_changed = 1;                \
+		}							\
+	}								\
+	spin_unlock_irq(&bgrp->lock);					\
+									\
+out_unlock:								\
+	mutex_unlock(&bfqio_mutex);					\
+	return ret;							\
+}
+
+STORE_FUNCTION(weight, BFQ_MIN_WEIGHT, BFQ_MAX_WEIGHT);
+STORE_FUNCTION(ioprio, 0, IOPRIO_BE_NR - 1);
+STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
+#undef STORE_FUNCTION
+
+static struct cftype bfqio_files[] = {
+	{
+		.name = "weight",
+		.read_u64 = bfqio_cgroup_weight_read,
+		.write_u64 = bfqio_cgroup_weight_write,
+	},
+	{
+		.name = "ioprio",
+		.read_u64 = bfqio_cgroup_ioprio_read,
+		.write_u64 = bfqio_cgroup_ioprio_write,
+	},
+	{
+		.name = "ioprio_class",
+		.read_u64 = bfqio_cgroup_ioprio_class_read,
+		.write_u64 = bfqio_cgroup_ioprio_class_write,
+	},
+	{ },	/* terminate */
+};
+
+static struct cgroup_subsys_state *bfqio_create(struct cgroup_subsys_state
+						*parent_css)
+{
+	struct bfqio_cgroup *bgrp;
+
+	if (parent_css != NULL) {
+		bgrp = kzalloc(sizeof(*bgrp), GFP_KERNEL);
+		if (bgrp == NULL)
+			return ERR_PTR(-ENOMEM);
+	} else
+		bgrp = &bfqio_root_cgroup;
+
+	spin_lock_init(&bgrp->lock);
+	INIT_HLIST_HEAD(&bgrp->group_data);
+	bgrp->ioprio = BFQ_DEFAULT_GRP_IOPRIO;
+	bgrp->ioprio_class = BFQ_DEFAULT_GRP_CLASS;
+
+	return &bgrp->css;
+}
+
+/*
+ * We cannot support shared io contexts, as we have no means to support
+ * two tasks with the same ioc in two different groups without major rework
+ * of the main bic/bfqq data structures.  By now we allow a task to change
+ * its cgroup only if it's the only owner of its ioc; the drawback of this
+ * behavior is that a group containing a task that forked using CLONE_IO
+ * will not be destroyed until the tasks sharing the ioc die.
+ */
+static int bfqio_can_attach(struct cgroup_subsys_state *css,
+			    struct cgroup_taskset *tset)
+{
+	struct task_struct *task;
+	struct io_context *ioc;
+	int ret = 0;
+
+	cgroup_taskset_for_each(task, tset) {
+		/*
+		 * task_lock() is needed to avoid races with
+		 * exit_io_context()
+		 */
+		task_lock(task);
+		ioc = task->io_context;
+		if (ioc != NULL && atomic_read(&ioc->nr_tasks) > 1)
+			/*
+			 * ioc == NULL means that the task is either too
+			 * young or exiting: if it has still no ioc the
+			 * ioc can't be shared, if the task is exiting the
+			 * attach will fail anyway, no matter what we
+			 * return here.
+			 */
+			ret = -EINVAL;
+		task_unlock(task);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
+static void bfqio_attach(struct cgroup_subsys_state *css,
+			 struct cgroup_taskset *tset)
+{
+	struct task_struct *task;
+	struct io_context *ioc;
+	struct io_cq *icq;
+
+	/*
+	 * IMPORTANT NOTE: The move of more than one process at a time to a
+	 * new group has not yet been tested.
+	 */
+	cgroup_taskset_for_each(task, tset) {
+		ioc = get_task_io_context(task, GFP_ATOMIC, NUMA_NO_NODE);
+		if (ioc) {
+			/*
+			 * Handle cgroup change here.
+			 */
+			rcu_read_lock();
+			hlist_for_each_entry_rcu(icq, &ioc->icq_list, ioc_node)
+				if (!strncmp(
+					icq->q->elevator->type->elevator_name,
+					"bfq", ELV_NAME_MAX))
+					bfq_bic_change_cgroup(icq_to_bic(icq),
+							      css);
+			rcu_read_unlock();
+			put_io_context(ioc);
+		}
+	}
+}
+
+static void bfqio_destroy(struct cgroup_subsys_state *css)
+{
+	struct bfqio_cgroup *bgrp = css_to_bfqio(css);
+	struct hlist_node *tmp;
+	struct bfq_group *bfqg;
+
+	/*
+	 * Since we are destroying the cgroup, there are no more tasks
+	 * referencing it, and all the RCU grace periods that may have
+	 * referenced it are ended (as the destruction of the parent
+	 * cgroup is RCU-safe); bgrp->group_data will not be accessed by
+	 * anything else and we don't need any synchronization.
+	 */
+	hlist_for_each_entry_safe(bfqg, tmp, &bgrp->group_data, group_node)
+		bfq_destroy_group(bgrp, bfqg);
+
+	kfree(bgrp);
+}
+
+static int bfqio_css_online(struct cgroup_subsys_state *css)
+{
+	struct bfqio_cgroup *bgrp = css_to_bfqio(css);
+
+	mutex_lock(&bfqio_mutex);
+	bgrp->online = true;
+	mutex_unlock(&bfqio_mutex);
+
+	return 0;
+}
+
+static void bfqio_css_offline(struct cgroup_subsys_state *css)
+{
+	struct bfqio_cgroup *bgrp = css_to_bfqio(css);
+
+	mutex_lock(&bfqio_mutex);
+	bgrp->online = false;
+	mutex_unlock(&bfqio_mutex);
+}
+
+struct cgroup_subsys bfqio_cgrp_subsys = {
+	.css_alloc = bfqio_create,
+	.css_online = bfqio_css_online,
+	.css_offline = bfqio_css_offline,
+	.can_attach = bfqio_can_attach,
+	.attach = bfqio_attach,
+	.css_free = bfqio_destroy,
+	.base_cftypes = bfqio_files,
+};
+#else
+static inline void bfq_init_entity(struct bfq_entity *entity,
+				   struct bfq_group *bfqg)
+{
+	entity->weight = entity->new_weight;
+	entity->orig_weight = entity->new_weight;
+	entity->ioprio = entity->new_ioprio;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->sched_data = &bfqg->sched_data;
+}
+
+static inline struct bfq_group *
+bfq_bic_update_cgroup(struct bfq_io_cq *bic)
+{
+	struct bfq_data *bfqd = bic_to_bfqd(bic);
+	return bfqd->root_group;
+}
+
+static inline void bfq_bfqq_move(struct bfq_data *bfqd,
+				 struct bfq_queue *bfqq,
+				 struct bfq_entity *entity,
+				 struct bfq_group *bfqg)
+{
+}
+
+static inline void bfq_disconnect_groups(struct bfq_data *bfqd)
+{
+	bfq_put_async_queues(bfqd, bfqd->root_group);
+}
+
+static inline void bfq_free_root_group(struct bfq_data *bfqd)
+{
+	kfree(bfqd->root_group);
+}
+
+static struct bfq_group *bfq_alloc_root_group(struct bfq_data *bfqd, int node)
+{
+	struct bfq_group *bfqg;
+	int i;
+
+	bfqg = kmalloc_node(sizeof(*bfqg), GFP_KERNEL | __GFP_ZERO, node);
+	if (bfqg == NULL)
+		return NULL;
+
+	for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
+		bfqg->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
+
+	return bfqg;
+}
+#endif
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 01a98be..b2cbfce 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -66,14 +66,6 @@
 #include "bfq.h"
 #include "blk.h"
 
-/*
- * Array of async queues for all the processes, one queue
- * per ioprio value per ioprio_class.
- */
-struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
-/* Async queue for the idle class (ioprio is ignored) */
-struct bfq_queue *async_idle_bfqq;
-
 /* Max number of dispatches in one round of service. */
 static const int bfq_quantum = 4;
 
@@ -128,6 +120,7 @@ static inline void bfq_schedule_dispatch(struct bfq_data *bfqd);
 
 #include "bfq-ioc.c"
 #include "bfq-sched.c"
+#include "bfq-cgroup.c"
 
 #define bfq_class_idle(bfqq)	((bfqq)->entity.ioprio_class ==\
 				 IOPRIO_CLASS_IDLE)
@@ -1359,6 +1352,7 @@ static void bfq_changed_ioprio(struct bfq_io_cq *bic)
 {
 	struct bfq_data *bfqd;
 	struct bfq_queue *bfqq, *new_bfqq;
+	struct bfq_group *bfqg;
 	unsigned long uninitialized_var(flags);
 	int ioprio = bic->icq.ioc->ioprio;
 
@@ -1373,7 +1367,9 @@ static void bfq_changed_ioprio(struct bfq_io_cq *bic)
 
 	bfqq = bic->bfqq[BLK_RW_ASYNC];
 	if (bfqq != NULL) {
-		new_bfqq = bfq_get_queue(bfqd, BLK_RW_ASYNC, bic,
+		bfqg = container_of(bfqq->entity.sched_data, struct bfq_group,
+				    sched_data);
+		new_bfqq = bfq_get_queue(bfqd, bfqg, BLK_RW_ASYNC, bic,
 					 GFP_ATOMIC);
 		if (new_bfqq != NULL) {
 			bic->bfqq[BLK_RW_ASYNC] = new_bfqq;
@@ -1417,6 +1413,7 @@ static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 }
 
 static struct bfq_queue *bfq_find_alloc_queue(struct bfq_data *bfqd,
+					      struct bfq_group *bfqg,
 					      int is_sync,
 					      struct bfq_io_cq *bic,
 					      gfp_t gfp_mask)
@@ -1459,6 +1456,7 @@ retry:
 		}
 
 		bfq_init_prio_data(bfqq, bic);
+		bfq_init_entity(&bfqq->entity, bfqg);
 	}
 
 	if (new_bfqq != NULL)
@@ -1468,26 +1466,27 @@ retry:
 }
 
 static struct bfq_queue **bfq_async_queue_prio(struct bfq_data *bfqd,
+					       struct bfq_group *bfqg,
 					       int ioprio_class, int ioprio)
 {
 	switch (ioprio_class) {
 	case IOPRIO_CLASS_RT:
-		return &async_bfqq[0][ioprio];
+		return &bfqg->async_bfqq[0][ioprio];
 	case IOPRIO_CLASS_NONE:
 		ioprio = IOPRIO_NORM;
 		/* fall through */
 	case IOPRIO_CLASS_BE:
-		return &async_bfqq[1][ioprio];
+		return &bfqg->async_bfqq[1][ioprio];
 	case IOPRIO_CLASS_IDLE:
-		return &async_idle_bfqq;
+		return &bfqg->async_idle_bfqq;
 	default:
 		BUG();
 	}
 }
 
 static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
-				       int is_sync, struct bfq_io_cq *bic,
-				       gfp_t gfp_mask)
+				       struct bfq_group *bfqg, int is_sync,
+				       struct bfq_io_cq *bic, gfp_t gfp_mask)
 {
 	const int ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
 	const int ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
@@ -1495,12 +1494,13 @@ static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
 	struct bfq_queue *bfqq = NULL;
 
 	if (!is_sync) {
-		async_bfqq = bfq_async_queue_prio(bfqd, ioprio_class, ioprio);
+		async_bfqq = bfq_async_queue_prio(bfqd, bfqg, ioprio_class,
+						  ioprio);
 		bfqq = *async_bfqq;
 	}
 
 	if (bfqq == NULL)
-		bfqq = bfq_find_alloc_queue(bfqd, is_sync, bic, gfp_mask);
+		bfqq = bfq_find_alloc_queue(bfqd, bfqg, is_sync, bic, gfp_mask);
 
 	/*
 	 * Pin the queue now that it's allocated, scheduler exit will
@@ -1830,6 +1830,7 @@ static int bfq_set_request(struct request_queue *q, struct request *rq,
 	const int rw = rq_data_dir(rq);
 	const int is_sync = rq_is_sync(rq);
 	struct bfq_queue *bfqq;
+	struct bfq_group *bfqg;
 	unsigned long flags;
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
@@ -1841,9 +1842,11 @@ static int bfq_set_request(struct request_queue *q, struct request *rq,
 	if (bic == NULL)
 		goto queue_fail;
 
+	bfqg = bfq_bic_update_cgroup(bic);
+
 	bfqq = bic_to_bfqq(bic, is_sync);
 	if (bfqq == NULL || bfqq == &bfqd->oom_bfqq) {
-		bfqq = bfq_get_queue(bfqd, is_sync, bic, gfp_mask);
+		bfqq = bfq_get_queue(bfqd, bfqg, is_sync, bic, gfp_mask);
 		bic_set_bfqq(bic, bfqq, is_sync);
 	}
 
@@ -1937,10 +1940,12 @@ static void bfq_shutdown_timer_wq(struct bfq_data *bfqd)
 static inline void __bfq_put_async_bfqq(struct bfq_data *bfqd,
 					struct bfq_queue **bfqq_ptr)
 {
+	struct bfq_group *root_group = bfqd->root_group;
 	struct bfq_queue *bfqq = *bfqq_ptr;
 
 	bfq_log(bfqd, "put_async_bfqq: %p", bfqq);
 	if (bfqq != NULL) {
+		bfq_bfqq_move(bfqd, bfqq, &bfqq->entity, root_group);
 		bfq_log_bfqq(bfqd, bfqq, "put_async_bfqq: putting %p, %d",
 			     bfqq, atomic_read(&bfqq->ref));
 		bfq_put_queue(bfqq);
@@ -1949,18 +1954,20 @@ static inline void __bfq_put_async_bfqq(struct bfq_data *bfqd,
 }
 
 /*
- * Release the extra reference of the async queues as the device
- * goes away.
+ * Release all the bfqg references to its async queues.  If we are
+ * deallocating the group these queues may still contain requests, so
+ * we reparent them to the root cgroup (i.e., the only one that will
+ * exist for sure until all the requests on a device are gone).
  */
-static void bfq_put_async_queues(struct bfq_data *bfqd)
+static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg)
 {
 	int i, j;
 
 	for (i = 0; i < 2; i++)
 		for (j = 0; j < IOPRIO_BE_NR; j++)
-			__bfq_put_async_bfqq(bfqd, &async_bfqq[i][j]);
+			__bfq_put_async_bfqq(bfqd, &bfqg->async_bfqq[i][j]);
 
-	__bfq_put_async_bfqq(bfqd, &async_idle_bfqq);
+	__bfq_put_async_bfqq(bfqd, &bfqg->async_idle_bfqq);
 }
 
 static void bfq_exit_queue(struct elevator_queue *e)
@@ -1976,18 +1983,20 @@ static void bfq_exit_queue(struct elevator_queue *e)
 	list_for_each_entry_safe(bfqq, n, &bfqd->idle_list, bfqq_list)
 		bfq_deactivate_bfqq(bfqd, bfqq, 0);
 
-	bfq_put_async_queues(bfqd);
+	bfq_disconnect_groups(bfqd);
 	spin_unlock_irq(q->queue_lock);
 
 	bfq_shutdown_timer_wq(bfqd);
 
 	synchronize_rcu();
 
+	bfq_free_root_group(bfqd);
 	kfree(bfqd);
 }
 
 static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 {
+	struct bfq_group *bfqg;
 	struct bfq_data *bfqd;
 	struct elevator_queue *eq;
 
@@ -2016,6 +2025,15 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 	q->elevator = eq;
 	spin_unlock_irq(q->queue_lock);
 
+	bfqg = bfq_alloc_root_group(bfqd, q->node);
+	if (bfqg == NULL) {
+		kfree(bfqd);
+		kobject_put(&eq->kobj);
+		return -ENOMEM;
+	}
+
+	bfqd->root_group = bfqg;
+
 	init_timer(&bfqd->idle_slice_timer);
 	bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
 	bfqd->idle_slice_timer.data = (unsigned long)bfqd;
@@ -2279,7 +2297,7 @@ static int __init bfq_init(void)
 		return -ENOMEM;
 
 	elv_register(&iosched_bfq);
-	pr_info("BFQ I/O-scheduler version: v0");
+	pr_info("BFQ I/O-scheduler version: v1");
 
 	return 0;
 }
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
index a9142f5..8801b6c 100644
--- a/block/bfq-sched.c
+++ b/block/bfq-sched.c
@@ -8,6 +8,61 @@
  *		      Paolo Valente <paolo.valente@unimore.it>
  */
 
+#ifdef CONFIG_CGROUP_BFQIO
+#define for_each_entity(entity)	\
+	for (; entity != NULL; entity = entity->parent)
+
+#define for_each_entity_safe(entity, parent) \
+	for (; entity && ({ parent = entity->parent; 1; }); entity = parent)
+
+static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
+						 int extract,
+						 struct bfq_data *bfqd);
+
+static inline void bfq_update_budget(struct bfq_entity *next_in_service)
+{
+	struct bfq_entity *bfqg_entity;
+	struct bfq_group *bfqg;
+	struct bfq_sched_data *group_sd;
+
+	group_sd = next_in_service->sched_data;
+
+	bfqg = container_of(group_sd, struct bfq_group, sched_data);
+	/*
+	 * bfq_group's my_entity field is not NULL only if the group
+	 * is not the root group. We must not touch the root entity
+	 * as it must never become an in-service entity.
+	 */
+	bfqg_entity = bfqg->my_entity;
+	if (bfqg_entity != NULL)
+		bfqg_entity->budget = next_in_service->budget;
+}
+
+static int bfq_update_next_in_service(struct bfq_sched_data *sd)
+{
+	struct bfq_entity *next_in_service;
+
+	if (sd->in_service_entity != NULL)
+		/* will update/requeue at the end of service */
+		return 0;
+
+	/*
+	 * NOTE: this can be improved in many ways, such as returning
+	 * 1 (and thus propagating upwards the update) only when the
+	 * budget changes, or caching the bfqq that will be scheduled
+	 * next from this subtree.  By now we worry more about
+	 * correctness than about performance...
+	 */
+	next_in_service = bfq_lookup_next_entity(sd, 0, NULL);
+	sd->next_in_service = next_in_service;
+
+	if (next_in_service != NULL)
+		bfq_update_budget(next_in_service);
+
+	return 1;
+}
+
+#else
 #define for_each_entity(entity)	\
 	for (; entity != NULL; entity = NULL)
 
@@ -19,14 +74,10 @@ static inline int bfq_update_next_in_service(struct bfq_sched_data *sd)
 	return 0;
 }
 
-static inline void bfq_check_next_in_service(struct bfq_sched_data *sd,
-					     struct bfq_entity *entity)
-{
-}
-
 static inline void bfq_update_budget(struct bfq_entity *next_in_service)
 {
 }
+#endif
 
 /*
  * Shift for timestamp calculations.  This actually limits the maximum
@@ -842,7 +893,6 @@ static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
 		entity = __bfq_lookup_next_entity(st + i, false);
 		if (entity != NULL) {
 			if (extract) {
-				bfq_check_next_in_service(sd, entity);
 				bfq_active_extract(st + i, entity);
 				sd->in_service_entity = entity;
 				sd->next_in_service = NULL;
@@ -866,7 +916,7 @@ static struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd)
 	if (bfqd->busy_queues == 0)
 		return NULL;
 
-	sd = &bfqd->sched_data;
+	sd = &bfqd->root_group->sched_data;
 	for (; sd != NULL; sd = entity->my_sched_data) {
 		entity = bfq_lookup_next_entity(sd, 1, bfqd);
 		entity->service = 0;
diff --git a/block/bfq.h b/block/bfq.h
index bd146b6..b982567 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -1,5 +1,5 @@
 /*
- * BFQ-v0 for 3.15.0: data structures and common functions prototypes.
+ * BFQ-v1 for 3.15.0: data structures and common functions prototypes.
  *
  * Based on ideas and code from CFQ:
  * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
@@ -92,7 +92,7 @@ struct bfq_sched_data {
  * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
  * @weight: weight of the queue
  * @parent: parent entity, for hierarchical scheduling.
- * @my_sched_data: for non-leaf nodes in the hierarchy, the
+ * @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the
  *                 associated scheduler queue, %NULL on leaf nodes.
  * @sched_data: the scheduler queue this entity belongs to.
  * @ioprio: the ioprio in use.
@@ -105,10 +105,11 @@ struct bfq_sched_data {
  * @ioprio_changed: flag, true when the user requested a weight, ioprio or
  *                  ioprio_class change.
  *
- * A bfq_entity is used to represent a bfq_queue (leaf node in the upper
- * level scheduler). Each entity belongs to the sched_data of the parent
- * group hierarchy. Non-leaf entities have also their own sched_data,
- * stored in @my_sched_data.
+ * A bfq_entity is used to represent either a bfq_queue (leaf node in the
+ * cgroup hierarchy) or a bfq_group into the upper level scheduler.  Each
+ * entity belongs to the sched_data of the parent group in the cgroup
+ * hierarchy.  Non-leaf entities have also their own sched_data, stored
+ * in @my_sched_data.
  *
  * Each entity stores independently its priority values; this would
  * allow different weights on different devices, but this
@@ -119,13 +120,14 @@ struct bfq_sched_data {
  * update to take place the effective and the requested priority
  * values are synchronized.
  *
- * The weight value is calculated from the ioprio to export the same
- * interface as CFQ.  When dealing with  ``well-behaved'' queues (i.e.,
- * queues that do not spend too much time to consume their budget
- * and have true sequential behavior, and when there are no external
- * factors breaking anticipation) the relative weights at each level
- * of the hierarchy should be guaranteed.  All the fields are
- * protected by the queue lock of the containing bfqd.
+ * Unless cgroups are used, the weight value is calculated from the
+ * ioprio to export the same interface as CFQ.  When dealing with
+ * ``well-behaved'' queues (i.e., queues that do not spend too much
+ * time to consume their budget and have true sequential behavior, and
+ * when there are no external factors breaking anticipation) the
+ * relative weights at each level of the cgroups hierarchy should be
+ * guaranteed.  All the fields are protected by the queue lock of the
+ * containing bfqd.
  */
 struct bfq_entity {
 	struct rb_node rb_node;
@@ -154,6 +156,8 @@ struct bfq_entity {
 	int ioprio_changed;
 };
 
+struct bfq_group;
+
 /**
  * struct bfq_queue - leaf schedulable entity.
  * @ref: reference counter.
@@ -177,7 +181,11 @@ struct bfq_entity {
  * @pid: pid of the process owning the queue, used for logging purposes.
  *
  * A bfq_queue is a leaf request queue; it can be associated with an
- * io_context or more, if it is async.
+ * io_context or more, if it is async. @cgroup holds a reference to the
+ * cgroup, to be sure that it does not disappear while a bfqq still
+ * references it (mostly to avoid races between request issuing and task
+ * migration followed by cgroup destruction). All the fields are protected
+ * by the queue lock of the containing bfqd.
  */
 struct bfq_queue {
 	atomic_t ref;
@@ -244,7 +252,7 @@ enum bfq_device_speed {
 /**
  * struct bfq_data - per device data structure.
  * @queue: request queue for the managed device.
- * @sched_data: root @bfq_sched_data for the device.
+ * @root_group: root bfq_group for the device.
  * @busy_queues: number of bfq_queues containing requests (including the
  *		 queue in service, even if it is idling).
  * @queued: number of queued requests.
@@ -267,6 +275,7 @@ enum bfq_device_speed {
  * @peak_rate_samples: number of samples used to calculate @peak_rate.
  * @bfq_max_budget: maximum budget allotted to a bfq_queue before
  *                  rescheduling.
+ * @group_list: list of all the bfq_groups active on the device.
  * @active_list: list of all the bfq_queues active on the device.
  * @idle_list: list of all the bfq_queues idle on the device.
  * @bfq_quantum: max number of requests dispatched per dispatch round.
@@ -293,7 +302,7 @@ enum bfq_device_speed {
 struct bfq_data {
 	struct request_queue *queue;
 
-	struct bfq_sched_data sched_data;
+	struct bfq_group *root_group;
 
 	int busy_queues;
 	int queued;
@@ -320,6 +329,7 @@ struct bfq_data {
 	u64 peak_rate;
 	unsigned long bfq_max_budget;
 
+	struct hlist_head group_list;
 	struct list_head active_list;
 	struct list_head idle_list;
 
@@ -390,6 +400,82 @@ enum bfqq_expiration {
 	BFQ_BFQQ_NO_MORE_REQUESTS,	/* the queue has no more requests */
 };
 
+#ifdef CONFIG_CGROUP_BFQIO
+/**
+ * struct bfq_group - per (device, cgroup) data structure.
+ * @entity: schedulable entity to insert into the parent group sched_data.
+ * @sched_data: own sched_data, to contain child entities (they may be
+ *              both bfq_queues and bfq_groups).
+ * @group_node: node to be inserted into the bfqio_cgroup->group_data
+ *              list of the containing cgroup's bfqio_cgroup.
+ * @bfqd_node: node to be inserted into the @bfqd->group_list list
+ *             of the groups active on the same device; used for cleanup.
+ * @bfqd: the bfq_data for the device this group acts upon.
+ * @async_bfqq: array of async queues for all the tasks belonging to
+ *              the group, one queue per ioprio value per ioprio_class,
+ *              except for the idle class that has only one queue.
+ * @async_idle_bfqq: async queue for the idle class (ioprio is ignored).
+ * @my_entity: pointer to @entity, %NULL for the toplevel group; used
+ *             to avoid too many special cases during group creation/
+ *             migration.
+ *
+ * Each (device, cgroup) pair has its own bfq_group, i.e., for each cgroup
+ * there is a set of bfq_groups, each one collecting the lower-level
+ * entities belonging to the group that are acting on the same device.
+ *
+ * Locking works as follows:
+ *    o @group_node is protected by the bfqio_cgroup lock, and is accessed
+ *      via RCU from its readers.
+ *    o @bfqd is protected by the queue lock, RCU is used to access it
+ *      from the readers.
+ *    o All the other fields are protected by the @bfqd queue lock.
+ */
+struct bfq_group {
+	struct bfq_entity entity;
+	struct bfq_sched_data sched_data;
+
+	struct hlist_node group_node;
+	struct hlist_node bfqd_node;
+
+	void *bfqd;
+
+	struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
+	struct bfq_queue *async_idle_bfqq;
+
+	struct bfq_entity *my_entity;
+};
+
+/**
+ * struct bfqio_cgroup - bfq cgroup data structure.
+ * @css: subsystem state for bfq in the containing cgroup.
+ * @online: flag marked when the subsystem is inserted.
+ * @weight: cgroup weight.
+ * @ioprio: cgroup ioprio.
+ * @ioprio_class: cgroup ioprio_class.
+ * @lock: spinlock that protects @ioprio, @ioprio_class and @group_data.
+ * @group_data: list containing the bfq_group belonging to this cgroup.
+ *
+ * @group_data is accessed using RCU, with @lock protecting the updates,
+ * @ioprio and @ioprio_class are protected by @lock.
+ */
+struct bfqio_cgroup {
+	struct cgroup_subsys_state css;
+	bool online;
+
+	unsigned short weight, ioprio, ioprio_class;
+
+	spinlock_t lock;
+	struct hlist_head group_data;
+};
+#else
+struct bfq_group {
+	struct bfq_sched_data sched_data;
+
+	struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
+	struct bfq_queue *async_idle_bfqq;
+};
+#endif
+
 static inline struct bfq_service_tree *
 bfq_entity_service_tree(struct bfq_entity *entity)
 {
@@ -460,8 +546,10 @@ static inline void bfq_put_bfqd_unlock(struct bfq_data *bfqd,
 static void bfq_changed_ioprio(struct bfq_io_cq *bic);
 static void bfq_put_queue(struct bfq_queue *bfqq);
 static void bfq_dispatch_insert(struct request_queue *q, struct request *rq);
-static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd, int is_sync,
+static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
+				       struct bfq_group *bfqg, int is_sync,
 				       struct bfq_io_cq *bic, gfp_t gfp_mask);
+static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg);
 static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
 
 #endif /* _BFQ_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 768fe44..cdd2528 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -39,6 +39,10 @@ SUBSYS(net_cls)
 SUBSYS(blkio)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_BFQIO)
+SUBSYS(bfqio)
+#endif
+
 #if IS_ENABLED(CONFIG_CGROUP_PERF)
 SUBSYS(perf_event)
 #endif
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 05/14] block, bfq: improve throughput boosting
  2014-05-27 12:42 ` paolo
@ 2014-05-27 12:42     ` paolo
  -1 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

From: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>

The feedback-loop algorithm used by BFQ to compute queue (process)
budgets is basically a set of three update rules, one for each of the
main reasons why a queue may be expired. If many processes suddenly
switch from sporadic I/O to greedy and sequential I/O, then these
rules are quite slow to assign large budgets to these processes, and
hence to achieve a high throughput. On the opposite side, BFQ assigns
the maximum possible budget B_max to a just-created queue. This allows
a high throughput to be achieved immediately if the associated process
is I/O-bound and performs sequential I/O from the beginning. But it
also increases the worst-case latency experienced by the first
requests issued by the process, because the larger the budget of a
queue waiting for service is, the later the queue will be served by
B-WF2Q+ (Subsec 3.3 in [1]). This is detrimental for an interactive or
soft real-time application.

To tackle these throughput and latency problems, on one hand this
patch changes the initial budget value to B_max/2. On the other hand,
it re-tunes the three rules, adopting a more aggressive,
multiplicative increase/linear decrease scheme. This scheme trades
latency for throughput more than before, and tends to assign large
budgets quickly to processes that are or become I/O-bound. For two of
the expiration reasons, the new version of the rules also contains
some more little improvements, briefly described below.

*No more backlog.* In this case, the budget was larger than the number
of sectors actually read/written by the process before it stopped
doing I/O. Hence, to reduce latency for the possible future I/O
requests of the process, the old rule simply set the next budget to
the number of sectors actually consumed by the process. However, if
there are still outstanding requests, then the process may have not
yet issued its next request just because it is still waiting for the
completion of some of the still outstanding ones. If this sub-case
holds true, then the new rule, instead of decreasing the budget,
doubles it, proactively, in the hope that: 1) a larger budget will fit
the actual needs of the process, and 2) the process is sequential and
hence a higher throughput will be achieved by serving the process
longer after granting it access to the device.

*Budget timeout*. The original rule set the new budget to the maximum
value B_max, to maximize throughput and let all processes experiencing
budget timeouts receive the same share of the device time. In our
experiments we verified that this sudden jump to B_max did not provide
sensible benefits; rather it increased the latency of processes
performing sporadic and short I/O. The new rule only doubles the
budget.

[1] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
http://www.algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 block/bfq-cgroup.c  |  2 ++
 block/bfq-ioc.c     |  2 ++
 block/bfq-iosched.c | 88 ++++++++++++++++++++++++++++-------------------------
 block/bfq-sched.c   |  2 ++
 block/bfq.h         |  2 ++
 5 files changed, 55 insertions(+), 41 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 00a7a1b..805fe5e 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -7,6 +7,8 @@
  * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
  *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
  *
+ * Copyright (C) 2010 Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ *
  * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ
  * file.
  */
diff --git a/block/bfq-ioc.c b/block/bfq-ioc.c
index adfb5a1..7f6b000 100644
--- a/block/bfq-ioc.c
+++ b/block/bfq-ioc.c
@@ -6,6 +6,8 @@
  *
  * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
  *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ *
+ * Copyright (C) 2010 Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
  */
 
 /**
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index b2cbfce..49ff1da 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -7,6 +7,8 @@
  * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
  *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
  *
+ * Copyright (C) 2010 Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ *
  * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ
  * file.
  *
@@ -101,9 +103,6 @@ struct kmem_cache *bfq_pool;
 #define BFQQ_SEEK_THR	 (sector_t)(8 * 1024)
 #define BFQQ_SEEKY(bfqq) ((bfqq)->seek_mean > BFQQ_SEEK_THR)
 
-/* Budget feedback step. */
-#define BFQ_BUDGET_STEP         128
-
 /* Min samples used for peak rate estimation (for autotuning). */
 #define BFQ_PEAK_RATE_SAMPLES	32
 
@@ -537,36 +536,6 @@ static inline unsigned long bfq_max_budget(struct bfq_data *bfqd)
 		return bfqd->bfq_max_budget;
 }
 
- /*
- * bfq_default_budget - return the default budget for @bfqq on @bfqd.
- * @bfqd: the device descriptor.
- * @bfqq: the queue to consider.
- *
- * We use 3/4 of the @bfqd maximum budget as the default value
- * for the max_budget field of the queues.  This lets the feedback
- * mechanism to start from some middle ground, then the behavior
- * of the process will drive the heuristics towards high values, if
- * it behaves as a greedy sequential reader, or towards small values
- * if it shows a more intermittent behavior.
- */
-static unsigned long bfq_default_budget(struct bfq_data *bfqd,
-					struct bfq_queue *bfqq)
-{
-	unsigned long budget;
-
-	/*
-	 * When we need an estimate of the peak rate we need to avoid
-	 * to give budgets that are too short due to previous measurements.
-	 * So, in the first 10 assignments use a ``safe'' budget value.
-	 */
-	if (bfqd->budgets_assigned < 194 && bfqd->bfq_user_max_budget == 0)
-		budget = bfq_default_max_budget;
-	else
-		budget = bfqd->bfq_max_budget;
-
-	return budget - budget / 4;
-}
-
 /*
  * Return min budget, which is a fraction of the current or default
  * max budget (trying with 1/32)
@@ -730,13 +699,51 @@ static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
 		 * for throughput.
 		 */
 		case BFQ_BFQQ_TOO_IDLE:
-			if (budget > min_budget + BFQ_BUDGET_STEP)
-				budget -= BFQ_BUDGET_STEP;
-			else
-				budget = min_budget;
+			/*
+			 * This is the only case where we may reduce
+			 * the budget: if there is no request of the
+			 * process still waiting for completion, then
+			 * we assume (tentatively) that the timer has
+			 * expired because the batch of requests of
+			 * the process could have been served with a
+			 * smaller budget.  Hence, betting that
+			 * process will behave in the same way when it
+			 * becomes backlogged again, we reduce its
+			 * next budget.  As long as we guess right,
+			 * this budget cut reduces the latency
+			 * experienced by the process.
+			 *
+			 * However, if there are still outstanding
+			 * requests, then the process may have not yet
+			 * issued its next request just because it is
+			 * still waiting for the completion of some of
+			 * the still outstanding ones.  So in this
+			 * subcase we do not reduce its budget, on the
+			 * contrary we increase it to possibly boost
+			 * the throughput, as discussed in the
+			 * comments to the BUDGET_TIMEOUT case.
+			 */
+			if (bfqq->dispatched > 0) /* still outstanding reqs */
+				budget = min(budget * 2, bfqd->bfq_max_budget);
+			else {
+				if (budget > 5 * min_budget)
+					budget -= 4 * min_budget;
+				else
+					budget = min_budget;
+			}
 			break;
 		case BFQ_BFQQ_BUDGET_TIMEOUT:
-			budget = bfq_default_budget(bfqd, bfqq);
+			/*
+			 * We double the budget here because: 1) it
+			 * gives the chance to boost the throughput if
+			 * this is not a seeky process (which may have
+			 * bumped into this timeout because of, e.g.,
+			 * ZBR), 2) together with charge_full_budget
+			 * it helps give seeky processes higher
+			 * timestamps, and hence be served less
+			 * frequently.
+			 */
+			budget = min(budget * 2, bfqd->bfq_max_budget);
 			break;
 		case BFQ_BFQQ_BUDGET_EXHAUSTED:
 			/*
@@ -748,8 +755,7 @@ static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
 			 * definitely increase the budget of this good
 			 * candidate to boost the disk throughput.
 			 */
-			budget = min(budget + 8 * BFQ_BUDGET_STEP,
-				     bfqd->bfq_max_budget);
+			budget = min(budget * 4, bfqd->bfq_max_budget);
 			break;
 		case BFQ_BFQQ_NO_MORE_REQUESTS:
 		       /*
@@ -1408,7 +1414,7 @@ static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	}
 
 	/* Tentative initial value to trade off between thr and lat */
-	bfqq->max_budget = bfq_default_budget(bfqd, bfqq);
+	bfqq->max_budget = (2 * bfq_max_budget(bfqd)) / 3;
 	bfqq->pid = pid;
 }
 
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
index 8801b6c..075e472 100644
--- a/block/bfq-sched.c
+++ b/block/bfq-sched.c
@@ -6,6 +6,8 @@
  *
  * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
  *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ *
+ * Copyright (C) 2010 Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
  */
 
 #ifdef CONFIG_CGROUP_BFQIO
diff --git a/block/bfq.h b/block/bfq.h
index b982567..a334eb4 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -6,6 +6,8 @@
  *
  * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
  *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ *
+ * Copyright (C) 2010 Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
  */
 
 #ifndef _BFQ_H
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 05/14] block, bfq: improve throughput boosting
@ 2014-05-27 12:42     ` paolo
  0 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups, Paolo Valente

From: Paolo Valente <paolo.valente@unimore.it>

The feedback-loop algorithm used by BFQ to compute queue (process)
budgets is basically a set of three update rules, one for each of the
main reasons why a queue may be expired. If many processes suddenly
switch from sporadic I/O to greedy and sequential I/O, then these
rules are quite slow to assign large budgets to these processes, and
hence to achieve a high throughput. On the opposite side, BFQ assigns
the maximum possible budget B_max to a just-created queue. This allows
a high throughput to be achieved immediately if the associated process
is I/O-bound and performs sequential I/O from the beginning. But it
also increases the worst-case latency experienced by the first
requests issued by the process, because the larger the budget of a
queue waiting for service is, the later the queue will be served by
B-WF2Q+ (Subsec 3.3 in [1]). This is detrimental for an interactive or
soft real-time application.

To tackle these throughput and latency problems, on one hand this
patch changes the initial budget value to B_max/2. On the other hand,
it re-tunes the three rules, adopting a more aggressive,
multiplicative increase/linear decrease scheme. This scheme trades
latency for throughput more than before, and tends to assign large
budgets quickly to processes that are or become I/O-bound. For two of
the expiration reasons, the new version of the rules also contains
some more little improvements, briefly described below.

*No more backlog.* In this case, the budget was larger than the number
of sectors actually read/written by the process before it stopped
doing I/O. Hence, to reduce latency for the possible future I/O
requests of the process, the old rule simply set the next budget to
the number of sectors actually consumed by the process. However, if
there are still outstanding requests, then the process may have not
yet issued its next request just because it is still waiting for the
completion of some of the still outstanding ones. If this sub-case
holds true, then the new rule, instead of decreasing the budget,
doubles it, proactively, in the hope that: 1) a larger budget will fit
the actual needs of the process, and 2) the process is sequential and
hence a higher throughput will be achieved by serving the process
longer after granting it access to the device.

*Budget timeout*. The original rule set the new budget to the maximum
value B_max, to maximize throughput and let all processes experiencing
budget timeouts receive the same share of the device time. In our
experiments we verified that this sudden jump to B_max did not provide
sensible benefits; rather it increased the latency of processes
performing sporadic and short I/O. The new rule only doubles the
budget.

[1] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
http://www.algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/bfq-cgroup.c  |  2 ++
 block/bfq-ioc.c     |  2 ++
 block/bfq-iosched.c | 88 ++++++++++++++++++++++++++++-------------------------
 block/bfq-sched.c   |  2 ++
 block/bfq.h         |  2 ++
 5 files changed, 55 insertions(+), 41 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 00a7a1b..805fe5e 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -7,6 +7,8 @@
  * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
  *		      Paolo Valente <paolo.valente@unimore.it>
  *
+ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
+ *
  * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ
  * file.
  */
diff --git a/block/bfq-ioc.c b/block/bfq-ioc.c
index adfb5a1..7f6b000 100644
--- a/block/bfq-ioc.c
+++ b/block/bfq-ioc.c
@@ -6,6 +6,8 @@
  *
  * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
  *		      Paolo Valente <paolo.valente@unimore.it>
+ *
+ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
  */
 
 /**
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index b2cbfce..49ff1da 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -7,6 +7,8 @@
  * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
  *		      Paolo Valente <paolo.valente@unimore.it>
  *
+ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
+ *
  * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ
  * file.
  *
@@ -101,9 +103,6 @@ struct kmem_cache *bfq_pool;
 #define BFQQ_SEEK_THR	 (sector_t)(8 * 1024)
 #define BFQQ_SEEKY(bfqq) ((bfqq)->seek_mean > BFQQ_SEEK_THR)
 
-/* Budget feedback step. */
-#define BFQ_BUDGET_STEP         128
-
 /* Min samples used for peak rate estimation (for autotuning). */
 #define BFQ_PEAK_RATE_SAMPLES	32
 
@@ -537,36 +536,6 @@ static inline unsigned long bfq_max_budget(struct bfq_data *bfqd)
 		return bfqd->bfq_max_budget;
 }
 
- /*
- * bfq_default_budget - return the default budget for @bfqq on @bfqd.
- * @bfqd: the device descriptor.
- * @bfqq: the queue to consider.
- *
- * We use 3/4 of the @bfqd maximum budget as the default value
- * for the max_budget field of the queues.  This lets the feedback
- * mechanism to start from some middle ground, then the behavior
- * of the process will drive the heuristics towards high values, if
- * it behaves as a greedy sequential reader, or towards small values
- * if it shows a more intermittent behavior.
- */
-static unsigned long bfq_default_budget(struct bfq_data *bfqd,
-					struct bfq_queue *bfqq)
-{
-	unsigned long budget;
-
-	/*
-	 * When we need an estimate of the peak rate we need to avoid
-	 * to give budgets that are too short due to previous measurements.
-	 * So, in the first 10 assignments use a ``safe'' budget value.
-	 */
-	if (bfqd->budgets_assigned < 194 && bfqd->bfq_user_max_budget == 0)
-		budget = bfq_default_max_budget;
-	else
-		budget = bfqd->bfq_max_budget;
-
-	return budget - budget / 4;
-}
-
 /*
  * Return min budget, which is a fraction of the current or default
  * max budget (trying with 1/32)
@@ -730,13 +699,51 @@ static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
 		 * for throughput.
 		 */
 		case BFQ_BFQQ_TOO_IDLE:
-			if (budget > min_budget + BFQ_BUDGET_STEP)
-				budget -= BFQ_BUDGET_STEP;
-			else
-				budget = min_budget;
+			/*
+			 * This is the only case where we may reduce
+			 * the budget: if there is no request of the
+			 * process still waiting for completion, then
+			 * we assume (tentatively) that the timer has
+			 * expired because the batch of requests of
+			 * the process could have been served with a
+			 * smaller budget.  Hence, betting that
+			 * process will behave in the same way when it
+			 * becomes backlogged again, we reduce its
+			 * next budget.  As long as we guess right,
+			 * this budget cut reduces the latency
+			 * experienced by the process.
+			 *
+			 * However, if there are still outstanding
+			 * requests, then the process may have not yet
+			 * issued its next request just because it is
+			 * still waiting for the completion of some of
+			 * the still outstanding ones.  So in this
+			 * subcase we do not reduce its budget, on the
+			 * contrary we increase it to possibly boost
+			 * the throughput, as discussed in the
+			 * comments to the BUDGET_TIMEOUT case.
+			 */
+			if (bfqq->dispatched > 0) /* still outstanding reqs */
+				budget = min(budget * 2, bfqd->bfq_max_budget);
+			else {
+				if (budget > 5 * min_budget)
+					budget -= 4 * min_budget;
+				else
+					budget = min_budget;
+			}
 			break;
 		case BFQ_BFQQ_BUDGET_TIMEOUT:
-			budget = bfq_default_budget(bfqd, bfqq);
+			/*
+			 * We double the budget here because: 1) it
+			 * gives the chance to boost the throughput if
+			 * this is not a seeky process (which may have
+			 * bumped into this timeout because of, e.g.,
+			 * ZBR), 2) together with charge_full_budget
+			 * it helps give seeky processes higher
+			 * timestamps, and hence be served less
+			 * frequently.
+			 */
+			budget = min(budget * 2, bfqd->bfq_max_budget);
 			break;
 		case BFQ_BFQQ_BUDGET_EXHAUSTED:
 			/*
@@ -748,8 +755,7 @@ static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
 			 * definitely increase the budget of this good
 			 * candidate to boost the disk throughput.
 			 */
-			budget = min(budget + 8 * BFQ_BUDGET_STEP,
-				     bfqd->bfq_max_budget);
+			budget = min(budget * 4, bfqd->bfq_max_budget);
 			break;
 		case BFQ_BFQQ_NO_MORE_REQUESTS:
 		       /*
@@ -1408,7 +1414,7 @@ static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	}
 
 	/* Tentative initial value to trade off between thr and lat */
-	bfqq->max_budget = bfq_default_budget(bfqd, bfqq);
+	bfqq->max_budget = (2 * bfq_max_budget(bfqd)) / 3;
 	bfqq->pid = pid;
 }
 
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
index 8801b6c..075e472 100644
--- a/block/bfq-sched.c
+++ b/block/bfq-sched.c
@@ -6,6 +6,8 @@
  *
  * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
  *		      Paolo Valente <paolo.valente@unimore.it>
+ *
+ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
  */
 
 #ifdef CONFIG_CGROUP_BFQIO
diff --git a/block/bfq.h b/block/bfq.h
index b982567..a334eb4 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -6,6 +6,8 @@
  *
  * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
  *		      Paolo Valente <paolo.valente@unimore.it>
+ *
+ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
  */
 
 #ifndef _BFQ_H
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 06/14] block, bfq: modify the peak-rate estimator
  2014-05-27 12:42 ` paolo
@ 2014-05-27 12:42     ` paolo
  -1 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

From: Paolo Valente <paolo.valente@unimore.it>

Unless the maximum budget B_max that BFQ can assign to a queue is set
explicitly by the user, BFQ automatically updates B_max. In
particular, BFQ dynamically sets B_max to the number of sectors that
can be read, at the current estimated peak rate, during the maximum
time, T_max, allowed before a budget timeout occurs. In formulas, if
we denote as R_est the estimated peak rate, then B_max = T_max ∗
R_est. Hence, the higher R_est is with respect to the actual disk peak
rate, the higher the probability that processes incur budget timeouts
unjustly is. Besides, a too high value of B_max unnecessarily
increases the deviation from an ideal, smooth service.

To filter out spikes, the estimated peak rate is updated only on the
expiration of queues that have been served for a long-enough time.  As
a first step, the estimator computes the device rate, R_meas, during
the service of the queue. After that, if R_est < R_meas, then R_est is
set to R_meas.

Unfortunately, our experiments highlighted the following two
problems. First, because of ZBR, depending on the locality of the
workload, the estimator may easily converge to a value that is
appropriate only for part of a disk. Second, R_est may jump (and
remain forever equal) to a much higher value than the actual device
peak rate, in case of hits in the drive cache, which may let sectors
be transferred in practice at bus rate.

To try to converge to the actual average peak rate over the disk
surface (in case of rotational devices), and to smooth the spikes
caused by the drive cache, this patch changes the estimator as
follows. In the description of the changes, we refer to a queue
containing random requests as 'seeky', according to the terminology
used in the code, and inherited from CFQ.

- First, now R_est may be updated also in case the just-expired queue,
  despite not being detected as seeky, has not been however able to
  consume all of its budget within the maximum time slice T_max. In
  fact, this is an indication that B_max is too large. Since B_max =
  T_max ∗ R_est, R_est is then probably too large, and should be
  reduced.

- Second, to filter the spikes in R_meas, a discrete low-pass filter
  is now used to update R_est instead of just keeping the highest rate
  sampled. The rationale is that the average peak rate of a disk over
  its surface is a relatively stable quantity, hence a low-pass filter
  should converge more or less quickly to the right value.

With the current values of the constants used in the filter, the
latter seems to effectively smooth fluctuations and allow the
estimator to converge to the actual peak rate with all the devices we
tested.

Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/bfq-iosched.c | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 49ff1da..2a4e03d 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -818,7 +818,7 @@ static unsigned long bfq_calc_max_budget(u64 peak_rate, u64 timeout)
  * throughput. See the code for more details.
  */
 static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
-				int compensate)
+				int compensate, enum bfqq_expiration reason)
 {
 	u64 bw, usecs, expected, timeout;
 	ktime_t delta;
@@ -854,10 +854,23 @@ static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	 * the peak rate estimation.
 	 */
 	if (usecs > 20000) {
-		if (bw > bfqd->peak_rate) {
-			bfqd->peak_rate = bw;
+		if (bw > bfqd->peak_rate ||
+		   (!BFQQ_SEEKY(bfqq) &&
+		    reason == BFQ_BFQQ_BUDGET_TIMEOUT)) {
+			bfq_log(bfqd, "measured bw =%llu", bw);
+			/*
+			 * To smooth oscillations use a low-pass filter with
+			 * alpha=7/8, i.e.,
+			 * new_rate = (7/8) * old_rate + (1/8) * bw
+			 */
+			do_div(bw, 8);
+			if (bw == 0)
+				return 0;
+			bfqd->peak_rate *= 7;
+			do_div(bfqd->peak_rate, 8);
+			bfqd->peak_rate += bw;
 			update = 1;
-			bfq_log(bfqd, "new peak_rate=%llu", bw);
+			bfq_log(bfqd, "new peak_rate=%llu", bfqd->peak_rate);
 		}
 
 		update |= bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES - 1;
@@ -936,7 +949,7 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
 	/* Update disk peak rate for autotuning and check whether the
 	 * process is slow (see bfq_update_peak_rate).
 	 */
-	slow = bfq_update_peak_rate(bfqd, bfqq, compensate);
+	slow = bfq_update_peak_rate(bfqd, bfqq, compensate, reason);
 
 	/*
 	 * As above explained, 'punish' slow (i.e., seeky), timed-out
-- 
1.9.2

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 06/14] block, bfq: modify the peak-rate estimator
@ 2014-05-27 12:42     ` paolo
  0 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups, Paolo Valente

From: Paolo Valente <paolo.valente@unimore.it>

Unless the maximum budget B_max that BFQ can assign to a queue is set
explicitly by the user, BFQ automatically updates B_max. In
particular, BFQ dynamically sets B_max to the number of sectors that
can be read, at the current estimated peak rate, during the maximum
time, T_max, allowed before a budget timeout occurs. In formulas, if
we denote as R_est the estimated peak rate, then B_max = T_max ∗
R_est. Hence, the higher R_est is with respect to the actual disk peak
rate, the higher the probability that processes incur budget timeouts
unjustly is. Besides, a too high value of B_max unnecessarily
increases the deviation from an ideal, smooth service.

To filter out spikes, the estimated peak rate is updated only on the
expiration of queues that have been served for a long-enough time.  As
a first step, the estimator computes the device rate, R_meas, during
the service of the queue. After that, if R_est < R_meas, then R_est is
set to R_meas.

Unfortunately, our experiments highlighted the following two
problems. First, because of ZBR, depending on the locality of the
workload, the estimator may easily converge to a value that is
appropriate only for part of a disk. Second, R_est may jump (and
remain forever equal) to a much higher value than the actual device
peak rate, in case of hits in the drive cache, which may let sectors
be transferred in practice at bus rate.

To try to converge to the actual average peak rate over the disk
surface (in case of rotational devices), and to smooth the spikes
caused by the drive cache, this patch changes the estimator as
follows. In the description of the changes, we refer to a queue
containing random requests as 'seeky', according to the terminology
used in the code, and inherited from CFQ.

- First, now R_est may be updated also in case the just-expired queue,
  despite not being detected as seeky, has not been however able to
  consume all of its budget within the maximum time slice T_max. In
  fact, this is an indication that B_max is too large. Since B_max =
  T_max ∗ R_est, R_est is then probably too large, and should be
  reduced.

- Second, to filter the spikes in R_meas, a discrete low-pass filter
  is now used to update R_est instead of just keeping the highest rate
  sampled. The rationale is that the average peak rate of a disk over
  its surface is a relatively stable quantity, hence a low-pass filter
  should converge more or less quickly to the right value.

With the current values of the constants used in the filter, the
latter seems to effectively smooth fluctuations and allow the
estimator to converge to the actual peak rate with all the devices we
tested.

Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/bfq-iosched.c | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 49ff1da..2a4e03d 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -818,7 +818,7 @@ static unsigned long bfq_calc_max_budget(u64 peak_rate, u64 timeout)
  * throughput. See the code for more details.
  */
 static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
-				int compensate)
+				int compensate, enum bfqq_expiration reason)
 {
 	u64 bw, usecs, expected, timeout;
 	ktime_t delta;
@@ -854,10 +854,23 @@ static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	 * the peak rate estimation.
 	 */
 	if (usecs > 20000) {
-		if (bw > bfqd->peak_rate) {
-			bfqd->peak_rate = bw;
+		if (bw > bfqd->peak_rate ||
+		   (!BFQQ_SEEKY(bfqq) &&
+		    reason == BFQ_BFQQ_BUDGET_TIMEOUT)) {
+			bfq_log(bfqd, "measured bw =%llu", bw);
+			/*
+			 * To smooth oscillations use a low-pass filter with
+			 * alpha=7/8, i.e.,
+			 * new_rate = (7/8) * old_rate + (1/8) * bw
+			 */
+			do_div(bw, 8);
+			if (bw == 0)
+				return 0;
+			bfqd->peak_rate *= 7;
+			do_div(bfqd->peak_rate, 8);
+			bfqd->peak_rate += bw;
 			update = 1;
-			bfq_log(bfqd, "new peak_rate=%llu", bw);
+			bfq_log(bfqd, "new peak_rate=%llu", bfqd->peak_rate);
 		}
 
 		update |= bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES - 1;
@@ -936,7 +949,7 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
 	/* Update disk peak rate for autotuning and check whether the
 	 * process is slow (see bfq_update_peak_rate).
 	 */
-	slow = bfq_update_peak_rate(bfqd, bfqq, compensate);
+	slow = bfq_update_peak_rate(bfqd, bfqq, compensate, reason);
 
 	/*
 	 * As above explained, 'punish' slow (i.e., seeky), timed-out
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 07/14] block, bfq: add more fairness to boost throughput and reduce latency
  2014-05-27 12:42 ` paolo
@ 2014-05-27 12:42     ` paolo
  -1 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

From: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>

We have found four sources of throughput loss and higher
latencies. First, write requests tend to starve read requests,
basically because, on one side, writes are slower than reads, whereas,
on the other side, storage devices confuse schedulers by deceptively
signaling the completion of write requests immediately after receiving
them. This patch addresses this issue by just throttling writes. In
particular, after a write request is dispatched for a queue, the
budget of the queue is decremented by the number of sectors to write,
multiplied by an (over)charge coefficient.

The current value of this coefficient, as well as the values of the
constants used in the following other changes, is the result of
our tuning with different devices.

The second source of problems is that some applications generate
really few and small, yet very far, random requests at the beginning
of a new I/O-bound phase. This causes the average seek distance,
computed using a low-pass filter, to remain high for a non-negligible
amount of time, even if then the application issues only sequential
requests. Hence, for a while, the queue associated with the
application is unavoidably detected as seeky (i.e., containing random
requests). The device-idling timeout is then set to a very low value
for the queue. This often caused a loss of throughput on rotational
devices, as well as an increased latency. In contrast, this patch
allows the device-idling timeout for a seeky queue to be set to a very
low value only if the associated process has either already consumed
at least a minimum fraction (1/8) of the maximum budget B_max, or
already proved to generate random requests systematically. In
particular, in the latter case the queue is flagged as "constantly
seeky".

Finally, the following additional BFQ mechanism causes throughput loss
and increased latencies in two further situations. When the in-service
queue is expired, BFQ also controls whether the queue has been "too
slow", i.e., has consumed its last-assigned budget at such a low rate
that it would have been impossible to consume all of it within the
maximum time slice T_max (Subsec. 3.5 in [1]). In this case, the queue
is always (over)charged the whole budget, to reduce its utilization of
the device, exactly as it happens with seeky queues. The description
of both the two situations in which this behavior causes problems and
the solution provided by this patch follows.

1. If too little time has elapsed since a process started doing
sequential I/O, then the positive effect on the throughput of its
sequential accesses may not have yet prevailed on the throughput loss
caused by the fact that a random access had to be performed to get to
the first sector requested by the process. For this reason, if a slow
queue is expired after receiving very little service (at most 1/8 of
the maximum budget), now it is not charged a full budget.

2. Because of ZBR, a queue may be deemed as slow when its associated
process is performing I/O on the slowest zones of a disk. However,
unless the process is truly too slow, not reducing the disk
utilization of the queue is more profitable in terms of disk
throughput than the opposite. For this reason now a queue is never
charged the whole budget if it has consumed at least a significant
part of it (2/3).

[1] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
http://www.algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 block/bfq-iosched.c | 50 +++++++++++++++++++++++++++++++++++++++++++++-----
 block/bfq.h         |  5 +++++
 2 files changed, 50 insertions(+), 5 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 2a4e03d..9e607a0 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -87,6 +87,13 @@ static int bfq_slice_idle = HZ / 125;
 static const int bfq_default_max_budget = 16 * 1024;
 static const int bfq_max_budget_async_rq = 4;
 
+/*
+ * Async to sync throughput distribution is controlled as follows:
+ * when an async request is served, the entity is charged the number
+ * of sectors of the request, multiplied by the factor below
+ */
+static const int bfq_async_charge_factor = 10;
+
 /* Default timeout values, in jiffies, approximating CFQ defaults. */
 static const int bfq_timeout_sync = HZ / 8;
 static int bfq_timeout_async = HZ / 25;
@@ -269,10 +276,12 @@ static struct request *bfq_find_next_rq(struct bfq_data *bfqd,
 	return bfq_choose_req(bfqd, next, prev, blk_rq_pos(last));
 }
 
+/* see the definition of bfq_async_charge_factor for details */
 static inline unsigned long bfq_serv_to_charge(struct request *rq,
 					       struct bfq_queue *bfqq)
 {
-	return blk_rq_sectors(rq);
+	return blk_rq_sectors(rq) *
+		(1 + ((!bfq_bfqq_sync(bfqq)) * bfq_async_charge_factor));
 }
 
 /**
@@ -565,13 +574,21 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd)
 	 * We don't want to idle for seeks, but we do want to allow
 	 * fair distribution of slice time for a process doing back-to-back
 	 * seeks. So allow a little bit of time for him to submit a new rq.
+	 *
+	 * To prevent processes with (partly) seeky workloads from
+	 * being too ill-treated, grant them a small fraction of the
+	 * assigned budget before reducing the waiting time to
+	 * BFQ_MIN_TT. This happened to help reduce latency.
 	 */
 	sl = bfqd->bfq_slice_idle;
 	/*
-	 * Grant only minimum idle time if the queue has been seeky for long
-	 * enough.
+	 * Grant only minimum idle time if the queue either has been seeky for
+	 * long enough or has already proved to be constantly seeky.
 	 */
-	if (bfq_sample_valid(bfqq->seek_samples) && BFQQ_SEEKY(bfqq))
+	if (bfq_sample_valid(bfqq->seek_samples) &&
+	    ((BFQQ_SEEKY(bfqq) && bfqq->entity.service >
+				  bfq_max_budget(bfqq->bfqd) / 8) ||
+	      bfq_bfqq_constantly_seeky(bfqq)))
 		sl = min(sl, msecs_to_jiffies(BFQ_MIN_TT));
 	bfqd->last_idling_start = ktime_get();
 	mod_timer(&bfqd->idle_slice_timer, jiffies + sl);
@@ -889,6 +906,16 @@ static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	}
 
 	/*
+	 * If the process has been served for a too short time
+	 * interval to let its possible sequential accesses prevail on
+	 * the initial seek time needed to move the disk head on the
+	 * first sector it requested, then give the process a chance
+	 * and for the moment return false.
+	 */
+	if (bfqq->entity.budget <= bfq_max_budget(bfqd) / 8)
+		return 0;
+
+	/*
 	 * A process is considered ``slow'' (i.e., seeky, so that we
 	 * cannot treat it fairly in the service domain, as it would
 	 * slow down too much the other processes) if, when a slice
@@ -954,10 +981,21 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
 	/*
 	 * As above explained, 'punish' slow (i.e., seeky), timed-out
 	 * and async queues, to favor sequential sync workloads.
+	 *
+	 * Processes doing I/O in the slower disk zones will tend to be
+	 * slow(er) even if not seeky. Hence, since the estimated peak
+	 * rate is actually an average over the disk surface, these
+	 * processes may timeout just for bad luck. To avoid punishing
+	 * them we do not charge a full budget to a process that
+	 * succeeded in consuming at least 2/3 of its budget.
 	 */
-	if (slow || reason == BFQ_BFQQ_BUDGET_TIMEOUT)
+	if (slow || (reason == BFQ_BFQQ_BUDGET_TIMEOUT &&
+		     bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3))
 		bfq_bfqq_charge_full_budget(bfqq);
 
+	if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT)
+		bfq_mark_bfqq_constantly_seeky(bfqq);
+
 	bfq_log_bfqq(bfqd, bfqq,
 		"expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
 		slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
@@ -1632,6 +1670,8 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfq_update_io_thinktime(bfqd, bic);
 	bfq_update_io_seektime(bfqd, bfqq, rq);
+	if (!BFQQ_SEEKY(bfqq))
+		bfq_clear_bfqq_constantly_seeky(bfqq);
 	if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
 	    !BFQQ_SEEKY(bfqq))
 		bfq_update_idle_window(bfqd, bfqq, bic);
diff --git a/block/bfq.h b/block/bfq.h
index a334eb4..ea5ecca 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -358,6 +358,10 @@ enum bfqq_state_flags {
 	BFQ_BFQQ_FLAG_prio_changed,	/* task priority has changed */
 	BFQ_BFQQ_FLAG_sync,		/* synchronous queue */
 	BFQ_BFQQ_FLAG_budget_new,	/* no completion with this budget */
+	BFQ_BFQQ_FLAG_constantly_seeky,	/*
+					 * bfqq has proved to be slow and
+					 * seeky until budget timeout
+					 */
 };
 
 #define BFQ_BFQQ_FNS(name)						\
@@ -382,6 +386,7 @@ BFQ_BFQQ_FNS(idle_window);
 BFQ_BFQQ_FNS(prio_changed);
 BFQ_BFQQ_FNS(sync);
 BFQ_BFQQ_FNS(budget_new);
+BFQ_BFQQ_FNS(constantly_seeky);
 #undef BFQ_BFQQ_FNS
 
 /* Logging facilities. */
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 07/14] block, bfq: add more fairness to boost throughput and reduce latency
@ 2014-05-27 12:42     ` paolo
  0 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups, Paolo Valente

From: Paolo Valente <paolo.valente@unimore.it>

We have found four sources of throughput loss and higher
latencies. First, write requests tend to starve read requests,
basically because, on one side, writes are slower than reads, whereas,
on the other side, storage devices confuse schedulers by deceptively
signaling the completion of write requests immediately after receiving
them. This patch addresses this issue by just throttling writes. In
particular, after a write request is dispatched for a queue, the
budget of the queue is decremented by the number of sectors to write,
multiplied by an (over)charge coefficient.

The current value of this coefficient, as well as the values of the
constants used in the following other changes, is the result of
our tuning with different devices.

The second source of problems is that some applications generate
really few and small, yet very far, random requests at the beginning
of a new I/O-bound phase. This causes the average seek distance,
computed using a low-pass filter, to remain high for a non-negligible
amount of time, even if then the application issues only sequential
requests. Hence, for a while, the queue associated with the
application is unavoidably detected as seeky (i.e., containing random
requests). The device-idling timeout is then set to a very low value
for the queue. This often caused a loss of throughput on rotational
devices, as well as an increased latency. In contrast, this patch
allows the device-idling timeout for a seeky queue to be set to a very
low value only if the associated process has either already consumed
at least a minimum fraction (1/8) of the maximum budget B_max, or
already proved to generate random requests systematically. In
particular, in the latter case the queue is flagged as "constantly
seeky".

Finally, the following additional BFQ mechanism causes throughput loss
and increased latencies in two further situations. When the in-service
queue is expired, BFQ also controls whether the queue has been "too
slow", i.e., has consumed its last-assigned budget at such a low rate
that it would have been impossible to consume all of it within the
maximum time slice T_max (Subsec. 3.5 in [1]). In this case, the queue
is always (over)charged the whole budget, to reduce its utilization of
the device, exactly as it happens with seeky queues. The description
of both the two situations in which this behavior causes problems and
the solution provided by this patch follows.

1. If too little time has elapsed since a process started doing
sequential I/O, then the positive effect on the throughput of its
sequential accesses may not have yet prevailed on the throughput loss
caused by the fact that a random access had to be performed to get to
the first sector requested by the process. For this reason, if a slow
queue is expired after receiving very little service (at most 1/8 of
the maximum budget), now it is not charged a full budget.

2. Because of ZBR, a queue may be deemed as slow when its associated
process is performing I/O on the slowest zones of a disk. However,
unless the process is truly too slow, not reducing the disk
utilization of the queue is more profitable in terms of disk
throughput than the opposite. For this reason now a queue is never
charged the whole budget if it has consumed at least a significant
part of it (2/3).

[1] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
http://www.algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/bfq-iosched.c | 50 +++++++++++++++++++++++++++++++++++++++++++++-----
 block/bfq.h         |  5 +++++
 2 files changed, 50 insertions(+), 5 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 2a4e03d..9e607a0 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -87,6 +87,13 @@ static int bfq_slice_idle = HZ / 125;
 static const int bfq_default_max_budget = 16 * 1024;
 static const int bfq_max_budget_async_rq = 4;
 
+/*
+ * Async to sync throughput distribution is controlled as follows:
+ * when an async request is served, the entity is charged the number
+ * of sectors of the request, multiplied by the factor below
+ */
+static const int bfq_async_charge_factor = 10;
+
 /* Default timeout values, in jiffies, approximating CFQ defaults. */
 static const int bfq_timeout_sync = HZ / 8;
 static int bfq_timeout_async = HZ / 25;
@@ -269,10 +276,12 @@ static struct request *bfq_find_next_rq(struct bfq_data *bfqd,
 	return bfq_choose_req(bfqd, next, prev, blk_rq_pos(last));
 }
 
+/* see the definition of bfq_async_charge_factor for details */
 static inline unsigned long bfq_serv_to_charge(struct request *rq,
 					       struct bfq_queue *bfqq)
 {
-	return blk_rq_sectors(rq);
+	return blk_rq_sectors(rq) *
+		(1 + ((!bfq_bfqq_sync(bfqq)) * bfq_async_charge_factor));
 }
 
 /**
@@ -565,13 +574,21 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd)
 	 * We don't want to idle for seeks, but we do want to allow
 	 * fair distribution of slice time for a process doing back-to-back
 	 * seeks. So allow a little bit of time for him to submit a new rq.
+	 *
+	 * To prevent processes with (partly) seeky workloads from
+	 * being too ill-treated, grant them a small fraction of the
+	 * assigned budget before reducing the waiting time to
+	 * BFQ_MIN_TT. This happened to help reduce latency.
 	 */
 	sl = bfqd->bfq_slice_idle;
 	/*
-	 * Grant only minimum idle time if the queue has been seeky for long
-	 * enough.
+	 * Grant only minimum idle time if the queue either has been seeky for
+	 * long enough or has already proved to be constantly seeky.
 	 */
-	if (bfq_sample_valid(bfqq->seek_samples) && BFQQ_SEEKY(bfqq))
+	if (bfq_sample_valid(bfqq->seek_samples) &&
+	    ((BFQQ_SEEKY(bfqq) && bfqq->entity.service >
+				  bfq_max_budget(bfqq->bfqd) / 8) ||
+	      bfq_bfqq_constantly_seeky(bfqq)))
 		sl = min(sl, msecs_to_jiffies(BFQ_MIN_TT));
 	bfqd->last_idling_start = ktime_get();
 	mod_timer(&bfqd->idle_slice_timer, jiffies + sl);
@@ -889,6 +906,16 @@ static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	}
 
 	/*
+	 * If the process has been served for a too short time
+	 * interval to let its possible sequential accesses prevail on
+	 * the initial seek time needed to move the disk head on the
+	 * first sector it requested, then give the process a chance
+	 * and for the moment return false.
+	 */
+	if (bfqq->entity.budget <= bfq_max_budget(bfqd) / 8)
+		return 0;
+
+	/*
 	 * A process is considered ``slow'' (i.e., seeky, so that we
 	 * cannot treat it fairly in the service domain, as it would
 	 * slow down too much the other processes) if, when a slice
@@ -954,10 +981,21 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
 	/*
 	 * As above explained, 'punish' slow (i.e., seeky), timed-out
 	 * and async queues, to favor sequential sync workloads.
+	 *
+	 * Processes doing I/O in the slower disk zones will tend to be
+	 * slow(er) even if not seeky. Hence, since the estimated peak
+	 * rate is actually an average over the disk surface, these
+	 * processes may timeout just for bad luck. To avoid punishing
+	 * them we do not charge a full budget to a process that
+	 * succeeded in consuming at least 2/3 of its budget.
 	 */
-	if (slow || reason == BFQ_BFQQ_BUDGET_TIMEOUT)
+	if (slow || (reason == BFQ_BFQQ_BUDGET_TIMEOUT &&
+		     bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3))
 		bfq_bfqq_charge_full_budget(bfqq);
 
+	if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT)
+		bfq_mark_bfqq_constantly_seeky(bfqq);
+
 	bfq_log_bfqq(bfqd, bfqq,
 		"expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
 		slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
@@ -1632,6 +1670,8 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfq_update_io_thinktime(bfqd, bic);
 	bfq_update_io_seektime(bfqd, bfqq, rq);
+	if (!BFQQ_SEEKY(bfqq))
+		bfq_clear_bfqq_constantly_seeky(bfqq);
 	if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
 	    !BFQQ_SEEKY(bfqq))
 		bfq_update_idle_window(bfqd, bfqq, bic);
diff --git a/block/bfq.h b/block/bfq.h
index a334eb4..ea5ecca 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -358,6 +358,10 @@ enum bfqq_state_flags {
 	BFQ_BFQQ_FLAG_prio_changed,	/* task priority has changed */
 	BFQ_BFQQ_FLAG_sync,		/* synchronous queue */
 	BFQ_BFQQ_FLAG_budget_new,	/* no completion with this budget */
+	BFQ_BFQQ_FLAG_constantly_seeky,	/*
+					 * bfqq has proved to be slow and
+					 * seeky until budget timeout
+					 */
 };
 
 #define BFQ_BFQQ_FNS(name)						\
@@ -382,6 +386,7 @@ BFQ_BFQQ_FNS(idle_window);
 BFQ_BFQQ_FNS(prio_changed);
 BFQ_BFQQ_FNS(sync);
 BFQ_BFQQ_FNS(budget_new);
+BFQ_BFQQ_FNS(constantly_seeky);
 #undef BFQ_BFQQ_FNS
 
 /* Logging facilities. */
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 08/14] block, bfq: improve responsiveness
  2014-05-27 12:42 ` paolo
@ 2014-05-27 12:42     ` paolo
  -1 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

From: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>

This patch introduces a simple heuristic to load applications quickly,
and to perform the I/O requested by interactive applications just as
quickly. To this purpose, both a newly-created queue and a queue
associated with an interactive application (we explain in a moment how
BFQ decides whether the associated application is interactive),
receive the following three special treatments:

1) The weight of the queue is raised.

2) The queue unconditionally enjoys device idling when it empties; in
fact, if the requests of a queue are sync, then performing device
idling for the queue is a necessary condition to guarantee that the
queue receives a fraction of the throughput proportional to its weight
(see [1] for details).

3) The device-idling timeout is larger for the queue. This reduces the
probability that the queue is expired because its next request does
not arrive in time.

For brevity, we call just weight-raising the combination of these
three preferential treatments. For a newly-created queue,
weight-raising starts immediately and lasts for a time interval that:
1) depends on the device speed and type (rotational or
non-rotational), and 2) is equal to the time needed to load (start up)
a large-size application on that device, with cold caches and with no
additional workload.

Finally, as for guaranteeing a fast execution to interactive,
I/O-related tasks (such as opening a file), consider that any
interactive application blocks and waits for user input both after
starting up and after executing some task. After a while, the user may
trigger new operations, after which the application stops again, and
so on. Accordingly, the low-latency heuristic weight-raises again a
queue in case it becomes backlogged after being idle for a
sufficiently long (configurable) time. The weight-raising then lasts
for the same time as for a just-created queue.

According to our experiments, the combination of this low-latency
heuristic and of the improvements described in patch 7 allows BFQ
to guarantee a high application responsiveness.

[1] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
http://www.algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 block/Kconfig.iosched |   8 +-
 block/bfq-cgroup.c    |  15 +++
 block/bfq-iosched.c   | 355 ++++++++++++++++++++++++++++++++++++++++++++++----
 block/bfq-sched.c     |   5 +-
 block/bfq.h           |  33 +++++
 5 files changed, 386 insertions(+), 30 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index a3675cb..3e26f28 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -45,8 +45,9 @@ config IOSCHED_BFQ
 	---help---
 	  The BFQ I/O scheduler tries to distribute bandwidth among all
 	  processes according to their weights.
-	  It aims at distributing the bandwidth as desired, regardless
-	  of the disk parameters and with any workload. If compiled
+	  It aims at distributing the bandwidth as desired, regardless of
+	  the device parameters and with any workload. It also tries to
+	  guarantee a low latency to interactive applications. If compiled
 	  built-in (saying Y here), BFQ can be configured to support
 	  hierarchical scheduling.
 
@@ -79,7 +80,8 @@ choice
 		  used by default for all block devices.
 		  The BFQ I/O scheduler aims at distributing the bandwidth
 		  as desired, regardless of the disk parameters and with
-		  any workload.
+		  any workload. It also tries to guarantee a low latency to
+		  interactive applications.
 
 	config DEFAULT_NOOP
 		bool "No-op"
diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 805fe5e..1cb25aa 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -525,6 +525,16 @@ static void bfq_destroy_group(struct bfqio_cgroup *bgrp, struct bfq_group *bfqg)
 	kfree(bfqg);
 }
 
+static void bfq_end_wr_async(struct bfq_data *bfqd)
+{
+	struct hlist_node *tmp;
+	struct bfq_group *bfqg;
+
+	hlist_for_each_entry_safe(bfqg, tmp, &bfqd->group_list, bfqd_node)
+		bfq_end_wr_async_queues(bfqd, bfqg);
+	bfq_end_wr_async_queues(bfqd, bfqd->root_group);
+}
+
 /**
  * bfq_disconnect_groups - disconnect @bfqd from all its groups.
  * @bfqd: the device descriptor being exited.
@@ -866,6 +876,11 @@ static inline void bfq_bfqq_move(struct bfq_data *bfqd,
 {
 }
 
+static void bfq_end_wr_async(struct bfq_data *bfqd)
+{
+	bfq_end_wr_async_queues(bfqd, bfqd->root_group);
+}
+
 static inline void bfq_disconnect_groups(struct bfq_data *bfqd)
 {
 	bfq_put_async_queues(bfqd, bfqd->root_group);
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 9e607a0..ace9aba 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -24,15 +24,15 @@
  * precisely, BFQ schedules queues associated to processes. Thanks to the
  * accurate policy of B-WF2Q+, BFQ can afford to assign high budgets to
  * I/O-bound processes issuing sequential requests (to boost the
- * throughput), and yet guarantee a relatively low latency to interactive
- * applications.
+ * throughput), and yet guarantee a low latency to interactive applications.
  *
  * BFQ is described in [1], where also a reference to the initial, more
  * theoretical paper on BFQ can be found. The interested reader can find
  * in the latter paper full details on the main algorithm, as well as
  * formulas of the guarantees and formal proofs of all the properties.
  * With respect to the version of BFQ presented in these papers, this
- * implementation adds a hierarchical extension based on H-WF2Q+.
+ * implementation adds a few more heuristics and a hierarchical extension
+ * based on H-WF2Q+.
  *
  * B-WF2Q+ is based on WF2Q+, that is described in [2], together with
  * H-WF2Q+, while the augmented tree used to implement B-WF2Q+ with O(log N)
@@ -116,6 +116,48 @@ struct kmem_cache *bfq_pool;
 /* Shift used for peak rate fixed precision calculations. */
 #define BFQ_RATE_SHIFT		16
 
+/*
+ * By default, BFQ computes the duration of the weight raising for
+ * interactive applications automatically, using the following formula:
+ * duration = (R / r) * T, where r is the peak rate of the device, and
+ * R and T are two reference parameters.
+ * In particular, R is the peak rate of the reference device (see below),
+ * and T is a reference time: given the systems that are likely to be
+ * installed on the reference device according to its speed class, T is
+ * about the maximum time needed, under BFQ and while reading two files in
+ * parallel, to load typical large applications on these systems.
+ * In practice, the slower/faster the device at hand is, the more/less it
+ * takes to load applications with respect to the reference device.
+ * Accordingly, the longer/shorter BFQ grants weight raising to interactive
+ * applications.
+ *
+ * BFQ uses four different reference pairs (R, T), depending on:
+ * . whether the device is rotational or non-rotational;
+ * . whether the device is slow, such as old or portable HDDs, as well as
+ *   SD cards, or fast, such as newer HDDs and SSDs.
+ *
+ * The device's speed class is dynamically (re)detected in
+ * bfq_update_peak_rate() every time the estimated peak rate is updated.
+ *
+ * In the following definitions, R_slow[0]/R_fast[0] and T_slow[0]/T_fast[0]
+ * are the reference values for a slow/fast rotational device, whereas
+ * R_slow[1]/R_fast[1] and T_slow[1]/T_fast[1] are the reference values for
+ * a slow/fast non-rotational device. Finally, device_speed_thresh are the
+ * thresholds used to switch between speed classes.
+ * Both the reference peak rates and the thresholds are measured in
+ * sectors/usec, left-shifted by BFQ_RATE_SHIFT.
+ */
+static int R_slow[2] = {1536, 10752};
+static int R_fast[2] = {17415, 34791};
+/*
+ * To improve readability, a conversion function is used to initialize the
+ * following arrays, which entails that they can be initialized only in a
+ * function.
+ */
+static int T_slow[2];
+static int T_fast[2];
+static int device_speed_thresh[2];
+
 #define BFQ_SERVICE_TREE_INIT	((struct bfq_service_tree)		\
 				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
 
@@ -281,7 +323,8 @@ static inline unsigned long bfq_serv_to_charge(struct request *rq,
 					       struct bfq_queue *bfqq)
 {
 	return blk_rq_sectors(rq) *
-		(1 + ((!bfq_bfqq_sync(bfqq)) * bfq_async_charge_factor));
+		(1 + ((!bfq_bfqq_sync(bfqq)) * (bfqq->wr_coeff == 1) *
+		bfq_async_charge_factor));
 }
 
 /**
@@ -322,12 +365,27 @@ static void bfq_updated_next_req(struct bfq_data *bfqd,
 	}
 }
 
+static inline unsigned int bfq_wr_duration(struct bfq_data *bfqd)
+{
+	u64 dur;
+
+	if (bfqd->bfq_wr_max_time > 0)
+		return bfqd->bfq_wr_max_time;
+
+	dur = bfqd->RT_prod;
+	do_div(dur, bfqd->peak_rate);
+
+	return dur;
+}
+
 static void bfq_add_request(struct request *rq)
 {
 	struct bfq_queue *bfqq = RQ_BFQQ(rq);
 	struct bfq_entity *entity = &bfqq->entity;
 	struct bfq_data *bfqd = bfqq->bfqd;
 	struct request *next_rq, *prev;
+	unsigned long old_wr_coeff = bfqq->wr_coeff;
+	int idle_for_long_time = 0;
 
 	bfq_log_bfqq(bfqd, bfqq, "add_request %d", rq_is_sync(rq));
 	bfqq->queued[rq_is_sync(rq)]++;
@@ -343,13 +401,64 @@ static void bfq_add_request(struct request *rq)
 	bfqq->next_rq = next_rq;
 
 	if (!bfq_bfqq_busy(bfqq)) {
+		idle_for_long_time = time_is_before_jiffies(
+			bfqq->budget_timeout +
+			bfqd->bfq_wr_min_idle_time);
 		entity->budget = max_t(unsigned long, bfqq->max_budget,
 				       bfq_serv_to_charge(next_rq, bfqq));
+
+		if (!bfqd->low_latency)
+			goto add_bfqq_busy;
+
+		/*
+		 * If the queue is not being boosted and has been idle for
+		 * enough time, start a weight-raising period.
+		 */
+		if (old_wr_coeff == 1 && idle_for_long_time) {
+			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
+			bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+			bfq_log_bfqq(bfqd, bfqq,
+				     "wrais starting at %lu, rais_max_time %u",
+				     jiffies,
+				     jiffies_to_msecs(bfqq->wr_cur_max_time));
+		} else if (old_wr_coeff > 1) {
+			if (idle_for_long_time)
+				bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+			else {
+				bfqq->wr_coeff = 1;
+				bfq_log_bfqq(bfqd, bfqq,
+					"wrais ending at %lu, rais_max_time %u",
+					jiffies,
+					jiffies_to_msecs(bfqq->
+						wr_cur_max_time));
+			}
+		}
+		if (old_wr_coeff != bfqq->wr_coeff)
+			entity->ioprio_changed = 1;
+add_bfqq_busy:
 		bfq_add_bfqq_busy(bfqd, bfqq);
 	} else {
+		if (bfqd->low_latency && old_wr_coeff == 1 && !rq_is_sync(rq) &&
+		    time_is_before_jiffies(
+				bfqq->last_wr_start_finish +
+				bfqd->bfq_wr_min_inter_arr_async)) {
+			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
+			bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+
+			entity->ioprio_changed = 1;
+			bfq_log_bfqq(bfqd, bfqq,
+			    "non-idle wrais starting at %lu, rais_max_time %u",
+			    jiffies,
+			    jiffies_to_msecs(bfqq->wr_cur_max_time));
+		}
 		if (prev != bfqq->next_rq)
 			bfq_updated_next_req(bfqd, bfqq);
 	}
+
+	if (bfqd->low_latency &&
+		(old_wr_coeff == 1 || bfqq->wr_coeff == 1 ||
+		 idle_for_long_time))
+		bfqq->last_wr_start_finish = jiffies;
 }
 
 static struct request *bfq_find_rq_fmerge(struct bfq_data *bfqd,
@@ -477,6 +586,43 @@ static void bfq_merged_requests(struct request_queue *q, struct request *rq,
 	bfq_remove_request(next);
 }
 
+/* Must be called with bfqq != NULL */
+static inline void bfq_bfqq_end_wr(struct bfq_queue *bfqq)
+{
+	bfqq->wr_coeff = 1;
+	bfqq->wr_cur_max_time = 0;
+	/* Trigger a weight change on the next activation of the queue */
+	bfqq->entity.ioprio_changed = 1;
+}
+
+static void bfq_end_wr_async_queues(struct bfq_data *bfqd,
+				    struct bfq_group *bfqg)
+{
+	int i, j;
+
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < IOPRIO_BE_NR; j++)
+			if (bfqg->async_bfqq[i][j] != NULL)
+				bfq_bfqq_end_wr(bfqg->async_bfqq[i][j]);
+	if (bfqg->async_idle_bfqq != NULL)
+		bfq_bfqq_end_wr(bfqg->async_idle_bfqq);
+}
+
+static void bfq_end_wr(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq;
+
+	spin_lock_irq(bfqd->queue->queue_lock);
+
+	list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list)
+		bfq_bfqq_end_wr(bfqq);
+	list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list)
+		bfq_bfqq_end_wr(bfqq);
+	bfq_end_wr_async(bfqd);
+
+	spin_unlock_irq(bfqd->queue->queue_lock);
+}
+
 static int bfq_allow_merge(struct request_queue *q, struct request *rq,
 			   struct bio *bio)
 {
@@ -582,14 +728,17 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd)
 	 */
 	sl = bfqd->bfq_slice_idle;
 	/*
-	 * Grant only minimum idle time if the queue either has been seeky for
-	 * long enough or has already proved to be constantly seeky.
+	 * Unless the queue is being weight-raised, grant only minimum idle
+	 * time if the queue either has been seeky for long enough or has
+	 * already proved to be constantly seeky.
 	 */
 	if (bfq_sample_valid(bfqq->seek_samples) &&
 	    ((BFQQ_SEEKY(bfqq) && bfqq->entity.service >
 				  bfq_max_budget(bfqq->bfqd) / 8) ||
-	      bfq_bfqq_constantly_seeky(bfqq)))
+	      bfq_bfqq_constantly_seeky(bfqq)) && bfqq->wr_coeff == 1)
 		sl = min(sl, msecs_to_jiffies(BFQ_MIN_TT));
+	else if (bfqq->wr_coeff > 1)
+		sl = sl * 3;
 	bfqd->last_idling_start = ktime_get();
 	mod_timer(&bfqd->idle_slice_timer, jiffies + sl);
 	bfq_log(bfqd, "arm idle: %u/%u ms",
@@ -677,9 +826,15 @@ static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 {
 	__bfq_bfqd_reset_in_service(bfqd);
 
-	if (RB_EMPTY_ROOT(&bfqq->sort_list))
+	if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
+		/*
+		 * Overloading budget_timeout field to store the time
+		 * at which the queue remains with no backlog; used by
+		 * the weight-raising mechanism.
+		 */
+		bfqq->budget_timeout = jiffies;
 		bfq_del_bfqq_busy(bfqd, bfqq, 1);
-	else
+	} else
 		bfq_activate_bfqq(bfqd, bfqq);
 }
 
@@ -896,12 +1051,26 @@ static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 			bfqd->peak_rate_samples++;
 
 		if (bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES &&
-		    update && bfqd->bfq_user_max_budget == 0) {
-			bfqd->bfq_max_budget =
-				bfq_calc_max_budget(bfqd->peak_rate,
-						    timeout);
-			bfq_log(bfqd, "new max_budget=%lu",
-				bfqd->bfq_max_budget);
+		    update) {
+			int dev_type = blk_queue_nonrot(bfqd->queue);
+			if (bfqd->bfq_user_max_budget == 0) {
+				bfqd->bfq_max_budget =
+					bfq_calc_max_budget(bfqd->peak_rate,
+							    timeout);
+				bfq_log(bfqd, "new max_budget=%lu",
+					bfqd->bfq_max_budget);
+			}
+			if (bfqd->device_speed == BFQ_BFQD_FAST &&
+			    bfqd->peak_rate < device_speed_thresh[dev_type]) {
+				bfqd->device_speed = BFQ_BFQD_SLOW;
+				bfqd->RT_prod = R_slow[dev_type] *
+						T_slow[dev_type];
+			} else if (bfqd->device_speed == BFQ_BFQD_SLOW &&
+			    bfqd->peak_rate > device_speed_thresh[dev_type]) {
+				bfqd->device_speed = BFQ_BFQD_FAST;
+				bfqd->RT_prod = R_fast[dev_type] *
+						T_fast[dev_type];
+			}
 		}
 	}
 
@@ -996,6 +1165,9 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
 	if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT)
 		bfq_mark_bfqq_constantly_seeky(bfqq);
 
+	if (bfqd->low_latency && bfqq->wr_coeff == 1)
+		bfqq->last_wr_start_finish = jiffies;
+
 	bfq_log_bfqq(bfqd, bfqq,
 		"expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
 		slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
@@ -1044,21 +1216,36 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
 }
 
 /*
- * Device idling is allowed only for sync queues that have a non-null
- * idle window.
+ * Device idling is allowed only for the queues for which this function
+ * returns true. For this reason, the return value of this function plays a
+ * critical role for both throughput boosting and service guarantees.
+ *
+ * The return value is computed through a logical expression, which may
+ * be true only if bfqq is sync and at least one of the following two
+ * conditions holds:
+ * - the queue has a non-null idle window;
+ * - the queue is being weight-raised.
+ * In fact, waiting for a new request for the queue, in the first case,
+ * is likely to boost the disk throughput, whereas, in the second case,
+ * is necessary to preserve fairness and latency guarantees
+ * (see [1] for details).
  */
 static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 {
-	return bfq_bfqq_sync(bfqq) && bfq_bfqq_idle_window(bfqq);
+	return bfq_bfqq_sync(bfqq) &&
+	       (bfqq->wr_coeff > 1 || bfq_bfqq_idle_window(bfqq));
 }
 
 /*
- * If the in-service queue is empty, but it is sync and the queue has its
- * idle window set (in this case, waiting for a new request for the queue
- * is likely to boost the throughput), then:
+ * If the in-service queue is empty but sync, and the function
+ * bfq_bfqq_must_not_expire returns true, then:
  * 1) the queue must remain in service and cannot be expired, and
  * 2) the disk must be idled to wait for the possible arrival of a new
  *    request for the queue.
+ * See the comments to the function bfq_bfqq_must_not_expire for the reasons
+ * why performing device idling is the best choice to boost the throughput
+ * and preserve service guarantees when bfq_bfqq_must_not_expire itself
+ * returns true.
  */
 static inline bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)
 {
@@ -1148,6 +1335,38 @@ keep_queue:
 	return bfqq;
 }
 
+static void bfq_update_wr_data(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+	if (bfqq->wr_coeff > 1) { /* queue is being weight-raised */
+		bfq_log_bfqq(bfqd, bfqq,
+			"raising period dur %u/%u msec, old coeff %u, w %d(%d)",
+			jiffies_to_msecs(jiffies - bfqq->last_wr_start_finish),
+			jiffies_to_msecs(bfqq->wr_cur_max_time),
+			bfqq->wr_coeff,
+			bfqq->entity.weight, bfqq->entity.orig_weight);
+
+		/*
+		 * If too much time has elapsed from the beginning
+		 * of this weight-raising period, stop it.
+		 */
+		if (time_is_before_jiffies(bfqq->last_wr_start_finish +
+					   bfqq->wr_cur_max_time)) {
+			bfqq->last_wr_start_finish = jiffies;
+			bfq_log_bfqq(bfqd, bfqq,
+				     "wrais ending at %lu, rais_max_time %u",
+				     bfqq->last_wr_start_finish,
+				     jiffies_to_msecs(bfqq->wr_cur_max_time));
+			bfq_bfqq_end_wr(bfqq);
+		}
+	}
+	/* Update weight both if it must be raised and if it must be lowered */
+	if ((entity->weight > entity->orig_weight) != (bfqq->wr_coeff > 1))
+		__bfq_entity_update_weight_prio(
+			bfq_entity_service_tree(entity),
+			entity);
+}
+
 /*
  * Dispatch one request from bfqq, moving it to the request queue
  * dispatch list.
@@ -1194,6 +1413,8 @@ static int bfq_dispatch_request(struct bfq_data *bfqd,
 	bfq_bfqq_served(bfqq, service_to_charge);
 	bfq_dispatch_insert(bfqd->queue, rq);
 
+	bfq_update_wr_data(bfqd, bfqq);
+
 	bfq_log_bfqq(bfqd, bfqq,
 			"dispatched %u sec req (%llu), budg left %lu",
 			blk_rq_sectors(rq),
@@ -1467,6 +1688,9 @@ static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	/* Tentative initial value to trade off between thr and lat */
 	bfqq->max_budget = (2 * bfq_max_budget(bfqd)) / 3;
 	bfqq->pid = pid;
+
+	bfqq->wr_coeff = 1;
+	bfqq->last_wr_start_finish = 0;
 }
 
 static struct bfq_queue *bfq_find_alloc_queue(struct bfq_data *bfqd,
@@ -1642,7 +1866,8 @@ static void bfq_update_idle_window(struct bfq_data *bfqd,
 		(bfqd->hw_tag && BFQQ_SEEKY(bfqq)))
 		enable_idle = 0;
 	else if (bfq_sample_valid(bic->ttime.ttime_samples)) {
-		if (bic->ttime.ttime_mean > bfqd->bfq_slice_idle)
+		if (bic->ttime.ttime_mean > bfqd->bfq_slice_idle &&
+			bfqq->wr_coeff == 1)
 			enable_idle = 0;
 		else
 			enable_idle = 1;
@@ -2117,6 +2342,22 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 	bfqd->bfq_timeout[BLK_RW_ASYNC] = bfq_timeout_async;
 	bfqd->bfq_timeout[BLK_RW_SYNC] = bfq_timeout_sync;
 
+	bfqd->low_latency = true;
+
+	bfqd->bfq_wr_coeff = 20;
+	bfqd->bfq_wr_max_time = 0;
+	bfqd->bfq_wr_min_idle_time = msecs_to_jiffies(2000);
+	bfqd->bfq_wr_min_inter_arr_async = msecs_to_jiffies(500);
+
+	/*
+	 * Begin by assuming, optimistically, that the device peak rate is
+	 * equal to the highest reference rate.
+	 */
+	bfqd->RT_prod = R_fast[blk_queue_nonrot(bfqd->queue)] *
+			T_fast[blk_queue_nonrot(bfqd->queue)];
+	bfqd->peak_rate = R_fast[blk_queue_nonrot(bfqd->queue)];
+	bfqd->device_speed = BFQ_BFQD_FAST;
+
 	return 0;
 }
 
@@ -2151,6 +2392,14 @@ static ssize_t bfq_var_store(unsigned long *var, const char *page,
 	return count;
 }
 
+static ssize_t bfq_wr_max_time_show(struct elevator_queue *e, char *page)
+{
+	struct bfq_data *bfqd = e->elevator_data;
+	return sprintf(page, "%d\n", bfqd->bfq_wr_max_time > 0 ?
+		       jiffies_to_msecs(bfqd->bfq_wr_max_time) :
+		       jiffies_to_msecs(bfq_wr_duration(bfqd)));
+}
+
 static ssize_t bfq_weights_show(struct elevator_queue *e, char *page)
 {
 	struct bfq_queue *bfqq;
@@ -2165,19 +2414,24 @@ static ssize_t bfq_weights_show(struct elevator_queue *e, char *page)
 	num_char += sprintf(page + num_char, "Active:\n");
 	list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list) {
 	  num_char += sprintf(page + num_char,
-			      "pid%d: weight %hu, nr_queued %d %d\n",
+			      "pid%d: weight %hu, nr_queued %d %d, dur %d/%u\n",
 			      bfqq->pid,
 			      bfqq->entity.weight,
 			      bfqq->queued[0],
-			      bfqq->queued[1]);
+			      bfqq->queued[1],
+			jiffies_to_msecs(jiffies - bfqq->last_wr_start_finish),
+			jiffies_to_msecs(bfqq->wr_cur_max_time));
 	}
 
 	num_char += sprintf(page + num_char, "Idle:\n");
 	list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list) {
 			num_char += sprintf(page + num_char,
-				"pid%d: weight %hu\n",
+				"pid%d: weight %hu, dur %d/%u\n",
 				bfqq->pid,
-				bfqq->entity.weight);
+				bfqq->entity.weight,
+				jiffies_to_msecs(jiffies -
+					bfqq->last_wr_start_finish),
+				jiffies_to_msecs(bfqq->wr_cur_max_time));
 	}
 
 	spin_unlock_irq(bfqd->queue->queue_lock);
@@ -2205,6 +2459,11 @@ SHOW_FUNCTION(bfq_max_budget_async_rq_show,
 	      bfqd->bfq_max_budget_async_rq, 0);
 SHOW_FUNCTION(bfq_timeout_sync_show, bfqd->bfq_timeout[BLK_RW_SYNC], 1);
 SHOW_FUNCTION(bfq_timeout_async_show, bfqd->bfq_timeout[BLK_RW_ASYNC], 1);
+SHOW_FUNCTION(bfq_low_latency_show, bfqd->low_latency, 0);
+SHOW_FUNCTION(bfq_wr_coeff_show, bfqd->bfq_wr_coeff, 0);
+SHOW_FUNCTION(bfq_wr_min_idle_time_show, bfqd->bfq_wr_min_idle_time, 1);
+SHOW_FUNCTION(bfq_wr_min_inter_arr_async_show, bfqd->bfq_wr_min_inter_arr_async,
+	1);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
@@ -2237,6 +2496,12 @@ STORE_FUNCTION(bfq_max_budget_async_rq_store, &bfqd->bfq_max_budget_async_rq,
 		1, INT_MAX, 0);
 STORE_FUNCTION(bfq_timeout_async_store, &bfqd->bfq_timeout[BLK_RW_ASYNC], 0,
 		INT_MAX, 1);
+STORE_FUNCTION(bfq_wr_coeff_store, &bfqd->bfq_wr_coeff, 1, INT_MAX, 0);
+STORE_FUNCTION(bfq_wr_max_time_store, &bfqd->bfq_wr_max_time, 0, INT_MAX, 1);
+STORE_FUNCTION(bfq_wr_min_idle_time_store, &bfqd->bfq_wr_min_idle_time, 0,
+		INT_MAX, 1);
+STORE_FUNCTION(bfq_wr_min_inter_arr_async_store,
+		&bfqd->bfq_wr_min_inter_arr_async, 0, INT_MAX, 1);
 #undef STORE_FUNCTION
 
 /* do nothing for the moment */
@@ -2295,6 +2560,22 @@ static ssize_t bfq_timeout_sync_store(struct elevator_queue *e,
 	return ret;
 }
 
+static ssize_t bfq_low_latency_store(struct elevator_queue *e,
+				     const char *page, size_t count)
+{
+	struct bfq_data *bfqd = e->elevator_data;
+	unsigned long uninitialized_var(__data);
+	int ret = bfq_var_store(&__data, (page), count);
+
+	if (__data > 1)
+		__data = 1;
+	if (__data == 0 && bfqd->low_latency != 0)
+		bfq_end_wr(bfqd);
+	bfqd->low_latency = __data;
+
+	return ret;
+}
+
 #define BFQ_ATTR(name) \
 	__ATTR(name, S_IRUGO|S_IWUSR, bfq_##name##_show, bfq_##name##_store)
 
@@ -2309,6 +2590,11 @@ static struct elv_fs_entry bfq_attrs[] = {
 	BFQ_ATTR(max_budget_async_rq),
 	BFQ_ATTR(timeout_sync),
 	BFQ_ATTR(timeout_async),
+	BFQ_ATTR(low_latency),
+	BFQ_ATTR(wr_coeff),
+	BFQ_ATTR(wr_max_time),
+	BFQ_ATTR(wr_min_idle_time),
+	BFQ_ATTR(wr_min_inter_arr_async),
 	BFQ_ATTR(weights),
 	__ATTR_NULL
 };
@@ -2355,6 +2641,23 @@ static int __init bfq_init(void)
 	if (bfq_slab_setup())
 		return -ENOMEM;
 
+	/*
+	 * Times to load large popular applications for the typical systems
+	 * installed on the reference devices (see the comments before the
+	 * definitions of the two arrays).
+	 */
+	T_slow[0] = msecs_to_jiffies(2600);
+	T_slow[1] = msecs_to_jiffies(1000);
+	T_fast[0] = msecs_to_jiffies(5500);
+	T_fast[1] = msecs_to_jiffies(2000);
+
+	/*
+	 * Thresholds that determine the switch between speed classes (see
+	 * the comments before the definition of the array).
+	 */
+	device_speed_thresh[0] = (R_fast[0] + R_slow[0]) / 2;
+	device_speed_thresh[1] = (R_fast[1] + R_slow[1]) / 2;
+
 	elv_register(&iosched_bfq);
 	pr_info("BFQ I/O-scheduler version: v1");
 
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
index 075e472..f6491d5 100644
--- a/block/bfq-sched.c
+++ b/block/bfq-sched.c
@@ -514,6 +514,8 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
 	struct bfq_service_tree *new_st = old_st;
 
 	if (entity->ioprio_changed) {
+		struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+
 		old_st->wsum -= entity->weight;
 
 		if (entity->new_weight != entity->orig_weight) {
@@ -539,7 +541,8 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
 		 * when entity->finish <= old_st->vtime).
 		 */
 		new_st = bfq_entity_service_tree(entity);
-		entity->weight = entity->orig_weight;
+		entity->weight = entity->orig_weight *
+				 (bfqq != NULL ? bfqq->wr_coeff : 1);
 		new_st->wsum += entity->weight;
 
 		if (new_st != old_st)
diff --git a/block/bfq.h b/block/bfq.h
index ea5ecca..3ce9100 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -181,6 +181,10 @@ struct bfq_group;
  * @seek_mean: mean seek distance
  * @last_request_pos: position of the last request enqueued
  * @pid: pid of the process owning the queue, used for logging purposes.
+ * @last_wr_start_finish: start time of the current weight-raising period if
+ *                        the @bfq-queue is being weight-raised, otherwise
+ *                        finish time of the last weight-raising period
+ * @wr_cur_max_time: current max raising time for this queue
  *
  * A bfq_queue is a leaf request queue; it can be associated with an
  * io_context or more, if it is async. @cgroup holds a reference to the
@@ -217,6 +221,11 @@ struct bfq_queue {
 	sector_t last_request_pos;
 
 	pid_t pid;
+
+	/* weight-raising fields */
+	unsigned long wr_cur_max_time;
+	unsigned long last_wr_start_finish;
+	unsigned int wr_coeff;
 };
 
 /**
@@ -297,6 +306,18 @@ enum bfq_device_speed {
  *               they are charged for the whole allocated budget, to try
  *               to preserve a behavior reasonably fair among them, but
  *               without service-domain guarantees).
+ * @bfq_wr_coeff: Maximum factor by which the weight of a weight-raised
+ *                queue is multiplied
+ * @bfq_wr_max_time: maximum duration of a weight-raising period (jiffies)
+ * @bfq_wr_min_idle_time: minimum idle period after which weight-raising
+ *			  may be reactivated for a queue (in jiffies)
+ * @bfq_wr_min_inter_arr_async: minimum period between request arrivals
+ *				after which weight-raising may be
+ *				reactivated for an already busy queue
+ *				(in jiffies)
+ * @RT_prod: cached value of the product R*T used for computing the maximum
+ *	     duration of the weight raising automatically
+ * @device_speed: device-speed class for the low-latency heuristic
  * @oom_bfqq: fallback dummy bfqq for extreme OOM conditions
  *
  * All the fields are protected by the @queue lock.
@@ -346,6 +367,16 @@ struct bfq_data {
 	unsigned int bfq_max_budget_async_rq;
 	unsigned int bfq_timeout[2];
 
+	bool low_latency;
+
+	/* parameters of the low_latency heuristics */
+	unsigned int bfq_wr_coeff;
+	unsigned int bfq_wr_max_time;
+	unsigned int bfq_wr_min_idle_time;
+	unsigned long bfq_wr_min_inter_arr_async;
+	u64 RT_prod;
+	enum bfq_device_speed device_speed;
+
 	struct bfq_queue oom_bfqq;
 };
 
@@ -556,6 +587,8 @@ static void bfq_dispatch_insert(struct request_queue *q, struct request *rq);
 static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
 				       struct bfq_group *bfqg, int is_sync,
 				       struct bfq_io_cq *bic, gfp_t gfp_mask);
+static void bfq_end_wr_async_queues(struct bfq_data *bfqd,
+				    struct bfq_group *bfqg);
 static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg);
 static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
 
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 08/14] block, bfq: improve responsiveness
@ 2014-05-27 12:42     ` paolo
  0 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups, Paolo Valente

From: Paolo Valente <paolo.valente@unimore.it>

This patch introduces a simple heuristic to load applications quickly,
and to perform the I/O requested by interactive applications just as
quickly. To this purpose, both a newly-created queue and a queue
associated with an interactive application (we explain in a moment how
BFQ decides whether the associated application is interactive),
receive the following three special treatments:

1) The weight of the queue is raised.

2) The queue unconditionally enjoys device idling when it empties; in
fact, if the requests of a queue are sync, then performing device
idling for the queue is a necessary condition to guarantee that the
queue receives a fraction of the throughput proportional to its weight
(see [1] for details).

3) The device-idling timeout is larger for the queue. This reduces the
probability that the queue is expired because its next request does
not arrive in time.

For brevity, we call just weight-raising the combination of these
three preferential treatments. For a newly-created queue,
weight-raising starts immediately and lasts for a time interval that:
1) depends on the device speed and type (rotational or
non-rotational), and 2) is equal to the time needed to load (start up)
a large-size application on that device, with cold caches and with no
additional workload.

Finally, as for guaranteeing a fast execution to interactive,
I/O-related tasks (such as opening a file), consider that any
interactive application blocks and waits for user input both after
starting up and after executing some task. After a while, the user may
trigger new operations, after which the application stops again, and
so on. Accordingly, the low-latency heuristic weight-raises again a
queue in case it becomes backlogged after being idle for a
sufficiently long (configurable) time. The weight-raising then lasts
for the same time as for a just-created queue.

According to our experiments, the combination of this low-latency
heuristic and of the improvements described in patch 7 allows BFQ
to guarantee a high application responsiveness.

[1] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
http://www.algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/Kconfig.iosched |   8 +-
 block/bfq-cgroup.c    |  15 +++
 block/bfq-iosched.c   | 355 ++++++++++++++++++++++++++++++++++++++++++++++----
 block/bfq-sched.c     |   5 +-
 block/bfq.h           |  33 +++++
 5 files changed, 386 insertions(+), 30 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index a3675cb..3e26f28 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -45,8 +45,9 @@ config IOSCHED_BFQ
 	---help---
 	  The BFQ I/O scheduler tries to distribute bandwidth among all
 	  processes according to their weights.
-	  It aims at distributing the bandwidth as desired, regardless
-	  of the disk parameters and with any workload. If compiled
+	  It aims at distributing the bandwidth as desired, regardless of
+	  the device parameters and with any workload. It also tries to
+	  guarantee a low latency to interactive applications. If compiled
 	  built-in (saying Y here), BFQ can be configured to support
 	  hierarchical scheduling.
 
@@ -79,7 +80,8 @@ choice
 		  used by default for all block devices.
 		  The BFQ I/O scheduler aims at distributing the bandwidth
 		  as desired, regardless of the disk parameters and with
-		  any workload.
+		  any workload. It also tries to guarantee a low latency to
+		  interactive applications.
 
 	config DEFAULT_NOOP
 		bool "No-op"
diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 805fe5e..1cb25aa 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -525,6 +525,16 @@ static void bfq_destroy_group(struct bfqio_cgroup *bgrp, struct bfq_group *bfqg)
 	kfree(bfqg);
 }
 
+static void bfq_end_wr_async(struct bfq_data *bfqd)
+{
+	struct hlist_node *tmp;
+	struct bfq_group *bfqg;
+
+	hlist_for_each_entry_safe(bfqg, tmp, &bfqd->group_list, bfqd_node)
+		bfq_end_wr_async_queues(bfqd, bfqg);
+	bfq_end_wr_async_queues(bfqd, bfqd->root_group);
+}
+
 /**
  * bfq_disconnect_groups - disconnect @bfqd from all its groups.
  * @bfqd: the device descriptor being exited.
@@ -866,6 +876,11 @@ static inline void bfq_bfqq_move(struct bfq_data *bfqd,
 {
 }
 
+static void bfq_end_wr_async(struct bfq_data *bfqd)
+{
+	bfq_end_wr_async_queues(bfqd, bfqd->root_group);
+}
+
 static inline void bfq_disconnect_groups(struct bfq_data *bfqd)
 {
 	bfq_put_async_queues(bfqd, bfqd->root_group);
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 9e607a0..ace9aba 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -24,15 +24,15 @@
  * precisely, BFQ schedules queues associated to processes. Thanks to the
  * accurate policy of B-WF2Q+, BFQ can afford to assign high budgets to
  * I/O-bound processes issuing sequential requests (to boost the
- * throughput), and yet guarantee a relatively low latency to interactive
- * applications.
+ * throughput), and yet guarantee a low latency to interactive applications.
  *
  * BFQ is described in [1], where also a reference to the initial, more
  * theoretical paper on BFQ can be found. The interested reader can find
  * in the latter paper full details on the main algorithm, as well as
  * formulas of the guarantees and formal proofs of all the properties.
  * With respect to the version of BFQ presented in these papers, this
- * implementation adds a hierarchical extension based on H-WF2Q+.
+ * implementation adds a few more heuristics and a hierarchical extension
+ * based on H-WF2Q+.
  *
  * B-WF2Q+ is based on WF2Q+, that is described in [2], together with
  * H-WF2Q+, while the augmented tree used to implement B-WF2Q+ with O(log N)
@@ -116,6 +116,48 @@ struct kmem_cache *bfq_pool;
 /* Shift used for peak rate fixed precision calculations. */
 #define BFQ_RATE_SHIFT		16
 
+/*
+ * By default, BFQ computes the duration of the weight raising for
+ * interactive applications automatically, using the following formula:
+ * duration = (R / r) * T, where r is the peak rate of the device, and
+ * R and T are two reference parameters.
+ * In particular, R is the peak rate of the reference device (see below),
+ * and T is a reference time: given the systems that are likely to be
+ * installed on the reference device according to its speed class, T is
+ * about the maximum time needed, under BFQ and while reading two files in
+ * parallel, to load typical large applications on these systems.
+ * In practice, the slower/faster the device at hand is, the more/less it
+ * takes to load applications with respect to the reference device.
+ * Accordingly, the longer/shorter BFQ grants weight raising to interactive
+ * applications.
+ *
+ * BFQ uses four different reference pairs (R, T), depending on:
+ * . whether the device is rotational or non-rotational;
+ * . whether the device is slow, such as old or portable HDDs, as well as
+ *   SD cards, or fast, such as newer HDDs and SSDs.
+ *
+ * The device's speed class is dynamically (re)detected in
+ * bfq_update_peak_rate() every time the estimated peak rate is updated.
+ *
+ * In the following definitions, R_slow[0]/R_fast[0] and T_slow[0]/T_fast[0]
+ * are the reference values for a slow/fast rotational device, whereas
+ * R_slow[1]/R_fast[1] and T_slow[1]/T_fast[1] are the reference values for
+ * a slow/fast non-rotational device. Finally, device_speed_thresh are the
+ * thresholds used to switch between speed classes.
+ * Both the reference peak rates and the thresholds are measured in
+ * sectors/usec, left-shifted by BFQ_RATE_SHIFT.
+ */
+static int R_slow[2] = {1536, 10752};
+static int R_fast[2] = {17415, 34791};
+/*
+ * To improve readability, a conversion function is used to initialize the
+ * following arrays, which entails that they can be initialized only in a
+ * function.
+ */
+static int T_slow[2];
+static int T_fast[2];
+static int device_speed_thresh[2];
+
 #define BFQ_SERVICE_TREE_INIT	((struct bfq_service_tree)		\
 				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
 
@@ -281,7 +323,8 @@ static inline unsigned long bfq_serv_to_charge(struct request *rq,
 					       struct bfq_queue *bfqq)
 {
 	return blk_rq_sectors(rq) *
-		(1 + ((!bfq_bfqq_sync(bfqq)) * bfq_async_charge_factor));
+		(1 + ((!bfq_bfqq_sync(bfqq)) * (bfqq->wr_coeff == 1) *
+		bfq_async_charge_factor));
 }
 
 /**
@@ -322,12 +365,27 @@ static void bfq_updated_next_req(struct bfq_data *bfqd,
 	}
 }
 
+static inline unsigned int bfq_wr_duration(struct bfq_data *bfqd)
+{
+	u64 dur;
+
+	if (bfqd->bfq_wr_max_time > 0)
+		return bfqd->bfq_wr_max_time;
+
+	dur = bfqd->RT_prod;
+	do_div(dur, bfqd->peak_rate);
+
+	return dur;
+}
+
 static void bfq_add_request(struct request *rq)
 {
 	struct bfq_queue *bfqq = RQ_BFQQ(rq);
 	struct bfq_entity *entity = &bfqq->entity;
 	struct bfq_data *bfqd = bfqq->bfqd;
 	struct request *next_rq, *prev;
+	unsigned long old_wr_coeff = bfqq->wr_coeff;
+	int idle_for_long_time = 0;
 
 	bfq_log_bfqq(bfqd, bfqq, "add_request %d", rq_is_sync(rq));
 	bfqq->queued[rq_is_sync(rq)]++;
@@ -343,13 +401,64 @@ static void bfq_add_request(struct request *rq)
 	bfqq->next_rq = next_rq;
 
 	if (!bfq_bfqq_busy(bfqq)) {
+		idle_for_long_time = time_is_before_jiffies(
+			bfqq->budget_timeout +
+			bfqd->bfq_wr_min_idle_time);
 		entity->budget = max_t(unsigned long, bfqq->max_budget,
 				       bfq_serv_to_charge(next_rq, bfqq));
+
+		if (!bfqd->low_latency)
+			goto add_bfqq_busy;
+
+		/*
+		 * If the queue is not being boosted and has been idle for
+		 * enough time, start a weight-raising period.
+		 */
+		if (old_wr_coeff == 1 && idle_for_long_time) {
+			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
+			bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+			bfq_log_bfqq(bfqd, bfqq,
+				     "wrais starting at %lu, rais_max_time %u",
+				     jiffies,
+				     jiffies_to_msecs(bfqq->wr_cur_max_time));
+		} else if (old_wr_coeff > 1) {
+			if (idle_for_long_time)
+				bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+			else {
+				bfqq->wr_coeff = 1;
+				bfq_log_bfqq(bfqd, bfqq,
+					"wrais ending at %lu, rais_max_time %u",
+					jiffies,
+					jiffies_to_msecs(bfqq->
+						wr_cur_max_time));
+			}
+		}
+		if (old_wr_coeff != bfqq->wr_coeff)
+			entity->ioprio_changed = 1;
+add_bfqq_busy:
 		bfq_add_bfqq_busy(bfqd, bfqq);
 	} else {
+		if (bfqd->low_latency && old_wr_coeff == 1 && !rq_is_sync(rq) &&
+		    time_is_before_jiffies(
+				bfqq->last_wr_start_finish +
+				bfqd->bfq_wr_min_inter_arr_async)) {
+			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
+			bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+
+			entity->ioprio_changed = 1;
+			bfq_log_bfqq(bfqd, bfqq,
+			    "non-idle wrais starting at %lu, rais_max_time %u",
+			    jiffies,
+			    jiffies_to_msecs(bfqq->wr_cur_max_time));
+		}
 		if (prev != bfqq->next_rq)
 			bfq_updated_next_req(bfqd, bfqq);
 	}
+
+	if (bfqd->low_latency &&
+		(old_wr_coeff == 1 || bfqq->wr_coeff == 1 ||
+		 idle_for_long_time))
+		bfqq->last_wr_start_finish = jiffies;
 }
 
 static struct request *bfq_find_rq_fmerge(struct bfq_data *bfqd,
@@ -477,6 +586,43 @@ static void bfq_merged_requests(struct request_queue *q, struct request *rq,
 	bfq_remove_request(next);
 }
 
+/* Must be called with bfqq != NULL */
+static inline void bfq_bfqq_end_wr(struct bfq_queue *bfqq)
+{
+	bfqq->wr_coeff = 1;
+	bfqq->wr_cur_max_time = 0;
+	/* Trigger a weight change on the next activation of the queue */
+	bfqq->entity.ioprio_changed = 1;
+}
+
+static void bfq_end_wr_async_queues(struct bfq_data *bfqd,
+				    struct bfq_group *bfqg)
+{
+	int i, j;
+
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < IOPRIO_BE_NR; j++)
+			if (bfqg->async_bfqq[i][j] != NULL)
+				bfq_bfqq_end_wr(bfqg->async_bfqq[i][j]);
+	if (bfqg->async_idle_bfqq != NULL)
+		bfq_bfqq_end_wr(bfqg->async_idle_bfqq);
+}
+
+static void bfq_end_wr(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq;
+
+	spin_lock_irq(bfqd->queue->queue_lock);
+
+	list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list)
+		bfq_bfqq_end_wr(bfqq);
+	list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list)
+		bfq_bfqq_end_wr(bfqq);
+	bfq_end_wr_async(bfqd);
+
+	spin_unlock_irq(bfqd->queue->queue_lock);
+}
+
 static int bfq_allow_merge(struct request_queue *q, struct request *rq,
 			   struct bio *bio)
 {
@@ -582,14 +728,17 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd)
 	 */
 	sl = bfqd->bfq_slice_idle;
 	/*
-	 * Grant only minimum idle time if the queue either has been seeky for
-	 * long enough or has already proved to be constantly seeky.
+	 * Unless the queue is being weight-raised, grant only minimum idle
+	 * time if the queue either has been seeky for long enough or has
+	 * already proved to be constantly seeky.
 	 */
 	if (bfq_sample_valid(bfqq->seek_samples) &&
 	    ((BFQQ_SEEKY(bfqq) && bfqq->entity.service >
 				  bfq_max_budget(bfqq->bfqd) / 8) ||
-	      bfq_bfqq_constantly_seeky(bfqq)))
+	      bfq_bfqq_constantly_seeky(bfqq)) && bfqq->wr_coeff == 1)
 		sl = min(sl, msecs_to_jiffies(BFQ_MIN_TT));
+	else if (bfqq->wr_coeff > 1)
+		sl = sl * 3;
 	bfqd->last_idling_start = ktime_get();
 	mod_timer(&bfqd->idle_slice_timer, jiffies + sl);
 	bfq_log(bfqd, "arm idle: %u/%u ms",
@@ -677,9 +826,15 @@ static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 {
 	__bfq_bfqd_reset_in_service(bfqd);
 
-	if (RB_EMPTY_ROOT(&bfqq->sort_list))
+	if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
+		/*
+		 * Overloading budget_timeout field to store the time
+		 * at which the queue remains with no backlog; used by
+		 * the weight-raising mechanism.
+		 */
+		bfqq->budget_timeout = jiffies;
 		bfq_del_bfqq_busy(bfqd, bfqq, 1);
-	else
+	} else
 		bfq_activate_bfqq(bfqd, bfqq);
 }
 
@@ -896,12 +1051,26 @@ static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 			bfqd->peak_rate_samples++;
 
 		if (bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES &&
-		    update && bfqd->bfq_user_max_budget == 0) {
-			bfqd->bfq_max_budget =
-				bfq_calc_max_budget(bfqd->peak_rate,
-						    timeout);
-			bfq_log(bfqd, "new max_budget=%lu",
-				bfqd->bfq_max_budget);
+		    update) {
+			int dev_type = blk_queue_nonrot(bfqd->queue);
+			if (bfqd->bfq_user_max_budget == 0) {
+				bfqd->bfq_max_budget =
+					bfq_calc_max_budget(bfqd->peak_rate,
+							    timeout);
+				bfq_log(bfqd, "new max_budget=%lu",
+					bfqd->bfq_max_budget);
+			}
+			if (bfqd->device_speed == BFQ_BFQD_FAST &&
+			    bfqd->peak_rate < device_speed_thresh[dev_type]) {
+				bfqd->device_speed = BFQ_BFQD_SLOW;
+				bfqd->RT_prod = R_slow[dev_type] *
+						T_slow[dev_type];
+			} else if (bfqd->device_speed == BFQ_BFQD_SLOW &&
+			    bfqd->peak_rate > device_speed_thresh[dev_type]) {
+				bfqd->device_speed = BFQ_BFQD_FAST;
+				bfqd->RT_prod = R_fast[dev_type] *
+						T_fast[dev_type];
+			}
 		}
 	}
 
@@ -996,6 +1165,9 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
 	if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT)
 		bfq_mark_bfqq_constantly_seeky(bfqq);
 
+	if (bfqd->low_latency && bfqq->wr_coeff == 1)
+		bfqq->last_wr_start_finish = jiffies;
+
 	bfq_log_bfqq(bfqd, bfqq,
 		"expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
 		slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
@@ -1044,21 +1216,36 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
 }
 
 /*
- * Device idling is allowed only for sync queues that have a non-null
- * idle window.
+ * Device idling is allowed only for the queues for which this function
+ * returns true. For this reason, the return value of this function plays a
+ * critical role for both throughput boosting and service guarantees.
+ *
+ * The return value is computed through a logical expression, which may
+ * be true only if bfqq is sync and at least one of the following two
+ * conditions holds:
+ * - the queue has a non-null idle window;
+ * - the queue is being weight-raised.
+ * In fact, waiting for a new request for the queue, in the first case,
+ * is likely to boost the disk throughput, whereas, in the second case,
+ * is necessary to preserve fairness and latency guarantees
+ * (see [1] for details).
  */
 static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 {
-	return bfq_bfqq_sync(bfqq) && bfq_bfqq_idle_window(bfqq);
+	return bfq_bfqq_sync(bfqq) &&
+	       (bfqq->wr_coeff > 1 || bfq_bfqq_idle_window(bfqq));
 }
 
 /*
- * If the in-service queue is empty, but it is sync and the queue has its
- * idle window set (in this case, waiting for a new request for the queue
- * is likely to boost the throughput), then:
+ * If the in-service queue is empty but sync, and the function
+ * bfq_bfqq_must_not_expire returns true, then:
  * 1) the queue must remain in service and cannot be expired, and
  * 2) the disk must be idled to wait for the possible arrival of a new
  *    request for the queue.
+ * See the comments to the function bfq_bfqq_must_not_expire for the reasons
+ * why performing device idling is the best choice to boost the throughput
+ * and preserve service guarantees when bfq_bfqq_must_not_expire itself
+ * returns true.
  */
 static inline bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)
 {
@@ -1148,6 +1335,38 @@ keep_queue:
 	return bfqq;
 }
 
+static void bfq_update_wr_data(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+	if (bfqq->wr_coeff > 1) { /* queue is being weight-raised */
+		bfq_log_bfqq(bfqd, bfqq,
+			"raising period dur %u/%u msec, old coeff %u, w %d(%d)",
+			jiffies_to_msecs(jiffies - bfqq->last_wr_start_finish),
+			jiffies_to_msecs(bfqq->wr_cur_max_time),
+			bfqq->wr_coeff,
+			bfqq->entity.weight, bfqq->entity.orig_weight);
+
+		/*
+		 * If too much time has elapsed from the beginning
+		 * of this weight-raising period, stop it.
+		 */
+		if (time_is_before_jiffies(bfqq->last_wr_start_finish +
+					   bfqq->wr_cur_max_time)) {
+			bfqq->last_wr_start_finish = jiffies;
+			bfq_log_bfqq(bfqd, bfqq,
+				     "wrais ending at %lu, rais_max_time %u",
+				     bfqq->last_wr_start_finish,
+				     jiffies_to_msecs(bfqq->wr_cur_max_time));
+			bfq_bfqq_end_wr(bfqq);
+		}
+	}
+	/* Update weight both if it must be raised and if it must be lowered */
+	if ((entity->weight > entity->orig_weight) != (bfqq->wr_coeff > 1))
+		__bfq_entity_update_weight_prio(
+			bfq_entity_service_tree(entity),
+			entity);
+}
+
 /*
  * Dispatch one request from bfqq, moving it to the request queue
  * dispatch list.
@@ -1194,6 +1413,8 @@ static int bfq_dispatch_request(struct bfq_data *bfqd,
 	bfq_bfqq_served(bfqq, service_to_charge);
 	bfq_dispatch_insert(bfqd->queue, rq);
 
+	bfq_update_wr_data(bfqd, bfqq);
+
 	bfq_log_bfqq(bfqd, bfqq,
 			"dispatched %u sec req (%llu), budg left %lu",
 			blk_rq_sectors(rq),
@@ -1467,6 +1688,9 @@ static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	/* Tentative initial value to trade off between thr and lat */
 	bfqq->max_budget = (2 * bfq_max_budget(bfqd)) / 3;
 	bfqq->pid = pid;
+
+	bfqq->wr_coeff = 1;
+	bfqq->last_wr_start_finish = 0;
 }
 
 static struct bfq_queue *bfq_find_alloc_queue(struct bfq_data *bfqd,
@@ -1642,7 +1866,8 @@ static void bfq_update_idle_window(struct bfq_data *bfqd,
 		(bfqd->hw_tag && BFQQ_SEEKY(bfqq)))
 		enable_idle = 0;
 	else if (bfq_sample_valid(bic->ttime.ttime_samples)) {
-		if (bic->ttime.ttime_mean > bfqd->bfq_slice_idle)
+		if (bic->ttime.ttime_mean > bfqd->bfq_slice_idle &&
+			bfqq->wr_coeff == 1)
 			enable_idle = 0;
 		else
 			enable_idle = 1;
@@ -2117,6 +2342,22 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 	bfqd->bfq_timeout[BLK_RW_ASYNC] = bfq_timeout_async;
 	bfqd->bfq_timeout[BLK_RW_SYNC] = bfq_timeout_sync;
 
+	bfqd->low_latency = true;
+
+	bfqd->bfq_wr_coeff = 20;
+	bfqd->bfq_wr_max_time = 0;
+	bfqd->bfq_wr_min_idle_time = msecs_to_jiffies(2000);
+	bfqd->bfq_wr_min_inter_arr_async = msecs_to_jiffies(500);
+
+	/*
+	 * Begin by assuming, optimistically, that the device peak rate is
+	 * equal to the highest reference rate.
+	 */
+	bfqd->RT_prod = R_fast[blk_queue_nonrot(bfqd->queue)] *
+			T_fast[blk_queue_nonrot(bfqd->queue)];
+	bfqd->peak_rate = R_fast[blk_queue_nonrot(bfqd->queue)];
+	bfqd->device_speed = BFQ_BFQD_FAST;
+
 	return 0;
 }
 
@@ -2151,6 +2392,14 @@ static ssize_t bfq_var_store(unsigned long *var, const char *page,
 	return count;
 }
 
+static ssize_t bfq_wr_max_time_show(struct elevator_queue *e, char *page)
+{
+	struct bfq_data *bfqd = e->elevator_data;
+	return sprintf(page, "%d\n", bfqd->bfq_wr_max_time > 0 ?
+		       jiffies_to_msecs(bfqd->bfq_wr_max_time) :
+		       jiffies_to_msecs(bfq_wr_duration(bfqd)));
+}
+
 static ssize_t bfq_weights_show(struct elevator_queue *e, char *page)
 {
 	struct bfq_queue *bfqq;
@@ -2165,19 +2414,24 @@ static ssize_t bfq_weights_show(struct elevator_queue *e, char *page)
 	num_char += sprintf(page + num_char, "Active:\n");
 	list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list) {
 	  num_char += sprintf(page + num_char,
-			      "pid%d: weight %hu, nr_queued %d %d\n",
+			      "pid%d: weight %hu, nr_queued %d %d, dur %d/%u\n",
 			      bfqq->pid,
 			      bfqq->entity.weight,
 			      bfqq->queued[0],
-			      bfqq->queued[1]);
+			      bfqq->queued[1],
+			jiffies_to_msecs(jiffies - bfqq->last_wr_start_finish),
+			jiffies_to_msecs(bfqq->wr_cur_max_time));
 	}
 
 	num_char += sprintf(page + num_char, "Idle:\n");
 	list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list) {
 			num_char += sprintf(page + num_char,
-				"pid%d: weight %hu\n",
+				"pid%d: weight %hu, dur %d/%u\n",
 				bfqq->pid,
-				bfqq->entity.weight);
+				bfqq->entity.weight,
+				jiffies_to_msecs(jiffies -
+					bfqq->last_wr_start_finish),
+				jiffies_to_msecs(bfqq->wr_cur_max_time));
 	}
 
 	spin_unlock_irq(bfqd->queue->queue_lock);
@@ -2205,6 +2459,11 @@ SHOW_FUNCTION(bfq_max_budget_async_rq_show,
 	      bfqd->bfq_max_budget_async_rq, 0);
 SHOW_FUNCTION(bfq_timeout_sync_show, bfqd->bfq_timeout[BLK_RW_SYNC], 1);
 SHOW_FUNCTION(bfq_timeout_async_show, bfqd->bfq_timeout[BLK_RW_ASYNC], 1);
+SHOW_FUNCTION(bfq_low_latency_show, bfqd->low_latency, 0);
+SHOW_FUNCTION(bfq_wr_coeff_show, bfqd->bfq_wr_coeff, 0);
+SHOW_FUNCTION(bfq_wr_min_idle_time_show, bfqd->bfq_wr_min_idle_time, 1);
+SHOW_FUNCTION(bfq_wr_min_inter_arr_async_show, bfqd->bfq_wr_min_inter_arr_async,
+	1);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
@@ -2237,6 +2496,12 @@ STORE_FUNCTION(bfq_max_budget_async_rq_store, &bfqd->bfq_max_budget_async_rq,
 		1, INT_MAX, 0);
 STORE_FUNCTION(bfq_timeout_async_store, &bfqd->bfq_timeout[BLK_RW_ASYNC], 0,
 		INT_MAX, 1);
+STORE_FUNCTION(bfq_wr_coeff_store, &bfqd->bfq_wr_coeff, 1, INT_MAX, 0);
+STORE_FUNCTION(bfq_wr_max_time_store, &bfqd->bfq_wr_max_time, 0, INT_MAX, 1);
+STORE_FUNCTION(bfq_wr_min_idle_time_store, &bfqd->bfq_wr_min_idle_time, 0,
+		INT_MAX, 1);
+STORE_FUNCTION(bfq_wr_min_inter_arr_async_store,
+		&bfqd->bfq_wr_min_inter_arr_async, 0, INT_MAX, 1);
 #undef STORE_FUNCTION
 
 /* do nothing for the moment */
@@ -2295,6 +2560,22 @@ static ssize_t bfq_timeout_sync_store(struct elevator_queue *e,
 	return ret;
 }
 
+static ssize_t bfq_low_latency_store(struct elevator_queue *e,
+				     const char *page, size_t count)
+{
+	struct bfq_data *bfqd = e->elevator_data;
+	unsigned long uninitialized_var(__data);
+	int ret = bfq_var_store(&__data, (page), count);
+
+	if (__data > 1)
+		__data = 1;
+	if (__data == 0 && bfqd->low_latency != 0)
+		bfq_end_wr(bfqd);
+	bfqd->low_latency = __data;
+
+	return ret;
+}
+
 #define BFQ_ATTR(name) \
 	__ATTR(name, S_IRUGO|S_IWUSR, bfq_##name##_show, bfq_##name##_store)
 
@@ -2309,6 +2590,11 @@ static struct elv_fs_entry bfq_attrs[] = {
 	BFQ_ATTR(max_budget_async_rq),
 	BFQ_ATTR(timeout_sync),
 	BFQ_ATTR(timeout_async),
+	BFQ_ATTR(low_latency),
+	BFQ_ATTR(wr_coeff),
+	BFQ_ATTR(wr_max_time),
+	BFQ_ATTR(wr_min_idle_time),
+	BFQ_ATTR(wr_min_inter_arr_async),
 	BFQ_ATTR(weights),
 	__ATTR_NULL
 };
@@ -2355,6 +2641,23 @@ static int __init bfq_init(void)
 	if (bfq_slab_setup())
 		return -ENOMEM;
 
+	/*
+	 * Times to load large popular applications for the typical systems
+	 * installed on the reference devices (see the comments before the
+	 * definitions of the two arrays).
+	 */
+	T_slow[0] = msecs_to_jiffies(2600);
+	T_slow[1] = msecs_to_jiffies(1000);
+	T_fast[0] = msecs_to_jiffies(5500);
+	T_fast[1] = msecs_to_jiffies(2000);
+
+	/*
+	 * Thresholds that determine the switch between speed classes (see
+	 * the comments before the definition of the array).
+	 */
+	device_speed_thresh[0] = (R_fast[0] + R_slow[0]) / 2;
+	device_speed_thresh[1] = (R_fast[1] + R_slow[1]) / 2;
+
 	elv_register(&iosched_bfq);
 	pr_info("BFQ I/O-scheduler version: v1");
 
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
index 075e472..f6491d5 100644
--- a/block/bfq-sched.c
+++ b/block/bfq-sched.c
@@ -514,6 +514,8 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
 	struct bfq_service_tree *new_st = old_st;
 
 	if (entity->ioprio_changed) {
+		struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+
 		old_st->wsum -= entity->weight;
 
 		if (entity->new_weight != entity->orig_weight) {
@@ -539,7 +541,8 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
 		 * when entity->finish <= old_st->vtime).
 		 */
 		new_st = bfq_entity_service_tree(entity);
-		entity->weight = entity->orig_weight;
+		entity->weight = entity->orig_weight *
+				 (bfqq != NULL ? bfqq->wr_coeff : 1);
 		new_st->wsum += entity->weight;
 
 		if (new_st != old_st)
diff --git a/block/bfq.h b/block/bfq.h
index ea5ecca..3ce9100 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -181,6 +181,10 @@ struct bfq_group;
  * @seek_mean: mean seek distance
  * @last_request_pos: position of the last request enqueued
  * @pid: pid of the process owning the queue, used for logging purposes.
+ * @last_wr_start_finish: start time of the current weight-raising period if
+ *                        the @bfq-queue is being weight-raised, otherwise
+ *                        finish time of the last weight-raising period
+ * @wr_cur_max_time: current max raising time for this queue
  *
  * A bfq_queue is a leaf request queue; it can be associated with an
  * io_context or more, if it is async. @cgroup holds a reference to the
@@ -217,6 +221,11 @@ struct bfq_queue {
 	sector_t last_request_pos;
 
 	pid_t pid;
+
+	/* weight-raising fields */
+	unsigned long wr_cur_max_time;
+	unsigned long last_wr_start_finish;
+	unsigned int wr_coeff;
 };
 
 /**
@@ -297,6 +306,18 @@ enum bfq_device_speed {
  *               they are charged for the whole allocated budget, to try
  *               to preserve a behavior reasonably fair among them, but
  *               without service-domain guarantees).
+ * @bfq_wr_coeff: Maximum factor by which the weight of a weight-raised
+ *                queue is multiplied
+ * @bfq_wr_max_time: maximum duration of a weight-raising period (jiffies)
+ * @bfq_wr_min_idle_time: minimum idle period after which weight-raising
+ *			  may be reactivated for a queue (in jiffies)
+ * @bfq_wr_min_inter_arr_async: minimum period between request arrivals
+ *				after which weight-raising may be
+ *				reactivated for an already busy queue
+ *				(in jiffies)
+ * @RT_prod: cached value of the product R*T used for computing the maximum
+ *	     duration of the weight raising automatically
+ * @device_speed: device-speed class for the low-latency heuristic
  * @oom_bfqq: fallback dummy bfqq for extreme OOM conditions
  *
  * All the fields are protected by the @queue lock.
@@ -346,6 +367,16 @@ struct bfq_data {
 	unsigned int bfq_max_budget_async_rq;
 	unsigned int bfq_timeout[2];
 
+	bool low_latency;
+
+	/* parameters of the low_latency heuristics */
+	unsigned int bfq_wr_coeff;
+	unsigned int bfq_wr_max_time;
+	unsigned int bfq_wr_min_idle_time;
+	unsigned long bfq_wr_min_inter_arr_async;
+	u64 RT_prod;
+	enum bfq_device_speed device_speed;
+
 	struct bfq_queue oom_bfqq;
 };
 
@@ -556,6 +587,8 @@ static void bfq_dispatch_insert(struct request_queue *q, struct request *rq);
 static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
 				       struct bfq_group *bfqg, int is_sync,
 				       struct bfq_io_cq *bic, gfp_t gfp_mask);
+static void bfq_end_wr_async_queues(struct bfq_data *bfqd,
+				    struct bfq_group *bfqg);
 static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg);
 static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
 
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 09/14] block, bfq: reduce I/O latency for soft real-time applications
  2014-05-27 12:42 ` paolo
@ 2014-05-27 12:42     ` paolo
  -1 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

From: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>

To guarantee a low latency also to the I/O requests issued by soft
real-time applications, this patch introduces a further heuristic,
which weight-raises (in the sense explained in patch 8) also the
queues associated to applications deemed as soft real-time.

To be deemed as soft real-time, an application must meet two
requirements.  First, the application must not require an average
bandwidth higher than the approximate bandwidth required to playback
or record a compressed high-definition video. Second, the request
pattern of the application must be isochronous, i.e., after issuing a
request or a batch of requests, the application must stop issuing new
requests until all its pending requests have been completed. After
that, the application may issue a new batch, and so on.

As for the second requirement, it is critical to require also that,
after all the pending requests of the application have been completed,
an adequate minimum amount of time elapses before the application
starts issuing new requests. This prevents also greedy (i.e.,
I/O-bound) applications from being incorrectly deemed, occasionally,
as soft real-time. In fact, if *any amount of time* is fine, then even
a greedy application may, paradoxically, meet both the above
requirements, if: (1) the application performs random I/O and/or the
device is slow, and (2) the CPU load is high. The reason is the
following.  First, if condition (1) is true, then, during the service
of the application, the throughput may be low enough to let the
application meet the bandwidth requirement.  Second, if condition (2)
is true as well, then the application may occasionally behave in an
apparently isochronous way, because it may simply stop issuing
requests while the CPUs are busy serving other processes.

To address this issue, the heuristic leverages the simple fact that
greedy applications issue *all* of their requests as quickly as they
can, whereas soft real-time applications spend some time processing
data after each batch of requests is completed. In particular, the
heuristic works as follows. First, according to the above isochrony
requirement, the heuristic checks whether an application may be soft
real-time (and therefore gives to the application the opportunity to
be deemed as such) only when both the following two conditions happen
to hold: 1) the queue associated with the application has expired and
is empty, 2) there is no outstanding request of the application.

Suppose that both conditions hold at time, say, t_c and that the
application issues instead its next request at time, say, t_i. At time
t_c the heuristic computes the next time instant, called
soft_rt_next_start in the code, such that, only if
t_i >= soft_rt_next_start, then both the next conditions will hold
when the application issues its next request:
1) the application will meet the above bandwidth requirement,
2) a given minimum time interval, say Delta, will have elapsed from
time t_c (so as to filter out greedy application).

The current value of Delta is a little bit higher than the value that
we have found, experimentally, to be adequate on a real,
general-purpose machine. In particular we had to increase Delta to
make the filter quite precise also in slower, embedded systems, and in
KVM/QEMU virtual machines (details in the comments to the code).

If the application actually issues its next request after time
soft_rt_next_start, then its associated queue will be weight-raised
for a relatively short time interval. If, during this time interval,
the application proves again to meet the bandwidth and isochrony
requirements, then the end of the weight-raising period for the queue
is moved forward, and so on. Note that an application whose associated
queue never happens to be empty when it expires will never have the
opportunity to be deemed as soft real-time.

Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 block/Kconfig.iosched |   8 +-
 block/bfq-iosched.c   | 231 ++++++++++++++++++++++++++++++++++++++++++++++++--
 block/bfq.h           |  24 ++++++
 3 files changed, 251 insertions(+), 12 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3e26f28..1d64eea 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -47,9 +47,9 @@ config IOSCHED_BFQ
 	  processes according to their weights.
 	  It aims at distributing the bandwidth as desired, regardless of
 	  the device parameters and with any workload. It also tries to
-	  guarantee a low latency to interactive applications. If compiled
-	  built-in (saying Y here), BFQ can be configured to support
-	  hierarchical scheduling.
+	  guarantee low latency to interactive and soft real-time
+	  applications. If compiled built-in (saying Y here), BFQ can
+	  be configured to support hierarchical scheduling.
 
 config CGROUP_BFQIO
 	bool "BFQ hierarchical scheduling support"
@@ -81,7 +81,7 @@ choice
 		  The BFQ I/O scheduler aims at distributing the bandwidth
 		  as desired, regardless of the disk parameters and with
 		  any workload. It also tries to guarantee a low latency to
-		  interactive applications.
+		  interactive and soft real-time applications.
 
 	config DEFAULT_NOOP
 		bool "No-op"
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index ace9aba..661f948 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -24,15 +24,17 @@
  * precisely, BFQ schedules queues associated to processes. Thanks to the
  * accurate policy of B-WF2Q+, BFQ can afford to assign high budgets to
  * I/O-bound processes issuing sequential requests (to boost the
- * throughput), and yet guarantee a low latency to interactive applications.
+ * throughput), and yet guarantee a low latency to interactive and soft
+ * real-time applications.
  *
  * BFQ is described in [1], where also a reference to the initial, more
  * theoretical paper on BFQ can be found. The interested reader can find
  * in the latter paper full details on the main algorithm, as well as
  * formulas of the guarantees and formal proofs of all the properties.
  * With respect to the version of BFQ presented in these papers, this
- * implementation adds a few more heuristics and a hierarchical extension
- * based on H-WF2Q+.
+ * implementation adds a few more heuristics, such as the one that
+ * guarantees a low latency to soft real-time applications, and a
+ * hierarchical extension based on H-WF2Q+.
  *
  * B-WF2Q+ is based on WF2Q+, that is described in [2], together with
  * H-WF2Q+, while the augmented tree used to implement B-WF2Q+ with O(log N)
@@ -401,6 +403,8 @@ static void bfq_add_request(struct request *rq)
 	bfqq->next_rq = next_rq;
 
 	if (!bfq_bfqq_busy(bfqq)) {
+		int soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
+			time_is_before_jiffies(bfqq->soft_rt_next_start);
 		idle_for_long_time = time_is_before_jiffies(
 			bfqq->budget_timeout +
 			bfqd->bfq_wr_min_idle_time);
@@ -414,9 +418,13 @@ static void bfq_add_request(struct request *rq)
 		 * If the queue is not being boosted and has been idle for
 		 * enough time, start a weight-raising period.
 		 */
-		if (old_wr_coeff == 1 && idle_for_long_time) {
+		if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt)) {
 			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
-			bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+			if (idle_for_long_time)
+				bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+			else
+				bfqq->wr_cur_max_time =
+					bfqd->bfq_wr_rt_max_time;
 			bfq_log_bfqq(bfqd, bfqq,
 				     "wrais starting at %lu, rais_max_time %u",
 				     jiffies,
@@ -424,18 +432,76 @@ static void bfq_add_request(struct request *rq)
 		} else if (old_wr_coeff > 1) {
 			if (idle_for_long_time)
 				bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
-			else {
+			else if (bfqq->wr_cur_max_time ==
+				 bfqd->bfq_wr_rt_max_time &&
+				 !soft_rt) {
 				bfqq->wr_coeff = 1;
 				bfq_log_bfqq(bfqd, bfqq,
 					"wrais ending at %lu, rais_max_time %u",
 					jiffies,
 					jiffies_to_msecs(bfqq->
 						wr_cur_max_time));
+			} else if (time_before(
+					bfqq->last_wr_start_finish +
+					bfqq->wr_cur_max_time,
+					jiffies +
+					bfqd->bfq_wr_rt_max_time) &&
+				   soft_rt) {
+				/*
+				 *
+				 * The remaining weight-raising time is lower
+				 * than bfqd->bfq_wr_rt_max_time, which means
+				 * that the application is enjoying weight
+				 * raising either because deemed soft-rt in
+				 * the near past, or because deemed interactive
+				 * a long ago.
+				 * In both cases, resetting now the current
+				 * remaining weight-raising time for the
+				 * application to the weight-raising duration
+				 * for soft rt applications would not cause any
+				 * latency increase for the application (as the
+				 * new duration would be higher than the
+				 * remaining time).
+				 *
+				 * In addition, the application is now meeting
+				 * the requirements for being deemed soft rt.
+				 * In the end we can correctly and safely
+				 * (re)charge the weight-raising duration for
+				 * the application with the weight-raising
+				 * duration for soft rt applications.
+				 *
+				 * In particular, doing this recharge now, i.e.,
+				 * before the weight-raising period for the
+				 * application finishes, reduces the probability
+				 * of the following negative scenario:
+				 * 1) the weight of a soft rt application is
+				 *    raised at startup (as for any newly
+				 *    created application),
+				 * 2) since the application is not interactive,
+				 *    at a certain time weight-raising is
+				 *    stopped for the application,
+				 * 3) at that time the application happens to
+				 *    still have pending requests, and hence
+				 *    is destined to not have a chance to be
+				 *    deemed soft rt before these requests are
+				 *    completed (see the comments to the
+				 *    function bfq_bfqq_softrt_next_start()
+				 *    for details on soft rt detection),
+				 * 4) these pending requests experience a high
+				 *    latency because the application is not
+				 *    weight-raised while they are pending.
+				 */
+				bfqq->last_wr_start_finish = jiffies;
+				bfqq->wr_cur_max_time =
+					bfqd->bfq_wr_rt_max_time;
 			}
 		}
 		if (old_wr_coeff != bfqq->wr_coeff)
 			entity->ioprio_changed = 1;
 add_bfqq_busy:
+		bfqq->last_idle_bklogged = jiffies;
+		bfqq->service_from_backlogged = 0;
+		bfq_clear_bfqq_softrt_update(bfqq);
 		bfq_add_bfqq_busy(bfqd, bfqq);
 	} else {
 		if (bfqd->low_latency && old_wr_coeff == 1 && !rq_is_sync(rq) &&
@@ -753,8 +819,11 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd)
 static void bfq_set_budget_timeout(struct bfq_data *bfqd)
 {
 	struct bfq_queue *bfqq = bfqd->in_service_queue;
-	unsigned int timeout_coeff = bfqq->entity.weight /
-				     bfqq->entity.orig_weight;
+	unsigned int timeout_coeff;
+	if (bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time)
+		timeout_coeff = 1;
+	else
+		timeout_coeff = bfqq->entity.weight / bfqq->entity.orig_weight;
 
 	bfqd->last_budget_start = ktime_get();
 
@@ -1105,6 +1174,77 @@ static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	return expected > (4 * bfqq->entity.budget) / 3;
 }
 
+/*
+ * To be deemed as soft real-time, an application must meet two
+ * requirements. First, the application must not require an average
+ * bandwidth higher than the approximate bandwidth required to playback or
+ * record a compressed high-definition video.
+ * The next function is invoked on the completion of the last request of a
+ * batch, to compute the next-start time instant, soft_rt_next_start, such
+ * that, if the next request of the application does not arrive before
+ * soft_rt_next_start, then the above requirement on the bandwidth is met.
+ *
+ * The second requirement is that the request pattern of the application is
+ * isochronous, i.e., that, after issuing a request or a batch of requests,
+ * the application stops issuing new requests until all its pending requests
+ * have been completed. After that, the application may issue a new batch,
+ * and so on.
+ * For this reason the next function is invoked to compute
+ * soft_rt_next_start only for applications that meet this requirement,
+ * whereas soft_rt_next_start is set to infinity for applications that do
+ * not.
+ *
+ * Unfortunately, even a greedy application may happen to behave in an
+ * isochronous way if the CPU load is high. In fact, the application may
+ * stop issuing requests while the CPUs are busy serving other processes,
+ * then restart, then stop again for a while, and so on. In addition, if
+ * the disk achieves a low enough throughput with the request pattern
+ * issued by the application (e.g., because the request pattern is random
+ * and/or the device is slow), then the application may meet the above
+ * bandwidth requirement too. To prevent such a greedy application to be
+ * deemed as soft real-time, a further rule is used in the computation of
+ * soft_rt_next_start: soft_rt_next_start must be higher than the current
+ * time plus the maximum time for which the arrival of a request is waited
+ * for when a sync queue becomes idle, namely bfqd->bfq_slice_idle.
+ * This filters out greedy applications, as the latter issue instead their
+ * next request as soon as possible after the last one has been completed
+ * (in contrast, when a batch of requests is completed, a soft real-time
+ * application spends some time processing data).
+ *
+ * Unfortunately, the last filter may easily generate false positives if
+ * only bfqd->bfq_slice_idle is used as a reference time interval and one
+ * or both the following cases occur:
+ * 1) HZ is so low that the duration of a jiffy is comparable to or higher
+ *    than bfqd->bfq_slice_idle. This happens, e.g., on slow devices with
+ *    HZ=100.
+ * 2) jiffies, instead of increasing at a constant rate, may stop increasing
+ *    for a while, then suddenly 'jump' by several units to recover the lost
+ *    increments. This seems to happen, e.g., inside virtual machines.
+ * To address this issue, we do not use as a reference time interval just
+ * bfqd->bfq_slice_idle, but bfqd->bfq_slice_idle plus a few jiffies. In
+ * particular we add the minimum number of jiffies for which the filter
+ * seems to be quite precise also in embedded systems and KVM/QEMU virtual
+ * machines.
+ */
+static inline unsigned long bfq_bfqq_softrt_next_start(struct bfq_data *bfqd,
+						       struct bfq_queue *bfqq)
+{
+	return max(bfqq->last_idle_bklogged +
+		   HZ * bfqq->service_from_backlogged /
+		   bfqd->bfq_wr_max_softrt_rate,
+		   jiffies + bfqq->bfqd->bfq_slice_idle + 4);
+}
+
+/*
+ * Return the largest-possible time instant such that, for as long as possible,
+ * the current time will be lower than this time instant according to the macro
+ * time_is_before_jiffies().
+ */
+static inline unsigned long bfq_infinity_from_now(unsigned long now)
+{
+	return now + ULONG_MAX / 2;
+}
+
 /**
  * bfq_bfqq_expire - expire a queue.
  * @bfqd: device owning the queue.
@@ -1162,12 +1302,55 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
 		     bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3))
 		bfq_bfqq_charge_full_budget(bfqq);
 
+	bfqq->service_from_backlogged += bfqq->entity.service;
+
 	if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT)
 		bfq_mark_bfqq_constantly_seeky(bfqq);
 
 	if (bfqd->low_latency && bfqq->wr_coeff == 1)
 		bfqq->last_wr_start_finish = jiffies;
 
+	if (bfqd->low_latency && bfqd->bfq_wr_max_softrt_rate > 0 &&
+	    RB_EMPTY_ROOT(&bfqq->sort_list)) {
+		/*
+		 * If we get here, and there are no outstanding requests,
+		 * then the request pattern is isochronous (see the comments
+		 * to the function bfq_bfqq_softrt_next_start()). Hence we
+		 * can compute soft_rt_next_start. If, instead, the queue
+		 * still has outstanding requests, then we have to wait
+		 * for the completion of all the outstanding requests to
+		 * discover whether the request pattern is actually
+		 * isochronous.
+		 */
+		if (bfqq->dispatched == 0)
+			bfqq->soft_rt_next_start =
+				bfq_bfqq_softrt_next_start(bfqd, bfqq);
+		else {
+			/*
+			 * The application is still waiting for the
+			 * completion of one or more requests:
+			 * prevent it from possibly being incorrectly
+			 * deemed as soft real-time by setting its
+			 * soft_rt_next_start to infinity. In fact,
+			 * without this assignment, the application
+			 * would be incorrectly deemed as soft
+			 * real-time if:
+			 * 1) it issued a new request before the
+			 *    completion of all its in-flight
+			 *    requests, and
+			 * 2) at that time, its soft_rt_next_start
+			 *    happened to be in the past.
+			 */
+			bfqq->soft_rt_next_start =
+				bfq_infinity_from_now(jiffies);
+			/*
+			 * Schedule an update of soft_rt_next_start to when
+			 * the task may be discovered to be isochronous.
+			 */
+			bfq_mark_bfqq_softrt_update(bfqq);
+		}
+	}
+
 	bfq_log_bfqq(bfqd, bfqq,
 		"expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
 		slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
@@ -1691,6 +1874,11 @@ static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfqq->wr_coeff = 1;
 	bfqq->last_wr_start_finish = 0;
+	/*
+	 * Set to the value for which bfqq will not be deemed as
+	 * soft rt when it becomes backlogged.
+	 */
+	bfqq->soft_rt_next_start = bfq_infinity_from_now(jiffies);
 }
 
 static struct bfq_queue *bfq_find_alloc_queue(struct bfq_data *bfqd,
@@ -2019,6 +2207,18 @@ static void bfq_completed_request(struct request_queue *q, struct request *rq)
 	}
 
 	/*
+	 * If we are waiting to discover whether the request pattern of the
+	 * task associated with the queue is actually isochronous, and
+	 * both requisites for this condition to hold are satisfied, then
+	 * compute soft_rt_next_start (see the comments to the function
+	 * bfq_bfqq_softrt_next_start()).
+	 */
+	if (bfq_bfqq_softrt_update(bfqq) && bfqq->dispatched == 0 &&
+	    RB_EMPTY_ROOT(&bfqq->sort_list))
+		bfqq->soft_rt_next_start =
+			bfq_bfqq_softrt_next_start(bfqd, bfqq);
+
+	/*
 	 * If this is the in-service queue, check if it needs to be expired,
 	 * or if we want to idle in case it has no pending requests.
 	 */
@@ -2345,9 +2545,16 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 	bfqd->low_latency = true;
 
 	bfqd->bfq_wr_coeff = 20;
+	bfqd->bfq_wr_rt_max_time = msecs_to_jiffies(300);
 	bfqd->bfq_wr_max_time = 0;
 	bfqd->bfq_wr_min_idle_time = msecs_to_jiffies(2000);
 	bfqd->bfq_wr_min_inter_arr_async = msecs_to_jiffies(500);
+	bfqd->bfq_wr_max_softrt_rate = 7000; /*
+					      * Approximate rate required
+					      * to playback or record a
+					      * high-definition compressed
+					      * video.
+					      */
 
 	/*
 	 * Begin by assuming, optimistically, that the device peak rate is
@@ -2461,9 +2668,11 @@ SHOW_FUNCTION(bfq_timeout_sync_show, bfqd->bfq_timeout[BLK_RW_SYNC], 1);
 SHOW_FUNCTION(bfq_timeout_async_show, bfqd->bfq_timeout[BLK_RW_ASYNC], 1);
 SHOW_FUNCTION(bfq_low_latency_show, bfqd->low_latency, 0);
 SHOW_FUNCTION(bfq_wr_coeff_show, bfqd->bfq_wr_coeff, 0);
+SHOW_FUNCTION(bfq_wr_rt_max_time_show, bfqd->bfq_wr_rt_max_time, 1);
 SHOW_FUNCTION(bfq_wr_min_idle_time_show, bfqd->bfq_wr_min_idle_time, 1);
 SHOW_FUNCTION(bfq_wr_min_inter_arr_async_show, bfqd->bfq_wr_min_inter_arr_async,
 	1);
+SHOW_FUNCTION(bfq_wr_max_softrt_rate_show, bfqd->bfq_wr_max_softrt_rate, 0);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
@@ -2498,10 +2707,14 @@ STORE_FUNCTION(bfq_timeout_async_store, &bfqd->bfq_timeout[BLK_RW_ASYNC], 0,
 		INT_MAX, 1);
 STORE_FUNCTION(bfq_wr_coeff_store, &bfqd->bfq_wr_coeff, 1, INT_MAX, 0);
 STORE_FUNCTION(bfq_wr_max_time_store, &bfqd->bfq_wr_max_time, 0, INT_MAX, 1);
+STORE_FUNCTION(bfq_wr_rt_max_time_store, &bfqd->bfq_wr_rt_max_time, 0, INT_MAX,
+		1);
 STORE_FUNCTION(bfq_wr_min_idle_time_store, &bfqd->bfq_wr_min_idle_time, 0,
 		INT_MAX, 1);
 STORE_FUNCTION(bfq_wr_min_inter_arr_async_store,
 		&bfqd->bfq_wr_min_inter_arr_async, 0, INT_MAX, 1);
+STORE_FUNCTION(bfq_wr_max_softrt_rate_store, &bfqd->bfq_wr_max_softrt_rate, 0,
+		INT_MAX, 0);
 #undef STORE_FUNCTION
 
 /* do nothing for the moment */
@@ -2593,8 +2806,10 @@ static struct elv_fs_entry bfq_attrs[] = {
 	BFQ_ATTR(low_latency),
 	BFQ_ATTR(wr_coeff),
 	BFQ_ATTR(wr_max_time),
+	BFQ_ATTR(wr_rt_max_time),
 	BFQ_ATTR(wr_min_idle_time),
 	BFQ_ATTR(wr_min_inter_arr_async),
+	BFQ_ATTR(wr_max_softrt_rate),
 	BFQ_ATTR(weights),
 	__ATTR_NULL
 };
diff --git a/block/bfq.h b/block/bfq.h
index 3ce9100..5fa8b34 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -185,6 +185,17 @@ struct bfq_group;
  *                        the @bfq-queue is being weight-raised, otherwise
  *                        finish time of the last weight-raising period
  * @wr_cur_max_time: current max raising time for this queue
+ * @soft_rt_next_start: minimum time instant such that, only if a new
+ *                      request is enqueued after this time instant in an
+ *                      idle @bfq_queue with no outstanding requests, then
+ *                      the task associated with the queue it is deemed as
+ *                      soft real-time (see the comments to the function
+ *                      bfq_bfqq_softrt_next_start())
+ * @last_idle_bklogged: time of the last transition of the @bfq_queue from
+ *                      idle to backlogged
+ * @service_from_backlogged: cumulative service received from the @bfq_queue
+ *                           since the last transition from idle to
+ *                           backlogged
  *
  * A bfq_queue is a leaf request queue; it can be associated with an
  * io_context or more, if it is async. @cgroup holds a reference to the
@@ -224,8 +235,11 @@ struct bfq_queue {
 
 	/* weight-raising fields */
 	unsigned long wr_cur_max_time;
+	unsigned long soft_rt_next_start;
 	unsigned long last_wr_start_finish;
 	unsigned int wr_coeff;
+	unsigned long last_idle_bklogged;
+	unsigned long service_from_backlogged;
 };
 
 /**
@@ -309,12 +323,15 @@ enum bfq_device_speed {
  * @bfq_wr_coeff: Maximum factor by which the weight of a weight-raised
  *                queue is multiplied
  * @bfq_wr_max_time: maximum duration of a weight-raising period (jiffies)
+ * @bfq_wr_rt_max_time: maximum duration for soft real-time processes
  * @bfq_wr_min_idle_time: minimum idle period after which weight-raising
  *			  may be reactivated for a queue (in jiffies)
  * @bfq_wr_min_inter_arr_async: minimum period between request arrivals
  *				after which weight-raising may be
  *				reactivated for an already busy queue
  *				(in jiffies)
+ * @bfq_wr_max_softrt_rate: max service-rate for a soft real-time queue,
+ *			    sectors per seconds
  * @RT_prod: cached value of the product R*T used for computing the maximum
  *	     duration of the weight raising automatically
  * @device_speed: device-speed class for the low-latency heuristic
@@ -372,8 +389,10 @@ struct bfq_data {
 	/* parameters of the low_latency heuristics */
 	unsigned int bfq_wr_coeff;
 	unsigned int bfq_wr_max_time;
+	unsigned int bfq_wr_rt_max_time;
 	unsigned int bfq_wr_min_idle_time;
 	unsigned long bfq_wr_min_inter_arr_async;
+	unsigned int bfq_wr_max_softrt_rate;
 	u64 RT_prod;
 	enum bfq_device_speed device_speed;
 
@@ -393,6 +412,10 @@ enum bfqq_state_flags {
 					 * bfqq has proved to be slow and
 					 * seeky until budget timeout
 					 */
+	BFQ_BFQQ_FLAG_softrt_update,	/*
+					 * may need softrt-next-start
+					 * update
+					 */
 };
 
 #define BFQ_BFQQ_FNS(name)						\
@@ -418,6 +441,7 @@ BFQ_BFQQ_FNS(prio_changed);
 BFQ_BFQQ_FNS(sync);
 BFQ_BFQQ_FNS(budget_new);
 BFQ_BFQQ_FNS(constantly_seeky);
+BFQ_BFQQ_FNS(softrt_update);
 #undef BFQ_BFQQ_FNS
 
 /* Logging facilities. */
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 09/14] block, bfq: reduce I/O latency for soft real-time applications
@ 2014-05-27 12:42     ` paolo
  0 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups, Paolo Valente

From: Paolo Valente <paolo.valente@unimore.it>

To guarantee a low latency also to the I/O requests issued by soft
real-time applications, this patch introduces a further heuristic,
which weight-raises (in the sense explained in patch 8) also the
queues associated to applications deemed as soft real-time.

To be deemed as soft real-time, an application must meet two
requirements.  First, the application must not require an average
bandwidth higher than the approximate bandwidth required to playback
or record a compressed high-definition video. Second, the request
pattern of the application must be isochronous, i.e., after issuing a
request or a batch of requests, the application must stop issuing new
requests until all its pending requests have been completed. After
that, the application may issue a new batch, and so on.

As for the second requirement, it is critical to require also that,
after all the pending requests of the application have been completed,
an adequate minimum amount of time elapses before the application
starts issuing new requests. This prevents also greedy (i.e.,
I/O-bound) applications from being incorrectly deemed, occasionally,
as soft real-time. In fact, if *any amount of time* is fine, then even
a greedy application may, paradoxically, meet both the above
requirements, if: (1) the application performs random I/O and/or the
device is slow, and (2) the CPU load is high. The reason is the
following.  First, if condition (1) is true, then, during the service
of the application, the throughput may be low enough to let the
application meet the bandwidth requirement.  Second, if condition (2)
is true as well, then the application may occasionally behave in an
apparently isochronous way, because it may simply stop issuing
requests while the CPUs are busy serving other processes.

To address this issue, the heuristic leverages the simple fact that
greedy applications issue *all* of their requests as quickly as they
can, whereas soft real-time applications spend some time processing
data after each batch of requests is completed. In particular, the
heuristic works as follows. First, according to the above isochrony
requirement, the heuristic checks whether an application may be soft
real-time (and therefore gives to the application the opportunity to
be deemed as such) only when both the following two conditions happen
to hold: 1) the queue associated with the application has expired and
is empty, 2) there is no outstanding request of the application.

Suppose that both conditions hold at time, say, t_c and that the
application issues instead its next request at time, say, t_i. At time
t_c the heuristic computes the next time instant, called
soft_rt_next_start in the code, such that, only if
t_i >= soft_rt_next_start, then both the next conditions will hold
when the application issues its next request:
1) the application will meet the above bandwidth requirement,
2) a given minimum time interval, say Delta, will have elapsed from
time t_c (so as to filter out greedy application).

The current value of Delta is a little bit higher than the value that
we have found, experimentally, to be adequate on a real,
general-purpose machine. In particular we had to increase Delta to
make the filter quite precise also in slower, embedded systems, and in
KVM/QEMU virtual machines (details in the comments to the code).

If the application actually issues its next request after time
soft_rt_next_start, then its associated queue will be weight-raised
for a relatively short time interval. If, during this time interval,
the application proves again to meet the bandwidth and isochrony
requirements, then the end of the weight-raising period for the queue
is moved forward, and so on. Note that an application whose associated
queue never happens to be empty when it expires will never have the
opportunity to be deemed as soft real-time.

Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/Kconfig.iosched |   8 +-
 block/bfq-iosched.c   | 231 ++++++++++++++++++++++++++++++++++++++++++++++++--
 block/bfq.h           |  24 ++++++
 3 files changed, 251 insertions(+), 12 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3e26f28..1d64eea 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -47,9 +47,9 @@ config IOSCHED_BFQ
 	  processes according to their weights.
 	  It aims at distributing the bandwidth as desired, regardless of
 	  the device parameters and with any workload. It also tries to
-	  guarantee a low latency to interactive applications. If compiled
-	  built-in (saying Y here), BFQ can be configured to support
-	  hierarchical scheduling.
+	  guarantee low latency to interactive and soft real-time
+	  applications. If compiled built-in (saying Y here), BFQ can
+	  be configured to support hierarchical scheduling.
 
 config CGROUP_BFQIO
 	bool "BFQ hierarchical scheduling support"
@@ -81,7 +81,7 @@ choice
 		  The BFQ I/O scheduler aims at distributing the bandwidth
 		  as desired, regardless of the disk parameters and with
 		  any workload. It also tries to guarantee a low latency to
-		  interactive applications.
+		  interactive and soft real-time applications.
 
 	config DEFAULT_NOOP
 		bool "No-op"
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index ace9aba..661f948 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -24,15 +24,17 @@
  * precisely, BFQ schedules queues associated to processes. Thanks to the
  * accurate policy of B-WF2Q+, BFQ can afford to assign high budgets to
  * I/O-bound processes issuing sequential requests (to boost the
- * throughput), and yet guarantee a low latency to interactive applications.
+ * throughput), and yet guarantee a low latency to interactive and soft
+ * real-time applications.
  *
  * BFQ is described in [1], where also a reference to the initial, more
  * theoretical paper on BFQ can be found. The interested reader can find
  * in the latter paper full details on the main algorithm, as well as
  * formulas of the guarantees and formal proofs of all the properties.
  * With respect to the version of BFQ presented in these papers, this
- * implementation adds a few more heuristics and a hierarchical extension
- * based on H-WF2Q+.
+ * implementation adds a few more heuristics, such as the one that
+ * guarantees a low latency to soft real-time applications, and a
+ * hierarchical extension based on H-WF2Q+.
  *
  * B-WF2Q+ is based on WF2Q+, that is described in [2], together with
  * H-WF2Q+, while the augmented tree used to implement B-WF2Q+ with O(log N)
@@ -401,6 +403,8 @@ static void bfq_add_request(struct request *rq)
 	bfqq->next_rq = next_rq;
 
 	if (!bfq_bfqq_busy(bfqq)) {
+		int soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
+			time_is_before_jiffies(bfqq->soft_rt_next_start);
 		idle_for_long_time = time_is_before_jiffies(
 			bfqq->budget_timeout +
 			bfqd->bfq_wr_min_idle_time);
@@ -414,9 +418,13 @@ static void bfq_add_request(struct request *rq)
 		 * If the queue is not being boosted and has been idle for
 		 * enough time, start a weight-raising period.
 		 */
-		if (old_wr_coeff == 1 && idle_for_long_time) {
+		if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt)) {
 			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
-			bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+			if (idle_for_long_time)
+				bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+			else
+				bfqq->wr_cur_max_time =
+					bfqd->bfq_wr_rt_max_time;
 			bfq_log_bfqq(bfqd, bfqq,
 				     "wrais starting at %lu, rais_max_time %u",
 				     jiffies,
@@ -424,18 +432,76 @@ static void bfq_add_request(struct request *rq)
 		} else if (old_wr_coeff > 1) {
 			if (idle_for_long_time)
 				bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
-			else {
+			else if (bfqq->wr_cur_max_time ==
+				 bfqd->bfq_wr_rt_max_time &&
+				 !soft_rt) {
 				bfqq->wr_coeff = 1;
 				bfq_log_bfqq(bfqd, bfqq,
 					"wrais ending at %lu, rais_max_time %u",
 					jiffies,
 					jiffies_to_msecs(bfqq->
 						wr_cur_max_time));
+			} else if (time_before(
+					bfqq->last_wr_start_finish +
+					bfqq->wr_cur_max_time,
+					jiffies +
+					bfqd->bfq_wr_rt_max_time) &&
+				   soft_rt) {
+				/*
+				 *
+				 * The remaining weight-raising time is lower
+				 * than bfqd->bfq_wr_rt_max_time, which means
+				 * that the application is enjoying weight
+				 * raising either because deemed soft-rt in
+				 * the near past, or because deemed interactive
+				 * a long ago.
+				 * In both cases, resetting now the current
+				 * remaining weight-raising time for the
+				 * application to the weight-raising duration
+				 * for soft rt applications would not cause any
+				 * latency increase for the application (as the
+				 * new duration would be higher than the
+				 * remaining time).
+				 *
+				 * In addition, the application is now meeting
+				 * the requirements for being deemed soft rt.
+				 * In the end we can correctly and safely
+				 * (re)charge the weight-raising duration for
+				 * the application with the weight-raising
+				 * duration for soft rt applications.
+				 *
+				 * In particular, doing this recharge now, i.e.,
+				 * before the weight-raising period for the
+				 * application finishes, reduces the probability
+				 * of the following negative scenario:
+				 * 1) the weight of a soft rt application is
+				 *    raised at startup (as for any newly
+				 *    created application),
+				 * 2) since the application is not interactive,
+				 *    at a certain time weight-raising is
+				 *    stopped for the application,
+				 * 3) at that time the application happens to
+				 *    still have pending requests, and hence
+				 *    is destined to not have a chance to be
+				 *    deemed soft rt before these requests are
+				 *    completed (see the comments to the
+				 *    function bfq_bfqq_softrt_next_start()
+				 *    for details on soft rt detection),
+				 * 4) these pending requests experience a high
+				 *    latency because the application is not
+				 *    weight-raised while they are pending.
+				 */
+				bfqq->last_wr_start_finish = jiffies;
+				bfqq->wr_cur_max_time =
+					bfqd->bfq_wr_rt_max_time;
 			}
 		}
 		if (old_wr_coeff != bfqq->wr_coeff)
 			entity->ioprio_changed = 1;
 add_bfqq_busy:
+		bfqq->last_idle_bklogged = jiffies;
+		bfqq->service_from_backlogged = 0;
+		bfq_clear_bfqq_softrt_update(bfqq);
 		bfq_add_bfqq_busy(bfqd, bfqq);
 	} else {
 		if (bfqd->low_latency && old_wr_coeff == 1 && !rq_is_sync(rq) &&
@@ -753,8 +819,11 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd)
 static void bfq_set_budget_timeout(struct bfq_data *bfqd)
 {
 	struct bfq_queue *bfqq = bfqd->in_service_queue;
-	unsigned int timeout_coeff = bfqq->entity.weight /
-				     bfqq->entity.orig_weight;
+	unsigned int timeout_coeff;
+	if (bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time)
+		timeout_coeff = 1;
+	else
+		timeout_coeff = bfqq->entity.weight / bfqq->entity.orig_weight;
 
 	bfqd->last_budget_start = ktime_get();
 
@@ -1105,6 +1174,77 @@ static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	return expected > (4 * bfqq->entity.budget) / 3;
 }
 
+/*
+ * To be deemed as soft real-time, an application must meet two
+ * requirements. First, the application must not require an average
+ * bandwidth higher than the approximate bandwidth required to playback or
+ * record a compressed high-definition video.
+ * The next function is invoked on the completion of the last request of a
+ * batch, to compute the next-start time instant, soft_rt_next_start, such
+ * that, if the next request of the application does not arrive before
+ * soft_rt_next_start, then the above requirement on the bandwidth is met.
+ *
+ * The second requirement is that the request pattern of the application is
+ * isochronous, i.e., that, after issuing a request or a batch of requests,
+ * the application stops issuing new requests until all its pending requests
+ * have been completed. After that, the application may issue a new batch,
+ * and so on.
+ * For this reason the next function is invoked to compute
+ * soft_rt_next_start only for applications that meet this requirement,
+ * whereas soft_rt_next_start is set to infinity for applications that do
+ * not.
+ *
+ * Unfortunately, even a greedy application may happen to behave in an
+ * isochronous way if the CPU load is high. In fact, the application may
+ * stop issuing requests while the CPUs are busy serving other processes,
+ * then restart, then stop again for a while, and so on. In addition, if
+ * the disk achieves a low enough throughput with the request pattern
+ * issued by the application (e.g., because the request pattern is random
+ * and/or the device is slow), then the application may meet the above
+ * bandwidth requirement too. To prevent such a greedy application to be
+ * deemed as soft real-time, a further rule is used in the computation of
+ * soft_rt_next_start: soft_rt_next_start must be higher than the current
+ * time plus the maximum time for which the arrival of a request is waited
+ * for when a sync queue becomes idle, namely bfqd->bfq_slice_idle.
+ * This filters out greedy applications, as the latter issue instead their
+ * next request as soon as possible after the last one has been completed
+ * (in contrast, when a batch of requests is completed, a soft real-time
+ * application spends some time processing data).
+ *
+ * Unfortunately, the last filter may easily generate false positives if
+ * only bfqd->bfq_slice_idle is used as a reference time interval and one
+ * or both the following cases occur:
+ * 1) HZ is so low that the duration of a jiffy is comparable to or higher
+ *    than bfqd->bfq_slice_idle. This happens, e.g., on slow devices with
+ *    HZ=100.
+ * 2) jiffies, instead of increasing at a constant rate, may stop increasing
+ *    for a while, then suddenly 'jump' by several units to recover the lost
+ *    increments. This seems to happen, e.g., inside virtual machines.
+ * To address this issue, we do not use as a reference time interval just
+ * bfqd->bfq_slice_idle, but bfqd->bfq_slice_idle plus a few jiffies. In
+ * particular we add the minimum number of jiffies for which the filter
+ * seems to be quite precise also in embedded systems and KVM/QEMU virtual
+ * machines.
+ */
+static inline unsigned long bfq_bfqq_softrt_next_start(struct bfq_data *bfqd,
+						       struct bfq_queue *bfqq)
+{
+	return max(bfqq->last_idle_bklogged +
+		   HZ * bfqq->service_from_backlogged /
+		   bfqd->bfq_wr_max_softrt_rate,
+		   jiffies + bfqq->bfqd->bfq_slice_idle + 4);
+}
+
+/*
+ * Return the largest-possible time instant such that, for as long as possible,
+ * the current time will be lower than this time instant according to the macro
+ * time_is_before_jiffies().
+ */
+static inline unsigned long bfq_infinity_from_now(unsigned long now)
+{
+	return now + ULONG_MAX / 2;
+}
+
 /**
  * bfq_bfqq_expire - expire a queue.
  * @bfqd: device owning the queue.
@@ -1162,12 +1302,55 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
 		     bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3))
 		bfq_bfqq_charge_full_budget(bfqq);
 
+	bfqq->service_from_backlogged += bfqq->entity.service;
+
 	if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT)
 		bfq_mark_bfqq_constantly_seeky(bfqq);
 
 	if (bfqd->low_latency && bfqq->wr_coeff == 1)
 		bfqq->last_wr_start_finish = jiffies;
 
+	if (bfqd->low_latency && bfqd->bfq_wr_max_softrt_rate > 0 &&
+	    RB_EMPTY_ROOT(&bfqq->sort_list)) {
+		/*
+		 * If we get here, and there are no outstanding requests,
+		 * then the request pattern is isochronous (see the comments
+		 * to the function bfq_bfqq_softrt_next_start()). Hence we
+		 * can compute soft_rt_next_start. If, instead, the queue
+		 * still has outstanding requests, then we have to wait
+		 * for the completion of all the outstanding requests to
+		 * discover whether the request pattern is actually
+		 * isochronous.
+		 */
+		if (bfqq->dispatched == 0)
+			bfqq->soft_rt_next_start =
+				bfq_bfqq_softrt_next_start(bfqd, bfqq);
+		else {
+			/*
+			 * The application is still waiting for the
+			 * completion of one or more requests:
+			 * prevent it from possibly being incorrectly
+			 * deemed as soft real-time by setting its
+			 * soft_rt_next_start to infinity. In fact,
+			 * without this assignment, the application
+			 * would be incorrectly deemed as soft
+			 * real-time if:
+			 * 1) it issued a new request before the
+			 *    completion of all its in-flight
+			 *    requests, and
+			 * 2) at that time, its soft_rt_next_start
+			 *    happened to be in the past.
+			 */
+			bfqq->soft_rt_next_start =
+				bfq_infinity_from_now(jiffies);
+			/*
+			 * Schedule an update of soft_rt_next_start to when
+			 * the task may be discovered to be isochronous.
+			 */
+			bfq_mark_bfqq_softrt_update(bfqq);
+		}
+	}
+
 	bfq_log_bfqq(bfqd, bfqq,
 		"expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
 		slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
@@ -1691,6 +1874,11 @@ static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfqq->wr_coeff = 1;
 	bfqq->last_wr_start_finish = 0;
+	/*
+	 * Set to the value for which bfqq will not be deemed as
+	 * soft rt when it becomes backlogged.
+	 */
+	bfqq->soft_rt_next_start = bfq_infinity_from_now(jiffies);
 }
 
 static struct bfq_queue *bfq_find_alloc_queue(struct bfq_data *bfqd,
@@ -2019,6 +2207,18 @@ static void bfq_completed_request(struct request_queue *q, struct request *rq)
 	}
 
 	/*
+	 * If we are waiting to discover whether the request pattern of the
+	 * task associated with the queue is actually isochronous, and
+	 * both requisites for this condition to hold are satisfied, then
+	 * compute soft_rt_next_start (see the comments to the function
+	 * bfq_bfqq_softrt_next_start()).
+	 */
+	if (bfq_bfqq_softrt_update(bfqq) && bfqq->dispatched == 0 &&
+	    RB_EMPTY_ROOT(&bfqq->sort_list))
+		bfqq->soft_rt_next_start =
+			bfq_bfqq_softrt_next_start(bfqd, bfqq);
+
+	/*
 	 * If this is the in-service queue, check if it needs to be expired,
 	 * or if we want to idle in case it has no pending requests.
 	 */
@@ -2345,9 +2545,16 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 	bfqd->low_latency = true;
 
 	bfqd->bfq_wr_coeff = 20;
+	bfqd->bfq_wr_rt_max_time = msecs_to_jiffies(300);
 	bfqd->bfq_wr_max_time = 0;
 	bfqd->bfq_wr_min_idle_time = msecs_to_jiffies(2000);
 	bfqd->bfq_wr_min_inter_arr_async = msecs_to_jiffies(500);
+	bfqd->bfq_wr_max_softrt_rate = 7000; /*
+					      * Approximate rate required
+					      * to playback or record a
+					      * high-definition compressed
+					      * video.
+					      */
 
 	/*
 	 * Begin by assuming, optimistically, that the device peak rate is
@@ -2461,9 +2668,11 @@ SHOW_FUNCTION(bfq_timeout_sync_show, bfqd->bfq_timeout[BLK_RW_SYNC], 1);
 SHOW_FUNCTION(bfq_timeout_async_show, bfqd->bfq_timeout[BLK_RW_ASYNC], 1);
 SHOW_FUNCTION(bfq_low_latency_show, bfqd->low_latency, 0);
 SHOW_FUNCTION(bfq_wr_coeff_show, bfqd->bfq_wr_coeff, 0);
+SHOW_FUNCTION(bfq_wr_rt_max_time_show, bfqd->bfq_wr_rt_max_time, 1);
 SHOW_FUNCTION(bfq_wr_min_idle_time_show, bfqd->bfq_wr_min_idle_time, 1);
 SHOW_FUNCTION(bfq_wr_min_inter_arr_async_show, bfqd->bfq_wr_min_inter_arr_async,
 	1);
+SHOW_FUNCTION(bfq_wr_max_softrt_rate_show, bfqd->bfq_wr_max_softrt_rate, 0);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
@@ -2498,10 +2707,14 @@ STORE_FUNCTION(bfq_timeout_async_store, &bfqd->bfq_timeout[BLK_RW_ASYNC], 0,
 		INT_MAX, 1);
 STORE_FUNCTION(bfq_wr_coeff_store, &bfqd->bfq_wr_coeff, 1, INT_MAX, 0);
 STORE_FUNCTION(bfq_wr_max_time_store, &bfqd->bfq_wr_max_time, 0, INT_MAX, 1);
+STORE_FUNCTION(bfq_wr_rt_max_time_store, &bfqd->bfq_wr_rt_max_time, 0, INT_MAX,
+		1);
 STORE_FUNCTION(bfq_wr_min_idle_time_store, &bfqd->bfq_wr_min_idle_time, 0,
 		INT_MAX, 1);
 STORE_FUNCTION(bfq_wr_min_inter_arr_async_store,
 		&bfqd->bfq_wr_min_inter_arr_async, 0, INT_MAX, 1);
+STORE_FUNCTION(bfq_wr_max_softrt_rate_store, &bfqd->bfq_wr_max_softrt_rate, 0,
+		INT_MAX, 0);
 #undef STORE_FUNCTION
 
 /* do nothing for the moment */
@@ -2593,8 +2806,10 @@ static struct elv_fs_entry bfq_attrs[] = {
 	BFQ_ATTR(low_latency),
 	BFQ_ATTR(wr_coeff),
 	BFQ_ATTR(wr_max_time),
+	BFQ_ATTR(wr_rt_max_time),
 	BFQ_ATTR(wr_min_idle_time),
 	BFQ_ATTR(wr_min_inter_arr_async),
+	BFQ_ATTR(wr_max_softrt_rate),
 	BFQ_ATTR(weights),
 	__ATTR_NULL
 };
diff --git a/block/bfq.h b/block/bfq.h
index 3ce9100..5fa8b34 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -185,6 +185,17 @@ struct bfq_group;
  *                        the @bfq-queue is being weight-raised, otherwise
  *                        finish time of the last weight-raising period
  * @wr_cur_max_time: current max raising time for this queue
+ * @soft_rt_next_start: minimum time instant such that, only if a new
+ *                      request is enqueued after this time instant in an
+ *                      idle @bfq_queue with no outstanding requests, then
+ *                      the task associated with the queue it is deemed as
+ *                      soft real-time (see the comments to the function
+ *                      bfq_bfqq_softrt_next_start())
+ * @last_idle_bklogged: time of the last transition of the @bfq_queue from
+ *                      idle to backlogged
+ * @service_from_backlogged: cumulative service received from the @bfq_queue
+ *                           since the last transition from idle to
+ *                           backlogged
  *
  * A bfq_queue is a leaf request queue; it can be associated with an
  * io_context or more, if it is async. @cgroup holds a reference to the
@@ -224,8 +235,11 @@ struct bfq_queue {
 
 	/* weight-raising fields */
 	unsigned long wr_cur_max_time;
+	unsigned long soft_rt_next_start;
 	unsigned long last_wr_start_finish;
 	unsigned int wr_coeff;
+	unsigned long last_idle_bklogged;
+	unsigned long service_from_backlogged;
 };
 
 /**
@@ -309,12 +323,15 @@ enum bfq_device_speed {
  * @bfq_wr_coeff: Maximum factor by which the weight of a weight-raised
  *                queue is multiplied
  * @bfq_wr_max_time: maximum duration of a weight-raising period (jiffies)
+ * @bfq_wr_rt_max_time: maximum duration for soft real-time processes
  * @bfq_wr_min_idle_time: minimum idle period after which weight-raising
  *			  may be reactivated for a queue (in jiffies)
  * @bfq_wr_min_inter_arr_async: minimum period between request arrivals
  *				after which weight-raising may be
  *				reactivated for an already busy queue
  *				(in jiffies)
+ * @bfq_wr_max_softrt_rate: max service-rate for a soft real-time queue,
+ *			    sectors per seconds
  * @RT_prod: cached value of the product R*T used for computing the maximum
  *	     duration of the weight raising automatically
  * @device_speed: device-speed class for the low-latency heuristic
@@ -372,8 +389,10 @@ struct bfq_data {
 	/* parameters of the low_latency heuristics */
 	unsigned int bfq_wr_coeff;
 	unsigned int bfq_wr_max_time;
+	unsigned int bfq_wr_rt_max_time;
 	unsigned int bfq_wr_min_idle_time;
 	unsigned long bfq_wr_min_inter_arr_async;
+	unsigned int bfq_wr_max_softrt_rate;
 	u64 RT_prod;
 	enum bfq_device_speed device_speed;
 
@@ -393,6 +412,10 @@ enum bfqq_state_flags {
 					 * bfqq has proved to be slow and
 					 * seeky until budget timeout
 					 */
+	BFQ_BFQQ_FLAG_softrt_update,	/*
+					 * may need softrt-next-start
+					 * update
+					 */
 };
 
 #define BFQ_BFQQ_FNS(name)						\
@@ -418,6 +441,7 @@ BFQ_BFQQ_FNS(prio_changed);
 BFQ_BFQQ_FNS(sync);
 BFQ_BFQQ_FNS(budget_new);
 BFQ_BFQQ_FNS(constantly_seeky);
+BFQ_BFQQ_FNS(softrt_update);
 #undef BFQ_BFQQ_FNS
 
 /* Logging facilities. */
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 10/14] block, bfq: preserve a low latency also with NCQ-capable drives
  2014-05-27 12:42 ` paolo
@ 2014-05-27 12:42     ` paolo
  -1 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

From: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>

I/O schedulers typically allow NCQ-capable drives to prefetch I/O
requests, as NCQ boosts the throughput exactly by prefetching and
internally reordering requests.

Unfortunately, as discussed in detail and shown experimentally in [1],
this may cause fairness and latency guarantees to be violated. The
main problem is that the internal scheduler of an NCQ-capable drive
may postpone the service of some unlucky (prefetched) requests as long
as it deems serving other requests more appropriate to boost the
throughput.

This patch addresses this issue by not disabling device idling for
weight-raised queues, even if the device supports NCQ. This allows BFQ
to start serving a new queue, and therefore allows the drive to
prefetch new requests, only after the idling timeout expires. At that
time, all the outstanding requests of the expired queue have been most
certainly served.

[1] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
http://www.algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 block/bfq-iosched.c | 5 +++--
 block/bfq.h         | 2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 661f948..0b24130 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2051,7 +2051,8 @@ static void bfq_update_idle_window(struct bfq_data *bfqd,
 
 	if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
 	    bfqd->bfq_slice_idle == 0 ||
-		(bfqd->hw_tag && BFQQ_SEEKY(bfqq)))
+		(bfqd->hw_tag && BFQQ_SEEKY(bfqq) &&
+			bfqq->wr_coeff == 1))
 		enable_idle = 0;
 	else if (bfq_sample_valid(bic->ttime.ttime_samples)) {
 		if (bic->ttime.ttime_mean > bfqd->bfq_slice_idle &&
@@ -2874,7 +2875,7 @@ static int __init bfq_init(void)
 	device_speed_thresh[1] = (R_fast[1] + R_slow[1]) / 2;
 
 	elv_register(&iosched_bfq);
-	pr_info("BFQ I/O-scheduler version: v1");
+	pr_info("BFQ I/O-scheduler version: v2");
 
 	return 0;
 }
diff --git a/block/bfq.h b/block/bfq.h
index 5fa8b34..3b5763a7 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -1,5 +1,5 @@
 /*
- * BFQ-v1 for 3.15.0: data structures and common functions prototypes.
+ * BFQ-v2 for 3.15.0: data structures and common functions prototypes.
  *
  * Based on ideas and code from CFQ:
  * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 10/14] block, bfq: preserve a low latency also with NCQ-capable drives
@ 2014-05-27 12:42     ` paolo
  0 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups, Paolo Valente

From: Paolo Valente <paolo.valente@unimore.it>

I/O schedulers typically allow NCQ-capable drives to prefetch I/O
requests, as NCQ boosts the throughput exactly by prefetching and
internally reordering requests.

Unfortunately, as discussed in detail and shown experimentally in [1],
this may cause fairness and latency guarantees to be violated. The
main problem is that the internal scheduler of an NCQ-capable drive
may postpone the service of some unlucky (prefetched) requests as long
as it deems serving other requests more appropriate to boost the
throughput.

This patch addresses this issue by not disabling device idling for
weight-raised queues, even if the device supports NCQ. This allows BFQ
to start serving a new queue, and therefore allows the drive to
prefetch new requests, only after the idling timeout expires. At that
time, all the outstanding requests of the expired queue have been most
certainly served.

[1] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
http://www.algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/bfq-iosched.c | 5 +++--
 block/bfq.h         | 2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 661f948..0b24130 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2051,7 +2051,8 @@ static void bfq_update_idle_window(struct bfq_data *bfqd,
 
 	if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
 	    bfqd->bfq_slice_idle == 0 ||
-		(bfqd->hw_tag && BFQQ_SEEKY(bfqq)))
+		(bfqd->hw_tag && BFQQ_SEEKY(bfqq) &&
+			bfqq->wr_coeff == 1))
 		enable_idle = 0;
 	else if (bfq_sample_valid(bic->ttime.ttime_samples)) {
 		if (bic->ttime.ttime_mean > bfqd->bfq_slice_idle &&
@@ -2874,7 +2875,7 @@ static int __init bfq_init(void)
 	device_speed_thresh[1] = (R_fast[1] + R_slow[1]) / 2;
 
 	elv_register(&iosched_bfq);
-	pr_info("BFQ I/O-scheduler version: v1");
+	pr_info("BFQ I/O-scheduler version: v2");
 
 	return 0;
 }
diff --git a/block/bfq.h b/block/bfq.h
index 5fa8b34..3b5763a7 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -1,5 +1,5 @@
 /*
- * BFQ-v1 for 3.15.0: data structures and common functions prototypes.
+ * BFQ-v2 for 3.15.0: data structures and common functions prototypes.
  *
  * Based on ideas and code from CFQ:
  * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 11/14] block, bfq: reduce latency during request-pool saturation
  2014-05-27 12:42 ` paolo
@ 2014-05-27 12:42     ` paolo
  -1 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

From: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>

This patch introduces an heuristic that reduces latency when the
I/O-request pool is saturated. This goal is achieved by disabling
device idling, for non-weight-raised queues, when there are weight-
raised queues with pending or in-flight requests. In fact, as
explained in more detail in the comment to the function
bfq_bfqq_must_not_expire(), this reduces the rate at which processes
associated with non-weight-raised queues grab requests from the pool,
thereby increasing the probability that processes associated with
weight-raised queues get a request immediately (or at least soon) when
they need one.

Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 block/bfq-iosched.c | 67 +++++++++++++++++++++++++++++++++++++++++++----------
 block/bfq-sched.c   |  6 +++++
 block/bfq.h         |  2 ++
 3 files changed, 63 insertions(+), 12 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 0b24130..5988c70 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -511,6 +511,7 @@ add_bfqq_busy:
 			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
 			bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
 
+			bfqd->wr_busy_queues++;
 			entity->ioprio_changed = 1;
 			bfq_log_bfqq(bfqd, bfqq,
 			    "non-idle wrais starting at %lu, rais_max_time %u",
@@ -655,6 +656,8 @@ static void bfq_merged_requests(struct request_queue *q, struct request *rq,
 /* Must be called with bfqq != NULL */
 static inline void bfq_bfqq_end_wr(struct bfq_queue *bfqq)
 {
+	if (bfq_bfqq_busy(bfqq))
+		bfqq->bfqd->wr_busy_queues--;
 	bfqq->wr_coeff = 1;
 	bfqq->wr_cur_max_time = 0;
 	/* Trigger a weight change on the next activation of the queue */
@@ -1401,22 +1404,61 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
 /*
  * Device idling is allowed only for the queues for which this function
  * returns true. For this reason, the return value of this function plays a
- * critical role for both throughput boosting and service guarantees.
+ * critical role for both throughput boosting and service guarantees. The
+ * return value is computed through a logical expression. In this rather
+ * long comment, we try to briefly describe all the details and motivations
+ * behind the components of this logical expression.
  *
- * The return value is computed through a logical expression, which may
- * be true only if bfqq is sync and at least one of the following two
- * conditions holds:
- * - the queue has a non-null idle window;
- * - the queue is being weight-raised.
- * In fact, waiting for a new request for the queue, in the first case,
- * is likely to boost the disk throughput, whereas, in the second case,
- * is necessary to preserve fairness and latency guarantees
- * (see [1] for details).
+ * First, the expression may be true only for sync queues. Besides, if
+ * bfqq is also being weight-raised, then the expression always evaluates
+ * to true, as device idling is instrumental for preserving low-latency
+ * guarantees (see [1]). Otherwise, the expression evaluates to true only
+ * if bfqq has a non-null idle window and at least one of the following
+ * two conditions holds. The first condition is that the device is not
+ * performing NCQ, because idling the device most certainly boosts the
+ * throughput if this condition holds and bfqq has been granted a non-null
+ * idle window.
+ *
+ * The second condition is that there is no weight-raised busy queue,
+ * which guarantees that the device is not idled for a sync non-weight-
+ * raised queue when there are busy weight-raised queues. The former is
+ * then expired immediately if empty. Combined with the timestamping rules
+ * of BFQ (see [1] for details), this causes sync non-weight-raised queues
+ * to get a lower number of requests served, and hence to ask for a lower
+ * number of requests from the request pool, before the busy weight-raised
+ * queues get served again.
+ *
+ * This is beneficial for the processes associated with weight-raised
+ * queues, when the request pool is saturated (e.g., in the presence of
+ * write hogs). In fact, if the processes associated with the other queues
+ * ask for requests at a lower rate, then weight-raised processes have a
+ * higher probability to get a request from the pool immediately (or at
+ * least soon) when they need one. Hence they have a higher probability to
+ * actually get a fraction of the disk throughput proportional to their
+ * high weight. This is especially true with NCQ-capable drives, which
+ * enqueue several requests in advance and further reorder internally-
+ * queued requests.
+ *
+ * In the end, mistreating non-weight-raised queues when there are busy
+ * weight-raised queues seems to mitigate starvation problems in the
+ * presence of heavy write workloads and NCQ, and hence to guarantee a
+ * higher application and system responsiveness in these hostile scenarios.
  */
 static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 {
-	return bfq_bfqq_sync(bfqq) &&
-	       (bfqq->wr_coeff > 1 || bfq_bfqq_idle_window(bfqq));
+	struct bfq_data *bfqd = bfqq->bfqd;
+/*
+ * Condition for expiring a non-weight-raised queue (and hence not idling
+ * the device).
+ */
+#define cond_for_expiring_non_wr  (bfqd->hw_tag && \
+				   bfqd->wr_busy_queues > 0)
+
+	return bfq_bfqq_sync(bfqq) && (
+		bfqq->wr_coeff > 1 ||
+		(bfq_bfqq_idle_window(bfqq) &&
+		 !cond_for_expiring_non_wr)
+	);
 }
 
 /*
@@ -2556,6 +2598,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 					      * high-definition compressed
 					      * video.
 					      */
+	bfqd->wr_busy_queues = 0;
 
 	/*
 	 * Begin by assuming, optimistically, that the device peak rate is
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
index f6491d5..73f453b 100644
--- a/block/bfq-sched.c
+++ b/block/bfq-sched.c
@@ -975,6 +975,9 @@ static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	bfqd->busy_queues--;
 
 	bfq_deactivate_bfqq(bfqd, bfqq, requeue);
+
+	if (bfqq->wr_coeff > 1)
+		bfqd->wr_busy_queues--;
 }
 
 /*
@@ -988,4 +991,7 @@ static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 
 	bfq_mark_bfqq_busy(bfqq);
 	bfqd->busy_queues++;
+
+	if (bfqq->wr_coeff > 1)
+		bfqd->wr_busy_queues++;
 }
diff --git a/block/bfq.h b/block/bfq.h
index 3b5763a7..7d6e4cb 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -280,6 +280,7 @@ enum bfq_device_speed {
  * @root_group: root bfq_group for the device.
  * @busy_queues: number of bfq_queues containing requests (including the
  *		 queue in service, even if it is idling).
+ * @wr_busy_queues: number of weight-raised busy @bfq_queues.
  * @queued: number of queued requests.
  * @rq_in_driver: number of requests dispatched and waiting for completion.
  * @sync_flight: number of sync requests in the driver.
@@ -345,6 +346,7 @@ struct bfq_data {
 	struct bfq_group *root_group;
 
 	int busy_queues;
+	int wr_busy_queues;
 	int queued;
 	int rq_in_driver;
 	int sync_flight;
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 11/14] block, bfq: reduce latency during request-pool saturation
@ 2014-05-27 12:42     ` paolo
  0 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups, Paolo Valente

From: Paolo Valente <paolo.valente@unimore.it>

This patch introduces an heuristic that reduces latency when the
I/O-request pool is saturated. This goal is achieved by disabling
device idling, for non-weight-raised queues, when there are weight-
raised queues with pending or in-flight requests. In fact, as
explained in more detail in the comment to the function
bfq_bfqq_must_not_expire(), this reduces the rate at which processes
associated with non-weight-raised queues grab requests from the pool,
thereby increasing the probability that processes associated with
weight-raised queues get a request immediately (or at least soon) when
they need one.

Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/bfq-iosched.c | 67 +++++++++++++++++++++++++++++++++++++++++++----------
 block/bfq-sched.c   |  6 +++++
 block/bfq.h         |  2 ++
 3 files changed, 63 insertions(+), 12 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 0b24130..5988c70 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -511,6 +511,7 @@ add_bfqq_busy:
 			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
 			bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
 
+			bfqd->wr_busy_queues++;
 			entity->ioprio_changed = 1;
 			bfq_log_bfqq(bfqd, bfqq,
 			    "non-idle wrais starting at %lu, rais_max_time %u",
@@ -655,6 +656,8 @@ static void bfq_merged_requests(struct request_queue *q, struct request *rq,
 /* Must be called with bfqq != NULL */
 static inline void bfq_bfqq_end_wr(struct bfq_queue *bfqq)
 {
+	if (bfq_bfqq_busy(bfqq))
+		bfqq->bfqd->wr_busy_queues--;
 	bfqq->wr_coeff = 1;
 	bfqq->wr_cur_max_time = 0;
 	/* Trigger a weight change on the next activation of the queue */
@@ -1401,22 +1404,61 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
 /*
  * Device idling is allowed only for the queues for which this function
  * returns true. For this reason, the return value of this function plays a
- * critical role for both throughput boosting and service guarantees.
+ * critical role for both throughput boosting and service guarantees. The
+ * return value is computed through a logical expression. In this rather
+ * long comment, we try to briefly describe all the details and motivations
+ * behind the components of this logical expression.
  *
- * The return value is computed through a logical expression, which may
- * be true only if bfqq is sync and at least one of the following two
- * conditions holds:
- * - the queue has a non-null idle window;
- * - the queue is being weight-raised.
- * In fact, waiting for a new request for the queue, in the first case,
- * is likely to boost the disk throughput, whereas, in the second case,
- * is necessary to preserve fairness and latency guarantees
- * (see [1] for details).
+ * First, the expression may be true only for sync queues. Besides, if
+ * bfqq is also being weight-raised, then the expression always evaluates
+ * to true, as device idling is instrumental for preserving low-latency
+ * guarantees (see [1]). Otherwise, the expression evaluates to true only
+ * if bfqq has a non-null idle window and at least one of the following
+ * two conditions holds. The first condition is that the device is not
+ * performing NCQ, because idling the device most certainly boosts the
+ * throughput if this condition holds and bfqq has been granted a non-null
+ * idle window.
+ *
+ * The second condition is that there is no weight-raised busy queue,
+ * which guarantees that the device is not idled for a sync non-weight-
+ * raised queue when there are busy weight-raised queues. The former is
+ * then expired immediately if empty. Combined with the timestamping rules
+ * of BFQ (see [1] for details), this causes sync non-weight-raised queues
+ * to get a lower number of requests served, and hence to ask for a lower
+ * number of requests from the request pool, before the busy weight-raised
+ * queues get served again.
+ *
+ * This is beneficial for the processes associated with weight-raised
+ * queues, when the request pool is saturated (e.g., in the presence of
+ * write hogs). In fact, if the processes associated with the other queues
+ * ask for requests at a lower rate, then weight-raised processes have a
+ * higher probability to get a request from the pool immediately (or at
+ * least soon) when they need one. Hence they have a higher probability to
+ * actually get a fraction of the disk throughput proportional to their
+ * high weight. This is especially true with NCQ-capable drives, which
+ * enqueue several requests in advance and further reorder internally-
+ * queued requests.
+ *
+ * In the end, mistreating non-weight-raised queues when there are busy
+ * weight-raised queues seems to mitigate starvation problems in the
+ * presence of heavy write workloads and NCQ, and hence to guarantee a
+ * higher application and system responsiveness in these hostile scenarios.
  */
 static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 {
-	return bfq_bfqq_sync(bfqq) &&
-	       (bfqq->wr_coeff > 1 || bfq_bfqq_idle_window(bfqq));
+	struct bfq_data *bfqd = bfqq->bfqd;
+/*
+ * Condition for expiring a non-weight-raised queue (and hence not idling
+ * the device).
+ */
+#define cond_for_expiring_non_wr  (bfqd->hw_tag && \
+				   bfqd->wr_busy_queues > 0)
+
+	return bfq_bfqq_sync(bfqq) && (
+		bfqq->wr_coeff > 1 ||
+		(bfq_bfqq_idle_window(bfqq) &&
+		 !cond_for_expiring_non_wr)
+	);
 }
 
 /*
@@ -2556,6 +2598,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 					      * high-definition compressed
 					      * video.
 					      */
+	bfqd->wr_busy_queues = 0;
 
 	/*
 	 * Begin by assuming, optimistically, that the device peak rate is
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
index f6491d5..73f453b 100644
--- a/block/bfq-sched.c
+++ b/block/bfq-sched.c
@@ -975,6 +975,9 @@ static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	bfqd->busy_queues--;
 
 	bfq_deactivate_bfqq(bfqd, bfqq, requeue);
+
+	if (bfqq->wr_coeff > 1)
+		bfqd->wr_busy_queues--;
 }
 
 /*
@@ -988,4 +991,7 @@ static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 
 	bfq_mark_bfqq_busy(bfqq);
 	bfqd->busy_queues++;
+
+	if (bfqq->wr_coeff > 1)
+		bfqd->wr_busy_queues++;
 }
diff --git a/block/bfq.h b/block/bfq.h
index 3b5763a7..7d6e4cb 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -280,6 +280,7 @@ enum bfq_device_speed {
  * @root_group: root bfq_group for the device.
  * @busy_queues: number of bfq_queues containing requests (including the
  *		 queue in service, even if it is idling).
+ * @wr_busy_queues: number of weight-raised busy @bfq_queues.
  * @queued: number of queued requests.
  * @rq_in_driver: number of requests dispatched and waiting for completion.
  * @sync_flight: number of sync requests in the driver.
@@ -345,6 +346,7 @@ struct bfq_data {
 	struct bfq_group *root_group;
 
 	int busy_queues;
+	int wr_busy_queues;
 	int queued;
 	int rq_in_driver;
 	int sync_flight;
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 12/14] block, bfq: add Early Queue Merge (EQM)
       [not found] ` <1401194558-5283-1-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
                     ` (10 preceding siblings ...)
  2014-05-27 12:42     ` paolo
@ 2014-05-27 12:42   ` paolo
  2014-05-27 12:42     ` paolo
                     ` (3 subsequent siblings)
  15 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Mauro Andreolini,
	Fabio Checconi, Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paolo Valente

From: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

A set of processes may happen to perform interleaved reads, i.e.,
read requests whose union would give rise to a sequential read pattern.
There are two typical cases: first, processes reading fixed-size chunks
of data at a fixed distance from each other; second, processes reading
variable-size chunks at variable distances. The latter case occurs for
example with QEMU, which splits the I/O generated by a guest into
multiple chunks, and lets these chunks be served by a pool of I/O
threads, iteratively assigning the next chunk of I/O to the first
available thread. CFQ denotes as 'cooperating' a set of processes that
are doing interleaved I/O, and when it detects cooperating processes,
it merges their queues to obtain a sequential I/O pattern from the union
of their I/O requests, and hence boost the throughput.

Unfortunately, in the following frequent case the mechanism
implemented in CFQ for detecting cooperating processes and merging
their queues is not responsive enough to handle also the fluctuating
I/O pattern of the second type of processes. Suppose that one process
of the second type issues a request close to the next request to serve
of another process of the same type. At that time the two processes
can be considered as cooperating. But, if the request issued by the
first process is to be merged with some other already-queued request,
then, from the moment at which this request arrives, to the moment
when CFQ controls whether the two processes are cooperating, the two
processes are likely to be already doing I/O in distant zones of the
disk surface or device memory.

CFQ uses however preemption to get a sequential read pattern out of
the read requests performed by the second type of processes too.  As a
consequence, CFQ uses two different mechanisms to achieve the same
goal: boosting the throughput with interleaved I/O.

This patch introduces Early Queue Merge (EQM), a unified mechanism to
get a sequential read pattern with both types of processes. The main
idea is to immediately check whether a newly-arrived request lets some
pair of processes become cooperating, both in the case of actual
request insertion and, to be responsive with the second type of
processes, in the case of request merge. Both types of processes are
then handled by just merging their queues.

Finally, EQM also preserves low latency, by properly restoring the
weight-raising state of a queue when it gets back to a non-merged
state.

Signed-off-by: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mauro Andreolini <mauro.andreolini-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
---
 block/bfq-iosched.c | 658 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 block/bfq.h         |  47 +++-
 2 files changed, 688 insertions(+), 17 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 5988c70..22d4caa 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -203,6 +203,72 @@ static inline void bfq_schedule_dispatch(struct bfq_data *bfqd)
 	}
 }
 
+static struct bfq_queue *
+bfq_rq_pos_tree_lookup(struct bfq_data *bfqd, struct rb_root *root,
+		     sector_t sector, struct rb_node **ret_parent,
+		     struct rb_node ***rb_link)
+{
+	struct rb_node **p, *parent;
+	struct bfq_queue *bfqq = NULL;
+
+	parent = NULL;
+	p = &root->rb_node;
+	while (*p) {
+		struct rb_node **n;
+
+		parent = *p;
+		bfqq = rb_entry(parent, struct bfq_queue, pos_node);
+
+		/*
+		 * Sort strictly based on sector. Smallest to the left,
+		 * largest to the right.
+		 */
+		if (sector > blk_rq_pos(bfqq->next_rq))
+			n = &(*p)->rb_right;
+		else if (sector < blk_rq_pos(bfqq->next_rq))
+			n = &(*p)->rb_left;
+		else
+			break;
+		p = n;
+		bfqq = NULL;
+	}
+
+	*ret_parent = parent;
+	if (rb_link)
+		*rb_link = p;
+
+	bfq_log(bfqd, "rq_pos_tree_lookup %llu: returning %d",
+		(long long unsigned)sector,
+		bfqq != NULL ? bfqq->pid : 0);
+
+	return bfqq;
+}
+
+static void bfq_rq_pos_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	struct rb_node **p, *parent;
+	struct bfq_queue *__bfqq;
+
+	if (bfqq->pos_root != NULL) {
+		rb_erase(&bfqq->pos_node, bfqq->pos_root);
+		bfqq->pos_root = NULL;
+	}
+
+	if (bfq_class_idle(bfqq))
+		return;
+	if (!bfqq->next_rq)
+		return;
+
+	bfqq->pos_root = &bfqd->rq_pos_tree;
+	__bfqq = bfq_rq_pos_tree_lookup(bfqd, bfqq->pos_root,
+			blk_rq_pos(bfqq->next_rq), &parent, &p);
+	if (__bfqq == NULL) {
+		rb_link_node(&bfqq->pos_node, parent, p);
+		rb_insert_color(&bfqq->pos_node, bfqq->pos_root);
+	} else
+		bfqq->pos_root = NULL;
+}
+
 /*
  * Lifted from AS - choose which of rq1 and rq2 that is best served now.
  * We choose the request that is closesr to the head right now.  Distance
@@ -380,6 +446,45 @@ static inline unsigned int bfq_wr_duration(struct bfq_data *bfqd)
 	return dur;
 }
 
+static inline void
+bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
+{
+	if (bic->saved_idle_window)
+		bfq_mark_bfqq_idle_window(bfqq);
+	else
+		bfq_clear_bfqq_idle_window(bfqq);
+	if (bic->wr_time_left && bfqq->bfqd->low_latency) {
+		/*
+		 * Start a weight raising period with the duration given by
+		 * the raising_time_left snapshot.
+		 */
+		if (bfq_bfqq_busy(bfqq))
+			bfqq->bfqd->wr_busy_queues++;
+		bfqq->wr_coeff = bfqq->bfqd->bfq_wr_coeff;
+		bfqq->wr_cur_max_time = bic->wr_time_left;
+		bfqq->last_wr_start_finish = jiffies;
+		bfqq->entity.ioprio_changed = 1;
+	}
+	/*
+	 * Clear wr_time_left to prevent bfq_bfqq_save_state() from
+	 * getting confused about the queue's need of a weight-raising
+	 * period.
+	 */
+	bic->wr_time_left = 0;
+}
+
+/*
+ * Must be called with the queue_lock held.
+ */
+static int bfqq_process_refs(struct bfq_queue *bfqq)
+{
+	int process_refs, io_refs;
+
+	io_refs = bfqq->allocated[READ] + bfqq->allocated[WRITE];
+	process_refs = atomic_read(&bfqq->ref) - io_refs - bfqq->entity.on_st;
+	return process_refs;
+}
+
 static void bfq_add_request(struct request *rq)
 {
 	struct bfq_queue *bfqq = RQ_BFQQ(rq);
@@ -402,6 +507,12 @@ static void bfq_add_request(struct request *rq)
 	next_rq = bfq_choose_req(bfqd, bfqq->next_rq, rq, bfqd->last_position);
 	bfqq->next_rq = next_rq;
 
+	/*
+	 * Adjust priority tree position, if next_rq changes.
+	 */
+	if (prev != bfqq->next_rq)
+		bfq_rq_pos_tree_add(bfqd, bfqq);
+
 	if (!bfq_bfqq_busy(bfqq)) {
 		int soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
 			time_is_before_jiffies(bfqq->soft_rt_next_start);
@@ -414,11 +525,20 @@ static void bfq_add_request(struct request *rq)
 		if (!bfqd->low_latency)
 			goto add_bfqq_busy;
 
+		if (bfq_bfqq_just_split(bfqq))
+			goto set_ioprio_changed;
+
 		/*
-		 * If the queue is not being boosted and has been idle for
-		 * enough time, start a weight-raising period.
+		 * If the queue:
+		 * - is not being boosted,
+		 * - has been idle for enough time,
+		 * - is not a sync queue or is linked to a bfq_io_cq (it is
+		 *   shared "for its nature" or it is not shared and its
+		 *   requests have not been redirected to a shared queue)
+		 * start a weight-raising period.
 		 */
-		if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt)) {
+		if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt) &&
+		    (!bfq_bfqq_sync(bfqq) || bfqq->bic != NULL)) {
 			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
 			if (idle_for_long_time)
 				bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
@@ -496,6 +616,7 @@ static void bfq_add_request(struct request *rq)
 					bfqd->bfq_wr_rt_max_time;
 			}
 		}
+set_ioprio_changed:
 		if (old_wr_coeff != bfqq->wr_coeff)
 			entity->ioprio_changed = 1;
 add_bfqq_busy:
@@ -583,6 +704,13 @@ static void bfq_remove_request(struct request *rq)
 	if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
 		if (bfq_bfqq_busy(bfqq) && bfqq != bfqd->in_service_queue)
 			bfq_del_bfqq_busy(bfqd, bfqq, 1);
+		/*
+		 * Remove queue from request-position tree as it is empty.
+		 */
+		if (bfqq->pos_root != NULL) {
+			rb_erase(&bfqq->pos_node, bfqq->pos_root);
+			bfqq->pos_root = NULL;
+		}
 	}
 
 	if (rq->cmd_flags & REQ_META)
@@ -625,11 +753,14 @@ static void bfq_merged_request(struct request_queue *q, struct request *req,
 					 bfqd->last_position);
 		bfqq->next_rq = next_rq;
 		/*
-		 * If next_rq changes, update the queue's budget to fit
-		 * the new request.
+		 * If next_rq changes, update both the queue's budget to
+		 * fit the new request and the queue's position in its
+		 * rq_pos_tree.
 		 */
-		if (prev != bfqq->next_rq)
+		if (prev != bfqq->next_rq) {
 			bfq_updated_next_req(bfqd, bfqq);
+			bfq_rq_pos_tree_add(bfqd, bfqq);
+		}
 	}
 }
 
@@ -692,12 +823,339 @@ static void bfq_end_wr(struct bfq_data *bfqd)
 	spin_unlock_irq(bfqd->queue->queue_lock);
 }
 
+static inline sector_t bfq_io_struct_pos(void *io_struct, bool request)
+{
+	if (request)
+		return blk_rq_pos(io_struct);
+	else
+		return ((struct bio *)io_struct)->bi_iter.bi_sector;
+}
+
+static inline sector_t bfq_dist_from(sector_t pos1,
+				     sector_t pos2)
+{
+	if (pos1 >= pos2)
+		return pos1 - pos2;
+	else
+		return pos2 - pos1;
+}
+
+static inline int bfq_rq_close_to_sector(void *io_struct, bool request,
+					 sector_t sector)
+{
+	return bfq_dist_from(bfq_io_struct_pos(io_struct, request), sector) <=
+	       BFQQ_SEEK_THR;
+}
+
+static struct bfq_queue *bfqq_close(struct bfq_data *bfqd, sector_t sector)
+{
+	struct rb_root *root = &bfqd->rq_pos_tree;
+	struct rb_node *parent, *node;
+	struct bfq_queue *__bfqq;
+
+	if (RB_EMPTY_ROOT(root))
+		return NULL;
+
+	/*
+	 * First, if we find a request starting at the end of the last
+	 * request, choose it.
+	 */
+	__bfqq = bfq_rq_pos_tree_lookup(bfqd, root, sector, &parent, NULL);
+	if (__bfqq != NULL)
+		return __bfqq;
+
+	/*
+	 * If the exact sector wasn't found, the parent of the NULL leaf
+	 * will contain the closest sector (rq_pos_tree sorted by
+	 * next_request position).
+	 */
+	__bfqq = rb_entry(parent, struct bfq_queue, pos_node);
+	if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector))
+		return __bfqq;
+
+	if (blk_rq_pos(__bfqq->next_rq) < sector)
+		node = rb_next(&__bfqq->pos_node);
+	else
+		node = rb_prev(&__bfqq->pos_node);
+	if (node == NULL)
+		return NULL;
+
+	__bfqq = rb_entry(node, struct bfq_queue, pos_node);
+	if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector))
+		return __bfqq;
+
+	return NULL;
+}
+
+/*
+ * bfqd - obvious
+ * cur_bfqq - passed in so that we don't decide that the current queue
+ *            is closely cooperating with itself
+ * sector - used as a reference point to search for a close queue
+ */
+static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
+					      struct bfq_queue *cur_bfqq,
+					      sector_t sector)
+{
+	struct bfq_queue *bfqq;
+
+	if (bfq_class_idle(cur_bfqq))
+		return NULL;
+	if (!bfq_bfqq_sync(cur_bfqq))
+		return NULL;
+	if (BFQQ_SEEKY(cur_bfqq))
+		return NULL;
+
+	/* If device has only one backlogged bfq_queue, don't search. */
+	if (bfqd->busy_queues == 1)
+		return NULL;
+
+	/*
+	 * We should notice if some of the queues are cooperating, e.g.
+	 * working closely on the same area of the disk. In that case,
+	 * we can group them together and don't waste time idling.
+	 */
+	bfqq = bfqq_close(bfqd, sector);
+	if (bfqq == NULL || bfqq == cur_bfqq)
+		return NULL;
+
+	/*
+	 * Do not merge queues from different bfq_groups.
+	*/
+	if (bfqq->entity.parent != cur_bfqq->entity.parent)
+		return NULL;
+
+	/*
+	 * It only makes sense to merge sync queues.
+	 */
+	if (!bfq_bfqq_sync(bfqq))
+		return NULL;
+	if (BFQQ_SEEKY(bfqq))
+		return NULL;
+
+	/*
+	 * Do not merge queues of different priority classes.
+	 */
+	if (bfq_class_rt(bfqq) != bfq_class_rt(cur_bfqq))
+		return NULL;
+
+	return bfqq;
+}
+
+static struct bfq_queue *
+bfq_setup_merge(struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
+{
+	int process_refs, new_process_refs;
+	struct bfq_queue *__bfqq;
+
+	/*
+	 * If there are no process references on the new_bfqq, then it is
+	 * unsafe to follow the ->new_bfqq chain as other bfqq's in the chain
+	 * may have dropped their last reference (not just their last process
+	 * reference).
+	 */
+	if (!bfqq_process_refs(new_bfqq))
+		return NULL;
+
+	/* Avoid a circular list and skip interim queue merges. */
+	while ((__bfqq = new_bfqq->new_bfqq)) {
+		if (__bfqq == bfqq)
+			return NULL;
+		new_bfqq = __bfqq;
+	}
+
+	process_refs = bfqq_process_refs(bfqq);
+	new_process_refs = bfqq_process_refs(new_bfqq);
+	/*
+	 * If the process for the bfqq has gone away, there is no
+	 * sense in merging the queues.
+	 */
+	if (process_refs == 0 || new_process_refs == 0)
+		return NULL;
+
+	bfq_log_bfqq(bfqq->bfqd, bfqq, "scheduling merge with queue %d",
+		new_bfqq->pid);
+
+	/*
+	 * Merging is just a redirection: the requests of the process
+	 * owning one of the two queues are redirected to the other queue.
+	 * The latter queue, in its turn, is set as shared if this is the
+	 * first time that the requests of some process are redirected to
+	 * it.
+	 *
+	 * We redirect bfqq to new_bfqq and not the opposite, because we
+	 * are in the context of the process owning bfqq, hence we have
+	 * the io_cq of this process. So we can immediately configure this
+	 * io_cq to redirect the requests of the process to new_bfqq.
+	 *
+	 * NOTE, even if new_bfqq coincides with the in-service queue, the
+	 * io_cq of new_bfqq is not available, because, if the in-service
+	 * queue is shared, bfqd->in_service_bic may not point to the
+	 * io_cq of the in-service queue.
+	 * Redirecting the requests of the process owning bfqq to the
+	 * currently in-service queue is in any case the best option, as
+	 * we feed the in-service queue with new requests close to the
+	 * last request served and, by doing so, hopefully increase the
+	 * throughput.
+	 */
+	bfqq->new_bfqq = new_bfqq;
+	atomic_add(process_refs, &new_bfqq->ref);
+	return new_bfqq;
+}
+
+/*
+ * Attempt to schedule a merge of bfqq with the currently in-service queue
+ * or with a close queue among the scheduled queues.
+ * Return NULL if no merge was scheduled, a pointer to the shared bfq_queue
+ * structure otherwise.
+ */
+static struct bfq_queue *
+bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+		     void *io_struct, bool request)
+{
+	struct bfq_queue *in_service_bfqq, *new_bfqq;
+
+	if (bfqq->new_bfqq)
+		return bfqq->new_bfqq;
+
+	if (!io_struct)
+		return NULL;
+
+	in_service_bfqq = bfqd->in_service_queue;
+
+	if (in_service_bfqq == NULL || in_service_bfqq == bfqq ||
+	    !bfqd->in_service_bic)
+		goto check_scheduled;
+
+	if (bfq_class_idle(in_service_bfqq) || bfq_class_idle(bfqq))
+		goto check_scheduled;
+
+	if (bfq_class_rt(in_service_bfqq) != bfq_class_rt(bfqq))
+		goto check_scheduled;
+
+	if (in_service_bfqq->entity.parent != bfqq->entity.parent)
+		goto check_scheduled;
+
+	if (bfq_rq_close_to_sector(io_struct, request, bfqd->last_position) &&
+	    bfq_bfqq_sync(in_service_bfqq) && bfq_bfqq_sync(bfqq)) {
+		new_bfqq = bfq_setup_merge(bfqq, in_service_bfqq);
+		if (new_bfqq != NULL)
+			return new_bfqq; /* Merge with in-service queue */
+	}
+
+	/*
+	 * Check whether there is a cooperator among currently scheduled
+	 * queues. The only thing we need is that the bio/request is not
+	 * NULL, as we need it to establish whether a cooperator exists.
+	 */
+check_scheduled:
+	new_bfqq = bfq_close_cooperator(bfqd, bfqq,
+					bfq_io_struct_pos(io_struct, request));
+	if (new_bfqq)
+		return bfq_setup_merge(bfqq, new_bfqq);
+
+	return NULL;
+}
+
+static inline void
+bfq_bfqq_save_state(struct bfq_queue *bfqq)
+{
+	/*
+	 * If bfqq->bic == NULL, the queue is already shared or its requests
+	 * have already been redirected to a shared queue; both idle window
+	 * and weight raising state have already been saved. Do nothing.
+	 */
+	if (bfqq->bic == NULL)
+		return;
+	if (bfqq->bic->wr_time_left)
+		/*
+		 * This is the queue of a just-started process, and would
+		 * deserve weight raising: we set wr_time_left to the full
+		 * weight-raising duration to trigger weight-raising when
+		 * and if the queue is split and the first request of the
+		 * queue is enqueued.
+		 */
+		bfqq->bic->wr_time_left = bfq_wr_duration(bfqq->bfqd);
+	else if (bfqq->wr_coeff > 1) {
+		unsigned long wr_duration =
+			jiffies - bfqq->last_wr_start_finish;
+		/*
+		 * It may happen that a queue's weight raising period lasts
+		 * longer than its wr_cur_max_time, as weight raising is
+		 * handled only when a request is enqueued or dispatched (it
+		 * does not use any timer). If the weight raising period is
+		 * about to end, don't save it.
+		 */
+		if (bfqq->wr_cur_max_time <= wr_duration)
+			bfqq->bic->wr_time_left = 0;
+		else
+			bfqq->bic->wr_time_left =
+				bfqq->wr_cur_max_time - wr_duration;
+		/*
+		 * The bfq_queue is becoming shared or the requests of the
+		 * process owning the queue are being redirected to a shared
+		 * queue. Stop the weight raising period of the queue, as in
+		 * both cases it should not be owned by an interactive or
+		 * soft real-time application.
+		 */
+		bfq_bfqq_end_wr(bfqq);
+	} else
+		bfqq->bic->wr_time_left = 0;
+	bfqq->bic->saved_idle_window = bfq_bfqq_idle_window(bfqq);
+}
+
+static inline void
+bfq_get_bic_reference(struct bfq_queue *bfqq)
+{
+	/*
+	 * If bfqq->bic has a non-NULL value, the bic to which it belongs
+	 * is about to begin using a shared bfq_queue.
+	 */
+	if (bfqq->bic)
+		atomic_long_inc(&bfqq->bic->icq.ioc->refcount);
+}
+
+static void
+bfq_merge_bfqqs(struct bfq_data *bfqd, struct bfq_io_cq *bic,
+		struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
+{
+	bfq_log_bfqq(bfqd, bfqq, "merging with queue %lu",
+		(long unsigned)new_bfqq->pid);
+	/* Save weight raising and idle window of the merged queues */
+	bfq_bfqq_save_state(bfqq);
+	bfq_bfqq_save_state(new_bfqq);
+	/*
+	 * Grab a reference to the bic, to prevent it from being destroyed
+	 * before being possibly touched by a bfq_split_bfqq().
+	 */
+	bfq_get_bic_reference(bfqq);
+	bfq_get_bic_reference(new_bfqq);
+	/*
+	 * Merge queues (that is, let bic redirect its requests to new_bfqq)
+	 */
+	bic_set_bfqq(bic, new_bfqq, 1);
+	bfq_mark_bfqq_coop(new_bfqq);
+	/*
+	 * new_bfqq now belongs to at least two bics (it is a shared queue):
+	 * set new_bfqq->bic to NULL. bfqq either:
+	 * - does not belong to any bic any more, and hence bfqq->bic must
+	 *   be set to NULL, or
+	 * - is a queue whose owning bics have already been redirected to a
+	 *   different queue, hence the queue is destined to not belong to
+	 *   any bic soon and bfqq->bic is already NULL (therefore the next
+	 *   assignment causes no harm).
+	 */
+	new_bfqq->bic = NULL;
+	bfqq->bic = NULL;
+	bfq_put_queue(bfqq);
+}
+
 static int bfq_allow_merge(struct request_queue *q, struct request *rq,
 			   struct bio *bio)
 {
 	struct bfq_data *bfqd = q->elevator->elevator_data;
 	struct bfq_io_cq *bic;
-	struct bfq_queue *bfqq;
+	struct bfq_queue *bfqq, *new_bfqq;
 
 	/*
 	 * Disallow merge of a sync bio into an async request.
@@ -715,6 +1173,23 @@ static int bfq_allow_merge(struct request_queue *q, struct request *rq,
 		return 0;
 
 	bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
+	/*
+	 * We take advantage of this function to perform an early merge
+	 * of the queues of possible cooperating processes.
+	 */
+	if (bfqq != NULL) {
+		new_bfqq = bfq_setup_cooperator(bfqd, bfqq, bio, false);
+		if (new_bfqq != NULL) {
+			bfq_merge_bfqqs(bfqd, bic, bfqq, new_bfqq);
+			/*
+			 * If we get here, the bio will be queued in the
+			 * shared queue, i.e., new_bfqq, so use new_bfqq
+			 * to decide whether bio and rq can be merged.
+			 */
+			bfqq = new_bfqq;
+		}
+	}
+
 	return bfqq == RQ_BFQQ(rq);
 }
 
@@ -898,6 +1373,15 @@ static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 {
 	__bfq_bfqd_reset_in_service(bfqd);
 
+	/*
+	 * If this bfqq is shared between multiple processes, check
+	 * to make sure that those processes are still issuing I/Os
+	 * within the mean seek distance. If not, it may be time to
+	 * break the queues apart again.
+	 */
+	if (bfq_bfqq_coop(bfqq) && BFQQ_SEEKY(bfqq))
+		bfq_mark_bfqq_split_coop(bfqq);
+
 	if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
 		/*
 		 * Overloading budget_timeout field to store the time
@@ -906,8 +1390,13 @@ static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 		 */
 		bfqq->budget_timeout = jiffies;
 		bfq_del_bfqq_busy(bfqd, bfqq, 1);
-	} else
+	} else {
 		bfq_activate_bfqq(bfqd, bfqq);
+		/*
+		 * Resort priority tree of potential close cooperators.
+		 */
+		bfq_rq_pos_tree_add(bfqd, bfqq);
+	}
 }
 
 /**
@@ -1773,6 +2262,25 @@ static void bfq_put_queue(struct bfq_queue *bfqq)
 	kmem_cache_free(bfq_pool, bfqq);
 }
 
+static void bfq_put_cooperator(struct bfq_queue *bfqq)
+{
+	struct bfq_queue *__bfqq, *next;
+
+	/*
+	 * If this queue was scheduled to merge with another queue, be
+	 * sure to drop the reference taken on that queue (and others in
+	 * the merge chain). See bfq_setup_merge and bfq_merge_bfqqs.
+	 */
+	__bfqq = bfqq->new_bfqq;
+	while (__bfqq) {
+		if (__bfqq == bfqq)
+			break;
+		next = __bfqq->new_bfqq;
+		bfq_put_queue(__bfqq);
+		__bfqq = next;
+	}
+}
+
 static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 {
 	if (bfqq == bfqd->in_service_queue) {
@@ -1783,12 +2291,35 @@ static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 	bfq_log_bfqq(bfqd, bfqq, "exit_bfqq: %p, %d", bfqq,
 		     atomic_read(&bfqq->ref));
 
+	bfq_put_cooperator(bfqq);
+
 	bfq_put_queue(bfqq);
 }
 
 static inline void bfq_init_icq(struct io_cq *icq)
 {
-	icq_to_bic(icq)->ttime.last_end_request = jiffies;
+	struct bfq_io_cq *bic = icq_to_bic(icq);
+
+	bic->ttime.last_end_request = jiffies;
+	/*
+	 * A newly created bic indicates that the process has just
+	 * started doing I/O, and is probably mapping into memory its
+	 * executable and libraries: it definitely needs weight raising.
+	 * There is however the possibility that the process performs,
+	 * for a while, I/O close to some other process. EQM intercepts
+	 * this behavior and may merge the queue corresponding to the
+	 * process  with some other queue, BEFORE the weight of the queue
+	 * is raised. Merged queues are not weight-raised (they are assumed
+	 * to belong to processes that benefit only from high throughput).
+	 * If the merge is basically the consequence of an accident, then
+	 * the queue will be split soon and will get back its old weight.
+	 * It is then important to write down somewhere that this queue
+	 * does need weight raising, even if it did not make it to get its
+	 * weight raised before being merged. To this purpose, we overload
+	 * the field raising_time_left and assign 1 to it, to mark the queue
+	 * as needing weight raising.
+	 */
+	bic->wr_time_left = 1;
 }
 
 static void bfq_exit_icq(struct io_cq *icq)
@@ -1802,6 +2333,13 @@ static void bfq_exit_icq(struct io_cq *icq)
 	}
 
 	if (bic->bfqq[BLK_RW_SYNC]) {
+		/*
+		 * If the bic is using a shared queue, put the reference
+		 * taken on the io_context when the bic started using a
+		 * shared bfq_queue.
+		 */
+		if (bfq_bfqq_coop(bic->bfqq[BLK_RW_SYNC]))
+			put_io_context(icq->ioc);
 		bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_SYNC]);
 		bic->bfqq[BLK_RW_SYNC] = NULL;
 	}
@@ -2089,6 +2627,10 @@ static void bfq_update_idle_window(struct bfq_data *bfqd,
 	if (!bfq_bfqq_sync(bfqq) || bfq_class_idle(bfqq))
 		return;
 
+	/* Idle window just restored, statistics are meaningless. */
+	if (bfq_bfqq_just_split(bfqq))
+		return;
+
 	enable_idle = bfq_bfqq_idle_window(bfqq);
 
 	if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
@@ -2131,6 +2673,7 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
 	    !BFQQ_SEEKY(bfqq))
 		bfq_update_idle_window(bfqd, bfqq, bic);
+	bfq_clear_bfqq_just_split(bfqq);
 
 	bfq_log_bfqq(bfqd, bfqq,
 		     "rq_enqueued: idle_window=%d (seeky %d, mean %llu)",
@@ -2191,14 +2734,48 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 static void bfq_insert_request(struct request_queue *q, struct request *rq)
 {
 	struct bfq_data *bfqd = q->elevator->elevator_data;
-	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+	struct bfq_queue *bfqq = RQ_BFQQ(rq), *new_bfqq;
 
 	assert_spin_locked(bfqd->queue->queue_lock);
 
+	/*
+	 * An unplug may trigger a requeue of a request from the device
+	 * driver: make sure we are in process context while trying to
+	 * merge two bfq_queues.
+	 */
+	if (!in_interrupt()) {
+		new_bfqq = bfq_setup_cooperator(bfqd, bfqq, rq, true);
+		if (new_bfqq != NULL) {
+			if (bic_to_bfqq(RQ_BIC(rq), 1) != bfqq)
+				new_bfqq = bic_to_bfqq(RQ_BIC(rq), 1);
+			/*
+			 * Release the request's reference to the old bfqq
+			 * and make sure one is taken to the shared queue.
+			 */
+			new_bfqq->allocated[rq_data_dir(rq)]++;
+			bfqq->allocated[rq_data_dir(rq)]--;
+			atomic_inc(&new_bfqq->ref);
+			bfq_put_queue(bfqq);
+			if (bic_to_bfqq(RQ_BIC(rq), 1) == bfqq)
+				bfq_merge_bfqqs(bfqd, RQ_BIC(rq),
+						bfqq, new_bfqq);
+			rq->elv.priv[1] = new_bfqq;
+			bfqq = new_bfqq;
+		}
+	}
+
 	bfq_init_prio_data(bfqq, RQ_BIC(rq));
 
 	bfq_add_request(rq);
 
+	/*
+	 * Here a newly-created bfq_queue has already started a weight-raising
+	 * period: clear raising_time_left to prevent bfq_bfqq_save_state()
+	 * from assigning it a full weight-raising period. See the detailed
+	 * comments about this field in bfq_init_icq().
+	 */
+	if (bfqq->bic != NULL)
+		bfqq->bic->wr_time_left = 0;
 	rq->fifo_time = jiffies + bfqd->bfq_fifo_expire[rq_is_sync(rq)];
 	list_add_tail(&rq->queuelist, &bfqq->fifo);
 
@@ -2347,6 +2924,32 @@ static void bfq_put_request(struct request *rq)
 }
 
 /*
+ * Returns NULL if a new bfqq should be allocated, or the old bfqq if this
+ * was the last process referring to said bfqq.
+ */
+static struct bfq_queue *
+bfq_split_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq)
+{
+	bfq_log_bfqq(bfqq->bfqd, bfqq, "splitting queue");
+
+	put_io_context(bic->icq.ioc);
+
+	if (bfqq_process_refs(bfqq) == 1) {
+		bfqq->pid = current->pid;
+		bfq_clear_bfqq_coop(bfqq);
+		bfq_clear_bfqq_split_coop(bfqq);
+		return bfqq;
+	}
+
+	bic_set_bfqq(bic, NULL, 1);
+
+	bfq_put_cooperator(bfqq);
+
+	bfq_put_queue(bfqq);
+	return NULL;
+}
+
+/*
  * Allocate bfq data structures associated with this request.
  */
 static int bfq_set_request(struct request_queue *q, struct request *rq,
@@ -2359,6 +2962,7 @@ static int bfq_set_request(struct request_queue *q, struct request *rq,
 	struct bfq_queue *bfqq;
 	struct bfq_group *bfqg;
 	unsigned long flags;
+	bool split = false;
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
@@ -2371,10 +2975,20 @@ static int bfq_set_request(struct request_queue *q, struct request *rq,
 
 	bfqg = bfq_bic_update_cgroup(bic);
 
+new_queue:
 	bfqq = bic_to_bfqq(bic, is_sync);
 	if (bfqq == NULL || bfqq == &bfqd->oom_bfqq) {
 		bfqq = bfq_get_queue(bfqd, bfqg, is_sync, bic, gfp_mask);
 		bic_set_bfqq(bic, bfqq, is_sync);
+	} else {
+		/* If the queue was seeky for too long, break it apart. */
+		if (bfq_bfqq_coop(bfqq) && bfq_bfqq_split_coop(bfqq)) {
+			bfq_log_bfqq(bfqd, bfqq, "breaking apart bfqq");
+			bfqq = bfq_split_bfqq(bic, bfqq);
+			split = true;
+			if (!bfqq)
+				goto new_queue;
+		}
 	}
 
 	bfqq->allocated[rw]++;
@@ -2385,6 +2999,26 @@ static int bfq_set_request(struct request_queue *q, struct request *rq,
 	rq->elv.priv[0] = bic;
 	rq->elv.priv[1] = bfqq;
 
+	/*
+	 * If a bfq_queue has only one process reference, it is owned
+	 * by only one bfq_io_cq: we can set the bic field of the
+	 * bfq_queue to the address of that structure. Also, if the
+	 * queue has just been split, mark a flag so that the
+	 * information is available to the other scheduler hooks.
+	 */
+	if (bfqq_process_refs(bfqq) == 1) {
+		bfqq->bic = bic;
+		if (split) {
+			bfq_mark_bfqq_just_split(bfqq);
+			/*
+			 * If the queue has just been split from a shared
+			 * queue, restore the idle window and the possible
+			 * weight raising period.
+			 */
+			bfq_bfqq_resume_state(bfqq, bic);
+		}
+	}
+
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
 	return 0;
@@ -2565,6 +3199,8 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 	bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
 	bfqd->idle_slice_timer.data = (unsigned long)bfqd;
 
+	bfqd->rq_pos_tree = RB_ROOT;
+
 	INIT_WORK(&bfqd->unplug_work, bfq_kick_queue);
 
 	INIT_LIST_HEAD(&bfqd->active_list);
@@ -2918,7 +3554,7 @@ static int __init bfq_init(void)
 	device_speed_thresh[1] = (R_fast[1] + R_slow[1]) / 2;
 
 	elv_register(&iosched_bfq);
-	pr_info("BFQ I/O-scheduler version: v2");
+	pr_info("BFQ I/O-scheduler version: v6");
 
 	return 0;
 }
diff --git a/block/bfq.h b/block/bfq.h
index 7d6e4cb..bda1ecb3 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -1,5 +1,5 @@
 /*
- * BFQ-v2 for 3.15.0: data structures and common functions prototypes.
+ * BFQ-v6 for 3.15.0: data structures and common functions prototypes.
  *
  * Based on ideas and code from CFQ:
  * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@@ -164,6 +164,10 @@ struct bfq_group;
  * struct bfq_queue - leaf schedulable entity.
  * @ref: reference counter.
  * @bfqd: parent bfq_data.
+ * @new_bfqq: shared bfq_queue if queue is cooperating with
+ *           one or more other queues.
+ * @pos_node: request-position tree member (see bfq_data's @rq_pos_tree).
+ * @pos_root: request-position tree root (see bfq_data's @rq_pos_tree).
  * @sort_list: sorted list of pending requests.
  * @next_rq: if fifo isn't expired, next request to serve.
  * @queued: nr of requests queued in @sort_list.
@@ -196,18 +200,26 @@ struct bfq_group;
  * @service_from_backlogged: cumulative service received from the @bfq_queue
  *                           since the last transition from idle to
  *                           backlogged
+ * @bic: pointer to the bfq_io_cq owning the bfq_queue, set to %NULL if the
+ *	 queue is shared
  *
  * A bfq_queue is a leaf request queue; it can be associated with an
- * io_context or more, if it is async. @cgroup holds a reference to the
- * cgroup, to be sure that it does not disappear while a bfqq still
- * references it (mostly to avoid races between request issuing and task
- * migration followed by cgroup destruction). All the fields are protected
- * by the queue lock of the containing bfqd.
+ * io_context or more, if it  is  async or shared  between  cooperating
+ * processes. @cgroup holds a reference to the cgroup, to be sure that it
+ * does not disappear while a bfqq still references it (mostly to avoid
+ * races between request issuing and task migration followed by cgroup
+ * destruction).
+ * All the fields are protected by the queue lock of the containing bfqd.
  */
 struct bfq_queue {
 	atomic_t ref;
 	struct bfq_data *bfqd;
 
+	/* fields for cooperating queues handling */
+	struct bfq_queue *new_bfqq;
+	struct rb_node pos_node;
+	struct rb_root *pos_root;
+
 	struct rb_root sort_list;
 	struct request *next_rq;
 	int queued[2];
@@ -232,6 +244,7 @@ struct bfq_queue {
 	sector_t last_request_pos;
 
 	pid_t pid;
+	struct bfq_io_cq *bic;
 
 	/* weight-raising fields */
 	unsigned long wr_cur_max_time;
@@ -261,12 +274,24 @@ struct bfq_ttime {
  * @icq: associated io_cq structure
  * @bfqq: array of two process queues, the sync and the async
  * @ttime: associated @bfq_ttime struct
+ * @wr_time_left: snapshot of the time left before weight raising ends
+ *                for the sync queue associated to this process; this
+ *		  snapshot is taken to remember this value while the weight
+ *		  raising is suspended because the queue is merged with a
+ *		  shared queue, and is used to set @raising_cur_max_time
+ *		  when the queue is split from the shared queue and its
+ *		  weight is raised again
+ * @saved_idle_window: same purpose as the previous field for the idle
+ *                     window
  */
 struct bfq_io_cq {
 	struct io_cq icq; /* must be the first member */
 	struct bfq_queue *bfqq[2];
 	struct bfq_ttime ttime;
 	int ioprio;
+
+	unsigned int wr_time_left;
+	unsigned int saved_idle_window;
 };
 
 enum bfq_device_speed {
@@ -278,6 +303,9 @@ enum bfq_device_speed {
  * struct bfq_data - per device data structure.
  * @queue: request queue for the managed device.
  * @root_group: root bfq_group for the device.
+ * @rq_pos_tree: rbtree sorted by next_request position, used when
+ *               determining if two or more queues have interleaving
+ *               requests (see bfq_close_cooperator()).
  * @busy_queues: number of bfq_queues containing requests (including the
  *		 queue in service, even if it is idling).
  * @wr_busy_queues: number of weight-raised busy @bfq_queues.
@@ -344,6 +372,7 @@ struct bfq_data {
 	struct request_queue *queue;
 
 	struct bfq_group *root_group;
+	struct rb_root rq_pos_tree;
 
 	int busy_queues;
 	int wr_busy_queues;
@@ -418,6 +447,9 @@ enum bfqq_state_flags {
 					 * may need softrt-next-start
 					 * update
 					 */
+	BFQ_BFQQ_FLAG_coop,		/* bfqq is shared */
+	BFQ_BFQQ_FLAG_split_coop,	/* shared bfqq will be split */
+	BFQ_BFQQ_FLAG_just_split,	/* queue has just been split */
 };
 
 #define BFQ_BFQQ_FNS(name)						\
@@ -443,6 +475,9 @@ BFQ_BFQQ_FNS(prio_changed);
 BFQ_BFQQ_FNS(sync);
 BFQ_BFQQ_FNS(budget_new);
 BFQ_BFQQ_FNS(constantly_seeky);
+BFQ_BFQQ_FNS(coop);
+BFQ_BFQQ_FNS(split_coop);
+BFQ_BFQQ_FNS(just_split);
 BFQ_BFQQ_FNS(softrt_update);
 #undef BFQ_BFQQ_FNS
 
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 12/14] block, bfq: add Early Queue Merge (EQM)
  2014-05-27 12:42 ` paolo
  (?)
@ 2014-05-27 12:42 ` paolo
  -1 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups, Mauro Andreolini, Paolo Valente

From: Arianna Avanzini <avanzini.arianna@gmail.com>

A set of processes may happen to perform interleaved reads, i.e.,
read requests whose union would give rise to a sequential read pattern.
There are two typical cases: first, processes reading fixed-size chunks
of data at a fixed distance from each other; second, processes reading
variable-size chunks at variable distances. The latter case occurs for
example with QEMU, which splits the I/O generated by a guest into
multiple chunks, and lets these chunks be served by a pool of I/O
threads, iteratively assigning the next chunk of I/O to the first
available thread. CFQ denotes as 'cooperating' a set of processes that
are doing interleaved I/O, and when it detects cooperating processes,
it merges their queues to obtain a sequential I/O pattern from the union
of their I/O requests, and hence boost the throughput.

Unfortunately, in the following frequent case the mechanism
implemented in CFQ for detecting cooperating processes and merging
their queues is not responsive enough to handle also the fluctuating
I/O pattern of the second type of processes. Suppose that one process
of the second type issues a request close to the next request to serve
of another process of the same type. At that time the two processes
can be considered as cooperating. But, if the request issued by the
first process is to be merged with some other already-queued request,
then, from the moment at which this request arrives, to the moment
when CFQ controls whether the two processes are cooperating, the two
processes are likely to be already doing I/O in distant zones of the
disk surface or device memory.

CFQ uses however preemption to get a sequential read pattern out of
the read requests performed by the second type of processes too.  As a
consequence, CFQ uses two different mechanisms to achieve the same
goal: boosting the throughput with interleaved I/O.

This patch introduces Early Queue Merge (EQM), a unified mechanism to
get a sequential read pattern with both types of processes. The main
idea is to immediately check whether a newly-arrived request lets some
pair of processes become cooperating, both in the case of actual
request insertion and, to be responsive with the second type of
processes, in the case of request merge. Both types of processes are
then handled by just merging their queues.

Finally, EQM also preserves low latency, by properly restoring the
weight-raising state of a queue when it gets back to a non-merged
state.

Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Mauro Andreolini <mauro.andreolini@unimore.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
---
 block/bfq-iosched.c | 658 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 block/bfq.h         |  47 +++-
 2 files changed, 688 insertions(+), 17 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 5988c70..22d4caa 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -203,6 +203,72 @@ static inline void bfq_schedule_dispatch(struct bfq_data *bfqd)
 	}
 }
 
+static struct bfq_queue *
+bfq_rq_pos_tree_lookup(struct bfq_data *bfqd, struct rb_root *root,
+		     sector_t sector, struct rb_node **ret_parent,
+		     struct rb_node ***rb_link)
+{
+	struct rb_node **p, *parent;
+	struct bfq_queue *bfqq = NULL;
+
+	parent = NULL;
+	p = &root->rb_node;
+	while (*p) {
+		struct rb_node **n;
+
+		parent = *p;
+		bfqq = rb_entry(parent, struct bfq_queue, pos_node);
+
+		/*
+		 * Sort strictly based on sector. Smallest to the left,
+		 * largest to the right.
+		 */
+		if (sector > blk_rq_pos(bfqq->next_rq))
+			n = &(*p)->rb_right;
+		else if (sector < blk_rq_pos(bfqq->next_rq))
+			n = &(*p)->rb_left;
+		else
+			break;
+		p = n;
+		bfqq = NULL;
+	}
+
+	*ret_parent = parent;
+	if (rb_link)
+		*rb_link = p;
+
+	bfq_log(bfqd, "rq_pos_tree_lookup %llu: returning %d",
+		(long long unsigned)sector,
+		bfqq != NULL ? bfqq->pid : 0);
+
+	return bfqq;
+}
+
+static void bfq_rq_pos_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	struct rb_node **p, *parent;
+	struct bfq_queue *__bfqq;
+
+	if (bfqq->pos_root != NULL) {
+		rb_erase(&bfqq->pos_node, bfqq->pos_root);
+		bfqq->pos_root = NULL;
+	}
+
+	if (bfq_class_idle(bfqq))
+		return;
+	if (!bfqq->next_rq)
+		return;
+
+	bfqq->pos_root = &bfqd->rq_pos_tree;
+	__bfqq = bfq_rq_pos_tree_lookup(bfqd, bfqq->pos_root,
+			blk_rq_pos(bfqq->next_rq), &parent, &p);
+	if (__bfqq == NULL) {
+		rb_link_node(&bfqq->pos_node, parent, p);
+		rb_insert_color(&bfqq->pos_node, bfqq->pos_root);
+	} else
+		bfqq->pos_root = NULL;
+}
+
 /*
  * Lifted from AS - choose which of rq1 and rq2 that is best served now.
  * We choose the request that is closesr to the head right now.  Distance
@@ -380,6 +446,45 @@ static inline unsigned int bfq_wr_duration(struct bfq_data *bfqd)
 	return dur;
 }
 
+static inline void
+bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
+{
+	if (bic->saved_idle_window)
+		bfq_mark_bfqq_idle_window(bfqq);
+	else
+		bfq_clear_bfqq_idle_window(bfqq);
+	if (bic->wr_time_left && bfqq->bfqd->low_latency) {
+		/*
+		 * Start a weight raising period with the duration given by
+		 * the raising_time_left snapshot.
+		 */
+		if (bfq_bfqq_busy(bfqq))
+			bfqq->bfqd->wr_busy_queues++;
+		bfqq->wr_coeff = bfqq->bfqd->bfq_wr_coeff;
+		bfqq->wr_cur_max_time = bic->wr_time_left;
+		bfqq->last_wr_start_finish = jiffies;
+		bfqq->entity.ioprio_changed = 1;
+	}
+	/*
+	 * Clear wr_time_left to prevent bfq_bfqq_save_state() from
+	 * getting confused about the queue's need of a weight-raising
+	 * period.
+	 */
+	bic->wr_time_left = 0;
+}
+
+/*
+ * Must be called with the queue_lock held.
+ */
+static int bfqq_process_refs(struct bfq_queue *bfqq)
+{
+	int process_refs, io_refs;
+
+	io_refs = bfqq->allocated[READ] + bfqq->allocated[WRITE];
+	process_refs = atomic_read(&bfqq->ref) - io_refs - bfqq->entity.on_st;
+	return process_refs;
+}
+
 static void bfq_add_request(struct request *rq)
 {
 	struct bfq_queue *bfqq = RQ_BFQQ(rq);
@@ -402,6 +507,12 @@ static void bfq_add_request(struct request *rq)
 	next_rq = bfq_choose_req(bfqd, bfqq->next_rq, rq, bfqd->last_position);
 	bfqq->next_rq = next_rq;
 
+	/*
+	 * Adjust priority tree position, if next_rq changes.
+	 */
+	if (prev != bfqq->next_rq)
+		bfq_rq_pos_tree_add(bfqd, bfqq);
+
 	if (!bfq_bfqq_busy(bfqq)) {
 		int soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
 			time_is_before_jiffies(bfqq->soft_rt_next_start);
@@ -414,11 +525,20 @@ static void bfq_add_request(struct request *rq)
 		if (!bfqd->low_latency)
 			goto add_bfqq_busy;
 
+		if (bfq_bfqq_just_split(bfqq))
+			goto set_ioprio_changed;
+
 		/*
-		 * If the queue is not being boosted and has been idle for
-		 * enough time, start a weight-raising period.
+		 * If the queue:
+		 * - is not being boosted,
+		 * - has been idle for enough time,
+		 * - is not a sync queue or is linked to a bfq_io_cq (it is
+		 *   shared "for its nature" or it is not shared and its
+		 *   requests have not been redirected to a shared queue)
+		 * start a weight-raising period.
 		 */
-		if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt)) {
+		if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt) &&
+		    (!bfq_bfqq_sync(bfqq) || bfqq->bic != NULL)) {
 			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
 			if (idle_for_long_time)
 				bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
@@ -496,6 +616,7 @@ static void bfq_add_request(struct request *rq)
 					bfqd->bfq_wr_rt_max_time;
 			}
 		}
+set_ioprio_changed:
 		if (old_wr_coeff != bfqq->wr_coeff)
 			entity->ioprio_changed = 1;
 add_bfqq_busy:
@@ -583,6 +704,13 @@ static void bfq_remove_request(struct request *rq)
 	if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
 		if (bfq_bfqq_busy(bfqq) && bfqq != bfqd->in_service_queue)
 			bfq_del_bfqq_busy(bfqd, bfqq, 1);
+		/*
+		 * Remove queue from request-position tree as it is empty.
+		 */
+		if (bfqq->pos_root != NULL) {
+			rb_erase(&bfqq->pos_node, bfqq->pos_root);
+			bfqq->pos_root = NULL;
+		}
 	}
 
 	if (rq->cmd_flags & REQ_META)
@@ -625,11 +753,14 @@ static void bfq_merged_request(struct request_queue *q, struct request *req,
 					 bfqd->last_position);
 		bfqq->next_rq = next_rq;
 		/*
-		 * If next_rq changes, update the queue's budget to fit
-		 * the new request.
+		 * If next_rq changes, update both the queue's budget to
+		 * fit the new request and the queue's position in its
+		 * rq_pos_tree.
 		 */
-		if (prev != bfqq->next_rq)
+		if (prev != bfqq->next_rq) {
 			bfq_updated_next_req(bfqd, bfqq);
+			bfq_rq_pos_tree_add(bfqd, bfqq);
+		}
 	}
 }
 
@@ -692,12 +823,339 @@ static void bfq_end_wr(struct bfq_data *bfqd)
 	spin_unlock_irq(bfqd->queue->queue_lock);
 }
 
+static inline sector_t bfq_io_struct_pos(void *io_struct, bool request)
+{
+	if (request)
+		return blk_rq_pos(io_struct);
+	else
+		return ((struct bio *)io_struct)->bi_iter.bi_sector;
+}
+
+static inline sector_t bfq_dist_from(sector_t pos1,
+				     sector_t pos2)
+{
+	if (pos1 >= pos2)
+		return pos1 - pos2;
+	else
+		return pos2 - pos1;
+}
+
+static inline int bfq_rq_close_to_sector(void *io_struct, bool request,
+					 sector_t sector)
+{
+	return bfq_dist_from(bfq_io_struct_pos(io_struct, request), sector) <=
+	       BFQQ_SEEK_THR;
+}
+
+static struct bfq_queue *bfqq_close(struct bfq_data *bfqd, sector_t sector)
+{
+	struct rb_root *root = &bfqd->rq_pos_tree;
+	struct rb_node *parent, *node;
+	struct bfq_queue *__bfqq;
+
+	if (RB_EMPTY_ROOT(root))
+		return NULL;
+
+	/*
+	 * First, if we find a request starting at the end of the last
+	 * request, choose it.
+	 */
+	__bfqq = bfq_rq_pos_tree_lookup(bfqd, root, sector, &parent, NULL);
+	if (__bfqq != NULL)
+		return __bfqq;
+
+	/*
+	 * If the exact sector wasn't found, the parent of the NULL leaf
+	 * will contain the closest sector (rq_pos_tree sorted by
+	 * next_request position).
+	 */
+	__bfqq = rb_entry(parent, struct bfq_queue, pos_node);
+	if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector))
+		return __bfqq;
+
+	if (blk_rq_pos(__bfqq->next_rq) < sector)
+		node = rb_next(&__bfqq->pos_node);
+	else
+		node = rb_prev(&__bfqq->pos_node);
+	if (node == NULL)
+		return NULL;
+
+	__bfqq = rb_entry(node, struct bfq_queue, pos_node);
+	if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector))
+		return __bfqq;
+
+	return NULL;
+}
+
+/*
+ * bfqd - obvious
+ * cur_bfqq - passed in so that we don't decide that the current queue
+ *            is closely cooperating with itself
+ * sector - used as a reference point to search for a close queue
+ */
+static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
+					      struct bfq_queue *cur_bfqq,
+					      sector_t sector)
+{
+	struct bfq_queue *bfqq;
+
+	if (bfq_class_idle(cur_bfqq))
+		return NULL;
+	if (!bfq_bfqq_sync(cur_bfqq))
+		return NULL;
+	if (BFQQ_SEEKY(cur_bfqq))
+		return NULL;
+
+	/* If device has only one backlogged bfq_queue, don't search. */
+	if (bfqd->busy_queues == 1)
+		return NULL;
+
+	/*
+	 * We should notice if some of the queues are cooperating, e.g.
+	 * working closely on the same area of the disk. In that case,
+	 * we can group them together and don't waste time idling.
+	 */
+	bfqq = bfqq_close(bfqd, sector);
+	if (bfqq == NULL || bfqq == cur_bfqq)
+		return NULL;
+
+	/*
+	 * Do not merge queues from different bfq_groups.
+	*/
+	if (bfqq->entity.parent != cur_bfqq->entity.parent)
+		return NULL;
+
+	/*
+	 * It only makes sense to merge sync queues.
+	 */
+	if (!bfq_bfqq_sync(bfqq))
+		return NULL;
+	if (BFQQ_SEEKY(bfqq))
+		return NULL;
+
+	/*
+	 * Do not merge queues of different priority classes.
+	 */
+	if (bfq_class_rt(bfqq) != bfq_class_rt(cur_bfqq))
+		return NULL;
+
+	return bfqq;
+}
+
+static struct bfq_queue *
+bfq_setup_merge(struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
+{
+	int process_refs, new_process_refs;
+	struct bfq_queue *__bfqq;
+
+	/*
+	 * If there are no process references on the new_bfqq, then it is
+	 * unsafe to follow the ->new_bfqq chain as other bfqq's in the chain
+	 * may have dropped their last reference (not just their last process
+	 * reference).
+	 */
+	if (!bfqq_process_refs(new_bfqq))
+		return NULL;
+
+	/* Avoid a circular list and skip interim queue merges. */
+	while ((__bfqq = new_bfqq->new_bfqq)) {
+		if (__bfqq == bfqq)
+			return NULL;
+		new_bfqq = __bfqq;
+	}
+
+	process_refs = bfqq_process_refs(bfqq);
+	new_process_refs = bfqq_process_refs(new_bfqq);
+	/*
+	 * If the process for the bfqq has gone away, there is no
+	 * sense in merging the queues.
+	 */
+	if (process_refs == 0 || new_process_refs == 0)
+		return NULL;
+
+	bfq_log_bfqq(bfqq->bfqd, bfqq, "scheduling merge with queue %d",
+		new_bfqq->pid);
+
+	/*
+	 * Merging is just a redirection: the requests of the process
+	 * owning one of the two queues are redirected to the other queue.
+	 * The latter queue, in its turn, is set as shared if this is the
+	 * first time that the requests of some process are redirected to
+	 * it.
+	 *
+	 * We redirect bfqq to new_bfqq and not the opposite, because we
+	 * are in the context of the process owning bfqq, hence we have
+	 * the io_cq of this process. So we can immediately configure this
+	 * io_cq to redirect the requests of the process to new_bfqq.
+	 *
+	 * NOTE, even if new_bfqq coincides with the in-service queue, the
+	 * io_cq of new_bfqq is not available, because, if the in-service
+	 * queue is shared, bfqd->in_service_bic may not point to the
+	 * io_cq of the in-service queue.
+	 * Redirecting the requests of the process owning bfqq to the
+	 * currently in-service queue is in any case the best option, as
+	 * we feed the in-service queue with new requests close to the
+	 * last request served and, by doing so, hopefully increase the
+	 * throughput.
+	 */
+	bfqq->new_bfqq = new_bfqq;
+	atomic_add(process_refs, &new_bfqq->ref);
+	return new_bfqq;
+}
+
+/*
+ * Attempt to schedule a merge of bfqq with the currently in-service queue
+ * or with a close queue among the scheduled queues.
+ * Return NULL if no merge was scheduled, a pointer to the shared bfq_queue
+ * structure otherwise.
+ */
+static struct bfq_queue *
+bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+		     void *io_struct, bool request)
+{
+	struct bfq_queue *in_service_bfqq, *new_bfqq;
+
+	if (bfqq->new_bfqq)
+		return bfqq->new_bfqq;
+
+	if (!io_struct)
+		return NULL;
+
+	in_service_bfqq = bfqd->in_service_queue;
+
+	if (in_service_bfqq == NULL || in_service_bfqq == bfqq ||
+	    !bfqd->in_service_bic)
+		goto check_scheduled;
+
+	if (bfq_class_idle(in_service_bfqq) || bfq_class_idle(bfqq))
+		goto check_scheduled;
+
+	if (bfq_class_rt(in_service_bfqq) != bfq_class_rt(bfqq))
+		goto check_scheduled;
+
+	if (in_service_bfqq->entity.parent != bfqq->entity.parent)
+		goto check_scheduled;
+
+	if (bfq_rq_close_to_sector(io_struct, request, bfqd->last_position) &&
+	    bfq_bfqq_sync(in_service_bfqq) && bfq_bfqq_sync(bfqq)) {
+		new_bfqq = bfq_setup_merge(bfqq, in_service_bfqq);
+		if (new_bfqq != NULL)
+			return new_bfqq; /* Merge with in-service queue */
+	}
+
+	/*
+	 * Check whether there is a cooperator among currently scheduled
+	 * queues. The only thing we need is that the bio/request is not
+	 * NULL, as we need it to establish whether a cooperator exists.
+	 */
+check_scheduled:
+	new_bfqq = bfq_close_cooperator(bfqd, bfqq,
+					bfq_io_struct_pos(io_struct, request));
+	if (new_bfqq)
+		return bfq_setup_merge(bfqq, new_bfqq);
+
+	return NULL;
+}
+
+static inline void
+bfq_bfqq_save_state(struct bfq_queue *bfqq)
+{
+	/*
+	 * If bfqq->bic == NULL, the queue is already shared or its requests
+	 * have already been redirected to a shared queue; both idle window
+	 * and weight raising state have already been saved. Do nothing.
+	 */
+	if (bfqq->bic == NULL)
+		return;
+	if (bfqq->bic->wr_time_left)
+		/*
+		 * This is the queue of a just-started process, and would
+		 * deserve weight raising: we set wr_time_left to the full
+		 * weight-raising duration to trigger weight-raising when
+		 * and if the queue is split and the first request of the
+		 * queue is enqueued.
+		 */
+		bfqq->bic->wr_time_left = bfq_wr_duration(bfqq->bfqd);
+	else if (bfqq->wr_coeff > 1) {
+		unsigned long wr_duration =
+			jiffies - bfqq->last_wr_start_finish;
+		/*
+		 * It may happen that a queue's weight raising period lasts
+		 * longer than its wr_cur_max_time, as weight raising is
+		 * handled only when a request is enqueued or dispatched (it
+		 * does not use any timer). If the weight raising period is
+		 * about to end, don't save it.
+		 */
+		if (bfqq->wr_cur_max_time <= wr_duration)
+			bfqq->bic->wr_time_left = 0;
+		else
+			bfqq->bic->wr_time_left =
+				bfqq->wr_cur_max_time - wr_duration;
+		/*
+		 * The bfq_queue is becoming shared or the requests of the
+		 * process owning the queue are being redirected to a shared
+		 * queue. Stop the weight raising period of the queue, as in
+		 * both cases it should not be owned by an interactive or
+		 * soft real-time application.
+		 */
+		bfq_bfqq_end_wr(bfqq);
+	} else
+		bfqq->bic->wr_time_left = 0;
+	bfqq->bic->saved_idle_window = bfq_bfqq_idle_window(bfqq);
+}
+
+static inline void
+bfq_get_bic_reference(struct bfq_queue *bfqq)
+{
+	/*
+	 * If bfqq->bic has a non-NULL value, the bic to which it belongs
+	 * is about to begin using a shared bfq_queue.
+	 */
+	if (bfqq->bic)
+		atomic_long_inc(&bfqq->bic->icq.ioc->refcount);
+}
+
+static void
+bfq_merge_bfqqs(struct bfq_data *bfqd, struct bfq_io_cq *bic,
+		struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
+{
+	bfq_log_bfqq(bfqd, bfqq, "merging with queue %lu",
+		(long unsigned)new_bfqq->pid);
+	/* Save weight raising and idle window of the merged queues */
+	bfq_bfqq_save_state(bfqq);
+	bfq_bfqq_save_state(new_bfqq);
+	/*
+	 * Grab a reference to the bic, to prevent it from being destroyed
+	 * before being possibly touched by a bfq_split_bfqq().
+	 */
+	bfq_get_bic_reference(bfqq);
+	bfq_get_bic_reference(new_bfqq);
+	/*
+	 * Merge queues (that is, let bic redirect its requests to new_bfqq)
+	 */
+	bic_set_bfqq(bic, new_bfqq, 1);
+	bfq_mark_bfqq_coop(new_bfqq);
+	/*
+	 * new_bfqq now belongs to at least two bics (it is a shared queue):
+	 * set new_bfqq->bic to NULL. bfqq either:
+	 * - does not belong to any bic any more, and hence bfqq->bic must
+	 *   be set to NULL, or
+	 * - is a queue whose owning bics have already been redirected to a
+	 *   different queue, hence the queue is destined to not belong to
+	 *   any bic soon and bfqq->bic is already NULL (therefore the next
+	 *   assignment causes no harm).
+	 */
+	new_bfqq->bic = NULL;
+	bfqq->bic = NULL;
+	bfq_put_queue(bfqq);
+}
+
 static int bfq_allow_merge(struct request_queue *q, struct request *rq,
 			   struct bio *bio)
 {
 	struct bfq_data *bfqd = q->elevator->elevator_data;
 	struct bfq_io_cq *bic;
-	struct bfq_queue *bfqq;
+	struct bfq_queue *bfqq, *new_bfqq;
 
 	/*
 	 * Disallow merge of a sync bio into an async request.
@@ -715,6 +1173,23 @@ static int bfq_allow_merge(struct request_queue *q, struct request *rq,
 		return 0;
 
 	bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
+	/*
+	 * We take advantage of this function to perform an early merge
+	 * of the queues of possible cooperating processes.
+	 */
+	if (bfqq != NULL) {
+		new_bfqq = bfq_setup_cooperator(bfqd, bfqq, bio, false);
+		if (new_bfqq != NULL) {
+			bfq_merge_bfqqs(bfqd, bic, bfqq, new_bfqq);
+			/*
+			 * If we get here, the bio will be queued in the
+			 * shared queue, i.e., new_bfqq, so use new_bfqq
+			 * to decide whether bio and rq can be merged.
+			 */
+			bfqq = new_bfqq;
+		}
+	}
+
 	return bfqq == RQ_BFQQ(rq);
 }
 
@@ -898,6 +1373,15 @@ static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 {
 	__bfq_bfqd_reset_in_service(bfqd);
 
+	/*
+	 * If this bfqq is shared between multiple processes, check
+	 * to make sure that those processes are still issuing I/Os
+	 * within the mean seek distance. If not, it may be time to
+	 * break the queues apart again.
+	 */
+	if (bfq_bfqq_coop(bfqq) && BFQQ_SEEKY(bfqq))
+		bfq_mark_bfqq_split_coop(bfqq);
+
 	if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
 		/*
 		 * Overloading budget_timeout field to store the time
@@ -906,8 +1390,13 @@ static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 		 */
 		bfqq->budget_timeout = jiffies;
 		bfq_del_bfqq_busy(bfqd, bfqq, 1);
-	} else
+	} else {
 		bfq_activate_bfqq(bfqd, bfqq);
+		/*
+		 * Resort priority tree of potential close cooperators.
+		 */
+		bfq_rq_pos_tree_add(bfqd, bfqq);
+	}
 }
 
 /**
@@ -1773,6 +2262,25 @@ static void bfq_put_queue(struct bfq_queue *bfqq)
 	kmem_cache_free(bfq_pool, bfqq);
 }
 
+static void bfq_put_cooperator(struct bfq_queue *bfqq)
+{
+	struct bfq_queue *__bfqq, *next;
+
+	/*
+	 * If this queue was scheduled to merge with another queue, be
+	 * sure to drop the reference taken on that queue (and others in
+	 * the merge chain). See bfq_setup_merge and bfq_merge_bfqqs.
+	 */
+	__bfqq = bfqq->new_bfqq;
+	while (__bfqq) {
+		if (__bfqq == bfqq)
+			break;
+		next = __bfqq->new_bfqq;
+		bfq_put_queue(__bfqq);
+		__bfqq = next;
+	}
+}
+
 static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 {
 	if (bfqq == bfqd->in_service_queue) {
@@ -1783,12 +2291,35 @@ static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 	bfq_log_bfqq(bfqd, bfqq, "exit_bfqq: %p, %d", bfqq,
 		     atomic_read(&bfqq->ref));
 
+	bfq_put_cooperator(bfqq);
+
 	bfq_put_queue(bfqq);
 }
 
 static inline void bfq_init_icq(struct io_cq *icq)
 {
-	icq_to_bic(icq)->ttime.last_end_request = jiffies;
+	struct bfq_io_cq *bic = icq_to_bic(icq);
+
+	bic->ttime.last_end_request = jiffies;
+	/*
+	 * A newly created bic indicates that the process has just
+	 * started doing I/O, and is probably mapping into memory its
+	 * executable and libraries: it definitely needs weight raising.
+	 * There is however the possibility that the process performs,
+	 * for a while, I/O close to some other process. EQM intercepts
+	 * this behavior and may merge the queue corresponding to the
+	 * process  with some other queue, BEFORE the weight of the queue
+	 * is raised. Merged queues are not weight-raised (they are assumed
+	 * to belong to processes that benefit only from high throughput).
+	 * If the merge is basically the consequence of an accident, then
+	 * the queue will be split soon and will get back its old weight.
+	 * It is then important to write down somewhere that this queue
+	 * does need weight raising, even if it did not make it to get its
+	 * weight raised before being merged. To this purpose, we overload
+	 * the field raising_time_left and assign 1 to it, to mark the queue
+	 * as needing weight raising.
+	 */
+	bic->wr_time_left = 1;
 }
 
 static void bfq_exit_icq(struct io_cq *icq)
@@ -1802,6 +2333,13 @@ static void bfq_exit_icq(struct io_cq *icq)
 	}
 
 	if (bic->bfqq[BLK_RW_SYNC]) {
+		/*
+		 * If the bic is using a shared queue, put the reference
+		 * taken on the io_context when the bic started using a
+		 * shared bfq_queue.
+		 */
+		if (bfq_bfqq_coop(bic->bfqq[BLK_RW_SYNC]))
+			put_io_context(icq->ioc);
 		bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_SYNC]);
 		bic->bfqq[BLK_RW_SYNC] = NULL;
 	}
@@ -2089,6 +2627,10 @@ static void bfq_update_idle_window(struct bfq_data *bfqd,
 	if (!bfq_bfqq_sync(bfqq) || bfq_class_idle(bfqq))
 		return;
 
+	/* Idle window just restored, statistics are meaningless. */
+	if (bfq_bfqq_just_split(bfqq))
+		return;
+
 	enable_idle = bfq_bfqq_idle_window(bfqq);
 
 	if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
@@ -2131,6 +2673,7 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
 	    !BFQQ_SEEKY(bfqq))
 		bfq_update_idle_window(bfqd, bfqq, bic);
+	bfq_clear_bfqq_just_split(bfqq);
 
 	bfq_log_bfqq(bfqd, bfqq,
 		     "rq_enqueued: idle_window=%d (seeky %d, mean %llu)",
@@ -2191,14 +2734,48 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 static void bfq_insert_request(struct request_queue *q, struct request *rq)
 {
 	struct bfq_data *bfqd = q->elevator->elevator_data;
-	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+	struct bfq_queue *bfqq = RQ_BFQQ(rq), *new_bfqq;
 
 	assert_spin_locked(bfqd->queue->queue_lock);
 
+	/*
+	 * An unplug may trigger a requeue of a request from the device
+	 * driver: make sure we are in process context while trying to
+	 * merge two bfq_queues.
+	 */
+	if (!in_interrupt()) {
+		new_bfqq = bfq_setup_cooperator(bfqd, bfqq, rq, true);
+		if (new_bfqq != NULL) {
+			if (bic_to_bfqq(RQ_BIC(rq), 1) != bfqq)
+				new_bfqq = bic_to_bfqq(RQ_BIC(rq), 1);
+			/*
+			 * Release the request's reference to the old bfqq
+			 * and make sure one is taken to the shared queue.
+			 */
+			new_bfqq->allocated[rq_data_dir(rq)]++;
+			bfqq->allocated[rq_data_dir(rq)]--;
+			atomic_inc(&new_bfqq->ref);
+			bfq_put_queue(bfqq);
+			if (bic_to_bfqq(RQ_BIC(rq), 1) == bfqq)
+				bfq_merge_bfqqs(bfqd, RQ_BIC(rq),
+						bfqq, new_bfqq);
+			rq->elv.priv[1] = new_bfqq;
+			bfqq = new_bfqq;
+		}
+	}
+
 	bfq_init_prio_data(bfqq, RQ_BIC(rq));
 
 	bfq_add_request(rq);
 
+	/*
+	 * Here a newly-created bfq_queue has already started a weight-raising
+	 * period: clear raising_time_left to prevent bfq_bfqq_save_state()
+	 * from assigning it a full weight-raising period. See the detailed
+	 * comments about this field in bfq_init_icq().
+	 */
+	if (bfqq->bic != NULL)
+		bfqq->bic->wr_time_left = 0;
 	rq->fifo_time = jiffies + bfqd->bfq_fifo_expire[rq_is_sync(rq)];
 	list_add_tail(&rq->queuelist, &bfqq->fifo);
 
@@ -2347,6 +2924,32 @@ static void bfq_put_request(struct request *rq)
 }
 
 /*
+ * Returns NULL if a new bfqq should be allocated, or the old bfqq if this
+ * was the last process referring to said bfqq.
+ */
+static struct bfq_queue *
+bfq_split_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq)
+{
+	bfq_log_bfqq(bfqq->bfqd, bfqq, "splitting queue");
+
+	put_io_context(bic->icq.ioc);
+
+	if (bfqq_process_refs(bfqq) == 1) {
+		bfqq->pid = current->pid;
+		bfq_clear_bfqq_coop(bfqq);
+		bfq_clear_bfqq_split_coop(bfqq);
+		return bfqq;
+	}
+
+	bic_set_bfqq(bic, NULL, 1);
+
+	bfq_put_cooperator(bfqq);
+
+	bfq_put_queue(bfqq);
+	return NULL;
+}
+
+/*
  * Allocate bfq data structures associated with this request.
  */
 static int bfq_set_request(struct request_queue *q, struct request *rq,
@@ -2359,6 +2962,7 @@ static int bfq_set_request(struct request_queue *q, struct request *rq,
 	struct bfq_queue *bfqq;
 	struct bfq_group *bfqg;
 	unsigned long flags;
+	bool split = false;
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
@@ -2371,10 +2975,20 @@ static int bfq_set_request(struct request_queue *q, struct request *rq,
 
 	bfqg = bfq_bic_update_cgroup(bic);
 
+new_queue:
 	bfqq = bic_to_bfqq(bic, is_sync);
 	if (bfqq == NULL || bfqq == &bfqd->oom_bfqq) {
 		bfqq = bfq_get_queue(bfqd, bfqg, is_sync, bic, gfp_mask);
 		bic_set_bfqq(bic, bfqq, is_sync);
+	} else {
+		/* If the queue was seeky for too long, break it apart. */
+		if (bfq_bfqq_coop(bfqq) && bfq_bfqq_split_coop(bfqq)) {
+			bfq_log_bfqq(bfqd, bfqq, "breaking apart bfqq");
+			bfqq = bfq_split_bfqq(bic, bfqq);
+			split = true;
+			if (!bfqq)
+				goto new_queue;
+		}
 	}
 
 	bfqq->allocated[rw]++;
@@ -2385,6 +2999,26 @@ static int bfq_set_request(struct request_queue *q, struct request *rq,
 	rq->elv.priv[0] = bic;
 	rq->elv.priv[1] = bfqq;
 
+	/*
+	 * If a bfq_queue has only one process reference, it is owned
+	 * by only one bfq_io_cq: we can set the bic field of the
+	 * bfq_queue to the address of that structure. Also, if the
+	 * queue has just been split, mark a flag so that the
+	 * information is available to the other scheduler hooks.
+	 */
+	if (bfqq_process_refs(bfqq) == 1) {
+		bfqq->bic = bic;
+		if (split) {
+			bfq_mark_bfqq_just_split(bfqq);
+			/*
+			 * If the queue has just been split from a shared
+			 * queue, restore the idle window and the possible
+			 * weight raising period.
+			 */
+			bfq_bfqq_resume_state(bfqq, bic);
+		}
+	}
+
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
 	return 0;
@@ -2565,6 +3199,8 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 	bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
 	bfqd->idle_slice_timer.data = (unsigned long)bfqd;
 
+	bfqd->rq_pos_tree = RB_ROOT;
+
 	INIT_WORK(&bfqd->unplug_work, bfq_kick_queue);
 
 	INIT_LIST_HEAD(&bfqd->active_list);
@@ -2918,7 +3554,7 @@ static int __init bfq_init(void)
 	device_speed_thresh[1] = (R_fast[1] + R_slow[1]) / 2;
 
 	elv_register(&iosched_bfq);
-	pr_info("BFQ I/O-scheduler version: v2");
+	pr_info("BFQ I/O-scheduler version: v6");
 
 	return 0;
 }
diff --git a/block/bfq.h b/block/bfq.h
index 7d6e4cb..bda1ecb3 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -1,5 +1,5 @@
 /*
- * BFQ-v2 for 3.15.0: data structures and common functions prototypes.
+ * BFQ-v6 for 3.15.0: data structures and common functions prototypes.
  *
  * Based on ideas and code from CFQ:
  * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
@@ -164,6 +164,10 @@ struct bfq_group;
  * struct bfq_queue - leaf schedulable entity.
  * @ref: reference counter.
  * @bfqd: parent bfq_data.
+ * @new_bfqq: shared bfq_queue if queue is cooperating with
+ *           one or more other queues.
+ * @pos_node: request-position tree member (see bfq_data's @rq_pos_tree).
+ * @pos_root: request-position tree root (see bfq_data's @rq_pos_tree).
  * @sort_list: sorted list of pending requests.
  * @next_rq: if fifo isn't expired, next request to serve.
  * @queued: nr of requests queued in @sort_list.
@@ -196,18 +200,26 @@ struct bfq_group;
  * @service_from_backlogged: cumulative service received from the @bfq_queue
  *                           since the last transition from idle to
  *                           backlogged
+ * @bic: pointer to the bfq_io_cq owning the bfq_queue, set to %NULL if the
+ *	 queue is shared
  *
  * A bfq_queue is a leaf request queue; it can be associated with an
- * io_context or more, if it is async. @cgroup holds a reference to the
- * cgroup, to be sure that it does not disappear while a bfqq still
- * references it (mostly to avoid races between request issuing and task
- * migration followed by cgroup destruction). All the fields are protected
- * by the queue lock of the containing bfqd.
+ * io_context or more, if it  is  async or shared  between  cooperating
+ * processes. @cgroup holds a reference to the cgroup, to be sure that it
+ * does not disappear while a bfqq still references it (mostly to avoid
+ * races between request issuing and task migration followed by cgroup
+ * destruction).
+ * All the fields are protected by the queue lock of the containing bfqd.
  */
 struct bfq_queue {
 	atomic_t ref;
 	struct bfq_data *bfqd;
 
+	/* fields for cooperating queues handling */
+	struct bfq_queue *new_bfqq;
+	struct rb_node pos_node;
+	struct rb_root *pos_root;
+
 	struct rb_root sort_list;
 	struct request *next_rq;
 	int queued[2];
@@ -232,6 +244,7 @@ struct bfq_queue {
 	sector_t last_request_pos;
 
 	pid_t pid;
+	struct bfq_io_cq *bic;
 
 	/* weight-raising fields */
 	unsigned long wr_cur_max_time;
@@ -261,12 +274,24 @@ struct bfq_ttime {
  * @icq: associated io_cq structure
  * @bfqq: array of two process queues, the sync and the async
  * @ttime: associated @bfq_ttime struct
+ * @wr_time_left: snapshot of the time left before weight raising ends
+ *                for the sync queue associated to this process; this
+ *		  snapshot is taken to remember this value while the weight
+ *		  raising is suspended because the queue is merged with a
+ *		  shared queue, and is used to set @raising_cur_max_time
+ *		  when the queue is split from the shared queue and its
+ *		  weight is raised again
+ * @saved_idle_window: same purpose as the previous field for the idle
+ *                     window
  */
 struct bfq_io_cq {
 	struct io_cq icq; /* must be the first member */
 	struct bfq_queue *bfqq[2];
 	struct bfq_ttime ttime;
 	int ioprio;
+
+	unsigned int wr_time_left;
+	unsigned int saved_idle_window;
 };
 
 enum bfq_device_speed {
@@ -278,6 +303,9 @@ enum bfq_device_speed {
  * struct bfq_data - per device data structure.
  * @queue: request queue for the managed device.
  * @root_group: root bfq_group for the device.
+ * @rq_pos_tree: rbtree sorted by next_request position, used when
+ *               determining if two or more queues have interleaving
+ *               requests (see bfq_close_cooperator()).
  * @busy_queues: number of bfq_queues containing requests (including the
  *		 queue in service, even if it is idling).
  * @wr_busy_queues: number of weight-raised busy @bfq_queues.
@@ -344,6 +372,7 @@ struct bfq_data {
 	struct request_queue *queue;
 
 	struct bfq_group *root_group;
+	struct rb_root rq_pos_tree;
 
 	int busy_queues;
 	int wr_busy_queues;
@@ -418,6 +447,9 @@ enum bfqq_state_flags {
 					 * may need softrt-next-start
 					 * update
 					 */
+	BFQ_BFQQ_FLAG_coop,		/* bfqq is shared */
+	BFQ_BFQQ_FLAG_split_coop,	/* shared bfqq will be split */
+	BFQ_BFQQ_FLAG_just_split,	/* queue has just been split */
 };
 
 #define BFQ_BFQQ_FNS(name)						\
@@ -443,6 +475,9 @@ BFQ_BFQQ_FNS(prio_changed);
 BFQ_BFQQ_FNS(sync);
 BFQ_BFQQ_FNS(budget_new);
 BFQ_BFQQ_FNS(constantly_seeky);
+BFQ_BFQQ_FNS(coop);
+BFQ_BFQQ_FNS(split_coop);
+BFQ_BFQQ_FNS(just_split);
 BFQ_BFQQ_FNS(softrt_update);
 #undef BFQ_BFQQ_FNS
 
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 13/14] block, bfq: boost the throughput on NCQ-capable flash-based devices
  2014-05-27 12:42 ` paolo
@ 2014-05-27 12:42     ` paolo
  -1 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

From: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>

This patch boosts the throughput on NCQ-capable flash-based devices,
while still preserving latency guarantees for interactive and soft
real-time applications. The throughput is boosted by just not idling
the device when the in-service queue remains empty, even if the queue
is sync and has a non-null idle window. This helps to keep the drive's
internal queue full, which is necessary to achieve maximum
performance. This solution to boost the throughput is a port of
commits a68bbdd and f7d7b7a for CFQ.

As already highlighted in patch 10, allowing the device to prefetch
and internally reorder requests trivially causes loss of control on
the request service order, and hence on service guarantees.
Fortunately, as discussed in detail in the comments to the function
bfq_bfqq_must_not_expire(), if every process has to receive the same
fraction of the throughput, then the service order enforced by the
internal scheduler of a flash-based device is relatively close to that
enforced by BFQ. In particular, it is close enough to let service
guarantees be substantially preserved.

Things change in an asymmetric scenario, i.e., if not every process
has to receive the same fraction of the throughput. In this case, to
guarantee the desired throughput distribution, the device must be
prevented from prefetching requests. This is exactly what this patch
does in asymmetric scenarios.

Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 block/bfq-cgroup.c  |   1 +
 block/bfq-iosched.c | 205 +++++++++++++++++++++++++++++++++++++++++++++++++---
 block/bfq-sched.c   |  98 ++++++++++++++++++++++++-
 block/bfq.h         |  46 ++++++++++++
 4 files changed, 338 insertions(+), 12 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 1cb25aa..d338a54 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -85,6 +85,7 @@ static inline void bfq_group_init_entity(struct bfqio_cgroup *bgrp,
 	entity->ioprio = entity->new_ioprio;
 	entity->ioprio_class = entity->new_ioprio_class = bgrp->ioprio_class;
 	entity->my_sched_data = &bfqg->sched_data;
+	bfqg->active_entities = 0;
 }
 
 static inline void bfq_group_set_parent(struct bfq_group *bfqg,
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 22d4caa..49856e1 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -364,6 +364,120 @@ static struct request *bfq_choose_req(struct bfq_data *bfqd,
 	}
 }
 
+/*
+ * Tell whether there are active queues or groups with differentiated weights.
+ */
+static inline bool bfq_differentiated_weights(struct bfq_data *bfqd)
+{
+	/*
+	 * For weights to differ, at least one of the trees must contain
+	 * at least two nodes.
+	 */
+	return (!RB_EMPTY_ROOT(&bfqd->queue_weights_tree) &&
+		(bfqd->queue_weights_tree.rb_node->rb_left ||
+		 bfqd->queue_weights_tree.rb_node->rb_right)
+#ifdef CONFIG_CGROUP_BFQIO
+	       ) ||
+	       (!RB_EMPTY_ROOT(&bfqd->group_weights_tree) &&
+		(bfqd->group_weights_tree.rb_node->rb_left ||
+		 bfqd->group_weights_tree.rb_node->rb_right)
+#endif
+	       );
+}
+
+/*
+ * If the weight-counter tree passed as input contains no counter for
+ * the weight of the input entity, then add that counter; otherwise just
+ * increment the existing counter.
+ *
+ * Note that weight-counter trees contain few nodes in mostly symmetric
+ * scenarios. For example, if all queues have the same weight, then the
+ * weight-counter tree for the queues may contain at most one node.
+ * This holds even if low_latency is on, because weight-raised queues
+ * are not inserted in the tree.
+ * In most scenarios, the rate at which nodes are created/destroyed
+ * should be low too.
+ */
+static void bfq_weights_tree_add(struct bfq_data *bfqd,
+				 struct bfq_entity *entity,
+				 struct rb_root *root)
+{
+	struct rb_node **new = &(root->rb_node), *parent = NULL;
+
+	/*
+	 * Do not insert if:
+	 * - the device does not support queueing;
+	 * - the entity is already associated with a counter, which happens if:
+	 *   1) the entity is associated with a queue, 2) a request arrival
+	 *   has caused the queue to become both non-weight-raised, and hence
+	 *   change its weight, and backlogged; in this respect, each
+	 *   of the two events causes an invocation of this function,
+	 *   3) this is the invocation of this function caused by the second
+	 *   event. This second invocation is actually useless, and we handle
+	 *   this fact by exiting immediately. More efficient or clearer
+	 *   solutions might possibly be adopted.
+	 */
+	if (!bfqd->hw_tag || entity->weight_counter)
+		return;
+
+	while (*new) {
+		struct bfq_weight_counter *__counter = container_of(*new,
+						struct bfq_weight_counter,
+						weights_node);
+		parent = *new;
+
+		if (entity->weight == __counter->weight) {
+			entity->weight_counter = __counter;
+			goto inc_counter;
+		}
+		if (entity->weight < __counter->weight)
+			new = &((*new)->rb_left);
+		else
+			new = &((*new)->rb_right);
+	}
+
+	entity->weight_counter = kzalloc(sizeof(struct bfq_weight_counter),
+					 GFP_ATOMIC);
+	entity->weight_counter->weight = entity->weight;
+	rb_link_node(&entity->weight_counter->weights_node, parent, new);
+	rb_insert_color(&entity->weight_counter->weights_node, root);
+
+inc_counter:
+	entity->weight_counter->num_active++;
+}
+
+/*
+ * Decrement the weight counter associated with the entity, and, if the
+ * counter reaches 0, remove the counter from the tree.
+ * See the comments to the function bfq_weights_tree_add() for considerations
+ * about overhead.
+ */
+static void bfq_weights_tree_remove(struct bfq_data *bfqd,
+				    struct bfq_entity *entity,
+				    struct rb_root *root)
+{
+	/*
+	 * Check whether the entity is actually associated with a counter.
+	 * In fact, the device may not be considered NCQ-capable for a while,
+	 * which implies that no insertion in the weight trees is performed,
+	 * after which the device may start to be deemed NCQ-capable, and hence
+	 * this function may start to be invoked. This may cause the function
+	 * to be invoked for entities that are not associated with any counter.
+	 */
+	if (!entity->weight_counter)
+		return;
+
+	entity->weight_counter->num_active--;
+	if (entity->weight_counter->num_active > 0)
+		goto reset_entity_pointer;
+
+	rb_erase(&entity->weight_counter->weights_node, root);
+	kfree(entity->weight_counter);
+
+reset_entity_pointer:
+	entity->weight_counter = NULL;
+}
+
 static struct request *bfq_find_next_rq(struct bfq_data *bfqd,
 					struct bfq_queue *bfqq,
 					struct request *last)
@@ -1906,16 +2020,17 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
  * two conditions holds. The first condition is that the device is not
  * performing NCQ, because idling the device most certainly boosts the
  * throughput if this condition holds and bfqq has been granted a non-null
- * idle window.
+ * idle window. The second compound condition is made of the logical AND of
+ * two components.
  *
- * The second condition is that there is no weight-raised busy queue,
- * which guarantees that the device is not idled for a sync non-weight-
- * raised queue when there are busy weight-raised queues. The former is
- * then expired immediately if empty. Combined with the timestamping rules
- * of BFQ (see [1] for details), this causes sync non-weight-raised queues
- * to get a lower number of requests served, and hence to ask for a lower
- * number of requests from the request pool, before the busy weight-raised
- * queues get served again.
+ * The first component is true only if there is no weight-raised busy
+ * queue. This guarantees that the device is not idled for a sync non-
+ * weight-raised queue when there are busy weight-raised queues. The former
+ * is then expired immediately if empty. Combined with the timestamping
+ * rules of BFQ (see [1] for details), this causes sync non-weight-raised
+ * queues to get a lower number of requests served, and hence to ask for a
+ * lower number of requests from the request pool, before the busy weight-
+ * raised queues get served again.
  *
  * This is beneficial for the processes associated with weight-raised
  * queues, when the request pool is saturated (e.g., in the presence of
@@ -1932,16 +2047,76 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
  * weight-raised queues seems to mitigate starvation problems in the
  * presence of heavy write workloads and NCQ, and hence to guarantee a
  * higher application and system responsiveness in these hostile scenarios.
+ *
+ * If the first component of the compound condition is instead true, i.e.,
+ * there is no weight-raised busy queue, then the second component of the
+ * compound condition takes into account service-guarantee and throughput
+ * issues related to NCQ (recall that the compound condition is evaluated
+ * only if the device is detected as supporting NCQ).
+ *
+ * As for service guarantees, allowing the drive to enqueue more than one
+ * request at a time, and hence delegating de facto final scheduling
+ * decisions to the drive's internal scheduler, causes loss of control on
+ * the actual request service order. In this respect, when the drive is
+ * allowed to enqueue more than one request at a time, the service
+ * distribution enforced by the drive's internal scheduler is likely to
+ * coincide with the desired device-throughput distribution only in the
+ * following, perfectly symmetric, scenario:
+ * 1) all active queues have the same weight,
+ * 2) all active groups at the same level in the groups tree have the same
+ *    weight,
+ * 3) all active groups at the same level in the groups tree have the same
+ *    number of children.
+ *
+ * Even in such a scenario, sequential I/O may still receive a preferential
+ * treatment, but this is not likely to be a big issue with flash-based
+ * devices, because of their non-dramatic loss of throughput with random
+ * I/O.
+ *
+ * Unfortunately, keeping the necessary state for evaluating exactly the
+ * above symmetry conditions would be quite complex and time-consuming.
+ * Therefore BFQ evaluates instead the following stronger sub-conditions,
+ * for which it is much easier to maintain the needed state:
+ * 1) all active queues have the same weight,
+ * 2) all active groups have the same weight,
+ * 3) all active groups have at most one active child each.
+ * In particular, the last two conditions are always true if hierarchical
+ * support and the cgroups interface are not enabled, hence no state needs
+ * to be maintained in this case.
+ *
+ * According to the above considerations, the second component of the
+ * compound condition evaluates to true if any of the above symmetry
+ * sub-condition does not hold, or the device is not flash-based. Therefore,
+ * if also the first component is true, then idling is allowed for a sync
+ * queue. In contrast, if all the required symmetry sub-conditions hold and
+ * the device is flash-based, then the second component, and hence the
+ * whole compound condition, evaluates to false, and no idling is performed.
+ * This helps to keep the drives' internal queues full on NCQ-capable
+ * devices, and hence to boost the throughput, without causing 'almost' any
+ * loss of service guarantees. The 'almost' follows from the fact that, if
+ * the internal queue of one such device is filled while all the
+ * sub-conditions hold, but at some point in time some sub-condition stops
+ * to hold, then it may become impossible to let requests be served in the
+ * new desired order until all the requests already queued in the device
+ * have been served.
  */
 static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 {
 	struct bfq_data *bfqd = bfqq->bfqd;
+#ifdef CONFIG_CGROUP_BFQIO
+#define symmetric_scenario	  (!bfqd->active_numerous_groups && \
+				   !bfq_differentiated_weights(bfqd))
+#else
+#define symmetric_scenario	  (!bfq_differentiated_weights(bfqd))
+#endif
 /*
  * Condition for expiring a non-weight-raised queue (and hence not idling
  * the device).
  */
 #define cond_for_expiring_non_wr  (bfqd->hw_tag && \
-				   bfqd->wr_busy_queues > 0)
+				   (bfqd->wr_busy_queues > 0 || \
+				    (symmetric_scenario && \
+				     blk_queue_nonrot(bfqd->queue))))
 
 	return bfq_bfqq_sync(bfqq) && (
 		bfqq->wr_coeff > 1 ||
@@ -2821,6 +2996,10 @@ static void bfq_completed_request(struct request_queue *q, struct request *rq)
 	bfqd->rq_in_driver--;
 	bfqq->dispatched--;
 
+	if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq))
+		bfq_weights_tree_remove(bfqd, &bfqq->entity,
+					&bfqd->queue_weights_tree);
+
 	if (sync) {
 		bfqd->sync_flight--;
 		RQ_BIC(rq)->ttime.last_end_request = jiffies;
@@ -3195,11 +3374,17 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 
 	bfqd->root_group = bfqg;
 
+#ifdef CONFIG_CGROUP_BFQIO
+	bfqd->active_numerous_groups = 0;
+#endif
+
 	init_timer(&bfqd->idle_slice_timer);
 	bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
 	bfqd->idle_slice_timer.data = (unsigned long)bfqd;
 
 	bfqd->rq_pos_tree = RB_ROOT;
+	bfqd->queue_weights_tree = RB_ROOT;
+	bfqd->group_weights_tree = RB_ROOT;
 
 	INIT_WORK(&bfqd->unplug_work, bfq_kick_queue);
 
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
index 73f453b..473b36a 100644
--- a/block/bfq-sched.c
+++ b/block/bfq-sched.c
@@ -308,6 +308,15 @@ up:
 	goto up;
 }
 
+static void bfq_weights_tree_add(struct bfq_data *bfqd,
+				 struct bfq_entity *entity,
+				 struct rb_root *root);
+
+static void bfq_weights_tree_remove(struct bfq_data *bfqd,
+				    struct bfq_entity *entity,
+				    struct rb_root *root);
+
+
 /**
  * bfq_active_insert - insert an entity in the active tree of its
  *                     group/device.
@@ -324,6 +333,11 @@ static void bfq_active_insert(struct bfq_service_tree *st,
 {
 	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
 	struct rb_node *node = &entity->rb_node;
+#ifdef CONFIG_CGROUP_BFQIO
+	struct bfq_sched_data *sd = NULL;
+	struct bfq_group *bfqg = NULL;
+	struct bfq_data *bfqd = NULL;
+#endif
 
 	bfq_insert(&st->active, entity);
 
@@ -334,8 +348,22 @@ static void bfq_active_insert(struct bfq_service_tree *st,
 
 	bfq_update_active_tree(node);
 
+#ifdef CONFIG_CGROUP_BFQIO
+	sd = entity->sched_data;
+	bfqg = container_of(sd, struct bfq_group, sched_data);
+	bfqd = (struct bfq_data *)bfqg->bfqd;
+#endif
 	if (bfqq != NULL)
 		list_add(&bfqq->bfqq_list, &bfqq->bfqd->active_list);
+#ifdef CONFIG_CGROUP_BFQIO
+	else /* bfq_group */
+		bfq_weights_tree_add(bfqd, entity, &bfqd->group_weights_tree);
+	if (bfqg != bfqd->root_group) {
+		bfqg->active_entities++;
+		if (bfqg->active_entities == 2)
+			bfqd->active_numerous_groups++;
+	}
+#endif
 }
 
 /**
@@ -411,6 +439,11 @@ static void bfq_active_extract(struct bfq_service_tree *st,
 {
 	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
 	struct rb_node *node;
+#ifdef CONFIG_CGROUP_BFQIO
+	struct bfq_sched_data *sd = NULL;
+	struct bfq_group *bfqg = NULL;
+	struct bfq_data *bfqd = NULL;
+#endif
 
 	node = bfq_find_deepest(&entity->rb_node);
 	bfq_extract(&st->active, entity);
@@ -418,8 +451,23 @@ static void bfq_active_extract(struct bfq_service_tree *st,
 	if (node != NULL)
 		bfq_update_active_tree(node);
 
+#ifdef CONFIG_CGROUP_BFQIO
+	sd = entity->sched_data;
+	bfqg = container_of(sd, struct bfq_group, sched_data);
+	bfqd = (struct bfq_data *)bfqg->bfqd;
+#endif
 	if (bfqq != NULL)
 		list_del(&bfqq->bfqq_list);
+#ifdef CONFIG_CGROUP_BFQIO
+	else /* bfq_group */
+		bfq_weights_tree_remove(bfqd, entity,
+					&bfqd->group_weights_tree);
+	if (bfqg != bfqd->root_group) {
+		bfqg->active_entities--;
+		if (bfqg->active_entities == 1)
+			bfqd->active_numerous_groups--;
+	}
+#endif
 }
 
 /**
@@ -515,6 +563,23 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
 
 	if (entity->ioprio_changed) {
 		struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+		unsigned short prev_weight, new_weight;
+		struct bfq_data *bfqd = NULL;
+		struct rb_root *root;
+#ifdef CONFIG_CGROUP_BFQIO
+		struct bfq_sched_data *sd;
+		struct bfq_group *bfqg;
+#endif
+
+		if (bfqq != NULL)
+			bfqd = bfqq->bfqd;
+#ifdef CONFIG_CGROUP_BFQIO
+		else {
+			sd = entity->my_sched_data;
+			bfqg = container_of(sd, struct bfq_group, sched_data);
+			bfqd = (struct bfq_data *)bfqg->bfqd;
+		}
+#endif
 
 		old_st->wsum -= entity->weight;
 
@@ -541,8 +606,31 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
 		 * when entity->finish <= old_st->vtime).
 		 */
 		new_st = bfq_entity_service_tree(entity);
-		entity->weight = entity->orig_weight *
-				 (bfqq != NULL ? bfqq->wr_coeff : 1);
+
+		prev_weight = entity->weight;
+		new_weight = entity->orig_weight *
+			     (bfqq != NULL ? bfqq->wr_coeff : 1);
+		/*
+		 * If the weight of the entity changes, remove the entity
+		 * from its old weight counter (if there is a counter
+		 * associated with the entity), and add it to the counter
+		 * associated with its new weight.
+		 */
+		if (prev_weight != new_weight) {
+			root = bfqq ? &bfqd->queue_weights_tree :
+				      &bfqd->group_weights_tree;
+			bfq_weights_tree_remove(bfqd, entity, root);
+		}
+		entity->weight = new_weight;
+		/*
+		 * Add the entity to its weights tree only if it is
+		 * not associated with a weight-raised queue.
+		 */
+		if (prev_weight != new_weight &&
+		    (bfqq ? bfqq->wr_coeff == 1 : 1))
+			/* If we get here, root has been initialized. */
+			bfq_weights_tree_add(bfqd, entity, root);
+
 		new_st->wsum += entity->weight;
 
 		if (new_st != old_st)
@@ -976,6 +1064,9 @@ static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfq_deactivate_bfqq(bfqd, bfqq, requeue);
 
+	if (!bfqq->dispatched)
+		bfq_weights_tree_remove(bfqd, &bfqq->entity,
+					&bfqd->queue_weights_tree);
 	if (bfqq->wr_coeff > 1)
 		bfqd->wr_busy_queues--;
 }
@@ -992,6 +1083,9 @@ static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 	bfq_mark_bfqq_busy(bfqq);
 	bfqd->busy_queues++;
 
+	if (!bfqq->dispatched && bfqq->wr_coeff == 1)
+		bfq_weights_tree_add(bfqd, &bfqq->entity,
+				     &bfqd->queue_weights_tree);
 	if (bfqq->wr_coeff > 1)
 		bfqd->wr_busy_queues++;
 }
diff --git a/block/bfq.h b/block/bfq.h
index bda1ecb3..83c828d 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -81,8 +81,23 @@ struct bfq_sched_data {
 };
 
 /**
+ * struct bfq_weight_counter - counter of the number of all active entities
+ *                             with a given weight.
+ * @weight: weight of the entities that this counter refers to.
+ * @num_active: number of active entities with this weight.
+ * @weights_node: weights tree member (see bfq_data's @queue_weights_tree
+ *                and @group_weights_tree).
+ */
+struct bfq_weight_counter {
+	short int weight;
+	unsigned int num_active;
+	struct rb_node weights_node;
+};
+
+/**
  * struct bfq_entity - schedulable entity.
  * @rb_node: service_tree member.
+ * @weight_counter: pointer to the weight counter associated with this entity.
  * @on_st: flag, true if the entity is on a tree (either the active or
  *         the idle one of its service_tree).
  * @finish: B-WF2Q+ finish timestamp (aka F_i).
@@ -133,6 +148,7 @@ struct bfq_sched_data {
  */
 struct bfq_entity {
 	struct rb_node rb_node;
+	struct bfq_weight_counter *weight_counter;
 
 	int on_st;
 
@@ -306,6 +322,22 @@ enum bfq_device_speed {
  * @rq_pos_tree: rbtree sorted by next_request position, used when
  *               determining if two or more queues have interleaving
  *               requests (see bfq_close_cooperator()).
+ * @active_numerous_groups: number of bfq_groups containing more than one
+ *                          active @bfq_entity.
+ * @queue_weights_tree: rbtree of weight counters of @bfq_queues, sorted by
+ *                      weight. Used to keep track of whether all @bfq_queues
+ *                     have the same weight. The tree contains one counter
+ *                     for each distinct weight associated to some active
+ *                     and not weight-raised @bfq_queue (see the comments to
+ *                      the functions bfq_weights_tree_[add|remove] for
+ *                     further details).
+ * @group_weights_tree: rbtree of non-queue @bfq_entity weight counters, sorted
+ *                      by weight. Used to keep track of whether all
+ *                     @bfq_groups have the same weight. The tree contains
+ *                     one counter for each distinct weight associated to
+ *                     some active @bfq_group (see the comments to the
+ *                     functions bfq_weights_tree_[add|remove] for further
+ *                     details).
  * @busy_queues: number of bfq_queues containing requests (including the
  *		 queue in service, even if it is idling).
  * @wr_busy_queues: number of weight-raised busy @bfq_queues.
@@ -374,6 +406,13 @@ struct bfq_data {
 	struct bfq_group *root_group;
 	struct rb_root rq_pos_tree;
 
+#ifdef CONFIG_CGROUP_BFQIO
+	int active_numerous_groups;
+#endif
+
+	struct rb_root queue_weights_tree;
+	struct rb_root group_weights_tree;
+
 	int busy_queues;
 	int wr_busy_queues;
 	int queued;
@@ -517,6 +556,11 @@ enum bfqq_expiration {
  * @my_entity: pointer to @entity, %NULL for the toplevel group; used
  *             to avoid too many special cases during group creation/
  *             migration.
+ * @active_entities: number of active entities belonging to the group;
+ *                   unused for the root group. Used to know whether there
+ *                   are groups with more than one active @bfq_entity
+ *                   (see the comments to the function
+ *                   bfq_bfqq_must_not_expire()).
  *
  * Each (device, cgroup) pair has its own bfq_group, i.e., for each cgroup
  * there is a set of bfq_groups, each one collecting the lower-level
@@ -542,6 +586,8 @@ struct bfq_group {
 	struct bfq_queue *async_idle_bfqq;
 
 	struct bfq_entity *my_entity;
+
+	int active_entities;
 };
 
 /**
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 13/14] block, bfq: boost the throughput on NCQ-capable flash-based devices
@ 2014-05-27 12:42     ` paolo
  0 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups, Paolo Valente

From: Paolo Valente <paolo.valente@unimore.it>

This patch boosts the throughput on NCQ-capable flash-based devices,
while still preserving latency guarantees for interactive and soft
real-time applications. The throughput is boosted by just not idling
the device when the in-service queue remains empty, even if the queue
is sync and has a non-null idle window. This helps to keep the drive's
internal queue full, which is necessary to achieve maximum
performance. This solution to boost the throughput is a port of
commits a68bbdd and f7d7b7a for CFQ.

As already highlighted in patch 10, allowing the device to prefetch
and internally reorder requests trivially causes loss of control on
the request service order, and hence on service guarantees.
Fortunately, as discussed in detail in the comments to the function
bfq_bfqq_must_not_expire(), if every process has to receive the same
fraction of the throughput, then the service order enforced by the
internal scheduler of a flash-based device is relatively close to that
enforced by BFQ. In particular, it is close enough to let service
guarantees be substantially preserved.

Things change in an asymmetric scenario, i.e., if not every process
has to receive the same fraction of the throughput. In this case, to
guarantee the desired throughput distribution, the device must be
prevented from prefetching requests. This is exactly what this patch
does in asymmetric scenarios.

Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/bfq-cgroup.c  |   1 +
 block/bfq-iosched.c | 205 +++++++++++++++++++++++++++++++++++++++++++++++++---
 block/bfq-sched.c   |  98 ++++++++++++++++++++++++-
 block/bfq.h         |  46 ++++++++++++
 4 files changed, 338 insertions(+), 12 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 1cb25aa..d338a54 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -85,6 +85,7 @@ static inline void bfq_group_init_entity(struct bfqio_cgroup *bgrp,
 	entity->ioprio = entity->new_ioprio;
 	entity->ioprio_class = entity->new_ioprio_class = bgrp->ioprio_class;
 	entity->my_sched_data = &bfqg->sched_data;
+	bfqg->active_entities = 0;
 }
 
 static inline void bfq_group_set_parent(struct bfq_group *bfqg,
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 22d4caa..49856e1 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -364,6 +364,120 @@ static struct request *bfq_choose_req(struct bfq_data *bfqd,
 	}
 }
 
+/*
+ * Tell whether there are active queues or groups with differentiated weights.
+ */
+static inline bool bfq_differentiated_weights(struct bfq_data *bfqd)
+{
+	/*
+	 * For weights to differ, at least one of the trees must contain
+	 * at least two nodes.
+	 */
+	return (!RB_EMPTY_ROOT(&bfqd->queue_weights_tree) &&
+		(bfqd->queue_weights_tree.rb_node->rb_left ||
+		 bfqd->queue_weights_tree.rb_node->rb_right)
+#ifdef CONFIG_CGROUP_BFQIO
+	       ) ||
+	       (!RB_EMPTY_ROOT(&bfqd->group_weights_tree) &&
+		(bfqd->group_weights_tree.rb_node->rb_left ||
+		 bfqd->group_weights_tree.rb_node->rb_right)
+#endif
+	       );
+}
+
+/*
+ * If the weight-counter tree passed as input contains no counter for
+ * the weight of the input entity, then add that counter; otherwise just
+ * increment the existing counter.
+ *
+ * Note that weight-counter trees contain few nodes in mostly symmetric
+ * scenarios. For example, if all queues have the same weight, then the
+ * weight-counter tree for the queues may contain at most one node.
+ * This holds even if low_latency is on, because weight-raised queues
+ * are not inserted in the tree.
+ * In most scenarios, the rate at which nodes are created/destroyed
+ * should be low too.
+ */
+static void bfq_weights_tree_add(struct bfq_data *bfqd,
+				 struct bfq_entity *entity,
+				 struct rb_root *root)
+{
+	struct rb_node **new = &(root->rb_node), *parent = NULL;
+
+	/*
+	 * Do not insert if:
+	 * - the device does not support queueing;
+	 * - the entity is already associated with a counter, which happens if:
+	 *   1) the entity is associated with a queue, 2) a request arrival
+	 *   has caused the queue to become both non-weight-raised, and hence
+	 *   change its weight, and backlogged; in this respect, each
+	 *   of the two events causes an invocation of this function,
+	 *   3) this is the invocation of this function caused by the second
+	 *   event. This second invocation is actually useless, and we handle
+	 *   this fact by exiting immediately. More efficient or clearer
+	 *   solutions might possibly be adopted.
+	 */
+	if (!bfqd->hw_tag || entity->weight_counter)
+		return;
+
+	while (*new) {
+		struct bfq_weight_counter *__counter = container_of(*new,
+						struct bfq_weight_counter,
+						weights_node);
+		parent = *new;
+
+		if (entity->weight == __counter->weight) {
+			entity->weight_counter = __counter;
+			goto inc_counter;
+		}
+		if (entity->weight < __counter->weight)
+			new = &((*new)->rb_left);
+		else
+			new = &((*new)->rb_right);
+	}
+
+	entity->weight_counter = kzalloc(sizeof(struct bfq_weight_counter),
+					 GFP_ATOMIC);
+	entity->weight_counter->weight = entity->weight;
+	rb_link_node(&entity->weight_counter->weights_node, parent, new);
+	rb_insert_color(&entity->weight_counter->weights_node, root);
+
+inc_counter:
+	entity->weight_counter->num_active++;
+}
+
+/*
+ * Decrement the weight counter associated with the entity, and, if the
+ * counter reaches 0, remove the counter from the tree.
+ * See the comments to the function bfq_weights_tree_add() for considerations
+ * about overhead.
+ */
+static void bfq_weights_tree_remove(struct bfq_data *bfqd,
+				    struct bfq_entity *entity,
+				    struct rb_root *root)
+{
+	/*
+	 * Check whether the entity is actually associated with a counter.
+	 * In fact, the device may not be considered NCQ-capable for a while,
+	 * which implies that no insertion in the weight trees is performed,
+	 * after which the device may start to be deemed NCQ-capable, and hence
+	 * this function may start to be invoked. This may cause the function
+	 * to be invoked for entities that are not associated with any counter.
+	 */
+	if (!entity->weight_counter)
+		return;
+
+	entity->weight_counter->num_active--;
+	if (entity->weight_counter->num_active > 0)
+		goto reset_entity_pointer;
+
+	rb_erase(&entity->weight_counter->weights_node, root);
+	kfree(entity->weight_counter);
+
+reset_entity_pointer:
+	entity->weight_counter = NULL;
+}
+
 static struct request *bfq_find_next_rq(struct bfq_data *bfqd,
 					struct bfq_queue *bfqq,
 					struct request *last)
@@ -1906,16 +2020,17 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
  * two conditions holds. The first condition is that the device is not
  * performing NCQ, because idling the device most certainly boosts the
  * throughput if this condition holds and bfqq has been granted a non-null
- * idle window.
+ * idle window. The second compound condition is made of the logical AND of
+ * two components.
  *
- * The second condition is that there is no weight-raised busy queue,
- * which guarantees that the device is not idled for a sync non-weight-
- * raised queue when there are busy weight-raised queues. The former is
- * then expired immediately if empty. Combined with the timestamping rules
- * of BFQ (see [1] for details), this causes sync non-weight-raised queues
- * to get a lower number of requests served, and hence to ask for a lower
- * number of requests from the request pool, before the busy weight-raised
- * queues get served again.
+ * The first component is true only if there is no weight-raised busy
+ * queue. This guarantees that the device is not idled for a sync non-
+ * weight-raised queue when there are busy weight-raised queues. The former
+ * is then expired immediately if empty. Combined with the timestamping
+ * rules of BFQ (see [1] for details), this causes sync non-weight-raised
+ * queues to get a lower number of requests served, and hence to ask for a
+ * lower number of requests from the request pool, before the busy weight-
+ * raised queues get served again.
  *
  * This is beneficial for the processes associated with weight-raised
  * queues, when the request pool is saturated (e.g., in the presence of
@@ -1932,16 +2047,76 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
  * weight-raised queues seems to mitigate starvation problems in the
  * presence of heavy write workloads and NCQ, and hence to guarantee a
  * higher application and system responsiveness in these hostile scenarios.
+ *
+ * If the first component of the compound condition is instead true, i.e.,
+ * there is no weight-raised busy queue, then the second component of the
+ * compound condition takes into account service-guarantee and throughput
+ * issues related to NCQ (recall that the compound condition is evaluated
+ * only if the device is detected as supporting NCQ).
+ *
+ * As for service guarantees, allowing the drive to enqueue more than one
+ * request at a time, and hence delegating de facto final scheduling
+ * decisions to the drive's internal scheduler, causes loss of control on
+ * the actual request service order. In this respect, when the drive is
+ * allowed to enqueue more than one request at a time, the service
+ * distribution enforced by the drive's internal scheduler is likely to
+ * coincide with the desired device-throughput distribution only in the
+ * following, perfectly symmetric, scenario:
+ * 1) all active queues have the same weight,
+ * 2) all active groups at the same level in the groups tree have the same
+ *    weight,
+ * 3) all active groups at the same level in the groups tree have the same
+ *    number of children.
+ *
+ * Even in such a scenario, sequential I/O may still receive a preferential
+ * treatment, but this is not likely to be a big issue with flash-based
+ * devices, because of their non-dramatic loss of throughput with random
+ * I/O.
+ *
+ * Unfortunately, keeping the necessary state for evaluating exactly the
+ * above symmetry conditions would be quite complex and time-consuming.
+ * Therefore BFQ evaluates instead the following stronger sub-conditions,
+ * for which it is much easier to maintain the needed state:
+ * 1) all active queues have the same weight,
+ * 2) all active groups have the same weight,
+ * 3) all active groups have at most one active child each.
+ * In particular, the last two conditions are always true if hierarchical
+ * support and the cgroups interface are not enabled, hence no state needs
+ * to be maintained in this case.
+ *
+ * According to the above considerations, the second component of the
+ * compound condition evaluates to true if any of the above symmetry
+ * sub-condition does not hold, or the device is not flash-based. Therefore,
+ * if also the first component is true, then idling is allowed for a sync
+ * queue. In contrast, if all the required symmetry sub-conditions hold and
+ * the device is flash-based, then the second component, and hence the
+ * whole compound condition, evaluates to false, and no idling is performed.
+ * This helps to keep the drives' internal queues full on NCQ-capable
+ * devices, and hence to boost the throughput, without causing 'almost' any
+ * loss of service guarantees. The 'almost' follows from the fact that, if
+ * the internal queue of one such device is filled while all the
+ * sub-conditions hold, but at some point in time some sub-condition stops
+ * to hold, then it may become impossible to let requests be served in the
+ * new desired order until all the requests already queued in the device
+ * have been served.
  */
 static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 {
 	struct bfq_data *bfqd = bfqq->bfqd;
+#ifdef CONFIG_CGROUP_BFQIO
+#define symmetric_scenario	  (!bfqd->active_numerous_groups && \
+				   !bfq_differentiated_weights(bfqd))
+#else
+#define symmetric_scenario	  (!bfq_differentiated_weights(bfqd))
+#endif
 /*
  * Condition for expiring a non-weight-raised queue (and hence not idling
  * the device).
  */
 #define cond_for_expiring_non_wr  (bfqd->hw_tag && \
-				   bfqd->wr_busy_queues > 0)
+				   (bfqd->wr_busy_queues > 0 || \
+				    (symmetric_scenario && \
+				     blk_queue_nonrot(bfqd->queue))))
 
 	return bfq_bfqq_sync(bfqq) && (
 		bfqq->wr_coeff > 1 ||
@@ -2821,6 +2996,10 @@ static void bfq_completed_request(struct request_queue *q, struct request *rq)
 	bfqd->rq_in_driver--;
 	bfqq->dispatched--;
 
+	if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq))
+		bfq_weights_tree_remove(bfqd, &bfqq->entity,
+					&bfqd->queue_weights_tree);
+
 	if (sync) {
 		bfqd->sync_flight--;
 		RQ_BIC(rq)->ttime.last_end_request = jiffies;
@@ -3195,11 +3374,17 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 
 	bfqd->root_group = bfqg;
 
+#ifdef CONFIG_CGROUP_BFQIO
+	bfqd->active_numerous_groups = 0;
+#endif
+
 	init_timer(&bfqd->idle_slice_timer);
 	bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
 	bfqd->idle_slice_timer.data = (unsigned long)bfqd;
 
 	bfqd->rq_pos_tree = RB_ROOT;
+	bfqd->queue_weights_tree = RB_ROOT;
+	bfqd->group_weights_tree = RB_ROOT;
 
 	INIT_WORK(&bfqd->unplug_work, bfq_kick_queue);
 
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
index 73f453b..473b36a 100644
--- a/block/bfq-sched.c
+++ b/block/bfq-sched.c
@@ -308,6 +308,15 @@ up:
 	goto up;
 }
 
+static void bfq_weights_tree_add(struct bfq_data *bfqd,
+				 struct bfq_entity *entity,
+				 struct rb_root *root);
+
+static void bfq_weights_tree_remove(struct bfq_data *bfqd,
+				    struct bfq_entity *entity,
+				    struct rb_root *root);
+
+
 /**
  * bfq_active_insert - insert an entity in the active tree of its
  *                     group/device.
@@ -324,6 +333,11 @@ static void bfq_active_insert(struct bfq_service_tree *st,
 {
 	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
 	struct rb_node *node = &entity->rb_node;
+#ifdef CONFIG_CGROUP_BFQIO
+	struct bfq_sched_data *sd = NULL;
+	struct bfq_group *bfqg = NULL;
+	struct bfq_data *bfqd = NULL;
+#endif
 
 	bfq_insert(&st->active, entity);
 
@@ -334,8 +348,22 @@ static void bfq_active_insert(struct bfq_service_tree *st,
 
 	bfq_update_active_tree(node);
 
+#ifdef CONFIG_CGROUP_BFQIO
+	sd = entity->sched_data;
+	bfqg = container_of(sd, struct bfq_group, sched_data);
+	bfqd = (struct bfq_data *)bfqg->bfqd;
+#endif
 	if (bfqq != NULL)
 		list_add(&bfqq->bfqq_list, &bfqq->bfqd->active_list);
+#ifdef CONFIG_CGROUP_BFQIO
+	else /* bfq_group */
+		bfq_weights_tree_add(bfqd, entity, &bfqd->group_weights_tree);
+	if (bfqg != bfqd->root_group) {
+		bfqg->active_entities++;
+		if (bfqg->active_entities == 2)
+			bfqd->active_numerous_groups++;
+	}
+#endif
 }
 
 /**
@@ -411,6 +439,11 @@ static void bfq_active_extract(struct bfq_service_tree *st,
 {
 	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
 	struct rb_node *node;
+#ifdef CONFIG_CGROUP_BFQIO
+	struct bfq_sched_data *sd = NULL;
+	struct bfq_group *bfqg = NULL;
+	struct bfq_data *bfqd = NULL;
+#endif
 
 	node = bfq_find_deepest(&entity->rb_node);
 	bfq_extract(&st->active, entity);
@@ -418,8 +451,23 @@ static void bfq_active_extract(struct bfq_service_tree *st,
 	if (node != NULL)
 		bfq_update_active_tree(node);
 
+#ifdef CONFIG_CGROUP_BFQIO
+	sd = entity->sched_data;
+	bfqg = container_of(sd, struct bfq_group, sched_data);
+	bfqd = (struct bfq_data *)bfqg->bfqd;
+#endif
 	if (bfqq != NULL)
 		list_del(&bfqq->bfqq_list);
+#ifdef CONFIG_CGROUP_BFQIO
+	else /* bfq_group */
+		bfq_weights_tree_remove(bfqd, entity,
+					&bfqd->group_weights_tree);
+	if (bfqg != bfqd->root_group) {
+		bfqg->active_entities--;
+		if (bfqg->active_entities == 1)
+			bfqd->active_numerous_groups--;
+	}
+#endif
 }
 
 /**
@@ -515,6 +563,23 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
 
 	if (entity->ioprio_changed) {
 		struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+		unsigned short prev_weight, new_weight;
+		struct bfq_data *bfqd = NULL;
+		struct rb_root *root;
+#ifdef CONFIG_CGROUP_BFQIO
+		struct bfq_sched_data *sd;
+		struct bfq_group *bfqg;
+#endif
+
+		if (bfqq != NULL)
+			bfqd = bfqq->bfqd;
+#ifdef CONFIG_CGROUP_BFQIO
+		else {
+			sd = entity->my_sched_data;
+			bfqg = container_of(sd, struct bfq_group, sched_data);
+			bfqd = (struct bfq_data *)bfqg->bfqd;
+		}
+#endif
 
 		old_st->wsum -= entity->weight;
 
@@ -541,8 +606,31 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
 		 * when entity->finish <= old_st->vtime).
 		 */
 		new_st = bfq_entity_service_tree(entity);
-		entity->weight = entity->orig_weight *
-				 (bfqq != NULL ? bfqq->wr_coeff : 1);
+
+		prev_weight = entity->weight;
+		new_weight = entity->orig_weight *
+			     (bfqq != NULL ? bfqq->wr_coeff : 1);
+		/*
+		 * If the weight of the entity changes, remove the entity
+		 * from its old weight counter (if there is a counter
+		 * associated with the entity), and add it to the counter
+		 * associated with its new weight.
+		 */
+		if (prev_weight != new_weight) {
+			root = bfqq ? &bfqd->queue_weights_tree :
+				      &bfqd->group_weights_tree;
+			bfq_weights_tree_remove(bfqd, entity, root);
+		}
+		entity->weight = new_weight;
+		/*
+		 * Add the entity to its weights tree only if it is
+		 * not associated with a weight-raised queue.
+		 */
+		if (prev_weight != new_weight &&
+		    (bfqq ? bfqq->wr_coeff == 1 : 1))
+			/* If we get here, root has been initialized. */
+			bfq_weights_tree_add(bfqd, entity, root);
+
 		new_st->wsum += entity->weight;
 
 		if (new_st != old_st)
@@ -976,6 +1064,9 @@ static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfq_deactivate_bfqq(bfqd, bfqq, requeue);
 
+	if (!bfqq->dispatched)
+		bfq_weights_tree_remove(bfqd, &bfqq->entity,
+					&bfqd->queue_weights_tree);
 	if (bfqq->wr_coeff > 1)
 		bfqd->wr_busy_queues--;
 }
@@ -992,6 +1083,9 @@ static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 	bfq_mark_bfqq_busy(bfqq);
 	bfqd->busy_queues++;
 
+	if (!bfqq->dispatched && bfqq->wr_coeff == 1)
+		bfq_weights_tree_add(bfqd, &bfqq->entity,
+				     &bfqd->queue_weights_tree);
 	if (bfqq->wr_coeff > 1)
 		bfqd->wr_busy_queues++;
 }
diff --git a/block/bfq.h b/block/bfq.h
index bda1ecb3..83c828d 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -81,8 +81,23 @@ struct bfq_sched_data {
 };
 
 /**
+ * struct bfq_weight_counter - counter of the number of all active entities
+ *                             with a given weight.
+ * @weight: weight of the entities that this counter refers to.
+ * @num_active: number of active entities with this weight.
+ * @weights_node: weights tree member (see bfq_data's @queue_weights_tree
+ *                and @group_weights_tree).
+ */
+struct bfq_weight_counter {
+	short int weight;
+	unsigned int num_active;
+	struct rb_node weights_node;
+};
+
+/**
  * struct bfq_entity - schedulable entity.
  * @rb_node: service_tree member.
+ * @weight_counter: pointer to the weight counter associated with this entity.
  * @on_st: flag, true if the entity is on a tree (either the active or
  *         the idle one of its service_tree).
  * @finish: B-WF2Q+ finish timestamp (aka F_i).
@@ -133,6 +148,7 @@ struct bfq_sched_data {
  */
 struct bfq_entity {
 	struct rb_node rb_node;
+	struct bfq_weight_counter *weight_counter;
 
 	int on_st;
 
@@ -306,6 +322,22 @@ enum bfq_device_speed {
  * @rq_pos_tree: rbtree sorted by next_request position, used when
  *               determining if two or more queues have interleaving
  *               requests (see bfq_close_cooperator()).
+ * @active_numerous_groups: number of bfq_groups containing more than one
+ *                          active @bfq_entity.
+ * @queue_weights_tree: rbtree of weight counters of @bfq_queues, sorted by
+ *                      weight. Used to keep track of whether all @bfq_queues
+ *                     have the same weight. The tree contains one counter
+ *                     for each distinct weight associated to some active
+ *                     and not weight-raised @bfq_queue (see the comments to
+ *                      the functions bfq_weights_tree_[add|remove] for
+ *                     further details).
+ * @group_weights_tree: rbtree of non-queue @bfq_entity weight counters, sorted
+ *                      by weight. Used to keep track of whether all
+ *                     @bfq_groups have the same weight. The tree contains
+ *                     one counter for each distinct weight associated to
+ *                     some active @bfq_group (see the comments to the
+ *                     functions bfq_weights_tree_[add|remove] for further
+ *                     details).
  * @busy_queues: number of bfq_queues containing requests (including the
  *		 queue in service, even if it is idling).
  * @wr_busy_queues: number of weight-raised busy @bfq_queues.
@@ -374,6 +406,13 @@ struct bfq_data {
 	struct bfq_group *root_group;
 	struct rb_root rq_pos_tree;
 
+#ifdef CONFIG_CGROUP_BFQIO
+	int active_numerous_groups;
+#endif
+
+	struct rb_root queue_weights_tree;
+	struct rb_root group_weights_tree;
+
 	int busy_queues;
 	int wr_busy_queues;
 	int queued;
@@ -517,6 +556,11 @@ enum bfqq_expiration {
  * @my_entity: pointer to @entity, %NULL for the toplevel group; used
  *             to avoid too many special cases during group creation/
  *             migration.
+ * @active_entities: number of active entities belonging to the group;
+ *                   unused for the root group. Used to know whether there
+ *                   are groups with more than one active @bfq_entity
+ *                   (see the comments to the function
+ *                   bfq_bfqq_must_not_expire()).
  *
  * Each (device, cgroup) pair has its own bfq_group, i.e., for each cgroup
  * there is a set of bfq_groups, each one collecting the lower-level
@@ -542,6 +586,8 @@ struct bfq_group {
 	struct bfq_queue *async_idle_bfqq;
 
 	struct bfq_entity *my_entity;
+
+	int active_entities;
 };
 
 /**
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 14/14] block, bfq: boost the throughput with random I/O on NCQ-capable HDDs
       [not found] ` <1401194558-5283-1-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
                     ` (12 preceding siblings ...)
  2014-05-27 12:42     ` paolo
@ 2014-05-27 12:42   ` paolo
  2014-05-30 15:32     ` Vivek Goyal
  2014-05-30 17:31     ` Vivek Goyal
  15 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

From: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>

This patch is basically the counterpart of patch 13 for NCQ-capable
rotational devices. Exactly as patch 13 does on flash-based devices
and for any workload, this patch disables device idling on rotational
devices, but only for random I/O. More precisely, idling is disabled
only for constantly-seeky queues (see patch 7). In fact, only with
these queues disabling idling boosts the throughput on NCQ-capable
rotational devices.

To not break service guarantees, idling is disabled for NCQ-enabled
rotational devices and constantly-seeky queues only when the same
symmetry conditions as in patch 13, plus an additional one, hold. The
additional condition is related to the fact that this patch disables
idling only for constantly-seeky queues. In fact, should idling be
disabled for a constantly-seeky queue while some other
non-constantly-seeky queue has pending requests, the latter queue
would get more requests served, after being set as in service, than
the former. This differentiated treatment would cause a deviation with
respect to the desired throughput distribution (i.e., with respect to
the throughput distribution corresponding to the weights assigned to
processes and groups of processes).  For this reason, the additional
condition for disabling idling for a constantly-seeky queue is that
all queues with pending or in-flight requests are constantly seeky.

Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 block/bfq-iosched.c | 79 +++++++++++++++++++++++++++++++++++++++++------------
 block/bfq-sched.c   | 21 +++++++++++---
 block/bfq.h         | 29 +++++++++++++++++++-
 3 files changed, 107 insertions(+), 22 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 49856e1..b9aafa5 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1910,8 +1910,12 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
 
 	bfqq->service_from_backlogged += bfqq->entity.service;
 
-	if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT)
+	if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT &&
+	    !bfq_bfqq_constantly_seeky(bfqq)) {
 		bfq_mark_bfqq_constantly_seeky(bfqq);
+		if (!blk_queue_nonrot(bfqd->queue))
+			bfqd->const_seeky_busy_in_flight_queues++;
+	}
 
 	if (bfqd->low_latency && bfqq->wr_coeff == 1)
 		bfqq->last_wr_start_finish = jiffies;
@@ -2071,7 +2075,8 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
  * Even in such a scenario, sequential I/O may still receive a preferential
  * treatment, but this is not likely to be a big issue with flash-based
  * devices, because of their non-dramatic loss of throughput with random
- * I/O.
+ * I/O. Things do differ with HDDs, for which additional care is taken, as
+ * explained after completing the discussion for flash-based devices.
  *
  * Unfortunately, keeping the necessary state for evaluating exactly the
  * above symmetry conditions would be quite complex and time-consuming.
@@ -2088,17 +2093,42 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
  * compound condition evaluates to true if any of the above symmetry
  * sub-condition does not hold, or the device is not flash-based. Therefore,
  * if also the first component is true, then idling is allowed for a sync
- * queue. In contrast, if all the required symmetry sub-conditions hold and
- * the device is flash-based, then the second component, and hence the
- * whole compound condition, evaluates to false, and no idling is performed.
- * This helps to keep the drives' internal queues full on NCQ-capable
- * devices, and hence to boost the throughput, without causing 'almost' any
- * loss of service guarantees. The 'almost' follows from the fact that, if
- * the internal queue of one such device is filled while all the
- * sub-conditions hold, but at some point in time some sub-condition stops
- * to hold, then it may become impossible to let requests be served in the
- * new desired order until all the requests already queued in the device
- * have been served.
+ * queue. These are the only sub-conditions considered if the device is
+ * flash-based, as, for such a device, it is sensible to force idling only
+ * for service-guarantee issues. In fact, as for throughput, idling
+ * NCQ-capable flash-based devices would not boost the throughput even
+ * with sequential I/O; rather it would lower the throughput in proportion
+ * to how fast the device is. In the end, (only) if all the three
+ * sub-conditions hold and the device is flash-based, the compound
+ * condition evaluates to false and therefore no idling is performed.
+ *
+ * As already said, things change with a rotational device, where idling
+ * boosts the throughput with sequential I/O (even with NCQ). Hence, for
+ * such a device the second component of the compound condition evaluates
+ * to true also if the following additional sub-condition does not hold:
+ * the queue is constantly seeky. Unfortunately, this different behavior
+ * with respect to flash-based devices causes an additional asymmetry: if
+ * some sync queues enjoy idling and some other sync queues do not, then
+ * the latter get a low share of the device throughput, simply because the
+ * former get many requests served after being set as in service, whereas
+ * the latter do not. As a consequence, to guarantee the desired throughput
+ * distribution, on HDDs the compound expression evaluates to true (and
+ * hence device idling is performed) also if the following last symmetry
+ * condition does not hold: no other queue is benefiting from idling. Also
+ * this last condition is actually replaced with a simpler-to-maintain and
+ * stronger condition: there is no busy queue which is not constantly seeky
+ * (and hence may also benefit from idling).
+ *
+ * To sum up, when all the required symmetry and throughput-boosting
+ * sub-conditions hold, the second component of the compound condition
+ * evaluates to false, and hence no idling is performed. This helps to
+ * keep the drives' internal queues full on NCQ-capable devices, and hence
+ * to boost the throughput, without causing 'almost' any loss of service
+ * guarantees. The 'almost' follows from the fact that, if the internal
+ * queue of one such device is filled while all the sub-conditions hold,
+ * but at some point in time some sub-condition stops to hold, then it may
+ * become impossible to let requests be served in the new desired order
+ * until all the requests already queued in the device have been served.
  */
 static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 {
@@ -2109,6 +2139,9 @@ static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 #else
 #define symmetric_scenario	  (!bfq_differentiated_weights(bfqd))
 #endif
+#define cond_for_seeky_on_ncq_hdd (bfq_bfqq_constantly_seeky(bfqq) && \
+				   bfqd->busy_in_flight_queues == \
+				   bfqd->const_seeky_busy_in_flight_queues)
 /*
  * Condition for expiring a non-weight-raised queue (and hence not idling
  * the device).
@@ -2116,7 +2149,8 @@ static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 #define cond_for_expiring_non_wr  (bfqd->hw_tag && \
 				   (bfqd->wr_busy_queues > 0 || \
 				    (symmetric_scenario && \
-				     blk_queue_nonrot(bfqd->queue))))
+				     (blk_queue_nonrot(bfqd->queue) || \
+				      cond_for_seeky_on_ncq_hdd))))
 
 	return bfq_bfqq_sync(bfqq) && (
 		bfqq->wr_coeff > 1 ||
@@ -2843,8 +2877,11 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfq_update_io_thinktime(bfqd, bic);
 	bfq_update_io_seektime(bfqd, bfqq, rq);
-	if (!BFQQ_SEEKY(bfqq))
+	if (!BFQQ_SEEKY(bfqq) && bfq_bfqq_constantly_seeky(bfqq)) {
 		bfq_clear_bfqq_constantly_seeky(bfqq);
+		if (!blk_queue_nonrot(bfqd->queue))
+			bfqd->const_seeky_busy_in_flight_queues--;
+	}
 	if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
 	    !BFQQ_SEEKY(bfqq))
 		bfq_update_idle_window(bfqd, bfqq, bic);
@@ -2996,9 +3033,15 @@ static void bfq_completed_request(struct request_queue *q, struct request *rq)
 	bfqd->rq_in_driver--;
 	bfqq->dispatched--;
 
-	if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq))
+	if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq)) {
 		bfq_weights_tree_remove(bfqd, &bfqq->entity,
 					&bfqd->queue_weights_tree);
+		if (!blk_queue_nonrot(bfqd->queue)) {
+			bfqd->busy_in_flight_queues--;
+			if (bfq_bfqq_constantly_seeky(bfqq))
+				bfqd->const_seeky_busy_in_flight_queues--;
+		}
+	}
 
 	if (sync) {
 		bfqd->sync_flight--;
@@ -3420,6 +3463,8 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 					      * video.
 					      */
 	bfqd->wr_busy_queues = 0;
+	bfqd->busy_in_flight_queues = 0;
+	bfqd->const_seeky_busy_in_flight_queues = 0;
 
 	/*
 	 * Begin by assuming, optimistically, that the device peak rate is
@@ -3739,7 +3784,7 @@ static int __init bfq_init(void)
 	device_speed_thresh[1] = (R_fast[1] + R_slow[1]) / 2;
 
 	elv_register(&iosched_bfq);
-	pr_info("BFQ I/O-scheduler version: v6");
+	pr_info("BFQ I/O-scheduler version: v7r4");
 
 	return 0;
 }
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
index 473b36a..afc4c23 100644
--- a/block/bfq-sched.c
+++ b/block/bfq-sched.c
@@ -1064,9 +1064,15 @@ static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfq_deactivate_bfqq(bfqd, bfqq, requeue);
 
-	if (!bfqq->dispatched)
+	if (!bfqq->dispatched) {
 		bfq_weights_tree_remove(bfqd, &bfqq->entity,
 					&bfqd->queue_weights_tree);
+		if (!blk_queue_nonrot(bfqd->queue)) {
+			bfqd->busy_in_flight_queues--;
+			if (bfq_bfqq_constantly_seeky(bfqq))
+				bfqd->const_seeky_busy_in_flight_queues--;
+		}
+	}
 	if (bfqq->wr_coeff > 1)
 		bfqd->wr_busy_queues--;
 }
@@ -1083,9 +1089,16 @@ static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 	bfq_mark_bfqq_busy(bfqq);
 	bfqd->busy_queues++;
 
-	if (!bfqq->dispatched && bfqq->wr_coeff == 1)
-		bfq_weights_tree_add(bfqd, &bfqq->entity,
-				     &bfqd->queue_weights_tree);
+	if (!bfqq->dispatched) {
+		if (bfqq->wr_coeff == 1)
+			bfq_weights_tree_add(bfqd, &bfqq->entity,
+					     &bfqd->queue_weights_tree);
+		if (!blk_queue_nonrot(bfqd->queue)) {
+			bfqd->busy_in_flight_queues++;
+			if (bfq_bfqq_constantly_seeky(bfqq))
+				bfqd->const_seeky_busy_in_flight_queues++;
+		}
+	}
 	if (bfqq->wr_coeff > 1)
 		bfqd->wr_busy_queues++;
 }
diff --git a/block/bfq.h b/block/bfq.h
index 83c828d..f4c702c 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -1,5 +1,5 @@
 /*
- * BFQ-v6 for 3.15.0: data structures and common functions prototypes.
+ * BFQ-v7r4 for 3.15.0: data structures and common functions prototypes.
  *
  * Based on ideas and code from CFQ:
  * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@@ -340,6 +340,31 @@ enum bfq_device_speed {
  *                     details).
  * @busy_queues: number of bfq_queues containing requests (including the
  *		 queue in service, even if it is idling).
+ * @busy_in_flight_queues: number of @bfq_queues containing pending or
+ *                         in-flight requests, plus the @bfq_queue in
+ *                         service, even if idle but waiting for the
+ *                         possible arrival of its next sync request. This
+ *                         field is updated only if the device is rotational,
+ *                         but used only if the device is also NCQ-capable.
+ *                         The reason why the field is updated also for non-
+ *                         NCQ-capable rotational devices is related to the
+ *                         fact that the value of @hw_tag may be set also
+ *                         later than when busy_in_flight_queues may need to
+ *                         be incremented for the first time(s). Taking also
+ *                         this possibility into account, to avoid unbalanced
+ *                         increments/decrements, would imply more overhead
+ *                         than just updating busy_in_flight_queues
+ *                         regardless of the value of @hw_tag.
+ * @const_seeky_busy_in_flight_queues: number of constantly-seeky @bfq_queues
+ *                                     (that is, seeky queues that expired
+ *                                     for budget timeout at least once)
+ *                                     containing pending or in-flight
+ *                                     requests, including the in-service
+ *                                     @bfq_queue if constantly seeky. This
+ *                                     field is updated only if the device
+ *                                     is rotational, but used only if the
+ *                                     device is also NCQ-capable (see the
+ *                                     comments to @busy_in_flight_queues).
  * @wr_busy_queues: number of weight-raised busy @bfq_queues.
  * @queued: number of queued requests.
  * @rq_in_driver: number of requests dispatched and waiting for completion.
@@ -414,6 +439,8 @@ struct bfq_data {
 	struct rb_root group_weights_tree;
 
 	int busy_queues;
+	int busy_in_flight_queues;
+	int const_seeky_busy_in_flight_queues;
 	int wr_busy_queues;
 	int queued;
 	int rq_in_driver;
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 14/14] block, bfq: boost the throughput with random I/O on NCQ-capable HDDs
       [not found] ` <1401194558-5283-1-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
@ 2014-05-27 12:42   ` paolo
  2014-05-27 12:42     ` paolo
                     ` (14 subsequent siblings)
  15 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups, Paolo Valente

From: Paolo Valente <paolo.valente@unimore.it>

This patch is basically the counterpart of patch 13 for NCQ-capable
rotational devices. Exactly as patch 13 does on flash-based devices
and for any workload, this patch disables device idling on rotational
devices, but only for random I/O. More precisely, idling is disabled
only for constantly-seeky queues (see patch 7). In fact, only with
these queues disabling idling boosts the throughput on NCQ-capable
rotational devices.

To not break service guarantees, idling is disabled for NCQ-enabled
rotational devices and constantly-seeky queues only when the same
symmetry conditions as in patch 13, plus an additional one, hold. The
additional condition is related to the fact that this patch disables
idling only for constantly-seeky queues. In fact, should idling be
disabled for a constantly-seeky queue while some other
non-constantly-seeky queue has pending requests, the latter queue
would get more requests served, after being set as in service, than
the former. This differentiated treatment would cause a deviation with
respect to the desired throughput distribution (i.e., with respect to
the throughput distribution corresponding to the weights assigned to
processes and groups of processes).  For this reason, the additional
condition for disabling idling for a constantly-seeky queue is that
all queues with pending or in-flight requests are constantly seeky.

Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/bfq-iosched.c | 79 +++++++++++++++++++++++++++++++++++++++++------------
 block/bfq-sched.c   | 21 +++++++++++---
 block/bfq.h         | 29 +++++++++++++++++++-
 3 files changed, 107 insertions(+), 22 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 49856e1..b9aafa5 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1910,8 +1910,12 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
 
 	bfqq->service_from_backlogged += bfqq->entity.service;
 
-	if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT)
+	if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT &&
+	    !bfq_bfqq_constantly_seeky(bfqq)) {
 		bfq_mark_bfqq_constantly_seeky(bfqq);
+		if (!blk_queue_nonrot(bfqd->queue))
+			bfqd->const_seeky_busy_in_flight_queues++;
+	}
 
 	if (bfqd->low_latency && bfqq->wr_coeff == 1)
 		bfqq->last_wr_start_finish = jiffies;
@@ -2071,7 +2075,8 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
  * Even in such a scenario, sequential I/O may still receive a preferential
  * treatment, but this is not likely to be a big issue with flash-based
  * devices, because of their non-dramatic loss of throughput with random
- * I/O.
+ * I/O. Things do differ with HDDs, for which additional care is taken, as
+ * explained after completing the discussion for flash-based devices.
  *
  * Unfortunately, keeping the necessary state for evaluating exactly the
  * above symmetry conditions would be quite complex and time-consuming.
@@ -2088,17 +2093,42 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
  * compound condition evaluates to true if any of the above symmetry
  * sub-condition does not hold, or the device is not flash-based. Therefore,
  * if also the first component is true, then idling is allowed for a sync
- * queue. In contrast, if all the required symmetry sub-conditions hold and
- * the device is flash-based, then the second component, and hence the
- * whole compound condition, evaluates to false, and no idling is performed.
- * This helps to keep the drives' internal queues full on NCQ-capable
- * devices, and hence to boost the throughput, without causing 'almost' any
- * loss of service guarantees. The 'almost' follows from the fact that, if
- * the internal queue of one such device is filled while all the
- * sub-conditions hold, but at some point in time some sub-condition stops
- * to hold, then it may become impossible to let requests be served in the
- * new desired order until all the requests already queued in the device
- * have been served.
+ * queue. These are the only sub-conditions considered if the device is
+ * flash-based, as, for such a device, it is sensible to force idling only
+ * for service-guarantee issues. In fact, as for throughput, idling
+ * NCQ-capable flash-based devices would not boost the throughput even
+ * with sequential I/O; rather it would lower the throughput in proportion
+ * to how fast the device is. In the end, (only) if all the three
+ * sub-conditions hold and the device is flash-based, the compound
+ * condition evaluates to false and therefore no idling is performed.
+ *
+ * As already said, things change with a rotational device, where idling
+ * boosts the throughput with sequential I/O (even with NCQ). Hence, for
+ * such a device the second component of the compound condition evaluates
+ * to true also if the following additional sub-condition does not hold:
+ * the queue is constantly seeky. Unfortunately, this different behavior
+ * with respect to flash-based devices causes an additional asymmetry: if
+ * some sync queues enjoy idling and some other sync queues do not, then
+ * the latter get a low share of the device throughput, simply because the
+ * former get many requests served after being set as in service, whereas
+ * the latter do not. As a consequence, to guarantee the desired throughput
+ * distribution, on HDDs the compound expression evaluates to true (and
+ * hence device idling is performed) also if the following last symmetry
+ * condition does not hold: no other queue is benefiting from idling. Also
+ * this last condition is actually replaced with a simpler-to-maintain and
+ * stronger condition: there is no busy queue which is not constantly seeky
+ * (and hence may also benefit from idling).
+ *
+ * To sum up, when all the required symmetry and throughput-boosting
+ * sub-conditions hold, the second component of the compound condition
+ * evaluates to false, and hence no idling is performed. This helps to
+ * keep the drives' internal queues full on NCQ-capable devices, and hence
+ * to boost the throughput, without causing 'almost' any loss of service
+ * guarantees. The 'almost' follows from the fact that, if the internal
+ * queue of one such device is filled while all the sub-conditions hold,
+ * but at some point in time some sub-condition stops to hold, then it may
+ * become impossible to let requests be served in the new desired order
+ * until all the requests already queued in the device have been served.
  */
 static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 {
@@ -2109,6 +2139,9 @@ static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 #else
 #define symmetric_scenario	  (!bfq_differentiated_weights(bfqd))
 #endif
+#define cond_for_seeky_on_ncq_hdd (bfq_bfqq_constantly_seeky(bfqq) && \
+				   bfqd->busy_in_flight_queues == \
+				   bfqd->const_seeky_busy_in_flight_queues)
 /*
  * Condition for expiring a non-weight-raised queue (and hence not idling
  * the device).
@@ -2116,7 +2149,8 @@ static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 #define cond_for_expiring_non_wr  (bfqd->hw_tag && \
 				   (bfqd->wr_busy_queues > 0 || \
 				    (symmetric_scenario && \
-				     blk_queue_nonrot(bfqd->queue))))
+				     (blk_queue_nonrot(bfqd->queue) || \
+				      cond_for_seeky_on_ncq_hdd))))
 
 	return bfq_bfqq_sync(bfqq) && (
 		bfqq->wr_coeff > 1 ||
@@ -2843,8 +2877,11 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfq_update_io_thinktime(bfqd, bic);
 	bfq_update_io_seektime(bfqd, bfqq, rq);
-	if (!BFQQ_SEEKY(bfqq))
+	if (!BFQQ_SEEKY(bfqq) && bfq_bfqq_constantly_seeky(bfqq)) {
 		bfq_clear_bfqq_constantly_seeky(bfqq);
+		if (!blk_queue_nonrot(bfqd->queue))
+			bfqd->const_seeky_busy_in_flight_queues--;
+	}
 	if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
 	    !BFQQ_SEEKY(bfqq))
 		bfq_update_idle_window(bfqd, bfqq, bic);
@@ -2996,9 +3033,15 @@ static void bfq_completed_request(struct request_queue *q, struct request *rq)
 	bfqd->rq_in_driver--;
 	bfqq->dispatched--;
 
-	if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq))
+	if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq)) {
 		bfq_weights_tree_remove(bfqd, &bfqq->entity,
 					&bfqd->queue_weights_tree);
+		if (!blk_queue_nonrot(bfqd->queue)) {
+			bfqd->busy_in_flight_queues--;
+			if (bfq_bfqq_constantly_seeky(bfqq))
+				bfqd->const_seeky_busy_in_flight_queues--;
+		}
+	}
 
 	if (sync) {
 		bfqd->sync_flight--;
@@ -3420,6 +3463,8 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 					      * video.
 					      */
 	bfqd->wr_busy_queues = 0;
+	bfqd->busy_in_flight_queues = 0;
+	bfqd->const_seeky_busy_in_flight_queues = 0;
 
 	/*
 	 * Begin by assuming, optimistically, that the device peak rate is
@@ -3739,7 +3784,7 @@ static int __init bfq_init(void)
 	device_speed_thresh[1] = (R_fast[1] + R_slow[1]) / 2;
 
 	elv_register(&iosched_bfq);
-	pr_info("BFQ I/O-scheduler version: v6");
+	pr_info("BFQ I/O-scheduler version: v7r4");
 
 	return 0;
 }
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
index 473b36a..afc4c23 100644
--- a/block/bfq-sched.c
+++ b/block/bfq-sched.c
@@ -1064,9 +1064,15 @@ static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfq_deactivate_bfqq(bfqd, bfqq, requeue);
 
-	if (!bfqq->dispatched)
+	if (!bfqq->dispatched) {
 		bfq_weights_tree_remove(bfqd, &bfqq->entity,
 					&bfqd->queue_weights_tree);
+		if (!blk_queue_nonrot(bfqd->queue)) {
+			bfqd->busy_in_flight_queues--;
+			if (bfq_bfqq_constantly_seeky(bfqq))
+				bfqd->const_seeky_busy_in_flight_queues--;
+		}
+	}
 	if (bfqq->wr_coeff > 1)
 		bfqd->wr_busy_queues--;
 }
@@ -1083,9 +1089,16 @@ static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 	bfq_mark_bfqq_busy(bfqq);
 	bfqd->busy_queues++;
 
-	if (!bfqq->dispatched && bfqq->wr_coeff == 1)
-		bfq_weights_tree_add(bfqd, &bfqq->entity,
-				     &bfqd->queue_weights_tree);
+	if (!bfqq->dispatched) {
+		if (bfqq->wr_coeff == 1)
+			bfq_weights_tree_add(bfqd, &bfqq->entity,
+					     &bfqd->queue_weights_tree);
+		if (!blk_queue_nonrot(bfqd->queue)) {
+			bfqd->busy_in_flight_queues++;
+			if (bfq_bfqq_constantly_seeky(bfqq))
+				bfqd->const_seeky_busy_in_flight_queues++;
+		}
+	}
 	if (bfqq->wr_coeff > 1)
 		bfqd->wr_busy_queues++;
 }
diff --git a/block/bfq.h b/block/bfq.h
index 83c828d..f4c702c 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -1,5 +1,5 @@
 /*
- * BFQ-v6 for 3.15.0: data structures and common functions prototypes.
+ * BFQ-v7r4 for 3.15.0: data structures and common functions prototypes.
  *
  * Based on ideas and code from CFQ:
  * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
@@ -340,6 +340,31 @@ enum bfq_device_speed {
  *                     details).
  * @busy_queues: number of bfq_queues containing requests (including the
  *		 queue in service, even if it is idling).
+ * @busy_in_flight_queues: number of @bfq_queues containing pending or
+ *                         in-flight requests, plus the @bfq_queue in
+ *                         service, even if idle but waiting for the
+ *                         possible arrival of its next sync request. This
+ *                         field is updated only if the device is rotational,
+ *                         but used only if the device is also NCQ-capable.
+ *                         The reason why the field is updated also for non-
+ *                         NCQ-capable rotational devices is related to the
+ *                         fact that the value of @hw_tag may be set also
+ *                         later than when busy_in_flight_queues may need to
+ *                         be incremented for the first time(s). Taking also
+ *                         this possibility into account, to avoid unbalanced
+ *                         increments/decrements, would imply more overhead
+ *                         than just updating busy_in_flight_queues
+ *                         regardless of the value of @hw_tag.
+ * @const_seeky_busy_in_flight_queues: number of constantly-seeky @bfq_queues
+ *                                     (that is, seeky queues that expired
+ *                                     for budget timeout at least once)
+ *                                     containing pending or in-flight
+ *                                     requests, including the in-service
+ *                                     @bfq_queue if constantly seeky. This
+ *                                     field is updated only if the device
+ *                                     is rotational, but used only if the
+ *                                     device is also NCQ-capable (see the
+ *                                     comments to @busy_in_flight_queues).
  * @wr_busy_queues: number of weight-raised busy @bfq_queues.
  * @queued: number of queued requests.
  * @rq_in_driver: number of requests dispatched and waiting for completion.
@@ -414,6 +439,8 @@ struct bfq_data {
 	struct rb_root group_weights_tree;
 
 	int busy_queues;
+	int busy_in_flight_queues;
+	int const_seeky_busy_in_flight_queues;
 	int wr_busy_queues;
 	int queued;
 	int rq_in_driver;
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC RESEND 14/14] block, bfq: boost the throughput with random I/O on NCQ-capable HDDs
@ 2014-05-27 12:42   ` paolo
  0 siblings, 0 replies; 247+ messages in thread
From: paolo @ 2014-05-27 12:42 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

From: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>

This patch is basically the counterpart of patch 13 for NCQ-capable
rotational devices. Exactly as patch 13 does on flash-based devices
and for any workload, this patch disables device idling on rotational
devices, but only for random I/O. More precisely, idling is disabled
only for constantly-seeky queues (see patch 7). In fact, only with
these queues disabling idling boosts the throughput on NCQ-capable
rotational devices.

To not break service guarantees, idling is disabled for NCQ-enabled
rotational devices and constantly-seeky queues only when the same
symmetry conditions as in patch 13, plus an additional one, hold. The
additional condition is related to the fact that this patch disables
idling only for constantly-seeky queues. In fact, should idling be
disabled for a constantly-seeky queue while some other
non-constantly-seeky queue has pending requests, the latter queue
would get more requests served, after being set as in service, than
the former. This differentiated treatment would cause a deviation with
respect to the desired throughput distribution (i.e., with respect to
the throughput distribution corresponding to the weights assigned to
processes and groups of processes).  For this reason, the additional
condition for disabling idling for a constantly-seeky queue is that
all queues with pending or in-flight requests are constantly seeky.

Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 block/bfq-iosched.c | 79 +++++++++++++++++++++++++++++++++++++++++------------
 block/bfq-sched.c   | 21 +++++++++++---
 block/bfq.h         | 29 +++++++++++++++++++-
 3 files changed, 107 insertions(+), 22 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 49856e1..b9aafa5 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1910,8 +1910,12 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
 
 	bfqq->service_from_backlogged += bfqq->entity.service;
 
-	if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT)
+	if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT &&
+	    !bfq_bfqq_constantly_seeky(bfqq)) {
 		bfq_mark_bfqq_constantly_seeky(bfqq);
+		if (!blk_queue_nonrot(bfqd->queue))
+			bfqd->const_seeky_busy_in_flight_queues++;
+	}
 
 	if (bfqd->low_latency && bfqq->wr_coeff == 1)
 		bfqq->last_wr_start_finish = jiffies;
@@ -2071,7 +2075,8 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
  * Even in such a scenario, sequential I/O may still receive a preferential
  * treatment, but this is not likely to be a big issue with flash-based
  * devices, because of their non-dramatic loss of throughput with random
- * I/O.
+ * I/O. Things do differ with HDDs, for which additional care is taken, as
+ * explained after completing the discussion for flash-based devices.
  *
  * Unfortunately, keeping the necessary state for evaluating exactly the
  * above symmetry conditions would be quite complex and time-consuming.
@@ -2088,17 +2093,42 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
  * compound condition evaluates to true if any of the above symmetry
  * sub-condition does not hold, or the device is not flash-based. Therefore,
  * if also the first component is true, then idling is allowed for a sync
- * queue. In contrast, if all the required symmetry sub-conditions hold and
- * the device is flash-based, then the second component, and hence the
- * whole compound condition, evaluates to false, and no idling is performed.
- * This helps to keep the drives' internal queues full on NCQ-capable
- * devices, and hence to boost the throughput, without causing 'almost' any
- * loss of service guarantees. The 'almost' follows from the fact that, if
- * the internal queue of one such device is filled while all the
- * sub-conditions hold, but at some point in time some sub-condition stops
- * to hold, then it may become impossible to let requests be served in the
- * new desired order until all the requests already queued in the device
- * have been served.
+ * queue. These are the only sub-conditions considered if the device is
+ * flash-based, as, for such a device, it is sensible to force idling only
+ * for service-guarantee issues. In fact, as for throughput, idling
+ * NCQ-capable flash-based devices would not boost the throughput even
+ * with sequential I/O; rather it would lower the throughput in proportion
+ * to how fast the device is. In the end, (only) if all the three
+ * sub-conditions hold and the device is flash-based, the compound
+ * condition evaluates to false and therefore no idling is performed.
+ *
+ * As already said, things change with a rotational device, where idling
+ * boosts the throughput with sequential I/O (even with NCQ). Hence, for
+ * such a device the second component of the compound condition evaluates
+ * to true also if the following additional sub-condition does not hold:
+ * the queue is constantly seeky. Unfortunately, this different behavior
+ * with respect to flash-based devices causes an additional asymmetry: if
+ * some sync queues enjoy idling and some other sync queues do not, then
+ * the latter get a low share of the device throughput, simply because the
+ * former get many requests served after being set as in service, whereas
+ * the latter do not. As a consequence, to guarantee the desired throughput
+ * distribution, on HDDs the compound expression evaluates to true (and
+ * hence device idling is performed) also if the following last symmetry
+ * condition does not hold: no other queue is benefiting from idling. Also
+ * this last condition is actually replaced with a simpler-to-maintain and
+ * stronger condition: there is no busy queue which is not constantly seeky
+ * (and hence may also benefit from idling).
+ *
+ * To sum up, when all the required symmetry and throughput-boosting
+ * sub-conditions hold, the second component of the compound condition
+ * evaluates to false, and hence no idling is performed. This helps to
+ * keep the drives' internal queues full on NCQ-capable devices, and hence
+ * to boost the throughput, without causing 'almost' any loss of service
+ * guarantees. The 'almost' follows from the fact that, if the internal
+ * queue of one such device is filled while all the sub-conditions hold,
+ * but at some point in time some sub-condition stops to hold, then it may
+ * become impossible to let requests be served in the new desired order
+ * until all the requests already queued in the device have been served.
  */
 static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 {
@@ -2109,6 +2139,9 @@ static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 #else
 #define symmetric_scenario	  (!bfq_differentiated_weights(bfqd))
 #endif
+#define cond_for_seeky_on_ncq_hdd (bfq_bfqq_constantly_seeky(bfqq) && \
+				   bfqd->busy_in_flight_queues == \
+				   bfqd->const_seeky_busy_in_flight_queues)
 /*
  * Condition for expiring a non-weight-raised queue (and hence not idling
  * the device).
@@ -2116,7 +2149,8 @@ static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 #define cond_for_expiring_non_wr  (bfqd->hw_tag && \
 				   (bfqd->wr_busy_queues > 0 || \
 				    (symmetric_scenario && \
-				     blk_queue_nonrot(bfqd->queue))))
+				     (blk_queue_nonrot(bfqd->queue) || \
+				      cond_for_seeky_on_ncq_hdd))))
 
 	return bfq_bfqq_sync(bfqq) && (
 		bfqq->wr_coeff > 1 ||
@@ -2843,8 +2877,11 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfq_update_io_thinktime(bfqd, bic);
 	bfq_update_io_seektime(bfqd, bfqq, rq);
-	if (!BFQQ_SEEKY(bfqq))
+	if (!BFQQ_SEEKY(bfqq) && bfq_bfqq_constantly_seeky(bfqq)) {
 		bfq_clear_bfqq_constantly_seeky(bfqq);
+		if (!blk_queue_nonrot(bfqd->queue))
+			bfqd->const_seeky_busy_in_flight_queues--;
+	}
 	if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
 	    !BFQQ_SEEKY(bfqq))
 		bfq_update_idle_window(bfqd, bfqq, bic);
@@ -2996,9 +3033,15 @@ static void bfq_completed_request(struct request_queue *q, struct request *rq)
 	bfqd->rq_in_driver--;
 	bfqq->dispatched--;
 
-	if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq))
+	if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq)) {
 		bfq_weights_tree_remove(bfqd, &bfqq->entity,
 					&bfqd->queue_weights_tree);
+		if (!blk_queue_nonrot(bfqd->queue)) {
+			bfqd->busy_in_flight_queues--;
+			if (bfq_bfqq_constantly_seeky(bfqq))
+				bfqd->const_seeky_busy_in_flight_queues--;
+		}
+	}
 
 	if (sync) {
 		bfqd->sync_flight--;
@@ -3420,6 +3463,8 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 					      * video.
 					      */
 	bfqd->wr_busy_queues = 0;
+	bfqd->busy_in_flight_queues = 0;
+	bfqd->const_seeky_busy_in_flight_queues = 0;
 
 	/*
 	 * Begin by assuming, optimistically, that the device peak rate is
@@ -3739,7 +3784,7 @@ static int __init bfq_init(void)
 	device_speed_thresh[1] = (R_fast[1] + R_slow[1]) / 2;
 
 	elv_register(&iosched_bfq);
-	pr_info("BFQ I/O-scheduler version: v6");
+	pr_info("BFQ I/O-scheduler version: v7r4");
 
 	return 0;
 }
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
index 473b36a..afc4c23 100644
--- a/block/bfq-sched.c
+++ b/block/bfq-sched.c
@@ -1064,9 +1064,15 @@ static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfq_deactivate_bfqq(bfqd, bfqq, requeue);
 
-	if (!bfqq->dispatched)
+	if (!bfqq->dispatched) {
 		bfq_weights_tree_remove(bfqd, &bfqq->entity,
 					&bfqd->queue_weights_tree);
+		if (!blk_queue_nonrot(bfqd->queue)) {
+			bfqd->busy_in_flight_queues--;
+			if (bfq_bfqq_constantly_seeky(bfqq))
+				bfqd->const_seeky_busy_in_flight_queues--;
+		}
+	}
 	if (bfqq->wr_coeff > 1)
 		bfqd->wr_busy_queues--;
 }
@@ -1083,9 +1089,16 @@ static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 	bfq_mark_bfqq_busy(bfqq);
 	bfqd->busy_queues++;
 
-	if (!bfqq->dispatched && bfqq->wr_coeff == 1)
-		bfq_weights_tree_add(bfqd, &bfqq->entity,
-				     &bfqd->queue_weights_tree);
+	if (!bfqq->dispatched) {
+		if (bfqq->wr_coeff == 1)
+			bfq_weights_tree_add(bfqd, &bfqq->entity,
+					     &bfqd->queue_weights_tree);
+		if (!blk_queue_nonrot(bfqd->queue)) {
+			bfqd->busy_in_flight_queues++;
+			if (bfq_bfqq_constantly_seeky(bfqq))
+				bfqd->const_seeky_busy_in_flight_queues++;
+		}
+	}
 	if (bfqq->wr_coeff > 1)
 		bfqd->wr_busy_queues++;
 }
diff --git a/block/bfq.h b/block/bfq.h
index 83c828d..f4c702c 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -1,5 +1,5 @@
 /*
- * BFQ-v6 for 3.15.0: data structures and common functions prototypes.
+ * BFQ-v7r4 for 3.15.0: data structures and common functions prototypes.
  *
  * Based on ideas and code from CFQ:
  * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@@ -340,6 +340,31 @@ enum bfq_device_speed {
  *                     details).
  * @busy_queues: number of bfq_queues containing requests (including the
  *		 queue in service, even if it is idling).
+ * @busy_in_flight_queues: number of @bfq_queues containing pending or
+ *                         in-flight requests, plus the @bfq_queue in
+ *                         service, even if idle but waiting for the
+ *                         possible arrival of its next sync request. This
+ *                         field is updated only if the device is rotational,
+ *                         but used only if the device is also NCQ-capable.
+ *                         The reason why the field is updated also for non-
+ *                         NCQ-capable rotational devices is related to the
+ *                         fact that the value of @hw_tag may be set also
+ *                         later than when busy_in_flight_queues may need to
+ *                         be incremented for the first time(s). Taking also
+ *                         this possibility into account, to avoid unbalanced
+ *                         increments/decrements, would imply more overhead
+ *                         than just updating busy_in_flight_queues
+ *                         regardless of the value of @hw_tag.
+ * @const_seeky_busy_in_flight_queues: number of constantly-seeky @bfq_queues
+ *                                     (that is, seeky queues that expired
+ *                                     for budget timeout at least once)
+ *                                     containing pending or in-flight
+ *                                     requests, including the in-service
+ *                                     @bfq_queue if constantly seeky. This
+ *                                     field is updated only if the device
+ *                                     is rotational, but used only if the
+ *                                     device is also NCQ-capable (see the
+ *                                     comments to @busy_in_flight_queues).
  * @wr_busy_queues: number of weight-raised busy @bfq_queues.
  * @queued: number of queued requests.
  * @rq_in_driver: number of requests dispatched and waiting for completion.
@@ -414,6 +439,8 @@ struct bfq_data {
 	struct rb_root group_weights_tree;
 
 	int busy_queues;
+	int busy_in_flight_queues;
+	int const_seeky_busy_in_flight_queues;
 	int wr_busy_queues;
 	int queued;
 	int rq_in_driver;
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC RESEND 01/14] block: kconfig update and build bits for BFQ
       [not found]     ` <1401194558-5283-2-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
@ 2014-05-28 22:19       ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-28 22:19 UTC (permalink / raw)
  To: paolo
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

On Tue, May 27, 2014 at 02:42:25PM +0200, paolo wrote:
> diff --git a/block/Makefile b/block/Makefile
> index 20645e8..cbd83fb 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -16,6 +16,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-throttle.o
>  obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
>  obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
>  obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
> +obj-$(CONFIG_IOSCHED_BFQ)	+= bfq-iosched.o

Doesn't this break builds where BFQ is enabled including all-y/m?
Please make the config / makefile changes with the actual
implementation.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC RESEND 01/14] block: kconfig update and build bits for BFQ
       [not found]     ` <1401194558-5283-2-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
@ 2014-05-28 22:19       ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-28 22:19 UTC (permalink / raw)
  To: paolo
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel, containers, cgroups

On Tue, May 27, 2014 at 02:42:25PM +0200, paolo wrote:
> diff --git a/block/Makefile b/block/Makefile
> index 20645e8..cbd83fb 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -16,6 +16,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-throttle.o
>  obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
>  obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
>  obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
> +obj-$(CONFIG_IOSCHED_BFQ)	+= bfq-iosched.o

Doesn't this break builds where BFQ is enabled including all-y/m?
Please make the config / makefile changes with the actual
implementation.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC RESEND 01/14] block: kconfig update and build bits for BFQ
@ 2014-05-28 22:19       ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-28 22:19 UTC (permalink / raw)
  To: paolo
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Tue, May 27, 2014 at 02:42:25PM +0200, paolo wrote:
> diff --git a/block/Makefile b/block/Makefile
> index 20645e8..cbd83fb 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -16,6 +16,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-throttle.o
>  obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
>  obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
>  obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
> +obj-$(CONFIG_IOSCHED_BFQ)	+= bfq-iosched.o

Doesn't this break builds where BFQ is enabled including all-y/m?
Please make the config / makefile changes with the actual
implementation.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
  2014-05-28 22:19       ` Tejun Heo
@ 2014-05-29  9:05           ` Paolo Valente
  -1 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

This is a rearranged version of the original patchset, according to the
recommendations of Tejun Heo:
https://lkml.org/lkml/2014/5/28/703

------------------------------------------------------------------------

Hi,
this patchset introduces the last version of BFQ, a proportional-share
storage-I/O scheduler. BFQ also supports hierarchical scheduling with
a cgroups interface. The first version of BFQ was submitted a few
years ago [1]. It is denoted as v0 in the patches, to distinguish it
from the version I am submitting now, v7r4. In particular, the first
two patches introduce BFQ-v0, whereas the remaining patches turn it
progressively into BFQ-v7r4. Here are some nice features of this last
version.

Low latency for interactive applications

According to our results, regardless of the actual background
workload, for interactive tasks the storage device is virtually as
responsive as if it was idle. For example, even if one or more of the
following background workloads are being executed:
- one or more large files are being read or written,
- a tree of source files is being compiled,
- one or more virtual machines are performing I/O,
- a software update is in progress,
- indexing daemons are scanning filesystems and updating their
  databases,
starting an application or loading a file from within an application
takes about the same time as if the storage device was idle. As a
comparison, with CFQ, NOOP or DEADLINE, and in the same conditions,
applications experience high latencies, or even become unresponsive
until the background workload terminates (also on SSDs).

Low latency for soft real-time applications

Also soft real-time applications, such as audio and video
players/streamers, enjoy low latency and drop rate, regardless of the
storage-device background workload. As a consequence, these
applications do not suffer from almost any glitch due to the
background workload.

High throughput

On hard disks, BFQ achieves up to 30% higher throughput than CFQ, and
up to 150% higher throughput than DEADLINE and NOOP, with half of the
parallel workloads considered in our tests. With the rest of the
workloads, and with all the workloads on flash-based devices, BFQ
achieves instead about the same throughput as the other schedulers.

Strong fairness guarantees (already provided by BFQ-v0)

As for long-term guarantees, BFQ distributes the device throughput
(and not just the device time) as desired to I/O-bound applications,
with any workload and regardless of the device parameters.

What allows BFQ to provide the above features is its accurate
scheduling engine (patches 1-2), combined with a set of simple
heuristics and improvements (patches 3-12).  A 15-minute demo of the
performance of BFQ is available here [2]. I made this demo with an
older version of BFQ (v3r4) and under Linux 3.4.0. We have further
improved BFQ since then. The performance of the last version of BFQ is
shown, e.g., through some graphs here [3], under 3.14.0, compared
against CFQ, DEADLINE and NOOP, and on: a fast and a slow hard disk, a
RAID1, an SSD, a microSDHC Card and an eMMC. As an example, our
results on the SSD are reported also in a table at the end of this
email. Finally, details on how BFQ and its components work are
provided in the descriptions of the patches. An organic description of
the main BFQ algorithm and of most of its features can instead be
found in this paper [4].

Finally, as for testing in everyday use, BFQ is the default I/O
scheduler in, e.g., Manjaro, Sabayon, OpenMandriva and Arch Linux ARM
in some NAS boxes, plus several kernel forks for PCs and
smartphones. BFQ is instead optionally available in, e.g., Arch,
PCLinuxOS and Gentoo. In addition, we record a few tens of downloads
per day from people using other distributions. The feedback received
so far basically confirms the expected latency drop and throughput
boost.

Paolo

Results on a Plextor PX-256M5S SSD

The first two rows of the next table report the aggregate throughput
achieved by BFQ, CFQ, DEADLINE and NOOP, while ten parallel processes
read, either sequentially or randomly, a separate portion of the
memory blocks each. These processes read directly from the device, and
no process performs writes, to avoid writing large files repeatedly
and wearing out the SSD during the many tests done. As can be seen,
all schedulers achieve about the same throughput with sequential
readers, whereas, with random readers, the throughput slightly grows
as the complexity, and hence the execution time, of the schedulers
decreases. In fact, with random readers, the number of IOPS is
extremely higher, and all CPUs spend all the time either executing
instructions or waiting for I/O (the total idle percentage is
0). Therefore, the processing time of I/O requests influences the
maximum throughput achievable.

The remaining rows report the cold-cache start-up time experienced by
various applications while one of the above two workloads is being
executed in parallel. In particular, "Start-up time 10 seq/rand"
stands for "Start-up time of the application at hand while 10
sequential/random readers are running". A timeout fires, and the test
is aborted, if the application does not start within 60 seconds; so,
in the table, '>60' means that the application did not start before
the timeout fired.

With sequential readers, the performance gap between BFQ and the other
schedulers is remarkable. Background workloads are intentionally very
heavy, to show the performance of the schedulers in somewhat extreme
conditions. Differences are however still significant also with
lighter workloads, as shown, e.g., here [3] for slower devices.

-----------------------------------------------------------------------------
|                      SCHEDULER                    |        Test           |
-----------------------------------------------------------------------------
|    BFQ     |    CFQ     |  DEADLINE  |    NOOP    |                       |
-----------------------------------------------------------------------------
|            |            |            |            | Aggregate Throughput  |
|            |            |            |            |       [MB/s]          |
|    399     |    400     |    400     |    400     |  10 raw seq. readers  |
|    191     |    193     |    202     |    203     | 10 raw random readers |
-----------------------------------------------------------------------------
|            |            |            |            | Start-up time 10 seq  |
|            |            |            |            |       [sec]           |
|    0.21    |    >60     |    1.91    |    1.88    |      xterm            |
|    0.93    |    >60     |    10.2    |    10.8    |     oowriter          |
|    0.89    |    >60     |    29.7    |    30.0    |      konsole          |
-----------------------------------------------------------------------------
|            |            |            |            | Start-up time 10 rand |
|            |            |            |            |       [sec]           |
|    0.20    |    0.30    |    0.21    |    0.21    |      xterm            |
|    0.81    |    3.28    |    0.80    |    0.81    |     oowriter          |
|    0.88    |    2.90    |    1.02    |    1.00    |      konsole          |
-----------------------------------------------------------------------------

[1] https://lkml.org/lkml/2008/4/1/234
    https://lkml.org/lkml/2008/11/11/148

[2] http://youtu.be/J-e7LnJblm8

[3] http://www.algogroup.unimo.it/people/paolo/disk_sched/results.php

[4] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
http://www.algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

Arianna Avanzini (1):
  block, bfq: add Early Queue Merge (EQM)

Fabio Checconi (2):
  block: introduce the BFQ-v0 I/O scheduler
  block, bfq: add full hierarchical scheduling and cgroups support

Paolo Valente (9):
  block, bfq: improve throughput boosting
  block, bfq: modify the peak-rate estimator
  block, bfq: add more fairness to boost throughput and reduce latency
  block, bfq: improve responsiveness
  block, bfq: reduce I/O latency for soft real-time applications
  block, bfq: preserve a low latency also with NCQ-capable drives
  block, bfq: reduce latency during request-pool saturation
  block, bfq: boost the throughput on NCQ-capable flash-based devices
  block, bfq: boost the throughput with random I/O on NCQ-capable HDDs

 block/Kconfig.iosched         |   32 +
 block/Makefile                |    1 +
 block/bfq-cgroup.c            |  909 ++++++++++
 block/bfq-ioc.c               |   36 +
 block/bfq-iosched.c           | 3802 +++++++++++++++++++++++++++++++++++++++++
 block/bfq-sched.c             | 1104 ++++++++++++
 block/bfq.h                   |  729 ++++++++
 include/linux/cgroup_subsys.h |    4 +
 8 files changed, 6617 insertions(+)
 create mode 100644 block/bfq-cgroup.c
 create mode 100644 block/bfq-ioc.c
 create mode 100644 block/bfq-iosched.c
 create mode 100644 block/bfq-sched.c
 create mode 100644 block/bfq.h

-- 
1.9.2

^ permalink raw reply	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-05-29  9:05           ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups, Paolo Valente

This is a rearranged version of the original patchset, according to the
recommendations of Tejun Heo:
https://lkml.org/lkml/2014/5/28/703

------------------------------------------------------------------------

Hi,
this patchset introduces the last version of BFQ, a proportional-share
storage-I/O scheduler. BFQ also supports hierarchical scheduling with
a cgroups interface. The first version of BFQ was submitted a few
years ago [1]. It is denoted as v0 in the patches, to distinguish it
from the version I am submitting now, v7r4. In particular, the first
two patches introduce BFQ-v0, whereas the remaining patches turn it
progressively into BFQ-v7r4. Here are some nice features of this last
version.

Low latency for interactive applications

According to our results, regardless of the actual background
workload, for interactive tasks the storage device is virtually as
responsive as if it was idle. For example, even if one or more of the
following background workloads are being executed:
- one or more large files are being read or written,
- a tree of source files is being compiled,
- one or more virtual machines are performing I/O,
- a software update is in progress,
- indexing daemons are scanning filesystems and updating their
  databases,
starting an application or loading a file from within an application
takes about the same time as if the storage device was idle. As a
comparison, with CFQ, NOOP or DEADLINE, and in the same conditions,
applications experience high latencies, or even become unresponsive
until the background workload terminates (also on SSDs).

Low latency for soft real-time applications

Also soft real-time applications, such as audio and video
players/streamers, enjoy low latency and drop rate, regardless of the
storage-device background workload. As a consequence, these
applications do not suffer from almost any glitch due to the
background workload.

High throughput

On hard disks, BFQ achieves up to 30% higher throughput than CFQ, and
up to 150% higher throughput than DEADLINE and NOOP, with half of the
parallel workloads considered in our tests. With the rest of the
workloads, and with all the workloads on flash-based devices, BFQ
achieves instead about the same throughput as the other schedulers.

Strong fairness guarantees (already provided by BFQ-v0)

As for long-term guarantees, BFQ distributes the device throughput
(and not just the device time) as desired to I/O-bound applications,
with any workload and regardless of the device parameters.

What allows BFQ to provide the above features is its accurate
scheduling engine (patches 1-2), combined with a set of simple
heuristics and improvements (patches 3-12).  A 15-minute demo of the
performance of BFQ is available here [2]. I made this demo with an
older version of BFQ (v3r4) and under Linux 3.4.0. We have further
improved BFQ since then. The performance of the last version of BFQ is
shown, e.g., through some graphs here [3], under 3.14.0, compared
against CFQ, DEADLINE and NOOP, and on: a fast and a slow hard disk, a
RAID1, an SSD, a microSDHC Card and an eMMC. As an example, our
results on the SSD are reported also in a table at the end of this
email. Finally, details on how BFQ and its components work are
provided in the descriptions of the patches. An organic description of
the main BFQ algorithm and of most of its features can instead be
found in this paper [4].

Finally, as for testing in everyday use, BFQ is the default I/O
scheduler in, e.g., Manjaro, Sabayon, OpenMandriva and Arch Linux ARM
in some NAS boxes, plus several kernel forks for PCs and
smartphones. BFQ is instead optionally available in, e.g., Arch,
PCLinuxOS and Gentoo. In addition, we record a few tens of downloads
per day from people using other distributions. The feedback received
so far basically confirms the expected latency drop and throughput
boost.

Paolo

Results on a Plextor PX-256M5S SSD

The first two rows of the next table report the aggregate throughput
achieved by BFQ, CFQ, DEADLINE and NOOP, while ten parallel processes
read, either sequentially or randomly, a separate portion of the
memory blocks each. These processes read directly from the device, and
no process performs writes, to avoid writing large files repeatedly
and wearing out the SSD during the many tests done. As can be seen,
all schedulers achieve about the same throughput with sequential
readers, whereas, with random readers, the throughput slightly grows
as the complexity, and hence the execution time, of the schedulers
decreases. In fact, with random readers, the number of IOPS is
extremely higher, and all CPUs spend all the time either executing
instructions or waiting for I/O (the total idle percentage is
0). Therefore, the processing time of I/O requests influences the
maximum throughput achievable.

The remaining rows report the cold-cache start-up time experienced by
various applications while one of the above two workloads is being
executed in parallel. In particular, "Start-up time 10 seq/rand"
stands for "Start-up time of the application at hand while 10
sequential/random readers are running". A timeout fires, and the test
is aborted, if the application does not start within 60 seconds; so,
in the table, '>60' means that the application did not start before
the timeout fired.

With sequential readers, the performance gap between BFQ and the other
schedulers is remarkable. Background workloads are intentionally very
heavy, to show the performance of the schedulers in somewhat extreme
conditions. Differences are however still significant also with
lighter workloads, as shown, e.g., here [3] for slower devices.

-----------------------------------------------------------------------------
|                      SCHEDULER                    |        Test           |
-----------------------------------------------------------------------------
|    BFQ     |    CFQ     |  DEADLINE  |    NOOP    |                       |
-----------------------------------------------------------------------------
|            |            |            |            | Aggregate Throughput  |
|            |            |            |            |       [MB/s]          |
|    399     |    400     |    400     |    400     |  10 raw seq. readers  |
|    191     |    193     |    202     |    203     | 10 raw random readers |
-----------------------------------------------------------------------------
|            |            |            |            | Start-up time 10 seq  |
|            |            |            |            |       [sec]           |
|    0.21    |    >60     |    1.91    |    1.88    |      xterm            |
|    0.93    |    >60     |    10.2    |    10.8    |     oowriter          |
|    0.89    |    >60     |    29.7    |    30.0    |      konsole          |
-----------------------------------------------------------------------------
|            |            |            |            | Start-up time 10 rand |
|            |            |            |            |       [sec]           |
|    0.20    |    0.30    |    0.21    |    0.21    |      xterm            |
|    0.81    |    3.28    |    0.80    |    0.81    |     oowriter          |
|    0.88    |    2.90    |    1.02    |    1.00    |      konsole          |
-----------------------------------------------------------------------------

[1] https://lkml.org/lkml/2008/4/1/234
    https://lkml.org/lkml/2008/11/11/148

[2] http://youtu.be/J-e7LnJblm8

[3] http://www.algogroup.unimo.it/people/paolo/disk_sched/results.php

[4] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
http://www.algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

Arianna Avanzini (1):
  block, bfq: add Early Queue Merge (EQM)

Fabio Checconi (2):
  block: introduce the BFQ-v0 I/O scheduler
  block, bfq: add full hierarchical scheduling and cgroups support

Paolo Valente (9):
  block, bfq: improve throughput boosting
  block, bfq: modify the peak-rate estimator
  block, bfq: add more fairness to boost throughput and reduce latency
  block, bfq: improve responsiveness
  block, bfq: reduce I/O latency for soft real-time applications
  block, bfq: preserve a low latency also with NCQ-capable drives
  block, bfq: reduce latency during request-pool saturation
  block, bfq: boost the throughput on NCQ-capable flash-based devices
  block, bfq: boost the throughput with random I/O on NCQ-capable HDDs

 block/Kconfig.iosched         |   32 +
 block/Makefile                |    1 +
 block/bfq-cgroup.c            |  909 ++++++++++
 block/bfq-ioc.c               |   36 +
 block/bfq-iosched.c           | 3802 +++++++++++++++++++++++++++++++++++++++++
 block/bfq-sched.c             | 1104 ++++++++++++
 block/bfq.h                   |  729 ++++++++
 include/linux/cgroup_subsys.h |    4 +
 8 files changed, 6617 insertions(+)
 create mode 100644 block/bfq-cgroup.c
 create mode 100644 block/bfq-ioc.c
 create mode 100644 block/bfq-iosched.c
 create mode 100644 block/bfq-sched.c
 create mode 100644 block/bfq.h

-- 
1.9.2


^ permalink raw reply	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 01/12] block: introduce the BFQ-v0 I/O scheduler
  2014-05-29  9:05           ` Paolo Valente
@ 2014-05-29  9:05               ` Paolo Valente
  -1 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

From: Fabio Checconi <fchecconi-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

BFQ is a proportional-share I/O scheduler, whose general structure,
plus a lot of code, are borrowed from CFQ.

- Each process doing I/O on a device is associated with a weight and a
  (bfq_)queue.

- BFQ grants exclusive access to the device, for a while, to one queue
  (process) at a time, and implements this service model by
  associating every queue with a budget, measured in number of
  sectors.

  - After a queue is granted access to the device, the budget of the
    queue is decremented, on each request dispatch, by the size of the
    request.

  - The in-service queue is expired, i.e., its service is suspended,
    only if one of the following events occurs: 1) the queue finishes
    its budget, 2) the queue empties, 3) a "budget timeout" fires.

    - The budget timeout prevents processes doing random I/O from
      holding the device for too long and dramatically reducing
      throughput.

    - Actually, as in CFQ, a queue associated with a process issuing
      sync requests may not be expired immediately when it empties. In
      contrast, BFQ may idle the device for a short time interval,
      giving the process the chance to go on being served if it issues
      a new request in time. Device idling typically boosts the
      throughput on rotational devices, if processes do synchronous
      and sequential I/O. Besides, under BFQ, device idling is also
      instrumental in guaranteeing the desired throughput fraction to
      processes issuing sync requests (see [1] for details).

  - Queues are scheduled according to a variant of WF2Q+, named
    B-WF2Q+, and implemented using an augmented rb-tree to preserve an
    O(log N) overall complexity.  See [1] for more details. B-WF2Q+ is
    also ready for hierarchical scheduling. However, for a cleaner
    logical breakdown, the code that enables and completes
    hierarchical support is provided in patch 4, which focuses exactly
    on this feature.

  - B-WF2Q+ guarantees a tight deviation with respect to an ideal,
    perfectly fair, and smooth service. In particular, B-WF2Q+
    guarantees that each queue receives a fraction of the device
    throughput proportional to its weight, even if the throughput
    fluctuates, and regardless of: the device parameters, the current
    workload and the budgets assigned to the queue.

  - The last, budget-independence, property (although probably
    counterintuitive in the first place) is definitely beneficial, for
    the following reasons.

    - First, with any proportional-share scheduler, the maximum
      deviation with respect to an ideal service is proportional to
      the maximum budget (slice) assigned to queues. As a consequence,
      BFQ can keep this deviation tight not only because of the
      accurate service of B-WF2Q+, but also because BFQ *does not*
      need to assign a larger budget to a queue to let the queue
      receive a higher fraction of the device throughput.

    - Second, BFQ is free to choose, for every process (queue), the
      budget that best fits the needs of the process, or best
      leverages the I/O pattern of the process. In particular, BFQ
      updates queue budgets with a simple feedback-loop algorithm that
      allows a high throughput to be achieved, while still providing
      tight latency guarantees to time-sensitive applications. When
      the in-service queue expires, this algorithm computes the next
      budget of the queue so as to:

      - Let large budgets be eventually assigned to the queues
        associated with I/O-bound applications performing sequential
        I/O: in fact, the longer these applications are served once
        got access to the device, the higher the throughput is.

      - Let small budgets be eventually assigned to the queues
        associated with time-sensitive applications (which typically
        perform sporadic and short I/O), because, the smaller the
        budget assigned to a queue waiting for service is, the sooner
        B-WF2Q+ will serve that queue (Subsec 3.3 in [1]).

- Weights can be assigned to processes only indirectly, through I/O
  priorities, and according to the relation: weight = IOPRIO_BE_NR -
  ioprio. The next two patches provide instead a cgroups interface
  through which weights can be assigned explicitly.

- ioprio classes are served in strict priority order, i.e.,
  lower-priority queues are not served as long as there are
  higher-priority queues.  Among queues in the same class, the
  bandwidth is distributed in proportion to the weight of each
  queue. A very thin extra bandwidth is however guaranteed to the Idle
  class, to prevent it from starving.

[1] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
http://www.algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

Signed-off-by: Fabio Checconi <fchecconi-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 block/Kconfig.iosched |   19 +
 block/Makefile        |    1 +
 block/bfq-ioc.c       |   34 +
 block/bfq-iosched.c   | 2297 +++++++++++++++++++++++++++++++++++++++++++++++++
 block/bfq-sched.c     |  936 ++++++++++++++++++++
 block/bfq.h           |  467 ++++++++++
 6 files changed, 3754 insertions(+)
 create mode 100644 block/bfq-ioc.c
 create mode 100644 block/bfq-iosched.c
 create mode 100644 block/bfq-sched.c
 create mode 100644 block/bfq.h

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 421bef9..8f98cc7 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -39,6 +39,15 @@ config CFQ_GROUP_IOSCHED
 	---help---
 	  Enable group IO scheduling in CFQ.
 
+config IOSCHED_BFQ
+	tristate "BFQ I/O scheduler"
+	default n
+	---help---
+	  The BFQ I/O scheduler tries to distribute bandwidth among all
+	  processes according to their weights.
+	  It aims at distributing the bandwidth as desired, regardless
+	  of the disk parameters and with any workload.
+
 choice
 	prompt "Default I/O scheduler"
 	default DEFAULT_CFQ
@@ -52,6 +61,15 @@ choice
 	config DEFAULT_CFQ
 		bool "CFQ" if IOSCHED_CFQ=y
 
+	config DEFAULT_BFQ
+		bool "BFQ" if IOSCHED_BFQ=y
+		help
+		  Selects BFQ as the default I/O scheduler which will be
+		  used by default for all block devices.
+		  The BFQ I/O scheduler aims at distributing the bandwidth
+		  as desired, regardless of the disk parameters and with
+		  any workload.
+
 	config DEFAULT_NOOP
 		bool "No-op"
 
@@ -61,6 +79,7 @@ config DEFAULT_IOSCHED
 	string
 	default "deadline" if DEFAULT_DEADLINE
 	default "cfq" if DEFAULT_CFQ
+	default "bfq" if DEFAULT_BFQ
 	default "noop" if DEFAULT_NOOP
 
 endmenu
diff --git a/block/Makefile b/block/Makefile
index 20645e8..cbd83fb 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -16,6 +16,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-throttle.o
 obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
 obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
 obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
+obj-$(CONFIG_IOSCHED_BFQ)	+= bfq-iosched.o
 
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY)	+= blk-integrity.o
diff --git a/block/bfq-ioc.c b/block/bfq-ioc.c
new file mode 100644
index 0000000..adfb5a1
--- /dev/null
+++ b/block/bfq-ioc.c
@@ -0,0 +1,34 @@
+/*
+ * BFQ: I/O context handling.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
+ *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ */
+
+/**
+ * icq_to_bic - convert iocontext queue structure to bfq_io_cq.
+ * @icq: the iocontext queue.
+ */
+static inline struct bfq_io_cq *icq_to_bic(struct io_cq *icq)
+{
+	/* bic->icq is the first member, %NULL will convert to %NULL */
+	return container_of(icq, struct bfq_io_cq, icq);
+}
+
+/**
+ * bfq_bic_lookup - search into @ioc a bic associated to @bfqd.
+ * @bfqd: the lookup key.
+ * @ioc: the io_context of the process doing I/O.
+ *
+ * Queue lock must be held.
+ */
+static inline struct bfq_io_cq *bfq_bic_lookup(struct bfq_data *bfqd,
+					       struct io_context *ioc)
+{
+	if (ioc)
+		return icq_to_bic(ioc_lookup_icq(ioc, bfqd->queue));
+	return NULL;
+}
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
new file mode 100644
index 0000000..01a98be
--- /dev/null
+++ b/block/bfq-iosched.c
@@ -0,0 +1,2297 @@
+/*
+ * Budget Fair Queueing (BFQ) disk scheduler.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
+ *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ *
+ * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ
+ * file.
+ *
+ * BFQ is a proportional-share storage-I/O scheduling algorithm based on
+ * the slice-by-slice service scheme of CFQ. But BFQ assigns budgets,
+ * measured in number of sectors, to processes instead of time slices. The
+ * device is not granted to the in-service process for a given time slice,
+ * but until it has exhausted its assigned budget. This change from the time
+ * to the service domain allows BFQ to distribute the device throughput
+ * among processes as desired, without any distortion due to ZBR, workload
+ * fluctuations or other factors. BFQ uses an ad hoc internal scheduler,
+ * called B-WF2Q+, to schedule processes according to their budgets. More
+ * precisely, BFQ schedules queues associated to processes. Thanks to the
+ * accurate policy of B-WF2Q+, BFQ can afford to assign high budgets to
+ * I/O-bound processes issuing sequential requests (to boost the
+ * throughput), and yet guarantee a relatively low latency to interactive
+ * applications.
+ *
+ * BFQ is described in [1], where also a reference to the initial, more
+ * theoretical paper on BFQ can be found. The interested reader can find
+ * in the latter paper full details on the main algorithm, as well as
+ * formulas of the guarantees and formal proofs of all the properties.
+ * With respect to the version of BFQ presented in these papers, this
+ * implementation adds a hierarchical extension based on H-WF2Q+.
+ *
+ * B-WF2Q+ is based on WF2Q+, that is described in [2], together with
+ * H-WF2Q+, while the augmented tree used to implement B-WF2Q+ with O(log N)
+ * complexity derives from the one introduced with EEVDF in [3].
+ *
+ * [1] P. Valente and M. Andreolini, ``Improving Application Responsiveness
+ *     with the BFQ Disk I/O Scheduler'',
+ *     Proceedings of the 5th Annual International Systems and Storage
+ *     Conference (SYSTOR '12), June 2012.
+ *
+ * http://algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf
+ *
+ * [2] Jon C.R. Bennett and H. Zhang, ``Hierarchical Packet Fair Queueing
+ *     Algorithms,'' IEEE/ACM Transactions on Networking, 5(5):675-689,
+ *     Oct 1997.
+ *
+ * http://www.cs.cmu.edu/~hzhang/papers/TON-97-Oct.ps.gz
+ *
+ * [3] I. Stoica and H. Abdel-Wahab, ``Earliest Eligible Virtual Deadline
+ *     First: A Flexible and Accurate Mechanism for Proportional Share
+ *     Resource Allocation,'' technical report.
+ *
+ * http://www.cs.berkeley.edu/~istoica/papers/eevdf-tr-95.pdf
+ */
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/blkdev.h>
+#include <linux/cgroup.h>
+#include <linux/elevator.h>
+#include <linux/jiffies.h>
+#include <linux/rbtree.h>
+#include <linux/ioprio.h>
+#include "bfq.h"
+#include "blk.h"
+
+/*
+ * Array of async queues for all the processes, one queue
+ * per ioprio value per ioprio_class.
+ */
+struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
+/* Async queue for the idle class (ioprio is ignored) */
+struct bfq_queue *async_idle_bfqq;
+
+/* Max number of dispatches in one round of service. */
+static const int bfq_quantum = 4;
+
+/* Expiration time of sync (0) and async (1) requests, in jiffies. */
+static const int bfq_fifo_expire[2] = { HZ / 4, HZ / 8 };
+
+/* Maximum backwards seek, in KiB. */
+static const int bfq_back_max = 16 * 1024;
+
+/* Penalty of a backwards seek, in number of sectors. */
+static const int bfq_back_penalty = 2;
+
+/* Idling period duration, in jiffies. */
+static int bfq_slice_idle = HZ / 125;
+
+/* Default maximum budget values, in sectors and number of requests. */
+static const int bfq_default_max_budget = 16 * 1024;
+static const int bfq_max_budget_async_rq = 4;
+
+/* Default timeout values, in jiffies, approximating CFQ defaults. */
+static const int bfq_timeout_sync = HZ / 8;
+static int bfq_timeout_async = HZ / 25;
+
+struct kmem_cache *bfq_pool;
+
+/* Below this threshold (in ms), we consider thinktime immediate. */
+#define BFQ_MIN_TT		2
+
+/* hw_tag detection: parallel requests threshold and min samples needed. */
+#define BFQ_HW_QUEUE_THRESHOLD	4
+#define BFQ_HW_QUEUE_SAMPLES	32
+
+#define BFQQ_SEEK_THR	 (sector_t)(8 * 1024)
+#define BFQQ_SEEKY(bfqq) ((bfqq)->seek_mean > BFQQ_SEEK_THR)
+
+/* Budget feedback step. */
+#define BFQ_BUDGET_STEP         128
+
+/* Min samples used for peak rate estimation (for autotuning). */
+#define BFQ_PEAK_RATE_SAMPLES	32
+
+/* Shift used for peak rate fixed precision calculations. */
+#define BFQ_RATE_SHIFT		16
+
+#define BFQ_SERVICE_TREE_INIT	((struct bfq_service_tree)		\
+				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
+
+#define RQ_BIC(rq)		((struct bfq_io_cq *) (rq)->elv.priv[0])
+#define RQ_BFQQ(rq)		((rq)->elv.priv[1])
+
+static inline void bfq_schedule_dispatch(struct bfq_data *bfqd);
+
+#include "bfq-ioc.c"
+#include "bfq-sched.c"
+
+#define bfq_class_idle(bfqq)	((bfqq)->entity.ioprio_class ==\
+				 IOPRIO_CLASS_IDLE)
+#define bfq_class_rt(bfqq)	((bfqq)->entity.ioprio_class ==\
+				 IOPRIO_CLASS_RT)
+
+#define bfq_sample_valid(samples)	((samples) > 80)
+
+/*
+ * We regard a request as SYNC, if either it's a read or has the SYNC bit
+ * set (in which case it could also be a direct WRITE).
+ */
+static inline int bfq_bio_sync(struct bio *bio)
+{
+	if (bio_data_dir(bio) == READ || (bio->bi_rw & REQ_SYNC))
+		return 1;
+
+	return 0;
+}
+
+/*
+ * Scheduler run of queue, if there are requests pending and no one in the
+ * driver that will restart queueing.
+ */
+static inline void bfq_schedule_dispatch(struct bfq_data *bfqd)
+{
+	if (bfqd->queued != 0) {
+		bfq_log(bfqd, "schedule dispatch");
+		kblockd_schedule_work(bfqd->queue, &bfqd->unplug_work);
+	}
+}
+
+/*
+ * Lifted from AS - choose which of rq1 and rq2 that is best served now.
+ * We choose the request that is closesr to the head right now.  Distance
+ * behind the head is penalized and only allowed to a certain extent.
+ */
+static struct request *bfq_choose_req(struct bfq_data *bfqd,
+				      struct request *rq1,
+				      struct request *rq2,
+				      sector_t last)
+{
+	sector_t s1, s2, d1 = 0, d2 = 0;
+	unsigned long back_max;
+#define BFQ_RQ1_WRAP	0x01 /* request 1 wraps */
+#define BFQ_RQ2_WRAP	0x02 /* request 2 wraps */
+	unsigned wrap = 0; /* bit mask: requests behind the disk head? */
+
+	if (rq1 == NULL || rq1 == rq2)
+		return rq2;
+	if (rq2 == NULL)
+		return rq1;
+
+	if (rq_is_sync(rq1) && !rq_is_sync(rq2))
+		return rq1;
+	else if (rq_is_sync(rq2) && !rq_is_sync(rq1))
+		return rq2;
+	if ((rq1->cmd_flags & REQ_META) && !(rq2->cmd_flags & REQ_META))
+		return rq1;
+	else if ((rq2->cmd_flags & REQ_META) && !(rq1->cmd_flags & REQ_META))
+		return rq2;
+
+	s1 = blk_rq_pos(rq1);
+	s2 = blk_rq_pos(rq2);
+
+	/*
+	 * By definition, 1KiB is 2 sectors.
+	 */
+	back_max = bfqd->bfq_back_max * 2;
+
+	/*
+	 * Strict one way elevator _except_ in the case where we allow
+	 * short backward seeks which are biased as twice the cost of a
+	 * similar forward seek.
+	 */
+	if (s1 >= last)
+		d1 = s1 - last;
+	else if (s1 + back_max >= last)
+		d1 = (last - s1) * bfqd->bfq_back_penalty;
+	else
+		wrap |= BFQ_RQ1_WRAP;
+
+	if (s2 >= last)
+		d2 = s2 - last;
+	else if (s2 + back_max >= last)
+		d2 = (last - s2) * bfqd->bfq_back_penalty;
+	else
+		wrap |= BFQ_RQ2_WRAP;
+
+	/* Found required data */
+
+	/*
+	 * By doing switch() on the bit mask "wrap" we avoid having to
+	 * check two variables for all permutations: --> faster!
+	 */
+	switch (wrap) {
+	case 0: /* common case for CFQ: rq1 and rq2 not wrapped */
+		if (d1 < d2)
+			return rq1;
+		else if (d2 < d1)
+			return rq2;
+		else {
+			if (s1 >= s2)
+				return rq1;
+			else
+				return rq2;
+		}
+
+	case BFQ_RQ2_WRAP:
+		return rq1;
+	case BFQ_RQ1_WRAP:
+		return rq2;
+	case (BFQ_RQ1_WRAP|BFQ_RQ2_WRAP): /* both rqs wrapped */
+	default:
+		/*
+		 * Since both rqs are wrapped,
+		 * start with the one that's further behind head
+		 * (--> only *one* back seek required),
+		 * since back seek takes more time than forward.
+		 */
+		if (s1 <= s2)
+			return rq1;
+		else
+			return rq2;
+	}
+}
+
+static struct request *bfq_find_next_rq(struct bfq_data *bfqd,
+					struct bfq_queue *bfqq,
+					struct request *last)
+{
+	struct rb_node *rbnext = rb_next(&last->rb_node);
+	struct rb_node *rbprev = rb_prev(&last->rb_node);
+	struct request *next = NULL, *prev = NULL;
+
+	if (rbprev != NULL)
+		prev = rb_entry_rq(rbprev);
+
+	if (rbnext != NULL)
+		next = rb_entry_rq(rbnext);
+	else {
+		rbnext = rb_first(&bfqq->sort_list);
+		if (rbnext && rbnext != &last->rb_node)
+			next = rb_entry_rq(rbnext);
+	}
+
+	return bfq_choose_req(bfqd, next, prev, blk_rq_pos(last));
+}
+
+static inline unsigned long bfq_serv_to_charge(struct request *rq,
+					       struct bfq_queue *bfqq)
+{
+	return blk_rq_sectors(rq);
+}
+
+/**
+ * bfq_updated_next_req - update the queue after a new next_rq selection.
+ * @bfqd: the device data the queue belongs to.
+ * @bfqq: the queue to update.
+ *
+ * If the first request of a queue changes we make sure that the queue
+ * has enough budget to serve at least its first request (if the
+ * request has grown).  We do this because if the queue has not enough
+ * budget for its first request, it has to go through two dispatch
+ * rounds to actually get it dispatched.
+ */
+static void bfq_updated_next_req(struct bfq_data *bfqd,
+				 struct bfq_queue *bfqq)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+	struct request *next_rq = bfqq->next_rq;
+	unsigned long new_budget;
+
+	if (next_rq == NULL)
+		return;
+
+	if (bfqq == bfqd->in_service_queue)
+		/*
+		 * In order not to break guarantees, budgets cannot be
+		 * changed after an entity has been selected.
+		 */
+		return;
+
+	new_budget = max_t(unsigned long, bfqq->max_budget,
+			   bfq_serv_to_charge(next_rq, bfqq));
+	if (entity->budget != new_budget) {
+		entity->budget = new_budget;
+		bfq_log_bfqq(bfqd, bfqq, "updated next rq: new budget %lu",
+					 new_budget);
+		bfq_activate_bfqq(bfqd, bfqq);
+	}
+}
+
+static void bfq_add_request(struct request *rq)
+{
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+	struct bfq_entity *entity = &bfqq->entity;
+	struct bfq_data *bfqd = bfqq->bfqd;
+	struct request *next_rq, *prev;
+
+	bfq_log_bfqq(bfqd, bfqq, "add_request %d", rq_is_sync(rq));
+	bfqq->queued[rq_is_sync(rq)]++;
+	bfqd->queued++;
+
+	elv_rb_add(&bfqq->sort_list, rq);
+
+	/*
+	 * Check if this request is a better next-serve candidate.
+	 */
+	prev = bfqq->next_rq;
+	next_rq = bfq_choose_req(bfqd, bfqq->next_rq, rq, bfqd->last_position);
+	bfqq->next_rq = next_rq;
+
+	if (!bfq_bfqq_busy(bfqq)) {
+		entity->budget = max_t(unsigned long, bfqq->max_budget,
+				       bfq_serv_to_charge(next_rq, bfqq));
+		bfq_add_bfqq_busy(bfqd, bfqq);
+	} else {
+		if (prev != bfqq->next_rq)
+			bfq_updated_next_req(bfqd, bfqq);
+	}
+}
+
+static struct request *bfq_find_rq_fmerge(struct bfq_data *bfqd,
+					  struct bio *bio)
+{
+	struct task_struct *tsk = current;
+	struct bfq_io_cq *bic;
+	struct bfq_queue *bfqq;
+
+	bic = bfq_bic_lookup(bfqd, tsk->io_context);
+	if (bic == NULL)
+		return NULL;
+
+	bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
+	if (bfqq != NULL)
+		return elv_rb_find(&bfqq->sort_list, bio_end_sector(bio));
+
+	return NULL;
+}
+
+static void bfq_activate_request(struct request_queue *q, struct request *rq)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+
+	bfqd->rq_in_driver++;
+	bfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq);
+	bfq_log(bfqd, "activate_request: new bfqd->last_position %llu",
+		(long long unsigned)bfqd->last_position);
+}
+
+static inline void bfq_deactivate_request(struct request_queue *q,
+					  struct request *rq)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+
+	bfqd->rq_in_driver--;
+}
+
+static void bfq_remove_request(struct request *rq)
+{
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+	struct bfq_data *bfqd = bfqq->bfqd;
+	const int sync = rq_is_sync(rq);
+
+	if (bfqq->next_rq == rq) {
+		bfqq->next_rq = bfq_find_next_rq(bfqd, bfqq, rq);
+		bfq_updated_next_req(bfqd, bfqq);
+	}
+
+	list_del_init(&rq->queuelist);
+	bfqq->queued[sync]--;
+	bfqd->queued--;
+	elv_rb_del(&bfqq->sort_list, rq);
+
+	if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
+		if (bfq_bfqq_busy(bfqq) && bfqq != bfqd->in_service_queue)
+			bfq_del_bfqq_busy(bfqd, bfqq, 1);
+	}
+
+	if (rq->cmd_flags & REQ_META)
+		bfqq->meta_pending--;
+}
+
+static int bfq_merge(struct request_queue *q, struct request **req,
+		     struct bio *bio)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct request *__rq;
+
+	__rq = bfq_find_rq_fmerge(bfqd, bio);
+	if (__rq != NULL && elv_rq_merge_ok(__rq, bio)) {
+		*req = __rq;
+		return ELEVATOR_FRONT_MERGE;
+	}
+
+	return ELEVATOR_NO_MERGE;
+}
+
+static void bfq_merged_request(struct request_queue *q, struct request *req,
+			       int type)
+{
+	if (type == ELEVATOR_FRONT_MERGE &&
+	    rb_prev(&req->rb_node) &&
+	    blk_rq_pos(req) <
+	    blk_rq_pos(container_of(rb_prev(&req->rb_node),
+				    struct request, rb_node))) {
+		struct bfq_queue *bfqq = RQ_BFQQ(req);
+		struct bfq_data *bfqd = bfqq->bfqd;
+		struct request *prev, *next_rq;
+
+		/* Reposition request in its sort_list */
+		elv_rb_del(&bfqq->sort_list, req);
+		elv_rb_add(&bfqq->sort_list, req);
+		/* Choose next request to be served for bfqq */
+		prev = bfqq->next_rq;
+		next_rq = bfq_choose_req(bfqd, bfqq->next_rq, req,
+					 bfqd->last_position);
+		bfqq->next_rq = next_rq;
+		/*
+		 * If next_rq changes, update the queue's budget to fit
+		 * the new request.
+		 */
+		if (prev != bfqq->next_rq)
+			bfq_updated_next_req(bfqd, bfqq);
+	}
+}
+
+static void bfq_merged_requests(struct request_queue *q, struct request *rq,
+				struct request *next)
+{
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+
+	/*
+	 * Reposition in fifo if next is older than rq.
+	 */
+	if (!list_empty(&rq->queuelist) && !list_empty(&next->queuelist) &&
+	    time_before(next->fifo_time, rq->fifo_time)) {
+		list_move(&rq->queuelist, &next->queuelist);
+		rq->fifo_time = next->fifo_time;
+	}
+
+	if (bfqq->next_rq == next)
+		bfqq->next_rq = rq;
+
+	bfq_remove_request(next);
+}
+
+static int bfq_allow_merge(struct request_queue *q, struct request *rq,
+			   struct bio *bio)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct bfq_io_cq *bic;
+	struct bfq_queue *bfqq;
+
+	/*
+	 * Disallow merge of a sync bio into an async request.
+	 */
+	if (bfq_bio_sync(bio) && !rq_is_sync(rq))
+		return 0;
+
+	/*
+	 * Lookup the bfqq that this bio will be queued with. Allow
+	 * merge only if rq is queued there.
+	 * Queue lock is held here.
+	 */
+	bic = bfq_bic_lookup(bfqd, current->io_context);
+	if (bic == NULL)
+		return 0;
+
+	bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
+	return bfqq == RQ_BFQQ(rq);
+}
+
+static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
+				       struct bfq_queue *bfqq)
+{
+	if (bfqq != NULL) {
+		bfq_mark_bfqq_must_alloc(bfqq);
+		bfq_mark_bfqq_budget_new(bfqq);
+		bfq_clear_bfqq_fifo_expire(bfqq);
+
+		bfqd->budgets_assigned = (bfqd->budgets_assigned*7 + 256) / 8;
+
+		bfq_log_bfqq(bfqd, bfqq,
+			     "set_in_service_queue, cur-budget = %lu",
+			     bfqq->entity.budget);
+	}
+
+	bfqd->in_service_queue = bfqq;
+}
+
+/*
+ * Get and set a new queue for service.
+ */
+static struct bfq_queue *bfq_set_in_service_queue(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq = bfq_get_next_queue(bfqd);
+
+	__bfq_set_in_service_queue(bfqd, bfqq);
+	return bfqq;
+}
+
+/*
+ * If enough samples have been computed, return the current max budget
+ * stored in bfqd, which is dynamically updated according to the
+ * estimated disk peak rate; otherwise return the default max budget
+ */
+static inline unsigned long bfq_max_budget(struct bfq_data *bfqd)
+{
+	if (bfqd->budgets_assigned < 194)
+		return bfq_default_max_budget;
+	else
+		return bfqd->bfq_max_budget;
+}
+
+ /*
+ * bfq_default_budget - return the default budget for @bfqq on @bfqd.
+ * @bfqd: the device descriptor.
+ * @bfqq: the queue to consider.
+ *
+ * We use 3/4 of the @bfqd maximum budget as the default value
+ * for the max_budget field of the queues.  This lets the feedback
+ * mechanism to start from some middle ground, then the behavior
+ * of the process will drive the heuristics towards high values, if
+ * it behaves as a greedy sequential reader, or towards small values
+ * if it shows a more intermittent behavior.
+ */
+static unsigned long bfq_default_budget(struct bfq_data *bfqd,
+					struct bfq_queue *bfqq)
+{
+	unsigned long budget;
+
+	/*
+	 * When we need an estimate of the peak rate we need to avoid
+	 * to give budgets that are too short due to previous measurements.
+	 * So, in the first 10 assignments use a ``safe'' budget value.
+	 */
+	if (bfqd->budgets_assigned < 194 && bfqd->bfq_user_max_budget == 0)
+		budget = bfq_default_max_budget;
+	else
+		budget = bfqd->bfq_max_budget;
+
+	return budget - budget / 4;
+}
+
+/*
+ * Return min budget, which is a fraction of the current or default
+ * max budget (trying with 1/32)
+ */
+static inline unsigned long bfq_min_budget(struct bfq_data *bfqd)
+{
+	if (bfqd->budgets_assigned < 194)
+		return bfq_default_max_budget / 32;
+	else
+		return bfqd->bfq_max_budget / 32;
+}
+
+static void bfq_arm_slice_timer(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq = bfqd->in_service_queue;
+	struct bfq_io_cq *bic;
+	unsigned long sl;
+
+	/* Processes have exited, don't wait. */
+	bic = bfqd->in_service_bic;
+	if (bic == NULL || atomic_read(&bic->icq.ioc->active_ref) == 0)
+		return;
+
+	bfq_mark_bfqq_wait_request(bfqq);
+
+	/*
+	 * We don't want to idle for seeks, but we do want to allow
+	 * fair distribution of slice time for a process doing back-to-back
+	 * seeks. So allow a little bit of time for him to submit a new rq.
+	 */
+	sl = bfqd->bfq_slice_idle;
+	/*
+	 * Grant only minimum idle time if the queue has been seeky for long
+	 * enough.
+	 */
+	if (bfq_sample_valid(bfqq->seek_samples) && BFQQ_SEEKY(bfqq))
+		sl = min(sl, msecs_to_jiffies(BFQ_MIN_TT));
+	bfqd->last_idling_start = ktime_get();
+	mod_timer(&bfqd->idle_slice_timer, jiffies + sl);
+	bfq_log(bfqd, "arm idle: %u/%u ms",
+		jiffies_to_msecs(sl), jiffies_to_msecs(bfqd->bfq_slice_idle));
+}
+
+/*
+ * Set the maximum time for the in-service queue to consume its
+ * budget. This prevents seeky processes from lowering the disk
+ * throughput (always guaranteed with a time slice scheme as in CFQ).
+ */
+static void bfq_set_budget_timeout(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq = bfqd->in_service_queue;
+	unsigned int timeout_coeff = bfqq->entity.weight /
+				     bfqq->entity.orig_weight;
+
+	bfqd->last_budget_start = ktime_get();
+
+	bfq_clear_bfqq_budget_new(bfqq);
+	bfqq->budget_timeout = jiffies +
+		bfqd->bfq_timeout[bfq_bfqq_sync(bfqq)] * timeout_coeff;
+
+	bfq_log_bfqq(bfqd, bfqq, "set budget_timeout %u",
+		jiffies_to_msecs(bfqd->bfq_timeout[bfq_bfqq_sync(bfqq)] *
+		timeout_coeff));
+}
+
+/*
+ * Move request from internal lists to the request queue dispatch list.
+ */
+static void bfq_dispatch_insert(struct request_queue *q, struct request *rq)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+
+	/*
+	 * For consistency, the next instruction should have been executed
+	 * after removing the request from the queue and dispatching it.
+	 * We execute instead this instruction before bfq_remove_request()
+	 * (and hence introduce a temporary inconsistency), for efficiency.
+	 * In fact, in a forced_dispatch, this prevents two counters related
+	 * to bfqq->dispatched to risk to be uselessly decremented if bfqq
+	 * is not in service, and then to be incremented again after
+	 * incrementing bfqq->dispatched.
+	 */
+	bfqq->dispatched++;
+	bfq_remove_request(rq);
+	elv_dispatch_sort(q, rq);
+
+	if (bfq_bfqq_sync(bfqq))
+		bfqd->sync_flight++;
+}
+
+/*
+ * Return expired entry, or NULL to just start from scratch in rbtree.
+ */
+static struct request *bfq_check_fifo(struct bfq_queue *bfqq)
+{
+	struct request *rq = NULL;
+
+	if (bfq_bfqq_fifo_expire(bfqq))
+		return NULL;
+
+	bfq_mark_bfqq_fifo_expire(bfqq);
+
+	if (list_empty(&bfqq->fifo))
+		return NULL;
+
+	rq = rq_entry_fifo(bfqq->fifo.next);
+
+	if (time_before(jiffies, rq->fifo_time))
+		return NULL;
+
+	return rq;
+}
+
+static inline unsigned long bfq_bfqq_budget_left(struct bfq_queue *bfqq)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+	return entity->budget - entity->service;
+}
+
+static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	__bfq_bfqd_reset_in_service(bfqd);
+
+	if (RB_EMPTY_ROOT(&bfqq->sort_list))
+		bfq_del_bfqq_busy(bfqd, bfqq, 1);
+	else
+		bfq_activate_bfqq(bfqd, bfqq);
+}
+
+/**
+ * __bfq_bfqq_recalc_budget - try to adapt the budget to the @bfqq behavior.
+ * @bfqd: device data.
+ * @bfqq: queue to update.
+ * @reason: reason for expiration.
+ *
+ * Handle the feedback on @bfqq budget.  See the body for detailed
+ * comments.
+ */
+static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
+				     struct bfq_queue *bfqq,
+				     enum bfqq_expiration reason)
+{
+	struct request *next_rq;
+	unsigned long budget, min_budget;
+
+	budget = bfqq->max_budget;
+	min_budget = bfq_min_budget(bfqd);
+
+	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last budg %lu, budg left %lu",
+		bfqq->entity.budget, bfq_bfqq_budget_left(bfqq));
+	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last max_budg %lu, min budg %lu",
+		budget, bfq_min_budget(bfqd));
+	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: sync %d, seeky %d",
+		bfq_bfqq_sync(bfqq), BFQQ_SEEKY(bfqd->in_service_queue));
+
+	if (bfq_bfqq_sync(bfqq)) {
+		switch (reason) {
+		/*
+		 * Caveat: in all the following cases we trade latency
+		 * for throughput.
+		 */
+		case BFQ_BFQQ_TOO_IDLE:
+			if (budget > min_budget + BFQ_BUDGET_STEP)
+				budget -= BFQ_BUDGET_STEP;
+			else
+				budget = min_budget;
+			break;
+		case BFQ_BFQQ_BUDGET_TIMEOUT:
+			budget = bfq_default_budget(bfqd, bfqq);
+			break;
+		case BFQ_BFQQ_BUDGET_EXHAUSTED:
+			/*
+			 * The process still has backlog, and did not
+			 * let either the budget timeout or the disk
+			 * idling timeout expire. Hence it is not
+			 * seeky, has a short thinktime and may be
+			 * happy with a higher budget too. So
+			 * definitely increase the budget of this good
+			 * candidate to boost the disk throughput.
+			 */
+			budget = min(budget + 8 * BFQ_BUDGET_STEP,
+				     bfqd->bfq_max_budget);
+			break;
+		case BFQ_BFQQ_NO_MORE_REQUESTS:
+		       /*
+			* Leave the budget unchanged.
+			*/
+		default:
+			return;
+		}
+	} else /* async queue */
+	    /* async queues get always the maximum possible budget
+	     * (their ability to dispatch is limited by
+	     * @bfqd->bfq_max_budget_async_rq).
+	     */
+		budget = bfqd->bfq_max_budget;
+
+	bfqq->max_budget = budget;
+
+	if (bfqd->budgets_assigned >= 194 && bfqd->bfq_user_max_budget == 0 &&
+	    bfqq->max_budget > bfqd->bfq_max_budget)
+		bfqq->max_budget = bfqd->bfq_max_budget;
+
+	/*
+	 * Make sure that we have enough budget for the next request.
+	 * Since the finish time of the bfqq must be kept in sync with
+	 * the budget, be sure to call __bfq_bfqq_expire() after the
+	 * update.
+	 */
+	next_rq = bfqq->next_rq;
+	if (next_rq != NULL)
+		bfqq->entity.budget = max_t(unsigned long, bfqq->max_budget,
+					    bfq_serv_to_charge(next_rq, bfqq));
+	else
+		bfqq->entity.budget = bfqq->max_budget;
+
+	bfq_log_bfqq(bfqd, bfqq, "head sect: %u, new budget %lu",
+			next_rq != NULL ? blk_rq_sectors(next_rq) : 0,
+			bfqq->entity.budget);
+}
+
+static unsigned long bfq_calc_max_budget(u64 peak_rate, u64 timeout)
+{
+	unsigned long max_budget;
+
+	/*
+	 * The max_budget calculated when autotuning is equal to the
+	 * amount of sectors transfered in timeout_sync at the
+	 * estimated peak rate.
+	 */
+	max_budget = (unsigned long)(peak_rate * 1000 *
+				     timeout >> BFQ_RATE_SHIFT);
+
+	return max_budget;
+}
+
+/*
+ * In addition to updating the peak rate, checks whether the process
+ * is "slow", and returns 1 if so. This slow flag is used, in addition
+ * to the budget timeout, to reduce the amount of service provided to
+ * seeky processes, and hence reduce their chances to lower the
+ * throughput. See the code for more details.
+ */
+static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+				int compensate)
+{
+	u64 bw, usecs, expected, timeout;
+	ktime_t delta;
+	int update = 0;
+
+	if (!bfq_bfqq_sync(bfqq) || bfq_bfqq_budget_new(bfqq))
+		return 0;
+
+	if (compensate)
+		delta = bfqd->last_idling_start;
+	else
+		delta = ktime_get();
+	delta = ktime_sub(delta, bfqd->last_budget_start);
+	usecs = ktime_to_us(delta);
+
+	/* Don't trust short/unrealistic values. */
+	if (usecs < 100 || usecs >= LONG_MAX)
+		return 0;
+
+	/*
+	 * Calculate the bandwidth for the last slice.  We use a 64 bit
+	 * value to store the peak rate, in sectors per usec in fixed
+	 * point math.  We do so to have enough precision in the estimate
+	 * and to avoid overflows.
+	 */
+	bw = (u64)bfqq->entity.service << BFQ_RATE_SHIFT;
+	do_div(bw, (unsigned long)usecs);
+
+	timeout = jiffies_to_msecs(bfqd->bfq_timeout[BLK_RW_SYNC]);
+
+	/*
+	 * Use only long (> 20ms) intervals to filter out spikes for
+	 * the peak rate estimation.
+	 */
+	if (usecs > 20000) {
+		if (bw > bfqd->peak_rate) {
+			bfqd->peak_rate = bw;
+			update = 1;
+			bfq_log(bfqd, "new peak_rate=%llu", bw);
+		}
+
+		update |= bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES - 1;
+
+		if (bfqd->peak_rate_samples < BFQ_PEAK_RATE_SAMPLES)
+			bfqd->peak_rate_samples++;
+
+		if (bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES &&
+		    update && bfqd->bfq_user_max_budget == 0) {
+			bfqd->bfq_max_budget =
+				bfq_calc_max_budget(bfqd->peak_rate,
+						    timeout);
+			bfq_log(bfqd, "new max_budget=%lu",
+				bfqd->bfq_max_budget);
+		}
+	}
+
+	/*
+	 * A process is considered ``slow'' (i.e., seeky, so that we
+	 * cannot treat it fairly in the service domain, as it would
+	 * slow down too much the other processes) if, when a slice
+	 * ends for whatever reason, it has received service at a
+	 * rate that would not be high enough to complete the budget
+	 * before the budget timeout expiration.
+	 */
+	expected = bw * 1000 * timeout >> BFQ_RATE_SHIFT;
+
+	/*
+	 * Caveat: processes doing IO in the slower disk zones will
+	 * tend to be slow(er) even if not seeky. And the estimated
+	 * peak rate will actually be an average over the disk
+	 * surface. Hence, to not be too harsh with unlucky processes,
+	 * we keep a budget/3 margin of safety before declaring a
+	 * process slow.
+	 */
+	return expected > (4 * bfqq->entity.budget) / 3;
+}
+
+/**
+ * bfq_bfqq_expire - expire a queue.
+ * @bfqd: device owning the queue.
+ * @bfqq: the queue to expire.
+ * @compensate: if true, compensate for the time spent idling.
+ * @reason: the reason causing the expiration.
+ *
+ *
+ * If the process associated to the queue is slow (i.e., seeky), or in
+ * case of budget timeout, or, finally, if it is async, we
+ * artificially charge it an entire budget (independently of the
+ * actual service it received). As a consequence, the queue will get
+ * higher timestamps than the correct ones upon reactivation, and
+ * hence it will be rescheduled as if it had received more service
+ * than what it actually received. In the end, this class of processes
+ * will receive less service in proportion to how slowly they consume
+ * their budgets (and hence how seriously they tend to lower the
+ * throughput).
+ *
+ * In contrast, when a queue expires because it has been idling for
+ * too much or because it exhausted its budget, we do not touch the
+ * amount of service it has received. Hence when the queue will be
+ * reactivated and its timestamps updated, the latter will be in sync
+ * with the actual service received by the queue until expiration.
+ *
+ * Charging a full budget to the first type of queues and the exact
+ * service to the others has the effect of using the WF2Q+ policy to
+ * schedule the former on a timeslice basis, without violating the
+ * service domain guarantees of the latter.
+ */
+static void bfq_bfqq_expire(struct bfq_data *bfqd,
+			    struct bfq_queue *bfqq,
+			    int compensate,
+			    enum bfqq_expiration reason)
+{
+	int slow;
+
+	/* Update disk peak rate for autotuning and check whether the
+	 * process is slow (see bfq_update_peak_rate).
+	 */
+	slow = bfq_update_peak_rate(bfqd, bfqq, compensate);
+
+	/*
+	 * As above explained, 'punish' slow (i.e., seeky), timed-out
+	 * and async queues, to favor sequential sync workloads.
+	 */
+	if (slow || reason == BFQ_BFQQ_BUDGET_TIMEOUT)
+		bfq_bfqq_charge_full_budget(bfqq);
+
+	bfq_log_bfqq(bfqd, bfqq,
+		"expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
+		slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
+
+	/*
+	 * Increase, decrease or leave budget unchanged according to
+	 * reason.
+	 */
+	__bfq_bfqq_recalc_budget(bfqd, bfqq, reason);
+	__bfq_bfqq_expire(bfqd, bfqq);
+}
+
+/*
+ * Budget timeout is not implemented through a dedicated timer, but
+ * just checked on request arrivals and completions, as well as on
+ * idle timer expirations.
+ */
+static int bfq_bfqq_budget_timeout(struct bfq_queue *bfqq)
+{
+	if (bfq_bfqq_budget_new(bfqq) ||
+	    time_before(jiffies, bfqq->budget_timeout))
+		return 0;
+	return 1;
+}
+
+/*
+ * If we expire a queue that is waiting for the arrival of a new
+ * request, we may prevent the fictitious timestamp back-shifting that
+ * allows the guarantees of the queue to be preserved (see [1] for
+ * this tricky aspect). Hence we return true only if this condition
+ * does not hold, or if the queue is slow enough to deserve only to be
+ * kicked off for preserving a high throughput.
+*/
+static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
+{
+	bfq_log_bfqq(bfqq->bfqd, bfqq,
+		"may_budget_timeout: wait_request %d left %d timeout %d",
+		bfq_bfqq_wait_request(bfqq),
+			bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3,
+		bfq_bfqq_budget_timeout(bfqq));
+
+	return (!bfq_bfqq_wait_request(bfqq) ||
+		bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3)
+		&&
+		bfq_bfqq_budget_timeout(bfqq);
+}
+
+/*
+ * Device idling is allowed only for sync queues that have a non-null
+ * idle window.
+ */
+static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
+{
+	return bfq_bfqq_sync(bfqq) && bfq_bfqq_idle_window(bfqq);
+}
+
+/*
+ * If the in-service queue is empty, but it is sync and the queue has its
+ * idle window set (in this case, waiting for a new request for the queue
+ * is likely to boost the throughput), then:
+ * 1) the queue must remain in service and cannot be expired, and
+ * 2) the disk must be idled to wait for the possible arrival of a new
+ *    request for the queue.
+ */
+static inline bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)
+{
+	struct bfq_data *bfqd = bfqq->bfqd;
+
+	return RB_EMPTY_ROOT(&bfqq->sort_list) && bfqd->bfq_slice_idle != 0 &&
+	       bfq_bfqq_must_not_expire(bfqq);
+}
+
+/*
+ * Select a queue for service.  If we have a current queue in service,
+ * check whether to continue servicing it, or retrieve and set a new one.
+ */
+static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq;
+	struct request *next_rq;
+	enum bfqq_expiration reason = BFQ_BFQQ_BUDGET_TIMEOUT;
+
+	bfqq = bfqd->in_service_queue;
+	if (bfqq == NULL)
+		goto new_queue;
+
+	bfq_log_bfqq(bfqd, bfqq, "select_queue: already in-service queue");
+
+	if (bfq_may_expire_for_budg_timeout(bfqq) &&
+	    !timer_pending(&bfqd->idle_slice_timer) &&
+	    !bfq_bfqq_must_idle(bfqq))
+		goto expire;
+
+	next_rq = bfqq->next_rq;
+	/*
+	 * If bfqq has requests queued and it has enough budget left to
+	 * serve them, keep the queue, otherwise expire it.
+	 */
+	if (next_rq != NULL) {
+		if (bfq_serv_to_charge(next_rq, bfqq) >
+			bfq_bfqq_budget_left(bfqq)) {
+			reason = BFQ_BFQQ_BUDGET_EXHAUSTED;
+			goto expire;
+		} else {
+			/*
+			 * The idle timer may be pending because we may
+			 * not disable disk idling even when a new request
+			 * arrives.
+			 */
+			if (timer_pending(&bfqd->idle_slice_timer)) {
+				/*
+				 * If we get here: 1) at least a new request
+				 * has arrived but we have not disabled the
+				 * timer because the request was too small,
+				 * 2) then the block layer has unplugged
+				 * the device, causing the dispatch to be
+				 * invoked.
+				 *
+				 * Since the device is unplugged, now the
+				 * requests are probably large enough to
+				 * provide a reasonable throughput.
+				 * So we disable idling.
+				 */
+				bfq_clear_bfqq_wait_request(bfqq);
+				del_timer(&bfqd->idle_slice_timer);
+			}
+			goto keep_queue;
+		}
+	}
+
+	/*
+	 * No requests pending.  If the in-service queue still has requests
+	 * in flight (possibly waiting for a completion) or is idling for a
+	 * new request, then keep it.
+	 */
+	if (timer_pending(&bfqd->idle_slice_timer) ||
+	    (bfqq->dispatched != 0 && bfq_bfqq_must_not_expire(bfqq))) {
+		bfqq = NULL;
+		goto keep_queue;
+	}
+
+	reason = BFQ_BFQQ_NO_MORE_REQUESTS;
+expire:
+	bfq_bfqq_expire(bfqd, bfqq, 0, reason);
+new_queue:
+	bfqq = bfq_set_in_service_queue(bfqd);
+	bfq_log(bfqd, "select_queue: new queue %d returned",
+		bfqq != NULL ? bfqq->pid : 0);
+keep_queue:
+	return bfqq;
+}
+
+/*
+ * Dispatch one request from bfqq, moving it to the request queue
+ * dispatch list.
+ */
+static int bfq_dispatch_request(struct bfq_data *bfqd,
+				struct bfq_queue *bfqq)
+{
+	int dispatched = 0;
+	struct request *rq;
+	unsigned long service_to_charge;
+
+	/* Follow expired path, else get first next available. */
+	rq = bfq_check_fifo(bfqq);
+	if (rq == NULL)
+		rq = bfqq->next_rq;
+	service_to_charge = bfq_serv_to_charge(rq, bfqq);
+
+	if (service_to_charge > bfq_bfqq_budget_left(bfqq)) {
+		/*
+		 * This may happen if the next rq is chosen in fifo order
+		 * instead of sector order. The budget is properly
+		 * dimensioned to be always sufficient to serve the next
+		 * request only if it is chosen in sector order. The reason
+		 * is that it would be quite inefficient and little useful
+		 * to always make sure that the budget is large enough to
+		 * serve even the possible next rq in fifo order.
+		 * In fact, requests are seldom served in fifo order.
+		 *
+		 * Expire the queue for budget exhaustion, and make sure
+		 * that the next act_budget is enough to serve the next
+		 * request, even if it comes from the fifo expired path.
+		 */
+		bfqq->next_rq = rq;
+		/*
+		 * Since this dispatch is failed, make sure that
+		 * a new one will be performed
+		 */
+		if (!bfqd->rq_in_driver)
+			bfq_schedule_dispatch(bfqd);
+		goto expire;
+	}
+
+	/* Finally, insert request into driver dispatch list. */
+	bfq_bfqq_served(bfqq, service_to_charge);
+	bfq_dispatch_insert(bfqd->queue, rq);
+
+	bfq_log_bfqq(bfqd, bfqq,
+			"dispatched %u sec req (%llu), budg left %lu",
+			blk_rq_sectors(rq),
+			(long long unsigned)blk_rq_pos(rq),
+			bfq_bfqq_budget_left(bfqq));
+
+	dispatched++;
+
+	if (bfqd->in_service_bic == NULL) {
+		atomic_long_inc(&RQ_BIC(rq)->icq.ioc->refcount);
+		bfqd->in_service_bic = RQ_BIC(rq);
+	}
+
+	if (bfqd->busy_queues > 1 && ((!bfq_bfqq_sync(bfqq) &&
+	    dispatched >= bfqd->bfq_max_budget_async_rq) ||
+	    bfq_class_idle(bfqq)))
+		goto expire;
+
+	return dispatched;
+
+expire:
+	bfq_bfqq_expire(bfqd, bfqq, 0, BFQ_BFQQ_BUDGET_EXHAUSTED);
+	return dispatched;
+}
+
+static int __bfq_forced_dispatch_bfqq(struct bfq_queue *bfqq)
+{
+	int dispatched = 0;
+
+	while (bfqq->next_rq != NULL) {
+		bfq_dispatch_insert(bfqq->bfqd->queue, bfqq->next_rq);
+		dispatched++;
+	}
+
+	return dispatched;
+}
+
+/*
+ * Drain our current requests.
+ * Used for barriers and when switching io schedulers on-the-fly.
+ */
+static int bfq_forced_dispatch(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq, *n;
+	struct bfq_service_tree *st;
+	int dispatched = 0;
+
+	bfqq = bfqd->in_service_queue;
+	if (bfqq != NULL)
+		__bfq_bfqq_expire(bfqd, bfqq);
+
+	/*
+	 * Loop through classes, and be careful to leave the scheduler
+	 * in a consistent state, as feedback mechanisms and vtime
+	 * updates cannot be disabled during the process.
+	 */
+	list_for_each_entry_safe(bfqq, n, &bfqd->active_list, bfqq_list) {
+		st = bfq_entity_service_tree(&bfqq->entity);
+
+		dispatched += __bfq_forced_dispatch_bfqq(bfqq);
+		bfqq->max_budget = bfq_max_budget(bfqd);
+
+		bfq_forget_idle(st);
+	}
+
+	return dispatched;
+}
+
+static int bfq_dispatch_requests(struct request_queue *q, int force)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct bfq_queue *bfqq;
+	int max_dispatch;
+
+	bfq_log(bfqd, "dispatch requests: %d busy queues", bfqd->busy_queues);
+	if (bfqd->busy_queues == 0)
+		return 0;
+
+	if (unlikely(force))
+		return bfq_forced_dispatch(bfqd);
+
+	bfqq = bfq_select_queue(bfqd);
+	if (bfqq == NULL)
+		return 0;
+
+	max_dispatch = bfqd->bfq_quantum;
+	if (bfq_class_idle(bfqq))
+		max_dispatch = 1;
+
+	if (!bfq_bfqq_sync(bfqq))
+		max_dispatch = bfqd->bfq_max_budget_async_rq;
+
+	if (bfqq->dispatched >= max_dispatch) {
+		if (bfqd->busy_queues > 1)
+			return 0;
+		if (bfqq->dispatched >= 4 * max_dispatch)
+			return 0;
+	}
+
+	if (bfqd->sync_flight != 0 && !bfq_bfqq_sync(bfqq))
+		return 0;
+
+	bfq_clear_bfqq_wait_request(bfqq);
+
+	if (!bfq_dispatch_request(bfqd, bfqq))
+		return 0;
+
+	bfq_log_bfqq(bfqd, bfqq, "dispatched one request of %d (max_disp %d)",
+			bfqq->pid, max_dispatch);
+
+	return 1;
+}
+
+/*
+ * Task holds one reference to the queue, dropped when task exits.  Each rq
+ * in-flight on this queue also holds a reference, dropped when rq is freed.
+ *
+ * Queue lock must be held here.
+ */
+static void bfq_put_queue(struct bfq_queue *bfqq)
+{
+	struct bfq_data *bfqd = bfqq->bfqd;
+
+	bfq_log_bfqq(bfqd, bfqq, "put_queue: %p %d", bfqq,
+		     atomic_read(&bfqq->ref));
+	if (!atomic_dec_and_test(&bfqq->ref))
+		return;
+
+	bfq_log_bfqq(bfqd, bfqq, "put_queue: %p freed", bfqq);
+
+	kmem_cache_free(bfq_pool, bfqq);
+}
+
+static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	if (bfqq == bfqd->in_service_queue) {
+		__bfq_bfqq_expire(bfqd, bfqq);
+		bfq_schedule_dispatch(bfqd);
+	}
+
+	bfq_log_bfqq(bfqd, bfqq, "exit_bfqq: %p, %d", bfqq,
+		     atomic_read(&bfqq->ref));
+
+	bfq_put_queue(bfqq);
+}
+
+static inline void bfq_init_icq(struct io_cq *icq)
+{
+	icq_to_bic(icq)->ttime.last_end_request = jiffies;
+}
+
+static void bfq_exit_icq(struct io_cq *icq)
+{
+	struct bfq_io_cq *bic = icq_to_bic(icq);
+	struct bfq_data *bfqd = bic_to_bfqd(bic);
+
+	if (bic->bfqq[BLK_RW_ASYNC]) {
+		bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_ASYNC]);
+		bic->bfqq[BLK_RW_ASYNC] = NULL;
+	}
+
+	if (bic->bfqq[BLK_RW_SYNC]) {
+		bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_SYNC]);
+		bic->bfqq[BLK_RW_SYNC] = NULL;
+	}
+}
+
+/*
+ * Update the entity prio values; note that the new values will not
+ * be used until the next (re)activation.
+ */
+static void bfq_init_prio_data(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
+{
+	struct task_struct *tsk = current;
+	int ioprio_class;
+
+	if (!bfq_bfqq_prio_changed(bfqq))
+		return;
+
+	ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
+	switch (ioprio_class) {
+	default:
+		dev_err(bfqq->bfqd->queue->backing_dev_info.dev,
+			"bfq: bad prio %x\n", ioprio_class);
+	case IOPRIO_CLASS_NONE:
+		/*
+		 * No prio set, inherit CPU scheduling settings.
+		 */
+		bfqq->entity.new_ioprio = task_nice_ioprio(tsk);
+		bfqq->entity.new_ioprio_class = task_nice_ioclass(tsk);
+		break;
+	case IOPRIO_CLASS_RT:
+		bfqq->entity.new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
+		bfqq->entity.new_ioprio_class = IOPRIO_CLASS_RT;
+		break;
+	case IOPRIO_CLASS_BE:
+		bfqq->entity.new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
+		bfqq->entity.new_ioprio_class = IOPRIO_CLASS_BE;
+		break;
+	case IOPRIO_CLASS_IDLE:
+		bfqq->entity.new_ioprio_class = IOPRIO_CLASS_IDLE;
+		bfqq->entity.new_ioprio = 7;
+		bfq_clear_bfqq_idle_window(bfqq);
+		break;
+	}
+
+	bfqq->entity.ioprio_changed = 1;
+
+	bfq_clear_bfqq_prio_changed(bfqq);
+}
+
+static void bfq_changed_ioprio(struct bfq_io_cq *bic)
+{
+	struct bfq_data *bfqd;
+	struct bfq_queue *bfqq, *new_bfqq;
+	unsigned long uninitialized_var(flags);
+	int ioprio = bic->icq.ioc->ioprio;
+
+	bfqd = bfq_get_bfqd_locked(&(bic->icq.q->elevator->elevator_data),
+				   &flags);
+	/*
+	 * This condition may trigger on a newly created bic, be sure to
+	 * drop the lock before returning.
+	 */
+	if (unlikely(bfqd == NULL) || likely(bic->ioprio == ioprio))
+		goto out;
+
+	bfqq = bic->bfqq[BLK_RW_ASYNC];
+	if (bfqq != NULL) {
+		new_bfqq = bfq_get_queue(bfqd, BLK_RW_ASYNC, bic,
+					 GFP_ATOMIC);
+		if (new_bfqq != NULL) {
+			bic->bfqq[BLK_RW_ASYNC] = new_bfqq;
+			bfq_log_bfqq(bfqd, bfqq,
+				     "changed_ioprio: bfqq %p %d",
+				     bfqq, atomic_read(&bfqq->ref));
+			bfq_put_queue(bfqq);
+		}
+	}
+
+	bfqq = bic->bfqq[BLK_RW_SYNC];
+	if (bfqq != NULL)
+		bfq_mark_bfqq_prio_changed(bfqq);
+
+	bic->ioprio = ioprio;
+
+out:
+	bfq_put_bfqd_unlock(bfqd, &flags);
+}
+
+static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+			  pid_t pid, int is_sync)
+{
+	RB_CLEAR_NODE(&bfqq->entity.rb_node);
+	INIT_LIST_HEAD(&bfqq->fifo);
+
+	atomic_set(&bfqq->ref, 0);
+	bfqq->bfqd = bfqd;
+
+	bfq_mark_bfqq_prio_changed(bfqq);
+
+	if (is_sync) {
+		if (!bfq_class_idle(bfqq))
+			bfq_mark_bfqq_idle_window(bfqq);
+		bfq_mark_bfqq_sync(bfqq);
+	}
+
+	/* Tentative initial value to trade off between thr and lat */
+	bfqq->max_budget = bfq_default_budget(bfqd, bfqq);
+	bfqq->pid = pid;
+}
+
+static struct bfq_queue *bfq_find_alloc_queue(struct bfq_data *bfqd,
+					      int is_sync,
+					      struct bfq_io_cq *bic,
+					      gfp_t gfp_mask)
+{
+	struct bfq_queue *bfqq, *new_bfqq = NULL;
+
+retry:
+	/* bic always exists here */
+	bfqq = bic_to_bfqq(bic, is_sync);
+
+	/*
+	 * Always try a new alloc if we fall back to the OOM bfqq
+	 * originally, since it should just be a temporary situation.
+	 */
+	if (bfqq == NULL || bfqq == &bfqd->oom_bfqq) {
+		bfqq = NULL;
+		if (new_bfqq != NULL) {
+			bfqq = new_bfqq;
+			new_bfqq = NULL;
+		} else if (gfp_mask & __GFP_WAIT) {
+			spin_unlock_irq(bfqd->queue->queue_lock);
+			new_bfqq = kmem_cache_alloc_node(bfq_pool,
+					gfp_mask | __GFP_ZERO,
+					bfqd->queue->node);
+			spin_lock_irq(bfqd->queue->queue_lock);
+			if (new_bfqq != NULL)
+				goto retry;
+		} else {
+			bfqq = kmem_cache_alloc_node(bfq_pool,
+					gfp_mask | __GFP_ZERO,
+					bfqd->queue->node);
+		}
+
+		if (bfqq != NULL) {
+			bfq_init_bfqq(bfqd, bfqq, current->pid, is_sync);
+			bfq_log_bfqq(bfqd, bfqq, "allocated");
+		} else {
+			bfqq = &bfqd->oom_bfqq;
+			bfq_log_bfqq(bfqd, bfqq, "using oom bfqq");
+		}
+
+		bfq_init_prio_data(bfqq, bic);
+	}
+
+	if (new_bfqq != NULL)
+		kmem_cache_free(bfq_pool, new_bfqq);
+
+	return bfqq;
+}
+
+static struct bfq_queue **bfq_async_queue_prio(struct bfq_data *bfqd,
+					       int ioprio_class, int ioprio)
+{
+	switch (ioprio_class) {
+	case IOPRIO_CLASS_RT:
+		return &async_bfqq[0][ioprio];
+	case IOPRIO_CLASS_NONE:
+		ioprio = IOPRIO_NORM;
+		/* fall through */
+	case IOPRIO_CLASS_BE:
+		return &async_bfqq[1][ioprio];
+	case IOPRIO_CLASS_IDLE:
+		return &async_idle_bfqq;
+	default:
+		BUG();
+	}
+}
+
+static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
+				       int is_sync, struct bfq_io_cq *bic,
+				       gfp_t gfp_mask)
+{
+	const int ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
+	const int ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
+	struct bfq_queue **async_bfqq = NULL;
+	struct bfq_queue *bfqq = NULL;
+
+	if (!is_sync) {
+		async_bfqq = bfq_async_queue_prio(bfqd, ioprio_class, ioprio);
+		bfqq = *async_bfqq;
+	}
+
+	if (bfqq == NULL)
+		bfqq = bfq_find_alloc_queue(bfqd, is_sync, bic, gfp_mask);
+
+	/*
+	 * Pin the queue now that it's allocated, scheduler exit will
+	 * prune it.
+	 */
+	if (!is_sync && *async_bfqq == NULL) {
+		atomic_inc(&bfqq->ref);
+		bfq_log_bfqq(bfqd, bfqq, "get_queue, bfqq not in async: %p, %d",
+			     bfqq, atomic_read(&bfqq->ref));
+		*async_bfqq = bfqq;
+	}
+
+	atomic_inc(&bfqq->ref);
+	bfq_log_bfqq(bfqd, bfqq, "get_queue, at end: %p, %d", bfqq,
+		     atomic_read(&bfqq->ref));
+	return bfqq;
+}
+
+static void bfq_update_io_thinktime(struct bfq_data *bfqd,
+				    struct bfq_io_cq *bic)
+{
+	unsigned long elapsed = jiffies - bic->ttime.last_end_request;
+	unsigned long ttime = min(elapsed, 2UL * bfqd->bfq_slice_idle);
+
+	bic->ttime.ttime_samples = (7*bic->ttime.ttime_samples + 256) / 8;
+	bic->ttime.ttime_total = (7*bic->ttime.ttime_total + 256*ttime) / 8;
+	bic->ttime.ttime_mean = (bic->ttime.ttime_total + 128) /
+				bic->ttime.ttime_samples;
+}
+
+static void bfq_update_io_seektime(struct bfq_data *bfqd,
+				   struct bfq_queue *bfqq,
+				   struct request *rq)
+{
+	sector_t sdist;
+	u64 total;
+
+	if (bfqq->last_request_pos < blk_rq_pos(rq))
+		sdist = blk_rq_pos(rq) - bfqq->last_request_pos;
+	else
+		sdist = bfqq->last_request_pos - blk_rq_pos(rq);
+
+	/*
+	 * Don't allow the seek distance to get too large from the
+	 * odd fragment, pagein, etc.
+	 */
+	if (bfqq->seek_samples == 0) /* first request, not really a seek */
+		sdist = 0;
+	else if (bfqq->seek_samples <= 60) /* second & third seek */
+		sdist = min(sdist, (bfqq->seek_mean * 4) + 2*1024*1024);
+	else
+		sdist = min(sdist, (bfqq->seek_mean * 4) + 2*1024*64);
+
+	bfqq->seek_samples = (7*bfqq->seek_samples + 256) / 8;
+	bfqq->seek_total = (7*bfqq->seek_total + (u64)256*sdist) / 8;
+	total = bfqq->seek_total + (bfqq->seek_samples/2);
+	do_div(total, bfqq->seek_samples);
+	bfqq->seek_mean = (sector_t)total;
+
+	bfq_log_bfqq(bfqd, bfqq, "dist=%llu mean=%llu", (u64)sdist,
+			(u64)bfqq->seek_mean);
+}
+
+/*
+ * Disable idle window if the process thinks too long or seeks so much that
+ * it doesn't matter.
+ */
+static void bfq_update_idle_window(struct bfq_data *bfqd,
+				   struct bfq_queue *bfqq,
+				   struct bfq_io_cq *bic)
+{
+	int enable_idle;
+
+	/* Don't idle for async or idle io prio class. */
+	if (!bfq_bfqq_sync(bfqq) || bfq_class_idle(bfqq))
+		return;
+
+	enable_idle = bfq_bfqq_idle_window(bfqq);
+
+	if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
+	    bfqd->bfq_slice_idle == 0 ||
+		(bfqd->hw_tag && BFQQ_SEEKY(bfqq)))
+		enable_idle = 0;
+	else if (bfq_sample_valid(bic->ttime.ttime_samples)) {
+		if (bic->ttime.ttime_mean > bfqd->bfq_slice_idle)
+			enable_idle = 0;
+		else
+			enable_idle = 1;
+	}
+	bfq_log_bfqq(bfqd, bfqq, "update_idle_window: enable_idle %d",
+		enable_idle);
+
+	if (enable_idle)
+		bfq_mark_bfqq_idle_window(bfqq);
+	else
+		bfq_clear_bfqq_idle_window(bfqq);
+}
+
+/*
+ * Called when a new fs request (rq) is added to bfqq.  Check if there's
+ * something we should do about it.
+ */
+static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+			    struct request *rq)
+{
+	struct bfq_io_cq *bic = RQ_BIC(rq);
+
+	if (rq->cmd_flags & REQ_META)
+		bfqq->meta_pending++;
+
+	bfq_update_io_thinktime(bfqd, bic);
+	bfq_update_io_seektime(bfqd, bfqq, rq);
+	if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
+	    !BFQQ_SEEKY(bfqq))
+		bfq_update_idle_window(bfqd, bfqq, bic);
+
+	bfq_log_bfqq(bfqd, bfqq,
+		     "rq_enqueued: idle_window=%d (seeky %d, mean %llu)",
+		     bfq_bfqq_idle_window(bfqq), BFQQ_SEEKY(bfqq),
+		     (long long unsigned)bfqq->seek_mean);
+
+	bfqq->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq);
+
+	if (bfqq == bfqd->in_service_queue && bfq_bfqq_wait_request(bfqq)) {
+		int small_req = bfqq->queued[rq_is_sync(rq)] == 1 &&
+				blk_rq_sectors(rq) < 32;
+		int budget_timeout = bfq_bfqq_budget_timeout(bfqq);
+
+		/*
+		 * There is just this request queued: if the request
+		 * is small and the queue is not to be expired, then
+		 * just exit.
+		 *
+		 * In this way, if the disk is being idled to wait for
+		 * a new request from the in-service queue, we avoid
+		 * unplugging the device and committing the disk to serve
+		 * just a small request. On the contrary, we wait for
+		 * the block layer to decide when to unplug the device:
+		 * hopefully, new requests will be merged to this one
+		 * quickly, then the device will be unplugged and
+		 * larger requests will be dispatched.
+		 */
+		if (small_req && !budget_timeout)
+			return;
+
+		/*
+		 * A large enough request arrived, or the queue is to
+		 * be expired: in both cases disk idling is to be
+		 * stopped, so clear wait_request flag and reset
+		 * timer.
+		 */
+		bfq_clear_bfqq_wait_request(bfqq);
+		del_timer(&bfqd->idle_slice_timer);
+
+		/*
+		 * The queue is not empty, because a new request just
+		 * arrived. Hence we can safely expire the queue, in
+		 * case of budget timeout, without risking that the
+		 * timestamps of the queue are not updated correctly.
+		 * See [1] for more details.
+		 */
+		if (budget_timeout)
+			bfq_bfqq_expire(bfqd, bfqq, 0, BFQ_BFQQ_BUDGET_TIMEOUT);
+
+		/*
+		 * Let the request rip immediately, or let a new queue be
+		 * selected if bfqq has just been expired.
+		 */
+		__blk_run_queue(bfqd->queue);
+	}
+}
+
+static void bfq_insert_request(struct request_queue *q, struct request *rq)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+
+	assert_spin_locked(bfqd->queue->queue_lock);
+
+	bfq_init_prio_data(bfqq, RQ_BIC(rq));
+
+	bfq_add_request(rq);
+
+	rq->fifo_time = jiffies + bfqd->bfq_fifo_expire[rq_is_sync(rq)];
+	list_add_tail(&rq->queuelist, &bfqq->fifo);
+
+	bfq_rq_enqueued(bfqd, bfqq, rq);
+}
+
+static void bfq_update_hw_tag(struct bfq_data *bfqd)
+{
+	bfqd->max_rq_in_driver = max(bfqd->max_rq_in_driver,
+				     bfqd->rq_in_driver);
+
+	if (bfqd->hw_tag == 1)
+		return;
+
+	/*
+	 * This sample is valid if the number of outstanding requests
+	 * is large enough to allow a queueing behavior.  Note that the
+	 * sum is not exact, as it's not taking into account deactivated
+	 * requests.
+	 */
+	if (bfqd->rq_in_driver + bfqd->queued < BFQ_HW_QUEUE_THRESHOLD)
+		return;
+
+	if (bfqd->hw_tag_samples++ < BFQ_HW_QUEUE_SAMPLES)
+		return;
+
+	bfqd->hw_tag = bfqd->max_rq_in_driver > BFQ_HW_QUEUE_THRESHOLD;
+	bfqd->max_rq_in_driver = 0;
+	bfqd->hw_tag_samples = 0;
+}
+
+static void bfq_completed_request(struct request_queue *q, struct request *rq)
+{
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+	struct bfq_data *bfqd = bfqq->bfqd;
+	bool sync = bfq_bfqq_sync(bfqq);
+
+	bfq_log_bfqq(bfqd, bfqq, "completed one req with %u sects left (%d)",
+		     blk_rq_sectors(rq), sync);
+
+	bfq_update_hw_tag(bfqd);
+
+	bfqd->rq_in_driver--;
+	bfqq->dispatched--;
+
+	if (sync) {
+		bfqd->sync_flight--;
+		RQ_BIC(rq)->ttime.last_end_request = jiffies;
+	}
+
+	/*
+	 * If this is the in-service queue, check if it needs to be expired,
+	 * or if we want to idle in case it has no pending requests.
+	 */
+	if (bfqd->in_service_queue == bfqq) {
+		if (bfq_bfqq_budget_new(bfqq))
+			bfq_set_budget_timeout(bfqd);
+
+		if (bfq_bfqq_must_idle(bfqq)) {
+			bfq_arm_slice_timer(bfqd);
+			goto out;
+		} else if (bfq_may_expire_for_budg_timeout(bfqq))
+			bfq_bfqq_expire(bfqd, bfqq, 0, BFQ_BFQQ_BUDGET_TIMEOUT);
+		else if (RB_EMPTY_ROOT(&bfqq->sort_list) &&
+			 (bfqq->dispatched == 0 ||
+			  !bfq_bfqq_must_not_expire(bfqq)))
+			bfq_bfqq_expire(bfqd, bfqq, 0,
+					BFQ_BFQQ_NO_MORE_REQUESTS);
+	}
+
+	if (!bfqd->rq_in_driver)
+		bfq_schedule_dispatch(bfqd);
+
+out:
+	return;
+}
+
+static inline int __bfq_may_queue(struct bfq_queue *bfqq)
+{
+	if (bfq_bfqq_wait_request(bfqq) && bfq_bfqq_must_alloc(bfqq)) {
+		bfq_clear_bfqq_must_alloc(bfqq);
+		return ELV_MQUEUE_MUST;
+	}
+
+	return ELV_MQUEUE_MAY;
+}
+
+static int bfq_may_queue(struct request_queue *q, int rw)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct task_struct *tsk = current;
+	struct bfq_io_cq *bic;
+	struct bfq_queue *bfqq;
+
+	/*
+	 * Don't force setup of a queue from here, as a call to may_queue
+	 * does not necessarily imply that a request actually will be
+	 * queued. So just lookup a possibly existing queue, or return
+	 * 'may queue' if that fails.
+	 */
+	bic = bfq_bic_lookup(bfqd, tsk->io_context);
+	if (bic == NULL)
+		return ELV_MQUEUE_MAY;
+
+	bfqq = bic_to_bfqq(bic, rw_is_sync(rw));
+	if (bfqq != NULL) {
+		bfq_init_prio_data(bfqq, bic);
+
+		return __bfq_may_queue(bfqq);
+	}
+
+	return ELV_MQUEUE_MAY;
+}
+
+/*
+ * Queue lock held here.
+ */
+static void bfq_put_request(struct request *rq)
+{
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+
+	if (bfqq != NULL) {
+		const int rw = rq_data_dir(rq);
+
+		bfqq->allocated[rw]--;
+
+		rq->elv.priv[0] = NULL;
+		rq->elv.priv[1] = NULL;
+
+		bfq_log_bfqq(bfqq->bfqd, bfqq, "put_request %p, %d",
+			     bfqq, atomic_read(&bfqq->ref));
+		bfq_put_queue(bfqq);
+	}
+}
+
+/*
+ * Allocate bfq data structures associated with this request.
+ */
+static int bfq_set_request(struct request_queue *q, struct request *rq,
+			   struct bio *bio, gfp_t gfp_mask)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct bfq_io_cq *bic = icq_to_bic(rq->elv.icq);
+	const int rw = rq_data_dir(rq);
+	const int is_sync = rq_is_sync(rq);
+	struct bfq_queue *bfqq;
+	unsigned long flags;
+
+	might_sleep_if(gfp_mask & __GFP_WAIT);
+
+	bfq_changed_ioprio(bic);
+
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	if (bic == NULL)
+		goto queue_fail;
+
+	bfqq = bic_to_bfqq(bic, is_sync);
+	if (bfqq == NULL || bfqq == &bfqd->oom_bfqq) {
+		bfqq = bfq_get_queue(bfqd, is_sync, bic, gfp_mask);
+		bic_set_bfqq(bic, bfqq, is_sync);
+	}
+
+	bfqq->allocated[rw]++;
+	atomic_inc(&bfqq->ref);
+	bfq_log_bfqq(bfqd, bfqq, "set_request: bfqq %p, %d", bfqq,
+		     atomic_read(&bfqq->ref));
+
+	rq->elv.priv[0] = bic;
+	rq->elv.priv[1] = bfqq;
+
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	return 0;
+
+queue_fail:
+	bfq_schedule_dispatch(bfqd);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	return 1;
+}
+
+static void bfq_kick_queue(struct work_struct *work)
+{
+	struct bfq_data *bfqd =
+		container_of(work, struct bfq_data, unplug_work);
+	struct request_queue *q = bfqd->queue;
+
+	spin_lock_irq(q->queue_lock);
+	__blk_run_queue(q);
+	spin_unlock_irq(q->queue_lock);
+}
+
+/*
+ * Handler of the expiration of the timer running if the in-service queue
+ * is idling inside its time slice.
+ */
+static void bfq_idle_slice_timer(unsigned long data)
+{
+	struct bfq_data *bfqd = (struct bfq_data *)data;
+	struct bfq_queue *bfqq;
+	unsigned long flags;
+	enum bfqq_expiration reason;
+
+	spin_lock_irqsave(bfqd->queue->queue_lock, flags);
+
+	bfqq = bfqd->in_service_queue;
+	/*
+	 * Theoretical race here: the in-service queue can be NULL or
+	 * different from the queue that was idling if the timer handler
+	 * spins on the queue_lock and a new request arrives for the
+	 * current queue and there is a full dispatch cycle that changes
+	 * the in-service queue.  This can hardly happen, but in the worst
+	 * case we just expire a queue too early.
+	 */
+	if (bfqq != NULL) {
+		bfq_log_bfqq(bfqd, bfqq, "slice_timer expired");
+		if (bfq_bfqq_budget_timeout(bfqq))
+			/*
+			 * Also here the queue can be safely expired
+			 * for budget timeout without wasting
+			 * guarantees
+			 */
+			reason = BFQ_BFQQ_BUDGET_TIMEOUT;
+		else if (bfqq->queued[0] == 0 && bfqq->queued[1] == 0)
+			/*
+			 * The queue may not be empty upon timer expiration,
+			 * because we may not disable the timer when the
+			 * first request of the in-service queue arrives
+			 * during disk idling.
+			 */
+			reason = BFQ_BFQQ_TOO_IDLE;
+		else
+			goto schedule_dispatch;
+
+		bfq_bfqq_expire(bfqd, bfqq, 1, reason);
+	}
+
+schedule_dispatch:
+	bfq_schedule_dispatch(bfqd);
+
+	spin_unlock_irqrestore(bfqd->queue->queue_lock, flags);
+}
+
+static void bfq_shutdown_timer_wq(struct bfq_data *bfqd)
+{
+	del_timer_sync(&bfqd->idle_slice_timer);
+	cancel_work_sync(&bfqd->unplug_work);
+}
+
+static inline void __bfq_put_async_bfqq(struct bfq_data *bfqd,
+					struct bfq_queue **bfqq_ptr)
+{
+	struct bfq_queue *bfqq = *bfqq_ptr;
+
+	bfq_log(bfqd, "put_async_bfqq: %p", bfqq);
+	if (bfqq != NULL) {
+		bfq_log_bfqq(bfqd, bfqq, "put_async_bfqq: putting %p, %d",
+			     bfqq, atomic_read(&bfqq->ref));
+		bfq_put_queue(bfqq);
+		*bfqq_ptr = NULL;
+	}
+}
+
+/*
+ * Release the extra reference of the async queues as the device
+ * goes away.
+ */
+static void bfq_put_async_queues(struct bfq_data *bfqd)
+{
+	int i, j;
+
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < IOPRIO_BE_NR; j++)
+			__bfq_put_async_bfqq(bfqd, &async_bfqq[i][j]);
+
+	__bfq_put_async_bfqq(bfqd, &async_idle_bfqq);
+}
+
+static void bfq_exit_queue(struct elevator_queue *e)
+{
+	struct bfq_data *bfqd = e->elevator_data;
+	struct request_queue *q = bfqd->queue;
+	struct bfq_queue *bfqq, *n;
+
+	bfq_shutdown_timer_wq(bfqd);
+
+	spin_lock_irq(q->queue_lock);
+
+	list_for_each_entry_safe(bfqq, n, &bfqd->idle_list, bfqq_list)
+		bfq_deactivate_bfqq(bfqd, bfqq, 0);
+
+	bfq_put_async_queues(bfqd);
+	spin_unlock_irq(q->queue_lock);
+
+	bfq_shutdown_timer_wq(bfqd);
+
+	synchronize_rcu();
+
+	kfree(bfqd);
+}
+
+static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
+{
+	struct bfq_data *bfqd;
+	struct elevator_queue *eq;
+
+	eq = elevator_alloc(q, e);
+	if (eq == NULL)
+		return -ENOMEM;
+
+	bfqd = kzalloc_node(sizeof(*bfqd), GFP_KERNEL, q->node);
+	if (bfqd == NULL) {
+		kobject_put(&eq->kobj);
+		return -ENOMEM;
+	}
+	eq->elevator_data = bfqd;
+
+	/*
+	 * Our fallback bfqq if bfq_find_alloc_queue() runs into OOM issues.
+	 * Grab a permanent reference to it, so that the normal code flow
+	 * will not attempt to free it.
+	 */
+	bfq_init_bfqq(bfqd, &bfqd->oom_bfqq, 1, 0);
+	atomic_inc(&bfqd->oom_bfqq.ref);
+
+	bfqd->queue = q;
+
+	spin_lock_irq(q->queue_lock);
+	q->elevator = eq;
+	spin_unlock_irq(q->queue_lock);
+
+	init_timer(&bfqd->idle_slice_timer);
+	bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
+	bfqd->idle_slice_timer.data = (unsigned long)bfqd;
+
+	INIT_WORK(&bfqd->unplug_work, bfq_kick_queue);
+
+	INIT_LIST_HEAD(&bfqd->active_list);
+	INIT_LIST_HEAD(&bfqd->idle_list);
+
+	bfqd->hw_tag = -1;
+
+	bfqd->bfq_max_budget = bfq_default_max_budget;
+
+	bfqd->bfq_quantum = bfq_quantum;
+	bfqd->bfq_fifo_expire[0] = bfq_fifo_expire[0];
+	bfqd->bfq_fifo_expire[1] = bfq_fifo_expire[1];
+	bfqd->bfq_back_max = bfq_back_max;
+	bfqd->bfq_back_penalty = bfq_back_penalty;
+	bfqd->bfq_slice_idle = bfq_slice_idle;
+	bfqd->bfq_class_idle_last_service = 0;
+	bfqd->bfq_max_budget_async_rq = bfq_max_budget_async_rq;
+	bfqd->bfq_timeout[BLK_RW_ASYNC] = bfq_timeout_async;
+	bfqd->bfq_timeout[BLK_RW_SYNC] = bfq_timeout_sync;
+
+	return 0;
+}
+
+static void bfq_slab_kill(void)
+{
+	if (bfq_pool != NULL)
+		kmem_cache_destroy(bfq_pool);
+}
+
+static int __init bfq_slab_setup(void)
+{
+	bfq_pool = KMEM_CACHE(bfq_queue, 0);
+	if (bfq_pool == NULL)
+		return -ENOMEM;
+	return 0;
+}
+
+static ssize_t bfq_var_show(unsigned int var, char *page)
+{
+	return sprintf(page, "%d\n", var);
+}
+
+static ssize_t bfq_var_store(unsigned long *var, const char *page,
+			     size_t count)
+{
+	unsigned long new_val;
+	int ret = kstrtoul(page, 10, &new_val);
+
+	if (ret == 0)
+		*var = new_val;
+
+	return count;
+}
+
+static ssize_t bfq_weights_show(struct elevator_queue *e, char *page)
+{
+	struct bfq_queue *bfqq;
+	struct bfq_data *bfqd = e->elevator_data;
+	ssize_t num_char = 0;
+
+	num_char += sprintf(page + num_char, "Tot reqs queued %d\n\n",
+			    bfqd->queued);
+
+	spin_lock_irq(bfqd->queue->queue_lock);
+
+	num_char += sprintf(page + num_char, "Active:\n");
+	list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list) {
+	  num_char += sprintf(page + num_char,
+			      "pid%d: weight %hu, nr_queued %d %d\n",
+			      bfqq->pid,
+			      bfqq->entity.weight,
+			      bfqq->queued[0],
+			      bfqq->queued[1]);
+	}
+
+	num_char += sprintf(page + num_char, "Idle:\n");
+	list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list) {
+			num_char += sprintf(page + num_char,
+				"pid%d: weight %hu\n",
+				bfqq->pid,
+				bfqq->entity.weight);
+	}
+
+	spin_unlock_irq(bfqd->queue->queue_lock);
+
+	return num_char;
+}
+
+#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
+static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
+{									\
+	struct bfq_data *bfqd = e->elevator_data;			\
+	unsigned int __data = __VAR;					\
+	if (__CONV)							\
+		__data = jiffies_to_msecs(__data);			\
+	return bfq_var_show(__data, (page));				\
+}
+SHOW_FUNCTION(bfq_quantum_show, bfqd->bfq_quantum, 0);
+SHOW_FUNCTION(bfq_fifo_expire_sync_show, bfqd->bfq_fifo_expire[1], 1);
+SHOW_FUNCTION(bfq_fifo_expire_async_show, bfqd->bfq_fifo_expire[0], 1);
+SHOW_FUNCTION(bfq_back_seek_max_show, bfqd->bfq_back_max, 0);
+SHOW_FUNCTION(bfq_back_seek_penalty_show, bfqd->bfq_back_penalty, 0);
+SHOW_FUNCTION(bfq_slice_idle_show, bfqd->bfq_slice_idle, 1);
+SHOW_FUNCTION(bfq_max_budget_show, bfqd->bfq_user_max_budget, 0);
+SHOW_FUNCTION(bfq_max_budget_async_rq_show,
+	      bfqd->bfq_max_budget_async_rq, 0);
+SHOW_FUNCTION(bfq_timeout_sync_show, bfqd->bfq_timeout[BLK_RW_SYNC], 1);
+SHOW_FUNCTION(bfq_timeout_async_show, bfqd->bfq_timeout[BLK_RW_ASYNC], 1);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
+static ssize_t								\
+__FUNC(struct elevator_queue *e, const char *page, size_t count)	\
+{									\
+	struct bfq_data *bfqd = e->elevator_data;			\
+	unsigned long uninitialized_var(__data);			\
+	int ret = bfq_var_store(&__data, (page), count);		\
+	if (__data < (MIN))						\
+		__data = (MIN);						\
+	else if (__data > (MAX))					\
+		__data = (MAX);						\
+	if (__CONV)							\
+		*(__PTR) = msecs_to_jiffies(__data);			\
+	else								\
+		*(__PTR) = __data;					\
+	return ret;							\
+}
+STORE_FUNCTION(bfq_quantum_store, &bfqd->bfq_quantum, 1, INT_MAX, 0);
+STORE_FUNCTION(bfq_fifo_expire_sync_store, &bfqd->bfq_fifo_expire[1], 1,
+		INT_MAX, 1);
+STORE_FUNCTION(bfq_fifo_expire_async_store, &bfqd->bfq_fifo_expire[0], 1,
+		INT_MAX, 1);
+STORE_FUNCTION(bfq_back_seek_max_store, &bfqd->bfq_back_max, 0, INT_MAX, 0);
+STORE_FUNCTION(bfq_back_seek_penalty_store, &bfqd->bfq_back_penalty, 1,
+		INT_MAX, 0);
+STORE_FUNCTION(bfq_slice_idle_store, &bfqd->bfq_slice_idle, 0, INT_MAX, 1);
+STORE_FUNCTION(bfq_max_budget_async_rq_store, &bfqd->bfq_max_budget_async_rq,
+		1, INT_MAX, 0);
+STORE_FUNCTION(bfq_timeout_async_store, &bfqd->bfq_timeout[BLK_RW_ASYNC], 0,
+		INT_MAX, 1);
+#undef STORE_FUNCTION
+
+/* do nothing for the moment */
+static ssize_t bfq_weights_store(struct elevator_queue *e,
+				    const char *page, size_t count)
+{
+	return count;
+}
+
+static inline unsigned long bfq_estimated_max_budget(struct bfq_data *bfqd)
+{
+	u64 timeout = jiffies_to_msecs(bfqd->bfq_timeout[BLK_RW_SYNC]);
+
+	if (bfqd->peak_rate_samples >= BFQ_PEAK_RATE_SAMPLES)
+		return bfq_calc_max_budget(bfqd->peak_rate, timeout);
+	else
+		return bfq_default_max_budget;
+}
+
+static ssize_t bfq_max_budget_store(struct elevator_queue *e,
+				    const char *page, size_t count)
+{
+	struct bfq_data *bfqd = e->elevator_data;
+	unsigned long uninitialized_var(__data);
+	int ret = bfq_var_store(&__data, (page), count);
+
+	if (__data == 0)
+		bfqd->bfq_max_budget = bfq_estimated_max_budget(bfqd);
+	else {
+		if (__data > INT_MAX)
+			__data = INT_MAX;
+		bfqd->bfq_max_budget = __data;
+	}
+
+	bfqd->bfq_user_max_budget = __data;
+
+	return ret;
+}
+
+static ssize_t bfq_timeout_sync_store(struct elevator_queue *e,
+				      const char *page, size_t count)
+{
+	struct bfq_data *bfqd = e->elevator_data;
+	unsigned long uninitialized_var(__data);
+	int ret = bfq_var_store(&__data, (page), count);
+
+	if (__data < 1)
+		__data = 1;
+	else if (__data > INT_MAX)
+		__data = INT_MAX;
+
+	bfqd->bfq_timeout[BLK_RW_SYNC] = msecs_to_jiffies(__data);
+	if (bfqd->bfq_user_max_budget == 0)
+		bfqd->bfq_max_budget = bfq_estimated_max_budget(bfqd);
+
+	return ret;
+}
+
+#define BFQ_ATTR(name) \
+	__ATTR(name, S_IRUGO|S_IWUSR, bfq_##name##_show, bfq_##name##_store)
+
+static struct elv_fs_entry bfq_attrs[] = {
+	BFQ_ATTR(quantum),
+	BFQ_ATTR(fifo_expire_sync),
+	BFQ_ATTR(fifo_expire_async),
+	BFQ_ATTR(back_seek_max),
+	BFQ_ATTR(back_seek_penalty),
+	BFQ_ATTR(slice_idle),
+	BFQ_ATTR(max_budget),
+	BFQ_ATTR(max_budget_async_rq),
+	BFQ_ATTR(timeout_sync),
+	BFQ_ATTR(timeout_async),
+	BFQ_ATTR(weights),
+	__ATTR_NULL
+};
+
+static struct elevator_type iosched_bfq = {
+	.ops = {
+		.elevator_merge_fn =		bfq_merge,
+		.elevator_merged_fn =		bfq_merged_request,
+		.elevator_merge_req_fn =	bfq_merged_requests,
+		.elevator_allow_merge_fn =	bfq_allow_merge,
+		.elevator_dispatch_fn =		bfq_dispatch_requests,
+		.elevator_add_req_fn =		bfq_insert_request,
+		.elevator_activate_req_fn =	bfq_activate_request,
+		.elevator_deactivate_req_fn =	bfq_deactivate_request,
+		.elevator_completed_req_fn =	bfq_completed_request,
+		.elevator_former_req_fn =	elv_rb_former_request,
+		.elevator_latter_req_fn =	elv_rb_latter_request,
+		.elevator_init_icq_fn =		bfq_init_icq,
+		.elevator_exit_icq_fn =		bfq_exit_icq,
+		.elevator_set_req_fn =		bfq_set_request,
+		.elevator_put_req_fn =		bfq_put_request,
+		.elevator_may_queue_fn =	bfq_may_queue,
+		.elevator_init_fn =		bfq_init_queue,
+		.elevator_exit_fn =		bfq_exit_queue,
+	},
+	.icq_size =		sizeof(struct bfq_io_cq),
+	.icq_align =		__alignof__(struct bfq_io_cq),
+	.elevator_attrs =	bfq_attrs,
+	.elevator_name =	"bfq",
+	.elevator_owner =	THIS_MODULE,
+};
+
+static int __init bfq_init(void)
+{
+	/*
+	 * Can be 0 on HZ < 1000 setups.
+	 */
+	if (bfq_slice_idle == 0)
+		bfq_slice_idle = 1;
+
+	if (bfq_timeout_async == 0)
+		bfq_timeout_async = 1;
+
+	if (bfq_slab_setup())
+		return -ENOMEM;
+
+	elv_register(&iosched_bfq);
+	pr_info("BFQ I/O-scheduler version: v0");
+
+	return 0;
+}
+
+static void __exit bfq_exit(void)
+{
+	elv_unregister(&iosched_bfq);
+	bfq_slab_kill();
+}
+
+module_init(bfq_init);
+module_exit(bfq_exit);
+
+MODULE_AUTHOR("Fabio Checconi, Paolo Valente");
+MODULE_LICENSE("GPL");
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
new file mode 100644
index 0000000..a9142f5
--- /dev/null
+++ b/block/bfq-sched.c
@@ -0,0 +1,936 @@
+/*
+ * BFQ: Hierarchical B-WF2Q+ scheduler.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
+ *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ */
+
+#define for_each_entity(entity)	\
+	for (; entity != NULL; entity = NULL)
+
+#define for_each_entity_safe(entity, parent) \
+	for (parent = NULL; entity != NULL; entity = parent)
+
+static inline int bfq_update_next_in_service(struct bfq_sched_data *sd)
+{
+	return 0;
+}
+
+static inline void bfq_check_next_in_service(struct bfq_sched_data *sd,
+					     struct bfq_entity *entity)
+{
+}
+
+static inline void bfq_update_budget(struct bfq_entity *next_in_service)
+{
+}
+
+/*
+ * Shift for timestamp calculations.  This actually limits the maximum
+ * service allowed in one timestamp delta (small shift values increase it),
+ * the maximum total weight that can be used for the queues in the system
+ * (big shift values increase it), and the period of virtual time
+ * wraparounds.
+ */
+#define WFQ_SERVICE_SHIFT	22
+
+/**
+ * bfq_gt - compare two timestamps.
+ * @a: first ts.
+ * @b: second ts.
+ *
+ * Return @a > @b, dealing with wrapping correctly.
+ */
+static inline int bfq_gt(u64 a, u64 b)
+{
+	return (s64)(a - b) > 0;
+}
+
+static inline struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = NULL;
+
+	if (entity->my_sched_data == NULL)
+		bfqq = container_of(entity, struct bfq_queue, entity);
+
+	return bfqq;
+}
+
+
+/**
+ * bfq_delta - map service into the virtual time domain.
+ * @service: amount of service.
+ * @weight: scale factor (weight of an entity or weight sum).
+ */
+static inline u64 bfq_delta(unsigned long service,
+					unsigned long weight)
+{
+	u64 d = (u64)service << WFQ_SERVICE_SHIFT;
+
+	do_div(d, weight);
+	return d;
+}
+
+/**
+ * bfq_calc_finish - assign the finish time to an entity.
+ * @entity: the entity to act upon.
+ * @service: the service to be charged to the entity.
+ */
+static inline void bfq_calc_finish(struct bfq_entity *entity,
+				   unsigned long service)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+
+	entity->finish = entity->start +
+		bfq_delta(service, entity->weight);
+
+	if (bfqq != NULL) {
+		bfq_log_bfqq(bfqq->bfqd, bfqq,
+			"calc_finish: serv %lu, w %d",
+			service, entity->weight);
+		bfq_log_bfqq(bfqq->bfqd, bfqq,
+			"calc_finish: start %llu, finish %llu, delta %llu",
+			entity->start, entity->finish,
+			bfq_delta(service, entity->weight));
+	}
+}
+
+/**
+ * bfq_entity_of - get an entity from a node.
+ * @node: the node field of the entity.
+ *
+ * Convert a node pointer to the relative entity.  This is used only
+ * to simplify the logic of some functions and not as the generic
+ * conversion mechanism because, e.g., in the tree walking functions,
+ * the check for a %NULL value would be redundant.
+ */
+static inline struct bfq_entity *bfq_entity_of(struct rb_node *node)
+{
+	struct bfq_entity *entity = NULL;
+
+	if (node != NULL)
+		entity = rb_entry(node, struct bfq_entity, rb_node);
+
+	return entity;
+}
+
+/**
+ * bfq_extract - remove an entity from a tree.
+ * @root: the tree root.
+ * @entity: the entity to remove.
+ */
+static inline void bfq_extract(struct rb_root *root,
+			       struct bfq_entity *entity)
+{
+	entity->tree = NULL;
+	rb_erase(&entity->rb_node, root);
+}
+
+/**
+ * bfq_idle_extract - extract an entity from the idle tree.
+ * @st: the service tree of the owning @entity.
+ * @entity: the entity being removed.
+ */
+static void bfq_idle_extract(struct bfq_service_tree *st,
+			     struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+	struct rb_node *next;
+
+	if (entity == st->first_idle) {
+		next = rb_next(&entity->rb_node);
+		st->first_idle = bfq_entity_of(next);
+	}
+
+	if (entity == st->last_idle) {
+		next = rb_prev(&entity->rb_node);
+		st->last_idle = bfq_entity_of(next);
+	}
+
+	bfq_extract(&st->idle, entity);
+
+	if (bfqq != NULL)
+		list_del(&bfqq->bfqq_list);
+}
+
+/**
+ * bfq_insert - generic tree insertion.
+ * @root: tree root.
+ * @entity: entity to insert.
+ *
+ * This is used for the idle and the active tree, since they are both
+ * ordered by finish time.
+ */
+static void bfq_insert(struct rb_root *root, struct bfq_entity *entity)
+{
+	struct bfq_entity *entry;
+	struct rb_node **node = &root->rb_node;
+	struct rb_node *parent = NULL;
+
+	while (*node != NULL) {
+		parent = *node;
+		entry = rb_entry(parent, struct bfq_entity, rb_node);
+
+		if (bfq_gt(entry->finish, entity->finish))
+			node = &parent->rb_left;
+		else
+			node = &parent->rb_right;
+	}
+
+	rb_link_node(&entity->rb_node, parent, node);
+	rb_insert_color(&entity->rb_node, root);
+
+	entity->tree = root;
+}
+
+/**
+ * bfq_update_min - update the min_start field of a entity.
+ * @entity: the entity to update.
+ * @node: one of its children.
+ *
+ * This function is called when @entity may store an invalid value for
+ * min_start due to updates to the active tree.  The function  assumes
+ * that the subtree rooted at @node (which may be its left or its right
+ * child) has a valid min_start value.
+ */
+static inline void bfq_update_min(struct bfq_entity *entity,
+				  struct rb_node *node)
+{
+	struct bfq_entity *child;
+
+	if (node != NULL) {
+		child = rb_entry(node, struct bfq_entity, rb_node);
+		if (bfq_gt(entity->min_start, child->min_start))
+			entity->min_start = child->min_start;
+	}
+}
+
+/**
+ * bfq_update_active_node - recalculate min_start.
+ * @node: the node to update.
+ *
+ * @node may have changed position or one of its children may have moved,
+ * this function updates its min_start value.  The left and right subtrees
+ * are assumed to hold a correct min_start value.
+ */
+static inline void bfq_update_active_node(struct rb_node *node)
+{
+	struct bfq_entity *entity = rb_entry(node, struct bfq_entity, rb_node);
+
+	entity->min_start = entity->start;
+	bfq_update_min(entity, node->rb_right);
+	bfq_update_min(entity, node->rb_left);
+}
+
+/**
+ * bfq_update_active_tree - update min_start for the whole active tree.
+ * @node: the starting node.
+ *
+ * @node must be the deepest modified node after an update.  This function
+ * updates its min_start using the values held by its children, assuming
+ * that they did not change, and then updates all the nodes that may have
+ * changed in the path to the root.  The only nodes that may have changed
+ * are the ones in the path or their siblings.
+ */
+static void bfq_update_active_tree(struct rb_node *node)
+{
+	struct rb_node *parent;
+
+up:
+	bfq_update_active_node(node);
+
+	parent = rb_parent(node);
+	if (parent == NULL)
+		return;
+
+	if (node == parent->rb_left && parent->rb_right != NULL)
+		bfq_update_active_node(parent->rb_right);
+	else if (parent->rb_left != NULL)
+		bfq_update_active_node(parent->rb_left);
+
+	node = parent;
+	goto up;
+}
+
+/**
+ * bfq_active_insert - insert an entity in the active tree of its
+ *                     group/device.
+ * @st: the service tree of the entity.
+ * @entity: the entity being inserted.
+ *
+ * The active tree is ordered by finish time, but an extra key is kept
+ * per each node, containing the minimum value for the start times of
+ * its children (and the node itself), so it's possible to search for
+ * the eligible node with the lowest finish time in logarithmic time.
+ */
+static void bfq_active_insert(struct bfq_service_tree *st,
+			      struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+	struct rb_node *node = &entity->rb_node;
+
+	bfq_insert(&st->active, entity);
+
+	if (node->rb_left != NULL)
+		node = node->rb_left;
+	else if (node->rb_right != NULL)
+		node = node->rb_right;
+
+	bfq_update_active_tree(node);
+
+	if (bfqq != NULL)
+		list_add(&bfqq->bfqq_list, &bfqq->bfqd->active_list);
+}
+
+/**
+ * bfq_ioprio_to_weight - calc a weight from an ioprio.
+ * @ioprio: the ioprio value to convert.
+ */
+static inline unsigned short bfq_ioprio_to_weight(int ioprio)
+{
+	return IOPRIO_BE_NR - ioprio;
+}
+
+/**
+ * bfq_weight_to_ioprio - calc an ioprio from a weight.
+ * @weight: the weight value to convert.
+ *
+ * To preserve as mush as possible the old only-ioprio user interface,
+ * 0 is used as an escape ioprio value for weights (numerically) equal or
+ * larger than IOPRIO_BE_NR
+ */
+static inline unsigned short bfq_weight_to_ioprio(int weight)
+{
+	return IOPRIO_BE_NR - weight < 0 ? 0 : IOPRIO_BE_NR - weight;
+}
+
+static inline void bfq_get_entity(struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+
+	if (bfqq != NULL) {
+		atomic_inc(&bfqq->ref);
+		bfq_log_bfqq(bfqq->bfqd, bfqq, "get_entity: %p %d",
+			     bfqq, atomic_read(&bfqq->ref));
+	}
+}
+
+/**
+ * bfq_find_deepest - find the deepest node that an extraction can modify.
+ * @node: the node being removed.
+ *
+ * Do the first step of an extraction in an rb tree, looking for the
+ * node that will replace @node, and returning the deepest node that
+ * the following modifications to the tree can touch.  If @node is the
+ * last node in the tree return %NULL.
+ */
+static struct rb_node *bfq_find_deepest(struct rb_node *node)
+{
+	struct rb_node *deepest;
+
+	if (node->rb_right == NULL && node->rb_left == NULL)
+		deepest = rb_parent(node);
+	else if (node->rb_right == NULL)
+		deepest = node->rb_left;
+	else if (node->rb_left == NULL)
+		deepest = node->rb_right;
+	else {
+		deepest = rb_next(node);
+		if (deepest->rb_right != NULL)
+			deepest = deepest->rb_right;
+		else if (rb_parent(deepest) != node)
+			deepest = rb_parent(deepest);
+	}
+
+	return deepest;
+}
+
+/**
+ * bfq_active_extract - remove an entity from the active tree.
+ * @st: the service_tree containing the tree.
+ * @entity: the entity being removed.
+ */
+static void bfq_active_extract(struct bfq_service_tree *st,
+			       struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+	struct rb_node *node;
+
+	node = bfq_find_deepest(&entity->rb_node);
+	bfq_extract(&st->active, entity);
+
+	if (node != NULL)
+		bfq_update_active_tree(node);
+
+	if (bfqq != NULL)
+		list_del(&bfqq->bfqq_list);
+}
+
+/**
+ * bfq_idle_insert - insert an entity into the idle tree.
+ * @st: the service tree containing the tree.
+ * @entity: the entity to insert.
+ */
+static void bfq_idle_insert(struct bfq_service_tree *st,
+			    struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+	struct bfq_entity *first_idle = st->first_idle;
+	struct bfq_entity *last_idle = st->last_idle;
+
+	if (first_idle == NULL || bfq_gt(first_idle->finish, entity->finish))
+		st->first_idle = entity;
+	if (last_idle == NULL || bfq_gt(entity->finish, last_idle->finish))
+		st->last_idle = entity;
+
+	bfq_insert(&st->idle, entity);
+
+	if (bfqq != NULL)
+		list_add(&bfqq->bfqq_list, &bfqq->bfqd->idle_list);
+}
+
+/**
+ * bfq_forget_entity - remove an entity from the wfq trees.
+ * @st: the service tree.
+ * @entity: the entity being removed.
+ *
+ * Update the device status and forget everything about @entity, putting
+ * the device reference to it, if it is a queue.  Entities belonging to
+ * groups are not refcounted.
+ */
+static void bfq_forget_entity(struct bfq_service_tree *st,
+			      struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+	struct bfq_sched_data *sd;
+
+	entity->on_st = 0;
+	st->wsum -= entity->weight;
+	if (bfqq != NULL) {
+		sd = entity->sched_data;
+		bfq_log_bfqq(bfqq->bfqd, bfqq, "forget_entity: %p %d",
+			     bfqq, atomic_read(&bfqq->ref));
+		bfq_put_queue(bfqq);
+	}
+}
+
+/**
+ * bfq_put_idle_entity - release the idle tree ref of an entity.
+ * @st: service tree for the entity.
+ * @entity: the entity being released.
+ */
+static void bfq_put_idle_entity(struct bfq_service_tree *st,
+				struct bfq_entity *entity)
+{
+	bfq_idle_extract(st, entity);
+	bfq_forget_entity(st, entity);
+}
+
+/**
+ * bfq_forget_idle - update the idle tree if necessary.
+ * @st: the service tree to act upon.
+ *
+ * To preserve the global O(log N) complexity we only remove one entry here;
+ * as the idle tree will not grow indefinitely this can be done safely.
+ */
+static void bfq_forget_idle(struct bfq_service_tree *st)
+{
+	struct bfq_entity *first_idle = st->first_idle;
+	struct bfq_entity *last_idle = st->last_idle;
+
+	if (RB_EMPTY_ROOT(&st->active) && last_idle != NULL &&
+	    !bfq_gt(last_idle->finish, st->vtime)) {
+		/*
+		 * Forget the whole idle tree, increasing the vtime past
+		 * the last finish time of idle entities.
+		 */
+		st->vtime = last_idle->finish;
+	}
+
+	if (first_idle != NULL && !bfq_gt(first_idle->finish, st->vtime))
+		bfq_put_idle_entity(st, first_idle);
+}
+
+static struct bfq_service_tree *
+__bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
+			 struct bfq_entity *entity)
+{
+	struct bfq_service_tree *new_st = old_st;
+
+	if (entity->ioprio_changed) {
+		old_st->wsum -= entity->weight;
+
+		if (entity->new_weight != entity->orig_weight) {
+			entity->orig_weight = entity->new_weight;
+			entity->ioprio =
+				bfq_weight_to_ioprio(entity->orig_weight);
+		} else if (entity->new_ioprio != entity->ioprio) {
+			entity->ioprio = entity->new_ioprio;
+			entity->orig_weight =
+					bfq_ioprio_to_weight(entity->ioprio);
+		} else
+			entity->new_weight = entity->orig_weight =
+				bfq_ioprio_to_weight(entity->ioprio);
+
+		entity->ioprio_class = entity->new_ioprio_class;
+		entity->ioprio_changed = 0;
+
+		/*
+		 * NOTE: here we may be changing the weight too early,
+		 * this will cause unfairness.  The correct approach
+		 * would have required additional complexity to defer
+		 * weight changes to the proper time instants (i.e.,
+		 * when entity->finish <= old_st->vtime).
+		 */
+		new_st = bfq_entity_service_tree(entity);
+		entity->weight = entity->orig_weight;
+		new_st->wsum += entity->weight;
+
+		if (new_st != old_st)
+			entity->start = new_st->vtime;
+	}
+
+	return new_st;
+}
+
+/**
+ * bfq_bfqq_served - update the scheduler status after selection for
+ *                   service.
+ * @bfqq: the queue being served.
+ * @served: bytes to transfer.
+ *
+ * NOTE: this can be optimized, as the timestamps of upper level entities
+ * are synchronized every time a new bfqq is selected for service.  By now,
+ * we keep it to better check consistency.
+ */
+static void bfq_bfqq_served(struct bfq_queue *bfqq, unsigned long served)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+	struct bfq_service_tree *st;
+
+	for_each_entity(entity) {
+		st = bfq_entity_service_tree(entity);
+
+		entity->service += served;
+
+		st->vtime += bfq_delta(served, st->wsum);
+		bfq_forget_idle(st);
+	}
+	bfq_log_bfqq(bfqq->bfqd, bfqq, "bfqq_served %lu secs", served);
+}
+
+/**
+ * bfq_bfqq_charge_full_budget - set the service to the entity budget.
+ * @bfqq: the queue that needs a service update.
+ *
+ * When it's not possible to be fair in the service domain, because
+ * a queue is not consuming its budget fast enough (the meaning of
+ * fast depends on the timeout parameter), we charge it a full
+ * budget.  In this way we should obtain a sort of time-domain
+ * fairness among all the seeky/slow queues.
+ */
+static inline void bfq_bfqq_charge_full_budget(struct bfq_queue *bfqq)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+
+	bfq_log_bfqq(bfqq->bfqd, bfqq, "charge_full_budget");
+
+	bfq_bfqq_served(bfqq, entity->budget - entity->service);
+}
+
+/**
+ * __bfq_activate_entity - activate an entity.
+ * @entity: the entity being activated.
+ *
+ * Called whenever an entity is activated, i.e., it is not active and one
+ * of its children receives a new request, or has to be reactivated due to
+ * budget exhaustion.  It uses the current budget of the entity (and the
+ * service received if @entity is active) of the queue to calculate its
+ * timestamps.
+ */
+static void __bfq_activate_entity(struct bfq_entity *entity)
+{
+	struct bfq_sched_data *sd = entity->sched_data;
+	struct bfq_service_tree *st = bfq_entity_service_tree(entity);
+
+	if (entity == sd->in_service_entity) {
+		/*
+		 * If we are requeueing the current entity we have
+		 * to take care of not charging to it service it has
+		 * not received.
+		 */
+		bfq_calc_finish(entity, entity->service);
+		entity->start = entity->finish;
+		sd->in_service_entity = NULL;
+	} else if (entity->tree == &st->active) {
+		/*
+		 * Requeueing an entity due to a change of some
+		 * next_in_service entity below it.  We reuse the
+		 * old start time.
+		 */
+		bfq_active_extract(st, entity);
+	} else if (entity->tree == &st->idle) {
+		/*
+		 * Must be on the idle tree, bfq_idle_extract() will
+		 * check for that.
+		 */
+		bfq_idle_extract(st, entity);
+		entity->start = bfq_gt(st->vtime, entity->finish) ?
+				       st->vtime : entity->finish;
+	} else {
+		/*
+		 * The finish time of the entity may be invalid, and
+		 * it is in the past for sure, otherwise the queue
+		 * would have been on the idle tree.
+		 */
+		entity->start = st->vtime;
+		st->wsum += entity->weight;
+		bfq_get_entity(entity);
+
+		entity->on_st = 1;
+	}
+
+	st = __bfq_entity_update_weight_prio(st, entity);
+	bfq_calc_finish(entity, entity->budget);
+	bfq_active_insert(st, entity);
+}
+
+/**
+ * bfq_activate_entity - activate an entity and its ancestors if necessary.
+ * @entity: the entity to activate.
+ *
+ * Activate @entity and all the entities on the path from it to the root.
+ */
+static void bfq_activate_entity(struct bfq_entity *entity)
+{
+	struct bfq_sched_data *sd;
+
+	for_each_entity(entity) {
+		__bfq_activate_entity(entity);
+
+		sd = entity->sched_data;
+		if (!bfq_update_next_in_service(sd))
+			/*
+			 * No need to propagate the activation to the
+			 * upper entities, as they will be updated when
+			 * the in-service entity is rescheduled.
+			 */
+			break;
+	}
+}
+
+/**
+ * __bfq_deactivate_entity - deactivate an entity from its service tree.
+ * @entity: the entity to deactivate.
+ * @requeue: if false, the entity will not be put into the idle tree.
+ *
+ * Deactivate an entity, independently from its previous state.  If the
+ * entity was not on a service tree just return, otherwise if it is on
+ * any scheduler tree, extract it from that tree, and if necessary
+ * and if the caller did not specify @requeue, put it on the idle tree.
+ *
+ * Return %1 if the caller should update the entity hierarchy, i.e.,
+ * if the entity was in service or if it was the next_in_service for
+ * its sched_data; return %0 otherwise.
+ */
+static int __bfq_deactivate_entity(struct bfq_entity *entity, int requeue)
+{
+	struct bfq_sched_data *sd = entity->sched_data;
+	struct bfq_service_tree *st = bfq_entity_service_tree(entity);
+	int was_in_service = entity == sd->in_service_entity;
+	int ret = 0;
+
+	if (!entity->on_st)
+		return 0;
+
+	if (was_in_service) {
+		bfq_calc_finish(entity, entity->service);
+		sd->in_service_entity = NULL;
+	} else if (entity->tree == &st->active)
+		bfq_active_extract(st, entity);
+	else if (entity->tree == &st->idle)
+		bfq_idle_extract(st, entity);
+
+	if (was_in_service || sd->next_in_service == entity)
+		ret = bfq_update_next_in_service(sd);
+
+	if (!requeue || !bfq_gt(entity->finish, st->vtime))
+		bfq_forget_entity(st, entity);
+	else
+		bfq_idle_insert(st, entity);
+
+	return ret;
+}
+
+/**
+ * bfq_deactivate_entity - deactivate an entity.
+ * @entity: the entity to deactivate.
+ * @requeue: true if the entity can be put on the idle tree
+ */
+static void bfq_deactivate_entity(struct bfq_entity *entity, int requeue)
+{
+	struct bfq_sched_data *sd;
+	struct bfq_entity *parent;
+
+	for_each_entity_safe(entity, parent) {
+		sd = entity->sched_data;
+
+		if (!__bfq_deactivate_entity(entity, requeue))
+			/*
+			 * The parent entity is still backlogged, and
+			 * we don't need to update it as it is still
+			 * in service.
+			 */
+			break;
+
+		if (sd->next_in_service != NULL)
+			/*
+			 * The parent entity is still backlogged and
+			 * the budgets on the path towards the root
+			 * need to be updated.
+			 */
+			goto update;
+
+		/*
+		 * If we reach there the parent is no more backlogged and
+		 * we want to propagate the dequeue upwards.
+		 */
+		requeue = 1;
+	}
+
+	return;
+
+update:
+	entity = parent;
+	for_each_entity(entity) {
+		__bfq_activate_entity(entity);
+
+		sd = entity->sched_data;
+		if (!bfq_update_next_in_service(sd))
+			break;
+	}
+}
+
+/**
+ * bfq_update_vtime - update vtime if necessary.
+ * @st: the service tree to act upon.
+ *
+ * If necessary update the service tree vtime to have at least one
+ * eligible entity, skipping to its start time.  Assumes that the
+ * active tree of the device is not empty.
+ *
+ * NOTE: this hierarchical implementation updates vtimes quite often,
+ * we may end up with reactivated processes getting timestamps after a
+ * vtime skip done because we needed a ->first_active entity on some
+ * intermediate node.
+ */
+static void bfq_update_vtime(struct bfq_service_tree *st)
+{
+	struct bfq_entity *entry;
+	struct rb_node *node = st->active.rb_node;
+
+	entry = rb_entry(node, struct bfq_entity, rb_node);
+	if (bfq_gt(entry->min_start, st->vtime)) {
+		st->vtime = entry->min_start;
+		bfq_forget_idle(st);
+	}
+}
+
+/**
+ * bfq_first_active_entity - find the eligible entity with
+ *                           the smallest finish time
+ * @st: the service tree to select from.
+ *
+ * This function searches the first schedulable entity, starting from the
+ * root of the tree and going on the left every time on this side there is
+ * a subtree with at least one eligible (start >= vtime) entity. The path on
+ * the right is followed only if a) the left subtree contains no eligible
+ * entities and b) no eligible entity has been found yet.
+ */
+static struct bfq_entity *bfq_first_active_entity(struct bfq_service_tree *st)
+{
+	struct bfq_entity *entry, *first = NULL;
+	struct rb_node *node = st->active.rb_node;
+
+	while (node != NULL) {
+		entry = rb_entry(node, struct bfq_entity, rb_node);
+left:
+		if (!bfq_gt(entry->start, st->vtime))
+			first = entry;
+
+		if (node->rb_left != NULL) {
+			entry = rb_entry(node->rb_left,
+					 struct bfq_entity, rb_node);
+			if (!bfq_gt(entry->min_start, st->vtime)) {
+				node = node->rb_left;
+				goto left;
+			}
+		}
+		if (first != NULL)
+			break;
+		node = node->rb_right;
+	}
+
+	return first;
+}
+
+/**
+ * __bfq_lookup_next_entity - return the first eligible entity in @st.
+ * @st: the service tree.
+ *
+ * Update the virtual time in @st and return the first eligible entity
+ * it contains.
+ */
+static struct bfq_entity *__bfq_lookup_next_entity(struct bfq_service_tree *st,
+						   bool force)
+{
+	struct bfq_entity *entity, *new_next_in_service = NULL;
+
+	if (RB_EMPTY_ROOT(&st->active))
+		return NULL;
+
+	bfq_update_vtime(st);
+	entity = bfq_first_active_entity(st);
+
+	/*
+	 * If the chosen entity does not match with the sched_data's
+	 * next_in_service and we are forcedly serving the IDLE priority
+	 * class tree, bubble up budget update.
+	 */
+	if (unlikely(force && entity != entity->sched_data->next_in_service)) {
+		new_next_in_service = entity;
+		for_each_entity(new_next_in_service)
+			bfq_update_budget(new_next_in_service);
+	}
+
+	return entity;
+}
+
+/**
+ * bfq_lookup_next_entity - return the first eligible entity in @sd.
+ * @sd: the sched_data.
+ * @extract: if true the returned entity will be also extracted from @sd.
+ *
+ * NOTE: since we cache the next_in_service entity at each level of the
+ * hierarchy, the complexity of the lookup can be decreased with
+ * absolutely no effort just returning the cached next_in_service value;
+ * we prefer to do full lookups to test the consistency of * the data
+ * structures.
+ */
+static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
+						 int extract,
+						 struct bfq_data *bfqd)
+{
+	struct bfq_service_tree *st = sd->service_tree;
+	struct bfq_entity *entity;
+	int i = 0;
+
+	if (bfqd != NULL &&
+	    jiffies - bfqd->bfq_class_idle_last_service > BFQ_CL_IDLE_TIMEOUT) {
+		entity = __bfq_lookup_next_entity(st + BFQ_IOPRIO_CLASSES - 1,
+						  true);
+		if (entity != NULL) {
+			i = BFQ_IOPRIO_CLASSES - 1;
+			bfqd->bfq_class_idle_last_service = jiffies;
+			sd->next_in_service = entity;
+		}
+	}
+	for (; i < BFQ_IOPRIO_CLASSES; i++) {
+		entity = __bfq_lookup_next_entity(st + i, false);
+		if (entity != NULL) {
+			if (extract) {
+				bfq_check_next_in_service(sd, entity);
+				bfq_active_extract(st + i, entity);
+				sd->in_service_entity = entity;
+				sd->next_in_service = NULL;
+			}
+			break;
+		}
+	}
+
+	return entity;
+}
+
+/*
+ * Get next queue for service.
+ */
+static struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd)
+{
+	struct bfq_entity *entity = NULL;
+	struct bfq_sched_data *sd;
+	struct bfq_queue *bfqq;
+
+	if (bfqd->busy_queues == 0)
+		return NULL;
+
+	sd = &bfqd->sched_data;
+	for (; sd != NULL; sd = entity->my_sched_data) {
+		entity = bfq_lookup_next_entity(sd, 1, bfqd);
+		entity->service = 0;
+	}
+
+	bfqq = bfq_entity_to_bfqq(entity);
+
+	return bfqq;
+}
+
+static void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd)
+{
+	if (bfqd->in_service_bic != NULL) {
+		put_io_context(bfqd->in_service_bic->icq.ioc);
+		bfqd->in_service_bic = NULL;
+	}
+
+	bfqd->in_service_queue = NULL;
+	del_timer(&bfqd->idle_slice_timer);
+}
+
+static void bfq_deactivate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+				int requeue)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+
+	if (bfqq == bfqd->in_service_queue)
+		__bfq_bfqd_reset_in_service(bfqd);
+
+	bfq_deactivate_entity(entity, requeue);
+}
+
+static void bfq_activate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+
+	bfq_activate_entity(entity);
+}
+
+/*
+ * Called when the bfqq no longer has requests pending, remove it from
+ * the service tree.
+ */
+static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+			      int requeue)
+{
+	bfq_log_bfqq(bfqd, bfqq, "del from busy");
+
+	bfq_clear_bfqq_busy(bfqq);
+
+	bfqd->busy_queues--;
+
+	bfq_deactivate_bfqq(bfqd, bfqq, requeue);
+}
+
+/*
+ * Called when an inactive queue receives a new request.
+ */
+static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	bfq_log_bfqq(bfqd, bfqq, "add to busy");
+
+	bfq_activate_bfqq(bfqd, bfqq);
+
+	bfq_mark_bfqq_busy(bfqq);
+	bfqd->busy_queues++;
+}
diff --git a/block/bfq.h b/block/bfq.h
new file mode 100644
index 0000000..bd146b6
--- /dev/null
+++ b/block/bfq.h
@@ -0,0 +1,467 @@
+/*
+ * BFQ-v0 for 3.15.0: data structures and common functions prototypes.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
+ *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ */
+
+#ifndef _BFQ_H
+#define _BFQ_H
+
+#include <linux/blktrace_api.h>
+#include <linux/hrtimer.h>
+#include <linux/ioprio.h>
+#include <linux/rbtree.h>
+
+#define BFQ_IOPRIO_CLASSES	3
+#define BFQ_CL_IDLE_TIMEOUT	(HZ/5)
+
+#define BFQ_MIN_WEIGHT	1
+#define BFQ_MAX_WEIGHT	1000
+
+#define BFQ_DEFAULT_GRP_WEIGHT	10
+#define BFQ_DEFAULT_GRP_IOPRIO	0
+#define BFQ_DEFAULT_GRP_CLASS	IOPRIO_CLASS_BE
+
+struct bfq_entity;
+
+/**
+ * struct bfq_service_tree - per ioprio_class service tree.
+ * @active: tree for active entities (i.e., those backlogged).
+ * @idle: tree for idle entities (i.e., those not backlogged, with V <= F_i).
+ * @first_idle: idle entity with minimum F_i.
+ * @last_idle: idle entity with maximum F_i.
+ * @vtime: scheduler virtual time.
+ * @wsum: scheduler weight sum; active and idle entities contribute to it.
+ *
+ * Each service tree represents a B-WF2Q+ scheduler on its own.  Each
+ * ioprio_class has its own independent scheduler, and so its own
+ * bfq_service_tree.  All the fields are protected by the queue lock
+ * of the containing bfqd.
+ */
+struct bfq_service_tree {
+	struct rb_root active;
+	struct rb_root idle;
+
+	struct bfq_entity *first_idle;
+	struct bfq_entity *last_idle;
+
+	u64 vtime;
+	unsigned long wsum;
+};
+
+/**
+ * struct bfq_sched_data - multi-class scheduler.
+ * @in_service_entity: entity in service.
+ * @next_in_service: head-of-the-line entity in the scheduler.
+ * @service_tree: array of service trees, one per ioprio_class.
+ *
+ * bfq_sched_data is the basic scheduler queue.  It supports three
+ * ioprio_classes, and can be used either as a toplevel queue or as
+ * an intermediate queue on a hierarchical setup.
+ * @next_in_service points to the active entity of the sched_data
+ * service trees that will be scheduled next.
+ *
+ * The supported ioprio_classes are the same as in CFQ, in descending
+ * priority order, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE.
+ * Requests from higher priority queues are served before all the
+ * requests from lower priority queues; among requests of the same
+ * queue requests are served according to B-WF2Q+.
+ * All the fields are protected by the queue lock of the containing bfqd.
+ */
+struct bfq_sched_data {
+	struct bfq_entity *in_service_entity;
+	struct bfq_entity *next_in_service;
+	struct bfq_service_tree service_tree[BFQ_IOPRIO_CLASSES];
+};
+
+/**
+ * struct bfq_entity - schedulable entity.
+ * @rb_node: service_tree member.
+ * @on_st: flag, true if the entity is on a tree (either the active or
+ *         the idle one of its service_tree).
+ * @finish: B-WF2Q+ finish timestamp (aka F_i).
+ * @start: B-WF2Q+ start timestamp (aka S_i).
+ * @tree: tree the entity is enqueued into; %NULL if not on a tree.
+ * @min_start: minimum start time of the (active) subtree rooted at
+ *             this entity; used for O(log N) lookups into active trees.
+ * @service: service received during the last round of service.
+ * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
+ * @weight: weight of the queue
+ * @parent: parent entity, for hierarchical scheduling.
+ * @my_sched_data: for non-leaf nodes in the hierarchy, the
+ *                 associated scheduler queue, %NULL on leaf nodes.
+ * @sched_data: the scheduler queue this entity belongs to.
+ * @ioprio: the ioprio in use.
+ * @new_weight: when a weight change is requested, the new weight value.
+ * @orig_weight: original weight, used to implement weight boosting
+ * @new_ioprio: when an ioprio change is requested, the new ioprio value.
+ * @ioprio_class: the ioprio_class in use.
+ * @new_ioprio_class: when an ioprio_class change is requested, the new
+ *                    ioprio_class value.
+ * @ioprio_changed: flag, true when the user requested a weight, ioprio or
+ *                  ioprio_class change.
+ *
+ * A bfq_entity is used to represent a bfq_queue (leaf node in the upper
+ * level scheduler). Each entity belongs to the sched_data of the parent
+ * group hierarchy. Non-leaf entities have also their own sched_data,
+ * stored in @my_sched_data.
+ *
+ * Each entity stores independently its priority values; this would
+ * allow different weights on different devices, but this
+ * functionality is not exported to userspace by now.  Priorities and
+ * weights are updated lazily, first storing the new values into the
+ * new_* fields, then setting the @ioprio_changed flag.  As soon as
+ * there is a transition in the entity state that allows the priority
+ * update to take place the effective and the requested priority
+ * values are synchronized.
+ *
+ * The weight value is calculated from the ioprio to export the same
+ * interface as CFQ.  When dealing with  ``well-behaved'' queues (i.e.,
+ * queues that do not spend too much time to consume their budget
+ * and have true sequential behavior, and when there are no external
+ * factors breaking anticipation) the relative weights at each level
+ * of the hierarchy should be guaranteed.  All the fields are
+ * protected by the queue lock of the containing bfqd.
+ */
+struct bfq_entity {
+	struct rb_node rb_node;
+
+	int on_st;
+
+	u64 finish;
+	u64 start;
+
+	struct rb_root *tree;
+
+	u64 min_start;
+
+	unsigned long service, budget;
+	unsigned short weight, new_weight;
+	unsigned short orig_weight;
+
+	struct bfq_entity *parent;
+
+	struct bfq_sched_data *my_sched_data;
+	struct bfq_sched_data *sched_data;
+
+	unsigned short ioprio, new_ioprio;
+	unsigned short ioprio_class, new_ioprio_class;
+
+	int ioprio_changed;
+};
+
+/**
+ * struct bfq_queue - leaf schedulable entity.
+ * @ref: reference counter.
+ * @bfqd: parent bfq_data.
+ * @sort_list: sorted list of pending requests.
+ * @next_rq: if fifo isn't expired, next request to serve.
+ * @queued: nr of requests queued in @sort_list.
+ * @allocated: currently allocated requests.
+ * @meta_pending: pending metadata requests.
+ * @fifo: fifo list of requests in sort_list.
+ * @entity: entity representing this queue in the scheduler.
+ * @max_budget: maximum budget allowed from the feedback mechanism.
+ * @budget_timeout: budget expiration (in jiffies).
+ * @dispatched: number of requests on the dispatch list or inside driver.
+ * @flags: status flags.
+ * @bfqq_list: node for active/idle bfqq list inside our bfqd.
+ * @seek_samples: number of seeks sampled
+ * @seek_total: sum of the distances of the seeks sampled
+ * @seek_mean: mean seek distance
+ * @last_request_pos: position of the last request enqueued
+ * @pid: pid of the process owning the queue, used for logging purposes.
+ *
+ * A bfq_queue is a leaf request queue; it can be associated with an
+ * io_context or more, if it is async.
+ */
+struct bfq_queue {
+	atomic_t ref;
+	struct bfq_data *bfqd;
+
+	struct rb_root sort_list;
+	struct request *next_rq;
+	int queued[2];
+	int allocated[2];
+	int meta_pending;
+	struct list_head fifo;
+
+	struct bfq_entity entity;
+
+	unsigned long max_budget;
+	unsigned long budget_timeout;
+
+	int dispatched;
+
+	unsigned int flags;
+
+	struct list_head bfqq_list;
+
+	unsigned int seek_samples;
+	u64 seek_total;
+	sector_t seek_mean;
+	sector_t last_request_pos;
+
+	pid_t pid;
+};
+
+/**
+ * struct bfq_ttime - per process thinktime stats.
+ * @ttime_total: total process thinktime
+ * @ttime_samples: number of thinktime samples
+ * @ttime_mean: average process thinktime
+ */
+struct bfq_ttime {
+	unsigned long last_end_request;
+
+	unsigned long ttime_total;
+	unsigned long ttime_samples;
+	unsigned long ttime_mean;
+};
+
+/**
+ * struct bfq_io_cq - per (request_queue, io_context) structure.
+ * @icq: associated io_cq structure
+ * @bfqq: array of two process queues, the sync and the async
+ * @ttime: associated @bfq_ttime struct
+ */
+struct bfq_io_cq {
+	struct io_cq icq; /* must be the first member */
+	struct bfq_queue *bfqq[2];
+	struct bfq_ttime ttime;
+	int ioprio;
+};
+
+enum bfq_device_speed {
+	BFQ_BFQD_FAST,
+	BFQ_BFQD_SLOW,
+};
+
+/**
+ * struct bfq_data - per device data structure.
+ * @queue: request queue for the managed device.
+ * @sched_data: root @bfq_sched_data for the device.
+ * @busy_queues: number of bfq_queues containing requests (including the
+ *		 queue in service, even if it is idling).
+ * @queued: number of queued requests.
+ * @rq_in_driver: number of requests dispatched and waiting for completion.
+ * @sync_flight: number of sync requests in the driver.
+ * @max_rq_in_driver: max number of reqs in driver in the last
+ *                    @hw_tag_samples completed requests.
+ * @hw_tag_samples: nr of samples used to calculate hw_tag.
+ * @hw_tag: flag set to one if the driver is showing a queueing behavior.
+ * @budgets_assigned: number of budgets assigned.
+ * @idle_slice_timer: timer set when idling for the next sequential request
+ *                    from the queue in service.
+ * @unplug_work: delayed work to restart dispatching on the request queue.
+ * @in_service_queue: bfq_queue in service.
+ * @in_service_bic: bfq_io_cq (bic) associated with the @in_service_queue.
+ * @last_position: on-disk position of the last served request.
+ * @last_budget_start: beginning of the last budget.
+ * @last_idling_start: beginning of the last idle slice.
+ * @peak_rate: peak transfer rate observed for a budget.
+ * @peak_rate_samples: number of samples used to calculate @peak_rate.
+ * @bfq_max_budget: maximum budget allotted to a bfq_queue before
+ *                  rescheduling.
+ * @active_list: list of all the bfq_queues active on the device.
+ * @idle_list: list of all the bfq_queues idle on the device.
+ * @bfq_quantum: max number of requests dispatched per dispatch round.
+ * @bfq_fifo_expire: timeout for async/sync requests; when it expires
+ *                   requests are served in fifo order.
+ * @bfq_back_penalty: weight of backward seeks wrt forward ones.
+ * @bfq_back_max: maximum allowed backward seek.
+ * @bfq_slice_idle: maximum idling time.
+ * @bfq_user_max_budget: user-configured max budget value
+ *                       (0 for auto-tuning).
+ * @bfq_max_budget_async_rq: maximum budget (in nr of requests) allotted to
+ *                           async queues.
+ * @bfq_timeout: timeout for bfq_queues to consume their budget; used to
+ *               to prevent seeky queues to impose long latencies to well
+ *               behaved ones (this also implies that seeky queues cannot
+ *               receive guarantees in the service domain; after a timeout
+ *               they are charged for the whole allocated budget, to try
+ *               to preserve a behavior reasonably fair among them, but
+ *               without service-domain guarantees).
+ * @oom_bfqq: fallback dummy bfqq for extreme OOM conditions
+ *
+ * All the fields are protected by the @queue lock.
+ */
+struct bfq_data {
+	struct request_queue *queue;
+
+	struct bfq_sched_data sched_data;
+
+	int busy_queues;
+	int queued;
+	int rq_in_driver;
+	int sync_flight;
+
+	int max_rq_in_driver;
+	int hw_tag_samples;
+	int hw_tag;
+
+	int budgets_assigned;
+
+	struct timer_list idle_slice_timer;
+	struct work_struct unplug_work;
+
+	struct bfq_queue *in_service_queue;
+	struct bfq_io_cq *in_service_bic;
+
+	sector_t last_position;
+
+	ktime_t last_budget_start;
+	ktime_t last_idling_start;
+	int peak_rate_samples;
+	u64 peak_rate;
+	unsigned long bfq_max_budget;
+
+	struct list_head active_list;
+	struct list_head idle_list;
+
+	unsigned int bfq_quantum;
+	unsigned int bfq_fifo_expire[2];
+	unsigned int bfq_back_penalty;
+	unsigned int bfq_back_max;
+	unsigned int bfq_slice_idle;
+	u64 bfq_class_idle_last_service;
+
+	unsigned int bfq_user_max_budget;
+	unsigned int bfq_max_budget_async_rq;
+	unsigned int bfq_timeout[2];
+
+	struct bfq_queue oom_bfqq;
+};
+
+enum bfqq_state_flags {
+	BFQ_BFQQ_FLAG_busy = 0,		/* has requests or is in service */
+	BFQ_BFQQ_FLAG_wait_request,	/* waiting for a request */
+	BFQ_BFQQ_FLAG_must_alloc,	/* must be allowed rq alloc */
+	BFQ_BFQQ_FLAG_fifo_expire,	/* FIFO checked in this slice */
+	BFQ_BFQQ_FLAG_idle_window,	/* slice idling enabled */
+	BFQ_BFQQ_FLAG_prio_changed,	/* task priority has changed */
+	BFQ_BFQQ_FLAG_sync,		/* synchronous queue */
+	BFQ_BFQQ_FLAG_budget_new,	/* no completion with this budget */
+};
+
+#define BFQ_BFQQ_FNS(name)						\
+static inline void bfq_mark_bfqq_##name(struct bfq_queue *bfqq)		\
+{									\
+	(bfqq)->flags |= (1 << BFQ_BFQQ_FLAG_##name);			\
+}									\
+static inline void bfq_clear_bfqq_##name(struct bfq_queue *bfqq)	\
+{									\
+	(bfqq)->flags &= ~(1 << BFQ_BFQQ_FLAG_##name);			\
+}									\
+static inline int bfq_bfqq_##name(const struct bfq_queue *bfqq)		\
+{									\
+	return ((bfqq)->flags & (1 << BFQ_BFQQ_FLAG_##name)) != 0;	\
+}
+
+BFQ_BFQQ_FNS(busy);
+BFQ_BFQQ_FNS(wait_request);
+BFQ_BFQQ_FNS(must_alloc);
+BFQ_BFQQ_FNS(fifo_expire);
+BFQ_BFQQ_FNS(idle_window);
+BFQ_BFQQ_FNS(prio_changed);
+BFQ_BFQQ_FNS(sync);
+BFQ_BFQQ_FNS(budget_new);
+#undef BFQ_BFQQ_FNS
+
+/* Logging facilities. */
+#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) \
+	blk_add_trace_msg((bfqd)->queue, "bfq%d " fmt, (bfqq)->pid, ##args)
+
+#define bfq_log(bfqd, fmt, args...) \
+	blk_add_trace_msg((bfqd)->queue, "bfq " fmt, ##args)
+
+/* Expiration reasons. */
+enum bfqq_expiration {
+	BFQ_BFQQ_TOO_IDLE = 0,		/*
+					 * queue has been idling for
+					 * too long
+					 */
+	BFQ_BFQQ_BUDGET_TIMEOUT,	/* budget took too long to be used */
+	BFQ_BFQQ_BUDGET_EXHAUSTED,	/* budget consumed */
+	BFQ_BFQQ_NO_MORE_REQUESTS,	/* the queue has no more requests */
+};
+
+static inline struct bfq_service_tree *
+bfq_entity_service_tree(struct bfq_entity *entity)
+{
+	struct bfq_sched_data *sched_data = entity->sched_data;
+	unsigned int idx = entity->ioprio_class - 1;
+
+	return sched_data->service_tree + idx;
+}
+
+static inline struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic,
+					    int is_sync)
+{
+	return bic->bfqq[!!is_sync];
+}
+
+static inline void bic_set_bfqq(struct bfq_io_cq *bic,
+				struct bfq_queue *bfqq, int is_sync)
+{
+	bic->bfqq[!!is_sync] = bfqq;
+}
+
+static inline struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic)
+{
+	return bic->icq.q->elevator->elevator_data;
+}
+
+/**
+ * bfq_get_bfqd_locked - get a lock to a bfqd using a RCU protected pointer.
+ * @ptr: a pointer to a bfqd.
+ * @flags: storage for the flags to be saved.
+ *
+ * This function allows bfqg->bfqd to be protected by the
+ * queue lock of the bfqd they reference; the pointer is dereferenced
+ * under RCU, so the storage for bfqd is assured to be safe as long
+ * as the RCU read side critical section does not end.  After the
+ * bfqd->queue->queue_lock is taken the pointer is rechecked, to be
+ * sure that no other writer accessed it.  If we raced with a writer,
+ * the function returns NULL, with the queue unlocked, otherwise it
+ * returns the dereferenced pointer, with the queue locked.
+ */
+static inline struct bfq_data *bfq_get_bfqd_locked(void **ptr,
+						   unsigned long *flags)
+{
+	struct bfq_data *bfqd;
+
+	rcu_read_lock();
+	bfqd = rcu_dereference(*(struct bfq_data **)ptr);
+
+	if (bfqd != NULL) {
+		spin_lock_irqsave(bfqd->queue->queue_lock, *flags);
+		if (*ptr == bfqd)
+			goto out;
+		spin_unlock_irqrestore(bfqd->queue->queue_lock, *flags);
+	}
+
+	bfqd = NULL;
+out:
+	rcu_read_unlock();
+	return bfqd;
+}
+
+static inline void bfq_put_bfqd_unlock(struct bfq_data *bfqd,
+				       unsigned long *flags)
+{
+	spin_unlock_irqrestore(bfqd->queue->queue_lock, *flags);
+}
+
+static void bfq_changed_ioprio(struct bfq_io_cq *bic);
+static void bfq_put_queue(struct bfq_queue *bfqq);
+static void bfq_dispatch_insert(struct request_queue *q, struct request *rq);
+static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd, int is_sync,
+				       struct bfq_io_cq *bic, gfp_t gfp_mask);
+static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
+
+#endif /* _BFQ_H */
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 01/12] block: introduce the BFQ-v0 I/O scheduler
@ 2014-05-29  9:05               ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups, Paolo Valente

From: Fabio Checconi <fchecconi@gmail.com>

BFQ is a proportional-share I/O scheduler, whose general structure,
plus a lot of code, are borrowed from CFQ.

- Each process doing I/O on a device is associated with a weight and a
  (bfq_)queue.

- BFQ grants exclusive access to the device, for a while, to one queue
  (process) at a time, and implements this service model by
  associating every queue with a budget, measured in number of
  sectors.

  - After a queue is granted access to the device, the budget of the
    queue is decremented, on each request dispatch, by the size of the
    request.

  - The in-service queue is expired, i.e., its service is suspended,
    only if one of the following events occurs: 1) the queue finishes
    its budget, 2) the queue empties, 3) a "budget timeout" fires.

    - The budget timeout prevents processes doing random I/O from
      holding the device for too long and dramatically reducing
      throughput.

    - Actually, as in CFQ, a queue associated with a process issuing
      sync requests may not be expired immediately when it empties. In
      contrast, BFQ may idle the device for a short time interval,
      giving the process the chance to go on being served if it issues
      a new request in time. Device idling typically boosts the
      throughput on rotational devices, if processes do synchronous
      and sequential I/O. Besides, under BFQ, device idling is also
      instrumental in guaranteeing the desired throughput fraction to
      processes issuing sync requests (see [1] for details).

  - Queues are scheduled according to a variant of WF2Q+, named
    B-WF2Q+, and implemented using an augmented rb-tree to preserve an
    O(log N) overall complexity.  See [1] for more details. B-WF2Q+ is
    also ready for hierarchical scheduling. However, for a cleaner
    logical breakdown, the code that enables and completes
    hierarchical support is provided in patch 4, which focuses exactly
    on this feature.

  - B-WF2Q+ guarantees a tight deviation with respect to an ideal,
    perfectly fair, and smooth service. In particular, B-WF2Q+
    guarantees that each queue receives a fraction of the device
    throughput proportional to its weight, even if the throughput
    fluctuates, and regardless of: the device parameters, the current
    workload and the budgets assigned to the queue.

  - The last, budget-independence, property (although probably
    counterintuitive in the first place) is definitely beneficial, for
    the following reasons.

    - First, with any proportional-share scheduler, the maximum
      deviation with respect to an ideal service is proportional to
      the maximum budget (slice) assigned to queues. As a consequence,
      BFQ can keep this deviation tight not only because of the
      accurate service of B-WF2Q+, but also because BFQ *does not*
      need to assign a larger budget to a queue to let the queue
      receive a higher fraction of the device throughput.

    - Second, BFQ is free to choose, for every process (queue), the
      budget that best fits the needs of the process, or best
      leverages the I/O pattern of the process. In particular, BFQ
      updates queue budgets with a simple feedback-loop algorithm that
      allows a high throughput to be achieved, while still providing
      tight latency guarantees to time-sensitive applications. When
      the in-service queue expires, this algorithm computes the next
      budget of the queue so as to:

      - Let large budgets be eventually assigned to the queues
        associated with I/O-bound applications performing sequential
        I/O: in fact, the longer these applications are served once
        got access to the device, the higher the throughput is.

      - Let small budgets be eventually assigned to the queues
        associated with time-sensitive applications (which typically
        perform sporadic and short I/O), because, the smaller the
        budget assigned to a queue waiting for service is, the sooner
        B-WF2Q+ will serve that queue (Subsec 3.3 in [1]).

- Weights can be assigned to processes only indirectly, through I/O
  priorities, and according to the relation: weight = IOPRIO_BE_NR -
  ioprio. The next two patches provide instead a cgroups interface
  through which weights can be assigned explicitly.

- ioprio classes are served in strict priority order, i.e.,
  lower-priority queues are not served as long as there are
  higher-priority queues.  Among queues in the same class, the
  bandwidth is distributed in proportion to the weight of each
  queue. A very thin extra bandwidth is however guaranteed to the Idle
  class, to prevent it from starving.

[1] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
http://www.algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/Kconfig.iosched |   19 +
 block/Makefile        |    1 +
 block/bfq-ioc.c       |   34 +
 block/bfq-iosched.c   | 2297 +++++++++++++++++++++++++++++++++++++++++++++++++
 block/bfq-sched.c     |  936 ++++++++++++++++++++
 block/bfq.h           |  467 ++++++++++
 6 files changed, 3754 insertions(+)
 create mode 100644 block/bfq-ioc.c
 create mode 100644 block/bfq-iosched.c
 create mode 100644 block/bfq-sched.c
 create mode 100644 block/bfq.h

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 421bef9..8f98cc7 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -39,6 +39,15 @@ config CFQ_GROUP_IOSCHED
 	---help---
 	  Enable group IO scheduling in CFQ.
 
+config IOSCHED_BFQ
+	tristate "BFQ I/O scheduler"
+	default n
+	---help---
+	  The BFQ I/O scheduler tries to distribute bandwidth among all
+	  processes according to their weights.
+	  It aims at distributing the bandwidth as desired, regardless
+	  of the disk parameters and with any workload.
+
 choice
 	prompt "Default I/O scheduler"
 	default DEFAULT_CFQ
@@ -52,6 +61,15 @@ choice
 	config DEFAULT_CFQ
 		bool "CFQ" if IOSCHED_CFQ=y
 
+	config DEFAULT_BFQ
+		bool "BFQ" if IOSCHED_BFQ=y
+		help
+		  Selects BFQ as the default I/O scheduler which will be
+		  used by default for all block devices.
+		  The BFQ I/O scheduler aims at distributing the bandwidth
+		  as desired, regardless of the disk parameters and with
+		  any workload.
+
 	config DEFAULT_NOOP
 		bool "No-op"
 
@@ -61,6 +79,7 @@ config DEFAULT_IOSCHED
 	string
 	default "deadline" if DEFAULT_DEADLINE
 	default "cfq" if DEFAULT_CFQ
+	default "bfq" if DEFAULT_BFQ
 	default "noop" if DEFAULT_NOOP
 
 endmenu
diff --git a/block/Makefile b/block/Makefile
index 20645e8..cbd83fb 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -16,6 +16,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-throttle.o
 obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
 obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
 obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
+obj-$(CONFIG_IOSCHED_BFQ)	+= bfq-iosched.o
 
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY)	+= blk-integrity.o
diff --git a/block/bfq-ioc.c b/block/bfq-ioc.c
new file mode 100644
index 0000000..adfb5a1
--- /dev/null
+++ b/block/bfq-ioc.c
@@ -0,0 +1,34 @@
+/*
+ * BFQ: I/O context handling.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ *		      Paolo Valente <paolo.valente@unimore.it>
+ */
+
+/**
+ * icq_to_bic - convert iocontext queue structure to bfq_io_cq.
+ * @icq: the iocontext queue.
+ */
+static inline struct bfq_io_cq *icq_to_bic(struct io_cq *icq)
+{
+	/* bic->icq is the first member, %NULL will convert to %NULL */
+	return container_of(icq, struct bfq_io_cq, icq);
+}
+
+/**
+ * bfq_bic_lookup - search into @ioc a bic associated to @bfqd.
+ * @bfqd: the lookup key.
+ * @ioc: the io_context of the process doing I/O.
+ *
+ * Queue lock must be held.
+ */
+static inline struct bfq_io_cq *bfq_bic_lookup(struct bfq_data *bfqd,
+					       struct io_context *ioc)
+{
+	if (ioc)
+		return icq_to_bic(ioc_lookup_icq(ioc, bfqd->queue));
+	return NULL;
+}
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
new file mode 100644
index 0000000..01a98be
--- /dev/null
+++ b/block/bfq-iosched.c
@@ -0,0 +1,2297 @@
+/*
+ * Budget Fair Queueing (BFQ) disk scheduler.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ *		      Paolo Valente <paolo.valente@unimore.it>
+ *
+ * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ
+ * file.
+ *
+ * BFQ is a proportional-share storage-I/O scheduling algorithm based on
+ * the slice-by-slice service scheme of CFQ. But BFQ assigns budgets,
+ * measured in number of sectors, to processes instead of time slices. The
+ * device is not granted to the in-service process for a given time slice,
+ * but until it has exhausted its assigned budget. This change from the time
+ * to the service domain allows BFQ to distribute the device throughput
+ * among processes as desired, without any distortion due to ZBR, workload
+ * fluctuations or other factors. BFQ uses an ad hoc internal scheduler,
+ * called B-WF2Q+, to schedule processes according to their budgets. More
+ * precisely, BFQ schedules queues associated to processes. Thanks to the
+ * accurate policy of B-WF2Q+, BFQ can afford to assign high budgets to
+ * I/O-bound processes issuing sequential requests (to boost the
+ * throughput), and yet guarantee a relatively low latency to interactive
+ * applications.
+ *
+ * BFQ is described in [1], where also a reference to the initial, more
+ * theoretical paper on BFQ can be found. The interested reader can find
+ * in the latter paper full details on the main algorithm, as well as
+ * formulas of the guarantees and formal proofs of all the properties.
+ * With respect to the version of BFQ presented in these papers, this
+ * implementation adds a hierarchical extension based on H-WF2Q+.
+ *
+ * B-WF2Q+ is based on WF2Q+, that is described in [2], together with
+ * H-WF2Q+, while the augmented tree used to implement B-WF2Q+ with O(log N)
+ * complexity derives from the one introduced with EEVDF in [3].
+ *
+ * [1] P. Valente and M. Andreolini, ``Improving Application Responsiveness
+ *     with the BFQ Disk I/O Scheduler'',
+ *     Proceedings of the 5th Annual International Systems and Storage
+ *     Conference (SYSTOR '12), June 2012.
+ *
+ * http://algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf
+ *
+ * [2] Jon C.R. Bennett and H. Zhang, ``Hierarchical Packet Fair Queueing
+ *     Algorithms,'' IEEE/ACM Transactions on Networking, 5(5):675-689,
+ *     Oct 1997.
+ *
+ * http://www.cs.cmu.edu/~hzhang/papers/TON-97-Oct.ps.gz
+ *
+ * [3] I. Stoica and H. Abdel-Wahab, ``Earliest Eligible Virtual Deadline
+ *     First: A Flexible and Accurate Mechanism for Proportional Share
+ *     Resource Allocation,'' technical report.
+ *
+ * http://www.cs.berkeley.edu/~istoica/papers/eevdf-tr-95.pdf
+ */
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/blkdev.h>
+#include <linux/cgroup.h>
+#include <linux/elevator.h>
+#include <linux/jiffies.h>
+#include <linux/rbtree.h>
+#include <linux/ioprio.h>
+#include "bfq.h"
+#include "blk.h"
+
+/*
+ * Array of async queues for all the processes, one queue
+ * per ioprio value per ioprio_class.
+ */
+struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
+/* Async queue for the idle class (ioprio is ignored) */
+struct bfq_queue *async_idle_bfqq;
+
+/* Max number of dispatches in one round of service. */
+static const int bfq_quantum = 4;
+
+/* Expiration time of sync (0) and async (1) requests, in jiffies. */
+static const int bfq_fifo_expire[2] = { HZ / 4, HZ / 8 };
+
+/* Maximum backwards seek, in KiB. */
+static const int bfq_back_max = 16 * 1024;
+
+/* Penalty of a backwards seek, in number of sectors. */
+static const int bfq_back_penalty = 2;
+
+/* Idling period duration, in jiffies. */
+static int bfq_slice_idle = HZ / 125;
+
+/* Default maximum budget values, in sectors and number of requests. */
+static const int bfq_default_max_budget = 16 * 1024;
+static const int bfq_max_budget_async_rq = 4;
+
+/* Default timeout values, in jiffies, approximating CFQ defaults. */
+static const int bfq_timeout_sync = HZ / 8;
+static int bfq_timeout_async = HZ / 25;
+
+struct kmem_cache *bfq_pool;
+
+/* Below this threshold (in ms), we consider thinktime immediate. */
+#define BFQ_MIN_TT		2
+
+/* hw_tag detection: parallel requests threshold and min samples needed. */
+#define BFQ_HW_QUEUE_THRESHOLD	4
+#define BFQ_HW_QUEUE_SAMPLES	32
+
+#define BFQQ_SEEK_THR	 (sector_t)(8 * 1024)
+#define BFQQ_SEEKY(bfqq) ((bfqq)->seek_mean > BFQQ_SEEK_THR)
+
+/* Budget feedback step. */
+#define BFQ_BUDGET_STEP         128
+
+/* Min samples used for peak rate estimation (for autotuning). */
+#define BFQ_PEAK_RATE_SAMPLES	32
+
+/* Shift used for peak rate fixed precision calculations. */
+#define BFQ_RATE_SHIFT		16
+
+#define BFQ_SERVICE_TREE_INIT	((struct bfq_service_tree)		\
+				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
+
+#define RQ_BIC(rq)		((struct bfq_io_cq *) (rq)->elv.priv[0])
+#define RQ_BFQQ(rq)		((rq)->elv.priv[1])
+
+static inline void bfq_schedule_dispatch(struct bfq_data *bfqd);
+
+#include "bfq-ioc.c"
+#include "bfq-sched.c"
+
+#define bfq_class_idle(bfqq)	((bfqq)->entity.ioprio_class ==\
+				 IOPRIO_CLASS_IDLE)
+#define bfq_class_rt(bfqq)	((bfqq)->entity.ioprio_class ==\
+				 IOPRIO_CLASS_RT)
+
+#define bfq_sample_valid(samples)	((samples) > 80)
+
+/*
+ * We regard a request as SYNC, if either it's a read or has the SYNC bit
+ * set (in which case it could also be a direct WRITE).
+ */
+static inline int bfq_bio_sync(struct bio *bio)
+{
+	if (bio_data_dir(bio) == READ || (bio->bi_rw & REQ_SYNC))
+		return 1;
+
+	return 0;
+}
+
+/*
+ * Scheduler run of queue, if there are requests pending and no one in the
+ * driver that will restart queueing.
+ */
+static inline void bfq_schedule_dispatch(struct bfq_data *bfqd)
+{
+	if (bfqd->queued != 0) {
+		bfq_log(bfqd, "schedule dispatch");
+		kblockd_schedule_work(bfqd->queue, &bfqd->unplug_work);
+	}
+}
+
+/*
+ * Lifted from AS - choose which of rq1 and rq2 that is best served now.
+ * We choose the request that is closesr to the head right now.  Distance
+ * behind the head is penalized and only allowed to a certain extent.
+ */
+static struct request *bfq_choose_req(struct bfq_data *bfqd,
+				      struct request *rq1,
+				      struct request *rq2,
+				      sector_t last)
+{
+	sector_t s1, s2, d1 = 0, d2 = 0;
+	unsigned long back_max;
+#define BFQ_RQ1_WRAP	0x01 /* request 1 wraps */
+#define BFQ_RQ2_WRAP	0x02 /* request 2 wraps */
+	unsigned wrap = 0; /* bit mask: requests behind the disk head? */
+
+	if (rq1 == NULL || rq1 == rq2)
+		return rq2;
+	if (rq2 == NULL)
+		return rq1;
+
+	if (rq_is_sync(rq1) && !rq_is_sync(rq2))
+		return rq1;
+	else if (rq_is_sync(rq2) && !rq_is_sync(rq1))
+		return rq2;
+	if ((rq1->cmd_flags & REQ_META) && !(rq2->cmd_flags & REQ_META))
+		return rq1;
+	else if ((rq2->cmd_flags & REQ_META) && !(rq1->cmd_flags & REQ_META))
+		return rq2;
+
+	s1 = blk_rq_pos(rq1);
+	s2 = blk_rq_pos(rq2);
+
+	/*
+	 * By definition, 1KiB is 2 sectors.
+	 */
+	back_max = bfqd->bfq_back_max * 2;
+
+	/*
+	 * Strict one way elevator _except_ in the case where we allow
+	 * short backward seeks which are biased as twice the cost of a
+	 * similar forward seek.
+	 */
+	if (s1 >= last)
+		d1 = s1 - last;
+	else if (s1 + back_max >= last)
+		d1 = (last - s1) * bfqd->bfq_back_penalty;
+	else
+		wrap |= BFQ_RQ1_WRAP;
+
+	if (s2 >= last)
+		d2 = s2 - last;
+	else if (s2 + back_max >= last)
+		d2 = (last - s2) * bfqd->bfq_back_penalty;
+	else
+		wrap |= BFQ_RQ2_WRAP;
+
+	/* Found required data */
+
+	/*
+	 * By doing switch() on the bit mask "wrap" we avoid having to
+	 * check two variables for all permutations: --> faster!
+	 */
+	switch (wrap) {
+	case 0: /* common case for CFQ: rq1 and rq2 not wrapped */
+		if (d1 < d2)
+			return rq1;
+		else if (d2 < d1)
+			return rq2;
+		else {
+			if (s1 >= s2)
+				return rq1;
+			else
+				return rq2;
+		}
+
+	case BFQ_RQ2_WRAP:
+		return rq1;
+	case BFQ_RQ1_WRAP:
+		return rq2;
+	case (BFQ_RQ1_WRAP|BFQ_RQ2_WRAP): /* both rqs wrapped */
+	default:
+		/*
+		 * Since both rqs are wrapped,
+		 * start with the one that's further behind head
+		 * (--> only *one* back seek required),
+		 * since back seek takes more time than forward.
+		 */
+		if (s1 <= s2)
+			return rq1;
+		else
+			return rq2;
+	}
+}
+
+static struct request *bfq_find_next_rq(struct bfq_data *bfqd,
+					struct bfq_queue *bfqq,
+					struct request *last)
+{
+	struct rb_node *rbnext = rb_next(&last->rb_node);
+	struct rb_node *rbprev = rb_prev(&last->rb_node);
+	struct request *next = NULL, *prev = NULL;
+
+	if (rbprev != NULL)
+		prev = rb_entry_rq(rbprev);
+
+	if (rbnext != NULL)
+		next = rb_entry_rq(rbnext);
+	else {
+		rbnext = rb_first(&bfqq->sort_list);
+		if (rbnext && rbnext != &last->rb_node)
+			next = rb_entry_rq(rbnext);
+	}
+
+	return bfq_choose_req(bfqd, next, prev, blk_rq_pos(last));
+}
+
+static inline unsigned long bfq_serv_to_charge(struct request *rq,
+					       struct bfq_queue *bfqq)
+{
+	return blk_rq_sectors(rq);
+}
+
+/**
+ * bfq_updated_next_req - update the queue after a new next_rq selection.
+ * @bfqd: the device data the queue belongs to.
+ * @bfqq: the queue to update.
+ *
+ * If the first request of a queue changes we make sure that the queue
+ * has enough budget to serve at least its first request (if the
+ * request has grown).  We do this because if the queue has not enough
+ * budget for its first request, it has to go through two dispatch
+ * rounds to actually get it dispatched.
+ */
+static void bfq_updated_next_req(struct bfq_data *bfqd,
+				 struct bfq_queue *bfqq)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+	struct request *next_rq = bfqq->next_rq;
+	unsigned long new_budget;
+
+	if (next_rq == NULL)
+		return;
+
+	if (bfqq == bfqd->in_service_queue)
+		/*
+		 * In order not to break guarantees, budgets cannot be
+		 * changed after an entity has been selected.
+		 */
+		return;
+
+	new_budget = max_t(unsigned long, bfqq->max_budget,
+			   bfq_serv_to_charge(next_rq, bfqq));
+	if (entity->budget != new_budget) {
+		entity->budget = new_budget;
+		bfq_log_bfqq(bfqd, bfqq, "updated next rq: new budget %lu",
+					 new_budget);
+		bfq_activate_bfqq(bfqd, bfqq);
+	}
+}
+
+static void bfq_add_request(struct request *rq)
+{
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+	struct bfq_entity *entity = &bfqq->entity;
+	struct bfq_data *bfqd = bfqq->bfqd;
+	struct request *next_rq, *prev;
+
+	bfq_log_bfqq(bfqd, bfqq, "add_request %d", rq_is_sync(rq));
+	bfqq->queued[rq_is_sync(rq)]++;
+	bfqd->queued++;
+
+	elv_rb_add(&bfqq->sort_list, rq);
+
+	/*
+	 * Check if this request is a better next-serve candidate.
+	 */
+	prev = bfqq->next_rq;
+	next_rq = bfq_choose_req(bfqd, bfqq->next_rq, rq, bfqd->last_position);
+	bfqq->next_rq = next_rq;
+
+	if (!bfq_bfqq_busy(bfqq)) {
+		entity->budget = max_t(unsigned long, bfqq->max_budget,
+				       bfq_serv_to_charge(next_rq, bfqq));
+		bfq_add_bfqq_busy(bfqd, bfqq);
+	} else {
+		if (prev != bfqq->next_rq)
+			bfq_updated_next_req(bfqd, bfqq);
+	}
+}
+
+static struct request *bfq_find_rq_fmerge(struct bfq_data *bfqd,
+					  struct bio *bio)
+{
+	struct task_struct *tsk = current;
+	struct bfq_io_cq *bic;
+	struct bfq_queue *bfqq;
+
+	bic = bfq_bic_lookup(bfqd, tsk->io_context);
+	if (bic == NULL)
+		return NULL;
+
+	bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
+	if (bfqq != NULL)
+		return elv_rb_find(&bfqq->sort_list, bio_end_sector(bio));
+
+	return NULL;
+}
+
+static void bfq_activate_request(struct request_queue *q, struct request *rq)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+
+	bfqd->rq_in_driver++;
+	bfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq);
+	bfq_log(bfqd, "activate_request: new bfqd->last_position %llu",
+		(long long unsigned)bfqd->last_position);
+}
+
+static inline void bfq_deactivate_request(struct request_queue *q,
+					  struct request *rq)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+
+	bfqd->rq_in_driver--;
+}
+
+static void bfq_remove_request(struct request *rq)
+{
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+	struct bfq_data *bfqd = bfqq->bfqd;
+	const int sync = rq_is_sync(rq);
+
+	if (bfqq->next_rq == rq) {
+		bfqq->next_rq = bfq_find_next_rq(bfqd, bfqq, rq);
+		bfq_updated_next_req(bfqd, bfqq);
+	}
+
+	list_del_init(&rq->queuelist);
+	bfqq->queued[sync]--;
+	bfqd->queued--;
+	elv_rb_del(&bfqq->sort_list, rq);
+
+	if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
+		if (bfq_bfqq_busy(bfqq) && bfqq != bfqd->in_service_queue)
+			bfq_del_bfqq_busy(bfqd, bfqq, 1);
+	}
+
+	if (rq->cmd_flags & REQ_META)
+		bfqq->meta_pending--;
+}
+
+static int bfq_merge(struct request_queue *q, struct request **req,
+		     struct bio *bio)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct request *__rq;
+
+	__rq = bfq_find_rq_fmerge(bfqd, bio);
+	if (__rq != NULL && elv_rq_merge_ok(__rq, bio)) {
+		*req = __rq;
+		return ELEVATOR_FRONT_MERGE;
+	}
+
+	return ELEVATOR_NO_MERGE;
+}
+
+static void bfq_merged_request(struct request_queue *q, struct request *req,
+			       int type)
+{
+	if (type == ELEVATOR_FRONT_MERGE &&
+	    rb_prev(&req->rb_node) &&
+	    blk_rq_pos(req) <
+	    blk_rq_pos(container_of(rb_prev(&req->rb_node),
+				    struct request, rb_node))) {
+		struct bfq_queue *bfqq = RQ_BFQQ(req);
+		struct bfq_data *bfqd = bfqq->bfqd;
+		struct request *prev, *next_rq;
+
+		/* Reposition request in its sort_list */
+		elv_rb_del(&bfqq->sort_list, req);
+		elv_rb_add(&bfqq->sort_list, req);
+		/* Choose next request to be served for bfqq */
+		prev = bfqq->next_rq;
+		next_rq = bfq_choose_req(bfqd, bfqq->next_rq, req,
+					 bfqd->last_position);
+		bfqq->next_rq = next_rq;
+		/*
+		 * If next_rq changes, update the queue's budget to fit
+		 * the new request.
+		 */
+		if (prev != bfqq->next_rq)
+			bfq_updated_next_req(bfqd, bfqq);
+	}
+}
+
+static void bfq_merged_requests(struct request_queue *q, struct request *rq,
+				struct request *next)
+{
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+
+	/*
+	 * Reposition in fifo if next is older than rq.
+	 */
+	if (!list_empty(&rq->queuelist) && !list_empty(&next->queuelist) &&
+	    time_before(next->fifo_time, rq->fifo_time)) {
+		list_move(&rq->queuelist, &next->queuelist);
+		rq->fifo_time = next->fifo_time;
+	}
+
+	if (bfqq->next_rq == next)
+		bfqq->next_rq = rq;
+
+	bfq_remove_request(next);
+}
+
+static int bfq_allow_merge(struct request_queue *q, struct request *rq,
+			   struct bio *bio)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct bfq_io_cq *bic;
+	struct bfq_queue *bfqq;
+
+	/*
+	 * Disallow merge of a sync bio into an async request.
+	 */
+	if (bfq_bio_sync(bio) && !rq_is_sync(rq))
+		return 0;
+
+	/*
+	 * Lookup the bfqq that this bio will be queued with. Allow
+	 * merge only if rq is queued there.
+	 * Queue lock is held here.
+	 */
+	bic = bfq_bic_lookup(bfqd, current->io_context);
+	if (bic == NULL)
+		return 0;
+
+	bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
+	return bfqq == RQ_BFQQ(rq);
+}
+
+static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
+				       struct bfq_queue *bfqq)
+{
+	if (bfqq != NULL) {
+		bfq_mark_bfqq_must_alloc(bfqq);
+		bfq_mark_bfqq_budget_new(bfqq);
+		bfq_clear_bfqq_fifo_expire(bfqq);
+
+		bfqd->budgets_assigned = (bfqd->budgets_assigned*7 + 256) / 8;
+
+		bfq_log_bfqq(bfqd, bfqq,
+			     "set_in_service_queue, cur-budget = %lu",
+			     bfqq->entity.budget);
+	}
+
+	bfqd->in_service_queue = bfqq;
+}
+
+/*
+ * Get and set a new queue for service.
+ */
+static struct bfq_queue *bfq_set_in_service_queue(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq = bfq_get_next_queue(bfqd);
+
+	__bfq_set_in_service_queue(bfqd, bfqq);
+	return bfqq;
+}
+
+/*
+ * If enough samples have been computed, return the current max budget
+ * stored in bfqd, which is dynamically updated according to the
+ * estimated disk peak rate; otherwise return the default max budget
+ */
+static inline unsigned long bfq_max_budget(struct bfq_data *bfqd)
+{
+	if (bfqd->budgets_assigned < 194)
+		return bfq_default_max_budget;
+	else
+		return bfqd->bfq_max_budget;
+}
+
+ /*
+ * bfq_default_budget - return the default budget for @bfqq on @bfqd.
+ * @bfqd: the device descriptor.
+ * @bfqq: the queue to consider.
+ *
+ * We use 3/4 of the @bfqd maximum budget as the default value
+ * for the max_budget field of the queues.  This lets the feedback
+ * mechanism to start from some middle ground, then the behavior
+ * of the process will drive the heuristics towards high values, if
+ * it behaves as a greedy sequential reader, or towards small values
+ * if it shows a more intermittent behavior.
+ */
+static unsigned long bfq_default_budget(struct bfq_data *bfqd,
+					struct bfq_queue *bfqq)
+{
+	unsigned long budget;
+
+	/*
+	 * When we need an estimate of the peak rate we need to avoid
+	 * to give budgets that are too short due to previous measurements.
+	 * So, in the first 10 assignments use a ``safe'' budget value.
+	 */
+	if (bfqd->budgets_assigned < 194 && bfqd->bfq_user_max_budget == 0)
+		budget = bfq_default_max_budget;
+	else
+		budget = bfqd->bfq_max_budget;
+
+	return budget - budget / 4;
+}
+
+/*
+ * Return min budget, which is a fraction of the current or default
+ * max budget (trying with 1/32)
+ */
+static inline unsigned long bfq_min_budget(struct bfq_data *bfqd)
+{
+	if (bfqd->budgets_assigned < 194)
+		return bfq_default_max_budget / 32;
+	else
+		return bfqd->bfq_max_budget / 32;
+}
+
+static void bfq_arm_slice_timer(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq = bfqd->in_service_queue;
+	struct bfq_io_cq *bic;
+	unsigned long sl;
+
+	/* Processes have exited, don't wait. */
+	bic = bfqd->in_service_bic;
+	if (bic == NULL || atomic_read(&bic->icq.ioc->active_ref) == 0)
+		return;
+
+	bfq_mark_bfqq_wait_request(bfqq);
+
+	/*
+	 * We don't want to idle for seeks, but we do want to allow
+	 * fair distribution of slice time for a process doing back-to-back
+	 * seeks. So allow a little bit of time for him to submit a new rq.
+	 */
+	sl = bfqd->bfq_slice_idle;
+	/*
+	 * Grant only minimum idle time if the queue has been seeky for long
+	 * enough.
+	 */
+	if (bfq_sample_valid(bfqq->seek_samples) && BFQQ_SEEKY(bfqq))
+		sl = min(sl, msecs_to_jiffies(BFQ_MIN_TT));
+	bfqd->last_idling_start = ktime_get();
+	mod_timer(&bfqd->idle_slice_timer, jiffies + sl);
+	bfq_log(bfqd, "arm idle: %u/%u ms",
+		jiffies_to_msecs(sl), jiffies_to_msecs(bfqd->bfq_slice_idle));
+}
+
+/*
+ * Set the maximum time for the in-service queue to consume its
+ * budget. This prevents seeky processes from lowering the disk
+ * throughput (always guaranteed with a time slice scheme as in CFQ).
+ */
+static void bfq_set_budget_timeout(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq = bfqd->in_service_queue;
+	unsigned int timeout_coeff = bfqq->entity.weight /
+				     bfqq->entity.orig_weight;
+
+	bfqd->last_budget_start = ktime_get();
+
+	bfq_clear_bfqq_budget_new(bfqq);
+	bfqq->budget_timeout = jiffies +
+		bfqd->bfq_timeout[bfq_bfqq_sync(bfqq)] * timeout_coeff;
+
+	bfq_log_bfqq(bfqd, bfqq, "set budget_timeout %u",
+		jiffies_to_msecs(bfqd->bfq_timeout[bfq_bfqq_sync(bfqq)] *
+		timeout_coeff));
+}
+
+/*
+ * Move request from internal lists to the request queue dispatch list.
+ */
+static void bfq_dispatch_insert(struct request_queue *q, struct request *rq)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+
+	/*
+	 * For consistency, the next instruction should have been executed
+	 * after removing the request from the queue and dispatching it.
+	 * We execute instead this instruction before bfq_remove_request()
+	 * (and hence introduce a temporary inconsistency), for efficiency.
+	 * In fact, in a forced_dispatch, this prevents two counters related
+	 * to bfqq->dispatched to risk to be uselessly decremented if bfqq
+	 * is not in service, and then to be incremented again after
+	 * incrementing bfqq->dispatched.
+	 */
+	bfqq->dispatched++;
+	bfq_remove_request(rq);
+	elv_dispatch_sort(q, rq);
+
+	if (bfq_bfqq_sync(bfqq))
+		bfqd->sync_flight++;
+}
+
+/*
+ * Return expired entry, or NULL to just start from scratch in rbtree.
+ */
+static struct request *bfq_check_fifo(struct bfq_queue *bfqq)
+{
+	struct request *rq = NULL;
+
+	if (bfq_bfqq_fifo_expire(bfqq))
+		return NULL;
+
+	bfq_mark_bfqq_fifo_expire(bfqq);
+
+	if (list_empty(&bfqq->fifo))
+		return NULL;
+
+	rq = rq_entry_fifo(bfqq->fifo.next);
+
+	if (time_before(jiffies, rq->fifo_time))
+		return NULL;
+
+	return rq;
+}
+
+static inline unsigned long bfq_bfqq_budget_left(struct bfq_queue *bfqq)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+	return entity->budget - entity->service;
+}
+
+static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	__bfq_bfqd_reset_in_service(bfqd);
+
+	if (RB_EMPTY_ROOT(&bfqq->sort_list))
+		bfq_del_bfqq_busy(bfqd, bfqq, 1);
+	else
+		bfq_activate_bfqq(bfqd, bfqq);
+}
+
+/**
+ * __bfq_bfqq_recalc_budget - try to adapt the budget to the @bfqq behavior.
+ * @bfqd: device data.
+ * @bfqq: queue to update.
+ * @reason: reason for expiration.
+ *
+ * Handle the feedback on @bfqq budget.  See the body for detailed
+ * comments.
+ */
+static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
+				     struct bfq_queue *bfqq,
+				     enum bfqq_expiration reason)
+{
+	struct request *next_rq;
+	unsigned long budget, min_budget;
+
+	budget = bfqq->max_budget;
+	min_budget = bfq_min_budget(bfqd);
+
+	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last budg %lu, budg left %lu",
+		bfqq->entity.budget, bfq_bfqq_budget_left(bfqq));
+	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last max_budg %lu, min budg %lu",
+		budget, bfq_min_budget(bfqd));
+	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: sync %d, seeky %d",
+		bfq_bfqq_sync(bfqq), BFQQ_SEEKY(bfqd->in_service_queue));
+
+	if (bfq_bfqq_sync(bfqq)) {
+		switch (reason) {
+		/*
+		 * Caveat: in all the following cases we trade latency
+		 * for throughput.
+		 */
+		case BFQ_BFQQ_TOO_IDLE:
+			if (budget > min_budget + BFQ_BUDGET_STEP)
+				budget -= BFQ_BUDGET_STEP;
+			else
+				budget = min_budget;
+			break;
+		case BFQ_BFQQ_BUDGET_TIMEOUT:
+			budget = bfq_default_budget(bfqd, bfqq);
+			break;
+		case BFQ_BFQQ_BUDGET_EXHAUSTED:
+			/*
+			 * The process still has backlog, and did not
+			 * let either the budget timeout or the disk
+			 * idling timeout expire. Hence it is not
+			 * seeky, has a short thinktime and may be
+			 * happy with a higher budget too. So
+			 * definitely increase the budget of this good
+			 * candidate to boost the disk throughput.
+			 */
+			budget = min(budget + 8 * BFQ_BUDGET_STEP,
+				     bfqd->bfq_max_budget);
+			break;
+		case BFQ_BFQQ_NO_MORE_REQUESTS:
+		       /*
+			* Leave the budget unchanged.
+			*/
+		default:
+			return;
+		}
+	} else /* async queue */
+	    /* async queues get always the maximum possible budget
+	     * (their ability to dispatch is limited by
+	     * @bfqd->bfq_max_budget_async_rq).
+	     */
+		budget = bfqd->bfq_max_budget;
+
+	bfqq->max_budget = budget;
+
+	if (bfqd->budgets_assigned >= 194 && bfqd->bfq_user_max_budget == 0 &&
+	    bfqq->max_budget > bfqd->bfq_max_budget)
+		bfqq->max_budget = bfqd->bfq_max_budget;
+
+	/*
+	 * Make sure that we have enough budget for the next request.
+	 * Since the finish time of the bfqq must be kept in sync with
+	 * the budget, be sure to call __bfq_bfqq_expire() after the
+	 * update.
+	 */
+	next_rq = bfqq->next_rq;
+	if (next_rq != NULL)
+		bfqq->entity.budget = max_t(unsigned long, bfqq->max_budget,
+					    bfq_serv_to_charge(next_rq, bfqq));
+	else
+		bfqq->entity.budget = bfqq->max_budget;
+
+	bfq_log_bfqq(bfqd, bfqq, "head sect: %u, new budget %lu",
+			next_rq != NULL ? blk_rq_sectors(next_rq) : 0,
+			bfqq->entity.budget);
+}
+
+static unsigned long bfq_calc_max_budget(u64 peak_rate, u64 timeout)
+{
+	unsigned long max_budget;
+
+	/*
+	 * The max_budget calculated when autotuning is equal to the
+	 * amount of sectors transfered in timeout_sync at the
+	 * estimated peak rate.
+	 */
+	max_budget = (unsigned long)(peak_rate * 1000 *
+				     timeout >> BFQ_RATE_SHIFT);
+
+	return max_budget;
+}
+
+/*
+ * In addition to updating the peak rate, checks whether the process
+ * is "slow", and returns 1 if so. This slow flag is used, in addition
+ * to the budget timeout, to reduce the amount of service provided to
+ * seeky processes, and hence reduce their chances to lower the
+ * throughput. See the code for more details.
+ */
+static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+				int compensate)
+{
+	u64 bw, usecs, expected, timeout;
+	ktime_t delta;
+	int update = 0;
+
+	if (!bfq_bfqq_sync(bfqq) || bfq_bfqq_budget_new(bfqq))
+		return 0;
+
+	if (compensate)
+		delta = bfqd->last_idling_start;
+	else
+		delta = ktime_get();
+	delta = ktime_sub(delta, bfqd->last_budget_start);
+	usecs = ktime_to_us(delta);
+
+	/* Don't trust short/unrealistic values. */
+	if (usecs < 100 || usecs >= LONG_MAX)
+		return 0;
+
+	/*
+	 * Calculate the bandwidth for the last slice.  We use a 64 bit
+	 * value to store the peak rate, in sectors per usec in fixed
+	 * point math.  We do so to have enough precision in the estimate
+	 * and to avoid overflows.
+	 */
+	bw = (u64)bfqq->entity.service << BFQ_RATE_SHIFT;
+	do_div(bw, (unsigned long)usecs);
+
+	timeout = jiffies_to_msecs(bfqd->bfq_timeout[BLK_RW_SYNC]);
+
+	/*
+	 * Use only long (> 20ms) intervals to filter out spikes for
+	 * the peak rate estimation.
+	 */
+	if (usecs > 20000) {
+		if (bw > bfqd->peak_rate) {
+			bfqd->peak_rate = bw;
+			update = 1;
+			bfq_log(bfqd, "new peak_rate=%llu", bw);
+		}
+
+		update |= bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES - 1;
+
+		if (bfqd->peak_rate_samples < BFQ_PEAK_RATE_SAMPLES)
+			bfqd->peak_rate_samples++;
+
+		if (bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES &&
+		    update && bfqd->bfq_user_max_budget == 0) {
+			bfqd->bfq_max_budget =
+				bfq_calc_max_budget(bfqd->peak_rate,
+						    timeout);
+			bfq_log(bfqd, "new max_budget=%lu",
+				bfqd->bfq_max_budget);
+		}
+	}
+
+	/*
+	 * A process is considered ``slow'' (i.e., seeky, so that we
+	 * cannot treat it fairly in the service domain, as it would
+	 * slow down too much the other processes) if, when a slice
+	 * ends for whatever reason, it has received service at a
+	 * rate that would not be high enough to complete the budget
+	 * before the budget timeout expiration.
+	 */
+	expected = bw * 1000 * timeout >> BFQ_RATE_SHIFT;
+
+	/*
+	 * Caveat: processes doing IO in the slower disk zones will
+	 * tend to be slow(er) even if not seeky. And the estimated
+	 * peak rate will actually be an average over the disk
+	 * surface. Hence, to not be too harsh with unlucky processes,
+	 * we keep a budget/3 margin of safety before declaring a
+	 * process slow.
+	 */
+	return expected > (4 * bfqq->entity.budget) / 3;
+}
+
+/**
+ * bfq_bfqq_expire - expire a queue.
+ * @bfqd: device owning the queue.
+ * @bfqq: the queue to expire.
+ * @compensate: if true, compensate for the time spent idling.
+ * @reason: the reason causing the expiration.
+ *
+ *
+ * If the process associated to the queue is slow (i.e., seeky), or in
+ * case of budget timeout, or, finally, if it is async, we
+ * artificially charge it an entire budget (independently of the
+ * actual service it received). As a consequence, the queue will get
+ * higher timestamps than the correct ones upon reactivation, and
+ * hence it will be rescheduled as if it had received more service
+ * than what it actually received. In the end, this class of processes
+ * will receive less service in proportion to how slowly they consume
+ * their budgets (and hence how seriously they tend to lower the
+ * throughput).
+ *
+ * In contrast, when a queue expires because it has been idling for
+ * too much or because it exhausted its budget, we do not touch the
+ * amount of service it has received. Hence when the queue will be
+ * reactivated and its timestamps updated, the latter will be in sync
+ * with the actual service received by the queue until expiration.
+ *
+ * Charging a full budget to the first type of queues and the exact
+ * service to the others has the effect of using the WF2Q+ policy to
+ * schedule the former on a timeslice basis, without violating the
+ * service domain guarantees of the latter.
+ */
+static void bfq_bfqq_expire(struct bfq_data *bfqd,
+			    struct bfq_queue *bfqq,
+			    int compensate,
+			    enum bfqq_expiration reason)
+{
+	int slow;
+
+	/* Update disk peak rate for autotuning and check whether the
+	 * process is slow (see bfq_update_peak_rate).
+	 */
+	slow = bfq_update_peak_rate(bfqd, bfqq, compensate);
+
+	/*
+	 * As above explained, 'punish' slow (i.e., seeky), timed-out
+	 * and async queues, to favor sequential sync workloads.
+	 */
+	if (slow || reason == BFQ_BFQQ_BUDGET_TIMEOUT)
+		bfq_bfqq_charge_full_budget(bfqq);
+
+	bfq_log_bfqq(bfqd, bfqq,
+		"expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
+		slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
+
+	/*
+	 * Increase, decrease or leave budget unchanged according to
+	 * reason.
+	 */
+	__bfq_bfqq_recalc_budget(bfqd, bfqq, reason);
+	__bfq_bfqq_expire(bfqd, bfqq);
+}
+
+/*
+ * Budget timeout is not implemented through a dedicated timer, but
+ * just checked on request arrivals and completions, as well as on
+ * idle timer expirations.
+ */
+static int bfq_bfqq_budget_timeout(struct bfq_queue *bfqq)
+{
+	if (bfq_bfqq_budget_new(bfqq) ||
+	    time_before(jiffies, bfqq->budget_timeout))
+		return 0;
+	return 1;
+}
+
+/*
+ * If we expire a queue that is waiting for the arrival of a new
+ * request, we may prevent the fictitious timestamp back-shifting that
+ * allows the guarantees of the queue to be preserved (see [1] for
+ * this tricky aspect). Hence we return true only if this condition
+ * does not hold, or if the queue is slow enough to deserve only to be
+ * kicked off for preserving a high throughput.
+*/
+static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
+{
+	bfq_log_bfqq(bfqq->bfqd, bfqq,
+		"may_budget_timeout: wait_request %d left %d timeout %d",
+		bfq_bfqq_wait_request(bfqq),
+			bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3,
+		bfq_bfqq_budget_timeout(bfqq));
+
+	return (!bfq_bfqq_wait_request(bfqq) ||
+		bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3)
+		&&
+		bfq_bfqq_budget_timeout(bfqq);
+}
+
+/*
+ * Device idling is allowed only for sync queues that have a non-null
+ * idle window.
+ */
+static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
+{
+	return bfq_bfqq_sync(bfqq) && bfq_bfqq_idle_window(bfqq);
+}
+
+/*
+ * If the in-service queue is empty, but it is sync and the queue has its
+ * idle window set (in this case, waiting for a new request for the queue
+ * is likely to boost the throughput), then:
+ * 1) the queue must remain in service and cannot be expired, and
+ * 2) the disk must be idled to wait for the possible arrival of a new
+ *    request for the queue.
+ */
+static inline bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)
+{
+	struct bfq_data *bfqd = bfqq->bfqd;
+
+	return RB_EMPTY_ROOT(&bfqq->sort_list) && bfqd->bfq_slice_idle != 0 &&
+	       bfq_bfqq_must_not_expire(bfqq);
+}
+
+/*
+ * Select a queue for service.  If we have a current queue in service,
+ * check whether to continue servicing it, or retrieve and set a new one.
+ */
+static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq;
+	struct request *next_rq;
+	enum bfqq_expiration reason = BFQ_BFQQ_BUDGET_TIMEOUT;
+
+	bfqq = bfqd->in_service_queue;
+	if (bfqq == NULL)
+		goto new_queue;
+
+	bfq_log_bfqq(bfqd, bfqq, "select_queue: already in-service queue");
+
+	if (bfq_may_expire_for_budg_timeout(bfqq) &&
+	    !timer_pending(&bfqd->idle_slice_timer) &&
+	    !bfq_bfqq_must_idle(bfqq))
+		goto expire;
+
+	next_rq = bfqq->next_rq;
+	/*
+	 * If bfqq has requests queued and it has enough budget left to
+	 * serve them, keep the queue, otherwise expire it.
+	 */
+	if (next_rq != NULL) {
+		if (bfq_serv_to_charge(next_rq, bfqq) >
+			bfq_bfqq_budget_left(bfqq)) {
+			reason = BFQ_BFQQ_BUDGET_EXHAUSTED;
+			goto expire;
+		} else {
+			/*
+			 * The idle timer may be pending because we may
+			 * not disable disk idling even when a new request
+			 * arrives.
+			 */
+			if (timer_pending(&bfqd->idle_slice_timer)) {
+				/*
+				 * If we get here: 1) at least a new request
+				 * has arrived but we have not disabled the
+				 * timer because the request was too small,
+				 * 2) then the block layer has unplugged
+				 * the device, causing the dispatch to be
+				 * invoked.
+				 *
+				 * Since the device is unplugged, now the
+				 * requests are probably large enough to
+				 * provide a reasonable throughput.
+				 * So we disable idling.
+				 */
+				bfq_clear_bfqq_wait_request(bfqq);
+				del_timer(&bfqd->idle_slice_timer);
+			}
+			goto keep_queue;
+		}
+	}
+
+	/*
+	 * No requests pending.  If the in-service queue still has requests
+	 * in flight (possibly waiting for a completion) or is idling for a
+	 * new request, then keep it.
+	 */
+	if (timer_pending(&bfqd->idle_slice_timer) ||
+	    (bfqq->dispatched != 0 && bfq_bfqq_must_not_expire(bfqq))) {
+		bfqq = NULL;
+		goto keep_queue;
+	}
+
+	reason = BFQ_BFQQ_NO_MORE_REQUESTS;
+expire:
+	bfq_bfqq_expire(bfqd, bfqq, 0, reason);
+new_queue:
+	bfqq = bfq_set_in_service_queue(bfqd);
+	bfq_log(bfqd, "select_queue: new queue %d returned",
+		bfqq != NULL ? bfqq->pid : 0);
+keep_queue:
+	return bfqq;
+}
+
+/*
+ * Dispatch one request from bfqq, moving it to the request queue
+ * dispatch list.
+ */
+static int bfq_dispatch_request(struct bfq_data *bfqd,
+				struct bfq_queue *bfqq)
+{
+	int dispatched = 0;
+	struct request *rq;
+	unsigned long service_to_charge;
+
+	/* Follow expired path, else get first next available. */
+	rq = bfq_check_fifo(bfqq);
+	if (rq == NULL)
+		rq = bfqq->next_rq;
+	service_to_charge = bfq_serv_to_charge(rq, bfqq);
+
+	if (service_to_charge > bfq_bfqq_budget_left(bfqq)) {
+		/*
+		 * This may happen if the next rq is chosen in fifo order
+		 * instead of sector order. The budget is properly
+		 * dimensioned to be always sufficient to serve the next
+		 * request only if it is chosen in sector order. The reason
+		 * is that it would be quite inefficient and little useful
+		 * to always make sure that the budget is large enough to
+		 * serve even the possible next rq in fifo order.
+		 * In fact, requests are seldom served in fifo order.
+		 *
+		 * Expire the queue for budget exhaustion, and make sure
+		 * that the next act_budget is enough to serve the next
+		 * request, even if it comes from the fifo expired path.
+		 */
+		bfqq->next_rq = rq;
+		/*
+		 * Since this dispatch is failed, make sure that
+		 * a new one will be performed
+		 */
+		if (!bfqd->rq_in_driver)
+			bfq_schedule_dispatch(bfqd);
+		goto expire;
+	}
+
+	/* Finally, insert request into driver dispatch list. */
+	bfq_bfqq_served(bfqq, service_to_charge);
+	bfq_dispatch_insert(bfqd->queue, rq);
+
+	bfq_log_bfqq(bfqd, bfqq,
+			"dispatched %u sec req (%llu), budg left %lu",
+			blk_rq_sectors(rq),
+			(long long unsigned)blk_rq_pos(rq),
+			bfq_bfqq_budget_left(bfqq));
+
+	dispatched++;
+
+	if (bfqd->in_service_bic == NULL) {
+		atomic_long_inc(&RQ_BIC(rq)->icq.ioc->refcount);
+		bfqd->in_service_bic = RQ_BIC(rq);
+	}
+
+	if (bfqd->busy_queues > 1 && ((!bfq_bfqq_sync(bfqq) &&
+	    dispatched >= bfqd->bfq_max_budget_async_rq) ||
+	    bfq_class_idle(bfqq)))
+		goto expire;
+
+	return dispatched;
+
+expire:
+	bfq_bfqq_expire(bfqd, bfqq, 0, BFQ_BFQQ_BUDGET_EXHAUSTED);
+	return dispatched;
+}
+
+static int __bfq_forced_dispatch_bfqq(struct bfq_queue *bfqq)
+{
+	int dispatched = 0;
+
+	while (bfqq->next_rq != NULL) {
+		bfq_dispatch_insert(bfqq->bfqd->queue, bfqq->next_rq);
+		dispatched++;
+	}
+
+	return dispatched;
+}
+
+/*
+ * Drain our current requests.
+ * Used for barriers and when switching io schedulers on-the-fly.
+ */
+static int bfq_forced_dispatch(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq, *n;
+	struct bfq_service_tree *st;
+	int dispatched = 0;
+
+	bfqq = bfqd->in_service_queue;
+	if (bfqq != NULL)
+		__bfq_bfqq_expire(bfqd, bfqq);
+
+	/*
+	 * Loop through classes, and be careful to leave the scheduler
+	 * in a consistent state, as feedback mechanisms and vtime
+	 * updates cannot be disabled during the process.
+	 */
+	list_for_each_entry_safe(bfqq, n, &bfqd->active_list, bfqq_list) {
+		st = bfq_entity_service_tree(&bfqq->entity);
+
+		dispatched += __bfq_forced_dispatch_bfqq(bfqq);
+		bfqq->max_budget = bfq_max_budget(bfqd);
+
+		bfq_forget_idle(st);
+	}
+
+	return dispatched;
+}
+
+static int bfq_dispatch_requests(struct request_queue *q, int force)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct bfq_queue *bfqq;
+	int max_dispatch;
+
+	bfq_log(bfqd, "dispatch requests: %d busy queues", bfqd->busy_queues);
+	if (bfqd->busy_queues == 0)
+		return 0;
+
+	if (unlikely(force))
+		return bfq_forced_dispatch(bfqd);
+
+	bfqq = bfq_select_queue(bfqd);
+	if (bfqq == NULL)
+		return 0;
+
+	max_dispatch = bfqd->bfq_quantum;
+	if (bfq_class_idle(bfqq))
+		max_dispatch = 1;
+
+	if (!bfq_bfqq_sync(bfqq))
+		max_dispatch = bfqd->bfq_max_budget_async_rq;
+
+	if (bfqq->dispatched >= max_dispatch) {
+		if (bfqd->busy_queues > 1)
+			return 0;
+		if (bfqq->dispatched >= 4 * max_dispatch)
+			return 0;
+	}
+
+	if (bfqd->sync_flight != 0 && !bfq_bfqq_sync(bfqq))
+		return 0;
+
+	bfq_clear_bfqq_wait_request(bfqq);
+
+	if (!bfq_dispatch_request(bfqd, bfqq))
+		return 0;
+
+	bfq_log_bfqq(bfqd, bfqq, "dispatched one request of %d (max_disp %d)",
+			bfqq->pid, max_dispatch);
+
+	return 1;
+}
+
+/*
+ * Task holds one reference to the queue, dropped when task exits.  Each rq
+ * in-flight on this queue also holds a reference, dropped when rq is freed.
+ *
+ * Queue lock must be held here.
+ */
+static void bfq_put_queue(struct bfq_queue *bfqq)
+{
+	struct bfq_data *bfqd = bfqq->bfqd;
+
+	bfq_log_bfqq(bfqd, bfqq, "put_queue: %p %d", bfqq,
+		     atomic_read(&bfqq->ref));
+	if (!atomic_dec_and_test(&bfqq->ref))
+		return;
+
+	bfq_log_bfqq(bfqd, bfqq, "put_queue: %p freed", bfqq);
+
+	kmem_cache_free(bfq_pool, bfqq);
+}
+
+static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	if (bfqq == bfqd->in_service_queue) {
+		__bfq_bfqq_expire(bfqd, bfqq);
+		bfq_schedule_dispatch(bfqd);
+	}
+
+	bfq_log_bfqq(bfqd, bfqq, "exit_bfqq: %p, %d", bfqq,
+		     atomic_read(&bfqq->ref));
+
+	bfq_put_queue(bfqq);
+}
+
+static inline void bfq_init_icq(struct io_cq *icq)
+{
+	icq_to_bic(icq)->ttime.last_end_request = jiffies;
+}
+
+static void bfq_exit_icq(struct io_cq *icq)
+{
+	struct bfq_io_cq *bic = icq_to_bic(icq);
+	struct bfq_data *bfqd = bic_to_bfqd(bic);
+
+	if (bic->bfqq[BLK_RW_ASYNC]) {
+		bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_ASYNC]);
+		bic->bfqq[BLK_RW_ASYNC] = NULL;
+	}
+
+	if (bic->bfqq[BLK_RW_SYNC]) {
+		bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_SYNC]);
+		bic->bfqq[BLK_RW_SYNC] = NULL;
+	}
+}
+
+/*
+ * Update the entity prio values; note that the new values will not
+ * be used until the next (re)activation.
+ */
+static void bfq_init_prio_data(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
+{
+	struct task_struct *tsk = current;
+	int ioprio_class;
+
+	if (!bfq_bfqq_prio_changed(bfqq))
+		return;
+
+	ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
+	switch (ioprio_class) {
+	default:
+		dev_err(bfqq->bfqd->queue->backing_dev_info.dev,
+			"bfq: bad prio %x\n", ioprio_class);
+	case IOPRIO_CLASS_NONE:
+		/*
+		 * No prio set, inherit CPU scheduling settings.
+		 */
+		bfqq->entity.new_ioprio = task_nice_ioprio(tsk);
+		bfqq->entity.new_ioprio_class = task_nice_ioclass(tsk);
+		break;
+	case IOPRIO_CLASS_RT:
+		bfqq->entity.new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
+		bfqq->entity.new_ioprio_class = IOPRIO_CLASS_RT;
+		break;
+	case IOPRIO_CLASS_BE:
+		bfqq->entity.new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
+		bfqq->entity.new_ioprio_class = IOPRIO_CLASS_BE;
+		break;
+	case IOPRIO_CLASS_IDLE:
+		bfqq->entity.new_ioprio_class = IOPRIO_CLASS_IDLE;
+		bfqq->entity.new_ioprio = 7;
+		bfq_clear_bfqq_idle_window(bfqq);
+		break;
+	}
+
+	bfqq->entity.ioprio_changed = 1;
+
+	bfq_clear_bfqq_prio_changed(bfqq);
+}
+
+static void bfq_changed_ioprio(struct bfq_io_cq *bic)
+{
+	struct bfq_data *bfqd;
+	struct bfq_queue *bfqq, *new_bfqq;
+	unsigned long uninitialized_var(flags);
+	int ioprio = bic->icq.ioc->ioprio;
+
+	bfqd = bfq_get_bfqd_locked(&(bic->icq.q->elevator->elevator_data),
+				   &flags);
+	/*
+	 * This condition may trigger on a newly created bic, be sure to
+	 * drop the lock before returning.
+	 */
+	if (unlikely(bfqd == NULL) || likely(bic->ioprio == ioprio))
+		goto out;
+
+	bfqq = bic->bfqq[BLK_RW_ASYNC];
+	if (bfqq != NULL) {
+		new_bfqq = bfq_get_queue(bfqd, BLK_RW_ASYNC, bic,
+					 GFP_ATOMIC);
+		if (new_bfqq != NULL) {
+			bic->bfqq[BLK_RW_ASYNC] = new_bfqq;
+			bfq_log_bfqq(bfqd, bfqq,
+				     "changed_ioprio: bfqq %p %d",
+				     bfqq, atomic_read(&bfqq->ref));
+			bfq_put_queue(bfqq);
+		}
+	}
+
+	bfqq = bic->bfqq[BLK_RW_SYNC];
+	if (bfqq != NULL)
+		bfq_mark_bfqq_prio_changed(bfqq);
+
+	bic->ioprio = ioprio;
+
+out:
+	bfq_put_bfqd_unlock(bfqd, &flags);
+}
+
+static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+			  pid_t pid, int is_sync)
+{
+	RB_CLEAR_NODE(&bfqq->entity.rb_node);
+	INIT_LIST_HEAD(&bfqq->fifo);
+
+	atomic_set(&bfqq->ref, 0);
+	bfqq->bfqd = bfqd;
+
+	bfq_mark_bfqq_prio_changed(bfqq);
+
+	if (is_sync) {
+		if (!bfq_class_idle(bfqq))
+			bfq_mark_bfqq_idle_window(bfqq);
+		bfq_mark_bfqq_sync(bfqq);
+	}
+
+	/* Tentative initial value to trade off between thr and lat */
+	bfqq->max_budget = bfq_default_budget(bfqd, bfqq);
+	bfqq->pid = pid;
+}
+
+static struct bfq_queue *bfq_find_alloc_queue(struct bfq_data *bfqd,
+					      int is_sync,
+					      struct bfq_io_cq *bic,
+					      gfp_t gfp_mask)
+{
+	struct bfq_queue *bfqq, *new_bfqq = NULL;
+
+retry:
+	/* bic always exists here */
+	bfqq = bic_to_bfqq(bic, is_sync);
+
+	/*
+	 * Always try a new alloc if we fall back to the OOM bfqq
+	 * originally, since it should just be a temporary situation.
+	 */
+	if (bfqq == NULL || bfqq == &bfqd->oom_bfqq) {
+		bfqq = NULL;
+		if (new_bfqq != NULL) {
+			bfqq = new_bfqq;
+			new_bfqq = NULL;
+		} else if (gfp_mask & __GFP_WAIT) {
+			spin_unlock_irq(bfqd->queue->queue_lock);
+			new_bfqq = kmem_cache_alloc_node(bfq_pool,
+					gfp_mask | __GFP_ZERO,
+					bfqd->queue->node);
+			spin_lock_irq(bfqd->queue->queue_lock);
+			if (new_bfqq != NULL)
+				goto retry;
+		} else {
+			bfqq = kmem_cache_alloc_node(bfq_pool,
+					gfp_mask | __GFP_ZERO,
+					bfqd->queue->node);
+		}
+
+		if (bfqq != NULL) {
+			bfq_init_bfqq(bfqd, bfqq, current->pid, is_sync);
+			bfq_log_bfqq(bfqd, bfqq, "allocated");
+		} else {
+			bfqq = &bfqd->oom_bfqq;
+			bfq_log_bfqq(bfqd, bfqq, "using oom bfqq");
+		}
+
+		bfq_init_prio_data(bfqq, bic);
+	}
+
+	if (new_bfqq != NULL)
+		kmem_cache_free(bfq_pool, new_bfqq);
+
+	return bfqq;
+}
+
+static struct bfq_queue **bfq_async_queue_prio(struct bfq_data *bfqd,
+					       int ioprio_class, int ioprio)
+{
+	switch (ioprio_class) {
+	case IOPRIO_CLASS_RT:
+		return &async_bfqq[0][ioprio];
+	case IOPRIO_CLASS_NONE:
+		ioprio = IOPRIO_NORM;
+		/* fall through */
+	case IOPRIO_CLASS_BE:
+		return &async_bfqq[1][ioprio];
+	case IOPRIO_CLASS_IDLE:
+		return &async_idle_bfqq;
+	default:
+		BUG();
+	}
+}
+
+static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
+				       int is_sync, struct bfq_io_cq *bic,
+				       gfp_t gfp_mask)
+{
+	const int ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
+	const int ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
+	struct bfq_queue **async_bfqq = NULL;
+	struct bfq_queue *bfqq = NULL;
+
+	if (!is_sync) {
+		async_bfqq = bfq_async_queue_prio(bfqd, ioprio_class, ioprio);
+		bfqq = *async_bfqq;
+	}
+
+	if (bfqq == NULL)
+		bfqq = bfq_find_alloc_queue(bfqd, is_sync, bic, gfp_mask);
+
+	/*
+	 * Pin the queue now that it's allocated, scheduler exit will
+	 * prune it.
+	 */
+	if (!is_sync && *async_bfqq == NULL) {
+		atomic_inc(&bfqq->ref);
+		bfq_log_bfqq(bfqd, bfqq, "get_queue, bfqq not in async: %p, %d",
+			     bfqq, atomic_read(&bfqq->ref));
+		*async_bfqq = bfqq;
+	}
+
+	atomic_inc(&bfqq->ref);
+	bfq_log_bfqq(bfqd, bfqq, "get_queue, at end: %p, %d", bfqq,
+		     atomic_read(&bfqq->ref));
+	return bfqq;
+}
+
+static void bfq_update_io_thinktime(struct bfq_data *bfqd,
+				    struct bfq_io_cq *bic)
+{
+	unsigned long elapsed = jiffies - bic->ttime.last_end_request;
+	unsigned long ttime = min(elapsed, 2UL * bfqd->bfq_slice_idle);
+
+	bic->ttime.ttime_samples = (7*bic->ttime.ttime_samples + 256) / 8;
+	bic->ttime.ttime_total = (7*bic->ttime.ttime_total + 256*ttime) / 8;
+	bic->ttime.ttime_mean = (bic->ttime.ttime_total + 128) /
+				bic->ttime.ttime_samples;
+}
+
+static void bfq_update_io_seektime(struct bfq_data *bfqd,
+				   struct bfq_queue *bfqq,
+				   struct request *rq)
+{
+	sector_t sdist;
+	u64 total;
+
+	if (bfqq->last_request_pos < blk_rq_pos(rq))
+		sdist = blk_rq_pos(rq) - bfqq->last_request_pos;
+	else
+		sdist = bfqq->last_request_pos - blk_rq_pos(rq);
+
+	/*
+	 * Don't allow the seek distance to get too large from the
+	 * odd fragment, pagein, etc.
+	 */
+	if (bfqq->seek_samples == 0) /* first request, not really a seek */
+		sdist = 0;
+	else if (bfqq->seek_samples <= 60) /* second & third seek */
+		sdist = min(sdist, (bfqq->seek_mean * 4) + 2*1024*1024);
+	else
+		sdist = min(sdist, (bfqq->seek_mean * 4) + 2*1024*64);
+
+	bfqq->seek_samples = (7*bfqq->seek_samples + 256) / 8;
+	bfqq->seek_total = (7*bfqq->seek_total + (u64)256*sdist) / 8;
+	total = bfqq->seek_total + (bfqq->seek_samples/2);
+	do_div(total, bfqq->seek_samples);
+	bfqq->seek_mean = (sector_t)total;
+
+	bfq_log_bfqq(bfqd, bfqq, "dist=%llu mean=%llu", (u64)sdist,
+			(u64)bfqq->seek_mean);
+}
+
+/*
+ * Disable idle window if the process thinks too long or seeks so much that
+ * it doesn't matter.
+ */
+static void bfq_update_idle_window(struct bfq_data *bfqd,
+				   struct bfq_queue *bfqq,
+				   struct bfq_io_cq *bic)
+{
+	int enable_idle;
+
+	/* Don't idle for async or idle io prio class. */
+	if (!bfq_bfqq_sync(bfqq) || bfq_class_idle(bfqq))
+		return;
+
+	enable_idle = bfq_bfqq_idle_window(bfqq);
+
+	if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
+	    bfqd->bfq_slice_idle == 0 ||
+		(bfqd->hw_tag && BFQQ_SEEKY(bfqq)))
+		enable_idle = 0;
+	else if (bfq_sample_valid(bic->ttime.ttime_samples)) {
+		if (bic->ttime.ttime_mean > bfqd->bfq_slice_idle)
+			enable_idle = 0;
+		else
+			enable_idle = 1;
+	}
+	bfq_log_bfqq(bfqd, bfqq, "update_idle_window: enable_idle %d",
+		enable_idle);
+
+	if (enable_idle)
+		bfq_mark_bfqq_idle_window(bfqq);
+	else
+		bfq_clear_bfqq_idle_window(bfqq);
+}
+
+/*
+ * Called when a new fs request (rq) is added to bfqq.  Check if there's
+ * something we should do about it.
+ */
+static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+			    struct request *rq)
+{
+	struct bfq_io_cq *bic = RQ_BIC(rq);
+
+	if (rq->cmd_flags & REQ_META)
+		bfqq->meta_pending++;
+
+	bfq_update_io_thinktime(bfqd, bic);
+	bfq_update_io_seektime(bfqd, bfqq, rq);
+	if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
+	    !BFQQ_SEEKY(bfqq))
+		bfq_update_idle_window(bfqd, bfqq, bic);
+
+	bfq_log_bfqq(bfqd, bfqq,
+		     "rq_enqueued: idle_window=%d (seeky %d, mean %llu)",
+		     bfq_bfqq_idle_window(bfqq), BFQQ_SEEKY(bfqq),
+		     (long long unsigned)bfqq->seek_mean);
+
+	bfqq->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq);
+
+	if (bfqq == bfqd->in_service_queue && bfq_bfqq_wait_request(bfqq)) {
+		int small_req = bfqq->queued[rq_is_sync(rq)] == 1 &&
+				blk_rq_sectors(rq) < 32;
+		int budget_timeout = bfq_bfqq_budget_timeout(bfqq);
+
+		/*
+		 * There is just this request queued: if the request
+		 * is small and the queue is not to be expired, then
+		 * just exit.
+		 *
+		 * In this way, if the disk is being idled to wait for
+		 * a new request from the in-service queue, we avoid
+		 * unplugging the device and committing the disk to serve
+		 * just a small request. On the contrary, we wait for
+		 * the block layer to decide when to unplug the device:
+		 * hopefully, new requests will be merged to this one
+		 * quickly, then the device will be unplugged and
+		 * larger requests will be dispatched.
+		 */
+		if (small_req && !budget_timeout)
+			return;
+
+		/*
+		 * A large enough request arrived, or the queue is to
+		 * be expired: in both cases disk idling is to be
+		 * stopped, so clear wait_request flag and reset
+		 * timer.
+		 */
+		bfq_clear_bfqq_wait_request(bfqq);
+		del_timer(&bfqd->idle_slice_timer);
+
+		/*
+		 * The queue is not empty, because a new request just
+		 * arrived. Hence we can safely expire the queue, in
+		 * case of budget timeout, without risking that the
+		 * timestamps of the queue are not updated correctly.
+		 * See [1] for more details.
+		 */
+		if (budget_timeout)
+			bfq_bfqq_expire(bfqd, bfqq, 0, BFQ_BFQQ_BUDGET_TIMEOUT);
+
+		/*
+		 * Let the request rip immediately, or let a new queue be
+		 * selected if bfqq has just been expired.
+		 */
+		__blk_run_queue(bfqd->queue);
+	}
+}
+
+static void bfq_insert_request(struct request_queue *q, struct request *rq)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+
+	assert_spin_locked(bfqd->queue->queue_lock);
+
+	bfq_init_prio_data(bfqq, RQ_BIC(rq));
+
+	bfq_add_request(rq);
+
+	rq->fifo_time = jiffies + bfqd->bfq_fifo_expire[rq_is_sync(rq)];
+	list_add_tail(&rq->queuelist, &bfqq->fifo);
+
+	bfq_rq_enqueued(bfqd, bfqq, rq);
+}
+
+static void bfq_update_hw_tag(struct bfq_data *bfqd)
+{
+	bfqd->max_rq_in_driver = max(bfqd->max_rq_in_driver,
+				     bfqd->rq_in_driver);
+
+	if (bfqd->hw_tag == 1)
+		return;
+
+	/*
+	 * This sample is valid if the number of outstanding requests
+	 * is large enough to allow a queueing behavior.  Note that the
+	 * sum is not exact, as it's not taking into account deactivated
+	 * requests.
+	 */
+	if (bfqd->rq_in_driver + bfqd->queued < BFQ_HW_QUEUE_THRESHOLD)
+		return;
+
+	if (bfqd->hw_tag_samples++ < BFQ_HW_QUEUE_SAMPLES)
+		return;
+
+	bfqd->hw_tag = bfqd->max_rq_in_driver > BFQ_HW_QUEUE_THRESHOLD;
+	bfqd->max_rq_in_driver = 0;
+	bfqd->hw_tag_samples = 0;
+}
+
+static void bfq_completed_request(struct request_queue *q, struct request *rq)
+{
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+	struct bfq_data *bfqd = bfqq->bfqd;
+	bool sync = bfq_bfqq_sync(bfqq);
+
+	bfq_log_bfqq(bfqd, bfqq, "completed one req with %u sects left (%d)",
+		     blk_rq_sectors(rq), sync);
+
+	bfq_update_hw_tag(bfqd);
+
+	bfqd->rq_in_driver--;
+	bfqq->dispatched--;
+
+	if (sync) {
+		bfqd->sync_flight--;
+		RQ_BIC(rq)->ttime.last_end_request = jiffies;
+	}
+
+	/*
+	 * If this is the in-service queue, check if it needs to be expired,
+	 * or if we want to idle in case it has no pending requests.
+	 */
+	if (bfqd->in_service_queue == bfqq) {
+		if (bfq_bfqq_budget_new(bfqq))
+			bfq_set_budget_timeout(bfqd);
+
+		if (bfq_bfqq_must_idle(bfqq)) {
+			bfq_arm_slice_timer(bfqd);
+			goto out;
+		} else if (bfq_may_expire_for_budg_timeout(bfqq))
+			bfq_bfqq_expire(bfqd, bfqq, 0, BFQ_BFQQ_BUDGET_TIMEOUT);
+		else if (RB_EMPTY_ROOT(&bfqq->sort_list) &&
+			 (bfqq->dispatched == 0 ||
+			  !bfq_bfqq_must_not_expire(bfqq)))
+			bfq_bfqq_expire(bfqd, bfqq, 0,
+					BFQ_BFQQ_NO_MORE_REQUESTS);
+	}
+
+	if (!bfqd->rq_in_driver)
+		bfq_schedule_dispatch(bfqd);
+
+out:
+	return;
+}
+
+static inline int __bfq_may_queue(struct bfq_queue *bfqq)
+{
+	if (bfq_bfqq_wait_request(bfqq) && bfq_bfqq_must_alloc(bfqq)) {
+		bfq_clear_bfqq_must_alloc(bfqq);
+		return ELV_MQUEUE_MUST;
+	}
+
+	return ELV_MQUEUE_MAY;
+}
+
+static int bfq_may_queue(struct request_queue *q, int rw)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct task_struct *tsk = current;
+	struct bfq_io_cq *bic;
+	struct bfq_queue *bfqq;
+
+	/*
+	 * Don't force setup of a queue from here, as a call to may_queue
+	 * does not necessarily imply that a request actually will be
+	 * queued. So just lookup a possibly existing queue, or return
+	 * 'may queue' if that fails.
+	 */
+	bic = bfq_bic_lookup(bfqd, tsk->io_context);
+	if (bic == NULL)
+		return ELV_MQUEUE_MAY;
+
+	bfqq = bic_to_bfqq(bic, rw_is_sync(rw));
+	if (bfqq != NULL) {
+		bfq_init_prio_data(bfqq, bic);
+
+		return __bfq_may_queue(bfqq);
+	}
+
+	return ELV_MQUEUE_MAY;
+}
+
+/*
+ * Queue lock held here.
+ */
+static void bfq_put_request(struct request *rq)
+{
+	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+
+	if (bfqq != NULL) {
+		const int rw = rq_data_dir(rq);
+
+		bfqq->allocated[rw]--;
+
+		rq->elv.priv[0] = NULL;
+		rq->elv.priv[1] = NULL;
+
+		bfq_log_bfqq(bfqq->bfqd, bfqq, "put_request %p, %d",
+			     bfqq, atomic_read(&bfqq->ref));
+		bfq_put_queue(bfqq);
+	}
+}
+
+/*
+ * Allocate bfq data structures associated with this request.
+ */
+static int bfq_set_request(struct request_queue *q, struct request *rq,
+			   struct bio *bio, gfp_t gfp_mask)
+{
+	struct bfq_data *bfqd = q->elevator->elevator_data;
+	struct bfq_io_cq *bic = icq_to_bic(rq->elv.icq);
+	const int rw = rq_data_dir(rq);
+	const int is_sync = rq_is_sync(rq);
+	struct bfq_queue *bfqq;
+	unsigned long flags;
+
+	might_sleep_if(gfp_mask & __GFP_WAIT);
+
+	bfq_changed_ioprio(bic);
+
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	if (bic == NULL)
+		goto queue_fail;
+
+	bfqq = bic_to_bfqq(bic, is_sync);
+	if (bfqq == NULL || bfqq == &bfqd->oom_bfqq) {
+		bfqq = bfq_get_queue(bfqd, is_sync, bic, gfp_mask);
+		bic_set_bfqq(bic, bfqq, is_sync);
+	}
+
+	bfqq->allocated[rw]++;
+	atomic_inc(&bfqq->ref);
+	bfq_log_bfqq(bfqd, bfqq, "set_request: bfqq %p, %d", bfqq,
+		     atomic_read(&bfqq->ref));
+
+	rq->elv.priv[0] = bic;
+	rq->elv.priv[1] = bfqq;
+
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	return 0;
+
+queue_fail:
+	bfq_schedule_dispatch(bfqd);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	return 1;
+}
+
+static void bfq_kick_queue(struct work_struct *work)
+{
+	struct bfq_data *bfqd =
+		container_of(work, struct bfq_data, unplug_work);
+	struct request_queue *q = bfqd->queue;
+
+	spin_lock_irq(q->queue_lock);
+	__blk_run_queue(q);
+	spin_unlock_irq(q->queue_lock);
+}
+
+/*
+ * Handler of the expiration of the timer running if the in-service queue
+ * is idling inside its time slice.
+ */
+static void bfq_idle_slice_timer(unsigned long data)
+{
+	struct bfq_data *bfqd = (struct bfq_data *)data;
+	struct bfq_queue *bfqq;
+	unsigned long flags;
+	enum bfqq_expiration reason;
+
+	spin_lock_irqsave(bfqd->queue->queue_lock, flags);
+
+	bfqq = bfqd->in_service_queue;
+	/*
+	 * Theoretical race here: the in-service queue can be NULL or
+	 * different from the queue that was idling if the timer handler
+	 * spins on the queue_lock and a new request arrives for the
+	 * current queue and there is a full dispatch cycle that changes
+	 * the in-service queue.  This can hardly happen, but in the worst
+	 * case we just expire a queue too early.
+	 */
+	if (bfqq != NULL) {
+		bfq_log_bfqq(bfqd, bfqq, "slice_timer expired");
+		if (bfq_bfqq_budget_timeout(bfqq))
+			/*
+			 * Also here the queue can be safely expired
+			 * for budget timeout without wasting
+			 * guarantees
+			 */
+			reason = BFQ_BFQQ_BUDGET_TIMEOUT;
+		else if (bfqq->queued[0] == 0 && bfqq->queued[1] == 0)
+			/*
+			 * The queue may not be empty upon timer expiration,
+			 * because we may not disable the timer when the
+			 * first request of the in-service queue arrives
+			 * during disk idling.
+			 */
+			reason = BFQ_BFQQ_TOO_IDLE;
+		else
+			goto schedule_dispatch;
+
+		bfq_bfqq_expire(bfqd, bfqq, 1, reason);
+	}
+
+schedule_dispatch:
+	bfq_schedule_dispatch(bfqd);
+
+	spin_unlock_irqrestore(bfqd->queue->queue_lock, flags);
+}
+
+static void bfq_shutdown_timer_wq(struct bfq_data *bfqd)
+{
+	del_timer_sync(&bfqd->idle_slice_timer);
+	cancel_work_sync(&bfqd->unplug_work);
+}
+
+static inline void __bfq_put_async_bfqq(struct bfq_data *bfqd,
+					struct bfq_queue **bfqq_ptr)
+{
+	struct bfq_queue *bfqq = *bfqq_ptr;
+
+	bfq_log(bfqd, "put_async_bfqq: %p", bfqq);
+	if (bfqq != NULL) {
+		bfq_log_bfqq(bfqd, bfqq, "put_async_bfqq: putting %p, %d",
+			     bfqq, atomic_read(&bfqq->ref));
+		bfq_put_queue(bfqq);
+		*bfqq_ptr = NULL;
+	}
+}
+
+/*
+ * Release the extra reference of the async queues as the device
+ * goes away.
+ */
+static void bfq_put_async_queues(struct bfq_data *bfqd)
+{
+	int i, j;
+
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < IOPRIO_BE_NR; j++)
+			__bfq_put_async_bfqq(bfqd, &async_bfqq[i][j]);
+
+	__bfq_put_async_bfqq(bfqd, &async_idle_bfqq);
+}
+
+static void bfq_exit_queue(struct elevator_queue *e)
+{
+	struct bfq_data *bfqd = e->elevator_data;
+	struct request_queue *q = bfqd->queue;
+	struct bfq_queue *bfqq, *n;
+
+	bfq_shutdown_timer_wq(bfqd);
+
+	spin_lock_irq(q->queue_lock);
+
+	list_for_each_entry_safe(bfqq, n, &bfqd->idle_list, bfqq_list)
+		bfq_deactivate_bfqq(bfqd, bfqq, 0);
+
+	bfq_put_async_queues(bfqd);
+	spin_unlock_irq(q->queue_lock);
+
+	bfq_shutdown_timer_wq(bfqd);
+
+	synchronize_rcu();
+
+	kfree(bfqd);
+}
+
+static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
+{
+	struct bfq_data *bfqd;
+	struct elevator_queue *eq;
+
+	eq = elevator_alloc(q, e);
+	if (eq == NULL)
+		return -ENOMEM;
+
+	bfqd = kzalloc_node(sizeof(*bfqd), GFP_KERNEL, q->node);
+	if (bfqd == NULL) {
+		kobject_put(&eq->kobj);
+		return -ENOMEM;
+	}
+	eq->elevator_data = bfqd;
+
+	/*
+	 * Our fallback bfqq if bfq_find_alloc_queue() runs into OOM issues.
+	 * Grab a permanent reference to it, so that the normal code flow
+	 * will not attempt to free it.
+	 */
+	bfq_init_bfqq(bfqd, &bfqd->oom_bfqq, 1, 0);
+	atomic_inc(&bfqd->oom_bfqq.ref);
+
+	bfqd->queue = q;
+
+	spin_lock_irq(q->queue_lock);
+	q->elevator = eq;
+	spin_unlock_irq(q->queue_lock);
+
+	init_timer(&bfqd->idle_slice_timer);
+	bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
+	bfqd->idle_slice_timer.data = (unsigned long)bfqd;
+
+	INIT_WORK(&bfqd->unplug_work, bfq_kick_queue);
+
+	INIT_LIST_HEAD(&bfqd->active_list);
+	INIT_LIST_HEAD(&bfqd->idle_list);
+
+	bfqd->hw_tag = -1;
+
+	bfqd->bfq_max_budget = bfq_default_max_budget;
+
+	bfqd->bfq_quantum = bfq_quantum;
+	bfqd->bfq_fifo_expire[0] = bfq_fifo_expire[0];
+	bfqd->bfq_fifo_expire[1] = bfq_fifo_expire[1];
+	bfqd->bfq_back_max = bfq_back_max;
+	bfqd->bfq_back_penalty = bfq_back_penalty;
+	bfqd->bfq_slice_idle = bfq_slice_idle;
+	bfqd->bfq_class_idle_last_service = 0;
+	bfqd->bfq_max_budget_async_rq = bfq_max_budget_async_rq;
+	bfqd->bfq_timeout[BLK_RW_ASYNC] = bfq_timeout_async;
+	bfqd->bfq_timeout[BLK_RW_SYNC] = bfq_timeout_sync;
+
+	return 0;
+}
+
+static void bfq_slab_kill(void)
+{
+	if (bfq_pool != NULL)
+		kmem_cache_destroy(bfq_pool);
+}
+
+static int __init bfq_slab_setup(void)
+{
+	bfq_pool = KMEM_CACHE(bfq_queue, 0);
+	if (bfq_pool == NULL)
+		return -ENOMEM;
+	return 0;
+}
+
+static ssize_t bfq_var_show(unsigned int var, char *page)
+{
+	return sprintf(page, "%d\n", var);
+}
+
+static ssize_t bfq_var_store(unsigned long *var, const char *page,
+			     size_t count)
+{
+	unsigned long new_val;
+	int ret = kstrtoul(page, 10, &new_val);
+
+	if (ret == 0)
+		*var = new_val;
+
+	return count;
+}
+
+static ssize_t bfq_weights_show(struct elevator_queue *e, char *page)
+{
+	struct bfq_queue *bfqq;
+	struct bfq_data *bfqd = e->elevator_data;
+	ssize_t num_char = 0;
+
+	num_char += sprintf(page + num_char, "Tot reqs queued %d\n\n",
+			    bfqd->queued);
+
+	spin_lock_irq(bfqd->queue->queue_lock);
+
+	num_char += sprintf(page + num_char, "Active:\n");
+	list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list) {
+	  num_char += sprintf(page + num_char,
+			      "pid%d: weight %hu, nr_queued %d %d\n",
+			      bfqq->pid,
+			      bfqq->entity.weight,
+			      bfqq->queued[0],
+			      bfqq->queued[1]);
+	}
+
+	num_char += sprintf(page + num_char, "Idle:\n");
+	list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list) {
+			num_char += sprintf(page + num_char,
+				"pid%d: weight %hu\n",
+				bfqq->pid,
+				bfqq->entity.weight);
+	}
+
+	spin_unlock_irq(bfqd->queue->queue_lock);
+
+	return num_char;
+}
+
+#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
+static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
+{									\
+	struct bfq_data *bfqd = e->elevator_data;			\
+	unsigned int __data = __VAR;					\
+	if (__CONV)							\
+		__data = jiffies_to_msecs(__data);			\
+	return bfq_var_show(__data, (page));				\
+}
+SHOW_FUNCTION(bfq_quantum_show, bfqd->bfq_quantum, 0);
+SHOW_FUNCTION(bfq_fifo_expire_sync_show, bfqd->bfq_fifo_expire[1], 1);
+SHOW_FUNCTION(bfq_fifo_expire_async_show, bfqd->bfq_fifo_expire[0], 1);
+SHOW_FUNCTION(bfq_back_seek_max_show, bfqd->bfq_back_max, 0);
+SHOW_FUNCTION(bfq_back_seek_penalty_show, bfqd->bfq_back_penalty, 0);
+SHOW_FUNCTION(bfq_slice_idle_show, bfqd->bfq_slice_idle, 1);
+SHOW_FUNCTION(bfq_max_budget_show, bfqd->bfq_user_max_budget, 0);
+SHOW_FUNCTION(bfq_max_budget_async_rq_show,
+	      bfqd->bfq_max_budget_async_rq, 0);
+SHOW_FUNCTION(bfq_timeout_sync_show, bfqd->bfq_timeout[BLK_RW_SYNC], 1);
+SHOW_FUNCTION(bfq_timeout_async_show, bfqd->bfq_timeout[BLK_RW_ASYNC], 1);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
+static ssize_t								\
+__FUNC(struct elevator_queue *e, const char *page, size_t count)	\
+{									\
+	struct bfq_data *bfqd = e->elevator_data;			\
+	unsigned long uninitialized_var(__data);			\
+	int ret = bfq_var_store(&__data, (page), count);		\
+	if (__data < (MIN))						\
+		__data = (MIN);						\
+	else if (__data > (MAX))					\
+		__data = (MAX);						\
+	if (__CONV)							\
+		*(__PTR) = msecs_to_jiffies(__data);			\
+	else								\
+		*(__PTR) = __data;					\
+	return ret;							\
+}
+STORE_FUNCTION(bfq_quantum_store, &bfqd->bfq_quantum, 1, INT_MAX, 0);
+STORE_FUNCTION(bfq_fifo_expire_sync_store, &bfqd->bfq_fifo_expire[1], 1,
+		INT_MAX, 1);
+STORE_FUNCTION(bfq_fifo_expire_async_store, &bfqd->bfq_fifo_expire[0], 1,
+		INT_MAX, 1);
+STORE_FUNCTION(bfq_back_seek_max_store, &bfqd->bfq_back_max, 0, INT_MAX, 0);
+STORE_FUNCTION(bfq_back_seek_penalty_store, &bfqd->bfq_back_penalty, 1,
+		INT_MAX, 0);
+STORE_FUNCTION(bfq_slice_idle_store, &bfqd->bfq_slice_idle, 0, INT_MAX, 1);
+STORE_FUNCTION(bfq_max_budget_async_rq_store, &bfqd->bfq_max_budget_async_rq,
+		1, INT_MAX, 0);
+STORE_FUNCTION(bfq_timeout_async_store, &bfqd->bfq_timeout[BLK_RW_ASYNC], 0,
+		INT_MAX, 1);
+#undef STORE_FUNCTION
+
+/* do nothing for the moment */
+static ssize_t bfq_weights_store(struct elevator_queue *e,
+				    const char *page, size_t count)
+{
+	return count;
+}
+
+static inline unsigned long bfq_estimated_max_budget(struct bfq_data *bfqd)
+{
+	u64 timeout = jiffies_to_msecs(bfqd->bfq_timeout[BLK_RW_SYNC]);
+
+	if (bfqd->peak_rate_samples >= BFQ_PEAK_RATE_SAMPLES)
+		return bfq_calc_max_budget(bfqd->peak_rate, timeout);
+	else
+		return bfq_default_max_budget;
+}
+
+static ssize_t bfq_max_budget_store(struct elevator_queue *e,
+				    const char *page, size_t count)
+{
+	struct bfq_data *bfqd = e->elevator_data;
+	unsigned long uninitialized_var(__data);
+	int ret = bfq_var_store(&__data, (page), count);
+
+	if (__data == 0)
+		bfqd->bfq_max_budget = bfq_estimated_max_budget(bfqd);
+	else {
+		if (__data > INT_MAX)
+			__data = INT_MAX;
+		bfqd->bfq_max_budget = __data;
+	}
+
+	bfqd->bfq_user_max_budget = __data;
+
+	return ret;
+}
+
+static ssize_t bfq_timeout_sync_store(struct elevator_queue *e,
+				      const char *page, size_t count)
+{
+	struct bfq_data *bfqd = e->elevator_data;
+	unsigned long uninitialized_var(__data);
+	int ret = bfq_var_store(&__data, (page), count);
+
+	if (__data < 1)
+		__data = 1;
+	else if (__data > INT_MAX)
+		__data = INT_MAX;
+
+	bfqd->bfq_timeout[BLK_RW_SYNC] = msecs_to_jiffies(__data);
+	if (bfqd->bfq_user_max_budget == 0)
+		bfqd->bfq_max_budget = bfq_estimated_max_budget(bfqd);
+
+	return ret;
+}
+
+#define BFQ_ATTR(name) \
+	__ATTR(name, S_IRUGO|S_IWUSR, bfq_##name##_show, bfq_##name##_store)
+
+static struct elv_fs_entry bfq_attrs[] = {
+	BFQ_ATTR(quantum),
+	BFQ_ATTR(fifo_expire_sync),
+	BFQ_ATTR(fifo_expire_async),
+	BFQ_ATTR(back_seek_max),
+	BFQ_ATTR(back_seek_penalty),
+	BFQ_ATTR(slice_idle),
+	BFQ_ATTR(max_budget),
+	BFQ_ATTR(max_budget_async_rq),
+	BFQ_ATTR(timeout_sync),
+	BFQ_ATTR(timeout_async),
+	BFQ_ATTR(weights),
+	__ATTR_NULL
+};
+
+static struct elevator_type iosched_bfq = {
+	.ops = {
+		.elevator_merge_fn =		bfq_merge,
+		.elevator_merged_fn =		bfq_merged_request,
+		.elevator_merge_req_fn =	bfq_merged_requests,
+		.elevator_allow_merge_fn =	bfq_allow_merge,
+		.elevator_dispatch_fn =		bfq_dispatch_requests,
+		.elevator_add_req_fn =		bfq_insert_request,
+		.elevator_activate_req_fn =	bfq_activate_request,
+		.elevator_deactivate_req_fn =	bfq_deactivate_request,
+		.elevator_completed_req_fn =	bfq_completed_request,
+		.elevator_former_req_fn =	elv_rb_former_request,
+		.elevator_latter_req_fn =	elv_rb_latter_request,
+		.elevator_init_icq_fn =		bfq_init_icq,
+		.elevator_exit_icq_fn =		bfq_exit_icq,
+		.elevator_set_req_fn =		bfq_set_request,
+		.elevator_put_req_fn =		bfq_put_request,
+		.elevator_may_queue_fn =	bfq_may_queue,
+		.elevator_init_fn =		bfq_init_queue,
+		.elevator_exit_fn =		bfq_exit_queue,
+	},
+	.icq_size =		sizeof(struct bfq_io_cq),
+	.icq_align =		__alignof__(struct bfq_io_cq),
+	.elevator_attrs =	bfq_attrs,
+	.elevator_name =	"bfq",
+	.elevator_owner =	THIS_MODULE,
+};
+
+static int __init bfq_init(void)
+{
+	/*
+	 * Can be 0 on HZ < 1000 setups.
+	 */
+	if (bfq_slice_idle == 0)
+		bfq_slice_idle = 1;
+
+	if (bfq_timeout_async == 0)
+		bfq_timeout_async = 1;
+
+	if (bfq_slab_setup())
+		return -ENOMEM;
+
+	elv_register(&iosched_bfq);
+	pr_info("BFQ I/O-scheduler version: v0");
+
+	return 0;
+}
+
+static void __exit bfq_exit(void)
+{
+	elv_unregister(&iosched_bfq);
+	bfq_slab_kill();
+}
+
+module_init(bfq_init);
+module_exit(bfq_exit);
+
+MODULE_AUTHOR("Fabio Checconi, Paolo Valente");
+MODULE_LICENSE("GPL");
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
new file mode 100644
index 0000000..a9142f5
--- /dev/null
+++ b/block/bfq-sched.c
@@ -0,0 +1,936 @@
+/*
+ * BFQ: Hierarchical B-WF2Q+ scheduler.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ *		      Paolo Valente <paolo.valente@unimore.it>
+ */
+
+#define for_each_entity(entity)	\
+	for (; entity != NULL; entity = NULL)
+
+#define for_each_entity_safe(entity, parent) \
+	for (parent = NULL; entity != NULL; entity = parent)
+
+static inline int bfq_update_next_in_service(struct bfq_sched_data *sd)
+{
+	return 0;
+}
+
+static inline void bfq_check_next_in_service(struct bfq_sched_data *sd,
+					     struct bfq_entity *entity)
+{
+}
+
+static inline void bfq_update_budget(struct bfq_entity *next_in_service)
+{
+}
+
+/*
+ * Shift for timestamp calculations.  This actually limits the maximum
+ * service allowed in one timestamp delta (small shift values increase it),
+ * the maximum total weight that can be used for the queues in the system
+ * (big shift values increase it), and the period of virtual time
+ * wraparounds.
+ */
+#define WFQ_SERVICE_SHIFT	22
+
+/**
+ * bfq_gt - compare two timestamps.
+ * @a: first ts.
+ * @b: second ts.
+ *
+ * Return @a > @b, dealing with wrapping correctly.
+ */
+static inline int bfq_gt(u64 a, u64 b)
+{
+	return (s64)(a - b) > 0;
+}
+
+static inline struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = NULL;
+
+	if (entity->my_sched_data == NULL)
+		bfqq = container_of(entity, struct bfq_queue, entity);
+
+	return bfqq;
+}
+
+
+/**
+ * bfq_delta - map service into the virtual time domain.
+ * @service: amount of service.
+ * @weight: scale factor (weight of an entity or weight sum).
+ */
+static inline u64 bfq_delta(unsigned long service,
+					unsigned long weight)
+{
+	u64 d = (u64)service << WFQ_SERVICE_SHIFT;
+
+	do_div(d, weight);
+	return d;
+}
+
+/**
+ * bfq_calc_finish - assign the finish time to an entity.
+ * @entity: the entity to act upon.
+ * @service: the service to be charged to the entity.
+ */
+static inline void bfq_calc_finish(struct bfq_entity *entity,
+				   unsigned long service)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+
+	entity->finish = entity->start +
+		bfq_delta(service, entity->weight);
+
+	if (bfqq != NULL) {
+		bfq_log_bfqq(bfqq->bfqd, bfqq,
+			"calc_finish: serv %lu, w %d",
+			service, entity->weight);
+		bfq_log_bfqq(bfqq->bfqd, bfqq,
+			"calc_finish: start %llu, finish %llu, delta %llu",
+			entity->start, entity->finish,
+			bfq_delta(service, entity->weight));
+	}
+}
+
+/**
+ * bfq_entity_of - get an entity from a node.
+ * @node: the node field of the entity.
+ *
+ * Convert a node pointer to the relative entity.  This is used only
+ * to simplify the logic of some functions and not as the generic
+ * conversion mechanism because, e.g., in the tree walking functions,
+ * the check for a %NULL value would be redundant.
+ */
+static inline struct bfq_entity *bfq_entity_of(struct rb_node *node)
+{
+	struct bfq_entity *entity = NULL;
+
+	if (node != NULL)
+		entity = rb_entry(node, struct bfq_entity, rb_node);
+
+	return entity;
+}
+
+/**
+ * bfq_extract - remove an entity from a tree.
+ * @root: the tree root.
+ * @entity: the entity to remove.
+ */
+static inline void bfq_extract(struct rb_root *root,
+			       struct bfq_entity *entity)
+{
+	entity->tree = NULL;
+	rb_erase(&entity->rb_node, root);
+}
+
+/**
+ * bfq_idle_extract - extract an entity from the idle tree.
+ * @st: the service tree of the owning @entity.
+ * @entity: the entity being removed.
+ */
+static void bfq_idle_extract(struct bfq_service_tree *st,
+			     struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+	struct rb_node *next;
+
+	if (entity == st->first_idle) {
+		next = rb_next(&entity->rb_node);
+		st->first_idle = bfq_entity_of(next);
+	}
+
+	if (entity == st->last_idle) {
+		next = rb_prev(&entity->rb_node);
+		st->last_idle = bfq_entity_of(next);
+	}
+
+	bfq_extract(&st->idle, entity);
+
+	if (bfqq != NULL)
+		list_del(&bfqq->bfqq_list);
+}
+
+/**
+ * bfq_insert - generic tree insertion.
+ * @root: tree root.
+ * @entity: entity to insert.
+ *
+ * This is used for the idle and the active tree, since they are both
+ * ordered by finish time.
+ */
+static void bfq_insert(struct rb_root *root, struct bfq_entity *entity)
+{
+	struct bfq_entity *entry;
+	struct rb_node **node = &root->rb_node;
+	struct rb_node *parent = NULL;
+
+	while (*node != NULL) {
+		parent = *node;
+		entry = rb_entry(parent, struct bfq_entity, rb_node);
+
+		if (bfq_gt(entry->finish, entity->finish))
+			node = &parent->rb_left;
+		else
+			node = &parent->rb_right;
+	}
+
+	rb_link_node(&entity->rb_node, parent, node);
+	rb_insert_color(&entity->rb_node, root);
+
+	entity->tree = root;
+}
+
+/**
+ * bfq_update_min - update the min_start field of a entity.
+ * @entity: the entity to update.
+ * @node: one of its children.
+ *
+ * This function is called when @entity may store an invalid value for
+ * min_start due to updates to the active tree.  The function  assumes
+ * that the subtree rooted at @node (which may be its left or its right
+ * child) has a valid min_start value.
+ */
+static inline void bfq_update_min(struct bfq_entity *entity,
+				  struct rb_node *node)
+{
+	struct bfq_entity *child;
+
+	if (node != NULL) {
+		child = rb_entry(node, struct bfq_entity, rb_node);
+		if (bfq_gt(entity->min_start, child->min_start))
+			entity->min_start = child->min_start;
+	}
+}
+
+/**
+ * bfq_update_active_node - recalculate min_start.
+ * @node: the node to update.
+ *
+ * @node may have changed position or one of its children may have moved,
+ * this function updates its min_start value.  The left and right subtrees
+ * are assumed to hold a correct min_start value.
+ */
+static inline void bfq_update_active_node(struct rb_node *node)
+{
+	struct bfq_entity *entity = rb_entry(node, struct bfq_entity, rb_node);
+
+	entity->min_start = entity->start;
+	bfq_update_min(entity, node->rb_right);
+	bfq_update_min(entity, node->rb_left);
+}
+
+/**
+ * bfq_update_active_tree - update min_start for the whole active tree.
+ * @node: the starting node.
+ *
+ * @node must be the deepest modified node after an update.  This function
+ * updates its min_start using the values held by its children, assuming
+ * that they did not change, and then updates all the nodes that may have
+ * changed in the path to the root.  The only nodes that may have changed
+ * are the ones in the path or their siblings.
+ */
+static void bfq_update_active_tree(struct rb_node *node)
+{
+	struct rb_node *parent;
+
+up:
+	bfq_update_active_node(node);
+
+	parent = rb_parent(node);
+	if (parent == NULL)
+		return;
+
+	if (node == parent->rb_left && parent->rb_right != NULL)
+		bfq_update_active_node(parent->rb_right);
+	else if (parent->rb_left != NULL)
+		bfq_update_active_node(parent->rb_left);
+
+	node = parent;
+	goto up;
+}
+
+/**
+ * bfq_active_insert - insert an entity in the active tree of its
+ *                     group/device.
+ * @st: the service tree of the entity.
+ * @entity: the entity being inserted.
+ *
+ * The active tree is ordered by finish time, but an extra key is kept
+ * per each node, containing the minimum value for the start times of
+ * its children (and the node itself), so it's possible to search for
+ * the eligible node with the lowest finish time in logarithmic time.
+ */
+static void bfq_active_insert(struct bfq_service_tree *st,
+			      struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+	struct rb_node *node = &entity->rb_node;
+
+	bfq_insert(&st->active, entity);
+
+	if (node->rb_left != NULL)
+		node = node->rb_left;
+	else if (node->rb_right != NULL)
+		node = node->rb_right;
+
+	bfq_update_active_tree(node);
+
+	if (bfqq != NULL)
+		list_add(&bfqq->bfqq_list, &bfqq->bfqd->active_list);
+}
+
+/**
+ * bfq_ioprio_to_weight - calc a weight from an ioprio.
+ * @ioprio: the ioprio value to convert.
+ */
+static inline unsigned short bfq_ioprio_to_weight(int ioprio)
+{
+	return IOPRIO_BE_NR - ioprio;
+}
+
+/**
+ * bfq_weight_to_ioprio - calc an ioprio from a weight.
+ * @weight: the weight value to convert.
+ *
+ * To preserve as mush as possible the old only-ioprio user interface,
+ * 0 is used as an escape ioprio value for weights (numerically) equal or
+ * larger than IOPRIO_BE_NR
+ */
+static inline unsigned short bfq_weight_to_ioprio(int weight)
+{
+	return IOPRIO_BE_NR - weight < 0 ? 0 : IOPRIO_BE_NR - weight;
+}
+
+static inline void bfq_get_entity(struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+
+	if (bfqq != NULL) {
+		atomic_inc(&bfqq->ref);
+		bfq_log_bfqq(bfqq->bfqd, bfqq, "get_entity: %p %d",
+			     bfqq, atomic_read(&bfqq->ref));
+	}
+}
+
+/**
+ * bfq_find_deepest - find the deepest node that an extraction can modify.
+ * @node: the node being removed.
+ *
+ * Do the first step of an extraction in an rb tree, looking for the
+ * node that will replace @node, and returning the deepest node that
+ * the following modifications to the tree can touch.  If @node is the
+ * last node in the tree return %NULL.
+ */
+static struct rb_node *bfq_find_deepest(struct rb_node *node)
+{
+	struct rb_node *deepest;
+
+	if (node->rb_right == NULL && node->rb_left == NULL)
+		deepest = rb_parent(node);
+	else if (node->rb_right == NULL)
+		deepest = node->rb_left;
+	else if (node->rb_left == NULL)
+		deepest = node->rb_right;
+	else {
+		deepest = rb_next(node);
+		if (deepest->rb_right != NULL)
+			deepest = deepest->rb_right;
+		else if (rb_parent(deepest) != node)
+			deepest = rb_parent(deepest);
+	}
+
+	return deepest;
+}
+
+/**
+ * bfq_active_extract - remove an entity from the active tree.
+ * @st: the service_tree containing the tree.
+ * @entity: the entity being removed.
+ */
+static void bfq_active_extract(struct bfq_service_tree *st,
+			       struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+	struct rb_node *node;
+
+	node = bfq_find_deepest(&entity->rb_node);
+	bfq_extract(&st->active, entity);
+
+	if (node != NULL)
+		bfq_update_active_tree(node);
+
+	if (bfqq != NULL)
+		list_del(&bfqq->bfqq_list);
+}
+
+/**
+ * bfq_idle_insert - insert an entity into the idle tree.
+ * @st: the service tree containing the tree.
+ * @entity: the entity to insert.
+ */
+static void bfq_idle_insert(struct bfq_service_tree *st,
+			    struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+	struct bfq_entity *first_idle = st->first_idle;
+	struct bfq_entity *last_idle = st->last_idle;
+
+	if (first_idle == NULL || bfq_gt(first_idle->finish, entity->finish))
+		st->first_idle = entity;
+	if (last_idle == NULL || bfq_gt(entity->finish, last_idle->finish))
+		st->last_idle = entity;
+
+	bfq_insert(&st->idle, entity);
+
+	if (bfqq != NULL)
+		list_add(&bfqq->bfqq_list, &bfqq->bfqd->idle_list);
+}
+
+/**
+ * bfq_forget_entity - remove an entity from the wfq trees.
+ * @st: the service tree.
+ * @entity: the entity being removed.
+ *
+ * Update the device status and forget everything about @entity, putting
+ * the device reference to it, if it is a queue.  Entities belonging to
+ * groups are not refcounted.
+ */
+static void bfq_forget_entity(struct bfq_service_tree *st,
+			      struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+	struct bfq_sched_data *sd;
+
+	entity->on_st = 0;
+	st->wsum -= entity->weight;
+	if (bfqq != NULL) {
+		sd = entity->sched_data;
+		bfq_log_bfqq(bfqq->bfqd, bfqq, "forget_entity: %p %d",
+			     bfqq, atomic_read(&bfqq->ref));
+		bfq_put_queue(bfqq);
+	}
+}
+
+/**
+ * bfq_put_idle_entity - release the idle tree ref of an entity.
+ * @st: service tree for the entity.
+ * @entity: the entity being released.
+ */
+static void bfq_put_idle_entity(struct bfq_service_tree *st,
+				struct bfq_entity *entity)
+{
+	bfq_idle_extract(st, entity);
+	bfq_forget_entity(st, entity);
+}
+
+/**
+ * bfq_forget_idle - update the idle tree if necessary.
+ * @st: the service tree to act upon.
+ *
+ * To preserve the global O(log N) complexity we only remove one entry here;
+ * as the idle tree will not grow indefinitely this can be done safely.
+ */
+static void bfq_forget_idle(struct bfq_service_tree *st)
+{
+	struct bfq_entity *first_idle = st->first_idle;
+	struct bfq_entity *last_idle = st->last_idle;
+
+	if (RB_EMPTY_ROOT(&st->active) && last_idle != NULL &&
+	    !bfq_gt(last_idle->finish, st->vtime)) {
+		/*
+		 * Forget the whole idle tree, increasing the vtime past
+		 * the last finish time of idle entities.
+		 */
+		st->vtime = last_idle->finish;
+	}
+
+	if (first_idle != NULL && !bfq_gt(first_idle->finish, st->vtime))
+		bfq_put_idle_entity(st, first_idle);
+}
+
+static struct bfq_service_tree *
+__bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
+			 struct bfq_entity *entity)
+{
+	struct bfq_service_tree *new_st = old_st;
+
+	if (entity->ioprio_changed) {
+		old_st->wsum -= entity->weight;
+
+		if (entity->new_weight != entity->orig_weight) {
+			entity->orig_weight = entity->new_weight;
+			entity->ioprio =
+				bfq_weight_to_ioprio(entity->orig_weight);
+		} else if (entity->new_ioprio != entity->ioprio) {
+			entity->ioprio = entity->new_ioprio;
+			entity->orig_weight =
+					bfq_ioprio_to_weight(entity->ioprio);
+		} else
+			entity->new_weight = entity->orig_weight =
+				bfq_ioprio_to_weight(entity->ioprio);
+
+		entity->ioprio_class = entity->new_ioprio_class;
+		entity->ioprio_changed = 0;
+
+		/*
+		 * NOTE: here we may be changing the weight too early,
+		 * this will cause unfairness.  The correct approach
+		 * would have required additional complexity to defer
+		 * weight changes to the proper time instants (i.e.,
+		 * when entity->finish <= old_st->vtime).
+		 */
+		new_st = bfq_entity_service_tree(entity);
+		entity->weight = entity->orig_weight;
+		new_st->wsum += entity->weight;
+
+		if (new_st != old_st)
+			entity->start = new_st->vtime;
+	}
+
+	return new_st;
+}
+
+/**
+ * bfq_bfqq_served - update the scheduler status after selection for
+ *                   service.
+ * @bfqq: the queue being served.
+ * @served: bytes to transfer.
+ *
+ * NOTE: this can be optimized, as the timestamps of upper level entities
+ * are synchronized every time a new bfqq is selected for service.  By now,
+ * we keep it to better check consistency.
+ */
+static void bfq_bfqq_served(struct bfq_queue *bfqq, unsigned long served)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+	struct bfq_service_tree *st;
+
+	for_each_entity(entity) {
+		st = bfq_entity_service_tree(entity);
+
+		entity->service += served;
+
+		st->vtime += bfq_delta(served, st->wsum);
+		bfq_forget_idle(st);
+	}
+	bfq_log_bfqq(bfqq->bfqd, bfqq, "bfqq_served %lu secs", served);
+}
+
+/**
+ * bfq_bfqq_charge_full_budget - set the service to the entity budget.
+ * @bfqq: the queue that needs a service update.
+ *
+ * When it's not possible to be fair in the service domain, because
+ * a queue is not consuming its budget fast enough (the meaning of
+ * fast depends on the timeout parameter), we charge it a full
+ * budget.  In this way we should obtain a sort of time-domain
+ * fairness among all the seeky/slow queues.
+ */
+static inline void bfq_bfqq_charge_full_budget(struct bfq_queue *bfqq)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+
+	bfq_log_bfqq(bfqq->bfqd, bfqq, "charge_full_budget");
+
+	bfq_bfqq_served(bfqq, entity->budget - entity->service);
+}
+
+/**
+ * __bfq_activate_entity - activate an entity.
+ * @entity: the entity being activated.
+ *
+ * Called whenever an entity is activated, i.e., it is not active and one
+ * of its children receives a new request, or has to be reactivated due to
+ * budget exhaustion.  It uses the current budget of the entity (and the
+ * service received if @entity is active) of the queue to calculate its
+ * timestamps.
+ */
+static void __bfq_activate_entity(struct bfq_entity *entity)
+{
+	struct bfq_sched_data *sd = entity->sched_data;
+	struct bfq_service_tree *st = bfq_entity_service_tree(entity);
+
+	if (entity == sd->in_service_entity) {
+		/*
+		 * If we are requeueing the current entity we have
+		 * to take care of not charging to it service it has
+		 * not received.
+		 */
+		bfq_calc_finish(entity, entity->service);
+		entity->start = entity->finish;
+		sd->in_service_entity = NULL;
+	} else if (entity->tree == &st->active) {
+		/*
+		 * Requeueing an entity due to a change of some
+		 * next_in_service entity below it.  We reuse the
+		 * old start time.
+		 */
+		bfq_active_extract(st, entity);
+	} else if (entity->tree == &st->idle) {
+		/*
+		 * Must be on the idle tree, bfq_idle_extract() will
+		 * check for that.
+		 */
+		bfq_idle_extract(st, entity);
+		entity->start = bfq_gt(st->vtime, entity->finish) ?
+				       st->vtime : entity->finish;
+	} else {
+		/*
+		 * The finish time of the entity may be invalid, and
+		 * it is in the past for sure, otherwise the queue
+		 * would have been on the idle tree.
+		 */
+		entity->start = st->vtime;
+		st->wsum += entity->weight;
+		bfq_get_entity(entity);
+
+		entity->on_st = 1;
+	}
+
+	st = __bfq_entity_update_weight_prio(st, entity);
+	bfq_calc_finish(entity, entity->budget);
+	bfq_active_insert(st, entity);
+}
+
+/**
+ * bfq_activate_entity - activate an entity and its ancestors if necessary.
+ * @entity: the entity to activate.
+ *
+ * Activate @entity and all the entities on the path from it to the root.
+ */
+static void bfq_activate_entity(struct bfq_entity *entity)
+{
+	struct bfq_sched_data *sd;
+
+	for_each_entity(entity) {
+		__bfq_activate_entity(entity);
+
+		sd = entity->sched_data;
+		if (!bfq_update_next_in_service(sd))
+			/*
+			 * No need to propagate the activation to the
+			 * upper entities, as they will be updated when
+			 * the in-service entity is rescheduled.
+			 */
+			break;
+	}
+}
+
+/**
+ * __bfq_deactivate_entity - deactivate an entity from its service tree.
+ * @entity: the entity to deactivate.
+ * @requeue: if false, the entity will not be put into the idle tree.
+ *
+ * Deactivate an entity, independently from its previous state.  If the
+ * entity was not on a service tree just return, otherwise if it is on
+ * any scheduler tree, extract it from that tree, and if necessary
+ * and if the caller did not specify @requeue, put it on the idle tree.
+ *
+ * Return %1 if the caller should update the entity hierarchy, i.e.,
+ * if the entity was in service or if it was the next_in_service for
+ * its sched_data; return %0 otherwise.
+ */
+static int __bfq_deactivate_entity(struct bfq_entity *entity, int requeue)
+{
+	struct bfq_sched_data *sd = entity->sched_data;
+	struct bfq_service_tree *st = bfq_entity_service_tree(entity);
+	int was_in_service = entity == sd->in_service_entity;
+	int ret = 0;
+
+	if (!entity->on_st)
+		return 0;
+
+	if (was_in_service) {
+		bfq_calc_finish(entity, entity->service);
+		sd->in_service_entity = NULL;
+	} else if (entity->tree == &st->active)
+		bfq_active_extract(st, entity);
+	else if (entity->tree == &st->idle)
+		bfq_idle_extract(st, entity);
+
+	if (was_in_service || sd->next_in_service == entity)
+		ret = bfq_update_next_in_service(sd);
+
+	if (!requeue || !bfq_gt(entity->finish, st->vtime))
+		bfq_forget_entity(st, entity);
+	else
+		bfq_idle_insert(st, entity);
+
+	return ret;
+}
+
+/**
+ * bfq_deactivate_entity - deactivate an entity.
+ * @entity: the entity to deactivate.
+ * @requeue: true if the entity can be put on the idle tree
+ */
+static void bfq_deactivate_entity(struct bfq_entity *entity, int requeue)
+{
+	struct bfq_sched_data *sd;
+	struct bfq_entity *parent;
+
+	for_each_entity_safe(entity, parent) {
+		sd = entity->sched_data;
+
+		if (!__bfq_deactivate_entity(entity, requeue))
+			/*
+			 * The parent entity is still backlogged, and
+			 * we don't need to update it as it is still
+			 * in service.
+			 */
+			break;
+
+		if (sd->next_in_service != NULL)
+			/*
+			 * The parent entity is still backlogged and
+			 * the budgets on the path towards the root
+			 * need to be updated.
+			 */
+			goto update;
+
+		/*
+		 * If we reach there the parent is no more backlogged and
+		 * we want to propagate the dequeue upwards.
+		 */
+		requeue = 1;
+	}
+
+	return;
+
+update:
+	entity = parent;
+	for_each_entity(entity) {
+		__bfq_activate_entity(entity);
+
+		sd = entity->sched_data;
+		if (!bfq_update_next_in_service(sd))
+			break;
+	}
+}
+
+/**
+ * bfq_update_vtime - update vtime if necessary.
+ * @st: the service tree to act upon.
+ *
+ * If necessary update the service tree vtime to have at least one
+ * eligible entity, skipping to its start time.  Assumes that the
+ * active tree of the device is not empty.
+ *
+ * NOTE: this hierarchical implementation updates vtimes quite often,
+ * we may end up with reactivated processes getting timestamps after a
+ * vtime skip done because we needed a ->first_active entity on some
+ * intermediate node.
+ */
+static void bfq_update_vtime(struct bfq_service_tree *st)
+{
+	struct bfq_entity *entry;
+	struct rb_node *node = st->active.rb_node;
+
+	entry = rb_entry(node, struct bfq_entity, rb_node);
+	if (bfq_gt(entry->min_start, st->vtime)) {
+		st->vtime = entry->min_start;
+		bfq_forget_idle(st);
+	}
+}
+
+/**
+ * bfq_first_active_entity - find the eligible entity with
+ *                           the smallest finish time
+ * @st: the service tree to select from.
+ *
+ * This function searches the first schedulable entity, starting from the
+ * root of the tree and going on the left every time on this side there is
+ * a subtree with at least one eligible (start >= vtime) entity. The path on
+ * the right is followed only if a) the left subtree contains no eligible
+ * entities and b) no eligible entity has been found yet.
+ */
+static struct bfq_entity *bfq_first_active_entity(struct bfq_service_tree *st)
+{
+	struct bfq_entity *entry, *first = NULL;
+	struct rb_node *node = st->active.rb_node;
+
+	while (node != NULL) {
+		entry = rb_entry(node, struct bfq_entity, rb_node);
+left:
+		if (!bfq_gt(entry->start, st->vtime))
+			first = entry;
+
+		if (node->rb_left != NULL) {
+			entry = rb_entry(node->rb_left,
+					 struct bfq_entity, rb_node);
+			if (!bfq_gt(entry->min_start, st->vtime)) {
+				node = node->rb_left;
+				goto left;
+			}
+		}
+		if (first != NULL)
+			break;
+		node = node->rb_right;
+	}
+
+	return first;
+}
+
+/**
+ * __bfq_lookup_next_entity - return the first eligible entity in @st.
+ * @st: the service tree.
+ *
+ * Update the virtual time in @st and return the first eligible entity
+ * it contains.
+ */
+static struct bfq_entity *__bfq_lookup_next_entity(struct bfq_service_tree *st,
+						   bool force)
+{
+	struct bfq_entity *entity, *new_next_in_service = NULL;
+
+	if (RB_EMPTY_ROOT(&st->active))
+		return NULL;
+
+	bfq_update_vtime(st);
+	entity = bfq_first_active_entity(st);
+
+	/*
+	 * If the chosen entity does not match with the sched_data's
+	 * next_in_service and we are forcedly serving the IDLE priority
+	 * class tree, bubble up budget update.
+	 */
+	if (unlikely(force && entity != entity->sched_data->next_in_service)) {
+		new_next_in_service = entity;
+		for_each_entity(new_next_in_service)
+			bfq_update_budget(new_next_in_service);
+	}
+
+	return entity;
+}
+
+/**
+ * bfq_lookup_next_entity - return the first eligible entity in @sd.
+ * @sd: the sched_data.
+ * @extract: if true the returned entity will be also extracted from @sd.
+ *
+ * NOTE: since we cache the next_in_service entity at each level of the
+ * hierarchy, the complexity of the lookup can be decreased with
+ * absolutely no effort just returning the cached next_in_service value;
+ * we prefer to do full lookups to test the consistency of * the data
+ * structures.
+ */
+static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
+						 int extract,
+						 struct bfq_data *bfqd)
+{
+	struct bfq_service_tree *st = sd->service_tree;
+	struct bfq_entity *entity;
+	int i = 0;
+
+	if (bfqd != NULL &&
+	    jiffies - bfqd->bfq_class_idle_last_service > BFQ_CL_IDLE_TIMEOUT) {
+		entity = __bfq_lookup_next_entity(st + BFQ_IOPRIO_CLASSES - 1,
+						  true);
+		if (entity != NULL) {
+			i = BFQ_IOPRIO_CLASSES - 1;
+			bfqd->bfq_class_idle_last_service = jiffies;
+			sd->next_in_service = entity;
+		}
+	}
+	for (; i < BFQ_IOPRIO_CLASSES; i++) {
+		entity = __bfq_lookup_next_entity(st + i, false);
+		if (entity != NULL) {
+			if (extract) {
+				bfq_check_next_in_service(sd, entity);
+				bfq_active_extract(st + i, entity);
+				sd->in_service_entity = entity;
+				sd->next_in_service = NULL;
+			}
+			break;
+		}
+	}
+
+	return entity;
+}
+
+/*
+ * Get next queue for service.
+ */
+static struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd)
+{
+	struct bfq_entity *entity = NULL;
+	struct bfq_sched_data *sd;
+	struct bfq_queue *bfqq;
+
+	if (bfqd->busy_queues == 0)
+		return NULL;
+
+	sd = &bfqd->sched_data;
+	for (; sd != NULL; sd = entity->my_sched_data) {
+		entity = bfq_lookup_next_entity(sd, 1, bfqd);
+		entity->service = 0;
+	}
+
+	bfqq = bfq_entity_to_bfqq(entity);
+
+	return bfqq;
+}
+
+static void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd)
+{
+	if (bfqd->in_service_bic != NULL) {
+		put_io_context(bfqd->in_service_bic->icq.ioc);
+		bfqd->in_service_bic = NULL;
+	}
+
+	bfqd->in_service_queue = NULL;
+	del_timer(&bfqd->idle_slice_timer);
+}
+
+static void bfq_deactivate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+				int requeue)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+
+	if (bfqq == bfqd->in_service_queue)
+		__bfq_bfqd_reset_in_service(bfqd);
+
+	bfq_deactivate_entity(entity, requeue);
+}
+
+static void bfq_activate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+
+	bfq_activate_entity(entity);
+}
+
+/*
+ * Called when the bfqq no longer has requests pending, remove it from
+ * the service tree.
+ */
+static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+			      int requeue)
+{
+	bfq_log_bfqq(bfqd, bfqq, "del from busy");
+
+	bfq_clear_bfqq_busy(bfqq);
+
+	bfqd->busy_queues--;
+
+	bfq_deactivate_bfqq(bfqd, bfqq, requeue);
+}
+
+/*
+ * Called when an inactive queue receives a new request.
+ */
+static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	bfq_log_bfqq(bfqd, bfqq, "add to busy");
+
+	bfq_activate_bfqq(bfqd, bfqq);
+
+	bfq_mark_bfqq_busy(bfqq);
+	bfqd->busy_queues++;
+}
diff --git a/block/bfq.h b/block/bfq.h
new file mode 100644
index 0000000..bd146b6
--- /dev/null
+++ b/block/bfq.h
@@ -0,0 +1,467 @@
+/*
+ * BFQ-v0 for 3.15.0: data structures and common functions prototypes.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ *		      Paolo Valente <paolo.valente@unimore.it>
+ */
+
+#ifndef _BFQ_H
+#define _BFQ_H
+
+#include <linux/blktrace_api.h>
+#include <linux/hrtimer.h>
+#include <linux/ioprio.h>
+#include <linux/rbtree.h>
+
+#define BFQ_IOPRIO_CLASSES	3
+#define BFQ_CL_IDLE_TIMEOUT	(HZ/5)
+
+#define BFQ_MIN_WEIGHT	1
+#define BFQ_MAX_WEIGHT	1000
+
+#define BFQ_DEFAULT_GRP_WEIGHT	10
+#define BFQ_DEFAULT_GRP_IOPRIO	0
+#define BFQ_DEFAULT_GRP_CLASS	IOPRIO_CLASS_BE
+
+struct bfq_entity;
+
+/**
+ * struct bfq_service_tree - per ioprio_class service tree.
+ * @active: tree for active entities (i.e., those backlogged).
+ * @idle: tree for idle entities (i.e., those not backlogged, with V <= F_i).
+ * @first_idle: idle entity with minimum F_i.
+ * @last_idle: idle entity with maximum F_i.
+ * @vtime: scheduler virtual time.
+ * @wsum: scheduler weight sum; active and idle entities contribute to it.
+ *
+ * Each service tree represents a B-WF2Q+ scheduler on its own.  Each
+ * ioprio_class has its own independent scheduler, and so its own
+ * bfq_service_tree.  All the fields are protected by the queue lock
+ * of the containing bfqd.
+ */
+struct bfq_service_tree {
+	struct rb_root active;
+	struct rb_root idle;
+
+	struct bfq_entity *first_idle;
+	struct bfq_entity *last_idle;
+
+	u64 vtime;
+	unsigned long wsum;
+};
+
+/**
+ * struct bfq_sched_data - multi-class scheduler.
+ * @in_service_entity: entity in service.
+ * @next_in_service: head-of-the-line entity in the scheduler.
+ * @service_tree: array of service trees, one per ioprio_class.
+ *
+ * bfq_sched_data is the basic scheduler queue.  It supports three
+ * ioprio_classes, and can be used either as a toplevel queue or as
+ * an intermediate queue on a hierarchical setup.
+ * @next_in_service points to the active entity of the sched_data
+ * service trees that will be scheduled next.
+ *
+ * The supported ioprio_classes are the same as in CFQ, in descending
+ * priority order, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE.
+ * Requests from higher priority queues are served before all the
+ * requests from lower priority queues; among requests of the same
+ * queue requests are served according to B-WF2Q+.
+ * All the fields are protected by the queue lock of the containing bfqd.
+ */
+struct bfq_sched_data {
+	struct bfq_entity *in_service_entity;
+	struct bfq_entity *next_in_service;
+	struct bfq_service_tree service_tree[BFQ_IOPRIO_CLASSES];
+};
+
+/**
+ * struct bfq_entity - schedulable entity.
+ * @rb_node: service_tree member.
+ * @on_st: flag, true if the entity is on a tree (either the active or
+ *         the idle one of its service_tree).
+ * @finish: B-WF2Q+ finish timestamp (aka F_i).
+ * @start: B-WF2Q+ start timestamp (aka S_i).
+ * @tree: tree the entity is enqueued into; %NULL if not on a tree.
+ * @min_start: minimum start time of the (active) subtree rooted at
+ *             this entity; used for O(log N) lookups into active trees.
+ * @service: service received during the last round of service.
+ * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
+ * @weight: weight of the queue
+ * @parent: parent entity, for hierarchical scheduling.
+ * @my_sched_data: for non-leaf nodes in the hierarchy, the
+ *                 associated scheduler queue, %NULL on leaf nodes.
+ * @sched_data: the scheduler queue this entity belongs to.
+ * @ioprio: the ioprio in use.
+ * @new_weight: when a weight change is requested, the new weight value.
+ * @orig_weight: original weight, used to implement weight boosting
+ * @new_ioprio: when an ioprio change is requested, the new ioprio value.
+ * @ioprio_class: the ioprio_class in use.
+ * @new_ioprio_class: when an ioprio_class change is requested, the new
+ *                    ioprio_class value.
+ * @ioprio_changed: flag, true when the user requested a weight, ioprio or
+ *                  ioprio_class change.
+ *
+ * A bfq_entity is used to represent a bfq_queue (leaf node in the upper
+ * level scheduler). Each entity belongs to the sched_data of the parent
+ * group hierarchy. Non-leaf entities have also their own sched_data,
+ * stored in @my_sched_data.
+ *
+ * Each entity stores independently its priority values; this would
+ * allow different weights on different devices, but this
+ * functionality is not exported to userspace by now.  Priorities and
+ * weights are updated lazily, first storing the new values into the
+ * new_* fields, then setting the @ioprio_changed flag.  As soon as
+ * there is a transition in the entity state that allows the priority
+ * update to take place the effective and the requested priority
+ * values are synchronized.
+ *
+ * The weight value is calculated from the ioprio to export the same
+ * interface as CFQ.  When dealing with  ``well-behaved'' queues (i.e.,
+ * queues that do not spend too much time to consume their budget
+ * and have true sequential behavior, and when there are no external
+ * factors breaking anticipation) the relative weights at each level
+ * of the hierarchy should be guaranteed.  All the fields are
+ * protected by the queue lock of the containing bfqd.
+ */
+struct bfq_entity {
+	struct rb_node rb_node;
+
+	int on_st;
+
+	u64 finish;
+	u64 start;
+
+	struct rb_root *tree;
+
+	u64 min_start;
+
+	unsigned long service, budget;
+	unsigned short weight, new_weight;
+	unsigned short orig_weight;
+
+	struct bfq_entity *parent;
+
+	struct bfq_sched_data *my_sched_data;
+	struct bfq_sched_data *sched_data;
+
+	unsigned short ioprio, new_ioprio;
+	unsigned short ioprio_class, new_ioprio_class;
+
+	int ioprio_changed;
+};
+
+/**
+ * struct bfq_queue - leaf schedulable entity.
+ * @ref: reference counter.
+ * @bfqd: parent bfq_data.
+ * @sort_list: sorted list of pending requests.
+ * @next_rq: if fifo isn't expired, next request to serve.
+ * @queued: nr of requests queued in @sort_list.
+ * @allocated: currently allocated requests.
+ * @meta_pending: pending metadata requests.
+ * @fifo: fifo list of requests in sort_list.
+ * @entity: entity representing this queue in the scheduler.
+ * @max_budget: maximum budget allowed from the feedback mechanism.
+ * @budget_timeout: budget expiration (in jiffies).
+ * @dispatched: number of requests on the dispatch list or inside driver.
+ * @flags: status flags.
+ * @bfqq_list: node for active/idle bfqq list inside our bfqd.
+ * @seek_samples: number of seeks sampled
+ * @seek_total: sum of the distances of the seeks sampled
+ * @seek_mean: mean seek distance
+ * @last_request_pos: position of the last request enqueued
+ * @pid: pid of the process owning the queue, used for logging purposes.
+ *
+ * A bfq_queue is a leaf request queue; it can be associated with an
+ * io_context or more, if it is async.
+ */
+struct bfq_queue {
+	atomic_t ref;
+	struct bfq_data *bfqd;
+
+	struct rb_root sort_list;
+	struct request *next_rq;
+	int queued[2];
+	int allocated[2];
+	int meta_pending;
+	struct list_head fifo;
+
+	struct bfq_entity entity;
+
+	unsigned long max_budget;
+	unsigned long budget_timeout;
+
+	int dispatched;
+
+	unsigned int flags;
+
+	struct list_head bfqq_list;
+
+	unsigned int seek_samples;
+	u64 seek_total;
+	sector_t seek_mean;
+	sector_t last_request_pos;
+
+	pid_t pid;
+};
+
+/**
+ * struct bfq_ttime - per process thinktime stats.
+ * @ttime_total: total process thinktime
+ * @ttime_samples: number of thinktime samples
+ * @ttime_mean: average process thinktime
+ */
+struct bfq_ttime {
+	unsigned long last_end_request;
+
+	unsigned long ttime_total;
+	unsigned long ttime_samples;
+	unsigned long ttime_mean;
+};
+
+/**
+ * struct bfq_io_cq - per (request_queue, io_context) structure.
+ * @icq: associated io_cq structure
+ * @bfqq: array of two process queues, the sync and the async
+ * @ttime: associated @bfq_ttime struct
+ */
+struct bfq_io_cq {
+	struct io_cq icq; /* must be the first member */
+	struct bfq_queue *bfqq[2];
+	struct bfq_ttime ttime;
+	int ioprio;
+};
+
+enum bfq_device_speed {
+	BFQ_BFQD_FAST,
+	BFQ_BFQD_SLOW,
+};
+
+/**
+ * struct bfq_data - per device data structure.
+ * @queue: request queue for the managed device.
+ * @sched_data: root @bfq_sched_data for the device.
+ * @busy_queues: number of bfq_queues containing requests (including the
+ *		 queue in service, even if it is idling).
+ * @queued: number of queued requests.
+ * @rq_in_driver: number of requests dispatched and waiting for completion.
+ * @sync_flight: number of sync requests in the driver.
+ * @max_rq_in_driver: max number of reqs in driver in the last
+ *                    @hw_tag_samples completed requests.
+ * @hw_tag_samples: nr of samples used to calculate hw_tag.
+ * @hw_tag: flag set to one if the driver is showing a queueing behavior.
+ * @budgets_assigned: number of budgets assigned.
+ * @idle_slice_timer: timer set when idling for the next sequential request
+ *                    from the queue in service.
+ * @unplug_work: delayed work to restart dispatching on the request queue.
+ * @in_service_queue: bfq_queue in service.
+ * @in_service_bic: bfq_io_cq (bic) associated with the @in_service_queue.
+ * @last_position: on-disk position of the last served request.
+ * @last_budget_start: beginning of the last budget.
+ * @last_idling_start: beginning of the last idle slice.
+ * @peak_rate: peak transfer rate observed for a budget.
+ * @peak_rate_samples: number of samples used to calculate @peak_rate.
+ * @bfq_max_budget: maximum budget allotted to a bfq_queue before
+ *                  rescheduling.
+ * @active_list: list of all the bfq_queues active on the device.
+ * @idle_list: list of all the bfq_queues idle on the device.
+ * @bfq_quantum: max number of requests dispatched per dispatch round.
+ * @bfq_fifo_expire: timeout for async/sync requests; when it expires
+ *                   requests are served in fifo order.
+ * @bfq_back_penalty: weight of backward seeks wrt forward ones.
+ * @bfq_back_max: maximum allowed backward seek.
+ * @bfq_slice_idle: maximum idling time.
+ * @bfq_user_max_budget: user-configured max budget value
+ *                       (0 for auto-tuning).
+ * @bfq_max_budget_async_rq: maximum budget (in nr of requests) allotted to
+ *                           async queues.
+ * @bfq_timeout: timeout for bfq_queues to consume their budget; used to
+ *               to prevent seeky queues to impose long latencies to well
+ *               behaved ones (this also implies that seeky queues cannot
+ *               receive guarantees in the service domain; after a timeout
+ *               they are charged for the whole allocated budget, to try
+ *               to preserve a behavior reasonably fair among them, but
+ *               without service-domain guarantees).
+ * @oom_bfqq: fallback dummy bfqq for extreme OOM conditions
+ *
+ * All the fields are protected by the @queue lock.
+ */
+struct bfq_data {
+	struct request_queue *queue;
+
+	struct bfq_sched_data sched_data;
+
+	int busy_queues;
+	int queued;
+	int rq_in_driver;
+	int sync_flight;
+
+	int max_rq_in_driver;
+	int hw_tag_samples;
+	int hw_tag;
+
+	int budgets_assigned;
+
+	struct timer_list idle_slice_timer;
+	struct work_struct unplug_work;
+
+	struct bfq_queue *in_service_queue;
+	struct bfq_io_cq *in_service_bic;
+
+	sector_t last_position;
+
+	ktime_t last_budget_start;
+	ktime_t last_idling_start;
+	int peak_rate_samples;
+	u64 peak_rate;
+	unsigned long bfq_max_budget;
+
+	struct list_head active_list;
+	struct list_head idle_list;
+
+	unsigned int bfq_quantum;
+	unsigned int bfq_fifo_expire[2];
+	unsigned int bfq_back_penalty;
+	unsigned int bfq_back_max;
+	unsigned int bfq_slice_idle;
+	u64 bfq_class_idle_last_service;
+
+	unsigned int bfq_user_max_budget;
+	unsigned int bfq_max_budget_async_rq;
+	unsigned int bfq_timeout[2];
+
+	struct bfq_queue oom_bfqq;
+};
+
+enum bfqq_state_flags {
+	BFQ_BFQQ_FLAG_busy = 0,		/* has requests or is in service */
+	BFQ_BFQQ_FLAG_wait_request,	/* waiting for a request */
+	BFQ_BFQQ_FLAG_must_alloc,	/* must be allowed rq alloc */
+	BFQ_BFQQ_FLAG_fifo_expire,	/* FIFO checked in this slice */
+	BFQ_BFQQ_FLAG_idle_window,	/* slice idling enabled */
+	BFQ_BFQQ_FLAG_prio_changed,	/* task priority has changed */
+	BFQ_BFQQ_FLAG_sync,		/* synchronous queue */
+	BFQ_BFQQ_FLAG_budget_new,	/* no completion with this budget */
+};
+
+#define BFQ_BFQQ_FNS(name)						\
+static inline void bfq_mark_bfqq_##name(struct bfq_queue *bfqq)		\
+{									\
+	(bfqq)->flags |= (1 << BFQ_BFQQ_FLAG_##name);			\
+}									\
+static inline void bfq_clear_bfqq_##name(struct bfq_queue *bfqq)	\
+{									\
+	(bfqq)->flags &= ~(1 << BFQ_BFQQ_FLAG_##name);			\
+}									\
+static inline int bfq_bfqq_##name(const struct bfq_queue *bfqq)		\
+{									\
+	return ((bfqq)->flags & (1 << BFQ_BFQQ_FLAG_##name)) != 0;	\
+}
+
+BFQ_BFQQ_FNS(busy);
+BFQ_BFQQ_FNS(wait_request);
+BFQ_BFQQ_FNS(must_alloc);
+BFQ_BFQQ_FNS(fifo_expire);
+BFQ_BFQQ_FNS(idle_window);
+BFQ_BFQQ_FNS(prio_changed);
+BFQ_BFQQ_FNS(sync);
+BFQ_BFQQ_FNS(budget_new);
+#undef BFQ_BFQQ_FNS
+
+/* Logging facilities. */
+#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) \
+	blk_add_trace_msg((bfqd)->queue, "bfq%d " fmt, (bfqq)->pid, ##args)
+
+#define bfq_log(bfqd, fmt, args...) \
+	blk_add_trace_msg((bfqd)->queue, "bfq " fmt, ##args)
+
+/* Expiration reasons. */
+enum bfqq_expiration {
+	BFQ_BFQQ_TOO_IDLE = 0,		/*
+					 * queue has been idling for
+					 * too long
+					 */
+	BFQ_BFQQ_BUDGET_TIMEOUT,	/* budget took too long to be used */
+	BFQ_BFQQ_BUDGET_EXHAUSTED,	/* budget consumed */
+	BFQ_BFQQ_NO_MORE_REQUESTS,	/* the queue has no more requests */
+};
+
+static inline struct bfq_service_tree *
+bfq_entity_service_tree(struct bfq_entity *entity)
+{
+	struct bfq_sched_data *sched_data = entity->sched_data;
+	unsigned int idx = entity->ioprio_class - 1;
+
+	return sched_data->service_tree + idx;
+}
+
+static inline struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic,
+					    int is_sync)
+{
+	return bic->bfqq[!!is_sync];
+}
+
+static inline void bic_set_bfqq(struct bfq_io_cq *bic,
+				struct bfq_queue *bfqq, int is_sync)
+{
+	bic->bfqq[!!is_sync] = bfqq;
+}
+
+static inline struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic)
+{
+	return bic->icq.q->elevator->elevator_data;
+}
+
+/**
+ * bfq_get_bfqd_locked - get a lock to a bfqd using a RCU protected pointer.
+ * @ptr: a pointer to a bfqd.
+ * @flags: storage for the flags to be saved.
+ *
+ * This function allows bfqg->bfqd to be protected by the
+ * queue lock of the bfqd they reference; the pointer is dereferenced
+ * under RCU, so the storage for bfqd is assured to be safe as long
+ * as the RCU read side critical section does not end.  After the
+ * bfqd->queue->queue_lock is taken the pointer is rechecked, to be
+ * sure that no other writer accessed it.  If we raced with a writer,
+ * the function returns NULL, with the queue unlocked, otherwise it
+ * returns the dereferenced pointer, with the queue locked.
+ */
+static inline struct bfq_data *bfq_get_bfqd_locked(void **ptr,
+						   unsigned long *flags)
+{
+	struct bfq_data *bfqd;
+
+	rcu_read_lock();
+	bfqd = rcu_dereference(*(struct bfq_data **)ptr);
+
+	if (bfqd != NULL) {
+		spin_lock_irqsave(bfqd->queue->queue_lock, *flags);
+		if (*ptr == bfqd)
+			goto out;
+		spin_unlock_irqrestore(bfqd->queue->queue_lock, *flags);
+	}
+
+	bfqd = NULL;
+out:
+	rcu_read_unlock();
+	return bfqd;
+}
+
+static inline void bfq_put_bfqd_unlock(struct bfq_data *bfqd,
+				       unsigned long *flags)
+{
+	spin_unlock_irqrestore(bfqd->queue->queue_lock, *flags);
+}
+
+static void bfq_changed_ioprio(struct bfq_io_cq *bic);
+static void bfq_put_queue(struct bfq_queue *bfqq);
+static void bfq_dispatch_insert(struct request_queue *q, struct request *rq);
+static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd, int is_sync,
+				       struct bfq_io_cq *bic, gfp_t gfp_mask);
+static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
+
+#endif /* _BFQ_H */
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 02/12] block, bfq: add full hierarchical scheduling and cgroups support
  2014-05-29  9:05           ` Paolo Valente
@ 2014-05-29  9:05               ` Paolo Valente
  -1 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

From: Fabio Checconi <fchecconi-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Complete support for full hierarchical scheduling, with a cgroups
interface. The name of the new subsystem is bfqio.

Weights can be assigned explicitly to groups and processes through the
cgroups interface, differently from what happens, for single
processes, if the cgroups interface is not used (as explained in the
description of patch 2). In particular, since each node has a full
scheduler, each group can be assigned its own weight.

Signed-off-by: Fabio Checconi <fchecconi-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 block/Kconfig.iosched         |  13 +-
 block/bfq-cgroup.c            | 891 ++++++++++++++++++++++++++++++++++++++++++
 block/bfq-iosched.c           |  66 ++--
 block/bfq-sched.c             |  64 ++-
 block/bfq.h                   | 122 +++++-
 include/linux/cgroup_subsys.h |   4 +
 6 files changed, 1111 insertions(+), 49 deletions(-)
 create mode 100644 block/bfq-cgroup.c

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 8f98cc7..a3675cb 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -46,7 +46,18 @@ config IOSCHED_BFQ
 	  The BFQ I/O scheduler tries to distribute bandwidth among all
 	  processes according to their weights.
 	  It aims at distributing the bandwidth as desired, regardless
-	  of the disk parameters and with any workload.
+	  of the disk parameters and with any workload. If compiled
+	  built-in (saying Y here), BFQ can be configured to support
+	  hierarchical scheduling.
+
+config CGROUP_BFQIO
+	bool "BFQ hierarchical scheduling support"
+	depends on CGROUPS && IOSCHED_BFQ=y
+	default n
+	---help---
+	  Enable hierarchical scheduling in BFQ, using the cgroups
+	  filesystem interface.  The name of the subsystem will be
+	  bfqio.
 
 choice
 	prompt "Default I/O scheduler"
diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
new file mode 100644
index 0000000..00a7a1b
--- /dev/null
+++ b/block/bfq-cgroup.c
@@ -0,0 +1,891 @@
+/*
+ * BFQ: CGROUPS support.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
+ *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ *
+ * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ
+ * file.
+ */
+
+#ifdef CONFIG_CGROUP_BFQIO
+
+static DEFINE_MUTEX(bfqio_mutex);
+
+static bool bfqio_is_removed(struct bfqio_cgroup *bgrp)
+{
+	return bgrp ? !bgrp->online : false;
+}
+
+static struct bfqio_cgroup bfqio_root_cgroup = {
+	.weight = BFQ_DEFAULT_GRP_WEIGHT,
+	.ioprio = BFQ_DEFAULT_GRP_IOPRIO,
+	.ioprio_class = BFQ_DEFAULT_GRP_CLASS,
+};
+
+static inline void bfq_init_entity(struct bfq_entity *entity,
+				   struct bfq_group *bfqg)
+{
+	entity->weight = entity->new_weight;
+	entity->orig_weight = entity->new_weight;
+	entity->ioprio = entity->new_ioprio;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->parent = bfqg->my_entity;
+	entity->sched_data = &bfqg->sched_data;
+}
+
+static struct bfqio_cgroup *css_to_bfqio(struct cgroup_subsys_state *css)
+{
+	return css ? container_of(css, struct bfqio_cgroup, css) : NULL;
+}
+
+/*
+ * Search the bfq_group for bfqd into the hash table (by now only a list)
+ * of bgrp.  Must be called under rcu_read_lock().
+ */
+static struct bfq_group *bfqio_lookup_group(struct bfqio_cgroup *bgrp,
+					    struct bfq_data *bfqd)
+{
+	struct bfq_group *bfqg;
+	void *key;
+
+	hlist_for_each_entry_rcu(bfqg, &bgrp->group_data, group_node) {
+		key = rcu_dereference(bfqg->bfqd);
+		if (key == bfqd)
+			return bfqg;
+	}
+
+	return NULL;
+}
+
+static inline void bfq_group_init_entity(struct bfqio_cgroup *bgrp,
+					 struct bfq_group *bfqg)
+{
+	struct bfq_entity *entity = &bfqg->entity;
+
+	/*
+	 * If the weight of the entity has never been set via the sysfs
+	 * interface, then bgrp->weight == 0. In this case we initialize
+	 * the weight from the current ioprio value. Otherwise, the group
+	 * weight, if set, has priority over the ioprio value.
+	 */
+	if (bgrp->weight == 0) {
+		entity->new_weight = bfq_ioprio_to_weight(bgrp->ioprio);
+		entity->new_ioprio = bgrp->ioprio;
+	} else {
+		entity->new_weight = bgrp->weight;
+		entity->new_ioprio = bfq_weight_to_ioprio(bgrp->weight);
+	}
+	entity->orig_weight = entity->weight = entity->new_weight;
+	entity->ioprio = entity->new_ioprio;
+	entity->ioprio_class = entity->new_ioprio_class = bgrp->ioprio_class;
+	entity->my_sched_data = &bfqg->sched_data;
+}
+
+static inline void bfq_group_set_parent(struct bfq_group *bfqg,
+					struct bfq_group *parent)
+{
+	struct bfq_entity *entity;
+
+	entity = &bfqg->entity;
+	entity->parent = parent->my_entity;
+	entity->sched_data = &parent->sched_data;
+}
+
+/**
+ * bfq_group_chain_alloc - allocate a chain of groups.
+ * @bfqd: queue descriptor.
+ * @css: the leaf cgroup_subsys_state this chain starts from.
+ *
+ * Allocate a chain of groups starting from the one belonging to
+ * @cgroup up to the root cgroup.  Stop if a cgroup on the chain
+ * to the root has already an allocated group on @bfqd.
+ */
+static struct bfq_group *bfq_group_chain_alloc(struct bfq_data *bfqd,
+					       struct cgroup_subsys_state *css)
+{
+	struct bfqio_cgroup *bgrp;
+	struct bfq_group *bfqg, *prev = NULL, *leaf = NULL;
+
+	for (; css != NULL; css = css->parent) {
+		bgrp = css_to_bfqio(css);
+
+		bfqg = bfqio_lookup_group(bgrp, bfqd);
+		if (bfqg != NULL) {
+			/*
+			 * All the cgroups in the path from there to the
+			 * root must have a bfq_group for bfqd, so we don't
+			 * need any more allocations.
+			 */
+			break;
+		}
+
+		bfqg = kzalloc(sizeof(*bfqg), GFP_ATOMIC);
+		if (bfqg == NULL)
+			goto cleanup;
+
+		bfq_group_init_entity(bgrp, bfqg);
+		bfqg->my_entity = &bfqg->entity;
+
+		if (leaf == NULL) {
+			leaf = bfqg;
+			prev = leaf;
+		} else {
+			bfq_group_set_parent(prev, bfqg);
+			/*
+			 * Build a list of allocated nodes using the bfqd
+			 * filed, that is still unused and will be
+			 * initialized only after the node will be
+			 * connected.
+			 */
+			prev->bfqd = bfqg;
+			prev = bfqg;
+		}
+	}
+
+	return leaf;
+
+cleanup:
+	while (leaf != NULL) {
+		prev = leaf;
+		leaf = leaf->bfqd;
+		kfree(prev);
+	}
+
+	return NULL;
+}
+
+/**
+ * bfq_group_chain_link - link an allocated group chain to a cgroup
+ *                        hierarchy.
+ * @bfqd: the queue descriptor.
+ * @css: the leaf cgroup_subsys_state to start from.
+ * @leaf: the leaf group (to be associated to @cgroup).
+ *
+ * Try to link a chain of groups to a cgroup hierarchy, connecting the
+ * nodes bottom-up, so we can be sure that when we find a cgroup in the
+ * hierarchy that already as a group associated to @bfqd all the nodes
+ * in the path to the root cgroup have one too.
+ *
+ * On locking: the queue lock protects the hierarchy (there is a hierarchy
+ * per device) while the bfqio_cgroup lock protects the list of groups
+ * belonging to the same cgroup.
+ */
+static void bfq_group_chain_link(struct bfq_data *bfqd,
+				 struct cgroup_subsys_state *css,
+				 struct bfq_group *leaf)
+{
+	struct bfqio_cgroup *bgrp;
+	struct bfq_group *bfqg, *next, *prev = NULL;
+	unsigned long flags;
+
+	assert_spin_locked(bfqd->queue->queue_lock);
+
+	for (; css != NULL && leaf != NULL; css = css->parent) {
+		bgrp = css_to_bfqio(css);
+		next = leaf->bfqd;
+
+		bfqg = bfqio_lookup_group(bgrp, bfqd);
+
+		spin_lock_irqsave(&bgrp->lock, flags);
+
+		rcu_assign_pointer(leaf->bfqd, bfqd);
+		hlist_add_head_rcu(&leaf->group_node, &bgrp->group_data);
+		hlist_add_head(&leaf->bfqd_node, &bfqd->group_list);
+
+		spin_unlock_irqrestore(&bgrp->lock, flags);
+
+		prev = leaf;
+		leaf = next;
+	}
+
+	if (css != NULL && prev != NULL) {
+		bgrp = css_to_bfqio(css);
+		bfqg = bfqio_lookup_group(bgrp, bfqd);
+		bfq_group_set_parent(prev, bfqg);
+	}
+}
+
+/**
+ * bfq_find_alloc_group - return the group associated to @bfqd in @cgroup.
+ * @bfqd: queue descriptor.
+ * @cgroup: cgroup being searched for.
+ *
+ * Return a group associated to @bfqd in @cgroup, allocating one if
+ * necessary.  When a group is returned all the cgroups in the path
+ * to the root have a group associated to @bfqd.
+ *
+ * If the allocation fails, return the root group: this breaks guarantees
+ * but is a safe fallback.  If this loss becomes a problem it can be
+ * mitigated using the equivalent weight (given by the product of the
+ * weights of the groups in the path from @group to the root) in the
+ * root scheduler.
+ *
+ * We allocate all the missing nodes in the path from the leaf cgroup
+ * to the root and we connect the nodes only after all the allocations
+ * have been successful.
+ */
+static struct bfq_group *bfq_find_alloc_group(struct bfq_data *bfqd,
+					      struct cgroup_subsys_state *css)
+{
+	struct bfqio_cgroup *bgrp = css_to_bfqio(css);
+	struct bfq_group *bfqg;
+
+	bfqg = bfqio_lookup_group(bgrp, bfqd);
+	if (bfqg != NULL)
+		return bfqg;
+
+	bfqg = bfq_group_chain_alloc(bfqd, css);
+	if (bfqg != NULL)
+		bfq_group_chain_link(bfqd, css, bfqg);
+	else
+		bfqg = bfqd->root_group;
+
+	return bfqg;
+}
+
+/**
+ * bfq_bfqq_move - migrate @bfqq to @bfqg.
+ * @bfqd: queue descriptor.
+ * @bfqq: the queue to move.
+ * @entity: @bfqq's entity.
+ * @bfqg: the group to move to.
+ *
+ * Move @bfqq to @bfqg, deactivating it from its old group and reactivating
+ * it on the new one.  Avoid putting the entity on the old group idle tree.
+ *
+ * Must be called under the queue lock; the cgroup owning @bfqg must
+ * not disappear (by now this just means that we are called under
+ * rcu_read_lock()).
+ */
+static void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+			  struct bfq_entity *entity, struct bfq_group *bfqg)
+{
+	int busy, resume;
+
+	busy = bfq_bfqq_busy(bfqq);
+	resume = !RB_EMPTY_ROOT(&bfqq->sort_list);
+
+	if (busy) {
+		if (!resume)
+			bfq_del_bfqq_busy(bfqd, bfqq, 0);
+		else
+			bfq_deactivate_bfqq(bfqd, bfqq, 0);
+	} else if (entity->on_st)
+		bfq_put_idle_entity(bfq_entity_service_tree(entity), entity);
+
+	/*
+	 * Here we use a reference to bfqg.  We don't need a refcounter
+	 * as the cgroup reference will not be dropped, so that its
+	 * destroy() callback will not be invoked.
+	 */
+	entity->parent = bfqg->my_entity;
+	entity->sched_data = &bfqg->sched_data;
+
+	if (busy && resume)
+		bfq_activate_bfqq(bfqd, bfqq);
+
+	if (bfqd->in_service_queue == NULL && !bfqd->rq_in_driver)
+		bfq_schedule_dispatch(bfqd);
+}
+
+/**
+ * __bfq_bic_change_cgroup - move @bic to @cgroup.
+ * @bfqd: the queue descriptor.
+ * @bic: the bic to move.
+ * @cgroup: the cgroup to move to.
+ *
+ * Move bic to cgroup, assuming that bfqd->queue is locked; the caller
+ * has to make sure that the reference to cgroup is valid across the call.
+ *
+ * NOTE: an alternative approach might have been to store the current
+ * cgroup in bfqq and getting a reference to it, reducing the lookup
+ * time here, at the price of slightly more complex code.
+ */
+static struct bfq_group *__bfq_bic_change_cgroup(struct bfq_data *bfqd,
+						struct bfq_io_cq *bic,
+						struct cgroup_subsys_state *css)
+{
+	struct bfq_queue *async_bfqq = bic_to_bfqq(bic, 0);
+	struct bfq_queue *sync_bfqq = bic_to_bfqq(bic, 1);
+	struct bfq_entity *entity;
+	struct bfq_group *bfqg;
+	struct bfqio_cgroup *bgrp;
+
+	bgrp = css_to_bfqio(css);
+
+	bfqg = bfq_find_alloc_group(bfqd, css);
+	if (async_bfqq != NULL) {
+		entity = &async_bfqq->entity;
+
+		if (entity->sched_data != &bfqg->sched_data) {
+			bic_set_bfqq(bic, NULL, 0);
+			bfq_log_bfqq(bfqd, async_bfqq,
+				     "bic_change_group: %p %d",
+				     async_bfqq, atomic_read(&async_bfqq->ref));
+			bfq_put_queue(async_bfqq);
+		}
+	}
+
+	if (sync_bfqq != NULL) {
+		entity = &sync_bfqq->entity;
+		if (entity->sched_data != &bfqg->sched_data)
+			bfq_bfqq_move(bfqd, sync_bfqq, entity, bfqg);
+	}
+
+	return bfqg;
+}
+
+/**
+ * bfq_bic_change_cgroup - move @bic to @cgroup.
+ * @bic: the bic being migrated.
+ * @cgroup: the destination cgroup.
+ *
+ * When the task owning @bic is moved to @cgroup, @bic is immediately
+ * moved into its new parent group.
+ */
+static void bfq_bic_change_cgroup(struct bfq_io_cq *bic,
+				  struct cgroup_subsys_state *css)
+{
+	struct bfq_data *bfqd;
+	unsigned long uninitialized_var(flags);
+
+	bfqd = bfq_get_bfqd_locked(&(bic->icq.q->elevator->elevator_data),
+				   &flags);
+	if (bfqd != NULL) {
+		__bfq_bic_change_cgroup(bfqd, bic, css);
+		bfq_put_bfqd_unlock(bfqd, &flags);
+	}
+}
+
+/**
+ * bfq_bic_update_cgroup - update the cgroup of @bic.
+ * @bic: the @bic to update.
+ *
+ * Make sure that @bic is enqueued in the cgroup of the current task.
+ * We need this in addition to moving bics during the cgroup attach
+ * phase because the task owning @bic could be at its first disk
+ * access or we may end up in the root cgroup as the result of a
+ * memory allocation failure and here we try to move to the right
+ * group.
+ *
+ * Must be called under the queue lock.  It is safe to use the returned
+ * value even after the rcu_read_unlock() as the migration/destruction
+ * paths act under the queue lock too.  IOW it is impossible to race with
+ * group migration/destruction and end up with an invalid group as:
+ *   a) here cgroup has not yet been destroyed, nor its destroy callback
+ *      has started execution, as current holds a reference to it,
+ *   b) if it is destroyed after rcu_read_unlock() [after current is
+ *      migrated to a different cgroup] its attach() callback will have
+ *      taken care of remove all the references to the old cgroup data.
+ */
+static struct bfq_group *bfq_bic_update_cgroup(struct bfq_io_cq *bic)
+{
+	struct bfq_data *bfqd = bic_to_bfqd(bic);
+	struct bfq_group *bfqg;
+	struct cgroup_subsys_state *css;
+
+	rcu_read_lock();
+	css = task_css(current, bfqio_cgrp_id);
+	bfqg = __bfq_bic_change_cgroup(bfqd, bic, css);
+	rcu_read_unlock();
+
+	return bfqg;
+}
+
+/**
+ * bfq_flush_idle_tree - deactivate any entity on the idle tree of @st.
+ * @st: the service tree being flushed.
+ */
+static inline void bfq_flush_idle_tree(struct bfq_service_tree *st)
+{
+	struct bfq_entity *entity = st->first_idle;
+
+	for (; entity != NULL; entity = st->first_idle)
+		__bfq_deactivate_entity(entity, 0);
+}
+
+/**
+ * bfq_reparent_leaf_entity - move leaf entity to the root_group.
+ * @bfqd: the device data structure with the root group.
+ * @entity: the entity to move.
+ */
+static inline void bfq_reparent_leaf_entity(struct bfq_data *bfqd,
+					    struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+
+	bfq_bfqq_move(bfqd, bfqq, entity, bfqd->root_group);
+	return;
+}
+
+/**
+ * bfq_reparent_active_entities - move to the root group all active
+ *                                entities.
+ * @bfqd: the device data structure with the root group.
+ * @bfqg: the group to move from.
+ * @st: the service tree with the entities.
+ *
+ * Needs queue_lock to be taken and reference to be valid over the call.
+ */
+static inline void bfq_reparent_active_entities(struct bfq_data *bfqd,
+						struct bfq_group *bfqg,
+						struct bfq_service_tree *st)
+{
+	struct rb_root *active = &st->active;
+	struct bfq_entity *entity = NULL;
+
+	if (!RB_EMPTY_ROOT(&st->active))
+		entity = bfq_entity_of(rb_first(active));
+
+	for (; entity != NULL; entity = bfq_entity_of(rb_first(active)))
+		bfq_reparent_leaf_entity(bfqd, entity);
+
+	if (bfqg->sched_data.in_service_entity != NULL)
+		bfq_reparent_leaf_entity(bfqd,
+			bfqg->sched_data.in_service_entity);
+
+	return;
+}
+
+/**
+ * bfq_destroy_group - destroy @bfqg.
+ * @bgrp: the bfqio_cgroup containing @bfqg.
+ * @bfqg: the group being destroyed.
+ *
+ * Destroy @bfqg, making sure that it is not referenced from its parent.
+ */
+static void bfq_destroy_group(struct bfqio_cgroup *bgrp, struct bfq_group *bfqg)
+{
+	struct bfq_data *bfqd;
+	struct bfq_service_tree *st;
+	struct bfq_entity *entity = bfqg->my_entity;
+	unsigned long uninitialized_var(flags);
+	int i;
+
+	hlist_del(&bfqg->group_node);
+
+	/*
+	 * Empty all service_trees belonging to this group before
+	 * deactivating the group itself.
+	 */
+	for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) {
+		st = bfqg->sched_data.service_tree + i;
+
+		/*
+		 * The idle tree may still contain bfq_queues belonging
+		 * to exited task because they never migrated to a different
+		 * cgroup from the one being destroyed now.  No one else
+		 * can access them so it's safe to act without any lock.
+		 */
+		bfq_flush_idle_tree(st);
+
+		/*
+		 * It may happen that some queues are still active
+		 * (busy) upon group destruction (if the corresponding
+		 * processes have been forced to terminate). We move
+		 * all the leaf entities corresponding to these queues
+		 * to the root_group.
+		 * Also, it may happen that the group has an entity
+		 * in service, which is disconnected from the active
+		 * tree: it must be moved, too.
+		 * There is no need to put the sync queues, as the
+		 * scheduler has taken no reference.
+		 */
+		bfqd = bfq_get_bfqd_locked(&bfqg->bfqd, &flags);
+		if (bfqd != NULL) {
+			bfq_reparent_active_entities(bfqd, bfqg, st);
+			bfq_put_bfqd_unlock(bfqd, &flags);
+		}
+	}
+
+	/*
+	 * We may race with device destruction, take extra care when
+	 * dereferencing bfqg->bfqd.
+	 */
+	bfqd = bfq_get_bfqd_locked(&bfqg->bfqd, &flags);
+	if (bfqd != NULL) {
+		hlist_del(&bfqg->bfqd_node);
+		__bfq_deactivate_entity(entity, 0);
+		bfq_put_async_queues(bfqd, bfqg);
+		bfq_put_bfqd_unlock(bfqd, &flags);
+	}
+
+	/*
+	 * No need to defer the kfree() to the end of the RCU grace
+	 * period: we are called from the destroy() callback of our
+	 * cgroup, so we can be sure that no one is a) still using
+	 * this cgroup or b) doing lookups in it.
+	 */
+	kfree(bfqg);
+}
+
+/**
+ * bfq_disconnect_groups - disconnect @bfqd from all its groups.
+ * @bfqd: the device descriptor being exited.
+ *
+ * When the device exits we just make sure that no lookup can return
+ * the now unused group structures.  They will be deallocated on cgroup
+ * destruction.
+ */
+static void bfq_disconnect_groups(struct bfq_data *bfqd)
+{
+	struct hlist_node *tmp;
+	struct bfq_group *bfqg;
+
+	bfq_log(bfqd, "disconnect_groups beginning");
+	hlist_for_each_entry_safe(bfqg, tmp, &bfqd->group_list, bfqd_node) {
+		hlist_del(&bfqg->bfqd_node);
+
+		__bfq_deactivate_entity(bfqg->my_entity, 0);
+
+		/*
+		 * Don't remove from the group hash, just set an
+		 * invalid key.  No lookups can race with the
+		 * assignment as bfqd is being destroyed; this
+		 * implies also that new elements cannot be added
+		 * to the list.
+		 */
+		rcu_assign_pointer(bfqg->bfqd, NULL);
+
+		bfq_log(bfqd, "disconnect_groups: put async for group %p",
+			bfqg);
+		bfq_put_async_queues(bfqd, bfqg);
+	}
+}
+
+static inline void bfq_free_root_group(struct bfq_data *bfqd)
+{
+	struct bfqio_cgroup *bgrp = &bfqio_root_cgroup;
+	struct bfq_group *bfqg = bfqd->root_group;
+
+	bfq_put_async_queues(bfqd, bfqg);
+
+	spin_lock_irq(&bgrp->lock);
+	hlist_del_rcu(&bfqg->group_node);
+	spin_unlock_irq(&bgrp->lock);
+
+	/*
+	 * No need to synchronize_rcu() here: since the device is gone
+	 * there cannot be any read-side access to its root_group.
+	 */
+	kfree(bfqg);
+}
+
+static struct bfq_group *bfq_alloc_root_group(struct bfq_data *bfqd, int node)
+{
+	struct bfq_group *bfqg;
+	struct bfqio_cgroup *bgrp;
+	int i;
+
+	bfqg = kzalloc_node(sizeof(*bfqg), GFP_KERNEL, node);
+	if (bfqg == NULL)
+		return NULL;
+
+	bfqg->entity.parent = NULL;
+	for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
+		bfqg->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
+
+	bgrp = &bfqio_root_cgroup;
+	spin_lock_irq(&bgrp->lock);
+	rcu_assign_pointer(bfqg->bfqd, bfqd);
+	hlist_add_head_rcu(&bfqg->group_node, &bgrp->group_data);
+	spin_unlock_irq(&bgrp->lock);
+
+	return bfqg;
+}
+
+#define SHOW_FUNCTION(__VAR)						\
+static u64 bfqio_cgroup_##__VAR##_read(struct cgroup_subsys_state *css, \
+				       struct cftype *cftype)		\
+{									\
+	struct bfqio_cgroup *bgrp = css_to_bfqio(css);			\
+	u64 ret = -ENODEV;						\
+									\
+	mutex_lock(&bfqio_mutex);					\
+	if (bfqio_is_removed(bgrp))					\
+		goto out_unlock;					\
+									\
+	spin_lock_irq(&bgrp->lock);					\
+	ret = bgrp->__VAR;						\
+	spin_unlock_irq(&bgrp->lock);					\
+									\
+out_unlock:								\
+	mutex_unlock(&bfqio_mutex);					\
+	return ret;							\
+}
+
+SHOW_FUNCTION(weight);
+SHOW_FUNCTION(ioprio);
+SHOW_FUNCTION(ioprio_class);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__VAR, __MIN, __MAX)				\
+static int bfqio_cgroup_##__VAR##_write(struct cgroup_subsys_state *css,\
+					struct cftype *cftype,		\
+					u64 val)			\
+{									\
+	struct bfqio_cgroup *bgrp = css_to_bfqio(css);			\
+	struct bfq_group *bfqg;						\
+	int ret = -EINVAL;						\
+									\
+	if (val < (__MIN) || val > (__MAX))				\
+		return ret;						\
+									\
+	ret = -ENODEV;							\
+	mutex_lock(&bfqio_mutex);					\
+	if (bfqio_is_removed(bgrp))					\
+		goto out_unlock;					\
+	ret = 0;							\
+									\
+	spin_lock_irq(&bgrp->lock);					\
+	bgrp->__VAR = (unsigned short)val;				\
+	hlist_for_each_entry(bfqg, &bgrp->group_data, group_node) {	\
+		/*							\
+		 * Setting the ioprio_changed flag of the entity        \
+		 * to 1 with new_##__VAR == ##__VAR would re-set        \
+		 * the value of the weight to its ioprio mapping.       \
+		 * Set the flag only if necessary.			\
+		 */							\
+		if ((unsigned short)val != bfqg->entity.new_##__VAR) {  \
+			bfqg->entity.new_##__VAR = (unsigned short)val; \
+			/*						\
+			 * Make sure that the above new value has been	\
+			 * stored in bfqg->entity.new_##__VAR before	\
+			 * setting the ioprio_changed flag. In fact,	\
+			 * this flag may be read asynchronously (in	\
+			 * critical sections protected by a different	\
+			 * lock than that held here), and finding this	\
+			 * flag set may cause the execution of the code	\
+			 * for updating parameters whose value may	\
+			 * depend also on bfqg->entity.new_##__VAR (in	\
+			 * __bfq_entity_update_weight_prio).		\
+			 * This barrier makes sure that the new value	\
+			 * of bfqg->entity.new_##__VAR is correctly	\
+			 * seen in that code.				\
+			 */						\
+			smp_wmb();                                      \
+			bfqg->entity.ioprio_changed = 1;                \
+		}							\
+	}								\
+	spin_unlock_irq(&bgrp->lock);					\
+									\
+out_unlock:								\
+	mutex_unlock(&bfqio_mutex);					\
+	return ret;							\
+}
+
+STORE_FUNCTION(weight, BFQ_MIN_WEIGHT, BFQ_MAX_WEIGHT);
+STORE_FUNCTION(ioprio, 0, IOPRIO_BE_NR - 1);
+STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
+#undef STORE_FUNCTION
+
+static struct cftype bfqio_files[] = {
+	{
+		.name = "weight",
+		.read_u64 = bfqio_cgroup_weight_read,
+		.write_u64 = bfqio_cgroup_weight_write,
+	},
+	{
+		.name = "ioprio",
+		.read_u64 = bfqio_cgroup_ioprio_read,
+		.write_u64 = bfqio_cgroup_ioprio_write,
+	},
+	{
+		.name = "ioprio_class",
+		.read_u64 = bfqio_cgroup_ioprio_class_read,
+		.write_u64 = bfqio_cgroup_ioprio_class_write,
+	},
+	{ },	/* terminate */
+};
+
+static struct cgroup_subsys_state *bfqio_create(struct cgroup_subsys_state
+						*parent_css)
+{
+	struct bfqio_cgroup *bgrp;
+
+	if (parent_css != NULL) {
+		bgrp = kzalloc(sizeof(*bgrp), GFP_KERNEL);
+		if (bgrp == NULL)
+			return ERR_PTR(-ENOMEM);
+	} else
+		bgrp = &bfqio_root_cgroup;
+
+	spin_lock_init(&bgrp->lock);
+	INIT_HLIST_HEAD(&bgrp->group_data);
+	bgrp->ioprio = BFQ_DEFAULT_GRP_IOPRIO;
+	bgrp->ioprio_class = BFQ_DEFAULT_GRP_CLASS;
+
+	return &bgrp->css;
+}
+
+/*
+ * We cannot support shared io contexts, as we have no means to support
+ * two tasks with the same ioc in two different groups without major rework
+ * of the main bic/bfqq data structures.  By now we allow a task to change
+ * its cgroup only if it's the only owner of its ioc; the drawback of this
+ * behavior is that a group containing a task that forked using CLONE_IO
+ * will not be destroyed until the tasks sharing the ioc die.
+ */
+static int bfqio_can_attach(struct cgroup_subsys_state *css,
+			    struct cgroup_taskset *tset)
+{
+	struct task_struct *task;
+	struct io_context *ioc;
+	int ret = 0;
+
+	cgroup_taskset_for_each(task, tset) {
+		/*
+		 * task_lock() is needed to avoid races with
+		 * exit_io_context()
+		 */
+		task_lock(task);
+		ioc = task->io_context;
+		if (ioc != NULL && atomic_read(&ioc->nr_tasks) > 1)
+			/*
+			 * ioc == NULL means that the task is either too
+			 * young or exiting: if it has still no ioc the
+			 * ioc can't be shared, if the task is exiting the
+			 * attach will fail anyway, no matter what we
+			 * return here.
+			 */
+			ret = -EINVAL;
+		task_unlock(task);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
+static void bfqio_attach(struct cgroup_subsys_state *css,
+			 struct cgroup_taskset *tset)
+{
+	struct task_struct *task;
+	struct io_context *ioc;
+	struct io_cq *icq;
+
+	/*
+	 * IMPORTANT NOTE: The move of more than one process at a time to a
+	 * new group has not yet been tested.
+	 */
+	cgroup_taskset_for_each(task, tset) {
+		ioc = get_task_io_context(task, GFP_ATOMIC, NUMA_NO_NODE);
+		if (ioc) {
+			/*
+			 * Handle cgroup change here.
+			 */
+			rcu_read_lock();
+			hlist_for_each_entry_rcu(icq, &ioc->icq_list, ioc_node)
+				if (!strncmp(
+					icq->q->elevator->type->elevator_name,
+					"bfq", ELV_NAME_MAX))
+					bfq_bic_change_cgroup(icq_to_bic(icq),
+							      css);
+			rcu_read_unlock();
+			put_io_context(ioc);
+		}
+	}
+}
+
+static void bfqio_destroy(struct cgroup_subsys_state *css)
+{
+	struct bfqio_cgroup *bgrp = css_to_bfqio(css);
+	struct hlist_node *tmp;
+	struct bfq_group *bfqg;
+
+	/*
+	 * Since we are destroying the cgroup, there are no more tasks
+	 * referencing it, and all the RCU grace periods that may have
+	 * referenced it are ended (as the destruction of the parent
+	 * cgroup is RCU-safe); bgrp->group_data will not be accessed by
+	 * anything else and we don't need any synchronization.
+	 */
+	hlist_for_each_entry_safe(bfqg, tmp, &bgrp->group_data, group_node)
+		bfq_destroy_group(bgrp, bfqg);
+
+	kfree(bgrp);
+}
+
+static int bfqio_css_online(struct cgroup_subsys_state *css)
+{
+	struct bfqio_cgroup *bgrp = css_to_bfqio(css);
+
+	mutex_lock(&bfqio_mutex);
+	bgrp->online = true;
+	mutex_unlock(&bfqio_mutex);
+
+	return 0;
+}
+
+static void bfqio_css_offline(struct cgroup_subsys_state *css)
+{
+	struct bfqio_cgroup *bgrp = css_to_bfqio(css);
+
+	mutex_lock(&bfqio_mutex);
+	bgrp->online = false;
+	mutex_unlock(&bfqio_mutex);
+}
+
+struct cgroup_subsys bfqio_cgrp_subsys = {
+	.css_alloc = bfqio_create,
+	.css_online = bfqio_css_online,
+	.css_offline = bfqio_css_offline,
+	.can_attach = bfqio_can_attach,
+	.attach = bfqio_attach,
+	.css_free = bfqio_destroy,
+	.base_cftypes = bfqio_files,
+};
+#else
+static inline void bfq_init_entity(struct bfq_entity *entity,
+				   struct bfq_group *bfqg)
+{
+	entity->weight = entity->new_weight;
+	entity->orig_weight = entity->new_weight;
+	entity->ioprio = entity->new_ioprio;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->sched_data = &bfqg->sched_data;
+}
+
+static inline struct bfq_group *
+bfq_bic_update_cgroup(struct bfq_io_cq *bic)
+{
+	struct bfq_data *bfqd = bic_to_bfqd(bic);
+	return bfqd->root_group;
+}
+
+static inline void bfq_bfqq_move(struct bfq_data *bfqd,
+				 struct bfq_queue *bfqq,
+				 struct bfq_entity *entity,
+				 struct bfq_group *bfqg)
+{
+}
+
+static inline void bfq_disconnect_groups(struct bfq_data *bfqd)
+{
+	bfq_put_async_queues(bfqd, bfqd->root_group);
+}
+
+static inline void bfq_free_root_group(struct bfq_data *bfqd)
+{
+	kfree(bfqd->root_group);
+}
+
+static struct bfq_group *bfq_alloc_root_group(struct bfq_data *bfqd, int node)
+{
+	struct bfq_group *bfqg;
+	int i;
+
+	bfqg = kmalloc_node(sizeof(*bfqg), GFP_KERNEL | __GFP_ZERO, node);
+	if (bfqg == NULL)
+		return NULL;
+
+	for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
+		bfqg->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
+
+	return bfqg;
+}
+#endif
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 01a98be..b2cbfce 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -66,14 +66,6 @@
 #include "bfq.h"
 #include "blk.h"
 
-/*
- * Array of async queues for all the processes, one queue
- * per ioprio value per ioprio_class.
- */
-struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
-/* Async queue for the idle class (ioprio is ignored) */
-struct bfq_queue *async_idle_bfqq;
-
 /* Max number of dispatches in one round of service. */
 static const int bfq_quantum = 4;
 
@@ -128,6 +120,7 @@ static inline void bfq_schedule_dispatch(struct bfq_data *bfqd);
 
 #include "bfq-ioc.c"
 #include "bfq-sched.c"
+#include "bfq-cgroup.c"
 
 #define bfq_class_idle(bfqq)	((bfqq)->entity.ioprio_class ==\
 				 IOPRIO_CLASS_IDLE)
@@ -1359,6 +1352,7 @@ static void bfq_changed_ioprio(struct bfq_io_cq *bic)
 {
 	struct bfq_data *bfqd;
 	struct bfq_queue *bfqq, *new_bfqq;
+	struct bfq_group *bfqg;
 	unsigned long uninitialized_var(flags);
 	int ioprio = bic->icq.ioc->ioprio;
 
@@ -1373,7 +1367,9 @@ static void bfq_changed_ioprio(struct bfq_io_cq *bic)
 
 	bfqq = bic->bfqq[BLK_RW_ASYNC];
 	if (bfqq != NULL) {
-		new_bfqq = bfq_get_queue(bfqd, BLK_RW_ASYNC, bic,
+		bfqg = container_of(bfqq->entity.sched_data, struct bfq_group,
+				    sched_data);
+		new_bfqq = bfq_get_queue(bfqd, bfqg, BLK_RW_ASYNC, bic,
 					 GFP_ATOMIC);
 		if (new_bfqq != NULL) {
 			bic->bfqq[BLK_RW_ASYNC] = new_bfqq;
@@ -1417,6 +1413,7 @@ static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 }
 
 static struct bfq_queue *bfq_find_alloc_queue(struct bfq_data *bfqd,
+					      struct bfq_group *bfqg,
 					      int is_sync,
 					      struct bfq_io_cq *bic,
 					      gfp_t gfp_mask)
@@ -1459,6 +1456,7 @@ retry:
 		}
 
 		bfq_init_prio_data(bfqq, bic);
+		bfq_init_entity(&bfqq->entity, bfqg);
 	}
 
 	if (new_bfqq != NULL)
@@ -1468,26 +1466,27 @@ retry:
 }
 
 static struct bfq_queue **bfq_async_queue_prio(struct bfq_data *bfqd,
+					       struct bfq_group *bfqg,
 					       int ioprio_class, int ioprio)
 {
 	switch (ioprio_class) {
 	case IOPRIO_CLASS_RT:
-		return &async_bfqq[0][ioprio];
+		return &bfqg->async_bfqq[0][ioprio];
 	case IOPRIO_CLASS_NONE:
 		ioprio = IOPRIO_NORM;
 		/* fall through */
 	case IOPRIO_CLASS_BE:
-		return &async_bfqq[1][ioprio];
+		return &bfqg->async_bfqq[1][ioprio];
 	case IOPRIO_CLASS_IDLE:
-		return &async_idle_bfqq;
+		return &bfqg->async_idle_bfqq;
 	default:
 		BUG();
 	}
 }
 
 static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
-				       int is_sync, struct bfq_io_cq *bic,
-				       gfp_t gfp_mask)
+				       struct bfq_group *bfqg, int is_sync,
+				       struct bfq_io_cq *bic, gfp_t gfp_mask)
 {
 	const int ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
 	const int ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
@@ -1495,12 +1494,13 @@ static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
 	struct bfq_queue *bfqq = NULL;
 
 	if (!is_sync) {
-		async_bfqq = bfq_async_queue_prio(bfqd, ioprio_class, ioprio);
+		async_bfqq = bfq_async_queue_prio(bfqd, bfqg, ioprio_class,
+						  ioprio);
 		bfqq = *async_bfqq;
 	}
 
 	if (bfqq == NULL)
-		bfqq = bfq_find_alloc_queue(bfqd, is_sync, bic, gfp_mask);
+		bfqq = bfq_find_alloc_queue(bfqd, bfqg, is_sync, bic, gfp_mask);
 
 	/*
 	 * Pin the queue now that it's allocated, scheduler exit will
@@ -1830,6 +1830,7 @@ static int bfq_set_request(struct request_queue *q, struct request *rq,
 	const int rw = rq_data_dir(rq);
 	const int is_sync = rq_is_sync(rq);
 	struct bfq_queue *bfqq;
+	struct bfq_group *bfqg;
 	unsigned long flags;
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
@@ -1841,9 +1842,11 @@ static int bfq_set_request(struct request_queue *q, struct request *rq,
 	if (bic == NULL)
 		goto queue_fail;
 
+	bfqg = bfq_bic_update_cgroup(bic);
+
 	bfqq = bic_to_bfqq(bic, is_sync);
 	if (bfqq == NULL || bfqq == &bfqd->oom_bfqq) {
-		bfqq = bfq_get_queue(bfqd, is_sync, bic, gfp_mask);
+		bfqq = bfq_get_queue(bfqd, bfqg, is_sync, bic, gfp_mask);
 		bic_set_bfqq(bic, bfqq, is_sync);
 	}
 
@@ -1937,10 +1940,12 @@ static void bfq_shutdown_timer_wq(struct bfq_data *bfqd)
 static inline void __bfq_put_async_bfqq(struct bfq_data *bfqd,
 					struct bfq_queue **bfqq_ptr)
 {
+	struct bfq_group *root_group = bfqd->root_group;
 	struct bfq_queue *bfqq = *bfqq_ptr;
 
 	bfq_log(bfqd, "put_async_bfqq: %p", bfqq);
 	if (bfqq != NULL) {
+		bfq_bfqq_move(bfqd, bfqq, &bfqq->entity, root_group);
 		bfq_log_bfqq(bfqd, bfqq, "put_async_bfqq: putting %p, %d",
 			     bfqq, atomic_read(&bfqq->ref));
 		bfq_put_queue(bfqq);
@@ -1949,18 +1954,20 @@ static inline void __bfq_put_async_bfqq(struct bfq_data *bfqd,
 }
 
 /*
- * Release the extra reference of the async queues as the device
- * goes away.
+ * Release all the bfqg references to its async queues.  If we are
+ * deallocating the group these queues may still contain requests, so
+ * we reparent them to the root cgroup (i.e., the only one that will
+ * exist for sure until all the requests on a device are gone).
  */
-static void bfq_put_async_queues(struct bfq_data *bfqd)
+static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg)
 {
 	int i, j;
 
 	for (i = 0; i < 2; i++)
 		for (j = 0; j < IOPRIO_BE_NR; j++)
-			__bfq_put_async_bfqq(bfqd, &async_bfqq[i][j]);
+			__bfq_put_async_bfqq(bfqd, &bfqg->async_bfqq[i][j]);
 
-	__bfq_put_async_bfqq(bfqd, &async_idle_bfqq);
+	__bfq_put_async_bfqq(bfqd, &bfqg->async_idle_bfqq);
 }
 
 static void bfq_exit_queue(struct elevator_queue *e)
@@ -1976,18 +1983,20 @@ static void bfq_exit_queue(struct elevator_queue *e)
 	list_for_each_entry_safe(bfqq, n, &bfqd->idle_list, bfqq_list)
 		bfq_deactivate_bfqq(bfqd, bfqq, 0);
 
-	bfq_put_async_queues(bfqd);
+	bfq_disconnect_groups(bfqd);
 	spin_unlock_irq(q->queue_lock);
 
 	bfq_shutdown_timer_wq(bfqd);
 
 	synchronize_rcu();
 
+	bfq_free_root_group(bfqd);
 	kfree(bfqd);
 }
 
 static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 {
+	struct bfq_group *bfqg;
 	struct bfq_data *bfqd;
 	struct elevator_queue *eq;
 
@@ -2016,6 +2025,15 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 	q->elevator = eq;
 	spin_unlock_irq(q->queue_lock);
 
+	bfqg = bfq_alloc_root_group(bfqd, q->node);
+	if (bfqg == NULL) {
+		kfree(bfqd);
+		kobject_put(&eq->kobj);
+		return -ENOMEM;
+	}
+
+	bfqd->root_group = bfqg;
+
 	init_timer(&bfqd->idle_slice_timer);
 	bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
 	bfqd->idle_slice_timer.data = (unsigned long)bfqd;
@@ -2279,7 +2297,7 @@ static int __init bfq_init(void)
 		return -ENOMEM;
 
 	elv_register(&iosched_bfq);
-	pr_info("BFQ I/O-scheduler version: v0");
+	pr_info("BFQ I/O-scheduler version: v1");
 
 	return 0;
 }
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
index a9142f5..8801b6c 100644
--- a/block/bfq-sched.c
+++ b/block/bfq-sched.c
@@ -8,6 +8,61 @@
  *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
  */
 
+#ifdef CONFIG_CGROUP_BFQIO
+#define for_each_entity(entity)	\
+	for (; entity != NULL; entity = entity->parent)
+
+#define for_each_entity_safe(entity, parent) \
+	for (; entity && ({ parent = entity->parent; 1; }); entity = parent)
+
+static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
+						 int extract,
+						 struct bfq_data *bfqd);
+
+static inline void bfq_update_budget(struct bfq_entity *next_in_service)
+{
+	struct bfq_entity *bfqg_entity;
+	struct bfq_group *bfqg;
+	struct bfq_sched_data *group_sd;
+
+	group_sd = next_in_service->sched_data;
+
+	bfqg = container_of(group_sd, struct bfq_group, sched_data);
+	/*
+	 * bfq_group's my_entity field is not NULL only if the group
+	 * is not the root group. We must not touch the root entity
+	 * as it must never become an in-service entity.
+	 */
+	bfqg_entity = bfqg->my_entity;
+	if (bfqg_entity != NULL)
+		bfqg_entity->budget = next_in_service->budget;
+}
+
+static int bfq_update_next_in_service(struct bfq_sched_data *sd)
+{
+	struct bfq_entity *next_in_service;
+
+	if (sd->in_service_entity != NULL)
+		/* will update/requeue at the end of service */
+		return 0;
+
+	/*
+	 * NOTE: this can be improved in many ways, such as returning
+	 * 1 (and thus propagating upwards the update) only when the
+	 * budget changes, or caching the bfqq that will be scheduled
+	 * next from this subtree.  By now we worry more about
+	 * correctness than about performance...
+	 */
+	next_in_service = bfq_lookup_next_entity(sd, 0, NULL);
+	sd->next_in_service = next_in_service;
+
+	if (next_in_service != NULL)
+		bfq_update_budget(next_in_service);
+
+	return 1;
+}
+
+#else
 #define for_each_entity(entity)	\
 	for (; entity != NULL; entity = NULL)
 
@@ -19,14 +74,10 @@ static inline int bfq_update_next_in_service(struct bfq_sched_data *sd)
 	return 0;
 }
 
-static inline void bfq_check_next_in_service(struct bfq_sched_data *sd,
-					     struct bfq_entity *entity)
-{
-}
-
 static inline void bfq_update_budget(struct bfq_entity *next_in_service)
 {
 }
+#endif
 
 /*
  * Shift for timestamp calculations.  This actually limits the maximum
@@ -842,7 +893,6 @@ static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
 		entity = __bfq_lookup_next_entity(st + i, false);
 		if (entity != NULL) {
 			if (extract) {
-				bfq_check_next_in_service(sd, entity);
 				bfq_active_extract(st + i, entity);
 				sd->in_service_entity = entity;
 				sd->next_in_service = NULL;
@@ -866,7 +916,7 @@ static struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd)
 	if (bfqd->busy_queues == 0)
 		return NULL;
 
-	sd = &bfqd->sched_data;
+	sd = &bfqd->root_group->sched_data;
 	for (; sd != NULL; sd = entity->my_sched_data) {
 		entity = bfq_lookup_next_entity(sd, 1, bfqd);
 		entity->service = 0;
diff --git a/block/bfq.h b/block/bfq.h
index bd146b6..b982567 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -1,5 +1,5 @@
 /*
- * BFQ-v0 for 3.15.0: data structures and common functions prototypes.
+ * BFQ-v1 for 3.15.0: data structures and common functions prototypes.
  *
  * Based on ideas and code from CFQ:
  * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@@ -92,7 +92,7 @@ struct bfq_sched_data {
  * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
  * @weight: weight of the queue
  * @parent: parent entity, for hierarchical scheduling.
- * @my_sched_data: for non-leaf nodes in the hierarchy, the
+ * @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the
  *                 associated scheduler queue, %NULL on leaf nodes.
  * @sched_data: the scheduler queue this entity belongs to.
  * @ioprio: the ioprio in use.
@@ -105,10 +105,11 @@ struct bfq_sched_data {
  * @ioprio_changed: flag, true when the user requested a weight, ioprio or
  *                  ioprio_class change.
  *
- * A bfq_entity is used to represent a bfq_queue (leaf node in the upper
- * level scheduler). Each entity belongs to the sched_data of the parent
- * group hierarchy. Non-leaf entities have also their own sched_data,
- * stored in @my_sched_data.
+ * A bfq_entity is used to represent either a bfq_queue (leaf node in the
+ * cgroup hierarchy) or a bfq_group into the upper level scheduler.  Each
+ * entity belongs to the sched_data of the parent group in the cgroup
+ * hierarchy.  Non-leaf entities have also their own sched_data, stored
+ * in @my_sched_data.
  *
  * Each entity stores independently its priority values; this would
  * allow different weights on different devices, but this
@@ -119,13 +120,14 @@ struct bfq_sched_data {
  * update to take place the effective and the requested priority
  * values are synchronized.
  *
- * The weight value is calculated from the ioprio to export the same
- * interface as CFQ.  When dealing with  ``well-behaved'' queues (i.e.,
- * queues that do not spend too much time to consume their budget
- * and have true sequential behavior, and when there are no external
- * factors breaking anticipation) the relative weights at each level
- * of the hierarchy should be guaranteed.  All the fields are
- * protected by the queue lock of the containing bfqd.
+ * Unless cgroups are used, the weight value is calculated from the
+ * ioprio to export the same interface as CFQ.  When dealing with
+ * ``well-behaved'' queues (i.e., queues that do not spend too much
+ * time to consume their budget and have true sequential behavior, and
+ * when there are no external factors breaking anticipation) the
+ * relative weights at each level of the cgroups hierarchy should be
+ * guaranteed.  All the fields are protected by the queue lock of the
+ * containing bfqd.
  */
 struct bfq_entity {
 	struct rb_node rb_node;
@@ -154,6 +156,8 @@ struct bfq_entity {
 	int ioprio_changed;
 };
 
+struct bfq_group;
+
 /**
  * struct bfq_queue - leaf schedulable entity.
  * @ref: reference counter.
@@ -177,7 +181,11 @@ struct bfq_entity {
  * @pid: pid of the process owning the queue, used for logging purposes.
  *
  * A bfq_queue is a leaf request queue; it can be associated with an
- * io_context or more, if it is async.
+ * io_context or more, if it is async. @cgroup holds a reference to the
+ * cgroup, to be sure that it does not disappear while a bfqq still
+ * references it (mostly to avoid races between request issuing and task
+ * migration followed by cgroup destruction). All the fields are protected
+ * by the queue lock of the containing bfqd.
  */
 struct bfq_queue {
 	atomic_t ref;
@@ -244,7 +252,7 @@ enum bfq_device_speed {
 /**
  * struct bfq_data - per device data structure.
  * @queue: request queue for the managed device.
- * @sched_data: root @bfq_sched_data for the device.
+ * @root_group: root bfq_group for the device.
  * @busy_queues: number of bfq_queues containing requests (including the
  *		 queue in service, even if it is idling).
  * @queued: number of queued requests.
@@ -267,6 +275,7 @@ enum bfq_device_speed {
  * @peak_rate_samples: number of samples used to calculate @peak_rate.
  * @bfq_max_budget: maximum budget allotted to a bfq_queue before
  *                  rescheduling.
+ * @group_list: list of all the bfq_groups active on the device.
  * @active_list: list of all the bfq_queues active on the device.
  * @idle_list: list of all the bfq_queues idle on the device.
  * @bfq_quantum: max number of requests dispatched per dispatch round.
@@ -293,7 +302,7 @@ enum bfq_device_speed {
 struct bfq_data {
 	struct request_queue *queue;
 
-	struct bfq_sched_data sched_data;
+	struct bfq_group *root_group;
 
 	int busy_queues;
 	int queued;
@@ -320,6 +329,7 @@ struct bfq_data {
 	u64 peak_rate;
 	unsigned long bfq_max_budget;
 
+	struct hlist_head group_list;
 	struct list_head active_list;
 	struct list_head idle_list;
 
@@ -390,6 +400,82 @@ enum bfqq_expiration {
 	BFQ_BFQQ_NO_MORE_REQUESTS,	/* the queue has no more requests */
 };
 
+#ifdef CONFIG_CGROUP_BFQIO
+/**
+ * struct bfq_group - per (device, cgroup) data structure.
+ * @entity: schedulable entity to insert into the parent group sched_data.
+ * @sched_data: own sched_data, to contain child entities (they may be
+ *              both bfq_queues and bfq_groups).
+ * @group_node: node to be inserted into the bfqio_cgroup->group_data
+ *              list of the containing cgroup's bfqio_cgroup.
+ * @bfqd_node: node to be inserted into the @bfqd->group_list list
+ *             of the groups active on the same device; used for cleanup.
+ * @bfqd: the bfq_data for the device this group acts upon.
+ * @async_bfqq: array of async queues for all the tasks belonging to
+ *              the group, one queue per ioprio value per ioprio_class,
+ *              except for the idle class that has only one queue.
+ * @async_idle_bfqq: async queue for the idle class (ioprio is ignored).
+ * @my_entity: pointer to @entity, %NULL for the toplevel group; used
+ *             to avoid too many special cases during group creation/
+ *             migration.
+ *
+ * Each (device, cgroup) pair has its own bfq_group, i.e., for each cgroup
+ * there is a set of bfq_groups, each one collecting the lower-level
+ * entities belonging to the group that are acting on the same device.
+ *
+ * Locking works as follows:
+ *    o @group_node is protected by the bfqio_cgroup lock, and is accessed
+ *      via RCU from its readers.
+ *    o @bfqd is protected by the queue lock, RCU is used to access it
+ *      from the readers.
+ *    o All the other fields are protected by the @bfqd queue lock.
+ */
+struct bfq_group {
+	struct bfq_entity entity;
+	struct bfq_sched_data sched_data;
+
+	struct hlist_node group_node;
+	struct hlist_node bfqd_node;
+
+	void *bfqd;
+
+	struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
+	struct bfq_queue *async_idle_bfqq;
+
+	struct bfq_entity *my_entity;
+};
+
+/**
+ * struct bfqio_cgroup - bfq cgroup data structure.
+ * @css: subsystem state for bfq in the containing cgroup.
+ * @online: flag marked when the subsystem is inserted.
+ * @weight: cgroup weight.
+ * @ioprio: cgroup ioprio.
+ * @ioprio_class: cgroup ioprio_class.
+ * @lock: spinlock that protects @ioprio, @ioprio_class and @group_data.
+ * @group_data: list containing the bfq_group belonging to this cgroup.
+ *
+ * @group_data is accessed using RCU, with @lock protecting the updates,
+ * @ioprio and @ioprio_class are protected by @lock.
+ */
+struct bfqio_cgroup {
+	struct cgroup_subsys_state css;
+	bool online;
+
+	unsigned short weight, ioprio, ioprio_class;
+
+	spinlock_t lock;
+	struct hlist_head group_data;
+};
+#else
+struct bfq_group {
+	struct bfq_sched_data sched_data;
+
+	struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
+	struct bfq_queue *async_idle_bfqq;
+};
+#endif
+
 static inline struct bfq_service_tree *
 bfq_entity_service_tree(struct bfq_entity *entity)
 {
@@ -460,8 +546,10 @@ static inline void bfq_put_bfqd_unlock(struct bfq_data *bfqd,
 static void bfq_changed_ioprio(struct bfq_io_cq *bic);
 static void bfq_put_queue(struct bfq_queue *bfqq);
 static void bfq_dispatch_insert(struct request_queue *q, struct request *rq);
-static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd, int is_sync,
+static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
+				       struct bfq_group *bfqg, int is_sync,
 				       struct bfq_io_cq *bic, gfp_t gfp_mask);
+static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg);
 static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
 
 #endif /* _BFQ_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 768fe44..cdd2528 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -39,6 +39,10 @@ SUBSYS(net_cls)
 SUBSYS(blkio)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_BFQIO)
+SUBSYS(bfqio)
+#endif
+
 #if IS_ENABLED(CONFIG_CGROUP_PERF)
 SUBSYS(perf_event)
 #endif
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 02/12] block, bfq: add full hierarchical scheduling and cgroups support
@ 2014-05-29  9:05               ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups, Paolo Valente

From: Fabio Checconi <fchecconi@gmail.com>

Complete support for full hierarchical scheduling, with a cgroups
interface. The name of the new subsystem is bfqio.

Weights can be assigned explicitly to groups and processes through the
cgroups interface, differently from what happens, for single
processes, if the cgroups interface is not used (as explained in the
description of patch 2). In particular, since each node has a full
scheduler, each group can be assigned its own weight.

Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/Kconfig.iosched         |  13 +-
 block/bfq-cgroup.c            | 891 ++++++++++++++++++++++++++++++++++++++++++
 block/bfq-iosched.c           |  66 ++--
 block/bfq-sched.c             |  64 ++-
 block/bfq.h                   | 122 +++++-
 include/linux/cgroup_subsys.h |   4 +
 6 files changed, 1111 insertions(+), 49 deletions(-)
 create mode 100644 block/bfq-cgroup.c

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 8f98cc7..a3675cb 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -46,7 +46,18 @@ config IOSCHED_BFQ
 	  The BFQ I/O scheduler tries to distribute bandwidth among all
 	  processes according to their weights.
 	  It aims at distributing the bandwidth as desired, regardless
-	  of the disk parameters and with any workload.
+	  of the disk parameters and with any workload. If compiled
+	  built-in (saying Y here), BFQ can be configured to support
+	  hierarchical scheduling.
+
+config CGROUP_BFQIO
+	bool "BFQ hierarchical scheduling support"
+	depends on CGROUPS && IOSCHED_BFQ=y
+	default n
+	---help---
+	  Enable hierarchical scheduling in BFQ, using the cgroups
+	  filesystem interface.  The name of the subsystem will be
+	  bfqio.
 
 choice
 	prompt "Default I/O scheduler"
diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
new file mode 100644
index 0000000..00a7a1b
--- /dev/null
+++ b/block/bfq-cgroup.c
@@ -0,0 +1,891 @@
+/*
+ * BFQ: CGROUPS support.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ *		      Paolo Valente <paolo.valente@unimore.it>
+ *
+ * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ
+ * file.
+ */
+
+#ifdef CONFIG_CGROUP_BFQIO
+
+static DEFINE_MUTEX(bfqio_mutex);
+
+static bool bfqio_is_removed(struct bfqio_cgroup *bgrp)
+{
+	return bgrp ? !bgrp->online : false;
+}
+
+static struct bfqio_cgroup bfqio_root_cgroup = {
+	.weight = BFQ_DEFAULT_GRP_WEIGHT,
+	.ioprio = BFQ_DEFAULT_GRP_IOPRIO,
+	.ioprio_class = BFQ_DEFAULT_GRP_CLASS,
+};
+
+static inline void bfq_init_entity(struct bfq_entity *entity,
+				   struct bfq_group *bfqg)
+{
+	entity->weight = entity->new_weight;
+	entity->orig_weight = entity->new_weight;
+	entity->ioprio = entity->new_ioprio;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->parent = bfqg->my_entity;
+	entity->sched_data = &bfqg->sched_data;
+}
+
+static struct bfqio_cgroup *css_to_bfqio(struct cgroup_subsys_state *css)
+{
+	return css ? container_of(css, struct bfqio_cgroup, css) : NULL;
+}
+
+/*
+ * Search the bfq_group for bfqd into the hash table (by now only a list)
+ * of bgrp.  Must be called under rcu_read_lock().
+ */
+static struct bfq_group *bfqio_lookup_group(struct bfqio_cgroup *bgrp,
+					    struct bfq_data *bfqd)
+{
+	struct bfq_group *bfqg;
+	void *key;
+
+	hlist_for_each_entry_rcu(bfqg, &bgrp->group_data, group_node) {
+		key = rcu_dereference(bfqg->bfqd);
+		if (key == bfqd)
+			return bfqg;
+	}
+
+	return NULL;
+}
+
+static inline void bfq_group_init_entity(struct bfqio_cgroup *bgrp,
+					 struct bfq_group *bfqg)
+{
+	struct bfq_entity *entity = &bfqg->entity;
+
+	/*
+	 * If the weight of the entity has never been set via the sysfs
+	 * interface, then bgrp->weight == 0. In this case we initialize
+	 * the weight from the current ioprio value. Otherwise, the group
+	 * weight, if set, has priority over the ioprio value.
+	 */
+	if (bgrp->weight == 0) {
+		entity->new_weight = bfq_ioprio_to_weight(bgrp->ioprio);
+		entity->new_ioprio = bgrp->ioprio;
+	} else {
+		entity->new_weight = bgrp->weight;
+		entity->new_ioprio = bfq_weight_to_ioprio(bgrp->weight);
+	}
+	entity->orig_weight = entity->weight = entity->new_weight;
+	entity->ioprio = entity->new_ioprio;
+	entity->ioprio_class = entity->new_ioprio_class = bgrp->ioprio_class;
+	entity->my_sched_data = &bfqg->sched_data;
+}
+
+static inline void bfq_group_set_parent(struct bfq_group *bfqg,
+					struct bfq_group *parent)
+{
+	struct bfq_entity *entity;
+
+	entity = &bfqg->entity;
+	entity->parent = parent->my_entity;
+	entity->sched_data = &parent->sched_data;
+}
+
+/**
+ * bfq_group_chain_alloc - allocate a chain of groups.
+ * @bfqd: queue descriptor.
+ * @css: the leaf cgroup_subsys_state this chain starts from.
+ *
+ * Allocate a chain of groups starting from the one belonging to
+ * @cgroup up to the root cgroup.  Stop if a cgroup on the chain
+ * to the root has already an allocated group on @bfqd.
+ */
+static struct bfq_group *bfq_group_chain_alloc(struct bfq_data *bfqd,
+					       struct cgroup_subsys_state *css)
+{
+	struct bfqio_cgroup *bgrp;
+	struct bfq_group *bfqg, *prev = NULL, *leaf = NULL;
+
+	for (; css != NULL; css = css->parent) {
+		bgrp = css_to_bfqio(css);
+
+		bfqg = bfqio_lookup_group(bgrp, bfqd);
+		if (bfqg != NULL) {
+			/*
+			 * All the cgroups in the path from there to the
+			 * root must have a bfq_group for bfqd, so we don't
+			 * need any more allocations.
+			 */
+			break;
+		}
+
+		bfqg = kzalloc(sizeof(*bfqg), GFP_ATOMIC);
+		if (bfqg == NULL)
+			goto cleanup;
+
+		bfq_group_init_entity(bgrp, bfqg);
+		bfqg->my_entity = &bfqg->entity;
+
+		if (leaf == NULL) {
+			leaf = bfqg;
+			prev = leaf;
+		} else {
+			bfq_group_set_parent(prev, bfqg);
+			/*
+			 * Build a list of allocated nodes using the bfqd
+			 * filed, that is still unused and will be
+			 * initialized only after the node will be
+			 * connected.
+			 */
+			prev->bfqd = bfqg;
+			prev = bfqg;
+		}
+	}
+
+	return leaf;
+
+cleanup:
+	while (leaf != NULL) {
+		prev = leaf;
+		leaf = leaf->bfqd;
+		kfree(prev);
+	}
+
+	return NULL;
+}
+
+/**
+ * bfq_group_chain_link - link an allocated group chain to a cgroup
+ *                        hierarchy.
+ * @bfqd: the queue descriptor.
+ * @css: the leaf cgroup_subsys_state to start from.
+ * @leaf: the leaf group (to be associated to @cgroup).
+ *
+ * Try to link a chain of groups to a cgroup hierarchy, connecting the
+ * nodes bottom-up, so we can be sure that when we find a cgroup in the
+ * hierarchy that already as a group associated to @bfqd all the nodes
+ * in the path to the root cgroup have one too.
+ *
+ * On locking: the queue lock protects the hierarchy (there is a hierarchy
+ * per device) while the bfqio_cgroup lock protects the list of groups
+ * belonging to the same cgroup.
+ */
+static void bfq_group_chain_link(struct bfq_data *bfqd,
+				 struct cgroup_subsys_state *css,
+				 struct bfq_group *leaf)
+{
+	struct bfqio_cgroup *bgrp;
+	struct bfq_group *bfqg, *next, *prev = NULL;
+	unsigned long flags;
+
+	assert_spin_locked(bfqd->queue->queue_lock);
+
+	for (; css != NULL && leaf != NULL; css = css->parent) {
+		bgrp = css_to_bfqio(css);
+		next = leaf->bfqd;
+
+		bfqg = bfqio_lookup_group(bgrp, bfqd);
+
+		spin_lock_irqsave(&bgrp->lock, flags);
+
+		rcu_assign_pointer(leaf->bfqd, bfqd);
+		hlist_add_head_rcu(&leaf->group_node, &bgrp->group_data);
+		hlist_add_head(&leaf->bfqd_node, &bfqd->group_list);
+
+		spin_unlock_irqrestore(&bgrp->lock, flags);
+
+		prev = leaf;
+		leaf = next;
+	}
+
+	if (css != NULL && prev != NULL) {
+		bgrp = css_to_bfqio(css);
+		bfqg = bfqio_lookup_group(bgrp, bfqd);
+		bfq_group_set_parent(prev, bfqg);
+	}
+}
+
+/**
+ * bfq_find_alloc_group - return the group associated to @bfqd in @cgroup.
+ * @bfqd: queue descriptor.
+ * @cgroup: cgroup being searched for.
+ *
+ * Return a group associated to @bfqd in @cgroup, allocating one if
+ * necessary.  When a group is returned all the cgroups in the path
+ * to the root have a group associated to @bfqd.
+ *
+ * If the allocation fails, return the root group: this breaks guarantees
+ * but is a safe fallback.  If this loss becomes a problem it can be
+ * mitigated using the equivalent weight (given by the product of the
+ * weights of the groups in the path from @group to the root) in the
+ * root scheduler.
+ *
+ * We allocate all the missing nodes in the path from the leaf cgroup
+ * to the root and we connect the nodes only after all the allocations
+ * have been successful.
+ */
+static struct bfq_group *bfq_find_alloc_group(struct bfq_data *bfqd,
+					      struct cgroup_subsys_state *css)
+{
+	struct bfqio_cgroup *bgrp = css_to_bfqio(css);
+	struct bfq_group *bfqg;
+
+	bfqg = bfqio_lookup_group(bgrp, bfqd);
+	if (bfqg != NULL)
+		return bfqg;
+
+	bfqg = bfq_group_chain_alloc(bfqd, css);
+	if (bfqg != NULL)
+		bfq_group_chain_link(bfqd, css, bfqg);
+	else
+		bfqg = bfqd->root_group;
+
+	return bfqg;
+}
+
+/**
+ * bfq_bfqq_move - migrate @bfqq to @bfqg.
+ * @bfqd: queue descriptor.
+ * @bfqq: the queue to move.
+ * @entity: @bfqq's entity.
+ * @bfqg: the group to move to.
+ *
+ * Move @bfqq to @bfqg, deactivating it from its old group and reactivating
+ * it on the new one.  Avoid putting the entity on the old group idle tree.
+ *
+ * Must be called under the queue lock; the cgroup owning @bfqg must
+ * not disappear (by now this just means that we are called under
+ * rcu_read_lock()).
+ */
+static void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+			  struct bfq_entity *entity, struct bfq_group *bfqg)
+{
+	int busy, resume;
+
+	busy = bfq_bfqq_busy(bfqq);
+	resume = !RB_EMPTY_ROOT(&bfqq->sort_list);
+
+	if (busy) {
+		if (!resume)
+			bfq_del_bfqq_busy(bfqd, bfqq, 0);
+		else
+			bfq_deactivate_bfqq(bfqd, bfqq, 0);
+	} else if (entity->on_st)
+		bfq_put_idle_entity(bfq_entity_service_tree(entity), entity);
+
+	/*
+	 * Here we use a reference to bfqg.  We don't need a refcounter
+	 * as the cgroup reference will not be dropped, so that its
+	 * destroy() callback will not be invoked.
+	 */
+	entity->parent = bfqg->my_entity;
+	entity->sched_data = &bfqg->sched_data;
+
+	if (busy && resume)
+		bfq_activate_bfqq(bfqd, bfqq);
+
+	if (bfqd->in_service_queue == NULL && !bfqd->rq_in_driver)
+		bfq_schedule_dispatch(bfqd);
+}
+
+/**
+ * __bfq_bic_change_cgroup - move @bic to @cgroup.
+ * @bfqd: the queue descriptor.
+ * @bic: the bic to move.
+ * @cgroup: the cgroup to move to.
+ *
+ * Move bic to cgroup, assuming that bfqd->queue is locked; the caller
+ * has to make sure that the reference to cgroup is valid across the call.
+ *
+ * NOTE: an alternative approach might have been to store the current
+ * cgroup in bfqq and getting a reference to it, reducing the lookup
+ * time here, at the price of slightly more complex code.
+ */
+static struct bfq_group *__bfq_bic_change_cgroup(struct bfq_data *bfqd,
+						struct bfq_io_cq *bic,
+						struct cgroup_subsys_state *css)
+{
+	struct bfq_queue *async_bfqq = bic_to_bfqq(bic, 0);
+	struct bfq_queue *sync_bfqq = bic_to_bfqq(bic, 1);
+	struct bfq_entity *entity;
+	struct bfq_group *bfqg;
+	struct bfqio_cgroup *bgrp;
+
+	bgrp = css_to_bfqio(css);
+
+	bfqg = bfq_find_alloc_group(bfqd, css);
+	if (async_bfqq != NULL) {
+		entity = &async_bfqq->entity;
+
+		if (entity->sched_data != &bfqg->sched_data) {
+			bic_set_bfqq(bic, NULL, 0);
+			bfq_log_bfqq(bfqd, async_bfqq,
+				     "bic_change_group: %p %d",
+				     async_bfqq, atomic_read(&async_bfqq->ref));
+			bfq_put_queue(async_bfqq);
+		}
+	}
+
+	if (sync_bfqq != NULL) {
+		entity = &sync_bfqq->entity;
+		if (entity->sched_data != &bfqg->sched_data)
+			bfq_bfqq_move(bfqd, sync_bfqq, entity, bfqg);
+	}
+
+	return bfqg;
+}
+
+/**
+ * bfq_bic_change_cgroup - move @bic to @cgroup.
+ * @bic: the bic being migrated.
+ * @cgroup: the destination cgroup.
+ *
+ * When the task owning @bic is moved to @cgroup, @bic is immediately
+ * moved into its new parent group.
+ */
+static void bfq_bic_change_cgroup(struct bfq_io_cq *bic,
+				  struct cgroup_subsys_state *css)
+{
+	struct bfq_data *bfqd;
+	unsigned long uninitialized_var(flags);
+
+	bfqd = bfq_get_bfqd_locked(&(bic->icq.q->elevator->elevator_data),
+				   &flags);
+	if (bfqd != NULL) {
+		__bfq_bic_change_cgroup(bfqd, bic, css);
+		bfq_put_bfqd_unlock(bfqd, &flags);
+	}
+}
+
+/**
+ * bfq_bic_update_cgroup - update the cgroup of @bic.
+ * @bic: the @bic to update.
+ *
+ * Make sure that @bic is enqueued in the cgroup of the current task.
+ * We need this in addition to moving bics during the cgroup attach
+ * phase because the task owning @bic could be at its first disk
+ * access or we may end up in the root cgroup as the result of a
+ * memory allocation failure and here we try to move to the right
+ * group.
+ *
+ * Must be called under the queue lock.  It is safe to use the returned
+ * value even after the rcu_read_unlock() as the migration/destruction
+ * paths act under the queue lock too.  IOW it is impossible to race with
+ * group migration/destruction and end up with an invalid group as:
+ *   a) here cgroup has not yet been destroyed, nor its destroy callback
+ *      has started execution, as current holds a reference to it,
+ *   b) if it is destroyed after rcu_read_unlock() [after current is
+ *      migrated to a different cgroup] its attach() callback will have
+ *      taken care of remove all the references to the old cgroup data.
+ */
+static struct bfq_group *bfq_bic_update_cgroup(struct bfq_io_cq *bic)
+{
+	struct bfq_data *bfqd = bic_to_bfqd(bic);
+	struct bfq_group *bfqg;
+	struct cgroup_subsys_state *css;
+
+	rcu_read_lock();
+	css = task_css(current, bfqio_cgrp_id);
+	bfqg = __bfq_bic_change_cgroup(bfqd, bic, css);
+	rcu_read_unlock();
+
+	return bfqg;
+}
+
+/**
+ * bfq_flush_idle_tree - deactivate any entity on the idle tree of @st.
+ * @st: the service tree being flushed.
+ */
+static inline void bfq_flush_idle_tree(struct bfq_service_tree *st)
+{
+	struct bfq_entity *entity = st->first_idle;
+
+	for (; entity != NULL; entity = st->first_idle)
+		__bfq_deactivate_entity(entity, 0);
+}
+
+/**
+ * bfq_reparent_leaf_entity - move leaf entity to the root_group.
+ * @bfqd: the device data structure with the root group.
+ * @entity: the entity to move.
+ */
+static inline void bfq_reparent_leaf_entity(struct bfq_data *bfqd,
+					    struct bfq_entity *entity)
+{
+	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+
+	bfq_bfqq_move(bfqd, bfqq, entity, bfqd->root_group);
+	return;
+}
+
+/**
+ * bfq_reparent_active_entities - move to the root group all active
+ *                                entities.
+ * @bfqd: the device data structure with the root group.
+ * @bfqg: the group to move from.
+ * @st: the service tree with the entities.
+ *
+ * Needs queue_lock to be taken and reference to be valid over the call.
+ */
+static inline void bfq_reparent_active_entities(struct bfq_data *bfqd,
+						struct bfq_group *bfqg,
+						struct bfq_service_tree *st)
+{
+	struct rb_root *active = &st->active;
+	struct bfq_entity *entity = NULL;
+
+	if (!RB_EMPTY_ROOT(&st->active))
+		entity = bfq_entity_of(rb_first(active));
+
+	for (; entity != NULL; entity = bfq_entity_of(rb_first(active)))
+		bfq_reparent_leaf_entity(bfqd, entity);
+
+	if (bfqg->sched_data.in_service_entity != NULL)
+		bfq_reparent_leaf_entity(bfqd,
+			bfqg->sched_data.in_service_entity);
+
+	return;
+}
+
+/**
+ * bfq_destroy_group - destroy @bfqg.
+ * @bgrp: the bfqio_cgroup containing @bfqg.
+ * @bfqg: the group being destroyed.
+ *
+ * Destroy @bfqg, making sure that it is not referenced from its parent.
+ */
+static void bfq_destroy_group(struct bfqio_cgroup *bgrp, struct bfq_group *bfqg)
+{
+	struct bfq_data *bfqd;
+	struct bfq_service_tree *st;
+	struct bfq_entity *entity = bfqg->my_entity;
+	unsigned long uninitialized_var(flags);
+	int i;
+
+	hlist_del(&bfqg->group_node);
+
+	/*
+	 * Empty all service_trees belonging to this group before
+	 * deactivating the group itself.
+	 */
+	for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) {
+		st = bfqg->sched_data.service_tree + i;
+
+		/*
+		 * The idle tree may still contain bfq_queues belonging
+		 * to exited task because they never migrated to a different
+		 * cgroup from the one being destroyed now.  No one else
+		 * can access them so it's safe to act without any lock.
+		 */
+		bfq_flush_idle_tree(st);
+
+		/*
+		 * It may happen that some queues are still active
+		 * (busy) upon group destruction (if the corresponding
+		 * processes have been forced to terminate). We move
+		 * all the leaf entities corresponding to these queues
+		 * to the root_group.
+		 * Also, it may happen that the group has an entity
+		 * in service, which is disconnected from the active
+		 * tree: it must be moved, too.
+		 * There is no need to put the sync queues, as the
+		 * scheduler has taken no reference.
+		 */
+		bfqd = bfq_get_bfqd_locked(&bfqg->bfqd, &flags);
+		if (bfqd != NULL) {
+			bfq_reparent_active_entities(bfqd, bfqg, st);
+			bfq_put_bfqd_unlock(bfqd, &flags);
+		}
+	}
+
+	/*
+	 * We may race with device destruction, take extra care when
+	 * dereferencing bfqg->bfqd.
+	 */
+	bfqd = bfq_get_bfqd_locked(&bfqg->bfqd, &flags);
+	if (bfqd != NULL) {
+		hlist_del(&bfqg->bfqd_node);
+		__bfq_deactivate_entity(entity, 0);
+		bfq_put_async_queues(bfqd, bfqg);
+		bfq_put_bfqd_unlock(bfqd, &flags);
+	}
+
+	/*
+	 * No need to defer the kfree() to the end of the RCU grace
+	 * period: we are called from the destroy() callback of our
+	 * cgroup, so we can be sure that no one is a) still using
+	 * this cgroup or b) doing lookups in it.
+	 */
+	kfree(bfqg);
+}
+
+/**
+ * bfq_disconnect_groups - disconnect @bfqd from all its groups.
+ * @bfqd: the device descriptor being exited.
+ *
+ * When the device exits we just make sure that no lookup can return
+ * the now unused group structures.  They will be deallocated on cgroup
+ * destruction.
+ */
+static void bfq_disconnect_groups(struct bfq_data *bfqd)
+{
+	struct hlist_node *tmp;
+	struct bfq_group *bfqg;
+
+	bfq_log(bfqd, "disconnect_groups beginning");
+	hlist_for_each_entry_safe(bfqg, tmp, &bfqd->group_list, bfqd_node) {
+		hlist_del(&bfqg->bfqd_node);
+
+		__bfq_deactivate_entity(bfqg->my_entity, 0);
+
+		/*
+		 * Don't remove from the group hash, just set an
+		 * invalid key.  No lookups can race with the
+		 * assignment as bfqd is being destroyed; this
+		 * implies also that new elements cannot be added
+		 * to the list.
+		 */
+		rcu_assign_pointer(bfqg->bfqd, NULL);
+
+		bfq_log(bfqd, "disconnect_groups: put async for group %p",
+			bfqg);
+		bfq_put_async_queues(bfqd, bfqg);
+	}
+}
+
+static inline void bfq_free_root_group(struct bfq_data *bfqd)
+{
+	struct bfqio_cgroup *bgrp = &bfqio_root_cgroup;
+	struct bfq_group *bfqg = bfqd->root_group;
+
+	bfq_put_async_queues(bfqd, bfqg);
+
+	spin_lock_irq(&bgrp->lock);
+	hlist_del_rcu(&bfqg->group_node);
+	spin_unlock_irq(&bgrp->lock);
+
+	/*
+	 * No need to synchronize_rcu() here: since the device is gone
+	 * there cannot be any read-side access to its root_group.
+	 */
+	kfree(bfqg);
+}
+
+static struct bfq_group *bfq_alloc_root_group(struct bfq_data *bfqd, int node)
+{
+	struct bfq_group *bfqg;
+	struct bfqio_cgroup *bgrp;
+	int i;
+
+	bfqg = kzalloc_node(sizeof(*bfqg), GFP_KERNEL, node);
+	if (bfqg == NULL)
+		return NULL;
+
+	bfqg->entity.parent = NULL;
+	for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
+		bfqg->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
+
+	bgrp = &bfqio_root_cgroup;
+	spin_lock_irq(&bgrp->lock);
+	rcu_assign_pointer(bfqg->bfqd, bfqd);
+	hlist_add_head_rcu(&bfqg->group_node, &bgrp->group_data);
+	spin_unlock_irq(&bgrp->lock);
+
+	return bfqg;
+}
+
+#define SHOW_FUNCTION(__VAR)						\
+static u64 bfqio_cgroup_##__VAR##_read(struct cgroup_subsys_state *css, \
+				       struct cftype *cftype)		\
+{									\
+	struct bfqio_cgroup *bgrp = css_to_bfqio(css);			\
+	u64 ret = -ENODEV;						\
+									\
+	mutex_lock(&bfqio_mutex);					\
+	if (bfqio_is_removed(bgrp))					\
+		goto out_unlock;					\
+									\
+	spin_lock_irq(&bgrp->lock);					\
+	ret = bgrp->__VAR;						\
+	spin_unlock_irq(&bgrp->lock);					\
+									\
+out_unlock:								\
+	mutex_unlock(&bfqio_mutex);					\
+	return ret;							\
+}
+
+SHOW_FUNCTION(weight);
+SHOW_FUNCTION(ioprio);
+SHOW_FUNCTION(ioprio_class);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__VAR, __MIN, __MAX)				\
+static int bfqio_cgroup_##__VAR##_write(struct cgroup_subsys_state *css,\
+					struct cftype *cftype,		\
+					u64 val)			\
+{									\
+	struct bfqio_cgroup *bgrp = css_to_bfqio(css);			\
+	struct bfq_group *bfqg;						\
+	int ret = -EINVAL;						\
+									\
+	if (val < (__MIN) || val > (__MAX))				\
+		return ret;						\
+									\
+	ret = -ENODEV;							\
+	mutex_lock(&bfqio_mutex);					\
+	if (bfqio_is_removed(bgrp))					\
+		goto out_unlock;					\
+	ret = 0;							\
+									\
+	spin_lock_irq(&bgrp->lock);					\
+	bgrp->__VAR = (unsigned short)val;				\
+	hlist_for_each_entry(bfqg, &bgrp->group_data, group_node) {	\
+		/*							\
+		 * Setting the ioprio_changed flag of the entity        \
+		 * to 1 with new_##__VAR == ##__VAR would re-set        \
+		 * the value of the weight to its ioprio mapping.       \
+		 * Set the flag only if necessary.			\
+		 */							\
+		if ((unsigned short)val != bfqg->entity.new_##__VAR) {  \
+			bfqg->entity.new_##__VAR = (unsigned short)val; \
+			/*						\
+			 * Make sure that the above new value has been	\
+			 * stored in bfqg->entity.new_##__VAR before	\
+			 * setting the ioprio_changed flag. In fact,	\
+			 * this flag may be read asynchronously (in	\
+			 * critical sections protected by a different	\
+			 * lock than that held here), and finding this	\
+			 * flag set may cause the execution of the code	\
+			 * for updating parameters whose value may	\
+			 * depend also on bfqg->entity.new_##__VAR (in	\
+			 * __bfq_entity_update_weight_prio).		\
+			 * This barrier makes sure that the new value	\
+			 * of bfqg->entity.new_##__VAR is correctly	\
+			 * seen in that code.				\
+			 */						\
+			smp_wmb();                                      \
+			bfqg->entity.ioprio_changed = 1;                \
+		}							\
+	}								\
+	spin_unlock_irq(&bgrp->lock);					\
+									\
+out_unlock:								\
+	mutex_unlock(&bfqio_mutex);					\
+	return ret;							\
+}
+
+STORE_FUNCTION(weight, BFQ_MIN_WEIGHT, BFQ_MAX_WEIGHT);
+STORE_FUNCTION(ioprio, 0, IOPRIO_BE_NR - 1);
+STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
+#undef STORE_FUNCTION
+
+static struct cftype bfqio_files[] = {
+	{
+		.name = "weight",
+		.read_u64 = bfqio_cgroup_weight_read,
+		.write_u64 = bfqio_cgroup_weight_write,
+	},
+	{
+		.name = "ioprio",
+		.read_u64 = bfqio_cgroup_ioprio_read,
+		.write_u64 = bfqio_cgroup_ioprio_write,
+	},
+	{
+		.name = "ioprio_class",
+		.read_u64 = bfqio_cgroup_ioprio_class_read,
+		.write_u64 = bfqio_cgroup_ioprio_class_write,
+	},
+	{ },	/* terminate */
+};
+
+static struct cgroup_subsys_state *bfqio_create(struct cgroup_subsys_state
+						*parent_css)
+{
+	struct bfqio_cgroup *bgrp;
+
+	if (parent_css != NULL) {
+		bgrp = kzalloc(sizeof(*bgrp), GFP_KERNEL);
+		if (bgrp == NULL)
+			return ERR_PTR(-ENOMEM);
+	} else
+		bgrp = &bfqio_root_cgroup;
+
+	spin_lock_init(&bgrp->lock);
+	INIT_HLIST_HEAD(&bgrp->group_data);
+	bgrp->ioprio = BFQ_DEFAULT_GRP_IOPRIO;
+	bgrp->ioprio_class = BFQ_DEFAULT_GRP_CLASS;
+
+	return &bgrp->css;
+}
+
+/*
+ * We cannot support shared io contexts, as we have no means to support
+ * two tasks with the same ioc in two different groups without major rework
+ * of the main bic/bfqq data structures.  By now we allow a task to change
+ * its cgroup only if it's the only owner of its ioc; the drawback of this
+ * behavior is that a group containing a task that forked using CLONE_IO
+ * will not be destroyed until the tasks sharing the ioc die.
+ */
+static int bfqio_can_attach(struct cgroup_subsys_state *css,
+			    struct cgroup_taskset *tset)
+{
+	struct task_struct *task;
+	struct io_context *ioc;
+	int ret = 0;
+
+	cgroup_taskset_for_each(task, tset) {
+		/*
+		 * task_lock() is needed to avoid races with
+		 * exit_io_context()
+		 */
+		task_lock(task);
+		ioc = task->io_context;
+		if (ioc != NULL && atomic_read(&ioc->nr_tasks) > 1)
+			/*
+			 * ioc == NULL means that the task is either too
+			 * young or exiting: if it has still no ioc the
+			 * ioc can't be shared, if the task is exiting the
+			 * attach will fail anyway, no matter what we
+			 * return here.
+			 */
+			ret = -EINVAL;
+		task_unlock(task);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
+static void bfqio_attach(struct cgroup_subsys_state *css,
+			 struct cgroup_taskset *tset)
+{
+	struct task_struct *task;
+	struct io_context *ioc;
+	struct io_cq *icq;
+
+	/*
+	 * IMPORTANT NOTE: The move of more than one process at a time to a
+	 * new group has not yet been tested.
+	 */
+	cgroup_taskset_for_each(task, tset) {
+		ioc = get_task_io_context(task, GFP_ATOMIC, NUMA_NO_NODE);
+		if (ioc) {
+			/*
+			 * Handle cgroup change here.
+			 */
+			rcu_read_lock();
+			hlist_for_each_entry_rcu(icq, &ioc->icq_list, ioc_node)
+				if (!strncmp(
+					icq->q->elevator->type->elevator_name,
+					"bfq", ELV_NAME_MAX))
+					bfq_bic_change_cgroup(icq_to_bic(icq),
+							      css);
+			rcu_read_unlock();
+			put_io_context(ioc);
+		}
+	}
+}
+
+static void bfqio_destroy(struct cgroup_subsys_state *css)
+{
+	struct bfqio_cgroup *bgrp = css_to_bfqio(css);
+	struct hlist_node *tmp;
+	struct bfq_group *bfqg;
+
+	/*
+	 * Since we are destroying the cgroup, there are no more tasks
+	 * referencing it, and all the RCU grace periods that may have
+	 * referenced it are ended (as the destruction of the parent
+	 * cgroup is RCU-safe); bgrp->group_data will not be accessed by
+	 * anything else and we don't need any synchronization.
+	 */
+	hlist_for_each_entry_safe(bfqg, tmp, &bgrp->group_data, group_node)
+		bfq_destroy_group(bgrp, bfqg);
+
+	kfree(bgrp);
+}
+
+static int bfqio_css_online(struct cgroup_subsys_state *css)
+{
+	struct bfqio_cgroup *bgrp = css_to_bfqio(css);
+
+	mutex_lock(&bfqio_mutex);
+	bgrp->online = true;
+	mutex_unlock(&bfqio_mutex);
+
+	return 0;
+}
+
+static void bfqio_css_offline(struct cgroup_subsys_state *css)
+{
+	struct bfqio_cgroup *bgrp = css_to_bfqio(css);
+
+	mutex_lock(&bfqio_mutex);
+	bgrp->online = false;
+	mutex_unlock(&bfqio_mutex);
+}
+
+struct cgroup_subsys bfqio_cgrp_subsys = {
+	.css_alloc = bfqio_create,
+	.css_online = bfqio_css_online,
+	.css_offline = bfqio_css_offline,
+	.can_attach = bfqio_can_attach,
+	.attach = bfqio_attach,
+	.css_free = bfqio_destroy,
+	.base_cftypes = bfqio_files,
+};
+#else
+static inline void bfq_init_entity(struct bfq_entity *entity,
+				   struct bfq_group *bfqg)
+{
+	entity->weight = entity->new_weight;
+	entity->orig_weight = entity->new_weight;
+	entity->ioprio = entity->new_ioprio;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->sched_data = &bfqg->sched_data;
+}
+
+static inline struct bfq_group *
+bfq_bic_update_cgroup(struct bfq_io_cq *bic)
+{
+	struct bfq_data *bfqd = bic_to_bfqd(bic);
+	return bfqd->root_group;
+}
+
+static inline void bfq_bfqq_move(struct bfq_data *bfqd,
+				 struct bfq_queue *bfqq,
+				 struct bfq_entity *entity,
+				 struct bfq_group *bfqg)
+{
+}
+
+static inline void bfq_disconnect_groups(struct bfq_data *bfqd)
+{
+	bfq_put_async_queues(bfqd, bfqd->root_group);
+}
+
+static inline void bfq_free_root_group(struct bfq_data *bfqd)
+{
+	kfree(bfqd->root_group);
+}
+
+static struct bfq_group *bfq_alloc_root_group(struct bfq_data *bfqd, int node)
+{
+	struct bfq_group *bfqg;
+	int i;
+
+	bfqg = kmalloc_node(sizeof(*bfqg), GFP_KERNEL | __GFP_ZERO, node);
+	if (bfqg == NULL)
+		return NULL;
+
+	for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
+		bfqg->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
+
+	return bfqg;
+}
+#endif
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 01a98be..b2cbfce 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -66,14 +66,6 @@
 #include "bfq.h"
 #include "blk.h"
 
-/*
- * Array of async queues for all the processes, one queue
- * per ioprio value per ioprio_class.
- */
-struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
-/* Async queue for the idle class (ioprio is ignored) */
-struct bfq_queue *async_idle_bfqq;
-
 /* Max number of dispatches in one round of service. */
 static const int bfq_quantum = 4;
 
@@ -128,6 +120,7 @@ static inline void bfq_schedule_dispatch(struct bfq_data *bfqd);
 
 #include "bfq-ioc.c"
 #include "bfq-sched.c"
+#include "bfq-cgroup.c"
 
 #define bfq_class_idle(bfqq)	((bfqq)->entity.ioprio_class ==\
 				 IOPRIO_CLASS_IDLE)
@@ -1359,6 +1352,7 @@ static void bfq_changed_ioprio(struct bfq_io_cq *bic)
 {
 	struct bfq_data *bfqd;
 	struct bfq_queue *bfqq, *new_bfqq;
+	struct bfq_group *bfqg;
 	unsigned long uninitialized_var(flags);
 	int ioprio = bic->icq.ioc->ioprio;
 
@@ -1373,7 +1367,9 @@ static void bfq_changed_ioprio(struct bfq_io_cq *bic)
 
 	bfqq = bic->bfqq[BLK_RW_ASYNC];
 	if (bfqq != NULL) {
-		new_bfqq = bfq_get_queue(bfqd, BLK_RW_ASYNC, bic,
+		bfqg = container_of(bfqq->entity.sched_data, struct bfq_group,
+				    sched_data);
+		new_bfqq = bfq_get_queue(bfqd, bfqg, BLK_RW_ASYNC, bic,
 					 GFP_ATOMIC);
 		if (new_bfqq != NULL) {
 			bic->bfqq[BLK_RW_ASYNC] = new_bfqq;
@@ -1417,6 +1413,7 @@ static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 }
 
 static struct bfq_queue *bfq_find_alloc_queue(struct bfq_data *bfqd,
+					      struct bfq_group *bfqg,
 					      int is_sync,
 					      struct bfq_io_cq *bic,
 					      gfp_t gfp_mask)
@@ -1459,6 +1456,7 @@ retry:
 		}
 
 		bfq_init_prio_data(bfqq, bic);
+		bfq_init_entity(&bfqq->entity, bfqg);
 	}
 
 	if (new_bfqq != NULL)
@@ -1468,26 +1466,27 @@ retry:
 }
 
 static struct bfq_queue **bfq_async_queue_prio(struct bfq_data *bfqd,
+					       struct bfq_group *bfqg,
 					       int ioprio_class, int ioprio)
 {
 	switch (ioprio_class) {
 	case IOPRIO_CLASS_RT:
-		return &async_bfqq[0][ioprio];
+		return &bfqg->async_bfqq[0][ioprio];
 	case IOPRIO_CLASS_NONE:
 		ioprio = IOPRIO_NORM;
 		/* fall through */
 	case IOPRIO_CLASS_BE:
-		return &async_bfqq[1][ioprio];
+		return &bfqg->async_bfqq[1][ioprio];
 	case IOPRIO_CLASS_IDLE:
-		return &async_idle_bfqq;
+		return &bfqg->async_idle_bfqq;
 	default:
 		BUG();
 	}
 }
 
 static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
-				       int is_sync, struct bfq_io_cq *bic,
-				       gfp_t gfp_mask)
+				       struct bfq_group *bfqg, int is_sync,
+				       struct bfq_io_cq *bic, gfp_t gfp_mask)
 {
 	const int ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
 	const int ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
@@ -1495,12 +1494,13 @@ static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
 	struct bfq_queue *bfqq = NULL;
 
 	if (!is_sync) {
-		async_bfqq = bfq_async_queue_prio(bfqd, ioprio_class, ioprio);
+		async_bfqq = bfq_async_queue_prio(bfqd, bfqg, ioprio_class,
+						  ioprio);
 		bfqq = *async_bfqq;
 	}
 
 	if (bfqq == NULL)
-		bfqq = bfq_find_alloc_queue(bfqd, is_sync, bic, gfp_mask);
+		bfqq = bfq_find_alloc_queue(bfqd, bfqg, is_sync, bic, gfp_mask);
 
 	/*
 	 * Pin the queue now that it's allocated, scheduler exit will
@@ -1830,6 +1830,7 @@ static int bfq_set_request(struct request_queue *q, struct request *rq,
 	const int rw = rq_data_dir(rq);
 	const int is_sync = rq_is_sync(rq);
 	struct bfq_queue *bfqq;
+	struct bfq_group *bfqg;
 	unsigned long flags;
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
@@ -1841,9 +1842,11 @@ static int bfq_set_request(struct request_queue *q, struct request *rq,
 	if (bic == NULL)
 		goto queue_fail;
 
+	bfqg = bfq_bic_update_cgroup(bic);
+
 	bfqq = bic_to_bfqq(bic, is_sync);
 	if (bfqq == NULL || bfqq == &bfqd->oom_bfqq) {
-		bfqq = bfq_get_queue(bfqd, is_sync, bic, gfp_mask);
+		bfqq = bfq_get_queue(bfqd, bfqg, is_sync, bic, gfp_mask);
 		bic_set_bfqq(bic, bfqq, is_sync);
 	}
 
@@ -1937,10 +1940,12 @@ static void bfq_shutdown_timer_wq(struct bfq_data *bfqd)
 static inline void __bfq_put_async_bfqq(struct bfq_data *bfqd,
 					struct bfq_queue **bfqq_ptr)
 {
+	struct bfq_group *root_group = bfqd->root_group;
 	struct bfq_queue *bfqq = *bfqq_ptr;
 
 	bfq_log(bfqd, "put_async_bfqq: %p", bfqq);
 	if (bfqq != NULL) {
+		bfq_bfqq_move(bfqd, bfqq, &bfqq->entity, root_group);
 		bfq_log_bfqq(bfqd, bfqq, "put_async_bfqq: putting %p, %d",
 			     bfqq, atomic_read(&bfqq->ref));
 		bfq_put_queue(bfqq);
@@ -1949,18 +1954,20 @@ static inline void __bfq_put_async_bfqq(struct bfq_data *bfqd,
 }
 
 /*
- * Release the extra reference of the async queues as the device
- * goes away.
+ * Release all the bfqg references to its async queues.  If we are
+ * deallocating the group these queues may still contain requests, so
+ * we reparent them to the root cgroup (i.e., the only one that will
+ * exist for sure until all the requests on a device are gone).
  */
-static void bfq_put_async_queues(struct bfq_data *bfqd)
+static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg)
 {
 	int i, j;
 
 	for (i = 0; i < 2; i++)
 		for (j = 0; j < IOPRIO_BE_NR; j++)
-			__bfq_put_async_bfqq(bfqd, &async_bfqq[i][j]);
+			__bfq_put_async_bfqq(bfqd, &bfqg->async_bfqq[i][j]);
 
-	__bfq_put_async_bfqq(bfqd, &async_idle_bfqq);
+	__bfq_put_async_bfqq(bfqd, &bfqg->async_idle_bfqq);
 }
 
 static void bfq_exit_queue(struct elevator_queue *e)
@@ -1976,18 +1983,20 @@ static void bfq_exit_queue(struct elevator_queue *e)
 	list_for_each_entry_safe(bfqq, n, &bfqd->idle_list, bfqq_list)
 		bfq_deactivate_bfqq(bfqd, bfqq, 0);
 
-	bfq_put_async_queues(bfqd);
+	bfq_disconnect_groups(bfqd);
 	spin_unlock_irq(q->queue_lock);
 
 	bfq_shutdown_timer_wq(bfqd);
 
 	synchronize_rcu();
 
+	bfq_free_root_group(bfqd);
 	kfree(bfqd);
 }
 
 static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 {
+	struct bfq_group *bfqg;
 	struct bfq_data *bfqd;
 	struct elevator_queue *eq;
 
@@ -2016,6 +2025,15 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 	q->elevator = eq;
 	spin_unlock_irq(q->queue_lock);
 
+	bfqg = bfq_alloc_root_group(bfqd, q->node);
+	if (bfqg == NULL) {
+		kfree(bfqd);
+		kobject_put(&eq->kobj);
+		return -ENOMEM;
+	}
+
+	bfqd->root_group = bfqg;
+
 	init_timer(&bfqd->idle_slice_timer);
 	bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
 	bfqd->idle_slice_timer.data = (unsigned long)bfqd;
@@ -2279,7 +2297,7 @@ static int __init bfq_init(void)
 		return -ENOMEM;
 
 	elv_register(&iosched_bfq);
-	pr_info("BFQ I/O-scheduler version: v0");
+	pr_info("BFQ I/O-scheduler version: v1");
 
 	return 0;
 }
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
index a9142f5..8801b6c 100644
--- a/block/bfq-sched.c
+++ b/block/bfq-sched.c
@@ -8,6 +8,61 @@
  *		      Paolo Valente <paolo.valente@unimore.it>
  */
 
+#ifdef CONFIG_CGROUP_BFQIO
+#define for_each_entity(entity)	\
+	for (; entity != NULL; entity = entity->parent)
+
+#define for_each_entity_safe(entity, parent) \
+	for (; entity && ({ parent = entity->parent; 1; }); entity = parent)
+
+static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
+						 int extract,
+						 struct bfq_data *bfqd);
+
+static inline void bfq_update_budget(struct bfq_entity *next_in_service)
+{
+	struct bfq_entity *bfqg_entity;
+	struct bfq_group *bfqg;
+	struct bfq_sched_data *group_sd;
+
+	group_sd = next_in_service->sched_data;
+
+	bfqg = container_of(group_sd, struct bfq_group, sched_data);
+	/*
+	 * bfq_group's my_entity field is not NULL only if the group
+	 * is not the root group. We must not touch the root entity
+	 * as it must never become an in-service entity.
+	 */
+	bfqg_entity = bfqg->my_entity;
+	if (bfqg_entity != NULL)
+		bfqg_entity->budget = next_in_service->budget;
+}
+
+static int bfq_update_next_in_service(struct bfq_sched_data *sd)
+{
+	struct bfq_entity *next_in_service;
+
+	if (sd->in_service_entity != NULL)
+		/* will update/requeue at the end of service */
+		return 0;
+
+	/*
+	 * NOTE: this can be improved in many ways, such as returning
+	 * 1 (and thus propagating upwards the update) only when the
+	 * budget changes, or caching the bfqq that will be scheduled
+	 * next from this subtree.  By now we worry more about
+	 * correctness than about performance...
+	 */
+	next_in_service = bfq_lookup_next_entity(sd, 0, NULL);
+	sd->next_in_service = next_in_service;
+
+	if (next_in_service != NULL)
+		bfq_update_budget(next_in_service);
+
+	return 1;
+}
+
+#else
 #define for_each_entity(entity)	\
 	for (; entity != NULL; entity = NULL)
 
@@ -19,14 +74,10 @@ static inline int bfq_update_next_in_service(struct bfq_sched_data *sd)
 	return 0;
 }
 
-static inline void bfq_check_next_in_service(struct bfq_sched_data *sd,
-					     struct bfq_entity *entity)
-{
-}
-
 static inline void bfq_update_budget(struct bfq_entity *next_in_service)
 {
 }
+#endif
 
 /*
  * Shift for timestamp calculations.  This actually limits the maximum
@@ -842,7 +893,6 @@ static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
 		entity = __bfq_lookup_next_entity(st + i, false);
 		if (entity != NULL) {
 			if (extract) {
-				bfq_check_next_in_service(sd, entity);
 				bfq_active_extract(st + i, entity);
 				sd->in_service_entity = entity;
 				sd->next_in_service = NULL;
@@ -866,7 +916,7 @@ static struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd)
 	if (bfqd->busy_queues == 0)
 		return NULL;
 
-	sd = &bfqd->sched_data;
+	sd = &bfqd->root_group->sched_data;
 	for (; sd != NULL; sd = entity->my_sched_data) {
 		entity = bfq_lookup_next_entity(sd, 1, bfqd);
 		entity->service = 0;
diff --git a/block/bfq.h b/block/bfq.h
index bd146b6..b982567 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -1,5 +1,5 @@
 /*
- * BFQ-v0 for 3.15.0: data structures and common functions prototypes.
+ * BFQ-v1 for 3.15.0: data structures and common functions prototypes.
  *
  * Based on ideas and code from CFQ:
  * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
@@ -92,7 +92,7 @@ struct bfq_sched_data {
  * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
  * @weight: weight of the queue
  * @parent: parent entity, for hierarchical scheduling.
- * @my_sched_data: for non-leaf nodes in the hierarchy, the
+ * @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the
  *                 associated scheduler queue, %NULL on leaf nodes.
  * @sched_data: the scheduler queue this entity belongs to.
  * @ioprio: the ioprio in use.
@@ -105,10 +105,11 @@ struct bfq_sched_data {
  * @ioprio_changed: flag, true when the user requested a weight, ioprio or
  *                  ioprio_class change.
  *
- * A bfq_entity is used to represent a bfq_queue (leaf node in the upper
- * level scheduler). Each entity belongs to the sched_data of the parent
- * group hierarchy. Non-leaf entities have also their own sched_data,
- * stored in @my_sched_data.
+ * A bfq_entity is used to represent either a bfq_queue (leaf node in the
+ * cgroup hierarchy) or a bfq_group into the upper level scheduler.  Each
+ * entity belongs to the sched_data of the parent group in the cgroup
+ * hierarchy.  Non-leaf entities have also their own sched_data, stored
+ * in @my_sched_data.
  *
  * Each entity stores independently its priority values; this would
  * allow different weights on different devices, but this
@@ -119,13 +120,14 @@ struct bfq_sched_data {
  * update to take place the effective and the requested priority
  * values are synchronized.
  *
- * The weight value is calculated from the ioprio to export the same
- * interface as CFQ.  When dealing with  ``well-behaved'' queues (i.e.,
- * queues that do not spend too much time to consume their budget
- * and have true sequential behavior, and when there are no external
- * factors breaking anticipation) the relative weights at each level
- * of the hierarchy should be guaranteed.  All the fields are
- * protected by the queue lock of the containing bfqd.
+ * Unless cgroups are used, the weight value is calculated from the
+ * ioprio to export the same interface as CFQ.  When dealing with
+ * ``well-behaved'' queues (i.e., queues that do not spend too much
+ * time to consume their budget and have true sequential behavior, and
+ * when there are no external factors breaking anticipation) the
+ * relative weights at each level of the cgroups hierarchy should be
+ * guaranteed.  All the fields are protected by the queue lock of the
+ * containing bfqd.
  */
 struct bfq_entity {
 	struct rb_node rb_node;
@@ -154,6 +156,8 @@ struct bfq_entity {
 	int ioprio_changed;
 };
 
+struct bfq_group;
+
 /**
  * struct bfq_queue - leaf schedulable entity.
  * @ref: reference counter.
@@ -177,7 +181,11 @@ struct bfq_entity {
  * @pid: pid of the process owning the queue, used for logging purposes.
  *
  * A bfq_queue is a leaf request queue; it can be associated with an
- * io_context or more, if it is async.
+ * io_context or more, if it is async. @cgroup holds a reference to the
+ * cgroup, to be sure that it does not disappear while a bfqq still
+ * references it (mostly to avoid races between request issuing and task
+ * migration followed by cgroup destruction). All the fields are protected
+ * by the queue lock of the containing bfqd.
  */
 struct bfq_queue {
 	atomic_t ref;
@@ -244,7 +252,7 @@ enum bfq_device_speed {
 /**
  * struct bfq_data - per device data structure.
  * @queue: request queue for the managed device.
- * @sched_data: root @bfq_sched_data for the device.
+ * @root_group: root bfq_group for the device.
  * @busy_queues: number of bfq_queues containing requests (including the
  *		 queue in service, even if it is idling).
  * @queued: number of queued requests.
@@ -267,6 +275,7 @@ enum bfq_device_speed {
  * @peak_rate_samples: number of samples used to calculate @peak_rate.
  * @bfq_max_budget: maximum budget allotted to a bfq_queue before
  *                  rescheduling.
+ * @group_list: list of all the bfq_groups active on the device.
  * @active_list: list of all the bfq_queues active on the device.
  * @idle_list: list of all the bfq_queues idle on the device.
  * @bfq_quantum: max number of requests dispatched per dispatch round.
@@ -293,7 +302,7 @@ enum bfq_device_speed {
 struct bfq_data {
 	struct request_queue *queue;
 
-	struct bfq_sched_data sched_data;
+	struct bfq_group *root_group;
 
 	int busy_queues;
 	int queued;
@@ -320,6 +329,7 @@ struct bfq_data {
 	u64 peak_rate;
 	unsigned long bfq_max_budget;
 
+	struct hlist_head group_list;
 	struct list_head active_list;
 	struct list_head idle_list;
 
@@ -390,6 +400,82 @@ enum bfqq_expiration {
 	BFQ_BFQQ_NO_MORE_REQUESTS,	/* the queue has no more requests */
 };
 
+#ifdef CONFIG_CGROUP_BFQIO
+/**
+ * struct bfq_group - per (device, cgroup) data structure.
+ * @entity: schedulable entity to insert into the parent group sched_data.
+ * @sched_data: own sched_data, to contain child entities (they may be
+ *              both bfq_queues and bfq_groups).
+ * @group_node: node to be inserted into the bfqio_cgroup->group_data
+ *              list of the containing cgroup's bfqio_cgroup.
+ * @bfqd_node: node to be inserted into the @bfqd->group_list list
+ *             of the groups active on the same device; used for cleanup.
+ * @bfqd: the bfq_data for the device this group acts upon.
+ * @async_bfqq: array of async queues for all the tasks belonging to
+ *              the group, one queue per ioprio value per ioprio_class,
+ *              except for the idle class that has only one queue.
+ * @async_idle_bfqq: async queue for the idle class (ioprio is ignored).
+ * @my_entity: pointer to @entity, %NULL for the toplevel group; used
+ *             to avoid too many special cases during group creation/
+ *             migration.
+ *
+ * Each (device, cgroup) pair has its own bfq_group, i.e., for each cgroup
+ * there is a set of bfq_groups, each one collecting the lower-level
+ * entities belonging to the group that are acting on the same device.
+ *
+ * Locking works as follows:
+ *    o @group_node is protected by the bfqio_cgroup lock, and is accessed
+ *      via RCU from its readers.
+ *    o @bfqd is protected by the queue lock, RCU is used to access it
+ *      from the readers.
+ *    o All the other fields are protected by the @bfqd queue lock.
+ */
+struct bfq_group {
+	struct bfq_entity entity;
+	struct bfq_sched_data sched_data;
+
+	struct hlist_node group_node;
+	struct hlist_node bfqd_node;
+
+	void *bfqd;
+
+	struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
+	struct bfq_queue *async_idle_bfqq;
+
+	struct bfq_entity *my_entity;
+};
+
+/**
+ * struct bfqio_cgroup - bfq cgroup data structure.
+ * @css: subsystem state for bfq in the containing cgroup.
+ * @online: flag marked when the subsystem is inserted.
+ * @weight: cgroup weight.
+ * @ioprio: cgroup ioprio.
+ * @ioprio_class: cgroup ioprio_class.
+ * @lock: spinlock that protects @ioprio, @ioprio_class and @group_data.
+ * @group_data: list containing the bfq_group belonging to this cgroup.
+ *
+ * @group_data is accessed using RCU, with @lock protecting the updates,
+ * @ioprio and @ioprio_class are protected by @lock.
+ */
+struct bfqio_cgroup {
+	struct cgroup_subsys_state css;
+	bool online;
+
+	unsigned short weight, ioprio, ioprio_class;
+
+	spinlock_t lock;
+	struct hlist_head group_data;
+};
+#else
+struct bfq_group {
+	struct bfq_sched_data sched_data;
+
+	struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
+	struct bfq_queue *async_idle_bfqq;
+};
+#endif
+
 static inline struct bfq_service_tree *
 bfq_entity_service_tree(struct bfq_entity *entity)
 {
@@ -460,8 +546,10 @@ static inline void bfq_put_bfqd_unlock(struct bfq_data *bfqd,
 static void bfq_changed_ioprio(struct bfq_io_cq *bic);
 static void bfq_put_queue(struct bfq_queue *bfqq);
 static void bfq_dispatch_insert(struct request_queue *q, struct request *rq);
-static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd, int is_sync,
+static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
+				       struct bfq_group *bfqg, int is_sync,
 				       struct bfq_io_cq *bic, gfp_t gfp_mask);
+static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg);
 static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
 
 #endif /* _BFQ_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 768fe44..cdd2528 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -39,6 +39,10 @@ SUBSYS(net_cls)
 SUBSYS(blkio)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_BFQIO)
+SUBSYS(bfqio)
+#endif
+
 #if IS_ENABLED(CONFIG_CGROUP_PERF)
 SUBSYS(perf_event)
 #endif
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 03/12] block, bfq: improve throughput boosting
  2014-05-29  9:05           ` Paolo Valente
@ 2014-05-29  9:05               ` Paolo Valente
  -1 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

The feedback-loop algorithm used by BFQ to compute queue (process)
budgets is basically a set of three update rules, one for each of the
main reasons why a queue may be expired. If many processes suddenly
switch from sporadic I/O to greedy and sequential I/O, then these
rules are quite slow to assign large budgets to these processes, and
hence to achieve a high throughput. On the opposite side, BFQ assigns
the maximum possible budget B_max to a just-created queue. This allows
a high throughput to be achieved immediately if the associated process
is I/O-bound and performs sequential I/O from the beginning. But it
also increases the worst-case latency experienced by the first
requests issued by the process, because the larger the budget of a
queue waiting for service is, the later the queue will be served by
B-WF2Q+ (Subsec 3.3 in [1]). This is detrimental for an interactive or
soft real-time application.

To tackle these throughput and latency problems, on one hand this
patch changes the initial budget value to B_max/2. On the other hand,
it re-tunes the three rules, adopting a more aggressive,
multiplicative increase/linear decrease scheme. This scheme trades
latency for throughput more than before, and tends to assign large
budgets quickly to processes that are or become I/O-bound. For two of
the expiration reasons, the new version of the rules also contains
some more little improvements, briefly described below.

*No more backlog.* In this case, the budget was larger than the number
of sectors actually read/written by the process before it stopped
doing I/O. Hence, to reduce latency for the possible future I/O
requests of the process, the old rule simply set the next budget to
the number of sectors actually consumed by the process. However, if
there are still outstanding requests, then the process may have not
yet issued its next request just because it is still waiting for the
completion of some of the still outstanding ones. If this sub-case
holds true, then the new rule, instead of decreasing the budget,
doubles it, proactively, in the hope that: 1) a larger budget will fit
the actual needs of the process, and 2) the process is sequential and
hence a higher throughput will be achieved by serving the process
longer after granting it access to the device.

*Budget timeout*. The original rule set the new budget to the maximum
value B_max, to maximize throughput and let all processes experiencing
budget timeouts receive the same share of the device time. In our
experiments we verified that this sudden jump to B_max did not provide
sensible benefits; rather it increased the latency of processes
performing sporadic and short I/O. The new rule only doubles the
budget.

[1] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
http://www.algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 block/bfq-cgroup.c  |  2 ++
 block/bfq-ioc.c     |  2 ++
 block/bfq-iosched.c | 88 ++++++++++++++++++++++++++++-------------------------
 block/bfq-sched.c   |  2 ++
 block/bfq.h         |  2 ++
 5 files changed, 55 insertions(+), 41 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 00a7a1b..805fe5e 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -7,6 +7,8 @@
  * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
  *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
  *
+ * Copyright (C) 2010 Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ *
  * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ
  * file.
  */
diff --git a/block/bfq-ioc.c b/block/bfq-ioc.c
index adfb5a1..7f6b000 100644
--- a/block/bfq-ioc.c
+++ b/block/bfq-ioc.c
@@ -6,6 +6,8 @@
  *
  * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
  *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ *
+ * Copyright (C) 2010 Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
  */
 
 /**
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index b2cbfce..49ff1da 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -7,6 +7,8 @@
  * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
  *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
  *
+ * Copyright (C) 2010 Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ *
  * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ
  * file.
  *
@@ -101,9 +103,6 @@ struct kmem_cache *bfq_pool;
 #define BFQQ_SEEK_THR	 (sector_t)(8 * 1024)
 #define BFQQ_SEEKY(bfqq) ((bfqq)->seek_mean > BFQQ_SEEK_THR)
 
-/* Budget feedback step. */
-#define BFQ_BUDGET_STEP         128
-
 /* Min samples used for peak rate estimation (for autotuning). */
 #define BFQ_PEAK_RATE_SAMPLES	32
 
@@ -537,36 +536,6 @@ static inline unsigned long bfq_max_budget(struct bfq_data *bfqd)
 		return bfqd->bfq_max_budget;
 }
 
- /*
- * bfq_default_budget - return the default budget for @bfqq on @bfqd.
- * @bfqd: the device descriptor.
- * @bfqq: the queue to consider.
- *
- * We use 3/4 of the @bfqd maximum budget as the default value
- * for the max_budget field of the queues.  This lets the feedback
- * mechanism to start from some middle ground, then the behavior
- * of the process will drive the heuristics towards high values, if
- * it behaves as a greedy sequential reader, or towards small values
- * if it shows a more intermittent behavior.
- */
-static unsigned long bfq_default_budget(struct bfq_data *bfqd,
-					struct bfq_queue *bfqq)
-{
-	unsigned long budget;
-
-	/*
-	 * When we need an estimate of the peak rate we need to avoid
-	 * to give budgets that are too short due to previous measurements.
-	 * So, in the first 10 assignments use a ``safe'' budget value.
-	 */
-	if (bfqd->budgets_assigned < 194 && bfqd->bfq_user_max_budget == 0)
-		budget = bfq_default_max_budget;
-	else
-		budget = bfqd->bfq_max_budget;
-
-	return budget - budget / 4;
-}
-
 /*
  * Return min budget, which is a fraction of the current or default
  * max budget (trying with 1/32)
@@ -730,13 +699,51 @@ static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
 		 * for throughput.
 		 */
 		case BFQ_BFQQ_TOO_IDLE:
-			if (budget > min_budget + BFQ_BUDGET_STEP)
-				budget -= BFQ_BUDGET_STEP;
-			else
-				budget = min_budget;
+			/*
+			 * This is the only case where we may reduce
+			 * the budget: if there is no request of the
+			 * process still waiting for completion, then
+			 * we assume (tentatively) that the timer has
+			 * expired because the batch of requests of
+			 * the process could have been served with a
+			 * smaller budget.  Hence, betting that
+			 * process will behave in the same way when it
+			 * becomes backlogged again, we reduce its
+			 * next budget.  As long as we guess right,
+			 * this budget cut reduces the latency
+			 * experienced by the process.
+			 *
+			 * However, if there are still outstanding
+			 * requests, then the process may have not yet
+			 * issued its next request just because it is
+			 * still waiting for the completion of some of
+			 * the still outstanding ones.  So in this
+			 * subcase we do not reduce its budget, on the
+			 * contrary we increase it to possibly boost
+			 * the throughput, as discussed in the
+			 * comments to the BUDGET_TIMEOUT case.
+			 */
+			if (bfqq->dispatched > 0) /* still outstanding reqs */
+				budget = min(budget * 2, bfqd->bfq_max_budget);
+			else {
+				if (budget > 5 * min_budget)
+					budget -= 4 * min_budget;
+				else
+					budget = min_budget;
+			}
 			break;
 		case BFQ_BFQQ_BUDGET_TIMEOUT:
-			budget = bfq_default_budget(bfqd, bfqq);
+			/*
+			 * We double the budget here because: 1) it
+			 * gives the chance to boost the throughput if
+			 * this is not a seeky process (which may have
+			 * bumped into this timeout because of, e.g.,
+			 * ZBR), 2) together with charge_full_budget
+			 * it helps give seeky processes higher
+			 * timestamps, and hence be served less
+			 * frequently.
+			 */
+			budget = min(budget * 2, bfqd->bfq_max_budget);
 			break;
 		case BFQ_BFQQ_BUDGET_EXHAUSTED:
 			/*
@@ -748,8 +755,7 @@ static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
 			 * definitely increase the budget of this good
 			 * candidate to boost the disk throughput.
 			 */
-			budget = min(budget + 8 * BFQ_BUDGET_STEP,
-				     bfqd->bfq_max_budget);
+			budget = min(budget * 4, bfqd->bfq_max_budget);
 			break;
 		case BFQ_BFQQ_NO_MORE_REQUESTS:
 		       /*
@@ -1408,7 +1414,7 @@ static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	}
 
 	/* Tentative initial value to trade off between thr and lat */
-	bfqq->max_budget = bfq_default_budget(bfqd, bfqq);
+	bfqq->max_budget = (2 * bfq_max_budget(bfqd)) / 3;
 	bfqq->pid = pid;
 }
 
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
index 8801b6c..075e472 100644
--- a/block/bfq-sched.c
+++ b/block/bfq-sched.c
@@ -6,6 +6,8 @@
  *
  * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
  *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ *
+ * Copyright (C) 2010 Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
  */
 
 #ifdef CONFIG_CGROUP_BFQIO
diff --git a/block/bfq.h b/block/bfq.h
index b982567..a334eb4 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -6,6 +6,8 @@
  *
  * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
  *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ *
+ * Copyright (C) 2010 Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
  */
 
 #ifndef _BFQ_H
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 03/12] block, bfq: improve throughput boosting
@ 2014-05-29  9:05               ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups, Paolo Valente

The feedback-loop algorithm used by BFQ to compute queue (process)
budgets is basically a set of three update rules, one for each of the
main reasons why a queue may be expired. If many processes suddenly
switch from sporadic I/O to greedy and sequential I/O, then these
rules are quite slow to assign large budgets to these processes, and
hence to achieve a high throughput. On the opposite side, BFQ assigns
the maximum possible budget B_max to a just-created queue. This allows
a high throughput to be achieved immediately if the associated process
is I/O-bound and performs sequential I/O from the beginning. But it
also increases the worst-case latency experienced by the first
requests issued by the process, because the larger the budget of a
queue waiting for service is, the later the queue will be served by
B-WF2Q+ (Subsec 3.3 in [1]). This is detrimental for an interactive or
soft real-time application.

To tackle these throughput and latency problems, on one hand this
patch changes the initial budget value to B_max/2. On the other hand,
it re-tunes the three rules, adopting a more aggressive,
multiplicative increase/linear decrease scheme. This scheme trades
latency for throughput more than before, and tends to assign large
budgets quickly to processes that are or become I/O-bound. For two of
the expiration reasons, the new version of the rules also contains
some more little improvements, briefly described below.

*No more backlog.* In this case, the budget was larger than the number
of sectors actually read/written by the process before it stopped
doing I/O. Hence, to reduce latency for the possible future I/O
requests of the process, the old rule simply set the next budget to
the number of sectors actually consumed by the process. However, if
there are still outstanding requests, then the process may have not
yet issued its next request just because it is still waiting for the
completion of some of the still outstanding ones. If this sub-case
holds true, then the new rule, instead of decreasing the budget,
doubles it, proactively, in the hope that: 1) a larger budget will fit
the actual needs of the process, and 2) the process is sequential and
hence a higher throughput will be achieved by serving the process
longer after granting it access to the device.

*Budget timeout*. The original rule set the new budget to the maximum
value B_max, to maximize throughput and let all processes experiencing
budget timeouts receive the same share of the device time. In our
experiments we verified that this sudden jump to B_max did not provide
sensible benefits; rather it increased the latency of processes
performing sporadic and short I/O. The new rule only doubles the
budget.

[1] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
http://www.algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/bfq-cgroup.c  |  2 ++
 block/bfq-ioc.c     |  2 ++
 block/bfq-iosched.c | 88 ++++++++++++++++++++++++++++-------------------------
 block/bfq-sched.c   |  2 ++
 block/bfq.h         |  2 ++
 5 files changed, 55 insertions(+), 41 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 00a7a1b..805fe5e 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -7,6 +7,8 @@
  * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
  *		      Paolo Valente <paolo.valente@unimore.it>
  *
+ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
+ *
  * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ
  * file.
  */
diff --git a/block/bfq-ioc.c b/block/bfq-ioc.c
index adfb5a1..7f6b000 100644
--- a/block/bfq-ioc.c
+++ b/block/bfq-ioc.c
@@ -6,6 +6,8 @@
  *
  * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
  *		      Paolo Valente <paolo.valente@unimore.it>
+ *
+ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
  */
 
 /**
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index b2cbfce..49ff1da 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -7,6 +7,8 @@
  * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
  *		      Paolo Valente <paolo.valente@unimore.it>
  *
+ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
+ *
  * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ
  * file.
  *
@@ -101,9 +103,6 @@ struct kmem_cache *bfq_pool;
 #define BFQQ_SEEK_THR	 (sector_t)(8 * 1024)
 #define BFQQ_SEEKY(bfqq) ((bfqq)->seek_mean > BFQQ_SEEK_THR)
 
-/* Budget feedback step. */
-#define BFQ_BUDGET_STEP         128
-
 /* Min samples used for peak rate estimation (for autotuning). */
 #define BFQ_PEAK_RATE_SAMPLES	32
 
@@ -537,36 +536,6 @@ static inline unsigned long bfq_max_budget(struct bfq_data *bfqd)
 		return bfqd->bfq_max_budget;
 }
 
- /*
- * bfq_default_budget - return the default budget for @bfqq on @bfqd.
- * @bfqd: the device descriptor.
- * @bfqq: the queue to consider.
- *
- * We use 3/4 of the @bfqd maximum budget as the default value
- * for the max_budget field of the queues.  This lets the feedback
- * mechanism to start from some middle ground, then the behavior
- * of the process will drive the heuristics towards high values, if
- * it behaves as a greedy sequential reader, or towards small values
- * if it shows a more intermittent behavior.
- */
-static unsigned long bfq_default_budget(struct bfq_data *bfqd,
-					struct bfq_queue *bfqq)
-{
-	unsigned long budget;
-
-	/*
-	 * When we need an estimate of the peak rate we need to avoid
-	 * to give budgets that are too short due to previous measurements.
-	 * So, in the first 10 assignments use a ``safe'' budget value.
-	 */
-	if (bfqd->budgets_assigned < 194 && bfqd->bfq_user_max_budget == 0)
-		budget = bfq_default_max_budget;
-	else
-		budget = bfqd->bfq_max_budget;
-
-	return budget - budget / 4;
-}
-
 /*
  * Return min budget, which is a fraction of the current or default
  * max budget (trying with 1/32)
@@ -730,13 +699,51 @@ static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
 		 * for throughput.
 		 */
 		case BFQ_BFQQ_TOO_IDLE:
-			if (budget > min_budget + BFQ_BUDGET_STEP)
-				budget -= BFQ_BUDGET_STEP;
-			else
-				budget = min_budget;
+			/*
+			 * This is the only case where we may reduce
+			 * the budget: if there is no request of the
+			 * process still waiting for completion, then
+			 * we assume (tentatively) that the timer has
+			 * expired because the batch of requests of
+			 * the process could have been served with a
+			 * smaller budget.  Hence, betting that
+			 * process will behave in the same way when it
+			 * becomes backlogged again, we reduce its
+			 * next budget.  As long as we guess right,
+			 * this budget cut reduces the latency
+			 * experienced by the process.
+			 *
+			 * However, if there are still outstanding
+			 * requests, then the process may have not yet
+			 * issued its next request just because it is
+			 * still waiting for the completion of some of
+			 * the still outstanding ones.  So in this
+			 * subcase we do not reduce its budget, on the
+			 * contrary we increase it to possibly boost
+			 * the throughput, as discussed in the
+			 * comments to the BUDGET_TIMEOUT case.
+			 */
+			if (bfqq->dispatched > 0) /* still outstanding reqs */
+				budget = min(budget * 2, bfqd->bfq_max_budget);
+			else {
+				if (budget > 5 * min_budget)
+					budget -= 4 * min_budget;
+				else
+					budget = min_budget;
+			}
 			break;
 		case BFQ_BFQQ_BUDGET_TIMEOUT:
-			budget = bfq_default_budget(bfqd, bfqq);
+			/*
+			 * We double the budget here because: 1) it
+			 * gives the chance to boost the throughput if
+			 * this is not a seeky process (which may have
+			 * bumped into this timeout because of, e.g.,
+			 * ZBR), 2) together with charge_full_budget
+			 * it helps give seeky processes higher
+			 * timestamps, and hence be served less
+			 * frequently.
+			 */
+			budget = min(budget * 2, bfqd->bfq_max_budget);
 			break;
 		case BFQ_BFQQ_BUDGET_EXHAUSTED:
 			/*
@@ -748,8 +755,7 @@ static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
 			 * definitely increase the budget of this good
 			 * candidate to boost the disk throughput.
 			 */
-			budget = min(budget + 8 * BFQ_BUDGET_STEP,
-				     bfqd->bfq_max_budget);
+			budget = min(budget * 4, bfqd->bfq_max_budget);
 			break;
 		case BFQ_BFQQ_NO_MORE_REQUESTS:
 		       /*
@@ -1408,7 +1414,7 @@ static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	}
 
 	/* Tentative initial value to trade off between thr and lat */
-	bfqq->max_budget = bfq_default_budget(bfqd, bfqq);
+	bfqq->max_budget = (2 * bfq_max_budget(bfqd)) / 3;
 	bfqq->pid = pid;
 }
 
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
index 8801b6c..075e472 100644
--- a/block/bfq-sched.c
+++ b/block/bfq-sched.c
@@ -6,6 +6,8 @@
  *
  * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
  *		      Paolo Valente <paolo.valente@unimore.it>
+ *
+ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
  */
 
 #ifdef CONFIG_CGROUP_BFQIO
diff --git a/block/bfq.h b/block/bfq.h
index b982567..a334eb4 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -6,6 +6,8 @@
  *
  * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
  *		      Paolo Valente <paolo.valente@unimore.it>
+ *
+ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
  */
 
 #ifndef _BFQ_H
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 04/12] block, bfq: modify the peak-rate estimator
  2014-05-29  9:05           ` Paolo Valente
@ 2014-05-29  9:05               ` Paolo Valente
  -1 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

Unless the maximum budget B_max that BFQ can assign to a queue is set
explicitly by the user, BFQ automatically updates B_max. In
particular, BFQ dynamically sets B_max to the number of sectors that
can be read, at the current estimated peak rate, during the maximum
time, T_max, allowed before a budget timeout occurs. In formulas, if
we denote as R_est the estimated peak rate, then B_max = T_max ∗
R_est. Hence, the higher R_est is with respect to the actual disk peak
rate, the higher the probability that processes incur budget timeouts
unjustly is. Besides, a too high value of B_max unnecessarily
increases the deviation from an ideal, smooth service.

To filter out spikes, the estimated peak rate is updated only on the
expiration of queues that have been served for a long-enough time.  As
a first step, the estimator computes the device rate, R_meas, during
the service of the queue. After that, if R_est < R_meas, then R_est is
set to R_meas.

Unfortunately, our experiments highlighted the following two
problems. First, because of ZBR, depending on the locality of the
workload, the estimator may easily converge to a value that is
appropriate only for part of a disk. Second, R_est may jump (and
remain forever equal) to a much higher value than the actual device
peak rate, in case of hits in the drive cache, which may let sectors
be transferred in practice at bus rate.

To try to converge to the actual average peak rate over the disk
surface (in case of rotational devices), and to smooth the spikes
caused by the drive cache, this patch changes the estimator as
follows. In the description of the changes, we refer to a queue
containing random requests as 'seeky', according to the terminology
used in the code, and inherited from CFQ.

- First, now R_est may be updated also in case the just-expired queue,
  despite not being detected as seeky, has not been however able to
  consume all of its budget within the maximum time slice T_max. In
  fact, this is an indication that B_max is too large. Since B_max =
  T_max ∗ R_est, R_est is then probably too large, and should be
  reduced.

- Second, to filter the spikes in R_meas, a discrete low-pass filter
  is now used to update R_est instead of just keeping the highest rate
  sampled. The rationale is that the average peak rate of a disk over
  its surface is a relatively stable quantity, hence a low-pass filter
  should converge more or less quickly to the right value.

With the current values of the constants used in the filter, the
latter seems to effectively smooth fluctuations and allow the
estimator to converge to the actual peak rate with all the devices we
tested.

Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/bfq-iosched.c | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 49ff1da..2a4e03d 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -818,7 +818,7 @@ static unsigned long bfq_calc_max_budget(u64 peak_rate, u64 timeout)
  * throughput. See the code for more details.
  */
 static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
-				int compensate)
+				int compensate, enum bfqq_expiration reason)
 {
 	u64 bw, usecs, expected, timeout;
 	ktime_t delta;
@@ -854,10 +854,23 @@ static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	 * the peak rate estimation.
 	 */
 	if (usecs > 20000) {
-		if (bw > bfqd->peak_rate) {
-			bfqd->peak_rate = bw;
+		if (bw > bfqd->peak_rate ||
+		   (!BFQQ_SEEKY(bfqq) &&
+		    reason == BFQ_BFQQ_BUDGET_TIMEOUT)) {
+			bfq_log(bfqd, "measured bw =%llu", bw);
+			/*
+			 * To smooth oscillations use a low-pass filter with
+			 * alpha=7/8, i.e.,
+			 * new_rate = (7/8) * old_rate + (1/8) * bw
+			 */
+			do_div(bw, 8);
+			if (bw == 0)
+				return 0;
+			bfqd->peak_rate *= 7;
+			do_div(bfqd->peak_rate, 8);
+			bfqd->peak_rate += bw;
 			update = 1;
-			bfq_log(bfqd, "new peak_rate=%llu", bw);
+			bfq_log(bfqd, "new peak_rate=%llu", bfqd->peak_rate);
 		}
 
 		update |= bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES - 1;
@@ -936,7 +949,7 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
 	/* Update disk peak rate for autotuning and check whether the
 	 * process is slow (see bfq_update_peak_rate).
 	 */
-	slow = bfq_update_peak_rate(bfqd, bfqq, compensate);
+	slow = bfq_update_peak_rate(bfqd, bfqq, compensate, reason);
 
 	/*
 	 * As above explained, 'punish' slow (i.e., seeky), timed-out
-- 
1.9.2

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 04/12] block, bfq: modify the peak-rate estimator
@ 2014-05-29  9:05               ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups, Paolo Valente

Unless the maximum budget B_max that BFQ can assign to a queue is set
explicitly by the user, BFQ automatically updates B_max. In
particular, BFQ dynamically sets B_max to the number of sectors that
can be read, at the current estimated peak rate, during the maximum
time, T_max, allowed before a budget timeout occurs. In formulas, if
we denote as R_est the estimated peak rate, then B_max = T_max ∗
R_est. Hence, the higher R_est is with respect to the actual disk peak
rate, the higher the probability that processes incur budget timeouts
unjustly is. Besides, a too high value of B_max unnecessarily
increases the deviation from an ideal, smooth service.

To filter out spikes, the estimated peak rate is updated only on the
expiration of queues that have been served for a long-enough time.  As
a first step, the estimator computes the device rate, R_meas, during
the service of the queue. After that, if R_est < R_meas, then R_est is
set to R_meas.

Unfortunately, our experiments highlighted the following two
problems. First, because of ZBR, depending on the locality of the
workload, the estimator may easily converge to a value that is
appropriate only for part of a disk. Second, R_est may jump (and
remain forever equal) to a much higher value than the actual device
peak rate, in case of hits in the drive cache, which may let sectors
be transferred in practice at bus rate.

To try to converge to the actual average peak rate over the disk
surface (in case of rotational devices), and to smooth the spikes
caused by the drive cache, this patch changes the estimator as
follows. In the description of the changes, we refer to a queue
containing random requests as 'seeky', according to the terminology
used in the code, and inherited from CFQ.

- First, now R_est may be updated also in case the just-expired queue,
  despite not being detected as seeky, has not been however able to
  consume all of its budget within the maximum time slice T_max. In
  fact, this is an indication that B_max is too large. Since B_max =
  T_max ∗ R_est, R_est is then probably too large, and should be
  reduced.

- Second, to filter the spikes in R_meas, a discrete low-pass filter
  is now used to update R_est instead of just keeping the highest rate
  sampled. The rationale is that the average peak rate of a disk over
  its surface is a relatively stable quantity, hence a low-pass filter
  should converge more or less quickly to the right value.

With the current values of the constants used in the filter, the
latter seems to effectively smooth fluctuations and allow the
estimator to converge to the actual peak rate with all the devices we
tested.

Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/bfq-iosched.c | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 49ff1da..2a4e03d 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -818,7 +818,7 @@ static unsigned long bfq_calc_max_budget(u64 peak_rate, u64 timeout)
  * throughput. See the code for more details.
  */
 static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
-				int compensate)
+				int compensate, enum bfqq_expiration reason)
 {
 	u64 bw, usecs, expected, timeout;
 	ktime_t delta;
@@ -854,10 +854,23 @@ static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	 * the peak rate estimation.
 	 */
 	if (usecs > 20000) {
-		if (bw > bfqd->peak_rate) {
-			bfqd->peak_rate = bw;
+		if (bw > bfqd->peak_rate ||
+		   (!BFQQ_SEEKY(bfqq) &&
+		    reason == BFQ_BFQQ_BUDGET_TIMEOUT)) {
+			bfq_log(bfqd, "measured bw =%llu", bw);
+			/*
+			 * To smooth oscillations use a low-pass filter with
+			 * alpha=7/8, i.e.,
+			 * new_rate = (7/8) * old_rate + (1/8) * bw
+			 */
+			do_div(bw, 8);
+			if (bw == 0)
+				return 0;
+			bfqd->peak_rate *= 7;
+			do_div(bfqd->peak_rate, 8);
+			bfqd->peak_rate += bw;
 			update = 1;
-			bfq_log(bfqd, "new peak_rate=%llu", bw);
+			bfq_log(bfqd, "new peak_rate=%llu", bfqd->peak_rate);
 		}
 
 		update |= bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES - 1;
@@ -936,7 +949,7 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
 	/* Update disk peak rate for autotuning and check whether the
 	 * process is slow (see bfq_update_peak_rate).
 	 */
-	slow = bfq_update_peak_rate(bfqd, bfqq, compensate);
+	slow = bfq_update_peak_rate(bfqd, bfqq, compensate, reason);
 
 	/*
 	 * As above explained, 'punish' slow (i.e., seeky), timed-out
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 05/12] block, bfq: add more fairness to boost throughput and reduce latency
       [not found]           ` <1401354343-5527-1-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
                               ` (3 preceding siblings ...)
  2014-05-29  9:05               ` Paolo Valente
@ 2014-05-29  9:05             ` Paolo Valente
  2014-05-29  9:05             ` [PATCH RFC - TAKE TWO - 06/12] block, bfq: improve responsiveness Paolo Valente
                               ` (7 subsequent siblings)
  12 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

We have found four sources of throughput loss and higher
latencies. First, write requests tend to starve read requests,
basically because, on one side, writes are slower than reads, whereas,
on the other side, storage devices confuse schedulers by deceptively
signaling the completion of write requests immediately after receiving
them. This patch addresses this issue by just throttling writes. In
particular, after a write request is dispatched for a queue, the
budget of the queue is decremented by the number of sectors to write,
multiplied by an (over)charge coefficient.

The current value of this coefficient, as well as the values of the
constants used in the following other changes, is the result of
our tuning with different devices.

The second source of problems is that some applications generate
really few and small, yet very far, random requests at the beginning
of a new I/O-bound phase. This causes the average seek distance,
computed using a low-pass filter, to remain high for a non-negligible
amount of time, even if then the application issues only sequential
requests. Hence, for a while, the queue associated with the
application is unavoidably detected as seeky (i.e., containing random
requests). The device-idling timeout is then set to a very low value
for the queue. This often caused a loss of throughput on rotational
devices, as well as an increased latency. In contrast, this patch
allows the device-idling timeout for a seeky queue to be set to a very
low value only if the associated process has either already consumed
at least a minimum fraction (1/8) of the maximum budget B_max, or
already proved to generate random requests systematically. In
particular, in the latter case the queue is flagged as "constantly
seeky".

Finally, the following additional BFQ mechanism causes throughput loss
and increased latencies in two further situations. When the in-service
queue is expired, BFQ also controls whether the queue has been "too
slow", i.e., has consumed its last-assigned budget at such a low rate
that it would have been impossible to consume all of it within the
maximum time slice T_max (Subsec. 3.5 in [1]). In this case, the queue
is always (over)charged the whole budget, to reduce its utilization of
the device, exactly as it happens with seeky queues. The description
of both the two situations in which this behavior causes problems and
the solution provided by this patch follows.

1. If too little time has elapsed since a process started doing
sequential I/O, then the positive effect on the throughput of its
sequential accesses may not have yet prevailed on the throughput loss
caused by the fact that a random access had to be performed to get to
the first sector requested by the process. For this reason, if a slow
queue is expired after receiving very little service (at most 1/8 of
the maximum budget), now it is not charged a full budget.

2. Because of ZBR, a queue may be deemed as slow when its associated
process is performing I/O on the slowest zones of a disk. However,
unless the process is truly too slow, not reducing the disk
utilization of the queue is more profitable in terms of disk
throughput than the opposite. For this reason now a queue is never
charged the whole budget if it has consumed at least a significant
part of it (2/3).

[1] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
http://www.algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 block/bfq-iosched.c | 50 +++++++++++++++++++++++++++++++++++++++++++++-----
 block/bfq.h         |  5 +++++
 2 files changed, 50 insertions(+), 5 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 2a4e03d..9e607a0 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -87,6 +87,13 @@ static int bfq_slice_idle = HZ / 125;
 static const int bfq_default_max_budget = 16 * 1024;
 static const int bfq_max_budget_async_rq = 4;
 
+/*
+ * Async to sync throughput distribution is controlled as follows:
+ * when an async request is served, the entity is charged the number
+ * of sectors of the request, multiplied by the factor below
+ */
+static const int bfq_async_charge_factor = 10;
+
 /* Default timeout values, in jiffies, approximating CFQ defaults. */
 static const int bfq_timeout_sync = HZ / 8;
 static int bfq_timeout_async = HZ / 25;
@@ -269,10 +276,12 @@ static struct request *bfq_find_next_rq(struct bfq_data *bfqd,
 	return bfq_choose_req(bfqd, next, prev, blk_rq_pos(last));
 }
 
+/* see the definition of bfq_async_charge_factor for details */
 static inline unsigned long bfq_serv_to_charge(struct request *rq,
 					       struct bfq_queue *bfqq)
 {
-	return blk_rq_sectors(rq);
+	return blk_rq_sectors(rq) *
+		(1 + ((!bfq_bfqq_sync(bfqq)) * bfq_async_charge_factor));
 }
 
 /**
@@ -565,13 +574,21 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd)
 	 * We don't want to idle for seeks, but we do want to allow
 	 * fair distribution of slice time for a process doing back-to-back
 	 * seeks. So allow a little bit of time for him to submit a new rq.
+	 *
+	 * To prevent processes with (partly) seeky workloads from
+	 * being too ill-treated, grant them a small fraction of the
+	 * assigned budget before reducing the waiting time to
+	 * BFQ_MIN_TT. This happened to help reduce latency.
 	 */
 	sl = bfqd->bfq_slice_idle;
 	/*
-	 * Grant only minimum idle time if the queue has been seeky for long
-	 * enough.
+	 * Grant only minimum idle time if the queue either has been seeky for
+	 * long enough or has already proved to be constantly seeky.
 	 */
-	if (bfq_sample_valid(bfqq->seek_samples) && BFQQ_SEEKY(bfqq))
+	if (bfq_sample_valid(bfqq->seek_samples) &&
+	    ((BFQQ_SEEKY(bfqq) && bfqq->entity.service >
+				  bfq_max_budget(bfqq->bfqd) / 8) ||
+	      bfq_bfqq_constantly_seeky(bfqq)))
 		sl = min(sl, msecs_to_jiffies(BFQ_MIN_TT));
 	bfqd->last_idling_start = ktime_get();
 	mod_timer(&bfqd->idle_slice_timer, jiffies + sl);
@@ -889,6 +906,16 @@ static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	}
 
 	/*
+	 * If the process has been served for a too short time
+	 * interval to let its possible sequential accesses prevail on
+	 * the initial seek time needed to move the disk head on the
+	 * first sector it requested, then give the process a chance
+	 * and for the moment return false.
+	 */
+	if (bfqq->entity.budget <= bfq_max_budget(bfqd) / 8)
+		return 0;
+
+	/*
 	 * A process is considered ``slow'' (i.e., seeky, so that we
 	 * cannot treat it fairly in the service domain, as it would
 	 * slow down too much the other processes) if, when a slice
@@ -954,10 +981,21 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
 	/*
 	 * As above explained, 'punish' slow (i.e., seeky), timed-out
 	 * and async queues, to favor sequential sync workloads.
+	 *
+	 * Processes doing I/O in the slower disk zones will tend to be
+	 * slow(er) even if not seeky. Hence, since the estimated peak
+	 * rate is actually an average over the disk surface, these
+	 * processes may timeout just for bad luck. To avoid punishing
+	 * them we do not charge a full budget to a process that
+	 * succeeded in consuming at least 2/3 of its budget.
 	 */
-	if (slow || reason == BFQ_BFQQ_BUDGET_TIMEOUT)
+	if (slow || (reason == BFQ_BFQQ_BUDGET_TIMEOUT &&
+		     bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3))
 		bfq_bfqq_charge_full_budget(bfqq);
 
+	if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT)
+		bfq_mark_bfqq_constantly_seeky(bfqq);
+
 	bfq_log_bfqq(bfqd, bfqq,
 		"expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
 		slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
@@ -1632,6 +1670,8 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfq_update_io_thinktime(bfqd, bic);
 	bfq_update_io_seektime(bfqd, bfqq, rq);
+	if (!BFQQ_SEEKY(bfqq))
+		bfq_clear_bfqq_constantly_seeky(bfqq);
 	if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
 	    !BFQQ_SEEKY(bfqq))
 		bfq_update_idle_window(bfqd, bfqq, bic);
diff --git a/block/bfq.h b/block/bfq.h
index a334eb4..ea5ecca 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -358,6 +358,10 @@ enum bfqq_state_flags {
 	BFQ_BFQQ_FLAG_prio_changed,	/* task priority has changed */
 	BFQ_BFQQ_FLAG_sync,		/* synchronous queue */
 	BFQ_BFQQ_FLAG_budget_new,	/* no completion with this budget */
+	BFQ_BFQQ_FLAG_constantly_seeky,	/*
+					 * bfqq has proved to be slow and
+					 * seeky until budget timeout
+					 */
 };
 
 #define BFQ_BFQQ_FNS(name)						\
@@ -382,6 +386,7 @@ BFQ_BFQQ_FNS(idle_window);
 BFQ_BFQQ_FNS(prio_changed);
 BFQ_BFQQ_FNS(sync);
 BFQ_BFQQ_FNS(budget_new);
+BFQ_BFQQ_FNS(constantly_seeky);
 #undef BFQ_BFQQ_FNS
 
 /* Logging facilities. */
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 05/12] block, bfq: add more fairness to boost throughput and reduce latency
       [not found]           ` <1401354343-5527-1-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
@ 2014-05-29  9:05             ` Paolo Valente
  2014-05-29  9:05               ` Paolo Valente
                               ` (11 subsequent siblings)
  12 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups, Paolo Valente

We have found four sources of throughput loss and higher
latencies. First, write requests tend to starve read requests,
basically because, on one side, writes are slower than reads, whereas,
on the other side, storage devices confuse schedulers by deceptively
signaling the completion of write requests immediately after receiving
them. This patch addresses this issue by just throttling writes. In
particular, after a write request is dispatched for a queue, the
budget of the queue is decremented by the number of sectors to write,
multiplied by an (over)charge coefficient.

The current value of this coefficient, as well as the values of the
constants used in the following other changes, is the result of
our tuning with different devices.

The second source of problems is that some applications generate
really few and small, yet very far, random requests at the beginning
of a new I/O-bound phase. This causes the average seek distance,
computed using a low-pass filter, to remain high for a non-negligible
amount of time, even if then the application issues only sequential
requests. Hence, for a while, the queue associated with the
application is unavoidably detected as seeky (i.e., containing random
requests). The device-idling timeout is then set to a very low value
for the queue. This often caused a loss of throughput on rotational
devices, as well as an increased latency. In contrast, this patch
allows the device-idling timeout for a seeky queue to be set to a very
low value only if the associated process has either already consumed
at least a minimum fraction (1/8) of the maximum budget B_max, or
already proved to generate random requests systematically. In
particular, in the latter case the queue is flagged as "constantly
seeky".

Finally, the following additional BFQ mechanism causes throughput loss
and increased latencies in two further situations. When the in-service
queue is expired, BFQ also controls whether the queue has been "too
slow", i.e., has consumed its last-assigned budget at such a low rate
that it would have been impossible to consume all of it within the
maximum time slice T_max (Subsec. 3.5 in [1]). In this case, the queue
is always (over)charged the whole budget, to reduce its utilization of
the device, exactly as it happens with seeky queues. The description
of both the two situations in which this behavior causes problems and
the solution provided by this patch follows.

1. If too little time has elapsed since a process started doing
sequential I/O, then the positive effect on the throughput of its
sequential accesses may not have yet prevailed on the throughput loss
caused by the fact that a random access had to be performed to get to
the first sector requested by the process. For this reason, if a slow
queue is expired after receiving very little service (at most 1/8 of
the maximum budget), now it is not charged a full budget.

2. Because of ZBR, a queue may be deemed as slow when its associated
process is performing I/O on the slowest zones of a disk. However,
unless the process is truly too slow, not reducing the disk
utilization of the queue is more profitable in terms of disk
throughput than the opposite. For this reason now a queue is never
charged the whole budget if it has consumed at least a significant
part of it (2/3).

[1] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
http://www.algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/bfq-iosched.c | 50 +++++++++++++++++++++++++++++++++++++++++++++-----
 block/bfq.h         |  5 +++++
 2 files changed, 50 insertions(+), 5 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 2a4e03d..9e607a0 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -87,6 +87,13 @@ static int bfq_slice_idle = HZ / 125;
 static const int bfq_default_max_budget = 16 * 1024;
 static const int bfq_max_budget_async_rq = 4;
 
+/*
+ * Async to sync throughput distribution is controlled as follows:
+ * when an async request is served, the entity is charged the number
+ * of sectors of the request, multiplied by the factor below
+ */
+static const int bfq_async_charge_factor = 10;
+
 /* Default timeout values, in jiffies, approximating CFQ defaults. */
 static const int bfq_timeout_sync = HZ / 8;
 static int bfq_timeout_async = HZ / 25;
@@ -269,10 +276,12 @@ static struct request *bfq_find_next_rq(struct bfq_data *bfqd,
 	return bfq_choose_req(bfqd, next, prev, blk_rq_pos(last));
 }
 
+/* see the definition of bfq_async_charge_factor for details */
 static inline unsigned long bfq_serv_to_charge(struct request *rq,
 					       struct bfq_queue *bfqq)
 {
-	return blk_rq_sectors(rq);
+	return blk_rq_sectors(rq) *
+		(1 + ((!bfq_bfqq_sync(bfqq)) * bfq_async_charge_factor));
 }
 
 /**
@@ -565,13 +574,21 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd)
 	 * We don't want to idle for seeks, but we do want to allow
 	 * fair distribution of slice time for a process doing back-to-back
 	 * seeks. So allow a little bit of time for him to submit a new rq.
+	 *
+	 * To prevent processes with (partly) seeky workloads from
+	 * being too ill-treated, grant them a small fraction of the
+	 * assigned budget before reducing the waiting time to
+	 * BFQ_MIN_TT. This happened to help reduce latency.
 	 */
 	sl = bfqd->bfq_slice_idle;
 	/*
-	 * Grant only minimum idle time if the queue has been seeky for long
-	 * enough.
+	 * Grant only minimum idle time if the queue either has been seeky for
+	 * long enough or has already proved to be constantly seeky.
 	 */
-	if (bfq_sample_valid(bfqq->seek_samples) && BFQQ_SEEKY(bfqq))
+	if (bfq_sample_valid(bfqq->seek_samples) &&
+	    ((BFQQ_SEEKY(bfqq) && bfqq->entity.service >
+				  bfq_max_budget(bfqq->bfqd) / 8) ||
+	      bfq_bfqq_constantly_seeky(bfqq)))
 		sl = min(sl, msecs_to_jiffies(BFQ_MIN_TT));
 	bfqd->last_idling_start = ktime_get();
 	mod_timer(&bfqd->idle_slice_timer, jiffies + sl);
@@ -889,6 +906,16 @@ static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	}
 
 	/*
+	 * If the process has been served for a too short time
+	 * interval to let its possible sequential accesses prevail on
+	 * the initial seek time needed to move the disk head on the
+	 * first sector it requested, then give the process a chance
+	 * and for the moment return false.
+	 */
+	if (bfqq->entity.budget <= bfq_max_budget(bfqd) / 8)
+		return 0;
+
+	/*
 	 * A process is considered ``slow'' (i.e., seeky, so that we
 	 * cannot treat it fairly in the service domain, as it would
 	 * slow down too much the other processes) if, when a slice
@@ -954,10 +981,21 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
 	/*
 	 * As above explained, 'punish' slow (i.e., seeky), timed-out
 	 * and async queues, to favor sequential sync workloads.
+	 *
+	 * Processes doing I/O in the slower disk zones will tend to be
+	 * slow(er) even if not seeky. Hence, since the estimated peak
+	 * rate is actually an average over the disk surface, these
+	 * processes may timeout just for bad luck. To avoid punishing
+	 * them we do not charge a full budget to a process that
+	 * succeeded in consuming at least 2/3 of its budget.
 	 */
-	if (slow || reason == BFQ_BFQQ_BUDGET_TIMEOUT)
+	if (slow || (reason == BFQ_BFQQ_BUDGET_TIMEOUT &&
+		     bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3))
 		bfq_bfqq_charge_full_budget(bfqq);
 
+	if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT)
+		bfq_mark_bfqq_constantly_seeky(bfqq);
+
 	bfq_log_bfqq(bfqd, bfqq,
 		"expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
 		slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
@@ -1632,6 +1670,8 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfq_update_io_thinktime(bfqd, bic);
 	bfq_update_io_seektime(bfqd, bfqq, rq);
+	if (!BFQQ_SEEKY(bfqq))
+		bfq_clear_bfqq_constantly_seeky(bfqq);
 	if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
 	    !BFQQ_SEEKY(bfqq))
 		bfq_update_idle_window(bfqd, bfqq, bic);
diff --git a/block/bfq.h b/block/bfq.h
index a334eb4..ea5ecca 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -358,6 +358,10 @@ enum bfqq_state_flags {
 	BFQ_BFQQ_FLAG_prio_changed,	/* task priority has changed */
 	BFQ_BFQQ_FLAG_sync,		/* synchronous queue */
 	BFQ_BFQQ_FLAG_budget_new,	/* no completion with this budget */
+	BFQ_BFQQ_FLAG_constantly_seeky,	/*
+					 * bfqq has proved to be slow and
+					 * seeky until budget timeout
+					 */
 };
 
 #define BFQ_BFQQ_FNS(name)						\
@@ -382,6 +386,7 @@ BFQ_BFQQ_FNS(idle_window);
 BFQ_BFQQ_FNS(prio_changed);
 BFQ_BFQQ_FNS(sync);
 BFQ_BFQQ_FNS(budget_new);
+BFQ_BFQQ_FNS(constantly_seeky);
 #undef BFQ_BFQQ_FNS
 
 /* Logging facilities. */
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 05/12] block, bfq: add more fairness to boost throughput and reduce latency
@ 2014-05-29  9:05             ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

We have found four sources of throughput loss and higher
latencies. First, write requests tend to starve read requests,
basically because, on one side, writes are slower than reads, whereas,
on the other side, storage devices confuse schedulers by deceptively
signaling the completion of write requests immediately after receiving
them. This patch addresses this issue by just throttling writes. In
particular, after a write request is dispatched for a queue, the
budget of the queue is decremented by the number of sectors to write,
multiplied by an (over)charge coefficient.

The current value of this coefficient, as well as the values of the
constants used in the following other changes, is the result of
our tuning with different devices.

The second source of problems is that some applications generate
really few and small, yet very far, random requests at the beginning
of a new I/O-bound phase. This causes the average seek distance,
computed using a low-pass filter, to remain high for a non-negligible
amount of time, even if then the application issues only sequential
requests. Hence, for a while, the queue associated with the
application is unavoidably detected as seeky (i.e., containing random
requests). The device-idling timeout is then set to a very low value
for the queue. This often caused a loss of throughput on rotational
devices, as well as an increased latency. In contrast, this patch
allows the device-idling timeout for a seeky queue to be set to a very
low value only if the associated process has either already consumed
at least a minimum fraction (1/8) of the maximum budget B_max, or
already proved to generate random requests systematically. In
particular, in the latter case the queue is flagged as "constantly
seeky".

Finally, the following additional BFQ mechanism causes throughput loss
and increased latencies in two further situations. When the in-service
queue is expired, BFQ also controls whether the queue has been "too
slow", i.e., has consumed its last-assigned budget at such a low rate
that it would have been impossible to consume all of it within the
maximum time slice T_max (Subsec. 3.5 in [1]). In this case, the queue
is always (over)charged the whole budget, to reduce its utilization of
the device, exactly as it happens with seeky queues. The description
of both the two situations in which this behavior causes problems and
the solution provided by this patch follows.

1. If too little time has elapsed since a process started doing
sequential I/O, then the positive effect on the throughput of its
sequential accesses may not have yet prevailed on the throughput loss
caused by the fact that a random access had to be performed to get to
the first sector requested by the process. For this reason, if a slow
queue is expired after receiving very little service (at most 1/8 of
the maximum budget), now it is not charged a full budget.

2. Because of ZBR, a queue may be deemed as slow when its associated
process is performing I/O on the slowest zones of a disk. However,
unless the process is truly too slow, not reducing the disk
utilization of the queue is more profitable in terms of disk
throughput than the opposite. For this reason now a queue is never
charged the whole budget if it has consumed at least a significant
part of it (2/3).

[1] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
http://www.algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 block/bfq-iosched.c | 50 +++++++++++++++++++++++++++++++++++++++++++++-----
 block/bfq.h         |  5 +++++
 2 files changed, 50 insertions(+), 5 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 2a4e03d..9e607a0 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -87,6 +87,13 @@ static int bfq_slice_idle = HZ / 125;
 static const int bfq_default_max_budget = 16 * 1024;
 static const int bfq_max_budget_async_rq = 4;
 
+/*
+ * Async to sync throughput distribution is controlled as follows:
+ * when an async request is served, the entity is charged the number
+ * of sectors of the request, multiplied by the factor below
+ */
+static const int bfq_async_charge_factor = 10;
+
 /* Default timeout values, in jiffies, approximating CFQ defaults. */
 static const int bfq_timeout_sync = HZ / 8;
 static int bfq_timeout_async = HZ / 25;
@@ -269,10 +276,12 @@ static struct request *bfq_find_next_rq(struct bfq_data *bfqd,
 	return bfq_choose_req(bfqd, next, prev, blk_rq_pos(last));
 }
 
+/* see the definition of bfq_async_charge_factor for details */
 static inline unsigned long bfq_serv_to_charge(struct request *rq,
 					       struct bfq_queue *bfqq)
 {
-	return blk_rq_sectors(rq);
+	return blk_rq_sectors(rq) *
+		(1 + ((!bfq_bfqq_sync(bfqq)) * bfq_async_charge_factor));
 }
 
 /**
@@ -565,13 +574,21 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd)
 	 * We don't want to idle for seeks, but we do want to allow
 	 * fair distribution of slice time for a process doing back-to-back
 	 * seeks. So allow a little bit of time for him to submit a new rq.
+	 *
+	 * To prevent processes with (partly) seeky workloads from
+	 * being too ill-treated, grant them a small fraction of the
+	 * assigned budget before reducing the waiting time to
+	 * BFQ_MIN_TT. This happened to help reduce latency.
 	 */
 	sl = bfqd->bfq_slice_idle;
 	/*
-	 * Grant only minimum idle time if the queue has been seeky for long
-	 * enough.
+	 * Grant only minimum idle time if the queue either has been seeky for
+	 * long enough or has already proved to be constantly seeky.
 	 */
-	if (bfq_sample_valid(bfqq->seek_samples) && BFQQ_SEEKY(bfqq))
+	if (bfq_sample_valid(bfqq->seek_samples) &&
+	    ((BFQQ_SEEKY(bfqq) && bfqq->entity.service >
+				  bfq_max_budget(bfqq->bfqd) / 8) ||
+	      bfq_bfqq_constantly_seeky(bfqq)))
 		sl = min(sl, msecs_to_jiffies(BFQ_MIN_TT));
 	bfqd->last_idling_start = ktime_get();
 	mod_timer(&bfqd->idle_slice_timer, jiffies + sl);
@@ -889,6 +906,16 @@ static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	}
 
 	/*
+	 * If the process has been served for a too short time
+	 * interval to let its possible sequential accesses prevail on
+	 * the initial seek time needed to move the disk head on the
+	 * first sector it requested, then give the process a chance
+	 * and for the moment return false.
+	 */
+	if (bfqq->entity.budget <= bfq_max_budget(bfqd) / 8)
+		return 0;
+
+	/*
 	 * A process is considered ``slow'' (i.e., seeky, so that we
 	 * cannot treat it fairly in the service domain, as it would
 	 * slow down too much the other processes) if, when a slice
@@ -954,10 +981,21 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
 	/*
 	 * As above explained, 'punish' slow (i.e., seeky), timed-out
 	 * and async queues, to favor sequential sync workloads.
+	 *
+	 * Processes doing I/O in the slower disk zones will tend to be
+	 * slow(er) even if not seeky. Hence, since the estimated peak
+	 * rate is actually an average over the disk surface, these
+	 * processes may timeout just for bad luck. To avoid punishing
+	 * them we do not charge a full budget to a process that
+	 * succeeded in consuming at least 2/3 of its budget.
 	 */
-	if (slow || reason == BFQ_BFQQ_BUDGET_TIMEOUT)
+	if (slow || (reason == BFQ_BFQQ_BUDGET_TIMEOUT &&
+		     bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3))
 		bfq_bfqq_charge_full_budget(bfqq);
 
+	if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT)
+		bfq_mark_bfqq_constantly_seeky(bfqq);
+
 	bfq_log_bfqq(bfqd, bfqq,
 		"expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
 		slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
@@ -1632,6 +1670,8 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfq_update_io_thinktime(bfqd, bic);
 	bfq_update_io_seektime(bfqd, bfqq, rq);
+	if (!BFQQ_SEEKY(bfqq))
+		bfq_clear_bfqq_constantly_seeky(bfqq);
 	if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
 	    !BFQQ_SEEKY(bfqq))
 		bfq_update_idle_window(bfqd, bfqq, bic);
diff --git a/block/bfq.h b/block/bfq.h
index a334eb4..ea5ecca 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -358,6 +358,10 @@ enum bfqq_state_flags {
 	BFQ_BFQQ_FLAG_prio_changed,	/* task priority has changed */
 	BFQ_BFQQ_FLAG_sync,		/* synchronous queue */
 	BFQ_BFQQ_FLAG_budget_new,	/* no completion with this budget */
+	BFQ_BFQQ_FLAG_constantly_seeky,	/*
+					 * bfqq has proved to be slow and
+					 * seeky until budget timeout
+					 */
 };
 
 #define BFQ_BFQQ_FNS(name)						\
@@ -382,6 +386,7 @@ BFQ_BFQQ_FNS(idle_window);
 BFQ_BFQQ_FNS(prio_changed);
 BFQ_BFQQ_FNS(sync);
 BFQ_BFQQ_FNS(budget_new);
+BFQ_BFQQ_FNS(constantly_seeky);
 #undef BFQ_BFQQ_FNS
 
 /* Logging facilities. */
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 06/12] block, bfq: improve responsiveness
       [not found]           ` <1401354343-5527-1-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
                               ` (4 preceding siblings ...)
  2014-05-29  9:05             ` [PATCH RFC - TAKE TWO - 05/12] block, bfq: add more fairness to boost throughput and reduce latency Paolo Valente
@ 2014-05-29  9:05             ` Paolo Valente
  2014-05-29  9:05               ` Paolo Valente
                               ` (6 subsequent siblings)
  12 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

This patch introduces a simple heuristic to load applications quickly,
and to perform the I/O requested by interactive applications just as
quickly. To this purpose, both a newly-created queue and a queue
associated with an interactive application (we explain in a moment how
BFQ decides whether the associated application is interactive),
receive the following three special treatments:

1) The weight of the queue is raised.

2) The queue unconditionally enjoys device idling when it empties; in
fact, if the requests of a queue are sync, then performing device
idling for the queue is a necessary condition to guarantee that the
queue receives a fraction of the throughput proportional to its weight
(see [1] for details).

3) The device-idling timeout is larger for the queue. This reduces the
probability that the queue is expired because its next request does
not arrive in time.

For brevity, we call just weight-raising the combination of these
three preferential treatments. For a newly-created queue,
weight-raising starts immediately and lasts for a time interval that:
1) depends on the device speed and type (rotational or
non-rotational), and 2) is equal to the time needed to load (start up)
a large-size application on that device, with cold caches and with no
additional workload.

Finally, as for guaranteeing a fast execution to interactive,
I/O-related tasks (such as opening a file), consider that any
interactive application blocks and waits for user input both after
starting up and after executing some task. After a while, the user may
trigger new operations, after which the application stops again, and
so on. Accordingly, the low-latency heuristic weight-raises again a
queue in case it becomes backlogged after being idle for a
sufficiently long (configurable) time. The weight-raising then lasts
for the same time as for a just-created queue.

According to our experiments, the combination of this low-latency
heuristic and of the improvements described in patch 7 allows BFQ
to guarantee a high application responsiveness.

[1] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
http://www.algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 block/Kconfig.iosched |   8 +-
 block/bfq-cgroup.c    |  15 +++
 block/bfq-iosched.c   | 355 ++++++++++++++++++++++++++++++++++++++++++++++----
 block/bfq-sched.c     |   5 +-
 block/bfq.h           |  33 +++++
 5 files changed, 386 insertions(+), 30 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index a3675cb..3e26f28 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -45,8 +45,9 @@ config IOSCHED_BFQ
 	---help---
 	  The BFQ I/O scheduler tries to distribute bandwidth among all
 	  processes according to their weights.
-	  It aims at distributing the bandwidth as desired, regardless
-	  of the disk parameters and with any workload. If compiled
+	  It aims at distributing the bandwidth as desired, regardless of
+	  the device parameters and with any workload. It also tries to
+	  guarantee a low latency to interactive applications. If compiled
 	  built-in (saying Y here), BFQ can be configured to support
 	  hierarchical scheduling.
 
@@ -79,7 +80,8 @@ choice
 		  used by default for all block devices.
 		  The BFQ I/O scheduler aims at distributing the bandwidth
 		  as desired, regardless of the disk parameters and with
-		  any workload.
+		  any workload. It also tries to guarantee a low latency to
+		  interactive applications.
 
 	config DEFAULT_NOOP
 		bool "No-op"
diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 805fe5e..1cb25aa 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -525,6 +525,16 @@ static void bfq_destroy_group(struct bfqio_cgroup *bgrp, struct bfq_group *bfqg)
 	kfree(bfqg);
 }
 
+static void bfq_end_wr_async(struct bfq_data *bfqd)
+{
+	struct hlist_node *tmp;
+	struct bfq_group *bfqg;
+
+	hlist_for_each_entry_safe(bfqg, tmp, &bfqd->group_list, bfqd_node)
+		bfq_end_wr_async_queues(bfqd, bfqg);
+	bfq_end_wr_async_queues(bfqd, bfqd->root_group);
+}
+
 /**
  * bfq_disconnect_groups - disconnect @bfqd from all its groups.
  * @bfqd: the device descriptor being exited.
@@ -866,6 +876,11 @@ static inline void bfq_bfqq_move(struct bfq_data *bfqd,
 {
 }
 
+static void bfq_end_wr_async(struct bfq_data *bfqd)
+{
+	bfq_end_wr_async_queues(bfqd, bfqd->root_group);
+}
+
 static inline void bfq_disconnect_groups(struct bfq_data *bfqd)
 {
 	bfq_put_async_queues(bfqd, bfqd->root_group);
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 9e607a0..ace9aba 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -24,15 +24,15 @@
  * precisely, BFQ schedules queues associated to processes. Thanks to the
  * accurate policy of B-WF2Q+, BFQ can afford to assign high budgets to
  * I/O-bound processes issuing sequential requests (to boost the
- * throughput), and yet guarantee a relatively low latency to interactive
- * applications.
+ * throughput), and yet guarantee a low latency to interactive applications.
  *
  * BFQ is described in [1], where also a reference to the initial, more
  * theoretical paper on BFQ can be found. The interested reader can find
  * in the latter paper full details on the main algorithm, as well as
  * formulas of the guarantees and formal proofs of all the properties.
  * With respect to the version of BFQ presented in these papers, this
- * implementation adds a hierarchical extension based on H-WF2Q+.
+ * implementation adds a few more heuristics and a hierarchical extension
+ * based on H-WF2Q+.
  *
  * B-WF2Q+ is based on WF2Q+, that is described in [2], together with
  * H-WF2Q+, while the augmented tree used to implement B-WF2Q+ with O(log N)
@@ -116,6 +116,48 @@ struct kmem_cache *bfq_pool;
 /* Shift used for peak rate fixed precision calculations. */
 #define BFQ_RATE_SHIFT		16
 
+/*
+ * By default, BFQ computes the duration of the weight raising for
+ * interactive applications automatically, using the following formula:
+ * duration = (R / r) * T, where r is the peak rate of the device, and
+ * R and T are two reference parameters.
+ * In particular, R is the peak rate of the reference device (see below),
+ * and T is a reference time: given the systems that are likely to be
+ * installed on the reference device according to its speed class, T is
+ * about the maximum time needed, under BFQ and while reading two files in
+ * parallel, to load typical large applications on these systems.
+ * In practice, the slower/faster the device at hand is, the more/less it
+ * takes to load applications with respect to the reference device.
+ * Accordingly, the longer/shorter BFQ grants weight raising to interactive
+ * applications.
+ *
+ * BFQ uses four different reference pairs (R, T), depending on:
+ * . whether the device is rotational or non-rotational;
+ * . whether the device is slow, such as old or portable HDDs, as well as
+ *   SD cards, or fast, such as newer HDDs and SSDs.
+ *
+ * The device's speed class is dynamically (re)detected in
+ * bfq_update_peak_rate() every time the estimated peak rate is updated.
+ *
+ * In the following definitions, R_slow[0]/R_fast[0] and T_slow[0]/T_fast[0]
+ * are the reference values for a slow/fast rotational device, whereas
+ * R_slow[1]/R_fast[1] and T_slow[1]/T_fast[1] are the reference values for
+ * a slow/fast non-rotational device. Finally, device_speed_thresh are the
+ * thresholds used to switch between speed classes.
+ * Both the reference peak rates and the thresholds are measured in
+ * sectors/usec, left-shifted by BFQ_RATE_SHIFT.
+ */
+static int R_slow[2] = {1536, 10752};
+static int R_fast[2] = {17415, 34791};
+/*
+ * To improve readability, a conversion function is used to initialize the
+ * following arrays, which entails that they can be initialized only in a
+ * function.
+ */
+static int T_slow[2];
+static int T_fast[2];
+static int device_speed_thresh[2];
+
 #define BFQ_SERVICE_TREE_INIT	((struct bfq_service_tree)		\
 				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
 
@@ -281,7 +323,8 @@ static inline unsigned long bfq_serv_to_charge(struct request *rq,
 					       struct bfq_queue *bfqq)
 {
 	return blk_rq_sectors(rq) *
-		(1 + ((!bfq_bfqq_sync(bfqq)) * bfq_async_charge_factor));
+		(1 + ((!bfq_bfqq_sync(bfqq)) * (bfqq->wr_coeff == 1) *
+		bfq_async_charge_factor));
 }
 
 /**
@@ -322,12 +365,27 @@ static void bfq_updated_next_req(struct bfq_data *bfqd,
 	}
 }
 
+static inline unsigned int bfq_wr_duration(struct bfq_data *bfqd)
+{
+	u64 dur;
+
+	if (bfqd->bfq_wr_max_time > 0)
+		return bfqd->bfq_wr_max_time;
+
+	dur = bfqd->RT_prod;
+	do_div(dur, bfqd->peak_rate);
+
+	return dur;
+}
+
 static void bfq_add_request(struct request *rq)
 {
 	struct bfq_queue *bfqq = RQ_BFQQ(rq);
 	struct bfq_entity *entity = &bfqq->entity;
 	struct bfq_data *bfqd = bfqq->bfqd;
 	struct request *next_rq, *prev;
+	unsigned long old_wr_coeff = bfqq->wr_coeff;
+	int idle_for_long_time = 0;
 
 	bfq_log_bfqq(bfqd, bfqq, "add_request %d", rq_is_sync(rq));
 	bfqq->queued[rq_is_sync(rq)]++;
@@ -343,13 +401,64 @@ static void bfq_add_request(struct request *rq)
 	bfqq->next_rq = next_rq;
 
 	if (!bfq_bfqq_busy(bfqq)) {
+		idle_for_long_time = time_is_before_jiffies(
+			bfqq->budget_timeout +
+			bfqd->bfq_wr_min_idle_time);
 		entity->budget = max_t(unsigned long, bfqq->max_budget,
 				       bfq_serv_to_charge(next_rq, bfqq));
+
+		if (!bfqd->low_latency)
+			goto add_bfqq_busy;
+
+		/*
+		 * If the queue is not being boosted and has been idle for
+		 * enough time, start a weight-raising period.
+		 */
+		if (old_wr_coeff == 1 && idle_for_long_time) {
+			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
+			bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+			bfq_log_bfqq(bfqd, bfqq,
+				     "wrais starting at %lu, rais_max_time %u",
+				     jiffies,
+				     jiffies_to_msecs(bfqq->wr_cur_max_time));
+		} else if (old_wr_coeff > 1) {
+			if (idle_for_long_time)
+				bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+			else {
+				bfqq->wr_coeff = 1;
+				bfq_log_bfqq(bfqd, bfqq,
+					"wrais ending at %lu, rais_max_time %u",
+					jiffies,
+					jiffies_to_msecs(bfqq->
+						wr_cur_max_time));
+			}
+		}
+		if (old_wr_coeff != bfqq->wr_coeff)
+			entity->ioprio_changed = 1;
+add_bfqq_busy:
 		bfq_add_bfqq_busy(bfqd, bfqq);
 	} else {
+		if (bfqd->low_latency && old_wr_coeff == 1 && !rq_is_sync(rq) &&
+		    time_is_before_jiffies(
+				bfqq->last_wr_start_finish +
+				bfqd->bfq_wr_min_inter_arr_async)) {
+			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
+			bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+
+			entity->ioprio_changed = 1;
+			bfq_log_bfqq(bfqd, bfqq,
+			    "non-idle wrais starting at %lu, rais_max_time %u",
+			    jiffies,
+			    jiffies_to_msecs(bfqq->wr_cur_max_time));
+		}
 		if (prev != bfqq->next_rq)
 			bfq_updated_next_req(bfqd, bfqq);
 	}
+
+	if (bfqd->low_latency &&
+		(old_wr_coeff == 1 || bfqq->wr_coeff == 1 ||
+		 idle_for_long_time))
+		bfqq->last_wr_start_finish = jiffies;
 }
 
 static struct request *bfq_find_rq_fmerge(struct bfq_data *bfqd,
@@ -477,6 +586,43 @@ static void bfq_merged_requests(struct request_queue *q, struct request *rq,
 	bfq_remove_request(next);
 }
 
+/* Must be called with bfqq != NULL */
+static inline void bfq_bfqq_end_wr(struct bfq_queue *bfqq)
+{
+	bfqq->wr_coeff = 1;
+	bfqq->wr_cur_max_time = 0;
+	/* Trigger a weight change on the next activation of the queue */
+	bfqq->entity.ioprio_changed = 1;
+}
+
+static void bfq_end_wr_async_queues(struct bfq_data *bfqd,
+				    struct bfq_group *bfqg)
+{
+	int i, j;
+
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < IOPRIO_BE_NR; j++)
+			if (bfqg->async_bfqq[i][j] != NULL)
+				bfq_bfqq_end_wr(bfqg->async_bfqq[i][j]);
+	if (bfqg->async_idle_bfqq != NULL)
+		bfq_bfqq_end_wr(bfqg->async_idle_bfqq);
+}
+
+static void bfq_end_wr(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq;
+
+	spin_lock_irq(bfqd->queue->queue_lock);
+
+	list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list)
+		bfq_bfqq_end_wr(bfqq);
+	list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list)
+		bfq_bfqq_end_wr(bfqq);
+	bfq_end_wr_async(bfqd);
+
+	spin_unlock_irq(bfqd->queue->queue_lock);
+}
+
 static int bfq_allow_merge(struct request_queue *q, struct request *rq,
 			   struct bio *bio)
 {
@@ -582,14 +728,17 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd)
 	 */
 	sl = bfqd->bfq_slice_idle;
 	/*
-	 * Grant only minimum idle time if the queue either has been seeky for
-	 * long enough or has already proved to be constantly seeky.
+	 * Unless the queue is being weight-raised, grant only minimum idle
+	 * time if the queue either has been seeky for long enough or has
+	 * already proved to be constantly seeky.
 	 */
 	if (bfq_sample_valid(bfqq->seek_samples) &&
 	    ((BFQQ_SEEKY(bfqq) && bfqq->entity.service >
 				  bfq_max_budget(bfqq->bfqd) / 8) ||
-	      bfq_bfqq_constantly_seeky(bfqq)))
+	      bfq_bfqq_constantly_seeky(bfqq)) && bfqq->wr_coeff == 1)
 		sl = min(sl, msecs_to_jiffies(BFQ_MIN_TT));
+	else if (bfqq->wr_coeff > 1)
+		sl = sl * 3;
 	bfqd->last_idling_start = ktime_get();
 	mod_timer(&bfqd->idle_slice_timer, jiffies + sl);
 	bfq_log(bfqd, "arm idle: %u/%u ms",
@@ -677,9 +826,15 @@ static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 {
 	__bfq_bfqd_reset_in_service(bfqd);
 
-	if (RB_EMPTY_ROOT(&bfqq->sort_list))
+	if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
+		/*
+		 * Overloading budget_timeout field to store the time
+		 * at which the queue remains with no backlog; used by
+		 * the weight-raising mechanism.
+		 */
+		bfqq->budget_timeout = jiffies;
 		bfq_del_bfqq_busy(bfqd, bfqq, 1);
-	else
+	} else
 		bfq_activate_bfqq(bfqd, bfqq);
 }
 
@@ -896,12 +1051,26 @@ static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 			bfqd->peak_rate_samples++;
 
 		if (bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES &&
-		    update && bfqd->bfq_user_max_budget == 0) {
-			bfqd->bfq_max_budget =
-				bfq_calc_max_budget(bfqd->peak_rate,
-						    timeout);
-			bfq_log(bfqd, "new max_budget=%lu",
-				bfqd->bfq_max_budget);
+		    update) {
+			int dev_type = blk_queue_nonrot(bfqd->queue);
+			if (bfqd->bfq_user_max_budget == 0) {
+				bfqd->bfq_max_budget =
+					bfq_calc_max_budget(bfqd->peak_rate,
+							    timeout);
+				bfq_log(bfqd, "new max_budget=%lu",
+					bfqd->bfq_max_budget);
+			}
+			if (bfqd->device_speed == BFQ_BFQD_FAST &&
+			    bfqd->peak_rate < device_speed_thresh[dev_type]) {
+				bfqd->device_speed = BFQ_BFQD_SLOW;
+				bfqd->RT_prod = R_slow[dev_type] *
+						T_slow[dev_type];
+			} else if (bfqd->device_speed == BFQ_BFQD_SLOW &&
+			    bfqd->peak_rate > device_speed_thresh[dev_type]) {
+				bfqd->device_speed = BFQ_BFQD_FAST;
+				bfqd->RT_prod = R_fast[dev_type] *
+						T_fast[dev_type];
+			}
 		}
 	}
 
@@ -996,6 +1165,9 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
 	if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT)
 		bfq_mark_bfqq_constantly_seeky(bfqq);
 
+	if (bfqd->low_latency && bfqq->wr_coeff == 1)
+		bfqq->last_wr_start_finish = jiffies;
+
 	bfq_log_bfqq(bfqd, bfqq,
 		"expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
 		slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
@@ -1044,21 +1216,36 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
 }
 
 /*
- * Device idling is allowed only for sync queues that have a non-null
- * idle window.
+ * Device idling is allowed only for the queues for which this function
+ * returns true. For this reason, the return value of this function plays a
+ * critical role for both throughput boosting and service guarantees.
+ *
+ * The return value is computed through a logical expression, which may
+ * be true only if bfqq is sync and at least one of the following two
+ * conditions holds:
+ * - the queue has a non-null idle window;
+ * - the queue is being weight-raised.
+ * In fact, waiting for a new request for the queue, in the first case,
+ * is likely to boost the disk throughput, whereas, in the second case,
+ * is necessary to preserve fairness and latency guarantees
+ * (see [1] for details).
  */
 static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 {
-	return bfq_bfqq_sync(bfqq) && bfq_bfqq_idle_window(bfqq);
+	return bfq_bfqq_sync(bfqq) &&
+	       (bfqq->wr_coeff > 1 || bfq_bfqq_idle_window(bfqq));
 }
 
 /*
- * If the in-service queue is empty, but it is sync and the queue has its
- * idle window set (in this case, waiting for a new request for the queue
- * is likely to boost the throughput), then:
+ * If the in-service queue is empty but sync, and the function
+ * bfq_bfqq_must_not_expire returns true, then:
  * 1) the queue must remain in service and cannot be expired, and
  * 2) the disk must be idled to wait for the possible arrival of a new
  *    request for the queue.
+ * See the comments to the function bfq_bfqq_must_not_expire for the reasons
+ * why performing device idling is the best choice to boost the throughput
+ * and preserve service guarantees when bfq_bfqq_must_not_expire itself
+ * returns true.
  */
 static inline bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)
 {
@@ -1148,6 +1335,38 @@ keep_queue:
 	return bfqq;
 }
 
+static void bfq_update_wr_data(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+	if (bfqq->wr_coeff > 1) { /* queue is being weight-raised */
+		bfq_log_bfqq(bfqd, bfqq,
+			"raising period dur %u/%u msec, old coeff %u, w %d(%d)",
+			jiffies_to_msecs(jiffies - bfqq->last_wr_start_finish),
+			jiffies_to_msecs(bfqq->wr_cur_max_time),
+			bfqq->wr_coeff,
+			bfqq->entity.weight, bfqq->entity.orig_weight);
+
+		/*
+		 * If too much time has elapsed from the beginning
+		 * of this weight-raising period, stop it.
+		 */
+		if (time_is_before_jiffies(bfqq->last_wr_start_finish +
+					   bfqq->wr_cur_max_time)) {
+			bfqq->last_wr_start_finish = jiffies;
+			bfq_log_bfqq(bfqd, bfqq,
+				     "wrais ending at %lu, rais_max_time %u",
+				     bfqq->last_wr_start_finish,
+				     jiffies_to_msecs(bfqq->wr_cur_max_time));
+			bfq_bfqq_end_wr(bfqq);
+		}
+	}
+	/* Update weight both if it must be raised and if it must be lowered */
+	if ((entity->weight > entity->orig_weight) != (bfqq->wr_coeff > 1))
+		__bfq_entity_update_weight_prio(
+			bfq_entity_service_tree(entity),
+			entity);
+}
+
 /*
  * Dispatch one request from bfqq, moving it to the request queue
  * dispatch list.
@@ -1194,6 +1413,8 @@ static int bfq_dispatch_request(struct bfq_data *bfqd,
 	bfq_bfqq_served(bfqq, service_to_charge);
 	bfq_dispatch_insert(bfqd->queue, rq);
 
+	bfq_update_wr_data(bfqd, bfqq);
+
 	bfq_log_bfqq(bfqd, bfqq,
 			"dispatched %u sec req (%llu), budg left %lu",
 			blk_rq_sectors(rq),
@@ -1467,6 +1688,9 @@ static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	/* Tentative initial value to trade off between thr and lat */
 	bfqq->max_budget = (2 * bfq_max_budget(bfqd)) / 3;
 	bfqq->pid = pid;
+
+	bfqq->wr_coeff = 1;
+	bfqq->last_wr_start_finish = 0;
 }
 
 static struct bfq_queue *bfq_find_alloc_queue(struct bfq_data *bfqd,
@@ -1642,7 +1866,8 @@ static void bfq_update_idle_window(struct bfq_data *bfqd,
 		(bfqd->hw_tag && BFQQ_SEEKY(bfqq)))
 		enable_idle = 0;
 	else if (bfq_sample_valid(bic->ttime.ttime_samples)) {
-		if (bic->ttime.ttime_mean > bfqd->bfq_slice_idle)
+		if (bic->ttime.ttime_mean > bfqd->bfq_slice_idle &&
+			bfqq->wr_coeff == 1)
 			enable_idle = 0;
 		else
 			enable_idle = 1;
@@ -2117,6 +2342,22 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 	bfqd->bfq_timeout[BLK_RW_ASYNC] = bfq_timeout_async;
 	bfqd->bfq_timeout[BLK_RW_SYNC] = bfq_timeout_sync;
 
+	bfqd->low_latency = true;
+
+	bfqd->bfq_wr_coeff = 20;
+	bfqd->bfq_wr_max_time = 0;
+	bfqd->bfq_wr_min_idle_time = msecs_to_jiffies(2000);
+	bfqd->bfq_wr_min_inter_arr_async = msecs_to_jiffies(500);
+
+	/*
+	 * Begin by assuming, optimistically, that the device peak rate is
+	 * equal to the highest reference rate.
+	 */
+	bfqd->RT_prod = R_fast[blk_queue_nonrot(bfqd->queue)] *
+			T_fast[blk_queue_nonrot(bfqd->queue)];
+	bfqd->peak_rate = R_fast[blk_queue_nonrot(bfqd->queue)];
+	bfqd->device_speed = BFQ_BFQD_FAST;
+
 	return 0;
 }
 
@@ -2151,6 +2392,14 @@ static ssize_t bfq_var_store(unsigned long *var, const char *page,
 	return count;
 }
 
+static ssize_t bfq_wr_max_time_show(struct elevator_queue *e, char *page)
+{
+	struct bfq_data *bfqd = e->elevator_data;
+	return sprintf(page, "%d\n", bfqd->bfq_wr_max_time > 0 ?
+		       jiffies_to_msecs(bfqd->bfq_wr_max_time) :
+		       jiffies_to_msecs(bfq_wr_duration(bfqd)));
+}
+
 static ssize_t bfq_weights_show(struct elevator_queue *e, char *page)
 {
 	struct bfq_queue *bfqq;
@@ -2165,19 +2414,24 @@ static ssize_t bfq_weights_show(struct elevator_queue *e, char *page)
 	num_char += sprintf(page + num_char, "Active:\n");
 	list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list) {
 	  num_char += sprintf(page + num_char,
-			      "pid%d: weight %hu, nr_queued %d %d\n",
+			      "pid%d: weight %hu, nr_queued %d %d, dur %d/%u\n",
 			      bfqq->pid,
 			      bfqq->entity.weight,
 			      bfqq->queued[0],
-			      bfqq->queued[1]);
+			      bfqq->queued[1],
+			jiffies_to_msecs(jiffies - bfqq->last_wr_start_finish),
+			jiffies_to_msecs(bfqq->wr_cur_max_time));
 	}
 
 	num_char += sprintf(page + num_char, "Idle:\n");
 	list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list) {
 			num_char += sprintf(page + num_char,
-				"pid%d: weight %hu\n",
+				"pid%d: weight %hu, dur %d/%u\n",
 				bfqq->pid,
-				bfqq->entity.weight);
+				bfqq->entity.weight,
+				jiffies_to_msecs(jiffies -
+					bfqq->last_wr_start_finish),
+				jiffies_to_msecs(bfqq->wr_cur_max_time));
 	}
 
 	spin_unlock_irq(bfqd->queue->queue_lock);
@@ -2205,6 +2459,11 @@ SHOW_FUNCTION(bfq_max_budget_async_rq_show,
 	      bfqd->bfq_max_budget_async_rq, 0);
 SHOW_FUNCTION(bfq_timeout_sync_show, bfqd->bfq_timeout[BLK_RW_SYNC], 1);
 SHOW_FUNCTION(bfq_timeout_async_show, bfqd->bfq_timeout[BLK_RW_ASYNC], 1);
+SHOW_FUNCTION(bfq_low_latency_show, bfqd->low_latency, 0);
+SHOW_FUNCTION(bfq_wr_coeff_show, bfqd->bfq_wr_coeff, 0);
+SHOW_FUNCTION(bfq_wr_min_idle_time_show, bfqd->bfq_wr_min_idle_time, 1);
+SHOW_FUNCTION(bfq_wr_min_inter_arr_async_show, bfqd->bfq_wr_min_inter_arr_async,
+	1);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
@@ -2237,6 +2496,12 @@ STORE_FUNCTION(bfq_max_budget_async_rq_store, &bfqd->bfq_max_budget_async_rq,
 		1, INT_MAX, 0);
 STORE_FUNCTION(bfq_timeout_async_store, &bfqd->bfq_timeout[BLK_RW_ASYNC], 0,
 		INT_MAX, 1);
+STORE_FUNCTION(bfq_wr_coeff_store, &bfqd->bfq_wr_coeff, 1, INT_MAX, 0);
+STORE_FUNCTION(bfq_wr_max_time_store, &bfqd->bfq_wr_max_time, 0, INT_MAX, 1);
+STORE_FUNCTION(bfq_wr_min_idle_time_store, &bfqd->bfq_wr_min_idle_time, 0,
+		INT_MAX, 1);
+STORE_FUNCTION(bfq_wr_min_inter_arr_async_store,
+		&bfqd->bfq_wr_min_inter_arr_async, 0, INT_MAX, 1);
 #undef STORE_FUNCTION
 
 /* do nothing for the moment */
@@ -2295,6 +2560,22 @@ static ssize_t bfq_timeout_sync_store(struct elevator_queue *e,
 	return ret;
 }
 
+static ssize_t bfq_low_latency_store(struct elevator_queue *e,
+				     const char *page, size_t count)
+{
+	struct bfq_data *bfqd = e->elevator_data;
+	unsigned long uninitialized_var(__data);
+	int ret = bfq_var_store(&__data, (page), count);
+
+	if (__data > 1)
+		__data = 1;
+	if (__data == 0 && bfqd->low_latency != 0)
+		bfq_end_wr(bfqd);
+	bfqd->low_latency = __data;
+
+	return ret;
+}
+
 #define BFQ_ATTR(name) \
 	__ATTR(name, S_IRUGO|S_IWUSR, bfq_##name##_show, bfq_##name##_store)
 
@@ -2309,6 +2590,11 @@ static struct elv_fs_entry bfq_attrs[] = {
 	BFQ_ATTR(max_budget_async_rq),
 	BFQ_ATTR(timeout_sync),
 	BFQ_ATTR(timeout_async),
+	BFQ_ATTR(low_latency),
+	BFQ_ATTR(wr_coeff),
+	BFQ_ATTR(wr_max_time),
+	BFQ_ATTR(wr_min_idle_time),
+	BFQ_ATTR(wr_min_inter_arr_async),
 	BFQ_ATTR(weights),
 	__ATTR_NULL
 };
@@ -2355,6 +2641,23 @@ static int __init bfq_init(void)
 	if (bfq_slab_setup())
 		return -ENOMEM;
 
+	/*
+	 * Times to load large popular applications for the typical systems
+	 * installed on the reference devices (see the comments before the
+	 * definitions of the two arrays).
+	 */
+	T_slow[0] = msecs_to_jiffies(2600);
+	T_slow[1] = msecs_to_jiffies(1000);
+	T_fast[0] = msecs_to_jiffies(5500);
+	T_fast[1] = msecs_to_jiffies(2000);
+
+	/*
+	 * Thresholds that determine the switch between speed classes (see
+	 * the comments before the definition of the array).
+	 */
+	device_speed_thresh[0] = (R_fast[0] + R_slow[0]) / 2;
+	device_speed_thresh[1] = (R_fast[1] + R_slow[1]) / 2;
+
 	elv_register(&iosched_bfq);
 	pr_info("BFQ I/O-scheduler version: v1");
 
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
index 075e472..f6491d5 100644
--- a/block/bfq-sched.c
+++ b/block/bfq-sched.c
@@ -514,6 +514,8 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
 	struct bfq_service_tree *new_st = old_st;
 
 	if (entity->ioprio_changed) {
+		struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+
 		old_st->wsum -= entity->weight;
 
 		if (entity->new_weight != entity->orig_weight) {
@@ -539,7 +541,8 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
 		 * when entity->finish <= old_st->vtime).
 		 */
 		new_st = bfq_entity_service_tree(entity);
-		entity->weight = entity->orig_weight;
+		entity->weight = entity->orig_weight *
+				 (bfqq != NULL ? bfqq->wr_coeff : 1);
 		new_st->wsum += entity->weight;
 
 		if (new_st != old_st)
diff --git a/block/bfq.h b/block/bfq.h
index ea5ecca..3ce9100 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -181,6 +181,10 @@ struct bfq_group;
  * @seek_mean: mean seek distance
  * @last_request_pos: position of the last request enqueued
  * @pid: pid of the process owning the queue, used for logging purposes.
+ * @last_wr_start_finish: start time of the current weight-raising period if
+ *                        the @bfq-queue is being weight-raised, otherwise
+ *                        finish time of the last weight-raising period
+ * @wr_cur_max_time: current max raising time for this queue
  *
  * A bfq_queue is a leaf request queue; it can be associated with an
  * io_context or more, if it is async. @cgroup holds a reference to the
@@ -217,6 +221,11 @@ struct bfq_queue {
 	sector_t last_request_pos;
 
 	pid_t pid;
+
+	/* weight-raising fields */
+	unsigned long wr_cur_max_time;
+	unsigned long last_wr_start_finish;
+	unsigned int wr_coeff;
 };
 
 /**
@@ -297,6 +306,18 @@ enum bfq_device_speed {
  *               they are charged for the whole allocated budget, to try
  *               to preserve a behavior reasonably fair among them, but
  *               without service-domain guarantees).
+ * @bfq_wr_coeff: Maximum factor by which the weight of a weight-raised
+ *                queue is multiplied
+ * @bfq_wr_max_time: maximum duration of a weight-raising period (jiffies)
+ * @bfq_wr_min_idle_time: minimum idle period after which weight-raising
+ *			  may be reactivated for a queue (in jiffies)
+ * @bfq_wr_min_inter_arr_async: minimum period between request arrivals
+ *				after which weight-raising may be
+ *				reactivated for an already busy queue
+ *				(in jiffies)
+ * @RT_prod: cached value of the product R*T used for computing the maximum
+ *	     duration of the weight raising automatically
+ * @device_speed: device-speed class for the low-latency heuristic
  * @oom_bfqq: fallback dummy bfqq for extreme OOM conditions
  *
  * All the fields are protected by the @queue lock.
@@ -346,6 +367,16 @@ struct bfq_data {
 	unsigned int bfq_max_budget_async_rq;
 	unsigned int bfq_timeout[2];
 
+	bool low_latency;
+
+	/* parameters of the low_latency heuristics */
+	unsigned int bfq_wr_coeff;
+	unsigned int bfq_wr_max_time;
+	unsigned int bfq_wr_min_idle_time;
+	unsigned long bfq_wr_min_inter_arr_async;
+	u64 RT_prod;
+	enum bfq_device_speed device_speed;
+
 	struct bfq_queue oom_bfqq;
 };
 
@@ -556,6 +587,8 @@ static void bfq_dispatch_insert(struct request_queue *q, struct request *rq);
 static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
 				       struct bfq_group *bfqg, int is_sync,
 				       struct bfq_io_cq *bic, gfp_t gfp_mask);
+static void bfq_end_wr_async_queues(struct bfq_data *bfqd,
+				    struct bfq_group *bfqg);
 static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg);
 static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
 
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 06/12] block, bfq: improve responsiveness
  2014-05-29  9:05           ` Paolo Valente
  (?)
  (?)
@ 2014-05-29  9:05           ` Paolo Valente
       [not found]             ` <1401354343-5527-7-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
  -1 siblings, 1 reply; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups, Paolo Valente

This patch introduces a simple heuristic to load applications quickly,
and to perform the I/O requested by interactive applications just as
quickly. To this purpose, both a newly-created queue and a queue
associated with an interactive application (we explain in a moment how
BFQ decides whether the associated application is interactive),
receive the following three special treatments:

1) The weight of the queue is raised.

2) The queue unconditionally enjoys device idling when it empties; in
fact, if the requests of a queue are sync, then performing device
idling for the queue is a necessary condition to guarantee that the
queue receives a fraction of the throughput proportional to its weight
(see [1] for details).

3) The device-idling timeout is larger for the queue. This reduces the
probability that the queue is expired because its next request does
not arrive in time.

For brevity, we call just weight-raising the combination of these
three preferential treatments. For a newly-created queue,
weight-raising starts immediately and lasts for a time interval that:
1) depends on the device speed and type (rotational or
non-rotational), and 2) is equal to the time needed to load (start up)
a large-size application on that device, with cold caches and with no
additional workload.

Finally, as for guaranteeing a fast execution to interactive,
I/O-related tasks (such as opening a file), consider that any
interactive application blocks and waits for user input both after
starting up and after executing some task. After a while, the user may
trigger new operations, after which the application stops again, and
so on. Accordingly, the low-latency heuristic weight-raises again a
queue in case it becomes backlogged after being idle for a
sufficiently long (configurable) time. The weight-raising then lasts
for the same time as for a just-created queue.

According to our experiments, the combination of this low-latency
heuristic and of the improvements described in patch 7 allows BFQ
to guarantee a high application responsiveness.

[1] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
http://www.algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/Kconfig.iosched |   8 +-
 block/bfq-cgroup.c    |  15 +++
 block/bfq-iosched.c   | 355 ++++++++++++++++++++++++++++++++++++++++++++++----
 block/bfq-sched.c     |   5 +-
 block/bfq.h           |  33 +++++
 5 files changed, 386 insertions(+), 30 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index a3675cb..3e26f28 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -45,8 +45,9 @@ config IOSCHED_BFQ
 	---help---
 	  The BFQ I/O scheduler tries to distribute bandwidth among all
 	  processes according to their weights.
-	  It aims at distributing the bandwidth as desired, regardless
-	  of the disk parameters and with any workload. If compiled
+	  It aims at distributing the bandwidth as desired, regardless of
+	  the device parameters and with any workload. It also tries to
+	  guarantee a low latency to interactive applications. If compiled
 	  built-in (saying Y here), BFQ can be configured to support
 	  hierarchical scheduling.
 
@@ -79,7 +80,8 @@ choice
 		  used by default for all block devices.
 		  The BFQ I/O scheduler aims at distributing the bandwidth
 		  as desired, regardless of the disk parameters and with
-		  any workload.
+		  any workload. It also tries to guarantee a low latency to
+		  interactive applications.
 
 	config DEFAULT_NOOP
 		bool "No-op"
diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 805fe5e..1cb25aa 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -525,6 +525,16 @@ static void bfq_destroy_group(struct bfqio_cgroup *bgrp, struct bfq_group *bfqg)
 	kfree(bfqg);
 }
 
+static void bfq_end_wr_async(struct bfq_data *bfqd)
+{
+	struct hlist_node *tmp;
+	struct bfq_group *bfqg;
+
+	hlist_for_each_entry_safe(bfqg, tmp, &bfqd->group_list, bfqd_node)
+		bfq_end_wr_async_queues(bfqd, bfqg);
+	bfq_end_wr_async_queues(bfqd, bfqd->root_group);
+}
+
 /**
  * bfq_disconnect_groups - disconnect @bfqd from all its groups.
  * @bfqd: the device descriptor being exited.
@@ -866,6 +876,11 @@ static inline void bfq_bfqq_move(struct bfq_data *bfqd,
 {
 }
 
+static void bfq_end_wr_async(struct bfq_data *bfqd)
+{
+	bfq_end_wr_async_queues(bfqd, bfqd->root_group);
+}
+
 static inline void bfq_disconnect_groups(struct bfq_data *bfqd)
 {
 	bfq_put_async_queues(bfqd, bfqd->root_group);
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 9e607a0..ace9aba 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -24,15 +24,15 @@
  * precisely, BFQ schedules queues associated to processes. Thanks to the
  * accurate policy of B-WF2Q+, BFQ can afford to assign high budgets to
  * I/O-bound processes issuing sequential requests (to boost the
- * throughput), and yet guarantee a relatively low latency to interactive
- * applications.
+ * throughput), and yet guarantee a low latency to interactive applications.
  *
  * BFQ is described in [1], where also a reference to the initial, more
  * theoretical paper on BFQ can be found. The interested reader can find
  * in the latter paper full details on the main algorithm, as well as
  * formulas of the guarantees and formal proofs of all the properties.
  * With respect to the version of BFQ presented in these papers, this
- * implementation adds a hierarchical extension based on H-WF2Q+.
+ * implementation adds a few more heuristics and a hierarchical extension
+ * based on H-WF2Q+.
  *
  * B-WF2Q+ is based on WF2Q+, that is described in [2], together with
  * H-WF2Q+, while the augmented tree used to implement B-WF2Q+ with O(log N)
@@ -116,6 +116,48 @@ struct kmem_cache *bfq_pool;
 /* Shift used for peak rate fixed precision calculations. */
 #define BFQ_RATE_SHIFT		16
 
+/*
+ * By default, BFQ computes the duration of the weight raising for
+ * interactive applications automatically, using the following formula:
+ * duration = (R / r) * T, where r is the peak rate of the device, and
+ * R and T are two reference parameters.
+ * In particular, R is the peak rate of the reference device (see below),
+ * and T is a reference time: given the systems that are likely to be
+ * installed on the reference device according to its speed class, T is
+ * about the maximum time needed, under BFQ and while reading two files in
+ * parallel, to load typical large applications on these systems.
+ * In practice, the slower/faster the device at hand is, the more/less it
+ * takes to load applications with respect to the reference device.
+ * Accordingly, the longer/shorter BFQ grants weight raising to interactive
+ * applications.
+ *
+ * BFQ uses four different reference pairs (R, T), depending on:
+ * . whether the device is rotational or non-rotational;
+ * . whether the device is slow, such as old or portable HDDs, as well as
+ *   SD cards, or fast, such as newer HDDs and SSDs.
+ *
+ * The device's speed class is dynamically (re)detected in
+ * bfq_update_peak_rate() every time the estimated peak rate is updated.
+ *
+ * In the following definitions, R_slow[0]/R_fast[0] and T_slow[0]/T_fast[0]
+ * are the reference values for a slow/fast rotational device, whereas
+ * R_slow[1]/R_fast[1] and T_slow[1]/T_fast[1] are the reference values for
+ * a slow/fast non-rotational device. Finally, device_speed_thresh are the
+ * thresholds used to switch between speed classes.
+ * Both the reference peak rates and the thresholds are measured in
+ * sectors/usec, left-shifted by BFQ_RATE_SHIFT.
+ */
+static int R_slow[2] = {1536, 10752};
+static int R_fast[2] = {17415, 34791};
+/*
+ * To improve readability, a conversion function is used to initialize the
+ * following arrays, which entails that they can be initialized only in a
+ * function.
+ */
+static int T_slow[2];
+static int T_fast[2];
+static int device_speed_thresh[2];
+
 #define BFQ_SERVICE_TREE_INIT	((struct bfq_service_tree)		\
 				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
 
@@ -281,7 +323,8 @@ static inline unsigned long bfq_serv_to_charge(struct request *rq,
 					       struct bfq_queue *bfqq)
 {
 	return blk_rq_sectors(rq) *
-		(1 + ((!bfq_bfqq_sync(bfqq)) * bfq_async_charge_factor));
+		(1 + ((!bfq_bfqq_sync(bfqq)) * (bfqq->wr_coeff == 1) *
+		bfq_async_charge_factor));
 }
 
 /**
@@ -322,12 +365,27 @@ static void bfq_updated_next_req(struct bfq_data *bfqd,
 	}
 }
 
+static inline unsigned int bfq_wr_duration(struct bfq_data *bfqd)
+{
+	u64 dur;
+
+	if (bfqd->bfq_wr_max_time > 0)
+		return bfqd->bfq_wr_max_time;
+
+	dur = bfqd->RT_prod;
+	do_div(dur, bfqd->peak_rate);
+
+	return dur;
+}
+
 static void bfq_add_request(struct request *rq)
 {
 	struct bfq_queue *bfqq = RQ_BFQQ(rq);
 	struct bfq_entity *entity = &bfqq->entity;
 	struct bfq_data *bfqd = bfqq->bfqd;
 	struct request *next_rq, *prev;
+	unsigned long old_wr_coeff = bfqq->wr_coeff;
+	int idle_for_long_time = 0;
 
 	bfq_log_bfqq(bfqd, bfqq, "add_request %d", rq_is_sync(rq));
 	bfqq->queued[rq_is_sync(rq)]++;
@@ -343,13 +401,64 @@ static void bfq_add_request(struct request *rq)
 	bfqq->next_rq = next_rq;
 
 	if (!bfq_bfqq_busy(bfqq)) {
+		idle_for_long_time = time_is_before_jiffies(
+			bfqq->budget_timeout +
+			bfqd->bfq_wr_min_idle_time);
 		entity->budget = max_t(unsigned long, bfqq->max_budget,
 				       bfq_serv_to_charge(next_rq, bfqq));
+
+		if (!bfqd->low_latency)
+			goto add_bfqq_busy;
+
+		/*
+		 * If the queue is not being boosted and has been idle for
+		 * enough time, start a weight-raising period.
+		 */
+		if (old_wr_coeff == 1 && idle_for_long_time) {
+			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
+			bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+			bfq_log_bfqq(bfqd, bfqq,
+				     "wrais starting at %lu, rais_max_time %u",
+				     jiffies,
+				     jiffies_to_msecs(bfqq->wr_cur_max_time));
+		} else if (old_wr_coeff > 1) {
+			if (idle_for_long_time)
+				bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+			else {
+				bfqq->wr_coeff = 1;
+				bfq_log_bfqq(bfqd, bfqq,
+					"wrais ending at %lu, rais_max_time %u",
+					jiffies,
+					jiffies_to_msecs(bfqq->
+						wr_cur_max_time));
+			}
+		}
+		if (old_wr_coeff != bfqq->wr_coeff)
+			entity->ioprio_changed = 1;
+add_bfqq_busy:
 		bfq_add_bfqq_busy(bfqd, bfqq);
 	} else {
+		if (bfqd->low_latency && old_wr_coeff == 1 && !rq_is_sync(rq) &&
+		    time_is_before_jiffies(
+				bfqq->last_wr_start_finish +
+				bfqd->bfq_wr_min_inter_arr_async)) {
+			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
+			bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+
+			entity->ioprio_changed = 1;
+			bfq_log_bfqq(bfqd, bfqq,
+			    "non-idle wrais starting at %lu, rais_max_time %u",
+			    jiffies,
+			    jiffies_to_msecs(bfqq->wr_cur_max_time));
+		}
 		if (prev != bfqq->next_rq)
 			bfq_updated_next_req(bfqd, bfqq);
 	}
+
+	if (bfqd->low_latency &&
+		(old_wr_coeff == 1 || bfqq->wr_coeff == 1 ||
+		 idle_for_long_time))
+		bfqq->last_wr_start_finish = jiffies;
 }
 
 static struct request *bfq_find_rq_fmerge(struct bfq_data *bfqd,
@@ -477,6 +586,43 @@ static void bfq_merged_requests(struct request_queue *q, struct request *rq,
 	bfq_remove_request(next);
 }
 
+/* Must be called with bfqq != NULL */
+static inline void bfq_bfqq_end_wr(struct bfq_queue *bfqq)
+{
+	bfqq->wr_coeff = 1;
+	bfqq->wr_cur_max_time = 0;
+	/* Trigger a weight change on the next activation of the queue */
+	bfqq->entity.ioprio_changed = 1;
+}
+
+static void bfq_end_wr_async_queues(struct bfq_data *bfqd,
+				    struct bfq_group *bfqg)
+{
+	int i, j;
+
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < IOPRIO_BE_NR; j++)
+			if (bfqg->async_bfqq[i][j] != NULL)
+				bfq_bfqq_end_wr(bfqg->async_bfqq[i][j]);
+	if (bfqg->async_idle_bfqq != NULL)
+		bfq_bfqq_end_wr(bfqg->async_idle_bfqq);
+}
+
+static void bfq_end_wr(struct bfq_data *bfqd)
+{
+	struct bfq_queue *bfqq;
+
+	spin_lock_irq(bfqd->queue->queue_lock);
+
+	list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list)
+		bfq_bfqq_end_wr(bfqq);
+	list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list)
+		bfq_bfqq_end_wr(bfqq);
+	bfq_end_wr_async(bfqd);
+
+	spin_unlock_irq(bfqd->queue->queue_lock);
+}
+
 static int bfq_allow_merge(struct request_queue *q, struct request *rq,
 			   struct bio *bio)
 {
@@ -582,14 +728,17 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd)
 	 */
 	sl = bfqd->bfq_slice_idle;
 	/*
-	 * Grant only minimum idle time if the queue either has been seeky for
-	 * long enough or has already proved to be constantly seeky.
+	 * Unless the queue is being weight-raised, grant only minimum idle
+	 * time if the queue either has been seeky for long enough or has
+	 * already proved to be constantly seeky.
 	 */
 	if (bfq_sample_valid(bfqq->seek_samples) &&
 	    ((BFQQ_SEEKY(bfqq) && bfqq->entity.service >
 				  bfq_max_budget(bfqq->bfqd) / 8) ||
-	      bfq_bfqq_constantly_seeky(bfqq)))
+	      bfq_bfqq_constantly_seeky(bfqq)) && bfqq->wr_coeff == 1)
 		sl = min(sl, msecs_to_jiffies(BFQ_MIN_TT));
+	else if (bfqq->wr_coeff > 1)
+		sl = sl * 3;
 	bfqd->last_idling_start = ktime_get();
 	mod_timer(&bfqd->idle_slice_timer, jiffies + sl);
 	bfq_log(bfqd, "arm idle: %u/%u ms",
@@ -677,9 +826,15 @@ static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 {
 	__bfq_bfqd_reset_in_service(bfqd);
 
-	if (RB_EMPTY_ROOT(&bfqq->sort_list))
+	if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
+		/*
+		 * Overloading budget_timeout field to store the time
+		 * at which the queue remains with no backlog; used by
+		 * the weight-raising mechanism.
+		 */
+		bfqq->budget_timeout = jiffies;
 		bfq_del_bfqq_busy(bfqd, bfqq, 1);
-	else
+	} else
 		bfq_activate_bfqq(bfqd, bfqq);
 }
 
@@ -896,12 +1051,26 @@ static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 			bfqd->peak_rate_samples++;
 
 		if (bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES &&
-		    update && bfqd->bfq_user_max_budget == 0) {
-			bfqd->bfq_max_budget =
-				bfq_calc_max_budget(bfqd->peak_rate,
-						    timeout);
-			bfq_log(bfqd, "new max_budget=%lu",
-				bfqd->bfq_max_budget);
+		    update) {
+			int dev_type = blk_queue_nonrot(bfqd->queue);
+			if (bfqd->bfq_user_max_budget == 0) {
+				bfqd->bfq_max_budget =
+					bfq_calc_max_budget(bfqd->peak_rate,
+							    timeout);
+				bfq_log(bfqd, "new max_budget=%lu",
+					bfqd->bfq_max_budget);
+			}
+			if (bfqd->device_speed == BFQ_BFQD_FAST &&
+			    bfqd->peak_rate < device_speed_thresh[dev_type]) {
+				bfqd->device_speed = BFQ_BFQD_SLOW;
+				bfqd->RT_prod = R_slow[dev_type] *
+						T_slow[dev_type];
+			} else if (bfqd->device_speed == BFQ_BFQD_SLOW &&
+			    bfqd->peak_rate > device_speed_thresh[dev_type]) {
+				bfqd->device_speed = BFQ_BFQD_FAST;
+				bfqd->RT_prod = R_fast[dev_type] *
+						T_fast[dev_type];
+			}
 		}
 	}
 
@@ -996,6 +1165,9 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
 	if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT)
 		bfq_mark_bfqq_constantly_seeky(bfqq);
 
+	if (bfqd->low_latency && bfqq->wr_coeff == 1)
+		bfqq->last_wr_start_finish = jiffies;
+
 	bfq_log_bfqq(bfqd, bfqq,
 		"expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
 		slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
@@ -1044,21 +1216,36 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
 }
 
 /*
- * Device idling is allowed only for sync queues that have a non-null
- * idle window.
+ * Device idling is allowed only for the queues for which this function
+ * returns true. For this reason, the return value of this function plays a
+ * critical role for both throughput boosting and service guarantees.
+ *
+ * The return value is computed through a logical expression, which may
+ * be true only if bfqq is sync and at least one of the following two
+ * conditions holds:
+ * - the queue has a non-null idle window;
+ * - the queue is being weight-raised.
+ * In fact, waiting for a new request for the queue, in the first case,
+ * is likely to boost the disk throughput, whereas, in the second case,
+ * is necessary to preserve fairness and latency guarantees
+ * (see [1] for details).
  */
 static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 {
-	return bfq_bfqq_sync(bfqq) && bfq_bfqq_idle_window(bfqq);
+	return bfq_bfqq_sync(bfqq) &&
+	       (bfqq->wr_coeff > 1 || bfq_bfqq_idle_window(bfqq));
 }
 
 /*
- * If the in-service queue is empty, but it is sync and the queue has its
- * idle window set (in this case, waiting for a new request for the queue
- * is likely to boost the throughput), then:
+ * If the in-service queue is empty but sync, and the function
+ * bfq_bfqq_must_not_expire returns true, then:
  * 1) the queue must remain in service and cannot be expired, and
  * 2) the disk must be idled to wait for the possible arrival of a new
  *    request for the queue.
+ * See the comments to the function bfq_bfqq_must_not_expire for the reasons
+ * why performing device idling is the best choice to boost the throughput
+ * and preserve service guarantees when bfq_bfqq_must_not_expire itself
+ * returns true.
  */
 static inline bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)
 {
@@ -1148,6 +1335,38 @@ keep_queue:
 	return bfqq;
 }
 
+static void bfq_update_wr_data(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	struct bfq_entity *entity = &bfqq->entity;
+	if (bfqq->wr_coeff > 1) { /* queue is being weight-raised */
+		bfq_log_bfqq(bfqd, bfqq,
+			"raising period dur %u/%u msec, old coeff %u, w %d(%d)",
+			jiffies_to_msecs(jiffies - bfqq->last_wr_start_finish),
+			jiffies_to_msecs(bfqq->wr_cur_max_time),
+			bfqq->wr_coeff,
+			bfqq->entity.weight, bfqq->entity.orig_weight);
+
+		/*
+		 * If too much time has elapsed from the beginning
+		 * of this weight-raising period, stop it.
+		 */
+		if (time_is_before_jiffies(bfqq->last_wr_start_finish +
+					   bfqq->wr_cur_max_time)) {
+			bfqq->last_wr_start_finish = jiffies;
+			bfq_log_bfqq(bfqd, bfqq,
+				     "wrais ending at %lu, rais_max_time %u",
+				     bfqq->last_wr_start_finish,
+				     jiffies_to_msecs(bfqq->wr_cur_max_time));
+			bfq_bfqq_end_wr(bfqq);
+		}
+	}
+	/* Update weight both if it must be raised and if it must be lowered */
+	if ((entity->weight > entity->orig_weight) != (bfqq->wr_coeff > 1))
+		__bfq_entity_update_weight_prio(
+			bfq_entity_service_tree(entity),
+			entity);
+}
+
 /*
  * Dispatch one request from bfqq, moving it to the request queue
  * dispatch list.
@@ -1194,6 +1413,8 @@ static int bfq_dispatch_request(struct bfq_data *bfqd,
 	bfq_bfqq_served(bfqq, service_to_charge);
 	bfq_dispatch_insert(bfqd->queue, rq);
 
+	bfq_update_wr_data(bfqd, bfqq);
+
 	bfq_log_bfqq(bfqd, bfqq,
 			"dispatched %u sec req (%llu), budg left %lu",
 			blk_rq_sectors(rq),
@@ -1467,6 +1688,9 @@ static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	/* Tentative initial value to trade off between thr and lat */
 	bfqq->max_budget = (2 * bfq_max_budget(bfqd)) / 3;
 	bfqq->pid = pid;
+
+	bfqq->wr_coeff = 1;
+	bfqq->last_wr_start_finish = 0;
 }
 
 static struct bfq_queue *bfq_find_alloc_queue(struct bfq_data *bfqd,
@@ -1642,7 +1866,8 @@ static void bfq_update_idle_window(struct bfq_data *bfqd,
 		(bfqd->hw_tag && BFQQ_SEEKY(bfqq)))
 		enable_idle = 0;
 	else if (bfq_sample_valid(bic->ttime.ttime_samples)) {
-		if (bic->ttime.ttime_mean > bfqd->bfq_slice_idle)
+		if (bic->ttime.ttime_mean > bfqd->bfq_slice_idle &&
+			bfqq->wr_coeff == 1)
 			enable_idle = 0;
 		else
 			enable_idle = 1;
@@ -2117,6 +2342,22 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 	bfqd->bfq_timeout[BLK_RW_ASYNC] = bfq_timeout_async;
 	bfqd->bfq_timeout[BLK_RW_SYNC] = bfq_timeout_sync;
 
+	bfqd->low_latency = true;
+
+	bfqd->bfq_wr_coeff = 20;
+	bfqd->bfq_wr_max_time = 0;
+	bfqd->bfq_wr_min_idle_time = msecs_to_jiffies(2000);
+	bfqd->bfq_wr_min_inter_arr_async = msecs_to_jiffies(500);
+
+	/*
+	 * Begin by assuming, optimistically, that the device peak rate is
+	 * equal to the highest reference rate.
+	 */
+	bfqd->RT_prod = R_fast[blk_queue_nonrot(bfqd->queue)] *
+			T_fast[blk_queue_nonrot(bfqd->queue)];
+	bfqd->peak_rate = R_fast[blk_queue_nonrot(bfqd->queue)];
+	bfqd->device_speed = BFQ_BFQD_FAST;
+
 	return 0;
 }
 
@@ -2151,6 +2392,14 @@ static ssize_t bfq_var_store(unsigned long *var, const char *page,
 	return count;
 }
 
+static ssize_t bfq_wr_max_time_show(struct elevator_queue *e, char *page)
+{
+	struct bfq_data *bfqd = e->elevator_data;
+	return sprintf(page, "%d\n", bfqd->bfq_wr_max_time > 0 ?
+		       jiffies_to_msecs(bfqd->bfq_wr_max_time) :
+		       jiffies_to_msecs(bfq_wr_duration(bfqd)));
+}
+
 static ssize_t bfq_weights_show(struct elevator_queue *e, char *page)
 {
 	struct bfq_queue *bfqq;
@@ -2165,19 +2414,24 @@ static ssize_t bfq_weights_show(struct elevator_queue *e, char *page)
 	num_char += sprintf(page + num_char, "Active:\n");
 	list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list) {
 	  num_char += sprintf(page + num_char,
-			      "pid%d: weight %hu, nr_queued %d %d\n",
+			      "pid%d: weight %hu, nr_queued %d %d, dur %d/%u\n",
 			      bfqq->pid,
 			      bfqq->entity.weight,
 			      bfqq->queued[0],
-			      bfqq->queued[1]);
+			      bfqq->queued[1],
+			jiffies_to_msecs(jiffies - bfqq->last_wr_start_finish),
+			jiffies_to_msecs(bfqq->wr_cur_max_time));
 	}
 
 	num_char += sprintf(page + num_char, "Idle:\n");
 	list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list) {
 			num_char += sprintf(page + num_char,
-				"pid%d: weight %hu\n",
+				"pid%d: weight %hu, dur %d/%u\n",
 				bfqq->pid,
-				bfqq->entity.weight);
+				bfqq->entity.weight,
+				jiffies_to_msecs(jiffies -
+					bfqq->last_wr_start_finish),
+				jiffies_to_msecs(bfqq->wr_cur_max_time));
 	}
 
 	spin_unlock_irq(bfqd->queue->queue_lock);
@@ -2205,6 +2459,11 @@ SHOW_FUNCTION(bfq_max_budget_async_rq_show,
 	      bfqd->bfq_max_budget_async_rq, 0);
 SHOW_FUNCTION(bfq_timeout_sync_show, bfqd->bfq_timeout[BLK_RW_SYNC], 1);
 SHOW_FUNCTION(bfq_timeout_async_show, bfqd->bfq_timeout[BLK_RW_ASYNC], 1);
+SHOW_FUNCTION(bfq_low_latency_show, bfqd->low_latency, 0);
+SHOW_FUNCTION(bfq_wr_coeff_show, bfqd->bfq_wr_coeff, 0);
+SHOW_FUNCTION(bfq_wr_min_idle_time_show, bfqd->bfq_wr_min_idle_time, 1);
+SHOW_FUNCTION(bfq_wr_min_inter_arr_async_show, bfqd->bfq_wr_min_inter_arr_async,
+	1);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
@@ -2237,6 +2496,12 @@ STORE_FUNCTION(bfq_max_budget_async_rq_store, &bfqd->bfq_max_budget_async_rq,
 		1, INT_MAX, 0);
 STORE_FUNCTION(bfq_timeout_async_store, &bfqd->bfq_timeout[BLK_RW_ASYNC], 0,
 		INT_MAX, 1);
+STORE_FUNCTION(bfq_wr_coeff_store, &bfqd->bfq_wr_coeff, 1, INT_MAX, 0);
+STORE_FUNCTION(bfq_wr_max_time_store, &bfqd->bfq_wr_max_time, 0, INT_MAX, 1);
+STORE_FUNCTION(bfq_wr_min_idle_time_store, &bfqd->bfq_wr_min_idle_time, 0,
+		INT_MAX, 1);
+STORE_FUNCTION(bfq_wr_min_inter_arr_async_store,
+		&bfqd->bfq_wr_min_inter_arr_async, 0, INT_MAX, 1);
 #undef STORE_FUNCTION
 
 /* do nothing for the moment */
@@ -2295,6 +2560,22 @@ static ssize_t bfq_timeout_sync_store(struct elevator_queue *e,
 	return ret;
 }
 
+static ssize_t bfq_low_latency_store(struct elevator_queue *e,
+				     const char *page, size_t count)
+{
+	struct bfq_data *bfqd = e->elevator_data;
+	unsigned long uninitialized_var(__data);
+	int ret = bfq_var_store(&__data, (page), count);
+
+	if (__data > 1)
+		__data = 1;
+	if (__data == 0 && bfqd->low_latency != 0)
+		bfq_end_wr(bfqd);
+	bfqd->low_latency = __data;
+
+	return ret;
+}
+
 #define BFQ_ATTR(name) \
 	__ATTR(name, S_IRUGO|S_IWUSR, bfq_##name##_show, bfq_##name##_store)
 
@@ -2309,6 +2590,11 @@ static struct elv_fs_entry bfq_attrs[] = {
 	BFQ_ATTR(max_budget_async_rq),
 	BFQ_ATTR(timeout_sync),
 	BFQ_ATTR(timeout_async),
+	BFQ_ATTR(low_latency),
+	BFQ_ATTR(wr_coeff),
+	BFQ_ATTR(wr_max_time),
+	BFQ_ATTR(wr_min_idle_time),
+	BFQ_ATTR(wr_min_inter_arr_async),
 	BFQ_ATTR(weights),
 	__ATTR_NULL
 };
@@ -2355,6 +2641,23 @@ static int __init bfq_init(void)
 	if (bfq_slab_setup())
 		return -ENOMEM;
 
+	/*
+	 * Times to load large popular applications for the typical systems
+	 * installed on the reference devices (see the comments before the
+	 * definitions of the two arrays).
+	 */
+	T_slow[0] = msecs_to_jiffies(2600);
+	T_slow[1] = msecs_to_jiffies(1000);
+	T_fast[0] = msecs_to_jiffies(5500);
+	T_fast[1] = msecs_to_jiffies(2000);
+
+	/*
+	 * Thresholds that determine the switch between speed classes (see
+	 * the comments before the definition of the array).
+	 */
+	device_speed_thresh[0] = (R_fast[0] + R_slow[0]) / 2;
+	device_speed_thresh[1] = (R_fast[1] + R_slow[1]) / 2;
+
 	elv_register(&iosched_bfq);
 	pr_info("BFQ I/O-scheduler version: v1");
 
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
index 075e472..f6491d5 100644
--- a/block/bfq-sched.c
+++ b/block/bfq-sched.c
@@ -514,6 +514,8 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
 	struct bfq_service_tree *new_st = old_st;
 
 	if (entity->ioprio_changed) {
+		struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+
 		old_st->wsum -= entity->weight;
 
 		if (entity->new_weight != entity->orig_weight) {
@@ -539,7 +541,8 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
 		 * when entity->finish <= old_st->vtime).
 		 */
 		new_st = bfq_entity_service_tree(entity);
-		entity->weight = entity->orig_weight;
+		entity->weight = entity->orig_weight *
+				 (bfqq != NULL ? bfqq->wr_coeff : 1);
 		new_st->wsum += entity->weight;
 
 		if (new_st != old_st)
diff --git a/block/bfq.h b/block/bfq.h
index ea5ecca..3ce9100 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -181,6 +181,10 @@ struct bfq_group;
  * @seek_mean: mean seek distance
  * @last_request_pos: position of the last request enqueued
  * @pid: pid of the process owning the queue, used for logging purposes.
+ * @last_wr_start_finish: start time of the current weight-raising period if
+ *                        the @bfq-queue is being weight-raised, otherwise
+ *                        finish time of the last weight-raising period
+ * @wr_cur_max_time: current max raising time for this queue
  *
  * A bfq_queue is a leaf request queue; it can be associated with an
  * io_context or more, if it is async. @cgroup holds a reference to the
@@ -217,6 +221,11 @@ struct bfq_queue {
 	sector_t last_request_pos;
 
 	pid_t pid;
+
+	/* weight-raising fields */
+	unsigned long wr_cur_max_time;
+	unsigned long last_wr_start_finish;
+	unsigned int wr_coeff;
 };
 
 /**
@@ -297,6 +306,18 @@ enum bfq_device_speed {
  *               they are charged for the whole allocated budget, to try
  *               to preserve a behavior reasonably fair among them, but
  *               without service-domain guarantees).
+ * @bfq_wr_coeff: Maximum factor by which the weight of a weight-raised
+ *                queue is multiplied
+ * @bfq_wr_max_time: maximum duration of a weight-raising period (jiffies)
+ * @bfq_wr_min_idle_time: minimum idle period after which weight-raising
+ *			  may be reactivated for a queue (in jiffies)
+ * @bfq_wr_min_inter_arr_async: minimum period between request arrivals
+ *				after which weight-raising may be
+ *				reactivated for an already busy queue
+ *				(in jiffies)
+ * @RT_prod: cached value of the product R*T used for computing the maximum
+ *	     duration of the weight raising automatically
+ * @device_speed: device-speed class for the low-latency heuristic
  * @oom_bfqq: fallback dummy bfqq for extreme OOM conditions
  *
  * All the fields are protected by the @queue lock.
@@ -346,6 +367,16 @@ struct bfq_data {
 	unsigned int bfq_max_budget_async_rq;
 	unsigned int bfq_timeout[2];
 
+	bool low_latency;
+
+	/* parameters of the low_latency heuristics */
+	unsigned int bfq_wr_coeff;
+	unsigned int bfq_wr_max_time;
+	unsigned int bfq_wr_min_idle_time;
+	unsigned long bfq_wr_min_inter_arr_async;
+	u64 RT_prod;
+	enum bfq_device_speed device_speed;
+
 	struct bfq_queue oom_bfqq;
 };
 
@@ -556,6 +587,8 @@ static void bfq_dispatch_insert(struct request_queue *q, struct request *rq);
 static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
 				       struct bfq_group *bfqg, int is_sync,
 				       struct bfq_io_cq *bic, gfp_t gfp_mask);
+static void bfq_end_wr_async_queues(struct bfq_data *bfqd,
+				    struct bfq_group *bfqg);
 static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg);
 static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
 
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 07/12] block, bfq: reduce I/O latency for soft real-time applications
  2014-05-29  9:05           ` Paolo Valente
@ 2014-05-29  9:05               ` Paolo Valente
  -1 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

To guarantee a low latency also to the I/O requests issued by soft
real-time applications, this patch introduces a further heuristic,
which weight-raises (in the sense explained in patch 8) also the
queues associated to applications deemed as soft real-time.

To be deemed as soft real-time, an application must meet two
requirements.  First, the application must not require an average
bandwidth higher than the approximate bandwidth required to playback
or record a compressed high-definition video. Second, the request
pattern of the application must be isochronous, i.e., after issuing a
request or a batch of requests, the application must stop issuing new
requests until all its pending requests have been completed. After
that, the application may issue a new batch, and so on.

As for the second requirement, it is critical to require also that,
after all the pending requests of the application have been completed,
an adequate minimum amount of time elapses before the application
starts issuing new requests. This prevents also greedy (i.e.,
I/O-bound) applications from being incorrectly deemed, occasionally,
as soft real-time. In fact, if *any amount of time* is fine, then even
a greedy application may, paradoxically, meet both the above
requirements, if: (1) the application performs random I/O and/or the
device is slow, and (2) the CPU load is high. The reason is the
following.  First, if condition (1) is true, then, during the service
of the application, the throughput may be low enough to let the
application meet the bandwidth requirement.  Second, if condition (2)
is true as well, then the application may occasionally behave in an
apparently isochronous way, because it may simply stop issuing
requests while the CPUs are busy serving other processes.

To address this issue, the heuristic leverages the simple fact that
greedy applications issue *all* of their requests as quickly as they
can, whereas soft real-time applications spend some time processing
data after each batch of requests is completed. In particular, the
heuristic works as follows. First, according to the above isochrony
requirement, the heuristic checks whether an application may be soft
real-time (and therefore gives to the application the opportunity to
be deemed as such) only when both the following two conditions happen
to hold: 1) the queue associated with the application has expired and
is empty, 2) there is no outstanding request of the application.

Suppose that both conditions hold at time, say, t_c and that the
application issues instead its next request at time, say, t_i. At time
t_c the heuristic computes the next time instant, called
soft_rt_next_start in the code, such that, only if
t_i >= soft_rt_next_start, then both the next conditions will hold
when the application issues its next request:
1) the application will meet the above bandwidth requirement,
2) a given minimum time interval, say Delta, will have elapsed from
time t_c (so as to filter out greedy application).

The current value of Delta is a little bit higher than the value that
we have found, experimentally, to be adequate on a real,
general-purpose machine. In particular we had to increase Delta to
make the filter quite precise also in slower, embedded systems, and in
KVM/QEMU virtual machines (details in the comments to the code).

If the application actually issues its next request after time
soft_rt_next_start, then its associated queue will be weight-raised
for a relatively short time interval. If, during this time interval,
the application proves again to meet the bandwidth and isochrony
requirements, then the end of the weight-raising period for the queue
is moved forward, and so on. Note that an application whose associated
queue never happens to be empty when it expires will never have the
opportunity to be deemed as soft real-time.

Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 block/Kconfig.iosched |   8 +-
 block/bfq-iosched.c   | 231 ++++++++++++++++++++++++++++++++++++++++++++++++--
 block/bfq.h           |  24 ++++++
 3 files changed, 251 insertions(+), 12 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3e26f28..1d64eea 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -47,9 +47,9 @@ config IOSCHED_BFQ
 	  processes according to their weights.
 	  It aims at distributing the bandwidth as desired, regardless of
 	  the device parameters and with any workload. It also tries to
-	  guarantee a low latency to interactive applications. If compiled
-	  built-in (saying Y here), BFQ can be configured to support
-	  hierarchical scheduling.
+	  guarantee low latency to interactive and soft real-time
+	  applications. If compiled built-in (saying Y here), BFQ can
+	  be configured to support hierarchical scheduling.
 
 config CGROUP_BFQIO
 	bool "BFQ hierarchical scheduling support"
@@ -81,7 +81,7 @@ choice
 		  The BFQ I/O scheduler aims at distributing the bandwidth
 		  as desired, regardless of the disk parameters and with
 		  any workload. It also tries to guarantee a low latency to
-		  interactive applications.
+		  interactive and soft real-time applications.
 
 	config DEFAULT_NOOP
 		bool "No-op"
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index ace9aba..661f948 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -24,15 +24,17 @@
  * precisely, BFQ schedules queues associated to processes. Thanks to the
  * accurate policy of B-WF2Q+, BFQ can afford to assign high budgets to
  * I/O-bound processes issuing sequential requests (to boost the
- * throughput), and yet guarantee a low latency to interactive applications.
+ * throughput), and yet guarantee a low latency to interactive and soft
+ * real-time applications.
  *
  * BFQ is described in [1], where also a reference to the initial, more
  * theoretical paper on BFQ can be found. The interested reader can find
  * in the latter paper full details on the main algorithm, as well as
  * formulas of the guarantees and formal proofs of all the properties.
  * With respect to the version of BFQ presented in these papers, this
- * implementation adds a few more heuristics and a hierarchical extension
- * based on H-WF2Q+.
+ * implementation adds a few more heuristics, such as the one that
+ * guarantees a low latency to soft real-time applications, and a
+ * hierarchical extension based on H-WF2Q+.
  *
  * B-WF2Q+ is based on WF2Q+, that is described in [2], together with
  * H-WF2Q+, while the augmented tree used to implement B-WF2Q+ with O(log N)
@@ -401,6 +403,8 @@ static void bfq_add_request(struct request *rq)
 	bfqq->next_rq = next_rq;
 
 	if (!bfq_bfqq_busy(bfqq)) {
+		int soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
+			time_is_before_jiffies(bfqq->soft_rt_next_start);
 		idle_for_long_time = time_is_before_jiffies(
 			bfqq->budget_timeout +
 			bfqd->bfq_wr_min_idle_time);
@@ -414,9 +418,13 @@ static void bfq_add_request(struct request *rq)
 		 * If the queue is not being boosted and has been idle for
 		 * enough time, start a weight-raising period.
 		 */
-		if (old_wr_coeff == 1 && idle_for_long_time) {
+		if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt)) {
 			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
-			bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+			if (idle_for_long_time)
+				bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+			else
+				bfqq->wr_cur_max_time =
+					bfqd->bfq_wr_rt_max_time;
 			bfq_log_bfqq(bfqd, bfqq,
 				     "wrais starting at %lu, rais_max_time %u",
 				     jiffies,
@@ -424,18 +432,76 @@ static void bfq_add_request(struct request *rq)
 		} else if (old_wr_coeff > 1) {
 			if (idle_for_long_time)
 				bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
-			else {
+			else if (bfqq->wr_cur_max_time ==
+				 bfqd->bfq_wr_rt_max_time &&
+				 !soft_rt) {
 				bfqq->wr_coeff = 1;
 				bfq_log_bfqq(bfqd, bfqq,
 					"wrais ending at %lu, rais_max_time %u",
 					jiffies,
 					jiffies_to_msecs(bfqq->
 						wr_cur_max_time));
+			} else if (time_before(
+					bfqq->last_wr_start_finish +
+					bfqq->wr_cur_max_time,
+					jiffies +
+					bfqd->bfq_wr_rt_max_time) &&
+				   soft_rt) {
+				/*
+				 *
+				 * The remaining weight-raising time is lower
+				 * than bfqd->bfq_wr_rt_max_time, which means
+				 * that the application is enjoying weight
+				 * raising either because deemed soft-rt in
+				 * the near past, or because deemed interactive
+				 * a long ago.
+				 * In both cases, resetting now the current
+				 * remaining weight-raising time for the
+				 * application to the weight-raising duration
+				 * for soft rt applications would not cause any
+				 * latency increase for the application (as the
+				 * new duration would be higher than the
+				 * remaining time).
+				 *
+				 * In addition, the application is now meeting
+				 * the requirements for being deemed soft rt.
+				 * In the end we can correctly and safely
+				 * (re)charge the weight-raising duration for
+				 * the application with the weight-raising
+				 * duration for soft rt applications.
+				 *
+				 * In particular, doing this recharge now, i.e.,
+				 * before the weight-raising period for the
+				 * application finishes, reduces the probability
+				 * of the following negative scenario:
+				 * 1) the weight of a soft rt application is
+				 *    raised at startup (as for any newly
+				 *    created application),
+				 * 2) since the application is not interactive,
+				 *    at a certain time weight-raising is
+				 *    stopped for the application,
+				 * 3) at that time the application happens to
+				 *    still have pending requests, and hence
+				 *    is destined to not have a chance to be
+				 *    deemed soft rt before these requests are
+				 *    completed (see the comments to the
+				 *    function bfq_bfqq_softrt_next_start()
+				 *    for details on soft rt detection),
+				 * 4) these pending requests experience a high
+				 *    latency because the application is not
+				 *    weight-raised while they are pending.
+				 */
+				bfqq->last_wr_start_finish = jiffies;
+				bfqq->wr_cur_max_time =
+					bfqd->bfq_wr_rt_max_time;
 			}
 		}
 		if (old_wr_coeff != bfqq->wr_coeff)
 			entity->ioprio_changed = 1;
 add_bfqq_busy:
+		bfqq->last_idle_bklogged = jiffies;
+		bfqq->service_from_backlogged = 0;
+		bfq_clear_bfqq_softrt_update(bfqq);
 		bfq_add_bfqq_busy(bfqd, bfqq);
 	} else {
 		if (bfqd->low_latency && old_wr_coeff == 1 && !rq_is_sync(rq) &&
@@ -753,8 +819,11 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd)
 static void bfq_set_budget_timeout(struct bfq_data *bfqd)
 {
 	struct bfq_queue *bfqq = bfqd->in_service_queue;
-	unsigned int timeout_coeff = bfqq->entity.weight /
-				     bfqq->entity.orig_weight;
+	unsigned int timeout_coeff;
+	if (bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time)
+		timeout_coeff = 1;
+	else
+		timeout_coeff = bfqq->entity.weight / bfqq->entity.orig_weight;
 
 	bfqd->last_budget_start = ktime_get();
 
@@ -1105,6 +1174,77 @@ static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	return expected > (4 * bfqq->entity.budget) / 3;
 }
 
+/*
+ * To be deemed as soft real-time, an application must meet two
+ * requirements. First, the application must not require an average
+ * bandwidth higher than the approximate bandwidth required to playback or
+ * record a compressed high-definition video.
+ * The next function is invoked on the completion of the last request of a
+ * batch, to compute the next-start time instant, soft_rt_next_start, such
+ * that, if the next request of the application does not arrive before
+ * soft_rt_next_start, then the above requirement on the bandwidth is met.
+ *
+ * The second requirement is that the request pattern of the application is
+ * isochronous, i.e., that, after issuing a request or a batch of requests,
+ * the application stops issuing new requests until all its pending requests
+ * have been completed. After that, the application may issue a new batch,
+ * and so on.
+ * For this reason the next function is invoked to compute
+ * soft_rt_next_start only for applications that meet this requirement,
+ * whereas soft_rt_next_start is set to infinity for applications that do
+ * not.
+ *
+ * Unfortunately, even a greedy application may happen to behave in an
+ * isochronous way if the CPU load is high. In fact, the application may
+ * stop issuing requests while the CPUs are busy serving other processes,
+ * then restart, then stop again for a while, and so on. In addition, if
+ * the disk achieves a low enough throughput with the request pattern
+ * issued by the application (e.g., because the request pattern is random
+ * and/or the device is slow), then the application may meet the above
+ * bandwidth requirement too. To prevent such a greedy application to be
+ * deemed as soft real-time, a further rule is used in the computation of
+ * soft_rt_next_start: soft_rt_next_start must be higher than the current
+ * time plus the maximum time for which the arrival of a request is waited
+ * for when a sync queue becomes idle, namely bfqd->bfq_slice_idle.
+ * This filters out greedy applications, as the latter issue instead their
+ * next request as soon as possible after the last one has been completed
+ * (in contrast, when a batch of requests is completed, a soft real-time
+ * application spends some time processing data).
+ *
+ * Unfortunately, the last filter may easily generate false positives if
+ * only bfqd->bfq_slice_idle is used as a reference time interval and one
+ * or both the following cases occur:
+ * 1) HZ is so low that the duration of a jiffy is comparable to or higher
+ *    than bfqd->bfq_slice_idle. This happens, e.g., on slow devices with
+ *    HZ=100.
+ * 2) jiffies, instead of increasing at a constant rate, may stop increasing
+ *    for a while, then suddenly 'jump' by several units to recover the lost
+ *    increments. This seems to happen, e.g., inside virtual machines.
+ * To address this issue, we do not use as a reference time interval just
+ * bfqd->bfq_slice_idle, but bfqd->bfq_slice_idle plus a few jiffies. In
+ * particular we add the minimum number of jiffies for which the filter
+ * seems to be quite precise also in embedded systems and KVM/QEMU virtual
+ * machines.
+ */
+static inline unsigned long bfq_bfqq_softrt_next_start(struct bfq_data *bfqd,
+						       struct bfq_queue *bfqq)
+{
+	return max(bfqq->last_idle_bklogged +
+		   HZ * bfqq->service_from_backlogged /
+		   bfqd->bfq_wr_max_softrt_rate,
+		   jiffies + bfqq->bfqd->bfq_slice_idle + 4);
+}
+
+/*
+ * Return the largest-possible time instant such that, for as long as possible,
+ * the current time will be lower than this time instant according to the macro
+ * time_is_before_jiffies().
+ */
+static inline unsigned long bfq_infinity_from_now(unsigned long now)
+{
+	return now + ULONG_MAX / 2;
+}
+
 /**
  * bfq_bfqq_expire - expire a queue.
  * @bfqd: device owning the queue.
@@ -1162,12 +1302,55 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
 		     bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3))
 		bfq_bfqq_charge_full_budget(bfqq);
 
+	bfqq->service_from_backlogged += bfqq->entity.service;
+
 	if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT)
 		bfq_mark_bfqq_constantly_seeky(bfqq);
 
 	if (bfqd->low_latency && bfqq->wr_coeff == 1)
 		bfqq->last_wr_start_finish = jiffies;
 
+	if (bfqd->low_latency && bfqd->bfq_wr_max_softrt_rate > 0 &&
+	    RB_EMPTY_ROOT(&bfqq->sort_list)) {
+		/*
+		 * If we get here, and there are no outstanding requests,
+		 * then the request pattern is isochronous (see the comments
+		 * to the function bfq_bfqq_softrt_next_start()). Hence we
+		 * can compute soft_rt_next_start. If, instead, the queue
+		 * still has outstanding requests, then we have to wait
+		 * for the completion of all the outstanding requests to
+		 * discover whether the request pattern is actually
+		 * isochronous.
+		 */
+		if (bfqq->dispatched == 0)
+			bfqq->soft_rt_next_start =
+				bfq_bfqq_softrt_next_start(bfqd, bfqq);
+		else {
+			/*
+			 * The application is still waiting for the
+			 * completion of one or more requests:
+			 * prevent it from possibly being incorrectly
+			 * deemed as soft real-time by setting its
+			 * soft_rt_next_start to infinity. In fact,
+			 * without this assignment, the application
+			 * would be incorrectly deemed as soft
+			 * real-time if:
+			 * 1) it issued a new request before the
+			 *    completion of all its in-flight
+			 *    requests, and
+			 * 2) at that time, its soft_rt_next_start
+			 *    happened to be in the past.
+			 */
+			bfqq->soft_rt_next_start =
+				bfq_infinity_from_now(jiffies);
+			/*
+			 * Schedule an update of soft_rt_next_start to when
+			 * the task may be discovered to be isochronous.
+			 */
+			bfq_mark_bfqq_softrt_update(bfqq);
+		}
+	}
+
 	bfq_log_bfqq(bfqd, bfqq,
 		"expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
 		slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
@@ -1691,6 +1874,11 @@ static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfqq->wr_coeff = 1;
 	bfqq->last_wr_start_finish = 0;
+	/*
+	 * Set to the value for which bfqq will not be deemed as
+	 * soft rt when it becomes backlogged.
+	 */
+	bfqq->soft_rt_next_start = bfq_infinity_from_now(jiffies);
 }
 
 static struct bfq_queue *bfq_find_alloc_queue(struct bfq_data *bfqd,
@@ -2019,6 +2207,18 @@ static void bfq_completed_request(struct request_queue *q, struct request *rq)
 	}
 
 	/*
+	 * If we are waiting to discover whether the request pattern of the
+	 * task associated with the queue is actually isochronous, and
+	 * both requisites for this condition to hold are satisfied, then
+	 * compute soft_rt_next_start (see the comments to the function
+	 * bfq_bfqq_softrt_next_start()).
+	 */
+	if (bfq_bfqq_softrt_update(bfqq) && bfqq->dispatched == 0 &&
+	    RB_EMPTY_ROOT(&bfqq->sort_list))
+		bfqq->soft_rt_next_start =
+			bfq_bfqq_softrt_next_start(bfqd, bfqq);
+
+	/*
 	 * If this is the in-service queue, check if it needs to be expired,
 	 * or if we want to idle in case it has no pending requests.
 	 */
@@ -2345,9 +2545,16 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 	bfqd->low_latency = true;
 
 	bfqd->bfq_wr_coeff = 20;
+	bfqd->bfq_wr_rt_max_time = msecs_to_jiffies(300);
 	bfqd->bfq_wr_max_time = 0;
 	bfqd->bfq_wr_min_idle_time = msecs_to_jiffies(2000);
 	bfqd->bfq_wr_min_inter_arr_async = msecs_to_jiffies(500);
+	bfqd->bfq_wr_max_softrt_rate = 7000; /*
+					      * Approximate rate required
+					      * to playback or record a
+					      * high-definition compressed
+					      * video.
+					      */
 
 	/*
 	 * Begin by assuming, optimistically, that the device peak rate is
@@ -2461,9 +2668,11 @@ SHOW_FUNCTION(bfq_timeout_sync_show, bfqd->bfq_timeout[BLK_RW_SYNC], 1);
 SHOW_FUNCTION(bfq_timeout_async_show, bfqd->bfq_timeout[BLK_RW_ASYNC], 1);
 SHOW_FUNCTION(bfq_low_latency_show, bfqd->low_latency, 0);
 SHOW_FUNCTION(bfq_wr_coeff_show, bfqd->bfq_wr_coeff, 0);
+SHOW_FUNCTION(bfq_wr_rt_max_time_show, bfqd->bfq_wr_rt_max_time, 1);
 SHOW_FUNCTION(bfq_wr_min_idle_time_show, bfqd->bfq_wr_min_idle_time, 1);
 SHOW_FUNCTION(bfq_wr_min_inter_arr_async_show, bfqd->bfq_wr_min_inter_arr_async,
 	1);
+SHOW_FUNCTION(bfq_wr_max_softrt_rate_show, bfqd->bfq_wr_max_softrt_rate, 0);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
@@ -2498,10 +2707,14 @@ STORE_FUNCTION(bfq_timeout_async_store, &bfqd->bfq_timeout[BLK_RW_ASYNC], 0,
 		INT_MAX, 1);
 STORE_FUNCTION(bfq_wr_coeff_store, &bfqd->bfq_wr_coeff, 1, INT_MAX, 0);
 STORE_FUNCTION(bfq_wr_max_time_store, &bfqd->bfq_wr_max_time, 0, INT_MAX, 1);
+STORE_FUNCTION(bfq_wr_rt_max_time_store, &bfqd->bfq_wr_rt_max_time, 0, INT_MAX,
+		1);
 STORE_FUNCTION(bfq_wr_min_idle_time_store, &bfqd->bfq_wr_min_idle_time, 0,
 		INT_MAX, 1);
 STORE_FUNCTION(bfq_wr_min_inter_arr_async_store,
 		&bfqd->bfq_wr_min_inter_arr_async, 0, INT_MAX, 1);
+STORE_FUNCTION(bfq_wr_max_softrt_rate_store, &bfqd->bfq_wr_max_softrt_rate, 0,
+		INT_MAX, 0);
 #undef STORE_FUNCTION
 
 /* do nothing for the moment */
@@ -2593,8 +2806,10 @@ static struct elv_fs_entry bfq_attrs[] = {
 	BFQ_ATTR(low_latency),
 	BFQ_ATTR(wr_coeff),
 	BFQ_ATTR(wr_max_time),
+	BFQ_ATTR(wr_rt_max_time),
 	BFQ_ATTR(wr_min_idle_time),
 	BFQ_ATTR(wr_min_inter_arr_async),
+	BFQ_ATTR(wr_max_softrt_rate),
 	BFQ_ATTR(weights),
 	__ATTR_NULL
 };
diff --git a/block/bfq.h b/block/bfq.h
index 3ce9100..5fa8b34 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -185,6 +185,17 @@ struct bfq_group;
  *                        the @bfq-queue is being weight-raised, otherwise
  *                        finish time of the last weight-raising period
  * @wr_cur_max_time: current max raising time for this queue
+ * @soft_rt_next_start: minimum time instant such that, only if a new
+ *                      request is enqueued after this time instant in an
+ *                      idle @bfq_queue with no outstanding requests, then
+ *                      the task associated with the queue it is deemed as
+ *                      soft real-time (see the comments to the function
+ *                      bfq_bfqq_softrt_next_start())
+ * @last_idle_bklogged: time of the last transition of the @bfq_queue from
+ *                      idle to backlogged
+ * @service_from_backlogged: cumulative service received from the @bfq_queue
+ *                           since the last transition from idle to
+ *                           backlogged
  *
  * A bfq_queue is a leaf request queue; it can be associated with an
  * io_context or more, if it is async. @cgroup holds a reference to the
@@ -224,8 +235,11 @@ struct bfq_queue {
 
 	/* weight-raising fields */
 	unsigned long wr_cur_max_time;
+	unsigned long soft_rt_next_start;
 	unsigned long last_wr_start_finish;
 	unsigned int wr_coeff;
+	unsigned long last_idle_bklogged;
+	unsigned long service_from_backlogged;
 };
 
 /**
@@ -309,12 +323,15 @@ enum bfq_device_speed {
  * @bfq_wr_coeff: Maximum factor by which the weight of a weight-raised
  *                queue is multiplied
  * @bfq_wr_max_time: maximum duration of a weight-raising period (jiffies)
+ * @bfq_wr_rt_max_time: maximum duration for soft real-time processes
  * @bfq_wr_min_idle_time: minimum idle period after which weight-raising
  *			  may be reactivated for a queue (in jiffies)
  * @bfq_wr_min_inter_arr_async: minimum period between request arrivals
  *				after which weight-raising may be
  *				reactivated for an already busy queue
  *				(in jiffies)
+ * @bfq_wr_max_softrt_rate: max service-rate for a soft real-time queue,
+ *			    sectors per seconds
  * @RT_prod: cached value of the product R*T used for computing the maximum
  *	     duration of the weight raising automatically
  * @device_speed: device-speed class for the low-latency heuristic
@@ -372,8 +389,10 @@ struct bfq_data {
 	/* parameters of the low_latency heuristics */
 	unsigned int bfq_wr_coeff;
 	unsigned int bfq_wr_max_time;
+	unsigned int bfq_wr_rt_max_time;
 	unsigned int bfq_wr_min_idle_time;
 	unsigned long bfq_wr_min_inter_arr_async;
+	unsigned int bfq_wr_max_softrt_rate;
 	u64 RT_prod;
 	enum bfq_device_speed device_speed;
 
@@ -393,6 +412,10 @@ enum bfqq_state_flags {
 					 * bfqq has proved to be slow and
 					 * seeky until budget timeout
 					 */
+	BFQ_BFQQ_FLAG_softrt_update,	/*
+					 * may need softrt-next-start
+					 * update
+					 */
 };
 
 #define BFQ_BFQQ_FNS(name)						\
@@ -418,6 +441,7 @@ BFQ_BFQQ_FNS(prio_changed);
 BFQ_BFQQ_FNS(sync);
 BFQ_BFQQ_FNS(budget_new);
 BFQ_BFQQ_FNS(constantly_seeky);
+BFQ_BFQQ_FNS(softrt_update);
 #undef BFQ_BFQQ_FNS
 
 /* Logging facilities. */
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 07/12] block, bfq: reduce I/O latency for soft real-time applications
@ 2014-05-29  9:05               ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups, Paolo Valente

To guarantee a low latency also to the I/O requests issued by soft
real-time applications, this patch introduces a further heuristic,
which weight-raises (in the sense explained in patch 8) also the
queues associated to applications deemed as soft real-time.

To be deemed as soft real-time, an application must meet two
requirements.  First, the application must not require an average
bandwidth higher than the approximate bandwidth required to playback
or record a compressed high-definition video. Second, the request
pattern of the application must be isochronous, i.e., after issuing a
request or a batch of requests, the application must stop issuing new
requests until all its pending requests have been completed. After
that, the application may issue a new batch, and so on.

As for the second requirement, it is critical to require also that,
after all the pending requests of the application have been completed,
an adequate minimum amount of time elapses before the application
starts issuing new requests. This prevents also greedy (i.e.,
I/O-bound) applications from being incorrectly deemed, occasionally,
as soft real-time. In fact, if *any amount of time* is fine, then even
a greedy application may, paradoxically, meet both the above
requirements, if: (1) the application performs random I/O and/or the
device is slow, and (2) the CPU load is high. The reason is the
following.  First, if condition (1) is true, then, during the service
of the application, the throughput may be low enough to let the
application meet the bandwidth requirement.  Second, if condition (2)
is true as well, then the application may occasionally behave in an
apparently isochronous way, because it may simply stop issuing
requests while the CPUs are busy serving other processes.

To address this issue, the heuristic leverages the simple fact that
greedy applications issue *all* of their requests as quickly as they
can, whereas soft real-time applications spend some time processing
data after each batch of requests is completed. In particular, the
heuristic works as follows. First, according to the above isochrony
requirement, the heuristic checks whether an application may be soft
real-time (and therefore gives to the application the opportunity to
be deemed as such) only when both the following two conditions happen
to hold: 1) the queue associated with the application has expired and
is empty, 2) there is no outstanding request of the application.

Suppose that both conditions hold at time, say, t_c and that the
application issues instead its next request at time, say, t_i. At time
t_c the heuristic computes the next time instant, called
soft_rt_next_start in the code, such that, only if
t_i >= soft_rt_next_start, then both the next conditions will hold
when the application issues its next request:
1) the application will meet the above bandwidth requirement,
2) a given minimum time interval, say Delta, will have elapsed from
time t_c (so as to filter out greedy application).

The current value of Delta is a little bit higher than the value that
we have found, experimentally, to be adequate on a real,
general-purpose machine. In particular we had to increase Delta to
make the filter quite precise also in slower, embedded systems, and in
KVM/QEMU virtual machines (details in the comments to the code).

If the application actually issues its next request after time
soft_rt_next_start, then its associated queue will be weight-raised
for a relatively short time interval. If, during this time interval,
the application proves again to meet the bandwidth and isochrony
requirements, then the end of the weight-raising period for the queue
is moved forward, and so on. Note that an application whose associated
queue never happens to be empty when it expires will never have the
opportunity to be deemed as soft real-time.

Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/Kconfig.iosched |   8 +-
 block/bfq-iosched.c   | 231 ++++++++++++++++++++++++++++++++++++++++++++++++--
 block/bfq.h           |  24 ++++++
 3 files changed, 251 insertions(+), 12 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3e26f28..1d64eea 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -47,9 +47,9 @@ config IOSCHED_BFQ
 	  processes according to their weights.
 	  It aims at distributing the bandwidth as desired, regardless of
 	  the device parameters and with any workload. It also tries to
-	  guarantee a low latency to interactive applications. If compiled
-	  built-in (saying Y here), BFQ can be configured to support
-	  hierarchical scheduling.
+	  guarantee low latency to interactive and soft real-time
+	  applications. If compiled built-in (saying Y here), BFQ can
+	  be configured to support hierarchical scheduling.
 
 config CGROUP_BFQIO
 	bool "BFQ hierarchical scheduling support"
@@ -81,7 +81,7 @@ choice
 		  The BFQ I/O scheduler aims at distributing the bandwidth
 		  as desired, regardless of the disk parameters and with
 		  any workload. It also tries to guarantee a low latency to
-		  interactive applications.
+		  interactive and soft real-time applications.
 
 	config DEFAULT_NOOP
 		bool "No-op"
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index ace9aba..661f948 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -24,15 +24,17 @@
  * precisely, BFQ schedules queues associated to processes. Thanks to the
  * accurate policy of B-WF2Q+, BFQ can afford to assign high budgets to
  * I/O-bound processes issuing sequential requests (to boost the
- * throughput), and yet guarantee a low latency to interactive applications.
+ * throughput), and yet guarantee a low latency to interactive and soft
+ * real-time applications.
  *
  * BFQ is described in [1], where also a reference to the initial, more
  * theoretical paper on BFQ can be found. The interested reader can find
  * in the latter paper full details on the main algorithm, as well as
  * formulas of the guarantees and formal proofs of all the properties.
  * With respect to the version of BFQ presented in these papers, this
- * implementation adds a few more heuristics and a hierarchical extension
- * based on H-WF2Q+.
+ * implementation adds a few more heuristics, such as the one that
+ * guarantees a low latency to soft real-time applications, and a
+ * hierarchical extension based on H-WF2Q+.
  *
  * B-WF2Q+ is based on WF2Q+, that is described in [2], together with
  * H-WF2Q+, while the augmented tree used to implement B-WF2Q+ with O(log N)
@@ -401,6 +403,8 @@ static void bfq_add_request(struct request *rq)
 	bfqq->next_rq = next_rq;
 
 	if (!bfq_bfqq_busy(bfqq)) {
+		int soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
+			time_is_before_jiffies(bfqq->soft_rt_next_start);
 		idle_for_long_time = time_is_before_jiffies(
 			bfqq->budget_timeout +
 			bfqd->bfq_wr_min_idle_time);
@@ -414,9 +418,13 @@ static void bfq_add_request(struct request *rq)
 		 * If the queue is not being boosted and has been idle for
 		 * enough time, start a weight-raising period.
 		 */
-		if (old_wr_coeff == 1 && idle_for_long_time) {
+		if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt)) {
 			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
-			bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+			if (idle_for_long_time)
+				bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+			else
+				bfqq->wr_cur_max_time =
+					bfqd->bfq_wr_rt_max_time;
 			bfq_log_bfqq(bfqd, bfqq,
 				     "wrais starting at %lu, rais_max_time %u",
 				     jiffies,
@@ -424,18 +432,76 @@ static void bfq_add_request(struct request *rq)
 		} else if (old_wr_coeff > 1) {
 			if (idle_for_long_time)
 				bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
-			else {
+			else if (bfqq->wr_cur_max_time ==
+				 bfqd->bfq_wr_rt_max_time &&
+				 !soft_rt) {
 				bfqq->wr_coeff = 1;
 				bfq_log_bfqq(bfqd, bfqq,
 					"wrais ending at %lu, rais_max_time %u",
 					jiffies,
 					jiffies_to_msecs(bfqq->
 						wr_cur_max_time));
+			} else if (time_before(
+					bfqq->last_wr_start_finish +
+					bfqq->wr_cur_max_time,
+					jiffies +
+					bfqd->bfq_wr_rt_max_time) &&
+				   soft_rt) {
+				/*
+				 *
+				 * The remaining weight-raising time is lower
+				 * than bfqd->bfq_wr_rt_max_time, which means
+				 * that the application is enjoying weight
+				 * raising either because deemed soft-rt in
+				 * the near past, or because deemed interactive
+				 * a long ago.
+				 * In both cases, resetting now the current
+				 * remaining weight-raising time for the
+				 * application to the weight-raising duration
+				 * for soft rt applications would not cause any
+				 * latency increase for the application (as the
+				 * new duration would be higher than the
+				 * remaining time).
+				 *
+				 * In addition, the application is now meeting
+				 * the requirements for being deemed soft rt.
+				 * In the end we can correctly and safely
+				 * (re)charge the weight-raising duration for
+				 * the application with the weight-raising
+				 * duration for soft rt applications.
+				 *
+				 * In particular, doing this recharge now, i.e.,
+				 * before the weight-raising period for the
+				 * application finishes, reduces the probability
+				 * of the following negative scenario:
+				 * 1) the weight of a soft rt application is
+				 *    raised at startup (as for any newly
+				 *    created application),
+				 * 2) since the application is not interactive,
+				 *    at a certain time weight-raising is
+				 *    stopped for the application,
+				 * 3) at that time the application happens to
+				 *    still have pending requests, and hence
+				 *    is destined to not have a chance to be
+				 *    deemed soft rt before these requests are
+				 *    completed (see the comments to the
+				 *    function bfq_bfqq_softrt_next_start()
+				 *    for details on soft rt detection),
+				 * 4) these pending requests experience a high
+				 *    latency because the application is not
+				 *    weight-raised while they are pending.
+				 */
+				bfqq->last_wr_start_finish = jiffies;
+				bfqq->wr_cur_max_time =
+					bfqd->bfq_wr_rt_max_time;
 			}
 		}
 		if (old_wr_coeff != bfqq->wr_coeff)
 			entity->ioprio_changed = 1;
 add_bfqq_busy:
+		bfqq->last_idle_bklogged = jiffies;
+		bfqq->service_from_backlogged = 0;
+		bfq_clear_bfqq_softrt_update(bfqq);
 		bfq_add_bfqq_busy(bfqd, bfqq);
 	} else {
 		if (bfqd->low_latency && old_wr_coeff == 1 && !rq_is_sync(rq) &&
@@ -753,8 +819,11 @@ static void bfq_arm_slice_timer(struct bfq_data *bfqd)
 static void bfq_set_budget_timeout(struct bfq_data *bfqd)
 {
 	struct bfq_queue *bfqq = bfqd->in_service_queue;
-	unsigned int timeout_coeff = bfqq->entity.weight /
-				     bfqq->entity.orig_weight;
+	unsigned int timeout_coeff;
+	if (bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time)
+		timeout_coeff = 1;
+	else
+		timeout_coeff = bfqq->entity.weight / bfqq->entity.orig_weight;
 
 	bfqd->last_budget_start = ktime_get();
 
@@ -1105,6 +1174,77 @@ static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	return expected > (4 * bfqq->entity.budget) / 3;
 }
 
+/*
+ * To be deemed as soft real-time, an application must meet two
+ * requirements. First, the application must not require an average
+ * bandwidth higher than the approximate bandwidth required to playback or
+ * record a compressed high-definition video.
+ * The next function is invoked on the completion of the last request of a
+ * batch, to compute the next-start time instant, soft_rt_next_start, such
+ * that, if the next request of the application does not arrive before
+ * soft_rt_next_start, then the above requirement on the bandwidth is met.
+ *
+ * The second requirement is that the request pattern of the application is
+ * isochronous, i.e., that, after issuing a request or a batch of requests,
+ * the application stops issuing new requests until all its pending requests
+ * have been completed. After that, the application may issue a new batch,
+ * and so on.
+ * For this reason the next function is invoked to compute
+ * soft_rt_next_start only for applications that meet this requirement,
+ * whereas soft_rt_next_start is set to infinity for applications that do
+ * not.
+ *
+ * Unfortunately, even a greedy application may happen to behave in an
+ * isochronous way if the CPU load is high. In fact, the application may
+ * stop issuing requests while the CPUs are busy serving other processes,
+ * then restart, then stop again for a while, and so on. In addition, if
+ * the disk achieves a low enough throughput with the request pattern
+ * issued by the application (e.g., because the request pattern is random
+ * and/or the device is slow), then the application may meet the above
+ * bandwidth requirement too. To prevent such a greedy application to be
+ * deemed as soft real-time, a further rule is used in the computation of
+ * soft_rt_next_start: soft_rt_next_start must be higher than the current
+ * time plus the maximum time for which the arrival of a request is waited
+ * for when a sync queue becomes idle, namely bfqd->bfq_slice_idle.
+ * This filters out greedy applications, as the latter issue instead their
+ * next request as soon as possible after the last one has been completed
+ * (in contrast, when a batch of requests is completed, a soft real-time
+ * application spends some time processing data).
+ *
+ * Unfortunately, the last filter may easily generate false positives if
+ * only bfqd->bfq_slice_idle is used as a reference time interval and one
+ * or both the following cases occur:
+ * 1) HZ is so low that the duration of a jiffy is comparable to or higher
+ *    than bfqd->bfq_slice_idle. This happens, e.g., on slow devices with
+ *    HZ=100.
+ * 2) jiffies, instead of increasing at a constant rate, may stop increasing
+ *    for a while, then suddenly 'jump' by several units to recover the lost
+ *    increments. This seems to happen, e.g., inside virtual machines.
+ * To address this issue, we do not use as a reference time interval just
+ * bfqd->bfq_slice_idle, but bfqd->bfq_slice_idle plus a few jiffies. In
+ * particular we add the minimum number of jiffies for which the filter
+ * seems to be quite precise also in embedded systems and KVM/QEMU virtual
+ * machines.
+ */
+static inline unsigned long bfq_bfqq_softrt_next_start(struct bfq_data *bfqd,
+						       struct bfq_queue *bfqq)
+{
+	return max(bfqq->last_idle_bklogged +
+		   HZ * bfqq->service_from_backlogged /
+		   bfqd->bfq_wr_max_softrt_rate,
+		   jiffies + bfqq->bfqd->bfq_slice_idle + 4);
+}
+
+/*
+ * Return the largest-possible time instant such that, for as long as possible,
+ * the current time will be lower than this time instant according to the macro
+ * time_is_before_jiffies().
+ */
+static inline unsigned long bfq_infinity_from_now(unsigned long now)
+{
+	return now + ULONG_MAX / 2;
+}
+
 /**
  * bfq_bfqq_expire - expire a queue.
  * @bfqd: device owning the queue.
@@ -1162,12 +1302,55 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
 		     bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3))
 		bfq_bfqq_charge_full_budget(bfqq);
 
+	bfqq->service_from_backlogged += bfqq->entity.service;
+
 	if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT)
 		bfq_mark_bfqq_constantly_seeky(bfqq);
 
 	if (bfqd->low_latency && bfqq->wr_coeff == 1)
 		bfqq->last_wr_start_finish = jiffies;
 
+	if (bfqd->low_latency && bfqd->bfq_wr_max_softrt_rate > 0 &&
+	    RB_EMPTY_ROOT(&bfqq->sort_list)) {
+		/*
+		 * If we get here, and there are no outstanding requests,
+		 * then the request pattern is isochronous (see the comments
+		 * to the function bfq_bfqq_softrt_next_start()). Hence we
+		 * can compute soft_rt_next_start. If, instead, the queue
+		 * still has outstanding requests, then we have to wait
+		 * for the completion of all the outstanding requests to
+		 * discover whether the request pattern is actually
+		 * isochronous.
+		 */
+		if (bfqq->dispatched == 0)
+			bfqq->soft_rt_next_start =
+				bfq_bfqq_softrt_next_start(bfqd, bfqq);
+		else {
+			/*
+			 * The application is still waiting for the
+			 * completion of one or more requests:
+			 * prevent it from possibly being incorrectly
+			 * deemed as soft real-time by setting its
+			 * soft_rt_next_start to infinity. In fact,
+			 * without this assignment, the application
+			 * would be incorrectly deemed as soft
+			 * real-time if:
+			 * 1) it issued a new request before the
+			 *    completion of all its in-flight
+			 *    requests, and
+			 * 2) at that time, its soft_rt_next_start
+			 *    happened to be in the past.
+			 */
+			bfqq->soft_rt_next_start =
+				bfq_infinity_from_now(jiffies);
+			/*
+			 * Schedule an update of soft_rt_next_start to when
+			 * the task may be discovered to be isochronous.
+			 */
+			bfq_mark_bfqq_softrt_update(bfqq);
+		}
+	}
+
 	bfq_log_bfqq(bfqd, bfqq,
 		"expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
 		slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
@@ -1691,6 +1874,11 @@ static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfqq->wr_coeff = 1;
 	bfqq->last_wr_start_finish = 0;
+	/*
+	 * Set to the value for which bfqq will not be deemed as
+	 * soft rt when it becomes backlogged.
+	 */
+	bfqq->soft_rt_next_start = bfq_infinity_from_now(jiffies);
 }
 
 static struct bfq_queue *bfq_find_alloc_queue(struct bfq_data *bfqd,
@@ -2019,6 +2207,18 @@ static void bfq_completed_request(struct request_queue *q, struct request *rq)
 	}
 
 	/*
+	 * If we are waiting to discover whether the request pattern of the
+	 * task associated with the queue is actually isochronous, and
+	 * both requisites for this condition to hold are satisfied, then
+	 * compute soft_rt_next_start (see the comments to the function
+	 * bfq_bfqq_softrt_next_start()).
+	 */
+	if (bfq_bfqq_softrt_update(bfqq) && bfqq->dispatched == 0 &&
+	    RB_EMPTY_ROOT(&bfqq->sort_list))
+		bfqq->soft_rt_next_start =
+			bfq_bfqq_softrt_next_start(bfqd, bfqq);
+
+	/*
 	 * If this is the in-service queue, check if it needs to be expired,
 	 * or if we want to idle in case it has no pending requests.
 	 */
@@ -2345,9 +2545,16 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 	bfqd->low_latency = true;
 
 	bfqd->bfq_wr_coeff = 20;
+	bfqd->bfq_wr_rt_max_time = msecs_to_jiffies(300);
 	bfqd->bfq_wr_max_time = 0;
 	bfqd->bfq_wr_min_idle_time = msecs_to_jiffies(2000);
 	bfqd->bfq_wr_min_inter_arr_async = msecs_to_jiffies(500);
+	bfqd->bfq_wr_max_softrt_rate = 7000; /*
+					      * Approximate rate required
+					      * to playback or record a
+					      * high-definition compressed
+					      * video.
+					      */
 
 	/*
 	 * Begin by assuming, optimistically, that the device peak rate is
@@ -2461,9 +2668,11 @@ SHOW_FUNCTION(bfq_timeout_sync_show, bfqd->bfq_timeout[BLK_RW_SYNC], 1);
 SHOW_FUNCTION(bfq_timeout_async_show, bfqd->bfq_timeout[BLK_RW_ASYNC], 1);
 SHOW_FUNCTION(bfq_low_latency_show, bfqd->low_latency, 0);
 SHOW_FUNCTION(bfq_wr_coeff_show, bfqd->bfq_wr_coeff, 0);
+SHOW_FUNCTION(bfq_wr_rt_max_time_show, bfqd->bfq_wr_rt_max_time, 1);
 SHOW_FUNCTION(bfq_wr_min_idle_time_show, bfqd->bfq_wr_min_idle_time, 1);
 SHOW_FUNCTION(bfq_wr_min_inter_arr_async_show, bfqd->bfq_wr_min_inter_arr_async,
 	1);
+SHOW_FUNCTION(bfq_wr_max_softrt_rate_show, bfqd->bfq_wr_max_softrt_rate, 0);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
@@ -2498,10 +2707,14 @@ STORE_FUNCTION(bfq_timeout_async_store, &bfqd->bfq_timeout[BLK_RW_ASYNC], 0,
 		INT_MAX, 1);
 STORE_FUNCTION(bfq_wr_coeff_store, &bfqd->bfq_wr_coeff, 1, INT_MAX, 0);
 STORE_FUNCTION(bfq_wr_max_time_store, &bfqd->bfq_wr_max_time, 0, INT_MAX, 1);
+STORE_FUNCTION(bfq_wr_rt_max_time_store, &bfqd->bfq_wr_rt_max_time, 0, INT_MAX,
+		1);
 STORE_FUNCTION(bfq_wr_min_idle_time_store, &bfqd->bfq_wr_min_idle_time, 0,
 		INT_MAX, 1);
 STORE_FUNCTION(bfq_wr_min_inter_arr_async_store,
 		&bfqd->bfq_wr_min_inter_arr_async, 0, INT_MAX, 1);
+STORE_FUNCTION(bfq_wr_max_softrt_rate_store, &bfqd->bfq_wr_max_softrt_rate, 0,
+		INT_MAX, 0);
 #undef STORE_FUNCTION
 
 /* do nothing for the moment */
@@ -2593,8 +2806,10 @@ static struct elv_fs_entry bfq_attrs[] = {
 	BFQ_ATTR(low_latency),
 	BFQ_ATTR(wr_coeff),
 	BFQ_ATTR(wr_max_time),
+	BFQ_ATTR(wr_rt_max_time),
 	BFQ_ATTR(wr_min_idle_time),
 	BFQ_ATTR(wr_min_inter_arr_async),
+	BFQ_ATTR(wr_max_softrt_rate),
 	BFQ_ATTR(weights),
 	__ATTR_NULL
 };
diff --git a/block/bfq.h b/block/bfq.h
index 3ce9100..5fa8b34 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -185,6 +185,17 @@ struct bfq_group;
  *                        the @bfq-queue is being weight-raised, otherwise
  *                        finish time of the last weight-raising period
  * @wr_cur_max_time: current max raising time for this queue
+ * @soft_rt_next_start: minimum time instant such that, only if a new
+ *                      request is enqueued after this time instant in an
+ *                      idle @bfq_queue with no outstanding requests, then
+ *                      the task associated with the queue it is deemed as
+ *                      soft real-time (see the comments to the function
+ *                      bfq_bfqq_softrt_next_start())
+ * @last_idle_bklogged: time of the last transition of the @bfq_queue from
+ *                      idle to backlogged
+ * @service_from_backlogged: cumulative service received from the @bfq_queue
+ *                           since the last transition from idle to
+ *                           backlogged
  *
  * A bfq_queue is a leaf request queue; it can be associated with an
  * io_context or more, if it is async. @cgroup holds a reference to the
@@ -224,8 +235,11 @@ struct bfq_queue {
 
 	/* weight-raising fields */
 	unsigned long wr_cur_max_time;
+	unsigned long soft_rt_next_start;
 	unsigned long last_wr_start_finish;
 	unsigned int wr_coeff;
+	unsigned long last_idle_bklogged;
+	unsigned long service_from_backlogged;
 };
 
 /**
@@ -309,12 +323,15 @@ enum bfq_device_speed {
  * @bfq_wr_coeff: Maximum factor by which the weight of a weight-raised
  *                queue is multiplied
  * @bfq_wr_max_time: maximum duration of a weight-raising period (jiffies)
+ * @bfq_wr_rt_max_time: maximum duration for soft real-time processes
  * @bfq_wr_min_idle_time: minimum idle period after which weight-raising
  *			  may be reactivated for a queue (in jiffies)
  * @bfq_wr_min_inter_arr_async: minimum period between request arrivals
  *				after which weight-raising may be
  *				reactivated for an already busy queue
  *				(in jiffies)
+ * @bfq_wr_max_softrt_rate: max service-rate for a soft real-time queue,
+ *			    sectors per seconds
  * @RT_prod: cached value of the product R*T used for computing the maximum
  *	     duration of the weight raising automatically
  * @device_speed: device-speed class for the low-latency heuristic
@@ -372,8 +389,10 @@ struct bfq_data {
 	/* parameters of the low_latency heuristics */
 	unsigned int bfq_wr_coeff;
 	unsigned int bfq_wr_max_time;
+	unsigned int bfq_wr_rt_max_time;
 	unsigned int bfq_wr_min_idle_time;
 	unsigned long bfq_wr_min_inter_arr_async;
+	unsigned int bfq_wr_max_softrt_rate;
 	u64 RT_prod;
 	enum bfq_device_speed device_speed;
 
@@ -393,6 +412,10 @@ enum bfqq_state_flags {
 					 * bfqq has proved to be slow and
 					 * seeky until budget timeout
 					 */
+	BFQ_BFQQ_FLAG_softrt_update,	/*
+					 * may need softrt-next-start
+					 * update
+					 */
 };
 
 #define BFQ_BFQQ_FNS(name)						\
@@ -418,6 +441,7 @@ BFQ_BFQQ_FNS(prio_changed);
 BFQ_BFQQ_FNS(sync);
 BFQ_BFQQ_FNS(budget_new);
 BFQ_BFQQ_FNS(constantly_seeky);
+BFQ_BFQQ_FNS(softrt_update);
 #undef BFQ_BFQQ_FNS
 
 /* Logging facilities. */
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 08/12] block, bfq: preserve a low latency also with NCQ-capable drives
       [not found]           ` <1401354343-5527-1-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
                               ` (6 preceding siblings ...)
  2014-05-29  9:05               ` Paolo Valente
@ 2014-05-29  9:05             ` Paolo Valente
  2014-05-29  9:05             ` [PATCH RFC - TAKE TWO - 09/12] block, bfq: reduce latency during request-pool saturation Paolo Valente
                               ` (4 subsequent siblings)
  12 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

I/O schedulers typically allow NCQ-capable drives to prefetch I/O
requests, as NCQ boosts the throughput exactly by prefetching and
internally reordering requests.

Unfortunately, as discussed in detail and shown experimentally in [1],
this may cause fairness and latency guarantees to be violated. The
main problem is that the internal scheduler of an NCQ-capable drive
may postpone the service of some unlucky (prefetched) requests as long
as it deems serving other requests more appropriate to boost the
throughput.

This patch addresses this issue by not disabling device idling for
weight-raised queues, even if the device supports NCQ. This allows BFQ
to start serving a new queue, and therefore allows the drive to
prefetch new requests, only after the idling timeout expires. At that
time, all the outstanding requests of the expired queue have been most
certainly served.

[1] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
http://www.algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 block/bfq-iosched.c | 5 +++--
 block/bfq.h         | 2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 661f948..0b24130 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2051,7 +2051,8 @@ static void bfq_update_idle_window(struct bfq_data *bfqd,
 
 	if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
 	    bfqd->bfq_slice_idle == 0 ||
-		(bfqd->hw_tag && BFQQ_SEEKY(bfqq)))
+		(bfqd->hw_tag && BFQQ_SEEKY(bfqq) &&
+			bfqq->wr_coeff == 1))
 		enable_idle = 0;
 	else if (bfq_sample_valid(bic->ttime.ttime_samples)) {
 		if (bic->ttime.ttime_mean > bfqd->bfq_slice_idle &&
@@ -2874,7 +2875,7 @@ static int __init bfq_init(void)
 	device_speed_thresh[1] = (R_fast[1] + R_slow[1]) / 2;
 
 	elv_register(&iosched_bfq);
-	pr_info("BFQ I/O-scheduler version: v1");
+	pr_info("BFQ I/O-scheduler version: v2");
 
 	return 0;
 }
diff --git a/block/bfq.h b/block/bfq.h
index 5fa8b34..3b5763a7 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -1,5 +1,5 @@
 /*
- * BFQ-v1 for 3.15.0: data structures and common functions prototypes.
+ * BFQ-v2 for 3.15.0: data structures and common functions prototypes.
  *
  * Based on ideas and code from CFQ:
  * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 08/12] block, bfq: preserve a low latency also with NCQ-capable drives
       [not found]           ` <1401354343-5527-1-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
@ 2014-05-29  9:05             ` Paolo Valente
  2014-05-29  9:05               ` Paolo Valente
                               ` (11 subsequent siblings)
  12 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups, Paolo Valente

I/O schedulers typically allow NCQ-capable drives to prefetch I/O
requests, as NCQ boosts the throughput exactly by prefetching and
internally reordering requests.

Unfortunately, as discussed in detail and shown experimentally in [1],
this may cause fairness and latency guarantees to be violated. The
main problem is that the internal scheduler of an NCQ-capable drive
may postpone the service of some unlucky (prefetched) requests as long
as it deems serving other requests more appropriate to boost the
throughput.

This patch addresses this issue by not disabling device idling for
weight-raised queues, even if the device supports NCQ. This allows BFQ
to start serving a new queue, and therefore allows the drive to
prefetch new requests, only after the idling timeout expires. At that
time, all the outstanding requests of the expired queue have been most
certainly served.

[1] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
http://www.algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/bfq-iosched.c | 5 +++--
 block/bfq.h         | 2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 661f948..0b24130 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2051,7 +2051,8 @@ static void bfq_update_idle_window(struct bfq_data *bfqd,
 
 	if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
 	    bfqd->bfq_slice_idle == 0 ||
-		(bfqd->hw_tag && BFQQ_SEEKY(bfqq)))
+		(bfqd->hw_tag && BFQQ_SEEKY(bfqq) &&
+			bfqq->wr_coeff == 1))
 		enable_idle = 0;
 	else if (bfq_sample_valid(bic->ttime.ttime_samples)) {
 		if (bic->ttime.ttime_mean > bfqd->bfq_slice_idle &&
@@ -2874,7 +2875,7 @@ static int __init bfq_init(void)
 	device_speed_thresh[1] = (R_fast[1] + R_slow[1]) / 2;
 
 	elv_register(&iosched_bfq);
-	pr_info("BFQ I/O-scheduler version: v1");
+	pr_info("BFQ I/O-scheduler version: v2");
 
 	return 0;
 }
diff --git a/block/bfq.h b/block/bfq.h
index 5fa8b34..3b5763a7 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -1,5 +1,5 @@
 /*
- * BFQ-v1 for 3.15.0: data structures and common functions prototypes.
+ * BFQ-v2 for 3.15.0: data structures and common functions prototypes.
  *
  * Based on ideas and code from CFQ:
  * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 08/12] block, bfq: preserve a low latency also with NCQ-capable drives
@ 2014-05-29  9:05             ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

I/O schedulers typically allow NCQ-capable drives to prefetch I/O
requests, as NCQ boosts the throughput exactly by prefetching and
internally reordering requests.

Unfortunately, as discussed in detail and shown experimentally in [1],
this may cause fairness and latency guarantees to be violated. The
main problem is that the internal scheduler of an NCQ-capable drive
may postpone the service of some unlucky (prefetched) requests as long
as it deems serving other requests more appropriate to boost the
throughput.

This patch addresses this issue by not disabling device idling for
weight-raised queues, even if the device supports NCQ. This allows BFQ
to start serving a new queue, and therefore allows the drive to
prefetch new requests, only after the idling timeout expires. At that
time, all the outstanding requests of the expired queue have been most
certainly served.

[1] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
http://www.algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 block/bfq-iosched.c | 5 +++--
 block/bfq.h         | 2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 661f948..0b24130 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2051,7 +2051,8 @@ static void bfq_update_idle_window(struct bfq_data *bfqd,
 
 	if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
 	    bfqd->bfq_slice_idle == 0 ||
-		(bfqd->hw_tag && BFQQ_SEEKY(bfqq)))
+		(bfqd->hw_tag && BFQQ_SEEKY(bfqq) &&
+			bfqq->wr_coeff == 1))
 		enable_idle = 0;
 	else if (bfq_sample_valid(bic->ttime.ttime_samples)) {
 		if (bic->ttime.ttime_mean > bfqd->bfq_slice_idle &&
@@ -2874,7 +2875,7 @@ static int __init bfq_init(void)
 	device_speed_thresh[1] = (R_fast[1] + R_slow[1]) / 2;
 
 	elv_register(&iosched_bfq);
-	pr_info("BFQ I/O-scheduler version: v1");
+	pr_info("BFQ I/O-scheduler version: v2");
 
 	return 0;
 }
diff --git a/block/bfq.h b/block/bfq.h
index 5fa8b34..3b5763a7 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -1,5 +1,5 @@
 /*
- * BFQ-v1 for 3.15.0: data structures and common functions prototypes.
+ * BFQ-v2 for 3.15.0: data structures and common functions prototypes.
  *
  * Based on ideas and code from CFQ:
  * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 09/12] block, bfq: reduce latency during request-pool saturation
       [not found]           ` <1401354343-5527-1-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
                               ` (7 preceding siblings ...)
  2014-05-29  9:05             ` [PATCH RFC - TAKE TWO - 08/12] block, bfq: preserve a low latency also with NCQ-capable drives Paolo Valente
@ 2014-05-29  9:05             ` Paolo Valente
  2014-05-29  9:05               ` Paolo Valente
                               ` (3 subsequent siblings)
  12 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

This patch introduces an heuristic that reduces latency when the
I/O-request pool is saturated. This goal is achieved by disabling
device idling, for non-weight-raised queues, when there are weight-
raised queues with pending or in-flight requests. In fact, as
explained in more detail in the comment to the function
bfq_bfqq_must_not_expire(), this reduces the rate at which processes
associated with non-weight-raised queues grab requests from the pool,
thereby increasing the probability that processes associated with
weight-raised queues get a request immediately (or at least soon) when
they need one.

Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 block/bfq-iosched.c | 67 +++++++++++++++++++++++++++++++++++++++++++----------
 block/bfq-sched.c   |  6 +++++
 block/bfq.h         |  2 ++
 3 files changed, 63 insertions(+), 12 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 0b24130..5988c70 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -511,6 +511,7 @@ add_bfqq_busy:
 			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
 			bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
 
+			bfqd->wr_busy_queues++;
 			entity->ioprio_changed = 1;
 			bfq_log_bfqq(bfqd, bfqq,
 			    "non-idle wrais starting at %lu, rais_max_time %u",
@@ -655,6 +656,8 @@ static void bfq_merged_requests(struct request_queue *q, struct request *rq,
 /* Must be called with bfqq != NULL */
 static inline void bfq_bfqq_end_wr(struct bfq_queue *bfqq)
 {
+	if (bfq_bfqq_busy(bfqq))
+		bfqq->bfqd->wr_busy_queues--;
 	bfqq->wr_coeff = 1;
 	bfqq->wr_cur_max_time = 0;
 	/* Trigger a weight change on the next activation of the queue */
@@ -1401,22 +1404,61 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
 /*
  * Device idling is allowed only for the queues for which this function
  * returns true. For this reason, the return value of this function plays a
- * critical role for both throughput boosting and service guarantees.
+ * critical role for both throughput boosting and service guarantees. The
+ * return value is computed through a logical expression. In this rather
+ * long comment, we try to briefly describe all the details and motivations
+ * behind the components of this logical expression.
  *
- * The return value is computed through a logical expression, which may
- * be true only if bfqq is sync and at least one of the following two
- * conditions holds:
- * - the queue has a non-null idle window;
- * - the queue is being weight-raised.
- * In fact, waiting for a new request for the queue, in the first case,
- * is likely to boost the disk throughput, whereas, in the second case,
- * is necessary to preserve fairness and latency guarantees
- * (see [1] for details).
+ * First, the expression may be true only for sync queues. Besides, if
+ * bfqq is also being weight-raised, then the expression always evaluates
+ * to true, as device idling is instrumental for preserving low-latency
+ * guarantees (see [1]). Otherwise, the expression evaluates to true only
+ * if bfqq has a non-null idle window and at least one of the following
+ * two conditions holds. The first condition is that the device is not
+ * performing NCQ, because idling the device most certainly boosts the
+ * throughput if this condition holds and bfqq has been granted a non-null
+ * idle window.
+ *
+ * The second condition is that there is no weight-raised busy queue,
+ * which guarantees that the device is not idled for a sync non-weight-
+ * raised queue when there are busy weight-raised queues. The former is
+ * then expired immediately if empty. Combined with the timestamping rules
+ * of BFQ (see [1] for details), this causes sync non-weight-raised queues
+ * to get a lower number of requests served, and hence to ask for a lower
+ * number of requests from the request pool, before the busy weight-raised
+ * queues get served again.
+ *
+ * This is beneficial for the processes associated with weight-raised
+ * queues, when the request pool is saturated (e.g., in the presence of
+ * write hogs). In fact, if the processes associated with the other queues
+ * ask for requests at a lower rate, then weight-raised processes have a
+ * higher probability to get a request from the pool immediately (or at
+ * least soon) when they need one. Hence they have a higher probability to
+ * actually get a fraction of the disk throughput proportional to their
+ * high weight. This is especially true with NCQ-capable drives, which
+ * enqueue several requests in advance and further reorder internally-
+ * queued requests.
+ *
+ * In the end, mistreating non-weight-raised queues when there are busy
+ * weight-raised queues seems to mitigate starvation problems in the
+ * presence of heavy write workloads and NCQ, and hence to guarantee a
+ * higher application and system responsiveness in these hostile scenarios.
  */
 static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 {
-	return bfq_bfqq_sync(bfqq) &&
-	       (bfqq->wr_coeff > 1 || bfq_bfqq_idle_window(bfqq));
+	struct bfq_data *bfqd = bfqq->bfqd;
+/*
+ * Condition for expiring a non-weight-raised queue (and hence not idling
+ * the device).
+ */
+#define cond_for_expiring_non_wr  (bfqd->hw_tag && \
+				   bfqd->wr_busy_queues > 0)
+
+	return bfq_bfqq_sync(bfqq) && (
+		bfqq->wr_coeff > 1 ||
+		(bfq_bfqq_idle_window(bfqq) &&
+		 !cond_for_expiring_non_wr)
+	);
 }
 
 /*
@@ -2556,6 +2598,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 					      * high-definition compressed
 					      * video.
 					      */
+	bfqd->wr_busy_queues = 0;
 
 	/*
 	 * Begin by assuming, optimistically, that the device peak rate is
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
index f6491d5..73f453b 100644
--- a/block/bfq-sched.c
+++ b/block/bfq-sched.c
@@ -975,6 +975,9 @@ static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	bfqd->busy_queues--;
 
 	bfq_deactivate_bfqq(bfqd, bfqq, requeue);
+
+	if (bfqq->wr_coeff > 1)
+		bfqd->wr_busy_queues--;
 }
 
 /*
@@ -988,4 +991,7 @@ static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 
 	bfq_mark_bfqq_busy(bfqq);
 	bfqd->busy_queues++;
+
+	if (bfqq->wr_coeff > 1)
+		bfqd->wr_busy_queues++;
 }
diff --git a/block/bfq.h b/block/bfq.h
index 3b5763a7..7d6e4cb 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -280,6 +280,7 @@ enum bfq_device_speed {
  * @root_group: root bfq_group for the device.
  * @busy_queues: number of bfq_queues containing requests (including the
  *		 queue in service, even if it is idling).
+ * @wr_busy_queues: number of weight-raised busy @bfq_queues.
  * @queued: number of queued requests.
  * @rq_in_driver: number of requests dispatched and waiting for completion.
  * @sync_flight: number of sync requests in the driver.
@@ -345,6 +346,7 @@ struct bfq_data {
 	struct bfq_group *root_group;
 
 	int busy_queues;
+	int wr_busy_queues;
 	int queued;
 	int rq_in_driver;
 	int sync_flight;
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 09/12] block, bfq: reduce latency during request-pool saturation
  2014-05-29  9:05           ` Paolo Valente
                             ` (3 preceding siblings ...)
  (?)
@ 2014-05-29  9:05           ` Paolo Valente
       [not found]             ` <1401354343-5527-10-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
  -1 siblings, 1 reply; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups, Paolo Valente

This patch introduces an heuristic that reduces latency when the
I/O-request pool is saturated. This goal is achieved by disabling
device idling, for non-weight-raised queues, when there are weight-
raised queues with pending or in-flight requests. In fact, as
explained in more detail in the comment to the function
bfq_bfqq_must_not_expire(), this reduces the rate at which processes
associated with non-weight-raised queues grab requests from the pool,
thereby increasing the probability that processes associated with
weight-raised queues get a request immediately (or at least soon) when
they need one.

Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/bfq-iosched.c | 67 +++++++++++++++++++++++++++++++++++++++++++----------
 block/bfq-sched.c   |  6 +++++
 block/bfq.h         |  2 ++
 3 files changed, 63 insertions(+), 12 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 0b24130..5988c70 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -511,6 +511,7 @@ add_bfqq_busy:
 			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
 			bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
 
+			bfqd->wr_busy_queues++;
 			entity->ioprio_changed = 1;
 			bfq_log_bfqq(bfqd, bfqq,
 			    "non-idle wrais starting at %lu, rais_max_time %u",
@@ -655,6 +656,8 @@ static void bfq_merged_requests(struct request_queue *q, struct request *rq,
 /* Must be called with bfqq != NULL */
 static inline void bfq_bfqq_end_wr(struct bfq_queue *bfqq)
 {
+	if (bfq_bfqq_busy(bfqq))
+		bfqq->bfqd->wr_busy_queues--;
 	bfqq->wr_coeff = 1;
 	bfqq->wr_cur_max_time = 0;
 	/* Trigger a weight change on the next activation of the queue */
@@ -1401,22 +1404,61 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
 /*
  * Device idling is allowed only for the queues for which this function
  * returns true. For this reason, the return value of this function plays a
- * critical role for both throughput boosting and service guarantees.
+ * critical role for both throughput boosting and service guarantees. The
+ * return value is computed through a logical expression. In this rather
+ * long comment, we try to briefly describe all the details and motivations
+ * behind the components of this logical expression.
  *
- * The return value is computed through a logical expression, which may
- * be true only if bfqq is sync and at least one of the following two
- * conditions holds:
- * - the queue has a non-null idle window;
- * - the queue is being weight-raised.
- * In fact, waiting for a new request for the queue, in the first case,
- * is likely to boost the disk throughput, whereas, in the second case,
- * is necessary to preserve fairness and latency guarantees
- * (see [1] for details).
+ * First, the expression may be true only for sync queues. Besides, if
+ * bfqq is also being weight-raised, then the expression always evaluates
+ * to true, as device idling is instrumental for preserving low-latency
+ * guarantees (see [1]). Otherwise, the expression evaluates to true only
+ * if bfqq has a non-null idle window and at least one of the following
+ * two conditions holds. The first condition is that the device is not
+ * performing NCQ, because idling the device most certainly boosts the
+ * throughput if this condition holds and bfqq has been granted a non-null
+ * idle window.
+ *
+ * The second condition is that there is no weight-raised busy queue,
+ * which guarantees that the device is not idled for a sync non-weight-
+ * raised queue when there are busy weight-raised queues. The former is
+ * then expired immediately if empty. Combined with the timestamping rules
+ * of BFQ (see [1] for details), this causes sync non-weight-raised queues
+ * to get a lower number of requests served, and hence to ask for a lower
+ * number of requests from the request pool, before the busy weight-raised
+ * queues get served again.
+ *
+ * This is beneficial for the processes associated with weight-raised
+ * queues, when the request pool is saturated (e.g., in the presence of
+ * write hogs). In fact, if the processes associated with the other queues
+ * ask for requests at a lower rate, then weight-raised processes have a
+ * higher probability to get a request from the pool immediately (or at
+ * least soon) when they need one. Hence they have a higher probability to
+ * actually get a fraction of the disk throughput proportional to their
+ * high weight. This is especially true with NCQ-capable drives, which
+ * enqueue several requests in advance and further reorder internally-
+ * queued requests.
+ *
+ * In the end, mistreating non-weight-raised queues when there are busy
+ * weight-raised queues seems to mitigate starvation problems in the
+ * presence of heavy write workloads and NCQ, and hence to guarantee a
+ * higher application and system responsiveness in these hostile scenarios.
  */
 static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 {
-	return bfq_bfqq_sync(bfqq) &&
-	       (bfqq->wr_coeff > 1 || bfq_bfqq_idle_window(bfqq));
+	struct bfq_data *bfqd = bfqq->bfqd;
+/*
+ * Condition for expiring a non-weight-raised queue (and hence not idling
+ * the device).
+ */
+#define cond_for_expiring_non_wr  (bfqd->hw_tag && \
+				   bfqd->wr_busy_queues > 0)
+
+	return bfq_bfqq_sync(bfqq) && (
+		bfqq->wr_coeff > 1 ||
+		(bfq_bfqq_idle_window(bfqq) &&
+		 !cond_for_expiring_non_wr)
+	);
 }
 
 /*
@@ -2556,6 +2598,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 					      * high-definition compressed
 					      * video.
 					      */
+	bfqd->wr_busy_queues = 0;
 
 	/*
 	 * Begin by assuming, optimistically, that the device peak rate is
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
index f6491d5..73f453b 100644
--- a/block/bfq-sched.c
+++ b/block/bfq-sched.c
@@ -975,6 +975,9 @@ static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	bfqd->busy_queues--;
 
 	bfq_deactivate_bfqq(bfqd, bfqq, requeue);
+
+	if (bfqq->wr_coeff > 1)
+		bfqd->wr_busy_queues--;
 }
 
 /*
@@ -988,4 +991,7 @@ static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 
 	bfq_mark_bfqq_busy(bfqq);
 	bfqd->busy_queues++;
+
+	if (bfqq->wr_coeff > 1)
+		bfqd->wr_busy_queues++;
 }
diff --git a/block/bfq.h b/block/bfq.h
index 3b5763a7..7d6e4cb 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -280,6 +280,7 @@ enum bfq_device_speed {
  * @root_group: root bfq_group for the device.
  * @busy_queues: number of bfq_queues containing requests (including the
  *		 queue in service, even if it is idling).
+ * @wr_busy_queues: number of weight-raised busy @bfq_queues.
  * @queued: number of queued requests.
  * @rq_in_driver: number of requests dispatched and waiting for completion.
  * @sync_flight: number of sync requests in the driver.
@@ -345,6 +346,7 @@ struct bfq_data {
 	struct bfq_group *root_group;
 
 	int busy_queues;
+	int wr_busy_queues;
 	int queued;
 	int rq_in_driver;
 	int sync_flight;
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 10/12] block, bfq: add Early Queue Merge (EQM)
  2014-05-29  9:05           ` Paolo Valente
@ 2014-05-29  9:05               ` Paolo Valente
  -1 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Mauro Andreolini,
	Fabio Checconi, Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paolo Valente

From: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

A set of processes may happen to perform interleaved reads, i.e.,
read requests whose union would give rise to a sequential read pattern.
There are two typical cases: first, processes reading fixed-size chunks
of data at a fixed distance from each other; second, processes reading
variable-size chunks at variable distances. The latter case occurs for
example with QEMU, which splits the I/O generated by a guest into
multiple chunks, and lets these chunks be served by a pool of I/O
threads, iteratively assigning the next chunk of I/O to the first
available thread. CFQ denotes as 'cooperating' a set of processes that
are doing interleaved I/O, and when it detects cooperating processes,
it merges their queues to obtain a sequential I/O pattern from the union
of their I/O requests, and hence boost the throughput.

Unfortunately, in the following frequent case the mechanism
implemented in CFQ for detecting cooperating processes and merging
their queues is not responsive enough to handle also the fluctuating
I/O pattern of the second type of processes. Suppose that one process
of the second type issues a request close to the next request to serve
of another process of the same type. At that time the two processes
can be considered as cooperating. But, if the request issued by the
first process is to be merged with some other already-queued request,
then, from the moment at which this request arrives, to the moment
when CFQ controls whether the two processes are cooperating, the two
processes are likely to be already doing I/O in distant zones of the
disk surface or device memory.

CFQ uses however preemption to get a sequential read pattern out of
the read requests performed by the second type of processes too.  As a
consequence, CFQ uses two different mechanisms to achieve the same
goal: boosting the throughput with interleaved I/O.

This patch introduces Early Queue Merge (EQM), a unified mechanism to
get a sequential read pattern with both types of processes. The main
idea is to immediately check whether a newly-arrived request lets some
pair of processes become cooperating, both in the case of actual
request insertion and, to be responsive with the second type of
processes, in the case of request merge. Both types of processes are
then handled by just merging their queues.

Finally, EQM also preserves low latency, by properly restoring the
weight-raising state of a queue when it gets back to a non-merged
state.

Signed-off-by: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mauro Andreolini <mauro.andreolini-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
---
 block/bfq-iosched.c | 658 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 block/bfq.h         |  47 +++-
 2 files changed, 688 insertions(+), 17 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 5988c70..22d4caa 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -203,6 +203,72 @@ static inline void bfq_schedule_dispatch(struct bfq_data *bfqd)
 	}
 }
 
+static struct bfq_queue *
+bfq_rq_pos_tree_lookup(struct bfq_data *bfqd, struct rb_root *root,
+		     sector_t sector, struct rb_node **ret_parent,
+		     struct rb_node ***rb_link)
+{
+	struct rb_node **p, *parent;
+	struct bfq_queue *bfqq = NULL;
+
+	parent = NULL;
+	p = &root->rb_node;
+	while (*p) {
+		struct rb_node **n;
+
+		parent = *p;
+		bfqq = rb_entry(parent, struct bfq_queue, pos_node);
+
+		/*
+		 * Sort strictly based on sector. Smallest to the left,
+		 * largest to the right.
+		 */
+		if (sector > blk_rq_pos(bfqq->next_rq))
+			n = &(*p)->rb_right;
+		else if (sector < blk_rq_pos(bfqq->next_rq))
+			n = &(*p)->rb_left;
+		else
+			break;
+		p = n;
+		bfqq = NULL;
+	}
+
+	*ret_parent = parent;
+	if (rb_link)
+		*rb_link = p;
+
+	bfq_log(bfqd, "rq_pos_tree_lookup %llu: returning %d",
+		(long long unsigned)sector,
+		bfqq != NULL ? bfqq->pid : 0);
+
+	return bfqq;
+}
+
+static void bfq_rq_pos_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	struct rb_node **p, *parent;
+	struct bfq_queue *__bfqq;
+
+	if (bfqq->pos_root != NULL) {
+		rb_erase(&bfqq->pos_node, bfqq->pos_root);
+		bfqq->pos_root = NULL;
+	}
+
+	if (bfq_class_idle(bfqq))
+		return;
+	if (!bfqq->next_rq)
+		return;
+
+	bfqq->pos_root = &bfqd->rq_pos_tree;
+	__bfqq = bfq_rq_pos_tree_lookup(bfqd, bfqq->pos_root,
+			blk_rq_pos(bfqq->next_rq), &parent, &p);
+	if (__bfqq == NULL) {
+		rb_link_node(&bfqq->pos_node, parent, p);
+		rb_insert_color(&bfqq->pos_node, bfqq->pos_root);
+	} else
+		bfqq->pos_root = NULL;
+}
+
 /*
  * Lifted from AS - choose which of rq1 and rq2 that is best served now.
  * We choose the request that is closesr to the head right now.  Distance
@@ -380,6 +446,45 @@ static inline unsigned int bfq_wr_duration(struct bfq_data *bfqd)
 	return dur;
 }
 
+static inline void
+bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
+{
+	if (bic->saved_idle_window)
+		bfq_mark_bfqq_idle_window(bfqq);
+	else
+		bfq_clear_bfqq_idle_window(bfqq);
+	if (bic->wr_time_left && bfqq->bfqd->low_latency) {
+		/*
+		 * Start a weight raising period with the duration given by
+		 * the raising_time_left snapshot.
+		 */
+		if (bfq_bfqq_busy(bfqq))
+			bfqq->bfqd->wr_busy_queues++;
+		bfqq->wr_coeff = bfqq->bfqd->bfq_wr_coeff;
+		bfqq->wr_cur_max_time = bic->wr_time_left;
+		bfqq->last_wr_start_finish = jiffies;
+		bfqq->entity.ioprio_changed = 1;
+	}
+	/*
+	 * Clear wr_time_left to prevent bfq_bfqq_save_state() from
+	 * getting confused about the queue's need of a weight-raising
+	 * period.
+	 */
+	bic->wr_time_left = 0;
+}
+
+/*
+ * Must be called with the queue_lock held.
+ */
+static int bfqq_process_refs(struct bfq_queue *bfqq)
+{
+	int process_refs, io_refs;
+
+	io_refs = bfqq->allocated[READ] + bfqq->allocated[WRITE];
+	process_refs = atomic_read(&bfqq->ref) - io_refs - bfqq->entity.on_st;
+	return process_refs;
+}
+
 static void bfq_add_request(struct request *rq)
 {
 	struct bfq_queue *bfqq = RQ_BFQQ(rq);
@@ -402,6 +507,12 @@ static void bfq_add_request(struct request *rq)
 	next_rq = bfq_choose_req(bfqd, bfqq->next_rq, rq, bfqd->last_position);
 	bfqq->next_rq = next_rq;
 
+	/*
+	 * Adjust priority tree position, if next_rq changes.
+	 */
+	if (prev != bfqq->next_rq)
+		bfq_rq_pos_tree_add(bfqd, bfqq);
+
 	if (!bfq_bfqq_busy(bfqq)) {
 		int soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
 			time_is_before_jiffies(bfqq->soft_rt_next_start);
@@ -414,11 +525,20 @@ static void bfq_add_request(struct request *rq)
 		if (!bfqd->low_latency)
 			goto add_bfqq_busy;
 
+		if (bfq_bfqq_just_split(bfqq))
+			goto set_ioprio_changed;
+
 		/*
-		 * If the queue is not being boosted and has been idle for
-		 * enough time, start a weight-raising period.
+		 * If the queue:
+		 * - is not being boosted,
+		 * - has been idle for enough time,
+		 * - is not a sync queue or is linked to a bfq_io_cq (it is
+		 *   shared "for its nature" or it is not shared and its
+		 *   requests have not been redirected to a shared queue)
+		 * start a weight-raising period.
 		 */
-		if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt)) {
+		if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt) &&
+		    (!bfq_bfqq_sync(bfqq) || bfqq->bic != NULL)) {
 			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
 			if (idle_for_long_time)
 				bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
@@ -496,6 +616,7 @@ static void bfq_add_request(struct request *rq)
 					bfqd->bfq_wr_rt_max_time;
 			}
 		}
+set_ioprio_changed:
 		if (old_wr_coeff != bfqq->wr_coeff)
 			entity->ioprio_changed = 1;
 add_bfqq_busy:
@@ -583,6 +704,13 @@ static void bfq_remove_request(struct request *rq)
 	if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
 		if (bfq_bfqq_busy(bfqq) && bfqq != bfqd->in_service_queue)
 			bfq_del_bfqq_busy(bfqd, bfqq, 1);
+		/*
+		 * Remove queue from request-position tree as it is empty.
+		 */
+		if (bfqq->pos_root != NULL) {
+			rb_erase(&bfqq->pos_node, bfqq->pos_root);
+			bfqq->pos_root = NULL;
+		}
 	}
 
 	if (rq->cmd_flags & REQ_META)
@@ -625,11 +753,14 @@ static void bfq_merged_request(struct request_queue *q, struct request *req,
 					 bfqd->last_position);
 		bfqq->next_rq = next_rq;
 		/*
-		 * If next_rq changes, update the queue's budget to fit
-		 * the new request.
+		 * If next_rq changes, update both the queue's budget to
+		 * fit the new request and the queue's position in its
+		 * rq_pos_tree.
 		 */
-		if (prev != bfqq->next_rq)
+		if (prev != bfqq->next_rq) {
 			bfq_updated_next_req(bfqd, bfqq);
+			bfq_rq_pos_tree_add(bfqd, bfqq);
+		}
 	}
 }
 
@@ -692,12 +823,339 @@ static void bfq_end_wr(struct bfq_data *bfqd)
 	spin_unlock_irq(bfqd->queue->queue_lock);
 }
 
+static inline sector_t bfq_io_struct_pos(void *io_struct, bool request)
+{
+	if (request)
+		return blk_rq_pos(io_struct);
+	else
+		return ((struct bio *)io_struct)->bi_iter.bi_sector;
+}
+
+static inline sector_t bfq_dist_from(sector_t pos1,
+				     sector_t pos2)
+{
+	if (pos1 >= pos2)
+		return pos1 - pos2;
+	else
+		return pos2 - pos1;
+}
+
+static inline int bfq_rq_close_to_sector(void *io_struct, bool request,
+					 sector_t sector)
+{
+	return bfq_dist_from(bfq_io_struct_pos(io_struct, request), sector) <=
+	       BFQQ_SEEK_THR;
+}
+
+static struct bfq_queue *bfqq_close(struct bfq_data *bfqd, sector_t sector)
+{
+	struct rb_root *root = &bfqd->rq_pos_tree;
+	struct rb_node *parent, *node;
+	struct bfq_queue *__bfqq;
+
+	if (RB_EMPTY_ROOT(root))
+		return NULL;
+
+	/*
+	 * First, if we find a request starting at the end of the last
+	 * request, choose it.
+	 */
+	__bfqq = bfq_rq_pos_tree_lookup(bfqd, root, sector, &parent, NULL);
+	if (__bfqq != NULL)
+		return __bfqq;
+
+	/*
+	 * If the exact sector wasn't found, the parent of the NULL leaf
+	 * will contain the closest sector (rq_pos_tree sorted by
+	 * next_request position).
+	 */
+	__bfqq = rb_entry(parent, struct bfq_queue, pos_node);
+	if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector))
+		return __bfqq;
+
+	if (blk_rq_pos(__bfqq->next_rq) < sector)
+		node = rb_next(&__bfqq->pos_node);
+	else
+		node = rb_prev(&__bfqq->pos_node);
+	if (node == NULL)
+		return NULL;
+
+	__bfqq = rb_entry(node, struct bfq_queue, pos_node);
+	if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector))
+		return __bfqq;
+
+	return NULL;
+}
+
+/*
+ * bfqd - obvious
+ * cur_bfqq - passed in so that we don't decide that the current queue
+ *            is closely cooperating with itself
+ * sector - used as a reference point to search for a close queue
+ */
+static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
+					      struct bfq_queue *cur_bfqq,
+					      sector_t sector)
+{
+	struct bfq_queue *bfqq;
+
+	if (bfq_class_idle(cur_bfqq))
+		return NULL;
+	if (!bfq_bfqq_sync(cur_bfqq))
+		return NULL;
+	if (BFQQ_SEEKY(cur_bfqq))
+		return NULL;
+
+	/* If device has only one backlogged bfq_queue, don't search. */
+	if (bfqd->busy_queues == 1)
+		return NULL;
+
+	/*
+	 * We should notice if some of the queues are cooperating, e.g.
+	 * working closely on the same area of the disk. In that case,
+	 * we can group them together and don't waste time idling.
+	 */
+	bfqq = bfqq_close(bfqd, sector);
+	if (bfqq == NULL || bfqq == cur_bfqq)
+		return NULL;
+
+	/*
+	 * Do not merge queues from different bfq_groups.
+	*/
+	if (bfqq->entity.parent != cur_bfqq->entity.parent)
+		return NULL;
+
+	/*
+	 * It only makes sense to merge sync queues.
+	 */
+	if (!bfq_bfqq_sync(bfqq))
+		return NULL;
+	if (BFQQ_SEEKY(bfqq))
+		return NULL;
+
+	/*
+	 * Do not merge queues of different priority classes.
+	 */
+	if (bfq_class_rt(bfqq) != bfq_class_rt(cur_bfqq))
+		return NULL;
+
+	return bfqq;
+}
+
+static struct bfq_queue *
+bfq_setup_merge(struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
+{
+	int process_refs, new_process_refs;
+	struct bfq_queue *__bfqq;
+
+	/*
+	 * If there are no process references on the new_bfqq, then it is
+	 * unsafe to follow the ->new_bfqq chain as other bfqq's in the chain
+	 * may have dropped their last reference (not just their last process
+	 * reference).
+	 */
+	if (!bfqq_process_refs(new_bfqq))
+		return NULL;
+
+	/* Avoid a circular list and skip interim queue merges. */
+	while ((__bfqq = new_bfqq->new_bfqq)) {
+		if (__bfqq == bfqq)
+			return NULL;
+		new_bfqq = __bfqq;
+	}
+
+	process_refs = bfqq_process_refs(bfqq);
+	new_process_refs = bfqq_process_refs(new_bfqq);
+	/*
+	 * If the process for the bfqq has gone away, there is no
+	 * sense in merging the queues.
+	 */
+	if (process_refs == 0 || new_process_refs == 0)
+		return NULL;
+
+	bfq_log_bfqq(bfqq->bfqd, bfqq, "scheduling merge with queue %d",
+		new_bfqq->pid);
+
+	/*
+	 * Merging is just a redirection: the requests of the process
+	 * owning one of the two queues are redirected to the other queue.
+	 * The latter queue, in its turn, is set as shared if this is the
+	 * first time that the requests of some process are redirected to
+	 * it.
+	 *
+	 * We redirect bfqq to new_bfqq and not the opposite, because we
+	 * are in the context of the process owning bfqq, hence we have
+	 * the io_cq of this process. So we can immediately configure this
+	 * io_cq to redirect the requests of the process to new_bfqq.
+	 *
+	 * NOTE, even if new_bfqq coincides with the in-service queue, the
+	 * io_cq of new_bfqq is not available, because, if the in-service
+	 * queue is shared, bfqd->in_service_bic may not point to the
+	 * io_cq of the in-service queue.
+	 * Redirecting the requests of the process owning bfqq to the
+	 * currently in-service queue is in any case the best option, as
+	 * we feed the in-service queue with new requests close to the
+	 * last request served and, by doing so, hopefully increase the
+	 * throughput.
+	 */
+	bfqq->new_bfqq = new_bfqq;
+	atomic_add(process_refs, &new_bfqq->ref);
+	return new_bfqq;
+}
+
+/*
+ * Attempt to schedule a merge of bfqq with the currently in-service queue
+ * or with a close queue among the scheduled queues.
+ * Return NULL if no merge was scheduled, a pointer to the shared bfq_queue
+ * structure otherwise.
+ */
+static struct bfq_queue *
+bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+		     void *io_struct, bool request)
+{
+	struct bfq_queue *in_service_bfqq, *new_bfqq;
+
+	if (bfqq->new_bfqq)
+		return bfqq->new_bfqq;
+
+	if (!io_struct)
+		return NULL;
+
+	in_service_bfqq = bfqd->in_service_queue;
+
+	if (in_service_bfqq == NULL || in_service_bfqq == bfqq ||
+	    !bfqd->in_service_bic)
+		goto check_scheduled;
+
+	if (bfq_class_idle(in_service_bfqq) || bfq_class_idle(bfqq))
+		goto check_scheduled;
+
+	if (bfq_class_rt(in_service_bfqq) != bfq_class_rt(bfqq))
+		goto check_scheduled;
+
+	if (in_service_bfqq->entity.parent != bfqq->entity.parent)
+		goto check_scheduled;
+
+	if (bfq_rq_close_to_sector(io_struct, request, bfqd->last_position) &&
+	    bfq_bfqq_sync(in_service_bfqq) && bfq_bfqq_sync(bfqq)) {
+		new_bfqq = bfq_setup_merge(bfqq, in_service_bfqq);
+		if (new_bfqq != NULL)
+			return new_bfqq; /* Merge with in-service queue */
+	}
+
+	/*
+	 * Check whether there is a cooperator among currently scheduled
+	 * queues. The only thing we need is that the bio/request is not
+	 * NULL, as we need it to establish whether a cooperator exists.
+	 */
+check_scheduled:
+	new_bfqq = bfq_close_cooperator(bfqd, bfqq,
+					bfq_io_struct_pos(io_struct, request));
+	if (new_bfqq)
+		return bfq_setup_merge(bfqq, new_bfqq);
+
+	return NULL;
+}
+
+static inline void
+bfq_bfqq_save_state(struct bfq_queue *bfqq)
+{
+	/*
+	 * If bfqq->bic == NULL, the queue is already shared or its requests
+	 * have already been redirected to a shared queue; both idle window
+	 * and weight raising state have already been saved. Do nothing.
+	 */
+	if (bfqq->bic == NULL)
+		return;
+	if (bfqq->bic->wr_time_left)
+		/*
+		 * This is the queue of a just-started process, and would
+		 * deserve weight raising: we set wr_time_left to the full
+		 * weight-raising duration to trigger weight-raising when
+		 * and if the queue is split and the first request of the
+		 * queue is enqueued.
+		 */
+		bfqq->bic->wr_time_left = bfq_wr_duration(bfqq->bfqd);
+	else if (bfqq->wr_coeff > 1) {
+		unsigned long wr_duration =
+			jiffies - bfqq->last_wr_start_finish;
+		/*
+		 * It may happen that a queue's weight raising period lasts
+		 * longer than its wr_cur_max_time, as weight raising is
+		 * handled only when a request is enqueued or dispatched (it
+		 * does not use any timer). If the weight raising period is
+		 * about to end, don't save it.
+		 */
+		if (bfqq->wr_cur_max_time <= wr_duration)
+			bfqq->bic->wr_time_left = 0;
+		else
+			bfqq->bic->wr_time_left =
+				bfqq->wr_cur_max_time - wr_duration;
+		/*
+		 * The bfq_queue is becoming shared or the requests of the
+		 * process owning the queue are being redirected to a shared
+		 * queue. Stop the weight raising period of the queue, as in
+		 * both cases it should not be owned by an interactive or
+		 * soft real-time application.
+		 */
+		bfq_bfqq_end_wr(bfqq);
+	} else
+		bfqq->bic->wr_time_left = 0;
+	bfqq->bic->saved_idle_window = bfq_bfqq_idle_window(bfqq);
+}
+
+static inline void
+bfq_get_bic_reference(struct bfq_queue *bfqq)
+{
+	/*
+	 * If bfqq->bic has a non-NULL value, the bic to which it belongs
+	 * is about to begin using a shared bfq_queue.
+	 */
+	if (bfqq->bic)
+		atomic_long_inc(&bfqq->bic->icq.ioc->refcount);
+}
+
+static void
+bfq_merge_bfqqs(struct bfq_data *bfqd, struct bfq_io_cq *bic,
+		struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
+{
+	bfq_log_bfqq(bfqd, bfqq, "merging with queue %lu",
+		(long unsigned)new_bfqq->pid);
+	/* Save weight raising and idle window of the merged queues */
+	bfq_bfqq_save_state(bfqq);
+	bfq_bfqq_save_state(new_bfqq);
+	/*
+	 * Grab a reference to the bic, to prevent it from being destroyed
+	 * before being possibly touched by a bfq_split_bfqq().
+	 */
+	bfq_get_bic_reference(bfqq);
+	bfq_get_bic_reference(new_bfqq);
+	/*
+	 * Merge queues (that is, let bic redirect its requests to new_bfqq)
+	 */
+	bic_set_bfqq(bic, new_bfqq, 1);
+	bfq_mark_bfqq_coop(new_bfqq);
+	/*
+	 * new_bfqq now belongs to at least two bics (it is a shared queue):
+	 * set new_bfqq->bic to NULL. bfqq either:
+	 * - does not belong to any bic any more, and hence bfqq->bic must
+	 *   be set to NULL, or
+	 * - is a queue whose owning bics have already been redirected to a
+	 *   different queue, hence the queue is destined to not belong to
+	 *   any bic soon and bfqq->bic is already NULL (therefore the next
+	 *   assignment causes no harm).
+	 */
+	new_bfqq->bic = NULL;
+	bfqq->bic = NULL;
+	bfq_put_queue(bfqq);
+}
+
 static int bfq_allow_merge(struct request_queue *q, struct request *rq,
 			   struct bio *bio)
 {
 	struct bfq_data *bfqd = q->elevator->elevator_data;
 	struct bfq_io_cq *bic;
-	struct bfq_queue *bfqq;
+	struct bfq_queue *bfqq, *new_bfqq;
 
 	/*
 	 * Disallow merge of a sync bio into an async request.
@@ -715,6 +1173,23 @@ static int bfq_allow_merge(struct request_queue *q, struct request *rq,
 		return 0;
 
 	bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
+	/*
+	 * We take advantage of this function to perform an early merge
+	 * of the queues of possible cooperating processes.
+	 */
+	if (bfqq != NULL) {
+		new_bfqq = bfq_setup_cooperator(bfqd, bfqq, bio, false);
+		if (new_bfqq != NULL) {
+			bfq_merge_bfqqs(bfqd, bic, bfqq, new_bfqq);
+			/*
+			 * If we get here, the bio will be queued in the
+			 * shared queue, i.e., new_bfqq, so use new_bfqq
+			 * to decide whether bio and rq can be merged.
+			 */
+			bfqq = new_bfqq;
+		}
+	}
+
 	return bfqq == RQ_BFQQ(rq);
 }
 
@@ -898,6 +1373,15 @@ static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 {
 	__bfq_bfqd_reset_in_service(bfqd);
 
+	/*
+	 * If this bfqq is shared between multiple processes, check
+	 * to make sure that those processes are still issuing I/Os
+	 * within the mean seek distance. If not, it may be time to
+	 * break the queues apart again.
+	 */
+	if (bfq_bfqq_coop(bfqq) && BFQQ_SEEKY(bfqq))
+		bfq_mark_bfqq_split_coop(bfqq);
+
 	if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
 		/*
 		 * Overloading budget_timeout field to store the time
@@ -906,8 +1390,13 @@ static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 		 */
 		bfqq->budget_timeout = jiffies;
 		bfq_del_bfqq_busy(bfqd, bfqq, 1);
-	} else
+	} else {
 		bfq_activate_bfqq(bfqd, bfqq);
+		/*
+		 * Resort priority tree of potential close cooperators.
+		 */
+		bfq_rq_pos_tree_add(bfqd, bfqq);
+	}
 }
 
 /**
@@ -1773,6 +2262,25 @@ static void bfq_put_queue(struct bfq_queue *bfqq)
 	kmem_cache_free(bfq_pool, bfqq);
 }
 
+static void bfq_put_cooperator(struct bfq_queue *bfqq)
+{
+	struct bfq_queue *__bfqq, *next;
+
+	/*
+	 * If this queue was scheduled to merge with another queue, be
+	 * sure to drop the reference taken on that queue (and others in
+	 * the merge chain). See bfq_setup_merge and bfq_merge_bfqqs.
+	 */
+	__bfqq = bfqq->new_bfqq;
+	while (__bfqq) {
+		if (__bfqq == bfqq)
+			break;
+		next = __bfqq->new_bfqq;
+		bfq_put_queue(__bfqq);
+		__bfqq = next;
+	}
+}
+
 static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 {
 	if (bfqq == bfqd->in_service_queue) {
@@ -1783,12 +2291,35 @@ static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 	bfq_log_bfqq(bfqd, bfqq, "exit_bfqq: %p, %d", bfqq,
 		     atomic_read(&bfqq->ref));
 
+	bfq_put_cooperator(bfqq);
+
 	bfq_put_queue(bfqq);
 }
 
 static inline void bfq_init_icq(struct io_cq *icq)
 {
-	icq_to_bic(icq)->ttime.last_end_request = jiffies;
+	struct bfq_io_cq *bic = icq_to_bic(icq);
+
+	bic->ttime.last_end_request = jiffies;
+	/*
+	 * A newly created bic indicates that the process has just
+	 * started doing I/O, and is probably mapping into memory its
+	 * executable and libraries: it definitely needs weight raising.
+	 * There is however the possibility that the process performs,
+	 * for a while, I/O close to some other process. EQM intercepts
+	 * this behavior and may merge the queue corresponding to the
+	 * process  with some other queue, BEFORE the weight of the queue
+	 * is raised. Merged queues are not weight-raised (they are assumed
+	 * to belong to processes that benefit only from high throughput).
+	 * If the merge is basically the consequence of an accident, then
+	 * the queue will be split soon and will get back its old weight.
+	 * It is then important to write down somewhere that this queue
+	 * does need weight raising, even if it did not make it to get its
+	 * weight raised before being merged. To this purpose, we overload
+	 * the field raising_time_left and assign 1 to it, to mark the queue
+	 * as needing weight raising.
+	 */
+	bic->wr_time_left = 1;
 }
 
 static void bfq_exit_icq(struct io_cq *icq)
@@ -1802,6 +2333,13 @@ static void bfq_exit_icq(struct io_cq *icq)
 	}
 
 	if (bic->bfqq[BLK_RW_SYNC]) {
+		/*
+		 * If the bic is using a shared queue, put the reference
+		 * taken on the io_context when the bic started using a
+		 * shared bfq_queue.
+		 */
+		if (bfq_bfqq_coop(bic->bfqq[BLK_RW_SYNC]))
+			put_io_context(icq->ioc);
 		bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_SYNC]);
 		bic->bfqq[BLK_RW_SYNC] = NULL;
 	}
@@ -2089,6 +2627,10 @@ static void bfq_update_idle_window(struct bfq_data *bfqd,
 	if (!bfq_bfqq_sync(bfqq) || bfq_class_idle(bfqq))
 		return;
 
+	/* Idle window just restored, statistics are meaningless. */
+	if (bfq_bfqq_just_split(bfqq))
+		return;
+
 	enable_idle = bfq_bfqq_idle_window(bfqq);
 
 	if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
@@ -2131,6 +2673,7 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
 	    !BFQQ_SEEKY(bfqq))
 		bfq_update_idle_window(bfqd, bfqq, bic);
+	bfq_clear_bfqq_just_split(bfqq);
 
 	bfq_log_bfqq(bfqd, bfqq,
 		     "rq_enqueued: idle_window=%d (seeky %d, mean %llu)",
@@ -2191,14 +2734,48 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 static void bfq_insert_request(struct request_queue *q, struct request *rq)
 {
 	struct bfq_data *bfqd = q->elevator->elevator_data;
-	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+	struct bfq_queue *bfqq = RQ_BFQQ(rq), *new_bfqq;
 
 	assert_spin_locked(bfqd->queue->queue_lock);
 
+	/*
+	 * An unplug may trigger a requeue of a request from the device
+	 * driver: make sure we are in process context while trying to
+	 * merge two bfq_queues.
+	 */
+	if (!in_interrupt()) {
+		new_bfqq = bfq_setup_cooperator(bfqd, bfqq, rq, true);
+		if (new_bfqq != NULL) {
+			if (bic_to_bfqq(RQ_BIC(rq), 1) != bfqq)
+				new_bfqq = bic_to_bfqq(RQ_BIC(rq), 1);
+			/*
+			 * Release the request's reference to the old bfqq
+			 * and make sure one is taken to the shared queue.
+			 */
+			new_bfqq->allocated[rq_data_dir(rq)]++;
+			bfqq->allocated[rq_data_dir(rq)]--;
+			atomic_inc(&new_bfqq->ref);
+			bfq_put_queue(bfqq);
+			if (bic_to_bfqq(RQ_BIC(rq), 1) == bfqq)
+				bfq_merge_bfqqs(bfqd, RQ_BIC(rq),
+						bfqq, new_bfqq);
+			rq->elv.priv[1] = new_bfqq;
+			bfqq = new_bfqq;
+		}
+	}
+
 	bfq_init_prio_data(bfqq, RQ_BIC(rq));
 
 	bfq_add_request(rq);
 
+	/*
+	 * Here a newly-created bfq_queue has already started a weight-raising
+	 * period: clear raising_time_left to prevent bfq_bfqq_save_state()
+	 * from assigning it a full weight-raising period. See the detailed
+	 * comments about this field in bfq_init_icq().
+	 */
+	if (bfqq->bic != NULL)
+		bfqq->bic->wr_time_left = 0;
 	rq->fifo_time = jiffies + bfqd->bfq_fifo_expire[rq_is_sync(rq)];
 	list_add_tail(&rq->queuelist, &bfqq->fifo);
 
@@ -2347,6 +2924,32 @@ static void bfq_put_request(struct request *rq)
 }
 
 /*
+ * Returns NULL if a new bfqq should be allocated, or the old bfqq if this
+ * was the last process referring to said bfqq.
+ */
+static struct bfq_queue *
+bfq_split_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq)
+{
+	bfq_log_bfqq(bfqq->bfqd, bfqq, "splitting queue");
+
+	put_io_context(bic->icq.ioc);
+
+	if (bfqq_process_refs(bfqq) == 1) {
+		bfqq->pid = current->pid;
+		bfq_clear_bfqq_coop(bfqq);
+		bfq_clear_bfqq_split_coop(bfqq);
+		return bfqq;
+	}
+
+	bic_set_bfqq(bic, NULL, 1);
+
+	bfq_put_cooperator(bfqq);
+
+	bfq_put_queue(bfqq);
+	return NULL;
+}
+
+/*
  * Allocate bfq data structures associated with this request.
  */
 static int bfq_set_request(struct request_queue *q, struct request *rq,
@@ -2359,6 +2962,7 @@ static int bfq_set_request(struct request_queue *q, struct request *rq,
 	struct bfq_queue *bfqq;
 	struct bfq_group *bfqg;
 	unsigned long flags;
+	bool split = false;
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
@@ -2371,10 +2975,20 @@ static int bfq_set_request(struct request_queue *q, struct request *rq,
 
 	bfqg = bfq_bic_update_cgroup(bic);
 
+new_queue:
 	bfqq = bic_to_bfqq(bic, is_sync);
 	if (bfqq == NULL || bfqq == &bfqd->oom_bfqq) {
 		bfqq = bfq_get_queue(bfqd, bfqg, is_sync, bic, gfp_mask);
 		bic_set_bfqq(bic, bfqq, is_sync);
+	} else {
+		/* If the queue was seeky for too long, break it apart. */
+		if (bfq_bfqq_coop(bfqq) && bfq_bfqq_split_coop(bfqq)) {
+			bfq_log_bfqq(bfqd, bfqq, "breaking apart bfqq");
+			bfqq = bfq_split_bfqq(bic, bfqq);
+			split = true;
+			if (!bfqq)
+				goto new_queue;
+		}
 	}
 
 	bfqq->allocated[rw]++;
@@ -2385,6 +2999,26 @@ static int bfq_set_request(struct request_queue *q, struct request *rq,
 	rq->elv.priv[0] = bic;
 	rq->elv.priv[1] = bfqq;
 
+	/*
+	 * If a bfq_queue has only one process reference, it is owned
+	 * by only one bfq_io_cq: we can set the bic field of the
+	 * bfq_queue to the address of that structure. Also, if the
+	 * queue has just been split, mark a flag so that the
+	 * information is available to the other scheduler hooks.
+	 */
+	if (bfqq_process_refs(bfqq) == 1) {
+		bfqq->bic = bic;
+		if (split) {
+			bfq_mark_bfqq_just_split(bfqq);
+			/*
+			 * If the queue has just been split from a shared
+			 * queue, restore the idle window and the possible
+			 * weight raising period.
+			 */
+			bfq_bfqq_resume_state(bfqq, bic);
+		}
+	}
+
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
 	return 0;
@@ -2565,6 +3199,8 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 	bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
 	bfqd->idle_slice_timer.data = (unsigned long)bfqd;
 
+	bfqd->rq_pos_tree = RB_ROOT;
+
 	INIT_WORK(&bfqd->unplug_work, bfq_kick_queue);
 
 	INIT_LIST_HEAD(&bfqd->active_list);
@@ -2918,7 +3554,7 @@ static int __init bfq_init(void)
 	device_speed_thresh[1] = (R_fast[1] + R_slow[1]) / 2;
 
 	elv_register(&iosched_bfq);
-	pr_info("BFQ I/O-scheduler version: v2");
+	pr_info("BFQ I/O-scheduler version: v6");
 
 	return 0;
 }
diff --git a/block/bfq.h b/block/bfq.h
index 7d6e4cb..bda1ecb3 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -1,5 +1,5 @@
 /*
- * BFQ-v2 for 3.15.0: data structures and common functions prototypes.
+ * BFQ-v6 for 3.15.0: data structures and common functions prototypes.
  *
  * Based on ideas and code from CFQ:
  * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@@ -164,6 +164,10 @@ struct bfq_group;
  * struct bfq_queue - leaf schedulable entity.
  * @ref: reference counter.
  * @bfqd: parent bfq_data.
+ * @new_bfqq: shared bfq_queue if queue is cooperating with
+ *           one or more other queues.
+ * @pos_node: request-position tree member (see bfq_data's @rq_pos_tree).
+ * @pos_root: request-position tree root (see bfq_data's @rq_pos_tree).
  * @sort_list: sorted list of pending requests.
  * @next_rq: if fifo isn't expired, next request to serve.
  * @queued: nr of requests queued in @sort_list.
@@ -196,18 +200,26 @@ struct bfq_group;
  * @service_from_backlogged: cumulative service received from the @bfq_queue
  *                           since the last transition from idle to
  *                           backlogged
+ * @bic: pointer to the bfq_io_cq owning the bfq_queue, set to %NULL if the
+ *	 queue is shared
  *
  * A bfq_queue is a leaf request queue; it can be associated with an
- * io_context or more, if it is async. @cgroup holds a reference to the
- * cgroup, to be sure that it does not disappear while a bfqq still
- * references it (mostly to avoid races between request issuing and task
- * migration followed by cgroup destruction). All the fields are protected
- * by the queue lock of the containing bfqd.
+ * io_context or more, if it  is  async or shared  between  cooperating
+ * processes. @cgroup holds a reference to the cgroup, to be sure that it
+ * does not disappear while a bfqq still references it (mostly to avoid
+ * races between request issuing and task migration followed by cgroup
+ * destruction).
+ * All the fields are protected by the queue lock of the containing bfqd.
  */
 struct bfq_queue {
 	atomic_t ref;
 	struct bfq_data *bfqd;
 
+	/* fields for cooperating queues handling */
+	struct bfq_queue *new_bfqq;
+	struct rb_node pos_node;
+	struct rb_root *pos_root;
+
 	struct rb_root sort_list;
 	struct request *next_rq;
 	int queued[2];
@@ -232,6 +244,7 @@ struct bfq_queue {
 	sector_t last_request_pos;
 
 	pid_t pid;
+	struct bfq_io_cq *bic;
 
 	/* weight-raising fields */
 	unsigned long wr_cur_max_time;
@@ -261,12 +274,24 @@ struct bfq_ttime {
  * @icq: associated io_cq structure
  * @bfqq: array of two process queues, the sync and the async
  * @ttime: associated @bfq_ttime struct
+ * @wr_time_left: snapshot of the time left before weight raising ends
+ *                for the sync queue associated to this process; this
+ *		  snapshot is taken to remember this value while the weight
+ *		  raising is suspended because the queue is merged with a
+ *		  shared queue, and is used to set @raising_cur_max_time
+ *		  when the queue is split from the shared queue and its
+ *		  weight is raised again
+ * @saved_idle_window: same purpose as the previous field for the idle
+ *                     window
  */
 struct bfq_io_cq {
 	struct io_cq icq; /* must be the first member */
 	struct bfq_queue *bfqq[2];
 	struct bfq_ttime ttime;
 	int ioprio;
+
+	unsigned int wr_time_left;
+	unsigned int saved_idle_window;
 };
 
 enum bfq_device_speed {
@@ -278,6 +303,9 @@ enum bfq_device_speed {
  * struct bfq_data - per device data structure.
  * @queue: request queue for the managed device.
  * @root_group: root bfq_group for the device.
+ * @rq_pos_tree: rbtree sorted by next_request position, used when
+ *               determining if two or more queues have interleaving
+ *               requests (see bfq_close_cooperator()).
  * @busy_queues: number of bfq_queues containing requests (including the
  *		 queue in service, even if it is idling).
  * @wr_busy_queues: number of weight-raised busy @bfq_queues.
@@ -344,6 +372,7 @@ struct bfq_data {
 	struct request_queue *queue;
 
 	struct bfq_group *root_group;
+	struct rb_root rq_pos_tree;
 
 	int busy_queues;
 	int wr_busy_queues;
@@ -418,6 +447,9 @@ enum bfqq_state_flags {
 					 * may need softrt-next-start
 					 * update
 					 */
+	BFQ_BFQQ_FLAG_coop,		/* bfqq is shared */
+	BFQ_BFQQ_FLAG_split_coop,	/* shared bfqq will be split */
+	BFQ_BFQQ_FLAG_just_split,	/* queue has just been split */
 };
 
 #define BFQ_BFQQ_FNS(name)						\
@@ -443,6 +475,9 @@ BFQ_BFQQ_FNS(prio_changed);
 BFQ_BFQQ_FNS(sync);
 BFQ_BFQQ_FNS(budget_new);
 BFQ_BFQQ_FNS(constantly_seeky);
+BFQ_BFQQ_FNS(coop);
+BFQ_BFQQ_FNS(split_coop);
+BFQ_BFQQ_FNS(just_split);
 BFQ_BFQQ_FNS(softrt_update);
 #undef BFQ_BFQQ_FNS
 
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 10/12] block, bfq: add Early Queue Merge (EQM)
@ 2014-05-29  9:05               ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups, Mauro Andreolini, Paolo Valente

From: Arianna Avanzini <avanzini.arianna@gmail.com>

A set of processes may happen to perform interleaved reads, i.e.,
read requests whose union would give rise to a sequential read pattern.
There are two typical cases: first, processes reading fixed-size chunks
of data at a fixed distance from each other; second, processes reading
variable-size chunks at variable distances. The latter case occurs for
example with QEMU, which splits the I/O generated by a guest into
multiple chunks, and lets these chunks be served by a pool of I/O
threads, iteratively assigning the next chunk of I/O to the first
available thread. CFQ denotes as 'cooperating' a set of processes that
are doing interleaved I/O, and when it detects cooperating processes,
it merges their queues to obtain a sequential I/O pattern from the union
of their I/O requests, and hence boost the throughput.

Unfortunately, in the following frequent case the mechanism
implemented in CFQ for detecting cooperating processes and merging
their queues is not responsive enough to handle also the fluctuating
I/O pattern of the second type of processes. Suppose that one process
of the second type issues a request close to the next request to serve
of another process of the same type. At that time the two processes
can be considered as cooperating. But, if the request issued by the
first process is to be merged with some other already-queued request,
then, from the moment at which this request arrives, to the moment
when CFQ controls whether the two processes are cooperating, the two
processes are likely to be already doing I/O in distant zones of the
disk surface or device memory.

CFQ uses however preemption to get a sequential read pattern out of
the read requests performed by the second type of processes too.  As a
consequence, CFQ uses two different mechanisms to achieve the same
goal: boosting the throughput with interleaved I/O.

This patch introduces Early Queue Merge (EQM), a unified mechanism to
get a sequential read pattern with both types of processes. The main
idea is to immediately check whether a newly-arrived request lets some
pair of processes become cooperating, both in the case of actual
request insertion and, to be responsive with the second type of
processes, in the case of request merge. Both types of processes are
then handled by just merging their queues.

Finally, EQM also preserves low latency, by properly restoring the
weight-raising state of a queue when it gets back to a non-merged
state.

Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Mauro Andreolini <mauro.andreolini@unimore.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
---
 block/bfq-iosched.c | 658 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 block/bfq.h         |  47 +++-
 2 files changed, 688 insertions(+), 17 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 5988c70..22d4caa 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -203,6 +203,72 @@ static inline void bfq_schedule_dispatch(struct bfq_data *bfqd)
 	}
 }
 
+static struct bfq_queue *
+bfq_rq_pos_tree_lookup(struct bfq_data *bfqd, struct rb_root *root,
+		     sector_t sector, struct rb_node **ret_parent,
+		     struct rb_node ***rb_link)
+{
+	struct rb_node **p, *parent;
+	struct bfq_queue *bfqq = NULL;
+
+	parent = NULL;
+	p = &root->rb_node;
+	while (*p) {
+		struct rb_node **n;
+
+		parent = *p;
+		bfqq = rb_entry(parent, struct bfq_queue, pos_node);
+
+		/*
+		 * Sort strictly based on sector. Smallest to the left,
+		 * largest to the right.
+		 */
+		if (sector > blk_rq_pos(bfqq->next_rq))
+			n = &(*p)->rb_right;
+		else if (sector < blk_rq_pos(bfqq->next_rq))
+			n = &(*p)->rb_left;
+		else
+			break;
+		p = n;
+		bfqq = NULL;
+	}
+
+	*ret_parent = parent;
+	if (rb_link)
+		*rb_link = p;
+
+	bfq_log(bfqd, "rq_pos_tree_lookup %llu: returning %d",
+		(long long unsigned)sector,
+		bfqq != NULL ? bfqq->pid : 0);
+
+	return bfqq;
+}
+
+static void bfq_rq_pos_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+{
+	struct rb_node **p, *parent;
+	struct bfq_queue *__bfqq;
+
+	if (bfqq->pos_root != NULL) {
+		rb_erase(&bfqq->pos_node, bfqq->pos_root);
+		bfqq->pos_root = NULL;
+	}
+
+	if (bfq_class_idle(bfqq))
+		return;
+	if (!bfqq->next_rq)
+		return;
+
+	bfqq->pos_root = &bfqd->rq_pos_tree;
+	__bfqq = bfq_rq_pos_tree_lookup(bfqd, bfqq->pos_root,
+			blk_rq_pos(bfqq->next_rq), &parent, &p);
+	if (__bfqq == NULL) {
+		rb_link_node(&bfqq->pos_node, parent, p);
+		rb_insert_color(&bfqq->pos_node, bfqq->pos_root);
+	} else
+		bfqq->pos_root = NULL;
+}
+
 /*
  * Lifted from AS - choose which of rq1 and rq2 that is best served now.
  * We choose the request that is closesr to the head right now.  Distance
@@ -380,6 +446,45 @@ static inline unsigned int bfq_wr_duration(struct bfq_data *bfqd)
 	return dur;
 }
 
+static inline void
+bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
+{
+	if (bic->saved_idle_window)
+		bfq_mark_bfqq_idle_window(bfqq);
+	else
+		bfq_clear_bfqq_idle_window(bfqq);
+	if (bic->wr_time_left && bfqq->bfqd->low_latency) {
+		/*
+		 * Start a weight raising period with the duration given by
+		 * the raising_time_left snapshot.
+		 */
+		if (bfq_bfqq_busy(bfqq))
+			bfqq->bfqd->wr_busy_queues++;
+		bfqq->wr_coeff = bfqq->bfqd->bfq_wr_coeff;
+		bfqq->wr_cur_max_time = bic->wr_time_left;
+		bfqq->last_wr_start_finish = jiffies;
+		bfqq->entity.ioprio_changed = 1;
+	}
+	/*
+	 * Clear wr_time_left to prevent bfq_bfqq_save_state() from
+	 * getting confused about the queue's need of a weight-raising
+	 * period.
+	 */
+	bic->wr_time_left = 0;
+}
+
+/*
+ * Must be called with the queue_lock held.
+ */
+static int bfqq_process_refs(struct bfq_queue *bfqq)
+{
+	int process_refs, io_refs;
+
+	io_refs = bfqq->allocated[READ] + bfqq->allocated[WRITE];
+	process_refs = atomic_read(&bfqq->ref) - io_refs - bfqq->entity.on_st;
+	return process_refs;
+}
+
 static void bfq_add_request(struct request *rq)
 {
 	struct bfq_queue *bfqq = RQ_BFQQ(rq);
@@ -402,6 +507,12 @@ static void bfq_add_request(struct request *rq)
 	next_rq = bfq_choose_req(bfqd, bfqq->next_rq, rq, bfqd->last_position);
 	bfqq->next_rq = next_rq;
 
+	/*
+	 * Adjust priority tree position, if next_rq changes.
+	 */
+	if (prev != bfqq->next_rq)
+		bfq_rq_pos_tree_add(bfqd, bfqq);
+
 	if (!bfq_bfqq_busy(bfqq)) {
 		int soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
 			time_is_before_jiffies(bfqq->soft_rt_next_start);
@@ -414,11 +525,20 @@ static void bfq_add_request(struct request *rq)
 		if (!bfqd->low_latency)
 			goto add_bfqq_busy;
 
+		if (bfq_bfqq_just_split(bfqq))
+			goto set_ioprio_changed;
+
 		/*
-		 * If the queue is not being boosted and has been idle for
-		 * enough time, start a weight-raising period.
+		 * If the queue:
+		 * - is not being boosted,
+		 * - has been idle for enough time,
+		 * - is not a sync queue or is linked to a bfq_io_cq (it is
+		 *   shared "for its nature" or it is not shared and its
+		 *   requests have not been redirected to a shared queue)
+		 * start a weight-raising period.
 		 */
-		if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt)) {
+		if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt) &&
+		    (!bfq_bfqq_sync(bfqq) || bfqq->bic != NULL)) {
 			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
 			if (idle_for_long_time)
 				bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
@@ -496,6 +616,7 @@ static void bfq_add_request(struct request *rq)
 					bfqd->bfq_wr_rt_max_time;
 			}
 		}
+set_ioprio_changed:
 		if (old_wr_coeff != bfqq->wr_coeff)
 			entity->ioprio_changed = 1;
 add_bfqq_busy:
@@ -583,6 +704,13 @@ static void bfq_remove_request(struct request *rq)
 	if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
 		if (bfq_bfqq_busy(bfqq) && bfqq != bfqd->in_service_queue)
 			bfq_del_bfqq_busy(bfqd, bfqq, 1);
+		/*
+		 * Remove queue from request-position tree as it is empty.
+		 */
+		if (bfqq->pos_root != NULL) {
+			rb_erase(&bfqq->pos_node, bfqq->pos_root);
+			bfqq->pos_root = NULL;
+		}
 	}
 
 	if (rq->cmd_flags & REQ_META)
@@ -625,11 +753,14 @@ static void bfq_merged_request(struct request_queue *q, struct request *req,
 					 bfqd->last_position);
 		bfqq->next_rq = next_rq;
 		/*
-		 * If next_rq changes, update the queue's budget to fit
-		 * the new request.
+		 * If next_rq changes, update both the queue's budget to
+		 * fit the new request and the queue's position in its
+		 * rq_pos_tree.
 		 */
-		if (prev != bfqq->next_rq)
+		if (prev != bfqq->next_rq) {
 			bfq_updated_next_req(bfqd, bfqq);
+			bfq_rq_pos_tree_add(bfqd, bfqq);
+		}
 	}
 }
 
@@ -692,12 +823,339 @@ static void bfq_end_wr(struct bfq_data *bfqd)
 	spin_unlock_irq(bfqd->queue->queue_lock);
 }
 
+static inline sector_t bfq_io_struct_pos(void *io_struct, bool request)
+{
+	if (request)
+		return blk_rq_pos(io_struct);
+	else
+		return ((struct bio *)io_struct)->bi_iter.bi_sector;
+}
+
+static inline sector_t bfq_dist_from(sector_t pos1,
+				     sector_t pos2)
+{
+	if (pos1 >= pos2)
+		return pos1 - pos2;
+	else
+		return pos2 - pos1;
+}
+
+static inline int bfq_rq_close_to_sector(void *io_struct, bool request,
+					 sector_t sector)
+{
+	return bfq_dist_from(bfq_io_struct_pos(io_struct, request), sector) <=
+	       BFQQ_SEEK_THR;
+}
+
+static struct bfq_queue *bfqq_close(struct bfq_data *bfqd, sector_t sector)
+{
+	struct rb_root *root = &bfqd->rq_pos_tree;
+	struct rb_node *parent, *node;
+	struct bfq_queue *__bfqq;
+
+	if (RB_EMPTY_ROOT(root))
+		return NULL;
+
+	/*
+	 * First, if we find a request starting at the end of the last
+	 * request, choose it.
+	 */
+	__bfqq = bfq_rq_pos_tree_lookup(bfqd, root, sector, &parent, NULL);
+	if (__bfqq != NULL)
+		return __bfqq;
+
+	/*
+	 * If the exact sector wasn't found, the parent of the NULL leaf
+	 * will contain the closest sector (rq_pos_tree sorted by
+	 * next_request position).
+	 */
+	__bfqq = rb_entry(parent, struct bfq_queue, pos_node);
+	if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector))
+		return __bfqq;
+
+	if (blk_rq_pos(__bfqq->next_rq) < sector)
+		node = rb_next(&__bfqq->pos_node);
+	else
+		node = rb_prev(&__bfqq->pos_node);
+	if (node == NULL)
+		return NULL;
+
+	__bfqq = rb_entry(node, struct bfq_queue, pos_node);
+	if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector))
+		return __bfqq;
+
+	return NULL;
+}
+
+/*
+ * bfqd - obvious
+ * cur_bfqq - passed in so that we don't decide that the current queue
+ *            is closely cooperating with itself
+ * sector - used as a reference point to search for a close queue
+ */
+static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
+					      struct bfq_queue *cur_bfqq,
+					      sector_t sector)
+{
+	struct bfq_queue *bfqq;
+
+	if (bfq_class_idle(cur_bfqq))
+		return NULL;
+	if (!bfq_bfqq_sync(cur_bfqq))
+		return NULL;
+	if (BFQQ_SEEKY(cur_bfqq))
+		return NULL;
+
+	/* If device has only one backlogged bfq_queue, don't search. */
+	if (bfqd->busy_queues == 1)
+		return NULL;
+
+	/*
+	 * We should notice if some of the queues are cooperating, e.g.
+	 * working closely on the same area of the disk. In that case,
+	 * we can group them together and don't waste time idling.
+	 */
+	bfqq = bfqq_close(bfqd, sector);
+	if (bfqq == NULL || bfqq == cur_bfqq)
+		return NULL;
+
+	/*
+	 * Do not merge queues from different bfq_groups.
+	*/
+	if (bfqq->entity.parent != cur_bfqq->entity.parent)
+		return NULL;
+
+	/*
+	 * It only makes sense to merge sync queues.
+	 */
+	if (!bfq_bfqq_sync(bfqq))
+		return NULL;
+	if (BFQQ_SEEKY(bfqq))
+		return NULL;
+
+	/*
+	 * Do not merge queues of different priority classes.
+	 */
+	if (bfq_class_rt(bfqq) != bfq_class_rt(cur_bfqq))
+		return NULL;
+
+	return bfqq;
+}
+
+static struct bfq_queue *
+bfq_setup_merge(struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
+{
+	int process_refs, new_process_refs;
+	struct bfq_queue *__bfqq;
+
+	/*
+	 * If there are no process references on the new_bfqq, then it is
+	 * unsafe to follow the ->new_bfqq chain as other bfqq's in the chain
+	 * may have dropped their last reference (not just their last process
+	 * reference).
+	 */
+	if (!bfqq_process_refs(new_bfqq))
+		return NULL;
+
+	/* Avoid a circular list and skip interim queue merges. */
+	while ((__bfqq = new_bfqq->new_bfqq)) {
+		if (__bfqq == bfqq)
+			return NULL;
+		new_bfqq = __bfqq;
+	}
+
+	process_refs = bfqq_process_refs(bfqq);
+	new_process_refs = bfqq_process_refs(new_bfqq);
+	/*
+	 * If the process for the bfqq has gone away, there is no
+	 * sense in merging the queues.
+	 */
+	if (process_refs == 0 || new_process_refs == 0)
+		return NULL;
+
+	bfq_log_bfqq(bfqq->bfqd, bfqq, "scheduling merge with queue %d",
+		new_bfqq->pid);
+
+	/*
+	 * Merging is just a redirection: the requests of the process
+	 * owning one of the two queues are redirected to the other queue.
+	 * The latter queue, in its turn, is set as shared if this is the
+	 * first time that the requests of some process are redirected to
+	 * it.
+	 *
+	 * We redirect bfqq to new_bfqq and not the opposite, because we
+	 * are in the context of the process owning bfqq, hence we have
+	 * the io_cq of this process. So we can immediately configure this
+	 * io_cq to redirect the requests of the process to new_bfqq.
+	 *
+	 * NOTE, even if new_bfqq coincides with the in-service queue, the
+	 * io_cq of new_bfqq is not available, because, if the in-service
+	 * queue is shared, bfqd->in_service_bic may not point to the
+	 * io_cq of the in-service queue.
+	 * Redirecting the requests of the process owning bfqq to the
+	 * currently in-service queue is in any case the best option, as
+	 * we feed the in-service queue with new requests close to the
+	 * last request served and, by doing so, hopefully increase the
+	 * throughput.
+	 */
+	bfqq->new_bfqq = new_bfqq;
+	atomic_add(process_refs, &new_bfqq->ref);
+	return new_bfqq;
+}
+
+/*
+ * Attempt to schedule a merge of bfqq with the currently in-service queue
+ * or with a close queue among the scheduled queues.
+ * Return NULL if no merge was scheduled, a pointer to the shared bfq_queue
+ * structure otherwise.
+ */
+static struct bfq_queue *
+bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+		     void *io_struct, bool request)
+{
+	struct bfq_queue *in_service_bfqq, *new_bfqq;
+
+	if (bfqq->new_bfqq)
+		return bfqq->new_bfqq;
+
+	if (!io_struct)
+		return NULL;
+
+	in_service_bfqq = bfqd->in_service_queue;
+
+	if (in_service_bfqq == NULL || in_service_bfqq == bfqq ||
+	    !bfqd->in_service_bic)
+		goto check_scheduled;
+
+	if (bfq_class_idle(in_service_bfqq) || bfq_class_idle(bfqq))
+		goto check_scheduled;
+
+	if (bfq_class_rt(in_service_bfqq) != bfq_class_rt(bfqq))
+		goto check_scheduled;
+
+	if (in_service_bfqq->entity.parent != bfqq->entity.parent)
+		goto check_scheduled;
+
+	if (bfq_rq_close_to_sector(io_struct, request, bfqd->last_position) &&
+	    bfq_bfqq_sync(in_service_bfqq) && bfq_bfqq_sync(bfqq)) {
+		new_bfqq = bfq_setup_merge(bfqq, in_service_bfqq);
+		if (new_bfqq != NULL)
+			return new_bfqq; /* Merge with in-service queue */
+	}
+
+	/*
+	 * Check whether there is a cooperator among currently scheduled
+	 * queues. The only thing we need is that the bio/request is not
+	 * NULL, as we need it to establish whether a cooperator exists.
+	 */
+check_scheduled:
+	new_bfqq = bfq_close_cooperator(bfqd, bfqq,
+					bfq_io_struct_pos(io_struct, request));
+	if (new_bfqq)
+		return bfq_setup_merge(bfqq, new_bfqq);
+
+	return NULL;
+}
+
+static inline void
+bfq_bfqq_save_state(struct bfq_queue *bfqq)
+{
+	/*
+	 * If bfqq->bic == NULL, the queue is already shared or its requests
+	 * have already been redirected to a shared queue; both idle window
+	 * and weight raising state have already been saved. Do nothing.
+	 */
+	if (bfqq->bic == NULL)
+		return;
+	if (bfqq->bic->wr_time_left)
+		/*
+		 * This is the queue of a just-started process, and would
+		 * deserve weight raising: we set wr_time_left to the full
+		 * weight-raising duration to trigger weight-raising when
+		 * and if the queue is split and the first request of the
+		 * queue is enqueued.
+		 */
+		bfqq->bic->wr_time_left = bfq_wr_duration(bfqq->bfqd);
+	else if (bfqq->wr_coeff > 1) {
+		unsigned long wr_duration =
+			jiffies - bfqq->last_wr_start_finish;
+		/*
+		 * It may happen that a queue's weight raising period lasts
+		 * longer than its wr_cur_max_time, as weight raising is
+		 * handled only when a request is enqueued or dispatched (it
+		 * does not use any timer). If the weight raising period is
+		 * about to end, don't save it.
+		 */
+		if (bfqq->wr_cur_max_time <= wr_duration)
+			bfqq->bic->wr_time_left = 0;
+		else
+			bfqq->bic->wr_time_left =
+				bfqq->wr_cur_max_time - wr_duration;
+		/*
+		 * The bfq_queue is becoming shared or the requests of the
+		 * process owning the queue are being redirected to a shared
+		 * queue. Stop the weight raising period of the queue, as in
+		 * both cases it should not be owned by an interactive or
+		 * soft real-time application.
+		 */
+		bfq_bfqq_end_wr(bfqq);
+	} else
+		bfqq->bic->wr_time_left = 0;
+	bfqq->bic->saved_idle_window = bfq_bfqq_idle_window(bfqq);
+}
+
+static inline void
+bfq_get_bic_reference(struct bfq_queue *bfqq)
+{
+	/*
+	 * If bfqq->bic has a non-NULL value, the bic to which it belongs
+	 * is about to begin using a shared bfq_queue.
+	 */
+	if (bfqq->bic)
+		atomic_long_inc(&bfqq->bic->icq.ioc->refcount);
+}
+
+static void
+bfq_merge_bfqqs(struct bfq_data *bfqd, struct bfq_io_cq *bic,
+		struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
+{
+	bfq_log_bfqq(bfqd, bfqq, "merging with queue %lu",
+		(long unsigned)new_bfqq->pid);
+	/* Save weight raising and idle window of the merged queues */
+	bfq_bfqq_save_state(bfqq);
+	bfq_bfqq_save_state(new_bfqq);
+	/*
+	 * Grab a reference to the bic, to prevent it from being destroyed
+	 * before being possibly touched by a bfq_split_bfqq().
+	 */
+	bfq_get_bic_reference(bfqq);
+	bfq_get_bic_reference(new_bfqq);
+	/*
+	 * Merge queues (that is, let bic redirect its requests to new_bfqq)
+	 */
+	bic_set_bfqq(bic, new_bfqq, 1);
+	bfq_mark_bfqq_coop(new_bfqq);
+	/*
+	 * new_bfqq now belongs to at least two bics (it is a shared queue):
+	 * set new_bfqq->bic to NULL. bfqq either:
+	 * - does not belong to any bic any more, and hence bfqq->bic must
+	 *   be set to NULL, or
+	 * - is a queue whose owning bics have already been redirected to a
+	 *   different queue, hence the queue is destined to not belong to
+	 *   any bic soon and bfqq->bic is already NULL (therefore the next
+	 *   assignment causes no harm).
+	 */
+	new_bfqq->bic = NULL;
+	bfqq->bic = NULL;
+	bfq_put_queue(bfqq);
+}
+
 static int bfq_allow_merge(struct request_queue *q, struct request *rq,
 			   struct bio *bio)
 {
 	struct bfq_data *bfqd = q->elevator->elevator_data;
 	struct bfq_io_cq *bic;
-	struct bfq_queue *bfqq;
+	struct bfq_queue *bfqq, *new_bfqq;
 
 	/*
 	 * Disallow merge of a sync bio into an async request.
@@ -715,6 +1173,23 @@ static int bfq_allow_merge(struct request_queue *q, struct request *rq,
 		return 0;
 
 	bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
+	/*
+	 * We take advantage of this function to perform an early merge
+	 * of the queues of possible cooperating processes.
+	 */
+	if (bfqq != NULL) {
+		new_bfqq = bfq_setup_cooperator(bfqd, bfqq, bio, false);
+		if (new_bfqq != NULL) {
+			bfq_merge_bfqqs(bfqd, bic, bfqq, new_bfqq);
+			/*
+			 * If we get here, the bio will be queued in the
+			 * shared queue, i.e., new_bfqq, so use new_bfqq
+			 * to decide whether bio and rq can be merged.
+			 */
+			bfqq = new_bfqq;
+		}
+	}
+
 	return bfqq == RQ_BFQQ(rq);
 }
 
@@ -898,6 +1373,15 @@ static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 {
 	__bfq_bfqd_reset_in_service(bfqd);
 
+	/*
+	 * If this bfqq is shared between multiple processes, check
+	 * to make sure that those processes are still issuing I/Os
+	 * within the mean seek distance. If not, it may be time to
+	 * break the queues apart again.
+	 */
+	if (bfq_bfqq_coop(bfqq) && BFQQ_SEEKY(bfqq))
+		bfq_mark_bfqq_split_coop(bfqq);
+
 	if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
 		/*
 		 * Overloading budget_timeout field to store the time
@@ -906,8 +1390,13 @@ static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 		 */
 		bfqq->budget_timeout = jiffies;
 		bfq_del_bfqq_busy(bfqd, bfqq, 1);
-	} else
+	} else {
 		bfq_activate_bfqq(bfqd, bfqq);
+		/*
+		 * Resort priority tree of potential close cooperators.
+		 */
+		bfq_rq_pos_tree_add(bfqd, bfqq);
+	}
 }
 
 /**
@@ -1773,6 +2262,25 @@ static void bfq_put_queue(struct bfq_queue *bfqq)
 	kmem_cache_free(bfq_pool, bfqq);
 }
 
+static void bfq_put_cooperator(struct bfq_queue *bfqq)
+{
+	struct bfq_queue *__bfqq, *next;
+
+	/*
+	 * If this queue was scheduled to merge with another queue, be
+	 * sure to drop the reference taken on that queue (and others in
+	 * the merge chain). See bfq_setup_merge and bfq_merge_bfqqs.
+	 */
+	__bfqq = bfqq->new_bfqq;
+	while (__bfqq) {
+		if (__bfqq == bfqq)
+			break;
+		next = __bfqq->new_bfqq;
+		bfq_put_queue(__bfqq);
+		__bfqq = next;
+	}
+}
+
 static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 {
 	if (bfqq == bfqd->in_service_queue) {
@@ -1783,12 +2291,35 @@ static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 	bfq_log_bfqq(bfqd, bfqq, "exit_bfqq: %p, %d", bfqq,
 		     atomic_read(&bfqq->ref));
 
+	bfq_put_cooperator(bfqq);
+
 	bfq_put_queue(bfqq);
 }
 
 static inline void bfq_init_icq(struct io_cq *icq)
 {
-	icq_to_bic(icq)->ttime.last_end_request = jiffies;
+	struct bfq_io_cq *bic = icq_to_bic(icq);
+
+	bic->ttime.last_end_request = jiffies;
+	/*
+	 * A newly created bic indicates that the process has just
+	 * started doing I/O, and is probably mapping into memory its
+	 * executable and libraries: it definitely needs weight raising.
+	 * There is however the possibility that the process performs,
+	 * for a while, I/O close to some other process. EQM intercepts
+	 * this behavior and may merge the queue corresponding to the
+	 * process  with some other queue, BEFORE the weight of the queue
+	 * is raised. Merged queues are not weight-raised (they are assumed
+	 * to belong to processes that benefit only from high throughput).
+	 * If the merge is basically the consequence of an accident, then
+	 * the queue will be split soon and will get back its old weight.
+	 * It is then important to write down somewhere that this queue
+	 * does need weight raising, even if it did not make it to get its
+	 * weight raised before being merged. To this purpose, we overload
+	 * the field raising_time_left and assign 1 to it, to mark the queue
+	 * as needing weight raising.
+	 */
+	bic->wr_time_left = 1;
 }
 
 static void bfq_exit_icq(struct io_cq *icq)
@@ -1802,6 +2333,13 @@ static void bfq_exit_icq(struct io_cq *icq)
 	}
 
 	if (bic->bfqq[BLK_RW_SYNC]) {
+		/*
+		 * If the bic is using a shared queue, put the reference
+		 * taken on the io_context when the bic started using a
+		 * shared bfq_queue.
+		 */
+		if (bfq_bfqq_coop(bic->bfqq[BLK_RW_SYNC]))
+			put_io_context(icq->ioc);
 		bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_SYNC]);
 		bic->bfqq[BLK_RW_SYNC] = NULL;
 	}
@@ -2089,6 +2627,10 @@ static void bfq_update_idle_window(struct bfq_data *bfqd,
 	if (!bfq_bfqq_sync(bfqq) || bfq_class_idle(bfqq))
 		return;
 
+	/* Idle window just restored, statistics are meaningless. */
+	if (bfq_bfqq_just_split(bfqq))
+		return;
+
 	enable_idle = bfq_bfqq_idle_window(bfqq);
 
 	if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
@@ -2131,6 +2673,7 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 	if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
 	    !BFQQ_SEEKY(bfqq))
 		bfq_update_idle_window(bfqd, bfqq, bic);
+	bfq_clear_bfqq_just_split(bfqq);
 
 	bfq_log_bfqq(bfqd, bfqq,
 		     "rq_enqueued: idle_window=%d (seeky %d, mean %llu)",
@@ -2191,14 +2734,48 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 static void bfq_insert_request(struct request_queue *q, struct request *rq)
 {
 	struct bfq_data *bfqd = q->elevator->elevator_data;
-	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+	struct bfq_queue *bfqq = RQ_BFQQ(rq), *new_bfqq;
 
 	assert_spin_locked(bfqd->queue->queue_lock);
 
+	/*
+	 * An unplug may trigger a requeue of a request from the device
+	 * driver: make sure we are in process context while trying to
+	 * merge two bfq_queues.
+	 */
+	if (!in_interrupt()) {
+		new_bfqq = bfq_setup_cooperator(bfqd, bfqq, rq, true);
+		if (new_bfqq != NULL) {
+			if (bic_to_bfqq(RQ_BIC(rq), 1) != bfqq)
+				new_bfqq = bic_to_bfqq(RQ_BIC(rq), 1);
+			/*
+			 * Release the request's reference to the old bfqq
+			 * and make sure one is taken to the shared queue.
+			 */
+			new_bfqq->allocated[rq_data_dir(rq)]++;
+			bfqq->allocated[rq_data_dir(rq)]--;
+			atomic_inc(&new_bfqq->ref);
+			bfq_put_queue(bfqq);
+			if (bic_to_bfqq(RQ_BIC(rq), 1) == bfqq)
+				bfq_merge_bfqqs(bfqd, RQ_BIC(rq),
+						bfqq, new_bfqq);
+			rq->elv.priv[1] = new_bfqq;
+			bfqq = new_bfqq;
+		}
+	}
+
 	bfq_init_prio_data(bfqq, RQ_BIC(rq));
 
 	bfq_add_request(rq);
 
+	/*
+	 * Here a newly-created bfq_queue has already started a weight-raising
+	 * period: clear raising_time_left to prevent bfq_bfqq_save_state()
+	 * from assigning it a full weight-raising period. See the detailed
+	 * comments about this field in bfq_init_icq().
+	 */
+	if (bfqq->bic != NULL)
+		bfqq->bic->wr_time_left = 0;
 	rq->fifo_time = jiffies + bfqd->bfq_fifo_expire[rq_is_sync(rq)];
 	list_add_tail(&rq->queuelist, &bfqq->fifo);
 
@@ -2347,6 +2924,32 @@ static void bfq_put_request(struct request *rq)
 }
 
 /*
+ * Returns NULL if a new bfqq should be allocated, or the old bfqq if this
+ * was the last process referring to said bfqq.
+ */
+static struct bfq_queue *
+bfq_split_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq)
+{
+	bfq_log_bfqq(bfqq->bfqd, bfqq, "splitting queue");
+
+	put_io_context(bic->icq.ioc);
+
+	if (bfqq_process_refs(bfqq) == 1) {
+		bfqq->pid = current->pid;
+		bfq_clear_bfqq_coop(bfqq);
+		bfq_clear_bfqq_split_coop(bfqq);
+		return bfqq;
+	}
+
+	bic_set_bfqq(bic, NULL, 1);
+
+	bfq_put_cooperator(bfqq);
+
+	bfq_put_queue(bfqq);
+	return NULL;
+}
+
+/*
  * Allocate bfq data structures associated with this request.
  */
 static int bfq_set_request(struct request_queue *q, struct request *rq,
@@ -2359,6 +2962,7 @@ static int bfq_set_request(struct request_queue *q, struct request *rq,
 	struct bfq_queue *bfqq;
 	struct bfq_group *bfqg;
 	unsigned long flags;
+	bool split = false;
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
@@ -2371,10 +2975,20 @@ static int bfq_set_request(struct request_queue *q, struct request *rq,
 
 	bfqg = bfq_bic_update_cgroup(bic);
 
+new_queue:
 	bfqq = bic_to_bfqq(bic, is_sync);
 	if (bfqq == NULL || bfqq == &bfqd->oom_bfqq) {
 		bfqq = bfq_get_queue(bfqd, bfqg, is_sync, bic, gfp_mask);
 		bic_set_bfqq(bic, bfqq, is_sync);
+	} else {
+		/* If the queue was seeky for too long, break it apart. */
+		if (bfq_bfqq_coop(bfqq) && bfq_bfqq_split_coop(bfqq)) {
+			bfq_log_bfqq(bfqd, bfqq, "breaking apart bfqq");
+			bfqq = bfq_split_bfqq(bic, bfqq);
+			split = true;
+			if (!bfqq)
+				goto new_queue;
+		}
 	}
 
 	bfqq->allocated[rw]++;
@@ -2385,6 +2999,26 @@ static int bfq_set_request(struct request_queue *q, struct request *rq,
 	rq->elv.priv[0] = bic;
 	rq->elv.priv[1] = bfqq;
 
+	/*
+	 * If a bfq_queue has only one process reference, it is owned
+	 * by only one bfq_io_cq: we can set the bic field of the
+	 * bfq_queue to the address of that structure. Also, if the
+	 * queue has just been split, mark a flag so that the
+	 * information is available to the other scheduler hooks.
+	 */
+	if (bfqq_process_refs(bfqq) == 1) {
+		bfqq->bic = bic;
+		if (split) {
+			bfq_mark_bfqq_just_split(bfqq);
+			/*
+			 * If the queue has just been split from a shared
+			 * queue, restore the idle window and the possible
+			 * weight raising period.
+			 */
+			bfq_bfqq_resume_state(bfqq, bic);
+		}
+	}
+
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
 	return 0;
@@ -2565,6 +3199,8 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 	bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
 	bfqd->idle_slice_timer.data = (unsigned long)bfqd;
 
+	bfqd->rq_pos_tree = RB_ROOT;
+
 	INIT_WORK(&bfqd->unplug_work, bfq_kick_queue);
 
 	INIT_LIST_HEAD(&bfqd->active_list);
@@ -2918,7 +3554,7 @@ static int __init bfq_init(void)
 	device_speed_thresh[1] = (R_fast[1] + R_slow[1]) / 2;
 
 	elv_register(&iosched_bfq);
-	pr_info("BFQ I/O-scheduler version: v2");
+	pr_info("BFQ I/O-scheduler version: v6");
 
 	return 0;
 }
diff --git a/block/bfq.h b/block/bfq.h
index 7d6e4cb..bda1ecb3 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -1,5 +1,5 @@
 /*
- * BFQ-v2 for 3.15.0: data structures and common functions prototypes.
+ * BFQ-v6 for 3.15.0: data structures and common functions prototypes.
  *
  * Based on ideas and code from CFQ:
  * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
@@ -164,6 +164,10 @@ struct bfq_group;
  * struct bfq_queue - leaf schedulable entity.
  * @ref: reference counter.
  * @bfqd: parent bfq_data.
+ * @new_bfqq: shared bfq_queue if queue is cooperating with
+ *           one or more other queues.
+ * @pos_node: request-position tree member (see bfq_data's @rq_pos_tree).
+ * @pos_root: request-position tree root (see bfq_data's @rq_pos_tree).
  * @sort_list: sorted list of pending requests.
  * @next_rq: if fifo isn't expired, next request to serve.
  * @queued: nr of requests queued in @sort_list.
@@ -196,18 +200,26 @@ struct bfq_group;
  * @service_from_backlogged: cumulative service received from the @bfq_queue
  *                           since the last transition from idle to
  *                           backlogged
+ * @bic: pointer to the bfq_io_cq owning the bfq_queue, set to %NULL if the
+ *	 queue is shared
  *
  * A bfq_queue is a leaf request queue; it can be associated with an
- * io_context or more, if it is async. @cgroup holds a reference to the
- * cgroup, to be sure that it does not disappear while a bfqq still
- * references it (mostly to avoid races between request issuing and task
- * migration followed by cgroup destruction). All the fields are protected
- * by the queue lock of the containing bfqd.
+ * io_context or more, if it  is  async or shared  between  cooperating
+ * processes. @cgroup holds a reference to the cgroup, to be sure that it
+ * does not disappear while a bfqq still references it (mostly to avoid
+ * races between request issuing and task migration followed by cgroup
+ * destruction).
+ * All the fields are protected by the queue lock of the containing bfqd.
  */
 struct bfq_queue {
 	atomic_t ref;
 	struct bfq_data *bfqd;
 
+	/* fields for cooperating queues handling */
+	struct bfq_queue *new_bfqq;
+	struct rb_node pos_node;
+	struct rb_root *pos_root;
+
 	struct rb_root sort_list;
 	struct request *next_rq;
 	int queued[2];
@@ -232,6 +244,7 @@ struct bfq_queue {
 	sector_t last_request_pos;
 
 	pid_t pid;
+	struct bfq_io_cq *bic;
 
 	/* weight-raising fields */
 	unsigned long wr_cur_max_time;
@@ -261,12 +274,24 @@ struct bfq_ttime {
  * @icq: associated io_cq structure
  * @bfqq: array of two process queues, the sync and the async
  * @ttime: associated @bfq_ttime struct
+ * @wr_time_left: snapshot of the time left before weight raising ends
+ *                for the sync queue associated to this process; this
+ *		  snapshot is taken to remember this value while the weight
+ *		  raising is suspended because the queue is merged with a
+ *		  shared queue, and is used to set @raising_cur_max_time
+ *		  when the queue is split from the shared queue and its
+ *		  weight is raised again
+ * @saved_idle_window: same purpose as the previous field for the idle
+ *                     window
  */
 struct bfq_io_cq {
 	struct io_cq icq; /* must be the first member */
 	struct bfq_queue *bfqq[2];
 	struct bfq_ttime ttime;
 	int ioprio;
+
+	unsigned int wr_time_left;
+	unsigned int saved_idle_window;
 };
 
 enum bfq_device_speed {
@@ -278,6 +303,9 @@ enum bfq_device_speed {
  * struct bfq_data - per device data structure.
  * @queue: request queue for the managed device.
  * @root_group: root bfq_group for the device.
+ * @rq_pos_tree: rbtree sorted by next_request position, used when
+ *               determining if two or more queues have interleaving
+ *               requests (see bfq_close_cooperator()).
  * @busy_queues: number of bfq_queues containing requests (including the
  *		 queue in service, even if it is idling).
  * @wr_busy_queues: number of weight-raised busy @bfq_queues.
@@ -344,6 +372,7 @@ struct bfq_data {
 	struct request_queue *queue;
 
 	struct bfq_group *root_group;
+	struct rb_root rq_pos_tree;
 
 	int busy_queues;
 	int wr_busy_queues;
@@ -418,6 +447,9 @@ enum bfqq_state_flags {
 					 * may need softrt-next-start
 					 * update
 					 */
+	BFQ_BFQQ_FLAG_coop,		/* bfqq is shared */
+	BFQ_BFQQ_FLAG_split_coop,	/* shared bfqq will be split */
+	BFQ_BFQQ_FLAG_just_split,	/* queue has just been split */
 };
 
 #define BFQ_BFQQ_FNS(name)						\
@@ -443,6 +475,9 @@ BFQ_BFQQ_FNS(prio_changed);
 BFQ_BFQQ_FNS(sync);
 BFQ_BFQQ_FNS(budget_new);
 BFQ_BFQQ_FNS(constantly_seeky);
+BFQ_BFQQ_FNS(coop);
+BFQ_BFQQ_FNS(split_coop);
+BFQ_BFQQ_FNS(just_split);
 BFQ_BFQQ_FNS(softrt_update);
 #undef BFQ_BFQQ_FNS
 
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices
       [not found]           ` <1401354343-5527-1-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
                               ` (9 preceding siblings ...)
  2014-05-29  9:05               ` Paolo Valente
@ 2014-05-29  9:05             ` Paolo Valente
  2014-05-29  9:05             ` [PATCH RFC - TAKE TWO - 12/12] block, bfq: boost the throughput with random I/O on NCQ-capable HDDs Paolo Valente
  2014-05-30 16:07               ` Tejun Heo
  12 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

This patch boosts the throughput on NCQ-capable flash-based devices,
while still preserving latency guarantees for interactive and soft
real-time applications. The throughput is boosted by just not idling
the device when the in-service queue remains empty, even if the queue
is sync and has a non-null idle window. This helps to keep the drive's
internal queue full, which is necessary to achieve maximum
performance. This solution to boost the throughput is a port of
commits a68bbdd and f7d7b7a for CFQ.

As already highlighted in patch 10, allowing the device to prefetch
and internally reorder requests trivially causes loss of control on
the request service order, and hence on service guarantees.
Fortunately, as discussed in detail in the comments to the function
bfq_bfqq_must_not_expire(), if every process has to receive the same
fraction of the throughput, then the service order enforced by the
internal scheduler of a flash-based device is relatively close to that
enforced by BFQ. In particular, it is close enough to let service
guarantees be substantially preserved.

Things change in an asymmetric scenario, i.e., if not every process
has to receive the same fraction of the throughput. In this case, to
guarantee the desired throughput distribution, the device must be
prevented from prefetching requests. This is exactly what this patch
does in asymmetric scenarios.

Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 block/bfq-cgroup.c  |   1 +
 block/bfq-iosched.c | 205 +++++++++++++++++++++++++++++++++++++++++++++++++---
 block/bfq-sched.c   |  98 ++++++++++++++++++++++++-
 block/bfq.h         |  46 ++++++++++++
 4 files changed, 338 insertions(+), 12 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 1cb25aa..d338a54 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -85,6 +85,7 @@ static inline void bfq_group_init_entity(struct bfqio_cgroup *bgrp,
 	entity->ioprio = entity->new_ioprio;
 	entity->ioprio_class = entity->new_ioprio_class = bgrp->ioprio_class;
 	entity->my_sched_data = &bfqg->sched_data;
+	bfqg->active_entities = 0;
 }
 
 static inline void bfq_group_set_parent(struct bfq_group *bfqg,
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 22d4caa..49856e1 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -364,6 +364,120 @@ static struct request *bfq_choose_req(struct bfq_data *bfqd,
 	}
 }
 
+/*
+ * Tell whether there are active queues or groups with differentiated weights.
+ */
+static inline bool bfq_differentiated_weights(struct bfq_data *bfqd)
+{
+	/*
+	 * For weights to differ, at least one of the trees must contain
+	 * at least two nodes.
+	 */
+	return (!RB_EMPTY_ROOT(&bfqd->queue_weights_tree) &&
+		(bfqd->queue_weights_tree.rb_node->rb_left ||
+		 bfqd->queue_weights_tree.rb_node->rb_right)
+#ifdef CONFIG_CGROUP_BFQIO
+	       ) ||
+	       (!RB_EMPTY_ROOT(&bfqd->group_weights_tree) &&
+		(bfqd->group_weights_tree.rb_node->rb_left ||
+		 bfqd->group_weights_tree.rb_node->rb_right)
+#endif
+	       );
+}
+
+/*
+ * If the weight-counter tree passed as input contains no counter for
+ * the weight of the input entity, then add that counter; otherwise just
+ * increment the existing counter.
+ *
+ * Note that weight-counter trees contain few nodes in mostly symmetric
+ * scenarios. For example, if all queues have the same weight, then the
+ * weight-counter tree for the queues may contain at most one node.
+ * This holds even if low_latency is on, because weight-raised queues
+ * are not inserted in the tree.
+ * In most scenarios, the rate at which nodes are created/destroyed
+ * should be low too.
+ */
+static void bfq_weights_tree_add(struct bfq_data *bfqd,
+				 struct bfq_entity *entity,
+				 struct rb_root *root)
+{
+	struct rb_node **new = &(root->rb_node), *parent = NULL;
+
+	/*
+	 * Do not insert if:
+	 * - the device does not support queueing;
+	 * - the entity is already associated with a counter, which happens if:
+	 *   1) the entity is associated with a queue, 2) a request arrival
+	 *   has caused the queue to become both non-weight-raised, and hence
+	 *   change its weight, and backlogged; in this respect, each
+	 *   of the two events causes an invocation of this function,
+	 *   3) this is the invocation of this function caused by the second
+	 *   event. This second invocation is actually useless, and we handle
+	 *   this fact by exiting immediately. More efficient or clearer
+	 *   solutions might possibly be adopted.
+	 */
+	if (!bfqd->hw_tag || entity->weight_counter)
+		return;
+
+	while (*new) {
+		struct bfq_weight_counter *__counter = container_of(*new,
+						struct bfq_weight_counter,
+						weights_node);
+		parent = *new;
+
+		if (entity->weight == __counter->weight) {
+			entity->weight_counter = __counter;
+			goto inc_counter;
+		}
+		if (entity->weight < __counter->weight)
+			new = &((*new)->rb_left);
+		else
+			new = &((*new)->rb_right);
+	}
+
+	entity->weight_counter = kzalloc(sizeof(struct bfq_weight_counter),
+					 GFP_ATOMIC);
+	entity->weight_counter->weight = entity->weight;
+	rb_link_node(&entity->weight_counter->weights_node, parent, new);
+	rb_insert_color(&entity->weight_counter->weights_node, root);
+
+inc_counter:
+	entity->weight_counter->num_active++;
+}
+
+/*
+ * Decrement the weight counter associated with the entity, and, if the
+ * counter reaches 0, remove the counter from the tree.
+ * See the comments to the function bfq_weights_tree_add() for considerations
+ * about overhead.
+ */
+static void bfq_weights_tree_remove(struct bfq_data *bfqd,
+				    struct bfq_entity *entity,
+				    struct rb_root *root)
+{
+	/*
+	 * Check whether the entity is actually associated with a counter.
+	 * In fact, the device may not be considered NCQ-capable for a while,
+	 * which implies that no insertion in the weight trees is performed,
+	 * after which the device may start to be deemed NCQ-capable, and hence
+	 * this function may start to be invoked. This may cause the function
+	 * to be invoked for entities that are not associated with any counter.
+	 */
+	if (!entity->weight_counter)
+		return;
+
+	entity->weight_counter->num_active--;
+	if (entity->weight_counter->num_active > 0)
+		goto reset_entity_pointer;
+
+	rb_erase(&entity->weight_counter->weights_node, root);
+	kfree(entity->weight_counter);
+
+reset_entity_pointer:
+	entity->weight_counter = NULL;
+}
+
 static struct request *bfq_find_next_rq(struct bfq_data *bfqd,
 					struct bfq_queue *bfqq,
 					struct request *last)
@@ -1906,16 +2020,17 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
  * two conditions holds. The first condition is that the device is not
  * performing NCQ, because idling the device most certainly boosts the
  * throughput if this condition holds and bfqq has been granted a non-null
- * idle window.
+ * idle window. The second compound condition is made of the logical AND of
+ * two components.
  *
- * The second condition is that there is no weight-raised busy queue,
- * which guarantees that the device is not idled for a sync non-weight-
- * raised queue when there are busy weight-raised queues. The former is
- * then expired immediately if empty. Combined with the timestamping rules
- * of BFQ (see [1] for details), this causes sync non-weight-raised queues
- * to get a lower number of requests served, and hence to ask for a lower
- * number of requests from the request pool, before the busy weight-raised
- * queues get served again.
+ * The first component is true only if there is no weight-raised busy
+ * queue. This guarantees that the device is not idled for a sync non-
+ * weight-raised queue when there are busy weight-raised queues. The former
+ * is then expired immediately if empty. Combined with the timestamping
+ * rules of BFQ (see [1] for details), this causes sync non-weight-raised
+ * queues to get a lower number of requests served, and hence to ask for a
+ * lower number of requests from the request pool, before the busy weight-
+ * raised queues get served again.
  *
  * This is beneficial for the processes associated with weight-raised
  * queues, when the request pool is saturated (e.g., in the presence of
@@ -1932,16 +2047,76 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
  * weight-raised queues seems to mitigate starvation problems in the
  * presence of heavy write workloads and NCQ, and hence to guarantee a
  * higher application and system responsiveness in these hostile scenarios.
+ *
+ * If the first component of the compound condition is instead true, i.e.,
+ * there is no weight-raised busy queue, then the second component of the
+ * compound condition takes into account service-guarantee and throughput
+ * issues related to NCQ (recall that the compound condition is evaluated
+ * only if the device is detected as supporting NCQ).
+ *
+ * As for service guarantees, allowing the drive to enqueue more than one
+ * request at a time, and hence delegating de facto final scheduling
+ * decisions to the drive's internal scheduler, causes loss of control on
+ * the actual request service order. In this respect, when the drive is
+ * allowed to enqueue more than one request at a time, the service
+ * distribution enforced by the drive's internal scheduler is likely to
+ * coincide with the desired device-throughput distribution only in the
+ * following, perfectly symmetric, scenario:
+ * 1) all active queues have the same weight,
+ * 2) all active groups at the same level in the groups tree have the same
+ *    weight,
+ * 3) all active groups at the same level in the groups tree have the same
+ *    number of children.
+ *
+ * Even in such a scenario, sequential I/O may still receive a preferential
+ * treatment, but this is not likely to be a big issue with flash-based
+ * devices, because of their non-dramatic loss of throughput with random
+ * I/O.
+ *
+ * Unfortunately, keeping the necessary state for evaluating exactly the
+ * above symmetry conditions would be quite complex and time-consuming.
+ * Therefore BFQ evaluates instead the following stronger sub-conditions,
+ * for which it is much easier to maintain the needed state:
+ * 1) all active queues have the same weight,
+ * 2) all active groups have the same weight,
+ * 3) all active groups have at most one active child each.
+ * In particular, the last two conditions are always true if hierarchical
+ * support and the cgroups interface are not enabled, hence no state needs
+ * to be maintained in this case.
+ *
+ * According to the above considerations, the second component of the
+ * compound condition evaluates to true if any of the above symmetry
+ * sub-condition does not hold, or the device is not flash-based. Therefore,
+ * if also the first component is true, then idling is allowed for a sync
+ * queue. In contrast, if all the required symmetry sub-conditions hold and
+ * the device is flash-based, then the second component, and hence the
+ * whole compound condition, evaluates to false, and no idling is performed.
+ * This helps to keep the drives' internal queues full on NCQ-capable
+ * devices, and hence to boost the throughput, without causing 'almost' any
+ * loss of service guarantees. The 'almost' follows from the fact that, if
+ * the internal queue of one such device is filled while all the
+ * sub-conditions hold, but at some point in time some sub-condition stops
+ * to hold, then it may become impossible to let requests be served in the
+ * new desired order until all the requests already queued in the device
+ * have been served.
  */
 static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 {
 	struct bfq_data *bfqd = bfqq->bfqd;
+#ifdef CONFIG_CGROUP_BFQIO
+#define symmetric_scenario	  (!bfqd->active_numerous_groups && \
+				   !bfq_differentiated_weights(bfqd))
+#else
+#define symmetric_scenario	  (!bfq_differentiated_weights(bfqd))
+#endif
 /*
  * Condition for expiring a non-weight-raised queue (and hence not idling
  * the device).
  */
 #define cond_for_expiring_non_wr  (bfqd->hw_tag && \
-				   bfqd->wr_busy_queues > 0)
+				   (bfqd->wr_busy_queues > 0 || \
+				    (symmetric_scenario && \
+				     blk_queue_nonrot(bfqd->queue))))
 
 	return bfq_bfqq_sync(bfqq) && (
 		bfqq->wr_coeff > 1 ||
@@ -2821,6 +2996,10 @@ static void bfq_completed_request(struct request_queue *q, struct request *rq)
 	bfqd->rq_in_driver--;
 	bfqq->dispatched--;
 
+	if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq))
+		bfq_weights_tree_remove(bfqd, &bfqq->entity,
+					&bfqd->queue_weights_tree);
+
 	if (sync) {
 		bfqd->sync_flight--;
 		RQ_BIC(rq)->ttime.last_end_request = jiffies;
@@ -3195,11 +3374,17 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 
 	bfqd->root_group = bfqg;
 
+#ifdef CONFIG_CGROUP_BFQIO
+	bfqd->active_numerous_groups = 0;
+#endif
+
 	init_timer(&bfqd->idle_slice_timer);
 	bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
 	bfqd->idle_slice_timer.data = (unsigned long)bfqd;
 
 	bfqd->rq_pos_tree = RB_ROOT;
+	bfqd->queue_weights_tree = RB_ROOT;
+	bfqd->group_weights_tree = RB_ROOT;
 
 	INIT_WORK(&bfqd->unplug_work, bfq_kick_queue);
 
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
index 73f453b..473b36a 100644
--- a/block/bfq-sched.c
+++ b/block/bfq-sched.c
@@ -308,6 +308,15 @@ up:
 	goto up;
 }
 
+static void bfq_weights_tree_add(struct bfq_data *bfqd,
+				 struct bfq_entity *entity,
+				 struct rb_root *root);
+
+static void bfq_weights_tree_remove(struct bfq_data *bfqd,
+				    struct bfq_entity *entity,
+				    struct rb_root *root);
+
+
 /**
  * bfq_active_insert - insert an entity in the active tree of its
  *                     group/device.
@@ -324,6 +333,11 @@ static void bfq_active_insert(struct bfq_service_tree *st,
 {
 	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
 	struct rb_node *node = &entity->rb_node;
+#ifdef CONFIG_CGROUP_BFQIO
+	struct bfq_sched_data *sd = NULL;
+	struct bfq_group *bfqg = NULL;
+	struct bfq_data *bfqd = NULL;
+#endif
 
 	bfq_insert(&st->active, entity);
 
@@ -334,8 +348,22 @@ static void bfq_active_insert(struct bfq_service_tree *st,
 
 	bfq_update_active_tree(node);
 
+#ifdef CONFIG_CGROUP_BFQIO
+	sd = entity->sched_data;
+	bfqg = container_of(sd, struct bfq_group, sched_data);
+	bfqd = (struct bfq_data *)bfqg->bfqd;
+#endif
 	if (bfqq != NULL)
 		list_add(&bfqq->bfqq_list, &bfqq->bfqd->active_list);
+#ifdef CONFIG_CGROUP_BFQIO
+	else /* bfq_group */
+		bfq_weights_tree_add(bfqd, entity, &bfqd->group_weights_tree);
+	if (bfqg != bfqd->root_group) {
+		bfqg->active_entities++;
+		if (bfqg->active_entities == 2)
+			bfqd->active_numerous_groups++;
+	}
+#endif
 }
 
 /**
@@ -411,6 +439,11 @@ static void bfq_active_extract(struct bfq_service_tree *st,
 {
 	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
 	struct rb_node *node;
+#ifdef CONFIG_CGROUP_BFQIO
+	struct bfq_sched_data *sd = NULL;
+	struct bfq_group *bfqg = NULL;
+	struct bfq_data *bfqd = NULL;
+#endif
 
 	node = bfq_find_deepest(&entity->rb_node);
 	bfq_extract(&st->active, entity);
@@ -418,8 +451,23 @@ static void bfq_active_extract(struct bfq_service_tree *st,
 	if (node != NULL)
 		bfq_update_active_tree(node);
 
+#ifdef CONFIG_CGROUP_BFQIO
+	sd = entity->sched_data;
+	bfqg = container_of(sd, struct bfq_group, sched_data);
+	bfqd = (struct bfq_data *)bfqg->bfqd;
+#endif
 	if (bfqq != NULL)
 		list_del(&bfqq->bfqq_list);
+#ifdef CONFIG_CGROUP_BFQIO
+	else /* bfq_group */
+		bfq_weights_tree_remove(bfqd, entity,
+					&bfqd->group_weights_tree);
+	if (bfqg != bfqd->root_group) {
+		bfqg->active_entities--;
+		if (bfqg->active_entities == 1)
+			bfqd->active_numerous_groups--;
+	}
+#endif
 }
 
 /**
@@ -515,6 +563,23 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
 
 	if (entity->ioprio_changed) {
 		struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+		unsigned short prev_weight, new_weight;
+		struct bfq_data *bfqd = NULL;
+		struct rb_root *root;
+#ifdef CONFIG_CGROUP_BFQIO
+		struct bfq_sched_data *sd;
+		struct bfq_group *bfqg;
+#endif
+
+		if (bfqq != NULL)
+			bfqd = bfqq->bfqd;
+#ifdef CONFIG_CGROUP_BFQIO
+		else {
+			sd = entity->my_sched_data;
+			bfqg = container_of(sd, struct bfq_group, sched_data);
+			bfqd = (struct bfq_data *)bfqg->bfqd;
+		}
+#endif
 
 		old_st->wsum -= entity->weight;
 
@@ -541,8 +606,31 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
 		 * when entity->finish <= old_st->vtime).
 		 */
 		new_st = bfq_entity_service_tree(entity);
-		entity->weight = entity->orig_weight *
-				 (bfqq != NULL ? bfqq->wr_coeff : 1);
+
+		prev_weight = entity->weight;
+		new_weight = entity->orig_weight *
+			     (bfqq != NULL ? bfqq->wr_coeff : 1);
+		/*
+		 * If the weight of the entity changes, remove the entity
+		 * from its old weight counter (if there is a counter
+		 * associated with the entity), and add it to the counter
+		 * associated with its new weight.
+		 */
+		if (prev_weight != new_weight) {
+			root = bfqq ? &bfqd->queue_weights_tree :
+				      &bfqd->group_weights_tree;
+			bfq_weights_tree_remove(bfqd, entity, root);
+		}
+		entity->weight = new_weight;
+		/*
+		 * Add the entity to its weights tree only if it is
+		 * not associated with a weight-raised queue.
+		 */
+		if (prev_weight != new_weight &&
+		    (bfqq ? bfqq->wr_coeff == 1 : 1))
+			/* If we get here, root has been initialized. */
+			bfq_weights_tree_add(bfqd, entity, root);
+
 		new_st->wsum += entity->weight;
 
 		if (new_st != old_st)
@@ -976,6 +1064,9 @@ static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfq_deactivate_bfqq(bfqd, bfqq, requeue);
 
+	if (!bfqq->dispatched)
+		bfq_weights_tree_remove(bfqd, &bfqq->entity,
+					&bfqd->queue_weights_tree);
 	if (bfqq->wr_coeff > 1)
 		bfqd->wr_busy_queues--;
 }
@@ -992,6 +1083,9 @@ static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 	bfq_mark_bfqq_busy(bfqq);
 	bfqd->busy_queues++;
 
+	if (!bfqq->dispatched && bfqq->wr_coeff == 1)
+		bfq_weights_tree_add(bfqd, &bfqq->entity,
+				     &bfqd->queue_weights_tree);
 	if (bfqq->wr_coeff > 1)
 		bfqd->wr_busy_queues++;
 }
diff --git a/block/bfq.h b/block/bfq.h
index bda1ecb3..83c828d 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -81,8 +81,23 @@ struct bfq_sched_data {
 };
 
 /**
+ * struct bfq_weight_counter - counter of the number of all active entities
+ *                             with a given weight.
+ * @weight: weight of the entities that this counter refers to.
+ * @num_active: number of active entities with this weight.
+ * @weights_node: weights tree member (see bfq_data's @queue_weights_tree
+ *                and @group_weights_tree).
+ */
+struct bfq_weight_counter {
+	short int weight;
+	unsigned int num_active;
+	struct rb_node weights_node;
+};
+
+/**
  * struct bfq_entity - schedulable entity.
  * @rb_node: service_tree member.
+ * @weight_counter: pointer to the weight counter associated with this entity.
  * @on_st: flag, true if the entity is on a tree (either the active or
  *         the idle one of its service_tree).
  * @finish: B-WF2Q+ finish timestamp (aka F_i).
@@ -133,6 +148,7 @@ struct bfq_sched_data {
  */
 struct bfq_entity {
 	struct rb_node rb_node;
+	struct bfq_weight_counter *weight_counter;
 
 	int on_st;
 
@@ -306,6 +322,22 @@ enum bfq_device_speed {
  * @rq_pos_tree: rbtree sorted by next_request position, used when
  *               determining if two or more queues have interleaving
  *               requests (see bfq_close_cooperator()).
+ * @active_numerous_groups: number of bfq_groups containing more than one
+ *                          active @bfq_entity.
+ * @queue_weights_tree: rbtree of weight counters of @bfq_queues, sorted by
+ *                      weight. Used to keep track of whether all @bfq_queues
+ *                     have the same weight. The tree contains one counter
+ *                     for each distinct weight associated to some active
+ *                     and not weight-raised @bfq_queue (see the comments to
+ *                      the functions bfq_weights_tree_[add|remove] for
+ *                     further details).
+ * @group_weights_tree: rbtree of non-queue @bfq_entity weight counters, sorted
+ *                      by weight. Used to keep track of whether all
+ *                     @bfq_groups have the same weight. The tree contains
+ *                     one counter for each distinct weight associated to
+ *                     some active @bfq_group (see the comments to the
+ *                     functions bfq_weights_tree_[add|remove] for further
+ *                     details).
  * @busy_queues: number of bfq_queues containing requests (including the
  *		 queue in service, even if it is idling).
  * @wr_busy_queues: number of weight-raised busy @bfq_queues.
@@ -374,6 +406,13 @@ struct bfq_data {
 	struct bfq_group *root_group;
 	struct rb_root rq_pos_tree;
 
+#ifdef CONFIG_CGROUP_BFQIO
+	int active_numerous_groups;
+#endif
+
+	struct rb_root queue_weights_tree;
+	struct rb_root group_weights_tree;
+
 	int busy_queues;
 	int wr_busy_queues;
 	int queued;
@@ -517,6 +556,11 @@ enum bfqq_expiration {
  * @my_entity: pointer to @entity, %NULL for the toplevel group; used
  *             to avoid too many special cases during group creation/
  *             migration.
+ * @active_entities: number of active entities belonging to the group;
+ *                   unused for the root group. Used to know whether there
+ *                   are groups with more than one active @bfq_entity
+ *                   (see the comments to the function
+ *                   bfq_bfqq_must_not_expire()).
  *
  * Each (device, cgroup) pair has its own bfq_group, i.e., for each cgroup
  * there is a set of bfq_groups, each one collecting the lower-level
@@ -542,6 +586,8 @@ struct bfq_group {
 	struct bfq_queue *async_idle_bfqq;
 
 	struct bfq_entity *my_entity;
+
+	int active_entities;
 };
 
 /**
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices
       [not found]           ` <1401354343-5527-1-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
@ 2014-05-29  9:05             ` Paolo Valente
  2014-05-29  9:05               ` Paolo Valente
                               ` (11 subsequent siblings)
  12 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups, Paolo Valente

This patch boosts the throughput on NCQ-capable flash-based devices,
while still preserving latency guarantees for interactive and soft
real-time applications. The throughput is boosted by just not idling
the device when the in-service queue remains empty, even if the queue
is sync and has a non-null idle window. This helps to keep the drive's
internal queue full, which is necessary to achieve maximum
performance. This solution to boost the throughput is a port of
commits a68bbdd and f7d7b7a for CFQ.

As already highlighted in patch 10, allowing the device to prefetch
and internally reorder requests trivially causes loss of control on
the request service order, and hence on service guarantees.
Fortunately, as discussed in detail in the comments to the function
bfq_bfqq_must_not_expire(), if every process has to receive the same
fraction of the throughput, then the service order enforced by the
internal scheduler of a flash-based device is relatively close to that
enforced by BFQ. In particular, it is close enough to let service
guarantees be substantially preserved.

Things change in an asymmetric scenario, i.e., if not every process
has to receive the same fraction of the throughput. In this case, to
guarantee the desired throughput distribution, the device must be
prevented from prefetching requests. This is exactly what this patch
does in asymmetric scenarios.

Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/bfq-cgroup.c  |   1 +
 block/bfq-iosched.c | 205 +++++++++++++++++++++++++++++++++++++++++++++++++---
 block/bfq-sched.c   |  98 ++++++++++++++++++++++++-
 block/bfq.h         |  46 ++++++++++++
 4 files changed, 338 insertions(+), 12 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 1cb25aa..d338a54 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -85,6 +85,7 @@ static inline void bfq_group_init_entity(struct bfqio_cgroup *bgrp,
 	entity->ioprio = entity->new_ioprio;
 	entity->ioprio_class = entity->new_ioprio_class = bgrp->ioprio_class;
 	entity->my_sched_data = &bfqg->sched_data;
+	bfqg->active_entities = 0;
 }
 
 static inline void bfq_group_set_parent(struct bfq_group *bfqg,
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 22d4caa..49856e1 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -364,6 +364,120 @@ static struct request *bfq_choose_req(struct bfq_data *bfqd,
 	}
 }
 
+/*
+ * Tell whether there are active queues or groups with differentiated weights.
+ */
+static inline bool bfq_differentiated_weights(struct bfq_data *bfqd)
+{
+	/*
+	 * For weights to differ, at least one of the trees must contain
+	 * at least two nodes.
+	 */
+	return (!RB_EMPTY_ROOT(&bfqd->queue_weights_tree) &&
+		(bfqd->queue_weights_tree.rb_node->rb_left ||
+		 bfqd->queue_weights_tree.rb_node->rb_right)
+#ifdef CONFIG_CGROUP_BFQIO
+	       ) ||
+	       (!RB_EMPTY_ROOT(&bfqd->group_weights_tree) &&
+		(bfqd->group_weights_tree.rb_node->rb_left ||
+		 bfqd->group_weights_tree.rb_node->rb_right)
+#endif
+	       );
+}
+
+/*
+ * If the weight-counter tree passed as input contains no counter for
+ * the weight of the input entity, then add that counter; otherwise just
+ * increment the existing counter.
+ *
+ * Note that weight-counter trees contain few nodes in mostly symmetric
+ * scenarios. For example, if all queues have the same weight, then the
+ * weight-counter tree for the queues may contain at most one node.
+ * This holds even if low_latency is on, because weight-raised queues
+ * are not inserted in the tree.
+ * In most scenarios, the rate at which nodes are created/destroyed
+ * should be low too.
+ */
+static void bfq_weights_tree_add(struct bfq_data *bfqd,
+				 struct bfq_entity *entity,
+				 struct rb_root *root)
+{
+	struct rb_node **new = &(root->rb_node), *parent = NULL;
+
+	/*
+	 * Do not insert if:
+	 * - the device does not support queueing;
+	 * - the entity is already associated with a counter, which happens if:
+	 *   1) the entity is associated with a queue, 2) a request arrival
+	 *   has caused the queue to become both non-weight-raised, and hence
+	 *   change its weight, and backlogged; in this respect, each
+	 *   of the two events causes an invocation of this function,
+	 *   3) this is the invocation of this function caused by the second
+	 *   event. This second invocation is actually useless, and we handle
+	 *   this fact by exiting immediately. More efficient or clearer
+	 *   solutions might possibly be adopted.
+	 */
+	if (!bfqd->hw_tag || entity->weight_counter)
+		return;
+
+	while (*new) {
+		struct bfq_weight_counter *__counter = container_of(*new,
+						struct bfq_weight_counter,
+						weights_node);
+		parent = *new;
+
+		if (entity->weight == __counter->weight) {
+			entity->weight_counter = __counter;
+			goto inc_counter;
+		}
+		if (entity->weight < __counter->weight)
+			new = &((*new)->rb_left);
+		else
+			new = &((*new)->rb_right);
+	}
+
+	entity->weight_counter = kzalloc(sizeof(struct bfq_weight_counter),
+					 GFP_ATOMIC);
+	entity->weight_counter->weight = entity->weight;
+	rb_link_node(&entity->weight_counter->weights_node, parent, new);
+	rb_insert_color(&entity->weight_counter->weights_node, root);
+
+inc_counter:
+	entity->weight_counter->num_active++;
+}
+
+/*
+ * Decrement the weight counter associated with the entity, and, if the
+ * counter reaches 0, remove the counter from the tree.
+ * See the comments to the function bfq_weights_tree_add() for considerations
+ * about overhead.
+ */
+static void bfq_weights_tree_remove(struct bfq_data *bfqd,
+				    struct bfq_entity *entity,
+				    struct rb_root *root)
+{
+	/*
+	 * Check whether the entity is actually associated with a counter.
+	 * In fact, the device may not be considered NCQ-capable for a while,
+	 * which implies that no insertion in the weight trees is performed,
+	 * after which the device may start to be deemed NCQ-capable, and hence
+	 * this function may start to be invoked. This may cause the function
+	 * to be invoked for entities that are not associated with any counter.
+	 */
+	if (!entity->weight_counter)
+		return;
+
+	entity->weight_counter->num_active--;
+	if (entity->weight_counter->num_active > 0)
+		goto reset_entity_pointer;
+
+	rb_erase(&entity->weight_counter->weights_node, root);
+	kfree(entity->weight_counter);
+
+reset_entity_pointer:
+	entity->weight_counter = NULL;
+}
+
 static struct request *bfq_find_next_rq(struct bfq_data *bfqd,
 					struct bfq_queue *bfqq,
 					struct request *last)
@@ -1906,16 +2020,17 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
  * two conditions holds. The first condition is that the device is not
  * performing NCQ, because idling the device most certainly boosts the
  * throughput if this condition holds and bfqq has been granted a non-null
- * idle window.
+ * idle window. The second compound condition is made of the logical AND of
+ * two components.
  *
- * The second condition is that there is no weight-raised busy queue,
- * which guarantees that the device is not idled for a sync non-weight-
- * raised queue when there are busy weight-raised queues. The former is
- * then expired immediately if empty. Combined with the timestamping rules
- * of BFQ (see [1] for details), this causes sync non-weight-raised queues
- * to get a lower number of requests served, and hence to ask for a lower
- * number of requests from the request pool, before the busy weight-raised
- * queues get served again.
+ * The first component is true only if there is no weight-raised busy
+ * queue. This guarantees that the device is not idled for a sync non-
+ * weight-raised queue when there are busy weight-raised queues. The former
+ * is then expired immediately if empty. Combined with the timestamping
+ * rules of BFQ (see [1] for details), this causes sync non-weight-raised
+ * queues to get a lower number of requests served, and hence to ask for a
+ * lower number of requests from the request pool, before the busy weight-
+ * raised queues get served again.
  *
  * This is beneficial for the processes associated with weight-raised
  * queues, when the request pool is saturated (e.g., in the presence of
@@ -1932,16 +2047,76 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
  * weight-raised queues seems to mitigate starvation problems in the
  * presence of heavy write workloads and NCQ, and hence to guarantee a
  * higher application and system responsiveness in these hostile scenarios.
+ *
+ * If the first component of the compound condition is instead true, i.e.,
+ * there is no weight-raised busy queue, then the second component of the
+ * compound condition takes into account service-guarantee and throughput
+ * issues related to NCQ (recall that the compound condition is evaluated
+ * only if the device is detected as supporting NCQ).
+ *
+ * As for service guarantees, allowing the drive to enqueue more than one
+ * request at a time, and hence delegating de facto final scheduling
+ * decisions to the drive's internal scheduler, causes loss of control on
+ * the actual request service order. In this respect, when the drive is
+ * allowed to enqueue more than one request at a time, the service
+ * distribution enforced by the drive's internal scheduler is likely to
+ * coincide with the desired device-throughput distribution only in the
+ * following, perfectly symmetric, scenario:
+ * 1) all active queues have the same weight,
+ * 2) all active groups at the same level in the groups tree have the same
+ *    weight,
+ * 3) all active groups at the same level in the groups tree have the same
+ *    number of children.
+ *
+ * Even in such a scenario, sequential I/O may still receive a preferential
+ * treatment, but this is not likely to be a big issue with flash-based
+ * devices, because of their non-dramatic loss of throughput with random
+ * I/O.
+ *
+ * Unfortunately, keeping the necessary state for evaluating exactly the
+ * above symmetry conditions would be quite complex and time-consuming.
+ * Therefore BFQ evaluates instead the following stronger sub-conditions,
+ * for which it is much easier to maintain the needed state:
+ * 1) all active queues have the same weight,
+ * 2) all active groups have the same weight,
+ * 3) all active groups have at most one active child each.
+ * In particular, the last two conditions are always true if hierarchical
+ * support and the cgroups interface are not enabled, hence no state needs
+ * to be maintained in this case.
+ *
+ * According to the above considerations, the second component of the
+ * compound condition evaluates to true if any of the above symmetry
+ * sub-condition does not hold, or the device is not flash-based. Therefore,
+ * if also the first component is true, then idling is allowed for a sync
+ * queue. In contrast, if all the required symmetry sub-conditions hold and
+ * the device is flash-based, then the second component, and hence the
+ * whole compound condition, evaluates to false, and no idling is performed.
+ * This helps to keep the drives' internal queues full on NCQ-capable
+ * devices, and hence to boost the throughput, without causing 'almost' any
+ * loss of service guarantees. The 'almost' follows from the fact that, if
+ * the internal queue of one such device is filled while all the
+ * sub-conditions hold, but at some point in time some sub-condition stops
+ * to hold, then it may become impossible to let requests be served in the
+ * new desired order until all the requests already queued in the device
+ * have been served.
  */
 static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 {
 	struct bfq_data *bfqd = bfqq->bfqd;
+#ifdef CONFIG_CGROUP_BFQIO
+#define symmetric_scenario	  (!bfqd->active_numerous_groups && \
+				   !bfq_differentiated_weights(bfqd))
+#else
+#define symmetric_scenario	  (!bfq_differentiated_weights(bfqd))
+#endif
 /*
  * Condition for expiring a non-weight-raised queue (and hence not idling
  * the device).
  */
 #define cond_for_expiring_non_wr  (bfqd->hw_tag && \
-				   bfqd->wr_busy_queues > 0)
+				   (bfqd->wr_busy_queues > 0 || \
+				    (symmetric_scenario && \
+				     blk_queue_nonrot(bfqd->queue))))
 
 	return bfq_bfqq_sync(bfqq) && (
 		bfqq->wr_coeff > 1 ||
@@ -2821,6 +2996,10 @@ static void bfq_completed_request(struct request_queue *q, struct request *rq)
 	bfqd->rq_in_driver--;
 	bfqq->dispatched--;
 
+	if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq))
+		bfq_weights_tree_remove(bfqd, &bfqq->entity,
+					&bfqd->queue_weights_tree);
+
 	if (sync) {
 		bfqd->sync_flight--;
 		RQ_BIC(rq)->ttime.last_end_request = jiffies;
@@ -3195,11 +3374,17 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 
 	bfqd->root_group = bfqg;
 
+#ifdef CONFIG_CGROUP_BFQIO
+	bfqd->active_numerous_groups = 0;
+#endif
+
 	init_timer(&bfqd->idle_slice_timer);
 	bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
 	bfqd->idle_slice_timer.data = (unsigned long)bfqd;
 
 	bfqd->rq_pos_tree = RB_ROOT;
+	bfqd->queue_weights_tree = RB_ROOT;
+	bfqd->group_weights_tree = RB_ROOT;
 
 	INIT_WORK(&bfqd->unplug_work, bfq_kick_queue);
 
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
index 73f453b..473b36a 100644
--- a/block/bfq-sched.c
+++ b/block/bfq-sched.c
@@ -308,6 +308,15 @@ up:
 	goto up;
 }
 
+static void bfq_weights_tree_add(struct bfq_data *bfqd,
+				 struct bfq_entity *entity,
+				 struct rb_root *root);
+
+static void bfq_weights_tree_remove(struct bfq_data *bfqd,
+				    struct bfq_entity *entity,
+				    struct rb_root *root);
+
+
 /**
  * bfq_active_insert - insert an entity in the active tree of its
  *                     group/device.
@@ -324,6 +333,11 @@ static void bfq_active_insert(struct bfq_service_tree *st,
 {
 	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
 	struct rb_node *node = &entity->rb_node;
+#ifdef CONFIG_CGROUP_BFQIO
+	struct bfq_sched_data *sd = NULL;
+	struct bfq_group *bfqg = NULL;
+	struct bfq_data *bfqd = NULL;
+#endif
 
 	bfq_insert(&st->active, entity);
 
@@ -334,8 +348,22 @@ static void bfq_active_insert(struct bfq_service_tree *st,
 
 	bfq_update_active_tree(node);
 
+#ifdef CONFIG_CGROUP_BFQIO
+	sd = entity->sched_data;
+	bfqg = container_of(sd, struct bfq_group, sched_data);
+	bfqd = (struct bfq_data *)bfqg->bfqd;
+#endif
 	if (bfqq != NULL)
 		list_add(&bfqq->bfqq_list, &bfqq->bfqd->active_list);
+#ifdef CONFIG_CGROUP_BFQIO
+	else /* bfq_group */
+		bfq_weights_tree_add(bfqd, entity, &bfqd->group_weights_tree);
+	if (bfqg != bfqd->root_group) {
+		bfqg->active_entities++;
+		if (bfqg->active_entities == 2)
+			bfqd->active_numerous_groups++;
+	}
+#endif
 }
 
 /**
@@ -411,6 +439,11 @@ static void bfq_active_extract(struct bfq_service_tree *st,
 {
 	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
 	struct rb_node *node;
+#ifdef CONFIG_CGROUP_BFQIO
+	struct bfq_sched_data *sd = NULL;
+	struct bfq_group *bfqg = NULL;
+	struct bfq_data *bfqd = NULL;
+#endif
 
 	node = bfq_find_deepest(&entity->rb_node);
 	bfq_extract(&st->active, entity);
@@ -418,8 +451,23 @@ static void bfq_active_extract(struct bfq_service_tree *st,
 	if (node != NULL)
 		bfq_update_active_tree(node);
 
+#ifdef CONFIG_CGROUP_BFQIO
+	sd = entity->sched_data;
+	bfqg = container_of(sd, struct bfq_group, sched_data);
+	bfqd = (struct bfq_data *)bfqg->bfqd;
+#endif
 	if (bfqq != NULL)
 		list_del(&bfqq->bfqq_list);
+#ifdef CONFIG_CGROUP_BFQIO
+	else /* bfq_group */
+		bfq_weights_tree_remove(bfqd, entity,
+					&bfqd->group_weights_tree);
+	if (bfqg != bfqd->root_group) {
+		bfqg->active_entities--;
+		if (bfqg->active_entities == 1)
+			bfqd->active_numerous_groups--;
+	}
+#endif
 }
 
 /**
@@ -515,6 +563,23 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
 
 	if (entity->ioprio_changed) {
 		struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+		unsigned short prev_weight, new_weight;
+		struct bfq_data *bfqd = NULL;
+		struct rb_root *root;
+#ifdef CONFIG_CGROUP_BFQIO
+		struct bfq_sched_data *sd;
+		struct bfq_group *bfqg;
+#endif
+
+		if (bfqq != NULL)
+			bfqd = bfqq->bfqd;
+#ifdef CONFIG_CGROUP_BFQIO
+		else {
+			sd = entity->my_sched_data;
+			bfqg = container_of(sd, struct bfq_group, sched_data);
+			bfqd = (struct bfq_data *)bfqg->bfqd;
+		}
+#endif
 
 		old_st->wsum -= entity->weight;
 
@@ -541,8 +606,31 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
 		 * when entity->finish <= old_st->vtime).
 		 */
 		new_st = bfq_entity_service_tree(entity);
-		entity->weight = entity->orig_weight *
-				 (bfqq != NULL ? bfqq->wr_coeff : 1);
+
+		prev_weight = entity->weight;
+		new_weight = entity->orig_weight *
+			     (bfqq != NULL ? bfqq->wr_coeff : 1);
+		/*
+		 * If the weight of the entity changes, remove the entity
+		 * from its old weight counter (if there is a counter
+		 * associated with the entity), and add it to the counter
+		 * associated with its new weight.
+		 */
+		if (prev_weight != new_weight) {
+			root = bfqq ? &bfqd->queue_weights_tree :
+				      &bfqd->group_weights_tree;
+			bfq_weights_tree_remove(bfqd, entity, root);
+		}
+		entity->weight = new_weight;
+		/*
+		 * Add the entity to its weights tree only if it is
+		 * not associated with a weight-raised queue.
+		 */
+		if (prev_weight != new_weight &&
+		    (bfqq ? bfqq->wr_coeff == 1 : 1))
+			/* If we get here, root has been initialized. */
+			bfq_weights_tree_add(bfqd, entity, root);
+
 		new_st->wsum += entity->weight;
 
 		if (new_st != old_st)
@@ -976,6 +1064,9 @@ static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfq_deactivate_bfqq(bfqd, bfqq, requeue);
 
+	if (!bfqq->dispatched)
+		bfq_weights_tree_remove(bfqd, &bfqq->entity,
+					&bfqd->queue_weights_tree);
 	if (bfqq->wr_coeff > 1)
 		bfqd->wr_busy_queues--;
 }
@@ -992,6 +1083,9 @@ static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 	bfq_mark_bfqq_busy(bfqq);
 	bfqd->busy_queues++;
 
+	if (!bfqq->dispatched && bfqq->wr_coeff == 1)
+		bfq_weights_tree_add(bfqd, &bfqq->entity,
+				     &bfqd->queue_weights_tree);
 	if (bfqq->wr_coeff > 1)
 		bfqd->wr_busy_queues++;
 }
diff --git a/block/bfq.h b/block/bfq.h
index bda1ecb3..83c828d 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -81,8 +81,23 @@ struct bfq_sched_data {
 };
 
 /**
+ * struct bfq_weight_counter - counter of the number of all active entities
+ *                             with a given weight.
+ * @weight: weight of the entities that this counter refers to.
+ * @num_active: number of active entities with this weight.
+ * @weights_node: weights tree member (see bfq_data's @queue_weights_tree
+ *                and @group_weights_tree).
+ */
+struct bfq_weight_counter {
+	short int weight;
+	unsigned int num_active;
+	struct rb_node weights_node;
+};
+
+/**
  * struct bfq_entity - schedulable entity.
  * @rb_node: service_tree member.
+ * @weight_counter: pointer to the weight counter associated with this entity.
  * @on_st: flag, true if the entity is on a tree (either the active or
  *         the idle one of its service_tree).
  * @finish: B-WF2Q+ finish timestamp (aka F_i).
@@ -133,6 +148,7 @@ struct bfq_sched_data {
  */
 struct bfq_entity {
 	struct rb_node rb_node;
+	struct bfq_weight_counter *weight_counter;
 
 	int on_st;
 
@@ -306,6 +322,22 @@ enum bfq_device_speed {
  * @rq_pos_tree: rbtree sorted by next_request position, used when
  *               determining if two or more queues have interleaving
  *               requests (see bfq_close_cooperator()).
+ * @active_numerous_groups: number of bfq_groups containing more than one
+ *                          active @bfq_entity.
+ * @queue_weights_tree: rbtree of weight counters of @bfq_queues, sorted by
+ *                      weight. Used to keep track of whether all @bfq_queues
+ *                     have the same weight. The tree contains one counter
+ *                     for each distinct weight associated to some active
+ *                     and not weight-raised @bfq_queue (see the comments to
+ *                      the functions bfq_weights_tree_[add|remove] for
+ *                     further details).
+ * @group_weights_tree: rbtree of non-queue @bfq_entity weight counters, sorted
+ *                      by weight. Used to keep track of whether all
+ *                     @bfq_groups have the same weight. The tree contains
+ *                     one counter for each distinct weight associated to
+ *                     some active @bfq_group (see the comments to the
+ *                     functions bfq_weights_tree_[add|remove] for further
+ *                     details).
  * @busy_queues: number of bfq_queues containing requests (including the
  *		 queue in service, even if it is idling).
  * @wr_busy_queues: number of weight-raised busy @bfq_queues.
@@ -374,6 +406,13 @@ struct bfq_data {
 	struct bfq_group *root_group;
 	struct rb_root rq_pos_tree;
 
+#ifdef CONFIG_CGROUP_BFQIO
+	int active_numerous_groups;
+#endif
+
+	struct rb_root queue_weights_tree;
+	struct rb_root group_weights_tree;
+
 	int busy_queues;
 	int wr_busy_queues;
 	int queued;
@@ -517,6 +556,11 @@ enum bfqq_expiration {
  * @my_entity: pointer to @entity, %NULL for the toplevel group; used
  *             to avoid too many special cases during group creation/
  *             migration.
+ * @active_entities: number of active entities belonging to the group;
+ *                   unused for the root group. Used to know whether there
+ *                   are groups with more than one active @bfq_entity
+ *                   (see the comments to the function
+ *                   bfq_bfqq_must_not_expire()).
  *
  * Each (device, cgroup) pair has its own bfq_group, i.e., for each cgroup
  * there is a set of bfq_groups, each one collecting the lower-level
@@ -542,6 +586,8 @@ struct bfq_group {
 	struct bfq_queue *async_idle_bfqq;
 
 	struct bfq_entity *my_entity;
+
+	int active_entities;
 };
 
 /**
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices
@ 2014-05-29  9:05             ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

This patch boosts the throughput on NCQ-capable flash-based devices,
while still preserving latency guarantees for interactive and soft
real-time applications. The throughput is boosted by just not idling
the device when the in-service queue remains empty, even if the queue
is sync and has a non-null idle window. This helps to keep the drive's
internal queue full, which is necessary to achieve maximum
performance. This solution to boost the throughput is a port of
commits a68bbdd and f7d7b7a for CFQ.

As already highlighted in patch 10, allowing the device to prefetch
and internally reorder requests trivially causes loss of control on
the request service order, and hence on service guarantees.
Fortunately, as discussed in detail in the comments to the function
bfq_bfqq_must_not_expire(), if every process has to receive the same
fraction of the throughput, then the service order enforced by the
internal scheduler of a flash-based device is relatively close to that
enforced by BFQ. In particular, it is close enough to let service
guarantees be substantially preserved.

Things change in an asymmetric scenario, i.e., if not every process
has to receive the same fraction of the throughput. In this case, to
guarantee the desired throughput distribution, the device must be
prevented from prefetching requests. This is exactly what this patch
does in asymmetric scenarios.

Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 block/bfq-cgroup.c  |   1 +
 block/bfq-iosched.c | 205 +++++++++++++++++++++++++++++++++++++++++++++++++---
 block/bfq-sched.c   |  98 ++++++++++++++++++++++++-
 block/bfq.h         |  46 ++++++++++++
 4 files changed, 338 insertions(+), 12 deletions(-)

diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index 1cb25aa..d338a54 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -85,6 +85,7 @@ static inline void bfq_group_init_entity(struct bfqio_cgroup *bgrp,
 	entity->ioprio = entity->new_ioprio;
 	entity->ioprio_class = entity->new_ioprio_class = bgrp->ioprio_class;
 	entity->my_sched_data = &bfqg->sched_data;
+	bfqg->active_entities = 0;
 }
 
 static inline void bfq_group_set_parent(struct bfq_group *bfqg,
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 22d4caa..49856e1 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -364,6 +364,120 @@ static struct request *bfq_choose_req(struct bfq_data *bfqd,
 	}
 }
 
+/*
+ * Tell whether there are active queues or groups with differentiated weights.
+ */
+static inline bool bfq_differentiated_weights(struct bfq_data *bfqd)
+{
+	/*
+	 * For weights to differ, at least one of the trees must contain
+	 * at least two nodes.
+	 */
+	return (!RB_EMPTY_ROOT(&bfqd->queue_weights_tree) &&
+		(bfqd->queue_weights_tree.rb_node->rb_left ||
+		 bfqd->queue_weights_tree.rb_node->rb_right)
+#ifdef CONFIG_CGROUP_BFQIO
+	       ) ||
+	       (!RB_EMPTY_ROOT(&bfqd->group_weights_tree) &&
+		(bfqd->group_weights_tree.rb_node->rb_left ||
+		 bfqd->group_weights_tree.rb_node->rb_right)
+#endif
+	       );
+}
+
+/*
+ * If the weight-counter tree passed as input contains no counter for
+ * the weight of the input entity, then add that counter; otherwise just
+ * increment the existing counter.
+ *
+ * Note that weight-counter trees contain few nodes in mostly symmetric
+ * scenarios. For example, if all queues have the same weight, then the
+ * weight-counter tree for the queues may contain at most one node.
+ * This holds even if low_latency is on, because weight-raised queues
+ * are not inserted in the tree.
+ * In most scenarios, the rate at which nodes are created/destroyed
+ * should be low too.
+ */
+static void bfq_weights_tree_add(struct bfq_data *bfqd,
+				 struct bfq_entity *entity,
+				 struct rb_root *root)
+{
+	struct rb_node **new = &(root->rb_node), *parent = NULL;
+
+	/*
+	 * Do not insert if:
+	 * - the device does not support queueing;
+	 * - the entity is already associated with a counter, which happens if:
+	 *   1) the entity is associated with a queue, 2) a request arrival
+	 *   has caused the queue to become both non-weight-raised, and hence
+	 *   change its weight, and backlogged; in this respect, each
+	 *   of the two events causes an invocation of this function,
+	 *   3) this is the invocation of this function caused by the second
+	 *   event. This second invocation is actually useless, and we handle
+	 *   this fact by exiting immediately. More efficient or clearer
+	 *   solutions might possibly be adopted.
+	 */
+	if (!bfqd->hw_tag || entity->weight_counter)
+		return;
+
+	while (*new) {
+		struct bfq_weight_counter *__counter = container_of(*new,
+						struct bfq_weight_counter,
+						weights_node);
+		parent = *new;
+
+		if (entity->weight == __counter->weight) {
+			entity->weight_counter = __counter;
+			goto inc_counter;
+		}
+		if (entity->weight < __counter->weight)
+			new = &((*new)->rb_left);
+		else
+			new = &((*new)->rb_right);
+	}
+
+	entity->weight_counter = kzalloc(sizeof(struct bfq_weight_counter),
+					 GFP_ATOMIC);
+	entity->weight_counter->weight = entity->weight;
+	rb_link_node(&entity->weight_counter->weights_node, parent, new);
+	rb_insert_color(&entity->weight_counter->weights_node, root);
+
+inc_counter:
+	entity->weight_counter->num_active++;
+}
+
+/*
+ * Decrement the weight counter associated with the entity, and, if the
+ * counter reaches 0, remove the counter from the tree.
+ * See the comments to the function bfq_weights_tree_add() for considerations
+ * about overhead.
+ */
+static void bfq_weights_tree_remove(struct bfq_data *bfqd,
+				    struct bfq_entity *entity,
+				    struct rb_root *root)
+{
+	/*
+	 * Check whether the entity is actually associated with a counter.
+	 * In fact, the device may not be considered NCQ-capable for a while,
+	 * which implies that no insertion in the weight trees is performed,
+	 * after which the device may start to be deemed NCQ-capable, and hence
+	 * this function may start to be invoked. This may cause the function
+	 * to be invoked for entities that are not associated with any counter.
+	 */
+	if (!entity->weight_counter)
+		return;
+
+	entity->weight_counter->num_active--;
+	if (entity->weight_counter->num_active > 0)
+		goto reset_entity_pointer;
+
+	rb_erase(&entity->weight_counter->weights_node, root);
+	kfree(entity->weight_counter);
+
+reset_entity_pointer:
+	entity->weight_counter = NULL;
+}
+
 static struct request *bfq_find_next_rq(struct bfq_data *bfqd,
 					struct bfq_queue *bfqq,
 					struct request *last)
@@ -1906,16 +2020,17 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
  * two conditions holds. The first condition is that the device is not
  * performing NCQ, because idling the device most certainly boosts the
  * throughput if this condition holds and bfqq has been granted a non-null
- * idle window.
+ * idle window. The second compound condition is made of the logical AND of
+ * two components.
  *
- * The second condition is that there is no weight-raised busy queue,
- * which guarantees that the device is not idled for a sync non-weight-
- * raised queue when there are busy weight-raised queues. The former is
- * then expired immediately if empty. Combined with the timestamping rules
- * of BFQ (see [1] for details), this causes sync non-weight-raised queues
- * to get a lower number of requests served, and hence to ask for a lower
- * number of requests from the request pool, before the busy weight-raised
- * queues get served again.
+ * The first component is true only if there is no weight-raised busy
+ * queue. This guarantees that the device is not idled for a sync non-
+ * weight-raised queue when there are busy weight-raised queues. The former
+ * is then expired immediately if empty. Combined with the timestamping
+ * rules of BFQ (see [1] for details), this causes sync non-weight-raised
+ * queues to get a lower number of requests served, and hence to ask for a
+ * lower number of requests from the request pool, before the busy weight-
+ * raised queues get served again.
  *
  * This is beneficial for the processes associated with weight-raised
  * queues, when the request pool is saturated (e.g., in the presence of
@@ -1932,16 +2047,76 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
  * weight-raised queues seems to mitigate starvation problems in the
  * presence of heavy write workloads and NCQ, and hence to guarantee a
  * higher application and system responsiveness in these hostile scenarios.
+ *
+ * If the first component of the compound condition is instead true, i.e.,
+ * there is no weight-raised busy queue, then the second component of the
+ * compound condition takes into account service-guarantee and throughput
+ * issues related to NCQ (recall that the compound condition is evaluated
+ * only if the device is detected as supporting NCQ).
+ *
+ * As for service guarantees, allowing the drive to enqueue more than one
+ * request at a time, and hence delegating de facto final scheduling
+ * decisions to the drive's internal scheduler, causes loss of control on
+ * the actual request service order. In this respect, when the drive is
+ * allowed to enqueue more than one request at a time, the service
+ * distribution enforced by the drive's internal scheduler is likely to
+ * coincide with the desired device-throughput distribution only in the
+ * following, perfectly symmetric, scenario:
+ * 1) all active queues have the same weight,
+ * 2) all active groups at the same level in the groups tree have the same
+ *    weight,
+ * 3) all active groups at the same level in the groups tree have the same
+ *    number of children.
+ *
+ * Even in such a scenario, sequential I/O may still receive a preferential
+ * treatment, but this is not likely to be a big issue with flash-based
+ * devices, because of their non-dramatic loss of throughput with random
+ * I/O.
+ *
+ * Unfortunately, keeping the necessary state for evaluating exactly the
+ * above symmetry conditions would be quite complex and time-consuming.
+ * Therefore BFQ evaluates instead the following stronger sub-conditions,
+ * for which it is much easier to maintain the needed state:
+ * 1) all active queues have the same weight,
+ * 2) all active groups have the same weight,
+ * 3) all active groups have at most one active child each.
+ * In particular, the last two conditions are always true if hierarchical
+ * support and the cgroups interface are not enabled, hence no state needs
+ * to be maintained in this case.
+ *
+ * According to the above considerations, the second component of the
+ * compound condition evaluates to true if any of the above symmetry
+ * sub-condition does not hold, or the device is not flash-based. Therefore,
+ * if also the first component is true, then idling is allowed for a sync
+ * queue. In contrast, if all the required symmetry sub-conditions hold and
+ * the device is flash-based, then the second component, and hence the
+ * whole compound condition, evaluates to false, and no idling is performed.
+ * This helps to keep the drives' internal queues full on NCQ-capable
+ * devices, and hence to boost the throughput, without causing 'almost' any
+ * loss of service guarantees. The 'almost' follows from the fact that, if
+ * the internal queue of one such device is filled while all the
+ * sub-conditions hold, but at some point in time some sub-condition stops
+ * to hold, then it may become impossible to let requests be served in the
+ * new desired order until all the requests already queued in the device
+ * have been served.
  */
 static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 {
 	struct bfq_data *bfqd = bfqq->bfqd;
+#ifdef CONFIG_CGROUP_BFQIO
+#define symmetric_scenario	  (!bfqd->active_numerous_groups && \
+				   !bfq_differentiated_weights(bfqd))
+#else
+#define symmetric_scenario	  (!bfq_differentiated_weights(bfqd))
+#endif
 /*
  * Condition for expiring a non-weight-raised queue (and hence not idling
  * the device).
  */
 #define cond_for_expiring_non_wr  (bfqd->hw_tag && \
-				   bfqd->wr_busy_queues > 0)
+				   (bfqd->wr_busy_queues > 0 || \
+				    (symmetric_scenario && \
+				     blk_queue_nonrot(bfqd->queue))))
 
 	return bfq_bfqq_sync(bfqq) && (
 		bfqq->wr_coeff > 1 ||
@@ -2821,6 +2996,10 @@ static void bfq_completed_request(struct request_queue *q, struct request *rq)
 	bfqd->rq_in_driver--;
 	bfqq->dispatched--;
 
+	if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq))
+		bfq_weights_tree_remove(bfqd, &bfqq->entity,
+					&bfqd->queue_weights_tree);
+
 	if (sync) {
 		bfqd->sync_flight--;
 		RQ_BIC(rq)->ttime.last_end_request = jiffies;
@@ -3195,11 +3374,17 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 
 	bfqd->root_group = bfqg;
 
+#ifdef CONFIG_CGROUP_BFQIO
+	bfqd->active_numerous_groups = 0;
+#endif
+
 	init_timer(&bfqd->idle_slice_timer);
 	bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
 	bfqd->idle_slice_timer.data = (unsigned long)bfqd;
 
 	bfqd->rq_pos_tree = RB_ROOT;
+	bfqd->queue_weights_tree = RB_ROOT;
+	bfqd->group_weights_tree = RB_ROOT;
 
 	INIT_WORK(&bfqd->unplug_work, bfq_kick_queue);
 
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
index 73f453b..473b36a 100644
--- a/block/bfq-sched.c
+++ b/block/bfq-sched.c
@@ -308,6 +308,15 @@ up:
 	goto up;
 }
 
+static void bfq_weights_tree_add(struct bfq_data *bfqd,
+				 struct bfq_entity *entity,
+				 struct rb_root *root);
+
+static void bfq_weights_tree_remove(struct bfq_data *bfqd,
+				    struct bfq_entity *entity,
+				    struct rb_root *root);
+
+
 /**
  * bfq_active_insert - insert an entity in the active tree of its
  *                     group/device.
@@ -324,6 +333,11 @@ static void bfq_active_insert(struct bfq_service_tree *st,
 {
 	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
 	struct rb_node *node = &entity->rb_node;
+#ifdef CONFIG_CGROUP_BFQIO
+	struct bfq_sched_data *sd = NULL;
+	struct bfq_group *bfqg = NULL;
+	struct bfq_data *bfqd = NULL;
+#endif
 
 	bfq_insert(&st->active, entity);
 
@@ -334,8 +348,22 @@ static void bfq_active_insert(struct bfq_service_tree *st,
 
 	bfq_update_active_tree(node);
 
+#ifdef CONFIG_CGROUP_BFQIO
+	sd = entity->sched_data;
+	bfqg = container_of(sd, struct bfq_group, sched_data);
+	bfqd = (struct bfq_data *)bfqg->bfqd;
+#endif
 	if (bfqq != NULL)
 		list_add(&bfqq->bfqq_list, &bfqq->bfqd->active_list);
+#ifdef CONFIG_CGROUP_BFQIO
+	else /* bfq_group */
+		bfq_weights_tree_add(bfqd, entity, &bfqd->group_weights_tree);
+	if (bfqg != bfqd->root_group) {
+		bfqg->active_entities++;
+		if (bfqg->active_entities == 2)
+			bfqd->active_numerous_groups++;
+	}
+#endif
 }
 
 /**
@@ -411,6 +439,11 @@ static void bfq_active_extract(struct bfq_service_tree *st,
 {
 	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
 	struct rb_node *node;
+#ifdef CONFIG_CGROUP_BFQIO
+	struct bfq_sched_data *sd = NULL;
+	struct bfq_group *bfqg = NULL;
+	struct bfq_data *bfqd = NULL;
+#endif
 
 	node = bfq_find_deepest(&entity->rb_node);
 	bfq_extract(&st->active, entity);
@@ -418,8 +451,23 @@ static void bfq_active_extract(struct bfq_service_tree *st,
 	if (node != NULL)
 		bfq_update_active_tree(node);
 
+#ifdef CONFIG_CGROUP_BFQIO
+	sd = entity->sched_data;
+	bfqg = container_of(sd, struct bfq_group, sched_data);
+	bfqd = (struct bfq_data *)bfqg->bfqd;
+#endif
 	if (bfqq != NULL)
 		list_del(&bfqq->bfqq_list);
+#ifdef CONFIG_CGROUP_BFQIO
+	else /* bfq_group */
+		bfq_weights_tree_remove(bfqd, entity,
+					&bfqd->group_weights_tree);
+	if (bfqg != bfqd->root_group) {
+		bfqg->active_entities--;
+		if (bfqg->active_entities == 1)
+			bfqd->active_numerous_groups--;
+	}
+#endif
 }
 
 /**
@@ -515,6 +563,23 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
 
 	if (entity->ioprio_changed) {
 		struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
+		unsigned short prev_weight, new_weight;
+		struct bfq_data *bfqd = NULL;
+		struct rb_root *root;
+#ifdef CONFIG_CGROUP_BFQIO
+		struct bfq_sched_data *sd;
+		struct bfq_group *bfqg;
+#endif
+
+		if (bfqq != NULL)
+			bfqd = bfqq->bfqd;
+#ifdef CONFIG_CGROUP_BFQIO
+		else {
+			sd = entity->my_sched_data;
+			bfqg = container_of(sd, struct bfq_group, sched_data);
+			bfqd = (struct bfq_data *)bfqg->bfqd;
+		}
+#endif
 
 		old_st->wsum -= entity->weight;
 
@@ -541,8 +606,31 @@ __bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
 		 * when entity->finish <= old_st->vtime).
 		 */
 		new_st = bfq_entity_service_tree(entity);
-		entity->weight = entity->orig_weight *
-				 (bfqq != NULL ? bfqq->wr_coeff : 1);
+
+		prev_weight = entity->weight;
+		new_weight = entity->orig_weight *
+			     (bfqq != NULL ? bfqq->wr_coeff : 1);
+		/*
+		 * If the weight of the entity changes, remove the entity
+		 * from its old weight counter (if there is a counter
+		 * associated with the entity), and add it to the counter
+		 * associated with its new weight.
+		 */
+		if (prev_weight != new_weight) {
+			root = bfqq ? &bfqd->queue_weights_tree :
+				      &bfqd->group_weights_tree;
+			bfq_weights_tree_remove(bfqd, entity, root);
+		}
+		entity->weight = new_weight;
+		/*
+		 * Add the entity to its weights tree only if it is
+		 * not associated with a weight-raised queue.
+		 */
+		if (prev_weight != new_weight &&
+		    (bfqq ? bfqq->wr_coeff == 1 : 1))
+			/* If we get here, root has been initialized. */
+			bfq_weights_tree_add(bfqd, entity, root);
+
 		new_st->wsum += entity->weight;
 
 		if (new_st != old_st)
@@ -976,6 +1064,9 @@ static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfq_deactivate_bfqq(bfqd, bfqq, requeue);
 
+	if (!bfqq->dispatched)
+		bfq_weights_tree_remove(bfqd, &bfqq->entity,
+					&bfqd->queue_weights_tree);
 	if (bfqq->wr_coeff > 1)
 		bfqd->wr_busy_queues--;
 }
@@ -992,6 +1083,9 @@ static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 	bfq_mark_bfqq_busy(bfqq);
 	bfqd->busy_queues++;
 
+	if (!bfqq->dispatched && bfqq->wr_coeff == 1)
+		bfq_weights_tree_add(bfqd, &bfqq->entity,
+				     &bfqd->queue_weights_tree);
 	if (bfqq->wr_coeff > 1)
 		bfqd->wr_busy_queues++;
 }
diff --git a/block/bfq.h b/block/bfq.h
index bda1ecb3..83c828d 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -81,8 +81,23 @@ struct bfq_sched_data {
 };
 
 /**
+ * struct bfq_weight_counter - counter of the number of all active entities
+ *                             with a given weight.
+ * @weight: weight of the entities that this counter refers to.
+ * @num_active: number of active entities with this weight.
+ * @weights_node: weights tree member (see bfq_data's @queue_weights_tree
+ *                and @group_weights_tree).
+ */
+struct bfq_weight_counter {
+	short int weight;
+	unsigned int num_active;
+	struct rb_node weights_node;
+};
+
+/**
  * struct bfq_entity - schedulable entity.
  * @rb_node: service_tree member.
+ * @weight_counter: pointer to the weight counter associated with this entity.
  * @on_st: flag, true if the entity is on a tree (either the active or
  *         the idle one of its service_tree).
  * @finish: B-WF2Q+ finish timestamp (aka F_i).
@@ -133,6 +148,7 @@ struct bfq_sched_data {
  */
 struct bfq_entity {
 	struct rb_node rb_node;
+	struct bfq_weight_counter *weight_counter;
 
 	int on_st;
 
@@ -306,6 +322,22 @@ enum bfq_device_speed {
  * @rq_pos_tree: rbtree sorted by next_request position, used when
  *               determining if two or more queues have interleaving
  *               requests (see bfq_close_cooperator()).
+ * @active_numerous_groups: number of bfq_groups containing more than one
+ *                          active @bfq_entity.
+ * @queue_weights_tree: rbtree of weight counters of @bfq_queues, sorted by
+ *                      weight. Used to keep track of whether all @bfq_queues
+ *                     have the same weight. The tree contains one counter
+ *                     for each distinct weight associated to some active
+ *                     and not weight-raised @bfq_queue (see the comments to
+ *                      the functions bfq_weights_tree_[add|remove] for
+ *                     further details).
+ * @group_weights_tree: rbtree of non-queue @bfq_entity weight counters, sorted
+ *                      by weight. Used to keep track of whether all
+ *                     @bfq_groups have the same weight. The tree contains
+ *                     one counter for each distinct weight associated to
+ *                     some active @bfq_group (see the comments to the
+ *                     functions bfq_weights_tree_[add|remove] for further
+ *                     details).
  * @busy_queues: number of bfq_queues containing requests (including the
  *		 queue in service, even if it is idling).
  * @wr_busy_queues: number of weight-raised busy @bfq_queues.
@@ -374,6 +406,13 @@ struct bfq_data {
 	struct bfq_group *root_group;
 	struct rb_root rq_pos_tree;
 
+#ifdef CONFIG_CGROUP_BFQIO
+	int active_numerous_groups;
+#endif
+
+	struct rb_root queue_weights_tree;
+	struct rb_root group_weights_tree;
+
 	int busy_queues;
 	int wr_busy_queues;
 	int queued;
@@ -517,6 +556,11 @@ enum bfqq_expiration {
  * @my_entity: pointer to @entity, %NULL for the toplevel group; used
  *             to avoid too many special cases during group creation/
  *             migration.
+ * @active_entities: number of active entities belonging to the group;
+ *                   unused for the root group. Used to know whether there
+ *                   are groups with more than one active @bfq_entity
+ *                   (see the comments to the function
+ *                   bfq_bfqq_must_not_expire()).
  *
  * Each (device, cgroup) pair has its own bfq_group, i.e., for each cgroup
  * there is a set of bfq_groups, each one collecting the lower-level
@@ -542,6 +586,8 @@ struct bfq_group {
 	struct bfq_queue *async_idle_bfqq;
 
 	struct bfq_entity *my_entity;
+
+	int active_entities;
 };
 
 /**
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 12/12] block, bfq: boost the throughput with random I/O on NCQ-capable HDDs
       [not found]           ` <1401354343-5527-1-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
                               ` (10 preceding siblings ...)
  2014-05-29  9:05             ` [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices Paolo Valente
@ 2014-05-29  9:05             ` Paolo Valente
  2014-05-30 16:07               ` Tejun Heo
  12 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

This patch is basically the counterpart of patch 13 for NCQ-capable
rotational devices. Exactly as patch 13 does on flash-based devices
and for any workload, this patch disables device idling on rotational
devices, but only for random I/O. More precisely, idling is disabled
only for constantly-seeky queues (see patch 7). In fact, only with
these queues disabling idling boosts the throughput on NCQ-capable
rotational devices.

To not break service guarantees, idling is disabled for NCQ-enabled
rotational devices and constantly-seeky queues only when the same
symmetry conditions as in patch 13, plus an additional one, hold. The
additional condition is related to the fact that this patch disables
idling only for constantly-seeky queues. In fact, should idling be
disabled for a constantly-seeky queue while some other
non-constantly-seeky queue has pending requests, the latter queue
would get more requests served, after being set as in service, than
the former. This differentiated treatment would cause a deviation with
respect to the desired throughput distribution (i.e., with respect to
the throughput distribution corresponding to the weights assigned to
processes and groups of processes).  For this reason, the additional
condition for disabling idling for a constantly-seeky queue is that
all queues with pending or in-flight requests are constantly seeky.

Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 block/bfq-iosched.c | 79 +++++++++++++++++++++++++++++++++++++++++------------
 block/bfq-sched.c   | 21 +++++++++++---
 block/bfq.h         | 29 +++++++++++++++++++-
 3 files changed, 107 insertions(+), 22 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 49856e1..b9aafa5 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1910,8 +1910,12 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
 
 	bfqq->service_from_backlogged += bfqq->entity.service;
 
-	if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT)
+	if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT &&
+	    !bfq_bfqq_constantly_seeky(bfqq)) {
 		bfq_mark_bfqq_constantly_seeky(bfqq);
+		if (!blk_queue_nonrot(bfqd->queue))
+			bfqd->const_seeky_busy_in_flight_queues++;
+	}
 
 	if (bfqd->low_latency && bfqq->wr_coeff == 1)
 		bfqq->last_wr_start_finish = jiffies;
@@ -2071,7 +2075,8 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
  * Even in such a scenario, sequential I/O may still receive a preferential
  * treatment, but this is not likely to be a big issue with flash-based
  * devices, because of their non-dramatic loss of throughput with random
- * I/O.
+ * I/O. Things do differ with HDDs, for which additional care is taken, as
+ * explained after completing the discussion for flash-based devices.
  *
  * Unfortunately, keeping the necessary state for evaluating exactly the
  * above symmetry conditions would be quite complex and time-consuming.
@@ -2088,17 +2093,42 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
  * compound condition evaluates to true if any of the above symmetry
  * sub-condition does not hold, or the device is not flash-based. Therefore,
  * if also the first component is true, then idling is allowed for a sync
- * queue. In contrast, if all the required symmetry sub-conditions hold and
- * the device is flash-based, then the second component, and hence the
- * whole compound condition, evaluates to false, and no idling is performed.
- * This helps to keep the drives' internal queues full on NCQ-capable
- * devices, and hence to boost the throughput, without causing 'almost' any
- * loss of service guarantees. The 'almost' follows from the fact that, if
- * the internal queue of one such device is filled while all the
- * sub-conditions hold, but at some point in time some sub-condition stops
- * to hold, then it may become impossible to let requests be served in the
- * new desired order until all the requests already queued in the device
- * have been served.
+ * queue. These are the only sub-conditions considered if the device is
+ * flash-based, as, for such a device, it is sensible to force idling only
+ * for service-guarantee issues. In fact, as for throughput, idling
+ * NCQ-capable flash-based devices would not boost the throughput even
+ * with sequential I/O; rather it would lower the throughput in proportion
+ * to how fast the device is. In the end, (only) if all the three
+ * sub-conditions hold and the device is flash-based, the compound
+ * condition evaluates to false and therefore no idling is performed.
+ *
+ * As already said, things change with a rotational device, where idling
+ * boosts the throughput with sequential I/O (even with NCQ). Hence, for
+ * such a device the second component of the compound condition evaluates
+ * to true also if the following additional sub-condition does not hold:
+ * the queue is constantly seeky. Unfortunately, this different behavior
+ * with respect to flash-based devices causes an additional asymmetry: if
+ * some sync queues enjoy idling and some other sync queues do not, then
+ * the latter get a low share of the device throughput, simply because the
+ * former get many requests served after being set as in service, whereas
+ * the latter do not. As a consequence, to guarantee the desired throughput
+ * distribution, on HDDs the compound expression evaluates to true (and
+ * hence device idling is performed) also if the following last symmetry
+ * condition does not hold: no other queue is benefiting from idling. Also
+ * this last condition is actually replaced with a simpler-to-maintain and
+ * stronger condition: there is no busy queue which is not constantly seeky
+ * (and hence may also benefit from idling).
+ *
+ * To sum up, when all the required symmetry and throughput-boosting
+ * sub-conditions hold, the second component of the compound condition
+ * evaluates to false, and hence no idling is performed. This helps to
+ * keep the drives' internal queues full on NCQ-capable devices, and hence
+ * to boost the throughput, without causing 'almost' any loss of service
+ * guarantees. The 'almost' follows from the fact that, if the internal
+ * queue of one such device is filled while all the sub-conditions hold,
+ * but at some point in time some sub-condition stops to hold, then it may
+ * become impossible to let requests be served in the new desired order
+ * until all the requests already queued in the device have been served.
  */
 static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 {
@@ -2109,6 +2139,9 @@ static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 #else
 #define symmetric_scenario	  (!bfq_differentiated_weights(bfqd))
 #endif
+#define cond_for_seeky_on_ncq_hdd (bfq_bfqq_constantly_seeky(bfqq) && \
+				   bfqd->busy_in_flight_queues == \
+				   bfqd->const_seeky_busy_in_flight_queues)
 /*
  * Condition for expiring a non-weight-raised queue (and hence not idling
  * the device).
@@ -2116,7 +2149,8 @@ static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 #define cond_for_expiring_non_wr  (bfqd->hw_tag && \
 				   (bfqd->wr_busy_queues > 0 || \
 				    (symmetric_scenario && \
-				     blk_queue_nonrot(bfqd->queue))))
+				     (blk_queue_nonrot(bfqd->queue) || \
+				      cond_for_seeky_on_ncq_hdd))))
 
 	return bfq_bfqq_sync(bfqq) && (
 		bfqq->wr_coeff > 1 ||
@@ -2843,8 +2877,11 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfq_update_io_thinktime(bfqd, bic);
 	bfq_update_io_seektime(bfqd, bfqq, rq);
-	if (!BFQQ_SEEKY(bfqq))
+	if (!BFQQ_SEEKY(bfqq) && bfq_bfqq_constantly_seeky(bfqq)) {
 		bfq_clear_bfqq_constantly_seeky(bfqq);
+		if (!blk_queue_nonrot(bfqd->queue))
+			bfqd->const_seeky_busy_in_flight_queues--;
+	}
 	if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
 	    !BFQQ_SEEKY(bfqq))
 		bfq_update_idle_window(bfqd, bfqq, bic);
@@ -2996,9 +3033,15 @@ static void bfq_completed_request(struct request_queue *q, struct request *rq)
 	bfqd->rq_in_driver--;
 	bfqq->dispatched--;
 
-	if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq))
+	if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq)) {
 		bfq_weights_tree_remove(bfqd, &bfqq->entity,
 					&bfqd->queue_weights_tree);
+		if (!blk_queue_nonrot(bfqd->queue)) {
+			bfqd->busy_in_flight_queues--;
+			if (bfq_bfqq_constantly_seeky(bfqq))
+				bfqd->const_seeky_busy_in_flight_queues--;
+		}
+	}
 
 	if (sync) {
 		bfqd->sync_flight--;
@@ -3420,6 +3463,8 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 					      * video.
 					      */
 	bfqd->wr_busy_queues = 0;
+	bfqd->busy_in_flight_queues = 0;
+	bfqd->const_seeky_busy_in_flight_queues = 0;
 
 	/*
 	 * Begin by assuming, optimistically, that the device peak rate is
@@ -3739,7 +3784,7 @@ static int __init bfq_init(void)
 	device_speed_thresh[1] = (R_fast[1] + R_slow[1]) / 2;
 
 	elv_register(&iosched_bfq);
-	pr_info("BFQ I/O-scheduler version: v6");
+	pr_info("BFQ I/O-scheduler version: v7r4");
 
 	return 0;
 }
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
index 473b36a..afc4c23 100644
--- a/block/bfq-sched.c
+++ b/block/bfq-sched.c
@@ -1064,9 +1064,15 @@ static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfq_deactivate_bfqq(bfqd, bfqq, requeue);
 
-	if (!bfqq->dispatched)
+	if (!bfqq->dispatched) {
 		bfq_weights_tree_remove(bfqd, &bfqq->entity,
 					&bfqd->queue_weights_tree);
+		if (!blk_queue_nonrot(bfqd->queue)) {
+			bfqd->busy_in_flight_queues--;
+			if (bfq_bfqq_constantly_seeky(bfqq))
+				bfqd->const_seeky_busy_in_flight_queues--;
+		}
+	}
 	if (bfqq->wr_coeff > 1)
 		bfqd->wr_busy_queues--;
 }
@@ -1083,9 +1089,16 @@ static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 	bfq_mark_bfqq_busy(bfqq);
 	bfqd->busy_queues++;
 
-	if (!bfqq->dispatched && bfqq->wr_coeff == 1)
-		bfq_weights_tree_add(bfqd, &bfqq->entity,
-				     &bfqd->queue_weights_tree);
+	if (!bfqq->dispatched) {
+		if (bfqq->wr_coeff == 1)
+			bfq_weights_tree_add(bfqd, &bfqq->entity,
+					     &bfqd->queue_weights_tree);
+		if (!blk_queue_nonrot(bfqd->queue)) {
+			bfqd->busy_in_flight_queues++;
+			if (bfq_bfqq_constantly_seeky(bfqq))
+				bfqd->const_seeky_busy_in_flight_queues++;
+		}
+	}
 	if (bfqq->wr_coeff > 1)
 		bfqd->wr_busy_queues++;
 }
diff --git a/block/bfq.h b/block/bfq.h
index 83c828d..f4c702c 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -1,5 +1,5 @@
 /*
- * BFQ-v6 for 3.15.0: data structures and common functions prototypes.
+ * BFQ-v7r4 for 3.15.0: data structures and common functions prototypes.
  *
  * Based on ideas and code from CFQ:
  * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@@ -340,6 +340,31 @@ enum bfq_device_speed {
  *                     details).
  * @busy_queues: number of bfq_queues containing requests (including the
  *		 queue in service, even if it is idling).
+ * @busy_in_flight_queues: number of @bfq_queues containing pending or
+ *                         in-flight requests, plus the @bfq_queue in
+ *                         service, even if idle but waiting for the
+ *                         possible arrival of its next sync request. This
+ *                         field is updated only if the device is rotational,
+ *                         but used only if the device is also NCQ-capable.
+ *                         The reason why the field is updated also for non-
+ *                         NCQ-capable rotational devices is related to the
+ *                         fact that the value of @hw_tag may be set also
+ *                         later than when busy_in_flight_queues may need to
+ *                         be incremented for the first time(s). Taking also
+ *                         this possibility into account, to avoid unbalanced
+ *                         increments/decrements, would imply more overhead
+ *                         than just updating busy_in_flight_queues
+ *                         regardless of the value of @hw_tag.
+ * @const_seeky_busy_in_flight_queues: number of constantly-seeky @bfq_queues
+ *                                     (that is, seeky queues that expired
+ *                                     for budget timeout at least once)
+ *                                     containing pending or in-flight
+ *                                     requests, including the in-service
+ *                                     @bfq_queue if constantly seeky. This
+ *                                     field is updated only if the device
+ *                                     is rotational, but used only if the
+ *                                     device is also NCQ-capable (see the
+ *                                     comments to @busy_in_flight_queues).
  * @wr_busy_queues: number of weight-raised busy @bfq_queues.
  * @queued: number of queued requests.
  * @rq_in_driver: number of requests dispatched and waiting for completion.
@@ -414,6 +439,8 @@ struct bfq_data {
 	struct rb_root group_weights_tree;
 
 	int busy_queues;
+	int busy_in_flight_queues;
+	int const_seeky_busy_in_flight_queues;
 	int wr_busy_queues;
 	int queued;
 	int rq_in_driver;
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 12/12] block, bfq: boost the throughput with random I/O on NCQ-capable HDDs
       [not found]           ` <1401354343-5527-1-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
@ 2014-05-29  9:05             ` Paolo Valente
  2014-05-29  9:05               ` Paolo Valente
                               ` (11 subsequent siblings)
  12 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups, Paolo Valente

This patch is basically the counterpart of patch 13 for NCQ-capable
rotational devices. Exactly as patch 13 does on flash-based devices
and for any workload, this patch disables device idling on rotational
devices, but only for random I/O. More precisely, idling is disabled
only for constantly-seeky queues (see patch 7). In fact, only with
these queues disabling idling boosts the throughput on NCQ-capable
rotational devices.

To not break service guarantees, idling is disabled for NCQ-enabled
rotational devices and constantly-seeky queues only when the same
symmetry conditions as in patch 13, plus an additional one, hold. The
additional condition is related to the fact that this patch disables
idling only for constantly-seeky queues. In fact, should idling be
disabled for a constantly-seeky queue while some other
non-constantly-seeky queue has pending requests, the latter queue
would get more requests served, after being set as in service, than
the former. This differentiated treatment would cause a deviation with
respect to the desired throughput distribution (i.e., with respect to
the throughput distribution corresponding to the weights assigned to
processes and groups of processes).  For this reason, the additional
condition for disabling idling for a constantly-seeky queue is that
all queues with pending or in-flight requests are constantly seeky.

Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
---
 block/bfq-iosched.c | 79 +++++++++++++++++++++++++++++++++++++++++------------
 block/bfq-sched.c   | 21 +++++++++++---
 block/bfq.h         | 29 +++++++++++++++++++-
 3 files changed, 107 insertions(+), 22 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 49856e1..b9aafa5 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1910,8 +1910,12 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
 
 	bfqq->service_from_backlogged += bfqq->entity.service;
 
-	if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT)
+	if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT &&
+	    !bfq_bfqq_constantly_seeky(bfqq)) {
 		bfq_mark_bfqq_constantly_seeky(bfqq);
+		if (!blk_queue_nonrot(bfqd->queue))
+			bfqd->const_seeky_busy_in_flight_queues++;
+	}
 
 	if (bfqd->low_latency && bfqq->wr_coeff == 1)
 		bfqq->last_wr_start_finish = jiffies;
@@ -2071,7 +2075,8 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
  * Even in such a scenario, sequential I/O may still receive a preferential
  * treatment, but this is not likely to be a big issue with flash-based
  * devices, because of their non-dramatic loss of throughput with random
- * I/O.
+ * I/O. Things do differ with HDDs, for which additional care is taken, as
+ * explained after completing the discussion for flash-based devices.
  *
  * Unfortunately, keeping the necessary state for evaluating exactly the
  * above symmetry conditions would be quite complex and time-consuming.
@@ -2088,17 +2093,42 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
  * compound condition evaluates to true if any of the above symmetry
  * sub-condition does not hold, or the device is not flash-based. Therefore,
  * if also the first component is true, then idling is allowed for a sync
- * queue. In contrast, if all the required symmetry sub-conditions hold and
- * the device is flash-based, then the second component, and hence the
- * whole compound condition, evaluates to false, and no idling is performed.
- * This helps to keep the drives' internal queues full on NCQ-capable
- * devices, and hence to boost the throughput, without causing 'almost' any
- * loss of service guarantees. The 'almost' follows from the fact that, if
- * the internal queue of one such device is filled while all the
- * sub-conditions hold, but at some point in time some sub-condition stops
- * to hold, then it may become impossible to let requests be served in the
- * new desired order until all the requests already queued in the device
- * have been served.
+ * queue. These are the only sub-conditions considered if the device is
+ * flash-based, as, for such a device, it is sensible to force idling only
+ * for service-guarantee issues. In fact, as for throughput, idling
+ * NCQ-capable flash-based devices would not boost the throughput even
+ * with sequential I/O; rather it would lower the throughput in proportion
+ * to how fast the device is. In the end, (only) if all the three
+ * sub-conditions hold and the device is flash-based, the compound
+ * condition evaluates to false and therefore no idling is performed.
+ *
+ * As already said, things change with a rotational device, where idling
+ * boosts the throughput with sequential I/O (even with NCQ). Hence, for
+ * such a device the second component of the compound condition evaluates
+ * to true also if the following additional sub-condition does not hold:
+ * the queue is constantly seeky. Unfortunately, this different behavior
+ * with respect to flash-based devices causes an additional asymmetry: if
+ * some sync queues enjoy idling and some other sync queues do not, then
+ * the latter get a low share of the device throughput, simply because the
+ * former get many requests served after being set as in service, whereas
+ * the latter do not. As a consequence, to guarantee the desired throughput
+ * distribution, on HDDs the compound expression evaluates to true (and
+ * hence device idling is performed) also if the following last symmetry
+ * condition does not hold: no other queue is benefiting from idling. Also
+ * this last condition is actually replaced with a simpler-to-maintain and
+ * stronger condition: there is no busy queue which is not constantly seeky
+ * (and hence may also benefit from idling).
+ *
+ * To sum up, when all the required symmetry and throughput-boosting
+ * sub-conditions hold, the second component of the compound condition
+ * evaluates to false, and hence no idling is performed. This helps to
+ * keep the drives' internal queues full on NCQ-capable devices, and hence
+ * to boost the throughput, without causing 'almost' any loss of service
+ * guarantees. The 'almost' follows from the fact that, if the internal
+ * queue of one such device is filled while all the sub-conditions hold,
+ * but at some point in time some sub-condition stops to hold, then it may
+ * become impossible to let requests be served in the new desired order
+ * until all the requests already queued in the device have been served.
  */
 static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 {
@@ -2109,6 +2139,9 @@ static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 #else
 #define symmetric_scenario	  (!bfq_differentiated_weights(bfqd))
 #endif
+#define cond_for_seeky_on_ncq_hdd (bfq_bfqq_constantly_seeky(bfqq) && \
+				   bfqd->busy_in_flight_queues == \
+				   bfqd->const_seeky_busy_in_flight_queues)
 /*
  * Condition for expiring a non-weight-raised queue (and hence not idling
  * the device).
@@ -2116,7 +2149,8 @@ static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 #define cond_for_expiring_non_wr  (bfqd->hw_tag && \
 				   (bfqd->wr_busy_queues > 0 || \
 				    (symmetric_scenario && \
-				     blk_queue_nonrot(bfqd->queue))))
+				     (blk_queue_nonrot(bfqd->queue) || \
+				      cond_for_seeky_on_ncq_hdd))))
 
 	return bfq_bfqq_sync(bfqq) && (
 		bfqq->wr_coeff > 1 ||
@@ -2843,8 +2877,11 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfq_update_io_thinktime(bfqd, bic);
 	bfq_update_io_seektime(bfqd, bfqq, rq);
-	if (!BFQQ_SEEKY(bfqq))
+	if (!BFQQ_SEEKY(bfqq) && bfq_bfqq_constantly_seeky(bfqq)) {
 		bfq_clear_bfqq_constantly_seeky(bfqq);
+		if (!blk_queue_nonrot(bfqd->queue))
+			bfqd->const_seeky_busy_in_flight_queues--;
+	}
 	if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
 	    !BFQQ_SEEKY(bfqq))
 		bfq_update_idle_window(bfqd, bfqq, bic);
@@ -2996,9 +3033,15 @@ static void bfq_completed_request(struct request_queue *q, struct request *rq)
 	bfqd->rq_in_driver--;
 	bfqq->dispatched--;
 
-	if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq))
+	if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq)) {
 		bfq_weights_tree_remove(bfqd, &bfqq->entity,
 					&bfqd->queue_weights_tree);
+		if (!blk_queue_nonrot(bfqd->queue)) {
+			bfqd->busy_in_flight_queues--;
+			if (bfq_bfqq_constantly_seeky(bfqq))
+				bfqd->const_seeky_busy_in_flight_queues--;
+		}
+	}
 
 	if (sync) {
 		bfqd->sync_flight--;
@@ -3420,6 +3463,8 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 					      * video.
 					      */
 	bfqd->wr_busy_queues = 0;
+	bfqd->busy_in_flight_queues = 0;
+	bfqd->const_seeky_busy_in_flight_queues = 0;
 
 	/*
 	 * Begin by assuming, optimistically, that the device peak rate is
@@ -3739,7 +3784,7 @@ static int __init bfq_init(void)
 	device_speed_thresh[1] = (R_fast[1] + R_slow[1]) / 2;
 
 	elv_register(&iosched_bfq);
-	pr_info("BFQ I/O-scheduler version: v6");
+	pr_info("BFQ I/O-scheduler version: v7r4");
 
 	return 0;
 }
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
index 473b36a..afc4c23 100644
--- a/block/bfq-sched.c
+++ b/block/bfq-sched.c
@@ -1064,9 +1064,15 @@ static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfq_deactivate_bfqq(bfqd, bfqq, requeue);
 
-	if (!bfqq->dispatched)
+	if (!bfqq->dispatched) {
 		bfq_weights_tree_remove(bfqd, &bfqq->entity,
 					&bfqd->queue_weights_tree);
+		if (!blk_queue_nonrot(bfqd->queue)) {
+			bfqd->busy_in_flight_queues--;
+			if (bfq_bfqq_constantly_seeky(bfqq))
+				bfqd->const_seeky_busy_in_flight_queues--;
+		}
+	}
 	if (bfqq->wr_coeff > 1)
 		bfqd->wr_busy_queues--;
 }
@@ -1083,9 +1089,16 @@ static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 	bfq_mark_bfqq_busy(bfqq);
 	bfqd->busy_queues++;
 
-	if (!bfqq->dispatched && bfqq->wr_coeff == 1)
-		bfq_weights_tree_add(bfqd, &bfqq->entity,
-				     &bfqd->queue_weights_tree);
+	if (!bfqq->dispatched) {
+		if (bfqq->wr_coeff == 1)
+			bfq_weights_tree_add(bfqd, &bfqq->entity,
+					     &bfqd->queue_weights_tree);
+		if (!blk_queue_nonrot(bfqd->queue)) {
+			bfqd->busy_in_flight_queues++;
+			if (bfq_bfqq_constantly_seeky(bfqq))
+				bfqd->const_seeky_busy_in_flight_queues++;
+		}
+	}
 	if (bfqq->wr_coeff > 1)
 		bfqd->wr_busy_queues++;
 }
diff --git a/block/bfq.h b/block/bfq.h
index 83c828d..f4c702c 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -1,5 +1,5 @@
 /*
- * BFQ-v6 for 3.15.0: data structures and common functions prototypes.
+ * BFQ-v7r4 for 3.15.0: data structures and common functions prototypes.
  *
  * Based on ideas and code from CFQ:
  * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
@@ -340,6 +340,31 @@ enum bfq_device_speed {
  *                     details).
  * @busy_queues: number of bfq_queues containing requests (including the
  *		 queue in service, even if it is idling).
+ * @busy_in_flight_queues: number of @bfq_queues containing pending or
+ *                         in-flight requests, plus the @bfq_queue in
+ *                         service, even if idle but waiting for the
+ *                         possible arrival of its next sync request. This
+ *                         field is updated only if the device is rotational,
+ *                         but used only if the device is also NCQ-capable.
+ *                         The reason why the field is updated also for non-
+ *                         NCQ-capable rotational devices is related to the
+ *                         fact that the value of @hw_tag may be set also
+ *                         later than when busy_in_flight_queues may need to
+ *                         be incremented for the first time(s). Taking also
+ *                         this possibility into account, to avoid unbalanced
+ *                         increments/decrements, would imply more overhead
+ *                         than just updating busy_in_flight_queues
+ *                         regardless of the value of @hw_tag.
+ * @const_seeky_busy_in_flight_queues: number of constantly-seeky @bfq_queues
+ *                                     (that is, seeky queues that expired
+ *                                     for budget timeout at least once)
+ *                                     containing pending or in-flight
+ *                                     requests, including the in-service
+ *                                     @bfq_queue if constantly seeky. This
+ *                                     field is updated only if the device
+ *                                     is rotational, but used only if the
+ *                                     device is also NCQ-capable (see the
+ *                                     comments to @busy_in_flight_queues).
  * @wr_busy_queues: number of weight-raised busy @bfq_queues.
  * @queued: number of queued requests.
  * @rq_in_driver: number of requests dispatched and waiting for completion.
@@ -414,6 +439,8 @@ struct bfq_data {
 	struct rb_root group_weights_tree;
 
 	int busy_queues;
+	int busy_in_flight_queues;
+	int const_seeky_busy_in_flight_queues;
 	int wr_busy_queues;
 	int queued;
 	int rq_in_driver;
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 247+ messages in thread

* [PATCH RFC - TAKE TWO - 12/12] block, bfq: boost the throughput with random I/O on NCQ-capable HDDs
@ 2014-05-29  9:05             ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-29  9:05 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Li Zefan
  Cc: Fabio Checconi, Arianna Avanzini, Paolo Valente,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

This patch is basically the counterpart of patch 13 for NCQ-capable
rotational devices. Exactly as patch 13 does on flash-based devices
and for any workload, this patch disables device idling on rotational
devices, but only for random I/O. More precisely, idling is disabled
only for constantly-seeky queues (see patch 7). In fact, only with
these queues disabling idling boosts the throughput on NCQ-capable
rotational devices.

To not break service guarantees, idling is disabled for NCQ-enabled
rotational devices and constantly-seeky queues only when the same
symmetry conditions as in patch 13, plus an additional one, hold. The
additional condition is related to the fact that this patch disables
idling only for constantly-seeky queues. In fact, should idling be
disabled for a constantly-seeky queue while some other
non-constantly-seeky queue has pending requests, the latter queue
would get more requests served, after being set as in service, than
the former. This differentiated treatment would cause a deviation with
respect to the desired throughput distribution (i.e., with respect to
the throughput distribution corresponding to the weights assigned to
processes and groups of processes).  For this reason, the additional
condition for disabling idling for a constantly-seeky queue is that
all queues with pending or in-flight requests are constantly seeky.

Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 block/bfq-iosched.c | 79 +++++++++++++++++++++++++++++++++++++++++------------
 block/bfq-sched.c   | 21 +++++++++++---
 block/bfq.h         | 29 +++++++++++++++++++-
 3 files changed, 107 insertions(+), 22 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 49856e1..b9aafa5 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1910,8 +1910,12 @@ static void bfq_bfqq_expire(struct bfq_data *bfqd,
 
 	bfqq->service_from_backlogged += bfqq->entity.service;
 
-	if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT)
+	if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT &&
+	    !bfq_bfqq_constantly_seeky(bfqq)) {
 		bfq_mark_bfqq_constantly_seeky(bfqq);
+		if (!blk_queue_nonrot(bfqd->queue))
+			bfqd->const_seeky_busy_in_flight_queues++;
+	}
 
 	if (bfqd->low_latency && bfqq->wr_coeff == 1)
 		bfqq->last_wr_start_finish = jiffies;
@@ -2071,7 +2075,8 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
  * Even in such a scenario, sequential I/O may still receive a preferential
  * treatment, but this is not likely to be a big issue with flash-based
  * devices, because of their non-dramatic loss of throughput with random
- * I/O.
+ * I/O. Things do differ with HDDs, for which additional care is taken, as
+ * explained after completing the discussion for flash-based devices.
  *
  * Unfortunately, keeping the necessary state for evaluating exactly the
  * above symmetry conditions would be quite complex and time-consuming.
@@ -2088,17 +2093,42 @@ static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
  * compound condition evaluates to true if any of the above symmetry
  * sub-condition does not hold, or the device is not flash-based. Therefore,
  * if also the first component is true, then idling is allowed for a sync
- * queue. In contrast, if all the required symmetry sub-conditions hold and
- * the device is flash-based, then the second component, and hence the
- * whole compound condition, evaluates to false, and no idling is performed.
- * This helps to keep the drives' internal queues full on NCQ-capable
- * devices, and hence to boost the throughput, without causing 'almost' any
- * loss of service guarantees. The 'almost' follows from the fact that, if
- * the internal queue of one such device is filled while all the
- * sub-conditions hold, but at some point in time some sub-condition stops
- * to hold, then it may become impossible to let requests be served in the
- * new desired order until all the requests already queued in the device
- * have been served.
+ * queue. These are the only sub-conditions considered if the device is
+ * flash-based, as, for such a device, it is sensible to force idling only
+ * for service-guarantee issues. In fact, as for throughput, idling
+ * NCQ-capable flash-based devices would not boost the throughput even
+ * with sequential I/O; rather it would lower the throughput in proportion
+ * to how fast the device is. In the end, (only) if all the three
+ * sub-conditions hold and the device is flash-based, the compound
+ * condition evaluates to false and therefore no idling is performed.
+ *
+ * As already said, things change with a rotational device, where idling
+ * boosts the throughput with sequential I/O (even with NCQ). Hence, for
+ * such a device the second component of the compound condition evaluates
+ * to true also if the following additional sub-condition does not hold:
+ * the queue is constantly seeky. Unfortunately, this different behavior
+ * with respect to flash-based devices causes an additional asymmetry: if
+ * some sync queues enjoy idling and some other sync queues do not, then
+ * the latter get a low share of the device throughput, simply because the
+ * former get many requests served after being set as in service, whereas
+ * the latter do not. As a consequence, to guarantee the desired throughput
+ * distribution, on HDDs the compound expression evaluates to true (and
+ * hence device idling is performed) also if the following last symmetry
+ * condition does not hold: no other queue is benefiting from idling. Also
+ * this last condition is actually replaced with a simpler-to-maintain and
+ * stronger condition: there is no busy queue which is not constantly seeky
+ * (and hence may also benefit from idling).
+ *
+ * To sum up, when all the required symmetry and throughput-boosting
+ * sub-conditions hold, the second component of the compound condition
+ * evaluates to false, and hence no idling is performed. This helps to
+ * keep the drives' internal queues full on NCQ-capable devices, and hence
+ * to boost the throughput, without causing 'almost' any loss of service
+ * guarantees. The 'almost' follows from the fact that, if the internal
+ * queue of one such device is filled while all the sub-conditions hold,
+ * but at some point in time some sub-condition stops to hold, then it may
+ * become impossible to let requests be served in the new desired order
+ * until all the requests already queued in the device have been served.
  */
 static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 {
@@ -2109,6 +2139,9 @@ static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 #else
 #define symmetric_scenario	  (!bfq_differentiated_weights(bfqd))
 #endif
+#define cond_for_seeky_on_ncq_hdd (bfq_bfqq_constantly_seeky(bfqq) && \
+				   bfqd->busy_in_flight_queues == \
+				   bfqd->const_seeky_busy_in_flight_queues)
 /*
  * Condition for expiring a non-weight-raised queue (and hence not idling
  * the device).
@@ -2116,7 +2149,8 @@ static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
 #define cond_for_expiring_non_wr  (bfqd->hw_tag && \
 				   (bfqd->wr_busy_queues > 0 || \
 				    (symmetric_scenario && \
-				     blk_queue_nonrot(bfqd->queue))))
+				     (blk_queue_nonrot(bfqd->queue) || \
+				      cond_for_seeky_on_ncq_hdd))))
 
 	return bfq_bfqq_sync(bfqq) && (
 		bfqq->wr_coeff > 1 ||
@@ -2843,8 +2877,11 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfq_update_io_thinktime(bfqd, bic);
 	bfq_update_io_seektime(bfqd, bfqq, rq);
-	if (!BFQQ_SEEKY(bfqq))
+	if (!BFQQ_SEEKY(bfqq) && bfq_bfqq_constantly_seeky(bfqq)) {
 		bfq_clear_bfqq_constantly_seeky(bfqq);
+		if (!blk_queue_nonrot(bfqd->queue))
+			bfqd->const_seeky_busy_in_flight_queues--;
+	}
 	if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
 	    !BFQQ_SEEKY(bfqq))
 		bfq_update_idle_window(bfqd, bfqq, bic);
@@ -2996,9 +3033,15 @@ static void bfq_completed_request(struct request_queue *q, struct request *rq)
 	bfqd->rq_in_driver--;
 	bfqq->dispatched--;
 
-	if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq))
+	if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq)) {
 		bfq_weights_tree_remove(bfqd, &bfqq->entity,
 					&bfqd->queue_weights_tree);
+		if (!blk_queue_nonrot(bfqd->queue)) {
+			bfqd->busy_in_flight_queues--;
+			if (bfq_bfqq_constantly_seeky(bfqq))
+				bfqd->const_seeky_busy_in_flight_queues--;
+		}
+	}
 
 	if (sync) {
 		bfqd->sync_flight--;
@@ -3420,6 +3463,8 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 					      * video.
 					      */
 	bfqd->wr_busy_queues = 0;
+	bfqd->busy_in_flight_queues = 0;
+	bfqd->const_seeky_busy_in_flight_queues = 0;
 
 	/*
 	 * Begin by assuming, optimistically, that the device peak rate is
@@ -3739,7 +3784,7 @@ static int __init bfq_init(void)
 	device_speed_thresh[1] = (R_fast[1] + R_slow[1]) / 2;
 
 	elv_register(&iosched_bfq);
-	pr_info("BFQ I/O-scheduler version: v6");
+	pr_info("BFQ I/O-scheduler version: v7r4");
 
 	return 0;
 }
diff --git a/block/bfq-sched.c b/block/bfq-sched.c
index 473b36a..afc4c23 100644
--- a/block/bfq-sched.c
+++ b/block/bfq-sched.c
@@ -1064,9 +1064,15 @@ static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
 
 	bfq_deactivate_bfqq(bfqd, bfqq, requeue);
 
-	if (!bfqq->dispatched)
+	if (!bfqq->dispatched) {
 		bfq_weights_tree_remove(bfqd, &bfqq->entity,
 					&bfqd->queue_weights_tree);
+		if (!blk_queue_nonrot(bfqd->queue)) {
+			bfqd->busy_in_flight_queues--;
+			if (bfq_bfqq_constantly_seeky(bfqq))
+				bfqd->const_seeky_busy_in_flight_queues--;
+		}
+	}
 	if (bfqq->wr_coeff > 1)
 		bfqd->wr_busy_queues--;
 }
@@ -1083,9 +1089,16 @@ static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
 	bfq_mark_bfqq_busy(bfqq);
 	bfqd->busy_queues++;
 
-	if (!bfqq->dispatched && bfqq->wr_coeff == 1)
-		bfq_weights_tree_add(bfqd, &bfqq->entity,
-				     &bfqd->queue_weights_tree);
+	if (!bfqq->dispatched) {
+		if (bfqq->wr_coeff == 1)
+			bfq_weights_tree_add(bfqd, &bfqq->entity,
+					     &bfqd->queue_weights_tree);
+		if (!blk_queue_nonrot(bfqd->queue)) {
+			bfqd->busy_in_flight_queues++;
+			if (bfq_bfqq_constantly_seeky(bfqq))
+				bfqd->const_seeky_busy_in_flight_queues++;
+		}
+	}
 	if (bfqq->wr_coeff > 1)
 		bfqd->wr_busy_queues++;
 }
diff --git a/block/bfq.h b/block/bfq.h
index 83c828d..f4c702c 100644
--- a/block/bfq.h
+++ b/block/bfq.h
@@ -1,5 +1,5 @@
 /*
- * BFQ-v6 for 3.15.0: data structures and common functions prototypes.
+ * BFQ-v7r4 for 3.15.0: data structures and common functions prototypes.
  *
  * Based on ideas and code from CFQ:
  * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@@ -340,6 +340,31 @@ enum bfq_device_speed {
  *                     details).
  * @busy_queues: number of bfq_queues containing requests (including the
  *		 queue in service, even if it is idling).
+ * @busy_in_flight_queues: number of @bfq_queues containing pending or
+ *                         in-flight requests, plus the @bfq_queue in
+ *                         service, even if idle but waiting for the
+ *                         possible arrival of its next sync request. This
+ *                         field is updated only if the device is rotational,
+ *                         but used only if the device is also NCQ-capable.
+ *                         The reason why the field is updated also for non-
+ *                         NCQ-capable rotational devices is related to the
+ *                         fact that the value of @hw_tag may be set also
+ *                         later than when busy_in_flight_queues may need to
+ *                         be incremented for the first time(s). Taking also
+ *                         this possibility into account, to avoid unbalanced
+ *                         increments/decrements, would imply more overhead
+ *                         than just updating busy_in_flight_queues
+ *                         regardless of the value of @hw_tag.
+ * @const_seeky_busy_in_flight_queues: number of constantly-seeky @bfq_queues
+ *                                     (that is, seeky queues that expired
+ *                                     for budget timeout at least once)
+ *                                     containing pending or in-flight
+ *                                     requests, including the in-service
+ *                                     @bfq_queue if constantly seeky. This
+ *                                     field is updated only if the device
+ *                                     is rotational, but used only if the
+ *                                     device is also NCQ-capable (see the
+ *                                     comments to @busy_in_flight_queues).
  * @wr_busy_queues: number of weight-raised busy @bfq_queues.
  * @queued: number of queued requests.
  * @rq_in_driver: number of requests dispatched and waiting for completion.
@@ -414,6 +439,8 @@ struct bfq_data {
 	struct rb_root group_weights_tree;
 
 	int busy_queues;
+	int busy_in_flight_queues;
+	int const_seeky_busy_in_flight_queues;
 	int wr_busy_queues;
 	int queued;
 	int rq_in_driver;
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC RESEND 00/14] New version of the BFQ I/O Scheduler
  2014-05-27 12:42 ` paolo
@ 2014-05-30 15:32     ` Vivek Goyal
  -1 siblings, 0 replies; 247+ messages in thread
From: Vivek Goyal @ 2014-05-30 15:32 UTC (permalink / raw)
  To: paolo
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paolo Valente

On Tue, May 27, 2014 at 02:42:24PM +0200, paolo wrote:

[..]
> Strong fairness guarantees (already provided by BFQ-v0)
> 
> As for long-term guarantees, BFQ distributes the device throughput
> (and not just the device time) as desired to I/O-bound applications,
> with any workload and regardless of the device parameters.

I don't think most of the people care about strong fairness guarantee.
As an algorithm round robin is not bad for ensuring fairness. CFQ had
started with that but then it stopped focussing on fairness and rather
focussed on trying to address various real issues.

And CFQ's problems don't arise from not having a good fairness algorithm.
So I don't think this should be the reason for taking a new scheduler.

I think instead of numbers, what would help is a short description
that what's the fundamental problem with CFQ which BFQ does not
have and how did you solve that issue.

One issue you seemed to mention is that write is a problem. CFQ 
suppresses buffered writes very actively in an attempt to improve
read latencies. How did you make it even better with BFQ.

Last time I had looked at BFQ, it looked pretty similar to CFQ except
that core round robin algorithm had been replaced by a more fair
algo and more things done like less preemption etc.

But personally I don't think using a more accurate fairness algorithm
is the problem to begin with most of the time.

So I fail to understand that why do we need BFQ.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC RESEND 00/14] New version of the BFQ I/O Scheduler
@ 2014-05-30 15:32     ` Vivek Goyal
  0 siblings, 0 replies; 247+ messages in thread
From: Vivek Goyal @ 2014-05-30 15:32 UTC (permalink / raw)
  To: paolo
  Cc: Jens Axboe, Tejun Heo, Li Zefan, Fabio Checconi,
	Arianna Avanzini, Paolo Valente, linux-kernel, containers,
	cgroups

On Tue, May 27, 2014 at 02:42:24PM +0200, paolo wrote:

[..]
> Strong fairness guarantees (already provided by BFQ-v0)
> 
> As for long-term guarantees, BFQ distributes the device throughput
> (and not just the device time) as desired to I/O-bound applications,
> with any workload and regardless of the device parameters.

I don't think most of the people care about strong fairness guarantee.
As an algorithm round robin is not bad for ensuring fairness. CFQ had
started with that but then it stopped focussing on fairness and rather
focussed on trying to address various real issues.

And CFQ's problems don't arise from not having a good fairness algorithm.
So I don't think this should be the reason for taking a new scheduler.

I think instead of numbers, what would help is a short description
that what's the fundamental problem with CFQ which BFQ does not
have and how did you solve that issue.

One issue you seemed to mention is that write is a problem. CFQ 
suppresses buffered writes very actively in an attempt to improve
read latencies. How did you make it even better with BFQ.

Last time I had looked at BFQ, it looked pretty similar to CFQ except
that core round robin algorithm had been replaced by a more fair
algo and more things done like less preemption etc.

But personally I don't think using a more accurate fairness algorithm
is the problem to begin with most of the time.

So I fail to understand that why do we need BFQ.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 01/12] block: introduce the BFQ-v0 I/O scheduler
  2014-05-29  9:05               ` Paolo Valente
@ 2014-05-30 15:36                   ` Tejun Heo
  -1 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-30 15:36 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

Hello,

On Thu, May 29, 2014 at 11:05:32AM +0200, Paolo Valente wrote:
> diff --git a/block/bfq-ioc.c b/block/bfq-ioc.c
> new file mode 100644
> index 0000000..adfb5a1
> --- /dev/null
> +++ b/block/bfq-ioc.c
> @@ -0,0 +1,34 @@
> +/*
> + * BFQ: I/O context handling.
> + *
> + * Based on ideas and code from CFQ:
> + * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
> + *
> + * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
> + *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
> + */
> +
> +/**
> + * icq_to_bic - convert iocontext queue structure to bfq_io_cq.
> + * @icq: the iocontext queue.
> + */
> +static inline struct bfq_io_cq *icq_to_bic(struct io_cq *icq)
> +{
> +	/* bic->icq is the first member, %NULL will convert to %NULL */
> +	return container_of(icq, struct bfq_io_cq, icq);
> +}
> +
> +/**
> + * bfq_bic_lookup - search into @ioc a bic associated to @bfqd.
> + * @bfqd: the lookup key.
> + * @ioc: the io_context of the process doing I/O.
> + *
> + * Queue lock must be held.
> + */
> +static inline struct bfq_io_cq *bfq_bic_lookup(struct bfq_data *bfqd,
> +					       struct io_context *ioc)
> +{
> +	if (ioc)
> +		return icq_to_bic(ioc_lookup_icq(ioc, bfqd->queue));
> +	return NULL;
> +}

Ugh.... please don't split files into C files which get included into
other C files.  It is useful sometimes but the usage here seems
arbitrary.  If you wanna split it into multiple files, make them
proper header or source files which get linked together.  That said,
why split it up at all?  Sure, it's a largish file but just puttings
things in sensible order shouldn't be any more difficult to follow.

...
> +#define BFQQ_SEEK_THR	 (sector_t)(8 * 1024)
> +#define BFQQ_SEEKY(bfqq) ((bfqq)->seek_mean > BFQQ_SEEK_THR)

This probably comes from following cfq but it's only historical that
we have some helpers as functions and others as functions.  Might as
well make them proper functions if you're redoing the whole thing
anyway.  More on this later tho.

> +/*
> + * We regard a request as SYNC, if either it's a read or has the SYNC bit
> + * set (in which case it could also be a direct WRITE).
> + */
> +static inline int bfq_bio_sync(struct bio *bio)

Compiler should be able to decide whether inlining is beneficial or
not.  Applies to other helpers too.

> +static inline unsigned long bfq_serv_to_charge(struct request *rq,
> +					       struct bfq_queue *bfqq)
> +{
> +	return blk_rq_sectors(rq);
> +}

Are this type of simple wrappers actually helpful?  Does't it
obfuscate more than help?  It's not like "we're charging by sectors"
is a concept difficult to grasp requiring an extra layer of
interpretation.

> +static struct request *bfq_find_rq_fmerge(struct bfq_data *bfqd,
> +					  struct bio *bio)
> +{
> +	struct task_struct *tsk = current;
> +	struct bfq_io_cq *bic;
> +	struct bfq_queue *bfqq;
> +
> +	bic = bfq_bic_lookup(bfqd, tsk->io_context);
> +	if (bic == NULL)
> +		return NULL;
> +
> +	bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
> +	if (bfqq != NULL)

Can we please avoid "!= NULL"?  This is unnecessary double negation.
Please just do "if (bfqq)".  Likewise, "if (!bic)".

> +/*
> + * If enough samples have been computed, return the current max budget
> + * stored in bfqd, which is dynamically updated according to the
> + * estimated disk peak rate; otherwise return the default max budget
> + */
> +static inline unsigned long bfq_max_budget(struct bfq_data *bfqd)
> +{
> +	if (bfqd->budgets_assigned < 194)
                                     ^^^
This constant seems to be repeated multiple times.  Maybe define it
with a meaningful explanation?

> +		return bfq_default_max_budget;
> +	else
> +		return bfqd->bfq_max_budget;
> +}
....
> +static unsigned long bfq_default_budget(struct bfq_data *bfqd,
> +					struct bfq_queue *bfqq)
> +{
> +	unsigned long budget;
> +
> +	/*
> +	 * When we need an estimate of the peak rate we need to avoid
> +	 * to give budgets that are too short due to previous measurements.
> +	 * So, in the first 10 assignments use a ``safe'' budget value.
> +	 */
> +	if (bfqd->budgets_assigned < 194 && bfqd->bfq_user_max_budget == 0)

Is it really necessary to let the userland configure max_budget
outside development and debugging?  I'll talk about this more but this
is something completely implementation dependent.  If a tunable needs
to be exposed, it usually is a much better idea to actually
characterize why something needs to be tuned and expose the underlying
semantics rather than just let userland directly diddle with internal
details.

> +		budget = bfq_default_max_budget;
> +	else
> +		budget = bfqd->bfq_max_budget;
> +
> +	return budget - budget / 4;
> +}
> +
> +/*
> + * Return min budget, which is a fraction of the current or default
> + * max budget (trying with 1/32)
> + */
> +static inline unsigned long bfq_min_budget(struct bfq_data *bfqd)
> +{
> +	if (bfqd->budgets_assigned < 194)
> +		return bfq_default_max_budget / 32;
> +	else
> +		return bfqd->bfq_max_budget / 32;
> +}

Does budget need to be unsigned long?  Using ulong for something which
isn't for bitops or a quantity closely related to address space is
usually wrong because it may be either 32 or 64bit and the quantity
not having anything to do with address space, it just ends up being
used as 32bit value.  Just use int?

Similar thing with using unsigned values for quantities which don't
really require the extra bit.  This is more debatable but using
unsigned often doesn't really buy anything.  It's not like C has
underflow protection.  You often just end up needing overly elaborate
underflow check when < 0 would do.

> +/*
> + * Move request from internal lists to the request queue dispatch list.
> + */
> +static void bfq_dispatch_insert(struct request_queue *q, struct request *rq)
> +{
> +	struct bfq_data *bfqd = q->elevator->elevator_data;
> +	struct bfq_queue *bfqq = RQ_BFQQ(rq);
> +
> +	/*
> +	 * For consistency, the next instruction should have been executed
> +	 * after removing the request from the queue and dispatching it.
> +	 * We execute instead this instruction before bfq_remove_request()
> +	 * (and hence introduce a temporary inconsistency), for efficiency.
> +	 * In fact, in a forced_dispatch, this prevents two counters related
> +	 * to bfqq->dispatched to risk to be uselessly decremented if bfqq
> +	 * is not in service, and then to be incremented again after
> +	 * incrementing bfqq->dispatched.
> +	 */

I'm having trouble following the above.  How is it "temporarily
inconsistent" when the whole thing is being performed atomically under
queue lock?

> +	bfqq->dispatched++;
> +	bfq_remove_request(rq);
> +	elv_dispatch_sort(q, rq);
> +
> +	if (bfq_bfqq_sync(bfqq))
> +		bfqd->sync_flight++;
> +}
...
> +static inline unsigned long bfq_bfqq_budget_left(struct bfq_queue *bfqq)
> +{
> +	struct bfq_entity *entity = &bfqq->entity;
> +	return entity->budget - entity->service;
> +}

Wouldn't it be easier to read if you collect all the trivial helpers
in one place?

> +/**
> + * __bfq_bfqq_recalc_budget - try to adapt the budget to the @bfqq behavior.
> + * @bfqd: device data.
> + * @bfqq: queue to update.
> + * @reason: reason for expiration.
> + *
> + * Handle the feedback on @bfqq budget.  See the body for detailed
> + * comments.

Would be nicer if it explained when this function is called.

> + */
> +static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
> +				     struct bfq_queue *bfqq,
> +				     enum bfqq_expiration reason)
> +{
> +	struct request *next_rq;
> +	unsigned long budget, min_budget;
> +
> +	budget = bfqq->max_budget;
> +	min_budget = bfq_min_budget(bfqd);
> +
> +	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last budg %lu, budg left %lu",
> +		bfqq->entity.budget, bfq_bfqq_budget_left(bfqq));
> +	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last max_budg %lu, min budg %lu",
> +		budget, bfq_min_budget(bfqd));
> +	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: sync %d, seeky %d",
> +		bfq_bfqq_sync(bfqq), BFQQ_SEEKY(bfqd->in_service_queue));
> +
> +	if (bfq_bfqq_sync(bfqq)) {
> +		switch (reason) {
> +		/*
> +		 * Caveat: in all the following cases we trade latency
> +		 * for throughput.
> +		 */
> +		case BFQ_BFQQ_TOO_IDLE:
> +			if (budget > min_budget + BFQ_BUDGET_STEP)
> +				budget -= BFQ_BUDGET_STEP;
> +			else
> +				budget = min_budget;

Yeah, exactly, things like this could have been written like the
following if budgets were ints.

			budget = max(budget - BFQ_BUDGET_STOP, min_budget);

...
> +		}
> +	} else /* async queue */
> +	    /* async queues get always the maximum possible budget
> +	     * (their ability to dispatch is limited by
> +	     * @bfqd->bfq_max_budget_async_rq).
> +	     */
> +		budget = bfqd->bfq_max_budget;

	} else {
		/*
		 * Asnyc queues get...
		 */
	}

> +
> +	bfqq->max_budget = budget;
> +
> +	if (bfqd->budgets_assigned >= 194 && bfqd->bfq_user_max_budget == 0 &&
                                      ^^^
                                    the magic number again

> +	    bfqq->max_budget > bfqd->bfq_max_budget)
> +		bfqq->max_budget = bfqd->bfq_max_budget;

The following would be easier to follow.

	if (bfqd->budgets_assigned >= 194 && !bfqd->bfq_user_max_budget)
		bfqq->max_budget = min(bfqq->max_budget, bfqd->bfq_max_budget);

> +
> +	/*
> +	 * Make sure that we have enough budget for the next request.
> +	 * Since the finish time of the bfqq must be kept in sync with
> +	 * the budget, be sure to call __bfq_bfqq_expire() after the
> +	 * update.
> +	 */
> +	next_rq = bfqq->next_rq;
> +	if (next_rq != NULL)
> +		bfqq->entity.budget = max_t(unsigned long, bfqq->max_budget,
> +					    bfq_serv_to_charge(next_rq, bfqq));
> +	else
> +		bfqq->entity.budget = bfqq->max_budget;
> +
> +	bfq_log_bfqq(bfqd, bfqq, "head sect: %u, new budget %lu",
> +			next_rq != NULL ? blk_rq_sectors(next_rq) : 0,
> +			bfqq->entity.budget);
> +}
> +
> +static unsigned long bfq_calc_max_budget(u64 peak_rate, u64 timeout)
> +{
> +	unsigned long max_budget;
> +
> +	/*
> +	 * The max_budget calculated when autotuning is equal to the
> +	 * amount of sectors transfered in timeout_sync at the
> +	 * estimated peak rate.
> +	 */
> +	max_budget = (unsigned long)(peak_rate * 1000 *
> +				     timeout >> BFQ_RATE_SHIFT);

What's up with the "* 1000"? Is peak_rate in bytes/us and max_budget
in bytes/ms?  If so, would it be difficult to stick with single unit
of time or at least document it clearly?

> +
> +	return max_budget;

Why not just do the following?

	return (unsigned long)(peak_rate * 100 *...)

Also, what's up with the (unsigned long) conversion after all the
calculation is done?  The compiler is gonna do that implicitly anyway
and the cast won't even cause a warning even if the return type later
gets changed to something else.  The cast achieves nothing.

> +/*
> + * In addition to updating the peak rate, checks whether the process
> + * is "slow", and returns 1 if so. This slow flag is used, in addition
> + * to the budget timeout, to reduce the amount of service provided to
> + * seeky processes, and hence reduce their chances to lower the
> + * throughput. See the code for more details.
> + */
> +static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,

          bool

> +				int compensate)

                                bool

Applies to other booleans values too.

> +{
> +	u64 bw, usecs, expected, timeout;
> +	ktime_t delta;
> +	int update = 0;
> +
> +	if (!bfq_bfqq_sync(bfqq) || bfq_bfqq_budget_new(bfqq))
> +		return 0;
> +
> +	if (compensate)
> +		delta = bfqd->last_idling_start;
> +	else
> +		delta = ktime_get();
> +	delta = ktime_sub(delta, bfqd->last_budget_start);
> +	usecs = ktime_to_us(delta);
> +
> +	/* Don't trust short/unrealistic values. */
> +	if (usecs < 100 || usecs >= LONG_MAX)

LONG_MAX is the limit?  It's weird.  Why would the limit on time
measure be different on 32 and 64bit machines?  Also how would you
ever get a number that large?

> +		return 0;
> +
> +	/*
> +	 * Calculate the bandwidth for the last slice.  We use a 64 bit
> +	 * value to store the peak rate, in sectors per usec in fixed
> +	 * point math.  We do so to have enough precision in the estimate
> +	 * and to avoid overflows.
> +	 */
> +	bw = (u64)bfqq->entity.service << BFQ_RATE_SHIFT;
> +	do_div(bw, (unsigned long)usecs);
> +
> +	timeout = jiffies_to_msecs(bfqd->bfq_timeout[BLK_RW_SYNC]);
> +
> +	/*
> +	 * Use only long (> 20ms) intervals to filter out spikes for
> +	 * the peak rate estimation.
> +	 */
> +	if (usecs > 20000) {

Would it make sense to define this value as a fraction of max timeout?

> +		if (bw > bfqd->peak_rate) {
> +			bfqd->peak_rate = bw;
> +			update = 1;
> +			bfq_log(bfqd, "new peak_rate=%llu", bw);
> +		}
> +
> +		update |= bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES - 1;
> +
> +		if (bfqd->peak_rate_samples < BFQ_PEAK_RATE_SAMPLES)
> +			bfqd->peak_rate_samples++;
> +
> +		if (bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES &&
> +		    update && bfqd->bfq_user_max_budget == 0) {

The code here seems confusing.  Can you please try to reorder them so
that the condition being checked is clearer?

> +static void bfq_bfqq_expire(struct bfq_data *bfqd,
> +			    struct bfq_queue *bfqq,
> +			    int compensate,
> +			    enum bfqq_expiration reason)
> +{
> +	int slow;
> +
> +	/* Update disk peak rate for autotuning and check whether the

/*
 * Updated...

> +	 * process is slow (see bfq_update_peak_rate).
> +	 */
> +	slow = bfq_update_peak_rate(bfqd, bfqq, compensate);
> +
> +	/*
> +	 * As above explained, 'punish' slow (i.e., seeky), timed-out
> +	 * and async queues, to favor sequential sync workloads.
> +	 */
> +	if (slow || reason == BFQ_BFQQ_BUDGET_TIMEOUT)
> +		bfq_bfqq_charge_full_budget(bfqq);
> +
> +	bfq_log_bfqq(bfqd, bfqq,
> +		"expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
> +		slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
> +
> +	/*
> +	 * Increase, decrease or leave budget unchanged according to
> +	 * reason.
> +	 */
> +	__bfq_bfqq_recalc_budget(bfqd, bfqq, reason);
> +	__bfq_bfqq_expire(bfqd, bfqq);
> +}
> +
> +/*
> + * Budget timeout is not implemented through a dedicated timer, but
> + * just checked on request arrivals and completions, as well as on
> + * idle timer expirations.
> + */
> +static int bfq_bfqq_budget_timeout(struct bfq_queue *bfqq)
> +{
> +	if (bfq_bfqq_budget_new(bfqq) ||
> +	    time_before(jiffies, bfqq->budget_timeout))
> +		return 0;
> +	return 1;
> +}

static bool...
{
	return !bfq_bfqq_budget_new(bfqq) &&
	       time_after_eq(jiffies, bfqq->budget_timeout);
}

> +/*
> + * If we expire a queue that is waiting for the arrival of a new
> + * request, we may prevent the fictitious timestamp back-shifting that
> + * allows the guarantees of the queue to be preserved (see [1] for
> + * this tricky aspect). Hence we return true only if this condition
> + * does not hold, or if the queue is slow enough to deserve only to be
> + * kicked off for preserving a high throughput.
> +*/
> +static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)

		bool

I'll stop commenting about int -> bool conversions but please use bool
for booleans.

> +{
> +	bfq_log_bfqq(bfqq->bfqd, bfqq,
> +		"may_budget_timeout: wait_request %d left %d timeout %d",
> +		bfq_bfqq_wait_request(bfqq),
> +			bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3,
> +		bfq_bfqq_budget_timeout(bfqq));
> +
> +	return (!bfq_bfqq_wait_request(bfqq) ||
> +		bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3)
> +		&&
> +		bfq_bfqq_budget_timeout(bfqq);
> +}

	return (!bfq_bfqq_wait_request(bfqq) ||
		bfq_bfqq_budget_left(bfqq) >= bfqq->entity.budget / 3) &&
	       bfq_bfqq_budget_timeout(bfqq);

The indentations should indicate how the conditions group.

> +/*
> + * Dispatch one request from bfqq, moving it to the request queue
> + * dispatch list.
> + */
> +static int bfq_dispatch_request(struct bfq_data *bfqd,
> +				struct bfq_queue *bfqq)
> +{
...
> +	if (bfqd->busy_queues > 1 && ((!bfq_bfqq_sync(bfqq) &&
> +	    dispatched >= bfqd->bfq_max_budget_async_rq) ||
> +	    bfq_class_idle(bfqq)))
> +		goto expire;

The canonoical formatting would be like the following,

	if (bfqd->busy_queues > 1 &&
	    ((!bfq_bfqq_sync(bfqq) &&
	      dispatched >= bfqd->bfq_max_budget_async_rq) ||
	     bfq_class_idle(bfqq)))
		goto expire;

but as it's pretty ugly, why not do something like the following?
It'd be a lot easier to follow.

	if (bfqd->busy_queues > 1) {
		if (!bfq_bfqq_sync(bfqq) &&
		    dispatched >= bfqd->bfq_max_budget_async_rq)
			goto expire;
		if (bfq_class_idle(bfqq))
			goto expire;
	}

> +static int bfq_dispatch_requests(struct request_queue *q, int force)
> +{
...
> +	max_dispatch = bfqd->bfq_quantum;
> +	if (bfq_class_idle(bfqq))
> +		max_dispatch = 1;
> +
> +	if (!bfq_bfqq_sync(bfqq))
> +		max_dispatch = bfqd->bfq_max_budget_async_rq;

	if (!bfq_bfqq_sync(bfqq))
		max_dispatch = bfqd->bfq_max_budget_async_rq;
	else if (bfq_class_idle(bfqq))
		max_dispatch = 1;
	else
		max_dispatch = bfqd->bfq_quantum;

> +
> +	if (bfqq->dispatched >= max_dispatch) {
> +		if (bfqd->busy_queues > 1)
> +			return 0;
> +		if (bfqq->dispatched >= 4 * max_dispatch)
> +			return 0;
> +	}
> +
> +	if (bfqd->sync_flight != 0 && !bfq_bfqq_sync(bfqq))

	if (bfqd->sync_flight && ...)

> +		return 0;
> +
> +	bfq_clear_bfqq_wait_request(bfqq);
> +
> +	if (!bfq_dispatch_request(bfqd, bfqq))
> +		return 0;
> +
> +	bfq_log_bfqq(bfqd, bfqq, "dispatched one request of %d (max_disp %d)",
> +			bfqq->pid, max_dispatch);
> +
> +	return 1;
> +}
...
> +static void bfq_update_io_seektime(struct bfq_data *bfqd,
> +				   struct bfq_queue *bfqq,
> +				   struct request *rq)
> +{
> +	sector_t sdist;
> +	u64 total;
> +
> +	if (bfqq->last_request_pos < blk_rq_pos(rq))
> +		sdist = blk_rq_pos(rq) - bfqq->last_request_pos;
> +	else
> +		sdist = bfqq->last_request_pos - blk_rq_pos(rq);
> +
> +	/*
> +	 * Don't allow the seek distance to get too large from the
> +	 * odd fragment, pagein, etc.
> +	 */
> +	if (bfqq->seek_samples == 0) /* first request, not really a seek */
> +		sdist = 0;
> +	else if (bfqq->seek_samples <= 60) /* second & third seek */
> +		sdist = min(sdist, (bfqq->seek_mean * 4) + 2*1024*1024);
> +	else
> +		sdist = min(sdist, (bfqq->seek_mean * 4) + 2*1024*64);
> +
> +	bfqq->seek_samples = (7*bfqq->seek_samples + 256) / 8;
> +	bfqq->seek_total = (7*bfqq->seek_total + (u64)256*sdist) / 8;
> +	total = bfqq->seek_total + (bfqq->seek_samples/2);
> +	do_div(total, bfqq->seek_samples);
> +	bfqq->seek_mean = (sector_t)total;

The above looks pretty magic to me.  Care to explain a bit?

...
> +static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
> +			    struct request *rq)
> +{
> +	struct bfq_io_cq *bic = RQ_BIC(rq);
> +
> +	if (rq->cmd_flags & REQ_META)
> +		bfqq->meta_pending++;
> +
> +	bfq_update_io_thinktime(bfqd, bic);
> +	bfq_update_io_seektime(bfqd, bfqq, rq);
> +	if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
> +	    !BFQQ_SEEKY(bfqq))
> +		bfq_update_idle_window(bfqd, bfqq, bic);
> +
> +	bfq_log_bfqq(bfqd, bfqq,
> +		     "rq_enqueued: idle_window=%d (seeky %d, mean %llu)",
> +		     bfq_bfqq_idle_window(bfqq), BFQQ_SEEKY(bfqq),
> +		     (long long unsigned)bfqq->seek_mean);
> +
> +	bfqq->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq);
> +
> +	if (bfqq == bfqd->in_service_queue && bfq_bfqq_wait_request(bfqq)) {
> +		int small_req = bfqq->queued[rq_is_sync(rq)] == 1 &&
> +				blk_rq_sectors(rq) < 32;
> +		int budget_timeout = bfq_bfqq_budget_timeout(bfqq);
> +
> +		/*
> +		 * There is just this request queued: if the request
> +		 * is small and the queue is not to be expired, then
> +		 * just exit.
> +		 *
> +		 * In this way, if the disk is being idled to wait for
> +		 * a new request from the in-service queue, we avoid
> +		 * unplugging the device and committing the disk to serve
> +		 * just a small request. On the contrary, we wait for
> +		 * the block layer to decide when to unplug the device:
> +		 * hopefully, new requests will be merged to this one
> +		 * quickly, then the device will be unplugged and
> +		 * larger requests will be dispatched.
> +		 */

Is the above comment correct?  How does an iosched see a request
before it gets unplugged?  By the time the request reaches an iosched,
it already got plug merged && unplugged.  Is this optimization
meaningful at all?

> +static void bfq_update_hw_tag(struct bfq_data *bfqd)
> +{
> +	bfqd->max_rq_in_driver = max(bfqd->max_rq_in_driver,
> +				     bfqd->rq_in_driver);
> +
> +	if (bfqd->hw_tag == 1)
> +		return;
> +
> +	/*
> +	 * This sample is valid if the number of outstanding requests
> +	 * is large enough to allow a queueing behavior.  Note that the
> +	 * sum is not exact, as it's not taking into account deactivated
> +	 * requests.
> +	 */
> +	if (bfqd->rq_in_driver + bfqd->queued < BFQ_HW_QUEUE_THRESHOLD)
> +		return;
> +
> +	if (bfqd->hw_tag_samples++ < BFQ_HW_QUEUE_SAMPLES)
> +		return;
> +
> +	bfqd->hw_tag = bfqd->max_rq_in_driver > BFQ_HW_QUEUE_THRESHOLD;
> +	bfqd->max_rq_in_driver = 0;
> +	bfqd->hw_tag_samples = 0;
> +}

This is a digression but I wonder whether ioscheds like bfq should
just throttle the number of outstanding requests to one rather than
trying to estimate the queue depth and then adjust its behavior.  It's
rather pointless to do all this work for queued devices anyway.  A
better behavior could be defaulting to [bc]fq for rotational devices
and something simpler (deadline or noop) for !rot ones.  Note that
blk-mq devices entirely bypass ioscheds anyway.

> +static void bfq_completed_request(struct request_queue *q, struct request *rq)
> +{
....
> +	/*
> +	 * If this is the in-service queue, check if it needs to be expired,
> +	 * or if we want to idle in case it has no pending requests.
> +	 */
> +	if (bfqd->in_service_queue == bfqq) {
> +		if (bfq_bfqq_budget_new(bfqq))
> +			bfq_set_budget_timeout(bfqd);
> +
> +		if (bfq_bfqq_must_idle(bfqq)) {
> +			bfq_arm_slice_timer(bfqd);
> +			goto out;
> +		} else if (bfq_may_expire_for_budg_timeout(bfqq))
> +			bfq_bfqq_expire(bfqd, bfqq, 0, BFQ_BFQQ_BUDGET_TIMEOUT);
> +		else if (RB_EMPTY_ROOT(&bfqq->sort_list) &&
> +			 (bfqq->dispatched == 0 ||
> +			  !bfq_bfqq_must_not_expire(bfqq)))
> +			bfq_bfqq_expire(bfqd, bfqq, 0,
> +					BFQ_BFQQ_NO_MORE_REQUESTS);

		if {
		} else if {
		} else if {
		}

...
> +static struct elv_fs_entry bfq_attrs[] = {
> +	BFQ_ATTR(quantum),
> +	BFQ_ATTR(fifo_expire_sync),
> +	BFQ_ATTR(fifo_expire_async),
> +	BFQ_ATTR(back_seek_max),
> +	BFQ_ATTR(back_seek_penalty),
> +	BFQ_ATTR(slice_idle),
> +	BFQ_ATTR(max_budget),
> +	BFQ_ATTR(max_budget_async_rq),
> +	BFQ_ATTR(timeout_sync),
> +	BFQ_ATTR(timeout_async),
> +	BFQ_ATTR(weights),

Again, please refrain from exposing knobs which reveal internal
details.  These unnecessarily lock us into specific implementation and
it's not like users can make sensible use of these konbs anyway.  If
you want some knobs exposed for debugging / development, keeping them
in a separate private patch or hiding them behind a kernel param would
work a lot better.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 01/12] block: introduce the BFQ-v0 I/O scheduler
@ 2014-05-30 15:36                   ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-30 15:36 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel, containers, cgroups

Hello,

On Thu, May 29, 2014 at 11:05:32AM +0200, Paolo Valente wrote:
> diff --git a/block/bfq-ioc.c b/block/bfq-ioc.c
> new file mode 100644
> index 0000000..adfb5a1
> --- /dev/null
> +++ b/block/bfq-ioc.c
> @@ -0,0 +1,34 @@
> +/*
> + * BFQ: I/O context handling.
> + *
> + * Based on ideas and code from CFQ:
> + * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
> + *
> + * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
> + *		      Paolo Valente <paolo.valente@unimore.it>
> + */
> +
> +/**
> + * icq_to_bic - convert iocontext queue structure to bfq_io_cq.
> + * @icq: the iocontext queue.
> + */
> +static inline struct bfq_io_cq *icq_to_bic(struct io_cq *icq)
> +{
> +	/* bic->icq is the first member, %NULL will convert to %NULL */
> +	return container_of(icq, struct bfq_io_cq, icq);
> +}
> +
> +/**
> + * bfq_bic_lookup - search into @ioc a bic associated to @bfqd.
> + * @bfqd: the lookup key.
> + * @ioc: the io_context of the process doing I/O.
> + *
> + * Queue lock must be held.
> + */
> +static inline struct bfq_io_cq *bfq_bic_lookup(struct bfq_data *bfqd,
> +					       struct io_context *ioc)
> +{
> +	if (ioc)
> +		return icq_to_bic(ioc_lookup_icq(ioc, bfqd->queue));
> +	return NULL;
> +}

Ugh.... please don't split files into C files which get included into
other C files.  It is useful sometimes but the usage here seems
arbitrary.  If you wanna split it into multiple files, make them
proper header or source files which get linked together.  That said,
why split it up at all?  Sure, it's a largish file but just puttings
things in sensible order shouldn't be any more difficult to follow.

...
> +#define BFQQ_SEEK_THR	 (sector_t)(8 * 1024)
> +#define BFQQ_SEEKY(bfqq) ((bfqq)->seek_mean > BFQQ_SEEK_THR)

This probably comes from following cfq but it's only historical that
we have some helpers as functions and others as functions.  Might as
well make them proper functions if you're redoing the whole thing
anyway.  More on this later tho.

> +/*
> + * We regard a request as SYNC, if either it's a read or has the SYNC bit
> + * set (in which case it could also be a direct WRITE).
> + */
> +static inline int bfq_bio_sync(struct bio *bio)

Compiler should be able to decide whether inlining is beneficial or
not.  Applies to other helpers too.

> +static inline unsigned long bfq_serv_to_charge(struct request *rq,
> +					       struct bfq_queue *bfqq)
> +{
> +	return blk_rq_sectors(rq);
> +}

Are this type of simple wrappers actually helpful?  Does't it
obfuscate more than help?  It's not like "we're charging by sectors"
is a concept difficult to grasp requiring an extra layer of
interpretation.

> +static struct request *bfq_find_rq_fmerge(struct bfq_data *bfqd,
> +					  struct bio *bio)
> +{
> +	struct task_struct *tsk = current;
> +	struct bfq_io_cq *bic;
> +	struct bfq_queue *bfqq;
> +
> +	bic = bfq_bic_lookup(bfqd, tsk->io_context);
> +	if (bic == NULL)
> +		return NULL;
> +
> +	bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
> +	if (bfqq != NULL)

Can we please avoid "!= NULL"?  This is unnecessary double negation.
Please just do "if (bfqq)".  Likewise, "if (!bic)".

> +/*
> + * If enough samples have been computed, return the current max budget
> + * stored in bfqd, which is dynamically updated according to the
> + * estimated disk peak rate; otherwise return the default max budget
> + */
> +static inline unsigned long bfq_max_budget(struct bfq_data *bfqd)
> +{
> +	if (bfqd->budgets_assigned < 194)
                                     ^^^
This constant seems to be repeated multiple times.  Maybe define it
with a meaningful explanation?

> +		return bfq_default_max_budget;
> +	else
> +		return bfqd->bfq_max_budget;
> +}
....
> +static unsigned long bfq_default_budget(struct bfq_data *bfqd,
> +					struct bfq_queue *bfqq)
> +{
> +	unsigned long budget;
> +
> +	/*
> +	 * When we need an estimate of the peak rate we need to avoid
> +	 * to give budgets that are too short due to previous measurements.
> +	 * So, in the first 10 assignments use a ``safe'' budget value.
> +	 */
> +	if (bfqd->budgets_assigned < 194 && bfqd->bfq_user_max_budget == 0)

Is it really necessary to let the userland configure max_budget
outside development and debugging?  I'll talk about this more but this
is something completely implementation dependent.  If a tunable needs
to be exposed, it usually is a much better idea to actually
characterize why something needs to be tuned and expose the underlying
semantics rather than just let userland directly diddle with internal
details.

> +		budget = bfq_default_max_budget;
> +	else
> +		budget = bfqd->bfq_max_budget;
> +
> +	return budget - budget / 4;
> +}
> +
> +/*
> + * Return min budget, which is a fraction of the current or default
> + * max budget (trying with 1/32)
> + */
> +static inline unsigned long bfq_min_budget(struct bfq_data *bfqd)
> +{
> +	if (bfqd->budgets_assigned < 194)
> +		return bfq_default_max_budget / 32;
> +	else
> +		return bfqd->bfq_max_budget / 32;
> +}

Does budget need to be unsigned long?  Using ulong for something which
isn't for bitops or a quantity closely related to address space is
usually wrong because it may be either 32 or 64bit and the quantity
not having anything to do with address space, it just ends up being
used as 32bit value.  Just use int?

Similar thing with using unsigned values for quantities which don't
really require the extra bit.  This is more debatable but using
unsigned often doesn't really buy anything.  It's not like C has
underflow protection.  You often just end up needing overly elaborate
underflow check when < 0 would do.

> +/*
> + * Move request from internal lists to the request queue dispatch list.
> + */
> +static void bfq_dispatch_insert(struct request_queue *q, struct request *rq)
> +{
> +	struct bfq_data *bfqd = q->elevator->elevator_data;
> +	struct bfq_queue *bfqq = RQ_BFQQ(rq);
> +
> +	/*
> +	 * For consistency, the next instruction should have been executed
> +	 * after removing the request from the queue and dispatching it.
> +	 * We execute instead this instruction before bfq_remove_request()
> +	 * (and hence introduce a temporary inconsistency), for efficiency.
> +	 * In fact, in a forced_dispatch, this prevents two counters related
> +	 * to bfqq->dispatched to risk to be uselessly decremented if bfqq
> +	 * is not in service, and then to be incremented again after
> +	 * incrementing bfqq->dispatched.
> +	 */

I'm having trouble following the above.  How is it "temporarily
inconsistent" when the whole thing is being performed atomically under
queue lock?

> +	bfqq->dispatched++;
> +	bfq_remove_request(rq);
> +	elv_dispatch_sort(q, rq);
> +
> +	if (bfq_bfqq_sync(bfqq))
> +		bfqd->sync_flight++;
> +}
...
> +static inline unsigned long bfq_bfqq_budget_left(struct bfq_queue *bfqq)
> +{
> +	struct bfq_entity *entity = &bfqq->entity;
> +	return entity->budget - entity->service;
> +}

Wouldn't it be easier to read if you collect all the trivial helpers
in one place?

> +/**
> + * __bfq_bfqq_recalc_budget - try to adapt the budget to the @bfqq behavior.
> + * @bfqd: device data.
> + * @bfqq: queue to update.
> + * @reason: reason for expiration.
> + *
> + * Handle the feedback on @bfqq budget.  See the body for detailed
> + * comments.

Would be nicer if it explained when this function is called.

> + */
> +static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
> +				     struct bfq_queue *bfqq,
> +				     enum bfqq_expiration reason)
> +{
> +	struct request *next_rq;
> +	unsigned long budget, min_budget;
> +
> +	budget = bfqq->max_budget;
> +	min_budget = bfq_min_budget(bfqd);
> +
> +	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last budg %lu, budg left %lu",
> +		bfqq->entity.budget, bfq_bfqq_budget_left(bfqq));
> +	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last max_budg %lu, min budg %lu",
> +		budget, bfq_min_budget(bfqd));
> +	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: sync %d, seeky %d",
> +		bfq_bfqq_sync(bfqq), BFQQ_SEEKY(bfqd->in_service_queue));
> +
> +	if (bfq_bfqq_sync(bfqq)) {
> +		switch (reason) {
> +		/*
> +		 * Caveat: in all the following cases we trade latency
> +		 * for throughput.
> +		 */
> +		case BFQ_BFQQ_TOO_IDLE:
> +			if (budget > min_budget + BFQ_BUDGET_STEP)
> +				budget -= BFQ_BUDGET_STEP;
> +			else
> +				budget = min_budget;

Yeah, exactly, things like this could have been written like the
following if budgets were ints.

			budget = max(budget - BFQ_BUDGET_STOP, min_budget);

...
> +		}
> +	} else /* async queue */
> +	    /* async queues get always the maximum possible budget
> +	     * (their ability to dispatch is limited by
> +	     * @bfqd->bfq_max_budget_async_rq).
> +	     */
> +		budget = bfqd->bfq_max_budget;

	} else {
		/*
		 * Asnyc queues get...
		 */
	}

> +
> +	bfqq->max_budget = budget;
> +
> +	if (bfqd->budgets_assigned >= 194 && bfqd->bfq_user_max_budget == 0 &&
                                      ^^^
                                    the magic number again

> +	    bfqq->max_budget > bfqd->bfq_max_budget)
> +		bfqq->max_budget = bfqd->bfq_max_budget;

The following would be easier to follow.

	if (bfqd->budgets_assigned >= 194 && !bfqd->bfq_user_max_budget)
		bfqq->max_budget = min(bfqq->max_budget, bfqd->bfq_max_budget);

> +
> +	/*
> +	 * Make sure that we have enough budget for the next request.
> +	 * Since the finish time of the bfqq must be kept in sync with
> +	 * the budget, be sure to call __bfq_bfqq_expire() after the
> +	 * update.
> +	 */
> +	next_rq = bfqq->next_rq;
> +	if (next_rq != NULL)
> +		bfqq->entity.budget = max_t(unsigned long, bfqq->max_budget,
> +					    bfq_serv_to_charge(next_rq, bfqq));
> +	else
> +		bfqq->entity.budget = bfqq->max_budget;
> +
> +	bfq_log_bfqq(bfqd, bfqq, "head sect: %u, new budget %lu",
> +			next_rq != NULL ? blk_rq_sectors(next_rq) : 0,
> +			bfqq->entity.budget);
> +}
> +
> +static unsigned long bfq_calc_max_budget(u64 peak_rate, u64 timeout)
> +{
> +	unsigned long max_budget;
> +
> +	/*
> +	 * The max_budget calculated when autotuning is equal to the
> +	 * amount of sectors transfered in timeout_sync at the
> +	 * estimated peak rate.
> +	 */
> +	max_budget = (unsigned long)(peak_rate * 1000 *
> +				     timeout >> BFQ_RATE_SHIFT);

What's up with the "* 1000"? Is peak_rate in bytes/us and max_budget
in bytes/ms?  If so, would it be difficult to stick with single unit
of time or at least document it clearly?

> +
> +	return max_budget;

Why not just do the following?

	return (unsigned long)(peak_rate * 100 *...)

Also, what's up with the (unsigned long) conversion after all the
calculation is done?  The compiler is gonna do that implicitly anyway
and the cast won't even cause a warning even if the return type later
gets changed to something else.  The cast achieves nothing.

> +/*
> + * In addition to updating the peak rate, checks whether the process
> + * is "slow", and returns 1 if so. This slow flag is used, in addition
> + * to the budget timeout, to reduce the amount of service provided to
> + * seeky processes, and hence reduce their chances to lower the
> + * throughput. See the code for more details.
> + */
> +static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,

          bool

> +				int compensate)

                                bool

Applies to other booleans values too.

> +{
> +	u64 bw, usecs, expected, timeout;
> +	ktime_t delta;
> +	int update = 0;
> +
> +	if (!bfq_bfqq_sync(bfqq) || bfq_bfqq_budget_new(bfqq))
> +		return 0;
> +
> +	if (compensate)
> +		delta = bfqd->last_idling_start;
> +	else
> +		delta = ktime_get();
> +	delta = ktime_sub(delta, bfqd->last_budget_start);
> +	usecs = ktime_to_us(delta);
> +
> +	/* Don't trust short/unrealistic values. */
> +	if (usecs < 100 || usecs >= LONG_MAX)

LONG_MAX is the limit?  It's weird.  Why would the limit on time
measure be different on 32 and 64bit machines?  Also how would you
ever get a number that large?

> +		return 0;
> +
> +	/*
> +	 * Calculate the bandwidth for the last slice.  We use a 64 bit
> +	 * value to store the peak rate, in sectors per usec in fixed
> +	 * point math.  We do so to have enough precision in the estimate
> +	 * and to avoid overflows.
> +	 */
> +	bw = (u64)bfqq->entity.service << BFQ_RATE_SHIFT;
> +	do_div(bw, (unsigned long)usecs);
> +
> +	timeout = jiffies_to_msecs(bfqd->bfq_timeout[BLK_RW_SYNC]);
> +
> +	/*
> +	 * Use only long (> 20ms) intervals to filter out spikes for
> +	 * the peak rate estimation.
> +	 */
> +	if (usecs > 20000) {

Would it make sense to define this value as a fraction of max timeout?

> +		if (bw > bfqd->peak_rate) {
> +			bfqd->peak_rate = bw;
> +			update = 1;
> +			bfq_log(bfqd, "new peak_rate=%llu", bw);
> +		}
> +
> +		update |= bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES - 1;
> +
> +		if (bfqd->peak_rate_samples < BFQ_PEAK_RATE_SAMPLES)
> +			bfqd->peak_rate_samples++;
> +
> +		if (bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES &&
> +		    update && bfqd->bfq_user_max_budget == 0) {

The code here seems confusing.  Can you please try to reorder them so
that the condition being checked is clearer?

> +static void bfq_bfqq_expire(struct bfq_data *bfqd,
> +			    struct bfq_queue *bfqq,
> +			    int compensate,
> +			    enum bfqq_expiration reason)
> +{
> +	int slow;
> +
> +	/* Update disk peak rate for autotuning and check whether the

/*
 * Updated...

> +	 * process is slow (see bfq_update_peak_rate).
> +	 */
> +	slow = bfq_update_peak_rate(bfqd, bfqq, compensate);
> +
> +	/*
> +	 * As above explained, 'punish' slow (i.e., seeky), timed-out
> +	 * and async queues, to favor sequential sync workloads.
> +	 */
> +	if (slow || reason == BFQ_BFQQ_BUDGET_TIMEOUT)
> +		bfq_bfqq_charge_full_budget(bfqq);
> +
> +	bfq_log_bfqq(bfqd, bfqq,
> +		"expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
> +		slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
> +
> +	/*
> +	 * Increase, decrease or leave budget unchanged according to
> +	 * reason.
> +	 */
> +	__bfq_bfqq_recalc_budget(bfqd, bfqq, reason);
> +	__bfq_bfqq_expire(bfqd, bfqq);
> +}
> +
> +/*
> + * Budget timeout is not implemented through a dedicated timer, but
> + * just checked on request arrivals and completions, as well as on
> + * idle timer expirations.
> + */
> +static int bfq_bfqq_budget_timeout(struct bfq_queue *bfqq)
> +{
> +	if (bfq_bfqq_budget_new(bfqq) ||
> +	    time_before(jiffies, bfqq->budget_timeout))
> +		return 0;
> +	return 1;
> +}

static bool...
{
	return !bfq_bfqq_budget_new(bfqq) &&
	       time_after_eq(jiffies, bfqq->budget_timeout);
}

> +/*
> + * If we expire a queue that is waiting for the arrival of a new
> + * request, we may prevent the fictitious timestamp back-shifting that
> + * allows the guarantees of the queue to be preserved (see [1] for
> + * this tricky aspect). Hence we return true only if this condition
> + * does not hold, or if the queue is slow enough to deserve only to be
> + * kicked off for preserving a high throughput.
> +*/
> +static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)

		bool

I'll stop commenting about int -> bool conversions but please use bool
for booleans.

> +{
> +	bfq_log_bfqq(bfqq->bfqd, bfqq,
> +		"may_budget_timeout: wait_request %d left %d timeout %d",
> +		bfq_bfqq_wait_request(bfqq),
> +			bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3,
> +		bfq_bfqq_budget_timeout(bfqq));
> +
> +	return (!bfq_bfqq_wait_request(bfqq) ||
> +		bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3)
> +		&&
> +		bfq_bfqq_budget_timeout(bfqq);
> +}

	return (!bfq_bfqq_wait_request(bfqq) ||
		bfq_bfqq_budget_left(bfqq) >= bfqq->entity.budget / 3) &&
	       bfq_bfqq_budget_timeout(bfqq);

The indentations should indicate how the conditions group.

> +/*
> + * Dispatch one request from bfqq, moving it to the request queue
> + * dispatch list.
> + */
> +static int bfq_dispatch_request(struct bfq_data *bfqd,
> +				struct bfq_queue *bfqq)
> +{
...
> +	if (bfqd->busy_queues > 1 && ((!bfq_bfqq_sync(bfqq) &&
> +	    dispatched >= bfqd->bfq_max_budget_async_rq) ||
> +	    bfq_class_idle(bfqq)))
> +		goto expire;

The canonoical formatting would be like the following,

	if (bfqd->busy_queues > 1 &&
	    ((!bfq_bfqq_sync(bfqq) &&
	      dispatched >= bfqd->bfq_max_budget_async_rq) ||
	     bfq_class_idle(bfqq)))
		goto expire;

but as it's pretty ugly, why not do something like the following?
It'd be a lot easier to follow.

	if (bfqd->busy_queues > 1) {
		if (!bfq_bfqq_sync(bfqq) &&
		    dispatched >= bfqd->bfq_max_budget_async_rq)
			goto expire;
		if (bfq_class_idle(bfqq))
			goto expire;
	}

> +static int bfq_dispatch_requests(struct request_queue *q, int force)
> +{
...
> +	max_dispatch = bfqd->bfq_quantum;
> +	if (bfq_class_idle(bfqq))
> +		max_dispatch = 1;
> +
> +	if (!bfq_bfqq_sync(bfqq))
> +		max_dispatch = bfqd->bfq_max_budget_async_rq;

	if (!bfq_bfqq_sync(bfqq))
		max_dispatch = bfqd->bfq_max_budget_async_rq;
	else if (bfq_class_idle(bfqq))
		max_dispatch = 1;
	else
		max_dispatch = bfqd->bfq_quantum;

> +
> +	if (bfqq->dispatched >= max_dispatch) {
> +		if (bfqd->busy_queues > 1)
> +			return 0;
> +		if (bfqq->dispatched >= 4 * max_dispatch)
> +			return 0;
> +	}
> +
> +	if (bfqd->sync_flight != 0 && !bfq_bfqq_sync(bfqq))

	if (bfqd->sync_flight && ...)

> +		return 0;
> +
> +	bfq_clear_bfqq_wait_request(bfqq);
> +
> +	if (!bfq_dispatch_request(bfqd, bfqq))
> +		return 0;
> +
> +	bfq_log_bfqq(bfqd, bfqq, "dispatched one request of %d (max_disp %d)",
> +			bfqq->pid, max_dispatch);
> +
> +	return 1;
> +}
...
> +static void bfq_update_io_seektime(struct bfq_data *bfqd,
> +				   struct bfq_queue *bfqq,
> +				   struct request *rq)
> +{
> +	sector_t sdist;
> +	u64 total;
> +
> +	if (bfqq->last_request_pos < blk_rq_pos(rq))
> +		sdist = blk_rq_pos(rq) - bfqq->last_request_pos;
> +	else
> +		sdist = bfqq->last_request_pos - blk_rq_pos(rq);
> +
> +	/*
> +	 * Don't allow the seek distance to get too large from the
> +	 * odd fragment, pagein, etc.
> +	 */
> +	if (bfqq->seek_samples == 0) /* first request, not really a seek */
> +		sdist = 0;
> +	else if (bfqq->seek_samples <= 60) /* second & third seek */
> +		sdist = min(sdist, (bfqq->seek_mean * 4) + 2*1024*1024);
> +	else
> +		sdist = min(sdist, (bfqq->seek_mean * 4) + 2*1024*64);
> +
> +	bfqq->seek_samples = (7*bfqq->seek_samples + 256) / 8;
> +	bfqq->seek_total = (7*bfqq->seek_total + (u64)256*sdist) / 8;
> +	total = bfqq->seek_total + (bfqq->seek_samples/2);
> +	do_div(total, bfqq->seek_samples);
> +	bfqq->seek_mean = (sector_t)total;

The above looks pretty magic to me.  Care to explain a bit?

...
> +static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
> +			    struct request *rq)
> +{
> +	struct bfq_io_cq *bic = RQ_BIC(rq);
> +
> +	if (rq->cmd_flags & REQ_META)
> +		bfqq->meta_pending++;
> +
> +	bfq_update_io_thinktime(bfqd, bic);
> +	bfq_update_io_seektime(bfqd, bfqq, rq);
> +	if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
> +	    !BFQQ_SEEKY(bfqq))
> +		bfq_update_idle_window(bfqd, bfqq, bic);
> +
> +	bfq_log_bfqq(bfqd, bfqq,
> +		     "rq_enqueued: idle_window=%d (seeky %d, mean %llu)",
> +		     bfq_bfqq_idle_window(bfqq), BFQQ_SEEKY(bfqq),
> +		     (long long unsigned)bfqq->seek_mean);
> +
> +	bfqq->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq);
> +
> +	if (bfqq == bfqd->in_service_queue && bfq_bfqq_wait_request(bfqq)) {
> +		int small_req = bfqq->queued[rq_is_sync(rq)] == 1 &&
> +				blk_rq_sectors(rq) < 32;
> +		int budget_timeout = bfq_bfqq_budget_timeout(bfqq);
> +
> +		/*
> +		 * There is just this request queued: if the request
> +		 * is small and the queue is not to be expired, then
> +		 * just exit.
> +		 *
> +		 * In this way, if the disk is being idled to wait for
> +		 * a new request from the in-service queue, we avoid
> +		 * unplugging the device and committing the disk to serve
> +		 * just a small request. On the contrary, we wait for
> +		 * the block layer to decide when to unplug the device:
> +		 * hopefully, new requests will be merged to this one
> +		 * quickly, then the device will be unplugged and
> +		 * larger requests will be dispatched.
> +		 */

Is the above comment correct?  How does an iosched see a request
before it gets unplugged?  By the time the request reaches an iosched,
it already got plug merged && unplugged.  Is this optimization
meaningful at all?

> +static void bfq_update_hw_tag(struct bfq_data *bfqd)
> +{
> +	bfqd->max_rq_in_driver = max(bfqd->max_rq_in_driver,
> +				     bfqd->rq_in_driver);
> +
> +	if (bfqd->hw_tag == 1)
> +		return;
> +
> +	/*
> +	 * This sample is valid if the number of outstanding requests
> +	 * is large enough to allow a queueing behavior.  Note that the
> +	 * sum is not exact, as it's not taking into account deactivated
> +	 * requests.
> +	 */
> +	if (bfqd->rq_in_driver + bfqd->queued < BFQ_HW_QUEUE_THRESHOLD)
> +		return;
> +
> +	if (bfqd->hw_tag_samples++ < BFQ_HW_QUEUE_SAMPLES)
> +		return;
> +
> +	bfqd->hw_tag = bfqd->max_rq_in_driver > BFQ_HW_QUEUE_THRESHOLD;
> +	bfqd->max_rq_in_driver = 0;
> +	bfqd->hw_tag_samples = 0;
> +}

This is a digression but I wonder whether ioscheds like bfq should
just throttle the number of outstanding requests to one rather than
trying to estimate the queue depth and then adjust its behavior.  It's
rather pointless to do all this work for queued devices anyway.  A
better behavior could be defaulting to [bc]fq for rotational devices
and something simpler (deadline or noop) for !rot ones.  Note that
blk-mq devices entirely bypass ioscheds anyway.

> +static void bfq_completed_request(struct request_queue *q, struct request *rq)
> +{
....
> +	/*
> +	 * If this is the in-service queue, check if it needs to be expired,
> +	 * or if we want to idle in case it has no pending requests.
> +	 */
> +	if (bfqd->in_service_queue == bfqq) {
> +		if (bfq_bfqq_budget_new(bfqq))
> +			bfq_set_budget_timeout(bfqd);
> +
> +		if (bfq_bfqq_must_idle(bfqq)) {
> +			bfq_arm_slice_timer(bfqd);
> +			goto out;
> +		} else if (bfq_may_expire_for_budg_timeout(bfqq))
> +			bfq_bfqq_expire(bfqd, bfqq, 0, BFQ_BFQQ_BUDGET_TIMEOUT);
> +		else if (RB_EMPTY_ROOT(&bfqq->sort_list) &&
> +			 (bfqq->dispatched == 0 ||
> +			  !bfq_bfqq_must_not_expire(bfqq)))
> +			bfq_bfqq_expire(bfqd, bfqq, 0,
> +					BFQ_BFQQ_NO_MORE_REQUESTS);

		if {
		} else if {
		} else if {
		}

...
> +static struct elv_fs_entry bfq_attrs[] = {
> +	BFQ_ATTR(quantum),
> +	BFQ_ATTR(fifo_expire_sync),
> +	BFQ_ATTR(fifo_expire_async),
> +	BFQ_ATTR(back_seek_max),
> +	BFQ_ATTR(back_seek_penalty),
> +	BFQ_ATTR(slice_idle),
> +	BFQ_ATTR(max_budget),
> +	BFQ_ATTR(max_budget_async_rq),
> +	BFQ_ATTR(timeout_sync),
> +	BFQ_ATTR(timeout_async),
> +	BFQ_ATTR(weights),

Again, please refrain from exposing knobs which reveal internal
details.  These unnecessarily lock us into specific implementation and
it's not like users can make sensible use of these konbs anyway.  If
you want some knobs exposed for debugging / development, keeping them
in a separate private patch or hiding them behind a kernel param would
work a lot better.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 02/12] block, bfq: add full hierarchical scheduling and cgroups support
       [not found]               ` <1401354343-5527-3-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
@ 2014-05-30 15:37                 ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-30 15:37 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

On Thu, May 29, 2014 at 11:05:33AM +0200, Paolo Valente wrote:
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index 768fe44..cdd2528 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -39,6 +39,10 @@ SUBSYS(net_cls)
>  SUBSYS(blkio)
>  #endif
>  
> +#if IS_ENABLED(CONFIG_CGROUP_BFQIO)
> +SUBSYS(bfqio)
> +#endif

So, ummm, I don't think this is a good idea.  Why aren't you plugging
into the blkcg infrastructure as cfq does?  Why does it need to be a
separate controller?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 02/12] block, bfq: add full hierarchical scheduling and cgroups support
       [not found]               ` <1401354343-5527-3-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
@ 2014-05-30 15:37                 ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-30 15:37 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel, containers, cgroups

On Thu, May 29, 2014 at 11:05:33AM +0200, Paolo Valente wrote:
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index 768fe44..cdd2528 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -39,6 +39,10 @@ SUBSYS(net_cls)
>  SUBSYS(blkio)
>  #endif
>  
> +#if IS_ENABLED(CONFIG_CGROUP_BFQIO)
> +SUBSYS(bfqio)
> +#endif

So, ummm, I don't think this is a good idea.  Why aren't you plugging
into the blkcg infrastructure as cfq does?  Why does it need to be a
separate controller?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 02/12] block, bfq: add full hierarchical scheduling and cgroups support
@ 2014-05-30 15:37                 ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-30 15:37 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Thu, May 29, 2014 at 11:05:33AM +0200, Paolo Valente wrote:
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index 768fe44..cdd2528 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -39,6 +39,10 @@ SUBSYS(net_cls)
>  SUBSYS(blkio)
>  #endif
>  
> +#if IS_ENABLED(CONFIG_CGROUP_BFQIO)
> +SUBSYS(bfqio)
> +#endif

So, ummm, I don't think this is a good idea.  Why aren't you plugging
into the blkcg infrastructure as cfq does?  Why does it need to be a
separate controller?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 02/12] block, bfq: add full hierarchical scheduling and cgroups support
       [not found]                 ` <20140530153718.GB24871-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2014-05-30 15:39                   ` Tejun Heo
  2014-05-30 21:49                   ` Paolo Valente
  1 sibling, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-30 15:39 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

On Fri, May 30, 2014 at 11:37:18AM -0400, Tejun Heo wrote:
> On Thu, May 29, 2014 at 11:05:33AM +0200, Paolo Valente wrote:
> > diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> > index 768fe44..cdd2528 100644
> > --- a/include/linux/cgroup_subsys.h
> > +++ b/include/linux/cgroup_subsys.h
> > @@ -39,6 +39,10 @@ SUBSYS(net_cls)
> >  SUBSYS(blkio)
> >  #endif
> >  
> > +#if IS_ENABLED(CONFIG_CGROUP_BFQIO)
> > +SUBSYS(bfqio)
> > +#endif
> 
> So, ummm, I don't think this is a good idea.  Why aren't you plugging
> into the blkcg infrastructure as cfq does?  Why does it need to be a
> separate controller?

If there's something which doesn't work for bfq in blkcg, please let
me know.  I'd be happy to make it work.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 02/12] block, bfq: add full hierarchical scheduling and cgroups support
       [not found]                 ` <20140530153718.GB24871-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2014-05-30 15:39                   ` Tejun Heo
  2014-05-30 21:49                   ` Paolo Valente
  1 sibling, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-30 15:39 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel, containers, cgroups

On Fri, May 30, 2014 at 11:37:18AM -0400, Tejun Heo wrote:
> On Thu, May 29, 2014 at 11:05:33AM +0200, Paolo Valente wrote:
> > diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> > index 768fe44..cdd2528 100644
> > --- a/include/linux/cgroup_subsys.h
> > +++ b/include/linux/cgroup_subsys.h
> > @@ -39,6 +39,10 @@ SUBSYS(net_cls)
> >  SUBSYS(blkio)
> >  #endif
> >  
> > +#if IS_ENABLED(CONFIG_CGROUP_BFQIO)
> > +SUBSYS(bfqio)
> > +#endif
> 
> So, ummm, I don't think this is a good idea.  Why aren't you plugging
> into the blkcg infrastructure as cfq does?  Why does it need to be a
> separate controller?

If there's something which doesn't work for bfq in blkcg, please let
me know.  I'd be happy to make it work.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 02/12] block, bfq: add full hierarchical scheduling and cgroups support
@ 2014-05-30 15:39                   ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-30 15:39 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, May 30, 2014 at 11:37:18AM -0400, Tejun Heo wrote:
> On Thu, May 29, 2014 at 11:05:33AM +0200, Paolo Valente wrote:
> > diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> > index 768fe44..cdd2528 100644
> > --- a/include/linux/cgroup_subsys.h
> > +++ b/include/linux/cgroup_subsys.h
> > @@ -39,6 +39,10 @@ SUBSYS(net_cls)
> >  SUBSYS(blkio)
> >  #endif
> >  
> > +#if IS_ENABLED(CONFIG_CGROUP_BFQIO)
> > +SUBSYS(bfqio)
> > +#endif
> 
> So, ummm, I don't think this is a good idea.  Why aren't you plugging
> into the blkcg infrastructure as cfq does?  Why does it need to be a
> separate controller?

If there's something which doesn't work for bfq in blkcg, please let
me know.  I'd be happy to make it work.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 06/12] block, bfq: improve responsiveness
  2014-05-29  9:05           ` [PATCH RFC - TAKE TWO - 06/12] block, bfq: improve responsiveness Paolo Valente
@ 2014-05-30 15:41                 ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-30 15:41 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

On Thu, May 29, 2014 at 11:05:37AM +0200, Paolo Valente wrote:
> @@ -281,7 +323,8 @@ static inline unsigned long bfq_serv_to_charge(struct request *rq,
>  					       struct bfq_queue *bfqq)
>  {
>  	return blk_rq_sectors(rq) *
> -		(1 + ((!bfq_bfqq_sync(bfqq)) * bfq_async_charge_factor));
> +		(1 + ((!bfq_bfqq_sync(bfqq)) * (bfqq->wr_coeff == 1) *
> +		bfq_async_charge_factor));

Ah, okay, so you actually use it later.  Please disregard my previous
comment about dropping the wrapper.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 06/12] block, bfq: improve responsiveness
@ 2014-05-30 15:41                 ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-30 15:41 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel, containers, cgroups

On Thu, May 29, 2014 at 11:05:37AM +0200, Paolo Valente wrote:
> @@ -281,7 +323,8 @@ static inline unsigned long bfq_serv_to_charge(struct request *rq,
>  					       struct bfq_queue *bfqq)
>  {
>  	return blk_rq_sectors(rq) *
> -		(1 + ((!bfq_bfqq_sync(bfqq)) * bfq_async_charge_factor));
> +		(1 + ((!bfq_bfqq_sync(bfqq)) * (bfqq->wr_coeff == 1) *
> +		bfq_async_charge_factor));

Ah, okay, so you actually use it later.  Please disregard my previous
comment about dropping the wrapper.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices
  2014-05-29  9:05             ` Paolo Valente
@ 2014-05-30 15:46                 ` Tejun Heo
  -1 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-30 15:46 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

On Thu, May 29, 2014 at 11:05:42AM +0200, Paolo Valente wrote:
> This patch boosts the throughput on NCQ-capable flash-based devices,
> while still preserving latency guarantees for interactive and soft
> real-time applications. The throughput is boosted by just not idling
> the device when the in-service queue remains empty, even if the queue
> is sync and has a non-null idle window. This helps to keep the drive's
> internal queue full, which is necessary to achieve maximum
> performance. This solution to boost the throughput is a port of
> commits a68bbdd and f7d7b7a for CFQ.
> 
> As already highlighted in patch 10, allowing the device to prefetch
> and internally reorder requests trivially causes loss of control on
> the request service order, and hence on service guarantees.
> Fortunately, as discussed in detail in the comments to the function
> bfq_bfqq_must_not_expire(), if every process has to receive the same
> fraction of the throughput, then the service order enforced by the
> internal scheduler of a flash-based device is relatively close to that
> enforced by BFQ. In particular, it is close enough to let service
> guarantees be substantially preserved.
> 
> Things change in an asymmetric scenario, i.e., if not every process
> has to receive the same fraction of the throughput. In this case, to
> guarantee the desired throughput distribution, the device must be
> prevented from prefetching requests. This is exactly what this patch
> does in asymmetric scenarios.

Does it even make sense to use this type of heavy iosched on ssds?
It's highly likely that ssds will soon be served through blk-mq
bypassing all these.  I don't feel too enthused about adding code to
support ssds to ioscheds.  A lot better approach would be just default
to deadline for them anyway.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices
@ 2014-05-30 15:46                 ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-30 15:46 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel, containers, cgroups

On Thu, May 29, 2014 at 11:05:42AM +0200, Paolo Valente wrote:
> This patch boosts the throughput on NCQ-capable flash-based devices,
> while still preserving latency guarantees for interactive and soft
> real-time applications. The throughput is boosted by just not idling
> the device when the in-service queue remains empty, even if the queue
> is sync and has a non-null idle window. This helps to keep the drive's
> internal queue full, which is necessary to achieve maximum
> performance. This solution to boost the throughput is a port of
> commits a68bbdd and f7d7b7a for CFQ.
> 
> As already highlighted in patch 10, allowing the device to prefetch
> and internally reorder requests trivially causes loss of control on
> the request service order, and hence on service guarantees.
> Fortunately, as discussed in detail in the comments to the function
> bfq_bfqq_must_not_expire(), if every process has to receive the same
> fraction of the throughput, then the service order enforced by the
> internal scheduler of a flash-based device is relatively close to that
> enforced by BFQ. In particular, it is close enough to let service
> guarantees be substantially preserved.
> 
> Things change in an asymmetric scenario, i.e., if not every process
> has to receive the same fraction of the throughput. In this case, to
> guarantee the desired throughput distribution, the device must be
> prevented from prefetching requests. This is exactly what this patch
> does in asymmetric scenarios.

Does it even make sense to use this type of heavy iosched on ssds?
It's highly likely that ssds will soon be served through blk-mq
bypassing all these.  I don't feel too enthused about adding code to
support ssds to ioscheds.  A lot better approach would be just default
to deadline for them anyway.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 12/12] block, bfq: boost the throughput with random I/O on NCQ-capable HDDs
  2014-05-29  9:05             ` Paolo Valente
@ 2014-05-30 15:51                 ` Tejun Heo
  -1 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-30 15:51 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

On Thu, May 29, 2014 at 11:05:43AM +0200, Paolo Valente wrote:
> This patch is basically the counterpart of patch 13 for NCQ-capable
> rotational devices. Exactly as patch 13 does on flash-based devices
> and for any workload, this patch disables device idling on rotational
> devices, but only for random I/O. More precisely, idling is disabled
> only for constantly-seeky queues (see patch 7). In fact, only with
> these queues disabling idling boosts the throughput on NCQ-capable
> rotational devices.
> 
> To not break service guarantees, idling is disabled for NCQ-enabled
> rotational devices and constantly-seeky queues only when the same
> symmetry conditions as in patch 13, plus an additional one, hold. The
> additional condition is related to the fact that this patch disables
> idling only for constantly-seeky queues. In fact, should idling be

Wouldn't it make more sense to limit queue depth to one unless the
workload can clearly benefit from allowing higher queue depth?  And I
really think it'd bring more clarity if we just concentrate on
rotational devices.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 12/12] block, bfq: boost the throughput with random I/O on NCQ-capable HDDs
@ 2014-05-30 15:51                 ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-30 15:51 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel, containers, cgroups

On Thu, May 29, 2014 at 11:05:43AM +0200, Paolo Valente wrote:
> This patch is basically the counterpart of patch 13 for NCQ-capable
> rotational devices. Exactly as patch 13 does on flash-based devices
> and for any workload, this patch disables device idling on rotational
> devices, but only for random I/O. More precisely, idling is disabled
> only for constantly-seeky queues (see patch 7). In fact, only with
> these queues disabling idling boosts the throughput on NCQ-capable
> rotational devices.
> 
> To not break service guarantees, idling is disabled for NCQ-enabled
> rotational devices and constantly-seeky queues only when the same
> symmetry conditions as in patch 13, plus an additional one, hold. The
> additional condition is related to the fact that this patch disables
> idling only for constantly-seeky queues. In fact, should idling be

Wouldn't it make more sense to limit queue depth to one unless the
workload can clearly benefit from allowing higher queue depth?  And I
really think it'd bring more clarity if we just concentrate on
rotational devices.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
  2014-05-29  9:05           ` Paolo Valente
@ 2014-05-30 16:07               ` Tejun Heo
  -1 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-30 16:07 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

Hello, Paolo.

On Thu, May 29, 2014 at 11:05:31AM +0200, Paolo Valente wrote:
> this patchset introduces the last version of BFQ, a proportional-share
> storage-I/O scheduler. BFQ also supports hierarchical scheduling with
> a cgroups interface. The first version of BFQ was submitted a few
> years ago [1]. It is denoted as v0 in the patches, to distinguish it
> from the version I am submitting now, v7r4. In particular, the first
> two patches introduce BFQ-v0, whereas the remaining patches turn it
> progressively into BFQ-v7r4. Here are some nice features of this last
> version.

So, excellent work.  I haven't acteually followed the implementation
of the scheduling logic itself but read all the papers and it seems
great to me; however, the biggest problem that I have is that while
being proposed as a separate iosched, this basically is an improvement
of cfq.  It shares most of the infrastructure code, aims the same set
of devices and usages scenarios and while a lot more clearly
characterized and in general better performing even the scheduling
behavior isn't that different from cfq.

We do have multiple ioscheds but sans for anticipatory which pretty
much has been superceded by cfq, they serve different purposes and I'd
really hate the idea of carrying two mostly similar ioscheds in tree.

For some reason, blkcg implementation seems completely different but
outside of that, bfq doesn't really seem to have diverged a lot from
cfq and the most likely and probably only way for it to be merged
would be if you just mutate cfq into bfq.  The whole effort is mostly
about characterizing and refining the scheduling algorithm anyway,
right?  I'd really love to see that happening.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-05-30 16:07               ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-30 16:07 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel, containers, cgroups

Hello, Paolo.

On Thu, May 29, 2014 at 11:05:31AM +0200, Paolo Valente wrote:
> this patchset introduces the last version of BFQ, a proportional-share
> storage-I/O scheduler. BFQ also supports hierarchical scheduling with
> a cgroups interface. The first version of BFQ was submitted a few
> years ago [1]. It is denoted as v0 in the patches, to distinguish it
> from the version I am submitting now, v7r4. In particular, the first
> two patches introduce BFQ-v0, whereas the remaining patches turn it
> progressively into BFQ-v7r4. Here are some nice features of this last
> version.

So, excellent work.  I haven't acteually followed the implementation
of the scheduling logic itself but read all the papers and it seems
great to me; however, the biggest problem that I have is that while
being proposed as a separate iosched, this basically is an improvement
of cfq.  It shares most of the infrastructure code, aims the same set
of devices and usages scenarios and while a lot more clearly
characterized and in general better performing even the scheduling
behavior isn't that different from cfq.

We do have multiple ioscheds but sans for anticipatory which pretty
much has been superceded by cfq, they serve different purposes and I'd
really hate the idea of carrying two mostly similar ioscheds in tree.

For some reason, blkcg implementation seems completely different but
outside of that, bfq doesn't really seem to have diverged a lot from
cfq and the most likely and probably only way for it to be merged
would be if you just mutate cfq into bfq.  The whole effort is mostly
about characterizing and refining the scheduling algorithm anyway,
right?  I'd really love to see that happening.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC RESEND 00/14] New version of the BFQ I/O Scheduler
       [not found]     ` <20140530153228.GE16605-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-05-30 16:16       ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-30 16:16 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, paolo,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

Hello, Vivek.

On Fri, May 30, 2014 at 11:32:28AM -0400, Vivek Goyal wrote:
> I don't think most of the people care about strong fairness guarantee.
> As an algorithm round robin is not bad for ensuring fairness. CFQ had
> started with that but then it stopped focussing on fairness and rather
> focussed on trying to address various real issues.

Oh, I widely disagree.  Have you looked at the test results in the
paper?  Unless the results are completely bogus, which it probably
isn't, this is awesome.  This is a *lot* more clearly characterized
algorithm which shows significantly better behavior especially in use
cases where scheduling matters.  I mean, it even reaches higher
throughput while achieving lower latency.  Why wouldn't we want it?

> And CFQ's problems don't arise from not having a good fairness algorithm.
> So I don't think this should be the reason for taking a new scheduler.

In comparison, cfq's fairness behavior is a lot worse but even
ignoring thing, one of the major problems of cfq is that the behavior
is hardly characterized.  It's really difficult to anticipate what
it'd do and understand why, which makes it very difficult to maintain
and improve.  Even just for the latter point, it'd be worthwhile to
adopt bfq.

> I think instead of numbers, what would help is a short description
> that what's the fundamental problem with CFQ which BFQ does not
> have and how did you solve that issue.

The papers are pretty clear and not too long.  Have you read them?

> One issue you seemed to mention is that write is a problem. CFQ 
> suppresses buffered writes very actively in an attempt to improve
> read latencies. How did you make it even better with BFQ.
> 
> Last time I had looked at BFQ, it looked pretty similar to CFQ except
> that core round robin algorithm had been replaced by a more fair
> algo and more things done like less preemption etc.
> 
> But personally I don't think using a more accurate fairness algorithm
> is the problem to begin with most of the time.
> 
> So I fail to understand that why do we need BFQ.

I violently disagree.  This is awesome.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC RESEND 00/14] New version of the BFQ I/O Scheduler
       [not found]     ` <20140530153228.GE16605-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-05-30 16:16       ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-30 16:16 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: paolo, Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel, containers, cgroups

Hello, Vivek.

On Fri, May 30, 2014 at 11:32:28AM -0400, Vivek Goyal wrote:
> I don't think most of the people care about strong fairness guarantee.
> As an algorithm round robin is not bad for ensuring fairness. CFQ had
> started with that but then it stopped focussing on fairness and rather
> focussed on trying to address various real issues.

Oh, I widely disagree.  Have you looked at the test results in the
paper?  Unless the results are completely bogus, which it probably
isn't, this is awesome.  This is a *lot* more clearly characterized
algorithm which shows significantly better behavior especially in use
cases where scheduling matters.  I mean, it even reaches higher
throughput while achieving lower latency.  Why wouldn't we want it?

> And CFQ's problems don't arise from not having a good fairness algorithm.
> So I don't think this should be the reason for taking a new scheduler.

In comparison, cfq's fairness behavior is a lot worse but even
ignoring thing, one of the major problems of cfq is that the behavior
is hardly characterized.  It's really difficult to anticipate what
it'd do and understand why, which makes it very difficult to maintain
and improve.  Even just for the latter point, it'd be worthwhile to
adopt bfq.

> I think instead of numbers, what would help is a short description
> that what's the fundamental problem with CFQ which BFQ does not
> have and how did you solve that issue.

The papers are pretty clear and not too long.  Have you read them?

> One issue you seemed to mention is that write is a problem. CFQ 
> suppresses buffered writes very actively in an attempt to improve
> read latencies. How did you make it even better with BFQ.
> 
> Last time I had looked at BFQ, it looked pretty similar to CFQ except
> that core round robin algorithm had been replaced by a more fair
> algo and more things done like less preemption etc.
> 
> But personally I don't think using a more accurate fairness algorithm
> is the problem to begin with most of the time.
> 
> So I fail to understand that why do we need BFQ.

I violently disagree.  This is awesome.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC RESEND 00/14] New version of the BFQ I/O Scheduler
@ 2014-05-30 16:16       ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-30 16:16 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: paolo, Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hello, Vivek.

On Fri, May 30, 2014 at 11:32:28AM -0400, Vivek Goyal wrote:
> I don't think most of the people care about strong fairness guarantee.
> As an algorithm round robin is not bad for ensuring fairness. CFQ had
> started with that but then it stopped focussing on fairness and rather
> focussed on trying to address various real issues.

Oh, I widely disagree.  Have you looked at the test results in the
paper?  Unless the results are completely bogus, which it probably
isn't, this is awesome.  This is a *lot* more clearly characterized
algorithm which shows significantly better behavior especially in use
cases where scheduling matters.  I mean, it even reaches higher
throughput while achieving lower latency.  Why wouldn't we want it?

> And CFQ's problems don't arise from not having a good fairness algorithm.
> So I don't think this should be the reason for taking a new scheduler.

In comparison, cfq's fairness behavior is a lot worse but even
ignoring thing, one of the major problems of cfq is that the behavior
is hardly characterized.  It's really difficult to anticipate what
it'd do and understand why, which makes it very difficult to maintain
and improve.  Even just for the latter point, it'd be worthwhile to
adopt bfq.

> I think instead of numbers, what would help is a short description
> that what's the fundamental problem with CFQ which BFQ does not
> have and how did you solve that issue.

The papers are pretty clear and not too long.  Have you read them?

> One issue you seemed to mention is that write is a problem. CFQ 
> suppresses buffered writes very actively in an attempt to improve
> read latencies. How did you make it even better with BFQ.
> 
> Last time I had looked at BFQ, it looked pretty similar to CFQ except
> that core round robin algorithm had been replaced by a more fair
> algo and more things done like less preemption etc.
> 
> But personally I don't think using a more accurate fairness algorithm
> is the problem to begin with most of the time.
> 
> So I fail to understand that why do we need BFQ.

I violently disagree.  This is awesome.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC RESEND 00/14] New version of the BFQ I/O Scheduler
  2014-05-30 16:16       ` Tejun Heo
@ 2014-05-30 17:09           ` Vivek Goyal
  -1 siblings, 0 replies; 247+ messages in thread
From: Vivek Goyal @ 2014-05-30 17:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, paolo,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

On Fri, May 30, 2014 at 12:16:50PM -0400, Tejun Heo wrote:
> Hello, Vivek.
> 
> On Fri, May 30, 2014 at 11:32:28AM -0400, Vivek Goyal wrote:
> > I don't think most of the people care about strong fairness guarantee.
> > As an algorithm round robin is not bad for ensuring fairness. CFQ had
> > started with that but then it stopped focussing on fairness and rather
> > focussed on trying to address various real issues.
> 
> Oh, I widely disagree.  Have you looked at the test results in the
> paper?  Unless the results are completely bogus, which it probably
> isn't, this is awesome.  This is a *lot* more clearly characterized
> algorithm which shows significantly better behavior especially in use
> cases where scheduling matters.  I mean, it even reaches higher
> throughput while achieving lower latency.  Why wouldn't we want it?
> 

Of everybody wants "higher throughput while achieving lower latency". Most
of the time these are contradictory goals.

Instead of just looking at numbers, I am keen on knowing what's the
fundamental design change which allows this. What is CFQ doing wrong
which BFQ gets right.

Are you referring to BFQ paper. I had read one in the past and it was
all about how to achieve more accurate fairness. At this point of time
I don't think that smarter algorithm is the problem. Until and unless
somebody can show me that how algorithm is a problem.

Apart from algorithm, one thing BFQ did different in the past was provide
fairness based on amount of IO done *and not in terms of time slice*. I
don't know if that's still the case. If it is the case I am wondering
why CFQ was converted from amount of IO based fairness to time slice
based fairness.

I am all for a better algorithm. I just want to somebody to explain it
to me that what magic they have done to achieve better throughput as
well as better latency.

Do you think that CFQ's problems come from round robin algorithms. No,
they don't. Most of the time we don't even honor the allocated time
slice (except sequential IO) and preempt the allocated slice. That's
why I think implementing a more fair algorithm is probably not the
solution.

CFQ's throughput problems come from idling and driving lower queue depth.
And CFQ's throughput problems arise due to suppressing buffered write
very actively in presence of sync IO. 

I want to know what has BFQ done fundamentally different to take care of
above issues. 

> > And CFQ's problems don't arise from not having a good fairness algorithm.
> > So I don't think this should be the reason for taking a new scheduler.
> 
> In comparison, cfq's fairness behavior is a lot worse but even
> ignoring thing, one of the major problems of cfq is that the behavior
> is hardly characterized.  It's really difficult to anticipate what
> it'd do and understand why, which makes it very difficult to maintain
> and improve.  Even just for the latter point, it'd be worthwhile to
> adopt bfq.

Remember CFQ had started with a simple algorith. Just start allocating
time slice in proportion to ioprio. That simple scheme did not work in
real world and then it became more complicated.

What's the guarantee that same thing will not happen to BFQ. There is
no point in getting fairness if overall performance sucks. That's what
happens even with block IO cgroups. Create more than 4 cgroups, put some
workload in that and with CFQ your performance will be so bad that you
will drop the idea of getting fairness.

Why performance is bad, again due to idling. In an attmept to provide
isolation between IO of two cgroups, we idle. You want to dilute the
the isolation, sure you will get better throughput and CFQ can do that
too.

> 
> > I think instead of numbers, what would help is a short description
> > that what's the fundamental problem with CFQ which BFQ does not
> > have and how did you solve that issue.
> 
> The papers are pretty clear and not too long.  Have you read them?

Can you please provide me the link to the paper. I had read one few
years back. I am not sure if it is still the same paper of a new one.
And after experimenting with their implementation and playing with 
CFQ, my impression was that fairness algorithm is not the core problem.

I will be more than happy to be proven wrong. Just I need somebody to
not throw just numbers at me but rather explain to me that why BFQ
performs better and why CFQ can't do the same thing.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC RESEND 00/14] New version of the BFQ I/O Scheduler
@ 2014-05-30 17:09           ` Vivek Goyal
  0 siblings, 0 replies; 247+ messages in thread
From: Vivek Goyal @ 2014-05-30 17:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: paolo, Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel, containers, cgroups

On Fri, May 30, 2014 at 12:16:50PM -0400, Tejun Heo wrote:
> Hello, Vivek.
> 
> On Fri, May 30, 2014 at 11:32:28AM -0400, Vivek Goyal wrote:
> > I don't think most of the people care about strong fairness guarantee.
> > As an algorithm round robin is not bad for ensuring fairness. CFQ had
> > started with that but then it stopped focussing on fairness and rather
> > focussed on trying to address various real issues.
> 
> Oh, I widely disagree.  Have you looked at the test results in the
> paper?  Unless the results are completely bogus, which it probably
> isn't, this is awesome.  This is a *lot* more clearly characterized
> algorithm which shows significantly better behavior especially in use
> cases where scheduling matters.  I mean, it even reaches higher
> throughput while achieving lower latency.  Why wouldn't we want it?
> 

Of everybody wants "higher throughput while achieving lower latency". Most
of the time these are contradictory goals.

Instead of just looking at numbers, I am keen on knowing what's the
fundamental design change which allows this. What is CFQ doing wrong
which BFQ gets right.

Are you referring to BFQ paper. I had read one in the past and it was
all about how to achieve more accurate fairness. At this point of time
I don't think that smarter algorithm is the problem. Until and unless
somebody can show me that how algorithm is a problem.

Apart from algorithm, one thing BFQ did different in the past was provide
fairness based on amount of IO done *and not in terms of time slice*. I
don't know if that's still the case. If it is the case I am wondering
why CFQ was converted from amount of IO based fairness to time slice
based fairness.

I am all for a better algorithm. I just want to somebody to explain it
to me that what magic they have done to achieve better throughput as
well as better latency.

Do you think that CFQ's problems come from round robin algorithms. No,
they don't. Most of the time we don't even honor the allocated time
slice (except sequential IO) and preempt the allocated slice. That's
why I think implementing a more fair algorithm is probably not the
solution.

CFQ's throughput problems come from idling and driving lower queue depth.
And CFQ's throughput problems arise due to suppressing buffered write
very actively in presence of sync IO. 

I want to know what has BFQ done fundamentally different to take care of
above issues. 

> > And CFQ's problems don't arise from not having a good fairness algorithm.
> > So I don't think this should be the reason for taking a new scheduler.
> 
> In comparison, cfq's fairness behavior is a lot worse but even
> ignoring thing, one of the major problems of cfq is that the behavior
> is hardly characterized.  It's really difficult to anticipate what
> it'd do and understand why, which makes it very difficult to maintain
> and improve.  Even just for the latter point, it'd be worthwhile to
> adopt bfq.

Remember CFQ had started with a simple algorith. Just start allocating
time slice in proportion to ioprio. That simple scheme did not work in
real world and then it became more complicated.

What's the guarantee that same thing will not happen to BFQ. There is
no point in getting fairness if overall performance sucks. That's what
happens even with block IO cgroups. Create more than 4 cgroups, put some
workload in that and with CFQ your performance will be so bad that you
will drop the idea of getting fairness.

Why performance is bad, again due to idling. In an attmept to provide
isolation between IO of two cgroups, we idle. You want to dilute the
the isolation, sure you will get better throughput and CFQ can do that
too.

> 
> > I think instead of numbers, what would help is a short description
> > that what's the fundamental problem with CFQ which BFQ does not
> > have and how did you solve that issue.
> 
> The papers are pretty clear and not too long.  Have you read them?

Can you please provide me the link to the paper. I had read one few
years back. I am not sure if it is still the same paper of a new one.
And after experimenting with their implementation and playing with 
CFQ, my impression was that fairness algorithm is not the core problem.

I will be more than happy to be proven wrong. Just I need somebody to
not throw just numbers at me but rather explain to me that why BFQ
performs better and why CFQ can't do the same thing.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC RESEND 00/14] New version of the BFQ I/O Scheduler
  2014-05-30 17:09           ` Vivek Goyal
@ 2014-05-30 17:26               ` Tejun Heo
  -1 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-30 17:26 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, paolo,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

Hello,

On Fri, May 30, 2014 at 01:09:58PM -0400, Vivek Goyal wrote:
> Are you referring to BFQ paper. I had read one in the past and it was
> all about how to achieve more accurate fairness. At this point of time
> I don't think that smarter algorithm is the problem. Until and unless
> somebody can show me that how algorithm is a problem.

Because cfq rr's with heuristically guessed slices and bfq calculates
where each one is supposed to end and then schedules the slices
accordingly.  With a lot of slices to serve, cfq loses track of which
one should come first to adhere to desired latency guarantees while
bfq doesn't, which in turn allows more latitude in using longer slices
for bfq allowing for better throughput.  It's all in the paper and
backed by numbers.  What more do you want?

> Apart from algorithm, one thing BFQ did different in the past was provide
> fairness based on amount of IO done *and not in terms of time slice*. I
> don't know if that's still the case. If it is the case I am wondering
> why CFQ was converted from amount of IO based fairness to time slice
> based fairness.

Rotating rusts being rotating rusts, either measure isn't perfect.
Bandwidth tracking gets silly with seeky workloads while pure time
slice tracking unfairly treats users of inner and outer tracks.  BFQ
uses bw tracking for sequential workload while basically switches to
time slice tracking for seeky workloads.  These were pretty clearly
discussed in the paper.

> I am all for a better algorithm. I just want to somebody to explain it
> to me that what magic they have done to achieve better throughput as
> well as better latency.

It feels more like you're actively refusing to understand it even when
the algorithm and heristics involved are as clearly presented as it
could be.  This is one of the best documented piece of work in this
area of the kernel.  Just read the papers.

> Do you think that CFQ's problems come from round robin algorithms. No,
> they don't. Most of the time we don't even honor the allocated time

Oh yes, why do you think we bother with preemption at all?  bfq
reportedly achieves sufficient fairness and responsiveness without the
need for preemption.  This is a lot clearer model.

> slice (except sequential IO) and preempt the allocated slice. That's
> why I think implementing a more fair algorithm is probably not the
> solution.

If you actually study what has been presented, you'd immediately
recognize that a lot of what bfq does is clarification and
improvements of the buried heuristics.

Look at it this way: RR + preemption becomes the basic timestamp based
scheduling and other heuristics are extracted and clarified.  That's
what it is.

> CFQ's throughput problems come from idling and driving lower queue depth.
> And CFQ's throughput problems arise due to suppressing buffered write
> very actively in presence of sync IO. 
> 
> I want to know what has BFQ done fundamentally different to take care of
> above issues. 

Yeah, read the paper.  They also analyzed why some things behave
certain ways.  A lot of them are pretty spot-on.

> Remember CFQ had started with a simple algorith. Just start allocating
> time slice in proportion to ioprio. That simple scheme did not work in
> real world and then it became more complicated.

Yes, this is another refinement step.  WTF is that so hard to warp
your head around it?  It doesn't throw away much.  It builds upon cfq
and clarifies and improves what it has been doing.

> What's the guarantee that same thing will not happen to BFQ. There is
> no point in getting fairness if overall performance sucks. That's what
> happens even with block IO cgroups. Create more than 4 cgroups, put some
> workload in that and with CFQ your performance will be so bad that you
> will drop the idea of getting fairness.

Dude, just go read the paper.

> Can you please provide me the link to the paper. I had read one few
> years back. I am not sure if it is still the same paper of a new one.
> And after experimenting with their implementation and playing with 
> CFQ, my impression was that fairness algorithm is not the core problem.

The references are in the frigging head message and in the head
comment of the implementation.  Are you even reading stuff?  You're
just wasting other people's time.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC RESEND 00/14] New version of the BFQ I/O Scheduler
@ 2014-05-30 17:26               ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-30 17:26 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: paolo, Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel, containers, cgroups

Hello,

On Fri, May 30, 2014 at 01:09:58PM -0400, Vivek Goyal wrote:
> Are you referring to BFQ paper. I had read one in the past and it was
> all about how to achieve more accurate fairness. At this point of time
> I don't think that smarter algorithm is the problem. Until and unless
> somebody can show me that how algorithm is a problem.

Because cfq rr's with heuristically guessed slices and bfq calculates
where each one is supposed to end and then schedules the slices
accordingly.  With a lot of slices to serve, cfq loses track of which
one should come first to adhere to desired latency guarantees while
bfq doesn't, which in turn allows more latitude in using longer slices
for bfq allowing for better throughput.  It's all in the paper and
backed by numbers.  What more do you want?

> Apart from algorithm, one thing BFQ did different in the past was provide
> fairness based on amount of IO done *and not in terms of time slice*. I
> don't know if that's still the case. If it is the case I am wondering
> why CFQ was converted from amount of IO based fairness to time slice
> based fairness.

Rotating rusts being rotating rusts, either measure isn't perfect.
Bandwidth tracking gets silly with seeky workloads while pure time
slice tracking unfairly treats users of inner and outer tracks.  BFQ
uses bw tracking for sequential workload while basically switches to
time slice tracking for seeky workloads.  These were pretty clearly
discussed in the paper.

> I am all for a better algorithm. I just want to somebody to explain it
> to me that what magic they have done to achieve better throughput as
> well as better latency.

It feels more like you're actively refusing to understand it even when
the algorithm and heristics involved are as clearly presented as it
could be.  This is one of the best documented piece of work in this
area of the kernel.  Just read the papers.

> Do you think that CFQ's problems come from round robin algorithms. No,
> they don't. Most of the time we don't even honor the allocated time

Oh yes, why do you think we bother with preemption at all?  bfq
reportedly achieves sufficient fairness and responsiveness without the
need for preemption.  This is a lot clearer model.

> slice (except sequential IO) and preempt the allocated slice. That's
> why I think implementing a more fair algorithm is probably not the
> solution.

If you actually study what has been presented, you'd immediately
recognize that a lot of what bfq does is clarification and
improvements of the buried heuristics.

Look at it this way: RR + preemption becomes the basic timestamp based
scheduling and other heuristics are extracted and clarified.  That's
what it is.

> CFQ's throughput problems come from idling and driving lower queue depth.
> And CFQ's throughput problems arise due to suppressing buffered write
> very actively in presence of sync IO. 
> 
> I want to know what has BFQ done fundamentally different to take care of
> above issues. 

Yeah, read the paper.  They also analyzed why some things behave
certain ways.  A lot of them are pretty spot-on.

> Remember CFQ had started with a simple algorith. Just start allocating
> time slice in proportion to ioprio. That simple scheme did not work in
> real world and then it became more complicated.

Yes, this is another refinement step.  WTF is that so hard to warp
your head around it?  It doesn't throw away much.  It builds upon cfq
and clarifies and improves what it has been doing.

> What's the guarantee that same thing will not happen to BFQ. There is
> no point in getting fairness if overall performance sucks. That's what
> happens even with block IO cgroups. Create more than 4 cgroups, put some
> workload in that and with CFQ your performance will be so bad that you
> will drop the idea of getting fairness.

Dude, just go read the paper.

> Can you please provide me the link to the paper. I had read one few
> years back. I am not sure if it is still the same paper of a new one.
> And after experimenting with their implementation and playing with 
> CFQ, my impression was that fairness algorithm is not the core problem.

The references are in the frigging head message and in the head
comment of the implementation.  Are you even reading stuff?  You're
just wasting other people's time.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC RESEND 00/14] New version of the BFQ I/O Scheduler
  2014-05-27 12:42 ` paolo
@ 2014-05-30 17:31     ` Vivek Goyal
  -1 siblings, 0 replies; 247+ messages in thread
From: Vivek Goyal @ 2014-05-30 17:31 UTC (permalink / raw)
  To: paolo
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paolo Valente

On Tue, May 27, 2014 at 02:42:24PM +0200, paolo wrote:
> From: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
> 
> [Re-posting, previous attempt seems to have partially failed]
> 
> Hi,
> this patchset introduces the last version of BFQ, a proportional-share
> storage-I/O scheduler. BFQ also supports hierarchical scheduling with
> a cgroups interface. The first version of BFQ was submitted a few
> years ago [1]. It is denoted as v0 in the patches, to distinguish it
> from the version I am submitting now, v7r4. In particular, the first
> four patches introduce BFQ-v0, whereas the remaining patches turn it
> progressively into BFQ-v7r4. Here are some nice features of this last
> version.
> 
> Low latency for interactive applications
> 
> According to our results, regardless of the actual background
> workload, for interactive tasks the storage device is virtually as
> responsive as if it was idle. For example, even if one or more of the
> following background workloads are being executed:
> - one or more large files are being read or written,
> - a tree of source files is being compiled,
> - one or more virtual machines are performing I/O,
> - a software update is in progress,
> - indexing daemons are scanning filesystems and updating their
>   databases,
> starting an application or loading a file from within an application
> takes about the same time as if the storage device was idle. As a
> comparison, with CFQ, NOOP or DEADLINE, and in the same conditions,
> applications experience high latencies, or even become unresponsive
> until the background workload terminates (also on SSDs).

So how do you achieve it? IOW, how do you figure out something is
interactive and just give it priority and almost stop other. What happens
to notion of fairness in that case.

And if there is a way to figure out interactive applications, then even
in CFQ, one should be able to easily put these queues in the beginning of
service tree so that they get served first and achieve better experience?

And in general that would be a desirable thing to do. So why not modify
CFQ.

> 
> Low latency for soft real-time applications
> 
> Also soft real-time applications, such as audio and video
> players/streamers, enjoy low latency and drop rate, regardless of the
> storage-device background workload. As a consequence, these
> applications do not suffer from almost any glitch due to the
> background workload.

Again, how do you achieve it?

> 
> High throughput
> 
> On hard disks, BFQ achieves up to 30% higher throughput than CFQ,

For what workload? 

> and
> up to 150% higher throughput than DEADLINE and NOOP,

Is this true with buffered write workload also? I think these % will
be very dependent on IO pattern.

Again I would like to know how did you achieve such a high throughput
when compared to CFQ, Deadline, NOOP. One of the things which drops
throughput on rotational hard disk is non-sequential IO pattern and
CFQ already does that. So only way to achieve higher throughput will
be to accumulate more sequetial IO in the queue and then let that queue
run for longer and stop other IO from other queues. And that will mean
higher latencies for IO in other queues.

So on this rotation hard disk, how do you achieve higher throughput as
well as reduced latency.

> with half of the
> parallel workloads considered in our tests. With the rest of the
> workloads, and with all the workloads on flash-based devices, BFQ
> achieves instead about the same throughput as the other schedulers.
> 
> Strong fairness guarantees (already provided by BFQ-v0)
> 
> As for long-term guarantees, BFQ distributes the device throughput
> (and not just the device time) as desired to I/O-bound applications,

I think this is one key differece as comapred to CFQ. Fairness based
on bandwidth and not fairness based on time slice.

So if a process is doing large sequential IO and other is doing small
random IO, most of the disk time will be given to process doing small
random IOs. Is that more fair. I think that's one reason that CFQ was
switched to time based scheme. Provide time slices and after that
it is up to the application how much they can get out of disk in that
slice based on their IO pattern. At least in terms of fairness, that
sounds more fair to me.

I think this is one point which needs to be discussed that is it
a better idea to switch to bandwidth based fairness. Should we change
CFQ to achieve that or we need to introduce new IO scheduler for that.

> with any workload and regardless of the device parameters.
> 
> What allows BFQ to provide the above features is its accurate
> scheduling engine (patches 1-4), combined with a set of simple
> heuristics and improvements (patches 5-14). 

This is very hard to understand. This puzzle need to be broken down
into small pieces and explained in simple design terms so that even
5 years down the line I can explain why BFQ was better.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC RESEND 00/14] New version of the BFQ I/O Scheduler
@ 2014-05-30 17:31     ` Vivek Goyal
  0 siblings, 0 replies; 247+ messages in thread
From: Vivek Goyal @ 2014-05-30 17:31 UTC (permalink / raw)
  To: paolo
  Cc: Jens Axboe, Tejun Heo, Li Zefan, Fabio Checconi,
	Arianna Avanzini, Paolo Valente, linux-kernel, containers,
	cgroups

On Tue, May 27, 2014 at 02:42:24PM +0200, paolo wrote:
> From: Paolo Valente <paolo.valente@unimore.it>
> 
> [Re-posting, previous attempt seems to have partially failed]
> 
> Hi,
> this patchset introduces the last version of BFQ, a proportional-share
> storage-I/O scheduler. BFQ also supports hierarchical scheduling with
> a cgroups interface. The first version of BFQ was submitted a few
> years ago [1]. It is denoted as v0 in the patches, to distinguish it
> from the version I am submitting now, v7r4. In particular, the first
> four patches introduce BFQ-v0, whereas the remaining patches turn it
> progressively into BFQ-v7r4. Here are some nice features of this last
> version.
> 
> Low latency for interactive applications
> 
> According to our results, regardless of the actual background
> workload, for interactive tasks the storage device is virtually as
> responsive as if it was idle. For example, even if one or more of the
> following background workloads are being executed:
> - one or more large files are being read or written,
> - a tree of source files is being compiled,
> - one or more virtual machines are performing I/O,
> - a software update is in progress,
> - indexing daemons are scanning filesystems and updating their
>   databases,
> starting an application or loading a file from within an application
> takes about the same time as if the storage device was idle. As a
> comparison, with CFQ, NOOP or DEADLINE, and in the same conditions,
> applications experience high latencies, or even become unresponsive
> until the background workload terminates (also on SSDs).

So how do you achieve it? IOW, how do you figure out something is
interactive and just give it priority and almost stop other. What happens
to notion of fairness in that case.

And if there is a way to figure out interactive applications, then even
in CFQ, one should be able to easily put these queues in the beginning of
service tree so that they get served first and achieve better experience?

And in general that would be a desirable thing to do. So why not modify
CFQ.

> 
> Low latency for soft real-time applications
> 
> Also soft real-time applications, such as audio and video
> players/streamers, enjoy low latency and drop rate, regardless of the
> storage-device background workload. As a consequence, these
> applications do not suffer from almost any glitch due to the
> background workload.

Again, how do you achieve it?

> 
> High throughput
> 
> On hard disks, BFQ achieves up to 30% higher throughput than CFQ,

For what workload? 

> and
> up to 150% higher throughput than DEADLINE and NOOP,

Is this true with buffered write workload also? I think these % will
be very dependent on IO pattern.

Again I would like to know how did you achieve such a high throughput
when compared to CFQ, Deadline, NOOP. One of the things which drops
throughput on rotational hard disk is non-sequential IO pattern and
CFQ already does that. So only way to achieve higher throughput will
be to accumulate more sequetial IO in the queue and then let that queue
run for longer and stop other IO from other queues. And that will mean
higher latencies for IO in other queues.

So on this rotation hard disk, how do you achieve higher throughput as
well as reduced latency.

> with half of the
> parallel workloads considered in our tests. With the rest of the
> workloads, and with all the workloads on flash-based devices, BFQ
> achieves instead about the same throughput as the other schedulers.
> 
> Strong fairness guarantees (already provided by BFQ-v0)
> 
> As for long-term guarantees, BFQ distributes the device throughput
> (and not just the device time) as desired to I/O-bound applications,

I think this is one key differece as comapred to CFQ. Fairness based
on bandwidth and not fairness based on time slice.

So if a process is doing large sequential IO and other is doing small
random IO, most of the disk time will be given to process doing small
random IOs. Is that more fair. I think that's one reason that CFQ was
switched to time based scheme. Provide time slices and after that
it is up to the application how much they can get out of disk in that
slice based on their IO pattern. At least in terms of fairness, that
sounds more fair to me.

I think this is one point which needs to be discussed that is it
a better idea to switch to bandwidth based fairness. Should we change
CFQ to achieve that or we need to introduce new IO scheduler for that.

> with any workload and regardless of the device parameters.
> 
> What allows BFQ to provide the above features is its accurate
> scheduling engine (patches 1-4), combined with a set of simple
> heuristics and improvements (patches 5-14). 

This is very hard to understand. This puzzle need to be broken down
into small pieces and explained in simple design terms so that even
5 years down the line I can explain why BFQ was better.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC RESEND 00/14] New version of the BFQ I/O Scheduler
  2014-05-30 17:31     ` Vivek Goyal
@ 2014-05-30 17:39         ` Tejun Heo
  -1 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-30 17:39 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, paolo,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

On Fri, May 30, 2014 at 01:31:46PM -0400, Vivek Goyal wrote:
> > What allows BFQ to provide the above features is its accurate
> > scheduling engine (patches 1-4), combined with a set of simple
> > heuristics and improvements (patches 5-14). 
> 
> This is very hard to understand. This puzzle need to be broken down
> into small pieces and explained in simple design terms so that even
> 5 years down the line I can explain why BFQ was better.

Vivek, stop wasting people's time and just go study the paper.  Let's
*please* talk after that.  What do you expect him to do?  Copy & paste
the whole paper here and walk you through each step of it?  He
presented as much information as he could and then provided sufficient
summary in the head message and as comments in the implementation.
It's now *your* turn to study what has been presented.  Sure, if you
have further questions or think that the implementation and patches
need more documentation, that's completely fine but I find it
ridiculous that you're demanding to be spoon-fed information which is
readily available in an easily consumable form.  Seriously, this is
the best documentation we've had in this area *EVER*.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC RESEND 00/14] New version of the BFQ I/O Scheduler
@ 2014-05-30 17:39         ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-30 17:39 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: paolo, Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel, containers, cgroups

On Fri, May 30, 2014 at 01:31:46PM -0400, Vivek Goyal wrote:
> > What allows BFQ to provide the above features is its accurate
> > scheduling engine (patches 1-4), combined with a set of simple
> > heuristics and improvements (patches 5-14). 
> 
> This is very hard to understand. This puzzle need to be broken down
> into small pieces and explained in simple design terms so that even
> 5 years down the line I can explain why BFQ was better.

Vivek, stop wasting people's time and just go study the paper.  Let's
*please* talk after that.  What do you expect him to do?  Copy & paste
the whole paper here and walk you through each step of it?  He
presented as much information as he could and then provided sufficient
summary in the head message and as comments in the implementation.
It's now *your* turn to study what has been presented.  Sure, if you
have further questions or think that the implementation and patches
need more documentation, that's completely fine but I find it
ridiculous that you're demanding to be spoon-fed information which is
readily available in an easily consumable form.  Seriously, this is
the best documentation we've had in this area *EVER*.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC RESEND 00/14] New version of the BFQ I/O Scheduler
  2014-05-30 17:26               ` Tejun Heo
@ 2014-05-30 17:55                   ` Vivek Goyal
  -1 siblings, 0 replies; 247+ messages in thread
From: Vivek Goyal @ 2014-05-30 17:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, paolo,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

On Fri, May 30, 2014 at 01:26:09PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Fri, May 30, 2014 at 01:09:58PM -0400, Vivek Goyal wrote:
> > Are you referring to BFQ paper. I had read one in the past and it was
> > all about how to achieve more accurate fairness. At this point of time
> > I don't think that smarter algorithm is the problem. Until and unless
> > somebody can show me that how algorithm is a problem.
> 
> Because cfq rr's with heuristically guessed slices and bfq calculates
> where each one is supposed to end and then schedules the slices
> accordingly.  With a lot of slices to serve, cfq loses track of which
> one should come first to adhere to desired latency guarantees while
> bfq doesn't, which in turn allows more latitude in using longer slices
> for bfq allowing for better throughput.  It's all in the paper and
> backed by numbers.  What more do you want?

Now CFQ also dynamically adjusts the slice length based on the how
many queues are ready to do IO. One problem with fixed slice lenth
round robin was that if there are lot of queues doing IO, then after
serving one queue, same queue will get time slice after a long time.

Corrado Zoccolo did work in this area in an attempt to improve latency.
Now slice length is calculated dynamically in an attempt to achieve
better latency.

commit 5db5d64277bf390056b1a87d0bb288c8b8553f96
Author: Corrado Zoccolo <czoccolo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Date:   Mon Oct 26 22:44:04 2009 +0100

    cfq-iosched: adapt slice to number of processes doing I/O
 
> 
> > Apart from algorithm, one thing BFQ did different in the past was provide
> > fairness based on amount of IO done *and not in terms of time slice*. I
> > don't know if that's still the case. If it is the case I am wondering
> > why CFQ was converted from amount of IO based fairness to time slice
> > based fairness.
> 
> Rotating rusts being rotating rusts, either measure isn't perfect.
> Bandwidth tracking gets silly with seeky workloads while pure time
> slice tracking unfairly treats users of inner and outer tracks.  BFQ
> uses bw tracking for sequential workload while basically switches to
> time slice tracking for seeky workloads.  These were pretty clearly
> discussed in the paper.

Ok, so you prefer to have another IO scheduler instead of improving
CFQ?

> 
> > I am all for a better algorithm. I just want to somebody to explain it
> > to me that what magic they have done to achieve better throughput as
> > well as better latency.
> 
> It feels more like you're actively refusing to understand it even when
> the algorithm and heristics involved are as clearly presented as it
> could be.  This is one of the best documented piece of work in this
> area of the kernel.  Just read the papers.

Tejun, I have spent significant amount of time on BFQ few years ago. And
that's the reason I have not read it again yet. My understanding was
that there was nothing which would not be done in CFQ (atleast things
which mattered).

Looks like you prefer introducing a new scheduler instead of improving
CFQ. My preference is to improve CFQ. Borrow good ideas from BFQ and
implement them in CFQ.

> 
> > Do you think that CFQ's problems come from round robin algorithms. No,
> > they don't. Most of the time we don't even honor the allocated time
> 
> Oh yes, why do you think we bother with preemption at all?  bfq
> reportedly achieves sufficient fairness and responsiveness without the
> need for preemption.  This is a lot clearer model.

We primarily allow preemption of buffered write queue. If you allocate
a share for buffered write queue, that would be good for not starving
buffered writes but your read latencies should go down. 

> 
> > slice (except sequential IO) and preempt the allocated slice. That's
> > why I think implementing a more fair algorithm is probably not the
> > solution.
> 
> If you actually study what has been presented, you'd immediately
> recognize that a lot of what bfq does is clarification and
> improvements of the buried heuristics.
> 
> Look at it this way: RR + preemption becomes the basic timestamp based
> scheduling and other heuristics are extracted and clarified.  That's
> what it is.
> 
> > CFQ's throughput problems come from idling and driving lower queue depth.
> > And CFQ's throughput problems arise due to suppressing buffered write
> > very actively in presence of sync IO. 
> > 
> > I want to know what has BFQ done fundamentally different to take care of
> > above issues. 
> 
> Yeah, read the paper.  They also analyzed why some things behave
> certain ways.  A lot of them are pretty spot-on.
> 
> > Remember CFQ had started with a simple algorith. Just start allocating
> > time slice in proportion to ioprio. That simple scheme did not work in
> > real world and then it became more complicated.
> 
> Yes, this is another refinement step.  WTF is that so hard to warp
> your head around it?  It doesn't throw away much.  It builds upon cfq
> and clarifies and improves what it has been doing.

So why not improve CFQ instead of carrying and maintaining another 
scheduler. And then have a discussion that what's the default scheduler.

If plan is that BFQ is better than CFQ and instead of fixing CFQ lets
introduce BFQ and slowly phase out CFQ, please say so.

There is alredy lot of confusion in terms of explaining to people what
IO scheduler to use and now there is another one in the mix which is
almost same as CFQ but supposedely performs better.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC RESEND 00/14] New version of the BFQ I/O Scheduler
@ 2014-05-30 17:55                   ` Vivek Goyal
  0 siblings, 0 replies; 247+ messages in thread
From: Vivek Goyal @ 2014-05-30 17:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: paolo, Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel, containers, cgroups

On Fri, May 30, 2014 at 01:26:09PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Fri, May 30, 2014 at 01:09:58PM -0400, Vivek Goyal wrote:
> > Are you referring to BFQ paper. I had read one in the past and it was
> > all about how to achieve more accurate fairness. At this point of time
> > I don't think that smarter algorithm is the problem. Until and unless
> > somebody can show me that how algorithm is a problem.
> 
> Because cfq rr's with heuristically guessed slices and bfq calculates
> where each one is supposed to end and then schedules the slices
> accordingly.  With a lot of slices to serve, cfq loses track of which
> one should come first to adhere to desired latency guarantees while
> bfq doesn't, which in turn allows more latitude in using longer slices
> for bfq allowing for better throughput.  It's all in the paper and
> backed by numbers.  What more do you want?

Now CFQ also dynamically adjusts the slice length based on the how
many queues are ready to do IO. One problem with fixed slice lenth
round robin was that if there are lot of queues doing IO, then after
serving one queue, same queue will get time slice after a long time.

Corrado Zoccolo did work in this area in an attempt to improve latency.
Now slice length is calculated dynamically in an attempt to achieve
better latency.

commit 5db5d64277bf390056b1a87d0bb288c8b8553f96
Author: Corrado Zoccolo <czoccolo@gmail.com>
Date:   Mon Oct 26 22:44:04 2009 +0100

    cfq-iosched: adapt slice to number of processes doing I/O
 
> 
> > Apart from algorithm, one thing BFQ did different in the past was provide
> > fairness based on amount of IO done *and not in terms of time slice*. I
> > don't know if that's still the case. If it is the case I am wondering
> > why CFQ was converted from amount of IO based fairness to time slice
> > based fairness.
> 
> Rotating rusts being rotating rusts, either measure isn't perfect.
> Bandwidth tracking gets silly with seeky workloads while pure time
> slice tracking unfairly treats users of inner and outer tracks.  BFQ
> uses bw tracking for sequential workload while basically switches to
> time slice tracking for seeky workloads.  These were pretty clearly
> discussed in the paper.

Ok, so you prefer to have another IO scheduler instead of improving
CFQ?

> 
> > I am all for a better algorithm. I just want to somebody to explain it
> > to me that what magic they have done to achieve better throughput as
> > well as better latency.
> 
> It feels more like you're actively refusing to understand it even when
> the algorithm and heristics involved are as clearly presented as it
> could be.  This is one of the best documented piece of work in this
> area of the kernel.  Just read the papers.

Tejun, I have spent significant amount of time on BFQ few years ago. And
that's the reason I have not read it again yet. My understanding was
that there was nothing which would not be done in CFQ (atleast things
which mattered).

Looks like you prefer introducing a new scheduler instead of improving
CFQ. My preference is to improve CFQ. Borrow good ideas from BFQ and
implement them in CFQ.

> 
> > Do you think that CFQ's problems come from round robin algorithms. No,
> > they don't. Most of the time we don't even honor the allocated time
> 
> Oh yes, why do you think we bother with preemption at all?  bfq
> reportedly achieves sufficient fairness and responsiveness without the
> need for preemption.  This is a lot clearer model.

We primarily allow preemption of buffered write queue. If you allocate
a share for buffered write queue, that would be good for not starving
buffered writes but your read latencies should go down. 

> 
> > slice (except sequential IO) and preempt the allocated slice. That's
> > why I think implementing a more fair algorithm is probably not the
> > solution.
> 
> If you actually study what has been presented, you'd immediately
> recognize that a lot of what bfq does is clarification and
> improvements of the buried heuristics.
> 
> Look at it this way: RR + preemption becomes the basic timestamp based
> scheduling and other heuristics are extracted and clarified.  That's
> what it is.
> 
> > CFQ's throughput problems come from idling and driving lower queue depth.
> > And CFQ's throughput problems arise due to suppressing buffered write
> > very actively in presence of sync IO. 
> > 
> > I want to know what has BFQ done fundamentally different to take care of
> > above issues. 
> 
> Yeah, read the paper.  They also analyzed why some things behave
> certain ways.  A lot of them are pretty spot-on.
> 
> > Remember CFQ had started with a simple algorith. Just start allocating
> > time slice in proportion to ioprio. That simple scheme did not work in
> > real world and then it became more complicated.
> 
> Yes, this is another refinement step.  WTF is that so hard to warp
> your head around it?  It doesn't throw away much.  It builds upon cfq
> and clarifies and improves what it has been doing.

So why not improve CFQ instead of carrying and maintaining another 
scheduler. And then have a discussion that what's the default scheduler.

If plan is that BFQ is better than CFQ and instead of fixing CFQ lets
introduce BFQ and slowly phase out CFQ, please say so.

There is alredy lot of confusion in terms of explaining to people what
IO scheduler to use and now there is another one in the mix which is
almost same as CFQ but supposedely performs better.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC RESEND 00/14] New version of the BFQ I/O Scheduler
       [not found]                   ` <20140530175527.GH16605-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-05-30 17:59                     ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-30 17:59 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, paolo,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

Hey, Vivek.

On Fri, May 30, 2014 at 01:55:27PM -0400, Vivek Goyal wrote:
> Now CFQ also dynamically adjusts the slice length based on the how
> many queues are ready to do IO. One problem with fixed slice lenth
> round robin was that if there are lot of queues doing IO, then after
> serving one queue, same queue will get time slice after a long time.
> 
> Corrado Zoccolo did work in this area in an attempt to improve latency.
> Now slice length is calculated dynamically in an attempt to achieve
> better latency.

Consdier it a better version of that.

> Ok, so you prefer to have another IO scheduler instead of improving
> CFQ?

As I wrote in another reply, I think the best approach would be
morphing cfq to accomodate the scheduling algorithm and heristics of
bfq.

> Tejun, I have spent significant amount of time on BFQ few years ago. And
> that's the reason I have not read it again yet. My understanding was
> that there was nothing which would not be done in CFQ (atleast things
> which mattered).

THERE IS A NEW PAPER.

> Looks like you prefer introducing a new scheduler instead of improving
> CFQ. My preference is to improve CFQ. Borrow good ideas from BFQ and
> implement them in CFQ.

LOOKS LIKE YOU ARE NOT READING ANYTHING AND JUST WRITING IRRELEVANT
SHIT.

> So why not improve CFQ instead of carrying and maintaining another 
> scheduler. And then have a discussion that what's the default scheduler.

For fuck's sake, I'm out.  This is total waste of time.  You don't
read what others are writing and refuse to study what's right there.
What are you doing?

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC RESEND 00/14] New version of the BFQ I/O Scheduler
       [not found]                   ` <20140530175527.GH16605-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-05-30 17:59                     ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-30 17:59 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: paolo, Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel, containers, cgroups

Hey, Vivek.

On Fri, May 30, 2014 at 01:55:27PM -0400, Vivek Goyal wrote:
> Now CFQ also dynamically adjusts the slice length based on the how
> many queues are ready to do IO. One problem with fixed slice lenth
> round robin was that if there are lot of queues doing IO, then after
> serving one queue, same queue will get time slice after a long time.
> 
> Corrado Zoccolo did work in this area in an attempt to improve latency.
> Now slice length is calculated dynamically in an attempt to achieve
> better latency.

Consdier it a better version of that.

> Ok, so you prefer to have another IO scheduler instead of improving
> CFQ?

As I wrote in another reply, I think the best approach would be
morphing cfq to accomodate the scheduling algorithm and heristics of
bfq.

> Tejun, I have spent significant amount of time on BFQ few years ago. And
> that's the reason I have not read it again yet. My understanding was
> that there was nothing which would not be done in CFQ (atleast things
> which mattered).

THERE IS A NEW PAPER.

> Looks like you prefer introducing a new scheduler instead of improving
> CFQ. My preference is to improve CFQ. Borrow good ideas from BFQ and
> implement them in CFQ.

LOOKS LIKE YOU ARE NOT READING ANYTHING AND JUST WRITING IRRELEVANT
SHIT.

> So why not improve CFQ instead of carrying and maintaining another 
> scheduler. And then have a discussion that what's the default scheduler.

For fuck's sake, I'm out.  This is total waste of time.  You don't
read what others are writing and refuse to study what's right there.
What are you doing?

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC RESEND 00/14] New version of the BFQ I/O Scheduler
@ 2014-05-30 17:59                     ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-30 17:59 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: paolo, Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hey, Vivek.

On Fri, May 30, 2014 at 01:55:27PM -0400, Vivek Goyal wrote:
> Now CFQ also dynamically adjusts the slice length based on the how
> many queues are ready to do IO. One problem with fixed slice lenth
> round robin was that if there are lot of queues doing IO, then after
> serving one queue, same queue will get time slice after a long time.
> 
> Corrado Zoccolo did work in this area in an attempt to improve latency.
> Now slice length is calculated dynamically in an attempt to achieve
> better latency.

Consdier it a better version of that.

> Ok, so you prefer to have another IO scheduler instead of improving
> CFQ?

As I wrote in another reply, I think the best approach would be
morphing cfq to accomodate the scheduling algorithm and heristics of
bfq.

> Tejun, I have spent significant amount of time on BFQ few years ago. And
> that's the reason I have not read it again yet. My understanding was
> that there was nothing which would not be done in CFQ (atleast things
> which mattered).

THERE IS A NEW PAPER.

> Looks like you prefer introducing a new scheduler instead of improving
> CFQ. My preference is to improve CFQ. Borrow good ideas from BFQ and
> implement them in CFQ.

LOOKS LIKE YOU ARE NOT READING ANYTHING AND JUST WRITING IRRELEVANT
SHIT.

> So why not improve CFQ instead of carrying and maintaining another 
> scheduler. And then have a discussion that what's the default scheduler.

For fuck's sake, I'm out.  This is total waste of time.  You don't
read what others are writing and refuse to study what's right there.
What are you doing?

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 02/12] block, bfq: add full hierarchical scheduling and cgroups support
       [not found]                 ` <20140530153718.GB24871-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
  2014-05-30 15:39                   ` Tejun Heo
@ 2014-05-30 21:49                   ` Paolo Valente
  1 sibling, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-30 21:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA


Il giorno 30/mag/2014, alle ore 17:37, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ha scritto:

> On Thu, May 29, 2014 at 11:05:33AM +0200, Paolo Valente wrote:
>> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
>> index 768fe44..cdd2528 100644
>> --- a/include/linux/cgroup_subsys.h
>> +++ b/include/linux/cgroup_subsys.h
>> @@ -39,6 +39,10 @@ SUBSYS(net_cls)
>> SUBSYS(blkio)
>> #endif
>> 
>> +#if IS_ENABLED(CONFIG_CGROUP_BFQIO)
>> +SUBSYS(bfqio)
>> +#endif
> 
> So, ummm, I don't think this is a good idea.  Why aren't you plugging
> into the blkcg infrastructure as cfq does?  Why does it need to be a
> separate controller?
> 

It does not, actually. It is just that when we implemented that part, there was no blkcg infrastructure. After that, I have gone on experimenting with the low-latency heuristics and all the other stuff. Finally I have decided to first propose this new version of bfq, and then deal also with blkcg integration in case of a positive welcome.

Thanks,
Paolo

> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 02/12] block, bfq: add full hierarchical scheduling and cgroups support
       [not found]                 ` <20140530153718.GB24871-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2014-05-30 21:49                   ` Paolo Valente
  2014-05-30 21:49                   ` Paolo Valente
  1 sibling, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-30 21:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel, containers, cgroups


Il giorno 30/mag/2014, alle ore 17:37, Tejun Heo <tj@kernel.org> ha scritto:

> On Thu, May 29, 2014 at 11:05:33AM +0200, Paolo Valente wrote:
>> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
>> index 768fe44..cdd2528 100644
>> --- a/include/linux/cgroup_subsys.h
>> +++ b/include/linux/cgroup_subsys.h
>> @@ -39,6 +39,10 @@ SUBSYS(net_cls)
>> SUBSYS(blkio)
>> #endif
>> 
>> +#if IS_ENABLED(CONFIG_CGROUP_BFQIO)
>> +SUBSYS(bfqio)
>> +#endif
> 
> So, ummm, I don't think this is a good idea.  Why aren't you plugging
> into the blkcg infrastructure as cfq does?  Why does it need to be a
> separate controller?
> 

It does not, actually. It is just that when we implemented that part, there was no blkcg infrastructure. After that, I have gone on experimenting with the low-latency heuristics and all the other stuff. Finally I have decided to first propose this new version of bfq, and then deal also with blkcg integration in case of a positive welcome.

Thanks,
Paolo

> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 02/12] block, bfq: add full hierarchical scheduling and cgroups support
@ 2014-05-30 21:49                   ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-30 21:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA


Il giorno 30/mag/2014, alle ore 17:37, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ha scritto:

> On Thu, May 29, 2014 at 11:05:33AM +0200, Paolo Valente wrote:
>> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
>> index 768fe44..cdd2528 100644
>> --- a/include/linux/cgroup_subsys.h
>> +++ b/include/linux/cgroup_subsys.h
>> @@ -39,6 +39,10 @@ SUBSYS(net_cls)
>> SUBSYS(blkio)
>> #endif
>> 
>> +#if IS_ENABLED(CONFIG_CGROUP_BFQIO)
>> +SUBSYS(bfqio)
>> +#endif
> 
> So, ummm, I don't think this is a good idea.  Why aren't you plugging
> into the blkcg infrastructure as cfq does?  Why does it need to be a
> separate controller?
> 

It does not, actually. It is just that when we implemented that part, there was no blkcg infrastructure. After that, I have gone on experimenting with the low-latency heuristics and all the other stuff. Finally I have decided to first propose this new version of bfq, and then deal also with blkcg integration in case of a positive welcome.

Thanks,
Paolo

> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 02/12] block, bfq: add full hierarchical scheduling and cgroups support
  2014-05-30 15:39                   ` Tejun Heo
@ 2014-05-30 21:49                       ` Paolo Valente
  -1 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-30 21:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA


Il giorno 30/mag/2014, alle ore 17:39, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ha scritto:

> On Fri, May 30, 2014 at 11:37:18AM -0400, Tejun Heo wrote:
>> On Thu, May 29, 2014 at 11:05:33AM +0200, Paolo Valente wrote:
>>> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
>>> index 768fe44..cdd2528 100644
>>> --- a/include/linux/cgroup_subsys.h
>>> +++ b/include/linux/cgroup_subsys.h
>>> @@ -39,6 +39,10 @@ SUBSYS(net_cls)
>>> SUBSYS(blkio)
>>> #endif
>>> 
>>> +#if IS_ENABLED(CONFIG_CGROUP_BFQIO)
>>> +SUBSYS(bfqio)
>>> +#endif
>> 
>> So, ummm, I don't think this is a good idea.  Why aren't you plugging
>> into the blkcg infrastructure as cfq does?  Why does it need to be a
>> separate controller?
> 
> If there's something which doesn't work for bfq in blkcg, please let
> me know.  I'd be happy to make it work.
> 

This will probably be very useful for us.

Thanks a lot,
Paolo

> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 02/12] block, bfq: add full hierarchical scheduling and cgroups support
@ 2014-05-30 21:49                       ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-30 21:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel, containers, cgroups


Il giorno 30/mag/2014, alle ore 17:39, Tejun Heo <tj@kernel.org> ha scritto:

> On Fri, May 30, 2014 at 11:37:18AM -0400, Tejun Heo wrote:
>> On Thu, May 29, 2014 at 11:05:33AM +0200, Paolo Valente wrote:
>>> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
>>> index 768fe44..cdd2528 100644
>>> --- a/include/linux/cgroup_subsys.h
>>> +++ b/include/linux/cgroup_subsys.h
>>> @@ -39,6 +39,10 @@ SUBSYS(net_cls)
>>> SUBSYS(blkio)
>>> #endif
>>> 
>>> +#if IS_ENABLED(CONFIG_CGROUP_BFQIO)
>>> +SUBSYS(bfqio)
>>> +#endif
>> 
>> So, ummm, I don't think this is a good idea.  Why aren't you plugging
>> into the blkcg infrastructure as cfq does?  Why does it need to be a
>> separate controller?
> 
> If there's something which doesn't work for bfq in blkcg, please let
> me know.  I'd be happy to make it work.
> 

This will probably be very useful for us.

Thanks a lot,
Paolo

> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices
  2014-05-30 15:46                 ` Tejun Heo
@ 2014-05-30 22:01                     ` Paolo Valente
  -1 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-30 22:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA


Il giorno 30/mag/2014, alle ore 17:46, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ha scritto:

> On Thu, May 29, 2014 at 11:05:42AM +0200, Paolo Valente wrote:
>> This patch boosts the throughput on NCQ-capable flash-based devices,
>> while still preserving latency guarantees for interactive and soft
>> real-time applications. The throughput is boosted by just not idling
>> the device when the in-service queue remains empty, even if the queue
>> is sync and has a non-null idle window. This helps to keep the drive's
>> internal queue full, which is necessary to achieve maximum
>> performance. This solution to boost the throughput is a port of
>> commits a68bbdd and f7d7b7a for CFQ.
>> 
>> As already highlighted in patch 10, allowing the device to prefetch
>> and internally reorder requests trivially causes loss of control on
>> the request service order, and hence on service guarantees.
>> Fortunately, as discussed in detail in the comments to the function
>> bfq_bfqq_must_not_expire(), if every process has to receive the same
>> fraction of the throughput, then the service order enforced by the
>> internal scheduler of a flash-based device is relatively close to that
>> enforced by BFQ. In particular, it is close enough to let service
>> guarantees be substantially preserved.
>> 
>> Things change in an asymmetric scenario, i.e., if not every process
>> has to receive the same fraction of the throughput. In this case, to
>> guarantee the desired throughput distribution, the device must be
>> prevented from prefetching requests. This is exactly what this patch
>> does in asymmetric scenarios.
> 
> Does it even make sense to use this type of heavy iosched on ssds?
> It's highly likely that ssds will soon be served through blk-mq
> bypassing all these.  I don't feel too enthused about adding code to
> support ssds to ioscheds.  A lot better approach would be just default
> to deadline for them anyway.
> 

This was basically my opinion before I started running test also with SSDs. As you can see from, e.g., Figure 8 in
http://algogroup.unimore.it/people/paolo/disk_sched/extra_results.php
or Figure 9 in
http://algogroup.unimore.it/people/paolo/disk_sched/results.php
with deadline, as with NOOP and even worse with CFQ, start-up times become unbearably high while some files are being read sequentially. I have got even higher latencies on Intel SSDs.

One of the main reasons is that these schedulers allow the drive to queue more than one request. Maybe adding some of the low-latency heuristics of bfq to deadline may help, but it should be investigated.

Paolo


> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices
@ 2014-05-30 22:01                     ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-30 22:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel, containers, cgroups


Il giorno 30/mag/2014, alle ore 17:46, Tejun Heo <tj@kernel.org> ha scritto:

> On Thu, May 29, 2014 at 11:05:42AM +0200, Paolo Valente wrote:
>> This patch boosts the throughput on NCQ-capable flash-based devices,
>> while still preserving latency guarantees for interactive and soft
>> real-time applications. The throughput is boosted by just not idling
>> the device when the in-service queue remains empty, even if the queue
>> is sync and has a non-null idle window. This helps to keep the drive's
>> internal queue full, which is necessary to achieve maximum
>> performance. This solution to boost the throughput is a port of
>> commits a68bbdd and f7d7b7a for CFQ.
>> 
>> As already highlighted in patch 10, allowing the device to prefetch
>> and internally reorder requests trivially causes loss of control on
>> the request service order, and hence on service guarantees.
>> Fortunately, as discussed in detail in the comments to the function
>> bfq_bfqq_must_not_expire(), if every process has to receive the same
>> fraction of the throughput, then the service order enforced by the
>> internal scheduler of a flash-based device is relatively close to that
>> enforced by BFQ. In particular, it is close enough to let service
>> guarantees be substantially preserved.
>> 
>> Things change in an asymmetric scenario, i.e., if not every process
>> has to receive the same fraction of the throughput. In this case, to
>> guarantee the desired throughput distribution, the device must be
>> prevented from prefetching requests. This is exactly what this patch
>> does in asymmetric scenarios.
> 
> Does it even make sense to use this type of heavy iosched on ssds?
> It's highly likely that ssds will soon be served through blk-mq
> bypassing all these.  I don't feel too enthused about adding code to
> support ssds to ioscheds.  A lot better approach would be just default
> to deadline for them anyway.
> 

This was basically my opinion before I started running test also with SSDs. As you can see from, e.g., Figure 8 in
http://algogroup.unimore.it/people/paolo/disk_sched/extra_results.php
or Figure 9 in
http://algogroup.unimore.it/people/paolo/disk_sched/results.php
with deadline, as with NOOP and even worse with CFQ, start-up times become unbearably high while some files are being read sequentially. I have got even higher latencies on Intel SSDs.

One of the main reasons is that these schedulers allow the drive to queue more than one request. Maybe adding some of the low-latency heuristics of bfq to deadline may help, but it should be investigated.

Paolo


> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
  2014-05-30 16:07               ` Tejun Heo
@ 2014-05-30 22:23                   ` Paolo Valente
  -1 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-30 22:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA


Il giorno 30/mag/2014, alle ore 18:07, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ha scritto:

> Hello, Paolo.
> 
> On Thu, May 29, 2014 at 11:05:31AM +0200, Paolo Valente wrote:
>> this patchset introduces the last version of BFQ, a proportional-share
>> storage-I/O scheduler. BFQ also supports hierarchical scheduling with
>> a cgroups interface. The first version of BFQ was submitted a few
>> years ago [1]. It is denoted as v0 in the patches, to distinguish it
>> from the version I am submitting now, v7r4. In particular, the first
>> two patches introduce BFQ-v0, whereas the remaining patches turn it
>> progressively into BFQ-v7r4. Here are some nice features of this last
>> version.
> 
> So, excellent work.  I haven't acteually followed the implementation
> of the scheduling logic itself but read all the papers and it seems
> great to me; however, the biggest problem that I have is that while
> being proposed as a separate iosched, this basically is an improvement
> of cfq.  It shares most of the infrastructure code, aims the same set
> of devices and usages scenarios and while a lot more clearly
> characterized and in general better performing even the scheduling
> behavior isn't that different from cfq.
> 

I am really glad to hear that, thanks.

> We do have multiple ioscheds but sans for anticipatory which pretty
> much has been superceded by cfq, they serve different purposes and I'd
> really hate the idea of carrying two mostly similar ioscheds in tree.
> 
> For some reason, blkcg implementation seems completely different but
> outside of that, bfq doesn't really seem to have diverged a lot from
> cfq and the most likely and probably only way for it to be merged
> would be if you just mutate cfq into bfq.  The whole effort is mostly
> about characterizing and refining the scheduling algorithm anyway,
> right?  I'd really love to see that happening.
> 

I do agree that bfq has essentially the same purpose as cfq. I am not sure that it is what you are proposing, but, in my opinion, since both the engine and all the new heuristics of bfq differ from those of cfq, a replacement would be most certainly a much easier solution than any other transformation of cfq into bfq (needless to say, leaving the same name for the scheduler would not be a problem for me). Of course, before that we are willing to improve what has to be improved in bfq.

BTW, we are already working on your recommendations for patch 01.

Paolo

> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-05-30 22:23                   ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-30 22:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel, containers, cgroups


Il giorno 30/mag/2014, alle ore 18:07, Tejun Heo <tj@kernel.org> ha scritto:

> Hello, Paolo.
> 
> On Thu, May 29, 2014 at 11:05:31AM +0200, Paolo Valente wrote:
>> this patchset introduces the last version of BFQ, a proportional-share
>> storage-I/O scheduler. BFQ also supports hierarchical scheduling with
>> a cgroups interface. The first version of BFQ was submitted a few
>> years ago [1]. It is denoted as v0 in the patches, to distinguish it
>> from the version I am submitting now, v7r4. In particular, the first
>> two patches introduce BFQ-v0, whereas the remaining patches turn it
>> progressively into BFQ-v7r4. Here are some nice features of this last
>> version.
> 
> So, excellent work.  I haven't acteually followed the implementation
> of the scheduling logic itself but read all the papers and it seems
> great to me; however, the biggest problem that I have is that while
> being proposed as a separate iosched, this basically is an improvement
> of cfq.  It shares most of the infrastructure code, aims the same set
> of devices and usages scenarios and while a lot more clearly
> characterized and in general better performing even the scheduling
> behavior isn't that different from cfq.
> 

I am really glad to hear that, thanks.

> We do have multiple ioscheds but sans for anticipatory which pretty
> much has been superceded by cfq, they serve different purposes and I'd
> really hate the idea of carrying two mostly similar ioscheds in tree.
> 
> For some reason, blkcg implementation seems completely different but
> outside of that, bfq doesn't really seem to have diverged a lot from
> cfq and the most likely and probably only way for it to be merged
> would be if you just mutate cfq into bfq.  The whole effort is mostly
> about characterizing and refining the scheduling algorithm anyway,
> right?  I'd really love to see that happening.
> 

I do agree that bfq has essentially the same purpose as cfq. I am not sure that it is what you are proposing, but, in my opinion, since both the engine and all the new heuristics of bfq differ from those of cfq, a replacement would be most certainly a much easier solution than any other transformation of cfq into bfq (needless to say, leaving the same name for the scheduler would not be a problem for me). Of course, before that we are willing to improve what has to be improved in bfq.

BTW, we are already working on your recommendations for patch 01.

Paolo

> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
       [not found]                   ` <464F6CBE-A63E-46EF-A90D-BF8450430444-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
@ 2014-05-30 23:28                     ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-30 23:28 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA

Hello,

On Sat, May 31, 2014 at 12:23:01AM +0200, Paolo Valente wrote:
> I do agree that bfq has essentially the same purpose as cfq. I am
> not sure that it is what you are proposing, but, in my opinion,
> since both the engine and all the new heuristics of bfq differ from
> those of cfq, a replacement would be most certainly a much easier
> solution than any other transformation of cfq into bfq (needless to
> say, leaving the same name for the scheduler would not be a problem
> for me). Of course, before that we are willing to improve what has
> to be improved in bfq.

Well, it's all about how to actually route the changes and in general
whenever avoidable we try to avoid whole-sale code replacement
especially when most of the structural code is similar like in this
case.  Gradually evolving cfq to bfq is likely to take more work but
I'm very positive that it'd definitely be a lot easier to merge the
changes that way and people involved, including the developers and
reviewers, would acquire a lot clearer picture of what's going on in
the process.  For example, AFAICS, most of the heuristics added by the
later patches are refined versions of what's already in cfq and at
least some are applicable regardless of the underlying scheduling
algorithm.  It all depends on the details but, for example, steps like
the following would be it a lot easier to get merged.

* Identify the improvements which can be applied to cfq as-is or with
  some adaptation and apply those improvements to cfq.

* Make prepatory changes to make transition to new base scheduling
  algorithm easier.

* Strip out or disable cfq features which get in the way of
  conversion.

* Switch the base algorithm to the timestamp based one.

* Rebuild stripped down features and apply new heuristics,
  optimizations and follow-up changes.

I understand that this might be non-significant amount of work but at
the same time it's not something which is inherently difficult.  It's
mostly logistical after all and I'd be happy to help where I can.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
       [not found]                   ` <464F6CBE-A63E-46EF-A90D-BF8450430444-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
@ 2014-05-30 23:28                     ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-30 23:28 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel, containers, cgroups

Hello,

On Sat, May 31, 2014 at 12:23:01AM +0200, Paolo Valente wrote:
> I do agree that bfq has essentially the same purpose as cfq. I am
> not sure that it is what you are proposing, but, in my opinion,
> since both the engine and all the new heuristics of bfq differ from
> those of cfq, a replacement would be most certainly a much easier
> solution than any other transformation of cfq into bfq (needless to
> say, leaving the same name for the scheduler would not be a problem
> for me). Of course, before that we are willing to improve what has
> to be improved in bfq.

Well, it's all about how to actually route the changes and in general
whenever avoidable we try to avoid whole-sale code replacement
especially when most of the structural code is similar like in this
case.  Gradually evolving cfq to bfq is likely to take more work but
I'm very positive that it'd definitely be a lot easier to merge the
changes that way and people involved, including the developers and
reviewers, would acquire a lot clearer picture of what's going on in
the process.  For example, AFAICS, most of the heuristics added by the
later patches are refined versions of what's already in cfq and at
least some are applicable regardless of the underlying scheduling
algorithm.  It all depends on the details but, for example, steps like
the following would be it a lot easier to get merged.

* Identify the improvements which can be applied to cfq as-is or with
  some adaptation and apply those improvements to cfq.

* Make prepatory changes to make transition to new base scheduling
  algorithm easier.

* Strip out or disable cfq features which get in the way of
  conversion.

* Switch the base algorithm to the timestamp based one.

* Rebuild stripped down features and apply new heuristics,
  optimizations and follow-up changes.

I understand that this might be non-significant amount of work but at
the same time it's not something which is inherently difficult.  It's
mostly logistical after all and I'd be happy to help where I can.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-05-30 23:28                     ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-30 23:28 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hello,

On Sat, May 31, 2014 at 12:23:01AM +0200, Paolo Valente wrote:
> I do agree that bfq has essentially the same purpose as cfq. I am
> not sure that it is what you are proposing, but, in my opinion,
> since both the engine and all the new heuristics of bfq differ from
> those of cfq, a replacement would be most certainly a much easier
> solution than any other transformation of cfq into bfq (needless to
> say, leaving the same name for the scheduler would not be a problem
> for me). Of course, before that we are willing to improve what has
> to be improved in bfq.

Well, it's all about how to actually route the changes and in general
whenever avoidable we try to avoid whole-sale code replacement
especially when most of the structural code is similar like in this
case.  Gradually evolving cfq to bfq is likely to take more work but
I'm very positive that it'd definitely be a lot easier to merge the
changes that way and people involved, including the developers and
reviewers, would acquire a lot clearer picture of what's going on in
the process.  For example, AFAICS, most of the heuristics added by the
later patches are refined versions of what's already in cfq and at
least some are applicable regardless of the underlying scheduling
algorithm.  It all depends on the details but, for example, steps like
the following would be it a lot easier to get merged.

* Identify the improvements which can be applied to cfq as-is or with
  some adaptation and apply those improvements to cfq.

* Make prepatory changes to make transition to new base scheduling
  algorithm easier.

* Strip out or disable cfq features which get in the way of
  conversion.

* Switch the base algorithm to the timestamp based one.

* Rebuild stripped down features and apply new heuristics,
  optimizations and follow-up changes.

I understand that this might be non-significant amount of work but at
the same time it's not something which is inherently difficult.  It's
mostly logistical after all and I'd be happy to help where I can.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC RESEND 00/14] New version of the BFQ I/O Scheduler
  2014-05-30 17:09           ` Vivek Goyal
@ 2014-05-30 23:33               ` Paolo Valente
  -1 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-30 23:33 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA


Il giorno 30/mag/2014, alle ore 19:09, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> ha scritto:

> […]
> Instead of just looking at numbers, I am keen on knowing what's the
> fundamental design change which allows this. What is CFQ doing wrong
> which BFQ gets right.
> 

I think that Tejun has already highlighted the key points and provided many details. To contribute to answer your questions about the reasons why bfq outperforms cfq, here is a summary of the most relevant underlying facts:

1) cfq is based on a round-robin scheme, in which an unlucky queue that should be served immediately may happen instead to wait for all the other busy queues before being served. In this respect, combining round robin with virtual-time-based improvements is likely to lead to not very clear solutions, and probably to sub-optimal results with respect to just using an optimal scheduler with provable deterministic guarantees (as the internal scheduler of bfq).

2) To provide a queue with a higher fraction of the throughput, a round-robin scheduler serves the queue for a longer time slice. Increasing time slices further increases per-request latencies. The problem may be mitigated by using preemption, but the result is a combination of a basic algorithm and a ‘corrective’ heuristic. This is again a more convoluted, and likely less accurate, solution than using directly an optimal algorithm with provable guarantees.

3) In bfq, budgets play a similar role as time slices in cfq, i.e., once a queue has been granted access to the device, the queue is served, in the simplest case, until it finishes its budget. But, under bfq, the fraction of the throughput received by a queue is *independent* of the budget assigned to the queue. I agree that this may seem counterintuitive in the first place, especially if one is accustomed to thinking a la round-robin. Differently from a round-robin algorithm, the internal scheduler of bfq controls throughput distribution by controlling the frequency at which queues are served. The resulting degree of freedom with respect to budget sizes has the following two main advantages:
3.a) bfq can choose for each queue the budget that best fits the requirements or characteristics of the queue. For example, queues corresponding to time-sensitive applications are assigned small budgets, which guarantees that they are served quickly. On the opposite side, queues associated to I/O-bound processes performing mostly sequential I/O are assigned large budgets, which helps boost the throughput.
3.b) bfq does not need to assign large budgets to queues to provide them with large fractions of the throughput; hence bfq does not need to deteriorate per-request latencies to guarantee a differentiated throughput distribution.

3) The internal scheduler of bfq guarantees that a queue that needs to be served quickly may wait, unjustly, for the service of at most one queue. More formally, bfq guarantees that each budget is completed with the smallest possible delay, for a budget-by-budget scheduler, with respect to an ideal, perfectly fair scheduler (i.e., an ideal scheduler that serves all busy queues at the same, providing each with a fraction of the throughput proportional to its weight).

4) Assigning temporarily a large fraction of the throughput is the main mechanism through which bfq provides interactive and soft real-time applications with a low latency. Thanks to fact 3.b, bfq achieves this goal without increasing per-request latencies. As for how applications are deemed as interactive or soft real-time, I have tried to describe both detection heuristics in patches 06 and 07.

Finally, as for adding to cfq the heuristics I have added to bfq, I think that this would probably improve application latencies also with cfq. But, because of the above facts, the result would unavoidably be worse than with bfq.

Paolo

--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC RESEND 00/14] New version of the BFQ I/O Scheduler
@ 2014-05-30 23:33               ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-30 23:33 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, Jens Axboe, Li Zefan, Fabio Checconi,
	Arianna Avanzini, linux-kernel, containers, cgroups


Il giorno 30/mag/2014, alle ore 19:09, Vivek Goyal <vgoyal@redhat.com> ha scritto:

> […]
> Instead of just looking at numbers, I am keen on knowing what's the
> fundamental design change which allows this. What is CFQ doing wrong
> which BFQ gets right.
> 

I think that Tejun has already highlighted the key points and provided many details. To contribute to answer your questions about the reasons why bfq outperforms cfq, here is a summary of the most relevant underlying facts:

1) cfq is based on a round-robin scheme, in which an unlucky queue that should be served immediately may happen instead to wait for all the other busy queues before being served. In this respect, combining round robin with virtual-time-based improvements is likely to lead to not very clear solutions, and probably to sub-optimal results with respect to just using an optimal scheduler with provable deterministic guarantees (as the internal scheduler of bfq).

2) To provide a queue with a higher fraction of the throughput, a round-robin scheduler serves the queue for a longer time slice. Increasing time slices further increases per-request latencies. The problem may be mitigated by using preemption, but the result is a combination of a basic algorithm and a ‘corrective’ heuristic. This is again a more convoluted, and likely less accurate, solution than using directly an optimal algorithm with provable guarantees.

3) In bfq, budgets play a similar role as time slices in cfq, i.e., once a queue has been granted access to the device, the queue is served, in the simplest case, until it finishes its budget. But, under bfq, the fraction of the throughput received by a queue is *independent* of the budget assigned to the queue. I agree that this may seem counterintuitive in the first place, especially if one is accustomed to thinking a la round-robin. Differently from a round-robin algorithm, the internal scheduler of bfq controls throughput distribution by controlling the frequency at which queues are served. The resulting degree of freedom with respect to budget sizes has the following two main advantages:
3.a) bfq can choose for each queue the budget that best fits the requirements or characteristics of the queue. For example, queues corresponding to time-sensitive applications are assigned small budgets, which guarantees that they are served quickly. On the opposite side, queues associated to I/O-bound processes performing mostly sequential I/O are assigned large budgets, which helps boost the throughput.
3.b) bfq does not need to assign large budgets to queues to provide them with large fractions of the throughput; hence bfq does not need to deteriorate per-request latencies to guarantee a differentiated throughput distribution.

3) The internal scheduler of bfq guarantees that a queue that needs to be served quickly may wait, unjustly, for the service of at most one queue. More formally, bfq guarantees that each budget is completed with the smallest possible delay, for a budget-by-budget scheduler, with respect to an ideal, perfectly fair scheduler (i.e., an ideal scheduler that serves all busy queues at the same, providing each with a fraction of the throughput proportional to its weight).

4) Assigning temporarily a large fraction of the throughput is the main mechanism through which bfq provides interactive and soft real-time applications with a low latency. Thanks to fact 3.b, bfq achieves this goal without increasing per-request latencies. As for how applications are deemed as interactive or soft real-time, I have tried to describe both detection heuristics in patches 06 and 07.

Finally, as for adding to cfq the heuristics I have added to bfq, I think that this would probably improve application latencies also with cfq. But, because of the above facts, the result would unavoidably be worse than with bfq.

Paolo

--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
  2014-05-30 23:28                     ` Tejun Heo
@ 2014-05-30 23:54                         ` Paolo Valente
  -1 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-30 23:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA


Il giorno 31/mag/2014, alle ore 01:28, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ha scritto:

> Hello,
> 
> On Sat, May 31, 2014 at 12:23:01AM +0200, Paolo Valente wrote:
>> I do agree that bfq has essentially the same purpose as cfq. I am
>> not sure that it is what you are proposing, but, in my opinion,
>> since both the engine and all the new heuristics of bfq differ from
>> those of cfq, a replacement would be most certainly a much easier
>> solution than any other transformation of cfq into bfq (needless to
>> say, leaving the same name for the scheduler would not be a problem
>> for me). Of course, before that we are willing to improve what has
>> to be improved in bfq.
> 
> Well, it's all about how to actually route the changes and in general
> whenever avoidable we try to avoid whole-sale code replacement
> especially when most of the structural code is similar like in this
> case.  Gradually evolving cfq to bfq is likely to take more work but
> I'm very positive that it'd definitely be a lot easier to merge the
> changes that way and people involved, including the developers and
> reviewers, would acquire a lot clearer picture of what's going on in
> the process.

I understand, and apologize for proposing an inappropriate shortcut.

>  For example, AFAICS, most of the heuristics added by the
> later patches are refined versions of what's already in cfq and at
> least some are applicable regardless of the underlying scheduling
> algorithm.

Absolutely correct.

>  It all depends on the details but, for example, steps like
> the following would be it a lot easier to get merged.
> 
> * Identify the improvements which can be applied to cfq as-is or with
>  some adaptation and apply those improvements to cfq.
> 
> * Make prepatory changes to make transition to new base scheduling
>  algorithm easier.
> 
> * Strip out or disable cfq features which get in the way of
>  conversion.
> 
> * Switch the base algorithm to the timestamp based one.
> 
> * Rebuild stripped down features and apply new heuristics,
>  optimizations and follow-up changes.
> 

OK, we can try.

> I understand that this might be non-significant amount of work

This may be a non-negligible difficulty, as Arianna, who is actively helping me with this project, and I are working on bfq in our (little) spare time. But if you are patient enough, we will be happy to try to make it one step at a time.

> but at
> the same time it's not something which is inherently difficult.  It's
> mostly logistical after all and I'd be happy to help where I can.
> 

This is certainly reassuring for us.

Thanks again,
Paolo

> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-05-30 23:54                         ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-05-30 23:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel, containers, cgroups


Il giorno 31/mag/2014, alle ore 01:28, Tejun Heo <tj@kernel.org> ha scritto:

> Hello,
> 
> On Sat, May 31, 2014 at 12:23:01AM +0200, Paolo Valente wrote:
>> I do agree that bfq has essentially the same purpose as cfq. I am
>> not sure that it is what you are proposing, but, in my opinion,
>> since both the engine and all the new heuristics of bfq differ from
>> those of cfq, a replacement would be most certainly a much easier
>> solution than any other transformation of cfq into bfq (needless to
>> say, leaving the same name for the scheduler would not be a problem
>> for me). Of course, before that we are willing to improve what has
>> to be improved in bfq.
> 
> Well, it's all about how to actually route the changes and in general
> whenever avoidable we try to avoid whole-sale code replacement
> especially when most of the structural code is similar like in this
> case.  Gradually evolving cfq to bfq is likely to take more work but
> I'm very positive that it'd definitely be a lot easier to merge the
> changes that way and people involved, including the developers and
> reviewers, would acquire a lot clearer picture of what's going on in
> the process.

I understand, and apologize for proposing an inappropriate shortcut.

>  For example, AFAICS, most of the heuristics added by the
> later patches are refined versions of what's already in cfq and at
> least some are applicable regardless of the underlying scheduling
> algorithm.

Absolutely correct.

>  It all depends on the details but, for example, steps like
> the following would be it a lot easier to get merged.
> 
> * Identify the improvements which can be applied to cfq as-is or with
>  some adaptation and apply those improvements to cfq.
> 
> * Make prepatory changes to make transition to new base scheduling
>  algorithm easier.
> 
> * Strip out or disable cfq features which get in the way of
>  conversion.
> 
> * Switch the base algorithm to the timestamp based one.
> 
> * Rebuild stripped down features and apply new heuristics,
>  optimizations and follow-up changes.
> 

OK, we can try.

> I understand that this might be non-significant amount of work

This may be a non-negligible difficulty, as Arianna, who is actively helping me with this project, and I are working on bfq in our (little) spare time. But if you are patient enough, we will be happy to try to make it one step at a time.

> but at
> the same time it's not something which is inherently difficult.  It's
> mostly logistical after all and I'd be happy to help where I can.
> 

This is certainly reassuring for us.

Thanks again,
Paolo

> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
       [not found]               ` <20140530160712.GG24871-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
  2014-05-30 22:23                   ` Paolo Valente
@ 2014-05-31  0:48                 ` Jens Axboe
  1 sibling, 0 replies; 247+ messages in thread
From: Jens Axboe @ 2014-05-31  0:48 UTC (permalink / raw)
  To: Tejun Heo, Paolo Valente
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

On 2014-05-30 10:07, Tejun Heo wrote:
> We do have multiple ioscheds but sans for anticipatory which pretty
> much has been superceded by cfq, they serve different purposes and I'd
> really hate the idea of carrying two mostly similar ioscheds in tree.

AS was removed, and exactly for that reason. So lets make one thing very 
clear: we are not going to carry two implementations of CFQ, that differ 
in various ways. That will not happen. We're going to end up with one 
"smart" IO scheduler, not several of them.

> For some reason, blkcg implementation seems completely different but
> outside of that, bfq doesn't really seem to have diverged a lot from
> cfq and the most likely and probably only way for it to be merged
> would be if you just mutate cfq into bfq.  The whole effort is mostly
> about characterizing and refining the scheduling algorithm anyway,
> right?  I'd really love to see that happening.

Patching CFQ would be the right way to go, imho. That would also make it 
very clear what the steps are to get there, leaving us with something 
that can actually be backtracked and debugged. I think the patch series 
already looks pretty good, basically patch #2 "just" needs to be turned 
into a series of patches for CFQ.

What I really like about the implementation is, as Tejun highlights, 
that the algorithm is detailed and characterized. Nobody ever wrote any 
detailed documentation on CFQ - I think the closest is a talk I gave at 
LCA in 2007 or so. That said, the devil is _always_ in the details when 
it comes to nice algorithms. When theory meets practice, then the little 
tweaks and tunings required to not drop 10% there or 20% here is when it 
gets ugly. And that's where CFQ has the history going for it, at least. 
Which is another reason for turning patch #2 into a series of changes 
for CFQ instead. We need to end up with something where we can 
potentially bisect our way down to whatever caused any given regression. 
The worst possible situation is "CFQ works fine for this workload, but 
BFQ does not" or vice versa. Or hangs, or whatever it might be.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
       [not found]               ` <20140530160712.GG24871-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2014-05-31  0:48                 ` Jens Axboe
  2014-05-31  0:48                 ` Jens Axboe
  1 sibling, 0 replies; 247+ messages in thread
From: Jens Axboe @ 2014-05-31  0:48 UTC (permalink / raw)
  To: Tejun Heo, Paolo Valente
  Cc: Li Zefan, Fabio Checconi, Arianna Avanzini, Paolo Valente,
	linux-kernel, containers, cgroups

On 2014-05-30 10:07, Tejun Heo wrote:
> We do have multiple ioscheds but sans for anticipatory which pretty
> much has been superceded by cfq, they serve different purposes and I'd
> really hate the idea of carrying two mostly similar ioscheds in tree.

AS was removed, and exactly for that reason. So lets make one thing very 
clear: we are not going to carry two implementations of CFQ, that differ 
in various ways. That will not happen. We're going to end up with one 
"smart" IO scheduler, not several of them.

> For some reason, blkcg implementation seems completely different but
> outside of that, bfq doesn't really seem to have diverged a lot from
> cfq and the most likely and probably only way for it to be merged
> would be if you just mutate cfq into bfq.  The whole effort is mostly
> about characterizing and refining the scheduling algorithm anyway,
> right?  I'd really love to see that happening.

Patching CFQ would be the right way to go, imho. That would also make it 
very clear what the steps are to get there, leaving us with something 
that can actually be backtracked and debugged. I think the patch series 
already looks pretty good, basically patch #2 "just" needs to be turned 
into a series of patches for CFQ.

What I really like about the implementation is, as Tejun highlights, 
that the algorithm is detailed and characterized. Nobody ever wrote any 
detailed documentation on CFQ - I think the closest is a talk I gave at 
LCA in 2007 or so. That said, the devil is _always_ in the details when 
it comes to nice algorithms. When theory meets practice, then the little 
tweaks and tunings required to not drop 10% there or 20% here is when it 
gets ugly. And that's where CFQ has the history going for it, at least. 
Which is another reason for turning patch #2 into a series of changes 
for CFQ instead. We need to end up with something where we can 
potentially bisect our way down to whatever caused any given regression. 
The worst possible situation is "CFQ works fine for this workload, but 
BFQ does not" or vice versa. Or hangs, or whatever it might be.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-05-31  0:48                 ` Jens Axboe
  0 siblings, 0 replies; 247+ messages in thread
From: Jens Axboe @ 2014-05-31  0:48 UTC (permalink / raw)
  To: Tejun Heo, Paolo Valente
  Cc: Li Zefan, Fabio Checconi, Arianna Avanzini, Paolo Valente,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On 2014-05-30 10:07, Tejun Heo wrote:
> We do have multiple ioscheds but sans for anticipatory which pretty
> much has been superceded by cfq, they serve different purposes and I'd
> really hate the idea of carrying two mostly similar ioscheds in tree.

AS was removed, and exactly for that reason. So lets make one thing very 
clear: we are not going to carry two implementations of CFQ, that differ 
in various ways. That will not happen. We're going to end up with one 
"smart" IO scheduler, not several of them.

> For some reason, blkcg implementation seems completely different but
> outside of that, bfq doesn't really seem to have diverged a lot from
> cfq and the most likely and probably only way for it to be merged
> would be if you just mutate cfq into bfq.  The whole effort is mostly
> about characterizing and refining the scheduling algorithm anyway,
> right?  I'd really love to see that happening.

Patching CFQ would be the right way to go, imho. That would also make it 
very clear what the steps are to get there, leaving us with something 
that can actually be backtracked and debugged. I think the patch series 
already looks pretty good, basically patch #2 "just" needs to be turned 
into a series of patches for CFQ.

What I really like about the implementation is, as Tejun highlights, 
that the algorithm is detailed and characterized. Nobody ever wrote any 
detailed documentation on CFQ - I think the closest is a talk I gave at 
LCA in 2007 or so. That said, the devil is _always_ in the details when 
it comes to nice algorithms. When theory meets practice, then the little 
tweaks and tunings required to not drop 10% there or 20% here is when it 
gets ugly. And that's where CFQ has the history going for it, at least. 
Which is another reason for turning patch #2 into a series of changes 
for CFQ instead. We need to end up with something where we can 
potentially bisect our way down to whatever caused any given regression. 
The worst possible situation is "CFQ works fine for this workload, but 
BFQ does not" or vice versa. Or hangs, or whatever it might be.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
       [not found]                 ` <538926F6.7080409-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@ 2014-05-31  5:16                   ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-31  5:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

Hello, Jens.

On Fri, May 30, 2014 at 06:48:54PM -0600, Jens Axboe wrote:
> What I really like about the implementation is, as Tejun highlights, that
> the algorithm is detailed and characterized. Nobody ever wrote any detailed
> documentation on CFQ - I think the closest is a talk I gave at LCA in 2007
> or so. That said, the devil is _always_ in the details when it comes to nice
> algorithms. When theory meets practice, then the little tweaks and tunings
> required to not drop 10% there or 20% here is when it gets ugly. And that's
> where CFQ has the history going for it, at least. Which is another reason

That's the thing I like about the new paper.  It looks like the
original BFQ was the naive ideal implementation but the new paper
basically takes most, if not all, heuristics implemented in cfq,
properly characterizes them and applies the improved versions.  The
end result, AFAICS, really is an evolution of cfq with the core
round-robin + preemption scheduler replaced by something a lot firmer.
It doesn't really lose much of what cfq has accumulated over time.

> for turning patch #2 into a series of changes for CFQ instead. We need to
> end up with something where we can potentially bisect our way down to
> whatever caused any given regression. The worst possible situation is "CFQ
> works fine for this workload, but BFQ does not" or vice versa. Or hangs, or
> whatever it might be.

It's likely that there will be some workloads out there which may be
affected adversely, which is true for any change really but with both
the core scheduling and heuristics properly characterized at least
finding a reasonable trade-off should be much less of a crapshoot and
the expected benefits seem to easily outweigh the risks as long as we
can properly sequence the changes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
       [not found]                 ` <538926F6.7080409-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@ 2014-05-31  5:16                   ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-31  5:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Paolo Valente, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel, containers, cgroups

Hello, Jens.

On Fri, May 30, 2014 at 06:48:54PM -0600, Jens Axboe wrote:
> What I really like about the implementation is, as Tejun highlights, that
> the algorithm is detailed and characterized. Nobody ever wrote any detailed
> documentation on CFQ - I think the closest is a talk I gave at LCA in 2007
> or so. That said, the devil is _always_ in the details when it comes to nice
> algorithms. When theory meets practice, then the little tweaks and tunings
> required to not drop 10% there or 20% here is when it gets ugly. And that's
> where CFQ has the history going for it, at least. Which is another reason

That's the thing I like about the new paper.  It looks like the
original BFQ was the naive ideal implementation but the new paper
basically takes most, if not all, heuristics implemented in cfq,
properly characterizes them and applies the improved versions.  The
end result, AFAICS, really is an evolution of cfq with the core
round-robin + preemption scheduler replaced by something a lot firmer.
It doesn't really lose much of what cfq has accumulated over time.

> for turning patch #2 into a series of changes for CFQ instead. We need to
> end up with something where we can potentially bisect our way down to
> whatever caused any given regression. The worst possible situation is "CFQ
> works fine for this workload, but BFQ does not" or vice versa. Or hangs, or
> whatever it might be.

It's likely that there will be some workloads out there which may be
affected adversely, which is true for any change really but with both
the core scheduling and heuristics properly characterized at least
finding a reasonable trade-off should be much less of a crapshoot and
the expected benefits seem to easily outweigh the risks as long as we
can properly sequence the changes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-05-31  5:16                   ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-31  5:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Paolo Valente, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hello, Jens.

On Fri, May 30, 2014 at 06:48:54PM -0600, Jens Axboe wrote:
> What I really like about the implementation is, as Tejun highlights, that
> the algorithm is detailed and characterized. Nobody ever wrote any detailed
> documentation on CFQ - I think the closest is a talk I gave at LCA in 2007
> or so. That said, the devil is _always_ in the details when it comes to nice
> algorithms. When theory meets practice, then the little tweaks and tunings
> required to not drop 10% there or 20% here is when it gets ugly. And that's
> where CFQ has the history going for it, at least. Which is another reason

That's the thing I like about the new paper.  It looks like the
original BFQ was the naive ideal implementation but the new paper
basically takes most, if not all, heuristics implemented in cfq,
properly characterizes them and applies the improved versions.  The
end result, AFAICS, really is an evolution of cfq with the core
round-robin + preemption scheduler replaced by something a lot firmer.
It doesn't really lose much of what cfq has accumulated over time.

> for turning patch #2 into a series of changes for CFQ instead. We need to
> end up with something where we can potentially bisect our way down to
> whatever caused any given regression. The worst possible situation is "CFQ
> works fine for this workload, but BFQ does not" or vice versa. Or hangs, or
> whatever it might be.

It's likely that there will be some workloads out there which may be
affected adversely, which is true for any change really but with both
the core scheduling and heuristics properly characterized at least
finding a reasonable trade-off should be much less of a crapshoot and
the expected benefits seem to easily outweigh the risks as long as we
can properly sequence the changes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices
  2014-05-29  9:05             ` Paolo Valente
@ 2014-05-31 11:52                 ` Tejun Heo
  -1 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-31 11:52 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

Hello, Paolo.

So, I've actually looked at the code.  Here are some questions.

On Thu, May 29, 2014 at 11:05:42AM +0200, Paolo Valente wrote:
> + * 1) all active queues have the same weight,
> + * 2) all active groups at the same level in the groups tree have the same
> + *    weight,
> + * 3) all active groups at the same level in the groups tree have the same
> + *    number of children.

3) basically disables it whenever blkcg is used.  Might as well just
skip the whole thing if there are any !root cgroups.  It's only
theoretically interesting.

>  static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
>  {
>  	struct bfq_data *bfqd = bfqq->bfqd;

	bool symmetric_scenario, expire_non_wr;

> +#ifdef CONFIG_CGROUP_BFQIO
> +#define symmetric_scenario	  (!bfqd->active_numerous_groups && \
> +				   !bfq_differentiated_weights(bfqd))

	symmetric_scenario = xxx;

> +#else
> +#define symmetric_scenario	  (!bfq_differentiated_weights(bfqd))

	symmetric_scenario = yyy;

> +#endif
>  /*
>   * Condition for expiring a non-weight-raised queue (and hence not idling
>   * the device).
>   */
>  #define cond_for_expiring_non_wr  (bfqd->hw_tag && \
> -				   bfqd->wr_busy_queues > 0)
> +				   (bfqd->wr_busy_queues > 0 || \
> +				    (symmetric_scenario && \
> +				     blk_queue_nonrot(bfqd->queue))))

	expire_non_wr = zzz;

>  
>  	return bfq_bfqq_sync(bfqq) && (
>  		bfqq->wr_coeff > 1 ||
>  /**
> + * struct bfq_weight_counter - counter of the number of all active entities
> + *                             with a given weight.
> + * @weight: weight of the entities that this counter refers to.
> + * @num_active: number of active entities with this weight.
> + * @weights_node: weights tree member (see bfq_data's @queue_weights_tree
> + *                and @group_weights_tree).
> + */
> +struct bfq_weight_counter {
> +	short int weight;
> +	unsigned int num_active;
> +	struct rb_node weights_node;
> +};

This is way over-engineered.  In most cases, the only time you get the
same weight on all IO issuers would be when everybody is on the
default ioprio so might as well simply count the number of non-default
ioprios.  It'd be one integer instead of a tree of counters.

> @@ -306,6 +322,22 @@ enum bfq_device_speed {
>   * @rq_pos_tree: rbtree sorted by next_request position, used when
>   *               determining if two or more queues have interleaving
>   *               requests (see bfq_close_cooperator()).
> + * @active_numerous_groups: number of bfq_groups containing more than one
> + *                          active @bfq_entity.

You can safely assume that on any system which uses blkcg, the above
counter is >1.

This optimization may be theoretically interesting but doesn't seem
practical at all and would make the sytem behave distinctively
differently depending on something which is extremely subtle and seems
completely unrelated.  Furthermore, on any system which uses blkcg,
ext4, btrfs or has any task which has non-zero nice value, it won't
make any difference.  Its value is only theoretical.

Another thing to consider is that virtually all remotely modern
devices, rotational or not, are queued.  At this point, it's rather
pointless to design one behavior for !queued and another for queued.
Things should just be designed for queued devices.  I don't know what
the solution is but given that the benefits of NCQ for rotational
devices is extremely limited, sticking with single request model in
most cases and maybe allowing queued operation for specific workloads
might be a better approach.  As for ssds, just do something simple.
It's highly likely that most ssds won't travel this code path in the
near future anyway.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices
@ 2014-05-31 11:52                 ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-31 11:52 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel, containers, cgroups

Hello, Paolo.

So, I've actually looked at the code.  Here are some questions.

On Thu, May 29, 2014 at 11:05:42AM +0200, Paolo Valente wrote:
> + * 1) all active queues have the same weight,
> + * 2) all active groups at the same level in the groups tree have the same
> + *    weight,
> + * 3) all active groups at the same level in the groups tree have the same
> + *    number of children.

3) basically disables it whenever blkcg is used.  Might as well just
skip the whole thing if there are any !root cgroups.  It's only
theoretically interesting.

>  static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
>  {
>  	struct bfq_data *bfqd = bfqq->bfqd;

	bool symmetric_scenario, expire_non_wr;

> +#ifdef CONFIG_CGROUP_BFQIO
> +#define symmetric_scenario	  (!bfqd->active_numerous_groups && \
> +				   !bfq_differentiated_weights(bfqd))

	symmetric_scenario = xxx;

> +#else
> +#define symmetric_scenario	  (!bfq_differentiated_weights(bfqd))

	symmetric_scenario = yyy;

> +#endif
>  /*
>   * Condition for expiring a non-weight-raised queue (and hence not idling
>   * the device).
>   */
>  #define cond_for_expiring_non_wr  (bfqd->hw_tag && \
> -				   bfqd->wr_busy_queues > 0)
> +				   (bfqd->wr_busy_queues > 0 || \
> +				    (symmetric_scenario && \
> +				     blk_queue_nonrot(bfqd->queue))))

	expire_non_wr = zzz;

>  
>  	return bfq_bfqq_sync(bfqq) && (
>  		bfqq->wr_coeff > 1 ||
>  /**
> + * struct bfq_weight_counter - counter of the number of all active entities
> + *                             with a given weight.
> + * @weight: weight of the entities that this counter refers to.
> + * @num_active: number of active entities with this weight.
> + * @weights_node: weights tree member (see bfq_data's @queue_weights_tree
> + *                and @group_weights_tree).
> + */
> +struct bfq_weight_counter {
> +	short int weight;
> +	unsigned int num_active;
> +	struct rb_node weights_node;
> +};

This is way over-engineered.  In most cases, the only time you get the
same weight on all IO issuers would be when everybody is on the
default ioprio so might as well simply count the number of non-default
ioprios.  It'd be one integer instead of a tree of counters.

> @@ -306,6 +322,22 @@ enum bfq_device_speed {
>   * @rq_pos_tree: rbtree sorted by next_request position, used when
>   *               determining if two or more queues have interleaving
>   *               requests (see bfq_close_cooperator()).
> + * @active_numerous_groups: number of bfq_groups containing more than one
> + *                          active @bfq_entity.

You can safely assume that on any system which uses blkcg, the above
counter is >1.

This optimization may be theoretically interesting but doesn't seem
practical at all and would make the sytem behave distinctively
differently depending on something which is extremely subtle and seems
completely unrelated.  Furthermore, on any system which uses blkcg,
ext4, btrfs or has any task which has non-zero nice value, it won't
make any difference.  Its value is only theoretical.

Another thing to consider is that virtually all remotely modern
devices, rotational or not, are queued.  At this point, it's rather
pointless to design one behavior for !queued and another for queued.
Things should just be designed for queued devices.  I don't know what
the solution is but given that the benefits of NCQ for rotational
devices is extremely limited, sticking with single request model in
most cases and maybe allowing queued operation for specific workloads
might be a better approach.  As for ssds, just do something simple.
It's highly likely that most ssds won't travel this code path in the
near future anyway.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 12/12] block, bfq: boost the throughput with random I/O on NCQ-capable HDDs
  2014-05-29  9:05             ` Paolo Valente
@ 2014-05-31 13:34                 ` Tejun Heo
  -1 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-31 13:34 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

Hello,

Okay, read this one properly too.

On Thu, May 29, 2014 at 11:05:43AM +0200, Paolo Valente wrote:
...
> + * As already said, things change with a rotational device, where idling
> + * boosts the throughput with sequential I/O (even with NCQ). Hence, for
> + * such a device the second component of the compound condition evaluates
> + * to true also if the following additional sub-condition does not hold:
> + * the queue is constantly seeky. Unfortunately, this different behavior
> + * with respect to flash-based devices causes an additional asymmetry: if
> + * some sync queues enjoy idling and some other sync queues do not, then
> + * the latter get a low share of the device throughput, simply because the
> + * former get many requests served after being set as in service, whereas
> + * the latter do not. As a consequence, to guarantee the desired throughput
> + * distribution, on HDDs the compound expression evaluates to true (and
> + * hence device idling is performed) also if the following last symmetry
> + * condition does not hold: no other queue is benefiting from idling. Also

Ummm... doesn't the compound expression evaluating to %true prevent
idling from taking place?  The above sentence is painful to parse.  I
don't really think there's much point in expressing the actual
evaulation of expressions in human language.  It sucks for that.  It'd
be far easier to comprehend if you just state what it actually
achieves.  ie. just say "for rotational queued devices, idling is
allowed if such and such conditions are met" and then explain
rationales for each.  There's no point in walking through the
evaluation process itself.

>  #define cond_for_expiring_non_wr  (bfqd->hw_tag && \
>  				   (bfqd->wr_busy_queues > 0 || \
>  				    (symmetric_scenario && \
> -				     blk_queue_nonrot(bfqd->queue))))
> +				     (blk_queue_nonrot(bfqd->queue) || \
> +				      cond_for_seeky_on_ncq_hdd))))

So, the addition is that for rotational queued devices, idling is
inhibited if all queues are symmetric and all the busy ones are
constantly seeky, right?  Everybody being seeky is probably the only
use case where allowing queued operation is desirable for rotational
devices.  In these cases, do we even care whether every queue is
symmetric?  If we really want this tagged operation optimization,
wouldn't it be far easier to simply charge everybody by bandwidth if
everybody is seeky with idling disabled whether queues are symmetric
or not?  The whole point of charging full slice to seeky queues is
making sure they don't starve non-seeky ones.  If everybody is seeky,
bandwidth charging should be enough for fairness among them, right?

Also, can we please try to avoid double negations where possible?  The
function could easily have been named should_idle() instead of
must_not_expire().  Combined with the complex logic expressions, it
makes things unnecessarily difficult to follow.  Just say yes for not
not.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 12/12] block, bfq: boost the throughput with random I/O on NCQ-capable HDDs
@ 2014-05-31 13:34                 ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-31 13:34 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel, containers, cgroups

Hello,

Okay, read this one properly too.

On Thu, May 29, 2014 at 11:05:43AM +0200, Paolo Valente wrote:
...
> + * As already said, things change with a rotational device, where idling
> + * boosts the throughput with sequential I/O (even with NCQ). Hence, for
> + * such a device the second component of the compound condition evaluates
> + * to true also if the following additional sub-condition does not hold:
> + * the queue is constantly seeky. Unfortunately, this different behavior
> + * with respect to flash-based devices causes an additional asymmetry: if
> + * some sync queues enjoy idling and some other sync queues do not, then
> + * the latter get a low share of the device throughput, simply because the
> + * former get many requests served after being set as in service, whereas
> + * the latter do not. As a consequence, to guarantee the desired throughput
> + * distribution, on HDDs the compound expression evaluates to true (and
> + * hence device idling is performed) also if the following last symmetry
> + * condition does not hold: no other queue is benefiting from idling. Also

Ummm... doesn't the compound expression evaluating to %true prevent
idling from taking place?  The above sentence is painful to parse.  I
don't really think there's much point in expressing the actual
evaulation of expressions in human language.  It sucks for that.  It'd
be far easier to comprehend if you just state what it actually
achieves.  ie. just say "for rotational queued devices, idling is
allowed if such and such conditions are met" and then explain
rationales for each.  There's no point in walking through the
evaluation process itself.

>  #define cond_for_expiring_non_wr  (bfqd->hw_tag && \
>  				   (bfqd->wr_busy_queues > 0 || \
>  				    (symmetric_scenario && \
> -				     blk_queue_nonrot(bfqd->queue))))
> +				     (blk_queue_nonrot(bfqd->queue) || \
> +				      cond_for_seeky_on_ncq_hdd))))

So, the addition is that for rotational queued devices, idling is
inhibited if all queues are symmetric and all the busy ones are
constantly seeky, right?  Everybody being seeky is probably the only
use case where allowing queued operation is desirable for rotational
devices.  In these cases, do we even care whether every queue is
symmetric?  If we really want this tagged operation optimization,
wouldn't it be far easier to simply charge everybody by bandwidth if
everybody is seeky with idling disabled whether queues are symmetric
or not?  The whole point of charging full slice to seeky queues is
making sure they don't starve non-seeky ones.  If everybody is seeky,
bandwidth charging should be enough for fairness among them, right?

Also, can we please try to avoid double negations where possible?  The
function could easily have been named should_idle() instead of
must_not_expire().  Combined with the complex logic expressions, it
makes things unnecessarily difficult to follow.  Just say yes for not
not.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 08/12] block, bfq: preserve a low latency also with NCQ-capable drives
  2014-05-29  9:05             ` Paolo Valente
@ 2014-05-31 13:48                 ` Tejun Heo
  -1 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-31 13:48 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

On Thu, May 29, 2014 at 11:05:39AM +0200, Paolo Valente wrote:
> This patch addresses this issue by not disabling device idling for

This patch addresses this issue by allowing device idling for...

> weight-raised queues, even if the device supports NCQ. This allows BFQ
> to start serving a new queue, and therefore allows the drive to

This disallows BFQ to start serving a new queue, and therefore the
drive to prefetch new requests, until the idling timeout expires.

Prefetch?  Can you elaborate?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 08/12] block, bfq: preserve a low latency also with NCQ-capable drives
@ 2014-05-31 13:48                 ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-31 13:48 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel, containers, cgroups

On Thu, May 29, 2014 at 11:05:39AM +0200, Paolo Valente wrote:
> This patch addresses this issue by not disabling device idling for

This patch addresses this issue by allowing device idling for...

> weight-raised queues, even if the device supports NCQ. This allows BFQ
> to start serving a new queue, and therefore allows the drive to

This disallows BFQ to start serving a new queue, and therefore the
drive to prefetch new requests, until the idling timeout expires.

Prefetch?  Can you elaborate?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 09/12] block, bfq: reduce latency during request-pool saturation
  2014-05-29  9:05           ` [PATCH RFC - TAKE TWO - 09/12] block, bfq: reduce latency during request-pool saturation Paolo Valente
@ 2014-05-31 13:54                 ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-31 13:54 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

On Thu, May 29, 2014 at 11:05:40AM +0200, Paolo Valente wrote:
> This patch introduces an heuristic that reduces latency when the
> I/O-request pool is saturated. This goal is achieved by disabling
> device idling, for non-weight-raised queues, when there are weight-
> raised queues with pending or in-flight requests. In fact, as
> explained in more detail in the comment to the function
> bfq_bfqq_must_not_expire(), this reduces the rate at which processes
> associated with non-weight-raised queues grab requests from the pool,
> thereby increasing the probability that processes associated with
> weight-raised queues get a request immediately (or at least soon) when
> they need one.

Wouldn't it be more straight-forward to simply control how many
requests each queue consume by returning ELV_MQUEUE_NO?  Seeky ones do
benefit from larger number of requests in elevator but to only certain
number given the fifo timeout anyway and controlling that explicitly
would be a lot easier to anticipate the behavior of than playing
roulette with random request allocation failures.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 09/12] block, bfq: reduce latency during request-pool saturation
@ 2014-05-31 13:54                 ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-05-31 13:54 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel, containers, cgroups

On Thu, May 29, 2014 at 11:05:40AM +0200, Paolo Valente wrote:
> This patch introduces an heuristic that reduces latency when the
> I/O-request pool is saturated. This goal is achieved by disabling
> device idling, for non-weight-raised queues, when there are weight-
> raised queues with pending or in-flight requests. In fact, as
> explained in more detail in the comment to the function
> bfq_bfqq_must_not_expire(), this reduces the rate at which processes
> associated with non-weight-raised queues grab requests from the pool,
> thereby increasing the probability that processes associated with
> weight-raised queues get a request immediately (or at least soon) when
> they need one.

Wouldn't it be more straight-forward to simply control how many
requests each queue consume by returning ELV_MQUEUE_NO?  Seeky ones do
benefit from larger number of requests in elevator but to only certain
number given the fifo timeout anyway and controlling that explicitly
would be a lot easier to anticipate the behavior of than playing
roulette with random request allocation failures.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 10/12] block, bfq: add Early Queue Merge (EQM)
       [not found]               ` <1401354343-5527-11-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
@ 2014-06-01  0:03                 ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-01  0:03 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Mauro Andreolini,
	Fabio Checconi, Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paolo Valente

Hello,

On Thu, May 29, 2014 at 11:05:41AM +0200, Paolo Valente wrote:
> Unfortunately, in the following frequent case the mechanism
> implemented in CFQ for detecting cooperating processes and merging
> their queues is not responsive enough to handle also the fluctuating
> I/O pattern of the second type of processes. Suppose that one process
> of the second type issues a request close to the next request to serve
> of another process of the same type. At that time the two processes
> can be considered as cooperating. But, if the request issued by the
> first process is to be merged with some other already-queued request,
> then, from the moment at which this request arrives, to the moment
> when CFQ controls whether the two processes are cooperating, the two
> processes are likely to be already doing I/O in distant zones of the
> disk surface or device memory.

I don't really follow the last part.  So, the difference is that
cooperating queue setup also takes place during bio merge too, right?
Are you trying to say that it's beneficial to detect cooperating
queues before the issuing queue gets unplugged because the two queues
might deviate while plugged?  If so, the above paragraph is, while
quite wordy, a rather lousy description.

> CFQ uses however preemption to get a sequential read pattern out of
> the read requests performed by the second type of processes too.  As a
> consequence, CFQ uses two different mechanisms to achieve the same
> goal: boosting the throughput with interleaved I/O.

You mean the cfq_rq_close() preemption, right?  Hmmm... interesting.
I'm a bit bothered that we might do this multiple times for the same
bio/request.  e.g. if bio is back merged to an existing request which
would be the most common bio merge scenario anyway, is it really
meaningful to retry it for each merge and then again on submission?
cfq does it once when allocating the request.  That seems a lot more
reasonable to me.  It's doing that once for one start sector.  I mean,
plugging is usually extremely short compared to actual IO service
time.  It's there to mask the latencies between bio issues that the
same CPU is doing.  I can't see how this earliness can be actually
useful.  Do you have results to back this one up?  Or is this just
born out of thin air?

> +/*
> + * Must be called with the queue_lock held.

Use lockdep_assert_held() instead?

> + */
> +static int bfqq_process_refs(struct bfq_queue *bfqq)
> +{
> +	int process_refs, io_refs;
> +
> +	io_refs = bfqq->allocated[READ] + bfqq->allocated[WRITE];
> +	process_refs = atomic_read(&bfqq->ref) - io_refs - bfqq->entity.on_st;
> +	return process_refs;
> +}
...
> +static inline sector_t bfq_io_struct_pos(void *io_struct, bool request)
> +{
> +	if (request)
> +		return blk_rq_pos(io_struct);
> +	else
> +		return ((struct bio *)io_struct)->bi_iter.bi_sector;
> +}
> +
> +static inline sector_t bfq_dist_from(sector_t pos1,
> +				     sector_t pos2)
> +{
> +	if (pos1 >= pos2)
> +		return pos1 - pos2;
> +	else
> +		return pos2 - pos1;
> +}
> +
> +static inline int bfq_rq_close_to_sector(void *io_struct, bool request,
> +					 sector_t sector)
> +{
> +	return bfq_dist_from(bfq_io_struct_pos(io_struct, request), sector) <=
> +	       BFQQ_SEEK_THR;
> +}

You can simply write the following.

	abs64(sector0 - sector1) < BFQQ_SEEKTHR;

Note that abs64() works whether the source type is signed or unsigned.
Also, please don't do "void *" + type switch.  If it absoluately has
to take two different types of pointers, just take two different
pointers and BUG_ON() if both are set, but here this doesn't seem to
be the case.  Just pass around the actual sectors.  This applies to
all usages of void *io_struct.

> +static struct bfq_queue *bfqq_close(struct bfq_data *bfqd, sector_t sector)

bfqq_find_close() or find_close_bfqq()?

> +/*
> + * bfqd - obvious
> + * cur_bfqq - passed in so that we don't decide that the current queue
> + *            is closely cooperating with itself
> + * sector - used as a reference point to search for a close queue
> + */

If you're gonna do the above, please use proper function comment.
Please take a look at Documentation/kernel-doc-nano-HOWTO.txt.

> +static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
> +					      struct bfq_queue *cur_bfqq,
> +					      sector_t sector)
> +{
> +	struct bfq_queue *bfqq;
> +
> +	if (bfq_class_idle(cur_bfqq))
> +		return NULL;
> +	if (!bfq_bfqq_sync(cur_bfqq))
> +		return NULL;
> +	if (BFQQ_SEEKY(cur_bfqq))
> +		return NULL;

Why are some of these conditions tested twice?  Once here and once in
the caller?  Collect them into one place?

...
> +	if (bfqq->entity.parent != cur_bfqq->entity.parent)
> +		return NULL;

This is the only place this rq position tree is being used, right?
Any reason not to use per-parent tree instead?

> +	if (bfq_class_rt(bfqq) != bfq_class_rt(cur_bfqq))
> +		return NULL;

Test ioprio_class for equality?

> +/*
> + * Attempt to schedule a merge of bfqq with the currently in-service queue
> + * or with a close queue among the scheduled queues.
> + * Return NULL if no merge was scheduled, a pointer to the shared bfq_queue
> + * structure otherwise.
> + */
> +static struct bfq_queue *
> +bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq,
> +		     void *io_struct, bool request)
> +{
> +	struct bfq_queue *in_service_bfqq, *new_bfqq;
> +
> +	if (bfqq->new_bfqq)
> +		return bfqq->new_bfqq;
> +
> +	if (!io_struct)
> +		return NULL;
> +
> +	in_service_bfqq = bfqd->in_service_queue;
> +
> +	if (in_service_bfqq == NULL || in_service_bfqq == bfqq ||
> +	    !bfqd->in_service_bic)
> +		goto check_scheduled;
> +
> +	if (bfq_class_idle(in_service_bfqq) || bfq_class_idle(bfqq))
> +		goto check_scheduled;
> +
> +	if (bfq_class_rt(in_service_bfqq) != bfq_class_rt(bfqq))
> +		goto check_scheduled;
> +
> +	if (in_service_bfqq->entity.parent != bfqq->entity.parent)
> +		goto check_scheduled;
> +
> +	if (bfq_rq_close_to_sector(io_struct, request, bfqd->last_position) &&
> +	    bfq_bfqq_sync(in_service_bfqq) && bfq_bfqq_sync(bfqq)) {
> +		new_bfqq = bfq_setup_merge(bfqq, in_service_bfqq);
> +		if (new_bfqq != NULL)
> +			return new_bfqq; /* Merge with in-service queue */
> +	}
> +
> +	/*
> +	 * Check whether there is a cooperator among currently scheduled
> +	 * queues. The only thing we need is that the bio/request is not
> +	 * NULL, as we need it to establish whether a cooperator exists.
> +	 */
> +check_scheduled:
> +	new_bfqq = bfq_close_cooperator(bfqd, bfqq,
> +					bfq_io_struct_pos(io_struct, request));
> +	if (new_bfqq)
> +		return bfq_setup_merge(bfqq, new_bfqq);

Why don't in_service_queue and scheduled search share the cooperation
conditions?  They should be the same, right?  Shouldn't the function
be structured like the following instead?

	if (bfq_try_close_cooperator(current_one, in_service_one))
		return in_service_one;

	found = bfq_find_close_cooperator();
	if (bfq_try_close_cooperator(current_one, found))
		return found;
	return NULL;

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 10/12] block, bfq: add Early Queue Merge (EQM)
       [not found]               ` <1401354343-5527-11-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
@ 2014-06-01  0:03                 ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-01  0:03 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel, containers, cgroups,
	Mauro Andreolini

Hello,

On Thu, May 29, 2014 at 11:05:41AM +0200, Paolo Valente wrote:
> Unfortunately, in the following frequent case the mechanism
> implemented in CFQ for detecting cooperating processes and merging
> their queues is not responsive enough to handle also the fluctuating
> I/O pattern of the second type of processes. Suppose that one process
> of the second type issues a request close to the next request to serve
> of another process of the same type. At that time the two processes
> can be considered as cooperating. But, if the request issued by the
> first process is to be merged with some other already-queued request,
> then, from the moment at which this request arrives, to the moment
> when CFQ controls whether the two processes are cooperating, the two
> processes are likely to be already doing I/O in distant zones of the
> disk surface or device memory.

I don't really follow the last part.  So, the difference is that
cooperating queue setup also takes place during bio merge too, right?
Are you trying to say that it's beneficial to detect cooperating
queues before the issuing queue gets unplugged because the two queues
might deviate while plugged?  If so, the above paragraph is, while
quite wordy, a rather lousy description.

> CFQ uses however preemption to get a sequential read pattern out of
> the read requests performed by the second type of processes too.  As a
> consequence, CFQ uses two different mechanisms to achieve the same
> goal: boosting the throughput with interleaved I/O.

You mean the cfq_rq_close() preemption, right?  Hmmm... interesting.
I'm a bit bothered that we might do this multiple times for the same
bio/request.  e.g. if bio is back merged to an existing request which
would be the most common bio merge scenario anyway, is it really
meaningful to retry it for each merge and then again on submission?
cfq does it once when allocating the request.  That seems a lot more
reasonable to me.  It's doing that once for one start sector.  I mean,
plugging is usually extremely short compared to actual IO service
time.  It's there to mask the latencies between bio issues that the
same CPU is doing.  I can't see how this earliness can be actually
useful.  Do you have results to back this one up?  Or is this just
born out of thin air?

> +/*
> + * Must be called with the queue_lock held.

Use lockdep_assert_held() instead?

> + */
> +static int bfqq_process_refs(struct bfq_queue *bfqq)
> +{
> +	int process_refs, io_refs;
> +
> +	io_refs = bfqq->allocated[READ] + bfqq->allocated[WRITE];
> +	process_refs = atomic_read(&bfqq->ref) - io_refs - bfqq->entity.on_st;
> +	return process_refs;
> +}
...
> +static inline sector_t bfq_io_struct_pos(void *io_struct, bool request)
> +{
> +	if (request)
> +		return blk_rq_pos(io_struct);
> +	else
> +		return ((struct bio *)io_struct)->bi_iter.bi_sector;
> +}
> +
> +static inline sector_t bfq_dist_from(sector_t pos1,
> +				     sector_t pos2)
> +{
> +	if (pos1 >= pos2)
> +		return pos1 - pos2;
> +	else
> +		return pos2 - pos1;
> +}
> +
> +static inline int bfq_rq_close_to_sector(void *io_struct, bool request,
> +					 sector_t sector)
> +{
> +	return bfq_dist_from(bfq_io_struct_pos(io_struct, request), sector) <=
> +	       BFQQ_SEEK_THR;
> +}

You can simply write the following.

	abs64(sector0 - sector1) < BFQQ_SEEKTHR;

Note that abs64() works whether the source type is signed or unsigned.
Also, please don't do "void *" + type switch.  If it absoluately has
to take two different types of pointers, just take two different
pointers and BUG_ON() if both are set, but here this doesn't seem to
be the case.  Just pass around the actual sectors.  This applies to
all usages of void *io_struct.

> +static struct bfq_queue *bfqq_close(struct bfq_data *bfqd, sector_t sector)

bfqq_find_close() or find_close_bfqq()?

> +/*
> + * bfqd - obvious
> + * cur_bfqq - passed in so that we don't decide that the current queue
> + *            is closely cooperating with itself
> + * sector - used as a reference point to search for a close queue
> + */

If you're gonna do the above, please use proper function comment.
Please take a look at Documentation/kernel-doc-nano-HOWTO.txt.

> +static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
> +					      struct bfq_queue *cur_bfqq,
> +					      sector_t sector)
> +{
> +	struct bfq_queue *bfqq;
> +
> +	if (bfq_class_idle(cur_bfqq))
> +		return NULL;
> +	if (!bfq_bfqq_sync(cur_bfqq))
> +		return NULL;
> +	if (BFQQ_SEEKY(cur_bfqq))
> +		return NULL;

Why are some of these conditions tested twice?  Once here and once in
the caller?  Collect them into one place?

...
> +	if (bfqq->entity.parent != cur_bfqq->entity.parent)
> +		return NULL;

This is the only place this rq position tree is being used, right?
Any reason not to use per-parent tree instead?

> +	if (bfq_class_rt(bfqq) != bfq_class_rt(cur_bfqq))
> +		return NULL;

Test ioprio_class for equality?

> +/*
> + * Attempt to schedule a merge of bfqq with the currently in-service queue
> + * or with a close queue among the scheduled queues.
> + * Return NULL if no merge was scheduled, a pointer to the shared bfq_queue
> + * structure otherwise.
> + */
> +static struct bfq_queue *
> +bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq,
> +		     void *io_struct, bool request)
> +{
> +	struct bfq_queue *in_service_bfqq, *new_bfqq;
> +
> +	if (bfqq->new_bfqq)
> +		return bfqq->new_bfqq;
> +
> +	if (!io_struct)
> +		return NULL;
> +
> +	in_service_bfqq = bfqd->in_service_queue;
> +
> +	if (in_service_bfqq == NULL || in_service_bfqq == bfqq ||
> +	    !bfqd->in_service_bic)
> +		goto check_scheduled;
> +
> +	if (bfq_class_idle(in_service_bfqq) || bfq_class_idle(bfqq))
> +		goto check_scheduled;
> +
> +	if (bfq_class_rt(in_service_bfqq) != bfq_class_rt(bfqq))
> +		goto check_scheduled;
> +
> +	if (in_service_bfqq->entity.parent != bfqq->entity.parent)
> +		goto check_scheduled;
> +
> +	if (bfq_rq_close_to_sector(io_struct, request, bfqd->last_position) &&
> +	    bfq_bfqq_sync(in_service_bfqq) && bfq_bfqq_sync(bfqq)) {
> +		new_bfqq = bfq_setup_merge(bfqq, in_service_bfqq);
> +		if (new_bfqq != NULL)
> +			return new_bfqq; /* Merge with in-service queue */
> +	}
> +
> +	/*
> +	 * Check whether there is a cooperator among currently scheduled
> +	 * queues. The only thing we need is that the bio/request is not
> +	 * NULL, as we need it to establish whether a cooperator exists.
> +	 */
> +check_scheduled:
> +	new_bfqq = bfq_close_cooperator(bfqd, bfqq,
> +					bfq_io_struct_pos(io_struct, request));
> +	if (new_bfqq)
> +		return bfq_setup_merge(bfqq, new_bfqq);

Why don't in_service_queue and scheduled search share the cooperation
conditions?  They should be the same, right?  Shouldn't the function
be structured like the following instead?

	if (bfq_try_close_cooperator(current_one, in_service_one))
		return in_service_one;

	found = bfq_find_close_cooperator();
	if (bfq_try_close_cooperator(current_one, found))
		return found;
	return NULL;

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 10/12] block, bfq: add Early Queue Merge (EQM)
@ 2014-06-01  0:03                 ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-01  0:03 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Mauro Andreolini

Hello,

On Thu, May 29, 2014 at 11:05:41AM +0200, Paolo Valente wrote:
> Unfortunately, in the following frequent case the mechanism
> implemented in CFQ for detecting cooperating processes and merging
> their queues is not responsive enough to handle also the fluctuating
> I/O pattern of the second type of processes. Suppose that one process
> of the second type issues a request close to the next request to serve
> of another process of the same type. At that time the two processes
> can be considered as cooperating. But, if the request issued by the
> first process is to be merged with some other already-queued request,
> then, from the moment at which this request arrives, to the moment
> when CFQ controls whether the two processes are cooperating, the two
> processes are likely to be already doing I/O in distant zones of the
> disk surface or device memory.

I don't really follow the last part.  So, the difference is that
cooperating queue setup also takes place during bio merge too, right?
Are you trying to say that it's beneficial to detect cooperating
queues before the issuing queue gets unplugged because the two queues
might deviate while plugged?  If so, the above paragraph is, while
quite wordy, a rather lousy description.

> CFQ uses however preemption to get a sequential read pattern out of
> the read requests performed by the second type of processes too.  As a
> consequence, CFQ uses two different mechanisms to achieve the same
> goal: boosting the throughput with interleaved I/O.

You mean the cfq_rq_close() preemption, right?  Hmmm... interesting.
I'm a bit bothered that we might do this multiple times for the same
bio/request.  e.g. if bio is back merged to an existing request which
would be the most common bio merge scenario anyway, is it really
meaningful to retry it for each merge and then again on submission?
cfq does it once when allocating the request.  That seems a lot more
reasonable to me.  It's doing that once for one start sector.  I mean,
plugging is usually extremely short compared to actual IO service
time.  It's there to mask the latencies between bio issues that the
same CPU is doing.  I can't see how this earliness can be actually
useful.  Do you have results to back this one up?  Or is this just
born out of thin air?

> +/*
> + * Must be called with the queue_lock held.

Use lockdep_assert_held() instead?

> + */
> +static int bfqq_process_refs(struct bfq_queue *bfqq)
> +{
> +	int process_refs, io_refs;
> +
> +	io_refs = bfqq->allocated[READ] + bfqq->allocated[WRITE];
> +	process_refs = atomic_read(&bfqq->ref) - io_refs - bfqq->entity.on_st;
> +	return process_refs;
> +}
...
> +static inline sector_t bfq_io_struct_pos(void *io_struct, bool request)
> +{
> +	if (request)
> +		return blk_rq_pos(io_struct);
> +	else
> +		return ((struct bio *)io_struct)->bi_iter.bi_sector;
> +}
> +
> +static inline sector_t bfq_dist_from(sector_t pos1,
> +				     sector_t pos2)
> +{
> +	if (pos1 >= pos2)
> +		return pos1 - pos2;
> +	else
> +		return pos2 - pos1;
> +}
> +
> +static inline int bfq_rq_close_to_sector(void *io_struct, bool request,
> +					 sector_t sector)
> +{
> +	return bfq_dist_from(bfq_io_struct_pos(io_struct, request), sector) <=
> +	       BFQQ_SEEK_THR;
> +}

You can simply write the following.

	abs64(sector0 - sector1) < BFQQ_SEEKTHR;

Note that abs64() works whether the source type is signed or unsigned.
Also, please don't do "void *" + type switch.  If it absoluately has
to take two different types of pointers, just take two different
pointers and BUG_ON() if both are set, but here this doesn't seem to
be the case.  Just pass around the actual sectors.  This applies to
all usages of void *io_struct.

> +static struct bfq_queue *bfqq_close(struct bfq_data *bfqd, sector_t sector)

bfqq_find_close() or find_close_bfqq()?

> +/*
> + * bfqd - obvious
> + * cur_bfqq - passed in so that we don't decide that the current queue
> + *            is closely cooperating with itself
> + * sector - used as a reference point to search for a close queue
> + */

If you're gonna do the above, please use proper function comment.
Please take a look at Documentation/kernel-doc-nano-HOWTO.txt.

> +static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
> +					      struct bfq_queue *cur_bfqq,
> +					      sector_t sector)
> +{
> +	struct bfq_queue *bfqq;
> +
> +	if (bfq_class_idle(cur_bfqq))
> +		return NULL;
> +	if (!bfq_bfqq_sync(cur_bfqq))
> +		return NULL;
> +	if (BFQQ_SEEKY(cur_bfqq))
> +		return NULL;

Why are some of these conditions tested twice?  Once here and once in
the caller?  Collect them into one place?

...
> +	if (bfqq->entity.parent != cur_bfqq->entity.parent)
> +		return NULL;

This is the only place this rq position tree is being used, right?
Any reason not to use per-parent tree instead?

> +	if (bfq_class_rt(bfqq) != bfq_class_rt(cur_bfqq))
> +		return NULL;

Test ioprio_class for equality?

> +/*
> + * Attempt to schedule a merge of bfqq with the currently in-service queue
> + * or with a close queue among the scheduled queues.
> + * Return NULL if no merge was scheduled, a pointer to the shared bfq_queue
> + * structure otherwise.
> + */
> +static struct bfq_queue *
> +bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq,
> +		     void *io_struct, bool request)
> +{
> +	struct bfq_queue *in_service_bfqq, *new_bfqq;
> +
> +	if (bfqq->new_bfqq)
> +		return bfqq->new_bfqq;
> +
> +	if (!io_struct)
> +		return NULL;
> +
> +	in_service_bfqq = bfqd->in_service_queue;
> +
> +	if (in_service_bfqq == NULL || in_service_bfqq == bfqq ||
> +	    !bfqd->in_service_bic)
> +		goto check_scheduled;
> +
> +	if (bfq_class_idle(in_service_bfqq) || bfq_class_idle(bfqq))
> +		goto check_scheduled;
> +
> +	if (bfq_class_rt(in_service_bfqq) != bfq_class_rt(bfqq))
> +		goto check_scheduled;
> +
> +	if (in_service_bfqq->entity.parent != bfqq->entity.parent)
> +		goto check_scheduled;
> +
> +	if (bfq_rq_close_to_sector(io_struct, request, bfqd->last_position) &&
> +	    bfq_bfqq_sync(in_service_bfqq) && bfq_bfqq_sync(bfqq)) {
> +		new_bfqq = bfq_setup_merge(bfqq, in_service_bfqq);
> +		if (new_bfqq != NULL)
> +			return new_bfqq; /* Merge with in-service queue */
> +	}
> +
> +	/*
> +	 * Check whether there is a cooperator among currently scheduled
> +	 * queues. The only thing we need is that the bio/request is not
> +	 * NULL, as we need it to establish whether a cooperator exists.
> +	 */
> +check_scheduled:
> +	new_bfqq = bfq_close_cooperator(bfqd, bfqq,
> +					bfq_io_struct_pos(io_struct, request));
> +	if (new_bfqq)
> +		return bfq_setup_merge(bfqq, new_bfqq);

Why don't in_service_queue and scheduled search share the cooperation
conditions?  They should be the same, right?  Shouldn't the function
be structured like the following instead?

	if (bfq_try_close_cooperator(current_one, in_service_one))
		return in_service_one;

	found = bfq_find_close_cooperator();
	if (bfq_try_close_cooperator(current_one, found))
		return found;
	return NULL;

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices
  2014-05-31 11:52                 ` Tejun Heo
@ 2014-06-02  9:26                     ` Paolo Valente
  -1 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-02  9:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA


Il giorno 31/mag/2014, alle ore 13:52, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ha scritto:

> Hello, Paolo.
> 
> So, I've actually looked at the code.  Here are some questions.
> 
> On Thu, May 29, 2014 at 11:05:42AM +0200, Paolo Valente wrote:
>> + * 1) all active queues have the same weight,
>> + * 2) all active groups at the same level in the groups tree have the same
>> + *    weight,
>> + * 3) all active groups at the same level in the groups tree have the same
>> + *    number of children.
> 
> 3) basically disables it whenever blkcg is used.  Might as well just
> skip the whole thing if there are any !root cgroups.  It's only
> theoretically interesting.

It is easier for me to reply to this, and the other related comments, cumulatively below.

> 
>> static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
>> {
>> 	struct bfq_data *bfqd = bfqq->bfqd;
> 
> 	bool symmetric_scenario, expire_non_wr;
> 
>> +#ifdef CONFIG_CGROUP_BFQIO
>> +#define symmetric_scenario	  (!bfqd->active_numerous_groups && \
>> +				   !bfq_differentiated_weights(bfqd))
> 
> 	symmetric_scenario = xxx;
> 
>> +#else
>> +#define symmetric_scenario	  (!bfq_differentiated_weights(bfqd))
> 
> 	symmetric_scenario = yyy;
> 
>> +#endif
>> /*
>>  * Condition for expiring a non-weight-raised queue (and hence not idling
>>  * the device).
>>  */
>> #define cond_for_expiring_non_wr  (bfqd->hw_tag && \
>> -				   bfqd->wr_busy_queues > 0)
>> +				   (bfqd->wr_busy_queues > 0 || \
>> +				    (symmetric_scenario && \
>> +				     blk_queue_nonrot(bfqd->queue))))
> 
> 	expire_non_wr = zzz;
> 

The solution you propose is the first that came to my mind. But then I went for a clumsy macro-based solution because: 1) the whole function is all about evaluating a long logical expression, 2) the macro-based solution allows the short-circuit to be used at best, and the number of steps to be minimized. For example, with async queues, only one condition is evaluated.

Defining three variables entails instead that the value of all the variables is computed every time, even if most of the times there is no need to.

Would this gain be negligible (sorry for my ignorance), or would not it be however enough to justify these unusual macros?

>> 
>> 	return bfq_bfqq_sync(bfqq) && (
>> 		bfqq->wr_coeff > 1 ||
>> /**
>> + * struct bfq_weight_counter - counter of the number of all active entities
>> + *                             with a given weight.
>> + * @weight: weight of the entities that this counter refers to.
>> + * @num_active: number of active entities with this weight.
>> + * @weights_node: weights tree member (see bfq_data's @queue_weights_tree
>> + *                and @group_weights_tree).
>> + */
>> +struct bfq_weight_counter {
>> +	short int weight;
>> +	unsigned int num_active;
>> +	struct rb_node weights_node;
>> +};
> 
> This is way over-engineered.  In most cases, the only time you get the
> same weight on all IO issuers would be when everybody is on the
> default ioprio so might as well simply count the number of non-default
> ioprios.  It'd be one integer instead of a tree of counters.
> 

Reply below.

>> @@ -306,6 +322,22 @@ enum bfq_device_speed {
>>  * @rq_pos_tree: rbtree sorted by next_request position, used when
>>  *               determining if two or more queues have interleaving
>>  *               requests (see bfq_close_cooperator()).
>> + * @active_numerous_groups: number of bfq_groups containing more than one
>> + *                          active @bfq_entity.
> 
> You can safely assume that on any system which uses blkcg, the above
> counter is >1.
> 
> This optimization may be theoretically interesting but doesn't seem
> practical at all and would make the sytem behave distinctively
> differently depending on something which is extremely subtle and seems
> completely unrelated.  Furthermore, on any system which uses blkcg,
> ext4, btrfs or has any task which has non-zero nice value, it won't
> make any difference.  Its value is only theoretical.
> 

Turning on idling unconditionally when blkcg is used, is one of the first solutions we have considered. But there seem to be practical scenarios where this would cause an unjustified loss of throughput. The main example for us was ulatencyd, which AFAIK creates one group for each process and, by default, assigns to all processes the same weight. But the assigned weight is not the one associated to the default ioprio.

I do not know how widespread a mechanism like ulatencyd is precisely, but in the symmetric scenario it creates, the throughput on, e.g., an HDD would drop by half \bif the workload is mostly random and we removed the more complex mechanism we set up.
Wouldn't this be bad?

> Another thing to consider is that virtually all remotely modern
> devices, rotational or not, are queued. At this point, it's rather
> pointless to design one behavior for !queued and another for queued.
> Things should just be designed for queued devices.

I am sorry for expressing doubts again (mainly because of my ignorance), but a few months ago I had to work with some portable devices for a company specialized in ARM systems. As an HDD, they were using a Toshiba MK6006GAH. If I remember well, this device had no NCQ. Instead of the improvements that we obtained by using bfq with this slow device, removing the differentiated behavior of bfq with respect to queued/!queued devices would have caused just a loss of throughput.

>  I don't know what
> the solution is but given that the benefits of NCQ for rotational
> devices is extremely limited, sticking with single request model in
> most cases and maybe allowing queued operation for specific workloads
> might be a better approach.  As for ssds, just do something simple.
> It's highly likely that most ssds won't travel this code path in the
> near future anyway.

This is the point that worries me mostly. As I pointed out in one of my previous emails, dispatching requests to an SSD  without control causes high latencies, or even complete unresponsiveness (Figure 8 in
http://algogroup.unimore.it/people/paolo/disk_sched/extra_results.php
or Figure 9 in
http://algogroup.unimore.it/people/paolo/disk_sched/results.php).

I am of course aware that efficiency is a critical issue with fast devices, and is probably destined to become more and more critical in the future. But, as a user, I would be definitely unhappy with a system that can, e.g., update itself in one minute instead of five, but, during that minute may become unresponsive. In particular, I would not be pleased to buy a more expensive SSD and get a much less responsive system than that I had with a cheaper HDD and bfq fully working.

Thanks,
Paolo

> 
> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices
@ 2014-06-02  9:26                     ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-02  9:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel, containers, cgroups


Il giorno 31/mag/2014, alle ore 13:52, Tejun Heo <tj@kernel.org> ha scritto:

> Hello, Paolo.
> 
> So, I've actually looked at the code.  Here are some questions.
> 
> On Thu, May 29, 2014 at 11:05:42AM +0200, Paolo Valente wrote:
>> + * 1) all active queues have the same weight,
>> + * 2) all active groups at the same level in the groups tree have the same
>> + *    weight,
>> + * 3) all active groups at the same level in the groups tree have the same
>> + *    number of children.
> 
> 3) basically disables it whenever blkcg is used.  Might as well just
> skip the whole thing if there are any !root cgroups.  It's only
> theoretically interesting.

It is easier for me to reply to this, and the other related comments, cumulatively below.

> 
>> static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
>> {
>> 	struct bfq_data *bfqd = bfqq->bfqd;
> 
> 	bool symmetric_scenario, expire_non_wr;
> 
>> +#ifdef CONFIG_CGROUP_BFQIO
>> +#define symmetric_scenario	  (!bfqd->active_numerous_groups && \
>> +				   !bfq_differentiated_weights(bfqd))
> 
> 	symmetric_scenario = xxx;
> 
>> +#else
>> +#define symmetric_scenario	  (!bfq_differentiated_weights(bfqd))
> 
> 	symmetric_scenario = yyy;
> 
>> +#endif
>> /*
>>  * Condition for expiring a non-weight-raised queue (and hence not idling
>>  * the device).
>>  */
>> #define cond_for_expiring_non_wr  (bfqd->hw_tag && \
>> -				   bfqd->wr_busy_queues > 0)
>> +				   (bfqd->wr_busy_queues > 0 || \
>> +				    (symmetric_scenario && \
>> +				     blk_queue_nonrot(bfqd->queue))))
> 
> 	expire_non_wr = zzz;
> 

The solution you propose is the first that came to my mind. But then I went for a clumsy macro-based solution because: 1) the whole function is all about evaluating a long logical expression, 2) the macro-based solution allows the short-circuit to be used at best, and the number of steps to be minimized. For example, with async queues, only one condition is evaluated.

Defining three variables entails instead that the value of all the variables is computed every time, even if most of the times there is no need to.

Would this gain be negligible (sorry for my ignorance), or would not it be however enough to justify these unusual macros?

>> 
>> 	return bfq_bfqq_sync(bfqq) && (
>> 		bfqq->wr_coeff > 1 ||
>> /**
>> + * struct bfq_weight_counter - counter of the number of all active entities
>> + *                             with a given weight.
>> + * @weight: weight of the entities that this counter refers to.
>> + * @num_active: number of active entities with this weight.
>> + * @weights_node: weights tree member (see bfq_data's @queue_weights_tree
>> + *                and @group_weights_tree).
>> + */
>> +struct bfq_weight_counter {
>> +	short int weight;
>> +	unsigned int num_active;
>> +	struct rb_node weights_node;
>> +};
> 
> This is way over-engineered.  In most cases, the only time you get the
> same weight on all IO issuers would be when everybody is on the
> default ioprio so might as well simply count the number of non-default
> ioprios.  It'd be one integer instead of a tree of counters.
> 

Reply below.

>> @@ -306,6 +322,22 @@ enum bfq_device_speed {
>>  * @rq_pos_tree: rbtree sorted by next_request position, used when
>>  *               determining if two or more queues have interleaving
>>  *               requests (see bfq_close_cooperator()).
>> + * @active_numerous_groups: number of bfq_groups containing more than one
>> + *                          active @bfq_entity.
> 
> You can safely assume that on any system which uses blkcg, the above
> counter is >1.
> 
> This optimization may be theoretically interesting but doesn't seem
> practical at all and would make the sytem behave distinctively
> differently depending on something which is extremely subtle and seems
> completely unrelated.  Furthermore, on any system which uses blkcg,
> ext4, btrfs or has any task which has non-zero nice value, it won't
> make any difference.  Its value is only theoretical.
> 

Turning on idling unconditionally when blkcg is used, is one of the first solutions we have considered. But there seem to be practical scenarios where this would cause an unjustified loss of throughput. The main example for us was ulatencyd, which AFAIK creates one group for each process and, by default, assigns to all processes the same weight. But the assigned weight is not the one associated to the default ioprio.

I do not know how widespread a mechanism like ulatencyd is precisely, but in the symmetric scenario it creates, the throughput on, e.g., an HDD would drop by half \bif the workload is mostly random and we removed the more complex mechanism we set up.
Wouldn't this be bad?

> Another thing to consider is that virtually all remotely modern
> devices, rotational or not, are queued. At this point, it's rather
> pointless to design one behavior for !queued and another for queued.
> Things should just be designed for queued devices.

I am sorry for expressing doubts again (mainly because of my ignorance), but a few months ago I had to work with some portable devices for a company specialized in ARM systems. As an HDD, they were using a Toshiba MK6006GAH. If I remember well, this device had no NCQ. Instead of the improvements that we obtained by using bfq with this slow device, removing the differentiated behavior of bfq with respect to queued/!queued devices would have caused just a loss of throughput.

>  I don't know what
> the solution is but given that the benefits of NCQ for rotational
> devices is extremely limited, sticking with single request model in
> most cases and maybe allowing queued operation for specific workloads
> might be a better approach.  As for ssds, just do something simple.
> It's highly likely that most ssds won't travel this code path in the
> near future anyway.

This is the point that worries me mostly. As I pointed out in one of my previous emails, dispatching requests to an SSD  without control causes high latencies, or even complete unresponsiveness (Figure 8 in
http://algogroup.unimore.it/people/paolo/disk_sched/extra_results.php
or Figure 9 in
http://algogroup.unimore.it/people/paolo/disk_sched/results.php).

I am of course aware that efficiency is a critical issue with fast devices, and is probably destined to become more and more critical in the future. But, as a user, I would be definitely unhappy with a system that can, e.g., update itself in one minute instead of five, but, during that minute may become unresponsive. In particular, I would not be pleased to buy a more expensive SSD and get a much less responsive system than that I had with a cheaper HDD and bfq fully working.

Thanks,
Paolo

> 
> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 10/12] block, bfq: add Early Queue Merge (EQM)
       [not found]                 ` <20140601000331.GA29085-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2014-06-02  9:46                   ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-02  9:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Mauro Andreolini,
	Fabio Checconi, Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA


Il giorno 01/giu/2014, alle ore 02:03, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ha scritto:

> Hello,
> 
> On Thu, May 29, 2014 at 11:05:41AM +0200, Paolo Valente wrote:
>> Unfortunately, in the following frequent case the mechanism
>> implemented in CFQ for detecting cooperating processes and merging
>> their queues is not responsive enough to handle also the fluctuating
>> I/O pattern of the second type of processes. Suppose that one process
>> of the second type issues a request close to the next request to serve
>> of another process of the same type. At that time the two processes
>> can be considered as cooperating. But, if the request issued by the
>> first process is to be merged with some other already-queued request,
>> then, from the moment at which this request arrives, to the moment
>> when CFQ controls whether the two processes are cooperating, the two
>> processes are likely to be already doing I/O in distant zones of the
>> disk surface or device memory.
> 
> I don't really follow the last part.  So, the difference is that
> cooperating queue setup also takes place during bio merge too, right?

Not only, in bfq an actual queue merge is performed in the bio-merge hook.

> Are you trying to say that it's beneficial to detect cooperating
> queues before the issuing queue gets unplugged because the two queues
> might deviate while plugged?

Yes, to keep throughput high both detection and queue merging must be performed immediately.

>  If so, the above paragraph is, while
> quite wordy, a rather lousy description.

Sorry for the long and badly written paragraph. Hoping to have fully understood the issue you raised, I will try to synthesize it in the next submission.

> 
>> CFQ uses however preemption to get a sequential read pattern out of
>> the read requests performed by the second type of processes too.  As a
>> consequence, CFQ uses two different mechanisms to achieve the same
>> goal: boosting the throughput with interleaved I/O.
> 
> You mean the cfq_rq_close() preemption, right?  Hmmm... interesting.
> I'm a bit bothered that we might do this multiple times for the same
> bio/request.  e.g. if bio is back merged to an existing request which
> would be the most common bio merge scenario anyway, is it really
> meaningful to retry it for each merge

Unfortunately the only way to make sure that we never miss any queue-merge opportunities should be checking every time.

> and then again on submission?

Even if a queue-merge attempt fails in the invocation of an allow_merge_fn hook that returns false, the request related to the same bio may then lead to a successful queue-merge in the following add_rq_fn.

> cfq does it once when allocating the request.  That seems a lot more
> reasonable to me.  It's doing that once for one start sector.  I mean,
> plugging is usually extremely short compared to actual IO service
> time.  It's there to mask the latencies between bio issues that the
> same CPU is doing.  I can't see how this earliness can be actually
> useful.  Do you have results to back this one up?  Or is this just
> born out of thin air?
> 

Arianna added the early-queue-merge part in the allow_merge_fn hook about one year ago, as a a consequence of a throughput loss of about 30% with KVM/QEMU workloads. In particular, we ran most of the tests on a WDC WD60000HLHX-0 Velociraptor. That HDD might not be available for testing any more, but we can reproduce our results for you on other HDDs, with and without early queue merge. And, maybe through traces, we can show you that the reason for the throughput loss is exactly that described (in a wordy way) in this patch. Of course unless we have missed something.

>> +/*
>> + * Must be called with the queue_lock held.
> 
> Use lockdep_assert_held() instead?
> 

We agree on this and on all your other suggestions/recommendations/corrections, especially on the idea of using per-parent rq position trees. We will apply these changes on the next submission of this patch.

Thanks,
Paolo

>> + */
>> +static int bfqq_process_refs(struct bfq_queue *bfqq)
>> +{
>> +	int process_refs, io_refs;
>> +
>> +	io_refs = bfqq->allocated[READ] + bfqq->allocated[WRITE];
>> +	process_refs = atomic_read(&bfqq->ref) - io_refs - bfqq->entity.on_st;
>> +	return process_refs;
>> +}
> ...
>> +static inline sector_t bfq_io_struct_pos(void *io_struct, bool request)
>> +{
>> +	if (request)
>> +		return blk_rq_pos(io_struct);
>> +	else
>> +		return ((struct bio *)io_struct)->bi_iter.bi_sector;
>> +}
>> +
>> +static inline sector_t bfq_dist_from(sector_t pos1,
>> +				     sector_t pos2)
>> +{
>> +	if (pos1 >= pos2)
>> +		return pos1 - pos2;
>> +	else
>> +		return pos2 - pos1;
>> +}
>> +
>> +static inline int bfq_rq_close_to_sector(void *io_struct, bool request,
>> +					 sector_t sector)
>> +{
>> +	return bfq_dist_from(bfq_io_struct_pos(io_struct, request), sector) <=
>> +	       BFQQ_SEEK_THR;
>> +}
> 
> You can simply write the following.
> 
> 	abs64(sector0 - sector1) < BFQQ_SEEKTHR;
> 
> Note that abs64() works whether the source type is signed or unsigned.

> Also, please don't do "void *" + type switch.  If it absoluately has
> to take two different types of pointers, just take two different
> pointers and BUG_ON() if both are set, but here this doesn't seem to
> be the case.  Just pass around the actual sectors.  This applies to
> all usages of void *io_struct.

>> +static struct bfq_queue *bfqq_close(struct bfq_data *bfqd, sector_t sector)
> 
> bfqq_find_close() or find_close_bfqq()?

> 
>> +/*
>> + * bfqd - obvious
>> + * cur_bfqq - passed in so that we don't decide that the current queue
>> + *            is closely cooperating with itself
>> + * sector - used as a reference point to search for a close queue
>> + */
> 
> If you're gonna do the above, please use proper function comment.
> Please take a look at Documentation/kernel-doc-nano-HOWTO.txt.
> 

>> +static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
>> +					      struct bfq_queue *cur_bfqq,
>> +					      sector_t sector)
>> +{
>> +	struct bfq_queue *bfqq;
>> +
>> +	if (bfq_class_idle(cur_bfqq))
>> +		return NULL;
>> +	if (!bfq_bfqq_sync(cur_bfqq))
>> +		return NULL;
>> +	if (BFQQ_SEEKY(cur_bfqq))
>> +		return NULL;
> 
> Why are some of these conditions tested twice?  Once here and once in
> the caller?  Collect them into one place?
> 

> ...
>> +	if (bfqq->entity.parent != cur_bfqq->entity.parent)
>> +		return NULL;
> 
> This is the only place this rq position tree is being used, right?
> Any reason not to use per-parent tree instead?

The only reason could have been to save some memory, but your proposal seems much more efficient, thanks.

> 
>> +	if (bfq_class_rt(bfqq) != bfq_class_rt(cur_bfqq))
>> +		return NULL;
> 
> Test ioprio_class for equality?

> 
>> +/*
>> + * Attempt to schedule a merge of bfqq with the currently in-service queue
>> + * or with a close queue among the scheduled queues.
>> + * Return NULL if no merge was scheduled, a pointer to the shared bfq_queue
>> + * structure otherwise.
>> + */
>> +static struct bfq_queue *
>> +bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq,
>> +		     void *io_struct, bool request)
>> +{
>> +	struct bfq_queue *in_service_bfqq, *new_bfqq;
>> +
>> +	if (bfqq->new_bfqq)
>> +		return bfqq->new_bfqq;
>> +
>> +	if (!io_struct)
>> +		return NULL;
>> +
>> +	in_service_bfqq = bfqd->in_service_queue;
>> +
>> +	if (in_service_bfqq == NULL || in_service_bfqq == bfqq ||
>> +	    !bfqd->in_service_bic)
>> +		goto check_scheduled;
>> +
>> +	if (bfq_class_idle(in_service_bfqq) || bfq_class_idle(bfqq))
>> +		goto check_scheduled;
>> +
>> +	if (bfq_class_rt(in_service_bfqq) != bfq_class_rt(bfqq))
>> +		goto check_scheduled;
>> +
>> +	if (in_service_bfqq->entity.parent != bfqq->entity.parent)
>> +		goto check_scheduled;
>> +
>> +	if (bfq_rq_close_to_sector(io_struct, request, bfqd->last_position) &&
>> +	    bfq_bfqq_sync(in_service_bfqq) && bfq_bfqq_sync(bfqq)) {
>> +		new_bfqq = bfq_setup_merge(bfqq, in_service_bfqq);
>> +		if (new_bfqq != NULL)
>> +			return new_bfqq; /* Merge with in-service queue */
>> +	}
>> +
>> +	/*
>> +	 * Check whether there is a cooperator among currently scheduled
>> +	 * queues. The only thing we need is that the bio/request is not
>> +	 * NULL, as we need it to establish whether a cooperator exists.
>> +	 */
>> +check_scheduled:
>> +	new_bfqq = bfq_close_cooperator(bfqd, bfqq,
>> +					bfq_io_struct_pos(io_struct, request));
>> +	if (new_bfqq)
>> +		return bfq_setup_merge(bfqq, new_bfqq);
> 
> Why don't in_service_queue and scheduled search share the cooperation
> conditions?  They should be the same, right?  Shouldn't the function
> be structured like the following instead?
> 
> 	if (bfq_try_close_cooperator(current_one, in_service_one))
> 		return in_service_one;
> 
> 	found = bfq_find_close_cooperator();
> 	if (bfq_try_close_cooperator(current_one, found))
> 		return found;
> 	return NULL;
> 
> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 10/12] block, bfq: add Early Queue Merge (EQM)
       [not found]                 ` <20140601000331.GA29085-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2014-06-02  9:46                   ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-02  9:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel, containers, cgroups, Mauro Andreolini


Il giorno 01/giu/2014, alle ore 02:03, Tejun Heo <tj@kernel.org> ha scritto:

> Hello,
> 
> On Thu, May 29, 2014 at 11:05:41AM +0200, Paolo Valente wrote:
>> Unfortunately, in the following frequent case the mechanism
>> implemented in CFQ for detecting cooperating processes and merging
>> their queues is not responsive enough to handle also the fluctuating
>> I/O pattern of the second type of processes. Suppose that one process
>> of the second type issues a request close to the next request to serve
>> of another process of the same type. At that time the two processes
>> can be considered as cooperating. But, if the request issued by the
>> first process is to be merged with some other already-queued request,
>> then, from the moment at which this request arrives, to the moment
>> when CFQ controls whether the two processes are cooperating, the two
>> processes are likely to be already doing I/O in distant zones of the
>> disk surface or device memory.
> 
> I don't really follow the last part.  So, the difference is that
> cooperating queue setup also takes place during bio merge too, right?

Not only, in bfq an actual queue merge is performed in the bio-merge hook.

> Are you trying to say that it's beneficial to detect cooperating
> queues before the issuing queue gets unplugged because the two queues
> might deviate while plugged?

Yes, to keep throughput high both detection and queue merging must be performed immediately.

>  If so, the above paragraph is, while
> quite wordy, a rather lousy description.

Sorry for the long and badly written paragraph. Hoping to have fully understood the issue you raised, I will try to synthesize it in the next submission.

> 
>> CFQ uses however preemption to get a sequential read pattern out of
>> the read requests performed by the second type of processes too.  As a
>> consequence, CFQ uses two different mechanisms to achieve the same
>> goal: boosting the throughput with interleaved I/O.
> 
> You mean the cfq_rq_close() preemption, right?  Hmmm... interesting.
> I'm a bit bothered that we might do this multiple times for the same
> bio/request.  e.g. if bio is back merged to an existing request which
> would be the most common bio merge scenario anyway, is it really
> meaningful to retry it for each merge

Unfortunately the only way to make sure that we never miss any queue-merge opportunities should be checking every time.

> and then again on submission?

Even if a queue-merge attempt fails in the invocation of an allow_merge_fn hook that returns false, the request related to the same bio may then lead to a successful queue-merge in the following add_rq_fn.

> cfq does it once when allocating the request.  That seems a lot more
> reasonable to me.  It's doing that once for one start sector.  I mean,
> plugging is usually extremely short compared to actual IO service
> time.  It's there to mask the latencies between bio issues that the
> same CPU is doing.  I can't see how this earliness can be actually
> useful.  Do you have results to back this one up?  Or is this just
> born out of thin air?
> 

Arianna added the early-queue-merge part in the allow_merge_fn hook about one year ago, as a a consequence of a throughput loss of about 30% with KVM/QEMU workloads. In particular, we ran most of the tests on a WDC WD60000HLHX-0 Velociraptor. That HDD might not be available for testing any more, but we can reproduce our results for you on other HDDs, with and without early queue merge. And, maybe through traces, we can show you that the reason for the throughput loss is exactly that described (in a wordy way) in this patch. Of course unless we have missed something.

>> +/*
>> + * Must be called with the queue_lock held.
> 
> Use lockdep_assert_held() instead?
> 

We agree on this and on all your other suggestions/recommendations/corrections, especially on the idea of using per-parent rq position trees. We will apply these changes on the next submission of this patch.

Thanks,
Paolo

>> + */
>> +static int bfqq_process_refs(struct bfq_queue *bfqq)
>> +{
>> +	int process_refs, io_refs;
>> +
>> +	io_refs = bfqq->allocated[READ] + bfqq->allocated[WRITE];
>> +	process_refs = atomic_read(&bfqq->ref) - io_refs - bfqq->entity.on_st;
>> +	return process_refs;
>> +}
> ...
>> +static inline sector_t bfq_io_struct_pos(void *io_struct, bool request)
>> +{
>> +	if (request)
>> +		return blk_rq_pos(io_struct);
>> +	else
>> +		return ((struct bio *)io_struct)->bi_iter.bi_sector;
>> +}
>> +
>> +static inline sector_t bfq_dist_from(sector_t pos1,
>> +				     sector_t pos2)
>> +{
>> +	if (pos1 >= pos2)
>> +		return pos1 - pos2;
>> +	else
>> +		return pos2 - pos1;
>> +}
>> +
>> +static inline int bfq_rq_close_to_sector(void *io_struct, bool request,
>> +					 sector_t sector)
>> +{
>> +	return bfq_dist_from(bfq_io_struct_pos(io_struct, request), sector) <=
>> +	       BFQQ_SEEK_THR;
>> +}
> 
> You can simply write the following.
> 
> 	abs64(sector0 - sector1) < BFQQ_SEEKTHR;
> 
> Note that abs64() works whether the source type is signed or unsigned.

> Also, please don't do "void *" + type switch.  If it absoluately has
> to take two different types of pointers, just take two different
> pointers and BUG_ON() if both are set, but here this doesn't seem to
> be the case.  Just pass around the actual sectors.  This applies to
> all usages of void *io_struct.

>> +static struct bfq_queue *bfqq_close(struct bfq_data *bfqd, sector_t sector)
> 
> bfqq_find_close() or find_close_bfqq()?

> 
>> +/*
>> + * bfqd - obvious
>> + * cur_bfqq - passed in so that we don't decide that the current queue
>> + *            is closely cooperating with itself
>> + * sector - used as a reference point to search for a close queue
>> + */
> 
> If you're gonna do the above, please use proper function comment.
> Please take a look at Documentation/kernel-doc-nano-HOWTO.txt.
> 

>> +static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
>> +					      struct bfq_queue *cur_bfqq,
>> +					      sector_t sector)
>> +{
>> +	struct bfq_queue *bfqq;
>> +
>> +	if (bfq_class_idle(cur_bfqq))
>> +		return NULL;
>> +	if (!bfq_bfqq_sync(cur_bfqq))
>> +		return NULL;
>> +	if (BFQQ_SEEKY(cur_bfqq))
>> +		return NULL;
> 
> Why are some of these conditions tested twice?  Once here and once in
> the caller?  Collect them into one place?
> 

> ...
>> +	if (bfqq->entity.parent != cur_bfqq->entity.parent)
>> +		return NULL;
> 
> This is the only place this rq position tree is being used, right?
> Any reason not to use per-parent tree instead?

The only reason could have been to save some memory, but your proposal seems much more efficient, thanks.

> 
>> +	if (bfq_class_rt(bfqq) != bfq_class_rt(cur_bfqq))
>> +		return NULL;
> 
> Test ioprio_class for equality?

> 
>> +/*
>> + * Attempt to schedule a merge of bfqq with the currently in-service queue
>> + * or with a close queue among the scheduled queues.
>> + * Return NULL if no merge was scheduled, a pointer to the shared bfq_queue
>> + * structure otherwise.
>> + */
>> +static struct bfq_queue *
>> +bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq,
>> +		     void *io_struct, bool request)
>> +{
>> +	struct bfq_queue *in_service_bfqq, *new_bfqq;
>> +
>> +	if (bfqq->new_bfqq)
>> +		return bfqq->new_bfqq;
>> +
>> +	if (!io_struct)
>> +		return NULL;
>> +
>> +	in_service_bfqq = bfqd->in_service_queue;
>> +
>> +	if (in_service_bfqq == NULL || in_service_bfqq == bfqq ||
>> +	    !bfqd->in_service_bic)
>> +		goto check_scheduled;
>> +
>> +	if (bfq_class_idle(in_service_bfqq) || bfq_class_idle(bfqq))
>> +		goto check_scheduled;
>> +
>> +	if (bfq_class_rt(in_service_bfqq) != bfq_class_rt(bfqq))
>> +		goto check_scheduled;
>> +
>> +	if (in_service_bfqq->entity.parent != bfqq->entity.parent)
>> +		goto check_scheduled;
>> +
>> +	if (bfq_rq_close_to_sector(io_struct, request, bfqd->last_position) &&
>> +	    bfq_bfqq_sync(in_service_bfqq) && bfq_bfqq_sync(bfqq)) {
>> +		new_bfqq = bfq_setup_merge(bfqq, in_service_bfqq);
>> +		if (new_bfqq != NULL)
>> +			return new_bfqq; /* Merge with in-service queue */
>> +	}
>> +
>> +	/*
>> +	 * Check whether there is a cooperator among currently scheduled
>> +	 * queues. The only thing we need is that the bio/request is not
>> +	 * NULL, as we need it to establish whether a cooperator exists.
>> +	 */
>> +check_scheduled:
>> +	new_bfqq = bfq_close_cooperator(bfqd, bfqq,
>> +					bfq_io_struct_pos(io_struct, request));
>> +	if (new_bfqq)
>> +		return bfq_setup_merge(bfqq, new_bfqq);
> 
> Why don't in_service_queue and scheduled search share the cooperation
> conditions?  They should be the same, right?  Shouldn't the function
> be structured like the following instead?
> 
> 	if (bfq_try_close_cooperator(current_one, in_service_one))
> 		return in_service_one;
> 
> 	found = bfq_find_close_cooperator();
> 	if (bfq_try_close_cooperator(current_one, found))
> 		return found;
> 	return NULL;
> 
> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 10/12] block, bfq: add Early Queue Merge (EQM)
@ 2014-06-02  9:46                   ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-02  9:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Mauro Andreolini


Il giorno 01/giu/2014, alle ore 02:03, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ha scritto:

> Hello,
> 
> On Thu, May 29, 2014 at 11:05:41AM +0200, Paolo Valente wrote:
>> Unfortunately, in the following frequent case the mechanism
>> implemented in CFQ for detecting cooperating processes and merging
>> their queues is not responsive enough to handle also the fluctuating
>> I/O pattern of the second type of processes. Suppose that one process
>> of the second type issues a request close to the next request to serve
>> of another process of the same type. At that time the two processes
>> can be considered as cooperating. But, if the request issued by the
>> first process is to be merged with some other already-queued request,
>> then, from the moment at which this request arrives, to the moment
>> when CFQ controls whether the two processes are cooperating, the two
>> processes are likely to be already doing I/O in distant zones of the
>> disk surface or device memory.
> 
> I don't really follow the last part.  So, the difference is that
> cooperating queue setup also takes place during bio merge too, right?

Not only, in bfq an actual queue merge is performed in the bio-merge hook.

> Are you trying to say that it's beneficial to detect cooperating
> queues before the issuing queue gets unplugged because the two queues
> might deviate while plugged?

Yes, to keep throughput high both detection and queue merging must be performed immediately.

>  If so, the above paragraph is, while
> quite wordy, a rather lousy description.

Sorry for the long and badly written paragraph. Hoping to have fully understood the issue you raised, I will try to synthesize it in the next submission.

> 
>> CFQ uses however preemption to get a sequential read pattern out of
>> the read requests performed by the second type of processes too.  As a
>> consequence, CFQ uses two different mechanisms to achieve the same
>> goal: boosting the throughput with interleaved I/O.
> 
> You mean the cfq_rq_close() preemption, right?  Hmmm... interesting.
> I'm a bit bothered that we might do this multiple times for the same
> bio/request.  e.g. if bio is back merged to an existing request which
> would be the most common bio merge scenario anyway, is it really
> meaningful to retry it for each merge

Unfortunately the only way to make sure that we never miss any queue-merge opportunities should be checking every time.

> and then again on submission?

Even if a queue-merge attempt fails in the invocation of an allow_merge_fn hook that returns false, the request related to the same bio may then lead to a successful queue-merge in the following add_rq_fn.

> cfq does it once when allocating the request.  That seems a lot more
> reasonable to me.  It's doing that once for one start sector.  I mean,
> plugging is usually extremely short compared to actual IO service
> time.  It's there to mask the latencies between bio issues that the
> same CPU is doing.  I can't see how this earliness can be actually
> useful.  Do you have results to back this one up?  Or is this just
> born out of thin air?
> 

Arianna added the early-queue-merge part in the allow_merge_fn hook about one year ago, as a a consequence of a throughput loss of about 30% with KVM/QEMU workloads. In particular, we ran most of the tests on a WDC WD60000HLHX-0 Velociraptor. That HDD might not be available for testing any more, but we can reproduce our results for you on other HDDs, with and without early queue merge. And, maybe through traces, we can show you that the reason for the throughput loss is exactly that described (in a wordy way) in this patch. Of course unless we have missed something.

>> +/*
>> + * Must be called with the queue_lock held.
> 
> Use lockdep_assert_held() instead?
> 

We agree on this and on all your other suggestions/recommendations/corrections, especially on the idea of using per-parent rq position trees. We will apply these changes on the next submission of this patch.

Thanks,
Paolo

>> + */
>> +static int bfqq_process_refs(struct bfq_queue *bfqq)
>> +{
>> +	int process_refs, io_refs;
>> +
>> +	io_refs = bfqq->allocated[READ] + bfqq->allocated[WRITE];
>> +	process_refs = atomic_read(&bfqq->ref) - io_refs - bfqq->entity.on_st;
>> +	return process_refs;
>> +}
> ...
>> +static inline sector_t bfq_io_struct_pos(void *io_struct, bool request)
>> +{
>> +	if (request)
>> +		return blk_rq_pos(io_struct);
>> +	else
>> +		return ((struct bio *)io_struct)->bi_iter.bi_sector;
>> +}
>> +
>> +static inline sector_t bfq_dist_from(sector_t pos1,
>> +				     sector_t pos2)
>> +{
>> +	if (pos1 >= pos2)
>> +		return pos1 - pos2;
>> +	else
>> +		return pos2 - pos1;
>> +}
>> +
>> +static inline int bfq_rq_close_to_sector(void *io_struct, bool request,
>> +					 sector_t sector)
>> +{
>> +	return bfq_dist_from(bfq_io_struct_pos(io_struct, request), sector) <=
>> +	       BFQQ_SEEK_THR;
>> +}
> 
> You can simply write the following.
> 
> 	abs64(sector0 - sector1) < BFQQ_SEEKTHR;
> 
> Note that abs64() works whether the source type is signed or unsigned.

> Also, please don't do "void *" + type switch.  If it absoluately has
> to take two different types of pointers, just take two different
> pointers and BUG_ON() if both are set, but here this doesn't seem to
> be the case.  Just pass around the actual sectors.  This applies to
> all usages of void *io_struct.

>> +static struct bfq_queue *bfqq_close(struct bfq_data *bfqd, sector_t sector)
> 
> bfqq_find_close() or find_close_bfqq()?

> 
>> +/*
>> + * bfqd - obvious
>> + * cur_bfqq - passed in so that we don't decide that the current queue
>> + *            is closely cooperating with itself
>> + * sector - used as a reference point to search for a close queue
>> + */
> 
> If you're gonna do the above, please use proper function comment.
> Please take a look at Documentation/kernel-doc-nano-HOWTO.txt.
> 

>> +static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
>> +					      struct bfq_queue *cur_bfqq,
>> +					      sector_t sector)
>> +{
>> +	struct bfq_queue *bfqq;
>> +
>> +	if (bfq_class_idle(cur_bfqq))
>> +		return NULL;
>> +	if (!bfq_bfqq_sync(cur_bfqq))
>> +		return NULL;
>> +	if (BFQQ_SEEKY(cur_bfqq))
>> +		return NULL;
> 
> Why are some of these conditions tested twice?  Once here and once in
> the caller?  Collect them into one place?
> 

> ...
>> +	if (bfqq->entity.parent != cur_bfqq->entity.parent)
>> +		return NULL;
> 
> This is the only place this rq position tree is being used, right?
> Any reason not to use per-parent tree instead?

The only reason could have been to save some memory, but your proposal seems much more efficient, thanks.

> 
>> +	if (bfq_class_rt(bfqq) != bfq_class_rt(cur_bfqq))
>> +		return NULL;
> 
> Test ioprio_class for equality?

> 
>> +/*
>> + * Attempt to schedule a merge of bfqq with the currently in-service queue
>> + * or with a close queue among the scheduled queues.
>> + * Return NULL if no merge was scheduled, a pointer to the shared bfq_queue
>> + * structure otherwise.
>> + */
>> +static struct bfq_queue *
>> +bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq,
>> +		     void *io_struct, bool request)
>> +{
>> +	struct bfq_queue *in_service_bfqq, *new_bfqq;
>> +
>> +	if (bfqq->new_bfqq)
>> +		return bfqq->new_bfqq;
>> +
>> +	if (!io_struct)
>> +		return NULL;
>> +
>> +	in_service_bfqq = bfqd->in_service_queue;
>> +
>> +	if (in_service_bfqq == NULL || in_service_bfqq == bfqq ||
>> +	    !bfqd->in_service_bic)
>> +		goto check_scheduled;
>> +
>> +	if (bfq_class_idle(in_service_bfqq) || bfq_class_idle(bfqq))
>> +		goto check_scheduled;
>> +
>> +	if (bfq_class_rt(in_service_bfqq) != bfq_class_rt(bfqq))
>> +		goto check_scheduled;
>> +
>> +	if (in_service_bfqq->entity.parent != bfqq->entity.parent)
>> +		goto check_scheduled;
>> +
>> +	if (bfq_rq_close_to_sector(io_struct, request, bfqd->last_position) &&
>> +	    bfq_bfqq_sync(in_service_bfqq) && bfq_bfqq_sync(bfqq)) {
>> +		new_bfqq = bfq_setup_merge(bfqq, in_service_bfqq);
>> +		if (new_bfqq != NULL)
>> +			return new_bfqq; /* Merge with in-service queue */
>> +	}
>> +
>> +	/*
>> +	 * Check whether there is a cooperator among currently scheduled
>> +	 * queues. The only thing we need is that the bio/request is not
>> +	 * NULL, as we need it to establish whether a cooperator exists.
>> +	 */
>> +check_scheduled:
>> +	new_bfqq = bfq_close_cooperator(bfqd, bfqq,
>> +					bfq_io_struct_pos(io_struct, request));
>> +	if (new_bfqq)
>> +		return bfq_setup_merge(bfqq, new_bfqq);
> 
> Why don't in_service_queue and scheduled search share the cooperation
> conditions?  They should be the same, right?  Shouldn't the function
> be structured like the following instead?
> 
> 	if (bfq_try_close_cooperator(current_one, in_service_one))
> 		return in_service_one;
> 
> 	found = bfq_find_close_cooperator();
> 	if (bfq_try_close_cooperator(current_one, found))
> 		return found;
> 	return NULL;
> 
> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 09/12] block, bfq: reduce latency during request-pool saturation
       [not found]                 ` <20140531135402.GC24557-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2014-06-02  9:54                   ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-02  9:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA


Il giorno 31/mag/2014, alle ore 15:54, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ha scritto:

> On Thu, May 29, 2014 at 11:05:40AM +0200, Paolo Valente wrote:
>> This patch introduces an heuristic that reduces latency when the
>> I/O-request pool is saturated. This goal is achieved by disabling
>> device idling, for non-weight-raised queues, when there are weight-
>> raised queues with pending or in-flight requests. In fact, as
>> explained in more detail in the comment to the function
>> bfq_bfqq_must_not_expire(), this reduces the rate at which processes
>> associated with non-weight-raised queues grab requests from the pool,
>> thereby increasing the probability that processes associated with
>> weight-raised queues get a request immediately (or at least soon) when
>> they need one.
> 
> Wouldn't it be more straight-forward to simply control how many
> requests each queue consume by returning ELV_MQUEUE_NO?

Arianna proposed me exactly this improvement about two months ago. The problem was that our TO-ADD list already contained several other improvements. So, to avoid waiting several more years before (re)submitting bfq, I decided that it would have been better to finally freeze the code for a while and pack it for submission.

If you agree, investigating and implementing this improvement will be our next step immediately after, and if the current version of bfq gets merged in some form.

Thanks,
Paolo

>  Seeky ones do
> benefit from larger number of requests in elevator but to only certain
> number given the fifo timeout anyway and controlling that explicitly
> would be a lot easier to anticipate the behavior of than playing
> roulette with random request allocation failures.
> 
> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 09/12] block, bfq: reduce latency during request-pool saturation
       [not found]                 ` <20140531135402.GC24557-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2014-06-02  9:54                   ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-02  9:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel, containers, cgroups


Il giorno 31/mag/2014, alle ore 15:54, Tejun Heo <tj@kernel.org> ha scritto:

> On Thu, May 29, 2014 at 11:05:40AM +0200, Paolo Valente wrote:
>> This patch introduces an heuristic that reduces latency when the
>> I/O-request pool is saturated. This goal is achieved by disabling
>> device idling, for non-weight-raised queues, when there are weight-
>> raised queues with pending or in-flight requests. In fact, as
>> explained in more detail in the comment to the function
>> bfq_bfqq_must_not_expire(), this reduces the rate at which processes
>> associated with non-weight-raised queues grab requests from the pool,
>> thereby increasing the probability that processes associated with
>> weight-raised queues get a request immediately (or at least soon) when
>> they need one.
> 
> Wouldn't it be more straight-forward to simply control how many
> requests each queue consume by returning ELV_MQUEUE_NO?

Arianna proposed me exactly this improvement about two months ago. The problem was that our TO-ADD list already contained several other improvements. So, to avoid waiting several more years before (re)submitting bfq, I decided that it would have been better to finally freeze the code for a while and pack it for submission.

If you agree, investigating and implementing this improvement will be our next step immediately after, and if the current version of bfq gets merged in some form.

Thanks,
Paolo

>  Seeky ones do
> benefit from larger number of requests in elevator but to only certain
> number given the fifo timeout anyway and controlling that explicitly
> would be a lot easier to anticipate the behavior of than playing
> roulette with random request allocation failures.
> 
> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 09/12] block, bfq: reduce latency during request-pool saturation
@ 2014-06-02  9:54                   ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-02  9:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA


Il giorno 31/mag/2014, alle ore 15:54, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ha scritto:

> On Thu, May 29, 2014 at 11:05:40AM +0200, Paolo Valente wrote:
>> This patch introduces an heuristic that reduces latency when the
>> I/O-request pool is saturated. This goal is achieved by disabling
>> device idling, for non-weight-raised queues, when there are weight-
>> raised queues with pending or in-flight requests. In fact, as
>> explained in more detail in the comment to the function
>> bfq_bfqq_must_not_expire(), this reduces the rate at which processes
>> associated with non-weight-raised queues grab requests from the pool,
>> thereby increasing the probability that processes associated with
>> weight-raised queues get a request immediately (or at least soon) when
>> they need one.
> 
> Wouldn't it be more straight-forward to simply control how many
> requests each queue consume by returning ELV_MQUEUE_NO?

Arianna proposed me exactly this improvement about two months ago. The problem was that our TO-ADD list already contained several other improvements. So, to avoid waiting several more years before (re)submitting bfq, I decided that it would have been better to finally freeze the code for a while and pack it for submission.

If you agree, investigating and implementing this improvement will be our next step immediately after, and if the current version of bfq gets merged in some form.

Thanks,
Paolo

>  Seeky ones do
> benefit from larger number of requests in elevator but to only certain
> number given the fifo timeout anyway and controlling that explicitly
> would be a lot easier to anticipate the behavior of than playing
> roulette with random request allocation failures.
> 
> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 08/12] block, bfq: preserve a low latency also with NCQ-capable drives
  2014-05-31 13:48                 ` Tejun Heo
@ 2014-06-02  9:58                     ` Paolo Valente
  -1 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-02  9:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA


Il giorno 31/mag/2014, alle ore 15:48, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ha scritto:

> On Thu, May 29, 2014 at 11:05:39AM +0200, Paolo Valente wrote:
>> This patch addresses this issue by not disabling device idling for
> 
> This patch addresses this issue by allowing device idling for...
> 
>> weight-raised queues, even if the device supports NCQ. This allows BFQ
>> to start serving a new queue, and therefore allows the drive to
> 
> This disallows BFQ to start serving a new queue, and therefore the
> drive to prefetch new requests, until the idling timeout expires.
> 
> Prefetch?  Can you elaborate?

By prefetching I meant just queuing more than one request (by the drive). I hope that this clarifies things a little bit.

Paolo

> 
> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 08/12] block, bfq: preserve a low latency also with NCQ-capable drives
@ 2014-06-02  9:58                     ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-02  9:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel, containers, cgroups


Il giorno 31/mag/2014, alle ore 15:48, Tejun Heo <tj@kernel.org> ha scritto:

> On Thu, May 29, 2014 at 11:05:39AM +0200, Paolo Valente wrote:
>> This patch addresses this issue by not disabling device idling for
> 
> This patch addresses this issue by allowing device idling for...
> 
>> weight-raised queues, even if the device supports NCQ. This allows BFQ
>> to start serving a new queue, and therefore allows the drive to
> 
> This disallows BFQ to start serving a new queue, and therefore the
> drive to prefetch new requests, until the idling timeout expires.
> 
> Prefetch?  Can you elaborate?

By prefetching I meant just queuing more than one request (by the drive). I hope that this clarifies things a little bit.

Paolo

> 
> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
       [not found]                     ` <20140530232804.GA5057-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
  2014-05-30 23:54                         ` Paolo Valente
@ 2014-06-02 11:14                       ` Pavel Machek
  1 sibling, 0 replies; 247+ messages in thread
From: Pavel Machek @ 2014-06-02 11:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri 2014-05-30 19:28:04, Tejun Heo wrote:
> Hello,
> 
> On Sat, May 31, 2014 at 12:23:01AM +0200, Paolo Valente wrote:
> > I do agree that bfq has essentially the same purpose as cfq. I am
> > not sure that it is what you are proposing, but, in my opinion,
> > since both the engine and all the new heuristics of bfq differ from
> > those of cfq, a replacement would be most certainly a much easier
> > solution than any other transformation of cfq into bfq (needless to
> > say, leaving the same name for the scheduler would not be a problem
> > for me). Of course, before that we are willing to improve what has
> > to be improved in bfq.
> 
> Well, it's all about how to actually route the changes and in general
> whenever avoidable we try to avoid whole-sale code replacement
> especially when most of the structural code is similar like in this
> case.  Gradually evolving cfq to bfq is likely to take more work but
> I'm very positive that it'd definitely be a lot easier to merge the
> changes that way and people involved, including the developers and
> reviewers, would acquire a lot clearer picture of what's going on in
> the process.  For example, AFAICS, most of the heuristics added by

Would it make sense to merge bfq first, _then_ turn cfq into bfq, then
remove bfq?

That way

1. Users like me would see improvements soon 

2. BFQ would get more testing early. 

3. If there are some problems in some workload, switching between bfq
and cfq will be easier than playing with git/patches.

Now.. I see it is more work for storage maintainers, because there'll
be more code to maintain in the interim. But perhaps user advantages
are worth it?

Thanks,

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
       [not found]                     ` <20140530232804.GA5057-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2014-06-02 11:14                       ` Pavel Machek
  2014-06-02 11:14                       ` Pavel Machek
  1 sibling, 0 replies; 247+ messages in thread
From: Pavel Machek @ 2014-06-02 11:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Paolo Valente, Jens Axboe, Li Zefan, Fabio Checconi,
	Arianna Avanzini, linux-kernel, containers, cgroups

On Fri 2014-05-30 19:28:04, Tejun Heo wrote:
> Hello,
> 
> On Sat, May 31, 2014 at 12:23:01AM +0200, Paolo Valente wrote:
> > I do agree that bfq has essentially the same purpose as cfq. I am
> > not sure that it is what you are proposing, but, in my opinion,
> > since both the engine and all the new heuristics of bfq differ from
> > those of cfq, a replacement would be most certainly a much easier
> > solution than any other transformation of cfq into bfq (needless to
> > say, leaving the same name for the scheduler would not be a problem
> > for me). Of course, before that we are willing to improve what has
> > to be improved in bfq.
> 
> Well, it's all about how to actually route the changes and in general
> whenever avoidable we try to avoid whole-sale code replacement
> especially when most of the structural code is similar like in this
> case.  Gradually evolving cfq to bfq is likely to take more work but
> I'm very positive that it'd definitely be a lot easier to merge the
> changes that way and people involved, including the developers and
> reviewers, would acquire a lot clearer picture of what's going on in
> the process.  For example, AFAICS, most of the heuristics added by

Would it make sense to merge bfq first, _then_ turn cfq into bfq, then
remove bfq?

That way

1. Users like me would see improvements soon 

2. BFQ would get more testing early. 

3. If there are some problems in some workload, switching between bfq
and cfq will be easier than playing with git/patches.

Now.. I see it is more work for storage maintainers, because there'll
be more code to maintain in the interim. But perhaps user advantages
are worth it?

Thanks,

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-06-02 11:14                       ` Pavel Machek
  0 siblings, 0 replies; 247+ messages in thread
From: Pavel Machek @ 2014-06-02 11:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Paolo Valente, Jens Axboe, Li Zefan, Fabio Checconi,
	Arianna Avanzini, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri 2014-05-30 19:28:04, Tejun Heo wrote:
> Hello,
> 
> On Sat, May 31, 2014 at 12:23:01AM +0200, Paolo Valente wrote:
> > I do agree that bfq has essentially the same purpose as cfq. I am
> > not sure that it is what you are proposing, but, in my opinion,
> > since both the engine and all the new heuristics of bfq differ from
> > those of cfq, a replacement would be most certainly a much easier
> > solution than any other transformation of cfq into bfq (needless to
> > say, leaving the same name for the scheduler would not be a problem
> > for me). Of course, before that we are willing to improve what has
> > to be improved in bfq.
> 
> Well, it's all about how to actually route the changes and in general
> whenever avoidable we try to avoid whole-sale code replacement
> especially when most of the structural code is similar like in this
> case.  Gradually evolving cfq to bfq is likely to take more work but
> I'm very positive that it'd definitely be a lot easier to merge the
> changes that way and people involved, including the developers and
> reviewers, would acquire a lot clearer picture of what's going on in
> the process.  For example, AFAICS, most of the heuristics added by

Would it make sense to merge bfq first, _then_ turn cfq into bfq, then
remove bfq?

That way

1. Users like me would see improvements soon 

2. BFQ would get more testing early. 

3. If there are some problems in some workload, switching between bfq
and cfq will be easier than playing with git/patches.

Now.. I see it is more work for storage maintainers, because there'll
be more code to maintain in the interim. But perhaps user advantages
are worth it?

Thanks,

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
  2014-06-02 11:14                       ` Pavel Machek
@ 2014-06-02 13:02                           ` Pavel Machek
  -1 siblings, 0 replies; 247+ messages in thread
From: Pavel Machek @ 2014-06-02 13:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA

Hi!

> > Well, it's all about how to actually route the changes and in general
> > whenever avoidable we try to avoid whole-sale code replacement
> > especially when most of the structural code is similar like in this
> > case.  Gradually evolving cfq to bfq is likely to take more work but
> > I'm very positive that it'd definitely be a lot easier to merge the
> > changes that way and people involved, including the developers and
> > reviewers, would acquire a lot clearer picture of what's going on in
> > the process.  For example, AFAICS, most of the heuristics added by
> 
> Would it make sense to merge bfq first, _then_ turn cfq into bfq, then
> remove bfq?
> 
> That way
> 
> 1. Users like me would see improvements soon 
> 
> 2. BFQ would get more testing early. 

Like this: I applied patch over today's git... 

I only see last bits of panic...

Call trace:
__bfq_bfqq_expire
bfq_bfqq_expire
bfq_dispatch_requests
sci_request_fn
...
EIP: T.1839+0x26
Kernel panic - not syncing: Fatal exception in interrupt
Shutting down cpus with NMI

...

Will retry.

Any ideas?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-06-02 13:02                           ` Pavel Machek
  0 siblings, 0 replies; 247+ messages in thread
From: Pavel Machek @ 2014-06-02 13:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Paolo Valente, Jens Axboe, Li Zefan, Fabio Checconi,
	Arianna Avanzini, linux-kernel, containers, cgroups

Hi!

> > Well, it's all about how to actually route the changes and in general
> > whenever avoidable we try to avoid whole-sale code replacement
> > especially when most of the structural code is similar like in this
> > case.  Gradually evolving cfq to bfq is likely to take more work but
> > I'm very positive that it'd definitely be a lot easier to merge the
> > changes that way and people involved, including the developers and
> > reviewers, would acquire a lot clearer picture of what's going on in
> > the process.  For example, AFAICS, most of the heuristics added by
> 
> Would it make sense to merge bfq first, _then_ turn cfq into bfq, then
> remove bfq?
> 
> That way
> 
> 1. Users like me would see improvements soon 
> 
> 2. BFQ would get more testing early. 

Like this: I applied patch over today's git... 

I only see last bits of panic...

Call trace:
__bfq_bfqq_expire
bfq_bfqq_expire
bfq_dispatch_requests
sci_request_fn
...
EIP: T.1839+0x26
Kernel panic - not syncing: Fatal exception in interrupt
Shutting down cpus with NMI

...

Will retry.

Any ideas?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
  2014-05-31  5:16                   ` Tejun Heo
@ 2014-06-02 14:29                       ` Jens Axboe
  -1 siblings, 0 replies; 247+ messages in thread
From: Jens Axboe @ 2014-06-02 14:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

On 2014-05-30 23:16, Tejun Heo wrote:
>> for turning patch #2 into a series of changes for CFQ instead. We need to
>> end up with something where we can potentially bisect our way down to
>> whatever caused any given regression. The worst possible situation is "CFQ
>> works fine for this workload, but BFQ does not" or vice versa. Or hangs, or
>> whatever it might be.
>
> It's likely that there will be some workloads out there which may be
> affected adversely, which is true for any change really but with both
> the core scheduling and heuristics properly characterized at least
> finding a reasonable trade-off should be much less of a crapshoot and
> the expected benefits seem to easily outweigh the risks as long as we
> can properly sequence the changes.

Exactly, I think we are pretty much on the same page here. As I said in 
the previous email, the biggest thing I care about is not adding a new 
IO scheduler wholesale. If Paolo can turn the "add BFQ" patch into a 
series of patches against CFQ, then I would have no issue merging it for 
testing (and inclusion, when it's stable enough).

One thing I've neglected to bring up but have been thinking about - 
we're quickly getting to the point where the old request_fn IO path will 
become a legacy thing, mostly in maintenance mode. That isn't a problem 
for morphing bfq and cfq, but it does mean that development efforts in 
this area would be a lot better spent writing an IO scheduler that fits 
into the blk-mq framework instead.

I realize this is a tall order right now, as I haven't included any sort 
of framework for that in blk-mq yet. So what I envision happening is 
that I will write a basic deadline (ish) scheduler for blk-mq, and 
hopefully others can then pitch in and we can get the ball rolling on 
that side as well.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-06-02 14:29                       ` Jens Axboe
  0 siblings, 0 replies; 247+ messages in thread
From: Jens Axboe @ 2014-06-02 14:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Paolo Valente, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel, containers, cgroups

On 2014-05-30 23:16, Tejun Heo wrote:
>> for turning patch #2 into a series of changes for CFQ instead. We need to
>> end up with something where we can potentially bisect our way down to
>> whatever caused any given regression. The worst possible situation is "CFQ
>> works fine for this workload, but BFQ does not" or vice versa. Or hangs, or
>> whatever it might be.
>
> It's likely that there will be some workloads out there which may be
> affected adversely, which is true for any change really but with both
> the core scheduling and heuristics properly characterized at least
> finding a reasonable trade-off should be much less of a crapshoot and
> the expected benefits seem to easily outweigh the risks as long as we
> can properly sequence the changes.

Exactly, I think we are pretty much on the same page here. As I said in 
the previous email, the biggest thing I care about is not adding a new 
IO scheduler wholesale. If Paolo can turn the "add BFQ" patch into a 
series of patches against CFQ, then I would have no issue merging it for 
testing (and inclusion, when it's stable enough).

One thing I've neglected to bring up but have been thinking about - 
we're quickly getting to the point where the old request_fn IO path will 
become a legacy thing, mostly in maintenance mode. That isn't a problem 
for morphing bfq and cfq, but it does mean that development efforts in 
this area would be a lot better spent writing an IO scheduler that fits 
into the blk-mq framework instead.

I realize this is a tall order right now, as I haven't included any sort 
of framework for that in blk-mq yet. So what I envision happening is 
that I will write a basic deadline (ish) scheduler for blk-mq, and 
hopefully others can then pitch in and we can get the ball rolling on 
that side as well.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
       [not found]                       ` <538C8A47.1050502-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@ 2014-06-02 17:24                         ` Tejun Heo
  2014-06-17 15:55                           ` Paolo Valente
  1 sibling, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-02 17:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

Hello, Jens.

On Mon, Jun 02, 2014 at 08:29:27AM -0600, Jens Axboe wrote:
> One thing I've neglected to bring up but have been thinking about - we're
> quickly getting to the point where the old request_fn IO path will become a
> legacy thing, mostly in maintenance mode. That isn't a problem for morphing
> bfq and cfq, but it does mean that development efforts in this area would be
> a lot better spent writing an IO scheduler that fits into the blk-mq
> framework instead.

What I'm planning right now is improving blkcg so that it can do both
proportional and hard limits with high cpu scalability, most likely
using percpu charge caches.  It probably would be best to roll all
those into one piece of logic.  I don't think, well at least hope,
that we'd need multiple modular scheduler / blkcg implementations for
blk-mq and both can be served by built-in scheduling logic.
Regardless of device speed, we'd need some form of fairness
enforcement after all.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
       [not found]                       ` <538C8A47.1050502-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@ 2014-06-02 17:24                         ` Tejun Heo
  2014-06-17 15:55                           ` Paolo Valente
  1 sibling, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-02 17:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Paolo Valente, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel, containers, cgroups

Hello, Jens.

On Mon, Jun 02, 2014 at 08:29:27AM -0600, Jens Axboe wrote:
> One thing I've neglected to bring up but have been thinking about - we're
> quickly getting to the point where the old request_fn IO path will become a
> legacy thing, mostly in maintenance mode. That isn't a problem for morphing
> bfq and cfq, but it does mean that development efforts in this area would be
> a lot better spent writing an IO scheduler that fits into the blk-mq
> framework instead.

What I'm planning right now is improving blkcg so that it can do both
proportional and hard limits with high cpu scalability, most likely
using percpu charge caches.  It probably would be best to roll all
those into one piece of logic.  I don't think, well at least hope,
that we'd need multiple modular scheduler / blkcg implementations for
blk-mq and both can be served by built-in scheduling logic.
Regardless of device speed, we'd need some form of fairness
enforcement after all.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-06-02 17:24                         ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-02 17:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Paolo Valente, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hello, Jens.

On Mon, Jun 02, 2014 at 08:29:27AM -0600, Jens Axboe wrote:
> One thing I've neglected to bring up but have been thinking about - we're
> quickly getting to the point where the old request_fn IO path will become a
> legacy thing, mostly in maintenance mode. That isn't a problem for morphing
> bfq and cfq, but it does mean that development efforts in this area would be
> a lot better spent writing an IO scheduler that fits into the blk-mq
> framework instead.

What I'm planning right now is improving blkcg so that it can do both
proportional and hard limits with high cpu scalability, most likely
using percpu charge caches.  It probably would be best to roll all
those into one piece of logic.  I don't think, well at least hope,
that we'd need multiple modular scheduler / blkcg implementations for
blk-mq and both can be served by built-in scheduling logic.
Regardless of device speed, we'd need some form of fairness
enforcement after all.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
       [not found]                         ` <20140602172454.GA8912-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2014-06-02 17:32                           ` Jens Axboe
  0 siblings, 0 replies; 247+ messages in thread
From: Jens Axboe @ 2014-06-02 17:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

On 06/02/2014 11:24 AM, Tejun Heo wrote:
> Hello, Jens.
> 
> On Mon, Jun 02, 2014 at 08:29:27AM -0600, Jens Axboe wrote:
>> One thing I've neglected to bring up but have been thinking about - we're
>> quickly getting to the point where the old request_fn IO path will become a
>> legacy thing, mostly in maintenance mode. That isn't a problem for morphing
>> bfq and cfq, but it does mean that development efforts in this area would be
>> a lot better spent writing an IO scheduler that fits into the blk-mq
>> framework instead.
> 
> What I'm planning right now is improving blkcg so that it can do both
> proportional and hard limits with high cpu scalability, most likely
> using percpu charge caches.  It probably would be best to roll all
> those into one piece of logic.  I don't think, well at least hope,
> that we'd need multiple modular scheduler / blkcg implementations for
> blk-mq and both can be served by built-in scheduling logic.
> Regardless of device speed, we'd need some form of fairness
> enforcement after all.

For things like blkcg, I agree, it should be able to be common code and
reusable. But there's a need for scheduling beyond that, for people that
don't use control groups (ie most...). And it'd be hard to retrofit cfq
into blk-mq, without rewriting it. I don't believe we need anything this
fancy for blk-mq, hopefully. At least having simple deadline scheduling
would be Good Enough for the foreseeable future.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
       [not found]                         ` <20140602172454.GA8912-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2014-06-02 17:32                           ` Jens Axboe
  0 siblings, 0 replies; 247+ messages in thread
From: Jens Axboe @ 2014-06-02 17:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Paolo Valente, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel, containers, cgroups

On 06/02/2014 11:24 AM, Tejun Heo wrote:
> Hello, Jens.
> 
> On Mon, Jun 02, 2014 at 08:29:27AM -0600, Jens Axboe wrote:
>> One thing I've neglected to bring up but have been thinking about - we're
>> quickly getting to the point where the old request_fn IO path will become a
>> legacy thing, mostly in maintenance mode. That isn't a problem for morphing
>> bfq and cfq, but it does mean that development efforts in this area would be
>> a lot better spent writing an IO scheduler that fits into the blk-mq
>> framework instead.
> 
> What I'm planning right now is improving blkcg so that it can do both
> proportional and hard limits with high cpu scalability, most likely
> using percpu charge caches.  It probably would be best to roll all
> those into one piece of logic.  I don't think, well at least hope,
> that we'd need multiple modular scheduler / blkcg implementations for
> blk-mq and both can be served by built-in scheduling logic.
> Regardless of device speed, we'd need some form of fairness
> enforcement after all.

For things like blkcg, I agree, it should be able to be common code and
reusable. But there's a need for scheduling beyond that, for people that
don't use control groups (ie most...). And it'd be hard to retrofit cfq
into blk-mq, without rewriting it. I don't believe we need anything this
fancy for blk-mq, hopefully. At least having simple deadline scheduling
would be Good Enough for the foreseeable future.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-06-02 17:32                           ` Jens Axboe
  0 siblings, 0 replies; 247+ messages in thread
From: Jens Axboe @ 2014-06-02 17:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Paolo Valente, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On 06/02/2014 11:24 AM, Tejun Heo wrote:
> Hello, Jens.
> 
> On Mon, Jun 02, 2014 at 08:29:27AM -0600, Jens Axboe wrote:
>> One thing I've neglected to bring up but have been thinking about - we're
>> quickly getting to the point where the old request_fn IO path will become a
>> legacy thing, mostly in maintenance mode. That isn't a problem for morphing
>> bfq and cfq, but it does mean that development efforts in this area would be
>> a lot better spent writing an IO scheduler that fits into the blk-mq
>> framework instead.
> 
> What I'm planning right now is improving blkcg so that it can do both
> proportional and hard limits with high cpu scalability, most likely
> using percpu charge caches.  It probably would be best to roll all
> those into one piece of logic.  I don't think, well at least hope,
> that we'd need multiple modular scheduler / blkcg implementations for
> blk-mq and both can be served by built-in scheduling logic.
> Regardless of device speed, we'd need some form of fairness
> enforcement after all.

For things like blkcg, I agree, it should be able to be common code and
reusable. But there's a need for scheduling beyond that, for people that
don't use control groups (ie most...). And it'd be hard to retrofit cfq
into blk-mq, without rewriting it. I don't believe we need anything this
fancy for blk-mq, hopefully. At least having simple deadline scheduling
would be Good Enough for the foreseeable future.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
  2014-06-02 11:14                       ` Pavel Machek
@ 2014-06-02 17:33                           ` Tejun Heo
  -1 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-02 17:33 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Jens Axboe, Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA

Hello, Pavel.

On Mon, Jun 02, 2014 at 01:14:33PM +0200, Pavel Machek wrote:
> Now.. I see it is more work for storage maintainers, because there'll
> be more code to maintain in the interim. But perhaps user advantages
> are worth it?

I'm quite skeptical about going that route.  Not necessarily because
of the extra amount of work but more the higher probability of getting
into situation where we can neither push forward or back out.  It's
difficult to define clear deadline and there will likely be unforeseen
challenges in the planned convergence of the two schedulers,
eventually, it isn't too unlikely to be in a situation where we have
to admit defeat and just keep both schedulers.  Note that developer
overhead isn't the only factor here.  Providing two slightly different
alternatives inevitably makes userland grow dependencies on subtleties
of both and there's a lot less pressure to make judgement calls and
take appropriate trade-offs, which have fairly high chance of
deadlocking progress towards any direction.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-06-02 17:33                           ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-02 17:33 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Paolo Valente, Jens Axboe, Li Zefan, Fabio Checconi,
	Arianna Avanzini, linux-kernel, containers, cgroups

Hello, Pavel.

On Mon, Jun 02, 2014 at 01:14:33PM +0200, Pavel Machek wrote:
> Now.. I see it is more work for storage maintainers, because there'll
> be more code to maintain in the interim. But perhaps user advantages
> are worth it?

I'm quite skeptical about going that route.  Not necessarily because
of the extra amount of work but more the higher probability of getting
into situation where we can neither push forward or back out.  It's
difficult to define clear deadline and there will likely be unforeseen
challenges in the planned convergence of the two schedulers,
eventually, it isn't too unlikely to be in a situation where we have
to admit defeat and just keep both schedulers.  Note that developer
overhead isn't the only factor here.  Providing two slightly different
alternatives inevitably makes userland grow dependencies on subtleties
of both and there's a lot less pressure to make judgement calls and
take appropriate trade-offs, which have fairly high chance of
deadlocking progress towards any direction.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
  2014-06-02 17:32                           ` Jens Axboe
@ 2014-06-02 17:42                               ` Tejun Heo
  -1 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-02 17:42 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

Hello, Jens.

On Mon, Jun 02, 2014 at 11:32:05AM -0600, Jens Axboe wrote:
> For things like blkcg, I agree, it should be able to be common code and
> reusable. But there's a need for scheduling beyond that, for people that
> don't use control groups (ie most...). And it'd be hard to retrofit cfq
> into blk-mq, without rewriting it. I don't believe we need anything this
> fancy for blk-mq, hopefully. At least having simple deadline scheduling
> would be Good Enough for the foreseeable future.

Heh, looks like we're miscommunicating.  I don't think anything with
the level of complexity of cfq is realistic for high-iops devices.  It
has already become a liability for SATA ssds after all.  My suggestion
is that as hierarchical scheduling tends to be logical extension of
flat scheduling, it probably would make sense to implement both
scheduling logics in the same framework as in the cpu scheduler or (to
a lesser extent) cfq.  So, a new blk-mq scheduler which can work in
hierarchical mode if blkcg is in active use.

One part I was wondering about is whether we'd need to continue the
modular multiple implementation mechanism.  For rotating disks, for
various reasons including some historical ones, we ended up with
multiple ioscheds and somewhat uglily layered blkcg implementations.
Given that the expected characteristics of blk-mq devices are more
consistent, it could be reasonable to stick with single iops and/or
bandwidth scheme.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-06-02 17:42                               ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-02 17:42 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Paolo Valente, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel, containers, cgroups

Hello, Jens.

On Mon, Jun 02, 2014 at 11:32:05AM -0600, Jens Axboe wrote:
> For things like blkcg, I agree, it should be able to be common code and
> reusable. But there's a need for scheduling beyond that, for people that
> don't use control groups (ie most...). And it'd be hard to retrofit cfq
> into blk-mq, without rewriting it. I don't believe we need anything this
> fancy for blk-mq, hopefully. At least having simple deadline scheduling
> would be Good Enough for the foreseeable future.

Heh, looks like we're miscommunicating.  I don't think anything with
the level of complexity of cfq is realistic for high-iops devices.  It
has already become a liability for SATA ssds after all.  My suggestion
is that as hierarchical scheduling tends to be logical extension of
flat scheduling, it probably would make sense to implement both
scheduling logics in the same framework as in the cpu scheduler or (to
a lesser extent) cfq.  So, a new blk-mq scheduler which can work in
hierarchical mode if blkcg is in active use.

One part I was wondering about is whether we'd need to continue the
modular multiple implementation mechanism.  For rotating disks, for
various reasons including some historical ones, we ended up with
multiple ioscheds and somewhat uglily layered blkcg implementations.
Given that the expected characteristics of blk-mq devices are more
consistent, it could be reasonable to stick with single iops and/or
bandwidth scheme.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
  2014-06-02 17:42                               ` Tejun Heo
@ 2014-06-02 17:46                                   ` Jens Axboe
  -1 siblings, 0 replies; 247+ messages in thread
From: Jens Axboe @ 2014-06-02 17:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

On 06/02/2014 11:42 AM, Tejun Heo wrote:
> Hello, Jens.
> 
> On Mon, Jun 02, 2014 at 11:32:05AM -0600, Jens Axboe wrote:
>> For things like blkcg, I agree, it should be able to be common code and
>> reusable. But there's a need for scheduling beyond that, for people that
>> don't use control groups (ie most...). And it'd be hard to retrofit cfq
>> into blk-mq, without rewriting it. I don't believe we need anything this
>> fancy for blk-mq, hopefully. At least having simple deadline scheduling
>> would be Good Enough for the foreseeable future.
> 
> Heh, looks like we're miscommunicating.  I don't think anything with
> the level of complexity of cfq is realistic for high-iops devices.  It
> has already become a liability for SATA ssds after all.  My suggestion
> is that as hierarchical scheduling tends to be logical extension of
> flat scheduling, it probably would make sense to implement both
> scheduling logics in the same framework as in the cpu scheduler or (to
> a lesser extent) cfq.  So, a new blk-mq scheduler which can work in
> hierarchical mode if blkcg is in active use.

But blk-mq will potentially drive anything, so it might not be out of
the question with a more expensive scheduling variant, if it makes any
sense to do of course. At least until there's no more rotating stuff out
there :-). But it's not a priority at all to me yet. As long as we have
coexisting IO paths, it'd be trivial to select the needed one based on
the device characteristics.

> One part I was wondering about is whether we'd need to continue the
> modular multiple implementation mechanism.  For rotating disks, for
> various reasons including some historical ones, we ended up with
> multiple ioscheds and somewhat uglily layered blkcg implementations.
> Given that the expected characteristics of blk-mq devices are more
> consistent, it could be reasonable to stick with single iops and/or
> bandwidth scheme.

I hope not to do that. I just want something sane and simple (like a
deadline scheduler), nothing more.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-06-02 17:46                                   ` Jens Axboe
  0 siblings, 0 replies; 247+ messages in thread
From: Jens Axboe @ 2014-06-02 17:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Paolo Valente, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel, containers, cgroups

On 06/02/2014 11:42 AM, Tejun Heo wrote:
> Hello, Jens.
> 
> On Mon, Jun 02, 2014 at 11:32:05AM -0600, Jens Axboe wrote:
>> For things like blkcg, I agree, it should be able to be common code and
>> reusable. But there's a need for scheduling beyond that, for people that
>> don't use control groups (ie most...). And it'd be hard to retrofit cfq
>> into blk-mq, without rewriting it. I don't believe we need anything this
>> fancy for blk-mq, hopefully. At least having simple deadline scheduling
>> would be Good Enough for the foreseeable future.
> 
> Heh, looks like we're miscommunicating.  I don't think anything with
> the level of complexity of cfq is realistic for high-iops devices.  It
> has already become a liability for SATA ssds after all.  My suggestion
> is that as hierarchical scheduling tends to be logical extension of
> flat scheduling, it probably would make sense to implement both
> scheduling logics in the same framework as in the cpu scheduler or (to
> a lesser extent) cfq.  So, a new blk-mq scheduler which can work in
> hierarchical mode if blkcg is in active use.

But blk-mq will potentially drive anything, so it might not be out of
the question with a more expensive scheduling variant, if it makes any
sense to do of course. At least until there's no more rotating stuff out
there :-). But it's not a priority at all to me yet. As long as we have
coexisting IO paths, it'd be trivial to select the needed one based on
the device characteristics.

> One part I was wondering about is whether we'd need to continue the
> modular multiple implementation mechanism.  For rotating disks, for
> various reasons including some historical ones, we ended up with
> multiple ioscheds and somewhat uglily layered blkcg implementations.
> Given that the expected characteristics of blk-mq devices are more
> consistent, it could be reasonable to stick with single iops and/or
> bandwidth scheme.

I hope not to do that. I just want something sane and simple (like a
deadline scheduler), nothing more.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
       [not found]                                   ` <538CB87C.7030600-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@ 2014-06-02 18:51                                     ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-02 18:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

Hello,

On Mon, Jun 02, 2014 at 11:46:36AM -0600, Jens Axboe wrote:
> But blk-mq will potentially drive anything, so it might not be out of
> the question with a more expensive scheduling variant, if it makes any
> sense to do of course. At least until there's no more rotating stuff out
> there :-). But it's not a priority at all to me yet. As long as we have
> coexisting IO paths, it'd be trivial to select the needed one based on
> the device characteristics.

Hmmm... yeah, moving rotating devices over to blk-mq doesn't really
seem beneficial to me.  I think there are fundamental behavioral
differences for rotating rusts and newer solid state devices to share
single code path for things like scheduling and selecting the
appropriate path depending on the actual devices sounds like a much
better plan even in the long term.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
       [not found]                                   ` <538CB87C.7030600-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@ 2014-06-02 18:51                                     ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-02 18:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Paolo Valente, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel, containers, cgroups

Hello,

On Mon, Jun 02, 2014 at 11:46:36AM -0600, Jens Axboe wrote:
> But blk-mq will potentially drive anything, so it might not be out of
> the question with a more expensive scheduling variant, if it makes any
> sense to do of course. At least until there's no more rotating stuff out
> there :-). But it's not a priority at all to me yet. As long as we have
> coexisting IO paths, it'd be trivial to select the needed one based on
> the device characteristics.

Hmmm... yeah, moving rotating devices over to blk-mq doesn't really
seem beneficial to me.  I think there are fundamental behavioral
differences for rotating rusts and newer solid state devices to share
single code path for things like scheduling and selecting the
appropriate path depending on the actual devices sounds like a much
better plan even in the long term.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-06-02 18:51                                     ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-02 18:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Paolo Valente, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hello,

On Mon, Jun 02, 2014 at 11:46:36AM -0600, Jens Axboe wrote:
> But blk-mq will potentially drive anything, so it might not be out of
> the question with a more expensive scheduling variant, if it makes any
> sense to do of course. At least until there's no more rotating stuff out
> there :-). But it's not a priority at all to me yet. As long as we have
> coexisting IO paths, it'd be trivial to select the needed one based on
> the device characteristics.

Hmmm... yeah, moving rotating devices over to blk-mq doesn't really
seem beneficial to me.  I think there are fundamental behavioral
differences for rotating rusts and newer solid state devices to share
single code path for things like scheduling and selecting the
appropriate path depending on the actual devices sounds like a much
better plan even in the long term.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
  2014-06-02 18:51                                     ` Tejun Heo
@ 2014-06-02 20:57                                         ` Jens Axboe
  -1 siblings, 0 replies; 247+ messages in thread
From: Jens Axboe @ 2014-06-02 20:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

On Mon, Jun 02 2014, Tejun Heo wrote:
> Hello,
> 
> On Mon, Jun 02, 2014 at 11:46:36AM -0600, Jens Axboe wrote:
> > But blk-mq will potentially drive anything, so it might not be out of
> > the question with a more expensive scheduling variant, if it makes any
> > sense to do of course. At least until there's no more rotating stuff out
> > there :-). But it's not a priority at all to me yet. As long as we have
> > coexisting IO paths, it'd be trivial to select the needed one based on
> > the device characteristics.
> 
> Hmmm... yeah, moving rotating devices over to blk-mq doesn't really
> seem beneficial to me.  I think there are fundamental behavioral
> differences for rotating rusts and newer solid state devices to share
> single code path for things like scheduling and selecting the
> appropriate path depending on the actual devices sounds like a much
> better plan even in the long term.

It's not so much about it being more beneficial to run in blk-mq, as it
is about not having two code paths. But yes, we're likely going to
maintain that code for a long time, so it's not going anywhere anytime
soon.

And for scsi-mq, it's already opt-in, though on a per-host basis. Doing
finer granularity than that is going to be difficult, unless we let
legacy-block and blk-mq share a tag map (though that would not be too
hard).

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-06-02 20:57                                         ` Jens Axboe
  0 siblings, 0 replies; 247+ messages in thread
From: Jens Axboe @ 2014-06-02 20:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Paolo Valente, Li Zefan, Fabio Checconi, Arianna Avanzini,
	Paolo Valente, linux-kernel, containers, cgroups

On Mon, Jun 02 2014, Tejun Heo wrote:
> Hello,
> 
> On Mon, Jun 02, 2014 at 11:46:36AM -0600, Jens Axboe wrote:
> > But blk-mq will potentially drive anything, so it might not be out of
> > the question with a more expensive scheduling variant, if it makes any
> > sense to do of course. At least until there's no more rotating stuff out
> > there :-). But it's not a priority at all to me yet. As long as we have
> > coexisting IO paths, it'd be trivial to select the needed one based on
> > the device characteristics.
> 
> Hmmm... yeah, moving rotating devices over to blk-mq doesn't really
> seem beneficial to me.  I think there are fundamental behavioral
> differences for rotating rusts and newer solid state devices to share
> single code path for things like scheduling and selecting the
> appropriate path depending on the actual devices sounds like a much
> better plan even in the long term.

It's not so much about it being more beneficial to run in blk-mq, as it
is about not having two code paths. But yes, we're likely going to
maintain that code for a long time, so it's not going anywhere anytime
soon.

And for scsi-mq, it's already opt-in, though on a per-host basis. Doing
finer granularity than that is going to be difficult, unless we let
legacy-block and blk-mq share a tag map (though that would not be too
hard).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
  2014-06-02 17:33                           ` Tejun Heo
@ 2014-06-03  4:12                               ` Mike Galbraith
  -1 siblings, 0 replies; 247+ messages in thread
From: Mike Galbraith @ 2014-06-03  4:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Pavel Machek, Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon, 2014-06-02 at 13:33 -0400, Tejun Heo wrote: 
> Hello, Pavel.
> 
> On Mon, Jun 02, 2014 at 01:14:33PM +0200, Pavel Machek wrote:
> > Now.. I see it is more work for storage maintainers, because there'll
> > be more code to maintain in the interim. But perhaps user advantages
> > are worth it?
> 
> I'm quite skeptical about going that route.  Not necessarily because
> of the extra amount of work but more the higher probability of getting
> into situation where we can neither push forward or back out.  It's
> difficult to define clear deadline and there will likely be unforeseen
> challenges in the planned convergence of the two schedulers,
> eventually, it isn't too unlikely to be in a situation where we have
> to admit defeat and just keep both schedulers.  Note that developer
> overhead isn't the only factor here.  Providing two slightly different
> alternatives inevitably makes userland grow dependencies on subtleties
> of both and there's a lot less pressure to make judgement calls and
> take appropriate trade-offs, which have fairly high chance of
> deadlocking progress towards any direction.

But OTOH..

This thing (allegedly) fixes issues that have existed for ages, issues
which have (also allegedly) not been fixed in all that time despite a
number of people having done a lot of this and that over the years.  If
the claims are true, seems to me that would make BFQ a bit special, and
perhaps worth some extra leeway and effort to ensure that what we are
being offered on a silver plate doesn't molder away out of tree forever.

If it were say put in staging, and it were stated right up front that it
isn't ever going to go further (Jens already said that more or less),
and _will_ drop dead if it stagnates, that would surely increase the
test base to shake out problem spots (surely it has some), and allow
users who meet an issue in either IO scheduler to verify it with the
flick of a switch every step of the way to whichever ending, and maybe
even motivate other IO people to help with the merge and/or to compare
their changes at the flick of that same switch.

-Mike

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-06-03  4:12                               ` Mike Galbraith
  0 siblings, 0 replies; 247+ messages in thread
From: Mike Galbraith @ 2014-06-03  4:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pavel Machek, Paolo Valente, Jens Axboe, Li Zefan,
	Fabio Checconi, Arianna Avanzini, linux-kernel, containers,
	cgroups

On Mon, 2014-06-02 at 13:33 -0400, Tejun Heo wrote: 
> Hello, Pavel.
> 
> On Mon, Jun 02, 2014 at 01:14:33PM +0200, Pavel Machek wrote:
> > Now.. I see it is more work for storage maintainers, because there'll
> > be more code to maintain in the interim. But perhaps user advantages
> > are worth it?
> 
> I'm quite skeptical about going that route.  Not necessarily because
> of the extra amount of work but more the higher probability of getting
> into situation where we can neither push forward or back out.  It's
> difficult to define clear deadline and there will likely be unforeseen
> challenges in the planned convergence of the two schedulers,
> eventually, it isn't too unlikely to be in a situation where we have
> to admit defeat and just keep both schedulers.  Note that developer
> overhead isn't the only factor here.  Providing two slightly different
> alternatives inevitably makes userland grow dependencies on subtleties
> of both and there's a lot less pressure to make judgement calls and
> take appropriate trade-offs, which have fairly high chance of
> deadlocking progress towards any direction.

But OTOH..

This thing (allegedly) fixes issues that have existed for ages, issues
which have (also allegedly) not been fixed in all that time despite a
number of people having done a lot of this and that over the years.  If
the claims are true, seems to me that would make BFQ a bit special, and
perhaps worth some extra leeway and effort to ensure that what we are
being offered on a silver plate doesn't molder away out of tree forever.

If it were say put in staging, and it were stated right up front that it
isn't ever going to go further (Jens already said that more or less),
and _will_ drop dead if it stagnates, that would surely increase the
test base to shake out problem spots (surely it has some), and allow
users who meet an issue in either IO scheduler to verify it with the
flick of a switch every step of the way to whichever ending, and maybe
even motivate other IO people to help with the merge and/or to compare
their changes at the flick of that same switch.

-Mike


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 10/12] block, bfq: add Early Queue Merge (EQM)
  2014-06-02  9:46                   ` Paolo Valente
@ 2014-06-03 16:28                       ` Tejun Heo
  -1 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-03 16:28 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Mauro Andreolini,
	Fabio Checconi, Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA

Hello,

On Mon, Jun 02, 2014 at 11:46:45AM +0200, Paolo Valente wrote:
> > I don't really follow the last part.  So, the difference is that
> > cooperating queue setup also takes place during bio merge too, right?
> 
> Not only, in bfq an actual queue merge is performed in the bio-merge hook.

I think I'm a bit confused because it's named "early" queue merge
while it actually moves queue merging later than cfq - set_request()
happens before bio/rq merging.  So, what it tries to do is
compensating for the lack of cfq_rq_close() preemption at request
issue time, right?

> > cfq does it once when allocating the request.  That seems a lot more
> > reasonable to me.  It's doing that once for one start sector.  I mean,
> > plugging is usually extremely short compared to actual IO service
> > time.  It's there to mask the latencies between bio issues that the
> > same CPU is doing.  I can't see how this earliness can be actually
> > useful.  Do you have results to back this one up?  Or is this just
> > born out of thin air?
> 
> Arianna added the early-queue-merge part in the allow_merge_fn hook
> about one year ago, as a a consequence of a throughput loss of about
> 30% with KVM/QEMU workloads. In particular, we ran most of the tests
> on a WDC WD60000HLHX-0 Velociraptor. That HDD might not be available
> for testing any more, but we can reproduce our results for you on
> other HDDs, with and without early queue merge. And, maybe through
> traces, we can show you that the reason for the throughput loss is
> exactly that described (in a wordy way) in this patch. Of course
> unless we have missed something.

Oh, as long as it makes measureable difference, I have no objection;
however, I do think more explanation and comments would be nice.  I
still can't quite understand why retrying on each merge attempt would
make so much difference.  Maybe I just failed to understand what you
wrote in the commit message.  Is it because the cooperating tasks
issue IOs which grow large and close enough after merges but not on
the first bio issuance?  If so, why isn't doing it on rq merge time
enough?  Is the timing sensitive enough for certain workloads that
waiting till unplug time misses the opportunity?  But plugging should
be relatively short compared to the time actual IOs take, so why would
it be that sensitive?  What am I missing here?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 10/12] block, bfq: add Early Queue Merge (EQM)
@ 2014-06-03 16:28                       ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-03 16:28 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel, containers, cgroups, Mauro Andreolini

Hello,

On Mon, Jun 02, 2014 at 11:46:45AM +0200, Paolo Valente wrote:
> > I don't really follow the last part.  So, the difference is that
> > cooperating queue setup also takes place during bio merge too, right?
> 
> Not only, in bfq an actual queue merge is performed in the bio-merge hook.

I think I'm a bit confused because it's named "early" queue merge
while it actually moves queue merging later than cfq - set_request()
happens before bio/rq merging.  So, what it tries to do is
compensating for the lack of cfq_rq_close() preemption at request
issue time, right?

> > cfq does it once when allocating the request.  That seems a lot more
> > reasonable to me.  It's doing that once for one start sector.  I mean,
> > plugging is usually extremely short compared to actual IO service
> > time.  It's there to mask the latencies between bio issues that the
> > same CPU is doing.  I can't see how this earliness can be actually
> > useful.  Do you have results to back this one up?  Or is this just
> > born out of thin air?
> 
> Arianna added the early-queue-merge part in the allow_merge_fn hook
> about one year ago, as a a consequence of a throughput loss of about
> 30% with KVM/QEMU workloads. In particular, we ran most of the tests
> on a WDC WD60000HLHX-0 Velociraptor. That HDD might not be available
> for testing any more, but we can reproduce our results for you on
> other HDDs, with and without early queue merge. And, maybe through
> traces, we can show you that the reason for the throughput loss is
> exactly that described (in a wordy way) in this patch. Of course
> unless we have missed something.

Oh, as long as it makes measureable difference, I have no objection;
however, I do think more explanation and comments would be nice.  I
still can't quite understand why retrying on each merge attempt would
make so much difference.  Maybe I just failed to understand what you
wrote in the commit message.  Is it because the cooperating tasks
issue IOs which grow large and close enough after merges but not on
the first bio issuance?  If so, why isn't doing it on rq merge time
enough?  Is the timing sensitive enough for certain workloads that
waiting till unplug time misses the opportunity?  But plugging should
be relatively short compared to the time actual IOs take, so why would
it be that sensitive?  What am I missing here?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
  2014-06-02 13:02                           ` Pavel Machek
@ 2014-06-03 16:54                               ` Paolo Valente
  -1 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-03 16:54 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA


Il giorno 02/giu/2014, alle ore 15:02, Pavel Machek <pavel-+ZI9xUNit7I@public.gmane.org> ha scritto:

> Hi!
> 
>>> Well, it's all about how to actually route the changes and in general
>>> whenever avoidable we try to avoid whole-sale code replacement
>>> especially when most of the structural code is similar like in this
>>> case.  Gradually evolving cfq to bfq is likely to take more work but
>>> I'm very positive that it'd definitely be a lot easier to merge the
>>> changes that way and people involved, including the developers and
>>> reviewers, would acquire a lot clearer picture of what's going on in
>>> the process.  For example, AFAICS, most of the heuristics added by
>> 
>> Would it make sense to merge bfq first, _then_ turn cfq into bfq, then
>> remove bfq?
>> 
>> That way
>> 
>> 1. Users like me would see improvements soon 
>> 
>> 2. BFQ would get more testing early. 
> 
> Like this: I applied patch over today's git... 
> 
> I only see last bits of panic...
> 
> Call trace:
> __bfq_bfqq_expire
> bfq_bfqq_expire
> bfq_dispatch_requests
> sci_request_fn
> ...
> EIP: T.1839+0x26
> Kernel panic - not syncing: Fatal exception in interrupt
> Shutting down cpus with NMI
> 
> ...
> 
> Will retry.
> 
> Any ideas?
> 			

We have tried to think about ways to trigger this failure, but in vain. Unfortunately, so far no user has reported any failure with this last version of bfq either. Finally, we have gone through a new static analysis, but also in this case uselessly.

So, if you are willing to retry, we have put online a version of the code filled with many BUG_ONs. I hope they can make it easier to track down the bug. The archive is here:
http://algogroup.unimore.it/people/paolo/disk_sched/debugging-patches/3.15.0-rc8-v7rc5.tgz

Should this attempt be useless as well, I will, if you do not mind, try by asking you more details about your system and reproducing your configuration as much as I can.

Thanks,
Paolo

> 						Pavel
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-06-03 16:54                               ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-03 16:54 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Tejun Heo, Jens Axboe, Li Zefan, Fabio Checconi,
	Arianna Avanzini, linux-kernel, containers, cgroups


Il giorno 02/giu/2014, alle ore 15:02, Pavel Machek <pavel@ucw.cz> ha scritto:

> Hi!
> 
>>> Well, it's all about how to actually route the changes and in general
>>> whenever avoidable we try to avoid whole-sale code replacement
>>> especially when most of the structural code is similar like in this
>>> case.  Gradually evolving cfq to bfq is likely to take more work but
>>> I'm very positive that it'd definitely be a lot easier to merge the
>>> changes that way and people involved, including the developers and
>>> reviewers, would acquire a lot clearer picture of what's going on in
>>> the process.  For example, AFAICS, most of the heuristics added by
>> 
>> Would it make sense to merge bfq first, _then_ turn cfq into bfq, then
>> remove bfq?
>> 
>> That way
>> 
>> 1. Users like me would see improvements soon 
>> 
>> 2. BFQ would get more testing early. 
> 
> Like this: I applied patch over today's git... 
> 
> I only see last bits of panic...
> 
> Call trace:
> __bfq_bfqq_expire
> bfq_bfqq_expire
> bfq_dispatch_requests
> sci_request_fn
> ...
> EIP: T.1839+0x26
> Kernel panic - not syncing: Fatal exception in interrupt
> Shutting down cpus with NMI
> 
> ...
> 
> Will retry.
> 
> Any ideas?
> 			

We have tried to think about ways to trigger this failure, but in vain. Unfortunately, so far no user has reported any failure with this last version of bfq either. Finally, we have gone through a new static analysis, but also in this case uselessly.

So, if you are willing to retry, we have put online a version of the code filled with many BUG_ONs. I hope they can make it easier to track down the bug. The archive is here:
http://algogroup.unimore.it/people/paolo/disk_sched/debugging-patches/3.15.0-rc8-v7rc5.tgz

Should this attempt be useless as well, I will, if you do not mind, try by asking you more details about your system and reproducing your configuration as much as I can.

Thanks,
Paolo

> 						Pavel
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html



^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices
       [not found]                     ` <36BFDB73-AEC2-4B87-9FD6-205E9431E722-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
@ 2014-06-03 17:11                       ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-03 17:11 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA

Hello,

On Mon, Jun 02, 2014 at 11:26:07AM +0200, Paolo Valente wrote:
> >> #define cond_for_expiring_non_wr  (bfqd->hw_tag && \
> >> -				   bfqd->wr_busy_queues > 0)
> >> +				   (bfqd->wr_busy_queues > 0 || \
> >> +				    (symmetric_scenario && \
> >> +				     blk_queue_nonrot(bfqd->queue))))
> > 
> > 	expire_non_wr = zzz;
> > 
> 
> The solution you propose is the first that came to my mind. But then
> I went for a clumsy macro-based solution because: 1) the whole
> function is all about evaluating a long logical expression, 2) the
> macro-based solution allows the short-circuit to be used at best,
> and the number of steps to be minimized. For example, with async
> queues, only one condition is evaluated.
> 
> Defining three variables entails instead that the value of all the
> variables is computed every time, even if most of the times there is
> no need to.
>
> Would this gain be negligible (sorry for my ignorance), or would
> not it be however enough to justify these unusual macros?

The compiler should be able to optimize those to basically the same
code.  AFAICS, everything the code tests is trivially known to be
without side-effect to the compiler.  Besides, even if the compiler
generates slightly less efficient code, which it shouldn't, it's
highly unlikely that this level of micro CPU cycle optimization would
be measureable for something as heavy as [bc]fq.

> > This optimization may be theoretically interesting but doesn't seem
> > practical at all and would make the sytem behave distinctively
> > differently depending on something which is extremely subtle and seems
> > completely unrelated.  Furthermore, on any system which uses blkcg,
> > ext4, btrfs or has any task which has non-zero nice value, it won't
> > make any difference.  Its value is only theoretical.
> 
> Turning on idling unconditionally when blkcg is used, is one of the
> first solutions we have considered. But there seem to be practical
> scenarios where this would cause an unjustified loss of
> throughput. The main example for us was ulatencyd, which AFAIK
> creates one group for each process and, by default, assigns to all
> processes the same weight. But the assigned weight is not the one
> associated to the default ioprio.

Isn't the optimization "not idling" when these conditions are met?
Shouldn't the comparison be against the benefit of "not idling
selectively" vs "always idling" when blkcg is in use?

Another problem there is that this not only depends on the number of
processes but the number of threads in it.  cgroup is moving away from
allowing threads of a single process in different cgroups, so this
means that the operation can fluctuate in a very unexpected manner.

I'm not really convinced about the approach.  With rotating disks, we
know that allowing queue depth > 1 generaly lowers both throughput and
responsiveness and brings benefits in quite restricted cases.  It
seems rather backwards to always allow QD > 1 and then try to optimize
in an attempt to recover what's lost.  Wouldn't it make far more sense
to actively maintain QD == 1 by default and allow QD > 1 in specific
cases where it can be determined to be more beneficial than harmful?

> I do not know how widespread a mechanism like ulatencyd is
> precisely, but in the symmetric scenario it creates, the throughput
> on, e.g., an HDD would drop by half if the workload is mostly random
> and we removed the more complex mechanism we set up.  Wouldn't this
> be bad?

It looks like a lot of complexity for optimization for a very
specific, likely unreliable (in terms of its triggering condition),
use case.  The triggering condition is just too specific.

> > Another thing to consider is that virtually all remotely modern
> > devices, rotational or not, are queued. At this point, it's rather
> > pointless to design one behavior for !queued and another for queued.
> > Things should just be designed for queued devices.
> 
> I am sorry for expressing doubts again (mainly because of my
> ignorance), but a few months ago I had to work with some portable
> devices for a company specialized in ARM systems. As an HDD, they
> were using a Toshiba MK6006GAH. If I remember well, this device had
> no NCQ. Instead of the improvements that we obtained by using bfq
> with this slow device, removing the differentiated behavior of bfq
> with respect to queued/!queued devices would have caused just a loss
> of throughput.

Heh, that's 60GB ATA-100 hard drive.  Had no idea those are still
being produced.  However, my point still is that the design should be
focused on queued devices.  They're predominant in the market and
it'll only continue to become more so.  What bothers me is that the
scheduler essentially loses control and shows sub-optimal behavior on
queued devices by default and that's how it's gonna perform in vast
majority of the use cases.

> >  I don't know what
> > the solution is but given that the benefits of NCQ for rotational
> > devices is extremely limited, sticking with single request model in
> > most cases and maybe allowing queued operation for specific workloads
> > might be a better approach.  As for ssds, just do something simple.
> > It's highly likely that most ssds won't travel this code path in the
> > near future anyway.
> 
> This is the point that worries me mostly. As I pointed out in one of my previous emails, dispatching requests to an SSD  without control causes high latencies, or even complete unresponsiveness (Figure 8 in
> http://algogroup.unimore.it/people/paolo/disk_sched/extra_results.php
> or Figure 9 in
> http://algogroup.unimore.it/people/paolo/disk_sched/results.php).
> 
> I am of course aware that efficiency is a critical issue with fast
> devices, and is probably destined to become more and more critical
> in the future. But, as a user, I would be definitely unhappy with a
> system that can, e.g., update itself in one minute instead of five,
> but, during that minute may become unresponsive. In particular, I
> would not be pleased to buy a more expensive SSD and get a much less
> responsive system than that I had with a cheaper HDD and bfq fully
> working.

blk-mq is right around the corner and newer devices won't travel this
path at all.  Hopefully, ahci too will be served through blk-mq too
when it's connected to ssds, so its usefulness for high performance
devices will diminsh rather quickly over the coming several years.  It
sure would be nice to still be able to carry some optimizations but it
does shift the trade-off balance in terms of how much extra complexity
is justified.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices
       [not found]                     ` <36BFDB73-AEC2-4B87-9FD6-205E9431E722-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
@ 2014-06-03 17:11                       ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-03 17:11 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel, containers, cgroups

Hello,

On Mon, Jun 02, 2014 at 11:26:07AM +0200, Paolo Valente wrote:
> >> #define cond_for_expiring_non_wr  (bfqd->hw_tag && \
> >> -				   bfqd->wr_busy_queues > 0)
> >> +				   (bfqd->wr_busy_queues > 0 || \
> >> +				    (symmetric_scenario && \
> >> +				     blk_queue_nonrot(bfqd->queue))))
> > 
> > 	expire_non_wr = zzz;
> > 
> 
> The solution you propose is the first that came to my mind. But then
> I went for a clumsy macro-based solution because: 1) the whole
> function is all about evaluating a long logical expression, 2) the
> macro-based solution allows the short-circuit to be used at best,
> and the number of steps to be minimized. For example, with async
> queues, only one condition is evaluated.
> 
> Defining three variables entails instead that the value of all the
> variables is computed every time, even if most of the times there is
> no need to.
>
> Would this gain be negligible (sorry for my ignorance), or would
> not it be however enough to justify these unusual macros?

The compiler should be able to optimize those to basically the same
code.  AFAICS, everything the code tests is trivially known to be
without side-effect to the compiler.  Besides, even if the compiler
generates slightly less efficient code, which it shouldn't, it's
highly unlikely that this level of micro CPU cycle optimization would
be measureable for something as heavy as [bc]fq.

> > This optimization may be theoretically interesting but doesn't seem
> > practical at all and would make the sytem behave distinctively
> > differently depending on something which is extremely subtle and seems
> > completely unrelated.  Furthermore, on any system which uses blkcg,
> > ext4, btrfs or has any task which has non-zero nice value, it won't
> > make any difference.  Its value is only theoretical.
> 
> Turning on idling unconditionally when blkcg is used, is one of the
> first solutions we have considered. But there seem to be practical
> scenarios where this would cause an unjustified loss of
> throughput. The main example for us was ulatencyd, which AFAIK
> creates one group for each process and, by default, assigns to all
> processes the same weight. But the assigned weight is not the one
> associated to the default ioprio.

Isn't the optimization "not idling" when these conditions are met?
Shouldn't the comparison be against the benefit of "not idling
selectively" vs "always idling" when blkcg is in use?

Another problem there is that this not only depends on the number of
processes but the number of threads in it.  cgroup is moving away from
allowing threads of a single process in different cgroups, so this
means that the operation can fluctuate in a very unexpected manner.

I'm not really convinced about the approach.  With rotating disks, we
know that allowing queue depth > 1 generaly lowers both throughput and
responsiveness and brings benefits in quite restricted cases.  It
seems rather backwards to always allow QD > 1 and then try to optimize
in an attempt to recover what's lost.  Wouldn't it make far more sense
to actively maintain QD == 1 by default and allow QD > 1 in specific
cases where it can be determined to be more beneficial than harmful?

> I do not know how widespread a mechanism like ulatencyd is
> precisely, but in the symmetric scenario it creates, the throughput
> on, e.g., an HDD would drop by half if the workload is mostly random
> and we removed the more complex mechanism we set up.  Wouldn't this
> be bad?

It looks like a lot of complexity for optimization for a very
specific, likely unreliable (in terms of its triggering condition),
use case.  The triggering condition is just too specific.

> > Another thing to consider is that virtually all remotely modern
> > devices, rotational or not, are queued. At this point, it's rather
> > pointless to design one behavior for !queued and another for queued.
> > Things should just be designed for queued devices.
> 
> I am sorry for expressing doubts again (mainly because of my
> ignorance), but a few months ago I had to work with some portable
> devices for a company specialized in ARM systems. As an HDD, they
> were using a Toshiba MK6006GAH. If I remember well, this device had
> no NCQ. Instead of the improvements that we obtained by using bfq
> with this slow device, removing the differentiated behavior of bfq
> with respect to queued/!queued devices would have caused just a loss
> of throughput.

Heh, that's 60GB ATA-100 hard drive.  Had no idea those are still
being produced.  However, my point still is that the design should be
focused on queued devices.  They're predominant in the market and
it'll only continue to become more so.  What bothers me is that the
scheduler essentially loses control and shows sub-optimal behavior on
queued devices by default and that's how it's gonna perform in vast
majority of the use cases.

> >  I don't know what
> > the solution is but given that the benefits of NCQ for rotational
> > devices is extremely limited, sticking with single request model in
> > most cases and maybe allowing queued operation for specific workloads
> > might be a better approach.  As for ssds, just do something simple.
> > It's highly likely that most ssds won't travel this code path in the
> > near future anyway.
> 
> This is the point that worries me mostly. As I pointed out in one of my previous emails, dispatching requests to an SSD  without control causes high latencies, or even complete unresponsiveness (Figure 8 in
> http://algogroup.unimore.it/people/paolo/disk_sched/extra_results.php
> or Figure 9 in
> http://algogroup.unimore.it/people/paolo/disk_sched/results.php).
> 
> I am of course aware that efficiency is a critical issue with fast
> devices, and is probably destined to become more and more critical
> in the future. But, as a user, I would be definitely unhappy with a
> system that can, e.g., update itself in one minute instead of five,
> but, during that minute may become unresponsive. In particular, I
> would not be pleased to buy a more expensive SSD and get a much less
> responsive system than that I had with a cheaper HDD and bfq fully
> working.

blk-mq is right around the corner and newer devices won't travel this
path at all.  Hopefully, ahci too will be served through blk-mq too
when it's connected to ssds, so its usefulness for high performance
devices will diminsh rather quickly over the coming several years.  It
sure would be nice to still be able to carry some optimizations but it
does shift the trade-off balance in terms of how much extra complexity
is justified.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices
@ 2014-06-03 17:11                       ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-03 17:11 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hello,

On Mon, Jun 02, 2014 at 11:26:07AM +0200, Paolo Valente wrote:
> >> #define cond_for_expiring_non_wr  (bfqd->hw_tag && \
> >> -				   bfqd->wr_busy_queues > 0)
> >> +				   (bfqd->wr_busy_queues > 0 || \
> >> +				    (symmetric_scenario && \
> >> +				     blk_queue_nonrot(bfqd->queue))))
> > 
> > 	expire_non_wr = zzz;
> > 
> 
> The solution you propose is the first that came to my mind. But then
> I went for a clumsy macro-based solution because: 1) the whole
> function is all about evaluating a long logical expression, 2) the
> macro-based solution allows the short-circuit to be used at best,
> and the number of steps to be minimized. For example, with async
> queues, only one condition is evaluated.
> 
> Defining three variables entails instead that the value of all the
> variables is computed every time, even if most of the times there is
> no need to.
>
> Would this gain be negligible (sorry for my ignorance), or would
> not it be however enough to justify these unusual macros?

The compiler should be able to optimize those to basically the same
code.  AFAICS, everything the code tests is trivially known to be
without side-effect to the compiler.  Besides, even if the compiler
generates slightly less efficient code, which it shouldn't, it's
highly unlikely that this level of micro CPU cycle optimization would
be measureable for something as heavy as [bc]fq.

> > This optimization may be theoretically interesting but doesn't seem
> > practical at all and would make the sytem behave distinctively
> > differently depending on something which is extremely subtle and seems
> > completely unrelated.  Furthermore, on any system which uses blkcg,
> > ext4, btrfs or has any task which has non-zero nice value, it won't
> > make any difference.  Its value is only theoretical.
> 
> Turning on idling unconditionally when blkcg is used, is one of the
> first solutions we have considered. But there seem to be practical
> scenarios where this would cause an unjustified loss of
> throughput. The main example for us was ulatencyd, which AFAIK
> creates one group for each process and, by default, assigns to all
> processes the same weight. But the assigned weight is not the one
> associated to the default ioprio.

Isn't the optimization "not idling" when these conditions are met?
Shouldn't the comparison be against the benefit of "not idling
selectively" vs "always idling" when blkcg is in use?

Another problem there is that this not only depends on the number of
processes but the number of threads in it.  cgroup is moving away from
allowing threads of a single process in different cgroups, so this
means that the operation can fluctuate in a very unexpected manner.

I'm not really convinced about the approach.  With rotating disks, we
know that allowing queue depth > 1 generaly lowers both throughput and
responsiveness and brings benefits in quite restricted cases.  It
seems rather backwards to always allow QD > 1 and then try to optimize
in an attempt to recover what's lost.  Wouldn't it make far more sense
to actively maintain QD == 1 by default and allow QD > 1 in specific
cases where it can be determined to be more beneficial than harmful?

> I do not know how widespread a mechanism like ulatencyd is
> precisely, but in the symmetric scenario it creates, the throughput
> on, e.g., an HDD would drop by half if the workload is mostly random
> and we removed the more complex mechanism we set up.  Wouldn't this
> be bad?

It looks like a lot of complexity for optimization for a very
specific, likely unreliable (in terms of its triggering condition),
use case.  The triggering condition is just too specific.

> > Another thing to consider is that virtually all remotely modern
> > devices, rotational or not, are queued. At this point, it's rather
> > pointless to design one behavior for !queued and another for queued.
> > Things should just be designed for queued devices.
> 
> I am sorry for expressing doubts again (mainly because of my
> ignorance), but a few months ago I had to work with some portable
> devices for a company specialized in ARM systems. As an HDD, they
> were using a Toshiba MK6006GAH. If I remember well, this device had
> no NCQ. Instead of the improvements that we obtained by using bfq
> with this slow device, removing the differentiated behavior of bfq
> with respect to queued/!queued devices would have caused just a loss
> of throughput.

Heh, that's 60GB ATA-100 hard drive.  Had no idea those are still
being produced.  However, my point still is that the design should be
focused on queued devices.  They're predominant in the market and
it'll only continue to become more so.  What bothers me is that the
scheduler essentially loses control and shows sub-optimal behavior on
queued devices by default and that's how it's gonna perform in vast
majority of the use cases.

> >  I don't know what
> > the solution is but given that the benefits of NCQ for rotational
> > devices is extremely limited, sticking with single request model in
> > most cases and maybe allowing queued operation for specific workloads
> > might be a better approach.  As for ssds, just do something simple.
> > It's highly likely that most ssds won't travel this code path in the
> > near future anyway.
> 
> This is the point that worries me mostly. As I pointed out in one of my previous emails, dispatching requests to an SSD  without control causes high latencies, or even complete unresponsiveness (Figure 8 in
> http://algogroup.unimore.it/people/paolo/disk_sched/extra_results.php
> or Figure 9 in
> http://algogroup.unimore.it/people/paolo/disk_sched/results.php).
> 
> I am of course aware that efficiency is a critical issue with fast
> devices, and is probably destined to become more and more critical
> in the future. But, as a user, I would be definitely unhappy with a
> system that can, e.g., update itself in one minute instead of five,
> but, during that minute may become unresponsive. In particular, I
> would not be pleased to buy a more expensive SSD and get a much less
> responsive system than that I had with a cheaper HDD and bfq fully
> working.

blk-mq is right around the corner and newer devices won't travel this
path at all.  Hopefully, ahci too will be served through blk-mq too
when it's connected to ssds, so its usefulness for high performance
devices will diminsh rather quickly over the coming several years.  It
sure would be nice to still be able to carry some optimizations but it
does shift the trade-off balance in terms of how much extra complexity
is justified.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
  2014-06-03 16:54                               ` Paolo Valente
@ 2014-06-03 20:40                                   ` Pavel Machek
  -1 siblings, 0 replies; 247+ messages in thread
From: Pavel Machek @ 2014-06-03 20:40 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

Hi!

> >>> Well, it's all about how to actually route the changes and in general
> >>> whenever avoidable we try to avoid whole-sale code replacement
> >>> especially when most of the structural code is similar like in this
> >>> case.  Gradually evolving cfq to bfq is likely to take more work but
> >>> I'm very positive that it'd definitely be a lot easier to merge the
> >>> changes that way and people involved, including the developers and
> >>> reviewers, would acquire a lot clearer picture of what's going on in
> >>> the process.  For example, AFAICS, most of the heuristics added by
> >> 
> >> Would it make sense to merge bfq first, _then_ turn cfq into bfq, then
> >> remove bfq?
> >> 
> >> That way
> >> 
> >> 1. Users like me would see improvements soon 
> >> 
> >> 2. BFQ would get more testing early. 
> > 
> > Like this: I applied patch over today's git... 
> > 
> > I only see last bits of panic...
> > 
> > Call trace:
> > __bfq_bfqq_expire
> > bfq_bfqq_expire
> > bfq_dispatch_requests
> > sci_request_fn
> > ...
> > EIP: T.1839+0x26
> > Kernel panic - not syncing: Fatal exception in interrupt
> > Shutting down cpus with NMI
> > 
> > ...
> > 
> > Will retry.
> > 
> > Any ideas?
> > 			

>  We have tried to think about ways to trigger this failure, but in
> vain. Unfortunately, so far no user has reported any failure with
> this last version of bfq either. Finally, we have gone through a new
> static analysis, but also in this case uselessly.

Ok, it is pretty much reproducible here: system just will not finish
booting.

> So, if you are willing to retry, we have put online a version of the code filled with many BUG_ONs. I hope they can make it easier to track down the bug. The archive is here:
> http://algogroup.unimore.it/people/paolo/disk_sched/debugging-patches/3.15.0-rc8-v7rc5.tgz
> 

Ok, let me try.

> Should this attempt be useless as well, I will, if you do not mind, try by asking you more details about your system and reproducing your configuration as much as I can.
> 

It is thinkpad x60 notebook, x86-32 machine with 2GB ram.

But I think it died on my x86-32 core duo desktop, too. 

Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-06-03 20:40                                   ` Pavel Machek
  0 siblings, 0 replies; 247+ messages in thread
From: Pavel Machek @ 2014-06-03 20:40 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Tejun Heo, Jens Axboe, Li Zefan, Fabio Checconi,
	Arianna Avanzini, linux-kernel, containers, cgroups

Hi!

> >>> Well, it's all about how to actually route the changes and in general
> >>> whenever avoidable we try to avoid whole-sale code replacement
> >>> especially when most of the structural code is similar like in this
> >>> case.  Gradually evolving cfq to bfq is likely to take more work but
> >>> I'm very positive that it'd definitely be a lot easier to merge the
> >>> changes that way and people involved, including the developers and
> >>> reviewers, would acquire a lot clearer picture of what's going on in
> >>> the process.  For example, AFAICS, most of the heuristics added by
> >> 
> >> Would it make sense to merge bfq first, _then_ turn cfq into bfq, then
> >> remove bfq?
> >> 
> >> That way
> >> 
> >> 1. Users like me would see improvements soon 
> >> 
> >> 2. BFQ would get more testing early. 
> > 
> > Like this: I applied patch over today's git... 
> > 
> > I only see last bits of panic...
> > 
> > Call trace:
> > __bfq_bfqq_expire
> > bfq_bfqq_expire
> > bfq_dispatch_requests
> > sci_request_fn
> > ...
> > EIP: T.1839+0x26
> > Kernel panic - not syncing: Fatal exception in interrupt
> > Shutting down cpus with NMI
> > 
> > ...
> > 
> > Will retry.
> > 
> > Any ideas?
> > 			

>  We have tried to think about ways to trigger this failure, but in
> vain. Unfortunately, so far no user has reported any failure with
> this last version of bfq either. Finally, we have gone through a new
> static analysis, but also in this case uselessly.

Ok, it is pretty much reproducible here: system just will not finish
booting.

> So, if you are willing to retry, we have put online a version of the code filled with many BUG_ONs. I hope they can make it easier to track down the bug. The archive is here:
> http://algogroup.unimore.it/people/paolo/disk_sched/debugging-patches/3.15.0-rc8-v7rc5.tgz
> 

Ok, let me try.

> Should this attempt be useless as well, I will, if you do not mind, try by asking you more details about your system and reproducing your configuration as much as I can.
> 

It is thinkpad x60 notebook, x86-32 machine with 2GB ram.

But I think it died on my x86-32 core duo desktop, too. 

Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices
  2014-06-03 17:11                       ` Tejun Heo
@ 2014-06-04  7:29                           ` Paolo Valente
  -1 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-04  7:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA


Il giorno 03/giu/2014, alle ore 19:11, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ha scritto:

> Hello,
> 
> On Mon, Jun 02, 2014 at 11:26:07AM +0200, Paolo Valente wrote:
>>>> #define cond_for_expiring_non_wr  (bfqd->hw_tag && \
>>>> -				   bfqd->wr_busy_queues > 0)
>>>> +				   (bfqd->wr_busy_queues > 0 || \
>>>> +				    (symmetric_scenario && \
>>>> +				     blk_queue_nonrot(bfqd->queue))))
>>> 
>>> 	expire_non_wr = zzz;
>>> 
>> 
>> The solution you propose is the first that came to my mind. But then
>> I went for a clumsy macro-based solution because: 1) the whole
>> function is all about evaluating a long logical expression, 2) the
>> macro-based solution allows the short-circuit to be used at best,
>> and the number of steps to be minimized. For example, with async
>> queues, only one condition is evaluated.
>> 
>> Defining three variables entails instead that the value of all the
>> variables is computed every time, even if most of the times there is
>> no need to.
>> 
>> Would this gain be negligible (sorry for my ignorance), or would
>> not it be however enough to justify these unusual macros?
> 
> The compiler should be able to optimize those to basically the same
> code.  AFAICS, everything the code tests is trivially known to be
> without side-effect to the compiler.  Besides, even if the compiler
> generates slightly less efficient code, which it shouldn't, it's
> highly unlikely that this level of micro CPU cycle optimization would
> be measureable for something as heavy as [bc]fq.
> 

Thanks a lot for answering this question.

>>> This optimization may be theoretically interesting but doesn't seem
>>> practical at all and would make the sytem behave distinctively
>>> differently depending on something which is extremely subtle and seems
>>> completely unrelated.  Furthermore, on any system which uses blkcg,
>>> ext4, btrfs or has any task which has non-zero nice value, it won't
>>> make any difference.  Its value is only theoretical.
>> 
>> Turning on idling unconditionally when blkcg is used, is one of the
>> first solutions we have considered. But there seem to be practical
>> scenarios where this would cause an unjustified loss of
>> throughput. The main example for us was ulatencyd, which AFAIK
>> creates one group for each process and, by default, assigns to all
>> processes the same weight. But the assigned weight is not the one
>> associated to the default ioprio.
> 
> Isn't the optimization "not idling" when these conditions are met?

Yes, not idling is the way to go in this case.

> Shouldn't the comparison be against the benefit of "not idling
> selectively" vs "always idling" when blkcg is in use?
> 

Exactly. I’m sorry if I wrote things/sentences that did not let this point be clear. Maybe this lack of clarity is a further consequence of the annoying “not not” scheme adopted in the code and in the comments.

> Another problem there is that this not only depends on the number of
> processes but the number of threads in it.  cgroup is moving away from
> allowing threads of a single process in different cgroups, so this
> means that the operation can fluctuate in a very unexpected manner.
> 
> I'm not really convinced about the approach.  With rotating disks, we
> know that allowing queue depth > 1 generaly lowers both throughput and
> responsiveness and brings benefits in quite restricted cases.  It
> seems rather backwards to always allow QD > 1 and then try to optimize
> in an attempt to recover what's lost.  Wouldn't it make far more sense
> to actively maintain QD == 1 by default and allow QD > 1 in specific
> cases where it can be determined to be more beneficial than harmful?
> 

Although QD == 1 is not denoted explicitly as default, what you suggest is exactly what bfq does. 

>> I do not know how widespread a mechanism like ulatencyd is
>> precisely, but in the symmetric scenario it creates, the throughput
>> on, e.g., an HDD would drop by half if the workload is mostly random
>> and we removed the more complex mechanism we set up.  Wouldn't this
>> be bad?
> 
> It looks like a lot of complexity for optimization for a very
> specific, likely unreliable (in terms of its triggering condition),
> use case.  The triggering condition is just too specific.

Actually we have been asked several times to improve random-I/O performance on HDDs over the last years, by people recording, for the typical tasks performed by their machines, much lower throughput than with the other schedulers. Major problems have been reported for server workloads (database, web), and for btrfs. According to the feedback received after introducing this optimization in bfq, those problems seem to be finally gone.

> 
>>> Another thing to consider is that virtually all remotely modern
>>> devices, rotational or not, are queued. At this point, it's rather
>>> pointless to design one behavior for !queued and another for queued.
>>> Things should just be designed for queued devices.
>> 
>> I am sorry for expressing doubts again (mainly because of my
>> ignorance), but a few months ago I had to work with some portable
>> devices for a company specialized in ARM systems. As an HDD, they
>> were using a Toshiba MK6006GAH. If I remember well, this device had
>> no NCQ. Instead of the improvements that we obtained by using bfq
>> with this slow device, removing the differentiated behavior of bfq
>> with respect to queued/!queued devices would have caused just a loss
>> of throughput.
> 
> Heh, that's 60GB ATA-100 hard drive.  Had no idea those are still
> being produced.  However, my point still is that the design should be
> focused on queued devices.  They're predominant in the market and
> it'll only continue to become more so.  What bothers me is that the
> scheduler essentially loses control and shows sub-optimal behavior on
> queued devices by default and that's how it's gonna perform in vast
> majority of the use cases.
> 

According to our experiments, this differential behavior of bfq is only beneficial, and there is apparently no performance loss with respect to any of the other schedulers and for any scenario.

>>> I don't know what
>>> the solution is but given that the benefits of NCQ for rotational
>>> devices is extremely limited, sticking with single request model in
>>> most cases and maybe allowing queued operation for specific workloads
>>> might be a better approach.  As for ssds, just do something simple.
>>> It's highly likely that most ssds won't travel this code path in the
>>> near future anyway.
>> 
>> This is the point that worries me mostly. As I pointed out in one of my previous emails, dispatching requests to an SSD  without control causes high latencies, or even complete unresponsiveness (Figure 8 in
>> http://algogroup.unimore.it/people/paolo/disk_sched/extra_results.php
>> or Figure 9 in
>> http://algogroup.unimore.it/people/paolo/disk_sched/results.php).
>> 
>> I am of course aware that efficiency is a critical issue with fast
>> devices, and is probably destined to become more and more critical
>> in the future. But, as a user, I would be definitely unhappy with a
>> system that can, e.g., update itself in one minute instead of five,
>> but, during that minute may become unresponsive. In particular, I
>> would not be pleased to buy a more expensive SSD and get a much less
>> responsive system than that I had with a cheaper HDD and bfq fully
>> working.
> 
> blk-mq is right around the corner and newer devices won't travel this
> path at all.  Hopefully, ahci too will be served through blk-mq too
> when it's connected to ssds, so its usefulness for high performance
> devices will diminsh rather quickly over the coming several years.  It
> sure would be nice to still be able to carry some optimizations but it
> does shift the trade-off balance in terms of how much extra complexity
> is justified.
> 

For users (like me), who will not like losing responsiveness in return for a shorter duration of operations that are usually executed in the background, the best solution is IMHO to leave the possibility to choose whether to preserve responsiveness or to squeeze every MB/s out of the device and/or keep the usage of every core low.

Besides, turning back to bfq, if its low-latency heuristics are disabled for non rotational devices, then, according to our results with slower devices, such as SD Cards and eMMCs, latency becomes easily unbearable, with no throughput gain.

Thanks,
Paolo

> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices
@ 2014-06-04  7:29                           ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-04  7:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel, containers, cgroups


Il giorno 03/giu/2014, alle ore 19:11, Tejun Heo <tj@kernel.org> ha scritto:

> Hello,
> 
> On Mon, Jun 02, 2014 at 11:26:07AM +0200, Paolo Valente wrote:
>>>> #define cond_for_expiring_non_wr  (bfqd->hw_tag && \
>>>> -				   bfqd->wr_busy_queues > 0)
>>>> +				   (bfqd->wr_busy_queues > 0 || \
>>>> +				    (symmetric_scenario && \
>>>> +				     blk_queue_nonrot(bfqd->queue))))
>>> 
>>> 	expire_non_wr = zzz;
>>> 
>> 
>> The solution you propose is the first that came to my mind. But then
>> I went for a clumsy macro-based solution because: 1) the whole
>> function is all about evaluating a long logical expression, 2) the
>> macro-based solution allows the short-circuit to be used at best,
>> and the number of steps to be minimized. For example, with async
>> queues, only one condition is evaluated.
>> 
>> Defining three variables entails instead that the value of all the
>> variables is computed every time, even if most of the times there is
>> no need to.
>> 
>> Would this gain be negligible (sorry for my ignorance), or would
>> not it be however enough to justify these unusual macros?
> 
> The compiler should be able to optimize those to basically the same
> code.  AFAICS, everything the code tests is trivially known to be
> without side-effect to the compiler.  Besides, even if the compiler
> generates slightly less efficient code, which it shouldn't, it's
> highly unlikely that this level of micro CPU cycle optimization would
> be measureable for something as heavy as [bc]fq.
> 

Thanks a lot for answering this question.

>>> This optimization may be theoretically interesting but doesn't seem
>>> practical at all and would make the sytem behave distinctively
>>> differently depending on something which is extremely subtle and seems
>>> completely unrelated.  Furthermore, on any system which uses blkcg,
>>> ext4, btrfs or has any task which has non-zero nice value, it won't
>>> make any difference.  Its value is only theoretical.
>> 
>> Turning on idling unconditionally when blkcg is used, is one of the
>> first solutions we have considered. But there seem to be practical
>> scenarios where this would cause an unjustified loss of
>> throughput. The main example for us was ulatencyd, which AFAIK
>> creates one group for each process and, by default, assigns to all
>> processes the same weight. But the assigned weight is not the one
>> associated to the default ioprio.
> 
> Isn't the optimization "not idling" when these conditions are met?

Yes, not idling is the way to go in this case.

> Shouldn't the comparison be against the benefit of "not idling
> selectively" vs "always idling" when blkcg is in use?
> 

Exactly. I’m sorry if I wrote things/sentences that did not let this point be clear. Maybe this lack of clarity is a further consequence of the annoying “not not” scheme adopted in the code and in the comments.

> Another problem there is that this not only depends on the number of
> processes but the number of threads in it.  cgroup is moving away from
> allowing threads of a single process in different cgroups, so this
> means that the operation can fluctuate in a very unexpected manner.
> 
> I'm not really convinced about the approach.  With rotating disks, we
> know that allowing queue depth > 1 generaly lowers both throughput and
> responsiveness and brings benefits in quite restricted cases.  It
> seems rather backwards to always allow QD > 1 and then try to optimize
> in an attempt to recover what's lost.  Wouldn't it make far more sense
> to actively maintain QD == 1 by default and allow QD > 1 in specific
> cases where it can be determined to be more beneficial than harmful?
> 

Although QD == 1 is not denoted explicitly as default, what you suggest is exactly what bfq does. 

>> I do not know how widespread a mechanism like ulatencyd is
>> precisely, but in the symmetric scenario it creates, the throughput
>> on, e.g., an HDD would drop by half if the workload is mostly random
>> and we removed the more complex mechanism we set up.  Wouldn't this
>> be bad?
> 
> It looks like a lot of complexity for optimization for a very
> specific, likely unreliable (in terms of its triggering condition),
> use case.  The triggering condition is just too specific.

Actually we have been asked several times to improve random-I/O performance on HDDs over the last years, by people recording, for the typical tasks performed by their machines, much lower throughput than with the other schedulers. Major problems have been reported for server workloads (database, web), and for btrfs. According to the feedback received after introducing this optimization in bfq, those problems seem to be finally gone.

> 
>>> Another thing to consider is that virtually all remotely modern
>>> devices, rotational or not, are queued. At this point, it's rather
>>> pointless to design one behavior for !queued and another for queued.
>>> Things should just be designed for queued devices.
>> 
>> I am sorry for expressing doubts again (mainly because of my
>> ignorance), but a few months ago I had to work with some portable
>> devices for a company specialized in ARM systems. As an HDD, they
>> were using a Toshiba MK6006GAH. If I remember well, this device had
>> no NCQ. Instead of the improvements that we obtained by using bfq
>> with this slow device, removing the differentiated behavior of bfq
>> with respect to queued/!queued devices would have caused just a loss
>> of throughput.
> 
> Heh, that's 60GB ATA-100 hard drive.  Had no idea those are still
> being produced.  However, my point still is that the design should be
> focused on queued devices.  They're predominant in the market and
> it'll only continue to become more so.  What bothers me is that the
> scheduler essentially loses control and shows sub-optimal behavior on
> queued devices by default and that's how it's gonna perform in vast
> majority of the use cases.
> 

According to our experiments, this differential behavior of bfq is only beneficial, and there is apparently no performance loss with respect to any of the other schedulers and for any scenario.

>>> I don't know what
>>> the solution is but given that the benefits of NCQ for rotational
>>> devices is extremely limited, sticking with single request model in
>>> most cases and maybe allowing queued operation for specific workloads
>>> might be a better approach.  As for ssds, just do something simple.
>>> It's highly likely that most ssds won't travel this code path in the
>>> near future anyway.
>> 
>> This is the point that worries me mostly. As I pointed out in one of my previous emails, dispatching requests to an SSD  without control causes high latencies, or even complete unresponsiveness (Figure 8 in
>> http://algogroup.unimore.it/people/paolo/disk_sched/extra_results.php
>> or Figure 9 in
>> http://algogroup.unimore.it/people/paolo/disk_sched/results.php).
>> 
>> I am of course aware that efficiency is a critical issue with fast
>> devices, and is probably destined to become more and more critical
>> in the future. But, as a user, I would be definitely unhappy with a
>> system that can, e.g., update itself in one minute instead of five,
>> but, during that minute may become unresponsive. In particular, I
>> would not be pleased to buy a more expensive SSD and get a much less
>> responsive system than that I had with a cheaper HDD and bfq fully
>> working.
> 
> blk-mq is right around the corner and newer devices won't travel this
> path at all.  Hopefully, ahci too will be served through blk-mq too
> when it's connected to ssds, so its usefulness for high performance
> devices will diminsh rather quickly over the coming several years.  It
> sure would be nice to still be able to carry some optimizations but it
> does shift the trade-off balance in terms of how much extra complexity
> is justified.
> 

For users (like me), who will not like losing responsiveness in return for a shorter duration of operations that are usually executed in the background, the best solution is IMHO to leave the possibility to choose whether to preserve responsiveness or to squeeze every MB/s out of the device and/or keep the usage of every core low.

Besides, turning back to bfq, if its low-latency heuristics are disabled for non rotational devices, then, according to our results with slower devices, such as SD Cards and eMMCs, latency becomes easily unbearable, with no throughput gain.

Thanks,
Paolo

> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
       [not found]                               ` <FCFE0106-A4DD-4DEF-AAAE-040F3823A447-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
  2014-06-03 20:40                                   ` Pavel Machek
@ 2014-06-04  8:39                                 ` Pavel Machek
  2014-06-04  9:08                                   ` Pavel Machek
  2014-06-04 10:03                                 ` BFQ speed tests [was Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler] Pavel Machek
  3 siblings, 0 replies; 247+ messages in thread
From: Pavel Machek @ 2014-06-04  8:39 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

Hi!

> > Like this: I applied patch over today's git... 
> > 
> > I only see last bits of panic...
> > 
> > Call trace:
> > __bfq_bfqq_expire
> > bfq_bfqq_expire
> > bfq_dispatch_requests
> > sci_request_fn
> > ...
> > EIP: T.1839+0x26

> > Any ideas?
> > 			
> 
> We have tried to think about ways to trigger this failure, but in vain. Unfortunately, so far no user has reported any failure with this last version of bfq either. Finally, we have gone through a new static analysis, but also in this case uselessly.
> 
> So, if you are willing to retry, we have put online a version of the code filled with many BUG_ONs. I hope they can make it easier to track down the bug. The archive is here:
> http://algogroup.unimore.it/people/paolo/disk_sched/debugging-patches/3.15.0-rc8-v7rc5.tgz
>

BUG: Unable to handle kernel paging request ad dee22fa0
IP: bfq_del_bfqq_busy+0x4d
...
Tainted: GW 3.15.0-rc8+
...
Call trace:
__bfq_bfqq_expire
bfq_bfqq_expire
? bfq_bfqq_expire
? bfq_bfqq_expire
bfq_idle_slice_timer
call_timer_fn
...

> Should this attempt be useless as well, I will, if you do not mind, try by asking you more details about your system and reproducing your configuration as much as I can.
>

See the preivous email...

Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
       [not found]                               ` <FCFE0106-A4DD-4DEF-AAAE-040F3823A447-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
@ 2014-06-04  8:39                                 ` Pavel Machek
  2014-06-04  8:39                                 ` Pavel Machek
                                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 247+ messages in thread
From: Pavel Machek @ 2014-06-04  8:39 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Tejun Heo, Jens Axboe, Li Zefan, Fabio Checconi,
	Arianna Avanzini, linux-kernel, containers, cgroups

Hi!

> > Like this: I applied patch over today's git... 
> > 
> > I only see last bits of panic...
> > 
> > Call trace:
> > __bfq_bfqq_expire
> > bfq_bfqq_expire
> > bfq_dispatch_requests
> > sci_request_fn
> > ...
> > EIP: T.1839+0x26

> > Any ideas?
> > 			
> 
> We have tried to think about ways to trigger this failure, but in vain. Unfortunately, so far no user has reported any failure with this last version of bfq either. Finally, we have gone through a new static analysis, but also in this case uselessly.
> 
> So, if you are willing to retry, we have put online a version of the code filled with many BUG_ONs. I hope they can make it easier to track down the bug. The archive is here:
> http://algogroup.unimore.it/people/paolo/disk_sched/debugging-patches/3.15.0-rc8-v7rc5.tgz
>

BUG: Unable to handle kernel paging request ad dee22fa0
IP: bfq_del_bfqq_busy+0x4d
...
Tainted: GW 3.15.0-rc8+
...
Call trace:
__bfq_bfqq_expire
bfq_bfqq_expire
? bfq_bfqq_expire
? bfq_bfqq_expire
bfq_idle_slice_timer
call_timer_fn
...

> Should this attempt be useless as well, I will, if you do not mind, try by asking you more details about your system and reproducing your configuration as much as I can.
>

See the preivous email...

Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-06-04  8:39                                 ` Pavel Machek
  0 siblings, 0 replies; 247+ messages in thread
From: Pavel Machek @ 2014-06-04  8:39 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Tejun Heo, Jens Axboe, Li Zefan, Fabio Checconi,
	Arianna Avanzini, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hi!

> > Like this: I applied patch over today's git... 
> > 
> > I only see last bits of panic...
> > 
> > Call trace:
> > __bfq_bfqq_expire
> > bfq_bfqq_expire
> > bfq_dispatch_requests
> > sci_request_fn
> > ...
> > EIP: T.1839+0x26

> > Any ideas?
> > 			
> 
> We have tried to think about ways to trigger this failure, but in vain. Unfortunately, so far no user has reported any failure with this last version of bfq either. Finally, we have gone through a new static analysis, but also in this case uselessly.
> 
> So, if you are willing to retry, we have put online a version of the code filled with many BUG_ONs. I hope they can make it easier to track down the bug. The archive is here:
> http://algogroup.unimore.it/people/paolo/disk_sched/debugging-patches/3.15.0-rc8-v7rc5.tgz
>

BUG: Unable to handle kernel paging request ad dee22fa0
IP: bfq_del_bfqq_busy+0x4d
...
Tainted: GW 3.15.0-rc8+
...
Call trace:
__bfq_bfqq_expire
bfq_bfqq_expire
? bfq_bfqq_expire
? bfq_bfqq_expire
bfq_idle_slice_timer
call_timer_fn
...

> Should this attempt be useless as well, I will, if you do not mind, try by asking you more details about your system and reproducing your configuration as much as I can.
>

See the preivous email...

Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
  2014-06-03 16:54                               ` Paolo Valente
@ 2014-06-04  9:08                                   ` Pavel Machek
  -1 siblings, 0 replies; 247+ messages in thread
From: Pavel Machek @ 2014-06-04  9:08 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

Hi!

> Should this attempt be useless as well, I will, if you do not mind,
>try by asking you more details about your system and reproducing your
>configuration as much as I can.

It fails during boot or shortly after that when clicking in gnome2
desktop. I had BFQ as a default scheduler.

Now I set CFQ as a default and it boots (as expected).

root@duo:~# cat /sys/block/sda/queue/scheduler 
noop deadline [cfq] bfq 
root@duo:~# echo bfq > /sys/block/sda/queue/scheduler
root@duo:~# dmesg | grep WARN
WARNING: CPU: 1 PID: 1 at net/wireless/reg.c:479
regulatory_init+0x88/0xf5()
root@duo:~# 

Hmm, and I seem to have pretty much functional system.

I'll try to do some benchmarks now.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-06-04  9:08                                   ` Pavel Machek
  0 siblings, 0 replies; 247+ messages in thread
From: Pavel Machek @ 2014-06-04  9:08 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Tejun Heo, Jens Axboe, Li Zefan, Fabio Checconi,
	Arianna Avanzini, linux-kernel, containers, cgroups

Hi!

> Should this attempt be useless as well, I will, if you do not mind,
>try by asking you more details about your system and reproducing your
>configuration as much as I can.

It fails during boot or shortly after that when clicking in gnome2
desktop. I had BFQ as a default scheduler.

Now I set CFQ as a default and it boots (as expected).

root@duo:~# cat /sys/block/sda/queue/scheduler 
noop deadline [cfq] bfq 
root@duo:~# echo bfq > /sys/block/sda/queue/scheduler
root@duo:~# dmesg | grep WARN
WARNING: CPU: 1 PID: 1 at net/wireless/reg.c:479
regulatory_init+0x88/0xf5()
root@duo:~# 

Hmm, and I seem to have pretty much functional system.

I'll try to do some benchmarks now.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 247+ messages in thread

* BFQ speed tests [was Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler]
       [not found]                               ` <FCFE0106-A4DD-4DEF-AAAE-040F3823A447-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
                                                   ` (2 preceding siblings ...)
  2014-06-04  9:08                                   ` Pavel Machek
@ 2014-06-04 10:03                                 ` Pavel Machek
  3 siblings, 0 replies; 247+ messages in thread
From: Pavel Machek @ 2014-06-04 10:03 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA

Hi!

> Should this attempt be useless as well, I will, if you do not mind, try by asking you more details about your system and reproducing your configuration as much as I can.
> 

Try making BFQ the default scheduler. That seems to break it for me,
when selected at runtime, it looks stable.

Anyway, here are some speed tests. Background load:

root@duo:/data/tmp# echo cfq > /sys/block/sda/queue/scheduler 
root@duo:/data/tmp# echo 3 > /proc/sys/vm/drop_caches
root@duo:/data/tmp# cat /dev/zero > delme; cat /dev/zero > delme;cat
/dev/zero > delme;cat /dev/zero > delme;cat /dev/zero > delme;cat
/dev/zero > delme

(Machine was running out of disk space.)

(I alternate between cfq and bfq).

Benchmark. I chose git describe because it is part of kernel build
sometimes .. and I actually wait for that.

pavel@duo:/data/l/linux-good$ time git describe
warning: refname 'HEAD' is ambiguous.
v3.15-rc8-144-g405dedd

Unfortunately, results are not too good for BFQ. (Can you replicate
the results?)

# BFQ
10.24user 1.62system 467.02 (7m47.028s) elapsed 2.54%CPU
# CFQ
8.55user 1.26system 69.57 (1m9.577s) elapsed 14.11%CPU
# BFQ
11.70user 3.18system 1491.59 (24m51.599s) elapsed 0.99%CPU
# CFQ, no background load
8.51user 0.75system 30.99 (0m30.994s) elapsed 29.91%CPU
# CFQ
8.70user 1.36system 74.72 (1m14.720s) elapsed 13.48%CPU

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 247+ messages in thread

* BFQ speed tests [was Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler]
       [not found]                               ` <FCFE0106-A4DD-4DEF-AAAE-040F3823A447-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
@ 2014-06-04 10:03                                 ` Pavel Machek
  2014-06-04  8:39                                 ` Pavel Machek
                                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 247+ messages in thread
From: Pavel Machek @ 2014-06-04 10:03 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Tejun Heo, Jens Axboe, Li Zefan, Fabio Checconi,
	Arianna Avanzini, linux-kernel, containers, cgroups

Hi!

> Should this attempt be useless as well, I will, if you do not mind, try by asking you more details about your system and reproducing your configuration as much as I can.
> 

Try making BFQ the default scheduler. That seems to break it for me,
when selected at runtime, it looks stable.

Anyway, here are some speed tests. Background load:

root@duo:/data/tmp# echo cfq > /sys/block/sda/queue/scheduler 
root@duo:/data/tmp# echo 3 > /proc/sys/vm/drop_caches
root@duo:/data/tmp# cat /dev/zero > delme; cat /dev/zero > delme;cat
/dev/zero > delme;cat /dev/zero > delme;cat /dev/zero > delme;cat
/dev/zero > delme

(Machine was running out of disk space.)

(I alternate between cfq and bfq).

Benchmark. I chose git describe because it is part of kernel build
sometimes .. and I actually wait for that.

pavel@duo:/data/l/linux-good$ time git describe
warning: refname 'HEAD' is ambiguous.
v3.15-rc8-144-g405dedd

Unfortunately, results are not too good for BFQ. (Can you replicate
the results?)

# BFQ
10.24user 1.62system 467.02 (7m47.028s) elapsed 2.54%CPU
# CFQ
8.55user 1.26system 69.57 (1m9.577s) elapsed 14.11%CPU
# BFQ
11.70user 3.18system 1491.59 (24m51.599s) elapsed 0.99%CPU
# CFQ, no background load
8.51user 0.75system 30.99 (0m30.994s) elapsed 29.91%CPU
# CFQ
8.70user 1.36system 74.72 (1m14.720s) elapsed 13.48%CPU

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 247+ messages in thread

* BFQ speed tests [was Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler]
@ 2014-06-04 10:03                                 ` Pavel Machek
  0 siblings, 0 replies; 247+ messages in thread
From: Pavel Machek @ 2014-06-04 10:03 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Tejun Heo, Jens Axboe, Li Zefan, Fabio Checconi,
	Arianna Avanzini, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hi!

> Should this attempt be useless as well, I will, if you do not mind, try by asking you more details about your system and reproducing your configuration as much as I can.
> 

Try making BFQ the default scheduler. That seems to break it for me,
when selected at runtime, it looks stable.

Anyway, here are some speed tests. Background load:

root@duo:/data/tmp# echo cfq > /sys/block/sda/queue/scheduler 
root@duo:/data/tmp# echo 3 > /proc/sys/vm/drop_caches
root@duo:/data/tmp# cat /dev/zero > delme; cat /dev/zero > delme;cat
/dev/zero > delme;cat /dev/zero > delme;cat /dev/zero > delme;cat
/dev/zero > delme

(Machine was running out of disk space.)

(I alternate between cfq and bfq).

Benchmark. I chose git describe because it is part of kernel build
sometimes .. and I actually wait for that.

pavel@duo:/data/l/linux-good$ time git describe
warning: refname 'HEAD' is ambiguous.
v3.15-rc8-144-g405dedd

Unfortunately, results are not too good for BFQ. (Can you replicate
the results?)

# BFQ
10.24user 1.62system 467.02 (7m47.028s) elapsed 2.54%CPU
# CFQ
8.55user 1.26system 69.57 (1m9.577s) elapsed 14.11%CPU
# BFQ
11.70user 3.18system 1491.59 (24m51.599s) elapsed 0.99%CPU
# CFQ, no background load
8.51user 0.75system 30.99 (0m30.994s) elapsed 29.91%CPU
# CFQ
8.70user 1.36system 74.72 (1m14.720s) elapsed 13.48%CPU

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: BFQ speed tests [was Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler]
  2014-06-04 10:03                                 ` Pavel Machek
@ 2014-06-04 10:24                                     ` Paolo Valente
  -1 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-04 10:24 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA


Il giorno 04/giu/2014, alle ore 12:03, Pavel Machek <pavel-+ZI9xUNit7I@public.gmane.org> ha scritto:

> Hi!
> 
>> Should this attempt be useless as well, I will, if you do not mind, try by asking you more details about your system and reproducing your configuration as much as I can.
>> 
> 
> Try making BFQ the default scheduler. That seems to break it for me,
> when selected at runtime, it looks stable.
> 
> Anyway, here are some speed tests. Background load:
> 
> root@duo:/data/tmp# echo cfq > /sys/block/sda/queue/scheduler 
> root@duo:/data/tmp# echo 3 > /proc/sys/vm/drop_caches
> root@duo:/data/tmp# cat /dev/zero > delme; cat /dev/zero > delme;cat
> /dev/zero > delme;cat /dev/zero > delme;cat /dev/zero > delme;cat
> /dev/zero > delme
> 
> (Machine was running out of disk space.)
> 
> (I alternate between cfq and bfq).
> 
> Benchmark. I chose git describe because it is part of kernel build
> sometimes .. and I actually wait for that.
> 
> pavel@duo:/data/l/linux-good$ time git describe
> warning: refname 'HEAD' is ambiguous.
> v3.15-rc8-144-g405dedd
> 
> Unfortunately, results are not too good for BFQ. (Can you replicate
> the results?)
> 
> # BFQ
> 10.24user 1.62system 467.02 (7m47.028s) elapsed 2.54%CPU
> # CFQ
> 8.55user 1.26system 69.57 (1m9.577s) elapsed 14.11%CPU
> # BFQ
> 11.70user 3.18system 1491.59 (24m51.599s) elapsed 0.99%CPU
> # CFQ, no background load
> 8.51user 0.75system 30.99 (0m30.994s) elapsed 29.91%CPU
> # CFQ
> 8.70user 1.36system 74.72 (1m14.720s) elapsed 13.48%CPU
> 

Definitely bad, we are about to repeat the test …

Thanks,
Paolo

> 									Pavel
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: BFQ speed tests [was Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler]
@ 2014-06-04 10:24                                     ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-04 10:24 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Tejun Heo, Jens Axboe, Li Zefan, Fabio Checconi,
	Arianna Avanzini, linux-kernel, containers, cgroups


Il giorno 04/giu/2014, alle ore 12:03, Pavel Machek <pavel@ucw.cz> ha scritto:

> Hi!
> 
>> Should this attempt be useless as well, I will, if you do not mind, try by asking you more details about your system and reproducing your configuration as much as I can.
>> 
> 
> Try making BFQ the default scheduler. That seems to break it for me,
> when selected at runtime, it looks stable.
> 
> Anyway, here are some speed tests. Background load:
> 
> root@duo:/data/tmp# echo cfq > /sys/block/sda/queue/scheduler 
> root@duo:/data/tmp# echo 3 > /proc/sys/vm/drop_caches
> root@duo:/data/tmp# cat /dev/zero > delme; cat /dev/zero > delme;cat
> /dev/zero > delme;cat /dev/zero > delme;cat /dev/zero > delme;cat
> /dev/zero > delme
> 
> (Machine was running out of disk space.)
> 
> (I alternate between cfq and bfq).
> 
> Benchmark. I chose git describe because it is part of kernel build
> sometimes .. and I actually wait for that.
> 
> pavel@duo:/data/l/linux-good$ time git describe
> warning: refname 'HEAD' is ambiguous.
> v3.15-rc8-144-g405dedd
> 
> Unfortunately, results are not too good for BFQ. (Can you replicate
> the results?)
> 
> # BFQ
> 10.24user 1.62system 467.02 (7m47.028s) elapsed 2.54%CPU
> # CFQ
> 8.55user 1.26system 69.57 (1m9.577s) elapsed 14.11%CPU
> # BFQ
> 11.70user 3.18system 1491.59 (24m51.599s) elapsed 0.99%CPU
> # CFQ, no background load
> 8.51user 0.75system 30.99 (0m30.994s) elapsed 29.91%CPU
> # CFQ
> 8.70user 1.36system 74.72 (1m14.720s) elapsed 13.48%CPU
> 

Definitely bad, we are about to repeat the test …

Thanks,
Paolo

> 									Pavel
> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 10/12] block, bfq: add Early Queue Merge (EQM)
  2014-06-03 16:28                       ` Tejun Heo
@ 2014-06-04 11:47                           ` Paolo Valente
  -1 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-04 11:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Mauro Andreolini,
	Fabio Checconi, Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA


Il giorno 03/giu/2014, alle ore 18:28, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ha scritto:

> Hello,
> 
> On Mon, Jun 02, 2014 at 11:46:45AM +0200, Paolo Valente wrote:
>>> I don't really follow the last part.  So, the difference is that
>>> cooperating queue setup also takes place during bio merge too, right?
>> 
>> Not only, in bfq an actual queue merge is performed in the bio-merge hook.
> 
> I think I'm a bit confused because it's named "early" queue merge
> while it actually moves queue merging later than cfq - set_request()
> happens before bio/rq merging.


There is probably something I am missing here, because, as can be seen in blk-core.c,
around line 1495, elv_set_request() is invoked in the context of the get_request() function,
which in its turn is called from blk_queue_bio() *after* attempting both a plug merge
and a merge with one of the requests in the block layer's cache. The first
attempt is lockless and doesn't involve the I/O scheduler, but the
second attempt includes invoking the allow_merge_fn hook of the scheduler
(elv_merge() -> elv_rq_merge_ok() -> elv_iosched_allow_merge()).

Furthermore, as far as I know, it is true that CFQ actually merges queues in the
set_request hook, but a cooperator is searched for a queue (and, if it is found,
the two queues are scheduled to merge) only when the queue expires after being
served (see cfq_select_queue() and the two functions cfq_close_cooperator() and
cfq_setup_merge() that it invokes). If a cooperator is found, it is forcedly
served; however, the actual merge of the two queues happens at the next
set_request (cfq_merge_bfqqs()).

In contrast, BFQ both searches for a cooperator and merges the queue with a
newly-found cooperator in the allow_merge hook, which is "earlier" with respect
to CFQ, as it doesn't need to wait for a queue to be served and expire, and for its
associated process to issue new I/O. Hence the name Early Queue Merge.


> So, what it tries to do is
> compensating for the lack of cfq_rq_close() preemption at request
> issue time, right?
> 

Yes, thanks to early merging, there is then no need to recover a lost sequential
pattern through preemptions.

>>> cfq does it once when allocating the request.  That seems a lot more
>>> reasonable to me.  It's doing that once for one start sector.  I mean,
>>> plugging is usually extremely short compared to actual IO service
>>> time.  It's there to mask the latencies between bio issues that the
>>> same CPU is doing.  I can't see how this earliness can be actually
>>> useful.  Do you have results to back this one up?  Or is this just
>>> born out of thin air?
>> 
>> Arianna added the early-queue-merge part in the allow_merge_fn hook
>> about one year ago, as a a consequence of a throughput loss of about
>> 30% with KVM/QEMU workloads. In particular, we ran most of the tests
>> on a WDC WD60000HLHX-0 Velociraptor. That HDD might not be available
>> for testing any more, but we can reproduce our results for you on
>> other HDDs, with and without early queue merge. And, maybe through
>> traces, we can show you that the reason for the throughput loss is
>> exactly that described (in a wordy way) in this patch. Of course
>> unless we have missed something.
> 
> Oh, as long as it makes measureable difference, I have no objection;
> however, I do think more explanation and comments would be nice.  I
> still can't quite understand why retrying on each merge attempt would
> make so much difference.  Maybe I just failed to understand what you
> wrote in the commit message.

If we remember well, one of the problems was exactly that a different request
may become the head request of the in-service queue between two rq merge
attempts. If we do not retry on every attempt, we lose the chance
to merge the queue at hand with the in-service queue. The two queues may
then diverge, and hence have no other opportunity to be merged.

> Is it because the cooperating tasks
> issue IOs which grow large and close enough after merges but not on
> the first bio issuance?  If so, why isn't doing it on rq merge time
> enough?  Is the timing sensitive enough for certain workloads that
> waiting till unplug time misses the opportunity?  But plugging should
> be relatively short compared to the time actual IOs take, so why would
> it be that sensitive?  What am I missing here?

The problem is not the duration of the plugging, but the fact that, if a request merge
succeeds for a bio, then there will be no set_request invocation for that bio.
Therefore, without early merging, there will be no queue merge at all.

If my replies are correct and convince you, then I will use them to integrate and
hopefully improve the documentation for this patch.

Paolo

> 
> Thanks.
> 
> -- 
> tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 10/12] block, bfq: add Early Queue Merge (EQM)
@ 2014-06-04 11:47                           ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-04 11:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel, containers, cgroups, Mauro Andreolini


Il giorno 03/giu/2014, alle ore 18:28, Tejun Heo <tj@kernel.org> ha scritto:

> Hello,
> 
> On Mon, Jun 02, 2014 at 11:46:45AM +0200, Paolo Valente wrote:
>>> I don't really follow the last part.  So, the difference is that
>>> cooperating queue setup also takes place during bio merge too, right?
>> 
>> Not only, in bfq an actual queue merge is performed in the bio-merge hook.
> 
> I think I'm a bit confused because it's named "early" queue merge
> while it actually moves queue merging later than cfq - set_request()
> happens before bio/rq merging.


There is probably something I am missing here, because, as can be seen in blk-core.c,
around line 1495, elv_set_request() is invoked in the context of the get_request() function,
which in its turn is called from blk_queue_bio() *after* attempting both a plug merge
and a merge with one of the requests in the block layer's cache. The first
attempt is lockless and doesn't involve the I/O scheduler, but the
second attempt includes invoking the allow_merge_fn hook of the scheduler
(elv_merge() -> elv_rq_merge_ok() -> elv_iosched_allow_merge()).

Furthermore, as far as I know, it is true that CFQ actually merges queues in the
set_request hook, but a cooperator is searched for a queue (and, if it is found,
the two queues are scheduled to merge) only when the queue expires after being
served (see cfq_select_queue() and the two functions cfq_close_cooperator() and
cfq_setup_merge() that it invokes). If a cooperator is found, it is forcedly
served; however, the actual merge of the two queues happens at the next
set_request (cfq_merge_bfqqs()).

In contrast, BFQ both searches for a cooperator and merges the queue with a
newly-found cooperator in the allow_merge hook, which is "earlier" with respect
to CFQ, as it doesn't need to wait for a queue to be served and expire, and for its
associated process to issue new I/O. Hence the name Early Queue Merge.


> So, what it tries to do is
> compensating for the lack of cfq_rq_close() preemption at request
> issue time, right?
> 

Yes, thanks to early merging, there is then no need to recover a lost sequential
pattern through preemptions.

>>> cfq does it once when allocating the request.  That seems a lot more
>>> reasonable to me.  It's doing that once for one start sector.  I mean,
>>> plugging is usually extremely short compared to actual IO service
>>> time.  It's there to mask the latencies between bio issues that the
>>> same CPU is doing.  I can't see how this earliness can be actually
>>> useful.  Do you have results to back this one up?  Or is this just
>>> born out of thin air?
>> 
>> Arianna added the early-queue-merge part in the allow_merge_fn hook
>> about one year ago, as a a consequence of a throughput loss of about
>> 30% with KVM/QEMU workloads. In particular, we ran most of the tests
>> on a WDC WD60000HLHX-0 Velociraptor. That HDD might not be available
>> for testing any more, but we can reproduce our results for you on
>> other HDDs, with and without early queue merge. And, maybe through
>> traces, we can show you that the reason for the throughput loss is
>> exactly that described (in a wordy way) in this patch. Of course
>> unless we have missed something.
> 
> Oh, as long as it makes measureable difference, I have no objection;
> however, I do think more explanation and comments would be nice.  I
> still can't quite understand why retrying on each merge attempt would
> make so much difference.  Maybe I just failed to understand what you
> wrote in the commit message.

If we remember well, one of the problems was exactly that a different request
may become the head request of the in-service queue between two rq merge
attempts. If we do not retry on every attempt, we lose the chance
to merge the queue at hand with the in-service queue. The two queues may
then diverge, and hence have no other opportunity to be merged.

> Is it because the cooperating tasks
> issue IOs which grow large and close enough after merges but not on
> the first bio issuance?  If so, why isn't doing it on rq merge time
> enough?  Is the timing sensitive enough for certain workloads that
> waiting till unplug time misses the opportunity?  But plugging should
> be relatively short compared to the time actual IOs take, so why would
> it be that sensitive?  What am I missing here?

The problem is not the duration of the plugging, but the fact that, if a request merge
succeeds for a bio, then there will be no set_request invocation for that bio.
Therefore, without early merging, there will be no queue merge at all.

If my replies are correct and convince you, then I will use them to integrate and
hopefully improve the documentation for this patch.

Paolo

> 
> Thanks.
> 
> -- 
> tejun



^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: BFQ speed tests [was Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler]
  2014-06-04 10:24                                     ` Paolo Valente
  (?)
@ 2014-06-04 11:59                                         ` Takashi Iwai
  -1 siblings, 0 replies; 247+ messages in thread
From: Takashi Iwai @ 2014-06-04 11:59 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Pavel Machek, Arianna Avanzini, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

At Wed, 4 Jun 2014 12:24:30 +0200,
Paolo Valente wrote:
> 
> 
> Il giorno 04/giu/2014, alle ore 12:03, Pavel Machek <pavel@ucw.cz> ha scritto:
> 
> > Hi!
> > 
> >> Should this attempt be useless as well, I will, if you do not mind, try by asking you more details about your system and reproducing your configuration as much as I can.
> >> 
> > 
> > Try making BFQ the default scheduler. That seems to break it for me,
> > when selected at runtime, it looks stable.
> > 
> > Anyway, here are some speed tests. Background load:
> > 
> > root@duo:/data/tmp# echo cfq > /sys/block/sda/queue/scheduler 
> > root@duo:/data/tmp# echo 3 > /proc/sys/vm/drop_caches
> > root@duo:/data/tmp# cat /dev/zero > delme; cat /dev/zero > delme;cat
> > /dev/zero > delme;cat /dev/zero > delme;cat /dev/zero > delme;cat
> > /dev/zero > delme
> > 
> > (Machine was running out of disk space.)
> > 
> > (I alternate between cfq and bfq).
> > 
> > Benchmark. I chose git describe because it is part of kernel build
> > sometimes .. and I actually wait for that.
> > 
> > pavel@duo:/data/l/linux-good$ time git describe
> > warning: refname 'HEAD' is ambiguous.
> > v3.15-rc8-144-g405dedd
> > 
> > Unfortunately, results are not too good for BFQ. (Can you replicate
> > the results?)
> > 
> > # BFQ
> > 10.24user 1.62system 467.02 (7m47.028s) elapsed 2.54%CPU
> > # CFQ
> > 8.55user 1.26system 69.57 (1m9.577s) elapsed 14.11%CPU
> > # BFQ
> > 11.70user 3.18system 1491.59 (24m51.599s) elapsed 0.99%CPU
> > # CFQ, no background load
> > 8.51user 0.75system 30.99 (0m30.994s) elapsed 29.91%CPU
> > # CFQ
> > 8.70user 1.36system 74.72 (1m14.720s) elapsed 13.48%CPU
> > 
> 
> Definitely bad, we are about to repeat the test …

I've been using BFQ for a while and noticed also some obvious
regression in some operations, notably git, too.
For example, git grep regresses badly.

I ran "test git grep foo > /dev/null" on linux kernel repos on both
rotational disk and SSD.

Rotational disk:
  CFQ:
    2.32user 3.48system 1:46.97elapsed 5%CPU
    2.33user 3.41system 1:48.30elapsed 5%CPU
    2.30user 3.54system 1:48.01elapsed 5%CPU

  BFQ:
    2.41user 3.22system 2:51.96elapsed 3%CPU
    2.40user 3.19system 2:50.35elapsed 3%CPU
    2.43user 3.11system 2:46.49elapsed 3%CPU

SSD:
  CFQ:
    2.37user 3.18system 0:04.70elapsed 118%CPU
    2.28user 3.26system 0:04.69elapsed 118%CPU
    2.21user 3.33system 0:04.69elapsed 118%CPU

  BFQ:
    2.35user 2.82system 1:07.85elapsed 7%CPU
    2.32user 2.90system 0:57.57elapsed 9%CPU
    2.39user 2.90system 0:55.03elapsed 9%CPU

It's without background task.

BFQ seems behaving bad when reading many small files.
When I ran "git grep foo HEAD", i.e. performing to the packaged
repository, the results of both BFQ and CFQ become almost same, as
expected:

SSD:
  CFQ:
    7.25user 0.47system 0:09.79elapsed 78%CPU
    7.26user 0.43system 0:09.75elapsed 78%CPU
    7.26user 0.43system 0:09.76elapsed 78%CPU

  BFQ:
    7.24user 0.45system 0:09.93elapsed 77%CPU
    7.31user 0.42system 0:09.90elapsed 78%CPU
    7.28user 0.42system 0:09.86elapsed 78%CPU


thanks,

Takashi
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: BFQ speed tests [was Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler]
@ 2014-06-04 11:59                                         ` Takashi Iwai
  0 siblings, 0 replies; 247+ messages in thread
From: Takashi Iwai @ 2014-06-04 11:59 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Pavel Machek, Tejun Heo, Jens Axboe, Li Zefan, Fabio Checconi,
	Arianna Avanzini, linux-kernel, containers, cgroups

At Wed, 4 Jun 2014 12:24:30 +0200,
Paolo Valente wrote:
> 
> 
> Il giorno 04/giu/2014, alle ore 12:03, Pavel Machek <pavel@ucw.cz> ha scritto:
> 
> > Hi!
> > 
> >> Should this attempt be useless as well, I will, if you do not mind, try by asking you more details about your system and reproducing your configuration as much as I can.
> >> 
> > 
> > Try making BFQ the default scheduler. That seems to break it for me,
> > when selected at runtime, it looks stable.
> > 
> > Anyway, here are some speed tests. Background load:
> > 
> > root@duo:/data/tmp# echo cfq > /sys/block/sda/queue/scheduler 
> > root@duo:/data/tmp# echo 3 > /proc/sys/vm/drop_caches
> > root@duo:/data/tmp# cat /dev/zero > delme; cat /dev/zero > delme;cat
> > /dev/zero > delme;cat /dev/zero > delme;cat /dev/zero > delme;cat
> > /dev/zero > delme
> > 
> > (Machine was running out of disk space.)
> > 
> > (I alternate between cfq and bfq).
> > 
> > Benchmark. I chose git describe because it is part of kernel build
> > sometimes .. and I actually wait for that.
> > 
> > pavel@duo:/data/l/linux-good$ time git describe
> > warning: refname 'HEAD' is ambiguous.
> > v3.15-rc8-144-g405dedd
> > 
> > Unfortunately, results are not too good for BFQ. (Can you replicate
> > the results?)
> > 
> > # BFQ
> > 10.24user 1.62system 467.02 (7m47.028s) elapsed 2.54%CPU
> > # CFQ
> > 8.55user 1.26system 69.57 (1m9.577s) elapsed 14.11%CPU
> > # BFQ
> > 11.70user 3.18system 1491.59 (24m51.599s) elapsed 0.99%CPU
> > # CFQ, no background load
> > 8.51user 0.75system 30.99 (0m30.994s) elapsed 29.91%CPU
> > # CFQ
> > 8.70user 1.36system 74.72 (1m14.720s) elapsed 13.48%CPU
> > 
> 
> Definitely bad, we are about to repeat the test …

I've been using BFQ for a while and noticed also some obvious
regression in some operations, notably git, too.
For example, git grep regresses badly.

I ran "test git grep foo > /dev/null" on linux kernel repos on both
rotational disk and SSD.

Rotational disk:
  CFQ:
    2.32user 3.48system 1:46.97elapsed 5%CPU
    2.33user 3.41system 1:48.30elapsed 5%CPU
    2.30user 3.54system 1:48.01elapsed 5%CPU

  BFQ:
    2.41user 3.22system 2:51.96elapsed 3%CPU
    2.40user 3.19system 2:50.35elapsed 3%CPU
    2.43user 3.11system 2:46.49elapsed 3%CPU

SSD:
  CFQ:
    2.37user 3.18system 0:04.70elapsed 118%CPU
    2.28user 3.26system 0:04.69elapsed 118%CPU
    2.21user 3.33system 0:04.69elapsed 118%CPU

  BFQ:
    2.35user 2.82system 1:07.85elapsed 7%CPU
    2.32user 2.90system 0:57.57elapsed 9%CPU
    2.39user 2.90system 0:55.03elapsed 9%CPU

It's without background task.

BFQ seems behaving bad when reading many small files.
When I ran "git grep foo HEAD", i.e. performing to the packaged
repository, the results of both BFQ and CFQ become almost same, as
expected:

SSD:
  CFQ:
    7.25user 0.47system 0:09.79elapsed 78%CPU
    7.26user 0.43system 0:09.75elapsed 78%CPU
    7.26user 0.43system 0:09.76elapsed 78%CPU

  BFQ:
    7.24user 0.45system 0:09.93elapsed 77%CPU
    7.31user 0.42system 0:09.90elapsed 78%CPU
    7.28user 0.42system 0:09.86elapsed 78%CPU


thanks,

Takashi

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: BFQ speed tests [was Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler]
@ 2014-06-04 11:59                                         ` Takashi Iwai
  0 siblings, 0 replies; 247+ messages in thread
From: Takashi Iwai @ 2014-06-04 11:59 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Pavel Machek, Arianna Avanzini, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

At Wed, 4 Jun 2014 12:24:30 +0200,
Paolo Valente wrote:
> 
> 
> Il giorno 04/giu/2014, alle ore 12:03, Pavel Machek <pavel@ucw.cz> ha scritto:
> 
> > Hi!
> > 
> >> Should this attempt be useless as well, I will, if you do not mind, try by asking you more details about your system and reproducing your configuration as much as I can.
> >> 
> > 
> > Try making BFQ the default scheduler. That seems to break it for me,
> > when selected at runtime, it looks stable.
> > 
> > Anyway, here are some speed tests. Background load:
> > 
> > root@duo:/data/tmp# echo cfq > /sys/block/sda/queue/scheduler 
> > root@duo:/data/tmp# echo 3 > /proc/sys/vm/drop_caches
> > root@duo:/data/tmp# cat /dev/zero > delme; cat /dev/zero > delme;cat
> > /dev/zero > delme;cat /dev/zero > delme;cat /dev/zero > delme;cat
> > /dev/zero > delme
> > 
> > (Machine was running out of disk space.)
> > 
> > (I alternate between cfq and bfq).
> > 
> > Benchmark. I chose git describe because it is part of kernel build
> > sometimes .. and I actually wait for that.
> > 
> > pavel@duo:/data/l/linux-good$ time git describe
> > warning: refname 'HEAD' is ambiguous.
> > v3.15-rc8-144-g405dedd
> > 
> > Unfortunately, results are not too good for BFQ. (Can you replicate
> > the results?)
> > 
> > # BFQ
> > 10.24user 1.62system 467.02 (7m47.028s) elapsed 2.54%CPU
> > # CFQ
> > 8.55user 1.26system 69.57 (1m9.577s) elapsed 14.11%CPU
> > # BFQ
> > 11.70user 3.18system 1491.59 (24m51.599s) elapsed 0.99%CPU
> > # CFQ, no background load
> > 8.51user 0.75system 30.99 (0m30.994s) elapsed 29.91%CPU
> > # CFQ
> > 8.70user 1.36system 74.72 (1m14.720s) elapsed 13.48%CPU
> > 
> 
> Definitely bad, we are about to repeat the test …

I've been using BFQ for a while and noticed also some obvious
regression in some operations, notably git, too.
For example, git grep regresses badly.

I ran "test git grep foo > /dev/null" on linux kernel repos on both
rotational disk and SSD.

Rotational disk:
  CFQ:
    2.32user 3.48system 1:46.97elapsed 5%CPU
    2.33user 3.41system 1:48.30elapsed 5%CPU
    2.30user 3.54system 1:48.01elapsed 5%CPU

  BFQ:
    2.41user 3.22system 2:51.96elapsed 3%CPU
    2.40user 3.19system 2:50.35elapsed 3%CPU
    2.43user 3.11system 2:46.49elapsed 3%CPU

SSD:
  CFQ:
    2.37user 3.18system 0:04.70elapsed 118%CPU
    2.28user 3.26system 0:04.69elapsed 118%CPU
    2.21user 3.33system 0:04.69elapsed 118%CPU

  BFQ:
    2.35user 2.82system 1:07.85elapsed 7%CPU
    2.32user 2.90system 0:57.57elapsed 9%CPU
    2.39user 2.90system 0:55.03elapsed 9%CPU

It's without background task.

BFQ seems behaving bad when reading many small files.
When I ran "git grep foo HEAD", i.e. performing to the packaged
repository, the results of both BFQ and CFQ become almost same, as
expected:

SSD:
  CFQ:
    7.25user 0.47system 0:09.79elapsed 78%CPU
    7.26user 0.43system 0:09.75elapsed 78%CPU
    7.26user 0.43system 0:09.76elapsed 78%CPU

  BFQ:
    7.24user 0.45system 0:09.93elapsed 77%CPU
    7.31user 0.42system 0:09.90elapsed 78%CPU
    7.28user 0.42system 0:09.86elapsed 78%CPU


thanks,

Takashi
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: BFQ speed tests [was Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler]
       [not found]                                         ` <s5hsink3mxk.wl%tiwai-l3A5Bk7waGM@public.gmane.org>
@ 2014-06-04 12:12                                           ` Paolo Valente
  2014-06-11 20:45                                             ` Paolo Valente
  1 sibling, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-04 12:12 UTC (permalink / raw)
  To: Takashi Iwai
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Pavel Machek, Arianna Avanzini, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA


Il giorno 04/giu/2014, alle ore 13:59, Takashi Iwai <tiwai-l3A5Bk7waGM@public.gmane.org> ha scritto:

> At Wed, 4 Jun 2014 12:24:30 +0200,
> Paolo Valente wrote:
>> 
>> 
>> Il giorno 04/giu/2014, alle ore 12:03, Pavel Machek <pavel-+ZI9xUNit7I@public.gmane.org> ha scritto:
>> 
>>> Hi!
>>> 
>>>> Should this attempt be useless as well, I will, if you do not mind, try by asking you more details about your system and reproducing your configuration as much as I can.
>>>> 
>>> 
>>> Try making BFQ the default scheduler. That seems to break it for me,
>>> when selected at runtime, it looks stable.
>>> 
>>> Anyway, here are some speed tests. Background load:
>>> 
>>> root@duo:/data/tmp# echo cfq > /sys/block/sda/queue/scheduler 
>>> root@duo:/data/tmp# echo 3 > /proc/sys/vm/drop_caches
>>> root@duo:/data/tmp# cat /dev/zero > delme; cat /dev/zero > delme;cat
>>> /dev/zero > delme;cat /dev/zero > delme;cat /dev/zero > delme;cat
>>> /dev/zero > delme
>>> 
>>> (Machine was running out of disk space.)
>>> 
>>> (I alternate between cfq and bfq).
>>> 
>>> Benchmark. I chose git describe because it is part of kernel build
>>> sometimes .. and I actually wait for that.
>>> 
>>> pavel@duo:/data/l/linux-good$ time git describe
>>> warning: refname 'HEAD' is ambiguous.
>>> v3.15-rc8-144-g405dedd
>>> 
>>> Unfortunately, results are not too good for BFQ. (Can you replicate
>>> the results?)
>>> 
>>> # BFQ
>>> 10.24user 1.62system 467.02 (7m47.028s) elapsed 2.54%CPU
>>> # CFQ
>>> 8.55user 1.26system 69.57 (1m9.577s) elapsed 14.11%CPU
>>> # BFQ
>>> 11.70user 3.18system 1491.59 (24m51.599s) elapsed 0.99%CPU
>>> # CFQ, no background load
>>> 8.51user 0.75system 30.99 (0m30.994s) elapsed 29.91%CPU
>>> # CFQ
>>> 8.70user 1.36system 74.72 (1m14.720s) elapsed 13.48%CPU
>>> 
>> 
>> Definitely bad, we are about to repeat the test …
> 
> I've been using BFQ for a while and noticed also some obvious
> regression in some operations, notably git, too.
> For example, git grep regresses badly.
> 
> I ran "test git grep foo > /dev/null" on linux kernel repos on both
> rotational disk and SSD.
> 
> Rotational disk:
>  CFQ:
>    2.32user 3.48system 1:46.97elapsed 5%CPU
>    2.33user 3.41system 1:48.30elapsed 5%CPU
>    2.30user 3.54system 1:48.01elapsed 5%CPU
> 
>  BFQ:
>    2.41user 3.22system 2:51.96elapsed 3%CPU
>    2.40user 3.19system 2:50.35elapsed 3%CPU
>    2.43user 3.11system 2:46.49elapsed 3%CPU
> 
> SSD:
>  CFQ:
>    2.37user 3.18system 0:04.70elapsed 118%CPU
>    2.28user 3.26system 0:04.69elapsed 118%CPU
>    2.21user 3.33system 0:04.69elapsed 118%CPU
> 
>  BFQ:
>    2.35user 2.82system 1:07.85elapsed 7%CPU
>    2.32user 2.90system 0:57.57elapsed 9%CPU
>    2.39user 2.90system 0:55.03elapsed 9%CPU
> 
> It's without background task.
> 
> BFQ seems behaving bad when reading many small files.

We ran this type of tests (plus checkout, merge and compilation) a long ago, and the performance was about the same as or better than with CFQ. Unfortunately, we have not repeated also these tests anymore since then.

We are already trying to understand what is going wrong.

Thanks,
Paolo

> When I ran "git grep foo HEAD", i.e. performing to the packaged
> repository, the results of both BFQ and CFQ become almost same, as
> expected:
> 
> SSD:
>  CFQ:
>    7.25user 0.47system 0:09.79elapsed 78%CPU
>    7.26user 0.43system 0:09.75elapsed 78%CPU
>    7.26user 0.43system 0:09.76elapsed 78%CPU
> 
>  BFQ:
>    7.24user 0.45system 0:09.93elapsed 77%CPU
>    7.31user 0.42system 0:09.90elapsed 78%CPU
>    7.28user 0.42system 0:09.86elapsed 78%CPU
> 
> 
> thanks,
> 
> Takashi


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: BFQ speed tests [was Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler]
       [not found]                                         ` <s5hsink3mxk.wl%tiwai-l3A5Bk7waGM@public.gmane.org>
@ 2014-06-04 12:12                                           ` Paolo Valente
  2014-06-11 20:45                                             ` Paolo Valente
  1 sibling, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-04 12:12 UTC (permalink / raw)
  To: Takashi Iwai
  Cc: Pavel Machek, Tejun Heo, Jens Axboe, Li Zefan, Fabio Checconi,
	Arianna Avanzini, linux-kernel, containers, cgroups


Il giorno 04/giu/2014, alle ore 13:59, Takashi Iwai <tiwai@suse.de> ha scritto:

> At Wed, 4 Jun 2014 12:24:30 +0200,
> Paolo Valente wrote:
>> 
>> 
>> Il giorno 04/giu/2014, alle ore 12:03, Pavel Machek <pavel@ucw.cz> ha scritto:
>> 
>>> Hi!
>>> 
>>>> Should this attempt be useless as well, I will, if you do not mind, try by asking you more details about your system and reproducing your configuration as much as I can.
>>>> 
>>> 
>>> Try making BFQ the default scheduler. That seems to break it for me,
>>> when selected at runtime, it looks stable.
>>> 
>>> Anyway, here are some speed tests. Background load:
>>> 
>>> root@duo:/data/tmp# echo cfq > /sys/block/sda/queue/scheduler 
>>> root@duo:/data/tmp# echo 3 > /proc/sys/vm/drop_caches
>>> root@duo:/data/tmp# cat /dev/zero > delme; cat /dev/zero > delme;cat
>>> /dev/zero > delme;cat /dev/zero > delme;cat /dev/zero > delme;cat
>>> /dev/zero > delme
>>> 
>>> (Machine was running out of disk space.)
>>> 
>>> (I alternate between cfq and bfq).
>>> 
>>> Benchmark. I chose git describe because it is part of kernel build
>>> sometimes .. and I actually wait for that.
>>> 
>>> pavel@duo:/data/l/linux-good$ time git describe
>>> warning: refname 'HEAD' is ambiguous.
>>> v3.15-rc8-144-g405dedd
>>> 
>>> Unfortunately, results are not too good for BFQ. (Can you replicate
>>> the results?)
>>> 
>>> # BFQ
>>> 10.24user 1.62system 467.02 (7m47.028s) elapsed 2.54%CPU
>>> # CFQ
>>> 8.55user 1.26system 69.57 (1m9.577s) elapsed 14.11%CPU
>>> # BFQ
>>> 11.70user 3.18system 1491.59 (24m51.599s) elapsed 0.99%CPU
>>> # CFQ, no background load
>>> 8.51user 0.75system 30.99 (0m30.994s) elapsed 29.91%CPU
>>> # CFQ
>>> 8.70user 1.36system 74.72 (1m14.720s) elapsed 13.48%CPU
>>> 
>> 
>> Definitely bad, we are about to repeat the test …
> 
> I've been using BFQ for a while and noticed also some obvious
> regression in some operations, notably git, too.
> For example, git grep regresses badly.
> 
> I ran "test git grep foo > /dev/null" on linux kernel repos on both
> rotational disk and SSD.
> 
> Rotational disk:
>  CFQ:
>    2.32user 3.48system 1:46.97elapsed 5%CPU
>    2.33user 3.41system 1:48.30elapsed 5%CPU
>    2.30user 3.54system 1:48.01elapsed 5%CPU
> 
>  BFQ:
>    2.41user 3.22system 2:51.96elapsed 3%CPU
>    2.40user 3.19system 2:50.35elapsed 3%CPU
>    2.43user 3.11system 2:46.49elapsed 3%CPU
> 
> SSD:
>  CFQ:
>    2.37user 3.18system 0:04.70elapsed 118%CPU
>    2.28user 3.26system 0:04.69elapsed 118%CPU
>    2.21user 3.33system 0:04.69elapsed 118%CPU
> 
>  BFQ:
>    2.35user 2.82system 1:07.85elapsed 7%CPU
>    2.32user 2.90system 0:57.57elapsed 9%CPU
>    2.39user 2.90system 0:55.03elapsed 9%CPU
> 
> It's without background task.
> 
> BFQ seems behaving bad when reading many small files.

We ran this type of tests (plus checkout, merge and compilation) a long ago, and the performance was about the same as or better than with CFQ. Unfortunately, we have not repeated also these tests anymore since then.

We are already trying to understand what is going wrong.

Thanks,
Paolo

> When I ran "git grep foo HEAD", i.e. performing to the packaged
> repository, the results of both BFQ and CFQ become almost same, as
> expected:
> 
> SSD:
>  CFQ:
>    7.25user 0.47system 0:09.79elapsed 78%CPU
>    7.26user 0.43system 0:09.75elapsed 78%CPU
>    7.26user 0.43system 0:09.76elapsed 78%CPU
> 
>  BFQ:
>    7.24user 0.45system 0:09.93elapsed 77%CPU
>    7.31user 0.42system 0:09.90elapsed 78%CPU
>    7.28user 0.42system 0:09.86elapsed 78%CPU
> 
> 
> thanks,
> 
> Takashi


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: BFQ speed tests [was Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler]
@ 2014-06-04 12:12                                           ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-04 12:12 UTC (permalink / raw)
  To: Takashi Iwai
  Cc: Pavel Machek, Tejun Heo, Jens Axboe, Li Zefan, Fabio Checconi,
	Arianna Avanzini, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA


Il giorno 04/giu/2014, alle ore 13:59, Takashi Iwai <tiwai-l3A5Bk7waGM@public.gmane.org> ha scritto:

> At Wed, 4 Jun 2014 12:24:30 +0200,
> Paolo Valente wrote:
>> 
>> 
>> Il giorno 04/giu/2014, alle ore 12:03, Pavel Machek <pavel-+ZI9xUNit7I@public.gmane.org> ha scritto:
>> 
>>> Hi!
>>> 
>>>> Should this attempt be useless as well, I will, if you do not mind, try by asking you more details about your system and reproducing your configuration as much as I can.
>>>> 
>>> 
>>> Try making BFQ the default scheduler. That seems to break it for me,
>>> when selected at runtime, it looks stable.
>>> 
>>> Anyway, here are some speed tests. Background load:
>>> 
>>> root@duo:/data/tmp# echo cfq > /sys/block/sda/queue/scheduler 
>>> root@duo:/data/tmp# echo 3 > /proc/sys/vm/drop_caches
>>> root@duo:/data/tmp# cat /dev/zero > delme; cat /dev/zero > delme;cat
>>> /dev/zero > delme;cat /dev/zero > delme;cat /dev/zero > delme;cat
>>> /dev/zero > delme
>>> 
>>> (Machine was running out of disk space.)
>>> 
>>> (I alternate between cfq and bfq).
>>> 
>>> Benchmark. I chose git describe because it is part of kernel build
>>> sometimes .. and I actually wait for that.
>>> 
>>> pavel@duo:/data/l/linux-good$ time git describe
>>> warning: refname 'HEAD' is ambiguous.
>>> v3.15-rc8-144-g405dedd
>>> 
>>> Unfortunately, results are not too good for BFQ. (Can you replicate
>>> the results?)
>>> 
>>> # BFQ
>>> 10.24user 1.62system 467.02 (7m47.028s) elapsed 2.54%CPU
>>> # CFQ
>>> 8.55user 1.26system 69.57 (1m9.577s) elapsed 14.11%CPU
>>> # BFQ
>>> 11.70user 3.18system 1491.59 (24m51.599s) elapsed 0.99%CPU
>>> # CFQ, no background load
>>> 8.51user 0.75system 30.99 (0m30.994s) elapsed 29.91%CPU
>>> # CFQ
>>> 8.70user 1.36system 74.72 (1m14.720s) elapsed 13.48%CPU
>>> 
>> 
>> Definitely bad, we are about to repeat the test …
> 
> I've been using BFQ for a while and noticed also some obvious
> regression in some operations, notably git, too.
> For example, git grep regresses badly.
> 
> I ran "test git grep foo > /dev/null" on linux kernel repos on both
> rotational disk and SSD.
> 
> Rotational disk:
>  CFQ:
>    2.32user 3.48system 1:46.97elapsed 5%CPU
>    2.33user 3.41system 1:48.30elapsed 5%CPU
>    2.30user 3.54system 1:48.01elapsed 5%CPU
> 
>  BFQ:
>    2.41user 3.22system 2:51.96elapsed 3%CPU
>    2.40user 3.19system 2:50.35elapsed 3%CPU
>    2.43user 3.11system 2:46.49elapsed 3%CPU
> 
> SSD:
>  CFQ:
>    2.37user 3.18system 0:04.70elapsed 118%CPU
>    2.28user 3.26system 0:04.69elapsed 118%CPU
>    2.21user 3.33system 0:04.69elapsed 118%CPU
> 
>  BFQ:
>    2.35user 2.82system 1:07.85elapsed 7%CPU
>    2.32user 2.90system 0:57.57elapsed 9%CPU
>    2.39user 2.90system 0:55.03elapsed 9%CPU
> 
> It's without background task.
> 
> BFQ seems behaving bad when reading many small files.

We ran this type of tests (plus checkout, merge and compilation) a long ago, and the performance was about the same as or better than with CFQ. Unfortunately, we have not repeated also these tests anymore since then.

We are already trying to understand what is going wrong.

Thanks,
Paolo

> When I ran "git grep foo HEAD", i.e. performing to the packaged
> repository, the results of both BFQ and CFQ become almost same, as
> expected:
> 
> SSD:
>  CFQ:
>    7.25user 0.47system 0:09.79elapsed 78%CPU
>    7.26user 0.43system 0:09.75elapsed 78%CPU
>    7.26user 0.43system 0:09.76elapsed 78%CPU
> 
>  BFQ:
>    7.24user 0.45system 0:09.93elapsed 77%CPU
>    7.31user 0.42system 0:09.90elapsed 78%CPU
>    7.28user 0.42system 0:09.86elapsed 78%CPU
> 
> 
> thanks,
> 
> Takashi


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 10/12] block, bfq: add Early Queue Merge (EQM)
       [not found]                           ` <91383F1F-69C3-4B88-B51E-30204818F1AB-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
@ 2014-06-04 13:04                             ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-04 13:04 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Mauro Andreolini,
	Fabio Checconi, Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA

Hello,

On Wed, Jun 04, 2014 at 01:47:36PM +0200, Paolo Valente wrote:
> > I think I'm a bit confused because it's named "early" queue merge
> > while it actually moves queue merging later than cfq - set_request()
> > happens before bio/rq merging.
> 
> 
> There is probably something I am missing here, because, as can be seen in blk-core.c,
> around line 1495, elv_set_request() is invoked in the context of the get_request() function,
> which in its turn is called from blk_queue_bio() *after* attempting both a plug merge
> and a merge with one of the requests in the block layer's cache. The first
> attempt is lockless and doesn't involve the I/O scheduler, but the
> second attempt includes invoking the allow_merge_fn hook of the scheduler
> (elv_merge() -> elv_rq_merge_ok() -> elv_iosched_allow_merge()).

Ah, you're right, set_request doesn't happen if a bio is merged into
an existing request.

> > Oh, as long as it makes measureable difference, I have no objection;
> > however, I do think more explanation and comments would be nice.  I
> > still can't quite understand why retrying on each merge attempt would
> > make so much difference.  Maybe I just failed to understand what you
> > wrote in the commit message.
> 
> If we remember well, one of the problems was exactly that a different request
> may become the head request of the in-service queue between two rq merge
> attempts. If we do not retry on every attempt, we lose the chance
> to merge the queue at hand with the in-service queue. The two queues may
> then diverge, and hence have no other opportunity to be merged.
> 
> > Is it because the cooperating tasks
> > issue IOs which grow large and close enough after merges but not on
> > the first bio issuance?  If so, why isn't doing it on rq merge time
> > enough?  Is the timing sensitive enough for certain workloads that
> > waiting till unplug time misses the opportunity?  But plugging should
> > be relatively short compared to the time actual IOs take, so why would
> > it be that sensitive?  What am I missing here?
> 
> The problem is not the duration of the plugging, but the fact that, if a request merge
> succeeds for a bio, then there will be no set_request invocation for that bio.
> Therefore, without early merging, there will be no queue merge at all.
> 
> If my replies are correct and convince you, then I will use them to integrate and
> hopefully improve the documentation for this patch.

Ah, okay, so it's about missing the chance to look for cooperating
queues when merge succeeds.  Yeah, that makes a lot more sense to me.
If that's the case, wouldn't it be better to try finding cooperating
queues after each merge success rather than each allow_merge()
invocation?  And let's please document what we're catching with the
extra attempts.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 10/12] block, bfq: add Early Queue Merge (EQM)
  2014-06-04 11:47                           ` Paolo Valente
  (?)
  (?)
@ 2014-06-04 13:04                           ` Tejun Heo
       [not found]                             ` <20140604130446.GA5004-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
  2014-06-16 11:23                               ` Paolo Valente
  -1 siblings, 2 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-04 13:04 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel, containers, cgroups, Mauro Andreolini

Hello,

On Wed, Jun 04, 2014 at 01:47:36PM +0200, Paolo Valente wrote:
> > I think I'm a bit confused because it's named "early" queue merge
> > while it actually moves queue merging later than cfq - set_request()
> > happens before bio/rq merging.
> 
> 
> There is probably something I am missing here, because, as can be seen in blk-core.c,
> around line 1495, elv_set_request() is invoked in the context of the get_request() function,
> which in its turn is called from blk_queue_bio() *after* attempting both a plug merge
> and a merge with one of the requests in the block layer's cache. The first
> attempt is lockless and doesn't involve the I/O scheduler, but the
> second attempt includes invoking the allow_merge_fn hook of the scheduler
> (elv_merge() -> elv_rq_merge_ok() -> elv_iosched_allow_merge()).

Ah, you're right, set_request doesn't happen if a bio is merged into
an existing request.

> > Oh, as long as it makes measureable difference, I have no objection;
> > however, I do think more explanation and comments would be nice.  I
> > still can't quite understand why retrying on each merge attempt would
> > make so much difference.  Maybe I just failed to understand what you
> > wrote in the commit message.
> 
> If we remember well, one of the problems was exactly that a different request
> may become the head request of the in-service queue between two rq merge
> attempts. If we do not retry on every attempt, we lose the chance
> to merge the queue at hand with the in-service queue. The two queues may
> then diverge, and hence have no other opportunity to be merged.
> 
> > Is it because the cooperating tasks
> > issue IOs which grow large and close enough after merges but not on
> > the first bio issuance?  If so, why isn't doing it on rq merge time
> > enough?  Is the timing sensitive enough for certain workloads that
> > waiting till unplug time misses the opportunity?  But plugging should
> > be relatively short compared to the time actual IOs take, so why would
> > it be that sensitive?  What am I missing here?
> 
> The problem is not the duration of the plugging, but the fact that, if a request merge
> succeeds for a bio, then there will be no set_request invocation for that bio.
> Therefore, without early merging, there will be no queue merge at all.
> 
> If my replies are correct and convince you, then I will use them to integrate and
> hopefully improve the documentation for this patch.

Ah, okay, so it's about missing the chance to look for cooperating
queues when merge succeeds.  Yeah, that makes a lot more sense to me.
If that's the case, wouldn't it be better to try finding cooperating
queues after each merge success rather than each allow_merge()
invocation?  And let's please document what we're catching with the
extra attempts.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices
       [not found]                           ` <03CDD106-DB18-4E8F-B3D6-2AAD45782A06-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
@ 2014-06-04 13:56                             ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-04 13:56 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA

Hello, Paolo.

On Wed, Jun 04, 2014 at 09:29:20AM +0200, Paolo Valente wrote:
> > Shouldn't the comparison be against the benefit of "not idling
> > selectively" vs "always idling" when blkcg is in use?
> > 
> 
> Exactly. I’m sorry if I wrote things/sentences that did not let this
> point be clear. Maybe this lack of clarity is a further consequence
> of the annoying “not not” scheme adopted in the code and in the
> comments.

Ah, no, it was just me misreading the message.

> > I'm not really convinced about the approach.  With rotating disks, we
> > know that allowing queue depth > 1 generaly lowers both throughput and
> > responsiveness and brings benefits in quite restricted cases.  It
> > seems rather backwards to always allow QD > 1 and then try to optimize
> > in an attempt to recover what's lost.  Wouldn't it make far more sense
> > to actively maintain QD == 1 by default and allow QD > 1 in specific
> > cases where it can be determined to be more beneficial than harmful?
> 
> Although QD == 1 is not denoted explicitly as default, what you suggest is exactly what bfq does. 
I see.

> >> I do not know how widespread a mechanism like ulatencyd is
> >> precisely, but in the symmetric scenario it creates, the throughput
> >> on, e.g., an HDD would drop by half if the workload is mostly random
> >> and we removed the more complex mechanism we set up.  Wouldn't this
> >> be bad?
> > 
> > It looks like a lot of complexity for optimization for a very
> > specific, likely unreliable (in terms of its triggering condition),
> > use case.  The triggering condition is just too specific.
> 
> Actually we have been asked several times to improve random-I/O
> performance on HDDs over the last years, by people recording, for
> the typical tasks performed by their machines, much lower throughput
> than with the other schedulers. Major problems have been reported
> for server workloads (database, web), and for btrfs. According to
> the feedback received after introducing this optimization in bfq,
> those problems seem to be finally gone.

I see.  The equal priority part can probably work in enough cases to
be meaningful given that it just depends on the busy queues having the
same weight instead of everything in the system.  It'd nice to note
that in the comment tho.

I'm still quite skeptical about the cgroup part tho.  The triggering
condition is too specific and fragile.  If I'm reading the bfq blkcg
implementation correctly, it seems to be applying the scheduling
algorithm recursively walking down the tree one level at a time.  cfq
does it differently.  cfq flattens the hierarchy by calculating the
nested weight of each active leaf queue and schedule all of them from
the same service tree.  IOW, scheduling algorithm per-se doesn't care
about the hierarchy.  All it sees are differing weights competing
equally regardless of the hierarchical structure.

If the same strategy can be applied to bfq, possibly the same strategy
of checking whether all the active queues have the same weight can be
used regardless of blkcg?  That'd be simpler and a lot more robust.

Another thing I'm curious about is that the logic that you're using to
disable idling assumes that the disk will serve the queued commands
more or less in fair manner over time, right?  If so, why does queues
having the same weight matter?  Shouldn't the bandwidth scheduling
eventually make them converge to the specified weights over time?
Isn't wr_coeff > 1 test enough for maintaining reasonable
responsiveness?

> Besides, turning back to bfq, if its low-latency heuristics are
> disabled for non rotational devices, then, according to our results
> with slower devices, such as SD Cards and eMMCs, latency becomes
> easily unbearable, with no throughput gain.

Hmmm... looking at the nonrot optimizations again, so yeah assuming
the weight counting is necessary for NCQ hdds the part specific to
ssds isn't that big.  You probably wanna sequence it the other way
around tho.  This really should be primarily about disks at this
point.

The thing which still makes me cringe is how it scatters
blk_queue_nonrot() tests across multiple places without clear
explanation on what's going on.  A bfqq being constantly seeky or not
doesn't have much to do with whether the device is rotational or not.
Its effect does and I don't think avoiding the overhead of keeping the
counters is meaningful.  Things like this make the code a lot harder
to maintain in the long term as code is organized according to
seemingly arbitrary optimization rather than semantic structure.  So,
let's please keep the accounting and optimization separate.

Thanks.

-- 
tejun
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices
       [not found]                           ` <03CDD106-DB18-4E8F-B3D6-2AAD45782A06-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
@ 2014-06-04 13:56                             ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-04 13:56 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel, containers, cgroups

Hello, Paolo.

On Wed, Jun 04, 2014 at 09:29:20AM +0200, Paolo Valente wrote:
> > Shouldn't the comparison be against the benefit of "not idling
> > selectively" vs "always idling" when blkcg is in use?
> > 
> 
> Exactly. I’m sorry if I wrote things/sentences that did not let this
> point be clear. Maybe this lack of clarity is a further consequence
> of the annoying “not not” scheme adopted in the code and in the
> comments.

Ah, no, it was just me misreading the message.

> > I'm not really convinced about the approach.  With rotating disks, we
> > know that allowing queue depth > 1 generaly lowers both throughput and
> > responsiveness and brings benefits in quite restricted cases.  It
> > seems rather backwards to always allow QD > 1 and then try to optimize
> > in an attempt to recover what's lost.  Wouldn't it make far more sense
> > to actively maintain QD == 1 by default and allow QD > 1 in specific
> > cases where it can be determined to be more beneficial than harmful?
> 
> Although QD == 1 is not denoted explicitly as default, what you suggest is exactly what bfq does. 
I see.

> >> I do not know how widespread a mechanism like ulatencyd is
> >> precisely, but in the symmetric scenario it creates, the throughput
> >> on, e.g., an HDD would drop by half if the workload is mostly random
> >> and we removed the more complex mechanism we set up.  Wouldn't this
> >> be bad?
> > 
> > It looks like a lot of complexity for optimization for a very
> > specific, likely unreliable (in terms of its triggering condition),
> > use case.  The triggering condition is just too specific.
> 
> Actually we have been asked several times to improve random-I/O
> performance on HDDs over the last years, by people recording, for
> the typical tasks performed by their machines, much lower throughput
> than with the other schedulers. Major problems have been reported
> for server workloads (database, web), and for btrfs. According to
> the feedback received after introducing this optimization in bfq,
> those problems seem to be finally gone.

I see.  The equal priority part can probably work in enough cases to
be meaningful given that it just depends on the busy queues having the
same weight instead of everything in the system.  It'd nice to note
that in the comment tho.

I'm still quite skeptical about the cgroup part tho.  The triggering
condition is too specific and fragile.  If I'm reading the bfq blkcg
implementation correctly, it seems to be applying the scheduling
algorithm recursively walking down the tree one level at a time.  cfq
does it differently.  cfq flattens the hierarchy by calculating the
nested weight of each active leaf queue and schedule all of them from
the same service tree.  IOW, scheduling algorithm per-se doesn't care
about the hierarchy.  All it sees are differing weights competing
equally regardless of the hierarchical structure.

If the same strategy can be applied to bfq, possibly the same strategy
of checking whether all the active queues have the same weight can be
used regardless of blkcg?  That'd be simpler and a lot more robust.

Another thing I'm curious about is that the logic that you're using to
disable idling assumes that the disk will serve the queued commands
more or less in fair manner over time, right?  If so, why does queues
having the same weight matter?  Shouldn't the bandwidth scheduling
eventually make them converge to the specified weights over time?
Isn't wr_coeff > 1 test enough for maintaining reasonable
responsiveness?

> Besides, turning back to bfq, if its low-latency heuristics are
> disabled for non rotational devices, then, according to our results
> with slower devices, such as SD Cards and eMMCs, latency becomes
> easily unbearable, with no throughput gain.

Hmmm... looking at the nonrot optimizations again, so yeah assuming
the weight counting is necessary for NCQ hdds the part specific to
ssds isn't that big.  You probably wanna sequence it the other way
around tho.  This really should be primarily about disks at this
point.

The thing which still makes me cringe is how it scatters
blk_queue_nonrot() tests across multiple places without clear
explanation on what's going on.  A bfqq being constantly seeky or not
doesn't have much to do with whether the device is rotational or not.
Its effect does and I don't think avoiding the overhead of keeping the
counters is meaningful.  Things like this make the code a lot harder
to maintain in the long term as code is organized according to
seemingly arbitrary optimization rather than semantic structure.  So,
let's please keep the accounting and optimization separate.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices
@ 2014-06-04 13:56                             ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-04 13:56 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hello, Paolo.

On Wed, Jun 04, 2014 at 09:29:20AM +0200, Paolo Valente wrote:
> > Shouldn't the comparison be against the benefit of "not idling
> > selectively" vs "always idling" when blkcg is in use?
> > 
> 
> Exactly. I’m sorry if I wrote things/sentences that did not let this
> point be clear. Maybe this lack of clarity is a further consequence
> of the annoying “not not” scheme adopted in the code and in the
> comments.

Ah, no, it was just me misreading the message.

> > I'm not really convinced about the approach.  With rotating disks, we
> > know that allowing queue depth > 1 generaly lowers both throughput and
> > responsiveness and brings benefits in quite restricted cases.  It
> > seems rather backwards to always allow QD > 1 and then try to optimize
> > in an attempt to recover what's lost.  Wouldn't it make far more sense
> > to actively maintain QD == 1 by default and allow QD > 1 in specific
> > cases where it can be determined to be more beneficial than harmful?
> 
> Although QD == 1 is not denoted explicitly as default, what you suggest is exactly what bfq does. 
I see.

> >> I do not know how widespread a mechanism like ulatencyd is
> >> precisely, but in the symmetric scenario it creates, the throughput
> >> on, e.g., an HDD would drop by half if the workload is mostly random
> >> and we removed the more complex mechanism we set up.  Wouldn't this
> >> be bad?
> > 
> > It looks like a lot of complexity for optimization for a very
> > specific, likely unreliable (in terms of its triggering condition),
> > use case.  The triggering condition is just too specific.
> 
> Actually we have been asked several times to improve random-I/O
> performance on HDDs over the last years, by people recording, for
> the typical tasks performed by their machines, much lower throughput
> than with the other schedulers. Major problems have been reported
> for server workloads (database, web), and for btrfs. According to
> the feedback received after introducing this optimization in bfq,
> those problems seem to be finally gone.

I see.  The equal priority part can probably work in enough cases to
be meaningful given that it just depends on the busy queues having the
same weight instead of everything in the system.  It'd nice to note
that in the comment tho.

I'm still quite skeptical about the cgroup part tho.  The triggering
condition is too specific and fragile.  If I'm reading the bfq blkcg
implementation correctly, it seems to be applying the scheduling
algorithm recursively walking down the tree one level at a time.  cfq
does it differently.  cfq flattens the hierarchy by calculating the
nested weight of each active leaf queue and schedule all of them from
the same service tree.  IOW, scheduling algorithm per-se doesn't care
about the hierarchy.  All it sees are differing weights competing
equally regardless of the hierarchical structure.

If the same strategy can be applied to bfq, possibly the same strategy
of checking whether all the active queues have the same weight can be
used regardless of blkcg?  That'd be simpler and a lot more robust.

Another thing I'm curious about is that the logic that you're using to
disable idling assumes that the disk will serve the queued commands
more or less in fair manner over time, right?  If so, why does queues
having the same weight matter?  Shouldn't the bandwidth scheduling
eventually make them converge to the specified weights over time?
Isn't wr_coeff > 1 test enough for maintaining reasonable
responsiveness?

> Besides, turning back to bfq, if its low-latency heuristics are
> disabled for non rotational devices, then, according to our results
> with slower devices, such as SD Cards and eMMCs, latency becomes
> easily unbearable, with no throughput gain.

Hmmm... looking at the nonrot optimizations again, so yeah assuming
the weight counting is necessary for NCQ hdds the part specific to
ssds isn't that big.  You probably wanna sequence it the other way
around tho.  This really should be primarily about disks at this
point.

The thing which still makes me cringe is how it scatters
blk_queue_nonrot() tests across multiple places without clear
explanation on what's going on.  A bfqq being constantly seeky or not
doesn't have much to do with whether the device is rotational or not.
Its effect does and I don't think avoiding the overhead of keeping the
counters is meaningful.  Things like this make the code a lot harder
to maintain in the long term as code is organized according to
seemingly arbitrary optimization rather than semantic structure.  So,
let's please keep the accounting and optimization separate.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
       [not found]                                         ` <20140602205713.GB8357-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@ 2014-06-04 14:31                                           ` Christoph Hellwig
  0 siblings, 0 replies; 247+ messages in thread
From: Christoph Hellwig @ 2014-06-04 14:31 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paolo Valente

On Mon, Jun 02, 2014 at 02:57:30PM -0600, Jens Axboe wrote:
> It's not so much about it being more beneficial to run in blk-mq, as it
> is about not having two code paths. But yes, we're likely going to
> maintain that code for a long time, so it's not going anywhere anytime
> soon.
> 
> And for scsi-mq, it's already opt-in, though on a per-host basis. Doing
> finer granularity than that is going to be difficult, unless we let
> legacy-block and blk-mq share a tag map (though that would not be too
> hard).

I don't really think there's anything inherently counter productive
to spinning rust (at least for somewhat modern spinning rust and
infrastructure) in blk-mq.  I'd really like to get rid of the old
request layer in a reasonable amount of time, and for SCSI I'm very
reluctant to add more integration between the old and new code.  I'd
really planning on not maintaining the old request based SCSI code
for a long time once we get positive reports in from users of various
kinds of older hardware.

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
       [not found]                                         ` <20140602205713.GB8357-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@ 2014-06-04 14:31                                           ` Christoph Hellwig
  0 siblings, 0 replies; 247+ messages in thread
From: Christoph Hellwig @ 2014-06-04 14:31 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Tejun Heo, Paolo Valente, Li Zefan, Fabio Checconi,
	Arianna Avanzini, Paolo Valente, linux-kernel, containers,
	cgroups

On Mon, Jun 02, 2014 at 02:57:30PM -0600, Jens Axboe wrote:
> It's not so much about it being more beneficial to run in blk-mq, as it
> is about not having two code paths. But yes, we're likely going to
> maintain that code for a long time, so it's not going anywhere anytime
> soon.
> 
> And for scsi-mq, it's already opt-in, though on a per-host basis. Doing
> finer granularity than that is going to be difficult, unless we let
> legacy-block and blk-mq share a tag map (though that would not be too
> hard).

I don't really think there's anything inherently counter productive
to spinning rust (at least for somewhat modern spinning rust and
infrastructure) in blk-mq.  I'd really like to get rid of the old
request layer in a reasonable amount of time, and for SCSI I'm very
reluctant to add more integration between the old and new code.  I'd
really planning on not maintaining the old request based SCSI code
for a long time once we get positive reports in from users of various
kinds of older hardware.


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-06-04 14:31                                           ` Christoph Hellwig
  0 siblings, 0 replies; 247+ messages in thread
From: Christoph Hellwig @ 2014-06-04 14:31 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Tejun Heo, Paolo Valente, Li Zefan, Fabio Checconi,
	Arianna Avanzini, Paolo Valente,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon, Jun 02, 2014 at 02:57:30PM -0600, Jens Axboe wrote:
> It's not so much about it being more beneficial to run in blk-mq, as it
> is about not having two code paths. But yes, we're likely going to
> maintain that code for a long time, so it's not going anywhere anytime
> soon.
> 
> And for scsi-mq, it's already opt-in, though on a per-host basis. Doing
> finer granularity than that is going to be difficult, unless we let
> legacy-block and blk-mq share a tag map (though that would not be too
> hard).

I don't really think there's anything inherently counter productive
to spinning rust (at least for somewhat modern spinning rust and
infrastructure) in blk-mq.  I'd really like to get rid of the old
request layer in a reasonable amount of time, and for SCSI I'm very
reluctant to add more integration between the old and new code.  I'd
really planning on not maintaining the old request based SCSI code
for a long time once we get positive reports in from users of various
kinds of older hardware.

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
  2014-06-04 14:31                                           ` Christoph Hellwig
@ 2014-06-04 14:50                                               ` Tejun Heo
  -1 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-04 14:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

Hey, Christoph.

On Wed, Jun 04, 2014 at 07:31:36AM -0700, Christoph Hellwig wrote:
> I don't really think there's anything inherently counter productive
> to spinning rust (at least for somewhat modern spinning rust and
> infrastructure) in blk-mq.  I'd really like to get rid of the old
> request layer in a reasonable amount of time, and for SCSI I'm very
> reluctant to add more integration between the old and new code.  I'd
> really planning on not maintaining the old request based SCSI code
> for a long time once we get positive reports in from users of various
> kinds of older hardware.

Hmmm... the biggest thing is ioscheds.  They heavily rely on being
strongly synchronized and are pretty important for rotating rusts.
Maybe they can be made to work with blk-mq by forcing single queue or
something but do we wnat that?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-06-04 14:50                                               ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-04 14:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Paolo Valente, Li Zefan, Fabio Checconi,
	Arianna Avanzini, Paolo Valente, linux-kernel, containers,
	cgroups

Hey, Christoph.

On Wed, Jun 04, 2014 at 07:31:36AM -0700, Christoph Hellwig wrote:
> I don't really think there's anything inherently counter productive
> to spinning rust (at least for somewhat modern spinning rust and
> infrastructure) in blk-mq.  I'd really like to get rid of the old
> request layer in a reasonable amount of time, and for SCSI I'm very
> reluctant to add more integration between the old and new code.  I'd
> really planning on not maintaining the old request based SCSI code
> for a long time once we get positive reports in from users of various
> kinds of older hardware.

Hmmm... the biggest thing is ioscheds.  They heavily rely on being
strongly synchronized and are pretty important for rotating rusts.
Maybe they can be made to work with blk-mq by forcing single queue or
something but do we wnat that?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
  2014-06-04 14:50                                               ` Tejun Heo
@ 2014-06-04 14:53                                                   ` Christoph Hellwig
  -1 siblings, 0 replies; 247+ messages in thread
From: Christoph Hellwig @ 2014-06-04 14:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
	Fabio Checconi, Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paolo Valente

On Wed, Jun 04, 2014 at 10:50:53AM -0400, Tejun Heo wrote:
> Hmmm... the biggest thing is ioscheds.  They heavily rely on being
> strongly synchronized and are pretty important for rotating rusts.
> Maybe they can be made to work with blk-mq by forcing single queue or
> something but do we wnat that?

Jens is planning to add an (optional) I/O scheduler to blk-mq, and
that is indeed required for proper disk support.  I don't think there
even is a need to limit it to a single queue technically, although
devices that support multiple queues are unlikely to need I/O
scheduling.

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-06-04 14:53                                                   ` Christoph Hellwig
  0 siblings, 0 replies; 247+ messages in thread
From: Christoph Hellwig @ 2014-06-04 14:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Hellwig, Jens Axboe, Paolo Valente, Li Zefan,
	Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups

On Wed, Jun 04, 2014 at 10:50:53AM -0400, Tejun Heo wrote:
> Hmmm... the biggest thing is ioscheds.  They heavily rely on being
> strongly synchronized and are pretty important for rotating rusts.
> Maybe they can be made to work with blk-mq by forcing single queue or
> something but do we wnat that?

Jens is planning to add an (optional) I/O scheduler to blk-mq, and
that is indeed required for proper disk support.  I don't think there
even is a need to limit it to a single queue technically, although
devices that support multiple queues are unlikely to need I/O
scheduling.


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
  2014-06-04 14:53                                                   ` Christoph Hellwig
@ 2014-06-04 14:58                                                       ` Tejun Heo
  -1 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-04 14:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA, Paolo Valente

On Wed, Jun 04, 2014 at 07:53:30AM -0700, Christoph Hellwig wrote:
> On Wed, Jun 04, 2014 at 10:50:53AM -0400, Tejun Heo wrote:
> > Hmmm... the biggest thing is ioscheds.  They heavily rely on being
> > strongly synchronized and are pretty important for rotating rusts.
> > Maybe they can be made to work with blk-mq by forcing single queue or
> > something but do we wnat that?
> 
> Jens is planning to add an (optional) I/O scheduler to blk-mq, and
> that is indeed required for proper disk support.  I don't think there
> even is a need to limit it to a single queue technically, although
> devices that support multiple queues are unlikely to need I/O
> scheduling.

I think what Jens is planning is something really minimal.  Things
like [cb]fq heavily depend on the old block infrastructure.  I don't
know.  Maybe they can be merged in time but I'm not quite sure we'd
have enough pressure to actually do that.  Host-granular switching
should be good enough, I guess.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-06-04 14:58                                                       ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-04 14:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Paolo Valente, Li Zefan, Fabio Checconi,
	Arianna Avanzini, Paolo Valente, linux-kernel, containers,
	cgroups

On Wed, Jun 04, 2014 at 07:53:30AM -0700, Christoph Hellwig wrote:
> On Wed, Jun 04, 2014 at 10:50:53AM -0400, Tejun Heo wrote:
> > Hmmm... the biggest thing is ioscheds.  They heavily rely on being
> > strongly synchronized and are pretty important for rotating rusts.
> > Maybe they can be made to work with blk-mq by forcing single queue or
> > something but do we wnat that?
> 
> Jens is planning to add an (optional) I/O scheduler to blk-mq, and
> that is indeed required for proper disk support.  I don't think there
> even is a need to limit it to a single queue technically, although
> devices that support multiple queues are unlikely to need I/O
> scheduling.

I think what Jens is planning is something really minimal.  Things
like [cb]fq heavily depend on the old block infrastructure.  I don't
know.  Maybe they can be merged in time but I'm not quite sure we'd
have enough pressure to actually do that.  Host-granular switching
should be good enough, I guess.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
  2014-06-04 14:58                                                       ` Tejun Heo
@ 2014-06-04 17:51                                                           ` Christoph Hellwig
  -1 siblings, 0 replies; 247+ messages in thread
From: Christoph Hellwig @ 2014-06-04 17:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
	Fabio Checconi, Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paolo Valente

On Wed, Jun 04, 2014 at 10:58:29AM -0400, Tejun Heo wrote:
> I think what Jens is planning is something really minimal.  Things
> like [cb]fq heavily depend on the old block infrastructure.  I don't
> know.  Maybe they can be merged in time but I'm not quite sure we'd
> have enough pressure to actually do that.  Host-granular switching
> should be good enough, I guess.

Jens told me he wanted to do a deadline scheduler, which actually is
the most sensible for disks unless you want all the cgroup magic.

Given that people in this thread are interested in more complex
schedulers I'd suggest they implement BFQ for blk-mq.

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-06-04 17:51                                                           ` Christoph Hellwig
  0 siblings, 0 replies; 247+ messages in thread
From: Christoph Hellwig @ 2014-06-04 17:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Hellwig, Jens Axboe, Paolo Valente, Li Zefan,
	Fabio Checconi, Arianna Avanzini, Paolo Valente, linux-kernel,
	containers, cgroups

On Wed, Jun 04, 2014 at 10:58:29AM -0400, Tejun Heo wrote:
> I think what Jens is planning is something really minimal.  Things
> like [cb]fq heavily depend on the old block infrastructure.  I don't
> know.  Maybe they can be merged in time but I'm not quite sure we'd
> have enough pressure to actually do that.  Host-granular switching
> should be good enough, I guess.

Jens told me he wanted to do a deadline scheduler, which actually is
the most sensible for disks unless you want all the cgroup magic.

Given that people in this thread are interested in more complex
schedulers I'd suggest they implement BFQ for blk-mq.


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
  2014-06-02 17:33                           ` Tejun Heo
@ 2014-06-04 22:31                               ` Pavel Machek
  -1 siblings, 0 replies; 247+ messages in thread
From: Pavel Machek @ 2014-06-04 22:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA

Hi!

On Mon 2014-06-02 13:33:32, Tejun Heo wrote:
> On Mon, Jun 02, 2014 at 01:14:33PM +0200, Pavel Machek wrote:
> > Now.. I see it is more work for storage maintainers, because there'll
> > be more code to maintain in the interim. But perhaps user advantages
> > are worth it?
> 
> I'm quite skeptical about going that route.  Not necessarily because
> of the extra amount of work but more the higher probability of getting
> into situation where we can neither push forward or back out.  It's
> difficult to define clear deadline and there will likely be unforeseen
> challenges in the planned convergence of the two schedulers,
> eventually, it isn't too unlikely to be in a situation where we have
> to admit defeat and just keep both schedulers.  Note that developer

Yes, that might happen. But it appears that conditions that would
make us stuck with CFQ&BFQ are the same conditions that would make us
stuck with CFQ alone.

And if BFQ is really better for interactivity under load, I'd really
really like option to use it, even if it leads to regression under
batch loads (or something else)... 

> overhead isn't the only factor here.  Providing two slightly different
> alternatives inevitably makes userland grow dependencies on subtleties
> of both and there's a lot less pressure to make judgement calls and

Dunno. It is just the scheduler. It makes stuff slower or faster, but
should not affect userland too badly.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-06-04 22:31                               ` Pavel Machek
  0 siblings, 0 replies; 247+ messages in thread
From: Pavel Machek @ 2014-06-04 22:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Paolo Valente, Jens Axboe, Li Zefan, Fabio Checconi,
	Arianna Avanzini, linux-kernel, containers, cgroups

Hi!

On Mon 2014-06-02 13:33:32, Tejun Heo wrote:
> On Mon, Jun 02, 2014 at 01:14:33PM +0200, Pavel Machek wrote:
> > Now.. I see it is more work for storage maintainers, because there'll
> > be more code to maintain in the interim. But perhaps user advantages
> > are worth it?
> 
> I'm quite skeptical about going that route.  Not necessarily because
> of the extra amount of work but more the higher probability of getting
> into situation where we can neither push forward or back out.  It's
> difficult to define clear deadline and there will likely be unforeseen
> challenges in the planned convergence of the two schedulers,
> eventually, it isn't too unlikely to be in a situation where we have
> to admit defeat and just keep both schedulers.  Note that developer

Yes, that might happen. But it appears that conditions that would
make us stuck with CFQ&BFQ are the same conditions that would make us
stuck with CFQ alone.

And if BFQ is really better for interactivity under load, I'd really
really like option to use it, even if it leads to regression under
batch loads (or something else)... 

> overhead isn't the only factor here.  Providing two slightly different
> alternatives inevitably makes userland grow dependencies on subtleties
> of both and there's a lot less pressure to make judgement calls and

Dunno. It is just the scheduler. It makes stuff slower or faster, but
should not affect userland too badly.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
  2014-06-04 22:31                               ` Pavel Machek
@ 2014-06-05  2:14                                   ` Jens Axboe
  -1 siblings, 0 replies; 247+ messages in thread
From: Jens Axboe @ 2014-06-05  2:14 UTC (permalink / raw)
  To: Pavel Machek, Tejun Heo
  Cc: Paolo Valente,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA

On 2014-06-04 16:31, Pavel Machek wrote:
> Hi!
>
> On Mon 2014-06-02 13:33:32, Tejun Heo wrote:
>> On Mon, Jun 02, 2014 at 01:14:33PM +0200, Pavel Machek wrote:
>>> Now.. I see it is more work for storage maintainers, because there'll
>>> be more code to maintain in the interim. But perhaps user advantages
>>> are worth it?
>>
>> I'm quite skeptical about going that route.  Not necessarily because
>> of the extra amount of work but more the higher probability of getting
>> into situation where we can neither push forward or back out.  It's
>> difficult to define clear deadline and there will likely be unforeseen
>> challenges in the planned convergence of the two schedulers,
>> eventually, it isn't too unlikely to be in a situation where we have
>> to admit defeat and just keep both schedulers.  Note that developer
>
> Yes, that might happen. But it appears that conditions that would
> make us stuck with CFQ&BFQ are the same conditions that would make us
> stuck with CFQ alone.

We're not merging BFQ as is. The plan has to be to merge the changes 
into CFQ, leaving us with both a single scheduler, and with a clear path 
both backwards and forwards. This was all mentioned earlier in this 
thread as well. The latter part of the patch series is already nicely 
geared towards this, it's just the first part that has to be done as 
well. THAT is the way forward for BFQ.

> And if BFQ is really better for interactivity under load, I'd really
> really like option to use it, even if it leads to regression under
> batch loads (or something else)...

The benefit is that BFQ has (most) everything nicely characterized, not 
that it is necessarily a lot better for any possible workload out there. 
As you saw yourself, there can be (and are) bugs lurking that can cause 
crashes. Another instance has been reported where there's a huge 
performance regression. Especially the latter would be a lot easier to 
debug, if it could be pin-pointed down to a specific single change. And 
I'm sure there are other issues as well, similarly to where there's 
undoubtedly cases where BFQ works better.

>> overhead isn't the only factor here.  Providing two slightly different
>> alternatives inevitably makes userland grow dependencies on subtleties
>> of both and there's a lot less pressure to make judgement calls and
>
> Dunno. It is just the scheduler. It makes stuff slower or faster, but
> should not affect userland too badly.

Until userland starts depending on various sysfs exports to tweak behavior.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-06-05  2:14                                   ` Jens Axboe
  0 siblings, 0 replies; 247+ messages in thread
From: Jens Axboe @ 2014-06-05  2:14 UTC (permalink / raw)
  To: Pavel Machek, Tejun Heo
  Cc: Paolo Valente, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel, containers, cgroups

On 2014-06-04 16:31, Pavel Machek wrote:
> Hi!
>
> On Mon 2014-06-02 13:33:32, Tejun Heo wrote:
>> On Mon, Jun 02, 2014 at 01:14:33PM +0200, Pavel Machek wrote:
>>> Now.. I see it is more work for storage maintainers, because there'll
>>> be more code to maintain in the interim. But perhaps user advantages
>>> are worth it?
>>
>> I'm quite skeptical about going that route.  Not necessarily because
>> of the extra amount of work but more the higher probability of getting
>> into situation where we can neither push forward or back out.  It's
>> difficult to define clear deadline and there will likely be unforeseen
>> challenges in the planned convergence of the two schedulers,
>> eventually, it isn't too unlikely to be in a situation where we have
>> to admit defeat and just keep both schedulers.  Note that developer
>
> Yes, that might happen. But it appears that conditions that would
> make us stuck with CFQ&BFQ are the same conditions that would make us
> stuck with CFQ alone.

We're not merging BFQ as is. The plan has to be to merge the changes 
into CFQ, leaving us with both a single scheduler, and with a clear path 
both backwards and forwards. This was all mentioned earlier in this 
thread as well. The latter part of the patch series is already nicely 
geared towards this, it's just the first part that has to be done as 
well. THAT is the way forward for BFQ.

> And if BFQ is really better for interactivity under load, I'd really
> really like option to use it, even if it leads to regression under
> batch loads (or something else)...

The benefit is that BFQ has (most) everything nicely characterized, not 
that it is necessarily a lot better for any possible workload out there. 
As you saw yourself, there can be (and are) bugs lurking that can cause 
crashes. Another instance has been reported where there's a huge 
performance regression. Especially the latter would be a lot easier to 
debug, if it could be pin-pointed down to a specific single change. And 
I'm sure there are other issues as well, similarly to where there's 
undoubtedly cases where BFQ works better.

>> overhead isn't the only factor here.  Providing two slightly different
>> alternatives inevitably makes userland grow dependencies on subtleties
>> of both and there's a lot less pressure to make judgement calls and
>
> Dunno. It is just the scheduler. It makes stuff slower or faster, but
> should not affect userland too badly.

Until userland starts depending on various sysfs exports to tweak behavior.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: BFQ speed tests [was Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler]
  2014-06-04 10:03                                 ` Pavel Machek
@ 2014-06-11 20:39                                     ` Paolo Valente
  -1 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-11 20:39 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA


Il giorno 04/giu/2014, alle ore 12:03, Pavel Machek <pavel-+ZI9xUNit7I@public.gmane.org> ha scritto:

> Hi!
> 
>> Should this attempt be useless as well, I will, if you do not mind, try by asking you more details about your system and reproducing your configuration as much as I can.
>> 
> 
> Try making BFQ the default scheduler. That seems to break it for me,
> when selected at runtime, it looks stable.

As I have already written to you privately, we have fixed the bug. It was a
clerical error, made while turning the original patchset into the series of
patches we have then submitted.

The new patchset is available here:
http://algogroup.unimore.it/people/paolo/disk_sched/debugging-patches/3.16.0-rc0-v7rc5.tgz

I’m not submitting this new, fixed patchset by email, because, before doing that,
we want to apply all the changes recommended by Tejun, and try to turn the
new patchset into a 'transformer' of cfq into bfq (of course, should it be better
to proceed in a different way also for this intermediate new version of bfq,
we are willing to).

> 
> Anyway, here are some speed tests. Background load:
> […]
> root@duo:/data/tmp# echo cfq > /sys/block/sda/queue/scheduler 
> root@duo:/data/tmp# echo 3 > /proc/sys/vm/drop_caches
> root@duo:/data/tmp# cat /dev/zero > delme; cat /dev/zero > delme;cat
> /dev/zero > delme;cat /dev/zero > delme;cat /dev/zero > delme;cat
> /dev/zero > delme
> 
> (Machine was running out of disk space.)
> 
> (I alternate between cfq and bfq).
> 
> Benchmark. I chose git describe because it is part of kernel build
> sometimes .. and I actually wait for that.
> […]

We have solved also this regression, related to both the queue-merge
mechanism and the heuristic for providing a low-latency to soft real-time
applications. The new patchset contains also this fix. We have repeated
your tests (and other similar tests) with this fixed version of bfq.
These are now our results with your tests.

# Test with background writes

[root@bfq-testbed data]# echo cfq > /sys/block/sda/queue/scheduler
[root@bfq-testbed data]# echo 3 > /proc/sys/vm/drop_caches
[root@bfq-testbed data]# cat /dev/zero > delme; cat /dev/zero > delme;cat
/dev/zero > delme;cat /dev/zero > delme;cat /dev/zero > delme;cat
/dev/zero > delme

[root@bfq-testbed linux-lkml]# time git describe
v3.15-rc8-78-gd531c25

# BFQ
0.24user 0.14system 0:07.42elapsed 5%CPU
# CFQ
0.24user 0.16system 0:08.39elapsed 4%CPU
# BFQ
0.25user 0.15system 0:08.45elapsed 4%CPU
# CFQ
0.26user 0.15system 0:09.11elapsed 4%CPU

# Results without background workload

# BFQ
0.23user 0.12system 0:07.23elapsed 4%CPU
# CFQ
0.25user 0.13system 0:07.36elapsed 5%CPU
# BFQ
0.23user 0.14system 0:07.24elapsed 5%CPU
# CFQ
0.22user 0.14system 0:07.36elapsed 5%CPU

Any feedback on these and other tests is more than welcome.

Thanks,
Paolo

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: BFQ speed tests [was Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler]
@ 2014-06-11 20:39                                     ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-11 20:39 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Tejun Heo, Jens Axboe, Li Zefan, Fabio Checconi,
	Arianna Avanzini, linux-kernel, containers, cgroups


Il giorno 04/giu/2014, alle ore 12:03, Pavel Machek <pavel@ucw.cz> ha scritto:

> Hi!
> 
>> Should this attempt be useless as well, I will, if you do not mind, try by asking you more details about your system and reproducing your configuration as much as I can.
>> 
> 
> Try making BFQ the default scheduler. That seems to break it for me,
> when selected at runtime, it looks stable.

As I have already written to you privately, we have fixed the bug. It was a
clerical error, made while turning the original patchset into the series of
patches we have then submitted.

The new patchset is available here:
http://algogroup.unimore.it/people/paolo/disk_sched/debugging-patches/3.16.0-rc0-v7rc5.tgz

I’m not submitting this new, fixed patchset by email, because, before doing that,
we want to apply all the changes recommended by Tejun, and try to turn the
new patchset into a 'transformer' of cfq into bfq (of course, should it be better
to proceed in a different way also for this intermediate new version of bfq,
we are willing to).

> 
> Anyway, here are some speed tests. Background load:
> […]
> root@duo:/data/tmp# echo cfq > /sys/block/sda/queue/scheduler 
> root@duo:/data/tmp# echo 3 > /proc/sys/vm/drop_caches
> root@duo:/data/tmp# cat /dev/zero > delme; cat /dev/zero > delme;cat
> /dev/zero > delme;cat /dev/zero > delme;cat /dev/zero > delme;cat
> /dev/zero > delme
> 
> (Machine was running out of disk space.)
> 
> (I alternate between cfq and bfq).
> 
> Benchmark. I chose git describe because it is part of kernel build
> sometimes .. and I actually wait for that.
> […]

We have solved also this regression, related to both the queue-merge
mechanism and the heuristic for providing a low-latency to soft real-time
applications. The new patchset contains also this fix. We have repeated
your tests (and other similar tests) with this fixed version of bfq.
These are now our results with your tests.

# Test with background writes

[root@bfq-testbed data]# echo cfq > /sys/block/sda/queue/scheduler
[root@bfq-testbed data]# echo 3 > /proc/sys/vm/drop_caches
[root@bfq-testbed data]# cat /dev/zero > delme; cat /dev/zero > delme;cat
/dev/zero > delme;cat /dev/zero > delme;cat /dev/zero > delme;cat
/dev/zero > delme

[root@bfq-testbed linux-lkml]# time git describe
v3.15-rc8-78-gd531c25

# BFQ
0.24user 0.14system 0:07.42elapsed 5%CPU
# CFQ
0.24user 0.16system 0:08.39elapsed 4%CPU
# BFQ
0.25user 0.15system 0:08.45elapsed 4%CPU
# CFQ
0.26user 0.15system 0:09.11elapsed 4%CPU

# Results without background workload

# BFQ
0.23user 0.12system 0:07.23elapsed 4%CPU
# CFQ
0.25user 0.13system 0:07.36elapsed 5%CPU
# BFQ
0.23user 0.14system 0:07.24elapsed 5%CPU
# CFQ
0.22user 0.14system 0:07.36elapsed 5%CPU

Any feedback on these and other tests is more than welcome.

Thanks,
Paolo


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: BFQ speed tests [was Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler]
  2014-06-04 11:59                                         ` Takashi Iwai
@ 2014-06-11 20:45                                             ` Paolo Valente
  -1 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-11 20:45 UTC (permalink / raw)
  To: Takashi Iwai
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Pavel Machek, Arianna Avanzini, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA


Il giorno 04/giu/2014, alle ore 13:59, Takashi Iwai <tiwai-l3A5Bk7waGM@public.gmane.org> ha scritto:

> […]
> I've been using BFQ for a while and noticed also some obvious
> regression in some operations, notably git, too.
> For example, git grep regresses badly.
> 
> I ran "test git grep foo > /dev/null" on linux kernel repos on both
> rotational disk and SSD.
> […]
> 
> BFQ seems behaving bad when reading many small files.
> 

The fix I described in my last reply to Pavel's speed tests
(https://lkml.org/lkml/2014/6/4/94) apparently solves also this problem.
As I wrote in that reply, the new fixed version of bfq is here:
http://algogroup.unimore.it/people/paolo/disk_sched/debugging-patches/3.16.0-rc0-v7rc5.tgz

These are our results, for your test, with this fixed version of bfq.

time git grep foo > /dev/null

Rotational disk:
 CFQ:
   2.86user 4.87system 0:29.51elapsed 26%CPU
   2.87user 4.87system 0:30.30elapsed 25%CPU
   2.82user 4.90system 0:29.13elapsed 26%CPU

 BFQ:
   2.81user 4.97system 0:25.96elapsed 29%CPU
   2.83user 5.02system 0:24.79elapsed 31%CPU
   2.85user 4.95system 0:24.73elapsed 31%CPU

SSD:
 CFQ:
   2.04user 3.93system 0:03.88elapsed 153%CPU
   2.12user 3.85system 0:03.89elapsed 153%CPU
   2.05user 3.92system 0:03.89elapsed 153%CPU

 BFQ:
   2.10user 3.86system 0:03.89elapsed 153%CPU
   2.05user 3.90system 0:03.88elapsed 153%CPU
   2.01user 3.95system 0:03.89elapsed 153%CPU

time git grep foo HEAD > /dev/null

SSD:
 CFQ:
   5.11user 0.38system 0:06.71elapsed 81%CPU
   5.21user 0.36system 0:06.78elapsed 82%CPU
   5.05user 0.41system 0:06.69elapsed 81%CPU

 BFQ:
   5.17user 0.39system 0:06.77elapsed 82%CPU
   5.13user 0.37system 0:06.73elapsed 81%CPU
   5.17user 0.37system 0:06.78elapsed 81%CPU

Should you be willing to provide further feedback on this and other tests,
we would of course really appreciate it.

Thanks again for your report,
Paolo

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: BFQ speed tests [was Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler]
@ 2014-06-11 20:45                                             ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-11 20:45 UTC (permalink / raw)
  To: Takashi Iwai
  Cc: Pavel Machek, Tejun Heo, Jens Axboe, Li Zefan, Fabio Checconi,
	Arianna Avanzini, linux-kernel, containers, cgroups


Il giorno 04/giu/2014, alle ore 13:59, Takashi Iwai <tiwai@suse.de> ha scritto:

> […]
> I've been using BFQ for a while and noticed also some obvious
> regression in some operations, notably git, too.
> For example, git grep regresses badly.
> 
> I ran "test git grep foo > /dev/null" on linux kernel repos on both
> rotational disk and SSD.
> […]
> 
> BFQ seems behaving bad when reading many small files.
> 

The fix I described in my last reply to Pavel's speed tests
(https://lkml.org/lkml/2014/6/4/94) apparently solves also this problem.
As I wrote in that reply, the new fixed version of bfq is here:
http://algogroup.unimore.it/people/paolo/disk_sched/debugging-patches/3.16.0-rc0-v7rc5.tgz

These are our results, for your test, with this fixed version of bfq.

time git grep foo > /dev/null

Rotational disk:
 CFQ:
   2.86user 4.87system 0:29.51elapsed 26%CPU
   2.87user 4.87system 0:30.30elapsed 25%CPU
   2.82user 4.90system 0:29.13elapsed 26%CPU

 BFQ:
   2.81user 4.97system 0:25.96elapsed 29%CPU
   2.83user 5.02system 0:24.79elapsed 31%CPU
   2.85user 4.95system 0:24.73elapsed 31%CPU

SSD:
 CFQ:
   2.04user 3.93system 0:03.88elapsed 153%CPU
   2.12user 3.85system 0:03.89elapsed 153%CPU
   2.05user 3.92system 0:03.89elapsed 153%CPU

 BFQ:
   2.10user 3.86system 0:03.89elapsed 153%CPU
   2.05user 3.90system 0:03.88elapsed 153%CPU
   2.01user 3.95system 0:03.89elapsed 153%CPU

time git grep foo HEAD > /dev/null

SSD:
 CFQ:
   5.11user 0.38system 0:06.71elapsed 81%CPU
   5.21user 0.36system 0:06.78elapsed 82%CPU
   5.05user 0.41system 0:06.69elapsed 81%CPU

 BFQ:
   5.17user 0.39system 0:06.77elapsed 82%CPU
   5.13user 0.37system 0:06.73elapsed 81%CPU
   5.17user 0.37system 0:06.78elapsed 81%CPU

Should you be willing to provide further feedback on this and other tests,
we would of course really appreciate it.

Thanks again for your report,
Paolo

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: BFQ speed tests [was Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler]
       [not found]                                             ` <6A4905B2-ACAA-419D-9C83-659BE9A5B20B-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
@ 2014-06-13 16:21                                               ` Takashi Iwai
  0 siblings, 0 replies; 247+ messages in thread
From: Takashi Iwai @ 2014-06-13 16:21 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Pavel Machek, Arianna Avanzini, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

At Wed, 11 Jun 2014 22:45:06 +0200,
Paolo Valente wrote:
> 
> 
> Il giorno 04/giu/2014, alle ore 13:59, Takashi Iwai <tiwai@suse.de> ha scritto:
> 
> > […]
> > I've been using BFQ for a while and noticed also some obvious
> > regression in some operations, notably git, too.
> > For example, git grep regresses badly.
> > 
> > I ran "test git grep foo > /dev/null" on linux kernel repos on both
> > rotational disk and SSD.
> > […]
> > 
> > BFQ seems behaving bad when reading many small files.
> > 
> 
> The fix I described in my last reply to Pavel's speed tests
> (https://lkml.org/lkml/2014/6/4/94) apparently solves also this problem.
> As I wrote in that reply, the new fixed version of bfq is here:
> http://algogroup.unimore.it/people/paolo/disk_sched/debugging-patches/3.16.0-rc0-v7rc5.tgz
> 
> These are our results, for your test, with this fixed version of bfq.
> 
> time git grep foo > /dev/null
> 
> Rotational disk:
>  CFQ:
>    2.86user 4.87system 0:29.51elapsed 26%CPU
>    2.87user 4.87system 0:30.30elapsed 25%CPU
>    2.82user 4.90system 0:29.13elapsed 26%CPU
> 
>  BFQ:
>    2.81user 4.97system 0:25.96elapsed 29%CPU
>    2.83user 5.02system 0:24.79elapsed 31%CPU
>    2.85user 4.95system 0:24.73elapsed 31%CPU
> 
> SSD:
>  CFQ:
>    2.04user 3.93system 0:03.88elapsed 153%CPU
>    2.12user 3.85system 0:03.89elapsed 153%CPU
>    2.05user 3.92system 0:03.89elapsed 153%CPU
> 
>  BFQ:
>    2.10user 3.86system 0:03.89elapsed 153%CPU
>    2.05user 3.90system 0:03.88elapsed 153%CPU
>    2.01user 3.95system 0:03.89elapsed 153%CPU
> 
> time git grep foo HEAD > /dev/null
> 
> SSD:
>  CFQ:
>    5.11user 0.38system 0:06.71elapsed 81%CPU
>    5.21user 0.36system 0:06.78elapsed 82%CPU
>    5.05user 0.41system 0:06.69elapsed 81%CPU
> 
>  BFQ:
>    5.17user 0.39system 0:06.77elapsed 82%CPU
>    5.13user 0.37system 0:06.73elapsed 81%CPU
>    5.17user 0.37system 0:06.78elapsed 81%CPU
> 
> Should you be willing to provide further feedback on this and other tests,
> we would of course really appreciate it.

Thanks.  The new patchset works well now.  The results with the new
patchset + latest Linus git tree are below.

The only significant difference is the case with "git grep foo" on
SSD.  But I'm not sure whether it's a casual error.  I'll need to get
more samples to flatten the errors.


Takashi

===

* time git grep foo > /dev/null

rotational disk:
  CFQ:
    2.34user 4.04system 2:00.12elapsed 5%CPU
    2.49user 3.80system 1:56.20elapsed 5%CPU
    2.42user 3.68system 1:46.81elapsed 5%CPU

  BFQ:
    2.44user 3.57system 1:49.65elapsed 5%CPU
    2.47user 3.67system 1:55.92elapsed 5%CPU
    2.47user 3.63system 1:50.06elapsed 5%CPU

SSD:
  CFQ:
    1.25user 1.54system 0:04.62elapsed 60%CPU
    1.23user 1.67system 0:04.65elapsed 62%CPU
    1.22user 1.60system 0:04.61elapsed 61%CPU

  BFQ:
    1.29user 1.64system 0:06.91elapsed 42%CPU
    1.30user 1.66system 0:06.66elapsed 44%CPU
    1.27user 1.59system 0:04.73elapsed 60%CPU

* time git grep foo HEAD > /dev/null

rotational disk:
  CFQ:
    5.12user 0.43system 0:19.86elapsed 28%CPU
    5.06user 0.45system 0:19.88elapsed 27%CPU
    5.00user 0.41system 0:20.05elapsed 27%CPU

  BFQ:
    4.82user 0.37system 0:19.56elapsed 26%CPU
    5.00user 0.43system 0:19.53elapsed 27%CPU
    4.92user 0.45system 0:19.69elapsed 27%CPU

SSD:
  CFQ:
    4.49user 0.32system 0:07.26elapsed 66%CPU
    4.50user 0.31system 0:07.25elapsed 66%CPU
    4.40user 0.32system 0:07.16elapsed 65%CPU

  BFQ:
    4.09user 0.26system 0:06.93elapsed 62%CPU
    3.76user 0.23system 0:06.54elapsed 61%CPU
    3.65user 0.22system 0:06.40elapsed 60%CPU

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: BFQ speed tests [was Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler]
       [not found]                                             ` <6A4905B2-ACAA-419D-9C83-659BE9A5B20B-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
@ 2014-06-13 16:21                                               ` Takashi Iwai
  0 siblings, 0 replies; 247+ messages in thread
From: Takashi Iwai @ 2014-06-13 16:21 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Pavel Machek, Tejun Heo, Jens Axboe, Li Zefan, Fabio Checconi,
	Arianna Avanzini, linux-kernel, containers, cgroups

At Wed, 11 Jun 2014 22:45:06 +0200,
Paolo Valente wrote:
> 
> 
> Il giorno 04/giu/2014, alle ore 13:59, Takashi Iwai <tiwai@suse.de> ha scritto:
> 
> > […]
> > I've been using BFQ for a while and noticed also some obvious
> > regression in some operations, notably git, too.
> > For example, git grep regresses badly.
> > 
> > I ran "test git grep foo > /dev/null" on linux kernel repos on both
> > rotational disk and SSD.
> > […]
> > 
> > BFQ seems behaving bad when reading many small files.
> > 
> 
> The fix I described in my last reply to Pavel's speed tests
> (https://lkml.org/lkml/2014/6/4/94) apparently solves also this problem.
> As I wrote in that reply, the new fixed version of bfq is here:
> http://algogroup.unimore.it/people/paolo/disk_sched/debugging-patches/3.16.0-rc0-v7rc5.tgz
> 
> These are our results, for your test, with this fixed version of bfq.
> 
> time git grep foo > /dev/null
> 
> Rotational disk:
>  CFQ:
>    2.86user 4.87system 0:29.51elapsed 26%CPU
>    2.87user 4.87system 0:30.30elapsed 25%CPU
>    2.82user 4.90system 0:29.13elapsed 26%CPU
> 
>  BFQ:
>    2.81user 4.97system 0:25.96elapsed 29%CPU
>    2.83user 5.02system 0:24.79elapsed 31%CPU
>    2.85user 4.95system 0:24.73elapsed 31%CPU
> 
> SSD:
>  CFQ:
>    2.04user 3.93system 0:03.88elapsed 153%CPU
>    2.12user 3.85system 0:03.89elapsed 153%CPU
>    2.05user 3.92system 0:03.89elapsed 153%CPU
> 
>  BFQ:
>    2.10user 3.86system 0:03.89elapsed 153%CPU
>    2.05user 3.90system 0:03.88elapsed 153%CPU
>    2.01user 3.95system 0:03.89elapsed 153%CPU
> 
> time git grep foo HEAD > /dev/null
> 
> SSD:
>  CFQ:
>    5.11user 0.38system 0:06.71elapsed 81%CPU
>    5.21user 0.36system 0:06.78elapsed 82%CPU
>    5.05user 0.41system 0:06.69elapsed 81%CPU
> 
>  BFQ:
>    5.17user 0.39system 0:06.77elapsed 82%CPU
>    5.13user 0.37system 0:06.73elapsed 81%CPU
>    5.17user 0.37system 0:06.78elapsed 81%CPU
> 
> Should you be willing to provide further feedback on this and other tests,
> we would of course really appreciate it.

Thanks.  The new patchset works well now.  The results with the new
patchset + latest Linus git tree are below.

The only significant difference is the case with "git grep foo" on
SSD.  But I'm not sure whether it's a casual error.  I'll need to get
more samples to flatten the errors.


Takashi

===

* time git grep foo > /dev/null

rotational disk:
  CFQ:
    2.34user 4.04system 2:00.12elapsed 5%CPU
    2.49user 3.80system 1:56.20elapsed 5%CPU
    2.42user 3.68system 1:46.81elapsed 5%CPU

  BFQ:
    2.44user 3.57system 1:49.65elapsed 5%CPU
    2.47user 3.67system 1:55.92elapsed 5%CPU
    2.47user 3.63system 1:50.06elapsed 5%CPU

SSD:
  CFQ:
    1.25user 1.54system 0:04.62elapsed 60%CPU
    1.23user 1.67system 0:04.65elapsed 62%CPU
    1.22user 1.60system 0:04.61elapsed 61%CPU

  BFQ:
    1.29user 1.64system 0:06.91elapsed 42%CPU
    1.30user 1.66system 0:06.66elapsed 44%CPU
    1.27user 1.59system 0:04.73elapsed 60%CPU

* time git grep foo HEAD > /dev/null

rotational disk:
  CFQ:
    5.12user 0.43system 0:19.86elapsed 28%CPU
    5.06user 0.45system 0:19.88elapsed 27%CPU
    5.00user 0.41system 0:20.05elapsed 27%CPU

  BFQ:
    4.82user 0.37system 0:19.56elapsed 26%CPU
    5.00user 0.43system 0:19.53elapsed 27%CPU
    4.92user 0.45system 0:19.69elapsed 27%CPU

SSD:
  CFQ:
    4.49user 0.32system 0:07.26elapsed 66%CPU
    4.50user 0.31system 0:07.25elapsed 66%CPU
    4.40user 0.32system 0:07.16elapsed 65%CPU

  BFQ:
    4.09user 0.26system 0:06.93elapsed 62%CPU
    3.76user 0.23system 0:06.54elapsed 61%CPU
    3.65user 0.22system 0:06.40elapsed 60%CPU


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: BFQ speed tests [was Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler]
@ 2014-06-13 16:21                                               ` Takashi Iwai
  0 siblings, 0 replies; 247+ messages in thread
From: Takashi Iwai @ 2014-06-13 16:21 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Pavel Machek, Tejun Heo, Jens Axboe, Li Zefan, Fabio Checconi,
	Arianna Avanzini, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

At Wed, 11 Jun 2014 22:45:06 +0200,
Paolo Valente wrote:
> 
> 
> Il giorno 04/giu/2014, alle ore 13:59, Takashi Iwai <tiwai-l3A5Bk7waGM@public.gmane.org> ha scritto:
> 
> > […]
> > I've been using BFQ for a while and noticed also some obvious
> > regression in some operations, notably git, too.
> > For example, git grep regresses badly.
> > 
> > I ran "test git grep foo > /dev/null" on linux kernel repos on both
> > rotational disk and SSD.
> > […]
> > 
> > BFQ seems behaving bad when reading many small files.
> > 
> 
> The fix I described in my last reply to Pavel's speed tests
> (https://lkml.org/lkml/2014/6/4/94) apparently solves also this problem.
> As I wrote in that reply, the new fixed version of bfq is here:
> http://algogroup.unimore.it/people/paolo/disk_sched/debugging-patches/3.16.0-rc0-v7rc5.tgz
> 
> These are our results, for your test, with this fixed version of bfq.
> 
> time git grep foo > /dev/null
> 
> Rotational disk:
>  CFQ:
>    2.86user 4.87system 0:29.51elapsed 26%CPU
>    2.87user 4.87system 0:30.30elapsed 25%CPU
>    2.82user 4.90system 0:29.13elapsed 26%CPU
> 
>  BFQ:
>    2.81user 4.97system 0:25.96elapsed 29%CPU
>    2.83user 5.02system 0:24.79elapsed 31%CPU
>    2.85user 4.95system 0:24.73elapsed 31%CPU
> 
> SSD:
>  CFQ:
>    2.04user 3.93system 0:03.88elapsed 153%CPU
>    2.12user 3.85system 0:03.89elapsed 153%CPU
>    2.05user 3.92system 0:03.89elapsed 153%CPU
> 
>  BFQ:
>    2.10user 3.86system 0:03.89elapsed 153%CPU
>    2.05user 3.90system 0:03.88elapsed 153%CPU
>    2.01user 3.95system 0:03.89elapsed 153%CPU
> 
> time git grep foo HEAD > /dev/null
> 
> SSD:
>  CFQ:
>    5.11user 0.38system 0:06.71elapsed 81%CPU
>    5.21user 0.36system 0:06.78elapsed 82%CPU
>    5.05user 0.41system 0:06.69elapsed 81%CPU
> 
>  BFQ:
>    5.17user 0.39system 0:06.77elapsed 82%CPU
>    5.13user 0.37system 0:06.73elapsed 81%CPU
>    5.17user 0.37system 0:06.78elapsed 81%CPU
> 
> Should you be willing to provide further feedback on this and other tests,
> we would of course really appreciate it.

Thanks.  The new patchset works well now.  The results with the new
patchset + latest Linus git tree are below.

The only significant difference is the case with "git grep foo" on
SSD.  But I'm not sure whether it's a casual error.  I'll need to get
more samples to flatten the errors.


Takashi

===

* time git grep foo > /dev/null

rotational disk:
  CFQ:
    2.34user 4.04system 2:00.12elapsed 5%CPU
    2.49user 3.80system 1:56.20elapsed 5%CPU
    2.42user 3.68system 1:46.81elapsed 5%CPU

  BFQ:
    2.44user 3.57system 1:49.65elapsed 5%CPU
    2.47user 3.67system 1:55.92elapsed 5%CPU
    2.47user 3.63system 1:50.06elapsed 5%CPU

SSD:
  CFQ:
    1.25user 1.54system 0:04.62elapsed 60%CPU
    1.23user 1.67system 0:04.65elapsed 62%CPU
    1.22user 1.60system 0:04.61elapsed 61%CPU

  BFQ:
    1.29user 1.64system 0:06.91elapsed 42%CPU
    1.30user 1.66system 0:06.66elapsed 44%CPU
    1.27user 1.59system 0:04.73elapsed 60%CPU

* time git grep foo HEAD > /dev/null

rotational disk:
  CFQ:
    5.12user 0.43system 0:19.86elapsed 28%CPU
    5.06user 0.45system 0:19.88elapsed 27%CPU
    5.00user 0.41system 0:20.05elapsed 27%CPU

  BFQ:
    4.82user 0.37system 0:19.56elapsed 26%CPU
    5.00user 0.43system 0:19.53elapsed 27%CPU
    4.92user 0.45system 0:19.69elapsed 27%CPU

SSD:
  CFQ:
    4.49user 0.32system 0:07.26elapsed 66%CPU
    4.50user 0.31system 0:07.25elapsed 66%CPU
    4.40user 0.32system 0:07.16elapsed 65%CPU

  BFQ:
    4.09user 0.26system 0:06.93elapsed 62%CPU
    3.76user 0.23system 0:06.54elapsed 61%CPU
    3.65user 0.22system 0:06.40elapsed 60%CPU

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices
  2014-06-04 13:56                             ` Tejun Heo
@ 2014-06-16 10:46                                 ` Paolo Valente
  -1 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-16 10:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA


Il giorno 04/giu/2014, alle ore 15:56, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ha scritto:

> Hello, Paolo.

Hello, and sorry for the late reply.

> […]
>> 
>> Actually we have been asked several times to improve random-I/O
>> performance on HDDs over the last years, by people recording, for
>> the typical tasks performed by their machines, much lower throughput
>> than with the other schedulers. Major problems have been reported
>> for server workloads (database, web), and for btrfs. According to
>> the feedback received after introducing this optimization in bfq,
>> those problems seem to be finally gone.
> 
> I see.  The equal priority part can probably work in enough cases to
> be meaningful given that it just depends on the busy queues having the
> same weight instead of everything in the system.  It'd nice to note
> that in the comment tho.
> 
> I'm still quite skeptical about the cgroup part tho.  The triggering
> condition is too specific and fragile.  If I'm reading the bfq blkcg
> implementation correctly, it seems to be applying the scheduling
> algorithm recursively walking down the tree one level at a time.

Yes.

>  cfq
> does it differently.  cfq flattens the hierarchy by calculating the
> nested weight of each active leaf queue and schedule all of them from
> the same service tree.  IOW, scheduling algorithm per-se doesn't care
> about the hierarchy.  All it sees are differing weights competing
> equally regardless of the hierarchical structure.
> 

To preserve the desired distribution of the device throughput (or time), this
scheme entails updating weights every time the set of active queues changes.
For example (sorry for the trivial example, but I just want to make sure that I am
not misunderstanding what you are telling me), suppose that two groups,
A and B, are reserved 50% of the throughput each, and that the first group contains
two processes, whereas the second group just one process. Apart from the
additional per-queue interventions of cfq to improve latency or throughput, the
queues the two processes in in group A are reserved 50% of the group throughput
each. It follows that, if all the three queues are backlogged, then a correct weight
distribution for a flat underlying scheduler is, e.g., 25 and 25 for the two processes
in group A, and 50 for the process in group B.

But, if one of the queues in group A becomes idle, then the correct weights
of the still-active queues become, e.g., 50 and 50.

Changing weights this way has basically no influence of the per-request latency
guarantees of cfq, because cfq is based on a round-robin scheme, and the latter
already suffers from a large deviation with respect to an ideal service. In contrast,
things change dramatically with an accurate scheduler, such as the internal
B-WF2Q+ scheduler of BFQ. In that case, as explained, e.g., here for packet
scheduling (but the problem is exactly the same)

http://www.algogroup.unimore.it/people/paolo/dynWF2Q+/dynWF2Q+.pdf

weight changes would just break service guarantees, and lead to the same
deviation as a round-robin scheduler. As proved in the same
document, to preserve guarantees, weight updates should be delayed/concealed
in a non-trivial (although computationally cheap) way.

> If the same strategy can be applied to bfq, possibly the same strategy
> of checking whether all the active queues have the same weight can be
> used regardless of blkcg?  That'd be simpler and a lot more robust.
> 

Unfortunately, because of the above problems, this scheme would break
service guarantees with an accurate scheduler such as bfq.

The hierarchical scheme used by bfq does not suffer from this problem,
also because it does implement the mechanisms described in the
above document. In particular, it allows these mechanism to be
implemented in a quite simple way, whereas things would become
more convoluted after flattening the hierarchy.

If useful, I am willing to provide more details, although these aspects are most
certainly quite theoretical and boring.

> Another thing I'm curious about is that the logic that you're using to
> disable idling assumes that the disk will serve the queued commands
> more or less in fair manner over time, right?  If so, why does queues
> having the same weight matter?  Shouldn't the bandwidth scheduling
> eventually make them converge to the specified weights over time?
> Isn't wr_coeff > 1 test enough for maintaining reasonable
> responsiveness?

Unfortunately, when I first defined bfq with Fabio, this was one of the main
mistakes made in most of existing research I/O schedulers. More precisely, your
statement is true for async queues, but fails for sync queues. The problem is that
the (only) way in which a process pushes a scheduler into giving it its reserved
throughput is by issuing requests at a higher rate than that at which they are
served. But, with sync queues this just cannot happen. In particular,
postponing the service of a sync request delays the arrival of the next,
but not-yet-arrived, sync request of the same process. Intuitively, for the scheduler,
it is like if the process was happy with the throughput it is receiving, because
it happens to issue requests exactly at that rate.

This problems is described in more detail, for example, in Section 5.3 of
http://www.algogroup.unimore.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

bfq deals with this problem by properly back-shifting request timestamps when
needed. But idling is necessary for this mechanism to work (unless more
complex solutions are adopted).

> 
>> Besides, turning back to bfq, if its low-latency heuristics are
>> disabled for non rotational devices, then, according to our results
>> with slower devices, such as SD Cards and eMMCs, latency becomes
>> easily unbearable, with no throughput gain.
> 
> Hmmm... looking at the nonrot optimizations again, so yeah assuming
> the weight counting is necessary for NCQ hdds the part specific to
> ssds isn't that big.  You probably wanna sequence it the other way
> around tho.  This really should be primarily about disks at this
> point.
> 
> The thing which still makes me cringe is how it scatters
> blk_queue_nonrot() tests across multiple places without clear
> explanation on what's going on.  A bfqq being constantly seeky or not
> doesn't have much to do with whether the device is rotational or not.
> Its effect does and I don't think avoiding the overhead of keeping the
> counters is meaningful.  Things like this make the code a lot harder
> to maintain in the long term as code is organized according to
> seemingly arbitrary optimization rather than semantic structure.  So,
> let's please keep the accounting and optimization separate.
> 

Added to the list of changes to make, thanks.

Paolo

> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices
@ 2014-06-16 10:46                                 ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-16 10:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel, containers, cgroups


Il giorno 04/giu/2014, alle ore 15:56, Tejun Heo <tj@kernel.org> ha scritto:

> Hello, Paolo.

Hello, and sorry for the late reply.

> […]
>> 
>> Actually we have been asked several times to improve random-I/O
>> performance on HDDs over the last years, by people recording, for
>> the typical tasks performed by their machines, much lower throughput
>> than with the other schedulers. Major problems have been reported
>> for server workloads (database, web), and for btrfs. According to
>> the feedback received after introducing this optimization in bfq,
>> those problems seem to be finally gone.
> 
> I see.  The equal priority part can probably work in enough cases to
> be meaningful given that it just depends on the busy queues having the
> same weight instead of everything in the system.  It'd nice to note
> that in the comment tho.
> 
> I'm still quite skeptical about the cgroup part tho.  The triggering
> condition is too specific and fragile.  If I'm reading the bfq blkcg
> implementation correctly, it seems to be applying the scheduling
> algorithm recursively walking down the tree one level at a time.

Yes.

>  cfq
> does it differently.  cfq flattens the hierarchy by calculating the
> nested weight of each active leaf queue and schedule all of them from
> the same service tree.  IOW, scheduling algorithm per-se doesn't care
> about the hierarchy.  All it sees are differing weights competing
> equally regardless of the hierarchical structure.
> 

To preserve the desired distribution of the device throughput (or time), this
scheme entails updating weights every time the set of active queues changes.
For example (sorry for the trivial example, but I just want to make sure that I am
not misunderstanding what you are telling me), suppose that two groups,
A and B, are reserved 50% of the throughput each, and that the first group contains
two processes, whereas the second group just one process. Apart from the
additional per-queue interventions of cfq to improve latency or throughput, the
queues the two processes in in group A are reserved 50% of the group throughput
each. It follows that, if all the three queues are backlogged, then a correct weight
distribution for a flat underlying scheduler is, e.g., 25 and 25 for the two processes
in group A, and 50 for the process in group B.

But, if one of the queues in group A becomes idle, then the correct weights
of the still-active queues become, e.g., 50 and 50.

Changing weights this way has basically no influence of the per-request latency
guarantees of cfq, because cfq is based on a round-robin scheme, and the latter
already suffers from a large deviation with respect to an ideal service. In contrast,
things change dramatically with an accurate scheduler, such as the internal
B-WF2Q+ scheduler of BFQ. In that case, as explained, e.g., here for packet
scheduling (but the problem is exactly the same)

http://www.algogroup.unimore.it/people/paolo/dynWF2Q+/dynWF2Q+.pdf

weight changes would just break service guarantees, and lead to the same
deviation as a round-robin scheduler. As proved in the same
document, to preserve guarantees, weight updates should be delayed/concealed
in a non-trivial (although computationally cheap) way.

> If the same strategy can be applied to bfq, possibly the same strategy
> of checking whether all the active queues have the same weight can be
> used regardless of blkcg?  That'd be simpler and a lot more robust.
> 

Unfortunately, because of the above problems, this scheme would break
service guarantees with an accurate scheduler such as bfq.

The hierarchical scheme used by bfq does not suffer from this problem,
also because it does implement the mechanisms described in the
above document. In particular, it allows these mechanism to be
implemented in a quite simple way, whereas things would become
more convoluted after flattening the hierarchy.

If useful, I am willing to provide more details, although these aspects are most
certainly quite theoretical and boring.

> Another thing I'm curious about is that the logic that you're using to
> disable idling assumes that the disk will serve the queued commands
> more or less in fair manner over time, right?  If so, why does queues
> having the same weight matter?  Shouldn't the bandwidth scheduling
> eventually make them converge to the specified weights over time?
> Isn't wr_coeff > 1 test enough for maintaining reasonable
> responsiveness?

Unfortunately, when I first defined bfq with Fabio, this was one of the main
mistakes made in most of existing research I/O schedulers. More precisely, your
statement is true for async queues, but fails for sync queues. The problem is that
the (only) way in which a process pushes a scheduler into giving it its reserved
throughput is by issuing requests at a higher rate than that at which they are
served. But, with sync queues this just cannot happen. In particular,
postponing the service of a sync request delays the arrival of the next,
but not-yet-arrived, sync request of the same process. Intuitively, for the scheduler,
it is like if the process was happy with the throughput it is receiving, because
it happens to issue requests exactly at that rate.

This problems is described in more detail, for example, in Section 5.3 of
http://www.algogroup.unimore.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf

bfq deals with this problem by properly back-shifting request timestamps when
needed. But idling is necessary for this mechanism to work (unless more
complex solutions are adopted).

> 
>> Besides, turning back to bfq, if its low-latency heuristics are
>> disabled for non rotational devices, then, according to our results
>> with slower devices, such as SD Cards and eMMCs, latency becomes
>> easily unbearable, with no throughput gain.
> 
> Hmmm... looking at the nonrot optimizations again, so yeah assuming
> the weight counting is necessary for NCQ hdds the part specific to
> ssds isn't that big.  You probably wanna sequence it the other way
> around tho.  This really should be primarily about disks at this
> point.
> 
> The thing which still makes me cringe is how it scatters
> blk_queue_nonrot() tests across multiple places without clear
> explanation on what's going on.  A bfqq being constantly seeky or not
> doesn't have much to do with whether the device is rotational or not.
> Its effect does and I don't think avoiding the overhead of keeping the
> counters is meaningful.  Things like this make the code a lot harder
> to maintain in the long term as code is organized according to
> seemingly arbitrary optimization rather than semantic structure.  So,
> let's please keep the accounting and optimization separate.
> 

Added to the list of changes to make, thanks.

Paolo

> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 10/12] block, bfq: add Early Queue Merge (EQM)
       [not found]                             ` <20140604130446.GA5004-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2014-06-16 11:23                               ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-16 11:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Mauro Andreolini,
	Fabio Checconi, Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA

Hi,

Il giorno 04/giu/2014, alle ore 15:04, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ha scritto:

> Hello,
> 
>> […]

>> The problem is not the duration of the plugging, but the fact that, if a request merge
>> succeeds for a bio, then there will be no set_request invocation for that bio.
>> Therefore, without early merging, there will be no queue merge at all.
>> 
>> If my replies are correct and convince you, then I will use them to integrate and
>> hopefully improve the documentation for this patch.
> 
> Ah, okay, so it's about missing the chance to look for cooperating
> queues when merge succeeds.  Yeah, that makes a lot more sense to me.
> If that's the case, wouldn't it be better to try finding cooperating
> queues after each merge success rather than each allow_merge()
> invocation?

If I have correctly understandood your question, then the answer is unfortunately no.

First, without any queue-merge control in allow_merge, a bio-merge attempt would
fail if the destination queue of the bio differs from the queue containing the target rq,
even if the two queues should be merged. With early queue merge, not only the two
queues are merged, but also the request-merge attempt succeeds.

Second, if the queue-merge check was executed only after a successful request merge,
then, in the case of a request-merge failure, the next chance to merge the destination queue
of the bio with some other queue would be in the next add_request. But, according to the
what we already discussed in some previous emails, waiting for the add_request is usually
too much with, e.g., the I/O generated by QEMU I/O threads.

In the end, moving queue merging to after successful request merges would most certainly
cause the throughput to drop.


>  And let's please document what we're catching with the
> extra attempts.
> 

Definitely, thanks,

Paolo

> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 10/12] block, bfq: add Early Queue Merge (EQM)
       [not found]                             ` <20140604130446.GA5004-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2014-06-16 11:23                               ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-16 11:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel, containers, cgroups, Mauro Andreolini

Hi,

Il giorno 04/giu/2014, alle ore 15:04, Tejun Heo <tj@kernel.org> ha scritto:

> Hello,
> 
>> […]

>> The problem is not the duration of the plugging, but the fact that, if a request merge
>> succeeds for a bio, then there will be no set_request invocation for that bio.
>> Therefore, without early merging, there will be no queue merge at all.
>> 
>> If my replies are correct and convince you, then I will use them to integrate and
>> hopefully improve the documentation for this patch.
> 
> Ah, okay, so it's about missing the chance to look for cooperating
> queues when merge succeeds.  Yeah, that makes a lot more sense to me.
> If that's the case, wouldn't it be better to try finding cooperating
> queues after each merge success rather than each allow_merge()
> invocation?

If I have correctly understandood your question, then the answer is unfortunately no.

First, without any queue-merge control in allow_merge, a bio-merge attempt would
fail if the destination queue of the bio differs from the queue containing the target rq,
even if the two queues should be merged. With early queue merge, not only the two
queues are merged, but also the request-merge attempt succeeds.

Second, if the queue-merge check was executed only after a successful request merge,
then, in the case of a request-merge failure, the next chance to merge the destination queue
of the bio with some other queue would be in the next add_request. But, according to the
what we already discussed in some previous emails, waiting for the add_request is usually
too much with, e.g., the I/O generated by QEMU I/O threads.

In the end, moving queue merging to after successful request merges would most certainly
cause the throughput to drop.


>  And let's please document what we're catching with the
> extra attempts.
> 

Definitely, thanks,

Paolo

> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 10/12] block, bfq: add Early Queue Merge (EQM)
@ 2014-06-16 11:23                               ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-16 11:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Mauro Andreolini

Hi,

Il giorno 04/giu/2014, alle ore 15:04, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ha scritto:

> Hello,
> 
>> […]

>> The problem is not the duration of the plugging, but the fact that, if a request merge
>> succeeds for a bio, then there will be no set_request invocation for that bio.
>> Therefore, without early merging, there will be no queue merge at all.
>> 
>> If my replies are correct and convince you, then I will use them to integrate and
>> hopefully improve the documentation for this patch.
> 
> Ah, okay, so it's about missing the chance to look for cooperating
> queues when merge succeeds.  Yeah, that makes a lot more sense to me.
> If that's the case, wouldn't it be better to try finding cooperating
> queues after each merge success rather than each allow_merge()
> invocation?

If I have correctly understandood your question, then the answer is unfortunately no.

First, without any queue-merge control in allow_merge, a bio-merge attempt would
fail if the destination queue of the bio differs from the queue containing the target rq,
even if the two queues should be merged. With early queue merge, not only the two
queues are merged, but also the request-merge attempt succeeds.

Second, if the queue-merge check was executed only after a successful request merge,
then, in the case of a request-merge failure, the next chance to merge the destination queue
of the bio with some other queue would be in the next add_request. But, according to the
what we already discussed in some previous emails, waiting for the add_request is usually
too much with, e.g., the I/O generated by QEMU I/O threads.

In the end, moving queue merging to after successful request merges would most certainly
cause the throughput to drop.


>  And let's please document what we're catching with the
> extra attempts.
> 

Definitely, thanks,

Paolo

> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
  2014-06-02 14:29                       ` Jens Axboe
@ 2014-06-17 15:55                           ` Paolo Valente
  -1 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-17 15:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA


Il giorno 02/giu/2014, alle ore 16:29, Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> ha scritto:

> On 2014-05-30 23:16, Tejun Heo wrote:
>>> for turning patch #2 into a series of changes for CFQ instead. We need to
>>> end up with something where we can potentially bisect our way down to
>>> whatever caused any given regression. The worst possible situation is "CFQ
>>> works fine for this workload, but BFQ does not" or vice versa. Or hangs, or
>>> whatever it might be.
>> 
>> It's likely that there will be some workloads out there which may be
>> affected adversely, which is true for any change really but with both
>> the core scheduling and heuristics properly characterized at least
>> finding a reasonable trade-off should be much less of a crapshoot and
>> the expected benefits seem to easily outweigh the risks as long as we
>> can properly sequence the changes.
> 
> Exactly, I think we are pretty much on the same page here. As I said in the previous email, the biggest thing I care about is not adding a new IO scheduler wholesale. If Paolo can turn the "add BFQ" patch into a series of patches against CFQ, then I would have no issue merging it for testing (and inclusion, when it's stable enough).

We have finished analyzing possible ways to turn cfq into bfq, and unfortunately
I think I need some help in this respect. In fact, we have found several, apparently
non-trivial issues. To describe them, I will start from some concrete examples, and
then try to discuss the overall problem in general terms, instead of providing a list
of all the issues we have found. I am sorry for providing however many details, but
I hope they will help me sync with you, and then make other boring emails needless.

First, supposing to start the transformation from adding the low-latency heuristics of bfq
to cfq's engine, one of the main issues is that cfq chooses the next queue to serve,
within a group and class, in a different way than bfq does. In fact, cfq first chooses the
sub-group of queues to serve according to a workload-based priority scheme, and then
performs a round-robin scheduling among the queues in the sub-group. This priority scheme
not only has nothing to do with the logic of the low-latency heuristics of bfq (and actually with
bfq altogether), but also conflicts with the freedom in choosing the next queue that these
heuristics need to succeed in guaranteeing a low latency.

If, on the opposite end, we assume, because of the above issue, to proceed the other
way round, i.e., to start from replacing the engine of cfq with that of bfq, then similar, if not worse,
issues arise:
- the internal scheduler of bfq is hierarchical, whereas the internal, round-robin-based scheduler
  of cfq is not
- the hierarchy-flattening scheme adopted in cfq has no counterpart in the hierarchical scheduling
  algorithm of bfq
- preemption is not trivial to implement in bfq in such a way that service guarantees are preserved,
  but, in the first place, would however be needed to keep a high throughput with interleaved I/O
- cfq uses the workload-based queue selection scheme I mentioned above, and this has no match
  with any mechanism in bfq
...

Instead of bothering you with the full list of issues, I want to try to describe the problem, in general
terms, through the following rough simplification (in which I am neglecting trivial common code
between cfq and bfq, such as handling of I/O contexts). On one side, bfq is made by 80% of a
hierarchical fair-queueing scheduler, and by the remaining 20% of a set of heuristics to improve
some performance indexes. On the other side, cfq is made, roughly, by 40% of a simple round-robin
flat scheduler, and by the remaining 60%, of: an extension to support hierarchical scheduling,
workload-based improvements, preemption, virtual-time extensions, further low-latency mechanisms,
and so on. This remaining 60% of cfq has practically very little in common with the above 20% of
heuristics in bfq (although many of the goals of these parts are the same). Probably, commonalities
amount to at most a 10%. The problem is then the remaining, almost completely incompatible, 90%
of non-common mechanisms.

To make a long story short, to implement a smooth transition from cfq to bfq, this 90% should of
course be progressively transformed along the way. This would apparently imply that:
- the intermediate versions would not be partial versions of either cfq or bfq;
- the performance of these intermediate versions would most certainly be worse than that of both
  cfq and bfq, as the mechanisms of the latter schedulers have been fine-tuned over the years,
  whereas the hybrid mechanisms in the intermediate versions would just be an attempt to avoid
  abrupt changes;
- these hybrid mechanisms would likely be more complex than the original ones;
- in the final steps of the transformation, these hybrid mechanisms will all have to be further
  changed to become those of bfq, or just thrown away.

In the end, a smooth transition seems messy and confusing. On the opposite end, I thought about
a cleaner but sharper solution, which probably better matches the one proposed by Tejun:
1) removing the 60% of extra code of cfq from around the round-robin engine of cfq, 2) turning the
remaining core into a flat version of bfq-v0, 3) turning this flat scheduler into the actual, hierarchical
bfq-v0, 4) applying the remaining bfq patches.

In general, with both a smooth but messy and a sharp but clean transformation, there seems to be
the following common problems:
1) The main benefits highlighted by Jens, i.e., being able to move back and forth and easily
understand what works and what does not, seem to be lost, because, with both solutions,
intermediate versions would likely have a worse performance than the current version of cfq.
2) bfq, on one side, does not export some of the sysfs parameters of cfq, such as slice_sync, and,
on the other side, uses other common parameters in a different way. For example, bfq turns I/O priorities
into throughput shares in a different way than cfq does. As a consequence, existing configurations may
break or behave in unexpected ways.

I’m sorry for the long list of (only) problems, but, because of the extent at which cfq and bfq have diverged
over the years, we are having a really hard time finding a sensible way to turn the former into the latter.
Of course, we are willing to do our best once we find a viable solution.

Thanks,
Paolo

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-06-17 15:55                           ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-17 15:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Tejun Heo, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel, containers, cgroups


Il giorno 02/giu/2014, alle ore 16:29, Jens Axboe <axboe@kernel.dk> ha scritto:

> On 2014-05-30 23:16, Tejun Heo wrote:
>>> for turning patch #2 into a series of changes for CFQ instead. We need to
>>> end up with something where we can potentially bisect our way down to
>>> whatever caused any given regression. The worst possible situation is "CFQ
>>> works fine for this workload, but BFQ does not" or vice versa. Or hangs, or
>>> whatever it might be.
>> 
>> It's likely that there will be some workloads out there which may be
>> affected adversely, which is true for any change really but with both
>> the core scheduling and heuristics properly characterized at least
>> finding a reasonable trade-off should be much less of a crapshoot and
>> the expected benefits seem to easily outweigh the risks as long as we
>> can properly sequence the changes.
> 
> Exactly, I think we are pretty much on the same page here. As I said in the previous email, the biggest thing I care about is not adding a new IO scheduler wholesale. If Paolo can turn the "add BFQ" patch into a series of patches against CFQ, then I would have no issue merging it for testing (and inclusion, when it's stable enough).

We have finished analyzing possible ways to turn cfq into bfq, and unfortunately
I think I need some help in this respect. In fact, we have found several, apparently
non-trivial issues. To describe them, I will start from some concrete examples, and
then try to discuss the overall problem in general terms, instead of providing a list
of all the issues we have found. I am sorry for providing however many details, but
I hope they will help me sync with you, and then make other boring emails needless.

First, supposing to start the transformation from adding the low-latency heuristics of bfq
to cfq's engine, one of the main issues is that cfq chooses the next queue to serve,
within a group and class, in a different way than bfq does. In fact, cfq first chooses the
sub-group of queues to serve according to a workload-based priority scheme, and then
performs a round-robin scheduling among the queues in the sub-group. This priority scheme
not only has nothing to do with the logic of the low-latency heuristics of bfq (and actually with
bfq altogether), but also conflicts with the freedom in choosing the next queue that these
heuristics need to succeed in guaranteeing a low latency.

If, on the opposite end, we assume, because of the above issue, to proceed the other
way round, i.e., to start from replacing the engine of cfq with that of bfq, then similar, if not worse,
issues arise:
- the internal scheduler of bfq is hierarchical, whereas the internal, round-robin-based scheduler
  of cfq is not
- the hierarchy-flattening scheme adopted in cfq has no counterpart in the hierarchical scheduling
  algorithm of bfq
- preemption is not trivial to implement in bfq in such a way that service guarantees are preserved,
  but, in the first place, would however be needed to keep a high throughput with interleaved I/O
- cfq uses the workload-based queue selection scheme I mentioned above, and this has no match
  with any mechanism in bfq
...

Instead of bothering you with the full list of issues, I want to try to describe the problem, in general
terms, through the following rough simplification (in which I am neglecting trivial common code
between cfq and bfq, such as handling of I/O contexts). On one side, bfq is made by 80% of a
hierarchical fair-queueing scheduler, and by the remaining 20% of a set of heuristics to improve
some performance indexes. On the other side, cfq is made, roughly, by 40% of a simple round-robin
flat scheduler, and by the remaining 60%, of: an extension to support hierarchical scheduling,
workload-based improvements, preemption, virtual-time extensions, further low-latency mechanisms,
and so on. This remaining 60% of cfq has practically very little in common with the above 20% of
heuristics in bfq (although many of the goals of these parts are the same). Probably, commonalities
amount to at most a 10%. The problem is then the remaining, almost completely incompatible, 90%
of non-common mechanisms.

To make a long story short, to implement a smooth transition from cfq to bfq, this 90% should of
course be progressively transformed along the way. This would apparently imply that:
- the intermediate versions would not be partial versions of either cfq or bfq;
- the performance of these intermediate versions would most certainly be worse than that of both
  cfq and bfq, as the mechanisms of the latter schedulers have been fine-tuned over the years,
  whereas the hybrid mechanisms in the intermediate versions would just be an attempt to avoid
  abrupt changes;
- these hybrid mechanisms would likely be more complex than the original ones;
- in the final steps of the transformation, these hybrid mechanisms will all have to be further
  changed to become those of bfq, or just thrown away.

In the end, a smooth transition seems messy and confusing. On the opposite end, I thought about
a cleaner but sharper solution, which probably better matches the one proposed by Tejun:
1) removing the 60% of extra code of cfq from around the round-robin engine of cfq, 2) turning the
remaining core into a flat version of bfq-v0, 3) turning this flat scheduler into the actual, hierarchical
bfq-v0, 4) applying the remaining bfq patches.

In general, with both a smooth but messy and a sharp but clean transformation, there seems to be
the following common problems:
1) The main benefits highlighted by Jens, i.e., being able to move back and forth and easily
understand what works and what does not, seem to be lost, because, with both solutions,
intermediate versions would likely have a worse performance than the current version of cfq.
2) bfq, on one side, does not export some of the sysfs parameters of cfq, such as slice_sync, and,
on the other side, uses other common parameters in a different way. For example, bfq turns I/O priorities
into throughput shares in a different way than cfq does. As a consequence, existing configurations may
break or behave in unexpected ways.

I’m sorry for the long list of (only) problems, but, because of the extent at which cfq and bfq have diverged
over the years, we are having a really hard time finding a sensible way to turn the former into the latter.
Of course, we are willing to do our best once we find a viable solution.

Thanks,
Paolo


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices
  2014-06-16 10:46                                 ` Paolo Valente
@ 2014-06-19  1:14                                     ` Tejun Heo
  -1 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-19  1:14 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA

Hello,

On Mon, Jun 16, 2014 at 12:46:36PM +0200, Paolo Valente wrote:
> To preserve the desired distribution of the device throughput (or time), this
> scheme entails updating weights every time the set of active queues changes.
> For example (sorry for the trivial example, but I just want to make sure that I am
> not misunderstanding what you are telling me), suppose that two groups,
> A and B, are reserved 50% of the throughput each, and that the first group contains
> two processes, whereas the second group just one process. Apart from the
> additional per-queue interventions of cfq to improve latency or throughput, the
> queues the two processes in in group A are reserved 50% of the group throughput
> each. It follows that, if all the three queues are backlogged, then a correct weight
> distribution for a flat underlying scheduler is, e.g., 25 and 25 for the two processes
> in group A, and 50 for the process in group B.
> 
> But, if one of the queues in group A becomes idle, then the correct weights
> of the still-active queues become, e.g., 50 and 50.

Yeap, that's what cfq is doing now.  Flattening priorities to the top
level each time the set of active queues changes.  It probably is
introducing weird artifacts to scheduling but given the existing
amount of existing noise I don't think it'd make a noticeable
difference.

> Changing weights this way has basically no influence of the per-request latency
> guarantees of cfq, because cfq is based on a round-robin scheme, and the latter
> already suffers from a large deviation with respect to an ideal service. In contrast,
> things change dramatically with an accurate scheduler, such as the internal
> B-WF2Q+ scheduler of BFQ. In that case, as explained, e.g., here for packet
> scheduling (but the problem is exactly the same)
> 
> http://www.algogroup.unimore.it/people/paolo/dynWF2Q+/dynWF2Q+.pdf
> 
> weight changes would just break service guarantees, and lead to the same
> deviation as a round-robin scheduler. As proved in the same
> document, to preserve guarantees, weight updates should be delayed/concealed
> in a non-trivial (although computationally cheap) way.

Ah, bummer.  Flattening is so much easier to deal with.

> If useful, I am willing to provide more details, although these aspects are most
> certainly quite theoretical and boring.

Including a simplified intuitive explanation of why fully hierarchical
scheduling is necessary with reference to more detailed explanation
would be nice.

> > Another thing I'm curious about is that the logic that you're using to
> > disable idling assumes that the disk will serve the queued commands
> > more or less in fair manner over time, right?  If so, why does queues
> > having the same weight matter?  Shouldn't the bandwidth scheduling
> > eventually make them converge to the specified weights over time?
> > Isn't wr_coeff > 1 test enough for maintaining reasonable
> > responsiveness?
> 
> Unfortunately, when I first defined bfq with Fabio, this was one of the main
> mistakes made in most of existing research I/O schedulers. More precisely, your
> statement is true for async queues, but fails for sync queues. The problem is that
> the (only) way in which a process pushes a scheduler into giving it its reserved
> throughput is by issuing requests at a higher rate than that at which they are
> served. But, with sync queues this just cannot happen. In particular,
> postponing the service of a sync request delays the arrival of the next,
> but not-yet-arrived, sync request of the same process. Intuitively, for the scheduler,
> it is like if the process was happy with the throughput it is receiving, because
> it happens to issue requests exactly at that rate.
> 
> This problems is described in more detail, for example, in Section 5.3 of
> http://www.algogroup.unimore.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf
> 
> bfq deals with this problem by properly back-shifting request timestamps when
> needed. But idling is necessary for this mechanism to work (unless more
> complex solutions are adopted).

Oh... I understand why idling is necessary to actually implement io
priorities.  I was wondering about the optimization for queued devices
and why it matters whether the active queues have equal weight or not.
If the optimization can be used for three of 1s, why can't it be used
for 1 and 2?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices
@ 2014-06-19  1:14                                     ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-19  1:14 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel, containers, cgroups

Hello,

On Mon, Jun 16, 2014 at 12:46:36PM +0200, Paolo Valente wrote:
> To preserve the desired distribution of the device throughput (or time), this
> scheme entails updating weights every time the set of active queues changes.
> For example (sorry for the trivial example, but I just want to make sure that I am
> not misunderstanding what you are telling me), suppose that two groups,
> A and B, are reserved 50% of the throughput each, and that the first group contains
> two processes, whereas the second group just one process. Apart from the
> additional per-queue interventions of cfq to improve latency or throughput, the
> queues the two processes in in group A are reserved 50% of the group throughput
> each. It follows that, if all the three queues are backlogged, then a correct weight
> distribution for a flat underlying scheduler is, e.g., 25 and 25 for the two processes
> in group A, and 50 for the process in group B.
> 
> But, if one of the queues in group A becomes idle, then the correct weights
> of the still-active queues become, e.g., 50 and 50.

Yeap, that's what cfq is doing now.  Flattening priorities to the top
level each time the set of active queues changes.  It probably is
introducing weird artifacts to scheduling but given the existing
amount of existing noise I don't think it'd make a noticeable
difference.

> Changing weights this way has basically no influence of the per-request latency
> guarantees of cfq, because cfq is based on a round-robin scheme, and the latter
> already suffers from a large deviation with respect to an ideal service. In contrast,
> things change dramatically with an accurate scheduler, such as the internal
> B-WF2Q+ scheduler of BFQ. In that case, as explained, e.g., here for packet
> scheduling (but the problem is exactly the same)
> 
> http://www.algogroup.unimore.it/people/paolo/dynWF2Q+/dynWF2Q+.pdf
> 
> weight changes would just break service guarantees, and lead to the same
> deviation as a round-robin scheduler. As proved in the same
> document, to preserve guarantees, weight updates should be delayed/concealed
> in a non-trivial (although computationally cheap) way.

Ah, bummer.  Flattening is so much easier to deal with.

> If useful, I am willing to provide more details, although these aspects are most
> certainly quite theoretical and boring.

Including a simplified intuitive explanation of why fully hierarchical
scheduling is necessary with reference to more detailed explanation
would be nice.

> > Another thing I'm curious about is that the logic that you're using to
> > disable idling assumes that the disk will serve the queued commands
> > more or less in fair manner over time, right?  If so, why does queues
> > having the same weight matter?  Shouldn't the bandwidth scheduling
> > eventually make them converge to the specified weights over time?
> > Isn't wr_coeff > 1 test enough for maintaining reasonable
> > responsiveness?
> 
> Unfortunately, when I first defined bfq with Fabio, this was one of the main
> mistakes made in most of existing research I/O schedulers. More precisely, your
> statement is true for async queues, but fails for sync queues. The problem is that
> the (only) way in which a process pushes a scheduler into giving it its reserved
> throughput is by issuing requests at a higher rate than that at which they are
> served. But, with sync queues this just cannot happen. In particular,
> postponing the service of a sync request delays the arrival of the next,
> but not-yet-arrived, sync request of the same process. Intuitively, for the scheduler,
> it is like if the process was happy with the throughput it is receiving, because
> it happens to issue requests exactly at that rate.
> 
> This problems is described in more detail, for example, in Section 5.3 of
> http://www.algogroup.unimore.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf
> 
> bfq deals with this problem by properly back-shifting request timestamps when
> needed. But idling is necessary for this mechanism to work (unless more
> complex solutions are adopted).

Oh... I understand why idling is necessary to actually implement io
priorities.  I was wondering about the optimization for queued devices
and why it matters whether the active queues have equal weight or not.
If the optimization can be used for three of 1s, why can't it be used
for 1 and 2?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
       [not found]                           ` <0A5218F8-0215-4B4F-959B-EE5AAEFC164A-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
@ 2014-06-19  1:46                             ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-19  1:46 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA

Hello,

On Tue, Jun 17, 2014 at 05:55:57PM +0200, Paolo Valente wrote:
> In general, with both a smooth but messy and a sharp but clean
> transformation, there seems to be the following common problems:
>
> 1) The main benefits highlighted by Jens, i.e., being able to move
> back and forth and easily understand what works and what does not,
> seem to be lost, because, with both solutions, intermediate versions
> would likely have a worse performance than the current version of
> cfq.

So, the perfectly smooth and performant transformation is possible,
it'd be great, but I don't really think that'd be the case.  My
opinion is that if the infrastructure pieces can be mostly maintained
while making logical gradual steps it should be fine.  ie. pick
whatever strategy which seems executable, chop down the pieces which
get in the way (ie. tear down all the cfq heuristics if you have to),
transform the base and then build things on top again.  Ensuring that
each step is logical and keeps working should give us enough safety
net, IMO.

Jens, what do you think?

> 2) bfq, on one side, does not export some of the sysfs parameters of
> cfq, such as slice_sync, and, on the other side, uses other common
> parameters in a different way. For example, bfq turns I/O priorities
> into throughput shares in a different way than cfq does. As a
> consequence, existing configurations may break or behave in
> unexpected ways.

This is why I hate exposing internal knobs without layering proper
semantic interpretation on top.  It ends up creating unnecessary
lock-in effect too often just to serve some esoteric cases which
aren't all that useful.  For knobs which don't make any sense for the
new scheduler, the appropriate thing to do would be just making them
noop and generate a warning message when it's written to.

As for behavior change for existing users, any change to scheduler
does that.  I don't think it's practical to avoid any changes for that
reason.  I think there already is a pretty solid platform to base
things on and the way forward is making the changes and iterating as
testing goes on and issues get reported.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
       [not found]                           ` <0A5218F8-0215-4B4F-959B-EE5AAEFC164A-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
@ 2014-06-19  1:46                             ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-19  1:46 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel, containers, cgroups

Hello,

On Tue, Jun 17, 2014 at 05:55:57PM +0200, Paolo Valente wrote:
> In general, with both a smooth but messy and a sharp but clean
> transformation, there seems to be the following common problems:
>
> 1) The main benefits highlighted by Jens, i.e., being able to move
> back and forth and easily understand what works and what does not,
> seem to be lost, because, with both solutions, intermediate versions
> would likely have a worse performance than the current version of
> cfq.

So, the perfectly smooth and performant transformation is possible,
it'd be great, but I don't really think that'd be the case.  My
opinion is that if the infrastructure pieces can be mostly maintained
while making logical gradual steps it should be fine.  ie. pick
whatever strategy which seems executable, chop down the pieces which
get in the way (ie. tear down all the cfq heuristics if you have to),
transform the base and then build things on top again.  Ensuring that
each step is logical and keeps working should give us enough safety
net, IMO.

Jens, what do you think?

> 2) bfq, on one side, does not export some of the sysfs parameters of
> cfq, such as slice_sync, and, on the other side, uses other common
> parameters in a different way. For example, bfq turns I/O priorities
> into throughput shares in a different way than cfq does. As a
> consequence, existing configurations may break or behave in
> unexpected ways.

This is why I hate exposing internal knobs without layering proper
semantic interpretation on top.  It ends up creating unnecessary
lock-in effect too often just to serve some esoteric cases which
aren't all that useful.  For knobs which don't make any sense for the
new scheduler, the appropriate thing to do would be just making them
noop and generate a warning message when it's written to.

As for behavior change for existing users, any change to scheduler
does that.  I don't think it's practical to avoid any changes for that
reason.  I think there already is a pretty solid platform to base
things on and the way forward is making the changes and iterating as
testing goes on and issues get reported.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-06-19  1:46                             ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-19  1:46 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hello,

On Tue, Jun 17, 2014 at 05:55:57PM +0200, Paolo Valente wrote:
> In general, with both a smooth but messy and a sharp but clean
> transformation, there seems to be the following common problems:
>
> 1) The main benefits highlighted by Jens, i.e., being able to move
> back and forth and easily understand what works and what does not,
> seem to be lost, because, with both solutions, intermediate versions
> would likely have a worse performance than the current version of
> cfq.

So, the perfectly smooth and performant transformation is possible,
it'd be great, but I don't really think that'd be the case.  My
opinion is that if the infrastructure pieces can be mostly maintained
while making logical gradual steps it should be fine.  ie. pick
whatever strategy which seems executable, chop down the pieces which
get in the way (ie. tear down all the cfq heuristics if you have to),
transform the base and then build things on top again.  Ensuring that
each step is logical and keeps working should give us enough safety
net, IMO.

Jens, what do you think?

> 2) bfq, on one side, does not export some of the sysfs parameters of
> cfq, such as slice_sync, and, on the other side, uses other common
> parameters in a different way. For example, bfq turns I/O priorities
> into throughput shares in a different way than cfq does. As a
> consequence, existing configurations may break or behave in
> unexpected ways.

This is why I hate exposing internal knobs without layering proper
semantic interpretation on top.  It ends up creating unnecessary
lock-in effect too often just to serve some esoteric cases which
aren't all that useful.  For knobs which don't make any sense for the
new scheduler, the appropriate thing to do would be just making them
noop and generate a warning message when it's written to.

As for behavior change for existing users, any change to scheduler
does that.  I don't think it's practical to avoid any changes for that
reason.  I think there already is a pretty solid platform to base
things on and the way forward is making the changes and iterating as
testing goes on and issues get reported.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
       [not found]                             ` <20140619014600.GA20100-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2014-06-19  1:49                               ` Tejun Heo
  2014-06-19  2:29                                 ` Jens Axboe
  1 sibling, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-19  1:49 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed, Jun 18, 2014 at 09:46:00PM -0400, Tejun Heo wrote:
...
> So, the perfectly smooth and performant transformation is possible,
     ^
     if
> it'd be great, but I don't really think that'd be the case.  My

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
       [not found]                             ` <20140619014600.GA20100-9pTldWuhBndy/B6EtB590w@public.gmane.org>
@ 2014-06-19  1:49                               ` Tejun Heo
  2014-06-19  2:29                                 ` Jens Axboe
  1 sibling, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-19  1:49 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel, containers, cgroups

On Wed, Jun 18, 2014 at 09:46:00PM -0400, Tejun Heo wrote:
...
> So, the perfectly smooth and performant transformation is possible,
     ^
     if
> it'd be great, but I don't really think that'd be the case.  My

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-06-19  1:49                               ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-19  1:49 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed, Jun 18, 2014 at 09:46:00PM -0400, Tejun Heo wrote:
...
> So, the perfectly smooth and performant transformation is possible,
     ^
     if
> it'd be great, but I don't really think that'd be the case.  My

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
  2014-06-19  1:46                             ` Tejun Heo
@ 2014-06-19  2:29                                 ` Jens Axboe
  -1 siblings, 0 replies; 247+ messages in thread
From: Jens Axboe @ 2014-06-19  2:29 UTC (permalink / raw)
  To: Tejun Heo, Paolo Valente
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA

On 2014-06-18 18:46, Tejun Heo wrote:
> Hello,
>
> On Tue, Jun 17, 2014 at 05:55:57PM +0200, Paolo Valente wrote:
>> In general, with both a smooth but messy and a sharp but clean
>> transformation, there seems to be the following common problems:
>>
>> 1) The main benefits highlighted by Jens, i.e., being able to move
>> back and forth and easily understand what works and what does not,
>> seem to be lost, because, with both solutions, intermediate versions
>> would likely have a worse performance than the current version of
>> cfq.
>
> So, the perfectly smooth and performant transformation is possible,
> it'd be great, but I don't really think that'd be the case.  My
> opinion is that if the infrastructure pieces can be mostly maintained
> while making logical gradual steps it should be fine.  ie. pick
> whatever strategy which seems executable, chop down the pieces which
> get in the way (ie. tear down all the cfq heuristics if you have to),
> transform the base and then build things on top again.  Ensuring that
> each step is logical and keeps working should give us enough safety
> net, IMO.
>
> Jens, what do you think?

I was thinking the same - strip CFQ back down, getting rid of the 
heuristics, then go forward to BFQ. That should be feasible. You need to 
find the common core first.

>> 2) bfq, on one side, does not export some of the sysfs parameters of
>> cfq, such as slice_sync, and, on the other side, uses other common
>> parameters in a different way. For example, bfq turns I/O priorities
>> into throughput shares in a different way than cfq does. As a
>> consequence, existing configurations may break or behave in
>> unexpected ways.
>
> This is why I hate exposing internal knobs without layering proper
> semantic interpretation on top.  It ends up creating unnecessary
> lock-in effect too often just to serve some esoteric cases which
> aren't all that useful.  For knobs which don't make any sense for the
> new scheduler, the appropriate thing to do would be just making them
> noop and generate a warning message when it's written to.
>
> As for behavior change for existing users, any change to scheduler
> does that.  I don't think it's practical to avoid any changes for that
> reason.  I think there already is a pretty solid platform to base
> things on and the way forward is making the changes and iterating as
> testing goes on and issues get reported.

Completely agree, don't worry about that. It's not like we advertise 
hard guarantees on the priorities right now, for instance, so as long as 
the end result isn't orders of magnitude different for the 
classes/levels, then it'll likely be good enough.

Ditto on the sysfs files, as some of those are likely fairly widely 
used. But if we warn and do nothing, then that'll allow us to sort out 
popular uses of it before we (later on) remove the files.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-06-19  2:29                                 ` Jens Axboe
  0 siblings, 0 replies; 247+ messages in thread
From: Jens Axboe @ 2014-06-19  2:29 UTC (permalink / raw)
  To: Tejun Heo, Paolo Valente
  Cc: Li Zefan, Fabio Checconi, Arianna Avanzini, linux-kernel,
	containers, cgroups

On 2014-06-18 18:46, Tejun Heo wrote:
> Hello,
>
> On Tue, Jun 17, 2014 at 05:55:57PM +0200, Paolo Valente wrote:
>> In general, with both a smooth but messy and a sharp but clean
>> transformation, there seems to be the following common problems:
>>
>> 1) The main benefits highlighted by Jens, i.e., being able to move
>> back and forth and easily understand what works and what does not,
>> seem to be lost, because, with both solutions, intermediate versions
>> would likely have a worse performance than the current version of
>> cfq.
>
> So, the perfectly smooth and performant transformation is possible,
> it'd be great, but I don't really think that'd be the case.  My
> opinion is that if the infrastructure pieces can be mostly maintained
> while making logical gradual steps it should be fine.  ie. pick
> whatever strategy which seems executable, chop down the pieces which
> get in the way (ie. tear down all the cfq heuristics if you have to),
> transform the base and then build things on top again.  Ensuring that
> each step is logical and keeps working should give us enough safety
> net, IMO.
>
> Jens, what do you think?

I was thinking the same - strip CFQ back down, getting rid of the 
heuristics, then go forward to BFQ. That should be feasible. You need to 
find the common core first.

>> 2) bfq, on one side, does not export some of the sysfs parameters of
>> cfq, such as slice_sync, and, on the other side, uses other common
>> parameters in a different way. For example, bfq turns I/O priorities
>> into throughput shares in a different way than cfq does. As a
>> consequence, existing configurations may break or behave in
>> unexpected ways.
>
> This is why I hate exposing internal knobs without layering proper
> semantic interpretation on top.  It ends up creating unnecessary
> lock-in effect too often just to serve some esoteric cases which
> aren't all that useful.  For knobs which don't make any sense for the
> new scheduler, the appropriate thing to do would be just making them
> noop and generate a warning message when it's written to.
>
> As for behavior change for existing users, any change to scheduler
> does that.  I don't think it's practical to avoid any changes for that
> reason.  I think there already is a pretty solid platform to base
> things on and the way forward is making the changes and iterating as
> testing goes on and issues get reported.

Completely agree, don't worry about that. It's not like we advertise 
hard guarantees on the priorities right now, for instance, so as long as 
the end result isn't orders of magnitude different for the 
classes/levels, then it'll likely be good enough.

Ditto on the sysfs files, as some of those are likely fairly widely 
used. But if we warn and do nothing, then that'll allow us to sort out 
popular uses of it before we (later on) remove the files.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
  2014-06-19  2:29                                 ` Jens Axboe
@ 2014-06-23 13:53                                     ` Paolo Valente
  -1 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-23 13:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA


Il giorno 19/giu/2014, alle ore 04:29, Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> ha scritto:

> On 2014-06-18 18:46, Tejun Heo wrote:
>> Hello,
>> 
>> On Tue, Jun 17, 2014 at 05:55:57PM +0200, Paolo Valente wrote:
>>> In general, with both a smooth but messy and a sharp but clean
>>> transformation, there seems to be the following common problems:
>>> 
>>> 1) The main benefits highlighted by Jens, i.e., being able to move
>>> back and forth and easily understand what works and what does not,
>>> seem to be lost, because, with both solutions, intermediate versions
>>> would likely have a worse performance than the current version of
>>> cfq.
>> 
>> So, the perfectly smooth and performant transformation is possible,
>> it'd be great, but I don't really think that'd be the case.  My
>> opinion is that if the infrastructure pieces can be mostly maintained
>> while making logical gradual steps it should be fine.  ie. pick
>> whatever strategy which seems executable, chop down the pieces which
>> get in the way (ie. tear down all the cfq heuristics if you have to),
>> transform the base and then build things on top again.  Ensuring that
>> each step is logical and keeps working should give us enough safety
>> net, IMO.
>> 
>> Jens, what do you think?
> 
> I was thinking the same - strip CFQ back down, getting rid of the heuristics, then go forward to BFQ. That should be feasible. You need to find the common core first.

OK, I will try exactly this approach (hoping not to have misunderstood anything).
Here is, very briefly, the strategy I am thinking about:
1) In a first, only-destructive phase, bring CFQ back, more or less, to its state
at the time when BFQ was forked initially, and justify the removal of every heuristic
and improvement. Depending on how many patches come out during this phase,
possibly pack them into a first, separate patch series.
2) In a second, only-constructive phase: (a) turn the stripped-down version of CFQ into
a flat BFQ-v0, (b) turn the latter into BFQ-v0, and, finally, (c) progressively turn BFQ-v0
into the last version of BFQ, through the previously-submitted patches. Of course after
fixing and improving all the involved patches according to the suggestions and corrections
of Tejun.

I will wait shortly for a possible feedback on this proposal, and, then, if nothing has still to be
changed or refined, silently start the process.

> 
>>> 2) bfq, on one side, does not export some of the sysfs parameters of
>>> cfq, such as slice_sync, and, on the other side, uses other common
>>> parameters in a different way. For example, bfq turns I/O priorities
>>> into throughput shares in a different way than cfq does. As a
>>> consequence, existing configurations may break or behave in
>>> unexpected ways.
>> 
>> This is why I hate exposing internal knobs without layering proper
>> semantic interpretation on top.  It ends up creating unnecessary
>> lock-in effect too often just to serve some esoteric cases which
>> aren't all that useful.  For knobs which don't make any sense for the
>> new scheduler, the appropriate thing to do would be just making them
>> noop and generate a warning message when it's written to.
>> 
>> As for behavior change for existing users, any change to scheduler
>> does that.  I don't think it's practical to avoid any changes for that
>> reason.  I think there already is a pretty solid platform to base
>> things on and the way forward is making the changes and iterating as
>> testing goes on and issues get reported.
> 
> Completely agree, don't worry about that. It's not like we advertise hard guarantees on the priorities right now, for instance, so as long as the end result isn't orders of magnitude different for the classes/levels, then it'll likely be good enough.
> 
> Ditto on the sysfs files, as some of those are likely fairly widely used. But if we warn and do nothing, then that'll allow us to sort out popular uses of it before we (later on) remove the files.

Great, thanks. BTW, most of the ‘internal’ parameters inappropriately exposed by BFQ,
as noted by Tejun, were exposed just because we forgot to remove them while turning
the testing version of BFQ into the submitted one. Sorry about that.

Thanks,
Paolo

> 
> -- 
> Jens Axboe


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-06-23 13:53                                     ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-06-23 13:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Tejun Heo, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel, containers, cgroups


Il giorno 19/giu/2014, alle ore 04:29, Jens Axboe <axboe@kernel.dk> ha scritto:

> On 2014-06-18 18:46, Tejun Heo wrote:
>> Hello,
>> 
>> On Tue, Jun 17, 2014 at 05:55:57PM +0200, Paolo Valente wrote:
>>> In general, with both a smooth but messy and a sharp but clean
>>> transformation, there seems to be the following common problems:
>>> 
>>> 1) The main benefits highlighted by Jens, i.e., being able to move
>>> back and forth and easily understand what works and what does not,
>>> seem to be lost, because, with both solutions, intermediate versions
>>> would likely have a worse performance than the current version of
>>> cfq.
>> 
>> So, the perfectly smooth and performant transformation is possible,
>> it'd be great, but I don't really think that'd be the case.  My
>> opinion is that if the infrastructure pieces can be mostly maintained
>> while making logical gradual steps it should be fine.  ie. pick
>> whatever strategy which seems executable, chop down the pieces which
>> get in the way (ie. tear down all the cfq heuristics if you have to),
>> transform the base and then build things on top again.  Ensuring that
>> each step is logical and keeps working should give us enough safety
>> net, IMO.
>> 
>> Jens, what do you think?
> 
> I was thinking the same - strip CFQ back down, getting rid of the heuristics, then go forward to BFQ. That should be feasible. You need to find the common core first.

OK, I will try exactly this approach (hoping not to have misunderstood anything).
Here is, very briefly, the strategy I am thinking about:
1) In a first, only-destructive phase, bring CFQ back, more or less, to its state
at the time when BFQ was forked initially, and justify the removal of every heuristic
and improvement. Depending on how many patches come out during this phase,
possibly pack them into a first, separate patch series.
2) In a second, only-constructive phase: (a) turn the stripped-down version of CFQ into
a flat BFQ-v0, (b) turn the latter into BFQ-v0, and, finally, (c) progressively turn BFQ-v0
into the last version of BFQ, through the previously-submitted patches. Of course after
fixing and improving all the involved patches according to the suggestions and corrections
of Tejun.

I will wait shortly for a possible feedback on this proposal, and, then, if nothing has still to be
changed or refined, silently start the process.

> 
>>> 2) bfq, on one side, does not export some of the sysfs parameters of
>>> cfq, such as slice_sync, and, on the other side, uses other common
>>> parameters in a different way. For example, bfq turns I/O priorities
>>> into throughput shares in a different way than cfq does. As a
>>> consequence, existing configurations may break or behave in
>>> unexpected ways.
>> 
>> This is why I hate exposing internal knobs without layering proper
>> semantic interpretation on top.  It ends up creating unnecessary
>> lock-in effect too often just to serve some esoteric cases which
>> aren't all that useful.  For knobs which don't make any sense for the
>> new scheduler, the appropriate thing to do would be just making them
>> noop and generate a warning message when it's written to.
>> 
>> As for behavior change for existing users, any change to scheduler
>> does that.  I don't think it's practical to avoid any changes for that
>> reason.  I think there already is a pretty solid platform to base
>> things on and the way forward is making the changes and iterating as
>> testing goes on and issues get reported.
> 
> Completely agree, don't worry about that. It's not like we advertise hard guarantees on the priorities right now, for instance, so as long as the end result isn't orders of magnitude different for the classes/levels, then it'll likely be good enough.
> 
> Ditto on the sysfs files, as some of those are likely fairly widely used. But if we warn and do nothing, then that'll allow us to sort out popular uses of it before we (later on) remove the files.

Great, thanks. BTW, most of the ‘internal’ parameters inappropriately exposed by BFQ,
as noted by Tejun, were exposed just because we forgot to remove them while turning
the testing version of BFQ into the submitted one. Sorry about that.

Thanks,
Paolo

> 
> -- 
> Jens Axboe


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/


^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
       [not found]                                     ` <8F719638-0CD7-4BD2-8F4F-088913A0EE2D-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
@ 2014-06-23 19:20                                       ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-23 19:20 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon, Jun 23, 2014 at 03:53:09PM +0200, Paolo Valente wrote:
> I will wait shortly for a possible feedback on this proposal, and,
> then, if nothing has still to be changed or refined, silently start
> the process.

We'll prolly end up doing a few iterations but overall it sounds good
to me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
       [not found]                                     ` <8F719638-0CD7-4BD2-8F4F-088913A0EE2D-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
@ 2014-06-23 19:20                                       ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-23 19:20 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel, containers, cgroups

On Mon, Jun 23, 2014 at 03:53:09PM +0200, Paolo Valente wrote:
> I will wait shortly for a possible feedback on this proposal, and,
> then, if nothing has still to be changed or refined, silently start
> the process.

We'll prolly end up doing a few iterations but overall it sounds good
to me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-06-23 19:20                                       ` Tejun Heo
  0 siblings, 0 replies; 247+ messages in thread
From: Tejun Heo @ 2014-06-23 19:20 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon, Jun 23, 2014 at 03:53:09PM +0200, Paolo Valente wrote:
> I will wait shortly for a possible feedback on this proposal, and,
> then, if nothing has still to be changed or refined, silently start
> the process.

We'll prolly end up doing a few iterations but overall it sounds good
to me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
  2014-06-23 19:20                                       ` Tejun Heo
@ 2014-07-09 20:54                                           ` Paolo Valente
  -1 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-07-09 20:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabio Checconi,
	Arianna Avanzini, cgroups-u79uwXL29TY76Z2rM5mHXA

Hoping that it may help people get a better idea of the features of bfq (while we work on the patches), I just uploaded a new, shorter demo (7 minutes) of BFQ with an SSD:
http://youtu.be/KhZl9LjCKuU

Paolo

Il giorno 23/giu/2014, alle ore 21:20, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> ha scritto:

> On Mon, Jun 23, 2014 at 03:53:09PM +0200, Paolo Valente wrote:
>> I will wait shortly for a possible feedback on this proposal, and,
>> then, if nothing has still to be changed or refined, silently start
>> the process.
> 
> We'll prolly end up doing a few iterations but overall it sounds good
> to me.
> 
> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/

^ permalink raw reply	[flat|nested] 247+ messages in thread

* Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler
@ 2014-07-09 20:54                                           ` Paolo Valente
  0 siblings, 0 replies; 247+ messages in thread
From: Paolo Valente @ 2014-07-09 20:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Li Zefan, Fabio Checconi, Arianna Avanzini,
	linux-kernel, containers, cgroups

Hoping that it may help people get a better idea of the features of bfq (while we work on the patches), I just uploaded a new, shorter demo (7 minutes) of BFQ with an SSD:
http://youtu.be/KhZl9LjCKuU

Paolo

Il giorno 23/giu/2014, alle ore 21:20, Tejun Heo <tj@kernel.org> ha scritto:

> On Mon, Jun 23, 2014 at 03:53:09PM +0200, Paolo Valente wrote:
>> I will wait shortly for a possible feedback on this proposal, and,
>> then, if nothing has still to be changed or refined, silently start
>> the process.
> 
> We'll prolly end up doing a few iterations but overall it sounds good
> to me.
> 
> Thanks.
> 
> -- 
> tejun


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica		
Via Campi, 213/B
41125 Modena - Italy        				  
homepage:  http://algogroup.unimore.it/people/paolo/


^ permalink raw reply	[flat|nested] 247+ messages in thread

end of thread, other threads:[~2014-07-09 20:54 UTC | newest]

Thread overview: 247+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-27 12:42 [PATCH RFC RESEND 00/14] New version of the BFQ I/O Scheduler paolo
2014-05-27 12:42 ` paolo
2014-05-27 12:42 ` [PATCH RFC RESEND 12/14] block, bfq: add Early Queue Merge (EQM) paolo
2014-05-27 12:42 ` [PATCH RFC RESEND 14/14] block, bfq: boost the throughput with random I/O on NCQ-capable HDDs paolo
2014-05-27 12:42   ` paolo
     [not found] ` <1401194558-5283-1-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
2014-05-27 12:42   ` [PATCH RFC RESEND 01/14] block: kconfig update and build bits for BFQ paolo
2014-05-27 12:42     ` paolo
2014-05-28 22:19     ` Tejun Heo
2014-05-28 22:19       ` Tejun Heo
     [not found]       ` <20140528221929.GG1419-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-05-29  9:05         ` [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler Paolo Valente
2014-05-29  9:05           ` Paolo Valente
2014-05-29  9:05           ` [PATCH RFC - TAKE TWO - 05/12] block, bfq: add more fairness to boost throughput and reduce latency Paolo Valente
2014-05-29  9:05             ` Paolo Valente
2014-05-29  9:05           ` [PATCH RFC - TAKE TWO - 06/12] block, bfq: improve responsiveness Paolo Valente
     [not found]             ` <1401354343-5527-7-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
2014-05-30 15:41               ` Tejun Heo
2014-05-30 15:41                 ` Tejun Heo
2014-05-29  9:05           ` [PATCH RFC - TAKE TWO - 08/12] block, bfq: preserve a low latency also with NCQ-capable drives Paolo Valente
2014-05-29  9:05             ` Paolo Valente
     [not found]             ` <1401354343-5527-9-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
2014-05-31 13:48               ` Tejun Heo
2014-05-31 13:48                 ` Tejun Heo
     [not found]                 ` <20140531134823.GB24557-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2014-06-02  9:58                   ` Paolo Valente
2014-06-02  9:58                     ` Paolo Valente
2014-05-29  9:05           ` [PATCH RFC - TAKE TWO - 09/12] block, bfq: reduce latency during request-pool saturation Paolo Valente
     [not found]             ` <1401354343-5527-10-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
2014-05-31 13:54               ` Tejun Heo
2014-05-31 13:54                 ` Tejun Heo
2014-06-02  9:54                 ` Paolo Valente
2014-06-02  9:54                   ` Paolo Valente
     [not found]                 ` <20140531135402.GC24557-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2014-06-02  9:54                   ` Paolo Valente
     [not found]           ` <1401354343-5527-1-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
2014-05-29  9:05             ` [PATCH RFC - TAKE TWO - 01/12] block: introduce the BFQ-v0 I/O scheduler Paolo Valente
2014-05-29  9:05               ` Paolo Valente
     [not found]               ` <1401354343-5527-2-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
2014-05-30 15:36                 ` Tejun Heo
2014-05-30 15:36                   ` Tejun Heo
2014-05-29  9:05             ` [PATCH RFC - TAKE TWO - 02/12] block, bfq: add full hierarchical scheduling and cgroups support Paolo Valente
2014-05-29  9:05               ` Paolo Valente
     [not found]               ` <1401354343-5527-3-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
2014-05-30 15:37                 ` Tejun Heo
2014-05-30 15:37               ` Tejun Heo
2014-05-30 15:37                 ` Tejun Heo
2014-05-30 15:39                 ` Tejun Heo
2014-05-30 15:39                   ` Tejun Heo
     [not found]                   ` <20140530153943.GC24871-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-05-30 21:49                     ` Paolo Valente
2014-05-30 21:49                       ` Paolo Valente
     [not found]                 ` <20140530153718.GB24871-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-05-30 15:39                   ` Tejun Heo
2014-05-30 21:49                   ` Paolo Valente
2014-05-30 21:49                 ` Paolo Valente
2014-05-30 21:49                   ` Paolo Valente
2014-05-29  9:05             ` [PATCH RFC - TAKE TWO - 03/12] block, bfq: improve throughput boosting Paolo Valente
2014-05-29  9:05               ` Paolo Valente
2014-05-29  9:05             ` [PATCH RFC - TAKE TWO - 04/12] block, bfq: modify the peak-rate estimator Paolo Valente
2014-05-29  9:05               ` Paolo Valente
2014-05-29  9:05             ` [PATCH RFC - TAKE TWO - 05/12] block, bfq: add more fairness to boost throughput and reduce latency Paolo Valente
2014-05-29  9:05             ` [PATCH RFC - TAKE TWO - 06/12] block, bfq: improve responsiveness Paolo Valente
2014-05-29  9:05             ` [PATCH RFC - TAKE TWO - 07/12] block, bfq: reduce I/O latency for soft real-time applications Paolo Valente
2014-05-29  9:05               ` Paolo Valente
2014-05-29  9:05             ` [PATCH RFC - TAKE TWO - 08/12] block, bfq: preserve a low latency also with NCQ-capable drives Paolo Valente
2014-05-29  9:05             ` [PATCH RFC - TAKE TWO - 09/12] block, bfq: reduce latency during request-pool saturation Paolo Valente
2014-05-29  9:05             ` [PATCH RFC - TAKE TWO - 10/12] block, bfq: add Early Queue Merge (EQM) Paolo Valente
2014-05-29  9:05               ` Paolo Valente
     [not found]               ` <1401354343-5527-11-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
2014-06-01  0:03                 ` Tejun Heo
2014-06-01  0:03               ` Tejun Heo
2014-06-01  0:03                 ` Tejun Heo
2014-06-02  9:46                 ` Paolo Valente
2014-06-02  9:46                   ` Paolo Valente
     [not found]                   ` <3B7B1A46-46EB-4C52-A52C-4F79C71D14C2-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
2014-06-03 16:28                     ` Tejun Heo
2014-06-03 16:28                       ` Tejun Heo
     [not found]                       ` <20140603162844.GD26210-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-06-04 11:47                         ` Paolo Valente
2014-06-04 11:47                           ` Paolo Valente
     [not found]                           ` <91383F1F-69C3-4B88-B51E-30204818F1AB-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
2014-06-04 13:04                             ` Tejun Heo
2014-06-04 13:04                           ` Tejun Heo
     [not found]                             ` <20140604130446.GA5004-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-06-16 11:23                               ` Paolo Valente
2014-06-16 11:23                             ` Paolo Valente
2014-06-16 11:23                               ` Paolo Valente
     [not found]                 ` <20140601000331.GA29085-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-06-02  9:46                   ` Paolo Valente
2014-05-29  9:05             ` [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices Paolo Valente
2014-05-29  9:05             ` [PATCH RFC - TAKE TWO - 12/12] block, bfq: boost the throughput with random I/O on NCQ-capable HDDs Paolo Valente
2014-05-30 16:07             ` [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler Tejun Heo
2014-05-30 16:07               ` Tejun Heo
     [not found]               ` <20140530160712.GG24871-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-05-30 22:23                 ` Paolo Valente
2014-05-30 22:23                   ` Paolo Valente
     [not found]                   ` <464F6CBE-A63E-46EF-A90D-BF8450430444-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
2014-05-30 23:28                     ` Tejun Heo
2014-05-30 23:28                   ` Tejun Heo
2014-05-30 23:28                     ` Tejun Heo
     [not found]                     ` <20140530232804.GA5057-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-05-30 23:54                       ` Paolo Valente
2014-05-30 23:54                         ` Paolo Valente
2014-06-02 11:14                       ` Pavel Machek
2014-06-02 11:14                     ` Pavel Machek
2014-06-02 11:14                       ` Pavel Machek
     [not found]                       ` <20140602111432.GA3737-tWAi6jLit6GreWDznjuHag@public.gmane.org>
2014-06-02 13:02                         ` Pavel Machek
2014-06-02 13:02                           ` Pavel Machek
     [not found]                           ` <20140602130226.GA14654-tWAi6jLit6GreWDznjuHag@public.gmane.org>
2014-06-03 16:54                             ` Paolo Valente
2014-06-03 16:54                               ` Paolo Valente
2014-06-04  8:39                               ` Pavel Machek
2014-06-04  8:39                                 ` Pavel Machek
     [not found]                               ` <FCFE0106-A4DD-4DEF-AAAE-040F3823A447-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
2014-06-03 20:40                                 ` Pavel Machek
2014-06-03 20:40                                   ` Pavel Machek
2014-06-04  8:39                                 ` Pavel Machek
2014-06-04  9:08                                 ` Pavel Machek
2014-06-04  9:08                                   ` Pavel Machek
2014-06-04 10:03                                 ` BFQ speed tests [was Re: [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler] Pavel Machek
2014-06-04 10:03                               ` Pavel Machek
2014-06-04 10:03                                 ` Pavel Machek
     [not found]                                 ` <20140604100358.GA4799-tWAi6jLit6GreWDznjuHag@public.gmane.org>
2014-06-04 10:24                                   ` Paolo Valente
2014-06-04 10:24                                     ` Paolo Valente
     [not found]                                     ` <4888F93F-D58D-48DD-81A6-A6D61C452D92-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
2014-06-04 11:59                                       ` Takashi Iwai
2014-06-04 11:59                                         ` Takashi Iwai
2014-06-04 11:59                                         ` Takashi Iwai
2014-06-04 12:12                                         ` Paolo Valente
2014-06-04 12:12                                           ` Paolo Valente
     [not found]                                         ` <s5hsink3mxk.wl%tiwai-l3A5Bk7waGM@public.gmane.org>
2014-06-04 12:12                                           ` Paolo Valente
2014-06-11 20:45                                           ` Paolo Valente
2014-06-11 20:45                                             ` Paolo Valente
2014-06-13 16:21                                             ` Takashi Iwai
2014-06-13 16:21                                               ` Takashi Iwai
     [not found]                                             ` <6A4905B2-ACAA-419D-9C83-659BE9A5B20B-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
2014-06-13 16:21                                               ` Takashi Iwai
2014-06-11 20:39                                   ` Paolo Valente
2014-06-11 20:39                                     ` Paolo Valente
2014-06-02 17:33                         ` [PATCH RFC - TAKE TWO - 00/12] New version of the BFQ I/O Scheduler Tejun Heo
2014-06-02 17:33                           ` Tejun Heo
     [not found]                           ` <20140602173332.GB8912-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-06-03  4:12                             ` Mike Galbraith
2014-06-03  4:12                               ` Mike Galbraith
2014-06-04 22:31                             ` Pavel Machek
2014-06-04 22:31                               ` Pavel Machek
     [not found]                               ` <20140604223152.GA7881-tWAi6jLit6GreWDznjuHag@public.gmane.org>
2014-06-05  2:14                                 ` Jens Axboe
2014-06-05  2:14                                   ` Jens Axboe
2014-05-31  0:48                 ` Jens Axboe
2014-05-31  0:48               ` Jens Axboe
2014-05-31  0:48                 ` Jens Axboe
     [not found]                 ` <538926F6.7080409-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2014-05-31  5:16                   ` Tejun Heo
2014-05-31  5:16                 ` Tejun Heo
2014-05-31  5:16                   ` Tejun Heo
     [not found]                   ` <20140531051635.GA19925-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2014-06-02 14:29                     ` Jens Axboe
2014-06-02 14:29                       ` Jens Axboe
     [not found]                       ` <538C8A47.1050502-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2014-06-02 17:24                         ` Tejun Heo
2014-06-17 15:55                         ` Paolo Valente
2014-06-17 15:55                           ` Paolo Valente
     [not found]                           ` <0A5218F8-0215-4B4F-959B-EE5AAEFC164A-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
2014-06-19  1:46                             ` Tejun Heo
2014-06-19  1:46                           ` Tejun Heo
2014-06-19  1:46                             ` Tejun Heo
     [not found]                             ` <20140619014600.GA20100-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2014-06-19  1:49                               ` Tejun Heo
2014-06-19  2:29                               ` Jens Axboe
2014-06-19  2:29                                 ` Jens Axboe
     [not found]                                 ` <53A24B1C.1070004-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2014-06-23 13:53                                   ` Paolo Valente
2014-06-23 13:53                                     ` Paolo Valente
2014-06-23 19:20                                     ` Tejun Heo
2014-06-23 19:20                                       ` Tejun Heo
     [not found]                                       ` <20140623192022.GA19660-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2014-07-09 20:54                                         ` Paolo Valente
2014-07-09 20:54                                           ` Paolo Valente
     [not found]                                     ` <8F719638-0CD7-4BD2-8F4F-088913A0EE2D-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
2014-06-23 19:20                                       ` Tejun Heo
2014-06-19  1:49                             ` Tejun Heo
2014-06-19  1:49                               ` Tejun Heo
2014-06-02 17:24                       ` Tejun Heo
2014-06-02 17:24                         ` Tejun Heo
2014-06-02 17:32                         ` Jens Axboe
2014-06-02 17:32                           ` Jens Axboe
     [not found]                           ` <538CB515.3090700-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2014-06-02 17:42                             ` Tejun Heo
2014-06-02 17:42                               ` Tejun Heo
     [not found]                               ` <20140602174250.GC8912-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-06-02 17:46                                 ` Jens Axboe
2014-06-02 17:46                                   ` Jens Axboe
2014-06-02 18:51                                   ` Tejun Heo
2014-06-02 18:51                                     ` Tejun Heo
     [not found]                                     ` <20140602185138.GD8912-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-06-02 20:57                                       ` Jens Axboe
2014-06-02 20:57                                         ` Jens Axboe
     [not found]                                         ` <20140602205713.GB8357-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2014-06-04 14:31                                           ` Christoph Hellwig
2014-06-04 14:31                                         ` Christoph Hellwig
2014-06-04 14:31                                           ` Christoph Hellwig
     [not found]                                           ` <20140604143136.GA1920-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2014-06-04 14:50                                             ` Tejun Heo
2014-06-04 14:50                                               ` Tejun Heo
     [not found]                                               ` <20140604145053.GE5004-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-06-04 14:53                                                 ` Christoph Hellwig
2014-06-04 14:53                                                   ` Christoph Hellwig
     [not found]                                                   ` <20140604145330.GA2955-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2014-06-04 14:58                                                     ` Tejun Heo
2014-06-04 14:58                                                       ` Tejun Heo
     [not found]                                                       ` <20140604145829.GF5004-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-06-04 17:51                                                         ` Christoph Hellwig
2014-06-04 17:51                                                           ` Christoph Hellwig
     [not found]                                   ` <538CB87C.7030600-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2014-06-02 18:51                                     ` Tejun Heo
     [not found]                         ` <20140602172454.GA8912-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-06-02 17:32                           ` Jens Axboe
2014-05-29  9:05           ` [PATCH RFC - TAKE TWO - 11/12] block, bfq: boost the throughput on NCQ-capable flash-based devices Paolo Valente
2014-05-29  9:05             ` Paolo Valente
     [not found]             ` <1401354343-5527-12-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
2014-05-30 15:46               ` Tejun Heo
2014-05-30 15:46                 ` Tejun Heo
     [not found]                 ` <20140530154654.GE24871-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-05-30 22:01                   ` Paolo Valente
2014-05-30 22:01                     ` Paolo Valente
2014-05-31 11:52               ` Tejun Heo
2014-05-31 11:52                 ` Tejun Heo
     [not found]                 ` <20140531115216.GB5057-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-06-02  9:26                   ` Paolo Valente
2014-06-02  9:26                     ` Paolo Valente
2014-06-03 17:11                     ` Tejun Heo
2014-06-03 17:11                       ` Tejun Heo
     [not found]                       ` <20140603171124.GE26210-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-06-04  7:29                         ` Paolo Valente
2014-06-04  7:29                           ` Paolo Valente
     [not found]                           ` <03CDD106-DB18-4E8F-B3D6-2AAD45782A06-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
2014-06-04 13:56                             ` Tejun Heo
2014-06-04 13:56                           ` Tejun Heo
2014-06-04 13:56                             ` Tejun Heo
     [not found]                             ` <20140604135627.GB5004-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-06-16 10:46                               ` Paolo Valente
2014-06-16 10:46                                 ` Paolo Valente
     [not found]                                 ` <D163E069-ED77-4BF5-A488-9A90C41C60C1-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
2014-06-19  1:14                                   ` Tejun Heo
2014-06-19  1:14                                     ` Tejun Heo
     [not found]                     ` <36BFDB73-AEC2-4B87-9FD6-205E9431E722-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
2014-06-03 17:11                       ` Tejun Heo
2014-05-29  9:05           ` [PATCH RFC - TAKE TWO - 12/12] block, bfq: boost the throughput with random I/O on NCQ-capable HDDs Paolo Valente
2014-05-29  9:05             ` Paolo Valente
     [not found]             ` <1401354343-5527-13-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
2014-05-30 15:51               ` Tejun Heo
2014-05-30 15:51                 ` Tejun Heo
2014-05-31 13:34               ` Tejun Heo
2014-05-31 13:34                 ` Tejun Heo
     [not found]     ` <1401194558-5283-2-git-send-email-paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
2014-05-28 22:19       ` [PATCH RFC RESEND 01/14] block: kconfig update and build bits for BFQ Tejun Heo
2014-05-27 12:42   ` [PATCH RFC RESEND 02/14] block: introduce the BFQ-v0 I/O scheduler paolo
2014-05-27 12:42     ` paolo
2014-05-27 12:42   ` [PATCH RFC RESEND 03/14] block: add hierarchical-support option to kconfig paolo
2014-05-27 12:42     ` paolo
2014-05-27 12:42   ` [PATCH RFC RESEND 04/14] block, bfq: add full hierarchical scheduling and cgroups support paolo
2014-05-27 12:42     ` paolo
2014-05-27 12:42   ` [PATCH RFC RESEND 05/14] block, bfq: improve throughput boosting paolo
2014-05-27 12:42     ` paolo
2014-05-27 12:42   ` [PATCH RFC RESEND 06/14] block, bfq: modify the peak-rate estimator paolo
2014-05-27 12:42     ` paolo
2014-05-27 12:42   ` [PATCH RFC RESEND 07/14] block, bfq: add more fairness to boost throughput and reduce latency paolo
2014-05-27 12:42     ` paolo
2014-05-27 12:42   ` [PATCH RFC RESEND 08/14] block, bfq: improve responsiveness paolo
2014-05-27 12:42     ` paolo
2014-05-27 12:42   ` [PATCH RFC RESEND 09/14] block, bfq: reduce I/O latency for soft real-time applications paolo
2014-05-27 12:42     ` paolo
2014-05-27 12:42   ` [PATCH RFC RESEND 10/14] block, bfq: preserve a low latency also with NCQ-capable drives paolo
2014-05-27 12:42     ` paolo
2014-05-27 12:42   ` [PATCH RFC RESEND 11/14] block, bfq: reduce latency during request-pool saturation paolo
2014-05-27 12:42     ` paolo
2014-05-27 12:42   ` [PATCH RFC RESEND 12/14] block, bfq: add Early Queue Merge (EQM) paolo
2014-05-27 12:42   ` [PATCH RFC RESEND 13/14] block, bfq: boost the throughput on NCQ-capable flash-based devices paolo
2014-05-27 12:42     ` paolo
2014-05-27 12:42   ` [PATCH RFC RESEND 14/14] block, bfq: boost the throughput with random I/O on NCQ-capable HDDs paolo
2014-05-30 15:32   ` [PATCH RFC RESEND 00/14] New version of the BFQ I/O Scheduler Vivek Goyal
2014-05-30 15:32     ` Vivek Goyal
2014-05-30 16:16     ` Tejun Heo
2014-05-30 16:16       ` Tejun Heo
     [not found]       ` <20140530161650.GH24871-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-05-30 17:09         ` Vivek Goyal
2014-05-30 17:09           ` Vivek Goyal
     [not found]           ` <20140530170958.GF16605-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-05-30 17:26             ` Tejun Heo
2014-05-30 17:26               ` Tejun Heo
     [not found]               ` <20140530172609.GI24871-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2014-05-30 17:55                 ` Vivek Goyal
2014-05-30 17:55                   ` Vivek Goyal
     [not found]                   ` <20140530175527.GH16605-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-05-30 17:59                     ` Tejun Heo
2014-05-30 17:59                   ` Tejun Heo
2014-05-30 17:59                     ` Tejun Heo
2014-05-30 23:33             ` Paolo Valente
2014-05-30 23:33               ` Paolo Valente
     [not found]     ` <20140530153228.GE16605-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-05-30 16:16       ` Tejun Heo
2014-05-30 17:31   ` Vivek Goyal
2014-05-30 17:31     ` Vivek Goyal
     [not found]     ` <20140530173146.GG16605-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-05-30 17:39       ` Tejun Heo
2014-05-30 17:39         ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.