All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] IO Controller
@ 2009-03-12  1:56 ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12  1:56 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, menage-hpIqsD4AKlfQT0dZR+AlfA



Hi All,

Here is another posting for IO controller patches. Last time I had posted
RFC patches for an IO controller which did bio control per cgroup.

http://lkml.org/lkml/2008/11/6/227

One of the takeaway from the discussion in this thread was that let us
implement a common layer which contains the proportional weight scheduling
code which can be shared by all the IO schedulers.

Implementing IO controller will not cover the devices which don't use
IO schedulers but it should cover the common case.

There were more discussions regarding 2 level vs 1 level IO control at
following link.

https://lists.linux-foundation.org/pipermail/containers/2009-January/015402.html

So in the mean time we took the discussion off the list and spent time on
making the 1 level control apporoach work where majority of the proportional
weight control is shared by the four schedulers instead of each one having
to replicate the code. We make use of BFQ code for fair queuing as posted
by Paolo and Fabio here.

http://lkml.org/lkml/2008/11/11/148

Details about design and howto have been put in documentation patch.

I have done very basic testing of running 2 or 3 "dd" threads in different
cgroups. Wanted to get the patchset out for feedback/review before we dive
into more bug fixing, benchmarking, optimizations etc.

Your feedback/comments are welcome.

Patch series contains 10 patches. It should be compilable and bootable after
every patch. Intial 2 patches implement flat fair queuing (no cgroup
support) and make cfq to use that. Later patches introduce hierarchical
fair queuing support in elevator layer and modify other IO schdulers to use
that.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* [RFC] IO Controller
@ 2009-03-12  1:56 ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12  1:56 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng,
	arozansk, jmoyer, oz-kernel, dhaval, balbir, linux-kernel,
	containers
  Cc: vgoyal, akpm, menage, peterz



Hi All,

Here is another posting for IO controller patches. Last time I had posted
RFC patches for an IO controller which did bio control per cgroup.

http://lkml.org/lkml/2008/11/6/227

One of the takeaway from the discussion in this thread was that let us
implement a common layer which contains the proportional weight scheduling
code which can be shared by all the IO schedulers.

Implementing IO controller will not cover the devices which don't use
IO schedulers but it should cover the common case.

There were more discussions regarding 2 level vs 1 level IO control at
following link.

https://lists.linux-foundation.org/pipermail/containers/2009-January/015402.html

So in the mean time we took the discussion off the list and spent time on
making the 1 level control apporoach work where majority of the proportional
weight control is shared by the four schedulers instead of each one having
to replicate the code. We make use of BFQ code for fair queuing as posted
by Paolo and Fabio here.

http://lkml.org/lkml/2008/11/11/148

Details about design and howto have been put in documentation patch.

I have done very basic testing of running 2 or 3 "dd" threads in different
cgroups. Wanted to get the patchset out for feedback/review before we dive
into more bug fixing, benchmarking, optimizations etc.

Your feedback/comments are welcome.

Patch series contains 10 patches. It should be compilable and bootable after
every patch. Intial 2 patches implement flat fair queuing (no cgroup
support) and make cfq to use that. Later patches introduce hierarchical
fair queuing support in elevator layer and modify other IO schdulers to use
that.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* [PATCH 01/10] Documentation
  2009-03-12  1:56 ` Vivek Goyal
@ 2009-03-12  1:56     ` Vivek Goyal
  -1 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12  1:56 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, menage-hpIqsD4AKlfQT0dZR+AlfA

o Documentation for io-controller.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 Documentation/block/io-controller.txt |  221 +++++++++++++++++++++++++++++++++
 1 files changed, 221 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/block/io-controller.txt

diff --git a/Documentation/block/io-controller.txt b/Documentation/block/io-controller.txt
new file mode 100644
index 0000000..8884c5a
--- /dev/null
+++ b/Documentation/block/io-controller.txt
@@ -0,0 +1,221 @@
+				IO Controller
+				=============
+
+Overview
+========
+
+This patchset implements a proportional weight IO controller. That is one
+can create cgroups and assign prio/weights to those cgroups and task group
+will get access to disk proportionate to the weight of the group.
+
+These patches modify elevator layer and individual IO schedulers to do
+IO control hence this io controller works only on block devices which use
+one of the standard io schedulers can not be used with any xyz logical block
+device.
+
+The assumption/thought behind modifying IO scheduler is that resource control
+is needed only on leaf nodes where the actual contention for resources is
+present and not on intertermediate logical block devices.
+
+Consider following hypothetical scenario. Lets say there are three physical
+disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
+created on top of these. Some part of sdb is in lv0 and some part is in lv1.
+
+			    lv0      lv1
+			  /	\  /     \
+			sda      sdb      sdc
+
+Also consider following cgroup hierarchy
+
+				root
+				/   \
+			       A     B
+			      / \    / \
+			     T1 T2  T3  T4
+
+A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
+Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
+get their fair share of bandwidth on disks sda, sdb and sdc. There is no
+IO control on intermediate logical block nodes (lv0, lv1).
+
+So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
+only, there will not be any contetion for resources between group A and B if
+IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
+IO scheduler associated with the sdb will distribute disk bandwidth to
+group A and B proportionate to their weight.
+
+CFQ already has the notion of fairness and it provides differential disk
+access based on priority and class of the task. Just that it is flat and
+with cgroup stuff, it needs to be made hierarchical.
+
+Rest of the IO schedulers (noop, deadline and AS) don't have any notion
+of fairness among various threads.
+
+One of the concerns raised with modifying IO schedulers was that we don't
+want to replicate the code in all the IO schedulers. These patches share
+the fair queuing code which has been moved to a common layer (elevator
+layer). Hence we don't end up replicating code across IO schedulers.
+
+Design
+======
+This patchset primarily uses BFQ (Budget Fair Queuing) code to provide
+fairness among different IO queues. Fabio and Paolo implemented BFQ which uses
+B-WF2Q+ algorithm for fair queuing.
+
+Why BFQ?
+
+- Not sure if weighted round robin logic of CFQ can be easily extended for
+  hierarchical mode. One of the things is that we can not keep dividing
+  the time slice of parent group among childrens. Deeper we go in hierarchy
+  time slice will get smaller.
+
+  One of the ways to implement hierarchical support could be to keep track
+  of virtual time and service provided to queue/group and select a queue/group
+  for service based on any of the various available algoriths.
+
+  BFQ already had support for hierarchical scheduling, taking those patches
+  was easier.
+
+- BFQ was designed to provide tighter bounds/delay w.r.t service provided
+  to a queue. Delay/Jitter with BFQ is supposed to be O(1).
+
+  Note: BFQ originally used amount of IO done (number of sectors) as notion
+        of service provided. IOW, it tried to provide fairness in terms of
+        actual IO done and not in terms of actual time disk access was
+	given to a queue.
+
+	This patcheset modified BFQ to provide fairness in time domain because
+	that's what CFQ does. So idea was try not to deviate too much from
+	the CFQ behavior initially.
+
+	Providing fairness in time domain makes accounting trciky because
+	due to command queueing, at one time there might be multiple requests
+	from different queues and there is no easy way to find out how much
+	disk time actually was consumed by the requests of a particular
+	queue. More about this in comments in source code.
+
+So it is yet to be seen if changing to time domain still retains BFQ gurantees
+or not.
+
+From data structure point of view, one can think of a tree per device, where
+io groups and io queues are hanging and are being scheduled using B-WF2Q+
+algorithm. io_queue, is end queue where requests are actually stored and
+dispatched from (like cfqq).
+
+These io queues are primarily created by and managed by end io schedulers
+depending on its semantics. For example, noop, deadline and AS ioschedulers
+keep one io queues per cgroup and cfqq keeps one io queue per io_context in
+a cgroup (apart from async queues).
+
+A request is mapped to an io group by elevator layer and which io queue it
+is mapped to with in group depends on ioscheduler. Currently "current" task
+is used to determine the cgroup (hence io group) of the request. Down the
+line we need to make use of bio-cgroup patches to map delayed writes to
+right group.
+
+Going back to old behavior
+==========================
+In new scheme of things essentially we are creating hierarchical fair
+queuing logic in elevator layer and chaning IO schedulers to make use of
+that logic so that end IO schedulers start supporting hierarchical scheduling.
+
+Elevator layer continues to support the old interfaces. So even if fair queuing
+is enabled at elevator layer, one can have both new hierchical scheduler as
+well as old non-hierarchical scheduler operating.
+
+Also noop, deadline and AS have option of enabling hierarchical scheduling.
+If it is selected, fair queuing is done in hierarchical manner. If hierarchical
+scheduling is disabled, noop, deadline and AS should retain their existing
+behavior.
+
+CFQ is the only exception where one can not disable fair queuing as it is
+needed for provding fairness among various threads even in non-hierarchical
+mode.
+
+Various user visible config options
+===================================
+CONFIG_IOSCHED_NOOP_HIER
+	- Enables hierchical fair queuing in noop. Not selecting this option
+	  leads to old behavior of noop.
+
+CONFIG_IOSCHED_DEADLINE_HIER
+	- Enables hierchical fair queuing in deadline. Not selecting this
+	  option leads to old behavior of deadline.
+
+CONFIG_IOSCHED_AS_HIER
+	- Enables hierchical fair queuing in AS. Not selecting this option
+	  leads to old behavior of AS.
+
+CONFIG_IOSCHED_CFQ_HIER
+	- Enables hierarchical fair queuing in CFQ. Not selecting this option
+	  still does fair queuing among various queus but it is flat and not
+	  hierarchical.
+
+Config options selected automatically
+=====================================
+These config options are not user visible and are selected/deselected
+automatically based on IO scheduler configurations.
+
+CONFIG_ELV_FAIR_QUEUING
+	- Enables/Disables the fair queuing logic at elevator layer.
+
+CONFIG_GROUP_IOSCHED
+	- Enables/Disables hierarchical queuing and associated cgroup bits.
+
+TODO
+====
+- Lots of cleanups, testing, bug fixing, optimizations, benchmarking etc...
+- Convert cgroup ioprio to notion of weight.
+- Anticipatory code will need more work. It is not working properly currently
+  and needs more thought.
+- Use of bio-cgroup patches.
+- Use of Nauman's per cgroup request descriptor patches.
+
+HOWTO
+=====
+So far I have done very simple testing of running two dd threads in two
+different cgroups. Here is what you can do.
+
+- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
+	CONFIG_IOSCHED_CFQ_HIER=y
+
+- Compile and boot into kernel and mount IO controller.
+
+	mount -t cgroup -o io none /cgroup
+
+- Create two cgroups
+	mkdir -p /cgroup/test1/ /cgroup/test2
+
+- Set io priority of group test1 and test2
+	echo 0 > /cgroup/test1/io.ioprio
+	echo 4 > /cgroup/test2/io.ioprio
+
+- Create two same size files (say 512MB each) on same disk (file1, file2) and
+  launch two dd threads in different cgroup to read those files. Make sure
+  right io scheduler is being used for the block device where files are
+  present (the one you compiled in hierarchical mode).
+
+	echo 1 > /proc/sys/vm/drop_caches
+
+	dd if=/mnt/lv0/zerofile1 of=/dev/null &
+	echo $! > /cgroup/test1/tasks
+	cat /cgroup/test1/tasks
+
+	dd if=/mnt/lv0/zerofile2 of=/dev/null &
+	echo $! > /cgroup/test2/tasks
+	cat /cgroup/test2/tasks
+
+- First dd should finish first.
+
+Some Test Results
+=================
+- Two dd in two cgroups with prio 0 and 4. Ran two "dd" in those cgroups.
+
+234179072 bytes (234 MB) copied, 10.1811 s, 23.0 MB/s
+234179072 bytes (234 MB) copied, 12.6187 s, 18.6 MB/s
+
+- Three dd in three cgroups with prio 0, 4, 4.
+
+234179072 bytes (234 MB) copied, 13.7654 s, 17.0 MB/s
+234179072 bytes (234 MB) copied, 19.476 s, 12.0 MB/s
+234179072 bytes (234 MB) copied, 20.1858 s, 11.6 MB/s
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 01/10] Documentation
@ 2009-03-12  1:56     ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12  1:56 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng,
	arozansk, jmoyer, oz-kernel, dhaval, balbir, linux-kernel,
	containers
  Cc: vgoyal, akpm, menage, peterz

o Documentation for io-controller.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 Documentation/block/io-controller.txt |  221 +++++++++++++++++++++++++++++++++
 1 files changed, 221 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/block/io-controller.txt

diff --git a/Documentation/block/io-controller.txt b/Documentation/block/io-controller.txt
new file mode 100644
index 0000000..8884c5a
--- /dev/null
+++ b/Documentation/block/io-controller.txt
@@ -0,0 +1,221 @@
+				IO Controller
+				=============
+
+Overview
+========
+
+This patchset implements a proportional weight IO controller. That is one
+can create cgroups and assign prio/weights to those cgroups and task group
+will get access to disk proportionate to the weight of the group.
+
+These patches modify elevator layer and individual IO schedulers to do
+IO control hence this io controller works only on block devices which use
+one of the standard io schedulers can not be used with any xyz logical block
+device.
+
+The assumption/thought behind modifying IO scheduler is that resource control
+is needed only on leaf nodes where the actual contention for resources is
+present and not on intertermediate logical block devices.
+
+Consider following hypothetical scenario. Lets say there are three physical
+disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
+created on top of these. Some part of sdb is in lv0 and some part is in lv1.
+
+			    lv0      lv1
+			  /	\  /     \
+			sda      sdb      sdc
+
+Also consider following cgroup hierarchy
+
+				root
+				/   \
+			       A     B
+			      / \    / \
+			     T1 T2  T3  T4
+
+A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
+Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
+get their fair share of bandwidth on disks sda, sdb and sdc. There is no
+IO control on intermediate logical block nodes (lv0, lv1).
+
+So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
+only, there will not be any contetion for resources between group A and B if
+IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
+IO scheduler associated with the sdb will distribute disk bandwidth to
+group A and B proportionate to their weight.
+
+CFQ already has the notion of fairness and it provides differential disk
+access based on priority and class of the task. Just that it is flat and
+with cgroup stuff, it needs to be made hierarchical.
+
+Rest of the IO schedulers (noop, deadline and AS) don't have any notion
+of fairness among various threads.
+
+One of the concerns raised with modifying IO schedulers was that we don't
+want to replicate the code in all the IO schedulers. These patches share
+the fair queuing code which has been moved to a common layer (elevator
+layer). Hence we don't end up replicating code across IO schedulers.
+
+Design
+======
+This patchset primarily uses BFQ (Budget Fair Queuing) code to provide
+fairness among different IO queues. Fabio and Paolo implemented BFQ which uses
+B-WF2Q+ algorithm for fair queuing.
+
+Why BFQ?
+
+- Not sure if weighted round robin logic of CFQ can be easily extended for
+  hierarchical mode. One of the things is that we can not keep dividing
+  the time slice of parent group among childrens. Deeper we go in hierarchy
+  time slice will get smaller.
+
+  One of the ways to implement hierarchical support could be to keep track
+  of virtual time and service provided to queue/group and select a queue/group
+  for service based on any of the various available algoriths.
+
+  BFQ already had support for hierarchical scheduling, taking those patches
+  was easier.
+
+- BFQ was designed to provide tighter bounds/delay w.r.t service provided
+  to a queue. Delay/Jitter with BFQ is supposed to be O(1).
+
+  Note: BFQ originally used amount of IO done (number of sectors) as notion
+        of service provided. IOW, it tried to provide fairness in terms of
+        actual IO done and not in terms of actual time disk access was
+	given to a queue.
+
+	This patcheset modified BFQ to provide fairness in time domain because
+	that's what CFQ does. So idea was try not to deviate too much from
+	the CFQ behavior initially.
+
+	Providing fairness in time domain makes accounting trciky because
+	due to command queueing, at one time there might be multiple requests
+	from different queues and there is no easy way to find out how much
+	disk time actually was consumed by the requests of a particular
+	queue. More about this in comments in source code.
+
+So it is yet to be seen if changing to time domain still retains BFQ gurantees
+or not.
+
+From data structure point of view, one can think of a tree per device, where
+io groups and io queues are hanging and are being scheduled using B-WF2Q+
+algorithm. io_queue, is end queue where requests are actually stored and
+dispatched from (like cfqq).
+
+These io queues are primarily created by and managed by end io schedulers
+depending on its semantics. For example, noop, deadline and AS ioschedulers
+keep one io queues per cgroup and cfqq keeps one io queue per io_context in
+a cgroup (apart from async queues).
+
+A request is mapped to an io group by elevator layer and which io queue it
+is mapped to with in group depends on ioscheduler. Currently "current" task
+is used to determine the cgroup (hence io group) of the request. Down the
+line we need to make use of bio-cgroup patches to map delayed writes to
+right group.
+
+Going back to old behavior
+==========================
+In new scheme of things essentially we are creating hierarchical fair
+queuing logic in elevator layer and chaning IO schedulers to make use of
+that logic so that end IO schedulers start supporting hierarchical scheduling.
+
+Elevator layer continues to support the old interfaces. So even if fair queuing
+is enabled at elevator layer, one can have both new hierchical scheduler as
+well as old non-hierarchical scheduler operating.
+
+Also noop, deadline and AS have option of enabling hierarchical scheduling.
+If it is selected, fair queuing is done in hierarchical manner. If hierarchical
+scheduling is disabled, noop, deadline and AS should retain their existing
+behavior.
+
+CFQ is the only exception where one can not disable fair queuing as it is
+needed for provding fairness among various threads even in non-hierarchical
+mode.
+
+Various user visible config options
+===================================
+CONFIG_IOSCHED_NOOP_HIER
+	- Enables hierchical fair queuing in noop. Not selecting this option
+	  leads to old behavior of noop.
+
+CONFIG_IOSCHED_DEADLINE_HIER
+	- Enables hierchical fair queuing in deadline. Not selecting this
+	  option leads to old behavior of deadline.
+
+CONFIG_IOSCHED_AS_HIER
+	- Enables hierchical fair queuing in AS. Not selecting this option
+	  leads to old behavior of AS.
+
+CONFIG_IOSCHED_CFQ_HIER
+	- Enables hierarchical fair queuing in CFQ. Not selecting this option
+	  still does fair queuing among various queus but it is flat and not
+	  hierarchical.
+
+Config options selected automatically
+=====================================
+These config options are not user visible and are selected/deselected
+automatically based on IO scheduler configurations.
+
+CONFIG_ELV_FAIR_QUEUING
+	- Enables/Disables the fair queuing logic at elevator layer.
+
+CONFIG_GROUP_IOSCHED
+	- Enables/Disables hierarchical queuing and associated cgroup bits.
+
+TODO
+====
+- Lots of cleanups, testing, bug fixing, optimizations, benchmarking etc...
+- Convert cgroup ioprio to notion of weight.
+- Anticipatory code will need more work. It is not working properly currently
+  and needs more thought.
+- Use of bio-cgroup patches.
+- Use of Nauman's per cgroup request descriptor patches.
+
+HOWTO
+=====
+So far I have done very simple testing of running two dd threads in two
+different cgroups. Here is what you can do.
+
+- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
+	CONFIG_IOSCHED_CFQ_HIER=y
+
+- Compile and boot into kernel and mount IO controller.
+
+	mount -t cgroup -o io none /cgroup
+
+- Create two cgroups
+	mkdir -p /cgroup/test1/ /cgroup/test2
+
+- Set io priority of group test1 and test2
+	echo 0 > /cgroup/test1/io.ioprio
+	echo 4 > /cgroup/test2/io.ioprio
+
+- Create two same size files (say 512MB each) on same disk (file1, file2) and
+  launch two dd threads in different cgroup to read those files. Make sure
+  right io scheduler is being used for the block device where files are
+  present (the one you compiled in hierarchical mode).
+
+	echo 1 > /proc/sys/vm/drop_caches
+
+	dd if=/mnt/lv0/zerofile1 of=/dev/null &
+	echo $! > /cgroup/test1/tasks
+	cat /cgroup/test1/tasks
+
+	dd if=/mnt/lv0/zerofile2 of=/dev/null &
+	echo $! > /cgroup/test2/tasks
+	cat /cgroup/test2/tasks
+
+- First dd should finish first.
+
+Some Test Results
+=================
+- Two dd in two cgroups with prio 0 and 4. Ran two "dd" in those cgroups.
+
+234179072 bytes (234 MB) copied, 10.1811 s, 23.0 MB/s
+234179072 bytes (234 MB) copied, 12.6187 s, 18.6 MB/s
+
+- Three dd in three cgroups with prio 0, 4, 4.
+
+234179072 bytes (234 MB) copied, 13.7654 s, 17.0 MB/s
+234179072 bytes (234 MB) copied, 19.476 s, 12.0 MB/s
+234179072 bytes (234 MB) copied, 20.1858 s, 11.6 MB/s
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 02/10] Common flat fair queuing code in elevaotor layer
       [not found] ` <1236823015-4183-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-03-12  1:56     ` Vivek Goyal
@ 2009-03-12  1:56   ` Vivek Goyal
  2009-03-12  1:56   ` [PATCH 03/10] Modify cfq to make use of flat elevator fair queuing Vivek Goyal
                     ` (11 subsequent siblings)
  13 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12  1:56 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, menage-hpIqsD4AKlfQT0dZR+AlfA

This is common fair queuing code in elevator layer. This is controlled by
config option CONFIG_ELV_FAIR_QUEUING. This patch initially only introduces
flat fair queuing support where there is only one group, "root group" and all
the tasks belong to root group.

This elevator layer changes are backward compatible. That means any ioscheduler
using old interfaces will continue to work.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched    |   13 +
 block/Makefile           |    1 +
 block/blk-sysfs.c        |   10 +
 block/elevator-fq.c      | 1882 ++++++++++++++++++++++++++++++++++++++++++++++
 block/elevator-fq.h      |  479 ++++++++++++
 block/elevator.c         |   46 +-
 include/linux/blkdev.h   |    5 +
 include/linux/elevator.h |   48 ++
 8 files changed, 2473 insertions(+), 11 deletions(-)
 create mode 100644 block/elevator-fq.c
 create mode 100644 block/elevator-fq.h

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 7e803fc..3398134 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -2,6 +2,19 @@ if BLOCK
 
 menu "IO Schedulers"
 
+config ELV_FAIR_QUEUING
+	bool "Elevator Fair Queuing Support"
+	default n
+	---help---
+	  Traditionally only cfq had notion of multiple queues and it did
+	  fair queuing at its own. With the cgroups and need of controlling
+	  IO, now even the simple io schedulers like noop, deadline, as will
+	  have one queue per cgroup and will need hierarchical fair queuing.
+	  Instead of every io scheduler implementing its own fair queuing
+	  logic, this option enables fair queuing in elevator layer so that
+	  other ioschedulers can make use of it.
+	  If unsure, say N.
+
 config IOSCHED_NOOP
 	bool
 	default y
diff --git a/block/Makefile b/block/Makefile
index bfe7304..6f410d5 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -16,3 +16,4 @@ obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
 obj-$(CONFIG_BLK_DEV_IO_TRACE)	+= blktrace.o
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY)	+= blk-integrity.o
+obj-$(CONFIG_ELV_FAIR_QUEUING)	+= elevator-fq.o
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index e29ddfc..0d98c96 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -276,6 +276,13 @@ static struct queue_sysfs_entry queue_iostats_entry = {
 	.store = queue_iostats_store,
 };
 
+#ifdef CONFIG_ELV_FAIR_QUEUING
+static struct queue_sysfs_entry queue_slice_idle_entry = {
+	.attr = {.name = "slice_idle", .mode = S_IRUGO | S_IWUSR },
+	.show = elv_slice_idle_show,
+	.store = elv_slice_idle_store,
+};
+#endif
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
 	&queue_ra_entry.attr,
@@ -287,6 +294,9 @@ static struct attribute *default_attrs[] = {
 	&queue_nomerges_entry.attr,
 	&queue_rq_affinity_entry.attr,
 	&queue_iostats_entry.attr,
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	&queue_slice_idle_entry.attr,
+#endif
 	NULL,
 };
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
new file mode 100644
index 0000000..a8addd1
--- /dev/null
+++ b/block/elevator-fq.c
@@ -0,0 +1,1882 @@
+/*
+ * BFQ: Hierarchical B-WF2Q+ scheduler.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
+ *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ */
+
+#include <linux/blkdev.h>
+#include "elevator-fq.h"
+#include <linux/blktrace_api.h>
+
+/* Values taken from cfq */
+const int elv_slice_sync = HZ / 10;
+int elv_slice_async = HZ / 25;
+const int elv_slice_async_rq = 2;
+int elv_slice_idle = HZ / 125;
+static struct kmem_cache *elv_ioq_pool;
+
+#define ELV_HW_QUEUE_MIN	(5)
+#define IO_SERVICE_TREE_INIT   ((struct io_service_tree)		\
+				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
+
+/* Mainly the BFQ scheduling code Follows */
+
+/*
+ * Shift for timestamp calculations.  This actually limits the maximum
+ * service allowed in one timestamp delta (small shift values increase it),
+ * the maximum total weight that can be used for the queues in the system
+ * (big shift values increase it), and the period of virtual time wraparounds.
+ */
+#define WFQ_SERVICE_SHIFT	22
+
+/**
+ * bfq_gt - compare two timestamps.
+ * @a: first ts.
+ * @b: second ts.
+ *
+ * Return @a > @b, dealing with wrapping correctly.
+ */
+static inline int bfq_gt(bfq_timestamp_t a, bfq_timestamp_t b)
+{
+	return (s64)(a - b) > 0;
+}
+
+/**
+ * bfq_delta - map service into the virtual time domain.
+ * @service: amount of service.
+ * @weight: scale factor.
+ */
+static inline bfq_timestamp_t bfq_delta(bfq_service_t service,
+					bfq_weight_t weight)
+{
+	bfq_timestamp_t d = (bfq_timestamp_t)service << WFQ_SERVICE_SHIFT;
+
+	do_div(d, weight);
+	return d;
+}
+
+/**
+ * bfq_calc_finish - assign the finish time to an entity.
+ * @entity: the entity to act upon.
+ * @service: the service to be charged to the entity.
+ */
+static inline void bfq_calc_finish(struct io_entity *entity,
+				   bfq_service_t service)
+{
+	BUG_ON(entity->weight == 0);
+
+	entity->finish = entity->start + bfq_delta(service, entity->weight);
+}
+
+static inline struct io_queue *io_entity_to_ioq(struct io_entity *entity)
+{
+	struct io_queue *ioq = NULL;
+
+	BUG_ON(entity == NULL);
+	if (entity->my_sched_data == NULL)
+		ioq = container_of(entity, struct io_queue, entity);
+	return ioq;
+}
+
+/**
+ * bfq_entity_of - get an entity from a node.
+ * @node: the node field of the entity.
+ *
+ * Convert a node pointer to the relative entity.  This is used only
+ * to simplify the logic of some functions and not as the generic
+ * conversion mechanism because, e.g., in the tree walking functions,
+ * the check for a %NULL value would be redundant.
+ */
+static inline struct io_entity *bfq_entity_of(struct rb_node *node)
+{
+	struct io_entity *entity = NULL;
+
+	if (node != NULL)
+		entity = rb_entry(node, struct io_entity, rb_node);
+
+	return entity;
+}
+
+/**
+ * bfq_extract - remove an entity from a tree.
+ * @root: the tree root.
+ * @entity: the entity to remove.
+ */
+static inline void bfq_extract(struct rb_root *root, struct io_entity *entity)
+{
+	BUG_ON(entity->tree != root);
+
+	entity->tree = NULL;
+	rb_erase(&entity->rb_node, root);
+}
+
+/**
+ * bfq_idle_extract - extract an entity from the idle tree.
+ * @st: the service tree of the owning @entity.
+ * @entity: the entity being removed.
+ */
+static void bfq_idle_extract(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	struct rb_node *next;
+	struct io_queue *ioq = io_entity_to_ioq(entity);
+
+	BUG_ON(entity->tree != &st->idle);
+
+	if (entity == st->first_idle) {
+		next = rb_next(&entity->rb_node);
+		st->first_idle = bfq_entity_of(next);
+	}
+
+	if (entity == st->last_idle) {
+		next = rb_prev(&entity->rb_node);
+		st->last_idle = bfq_entity_of(next);
+	}
+
+	bfq_extract(&st->idle, entity);
+
+	/* Delete queue from idle list */
+	if (ioq)
+		list_del(&ioq->queue_list);
+}
+
+/**
+ * bfq_insert - generic tree insertion.
+ * @root: tree root.
+ * @entity: entity to insert.
+ *
+ * This is used for the idle and the active tree, since they are both
+ * ordered by finish time.
+ */
+static void bfq_insert(struct rb_root *root, struct io_entity *entity)
+{
+	struct io_entity *entry;
+	struct rb_node **node = &root->rb_node;
+	struct rb_node *parent = NULL;
+
+	BUG_ON(entity->tree != NULL);
+
+	while (*node != NULL) {
+		parent = *node;
+		entry = rb_entry(parent, struct io_entity, rb_node);
+
+		if (bfq_gt(entry->finish, entity->finish))
+			node = &parent->rb_left;
+		else
+			node = &parent->rb_right;
+	}
+
+	rb_link_node(&entity->rb_node, parent, node);
+	rb_insert_color(&entity->rb_node, root);
+
+	entity->tree = root;
+}
+
+/**
+ * bfq_update_min - update the min_start field of a entity.
+ * @entity: the entity to update.
+ * @node: one of its children.
+ *
+ * This function is called when @entity may store an invalid value for
+ * min_start due to updates to the active tree.  The function  assumes
+ * that the subtree rooted at @node (which may be its left or its right
+ * child) has a valid min_start value.
+ */
+static inline void bfq_update_min(struct io_entity *entity,
+					struct rb_node *node)
+{
+	struct io_entity *child;
+
+	if (node != NULL) {
+		child = rb_entry(node, struct io_entity, rb_node);
+		if (bfq_gt(entity->min_start, child->min_start))
+			entity->min_start = child->min_start;
+	}
+}
+
+/**
+ * bfq_update_active_node - recalculate min_start.
+ * @node: the node to update.
+ *
+ * @node may have changed position or one of its children may have moved,
+ * this function updates its min_start value.  The left and right subtrees
+ * are assumed to hold a correct min_start value.
+ */
+static inline void bfq_update_active_node(struct rb_node *node)
+{
+	struct io_entity *entity = rb_entry(node, struct io_entity, rb_node);
+
+	entity->min_start = entity->start;
+	bfq_update_min(entity, node->rb_right);
+	bfq_update_min(entity, node->rb_left);
+}
+
+/**
+ * bfq_update_active_tree - update min_start for the whole active tree.
+ * @node: the starting node.
+ *
+ * @node must be the deepest modified node after an update.  This function
+ * updates its min_start using the values held by its children, assuming
+ * that they did not change, and then updates all the nodes that may have
+ * changed in the path to the root.  The only nodes that may have changed
+ * are the ones in the path or their siblings.
+ */
+static void bfq_update_active_tree(struct rb_node *node)
+{
+	struct rb_node *parent;
+
+up:
+	bfq_update_active_node(node);
+
+	parent = rb_parent(node);
+	if (parent == NULL)
+		return;
+
+	if (node == parent->rb_left && parent->rb_right != NULL)
+		bfq_update_active_node(parent->rb_right);
+	else if (parent->rb_left != NULL)
+		bfq_update_active_node(parent->rb_left);
+
+	node = parent;
+	goto up;
+}
+
+/**
+ * bfq_active_insert - insert an entity in the active tree of its group/device.
+ * @st: the service tree of the entity.
+ * @entity: the entity being inserted.
+ *
+ * The active tree is ordered by finish time, but an extra key is kept
+ * per each node, containing the minimum value for the start times of
+ * its children (and the node itself), so it's possible to search for
+ * the eligible node with the lowest finish time in logarithmic time.
+ */
+static void bfq_active_insert(struct io_service_tree *st,
+					struct io_entity *entity)
+{
+	struct rb_node *node = &entity->rb_node;
+
+	bfq_insert(&st->active, entity);
+
+	if (node->rb_left != NULL)
+		node = node->rb_left;
+	else if (node->rb_right != NULL)
+		node = node->rb_right;
+
+	bfq_update_active_tree(node);
+}
+
+/**
+ * bfq_ioprio_to_weight - calc a weight from an ioprio.
+ * @ioprio: the ioprio value to convert.
+ */
+static bfq_weight_t bfq_ioprio_to_weight(int ioprio)
+{
+	WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
+	return IOPRIO_BE_NR - ioprio;
+}
+
+void bfq_get_entity(struct io_entity *entity)
+{
+	struct io_queue *ioq = io_entity_to_ioq(entity);
+
+	if (ioq)
+		elv_get_ioq(ioq);
+}
+
+void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+	entity->ioprio = entity->new_ioprio;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->sched_data = &iog->sched_data;
+}
+
+/**
+ * bfq_find_deepest - find the deepest node that an extraction can modify.
+ * @node: the node being removed.
+ *
+ * Do the first step of an extraction in an rb tree, looking for the
+ * node that will replace @node, and returning the deepest node that
+ * the following modifications to the tree can touch.  If @node is the
+ * last node in the tree return %NULL.
+ */
+static struct rb_node *bfq_find_deepest(struct rb_node *node)
+{
+	struct rb_node *deepest;
+
+	if (node->rb_right == NULL && node->rb_left == NULL)
+		deepest = rb_parent(node);
+	else if (node->rb_right == NULL)
+		deepest = node->rb_left;
+	else if (node->rb_left == NULL)
+		deepest = node->rb_right;
+	else {
+		deepest = rb_next(node);
+		if (deepest->rb_right != NULL)
+			deepest = deepest->rb_right;
+		else if (rb_parent(deepest) != node)
+			deepest = rb_parent(deepest);
+	}
+
+	return deepest;
+}
+
+/**
+ * bfq_active_extract - remove an entity from the active tree.
+ * @st: the service_tree containing the tree.
+ * @entity: the entity being removed.
+ */
+static void bfq_active_extract(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	struct rb_node *node;
+
+	node = bfq_find_deepest(&entity->rb_node);
+	bfq_extract(&st->active, entity);
+
+	if (node != NULL)
+		bfq_update_active_tree(node);
+}
+
+/**
+ * bfq_idle_insert - insert an entity into the idle tree.
+ * @st: the service tree containing the tree.
+ * @entity: the entity to insert.
+ */
+static void bfq_idle_insert(struct io_service_tree *st,
+					struct io_entity *entity)
+{
+	struct io_entity *first_idle = st->first_idle;
+	struct io_entity *last_idle = st->last_idle;
+	struct io_queue *ioq = io_entity_to_ioq(entity);
+
+	if (first_idle == NULL || bfq_gt(first_idle->finish, entity->finish))
+		st->first_idle = entity;
+	if (last_idle == NULL || bfq_gt(entity->finish, last_idle->finish))
+		st->last_idle = entity;
+
+	bfq_insert(&st->idle, entity);
+
+	/* Add this queue to idle list */
+	if (ioq)
+		list_add(&ioq->queue_list, &ioq->efqd->idle_list);
+}
+
+/**
+ * bfq_forget_entity - remove an entity from the wfq trees.
+ * @st: the service tree.
+ * @entity: the entity being removed.
+ *
+ * Update the device status and forget everything about @entity, putting
+ * the device reference to it, if it is a queue.  Entities belonging to
+ * groups are not refcounted.
+ */
+static void bfq_forget_entity(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	struct io_queue *ioq = NULL;
+
+	BUG_ON(!entity->on_st);
+	entity->on_st = 0;
+	st->wsum -= entity->weight;
+	ioq = io_entity_to_ioq(entity);
+	if (!ioq)
+		return;
+	elv_put_ioq(ioq);
+}
+
+/**
+ * bfq_put_idle_entity - release the idle tree ref of an entity.
+ * @st: service tree for the entity.
+ * @entity: the entity being released.
+ */
+void bfq_put_idle_entity(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	bfq_idle_extract(st, entity);
+	bfq_forget_entity(st, entity);
+}
+
+/**
+ * bfq_forget_idle - update the idle tree if necessary.
+ * @st: the service tree to act upon.
+ *
+ * To preserve the global O(log N) complexity we only remove one entry here;
+ * as the idle tree will not grow indefinitely this can be done safely.
+ */
+void bfq_forget_idle(struct io_service_tree *st)
+{
+	struct io_entity *first_idle = st->first_idle;
+	struct io_entity *last_idle = st->last_idle;
+
+	if (RB_EMPTY_ROOT(&st->active) && last_idle != NULL &&
+	    !bfq_gt(last_idle->finish, st->vtime)) {
+		/*
+		 * Active tree is empty. Pull back vtime to finish time of
+		 * last idle entity on idle tree.
+		 * Rational seems to be that it reduces the possibility of
+		 * vtime wraparound (bfq_gt(V-F) < 0).
+		 */
+		st->vtime = last_idle->finish;
+	}
+
+	if (first_idle != NULL && !bfq_gt(first_idle->finish, st->vtime))
+		bfq_put_idle_entity(st, first_idle);
+}
+
+
+static struct io_service_tree *
+__bfq_entity_update_prio(struct io_service_tree *old_st,
+				struct io_entity *entity)
+{
+	struct io_service_tree *new_st = old_st;
+
+	if (entity->ioprio_changed) {
+		entity->ioprio = entity->new_ioprio;
+		entity->ioprio_class = entity->new_ioprio_class;
+		entity->ioprio_changed = 0;
+
+		old_st->wsum -= entity->weight;
+		entity->weight = bfq_ioprio_to_weight(entity->ioprio);
+
+		/*
+		 * NOTE: here we may be changing the weight too early,
+		 * this will cause unfairness.  The correct approach
+		 * would have required additional complexity to defer
+		 * weight changes to the proper time instants (i.e.,
+		 * when entity->finish <= old_st->vtime).
+		 */
+		new_st = io_entity_service_tree(entity);
+		new_st->wsum += entity->weight;
+
+		if (new_st != old_st)
+			entity->start = new_st->vtime;
+	}
+
+	return new_st;
+}
+
+/**
+ * __bfq_activate_entity - activate an entity.
+ * @entity: the entity being activated.
+ *
+ * Called whenever an entity is activated, i.e., it is not active and one
+ * of its children receives a new request, or has to be reactivated due to
+ * budget exhaustion.  It uses the current budget of the entity (and the
+ * service received if @entity is active) of the queue to calculate its
+ * timestamps.
+ */
+static void __bfq_activate_entity(struct io_entity *entity)
+{
+	struct io_sched_data *sd = entity->sched_data;
+	struct io_service_tree *st = io_entity_service_tree(entity);
+
+	if (entity == sd->active_entity) {
+		BUG_ON(entity->tree != NULL);
+		/*
+		 * If we are requeueing the current entity we have
+		 * to take care of not charging to it service it has
+		 * not received.
+		 */
+		bfq_calc_finish(entity, entity->service);
+		entity->start = entity->finish;
+		sd->active_entity = NULL;
+	} else if (entity->tree == &st->active) {
+		/*
+		 * Requeueing an entity due to a change of some
+		 * next_active entity below it.  We reuse the old
+		 * start time.
+		 */
+		bfq_active_extract(st, entity);
+	} else if (entity->tree == &st->idle) {
+		/*
+		 * Must be on the idle tree, bfq_idle_extract() will
+		 * check for that.
+		 */
+		bfq_idle_extract(st, entity);
+		entity->start = bfq_gt(st->vtime, entity->finish) ?
+				       st->vtime : entity->finish;
+	} else {
+		/*
+		 * The finish time of the entity may be invalid, and
+		 * it is in the past for sure, otherwise the queue
+		 * would have been on the idle tree.
+		 */
+		entity->start = st->vtime;
+		st->wsum += entity->weight;
+		bfq_get_entity(entity);
+
+		BUG_ON(entity->on_st);
+		entity->on_st = 1;
+	}
+
+	st = __bfq_entity_update_prio(st, entity);
+	bfq_calc_finish(entity, entity->budget);
+	bfq_active_insert(st, entity);
+}
+
+/**
+ * bfq_activate_entity - activate an entity.
+ * @entity: the entity to activate.
+ */
+void bfq_activate_entity(struct io_entity *entity)
+{
+	__bfq_activate_entity(entity);
+}
+
+/**
+ * __bfq_deactivate_entity - deactivate an entity from its service tree.
+ * @entity: the entity to deactivate.
+ * @requeue: if false, the entity will not be put into the idle tree.
+ *
+ * Deactivate an entity, independently from its previous state.  If the
+ * entity was not on a service tree just return, otherwise if it is on
+ * any scheduler tree, extract it from that tree, and if necessary
+ * and if the caller did not specify @requeue, put it on the idle tree.
+ *
+ */
+int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
+{
+	struct io_sched_data *sd = entity->sched_data;
+	struct io_service_tree *st = io_entity_service_tree(entity);
+	int was_active = entity == sd->active_entity;
+	int ret = 0;
+
+	if (!entity->on_st)
+		return 0;
+
+	BUG_ON(was_active && entity->tree != NULL);
+
+	if (was_active) {
+		bfq_calc_finish(entity, entity->service);
+		sd->active_entity = NULL;
+	} else if (entity->tree == &st->active)
+		bfq_active_extract(st, entity);
+	else if (entity->tree == &st->idle)
+		bfq_idle_extract(st, entity);
+	else if (entity->tree != NULL)
+		BUG();
+
+	if (!requeue || !bfq_gt(entity->finish, st->vtime))
+		bfq_forget_entity(st, entity);
+	else
+		bfq_idle_insert(st, entity);
+
+	BUG_ON(sd->active_entity == entity);
+
+	return ret;
+}
+
+/**
+ * bfq_deactivate_entity - deactivate an entity.
+ * @entity: the entity to deactivate.
+ * @requeue: true if the entity can be put on the idle tree
+ */
+void bfq_deactivate_entity(struct io_entity *entity, int requeue)
+{
+	__bfq_deactivate_entity(entity, requeue);
+}
+
+/**
+ * bfq_update_vtime - update vtime if necessary.
+ * @st: the service tree to act upon.
+ *
+ * If necessary update the service tree vtime to have at least one
+ * eligible entity, skipping to its start time.  Assumes that the
+ * active tree of the device is not empty.
+ *
+ * NOTE: this hierarchical implementation updates vtimes quite often,
+ * we may end up with reactivated tasks getting timestamps after a
+ * vtime skip done because we needed a ->first_active entity on some
+ * intermediate node.
+ */
+static void bfq_update_vtime(struct io_service_tree *st)
+{
+	struct io_entity *entry;
+	struct rb_node *node = st->active.rb_node;
+
+	entry = rb_entry(node, struct io_entity, rb_node);
+	if (bfq_gt(entry->min_start, st->vtime)) {
+		st->vtime = entry->min_start;
+		bfq_forget_idle(st);
+	}
+}
+
+/**
+ * bfq_first_active - find the eligible entity with the smallest finish time
+ * @st: the service tree to select from.
+ *
+ * This function searches the first schedulable entity, starting from the
+ * root of the tree and going on the left every time on this side there is
+ * a subtree with at least one eligible (start <= vtime) entity.  The path
+ * on the right is followed only if a) the left subtree contains no eligible
+ * entities and b) no eligible entity has been found yet.
+ */
+static struct io_entity *bfq_first_active_entity(struct io_service_tree *st)
+{
+	struct io_entity *entry, *first = NULL;
+	struct rb_node *node = st->active.rb_node;
+
+	while (node != NULL) {
+		entry = rb_entry(node, struct io_entity, rb_node);
+left:
+		if (!bfq_gt(entry->start, st->vtime))
+			first = entry;
+
+		BUG_ON(bfq_gt(entry->min_start, st->vtime));
+
+		if (node->rb_left != NULL) {
+			entry = rb_entry(node->rb_left,
+					 struct io_entity, rb_node);
+			if (!bfq_gt(entry->min_start, st->vtime)) {
+				node = node->rb_left;
+				goto left;
+			}
+		}
+		if (first != NULL)
+			break;
+		node = node->rb_right;
+	}
+
+	BUG_ON(first == NULL && !RB_EMPTY_ROOT(&st->active));
+	return first;
+}
+
+/**
+ * __bfq_lookup_next_entity - return the first eligible entity in @st.
+ * @st: the service tree.
+ *
+ * Update the virtual time in @st and return the first eligible entity
+ * it contains.
+ */
+static struct io_entity *__bfq_lookup_next_entity(struct io_service_tree *st)
+{
+	struct io_entity *entity;
+
+	if (RB_EMPTY_ROOT(&st->active))
+		return NULL;
+
+	bfq_update_vtime(st);
+	entity = bfq_first_active_entity(st);
+	BUG_ON(bfq_gt(entity->start, st->vtime));
+
+	return entity;
+}
+
+/**
+ * bfq_lookup_next_entity - return the first eligible entity in @sd.
+ * @sd: the sched_data.
+ * @extract: if true the returned entity will be also extracted from @sd.
+ *
+ * NOTE: since we cache the next_active entity at each level of the
+ * hierarchy, the complexity of the lookup can be decreased with
+ * absolutely no effort just returning the cached next_active value;
+ * we prefer to do full lookups to test the consistency of * the data
+ * structures.
+ */
+struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
+						 int extract)
+{
+	struct io_service_tree *st = sd->service_tree;
+	struct io_entity *entity;
+	int i;
+
+	/*
+	 * One can check for which will be next selected entity without
+	 * expiring the current one.
+	 */
+	BUG_ON(extract && sd->active_entity != NULL);
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++, st++) {
+		entity = __bfq_lookup_next_entity(st);
+		if (entity != NULL) {
+			if (extract) {
+				bfq_active_extract(st, entity);
+				sd->active_entity = entity;
+			}
+			break;
+		}
+	}
+
+	return entity;
+}
+
+void entity_served(struct io_entity *entity, bfq_service_t served)
+{
+	struct io_service_tree *st;
+
+	st = io_entity_service_tree(entity);
+	entity->service += served;
+	WARN_ON_ONCE(entity->service > entity->budget);
+	BUG_ON(st->wsum == 0);
+	st->vtime += bfq_delta(served, st->wsum);
+	bfq_forget_idle(st);
+}
+
+/* Elevator fair queuing function */
+struct io_queue *rq_ioq(struct request *rq)
+{
+	return rq->ioq;
+}
+
+static inline struct io_queue *elv_active_ioq(struct elevator_queue *e)
+{
+	return e->efqd.active_queue;
+}
+
+void *elv_active_sched_queue(struct elevator_queue *e)
+{
+	return ioq_sched_queue(elv_active_ioq(e));
+}
+EXPORT_SYMBOL(elv_active_sched_queue);
+
+int elv_nr_busy_ioq(struct elevator_queue *e)
+{
+	return e->efqd.busy_queues;
+}
+EXPORT_SYMBOL(elv_nr_busy_ioq);
+
+int elv_nr_busy_rt_ioq(struct elevator_queue *e)
+{
+	return e->efqd.busy_rt_queues;
+}
+EXPORT_SYMBOL(elv_nr_busy_rt_ioq);
+
+int elv_hw_tag(struct elevator_queue *e)
+{
+	return e->efqd.hw_tag;
+}
+EXPORT_SYMBOL(elv_hw_tag);
+
+/* Helper functions for operating on elevator idle slice timer */
+int elv_mod_idle_slice_timer(struct elevator_queue *eq, unsigned long expires)
+{
+	struct elv_fq_data *efqd = &eq->efqd;
+
+	return mod_timer(&efqd->idle_slice_timer, expires);
+}
+EXPORT_SYMBOL(elv_mod_idle_slice_timer);
+
+int elv_del_idle_slice_timer(struct elevator_queue *eq)
+{
+	struct elv_fq_data *efqd = &eq->efqd;
+
+	return del_timer(&efqd->idle_slice_timer);
+}
+EXPORT_SYMBOL(elv_del_idle_slice_timer);
+
+unsigned int elv_get_slice_idle(struct elevator_queue *eq)
+{
+	return eq->efqd.elv_slice_idle;
+}
+EXPORT_SYMBOL(elv_get_slice_idle);
+
+void elv_ioq_served(struct io_queue *ioq, bfq_service_t served)
+{
+	struct elv_fq_data *efqd = ioq->efqd;
+
+	entity_served(&ioq->entity, served);
+
+	ioq->total_service += served;
+	elv_log_ioq(efqd, ioq, "ioq served=0x%lx total service=0x%lx", served,
+			ioq->total_service);
+}
+
+/* Functions to show and store elv_idle_slice value through sysfs */
+ssize_t elv_slice_idle_show(struct request_queue *q, char *name)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	data = jiffies_to_msecs(efqd->elv_slice_idle);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return sprintf(name, "%d\n", data);
+}
+
+ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
+			  size_t count)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	char *p = (char *)name;
+
+	data = simple_strtoul(p, &p, 10);
+
+	if (data < 0)
+		data = 0;
+	else if (data > INT_MAX)
+		data = INT_MAX;
+
+	data = msecs_to_jiffies(data);
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	efqd->elv_slice_idle = data;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	return count;
+}
+
+void elv_schedule_dispatch(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (elv_nr_busy_ioq(q->elevator)) {
+		elv_log(efqd, "schedule dispatch");
+		kblockd_schedule_work(efqd->queue, &efqd->unplug_work);
+	}
+}
+EXPORT_SYMBOL(elv_schedule_dispatch);
+
+void elv_kick_queue(struct work_struct *work)
+{
+	struct elv_fq_data *efqd =
+		container_of(work, struct elv_fq_data, unplug_work);
+	struct request_queue *q = efqd->queue;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	blk_start_queueing(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+void elv_shutdown_timer_wq(struct elevator_queue *e)
+{
+	del_timer_sync(&e->efqd.idle_slice_timer);
+	cancel_work_sync(&e->efqd.unplug_work);
+}
+EXPORT_SYMBOL(elv_shutdown_timer_wq);
+
+void elv_ioq_set_prio_slice(struct request_queue *q, struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	ioq->slice_end = jiffies + ioq->entity.budget;
+	elv_log_ioq(efqd, ioq, "set_slice=%lu", ioq->entity.budget);
+}
+
+void elv_ioq_init_prio_data(struct io_queue *ioq, int ioprio_class, int ioprio)
+{
+	struct io_entity *entity = &ioq->entity;
+
+	entity->new_ioprio_class = ioprio_class;
+	entity->new_ioprio = ioprio;
+	entity->ioprio_changed = 1;
+	return;
+}
+
+static void elv_ioq_update_io_thinktime(struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = ioq->efqd;
+	unsigned long elapsed = jiffies - ioq->last_end_request;
+	unsigned long ttime = min(elapsed, 2UL * efqd->elv_slice_idle);
+
+	ioq->ttime_samples = (7*ioq->ttime_samples + 256) / 8;
+	ioq->ttime_total = (7*ioq->ttime_total + 256*ttime) / 8;
+	ioq->ttime_mean = (ioq->ttime_total + 128) / ioq->ttime_samples;
+}
+
+/*
+ * Disable idle window if the process thinks too long.
+ * This idle flag can also be updated by io scheduler.
+ */
+static void elv_ioq_update_idle_window(struct elevator_queue *eq,
+				struct io_queue *ioq, struct request *rq)
+{
+	int old_idle, enable_idle;
+	struct elv_fq_data *efqd = ioq->efqd;
+
+	/*
+	 * Don't idle for async or idle io prio class
+	 */
+	if (!elv_ioq_sync(ioq) || elv_ioq_class_idle(ioq))
+		return;
+
+	enable_idle = old_idle = elv_ioq_idle_window(ioq);
+
+	if (!efqd->elv_slice_idle)
+		enable_idle = 0;
+	else if (ioq_sample_valid(ioq->ttime_samples)) {
+		if (ioq->ttime_mean > efqd->elv_slice_idle)
+			enable_idle = 0;
+		else
+			enable_idle = 1;
+	}
+
+	/*
+	 * From think time perspective idle should be enabled. Check with
+	 * io scheduler if it wants to disable idling based on additional
+	 * considrations like seek pattern.
+	 */
+	if (enable_idle) {
+		if (eq->ops->elevator_update_idle_window_fn)
+			enable_idle = eq->ops->elevator_update_idle_window_fn(
+						eq, ioq->sched_queue, rq);
+		if (!enable_idle)
+			elv_log_ioq(efqd, ioq, "iosched disabled idle");
+	}
+
+	if (old_idle != enable_idle) {
+		elv_log_ioq(efqd, ioq, "idle=%d", enable_idle);
+		if (enable_idle)
+			elv_mark_ioq_idle_window(ioq);
+		else
+			elv_clear_ioq_idle_window(ioq);
+	}
+}
+
+struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask)
+{
+	struct io_queue *ioq = NULL;
+
+	ioq = kmem_cache_alloc_node(elv_ioq_pool, gfp_mask, q->node);
+	return ioq;
+}
+EXPORT_SYMBOL(elv_alloc_ioq);
+
+void elv_free_ioq(struct io_queue *ioq)
+{
+	kmem_cache_free(elv_ioq_pool, ioq);
+}
+EXPORT_SYMBOL(elv_free_ioq);
+
+int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
+			void *sched_queue, int ioprio_class, int ioprio,
+			int is_sync)
+{
+	struct elv_fq_data *efqd = &eq->efqd;
+	struct io_group *iog = io_lookup_io_group_current(efqd->queue);
+
+	RB_CLEAR_NODE(&ioq->entity.rb_node);
+	atomic_set(&ioq->ref, 0);
+	ioq->efqd = efqd;
+	ioq->entity.budget = efqd->elv_slice[is_sync];
+	elv_ioq_set_ioprio_class(ioq, ioprio_class);
+	elv_ioq_set_ioprio(ioq, ioprio);
+	ioq->pid = current->pid;
+	ioq->sched_queue = sched_queue;
+	elv_mark_ioq_idle_window(ioq);
+	bfq_init_entity(&ioq->entity, iog);
+	return 0;
+}
+EXPORT_SYMBOL(elv_init_ioq);
+
+void elv_put_ioq(struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = ioq->efqd;
+	struct elevator_queue *e = container_of(efqd, struct elevator_queue,
+						efqd);
+
+	BUG_ON(atomic_read(&ioq->ref) <= 0);
+	if (!atomic_dec_and_test(&ioq->ref))
+		return;
+	BUG_ON(ioq->nr_queued);
+	BUG_ON(ioq->entity.tree != NULL);
+	BUG_ON(elv_ioq_busy(ioq));
+	BUG_ON(efqd->active_queue == ioq);
+
+	/* Can be called by outgoing elevator. Don't use q */
+	BUG_ON(!e->ops->elevator_free_sched_queue_fn);
+
+	e->ops->elevator_free_sched_queue_fn(e, ioq->sched_queue);
+	elv_log_ioq(efqd, ioq, "freed");
+	elv_free_ioq(ioq);
+}
+EXPORT_SYMBOL(elv_put_ioq);
+
+void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
+{
+	struct io_queue *ioq = *ioq_ptr;
+
+	if (ioq != NULL) {
+		/* Drop the reference taken by the io group */
+		elv_put_ioq(ioq);
+		*ioq_ptr = NULL;
+	}
+}
+
+/* Get next queue for service. */
+struct io_queue *elv_get_next_ioq(struct request_queue *q, int extract)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_entity *entity = NULL;
+	struct io_queue *ioq = NULL;
+	struct io_sched_data *sd;
+
+	/*
+	 * one can check for which queue will be selected next while having
+	 * one queue active. preempt logic uses it.
+	 */
+	BUG_ON(extract && efqd->active_queue != NULL);
+
+	if (!efqd->busy_queues)
+		return NULL;
+
+	sd = &efqd->root_group->sched_data;
+	if (extract)
+		entity = bfq_lookup_next_entity(sd, 1);
+	else
+		entity = bfq_lookup_next_entity(sd, 0);
+
+	BUG_ON(!entity);
+	if (extract)
+		entity->service = 0;
+	ioq = io_entity_to_ioq(entity);
+
+	return ioq;
+}
+
+static void __elv_set_active_ioq(struct elv_fq_data *efqd,
+					struct io_queue *ioq)
+{
+	struct request_queue *q = efqd->queue;
+
+	if (ioq) {
+		elv_log_ioq(efqd, ioq, "set_active, busy_queues=%d",
+							efqd->busy_queues);
+		ioq->slice_end = 0;
+		elv_mark_ioq_slice_new(ioq);
+	}
+
+	efqd->active_queue = ioq;
+
+	/* Let iosched know if it wants to take some action */
+	if (ioq) {
+		if (q->elevator->ops->elevator_active_ioq_set_fn)
+			q->elevator->ops->elevator_active_ioq_set_fn(q,
+							ioq->sched_queue);
+	}
+}
+
+/* Get and set a new active queue for service. */
+struct io_queue *elv_set_active_ioq(struct request_queue *q)
+{
+	struct io_queue *ioq;
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	ioq = elv_get_next_ioq(q, 1);
+	__elv_set_active_ioq(efqd, ioq);
+	return ioq;
+}
+
+void elv_reset_active_ioq(struct elv_fq_data *efqd)
+{
+	struct request_queue *q = efqd->queue;
+
+	if (q->elevator->ops->elevator_active_ioq_reset_fn)
+		q->elevator->ops->elevator_active_ioq_reset_fn(q);
+	efqd->active_queue = NULL;
+	del_timer(&efqd->idle_slice_timer);
+}
+
+void elv_activate_ioq(struct io_queue *ioq)
+{
+	bfq_activate_entity(&ioq->entity);
+}
+
+void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
+					int requeue)
+{
+	if (ioq == efqd->active_queue)
+		elv_reset_active_ioq(efqd);
+
+	bfq_deactivate_entity(&ioq->entity, requeue);
+}
+
+/* Called when an inactive queue receives a new request. */
+void elv_add_ioq_busy(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+	BUG_ON(elv_ioq_busy(ioq));
+	BUG_ON(ioq == efqd->active_queue);
+	elv_log_ioq(efqd, ioq, "add to busy");
+	elv_activate_ioq(ioq);
+	elv_mark_ioq_busy(ioq);
+	efqd->busy_queues++;
+	if (elv_ioq_class_rt(ioq))
+		efqd->busy_rt_queues++;
+}
+
+void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
+					int requeue)
+{
+	struct elv_fq_data *efqd = &e->efqd;
+
+	BUG_ON(!elv_ioq_busy(ioq));
+	BUG_ON(ioq->nr_queued);
+	elv_log_ioq(efqd, ioq, "del from busy");
+	elv_clear_ioq_busy(ioq);
+	BUG_ON(efqd->busy_queues == 0);
+	efqd->busy_queues--;
+	if (elv_ioq_class_rt(ioq))
+		efqd->busy_rt_queues--;
+
+	elv_deactivate_ioq(efqd, ioq, requeue);
+}
+
+/*
+ * Do the accounting. Determine how much service (in terms of time slices)
+ * current queue used and adjust the start, finish time of queue and vtime
+ * of the tree accordingly.
+ *
+ * Determining the service used in terms of time is tricky in certain
+ * situations. Especially when underlying device supports command queuing
+ * and requests from multiple queues can be there at same time, then it
+ * is not clear which queue consumed how much of disk time.
+ *
+ * To mitigate this problem, cfq starts the time slice of the queue only
+ * after first request from the queue has completed. This does not work
+ * very well if we expire the queue before we wait for first and more
+ * request to finish from the queue. For seeky queues, we will expire the
+ * queue after dispatching few requests without waiting and start dispatching
+ * from next queue.
+ *
+ * Not sure how to determine the time consumed by queue in such scenarios.
+ * Currently as a crude approximation, we are charging 25% of time slice
+ * for such cases. A better mechanism is needed for accurate accounting. 
+ */
+void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq,
+				int budget_update)
+{
+	struct elevator_queue *e = q->elevator;
+	struct elv_fq_data *efqd = &e->efqd;
+	struct io_entity *entity = &ioq->entity;
+	unsigned long slice, step, min_slice;
+	long slice_unused, slice_used;
+
+	assert_spin_locked(q->queue_lock);
+	elv_log_ioq(efqd, ioq, "slice expired t=%d", budget_update);
+
+	if (elv_ioq_wait_request(ioq))
+		del_timer(&efqd->idle_slice_timer);
+
+	elv_clear_ioq_wait_request(ioq);
+
+	/*
+	 * if ioq->slice_end = 0, that means a queue was expired before first
+	 * reuqest from the queue got completed. Of course we are not planning
+	 * to idle on the queue otherwise we would not have expired it.
+	 *
+	 * Charge for the 25% slice in such cases. This is not the best thing
+	 * to do but at the same time not very sure what's the next best
+	 * thing to do.
+	 *
+	 * This arises from that fact that we don't have the notion of
+	 * one queue being operational at one time. io scheduler can dispatch
+	 * requests from multiple queues in one dispatch round. Ideally for
+	 * more accurate accounting of exact disk time used by disk, one
+	 * should dispatch requests from only one queue and wait for all
+	 * the requests to finish. But this will reduce throughput.
+	 */
+	if (!ioq->slice_end)
+		slice_unused = 3*entity->budget/4;
+	else {
+		slice_unused = ioq->slice_end - jiffies;
+		/*
+		 * queue consumed more slice than it was allocated for.
+		 * Currently we are still charging it only for allocated
+		 * slice and not for the consumed slice as it can increase
+		 * the latency of when this queue is scheduled next.
+		 *
+		 * Maybe we can live with little bit unfairness. This is
+		 * still an open problem regarding how to handle it
+		 * correctly.
+		 */
+		if (slice_unused < 0)
+			slice_unused = 0;
+
+		if (slice_unused == entity->budget) {
+			/*
+			 * queue got expired immediately after completing
+			 * first request. Charge 25% of slice.
+			 */
+			slice_unused = (3*entity->budget)/4;
+		}
+
+	}
+
+	slice_used = entity->budget - slice_unused;
+	elv_ioq_served(ioq, slice_used);
+
+	if (budget_update && !elv_ioq_slice_new(ioq)) {
+		slice = efqd->elv_slice[elv_ioq_sync(ioq)];
+		step = slice / 16;
+		min_slice = slice - slice / 4;
+
+		/*
+		 * Try to adapt the slice length to the behavior of the
+		 * queue.  If it has not exhausted the assigned budget
+		 * assign it a shorter new one, otherwise assign it a
+		 * longer new one.  The increments/decrements are done
+		 * linearly with a step of cfqd->cfq_slice / 16, and
+		 * slices of less than 11 / 16 * cfqd->cfq_slice are
+		 * never assigned, to avoid performance degradation.
+		 */
+		if (slice_unused != 0 && entity->budget >= min_slice)
+			entity->budget -= step;
+		else if (slice_unused == 0 && entity->budget <= slice - step)
+			entity->budget += step;
+
+		elv_log_ioq(efqd, ioq, "slice_unused=%ld, budget=%ld",
+					slice_unused, entity->budget);
+	}
+
+	BUG_ON(ioq != efqd->active_queue);
+	elv_reset_active_ioq(efqd);
+
+	if (!ioq->nr_queued)
+		elv_del_ioq_busy(e, ioq, 1);
+	else
+		elv_activate_ioq(ioq);
+}
+EXPORT_SYMBOL(__elv_ioq_slice_expired);
+
+/*
+ * budget_update signifies whether if budget increment/decrement accounting
+ * should be done or not on this queue on this expiry.
+ * In some circumstances like preemption, forced dispatch etc, it might
+ * not make much sense to adjust budgets.
+ */
+void elv_ioq_slice_expired(struct request_queue *q, int budget_update)
+{
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+
+	if (ioq)
+		__elv_ioq_slice_expired(q, ioq, budget_update);
+}
+
+/*
+ * Check if new_cfqq should preempt the currently active queue. Return 0 for
+ * no or if we aren't sure, a 1 will cause a preemption attempt.
+ */
+int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
+			struct request *rq)
+{
+	struct io_queue *ioq;
+	struct elevator_queue *eq = q->elevator;
+
+	ioq = elv_active_ioq(eq);
+
+	if (!ioq)
+		return 0;
+
+	if (elv_ioq_slice_used(ioq))
+		return 1;
+
+	if (elv_ioq_class_idle(new_ioq))
+		return 0;
+
+	if (elv_ioq_class_idle(ioq))
+		return 1;
+
+	/*
+	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
+	 */
+	if (elv_ioq_class_rt(new_ioq) && !elv_ioq_class_rt(ioq))
+		return 1;
+
+	/*
+	 * Check with io scheduler if it has additional criterion based on
+	 * which it wants to preempt existing queue.
+	 */
+	if (eq->ops->elevator_should_preempt_fn)
+		return eq->ops->elevator_should_preempt_fn(q, new_ioq, rq);
+
+	return 0;
+}
+
+int elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
+{
+	struct io_queue *new_ioq;
+
+	elv_log_ioq(&q->elevator->efqd, ioq, "preemption attempt");
+
+	new_ioq = elv_get_next_ioq(q, 0);
+	if (new_ioq == ioq) {
+		/*
+		 * We might need expire_ioq logic here to check with io
+		 * scheduler if queue can be preempted. This might not
+		 * be need for cfq but AS might need it.
+		 */
+		elv_ioq_slice_expired(q, 0);
+		elv_ioq_set_slice_end(ioq, 0);
+		elv_mark_ioq_slice_new(ioq);
+		return 1;
+	}
+
+	return 0;
+}
+
+void elv_ioq_request_add(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	BUG_ON(!efqd);
+	BUG_ON(!ioq);
+	efqd->rq_queued++;
+	ioq->nr_queued++;
+
+	if (!elv_ioq_busy(ioq))
+		elv_add_ioq_busy(efqd, ioq);
+
+	elv_ioq_update_io_thinktime(ioq);
+	elv_ioq_update_idle_window(q->elevator, ioq, rq);
+
+	if (ioq == elv_active_ioq(q->elevator)) {
+		/*
+		 * if we are waiting for a request for this queue, let it rip
+		 * immediately and flag that we must not expire this queue
+		 * just now
+		 */
+		if (elv_ioq_wait_request(ioq)) {
+			del_timer(&efqd->idle_slice_timer);
+			blk_start_queueing(q);
+		}
+	} else if (elv_should_preempt(q, ioq, rq)) {
+		/*
+		 * not the active queue - expire current slice if it is
+		 * idle and has expired it's mean thinktime or this new queue
+		 * has some old slice time left and is of higher priority or
+		 * this new queue is RT and the current one is BE
+		 */
+		/*
+		 * try to preempt the active queue; we still respect the
+		 * scheduler decision, so we try to reschedule, but if cfqq
+		 * has received more service than allocated, the scheduler
+		 * will refuse the preemption.
+		 */
+		if (elv_preempt_queue(q, ioq))
+			blk_start_queueing(q);
+	}
+}
+
+void elv_idle_slice_timer(unsigned long data)
+{
+	struct elv_fq_data *efqd = (struct elv_fq_data *)data;
+	struct io_queue *ioq;
+	unsigned long flags;
+	struct request_queue *q = efqd->queue;
+
+	elv_log(efqd, "idle timer fired");
+
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	ioq = efqd->active_queue;
+
+	if (ioq) {
+
+		/*
+		 * expired
+		 */
+		if (elv_ioq_slice_used(ioq))
+			goto expire;
+
+		/*
+		 * only expire and reinvoke request handler, if there are
+		 * other queues with pending requests
+		 */
+		if (!elv_nr_busy_ioq(q->elevator))
+			goto out_cont;
+
+		/*
+		 * not expired and it has a request pending, let it dispatch
+		 */
+		if (ioq->nr_queued)
+			goto out_kick;
+	}
+expire:
+	elv_ioq_slice_expired(q, 1);
+out_kick:
+	elv_schedule_dispatch(q);
+out_cont:
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+void elv_ioq_arm_slice_timer(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+	unsigned long sl;
+
+	BUG_ON(!ioq);
+
+	/*
+	 * SSD device without seek penalty, disable idling. But only do so
+	 * for devices that support queuing, otherwise we still have a problem
+	 * with sync vs async workloads.
+	 */
+	if (blk_queue_nonrot(q) && efqd->hw_tag)
+		return;
+
+	/*
+	 * still requests with the driver, don't idle
+	 */
+	if (efqd->rq_in_driver)
+		return;
+
+	/*
+	 * idle is disabled, either manually or by past process history
+	 */
+	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
+		return;
+
+	/*
+	 * may be iosched got its own idling logic. In that case io
+	 * schduler will take care of arming the timer, if need be.
+	 */
+	if (q->elevator->ops->elevator_arm_slice_timer_fn) {
+		q->elevator->ops->elevator_arm_slice_timer_fn(q,
+						ioq->sched_queue);
+	} else {
+		elv_mark_ioq_wait_request(ioq);
+		sl = efqd->elv_slice_idle;
+		mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+		elv_log(efqd, "arm idle: %lu", sl);
+	}
+}
+
+void elv_free_idle_ioq_list(struct elevator_queue *e)
+{
+	struct io_queue *ioq, *n;
+	struct elv_fq_data *efqd = &e->efqd;
+
+	list_for_each_entry_safe(ioq, n, &efqd->idle_list, queue_list)
+		elv_deactivate_ioq(efqd, ioq, 0);
+}
+
+/* Common layer function to select the next queue to dispatch from */
+void *elv_fq_select_ioq(struct request_queue *q, int force)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+	int budget_update = 1;
+
+	if (!elv_nr_busy_ioq(q->elevator))
+		return NULL;
+
+	if (ioq == NULL)
+		goto new_queue;
+
+	/*
+	 * Force dispatch. Continue to dispatch from current queue as long
+	 * as it has requests.
+	 */
+	if (unlikely(force)) {
+		if (ioq->nr_queued)
+			goto keep_queue;
+		else {
+			/*
+			 * Don't try to update queue's budget based on forced
+			 * dispatch bahavior.
+			 */
+			budget_update = 0;
+			goto expire;
+		}
+	}
+
+	/*
+	 * The active queue has run out of time, expire it and select new.
+	 */
+	if (elv_ioq_slice_used(ioq))
+		goto expire;
+
+	/*
+	 * If we have a RT cfqq waiting, then we pre-empt the current non-rt
+	 * cfqq.
+	 */
+	if (!elv_ioq_class_rt(ioq) && efqd->busy_rt_queues) {
+		/*
+		 * We simulate this as cfqq timed out so that it gets to bank
+		 * the remaining of its time slice.
+		 */
+		elv_log_ioq(efqd, ioq, "preempt");
+
+		/* Don't do budget adjustments for queue being preempted. */
+		budget_update = 0;
+		goto expire;
+	}
+
+	/*
+	 * The active queue has requests and isn't expired, allow it to
+	 * dispatch.
+	 */
+
+	if (ioq->nr_queued)
+		goto keep_queue;
+
+	/*
+	 * No requests pending. If the active queue still has requests in
+	 * flight or is idling for a new request, allow either of these
+	 * conditions to happen (or time out) before selecting a new queue.
+	 */
+
+	if (timer_pending(&efqd->idle_slice_timer) ||
+	    (elv_ioq_nr_dispatched(ioq) && elv_ioq_idle_window(ioq))) {
+		ioq = NULL;
+		goto keep_queue;
+	}
+
+expire:
+	elv_ioq_slice_expired(q, budget_update);
+new_queue:
+	ioq = elv_set_active_ioq(q);
+keep_queue:
+	return ioq;
+}
+
+/* A request got removed from io_queue. Do the accounting */
+void elv_ioq_request_removed(struct elevator_queue *e, struct request *rq)
+{
+	struct io_queue *ioq;
+	struct elv_fq_data *efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	ioq = rq->ioq;
+	BUG_ON(!ioq);
+	ioq->nr_queued--;
+
+	efqd = ioq->efqd;
+	BUG_ON(!efqd);
+	efqd->rq_queued--;
+
+	if (elv_ioq_busy(ioq) && (elv_active_ioq(e) != ioq) && !ioq->nr_queued)
+		elv_del_ioq_busy(e, ioq, 1);
+}
+
+/* A request got dispatched. Do the accounting. */
+void elv_fq_dispatched_request(struct elevator_queue *e, struct request *rq)
+{
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	BUG_ON(!ioq);
+	elv_ioq_request_dispatched(ioq);
+	elv_ioq_request_removed(e, rq);
+}
+
+void elv_fq_activate_rq(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	efqd->rq_in_driver++;
+	elv_log_ioq(efqd, rq_ioq(rq), "activate rq, drv=%d",
+						efqd->rq_in_driver);
+}
+
+void elv_fq_deactivate_rq(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	WARN_ON(!efqd->rq_in_driver);
+	efqd->rq_in_driver--;
+	elv_log_ioq(efqd, rq_ioq(rq), "deactivate rq, drv=%d",
+						efqd->rq_in_driver);
+}
+
+/*
+ * Update hw_tag based on peak queue depth over 50 samples under
+ * sufficient load.
+ */
+static void elv_update_hw_tag(struct elv_fq_data *efqd)
+{
+	if (efqd->rq_in_driver > efqd->rq_in_driver_peak)
+		efqd->rq_in_driver_peak = efqd->rq_in_driver;
+
+	if (efqd->rq_queued <= ELV_HW_QUEUE_MIN &&
+	    efqd->rq_in_driver <= ELV_HW_QUEUE_MIN)
+		return;
+
+	if (efqd->hw_tag_samples++ < 50)
+		return;
+
+	if (efqd->rq_in_driver_peak >= ELV_HW_QUEUE_MIN)
+		efqd->hw_tag = 1;
+	else
+		efqd->hw_tag = 0;
+
+	efqd->hw_tag_samples = 0;
+	efqd->rq_in_driver_peak = 0;
+}
+
+/* A request got completed from io_queue. Do the accounting. */
+void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
+{
+	const int sync = rq_is_sync(rq);
+	struct io_queue *ioq = rq->ioq;
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	elv_log_ioq(efqd, ioq, "complete");
+
+	elv_update_hw_tag(efqd);
+
+	WARN_ON(!efqd->rq_in_driver);
+	efqd->rq_in_driver--;
+
+	WARN_ON(!ioq->dispatched);
+	ioq->dispatched--;
+
+	if (sync)
+		ioq->last_end_request = jiffies;
+
+	/*
+	 * If this is the active queue, check if it needs to be expired,
+	 * or if we want to idle in case it has no pending requests.
+	 */
+
+	if (elv_active_ioq(q->elevator) == ioq) {
+		if (elv_ioq_slice_new(ioq)) {
+			elv_ioq_set_prio_slice(q, ioq);
+			elv_clear_ioq_slice_new(ioq);
+		}
+		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+			elv_ioq_slice_expired(q, 1);
+		else if (sync && !ioq->nr_queued)
+			elv_ioq_arm_slice_timer(q);
+	}
+
+	if (!efqd->rq_in_driver)
+		elv_schedule_dispatch(q);
+}
+
+struct io_group *io_lookup_io_group_current(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	return efqd->root_group;
+}
+EXPORT_SYMBOL(io_lookup_io_group_current);
+
+void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
+					int ioprio)
+{
+	struct io_queue *ioq = NULL;
+
+	switch (ioprio_class) {
+	case IOPRIO_CLASS_RT:
+		ioq = iog->async_queue[0][ioprio];
+		break;
+	case IOPRIO_CLASS_BE:
+		ioq = iog->async_queue[1][ioprio];
+		break;
+	case IOPRIO_CLASS_IDLE:
+		ioq = iog->async_idle_queue;
+		break;
+	default:
+		BUG();
+	}
+
+	if (ioq)
+		return ioq->sched_queue;
+	return NULL;
+}
+EXPORT_SYMBOL(io_group_async_queue_prio);
+
+void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+					int ioprio, struct io_queue *ioq)
+{
+	switch (ioprio_class) {
+	case IOPRIO_CLASS_RT:
+		iog->async_queue[0][ioprio] = ioq;
+		break;
+	case IOPRIO_CLASS_BE:
+		iog->async_queue[1][ioprio] = ioq;
+		break;
+	case IOPRIO_CLASS_IDLE:
+		iog->async_idle_queue = ioq;
+		break;
+	default:
+		BUG();
+	}
+
+	/*
+	 * Take the group reference and pin the queue. Group exit will
+	 * clean it up
+	 */
+	elv_get_ioq(ioq);
+}
+EXPORT_SYMBOL(io_group_set_async_queue);
+
+/*
+ * Release all the io group references to its async queues.
+ */
+void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+{
+	int i, j;
+
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < IOPRIO_BE_NR; j++)
+			elv_release_ioq(e, &iog->async_queue[i][j]);
+
+	/* Free up async idle queue */
+	elv_release_ioq(e, &iog->async_idle_queue);
+}
+
+struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+	return iog;
+}
+
+void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_group *iog = e->efqd.root_group;
+	io_put_io_group_queues(e, iog);
+	kfree(iog);
+}
+
+static void elv_slab_kill(void)
+{
+	/*
+	 * Caller already ensured that pending RCU callbacks are completed,
+	 * so we should have no busy allocations at this point.
+	 */
+	if (elv_ioq_pool)
+		kmem_cache_destroy(elv_ioq_pool);
+}
+
+static int __init elv_slab_setup(void)
+{
+	elv_ioq_pool = KMEM_CACHE(io_queue, 0);
+	if (!elv_ioq_pool)
+		goto fail;
+
+	return 0;
+fail:
+	elv_slab_kill();
+	return -ENOMEM;
+}
+
+/* Initialize fair queueing data associated with elevator */
+int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
+{
+	struct io_group *iog;
+	struct elv_fq_data *efqd = &e->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return 0;
+
+	iog = io_alloc_root_group(q, e, efqd);
+	if (iog == NULL)
+		return 1;
+
+	efqd->root_group = iog;
+	efqd->queue = q;
+
+	init_timer(&efqd->idle_slice_timer);
+	efqd->idle_slice_timer.function = elv_idle_slice_timer;
+	efqd->idle_slice_timer.data = (unsigned long) efqd;
+
+	INIT_WORK(&efqd->unplug_work, elv_kick_queue);
+
+	INIT_LIST_HEAD(&efqd->idle_list);
+
+	efqd->elv_slice[0] = elv_slice_async;
+	efqd->elv_slice[1] = elv_slice_sync;
+	efqd->elv_slice_idle = elv_slice_idle;
+	efqd->hw_tag = 1;
+
+	return 0;
+}
+
+/*
+ * elv_exit_fq_data is called before we call elevator_exit_fn. Before
+ * we ask elevator to cleanup its queues, we do the cleanup here so
+ * that all the group and idle tree references to ioq are dropped. Later
+ * during elevator cleanup, ioc reference will be dropped which will lead
+ * to removal of ioscheduler queue as well as associated ioq object.
+ */
+void elv_exit_fq_data(struct elevator_queue *e)
+{
+	struct elv_fq_data *efqd = &e->efqd;
+	struct request_queue *q = efqd->queue;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	elv_shutdown_timer_wq(e);
+
+	spin_lock_irq(q->queue_lock);
+	/* This should drop all the idle tree references of ioq */
+	elv_free_idle_ioq_list(e);
+	spin_unlock_irq(q->queue_lock);
+
+	elv_shutdown_timer_wq(e);
+
+	BUG_ON(timer_pending(&efqd->idle_slice_timer));
+	io_free_root_group(e);
+}
+
+/*
+ * This is called after the io scheduler has cleaned up its data structres.
+ * I don't think that this function is required. Right now just keeping it
+ * because cfq cleans up timer and work queue again after freeing up
+ * io contexts. To me io scheduler has already been drained out, and all
+ * the active queue have already been expired so time and work queue should
+ * not been activated during cleanup process.
+ *
+ * Keeping it here for the time being. Will get rid of it later.
+ */
+void elv_exit_fq_data_post(struct elevator_queue *e)
+{
+	struct elv_fq_data *efqd = &e->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	elv_shutdown_timer_wq(e);
+	BUG_ON(timer_pending(&efqd->idle_slice_timer));
+}
+
+
+static int __init elv_fq_init(void)
+{
+	if (elv_slab_setup())
+		return -ENOMEM;
+
+	/* could be 0 on HZ < 1000 setups */
+
+	if (!elv_slice_async)
+		elv_slice_async = 1;
+
+	if (elv_slice_idle == 0)
+		elv_slice_idle = 1;
+
+	return 0;
+}
+
+module_init(elv_fq_init);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
new file mode 100644
index 0000000..b5a0d08
--- /dev/null
+++ b/block/elevator-fq.h
@@ -0,0 +1,479 @@
+/*
+ * BFQ: data structures and common functions prototypes.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
+ *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ */
+
+#include <linux/blkdev.h>
+
+#ifndef _BFQ_SCHED_H
+#define _BFQ_SCHED_H
+
+#define IO_IOPRIO_CLASSES	3
+
+typedef u64 bfq_timestamp_t;
+typedef unsigned long bfq_weight_t;
+typedef unsigned long bfq_service_t;
+struct io_entity;
+struct io_queue;
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+
+/**
+ * struct bfq_service_tree - per ioprio_class service tree.
+ * @active: tree for active entities (i.e., those backlogged).
+ * @idle: tree for idle entities (i.e., those not backlogged, with V <= F_i).
+ * @first_idle: idle entity with minimum F_i.
+ * @last_idle: idle entity with maximum F_i.
+ * @vtime: scheduler virtual time.
+ * @wsum: scheduler weight sum; active and idle entities contribute to it.
+ *
+ * Each service tree represents a B-WF2Q+ scheduler on its own.  Each
+ * ioprio_class has its own independent scheduler, and so its own
+ * bfq_service_tree.  All the fields are protected by the queue lock
+ * of the containing efqd.
+ */
+struct io_service_tree {
+	struct rb_root active;
+	struct rb_root idle;
+
+	struct io_entity *first_idle;
+	struct io_entity *last_idle;
+
+	bfq_timestamp_t vtime;
+	bfq_weight_t wsum;
+};
+
+/**
+ * struct bfq_sched_data - multi-class scheduler.
+ * @active_entity: entity under service.
+ * @next_active: head-of-the-line entity in the scheduler.
+ * @service_tree: array of service trees, one per ioprio_class.
+ *
+ * bfq_sched_data is the basic scheduler queue.  It supports three
+ * ioprio_classes, and can be used either as a toplevel queue or as
+ * an intermediate queue on a hierarchical setup.
+ * @next_active points to the active entity of the sched_data service
+ * trees that will be scheduled next.
+ *
+ * The supported ioprio_classes are the same as in CFQ, in descending
+ * priority order, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE.
+ * Requests from higher priority queues are served before all the
+ * requests from lower priority queues; among requests of the same
+ * queue requests are served according to B-WF2Q+.
+ * All the fields are protected by the queue lock of the containing bfqd.
+ */
+struct io_sched_data {
+	struct io_entity *active_entity;
+	struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
+};
+
+/**
+ * struct bfq_entity - schedulable entity.
+ * @rb_node: service_tree member.
+ * @on_st: flag, true if the entity is on a tree (either the active or
+ *         the idle one of its service_tree).
+ * @finish: B-WF2Q+ finish timestamp (aka F_i).
+ * @start: B-WF2Q+ start timestamp (aka S_i).
+ * @tree: tree the entity is enqueued into; %NULL if not on a tree.
+ * @min_start: minimum start time of the (active) subtree rooted at
+ *             this entity; used for O(log N) lookups into active trees.
+ * @service: service received during the last round of service.
+ * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
+ * @weight: weight of the queue, calculated as IOPRIO_BE_NR - @ioprio.
+ * @parent: parent entity, for hierarchical scheduling.
+ * @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the
+ *                 associated scheduler queue, %NULL on leaf nodes.
+ * @sched_data: the scheduler queue this entity belongs to.
+ * @ioprio: the ioprio in use.
+ * @new_ioprio: when an ioprio change is requested, the new ioprio value
+ * @ioprio_class: the ioprio_class in use.
+ * @new_ioprio_class: when an ioprio_class change is requested, the new
+ *                    ioprio_class value.
+ * @ioprio_changed: flag, true when the user requested an ioprio or
+ *                  ioprio_class change.
+ *
+ * A bfq_entity is used to represent either a bfq_queue (leaf node in the
+ * cgroup hierarchy) or a bfq_group into the upper level scheduler.  Each
+ * entity belongs to the sched_data of the parent group in the cgroup
+ * hierarchy.  Non-leaf entities have also their own sched_data, stored
+ * in @my_sched_data.
+ *
+ * Each entity stores independently its priority values; this would allow
+ * different weights on different devices, but this functionality is not
+ * exported to userspace by now.  Priorities are updated lazily, first
+ * storing the new values into the new_* fields, then setting the
+ * @ioprio_changed flag.  As soon as there is a transition in the entity
+ * state that allows the priority update to take place the effective and
+ * the requested priority values are synchronized.
+ *
+ * The weight value is calculated from the ioprio to export the same
+ * interface as CFQ.  When dealing with ``well-behaved'' queues (i.e.,
+ * queues that do not spend too much time to consume their budget and
+ * have true sequential behavior, and when there are no external factors
+ * breaking anticipation) the relative weights at each level of the
+ * cgroups hierarchy should be guaranteed.
+ * All the fields are protected by the queue lock of the containing bfqd.
+ */
+struct io_entity {
+	struct rb_node rb_node;
+
+	int on_st;
+
+	bfq_timestamp_t finish;
+	bfq_timestamp_t start;
+
+	struct rb_root *tree;
+
+	bfq_timestamp_t min_start;
+
+	bfq_service_t service, budget;
+	bfq_weight_t weight;
+
+	struct io_entity *parent;
+
+	struct io_sched_data *my_sched_data;
+	struct io_sched_data *sched_data;
+
+	unsigned short ioprio, new_ioprio;
+	unsigned short ioprio_class, new_ioprio_class;
+
+	int ioprio_changed;
+};
+
+/*
+ * A common structure embedded by every io scheduler into their respective
+ * queue structure.
+ */
+struct io_queue {
+	struct io_entity entity;
+	atomic_t ref;
+	unsigned int flags;
+
+	/* Pointer to generic elevator data structure */
+	struct elv_fq_data *efqd;
+	struct list_head queue_list;
+	pid_t pid;
+
+	/* Number of requests queued on this io queue */
+	unsigned long nr_queued;
+
+	/* Requests dispatched from this queue */
+	int dispatched;
+
+	/* Keep a track of think time of processes in this queue */
+	unsigned long last_end_request;
+	unsigned long ttime_total;
+	unsigned long ttime_samples;
+	unsigned long ttime_mean;
+
+	unsigned long slice_end;
+
+	/* Pointer to io scheduler's queue */
+	void *sched_queue;
+
+	/*
+	 * keeps a track of total slice time assigned to a queue for
+	 * debugging purposes.
+	 */
+	unsigned long total_service;
+};
+
+struct io_group {
+	struct io_sched_data sched_data;
+
+	/* async_queue and idle_queue are used only for cfq */
+	struct io_queue *async_queue[2][IOPRIO_BE_NR];
+	struct io_queue *async_idle_queue;
+};
+
+struct elv_fq_data {
+	struct io_group *root_group;
+
+	/* List of io queues on idle tree. */
+	struct list_head idle_list;
+
+	struct request_queue *queue;
+	unsigned int busy_queues;
+	/*
+	 * Used to track any pending rt requests so we can pre-empt current
+	 * non-RT cfqq in service when this value is non-zero.
+	 */
+	unsigned int busy_rt_queues;
+
+	/* Number of requests queued */
+	int rq_queued;
+
+	/* Pointer to the ioscheduler queue being served */
+	void *active_queue;
+
+	int rq_in_driver;
+	int hw_tag;
+	int hw_tag_samples;
+	int rq_in_driver_peak;
+
+	/*
+	 * elevator fair queuing layer has the capability to provide idling
+	 * for ensuring fairness for processes doing dependent reads.
+	 * This might be needed to ensure fairness among two processes doing
+	 * synchronous reads in two different cgroups. noop and deadline don't
+	 * have any notion of anticipation/idling. As of now, these are the
+	 * users of this functionality.
+	 */
+	unsigned int elv_slice_idle;
+	struct timer_list idle_slice_timer;
+	struct work_struct unplug_work;
+
+	unsigned int elv_slice[2];
+};
+
+extern int elv_slice_idle;
+extern int elv_slice_async;
+
+/* Logging facilities. */
+#define elv_log_ioq(efqd, ioq, fmt, args...) \
+	blk_add_trace_msg((efqd)->queue, "%d" fmt, (ioq)->pid, ##args)
+
+#define elv_log(efqd, fmt, args...) \
+	blk_add_trace_msg((efqd)->queue, "" fmt, ##args)
+
+#define ioq_sample_valid(samples)   ((samples) > 80)
+
+/* Some shared queue flag manipulation functions among elevators */
+
+enum elv_queue_state_flags {
+	ELV_QUEUE_FLAG_busy = 0,          /* has requests or is under service */
+	ELV_QUEUE_FLAG_sync,              /* synchronous queue */
+	ELV_QUEUE_FLAG_idle_window,	  /* elevator slice idling enabled */
+	ELV_QUEUE_FLAG_wait_request,	  /* waiting for a request */
+	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
+	ELV_QUEUE_FLAG_NR,
+};
+
+#define ELV_IO_QUEUE_FLAG_FNS(name)					\
+static inline void elv_mark_ioq_##name(struct io_queue *ioq)		\
+{                                                                       \
+	(ioq)->flags |= (1 << ELV_QUEUE_FLAG_##name);			\
+}                                                                       \
+static inline void elv_clear_ioq_##name(struct io_queue *ioq)		\
+{                                                                       \
+	(ioq)->flags &= ~(1 << ELV_QUEUE_FLAG_##name);			\
+}                                                                       \
+static inline int elv_ioq_##name(struct io_queue *ioq)         		\
+{                                                                       \
+	return ((ioq)->flags & (1 << ELV_QUEUE_FLAG_##name)) != 0;	\
+}
+
+ELV_IO_QUEUE_FLAG_FNS(busy)
+ELV_IO_QUEUE_FLAG_FNS(sync)
+ELV_IO_QUEUE_FLAG_FNS(wait_request)
+ELV_IO_QUEUE_FLAG_FNS(idle_window)
+ELV_IO_QUEUE_FLAG_FNS(slice_new)
+
+static inline struct io_service_tree *
+io_entity_service_tree(struct io_entity *entity)
+{
+	struct io_sched_data *sched_data = entity->sched_data;
+	unsigned int idx = entity->ioprio_class - 1;
+
+	BUG_ON(idx >= IO_IOPRIO_CLASSES);
+	BUG_ON(sched_data == NULL);
+
+	return sched_data->service_tree + idx;
+}
+
+/* A request got dispatched from the io_queue. Do the accounting. */
+static inline void elv_ioq_request_dispatched(struct io_queue *ioq)
+{
+	ioq->dispatched++;
+}
+
+static inline int elv_ioq_slice_used(struct io_queue *ioq)
+{
+	if (elv_ioq_slice_new(ioq))
+		return 0;
+	if (time_before(jiffies, ioq->slice_end))
+		return 0;
+
+	return 1;
+}
+
+/* How many request are currently dispatched from the queue */
+static inline int elv_ioq_nr_dispatched(struct io_queue *ioq)
+{
+	return ioq->dispatched;
+}
+
+static inline pid_t elv_ioq_pid(struct io_queue *ioq)
+{
+	return ioq->pid;
+}
+
+static inline unsigned long elv_ioq_ttime_mean(struct io_queue *ioq)
+{
+	return ioq->ttime_mean;
+}
+
+static inline unsigned long elv_ioq_sample_valid(struct io_queue *ioq)
+{
+	return ioq_sample_valid(ioq->ttime_samples);
+}
+
+static inline void elv_get_ioq(struct io_queue *ioq)
+{
+	atomic_inc(&ioq->ref);
+}
+
+static inline void elv_ioq_set_slice_end(struct io_queue *ioq,
+						unsigned long slice_end)
+{
+	ioq->slice_end = slice_end;
+}
+
+static inline int elv_ioq_class_idle(struct io_queue *ioq)
+{
+	return ioq->entity.ioprio_class == IOPRIO_CLASS_IDLE;
+}
+
+static inline int elv_ioq_class_rt(struct io_queue *ioq)
+{
+	return ioq->entity.ioprio_class == IOPRIO_CLASS_RT;
+}
+
+static inline int elv_ioq_ioprio_class(struct io_queue *ioq)
+{
+	return ioq->entity.new_ioprio_class;
+}
+
+static inline int elv_ioq_ioprio(struct io_queue *ioq)
+{
+	return ioq->entity.new_ioprio;
+}
+
+static inline void elv_ioq_set_ioprio_class(struct io_queue *ioq,
+						int ioprio_class)
+{
+	ioq->entity.new_ioprio_class = ioprio_class;
+	ioq->entity.ioprio_changed = 1;
+}
+
+static inline void elv_ioq_set_ioprio(struct io_queue *ioq, int ioprio)
+{
+	ioq->entity.new_ioprio = ioprio;
+	ioq->entity.ioprio_changed = 1;
+}
+
+static inline void *ioq_sched_queue(struct io_queue *ioq)
+{
+	if (ioq)
+		return ioq->sched_queue;
+	return NULL;
+}
+
+static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
+{
+	return container_of(ioq->entity.sched_data, struct io_group,
+						sched_data);
+}
+
+/* Functions used by blksysfs.c */
+extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
+extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
+						size_t count);
+
+/* Functions used by elevator.c */
+extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
+extern void elv_exit_fq_data(struct elevator_queue *e);
+extern void elv_exit_fq_data_post(struct elevator_queue *e);
+
+extern void elv_ioq_request_add(struct request_queue *q, struct request *rq);
+extern void elv_ioq_request_removed(struct elevator_queue *e,
+					struct request *rq);
+extern void elv_fq_dispatched_request(struct elevator_queue *e,
+					struct request *rq);
+
+extern void elv_fq_activate_rq(struct request_queue *q, struct request *rq);
+extern void elv_fq_deactivate_rq(struct request_queue *q, struct request *rq);
+
+extern void elv_ioq_completed_request(struct request_queue *q,
+				struct request *rq);
+
+extern void *elv_fq_select_ioq(struct request_queue *q, int force);
+extern struct io_queue *rq_ioq(struct request *rq);
+
+/* Functions used by io schedulers */
+extern void elv_put_ioq(struct io_queue *ioq);
+extern void __elv_ioq_slice_expired(struct request_queue *q,
+					struct io_queue *ioq, int timed_out);
+extern int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
+		void *sched_queue, int ioprio_class, int ioprio, int is_sync);
+extern void elv_schedule_dispatch(struct request_queue *q);
+extern int elv_hw_tag(struct elevator_queue *e);
+extern void *elv_active_sched_queue(struct elevator_queue *e);
+extern int elv_mod_idle_slice_timer(struct elevator_queue *eq,
+					unsigned long expires);
+extern int elv_del_idle_slice_timer(struct elevator_queue *eq);
+extern unsigned int elv_get_slice_idle(struct elevator_queue *eq);
+extern void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
+					int ioprio);
+extern void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+					int ioprio, struct io_queue *ioq);
+extern struct io_group *io_lookup_io_group_current(struct request_queue *q);
+extern int elv_nr_busy_ioq(struct elevator_queue *e);
+extern int elv_nr_busy_rt_ioq(struct elevator_queue *e);
+extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
+extern void elv_free_ioq(struct io_queue *ioq);
+
+#else /* CONFIG_ELV_FAIR_QUEUING */
+
+static inline int elv_init_fq_data(struct request_queue *q,
+					struct elevator_queue *e)
+{
+	return 0;
+}
+
+static inline void elv_exit_fq_data(struct elevator_queue *e) {}
+static inline void elv_exit_fq_data_post(struct elevator_queue *e) {}
+
+static inline void elv_fq_activate_rq(struct request_queue *q,
+					struct request *rq)
+{
+}
+
+static inline void elv_fq_deactivate_rq(struct request_queue *q,
+					struct request *rq)
+{
+}
+
+static inline void elv_fq_dispatched_request(struct elevator_queue *e,
+						struct request *rq)
+{
+}
+
+static inline void elv_ioq_request_removed(struct elevator_queue *e,
+						struct request *rq)
+{
+}
+
+static inline void elv_ioq_request_add(struct request_queue *q,
+					struct request *rq)
+{
+}
+
+static inline void elv_ioq_completed_request(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline void *ioq_sched_queue(struct io_queue *ioq) { return NULL; }
+static inline struct io_queue *rq_ioq(struct request *rq) { return NULL; }
+static inline void *elv_fq_select_ioq(struct request_queue *q, int force)
+{
+	return NULL;
+}
+#endif /* CONFIG_ELV_FAIR_QUEUING */
+#endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index 98259ed..7a3a7e9 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -231,6 +231,9 @@ static struct elevator_queue *elevator_alloc(struct request_queue *q,
 	for (i = 0; i < ELV_HASH_ENTRIES; i++)
 		INIT_HLIST_HEAD(&eq->hash[i]);
 
+	if (elv_init_fq_data(q, eq))
+		goto err;
+
 	return eq;
 err:
 	kfree(eq);
@@ -301,9 +304,11 @@ EXPORT_SYMBOL(elevator_init);
 void elevator_exit(struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
+	elv_exit_fq_data(e);
 	if (e->ops->elevator_exit_fn)
 		e->ops->elevator_exit_fn(e);
 	e->ops = NULL;
+	elv_exit_fq_data_post(e);
 	mutex_unlock(&e->sysfs_lock);
 
 	kobject_put(&e->kobj);
@@ -314,6 +319,8 @@ static void elv_activate_rq(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	elv_fq_activate_rq(q, rq);
+
 	if (e->ops->elevator_activate_req_fn)
 		e->ops->elevator_activate_req_fn(q, rq);
 }
@@ -322,6 +329,8 @@ static void elv_deactivate_rq(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	elv_fq_deactivate_rq(q, rq);
+
 	if (e->ops->elevator_deactivate_req_fn)
 		e->ops->elevator_deactivate_req_fn(q, rq);
 }
@@ -446,6 +455,7 @@ void elv_dispatch_sort(struct request_queue *q, struct request *rq)
 	elv_rqhash_del(q, rq);
 
 	q->nr_sorted--;
+	elv_fq_dispatched_request(q->elevator, rq);
 
 	boundary = q->end_sector;
 	stop_flags = REQ_SOFTBARRIER | REQ_HARDBARRIER | REQ_STARTED;
@@ -486,6 +496,7 @@ void elv_dispatch_add_tail(struct request_queue *q, struct request *rq)
 	elv_rqhash_del(q, rq);
 
 	q->nr_sorted--;
+	elv_fq_dispatched_request(q->elevator, rq);
 
 	q->end_sector = rq_end_sector(rq);
 	q->boundary_rq = rq;
@@ -553,6 +564,7 @@ void elv_merge_requests(struct request_queue *q, struct request *rq,
 	elv_rqhash_del(q, next);
 
 	q->nr_sorted--;
+	elv_ioq_request_removed(e, next);
 	q->last_merge = rq;
 }
 
@@ -632,12 +644,8 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 				q->last_merge = rq;
 		}
 
-		/*
-		 * Some ioscheds (cfq) run q->request_fn directly, so
-		 * rq cannot be accessed after calling
-		 * elevator_add_req_fn.
-		 */
 		q->elevator->ops->elevator_add_req_fn(q, rq);
+		elv_ioq_request_add(q, rq);
 		break;
 
 	case ELEVATOR_INSERT_REQUEUE:
@@ -847,13 +855,12 @@ void elv_dequeue_request(struct request_queue *q, struct request *rq)
 
 int elv_queue_empty(struct request_queue *q)
 {
-	struct elevator_queue *e = q->elevator;
-
 	if (!list_empty(&q->queue_head))
 		return 0;
 
-	if (e->ops->elevator_queue_empty_fn)
-		return e->ops->elevator_queue_empty_fn(q);
+	/* Hopefully nr_sorted works and no need to call queue_empty_fn */
+	if (q->nr_sorted)
+		return 0;
 
 	return 1;
 }
@@ -928,8 +935,11 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
 	 */
 	if (blk_account_rq(rq)) {
 		q->in_flight--;
-		if (blk_sorted_rq(rq) && e->ops->elevator_completed_req_fn)
-			e->ops->elevator_completed_req_fn(q, rq);
+		if (blk_sorted_rq(rq)) {
+			if (e->ops->elevator_completed_req_fn)
+				e->ops->elevator_completed_req_fn(q, rq);
+			elv_ioq_completed_request(q, rq);
+		}
 	}
 
 	/*
@@ -1228,3 +1238,17 @@ struct request *elv_rb_latter_request(struct request_queue *q,
 	return NULL;
 }
 EXPORT_SYMBOL(elv_rb_latter_request);
+
+/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
+{
+	return ioq_sched_queue(rq_ioq(rq));
+}
+EXPORT_SYMBOL(elv_get_sched_queue);
+
+/* Select an ioscheduler queue to dispatch request from. */
+void *elv_select_sched_queue(struct request_queue *q, int force)
+{
+	return ioq_sched_queue(elv_fq_select_ioq(q, force));
+}
+EXPORT_SYMBOL(elv_select_sched_queue);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 465d6ba..cf02216 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -234,6 +234,11 @@ struct request {
 
 	/* for bidi */
 	struct request *next_rq;
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	/* io queue request belongs to */
+	struct io_queue *ioq;
+#endif
 };
 
 static inline unsigned short req_get_ioprio(struct request *req)
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 7a20425..6f2dea5 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -2,6 +2,7 @@
 #define _LINUX_ELEVATOR_H
 
 #include <linux/percpu.h>
+#include "../../block/elevator-fq.h"
 
 #ifdef CONFIG_BLOCK
 
@@ -29,6 +30,16 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
 
 typedef void *(elevator_init_fn) (struct request_queue *);
 typedef void (elevator_exit_fn) (struct elevator_queue *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *);
+typedef void (elevator_active_ioq_reset_fn) (struct request_queue *);
+typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
+typedef int (elevator_should_preempt_fn) (struct request_queue*, void*,
+						struct request*);
+typedef int (elevator_update_idle_window_fn) (struct elevator_queue*, void*,
+						struct request*);
+#endif
 
 struct elevator_ops
 {
@@ -56,6 +67,16 @@ struct elevator_ops
 	elevator_init_fn *elevator_init_fn;
 	elevator_exit_fn *elevator_exit_fn;
 	void (*trim)(struct io_context *);
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+	elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
+	elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;
+
+	elevator_arm_slice_timer_fn *elevator_arm_slice_timer_fn;
+	elevator_should_preempt_fn *elevator_should_preempt_fn;
+	elevator_update_idle_window_fn *elevator_update_idle_window_fn;
+#endif
 };
 
 #define ELV_NAME_MAX	(16)
@@ -76,6 +97,9 @@ struct elevator_type
 	struct elv_fs_entry *elevator_attrs;
 	char elevator_name[ELV_NAME_MAX];
 	struct module *elevator_owner;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	int elevator_features;
+#endif
 };
 
 /*
@@ -89,6 +113,10 @@ struct elevator_queue
 	struct elevator_type *elevator_type;
 	struct mutex sysfs_lock;
 	struct hlist_head *hash;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	/* fair queuing data */
+	struct elv_fq_data efqd;
+#endif
 };
 
 /*
@@ -208,5 +236,25 @@ enum {
 	__val;							\
 })
 
+/* iosched can let elevator know their feature set/capability */
+#ifdef CONFIG_ELV_FAIR_QUEUING
+
+/* iosched wants to use fq logic of elevator layer */
+#define	ELV_IOSCHED_NEED_FQ	1
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+	return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
+}
+
+#else /* ELV_IOSCHED_FAIR_QUEUING */
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+	return 0;
+}
+#endif /* ELV_IOSCHED_FAIR_QUEUING */
+extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
+extern void *elv_select_sched_queue(struct request_queue *q, int force);
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 02/10] Common flat fair queuing code in elevaotor layer
  2009-03-12  1:56 ` Vivek Goyal
  (?)
@ 2009-03-12  1:56 ` Vivek Goyal
  2009-03-19  6:27   ` Gui Jianfeng
                     ` (3 more replies)
  -1 siblings, 4 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12  1:56 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng,
	arozansk, jmoyer, oz-kernel, dhaval, balbir, linux-kernel,
	containers
  Cc: vgoyal, akpm, menage, peterz

This is common fair queuing code in elevator layer. This is controlled by
config option CONFIG_ELV_FAIR_QUEUING. This patch initially only introduces
flat fair queuing support where there is only one group, "root group" and all
the tasks belong to root group.

This elevator layer changes are backward compatible. That means any ioscheduler
using old interfaces will continue to work.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched    |   13 +
 block/Makefile           |    1 +
 block/blk-sysfs.c        |   10 +
 block/elevator-fq.c      | 1882 ++++++++++++++++++++++++++++++++++++++++++++++
 block/elevator-fq.h      |  479 ++++++++++++
 block/elevator.c         |   46 +-
 include/linux/blkdev.h   |    5 +
 include/linux/elevator.h |   48 ++
 8 files changed, 2473 insertions(+), 11 deletions(-)
 create mode 100644 block/elevator-fq.c
 create mode 100644 block/elevator-fq.h

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 7e803fc..3398134 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -2,6 +2,19 @@ if BLOCK
 
 menu "IO Schedulers"
 
+config ELV_FAIR_QUEUING
+	bool "Elevator Fair Queuing Support"
+	default n
+	---help---
+	  Traditionally only cfq had notion of multiple queues and it did
+	  fair queuing at its own. With the cgroups and need of controlling
+	  IO, now even the simple io schedulers like noop, deadline, as will
+	  have one queue per cgroup and will need hierarchical fair queuing.
+	  Instead of every io scheduler implementing its own fair queuing
+	  logic, this option enables fair queuing in elevator layer so that
+	  other ioschedulers can make use of it.
+	  If unsure, say N.
+
 config IOSCHED_NOOP
 	bool
 	default y
diff --git a/block/Makefile b/block/Makefile
index bfe7304..6f410d5 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -16,3 +16,4 @@ obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
 obj-$(CONFIG_BLK_DEV_IO_TRACE)	+= blktrace.o
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY)	+= blk-integrity.o
+obj-$(CONFIG_ELV_FAIR_QUEUING)	+= elevator-fq.o
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index e29ddfc..0d98c96 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -276,6 +276,13 @@ static struct queue_sysfs_entry queue_iostats_entry = {
 	.store = queue_iostats_store,
 };
 
+#ifdef CONFIG_ELV_FAIR_QUEUING
+static struct queue_sysfs_entry queue_slice_idle_entry = {
+	.attr = {.name = "slice_idle", .mode = S_IRUGO | S_IWUSR },
+	.show = elv_slice_idle_show,
+	.store = elv_slice_idle_store,
+};
+#endif
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
 	&queue_ra_entry.attr,
@@ -287,6 +294,9 @@ static struct attribute *default_attrs[] = {
 	&queue_nomerges_entry.attr,
 	&queue_rq_affinity_entry.attr,
 	&queue_iostats_entry.attr,
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	&queue_slice_idle_entry.attr,
+#endif
 	NULL,
 };
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
new file mode 100644
index 0000000..a8addd1
--- /dev/null
+++ b/block/elevator-fq.c
@@ -0,0 +1,1882 @@
+/*
+ * BFQ: Hierarchical B-WF2Q+ scheduler.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ *		      Paolo Valente <paolo.valente@unimore.it>
+ */
+
+#include <linux/blkdev.h>
+#include "elevator-fq.h"
+#include <linux/blktrace_api.h>
+
+/* Values taken from cfq */
+const int elv_slice_sync = HZ / 10;
+int elv_slice_async = HZ / 25;
+const int elv_slice_async_rq = 2;
+int elv_slice_idle = HZ / 125;
+static struct kmem_cache *elv_ioq_pool;
+
+#define ELV_HW_QUEUE_MIN	(5)
+#define IO_SERVICE_TREE_INIT   ((struct io_service_tree)		\
+				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
+
+/* Mainly the BFQ scheduling code Follows */
+
+/*
+ * Shift for timestamp calculations.  This actually limits the maximum
+ * service allowed in one timestamp delta (small shift values increase it),
+ * the maximum total weight that can be used for the queues in the system
+ * (big shift values increase it), and the period of virtual time wraparounds.
+ */
+#define WFQ_SERVICE_SHIFT	22
+
+/**
+ * bfq_gt - compare two timestamps.
+ * @a: first ts.
+ * @b: second ts.
+ *
+ * Return @a > @b, dealing with wrapping correctly.
+ */
+static inline int bfq_gt(bfq_timestamp_t a, bfq_timestamp_t b)
+{
+	return (s64)(a - b) > 0;
+}
+
+/**
+ * bfq_delta - map service into the virtual time domain.
+ * @service: amount of service.
+ * @weight: scale factor.
+ */
+static inline bfq_timestamp_t bfq_delta(bfq_service_t service,
+					bfq_weight_t weight)
+{
+	bfq_timestamp_t d = (bfq_timestamp_t)service << WFQ_SERVICE_SHIFT;
+
+	do_div(d, weight);
+	return d;
+}
+
+/**
+ * bfq_calc_finish - assign the finish time to an entity.
+ * @entity: the entity to act upon.
+ * @service: the service to be charged to the entity.
+ */
+static inline void bfq_calc_finish(struct io_entity *entity,
+				   bfq_service_t service)
+{
+	BUG_ON(entity->weight == 0);
+
+	entity->finish = entity->start + bfq_delta(service, entity->weight);
+}
+
+static inline struct io_queue *io_entity_to_ioq(struct io_entity *entity)
+{
+	struct io_queue *ioq = NULL;
+
+	BUG_ON(entity == NULL);
+	if (entity->my_sched_data == NULL)
+		ioq = container_of(entity, struct io_queue, entity);
+	return ioq;
+}
+
+/**
+ * bfq_entity_of - get an entity from a node.
+ * @node: the node field of the entity.
+ *
+ * Convert a node pointer to the relative entity.  This is used only
+ * to simplify the logic of some functions and not as the generic
+ * conversion mechanism because, e.g., in the tree walking functions,
+ * the check for a %NULL value would be redundant.
+ */
+static inline struct io_entity *bfq_entity_of(struct rb_node *node)
+{
+	struct io_entity *entity = NULL;
+
+	if (node != NULL)
+		entity = rb_entry(node, struct io_entity, rb_node);
+
+	return entity;
+}
+
+/**
+ * bfq_extract - remove an entity from a tree.
+ * @root: the tree root.
+ * @entity: the entity to remove.
+ */
+static inline void bfq_extract(struct rb_root *root, struct io_entity *entity)
+{
+	BUG_ON(entity->tree != root);
+
+	entity->tree = NULL;
+	rb_erase(&entity->rb_node, root);
+}
+
+/**
+ * bfq_idle_extract - extract an entity from the idle tree.
+ * @st: the service tree of the owning @entity.
+ * @entity: the entity being removed.
+ */
+static void bfq_idle_extract(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	struct rb_node *next;
+	struct io_queue *ioq = io_entity_to_ioq(entity);
+
+	BUG_ON(entity->tree != &st->idle);
+
+	if (entity == st->first_idle) {
+		next = rb_next(&entity->rb_node);
+		st->first_idle = bfq_entity_of(next);
+	}
+
+	if (entity == st->last_idle) {
+		next = rb_prev(&entity->rb_node);
+		st->last_idle = bfq_entity_of(next);
+	}
+
+	bfq_extract(&st->idle, entity);
+
+	/* Delete queue from idle list */
+	if (ioq)
+		list_del(&ioq->queue_list);
+}
+
+/**
+ * bfq_insert - generic tree insertion.
+ * @root: tree root.
+ * @entity: entity to insert.
+ *
+ * This is used for the idle and the active tree, since they are both
+ * ordered by finish time.
+ */
+static void bfq_insert(struct rb_root *root, struct io_entity *entity)
+{
+	struct io_entity *entry;
+	struct rb_node **node = &root->rb_node;
+	struct rb_node *parent = NULL;
+
+	BUG_ON(entity->tree != NULL);
+
+	while (*node != NULL) {
+		parent = *node;
+		entry = rb_entry(parent, struct io_entity, rb_node);
+
+		if (bfq_gt(entry->finish, entity->finish))
+			node = &parent->rb_left;
+		else
+			node = &parent->rb_right;
+	}
+
+	rb_link_node(&entity->rb_node, parent, node);
+	rb_insert_color(&entity->rb_node, root);
+
+	entity->tree = root;
+}
+
+/**
+ * bfq_update_min - update the min_start field of a entity.
+ * @entity: the entity to update.
+ * @node: one of its children.
+ *
+ * This function is called when @entity may store an invalid value for
+ * min_start due to updates to the active tree.  The function  assumes
+ * that the subtree rooted at @node (which may be its left or its right
+ * child) has a valid min_start value.
+ */
+static inline void bfq_update_min(struct io_entity *entity,
+					struct rb_node *node)
+{
+	struct io_entity *child;
+
+	if (node != NULL) {
+		child = rb_entry(node, struct io_entity, rb_node);
+		if (bfq_gt(entity->min_start, child->min_start))
+			entity->min_start = child->min_start;
+	}
+}
+
+/**
+ * bfq_update_active_node - recalculate min_start.
+ * @node: the node to update.
+ *
+ * @node may have changed position or one of its children may have moved,
+ * this function updates its min_start value.  The left and right subtrees
+ * are assumed to hold a correct min_start value.
+ */
+static inline void bfq_update_active_node(struct rb_node *node)
+{
+	struct io_entity *entity = rb_entry(node, struct io_entity, rb_node);
+
+	entity->min_start = entity->start;
+	bfq_update_min(entity, node->rb_right);
+	bfq_update_min(entity, node->rb_left);
+}
+
+/**
+ * bfq_update_active_tree - update min_start for the whole active tree.
+ * @node: the starting node.
+ *
+ * @node must be the deepest modified node after an update.  This function
+ * updates its min_start using the values held by its children, assuming
+ * that they did not change, and then updates all the nodes that may have
+ * changed in the path to the root.  The only nodes that may have changed
+ * are the ones in the path or their siblings.
+ */
+static void bfq_update_active_tree(struct rb_node *node)
+{
+	struct rb_node *parent;
+
+up:
+	bfq_update_active_node(node);
+
+	parent = rb_parent(node);
+	if (parent == NULL)
+		return;
+
+	if (node == parent->rb_left && parent->rb_right != NULL)
+		bfq_update_active_node(parent->rb_right);
+	else if (parent->rb_left != NULL)
+		bfq_update_active_node(parent->rb_left);
+
+	node = parent;
+	goto up;
+}
+
+/**
+ * bfq_active_insert - insert an entity in the active tree of its group/device.
+ * @st: the service tree of the entity.
+ * @entity: the entity being inserted.
+ *
+ * The active tree is ordered by finish time, but an extra key is kept
+ * per each node, containing the minimum value for the start times of
+ * its children (and the node itself), so it's possible to search for
+ * the eligible node with the lowest finish time in logarithmic time.
+ */
+static void bfq_active_insert(struct io_service_tree *st,
+					struct io_entity *entity)
+{
+	struct rb_node *node = &entity->rb_node;
+
+	bfq_insert(&st->active, entity);
+
+	if (node->rb_left != NULL)
+		node = node->rb_left;
+	else if (node->rb_right != NULL)
+		node = node->rb_right;
+
+	bfq_update_active_tree(node);
+}
+
+/**
+ * bfq_ioprio_to_weight - calc a weight from an ioprio.
+ * @ioprio: the ioprio value to convert.
+ */
+static bfq_weight_t bfq_ioprio_to_weight(int ioprio)
+{
+	WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
+	return IOPRIO_BE_NR - ioprio;
+}
+
+void bfq_get_entity(struct io_entity *entity)
+{
+	struct io_queue *ioq = io_entity_to_ioq(entity);
+
+	if (ioq)
+		elv_get_ioq(ioq);
+}
+
+void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+	entity->ioprio = entity->new_ioprio;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->sched_data = &iog->sched_data;
+}
+
+/**
+ * bfq_find_deepest - find the deepest node that an extraction can modify.
+ * @node: the node being removed.
+ *
+ * Do the first step of an extraction in an rb tree, looking for the
+ * node that will replace @node, and returning the deepest node that
+ * the following modifications to the tree can touch.  If @node is the
+ * last node in the tree return %NULL.
+ */
+static struct rb_node *bfq_find_deepest(struct rb_node *node)
+{
+	struct rb_node *deepest;
+
+	if (node->rb_right == NULL && node->rb_left == NULL)
+		deepest = rb_parent(node);
+	else if (node->rb_right == NULL)
+		deepest = node->rb_left;
+	else if (node->rb_left == NULL)
+		deepest = node->rb_right;
+	else {
+		deepest = rb_next(node);
+		if (deepest->rb_right != NULL)
+			deepest = deepest->rb_right;
+		else if (rb_parent(deepest) != node)
+			deepest = rb_parent(deepest);
+	}
+
+	return deepest;
+}
+
+/**
+ * bfq_active_extract - remove an entity from the active tree.
+ * @st: the service_tree containing the tree.
+ * @entity: the entity being removed.
+ */
+static void bfq_active_extract(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	struct rb_node *node;
+
+	node = bfq_find_deepest(&entity->rb_node);
+	bfq_extract(&st->active, entity);
+
+	if (node != NULL)
+		bfq_update_active_tree(node);
+}
+
+/**
+ * bfq_idle_insert - insert an entity into the idle tree.
+ * @st: the service tree containing the tree.
+ * @entity: the entity to insert.
+ */
+static void bfq_idle_insert(struct io_service_tree *st,
+					struct io_entity *entity)
+{
+	struct io_entity *first_idle = st->first_idle;
+	struct io_entity *last_idle = st->last_idle;
+	struct io_queue *ioq = io_entity_to_ioq(entity);
+
+	if (first_idle == NULL || bfq_gt(first_idle->finish, entity->finish))
+		st->first_idle = entity;
+	if (last_idle == NULL || bfq_gt(entity->finish, last_idle->finish))
+		st->last_idle = entity;
+
+	bfq_insert(&st->idle, entity);
+
+	/* Add this queue to idle list */
+	if (ioq)
+		list_add(&ioq->queue_list, &ioq->efqd->idle_list);
+}
+
+/**
+ * bfq_forget_entity - remove an entity from the wfq trees.
+ * @st: the service tree.
+ * @entity: the entity being removed.
+ *
+ * Update the device status and forget everything about @entity, putting
+ * the device reference to it, if it is a queue.  Entities belonging to
+ * groups are not refcounted.
+ */
+static void bfq_forget_entity(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	struct io_queue *ioq = NULL;
+
+	BUG_ON(!entity->on_st);
+	entity->on_st = 0;
+	st->wsum -= entity->weight;
+	ioq = io_entity_to_ioq(entity);
+	if (!ioq)
+		return;
+	elv_put_ioq(ioq);
+}
+
+/**
+ * bfq_put_idle_entity - release the idle tree ref of an entity.
+ * @st: service tree for the entity.
+ * @entity: the entity being released.
+ */
+void bfq_put_idle_entity(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	bfq_idle_extract(st, entity);
+	bfq_forget_entity(st, entity);
+}
+
+/**
+ * bfq_forget_idle - update the idle tree if necessary.
+ * @st: the service tree to act upon.
+ *
+ * To preserve the global O(log N) complexity we only remove one entry here;
+ * as the idle tree will not grow indefinitely this can be done safely.
+ */
+void bfq_forget_idle(struct io_service_tree *st)
+{
+	struct io_entity *first_idle = st->first_idle;
+	struct io_entity *last_idle = st->last_idle;
+
+	if (RB_EMPTY_ROOT(&st->active) && last_idle != NULL &&
+	    !bfq_gt(last_idle->finish, st->vtime)) {
+		/*
+		 * Active tree is empty. Pull back vtime to finish time of
+		 * last idle entity on idle tree.
+		 * Rational seems to be that it reduces the possibility of
+		 * vtime wraparound (bfq_gt(V-F) < 0).
+		 */
+		st->vtime = last_idle->finish;
+	}
+
+	if (first_idle != NULL && !bfq_gt(first_idle->finish, st->vtime))
+		bfq_put_idle_entity(st, first_idle);
+}
+
+
+static struct io_service_tree *
+__bfq_entity_update_prio(struct io_service_tree *old_st,
+				struct io_entity *entity)
+{
+	struct io_service_tree *new_st = old_st;
+
+	if (entity->ioprio_changed) {
+		entity->ioprio = entity->new_ioprio;
+		entity->ioprio_class = entity->new_ioprio_class;
+		entity->ioprio_changed = 0;
+
+		old_st->wsum -= entity->weight;
+		entity->weight = bfq_ioprio_to_weight(entity->ioprio);
+
+		/*
+		 * NOTE: here we may be changing the weight too early,
+		 * this will cause unfairness.  The correct approach
+		 * would have required additional complexity to defer
+		 * weight changes to the proper time instants (i.e.,
+		 * when entity->finish <= old_st->vtime).
+		 */
+		new_st = io_entity_service_tree(entity);
+		new_st->wsum += entity->weight;
+
+		if (new_st != old_st)
+			entity->start = new_st->vtime;
+	}
+
+	return new_st;
+}
+
+/**
+ * __bfq_activate_entity - activate an entity.
+ * @entity: the entity being activated.
+ *
+ * Called whenever an entity is activated, i.e., it is not active and one
+ * of its children receives a new request, or has to be reactivated due to
+ * budget exhaustion.  It uses the current budget of the entity (and the
+ * service received if @entity is active) of the queue to calculate its
+ * timestamps.
+ */
+static void __bfq_activate_entity(struct io_entity *entity)
+{
+	struct io_sched_data *sd = entity->sched_data;
+	struct io_service_tree *st = io_entity_service_tree(entity);
+
+	if (entity == sd->active_entity) {
+		BUG_ON(entity->tree != NULL);
+		/*
+		 * If we are requeueing the current entity we have
+		 * to take care of not charging to it service it has
+		 * not received.
+		 */
+		bfq_calc_finish(entity, entity->service);
+		entity->start = entity->finish;
+		sd->active_entity = NULL;
+	} else if (entity->tree == &st->active) {
+		/*
+		 * Requeueing an entity due to a change of some
+		 * next_active entity below it.  We reuse the old
+		 * start time.
+		 */
+		bfq_active_extract(st, entity);
+	} else if (entity->tree == &st->idle) {
+		/*
+		 * Must be on the idle tree, bfq_idle_extract() will
+		 * check for that.
+		 */
+		bfq_idle_extract(st, entity);
+		entity->start = bfq_gt(st->vtime, entity->finish) ?
+				       st->vtime : entity->finish;
+	} else {
+		/*
+		 * The finish time of the entity may be invalid, and
+		 * it is in the past for sure, otherwise the queue
+		 * would have been on the idle tree.
+		 */
+		entity->start = st->vtime;
+		st->wsum += entity->weight;
+		bfq_get_entity(entity);
+
+		BUG_ON(entity->on_st);
+		entity->on_st = 1;
+	}
+
+	st = __bfq_entity_update_prio(st, entity);
+	bfq_calc_finish(entity, entity->budget);
+	bfq_active_insert(st, entity);
+}
+
+/**
+ * bfq_activate_entity - activate an entity.
+ * @entity: the entity to activate.
+ */
+void bfq_activate_entity(struct io_entity *entity)
+{
+	__bfq_activate_entity(entity);
+}
+
+/**
+ * __bfq_deactivate_entity - deactivate an entity from its service tree.
+ * @entity: the entity to deactivate.
+ * @requeue: if false, the entity will not be put into the idle tree.
+ *
+ * Deactivate an entity, independently from its previous state.  If the
+ * entity was not on a service tree just return, otherwise if it is on
+ * any scheduler tree, extract it from that tree, and if necessary
+ * and if the caller did not specify @requeue, put it on the idle tree.
+ *
+ */
+int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
+{
+	struct io_sched_data *sd = entity->sched_data;
+	struct io_service_tree *st = io_entity_service_tree(entity);
+	int was_active = entity == sd->active_entity;
+	int ret = 0;
+
+	if (!entity->on_st)
+		return 0;
+
+	BUG_ON(was_active && entity->tree != NULL);
+
+	if (was_active) {
+		bfq_calc_finish(entity, entity->service);
+		sd->active_entity = NULL;
+	} else if (entity->tree == &st->active)
+		bfq_active_extract(st, entity);
+	else if (entity->tree == &st->idle)
+		bfq_idle_extract(st, entity);
+	else if (entity->tree != NULL)
+		BUG();
+
+	if (!requeue || !bfq_gt(entity->finish, st->vtime))
+		bfq_forget_entity(st, entity);
+	else
+		bfq_idle_insert(st, entity);
+
+	BUG_ON(sd->active_entity == entity);
+
+	return ret;
+}
+
+/**
+ * bfq_deactivate_entity - deactivate an entity.
+ * @entity: the entity to deactivate.
+ * @requeue: true if the entity can be put on the idle tree
+ */
+void bfq_deactivate_entity(struct io_entity *entity, int requeue)
+{
+	__bfq_deactivate_entity(entity, requeue);
+}
+
+/**
+ * bfq_update_vtime - update vtime if necessary.
+ * @st: the service tree to act upon.
+ *
+ * If necessary update the service tree vtime to have at least one
+ * eligible entity, skipping to its start time.  Assumes that the
+ * active tree of the device is not empty.
+ *
+ * NOTE: this hierarchical implementation updates vtimes quite often,
+ * we may end up with reactivated tasks getting timestamps after a
+ * vtime skip done because we needed a ->first_active entity on some
+ * intermediate node.
+ */
+static void bfq_update_vtime(struct io_service_tree *st)
+{
+	struct io_entity *entry;
+	struct rb_node *node = st->active.rb_node;
+
+	entry = rb_entry(node, struct io_entity, rb_node);
+	if (bfq_gt(entry->min_start, st->vtime)) {
+		st->vtime = entry->min_start;
+		bfq_forget_idle(st);
+	}
+}
+
+/**
+ * bfq_first_active - find the eligible entity with the smallest finish time
+ * @st: the service tree to select from.
+ *
+ * This function searches the first schedulable entity, starting from the
+ * root of the tree and going on the left every time on this side there is
+ * a subtree with at least one eligible (start <= vtime) entity.  The path
+ * on the right is followed only if a) the left subtree contains no eligible
+ * entities and b) no eligible entity has been found yet.
+ */
+static struct io_entity *bfq_first_active_entity(struct io_service_tree *st)
+{
+	struct io_entity *entry, *first = NULL;
+	struct rb_node *node = st->active.rb_node;
+
+	while (node != NULL) {
+		entry = rb_entry(node, struct io_entity, rb_node);
+left:
+		if (!bfq_gt(entry->start, st->vtime))
+			first = entry;
+
+		BUG_ON(bfq_gt(entry->min_start, st->vtime));
+
+		if (node->rb_left != NULL) {
+			entry = rb_entry(node->rb_left,
+					 struct io_entity, rb_node);
+			if (!bfq_gt(entry->min_start, st->vtime)) {
+				node = node->rb_left;
+				goto left;
+			}
+		}
+		if (first != NULL)
+			break;
+		node = node->rb_right;
+	}
+
+	BUG_ON(first == NULL && !RB_EMPTY_ROOT(&st->active));
+	return first;
+}
+
+/**
+ * __bfq_lookup_next_entity - return the first eligible entity in @st.
+ * @st: the service tree.
+ *
+ * Update the virtual time in @st and return the first eligible entity
+ * it contains.
+ */
+static struct io_entity *__bfq_lookup_next_entity(struct io_service_tree *st)
+{
+	struct io_entity *entity;
+
+	if (RB_EMPTY_ROOT(&st->active))
+		return NULL;
+
+	bfq_update_vtime(st);
+	entity = bfq_first_active_entity(st);
+	BUG_ON(bfq_gt(entity->start, st->vtime));
+
+	return entity;
+}
+
+/**
+ * bfq_lookup_next_entity - return the first eligible entity in @sd.
+ * @sd: the sched_data.
+ * @extract: if true the returned entity will be also extracted from @sd.
+ *
+ * NOTE: since we cache the next_active entity at each level of the
+ * hierarchy, the complexity of the lookup can be decreased with
+ * absolutely no effort just returning the cached next_active value;
+ * we prefer to do full lookups to test the consistency of * the data
+ * structures.
+ */
+struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
+						 int extract)
+{
+	struct io_service_tree *st = sd->service_tree;
+	struct io_entity *entity;
+	int i;
+
+	/*
+	 * One can check for which will be next selected entity without
+	 * expiring the current one.
+	 */
+	BUG_ON(extract && sd->active_entity != NULL);
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++, st++) {
+		entity = __bfq_lookup_next_entity(st);
+		if (entity != NULL) {
+			if (extract) {
+				bfq_active_extract(st, entity);
+				sd->active_entity = entity;
+			}
+			break;
+		}
+	}
+
+	return entity;
+}
+
+void entity_served(struct io_entity *entity, bfq_service_t served)
+{
+	struct io_service_tree *st;
+
+	st = io_entity_service_tree(entity);
+	entity->service += served;
+	WARN_ON_ONCE(entity->service > entity->budget);
+	BUG_ON(st->wsum == 0);
+	st->vtime += bfq_delta(served, st->wsum);
+	bfq_forget_idle(st);
+}
+
+/* Elevator fair queuing function */
+struct io_queue *rq_ioq(struct request *rq)
+{
+	return rq->ioq;
+}
+
+static inline struct io_queue *elv_active_ioq(struct elevator_queue *e)
+{
+	return e->efqd.active_queue;
+}
+
+void *elv_active_sched_queue(struct elevator_queue *e)
+{
+	return ioq_sched_queue(elv_active_ioq(e));
+}
+EXPORT_SYMBOL(elv_active_sched_queue);
+
+int elv_nr_busy_ioq(struct elevator_queue *e)
+{
+	return e->efqd.busy_queues;
+}
+EXPORT_SYMBOL(elv_nr_busy_ioq);
+
+int elv_nr_busy_rt_ioq(struct elevator_queue *e)
+{
+	return e->efqd.busy_rt_queues;
+}
+EXPORT_SYMBOL(elv_nr_busy_rt_ioq);
+
+int elv_hw_tag(struct elevator_queue *e)
+{
+	return e->efqd.hw_tag;
+}
+EXPORT_SYMBOL(elv_hw_tag);
+
+/* Helper functions for operating on elevator idle slice timer */
+int elv_mod_idle_slice_timer(struct elevator_queue *eq, unsigned long expires)
+{
+	struct elv_fq_data *efqd = &eq->efqd;
+
+	return mod_timer(&efqd->idle_slice_timer, expires);
+}
+EXPORT_SYMBOL(elv_mod_idle_slice_timer);
+
+int elv_del_idle_slice_timer(struct elevator_queue *eq)
+{
+	struct elv_fq_data *efqd = &eq->efqd;
+
+	return del_timer(&efqd->idle_slice_timer);
+}
+EXPORT_SYMBOL(elv_del_idle_slice_timer);
+
+unsigned int elv_get_slice_idle(struct elevator_queue *eq)
+{
+	return eq->efqd.elv_slice_idle;
+}
+EXPORT_SYMBOL(elv_get_slice_idle);
+
+void elv_ioq_served(struct io_queue *ioq, bfq_service_t served)
+{
+	struct elv_fq_data *efqd = ioq->efqd;
+
+	entity_served(&ioq->entity, served);
+
+	ioq->total_service += served;
+	elv_log_ioq(efqd, ioq, "ioq served=0x%lx total service=0x%lx", served,
+			ioq->total_service);
+}
+
+/* Functions to show and store elv_idle_slice value through sysfs */
+ssize_t elv_slice_idle_show(struct request_queue *q, char *name)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	data = jiffies_to_msecs(efqd->elv_slice_idle);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return sprintf(name, "%d\n", data);
+}
+
+ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
+			  size_t count)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	char *p = (char *)name;
+
+	data = simple_strtoul(p, &p, 10);
+
+	if (data < 0)
+		data = 0;
+	else if (data > INT_MAX)
+		data = INT_MAX;
+
+	data = msecs_to_jiffies(data);
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	efqd->elv_slice_idle = data;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	return count;
+}
+
+void elv_schedule_dispatch(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (elv_nr_busy_ioq(q->elevator)) {
+		elv_log(efqd, "schedule dispatch");
+		kblockd_schedule_work(efqd->queue, &efqd->unplug_work);
+	}
+}
+EXPORT_SYMBOL(elv_schedule_dispatch);
+
+void elv_kick_queue(struct work_struct *work)
+{
+	struct elv_fq_data *efqd =
+		container_of(work, struct elv_fq_data, unplug_work);
+	struct request_queue *q = efqd->queue;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	blk_start_queueing(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+void elv_shutdown_timer_wq(struct elevator_queue *e)
+{
+	del_timer_sync(&e->efqd.idle_slice_timer);
+	cancel_work_sync(&e->efqd.unplug_work);
+}
+EXPORT_SYMBOL(elv_shutdown_timer_wq);
+
+void elv_ioq_set_prio_slice(struct request_queue *q, struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	ioq->slice_end = jiffies + ioq->entity.budget;
+	elv_log_ioq(efqd, ioq, "set_slice=%lu", ioq->entity.budget);
+}
+
+void elv_ioq_init_prio_data(struct io_queue *ioq, int ioprio_class, int ioprio)
+{
+	struct io_entity *entity = &ioq->entity;
+
+	entity->new_ioprio_class = ioprio_class;
+	entity->new_ioprio = ioprio;
+	entity->ioprio_changed = 1;
+	return;
+}
+
+static void elv_ioq_update_io_thinktime(struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = ioq->efqd;
+	unsigned long elapsed = jiffies - ioq->last_end_request;
+	unsigned long ttime = min(elapsed, 2UL * efqd->elv_slice_idle);
+
+	ioq->ttime_samples = (7*ioq->ttime_samples + 256) / 8;
+	ioq->ttime_total = (7*ioq->ttime_total + 256*ttime) / 8;
+	ioq->ttime_mean = (ioq->ttime_total + 128) / ioq->ttime_samples;
+}
+
+/*
+ * Disable idle window if the process thinks too long.
+ * This idle flag can also be updated by io scheduler.
+ */
+static void elv_ioq_update_idle_window(struct elevator_queue *eq,
+				struct io_queue *ioq, struct request *rq)
+{
+	int old_idle, enable_idle;
+	struct elv_fq_data *efqd = ioq->efqd;
+
+	/*
+	 * Don't idle for async or idle io prio class
+	 */
+	if (!elv_ioq_sync(ioq) || elv_ioq_class_idle(ioq))
+		return;
+
+	enable_idle = old_idle = elv_ioq_idle_window(ioq);
+
+	if (!efqd->elv_slice_idle)
+		enable_idle = 0;
+	else if (ioq_sample_valid(ioq->ttime_samples)) {
+		if (ioq->ttime_mean > efqd->elv_slice_idle)
+			enable_idle = 0;
+		else
+			enable_idle = 1;
+	}
+
+	/*
+	 * From think time perspective idle should be enabled. Check with
+	 * io scheduler if it wants to disable idling based on additional
+	 * considrations like seek pattern.
+	 */
+	if (enable_idle) {
+		if (eq->ops->elevator_update_idle_window_fn)
+			enable_idle = eq->ops->elevator_update_idle_window_fn(
+						eq, ioq->sched_queue, rq);
+		if (!enable_idle)
+			elv_log_ioq(efqd, ioq, "iosched disabled idle");
+	}
+
+	if (old_idle != enable_idle) {
+		elv_log_ioq(efqd, ioq, "idle=%d", enable_idle);
+		if (enable_idle)
+			elv_mark_ioq_idle_window(ioq);
+		else
+			elv_clear_ioq_idle_window(ioq);
+	}
+}
+
+struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask)
+{
+	struct io_queue *ioq = NULL;
+
+	ioq = kmem_cache_alloc_node(elv_ioq_pool, gfp_mask, q->node);
+	return ioq;
+}
+EXPORT_SYMBOL(elv_alloc_ioq);
+
+void elv_free_ioq(struct io_queue *ioq)
+{
+	kmem_cache_free(elv_ioq_pool, ioq);
+}
+EXPORT_SYMBOL(elv_free_ioq);
+
+int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
+			void *sched_queue, int ioprio_class, int ioprio,
+			int is_sync)
+{
+	struct elv_fq_data *efqd = &eq->efqd;
+	struct io_group *iog = io_lookup_io_group_current(efqd->queue);
+
+	RB_CLEAR_NODE(&ioq->entity.rb_node);
+	atomic_set(&ioq->ref, 0);
+	ioq->efqd = efqd;
+	ioq->entity.budget = efqd->elv_slice[is_sync];
+	elv_ioq_set_ioprio_class(ioq, ioprio_class);
+	elv_ioq_set_ioprio(ioq, ioprio);
+	ioq->pid = current->pid;
+	ioq->sched_queue = sched_queue;
+	elv_mark_ioq_idle_window(ioq);
+	bfq_init_entity(&ioq->entity, iog);
+	return 0;
+}
+EXPORT_SYMBOL(elv_init_ioq);
+
+void elv_put_ioq(struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = ioq->efqd;
+	struct elevator_queue *e = container_of(efqd, struct elevator_queue,
+						efqd);
+
+	BUG_ON(atomic_read(&ioq->ref) <= 0);
+	if (!atomic_dec_and_test(&ioq->ref))
+		return;
+	BUG_ON(ioq->nr_queued);
+	BUG_ON(ioq->entity.tree != NULL);
+	BUG_ON(elv_ioq_busy(ioq));
+	BUG_ON(efqd->active_queue == ioq);
+
+	/* Can be called by outgoing elevator. Don't use q */
+	BUG_ON(!e->ops->elevator_free_sched_queue_fn);
+
+	e->ops->elevator_free_sched_queue_fn(e, ioq->sched_queue);
+	elv_log_ioq(efqd, ioq, "freed");
+	elv_free_ioq(ioq);
+}
+EXPORT_SYMBOL(elv_put_ioq);
+
+void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
+{
+	struct io_queue *ioq = *ioq_ptr;
+
+	if (ioq != NULL) {
+		/* Drop the reference taken by the io group */
+		elv_put_ioq(ioq);
+		*ioq_ptr = NULL;
+	}
+}
+
+/* Get next queue for service. */
+struct io_queue *elv_get_next_ioq(struct request_queue *q, int extract)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_entity *entity = NULL;
+	struct io_queue *ioq = NULL;
+	struct io_sched_data *sd;
+
+	/*
+	 * one can check for which queue will be selected next while having
+	 * one queue active. preempt logic uses it.
+	 */
+	BUG_ON(extract && efqd->active_queue != NULL);
+
+	if (!efqd->busy_queues)
+		return NULL;
+
+	sd = &efqd->root_group->sched_data;
+	if (extract)
+		entity = bfq_lookup_next_entity(sd, 1);
+	else
+		entity = bfq_lookup_next_entity(sd, 0);
+
+	BUG_ON(!entity);
+	if (extract)
+		entity->service = 0;
+	ioq = io_entity_to_ioq(entity);
+
+	return ioq;
+}
+
+static void __elv_set_active_ioq(struct elv_fq_data *efqd,
+					struct io_queue *ioq)
+{
+	struct request_queue *q = efqd->queue;
+
+	if (ioq) {
+		elv_log_ioq(efqd, ioq, "set_active, busy_queues=%d",
+							efqd->busy_queues);
+		ioq->slice_end = 0;
+		elv_mark_ioq_slice_new(ioq);
+	}
+
+	efqd->active_queue = ioq;
+
+	/* Let iosched know if it wants to take some action */
+	if (ioq) {
+		if (q->elevator->ops->elevator_active_ioq_set_fn)
+			q->elevator->ops->elevator_active_ioq_set_fn(q,
+							ioq->sched_queue);
+	}
+}
+
+/* Get and set a new active queue for service. */
+struct io_queue *elv_set_active_ioq(struct request_queue *q)
+{
+	struct io_queue *ioq;
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	ioq = elv_get_next_ioq(q, 1);
+	__elv_set_active_ioq(efqd, ioq);
+	return ioq;
+}
+
+void elv_reset_active_ioq(struct elv_fq_data *efqd)
+{
+	struct request_queue *q = efqd->queue;
+
+	if (q->elevator->ops->elevator_active_ioq_reset_fn)
+		q->elevator->ops->elevator_active_ioq_reset_fn(q);
+	efqd->active_queue = NULL;
+	del_timer(&efqd->idle_slice_timer);
+}
+
+void elv_activate_ioq(struct io_queue *ioq)
+{
+	bfq_activate_entity(&ioq->entity);
+}
+
+void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
+					int requeue)
+{
+	if (ioq == efqd->active_queue)
+		elv_reset_active_ioq(efqd);
+
+	bfq_deactivate_entity(&ioq->entity, requeue);
+}
+
+/* Called when an inactive queue receives a new request. */
+void elv_add_ioq_busy(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+	BUG_ON(elv_ioq_busy(ioq));
+	BUG_ON(ioq == efqd->active_queue);
+	elv_log_ioq(efqd, ioq, "add to busy");
+	elv_activate_ioq(ioq);
+	elv_mark_ioq_busy(ioq);
+	efqd->busy_queues++;
+	if (elv_ioq_class_rt(ioq))
+		efqd->busy_rt_queues++;
+}
+
+void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
+					int requeue)
+{
+	struct elv_fq_data *efqd = &e->efqd;
+
+	BUG_ON(!elv_ioq_busy(ioq));
+	BUG_ON(ioq->nr_queued);
+	elv_log_ioq(efqd, ioq, "del from busy");
+	elv_clear_ioq_busy(ioq);
+	BUG_ON(efqd->busy_queues == 0);
+	efqd->busy_queues--;
+	if (elv_ioq_class_rt(ioq))
+		efqd->busy_rt_queues--;
+
+	elv_deactivate_ioq(efqd, ioq, requeue);
+}
+
+/*
+ * Do the accounting. Determine how much service (in terms of time slices)
+ * current queue used and adjust the start, finish time of queue and vtime
+ * of the tree accordingly.
+ *
+ * Determining the service used in terms of time is tricky in certain
+ * situations. Especially when underlying device supports command queuing
+ * and requests from multiple queues can be there at same time, then it
+ * is not clear which queue consumed how much of disk time.
+ *
+ * To mitigate this problem, cfq starts the time slice of the queue only
+ * after first request from the queue has completed. This does not work
+ * very well if we expire the queue before we wait for first and more
+ * request to finish from the queue. For seeky queues, we will expire the
+ * queue after dispatching few requests without waiting and start dispatching
+ * from next queue.
+ *
+ * Not sure how to determine the time consumed by queue in such scenarios.
+ * Currently as a crude approximation, we are charging 25% of time slice
+ * for such cases. A better mechanism is needed for accurate accounting. 
+ */
+void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq,
+				int budget_update)
+{
+	struct elevator_queue *e = q->elevator;
+	struct elv_fq_data *efqd = &e->efqd;
+	struct io_entity *entity = &ioq->entity;
+	unsigned long slice, step, min_slice;
+	long slice_unused, slice_used;
+
+	assert_spin_locked(q->queue_lock);
+	elv_log_ioq(efqd, ioq, "slice expired t=%d", budget_update);
+
+	if (elv_ioq_wait_request(ioq))
+		del_timer(&efqd->idle_slice_timer);
+
+	elv_clear_ioq_wait_request(ioq);
+
+	/*
+	 * if ioq->slice_end = 0, that means a queue was expired before first
+	 * reuqest from the queue got completed. Of course we are not planning
+	 * to idle on the queue otherwise we would not have expired it.
+	 *
+	 * Charge for the 25% slice in such cases. This is not the best thing
+	 * to do but at the same time not very sure what's the next best
+	 * thing to do.
+	 *
+	 * This arises from that fact that we don't have the notion of
+	 * one queue being operational at one time. io scheduler can dispatch
+	 * requests from multiple queues in one dispatch round. Ideally for
+	 * more accurate accounting of exact disk time used by disk, one
+	 * should dispatch requests from only one queue and wait for all
+	 * the requests to finish. But this will reduce throughput.
+	 */
+	if (!ioq->slice_end)
+		slice_unused = 3*entity->budget/4;
+	else {
+		slice_unused = ioq->slice_end - jiffies;
+		/*
+		 * queue consumed more slice than it was allocated for.
+		 * Currently we are still charging it only for allocated
+		 * slice and not for the consumed slice as it can increase
+		 * the latency of when this queue is scheduled next.
+		 *
+		 * Maybe we can live with little bit unfairness. This is
+		 * still an open problem regarding how to handle it
+		 * correctly.
+		 */
+		if (slice_unused < 0)
+			slice_unused = 0;
+
+		if (slice_unused == entity->budget) {
+			/*
+			 * queue got expired immediately after completing
+			 * first request. Charge 25% of slice.
+			 */
+			slice_unused = (3*entity->budget)/4;
+		}
+
+	}
+
+	slice_used = entity->budget - slice_unused;
+	elv_ioq_served(ioq, slice_used);
+
+	if (budget_update && !elv_ioq_slice_new(ioq)) {
+		slice = efqd->elv_slice[elv_ioq_sync(ioq)];
+		step = slice / 16;
+		min_slice = slice - slice / 4;
+
+		/*
+		 * Try to adapt the slice length to the behavior of the
+		 * queue.  If it has not exhausted the assigned budget
+		 * assign it a shorter new one, otherwise assign it a
+		 * longer new one.  The increments/decrements are done
+		 * linearly with a step of cfqd->cfq_slice / 16, and
+		 * slices of less than 11 / 16 * cfqd->cfq_slice are
+		 * never assigned, to avoid performance degradation.
+		 */
+		if (slice_unused != 0 && entity->budget >= min_slice)
+			entity->budget -= step;
+		else if (slice_unused == 0 && entity->budget <= slice - step)
+			entity->budget += step;
+
+		elv_log_ioq(efqd, ioq, "slice_unused=%ld, budget=%ld",
+					slice_unused, entity->budget);
+	}
+
+	BUG_ON(ioq != efqd->active_queue);
+	elv_reset_active_ioq(efqd);
+
+	if (!ioq->nr_queued)
+		elv_del_ioq_busy(e, ioq, 1);
+	else
+		elv_activate_ioq(ioq);
+}
+EXPORT_SYMBOL(__elv_ioq_slice_expired);
+
+/*
+ * budget_update signifies whether if budget increment/decrement accounting
+ * should be done or not on this queue on this expiry.
+ * In some circumstances like preemption, forced dispatch etc, it might
+ * not make much sense to adjust budgets.
+ */
+void elv_ioq_slice_expired(struct request_queue *q, int budget_update)
+{
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+
+	if (ioq)
+		__elv_ioq_slice_expired(q, ioq, budget_update);
+}
+
+/*
+ * Check if new_cfqq should preempt the currently active queue. Return 0 for
+ * no or if we aren't sure, a 1 will cause a preemption attempt.
+ */
+int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
+			struct request *rq)
+{
+	struct io_queue *ioq;
+	struct elevator_queue *eq = q->elevator;
+
+	ioq = elv_active_ioq(eq);
+
+	if (!ioq)
+		return 0;
+
+	if (elv_ioq_slice_used(ioq))
+		return 1;
+
+	if (elv_ioq_class_idle(new_ioq))
+		return 0;
+
+	if (elv_ioq_class_idle(ioq))
+		return 1;
+
+	/*
+	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
+	 */
+	if (elv_ioq_class_rt(new_ioq) && !elv_ioq_class_rt(ioq))
+		return 1;
+
+	/*
+	 * Check with io scheduler if it has additional criterion based on
+	 * which it wants to preempt existing queue.
+	 */
+	if (eq->ops->elevator_should_preempt_fn)
+		return eq->ops->elevator_should_preempt_fn(q, new_ioq, rq);
+
+	return 0;
+}
+
+int elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
+{
+	struct io_queue *new_ioq;
+
+	elv_log_ioq(&q->elevator->efqd, ioq, "preemption attempt");
+
+	new_ioq = elv_get_next_ioq(q, 0);
+	if (new_ioq == ioq) {
+		/*
+		 * We might need expire_ioq logic here to check with io
+		 * scheduler if queue can be preempted. This might not
+		 * be need for cfq but AS might need it.
+		 */
+		elv_ioq_slice_expired(q, 0);
+		elv_ioq_set_slice_end(ioq, 0);
+		elv_mark_ioq_slice_new(ioq);
+		return 1;
+	}
+
+	return 0;
+}
+
+void elv_ioq_request_add(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	BUG_ON(!efqd);
+	BUG_ON(!ioq);
+	efqd->rq_queued++;
+	ioq->nr_queued++;
+
+	if (!elv_ioq_busy(ioq))
+		elv_add_ioq_busy(efqd, ioq);
+
+	elv_ioq_update_io_thinktime(ioq);
+	elv_ioq_update_idle_window(q->elevator, ioq, rq);
+
+	if (ioq == elv_active_ioq(q->elevator)) {
+		/*
+		 * if we are waiting for a request for this queue, let it rip
+		 * immediately and flag that we must not expire this queue
+		 * just now
+		 */
+		if (elv_ioq_wait_request(ioq)) {
+			del_timer(&efqd->idle_slice_timer);
+			blk_start_queueing(q);
+		}
+	} else if (elv_should_preempt(q, ioq, rq)) {
+		/*
+		 * not the active queue - expire current slice if it is
+		 * idle and has expired it's mean thinktime or this new queue
+		 * has some old slice time left and is of higher priority or
+		 * this new queue is RT and the current one is BE
+		 */
+		/*
+		 * try to preempt the active queue; we still respect the
+		 * scheduler decision, so we try to reschedule, but if cfqq
+		 * has received more service than allocated, the scheduler
+		 * will refuse the preemption.
+		 */
+		if (elv_preempt_queue(q, ioq))
+			blk_start_queueing(q);
+	}
+}
+
+void elv_idle_slice_timer(unsigned long data)
+{
+	struct elv_fq_data *efqd = (struct elv_fq_data *)data;
+	struct io_queue *ioq;
+	unsigned long flags;
+	struct request_queue *q = efqd->queue;
+
+	elv_log(efqd, "idle timer fired");
+
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	ioq = efqd->active_queue;
+
+	if (ioq) {
+
+		/*
+		 * expired
+		 */
+		if (elv_ioq_slice_used(ioq))
+			goto expire;
+
+		/*
+		 * only expire and reinvoke request handler, if there are
+		 * other queues with pending requests
+		 */
+		if (!elv_nr_busy_ioq(q->elevator))
+			goto out_cont;
+
+		/*
+		 * not expired and it has a request pending, let it dispatch
+		 */
+		if (ioq->nr_queued)
+			goto out_kick;
+	}
+expire:
+	elv_ioq_slice_expired(q, 1);
+out_kick:
+	elv_schedule_dispatch(q);
+out_cont:
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+void elv_ioq_arm_slice_timer(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+	unsigned long sl;
+
+	BUG_ON(!ioq);
+
+	/*
+	 * SSD device without seek penalty, disable idling. But only do so
+	 * for devices that support queuing, otherwise we still have a problem
+	 * with sync vs async workloads.
+	 */
+	if (blk_queue_nonrot(q) && efqd->hw_tag)
+		return;
+
+	/*
+	 * still requests with the driver, don't idle
+	 */
+	if (efqd->rq_in_driver)
+		return;
+
+	/*
+	 * idle is disabled, either manually or by past process history
+	 */
+	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
+		return;
+
+	/*
+	 * may be iosched got its own idling logic. In that case io
+	 * schduler will take care of arming the timer, if need be.
+	 */
+	if (q->elevator->ops->elevator_arm_slice_timer_fn) {
+		q->elevator->ops->elevator_arm_slice_timer_fn(q,
+						ioq->sched_queue);
+	} else {
+		elv_mark_ioq_wait_request(ioq);
+		sl = efqd->elv_slice_idle;
+		mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+		elv_log(efqd, "arm idle: %lu", sl);
+	}
+}
+
+void elv_free_idle_ioq_list(struct elevator_queue *e)
+{
+	struct io_queue *ioq, *n;
+	struct elv_fq_data *efqd = &e->efqd;
+
+	list_for_each_entry_safe(ioq, n, &efqd->idle_list, queue_list)
+		elv_deactivate_ioq(efqd, ioq, 0);
+}
+
+/* Common layer function to select the next queue to dispatch from */
+void *elv_fq_select_ioq(struct request_queue *q, int force)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+	int budget_update = 1;
+
+	if (!elv_nr_busy_ioq(q->elevator))
+		return NULL;
+
+	if (ioq == NULL)
+		goto new_queue;
+
+	/*
+	 * Force dispatch. Continue to dispatch from current queue as long
+	 * as it has requests.
+	 */
+	if (unlikely(force)) {
+		if (ioq->nr_queued)
+			goto keep_queue;
+		else {
+			/*
+			 * Don't try to update queue's budget based on forced
+			 * dispatch bahavior.
+			 */
+			budget_update = 0;
+			goto expire;
+		}
+	}
+
+	/*
+	 * The active queue has run out of time, expire it and select new.
+	 */
+	if (elv_ioq_slice_used(ioq))
+		goto expire;
+
+	/*
+	 * If we have a RT cfqq waiting, then we pre-empt the current non-rt
+	 * cfqq.
+	 */
+	if (!elv_ioq_class_rt(ioq) && efqd->busy_rt_queues) {
+		/*
+		 * We simulate this as cfqq timed out so that it gets to bank
+		 * the remaining of its time slice.
+		 */
+		elv_log_ioq(efqd, ioq, "preempt");
+
+		/* Don't do budget adjustments for queue being preempted. */
+		budget_update = 0;
+		goto expire;
+	}
+
+	/*
+	 * The active queue has requests and isn't expired, allow it to
+	 * dispatch.
+	 */
+
+	if (ioq->nr_queued)
+		goto keep_queue;
+
+	/*
+	 * No requests pending. If the active queue still has requests in
+	 * flight or is idling for a new request, allow either of these
+	 * conditions to happen (or time out) before selecting a new queue.
+	 */
+
+	if (timer_pending(&efqd->idle_slice_timer) ||
+	    (elv_ioq_nr_dispatched(ioq) && elv_ioq_idle_window(ioq))) {
+		ioq = NULL;
+		goto keep_queue;
+	}
+
+expire:
+	elv_ioq_slice_expired(q, budget_update);
+new_queue:
+	ioq = elv_set_active_ioq(q);
+keep_queue:
+	return ioq;
+}
+
+/* A request got removed from io_queue. Do the accounting */
+void elv_ioq_request_removed(struct elevator_queue *e, struct request *rq)
+{
+	struct io_queue *ioq;
+	struct elv_fq_data *efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	ioq = rq->ioq;
+	BUG_ON(!ioq);
+	ioq->nr_queued--;
+
+	efqd = ioq->efqd;
+	BUG_ON(!efqd);
+	efqd->rq_queued--;
+
+	if (elv_ioq_busy(ioq) && (elv_active_ioq(e) != ioq) && !ioq->nr_queued)
+		elv_del_ioq_busy(e, ioq, 1);
+}
+
+/* A request got dispatched. Do the accounting. */
+void elv_fq_dispatched_request(struct elevator_queue *e, struct request *rq)
+{
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	BUG_ON(!ioq);
+	elv_ioq_request_dispatched(ioq);
+	elv_ioq_request_removed(e, rq);
+}
+
+void elv_fq_activate_rq(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	efqd->rq_in_driver++;
+	elv_log_ioq(efqd, rq_ioq(rq), "activate rq, drv=%d",
+						efqd->rq_in_driver);
+}
+
+void elv_fq_deactivate_rq(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	WARN_ON(!efqd->rq_in_driver);
+	efqd->rq_in_driver--;
+	elv_log_ioq(efqd, rq_ioq(rq), "deactivate rq, drv=%d",
+						efqd->rq_in_driver);
+}
+
+/*
+ * Update hw_tag based on peak queue depth over 50 samples under
+ * sufficient load.
+ */
+static void elv_update_hw_tag(struct elv_fq_data *efqd)
+{
+	if (efqd->rq_in_driver > efqd->rq_in_driver_peak)
+		efqd->rq_in_driver_peak = efqd->rq_in_driver;
+
+	if (efqd->rq_queued <= ELV_HW_QUEUE_MIN &&
+	    efqd->rq_in_driver <= ELV_HW_QUEUE_MIN)
+		return;
+
+	if (efqd->hw_tag_samples++ < 50)
+		return;
+
+	if (efqd->rq_in_driver_peak >= ELV_HW_QUEUE_MIN)
+		efqd->hw_tag = 1;
+	else
+		efqd->hw_tag = 0;
+
+	efqd->hw_tag_samples = 0;
+	efqd->rq_in_driver_peak = 0;
+}
+
+/* A request got completed from io_queue. Do the accounting. */
+void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
+{
+	const int sync = rq_is_sync(rq);
+	struct io_queue *ioq = rq->ioq;
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	elv_log_ioq(efqd, ioq, "complete");
+
+	elv_update_hw_tag(efqd);
+
+	WARN_ON(!efqd->rq_in_driver);
+	efqd->rq_in_driver--;
+
+	WARN_ON(!ioq->dispatched);
+	ioq->dispatched--;
+
+	if (sync)
+		ioq->last_end_request = jiffies;
+
+	/*
+	 * If this is the active queue, check if it needs to be expired,
+	 * or if we want to idle in case it has no pending requests.
+	 */
+
+	if (elv_active_ioq(q->elevator) == ioq) {
+		if (elv_ioq_slice_new(ioq)) {
+			elv_ioq_set_prio_slice(q, ioq);
+			elv_clear_ioq_slice_new(ioq);
+		}
+		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+			elv_ioq_slice_expired(q, 1);
+		else if (sync && !ioq->nr_queued)
+			elv_ioq_arm_slice_timer(q);
+	}
+
+	if (!efqd->rq_in_driver)
+		elv_schedule_dispatch(q);
+}
+
+struct io_group *io_lookup_io_group_current(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	return efqd->root_group;
+}
+EXPORT_SYMBOL(io_lookup_io_group_current);
+
+void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
+					int ioprio)
+{
+	struct io_queue *ioq = NULL;
+
+	switch (ioprio_class) {
+	case IOPRIO_CLASS_RT:
+		ioq = iog->async_queue[0][ioprio];
+		break;
+	case IOPRIO_CLASS_BE:
+		ioq = iog->async_queue[1][ioprio];
+		break;
+	case IOPRIO_CLASS_IDLE:
+		ioq = iog->async_idle_queue;
+		break;
+	default:
+		BUG();
+	}
+
+	if (ioq)
+		return ioq->sched_queue;
+	return NULL;
+}
+EXPORT_SYMBOL(io_group_async_queue_prio);
+
+void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+					int ioprio, struct io_queue *ioq)
+{
+	switch (ioprio_class) {
+	case IOPRIO_CLASS_RT:
+		iog->async_queue[0][ioprio] = ioq;
+		break;
+	case IOPRIO_CLASS_BE:
+		iog->async_queue[1][ioprio] = ioq;
+		break;
+	case IOPRIO_CLASS_IDLE:
+		iog->async_idle_queue = ioq;
+		break;
+	default:
+		BUG();
+	}
+
+	/*
+	 * Take the group reference and pin the queue. Group exit will
+	 * clean it up
+	 */
+	elv_get_ioq(ioq);
+}
+EXPORT_SYMBOL(io_group_set_async_queue);
+
+/*
+ * Release all the io group references to its async queues.
+ */
+void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+{
+	int i, j;
+
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < IOPRIO_BE_NR; j++)
+			elv_release_ioq(e, &iog->async_queue[i][j]);
+
+	/* Free up async idle queue */
+	elv_release_ioq(e, &iog->async_idle_queue);
+}
+
+struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+	return iog;
+}
+
+void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_group *iog = e->efqd.root_group;
+	io_put_io_group_queues(e, iog);
+	kfree(iog);
+}
+
+static void elv_slab_kill(void)
+{
+	/*
+	 * Caller already ensured that pending RCU callbacks are completed,
+	 * so we should have no busy allocations at this point.
+	 */
+	if (elv_ioq_pool)
+		kmem_cache_destroy(elv_ioq_pool);
+}
+
+static int __init elv_slab_setup(void)
+{
+	elv_ioq_pool = KMEM_CACHE(io_queue, 0);
+	if (!elv_ioq_pool)
+		goto fail;
+
+	return 0;
+fail:
+	elv_slab_kill();
+	return -ENOMEM;
+}
+
+/* Initialize fair queueing data associated with elevator */
+int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
+{
+	struct io_group *iog;
+	struct elv_fq_data *efqd = &e->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return 0;
+
+	iog = io_alloc_root_group(q, e, efqd);
+	if (iog == NULL)
+		return 1;
+
+	efqd->root_group = iog;
+	efqd->queue = q;
+
+	init_timer(&efqd->idle_slice_timer);
+	efqd->idle_slice_timer.function = elv_idle_slice_timer;
+	efqd->idle_slice_timer.data = (unsigned long) efqd;
+
+	INIT_WORK(&efqd->unplug_work, elv_kick_queue);
+
+	INIT_LIST_HEAD(&efqd->idle_list);
+
+	efqd->elv_slice[0] = elv_slice_async;
+	efqd->elv_slice[1] = elv_slice_sync;
+	efqd->elv_slice_idle = elv_slice_idle;
+	efqd->hw_tag = 1;
+
+	return 0;
+}
+
+/*
+ * elv_exit_fq_data is called before we call elevator_exit_fn. Before
+ * we ask elevator to cleanup its queues, we do the cleanup here so
+ * that all the group and idle tree references to ioq are dropped. Later
+ * during elevator cleanup, ioc reference will be dropped which will lead
+ * to removal of ioscheduler queue as well as associated ioq object.
+ */
+void elv_exit_fq_data(struct elevator_queue *e)
+{
+	struct elv_fq_data *efqd = &e->efqd;
+	struct request_queue *q = efqd->queue;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	elv_shutdown_timer_wq(e);
+
+	spin_lock_irq(q->queue_lock);
+	/* This should drop all the idle tree references of ioq */
+	elv_free_idle_ioq_list(e);
+	spin_unlock_irq(q->queue_lock);
+
+	elv_shutdown_timer_wq(e);
+
+	BUG_ON(timer_pending(&efqd->idle_slice_timer));
+	io_free_root_group(e);
+}
+
+/*
+ * This is called after the io scheduler has cleaned up its data structres.
+ * I don't think that this function is required. Right now just keeping it
+ * because cfq cleans up timer and work queue again after freeing up
+ * io contexts. To me io scheduler has already been drained out, and all
+ * the active queue have already been expired so time and work queue should
+ * not been activated during cleanup process.
+ *
+ * Keeping it here for the time being. Will get rid of it later.
+ */
+void elv_exit_fq_data_post(struct elevator_queue *e)
+{
+	struct elv_fq_data *efqd = &e->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	elv_shutdown_timer_wq(e);
+	BUG_ON(timer_pending(&efqd->idle_slice_timer));
+}
+
+
+static int __init elv_fq_init(void)
+{
+	if (elv_slab_setup())
+		return -ENOMEM;
+
+	/* could be 0 on HZ < 1000 setups */
+
+	if (!elv_slice_async)
+		elv_slice_async = 1;
+
+	if (elv_slice_idle == 0)
+		elv_slice_idle = 1;
+
+	return 0;
+}
+
+module_init(elv_fq_init);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
new file mode 100644
index 0000000..b5a0d08
--- /dev/null
+++ b/block/elevator-fq.h
@@ -0,0 +1,479 @@
+/*
+ * BFQ: data structures and common functions prototypes.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ *		      Paolo Valente <paolo.valente@unimore.it>
+ */
+
+#include <linux/blkdev.h>
+
+#ifndef _BFQ_SCHED_H
+#define _BFQ_SCHED_H
+
+#define IO_IOPRIO_CLASSES	3
+
+typedef u64 bfq_timestamp_t;
+typedef unsigned long bfq_weight_t;
+typedef unsigned long bfq_service_t;
+struct io_entity;
+struct io_queue;
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+
+/**
+ * struct bfq_service_tree - per ioprio_class service tree.
+ * @active: tree for active entities (i.e., those backlogged).
+ * @idle: tree for idle entities (i.e., those not backlogged, with V <= F_i).
+ * @first_idle: idle entity with minimum F_i.
+ * @last_idle: idle entity with maximum F_i.
+ * @vtime: scheduler virtual time.
+ * @wsum: scheduler weight sum; active and idle entities contribute to it.
+ *
+ * Each service tree represents a B-WF2Q+ scheduler on its own.  Each
+ * ioprio_class has its own independent scheduler, and so its own
+ * bfq_service_tree.  All the fields are protected by the queue lock
+ * of the containing efqd.
+ */
+struct io_service_tree {
+	struct rb_root active;
+	struct rb_root idle;
+
+	struct io_entity *first_idle;
+	struct io_entity *last_idle;
+
+	bfq_timestamp_t vtime;
+	bfq_weight_t wsum;
+};
+
+/**
+ * struct bfq_sched_data - multi-class scheduler.
+ * @active_entity: entity under service.
+ * @next_active: head-of-the-line entity in the scheduler.
+ * @service_tree: array of service trees, one per ioprio_class.
+ *
+ * bfq_sched_data is the basic scheduler queue.  It supports three
+ * ioprio_classes, and can be used either as a toplevel queue or as
+ * an intermediate queue on a hierarchical setup.
+ * @next_active points to the active entity of the sched_data service
+ * trees that will be scheduled next.
+ *
+ * The supported ioprio_classes are the same as in CFQ, in descending
+ * priority order, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE.
+ * Requests from higher priority queues are served before all the
+ * requests from lower priority queues; among requests of the same
+ * queue requests are served according to B-WF2Q+.
+ * All the fields are protected by the queue lock of the containing bfqd.
+ */
+struct io_sched_data {
+	struct io_entity *active_entity;
+	struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
+};
+
+/**
+ * struct bfq_entity - schedulable entity.
+ * @rb_node: service_tree member.
+ * @on_st: flag, true if the entity is on a tree (either the active or
+ *         the idle one of its service_tree).
+ * @finish: B-WF2Q+ finish timestamp (aka F_i).
+ * @start: B-WF2Q+ start timestamp (aka S_i).
+ * @tree: tree the entity is enqueued into; %NULL if not on a tree.
+ * @min_start: minimum start time of the (active) subtree rooted at
+ *             this entity; used for O(log N) lookups into active trees.
+ * @service: service received during the last round of service.
+ * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
+ * @weight: weight of the queue, calculated as IOPRIO_BE_NR - @ioprio.
+ * @parent: parent entity, for hierarchical scheduling.
+ * @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the
+ *                 associated scheduler queue, %NULL on leaf nodes.
+ * @sched_data: the scheduler queue this entity belongs to.
+ * @ioprio: the ioprio in use.
+ * @new_ioprio: when an ioprio change is requested, the new ioprio value
+ * @ioprio_class: the ioprio_class in use.
+ * @new_ioprio_class: when an ioprio_class change is requested, the new
+ *                    ioprio_class value.
+ * @ioprio_changed: flag, true when the user requested an ioprio or
+ *                  ioprio_class change.
+ *
+ * A bfq_entity is used to represent either a bfq_queue (leaf node in the
+ * cgroup hierarchy) or a bfq_group into the upper level scheduler.  Each
+ * entity belongs to the sched_data of the parent group in the cgroup
+ * hierarchy.  Non-leaf entities have also their own sched_data, stored
+ * in @my_sched_data.
+ *
+ * Each entity stores independently its priority values; this would allow
+ * different weights on different devices, but this functionality is not
+ * exported to userspace by now.  Priorities are updated lazily, first
+ * storing the new values into the new_* fields, then setting the
+ * @ioprio_changed flag.  As soon as there is a transition in the entity
+ * state that allows the priority update to take place the effective and
+ * the requested priority values are synchronized.
+ *
+ * The weight value is calculated from the ioprio to export the same
+ * interface as CFQ.  When dealing with ``well-behaved'' queues (i.e.,
+ * queues that do not spend too much time to consume their budget and
+ * have true sequential behavior, and when there are no external factors
+ * breaking anticipation) the relative weights at each level of the
+ * cgroups hierarchy should be guaranteed.
+ * All the fields are protected by the queue lock of the containing bfqd.
+ */
+struct io_entity {
+	struct rb_node rb_node;
+
+	int on_st;
+
+	bfq_timestamp_t finish;
+	bfq_timestamp_t start;
+
+	struct rb_root *tree;
+
+	bfq_timestamp_t min_start;
+
+	bfq_service_t service, budget;
+	bfq_weight_t weight;
+
+	struct io_entity *parent;
+
+	struct io_sched_data *my_sched_data;
+	struct io_sched_data *sched_data;
+
+	unsigned short ioprio, new_ioprio;
+	unsigned short ioprio_class, new_ioprio_class;
+
+	int ioprio_changed;
+};
+
+/*
+ * A common structure embedded by every io scheduler into their respective
+ * queue structure.
+ */
+struct io_queue {
+	struct io_entity entity;
+	atomic_t ref;
+	unsigned int flags;
+
+	/* Pointer to generic elevator data structure */
+	struct elv_fq_data *efqd;
+	struct list_head queue_list;
+	pid_t pid;
+
+	/* Number of requests queued on this io queue */
+	unsigned long nr_queued;
+
+	/* Requests dispatched from this queue */
+	int dispatched;
+
+	/* Keep a track of think time of processes in this queue */
+	unsigned long last_end_request;
+	unsigned long ttime_total;
+	unsigned long ttime_samples;
+	unsigned long ttime_mean;
+
+	unsigned long slice_end;
+
+	/* Pointer to io scheduler's queue */
+	void *sched_queue;
+
+	/*
+	 * keeps a track of total slice time assigned to a queue for
+	 * debugging purposes.
+	 */
+	unsigned long total_service;
+};
+
+struct io_group {
+	struct io_sched_data sched_data;
+
+	/* async_queue and idle_queue are used only for cfq */
+	struct io_queue *async_queue[2][IOPRIO_BE_NR];
+	struct io_queue *async_idle_queue;
+};
+
+struct elv_fq_data {
+	struct io_group *root_group;
+
+	/* List of io queues on idle tree. */
+	struct list_head idle_list;
+
+	struct request_queue *queue;
+	unsigned int busy_queues;
+	/*
+	 * Used to track any pending rt requests so we can pre-empt current
+	 * non-RT cfqq in service when this value is non-zero.
+	 */
+	unsigned int busy_rt_queues;
+
+	/* Number of requests queued */
+	int rq_queued;
+
+	/* Pointer to the ioscheduler queue being served */
+	void *active_queue;
+
+	int rq_in_driver;
+	int hw_tag;
+	int hw_tag_samples;
+	int rq_in_driver_peak;
+
+	/*
+	 * elevator fair queuing layer has the capability to provide idling
+	 * for ensuring fairness for processes doing dependent reads.
+	 * This might be needed to ensure fairness among two processes doing
+	 * synchronous reads in two different cgroups. noop and deadline don't
+	 * have any notion of anticipation/idling. As of now, these are the
+	 * users of this functionality.
+	 */
+	unsigned int elv_slice_idle;
+	struct timer_list idle_slice_timer;
+	struct work_struct unplug_work;
+
+	unsigned int elv_slice[2];
+};
+
+extern int elv_slice_idle;
+extern int elv_slice_async;
+
+/* Logging facilities. */
+#define elv_log_ioq(efqd, ioq, fmt, args...) \
+	blk_add_trace_msg((efqd)->queue, "%d" fmt, (ioq)->pid, ##args)
+
+#define elv_log(efqd, fmt, args...) \
+	blk_add_trace_msg((efqd)->queue, "" fmt, ##args)
+
+#define ioq_sample_valid(samples)   ((samples) > 80)
+
+/* Some shared queue flag manipulation functions among elevators */
+
+enum elv_queue_state_flags {
+	ELV_QUEUE_FLAG_busy = 0,          /* has requests or is under service */
+	ELV_QUEUE_FLAG_sync,              /* synchronous queue */
+	ELV_QUEUE_FLAG_idle_window,	  /* elevator slice idling enabled */
+	ELV_QUEUE_FLAG_wait_request,	  /* waiting for a request */
+	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
+	ELV_QUEUE_FLAG_NR,
+};
+
+#define ELV_IO_QUEUE_FLAG_FNS(name)					\
+static inline void elv_mark_ioq_##name(struct io_queue *ioq)		\
+{                                                                       \
+	(ioq)->flags |= (1 << ELV_QUEUE_FLAG_##name);			\
+}                                                                       \
+static inline void elv_clear_ioq_##name(struct io_queue *ioq)		\
+{                                                                       \
+	(ioq)->flags &= ~(1 << ELV_QUEUE_FLAG_##name);			\
+}                                                                       \
+static inline int elv_ioq_##name(struct io_queue *ioq)         		\
+{                                                                       \
+	return ((ioq)->flags & (1 << ELV_QUEUE_FLAG_##name)) != 0;	\
+}
+
+ELV_IO_QUEUE_FLAG_FNS(busy)
+ELV_IO_QUEUE_FLAG_FNS(sync)
+ELV_IO_QUEUE_FLAG_FNS(wait_request)
+ELV_IO_QUEUE_FLAG_FNS(idle_window)
+ELV_IO_QUEUE_FLAG_FNS(slice_new)
+
+static inline struct io_service_tree *
+io_entity_service_tree(struct io_entity *entity)
+{
+	struct io_sched_data *sched_data = entity->sched_data;
+	unsigned int idx = entity->ioprio_class - 1;
+
+	BUG_ON(idx >= IO_IOPRIO_CLASSES);
+	BUG_ON(sched_data == NULL);
+
+	return sched_data->service_tree + idx;
+}
+
+/* A request got dispatched from the io_queue. Do the accounting. */
+static inline void elv_ioq_request_dispatched(struct io_queue *ioq)
+{
+	ioq->dispatched++;
+}
+
+static inline int elv_ioq_slice_used(struct io_queue *ioq)
+{
+	if (elv_ioq_slice_new(ioq))
+		return 0;
+	if (time_before(jiffies, ioq->slice_end))
+		return 0;
+
+	return 1;
+}
+
+/* How many request are currently dispatched from the queue */
+static inline int elv_ioq_nr_dispatched(struct io_queue *ioq)
+{
+	return ioq->dispatched;
+}
+
+static inline pid_t elv_ioq_pid(struct io_queue *ioq)
+{
+	return ioq->pid;
+}
+
+static inline unsigned long elv_ioq_ttime_mean(struct io_queue *ioq)
+{
+	return ioq->ttime_mean;
+}
+
+static inline unsigned long elv_ioq_sample_valid(struct io_queue *ioq)
+{
+	return ioq_sample_valid(ioq->ttime_samples);
+}
+
+static inline void elv_get_ioq(struct io_queue *ioq)
+{
+	atomic_inc(&ioq->ref);
+}
+
+static inline void elv_ioq_set_slice_end(struct io_queue *ioq,
+						unsigned long slice_end)
+{
+	ioq->slice_end = slice_end;
+}
+
+static inline int elv_ioq_class_idle(struct io_queue *ioq)
+{
+	return ioq->entity.ioprio_class == IOPRIO_CLASS_IDLE;
+}
+
+static inline int elv_ioq_class_rt(struct io_queue *ioq)
+{
+	return ioq->entity.ioprio_class == IOPRIO_CLASS_RT;
+}
+
+static inline int elv_ioq_ioprio_class(struct io_queue *ioq)
+{
+	return ioq->entity.new_ioprio_class;
+}
+
+static inline int elv_ioq_ioprio(struct io_queue *ioq)
+{
+	return ioq->entity.new_ioprio;
+}
+
+static inline void elv_ioq_set_ioprio_class(struct io_queue *ioq,
+						int ioprio_class)
+{
+	ioq->entity.new_ioprio_class = ioprio_class;
+	ioq->entity.ioprio_changed = 1;
+}
+
+static inline void elv_ioq_set_ioprio(struct io_queue *ioq, int ioprio)
+{
+	ioq->entity.new_ioprio = ioprio;
+	ioq->entity.ioprio_changed = 1;
+}
+
+static inline void *ioq_sched_queue(struct io_queue *ioq)
+{
+	if (ioq)
+		return ioq->sched_queue;
+	return NULL;
+}
+
+static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
+{
+	return container_of(ioq->entity.sched_data, struct io_group,
+						sched_data);
+}
+
+/* Functions used by blksysfs.c */
+extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
+extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
+						size_t count);
+
+/* Functions used by elevator.c */
+extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
+extern void elv_exit_fq_data(struct elevator_queue *e);
+extern void elv_exit_fq_data_post(struct elevator_queue *e);
+
+extern void elv_ioq_request_add(struct request_queue *q, struct request *rq);
+extern void elv_ioq_request_removed(struct elevator_queue *e,
+					struct request *rq);
+extern void elv_fq_dispatched_request(struct elevator_queue *e,
+					struct request *rq);
+
+extern void elv_fq_activate_rq(struct request_queue *q, struct request *rq);
+extern void elv_fq_deactivate_rq(struct request_queue *q, struct request *rq);
+
+extern void elv_ioq_completed_request(struct request_queue *q,
+				struct request *rq);
+
+extern void *elv_fq_select_ioq(struct request_queue *q, int force);
+extern struct io_queue *rq_ioq(struct request *rq);
+
+/* Functions used by io schedulers */
+extern void elv_put_ioq(struct io_queue *ioq);
+extern void __elv_ioq_slice_expired(struct request_queue *q,
+					struct io_queue *ioq, int timed_out);
+extern int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
+		void *sched_queue, int ioprio_class, int ioprio, int is_sync);
+extern void elv_schedule_dispatch(struct request_queue *q);
+extern int elv_hw_tag(struct elevator_queue *e);
+extern void *elv_active_sched_queue(struct elevator_queue *e);
+extern int elv_mod_idle_slice_timer(struct elevator_queue *eq,
+					unsigned long expires);
+extern int elv_del_idle_slice_timer(struct elevator_queue *eq);
+extern unsigned int elv_get_slice_idle(struct elevator_queue *eq);
+extern void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
+					int ioprio);
+extern void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+					int ioprio, struct io_queue *ioq);
+extern struct io_group *io_lookup_io_group_current(struct request_queue *q);
+extern int elv_nr_busy_ioq(struct elevator_queue *e);
+extern int elv_nr_busy_rt_ioq(struct elevator_queue *e);
+extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
+extern void elv_free_ioq(struct io_queue *ioq);
+
+#else /* CONFIG_ELV_FAIR_QUEUING */
+
+static inline int elv_init_fq_data(struct request_queue *q,
+					struct elevator_queue *e)
+{
+	return 0;
+}
+
+static inline void elv_exit_fq_data(struct elevator_queue *e) {}
+static inline void elv_exit_fq_data_post(struct elevator_queue *e) {}
+
+static inline void elv_fq_activate_rq(struct request_queue *q,
+					struct request *rq)
+{
+}
+
+static inline void elv_fq_deactivate_rq(struct request_queue *q,
+					struct request *rq)
+{
+}
+
+static inline void elv_fq_dispatched_request(struct elevator_queue *e,
+						struct request *rq)
+{
+}
+
+static inline void elv_ioq_request_removed(struct elevator_queue *e,
+						struct request *rq)
+{
+}
+
+static inline void elv_ioq_request_add(struct request_queue *q,
+					struct request *rq)
+{
+}
+
+static inline void elv_ioq_completed_request(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline void *ioq_sched_queue(struct io_queue *ioq) { return NULL; }
+static inline struct io_queue *rq_ioq(struct request *rq) { return NULL; }
+static inline void *elv_fq_select_ioq(struct request_queue *q, int force)
+{
+	return NULL;
+}
+#endif /* CONFIG_ELV_FAIR_QUEUING */
+#endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index 98259ed..7a3a7e9 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -231,6 +231,9 @@ static struct elevator_queue *elevator_alloc(struct request_queue *q,
 	for (i = 0; i < ELV_HASH_ENTRIES; i++)
 		INIT_HLIST_HEAD(&eq->hash[i]);
 
+	if (elv_init_fq_data(q, eq))
+		goto err;
+
 	return eq;
 err:
 	kfree(eq);
@@ -301,9 +304,11 @@ EXPORT_SYMBOL(elevator_init);
 void elevator_exit(struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
+	elv_exit_fq_data(e);
 	if (e->ops->elevator_exit_fn)
 		e->ops->elevator_exit_fn(e);
 	e->ops = NULL;
+	elv_exit_fq_data_post(e);
 	mutex_unlock(&e->sysfs_lock);
 
 	kobject_put(&e->kobj);
@@ -314,6 +319,8 @@ static void elv_activate_rq(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	elv_fq_activate_rq(q, rq);
+
 	if (e->ops->elevator_activate_req_fn)
 		e->ops->elevator_activate_req_fn(q, rq);
 }
@@ -322,6 +329,8 @@ static void elv_deactivate_rq(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	elv_fq_deactivate_rq(q, rq);
+
 	if (e->ops->elevator_deactivate_req_fn)
 		e->ops->elevator_deactivate_req_fn(q, rq);
 }
@@ -446,6 +455,7 @@ void elv_dispatch_sort(struct request_queue *q, struct request *rq)
 	elv_rqhash_del(q, rq);
 
 	q->nr_sorted--;
+	elv_fq_dispatched_request(q->elevator, rq);
 
 	boundary = q->end_sector;
 	stop_flags = REQ_SOFTBARRIER | REQ_HARDBARRIER | REQ_STARTED;
@@ -486,6 +496,7 @@ void elv_dispatch_add_tail(struct request_queue *q, struct request *rq)
 	elv_rqhash_del(q, rq);
 
 	q->nr_sorted--;
+	elv_fq_dispatched_request(q->elevator, rq);
 
 	q->end_sector = rq_end_sector(rq);
 	q->boundary_rq = rq;
@@ -553,6 +564,7 @@ void elv_merge_requests(struct request_queue *q, struct request *rq,
 	elv_rqhash_del(q, next);
 
 	q->nr_sorted--;
+	elv_ioq_request_removed(e, next);
 	q->last_merge = rq;
 }
 
@@ -632,12 +644,8 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 				q->last_merge = rq;
 		}
 
-		/*
-		 * Some ioscheds (cfq) run q->request_fn directly, so
-		 * rq cannot be accessed after calling
-		 * elevator_add_req_fn.
-		 */
 		q->elevator->ops->elevator_add_req_fn(q, rq);
+		elv_ioq_request_add(q, rq);
 		break;
 
 	case ELEVATOR_INSERT_REQUEUE:
@@ -847,13 +855,12 @@ void elv_dequeue_request(struct request_queue *q, struct request *rq)
 
 int elv_queue_empty(struct request_queue *q)
 {
-	struct elevator_queue *e = q->elevator;
-
 	if (!list_empty(&q->queue_head))
 		return 0;
 
-	if (e->ops->elevator_queue_empty_fn)
-		return e->ops->elevator_queue_empty_fn(q);
+	/* Hopefully nr_sorted works and no need to call queue_empty_fn */
+	if (q->nr_sorted)
+		return 0;
 
 	return 1;
 }
@@ -928,8 +935,11 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
 	 */
 	if (blk_account_rq(rq)) {
 		q->in_flight--;
-		if (blk_sorted_rq(rq) && e->ops->elevator_completed_req_fn)
-			e->ops->elevator_completed_req_fn(q, rq);
+		if (blk_sorted_rq(rq)) {
+			if (e->ops->elevator_completed_req_fn)
+				e->ops->elevator_completed_req_fn(q, rq);
+			elv_ioq_completed_request(q, rq);
+		}
 	}
 
 	/*
@@ -1228,3 +1238,17 @@ struct request *elv_rb_latter_request(struct request_queue *q,
 	return NULL;
 }
 EXPORT_SYMBOL(elv_rb_latter_request);
+
+/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
+{
+	return ioq_sched_queue(rq_ioq(rq));
+}
+EXPORT_SYMBOL(elv_get_sched_queue);
+
+/* Select an ioscheduler queue to dispatch request from. */
+void *elv_select_sched_queue(struct request_queue *q, int force)
+{
+	return ioq_sched_queue(elv_fq_select_ioq(q, force));
+}
+EXPORT_SYMBOL(elv_select_sched_queue);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 465d6ba..cf02216 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -234,6 +234,11 @@ struct request {
 
 	/* for bidi */
 	struct request *next_rq;
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	/* io queue request belongs to */
+	struct io_queue *ioq;
+#endif
 };
 
 static inline unsigned short req_get_ioprio(struct request *req)
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 7a20425..6f2dea5 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -2,6 +2,7 @@
 #define _LINUX_ELEVATOR_H
 
 #include <linux/percpu.h>
+#include "../../block/elevator-fq.h"
 
 #ifdef CONFIG_BLOCK
 
@@ -29,6 +30,16 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
 
 typedef void *(elevator_init_fn) (struct request_queue *);
 typedef void (elevator_exit_fn) (struct elevator_queue *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *);
+typedef void (elevator_active_ioq_reset_fn) (struct request_queue *);
+typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
+typedef int (elevator_should_preempt_fn) (struct request_queue*, void*,
+						struct request*);
+typedef int (elevator_update_idle_window_fn) (struct elevator_queue*, void*,
+						struct request*);
+#endif
 
 struct elevator_ops
 {
@@ -56,6 +67,16 @@ struct elevator_ops
 	elevator_init_fn *elevator_init_fn;
 	elevator_exit_fn *elevator_exit_fn;
 	void (*trim)(struct io_context *);
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+	elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
+	elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;
+
+	elevator_arm_slice_timer_fn *elevator_arm_slice_timer_fn;
+	elevator_should_preempt_fn *elevator_should_preempt_fn;
+	elevator_update_idle_window_fn *elevator_update_idle_window_fn;
+#endif
 };
 
 #define ELV_NAME_MAX	(16)
@@ -76,6 +97,9 @@ struct elevator_type
 	struct elv_fs_entry *elevator_attrs;
 	char elevator_name[ELV_NAME_MAX];
 	struct module *elevator_owner;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	int elevator_features;
+#endif
 };
 
 /*
@@ -89,6 +113,10 @@ struct elevator_queue
 	struct elevator_type *elevator_type;
 	struct mutex sysfs_lock;
 	struct hlist_head *hash;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	/* fair queuing data */
+	struct elv_fq_data efqd;
+#endif
 };
 
 /*
@@ -208,5 +236,25 @@ enum {
 	__val;							\
 })
 
+/* iosched can let elevator know their feature set/capability */
+#ifdef CONFIG_ELV_FAIR_QUEUING
+
+/* iosched wants to use fq logic of elevator layer */
+#define	ELV_IOSCHED_NEED_FQ	1
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+	return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
+}
+
+#else /* ELV_IOSCHED_FAIR_QUEUING */
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+	return 0;
+}
+#endif /* ELV_IOSCHED_FAIR_QUEUING */
+extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
+extern void *elv_select_sched_queue(struct request_queue *q, int force);
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 03/10] Modify cfq to make use of flat elevator fair queuing
       [not found] ` <1236823015-4183-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-03-12  1:56     ` Vivek Goyal
  2009-03-12  1:56   ` [PATCH 02/10] Common flat fair queuing code in elevaotor layer Vivek Goyal
@ 2009-03-12  1:56   ` Vivek Goyal
  2009-03-12  1:56     ` Vivek Goyal
                     ` (10 subsequent siblings)
  13 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12  1:56 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, menage-hpIqsD4AKlfQT0dZR+AlfA

This patch changes cfq to use fair queuing code from elevator layer.

o must_dispatch logic sounds like a dead code. Nobody seems to be making
  use of that flag. Retaining it for the time being.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched |    3 +-
 block/cfq-iosched.c   | 1082 +++++++++++--------------------------------------
 2 files changed, 232 insertions(+), 853 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3398134..dd5224d 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -3,7 +3,7 @@ if BLOCK
 menu "IO Schedulers"
 
 config ELV_FAIR_QUEUING
-	bool "Elevator Fair Queuing Support"
+	bool
 	default n
 	---help---
 	  Traditionally only cfq had notion of multiple queues and it did
@@ -46,6 +46,7 @@ config IOSCHED_DEADLINE
 
 config IOSCHED_CFQ
 	tristate "CFQ I/O scheduler"
+	select ELV_FAIR_QUEUING
 	default y
 	---help---
 	  The CFQ I/O scheduler tries to distribute bandwidth equally
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 664ebfd..5b41b54 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -12,7 +12,6 @@
 #include <linux/rbtree.h>
 #include <linux/ioprio.h>
 #include <linux/blktrace_api.h>
-
 /*
  * tunables
  */
@@ -23,15 +22,7 @@ static const int cfq_fifo_expire[2] = { HZ / 4, HZ / 8 };
 static const int cfq_back_max = 16 * 1024;
 /* penalty of a backwards seek */
 static const int cfq_back_penalty = 2;
-static const int cfq_slice_sync = HZ / 10;
-static int cfq_slice_async = HZ / 25;
 static const int cfq_slice_async_rq = 2;
-static int cfq_slice_idle = HZ / 125;
-
-/*
- * offset from end of service tree
- */
-#define CFQ_IDLE_DELAY		(HZ / 5)
 
 /*
  * below this threshold, we consider thinktime immediate
@@ -43,7 +34,7 @@ static int cfq_slice_idle = HZ / 125;
 
 #define RQ_CIC(rq)		\
 	((struct cfq_io_context *) (rq)->elevator_private)
-#define RQ_CFQQ(rq)		(struct cfq_queue *) ((rq)->elevator_private2)
+#define RQ_CFQQ(rq)	(struct cfq_queue *) (ioq_sched_queue((rq)->ioq))
 
 static struct kmem_cache *cfq_pool;
 static struct kmem_cache *cfq_ioc_pool;
@@ -53,8 +44,6 @@ static struct completion *ioc_gone;
 static DEFINE_SPINLOCK(ioc_gone_lock);
 
 #define CFQ_PRIO_LISTS		IOPRIO_BE_NR
-#define cfq_class_idle(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
-#define cfq_class_rt(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_RT)
 
 #define ASYNC			(0)
 #define SYNC			(1)
@@ -77,45 +66,16 @@ struct cfq_rb_root {
  * Per block device queue structure
  */
 struct cfq_data {
-	struct request_queue *queue;
-
-	/*
-	 * rr list of queues with requests and the count of them
-	 */
-	struct cfq_rb_root service_tree;
-	unsigned int busy_queues;
-	/*
-	 * Used to track any pending rt requests so we can pre-empt current
-	 * non-RT cfqq in service when this value is non-zero.
-	 */
-	unsigned int busy_rt_queues;
+	struct request_queue *q;
 
-	int rq_in_driver;
 	int sync_flight;
 
 	/*
-	 * queue-depth detection
-	 */
-	int rq_queued;
-	int hw_tag;
-	int hw_tag_samples;
-	int rq_in_driver_peak;
-
-	/*
 	 * idle window management
 	 */
-	struct timer_list idle_slice_timer;
-	struct work_struct unplug_work;
 
-	struct cfq_queue *active_queue;
 	struct cfq_io_context *active_cic;
 
-	/*
-	 * async queue for each priority case
-	 */
-	struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
-	struct cfq_queue *async_idle_cfqq;
-
 	sector_t last_position;
 	unsigned long last_end_request;
 
@@ -126,9 +86,7 @@ struct cfq_data {
 	unsigned int cfq_fifo_expire[2];
 	unsigned int cfq_back_penalty;
 	unsigned int cfq_back_max;
-	unsigned int cfq_slice[2];
 	unsigned int cfq_slice_async_rq;
-	unsigned int cfq_slice_idle;
 
 	struct list_head cic_list;
 };
@@ -137,16 +95,11 @@ struct cfq_data {
  * Per process-grouping structure
  */
 struct cfq_queue {
-	/* reference count */
-	atomic_t ref;
+	struct io_queue *ioq;
 	/* various state flags, see below */
 	unsigned int flags;
 	/* parent cfq_data */
 	struct cfq_data *cfqd;
-	/* service_tree member */
-	struct rb_node rb_node;
-	/* service_tree key */
-	unsigned long rb_key;
 	/* sorted list of pending requests */
 	struct rb_root sort_list;
 	/* if fifo isn't expired, next request to serve */
@@ -158,33 +111,23 @@ struct cfq_queue {
 	/* fifo list of requests in sort_list */
 	struct list_head fifo;
 
-	unsigned long slice_end;
-	long slice_resid;
-
 	/* pending metadata requests */
 	int meta_pending;
-	/* number of requests that are on the dispatch list or inside driver */
-	int dispatched;
 
 	/* io prio of this group */
-	unsigned short ioprio, org_ioprio;
-	unsigned short ioprio_class, org_ioprio_class;
+	unsigned short org_ioprio;
+	unsigned short org_ioprio_class;
 
 	pid_t pid;
 };
 
 enum cfqq_state_flags {
-	CFQ_CFQQ_FLAG_on_rr = 0,	/* on round-robin busy list */
-	CFQ_CFQQ_FLAG_wait_request,	/* waiting for a request */
 	CFQ_CFQQ_FLAG_must_alloc,	/* must be allowed rq alloc */
-	CFQ_CFQQ_FLAG_must_alloc_slice,	/* per-slice must_alloc flag */
-	CFQ_CFQQ_FLAG_must_dispatch,	/* must dispatch, even if expired */
-	CFQ_CFQQ_FLAG_fifo_expire,	/* FIFO checked in this slice */
-	CFQ_CFQQ_FLAG_idle_window,	/* slice idling enabled */
-	CFQ_CFQQ_FLAG_prio_changed,	/* task priority has changed */
-	CFQ_CFQQ_FLAG_queue_new,	/* queue never been serviced */
-	CFQ_CFQQ_FLAG_slice_new,	/* no requests dispatched in slice */
-	CFQ_CFQQ_FLAG_sync,		/* synchronous queue */
+	CFQ_CFQQ_FLAG_must_alloc_slice, /* per-slice must_alloc flag */
+	CFQ_CFQQ_FLAG_must_dispatch,    /* must dispatch, even if expired */
+	CFQ_CFQQ_FLAG_fifo_expire,      /* FIFO checked in this slice */
+	CFQ_CFQQ_FLAG_prio_changed,     /* task priority has changed */
+	CFQ_CFQQ_FLAG_queue_new,        /* queue never been serviced */
 };
 
 #define CFQ_CFQQ_FNS(name)						\
@@ -199,116 +142,78 @@ static inline void cfq_clear_cfqq_##name(struct cfq_queue *cfqq)	\
 static inline int cfq_cfqq_##name(const struct cfq_queue *cfqq)		\
 {									\
 	return ((cfqq)->flags & (1 << CFQ_CFQQ_FLAG_##name)) != 0;	\
-}
+}									\
 
-CFQ_CFQQ_FNS(on_rr);
-CFQ_CFQQ_FNS(wait_request);
 CFQ_CFQQ_FNS(must_alloc);
 CFQ_CFQQ_FNS(must_alloc_slice);
 CFQ_CFQQ_FNS(must_dispatch);
 CFQ_CFQQ_FNS(fifo_expire);
-CFQ_CFQQ_FNS(idle_window);
 CFQ_CFQQ_FNS(prio_changed);
 CFQ_CFQQ_FNS(queue_new);
-CFQ_CFQQ_FNS(slice_new);
-CFQ_CFQQ_FNS(sync);
 #undef CFQ_CFQQ_FNS
 
 #define cfq_log_cfqq(cfqd, cfqq, fmt, args...)	\
-	blk_add_trace_msg((cfqd)->queue, "cfq%d " fmt, (cfqq)->pid, ##args)
+	blk_add_trace_msg((cfqd)->q, "cfq%d " fmt, elv_ioq_pid(cfqq->ioq), \
+					##args)
 #define cfq_log(cfqd, fmt, args...)	\
-	blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)
+	blk_add_trace_msg((cfqd)->q, "cfq " fmt, ##args)
 
 static void cfq_dispatch_insert(struct request_queue *, struct request *);
 static struct cfq_queue *cfq_get_queue(struct cfq_data *, int,
-				       struct io_context *, gfp_t);
+						struct io_context *, gfp_t);
 static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
 						struct io_context *);
 
-static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
-					    int is_sync)
-{
-	return cic->cfqq[!!is_sync];
-}
-
-static inline void cic_set_cfqq(struct cfq_io_context *cic,
-				struct cfq_queue *cfqq, int is_sync)
+static inline struct io_group *cfqq_to_io_group(struct cfq_queue *cfqq)
 {
-	cic->cfqq[!!is_sync] = cfqq;
+	return ioq_to_io_group(cfqq->ioq);
 }
 
-/*
- * We regard a request as SYNC, if it's either a read or has the SYNC bit
- * set (in which case it could also be direct WRITE).
- */
-static inline int cfq_bio_sync(struct bio *bio)
+static inline int cfq_class_idle(struct cfq_queue *cfqq)
 {
-	if (bio_data_dir(bio) == READ || bio_sync(bio))
-		return 1;
-
-	return 0;
+	return elv_ioq_class_idle(cfqq->ioq);
 }
 
-/*
- * scheduler run of queue, if there are requests pending and no one in the
- * driver that will restart queueing
- */
-static inline void cfq_schedule_dispatch(struct cfq_data *cfqd)
+static inline int cfq_class_rt(struct cfq_queue *cfqq)
 {
-	if (cfqd->busy_queues) {
-		cfq_log(cfqd, "schedule dispatch");
-		kblockd_schedule_work(cfqd->queue, &cfqd->unplug_work);
-	}
+	return elv_ioq_class_rt(cfqq->ioq);
 }
 
-static int cfq_queue_empty(struct request_queue *q)
+static inline int cfq_cfqq_sync(struct cfq_queue *cfqq)
 {
-	struct cfq_data *cfqd = q->elevator->elevator_data;
-
-	return !cfqd->busy_queues;
+	return elv_ioq_sync(cfqq->ioq);
 }
 
-/*
- * Scale schedule slice based on io priority. Use the sync time slice only
- * if a queue is marked sync and has sync io queued. A sync queue with async
- * io only, should not get full sync slice length.
- */
-static inline int cfq_prio_slice(struct cfq_data *cfqd, int sync,
-				 unsigned short prio)
+static inline int cfqq_is_active_queue(struct cfq_queue *cfqq)
 {
-	const int base_slice = cfqd->cfq_slice[sync];
-
-	WARN_ON(prio >= IOPRIO_BE_NR);
+	struct cfq_data *cfqd = cfqq->cfqd;
+	struct elevator_queue *e = cfqd->q->elevator;
 
-	return base_slice + (base_slice/CFQ_SLICE_SCALE * (4 - prio));
+	return (elv_active_sched_queue(e) == cfqq);
 }
 
-static inline int
-cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
+					    int is_sync)
 {
-	return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
+	return cic->cfqq[!!is_sync];
 }
 
-static inline void
-cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+static inline void cic_set_cfqq(struct cfq_io_context *cic,
+				struct cfq_queue *cfqq, int is_sync)
 {
-	cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
-	cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
+	cic->cfqq[!!is_sync] = cfqq;
 }
 
 /*
- * We need to wrap this check in cfq_cfqq_slice_new(), since ->slice_end
- * isn't valid until the first request from the dispatch is activated
- * and the slice time set.
+ * We regard a request as SYNC, if it's either a read or has the SYNC bit
+ * set (in which case it could also be direct WRITE).
  */
-static inline int cfq_slice_used(struct cfq_queue *cfqq)
+static inline int cfq_bio_sync(struct bio *bio)
 {
-	if (cfq_cfqq_slice_new(cfqq))
-		return 0;
-	if (time_before(jiffies, cfqq->slice_end))
-		return 0;
+	if (bio_data_dir(bio) == READ || bio_sync(bio))
+		return 1;
 
-	return 1;
+	return 0;
 }
 
 /*
@@ -406,32 +311,6 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2)
 	}
 }
 
-/*
- * The below is leftmost cache rbtree addon
- */
-static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
-{
-	if (!root->left)
-		root->left = rb_first(&root->rb);
-
-	if (root->left)
-		return rb_entry(root->left, struct cfq_queue, rb_node);
-
-	return NULL;
-}
-
-static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root)
-{
-	if (root->left == n)
-		root->left = NULL;
-
-	rb_erase(n, &root->rb);
-	RB_CLEAR_NODE(n);
-}
-
-/*
- * would be nice to take fifo expire time into account as well
- */
 static struct request *
 cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		  struct request *last)
@@ -442,10 +321,10 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 
 	BUG_ON(RB_EMPTY_NODE(&last->rb_node));
 
-	if (rbprev)
+	if (rbprev != NULL)
 		prev = rb_entry_rq(rbprev);
 
-	if (rbnext)
+	if (rbnext != NULL)
 		next = rb_entry_rq(rbnext);
 	else {
 		rbnext = rb_first(&cfqq->sort_list);
@@ -456,140 +335,25 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	return cfq_choose_req(cfqd, next, prev);
 }
 
-static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
-				      struct cfq_queue *cfqq)
+/* An active ioq has been reset. A chance to do cic related stuff. */
+static void cfq_active_ioq_reset(struct request_queue *q)
 {
-	/*
-	 * just an approximation, should be ok.
-	 */
-	return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) -
-		       cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
-}
-
-/*
- * The cfqd->service_tree holds all pending cfq_queue's that have
- * requests waiting to be processed. It is sorted in the order that
- * we will service the queues.
- */
-static void cfq_service_tree_add(struct cfq_data *cfqd,
-				    struct cfq_queue *cfqq, int add_front)
-{
-	struct rb_node **p, *parent;
-	struct cfq_queue *__cfqq;
-	unsigned long rb_key;
-	int left;
-
-	if (cfq_class_idle(cfqq)) {
-		rb_key = CFQ_IDLE_DELAY;
-		parent = rb_last(&cfqd->service_tree.rb);
-		if (parent && parent != &cfqq->rb_node) {
-			__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
-			rb_key += __cfqq->rb_key;
-		} else
-			rb_key += jiffies;
-	} else if (!add_front) {
-		rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
-		rb_key += cfqq->slice_resid;
-		cfqq->slice_resid = 0;
-	} else
-		rb_key = 0;
-
-	if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
-		/*
-		 * same position, nothing more to do
-		 */
-		if (rb_key == cfqq->rb_key)
-			return;
-
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
-	}
-
-	left = 1;
-	parent = NULL;
-	p = &cfqd->service_tree.rb.rb_node;
-	while (*p) {
-		struct rb_node **n;
-
-		parent = *p;
-		__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
-
-		/*
-		 * sort RT queues first, we always want to give
-		 * preference to them. IDLE queues goes to the back.
-		 * after that, sort on the next service time.
-		 */
-		if (cfq_class_rt(cfqq) > cfq_class_rt(__cfqq))
-			n = &(*p)->rb_left;
-		else if (cfq_class_rt(cfqq) < cfq_class_rt(__cfqq))
-			n = &(*p)->rb_right;
-		else if (cfq_class_idle(cfqq) < cfq_class_idle(__cfqq))
-			n = &(*p)->rb_left;
-		else if (cfq_class_idle(cfqq) > cfq_class_idle(__cfqq))
-			n = &(*p)->rb_right;
-		else if (rb_key < __cfqq->rb_key)
-			n = &(*p)->rb_left;
-		else
-			n = &(*p)->rb_right;
-
-		if (n == &(*p)->rb_right)
-			left = 0;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
 
-		p = n;
+	if (cfqd->active_cic) {
+		put_io_context(cfqd->active_cic->ioc);
+		cfqd->active_cic = NULL;
 	}
-
-	if (left)
-		cfqd->service_tree.left = &cfqq->rb_node;
-
-	cfqq->rb_key = rb_key;
-	rb_link_node(&cfqq->rb_node, parent, p);
-	rb_insert_color(&cfqq->rb_node, &cfqd->service_tree.rb);
 }
 
-/*
- * Update cfqq's position in the service tree.
- */
-static void cfq_resort_rr_list(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+/* An ioq has been set as active one. */
+static void cfq_active_ioq_set(struct request_queue *q, void *sched_queue)
 {
-	/*
-	 * Resorting requires the cfqq to be on the RR list already.
-	 */
-	if (cfq_cfqq_on_rr(cfqq))
-		cfq_service_tree_add(cfqd, cfqq, 0);
-}
-
-/*
- * add to busy list of queues for service, trying to be fair in ordering
- * the pending list according to last request service
- */
-static void cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	cfq_log_cfqq(cfqd, cfqq, "add_to_rr");
-	BUG_ON(cfq_cfqq_on_rr(cfqq));
-	cfq_mark_cfqq_on_rr(cfqq);
-	cfqd->busy_queues++;
-	if (cfq_class_rt(cfqq))
-		cfqd->busy_rt_queues++;
-
-	cfq_resort_rr_list(cfqd, cfqq);
-}
-
-/*
- * Called when the cfqq no longer has requests pending, remove it from
- * the service tree.
- */
-static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	cfq_log_cfqq(cfqd, cfqq, "del_from_rr");
-	BUG_ON(!cfq_cfqq_on_rr(cfqq));
-	cfq_clear_cfqq_on_rr(cfqq);
-
-	if (!RB_EMPTY_NODE(&cfqq->rb_node))
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
+	struct cfq_queue *cfqq = sched_queue;
 
-	BUG_ON(!cfqd->busy_queues);
-	cfqd->busy_queues--;
-	if (cfq_class_rt(cfqq))
-		cfqd->busy_rt_queues--;
+	cfq_mark_cfqq_must_alloc(cfqq);
+	cfq_clear_cfqq_fifo_expire(cfqq);
+	cfq_clear_cfqq_queue_new(cfqq);
 }
 
 /*
@@ -598,22 +362,19 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 static void cfq_del_rq_rb(struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
-	struct cfq_data *cfqd = cfqq->cfqd;
 	const int sync = rq_is_sync(rq);
 
 	BUG_ON(!cfqq->queued[sync]);
 	cfqq->queued[sync]--;
 
 	elv_rb_del(&cfqq->sort_list, rq);
-
-	if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
-		cfq_del_cfqq_rr(cfqd, cfqq);
 }
 
 static void cfq_add_rq_rb(struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
 	struct cfq_data *cfqd = cfqq->cfqd;
+	struct request_queue *q = cfqd->q;
 	struct request *__alias;
 
 	cfqq->queued[rq_is_sync(rq)]++;
@@ -623,10 +384,7 @@ static void cfq_add_rq_rb(struct request *rq)
 	 * if that happens, put the alias on the dispatch list
 	 */
 	while ((__alias = elv_rb_add(&cfqq->sort_list, rq)) != NULL)
-		cfq_dispatch_insert(cfqd->queue, __alias);
-
-	if (!cfq_cfqq_on_rr(cfqq))
-		cfq_add_cfqq_rr(cfqd, cfqq);
+		cfq_dispatch_insert(q, __alias);
 
 	/*
 	 * check if this request is a better next-serve candidate
@@ -667,23 +425,9 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 
-	cfqd->rq_in_driver++;
-	cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "activate rq, drv=%d",
-						cfqd->rq_in_driver);
-
 	cfqd->last_position = rq->hard_sector + rq->hard_nr_sectors;
 }
 
-static void cfq_deactivate_request(struct request_queue *q, struct request *rq)
-{
-	struct cfq_data *cfqd = q->elevator->elevator_data;
-
-	WARN_ON(!cfqd->rq_in_driver);
-	cfqd->rq_in_driver--;
-	cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "deactivate rq, drv=%d",
-						cfqd->rq_in_driver);
-}
-
 static void cfq_remove_request(struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
@@ -694,7 +438,6 @@ static void cfq_remove_request(struct request *rq)
 	list_del_init(&rq->queuelist);
 	cfq_del_rq_rb(rq);
 
-	cfqq->cfqd->rq_queued--;
 	if (rq_is_meta(rq)) {
 		WARN_ON(!cfqq->meta_pending);
 		cfqq->meta_pending--;
@@ -768,85 +511,23 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	return 0;
 }
 
-static void __cfq_set_active_queue(struct cfq_data *cfqd,
-				   struct cfq_queue *cfqq)
-{
-	if (cfqq) {
-		cfq_log_cfqq(cfqd, cfqq, "set_active");
-		cfqq->slice_end = 0;
-		cfq_clear_cfqq_must_alloc_slice(cfqq);
-		cfq_clear_cfqq_fifo_expire(cfqq);
-		cfq_mark_cfqq_slice_new(cfqq);
-		cfq_clear_cfqq_queue_new(cfqq);
-	}
-
-	cfqd->active_queue = cfqq;
-}
-
 /*
  * current cfqq expired its slice (or was too idle), select new one
  */
 static void
 __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-		    int timed_out)
+				int budget_update)
 {
-	cfq_log_cfqq(cfqd, cfqq, "slice expired t=%d", timed_out);
-
-	if (cfq_cfqq_wait_request(cfqq))
-		del_timer(&cfqd->idle_slice_timer);
-
 	cfq_clear_cfqq_must_dispatch(cfqq);
-	cfq_clear_cfqq_wait_request(cfqq);
-
-	/*
-	 * store what was left of this slice, if the queue idled/timed out
-	 */
-	if (timed_out && !cfq_cfqq_slice_new(cfqq)) {
-		cfqq->slice_resid = cfqq->slice_end - jiffies;
-		cfq_log_cfqq(cfqd, cfqq, "resid=%ld", cfqq->slice_resid);
-	}
-
-	cfq_resort_rr_list(cfqd, cfqq);
-
-	if (cfqq == cfqd->active_queue)
-		cfqd->active_queue = NULL;
-
-	if (cfqd->active_cic) {
-		put_io_context(cfqd->active_cic->ioc);
-		cfqd->active_cic = NULL;
-	}
+	__elv_ioq_slice_expired(cfqd->q, cfqq->ioq, budget_update);
 }
 
-static inline void cfq_slice_expired(struct cfq_data *cfqd, int timed_out)
+static inline void cfq_slice_expired(struct cfq_data *cfqd, int budget_update)
 {
-	struct cfq_queue *cfqq = cfqd->active_queue;
+	struct cfq_queue *cfqq = elv_active_sched_queue(cfqd->q->elevator);
 
 	if (cfqq)
-		__cfq_slice_expired(cfqd, cfqq, timed_out);
-}
-
-/*
- * Get next queue for service. Unless we have a queue preemption,
- * we'll simply select the first cfqq in the service tree.
- */
-static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
-{
-	if (RB_EMPTY_ROOT(&cfqd->service_tree.rb))
-		return NULL;
-
-	return cfq_rb_first(&cfqd->service_tree);
-}
-
-/*
- * Get and set a new active queue for service.
- */
-static struct cfq_queue *cfq_set_active_queue(struct cfq_data *cfqd)
-{
-	struct cfq_queue *cfqq;
-
-	cfqq = cfq_get_next_queue(cfqd);
-	__cfq_set_active_queue(cfqd, cfqq);
-	return cfqq;
+		__cfq_slice_expired(cfqd, cfqq, budget_update);
 }
 
 static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd,
@@ -881,35 +562,15 @@ static int cfq_close_cooperator(struct cfq_data *cfq_data,
 
 #define CIC_SEEKY(cic) ((cic)->seek_mean > (8 * 1024))
 
-static void cfq_arm_slice_timer(struct cfq_data *cfqd)
+static void cfq_arm_slice_timer(struct request_queue *q, void *sched_queue)
 {
-	struct cfq_queue *cfqq = cfqd->active_queue;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = sched_queue;
 	struct cfq_io_context *cic;
 	unsigned long sl;
 
-	/*
-	 * SSD device without seek penalty, disable idling. But only do so
-	 * for devices that support queuing, otherwise we still have a problem
-	 * with sync vs async workloads.
-	 */
-	if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)
-		return;
-
 	WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list));
-	WARN_ON(cfq_cfqq_slice_new(cfqq));
-
-	/*
-	 * idle is disabled, either manually or by past process history
-	 */
-	if (!cfqd->cfq_slice_idle || !cfq_cfqq_idle_window(cfqq))
-		return;
-
-	/*
-	 * still requests with the driver, don't idle
-	 */
-	if (cfqd->rq_in_driver)
-		return;
-
+	WARN_ON(elv_ioq_slice_new(cfqq->ioq));
 	/*
 	 * task has exited, don't wait
 	 */
@@ -921,22 +582,23 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 	 * See if this prio level has a good candidate
 	 */
 	if (cfq_close_cooperator(cfqd, cfqq) &&
-	    (sample_valid(cic->ttime_samples) && cic->ttime_mean > 2))
+	    (elv_ioq_sample_valid(cfqq->ioq) &&
+	    elv_ioq_ttime_mean(cfqq->ioq) > 2))
 		return;
 
 	cfq_mark_cfqq_must_dispatch(cfqq);
-	cfq_mark_cfqq_wait_request(cfqq);
+	elv_mark_ioq_wait_request(cfqq->ioq);
 
 	/*
 	 * we don't want to idle for seeks, but we do want to allow
 	 * fair distribution of slice time for a process doing back-to-back
 	 * seeks. so allow a little bit of time for him to submit a new rq
 	 */
-	sl = cfqd->cfq_slice_idle;
+	sl = elv_get_slice_idle(q->elevator);
 	if (sample_valid(cic->seek_samples) && CIC_SEEKY(cic))
 		sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));
 
-	mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
+	elv_mod_idle_slice_timer(q->elevator, jiffies + sl);
 	cfq_log(cfqd, "arm_idle: %lu", sl);
 }
 
@@ -945,13 +607,12 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
  */
 static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
 {
-	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
+	struct cfq_data *cfqd = q->elevator->elevator_data;
 
 	cfq_log_cfqq(cfqd, cfqq, "dispatch_insert");
 
 	cfq_remove_request(rq);
-	cfqq->dispatched++;
 	elv_dispatch_sort(q, rq);
 
 	if (cfq_cfqq_sync(cfqq))
@@ -989,68 +650,11 @@ static inline int
 cfq_prio_to_maxrq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
 	const int base_rq = cfqd->cfq_slice_async_rq;
+	unsigned short ioprio = elv_ioq_ioprio(cfqq->ioq);
 
-	WARN_ON(cfqq->ioprio >= IOPRIO_BE_NR);
+	WARN_ON(ioprio >= IOPRIO_BE_NR);
 
-	return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - cfqq->ioprio));
-}
-
-/*
- * Select a queue for service. If we have a current active queue,
- * check whether to continue servicing it, or retrieve and set a new one.
- */
-static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
-{
-	struct cfq_queue *cfqq;
-
-	cfqq = cfqd->active_queue;
-	if (!cfqq)
-		goto new_queue;
-
-	/*
-	 * The active queue has run out of time, expire it and select new.
-	 */
-	if (cfq_slice_used(cfqq))
-		goto expire;
-
-	/*
-	 * If we have a RT cfqq waiting, then we pre-empt the current non-rt
-	 * cfqq.
-	 */
-	if (!cfq_class_rt(cfqq) && cfqd->busy_rt_queues) {
-		/*
-		 * We simulate this as cfqq timed out so that it gets to bank
-		 * the remaining of its time slice.
-		 */
-		cfq_log_cfqq(cfqd, cfqq, "preempt");
-		cfq_slice_expired(cfqd, 1);
-		goto new_queue;
-	}
-
-	/*
-	 * The active queue has requests and isn't expired, allow it to
-	 * dispatch.
-	 */
-	if (!RB_EMPTY_ROOT(&cfqq->sort_list))
-		goto keep_queue;
-
-	/*
-	 * No requests pending. If the active queue still has requests in
-	 * flight or is idling for a new request, allow either of these
-	 * conditions to happen (or time out) before selecting a new queue.
-	 */
-	if (timer_pending(&cfqd->idle_slice_timer) ||
-	    (cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) {
-		cfqq = NULL;
-		goto keep_queue;
-	}
-
-expire:
-	cfq_slice_expired(cfqd, 0);
-new_queue:
-	cfqq = cfq_set_active_queue(cfqd);
-keep_queue:
-	return cfqq;
+	return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - ioprio));
 }
 
 /*
@@ -1062,6 +666,7 @@ __cfq_dispatch_requests(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 			int max_dispatch)
 {
 	int dispatched = 0;
+	struct request_queue *q = cfqd->q;
 
 	BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list));
 
@@ -1078,7 +683,7 @@ __cfq_dispatch_requests(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		/*
 		 * finally, insert request into driver dispatch list
 		 */
-		cfq_dispatch_insert(cfqd->queue, rq);
+		cfq_dispatch_insert(q, rq);
 
 		dispatched++;
 
@@ -1094,7 +699,7 @@ __cfq_dispatch_requests(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		 * If there is a non-empty RT cfqq waiting for current
 		 * cfqq's timeslice to complete, pre-empt this cfqq
 		 */
-		if (!cfq_class_rt(cfqq) && cfqd->busy_rt_queues)
+		if (!cfq_class_rt(cfqq) && elv_nr_busy_rt_ioq(q->elevator))
 			break;
 
 	} while (dispatched < max_dispatch);
@@ -1103,11 +708,12 @@ __cfq_dispatch_requests(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	 * expire an async queue immediately if it has used up its slice. idle
 	 * queue always expire after 1 dispatch round.
 	 */
-	if (cfqd->busy_queues > 1 && ((!cfq_cfqq_sync(cfqq) &&
+
+	if (elv_nr_busy_ioq(q->elevator) > 1 && ((!cfq_cfqq_sync(cfqq) &&
 	    dispatched >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
 	    cfq_class_idle(cfqq))) {
-		cfqq->slice_end = jiffies + 1;
-		cfq_slice_expired(cfqd, 0);
+		elv_ioq_set_slice_end(cfqq->ioq, jiffies + 1);
+		cfq_slice_expired(cfqd, 1);
 	}
 
 	return dispatched;
@@ -1118,7 +724,7 @@ static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
 	int dispatched = 0;
 
 	while (cfqq->next_rq) {
-		cfq_dispatch_insert(cfqq->cfqd->queue, cfqq->next_rq);
+		cfq_dispatch_insert(cfqq->cfqd->q, cfqq->next_rq);
 		dispatched++;
 	}
 
@@ -1135,12 +741,14 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
 	struct cfq_queue *cfqq;
 	int dispatched = 0;
 
-	while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL)
+	while ((cfqq = elv_select_sched_queue(cfqd->q, 1)) != NULL)
 		dispatched += __cfq_forced_dispatch_cfqq(cfqq);
 
+	/* This probably is redundant now. above loop will should make sure
+	 * that all the busy queues have expired */
 	cfq_slice_expired(cfqd, 0);
 
-	BUG_ON(cfqd->busy_queues);
+	BUG_ON(elv_nr_busy_ioq(cfqd->q->elevator));
 
 	cfq_log(cfqd, "forced_dispatch=%d\n", dispatched);
 	return dispatched;
@@ -1152,29 +760,27 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	struct cfq_queue *cfqq;
 	int dispatched;
 
-	if (!cfqd->busy_queues)
-		return 0;
-
 	if (unlikely(force))
 		return cfq_forced_dispatch(cfqd);
 
 	dispatched = 0;
-	while ((cfqq = cfq_select_queue(cfqd)) != NULL) {
+	while ((cfqq = elv_select_sched_queue(q, 0)) != NULL) {
 		int max_dispatch;
 
 		max_dispatch = cfqd->cfq_quantum;
 		if (cfq_class_idle(cfqq))
 			max_dispatch = 1;
 
-		if (cfqq->dispatched >= max_dispatch && cfqd->busy_queues > 1)
+		if (elv_ioq_nr_dispatched(cfqq->ioq) >= max_dispatch &&
+		    elv_nr_busy_ioq(q->elevator) > 1)
 			break;
 
 		if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
 			break;
 
 		cfq_clear_cfqq_must_dispatch(cfqq);
-		cfq_clear_cfqq_wait_request(cfqq);
-		del_timer(&cfqd->idle_slice_timer);
+		elv_clear_ioq_wait_request(cfqq->ioq);
+		elv_del_idle_slice_timer(q->elevator);
 
 		dispatched += __cfq_dispatch_requests(cfqd, cfqq, max_dispatch);
 	}
@@ -1183,34 +789,30 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	return dispatched;
 }
 
-/*
- * task holds one reference to the queue, dropped when task exits. each rq
- * in-flight on this queue also holds a reference, dropped when rq is freed.
- *
- * queue lock must be held here.
- */
-static void cfq_put_queue(struct cfq_queue *cfqq)
+static void cfq_free_cfq_queue(struct elevator_queue *e, void *sched_queue)
 {
+	struct cfq_queue *cfqq = sched_queue;
 	struct cfq_data *cfqd = cfqq->cfqd;
 
-	BUG_ON(atomic_read(&cfqq->ref) <= 0);
-
-	if (!atomic_dec_and_test(&cfqq->ref))
-		return;
+	BUG_ON(!cfqq);
 
-	cfq_log_cfqq(cfqd, cfqq, "put_queue");
+	cfq_log_cfqq(cfqd, cfqq, "free_queue");
 	BUG_ON(rb_first(&cfqq->sort_list));
 	BUG_ON(cfqq->allocated[READ] + cfqq->allocated[WRITE]);
-	BUG_ON(cfq_cfqq_on_rr(cfqq));
 
-	if (unlikely(cfqd->active_queue == cfqq)) {
+	if (unlikely(cfqq_is_active_queue(cfqq))) {
 		__cfq_slice_expired(cfqd, cfqq, 0);
-		cfq_schedule_dispatch(cfqd);
+		elv_schedule_dispatch(cfqd->q);
 	}
 
 	kmem_cache_free(cfq_pool, cfqq);
 }
 
+static inline void cfq_put_queue(struct cfq_queue *cfqq)
+{
+	elv_put_ioq(cfqq->ioq);
+}
+
 /*
  * Must always be called with the rcu_read_lock() held
  */
@@ -1298,9 +900,9 @@ static void cfq_free_io_context(struct io_context *ioc)
 
 static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	if (unlikely(cfqq == cfqd->active_queue)) {
+	if (unlikely(cfqq == elv_active_sched_queue(cfqd->q->elevator))) {
 		__cfq_slice_expired(cfqd, cfqq, 0);
-		cfq_schedule_dispatch(cfqd);
+		elv_schedule_dispatch(cfqd->q);
 	}
 
 	cfq_put_queue(cfqq);
@@ -1340,7 +942,7 @@ static void cfq_exit_single_io_context(struct io_context *ioc,
 	struct cfq_data *cfqd = cic->key;
 
 	if (cfqd) {
-		struct request_queue *q = cfqd->queue;
+		struct request_queue *q = cfqd->q;
 		unsigned long flags;
 
 		spin_lock_irqsave(q->queue_lock, flags);
@@ -1370,9 +972,10 @@ static struct cfq_io_context *
 cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 {
 	struct cfq_io_context *cic;
+	struct request_queue *q = cfqd->q;
 
 	cic = kmem_cache_alloc_node(cfq_ioc_pool, gfp_mask | __GFP_ZERO,
-							cfqd->queue->node);
+							q->node);
 	if (cic) {
 		cic->last_end_request = jiffies;
 		INIT_LIST_HEAD(&cic->queue_list);
@@ -1388,7 +991,7 @@ cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
 {
 	struct task_struct *tsk = current;
-	int ioprio_class;
+	int ioprio_class, ioprio;
 
 	if (!cfq_cfqq_prio_changed(cfqq))
 		return;
@@ -1401,30 +1004,33 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
 		/*
 		 * no prio set, inherit CPU scheduling settings
 		 */
-		cfqq->ioprio = task_nice_ioprio(tsk);
-		cfqq->ioprio_class = task_nice_ioclass(tsk);
+		ioprio = task_nice_ioprio(tsk);
+		ioprio_class = task_nice_ioclass(tsk);
 		break;
 	case IOPRIO_CLASS_RT:
-		cfqq->ioprio = task_ioprio(ioc);
-		cfqq->ioprio_class = IOPRIO_CLASS_RT;
+		ioprio = task_ioprio(ioc);
+		ioprio_class = IOPRIO_CLASS_RT;
 		break;
 	case IOPRIO_CLASS_BE:
-		cfqq->ioprio = task_ioprio(ioc);
-		cfqq->ioprio_class = IOPRIO_CLASS_BE;
+		ioprio = task_ioprio(ioc);
+		ioprio_class = IOPRIO_CLASS_BE;
 		break;
 	case IOPRIO_CLASS_IDLE:
-		cfqq->ioprio_class = IOPRIO_CLASS_IDLE;
-		cfqq->ioprio = 7;
-		cfq_clear_cfqq_idle_window(cfqq);
+		ioprio_class = IOPRIO_CLASS_IDLE;
+		ioprio = 7;
+		elv_clear_ioq_idle_window(cfqq->ioq);
 		break;
 	}
 
+	elv_ioq_set_ioprio_class(cfqq->ioq, ioprio_class);
+	elv_ioq_set_ioprio(cfqq->ioq, ioprio);
+
 	/*
 	 * keep track of original prio settings in case we have to temporarily
 	 * elevate the priority of this queue
 	 */
-	cfqq->org_ioprio = cfqq->ioprio;
-	cfqq->org_ioprio_class = cfqq->ioprio_class;
+	cfqq->org_ioprio = ioprio;
+	cfqq->org_ioprio_class = ioprio_class;
 	cfq_clear_cfqq_prio_changed(cfqq);
 }
 
@@ -1433,11 +1039,12 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
 	struct cfq_data *cfqd = cic->key;
 	struct cfq_queue *cfqq;
 	unsigned long flags;
+	struct request_queue *q = cfqd->q;
 
 	if (unlikely(!cfqd))
 		return;
 
-	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+	spin_lock_irqsave(q->queue_lock, flags);
 
 	cfqq = cic->cfqq[ASYNC];
 	if (cfqq) {
@@ -1453,7 +1060,7 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
 	if (cfqq)
 		cfq_mark_cfqq_prio_changed(cfqq);
 
-	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
 static void cfq_ioc_set_ioprio(struct io_context *ioc)
@@ -1464,11 +1071,12 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
 
 static struct cfq_queue *
 cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
-		     struct io_context *ioc, gfp_t gfp_mask)
+				struct io_context *ioc, gfp_t gfp_mask)
 {
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
 	struct cfq_io_context *cic;
-
+	struct request_queue *q = cfqd->q;
+	struct io_queue *ioq = NULL, *new_ioq = NULL;
 retry:
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
@@ -1476,8 +1084,7 @@ retry:
 
 	if (!cfqq) {
 		if (new_cfqq) {
-			cfqq = new_cfqq;
-			new_cfqq = NULL;
+			goto alloc_ioq;
 		} else if (gfp_mask & __GFP_WAIT) {
 			/*
 			 * Inform the allocator of the fact that we will
@@ -1485,35 +1092,66 @@ retry:
 			 * the allocator to do whatever it needs to attempt to
 			 * free memory.
 			 */
-			spin_unlock_irq(cfqd->queue->queue_lock);
+			spin_unlock_irq(q->queue_lock);
 			new_cfqq = kmem_cache_alloc_node(cfq_pool,
 					gfp_mask | __GFP_NOFAIL | __GFP_ZERO,
-					cfqd->queue->node);
-			spin_lock_irq(cfqd->queue->queue_lock);
+					q->node);
+			spin_lock_irq(q->queue_lock);
 			goto retry;
 		} else {
 			cfqq = kmem_cache_alloc_node(cfq_pool,
 					gfp_mask | __GFP_ZERO,
-					cfqd->queue->node);
+					q->node);
 			if (!cfqq)
 				goto out;
 		}
 
-		RB_CLEAR_NODE(&cfqq->rb_node);
-		INIT_LIST_HEAD(&cfqq->fifo);
+alloc_ioq:
+		if (new_ioq) {
+			ioq = new_ioq;
+			new_ioq = NULL;
+			cfqq = new_cfqq;
+			new_cfqq = NULL;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			new_ioq = elv_alloc_ioq(q,
+					gfp_mask | __GFP_NOFAIL | __GFP_ZERO);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+			if (!ioq) {
+				kmem_cache_free(cfq_pool, cfqq);
+				cfqq = NULL;
+				goto out;
+			}
+		}
 
-		atomic_set(&cfqq->ref, 0);
+		/*
+		 * Both cfqq and ioq objects allocated. Do the initializations
+		 * now.
+		 */
+		INIT_LIST_HEAD(&cfqq->fifo);
 		cfqq->cfqd = cfqd;
 
 		cfq_mark_cfqq_prio_changed(cfqq);
 		cfq_mark_cfqq_queue_new(cfqq);
 
+		cfqq->ioq = ioq;
 		cfq_init_prio_data(cfqq, ioc);
+		elv_init_ioq(q->elevator, ioq, cfqq, cfqq->org_ioprio_class,
+					cfqq->org_ioprio, is_sync);
 
 		if (is_sync) {
 			if (!cfq_class_idle(cfqq))
-				cfq_mark_cfqq_idle_window(cfqq);
-			cfq_mark_cfqq_sync(cfqq);
+				elv_mark_ioq_idle_window(cfqq->ioq);
+			elv_mark_ioq_sync(cfqq->ioq);
 		}
 		cfqq->pid = current->pid;
 		cfq_log_cfqq(cfqd, cfqq, "alloced");
@@ -1522,38 +1160,28 @@ retry:
 	if (new_cfqq)
 		kmem_cache_free(cfq_pool, new_cfqq);
 
+	if (new_ioq)
+		elv_free_ioq(new_ioq);
+
 out:
 	WARN_ON((gfp_mask & __GFP_WAIT) && !cfqq);
 	return cfqq;
 }
 
-static struct cfq_queue **
-cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
-{
-	switch (ioprio_class) {
-	case IOPRIO_CLASS_RT:
-		return &cfqd->async_cfqq[0][ioprio];
-	case IOPRIO_CLASS_BE:
-		return &cfqd->async_cfqq[1][ioprio];
-	case IOPRIO_CLASS_IDLE:
-		return &cfqd->async_idle_cfqq;
-	default:
-		BUG();
-	}
-}
-
 static struct cfq_queue *
 cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
-	      gfp_t gfp_mask)
+					gfp_t gfp_mask)
 {
 	const int ioprio = task_ioprio(ioc);
 	const int ioprio_class = task_ioprio_class(ioc);
-	struct cfq_queue **async_cfqq = NULL;
+	struct cfq_queue *async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
+	struct io_group *iog = io_lookup_io_group_current(cfqd->q);
 
 	if (!is_sync) {
-		async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);
-		cfqq = *async_cfqq;
+		async_cfqq = io_group_async_queue_prio(iog, ioprio_class,
+								ioprio);
+		cfqq = async_cfqq;
 	}
 
 	if (!cfqq) {
@@ -1562,15 +1190,11 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 			return NULL;
 	}
 
-	/*
-	 * pin the queue now that it's allocated, scheduler exit will prune it
-	 */
-	if (!is_sync && !(*async_cfqq)) {
-		atomic_inc(&cfqq->ref);
-		*async_cfqq = cfqq;
-	}
+	if (!is_sync && !async_cfqq)
+		io_group_set_async_queue(iog, ioprio_class, ioprio, cfqq->ioq);
 
-	atomic_inc(&cfqq->ref);
+	/* ioc reference */
+	elv_get_ioq(cfqq->ioq);
 	return cfqq;
 }
 
@@ -1649,6 +1273,7 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
 {
 	unsigned long flags;
 	int ret;
+	struct request_queue *q = cfqd->q;
 
 	ret = radix_tree_preload(gfp_mask);
 	if (!ret) {
@@ -1665,9 +1290,9 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
 		radix_tree_preload_end();
 
 		if (!ret) {
-			spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+			spin_lock_irqsave(q->queue_lock, flags);
 			list_add(&cic->queue_list, &cfqd->cic_list);
-			spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+			spin_unlock_irqrestore(q->queue_lock, flags);
 		}
 	}
 
@@ -1687,10 +1312,11 @@ cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 {
 	struct io_context *ioc = NULL;
 	struct cfq_io_context *cic;
+	struct request_queue *q = cfqd->q;
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
-	ioc = get_io_context(gfp_mask, cfqd->queue->node);
+	ioc = get_io_context(gfp_mask, q->node);
 	if (!ioc)
 		return NULL;
 
@@ -1709,7 +1335,6 @@ out:
 	smp_read_barrier_depends();
 	if (unlikely(ioc->ioprio_changed))
 		cfq_ioc_set_ioprio(ioc);
-
 	return cic;
 err_free:
 	cfq_cic_free(cic);
@@ -1719,17 +1344,6 @@ err:
 }
 
 static void
-cfq_update_io_thinktime(struct cfq_data *cfqd, struct cfq_io_context *cic)
-{
-	unsigned long elapsed = jiffies - cic->last_end_request;
-	unsigned long ttime = min(elapsed, 2UL * cfqd->cfq_slice_idle);
-
-	cic->ttime_samples = (7*cic->ttime_samples + 256) / 8;
-	cic->ttime_total = (7*cic->ttime_total + 256*ttime) / 8;
-	cic->ttime_mean = (cic->ttime_total + 128) / cic->ttime_samples;
-}
-
-static void
 cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_io_context *cic,
 		       struct request *rq)
 {
@@ -1758,65 +1372,40 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_io_context *cic,
 }
 
 /*
- * Disable idle window if the process thinks too long or seeks so much that
- * it doesn't matter
+ * Disable idle window if the process seeks so much that it doesn't matter
  */
-static void
-cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-		       struct cfq_io_context *cic)
+static int
+cfq_update_idle_window(struct elevator_queue *eq, void *cfqq,
+					struct request *rq)
 {
-	int old_idle, enable_idle;
+	struct cfq_io_context *cic = RQ_CIC(rq);
 
 	/*
-	 * Don't idle for async or idle io prio class
+	 * Enabling/Disabling idling based on thinktime has been moved
+	 * in common layer.
 	 */
-	if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
-		return;
-
-	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
-
-	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
-	    (cfqd->hw_tag && CIC_SEEKY(cic)))
-		enable_idle = 0;
-	else if (sample_valid(cic->ttime_samples)) {
-		if (cic->ttime_mean > cfqd->cfq_slice_idle)
-			enable_idle = 0;
-		else
-			enable_idle = 1;
-	}
+	if (!atomic_read(&cic->ioc->nr_tasks) ||
+	    (elv_hw_tag(eq) && CIC_SEEKY(cic)))
+		return 0;
 
-	if (old_idle != enable_idle) {
-		cfq_log_cfqq(cfqd, cfqq, "idle=%d", enable_idle);
-		if (enable_idle)
-			cfq_mark_cfqq_idle_window(cfqq);
-		else
-			cfq_clear_cfqq_idle_window(cfqq);
-	}
+	return 1;
 }
 
 /*
  * Check if new_cfqq should preempt the currently active queue. Return 0 for
- * no or if we aren't sure, a 1 will cause a preempt.
+ * no or if we aren't sure, a 1 will cause a preemption attempt.
+ * Some of the preemption logic has been moved to common layer. Only cfq
+ * specific parts are left here.
  */
 static int
-cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
-		   struct request *rq)
+cfq_should_preempt(struct request_queue *q, void *new_cfqq, struct request *rq)
 {
-	struct cfq_queue *cfqq;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = elv_active_sched_queue(q->elevator);
 
-	cfqq = cfqd->active_queue;
 	if (!cfqq)
 		return 0;
 
-	if (cfq_slice_used(cfqq))
-		return 1;
-
-	if (cfq_class_idle(new_cfqq))
-		return 0;
-
-	if (cfq_class_idle(cfqq))
-		return 1;
-
 	/*
 	 * if the new request is sync, but the currently running queue is
 	 * not, let the sync request have priority.
@@ -1831,13 +1420,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 	if (rq_is_meta(rq) && !cfqq->meta_pending)
 		return 1;
 
-	/*
-	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
-	 */
-	if (cfq_class_rt(new_cfqq) && !cfq_class_rt(cfqq))
-		return 1;
-
-	if (!cfqd->active_cic || !cfq_cfqq_wait_request(cfqq))
+	if (!cfqd->active_cic || !elv_ioq_wait_request(cfqq->ioq))
 		return 0;
 
 	/*
@@ -1851,29 +1434,10 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 }
 
 /*
- * cfqq preempts the active queue. if we allowed preempt with no slice left,
- * let it have half of its nominal slice.
- */
-static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	cfq_log_cfqq(cfqd, cfqq, "preempt");
-	cfq_slice_expired(cfqd, 1);
-
-	/*
-	 * Put the new queue at the front of the of the current list,
-	 * so we know that it will be selected next.
-	 */
-	BUG_ON(!cfq_cfqq_on_rr(cfqq));
-
-	cfq_service_tree_add(cfqd, cfqq, 1);
-
-	cfqq->slice_end = 0;
-	cfq_mark_cfqq_slice_new(cfqq);
-}
-
-/*
  * Called when a new fs request (rq) is added (to cfqq). Check if there's
  * something we should do about it
+ * After enqueuing the request whether queue should be preempted or kicked
+ * decision is taken by common layer.
  */
 static void
 cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
@@ -1881,38 +1445,12 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 {
 	struct cfq_io_context *cic = RQ_CIC(rq);
 
-	cfqd->rq_queued++;
 	if (rq_is_meta(rq))
 		cfqq->meta_pending++;
 
-	cfq_update_io_thinktime(cfqd, cic);
 	cfq_update_io_seektime(cfqd, cic, rq);
-	cfq_update_idle_window(cfqd, cfqq, cic);
 
 	cic->last_request_pos = rq->sector + rq->nr_sectors;
-
-	if (cfqq == cfqd->active_queue) {
-		/*
-		 * if we are waiting for a request for this queue, let it rip
-		 * immediately and flag that we must not expire this queue
-		 * just now
-		 */
-		if (cfq_cfqq_wait_request(cfqq)) {
-			cfq_mark_cfqq_must_dispatch(cfqq);
-			del_timer(&cfqd->idle_slice_timer);
-			blk_start_queueing(cfqd->queue);
-		}
-	} else if (cfq_should_preempt(cfqd, cfqq, rq)) {
-		/*
-		 * not the active queue - expire current slice if it is
-		 * idle and has expired it's mean thinktime or this new queue
-		 * has some old slice time left and is of higher priority or
-		 * this new queue is RT and the current one is BE
-		 */
-		cfq_preempt_queue(cfqd, cfqq);
-		cfq_mark_cfqq_must_dispatch(cfqq);
-		blk_start_queueing(cfqd->queue);
-	}
 }
 
 static void cfq_insert_request(struct request_queue *q, struct request *rq)
@@ -1930,31 +1468,6 @@ static void cfq_insert_request(struct request_queue *q, struct request *rq)
 	cfq_rq_enqueued(cfqd, cfqq, rq);
 }
 
-/*
- * Update hw_tag based on peak queue depth over 50 samples under
- * sufficient load.
- */
-static void cfq_update_hw_tag(struct cfq_data *cfqd)
-{
-	if (cfqd->rq_in_driver > cfqd->rq_in_driver_peak)
-		cfqd->rq_in_driver_peak = cfqd->rq_in_driver;
-
-	if (cfqd->rq_queued <= CFQ_HW_QUEUE_MIN &&
-	    cfqd->rq_in_driver <= CFQ_HW_QUEUE_MIN)
-		return;
-
-	if (cfqd->hw_tag_samples++ < 50)
-		return;
-
-	if (cfqd->rq_in_driver_peak >= CFQ_HW_QUEUE_MIN)
-		cfqd->hw_tag = 1;
-	else
-		cfqd->hw_tag = 0;
-
-	cfqd->hw_tag_samples = 0;
-	cfqd->rq_in_driver_peak = 0;
-}
-
 static void cfq_completed_request(struct request_queue *q, struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
@@ -1965,13 +1478,6 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 	now = jiffies;
 	cfq_log_cfqq(cfqd, cfqq, "complete");
 
-	cfq_update_hw_tag(cfqd);
-
-	WARN_ON(!cfqd->rq_in_driver);
-	WARN_ON(!cfqq->dispatched);
-	cfqd->rq_in_driver--;
-	cfqq->dispatched--;
-
 	if (cfq_cfqq_sync(cfqq))
 		cfqd->sync_flight--;
 
@@ -1980,24 +1486,6 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 
 	if (sync)
 		RQ_CIC(rq)->last_end_request = now;
-
-	/*
-	 * If this is the active queue, check if it needs to be expired,
-	 * or if we want to idle in case it has no pending requests.
-	 */
-	if (cfqd->active_queue == cfqq) {
-		if (cfq_cfqq_slice_new(cfqq)) {
-			cfq_set_prio_slice(cfqd, cfqq);
-			cfq_clear_cfqq_slice_new(cfqq);
-		}
-		if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
-			cfq_slice_expired(cfqd, 1);
-		else if (sync && RB_EMPTY_ROOT(&cfqq->sort_list))
-			cfq_arm_slice_timer(cfqd);
-	}
-
-	if (!cfqd->rq_in_driver)
-		cfq_schedule_dispatch(cfqd);
 }
 
 /*
@@ -2006,30 +1494,33 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
  */
 static void cfq_prio_boost(struct cfq_queue *cfqq)
 {
+	struct io_queue *ioq = cfqq->ioq;
+
 	if (has_fs_excl()) {
 		/*
 		 * boost idle prio on transactions that would lock out other
 		 * users of the filesystem
 		 */
 		if (cfq_class_idle(cfqq))
-			cfqq->ioprio_class = IOPRIO_CLASS_BE;
-		if (cfqq->ioprio > IOPRIO_NORM)
-			cfqq->ioprio = IOPRIO_NORM;
+			elv_ioq_set_ioprio_class(ioq, IOPRIO_CLASS_BE);
+		if (elv_ioq_ioprio(ioq) > IOPRIO_NORM)
+			elv_ioq_set_ioprio(ioq, IOPRIO_NORM);
+
 	} else {
 		/*
 		 * check if we need to unboost the queue
 		 */
-		if (cfqq->ioprio_class != cfqq->org_ioprio_class)
-			cfqq->ioprio_class = cfqq->org_ioprio_class;
-		if (cfqq->ioprio != cfqq->org_ioprio)
-			cfqq->ioprio = cfqq->org_ioprio;
+		if (elv_ioq_ioprio_class(ioq) != cfqq->org_ioprio_class)
+			elv_ioq_set_ioprio_class(ioq, cfqq->org_ioprio_class);
+		if (elv_ioq_ioprio(ioq) != cfqq->org_ioprio)
+			elv_ioq_set_ioprio(ioq, cfqq->org_ioprio);
 	}
 }
 
 static inline int __cfq_may_queue(struct cfq_queue *cfqq)
 {
-	if ((cfq_cfqq_wait_request(cfqq) || cfq_cfqq_must_alloc(cfqq)) &&
-	    !cfq_cfqq_must_alloc_slice(cfqq)) {
+	if ((elv_ioq_wait_request(cfqq->ioq) ||
+	   cfq_cfqq_must_alloc(cfqq)) && !cfq_cfqq_must_alloc_slice(cfqq)) {
 		cfq_mark_cfqq_must_alloc_slice(cfqq);
 		return ELV_MQUEUE_MUST;
 	}
@@ -2121,116 +1612,31 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 
 	cfqq->allocated[rw]++;
 	cfq_clear_cfqq_must_alloc(cfqq);
-	atomic_inc(&cfqq->ref);
+	elv_get_ioq(cfqq->ioq);
 
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
 	rq->elevator_private = cic;
-	rq->elevator_private2 = cfqq;
+	rq->ioq = cfqq->ioq;
 	return 0;
 
 queue_fail:
 	if (cic)
 		put_io_context(cic->ioc);
 
-	cfq_schedule_dispatch(cfqd);
+	elv_schedule_dispatch(cfqd->q);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 	cfq_log(cfqd, "set_request fail");
 	return 1;
 }
 
-static void cfq_kick_queue(struct work_struct *work)
-{
-	struct cfq_data *cfqd =
-		container_of(work, struct cfq_data, unplug_work);
-	struct request_queue *q = cfqd->queue;
-	unsigned long flags;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	blk_start_queueing(q);
-	spin_unlock_irqrestore(q->queue_lock, flags);
-}
-
-/*
- * Timer running if the active_queue is currently idling inside its time slice
- */
-static void cfq_idle_slice_timer(unsigned long data)
-{
-	struct cfq_data *cfqd = (struct cfq_data *) data;
-	struct cfq_queue *cfqq;
-	unsigned long flags;
-	int timed_out = 1;
-
-	cfq_log(cfqd, "idle timer fired");
-
-	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
-
-	cfqq = cfqd->active_queue;
-	if (cfqq) {
-		timed_out = 0;
-
-		/*
-		 * expired
-		 */
-		if (cfq_slice_used(cfqq))
-			goto expire;
-
-		/*
-		 * only expire and reinvoke request handler, if there are
-		 * other queues with pending requests
-		 */
-		if (!cfqd->busy_queues)
-			goto out_cont;
-
-		/*
-		 * not expired and it has a request pending, let it dispatch
-		 */
-		if (!RB_EMPTY_ROOT(&cfqq->sort_list)) {
-			cfq_mark_cfqq_must_dispatch(cfqq);
-			goto out_kick;
-		}
-	}
-expire:
-	cfq_slice_expired(cfqd, timed_out);
-out_kick:
-	cfq_schedule_dispatch(cfqd);
-out_cont:
-	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
-}
-
-static void cfq_shutdown_timer_wq(struct cfq_data *cfqd)
-{
-	del_timer_sync(&cfqd->idle_slice_timer);
-	cancel_work_sync(&cfqd->unplug_work);
-}
-
-static void cfq_put_async_queues(struct cfq_data *cfqd)
-{
-	int i;
-
-	for (i = 0; i < IOPRIO_BE_NR; i++) {
-		if (cfqd->async_cfqq[0][i])
-			cfq_put_queue(cfqd->async_cfqq[0][i]);
-		if (cfqd->async_cfqq[1][i])
-			cfq_put_queue(cfqd->async_cfqq[1][i]);
-	}
-
-	if (cfqd->async_idle_cfqq)
-		cfq_put_queue(cfqd->async_idle_cfqq);
-}
-
 static void cfq_exit_queue(struct elevator_queue *e)
 {
 	struct cfq_data *cfqd = e->elevator_data;
-	struct request_queue *q = cfqd->queue;
-
-	cfq_shutdown_timer_wq(cfqd);
+	struct request_queue *q = cfqd->q;
 
 	spin_lock_irq(q->queue_lock);
 
-	if (cfqd->active_queue)
-		__cfq_slice_expired(cfqd, cfqd->active_queue, 0);
-
 	while (!list_empty(&cfqd->cic_list)) {
 		struct cfq_io_context *cic = list_entry(cfqd->cic_list.next,
 							struct cfq_io_context,
@@ -2239,12 +1645,7 @@ static void cfq_exit_queue(struct elevator_queue *e)
 		__cfq_exit_single_io_context(cfqd, cic);
 	}
 
-	cfq_put_async_queues(cfqd);
-
 	spin_unlock_irq(q->queue_lock);
-
-	cfq_shutdown_timer_wq(cfqd);
-
 	kfree(cfqd);
 }
 
@@ -2256,16 +1657,9 @@ static void *cfq_init_queue(struct request_queue *q)
 	if (!cfqd)
 		return NULL;
 
-	cfqd->service_tree = CFQ_RB_ROOT;
 	INIT_LIST_HEAD(&cfqd->cic_list);
 
-	cfqd->queue = q;
-
-	init_timer(&cfqd->idle_slice_timer);
-	cfqd->idle_slice_timer.function = cfq_idle_slice_timer;
-	cfqd->idle_slice_timer.data = (unsigned long) cfqd;
-
-	INIT_WORK(&cfqd->unplug_work, cfq_kick_queue);
+	cfqd->q = q;
 
 	cfqd->last_end_request = jiffies;
 	cfqd->cfq_quantum = cfq_quantum;
@@ -2273,11 +1667,7 @@ static void *cfq_init_queue(struct request_queue *q)
 	cfqd->cfq_fifo_expire[1] = cfq_fifo_expire[1];
 	cfqd->cfq_back_max = cfq_back_max;
 	cfqd->cfq_back_penalty = cfq_back_penalty;
-	cfqd->cfq_slice[0] = cfq_slice_async;
-	cfqd->cfq_slice[1] = cfq_slice_sync;
 	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
-	cfqd->cfq_slice_idle = cfq_slice_idle;
-	cfqd->hw_tag = 1;
 
 	return cfqd;
 }
@@ -2342,9 +1732,6 @@ SHOW_FUNCTION(cfq_fifo_expire_sync_show, cfqd->cfq_fifo_expire[1], 1);
 SHOW_FUNCTION(cfq_fifo_expire_async_show, cfqd->cfq_fifo_expire[0], 1);
 SHOW_FUNCTION(cfq_back_seek_max_show, cfqd->cfq_back_max, 0);
 SHOW_FUNCTION(cfq_back_seek_penalty_show, cfqd->cfq_back_penalty, 0);
-SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1);
-SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
-SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
 SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
 #undef SHOW_FUNCTION
 
@@ -2372,9 +1759,6 @@ STORE_FUNCTION(cfq_fifo_expire_async_store, &cfqd->cfq_fifo_expire[0], 1,
 STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
 STORE_FUNCTION(cfq_back_seek_penalty_store, &cfqd->cfq_back_penalty, 1,
 		UINT_MAX, 0);
-STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
 		UINT_MAX, 0);
 #undef STORE_FUNCTION
@@ -2388,10 +1772,7 @@ static struct elv_fs_entry cfq_attrs[] = {
 	CFQ_ATTR(fifo_expire_async),
 	CFQ_ATTR(back_seek_max),
 	CFQ_ATTR(back_seek_penalty),
-	CFQ_ATTR(slice_sync),
-	CFQ_ATTR(slice_async),
 	CFQ_ATTR(slice_async_rq),
-	CFQ_ATTR(slice_idle),
 	__ATTR_NULL
 };
 
@@ -2404,8 +1785,6 @@ static struct elevator_type iosched_cfq = {
 		.elevator_dispatch_fn =		cfq_dispatch_requests,
 		.elevator_add_req_fn =		cfq_insert_request,
 		.elevator_activate_req_fn =	cfq_activate_request,
-		.elevator_deactivate_req_fn =	cfq_deactivate_request,
-		.elevator_queue_empty_fn =	cfq_queue_empty,
 		.elevator_completed_req_fn =	cfq_completed_request,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
@@ -2415,7 +1794,14 @@ static struct elevator_type iosched_cfq = {
 		.elevator_init_fn =		cfq_init_queue,
 		.elevator_exit_fn =		cfq_exit_queue,
 		.trim =				cfq_free_io_context,
+		.elevator_free_sched_queue_fn =	cfq_free_cfq_queue,
+		.elevator_active_ioq_set_fn = 	cfq_active_ioq_set,
+		.elevator_active_ioq_reset_fn =	cfq_active_ioq_reset,
+		.elevator_arm_slice_timer_fn = 	cfq_arm_slice_timer,
+		.elevator_should_preempt_fn = 	cfq_should_preempt,
+		.elevator_update_idle_window_fn = cfq_update_idle_window,
 	},
+	.elevator_features =    ELV_IOSCHED_NEED_FQ,
 	.elevator_attrs =	cfq_attrs,
 	.elevator_name =	"cfq",
 	.elevator_owner =	THIS_MODULE,
@@ -2423,14 +1809,6 @@ static struct elevator_type iosched_cfq = {
 
 static int __init cfq_init(void)
 {
-	/*
-	 * could be 0 on HZ < 1000 setups
-	 */
-	if (!cfq_slice_async)
-		cfq_slice_async = 1;
-	if (!cfq_slice_idle)
-		cfq_slice_idle = 1;
-
 	if (cfq_slab_setup())
 		return -ENOMEM;
 
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 03/10] Modify cfq to make use of flat elevator fair queuing
  2009-03-12  1:56 ` Vivek Goyal
  (?)
  (?)
@ 2009-03-12  1:56 ` Vivek Goyal
  -1 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12  1:56 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng,
	arozansk, jmoyer, oz-kernel, dhaval, balbir, linux-kernel,
	containers
  Cc: vgoyal, akpm, menage, peterz

This patch changes cfq to use fair queuing code from elevator layer.

o must_dispatch logic sounds like a dead code. Nobody seems to be making
  use of that flag. Retaining it for the time being.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched |    3 +-
 block/cfq-iosched.c   | 1082 +++++++++++--------------------------------------
 2 files changed, 232 insertions(+), 853 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3398134..dd5224d 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -3,7 +3,7 @@ if BLOCK
 menu "IO Schedulers"
 
 config ELV_FAIR_QUEUING
-	bool "Elevator Fair Queuing Support"
+	bool
 	default n
 	---help---
 	  Traditionally only cfq had notion of multiple queues and it did
@@ -46,6 +46,7 @@ config IOSCHED_DEADLINE
 
 config IOSCHED_CFQ
 	tristate "CFQ I/O scheduler"
+	select ELV_FAIR_QUEUING
 	default y
 	---help---
 	  The CFQ I/O scheduler tries to distribute bandwidth equally
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 664ebfd..5b41b54 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -12,7 +12,6 @@
 #include <linux/rbtree.h>
 #include <linux/ioprio.h>
 #include <linux/blktrace_api.h>
-
 /*
  * tunables
  */
@@ -23,15 +22,7 @@ static const int cfq_fifo_expire[2] = { HZ / 4, HZ / 8 };
 static const int cfq_back_max = 16 * 1024;
 /* penalty of a backwards seek */
 static const int cfq_back_penalty = 2;
-static const int cfq_slice_sync = HZ / 10;
-static int cfq_slice_async = HZ / 25;
 static const int cfq_slice_async_rq = 2;
-static int cfq_slice_idle = HZ / 125;
-
-/*
- * offset from end of service tree
- */
-#define CFQ_IDLE_DELAY		(HZ / 5)
 
 /*
  * below this threshold, we consider thinktime immediate
@@ -43,7 +34,7 @@ static int cfq_slice_idle = HZ / 125;
 
 #define RQ_CIC(rq)		\
 	((struct cfq_io_context *) (rq)->elevator_private)
-#define RQ_CFQQ(rq)		(struct cfq_queue *) ((rq)->elevator_private2)
+#define RQ_CFQQ(rq)	(struct cfq_queue *) (ioq_sched_queue((rq)->ioq))
 
 static struct kmem_cache *cfq_pool;
 static struct kmem_cache *cfq_ioc_pool;
@@ -53,8 +44,6 @@ static struct completion *ioc_gone;
 static DEFINE_SPINLOCK(ioc_gone_lock);
 
 #define CFQ_PRIO_LISTS		IOPRIO_BE_NR
-#define cfq_class_idle(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
-#define cfq_class_rt(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_RT)
 
 #define ASYNC			(0)
 #define SYNC			(1)
@@ -77,45 +66,16 @@ struct cfq_rb_root {
  * Per block device queue structure
  */
 struct cfq_data {
-	struct request_queue *queue;
-
-	/*
-	 * rr list of queues with requests and the count of them
-	 */
-	struct cfq_rb_root service_tree;
-	unsigned int busy_queues;
-	/*
-	 * Used to track any pending rt requests so we can pre-empt current
-	 * non-RT cfqq in service when this value is non-zero.
-	 */
-	unsigned int busy_rt_queues;
+	struct request_queue *q;
 
-	int rq_in_driver;
 	int sync_flight;
 
 	/*
-	 * queue-depth detection
-	 */
-	int rq_queued;
-	int hw_tag;
-	int hw_tag_samples;
-	int rq_in_driver_peak;
-
-	/*
 	 * idle window management
 	 */
-	struct timer_list idle_slice_timer;
-	struct work_struct unplug_work;
 
-	struct cfq_queue *active_queue;
 	struct cfq_io_context *active_cic;
 
-	/*
-	 * async queue for each priority case
-	 */
-	struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
-	struct cfq_queue *async_idle_cfqq;
-
 	sector_t last_position;
 	unsigned long last_end_request;
 
@@ -126,9 +86,7 @@ struct cfq_data {
 	unsigned int cfq_fifo_expire[2];
 	unsigned int cfq_back_penalty;
 	unsigned int cfq_back_max;
-	unsigned int cfq_slice[2];
 	unsigned int cfq_slice_async_rq;
-	unsigned int cfq_slice_idle;
 
 	struct list_head cic_list;
 };
@@ -137,16 +95,11 @@ struct cfq_data {
  * Per process-grouping structure
  */
 struct cfq_queue {
-	/* reference count */
-	atomic_t ref;
+	struct io_queue *ioq;
 	/* various state flags, see below */
 	unsigned int flags;
 	/* parent cfq_data */
 	struct cfq_data *cfqd;
-	/* service_tree member */
-	struct rb_node rb_node;
-	/* service_tree key */
-	unsigned long rb_key;
 	/* sorted list of pending requests */
 	struct rb_root sort_list;
 	/* if fifo isn't expired, next request to serve */
@@ -158,33 +111,23 @@ struct cfq_queue {
 	/* fifo list of requests in sort_list */
 	struct list_head fifo;
 
-	unsigned long slice_end;
-	long slice_resid;
-
 	/* pending metadata requests */
 	int meta_pending;
-	/* number of requests that are on the dispatch list or inside driver */
-	int dispatched;
 
 	/* io prio of this group */
-	unsigned short ioprio, org_ioprio;
-	unsigned short ioprio_class, org_ioprio_class;
+	unsigned short org_ioprio;
+	unsigned short org_ioprio_class;
 
 	pid_t pid;
 };
 
 enum cfqq_state_flags {
-	CFQ_CFQQ_FLAG_on_rr = 0,	/* on round-robin busy list */
-	CFQ_CFQQ_FLAG_wait_request,	/* waiting for a request */
 	CFQ_CFQQ_FLAG_must_alloc,	/* must be allowed rq alloc */
-	CFQ_CFQQ_FLAG_must_alloc_slice,	/* per-slice must_alloc flag */
-	CFQ_CFQQ_FLAG_must_dispatch,	/* must dispatch, even if expired */
-	CFQ_CFQQ_FLAG_fifo_expire,	/* FIFO checked in this slice */
-	CFQ_CFQQ_FLAG_idle_window,	/* slice idling enabled */
-	CFQ_CFQQ_FLAG_prio_changed,	/* task priority has changed */
-	CFQ_CFQQ_FLAG_queue_new,	/* queue never been serviced */
-	CFQ_CFQQ_FLAG_slice_new,	/* no requests dispatched in slice */
-	CFQ_CFQQ_FLAG_sync,		/* synchronous queue */
+	CFQ_CFQQ_FLAG_must_alloc_slice, /* per-slice must_alloc flag */
+	CFQ_CFQQ_FLAG_must_dispatch,    /* must dispatch, even if expired */
+	CFQ_CFQQ_FLAG_fifo_expire,      /* FIFO checked in this slice */
+	CFQ_CFQQ_FLAG_prio_changed,     /* task priority has changed */
+	CFQ_CFQQ_FLAG_queue_new,        /* queue never been serviced */
 };
 
 #define CFQ_CFQQ_FNS(name)						\
@@ -199,116 +142,78 @@ static inline void cfq_clear_cfqq_##name(struct cfq_queue *cfqq)	\
 static inline int cfq_cfqq_##name(const struct cfq_queue *cfqq)		\
 {									\
 	return ((cfqq)->flags & (1 << CFQ_CFQQ_FLAG_##name)) != 0;	\
-}
+}									\
 
-CFQ_CFQQ_FNS(on_rr);
-CFQ_CFQQ_FNS(wait_request);
 CFQ_CFQQ_FNS(must_alloc);
 CFQ_CFQQ_FNS(must_alloc_slice);
 CFQ_CFQQ_FNS(must_dispatch);
 CFQ_CFQQ_FNS(fifo_expire);
-CFQ_CFQQ_FNS(idle_window);
 CFQ_CFQQ_FNS(prio_changed);
 CFQ_CFQQ_FNS(queue_new);
-CFQ_CFQQ_FNS(slice_new);
-CFQ_CFQQ_FNS(sync);
 #undef CFQ_CFQQ_FNS
 
 #define cfq_log_cfqq(cfqd, cfqq, fmt, args...)	\
-	blk_add_trace_msg((cfqd)->queue, "cfq%d " fmt, (cfqq)->pid, ##args)
+	blk_add_trace_msg((cfqd)->q, "cfq%d " fmt, elv_ioq_pid(cfqq->ioq), \
+					##args)
 #define cfq_log(cfqd, fmt, args...)	\
-	blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)
+	blk_add_trace_msg((cfqd)->q, "cfq " fmt, ##args)
 
 static void cfq_dispatch_insert(struct request_queue *, struct request *);
 static struct cfq_queue *cfq_get_queue(struct cfq_data *, int,
-				       struct io_context *, gfp_t);
+						struct io_context *, gfp_t);
 static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
 						struct io_context *);
 
-static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
-					    int is_sync)
-{
-	return cic->cfqq[!!is_sync];
-}
-
-static inline void cic_set_cfqq(struct cfq_io_context *cic,
-				struct cfq_queue *cfqq, int is_sync)
+static inline struct io_group *cfqq_to_io_group(struct cfq_queue *cfqq)
 {
-	cic->cfqq[!!is_sync] = cfqq;
+	return ioq_to_io_group(cfqq->ioq);
 }
 
-/*
- * We regard a request as SYNC, if it's either a read or has the SYNC bit
- * set (in which case it could also be direct WRITE).
- */
-static inline int cfq_bio_sync(struct bio *bio)
+static inline int cfq_class_idle(struct cfq_queue *cfqq)
 {
-	if (bio_data_dir(bio) == READ || bio_sync(bio))
-		return 1;
-
-	return 0;
+	return elv_ioq_class_idle(cfqq->ioq);
 }
 
-/*
- * scheduler run of queue, if there are requests pending and no one in the
- * driver that will restart queueing
- */
-static inline void cfq_schedule_dispatch(struct cfq_data *cfqd)
+static inline int cfq_class_rt(struct cfq_queue *cfqq)
 {
-	if (cfqd->busy_queues) {
-		cfq_log(cfqd, "schedule dispatch");
-		kblockd_schedule_work(cfqd->queue, &cfqd->unplug_work);
-	}
+	return elv_ioq_class_rt(cfqq->ioq);
 }
 
-static int cfq_queue_empty(struct request_queue *q)
+static inline int cfq_cfqq_sync(struct cfq_queue *cfqq)
 {
-	struct cfq_data *cfqd = q->elevator->elevator_data;
-
-	return !cfqd->busy_queues;
+	return elv_ioq_sync(cfqq->ioq);
 }
 
-/*
- * Scale schedule slice based on io priority. Use the sync time slice only
- * if a queue is marked sync and has sync io queued. A sync queue with async
- * io only, should not get full sync slice length.
- */
-static inline int cfq_prio_slice(struct cfq_data *cfqd, int sync,
-				 unsigned short prio)
+static inline int cfqq_is_active_queue(struct cfq_queue *cfqq)
 {
-	const int base_slice = cfqd->cfq_slice[sync];
-
-	WARN_ON(prio >= IOPRIO_BE_NR);
+	struct cfq_data *cfqd = cfqq->cfqd;
+	struct elevator_queue *e = cfqd->q->elevator;
 
-	return base_slice + (base_slice/CFQ_SLICE_SCALE * (4 - prio));
+	return (elv_active_sched_queue(e) == cfqq);
 }
 
-static inline int
-cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
+					    int is_sync)
 {
-	return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
+	return cic->cfqq[!!is_sync];
 }
 
-static inline void
-cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+static inline void cic_set_cfqq(struct cfq_io_context *cic,
+				struct cfq_queue *cfqq, int is_sync)
 {
-	cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
-	cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
+	cic->cfqq[!!is_sync] = cfqq;
 }
 
 /*
- * We need to wrap this check in cfq_cfqq_slice_new(), since ->slice_end
- * isn't valid until the first request from the dispatch is activated
- * and the slice time set.
+ * We regard a request as SYNC, if it's either a read or has the SYNC bit
+ * set (in which case it could also be direct WRITE).
  */
-static inline int cfq_slice_used(struct cfq_queue *cfqq)
+static inline int cfq_bio_sync(struct bio *bio)
 {
-	if (cfq_cfqq_slice_new(cfqq))
-		return 0;
-	if (time_before(jiffies, cfqq->slice_end))
-		return 0;
+	if (bio_data_dir(bio) == READ || bio_sync(bio))
+		return 1;
 
-	return 1;
+	return 0;
 }
 
 /*
@@ -406,32 +311,6 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2)
 	}
 }
 
-/*
- * The below is leftmost cache rbtree addon
- */
-static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
-{
-	if (!root->left)
-		root->left = rb_first(&root->rb);
-
-	if (root->left)
-		return rb_entry(root->left, struct cfq_queue, rb_node);
-
-	return NULL;
-}
-
-static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root)
-{
-	if (root->left == n)
-		root->left = NULL;
-
-	rb_erase(n, &root->rb);
-	RB_CLEAR_NODE(n);
-}
-
-/*
- * would be nice to take fifo expire time into account as well
- */
 static struct request *
 cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		  struct request *last)
@@ -442,10 +321,10 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 
 	BUG_ON(RB_EMPTY_NODE(&last->rb_node));
 
-	if (rbprev)
+	if (rbprev != NULL)
 		prev = rb_entry_rq(rbprev);
 
-	if (rbnext)
+	if (rbnext != NULL)
 		next = rb_entry_rq(rbnext);
 	else {
 		rbnext = rb_first(&cfqq->sort_list);
@@ -456,140 +335,25 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	return cfq_choose_req(cfqd, next, prev);
 }
 
-static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
-				      struct cfq_queue *cfqq)
+/* An active ioq has been reset. A chance to do cic related stuff. */
+static void cfq_active_ioq_reset(struct request_queue *q)
 {
-	/*
-	 * just an approximation, should be ok.
-	 */
-	return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) -
-		       cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
-}
-
-/*
- * The cfqd->service_tree holds all pending cfq_queue's that have
- * requests waiting to be processed. It is sorted in the order that
- * we will service the queues.
- */
-static void cfq_service_tree_add(struct cfq_data *cfqd,
-				    struct cfq_queue *cfqq, int add_front)
-{
-	struct rb_node **p, *parent;
-	struct cfq_queue *__cfqq;
-	unsigned long rb_key;
-	int left;
-
-	if (cfq_class_idle(cfqq)) {
-		rb_key = CFQ_IDLE_DELAY;
-		parent = rb_last(&cfqd->service_tree.rb);
-		if (parent && parent != &cfqq->rb_node) {
-			__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
-			rb_key += __cfqq->rb_key;
-		} else
-			rb_key += jiffies;
-	} else if (!add_front) {
-		rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
-		rb_key += cfqq->slice_resid;
-		cfqq->slice_resid = 0;
-	} else
-		rb_key = 0;
-
-	if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
-		/*
-		 * same position, nothing more to do
-		 */
-		if (rb_key == cfqq->rb_key)
-			return;
-
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
-	}
-
-	left = 1;
-	parent = NULL;
-	p = &cfqd->service_tree.rb.rb_node;
-	while (*p) {
-		struct rb_node **n;
-
-		parent = *p;
-		__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
-
-		/*
-		 * sort RT queues first, we always want to give
-		 * preference to them. IDLE queues goes to the back.
-		 * after that, sort on the next service time.
-		 */
-		if (cfq_class_rt(cfqq) > cfq_class_rt(__cfqq))
-			n = &(*p)->rb_left;
-		else if (cfq_class_rt(cfqq) < cfq_class_rt(__cfqq))
-			n = &(*p)->rb_right;
-		else if (cfq_class_idle(cfqq) < cfq_class_idle(__cfqq))
-			n = &(*p)->rb_left;
-		else if (cfq_class_idle(cfqq) > cfq_class_idle(__cfqq))
-			n = &(*p)->rb_right;
-		else if (rb_key < __cfqq->rb_key)
-			n = &(*p)->rb_left;
-		else
-			n = &(*p)->rb_right;
-
-		if (n == &(*p)->rb_right)
-			left = 0;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
 
-		p = n;
+	if (cfqd->active_cic) {
+		put_io_context(cfqd->active_cic->ioc);
+		cfqd->active_cic = NULL;
 	}
-
-	if (left)
-		cfqd->service_tree.left = &cfqq->rb_node;
-
-	cfqq->rb_key = rb_key;
-	rb_link_node(&cfqq->rb_node, parent, p);
-	rb_insert_color(&cfqq->rb_node, &cfqd->service_tree.rb);
 }
 
-/*
- * Update cfqq's position in the service tree.
- */
-static void cfq_resort_rr_list(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+/* An ioq has been set as active one. */
+static void cfq_active_ioq_set(struct request_queue *q, void *sched_queue)
 {
-	/*
-	 * Resorting requires the cfqq to be on the RR list already.
-	 */
-	if (cfq_cfqq_on_rr(cfqq))
-		cfq_service_tree_add(cfqd, cfqq, 0);
-}
-
-/*
- * add to busy list of queues for service, trying to be fair in ordering
- * the pending list according to last request service
- */
-static void cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	cfq_log_cfqq(cfqd, cfqq, "add_to_rr");
-	BUG_ON(cfq_cfqq_on_rr(cfqq));
-	cfq_mark_cfqq_on_rr(cfqq);
-	cfqd->busy_queues++;
-	if (cfq_class_rt(cfqq))
-		cfqd->busy_rt_queues++;
-
-	cfq_resort_rr_list(cfqd, cfqq);
-}
-
-/*
- * Called when the cfqq no longer has requests pending, remove it from
- * the service tree.
- */
-static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	cfq_log_cfqq(cfqd, cfqq, "del_from_rr");
-	BUG_ON(!cfq_cfqq_on_rr(cfqq));
-	cfq_clear_cfqq_on_rr(cfqq);
-
-	if (!RB_EMPTY_NODE(&cfqq->rb_node))
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
+	struct cfq_queue *cfqq = sched_queue;
 
-	BUG_ON(!cfqd->busy_queues);
-	cfqd->busy_queues--;
-	if (cfq_class_rt(cfqq))
-		cfqd->busy_rt_queues--;
+	cfq_mark_cfqq_must_alloc(cfqq);
+	cfq_clear_cfqq_fifo_expire(cfqq);
+	cfq_clear_cfqq_queue_new(cfqq);
 }
 
 /*
@@ -598,22 +362,19 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 static void cfq_del_rq_rb(struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
-	struct cfq_data *cfqd = cfqq->cfqd;
 	const int sync = rq_is_sync(rq);
 
 	BUG_ON(!cfqq->queued[sync]);
 	cfqq->queued[sync]--;
 
 	elv_rb_del(&cfqq->sort_list, rq);
-
-	if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
-		cfq_del_cfqq_rr(cfqd, cfqq);
 }
 
 static void cfq_add_rq_rb(struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
 	struct cfq_data *cfqd = cfqq->cfqd;
+	struct request_queue *q = cfqd->q;
 	struct request *__alias;
 
 	cfqq->queued[rq_is_sync(rq)]++;
@@ -623,10 +384,7 @@ static void cfq_add_rq_rb(struct request *rq)
 	 * if that happens, put the alias on the dispatch list
 	 */
 	while ((__alias = elv_rb_add(&cfqq->sort_list, rq)) != NULL)
-		cfq_dispatch_insert(cfqd->queue, __alias);
-
-	if (!cfq_cfqq_on_rr(cfqq))
-		cfq_add_cfqq_rr(cfqd, cfqq);
+		cfq_dispatch_insert(q, __alias);
 
 	/*
 	 * check if this request is a better next-serve candidate
@@ -667,23 +425,9 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 
-	cfqd->rq_in_driver++;
-	cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "activate rq, drv=%d",
-						cfqd->rq_in_driver);
-
 	cfqd->last_position = rq->hard_sector + rq->hard_nr_sectors;
 }
 
-static void cfq_deactivate_request(struct request_queue *q, struct request *rq)
-{
-	struct cfq_data *cfqd = q->elevator->elevator_data;
-
-	WARN_ON(!cfqd->rq_in_driver);
-	cfqd->rq_in_driver--;
-	cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "deactivate rq, drv=%d",
-						cfqd->rq_in_driver);
-}
-
 static void cfq_remove_request(struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
@@ -694,7 +438,6 @@ static void cfq_remove_request(struct request *rq)
 	list_del_init(&rq->queuelist);
 	cfq_del_rq_rb(rq);
 
-	cfqq->cfqd->rq_queued--;
 	if (rq_is_meta(rq)) {
 		WARN_ON(!cfqq->meta_pending);
 		cfqq->meta_pending--;
@@ -768,85 +511,23 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	return 0;
 }
 
-static void __cfq_set_active_queue(struct cfq_data *cfqd,
-				   struct cfq_queue *cfqq)
-{
-	if (cfqq) {
-		cfq_log_cfqq(cfqd, cfqq, "set_active");
-		cfqq->slice_end = 0;
-		cfq_clear_cfqq_must_alloc_slice(cfqq);
-		cfq_clear_cfqq_fifo_expire(cfqq);
-		cfq_mark_cfqq_slice_new(cfqq);
-		cfq_clear_cfqq_queue_new(cfqq);
-	}
-
-	cfqd->active_queue = cfqq;
-}
-
 /*
  * current cfqq expired its slice (or was too idle), select new one
  */
 static void
 __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-		    int timed_out)
+				int budget_update)
 {
-	cfq_log_cfqq(cfqd, cfqq, "slice expired t=%d", timed_out);
-
-	if (cfq_cfqq_wait_request(cfqq))
-		del_timer(&cfqd->idle_slice_timer);
-
 	cfq_clear_cfqq_must_dispatch(cfqq);
-	cfq_clear_cfqq_wait_request(cfqq);
-
-	/*
-	 * store what was left of this slice, if the queue idled/timed out
-	 */
-	if (timed_out && !cfq_cfqq_slice_new(cfqq)) {
-		cfqq->slice_resid = cfqq->slice_end - jiffies;
-		cfq_log_cfqq(cfqd, cfqq, "resid=%ld", cfqq->slice_resid);
-	}
-
-	cfq_resort_rr_list(cfqd, cfqq);
-
-	if (cfqq == cfqd->active_queue)
-		cfqd->active_queue = NULL;
-
-	if (cfqd->active_cic) {
-		put_io_context(cfqd->active_cic->ioc);
-		cfqd->active_cic = NULL;
-	}
+	__elv_ioq_slice_expired(cfqd->q, cfqq->ioq, budget_update);
 }
 
-static inline void cfq_slice_expired(struct cfq_data *cfqd, int timed_out)
+static inline void cfq_slice_expired(struct cfq_data *cfqd, int budget_update)
 {
-	struct cfq_queue *cfqq = cfqd->active_queue;
+	struct cfq_queue *cfqq = elv_active_sched_queue(cfqd->q->elevator);
 
 	if (cfqq)
-		__cfq_slice_expired(cfqd, cfqq, timed_out);
-}
-
-/*
- * Get next queue for service. Unless we have a queue preemption,
- * we'll simply select the first cfqq in the service tree.
- */
-static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
-{
-	if (RB_EMPTY_ROOT(&cfqd->service_tree.rb))
-		return NULL;
-
-	return cfq_rb_first(&cfqd->service_tree);
-}
-
-/*
- * Get and set a new active queue for service.
- */
-static struct cfq_queue *cfq_set_active_queue(struct cfq_data *cfqd)
-{
-	struct cfq_queue *cfqq;
-
-	cfqq = cfq_get_next_queue(cfqd);
-	__cfq_set_active_queue(cfqd, cfqq);
-	return cfqq;
+		__cfq_slice_expired(cfqd, cfqq, budget_update);
 }
 
 static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd,
@@ -881,35 +562,15 @@ static int cfq_close_cooperator(struct cfq_data *cfq_data,
 
 #define CIC_SEEKY(cic) ((cic)->seek_mean > (8 * 1024))
 
-static void cfq_arm_slice_timer(struct cfq_data *cfqd)
+static void cfq_arm_slice_timer(struct request_queue *q, void *sched_queue)
 {
-	struct cfq_queue *cfqq = cfqd->active_queue;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = sched_queue;
 	struct cfq_io_context *cic;
 	unsigned long sl;
 
-	/*
-	 * SSD device without seek penalty, disable idling. But only do so
-	 * for devices that support queuing, otherwise we still have a problem
-	 * with sync vs async workloads.
-	 */
-	if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)
-		return;
-
 	WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list));
-	WARN_ON(cfq_cfqq_slice_new(cfqq));
-
-	/*
-	 * idle is disabled, either manually or by past process history
-	 */
-	if (!cfqd->cfq_slice_idle || !cfq_cfqq_idle_window(cfqq))
-		return;
-
-	/*
-	 * still requests with the driver, don't idle
-	 */
-	if (cfqd->rq_in_driver)
-		return;
-
+	WARN_ON(elv_ioq_slice_new(cfqq->ioq));
 	/*
 	 * task has exited, don't wait
 	 */
@@ -921,22 +582,23 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 	 * See if this prio level has a good candidate
 	 */
 	if (cfq_close_cooperator(cfqd, cfqq) &&
-	    (sample_valid(cic->ttime_samples) && cic->ttime_mean > 2))
+	    (elv_ioq_sample_valid(cfqq->ioq) &&
+	    elv_ioq_ttime_mean(cfqq->ioq) > 2))
 		return;
 
 	cfq_mark_cfqq_must_dispatch(cfqq);
-	cfq_mark_cfqq_wait_request(cfqq);
+	elv_mark_ioq_wait_request(cfqq->ioq);
 
 	/*
 	 * we don't want to idle for seeks, but we do want to allow
 	 * fair distribution of slice time for a process doing back-to-back
 	 * seeks. so allow a little bit of time for him to submit a new rq
 	 */
-	sl = cfqd->cfq_slice_idle;
+	sl = elv_get_slice_idle(q->elevator);
 	if (sample_valid(cic->seek_samples) && CIC_SEEKY(cic))
 		sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));
 
-	mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
+	elv_mod_idle_slice_timer(q->elevator, jiffies + sl);
 	cfq_log(cfqd, "arm_idle: %lu", sl);
 }
 
@@ -945,13 +607,12 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
  */
 static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
 {
-	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
+	struct cfq_data *cfqd = q->elevator->elevator_data;
 
 	cfq_log_cfqq(cfqd, cfqq, "dispatch_insert");
 
 	cfq_remove_request(rq);
-	cfqq->dispatched++;
 	elv_dispatch_sort(q, rq);
 
 	if (cfq_cfqq_sync(cfqq))
@@ -989,68 +650,11 @@ static inline int
 cfq_prio_to_maxrq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
 	const int base_rq = cfqd->cfq_slice_async_rq;
+	unsigned short ioprio = elv_ioq_ioprio(cfqq->ioq);
 
-	WARN_ON(cfqq->ioprio >= IOPRIO_BE_NR);
+	WARN_ON(ioprio >= IOPRIO_BE_NR);
 
-	return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - cfqq->ioprio));
-}
-
-/*
- * Select a queue for service. If we have a current active queue,
- * check whether to continue servicing it, or retrieve and set a new one.
- */
-static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
-{
-	struct cfq_queue *cfqq;
-
-	cfqq = cfqd->active_queue;
-	if (!cfqq)
-		goto new_queue;
-
-	/*
-	 * The active queue has run out of time, expire it and select new.
-	 */
-	if (cfq_slice_used(cfqq))
-		goto expire;
-
-	/*
-	 * If we have a RT cfqq waiting, then we pre-empt the current non-rt
-	 * cfqq.
-	 */
-	if (!cfq_class_rt(cfqq) && cfqd->busy_rt_queues) {
-		/*
-		 * We simulate this as cfqq timed out so that it gets to bank
-		 * the remaining of its time slice.
-		 */
-		cfq_log_cfqq(cfqd, cfqq, "preempt");
-		cfq_slice_expired(cfqd, 1);
-		goto new_queue;
-	}
-
-	/*
-	 * The active queue has requests and isn't expired, allow it to
-	 * dispatch.
-	 */
-	if (!RB_EMPTY_ROOT(&cfqq->sort_list))
-		goto keep_queue;
-
-	/*
-	 * No requests pending. If the active queue still has requests in
-	 * flight or is idling for a new request, allow either of these
-	 * conditions to happen (or time out) before selecting a new queue.
-	 */
-	if (timer_pending(&cfqd->idle_slice_timer) ||
-	    (cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) {
-		cfqq = NULL;
-		goto keep_queue;
-	}
-
-expire:
-	cfq_slice_expired(cfqd, 0);
-new_queue:
-	cfqq = cfq_set_active_queue(cfqd);
-keep_queue:
-	return cfqq;
+	return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - ioprio));
 }
 
 /*
@@ -1062,6 +666,7 @@ __cfq_dispatch_requests(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 			int max_dispatch)
 {
 	int dispatched = 0;
+	struct request_queue *q = cfqd->q;
 
 	BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list));
 
@@ -1078,7 +683,7 @@ __cfq_dispatch_requests(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		/*
 		 * finally, insert request into driver dispatch list
 		 */
-		cfq_dispatch_insert(cfqd->queue, rq);
+		cfq_dispatch_insert(q, rq);
 
 		dispatched++;
 
@@ -1094,7 +699,7 @@ __cfq_dispatch_requests(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		 * If there is a non-empty RT cfqq waiting for current
 		 * cfqq's timeslice to complete, pre-empt this cfqq
 		 */
-		if (!cfq_class_rt(cfqq) && cfqd->busy_rt_queues)
+		if (!cfq_class_rt(cfqq) && elv_nr_busy_rt_ioq(q->elevator))
 			break;
 
 	} while (dispatched < max_dispatch);
@@ -1103,11 +708,12 @@ __cfq_dispatch_requests(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	 * expire an async queue immediately if it has used up its slice. idle
 	 * queue always expire after 1 dispatch round.
 	 */
-	if (cfqd->busy_queues > 1 && ((!cfq_cfqq_sync(cfqq) &&
+
+	if (elv_nr_busy_ioq(q->elevator) > 1 && ((!cfq_cfqq_sync(cfqq) &&
 	    dispatched >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
 	    cfq_class_idle(cfqq))) {
-		cfqq->slice_end = jiffies + 1;
-		cfq_slice_expired(cfqd, 0);
+		elv_ioq_set_slice_end(cfqq->ioq, jiffies + 1);
+		cfq_slice_expired(cfqd, 1);
 	}
 
 	return dispatched;
@@ -1118,7 +724,7 @@ static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
 	int dispatched = 0;
 
 	while (cfqq->next_rq) {
-		cfq_dispatch_insert(cfqq->cfqd->queue, cfqq->next_rq);
+		cfq_dispatch_insert(cfqq->cfqd->q, cfqq->next_rq);
 		dispatched++;
 	}
 
@@ -1135,12 +741,14 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
 	struct cfq_queue *cfqq;
 	int dispatched = 0;
 
-	while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL)
+	while ((cfqq = elv_select_sched_queue(cfqd->q, 1)) != NULL)
 		dispatched += __cfq_forced_dispatch_cfqq(cfqq);
 
+	/* This probably is redundant now. above loop will should make sure
+	 * that all the busy queues have expired */
 	cfq_slice_expired(cfqd, 0);
 
-	BUG_ON(cfqd->busy_queues);
+	BUG_ON(elv_nr_busy_ioq(cfqd->q->elevator));
 
 	cfq_log(cfqd, "forced_dispatch=%d\n", dispatched);
 	return dispatched;
@@ -1152,29 +760,27 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	struct cfq_queue *cfqq;
 	int dispatched;
 
-	if (!cfqd->busy_queues)
-		return 0;
-
 	if (unlikely(force))
 		return cfq_forced_dispatch(cfqd);
 
 	dispatched = 0;
-	while ((cfqq = cfq_select_queue(cfqd)) != NULL) {
+	while ((cfqq = elv_select_sched_queue(q, 0)) != NULL) {
 		int max_dispatch;
 
 		max_dispatch = cfqd->cfq_quantum;
 		if (cfq_class_idle(cfqq))
 			max_dispatch = 1;
 
-		if (cfqq->dispatched >= max_dispatch && cfqd->busy_queues > 1)
+		if (elv_ioq_nr_dispatched(cfqq->ioq) >= max_dispatch &&
+		    elv_nr_busy_ioq(q->elevator) > 1)
 			break;
 
 		if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
 			break;
 
 		cfq_clear_cfqq_must_dispatch(cfqq);
-		cfq_clear_cfqq_wait_request(cfqq);
-		del_timer(&cfqd->idle_slice_timer);
+		elv_clear_ioq_wait_request(cfqq->ioq);
+		elv_del_idle_slice_timer(q->elevator);
 
 		dispatched += __cfq_dispatch_requests(cfqd, cfqq, max_dispatch);
 	}
@@ -1183,34 +789,30 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	return dispatched;
 }
 
-/*
- * task holds one reference to the queue, dropped when task exits. each rq
- * in-flight on this queue also holds a reference, dropped when rq is freed.
- *
- * queue lock must be held here.
- */
-static void cfq_put_queue(struct cfq_queue *cfqq)
+static void cfq_free_cfq_queue(struct elevator_queue *e, void *sched_queue)
 {
+	struct cfq_queue *cfqq = sched_queue;
 	struct cfq_data *cfqd = cfqq->cfqd;
 
-	BUG_ON(atomic_read(&cfqq->ref) <= 0);
-
-	if (!atomic_dec_and_test(&cfqq->ref))
-		return;
+	BUG_ON(!cfqq);
 
-	cfq_log_cfqq(cfqd, cfqq, "put_queue");
+	cfq_log_cfqq(cfqd, cfqq, "free_queue");
 	BUG_ON(rb_first(&cfqq->sort_list));
 	BUG_ON(cfqq->allocated[READ] + cfqq->allocated[WRITE]);
-	BUG_ON(cfq_cfqq_on_rr(cfqq));
 
-	if (unlikely(cfqd->active_queue == cfqq)) {
+	if (unlikely(cfqq_is_active_queue(cfqq))) {
 		__cfq_slice_expired(cfqd, cfqq, 0);
-		cfq_schedule_dispatch(cfqd);
+		elv_schedule_dispatch(cfqd->q);
 	}
 
 	kmem_cache_free(cfq_pool, cfqq);
 }
 
+static inline void cfq_put_queue(struct cfq_queue *cfqq)
+{
+	elv_put_ioq(cfqq->ioq);
+}
+
 /*
  * Must always be called with the rcu_read_lock() held
  */
@@ -1298,9 +900,9 @@ static void cfq_free_io_context(struct io_context *ioc)
 
 static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	if (unlikely(cfqq == cfqd->active_queue)) {
+	if (unlikely(cfqq == elv_active_sched_queue(cfqd->q->elevator))) {
 		__cfq_slice_expired(cfqd, cfqq, 0);
-		cfq_schedule_dispatch(cfqd);
+		elv_schedule_dispatch(cfqd->q);
 	}
 
 	cfq_put_queue(cfqq);
@@ -1340,7 +942,7 @@ static void cfq_exit_single_io_context(struct io_context *ioc,
 	struct cfq_data *cfqd = cic->key;
 
 	if (cfqd) {
-		struct request_queue *q = cfqd->queue;
+		struct request_queue *q = cfqd->q;
 		unsigned long flags;
 
 		spin_lock_irqsave(q->queue_lock, flags);
@@ -1370,9 +972,10 @@ static struct cfq_io_context *
 cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 {
 	struct cfq_io_context *cic;
+	struct request_queue *q = cfqd->q;
 
 	cic = kmem_cache_alloc_node(cfq_ioc_pool, gfp_mask | __GFP_ZERO,
-							cfqd->queue->node);
+							q->node);
 	if (cic) {
 		cic->last_end_request = jiffies;
 		INIT_LIST_HEAD(&cic->queue_list);
@@ -1388,7 +991,7 @@ cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
 {
 	struct task_struct *tsk = current;
-	int ioprio_class;
+	int ioprio_class, ioprio;
 
 	if (!cfq_cfqq_prio_changed(cfqq))
 		return;
@@ -1401,30 +1004,33 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
 		/*
 		 * no prio set, inherit CPU scheduling settings
 		 */
-		cfqq->ioprio = task_nice_ioprio(tsk);
-		cfqq->ioprio_class = task_nice_ioclass(tsk);
+		ioprio = task_nice_ioprio(tsk);
+		ioprio_class = task_nice_ioclass(tsk);
 		break;
 	case IOPRIO_CLASS_RT:
-		cfqq->ioprio = task_ioprio(ioc);
-		cfqq->ioprio_class = IOPRIO_CLASS_RT;
+		ioprio = task_ioprio(ioc);
+		ioprio_class = IOPRIO_CLASS_RT;
 		break;
 	case IOPRIO_CLASS_BE:
-		cfqq->ioprio = task_ioprio(ioc);
-		cfqq->ioprio_class = IOPRIO_CLASS_BE;
+		ioprio = task_ioprio(ioc);
+		ioprio_class = IOPRIO_CLASS_BE;
 		break;
 	case IOPRIO_CLASS_IDLE:
-		cfqq->ioprio_class = IOPRIO_CLASS_IDLE;
-		cfqq->ioprio = 7;
-		cfq_clear_cfqq_idle_window(cfqq);
+		ioprio_class = IOPRIO_CLASS_IDLE;
+		ioprio = 7;
+		elv_clear_ioq_idle_window(cfqq->ioq);
 		break;
 	}
 
+	elv_ioq_set_ioprio_class(cfqq->ioq, ioprio_class);
+	elv_ioq_set_ioprio(cfqq->ioq, ioprio);
+
 	/*
 	 * keep track of original prio settings in case we have to temporarily
 	 * elevate the priority of this queue
 	 */
-	cfqq->org_ioprio = cfqq->ioprio;
-	cfqq->org_ioprio_class = cfqq->ioprio_class;
+	cfqq->org_ioprio = ioprio;
+	cfqq->org_ioprio_class = ioprio_class;
 	cfq_clear_cfqq_prio_changed(cfqq);
 }
 
@@ -1433,11 +1039,12 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
 	struct cfq_data *cfqd = cic->key;
 	struct cfq_queue *cfqq;
 	unsigned long flags;
+	struct request_queue *q = cfqd->q;
 
 	if (unlikely(!cfqd))
 		return;
 
-	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+	spin_lock_irqsave(q->queue_lock, flags);
 
 	cfqq = cic->cfqq[ASYNC];
 	if (cfqq) {
@@ -1453,7 +1060,7 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
 	if (cfqq)
 		cfq_mark_cfqq_prio_changed(cfqq);
 
-	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
 static void cfq_ioc_set_ioprio(struct io_context *ioc)
@@ -1464,11 +1071,12 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
 
 static struct cfq_queue *
 cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
-		     struct io_context *ioc, gfp_t gfp_mask)
+				struct io_context *ioc, gfp_t gfp_mask)
 {
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
 	struct cfq_io_context *cic;
-
+	struct request_queue *q = cfqd->q;
+	struct io_queue *ioq = NULL, *new_ioq = NULL;
 retry:
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
@@ -1476,8 +1084,7 @@ retry:
 
 	if (!cfqq) {
 		if (new_cfqq) {
-			cfqq = new_cfqq;
-			new_cfqq = NULL;
+			goto alloc_ioq;
 		} else if (gfp_mask & __GFP_WAIT) {
 			/*
 			 * Inform the allocator of the fact that we will
@@ -1485,35 +1092,66 @@ retry:
 			 * the allocator to do whatever it needs to attempt to
 			 * free memory.
 			 */
-			spin_unlock_irq(cfqd->queue->queue_lock);
+			spin_unlock_irq(q->queue_lock);
 			new_cfqq = kmem_cache_alloc_node(cfq_pool,
 					gfp_mask | __GFP_NOFAIL | __GFP_ZERO,
-					cfqd->queue->node);
-			spin_lock_irq(cfqd->queue->queue_lock);
+					q->node);
+			spin_lock_irq(q->queue_lock);
 			goto retry;
 		} else {
 			cfqq = kmem_cache_alloc_node(cfq_pool,
 					gfp_mask | __GFP_ZERO,
-					cfqd->queue->node);
+					q->node);
 			if (!cfqq)
 				goto out;
 		}
 
-		RB_CLEAR_NODE(&cfqq->rb_node);
-		INIT_LIST_HEAD(&cfqq->fifo);
+alloc_ioq:
+		if (new_ioq) {
+			ioq = new_ioq;
+			new_ioq = NULL;
+			cfqq = new_cfqq;
+			new_cfqq = NULL;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			new_ioq = elv_alloc_ioq(q,
+					gfp_mask | __GFP_NOFAIL | __GFP_ZERO);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+			if (!ioq) {
+				kmem_cache_free(cfq_pool, cfqq);
+				cfqq = NULL;
+				goto out;
+			}
+		}
 
-		atomic_set(&cfqq->ref, 0);
+		/*
+		 * Both cfqq and ioq objects allocated. Do the initializations
+		 * now.
+		 */
+		INIT_LIST_HEAD(&cfqq->fifo);
 		cfqq->cfqd = cfqd;
 
 		cfq_mark_cfqq_prio_changed(cfqq);
 		cfq_mark_cfqq_queue_new(cfqq);
 
+		cfqq->ioq = ioq;
 		cfq_init_prio_data(cfqq, ioc);
+		elv_init_ioq(q->elevator, ioq, cfqq, cfqq->org_ioprio_class,
+					cfqq->org_ioprio, is_sync);
 
 		if (is_sync) {
 			if (!cfq_class_idle(cfqq))
-				cfq_mark_cfqq_idle_window(cfqq);
-			cfq_mark_cfqq_sync(cfqq);
+				elv_mark_ioq_idle_window(cfqq->ioq);
+			elv_mark_ioq_sync(cfqq->ioq);
 		}
 		cfqq->pid = current->pid;
 		cfq_log_cfqq(cfqd, cfqq, "alloced");
@@ -1522,38 +1160,28 @@ retry:
 	if (new_cfqq)
 		kmem_cache_free(cfq_pool, new_cfqq);
 
+	if (new_ioq)
+		elv_free_ioq(new_ioq);
+
 out:
 	WARN_ON((gfp_mask & __GFP_WAIT) && !cfqq);
 	return cfqq;
 }
 
-static struct cfq_queue **
-cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
-{
-	switch (ioprio_class) {
-	case IOPRIO_CLASS_RT:
-		return &cfqd->async_cfqq[0][ioprio];
-	case IOPRIO_CLASS_BE:
-		return &cfqd->async_cfqq[1][ioprio];
-	case IOPRIO_CLASS_IDLE:
-		return &cfqd->async_idle_cfqq;
-	default:
-		BUG();
-	}
-}
-
 static struct cfq_queue *
 cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
-	      gfp_t gfp_mask)
+					gfp_t gfp_mask)
 {
 	const int ioprio = task_ioprio(ioc);
 	const int ioprio_class = task_ioprio_class(ioc);
-	struct cfq_queue **async_cfqq = NULL;
+	struct cfq_queue *async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
+	struct io_group *iog = io_lookup_io_group_current(cfqd->q);
 
 	if (!is_sync) {
-		async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);
-		cfqq = *async_cfqq;
+		async_cfqq = io_group_async_queue_prio(iog, ioprio_class,
+								ioprio);
+		cfqq = async_cfqq;
 	}
 
 	if (!cfqq) {
@@ -1562,15 +1190,11 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 			return NULL;
 	}
 
-	/*
-	 * pin the queue now that it's allocated, scheduler exit will prune it
-	 */
-	if (!is_sync && !(*async_cfqq)) {
-		atomic_inc(&cfqq->ref);
-		*async_cfqq = cfqq;
-	}
+	if (!is_sync && !async_cfqq)
+		io_group_set_async_queue(iog, ioprio_class, ioprio, cfqq->ioq);
 
-	atomic_inc(&cfqq->ref);
+	/* ioc reference */
+	elv_get_ioq(cfqq->ioq);
 	return cfqq;
 }
 
@@ -1649,6 +1273,7 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
 {
 	unsigned long flags;
 	int ret;
+	struct request_queue *q = cfqd->q;
 
 	ret = radix_tree_preload(gfp_mask);
 	if (!ret) {
@@ -1665,9 +1290,9 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
 		radix_tree_preload_end();
 
 		if (!ret) {
-			spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+			spin_lock_irqsave(q->queue_lock, flags);
 			list_add(&cic->queue_list, &cfqd->cic_list);
-			spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+			spin_unlock_irqrestore(q->queue_lock, flags);
 		}
 	}
 
@@ -1687,10 +1312,11 @@ cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 {
 	struct io_context *ioc = NULL;
 	struct cfq_io_context *cic;
+	struct request_queue *q = cfqd->q;
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
-	ioc = get_io_context(gfp_mask, cfqd->queue->node);
+	ioc = get_io_context(gfp_mask, q->node);
 	if (!ioc)
 		return NULL;
 
@@ -1709,7 +1335,6 @@ out:
 	smp_read_barrier_depends();
 	if (unlikely(ioc->ioprio_changed))
 		cfq_ioc_set_ioprio(ioc);
-
 	return cic;
 err_free:
 	cfq_cic_free(cic);
@@ -1719,17 +1344,6 @@ err:
 }
 
 static void
-cfq_update_io_thinktime(struct cfq_data *cfqd, struct cfq_io_context *cic)
-{
-	unsigned long elapsed = jiffies - cic->last_end_request;
-	unsigned long ttime = min(elapsed, 2UL * cfqd->cfq_slice_idle);
-
-	cic->ttime_samples = (7*cic->ttime_samples + 256) / 8;
-	cic->ttime_total = (7*cic->ttime_total + 256*ttime) / 8;
-	cic->ttime_mean = (cic->ttime_total + 128) / cic->ttime_samples;
-}
-
-static void
 cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_io_context *cic,
 		       struct request *rq)
 {
@@ -1758,65 +1372,40 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_io_context *cic,
 }
 
 /*
- * Disable idle window if the process thinks too long or seeks so much that
- * it doesn't matter
+ * Disable idle window if the process seeks so much that it doesn't matter
  */
-static void
-cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-		       struct cfq_io_context *cic)
+static int
+cfq_update_idle_window(struct elevator_queue *eq, void *cfqq,
+					struct request *rq)
 {
-	int old_idle, enable_idle;
+	struct cfq_io_context *cic = RQ_CIC(rq);
 
 	/*
-	 * Don't idle for async or idle io prio class
+	 * Enabling/Disabling idling based on thinktime has been moved
+	 * in common layer.
 	 */
-	if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
-		return;
-
-	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
-
-	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
-	    (cfqd->hw_tag && CIC_SEEKY(cic)))
-		enable_idle = 0;
-	else if (sample_valid(cic->ttime_samples)) {
-		if (cic->ttime_mean > cfqd->cfq_slice_idle)
-			enable_idle = 0;
-		else
-			enable_idle = 1;
-	}
+	if (!atomic_read(&cic->ioc->nr_tasks) ||
+	    (elv_hw_tag(eq) && CIC_SEEKY(cic)))
+		return 0;
 
-	if (old_idle != enable_idle) {
-		cfq_log_cfqq(cfqd, cfqq, "idle=%d", enable_idle);
-		if (enable_idle)
-			cfq_mark_cfqq_idle_window(cfqq);
-		else
-			cfq_clear_cfqq_idle_window(cfqq);
-	}
+	return 1;
 }
 
 /*
  * Check if new_cfqq should preempt the currently active queue. Return 0 for
- * no or if we aren't sure, a 1 will cause a preempt.
+ * no or if we aren't sure, a 1 will cause a preemption attempt.
+ * Some of the preemption logic has been moved to common layer. Only cfq
+ * specific parts are left here.
  */
 static int
-cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
-		   struct request *rq)
+cfq_should_preempt(struct request_queue *q, void *new_cfqq, struct request *rq)
 {
-	struct cfq_queue *cfqq;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = elv_active_sched_queue(q->elevator);
 
-	cfqq = cfqd->active_queue;
 	if (!cfqq)
 		return 0;
 
-	if (cfq_slice_used(cfqq))
-		return 1;
-
-	if (cfq_class_idle(new_cfqq))
-		return 0;
-
-	if (cfq_class_idle(cfqq))
-		return 1;
-
 	/*
 	 * if the new request is sync, but the currently running queue is
 	 * not, let the sync request have priority.
@@ -1831,13 +1420,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 	if (rq_is_meta(rq) && !cfqq->meta_pending)
 		return 1;
 
-	/*
-	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
-	 */
-	if (cfq_class_rt(new_cfqq) && !cfq_class_rt(cfqq))
-		return 1;
-
-	if (!cfqd->active_cic || !cfq_cfqq_wait_request(cfqq))
+	if (!cfqd->active_cic || !elv_ioq_wait_request(cfqq->ioq))
 		return 0;
 
 	/*
@@ -1851,29 +1434,10 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 }
 
 /*
- * cfqq preempts the active queue. if we allowed preempt with no slice left,
- * let it have half of its nominal slice.
- */
-static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	cfq_log_cfqq(cfqd, cfqq, "preempt");
-	cfq_slice_expired(cfqd, 1);
-
-	/*
-	 * Put the new queue at the front of the of the current list,
-	 * so we know that it will be selected next.
-	 */
-	BUG_ON(!cfq_cfqq_on_rr(cfqq));
-
-	cfq_service_tree_add(cfqd, cfqq, 1);
-
-	cfqq->slice_end = 0;
-	cfq_mark_cfqq_slice_new(cfqq);
-}
-
-/*
  * Called when a new fs request (rq) is added (to cfqq). Check if there's
  * something we should do about it
+ * After enqueuing the request whether queue should be preempted or kicked
+ * decision is taken by common layer.
  */
 static void
 cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
@@ -1881,38 +1445,12 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 {
 	struct cfq_io_context *cic = RQ_CIC(rq);
 
-	cfqd->rq_queued++;
 	if (rq_is_meta(rq))
 		cfqq->meta_pending++;
 
-	cfq_update_io_thinktime(cfqd, cic);
 	cfq_update_io_seektime(cfqd, cic, rq);
-	cfq_update_idle_window(cfqd, cfqq, cic);
 
 	cic->last_request_pos = rq->sector + rq->nr_sectors;
-
-	if (cfqq == cfqd->active_queue) {
-		/*
-		 * if we are waiting for a request for this queue, let it rip
-		 * immediately and flag that we must not expire this queue
-		 * just now
-		 */
-		if (cfq_cfqq_wait_request(cfqq)) {
-			cfq_mark_cfqq_must_dispatch(cfqq);
-			del_timer(&cfqd->idle_slice_timer);
-			blk_start_queueing(cfqd->queue);
-		}
-	} else if (cfq_should_preempt(cfqd, cfqq, rq)) {
-		/*
-		 * not the active queue - expire current slice if it is
-		 * idle and has expired it's mean thinktime or this new queue
-		 * has some old slice time left and is of higher priority or
-		 * this new queue is RT and the current one is BE
-		 */
-		cfq_preempt_queue(cfqd, cfqq);
-		cfq_mark_cfqq_must_dispatch(cfqq);
-		blk_start_queueing(cfqd->queue);
-	}
 }
 
 static void cfq_insert_request(struct request_queue *q, struct request *rq)
@@ -1930,31 +1468,6 @@ static void cfq_insert_request(struct request_queue *q, struct request *rq)
 	cfq_rq_enqueued(cfqd, cfqq, rq);
 }
 
-/*
- * Update hw_tag based on peak queue depth over 50 samples under
- * sufficient load.
- */
-static void cfq_update_hw_tag(struct cfq_data *cfqd)
-{
-	if (cfqd->rq_in_driver > cfqd->rq_in_driver_peak)
-		cfqd->rq_in_driver_peak = cfqd->rq_in_driver;
-
-	if (cfqd->rq_queued <= CFQ_HW_QUEUE_MIN &&
-	    cfqd->rq_in_driver <= CFQ_HW_QUEUE_MIN)
-		return;
-
-	if (cfqd->hw_tag_samples++ < 50)
-		return;
-
-	if (cfqd->rq_in_driver_peak >= CFQ_HW_QUEUE_MIN)
-		cfqd->hw_tag = 1;
-	else
-		cfqd->hw_tag = 0;
-
-	cfqd->hw_tag_samples = 0;
-	cfqd->rq_in_driver_peak = 0;
-}
-
 static void cfq_completed_request(struct request_queue *q, struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
@@ -1965,13 +1478,6 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 	now = jiffies;
 	cfq_log_cfqq(cfqd, cfqq, "complete");
 
-	cfq_update_hw_tag(cfqd);
-
-	WARN_ON(!cfqd->rq_in_driver);
-	WARN_ON(!cfqq->dispatched);
-	cfqd->rq_in_driver--;
-	cfqq->dispatched--;
-
 	if (cfq_cfqq_sync(cfqq))
 		cfqd->sync_flight--;
 
@@ -1980,24 +1486,6 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 
 	if (sync)
 		RQ_CIC(rq)->last_end_request = now;
-
-	/*
-	 * If this is the active queue, check if it needs to be expired,
-	 * or if we want to idle in case it has no pending requests.
-	 */
-	if (cfqd->active_queue == cfqq) {
-		if (cfq_cfqq_slice_new(cfqq)) {
-			cfq_set_prio_slice(cfqd, cfqq);
-			cfq_clear_cfqq_slice_new(cfqq);
-		}
-		if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
-			cfq_slice_expired(cfqd, 1);
-		else if (sync && RB_EMPTY_ROOT(&cfqq->sort_list))
-			cfq_arm_slice_timer(cfqd);
-	}
-
-	if (!cfqd->rq_in_driver)
-		cfq_schedule_dispatch(cfqd);
 }
 
 /*
@@ -2006,30 +1494,33 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
  */
 static void cfq_prio_boost(struct cfq_queue *cfqq)
 {
+	struct io_queue *ioq = cfqq->ioq;
+
 	if (has_fs_excl()) {
 		/*
 		 * boost idle prio on transactions that would lock out other
 		 * users of the filesystem
 		 */
 		if (cfq_class_idle(cfqq))
-			cfqq->ioprio_class = IOPRIO_CLASS_BE;
-		if (cfqq->ioprio > IOPRIO_NORM)
-			cfqq->ioprio = IOPRIO_NORM;
+			elv_ioq_set_ioprio_class(ioq, IOPRIO_CLASS_BE);
+		if (elv_ioq_ioprio(ioq) > IOPRIO_NORM)
+			elv_ioq_set_ioprio(ioq, IOPRIO_NORM);
+
 	} else {
 		/*
 		 * check if we need to unboost the queue
 		 */
-		if (cfqq->ioprio_class != cfqq->org_ioprio_class)
-			cfqq->ioprio_class = cfqq->org_ioprio_class;
-		if (cfqq->ioprio != cfqq->org_ioprio)
-			cfqq->ioprio = cfqq->org_ioprio;
+		if (elv_ioq_ioprio_class(ioq) != cfqq->org_ioprio_class)
+			elv_ioq_set_ioprio_class(ioq, cfqq->org_ioprio_class);
+		if (elv_ioq_ioprio(ioq) != cfqq->org_ioprio)
+			elv_ioq_set_ioprio(ioq, cfqq->org_ioprio);
 	}
 }
 
 static inline int __cfq_may_queue(struct cfq_queue *cfqq)
 {
-	if ((cfq_cfqq_wait_request(cfqq) || cfq_cfqq_must_alloc(cfqq)) &&
-	    !cfq_cfqq_must_alloc_slice(cfqq)) {
+	if ((elv_ioq_wait_request(cfqq->ioq) ||
+	   cfq_cfqq_must_alloc(cfqq)) && !cfq_cfqq_must_alloc_slice(cfqq)) {
 		cfq_mark_cfqq_must_alloc_slice(cfqq);
 		return ELV_MQUEUE_MUST;
 	}
@@ -2121,116 +1612,31 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 
 	cfqq->allocated[rw]++;
 	cfq_clear_cfqq_must_alloc(cfqq);
-	atomic_inc(&cfqq->ref);
+	elv_get_ioq(cfqq->ioq);
 
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
 	rq->elevator_private = cic;
-	rq->elevator_private2 = cfqq;
+	rq->ioq = cfqq->ioq;
 	return 0;
 
 queue_fail:
 	if (cic)
 		put_io_context(cic->ioc);
 
-	cfq_schedule_dispatch(cfqd);
+	elv_schedule_dispatch(cfqd->q);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 	cfq_log(cfqd, "set_request fail");
 	return 1;
 }
 
-static void cfq_kick_queue(struct work_struct *work)
-{
-	struct cfq_data *cfqd =
-		container_of(work, struct cfq_data, unplug_work);
-	struct request_queue *q = cfqd->queue;
-	unsigned long flags;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	blk_start_queueing(q);
-	spin_unlock_irqrestore(q->queue_lock, flags);
-}
-
-/*
- * Timer running if the active_queue is currently idling inside its time slice
- */
-static void cfq_idle_slice_timer(unsigned long data)
-{
-	struct cfq_data *cfqd = (struct cfq_data *) data;
-	struct cfq_queue *cfqq;
-	unsigned long flags;
-	int timed_out = 1;
-
-	cfq_log(cfqd, "idle timer fired");
-
-	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
-
-	cfqq = cfqd->active_queue;
-	if (cfqq) {
-		timed_out = 0;
-
-		/*
-		 * expired
-		 */
-		if (cfq_slice_used(cfqq))
-			goto expire;
-
-		/*
-		 * only expire and reinvoke request handler, if there are
-		 * other queues with pending requests
-		 */
-		if (!cfqd->busy_queues)
-			goto out_cont;
-
-		/*
-		 * not expired and it has a request pending, let it dispatch
-		 */
-		if (!RB_EMPTY_ROOT(&cfqq->sort_list)) {
-			cfq_mark_cfqq_must_dispatch(cfqq);
-			goto out_kick;
-		}
-	}
-expire:
-	cfq_slice_expired(cfqd, timed_out);
-out_kick:
-	cfq_schedule_dispatch(cfqd);
-out_cont:
-	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
-}
-
-static void cfq_shutdown_timer_wq(struct cfq_data *cfqd)
-{
-	del_timer_sync(&cfqd->idle_slice_timer);
-	cancel_work_sync(&cfqd->unplug_work);
-}
-
-static void cfq_put_async_queues(struct cfq_data *cfqd)
-{
-	int i;
-
-	for (i = 0; i < IOPRIO_BE_NR; i++) {
-		if (cfqd->async_cfqq[0][i])
-			cfq_put_queue(cfqd->async_cfqq[0][i]);
-		if (cfqd->async_cfqq[1][i])
-			cfq_put_queue(cfqd->async_cfqq[1][i]);
-	}
-
-	if (cfqd->async_idle_cfqq)
-		cfq_put_queue(cfqd->async_idle_cfqq);
-}
-
 static void cfq_exit_queue(struct elevator_queue *e)
 {
 	struct cfq_data *cfqd = e->elevator_data;
-	struct request_queue *q = cfqd->queue;
-
-	cfq_shutdown_timer_wq(cfqd);
+	struct request_queue *q = cfqd->q;
 
 	spin_lock_irq(q->queue_lock);
 
-	if (cfqd->active_queue)
-		__cfq_slice_expired(cfqd, cfqd->active_queue, 0);
-
 	while (!list_empty(&cfqd->cic_list)) {
 		struct cfq_io_context *cic = list_entry(cfqd->cic_list.next,
 							struct cfq_io_context,
@@ -2239,12 +1645,7 @@ static void cfq_exit_queue(struct elevator_queue *e)
 		__cfq_exit_single_io_context(cfqd, cic);
 	}
 
-	cfq_put_async_queues(cfqd);
-
 	spin_unlock_irq(q->queue_lock);
-
-	cfq_shutdown_timer_wq(cfqd);
-
 	kfree(cfqd);
 }
 
@@ -2256,16 +1657,9 @@ static void *cfq_init_queue(struct request_queue *q)
 	if (!cfqd)
 		return NULL;
 
-	cfqd->service_tree = CFQ_RB_ROOT;
 	INIT_LIST_HEAD(&cfqd->cic_list);
 
-	cfqd->queue = q;
-
-	init_timer(&cfqd->idle_slice_timer);
-	cfqd->idle_slice_timer.function = cfq_idle_slice_timer;
-	cfqd->idle_slice_timer.data = (unsigned long) cfqd;
-
-	INIT_WORK(&cfqd->unplug_work, cfq_kick_queue);
+	cfqd->q = q;
 
 	cfqd->last_end_request = jiffies;
 	cfqd->cfq_quantum = cfq_quantum;
@@ -2273,11 +1667,7 @@ static void *cfq_init_queue(struct request_queue *q)
 	cfqd->cfq_fifo_expire[1] = cfq_fifo_expire[1];
 	cfqd->cfq_back_max = cfq_back_max;
 	cfqd->cfq_back_penalty = cfq_back_penalty;
-	cfqd->cfq_slice[0] = cfq_slice_async;
-	cfqd->cfq_slice[1] = cfq_slice_sync;
 	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
-	cfqd->cfq_slice_idle = cfq_slice_idle;
-	cfqd->hw_tag = 1;
 
 	return cfqd;
 }
@@ -2342,9 +1732,6 @@ SHOW_FUNCTION(cfq_fifo_expire_sync_show, cfqd->cfq_fifo_expire[1], 1);
 SHOW_FUNCTION(cfq_fifo_expire_async_show, cfqd->cfq_fifo_expire[0], 1);
 SHOW_FUNCTION(cfq_back_seek_max_show, cfqd->cfq_back_max, 0);
 SHOW_FUNCTION(cfq_back_seek_penalty_show, cfqd->cfq_back_penalty, 0);
-SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1);
-SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
-SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
 SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
 #undef SHOW_FUNCTION
 
@@ -2372,9 +1759,6 @@ STORE_FUNCTION(cfq_fifo_expire_async_store, &cfqd->cfq_fifo_expire[0], 1,
 STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
 STORE_FUNCTION(cfq_back_seek_penalty_store, &cfqd->cfq_back_penalty, 1,
 		UINT_MAX, 0);
-STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
 		UINT_MAX, 0);
 #undef STORE_FUNCTION
@@ -2388,10 +1772,7 @@ static struct elv_fs_entry cfq_attrs[] = {
 	CFQ_ATTR(fifo_expire_async),
 	CFQ_ATTR(back_seek_max),
 	CFQ_ATTR(back_seek_penalty),
-	CFQ_ATTR(slice_sync),
-	CFQ_ATTR(slice_async),
 	CFQ_ATTR(slice_async_rq),
-	CFQ_ATTR(slice_idle),
 	__ATTR_NULL
 };
 
@@ -2404,8 +1785,6 @@ static struct elevator_type iosched_cfq = {
 		.elevator_dispatch_fn =		cfq_dispatch_requests,
 		.elevator_add_req_fn =		cfq_insert_request,
 		.elevator_activate_req_fn =	cfq_activate_request,
-		.elevator_deactivate_req_fn =	cfq_deactivate_request,
-		.elevator_queue_empty_fn =	cfq_queue_empty,
 		.elevator_completed_req_fn =	cfq_completed_request,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
@@ -2415,7 +1794,14 @@ static struct elevator_type iosched_cfq = {
 		.elevator_init_fn =		cfq_init_queue,
 		.elevator_exit_fn =		cfq_exit_queue,
 		.trim =				cfq_free_io_context,
+		.elevator_free_sched_queue_fn =	cfq_free_cfq_queue,
+		.elevator_active_ioq_set_fn = 	cfq_active_ioq_set,
+		.elevator_active_ioq_reset_fn =	cfq_active_ioq_reset,
+		.elevator_arm_slice_timer_fn = 	cfq_arm_slice_timer,
+		.elevator_should_preempt_fn = 	cfq_should_preempt,
+		.elevator_update_idle_window_fn = cfq_update_idle_window,
 	},
+	.elevator_features =    ELV_IOSCHED_NEED_FQ,
 	.elevator_attrs =	cfq_attrs,
 	.elevator_name =	"cfq",
 	.elevator_owner =	THIS_MODULE,
@@ -2423,14 +1809,6 @@ static struct elevator_type iosched_cfq = {
 
 static int __init cfq_init(void)
 {
-	/*
-	 * could be 0 on HZ < 1000 setups
-	 */
-	if (!cfq_slice_async)
-		cfq_slice_async = 1;
-	if (!cfq_slice_idle)
-		cfq_slice_idle = 1;
-
 	if (cfq_slab_setup())
 		return -ENOMEM;
 
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 04/10] Common hierarchical fair queuing code in elevaotor layer
  2009-03-12  1:56 ` Vivek Goyal
@ 2009-03-12  1:56     ` Vivek Goyal
  -1 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12  1:56 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, menage-hpIqsD4AKlfQT0dZR+AlfA

This patch enables hierarchical fair queuing in common layer. It is
controlled by config option CONFIG_GROUP_IOSCHED.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/blk-ioc.c               |    3 +
 block/elevator-fq.c           |  991 ++++++++++++++++++++++++++++++++++++++---
 block/elevator-fq.h           |  113 +++++
 block/elevator.c              |    2 +
 include/linux/blkdev.h        |    7 +-
 include/linux/cgroup_subsys.h |    7 +
 include/linux/iocontext.h     |    5 +
 init/Kconfig                  |    8 +
 8 files changed, 1064 insertions(+), 72 deletions(-)

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 012f065..8f0f6cf 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -95,6 +95,9 @@ struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 		spin_lock_init(&ret->lock);
 		ret->ioprio_changed = 0;
 		ret->ioprio = 0;
+#ifdef CONFIG_GROUP_IOSCHED
+		ret->cgroup_changed = 0;
+#endif
 		ret->last_waited = jiffies; /* doesn't matter... */
 		ret->nr_batch_requests = 0; /* because this is 0 */
 		ret->aic = NULL;
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index a8addd1..389f68e 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -20,10 +20,82 @@ int elv_slice_idle = HZ / 125;
 static struct kmem_cache *elv_ioq_pool;
 
 #define ELV_HW_QUEUE_MIN	(5)
+
+#define IO_DEFAULT_GRP_IOPRIO  4
+#define IO_DEFAULT_GRP_CLASS   IOPRIO_CLASS_BE
+
 #define IO_SERVICE_TREE_INIT   ((struct io_service_tree)		\
 				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
 
+void elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
 /* Mainly the BFQ scheduling code Follows */
+#ifdef CONFIG_GROUP_IOSCHED
+#define for_each_entity(entity)	\
+	for (; entity != NULL; entity = entity->parent)
+
+#define for_each_entity_safe(entity, parent) \
+	for (; entity && ({ parent = entity->parent; 1; }); entity = parent)
+
+
+struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
+						 int extract);
+void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
+					int requeue);
+void elv_activate_ioq(struct io_queue *ioq);
+void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
+					int requeue);
+
+static int bfq_update_next_active(struct io_sched_data *sd)
+{
+	struct io_group *iog;
+	struct io_entity *entity, *next_active;
+
+	if (sd->active_entity != NULL)
+		/* will update/requeue at the end of service */
+		return 0;
+
+	/*
+	 * NOTE: this can be improved in may ways, such as returning
+	 * 1 (and thus propagating upwards the update) only when the
+	 * budget changes, or caching the bfqq that will be scheduled
+	 * next from this subtree.  By now we worry more about
+	 * correctness than about performance...
+	 */
+	next_active = bfq_lookup_next_entity(sd, 0);
+	sd->next_active = next_active;
+
+	if (next_active != NULL) {
+		iog = container_of(sd, struct io_group, sched_data);
+		entity = iog->my_entity;
+		if (entity != NULL)
+			entity->budget = next_active->budget;
+	}
+
+	return 1;
+}
+
+static inline void bfq_check_next_active(struct io_sched_data *sd,
+					 struct io_entity *entity)
+{
+	BUG_ON(sd->next_active != entity);
+}
+#else /* GROUP_IOSCHED */
+#define for_each_entity(entity)	\
+	for (; entity != NULL; entity = NULL)
+
+#define for_each_entity_safe(entity, parent) \
+	for (parent = NULL; entity != NULL; entity = parent)
+
+static inline int bfq_update_next_active(struct io_sched_data *sd)
+{
+	return 0;
+}
+
+static inline void bfq_check_next_active(struct io_sched_data *sd,
+					 struct io_entity *entity)
+{
+}
+#endif
 
 /*
  * Shift for timestamp calculations.  This actually limits the maximum
@@ -288,13 +360,6 @@ void bfq_get_entity(struct io_entity *entity)
 		elv_get_ioq(ioq);
 }
 
-void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
-{
-	entity->ioprio = entity->new_ioprio;
-	entity->ioprio_class = entity->new_ioprio_class;
-	entity->sched_data = &iog->sched_data;
-}
-
 /**
  * bfq_find_deepest - find the deepest node that an extraction can modify.
  * @node: the node being removed.
@@ -520,12 +585,27 @@ static void __bfq_activate_entity(struct io_entity *entity)
 }
 
 /**
- * bfq_activate_entity - activate an entity.
+ * bfq_activate_entity - activate an entity and its ancestors if necessary.
  * @entity: the entity to activate.
+ *
+ * Activate @entity and all the entities on the path from it to the root.
  */
 void bfq_activate_entity(struct io_entity *entity)
 {
-	__bfq_activate_entity(entity);
+	struct io_sched_data *sd;
+
+	for_each_entity(entity) {
+		__bfq_activate_entity(entity);
+
+		sd = entity->sched_data;
+		if (!bfq_update_next_active(sd))
+			/*
+			 * No need to propagate the activation to the
+			 * upper entities, as they will be updated when
+			 * the active entity is rescheduled.
+			 */
+			break;
+	}
 }
 
 /**
@@ -561,12 +641,16 @@ int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
 	else if (entity->tree != NULL)
 		BUG();
 
+	if (was_active || sd->next_active == entity)
+		ret = bfq_update_next_active(sd);
+
 	if (!requeue || !bfq_gt(entity->finish, st->vtime))
 		bfq_forget_entity(st, entity);
 	else
 		bfq_idle_insert(st, entity);
 
 	BUG_ON(sd->active_entity == entity);
+	BUG_ON(sd->next_active == entity);
 
 	return ret;
 }
@@ -578,7 +662,46 @@ int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
  */
 void bfq_deactivate_entity(struct io_entity *entity, int requeue)
 {
-	__bfq_deactivate_entity(entity, requeue);
+	struct io_sched_data *sd;
+	struct io_entity *parent;
+
+	for_each_entity_safe(entity, parent) {
+		sd = entity->sched_data;
+
+		if (!__bfq_deactivate_entity(entity, requeue))
+			/*
+			 * The parent entity is still backlogged, and
+			 * we don't need to update it as it is still
+			 * under service.
+			 */
+			break;
+
+		if (sd->next_active != NULL)
+			/*
+			 * The parent entity is still backlogged and
+			 * the budgets on the path towards the root
+			 * need to be updated.
+			 */
+			goto update;
+
+		/*
+		 * If we reach there the parent is no more backlogged and
+		 * we want to propagate the dequeue upwards.
+		 */
+		requeue = 1;
+	}
+
+	return;
+
+update:
+	entity = parent;
+	for_each_entity(entity) {
+		__bfq_activate_entity(entity);
+
+		sd = entity->sched_data;
+		if (!bfq_update_next_active(sd))
+			break;
+	}
 }
 
 /**
@@ -695,8 +818,10 @@ struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
 		entity = __bfq_lookup_next_entity(st);
 		if (entity != NULL) {
 			if (extract) {
+				bfq_check_next_active(sd, entity);
 				bfq_active_extract(st, entity);
 				sd->active_entity = entity;
+				sd->next_active = NULL;
 			}
 			break;
 		}
@@ -709,14 +834,756 @@ void entity_served(struct io_entity *entity, bfq_service_t served)
 {
 	struct io_service_tree *st;
 
-	st = io_entity_service_tree(entity);
-	entity->service += served;
-	WARN_ON_ONCE(entity->service > entity->budget);
-	BUG_ON(st->wsum == 0);
-	st->vtime += bfq_delta(served, st->wsum);
-	bfq_forget_idle(st);
+	for_each_entity(entity) {
+		st = io_entity_service_tree(entity);
+		entity->service += served;
+		WARN_ON_ONCE(entity->service > entity->budget);
+		BUG_ON(st->wsum == 0);
+		st->vtime += bfq_delta(served, st->wsum);
+		bfq_forget_idle(st);
+	}
+}
+
+/*
+ * Release all the io group references to its async queues.
+ */
+void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+{
+	int i, j;
+
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < IOPRIO_BE_NR; j++)
+			elv_release_ioq(e, &iog->async_queue[i][j]);
+
+	/* Free up async idle queue */
+	elv_release_ioq(e, &iog->async_idle_queue);
+}
+
+
+/* Mainly hierarchical grouping code */
+#ifdef CONFIG_GROUP_IOSCHED
+
+struct io_cgroup io_root_cgroup = {
+	.ioprio = IO_DEFAULT_GRP_IOPRIO,
+	.ioprio_class = IO_DEFAULT_GRP_CLASS,
+};
+
+void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+	entity->ioprio = entity->new_ioprio;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->parent = iog->my_entity;
+	entity->sched_data = &iog->sched_data;
+}
+
+struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
+{
+	return container_of(cgroup_subsys_state(cgroup, io_subsys_id),
+			    struct io_cgroup, css);
+}
+
+/*
+ * Search the bfq_group for bfqd into the hash table (by now only a list)
+ * of bgrp.  Must be called under rcu_read_lock().
+ */
+struct io_group *io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
+{
+	struct io_group *iog;
+	struct hlist_node *n;
+	void *__key;
+
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		__key = rcu_dereference(iog->key);
+		if (__key == key)
+			return iog;
+	}
+
+	return NULL;
+}
+
+struct io_group *io_lookup_io_group_current(struct request_queue *q)
+{
+	struct io_group *iog;
+	struct io_cgroup *iocg;
+	struct cgroup *cgroup;
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	cgroup = task_cgroup(current, io_subsys_id);
+	iocg = cgroup_to_io_cgroup(cgroup);
+	iog = io_cgroup_lookup_group(iocg, efqd);
+	return iog;
+}
+EXPORT_SYMBOL(io_lookup_io_group_current);
+
+void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+{
+	struct io_entity *entity = &iog->entity;
+
+	entity->ioprio = entity->new_ioprio = iocg->ioprio;
+	entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
+	entity->ioprio_changed = 1;
+	entity->my_sched_data = &iog->sched_data;
+}
+
+void io_group_set_parent(struct io_group *iog, struct io_group *parent)
+{
+	struct io_entity *entity;
+
+	BUG_ON(parent == NULL);
+	BUG_ON(iog == NULL);
+
+	entity = &iog->entity;
+	entity->parent = parent->my_entity;
+	entity->sched_data = &parent->sched_data;
+}
+
+/**
+ * bfq_flush_idle_tree - deactivate any entity on the idle tree of @st.
+ * @st: the service tree being flushed.
+ */
+void io_flush_idle_tree(struct io_service_tree *st)
+{
+	struct io_entity *entity = st->first_idle;
+
+	for (; entity != NULL; entity = st->first_idle)
+		__bfq_deactivate_entity(entity, 0);
+}
+
+#define SHOW_FUNCTION(__VAR)						\
+static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
+				       struct cftype *cftype)		\
+{									\
+	struct io_cgroup *iocg;					\
+	u64 ret;							\
+									\
+	if (!cgroup_lock_live_group(cgroup))				\
+		return -ENODEV;						\
+									\
+	iocg = cgroup_to_io_cgroup(cgroup);				\
+	spin_lock_irq(&iocg->lock);					\
+	ret = iocg->__VAR;						\
+	spin_unlock_irq(&iocg->lock);					\
+									\
+	cgroup_unlock();						\
+									\
+	return ret;							\
+}
+
+SHOW_FUNCTION(ioprio);
+SHOW_FUNCTION(ioprio_class);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__VAR, __MIN, __MAX)				\
+static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
+					struct cftype *cftype,		\
+					u64 val)			\
+{									\
+	struct io_cgroup *iocg;					\
+	struct io_group *iog;						\
+	struct hlist_node *n;						\
+									\
+	if (val < (__MIN) || val > (__MAX))				\
+		return -EINVAL;						\
+									\
+	if (!cgroup_lock_live_group(cgroup))				\
+		return -ENODEV;						\
+									\
+	iocg = cgroup_to_io_cgroup(cgroup);				\
+									\
+	spin_lock_irq(&iocg->lock);					\
+	iocg->__VAR = (unsigned char)val;				\
+	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {	\
+		iog->entity.new_##__VAR = (unsigned char)val;		\
+		smp_wmb();						\
+		iog->entity.ioprio_changed = 1;			\
+	}								\
+	spin_unlock_irq(&iocg->lock);					\
+									\
+	cgroup_unlock();						\
+									\
+	return 0;							\
+}
+
+STORE_FUNCTION(ioprio, 0, IOPRIO_BE_NR - 1);
+STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
+#undef STORE_FUNCTION
+
+/**
+ * bfq_group_chain_alloc - allocate a chain of groups.
+ * @bfqd: queue descriptor.
+ * @cgroup: the leaf cgroup this chain starts from.
+ *
+ * Allocate a chain of groups starting from the one belonging to
+ * @cgroup up to the root cgroup.  Stop if a cgroup on the chain
+ * to the root has already an allocated group on @bfqd.
+ */
+struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
+					struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog, *leaf = NULL, *prev = NULL;
+	gfp_t flags = GFP_ATOMIC |  __GFP_ZERO;
+
+	for (; cgroup != NULL; cgroup = cgroup->parent) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+
+		iog = io_cgroup_lookup_group(iocg, key);
+		if (iog != NULL) {
+			/*
+			 * All the cgroups in the path from there to the
+			 * root must have a bfq_group for bfqd, so we don't
+			 * need any more allocations.
+			 */
+			break;
+		}
+
+		iog = kzalloc_node(sizeof(*iog), flags, q->node);
+		if (!iog)
+			goto cleanup;
+
+		io_group_init_entity(iocg, iog);
+		iog->my_entity = &iog->entity;
+
+		if (leaf == NULL) {
+			leaf = iog;
+			prev = leaf;
+		} else {
+			io_group_set_parent(prev, iog);
+			/*
+			 * Build a list of allocated nodes using the bfqd
+			 * filed, that is still unused and will be initialized
+			 * only after the node will be connected.
+			 */
+			prev->key = iog;
+			prev = iog;
+		}
+	}
+
+	return leaf;
+
+cleanup:
+	while (leaf != NULL) {
+		prev = leaf;
+		leaf = leaf->key;
+		kfree(iog);
+	}
+
+	return NULL;
 }
 
+/**
+ * bfq_group_chain_link - link an allocatd group chain to a cgroup hierarchy.
+ * @bfqd: the queue descriptor.
+ * @cgroup: the leaf cgroup to start from.
+ * @leaf: the leaf group (to be associated to @cgroup).
+ *
+ * Try to link a chain of groups to a cgroup hierarchy, connecting the
+ * nodes bottom-up, so we can be sure that when we find a cgroup in the
+ * hierarchy that already as a group associated to @bfqd all the nodes
+ * in the path to the root cgroup have one too.
+ *
+ * On locking: the queue lock protects the hierarchy (there is a hierarchy
+ * per device) while the bfqio_cgroup lock protects the list of groups
+ * belonging to the same cgroup.
+ */
+void io_group_chain_link(struct request_queue *q, void *key,
+				struct cgroup *cgroup,
+				struct io_group *leaf,
+				struct elv_fq_data *efqd)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog, *next, *prev = NULL;
+	unsigned long flags;
+
+	assert_spin_locked(q->queue_lock);
+
+	for (; cgroup != NULL && leaf != NULL; cgroup = cgroup->parent) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+		next = leaf->key;
+
+		iog = io_cgroup_lookup_group(iocg, key);
+		BUG_ON(iog != NULL);
+
+		spin_lock_irqsave(&iocg->lock, flags);
+
+		rcu_assign_pointer(leaf->key, key);
+		hlist_add_head_rcu(&leaf->group_node, &iocg->group_data);
+		hlist_add_head(&leaf->elv_data_node, &efqd->group_list);
+
+		spin_unlock_irqrestore(&iocg->lock, flags);
+
+		prev = leaf;
+		leaf = next;
+	}
+
+	BUG_ON(cgroup == NULL && leaf != NULL);
+
+	if (cgroup != NULL && prev != NULL) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+		iog = io_cgroup_lookup_group(iocg, key);
+		io_group_set_parent(prev, iog);
+	}
+}
+
+/**
+ * bfq_find_alloc_group - return the group associated to @bfqd in @cgroup.
+ * @bfqd: queue descriptor.
+ * @cgroup: cgroup being searched for.
+ *
+ * Return a group associated to @bfqd in @cgroup, allocating one if
+ * necessary.  When a group is returned all the cgroups in the path
+ * to the root have a group associated to @bfqd.
+ *
+ * If the allocation fails, return the root group: this breaks guarantees
+ * but is a safe fallbak.  If this loss becames a problem it can be
+ * mitigated using the equivalent weight (given by the product of the
+ * weights of the groups in the path from @group to the root) in the
+ * root scheduler.
+ *
+ * We allocate all the missing nodes in the path from the leaf cgroup
+ * to the root and we connect the nodes only after all the allocations
+ * have been successful.
+ */
+struct io_group *io_find_alloc_group(struct request_queue *q,
+			struct cgroup *cgroup, struct elv_fq_data *efqd)
+{
+	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+	struct io_group *iog;
+	/* Note: Use efqd as key */
+	void *key = efqd;
+
+	iog = io_cgroup_lookup_group(iocg, key);
+	if (iog != NULL)
+		return iog;
+
+	iog = io_group_chain_alloc(q, key, cgroup);
+	if (iog != NULL)
+		io_group_chain_link(q, key, cgroup, iog, efqd);
+
+	return iog;
+}
+
+/*
+ * Generic function to make sure cgroup hierarchy is all setup once a request
+ * from a cgroup is received by the io scheduler.
+ */
+struct io_group *io_get_io_group(struct request_queue *q)
+{
+	struct cgroup *cgroup;
+	struct io_group *iog;
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	rcu_read_lock();
+	cgroup = task_cgroup(current, io_subsys_id);
+	iog = io_find_alloc_group(q, cgroup, efqd);
+	if (iog == NULL)
+		iog = efqd->root_group;
+	rcu_read_unlock();
+
+	return iog;
+}
+
+void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_cgroup *iocg = &io_root_cgroup;
+	struct elv_fq_data *efqd = &e->efqd;
+	struct io_group *iog = efqd->root_group;
+
+	BUG_ON(!iog);
+	spin_lock_irq(&iocg->lock);
+	hlist_del_rcu(&iog->group_node);
+	spin_unlock_irq(&iocg->lock);
+	io_put_io_group_queues(e, iog);
+	kfree(iog);
+}
+
+struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	struct io_cgroup *iocg;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	iog->entity.parent = NULL;
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+	iocg = &io_root_cgroup;
+	spin_lock_irq(&iocg->lock);
+	rcu_assign_pointer(iog->key, key);
+	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
+	spin_unlock_irq(&iocg->lock);
+
+	return iog;
+}
+
+struct cftype bfqio_files[] = {
+	{
+		.name = "ioprio",
+		.read_u64 = io_cgroup_ioprio_read,
+		.write_u64 = io_cgroup_ioprio_write,
+	},
+	{
+		.name = "ioprio_class",
+		.read_u64 = io_cgroup_ioprio_class_read,
+		.write_u64 = io_cgroup_ioprio_class_write,
+	},
+};
+
+int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+	return cgroup_add_files(cgroup, subsys, bfqio_files,
+				ARRAY_SIZE(bfqio_files));
+}
+
+struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
+						struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg;
+
+	if (cgroup->parent != NULL) {
+		iocg = kzalloc(sizeof(*iocg), GFP_KERNEL);
+		if (iocg == NULL)
+			return ERR_PTR(-ENOMEM);
+	} else
+		iocg = &io_root_cgroup;
+
+	spin_lock_init(&iocg->lock);
+	INIT_HLIST_HEAD(&iocg->group_data);
+	iocg->ioprio = IO_DEFAULT_GRP_IOPRIO;
+	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
+
+	return &iocg->css;
+}
+
+/*
+ * We cannot support shared io contexts, as we have no mean to support
+ * two tasks with the same ioc in two different groups without major rework
+ * of the main cic/bfqq data structures.  By now we allow a task to change
+ * its cgroup only if it's the only owner of its ioc; the drawback of this
+ * behavior is that a group containing a task that forked using CLONE_IO
+ * will not be destroyed until the tasks sharing the ioc die.
+ */
+int iocg_can_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+			    struct task_struct *tsk)
+{
+	struct io_context *ioc;
+	int ret = 0;
+
+	/* task_lock() is needed to avoid races with exit_io_context() */
+	task_lock(tsk);
+	ioc = tsk->io_context;
+	if (ioc != NULL && atomic_read(&ioc->nr_tasks) > 1)
+		/*
+		 * ioc == NULL means that the task is either too young or
+		 * exiting: if it has still no ioc the ioc can't be shared,
+		 * if the task is exiting the attach will fail anyway, no
+		 * matter what we return here.
+		 */
+		ret = -EINVAL;
+	task_unlock(tsk);
+
+	return ret;
+}
+
+void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+			 struct cgroup *prev, struct task_struct *tsk)
+{
+	struct io_context *ioc;
+
+	task_lock(tsk);
+	ioc = tsk->io_context;
+	if (ioc != NULL)
+		ioc->cgroup_changed = 1;
+	task_unlock(tsk);
+}
+
+/*
+ * Move the queue to the root group if it is active. This is needed when
+ * a cgroup is being deleted and all the IO is not done yet. This is not
+ * very good scheme as a user might get unfair share. This needs to be
+ * fixed.
+ */
+void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
+				struct io_group *iog)
+{
+	int busy, resume;
+	struct io_entity *entity = &ioq->entity;
+	struct elv_fq_data *efqd = &e->efqd;
+	struct io_service_tree *st = io_entity_service_tree(entity);
+
+	busy = elv_ioq_busy(ioq);
+	resume = !!ioq->nr_queued;
+
+	BUG_ON(resume && !entity->on_st);
+	BUG_ON(busy && !resume && entity->on_st && ioq != efqd->active_queue);
+
+	/*
+	 * We could be moving an queue which is on idle tree of previous group
+	 * What to do? I guess anyway this queue does not have any requests.
+	 * just forget the entity and free up from idle tree.
+	 *
+	 * This needs cleanup. Hackish.
+	 */
+	if (entity->tree == &st->idle) {
+		BUG_ON(atomic_read(&ioq->ref) < 2);
+		bfq_put_idle_entity(st, entity);
+	}
+
+	if (busy) {
+		BUG_ON(atomic_read(&ioq->ref) < 2);
+
+		if (!resume)
+			elv_del_ioq_busy(e, ioq, 0);
+		else
+			elv_deactivate_ioq(efqd, ioq, 0);
+	}
+
+	/*
+	 * Here we use a reference to bfqg.  We don't need a refcounter
+	 * as the cgroup reference will not be dropped, so that its
+	 * destroy() callback will not be invoked.
+	 */
+	entity->parent = iog->my_entity;
+	entity->sched_data = &iog->sched_data;
+
+	if (busy && resume)
+		elv_activate_ioq(ioq);
+}
+EXPORT_SYMBOL(io_ioq_move);
+
+static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
+{
+	struct elevator_queue *eq;
+	struct io_entity *entity = iog->my_entity;
+	struct io_service_tree *st;
+	int i;
+
+	eq = container_of(efqd, struct elevator_queue, efqd);
+	hlist_del(&iog->elv_data_node);
+	__bfq_deactivate_entity(entity, 0);
+	io_put_io_group_queues(eq, iog);
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+
+		/*
+		 * The idle tree may still contain bfq_queues belonging
+		 * to exited task because they never migrated to a different
+		 * cgroup from the one being destroyed now.  Noone else
+		 * can access them so it's safe to act without any lock.
+		 */
+		io_flush_idle_tree(st);
+
+		BUG_ON(!RB_EMPTY_ROOT(&st->active));
+		BUG_ON(!RB_EMPTY_ROOT(&st->idle));
+	}
+
+	BUG_ON(iog->sched_data.next_active != NULL);
+	BUG_ON(iog->sched_data.active_entity != NULL);
+	BUG_ON(entity->tree != NULL);
+}
+
+/**
+ * bfq_destroy_group - destroy @bfqg.
+ * @bgrp: the bfqio_cgroup containing @bfqg.
+ * @bfqg: the group being destroyed.
+ *
+ * Destroy @bfqg, making sure that it is not referenced from its parent.
+ */
+static void io_destroy_group(struct io_cgroup *iocg, struct io_group *iog)
+{
+	struct elv_fq_data *efqd = NULL;
+	unsigned long uninitialized_var(flags);
+
+	/* Remove io group from cgroup list */
+	hlist_del(&iog->group_node);
+
+	/*
+	 * io groups are linked in two lists. One list is maintained
+	 * in elevator (efqd->group_list) and other is maintained
+	 * per cgroup structure (iocg->group_data).
+	 *
+	 * While a cgroup is being deleted, elevator also might be
+	 * exiting and both might try to cleanup the same io group
+	 * so need to be little careful.
+	 *
+	 * Following code first accesses efqd under RCU to make sure
+	 * iog->key is pointing to valid efqd and then takes the
+	 * associated queue lock. After gettting the queue lock it
+	 * again checks whether elevator exit path had alreday got
+	 * hold of io group (iog->key == NULL). If yes, it does not
+	 * try to free up async queues again or flush the idle tree.
+	 */
+
+	rcu_read_lock();
+	efqd = rcu_dereference(iog->key);
+	if (efqd != NULL) {
+		spin_lock_irqsave(efqd->queue->queue_lock, flags);
+		if (iog->key == efqd)
+			__io_destroy_group(efqd, iog);
+		spin_unlock_irqrestore(efqd->queue->queue_lock, flags);
+	}
+	rcu_read_unlock();
+
+	/*
+	 * No need to defer the kfree() to the end of the RCU grace
+	 * period: we are called from the destroy() callback of our
+	 * cgroup, so we can be sure that noone is a) still using
+	 * this cgroup or b) doing lookups in it.
+	 */
+	kfree(iog);
+}
+
+void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+	struct hlist_node *n, *tmp;
+	struct io_group *iog;
+
+	/*
+	 * Since we are destroying the cgroup, there are no more tasks
+	 * referencing it, and all the RCU grace periods that may have
+	 * referenced it are ended (as the destruction of the parent
+	 * cgroup is RCU-safe); bgrp->group_data will not be accessed by
+	 * anything else and we don't need any synchronization.
+	 */
+	hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node)
+		io_destroy_group(iocg, iog);
+
+	BUG_ON(!hlist_empty(&iocg->group_data));
+
+	kfree(iocg);
+}
+
+void io_disconnect_groups(struct elevator_queue *e)
+{
+	struct hlist_node *pos, *n;
+	struct io_group *iog;
+	struct elv_fq_data *efqd = &e->efqd;
+
+	hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
+					elv_data_node) {
+		hlist_del(&iog->elv_data_node);
+
+		__bfq_deactivate_entity(iog->my_entity, 0);
+
+		/*
+		 * Don't remove from the group hash, just set an
+		 * invalid key.  No lookups can race with the
+		 * assignment as bfqd is being destroyed; this
+		 * implies also that new elements cannot be added
+		 * to the list.
+		 */
+		rcu_assign_pointer(iog->key, NULL);
+		io_put_io_group_queues(e, iog);
+	}
+}
+
+struct cgroup_subsys io_subsys = {
+	.name = "io",
+	.create = iocg_create,
+	.can_attach = iocg_can_attach,
+	.attach = iocg_attach,
+	.destroy = iocg_destroy,
+	.populate = iocg_populate,
+	.subsys_id = io_subsys_id,
+};
+
+/* if bio sumbmitting task and rq don't belong to same io_group, it can't
+ * be merged */
+int io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+	struct request_queue *q = rq->q;
+	struct io_queue *ioq = rq->ioq;
+	struct io_group *iog, *__iog;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return 1;
+
+	/* Determine the io group of the bio submitting task */
+	iog = io_lookup_io_group_current(q);
+	if (!iog) {
+		/* May be task belongs to a differet cgroup for which io
+		 * group has not been setup yet. */
+		return 0;
+	}
+
+	/* Determine the io group of the ioq, rq belongs to*/
+	__iog = ioq_to_io_group(ioq);
+
+	return (iog == __iog);
+}
+
+/* find/create the io group request belongs to and put that info in rq */
+void elv_fq_set_request_io_group(struct request_queue *q,
+						struct request *rq)
+{
+	struct io_group *iog;
+	unsigned long flags;
+
+	/* Make sure io group hierarchy has been setup and also set the
+	 * io group to which rq belongs. Later we should make use of
+	 * bio cgroup patches to determine the io group */
+	spin_lock_irqsave(q->queue_lock, flags);
+	iog = io_get_io_group(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	BUG_ON(!iog);
+
+	/* Store iog in rq. TODO: take care of referencing */
+	rq->iog = iog;
+}
+
+#else /* GROUP_IOSCHED */
+void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+	entity->ioprio = entity->new_ioprio;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->sched_data = &iog->sched_data;
+}
+
+struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+	return iog;
+}
+
+struct io_group *io_lookup_io_group_current(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	return efqd->root_group;
+}
+EXPORT_SYMBOL(io_lookup_io_group_current);
+
+void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_group *iog = e->efqd.root_group;
+	io_put_io_group_queues(e, iog);
+	kfree(iog);
+}
+
+struct io_group *io_get_io_group(struct request_queue *q)
+{
+	return q->elevator->efqd.root_group;
+}
+
+#endif /* CONFIG_GROUP_IOSCHED*/
+
 /* Elevator fair queuing function */
 struct io_queue *rq_ioq(struct request *rq)
 {
@@ -995,9 +1862,11 @@ EXPORT_SYMBOL(elv_put_ioq);
 
 void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
 {
+	struct io_group *root_group = e->efqd.root_group;
 	struct io_queue *ioq = *ioq_ptr;
 
 	if (ioq != NULL) {
+		io_ioq_move(e, ioq, root_group);
 		/* Drop the reference taken by the io group */
 		elv_put_ioq(ioq);
 		*ioq_ptr = NULL;
@@ -1022,14 +1891,27 @@ struct io_queue *elv_get_next_ioq(struct request_queue *q, int extract)
 		return NULL;
 
 	sd = &efqd->root_group->sched_data;
-	if (extract)
-		entity = bfq_lookup_next_entity(sd, 1);
-	else
-		entity = bfq_lookup_next_entity(sd, 0);
+	for (; sd != NULL; sd = entity->my_sched_data) {
+		if (extract)
+			entity = bfq_lookup_next_entity(sd, 1);
+		else
+			entity = bfq_lookup_next_entity(sd, 0);
+
+		/*
+		 * entity can be null despite the fact that there are busy
+		 * queues. if all the busy queues are under a group which is
+		 * currently under service.
+		 * So if we are just looking for next ioq while something is
+		 * being served, null entity is not an error.
+		 */
+		BUG_ON(!entity && extract);
 
-	BUG_ON(!entity);
-	if (extract)
-		entity->service = 0;
+		if (extract)
+			entity->service = 0;
+
+		if (!entity)
+			return NULL;
+	}
 	ioq = io_entity_to_ioq(entity);
 
 	return ioq;
@@ -1262,6 +2144,7 @@ int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 {
 	struct io_queue *ioq;
 	struct elevator_queue *eq = q->elevator;
+	struct io_group *iog = NULL, *new_iog = NULL;
 
 	ioq = elv_active_ioq(eq);
 
@@ -1283,10 +2166,17 @@ int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 	if (elv_ioq_class_rt(new_ioq) && !elv_ioq_class_rt(ioq))
 		return 1;
 
+	iog = ioq_to_io_group(ioq);
+	new_iog = ioq_to_io_group(new_ioq);
+
 	/*
-	 * Check with io scheduler if it has additional criterion based on
-	 * which it wants to preempt existing queue.
+	 * If both the queues belong to same group, check with io scheduler
+	 * if it has additional criterion based on which it wants to
+	 * preempt existing queue.
 	 */
+	if (iog != new_iog)
+		return 0;
+
 	if (eq->ops->elevator_should_preempt_fn)
 		return eq->ops->elevator_should_preempt_fn(q, new_ioq, rq);
 
@@ -1663,14 +2553,6 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		elv_schedule_dispatch(q);
 }
 
-struct io_group *io_lookup_io_group_current(struct request_queue *q)
-{
-	struct elv_fq_data *efqd = &q->elevator->efqd;
-
-	return efqd->root_group;
-}
-EXPORT_SYMBOL(io_lookup_io_group_current);
-
 void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
 					int ioprio)
 {
@@ -1721,44 +2603,6 @@ void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 }
 EXPORT_SYMBOL(io_group_set_async_queue);
 
-/*
- * Release all the io group references to its async queues.
- */
-void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
-{
-	int i, j;
-
-	for (i = 0; i < 2; i++)
-		for (j = 0; j < IOPRIO_BE_NR; j++)
-			elv_release_ioq(e, &iog->async_queue[i][j]);
-
-	/* Free up async idle queue */
-	elv_release_ioq(e, &iog->async_idle_queue);
-}
-
-struct io_group *io_alloc_root_group(struct request_queue *q,
-					struct elevator_queue *e, void *key)
-{
-	struct io_group *iog;
-	int i;
-
-	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
-	if (iog == NULL)
-		return NULL;
-
-	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
-		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
-
-	return iog;
-}
-
-void io_free_root_group(struct elevator_queue *e)
-{
-	struct io_group *iog = e->efqd.root_group;
-	io_put_io_group_queues(e, iog);
-	kfree(iog);
-}
-
 static void elv_slab_kill(void)
 {
 	/*
@@ -1804,6 +2648,7 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 	INIT_WORK(&efqd->unplug_work, elv_kick_queue);
 
 	INIT_LIST_HEAD(&efqd->idle_list);
+	INIT_HLIST_HEAD(&efqd->group_list);
 
 	efqd->elv_slice[0] = elv_slice_async;
 	efqd->elv_slice[1] = elv_slice_sync;
@@ -1833,10 +2678,14 @@ void elv_exit_fq_data(struct elevator_queue *e)
 	spin_lock_irq(q->queue_lock);
 	/* This should drop all the idle tree references of ioq */
 	elv_free_idle_ioq_list(e);
+	/* This should drop all the io group references of async queues */
+	io_disconnect_groups(e);
 	spin_unlock_irq(q->queue_lock);
 
 	elv_shutdown_timer_wq(e);
 
+	/* Wait for iog->key accessors to exit their grace periods. */
+	synchronize_rcu();
 	BUG_ON(timer_pending(&efqd->idle_slice_timer));
 	io_free_root_group(e);
 }
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index b5a0d08..3fab8f8 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -9,6 +9,7 @@
  */
 
 #include <linux/blkdev.h>
+#include <linux/cgroup.h>
 
 #ifndef _BFQ_SCHED_H
 #define _BFQ_SCHED_H
@@ -69,6 +70,7 @@ struct io_service_tree {
  */
 struct io_sched_data {
 	struct io_entity *active_entity;
+	struct io_entity *next_active;
 	struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
 };
 
@@ -183,17 +185,90 @@ struct io_queue {
 	unsigned long total_service;
 };
 
+#ifdef CONFIG_GROUP_IOSCHED
+/**
+ * struct bfq_group - per (device, cgroup) data structure.
+ * @entity: schedulable entity to insert into the parent group sched_data.
+ * @sched_data: own sched_data, to contain child entities (they may be
+ *              both bfq_queues and bfq_groups).
+ * @group_node: node to be inserted into the bfqio_cgroup->group_data
+ *              list of the containing cgroup's bfqio_cgroup.
+ * @bfqd_node: node to be inserted into the @bfqd->group_list list
+ *             of the groups active on the same device; used for cleanup.
+ * @bfqd: the bfq_data for the device this group acts upon.
+ * @async_bfqq: array of async queues for all the tasks belonging to
+ *              the group, one queue per ioprio value per ioprio_class,
+ *              except for the idle class that has only one queue.
+ * @async_idle_bfqq: async queue for the idle class (ioprio is ignored).
+ * @my_entity: pointer to @entity, %NULL for the toplevel group; used
+ *             to avoid too many special cases during group creation/migration.
+ *
+ * Each (device, cgroup) pair has its own bfq_group, i.e., for each cgroup
+ * there is a set of bfq_groups, each one collecting the lower-level
+ * entities belonging to the group that are acting on the same device.
+ *
+ * Locking works as follows:
+ *    o @group_node is protected by the bfqio_cgroup lock, and is accessed
+ *      via RCU from its readers.
+ *    o @bfqd is protected by the queue lock, RCU is used to access it
+ *      from the readers.
+ *    o All the other fields are protected by the @bfqd queue lock.
+ */
 struct io_group {
+	struct io_entity entity;
+	struct hlist_node elv_data_node;
+	struct hlist_node group_node;
 	struct io_sched_data sched_data;
 
+	struct io_entity *my_entity;
+
+	/*
+	 * A cgroup has multiple io_groups, one for each request queue.
+	 * to find io group belonging to a particular queue, elv_fq_data
+	 * pointer is stored as a key.
+	 */
+	void *key;
+
 	/* async_queue and idle_queue are used only for cfq */
 	struct io_queue *async_queue[2][IOPRIO_BE_NR];
 	struct io_queue *async_idle_queue;
 };
 
+/**
+ * struct bfqio_cgroup - bfq cgroup data structure.
+ * @css: subsystem state for bfq in the containing cgroup.
+ * @ioprio: cgroup ioprio.
+ * @ioprio_class: cgroup ioprio_class.
+ * @lock: spinlock that protects @ioprio, @ioprio_class and @group_data.
+ * @group_data: list containing the bfq_group belonging to this cgroup.
+ *
+ * @group_data is accessed using RCU, with @lock protecting the updates,
+ * @ioprio and @ioprio_class are protected by @lock.
+ */
+struct io_cgroup {
+	struct cgroup_subsys_state css;
+
+	unsigned short ioprio, ioprio_class;
+
+	spinlock_t lock;
+	struct hlist_head group_data;
+};
+#else
+struct io_group {
+	struct io_sched_data sched_data;
+
+	/* async_queue and idle_queue are used only for cfq */
+	struct io_queue *async_queue[2][IOPRIO_BE_NR];
+	struct io_queue *async_idle_queue;
+};
+#endif
+
 struct elv_fq_data {
 	struct io_group *root_group;
 
+	/* List of io groups hanging on this elevator */
+	struct hlist_head group_list;
+
 	/* List of io queues on idle tree. */
 	struct list_head idle_list;
 
@@ -380,6 +455,39 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
 						sched_data);
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+extern int io_group_allow_merge(struct request *rq, struct bio *bio);
+extern void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
+					struct io_group *iog);
+extern void elv_fq_set_request_io_group(struct request_queue *q,
+						struct request *rq);
+#else /* !GROUP_IOSCHED */
+/*
+ * No ioq movement is needed in case of flat setup. root io group gets cleaned
+ * up upon elevator exit and before that it has been made sure that both
+ * active and idle tree are empty.
+ */
+static inline void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
+					struct io_group *iog)
+{
+}
+
+static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+	return 1;
+}
+/*
+ * Currently root group is not part of elevator group list and freed
+ * separately. Hence in case of non-hierarchical setup, nothing todo.
+ */
+static inline void io_disconnect_groups(struct elevator_queue *e) {}
+static inline void elv_fq_set_request_io_group(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+#endif /* GROUP_IOSCHED */
+
 /* Functions used by blksysfs.c */
 extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
 extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
@@ -475,5 +583,10 @@ static inline void *elv_fq_select_ioq(struct request_queue *q, int force)
 {
 	return NULL;
 }
+
+static inline void elv_fq_set_request_io_group(struct request_queue *q,
+						struct request *rq)
+{
+}
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index 7a3a7e9..27889bc 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -888,6 +888,8 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 
+	elv_fq_set_request_io_group(q, rq);
+
 	if (e->ops->elevator_set_req_fn)
 		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index cf02216..0baeb8e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -238,7 +238,12 @@ struct request {
 #ifdef CONFIG_ELV_FAIR_QUEUING
 	/* io queue request belongs to */
 	struct io_queue *ioq;
-#endif
+
+#ifdef CONFIG_GROUP_IOSCHED
+	/* io group request belongs to */
+	struct io_group *iog;
+#endif /* GROUP_IOSCHED */
+#endif /* ELV_FAIR_QUEUING */
 };
 
 static inline unsigned short req_get_ioprio(struct request *req)
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c8d31b..68ea6bd 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -60,3 +60,10 @@ SUBSYS(net_cls)
 #endif
 
 /* */
+
+#ifdef CONFIG_GROUP_IOSCHED
+SUBSYS(io)
+#endif
+
+/* */
+
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 08b987b..51664bb 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -73,6 +73,11 @@ struct io_context {
 	unsigned short ioprio;
 	unsigned short ioprio_changed;
 
+#ifdef CONFIG_GROUP_IOSCHED
+	/* If task changes the cgroup, elevator processes it asynchronously */
+	unsigned short cgroup_changed;
+#endif
+
 	/*
 	 * For request batching
 	 */
diff --git a/init/Kconfig b/init/Kconfig
index 6a5c5fe..66c2310 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -538,6 +538,14 @@ config CGROUP_MEM_RES_CTLR_SWAP
 	  there will be no overhead from this. Even when you set this config=y,
 	  if boot option "noswapaccount" is set, swap will not be accounted.
 
+config GROUP_IOSCHED
+	bool "Group IO Scheduler"
+	depends on CGROUPS && ELV_FAIR_QUEUING
+	default n
+	---help---
+	  This feature lets IO scheduler recognize task groups and control
+	  disk bandwidth allocation to such task groups.
+
 endif # CGROUPS
 
 config MM_OWNER
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 04/10] Common hierarchical fair queuing code in elevaotor layer
@ 2009-03-12  1:56     ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12  1:56 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng,
	arozansk, jmoyer, oz-kernel, dhaval, balbir, linux-kernel,
	containers
  Cc: vgoyal, akpm, menage, peterz

This patch enables hierarchical fair queuing in common layer. It is
controlled by config option CONFIG_GROUP_IOSCHED.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-ioc.c               |    3 +
 block/elevator-fq.c           |  991 ++++++++++++++++++++++++++++++++++++++---
 block/elevator-fq.h           |  113 +++++
 block/elevator.c              |    2 +
 include/linux/blkdev.h        |    7 +-
 include/linux/cgroup_subsys.h |    7 +
 include/linux/iocontext.h     |    5 +
 init/Kconfig                  |    8 +
 8 files changed, 1064 insertions(+), 72 deletions(-)

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 012f065..8f0f6cf 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -95,6 +95,9 @@ struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 		spin_lock_init(&ret->lock);
 		ret->ioprio_changed = 0;
 		ret->ioprio = 0;
+#ifdef CONFIG_GROUP_IOSCHED
+		ret->cgroup_changed = 0;
+#endif
 		ret->last_waited = jiffies; /* doesn't matter... */
 		ret->nr_batch_requests = 0; /* because this is 0 */
 		ret->aic = NULL;
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index a8addd1..389f68e 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -20,10 +20,82 @@ int elv_slice_idle = HZ / 125;
 static struct kmem_cache *elv_ioq_pool;
 
 #define ELV_HW_QUEUE_MIN	(5)
+
+#define IO_DEFAULT_GRP_IOPRIO  4
+#define IO_DEFAULT_GRP_CLASS   IOPRIO_CLASS_BE
+
 #define IO_SERVICE_TREE_INIT   ((struct io_service_tree)		\
 				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
 
+void elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
 /* Mainly the BFQ scheduling code Follows */
+#ifdef CONFIG_GROUP_IOSCHED
+#define for_each_entity(entity)	\
+	for (; entity != NULL; entity = entity->parent)
+
+#define for_each_entity_safe(entity, parent) \
+	for (; entity && ({ parent = entity->parent; 1; }); entity = parent)
+
+
+struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
+						 int extract);
+void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
+					int requeue);
+void elv_activate_ioq(struct io_queue *ioq);
+void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
+					int requeue);
+
+static int bfq_update_next_active(struct io_sched_data *sd)
+{
+	struct io_group *iog;
+	struct io_entity *entity, *next_active;
+
+	if (sd->active_entity != NULL)
+		/* will update/requeue at the end of service */
+		return 0;
+
+	/*
+	 * NOTE: this can be improved in may ways, such as returning
+	 * 1 (and thus propagating upwards the update) only when the
+	 * budget changes, or caching the bfqq that will be scheduled
+	 * next from this subtree.  By now we worry more about
+	 * correctness than about performance...
+	 */
+	next_active = bfq_lookup_next_entity(sd, 0);
+	sd->next_active = next_active;
+
+	if (next_active != NULL) {
+		iog = container_of(sd, struct io_group, sched_data);
+		entity = iog->my_entity;
+		if (entity != NULL)
+			entity->budget = next_active->budget;
+	}
+
+	return 1;
+}
+
+static inline void bfq_check_next_active(struct io_sched_data *sd,
+					 struct io_entity *entity)
+{
+	BUG_ON(sd->next_active != entity);
+}
+#else /* GROUP_IOSCHED */
+#define for_each_entity(entity)	\
+	for (; entity != NULL; entity = NULL)
+
+#define for_each_entity_safe(entity, parent) \
+	for (parent = NULL; entity != NULL; entity = parent)
+
+static inline int bfq_update_next_active(struct io_sched_data *sd)
+{
+	return 0;
+}
+
+static inline void bfq_check_next_active(struct io_sched_data *sd,
+					 struct io_entity *entity)
+{
+}
+#endif
 
 /*
  * Shift for timestamp calculations.  This actually limits the maximum
@@ -288,13 +360,6 @@ void bfq_get_entity(struct io_entity *entity)
 		elv_get_ioq(ioq);
 }
 
-void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
-{
-	entity->ioprio = entity->new_ioprio;
-	entity->ioprio_class = entity->new_ioprio_class;
-	entity->sched_data = &iog->sched_data;
-}
-
 /**
  * bfq_find_deepest - find the deepest node that an extraction can modify.
  * @node: the node being removed.
@@ -520,12 +585,27 @@ static void __bfq_activate_entity(struct io_entity *entity)
 }
 
 /**
- * bfq_activate_entity - activate an entity.
+ * bfq_activate_entity - activate an entity and its ancestors if necessary.
  * @entity: the entity to activate.
+ *
+ * Activate @entity and all the entities on the path from it to the root.
  */
 void bfq_activate_entity(struct io_entity *entity)
 {
-	__bfq_activate_entity(entity);
+	struct io_sched_data *sd;
+
+	for_each_entity(entity) {
+		__bfq_activate_entity(entity);
+
+		sd = entity->sched_data;
+		if (!bfq_update_next_active(sd))
+			/*
+			 * No need to propagate the activation to the
+			 * upper entities, as they will be updated when
+			 * the active entity is rescheduled.
+			 */
+			break;
+	}
 }
 
 /**
@@ -561,12 +641,16 @@ int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
 	else if (entity->tree != NULL)
 		BUG();
 
+	if (was_active || sd->next_active == entity)
+		ret = bfq_update_next_active(sd);
+
 	if (!requeue || !bfq_gt(entity->finish, st->vtime))
 		bfq_forget_entity(st, entity);
 	else
 		bfq_idle_insert(st, entity);
 
 	BUG_ON(sd->active_entity == entity);
+	BUG_ON(sd->next_active == entity);
 
 	return ret;
 }
@@ -578,7 +662,46 @@ int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
  */
 void bfq_deactivate_entity(struct io_entity *entity, int requeue)
 {
-	__bfq_deactivate_entity(entity, requeue);
+	struct io_sched_data *sd;
+	struct io_entity *parent;
+
+	for_each_entity_safe(entity, parent) {
+		sd = entity->sched_data;
+
+		if (!__bfq_deactivate_entity(entity, requeue))
+			/*
+			 * The parent entity is still backlogged, and
+			 * we don't need to update it as it is still
+			 * under service.
+			 */
+			break;
+
+		if (sd->next_active != NULL)
+			/*
+			 * The parent entity is still backlogged and
+			 * the budgets on the path towards the root
+			 * need to be updated.
+			 */
+			goto update;
+
+		/*
+		 * If we reach there the parent is no more backlogged and
+		 * we want to propagate the dequeue upwards.
+		 */
+		requeue = 1;
+	}
+
+	return;
+
+update:
+	entity = parent;
+	for_each_entity(entity) {
+		__bfq_activate_entity(entity);
+
+		sd = entity->sched_data;
+		if (!bfq_update_next_active(sd))
+			break;
+	}
 }
 
 /**
@@ -695,8 +818,10 @@ struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
 		entity = __bfq_lookup_next_entity(st);
 		if (entity != NULL) {
 			if (extract) {
+				bfq_check_next_active(sd, entity);
 				bfq_active_extract(st, entity);
 				sd->active_entity = entity;
+				sd->next_active = NULL;
 			}
 			break;
 		}
@@ -709,14 +834,756 @@ void entity_served(struct io_entity *entity, bfq_service_t served)
 {
 	struct io_service_tree *st;
 
-	st = io_entity_service_tree(entity);
-	entity->service += served;
-	WARN_ON_ONCE(entity->service > entity->budget);
-	BUG_ON(st->wsum == 0);
-	st->vtime += bfq_delta(served, st->wsum);
-	bfq_forget_idle(st);
+	for_each_entity(entity) {
+		st = io_entity_service_tree(entity);
+		entity->service += served;
+		WARN_ON_ONCE(entity->service > entity->budget);
+		BUG_ON(st->wsum == 0);
+		st->vtime += bfq_delta(served, st->wsum);
+		bfq_forget_idle(st);
+	}
+}
+
+/*
+ * Release all the io group references to its async queues.
+ */
+void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+{
+	int i, j;
+
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < IOPRIO_BE_NR; j++)
+			elv_release_ioq(e, &iog->async_queue[i][j]);
+
+	/* Free up async idle queue */
+	elv_release_ioq(e, &iog->async_idle_queue);
+}
+
+
+/* Mainly hierarchical grouping code */
+#ifdef CONFIG_GROUP_IOSCHED
+
+struct io_cgroup io_root_cgroup = {
+	.ioprio = IO_DEFAULT_GRP_IOPRIO,
+	.ioprio_class = IO_DEFAULT_GRP_CLASS,
+};
+
+void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+	entity->ioprio = entity->new_ioprio;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->parent = iog->my_entity;
+	entity->sched_data = &iog->sched_data;
+}
+
+struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
+{
+	return container_of(cgroup_subsys_state(cgroup, io_subsys_id),
+			    struct io_cgroup, css);
+}
+
+/*
+ * Search the bfq_group for bfqd into the hash table (by now only a list)
+ * of bgrp.  Must be called under rcu_read_lock().
+ */
+struct io_group *io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
+{
+	struct io_group *iog;
+	struct hlist_node *n;
+	void *__key;
+
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		__key = rcu_dereference(iog->key);
+		if (__key == key)
+			return iog;
+	}
+
+	return NULL;
+}
+
+struct io_group *io_lookup_io_group_current(struct request_queue *q)
+{
+	struct io_group *iog;
+	struct io_cgroup *iocg;
+	struct cgroup *cgroup;
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	cgroup = task_cgroup(current, io_subsys_id);
+	iocg = cgroup_to_io_cgroup(cgroup);
+	iog = io_cgroup_lookup_group(iocg, efqd);
+	return iog;
+}
+EXPORT_SYMBOL(io_lookup_io_group_current);
+
+void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+{
+	struct io_entity *entity = &iog->entity;
+
+	entity->ioprio = entity->new_ioprio = iocg->ioprio;
+	entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
+	entity->ioprio_changed = 1;
+	entity->my_sched_data = &iog->sched_data;
+}
+
+void io_group_set_parent(struct io_group *iog, struct io_group *parent)
+{
+	struct io_entity *entity;
+
+	BUG_ON(parent == NULL);
+	BUG_ON(iog == NULL);
+
+	entity = &iog->entity;
+	entity->parent = parent->my_entity;
+	entity->sched_data = &parent->sched_data;
+}
+
+/**
+ * bfq_flush_idle_tree - deactivate any entity on the idle tree of @st.
+ * @st: the service tree being flushed.
+ */
+void io_flush_idle_tree(struct io_service_tree *st)
+{
+	struct io_entity *entity = st->first_idle;
+
+	for (; entity != NULL; entity = st->first_idle)
+		__bfq_deactivate_entity(entity, 0);
+}
+
+#define SHOW_FUNCTION(__VAR)						\
+static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
+				       struct cftype *cftype)		\
+{									\
+	struct io_cgroup *iocg;					\
+	u64 ret;							\
+									\
+	if (!cgroup_lock_live_group(cgroup))				\
+		return -ENODEV;						\
+									\
+	iocg = cgroup_to_io_cgroup(cgroup);				\
+	spin_lock_irq(&iocg->lock);					\
+	ret = iocg->__VAR;						\
+	spin_unlock_irq(&iocg->lock);					\
+									\
+	cgroup_unlock();						\
+									\
+	return ret;							\
+}
+
+SHOW_FUNCTION(ioprio);
+SHOW_FUNCTION(ioprio_class);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__VAR, __MIN, __MAX)				\
+static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
+					struct cftype *cftype,		\
+					u64 val)			\
+{									\
+	struct io_cgroup *iocg;					\
+	struct io_group *iog;						\
+	struct hlist_node *n;						\
+									\
+	if (val < (__MIN) || val > (__MAX))				\
+		return -EINVAL;						\
+									\
+	if (!cgroup_lock_live_group(cgroup))				\
+		return -ENODEV;						\
+									\
+	iocg = cgroup_to_io_cgroup(cgroup);				\
+									\
+	spin_lock_irq(&iocg->lock);					\
+	iocg->__VAR = (unsigned char)val;				\
+	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {	\
+		iog->entity.new_##__VAR = (unsigned char)val;		\
+		smp_wmb();						\
+		iog->entity.ioprio_changed = 1;			\
+	}								\
+	spin_unlock_irq(&iocg->lock);					\
+									\
+	cgroup_unlock();						\
+									\
+	return 0;							\
+}
+
+STORE_FUNCTION(ioprio, 0, IOPRIO_BE_NR - 1);
+STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
+#undef STORE_FUNCTION
+
+/**
+ * bfq_group_chain_alloc - allocate a chain of groups.
+ * @bfqd: queue descriptor.
+ * @cgroup: the leaf cgroup this chain starts from.
+ *
+ * Allocate a chain of groups starting from the one belonging to
+ * @cgroup up to the root cgroup.  Stop if a cgroup on the chain
+ * to the root has already an allocated group on @bfqd.
+ */
+struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
+					struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog, *leaf = NULL, *prev = NULL;
+	gfp_t flags = GFP_ATOMIC |  __GFP_ZERO;
+
+	for (; cgroup != NULL; cgroup = cgroup->parent) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+
+		iog = io_cgroup_lookup_group(iocg, key);
+		if (iog != NULL) {
+			/*
+			 * All the cgroups in the path from there to the
+			 * root must have a bfq_group for bfqd, so we don't
+			 * need any more allocations.
+			 */
+			break;
+		}
+
+		iog = kzalloc_node(sizeof(*iog), flags, q->node);
+		if (!iog)
+			goto cleanup;
+
+		io_group_init_entity(iocg, iog);
+		iog->my_entity = &iog->entity;
+
+		if (leaf == NULL) {
+			leaf = iog;
+			prev = leaf;
+		} else {
+			io_group_set_parent(prev, iog);
+			/*
+			 * Build a list of allocated nodes using the bfqd
+			 * filed, that is still unused and will be initialized
+			 * only after the node will be connected.
+			 */
+			prev->key = iog;
+			prev = iog;
+		}
+	}
+
+	return leaf;
+
+cleanup:
+	while (leaf != NULL) {
+		prev = leaf;
+		leaf = leaf->key;
+		kfree(iog);
+	}
+
+	return NULL;
 }
 
+/**
+ * bfq_group_chain_link - link an allocatd group chain to a cgroup hierarchy.
+ * @bfqd: the queue descriptor.
+ * @cgroup: the leaf cgroup to start from.
+ * @leaf: the leaf group (to be associated to @cgroup).
+ *
+ * Try to link a chain of groups to a cgroup hierarchy, connecting the
+ * nodes bottom-up, so we can be sure that when we find a cgroup in the
+ * hierarchy that already as a group associated to @bfqd all the nodes
+ * in the path to the root cgroup have one too.
+ *
+ * On locking: the queue lock protects the hierarchy (there is a hierarchy
+ * per device) while the bfqio_cgroup lock protects the list of groups
+ * belonging to the same cgroup.
+ */
+void io_group_chain_link(struct request_queue *q, void *key,
+				struct cgroup *cgroup,
+				struct io_group *leaf,
+				struct elv_fq_data *efqd)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog, *next, *prev = NULL;
+	unsigned long flags;
+
+	assert_spin_locked(q->queue_lock);
+
+	for (; cgroup != NULL && leaf != NULL; cgroup = cgroup->parent) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+		next = leaf->key;
+
+		iog = io_cgroup_lookup_group(iocg, key);
+		BUG_ON(iog != NULL);
+
+		spin_lock_irqsave(&iocg->lock, flags);
+
+		rcu_assign_pointer(leaf->key, key);
+		hlist_add_head_rcu(&leaf->group_node, &iocg->group_data);
+		hlist_add_head(&leaf->elv_data_node, &efqd->group_list);
+
+		spin_unlock_irqrestore(&iocg->lock, flags);
+
+		prev = leaf;
+		leaf = next;
+	}
+
+	BUG_ON(cgroup == NULL && leaf != NULL);
+
+	if (cgroup != NULL && prev != NULL) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+		iog = io_cgroup_lookup_group(iocg, key);
+		io_group_set_parent(prev, iog);
+	}
+}
+
+/**
+ * bfq_find_alloc_group - return the group associated to @bfqd in @cgroup.
+ * @bfqd: queue descriptor.
+ * @cgroup: cgroup being searched for.
+ *
+ * Return a group associated to @bfqd in @cgroup, allocating one if
+ * necessary.  When a group is returned all the cgroups in the path
+ * to the root have a group associated to @bfqd.
+ *
+ * If the allocation fails, return the root group: this breaks guarantees
+ * but is a safe fallbak.  If this loss becames a problem it can be
+ * mitigated using the equivalent weight (given by the product of the
+ * weights of the groups in the path from @group to the root) in the
+ * root scheduler.
+ *
+ * We allocate all the missing nodes in the path from the leaf cgroup
+ * to the root and we connect the nodes only after all the allocations
+ * have been successful.
+ */
+struct io_group *io_find_alloc_group(struct request_queue *q,
+			struct cgroup *cgroup, struct elv_fq_data *efqd)
+{
+	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+	struct io_group *iog;
+	/* Note: Use efqd as key */
+	void *key = efqd;
+
+	iog = io_cgroup_lookup_group(iocg, key);
+	if (iog != NULL)
+		return iog;
+
+	iog = io_group_chain_alloc(q, key, cgroup);
+	if (iog != NULL)
+		io_group_chain_link(q, key, cgroup, iog, efqd);
+
+	return iog;
+}
+
+/*
+ * Generic function to make sure cgroup hierarchy is all setup once a request
+ * from a cgroup is received by the io scheduler.
+ */
+struct io_group *io_get_io_group(struct request_queue *q)
+{
+	struct cgroup *cgroup;
+	struct io_group *iog;
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	rcu_read_lock();
+	cgroup = task_cgroup(current, io_subsys_id);
+	iog = io_find_alloc_group(q, cgroup, efqd);
+	if (iog == NULL)
+		iog = efqd->root_group;
+	rcu_read_unlock();
+
+	return iog;
+}
+
+void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_cgroup *iocg = &io_root_cgroup;
+	struct elv_fq_data *efqd = &e->efqd;
+	struct io_group *iog = efqd->root_group;
+
+	BUG_ON(!iog);
+	spin_lock_irq(&iocg->lock);
+	hlist_del_rcu(&iog->group_node);
+	spin_unlock_irq(&iocg->lock);
+	io_put_io_group_queues(e, iog);
+	kfree(iog);
+}
+
+struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	struct io_cgroup *iocg;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	iog->entity.parent = NULL;
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+	iocg = &io_root_cgroup;
+	spin_lock_irq(&iocg->lock);
+	rcu_assign_pointer(iog->key, key);
+	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
+	spin_unlock_irq(&iocg->lock);
+
+	return iog;
+}
+
+struct cftype bfqio_files[] = {
+	{
+		.name = "ioprio",
+		.read_u64 = io_cgroup_ioprio_read,
+		.write_u64 = io_cgroup_ioprio_write,
+	},
+	{
+		.name = "ioprio_class",
+		.read_u64 = io_cgroup_ioprio_class_read,
+		.write_u64 = io_cgroup_ioprio_class_write,
+	},
+};
+
+int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+	return cgroup_add_files(cgroup, subsys, bfqio_files,
+				ARRAY_SIZE(bfqio_files));
+}
+
+struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
+						struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg;
+
+	if (cgroup->parent != NULL) {
+		iocg = kzalloc(sizeof(*iocg), GFP_KERNEL);
+		if (iocg == NULL)
+			return ERR_PTR(-ENOMEM);
+	} else
+		iocg = &io_root_cgroup;
+
+	spin_lock_init(&iocg->lock);
+	INIT_HLIST_HEAD(&iocg->group_data);
+	iocg->ioprio = IO_DEFAULT_GRP_IOPRIO;
+	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
+
+	return &iocg->css;
+}
+
+/*
+ * We cannot support shared io contexts, as we have no mean to support
+ * two tasks with the same ioc in two different groups without major rework
+ * of the main cic/bfqq data structures.  By now we allow a task to change
+ * its cgroup only if it's the only owner of its ioc; the drawback of this
+ * behavior is that a group containing a task that forked using CLONE_IO
+ * will not be destroyed until the tasks sharing the ioc die.
+ */
+int iocg_can_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+			    struct task_struct *tsk)
+{
+	struct io_context *ioc;
+	int ret = 0;
+
+	/* task_lock() is needed to avoid races with exit_io_context() */
+	task_lock(tsk);
+	ioc = tsk->io_context;
+	if (ioc != NULL && atomic_read(&ioc->nr_tasks) > 1)
+		/*
+		 * ioc == NULL means that the task is either too young or
+		 * exiting: if it has still no ioc the ioc can't be shared,
+		 * if the task is exiting the attach will fail anyway, no
+		 * matter what we return here.
+		 */
+		ret = -EINVAL;
+	task_unlock(tsk);
+
+	return ret;
+}
+
+void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+			 struct cgroup *prev, struct task_struct *tsk)
+{
+	struct io_context *ioc;
+
+	task_lock(tsk);
+	ioc = tsk->io_context;
+	if (ioc != NULL)
+		ioc->cgroup_changed = 1;
+	task_unlock(tsk);
+}
+
+/*
+ * Move the queue to the root group if it is active. This is needed when
+ * a cgroup is being deleted and all the IO is not done yet. This is not
+ * very good scheme as a user might get unfair share. This needs to be
+ * fixed.
+ */
+void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
+				struct io_group *iog)
+{
+	int busy, resume;
+	struct io_entity *entity = &ioq->entity;
+	struct elv_fq_data *efqd = &e->efqd;
+	struct io_service_tree *st = io_entity_service_tree(entity);
+
+	busy = elv_ioq_busy(ioq);
+	resume = !!ioq->nr_queued;
+
+	BUG_ON(resume && !entity->on_st);
+	BUG_ON(busy && !resume && entity->on_st && ioq != efqd->active_queue);
+
+	/*
+	 * We could be moving an queue which is on idle tree of previous group
+	 * What to do? I guess anyway this queue does not have any requests.
+	 * just forget the entity and free up from idle tree.
+	 *
+	 * This needs cleanup. Hackish.
+	 */
+	if (entity->tree == &st->idle) {
+		BUG_ON(atomic_read(&ioq->ref) < 2);
+		bfq_put_idle_entity(st, entity);
+	}
+
+	if (busy) {
+		BUG_ON(atomic_read(&ioq->ref) < 2);
+
+		if (!resume)
+			elv_del_ioq_busy(e, ioq, 0);
+		else
+			elv_deactivate_ioq(efqd, ioq, 0);
+	}
+
+	/*
+	 * Here we use a reference to bfqg.  We don't need a refcounter
+	 * as the cgroup reference will not be dropped, so that its
+	 * destroy() callback will not be invoked.
+	 */
+	entity->parent = iog->my_entity;
+	entity->sched_data = &iog->sched_data;
+
+	if (busy && resume)
+		elv_activate_ioq(ioq);
+}
+EXPORT_SYMBOL(io_ioq_move);
+
+static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
+{
+	struct elevator_queue *eq;
+	struct io_entity *entity = iog->my_entity;
+	struct io_service_tree *st;
+	int i;
+
+	eq = container_of(efqd, struct elevator_queue, efqd);
+	hlist_del(&iog->elv_data_node);
+	__bfq_deactivate_entity(entity, 0);
+	io_put_io_group_queues(eq, iog);
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+
+		/*
+		 * The idle tree may still contain bfq_queues belonging
+		 * to exited task because they never migrated to a different
+		 * cgroup from the one being destroyed now.  Noone else
+		 * can access them so it's safe to act without any lock.
+		 */
+		io_flush_idle_tree(st);
+
+		BUG_ON(!RB_EMPTY_ROOT(&st->active));
+		BUG_ON(!RB_EMPTY_ROOT(&st->idle));
+	}
+
+	BUG_ON(iog->sched_data.next_active != NULL);
+	BUG_ON(iog->sched_data.active_entity != NULL);
+	BUG_ON(entity->tree != NULL);
+}
+
+/**
+ * bfq_destroy_group - destroy @bfqg.
+ * @bgrp: the bfqio_cgroup containing @bfqg.
+ * @bfqg: the group being destroyed.
+ *
+ * Destroy @bfqg, making sure that it is not referenced from its parent.
+ */
+static void io_destroy_group(struct io_cgroup *iocg, struct io_group *iog)
+{
+	struct elv_fq_data *efqd = NULL;
+	unsigned long uninitialized_var(flags);
+
+	/* Remove io group from cgroup list */
+	hlist_del(&iog->group_node);
+
+	/*
+	 * io groups are linked in two lists. One list is maintained
+	 * in elevator (efqd->group_list) and other is maintained
+	 * per cgroup structure (iocg->group_data).
+	 *
+	 * While a cgroup is being deleted, elevator also might be
+	 * exiting and both might try to cleanup the same io group
+	 * so need to be little careful.
+	 *
+	 * Following code first accesses efqd under RCU to make sure
+	 * iog->key is pointing to valid efqd and then takes the
+	 * associated queue lock. After gettting the queue lock it
+	 * again checks whether elevator exit path had alreday got
+	 * hold of io group (iog->key == NULL). If yes, it does not
+	 * try to free up async queues again or flush the idle tree.
+	 */
+
+	rcu_read_lock();
+	efqd = rcu_dereference(iog->key);
+	if (efqd != NULL) {
+		spin_lock_irqsave(efqd->queue->queue_lock, flags);
+		if (iog->key == efqd)
+			__io_destroy_group(efqd, iog);
+		spin_unlock_irqrestore(efqd->queue->queue_lock, flags);
+	}
+	rcu_read_unlock();
+
+	/*
+	 * No need to defer the kfree() to the end of the RCU grace
+	 * period: we are called from the destroy() callback of our
+	 * cgroup, so we can be sure that noone is a) still using
+	 * this cgroup or b) doing lookups in it.
+	 */
+	kfree(iog);
+}
+
+void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+	struct hlist_node *n, *tmp;
+	struct io_group *iog;
+
+	/*
+	 * Since we are destroying the cgroup, there are no more tasks
+	 * referencing it, and all the RCU grace periods that may have
+	 * referenced it are ended (as the destruction of the parent
+	 * cgroup is RCU-safe); bgrp->group_data will not be accessed by
+	 * anything else and we don't need any synchronization.
+	 */
+	hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node)
+		io_destroy_group(iocg, iog);
+
+	BUG_ON(!hlist_empty(&iocg->group_data));
+
+	kfree(iocg);
+}
+
+void io_disconnect_groups(struct elevator_queue *e)
+{
+	struct hlist_node *pos, *n;
+	struct io_group *iog;
+	struct elv_fq_data *efqd = &e->efqd;
+
+	hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
+					elv_data_node) {
+		hlist_del(&iog->elv_data_node);
+
+		__bfq_deactivate_entity(iog->my_entity, 0);
+
+		/*
+		 * Don't remove from the group hash, just set an
+		 * invalid key.  No lookups can race with the
+		 * assignment as bfqd is being destroyed; this
+		 * implies also that new elements cannot be added
+		 * to the list.
+		 */
+		rcu_assign_pointer(iog->key, NULL);
+		io_put_io_group_queues(e, iog);
+	}
+}
+
+struct cgroup_subsys io_subsys = {
+	.name = "io",
+	.create = iocg_create,
+	.can_attach = iocg_can_attach,
+	.attach = iocg_attach,
+	.destroy = iocg_destroy,
+	.populate = iocg_populate,
+	.subsys_id = io_subsys_id,
+};
+
+/* if bio sumbmitting task and rq don't belong to same io_group, it can't
+ * be merged */
+int io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+	struct request_queue *q = rq->q;
+	struct io_queue *ioq = rq->ioq;
+	struct io_group *iog, *__iog;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return 1;
+
+	/* Determine the io group of the bio submitting task */
+	iog = io_lookup_io_group_current(q);
+	if (!iog) {
+		/* May be task belongs to a differet cgroup for which io
+		 * group has not been setup yet. */
+		return 0;
+	}
+
+	/* Determine the io group of the ioq, rq belongs to*/
+	__iog = ioq_to_io_group(ioq);
+
+	return (iog == __iog);
+}
+
+/* find/create the io group request belongs to and put that info in rq */
+void elv_fq_set_request_io_group(struct request_queue *q,
+						struct request *rq)
+{
+	struct io_group *iog;
+	unsigned long flags;
+
+	/* Make sure io group hierarchy has been setup and also set the
+	 * io group to which rq belongs. Later we should make use of
+	 * bio cgroup patches to determine the io group */
+	spin_lock_irqsave(q->queue_lock, flags);
+	iog = io_get_io_group(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	BUG_ON(!iog);
+
+	/* Store iog in rq. TODO: take care of referencing */
+	rq->iog = iog;
+}
+
+#else /* GROUP_IOSCHED */
+void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+	entity->ioprio = entity->new_ioprio;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->sched_data = &iog->sched_data;
+}
+
+struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+	return iog;
+}
+
+struct io_group *io_lookup_io_group_current(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	return efqd->root_group;
+}
+EXPORT_SYMBOL(io_lookup_io_group_current);
+
+void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_group *iog = e->efqd.root_group;
+	io_put_io_group_queues(e, iog);
+	kfree(iog);
+}
+
+struct io_group *io_get_io_group(struct request_queue *q)
+{
+	return q->elevator->efqd.root_group;
+}
+
+#endif /* CONFIG_GROUP_IOSCHED*/
+
 /* Elevator fair queuing function */
 struct io_queue *rq_ioq(struct request *rq)
 {
@@ -995,9 +1862,11 @@ EXPORT_SYMBOL(elv_put_ioq);
 
 void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
 {
+	struct io_group *root_group = e->efqd.root_group;
 	struct io_queue *ioq = *ioq_ptr;
 
 	if (ioq != NULL) {
+		io_ioq_move(e, ioq, root_group);
 		/* Drop the reference taken by the io group */
 		elv_put_ioq(ioq);
 		*ioq_ptr = NULL;
@@ -1022,14 +1891,27 @@ struct io_queue *elv_get_next_ioq(struct request_queue *q, int extract)
 		return NULL;
 
 	sd = &efqd->root_group->sched_data;
-	if (extract)
-		entity = bfq_lookup_next_entity(sd, 1);
-	else
-		entity = bfq_lookup_next_entity(sd, 0);
+	for (; sd != NULL; sd = entity->my_sched_data) {
+		if (extract)
+			entity = bfq_lookup_next_entity(sd, 1);
+		else
+			entity = bfq_lookup_next_entity(sd, 0);
+
+		/*
+		 * entity can be null despite the fact that there are busy
+		 * queues. if all the busy queues are under a group which is
+		 * currently under service.
+		 * So if we are just looking for next ioq while something is
+		 * being served, null entity is not an error.
+		 */
+		BUG_ON(!entity && extract);
 
-	BUG_ON(!entity);
-	if (extract)
-		entity->service = 0;
+		if (extract)
+			entity->service = 0;
+
+		if (!entity)
+			return NULL;
+	}
 	ioq = io_entity_to_ioq(entity);
 
 	return ioq;
@@ -1262,6 +2144,7 @@ int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 {
 	struct io_queue *ioq;
 	struct elevator_queue *eq = q->elevator;
+	struct io_group *iog = NULL, *new_iog = NULL;
 
 	ioq = elv_active_ioq(eq);
 
@@ -1283,10 +2166,17 @@ int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 	if (elv_ioq_class_rt(new_ioq) && !elv_ioq_class_rt(ioq))
 		return 1;
 
+	iog = ioq_to_io_group(ioq);
+	new_iog = ioq_to_io_group(new_ioq);
+
 	/*
-	 * Check with io scheduler if it has additional criterion based on
-	 * which it wants to preempt existing queue.
+	 * If both the queues belong to same group, check with io scheduler
+	 * if it has additional criterion based on which it wants to
+	 * preempt existing queue.
 	 */
+	if (iog != new_iog)
+		return 0;
+
 	if (eq->ops->elevator_should_preempt_fn)
 		return eq->ops->elevator_should_preempt_fn(q, new_ioq, rq);
 
@@ -1663,14 +2553,6 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		elv_schedule_dispatch(q);
 }
 
-struct io_group *io_lookup_io_group_current(struct request_queue *q)
-{
-	struct elv_fq_data *efqd = &q->elevator->efqd;
-
-	return efqd->root_group;
-}
-EXPORT_SYMBOL(io_lookup_io_group_current);
-
 void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
 					int ioprio)
 {
@@ -1721,44 +2603,6 @@ void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 }
 EXPORT_SYMBOL(io_group_set_async_queue);
 
-/*
- * Release all the io group references to its async queues.
- */
-void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
-{
-	int i, j;
-
-	for (i = 0; i < 2; i++)
-		for (j = 0; j < IOPRIO_BE_NR; j++)
-			elv_release_ioq(e, &iog->async_queue[i][j]);
-
-	/* Free up async idle queue */
-	elv_release_ioq(e, &iog->async_idle_queue);
-}
-
-struct io_group *io_alloc_root_group(struct request_queue *q,
-					struct elevator_queue *e, void *key)
-{
-	struct io_group *iog;
-	int i;
-
-	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
-	if (iog == NULL)
-		return NULL;
-
-	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
-		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
-
-	return iog;
-}
-
-void io_free_root_group(struct elevator_queue *e)
-{
-	struct io_group *iog = e->efqd.root_group;
-	io_put_io_group_queues(e, iog);
-	kfree(iog);
-}
-
 static void elv_slab_kill(void)
 {
 	/*
@@ -1804,6 +2648,7 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 	INIT_WORK(&efqd->unplug_work, elv_kick_queue);
 
 	INIT_LIST_HEAD(&efqd->idle_list);
+	INIT_HLIST_HEAD(&efqd->group_list);
 
 	efqd->elv_slice[0] = elv_slice_async;
 	efqd->elv_slice[1] = elv_slice_sync;
@@ -1833,10 +2678,14 @@ void elv_exit_fq_data(struct elevator_queue *e)
 	spin_lock_irq(q->queue_lock);
 	/* This should drop all the idle tree references of ioq */
 	elv_free_idle_ioq_list(e);
+	/* This should drop all the io group references of async queues */
+	io_disconnect_groups(e);
 	spin_unlock_irq(q->queue_lock);
 
 	elv_shutdown_timer_wq(e);
 
+	/* Wait for iog->key accessors to exit their grace periods. */
+	synchronize_rcu();
 	BUG_ON(timer_pending(&efqd->idle_slice_timer));
 	io_free_root_group(e);
 }
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index b5a0d08..3fab8f8 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -9,6 +9,7 @@
  */
 
 #include <linux/blkdev.h>
+#include <linux/cgroup.h>
 
 #ifndef _BFQ_SCHED_H
 #define _BFQ_SCHED_H
@@ -69,6 +70,7 @@ struct io_service_tree {
  */
 struct io_sched_data {
 	struct io_entity *active_entity;
+	struct io_entity *next_active;
 	struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
 };
 
@@ -183,17 +185,90 @@ struct io_queue {
 	unsigned long total_service;
 };
 
+#ifdef CONFIG_GROUP_IOSCHED
+/**
+ * struct bfq_group - per (device, cgroup) data structure.
+ * @entity: schedulable entity to insert into the parent group sched_data.
+ * @sched_data: own sched_data, to contain child entities (they may be
+ *              both bfq_queues and bfq_groups).
+ * @group_node: node to be inserted into the bfqio_cgroup->group_data
+ *              list of the containing cgroup's bfqio_cgroup.
+ * @bfqd_node: node to be inserted into the @bfqd->group_list list
+ *             of the groups active on the same device; used for cleanup.
+ * @bfqd: the bfq_data for the device this group acts upon.
+ * @async_bfqq: array of async queues for all the tasks belonging to
+ *              the group, one queue per ioprio value per ioprio_class,
+ *              except for the idle class that has only one queue.
+ * @async_idle_bfqq: async queue for the idle class (ioprio is ignored).
+ * @my_entity: pointer to @entity, %NULL for the toplevel group; used
+ *             to avoid too many special cases during group creation/migration.
+ *
+ * Each (device, cgroup) pair has its own bfq_group, i.e., for each cgroup
+ * there is a set of bfq_groups, each one collecting the lower-level
+ * entities belonging to the group that are acting on the same device.
+ *
+ * Locking works as follows:
+ *    o @group_node is protected by the bfqio_cgroup lock, and is accessed
+ *      via RCU from its readers.
+ *    o @bfqd is protected by the queue lock, RCU is used to access it
+ *      from the readers.
+ *    o All the other fields are protected by the @bfqd queue lock.
+ */
 struct io_group {
+	struct io_entity entity;
+	struct hlist_node elv_data_node;
+	struct hlist_node group_node;
 	struct io_sched_data sched_data;
 
+	struct io_entity *my_entity;
+
+	/*
+	 * A cgroup has multiple io_groups, one for each request queue.
+	 * to find io group belonging to a particular queue, elv_fq_data
+	 * pointer is stored as a key.
+	 */
+	void *key;
+
 	/* async_queue and idle_queue are used only for cfq */
 	struct io_queue *async_queue[2][IOPRIO_BE_NR];
 	struct io_queue *async_idle_queue;
 };
 
+/**
+ * struct bfqio_cgroup - bfq cgroup data structure.
+ * @css: subsystem state for bfq in the containing cgroup.
+ * @ioprio: cgroup ioprio.
+ * @ioprio_class: cgroup ioprio_class.
+ * @lock: spinlock that protects @ioprio, @ioprio_class and @group_data.
+ * @group_data: list containing the bfq_group belonging to this cgroup.
+ *
+ * @group_data is accessed using RCU, with @lock protecting the updates,
+ * @ioprio and @ioprio_class are protected by @lock.
+ */
+struct io_cgroup {
+	struct cgroup_subsys_state css;
+
+	unsigned short ioprio, ioprio_class;
+
+	spinlock_t lock;
+	struct hlist_head group_data;
+};
+#else
+struct io_group {
+	struct io_sched_data sched_data;
+
+	/* async_queue and idle_queue are used only for cfq */
+	struct io_queue *async_queue[2][IOPRIO_BE_NR];
+	struct io_queue *async_idle_queue;
+};
+#endif
+
 struct elv_fq_data {
 	struct io_group *root_group;
 
+	/* List of io groups hanging on this elevator */
+	struct hlist_head group_list;
+
 	/* List of io queues on idle tree. */
 	struct list_head idle_list;
 
@@ -380,6 +455,39 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
 						sched_data);
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+extern int io_group_allow_merge(struct request *rq, struct bio *bio);
+extern void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
+					struct io_group *iog);
+extern void elv_fq_set_request_io_group(struct request_queue *q,
+						struct request *rq);
+#else /* !GROUP_IOSCHED */
+/*
+ * No ioq movement is needed in case of flat setup. root io group gets cleaned
+ * up upon elevator exit and before that it has been made sure that both
+ * active and idle tree are empty.
+ */
+static inline void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
+					struct io_group *iog)
+{
+}
+
+static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+	return 1;
+}
+/*
+ * Currently root group is not part of elevator group list and freed
+ * separately. Hence in case of non-hierarchical setup, nothing todo.
+ */
+static inline void io_disconnect_groups(struct elevator_queue *e) {}
+static inline void elv_fq_set_request_io_group(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+#endif /* GROUP_IOSCHED */
+
 /* Functions used by blksysfs.c */
 extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
 extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
@@ -475,5 +583,10 @@ static inline void *elv_fq_select_ioq(struct request_queue *q, int force)
 {
 	return NULL;
 }
+
+static inline void elv_fq_set_request_io_group(struct request_queue *q,
+						struct request *rq)
+{
+}
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index 7a3a7e9..27889bc 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -888,6 +888,8 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 
+	elv_fq_set_request_io_group(q, rq);
+
 	if (e->ops->elevator_set_req_fn)
 		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index cf02216..0baeb8e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -238,7 +238,12 @@ struct request {
 #ifdef CONFIG_ELV_FAIR_QUEUING
 	/* io queue request belongs to */
 	struct io_queue *ioq;
-#endif
+
+#ifdef CONFIG_GROUP_IOSCHED
+	/* io group request belongs to */
+	struct io_group *iog;
+#endif /* GROUP_IOSCHED */
+#endif /* ELV_FAIR_QUEUING */
 };
 
 static inline unsigned short req_get_ioprio(struct request *req)
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c8d31b..68ea6bd 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -60,3 +60,10 @@ SUBSYS(net_cls)
 #endif
 
 /* */
+
+#ifdef CONFIG_GROUP_IOSCHED
+SUBSYS(io)
+#endif
+
+/* */
+
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 08b987b..51664bb 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -73,6 +73,11 @@ struct io_context {
 	unsigned short ioprio;
 	unsigned short ioprio_changed;
 
+#ifdef CONFIG_GROUP_IOSCHED
+	/* If task changes the cgroup, elevator processes it asynchronously */
+	unsigned short cgroup_changed;
+#endif
+
 	/*
 	 * For request batching
 	 */
diff --git a/init/Kconfig b/init/Kconfig
index 6a5c5fe..66c2310 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -538,6 +538,14 @@ config CGROUP_MEM_RES_CTLR_SWAP
 	  there will be no overhead from this. Even when you set this config=y,
 	  if boot option "noswapaccount" is set, swap will not be accounted.
 
+config GROUP_IOSCHED
+	bool "Group IO Scheduler"
+	depends on CGROUPS && ELV_FAIR_QUEUING
+	default n
+	---help---
+	  This feature lets IO scheduler recognize task groups and control
+	  disk bandwidth allocation to such task groups.
+
 endif # CGROUPS
 
 config MM_OWNER
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 05/10] cfq changes to use hierarchical fair queuing code in elevaotor layer
  2009-03-12  1:56 ` Vivek Goyal
@ 2009-03-12  1:56     ` Vivek Goyal
  -1 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12  1:56 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, menage-hpIqsD4AKlfQT0dZR+AlfA

Make cfq hierarhical.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched |    8 ++++++++
 block/cfq-iosched.c   |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 init/Kconfig          |    2 +-
 3 files changed, 57 insertions(+), 1 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index dd5224d..a91a807 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -54,6 +54,14 @@ config IOSCHED_CFQ
 	  working environment, suitable for desktop systems.
 	  This is the default I/O scheduler.
 
+config IOSCHED_CFQ_HIER
+	bool "CFQ Hierarchical Scheduling support"
+	depends on IOSCHED_CFQ && CGROUPS
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in cfq.
+
 choice
 	prompt "Default I/O scheduler"
 	default DEFAULT_CFQ
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 5b41b54..0ecf7c7 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1069,6 +1069,50 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
 	ioc->ioprio_changed = 0;
 }
 
+#ifdef CONFIG_IOSCHED_CFQ_HIER
+static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
+{
+	struct cfq_queue *async_cfqq = cic_to_cfqq(cic, 0);
+	struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1);
+	struct cfq_data *cfqd = cic->key;
+	struct io_group *iog, *__iog;
+	unsigned long flags;
+	struct request_queue *q;
+
+	if (unlikely(!cfqd))
+		return;
+
+	q = cfqd->q;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	iog = io_lookup_io_group_current(q);
+
+	if (async_cfqq != NULL) {
+		__iog = cfqq_to_io_group(async_cfqq);
+
+		if (iog != __iog) {
+			cic_set_cfqq(cic, NULL, 0);
+			cfq_put_queue(async_cfqq);
+		}
+	}
+
+	if (sync_cfqq != NULL) {
+		__iog = cfqq_to_io_group(sync_cfqq);
+		if (iog != __iog)
+			io_ioq_move(q->elevator, sync_cfqq->ioq, iog);
+	}
+
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static void cfq_ioc_set_cgroup(struct io_context *ioc)
+{
+	call_for_each_cic(ioc, changed_cgroup);
+	ioc->cgroup_changed = 0;
+}
+#endif  /* CONFIG_IOSCHED_CFQ_HIER */
+
 static struct cfq_queue *
 cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
 				struct io_context *ioc, gfp_t gfp_mask)
@@ -1335,6 +1379,10 @@ out:
 	smp_read_barrier_depends();
 	if (unlikely(ioc->ioprio_changed))
 		cfq_ioc_set_ioprio(ioc);
+#ifdef CONFIG_IOSCHED_CFQ_HIER
+	if (unlikely(ioc->cgroup_changed))
+		cfq_ioc_set_cgroup(ioc);
+#endif
 	return cic;
 err_free:
 	cfq_cic_free(cic);
diff --git a/init/Kconfig b/init/Kconfig
index 66c2310..d7bc054 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -539,7 +539,7 @@ config CGROUP_MEM_RES_CTLR_SWAP
 	  if boot option "noswapaccount" is set, swap will not be accounted.
 
 config GROUP_IOSCHED
-	bool "Group IO Scheduler"
+	bool
 	depends on CGROUPS && ELV_FAIR_QUEUING
 	default n
 	---help---
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 05/10] cfq changes to use hierarchical fair queuing code in elevaotor layer
@ 2009-03-12  1:56     ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12  1:56 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng,
	arozansk, jmoyer, oz-kernel, dhaval, balbir, linux-kernel,
	containers
  Cc: vgoyal, akpm, menage, peterz

Make cfq hierarhical.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched |    8 ++++++++
 block/cfq-iosched.c   |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 init/Kconfig          |    2 +-
 3 files changed, 57 insertions(+), 1 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index dd5224d..a91a807 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -54,6 +54,14 @@ config IOSCHED_CFQ
 	  working environment, suitable for desktop systems.
 	  This is the default I/O scheduler.
 
+config IOSCHED_CFQ_HIER
+	bool "CFQ Hierarchical Scheduling support"
+	depends on IOSCHED_CFQ && CGROUPS
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in cfq.
+
 choice
 	prompt "Default I/O scheduler"
 	default DEFAULT_CFQ
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 5b41b54..0ecf7c7 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1069,6 +1069,50 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
 	ioc->ioprio_changed = 0;
 }
 
+#ifdef CONFIG_IOSCHED_CFQ_HIER
+static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
+{
+	struct cfq_queue *async_cfqq = cic_to_cfqq(cic, 0);
+	struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1);
+	struct cfq_data *cfqd = cic->key;
+	struct io_group *iog, *__iog;
+	unsigned long flags;
+	struct request_queue *q;
+
+	if (unlikely(!cfqd))
+		return;
+
+	q = cfqd->q;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	iog = io_lookup_io_group_current(q);
+
+	if (async_cfqq != NULL) {
+		__iog = cfqq_to_io_group(async_cfqq);
+
+		if (iog != __iog) {
+			cic_set_cfqq(cic, NULL, 0);
+			cfq_put_queue(async_cfqq);
+		}
+	}
+
+	if (sync_cfqq != NULL) {
+		__iog = cfqq_to_io_group(sync_cfqq);
+		if (iog != __iog)
+			io_ioq_move(q->elevator, sync_cfqq->ioq, iog);
+	}
+
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static void cfq_ioc_set_cgroup(struct io_context *ioc)
+{
+	call_for_each_cic(ioc, changed_cgroup);
+	ioc->cgroup_changed = 0;
+}
+#endif  /* CONFIG_IOSCHED_CFQ_HIER */
+
 static struct cfq_queue *
 cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
 				struct io_context *ioc, gfp_t gfp_mask)
@@ -1335,6 +1379,10 @@ out:
 	smp_read_barrier_depends();
 	if (unlikely(ioc->ioprio_changed))
 		cfq_ioc_set_ioprio(ioc);
+#ifdef CONFIG_IOSCHED_CFQ_HIER
+	if (unlikely(ioc->cgroup_changed))
+		cfq_ioc_set_cgroup(ioc);
+#endif
 	return cic;
 err_free:
 	cfq_cic_free(cic);
diff --git a/init/Kconfig b/init/Kconfig
index 66c2310..d7bc054 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -539,7 +539,7 @@ config CGROUP_MEM_RES_CTLR_SWAP
 	  if boot option "noswapaccount" is set, swap will not be accounted.
 
 config GROUP_IOSCHED
-	bool "Group IO Scheduler"
+	bool
 	depends on CGROUPS && ELV_FAIR_QUEUING
 	default n
 	---help---
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 06/10] Separate out queue and data
  2009-03-12  1:56 ` Vivek Goyal
@ 2009-03-12  1:56     ` Vivek Goyal
  -1 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12  1:56 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, menage-hpIqsD4AKlfQT0dZR+AlfA

o So far noop, deadline and AS had one common structure called *_data which
  contained both the queue information where requests are queued and also
  common data used for scheduling. This patch breaks down this common
  structure in two parts, *_queue and *_data. This is along the lines of
  cfq where all the reuquests are queued in queue and common data and tunables
  are part of data.

o It does not change the functionality but this re-organization helps once
  noop, deadline and AS are changed to use hierarchical fair queuing.

o looks like queue_empty function is not required and we can check for
  q->nr_sorted in elevator layer to see if ioscheduler queues are empty or
  not.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/as-iosched.c       |  209 ++++++++++++++++++++++++++--------------------
 block/deadline-iosched.c |  117 ++++++++++++++++----------
 block/elevator.c         |  111 +++++++++++++++++++++----
 block/noop-iosched.c     |   59 ++++++-------
 include/linux/elevator.h |    8 ++-
 5 files changed, 319 insertions(+), 185 deletions(-)

diff --git a/block/as-iosched.c b/block/as-iosched.c
index 631f6f4..6d2890c 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -79,13 +79,7 @@ enum anticipation_status {
 				 * or timed out */
 };
 
-struct as_data {
-	/*
-	 * run time data
-	 */
-
-	struct request_queue *q;	/* the "owner" queue */
-
+struct as_queue {
 	/*
 	 * requests (as_rq s) are present on both sort_list and fifo_list
 	 */
@@ -93,6 +87,14 @@ struct as_data {
 	struct list_head fifo_list[2];
 
 	struct request *next_rq[2];	/* next in sort order */
+	unsigned long last_check_fifo[2];
+	int write_batch_count;		/* max # of reqs in a write batch */
+	int current_write_count;	/* how many requests left this batch */
+	int write_batch_idled;		/* has the write batch gone idle? */
+};
+
+struct as_data {
+	struct request_queue *q;	/* the "owner" queue */
 	sector_t last_sector[2];	/* last REQ_SYNC & REQ_ASYNC sectors */
 
 	unsigned long exit_prob;	/* probability a task will exit while
@@ -104,23 +106,19 @@ struct as_data {
 	unsigned long new_ttime_mean;
 	u64 new_seek_total;		/* mean seek on new proc */
 	sector_t new_seek_mean;
-
 	unsigned long current_batch_expires;
-	unsigned long last_check_fifo[2];
+
 	int changed_batch;		/* 1: waiting for old batch to end */
 	int new_batch;			/* 1: waiting on first read complete */
-	int batch_data_dir;		/* current batch REQ_SYNC / REQ_ASYNC */
-	int write_batch_count;		/* max # of reqs in a write batch */
-	int current_write_count;	/* how many requests left this batch */
-	int write_batch_idled;		/* has the write batch gone idle? */
 
 	enum anticipation_status antic_status;
 	unsigned long antic_start;	/* jiffies: when it started */
 	struct timer_list antic_timer;	/* anticipatory scheduling timer */
-	struct work_struct antic_work;	/* Deferred unplugging */
+	struct work_struct antic_work;  /* Deferred unplugging */
 	struct io_context *io_context;	/* Identify the expected process */
 	int ioc_finished; /* IO associated with io_context is finished */
 	int nr_dispatched;
+	int batch_data_dir;		/* current batch REQ_SYNC / REQ_ASYNC */
 
 	/*
 	 * settings that change how the i/o scheduler behaves
@@ -261,13 +259,14 @@ static void as_put_io_context(struct request *rq)
 /*
  * rb tree support functions
  */
-#define RQ_RB_ROOT(ad, rq)	(&(ad)->sort_list[rq_is_sync((rq))])
+#define RQ_RB_ROOT(asq, rq)	(&(asq)->sort_list[rq_is_sync((rq))])
 
 static void as_add_rq_rb(struct as_data *ad, struct request *rq)
 {
 	struct request *alias;
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
-	while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(ad, rq), rq)))) {
+	while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(asq, rq), rq)))) {
 		as_move_to_dispatch(ad, alias);
 		as_antic_stop(ad);
 	}
@@ -275,7 +274,9 @@ static void as_add_rq_rb(struct as_data *ad, struct request *rq)
 
 static inline void as_del_rq_rb(struct as_data *ad, struct request *rq)
 {
-	elv_rb_del(RQ_RB_ROOT(ad, rq), rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
+
+	elv_rb_del(RQ_RB_ROOT(asq, rq), rq);
 }
 
 /*
@@ -369,7 +370,7 @@ as_choose_req(struct as_data *ad, struct request *rq1, struct request *rq2)
  * what request to process next. Anticipation works on top of this.
  */
 static struct request *
-as_find_next_rq(struct as_data *ad, struct request *last)
+as_find_next_rq(struct as_data *ad, struct as_queue *asq, struct request *last)
 {
 	struct rb_node *rbnext = rb_next(&last->rb_node);
 	struct rb_node *rbprev = rb_prev(&last->rb_node);
@@ -385,7 +386,7 @@ as_find_next_rq(struct as_data *ad, struct request *last)
 	else {
 		const int data_dir = rq_is_sync(last);
 
-		rbnext = rb_first(&ad->sort_list[data_dir]);
+		rbnext = rb_first(&asq->sort_list[data_dir]);
 		if (rbnext && rbnext != &last->rb_node)
 			next = rb_entry_rq(rbnext);
 	}
@@ -790,9 +791,10 @@ static int as_can_anticipate(struct as_data *ad, struct request *rq)
 static void as_update_rq(struct as_data *ad, struct request *rq)
 {
 	const int data_dir = rq_is_sync(rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
 	/* keep the next_rq cache up to date */
-	ad->next_rq[data_dir] = as_choose_req(ad, rq, ad->next_rq[data_dir]);
+	asq->next_rq[data_dir] = as_choose_req(ad, rq, asq->next_rq[data_dir]);
 
 	/*
 	 * have we been anticipating this request?
@@ -813,25 +815,26 @@ static void update_write_batch(struct as_data *ad)
 {
 	unsigned long batch = ad->batch_expire[REQ_ASYNC];
 	long write_time;
+	struct as_queue *asq = elv_get_sched_queue(ad->q, NULL);
 
 	write_time = (jiffies - ad->current_batch_expires) + batch;
 	if (write_time < 0)
 		write_time = 0;
 
-	if (write_time > batch && !ad->write_batch_idled) {
+	if (write_time > batch && !asq->write_batch_idled) {
 		if (write_time > batch * 3)
-			ad->write_batch_count /= 2;
+			asq->write_batch_count /= 2;
 		else
-			ad->write_batch_count--;
-	} else if (write_time < batch && ad->current_write_count == 0) {
+			asq->write_batch_count--;
+	} else if (write_time < batch && asq->current_write_count == 0) {
 		if (batch > write_time * 3)
-			ad->write_batch_count *= 2;
+			asq->write_batch_count *= 2;
 		else
-			ad->write_batch_count++;
+			asq->write_batch_count++;
 	}
 
-	if (ad->write_batch_count < 1)
-		ad->write_batch_count = 1;
+	if (asq->write_batch_count < 1)
+		asq->write_batch_count = 1;
 }
 
 /*
@@ -902,6 +905,7 @@ static void as_remove_queued_request(struct request_queue *q,
 	const int data_dir = rq_is_sync(rq);
 	struct as_data *ad = q->elevator->elevator_data;
 	struct io_context *ioc;
+	struct as_queue *asq = elv_get_sched_queue(q, rq);
 
 	WARN_ON(RQ_STATE(rq) != AS_RQ_QUEUED);
 
@@ -915,8 +919,8 @@ static void as_remove_queued_request(struct request_queue *q,
 	 * Update the "next_rq" cache if we are about to remove its
 	 * entry
 	 */
-	if (ad->next_rq[data_dir] == rq)
-		ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+	if (asq->next_rq[data_dir] == rq)
+		asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);
 
 	rq_fifo_clear(rq);
 	as_del_rq_rb(ad, rq);
@@ -930,23 +934,23 @@ static void as_remove_queued_request(struct request_queue *q,
  *
  * See as_antic_expired comment.
  */
-static int as_fifo_expired(struct as_data *ad, int adir)
+static int as_fifo_expired(struct as_data *ad, struct as_queue *asq, int adir)
 {
 	struct request *rq;
 	long delta_jif;
 
-	delta_jif = jiffies - ad->last_check_fifo[adir];
+	delta_jif = jiffies - asq->last_check_fifo[adir];
 	if (unlikely(delta_jif < 0))
 		delta_jif = -delta_jif;
 	if (delta_jif < ad->fifo_expire[adir])
 		return 0;
 
-	ad->last_check_fifo[adir] = jiffies;
+	asq->last_check_fifo[adir] = jiffies;
 
-	if (list_empty(&ad->fifo_list[adir]))
+	if (list_empty(&asq->fifo_list[adir]))
 		return 0;
 
-	rq = rq_entry_fifo(ad->fifo_list[adir].next);
+	rq = rq_entry_fifo(asq->fifo_list[adir].next);
 
 	return time_after(jiffies, rq_fifo_time(rq));
 }
@@ -955,7 +959,7 @@ static int as_fifo_expired(struct as_data *ad, int adir)
  * as_batch_expired returns true if the current batch has expired. A batch
  * is a set of reads or a set of writes.
  */
-static inline int as_batch_expired(struct as_data *ad)
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq)
 {
 	if (ad->changed_batch || ad->new_batch)
 		return 0;
@@ -965,7 +969,7 @@ static inline int as_batch_expired(struct as_data *ad)
 		return time_after(jiffies, ad->current_batch_expires);
 
 	return time_after(jiffies, ad->current_batch_expires)
-		|| ad->current_write_count == 0;
+		|| asq->current_write_count == 0;
 }
 
 /*
@@ -974,6 +978,7 @@ static inline int as_batch_expired(struct as_data *ad)
 static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 {
 	const int data_dir = rq_is_sync(rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
 	BUG_ON(RB_EMPTY_NODE(&rq->rb_node));
 
@@ -996,12 +1001,12 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 			ad->io_context = NULL;
 		}
 
-		if (ad->current_write_count != 0)
-			ad->current_write_count--;
+		if (asq->current_write_count != 0)
+			asq->current_write_count--;
 	}
 	ad->ioc_finished = 0;
 
-	ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+	asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);
 
 	/*
 	 * take it off the sort and fifo list, add to dispatch queue
@@ -1025,10 +1030,16 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 static int as_dispatch_request(struct request_queue *q, int force)
 {
 	struct as_data *ad = q->elevator->elevator_data;
-	const int reads = !list_empty(&ad->fifo_list[REQ_SYNC]);
-	const int writes = !list_empty(&ad->fifo_list[REQ_ASYNC]);
+	struct as_queue *asq = elv_select_sched_queue(q, force);
+	int reads, writes;
 	struct request *rq;
 
+	if (!asq)
+		return 0;
+
+	reads = !list_empty(&asq->fifo_list[REQ_SYNC]);
+	writes = !list_empty(&asq->fifo_list[REQ_ASYNC]);
+
 	if (unlikely(force)) {
 		/*
 		 * Forced dispatch, accounting is useless.  Reset
@@ -1043,25 +1054,25 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		ad->changed_batch = 0;
 		ad->new_batch = 0;
 
-		while (ad->next_rq[REQ_SYNC]) {
-			as_move_to_dispatch(ad, ad->next_rq[REQ_SYNC]);
+		while (asq->next_rq[REQ_SYNC]) {
+			as_move_to_dispatch(ad, asq->next_rq[REQ_SYNC]);
 			dispatched++;
 		}
-		ad->last_check_fifo[REQ_SYNC] = jiffies;
+		asq->last_check_fifo[REQ_SYNC] = jiffies;
 
-		while (ad->next_rq[REQ_ASYNC]) {
-			as_move_to_dispatch(ad, ad->next_rq[REQ_ASYNC]);
+		while (asq->next_rq[REQ_ASYNC]) {
+			as_move_to_dispatch(ad, asq->next_rq[REQ_ASYNC]);
 			dispatched++;
 		}
-		ad->last_check_fifo[REQ_ASYNC] = jiffies;
+		asq->last_check_fifo[REQ_ASYNC] = jiffies;
 
 		return dispatched;
 	}
 
 	/* Signal that the write batch was uncontended, so we can't time it */
 	if (ad->batch_data_dir == REQ_ASYNC && !reads) {
-		if (ad->current_write_count == 0 || !writes)
-			ad->write_batch_idled = 1;
+		if (asq->current_write_count == 0 || !writes)
+			asq->write_batch_idled = 1;
 	}
 
 	if (!(reads || writes)
@@ -1070,14 +1081,14 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		|| ad->changed_batch)
 		return 0;
 
-	if (!(reads && writes && as_batch_expired(ad))) {
+	if (!(reads && writes && as_batch_expired(ad, asq))) {
 		/*
 		 * batch is still running or no reads or no writes
 		 */
-		rq = ad->next_rq[ad->batch_data_dir];
+		rq = asq->next_rq[ad->batch_data_dir];
 
 		if (ad->batch_data_dir == REQ_SYNC && ad->antic_expire) {
-			if (as_fifo_expired(ad, REQ_SYNC))
+			if (as_fifo_expired(ad, asq, REQ_SYNC))
 				goto fifo_expired;
 
 			if (as_can_anticipate(ad, rq)) {
@@ -1101,7 +1112,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 	 */
 
 	if (reads) {
-		BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[REQ_SYNC]));
+		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[REQ_SYNC]));
 
 		if (writes && ad->batch_data_dir == REQ_SYNC)
 			/*
@@ -1114,8 +1125,8 @@ static int as_dispatch_request(struct request_queue *q, int force)
 			ad->changed_batch = 1;
 		}
 		ad->batch_data_dir = REQ_SYNC;
-		rq = rq_entry_fifo(ad->fifo_list[REQ_SYNC].next);
-		ad->last_check_fifo[ad->batch_data_dir] = jiffies;
+		rq = rq_entry_fifo(asq->fifo_list[REQ_SYNC].next);
+		asq->last_check_fifo[ad->batch_data_dir] = jiffies;
 		goto dispatch_request;
 	}
 
@@ -1125,7 +1136,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 
 	if (writes) {
 dispatch_writes:
-		BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[REQ_ASYNC]));
+		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[REQ_ASYNC]));
 
 		if (ad->batch_data_dir == REQ_SYNC) {
 			ad->changed_batch = 1;
@@ -1138,10 +1149,10 @@ dispatch_writes:
 			ad->new_batch = 0;
 		}
 		ad->batch_data_dir = REQ_ASYNC;
-		ad->current_write_count = ad->write_batch_count;
-		ad->write_batch_idled = 0;
-		rq = rq_entry_fifo(ad->fifo_list[REQ_ASYNC].next);
-		ad->last_check_fifo[REQ_ASYNC] = jiffies;
+		asq->current_write_count = asq->write_batch_count;
+		asq->write_batch_idled = 0;
+		rq = rq_entry_fifo(asq->fifo_list[REQ_ASYNC].next);
+		asq->last_check_fifo[REQ_ASYNC] = jiffies;
 		goto dispatch_request;
 	}
 
@@ -1153,9 +1164,9 @@ dispatch_request:
 	 * If a request has expired, service it.
 	 */
 
-	if (as_fifo_expired(ad, ad->batch_data_dir)) {
+	if (as_fifo_expired(ad, asq, ad->batch_data_dir)) {
 fifo_expired:
-		rq = rq_entry_fifo(ad->fifo_list[ad->batch_data_dir].next);
+		rq = rq_entry_fifo(asq->fifo_list[ad->batch_data_dir].next);
 	}
 
 	if (ad->changed_batch) {
@@ -1188,6 +1199,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 {
 	struct as_data *ad = q->elevator->elevator_data;
 	int data_dir;
+	struct as_queue *asq = elv_get_sched_queue(q, rq);
 
 	RQ_SET_STATE(rq, AS_RQ_NEW);
 
@@ -1206,7 +1218,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 	 * set expire time and add to fifo list
 	 */
 	rq_set_fifo_time(rq, jiffies + ad->fifo_expire[data_dir]);
-	list_add_tail(&rq->queuelist, &ad->fifo_list[data_dir]);
+	list_add_tail(&rq->queuelist, &asq->fifo_list[data_dir]);
 
 	as_update_rq(ad, rq); /* keep state machine up to date */
 	RQ_SET_STATE(rq, AS_RQ_QUEUED);
@@ -1228,31 +1240,20 @@ static void as_deactivate_request(struct request_queue *q, struct request *rq)
 		atomic_inc(&RQ_IOC(rq)->aic->nr_dispatched);
 }
 
-/*
- * as_queue_empty tells us if there are requests left in the device. It may
- * not be the case that a driver can get the next request even if the queue
- * is not empty - it is used in the block layer to check for plugging and
- * merging opportunities
- */
-static int as_queue_empty(struct request_queue *q)
-{
-	struct as_data *ad = q->elevator->elevator_data;
-
-	return list_empty(&ad->fifo_list[REQ_ASYNC])
-		&& list_empty(&ad->fifo_list[REQ_SYNC]);
-}
-
 static int
 as_merge(struct request_queue *q, struct request **req, struct bio *bio)
 {
-	struct as_data *ad = q->elevator->elevator_data;
 	sector_t rb_key = bio->bi_sector + bio_sectors(bio);
 	struct request *__rq;
+	struct as_queue *asq = elv_get_sched_queue_current(q);
+
+	if (!asq)
+		return ELEVATOR_NO_MERGE;
 
 	/*
 	 * check for front merge
 	 */
-	__rq = elv_rb_find(&ad->sort_list[bio_data_dir(bio)], rb_key);
+	__rq = elv_rb_find(&asq->sort_list[bio_data_dir(bio)], rb_key);
 	if (__rq && elv_rq_merge_ok(__rq, bio)) {
 		*req = __rq;
 		return ELEVATOR_FRONT_MERGE;
@@ -1339,6 +1340,41 @@ static int as_may_queue(struct request_queue *q, int rw)
 	return ret;
 }
 
+/* Called with queue lock held */
+static void *as_alloc_as_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
+{
+	struct as_queue *asq;
+	struct as_data *ad = eq->elevator_data;
+
+	asq = kmalloc_node(sizeof(*asq), gfp_mask | __GFP_ZERO, q->node);
+	if (asq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&asq->fifo_list[REQ_SYNC]);
+	INIT_LIST_HEAD(&asq->fifo_list[REQ_ASYNC]);
+	asq->sort_list[REQ_SYNC] = RB_ROOT;
+	asq->sort_list[REQ_ASYNC] = RB_ROOT;
+	if (ad)
+		asq->write_batch_count = ad->batch_expire[REQ_ASYNC] / 10;
+	else
+		asq->write_batch_count = default_write_batch_expire / 10;
+
+	if (asq->write_batch_count < 2)
+		asq->write_batch_count = 2;
+out:
+	return asq;
+}
+
+static void as_free_as_queue(struct elevator_queue *e, void *sched_queue)
+{
+	struct as_queue *asq = sched_queue;
+
+	BUG_ON(!list_empty(&asq->fifo_list[REQ_SYNC]));
+	BUG_ON(!list_empty(&asq->fifo_list[REQ_ASYNC]));
+	kfree(asq);
+}
+
 static void as_exit_queue(struct elevator_queue *e)
 {
 	struct as_data *ad = e->elevator_data;
@@ -1346,9 +1382,6 @@ static void as_exit_queue(struct elevator_queue *e)
 	del_timer_sync(&ad->antic_timer);
 	cancel_work_sync(&ad->antic_work);
 
-	BUG_ON(!list_empty(&ad->fifo_list[REQ_SYNC]));
-	BUG_ON(!list_empty(&ad->fifo_list[REQ_ASYNC]));
-
 	put_io_context(ad->io_context);
 	kfree(ad);
 }
@@ -1372,10 +1405,6 @@ static void *as_init_queue(struct request_queue *q)
 	init_timer(&ad->antic_timer);
 	INIT_WORK(&ad->antic_work, as_work_handler);
 
-	INIT_LIST_HEAD(&ad->fifo_list[REQ_SYNC]);
-	INIT_LIST_HEAD(&ad->fifo_list[REQ_ASYNC]);
-	ad->sort_list[REQ_SYNC] = RB_ROOT;
-	ad->sort_list[REQ_ASYNC] = RB_ROOT;
 	ad->fifo_expire[REQ_SYNC] = default_read_expire;
 	ad->fifo_expire[REQ_ASYNC] = default_write_expire;
 	ad->antic_expire = default_antic_expire;
@@ -1383,9 +1412,6 @@ static void *as_init_queue(struct request_queue *q)
 	ad->batch_expire[REQ_ASYNC] = default_write_batch_expire;
 
 	ad->current_batch_expires = jiffies + ad->batch_expire[REQ_SYNC];
-	ad->write_batch_count = ad->batch_expire[REQ_ASYNC] / 10;
-	if (ad->write_batch_count < 2)
-		ad->write_batch_count = 2;
 
 	return ad;
 }
@@ -1482,7 +1508,6 @@ static struct elevator_type iosched_as = {
 		.elevator_add_req_fn =		as_add_request,
 		.elevator_activate_req_fn =	as_activate_request,
 		.elevator_deactivate_req_fn = 	as_deactivate_request,
-		.elevator_queue_empty_fn =	as_queue_empty,
 		.elevator_completed_req_fn =	as_completed_request,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
@@ -1490,6 +1515,8 @@ static struct elevator_type iosched_as = {
 		.elevator_init_fn =		as_init_queue,
 		.elevator_exit_fn =		as_exit_queue,
 		.trim =				as_trim,
+		.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
+		.elevator_free_sched_queue_fn = as_free_as_queue,
 	},
 
 	.elevator_attrs = as_attrs,
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index c4d991d..5e65041 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -23,25 +23,23 @@ static const int writes_starved = 2;    /* max times reads can starve a write */
 static const int fifo_batch = 16;       /* # of sequential requests treated as one
 				     by the above parameters. For throughput. */
 
-struct deadline_data {
-	/*
-	 * run time data
-	 */
-
+struct deadline_queue {
 	/*
 	 * requests (deadline_rq s) are present on both sort_list and fifo_list
 	 */
-	struct rb_root sort_list[2];	
+	struct rb_root sort_list[2];
 	struct list_head fifo_list[2];
-
 	/*
 	 * next in sort order. read, write or both are NULL
 	 */
 	struct request *next_rq[2];
 	unsigned int batching;		/* number of sequential requests made */
-	sector_t last_sector;		/* head position */
 	unsigned int starved;		/* times reads have starved writes */
+};
 
+struct deadline_data {
+	struct request_queue *q;
+	sector_t last_sector;		/* head position */
 	/*
 	 * settings that change how the i/o scheduler behaves
 	 */
@@ -56,7 +54,9 @@ static void deadline_move_request(struct deadline_data *, struct request *);
 static inline struct rb_root *
 deadline_rb_root(struct deadline_data *dd, struct request *rq)
 {
-	return &dd->sort_list[rq_data_dir(rq)];
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
+
+	return &dq->sort_list[rq_data_dir(rq)];
 }
 
 /*
@@ -87,9 +87,10 @@ static inline void
 deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
 {
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
 
-	if (dd->next_rq[data_dir] == rq)
-		dd->next_rq[data_dir] = deadline_latter_request(rq);
+	if (dq->next_rq[data_dir] == rq)
+		dq->next_rq[data_dir] = deadline_latter_request(rq);
 
 	elv_rb_del(deadline_rb_root(dd, rq), rq);
 }
@@ -102,6 +103,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
 {
 	struct deadline_data *dd = q->elevator->elevator_data;
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(q, rq);
 
 	deadline_add_rq_rb(dd, rq);
 
@@ -109,7 +111,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
 	 * set expire time and add to fifo list
 	 */
 	rq_set_fifo_time(rq, jiffies + dd->fifo_expire[data_dir]);
-	list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
+	list_add_tail(&rq->queuelist, &dq->fifo_list[data_dir]);
 }
 
 /*
@@ -129,6 +131,11 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	struct deadline_data *dd = q->elevator->elevator_data;
 	struct request *__rq;
 	int ret;
+	struct deadline_queue *dq;
+
+	dq = elv_get_sched_queue_current(q);
+	if (!dq)
+		return ELEVATOR_NO_MERGE;
 
 	/*
 	 * check for front merge
@@ -136,7 +143,7 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	if (dd->front_merges) {
 		sector_t sector = bio->bi_sector + bio_sectors(bio);
 
-		__rq = elv_rb_find(&dd->sort_list[bio_data_dir(bio)], sector);
+		__rq = elv_rb_find(&dq->sort_list[bio_data_dir(bio)], sector);
 		if (__rq) {
 			BUG_ON(sector != __rq->sector);
 
@@ -207,10 +214,11 @@ static void
 deadline_move_request(struct deadline_data *dd, struct request *rq)
 {
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
 
-	dd->next_rq[READ] = NULL;
-	dd->next_rq[WRITE] = NULL;
-	dd->next_rq[data_dir] = deadline_latter_request(rq);
+	dq->next_rq[READ] = NULL;
+	dq->next_rq[WRITE] = NULL;
+	dq->next_rq[data_dir] = deadline_latter_request(rq);
 
 	dd->last_sector = rq_end_sector(rq);
 
@@ -225,9 +233,9 @@ deadline_move_request(struct deadline_data *dd, struct request *rq)
  * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
  * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
  */
-static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
+static inline int deadline_check_fifo(struct deadline_queue *dq, int ddir)
 {
-	struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
+	struct request *rq = rq_entry_fifo(dq->fifo_list[ddir].next);
 
 	/*
 	 * rq is expired!
@@ -245,20 +253,26 @@ static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
 static int deadline_dispatch_requests(struct request_queue *q, int force)
 {
 	struct deadline_data *dd = q->elevator->elevator_data;
-	const int reads = !list_empty(&dd->fifo_list[READ]);
-	const int writes = !list_empty(&dd->fifo_list[WRITE]);
+	struct deadline_queue *dq = elv_select_sched_queue(q, force);
+	int reads, writes;
 	struct request *rq;
 	int data_dir;
 
+	if (!dq)
+		return 0;
+
+	reads = !list_empty(&dq->fifo_list[READ]);
+	writes = !list_empty(&dq->fifo_list[WRITE]);
+
 	/*
 	 * batches are currently reads XOR writes
 	 */
-	if (dd->next_rq[WRITE])
-		rq = dd->next_rq[WRITE];
+	if (dq->next_rq[WRITE])
+		rq = dq->next_rq[WRITE];
 	else
-		rq = dd->next_rq[READ];
+		rq = dq->next_rq[READ];
 
-	if (rq && dd->batching < dd->fifo_batch)
+	if (rq && dq->batching < dd->fifo_batch)
 		/* we have a next request are still entitled to batch */
 		goto dispatch_request;
 
@@ -268,9 +282,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
 	 */
 
 	if (reads) {
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
+		BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[READ]));
 
-		if (writes && (dd->starved++ >= dd->writes_starved))
+		if (writes && (dq->starved++ >= dd->writes_starved))
 			goto dispatch_writes;
 
 		data_dir = READ;
@@ -284,9 +298,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
 
 	if (writes) {
 dispatch_writes:
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
+		BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[WRITE]));
 
-		dd->starved = 0;
+		dq->starved = 0;
 
 		data_dir = WRITE;
 
@@ -299,48 +313,62 @@ dispatch_find_request:
 	/*
 	 * we are not running a batch, find best request for selected data_dir
 	 */
-	if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
+	if (deadline_check_fifo(dq, data_dir) || !dq->next_rq[data_dir]) {
 		/*
 		 * A deadline has expired, the last request was in the other
 		 * direction, or we have run out of higher-sectored requests.
 		 * Start again from the request with the earliest expiry time.
 		 */
-		rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
+		rq = rq_entry_fifo(dq->fifo_list[data_dir].next);
 	} else {
 		/*
 		 * The last req was the same dir and we have a next request in
 		 * sort order. No expired requests so continue on from here.
 		 */
-		rq = dd->next_rq[data_dir];
+		rq = dq->next_rq[data_dir];
 	}
 
-	dd->batching = 0;
+	dq->batching = 0;
 
 dispatch_request:
 	/*
 	 * rq is the selected appropriate request.
 	 */
-	dd->batching++;
+	dq->batching++;
 	deadline_move_request(dd, rq);
 
 	return 1;
 }
 
-static int deadline_queue_empty(struct request_queue *q)
+static void *deadline_alloc_deadline_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
 {
-	struct deadline_data *dd = q->elevator->elevator_data;
+	struct deadline_queue *dq;
 
-	return list_empty(&dd->fifo_list[WRITE])
-		&& list_empty(&dd->fifo_list[READ]);
+	dq = kmalloc_node(sizeof(*dq), gfp_mask | __GFP_ZERO, q->node);
+	if (dq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&dq->fifo_list[READ]);
+	INIT_LIST_HEAD(&dq->fifo_list[WRITE]);
+	dq->sort_list[READ] = RB_ROOT;
+	dq->sort_list[WRITE] = RB_ROOT;
+out:
+	return dq;
+}
+
+static void deadline_free_deadline_queue(struct elevator_queue *e,
+						void *sched_queue)
+{
+	struct deadline_queue *dq = sched_queue;
+
+	kfree(dq);
 }
 
 static void deadline_exit_queue(struct elevator_queue *e)
 {
 	struct deadline_data *dd = e->elevator_data;
 
-	BUG_ON(!list_empty(&dd->fifo_list[READ]));
-	BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
-
 	kfree(dd);
 }
 
@@ -355,10 +383,7 @@ static void *deadline_init_queue(struct request_queue *q)
 	if (!dd)
 		return NULL;
 
-	INIT_LIST_HEAD(&dd->fifo_list[READ]);
-	INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
-	dd->sort_list[READ] = RB_ROOT;
-	dd->sort_list[WRITE] = RB_ROOT;
+	dd->q = q;
 	dd->fifo_expire[READ] = read_expire;
 	dd->fifo_expire[WRITE] = write_expire;
 	dd->writes_starved = writes_starved;
@@ -445,13 +470,13 @@ static struct elevator_type iosched_deadline = {
 		.elevator_merge_req_fn =	deadline_merged_requests,
 		.elevator_dispatch_fn =		deadline_dispatch_requests,
 		.elevator_add_req_fn =		deadline_add_request,
-		.elevator_queue_empty_fn =	deadline_queue_empty,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
 		.elevator_init_fn =		deadline_init_queue,
 		.elevator_exit_fn =		deadline_exit_queue,
+		.elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
+		.elevator_free_sched_queue_fn = deadline_free_deadline_queue,
 	},
-
 	.elevator_attrs = deadline_attrs,
 	.elevator_name = "deadline",
 	.elevator_owner = THIS_MODULE,
diff --git a/block/elevator.c b/block/elevator.c
index 27889bc..5df13c4 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -176,17 +176,54 @@ static struct elevator_type *elevator_get(const char *name)
 	return e;
 }
 
-static void *elevator_init_queue(struct request_queue *q,
-				 struct elevator_queue *eq)
+static void *elevator_init_data(struct request_queue *q,
+					struct elevator_queue *eq)
 {
-	return eq->ops->elevator_init_fn(q);
+	void *data = NULL;
+
+	if (eq->ops->elevator_init_fn) {
+		data = eq->ops->elevator_init_fn(q);
+		if (data)
+			return data;
+		else
+			return ERR_PTR(-ENOMEM);
+	}
+
+	/* IO scheduler does not instanciate data (noop), it is not an error */
+	return NULL;
+}
+
+static void elevator_free_sched_queue(struct elevator_queue *eq,
+						void *sched_queue)
+{
+	/* Not all io schedulers (cfq) strore sched_queue */
+	if (!sched_queue)
+		return;
+	eq->ops->elevator_free_sched_queue_fn(eq, sched_queue);
+}
+
+static void *elevator_alloc_sched_queue(struct request_queue *q,
+					struct elevator_queue *eq)
+{
+	void *sched_queue = NULL;
+
+	if (eq->ops->elevator_alloc_sched_queue_fn) {
+		sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
+								GFP_KERNEL);
+		if (!sched_queue)
+			return ERR_PTR(-ENOMEM);
+
+	}
+
+	return sched_queue;
 }
 
 static void elevator_attach(struct request_queue *q, struct elevator_queue *eq,
-			   void *data)
+			   void *data, void *sched_queue)
 {
 	q->elevator = eq;
 	eq->elevator_data = data;
+	eq->sched_queue = sched_queue;
 }
 
 static char chosen_elevator[16];
@@ -256,7 +293,7 @@ int elevator_init(struct request_queue *q, char *name)
 	struct elevator_type *e = NULL;
 	struct elevator_queue *eq;
 	int ret = 0;
-	void *data;
+	void *data = NULL, *sched_queue = NULL;
 
 	INIT_LIST_HEAD(&q->queue_head);
 	q->last_merge = NULL;
@@ -290,13 +327,21 @@ int elevator_init(struct request_queue *q, char *name)
 	if (!eq)
 		return -ENOMEM;
 
-	data = elevator_init_queue(q, eq);
-	if (!data) {
+	data = elevator_init_data(q, eq);
+
+	if (IS_ERR(data)) {
+		kobject_put(&eq->kobj);
+		return -ENOMEM;
+	}
+
+	sched_queue = elevator_alloc_sched_queue(q, eq);
+
+	if (IS_ERR(sched_queue)) {
 		kobject_put(&eq->kobj);
 		return -ENOMEM;
 	}
 
-	elevator_attach(q, eq, data);
+	elevator_attach(q, eq, data, sched_queue);
 	return ret;
 }
 EXPORT_SYMBOL(elevator_init);
@@ -304,6 +349,7 @@ EXPORT_SYMBOL(elevator_init);
 void elevator_exit(struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
+	elevator_free_sched_queue(e, e->sched_queue);
 	elv_exit_fq_data(e);
 	if (e->ops->elevator_exit_fn)
 		e->ops->elevator_exit_fn(e);
@@ -1094,7 +1140,7 @@ EXPORT_SYMBOL_GPL(elv_unregister);
 static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 {
 	struct elevator_queue *old_elevator, *e;
-	void *data;
+	void *data = NULL, *sched_queue = NULL;
 
 	/*
 	 * Allocate new elevator
@@ -1103,10 +1149,18 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	if (!e)
 		return 0;
 
-	data = elevator_init_queue(q, e);
-	if (!data) {
+	data = elevator_init_data(q, e);
+
+	if (IS_ERR(data)) {
 		kobject_put(&e->kobj);
-		return 0;
+		return -ENOMEM;
+	}
+
+	sched_queue = elevator_alloc_sched_queue(q, e);
+
+	if (IS_ERR(sched_queue)) {
+		kobject_put(&e->kobj);
+		return -ENOMEM;
 	}
 
 	/*
@@ -1134,7 +1188,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	/*
 	 * attach and start new elevator
 	 */
-	elevator_attach(q, e, data);
+	elevator_attach(q, e, data, sched_queue);
 
 	spin_unlock_irq(q->queue_lock);
 
@@ -1241,16 +1295,43 @@ struct request *elv_rb_latter_request(struct request_queue *q,
 }
 EXPORT_SYMBOL(elv_rb_latter_request);
 
-/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+/* Get the io scheduler queue pointer. */
 void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
 {
-	return ioq_sched_queue(rq_ioq(rq));
+	/*
+	 * io scheduler is not using fair queuing. Return sched_queue
+	 * pointer stored in elevator_queue. It will be null if io
+	 * scheduler never stored anything there to begin with (cfq)
+	 */
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
+	/*
+	 * IO schedueler is using fair queuing infrasture. If io scheduler
+	 * has passed a non null rq, retrieve sched_queue pointer from
+	 * there. */
+	if (rq)
+		return ioq_sched_queue(rq_ioq(rq));
+
+	return NULL;
 }
 EXPORT_SYMBOL(elv_get_sched_queue);
 
 /* Select an ioscheduler queue to dispatch request from. */
 void *elv_select_sched_queue(struct request_queue *q, int force)
 {
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
 	return ioq_sched_queue(elv_fq_select_ioq(q, force));
 }
 EXPORT_SYMBOL(elv_select_sched_queue);
+
+/*
+ * Get the io scheduler queue pointer for current task.
+ */
+void *elv_get_sched_queue_current(struct request_queue *q)
+{
+	return q->elevator->sched_queue;
+}
+EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index 3a0d369..d587832 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -7,7 +7,7 @@
 #include <linux/module.h>
 #include <linux/init.h>
 
-struct noop_data {
+struct noop_queue {
 	struct list_head queue;
 };
 
@@ -19,11 +19,14 @@ static void noop_merged_requests(struct request_queue *q, struct request *rq,
 
 static int noop_dispatch(struct request_queue *q, int force)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_select_sched_queue(q, force);
 
-	if (!list_empty(&nd->queue)) {
+	if (!nq)
+		return 0;
+
+	if (!list_empty(&nq->queue)) {
 		struct request *rq;
-		rq = list_entry(nd->queue.next, struct request, queuelist);
+		rq = list_entry(nq->queue.next, struct request, queuelist);
 		list_del_init(&rq->queuelist);
 		elv_dispatch_sort(q, rq);
 		return 1;
@@ -33,24 +36,17 @@ static int noop_dispatch(struct request_queue *q, int force)
 
 static void noop_add_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	list_add_tail(&rq->queuelist, &nd->queue);
-}
-
-static int noop_queue_empty(struct request_queue *q)
-{
-	struct noop_data *nd = q->elevator->elevator_data;
-
-	return list_empty(&nd->queue);
+	list_add_tail(&rq->queuelist, &nq->queue);
 }
 
 static struct request *
 noop_former_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	if (rq->queuelist.prev == &nd->queue)
+	if (rq->queuelist.prev == &nq->queue)
 		return NULL;
 	return list_entry(rq->queuelist.prev, struct request, queuelist);
 }
@@ -58,30 +54,32 @@ noop_former_request(struct request_queue *q, struct request *rq)
 static struct request *
 noop_latter_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	if (rq->queuelist.next == &nd->queue)
+	if (rq->queuelist.next == &nq->queue)
 		return NULL;
 	return list_entry(rq->queuelist.next, struct request, queuelist);
 }
 
-static void *noop_init_queue(struct request_queue *q)
+static void *noop_alloc_noop_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
 {
-	struct noop_data *nd;
+	struct noop_queue *nq;
 
-	nd = kmalloc_node(sizeof(*nd), GFP_KERNEL, q->node);
-	if (!nd)
-		return NULL;
-	INIT_LIST_HEAD(&nd->queue);
-	return nd;
+	nq = kmalloc_node(sizeof(*nq), gfp_mask | __GFP_ZERO, q->node);
+	if (nq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&nq->queue);
+out:
+	return nq;
 }
 
-static void noop_exit_queue(struct elevator_queue *e)
+static void noop_free_noop_queue(struct elevator_queue *e, void *sched_queue)
 {
-	struct noop_data *nd = e->elevator_data;
+	struct noop_queue *nq = sched_queue;
 
-	BUG_ON(!list_empty(&nd->queue));
-	kfree(nd);
+	kfree(nq);
 }
 
 static struct elevator_type elevator_noop = {
@@ -89,11 +87,10 @@ static struct elevator_type elevator_noop = {
 		.elevator_merge_req_fn		= noop_merged_requests,
 		.elevator_dispatch_fn		= noop_dispatch,
 		.elevator_add_req_fn		= noop_add_request,
-		.elevator_queue_empty_fn	= noop_queue_empty,
 		.elevator_former_req_fn		= noop_former_request,
 		.elevator_latter_req_fn		= noop_latter_request,
-		.elevator_init_fn		= noop_init_queue,
-		.elevator_exit_fn		= noop_exit_queue,
+		.elevator_alloc_sched_queue_fn	= noop_alloc_noop_queue,
+		.elevator_free_sched_queue_fn	= noop_free_noop_queue,
 	},
 	.elevator_name = "noop",
 	.elevator_owner = THIS_MODULE,
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 6f2dea5..bb5ae3a 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -30,8 +30,9 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
 
 typedef void *(elevator_init_fn) (struct request_queue *);
 typedef void (elevator_exit_fn) (struct elevator_queue *);
-#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void* (elevator_alloc_sched_queue_fn) (struct request_queue *q, struct elevator_queue *eq, gfp_t);
 typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
 typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *);
 typedef void (elevator_active_ioq_reset_fn) (struct request_queue *);
 typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
@@ -68,8 +69,9 @@ struct elevator_ops
 	elevator_exit_fn *elevator_exit_fn;
 	void (*trim)(struct io_context *);
 
-#ifdef CONFIG_ELV_FAIR_QUEUING
+	elevator_alloc_sched_queue_fn *elevator_alloc_sched_queue_fn;
 	elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+#ifdef CONFIG_ELV_FAIR_QUEUING
 	elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
 	elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;
 
@@ -109,6 +111,7 @@ struct elevator_queue
 {
 	struct elevator_ops *ops;
 	void *elevator_data;
+	void *sched_queue;
 	struct kobject kobj;
 	struct elevator_type *elevator_type;
 	struct mutex sysfs_lock;
@@ -256,5 +259,6 @@ static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
+extern void *elv_get_sched_queue_current(struct request_queue *q);
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 06/10] Separate out queue and data
@ 2009-03-12  1:56     ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12  1:56 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng,
	arozansk, jmoyer, oz-kernel, dhaval, balbir, linux-kernel,
	containers
  Cc: vgoyal, akpm, menage, peterz

o So far noop, deadline and AS had one common structure called *_data which
  contained both the queue information where requests are queued and also
  common data used for scheduling. This patch breaks down this common
  structure in two parts, *_queue and *_data. This is along the lines of
  cfq where all the reuquests are queued in queue and common data and tunables
  are part of data.

o It does not change the functionality but this re-organization helps once
  noop, deadline and AS are changed to use hierarchical fair queuing.

o looks like queue_empty function is not required and we can check for
  q->nr_sorted in elevator layer to see if ioscheduler queues are empty or
  not.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/as-iosched.c       |  209 ++++++++++++++++++++++++++--------------------
 block/deadline-iosched.c |  117 ++++++++++++++++----------
 block/elevator.c         |  111 +++++++++++++++++++++----
 block/noop-iosched.c     |   59 ++++++-------
 include/linux/elevator.h |    8 ++-
 5 files changed, 319 insertions(+), 185 deletions(-)

diff --git a/block/as-iosched.c b/block/as-iosched.c
index 631f6f4..6d2890c 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -79,13 +79,7 @@ enum anticipation_status {
 				 * or timed out */
 };
 
-struct as_data {
-	/*
-	 * run time data
-	 */
-
-	struct request_queue *q;	/* the "owner" queue */
-
+struct as_queue {
 	/*
 	 * requests (as_rq s) are present on both sort_list and fifo_list
 	 */
@@ -93,6 +87,14 @@ struct as_data {
 	struct list_head fifo_list[2];
 
 	struct request *next_rq[2];	/* next in sort order */
+	unsigned long last_check_fifo[2];
+	int write_batch_count;		/* max # of reqs in a write batch */
+	int current_write_count;	/* how many requests left this batch */
+	int write_batch_idled;		/* has the write batch gone idle? */
+};
+
+struct as_data {
+	struct request_queue *q;	/* the "owner" queue */
 	sector_t last_sector[2];	/* last REQ_SYNC & REQ_ASYNC sectors */
 
 	unsigned long exit_prob;	/* probability a task will exit while
@@ -104,23 +106,19 @@ struct as_data {
 	unsigned long new_ttime_mean;
 	u64 new_seek_total;		/* mean seek on new proc */
 	sector_t new_seek_mean;
-
 	unsigned long current_batch_expires;
-	unsigned long last_check_fifo[2];
+
 	int changed_batch;		/* 1: waiting for old batch to end */
 	int new_batch;			/* 1: waiting on first read complete */
-	int batch_data_dir;		/* current batch REQ_SYNC / REQ_ASYNC */
-	int write_batch_count;		/* max # of reqs in a write batch */
-	int current_write_count;	/* how many requests left this batch */
-	int write_batch_idled;		/* has the write batch gone idle? */
 
 	enum anticipation_status antic_status;
 	unsigned long antic_start;	/* jiffies: when it started */
 	struct timer_list antic_timer;	/* anticipatory scheduling timer */
-	struct work_struct antic_work;	/* Deferred unplugging */
+	struct work_struct antic_work;  /* Deferred unplugging */
 	struct io_context *io_context;	/* Identify the expected process */
 	int ioc_finished; /* IO associated with io_context is finished */
 	int nr_dispatched;
+	int batch_data_dir;		/* current batch REQ_SYNC / REQ_ASYNC */
 
 	/*
 	 * settings that change how the i/o scheduler behaves
@@ -261,13 +259,14 @@ static void as_put_io_context(struct request *rq)
 /*
  * rb tree support functions
  */
-#define RQ_RB_ROOT(ad, rq)	(&(ad)->sort_list[rq_is_sync((rq))])
+#define RQ_RB_ROOT(asq, rq)	(&(asq)->sort_list[rq_is_sync((rq))])
 
 static void as_add_rq_rb(struct as_data *ad, struct request *rq)
 {
 	struct request *alias;
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
-	while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(ad, rq), rq)))) {
+	while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(asq, rq), rq)))) {
 		as_move_to_dispatch(ad, alias);
 		as_antic_stop(ad);
 	}
@@ -275,7 +274,9 @@ static void as_add_rq_rb(struct as_data *ad, struct request *rq)
 
 static inline void as_del_rq_rb(struct as_data *ad, struct request *rq)
 {
-	elv_rb_del(RQ_RB_ROOT(ad, rq), rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
+
+	elv_rb_del(RQ_RB_ROOT(asq, rq), rq);
 }
 
 /*
@@ -369,7 +370,7 @@ as_choose_req(struct as_data *ad, struct request *rq1, struct request *rq2)
  * what request to process next. Anticipation works on top of this.
  */
 static struct request *
-as_find_next_rq(struct as_data *ad, struct request *last)
+as_find_next_rq(struct as_data *ad, struct as_queue *asq, struct request *last)
 {
 	struct rb_node *rbnext = rb_next(&last->rb_node);
 	struct rb_node *rbprev = rb_prev(&last->rb_node);
@@ -385,7 +386,7 @@ as_find_next_rq(struct as_data *ad, struct request *last)
 	else {
 		const int data_dir = rq_is_sync(last);
 
-		rbnext = rb_first(&ad->sort_list[data_dir]);
+		rbnext = rb_first(&asq->sort_list[data_dir]);
 		if (rbnext && rbnext != &last->rb_node)
 			next = rb_entry_rq(rbnext);
 	}
@@ -790,9 +791,10 @@ static int as_can_anticipate(struct as_data *ad, struct request *rq)
 static void as_update_rq(struct as_data *ad, struct request *rq)
 {
 	const int data_dir = rq_is_sync(rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
 	/* keep the next_rq cache up to date */
-	ad->next_rq[data_dir] = as_choose_req(ad, rq, ad->next_rq[data_dir]);
+	asq->next_rq[data_dir] = as_choose_req(ad, rq, asq->next_rq[data_dir]);
 
 	/*
 	 * have we been anticipating this request?
@@ -813,25 +815,26 @@ static void update_write_batch(struct as_data *ad)
 {
 	unsigned long batch = ad->batch_expire[REQ_ASYNC];
 	long write_time;
+	struct as_queue *asq = elv_get_sched_queue(ad->q, NULL);
 
 	write_time = (jiffies - ad->current_batch_expires) + batch;
 	if (write_time < 0)
 		write_time = 0;
 
-	if (write_time > batch && !ad->write_batch_idled) {
+	if (write_time > batch && !asq->write_batch_idled) {
 		if (write_time > batch * 3)
-			ad->write_batch_count /= 2;
+			asq->write_batch_count /= 2;
 		else
-			ad->write_batch_count--;
-	} else if (write_time < batch && ad->current_write_count == 0) {
+			asq->write_batch_count--;
+	} else if (write_time < batch && asq->current_write_count == 0) {
 		if (batch > write_time * 3)
-			ad->write_batch_count *= 2;
+			asq->write_batch_count *= 2;
 		else
-			ad->write_batch_count++;
+			asq->write_batch_count++;
 	}
 
-	if (ad->write_batch_count < 1)
-		ad->write_batch_count = 1;
+	if (asq->write_batch_count < 1)
+		asq->write_batch_count = 1;
 }
 
 /*
@@ -902,6 +905,7 @@ static void as_remove_queued_request(struct request_queue *q,
 	const int data_dir = rq_is_sync(rq);
 	struct as_data *ad = q->elevator->elevator_data;
 	struct io_context *ioc;
+	struct as_queue *asq = elv_get_sched_queue(q, rq);
 
 	WARN_ON(RQ_STATE(rq) != AS_RQ_QUEUED);
 
@@ -915,8 +919,8 @@ static void as_remove_queued_request(struct request_queue *q,
 	 * Update the "next_rq" cache if we are about to remove its
 	 * entry
 	 */
-	if (ad->next_rq[data_dir] == rq)
-		ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+	if (asq->next_rq[data_dir] == rq)
+		asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);
 
 	rq_fifo_clear(rq);
 	as_del_rq_rb(ad, rq);
@@ -930,23 +934,23 @@ static void as_remove_queued_request(struct request_queue *q,
  *
  * See as_antic_expired comment.
  */
-static int as_fifo_expired(struct as_data *ad, int adir)
+static int as_fifo_expired(struct as_data *ad, struct as_queue *asq, int adir)
 {
 	struct request *rq;
 	long delta_jif;
 
-	delta_jif = jiffies - ad->last_check_fifo[adir];
+	delta_jif = jiffies - asq->last_check_fifo[adir];
 	if (unlikely(delta_jif < 0))
 		delta_jif = -delta_jif;
 	if (delta_jif < ad->fifo_expire[adir])
 		return 0;
 
-	ad->last_check_fifo[adir] = jiffies;
+	asq->last_check_fifo[adir] = jiffies;
 
-	if (list_empty(&ad->fifo_list[adir]))
+	if (list_empty(&asq->fifo_list[adir]))
 		return 0;
 
-	rq = rq_entry_fifo(ad->fifo_list[adir].next);
+	rq = rq_entry_fifo(asq->fifo_list[adir].next);
 
 	return time_after(jiffies, rq_fifo_time(rq));
 }
@@ -955,7 +959,7 @@ static int as_fifo_expired(struct as_data *ad, int adir)
  * as_batch_expired returns true if the current batch has expired. A batch
  * is a set of reads or a set of writes.
  */
-static inline int as_batch_expired(struct as_data *ad)
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq)
 {
 	if (ad->changed_batch || ad->new_batch)
 		return 0;
@@ -965,7 +969,7 @@ static inline int as_batch_expired(struct as_data *ad)
 		return time_after(jiffies, ad->current_batch_expires);
 
 	return time_after(jiffies, ad->current_batch_expires)
-		|| ad->current_write_count == 0;
+		|| asq->current_write_count == 0;
 }
 
 /*
@@ -974,6 +978,7 @@ static inline int as_batch_expired(struct as_data *ad)
 static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 {
 	const int data_dir = rq_is_sync(rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
 	BUG_ON(RB_EMPTY_NODE(&rq->rb_node));
 
@@ -996,12 +1001,12 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 			ad->io_context = NULL;
 		}
 
-		if (ad->current_write_count != 0)
-			ad->current_write_count--;
+		if (asq->current_write_count != 0)
+			asq->current_write_count--;
 	}
 	ad->ioc_finished = 0;
 
-	ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+	asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);
 
 	/*
 	 * take it off the sort and fifo list, add to dispatch queue
@@ -1025,10 +1030,16 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 static int as_dispatch_request(struct request_queue *q, int force)
 {
 	struct as_data *ad = q->elevator->elevator_data;
-	const int reads = !list_empty(&ad->fifo_list[REQ_SYNC]);
-	const int writes = !list_empty(&ad->fifo_list[REQ_ASYNC]);
+	struct as_queue *asq = elv_select_sched_queue(q, force);
+	int reads, writes;
 	struct request *rq;
 
+	if (!asq)
+		return 0;
+
+	reads = !list_empty(&asq->fifo_list[REQ_SYNC]);
+	writes = !list_empty(&asq->fifo_list[REQ_ASYNC]);
+
 	if (unlikely(force)) {
 		/*
 		 * Forced dispatch, accounting is useless.  Reset
@@ -1043,25 +1054,25 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		ad->changed_batch = 0;
 		ad->new_batch = 0;
 
-		while (ad->next_rq[REQ_SYNC]) {
-			as_move_to_dispatch(ad, ad->next_rq[REQ_SYNC]);
+		while (asq->next_rq[REQ_SYNC]) {
+			as_move_to_dispatch(ad, asq->next_rq[REQ_SYNC]);
 			dispatched++;
 		}
-		ad->last_check_fifo[REQ_SYNC] = jiffies;
+		asq->last_check_fifo[REQ_SYNC] = jiffies;
 
-		while (ad->next_rq[REQ_ASYNC]) {
-			as_move_to_dispatch(ad, ad->next_rq[REQ_ASYNC]);
+		while (asq->next_rq[REQ_ASYNC]) {
+			as_move_to_dispatch(ad, asq->next_rq[REQ_ASYNC]);
 			dispatched++;
 		}
-		ad->last_check_fifo[REQ_ASYNC] = jiffies;
+		asq->last_check_fifo[REQ_ASYNC] = jiffies;
 
 		return dispatched;
 	}
 
 	/* Signal that the write batch was uncontended, so we can't time it */
 	if (ad->batch_data_dir == REQ_ASYNC && !reads) {
-		if (ad->current_write_count == 0 || !writes)
-			ad->write_batch_idled = 1;
+		if (asq->current_write_count == 0 || !writes)
+			asq->write_batch_idled = 1;
 	}
 
 	if (!(reads || writes)
@@ -1070,14 +1081,14 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		|| ad->changed_batch)
 		return 0;
 
-	if (!(reads && writes && as_batch_expired(ad))) {
+	if (!(reads && writes && as_batch_expired(ad, asq))) {
 		/*
 		 * batch is still running or no reads or no writes
 		 */
-		rq = ad->next_rq[ad->batch_data_dir];
+		rq = asq->next_rq[ad->batch_data_dir];
 
 		if (ad->batch_data_dir == REQ_SYNC && ad->antic_expire) {
-			if (as_fifo_expired(ad, REQ_SYNC))
+			if (as_fifo_expired(ad, asq, REQ_SYNC))
 				goto fifo_expired;
 
 			if (as_can_anticipate(ad, rq)) {
@@ -1101,7 +1112,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 	 */
 
 	if (reads) {
-		BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[REQ_SYNC]));
+		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[REQ_SYNC]));
 
 		if (writes && ad->batch_data_dir == REQ_SYNC)
 			/*
@@ -1114,8 +1125,8 @@ static int as_dispatch_request(struct request_queue *q, int force)
 			ad->changed_batch = 1;
 		}
 		ad->batch_data_dir = REQ_SYNC;
-		rq = rq_entry_fifo(ad->fifo_list[REQ_SYNC].next);
-		ad->last_check_fifo[ad->batch_data_dir] = jiffies;
+		rq = rq_entry_fifo(asq->fifo_list[REQ_SYNC].next);
+		asq->last_check_fifo[ad->batch_data_dir] = jiffies;
 		goto dispatch_request;
 	}
 
@@ -1125,7 +1136,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 
 	if (writes) {
 dispatch_writes:
-		BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[REQ_ASYNC]));
+		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[REQ_ASYNC]));
 
 		if (ad->batch_data_dir == REQ_SYNC) {
 			ad->changed_batch = 1;
@@ -1138,10 +1149,10 @@ dispatch_writes:
 			ad->new_batch = 0;
 		}
 		ad->batch_data_dir = REQ_ASYNC;
-		ad->current_write_count = ad->write_batch_count;
-		ad->write_batch_idled = 0;
-		rq = rq_entry_fifo(ad->fifo_list[REQ_ASYNC].next);
-		ad->last_check_fifo[REQ_ASYNC] = jiffies;
+		asq->current_write_count = asq->write_batch_count;
+		asq->write_batch_idled = 0;
+		rq = rq_entry_fifo(asq->fifo_list[REQ_ASYNC].next);
+		asq->last_check_fifo[REQ_ASYNC] = jiffies;
 		goto dispatch_request;
 	}
 
@@ -1153,9 +1164,9 @@ dispatch_request:
 	 * If a request has expired, service it.
 	 */
 
-	if (as_fifo_expired(ad, ad->batch_data_dir)) {
+	if (as_fifo_expired(ad, asq, ad->batch_data_dir)) {
 fifo_expired:
-		rq = rq_entry_fifo(ad->fifo_list[ad->batch_data_dir].next);
+		rq = rq_entry_fifo(asq->fifo_list[ad->batch_data_dir].next);
 	}
 
 	if (ad->changed_batch) {
@@ -1188,6 +1199,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 {
 	struct as_data *ad = q->elevator->elevator_data;
 	int data_dir;
+	struct as_queue *asq = elv_get_sched_queue(q, rq);
 
 	RQ_SET_STATE(rq, AS_RQ_NEW);
 
@@ -1206,7 +1218,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 	 * set expire time and add to fifo list
 	 */
 	rq_set_fifo_time(rq, jiffies + ad->fifo_expire[data_dir]);
-	list_add_tail(&rq->queuelist, &ad->fifo_list[data_dir]);
+	list_add_tail(&rq->queuelist, &asq->fifo_list[data_dir]);
 
 	as_update_rq(ad, rq); /* keep state machine up to date */
 	RQ_SET_STATE(rq, AS_RQ_QUEUED);
@@ -1228,31 +1240,20 @@ static void as_deactivate_request(struct request_queue *q, struct request *rq)
 		atomic_inc(&RQ_IOC(rq)->aic->nr_dispatched);
 }
 
-/*
- * as_queue_empty tells us if there are requests left in the device. It may
- * not be the case that a driver can get the next request even if the queue
- * is not empty - it is used in the block layer to check for plugging and
- * merging opportunities
- */
-static int as_queue_empty(struct request_queue *q)
-{
-	struct as_data *ad = q->elevator->elevator_data;
-
-	return list_empty(&ad->fifo_list[REQ_ASYNC])
-		&& list_empty(&ad->fifo_list[REQ_SYNC]);
-}
-
 static int
 as_merge(struct request_queue *q, struct request **req, struct bio *bio)
 {
-	struct as_data *ad = q->elevator->elevator_data;
 	sector_t rb_key = bio->bi_sector + bio_sectors(bio);
 	struct request *__rq;
+	struct as_queue *asq = elv_get_sched_queue_current(q);
+
+	if (!asq)
+		return ELEVATOR_NO_MERGE;
 
 	/*
 	 * check for front merge
 	 */
-	__rq = elv_rb_find(&ad->sort_list[bio_data_dir(bio)], rb_key);
+	__rq = elv_rb_find(&asq->sort_list[bio_data_dir(bio)], rb_key);
 	if (__rq && elv_rq_merge_ok(__rq, bio)) {
 		*req = __rq;
 		return ELEVATOR_FRONT_MERGE;
@@ -1339,6 +1340,41 @@ static int as_may_queue(struct request_queue *q, int rw)
 	return ret;
 }
 
+/* Called with queue lock held */
+static void *as_alloc_as_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
+{
+	struct as_queue *asq;
+	struct as_data *ad = eq->elevator_data;
+
+	asq = kmalloc_node(sizeof(*asq), gfp_mask | __GFP_ZERO, q->node);
+	if (asq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&asq->fifo_list[REQ_SYNC]);
+	INIT_LIST_HEAD(&asq->fifo_list[REQ_ASYNC]);
+	asq->sort_list[REQ_SYNC] = RB_ROOT;
+	asq->sort_list[REQ_ASYNC] = RB_ROOT;
+	if (ad)
+		asq->write_batch_count = ad->batch_expire[REQ_ASYNC] / 10;
+	else
+		asq->write_batch_count = default_write_batch_expire / 10;
+
+	if (asq->write_batch_count < 2)
+		asq->write_batch_count = 2;
+out:
+	return asq;
+}
+
+static void as_free_as_queue(struct elevator_queue *e, void *sched_queue)
+{
+	struct as_queue *asq = sched_queue;
+
+	BUG_ON(!list_empty(&asq->fifo_list[REQ_SYNC]));
+	BUG_ON(!list_empty(&asq->fifo_list[REQ_ASYNC]));
+	kfree(asq);
+}
+
 static void as_exit_queue(struct elevator_queue *e)
 {
 	struct as_data *ad = e->elevator_data;
@@ -1346,9 +1382,6 @@ static void as_exit_queue(struct elevator_queue *e)
 	del_timer_sync(&ad->antic_timer);
 	cancel_work_sync(&ad->antic_work);
 
-	BUG_ON(!list_empty(&ad->fifo_list[REQ_SYNC]));
-	BUG_ON(!list_empty(&ad->fifo_list[REQ_ASYNC]));
-
 	put_io_context(ad->io_context);
 	kfree(ad);
 }
@@ -1372,10 +1405,6 @@ static void *as_init_queue(struct request_queue *q)
 	init_timer(&ad->antic_timer);
 	INIT_WORK(&ad->antic_work, as_work_handler);
 
-	INIT_LIST_HEAD(&ad->fifo_list[REQ_SYNC]);
-	INIT_LIST_HEAD(&ad->fifo_list[REQ_ASYNC]);
-	ad->sort_list[REQ_SYNC] = RB_ROOT;
-	ad->sort_list[REQ_ASYNC] = RB_ROOT;
 	ad->fifo_expire[REQ_SYNC] = default_read_expire;
 	ad->fifo_expire[REQ_ASYNC] = default_write_expire;
 	ad->antic_expire = default_antic_expire;
@@ -1383,9 +1412,6 @@ static void *as_init_queue(struct request_queue *q)
 	ad->batch_expire[REQ_ASYNC] = default_write_batch_expire;
 
 	ad->current_batch_expires = jiffies + ad->batch_expire[REQ_SYNC];
-	ad->write_batch_count = ad->batch_expire[REQ_ASYNC] / 10;
-	if (ad->write_batch_count < 2)
-		ad->write_batch_count = 2;
 
 	return ad;
 }
@@ -1482,7 +1508,6 @@ static struct elevator_type iosched_as = {
 		.elevator_add_req_fn =		as_add_request,
 		.elevator_activate_req_fn =	as_activate_request,
 		.elevator_deactivate_req_fn = 	as_deactivate_request,
-		.elevator_queue_empty_fn =	as_queue_empty,
 		.elevator_completed_req_fn =	as_completed_request,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
@@ -1490,6 +1515,8 @@ static struct elevator_type iosched_as = {
 		.elevator_init_fn =		as_init_queue,
 		.elevator_exit_fn =		as_exit_queue,
 		.trim =				as_trim,
+		.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
+		.elevator_free_sched_queue_fn = as_free_as_queue,
 	},
 
 	.elevator_attrs = as_attrs,
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index c4d991d..5e65041 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -23,25 +23,23 @@ static const int writes_starved = 2;    /* max times reads can starve a write */
 static const int fifo_batch = 16;       /* # of sequential requests treated as one
 				     by the above parameters. For throughput. */
 
-struct deadline_data {
-	/*
-	 * run time data
-	 */
-
+struct deadline_queue {
 	/*
 	 * requests (deadline_rq s) are present on both sort_list and fifo_list
 	 */
-	struct rb_root sort_list[2];	
+	struct rb_root sort_list[2];
 	struct list_head fifo_list[2];
-
 	/*
 	 * next in sort order. read, write or both are NULL
 	 */
 	struct request *next_rq[2];
 	unsigned int batching;		/* number of sequential requests made */
-	sector_t last_sector;		/* head position */
 	unsigned int starved;		/* times reads have starved writes */
+};
 
+struct deadline_data {
+	struct request_queue *q;
+	sector_t last_sector;		/* head position */
 	/*
 	 * settings that change how the i/o scheduler behaves
 	 */
@@ -56,7 +54,9 @@ static void deadline_move_request(struct deadline_data *, struct request *);
 static inline struct rb_root *
 deadline_rb_root(struct deadline_data *dd, struct request *rq)
 {
-	return &dd->sort_list[rq_data_dir(rq)];
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
+
+	return &dq->sort_list[rq_data_dir(rq)];
 }
 
 /*
@@ -87,9 +87,10 @@ static inline void
 deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
 {
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
 
-	if (dd->next_rq[data_dir] == rq)
-		dd->next_rq[data_dir] = deadline_latter_request(rq);
+	if (dq->next_rq[data_dir] == rq)
+		dq->next_rq[data_dir] = deadline_latter_request(rq);
 
 	elv_rb_del(deadline_rb_root(dd, rq), rq);
 }
@@ -102,6 +103,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
 {
 	struct deadline_data *dd = q->elevator->elevator_data;
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(q, rq);
 
 	deadline_add_rq_rb(dd, rq);
 
@@ -109,7 +111,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
 	 * set expire time and add to fifo list
 	 */
 	rq_set_fifo_time(rq, jiffies + dd->fifo_expire[data_dir]);
-	list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
+	list_add_tail(&rq->queuelist, &dq->fifo_list[data_dir]);
 }
 
 /*
@@ -129,6 +131,11 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	struct deadline_data *dd = q->elevator->elevator_data;
 	struct request *__rq;
 	int ret;
+	struct deadline_queue *dq;
+
+	dq = elv_get_sched_queue_current(q);
+	if (!dq)
+		return ELEVATOR_NO_MERGE;
 
 	/*
 	 * check for front merge
@@ -136,7 +143,7 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	if (dd->front_merges) {
 		sector_t sector = bio->bi_sector + bio_sectors(bio);
 
-		__rq = elv_rb_find(&dd->sort_list[bio_data_dir(bio)], sector);
+		__rq = elv_rb_find(&dq->sort_list[bio_data_dir(bio)], sector);
 		if (__rq) {
 			BUG_ON(sector != __rq->sector);
 
@@ -207,10 +214,11 @@ static void
 deadline_move_request(struct deadline_data *dd, struct request *rq)
 {
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
 
-	dd->next_rq[READ] = NULL;
-	dd->next_rq[WRITE] = NULL;
-	dd->next_rq[data_dir] = deadline_latter_request(rq);
+	dq->next_rq[READ] = NULL;
+	dq->next_rq[WRITE] = NULL;
+	dq->next_rq[data_dir] = deadline_latter_request(rq);
 
 	dd->last_sector = rq_end_sector(rq);
 
@@ -225,9 +233,9 @@ deadline_move_request(struct deadline_data *dd, struct request *rq)
  * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
  * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
  */
-static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
+static inline int deadline_check_fifo(struct deadline_queue *dq, int ddir)
 {
-	struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
+	struct request *rq = rq_entry_fifo(dq->fifo_list[ddir].next);
 
 	/*
 	 * rq is expired!
@@ -245,20 +253,26 @@ static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
 static int deadline_dispatch_requests(struct request_queue *q, int force)
 {
 	struct deadline_data *dd = q->elevator->elevator_data;
-	const int reads = !list_empty(&dd->fifo_list[READ]);
-	const int writes = !list_empty(&dd->fifo_list[WRITE]);
+	struct deadline_queue *dq = elv_select_sched_queue(q, force);
+	int reads, writes;
 	struct request *rq;
 	int data_dir;
 
+	if (!dq)
+		return 0;
+
+	reads = !list_empty(&dq->fifo_list[READ]);
+	writes = !list_empty(&dq->fifo_list[WRITE]);
+
 	/*
 	 * batches are currently reads XOR writes
 	 */
-	if (dd->next_rq[WRITE])
-		rq = dd->next_rq[WRITE];
+	if (dq->next_rq[WRITE])
+		rq = dq->next_rq[WRITE];
 	else
-		rq = dd->next_rq[READ];
+		rq = dq->next_rq[READ];
 
-	if (rq && dd->batching < dd->fifo_batch)
+	if (rq && dq->batching < dd->fifo_batch)
 		/* we have a next request are still entitled to batch */
 		goto dispatch_request;
 
@@ -268,9 +282,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
 	 */
 
 	if (reads) {
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
+		BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[READ]));
 
-		if (writes && (dd->starved++ >= dd->writes_starved))
+		if (writes && (dq->starved++ >= dd->writes_starved))
 			goto dispatch_writes;
 
 		data_dir = READ;
@@ -284,9 +298,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
 
 	if (writes) {
 dispatch_writes:
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
+		BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[WRITE]));
 
-		dd->starved = 0;
+		dq->starved = 0;
 
 		data_dir = WRITE;
 
@@ -299,48 +313,62 @@ dispatch_find_request:
 	/*
 	 * we are not running a batch, find best request for selected data_dir
 	 */
-	if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
+	if (deadline_check_fifo(dq, data_dir) || !dq->next_rq[data_dir]) {
 		/*
 		 * A deadline has expired, the last request was in the other
 		 * direction, or we have run out of higher-sectored requests.
 		 * Start again from the request with the earliest expiry time.
 		 */
-		rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
+		rq = rq_entry_fifo(dq->fifo_list[data_dir].next);
 	} else {
 		/*
 		 * The last req was the same dir and we have a next request in
 		 * sort order. No expired requests so continue on from here.
 		 */
-		rq = dd->next_rq[data_dir];
+		rq = dq->next_rq[data_dir];
 	}
 
-	dd->batching = 0;
+	dq->batching = 0;
 
 dispatch_request:
 	/*
 	 * rq is the selected appropriate request.
 	 */
-	dd->batching++;
+	dq->batching++;
 	deadline_move_request(dd, rq);
 
 	return 1;
 }
 
-static int deadline_queue_empty(struct request_queue *q)
+static void *deadline_alloc_deadline_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
 {
-	struct deadline_data *dd = q->elevator->elevator_data;
+	struct deadline_queue *dq;
 
-	return list_empty(&dd->fifo_list[WRITE])
-		&& list_empty(&dd->fifo_list[READ]);
+	dq = kmalloc_node(sizeof(*dq), gfp_mask | __GFP_ZERO, q->node);
+	if (dq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&dq->fifo_list[READ]);
+	INIT_LIST_HEAD(&dq->fifo_list[WRITE]);
+	dq->sort_list[READ] = RB_ROOT;
+	dq->sort_list[WRITE] = RB_ROOT;
+out:
+	return dq;
+}
+
+static void deadline_free_deadline_queue(struct elevator_queue *e,
+						void *sched_queue)
+{
+	struct deadline_queue *dq = sched_queue;
+
+	kfree(dq);
 }
 
 static void deadline_exit_queue(struct elevator_queue *e)
 {
 	struct deadline_data *dd = e->elevator_data;
 
-	BUG_ON(!list_empty(&dd->fifo_list[READ]));
-	BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
-
 	kfree(dd);
 }
 
@@ -355,10 +383,7 @@ static void *deadline_init_queue(struct request_queue *q)
 	if (!dd)
 		return NULL;
 
-	INIT_LIST_HEAD(&dd->fifo_list[READ]);
-	INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
-	dd->sort_list[READ] = RB_ROOT;
-	dd->sort_list[WRITE] = RB_ROOT;
+	dd->q = q;
 	dd->fifo_expire[READ] = read_expire;
 	dd->fifo_expire[WRITE] = write_expire;
 	dd->writes_starved = writes_starved;
@@ -445,13 +470,13 @@ static struct elevator_type iosched_deadline = {
 		.elevator_merge_req_fn =	deadline_merged_requests,
 		.elevator_dispatch_fn =		deadline_dispatch_requests,
 		.elevator_add_req_fn =		deadline_add_request,
-		.elevator_queue_empty_fn =	deadline_queue_empty,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
 		.elevator_init_fn =		deadline_init_queue,
 		.elevator_exit_fn =		deadline_exit_queue,
+		.elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
+		.elevator_free_sched_queue_fn = deadline_free_deadline_queue,
 	},
-
 	.elevator_attrs = deadline_attrs,
 	.elevator_name = "deadline",
 	.elevator_owner = THIS_MODULE,
diff --git a/block/elevator.c b/block/elevator.c
index 27889bc..5df13c4 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -176,17 +176,54 @@ static struct elevator_type *elevator_get(const char *name)
 	return e;
 }
 
-static void *elevator_init_queue(struct request_queue *q,
-				 struct elevator_queue *eq)
+static void *elevator_init_data(struct request_queue *q,
+					struct elevator_queue *eq)
 {
-	return eq->ops->elevator_init_fn(q);
+	void *data = NULL;
+
+	if (eq->ops->elevator_init_fn) {
+		data = eq->ops->elevator_init_fn(q);
+		if (data)
+			return data;
+		else
+			return ERR_PTR(-ENOMEM);
+	}
+
+	/* IO scheduler does not instanciate data (noop), it is not an error */
+	return NULL;
+}
+
+static void elevator_free_sched_queue(struct elevator_queue *eq,
+						void *sched_queue)
+{
+	/* Not all io schedulers (cfq) strore sched_queue */
+	if (!sched_queue)
+		return;
+	eq->ops->elevator_free_sched_queue_fn(eq, sched_queue);
+}
+
+static void *elevator_alloc_sched_queue(struct request_queue *q,
+					struct elevator_queue *eq)
+{
+	void *sched_queue = NULL;
+
+	if (eq->ops->elevator_alloc_sched_queue_fn) {
+		sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
+								GFP_KERNEL);
+		if (!sched_queue)
+			return ERR_PTR(-ENOMEM);
+
+	}
+
+	return sched_queue;
 }
 
 static void elevator_attach(struct request_queue *q, struct elevator_queue *eq,
-			   void *data)
+			   void *data, void *sched_queue)
 {
 	q->elevator = eq;
 	eq->elevator_data = data;
+	eq->sched_queue = sched_queue;
 }
 
 static char chosen_elevator[16];
@@ -256,7 +293,7 @@ int elevator_init(struct request_queue *q, char *name)
 	struct elevator_type *e = NULL;
 	struct elevator_queue *eq;
 	int ret = 0;
-	void *data;
+	void *data = NULL, *sched_queue = NULL;
 
 	INIT_LIST_HEAD(&q->queue_head);
 	q->last_merge = NULL;
@@ -290,13 +327,21 @@ int elevator_init(struct request_queue *q, char *name)
 	if (!eq)
 		return -ENOMEM;
 
-	data = elevator_init_queue(q, eq);
-	if (!data) {
+	data = elevator_init_data(q, eq);
+
+	if (IS_ERR(data)) {
+		kobject_put(&eq->kobj);
+		return -ENOMEM;
+	}
+
+	sched_queue = elevator_alloc_sched_queue(q, eq);
+
+	if (IS_ERR(sched_queue)) {
 		kobject_put(&eq->kobj);
 		return -ENOMEM;
 	}
 
-	elevator_attach(q, eq, data);
+	elevator_attach(q, eq, data, sched_queue);
 	return ret;
 }
 EXPORT_SYMBOL(elevator_init);
@@ -304,6 +349,7 @@ EXPORT_SYMBOL(elevator_init);
 void elevator_exit(struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
+	elevator_free_sched_queue(e, e->sched_queue);
 	elv_exit_fq_data(e);
 	if (e->ops->elevator_exit_fn)
 		e->ops->elevator_exit_fn(e);
@@ -1094,7 +1140,7 @@ EXPORT_SYMBOL_GPL(elv_unregister);
 static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 {
 	struct elevator_queue *old_elevator, *e;
-	void *data;
+	void *data = NULL, *sched_queue = NULL;
 
 	/*
 	 * Allocate new elevator
@@ -1103,10 +1149,18 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	if (!e)
 		return 0;
 
-	data = elevator_init_queue(q, e);
-	if (!data) {
+	data = elevator_init_data(q, e);
+
+	if (IS_ERR(data)) {
 		kobject_put(&e->kobj);
-		return 0;
+		return -ENOMEM;
+	}
+
+	sched_queue = elevator_alloc_sched_queue(q, e);
+
+	if (IS_ERR(sched_queue)) {
+		kobject_put(&e->kobj);
+		return -ENOMEM;
 	}
 
 	/*
@@ -1134,7 +1188,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	/*
 	 * attach and start new elevator
 	 */
-	elevator_attach(q, e, data);
+	elevator_attach(q, e, data, sched_queue);
 
 	spin_unlock_irq(q->queue_lock);
 
@@ -1241,16 +1295,43 @@ struct request *elv_rb_latter_request(struct request_queue *q,
 }
 EXPORT_SYMBOL(elv_rb_latter_request);
 
-/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+/* Get the io scheduler queue pointer. */
 void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
 {
-	return ioq_sched_queue(rq_ioq(rq));
+	/*
+	 * io scheduler is not using fair queuing. Return sched_queue
+	 * pointer stored in elevator_queue. It will be null if io
+	 * scheduler never stored anything there to begin with (cfq)
+	 */
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
+	/*
+	 * IO schedueler is using fair queuing infrasture. If io scheduler
+	 * has passed a non null rq, retrieve sched_queue pointer from
+	 * there. */
+	if (rq)
+		return ioq_sched_queue(rq_ioq(rq));
+
+	return NULL;
 }
 EXPORT_SYMBOL(elv_get_sched_queue);
 
 /* Select an ioscheduler queue to dispatch request from. */
 void *elv_select_sched_queue(struct request_queue *q, int force)
 {
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
 	return ioq_sched_queue(elv_fq_select_ioq(q, force));
 }
 EXPORT_SYMBOL(elv_select_sched_queue);
+
+/*
+ * Get the io scheduler queue pointer for current task.
+ */
+void *elv_get_sched_queue_current(struct request_queue *q)
+{
+	return q->elevator->sched_queue;
+}
+EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index 3a0d369..d587832 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -7,7 +7,7 @@
 #include <linux/module.h>
 #include <linux/init.h>
 
-struct noop_data {
+struct noop_queue {
 	struct list_head queue;
 };
 
@@ -19,11 +19,14 @@ static void noop_merged_requests(struct request_queue *q, struct request *rq,
 
 static int noop_dispatch(struct request_queue *q, int force)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_select_sched_queue(q, force);
 
-	if (!list_empty(&nd->queue)) {
+	if (!nq)
+		return 0;
+
+	if (!list_empty(&nq->queue)) {
 		struct request *rq;
-		rq = list_entry(nd->queue.next, struct request, queuelist);
+		rq = list_entry(nq->queue.next, struct request, queuelist);
 		list_del_init(&rq->queuelist);
 		elv_dispatch_sort(q, rq);
 		return 1;
@@ -33,24 +36,17 @@ static int noop_dispatch(struct request_queue *q, int force)
 
 static void noop_add_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	list_add_tail(&rq->queuelist, &nd->queue);
-}
-
-static int noop_queue_empty(struct request_queue *q)
-{
-	struct noop_data *nd = q->elevator->elevator_data;
-
-	return list_empty(&nd->queue);
+	list_add_tail(&rq->queuelist, &nq->queue);
 }
 
 static struct request *
 noop_former_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	if (rq->queuelist.prev == &nd->queue)
+	if (rq->queuelist.prev == &nq->queue)
 		return NULL;
 	return list_entry(rq->queuelist.prev, struct request, queuelist);
 }
@@ -58,30 +54,32 @@ noop_former_request(struct request_queue *q, struct request *rq)
 static struct request *
 noop_latter_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	if (rq->queuelist.next == &nd->queue)
+	if (rq->queuelist.next == &nq->queue)
 		return NULL;
 	return list_entry(rq->queuelist.next, struct request, queuelist);
 }
 
-static void *noop_init_queue(struct request_queue *q)
+static void *noop_alloc_noop_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
 {
-	struct noop_data *nd;
+	struct noop_queue *nq;
 
-	nd = kmalloc_node(sizeof(*nd), GFP_KERNEL, q->node);
-	if (!nd)
-		return NULL;
-	INIT_LIST_HEAD(&nd->queue);
-	return nd;
+	nq = kmalloc_node(sizeof(*nq), gfp_mask | __GFP_ZERO, q->node);
+	if (nq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&nq->queue);
+out:
+	return nq;
 }
 
-static void noop_exit_queue(struct elevator_queue *e)
+static void noop_free_noop_queue(struct elevator_queue *e, void *sched_queue)
 {
-	struct noop_data *nd = e->elevator_data;
+	struct noop_queue *nq = sched_queue;
 
-	BUG_ON(!list_empty(&nd->queue));
-	kfree(nd);
+	kfree(nq);
 }
 
 static struct elevator_type elevator_noop = {
@@ -89,11 +87,10 @@ static struct elevator_type elevator_noop = {
 		.elevator_merge_req_fn		= noop_merged_requests,
 		.elevator_dispatch_fn		= noop_dispatch,
 		.elevator_add_req_fn		= noop_add_request,
-		.elevator_queue_empty_fn	= noop_queue_empty,
 		.elevator_former_req_fn		= noop_former_request,
 		.elevator_latter_req_fn		= noop_latter_request,
-		.elevator_init_fn		= noop_init_queue,
-		.elevator_exit_fn		= noop_exit_queue,
+		.elevator_alloc_sched_queue_fn	= noop_alloc_noop_queue,
+		.elevator_free_sched_queue_fn	= noop_free_noop_queue,
 	},
 	.elevator_name = "noop",
 	.elevator_owner = THIS_MODULE,
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 6f2dea5..bb5ae3a 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -30,8 +30,9 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
 
 typedef void *(elevator_init_fn) (struct request_queue *);
 typedef void (elevator_exit_fn) (struct elevator_queue *);
-#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void* (elevator_alloc_sched_queue_fn) (struct request_queue *q, struct elevator_queue *eq, gfp_t);
 typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
 typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *);
 typedef void (elevator_active_ioq_reset_fn) (struct request_queue *);
 typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
@@ -68,8 +69,9 @@ struct elevator_ops
 	elevator_exit_fn *elevator_exit_fn;
 	void (*trim)(struct io_context *);
 
-#ifdef CONFIG_ELV_FAIR_QUEUING
+	elevator_alloc_sched_queue_fn *elevator_alloc_sched_queue_fn;
 	elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+#ifdef CONFIG_ELV_FAIR_QUEUING
 	elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
 	elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;
 
@@ -109,6 +111,7 @@ struct elevator_queue
 {
 	struct elevator_ops *ops;
 	void *elevator_data;
+	void *sched_queue;
 	struct kobject kobj;
 	struct elevator_type *elevator_type;
 	struct mutex sysfs_lock;
@@ -256,5 +259,6 @@ static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
+extern void *elv_get_sched_queue_current(struct request_queue *q);
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 07/10] Prepare elevator layer for single queue schedulers
       [not found] ` <1236823015-4183-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (5 preceding siblings ...)
  2009-03-12  1:56     ` Vivek Goyal
@ 2009-03-12  1:56   ` Vivek Goyal
  2009-03-12  1:56     ` Vivek Goyal
                     ` (6 subsequent siblings)
  13 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12  1:56 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, menage-hpIqsD4AKlfQT0dZR+AlfA

Elevator layer now has support for hierarchical fair queuing. cfq has
been migrated to make use of it and now it is time to do groundwork for
noop, deadline and AS.

noop deadline and AS don't maintain separate queues for different processes.
There is only one single queue. Effectively one can think that in hierarchical
setup, there will be one queue per cgroup where requests from all the
processes in the cgroup will be queued.

Generally io scheduler takes care of creating queues. Because there is
only one queue here, we have modified common layer to take care of queue
creation and some other functionality. This special casing helps in keeping
the changes to noop, deadline and AS to the minimum.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/elevator-fq.c      |  153 ++++++++++++++++++++++++++++++++++++++++++++++
 block/elevator-fq.h      |   67 ++++++++++++++++++++
 block/elevator.c         |   35 ++++++++++-
 include/linux/elevator.h |   14 ++++
 4 files changed, 268 insertions(+), 1 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 389f68e..172f9e3 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -857,6 +857,12 @@ void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
 
 	/* Free up async idle queue */
 	elv_release_ioq(e, &iog->async_idle_queue);
+
+#ifdef CONFIG_GROUP_IOSCHED
+	/* Optimization for io schedulers having single ioq */
+	if (elv_iosched_single_ioq(e))
+		elv_release_ioq(e, &iog->ioq);
+#endif
 }
 
 
@@ -1538,6 +1544,153 @@ void elv_fq_set_request_io_group(struct request_queue *q,
 	rq->iog = iog;
 }
 
+/*
+ * Find/Create the io queue the rq should go in. This is an optimization
+ * for the io schedulers (noop, deadline and AS) which maintain only single
+ * io queue per cgroup. In this case common layer can just maintain a
+ * pointer in group data structure and keeps track of it.
+ *
+ * For the io schdulers like cfq, which maintain multiple io queues per
+ * cgroup, and decide the io queue  of request based on process, this
+ * function is not invoked.
+ */
+int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
+					gfp_t gfp_mask)
+{
+	struct elevator_queue *e = q->elevator;
+	unsigned long flags;
+	struct io_queue *ioq = NULL, *new_ioq = NULL;
+	struct io_group *iog;
+	void *sched_q = NULL, *new_sched_q = NULL;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return 0;
+
+	might_sleep_if(gfp_mask & __GFP_WAIT);
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	/* Determine the io group request belongs to */
+	iog = rq->iog;
+	BUG_ON(!iog);
+
+retry:
+	/* Get the iosched queue */
+	ioq = io_group_ioq(iog);
+	if (!ioq) {
+		/* io queue and sched_queue needs to be allocated */
+		BUG_ON(!e->ops->elevator_alloc_sched_queue_fn);
+
+		if (new_sched_q) {
+			goto alloc_ioq;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			/* Call io scheduer to create scheduler queue */
+			new_sched_q = e->ops->elevator_alloc_sched_queue_fn(q,
+					e, gfp_mask | __GFP_NOFAIL
+					| __GFP_ZERO);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			sched_q = e->ops->elevator_alloc_sched_queue_fn(q, e,
+						gfp_mask | __GFP_ZERO);
+			if (!sched_q)
+				goto queue_fail;
+		}
+
+alloc_ioq:
+		if (new_ioq) {
+			ioq = new_ioq;
+			new_ioq = NULL;
+			sched_q = new_sched_q;
+			new_sched_q = NULL;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			new_ioq = elv_alloc_ioq(q, gfp_mask | __GFP_NOFAIL
+							| __GFP_ZERO);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+			if (!ioq) {
+				e->ops->elevator_free_sched_queue_fn(e,
+							sched_q);
+				sched_q = NULL;
+				goto queue_fail;
+			}
+		}
+
+		elv_init_ioq(e, ioq, sched_q, IOPRIO_CLASS_BE, 4, 1);
+		io_group_set_ioq(iog, ioq);
+		elv_mark_ioq_sync(ioq);
+	}
+
+	if (new_sched_q)
+		e->ops->elevator_free_sched_queue_fn(q->elevator, sched_q);
+
+	if (new_ioq)
+		elv_free_ioq(new_ioq);
+
+	/* Request reference */
+	elv_get_ioq(ioq);
+	rq->ioq = ioq;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return 0;
+
+queue_fail:
+	WARN_ON((gfp_mask & __GFP_WAIT) && !ioq);
+	elv_schedule_dispatch(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return 1;
+}
+
+/*
+ * Find out the io queue of current task. Optimization for single ioq
+ * per io group io schedulers.
+ */
+struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	struct io_group *iog;
+
+	/* Determine the io group and io queue of the bio submitting task */
+	iog = io_lookup_io_group_current(q);
+	if (!iog) {
+		/* May be task belongs to a cgroup for which io group has
+		 * not been setup yet. */
+		return NULL;
+	}
+	return io_group_ioq(iog);
+}
+
+/*
+ * This request has been serviced. Clean up ioq info and drop the reference.
+ * Again this is called only for single queue per cgroup schedulers (noop,
+ * deadline, AS).
+ */
+void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
+{
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	if (ioq) {
+		rq->ioq = NULL;
+		elv_put_ioq(ioq);
+	}
+}
+
 #else /* GROUP_IOSCHED */
 void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
 {
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 3fab8f8..fc4110d 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -232,6 +232,9 @@ struct io_group {
 	/* async_queue and idle_queue are used only for cfq */
 	struct io_queue *async_queue[2][IOPRIO_BE_NR];
 	struct io_queue *async_idle_queue;
+
+	/* Single ioq per group, used for noop, deadline, anticipatory */
+	struct io_queue *ioq;
 };
 
 /**
@@ -461,6 +464,28 @@ extern void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
 					struct io_group *iog);
 extern void elv_fq_set_request_io_group(struct request_queue *q,
 						struct request *rq);
+extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
+					gfp_t gfp_mask);
+extern void elv_fq_unset_request_ioq(struct request_queue *q,
+					struct request *rq);
+extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+
+/* Returns single ioq associated with the io group. */
+static inline struct io_queue *io_group_ioq(struct io_group *iog)
+{
+	BUG_ON(!iog);
+	return iog->ioq;
+}
+
+/* Sets the single ioq associated with the io group. (noop, deadline, AS) */
+static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
+{
+	BUG_ON(!iog);
+	/* io group reference. Will be dropped when group is destroyed. */
+	elv_get_ioq(ioq);
+	iog->ioq = ioq;
+}
+
 #else /* !GROUP_IOSCHED */
 /*
  * No ioq movement is needed in case of flat setup. root io group gets cleaned
@@ -486,6 +511,32 @@ static inline void elv_fq_set_request_io_group(struct request_queue *q,
 {
 }
 
+/* Returns single ioq associated with the io group. */
+static inline struct io_queue *io_group_ioq(struct io_group *iog)
+{
+	return NULL;
+}
+
+static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
+{
+}
+
+static inline int elv_fq_set_request_ioq(struct request_queue *q,
+					struct request *rq, gfp_t gfp_mask)
+{
+	return 0;
+}
+
+static inline void elv_fq_unset_request_ioq(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	return NULL;
+}
+
 #endif /* GROUP_IOSCHED */
 
 /* Functions used by blksysfs.c */
@@ -588,5 +639,21 @@ static inline void elv_fq_set_request_io_group(struct request_queue *q,
 						struct request *rq)
 {
 }
+static inline int elv_fq_set_request_ioq(struct request_queue *q,
+					struct request *rq, gfp_t gfp_mask)
+{
+	return 0;
+}
+
+static inline void elv_fq_unset_request_ioq(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	return NULL;
+}
+
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index 5df13c4..bce4421 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -207,6 +207,14 @@ static void *elevator_alloc_sched_queue(struct request_queue *q,
 {
 	void *sched_queue = NULL;
 
+	/*
+	 * If fair queuing is enabled, then queue allocation takes place
+	 * during set_request() functions when request actually comes
+	 * in.
+	 */
+	if (elv_iosched_fair_queuing_enabled(eq))
+		return NULL;
+
 	if (eq->ops->elevator_alloc_sched_queue_fn) {
 		sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
 								GFP_KERNEL);
@@ -936,6 +944,13 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 
 	elv_fq_set_request_io_group(q, rq);
 
+	/*
+	 * Optimization for noop, deadline and AS which maintain only single
+	 * ioq per io group
+	 */
+	if (elv_iosched_single_ioq(e))
+		return elv_fq_set_request_ioq(q, rq, gfp_mask);
+
 	if (e->ops->elevator_set_req_fn)
 		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
 
@@ -947,6 +962,15 @@ void elv_put_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	/*
+	 * Optimization for noop, deadline and AS which maintain only single
+	 * ioq per io group
+	 */
+	if (elv_iosched_single_ioq(e)) {
+		elv_fq_unset_request_ioq(q, rq);
+		return;
+	}
+
 	if (e->ops->elevator_put_req_fn)
 		e->ops->elevator_put_req_fn(rq);
 }
@@ -1329,9 +1353,18 @@ EXPORT_SYMBOL(elv_select_sched_queue);
 
 /*
  * Get the io scheduler queue pointer for current task.
+ *
+ * If fair queuing is enabled, determine the io group of task and retrieve
+ * the ioq pointer from that. This is used by only single queue ioschedulers
+ * for retrieving the queue associated with the group to decide whether the
+ * new bio can do a front merge or not.
  */
 void *elv_get_sched_queue_current(struct request_queue *q)
 {
-	return q->elevator->sched_queue;
+	/* Fair queuing is not enabled. There is only one queue. */
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
+	return ioq_sched_queue(elv_lookup_ioq_current(q));
 }
 EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index bb5ae3a..8cee877 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -245,17 +245,31 @@ enum {
 /* iosched wants to use fq logic of elevator layer */
 #define	ELV_IOSCHED_NEED_FQ	1
 
+/* iosched maintains only single ioq per group.*/
+#define ELV_IOSCHED_SINGLE_IOQ        2
+
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 {
 	return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
 }
 
+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+	return (e->elevator_type->elevator_features) & ELV_IOSCHED_SINGLE_IOQ;
+}
+
 #else /* ELV_IOSCHED_FAIR_QUEUING */
 
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 {
 	return 0;
 }
+
+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+	return 0;
+}
+
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 07/10] Prepare elevator layer for single queue schedulers
  2009-03-12  1:56 ` Vivek Goyal
                   ` (2 preceding siblings ...)
  (?)
@ 2009-03-12  1:56 ` Vivek Goyal
  -1 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12  1:56 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng,
	arozansk, jmoyer, oz-kernel, dhaval, balbir, linux-kernel,
	containers
  Cc: vgoyal, akpm, menage, peterz

Elevator layer now has support for hierarchical fair queuing. cfq has
been migrated to make use of it and now it is time to do groundwork for
noop, deadline and AS.

noop deadline and AS don't maintain separate queues for different processes.
There is only one single queue. Effectively one can think that in hierarchical
setup, there will be one queue per cgroup where requests from all the
processes in the cgroup will be queued.

Generally io scheduler takes care of creating queues. Because there is
only one queue here, we have modified common layer to take care of queue
creation and some other functionality. This special casing helps in keeping
the changes to noop, deadline and AS to the minimum.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c      |  153 ++++++++++++++++++++++++++++++++++++++++++++++
 block/elevator-fq.h      |   67 ++++++++++++++++++++
 block/elevator.c         |   35 ++++++++++-
 include/linux/elevator.h |   14 ++++
 4 files changed, 268 insertions(+), 1 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 389f68e..172f9e3 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -857,6 +857,12 @@ void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
 
 	/* Free up async idle queue */
 	elv_release_ioq(e, &iog->async_idle_queue);
+
+#ifdef CONFIG_GROUP_IOSCHED
+	/* Optimization for io schedulers having single ioq */
+	if (elv_iosched_single_ioq(e))
+		elv_release_ioq(e, &iog->ioq);
+#endif
 }
 
 
@@ -1538,6 +1544,153 @@ void elv_fq_set_request_io_group(struct request_queue *q,
 	rq->iog = iog;
 }
 
+/*
+ * Find/Create the io queue the rq should go in. This is an optimization
+ * for the io schedulers (noop, deadline and AS) which maintain only single
+ * io queue per cgroup. In this case common layer can just maintain a
+ * pointer in group data structure and keeps track of it.
+ *
+ * For the io schdulers like cfq, which maintain multiple io queues per
+ * cgroup, and decide the io queue  of request based on process, this
+ * function is not invoked.
+ */
+int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
+					gfp_t gfp_mask)
+{
+	struct elevator_queue *e = q->elevator;
+	unsigned long flags;
+	struct io_queue *ioq = NULL, *new_ioq = NULL;
+	struct io_group *iog;
+	void *sched_q = NULL, *new_sched_q = NULL;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return 0;
+
+	might_sleep_if(gfp_mask & __GFP_WAIT);
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	/* Determine the io group request belongs to */
+	iog = rq->iog;
+	BUG_ON(!iog);
+
+retry:
+	/* Get the iosched queue */
+	ioq = io_group_ioq(iog);
+	if (!ioq) {
+		/* io queue and sched_queue needs to be allocated */
+		BUG_ON(!e->ops->elevator_alloc_sched_queue_fn);
+
+		if (new_sched_q) {
+			goto alloc_ioq;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			/* Call io scheduer to create scheduler queue */
+			new_sched_q = e->ops->elevator_alloc_sched_queue_fn(q,
+					e, gfp_mask | __GFP_NOFAIL
+					| __GFP_ZERO);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			sched_q = e->ops->elevator_alloc_sched_queue_fn(q, e,
+						gfp_mask | __GFP_ZERO);
+			if (!sched_q)
+				goto queue_fail;
+		}
+
+alloc_ioq:
+		if (new_ioq) {
+			ioq = new_ioq;
+			new_ioq = NULL;
+			sched_q = new_sched_q;
+			new_sched_q = NULL;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			new_ioq = elv_alloc_ioq(q, gfp_mask | __GFP_NOFAIL
+							| __GFP_ZERO);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+			if (!ioq) {
+				e->ops->elevator_free_sched_queue_fn(e,
+							sched_q);
+				sched_q = NULL;
+				goto queue_fail;
+			}
+		}
+
+		elv_init_ioq(e, ioq, sched_q, IOPRIO_CLASS_BE, 4, 1);
+		io_group_set_ioq(iog, ioq);
+		elv_mark_ioq_sync(ioq);
+	}
+
+	if (new_sched_q)
+		e->ops->elevator_free_sched_queue_fn(q->elevator, sched_q);
+
+	if (new_ioq)
+		elv_free_ioq(new_ioq);
+
+	/* Request reference */
+	elv_get_ioq(ioq);
+	rq->ioq = ioq;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return 0;
+
+queue_fail:
+	WARN_ON((gfp_mask & __GFP_WAIT) && !ioq);
+	elv_schedule_dispatch(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return 1;
+}
+
+/*
+ * Find out the io queue of current task. Optimization for single ioq
+ * per io group io schedulers.
+ */
+struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	struct io_group *iog;
+
+	/* Determine the io group and io queue of the bio submitting task */
+	iog = io_lookup_io_group_current(q);
+	if (!iog) {
+		/* May be task belongs to a cgroup for which io group has
+		 * not been setup yet. */
+		return NULL;
+	}
+	return io_group_ioq(iog);
+}
+
+/*
+ * This request has been serviced. Clean up ioq info and drop the reference.
+ * Again this is called only for single queue per cgroup schedulers (noop,
+ * deadline, AS).
+ */
+void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
+{
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	if (ioq) {
+		rq->ioq = NULL;
+		elv_put_ioq(ioq);
+	}
+}
+
 #else /* GROUP_IOSCHED */
 void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
 {
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 3fab8f8..fc4110d 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -232,6 +232,9 @@ struct io_group {
 	/* async_queue and idle_queue are used only for cfq */
 	struct io_queue *async_queue[2][IOPRIO_BE_NR];
 	struct io_queue *async_idle_queue;
+
+	/* Single ioq per group, used for noop, deadline, anticipatory */
+	struct io_queue *ioq;
 };
 
 /**
@@ -461,6 +464,28 @@ extern void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
 					struct io_group *iog);
 extern void elv_fq_set_request_io_group(struct request_queue *q,
 						struct request *rq);
+extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
+					gfp_t gfp_mask);
+extern void elv_fq_unset_request_ioq(struct request_queue *q,
+					struct request *rq);
+extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+
+/* Returns single ioq associated with the io group. */
+static inline struct io_queue *io_group_ioq(struct io_group *iog)
+{
+	BUG_ON(!iog);
+	return iog->ioq;
+}
+
+/* Sets the single ioq associated with the io group. (noop, deadline, AS) */
+static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
+{
+	BUG_ON(!iog);
+	/* io group reference. Will be dropped when group is destroyed. */
+	elv_get_ioq(ioq);
+	iog->ioq = ioq;
+}
+
 #else /* !GROUP_IOSCHED */
 /*
  * No ioq movement is needed in case of flat setup. root io group gets cleaned
@@ -486,6 +511,32 @@ static inline void elv_fq_set_request_io_group(struct request_queue *q,
 {
 }
 
+/* Returns single ioq associated with the io group. */
+static inline struct io_queue *io_group_ioq(struct io_group *iog)
+{
+	return NULL;
+}
+
+static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
+{
+}
+
+static inline int elv_fq_set_request_ioq(struct request_queue *q,
+					struct request *rq, gfp_t gfp_mask)
+{
+	return 0;
+}
+
+static inline void elv_fq_unset_request_ioq(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	return NULL;
+}
+
 #endif /* GROUP_IOSCHED */
 
 /* Functions used by blksysfs.c */
@@ -588,5 +639,21 @@ static inline void elv_fq_set_request_io_group(struct request_queue *q,
 						struct request *rq)
 {
 }
+static inline int elv_fq_set_request_ioq(struct request_queue *q,
+					struct request *rq, gfp_t gfp_mask)
+{
+	return 0;
+}
+
+static inline void elv_fq_unset_request_ioq(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	return NULL;
+}
+
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index 5df13c4..bce4421 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -207,6 +207,14 @@ static void *elevator_alloc_sched_queue(struct request_queue *q,
 {
 	void *sched_queue = NULL;
 
+	/*
+	 * If fair queuing is enabled, then queue allocation takes place
+	 * during set_request() functions when request actually comes
+	 * in.
+	 */
+	if (elv_iosched_fair_queuing_enabled(eq))
+		return NULL;
+
 	if (eq->ops->elevator_alloc_sched_queue_fn) {
 		sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
 								GFP_KERNEL);
@@ -936,6 +944,13 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 
 	elv_fq_set_request_io_group(q, rq);
 
+	/*
+	 * Optimization for noop, deadline and AS which maintain only single
+	 * ioq per io group
+	 */
+	if (elv_iosched_single_ioq(e))
+		return elv_fq_set_request_ioq(q, rq, gfp_mask);
+
 	if (e->ops->elevator_set_req_fn)
 		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
 
@@ -947,6 +962,15 @@ void elv_put_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	/*
+	 * Optimization for noop, deadline and AS which maintain only single
+	 * ioq per io group
+	 */
+	if (elv_iosched_single_ioq(e)) {
+		elv_fq_unset_request_ioq(q, rq);
+		return;
+	}
+
 	if (e->ops->elevator_put_req_fn)
 		e->ops->elevator_put_req_fn(rq);
 }
@@ -1329,9 +1353,18 @@ EXPORT_SYMBOL(elv_select_sched_queue);
 
 /*
  * Get the io scheduler queue pointer for current task.
+ *
+ * If fair queuing is enabled, determine the io group of task and retrieve
+ * the ioq pointer from that. This is used by only single queue ioschedulers
+ * for retrieving the queue associated with the group to decide whether the
+ * new bio can do a front merge or not.
  */
 void *elv_get_sched_queue_current(struct request_queue *q)
 {
-	return q->elevator->sched_queue;
+	/* Fair queuing is not enabled. There is only one queue. */
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
+	return ioq_sched_queue(elv_lookup_ioq_current(q));
 }
 EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index bb5ae3a..8cee877 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -245,17 +245,31 @@ enum {
 /* iosched wants to use fq logic of elevator layer */
 #define	ELV_IOSCHED_NEED_FQ	1
 
+/* iosched maintains only single ioq per group.*/
+#define ELV_IOSCHED_SINGLE_IOQ        2
+
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 {
 	return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
 }
 
+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+	return (e->elevator_type->elevator_features) & ELV_IOSCHED_SINGLE_IOQ;
+}
+
 #else /* ELV_IOSCHED_FAIR_QUEUING */
 
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 {
 	return 0;
 }
+
+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+	return 0;
+}
+
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 08/10] noop changes for hierarchical fair queuing
  2009-03-12  1:56 ` Vivek Goyal
@ 2009-03-12  1:56     ` Vivek Goyal
  -1 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12  1:56 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, menage-hpIqsD4AKlfQT0dZR+AlfA

This patch changes noop to use queue scheduling code from elevator layer.
One can go back to old noop by deselecting CONFIG_IOSCHED_NOOP_HIER.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched |   11 +++++++++++
 block/noop-iosched.c  |    3 +++
 2 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index a91a807..9da6657 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -25,6 +25,17 @@ config IOSCHED_NOOP
 	  that do their own scheduling and require only minimal assistance from
 	  the kernel.
 
+config IOSCHED_NOOP_HIER
+	bool "Noop Hierarchical Scheduling support"
+	depends on IOSCHED_NOOP && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in noop. In this mode noop keeps
+	  one IO queue per cgroup instead of a global queue. Elevator
+	  fair queuing logic ensures fairness among various queues.
+
 config IOSCHED_AS
 	tristate "Anticipatory I/O scheduler"
 	default y
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index d587832..73e571d 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -92,6 +92,9 @@ static struct elevator_type elevator_noop = {
 		.elevator_alloc_sched_queue_fn	= noop_alloc_noop_queue,
 		.elevator_free_sched_queue_fn	= noop_free_noop_queue,
 	},
+#ifdef CONFIG_IOSCHED_NOOP_HIER
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#endif
 	.elevator_name = "noop",
 	.elevator_owner = THIS_MODULE,
 };
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 08/10] noop changes for hierarchical fair queuing
@ 2009-03-12  1:56     ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12  1:56 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng,
	arozansk, jmoyer, oz-kernel, dhaval, balbir, linux-kernel,
	containers
  Cc: vgoyal, akpm, menage, peterz

This patch changes noop to use queue scheduling code from elevator layer.
One can go back to old noop by deselecting CONFIG_IOSCHED_NOOP_HIER.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched |   11 +++++++++++
 block/noop-iosched.c  |    3 +++
 2 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index a91a807..9da6657 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -25,6 +25,17 @@ config IOSCHED_NOOP
 	  that do their own scheduling and require only minimal assistance from
 	  the kernel.
 
+config IOSCHED_NOOP_HIER
+	bool "Noop Hierarchical Scheduling support"
+	depends on IOSCHED_NOOP && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in noop. In this mode noop keeps
+	  one IO queue per cgroup instead of a global queue. Elevator
+	  fair queuing logic ensures fairness among various queues.
+
 config IOSCHED_AS
 	tristate "Anticipatory I/O scheduler"
 	default y
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index d587832..73e571d 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -92,6 +92,9 @@ static struct elevator_type elevator_noop = {
 		.elevator_alloc_sched_queue_fn	= noop_alloc_noop_queue,
 		.elevator_free_sched_queue_fn	= noop_free_noop_queue,
 	},
+#ifdef CONFIG_IOSCHED_NOOP_HIER
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#endif
 	.elevator_name = "noop",
 	.elevator_owner = THIS_MODULE,
 };
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 09/10] deadline changes for hierarchical fair queuing
  2009-03-12  1:56 ` Vivek Goyal
@ 2009-03-12  1:56     ` Vivek Goyal
  -1 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12  1:56 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, menage-hpIqsD4AKlfQT0dZR+AlfA

This patch changes deadline to use queue scheduling code from elevator layer.
One can go back to old deadline by selecting CONFIG_IOSCHED_DEADLINE_HIER.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched    |   11 +++++++++++
 block/deadline-iosched.c |    3 +++
 2 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 9da6657..3a9e7d7 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -55,6 +55,17 @@ config IOSCHED_DEADLINE
 	  a disk at any one time, its behaviour is almost identical to the
 	  anticipatory I/O scheduler and so is a good choice.
 
+config IOSCHED_DEADLINE_HIER
+	bool "Deadline Hierarchical Scheduling support"
+	depends on IOSCHED_DEADLINE && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in deadline. In this mode deadline keeps
+	  one IO queue per cgroup instead of a global queue. Elevator
+	  fair queuing logic ensures fairness among various queues.
+
 config IOSCHED_CFQ
 	tristate "CFQ I/O scheduler"
 	select ELV_FAIR_QUEUING
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 5e65041..27b77b9 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -477,6 +477,9 @@ static struct elevator_type iosched_deadline = {
 		.elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
 		.elevator_free_sched_queue_fn = deadline_free_deadline_queue,
 	},
+#ifdef CONFIG_IOSCHED_DEADLINE_HIER
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#endif
 	.elevator_attrs = deadline_attrs,
 	.elevator_name = "deadline",
 	.elevator_owner = THIS_MODULE,
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 09/10] deadline changes for hierarchical fair queuing
@ 2009-03-12  1:56     ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12  1:56 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng,
	arozansk, jmoyer, oz-kernel, dhaval, balbir, linux-kernel,
	containers
  Cc: vgoyal, akpm, menage, peterz

This patch changes deadline to use queue scheduling code from elevator layer.
One can go back to old deadline by selecting CONFIG_IOSCHED_DEADLINE_HIER.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched    |   11 +++++++++++
 block/deadline-iosched.c |    3 +++
 2 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 9da6657..3a9e7d7 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -55,6 +55,17 @@ config IOSCHED_DEADLINE
 	  a disk at any one time, its behaviour is almost identical to the
 	  anticipatory I/O scheduler and so is a good choice.
 
+config IOSCHED_DEADLINE_HIER
+	bool "Deadline Hierarchical Scheduling support"
+	depends on IOSCHED_DEADLINE && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in deadline. In this mode deadline keeps
+	  one IO queue per cgroup instead of a global queue. Elevator
+	  fair queuing logic ensures fairness among various queues.
+
 config IOSCHED_CFQ
 	tristate "CFQ I/O scheduler"
 	select ELV_FAIR_QUEUING
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 5e65041..27b77b9 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -477,6 +477,9 @@ static struct elevator_type iosched_deadline = {
 		.elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
 		.elevator_free_sched_queue_fn = deadline_free_deadline_queue,
 	},
+#ifdef CONFIG_IOSCHED_DEADLINE_HIER
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#endif
 	.elevator_attrs = deadline_attrs,
 	.elevator_name = "deadline",
 	.elevator_owner = THIS_MODULE,
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 10/10] anticipatory changes for hierarchical fair queuing
  2009-03-12  1:56 ` Vivek Goyal
@ 2009-03-12  1:56     ` Vivek Goyal
  -1 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12  1:56 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, menage-hpIqsD4AKlfQT0dZR+AlfA

This patch changes anticipatory scheduler to use queue scheduling code from
elevator layer.  One can go back to old as by deselecting
CONFIG_IOSCHED_AS_HIER.

TODO/Issues
===========
- AS anticipation logic does not seem to be sufficient to provide BW difference
  if two "dd" are going in two different cgroups. Needs to be looked into.

- AS write batch number of request adjustment happens upon every W->R batch
  direction switch. This automatic adjustment depends on how much time a
  read is taking after a W->R switch.

  This does not gel very well when hierarhical scheduling is enabled and
  every io group can have its separate read/write batch. Now if io group
  switching takes place it creates issues.

  Currently I have disabled write batch length adjustment in hierarchical
  mode.

- Currently performance seems to be very bad in hierarhical mode. Needs
  to be looked into.

- I think the whole idea of common layer doing time slice switching between
  queues and then queue in turn running timed batches is not very good. May
  be AS can maintain two queues (one for READS and other for WRITES) and let
  common layer do the time slice switching between these two queues.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched    |   12 +++
 block/as-iosched.c       |  176 +++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.c      |   77 ++++++++++++++++----
 include/linux/elevator.h |   16 ++++
 4 files changed, 265 insertions(+), 16 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3a9e7d7..77fc786 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -45,6 +45,18 @@ config IOSCHED_AS
 	  deadline I/O scheduler, it can also be slower in some cases
 	  especially some database loads.
 
+config IOSCHED_AS_HIER
+	bool "Anticipatory Hierarchical Scheduling support"
+	depends on IOSCHED_AS && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in anticipatory. In this mode
+	  anticipatory keeps one IO queue per cgroup instead of a global
+	  queue. Elevator fair queuing logic ensures fairness among various
+	  queues.
+
 config IOSCHED_DEADLINE
 	tristate "Deadline I/O scheduler"
 	default y
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 6d2890c..27c14a7 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -87,6 +87,19 @@ struct as_queue {
 	struct list_head fifo_list[2];
 
 	struct request *next_rq[2];	/* next in sort order */
+
+	/*
+	 * If an as_queue is switched while a batch is running, then we
+	 * store the time left before current batch will expire
+	 */
+	long current_batch_time_left;
+
+	/*
+	 * batch data dir when queue was scheduled out. This will be used
+	 * to setup ad->batch_data_dir when queue is scheduled in.
+	 */
+	int saved_batch_data_dir;
+
 	unsigned long last_check_fifo[2];
 	int write_batch_count;		/* max # of reqs in a write batch */
 	int current_write_count;	/* how many requests left this batch */
@@ -153,6 +166,140 @@ static DEFINE_SPINLOCK(ioc_gone_lock);
 
 static void as_move_to_dispatch(struct as_data *ad, struct request *rq);
 static void as_antic_stop(struct as_data *ad);
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq);
+
+#ifdef CONFIG_IOSCHED_AS_HIER
+static void as_save_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+	/* Save batch data dir */
+	asq->saved_batch_data_dir = ad->batch_data_dir;
+
+	if (ad->changed_batch) {
+		/*
+		 * In case of force expire, we come here. Batch changeover
+		 * has been signalled but we are waiting for all the
+		 * request to finish from previous batch and then start
+		 * the new batch. Can't wait now. Mark that full batch time
+		 * needs to be allocated when this queue is scheduled again.
+		 */
+		asq->current_batch_time_left =
+				ad->batch_expire[ad->batch_data_dir];
+		ad->changed_batch = 0;
+		return;
+	}
+
+	if (ad->new_batch) {
+		/*
+		 * We should come here only when new_batch has been set
+		 * but no read request has been issued or if it is a forced
+		 * expiry.
+		 *
+		 * In both the cases, new batch has not started yet so
+		 * allocate full batch length for next scheduling opportunity.
+		 * We don't do write batch size adjustment in hierarchical
+		 * AS so that should not be an issue.
+		 */
+		asq->current_batch_time_left =
+				ad->batch_expire[ad->batch_data_dir];
+		ad->new_batch = 0;
+		return;
+	}
+
+	/* Save how much time is left before current batch expires */
+	if (as_batch_expired(ad, asq))
+		asq->current_batch_time_left = 0;
+	else {
+		asq->current_batch_time_left = ad->current_batch_expires
+							- jiffies;
+		BUG_ON((asq->current_batch_time_left) < 0);
+	}
+}
+
+/*
+ * FIXME: In original AS, read batch's time account started only after when
+ * first request had completed (if last batch was a write batch). But here
+ * we might be rescheduling a read batch right away irrespective of the fact
+ * of disk cache state.
+ */
+static void as_restore_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+	/* Adjust the batch expire time */
+	if (asq->current_batch_time_left)
+		ad->current_batch_expires = jiffies +
+						asq->current_batch_time_left;
+	/* restore asq batch_data_dir info */
+	ad->batch_data_dir = asq->saved_batch_data_dir;
+}
+
+/* ioq has been set. */
+static void as_active_ioq_set(struct request_queue *q, void *sched_queue)
+{
+	struct as_queue *asq = sched_queue;
+	struct as_data *ad = q->elevator->elevator_data;
+
+	as_restore_batch_context(ad, asq);
+}
+
+/*
+ * This is a notification from common layer that it wishes to expire this
+ * io queue. AS decides whether queue can be expired, if yes, it also
+ * saves the batch context.
+ */
+static int as_expire_ioq(struct request_queue *q, void *sched_queue,
+				int slice_expired, int force)
+{
+	struct as_data *ad = q->elevator->elevator_data;
+	int status = ad->antic_status;
+	struct as_queue *asq = sched_queue;
+
+	/* Forced expiry. We don't have a choice */
+	if (force) {
+		as_antic_stop(ad);
+		as_save_batch_context(ad, asq);
+		return 1;
+	}
+
+	/*
+	 * We are waiting for requests to finish from last
+	 * batch. Don't expire the queue now
+	 */
+	if (ad->changed_batch)
+		goto keep_queue;
+
+	/*
+	 * Wait for all requests from existing batch to finish before we
+	 * switch the queue. New queue might change the batch direction
+	 * and this is to be consistent with AS philosophy of not dispatching
+	 * new requests to underlying drive till requests from requests
+	 * from previous batch are completed.
+	 */
+	if (ad->nr_dispatched)
+		goto keep_queue;
+
+	/*
+	 * If AS anticipation is ON, stop it if slice expired, otherwise
+	 * keep the queue.
+	 */
+	if (status == ANTIC_WAIT_REQ || status == ANTIC_WAIT_NEXT) {
+		if (slice_expired)
+			as_antic_stop(ad);
+		else
+			/*
+			 * We are anticipating and time slice has not expired
+			 * so I would rather prefer waiting than break the
+			 * anticipation and expire the queue.
+			 */
+			goto keep_queue;
+	}
+
+	/* We are good to expire the queue. Save batch context */
+	as_save_batch_context(ad, asq);
+	return 1;
+
+keep_queue:
+	return 0;
+}
+#endif
 
 /*
  * IO Context helper functions
@@ -808,6 +955,7 @@ static void as_update_rq(struct as_data *ad, struct request *rq)
 	}
 }
 
+#ifndef CONFIG_IOSCHED_AS_HIER
 /*
  * Gathers timings and resizes the write batch automatically
  */
@@ -836,6 +984,7 @@ static void update_write_batch(struct as_data *ad)
 	if (asq->write_batch_count < 1)
 		asq->write_batch_count = 1;
 }
+#endif /* !CONFIG_IOSCHED_AS_HIER */
 
 /*
  * as_completed_request is to be called when a request has completed and
@@ -870,7 +1019,26 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
 	 * and writeback caches
 	 */
 	if (ad->new_batch && ad->batch_data_dir == rq_is_sync(rq)) {
+#ifndef CONFIG_IOSCHED_AS_HIER
+		/*
+		 * Dynamic updation of write batch length is disabled
+		 * for hierarchical scheduling. It is difficult to do
+		 * accurate accounting when queue switch can take place
+		 * in the middle of the batch.
+		 *
+		 * Say, A, B are two groups. Following is the sequence of
+		 * events.
+		 *
+		 * Servicing Write batch of A.
+		 * Queue switch takes place and write batch of B starts.
+		 * Batch switch takes place and read batch of B starts.
+		 *
+		 * In above scenario, writes issued in write batch of A
+		 * might impact the write batch length of B. Which is not
+		 * good.
+		 */
 		update_write_batch(ad);
+#endif
 		ad->current_batch_expires = jiffies +
 				ad->batch_expire[REQ_SYNC];
 		ad->new_batch = 0;
@@ -1517,8 +1685,14 @@ static struct elevator_type iosched_as = {
 		.trim =				as_trim,
 		.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
 		.elevator_free_sched_queue_fn = as_free_as_queue,
+#ifdef CONFIG_IOSCHED_AS_HIER
+		.elevator_expire_ioq_fn =       as_expire_ioq,
+		.elevator_active_ioq_set_fn =   as_active_ioq_set,
 	},
-
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ | ELV_IOSCHED_DONT_IDLE,
+#else
+	},
+#endif
 	.elevator_attrs = as_attrs,
 	.elevator_name = "anticipatory",
 	.elevator_owner = THIS_MODULE,
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 172f9e3..df53418 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -28,6 +28,9 @@ static struct kmem_cache *elv_ioq_pool;
 				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
 
 void elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
+int elv_iosched_expire_ioq(struct request_queue *q, int slice_expired,
+					int force);
+
 /* Mainly the BFQ scheduling code Follows */
 #ifdef CONFIG_GROUP_IOSCHED
 #define for_each_entity(entity)	\
@@ -1915,6 +1918,9 @@ static void elv_ioq_update_idle_window(struct elevator_queue *eq,
 	int old_idle, enable_idle;
 	struct elv_fq_data *efqd = ioq->efqd;
 
+	/* If idling is disabled from ioscheduler, return */
+	if (!elv_gen_idling_enabled(eq))
+		return;
 	/*
 	 * Don't idle for async or idle io prio class
 	 */
@@ -1984,7 +1990,11 @@ int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
 	elv_ioq_set_ioprio(ioq, ioprio);
 	ioq->pid = current->pid;
 	ioq->sched_queue = sched_queue;
-	elv_mark_ioq_idle_window(ioq);
+
+	/* If generic idle logic is enabled, mark it */
+	if (elv_gen_idling_enabled(eq))
+		elv_mark_ioq_idle_window(ioq);
+
 	bfq_init_entity(&ioq->entity, iog);
 	return 0;
 }
@@ -2344,15 +2354,13 @@ int elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
 
 	new_ioq = elv_get_next_ioq(q, 0);
 	if (new_ioq == ioq) {
-		/*
-		 * We might need expire_ioq logic here to check with io
-		 * scheduler if queue can be preempted. This might not
-		 * be need for cfq but AS might need it.
-		 */
-		elv_ioq_slice_expired(q, 0);
-		elv_ioq_set_slice_end(ioq, 0);
-		elv_mark_ioq_slice_new(ioq);
-		return 1;
+		/* Is forced expire a too strong an action here? */
+		if (elv_iosched_expire_ioq(q, 0, 1)) {
+			elv_ioq_slice_expired(q, 0);
+			elv_ioq_set_slice_end(ioq, 0);
+			elv_mark_ioq_slice_new(ioq);
+			return 1;
+		}
 	}
 
 	return 0;
@@ -2499,12 +2507,44 @@ void elv_free_idle_ioq_list(struct elevator_queue *e)
 		elv_deactivate_ioq(efqd, ioq, 0);
 }
 
+/*
+ * Call iosched to let that elevator wants to expire the queue. This gives
+ * iosched like AS to say no (if it is in the middle of batch changeover or
+ * it is anticipating). it also allows iosched to do some house keeping
+ *
+ * force--> it is force dispatch and iosched must clean up its state. This
+ * 	     is useful when elevator wants to drain iosched and wants to
+ * 	     expire currnent active queue.
+ *
+ * slice_expired--> if 1, ioq slice expired hence elevator fair queuing logic
+ * 		    wants to switch the queue. iosched should allow that until
+ * 		    and unless necessary. Currently AS can deny the switch if
+ * 		    in the middle of batch switch.
+ *
+ * 		    if 0, time slice is still remaining. It is up to the iosched
+ * 		    whether it wants to wait on this queue or just want to
+ * 		    expire it and move on to next queue.
+ *
+ */
+int elv_iosched_expire_ioq(struct request_queue *q, int slice_expired,
+					int force)
+{
+	struct elevator_queue *e = q->elevator;
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+
+	if (e->ops->elevator_expire_ioq_fn)
+		return e->ops->elevator_expire_ioq_fn(q, ioq->sched_queue,
+							slice_expired, force);
+
+	return 1;
+}
+
 /* Common layer function to select the next queue to dispatch from */
 void *elv_fq_select_ioq(struct request_queue *q, int force)
 {
 	struct elv_fq_data *efqd = &q->elevator->efqd;
 	struct io_queue *ioq = elv_active_ioq(q->elevator);
-	int budget_update = 1;
+	int slice_expired = 1, budget_update = 1;
 
 	if (!elv_nr_busy_ioq(q->elevator))
 		return NULL;
@@ -2571,8 +2611,14 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 		goto keep_queue;
 	}
 
+	slice_expired = 0;
 expire:
-	elv_ioq_slice_expired(q, budget_update);
+	if (elv_iosched_expire_ioq(q, slice_expired, force))
+		elv_ioq_slice_expired(q, budget_update);
+	else {
+		ioq = NULL;
+		goto keep_queue;
+	}
 new_queue:
 	ioq = elv_set_active_ioq(q);
 keep_queue:
@@ -2696,9 +2742,10 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 			elv_ioq_set_prio_slice(q, ioq);
 			elv_clear_ioq_slice_new(ioq);
 		}
-		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
-			elv_ioq_slice_expired(q, 1);
-		else if (sync && !ioq->nr_queued)
+		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq)) {
+			if (elv_iosched_expire_ioq(q, 1, 0))
+				elv_ioq_slice_expired(q, 1);
+		} else if (sync && !ioq->nr_queued)
 			elv_ioq_arm_slice_timer(q);
 	}
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 8cee877..9b5c9b9 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -40,6 +40,7 @@ typedef int (elevator_should_preempt_fn) (struct request_queue*, void*,
 						struct request*);
 typedef int (elevator_update_idle_window_fn) (struct elevator_queue*, void*,
 						struct request*);
+typedef int (elevator_expire_ioq_fn) (struct request_queue*, void *, int, int);
 #endif
 
 struct elevator_ops
@@ -78,6 +79,7 @@ struct elevator_ops
 	elevator_arm_slice_timer_fn *elevator_arm_slice_timer_fn;
 	elevator_should_preempt_fn *elevator_should_preempt_fn;
 	elevator_update_idle_window_fn *elevator_update_idle_window_fn;
+	elevator_expire_ioq_fn  *elevator_expire_ioq_fn;
 #endif
 };
 
@@ -248,6 +250,9 @@ enum {
 /* iosched maintains only single ioq per group.*/
 #define ELV_IOSCHED_SINGLE_IOQ        2
 
+/* iosched does not need anticipation/idling logic support from common layer */
+#define ELV_IOSCHED_DONT_IDLE	4
+
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 {
 	return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
@@ -258,6 +263,12 @@ static inline int elv_iosched_single_ioq(struct elevator_queue *e)
 	return (e->elevator_type->elevator_features) & ELV_IOSCHED_SINGLE_IOQ;
 }
 
+/* returns 1 if elevator layer should enable its idling logic, 0 otherwise */
+static inline int elv_gen_idling_enabled(struct elevator_queue *e)
+{
+	return !((e->elevator_type->elevator_features) & ELV_IOSCHED_DONT_IDLE);
+}
+
 #else /* ELV_IOSCHED_FAIR_QUEUING */
 
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
@@ -270,6 +281,11 @@ static inline int elv_iosched_single_ioq(struct elevator_queue *e)
 	return 0;
 }
 
+static inline int elv_gen_idling_enabled(struct elevator_queue *e)
+{
+	return 0;
+}
+
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH 10/10] anticipatory changes for hierarchical fair queuing
@ 2009-03-12  1:56     ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12  1:56 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng,
	arozansk, jmoyer, oz-kernel, dhaval, balbir, linux-kernel,
	containers
  Cc: vgoyal, akpm, menage, peterz

This patch changes anticipatory scheduler to use queue scheduling code from
elevator layer.  One can go back to old as by deselecting
CONFIG_IOSCHED_AS_HIER.

TODO/Issues
===========
- AS anticipation logic does not seem to be sufficient to provide BW difference
  if two "dd" are going in two different cgroups. Needs to be looked into.

- AS write batch number of request adjustment happens upon every W->R batch
  direction switch. This automatic adjustment depends on how much time a
  read is taking after a W->R switch.

  This does not gel very well when hierarhical scheduling is enabled and
  every io group can have its separate read/write batch. Now if io group
  switching takes place it creates issues.

  Currently I have disabled write batch length adjustment in hierarchical
  mode.

- Currently performance seems to be very bad in hierarhical mode. Needs
  to be looked into.

- I think the whole idea of common layer doing time slice switching between
  queues and then queue in turn running timed batches is not very good. May
  be AS can maintain two queues (one for READS and other for WRITES) and let
  common layer do the time slice switching between these two queues.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched    |   12 +++
 block/as-iosched.c       |  176 +++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.c      |   77 ++++++++++++++++----
 include/linux/elevator.h |   16 ++++
 4 files changed, 265 insertions(+), 16 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3a9e7d7..77fc786 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -45,6 +45,18 @@ config IOSCHED_AS
 	  deadline I/O scheduler, it can also be slower in some cases
 	  especially some database loads.
 
+config IOSCHED_AS_HIER
+	bool "Anticipatory Hierarchical Scheduling support"
+	depends on IOSCHED_AS && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in anticipatory. In this mode
+	  anticipatory keeps one IO queue per cgroup instead of a global
+	  queue. Elevator fair queuing logic ensures fairness among various
+	  queues.
+
 config IOSCHED_DEADLINE
 	tristate "Deadline I/O scheduler"
 	default y
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 6d2890c..27c14a7 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -87,6 +87,19 @@ struct as_queue {
 	struct list_head fifo_list[2];
 
 	struct request *next_rq[2];	/* next in sort order */
+
+	/*
+	 * If an as_queue is switched while a batch is running, then we
+	 * store the time left before current batch will expire
+	 */
+	long current_batch_time_left;
+
+	/*
+	 * batch data dir when queue was scheduled out. This will be used
+	 * to setup ad->batch_data_dir when queue is scheduled in.
+	 */
+	int saved_batch_data_dir;
+
 	unsigned long last_check_fifo[2];
 	int write_batch_count;		/* max # of reqs in a write batch */
 	int current_write_count;	/* how many requests left this batch */
@@ -153,6 +166,140 @@ static DEFINE_SPINLOCK(ioc_gone_lock);
 
 static void as_move_to_dispatch(struct as_data *ad, struct request *rq);
 static void as_antic_stop(struct as_data *ad);
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq);
+
+#ifdef CONFIG_IOSCHED_AS_HIER
+static void as_save_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+	/* Save batch data dir */
+	asq->saved_batch_data_dir = ad->batch_data_dir;
+
+	if (ad->changed_batch) {
+		/*
+		 * In case of force expire, we come here. Batch changeover
+		 * has been signalled but we are waiting for all the
+		 * request to finish from previous batch and then start
+		 * the new batch. Can't wait now. Mark that full batch time
+		 * needs to be allocated when this queue is scheduled again.
+		 */
+		asq->current_batch_time_left =
+				ad->batch_expire[ad->batch_data_dir];
+		ad->changed_batch = 0;
+		return;
+	}
+
+	if (ad->new_batch) {
+		/*
+		 * We should come here only when new_batch has been set
+		 * but no read request has been issued or if it is a forced
+		 * expiry.
+		 *
+		 * In both the cases, new batch has not started yet so
+		 * allocate full batch length for next scheduling opportunity.
+		 * We don't do write batch size adjustment in hierarchical
+		 * AS so that should not be an issue.
+		 */
+		asq->current_batch_time_left =
+				ad->batch_expire[ad->batch_data_dir];
+		ad->new_batch = 0;
+		return;
+	}
+
+	/* Save how much time is left before current batch expires */
+	if (as_batch_expired(ad, asq))
+		asq->current_batch_time_left = 0;
+	else {
+		asq->current_batch_time_left = ad->current_batch_expires
+							- jiffies;
+		BUG_ON((asq->current_batch_time_left) < 0);
+	}
+}
+
+/*
+ * FIXME: In original AS, read batch's time account started only after when
+ * first request had completed (if last batch was a write batch). But here
+ * we might be rescheduling a read batch right away irrespective of the fact
+ * of disk cache state.
+ */
+static void as_restore_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+	/* Adjust the batch expire time */
+	if (asq->current_batch_time_left)
+		ad->current_batch_expires = jiffies +
+						asq->current_batch_time_left;
+	/* restore asq batch_data_dir info */
+	ad->batch_data_dir = asq->saved_batch_data_dir;
+}
+
+/* ioq has been set. */
+static void as_active_ioq_set(struct request_queue *q, void *sched_queue)
+{
+	struct as_queue *asq = sched_queue;
+	struct as_data *ad = q->elevator->elevator_data;
+
+	as_restore_batch_context(ad, asq);
+}
+
+/*
+ * This is a notification from common layer that it wishes to expire this
+ * io queue. AS decides whether queue can be expired, if yes, it also
+ * saves the batch context.
+ */
+static int as_expire_ioq(struct request_queue *q, void *sched_queue,
+				int slice_expired, int force)
+{
+	struct as_data *ad = q->elevator->elevator_data;
+	int status = ad->antic_status;
+	struct as_queue *asq = sched_queue;
+
+	/* Forced expiry. We don't have a choice */
+	if (force) {
+		as_antic_stop(ad);
+		as_save_batch_context(ad, asq);
+		return 1;
+	}
+
+	/*
+	 * We are waiting for requests to finish from last
+	 * batch. Don't expire the queue now
+	 */
+	if (ad->changed_batch)
+		goto keep_queue;
+
+	/*
+	 * Wait for all requests from existing batch to finish before we
+	 * switch the queue. New queue might change the batch direction
+	 * and this is to be consistent with AS philosophy of not dispatching
+	 * new requests to underlying drive till requests from requests
+	 * from previous batch are completed.
+	 */
+	if (ad->nr_dispatched)
+		goto keep_queue;
+
+	/*
+	 * If AS anticipation is ON, stop it if slice expired, otherwise
+	 * keep the queue.
+	 */
+	if (status == ANTIC_WAIT_REQ || status == ANTIC_WAIT_NEXT) {
+		if (slice_expired)
+			as_antic_stop(ad);
+		else
+			/*
+			 * We are anticipating and time slice has not expired
+			 * so I would rather prefer waiting than break the
+			 * anticipation and expire the queue.
+			 */
+			goto keep_queue;
+	}
+
+	/* We are good to expire the queue. Save batch context */
+	as_save_batch_context(ad, asq);
+	return 1;
+
+keep_queue:
+	return 0;
+}
+#endif
 
 /*
  * IO Context helper functions
@@ -808,6 +955,7 @@ static void as_update_rq(struct as_data *ad, struct request *rq)
 	}
 }
 
+#ifndef CONFIG_IOSCHED_AS_HIER
 /*
  * Gathers timings and resizes the write batch automatically
  */
@@ -836,6 +984,7 @@ static void update_write_batch(struct as_data *ad)
 	if (asq->write_batch_count < 1)
 		asq->write_batch_count = 1;
 }
+#endif /* !CONFIG_IOSCHED_AS_HIER */
 
 /*
  * as_completed_request is to be called when a request has completed and
@@ -870,7 +1019,26 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
 	 * and writeback caches
 	 */
 	if (ad->new_batch && ad->batch_data_dir == rq_is_sync(rq)) {
+#ifndef CONFIG_IOSCHED_AS_HIER
+		/*
+		 * Dynamic updation of write batch length is disabled
+		 * for hierarchical scheduling. It is difficult to do
+		 * accurate accounting when queue switch can take place
+		 * in the middle of the batch.
+		 *
+		 * Say, A, B are two groups. Following is the sequence of
+		 * events.
+		 *
+		 * Servicing Write batch of A.
+		 * Queue switch takes place and write batch of B starts.
+		 * Batch switch takes place and read batch of B starts.
+		 *
+		 * In above scenario, writes issued in write batch of A
+		 * might impact the write batch length of B. Which is not
+		 * good.
+		 */
 		update_write_batch(ad);
+#endif
 		ad->current_batch_expires = jiffies +
 				ad->batch_expire[REQ_SYNC];
 		ad->new_batch = 0;
@@ -1517,8 +1685,14 @@ static struct elevator_type iosched_as = {
 		.trim =				as_trim,
 		.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
 		.elevator_free_sched_queue_fn = as_free_as_queue,
+#ifdef CONFIG_IOSCHED_AS_HIER
+		.elevator_expire_ioq_fn =       as_expire_ioq,
+		.elevator_active_ioq_set_fn =   as_active_ioq_set,
 	},
-
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ | ELV_IOSCHED_DONT_IDLE,
+#else
+	},
+#endif
 	.elevator_attrs = as_attrs,
 	.elevator_name = "anticipatory",
 	.elevator_owner = THIS_MODULE,
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 172f9e3..df53418 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -28,6 +28,9 @@ static struct kmem_cache *elv_ioq_pool;
 				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
 
 void elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
+int elv_iosched_expire_ioq(struct request_queue *q, int slice_expired,
+					int force);
+
 /* Mainly the BFQ scheduling code Follows */
 #ifdef CONFIG_GROUP_IOSCHED
 #define for_each_entity(entity)	\
@@ -1915,6 +1918,9 @@ static void elv_ioq_update_idle_window(struct elevator_queue *eq,
 	int old_idle, enable_idle;
 	struct elv_fq_data *efqd = ioq->efqd;
 
+	/* If idling is disabled from ioscheduler, return */
+	if (!elv_gen_idling_enabled(eq))
+		return;
 	/*
 	 * Don't idle for async or idle io prio class
 	 */
@@ -1984,7 +1990,11 @@ int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
 	elv_ioq_set_ioprio(ioq, ioprio);
 	ioq->pid = current->pid;
 	ioq->sched_queue = sched_queue;
-	elv_mark_ioq_idle_window(ioq);
+
+	/* If generic idle logic is enabled, mark it */
+	if (elv_gen_idling_enabled(eq))
+		elv_mark_ioq_idle_window(ioq);
+
 	bfq_init_entity(&ioq->entity, iog);
 	return 0;
 }
@@ -2344,15 +2354,13 @@ int elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
 
 	new_ioq = elv_get_next_ioq(q, 0);
 	if (new_ioq == ioq) {
-		/*
-		 * We might need expire_ioq logic here to check with io
-		 * scheduler if queue can be preempted. This might not
-		 * be need for cfq but AS might need it.
-		 */
-		elv_ioq_slice_expired(q, 0);
-		elv_ioq_set_slice_end(ioq, 0);
-		elv_mark_ioq_slice_new(ioq);
-		return 1;
+		/* Is forced expire a too strong an action here? */
+		if (elv_iosched_expire_ioq(q, 0, 1)) {
+			elv_ioq_slice_expired(q, 0);
+			elv_ioq_set_slice_end(ioq, 0);
+			elv_mark_ioq_slice_new(ioq);
+			return 1;
+		}
 	}
 
 	return 0;
@@ -2499,12 +2507,44 @@ void elv_free_idle_ioq_list(struct elevator_queue *e)
 		elv_deactivate_ioq(efqd, ioq, 0);
 }
 
+/*
+ * Call iosched to let that elevator wants to expire the queue. This gives
+ * iosched like AS to say no (if it is in the middle of batch changeover or
+ * it is anticipating). it also allows iosched to do some house keeping
+ *
+ * force--> it is force dispatch and iosched must clean up its state. This
+ * 	     is useful when elevator wants to drain iosched and wants to
+ * 	     expire currnent active queue.
+ *
+ * slice_expired--> if 1, ioq slice expired hence elevator fair queuing logic
+ * 		    wants to switch the queue. iosched should allow that until
+ * 		    and unless necessary. Currently AS can deny the switch if
+ * 		    in the middle of batch switch.
+ *
+ * 		    if 0, time slice is still remaining. It is up to the iosched
+ * 		    whether it wants to wait on this queue or just want to
+ * 		    expire it and move on to next queue.
+ *
+ */
+int elv_iosched_expire_ioq(struct request_queue *q, int slice_expired,
+					int force)
+{
+	struct elevator_queue *e = q->elevator;
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+
+	if (e->ops->elevator_expire_ioq_fn)
+		return e->ops->elevator_expire_ioq_fn(q, ioq->sched_queue,
+							slice_expired, force);
+
+	return 1;
+}
+
 /* Common layer function to select the next queue to dispatch from */
 void *elv_fq_select_ioq(struct request_queue *q, int force)
 {
 	struct elv_fq_data *efqd = &q->elevator->efqd;
 	struct io_queue *ioq = elv_active_ioq(q->elevator);
-	int budget_update = 1;
+	int slice_expired = 1, budget_update = 1;
 
 	if (!elv_nr_busy_ioq(q->elevator))
 		return NULL;
@@ -2571,8 +2611,14 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 		goto keep_queue;
 	}
 
+	slice_expired = 0;
 expire:
-	elv_ioq_slice_expired(q, budget_update);
+	if (elv_iosched_expire_ioq(q, slice_expired, force))
+		elv_ioq_slice_expired(q, budget_update);
+	else {
+		ioq = NULL;
+		goto keep_queue;
+	}
 new_queue:
 	ioq = elv_set_active_ioq(q);
 keep_queue:
@@ -2696,9 +2742,10 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 			elv_ioq_set_prio_slice(q, ioq);
 			elv_clear_ioq_slice_new(ioq);
 		}
-		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
-			elv_ioq_slice_expired(q, 1);
-		else if (sync && !ioq->nr_queued)
+		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq)) {
+			if (elv_iosched_expire_ioq(q, 1, 0))
+				elv_ioq_slice_expired(q, 1);
+		} else if (sync && !ioq->nr_queued)
 			elv_ioq_arm_slice_timer(q);
 	}
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 8cee877..9b5c9b9 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -40,6 +40,7 @@ typedef int (elevator_should_preempt_fn) (struct request_queue*, void*,
 						struct request*);
 typedef int (elevator_update_idle_window_fn) (struct elevator_queue*, void*,
 						struct request*);
+typedef int (elevator_expire_ioq_fn) (struct request_queue*, void *, int, int);
 #endif
 
 struct elevator_ops
@@ -78,6 +79,7 @@ struct elevator_ops
 	elevator_arm_slice_timer_fn *elevator_arm_slice_timer_fn;
 	elevator_should_preempt_fn *elevator_should_preempt_fn;
 	elevator_update_idle_window_fn *elevator_update_idle_window_fn;
+	elevator_expire_ioq_fn  *elevator_expire_ioq_fn;
 #endif
 };
 
@@ -248,6 +250,9 @@ enum {
 /* iosched maintains only single ioq per group.*/
 #define ELV_IOSCHED_SINGLE_IOQ        2
 
+/* iosched does not need anticipation/idling logic support from common layer */
+#define ELV_IOSCHED_DONT_IDLE	4
+
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 {
 	return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
@@ -258,6 +263,12 @@ static inline int elv_iosched_single_ioq(struct elevator_queue *e)
 	return (e->elevator_type->elevator_features) & ELV_IOSCHED_SINGLE_IOQ;
 }
 
+/* returns 1 if elevator layer should enable its idling logic, 0 otherwise */
+static inline int elv_gen_idling_enabled(struct elevator_queue *e)
+{
+	return !((e->elevator_type->elevator_features) & ELV_IOSCHED_DONT_IDLE);
+}
+
 #else /* ELV_IOSCHED_FAIR_QUEUING */
 
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
@@ -270,6 +281,11 @@ static inline int elv_iosched_single_ioq(struct elevator_queue *e)
 	return 0;
 }
 
+static inline int elv_gen_idling_enabled(struct elevator_queue *e)
+{
+	return 0;
+}
+
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
       [not found] ` <1236823015-4183-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (9 preceding siblings ...)
  2009-03-12  1:56     ` Vivek Goyal
@ 2009-03-12  3:27   ` Takuya Yoshikawa
  2009-04-02  6:39   ` Gui Jianfeng
                     ` (2 subsequent siblings)
  13 siblings, 0 replies; 190+ messages in thread
From: Takuya Yoshikawa @ 2009-03-12  3:27 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

Hi Vivek,

Could you tell me to which kernel I can apply your patches?
   # latest mm?
I would like to test your controller.

Thank you,
   Takuya Yoshikawa


Vivek Goyal wrote:
> 
> Hi All,
> 
> Here is another posting for IO controller patches. Last time I had posted
> RFC patches for an IO controller which did bio control per cgroup.
> 
> http://lkml.org/lkml/2008/11/6/227
> 
> One of the takeaway from the discussion in this thread was that let us
> implement a common layer which contains the proportional weight scheduling
> code which can be shared by all the IO schedulers.
> 
> Implementing IO controller will not cover the devices which don't use
> IO schedulers but it should cover the common case.
> 
> There were more discussions regarding 2 level vs 1 level IO control at
> following link.
> 
> https://lists.linux-foundation.org/pipermail/containers/2009-January/015402.html
> 
> So in the mean time we took the discussion off the list and spent time on
> making the 1 level control apporoach work where majority of the proportional
> weight control is shared by the four schedulers instead of each one having
> to replicate the code. We make use of BFQ code for fair queuing as posted
> by Paolo and Fabio here.
> 
> http://lkml.org/lkml/2008/11/11/148
> 
> Details about design and howto have been put in documentation patch.
> 
> I have done very basic testing of running 2 or 3 "dd" threads in different
> cgroups. Wanted to get the patchset out for feedback/review before we dive
> into more bug fixing, benchmarking, optimizations etc.
> 
> Your feedback/comments are welcome.
> 
> Patch series contains 10 patches. It should be compilable and bootable after
> every patch. Intial 2 patches implement flat fair queuing (no cgroup
> support) and make cfq to use that. Later patches introduce hierarchical
> fair queuing support in elevator layer and modify other IO schdulers to use
> that.
> 
> Thanks
> Vivek
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
> 

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
  2009-03-12  1:56 ` Vivek Goyal
                   ` (3 preceding siblings ...)
  (?)
@ 2009-03-12  3:27 ` Takuya Yoshikawa
  2009-03-12  6:40   ` anqin
       [not found]   ` <49B8810B.7030603-gVGce1chcLdL9jVzuh4AOg@public.gmane.org>
  -1 siblings, 2 replies; 190+ messages in thread
From: Takuya Yoshikawa @ 2009-03-12  3:27 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng,
	arozansk, jmoyer, oz-kernel, dhaval, balbir, linux-kernel,
	containers, akpm, menage

Hi Vivek,

Could you tell me to which kernel I can apply your patches?
   # latest mm?
I would like to test your controller.

Thank you,
   Takuya Yoshikawa


Vivek Goyal wrote:
> 
> Hi All,
> 
> Here is another posting for IO controller patches. Last time I had posted
> RFC patches for an IO controller which did bio control per cgroup.
> 
> http://lkml.org/lkml/2008/11/6/227
> 
> One of the takeaway from the discussion in this thread was that let us
> implement a common layer which contains the proportional weight scheduling
> code which can be shared by all the IO schedulers.
> 
> Implementing IO controller will not cover the devices which don't use
> IO schedulers but it should cover the common case.
> 
> There were more discussions regarding 2 level vs 1 level IO control at
> following link.
> 
> https://lists.linux-foundation.org/pipermail/containers/2009-January/015402.html
> 
> So in the mean time we took the discussion off the list and spent time on
> making the 1 level control apporoach work where majority of the proportional
> weight control is shared by the four schedulers instead of each one having
> to replicate the code. We make use of BFQ code for fair queuing as posted
> by Paolo and Fabio here.
> 
> http://lkml.org/lkml/2008/11/11/148
> 
> Details about design and howto have been put in documentation patch.
> 
> I have done very basic testing of running 2 or 3 "dd" threads in different
> cgroups. Wanted to get the patchset out for feedback/review before we dive
> into more bug fixing, benchmarking, optimizations etc.
> 
> Your feedback/comments are welcome.
> 
> Patch series contains 10 patches. It should be compilable and bootable after
> every patch. Intial 2 patches implement flat fair queuing (no cgroup
> support) and make cfq to use that. Later patches introduce hierarchical
> fair queuing support in elevator layer and modify other IO schdulers to use
> that.
> 
> Thanks
> Vivek
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
> 


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
       [not found]   ` <49B8810B.7030603-gVGce1chcLdL9jVzuh4AOg@public.gmane.org>
@ 2009-03-12  6:40     ` anqin
  2009-03-12 13:43       ` Vivek Goyal
  1 sibling, 0 replies; 190+ messages in thread
From: anqin @ 2009-03-12  6:40 UTC (permalink / raw)
  To: Takuya Yoshikawa, Vivek Goyal
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, menage-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

Hi Vivek,

It would be very appreciated if the patches can be based on 2.6.28.

Thanks a lot.

Anqin

On Thu, Mar 12, 2009 at 11:27 AM, Takuya Yoshikawa
<yoshikawa.takuya-gVGce1chcLdL9jVzuh4AOg@public.gmane.org> wrote:
> Hi Vivek,
>
> Could you tell me to which kernel I can apply your patches?
>   # latest mm?
> I would like to test your controller.
>
> Thank you,
>   Takuya Yoshikawa
>
>
> Vivek Goyal wrote:
>>
>> Hi All,
>>
>> Here is another posting for IO controller patches. Last time I had posted
>> RFC patches for an IO controller which did bio control per cgroup.
>>
>> http://lkml.org/lkml/2008/11/6/227
>>
>> One of the takeaway from the discussion in this thread was that let us
>> implement a common layer which contains the proportional weight scheduling
>> code which can be shared by all the IO schedulers.
>>
>> Implementing IO controller will not cover the devices which don't use
>> IO schedulers but it should cover the common case.
>>
>> There were more discussions regarding 2 level vs 1 level IO control at
>> following link.
>>
>> https://lists.linux-foundation.org/pipermail/containers/2009-January/015402.html
>>
>> So in the mean time we took the discussion off the list and spent time on
>> making the 1 level control apporoach work where majority of the proportional
>> weight control is shared by the four schedulers instead of each one having
>> to replicate the code. We make use of BFQ code for fair queuing as posted
>> by Paolo and Fabio here.
>>
>> http://lkml.org/lkml/2008/11/11/148
>>
>> Details about design and howto have been put in documentation patch.
>>
>> I have done very basic testing of running 2 or 3 "dd" threads in different
>> cgroups. Wanted to get the patchset out for feedback/review before we dive
>> into more bug fixing, benchmarking, optimizations etc.
>>
>> Your feedback/comments are welcome.
>>
>> Patch series contains 10 patches. It should be compilable and bootable after
>> every patch. Intial 2 patches implement flat fair queuing (no cgroup
>> support) and make cfq to use that. Later patches introduce hierarchical
>> fair queuing support in elevator layer and modify other IO schdulers to use
>> that.
>>
>> Thanks
>> Vivek
>> _______________________________________________
>> Containers mailing list
>> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
>> https://lists.linux-foundation.org/mailman/listinfo/containers
>>
>
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
  2009-03-12  3:27 ` [RFC] IO Controller Takuya Yoshikawa
@ 2009-03-12  6:40   ` anqin
       [not found]     ` <d95d44a20903112340s3c77807dt465e68901747ad89-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2009-03-12  6:55     ` Li Zefan
       [not found]   ` <49B8810B.7030603-gVGce1chcLdL9jVzuh4AOg@public.gmane.org>
  1 sibling, 2 replies; 190+ messages in thread
From: anqin @ 2009-03-12  6:40 UTC (permalink / raw)
  To: Takuya Yoshikawa, Vivek Goyal
  Cc: oz-kernel, paolo.valente, linux-kernel, dhaval, containers,
	menage, jmoyer, fchecconi, arozansk, jens.axboe, akpm, fernando,
	balbir

Hi Vivek,

It would be very appreciated if the patches can be based on 2.6.28.

Thanks a lot.

Anqin

On Thu, Mar 12, 2009 at 11:27 AM, Takuya Yoshikawa
<yoshikawa.takuya@oss.ntt.co.jp> wrote:
> Hi Vivek,
>
> Could you tell me to which kernel I can apply your patches?
>   # latest mm?
> I would like to test your controller.
>
> Thank you,
>   Takuya Yoshikawa
>
>
> Vivek Goyal wrote:
>>
>> Hi All,
>>
>> Here is another posting for IO controller patches. Last time I had posted
>> RFC patches for an IO controller which did bio control per cgroup.
>>
>> http://lkml.org/lkml/2008/11/6/227
>>
>> One of the takeaway from the discussion in this thread was that let us
>> implement a common layer which contains the proportional weight scheduling
>> code which can be shared by all the IO schedulers.
>>
>> Implementing IO controller will not cover the devices which don't use
>> IO schedulers but it should cover the common case.
>>
>> There were more discussions regarding 2 level vs 1 level IO control at
>> following link.
>>
>> https://lists.linux-foundation.org/pipermail/containers/2009-January/015402.html
>>
>> So in the mean time we took the discussion off the list and spent time on
>> making the 1 level control apporoach work where majority of the proportional
>> weight control is shared by the four schedulers instead of each one having
>> to replicate the code. We make use of BFQ code for fair queuing as posted
>> by Paolo and Fabio here.
>>
>> http://lkml.org/lkml/2008/11/11/148
>>
>> Details about design and howto have been put in documentation patch.
>>
>> I have done very basic testing of running 2 or 3 "dd" threads in different
>> cgroups. Wanted to get the patchset out for feedback/review before we dive
>> into more bug fixing, benchmarking, optimizations etc.
>>
>> Your feedback/comments are welcome.
>>
>> Patch series contains 10 patches. It should be compilable and bootable after
>> every patch. Intial 2 patches implement flat fair queuing (no cgroup
>> support) and make cfq to use that. Later patches introduce hierarchical
>> fair queuing support in elevator layer and modify other IO schdulers to use
>> that.
>>
>> Thanks
>> Vivek
>> _______________________________________________
>> Containers mailing list
>> Containers@lists.linux-foundation.org
>> https://lists.linux-foundation.org/mailman/listinfo/containers
>>
>
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
       [not found]     ` <d95d44a20903112340s3c77807dt465e68901747ad89-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-03-12  6:55       ` Li Zefan
  2009-03-12 13:46         ` Vivek Goyal
  1 sibling, 0 replies; 190+ messages in thread
From: Li Zefan @ 2009-03-12  6:55 UTC (permalink / raw)
  To: anqin
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, menage-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

(Please don't top-post...)

anqin wrote:
> Hi Vivek,
> 
> It would be very appreciated if the patches can be based on 2.6.28.
> 

Why? When this is ready to be merged, then it should be based on Jens' block-tree,
or akpm's mm tree. And this version currently is based on 2.6.29-rc4, so if you
want to try it out, just prepare a 2.6.29-rc4 kernel tree.

> Thanks a lot.
> 
> Anqin
> 
> On Thu, Mar 12, 2009 at 11:27 AM, Takuya Yoshikawa
> <yoshikawa.takuya-gVGce1chcLdL9jVzuh4AOg@public.gmane.org> wrote:
>> Hi Vivek,
>>
>> Could you tell me to which kernel I can apply your patches?
>>   # latest mm?
>> I would like to test your controller.
>>
>> Thank you,
>>   Takuya Yoshikawa

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
  2009-03-12  6:40   ` anqin
       [not found]     ` <d95d44a20903112340s3c77807dt465e68901747ad89-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-03-12  6:55     ` Li Zefan
  2009-03-12  7:11       ` anqin
       [not found]       ` <49B8B1FB.1040506-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  1 sibling, 2 replies; 190+ messages in thread
From: Li Zefan @ 2009-03-12  6:55 UTC (permalink / raw)
  To: anqin
  Cc: Takuya Yoshikawa, Vivek Goyal, oz-kernel, paolo.valente,
	linux-kernel, dhaval, containers, menage, jmoyer, fchecconi,
	arozansk, jens.axboe, akpm, fernando, balbir

(Please don't top-post...)

anqin wrote:
> Hi Vivek,
> 
> It would be very appreciated if the patches can be based on 2.6.28.
> 

Why? When this is ready to be merged, then it should be based on Jens' block-tree,
or akpm's mm tree. And this version currently is based on 2.6.29-rc4, so if you
want to try it out, just prepare a 2.6.29-rc4 kernel tree.

> Thanks a lot.
> 
> Anqin
> 
> On Thu, Mar 12, 2009 at 11:27 AM, Takuya Yoshikawa
> <yoshikawa.takuya@oss.ntt.co.jp> wrote:
>> Hi Vivek,
>>
>> Could you tell me to which kernel I can apply your patches?
>>   # latest mm?
>> I would like to test your controller.
>>
>> Thank you,
>>   Takuya Yoshikawa

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
       [not found]       ` <49B8B1FB.1040506-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-03-12  7:11         ` anqin
  0 siblings, 0 replies; 190+ messages in thread
From: anqin @ 2009-03-12  7:11 UTC (permalink / raw)
  To: Li Zefan
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, menage-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

> Why? When this is ready to be merged, then it should be based on Jens' block-tree,
> or akpm's mm tree. And this version currently is based on 2.6.29-rc4, so if you
> want to try it out, just prepare a 2.6.29-rc4 kernel tree.
>

I have checked the LKML and see these patches (on web pages) are based on
2.6.27. It seemed too old.

You mean that the codes have new patch files in 2.6.29-rc4?

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
  2009-03-12  6:55     ` Li Zefan
@ 2009-03-12  7:11       ` anqin
       [not found]         ` <d95d44a20903120011m4a7281enf17b31b9aaf7c937-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
       [not found]       ` <49B8B1FB.1040506-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  1 sibling, 1 reply; 190+ messages in thread
From: anqin @ 2009-03-12  7:11 UTC (permalink / raw)
  To: Li Zefan
  Cc: Takuya Yoshikawa, Vivek Goyal, oz-kernel, paolo.valente,
	linux-kernel, dhaval, containers, menage, jmoyer, fchecconi,
	arozansk, jens.axboe, akpm, fernando, balbir

> Why? When this is ready to be merged, then it should be based on Jens' block-tree,
> or akpm's mm tree. And this version currently is based on 2.6.29-rc4, so if you
> want to try it out, just prepare a 2.6.29-rc4 kernel tree.
>

I have checked the LKML and see these patches (on web pages) are based on
2.6.27. It seemed too old.

You mean that the codes have new patch files in 2.6.29-rc4?

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-03-12  1:56     ` Vivek Goyal
@ 2009-03-12  7:11         ` Andrew Morton
  -1 siblings, 0 replies; 190+ messages in thread
From: Andrew Morton @ 2009-03-12  7:11 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, menage-hpIqsD4AKlfQT0dZR+AlfA,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Wed, 11 Mar 2009 21:56:46 -0400 Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> +Currently "current" task
> +is used to determine the cgroup (hence io group) of the request. Down the
> +line we need to make use of bio-cgroup patches to map delayed writes to
> +right group.

You handled this problem pretty neatly!

It's always been a BIG problem for all the io-controlling schemes, and
most of them seem to have "handled" it in the above way :(

But for many workloads, writeback is the majority of the IO and it has
always been the form of IO which has caused us the worst contention and
latency problems.  So I don't think that we can proceed with _anything_
until we at least have a convincing plan here.




Also..  there are so many IO controller implementations that I've lost
track of who is doing what.  I do have one private report here that
Andreas's controller "is incredibly productive for us and has allowed
us to put twice as many users per server with faster times for all
users".  Which is pretty stunning, although it should be viewed as a
condemnation of the current code, I'm afraid.

So my question is: what is the definitive list of
proposed-io-controller-implementations and how do I cunningly get all
you guys to check each others homework? :)

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
@ 2009-03-12  7:11         ` Andrew Morton
  0 siblings, 0 replies; 190+ messages in thread
From: Andrew Morton @ 2009-03-12  7:11 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng,
	arozansk, jmoyer, oz-kernel, dhaval, balbir, linux-kernel,
	containers, menage, peterz

On Wed, 11 Mar 2009 21:56:46 -0400 Vivek Goyal <vgoyal@redhat.com> wrote:

> +Currently "current" task
> +is used to determine the cgroup (hence io group) of the request. Down the
> +line we need to make use of bio-cgroup patches to map delayed writes to
> +right group.

You handled this problem pretty neatly!

It's always been a BIG problem for all the io-controlling schemes, and
most of them seem to have "handled" it in the above way :(

But for many workloads, writeback is the majority of the IO and it has
always been the form of IO which has caused us the worst contention and
latency problems.  So I don't think that we can proceed with _anything_
until we at least have a convincing plan here.




Also..  there are so many IO controller implementations that I've lost
track of who is doing what.  I do have one private report here that
Andreas's controller "is incredibly productive for us and has allowed
us to put twice as many users per server with faster times for all
users".  Which is pretty stunning, although it should be viewed as a
condemnation of the current code, I'm afraid.

So my question is: what is the definitive list of
proposed-io-controller-implementations and how do I cunningly get all
you guys to check each others homework? :)

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-03-12  1:56     ` Vivek Goyal
@ 2009-03-12  7:45         ` Yang Hongyang
  -1 siblings, 0 replies; 190+ messages in thread
From: Yang Hongyang @ 2009-03-12  7:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	menage-hpIqsD4AKlfQT0dZR+AlfA

Don't forget to update the 00-INDEX file when you add a new doc.^!^

Vivek Goyal wrote:
> o Documentation for io-controller.
> 
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  Documentation/block/io-controller.txt |  221 +++++++++++++++++++++++++++++++++
>  1 files changed, 221 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/block/io-controller.txt
> 
> diff --git a/Documentation/block/io-controller.txt b/Documentation/block/io-controller.txt
> new file mode 100644
> index 0000000..8884c5a
> --- /dev/null
> +++ b/Documentation/block/io-controller.txt
> @@ -0,0 +1,221 @@
> +				IO Controller
> +				=============
> +
> +Overview
> +========
> +
> +This patchset implements a proportional weight IO controller. That is one
> +can create cgroups and assign prio/weights to those cgroups and task group
> +will get access to disk proportionate to the weight of the group.
> +
> +These patches modify elevator layer and individual IO schedulers to do
> +IO control hence this io controller works only on block devices which use
> +one of the standard io schedulers can not be used with any xyz logical block
> +device.
> +
> +The assumption/thought behind modifying IO scheduler is that resource control
> +is needed only on leaf nodes where the actual contention for resources is
> +present and not on intertermediate logical block devices.
> +
> +Consider following hypothetical scenario. Lets say there are three physical
> +disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
> +created on top of these. Some part of sdb is in lv0 and some part is in lv1.
> +
> +			    lv0      lv1
> +			  /	\  /     \
> +			sda      sdb      sdc
> +
> +Also consider following cgroup hierarchy
> +
> +				root
> +				/   \
> +			       A     B
> +			      / \    / \
> +			     T1 T2  T3  T4
> +
> +A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
> +Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
> +get their fair share of bandwidth on disks sda, sdb and sdc. There is no
> +IO control on intermediate logical block nodes (lv0, lv1).
> +
> +So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
> +only, there will not be any contetion for resources between group A and B if
> +IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
> +IO scheduler associated with the sdb will distribute disk bandwidth to
> +group A and B proportionate to their weight.
> +
> +CFQ already has the notion of fairness and it provides differential disk
> +access based on priority and class of the task. Just that it is flat and
> +with cgroup stuff, it needs to be made hierarchical.
> +
> +Rest of the IO schedulers (noop, deadline and AS) don't have any notion
> +of fairness among various threads.
> +
> +One of the concerns raised with modifying IO schedulers was that we don't
> +want to replicate the code in all the IO schedulers. These patches share
> +the fair queuing code which has been moved to a common layer (elevator
> +layer). Hence we don't end up replicating code across IO schedulers.
> +
> +Design
> +======
> +This patchset primarily uses BFQ (Budget Fair Queuing) code to provide
> +fairness among different IO queues. Fabio and Paolo implemented BFQ which uses
> +B-WF2Q+ algorithm for fair queuing.
> +
> +Why BFQ?
> +
> +- Not sure if weighted round robin logic of CFQ can be easily extended for
> +  hierarchical mode. One of the things is that we can not keep dividing
> +  the time slice of parent group among childrens. Deeper we go in hierarchy
> +  time slice will get smaller.
> +
> +  One of the ways to implement hierarchical support could be to keep track
> +  of virtual time and service provided to queue/group and select a queue/group
> +  for service based on any of the various available algoriths.
> +
> +  BFQ already had support for hierarchical scheduling, taking those patches
> +  was easier.
> +
> +- BFQ was designed to provide tighter bounds/delay w.r.t service provided
> +  to a queue. Delay/Jitter with BFQ is supposed to be O(1).
> +
> +  Note: BFQ originally used amount of IO done (number of sectors) as notion
> +        of service provided. IOW, it tried to provide fairness in terms of
> +        actual IO done and not in terms of actual time disk access was
> +	given to a queue.
> +
> +	This patcheset modified BFQ to provide fairness in time domain because
> +	that's what CFQ does. So idea was try not to deviate too much from
> +	the CFQ behavior initially.
> +
> +	Providing fairness in time domain makes accounting trciky because
> +	due to command queueing, at one time there might be multiple requests
> +	from different queues and there is no easy way to find out how much
> +	disk time actually was consumed by the requests of a particular
> +	queue. More about this in comments in source code.
> +
> +So it is yet to be seen if changing to time domain still retains BFQ gurantees
> +or not.
> +
> +From data structure point of view, one can think of a tree per device, where
> +io groups and io queues are hanging and are being scheduled using B-WF2Q+
> +algorithm. io_queue, is end queue where requests are actually stored and
> +dispatched from (like cfqq).
> +
> +These io queues are primarily created by and managed by end io schedulers
> +depending on its semantics. For example, noop, deadline and AS ioschedulers
> +keep one io queues per cgroup and cfqq keeps one io queue per io_context in
> +a cgroup (apart from async queues).
> +
> +A request is mapped to an io group by elevator layer and which io queue it
> +is mapped to with in group depends on ioscheduler. Currently "current" task
> +is used to determine the cgroup (hence io group) of the request. Down the
> +line we need to make use of bio-cgroup patches to map delayed writes to
> +right group.
> +
> +Going back to old behavior
> +==========================
> +In new scheme of things essentially we are creating hierarchical fair
> +queuing logic in elevator layer and chaning IO schedulers to make use of
> +that logic so that end IO schedulers start supporting hierarchical scheduling.
> +
> +Elevator layer continues to support the old interfaces. So even if fair queuing
> +is enabled at elevator layer, one can have both new hierchical scheduler as
> +well as old non-hierarchical scheduler operating.
> +
> +Also noop, deadline and AS have option of enabling hierarchical scheduling.
> +If it is selected, fair queuing is done in hierarchical manner. If hierarchical
> +scheduling is disabled, noop, deadline and AS should retain their existing
> +behavior.
> +
> +CFQ is the only exception where one can not disable fair queuing as it is
> +needed for provding fairness among various threads even in non-hierarchical
> +mode.
> +
> +Various user visible config options
> +===================================
> +CONFIG_IOSCHED_NOOP_HIER
> +	- Enables hierchical fair queuing in noop. Not selecting this option
> +	  leads to old behavior of noop.
> +
> +CONFIG_IOSCHED_DEADLINE_HIER
> +	- Enables hierchical fair queuing in deadline. Not selecting this
> +	  option leads to old behavior of deadline.
> +
> +CONFIG_IOSCHED_AS_HIER
> +	- Enables hierchical fair queuing in AS. Not selecting this option
> +	  leads to old behavior of AS.
> +
> +CONFIG_IOSCHED_CFQ_HIER
> +	- Enables hierarchical fair queuing in CFQ. Not selecting this option
> +	  still does fair queuing among various queus but it is flat and not
> +	  hierarchical.
> +
> +Config options selected automatically
> +=====================================
> +These config options are not user visible and are selected/deselected
> +automatically based on IO scheduler configurations.
> +
> +CONFIG_ELV_FAIR_QUEUING
> +	- Enables/Disables the fair queuing logic at elevator layer.
> +
> +CONFIG_GROUP_IOSCHED
> +	- Enables/Disables hierarchical queuing and associated cgroup bits.
> +
> +TODO
> +====
> +- Lots of cleanups, testing, bug fixing, optimizations, benchmarking etc...
> +- Convert cgroup ioprio to notion of weight.
> +- Anticipatory code will need more work. It is not working properly currently
> +  and needs more thought.
> +- Use of bio-cgroup patches.
> +- Use of Nauman's per cgroup request descriptor patches.
> +
> +HOWTO
> +=====
> +So far I have done very simple testing of running two dd threads in two
> +different cgroups. Here is what you can do.
> +
> +- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
> +	CONFIG_IOSCHED_CFQ_HIER=y
> +
> +- Compile and boot into kernel and mount IO controller.
> +
> +	mount -t cgroup -o io none /cgroup
> +
> +- Create two cgroups
> +	mkdir -p /cgroup/test1/ /cgroup/test2
> +
> +- Set io priority of group test1 and test2
> +	echo 0 > /cgroup/test1/io.ioprio
> +	echo 4 > /cgroup/test2/io.ioprio
> +
> +- Create two same size files (say 512MB each) on same disk (file1, file2) and
> +  launch two dd threads in different cgroup to read those files. Make sure
> +  right io scheduler is being used for the block device where files are
> +  present (the one you compiled in hierarchical mode).
> +
> +	echo 1 > /proc/sys/vm/drop_caches
> +
> +	dd if=/mnt/lv0/zerofile1 of=/dev/null &
> +	echo $! > /cgroup/test1/tasks
> +	cat /cgroup/test1/tasks
> +
> +	dd if=/mnt/lv0/zerofile2 of=/dev/null &
> +	echo $! > /cgroup/test2/tasks
> +	cat /cgroup/test2/tasks
> +
> +- First dd should finish first.
> +
> +Some Test Results
> +=================
> +- Two dd in two cgroups with prio 0 and 4. Ran two "dd" in those cgroups.
> +
> +234179072 bytes (234 MB) copied, 10.1811 s, 23.0 MB/s
> +234179072 bytes (234 MB) copied, 12.6187 s, 18.6 MB/s
> +
> +- Three dd in three cgroups with prio 0, 4, 4.
> +
> +234179072 bytes (234 MB) copied, 13.7654 s, 17.0 MB/s
> +234179072 bytes (234 MB) copied, 19.476 s, 12.0 MB/s
> +234179072 bytes (234 MB) copied, 20.1858 s, 11.6 MB/s


-- 
Regards
Yang Hongyang

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
@ 2009-03-12  7:45         ` Yang Hongyang
  0 siblings, 0 replies; 190+ messages in thread
From: Yang Hongyang @ 2009-03-12  7:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng,
	arozansk, jmoyer, oz-kernel, dhaval, balbir, linux-kernel,
	containers, akpm, menage, peterz

Don't forget to update the 00-INDEX file when you add a new doc.^!^

Vivek Goyal wrote:
> o Documentation for io-controller.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  Documentation/block/io-controller.txt |  221 +++++++++++++++++++++++++++++++++
>  1 files changed, 221 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/block/io-controller.txt
> 
> diff --git a/Documentation/block/io-controller.txt b/Documentation/block/io-controller.txt
> new file mode 100644
> index 0000000..8884c5a
> --- /dev/null
> +++ b/Documentation/block/io-controller.txt
> @@ -0,0 +1,221 @@
> +				IO Controller
> +				=============
> +
> +Overview
> +========
> +
> +This patchset implements a proportional weight IO controller. That is one
> +can create cgroups and assign prio/weights to those cgroups and task group
> +will get access to disk proportionate to the weight of the group.
> +
> +These patches modify elevator layer and individual IO schedulers to do
> +IO control hence this io controller works only on block devices which use
> +one of the standard io schedulers can not be used with any xyz logical block
> +device.
> +
> +The assumption/thought behind modifying IO scheduler is that resource control
> +is needed only on leaf nodes where the actual contention for resources is
> +present and not on intertermediate logical block devices.
> +
> +Consider following hypothetical scenario. Lets say there are three physical
> +disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
> +created on top of these. Some part of sdb is in lv0 and some part is in lv1.
> +
> +			    lv0      lv1
> +			  /	\  /     \
> +			sda      sdb      sdc
> +
> +Also consider following cgroup hierarchy
> +
> +				root
> +				/   \
> +			       A     B
> +			      / \    / \
> +			     T1 T2  T3  T4
> +
> +A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
> +Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
> +get their fair share of bandwidth on disks sda, sdb and sdc. There is no
> +IO control on intermediate logical block nodes (lv0, lv1).
> +
> +So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
> +only, there will not be any contetion for resources between group A and B if
> +IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
> +IO scheduler associated with the sdb will distribute disk bandwidth to
> +group A and B proportionate to their weight.
> +
> +CFQ already has the notion of fairness and it provides differential disk
> +access based on priority and class of the task. Just that it is flat and
> +with cgroup stuff, it needs to be made hierarchical.
> +
> +Rest of the IO schedulers (noop, deadline and AS) don't have any notion
> +of fairness among various threads.
> +
> +One of the concerns raised with modifying IO schedulers was that we don't
> +want to replicate the code in all the IO schedulers. These patches share
> +the fair queuing code which has been moved to a common layer (elevator
> +layer). Hence we don't end up replicating code across IO schedulers.
> +
> +Design
> +======
> +This patchset primarily uses BFQ (Budget Fair Queuing) code to provide
> +fairness among different IO queues. Fabio and Paolo implemented BFQ which uses
> +B-WF2Q+ algorithm for fair queuing.
> +
> +Why BFQ?
> +
> +- Not sure if weighted round robin logic of CFQ can be easily extended for
> +  hierarchical mode. One of the things is that we can not keep dividing
> +  the time slice of parent group among childrens. Deeper we go in hierarchy
> +  time slice will get smaller.
> +
> +  One of the ways to implement hierarchical support could be to keep track
> +  of virtual time and service provided to queue/group and select a queue/group
> +  for service based on any of the various available algoriths.
> +
> +  BFQ already had support for hierarchical scheduling, taking those patches
> +  was easier.
> +
> +- BFQ was designed to provide tighter bounds/delay w.r.t service provided
> +  to a queue. Delay/Jitter with BFQ is supposed to be O(1).
> +
> +  Note: BFQ originally used amount of IO done (number of sectors) as notion
> +        of service provided. IOW, it tried to provide fairness in terms of
> +        actual IO done and not in terms of actual time disk access was
> +	given to a queue.
> +
> +	This patcheset modified BFQ to provide fairness in time domain because
> +	that's what CFQ does. So idea was try not to deviate too much from
> +	the CFQ behavior initially.
> +
> +	Providing fairness in time domain makes accounting trciky because
> +	due to command queueing, at one time there might be multiple requests
> +	from different queues and there is no easy way to find out how much
> +	disk time actually was consumed by the requests of a particular
> +	queue. More about this in comments in source code.
> +
> +So it is yet to be seen if changing to time domain still retains BFQ gurantees
> +or not.
> +
> +From data structure point of view, one can think of a tree per device, where
> +io groups and io queues are hanging and are being scheduled using B-WF2Q+
> +algorithm. io_queue, is end queue where requests are actually stored and
> +dispatched from (like cfqq).
> +
> +These io queues are primarily created by and managed by end io schedulers
> +depending on its semantics. For example, noop, deadline and AS ioschedulers
> +keep one io queues per cgroup and cfqq keeps one io queue per io_context in
> +a cgroup (apart from async queues).
> +
> +A request is mapped to an io group by elevator layer and which io queue it
> +is mapped to with in group depends on ioscheduler. Currently "current" task
> +is used to determine the cgroup (hence io group) of the request. Down the
> +line we need to make use of bio-cgroup patches to map delayed writes to
> +right group.
> +
> +Going back to old behavior
> +==========================
> +In new scheme of things essentially we are creating hierarchical fair
> +queuing logic in elevator layer and chaning IO schedulers to make use of
> +that logic so that end IO schedulers start supporting hierarchical scheduling.
> +
> +Elevator layer continues to support the old interfaces. So even if fair queuing
> +is enabled at elevator layer, one can have both new hierchical scheduler as
> +well as old non-hierarchical scheduler operating.
> +
> +Also noop, deadline and AS have option of enabling hierarchical scheduling.
> +If it is selected, fair queuing is done in hierarchical manner. If hierarchical
> +scheduling is disabled, noop, deadline and AS should retain their existing
> +behavior.
> +
> +CFQ is the only exception where one can not disable fair queuing as it is
> +needed for provding fairness among various threads even in non-hierarchical
> +mode.
> +
> +Various user visible config options
> +===================================
> +CONFIG_IOSCHED_NOOP_HIER
> +	- Enables hierchical fair queuing in noop. Not selecting this option
> +	  leads to old behavior of noop.
> +
> +CONFIG_IOSCHED_DEADLINE_HIER
> +	- Enables hierchical fair queuing in deadline. Not selecting this
> +	  option leads to old behavior of deadline.
> +
> +CONFIG_IOSCHED_AS_HIER
> +	- Enables hierchical fair queuing in AS. Not selecting this option
> +	  leads to old behavior of AS.
> +
> +CONFIG_IOSCHED_CFQ_HIER
> +	- Enables hierarchical fair queuing in CFQ. Not selecting this option
> +	  still does fair queuing among various queus but it is flat and not
> +	  hierarchical.
> +
> +Config options selected automatically
> +=====================================
> +These config options are not user visible and are selected/deselected
> +automatically based on IO scheduler configurations.
> +
> +CONFIG_ELV_FAIR_QUEUING
> +	- Enables/Disables the fair queuing logic at elevator layer.
> +
> +CONFIG_GROUP_IOSCHED
> +	- Enables/Disables hierarchical queuing and associated cgroup bits.
> +
> +TODO
> +====
> +- Lots of cleanups, testing, bug fixing, optimizations, benchmarking etc...
> +- Convert cgroup ioprio to notion of weight.
> +- Anticipatory code will need more work. It is not working properly currently
> +  and needs more thought.
> +- Use of bio-cgroup patches.
> +- Use of Nauman's per cgroup request descriptor patches.
> +
> +HOWTO
> +=====
> +So far I have done very simple testing of running two dd threads in two
> +different cgroups. Here is what you can do.
> +
> +- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
> +	CONFIG_IOSCHED_CFQ_HIER=y
> +
> +- Compile and boot into kernel and mount IO controller.
> +
> +	mount -t cgroup -o io none /cgroup
> +
> +- Create two cgroups
> +	mkdir -p /cgroup/test1/ /cgroup/test2
> +
> +- Set io priority of group test1 and test2
> +	echo 0 > /cgroup/test1/io.ioprio
> +	echo 4 > /cgroup/test2/io.ioprio
> +
> +- Create two same size files (say 512MB each) on same disk (file1, file2) and
> +  launch two dd threads in different cgroup to read those files. Make sure
> +  right io scheduler is being used for the block device where files are
> +  present (the one you compiled in hierarchical mode).
> +
> +	echo 1 > /proc/sys/vm/drop_caches
> +
> +	dd if=/mnt/lv0/zerofile1 of=/dev/null &
> +	echo $! > /cgroup/test1/tasks
> +	cat /cgroup/test1/tasks
> +
> +	dd if=/mnt/lv0/zerofile2 of=/dev/null &
> +	echo $! > /cgroup/test2/tasks
> +	cat /cgroup/test2/tasks
> +
> +- First dd should finish first.
> +
> +Some Test Results
> +=================
> +- Two dd in two cgroups with prio 0 and 4. Ran two "dd" in those cgroups.
> +
> +234179072 bytes (234 MB) copied, 10.1811 s, 23.0 MB/s
> +234179072 bytes (234 MB) copied, 12.6187 s, 18.6 MB/s
> +
> +- Three dd in three cgroups with prio 0, 4, 4.
> +
> +234179072 bytes (234 MB) copied, 13.7654 s, 17.0 MB/s
> +234179072 bytes (234 MB) copied, 19.476 s, 12.0 MB/s
> +234179072 bytes (234 MB) copied, 20.1858 s, 11.6 MB/s


-- 
Regards
Yang Hongyang

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
       [not found]     ` <1236823015-4183-2-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-03-12  7:11         ` Andrew Morton
  2009-03-12  7:45         ` Yang Hongyang
@ 2009-03-12 10:00       ` Dhaval Giani
  2009-03-12 10:24         ` Peter Zijlstra
  2009-04-06 14:35         ` Balbir Singh
  4 siblings, 0 replies; 190+ messages in thread
From: Dhaval Giani @ 2009-03-12 10:00 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Wed, Mar 11, 2009 at 09:56:46PM -0400, Vivek Goyal wrote:
> o Documentation for io-controller.
> 
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  Documentation/block/io-controller.txt |  221 +++++++++++++++++++++++++++++++++
>  1 files changed, 221 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/block/io-controller.txt
> 
> diff --git a/Documentation/block/io-controller.txt b/Documentation/block/io-controller.txt
> new file mode 100644
> index 0000000..8884c5a
> --- /dev/null
> +++ b/Documentation/block/io-controller.txt
> @@ -0,0 +1,221 @@
> +				IO Controller
> +				=============
> +
> +Overview
> +========
> +
> +This patchset implements a proportional weight IO controller. That is one
> +can create cgroups and assign prio/weights to those cgroups and task group
> +will get access to disk proportionate to the weight of the group.
> +
> +These patches modify elevator layer and individual IO schedulers to do
> +IO control hence this io controller works only on block devices which use
> +one of the standard io schedulers can not be used with any xyz logical block
> +device.
> +
> +The assumption/thought behind modifying IO scheduler is that resource control
> +is needed only on leaf nodes where the actual contention for resources is
> +present and not on intertermediate logical block devices.
> +
> +Consider following hypothetical scenario. Lets say there are three physical
> +disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
> +created on top of these. Some part of sdb is in lv0 and some part is in lv1.
> +
> +			    lv0      lv1
> +			  /	\  /     \
> +			sda      sdb      sdc
> +
> +Also consider following cgroup hierarchy
> +
> +				root
> +				/   \
> +			       A     B
> +			      / \    / \
> +			     T1 T2  T3  T4
> +
> +A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
> +Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
> +get their fair share of bandwidth on disks sda, sdb and sdc. There is no
> +IO control on intermediate logical block nodes (lv0, lv1).
> +
> +So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
> +only, there will not be any contetion for resources between group A and B if
> +IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
> +IO scheduler associated with the sdb will distribute disk bandwidth to
> +group A and B proportionate to their weight.
> +
> +CFQ already has the notion of fairness and it provides differential disk
> +access based on priority and class of the task. Just that it is flat and
> +with cgroup stuff, it needs to be made hierarchical.
> +
> +Rest of the IO schedulers (noop, deadline and AS) don't have any notion
> +of fairness among various threads.
> +
> +One of the concerns raised with modifying IO schedulers was that we don't
> +want to replicate the code in all the IO schedulers. These patches share
> +the fair queuing code which has been moved to a common layer (elevator
> +layer). Hence we don't end up replicating code across IO schedulers.
> +
> +Design
> +======
> +This patchset primarily uses BFQ (Budget Fair Queuing) code to provide
> +fairness among different IO queues. Fabio and Paolo implemented BFQ which uses
> +B-WF2Q+ algorithm for fair queuing.
> +
> +Why BFQ?
> +
> +- Not sure if weighted round robin logic of CFQ can be easily extended for
> +  hierarchical mode. One of the things is that we can not keep dividing
> +  the time slice of parent group among childrens. Deeper we go in hierarchy
> +  time slice will get smaller.
> +
> +  One of the ways to implement hierarchical support could be to keep track
> +  of virtual time and service provided to queue/group and select a queue/group
> +  for service based on any of the various available algoriths.
> +
> +  BFQ already had support for hierarchical scheduling, taking those patches
> +  was easier.
> +
> +- BFQ was designed to provide tighter bounds/delay w.r.t service provided
> +  to a queue. Delay/Jitter with BFQ is supposed to be O(1).
> +
> +  Note: BFQ originally used amount of IO done (number of sectors) as notion
> +        of service provided. IOW, it tried to provide fairness in terms of
> +        actual IO done and not in terms of actual time disk access was
> +	given to a queue.
> +
> +	This patcheset modified BFQ to provide fairness in time domain because
> +	that's what CFQ does. So idea was try not to deviate too much from
> +	the CFQ behavior initially.
> +
> +	Providing fairness in time domain makes accounting trciky because
> +	due to command queueing, at one time there might be multiple requests
> +	from different queues and there is no easy way to find out how much
> +	disk time actually was consumed by the requests of a particular
> +	queue. More about this in comments in source code.
> +
> +So it is yet to be seen if changing to time domain still retains BFQ gurantees
> +or not.
> +
> +From data structure point of view, one can think of a tree per device, where
> +io groups and io queues are hanging and are being scheduled using B-WF2Q+
> +algorithm. io_queue, is end queue where requests are actually stored and
> +dispatched from (like cfqq).
> +
> +These io queues are primarily created by and managed by end io schedulers
> +depending on its semantics. For example, noop, deadline and AS ioschedulers
> +keep one io queues per cgroup and cfqq keeps one io queue per io_context in
> +a cgroup (apart from async queues).
> +
> +A request is mapped to an io group by elevator layer and which io queue it
> +is mapped to with in group depends on ioscheduler. Currently "current" task
> +is used to determine the cgroup (hence io group) of the request. Down the
> +line we need to make use of bio-cgroup patches to map delayed writes to
> +right group.
> +
> +Going back to old behavior
> +==========================
> +In new scheme of things essentially we are creating hierarchical fair
> +queuing logic in elevator layer and chaning IO schedulers to make use of
> +that logic so that end IO schedulers start supporting hierarchical scheduling.
> +
> +Elevator layer continues to support the old interfaces. So even if fair queuing
> +is enabled at elevator layer, one can have both new hierchical scheduler as
> +well as old non-hierarchical scheduler operating.
> +
> +Also noop, deadline and AS have option of enabling hierarchical scheduling.
> +If it is selected, fair queuing is done in hierarchical manner. If hierarchical
> +scheduling is disabled, noop, deadline and AS should retain their existing
> +behavior.
> +
> +CFQ is the only exception where one can not disable fair queuing as it is
> +needed for provding fairness among various threads even in non-hierarchical
> +mode.
> +
> +Various user visible config options
> +===================================
> +CONFIG_IOSCHED_NOOP_HIER
> +	- Enables hierchical fair queuing in noop. Not selecting this option
> +	  leads to old behavior of noop.
> +
> +CONFIG_IOSCHED_DEADLINE_HIER
> +	- Enables hierchical fair queuing in deadline. Not selecting this
> +	  option leads to old behavior of deadline.
> +
> +CONFIG_IOSCHED_AS_HIER
> +	- Enables hierchical fair queuing in AS. Not selecting this option
> +	  leads to old behavior of AS.
> +
> +CONFIG_IOSCHED_CFQ_HIER
> +	- Enables hierarchical fair queuing in CFQ. Not selecting this option
> +	  still does fair queuing among various queus but it is flat and not
> +	  hierarchical.
> +
> +Config options selected automatically
> +=====================================
> +These config options are not user visible and are selected/deselected
> +automatically based on IO scheduler configurations.
> +
> +CONFIG_ELV_FAIR_QUEUING
> +	- Enables/Disables the fair queuing logic at elevator layer.
> +
> +CONFIG_GROUP_IOSCHED
> +	- Enables/Disables hierarchical queuing and associated cgroup bits.
> +
> +TODO
> +====
> +- Lots of cleanups, testing, bug fixing, optimizations, benchmarking etc...
> +- Convert cgroup ioprio to notion of weight.
> +- Anticipatory code will need more work. It is not working properly currently
> +  and needs more thought.
> +- Use of bio-cgroup patches.
> +- Use of Nauman's per cgroup request descriptor patches.
> +
> +HOWTO
> +=====
> +So far I have done very simple testing of running two dd threads in two
> +different cgroups. Here is what you can do.
> +
> +- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
> +	CONFIG_IOSCHED_CFQ_HIER=y
> +
> +- Compile and boot into kernel and mount IO controller.
> +
> +	mount -t cgroup -o io none /cgroup
> +
> +- Create two cgroups
> +	mkdir -p /cgroup/test1/ /cgroup/test2
> +
> +- Set io priority of group test1 and test2
> +	echo 0 > /cgroup/test1/io.ioprio
> +	echo 4 > /cgroup/test2/io.ioprio
> +
> +- Create two same size files (say 512MB each) on same disk (file1, file2) and
> +  launch two dd threads in different cgroup to read those files. Make sure
> +  right io scheduler is being used for the block device where files are
> +  present (the one you compiled in hierarchical mode).
> +
> +	echo 1 > /proc/sys/vm/drop_caches
> +
> +	dd if=/mnt/lv0/zerofile1 of=/dev/null &
> +	echo $! > /cgroup/test1/tasks
> +	cat /cgroup/test1/tasks
> +
> +	dd if=/mnt/lv0/zerofile2 of=/dev/null &
> +	echo $! > /cgroup/test2/tasks
> +	cat /cgroup/test2/tasks
> +
> +- First dd should finish first.
> +
> +Some Test Results
> +=================
> +- Two dd in two cgroups with prio 0 and 4. Ran two "dd" in those cgroups.
> +
> +234179072 bytes (234 MB) copied, 10.1811 s, 23.0 MB/s
> +234179072 bytes (234 MB) copied, 12.6187 s, 18.6 MB/s
> +
> +- Three dd in three cgroups with prio 0, 4, 4.
> +
> +234179072 bytes (234 MB) copied, 13.7654 s, 17.0 MB/s
> +234179072 bytes (234 MB) copied, 19.476 s, 12.0 MB/s
> +234179072 bytes (234 MB) copied, 20.1858 s, 11.6 MB/s

Hi Vivek,

I would be interested in knowing if these are the results expected?

-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-03-12  1:56     ` Vivek Goyal
  (?)
  (?)
@ 2009-03-12 10:00     ` Dhaval Giani
       [not found]       ` <20090312100054.GA8024-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  2009-03-12 14:04       ` Vivek Goyal
  -1 siblings, 2 replies; 190+ messages in thread
From: Dhaval Giani @ 2009-03-12 10:00 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng,
	arozansk, jmoyer, oz-kernel, balbir, linux-kernel, containers,
	akpm, menage, peterz

On Wed, Mar 11, 2009 at 09:56:46PM -0400, Vivek Goyal wrote:
> o Documentation for io-controller.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  Documentation/block/io-controller.txt |  221 +++++++++++++++++++++++++++++++++
>  1 files changed, 221 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/block/io-controller.txt
> 
> diff --git a/Documentation/block/io-controller.txt b/Documentation/block/io-controller.txt
> new file mode 100644
> index 0000000..8884c5a
> --- /dev/null
> +++ b/Documentation/block/io-controller.txt
> @@ -0,0 +1,221 @@
> +				IO Controller
> +				=============
> +
> +Overview
> +========
> +
> +This patchset implements a proportional weight IO controller. That is one
> +can create cgroups and assign prio/weights to those cgroups and task group
> +will get access to disk proportionate to the weight of the group.
> +
> +These patches modify elevator layer and individual IO schedulers to do
> +IO control hence this io controller works only on block devices which use
> +one of the standard io schedulers can not be used with any xyz logical block
> +device.
> +
> +The assumption/thought behind modifying IO scheduler is that resource control
> +is needed only on leaf nodes where the actual contention for resources is
> +present and not on intertermediate logical block devices.
> +
> +Consider following hypothetical scenario. Lets say there are three physical
> +disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
> +created on top of these. Some part of sdb is in lv0 and some part is in lv1.
> +
> +			    lv0      lv1
> +			  /	\  /     \
> +			sda      sdb      sdc
> +
> +Also consider following cgroup hierarchy
> +
> +				root
> +				/   \
> +			       A     B
> +			      / \    / \
> +			     T1 T2  T3  T4
> +
> +A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
> +Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
> +get their fair share of bandwidth on disks sda, sdb and sdc. There is no
> +IO control on intermediate logical block nodes (lv0, lv1).
> +
> +So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
> +only, there will not be any contetion for resources between group A and B if
> +IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
> +IO scheduler associated with the sdb will distribute disk bandwidth to
> +group A and B proportionate to their weight.
> +
> +CFQ already has the notion of fairness and it provides differential disk
> +access based on priority and class of the task. Just that it is flat and
> +with cgroup stuff, it needs to be made hierarchical.
> +
> +Rest of the IO schedulers (noop, deadline and AS) don't have any notion
> +of fairness among various threads.
> +
> +One of the concerns raised with modifying IO schedulers was that we don't
> +want to replicate the code in all the IO schedulers. These patches share
> +the fair queuing code which has been moved to a common layer (elevator
> +layer). Hence we don't end up replicating code across IO schedulers.
> +
> +Design
> +======
> +This patchset primarily uses BFQ (Budget Fair Queuing) code to provide
> +fairness among different IO queues. Fabio and Paolo implemented BFQ which uses
> +B-WF2Q+ algorithm for fair queuing.
> +
> +Why BFQ?
> +
> +- Not sure if weighted round robin logic of CFQ can be easily extended for
> +  hierarchical mode. One of the things is that we can not keep dividing
> +  the time slice of parent group among childrens. Deeper we go in hierarchy
> +  time slice will get smaller.
> +
> +  One of the ways to implement hierarchical support could be to keep track
> +  of virtual time and service provided to queue/group and select a queue/group
> +  for service based on any of the various available algoriths.
> +
> +  BFQ already had support for hierarchical scheduling, taking those patches
> +  was easier.
> +
> +- BFQ was designed to provide tighter bounds/delay w.r.t service provided
> +  to a queue. Delay/Jitter with BFQ is supposed to be O(1).
> +
> +  Note: BFQ originally used amount of IO done (number of sectors) as notion
> +        of service provided. IOW, it tried to provide fairness in terms of
> +        actual IO done and not in terms of actual time disk access was
> +	given to a queue.
> +
> +	This patcheset modified BFQ to provide fairness in time domain because
> +	that's what CFQ does. So idea was try not to deviate too much from
> +	the CFQ behavior initially.
> +
> +	Providing fairness in time domain makes accounting trciky because
> +	due to command queueing, at one time there might be multiple requests
> +	from different queues and there is no easy way to find out how much
> +	disk time actually was consumed by the requests of a particular
> +	queue. More about this in comments in source code.
> +
> +So it is yet to be seen if changing to time domain still retains BFQ gurantees
> +or not.
> +
> +From data structure point of view, one can think of a tree per device, where
> +io groups and io queues are hanging and are being scheduled using B-WF2Q+
> +algorithm. io_queue, is end queue where requests are actually stored and
> +dispatched from (like cfqq).
> +
> +These io queues are primarily created by and managed by end io schedulers
> +depending on its semantics. For example, noop, deadline and AS ioschedulers
> +keep one io queues per cgroup and cfqq keeps one io queue per io_context in
> +a cgroup (apart from async queues).
> +
> +A request is mapped to an io group by elevator layer and which io queue it
> +is mapped to with in group depends on ioscheduler. Currently "current" task
> +is used to determine the cgroup (hence io group) of the request. Down the
> +line we need to make use of bio-cgroup patches to map delayed writes to
> +right group.
> +
> +Going back to old behavior
> +==========================
> +In new scheme of things essentially we are creating hierarchical fair
> +queuing logic in elevator layer and chaning IO schedulers to make use of
> +that logic so that end IO schedulers start supporting hierarchical scheduling.
> +
> +Elevator layer continues to support the old interfaces. So even if fair queuing
> +is enabled at elevator layer, one can have both new hierchical scheduler as
> +well as old non-hierarchical scheduler operating.
> +
> +Also noop, deadline and AS have option of enabling hierarchical scheduling.
> +If it is selected, fair queuing is done in hierarchical manner. If hierarchical
> +scheduling is disabled, noop, deadline and AS should retain their existing
> +behavior.
> +
> +CFQ is the only exception where one can not disable fair queuing as it is
> +needed for provding fairness among various threads even in non-hierarchical
> +mode.
> +
> +Various user visible config options
> +===================================
> +CONFIG_IOSCHED_NOOP_HIER
> +	- Enables hierchical fair queuing in noop. Not selecting this option
> +	  leads to old behavior of noop.
> +
> +CONFIG_IOSCHED_DEADLINE_HIER
> +	- Enables hierchical fair queuing in deadline. Not selecting this
> +	  option leads to old behavior of deadline.
> +
> +CONFIG_IOSCHED_AS_HIER
> +	- Enables hierchical fair queuing in AS. Not selecting this option
> +	  leads to old behavior of AS.
> +
> +CONFIG_IOSCHED_CFQ_HIER
> +	- Enables hierarchical fair queuing in CFQ. Not selecting this option
> +	  still does fair queuing among various queus but it is flat and not
> +	  hierarchical.
> +
> +Config options selected automatically
> +=====================================
> +These config options are not user visible and are selected/deselected
> +automatically based on IO scheduler configurations.
> +
> +CONFIG_ELV_FAIR_QUEUING
> +	- Enables/Disables the fair queuing logic at elevator layer.
> +
> +CONFIG_GROUP_IOSCHED
> +	- Enables/Disables hierarchical queuing and associated cgroup bits.
> +
> +TODO
> +====
> +- Lots of cleanups, testing, bug fixing, optimizations, benchmarking etc...
> +- Convert cgroup ioprio to notion of weight.
> +- Anticipatory code will need more work. It is not working properly currently
> +  and needs more thought.
> +- Use of bio-cgroup patches.
> +- Use of Nauman's per cgroup request descriptor patches.
> +
> +HOWTO
> +=====
> +So far I have done very simple testing of running two dd threads in two
> +different cgroups. Here is what you can do.
> +
> +- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
> +	CONFIG_IOSCHED_CFQ_HIER=y
> +
> +- Compile and boot into kernel and mount IO controller.
> +
> +	mount -t cgroup -o io none /cgroup
> +
> +- Create two cgroups
> +	mkdir -p /cgroup/test1/ /cgroup/test2
> +
> +- Set io priority of group test1 and test2
> +	echo 0 > /cgroup/test1/io.ioprio
> +	echo 4 > /cgroup/test2/io.ioprio
> +
> +- Create two same size files (say 512MB each) on same disk (file1, file2) and
> +  launch two dd threads in different cgroup to read those files. Make sure
> +  right io scheduler is being used for the block device where files are
> +  present (the one you compiled in hierarchical mode).
> +
> +	echo 1 > /proc/sys/vm/drop_caches
> +
> +	dd if=/mnt/lv0/zerofile1 of=/dev/null &
> +	echo $! > /cgroup/test1/tasks
> +	cat /cgroup/test1/tasks
> +
> +	dd if=/mnt/lv0/zerofile2 of=/dev/null &
> +	echo $! > /cgroup/test2/tasks
> +	cat /cgroup/test2/tasks
> +
> +- First dd should finish first.
> +
> +Some Test Results
> +=================
> +- Two dd in two cgroups with prio 0 and 4. Ran two "dd" in those cgroups.
> +
> +234179072 bytes (234 MB) copied, 10.1811 s, 23.0 MB/s
> +234179072 bytes (234 MB) copied, 12.6187 s, 18.6 MB/s
> +
> +- Three dd in three cgroups with prio 0, 4, 4.
> +
> +234179072 bytes (234 MB) copied, 13.7654 s, 17.0 MB/s
> +234179072 bytes (234 MB) copied, 19.476 s, 12.0 MB/s
> +234179072 bytes (234 MB) copied, 20.1858 s, 11.6 MB/s

Hi Vivek,

I would be interested in knowing if these are the results expected?

-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
       [not found]         ` <20090312001146.74591b9d.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2009-03-12 10:07           ` Ryo Tsuruta
  2009-03-12 18:01           ` Vivek Goyal
  1 sibling, 0 replies; 190+ messages in thread
From: Ryo Tsuruta @ 2009-03-12 10:07 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	menage-hpIqsD4AKlfQT0dZR+AlfA, arozansk-H+wXaHxf7aLQT0dZR+AlfA

Hi Andrew,

> Also..  there are so many IO controller implementations that I've lost
> track of who is doing what.  I do have one private report here that
> Andreas's controller "is incredibly productive for us and has allowed
> us to put twice as many users per server with faster times for all
> users".  Which is pretty stunning, although it should be viewed as a
> condemnation of the current code, I'm afraid.

I'm developing dm-ioband, which is an another IO controller, and I
would like to hear your comments about dm-ioband if you have.

dm-ioband web page:
http://people.valinux.co.jp/~ryov/dm-ioband/

> So my question is: what is the definitive list of
> proposed-io-controller-implementations and how do I cunningly get all
> you guys to check each others homework? :)

Dm-ioband is implemented as a device-mapper driver, so I'm proposing
dm-ioband to dm-devel and I hope device-mapper staffs take care of it.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-03-12  7:11         ` Andrew Morton
  (?)
@ 2009-03-12 10:07         ` Ryo Tsuruta
  -1 siblings, 0 replies; 190+ messages in thread
From: Ryo Tsuruta @ 2009-03-12 10:07 UTC (permalink / raw)
  To: akpm
  Cc: vgoyal, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, fernando, s-uchida, taka, guijianfeng, arozansk,
	jmoyer, oz-kernel, dhaval, balbir, linux-kernel, containers,
	menage, peterz

Hi Andrew,

> Also..  there are so many IO controller implementations that I've lost
> track of who is doing what.  I do have one private report here that
> Andreas's controller "is incredibly productive for us and has allowed
> us to put twice as many users per server with faster times for all
> users".  Which is pretty stunning, although it should be viewed as a
> condemnation of the current code, I'm afraid.

I'm developing dm-ioband, which is an another IO controller, and I
would like to hear your comments about dm-ioband if you have.

dm-ioband web page:
http://people.valinux.co.jp/~ryov/dm-ioband/

> So my question is: what is the definitive list of
> proposed-io-controller-implementations and how do I cunningly get all
> you guys to check each others homework? :)

Dm-ioband is implemented as a device-mapper driver, so I'm proposing
dm-ioband to dm-devel and I hope device-mapper staffs take care of it.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-03-12  1:56     ` Vivek Goyal
@ 2009-03-12 10:24         ` Peter Zijlstra
  -1 siblings, 0 replies; 190+ messages in thread
From: Peter Zijlstra @ 2009-03-12 10:24 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Wed, 2009-03-11 at 21:56 -0400, Vivek Goyal wrote:
> +Going back to old behavior
> +==========================
> +In new scheme of things essentially we are creating hierarchical fair
> +queuing logic in elevator layer and changing IO schedulers to make use of
> +that logic so that end IO schedulers start supporting hierarchical scheduling.
> +
> +Elevator layer continues to support the old interfaces. So even if fair queuing
> +is enabled at elevator layer, one can have both new hierarchical scheduler as
> +well as old non-hierarchical scheduler operating.
> +
> +Also noop, deadline and AS have option of enabling hierarchical scheduling.
> +If it is selected, fair queuing is done in hierarchical manner. If hierarchical
> +scheduling is disabled, noop, deadline and AS should retain their existing
> +behavior.
> +
> +CFQ is the only exception where one can not disable fair queuing as it is
> +needed for providing fairness among various threads even in non-hierarchical
> +mode.
> +
> +Various user visible config options
> +===================================
> +CONFIG_IOSCHED_NOOP_HIER
> +       - Enables hierchical fair queuing in noop. Not selecting this option
> +         leads to old behavior of noop.
> +
> +CONFIG_IOSCHED_DEADLINE_HIER
> +       - Enables hierchical fair queuing in deadline. Not selecting this
> +         option leads to old behavior of deadline.
> +
> +CONFIG_IOSCHED_AS_HIER
> +       - Enables hierchical fair queuing in AS. Not selecting this option
> +         leads to old behavior of AS.
> +
> +CONFIG_IOSCHED_CFQ_HIER
> +       - Enables hierarchical fair queuing in CFQ. Not selecting this option
> +         still does fair queuing among various queus but it is flat and not
> +         hierarchical.

One worry I have is that these are compile time switches. Is there any
way you can get the old AS/DEADLINE back when these are enabled but
you're not actively using cgroups?

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
@ 2009-03-12 10:24         ` Peter Zijlstra
  0 siblings, 0 replies; 190+ messages in thread
From: Peter Zijlstra @ 2009-03-12 10:24 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng,
	arozansk, jmoyer, oz-kernel, dhaval, balbir, linux-kernel,
	containers, akpm, menage

On Wed, 2009-03-11 at 21:56 -0400, Vivek Goyal wrote:
> +Going back to old behavior
> +==========================
> +In new scheme of things essentially we are creating hierarchical fair
> +queuing logic in elevator layer and changing IO schedulers to make use of
> +that logic so that end IO schedulers start supporting hierarchical scheduling.
> +
> +Elevator layer continues to support the old interfaces. So even if fair queuing
> +is enabled at elevator layer, one can have both new hierarchical scheduler as
> +well as old non-hierarchical scheduler operating.
> +
> +Also noop, deadline and AS have option of enabling hierarchical scheduling.
> +If it is selected, fair queuing is done in hierarchical manner. If hierarchical
> +scheduling is disabled, noop, deadline and AS should retain their existing
> +behavior.
> +
> +CFQ is the only exception where one can not disable fair queuing as it is
> +needed for providing fairness among various threads even in non-hierarchical
> +mode.
> +
> +Various user visible config options
> +===================================
> +CONFIG_IOSCHED_NOOP_HIER
> +       - Enables hierchical fair queuing in noop. Not selecting this option
> +         leads to old behavior of noop.
> +
> +CONFIG_IOSCHED_DEADLINE_HIER
> +       - Enables hierchical fair queuing in deadline. Not selecting this
> +         option leads to old behavior of deadline.
> +
> +CONFIG_IOSCHED_AS_HIER
> +       - Enables hierchical fair queuing in AS. Not selecting this option
> +         leads to old behavior of AS.
> +
> +CONFIG_IOSCHED_CFQ_HIER
> +       - Enables hierarchical fair queuing in CFQ. Not selecting this option
> +         still does fair queuing among various queus but it is flat and not
> +         hierarchical.

One worry I have is that these are compile time switches. Is there any
way you can get the old AS/DEADLINE back when these are enabled but
you're not actively using cgroups?


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
  2009-03-12  3:27 ` [RFC] IO Controller Takuya Yoshikawa
@ 2009-03-12 13:43       ` Vivek Goyal
       [not found]   ` <49B8810B.7030603-gVGce1chcLdL9jVzuh4AOg@public.gmane.org>
  1 sibling, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12 13:43 UTC (permalink / raw)
  To: Takuya Yoshikawa
  Cc: oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Thu, Mar 12, 2009 at 12:27:07PM +0900, Takuya Yoshikawa wrote:
> Hi Vivek,
>
> Could you tell me to which kernel I can apply your patches?
>   # latest mm?
> I would like to test your controller.
>

Hi Takuya,

These apply on linus git tree (2.6.29-rc7).

Thanks
Vivek

> Thank you,
>   Takuya Yoshikawa
>
>
> Vivek Goyal wrote:
>>
>> Hi All,
>>
>> Here is another posting for IO controller patches. Last time I had posted
>> RFC patches for an IO controller which did bio control per cgroup.
>>
>> http://lkml.org/lkml/2008/11/6/227
>>
>> One of the takeaway from the discussion in this thread was that let us
>> implement a common layer which contains the proportional weight scheduling
>> code which can be shared by all the IO schedulers.
>>
>> Implementing IO controller will not cover the devices which don't use
>> IO schedulers but it should cover the common case.
>>
>> There were more discussions regarding 2 level vs 1 level IO control at
>> following link.
>>
>> https://lists.linux-foundation.org/pipermail/containers/2009-January/015402.html
>>
>> So in the mean time we took the discussion off the list and spent time on
>> making the 1 level control apporoach work where majority of the proportional
>> weight control is shared by the four schedulers instead of each one having
>> to replicate the code. We make use of BFQ code for fair queuing as posted
>> by Paolo and Fabio here.
>>
>> http://lkml.org/lkml/2008/11/11/148
>>
>> Details about design and howto have been put in documentation patch.
>>
>> I have done very basic testing of running 2 or 3 "dd" threads in different
>> cgroups. Wanted to get the patchset out for feedback/review before we dive
>> into more bug fixing, benchmarking, optimizations etc.
>>
>> Your feedback/comments are welcome.
>>
>> Patch series contains 10 patches. It should be compilable and bootable after
>> every patch. Intial 2 patches implement flat fair queuing (no cgroup
>> support) and make cfq to use that. Later patches introduce hierarchical
>> fair queuing support in elevator layer and modify other IO schdulers to use
>> that.
>>
>> Thanks
>> Vivek
>> _______________________________________________
>> Containers mailing list
>> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
>> https://lists.linux-foundation.org/mailman/listinfo/containers
>>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
@ 2009-03-12 13:43       ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12 13:43 UTC (permalink / raw)
  To: Takuya Yoshikawa
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng,
	arozansk, jmoyer, oz-kernel, dhaval, balbir, linux-kernel,
	containers, akpm, menage

On Thu, Mar 12, 2009 at 12:27:07PM +0900, Takuya Yoshikawa wrote:
> Hi Vivek,
>
> Could you tell me to which kernel I can apply your patches?
>   # latest mm?
> I would like to test your controller.
>

Hi Takuya,

These apply on linus git tree (2.6.29-rc7).

Thanks
Vivek

> Thank you,
>   Takuya Yoshikawa
>
>
> Vivek Goyal wrote:
>>
>> Hi All,
>>
>> Here is another posting for IO controller patches. Last time I had posted
>> RFC patches for an IO controller which did bio control per cgroup.
>>
>> http://lkml.org/lkml/2008/11/6/227
>>
>> One of the takeaway from the discussion in this thread was that let us
>> implement a common layer which contains the proportional weight scheduling
>> code which can be shared by all the IO schedulers.
>>
>> Implementing IO controller will not cover the devices which don't use
>> IO schedulers but it should cover the common case.
>>
>> There were more discussions regarding 2 level vs 1 level IO control at
>> following link.
>>
>> https://lists.linux-foundation.org/pipermail/containers/2009-January/015402.html
>>
>> So in the mean time we took the discussion off the list and spent time on
>> making the 1 level control apporoach work where majority of the proportional
>> weight control is shared by the four schedulers instead of each one having
>> to replicate the code. We make use of BFQ code for fair queuing as posted
>> by Paolo and Fabio here.
>>
>> http://lkml.org/lkml/2008/11/11/148
>>
>> Details about design and howto have been put in documentation patch.
>>
>> I have done very basic testing of running 2 or 3 "dd" threads in different
>> cgroups. Wanted to get the patchset out for feedback/review before we dive
>> into more bug fixing, benchmarking, optimizations etc.
>>
>> Your feedback/comments are welcome.
>>
>> Patch series contains 10 patches. It should be compilable and bootable after
>> every patch. Intial 2 patches implement flat fair queuing (no cgroup
>> support) and make cfq to use that. Later patches introduce hierarchical
>> fair queuing support in elevator layer and modify other IO schdulers to use
>> that.
>>
>> Thanks
>> Vivek
>> _______________________________________________
>> Containers mailing list
>> Containers@lists.linux-foundation.org
>> https://lists.linux-foundation.org/mailman/listinfo/containers
>>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
  2009-03-12  6:40   ` anqin
@ 2009-03-12 13:46         ` Vivek Goyal
  2009-03-12  6:55     ` Li Zefan
  1 sibling, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12 13:46 UTC (permalink / raw)
  To: anqin
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, menage-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Thu, Mar 12, 2009 at 02:40:23PM +0800, anqin wrote:
> Hi Vivek,
> 
> It would be very appreciated if the patches can be based on 2.6.28.
> 

Hi Anquin,

I think most of the people want to test new patches on latest kernels
so I will keep it that way. You can backport it to previous kernels if
you really need to. For me it will become very difficult to maintain
two versions. 

Is there any reason why you can't move to latest kernels?  

Thanks
Vivek

> Thanks a lot.
> 
> Anqin
> 
> On Thu, Mar 12, 2009 at 11:27 AM, Takuya Yoshikawa
> <yoshikawa.takuya-gVGce1chcLdL9jVzuh4AOg@public.gmane.org> wrote:
> > Hi Vivek,
> >
> > Could you tell me to which kernel I can apply your patches?
> >   # latest mm?
> > I would like to test your controller.
> >
> > Thank you,
> >   Takuya Yoshikawa
> >
> >
> > Vivek Goyal wrote:
> >>
> >> Hi All,
> >>
> >> Here is another posting for IO controller patches. Last time I had posted
> >> RFC patches for an IO controller which did bio control per cgroup.
> >>
> >> http://lkml.org/lkml/2008/11/6/227
> >>
> >> One of the takeaway from the discussion in this thread was that let us
> >> implement a common layer which contains the proportional weight scheduling
> >> code which can be shared by all the IO schedulers.
> >>
> >> Implementing IO controller will not cover the devices which don't use
> >> IO schedulers but it should cover the common case.
> >>
> >> There were more discussions regarding 2 level vs 1 level IO control at
> >> following link.
> >>
> >> https://lists.linux-foundation.org/pipermail/containers/2009-January/015402.html
> >>
> >> So in the mean time we took the discussion off the list and spent time on
> >> making the 1 level control apporoach work where majority of the proportional
> >> weight control is shared by the four schedulers instead of each one having
> >> to replicate the code. We make use of BFQ code for fair queuing as posted
> >> by Paolo and Fabio here.
> >>
> >> http://lkml.org/lkml/2008/11/11/148
> >>
> >> Details about design and howto have been put in documentation patch.
> >>
> >> I have done very basic testing of running 2 or 3 "dd" threads in different
> >> cgroups. Wanted to get the patchset out for feedback/review before we dive
> >> into more bug fixing, benchmarking, optimizations etc.
> >>
> >> Your feedback/comments are welcome.
> >>
> >> Patch series contains 10 patches. It should be compilable and bootable after
> >> every patch. Intial 2 patches implement flat fair queuing (no cgroup
> >> support) and make cfq to use that. Later patches introduce hierarchical
> >> fair queuing support in elevator layer and modify other IO schdulers to use
> >> that.
> >>
> >> Thanks
> >> Vivek
> >> _______________________________________________
> >> Containers mailing list
> >> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> >> https://lists.linux-foundation.org/mailman/listinfo/containers
> >>
> >
> > _______________________________________________
> > Containers mailing list
> > Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> > https://lists.linux-foundation.org/mailman/listinfo/containers
> >

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
@ 2009-03-12 13:46         ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12 13:46 UTC (permalink / raw)
  To: anqin
  Cc: Takuya Yoshikawa, oz-kernel, paolo.valente, linux-kernel, dhaval,
	containers, menage, jmoyer, fchecconi, arozansk, jens.axboe,
	akpm, fernando, balbir

On Thu, Mar 12, 2009 at 02:40:23PM +0800, anqin wrote:
> Hi Vivek,
> 
> It would be very appreciated if the patches can be based on 2.6.28.
> 

Hi Anquin,

I think most of the people want to test new patches on latest kernels
so I will keep it that way. You can backport it to previous kernels if
you really need to. For me it will become very difficult to maintain
two versions. 

Is there any reason why you can't move to latest kernels?  

Thanks
Vivek

> Thanks a lot.
> 
> Anqin
> 
> On Thu, Mar 12, 2009 at 11:27 AM, Takuya Yoshikawa
> <yoshikawa.takuya@oss.ntt.co.jp> wrote:
> > Hi Vivek,
> >
> > Could you tell me to which kernel I can apply your patches?
> >   # latest mm?
> > I would like to test your controller.
> >
> > Thank you,
> >   Takuya Yoshikawa
> >
> >
> > Vivek Goyal wrote:
> >>
> >> Hi All,
> >>
> >> Here is another posting for IO controller patches. Last time I had posted
> >> RFC patches for an IO controller which did bio control per cgroup.
> >>
> >> http://lkml.org/lkml/2008/11/6/227
> >>
> >> One of the takeaway from the discussion in this thread was that let us
> >> implement a common layer which contains the proportional weight scheduling
> >> code which can be shared by all the IO schedulers.
> >>
> >> Implementing IO controller will not cover the devices which don't use
> >> IO schedulers but it should cover the common case.
> >>
> >> There were more discussions regarding 2 level vs 1 level IO control at
> >> following link.
> >>
> >> https://lists.linux-foundation.org/pipermail/containers/2009-January/015402.html
> >>
> >> So in the mean time we took the discussion off the list and spent time on
> >> making the 1 level control apporoach work where majority of the proportional
> >> weight control is shared by the four schedulers instead of each one having
> >> to replicate the code. We make use of BFQ code for fair queuing as posted
> >> by Paolo and Fabio here.
> >>
> >> http://lkml.org/lkml/2008/11/11/148
> >>
> >> Details about design and howto have been put in documentation patch.
> >>
> >> I have done very basic testing of running 2 or 3 "dd" threads in different
> >> cgroups. Wanted to get the patchset out for feedback/review before we dive
> >> into more bug fixing, benchmarking, optimizations etc.
> >>
> >> Your feedback/comments are welcome.
> >>
> >> Patch series contains 10 patches. It should be compilable and bootable after
> >> every patch. Intial 2 patches implement flat fair queuing (no cgroup
> >> support) and make cfq to use that. Later patches introduce hierarchical
> >> fair queuing support in elevator layer and modify other IO schdulers to use
> >> that.
> >>
> >> Thanks
> >> Vivek
> >> _______________________________________________
> >> Containers mailing list
> >> Containers@lists.linux-foundation.org
> >> https://lists.linux-foundation.org/mailman/listinfo/containers
> >>
> >
> > _______________________________________________
> > Containers mailing list
> > Containers@lists.linux-foundation.org
> > https://lists.linux-foundation.org/mailman/listinfo/containers
> >

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
       [not found]         ` <49B8BDB3.40808-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-03-12 13:51           ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12 13:51 UTC (permalink / raw)
  To: Yang Hongyang
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	menage-hpIqsD4AKlfQT0dZR+AlfA

On Thu, Mar 12, 2009 at 03:45:55PM +0800, Yang Hongyang wrote:
> Don't forget to update the 00-INDEX file when you add a new doc.^!^
> 

Thanks. Will do it.

Vivek

> Vivek Goyal wrote:
> > o Documentation for io-controller.
> > 
> > Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > ---
> >  Documentation/block/io-controller.txt |  221 +++++++++++++++++++++++++++++++++
> >  1 files changed, 221 insertions(+), 0 deletions(-)
> >  create mode 100644 Documentation/block/io-controller.txt
> > 
> > diff --git a/Documentation/block/io-controller.txt b/Documentation/block/io-controller.txt
> > new file mode 100644
> > index 0000000..8884c5a
> > --- /dev/null
> > +++ b/Documentation/block/io-controller.txt
> > @@ -0,0 +1,221 @@
> > +				IO Controller
> > +				=============
> > +
> > +Overview
> > +========
> > +
> > +This patchset implements a proportional weight IO controller. That is one
> > +can create cgroups and assign prio/weights to those cgroups and task group
> > +will get access to disk proportionate to the weight of the group.
> > +
> > +These patches modify elevator layer and individual IO schedulers to do
> > +IO control hence this io controller works only on block devices which use
> > +one of the standard io schedulers can not be used with any xyz logical block
> > +device.
> > +
> > +The assumption/thought behind modifying IO scheduler is that resource control
> > +is needed only on leaf nodes where the actual contention for resources is
> > +present and not on intertermediate logical block devices.
> > +
> > +Consider following hypothetical scenario. Lets say there are three physical
> > +disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
> > +created on top of these. Some part of sdb is in lv0 and some part is in lv1.
> > +
> > +			    lv0      lv1
> > +			  /	\  /     \
> > +			sda      sdb      sdc
> > +
> > +Also consider following cgroup hierarchy
> > +
> > +				root
> > +				/   \
> > +			       A     B
> > +			      / \    / \
> > +			     T1 T2  T3  T4
> > +
> > +A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
> > +Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
> > +get their fair share of bandwidth on disks sda, sdb and sdc. There is no
> > +IO control on intermediate logical block nodes (lv0, lv1).
> > +
> > +So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
> > +only, there will not be any contetion for resources between group A and B if
> > +IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
> > +IO scheduler associated with the sdb will distribute disk bandwidth to
> > +group A and B proportionate to their weight.
> > +
> > +CFQ already has the notion of fairness and it provides differential disk
> > +access based on priority and class of the task. Just that it is flat and
> > +with cgroup stuff, it needs to be made hierarchical.
> > +
> > +Rest of the IO schedulers (noop, deadline and AS) don't have any notion
> > +of fairness among various threads.
> > +
> > +One of the concerns raised with modifying IO schedulers was that we don't
> > +want to replicate the code in all the IO schedulers. These patches share
> > +the fair queuing code which has been moved to a common layer (elevator
> > +layer). Hence we don't end up replicating code across IO schedulers.
> > +
> > +Design
> > +======
> > +This patchset primarily uses BFQ (Budget Fair Queuing) code to provide
> > +fairness among different IO queues. Fabio and Paolo implemented BFQ which uses
> > +B-WF2Q+ algorithm for fair queuing.
> > +
> > +Why BFQ?
> > +
> > +- Not sure if weighted round robin logic of CFQ can be easily extended for
> > +  hierarchical mode. One of the things is that we can not keep dividing
> > +  the time slice of parent group among childrens. Deeper we go in hierarchy
> > +  time slice will get smaller.
> > +
> > +  One of the ways to implement hierarchical support could be to keep track
> > +  of virtual time and service provided to queue/group and select a queue/group
> > +  for service based on any of the various available algoriths.
> > +
> > +  BFQ already had support for hierarchical scheduling, taking those patches
> > +  was easier.
> > +
> > +- BFQ was designed to provide tighter bounds/delay w.r.t service provided
> > +  to a queue. Delay/Jitter with BFQ is supposed to be O(1).
> > +
> > +  Note: BFQ originally used amount of IO done (number of sectors) as notion
> > +        of service provided. IOW, it tried to provide fairness in terms of
> > +        actual IO done and not in terms of actual time disk access was
> > +	given to a queue.
> > +
> > +	This patcheset modified BFQ to provide fairness in time domain because
> > +	that's what CFQ does. So idea was try not to deviate too much from
> > +	the CFQ behavior initially.
> > +
> > +	Providing fairness in time domain makes accounting trciky because
> > +	due to command queueing, at one time there might be multiple requests
> > +	from different queues and there is no easy way to find out how much
> > +	disk time actually was consumed by the requests of a particular
> > +	queue. More about this in comments in source code.
> > +
> > +So it is yet to be seen if changing to time domain still retains BFQ gurantees
> > +or not.
> > +
> > +From data structure point of view, one can think of a tree per device, where
> > +io groups and io queues are hanging and are being scheduled using B-WF2Q+
> > +algorithm. io_queue, is end queue where requests are actually stored and
> > +dispatched from (like cfqq).
> > +
> > +These io queues are primarily created by and managed by end io schedulers
> > +depending on its semantics. For example, noop, deadline and AS ioschedulers
> > +keep one io queues per cgroup and cfqq keeps one io queue per io_context in
> > +a cgroup (apart from async queues).
> > +
> > +A request is mapped to an io group by elevator layer and which io queue it
> > +is mapped to with in group depends on ioscheduler. Currently "current" task
> > +is used to determine the cgroup (hence io group) of the request. Down the
> > +line we need to make use of bio-cgroup patches to map delayed writes to
> > +right group.
> > +
> > +Going back to old behavior
> > +==========================
> > +In new scheme of things essentially we are creating hierarchical fair
> > +queuing logic in elevator layer and chaning IO schedulers to make use of
> > +that logic so that end IO schedulers start supporting hierarchical scheduling.
> > +
> > +Elevator layer continues to support the old interfaces. So even if fair queuing
> > +is enabled at elevator layer, one can have both new hierchical scheduler as
> > +well as old non-hierarchical scheduler operating.
> > +
> > +Also noop, deadline and AS have option of enabling hierarchical scheduling.
> > +If it is selected, fair queuing is done in hierarchical manner. If hierarchical
> > +scheduling is disabled, noop, deadline and AS should retain their existing
> > +behavior.
> > +
> > +CFQ is the only exception where one can not disable fair queuing as it is
> > +needed for provding fairness among various threads even in non-hierarchical
> > +mode.
> > +
> > +Various user visible config options
> > +===================================
> > +CONFIG_IOSCHED_NOOP_HIER
> > +	- Enables hierchical fair queuing in noop. Not selecting this option
> > +	  leads to old behavior of noop.
> > +
> > +CONFIG_IOSCHED_DEADLINE_HIER
> > +	- Enables hierchical fair queuing in deadline. Not selecting this
> > +	  option leads to old behavior of deadline.
> > +
> > +CONFIG_IOSCHED_AS_HIER
> > +	- Enables hierchical fair queuing in AS. Not selecting this option
> > +	  leads to old behavior of AS.
> > +
> > +CONFIG_IOSCHED_CFQ_HIER
> > +	- Enables hierarchical fair queuing in CFQ. Not selecting this option
> > +	  still does fair queuing among various queus but it is flat and not
> > +	  hierarchical.
> > +
> > +Config options selected automatically
> > +=====================================
> > +These config options are not user visible and are selected/deselected
> > +automatically based on IO scheduler configurations.
> > +
> > +CONFIG_ELV_FAIR_QUEUING
> > +	- Enables/Disables the fair queuing logic at elevator layer.
> > +
> > +CONFIG_GROUP_IOSCHED
> > +	- Enables/Disables hierarchical queuing and associated cgroup bits.
> > +
> > +TODO
> > +====
> > +- Lots of cleanups, testing, bug fixing, optimizations, benchmarking etc...
> > +- Convert cgroup ioprio to notion of weight.
> > +- Anticipatory code will need more work. It is not working properly currently
> > +  and needs more thought.
> > +- Use of bio-cgroup patches.
> > +- Use of Nauman's per cgroup request descriptor patches.
> > +
> > +HOWTO
> > +=====
> > +So far I have done very simple testing of running two dd threads in two
> > +different cgroups. Here is what you can do.
> > +
> > +- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
> > +	CONFIG_IOSCHED_CFQ_HIER=y
> > +
> > +- Compile and boot into kernel and mount IO controller.
> > +
> > +	mount -t cgroup -o io none /cgroup
> > +
> > +- Create two cgroups
> > +	mkdir -p /cgroup/test1/ /cgroup/test2
> > +
> > +- Set io priority of group test1 and test2
> > +	echo 0 > /cgroup/test1/io.ioprio
> > +	echo 4 > /cgroup/test2/io.ioprio
> > +
> > +- Create two same size files (say 512MB each) on same disk (file1, file2) and
> > +  launch two dd threads in different cgroup to read those files. Make sure
> > +  right io scheduler is being used for the block device where files are
> > +  present (the one you compiled in hierarchical mode).
> > +
> > +	echo 1 > /proc/sys/vm/drop_caches
> > +
> > +	dd if=/mnt/lv0/zerofile1 of=/dev/null &
> > +	echo $! > /cgroup/test1/tasks
> > +	cat /cgroup/test1/tasks
> > +
> > +	dd if=/mnt/lv0/zerofile2 of=/dev/null &
> > +	echo $! > /cgroup/test2/tasks
> > +	cat /cgroup/test2/tasks
> > +
> > +- First dd should finish first.
> > +
> > +Some Test Results
> > +=================
> > +- Two dd in two cgroups with prio 0 and 4. Ran two "dd" in those cgroups.
> > +
> > +234179072 bytes (234 MB) copied, 10.1811 s, 23.0 MB/s
> > +234179072 bytes (234 MB) copied, 12.6187 s, 18.6 MB/s
> > +
> > +- Three dd in three cgroups with prio 0, 4, 4.
> > +
> > +234179072 bytes (234 MB) copied, 13.7654 s, 17.0 MB/s
> > +234179072 bytes (234 MB) copied, 19.476 s, 12.0 MB/s
> > +234179072 bytes (234 MB) copied, 20.1858 s, 11.6 MB/s
> 
> 
> -- 
> Regards
> Yang Hongyang

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-03-12  7:45         ` Yang Hongyang
  (?)
  (?)
@ 2009-03-12 13:51         ` Vivek Goyal
  -1 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12 13:51 UTC (permalink / raw)
  To: Yang Hongyang
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng,
	arozansk, jmoyer, oz-kernel, dhaval, balbir, linux-kernel,
	containers, akpm, menage, peterz

On Thu, Mar 12, 2009 at 03:45:55PM +0800, Yang Hongyang wrote:
> Don't forget to update the 00-INDEX file when you add a new doc.^!^
> 

Thanks. Will do it.

Vivek

> Vivek Goyal wrote:
> > o Documentation for io-controller.
> > 
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  Documentation/block/io-controller.txt |  221 +++++++++++++++++++++++++++++++++
> >  1 files changed, 221 insertions(+), 0 deletions(-)
> >  create mode 100644 Documentation/block/io-controller.txt
> > 
> > diff --git a/Documentation/block/io-controller.txt b/Documentation/block/io-controller.txt
> > new file mode 100644
> > index 0000000..8884c5a
> > --- /dev/null
> > +++ b/Documentation/block/io-controller.txt
> > @@ -0,0 +1,221 @@
> > +				IO Controller
> > +				=============
> > +
> > +Overview
> > +========
> > +
> > +This patchset implements a proportional weight IO controller. That is one
> > +can create cgroups and assign prio/weights to those cgroups and task group
> > +will get access to disk proportionate to the weight of the group.
> > +
> > +These patches modify elevator layer and individual IO schedulers to do
> > +IO control hence this io controller works only on block devices which use
> > +one of the standard io schedulers can not be used with any xyz logical block
> > +device.
> > +
> > +The assumption/thought behind modifying IO scheduler is that resource control
> > +is needed only on leaf nodes where the actual contention for resources is
> > +present and not on intertermediate logical block devices.
> > +
> > +Consider following hypothetical scenario. Lets say there are three physical
> > +disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
> > +created on top of these. Some part of sdb is in lv0 and some part is in lv1.
> > +
> > +			    lv0      lv1
> > +			  /	\  /     \
> > +			sda      sdb      sdc
> > +
> > +Also consider following cgroup hierarchy
> > +
> > +				root
> > +				/   \
> > +			       A     B
> > +			      / \    / \
> > +			     T1 T2  T3  T4
> > +
> > +A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
> > +Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
> > +get their fair share of bandwidth on disks sda, sdb and sdc. There is no
> > +IO control on intermediate logical block nodes (lv0, lv1).
> > +
> > +So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
> > +only, there will not be any contetion for resources between group A and B if
> > +IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
> > +IO scheduler associated with the sdb will distribute disk bandwidth to
> > +group A and B proportionate to their weight.
> > +
> > +CFQ already has the notion of fairness and it provides differential disk
> > +access based on priority and class of the task. Just that it is flat and
> > +with cgroup stuff, it needs to be made hierarchical.
> > +
> > +Rest of the IO schedulers (noop, deadline and AS) don't have any notion
> > +of fairness among various threads.
> > +
> > +One of the concerns raised with modifying IO schedulers was that we don't
> > +want to replicate the code in all the IO schedulers. These patches share
> > +the fair queuing code which has been moved to a common layer (elevator
> > +layer). Hence we don't end up replicating code across IO schedulers.
> > +
> > +Design
> > +======
> > +This patchset primarily uses BFQ (Budget Fair Queuing) code to provide
> > +fairness among different IO queues. Fabio and Paolo implemented BFQ which uses
> > +B-WF2Q+ algorithm for fair queuing.
> > +
> > +Why BFQ?
> > +
> > +- Not sure if weighted round robin logic of CFQ can be easily extended for
> > +  hierarchical mode. One of the things is that we can not keep dividing
> > +  the time slice of parent group among childrens. Deeper we go in hierarchy
> > +  time slice will get smaller.
> > +
> > +  One of the ways to implement hierarchical support could be to keep track
> > +  of virtual time and service provided to queue/group and select a queue/group
> > +  for service based on any of the various available algoriths.
> > +
> > +  BFQ already had support for hierarchical scheduling, taking those patches
> > +  was easier.
> > +
> > +- BFQ was designed to provide tighter bounds/delay w.r.t service provided
> > +  to a queue. Delay/Jitter with BFQ is supposed to be O(1).
> > +
> > +  Note: BFQ originally used amount of IO done (number of sectors) as notion
> > +        of service provided. IOW, it tried to provide fairness in terms of
> > +        actual IO done and not in terms of actual time disk access was
> > +	given to a queue.
> > +
> > +	This patcheset modified BFQ to provide fairness in time domain because
> > +	that's what CFQ does. So idea was try not to deviate too much from
> > +	the CFQ behavior initially.
> > +
> > +	Providing fairness in time domain makes accounting trciky because
> > +	due to command queueing, at one time there might be multiple requests
> > +	from different queues and there is no easy way to find out how much
> > +	disk time actually was consumed by the requests of a particular
> > +	queue. More about this in comments in source code.
> > +
> > +So it is yet to be seen if changing to time domain still retains BFQ gurantees
> > +or not.
> > +
> > +From data structure point of view, one can think of a tree per device, where
> > +io groups and io queues are hanging and are being scheduled using B-WF2Q+
> > +algorithm. io_queue, is end queue where requests are actually stored and
> > +dispatched from (like cfqq).
> > +
> > +These io queues are primarily created by and managed by end io schedulers
> > +depending on its semantics. For example, noop, deadline and AS ioschedulers
> > +keep one io queues per cgroup and cfqq keeps one io queue per io_context in
> > +a cgroup (apart from async queues).
> > +
> > +A request is mapped to an io group by elevator layer and which io queue it
> > +is mapped to with in group depends on ioscheduler. Currently "current" task
> > +is used to determine the cgroup (hence io group) of the request. Down the
> > +line we need to make use of bio-cgroup patches to map delayed writes to
> > +right group.
> > +
> > +Going back to old behavior
> > +==========================
> > +In new scheme of things essentially we are creating hierarchical fair
> > +queuing logic in elevator layer and chaning IO schedulers to make use of
> > +that logic so that end IO schedulers start supporting hierarchical scheduling.
> > +
> > +Elevator layer continues to support the old interfaces. So even if fair queuing
> > +is enabled at elevator layer, one can have both new hierchical scheduler as
> > +well as old non-hierarchical scheduler operating.
> > +
> > +Also noop, deadline and AS have option of enabling hierarchical scheduling.
> > +If it is selected, fair queuing is done in hierarchical manner. If hierarchical
> > +scheduling is disabled, noop, deadline and AS should retain their existing
> > +behavior.
> > +
> > +CFQ is the only exception where one can not disable fair queuing as it is
> > +needed for provding fairness among various threads even in non-hierarchical
> > +mode.
> > +
> > +Various user visible config options
> > +===================================
> > +CONFIG_IOSCHED_NOOP_HIER
> > +	- Enables hierchical fair queuing in noop. Not selecting this option
> > +	  leads to old behavior of noop.
> > +
> > +CONFIG_IOSCHED_DEADLINE_HIER
> > +	- Enables hierchical fair queuing in deadline. Not selecting this
> > +	  option leads to old behavior of deadline.
> > +
> > +CONFIG_IOSCHED_AS_HIER
> > +	- Enables hierchical fair queuing in AS. Not selecting this option
> > +	  leads to old behavior of AS.
> > +
> > +CONFIG_IOSCHED_CFQ_HIER
> > +	- Enables hierarchical fair queuing in CFQ. Not selecting this option
> > +	  still does fair queuing among various queus but it is flat and not
> > +	  hierarchical.
> > +
> > +Config options selected automatically
> > +=====================================
> > +These config options are not user visible and are selected/deselected
> > +automatically based on IO scheduler configurations.
> > +
> > +CONFIG_ELV_FAIR_QUEUING
> > +	- Enables/Disables the fair queuing logic at elevator layer.
> > +
> > +CONFIG_GROUP_IOSCHED
> > +	- Enables/Disables hierarchical queuing and associated cgroup bits.
> > +
> > +TODO
> > +====
> > +- Lots of cleanups, testing, bug fixing, optimizations, benchmarking etc...
> > +- Convert cgroup ioprio to notion of weight.
> > +- Anticipatory code will need more work. It is not working properly currently
> > +  and needs more thought.
> > +- Use of bio-cgroup patches.
> > +- Use of Nauman's per cgroup request descriptor patches.
> > +
> > +HOWTO
> > +=====
> > +So far I have done very simple testing of running two dd threads in two
> > +different cgroups. Here is what you can do.
> > +
> > +- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
> > +	CONFIG_IOSCHED_CFQ_HIER=y
> > +
> > +- Compile and boot into kernel and mount IO controller.
> > +
> > +	mount -t cgroup -o io none /cgroup
> > +
> > +- Create two cgroups
> > +	mkdir -p /cgroup/test1/ /cgroup/test2
> > +
> > +- Set io priority of group test1 and test2
> > +	echo 0 > /cgroup/test1/io.ioprio
> > +	echo 4 > /cgroup/test2/io.ioprio
> > +
> > +- Create two same size files (say 512MB each) on same disk (file1, file2) and
> > +  launch two dd threads in different cgroup to read those files. Make sure
> > +  right io scheduler is being used for the block device where files are
> > +  present (the one you compiled in hierarchical mode).
> > +
> > +	echo 1 > /proc/sys/vm/drop_caches
> > +
> > +	dd if=/mnt/lv0/zerofile1 of=/dev/null &
> > +	echo $! > /cgroup/test1/tasks
> > +	cat /cgroup/test1/tasks
> > +
> > +	dd if=/mnt/lv0/zerofile2 of=/dev/null &
> > +	echo $! > /cgroup/test2/tasks
> > +	cat /cgroup/test2/tasks
> > +
> > +- First dd should finish first.
> > +
> > +Some Test Results
> > +=================
> > +- Two dd in two cgroups with prio 0 and 4. Ran two "dd" in those cgroups.
> > +
> > +234179072 bytes (234 MB) copied, 10.1811 s, 23.0 MB/s
> > +234179072 bytes (234 MB) copied, 12.6187 s, 18.6 MB/s
> > +
> > +- Three dd in three cgroups with prio 0, 4, 4.
> > +
> > +234179072 bytes (234 MB) copied, 13.7654 s, 17.0 MB/s
> > +234179072 bytes (234 MB) copied, 19.476 s, 12.0 MB/s
> > +234179072 bytes (234 MB) copied, 20.1858 s, 11.6 MB/s
> 
> 
> -- 
> Regards
> Yang Hongyang

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
       [not found]       ` <20090312100054.GA8024-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2009-03-12 14:04         ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12 14:04 UTC (permalink / raw)
  To: Dhaval Giani
  Cc: oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Thu, Mar 12, 2009 at 03:30:54PM +0530, Dhaval Giani wrote:
> On Wed, Mar 11, 2009 at 09:56:46PM -0400, Vivek Goyal wrote:
> > o Documentation for io-controller.
> > 
> > Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > ---
> >  Documentation/block/io-controller.txt |  221 +++++++++++++++++++++++++++++++++
> >  1 files changed, 221 insertions(+), 0 deletions(-)
> >  create mode 100644 Documentation/block/io-controller.txt
> > 
> > diff --git a/Documentation/block/io-controller.txt b/Documentation/block/io-controller.txt
> > new file mode 100644
> > index 0000000..8884c5a
> > --- /dev/null
> > +++ b/Documentation/block/io-controller.txt
> > @@ -0,0 +1,221 @@
> > +				IO Controller
> > +				=============
> > +
> > +Overview
> > +========
> > +
> > +This patchset implements a proportional weight IO controller. That is one
> > +can create cgroups and assign prio/weights to those cgroups and task group
> > +will get access to disk proportionate to the weight of the group.
> > +
> > +These patches modify elevator layer and individual IO schedulers to do
> > +IO control hence this io controller works only on block devices which use
> > +one of the standard io schedulers can not be used with any xyz logical block
> > +device.
> > +
> > +The assumption/thought behind modifying IO scheduler is that resource control
> > +is needed only on leaf nodes where the actual contention for resources is
> > +present and not on intertermediate logical block devices.
> > +
> > +Consider following hypothetical scenario. Lets say there are three physical
> > +disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
> > +created on top of these. Some part of sdb is in lv0 and some part is in lv1.
> > +
> > +			    lv0      lv1
> > +			  /	\  /     \
> > +			sda      sdb      sdc
> > +
> > +Also consider following cgroup hierarchy
> > +
> > +				root
> > +				/   \
> > +			       A     B
> > +			      / \    / \
> > +			     T1 T2  T3  T4
> > +
> > +A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
> > +Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
> > +get their fair share of bandwidth on disks sda, sdb and sdc. There is no
> > +IO control on intermediate logical block nodes (lv0, lv1).
> > +
> > +So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
> > +only, there will not be any contetion for resources between group A and B if
> > +IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
> > +IO scheduler associated with the sdb will distribute disk bandwidth to
> > +group A and B proportionate to their weight.
> > +
> > +CFQ already has the notion of fairness and it provides differential disk
> > +access based on priority and class of the task. Just that it is flat and
> > +with cgroup stuff, it needs to be made hierarchical.
> > +
> > +Rest of the IO schedulers (noop, deadline and AS) don't have any notion
> > +of fairness among various threads.
> > +
> > +One of the concerns raised with modifying IO schedulers was that we don't
> > +want to replicate the code in all the IO schedulers. These patches share
> > +the fair queuing code which has been moved to a common layer (elevator
> > +layer). Hence we don't end up replicating code across IO schedulers.
> > +
> > +Design
> > +======
> > +This patchset primarily uses BFQ (Budget Fair Queuing) code to provide
> > +fairness among different IO queues. Fabio and Paolo implemented BFQ which uses
> > +B-WF2Q+ algorithm for fair queuing.
> > +
> > +Why BFQ?
> > +
> > +- Not sure if weighted round robin logic of CFQ can be easily extended for
> > +  hierarchical mode. One of the things is that we can not keep dividing
> > +  the time slice of parent group among childrens. Deeper we go in hierarchy
> > +  time slice will get smaller.
> > +
> > +  One of the ways to implement hierarchical support could be to keep track
> > +  of virtual time and service provided to queue/group and select a queue/group
> > +  for service based on any of the various available algoriths.
> > +
> > +  BFQ already had support for hierarchical scheduling, taking those patches
> > +  was easier.
> > +
> > +- BFQ was designed to provide tighter bounds/delay w.r.t service provided
> > +  to a queue. Delay/Jitter with BFQ is supposed to be O(1).
> > +
> > +  Note: BFQ originally used amount of IO done (number of sectors) as notion
> > +        of service provided. IOW, it tried to provide fairness in terms of
> > +        actual IO done and not in terms of actual time disk access was
> > +	given to a queue.
> > +
> > +	This patcheset modified BFQ to provide fairness in time domain because
> > +	that's what CFQ does. So idea was try not to deviate too much from
> > +	the CFQ behavior initially.
> > +
> > +	Providing fairness in time domain makes accounting trciky because
> > +	due to command queueing, at one time there might be multiple requests
> > +	from different queues and there is no easy way to find out how much
> > +	disk time actually was consumed by the requests of a particular
> > +	queue. More about this in comments in source code.
> > +
> > +So it is yet to be seen if changing to time domain still retains BFQ gurantees
> > +or not.
> > +
> > +From data structure point of view, one can think of a tree per device, where
> > +io groups and io queues are hanging and are being scheduled using B-WF2Q+
> > +algorithm. io_queue, is end queue where requests are actually stored and
> > +dispatched from (like cfqq).
> > +
> > +These io queues are primarily created by and managed by end io schedulers
> > +depending on its semantics. For example, noop, deadline and AS ioschedulers
> > +keep one io queues per cgroup and cfqq keeps one io queue per io_context in
> > +a cgroup (apart from async queues).
> > +
> > +A request is mapped to an io group by elevator layer and which io queue it
> > +is mapped to with in group depends on ioscheduler. Currently "current" task
> > +is used to determine the cgroup (hence io group) of the request. Down the
> > +line we need to make use of bio-cgroup patches to map delayed writes to
> > +right group.
> > +
> > +Going back to old behavior
> > +==========================
> > +In new scheme of things essentially we are creating hierarchical fair
> > +queuing logic in elevator layer and chaning IO schedulers to make use of
> > +that logic so that end IO schedulers start supporting hierarchical scheduling.
> > +
> > +Elevator layer continues to support the old interfaces. So even if fair queuing
> > +is enabled at elevator layer, one can have both new hierchical scheduler as
> > +well as old non-hierarchical scheduler operating.
> > +
> > +Also noop, deadline and AS have option of enabling hierarchical scheduling.
> > +If it is selected, fair queuing is done in hierarchical manner. If hierarchical
> > +scheduling is disabled, noop, deadline and AS should retain their existing
> > +behavior.
> > +
> > +CFQ is the only exception where one can not disable fair queuing as it is
> > +needed for provding fairness among various threads even in non-hierarchical
> > +mode.
> > +
> > +Various user visible config options
> > +===================================
> > +CONFIG_IOSCHED_NOOP_HIER
> > +	- Enables hierchical fair queuing in noop. Not selecting this option
> > +	  leads to old behavior of noop.
> > +
> > +CONFIG_IOSCHED_DEADLINE_HIER
> > +	- Enables hierchical fair queuing in deadline. Not selecting this
> > +	  option leads to old behavior of deadline.
> > +
> > +CONFIG_IOSCHED_AS_HIER
> > +	- Enables hierchical fair queuing in AS. Not selecting this option
> > +	  leads to old behavior of AS.
> > +
> > +CONFIG_IOSCHED_CFQ_HIER
> > +	- Enables hierarchical fair queuing in CFQ. Not selecting this option
> > +	  still does fair queuing among various queus but it is flat and not
> > +	  hierarchical.
> > +
> > +Config options selected automatically
> > +=====================================
> > +These config options are not user visible and are selected/deselected
> > +automatically based on IO scheduler configurations.
> > +
> > +CONFIG_ELV_FAIR_QUEUING
> > +	- Enables/Disables the fair queuing logic at elevator layer.
> > +
> > +CONFIG_GROUP_IOSCHED
> > +	- Enables/Disables hierarchical queuing and associated cgroup bits.
> > +
> > +TODO
> > +====
> > +- Lots of cleanups, testing, bug fixing, optimizations, benchmarking etc...
> > +- Convert cgroup ioprio to notion of weight.
> > +- Anticipatory code will need more work. It is not working properly currently
> > +  and needs more thought.
> > +- Use of bio-cgroup patches.
> > +- Use of Nauman's per cgroup request descriptor patches.
> > +
> > +HOWTO
> > +=====
> > +So far I have done very simple testing of running two dd threads in two
> > +different cgroups. Here is what you can do.
> > +
> > +- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
> > +	CONFIG_IOSCHED_CFQ_HIER=y
> > +
> > +- Compile and boot into kernel and mount IO controller.
> > +
> > +	mount -t cgroup -o io none /cgroup
> > +
> > +- Create two cgroups
> > +	mkdir -p /cgroup/test1/ /cgroup/test2
> > +
> > +- Set io priority of group test1 and test2
> > +	echo 0 > /cgroup/test1/io.ioprio
> > +	echo 4 > /cgroup/test2/io.ioprio
> > +
> > +- Create two same size files (say 512MB each) on same disk (file1, file2) and
> > +  launch two dd threads in different cgroup to read those files. Make sure
> > +  right io scheduler is being used for the block device where files are
> > +  present (the one you compiled in hierarchical mode).
> > +
> > +	echo 1 > /proc/sys/vm/drop_caches
> > +
> > +	dd if=/mnt/lv0/zerofile1 of=/dev/null &
> > +	echo $! > /cgroup/test1/tasks
> > +	cat /cgroup/test1/tasks
> > +
> > +	dd if=/mnt/lv0/zerofile2 of=/dev/null &
> > +	echo $! > /cgroup/test2/tasks
> > +	cat /cgroup/test2/tasks
> > +
> > +- First dd should finish first.
> > +
> > +Some Test Results
> > +=================
> > +- Two dd in two cgroups with prio 0 and 4. Ran two "dd" in those cgroups.
> > +
> > +234179072 bytes (234 MB) copied, 10.1811 s, 23.0 MB/s
> > +234179072 bytes (234 MB) copied, 12.6187 s, 18.6 MB/s
> > +
> > +- Three dd in three cgroups with prio 0, 4, 4.
> > +
> > +234179072 bytes (234 MB) copied, 13.7654 s, 17.0 MB/s
> > +234179072 bytes (234 MB) copied, 19.476 s, 12.0 MB/s
> > +234179072 bytes (234 MB) copied, 20.1858 s, 11.6 MB/s
> 
> Hi Vivek,
> 
> I would be interested in knowing if these are the results expected?
> 

Hi Dhaval, 

Good question. Keeping current expectation in mind, yes these are expected
results. To begin with, current expectations are that try to emulate
cfq behavior and the kind of service differentiation we get between
threads of different priority, same kind of service differentiation we
should get from different cgroups.
 
Having said that, in theory a more accurate estimate should be amount 
of actual disk time a queue/cgroup got. I have put a tracing message
to keep track of total service received by a queue. If you run "blktrace"
then you can see that. Ideally, total service received by two threads
over a period of time should be in same proportion as their cgroup
weights.

It will not be easy to achive it given the constraints we have got in
terms of how to accurately we can account for disk time actually used by a
queue in certain situations. So to begin with I am targetting that
try to meet same kind of service differentation between cgroups as
cfq provides between threads and then slowly refine it to see how
close one can come to get accurate numbers in terms of "total_serivce"
received by each queue.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-03-12 10:00     ` Dhaval Giani
       [not found]       ` <20090312100054.GA8024-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2009-03-12 14:04       ` Vivek Goyal
       [not found]         ` <20090312140450.GE10919-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-03-18  7:23         ` Gui Jianfeng
  1 sibling, 2 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12 14:04 UTC (permalink / raw)
  To: Dhaval Giani
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng,
	arozansk, jmoyer, oz-kernel, balbir, linux-kernel, containers,
	akpm, menage, peterz

On Thu, Mar 12, 2009 at 03:30:54PM +0530, Dhaval Giani wrote:
> On Wed, Mar 11, 2009 at 09:56:46PM -0400, Vivek Goyal wrote:
> > o Documentation for io-controller.
> > 
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  Documentation/block/io-controller.txt |  221 +++++++++++++++++++++++++++++++++
> >  1 files changed, 221 insertions(+), 0 deletions(-)
> >  create mode 100644 Documentation/block/io-controller.txt
> > 
> > diff --git a/Documentation/block/io-controller.txt b/Documentation/block/io-controller.txt
> > new file mode 100644
> > index 0000000..8884c5a
> > --- /dev/null
> > +++ b/Documentation/block/io-controller.txt
> > @@ -0,0 +1,221 @@
> > +				IO Controller
> > +				=============
> > +
> > +Overview
> > +========
> > +
> > +This patchset implements a proportional weight IO controller. That is one
> > +can create cgroups and assign prio/weights to those cgroups and task group
> > +will get access to disk proportionate to the weight of the group.
> > +
> > +These patches modify elevator layer and individual IO schedulers to do
> > +IO control hence this io controller works only on block devices which use
> > +one of the standard io schedulers can not be used with any xyz logical block
> > +device.
> > +
> > +The assumption/thought behind modifying IO scheduler is that resource control
> > +is needed only on leaf nodes where the actual contention for resources is
> > +present and not on intertermediate logical block devices.
> > +
> > +Consider following hypothetical scenario. Lets say there are three physical
> > +disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
> > +created on top of these. Some part of sdb is in lv0 and some part is in lv1.
> > +
> > +			    lv0      lv1
> > +			  /	\  /     \
> > +			sda      sdb      sdc
> > +
> > +Also consider following cgroup hierarchy
> > +
> > +				root
> > +				/   \
> > +			       A     B
> > +			      / \    / \
> > +			     T1 T2  T3  T4
> > +
> > +A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
> > +Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
> > +get their fair share of bandwidth on disks sda, sdb and sdc. There is no
> > +IO control on intermediate logical block nodes (lv0, lv1).
> > +
> > +So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
> > +only, there will not be any contetion for resources between group A and B if
> > +IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
> > +IO scheduler associated with the sdb will distribute disk bandwidth to
> > +group A and B proportionate to their weight.
> > +
> > +CFQ already has the notion of fairness and it provides differential disk
> > +access based on priority and class of the task. Just that it is flat and
> > +with cgroup stuff, it needs to be made hierarchical.
> > +
> > +Rest of the IO schedulers (noop, deadline and AS) don't have any notion
> > +of fairness among various threads.
> > +
> > +One of the concerns raised with modifying IO schedulers was that we don't
> > +want to replicate the code in all the IO schedulers. These patches share
> > +the fair queuing code which has been moved to a common layer (elevator
> > +layer). Hence we don't end up replicating code across IO schedulers.
> > +
> > +Design
> > +======
> > +This patchset primarily uses BFQ (Budget Fair Queuing) code to provide
> > +fairness among different IO queues. Fabio and Paolo implemented BFQ which uses
> > +B-WF2Q+ algorithm for fair queuing.
> > +
> > +Why BFQ?
> > +
> > +- Not sure if weighted round robin logic of CFQ can be easily extended for
> > +  hierarchical mode. One of the things is that we can not keep dividing
> > +  the time slice of parent group among childrens. Deeper we go in hierarchy
> > +  time slice will get smaller.
> > +
> > +  One of the ways to implement hierarchical support could be to keep track
> > +  of virtual time and service provided to queue/group and select a queue/group
> > +  for service based on any of the various available algoriths.
> > +
> > +  BFQ already had support for hierarchical scheduling, taking those patches
> > +  was easier.
> > +
> > +- BFQ was designed to provide tighter bounds/delay w.r.t service provided
> > +  to a queue. Delay/Jitter with BFQ is supposed to be O(1).
> > +
> > +  Note: BFQ originally used amount of IO done (number of sectors) as notion
> > +        of service provided. IOW, it tried to provide fairness in terms of
> > +        actual IO done and not in terms of actual time disk access was
> > +	given to a queue.
> > +
> > +	This patcheset modified BFQ to provide fairness in time domain because
> > +	that's what CFQ does. So idea was try not to deviate too much from
> > +	the CFQ behavior initially.
> > +
> > +	Providing fairness in time domain makes accounting trciky because
> > +	due to command queueing, at one time there might be multiple requests
> > +	from different queues and there is no easy way to find out how much
> > +	disk time actually was consumed by the requests of a particular
> > +	queue. More about this in comments in source code.
> > +
> > +So it is yet to be seen if changing to time domain still retains BFQ gurantees
> > +or not.
> > +
> > +From data structure point of view, one can think of a tree per device, where
> > +io groups and io queues are hanging and are being scheduled using B-WF2Q+
> > +algorithm. io_queue, is end queue where requests are actually stored and
> > +dispatched from (like cfqq).
> > +
> > +These io queues are primarily created by and managed by end io schedulers
> > +depending on its semantics. For example, noop, deadline and AS ioschedulers
> > +keep one io queues per cgroup and cfqq keeps one io queue per io_context in
> > +a cgroup (apart from async queues).
> > +
> > +A request is mapped to an io group by elevator layer and which io queue it
> > +is mapped to with in group depends on ioscheduler. Currently "current" task
> > +is used to determine the cgroup (hence io group) of the request. Down the
> > +line we need to make use of bio-cgroup patches to map delayed writes to
> > +right group.
> > +
> > +Going back to old behavior
> > +==========================
> > +In new scheme of things essentially we are creating hierarchical fair
> > +queuing logic in elevator layer and chaning IO schedulers to make use of
> > +that logic so that end IO schedulers start supporting hierarchical scheduling.
> > +
> > +Elevator layer continues to support the old interfaces. So even if fair queuing
> > +is enabled at elevator layer, one can have both new hierchical scheduler as
> > +well as old non-hierarchical scheduler operating.
> > +
> > +Also noop, deadline and AS have option of enabling hierarchical scheduling.
> > +If it is selected, fair queuing is done in hierarchical manner. If hierarchical
> > +scheduling is disabled, noop, deadline and AS should retain their existing
> > +behavior.
> > +
> > +CFQ is the only exception where one can not disable fair queuing as it is
> > +needed for provding fairness among various threads even in non-hierarchical
> > +mode.
> > +
> > +Various user visible config options
> > +===================================
> > +CONFIG_IOSCHED_NOOP_HIER
> > +	- Enables hierchical fair queuing in noop. Not selecting this option
> > +	  leads to old behavior of noop.
> > +
> > +CONFIG_IOSCHED_DEADLINE_HIER
> > +	- Enables hierchical fair queuing in deadline. Not selecting this
> > +	  option leads to old behavior of deadline.
> > +
> > +CONFIG_IOSCHED_AS_HIER
> > +	- Enables hierchical fair queuing in AS. Not selecting this option
> > +	  leads to old behavior of AS.
> > +
> > +CONFIG_IOSCHED_CFQ_HIER
> > +	- Enables hierarchical fair queuing in CFQ. Not selecting this option
> > +	  still does fair queuing among various queus but it is flat and not
> > +	  hierarchical.
> > +
> > +Config options selected automatically
> > +=====================================
> > +These config options are not user visible and are selected/deselected
> > +automatically based on IO scheduler configurations.
> > +
> > +CONFIG_ELV_FAIR_QUEUING
> > +	- Enables/Disables the fair queuing logic at elevator layer.
> > +
> > +CONFIG_GROUP_IOSCHED
> > +	- Enables/Disables hierarchical queuing and associated cgroup bits.
> > +
> > +TODO
> > +====
> > +- Lots of cleanups, testing, bug fixing, optimizations, benchmarking etc...
> > +- Convert cgroup ioprio to notion of weight.
> > +- Anticipatory code will need more work. It is not working properly currently
> > +  and needs more thought.
> > +- Use of bio-cgroup patches.
> > +- Use of Nauman's per cgroup request descriptor patches.
> > +
> > +HOWTO
> > +=====
> > +So far I have done very simple testing of running two dd threads in two
> > +different cgroups. Here is what you can do.
> > +
> > +- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
> > +	CONFIG_IOSCHED_CFQ_HIER=y
> > +
> > +- Compile and boot into kernel and mount IO controller.
> > +
> > +	mount -t cgroup -o io none /cgroup
> > +
> > +- Create two cgroups
> > +	mkdir -p /cgroup/test1/ /cgroup/test2
> > +
> > +- Set io priority of group test1 and test2
> > +	echo 0 > /cgroup/test1/io.ioprio
> > +	echo 4 > /cgroup/test2/io.ioprio
> > +
> > +- Create two same size files (say 512MB each) on same disk (file1, file2) and
> > +  launch two dd threads in different cgroup to read those files. Make sure
> > +  right io scheduler is being used for the block device where files are
> > +  present (the one you compiled in hierarchical mode).
> > +
> > +	echo 1 > /proc/sys/vm/drop_caches
> > +
> > +	dd if=/mnt/lv0/zerofile1 of=/dev/null &
> > +	echo $! > /cgroup/test1/tasks
> > +	cat /cgroup/test1/tasks
> > +
> > +	dd if=/mnt/lv0/zerofile2 of=/dev/null &
> > +	echo $! > /cgroup/test2/tasks
> > +	cat /cgroup/test2/tasks
> > +
> > +- First dd should finish first.
> > +
> > +Some Test Results
> > +=================
> > +- Two dd in two cgroups with prio 0 and 4. Ran two "dd" in those cgroups.
> > +
> > +234179072 bytes (234 MB) copied, 10.1811 s, 23.0 MB/s
> > +234179072 bytes (234 MB) copied, 12.6187 s, 18.6 MB/s
> > +
> > +- Three dd in three cgroups with prio 0, 4, 4.
> > +
> > +234179072 bytes (234 MB) copied, 13.7654 s, 17.0 MB/s
> > +234179072 bytes (234 MB) copied, 19.476 s, 12.0 MB/s
> > +234179072 bytes (234 MB) copied, 20.1858 s, 11.6 MB/s
> 
> Hi Vivek,
> 
> I would be interested in knowing if these are the results expected?
> 

Hi Dhaval, 

Good question. Keeping current expectation in mind, yes these are expected
results. To begin with, current expectations are that try to emulate
cfq behavior and the kind of service differentiation we get between
threads of different priority, same kind of service differentiation we
should get from different cgroups.
 
Having said that, in theory a more accurate estimate should be amount 
of actual disk time a queue/cgroup got. I have put a tracing message
to keep track of total service received by a queue. If you run "blktrace"
then you can see that. Ideally, total service received by two threads
over a period of time should be in same proportion as their cgroup
weights.

It will not be easy to achive it given the constraints we have got in
terms of how to accurately we can account for disk time actually used by a
queue in certain situations. So to begin with I am targetting that
try to meet same kind of service differentation between cgroups as
cfq provides between threads and then slowly refine it to see how
close one can come to get accurate numbers in terms of "total_serivce"
received by each queue.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-03-12 10:24         ` Peter Zijlstra
  (?)
  (?)
@ 2009-03-12 14:09         ` Vivek Goyal
  -1 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12 14:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Thu, Mar 12, 2009 at 11:24:50AM +0100, Peter Zijlstra wrote:
> On Wed, 2009-03-11 at 21:56 -0400, Vivek Goyal wrote:
> > +Going back to old behavior
> > +==========================
> > +In new scheme of things essentially we are creating hierarchical fair
> > +queuing logic in elevator layer and changing IO schedulers to make use of
> > +that logic so that end IO schedulers start supporting hierarchical scheduling.
> > +
> > +Elevator layer continues to support the old interfaces. So even if fair queuing
> > +is enabled at elevator layer, one can have both new hierarchical scheduler as
> > +well as old non-hierarchical scheduler operating.
> > +
> > +Also noop, deadline and AS have option of enabling hierarchical scheduling.
> > +If it is selected, fair queuing is done in hierarchical manner. If hierarchical
> > +scheduling is disabled, noop, deadline and AS should retain their existing
> > +behavior.
> > +
> > +CFQ is the only exception where one can not disable fair queuing as it is
> > +needed for providing fairness among various threads even in non-hierarchical
> > +mode.
> > +
> > +Various user visible config options
> > +===================================
> > +CONFIG_IOSCHED_NOOP_HIER
> > +       - Enables hierchical fair queuing in noop. Not selecting this option
> > +         leads to old behavior of noop.
> > +
> > +CONFIG_IOSCHED_DEADLINE_HIER
> > +       - Enables hierchical fair queuing in deadline. Not selecting this
> > +         option leads to old behavior of deadline.
> > +
> > +CONFIG_IOSCHED_AS_HIER
> > +       - Enables hierchical fair queuing in AS. Not selecting this option
> > +         leads to old behavior of AS.
> > +
> > +CONFIG_IOSCHED_CFQ_HIER
> > +       - Enables hierarchical fair queuing in CFQ. Not selecting this option
> > +         still does fair queuing among various queus but it is flat and not
> > +         hierarchical.
> 
> One worry I have is that these are compile time switches. Is there any
> way you can get the old AS/DEADLINE back when these are enabled but
> you're not actively using cgroups?

Hi Peter,

In principle, if one is not using cgroups, there is only one io queue
in the root group and most likely we should achieve the same behavior
as old schedulers. Just that some extra code gets into execution at 
runtime.

I have not got a chance yet to do some numbers but I think this path
can be optimized enough that at run time effectively we don't see any 
significant performance penalty and behavior of schedulers is almost
same as old ones.

Thanks
Vivek 

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-03-12 10:24         ` Peter Zijlstra
  (?)
@ 2009-03-12 14:09         ` Vivek Goyal
  -1 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12 14:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng,
	arozansk, jmoyer, oz-kernel, dhaval, balbir, linux-kernel,
	containers, akpm, menage

On Thu, Mar 12, 2009 at 11:24:50AM +0100, Peter Zijlstra wrote:
> On Wed, 2009-03-11 at 21:56 -0400, Vivek Goyal wrote:
> > +Going back to old behavior
> > +==========================
> > +In new scheme of things essentially we are creating hierarchical fair
> > +queuing logic in elevator layer and changing IO schedulers to make use of
> > +that logic so that end IO schedulers start supporting hierarchical scheduling.
> > +
> > +Elevator layer continues to support the old interfaces. So even if fair queuing
> > +is enabled at elevator layer, one can have both new hierarchical scheduler as
> > +well as old non-hierarchical scheduler operating.
> > +
> > +Also noop, deadline and AS have option of enabling hierarchical scheduling.
> > +If it is selected, fair queuing is done in hierarchical manner. If hierarchical
> > +scheduling is disabled, noop, deadline and AS should retain their existing
> > +behavior.
> > +
> > +CFQ is the only exception where one can not disable fair queuing as it is
> > +needed for providing fairness among various threads even in non-hierarchical
> > +mode.
> > +
> > +Various user visible config options
> > +===================================
> > +CONFIG_IOSCHED_NOOP_HIER
> > +       - Enables hierchical fair queuing in noop. Not selecting this option
> > +         leads to old behavior of noop.
> > +
> > +CONFIG_IOSCHED_DEADLINE_HIER
> > +       - Enables hierchical fair queuing in deadline. Not selecting this
> > +         option leads to old behavior of deadline.
> > +
> > +CONFIG_IOSCHED_AS_HIER
> > +       - Enables hierchical fair queuing in AS. Not selecting this option
> > +         leads to old behavior of AS.
> > +
> > +CONFIG_IOSCHED_CFQ_HIER
> > +       - Enables hierarchical fair queuing in CFQ. Not selecting this option
> > +         still does fair queuing among various queus but it is flat and not
> > +         hierarchical.
> 
> One worry I have is that these are compile time switches. Is there any
> way you can get the old AS/DEADLINE back when these are enabled but
> you're not actively using cgroups?

Hi Peter,

In principle, if one is not using cgroups, there is only one io queue
in the root group and most likely we should achieve the same behavior
as old schedulers. Just that some extra code gets into execution at 
runtime.

I have not got a chance yet to do some numbers but I think this path
can be optimized enough that at run time effectively we don't see any 
significant performance penalty and behavior of schedulers is almost
same as old ones.

Thanks
Vivek 

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-03-12 14:04       ` Vivek Goyal
@ 2009-03-12 14:48             ` Fabio Checconi
  2009-03-18  7:23         ` Gui Jianfeng
  1 sibling, 0 replies; 190+ messages in thread
From: Fabio Checconi @ 2009-03-12 14:48 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA, Dhaval Giani,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

> From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Date: Thu, Mar 12, 2009 10:04:50AM -0400
>
> On Thu, Mar 12, 2009 at 03:30:54PM +0530, Dhaval Giani wrote:
...
> > > +Some Test Results
> > > +=================
> > > +- Two dd in two cgroups with prio 0 and 4. Ran two "dd" in those cgroups.
> > > +
> > > +234179072 bytes (234 MB) copied, 10.1811 s, 23.0 MB/s
> > > +234179072 bytes (234 MB) copied, 12.6187 s, 18.6 MB/s
> > > +
> > > +- Three dd in three cgroups with prio 0, 4, 4.
> > > +
> > > +234179072 bytes (234 MB) copied, 13.7654 s, 17.0 MB/s
> > > +234179072 bytes (234 MB) copied, 19.476 s, 12.0 MB/s
> > > +234179072 bytes (234 MB) copied, 20.1858 s, 11.6 MB/s
> > 
> > Hi Vivek,
> > 
> > I would be interested in knowing if these are the results expected?
> > 
> 
> Hi Dhaval, 
> 
> Good question. Keeping current expectation in mind, yes these are expected
> results. To begin with, current expectations are that try to emulate
> cfq behavior and the kind of service differentiation we get between
> threads of different priority, same kind of service differentiation we
> should get from different cgroups.
>  
> Having said that, in theory a more accurate estimate should be amount 
> of actual disk time a queue/cgroup got. I have put a tracing message
> to keep track of total service received by a queue. If you run "blktrace"
> then you can see that. Ideally, total service received by two threads
> over a period of time should be in same proportion as their cgroup
> weights.
> 
> It will not be easy to achive it given the constraints we have got in
> terms of how to accurately we can account for disk time actually used by a
> queue in certain situations. So to begin with I am targetting that
> try to meet same kind of service differentation between cgroups as
> cfq provides between threads and then slowly refine it to see how
> close one can come to get accurate numbers in terms of "total_serivce"
> received by each queue.
> 

There is also another issue to consider; to achieve a proper weighted
distribution of ``service time'' (assuming that service time can be
attributed accurately) over any time window, we need also that the tasks
actually compete for disk service during this window.

For example, in the case above with three tasks, the highest weight task
terminates earlier than the other ones, so we have two time frames:
during the first one disk time is divided among all the three tasks
according to their weights, then the highest weight one terminates,
and disk time is divided (equally) among the remaining ones.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
@ 2009-03-12 14:48             ` Fabio Checconi
  0 siblings, 0 replies; 190+ messages in thread
From: Fabio Checconi @ 2009-03-12 14:48 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Dhaval Giani, nauman, dpshah, lizf, mikew, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng,
	arozansk, jmoyer, oz-kernel, balbir, linux-kernel, containers,
	akpm, menage, peterz

> From: Vivek Goyal <vgoyal@redhat.com>
> Date: Thu, Mar 12, 2009 10:04:50AM -0400
>
> On Thu, Mar 12, 2009 at 03:30:54PM +0530, Dhaval Giani wrote:
...
> > > +Some Test Results
> > > +=================
> > > +- Two dd in two cgroups with prio 0 and 4. Ran two "dd" in those cgroups.
> > > +
> > > +234179072 bytes (234 MB) copied, 10.1811 s, 23.0 MB/s
> > > +234179072 bytes (234 MB) copied, 12.6187 s, 18.6 MB/s
> > > +
> > > +- Three dd in three cgroups with prio 0, 4, 4.
> > > +
> > > +234179072 bytes (234 MB) copied, 13.7654 s, 17.0 MB/s
> > > +234179072 bytes (234 MB) copied, 19.476 s, 12.0 MB/s
> > > +234179072 bytes (234 MB) copied, 20.1858 s, 11.6 MB/s
> > 
> > Hi Vivek,
> > 
> > I would be interested in knowing if these are the results expected?
> > 
> 
> Hi Dhaval, 
> 
> Good question. Keeping current expectation in mind, yes these are expected
> results. To begin with, current expectations are that try to emulate
> cfq behavior and the kind of service differentiation we get between
> threads of different priority, same kind of service differentiation we
> should get from different cgroups.
>  
> Having said that, in theory a more accurate estimate should be amount 
> of actual disk time a queue/cgroup got. I have put a tracing message
> to keep track of total service received by a queue. If you run "blktrace"
> then you can see that. Ideally, total service received by two threads
> over a period of time should be in same proportion as their cgroup
> weights.
> 
> It will not be easy to achive it given the constraints we have got in
> terms of how to accurately we can account for disk time actually used by a
> queue in certain situations. So to begin with I am targetting that
> try to meet same kind of service differentation between cgroups as
> cfq provides between threads and then slowly refine it to see how
> close one can come to get accurate numbers in terms of "total_serivce"
> received by each queue.
> 

There is also another issue to consider; to achieve a proper weighted
distribution of ``service time'' (assuming that service time can be
attributed accurately) over any time window, we need also that the tasks
actually compete for disk service during this window.

For example, in the case above with three tasks, the highest weight task
terminates earlier than the other ones, so we have two time frames:
during the first one disk time is divided among all the three tasks
according to their weights, then the highest weight one terminates,
and disk time is divided (equally) among the remaining ones.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
  2009-03-12  7:11       ` anqin
@ 2009-03-12 14:57             ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12 14:57 UTC (permalink / raw)
  To: anqin
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, menage-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Thu, Mar 12, 2009 at 03:11:07PM +0800, anqin wrote:
> > Why? When this is ready to be merged, then it should be based on Jens' block-tree,
> > or akpm's mm tree. And this version currently is based on 2.6.29-rc4, so if you
> > want to try it out, just prepare a 2.6.29-rc4 kernel tree.
> >
> 
> I have checked the LKML and see these patches (on web pages) are based on
> 2.6.27. It seemed too old.
> 
> You mean that the codes have new patch files in 2.6.29-rc4?

I guess my mistake that I did not specify the kernel version in my mail.

These patches should apply on top of 2.6.29-rc7. 

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
@ 2009-03-12 14:57             ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12 14:57 UTC (permalink / raw)
  To: anqin
  Cc: Li Zefan, Takuya Yoshikawa, oz-kernel, paolo.valente,
	linux-kernel, dhaval, containers, menage, jmoyer, fchecconi,
	arozansk, jens.axboe, akpm, fernando, balbir

On Thu, Mar 12, 2009 at 03:11:07PM +0800, anqin wrote:
> > Why? When this is ready to be merged, then it should be based on Jens' block-tree,
> > or akpm's mm tree. And this version currently is based on 2.6.29-rc4, so if you
> > want to try it out, just prepare a 2.6.29-rc4 kernel tree.
> >
> 
> I have checked the LKML and see these patches (on web pages) are based on
> 2.6.27. It seemed too old.
> 
> You mean that the codes have new patch files in 2.6.29-rc4?

I guess my mistake that I did not specify the kernel version in my mail.

These patches should apply on top of 2.6.29-rc7. 

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
       [not found]             ` <20090312144842.GS12361-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
@ 2009-03-12 15:03               ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12 15:03 UTC (permalink / raw)
  To: Fabio Checconi
  Cc: oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA, Dhaval Giani,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Thu, Mar 12, 2009 at 03:48:42PM +0100, Fabio Checconi wrote:
> > From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > Date: Thu, Mar 12, 2009 10:04:50AM -0400
> >
> > On Thu, Mar 12, 2009 at 03:30:54PM +0530, Dhaval Giani wrote:
> ...
> > > > +Some Test Results
> > > > +=================
> > > > +- Two dd in two cgroups with prio 0 and 4. Ran two "dd" in those cgroups.
> > > > +
> > > > +234179072 bytes (234 MB) copied, 10.1811 s, 23.0 MB/s
> > > > +234179072 bytes (234 MB) copied, 12.6187 s, 18.6 MB/s
> > > > +
> > > > +- Three dd in three cgroups with prio 0, 4, 4.
> > > > +
> > > > +234179072 bytes (234 MB) copied, 13.7654 s, 17.0 MB/s
> > > > +234179072 bytes (234 MB) copied, 19.476 s, 12.0 MB/s
> > > > +234179072 bytes (234 MB) copied, 20.1858 s, 11.6 MB/s
> > > 
> > > Hi Vivek,
> > > 
> > > I would be interested in knowing if these are the results expected?
> > > 
> > 
> > Hi Dhaval, 
> > 
> > Good question. Keeping current expectation in mind, yes these are expected
> > results. To begin with, current expectations are that try to emulate
> > cfq behavior and the kind of service differentiation we get between
> > threads of different priority, same kind of service differentiation we
> > should get from different cgroups.
> >  
> > Having said that, in theory a more accurate estimate should be amount 
> > of actual disk time a queue/cgroup got. I have put a tracing message
> > to keep track of total service received by a queue. If you run "blktrace"
> > then you can see that. Ideally, total service received by two threads
> > over a period of time should be in same proportion as their cgroup
> > weights.
> > 
> > It will not be easy to achive it given the constraints we have got in
> > terms of how to accurately we can account for disk time actually used by a
> > queue in certain situations. So to begin with I am targetting that
> > try to meet same kind of service differentation between cgroups as
> > cfq provides between threads and then slowly refine it to see how
> > close one can come to get accurate numbers in terms of "total_serivce"
> > received by each queue.
> > 
> 
> There is also another issue to consider; to achieve a proper weighted
> distribution of ``service time'' (assuming that service time can be
> attributed accurately) over any time window, we need also that the tasks
> actually compete for disk service during this window.
> 
> For example, in the case above with three tasks, the highest weight task
> terminates earlier than the other ones, so we have two time frames:
> during the first one disk time is divided among all the three tasks
> according to their weights, then the highest weight one terminates,
> and disk time is divided (equally) among the remaining ones.

True. But we can do one thing. I am printing total_service every time
a queue expires(elv_ioq_served()). So when first task exits, at that
point of time, we can see how much service each competing queue has
received till that point and it should be proportionate to queue's weight.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-03-12 14:48             ` Fabio Checconi
  (?)
@ 2009-03-12 15:03             ` Vivek Goyal
  -1 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12 15:03 UTC (permalink / raw)
  To: Fabio Checconi
  Cc: Dhaval Giani, nauman, dpshah, lizf, mikew, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng,
	arozansk, jmoyer, oz-kernel, balbir, linux-kernel, containers,
	akpm, menage, peterz

On Thu, Mar 12, 2009 at 03:48:42PM +0100, Fabio Checconi wrote:
> > From: Vivek Goyal <vgoyal@redhat.com>
> > Date: Thu, Mar 12, 2009 10:04:50AM -0400
> >
> > On Thu, Mar 12, 2009 at 03:30:54PM +0530, Dhaval Giani wrote:
> ...
> > > > +Some Test Results
> > > > +=================
> > > > +- Two dd in two cgroups with prio 0 and 4. Ran two "dd" in those cgroups.
> > > > +
> > > > +234179072 bytes (234 MB) copied, 10.1811 s, 23.0 MB/s
> > > > +234179072 bytes (234 MB) copied, 12.6187 s, 18.6 MB/s
> > > > +
> > > > +- Three dd in three cgroups with prio 0, 4, 4.
> > > > +
> > > > +234179072 bytes (234 MB) copied, 13.7654 s, 17.0 MB/s
> > > > +234179072 bytes (234 MB) copied, 19.476 s, 12.0 MB/s
> > > > +234179072 bytes (234 MB) copied, 20.1858 s, 11.6 MB/s
> > > 
> > > Hi Vivek,
> > > 
> > > I would be interested in knowing if these are the results expected?
> > > 
> > 
> > Hi Dhaval, 
> > 
> > Good question. Keeping current expectation in mind, yes these are expected
> > results. To begin with, current expectations are that try to emulate
> > cfq behavior and the kind of service differentiation we get between
> > threads of different priority, same kind of service differentiation we
> > should get from different cgroups.
> >  
> > Having said that, in theory a more accurate estimate should be amount 
> > of actual disk time a queue/cgroup got. I have put a tracing message
> > to keep track of total service received by a queue. If you run "blktrace"
> > then you can see that. Ideally, total service received by two threads
> > over a period of time should be in same proportion as their cgroup
> > weights.
> > 
> > It will not be easy to achive it given the constraints we have got in
> > terms of how to accurately we can account for disk time actually used by a
> > queue in certain situations. So to begin with I am targetting that
> > try to meet same kind of service differentation between cgroups as
> > cfq provides between threads and then slowly refine it to see how
> > close one can come to get accurate numbers in terms of "total_serivce"
> > received by each queue.
> > 
> 
> There is also another issue to consider; to achieve a proper weighted
> distribution of ``service time'' (assuming that service time can be
> attributed accurately) over any time window, we need also that the tasks
> actually compete for disk service during this window.
> 
> For example, in the case above with three tasks, the highest weight task
> terminates earlier than the other ones, so we have two time frames:
> during the first one disk time is divided among all the three tasks
> according to their weights, then the highest weight one terminates,
> and disk time is divided (equally) among the remaining ones.

True. But we can do one thing. I am printing total_service every time
a queue expires(elv_ioq_served()). So when first task exits, at that
point of time, we can see how much service each competing queue has
received till that point and it should be proportionate to queue's weight.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
       [not found]         ` <20090312001146.74591b9d.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  2009-03-12 10:07           ` Ryo Tsuruta
@ 2009-03-12 18:01           ` Vivek Goyal
  1 sibling, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12 18:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil, Andrea Righi,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w, menage-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Mar 12, 2009 at 12:11:46AM -0700, Andrew Morton wrote:
> On Wed, 11 Mar 2009 21:56:46 -0400 Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> 
> > +Currently "current" task
> > +is used to determine the cgroup (hence io group) of the request. Down the
> > +line we need to make use of bio-cgroup patches to map delayed writes to
> > +right group.
> 
> You handled this problem pretty neatly!
> 
> It's always been a BIG problem for all the io-controlling schemes, and
> most of them seem to have "handled" it in the above way :(
> 
> But for many workloads, writeback is the majority of the IO and it has
> always been the form of IO which has caused us the worst contention and
> latency problems.  So I don't think that we can proceed with _anything_
> until we at least have a convincing plan here.
> 

Hi Andrew,

Nauman is already maintaining the bio-cgroup patches (originally from
valinux folks) on top of this patchset for attributing write requests to
correct cgroup. We did not include those in initial posting thinking that
patchest will bloat further.

We can pull in bio-cgroup patches also in this series to attribute writes
to right cgroup.

> 
> Also..  there are so many IO controller implementations that I've lost
> track of who is doing what.  I do have one private report here that
> Andreas's controller "is incredibly productive for us and has allowed
> us to put twice as many users per server with faster times for all
> users".  Which is pretty stunning, although it should be viewed as a
> condemnation of the current code, I'm afraid.
> 

I had looked briefly at Andrea's implementation in the past. I will look
again. I had thought that this approach did not get much traction.

Some quick thoughts about this approach though.

- It is not a proportional weight controller. It is more of limiting
  bandwidth in absolute numbers for each cgroup on each disk.
 
  So each cgroup will define a rule for each disk in the system mentioning
  at what maximum rate that cgroup can issue IO to that disk and throttle
  the IO from that cgroup if rate has excedded.

  Above requirement can create configuration problems.

	- If there are large number of disks in system, per cgroup one shall
	  have to create rules for each disk. Until and unless admin knows
	  what applications are in which cgroup and strictly what disk
	  these applications do IO to and create rules for only those
 	  disks.

	- I think problem gets compounded if there is a hierarchy of
	  logical devices. I think in that case one shall have to create
	  rules for logical devices and not actual physical devices.

- Because it is not proportional weight distribution, if some
  cgroup is not using its planned BW, other group sharing the
  disk can not make use of spare BW.  
	
- I think one should know in advance the throughput rate of underlying media
  and also know competing applications so that one can statically define
  the BW assigned to each cgroup on each disk.

  This will be difficult. Effective BW extracted out of a rotational media
  is dependent on the seek pattern so one shall have to either try to make
  some conservative estimates and try to divide BW (we will not utilize disk
  fully) or take some peak numbers and divide BW (cgroup might not get the
  maximum rate configured).

- Above problems will comound when one goes for deeper hierarhical
  configurations.

I think for renewable resources like disk time, it might be a good idea
to do a proportional weight controller to ensure fairness at the same time
achive best throughput possible.

Andrea, please correct me if I have misunderstood the things.

> So my question is: what is the definitive list of
> proposed-io-controller-implementations and how do I cunningly get all
> you guys to check each others homework? :)
 
I will try to summarize some of the proposals I am aware of. 

- Elevator/IO scheduler modification based IO controllers
	- This proposal
	- cfq io scheduler based control (Satoshi Uchida, NEC)
	- One more cfq based io control (vasily, OpenVZ)
	- AS io scheduler based control (Naveen Gupta, Google)

- Io-throttling (Andrea Righi)
	- Max Bandwidth Controller

- dm-ioband (valinux)
	- Proportional weight IO controller.

- Generic IO controller (Vivek Goyal, RedHat)
	- My initial attempt to do proportional division of amount of bio
	  per cgroup at request queue level. This was inspired from
	  dm-ioband.

I think this proposal should hopefully meet the requirements as envisoned
by other elevator based IO controller solutions.

dm-ioband
---------
I have briefly looked at dm-ioband also and following were some of the
concerns I had raised in the past.

- Need of a dm device for every device we want to control

	- This requirement looks odd. It forces everybody to use dm-tools
	  and if there are lots of disks in the system, configuation is
	  pain.

- It does not support hiearhical grouping.

- Possibly can break the assumptions of underlying IO schedulers.

	- There is no notion of task classes. So tasks of all the classes
	  are at same level from resource contention point of view.
	  The only thing which differentiates them is cgroup weight. Which
	  does not answer the question that an RT task or RT cgroup should
	  starve the peer cgroup if need be as RT cgroup should get priority
	  access.

	- Because of FIFO release of buffered bios, it is possible that
	  task of lower priority gets more IO done than the task of higher
	  priority.

	- Buffering at multiple levels and FIFO dispatch can have more
	  interesting hard to solve issues.

		- Assume there is sequential reader and an aggressive
		  writer in the cgroup. It might happen that writer
		  pushed lot of write requests in the FIFO queue first
		  and then a read request from reader comes. Now it might
		  happen that cfq does not see this read request for a long
		  time (if cgroup weight is less) and this writer will 
		  starve the reader in this cgroup.

		  Even cfq anticipation logic will not help here because
		  when that first read request actually gets to cfq, cfq might
		  choose to idle for more read requests to come, but the
		  agreesive writer might have again flooded the FIFO queue
		  in the group and cfq will not see subsequent read request
		  for a long time and will unnecessarily idle for read.

- Task grouping logic
	- We already have the notion of cgroup where tasks can be grouped
	  in hierarhical manner. dm-ioband does not make full use of that
	  and comes up with own mechansim of grouping tasks (apart from
	  cgroup).  And there are odd ways of specifying cgroup id while
	  configuring the dm-ioband device.

	  IMHO, once somebody has created the cgroup hieararchy, any IO
	  controller logic should be able to internally read that hiearchy
	  and provide control. There should not be need of any other
	  configuration utity on top of cgroup.

	  My RFC patches had tried to get rid of this external
	  configuration requirement.

- Task and Groups can not be treated at same level.

	- Because at any second level solution we are controlling bio
	  per cgroup and don't have any notion of which task queue bio
	  belongs to, one can not treat task and group  at same level.
	
	  What I meant is following.

			root
			/ | \
		       1  2  A
			    / \
			   3   4

	In dm-ioband approach, at top level tasks 1 and 2 will get 50%
	of BW together and group A will get 50%. Ideally along the lines
	of cpu controller, I would expect it to be 33% each for task 1
	task 2 and group A.

	This can create interesting scenarios where assumg task1 is
	an RT class task. Now one would expect task 1 get all the BW
	possible starving task 2 and group A, but that will not be the
	case and task1 will get 50% of BW.

 	Not that it is critically important but it would probably be
	nice if we can maitain same semantics as cpu controller. In
	elevator layer solution we can do it at least for CFQ scheduler
	as it maintains separate io queue per io context. 	

	This is in general an issue for any 2nd level IO controller which
	only accounts for io groups and not for io queues per process.

- We will end copying a lot of code/logic from cfq

	- To address many of the concerns like multi class scheduler
	  we will end up duplicating code of IO scheduler. Why can't
	  we have a one point hierarchical IO scheduling (This patchset).
Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-03-12  7:11         ` Andrew Morton
                           ` (2 preceding siblings ...)
  (?)
@ 2009-03-12 18:01         ` Vivek Goyal
  2009-03-16  8:40           ` Ryo Tsuruta
                             ` (2 more replies)
  -1 siblings, 3 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-12 18:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng,
	arozansk, jmoyer, oz-kernel, dhaval, balbir, linux-kernel,
	containers, menage, peterz, Andrea Righi

On Thu, Mar 12, 2009 at 12:11:46AM -0700, Andrew Morton wrote:
> On Wed, 11 Mar 2009 21:56:46 -0400 Vivek Goyal <vgoyal@redhat.com> wrote:
> 
> > +Currently "current" task
> > +is used to determine the cgroup (hence io group) of the request. Down the
> > +line we need to make use of bio-cgroup patches to map delayed writes to
> > +right group.
> 
> You handled this problem pretty neatly!
> 
> It's always been a BIG problem for all the io-controlling schemes, and
> most of them seem to have "handled" it in the above way :(
> 
> But for many workloads, writeback is the majority of the IO and it has
> always been the form of IO which has caused us the worst contention and
> latency problems.  So I don't think that we can proceed with _anything_
> until we at least have a convincing plan here.
> 

Hi Andrew,

Nauman is already maintaining the bio-cgroup patches (originally from
valinux folks) on top of this patchset for attributing write requests to
correct cgroup. We did not include those in initial posting thinking that
patchest will bloat further.

We can pull in bio-cgroup patches also in this series to attribute writes
to right cgroup.

> 
> Also..  there are so many IO controller implementations that I've lost
> track of who is doing what.  I do have one private report here that
> Andreas's controller "is incredibly productive for us and has allowed
> us to put twice as many users per server with faster times for all
> users".  Which is pretty stunning, although it should be viewed as a
> condemnation of the current code, I'm afraid.
> 

I had looked briefly at Andrea's implementation in the past. I will look
again. I had thought that this approach did not get much traction.

Some quick thoughts about this approach though.

- It is not a proportional weight controller. It is more of limiting
  bandwidth in absolute numbers for each cgroup on each disk.
 
  So each cgroup will define a rule for each disk in the system mentioning
  at what maximum rate that cgroup can issue IO to that disk and throttle
  the IO from that cgroup if rate has excedded.

  Above requirement can create configuration problems.

	- If there are large number of disks in system, per cgroup one shall
	  have to create rules for each disk. Until and unless admin knows
	  what applications are in which cgroup and strictly what disk
	  these applications do IO to and create rules for only those
 	  disks.

	- I think problem gets compounded if there is a hierarchy of
	  logical devices. I think in that case one shall have to create
	  rules for logical devices and not actual physical devices.

- Because it is not proportional weight distribution, if some
  cgroup is not using its planned BW, other group sharing the
  disk can not make use of spare BW.  
	
- I think one should know in advance the throughput rate of underlying media
  and also know competing applications so that one can statically define
  the BW assigned to each cgroup on each disk.

  This will be difficult. Effective BW extracted out of a rotational media
  is dependent on the seek pattern so one shall have to either try to make
  some conservative estimates and try to divide BW (we will not utilize disk
  fully) or take some peak numbers and divide BW (cgroup might not get the
  maximum rate configured).

- Above problems will comound when one goes for deeper hierarhical
  configurations.

I think for renewable resources like disk time, it might be a good idea
to do a proportional weight controller to ensure fairness at the same time
achive best throughput possible.

Andrea, please correct me if I have misunderstood the things.

> So my question is: what is the definitive list of
> proposed-io-controller-implementations and how do I cunningly get all
> you guys to check each others homework? :)
 
I will try to summarize some of the proposals I am aware of. 

- Elevator/IO scheduler modification based IO controllers
	- This proposal
	- cfq io scheduler based control (Satoshi Uchida, NEC)
	- One more cfq based io control (vasily, OpenVZ)
	- AS io scheduler based control (Naveen Gupta, Google)

- Io-throttling (Andrea Righi)
	- Max Bandwidth Controller

- dm-ioband (valinux)
	- Proportional weight IO controller.

- Generic IO controller (Vivek Goyal, RedHat)
	- My initial attempt to do proportional division of amount of bio
	  per cgroup at request queue level. This was inspired from
	  dm-ioband.

I think this proposal should hopefully meet the requirements as envisoned
by other elevator based IO controller solutions.

dm-ioband
---------
I have briefly looked at dm-ioband also and following were some of the
concerns I had raised in the past.

- Need of a dm device for every device we want to control

	- This requirement looks odd. It forces everybody to use dm-tools
	  and if there are lots of disks in the system, configuation is
	  pain.

- It does not support hiearhical grouping.

- Possibly can break the assumptions of underlying IO schedulers.

	- There is no notion of task classes. So tasks of all the classes
	  are at same level from resource contention point of view.
	  The only thing which differentiates them is cgroup weight. Which
	  does not answer the question that an RT task or RT cgroup should
	  starve the peer cgroup if need be as RT cgroup should get priority
	  access.

	- Because of FIFO release of buffered bios, it is possible that
	  task of lower priority gets more IO done than the task of higher
	  priority.

	- Buffering at multiple levels and FIFO dispatch can have more
	  interesting hard to solve issues.

		- Assume there is sequential reader and an aggressive
		  writer in the cgroup. It might happen that writer
		  pushed lot of write requests in the FIFO queue first
		  and then a read request from reader comes. Now it might
		  happen that cfq does not see this read request for a long
		  time (if cgroup weight is less) and this writer will 
		  starve the reader in this cgroup.

		  Even cfq anticipation logic will not help here because
		  when that first read request actually gets to cfq, cfq might
		  choose to idle for more read requests to come, but the
		  agreesive writer might have again flooded the FIFO queue
		  in the group and cfq will not see subsequent read request
		  for a long time and will unnecessarily idle for read.

- Task grouping logic
	- We already have the notion of cgroup where tasks can be grouped
	  in hierarhical manner. dm-ioband does not make full use of that
	  and comes up with own mechansim of grouping tasks (apart from
	  cgroup).  And there are odd ways of specifying cgroup id while
	  configuring the dm-ioband device.

	  IMHO, once somebody has created the cgroup hieararchy, any IO
	  controller logic should be able to internally read that hiearchy
	  and provide control. There should not be need of any other
	  configuration utity on top of cgroup.

	  My RFC patches had tried to get rid of this external
	  configuration requirement.

- Task and Groups can not be treated at same level.

	- Because at any second level solution we are controlling bio
	  per cgroup and don't have any notion of which task queue bio
	  belongs to, one can not treat task and group  at same level.
	
	  What I meant is following.

			root
			/ | \
		       1  2  A
			    / \
			   3   4

	In dm-ioband approach, at top level tasks 1 and 2 will get 50%
	of BW together and group A will get 50%. Ideally along the lines
	of cpu controller, I would expect it to be 33% each for task 1
	task 2 and group A.

	This can create interesting scenarios where assumg task1 is
	an RT class task. Now one would expect task 1 get all the BW
	possible starving task 2 and group A, but that will not be the
	case and task1 will get 50% of BW.

 	Not that it is critically important but it would probably be
	nice if we can maitain same semantics as cpu controller. In
	elevator layer solution we can do it at least for CFQ scheduler
	as it maintains separate io queue per io context. 	

	This is in general an issue for any 2nd level IO controller which
	only accounts for io groups and not for io queues per process.

- We will end copying a lot of code/logic from cfq

	- To address many of the concerns like multi class scheduler
	  we will end up duplicating code of IO scheduler. Why can't
	  we have a one point hierarchical IO scheduling (This patchset).
Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
       [not found]           ` <20090312180126.GI10919-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-03-16  8:40             ` Ryo Tsuruta
  2009-04-05 15:15             ` Andrea Righi
  1 sibling, 0 replies; 190+ messages in thread
From: Ryo Tsuruta @ 2009-03-16  8:40 UTC (permalink / raw)
  To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	menage-hpIqsD4AKlfQT0dZR+AlfA

Hi Vivek,

> dm-ioband
> ---------
> I have briefly looked at dm-ioband also and following were some of the
> concerns I had raised in the past.
> 
> - Need of a dm device for every device we want to control
> 
> 	- This requirement looks odd. It forces everybody to use dm-tools
> 	  and if there are lots of disks in the system, configuation is
> 	  pain.

I don't think it's a pain. Could it be easily done by writing a small
script?

> - It does not support hiearhical grouping.

I can implement hierarchical grouping to dm-ioband if it's really
necessary, but at this point, I don't think it's really necessary
and I want to keep the code simple.

> - Possibly can break the assumptions of underlying IO schedulers.
> 
> 	- There is no notion of task classes. So tasks of all the classes
> 	  are at same level from resource contention point of view.
> 	  The only thing which differentiates them is cgroup weight. Which
> 	  does not answer the question that an RT task or RT cgroup should
> 	  starve the peer cgroup if need be as RT cgroup should get priority
> 	  access.
> 
> 	- Because of FIFO release of buffered bios, it is possible that
> 	  task of lower priority gets more IO done than the task of higher
> 	  priority.
> 
> 	- Buffering at multiple levels and FIFO dispatch can have more
> 	  interesting hard to solve issues.
> 
> 		- Assume there is sequential reader and an aggressive
> 		  writer in the cgroup. It might happen that writer
> 		  pushed lot of write requests in the FIFO queue first
> 		  and then a read request from reader comes. Now it might
> 		  happen that cfq does not see this read request for a long
> 		  time (if cgroup weight is less) and this writer will 
> 		  starve the reader in this cgroup.
> 
> 		  Even cfq anticipation logic will not help here because
> 		  when that first read request actually gets to cfq, cfq might
> 		  choose to idle for more read requests to come, but the
> 		  agreesive writer might have again flooded the FIFO queue
> 		  in the group and cfq will not see subsequent read request
> 		  for a long time and will unnecessarily idle for read.

I think it's just a matter of which you prioritize, bandwidth or
io-class. What do you do when the RT task issues a lot of I/O?

> - Task grouping logic
> 	- We already have the notion of cgroup where tasks can be grouped
> 	  in hierarhical manner. dm-ioband does not make full use of that
> 	  and comes up with own mechansim of grouping tasks (apart from
> 	  cgroup).  And there are odd ways of specifying cgroup id while
> 	  configuring the dm-ioband device.
> 
> 	  IMHO, once somebody has created the cgroup hieararchy, any IO
> 	  controller logic should be able to internally read that hiearchy
> 	  and provide control. There should not be need of any other
> 	  configuration utity on top of cgroup.
> 
> 	  My RFC patches had tried to get rid of this external
> 	  configuration requirement.

The reason is that it makes bio-cgroup easy to use for dm-ioband.
But It's not a final design of the interface between dm-ioband and
cgroup.

> - Task and Groups can not be treated at same level.
> 
> 	- Because at any second level solution we are controlling bio
> 	  per cgroup and don't have any notion of which task queue bio
> 	  belongs to, one can not treat task and group  at same level.
> 	
> 	  What I meant is following.
> 
> 			root
> 			/ | \
> 		       1  2  A
> 			    / \
> 			   3   4
> 
> 	In dm-ioband approach, at top level tasks 1 and 2 will get 50%
> 	of BW together and group A will get 50%. Ideally along the lines
> 	of cpu controller, I would expect it to be 33% each for task 1
> 	task 2 and group A.
> 
> 	This can create interesting scenarios where assumg task1 is
> 	an RT class task. Now one would expect task 1 get all the BW
> 	possible starving task 2 and group A, but that will not be the
> 	case and task1 will get 50% of BW.
> 
>  	Not that it is critically important but it would probably be
> 	nice if we can maitain same semantics as cpu controller. In
> 	elevator layer solution we can do it at least for CFQ scheduler
> 	as it maintains separate io queue per io context. 	

I will consider following the CPU controller's manner when dm-ioband
supports hierarchical grouping.

> 	This is in general an issue for any 2nd level IO controller which
> 	only accounts for io groups and not for io queues per process.
> 
> - We will end copying a lot of code/logic from cfq
> 
> 	- To address many of the concerns like multi class scheduler
> 	  we will end up duplicating code of IO scheduler. Why can't
> 	  we have a one point hierarchical IO scheduling (This patchset).
> Thanks
> Vivek

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-03-12 18:01         ` Vivek Goyal
@ 2009-03-16  8:40           ` Ryo Tsuruta
  2009-03-16 13:39             ` Vivek Goyal
       [not found]             ` <20090316.174043.193698189.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
       [not found]           ` <20090312180126.GI10919-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-04-05 15:15           ` Andrea Righi
  2 siblings, 2 replies; 190+ messages in thread
From: Ryo Tsuruta @ 2009-03-16  8:40 UTC (permalink / raw)
  To: vgoyal
  Cc: akpm, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, fernando, s-uchida, taka, guijianfeng, arozansk,
	jmoyer, oz-kernel, dhaval, balbir, linux-kernel, containers,
	menage, peterz, righi.andrea

Hi Vivek,

> dm-ioband
> ---------
> I have briefly looked at dm-ioband also and following were some of the
> concerns I had raised in the past.
> 
> - Need of a dm device for every device we want to control
> 
> 	- This requirement looks odd. It forces everybody to use dm-tools
> 	  and if there are lots of disks in the system, configuation is
> 	  pain.

I don't think it's a pain. Could it be easily done by writing a small
script?

> - It does not support hiearhical grouping.

I can implement hierarchical grouping to dm-ioband if it's really
necessary, but at this point, I don't think it's really necessary
and I want to keep the code simple.

> - Possibly can break the assumptions of underlying IO schedulers.
> 
> 	- There is no notion of task classes. So tasks of all the classes
> 	  are at same level from resource contention point of view.
> 	  The only thing which differentiates them is cgroup weight. Which
> 	  does not answer the question that an RT task or RT cgroup should
> 	  starve the peer cgroup if need be as RT cgroup should get priority
> 	  access.
> 
> 	- Because of FIFO release of buffered bios, it is possible that
> 	  task of lower priority gets more IO done than the task of higher
> 	  priority.
> 
> 	- Buffering at multiple levels and FIFO dispatch can have more
> 	  interesting hard to solve issues.
> 
> 		- Assume there is sequential reader and an aggressive
> 		  writer in the cgroup. It might happen that writer
> 		  pushed lot of write requests in the FIFO queue first
> 		  and then a read request from reader comes. Now it might
> 		  happen that cfq does not see this read request for a long
> 		  time (if cgroup weight is less) and this writer will 
> 		  starve the reader in this cgroup.
> 
> 		  Even cfq anticipation logic will not help here because
> 		  when that first read request actually gets to cfq, cfq might
> 		  choose to idle for more read requests to come, but the
> 		  agreesive writer might have again flooded the FIFO queue
> 		  in the group and cfq will not see subsequent read request
> 		  for a long time and will unnecessarily idle for read.

I think it's just a matter of which you prioritize, bandwidth or
io-class. What do you do when the RT task issues a lot of I/O?

> - Task grouping logic
> 	- We already have the notion of cgroup where tasks can be grouped
> 	  in hierarhical manner. dm-ioband does not make full use of that
> 	  and comes up with own mechansim of grouping tasks (apart from
> 	  cgroup).  And there are odd ways of specifying cgroup id while
> 	  configuring the dm-ioband device.
> 
> 	  IMHO, once somebody has created the cgroup hieararchy, any IO
> 	  controller logic should be able to internally read that hiearchy
> 	  and provide control. There should not be need of any other
> 	  configuration utity on top of cgroup.
> 
> 	  My RFC patches had tried to get rid of this external
> 	  configuration requirement.

The reason is that it makes bio-cgroup easy to use for dm-ioband.
But It's not a final design of the interface between dm-ioband and
cgroup.

> - Task and Groups can not be treated at same level.
> 
> 	- Because at any second level solution we are controlling bio
> 	  per cgroup and don't have any notion of which task queue bio
> 	  belongs to, one can not treat task and group  at same level.
> 	
> 	  What I meant is following.
> 
> 			root
> 			/ | \
> 		       1  2  A
> 			    / \
> 			   3   4
> 
> 	In dm-ioband approach, at top level tasks 1 and 2 will get 50%
> 	of BW together and group A will get 50%. Ideally along the lines
> 	of cpu controller, I would expect it to be 33% each for task 1
> 	task 2 and group A.
> 
> 	This can create interesting scenarios where assumg task1 is
> 	an RT class task. Now one would expect task 1 get all the BW
> 	possible starving task 2 and group A, but that will not be the
> 	case and task1 will get 50% of BW.
> 
>  	Not that it is critically important but it would probably be
> 	nice if we can maitain same semantics as cpu controller. In
> 	elevator layer solution we can do it at least for CFQ scheduler
> 	as it maintains separate io queue per io context. 	

I will consider following the CPU controller's manner when dm-ioband
supports hierarchical grouping.

> 	This is in general an issue for any 2nd level IO controller which
> 	only accounts for io groups and not for io queues per process.
> 
> - We will end copying a lot of code/logic from cfq
> 
> 	- To address many of the concerns like multi class scheduler
> 	  we will end up duplicating code of IO scheduler. Why can't
> 	  we have a one point hierarchical IO scheduling (This patchset).
> Thanks
> Vivek

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
       [not found]             ` <20090316.174043.193698189.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
@ 2009-03-16 13:39               ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-16 13:39 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	menage-hpIqsD4AKlfQT0dZR+AlfA

On Mon, Mar 16, 2009 at 05:40:43PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
> 
> > dm-ioband
> > ---------
> > I have briefly looked at dm-ioband also and following were some of the
> > concerns I had raised in the past.
> > 
> > - Need of a dm device for every device we want to control
> > 
> > 	- This requirement looks odd. It forces everybody to use dm-tools
> > 	  and if there are lots of disks in the system, configuation is
> > 	  pain.
> 
> I don't think it's a pain. Could it be easily done by writing a small
> script?
> 

I think it is an extra hassle which can be avoided. Following are some
of the thoughts about configuration and issues. Looking at these, IMHO,
it is not simple to configure dm-ioband.

- So if there are 100 disks in a system, and lets say 5 partitions on each
  disk, then script needs to create a dm-ioband device for every partition.
  So I will end up creating 500 dm-ioband devices. This is not taking into
  picture the dm-ioband devices people might end up creating on
  intermediate logical nodes.

- Need of dm tools to create devices and create groups.

- I am look at dm-ioband help on web and thinking are these commands
  really simple and hassle free for a user who does not use dm in his
  setup.
  
  For two dm-ioband device creations on two partitions.

 # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1 0 0 none" \
     "weight 0 :40" | dmsetup create ioband1
 # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1 0 0 none" \
     "weight 0 :10" | dmsetup create ioband2 

- Following are the commands just to create two groups on a single io-band
  device.

 # dmsetup message ioband1 0 type user
 # dmsetup message ioband1 0 attach 1000
 # dmsetup message ioband1 0 attach 2000
 # dmsetup message ioband1 0 weight 1000:30
 # dmsetup message ioband1 0 weight 2000:20

Now think of a decent size group hierarchy (say 50 groups) on 500 ioband device
system. So that would be 50*500 = 25000 group creation commands.

- So if an admin wants to group applications using cgroup, first he needs
  to create cgroup hierarchy. Then he needs to take all the cgroup ids
  and provide these to this dm-ioband device with the help of dmsetup
  command.

	dmsetup message ioband1 0 attach <cgroup id>

  cgroup has already provided us nice grouping facility in hierarchical
  manner. This extra step is cumbersome and completely unnecessary.

- These configuration commands will become even much more complicated
  once you start supporting hierachical setup. All the hierarchy
  information shall have to passed in the command itself in one way or
  other once a group is being created.

- You will be limited in terms of functionlity. I am assuming these group
  creation operations will be limited to "root" user. A very common
  requiremnt we are seeming now a days is that admin will create a top
  level cgroup and then let user create/manage more groups with-in top
  level group. 

  For example.

			  root
			/  |  \
		     u1    u2 others 		      

  Here u1 and u2 are two different users on the system. Here admin can
  create top level cgroups for users and assign users weight from IO
  point of view. Now individual users should be able to create groups
  of their own and manage their tasks. Cgroup infrastructure allows all
  this.

  In the above setup it will become very very hard to let user also create
  its own groups in top level group. You shall have to keep all the 
  information which filesystem keeps in terms of file permissions etc.
   
So IMHO, configuration of dm-ioband devices and groups is complicated and
it can be simplified a lot. Secondly, it does not seem to be a good idea
to not make use of cgroup infrastrucuture and come up own ways of
grouping things.

> > - It does not support hiearhical grouping.
> 
> I can implement hierarchical grouping to dm-ioband if it's really
> necessary, but at this point, I don't think it's really necessary
> and I want to keep the code simple.
> 

We do need hierarchical support.

In fact later in the mail you have specified that you will consider treating
task and groups at same level. The moment you do that, one flat hiearchy will
mean a single "root" group only and no groups with-in that. Until and unless
you implement hiearchical support you can't create even single level of groups
with-in "root".

Secondly, i think dm-ioband will become very complex (especially in terms
of managing configuration), the moment hiearchical support is introduced.
So it would be a good idea to implement the hiearchical support now and
get to know the full complexity of the system.

> > - Possibly can break the assumptions of underlying IO schedulers.
> > 
> > 	- There is no notion of task classes. So tasks of all the classes
> > 	  are at same level from resource contention point of view.
> > 	  The only thing which differentiates them is cgroup weight. Which
> > 	  does not answer the question that an RT task or RT cgroup should
> > 	  starve the peer cgroup if need be as RT cgroup should get priority
> > 	  access.
> > 
> > 	- Because of FIFO release of buffered bios, it is possible that
> > 	  task of lower priority gets more IO done than the task of higher
> > 	  priority.
> > 
> > 	- Buffering at multiple levels and FIFO dispatch can have more
> > 	  interesting hard to solve issues.
> > 
> > 		- Assume there is sequential reader and an aggressive
> > 		  writer in the cgroup. It might happen that writer
> > 		  pushed lot of write requests in the FIFO queue first
> > 		  and then a read request from reader comes. Now it might
> > 		  happen that cfq does not see this read request for a long
> > 		  time (if cgroup weight is less) and this writer will 
> > 		  starve the reader in this cgroup.
> > 
> > 		  Even cfq anticipation logic will not help here because
> > 		  when that first read request actually gets to cfq, cfq might
> > 		  choose to idle for more read requests to come, but the
> > 		  agreesive writer might have again flooded the FIFO queue
> > 		  in the group and cfq will not see subsequent read request
> > 		  for a long time and will unnecessarily idle for read.
> 
> I think it's just a matter of which you prioritize, bandwidth or
> io-class. What do you do when the RT task issues a lot of I/O?
> 

This is a multi-class scheduler. We first prioritize class and then handle
tasks with-in class. So RT class will always get to dispatch first and
can starve Best effort class tasks if it is issueing lots of IO.

You just don't have any notion of RT groups. So if admin wants to make
sure that and RT tasks always gets the disk access first, there is no way to
ensure that. The best thing in this setup one can do is assign higher
weight to RT task group. This group will still be doing proportional
weight scheduling with Best effort class groups or Idle task groups. That's
not multi-class scheduling is.

So in your patches there is no differentiation between classes. A best effort
task is competing equally hard as RT task. For example.

			root
		 	/  \
		   RT task  Group (best effort class)
				/ \
			       T1  T2 

Here T1 and T2 are best effort class tasks and they are sharing disk 
bandwidth with RT task. Instead, RT task should get exclusive access to
disk.

Secondly, two of the above issues I have mentioned are for tasks with-in same
class and how FIFO dispatch will create the problems. These are problems
with any second level controller. These will be really hard to solve the
issues and will force us to copy more code from cfq and other subsystems.

> > - Task grouping logic
> > 	- We already have the notion of cgroup where tasks can be grouped
> > 	  in hierarhical manner. dm-ioband does not make full use of that
> > 	  and comes up with own mechansim of grouping tasks (apart from
> > 	  cgroup).  And there are odd ways of specifying cgroup id while
> > 	  configuring the dm-ioband device.
> > 
> > 	  IMHO, once somebody has created the cgroup hieararchy, any IO
> > 	  controller logic should be able to internally read that hiearchy
> > 	  and provide control. There should not be need of any other
> > 	  configuration utity on top of cgroup.
> > 
> > 	  My RFC patches had tried to get rid of this external
> > 	  configuration requirement.
> 
> The reason is that it makes bio-cgroup easy to use for dm-ioband.
> But It's not a final design of the interface between dm-ioband and
> cgroup.

It makes it easy for dm-ioband implementation but harder for the user.

What is the alternate interface?

> 
> > - Task and Groups can not be treated at same level.
> > 
> > 	- Because at any second level solution we are controlling bio
> > 	  per cgroup and don't have any notion of which task queue bio
> > 	  belongs to, one can not treat task and group  at same level.
> > 	
> > 	  What I meant is following.
> > 
> > 			root
> > 			/ | \
> > 		       1  2  A
> > 			    / \
> > 			   3   4
> > 
> > 	In dm-ioband approach, at top level tasks 1 and 2 will get 50%
> > 	of BW together and group A will get 50%. Ideally along the lines
> > 	of cpu controller, I would expect it to be 33% each for task 1
> > 	task 2 and group A.
> > 
> > 	This can create interesting scenarios where assumg task1 is
> > 	an RT class task. Now one would expect task 1 get all the BW
> > 	possible starving task 2 and group A, but that will not be the
> > 	case and task1 will get 50% of BW.
> > 
> >  	Not that it is critically important but it would probably be
> > 	nice if we can maitain same semantics as cpu controller. In
> > 	elevator layer solution we can do it at least for CFQ scheduler
> > 	as it maintains separate io queue per io context. 	
> 
> I will consider following the CPU controller's manner when dm-ioband
> supports hierarchical grouping.

But this is an issue even now. If you want to consider task and group
at the same level, then you will end up creating separate queues for
all the tasks (and not only queues for groups). This will essentially
become CFQ.
 
> 
> > 	This is in general an issue for any 2nd level IO controller which
> > 	only accounts for io groups and not for io queues per process.
> > 
> > - We will end copying a lot of code/logic from cfq
> > 
> > 	- To address many of the concerns like multi class scheduler
> > 	  we will end up duplicating code of IO scheduler. Why can't
> > 	  we have a one point hierarchical IO scheduling (This patchset).

More details about this point.

- To make dm-ioband support multiclass task/groups, we will end up
  inheriting logic from cfq/bfq.

- To treat task and group at same level we will end up creating separate
  queues for each task and then import lots of cfq/bfq logic for managing
  those queues.

- The moment we move to hiearchical support, you will end up creating
  equivalent logic of our patches.

The point is, why to do all this? CFQ has already solved the problem of
multi class IO scheduler and providing service differentiation between
tasks of different priority. With cgroup stuff, we need to just extend
existing CFQ logic so that it supports hiearchical scheduling and we will
have a good IO controller in place.

Can you please point out specifically why do you think extending CFQ
logic to support hiearchical scheduling and sharing code with other IO
schedulers is not a good idea to implement hiearchical IO control?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-03-16  8:40           ` Ryo Tsuruta
@ 2009-03-16 13:39             ` Vivek Goyal
       [not found]             ` <20090316.174043.193698189.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
  1 sibling, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-16 13:39 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: akpm, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, fernando, s-uchida, taka, guijianfeng, arozansk,
	jmoyer, oz-kernel, dhaval, balbir, linux-kernel, containers,
	menage, peterz, righi.andrea

On Mon, Mar 16, 2009 at 05:40:43PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
> 
> > dm-ioband
> > ---------
> > I have briefly looked at dm-ioband also and following were some of the
> > concerns I had raised in the past.
> > 
> > - Need of a dm device for every device we want to control
> > 
> > 	- This requirement looks odd. It forces everybody to use dm-tools
> > 	  and if there are lots of disks in the system, configuation is
> > 	  pain.
> 
> I don't think it's a pain. Could it be easily done by writing a small
> script?
> 

I think it is an extra hassle which can be avoided. Following are some
of the thoughts about configuration and issues. Looking at these, IMHO,
it is not simple to configure dm-ioband.

- So if there are 100 disks in a system, and lets say 5 partitions on each
  disk, then script needs to create a dm-ioband device for every partition.
  So I will end up creating 500 dm-ioband devices. This is not taking into
  picture the dm-ioband devices people might end up creating on
  intermediate logical nodes.

- Need of dm tools to create devices and create groups.

- I am look at dm-ioband help on web and thinking are these commands
  really simple and hassle free for a user who does not use dm in his
  setup.
  
  For two dm-ioband device creations on two partitions.

 # echo "0 $(blockdev --getsize /dev/sda1) ioband /dev/sda1 1 0 0 none" \
     "weight 0 :40" | dmsetup create ioband1
 # echo "0 $(blockdev --getsize /dev/sda2) ioband /dev/sda2 1 0 0 none" \
     "weight 0 :10" | dmsetup create ioband2 

- Following are the commands just to create two groups on a single io-band
  device.

 # dmsetup message ioband1 0 type user
 # dmsetup message ioband1 0 attach 1000
 # dmsetup message ioband1 0 attach 2000
 # dmsetup message ioband1 0 weight 1000:30
 # dmsetup message ioband1 0 weight 2000:20

Now think of a decent size group hierarchy (say 50 groups) on 500 ioband device
system. So that would be 50*500 = 25000 group creation commands.

- So if an admin wants to group applications using cgroup, first he needs
  to create cgroup hierarchy. Then he needs to take all the cgroup ids
  and provide these to this dm-ioband device with the help of dmsetup
  command.

	dmsetup message ioband1 0 attach <cgroup id>

  cgroup has already provided us nice grouping facility in hierarchical
  manner. This extra step is cumbersome and completely unnecessary.

- These configuration commands will become even much more complicated
  once you start supporting hierachical setup. All the hierarchy
  information shall have to passed in the command itself in one way or
  other once a group is being created.

- You will be limited in terms of functionlity. I am assuming these group
  creation operations will be limited to "root" user. A very common
  requiremnt we are seeming now a days is that admin will create a top
  level cgroup and then let user create/manage more groups with-in top
  level group. 

  For example.

			  root
			/  |  \
		     u1    u2 others 		      

  Here u1 and u2 are two different users on the system. Here admin can
  create top level cgroups for users and assign users weight from IO
  point of view. Now individual users should be able to create groups
  of their own and manage their tasks. Cgroup infrastructure allows all
  this.

  In the above setup it will become very very hard to let user also create
  its own groups in top level group. You shall have to keep all the 
  information which filesystem keeps in terms of file permissions etc.
   
So IMHO, configuration of dm-ioband devices and groups is complicated and
it can be simplified a lot. Secondly, it does not seem to be a good idea
to not make use of cgroup infrastrucuture and come up own ways of
grouping things.

> > - It does not support hiearhical grouping.
> 
> I can implement hierarchical grouping to dm-ioband if it's really
> necessary, but at this point, I don't think it's really necessary
> and I want to keep the code simple.
> 

We do need hierarchical support.

In fact later in the mail you have specified that you will consider treating
task and groups at same level. The moment you do that, one flat hiearchy will
mean a single "root" group only and no groups with-in that. Until and unless
you implement hiearchical support you can't create even single level of groups
with-in "root".

Secondly, i think dm-ioband will become very complex (especially in terms
of managing configuration), the moment hiearchical support is introduced.
So it would be a good idea to implement the hiearchical support now and
get to know the full complexity of the system.

> > - Possibly can break the assumptions of underlying IO schedulers.
> > 
> > 	- There is no notion of task classes. So tasks of all the classes
> > 	  are at same level from resource contention point of view.
> > 	  The only thing which differentiates them is cgroup weight. Which
> > 	  does not answer the question that an RT task or RT cgroup should
> > 	  starve the peer cgroup if need be as RT cgroup should get priority
> > 	  access.
> > 
> > 	- Because of FIFO release of buffered bios, it is possible that
> > 	  task of lower priority gets more IO done than the task of higher
> > 	  priority.
> > 
> > 	- Buffering at multiple levels and FIFO dispatch can have more
> > 	  interesting hard to solve issues.
> > 
> > 		- Assume there is sequential reader and an aggressive
> > 		  writer in the cgroup. It might happen that writer
> > 		  pushed lot of write requests in the FIFO queue first
> > 		  and then a read request from reader comes. Now it might
> > 		  happen that cfq does not see this read request for a long
> > 		  time (if cgroup weight is less) and this writer will 
> > 		  starve the reader in this cgroup.
> > 
> > 		  Even cfq anticipation logic will not help here because
> > 		  when that first read request actually gets to cfq, cfq might
> > 		  choose to idle for more read requests to come, but the
> > 		  agreesive writer might have again flooded the FIFO queue
> > 		  in the group and cfq will not see subsequent read request
> > 		  for a long time and will unnecessarily idle for read.
> 
> I think it's just a matter of which you prioritize, bandwidth or
> io-class. What do you do when the RT task issues a lot of I/O?
> 

This is a multi-class scheduler. We first prioritize class and then handle
tasks with-in class. So RT class will always get to dispatch first and
can starve Best effort class tasks if it is issueing lots of IO.

You just don't have any notion of RT groups. So if admin wants to make
sure that and RT tasks always gets the disk access first, there is no way to
ensure that. The best thing in this setup one can do is assign higher
weight to RT task group. This group will still be doing proportional
weight scheduling with Best effort class groups or Idle task groups. That's
not multi-class scheduling is.

So in your patches there is no differentiation between classes. A best effort
task is competing equally hard as RT task. For example.

			root
		 	/  \
		   RT task  Group (best effort class)
				/ \
			       T1  T2 

Here T1 and T2 are best effort class tasks and they are sharing disk 
bandwidth with RT task. Instead, RT task should get exclusive access to
disk.

Secondly, two of the above issues I have mentioned are for tasks with-in same
class and how FIFO dispatch will create the problems. These are problems
with any second level controller. These will be really hard to solve the
issues and will force us to copy more code from cfq and other subsystems.

> > - Task grouping logic
> > 	- We already have the notion of cgroup where tasks can be grouped
> > 	  in hierarhical manner. dm-ioband does not make full use of that
> > 	  and comes up with own mechansim of grouping tasks (apart from
> > 	  cgroup).  And there are odd ways of specifying cgroup id while
> > 	  configuring the dm-ioband device.
> > 
> > 	  IMHO, once somebody has created the cgroup hieararchy, any IO
> > 	  controller logic should be able to internally read that hiearchy
> > 	  and provide control. There should not be need of any other
> > 	  configuration utity on top of cgroup.
> > 
> > 	  My RFC patches had tried to get rid of this external
> > 	  configuration requirement.
> 
> The reason is that it makes bio-cgroup easy to use for dm-ioband.
> But It's not a final design of the interface between dm-ioband and
> cgroup.

It makes it easy for dm-ioband implementation but harder for the user.

What is the alternate interface?

> 
> > - Task and Groups can not be treated at same level.
> > 
> > 	- Because at any second level solution we are controlling bio
> > 	  per cgroup and don't have any notion of which task queue bio
> > 	  belongs to, one can not treat task and group  at same level.
> > 	
> > 	  What I meant is following.
> > 
> > 			root
> > 			/ | \
> > 		       1  2  A
> > 			    / \
> > 			   3   4
> > 
> > 	In dm-ioband approach, at top level tasks 1 and 2 will get 50%
> > 	of BW together and group A will get 50%. Ideally along the lines
> > 	of cpu controller, I would expect it to be 33% each for task 1
> > 	task 2 and group A.
> > 
> > 	This can create interesting scenarios where assumg task1 is
> > 	an RT class task. Now one would expect task 1 get all the BW
> > 	possible starving task 2 and group A, but that will not be the
> > 	case and task1 will get 50% of BW.
> > 
> >  	Not that it is critically important but it would probably be
> > 	nice if we can maitain same semantics as cpu controller. In
> > 	elevator layer solution we can do it at least for CFQ scheduler
> > 	as it maintains separate io queue per io context. 	
> 
> I will consider following the CPU controller's manner when dm-ioband
> supports hierarchical grouping.

But this is an issue even now. If you want to consider task and group
at the same level, then you will end up creating separate queues for
all the tasks (and not only queues for groups). This will essentially
become CFQ.
 
> 
> > 	This is in general an issue for any 2nd level IO controller which
> > 	only accounts for io groups and not for io queues per process.
> > 
> > - We will end copying a lot of code/logic from cfq
> > 
> > 	- To address many of the concerns like multi class scheduler
> > 	  we will end up duplicating code of IO scheduler. Why can't
> > 	  we have a one point hierarchical IO scheduling (This patchset).

More details about this point.

- To make dm-ioband support multiclass task/groups, we will end up
  inheriting logic from cfq/bfq.

- To treat task and group at same level we will end up creating separate
  queues for each task and then import lots of cfq/bfq logic for managing
  those queues.

- The moment we move to hiearchical support, you will end up creating
  equivalent logic of our patches.

The point is, why to do all this? CFQ has already solved the problem of
multi class IO scheduler and providing service differentiation between
tasks of different priority. With cgroup stuff, we need to just extend
existing CFQ logic so that it supports hiearchical scheduling and we will
have a good IO controller in place.

Can you please point out specifically why do you think extending CFQ
logic to support hiearchical scheduling and sharing code with other IO
schedulers is not a good idea to implement hiearchical IO control?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
       [not found]         ` <20090312140450.GE10919-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-03-12 14:48             ` Fabio Checconi
@ 2009-03-18  7:23           ` Gui Jianfeng
  1 sibling, 0 replies; 190+ messages in thread
From: Gui Jianfeng @ 2009-03-18  7:23 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dhaval Giani,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

Vivek Goyal wrote:
>> Hi Vivek,
>>
>> I would be interested in knowing if these are the results expected?
>>
> 
> Hi Dhaval, 
> 
> Good question. Keeping current expectation in mind, yes these are expected
> results. To begin with, current expectations are that try to emulate
> cfq behavior and the kind of service differentiation we get between
> threads of different priority, same kind of service differentiation we
> should get from different cgroups.
>  
> Having said that, in theory a more accurate estimate should be amount 
> of actual disk time a queue/cgroup got. I have put a tracing message
> to keep track of total service received by a queue. If you run "blktrace"
> then you can see that. Ideally, total service received by two threads
> over a period of time should be in same proportion as their cgroup
> weights.
> 
> It will not be easy to achive it given the constraints we have got in
> terms of how to accurately we can account for disk time actually used by a
> queue in certain situations. So to begin with I am targetting that
> try to meet same kind of service differentation between cgroups as
> cfq provides between threads and then slowly refine it to see how
> close one can come to get accurate numbers in terms of "total_serivce"
> received by each queue.

  Hi Vivek,

  I simply tested with blktrace opened. I create two groups and set ioprio
  4 and 7 respectively(the corresponding weight should 4:1, right?), and 
  start two dd concurrently. UUIC, Ideally, the proportion of service two 
  dd got should be 4:1 in a period of time when they are running. I extract 
  *served* value from blktrace output and sum them up. I found the proportion 
  of the sum of *served* value is about 1.7:1
  Am i missing something?

  I extract the following highlight value
  8,0  0   0   18.914906549     0  m   N 6601ioq served=*0x13* total service=0x184d

> 
> Thanks
> Vivek
> 
> 
> 

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-03-12 14:04       ` Vivek Goyal
       [not found]         ` <20090312140450.GE10919-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-03-18  7:23         ` Gui Jianfeng
       [not found]           ` <49C0A171.8060009-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  1 sibling, 1 reply; 190+ messages in thread
From: Gui Jianfeng @ 2009-03-18  7:23 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Dhaval Giani, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	arozansk, jmoyer, oz-kernel, balbir, linux-kernel, containers,
	akpm, menage, peterz

Vivek Goyal wrote:
>> Hi Vivek,
>>
>> I would be interested in knowing if these are the results expected?
>>
> 
> Hi Dhaval, 
> 
> Good question. Keeping current expectation in mind, yes these are expected
> results. To begin with, current expectations are that try to emulate
> cfq behavior and the kind of service differentiation we get between
> threads of different priority, same kind of service differentiation we
> should get from different cgroups.
>  
> Having said that, in theory a more accurate estimate should be amount 
> of actual disk time a queue/cgroup got. I have put a tracing message
> to keep track of total service received by a queue. If you run "blktrace"
> then you can see that. Ideally, total service received by two threads
> over a period of time should be in same proportion as their cgroup
> weights.
> 
> It will not be easy to achive it given the constraints we have got in
> terms of how to accurately we can account for disk time actually used by a
> queue in certain situations. So to begin with I am targetting that
> try to meet same kind of service differentation between cgroups as
> cfq provides between threads and then slowly refine it to see how
> close one can come to get accurate numbers in terms of "total_serivce"
> received by each queue.

  Hi Vivek,

  I simply tested with blktrace opened. I create two groups and set ioprio
  4 and 7 respectively(the corresponding weight should 4:1, right?), and 
  start two dd concurrently. UUIC, Ideally, the proportion of service two 
  dd got should be 4:1 in a period of time when they are running. I extract 
  *served* value from blktrace output and sum them up. I found the proportion 
  of the sum of *served* value is about 1.7:1
  Am i missing something?

  I extract the following highlight value
  8,0  0   0   18.914906549     0  m   N 6601ioq served=*0x13* total service=0x184d

> 
> Thanks
> Vivek
> 
> 
> 

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-03-18  7:23         ` Gui Jianfeng
@ 2009-03-18 21:55               ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-18 21:55 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dhaval Giani,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Wed, Mar 18, 2009 at 03:23:29PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> >> Hi Vivek,
> >>
> >> I would be interested in knowing if these are the results expected?
> >>
> > 
> > Hi Dhaval, 
> > 
> > Good question. Keeping current expectation in mind, yes these are expected
> > results. To begin with, current expectations are that try to emulate
> > cfq behavior and the kind of service differentiation we get between
> > threads of different priority, same kind of service differentiation we
> > should get from different cgroups.
> >  
> > Having said that, in theory a more accurate estimate should be amount 
> > of actual disk time a queue/cgroup got. I have put a tracing message
> > to keep track of total service received by a queue. If you run "blktrace"
> > then you can see that. Ideally, total service received by two threads
> > over a period of time should be in same proportion as their cgroup
> > weights.
> > 
> > It will not be easy to achive it given the constraints we have got in
> > terms of how to accurately we can account for disk time actually used by a
> > queue in certain situations. So to begin with I am targetting that
> > try to meet same kind of service differentation between cgroups as
> > cfq provides between threads and then slowly refine it to see how
> > close one can come to get accurate numbers in terms of "total_serivce"
> > received by each queue.
> 
>   Hi Vivek,
> 
>   I simply tested with blktrace opened. I create two groups and set ioprio
>   4 and 7 respectively(the corresponding weight should 4:1, right?),

Hi Gui,

Thanks for testing. You are right about weight proportions.

> and 
>   start two dd concurrently. UUIC, Ideally, the proportion of service two 
>   dd got should be 4:1 in a period of time when they are running. I extract 
>   *served* value from blktrace output and sum them up. I found the proportion 
>   of the sum of *served* value is about 1.7:1
>   Am i missing something?

Actually getting the service proportion in same ratio as weight proportion
is quite hard for sync queues. The biggest issue is that many a times sync
queues are not continuously backlogged and they do some IO and then dispatch
a next round of requests.

Most of the time idling seems to be the solution for giving an impression
that sync queue is continuously backlogged but it also has potential to
reduce throughput on faster hardware.

Anyway, can you please send me your complete blkparse output. There are
many a places where code has been designed to favor throughput than
fairness. Looking at your blkparse output, will give me better idea what's
the issue in your setup.

Also please try the attached patch. I have experimented with waiting for
new request to come before sync queue is expired. It helps me in getting
the fairness numbers at least with noop on non-queueing rotational media.

I also have introduced a new tunable "fairness". Above code will kick in
only if this variable is set to 1. Many a places where we favor throughput
over fairness, I plan to use this variable as condition to let user
decide whether to choose fairness over throughput. I am not sure at how many
places it really makes sense, but it atleast gives us something to play and
compare the throughput in two cases.

This patch applies on my current tree after removing tomost patceh
"anticipatory scheduling changes". My code has changed a bit since the
posting, so you might have to message this patch a bit.

Thanks
Vivek


DESC
io-controller: idle for sometime on sync queue before expiring it
EDESC 

o When a sync queue expires, in many cases it might be empty and then
  it will be deleted from the active tree. This will lead to a scenario
  where out of two competing queues, only one is on the tree and when a
  new queue is selected, vtime jump takes place and we don't see services
  provided in proportion to weight.

o In general this is a fundamental problem with fairness of sync queues
  where queues are not continuously backlogged. Looks like idling is
  only solution to make sure such kind of queues can get some decent amount
  of disk bandwidth in the face of competion from continusouly backlogged
  queues. But excessive idling has potential to reduce performance on SSD
  and disks with commnad queuing.

o This patch experiments with waiting for next request to come before a
  queue is expired after it has consumed its time slice. This can ensure
  more accurate fairness numbers in some cases.

o Introduced a tunable "fairness". If set, io-controller will put more
  focus on getting fairness right than getting throughput right. 


---
 block/blk-sysfs.c   |    7 ++++
 block/elevator-fq.c |   85 +++++++++++++++++++++++++++++++++++++++++++++-------
 block/elevator-fq.h |   12 +++++++
 3 files changed, 94 insertions(+), 10 deletions(-)

Index: linux1/block/elevator-fq.h
===================================================================
--- linux1.orig/block/elevator-fq.h	2009-03-18 17:34:46.000000000 -0400
+++ linux1/block/elevator-fq.h	2009-03-18 17:34:53.000000000 -0400
@@ -318,6 +318,13 @@ struct elv_fq_data {
 	unsigned long long rate_sampling_start; /*sampling window start jifies*/
 	/* number of sectors finished io during current sampling window */
 	unsigned long rate_sectors_current;
+
+	/*
+	 * If set to 1, will disable many optimizations done for boost
+	 * throughput and focus more on providing fairness for sync
+	 * queues.
+	 */
+	int fairness;
 };
 
 extern int elv_slice_idle;
@@ -340,6 +347,7 @@ enum elv_queue_state_flags {
 	ELV_QUEUE_FLAG_idle_window,	  /* elevator slice idling enabled */
 	ELV_QUEUE_FLAG_wait_request,	  /* waiting for a request */
 	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
+	ELV_QUEUE_FLAG_wait_busy,	  /* wait for this queue to get busy */
 	ELV_QUEUE_FLAG_NR,
 };
 
@@ -362,6 +370,7 @@ ELV_IO_QUEUE_FLAG_FNS(sync)
 ELV_IO_QUEUE_FLAG_FNS(wait_request)
 ELV_IO_QUEUE_FLAG_FNS(idle_window)
 ELV_IO_QUEUE_FLAG_FNS(slice_new)
+ELV_IO_QUEUE_FLAG_FNS(wait_busy)
 
 static inline struct io_service_tree *
 io_entity_service_tree(struct io_entity *entity)
@@ -554,6 +563,9 @@ static inline struct io_queue *elv_looku
 extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
 extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
 						size_t count);
+extern ssize_t elv_fairness_show(struct request_queue *q, char *name);
+extern ssize_t elv_fairness_store(struct request_queue *q, const char *name,
+						size_t count);
 
 /* Functions used by elevator.c */
 extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
Index: linux1/block/elevator-fq.c
===================================================================
--- linux1.orig/block/elevator-fq.c	2009-03-18 17:34:46.000000000 -0400
+++ linux1/block/elevator-fq.c	2009-03-18 17:34:53.000000000 -0400
@@ -1837,6 +1837,44 @@ void elv_ioq_served(struct io_queue *ioq
 			ioq->total_service);
 }
 
+/* Functions to show and store fairness value through sysfs */
+ssize_t elv_fairness_show(struct request_queue *q, char *name)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	data = efqd->fairness;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return sprintf(name, "%d\n", data);
+}
+
+ssize_t elv_fairness_store(struct request_queue *q, const char *name,
+			  size_t count)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	char *p = (char *)name;
+
+	data = simple_strtoul(p, &p, 10);
+
+	if (data < 0)
+		data = 0;
+	else if (data > INT_MAX)
+		data = INT_MAX;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	efqd->fairness = data;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	return count;
+}
+
 /* Functions to show and store elv_idle_slice value through sysfs */
 ssize_t elv_slice_idle_show(struct request_queue *q, char *name)
 {
@@ -2263,10 +2301,11 @@ void __elv_ioq_slice_expired(struct requ
 	assert_spin_locked(q->queue_lock);
 	elv_log_ioq(efqd, ioq, "slice expired upd=%d", budget_update);
 
-	if (elv_ioq_wait_request(ioq))
+	if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq))
 		del_timer(&efqd->idle_slice_timer);
 
 	elv_clear_ioq_wait_request(ioq);
+	elv_clear_ioq_wait_busy(ioq);
 
 	/*
 	 * if ioq->slice_end = 0, that means a queue was expired before first
@@ -2482,8 +2521,9 @@ void elv_ioq_request_add(struct request_
 		 * immediately and flag that we must not expire this queue
 		 * just now
 		 */
-		if (elv_ioq_wait_request(ioq)) {
+		if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq)) {
 			del_timer(&efqd->idle_slice_timer);
+			elv_clear_ioq_wait_busy(ioq);
 			blk_start_queueing(q);
 		}
 	} else if (elv_should_preempt(q, ioq, rq)) {
@@ -2519,6 +2559,9 @@ void elv_idle_slice_timer(unsigned long 
 
 	if (ioq) {
 
+		if (elv_ioq_wait_busy(ioq))
+			goto expire;
+
 		/*
 		 * expired
 		 */
@@ -2546,7 +2589,7 @@ out_cont:
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
-void elv_ioq_arm_slice_timer(struct request_queue *q)
+void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
 {
 	struct elv_fq_data *efqd = &q->elevator->efqd;
 	struct io_queue *ioq = elv_active_ioq(q->elevator);
@@ -2563,15 +2606,27 @@ void elv_ioq_arm_slice_timer(struct requ
 		return;
 
 	/*
-	 * still requests with the driver, don't idle
+	 * idle is disabled, either manually or by past process history
 	 */
-	if (efqd->rq_in_driver)
+	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
 		return;
 
 	/*
-	 * idle is disabled, either manually or by past process history
+	 * This queue has consumed its time slice. We are waiting only for
+	 * it to become busy before we select next queue for dispatch.
 	 */
-	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
+	if (efqd->fairness && wait_for_busy) {
+		elv_mark_ioq_wait_busy(ioq);
+		sl = efqd->elv_slice_idle;
+		mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+		elv_log(efqd, "arm idle: %lu wait busy=1", sl);
+		return;
+	}
+
+	/*
+	 * still requests with the driver, don't idle
+	 */
+	if (efqd->rq_in_driver)
 		return;
 
 	/*
@@ -2628,6 +2683,12 @@ void *elv_fq_select_ioq(struct request_q
 		}
 	}
 
+	/* We are waiting for this queue to become busy before it expires.*/
+	if (efqd->fairness && elv_ioq_wait_busy(ioq)) {
+		ioq = NULL;
+		goto keep_queue;
+	}
+
 	/*
 	 * The active queue has run out of time, expire it and select new.
 	 */
@@ -2802,10 +2863,14 @@ void elv_ioq_completed_request(struct re
 			elv_ioq_set_prio_slice(q, ioq);
 			elv_clear_ioq_slice_new(ioq);
 		}
-		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+		if (elv_ioq_class_idle(ioq))
 			elv_ioq_slice_expired(q, 1);
-		else if (sync && !ioq->nr_queued)
-			elv_ioq_arm_slice_timer(q);
+		else if (sync && !ioq->nr_queued) {
+			if (elv_ioq_slice_used(ioq))
+				elv_ioq_arm_slice_timer(q, 1);
+			else
+				elv_ioq_arm_slice_timer(q, 0);
+		}
 	}
 
 	if (!efqd->rq_in_driver)
Index: linux1/block/blk-sysfs.c
===================================================================
--- linux1.orig/block/blk-sysfs.c	2009-03-18 17:34:28.000000000 -0400
+++ linux1/block/blk-sysfs.c	2009-03-18 17:34:53.000000000 -0400
@@ -282,6 +282,12 @@ static struct queue_sysfs_entry queue_sl
 	.show = elv_slice_idle_show,
 	.store = elv_slice_idle_store,
 };
+
+static struct queue_sysfs_entry queue_fairness_entry = {
+	.attr = {.name = "fairness", .mode = S_IRUGO | S_IWUSR },
+	.show = elv_fairness_show,
+	.store = elv_fairness_store,
+};
 #endif
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
@@ -296,6 +302,7 @@ static struct attribute *default_attrs[]
 	&queue_iostats_entry.attr,
 #ifdef CONFIG_ELV_FAIR_QUEUING
 	&queue_slice_idle_entry.attr,
+	&queue_fairness_entry.attr,
 #endif
 	NULL,
 };

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
@ 2009-03-18 21:55               ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-18 21:55 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: Dhaval Giani, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	arozansk, jmoyer, oz-kernel, balbir, linux-kernel, containers,
	akpm, menage, peterz

On Wed, Mar 18, 2009 at 03:23:29PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> >> Hi Vivek,
> >>
> >> I would be interested in knowing if these are the results expected?
> >>
> > 
> > Hi Dhaval, 
> > 
> > Good question. Keeping current expectation in mind, yes these are expected
> > results. To begin with, current expectations are that try to emulate
> > cfq behavior and the kind of service differentiation we get between
> > threads of different priority, same kind of service differentiation we
> > should get from different cgroups.
> >  
> > Having said that, in theory a more accurate estimate should be amount 
> > of actual disk time a queue/cgroup got. I have put a tracing message
> > to keep track of total service received by a queue. If you run "blktrace"
> > then you can see that. Ideally, total service received by two threads
> > over a period of time should be in same proportion as their cgroup
> > weights.
> > 
> > It will not be easy to achive it given the constraints we have got in
> > terms of how to accurately we can account for disk time actually used by a
> > queue in certain situations. So to begin with I am targetting that
> > try to meet same kind of service differentation between cgroups as
> > cfq provides between threads and then slowly refine it to see how
> > close one can come to get accurate numbers in terms of "total_serivce"
> > received by each queue.
> 
>   Hi Vivek,
> 
>   I simply tested with blktrace opened. I create two groups and set ioprio
>   4 and 7 respectively(the corresponding weight should 4:1, right?),

Hi Gui,

Thanks for testing. You are right about weight proportions.

> and 
>   start two dd concurrently. UUIC, Ideally, the proportion of service two 
>   dd got should be 4:1 in a period of time when they are running. I extract 
>   *served* value from blktrace output and sum them up. I found the proportion 
>   of the sum of *served* value is about 1.7:1
>   Am i missing something?

Actually getting the service proportion in same ratio as weight proportion
is quite hard for sync queues. The biggest issue is that many a times sync
queues are not continuously backlogged and they do some IO and then dispatch
a next round of requests.

Most of the time idling seems to be the solution for giving an impression
that sync queue is continuously backlogged but it also has potential to
reduce throughput on faster hardware.

Anyway, can you please send me your complete blkparse output. There are
many a places where code has been designed to favor throughput than
fairness. Looking at your blkparse output, will give me better idea what's
the issue in your setup.

Also please try the attached patch. I have experimented with waiting for
new request to come before sync queue is expired. It helps me in getting
the fairness numbers at least with noop on non-queueing rotational media.

I also have introduced a new tunable "fairness". Above code will kick in
only if this variable is set to 1. Many a places where we favor throughput
over fairness, I plan to use this variable as condition to let user
decide whether to choose fairness over throughput. I am not sure at how many
places it really makes sense, but it atleast gives us something to play and
compare the throughput in two cases.

This patch applies on my current tree after removing tomost patceh
"anticipatory scheduling changes". My code has changed a bit since the
posting, so you might have to message this patch a bit.

Thanks
Vivek


DESC
io-controller: idle for sometime on sync queue before expiring it
EDESC 

o When a sync queue expires, in many cases it might be empty and then
  it will be deleted from the active tree. This will lead to a scenario
  where out of two competing queues, only one is on the tree and when a
  new queue is selected, vtime jump takes place and we don't see services
  provided in proportion to weight.

o In general this is a fundamental problem with fairness of sync queues
  where queues are not continuously backlogged. Looks like idling is
  only solution to make sure such kind of queues can get some decent amount
  of disk bandwidth in the face of competion from continusouly backlogged
  queues. But excessive idling has potential to reduce performance on SSD
  and disks with commnad queuing.

o This patch experiments with waiting for next request to come before a
  queue is expired after it has consumed its time slice. This can ensure
  more accurate fairness numbers in some cases.

o Introduced a tunable "fairness". If set, io-controller will put more
  focus on getting fairness right than getting throughput right. 


---
 block/blk-sysfs.c   |    7 ++++
 block/elevator-fq.c |   85 +++++++++++++++++++++++++++++++++++++++++++++-------
 block/elevator-fq.h |   12 +++++++
 3 files changed, 94 insertions(+), 10 deletions(-)

Index: linux1/block/elevator-fq.h
===================================================================
--- linux1.orig/block/elevator-fq.h	2009-03-18 17:34:46.000000000 -0400
+++ linux1/block/elevator-fq.h	2009-03-18 17:34:53.000000000 -0400
@@ -318,6 +318,13 @@ struct elv_fq_data {
 	unsigned long long rate_sampling_start; /*sampling window start jifies*/
 	/* number of sectors finished io during current sampling window */
 	unsigned long rate_sectors_current;
+
+	/*
+	 * If set to 1, will disable many optimizations done for boost
+	 * throughput and focus more on providing fairness for sync
+	 * queues.
+	 */
+	int fairness;
 };
 
 extern int elv_slice_idle;
@@ -340,6 +347,7 @@ enum elv_queue_state_flags {
 	ELV_QUEUE_FLAG_idle_window,	  /* elevator slice idling enabled */
 	ELV_QUEUE_FLAG_wait_request,	  /* waiting for a request */
 	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
+	ELV_QUEUE_FLAG_wait_busy,	  /* wait for this queue to get busy */
 	ELV_QUEUE_FLAG_NR,
 };
 
@@ -362,6 +370,7 @@ ELV_IO_QUEUE_FLAG_FNS(sync)
 ELV_IO_QUEUE_FLAG_FNS(wait_request)
 ELV_IO_QUEUE_FLAG_FNS(idle_window)
 ELV_IO_QUEUE_FLAG_FNS(slice_new)
+ELV_IO_QUEUE_FLAG_FNS(wait_busy)
 
 static inline struct io_service_tree *
 io_entity_service_tree(struct io_entity *entity)
@@ -554,6 +563,9 @@ static inline struct io_queue *elv_looku
 extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
 extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
 						size_t count);
+extern ssize_t elv_fairness_show(struct request_queue *q, char *name);
+extern ssize_t elv_fairness_store(struct request_queue *q, const char *name,
+						size_t count);
 
 /* Functions used by elevator.c */
 extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
Index: linux1/block/elevator-fq.c
===================================================================
--- linux1.orig/block/elevator-fq.c	2009-03-18 17:34:46.000000000 -0400
+++ linux1/block/elevator-fq.c	2009-03-18 17:34:53.000000000 -0400
@@ -1837,6 +1837,44 @@ void elv_ioq_served(struct io_queue *ioq
 			ioq->total_service);
 }
 
+/* Functions to show and store fairness value through sysfs */
+ssize_t elv_fairness_show(struct request_queue *q, char *name)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	data = efqd->fairness;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return sprintf(name, "%d\n", data);
+}
+
+ssize_t elv_fairness_store(struct request_queue *q, const char *name,
+			  size_t count)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	char *p = (char *)name;
+
+	data = simple_strtoul(p, &p, 10);
+
+	if (data < 0)
+		data = 0;
+	else if (data > INT_MAX)
+		data = INT_MAX;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	efqd->fairness = data;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	return count;
+}
+
 /* Functions to show and store elv_idle_slice value through sysfs */
 ssize_t elv_slice_idle_show(struct request_queue *q, char *name)
 {
@@ -2263,10 +2301,11 @@ void __elv_ioq_slice_expired(struct requ
 	assert_spin_locked(q->queue_lock);
 	elv_log_ioq(efqd, ioq, "slice expired upd=%d", budget_update);
 
-	if (elv_ioq_wait_request(ioq))
+	if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq))
 		del_timer(&efqd->idle_slice_timer);
 
 	elv_clear_ioq_wait_request(ioq);
+	elv_clear_ioq_wait_busy(ioq);
 
 	/*
 	 * if ioq->slice_end = 0, that means a queue was expired before first
@@ -2482,8 +2521,9 @@ void elv_ioq_request_add(struct request_
 		 * immediately and flag that we must not expire this queue
 		 * just now
 		 */
-		if (elv_ioq_wait_request(ioq)) {
+		if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq)) {
 			del_timer(&efqd->idle_slice_timer);
+			elv_clear_ioq_wait_busy(ioq);
 			blk_start_queueing(q);
 		}
 	} else if (elv_should_preempt(q, ioq, rq)) {
@@ -2519,6 +2559,9 @@ void elv_idle_slice_timer(unsigned long 
 
 	if (ioq) {
 
+		if (elv_ioq_wait_busy(ioq))
+			goto expire;
+
 		/*
 		 * expired
 		 */
@@ -2546,7 +2589,7 @@ out_cont:
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
-void elv_ioq_arm_slice_timer(struct request_queue *q)
+void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
 {
 	struct elv_fq_data *efqd = &q->elevator->efqd;
 	struct io_queue *ioq = elv_active_ioq(q->elevator);
@@ -2563,15 +2606,27 @@ void elv_ioq_arm_slice_timer(struct requ
 		return;
 
 	/*
-	 * still requests with the driver, don't idle
+	 * idle is disabled, either manually or by past process history
 	 */
-	if (efqd->rq_in_driver)
+	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
 		return;
 
 	/*
-	 * idle is disabled, either manually or by past process history
+	 * This queue has consumed its time slice. We are waiting only for
+	 * it to become busy before we select next queue for dispatch.
 	 */
-	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
+	if (efqd->fairness && wait_for_busy) {
+		elv_mark_ioq_wait_busy(ioq);
+		sl = efqd->elv_slice_idle;
+		mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+		elv_log(efqd, "arm idle: %lu wait busy=1", sl);
+		return;
+	}
+
+	/*
+	 * still requests with the driver, don't idle
+	 */
+	if (efqd->rq_in_driver)
 		return;
 
 	/*
@@ -2628,6 +2683,12 @@ void *elv_fq_select_ioq(struct request_q
 		}
 	}
 
+	/* We are waiting for this queue to become busy before it expires.*/
+	if (efqd->fairness && elv_ioq_wait_busy(ioq)) {
+		ioq = NULL;
+		goto keep_queue;
+	}
+
 	/*
 	 * The active queue has run out of time, expire it and select new.
 	 */
@@ -2802,10 +2863,14 @@ void elv_ioq_completed_request(struct re
 			elv_ioq_set_prio_slice(q, ioq);
 			elv_clear_ioq_slice_new(ioq);
 		}
-		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+		if (elv_ioq_class_idle(ioq))
 			elv_ioq_slice_expired(q, 1);
-		else if (sync && !ioq->nr_queued)
-			elv_ioq_arm_slice_timer(q);
+		else if (sync && !ioq->nr_queued) {
+			if (elv_ioq_slice_used(ioq))
+				elv_ioq_arm_slice_timer(q, 1);
+			else
+				elv_ioq_arm_slice_timer(q, 0);
+		}
 	}
 
 	if (!efqd->rq_in_driver)
Index: linux1/block/blk-sysfs.c
===================================================================
--- linux1.orig/block/blk-sysfs.c	2009-03-18 17:34:28.000000000 -0400
+++ linux1/block/blk-sysfs.c	2009-03-18 17:34:53.000000000 -0400
@@ -282,6 +282,12 @@ static struct queue_sysfs_entry queue_sl
 	.show = elv_slice_idle_show,
 	.store = elv_slice_idle_store,
 };
+
+static struct queue_sysfs_entry queue_fairness_entry = {
+	.attr = {.name = "fairness", .mode = S_IRUGO | S_IWUSR },
+	.show = elv_fairness_show,
+	.store = elv_fairness_store,
+};
 #endif
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
@@ -296,6 +302,7 @@ static struct attribute *default_attrs[]
 	&queue_iostats_entry.attr,
 #ifdef CONFIG_ELV_FAIR_QUEUING
 	&queue_slice_idle_entry.attr,
+	&queue_fairness_entry.attr,
 #endif
 	NULL,
 };

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
       [not found]               ` <20090318215529.GA3338-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-03-19  3:38                 ` Gui Jianfeng
  2009-03-24  5:32                 ` Nauman Rafique
  1 sibling, 0 replies; 190+ messages in thread
From: Gui Jianfeng @ 2009-03-19  3:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dhaval Giani,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

Vivek Goyal wrote:
> On Wed, Mar 18, 2009 at 03:23:29PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>>> Hi Vivek,
>>>>
>>>> I would be interested in knowing if these are the results expected?
>>>>
>>> Hi Dhaval, 
>>>
>>> Good question. Keeping current expectation in mind, yes these are expected
>>> results. To begin with, current expectations are that try to emulate
>>> cfq behavior and the kind of service differentiation we get between
>>> threads of different priority, same kind of service differentiation we
>>> should get from different cgroups.
>>>  
>>> Having said that, in theory a more accurate estimate should be amount 
>>> of actual disk time a queue/cgroup got. I have put a tracing message
>>> to keep track of total service received by a queue. If you run "blktrace"
>>> then you can see that. Ideally, total service received by two threads
>>> over a period of time should be in same proportion as their cgroup
>>> weights.
>>>
>>> It will not be easy to achive it given the constraints we have got in
>>> terms of how to accurately we can account for disk time actually used by a
>>> queue in certain situations. So to begin with I am targetting that
>>> try to meet same kind of service differentation between cgroups as
>>> cfq provides between threads and then slowly refine it to see how
>>> close one can come to get accurate numbers in terms of "total_serivce"
>>> received by each queue.
>>   Hi Vivek,
>>
>>   I simply tested with blktrace opened. I create two groups and set ioprio
>>   4 and 7 respectively(the corresponding weight should 4:1, right?),
> 
> Hi Gui,
> 
> Thanks for testing. You are right about weight proportions.
> 
>> and 
>>   start two dd concurrently. UUIC, Ideally, the proportion of service two 
>>   dd got should be 4:1 in a period of time when they are running. I extract 
>>   *served* value from blktrace output and sum them up. I found the proportion 
>>   of the sum of *served* value is about 1.7:1
>>   Am i missing something?
> 
> Actually getting the service proportion in same ratio as weight proportion
> is quite hard for sync queues. The biggest issue is that many a times sync
> queues are not continuously backlogged and they do some IO and then dispatch
> a next round of requests.
> 
> Most of the time idling seems to be the solution for giving an impression
> that sync queue is continuously backlogged but it also has potential to
> reduce throughput on faster hardware.
> 
> Anyway, can you please send me your complete blkparse output. There are
> many a places where code has been designed to favor throughput than
> fairness. Looking at your blkparse output, will give me better idea what's
> the issue in your setup.
> 
> Also please try the attached patch. I have experimented with waiting for
> new request to come before sync queue is expired. It helps me in getting
> the fairness numbers at least with noop on non-queueing rotational media.
> 
> I also have introduced a new tunable "fairness". Above code will kick in
> only if this variable is set to 1. Many a places where we favor throughput
> over fairness, I plan to use this variable as condition to let user
> decide whether to choose fairness over throughput. I am not sure at how many
> places it really makes sense, but it atleast gives us something to play and
> compare the throughput in two cases.
> 
> This patch applies on my current tree after removing tomost patceh
> "anticipatory scheduling changes". My code has changed a bit since the
> posting, so you might have to message this patch a bit.

  Hi Vivek,

  This time I run two dd with pure sync read like this:
  dd if=/mnt/500M.1 of=/dev/null, and the proportion of service got by each
  is very close to the proportion of their weight.
  Previously, I run concurrent dd like this:
  dd if=/mnt/500M.1 of=/mnt/500M.2

  I'd like to try this patch out.

> 
> Thanks
> Vivek
> 
> 
> DESC
> io-controller: idle for sometime on sync queue before expiring it
> EDESC 
> 
> o When a sync queue expires, in many cases it might be empty and then
>   it will be deleted from the active tree. This will lead to a scenario
>   where out of two competing queues, only one is on the tree and when a
>   new queue is selected, vtime jump takes place and we don't see services
>   provided in proportion to weight.
> 
> o In general this is a fundamental problem with fairness of sync queues
>   where queues are not continuously backlogged. Looks like idling is
>   only solution to make sure such kind of queues can get some decent amount
>   of disk bandwidth in the face of competion from continusouly backlogged
>   queues. But excessive idling has potential to reduce performance on SSD
>   and disks with commnad queuing.
> 
> o This patch experiments with waiting for next request to come before a
>   queue is expired after it has consumed its time slice. This can ensure
>   more accurate fairness numbers in some cases.
> 
> o Introduced a tunable "fairness". If set, io-controller will put more
>   focus on getting fairness right than getting throughput right. 
> 
> 
> ---
>  block/blk-sysfs.c   |    7 ++++
>  block/elevator-fq.c |   85 +++++++++++++++++++++++++++++++++++++++++++++-------
>  block/elevator-fq.h |   12 +++++++
>  3 files changed, 94 insertions(+), 10 deletions(-)
> 
> Index: linux1/block/elevator-fq.h
> ===================================================================
> --- linux1.orig/block/elevator-fq.h	2009-03-18 17:34:46.000000000 -0400
> +++ linux1/block/elevator-fq.h	2009-03-18 17:34:53.000000000 -0400
> @@ -318,6 +318,13 @@ struct elv_fq_data {
>  	unsigned long long rate_sampling_start; /*sampling window start jifies*/
>  	/* number of sectors finished io during current sampling window */
>  	unsigned long rate_sectors_current;
> +
> +	/*
> +	 * If set to 1, will disable many optimizations done for boost
> +	 * throughput and focus more on providing fairness for sync
> +	 * queues.
> +	 */
> +	int fairness;
>  };
>  
>  extern int elv_slice_idle;
> @@ -340,6 +347,7 @@ enum elv_queue_state_flags {
>  	ELV_QUEUE_FLAG_idle_window,	  /* elevator slice idling enabled */
>  	ELV_QUEUE_FLAG_wait_request,	  /* waiting for a request */
>  	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
> +	ELV_QUEUE_FLAG_wait_busy,	  /* wait for this queue to get busy */
>  	ELV_QUEUE_FLAG_NR,
>  };
>  
> @@ -362,6 +370,7 @@ ELV_IO_QUEUE_FLAG_FNS(sync)
>  ELV_IO_QUEUE_FLAG_FNS(wait_request)
>  ELV_IO_QUEUE_FLAG_FNS(idle_window)
>  ELV_IO_QUEUE_FLAG_FNS(slice_new)
> +ELV_IO_QUEUE_FLAG_FNS(wait_busy)
>  
>  static inline struct io_service_tree *
>  io_entity_service_tree(struct io_entity *entity)
> @@ -554,6 +563,9 @@ static inline struct io_queue *elv_looku
>  extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
>  extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
>  						size_t count);
> +extern ssize_t elv_fairness_show(struct request_queue *q, char *name);
> +extern ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> +						size_t count);
>  
>  /* Functions used by elevator.c */
>  extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
> Index: linux1/block/elevator-fq.c
> ===================================================================
> --- linux1.orig/block/elevator-fq.c	2009-03-18 17:34:46.000000000 -0400
> +++ linux1/block/elevator-fq.c	2009-03-18 17:34:53.000000000 -0400
> @@ -1837,6 +1837,44 @@ void elv_ioq_served(struct io_queue *ioq
>  			ioq->total_service);
>  }
>  
> +/* Functions to show and store fairness value through sysfs */
> +ssize_t elv_fairness_show(struct request_queue *q, char *name)
> +{
> +	struct elv_fq_data *efqd;
> +	unsigned int data;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(q->queue_lock, flags);
> +	efqd = &q->elevator->efqd;
> +	data = efqd->fairness;
> +	spin_unlock_irqrestore(q->queue_lock, flags);
> +	return sprintf(name, "%d\n", data);
> +}
> +
> +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> +			  size_t count)
> +{
> +	struct elv_fq_data *efqd;
> +	unsigned int data;
> +	unsigned long flags;
> +
> +	char *p = (char *)name;
> +
> +	data = simple_strtoul(p, &p, 10);
> +
> +	if (data < 0)
> +		data = 0;
> +	else if (data > INT_MAX)
> +		data = INT_MAX;
> +
> +	spin_lock_irqsave(q->queue_lock, flags);
> +	efqd = &q->elevator->efqd;
> +	efqd->fairness = data;
> +	spin_unlock_irqrestore(q->queue_lock, flags);
> +
> +	return count;
> +}
> +
>  /* Functions to show and store elv_idle_slice value through sysfs */
>  ssize_t elv_slice_idle_show(struct request_queue *q, char *name)
>  {
> @@ -2263,10 +2301,11 @@ void __elv_ioq_slice_expired(struct requ
>  	assert_spin_locked(q->queue_lock);
>  	elv_log_ioq(efqd, ioq, "slice expired upd=%d", budget_update);
>  
> -	if (elv_ioq_wait_request(ioq))
> +	if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq))
>  		del_timer(&efqd->idle_slice_timer);
>  
>  	elv_clear_ioq_wait_request(ioq);
> +	elv_clear_ioq_wait_busy(ioq);
>  
>  	/*
>  	 * if ioq->slice_end = 0, that means a queue was expired before first
> @@ -2482,8 +2521,9 @@ void elv_ioq_request_add(struct request_
>  		 * immediately and flag that we must not expire this queue
>  		 * just now
>  		 */
> -		if (elv_ioq_wait_request(ioq)) {
> +		if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq)) {
>  			del_timer(&efqd->idle_slice_timer);
> +			elv_clear_ioq_wait_busy(ioq);
>  			blk_start_queueing(q);
>  		}
>  	} else if (elv_should_preempt(q, ioq, rq)) {
> @@ -2519,6 +2559,9 @@ void elv_idle_slice_timer(unsigned long 
>  
>  	if (ioq) {
>  
> +		if (elv_ioq_wait_busy(ioq))
> +			goto expire;
> +
>  		/*
>  		 * expired
>  		 */
> @@ -2546,7 +2589,7 @@ out_cont:
>  	spin_unlock_irqrestore(q->queue_lock, flags);
>  }
>  
> -void elv_ioq_arm_slice_timer(struct request_queue *q)
> +void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
>  {
>  	struct elv_fq_data *efqd = &q->elevator->efqd;
>  	struct io_queue *ioq = elv_active_ioq(q->elevator);
> @@ -2563,15 +2606,27 @@ void elv_ioq_arm_slice_timer(struct requ
>  		return;
>  
>  	/*
> -	 * still requests with the driver, don't idle
> +	 * idle is disabled, either manually or by past process history
>  	 */
> -	if (efqd->rq_in_driver)
> +	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
>  		return;
>  
>  	/*
> -	 * idle is disabled, either manually or by past process history
> +	 * This queue has consumed its time slice. We are waiting only for
> +	 * it to become busy before we select next queue for dispatch.
>  	 */
> -	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
> +	if (efqd->fairness && wait_for_busy) {
> +		elv_mark_ioq_wait_busy(ioq);
> +		sl = efqd->elv_slice_idle;
> +		mod_timer(&efqd->idle_slice_timer, jiffies + sl);
> +		elv_log(efqd, "arm idle: %lu wait busy=1", sl);
> +		return;
> +	}
> +
> +	/*
> +	 * still requests with the driver, don't idle
> +	 */
> +	if (efqd->rq_in_driver)
>  		return;
>  
>  	/*
> @@ -2628,6 +2683,12 @@ void *elv_fq_select_ioq(struct request_q
>  		}
>  	}
>  
> +	/* We are waiting for this queue to become busy before it expires.*/
> +	if (efqd->fairness && elv_ioq_wait_busy(ioq)) {
> +		ioq = NULL;
> +		goto keep_queue;
> +	}
> +
>  	/*
>  	 * The active queue has run out of time, expire it and select new.
>  	 */
> @@ -2802,10 +2863,14 @@ void elv_ioq_completed_request(struct re
>  			elv_ioq_set_prio_slice(q, ioq);
>  			elv_clear_ioq_slice_new(ioq);
>  		}
> -		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
> +		if (elv_ioq_class_idle(ioq))
>  			elv_ioq_slice_expired(q, 1);
> -		else if (sync && !ioq->nr_queued)
> -			elv_ioq_arm_slice_timer(q);
> +		else if (sync && !ioq->nr_queued) {
> +			if (elv_ioq_slice_used(ioq))
> +				elv_ioq_arm_slice_timer(q, 1);
> +			else
> +				elv_ioq_arm_slice_timer(q, 0);
> +		}
>  	}
>  
>  	if (!efqd->rq_in_driver)
> Index: linux1/block/blk-sysfs.c
> ===================================================================
> --- linux1.orig/block/blk-sysfs.c	2009-03-18 17:34:28.000000000 -0400
> +++ linux1/block/blk-sysfs.c	2009-03-18 17:34:53.000000000 -0400
> @@ -282,6 +282,12 @@ static struct queue_sysfs_entry queue_sl
>  	.show = elv_slice_idle_show,
>  	.store = elv_slice_idle_store,
>  };
> +
> +static struct queue_sysfs_entry queue_fairness_entry = {
> +	.attr = {.name = "fairness", .mode = S_IRUGO | S_IWUSR },
> +	.show = elv_fairness_show,
> +	.store = elv_fairness_store,
> +};
>  #endif
>  static struct attribute *default_attrs[] = {
>  	&queue_requests_entry.attr,
> @@ -296,6 +302,7 @@ static struct attribute *default_attrs[]
>  	&queue_iostats_entry.attr,
>  #ifdef CONFIG_ELV_FAIR_QUEUING
>  	&queue_slice_idle_entry.attr,
> +	&queue_fairness_entry.attr,
>  #endif
>  	NULL,
>  };
> 
> 
> 

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-03-18 21:55               ` Vivek Goyal
  (?)
  (?)
@ 2009-03-19  3:38               ` Gui Jianfeng
  -1 siblings, 0 replies; 190+ messages in thread
From: Gui Jianfeng @ 2009-03-19  3:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Dhaval Giani, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	arozansk, jmoyer, oz-kernel, balbir, linux-kernel, containers,
	akpm, menage, peterz

Vivek Goyal wrote:
> On Wed, Mar 18, 2009 at 03:23:29PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>>> Hi Vivek,
>>>>
>>>> I would be interested in knowing if these are the results expected?
>>>>
>>> Hi Dhaval, 
>>>
>>> Good question. Keeping current expectation in mind, yes these are expected
>>> results. To begin with, current expectations are that try to emulate
>>> cfq behavior and the kind of service differentiation we get between
>>> threads of different priority, same kind of service differentiation we
>>> should get from different cgroups.
>>>  
>>> Having said that, in theory a more accurate estimate should be amount 
>>> of actual disk time a queue/cgroup got. I have put a tracing message
>>> to keep track of total service received by a queue. If you run "blktrace"
>>> then you can see that. Ideally, total service received by two threads
>>> over a period of time should be in same proportion as their cgroup
>>> weights.
>>>
>>> It will not be easy to achive it given the constraints we have got in
>>> terms of how to accurately we can account for disk time actually used by a
>>> queue in certain situations. So to begin with I am targetting that
>>> try to meet same kind of service differentation between cgroups as
>>> cfq provides between threads and then slowly refine it to see how
>>> close one can come to get accurate numbers in terms of "total_serivce"
>>> received by each queue.
>>   Hi Vivek,
>>
>>   I simply tested with blktrace opened. I create two groups and set ioprio
>>   4 and 7 respectively(the corresponding weight should 4:1, right?),
> 
> Hi Gui,
> 
> Thanks for testing. You are right about weight proportions.
> 
>> and 
>>   start two dd concurrently. UUIC, Ideally, the proportion of service two 
>>   dd got should be 4:1 in a period of time when they are running. I extract 
>>   *served* value from blktrace output and sum them up. I found the proportion 
>>   of the sum of *served* value is about 1.7:1
>>   Am i missing something?
> 
> Actually getting the service proportion in same ratio as weight proportion
> is quite hard for sync queues. The biggest issue is that many a times sync
> queues are not continuously backlogged and they do some IO and then dispatch
> a next round of requests.
> 
> Most of the time idling seems to be the solution for giving an impression
> that sync queue is continuously backlogged but it also has potential to
> reduce throughput on faster hardware.
> 
> Anyway, can you please send me your complete blkparse output. There are
> many a places where code has been designed to favor throughput than
> fairness. Looking at your blkparse output, will give me better idea what's
> the issue in your setup.
> 
> Also please try the attached patch. I have experimented with waiting for
> new request to come before sync queue is expired. It helps me in getting
> the fairness numbers at least with noop on non-queueing rotational media.
> 
> I also have introduced a new tunable "fairness". Above code will kick in
> only if this variable is set to 1. Many a places where we favor throughput
> over fairness, I plan to use this variable as condition to let user
> decide whether to choose fairness over throughput. I am not sure at how many
> places it really makes sense, but it atleast gives us something to play and
> compare the throughput in two cases.
> 
> This patch applies on my current tree after removing tomost patceh
> "anticipatory scheduling changes". My code has changed a bit since the
> posting, so you might have to message this patch a bit.

  Hi Vivek,

  This time I run two dd with pure sync read like this:
  dd if=/mnt/500M.1 of=/dev/null, and the proportion of service got by each
  is very close to the proportion of their weight.
  Previously, I run concurrent dd like this:
  dd if=/mnt/500M.1 of=/mnt/500M.2

  I'd like to try this patch out.

> 
> Thanks
> Vivek
> 
> 
> DESC
> io-controller: idle for sometime on sync queue before expiring it
> EDESC 
> 
> o When a sync queue expires, in many cases it might be empty and then
>   it will be deleted from the active tree. This will lead to a scenario
>   where out of two competing queues, only one is on the tree and when a
>   new queue is selected, vtime jump takes place and we don't see services
>   provided in proportion to weight.
> 
> o In general this is a fundamental problem with fairness of sync queues
>   where queues are not continuously backlogged. Looks like idling is
>   only solution to make sure such kind of queues can get some decent amount
>   of disk bandwidth in the face of competion from continusouly backlogged
>   queues. But excessive idling has potential to reduce performance on SSD
>   and disks with commnad queuing.
> 
> o This patch experiments with waiting for next request to come before a
>   queue is expired after it has consumed its time slice. This can ensure
>   more accurate fairness numbers in some cases.
> 
> o Introduced a tunable "fairness". If set, io-controller will put more
>   focus on getting fairness right than getting throughput right. 
> 
> 
> ---
>  block/blk-sysfs.c   |    7 ++++
>  block/elevator-fq.c |   85 +++++++++++++++++++++++++++++++++++++++++++++-------
>  block/elevator-fq.h |   12 +++++++
>  3 files changed, 94 insertions(+), 10 deletions(-)
> 
> Index: linux1/block/elevator-fq.h
> ===================================================================
> --- linux1.orig/block/elevator-fq.h	2009-03-18 17:34:46.000000000 -0400
> +++ linux1/block/elevator-fq.h	2009-03-18 17:34:53.000000000 -0400
> @@ -318,6 +318,13 @@ struct elv_fq_data {
>  	unsigned long long rate_sampling_start; /*sampling window start jifies*/
>  	/* number of sectors finished io during current sampling window */
>  	unsigned long rate_sectors_current;
> +
> +	/*
> +	 * If set to 1, will disable many optimizations done for boost
> +	 * throughput and focus more on providing fairness for sync
> +	 * queues.
> +	 */
> +	int fairness;
>  };
>  
>  extern int elv_slice_idle;
> @@ -340,6 +347,7 @@ enum elv_queue_state_flags {
>  	ELV_QUEUE_FLAG_idle_window,	  /* elevator slice idling enabled */
>  	ELV_QUEUE_FLAG_wait_request,	  /* waiting for a request */
>  	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
> +	ELV_QUEUE_FLAG_wait_busy,	  /* wait for this queue to get busy */
>  	ELV_QUEUE_FLAG_NR,
>  };
>  
> @@ -362,6 +370,7 @@ ELV_IO_QUEUE_FLAG_FNS(sync)
>  ELV_IO_QUEUE_FLAG_FNS(wait_request)
>  ELV_IO_QUEUE_FLAG_FNS(idle_window)
>  ELV_IO_QUEUE_FLAG_FNS(slice_new)
> +ELV_IO_QUEUE_FLAG_FNS(wait_busy)
>  
>  static inline struct io_service_tree *
>  io_entity_service_tree(struct io_entity *entity)
> @@ -554,6 +563,9 @@ static inline struct io_queue *elv_looku
>  extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
>  extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
>  						size_t count);
> +extern ssize_t elv_fairness_show(struct request_queue *q, char *name);
> +extern ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> +						size_t count);
>  
>  /* Functions used by elevator.c */
>  extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
> Index: linux1/block/elevator-fq.c
> ===================================================================
> --- linux1.orig/block/elevator-fq.c	2009-03-18 17:34:46.000000000 -0400
> +++ linux1/block/elevator-fq.c	2009-03-18 17:34:53.000000000 -0400
> @@ -1837,6 +1837,44 @@ void elv_ioq_served(struct io_queue *ioq
>  			ioq->total_service);
>  }
>  
> +/* Functions to show and store fairness value through sysfs */
> +ssize_t elv_fairness_show(struct request_queue *q, char *name)
> +{
> +	struct elv_fq_data *efqd;
> +	unsigned int data;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(q->queue_lock, flags);
> +	efqd = &q->elevator->efqd;
> +	data = efqd->fairness;
> +	spin_unlock_irqrestore(q->queue_lock, flags);
> +	return sprintf(name, "%d\n", data);
> +}
> +
> +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> +			  size_t count)
> +{
> +	struct elv_fq_data *efqd;
> +	unsigned int data;
> +	unsigned long flags;
> +
> +	char *p = (char *)name;
> +
> +	data = simple_strtoul(p, &p, 10);
> +
> +	if (data < 0)
> +		data = 0;
> +	else if (data > INT_MAX)
> +		data = INT_MAX;
> +
> +	spin_lock_irqsave(q->queue_lock, flags);
> +	efqd = &q->elevator->efqd;
> +	efqd->fairness = data;
> +	spin_unlock_irqrestore(q->queue_lock, flags);
> +
> +	return count;
> +}
> +
>  /* Functions to show and store elv_idle_slice value through sysfs */
>  ssize_t elv_slice_idle_show(struct request_queue *q, char *name)
>  {
> @@ -2263,10 +2301,11 @@ void __elv_ioq_slice_expired(struct requ
>  	assert_spin_locked(q->queue_lock);
>  	elv_log_ioq(efqd, ioq, "slice expired upd=%d", budget_update);
>  
> -	if (elv_ioq_wait_request(ioq))
> +	if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq))
>  		del_timer(&efqd->idle_slice_timer);
>  
>  	elv_clear_ioq_wait_request(ioq);
> +	elv_clear_ioq_wait_busy(ioq);
>  
>  	/*
>  	 * if ioq->slice_end = 0, that means a queue was expired before first
> @@ -2482,8 +2521,9 @@ void elv_ioq_request_add(struct request_
>  		 * immediately and flag that we must not expire this queue
>  		 * just now
>  		 */
> -		if (elv_ioq_wait_request(ioq)) {
> +		if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq)) {
>  			del_timer(&efqd->idle_slice_timer);
> +			elv_clear_ioq_wait_busy(ioq);
>  			blk_start_queueing(q);
>  		}
>  	} else if (elv_should_preempt(q, ioq, rq)) {
> @@ -2519,6 +2559,9 @@ void elv_idle_slice_timer(unsigned long 
>  
>  	if (ioq) {
>  
> +		if (elv_ioq_wait_busy(ioq))
> +			goto expire;
> +
>  		/*
>  		 * expired
>  		 */
> @@ -2546,7 +2589,7 @@ out_cont:
>  	spin_unlock_irqrestore(q->queue_lock, flags);
>  }
>  
> -void elv_ioq_arm_slice_timer(struct request_queue *q)
> +void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
>  {
>  	struct elv_fq_data *efqd = &q->elevator->efqd;
>  	struct io_queue *ioq = elv_active_ioq(q->elevator);
> @@ -2563,15 +2606,27 @@ void elv_ioq_arm_slice_timer(struct requ
>  		return;
>  
>  	/*
> -	 * still requests with the driver, don't idle
> +	 * idle is disabled, either manually or by past process history
>  	 */
> -	if (efqd->rq_in_driver)
> +	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
>  		return;
>  
>  	/*
> -	 * idle is disabled, either manually or by past process history
> +	 * This queue has consumed its time slice. We are waiting only for
> +	 * it to become busy before we select next queue for dispatch.
>  	 */
> -	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
> +	if (efqd->fairness && wait_for_busy) {
> +		elv_mark_ioq_wait_busy(ioq);
> +		sl = efqd->elv_slice_idle;
> +		mod_timer(&efqd->idle_slice_timer, jiffies + sl);
> +		elv_log(efqd, "arm idle: %lu wait busy=1", sl);
> +		return;
> +	}
> +
> +	/*
> +	 * still requests with the driver, don't idle
> +	 */
> +	if (efqd->rq_in_driver)
>  		return;
>  
>  	/*
> @@ -2628,6 +2683,12 @@ void *elv_fq_select_ioq(struct request_q
>  		}
>  	}
>  
> +	/* We are waiting for this queue to become busy before it expires.*/
> +	if (efqd->fairness && elv_ioq_wait_busy(ioq)) {
> +		ioq = NULL;
> +		goto keep_queue;
> +	}
> +
>  	/*
>  	 * The active queue has run out of time, expire it and select new.
>  	 */
> @@ -2802,10 +2863,14 @@ void elv_ioq_completed_request(struct re
>  			elv_ioq_set_prio_slice(q, ioq);
>  			elv_clear_ioq_slice_new(ioq);
>  		}
> -		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
> +		if (elv_ioq_class_idle(ioq))
>  			elv_ioq_slice_expired(q, 1);
> -		else if (sync && !ioq->nr_queued)
> -			elv_ioq_arm_slice_timer(q);
> +		else if (sync && !ioq->nr_queued) {
> +			if (elv_ioq_slice_used(ioq))
> +				elv_ioq_arm_slice_timer(q, 1);
> +			else
> +				elv_ioq_arm_slice_timer(q, 0);
> +		}
>  	}
>  
>  	if (!efqd->rq_in_driver)
> Index: linux1/block/blk-sysfs.c
> ===================================================================
> --- linux1.orig/block/blk-sysfs.c	2009-03-18 17:34:28.000000000 -0400
> +++ linux1/block/blk-sysfs.c	2009-03-18 17:34:53.000000000 -0400
> @@ -282,6 +282,12 @@ static struct queue_sysfs_entry queue_sl
>  	.show = elv_slice_idle_show,
>  	.store = elv_slice_idle_store,
>  };
> +
> +static struct queue_sysfs_entry queue_fairness_entry = {
> +	.attr = {.name = "fairness", .mode = S_IRUGO | S_IWUSR },
> +	.show = elv_fairness_show,
> +	.store = elv_fairness_store,
> +};
>  #endif
>  static struct attribute *default_attrs[] = {
>  	&queue_requests_entry.attr,
> @@ -296,6 +302,7 @@ static struct attribute *default_attrs[]
>  	&queue_iostats_entry.attr,
>  #ifdef CONFIG_ELV_FAIR_QUEUING
>  	&queue_slice_idle_entry.attr,
> +	&queue_fairness_entry.attr,
>  #endif
>  	NULL,
>  };
> 
> 
> 

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 02/10] Common flat fair queuing code in elevaotor layer
       [not found]   ` <1236823015-4183-3-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-03-19  6:27     ` Gui Jianfeng
  2009-03-27  8:30     ` [PATCH] IO Controller: Don't store the pid in single queue circumstances Gui Jianfeng
  2009-04-02  4:06     ` [PATCH 02/10] Common flat fair queuing code in elevaotor layer Divyesh Shah
  2 siblings, 0 replies; 190+ messages in thread
From: Gui Jianfeng @ 2009-03-19  6:27 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

Vivek Goyal wrote:
...
> +
> +int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
> +			void *sched_queue, int ioprio_class, int ioprio,
> +			int is_sync)
> +{
> +	struct elv_fq_data *efqd = &eq->efqd;
> +	struct io_group *iog = io_lookup_io_group_current(efqd->queue);
> +
> +	RB_CLEAR_NODE(&ioq->entity.rb_node);
> +	atomic_set(&ioq->ref, 0);
> +	ioq->efqd = efqd;
> +	ioq->entity.budget = efqd->elv_slice[is_sync];
> +	elv_ioq_set_ioprio_class(ioq, ioprio_class);
> +	elv_ioq_set_ioprio(ioq, ioprio);
> +	ioq->pid = current->pid;

  Hi Vivek,

  If we are using a scheduler other than cfq, IOW single ioq is used.
  that storing a pid to the ioq makes no sense, it just stores the first 
  serviced task.

> +	ioq->sched_queue = sched_queue;
> +	elv_mark_ioq_idle_window(ioq);
> +	bfq_init_entity(&ioq->entity, iog);
> +	return 0;
> +}
> +EXPORT_SYMBOL(elv_init_ioq);
...
> +
> +extern int elv_slice_idle;
> +extern int elv_slice_async;
> +
> +/* Logging facilities. */
> +#define elv_log_ioq(efqd, ioq, fmt, args...) \
> +	blk_add_trace_msg((efqd)->queue, "%d" fmt, (ioq)->pid, ##args)

   Maybe we need to use current->pid instead, for ioq->pid is not valid
   sometimes.

> +
> +#define elv_log(efqd, fmt, args...) \
> +	blk_add_trace_msg((efqd)->queue, "" fmt, ##args)
> +
> +#define ioq_sample_valid(samples)   ((samples) > 80)
> +


-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 02/10] Common flat fair queuing code in elevaotor layer
  2009-03-12  1:56 ` [PATCH 02/10] Common flat fair queuing code in elevaotor layer Vivek Goyal
@ 2009-03-19  6:27   ` Gui Jianfeng
  2009-03-27  8:30   ` [PATCH] IO Controller: Don't store the pid in single queue circumstances Gui Jianfeng
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 190+ messages in thread
From: Gui Jianfeng @ 2009-03-19  6:27 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, arozansk, jmoyer,
	oz-kernel, dhaval, balbir, linux-kernel, containers, akpm,
	menage, peterz

Vivek Goyal wrote:
...
> +
> +int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
> +			void *sched_queue, int ioprio_class, int ioprio,
> +			int is_sync)
> +{
> +	struct elv_fq_data *efqd = &eq->efqd;
> +	struct io_group *iog = io_lookup_io_group_current(efqd->queue);
> +
> +	RB_CLEAR_NODE(&ioq->entity.rb_node);
> +	atomic_set(&ioq->ref, 0);
> +	ioq->efqd = efqd;
> +	ioq->entity.budget = efqd->elv_slice[is_sync];
> +	elv_ioq_set_ioprio_class(ioq, ioprio_class);
> +	elv_ioq_set_ioprio(ioq, ioprio);
> +	ioq->pid = current->pid;

  Hi Vivek,

  If we are using a scheduler other than cfq, IOW single ioq is used.
  that storing a pid to the ioq makes no sense, it just stores the first 
  serviced task.

> +	ioq->sched_queue = sched_queue;
> +	elv_mark_ioq_idle_window(ioq);
> +	bfq_init_entity(&ioq->entity, iog);
> +	return 0;
> +}
> +EXPORT_SYMBOL(elv_init_ioq);
...
> +
> +extern int elv_slice_idle;
> +extern int elv_slice_async;
> +
> +/* Logging facilities. */
> +#define elv_log_ioq(efqd, ioq, fmt, args...) \
> +	blk_add_trace_msg((efqd)->queue, "%d" fmt, (ioq)->pid, ##args)

   Maybe we need to use current->pid instead, for ioq->pid is not valid
   sometimes.

> +
> +#define elv_log(efqd, fmt, args...) \
> +	blk_add_trace_msg((efqd)->queue, "" fmt, ##args)
> +
> +#define ioq_sample_valid(samples)   ((samples) > 80)
> +


-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
       [not found]               ` <20090318215529.GA3338-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-03-19  3:38                 ` Gui Jianfeng
@ 2009-03-24  5:32                 ` Nauman Rafique
  1 sibling, 0 replies; 190+ messages in thread
From: Nauman Rafique @ 2009-03-24  5:32 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: oz-kernel-H+wXaHxf7aLQT0dZR+AlfA, menage-hpIqsD4AKlfQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dhaval Giani,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Wed, Mar 18, 2009 at 2:55 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Wed, Mar 18, 2009 at 03:23:29PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>> >> Hi Vivek,
>> >>
>> >> I would be interested in knowing if these are the results expected?
>> >>
>> >
>> > Hi Dhaval,
>> >
>> > Good question. Keeping current expectation in mind, yes these are expected
>> > results. To begin with, current expectations are that try to emulate
>> > cfq behavior and the kind of service differentiation we get between
>> > threads of different priority, same kind of service differentiation we
>> > should get from different cgroups.
>> >
>> > Having said that, in theory a more accurate estimate should be amount
>> > of actual disk time a queue/cgroup got. I have put a tracing message
>> > to keep track of total service received by a queue. If you run "blktrace"
>> > then you can see that. Ideally, total service received by two threads
>> > over a period of time should be in same proportion as their cgroup
>> > weights.
>> >
>> > It will not be easy to achive it given the constraints we have got in
>> > terms of how to accurately we can account for disk time actually used by a
>> > queue in certain situations. So to begin with I am targetting that
>> > try to meet same kind of service differentation between cgroups as
>> > cfq provides between threads and then slowly refine it to see how
>> > close one can come to get accurate numbers in terms of "total_serivce"
>> > received by each queue.
>>
>>   Hi Vivek,
>>
>>   I simply tested with blktrace opened. I create two groups and set ioprio
>>   4 and 7 respectively(the corresponding weight should 4:1, right?),
>
> Hi Gui,
>
> Thanks for testing. You are right about weight proportions.
>
>> and
>>   start two dd concurrently. UUIC, Ideally, the proportion of service two
>>   dd got should be 4:1 in a period of time when they are running. I extract
>>   *served* value from blktrace output and sum them up. I found the proportion
>>   of the sum of *served* value is about 1.7:1
>>   Am i missing something?
>
> Actually getting the service proportion in same ratio as weight proportion
> is quite hard for sync queues. The biggest issue is that many a times sync
> queues are not continuously backlogged and they do some IO and then dispatch
> a next round of requests.
>
> Most of the time idling seems to be the solution for giving an impression
> that sync queue is continuously backlogged but it also has potential to
> reduce throughput on faster hardware.
>
> Anyway, can you please send me your complete blkparse output. There are
> many a places where code has been designed to favor throughput than
> fairness. Looking at your blkparse output, will give me better idea what's
> the issue in your setup.
>
> Also please try the attached patch. I have experimented with waiting for
> new request to come before sync queue is expired. It helps me in getting
> the fairness numbers at least with noop on non-queueing rotational media.
>
> I also have introduced a new tunable "fairness". Above code will kick in
> only if this variable is set to 1. Many a places where we favor throughput
> over fairness, I plan to use this variable as condition to let user
> decide whether to choose fairness over throughput. I am not sure at how many
> places it really makes sense, but it atleast gives us something to play and
> compare the throughput in two cases.
>
> This patch applies on my current tree after removing tomost patceh
> "anticipatory scheduling changes". My code has changed a bit since the
> posting, so you might have to message this patch a bit.
>
> Thanks
> Vivek
>
>
> DESC
> io-controller: idle for sometime on sync queue before expiring it
> EDESC
>
> o When a sync queue expires, in many cases it might be empty and then
>  it will be deleted from the active tree. This will lead to a scenario
>  where out of two competing queues, only one is on the tree and when a
>  new queue is selected, vtime jump takes place and we don't see services
>  provided in proportion to weight.
>
> o In general this is a fundamental problem with fairness of sync queues
>  where queues are not continuously backlogged. Looks like idling is
>  only solution to make sure such kind of queues can get some decent amount
>  of disk bandwidth in the face of competion from continusouly backlogged
>  queues. But excessive idling has potential to reduce performance on SSD
>  and disks with commnad queuing.
>
> o This patch experiments with waiting for next request to come before a
>  queue is expired after it has consumed its time slice. This can ensure
>  more accurate fairness numbers in some cases.

Vivek, have you introduced this option just to play with it, or you
are planning to make it a part of the patch set. Waiting for a new
request to come before expiring time slice sounds problematic.

>
> o Introduced a tunable "fairness". If set, io-controller will put more
>  focus on getting fairness right than getting throughput right.
>
>
> ---
>  block/blk-sysfs.c   |    7 ++++
>  block/elevator-fq.c |   85 +++++++++++++++++++++++++++++++++++++++++++++-------
>  block/elevator-fq.h |   12 +++++++
>  3 files changed, 94 insertions(+), 10 deletions(-)
>
> Index: linux1/block/elevator-fq.h
> ===================================================================
> --- linux1.orig/block/elevator-fq.h     2009-03-18 17:34:46.000000000 -0400
> +++ linux1/block/elevator-fq.h  2009-03-18 17:34:53.000000000 -0400
> @@ -318,6 +318,13 @@ struct elv_fq_data {
>        unsigned long long rate_sampling_start; /*sampling window start jifies*/
>        /* number of sectors finished io during current sampling window */
>        unsigned long rate_sectors_current;
> +
> +       /*
> +        * If set to 1, will disable many optimizations done for boost
> +        * throughput and focus more on providing fairness for sync
> +        * queues.
> +        */
> +       int fairness;
>  };
>
>  extern int elv_slice_idle;
> @@ -340,6 +347,7 @@ enum elv_queue_state_flags {
>        ELV_QUEUE_FLAG_idle_window,       /* elevator slice idling enabled */
>        ELV_QUEUE_FLAG_wait_request,      /* waiting for a request */
>        ELV_QUEUE_FLAG_slice_new,         /* no requests dispatched in slice */
> +       ELV_QUEUE_FLAG_wait_busy,         /* wait for this queue to get busy */
>        ELV_QUEUE_FLAG_NR,
>  };
>
> @@ -362,6 +370,7 @@ ELV_IO_QUEUE_FLAG_FNS(sync)
>  ELV_IO_QUEUE_FLAG_FNS(wait_request)
>  ELV_IO_QUEUE_FLAG_FNS(idle_window)
>  ELV_IO_QUEUE_FLAG_FNS(slice_new)
> +ELV_IO_QUEUE_FLAG_FNS(wait_busy)
>
>  static inline struct io_service_tree *
>  io_entity_service_tree(struct io_entity *entity)
> @@ -554,6 +563,9 @@ static inline struct io_queue *elv_looku
>  extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
>  extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
>                                                size_t count);
> +extern ssize_t elv_fairness_show(struct request_queue *q, char *name);
> +extern ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> +                                               size_t count);
>
>  /* Functions used by elevator.c */
>  extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
> Index: linux1/block/elevator-fq.c
> ===================================================================
> --- linux1.orig/block/elevator-fq.c     2009-03-18 17:34:46.000000000 -0400
> +++ linux1/block/elevator-fq.c  2009-03-18 17:34:53.000000000 -0400
> @@ -1837,6 +1837,44 @@ void elv_ioq_served(struct io_queue *ioq
>                        ioq->total_service);
>  }
>
> +/* Functions to show and store fairness value through sysfs */
> +ssize_t elv_fairness_show(struct request_queue *q, char *name)
> +{
> +       struct elv_fq_data *efqd;
> +       unsigned int data;
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(q->queue_lock, flags);
> +       efqd = &q->elevator->efqd;
> +       data = efqd->fairness;
> +       spin_unlock_irqrestore(q->queue_lock, flags);
> +       return sprintf(name, "%d\n", data);
> +}
> +
> +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> +                         size_t count)
> +{
> +       struct elv_fq_data *efqd;
> +       unsigned int data;
> +       unsigned long flags;
> +
> +       char *p = (char *)name;
> +
> +       data = simple_strtoul(p, &p, 10);
> +
> +       if (data < 0)
> +               data = 0;
> +       else if (data > INT_MAX)
> +               data = INT_MAX;
> +
> +       spin_lock_irqsave(q->queue_lock, flags);
> +       efqd = &q->elevator->efqd;
> +       efqd->fairness = data;
> +       spin_unlock_irqrestore(q->queue_lock, flags);
> +
> +       return count;
> +}
> +
>  /* Functions to show and store elv_idle_slice value through sysfs */
>  ssize_t elv_slice_idle_show(struct request_queue *q, char *name)
>  {
> @@ -2263,10 +2301,11 @@ void __elv_ioq_slice_expired(struct requ
>        assert_spin_locked(q->queue_lock);
>        elv_log_ioq(efqd, ioq, "slice expired upd=%d", budget_update);
>
> -       if (elv_ioq_wait_request(ioq))
> +       if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq))
>                del_timer(&efqd->idle_slice_timer);
>
>        elv_clear_ioq_wait_request(ioq);
> +       elv_clear_ioq_wait_busy(ioq);
>
>        /*
>         * if ioq->slice_end = 0, that means a queue was expired before first
> @@ -2482,8 +2521,9 @@ void elv_ioq_request_add(struct request_
>                 * immediately and flag that we must not expire this queue
>                 * just now
>                 */
> -               if (elv_ioq_wait_request(ioq)) {
> +               if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq)) {
>                        del_timer(&efqd->idle_slice_timer);
> +                       elv_clear_ioq_wait_busy(ioq);
>                        blk_start_queueing(q);
>                }
>        } else if (elv_should_preempt(q, ioq, rq)) {
> @@ -2519,6 +2559,9 @@ void elv_idle_slice_timer(unsigned long
>
>        if (ioq) {
>
> +               if (elv_ioq_wait_busy(ioq))
> +                       goto expire;
> +
>                /*
>                 * expired
>                 */
> @@ -2546,7 +2589,7 @@ out_cont:
>        spin_unlock_irqrestore(q->queue_lock, flags);
>  }
>
> -void elv_ioq_arm_slice_timer(struct request_queue *q)
> +void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
>  {
>        struct elv_fq_data *efqd = &q->elevator->efqd;
>        struct io_queue *ioq = elv_active_ioq(q->elevator);
> @@ -2563,15 +2606,27 @@ void elv_ioq_arm_slice_timer(struct requ
>                return;
>
>        /*
> -        * still requests with the driver, don't idle
> +        * idle is disabled, either manually or by past process history
>         */
> -       if (efqd->rq_in_driver)
> +       if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
>                return;
>
>        /*
> -        * idle is disabled, either manually or by past process history
> +        * This queue has consumed its time slice. We are waiting only for
> +        * it to become busy before we select next queue for dispatch.
>         */
> -       if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
> +       if (efqd->fairness && wait_for_busy) {
> +               elv_mark_ioq_wait_busy(ioq);
> +               sl = efqd->elv_slice_idle;
> +               mod_timer(&efqd->idle_slice_timer, jiffies + sl);
> +               elv_log(efqd, "arm idle: %lu wait busy=1", sl);
> +               return;
> +       }
> +
> +       /*
> +        * still requests with the driver, don't idle
> +        */
> +       if (efqd->rq_in_driver)
>                return;
>
>        /*
> @@ -2628,6 +2683,12 @@ void *elv_fq_select_ioq(struct request_q
>                }
>        }
>
> +       /* We are waiting for this queue to become busy before it expires.*/
> +       if (efqd->fairness && elv_ioq_wait_busy(ioq)) {
> +               ioq = NULL;
> +               goto keep_queue;
> +       }
> +
>        /*
>         * The active queue has run out of time, expire it and select new.
>         */
> @@ -2802,10 +2863,14 @@ void elv_ioq_completed_request(struct re
>                        elv_ioq_set_prio_slice(q, ioq);
>                        elv_clear_ioq_slice_new(ioq);
>                }
> -               if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
> +               if (elv_ioq_class_idle(ioq))
>                        elv_ioq_slice_expired(q, 1);
> -               else if (sync && !ioq->nr_queued)
> -                       elv_ioq_arm_slice_timer(q);
> +               else if (sync && !ioq->nr_queued) {
> +                       if (elv_ioq_slice_used(ioq))
> +                               elv_ioq_arm_slice_timer(q, 1);
> +                       else
> +                               elv_ioq_arm_slice_timer(q, 0);
> +               }
>        }
>
>        if (!efqd->rq_in_driver)
> Index: linux1/block/blk-sysfs.c
> ===================================================================
> --- linux1.orig/block/blk-sysfs.c       2009-03-18 17:34:28.000000000 -0400
> +++ linux1/block/blk-sysfs.c    2009-03-18 17:34:53.000000000 -0400
> @@ -282,6 +282,12 @@ static struct queue_sysfs_entry queue_sl
>        .show = elv_slice_idle_show,
>        .store = elv_slice_idle_store,
>  };
> +
> +static struct queue_sysfs_entry queue_fairness_entry = {
> +       .attr = {.name = "fairness", .mode = S_IRUGO | S_IWUSR },
> +       .show = elv_fairness_show,
> +       .store = elv_fairness_store,
> +};
>  #endif
>  static struct attribute *default_attrs[] = {
>        &queue_requests_entry.attr,
> @@ -296,6 +302,7 @@ static struct attribute *default_attrs[]
>        &queue_iostats_entry.attr,
>  #ifdef CONFIG_ELV_FAIR_QUEUING
>        &queue_slice_idle_entry.attr,
> +       &queue_fairness_entry.attr,
>  #endif
>        NULL,
>  };
>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-03-18 21:55               ` Vivek Goyal
                                 ` (2 preceding siblings ...)
  (?)
@ 2009-03-24  5:32               ` Nauman Rafique
       [not found]                 ` <e98e18940903232232i432f62c5r9dfd74268e1b2684-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  -1 siblings, 1 reply; 190+ messages in thread
From: Nauman Rafique @ 2009-03-24  5:32 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Gui Jianfeng, Dhaval Giani, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	arozansk, jmoyer, oz-kernel, balbir, linux-kernel, containers,
	akpm, menage, peterz

On Wed, Mar 18, 2009 at 2:55 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Wed, Mar 18, 2009 at 03:23:29PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>> >> Hi Vivek,
>> >>
>> >> I would be interested in knowing if these are the results expected?
>> >>
>> >
>> > Hi Dhaval,
>> >
>> > Good question. Keeping current expectation in mind, yes these are expected
>> > results. To begin with, current expectations are that try to emulate
>> > cfq behavior and the kind of service differentiation we get between
>> > threads of different priority, same kind of service differentiation we
>> > should get from different cgroups.
>> >
>> > Having said that, in theory a more accurate estimate should be amount
>> > of actual disk time a queue/cgroup got. I have put a tracing message
>> > to keep track of total service received by a queue. If you run "blktrace"
>> > then you can see that. Ideally, total service received by two threads
>> > over a period of time should be in same proportion as their cgroup
>> > weights.
>> >
>> > It will not be easy to achive it given the constraints we have got in
>> > terms of how to accurately we can account for disk time actually used by a
>> > queue in certain situations. So to begin with I am targetting that
>> > try to meet same kind of service differentation between cgroups as
>> > cfq provides between threads and then slowly refine it to see how
>> > close one can come to get accurate numbers in terms of "total_serivce"
>> > received by each queue.
>>
>>   Hi Vivek,
>>
>>   I simply tested with blktrace opened. I create two groups and set ioprio
>>   4 and 7 respectively(the corresponding weight should 4:1, right?),
>
> Hi Gui,
>
> Thanks for testing. You are right about weight proportions.
>
>> and
>>   start two dd concurrently. UUIC, Ideally, the proportion of service two
>>   dd got should be 4:1 in a period of time when they are running. I extract
>>   *served* value from blktrace output and sum them up. I found the proportion
>>   of the sum of *served* value is about 1.7:1
>>   Am i missing something?
>
> Actually getting the service proportion in same ratio as weight proportion
> is quite hard for sync queues. The biggest issue is that many a times sync
> queues are not continuously backlogged and they do some IO and then dispatch
> a next round of requests.
>
> Most of the time idling seems to be the solution for giving an impression
> that sync queue is continuously backlogged but it also has potential to
> reduce throughput on faster hardware.
>
> Anyway, can you please send me your complete blkparse output. There are
> many a places where code has been designed to favor throughput than
> fairness. Looking at your blkparse output, will give me better idea what's
> the issue in your setup.
>
> Also please try the attached patch. I have experimented with waiting for
> new request to come before sync queue is expired. It helps me in getting
> the fairness numbers at least with noop on non-queueing rotational media.
>
> I also have introduced a new tunable "fairness". Above code will kick in
> only if this variable is set to 1. Many a places where we favor throughput
> over fairness, I plan to use this variable as condition to let user
> decide whether to choose fairness over throughput. I am not sure at how many
> places it really makes sense, but it atleast gives us something to play and
> compare the throughput in two cases.
>
> This patch applies on my current tree after removing tomost patceh
> "anticipatory scheduling changes". My code has changed a bit since the
> posting, so you might have to message this patch a bit.
>
> Thanks
> Vivek
>
>
> DESC
> io-controller: idle for sometime on sync queue before expiring it
> EDESC
>
> o When a sync queue expires, in many cases it might be empty and then
>  it will be deleted from the active tree. This will lead to a scenario
>  where out of two competing queues, only one is on the tree and when a
>  new queue is selected, vtime jump takes place and we don't see services
>  provided in proportion to weight.
>
> o In general this is a fundamental problem with fairness of sync queues
>  where queues are not continuously backlogged. Looks like idling is
>  only solution to make sure such kind of queues can get some decent amount
>  of disk bandwidth in the face of competion from continusouly backlogged
>  queues. But excessive idling has potential to reduce performance on SSD
>  and disks with commnad queuing.
>
> o This patch experiments with waiting for next request to come before a
>  queue is expired after it has consumed its time slice. This can ensure
>  more accurate fairness numbers in some cases.

Vivek, have you introduced this option just to play with it, or you
are planning to make it a part of the patch set. Waiting for a new
request to come before expiring time slice sounds problematic.

>
> o Introduced a tunable "fairness". If set, io-controller will put more
>  focus on getting fairness right than getting throughput right.
>
>
> ---
>  block/blk-sysfs.c   |    7 ++++
>  block/elevator-fq.c |   85 +++++++++++++++++++++++++++++++++++++++++++++-------
>  block/elevator-fq.h |   12 +++++++
>  3 files changed, 94 insertions(+), 10 deletions(-)
>
> Index: linux1/block/elevator-fq.h
> ===================================================================
> --- linux1.orig/block/elevator-fq.h     2009-03-18 17:34:46.000000000 -0400
> +++ linux1/block/elevator-fq.h  2009-03-18 17:34:53.000000000 -0400
> @@ -318,6 +318,13 @@ struct elv_fq_data {
>        unsigned long long rate_sampling_start; /*sampling window start jifies*/
>        /* number of sectors finished io during current sampling window */
>        unsigned long rate_sectors_current;
> +
> +       /*
> +        * If set to 1, will disable many optimizations done for boost
> +        * throughput and focus more on providing fairness for sync
> +        * queues.
> +        */
> +       int fairness;
>  };
>
>  extern int elv_slice_idle;
> @@ -340,6 +347,7 @@ enum elv_queue_state_flags {
>        ELV_QUEUE_FLAG_idle_window,       /* elevator slice idling enabled */
>        ELV_QUEUE_FLAG_wait_request,      /* waiting for a request */
>        ELV_QUEUE_FLAG_slice_new,         /* no requests dispatched in slice */
> +       ELV_QUEUE_FLAG_wait_busy,         /* wait for this queue to get busy */
>        ELV_QUEUE_FLAG_NR,
>  };
>
> @@ -362,6 +370,7 @@ ELV_IO_QUEUE_FLAG_FNS(sync)
>  ELV_IO_QUEUE_FLAG_FNS(wait_request)
>  ELV_IO_QUEUE_FLAG_FNS(idle_window)
>  ELV_IO_QUEUE_FLAG_FNS(slice_new)
> +ELV_IO_QUEUE_FLAG_FNS(wait_busy)
>
>  static inline struct io_service_tree *
>  io_entity_service_tree(struct io_entity *entity)
> @@ -554,6 +563,9 @@ static inline struct io_queue *elv_looku
>  extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
>  extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
>                                                size_t count);
> +extern ssize_t elv_fairness_show(struct request_queue *q, char *name);
> +extern ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> +                                               size_t count);
>
>  /* Functions used by elevator.c */
>  extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
> Index: linux1/block/elevator-fq.c
> ===================================================================
> --- linux1.orig/block/elevator-fq.c     2009-03-18 17:34:46.000000000 -0400
> +++ linux1/block/elevator-fq.c  2009-03-18 17:34:53.000000000 -0400
> @@ -1837,6 +1837,44 @@ void elv_ioq_served(struct io_queue *ioq
>                        ioq->total_service);
>  }
>
> +/* Functions to show and store fairness value through sysfs */
> +ssize_t elv_fairness_show(struct request_queue *q, char *name)
> +{
> +       struct elv_fq_data *efqd;
> +       unsigned int data;
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(q->queue_lock, flags);
> +       efqd = &q->elevator->efqd;
> +       data = efqd->fairness;
> +       spin_unlock_irqrestore(q->queue_lock, flags);
> +       return sprintf(name, "%d\n", data);
> +}
> +
> +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> +                         size_t count)
> +{
> +       struct elv_fq_data *efqd;
> +       unsigned int data;
> +       unsigned long flags;
> +
> +       char *p = (char *)name;
> +
> +       data = simple_strtoul(p, &p, 10);
> +
> +       if (data < 0)
> +               data = 0;
> +       else if (data > INT_MAX)
> +               data = INT_MAX;
> +
> +       spin_lock_irqsave(q->queue_lock, flags);
> +       efqd = &q->elevator->efqd;
> +       efqd->fairness = data;
> +       spin_unlock_irqrestore(q->queue_lock, flags);
> +
> +       return count;
> +}
> +
>  /* Functions to show and store elv_idle_slice value through sysfs */
>  ssize_t elv_slice_idle_show(struct request_queue *q, char *name)
>  {
> @@ -2263,10 +2301,11 @@ void __elv_ioq_slice_expired(struct requ
>        assert_spin_locked(q->queue_lock);
>        elv_log_ioq(efqd, ioq, "slice expired upd=%d", budget_update);
>
> -       if (elv_ioq_wait_request(ioq))
> +       if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq))
>                del_timer(&efqd->idle_slice_timer);
>
>        elv_clear_ioq_wait_request(ioq);
> +       elv_clear_ioq_wait_busy(ioq);
>
>        /*
>         * if ioq->slice_end = 0, that means a queue was expired before first
> @@ -2482,8 +2521,9 @@ void elv_ioq_request_add(struct request_
>                 * immediately and flag that we must not expire this queue
>                 * just now
>                 */
> -               if (elv_ioq_wait_request(ioq)) {
> +               if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq)) {
>                        del_timer(&efqd->idle_slice_timer);
> +                       elv_clear_ioq_wait_busy(ioq);
>                        blk_start_queueing(q);
>                }
>        } else if (elv_should_preempt(q, ioq, rq)) {
> @@ -2519,6 +2559,9 @@ void elv_idle_slice_timer(unsigned long
>
>        if (ioq) {
>
> +               if (elv_ioq_wait_busy(ioq))
> +                       goto expire;
> +
>                /*
>                 * expired
>                 */
> @@ -2546,7 +2589,7 @@ out_cont:
>        spin_unlock_irqrestore(q->queue_lock, flags);
>  }
>
> -void elv_ioq_arm_slice_timer(struct request_queue *q)
> +void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
>  {
>        struct elv_fq_data *efqd = &q->elevator->efqd;
>        struct io_queue *ioq = elv_active_ioq(q->elevator);
> @@ -2563,15 +2606,27 @@ void elv_ioq_arm_slice_timer(struct requ
>                return;
>
>        /*
> -        * still requests with the driver, don't idle
> +        * idle is disabled, either manually or by past process history
>         */
> -       if (efqd->rq_in_driver)
> +       if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
>                return;
>
>        /*
> -        * idle is disabled, either manually or by past process history
> +        * This queue has consumed its time slice. We are waiting only for
> +        * it to become busy before we select next queue for dispatch.
>         */
> -       if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
> +       if (efqd->fairness && wait_for_busy) {
> +               elv_mark_ioq_wait_busy(ioq);
> +               sl = efqd->elv_slice_idle;
> +               mod_timer(&efqd->idle_slice_timer, jiffies + sl);
> +               elv_log(efqd, "arm idle: %lu wait busy=1", sl);
> +               return;
> +       }
> +
> +       /*
> +        * still requests with the driver, don't idle
> +        */
> +       if (efqd->rq_in_driver)
>                return;
>
>        /*
> @@ -2628,6 +2683,12 @@ void *elv_fq_select_ioq(struct request_q
>                }
>        }
>
> +       /* We are waiting for this queue to become busy before it expires.*/
> +       if (efqd->fairness && elv_ioq_wait_busy(ioq)) {
> +               ioq = NULL;
> +               goto keep_queue;
> +       }
> +
>        /*
>         * The active queue has run out of time, expire it and select new.
>         */
> @@ -2802,10 +2863,14 @@ void elv_ioq_completed_request(struct re
>                        elv_ioq_set_prio_slice(q, ioq);
>                        elv_clear_ioq_slice_new(ioq);
>                }
> -               if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
> +               if (elv_ioq_class_idle(ioq))
>                        elv_ioq_slice_expired(q, 1);
> -               else if (sync && !ioq->nr_queued)
> -                       elv_ioq_arm_slice_timer(q);
> +               else if (sync && !ioq->nr_queued) {
> +                       if (elv_ioq_slice_used(ioq))
> +                               elv_ioq_arm_slice_timer(q, 1);
> +                       else
> +                               elv_ioq_arm_slice_timer(q, 0);
> +               }
>        }
>
>        if (!efqd->rq_in_driver)
> Index: linux1/block/blk-sysfs.c
> ===================================================================
> --- linux1.orig/block/blk-sysfs.c       2009-03-18 17:34:28.000000000 -0400
> +++ linux1/block/blk-sysfs.c    2009-03-18 17:34:53.000000000 -0400
> @@ -282,6 +282,12 @@ static struct queue_sysfs_entry queue_sl
>        .show = elv_slice_idle_show,
>        .store = elv_slice_idle_store,
>  };
> +
> +static struct queue_sysfs_entry queue_fairness_entry = {
> +       .attr = {.name = "fairness", .mode = S_IRUGO | S_IWUSR },
> +       .show = elv_fairness_show,
> +       .store = elv_fairness_store,
> +};
>  #endif
>  static struct attribute *default_attrs[] = {
>        &queue_requests_entry.attr,
> @@ -296,6 +302,7 @@ static struct attribute *default_attrs[]
>        &queue_iostats_entry.attr,
>  #ifdef CONFIG_ELV_FAIR_QUEUING
>        &queue_slice_idle_entry.attr,
> +       &queue_fairness_entry.attr,
>  #endif
>        NULL,
>  };
>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-03-24  5:32               ` Nauman Rafique
@ 2009-03-24 12:58                     ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-24 12:58 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: oz-kernel-H+wXaHxf7aLQT0dZR+AlfA, menage-hpIqsD4AKlfQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dhaval Giani,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Mon, Mar 23, 2009 at 10:32:41PM -0700, Nauman Rafique wrote:

[..]
> > DESC
> > io-controller: idle for sometime on sync queue before expiring it
> > EDESC
> >
> > o When a sync queue expires, in many cases it might be empty and then
> >  it will be deleted from the active tree. This will lead to a scenario
> >  where out of two competing queues, only one is on the tree and when a
> >  new queue is selected, vtime jump takes place and we don't see services
> >  provided in proportion to weight.
> >
> > o In general this is a fundamental problem with fairness of sync queues
> >  where queues are not continuously backlogged. Looks like idling is
> >  only solution to make sure such kind of queues can get some decent amount
> >  of disk bandwidth in the face of competion from continusouly backlogged
> >  queues. But excessive idling has potential to reduce performance on SSD
> >  and disks with commnad queuing.
> >
> > o This patch experiments with waiting for next request to come before a
> >  queue is expired after it has consumed its time slice. This can ensure
> >  more accurate fairness numbers in some cases.
> 
> Vivek, have you introduced this option just to play with it, or you
> are planning to make it a part of the patch set. Waiting for a new
> request to come before expiring time slice sounds problematic.

Why are the issues you forsee with it. This is just an extra 8ms idling
on the sync queue that is also if think time of the queue is not high.

We already do idling on sync queues. In this case we are doing an extra
idle even if queue has consumed its allocated quota. It helps me get
fairness numbers and I have put it under a tunable "fairness". So by
default this code will not kick in.

Other possible option could be that when expiring a sync queue, don't
remove the queue immediately from the tree and remove it later if there
is no request from the queue in 8ms or so. I am not sure with BFQ, is it
feasible to do that without creating issues with current implementation. 
Current implementation was simple, so I stick to it to begin with.

So yes, I am planning to keep it under tunable, unless there are
significant issues in doing that.

Thanks
Vivek

> 
> >
> > o Introduced a tunable "fairness". If set, io-controller will put more
> >  focus on getting fairness right than getting throughput right.
> >
> >
> > ---
> >  block/blk-sysfs.c   |    7 ++++
> >  block/elevator-fq.c |   85 +++++++++++++++++++++++++++++++++++++++++++++-------
> >  block/elevator-fq.h |   12 +++++++
> >  3 files changed, 94 insertions(+), 10 deletions(-)
> >
> > Index: linux1/block/elevator-fq.h
> > ===================================================================
> > --- linux1.orig/block/elevator-fq.h     2009-03-18 17:34:46.000000000 -0400
> > +++ linux1/block/elevator-fq.h  2009-03-18 17:34:53.000000000 -0400
> > @@ -318,6 +318,13 @@ struct elv_fq_data {
> >        unsigned long long rate_sampling_start; /*sampling window start jifies*/
> >        /* number of sectors finished io during current sampling window */
> >        unsigned long rate_sectors_current;
> > +
> > +       /*
> > +        * If set to 1, will disable many optimizations done for boost
> > +        * throughput and focus more on providing fairness for sync
> > +        * queues.
> > +        */
> > +       int fairness;
> >  };
> >
> >  extern int elv_slice_idle;
> > @@ -340,6 +347,7 @@ enum elv_queue_state_flags {
> >        ELV_QUEUE_FLAG_idle_window,       /* elevator slice idling enabled */
> >        ELV_QUEUE_FLAG_wait_request,      /* waiting for a request */
> >        ELV_QUEUE_FLAG_slice_new,         /* no requests dispatched in slice */
> > +       ELV_QUEUE_FLAG_wait_busy,         /* wait for this queue to get busy */
> >        ELV_QUEUE_FLAG_NR,
> >  };
> >
> > @@ -362,6 +370,7 @@ ELV_IO_QUEUE_FLAG_FNS(sync)
> >  ELV_IO_QUEUE_FLAG_FNS(wait_request)
> >  ELV_IO_QUEUE_FLAG_FNS(idle_window)
> >  ELV_IO_QUEUE_FLAG_FNS(slice_new)
> > +ELV_IO_QUEUE_FLAG_FNS(wait_busy)
> >
> >  static inline struct io_service_tree *
> >  io_entity_service_tree(struct io_entity *entity)
> > @@ -554,6 +563,9 @@ static inline struct io_queue *elv_looku
> >  extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
> >  extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
> >                                                size_t count);
> > +extern ssize_t elv_fairness_show(struct request_queue *q, char *name);
> > +extern ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> > +                                               size_t count);
> >
> >  /* Functions used by elevator.c */
> >  extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
> > Index: linux1/block/elevator-fq.c
> > ===================================================================
> > --- linux1.orig/block/elevator-fq.c     2009-03-18 17:34:46.000000000 -0400
> > +++ linux1/block/elevator-fq.c  2009-03-18 17:34:53.000000000 -0400
> > @@ -1837,6 +1837,44 @@ void elv_ioq_served(struct io_queue *ioq
> >                        ioq->total_service);
> >  }
> >
> > +/* Functions to show and store fairness value through sysfs */
> > +ssize_t elv_fairness_show(struct request_queue *q, char *name)
> > +{
> > +       struct elv_fq_data *efqd;
> > +       unsigned int data;
> > +       unsigned long flags;
> > +
> > +       spin_lock_irqsave(q->queue_lock, flags);
> > +       efqd = &q->elevator->efqd;
> > +       data = efqd->fairness;
> > +       spin_unlock_irqrestore(q->queue_lock, flags);
> > +       return sprintf(name, "%d\n", data);
> > +}
> > +
> > +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> > +                         size_t count)
> > +{
> > +       struct elv_fq_data *efqd;
> > +       unsigned int data;
> > +       unsigned long flags;
> > +
> > +       char *p = (char *)name;
> > +
> > +       data = simple_strtoul(p, &p, 10);
> > +
> > +       if (data < 0)
> > +               data = 0;
> > +       else if (data > INT_MAX)
> > +               data = INT_MAX;
> > +
> > +       spin_lock_irqsave(q->queue_lock, flags);
> > +       efqd = &q->elevator->efqd;
> > +       efqd->fairness = data;
> > +       spin_unlock_irqrestore(q->queue_lock, flags);
> > +
> > +       return count;
> > +}
> > +
> >  /* Functions to show and store elv_idle_slice value through sysfs */
> >  ssize_t elv_slice_idle_show(struct request_queue *q, char *name)
> >  {
> > @@ -2263,10 +2301,11 @@ void __elv_ioq_slice_expired(struct requ
> >        assert_spin_locked(q->queue_lock);
> >        elv_log_ioq(efqd, ioq, "slice expired upd=%d", budget_update);
> >
> > -       if (elv_ioq_wait_request(ioq))
> > +       if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq))
> >                del_timer(&efqd->idle_slice_timer);
> >
> >        elv_clear_ioq_wait_request(ioq);
> > +       elv_clear_ioq_wait_busy(ioq);
> >
> >        /*
> >         * if ioq->slice_end = 0, that means a queue was expired before first
> > @@ -2482,8 +2521,9 @@ void elv_ioq_request_add(struct request_
> >                 * immediately and flag that we must not expire this queue
> >                 * just now
> >                 */
> > -               if (elv_ioq_wait_request(ioq)) {
> > +               if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq)) {
> >                        del_timer(&efqd->idle_slice_timer);
> > +                       elv_clear_ioq_wait_busy(ioq);
> >                        blk_start_queueing(q);
> >                }
> >        } else if (elv_should_preempt(q, ioq, rq)) {
> > @@ -2519,6 +2559,9 @@ void elv_idle_slice_timer(unsigned long
> >
> >        if (ioq) {
> >
> > +               if (elv_ioq_wait_busy(ioq))
> > +                       goto expire;
> > +
> >                /*
> >                 * expired
> >                 */
> > @@ -2546,7 +2589,7 @@ out_cont:
> >        spin_unlock_irqrestore(q->queue_lock, flags);
> >  }
> >
> > -void elv_ioq_arm_slice_timer(struct request_queue *q)
> > +void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
> >  {
> >        struct elv_fq_data *efqd = &q->elevator->efqd;
> >        struct io_queue *ioq = elv_active_ioq(q->elevator);
> > @@ -2563,15 +2606,27 @@ void elv_ioq_arm_slice_timer(struct requ
> >                return;
> >
> >        /*
> > -        * still requests with the driver, don't idle
> > +        * idle is disabled, either manually or by past process history
> >         */
> > -       if (efqd->rq_in_driver)
> > +       if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
> >                return;
> >
> >        /*
> > -        * idle is disabled, either manually or by past process history
> > +        * This queue has consumed its time slice. We are waiting only for
> > +        * it to become busy before we select next queue for dispatch.
> >         */
> > -       if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
> > +       if (efqd->fairness && wait_for_busy) {
> > +               elv_mark_ioq_wait_busy(ioq);
> > +               sl = efqd->elv_slice_idle;
> > +               mod_timer(&efqd->idle_slice_timer, jiffies + sl);
> > +               elv_log(efqd, "arm idle: %lu wait busy=1", sl);
> > +               return;
> > +       }
> > +
> > +       /*
> > +        * still requests with the driver, don't idle
> > +        */
> > +       if (efqd->rq_in_driver)
> >                return;
> >
> >        /*
> > @@ -2628,6 +2683,12 @@ void *elv_fq_select_ioq(struct request_q
> >                }
> >        }
> >
> > +       /* We are waiting for this queue to become busy before it expires.*/
> > +       if (efqd->fairness && elv_ioq_wait_busy(ioq)) {
> > +               ioq = NULL;
> > +               goto keep_queue;
> > +       }
> > +
> >        /*
> >         * The active queue has run out of time, expire it and select new.
> >         */
> > @@ -2802,10 +2863,14 @@ void elv_ioq_completed_request(struct re
> >                        elv_ioq_set_prio_slice(q, ioq);
> >                        elv_clear_ioq_slice_new(ioq);
> >                }
> > -               if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
> > +               if (elv_ioq_class_idle(ioq))
> >                        elv_ioq_slice_expired(q, 1);
> > -               else if (sync && !ioq->nr_queued)
> > -                       elv_ioq_arm_slice_timer(q);
> > +               else if (sync && !ioq->nr_queued) {
> > +                       if (elv_ioq_slice_used(ioq))
> > +                               elv_ioq_arm_slice_timer(q, 1);
> > +                       else
> > +                               elv_ioq_arm_slice_timer(q, 0);
> > +               }
> >        }
> >
> >        if (!efqd->rq_in_driver)
> > Index: linux1/block/blk-sysfs.c
> > ===================================================================
> > --- linux1.orig/block/blk-sysfs.c       2009-03-18 17:34:28.000000000 -0400
> > +++ linux1/block/blk-sysfs.c    2009-03-18 17:34:53.000000000 -0400
> > @@ -282,6 +282,12 @@ static struct queue_sysfs_entry queue_sl
> >        .show = elv_slice_idle_show,
> >        .store = elv_slice_idle_store,
> >  };
> > +
> > +static struct queue_sysfs_entry queue_fairness_entry = {
> > +       .attr = {.name = "fairness", .mode = S_IRUGO | S_IWUSR },
> > +       .show = elv_fairness_show,
> > +       .store = elv_fairness_store,
> > +};
> >  #endif
> >  static struct attribute *default_attrs[] = {
> >        &queue_requests_entry.attr,
> > @@ -296,6 +302,7 @@ static struct attribute *default_attrs[]
> >        &queue_iostats_entry.attr,
> >  #ifdef CONFIG_ELV_FAIR_QUEUING
> >        &queue_slice_idle_entry.attr,
> > +       &queue_fairness_entry.attr,
> >  #endif
> >        NULL,
> >  };
> >

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
@ 2009-03-24 12:58                     ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-24 12:58 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: Gui Jianfeng, Dhaval Giani, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	arozansk, jmoyer, oz-kernel, balbir, linux-kernel, containers,
	akpm, menage, peterz

On Mon, Mar 23, 2009 at 10:32:41PM -0700, Nauman Rafique wrote:

[..]
> > DESC
> > io-controller: idle for sometime on sync queue before expiring it
> > EDESC
> >
> > o When a sync queue expires, in many cases it might be empty and then
> >  it will be deleted from the active tree. This will lead to a scenario
> >  where out of two competing queues, only one is on the tree and when a
> >  new queue is selected, vtime jump takes place and we don't see services
> >  provided in proportion to weight.
> >
> > o In general this is a fundamental problem with fairness of sync queues
> >  where queues are not continuously backlogged. Looks like idling is
> >  only solution to make sure such kind of queues can get some decent amount
> >  of disk bandwidth in the face of competion from continusouly backlogged
> >  queues. But excessive idling has potential to reduce performance on SSD
> >  and disks with commnad queuing.
> >
> > o This patch experiments with waiting for next request to come before a
> >  queue is expired after it has consumed its time slice. This can ensure
> >  more accurate fairness numbers in some cases.
> 
> Vivek, have you introduced this option just to play with it, or you
> are planning to make it a part of the patch set. Waiting for a new
> request to come before expiring time slice sounds problematic.

Why are the issues you forsee with it. This is just an extra 8ms idling
on the sync queue that is also if think time of the queue is not high.

We already do idling on sync queues. In this case we are doing an extra
idle even if queue has consumed its allocated quota. It helps me get
fairness numbers and I have put it under a tunable "fairness". So by
default this code will not kick in.

Other possible option could be that when expiring a sync queue, don't
remove the queue immediately from the tree and remove it later if there
is no request from the queue in 8ms or so. I am not sure with BFQ, is it
feasible to do that without creating issues with current implementation. 
Current implementation was simple, so I stick to it to begin with.

So yes, I am planning to keep it under tunable, unless there are
significant issues in doing that.

Thanks
Vivek

> 
> >
> > o Introduced a tunable "fairness". If set, io-controller will put more
> >  focus on getting fairness right than getting throughput right.
> >
> >
> > ---
> >  block/blk-sysfs.c   |    7 ++++
> >  block/elevator-fq.c |   85 +++++++++++++++++++++++++++++++++++++++++++++-------
> >  block/elevator-fq.h |   12 +++++++
> >  3 files changed, 94 insertions(+), 10 deletions(-)
> >
> > Index: linux1/block/elevator-fq.h
> > ===================================================================
> > --- linux1.orig/block/elevator-fq.h     2009-03-18 17:34:46.000000000 -0400
> > +++ linux1/block/elevator-fq.h  2009-03-18 17:34:53.000000000 -0400
> > @@ -318,6 +318,13 @@ struct elv_fq_data {
> >        unsigned long long rate_sampling_start; /*sampling window start jifies*/
> >        /* number of sectors finished io during current sampling window */
> >        unsigned long rate_sectors_current;
> > +
> > +       /*
> > +        * If set to 1, will disable many optimizations done for boost
> > +        * throughput and focus more on providing fairness for sync
> > +        * queues.
> > +        */
> > +       int fairness;
> >  };
> >
> >  extern int elv_slice_idle;
> > @@ -340,6 +347,7 @@ enum elv_queue_state_flags {
> >        ELV_QUEUE_FLAG_idle_window,       /* elevator slice idling enabled */
> >        ELV_QUEUE_FLAG_wait_request,      /* waiting for a request */
> >        ELV_QUEUE_FLAG_slice_new,         /* no requests dispatched in slice */
> > +       ELV_QUEUE_FLAG_wait_busy,         /* wait for this queue to get busy */
> >        ELV_QUEUE_FLAG_NR,
> >  };
> >
> > @@ -362,6 +370,7 @@ ELV_IO_QUEUE_FLAG_FNS(sync)
> >  ELV_IO_QUEUE_FLAG_FNS(wait_request)
> >  ELV_IO_QUEUE_FLAG_FNS(idle_window)
> >  ELV_IO_QUEUE_FLAG_FNS(slice_new)
> > +ELV_IO_QUEUE_FLAG_FNS(wait_busy)
> >
> >  static inline struct io_service_tree *
> >  io_entity_service_tree(struct io_entity *entity)
> > @@ -554,6 +563,9 @@ static inline struct io_queue *elv_looku
> >  extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
> >  extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
> >                                                size_t count);
> > +extern ssize_t elv_fairness_show(struct request_queue *q, char *name);
> > +extern ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> > +                                               size_t count);
> >
> >  /* Functions used by elevator.c */
> >  extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
> > Index: linux1/block/elevator-fq.c
> > ===================================================================
> > --- linux1.orig/block/elevator-fq.c     2009-03-18 17:34:46.000000000 -0400
> > +++ linux1/block/elevator-fq.c  2009-03-18 17:34:53.000000000 -0400
> > @@ -1837,6 +1837,44 @@ void elv_ioq_served(struct io_queue *ioq
> >                        ioq->total_service);
> >  }
> >
> > +/* Functions to show and store fairness value through sysfs */
> > +ssize_t elv_fairness_show(struct request_queue *q, char *name)
> > +{
> > +       struct elv_fq_data *efqd;
> > +       unsigned int data;
> > +       unsigned long flags;
> > +
> > +       spin_lock_irqsave(q->queue_lock, flags);
> > +       efqd = &q->elevator->efqd;
> > +       data = efqd->fairness;
> > +       spin_unlock_irqrestore(q->queue_lock, flags);
> > +       return sprintf(name, "%d\n", data);
> > +}
> > +
> > +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> > +                         size_t count)
> > +{
> > +       struct elv_fq_data *efqd;
> > +       unsigned int data;
> > +       unsigned long flags;
> > +
> > +       char *p = (char *)name;
> > +
> > +       data = simple_strtoul(p, &p, 10);
> > +
> > +       if (data < 0)
> > +               data = 0;
> > +       else if (data > INT_MAX)
> > +               data = INT_MAX;
> > +
> > +       spin_lock_irqsave(q->queue_lock, flags);
> > +       efqd = &q->elevator->efqd;
> > +       efqd->fairness = data;
> > +       spin_unlock_irqrestore(q->queue_lock, flags);
> > +
> > +       return count;
> > +}
> > +
> >  /* Functions to show and store elv_idle_slice value through sysfs */
> >  ssize_t elv_slice_idle_show(struct request_queue *q, char *name)
> >  {
> > @@ -2263,10 +2301,11 @@ void __elv_ioq_slice_expired(struct requ
> >        assert_spin_locked(q->queue_lock);
> >        elv_log_ioq(efqd, ioq, "slice expired upd=%d", budget_update);
> >
> > -       if (elv_ioq_wait_request(ioq))
> > +       if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq))
> >                del_timer(&efqd->idle_slice_timer);
> >
> >        elv_clear_ioq_wait_request(ioq);
> > +       elv_clear_ioq_wait_busy(ioq);
> >
> >        /*
> >         * if ioq->slice_end = 0, that means a queue was expired before first
> > @@ -2482,8 +2521,9 @@ void elv_ioq_request_add(struct request_
> >                 * immediately and flag that we must not expire this queue
> >                 * just now
> >                 */
> > -               if (elv_ioq_wait_request(ioq)) {
> > +               if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq)) {
> >                        del_timer(&efqd->idle_slice_timer);
> > +                       elv_clear_ioq_wait_busy(ioq);
> >                        blk_start_queueing(q);
> >                }
> >        } else if (elv_should_preempt(q, ioq, rq)) {
> > @@ -2519,6 +2559,9 @@ void elv_idle_slice_timer(unsigned long
> >
> >        if (ioq) {
> >
> > +               if (elv_ioq_wait_busy(ioq))
> > +                       goto expire;
> > +
> >                /*
> >                 * expired
> >                 */
> > @@ -2546,7 +2589,7 @@ out_cont:
> >        spin_unlock_irqrestore(q->queue_lock, flags);
> >  }
> >
> > -void elv_ioq_arm_slice_timer(struct request_queue *q)
> > +void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
> >  {
> >        struct elv_fq_data *efqd = &q->elevator->efqd;
> >        struct io_queue *ioq = elv_active_ioq(q->elevator);
> > @@ -2563,15 +2606,27 @@ void elv_ioq_arm_slice_timer(struct requ
> >                return;
> >
> >        /*
> > -        * still requests with the driver, don't idle
> > +        * idle is disabled, either manually or by past process history
> >         */
> > -       if (efqd->rq_in_driver)
> > +       if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
> >                return;
> >
> >        /*
> > -        * idle is disabled, either manually or by past process history
> > +        * This queue has consumed its time slice. We are waiting only for
> > +        * it to become busy before we select next queue for dispatch.
> >         */
> > -       if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
> > +       if (efqd->fairness && wait_for_busy) {
> > +               elv_mark_ioq_wait_busy(ioq);
> > +               sl = efqd->elv_slice_idle;
> > +               mod_timer(&efqd->idle_slice_timer, jiffies + sl);
> > +               elv_log(efqd, "arm idle: %lu wait busy=1", sl);
> > +               return;
> > +       }
> > +
> > +       /*
> > +        * still requests with the driver, don't idle
> > +        */
> > +       if (efqd->rq_in_driver)
> >                return;
> >
> >        /*
> > @@ -2628,6 +2683,12 @@ void *elv_fq_select_ioq(struct request_q
> >                }
> >        }
> >
> > +       /* We are waiting for this queue to become busy before it expires.*/
> > +       if (efqd->fairness && elv_ioq_wait_busy(ioq)) {
> > +               ioq = NULL;
> > +               goto keep_queue;
> > +       }
> > +
> >        /*
> >         * The active queue has run out of time, expire it and select new.
> >         */
> > @@ -2802,10 +2863,14 @@ void elv_ioq_completed_request(struct re
> >                        elv_ioq_set_prio_slice(q, ioq);
> >                        elv_clear_ioq_slice_new(ioq);
> >                }
> > -               if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
> > +               if (elv_ioq_class_idle(ioq))
> >                        elv_ioq_slice_expired(q, 1);
> > -               else if (sync && !ioq->nr_queued)
> > -                       elv_ioq_arm_slice_timer(q);
> > +               else if (sync && !ioq->nr_queued) {
> > +                       if (elv_ioq_slice_used(ioq))
> > +                               elv_ioq_arm_slice_timer(q, 1);
> > +                       else
> > +                               elv_ioq_arm_slice_timer(q, 0);
> > +               }
> >        }
> >
> >        if (!efqd->rq_in_driver)
> > Index: linux1/block/blk-sysfs.c
> > ===================================================================
> > --- linux1.orig/block/blk-sysfs.c       2009-03-18 17:34:28.000000000 -0400
> > +++ linux1/block/blk-sysfs.c    2009-03-18 17:34:53.000000000 -0400
> > @@ -282,6 +282,12 @@ static struct queue_sysfs_entry queue_sl
> >        .show = elv_slice_idle_show,
> >        .store = elv_slice_idle_store,
> >  };
> > +
> > +static struct queue_sysfs_entry queue_fairness_entry = {
> > +       .attr = {.name = "fairness", .mode = S_IRUGO | S_IWUSR },
> > +       .show = elv_fairness_show,
> > +       .store = elv_fairness_store,
> > +};
> >  #endif
> >  static struct attribute *default_attrs[] = {
> >        &queue_requests_entry.attr,
> > @@ -296,6 +302,7 @@ static struct attribute *default_attrs[]
> >        &queue_iostats_entry.attr,
> >  #ifdef CONFIG_ELV_FAIR_QUEUING
> >        &queue_slice_idle_entry.attr,
> > +       &queue_fairness_entry.attr,
> >  #endif
> >        NULL,
> >  };
> >

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
       [not found]                     ` <20090324125842.GA21389-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-03-24 18:14                       ` Nauman Rafique
  0 siblings, 0 replies; 190+ messages in thread
From: Nauman Rafique @ 2009-03-24 18:14 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: oz-kernel-H+wXaHxf7aLQT0dZR+AlfA, menage-hpIqsD4AKlfQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dhaval Giani,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Tue, Mar 24, 2009 at 5:58 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Mon, Mar 23, 2009 at 10:32:41PM -0700, Nauman Rafique wrote:
>
> [..]
>> > DESC
>> > io-controller: idle for sometime on sync queue before expiring it
>> > EDESC
>> >
>> > o When a sync queue expires, in many cases it might be empty and then
>> > áit will be deleted from the active tree. This will lead to a scenario
>> > áwhere out of two competing queues, only one is on the tree and when a
>> > ánew queue is selected, vtime jump takes place and we don't see services
>> > áprovided in proportion to weight.
>> >
>> > o In general this is a fundamental problem with fairness of sync queues
>> > áwhere queues are not continuously backlogged. Looks like idling is
>> > áonly solution to make sure such kind of queues can get some decent amount
>> > áof disk bandwidth in the face of competion from continusouly backlogged
>> > áqueues. But excessive idling has potential to reduce performance on SSD
>> > áand disks with commnad queuing.
>> >
>> > o This patch experiments with waiting for next request to come before a
>> > áqueue is expired after it has consumed its time slice. This can ensure
>> > ámore accurate fairness numbers in some cases.
>>
>> Vivek, have you introduced this option just to play with it, or you
>> are planning to make it a part of the patch set. Waiting for a new
>> request to come before expiring time slice sounds problematic.
>
> Why are the issues you forsee with it. This is just an extra 8ms idling
> on the sync queue that is also if think time of the queue is not high.
>
> We already do idling on sync queues. In this case we are doing an extra
> idle even if queue has consumed its allocated quota. It helps me get
> fairness numbers and I have put it under a tunable "fairness". So by
> default this code will not kick in.
>
> Other possible option could be that when expiring a sync queue, don't
> remove the queue immediately from the tree and remove it later if there
> is no request from the queue in 8ms or so. I am not sure with BFQ, is it
> feasible to do that without creating issues with current implementation.
> Current implementation was simple, so I stick to it to begin with.

If the maximum wait is bounded by 8ms, then it should be fine. The
comments on the patch did not talk about such limit; it sounded like
unbounded wait to me.

Does keeping the sync queue in ready tree solves the problem too? Is
it because it avoid a virtual time jump?

>
> So yes, I am planning to keep it under tunable, unless there are
> significant issues in doing that.
>
> Thanks
> Vivek
>
>>
>> >
>> > o Introduced a tunable "fairness". If set, io-controller will put more
>> > áfocus on getting fairness right than getting throughput right.
>> >
>> >
>> > ---
>> > áblock/blk-sysfs.c á | á á7 ++++
>> > áblock/elevator-fq.c | á 85 +++++++++++++++++++++++++++++++++++++++++++++-------
>> > áblock/elevator-fq.h | á 12 +++++++
>> > á3 files changed, 94 insertions(+), 10 deletions(-)
>> >
>> > Index: linux1/block/elevator-fq.h
>> > ===================================================================
>> > --- linux1.orig/block/elevator-fq.h á á 2009-03-18 17:34:46.000000000 -0400
>> > +++ linux1/block/elevator-fq.h á2009-03-18 17:34:53.000000000 -0400
>> > @@ -318,6 +318,13 @@ struct elv_fq_data {
>> > á á á áunsigned long long rate_sampling_start; /*sampling window start jifies*/
>> > á á á á/* number of sectors finished io during current sampling window */
>> > á á á áunsigned long rate_sectors_current;
>> > +
>> > + á á á /*
>> > + á á á á* If set to 1, will disable many optimizations done for boost
>> > + á á á á* throughput and focus more on providing fairness for sync
>> > + á á á á* queues.
>> > + á á á á*/
>> > + á á á int fairness;
>> > á};
>> >
>> > áextern int elv_slice_idle;
>> > @@ -340,6 +347,7 @@ enum elv_queue_state_flags {
>> > á á á áELV_QUEUE_FLAG_idle_window, á á á /* elevator slice idling enabled */
>> > á á á áELV_QUEUE_FLAG_wait_request, á á á/* waiting for a request */
>> > á á á áELV_QUEUE_FLAG_slice_new, á á á á /* no requests dispatched in slice */
>> > + á á á ELV_QUEUE_FLAG_wait_busy, á á á á /* wait for this queue to get busy */
>> > á á á áELV_QUEUE_FLAG_NR,
>> > á};
>> >
>> > @@ -362,6 +370,7 @@ ELV_IO_QUEUE_FLAG_FNS(sync)
>> > áELV_IO_QUEUE_FLAG_FNS(wait_request)
>> > áELV_IO_QUEUE_FLAG_FNS(idle_window)
>> > áELV_IO_QUEUE_FLAG_FNS(slice_new)
>> > +ELV_IO_QUEUE_FLAG_FNS(wait_busy)
>> >
>> > ástatic inline struct io_service_tree *
>> > áio_entity_service_tree(struct io_entity *entity)
>> > @@ -554,6 +563,9 @@ static inline struct io_queue *elv_looku
>> > áextern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
>> > áextern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
>> > á á á á á á á á á á á á á á á á á á á á á á á ásize_t count);
>> > +extern ssize_t elv_fairness_show(struct request_queue *q, char *name);
>> > +extern ssize_t elv_fairness_store(struct request_queue *q, const char *name,
>> > + á á á á á á á á á á á á á á á á á á á á á á á size_t count);
>> >
>> > á/* Functions used by elevator.c */
>> > áextern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
>> > Index: linux1/block/elevator-fq.c
>> > ===================================================================
>> > --- linux1.orig/block/elevator-fq.c á á 2009-03-18 17:34:46.000000000 -0400
>> > +++ linux1/block/elevator-fq.c á2009-03-18 17:34:53.000000000 -0400
>> > @@ -1837,6 +1837,44 @@ void elv_ioq_served(struct io_queue *ioq
>> > á á á á á á á á á á á áioq->total_service);
>> > á}
>> >
>> > +/* Functions to show and store fairness value through sysfs */
>> > +ssize_t elv_fairness_show(struct request_queue *q, char *name)
>> > +{
>> > + á á á struct elv_fq_data *efqd;
>> > + á á á unsigned int data;
>> > + á á á unsigned long flags;
>> > +
>> > + á á á spin_lock_irqsave(q->queue_lock, flags);
>> > + á á á efqd = &q->elevator->efqd;
>> > + á á á data = efqd->fairness;
>> > + á á á spin_unlock_irqrestore(q->queue_lock, flags);
>> > + á á á return sprintf(name, "%d\n", data);
>> > +}
>> > +
>> > +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
>> > + á á á á á á á á á á á á size_t count)
>> > +{
>> > + á á á struct elv_fq_data *efqd;
>> > + á á á unsigned int data;
>> > + á á á unsigned long flags;
>> > +
>> > + á á á char *p = (char *)name;
>> > +
>> > + á á á data = simple_strtoul(p, &p, 10);
>> > +
>> > + á á á if (data < 0)
>> > + á á á á á á á data = 0;
>> > + á á á else if (data > INT_MAX)
>> > + á á á á á á á data = INT_MAX;
>> > +
>> > + á á á spin_lock_irqsave(q->queue_lock, flags);
>> > + á á á efqd = &q->elevator->efqd;
>> > + á á á efqd->fairness = data;
>> > + á á á spin_unlock_irqrestore(q->queue_lock, flags);
>> > +
>> > + á á á return count;
>> > +}
>> > +
>> > á/* Functions to show and store elv_idle_slice value through sysfs */
>> > ássize_t elv_slice_idle_show(struct request_queue *q, char *name)
>> > á{
>> > @@ -2263,10 +2301,11 @@ void __elv_ioq_slice_expired(struct requ
>> > á á á áassert_spin_locked(q->queue_lock);
>> > á á á áelv_log_ioq(efqd, ioq, "slice expired upd=%d", budget_update);
>> >
>> > - á á á if (elv_ioq_wait_request(ioq))
>> > + á á á if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq))
>> > á á á á á á á ádel_timer(&efqd->idle_slice_timer);
>> >
>> > á á á áelv_clear_ioq_wait_request(ioq);
>> > + á á á elv_clear_ioq_wait_busy(ioq);
>> >
>> > á á á á/*
>> > á á á á * if ioq->slice_end = 0, that means a queue was expired before first
>> > @@ -2482,8 +2521,9 @@ void elv_ioq_request_add(struct request_
>> > á á á á á á á á * immediately and flag that we must not expire this queue
>> > á á á á á á á á * just now
>> > á á á á á á á á */
>> > - á á á á á á á if (elv_ioq_wait_request(ioq)) {
>> > + á á á á á á á if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq)) {
>> > á á á á á á á á á á á ádel_timer(&efqd->idle_slice_timer);
>> > + á á á á á á á á á á á elv_clear_ioq_wait_busy(ioq);
>> > á á á á á á á á á á á áblk_start_queueing(q);
>> > á á á á á á á á}
>> > á á á á} else if (elv_should_preempt(q, ioq, rq)) {
>> > @@ -2519,6 +2559,9 @@ void elv_idle_slice_timer(unsigned long
>> >
>> > á á á áif (ioq) {
>> >
>> > + á á á á á á á if (elv_ioq_wait_busy(ioq))
>> > + á á á á á á á á á á á goto expire;
>> > +
>> > á á á á á á á á/*
>> > á á á á á á á á * expired
>> > á á á á á á á á */
>> > @@ -2546,7 +2589,7 @@ out_cont:
>> > á á á áspin_unlock_irqrestore(q->queue_lock, flags);
>> > á}
>> >
>> > -void elv_ioq_arm_slice_timer(struct request_queue *q)
>> > +void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
>> > á{
>> > á á á ástruct elv_fq_data *efqd = &q->elevator->efqd;
>> > á á á ástruct io_queue *ioq = elv_active_ioq(q->elevator);
>> > @@ -2563,15 +2606,27 @@ void elv_ioq_arm_slice_timer(struct requ
>> > á á á á á á á áreturn;
>> >
>> > á á á á/*
>> > - á á á á* still requests with the driver, don't idle
>> > + á á á á* idle is disabled, either manually or by past process history
>> > á á á á */
>> > - á á á if (efqd->rq_in_driver)
>> > + á á á if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
>> > á á á á á á á áreturn;
>> >
>> > á á á á/*
>> > - á á á á* idle is disabled, either manually or by past process history
>> > + á á á á* This queue has consumed its time slice. We are waiting only for
>> > + á á á á* it to become busy before we select next queue for dispatch.
>> > á á á á */
>> > - á á á if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
>> > + á á á if (efqd->fairness && wait_for_busy) {
>> > + á á á á á á á elv_mark_ioq_wait_busy(ioq);
>> > + á á á á á á á sl = efqd->elv_slice_idle;
>> > + á á á á á á á mod_timer(&efqd->idle_slice_timer, jiffies + sl);
>> > + á á á á á á á elv_log(efqd, "arm idle: %lu wait busy=1", sl);
>> > + á á á á á á á return;
>> > + á á á }
>> > +
>> > + á á á /*
>> > + á á á á* still requests with the driver, don't idle
>> > + á á á á*/
>> > + á á á if (efqd->rq_in_driver)
>> > á á á á á á á áreturn;
>> >
>> > á á á á/*
>> > @@ -2628,6 +2683,12 @@ void *elv_fq_select_ioq(struct request_q
>> > á á á á á á á á}
>> > á á á á}
>> >
>> > + á á á /* We are waiting for this queue to become busy before it expires.*/
>> > + á á á if (efqd->fairness && elv_ioq_wait_busy(ioq)) {
>> > + á á á á á á á ioq = NULL;
>> > + á á á á á á á goto keep_queue;
>> > + á á á }
>> > +
>> > á á á á/*
>> > á á á á * The active queue has run out of time, expire it and select new.
>> > á á á á */
>> > @@ -2802,10 +2863,14 @@ void elv_ioq_completed_request(struct re
>> > á á á á á á á á á á á áelv_ioq_set_prio_slice(q, ioq);
>> > á á á á á á á á á á á áelv_clear_ioq_slice_new(ioq);
>> > á á á á á á á á}
>> > - á á á á á á á if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
>> > + á á á á á á á if (elv_ioq_class_idle(ioq))
>> > á á á á á á á á á á á áelv_ioq_slice_expired(q, 1);
>> > - á á á á á á á else if (sync && !ioq->nr_queued)
>> > - á á á á á á á á á á á elv_ioq_arm_slice_timer(q);
>> > + á á á á á á á else if (sync && !ioq->nr_queued) {
>> > + á á á á á á á á á á á if (elv_ioq_slice_used(ioq))
>> > + á á á á á á á á á á á á á á á elv_ioq_arm_slice_timer(q, 1);
>> > + á á á á á á á á á á á else
>> > + á á á á á á á á á á á á á á á elv_ioq_arm_slice_timer(q, 0);
>> > + á á á á á á á }
>> > á á á á}
>> >
>> > á á á áif (!efqd->rq_in_driver)
>> > Index: linux1/block/blk-sysfs.c
>> > ===================================================================
>> > --- linux1.orig/block/blk-sysfs.c á á á 2009-03-18 17:34:28.000000000 -0400
>> > +++ linux1/block/blk-sysfs.c á á2009-03-18 17:34:53.000000000 -0400
>> > @@ -282,6 +282,12 @@ static struct queue_sysfs_entry queue_sl
>> > á á á á.show = elv_slice_idle_show,
>> > á á á á.store = elv_slice_idle_store,
>> > á};
>> > +
>> > +static struct queue_sysfs_entry queue_fairness_entry = {
>> > + á á á .attr = {.name = "fairness", .mode = S_IRUGO | S_IWUSR },
>> > + á á á .show = elv_fairness_show,
>> > + á á á .store = elv_fairness_store,
>> > +};
>> > á#endif
>> > ástatic struct attribute *default_attrs[] = {
>> > á á á á&queue_requests_entry.attr,
>> > @@ -296,6 +302,7 @@ static struct attribute *default_attrs[]
>> > á á á á&queue_iostats_entry.attr,
>> > á#ifdef CONFIG_ELV_FAIR_QUEUING
>> > á á á á&queue_slice_idle_entry.attr,
>> > + á á á &queue_fairness_entry.attr,
>> > á#endif
>> > á á á áNULL,
>> > á};
>> >
>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-03-24 12:58                     ` Vivek Goyal
  (?)
@ 2009-03-24 18:14                     ` Nauman Rafique
       [not found]                       ` <e98e18940903241114u1e03ae7dhf654d7d8d0fc0302-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  -1 siblings, 1 reply; 190+ messages in thread
From: Nauman Rafique @ 2009-03-24 18:14 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Gui Jianfeng, Dhaval Giani, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	arozansk, jmoyer, oz-kernel, balbir, linux-kernel, containers,
	akpm, menage, peterz

On Tue, Mar 24, 2009 at 5:58 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Mon, Mar 23, 2009 at 10:32:41PM -0700, Nauman Rafique wrote:
>
> [..]
>> > DESC
>> > io-controller: idle for sometime on sync queue before expiring it
>> > EDESC
>> >
>> > o When a sync queue expires, in many cases it might be empty and then
>> > áit will be deleted from the active tree. This will lead to a scenario
>> > áwhere out of two competing queues, only one is on the tree and when a
>> > ánew queue is selected, vtime jump takes place and we don't see services
>> > áprovided in proportion to weight.
>> >
>> > o In general this is a fundamental problem with fairness of sync queues
>> > áwhere queues are not continuously backlogged. Looks like idling is
>> > áonly solution to make sure such kind of queues can get some decent amount
>> > áof disk bandwidth in the face of competion from continusouly backlogged
>> > áqueues. But excessive idling has potential to reduce performance on SSD
>> > áand disks with commnad queuing.
>> >
>> > o This patch experiments with waiting for next request to come before a
>> > áqueue is expired after it has consumed its time slice. This can ensure
>> > ámore accurate fairness numbers in some cases.
>>
>> Vivek, have you introduced this option just to play with it, or you
>> are planning to make it a part of the patch set. Waiting for a new
>> request to come before expiring time slice sounds problematic.
>
> Why are the issues you forsee with it. This is just an extra 8ms idling
> on the sync queue that is also if think time of the queue is not high.
>
> We already do idling on sync queues. In this case we are doing an extra
> idle even if queue has consumed its allocated quota. It helps me get
> fairness numbers and I have put it under a tunable "fairness". So by
> default this code will not kick in.
>
> Other possible option could be that when expiring a sync queue, don't
> remove the queue immediately from the tree and remove it later if there
> is no request from the queue in 8ms or so. I am not sure with BFQ, is it
> feasible to do that without creating issues with current implementation.
> Current implementation was simple, so I stick to it to begin with.

If the maximum wait is bounded by 8ms, then it should be fine. The
comments on the patch did not talk about such limit; it sounded like
unbounded wait to me.

Does keeping the sync queue in ready tree solves the problem too? Is
it because it avoid a virtual time jump?

>
> So yes, I am planning to keep it under tunable, unless there are
> significant issues in doing that.
>
> Thanks
> Vivek
>
>>
>> >
>> > o Introduced a tunable "fairness". If set, io-controller will put more
>> > áfocus on getting fairness right than getting throughput right.
>> >
>> >
>> > ---
>> > áblock/blk-sysfs.c á | á á7 ++++
>> > áblock/elevator-fq.c | á 85 +++++++++++++++++++++++++++++++++++++++++++++-------
>> > áblock/elevator-fq.h | á 12 +++++++
>> > á3 files changed, 94 insertions(+), 10 deletions(-)
>> >
>> > Index: linux1/block/elevator-fq.h
>> > ===================================================================
>> > --- linux1.orig/block/elevator-fq.h á á 2009-03-18 17:34:46.000000000 -0400
>> > +++ linux1/block/elevator-fq.h á2009-03-18 17:34:53.000000000 -0400
>> > @@ -318,6 +318,13 @@ struct elv_fq_data {
>> > á á á áunsigned long long rate_sampling_start; /*sampling window start jifies*/
>> > á á á á/* number of sectors finished io during current sampling window */
>> > á á á áunsigned long rate_sectors_current;
>> > +
>> > + á á á /*
>> > + á á á á* If set to 1, will disable many optimizations done for boost
>> > + á á á á* throughput and focus more on providing fairness for sync
>> > + á á á á* queues.
>> > + á á á á*/
>> > + á á á int fairness;
>> > á};
>> >
>> > áextern int elv_slice_idle;
>> > @@ -340,6 +347,7 @@ enum elv_queue_state_flags {
>> > á á á áELV_QUEUE_FLAG_idle_window, á á á /* elevator slice idling enabled */
>> > á á á áELV_QUEUE_FLAG_wait_request, á á á/* waiting for a request */
>> > á á á áELV_QUEUE_FLAG_slice_new, á á á á /* no requests dispatched in slice */
>> > + á á á ELV_QUEUE_FLAG_wait_busy, á á á á /* wait for this queue to get busy */
>> > á á á áELV_QUEUE_FLAG_NR,
>> > á};
>> >
>> > @@ -362,6 +370,7 @@ ELV_IO_QUEUE_FLAG_FNS(sync)
>> > áELV_IO_QUEUE_FLAG_FNS(wait_request)
>> > áELV_IO_QUEUE_FLAG_FNS(idle_window)
>> > áELV_IO_QUEUE_FLAG_FNS(slice_new)
>> > +ELV_IO_QUEUE_FLAG_FNS(wait_busy)
>> >
>> > ástatic inline struct io_service_tree *
>> > áio_entity_service_tree(struct io_entity *entity)
>> > @@ -554,6 +563,9 @@ static inline struct io_queue *elv_looku
>> > áextern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
>> > áextern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
>> > á á á á á á á á á á á á á á á á á á á á á á á ásize_t count);
>> > +extern ssize_t elv_fairness_show(struct request_queue *q, char *name);
>> > +extern ssize_t elv_fairness_store(struct request_queue *q, const char *name,
>> > + á á á á á á á á á á á á á á á á á á á á á á á size_t count);
>> >
>> > á/* Functions used by elevator.c */
>> > áextern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
>> > Index: linux1/block/elevator-fq.c
>> > ===================================================================
>> > --- linux1.orig/block/elevator-fq.c á á 2009-03-18 17:34:46.000000000 -0400
>> > +++ linux1/block/elevator-fq.c á2009-03-18 17:34:53.000000000 -0400
>> > @@ -1837,6 +1837,44 @@ void elv_ioq_served(struct io_queue *ioq
>> > á á á á á á á á á á á áioq->total_service);
>> > á}
>> >
>> > +/* Functions to show and store fairness value through sysfs */
>> > +ssize_t elv_fairness_show(struct request_queue *q, char *name)
>> > +{
>> > + á á á struct elv_fq_data *efqd;
>> > + á á á unsigned int data;
>> > + á á á unsigned long flags;
>> > +
>> > + á á á spin_lock_irqsave(q->queue_lock, flags);
>> > + á á á efqd = &q->elevator->efqd;
>> > + á á á data = efqd->fairness;
>> > + á á á spin_unlock_irqrestore(q->queue_lock, flags);
>> > + á á á return sprintf(name, "%d\n", data);
>> > +}
>> > +
>> > +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
>> > + á á á á á á á á á á á á size_t count)
>> > +{
>> > + á á á struct elv_fq_data *efqd;
>> > + á á á unsigned int data;
>> > + á á á unsigned long flags;
>> > +
>> > + á á á char *p = (char *)name;
>> > +
>> > + á á á data = simple_strtoul(p, &p, 10);
>> > +
>> > + á á á if (data < 0)
>> > + á á á á á á á data = 0;
>> > + á á á else if (data > INT_MAX)
>> > + á á á á á á á data = INT_MAX;
>> > +
>> > + á á á spin_lock_irqsave(q->queue_lock, flags);
>> > + á á á efqd = &q->elevator->efqd;
>> > + á á á efqd->fairness = data;
>> > + á á á spin_unlock_irqrestore(q->queue_lock, flags);
>> > +
>> > + á á á return count;
>> > +}
>> > +
>> > á/* Functions to show and store elv_idle_slice value through sysfs */
>> > ássize_t elv_slice_idle_show(struct request_queue *q, char *name)
>> > á{
>> > @@ -2263,10 +2301,11 @@ void __elv_ioq_slice_expired(struct requ
>> > á á á áassert_spin_locked(q->queue_lock);
>> > á á á áelv_log_ioq(efqd, ioq, "slice expired upd=%d", budget_update);
>> >
>> > - á á á if (elv_ioq_wait_request(ioq))
>> > + á á á if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq))
>> > á á á á á á á ádel_timer(&efqd->idle_slice_timer);
>> >
>> > á á á áelv_clear_ioq_wait_request(ioq);
>> > + á á á elv_clear_ioq_wait_busy(ioq);
>> >
>> > á á á á/*
>> > á á á á * if ioq->slice_end = 0, that means a queue was expired before first
>> > @@ -2482,8 +2521,9 @@ void elv_ioq_request_add(struct request_
>> > á á á á á á á á * immediately and flag that we must not expire this queue
>> > á á á á á á á á * just now
>> > á á á á á á á á */
>> > - á á á á á á á if (elv_ioq_wait_request(ioq)) {
>> > + á á á á á á á if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq)) {
>> > á á á á á á á á á á á ádel_timer(&efqd->idle_slice_timer);
>> > + á á á á á á á á á á á elv_clear_ioq_wait_busy(ioq);
>> > á á á á á á á á á á á áblk_start_queueing(q);
>> > á á á á á á á á}
>> > á á á á} else if (elv_should_preempt(q, ioq, rq)) {
>> > @@ -2519,6 +2559,9 @@ void elv_idle_slice_timer(unsigned long
>> >
>> > á á á áif (ioq) {
>> >
>> > + á á á á á á á if (elv_ioq_wait_busy(ioq))
>> > + á á á á á á á á á á á goto expire;
>> > +
>> > á á á á á á á á/*
>> > á á á á á á á á * expired
>> > á á á á á á á á */
>> > @@ -2546,7 +2589,7 @@ out_cont:
>> > á á á áspin_unlock_irqrestore(q->queue_lock, flags);
>> > á}
>> >
>> > -void elv_ioq_arm_slice_timer(struct request_queue *q)
>> > +void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
>> > á{
>> > á á á ástruct elv_fq_data *efqd = &q->elevator->efqd;
>> > á á á ástruct io_queue *ioq = elv_active_ioq(q->elevator);
>> > @@ -2563,15 +2606,27 @@ void elv_ioq_arm_slice_timer(struct requ
>> > á á á á á á á áreturn;
>> >
>> > á á á á/*
>> > - á á á á* still requests with the driver, don't idle
>> > + á á á á* idle is disabled, either manually or by past process history
>> > á á á á */
>> > - á á á if (efqd->rq_in_driver)
>> > + á á á if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
>> > á á á á á á á áreturn;
>> >
>> > á á á á/*
>> > - á á á á* idle is disabled, either manually or by past process history
>> > + á á á á* This queue has consumed its time slice. We are waiting only for
>> > + á á á á* it to become busy before we select next queue for dispatch.
>> > á á á á */
>> > - á á á if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
>> > + á á á if (efqd->fairness && wait_for_busy) {
>> > + á á á á á á á elv_mark_ioq_wait_busy(ioq);
>> > + á á á á á á á sl = efqd->elv_slice_idle;
>> > + á á á á á á á mod_timer(&efqd->idle_slice_timer, jiffies + sl);
>> > + á á á á á á á elv_log(efqd, "arm idle: %lu wait busy=1", sl);
>> > + á á á á á á á return;
>> > + á á á }
>> > +
>> > + á á á /*
>> > + á á á á* still requests with the driver, don't idle
>> > + á á á á*/
>> > + á á á if (efqd->rq_in_driver)
>> > á á á á á á á áreturn;
>> >
>> > á á á á/*
>> > @@ -2628,6 +2683,12 @@ void *elv_fq_select_ioq(struct request_q
>> > á á á á á á á á}
>> > á á á á}
>> >
>> > + á á á /* We are waiting for this queue to become busy before it expires.*/
>> > + á á á if (efqd->fairness && elv_ioq_wait_busy(ioq)) {
>> > + á á á á á á á ioq = NULL;
>> > + á á á á á á á goto keep_queue;
>> > + á á á }
>> > +
>> > á á á á/*
>> > á á á á * The active queue has run out of time, expire it and select new.
>> > á á á á */
>> > @@ -2802,10 +2863,14 @@ void elv_ioq_completed_request(struct re
>> > á á á á á á á á á á á áelv_ioq_set_prio_slice(q, ioq);
>> > á á á á á á á á á á á áelv_clear_ioq_slice_new(ioq);
>> > á á á á á á á á}
>> > - á á á á á á á if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
>> > + á á á á á á á if (elv_ioq_class_idle(ioq))
>> > á á á á á á á á á á á áelv_ioq_slice_expired(q, 1);
>> > - á á á á á á á else if (sync && !ioq->nr_queued)
>> > - á á á á á á á á á á á elv_ioq_arm_slice_timer(q);
>> > + á á á á á á á else if (sync && !ioq->nr_queued) {
>> > + á á á á á á á á á á á if (elv_ioq_slice_used(ioq))
>> > + á á á á á á á á á á á á á á á elv_ioq_arm_slice_timer(q, 1);
>> > + á á á á á á á á á á á else
>> > + á á á á á á á á á á á á á á á elv_ioq_arm_slice_timer(q, 0);
>> > + á á á á á á á }
>> > á á á á}
>> >
>> > á á á áif (!efqd->rq_in_driver)
>> > Index: linux1/block/blk-sysfs.c
>> > ===================================================================
>> > --- linux1.orig/block/blk-sysfs.c á á á 2009-03-18 17:34:28.000000000 -0400
>> > +++ linux1/block/blk-sysfs.c á á2009-03-18 17:34:53.000000000 -0400
>> > @@ -282,6 +282,12 @@ static struct queue_sysfs_entry queue_sl
>> > á á á á.show = elv_slice_idle_show,
>> > á á á á.store = elv_slice_idle_store,
>> > á};
>> > +
>> > +static struct queue_sysfs_entry queue_fairness_entry = {
>> > + á á á .attr = {.name = "fairness", .mode = S_IRUGO | S_IWUSR },
>> > + á á á .show = elv_fairness_show,
>> > + á á á .store = elv_fairness_store,
>> > +};
>> > á#endif
>> > ástatic struct attribute *default_attrs[] = {
>> > á á á á&queue_requests_entry.attr,
>> > @@ -296,6 +302,7 @@ static struct attribute *default_attrs[]
>> > á á á á&queue_iostats_entry.attr,
>> > á#ifdef CONFIG_ELV_FAIR_QUEUING
>> > á á á á&queue_slice_idle_entry.attr,
>> > + á á á &queue_fairness_entry.attr,
>> > á#endif
>> > á á á áNULL,
>> > á};
>> >
>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-03-24 18:14                     ` Nauman Rafique
@ 2009-03-24 18:29                           ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-24 18:29 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: oz-kernel-H+wXaHxf7aLQT0dZR+AlfA, menage-hpIqsD4AKlfQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dhaval Giani,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Tue, Mar 24, 2009 at 11:14:13AM -0700, Nauman Rafique wrote:
> On Tue, Mar 24, 2009 at 5:58 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Mon, Mar 23, 2009 at 10:32:41PM -0700, Nauman Rafique wrote:
> >
> > [..]
> >> > DESC
> >> > io-controller: idle for sometime on sync queue before expiring it
> >> > EDESC
> >> >
> >> > o When a sync queue expires, in many cases it might be empty and then
> >> > áit will be deleted from the active tree. This will lead to a scenario
> >> > áwhere out of two competing queues, only one is on the tree and when a
> >> > ánew queue is selected, vtime jump takes place and we don't see services
> >> > áprovided in proportion to weight.
> >> >
> >> > o In general this is a fundamental problem with fairness of sync queues
> >> > áwhere queues are not continuously backlogged. Looks like idling is
> >> > áonly solution to make sure such kind of queues can get some decent amount
> >> > áof disk bandwidth in the face of competion from continusouly backlogged
> >> > áqueues. But excessive idling has potential to reduce performance on SSD
> >> > áand disks with commnad queuing.
> >> >
> >> > o This patch experiments with waiting for next request to come before a
> >> > áqueue is expired after it has consumed its time slice. This can ensure
> >> > ámore accurate fairness numbers in some cases.
> >>
> >> Vivek, have you introduced this option just to play with it, or you
> >> are planning to make it a part of the patch set. Waiting for a new
> >> request to come before expiring time slice sounds problematic.
> >
> > Why are the issues you forsee with it. This is just an extra 8ms idling
> > on the sync queue that is also if think time of the queue is not high.
> >
> > We already do idling on sync queues. In this case we are doing an extra
> > idle even if queue has consumed its allocated quota. It helps me get
> > fairness numbers and I have put it under a tunable "fairness". So by
> > default this code will not kick in.
> >
> > Other possible option could be that when expiring a sync queue, don't
> > remove the queue immediately from the tree and remove it later if there
> > is no request from the queue in 8ms or so. I am not sure with BFQ, is it
> > feasible to do that without creating issues with current implementation.
> > Current implementation was simple, so I stick to it to begin with.
> 
> If the maximum wait is bounded by 8ms, then it should be fine. The
> comments on the patch did not talk about such limit; it sounded like
> unbounded wait to me.
> 
> Does keeping the sync queue in ready tree solves the problem too? Is
> it because it avoid a virtual time jump?
> 

I have not tried the second approch yet. But that also should solve the
vtime jump issue.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
@ 2009-03-24 18:29                           ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-24 18:29 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: Gui Jianfeng, Dhaval Giani, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	arozansk, jmoyer, oz-kernel, balbir, linux-kernel, containers,
	akpm, menage, peterz

On Tue, Mar 24, 2009 at 11:14:13AM -0700, Nauman Rafique wrote:
> On Tue, Mar 24, 2009 at 5:58 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Mon, Mar 23, 2009 at 10:32:41PM -0700, Nauman Rafique wrote:
> >
> > [..]
> >> > DESC
> >> > io-controller: idle for sometime on sync queue before expiring it
> >> > EDESC
> >> >
> >> > o When a sync queue expires, in many cases it might be empty and then
> >> > áit will be deleted from the active tree. This will lead to a scenario
> >> > áwhere out of two competing queues, only one is on the tree and when a
> >> > ánew queue is selected, vtime jump takes place and we don't see services
> >> > áprovided in proportion to weight.
> >> >
> >> > o In general this is a fundamental problem with fairness of sync queues
> >> > áwhere queues are not continuously backlogged. Looks like idling is
> >> > áonly solution to make sure such kind of queues can get some decent amount
> >> > áof disk bandwidth in the face of competion from continusouly backlogged
> >> > áqueues. But excessive idling has potential to reduce performance on SSD
> >> > áand disks with commnad queuing.
> >> >
> >> > o This patch experiments with waiting for next request to come before a
> >> > áqueue is expired after it has consumed its time slice. This can ensure
> >> > ámore accurate fairness numbers in some cases.
> >>
> >> Vivek, have you introduced this option just to play with it, or you
> >> are planning to make it a part of the patch set. Waiting for a new
> >> request to come before expiring time slice sounds problematic.
> >
> > Why are the issues you forsee with it. This is just an extra 8ms idling
> > on the sync queue that is also if think time of the queue is not high.
> >
> > We already do idling on sync queues. In this case we are doing an extra
> > idle even if queue has consumed its allocated quota. It helps me get
> > fairness numbers and I have put it under a tunable "fairness". So by
> > default this code will not kick in.
> >
> > Other possible option could be that when expiring a sync queue, don't
> > remove the queue immediately from the tree and remove it later if there
> > is no request from the queue in 8ms or so. I am not sure with BFQ, is it
> > feasible to do that without creating issues with current implementation.
> > Current implementation was simple, so I stick to it to begin with.
> 
> If the maximum wait is bounded by 8ms, then it should be fine. The
> comments on the patch did not talk about such limit; it sounded like
> unbounded wait to me.
> 
> Does keeping the sync queue in ready tree solves the problem too? Is
> it because it avoid a virtual time jump?
> 

I have not tried the second approch yet. But that also should solve the
vtime jump issue.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-03-24 18:41                           ` Fabio Checconi
@ 2009-03-24 18:35                                 ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-24 18:35 UTC (permalink / raw)
  To: Fabio Checconi
  Cc: oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA, Dhaval Giani,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Tue, Mar 24, 2009 at 07:41:01PM +0100, Fabio Checconi wrote:
> > From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > Date: Tue, Mar 24, 2009 02:29:06PM -0400
> >
> ...
> > > Does keeping the sync queue in ready tree solves the problem too? Is
> > > it because it avoid a virtual time jump?
> > > 
> > 
> > I have not tried the second approch yet. But that also should solve the
> > vtime jump issue.
> > 
> 
> Do you mean that you intend to keep a queue with no backlog in the
> active tree?

Yes. Is it possible to keep a not-backlogged queue in the tree for later
expiry. So that we don't actively wait/idle for next request to come and
hope queue will become backlogged soon. Otherwise, it will be deleted from
the active queue. This is just a thought, I am not even sure how would it
interefere with bfq code.

All this to solve the vtime jump issue for sync queues.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
@ 2009-03-24 18:35                                 ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-24 18:35 UTC (permalink / raw)
  To: Fabio Checconi
  Cc: Nauman Rafique, Gui Jianfeng, Dhaval Giani, dpshah, lizf, mikew,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	arozansk, jmoyer, oz-kernel, balbir, linux-kernel, containers,
	akpm, menage, peterz

On Tue, Mar 24, 2009 at 07:41:01PM +0100, Fabio Checconi wrote:
> > From: Vivek Goyal <vgoyal@redhat.com>
> > Date: Tue, Mar 24, 2009 02:29:06PM -0400
> >
> ...
> > > Does keeping the sync queue in ready tree solves the problem too? Is
> > > it because it avoid a virtual time jump?
> > > 
> > 
> > I have not tried the second approch yet. But that also should solve the
> > vtime jump issue.
> > 
> 
> Do you mean that you intend to keep a queue with no backlog in the
> active tree?

Yes. Is it possible to keep a not-backlogged queue in the tree for later
expiry. So that we don't actively wait/idle for next request to come and
hope queue will become backlogged soon. Otherwise, it will be deleted from
the active queue. This is just a thought, I am not even sure how would it
interefere with bfq code.

All this to solve the vtime jump issue for sync queues.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
       [not found]                           ` <20090324182906.GF21389-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-03-24 18:41                             ` Fabio Checconi
  0 siblings, 0 replies; 190+ messages in thread
From: Fabio Checconi @ 2009-03-24 18:41 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA, Dhaval Giani,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

> From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Date: Tue, Mar 24, 2009 02:29:06PM -0400
>
...
> > Does keeping the sync queue in ready tree solves the problem too? Is
> > it because it avoid a virtual time jump?
> > 
> 
> I have not tried the second approch yet. But that also should solve the
> vtime jump issue.
> 

Do you mean that you intend to keep a queue with no backlog in the
active tree?

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-03-24 18:29                           ` Vivek Goyal
  (?)
@ 2009-03-24 18:41                           ` Fabio Checconi
       [not found]                             ` <20090324184101.GO18554-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
  -1 siblings, 1 reply; 190+ messages in thread
From: Fabio Checconi @ 2009-03-24 18:41 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Nauman Rafique, Gui Jianfeng, Dhaval Giani, dpshah, lizf, mikew,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	arozansk, jmoyer, oz-kernel, balbir, linux-kernel, containers,
	akpm, menage, peterz

> From: Vivek Goyal <vgoyal@redhat.com>
> Date: Tue, Mar 24, 2009 02:29:06PM -0400
>
...
> > Does keeping the sync queue in ready tree solves the problem too? Is
> > it because it avoid a virtual time jump?
> > 
> 
> I have not tried the second approch yet. But that also should solve the
> vtime jump issue.
> 

Do you mean that you intend to keep a queue with no backlog in the
active tree?

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
       [not found]                                 ` <20090324183532.GG21389-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-03-24 18:49                                   ` Nauman Rafique
  2009-03-24 19:04                                   ` Fabio Checconi
  1 sibling, 0 replies; 190+ messages in thread
From: Nauman Rafique @ 2009-03-24 18:49 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: oz-kernel-H+wXaHxf7aLQT0dZR+AlfA, menage-hpIqsD4AKlfQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dhaval Giani,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Fabio Checconi,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Tue, Mar 24, 2009 at 11:35 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, Mar 24, 2009 at 07:41:01PM +0100, Fabio Checconi wrote:
>> > From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> > Date: Tue, Mar 24, 2009 02:29:06PM -0400
>> >
>> ...
>> > > Does keeping the sync queue in ready tree solves the problem too? Is
>> > > it because it avoid a virtual time jump?
>> > >
>> >
>> > I have not tried the second approch yet. But that also should solve the
>> > vtime jump issue.
>> >
>>
>> Do you mean that you intend to keep a queue with no backlog in the
>> active tree?
>
> Yes. Is it possible to keep a not-backlogged queue in the tree for later
> expiry. So that we don't actively wait/idle for next request to come and
> hope queue will become backlogged soon. Otherwise, it will be deleted from
> the active queue. This is just a thought, I am not even sure how would it
> interefere with bfq code.
>
> All this to solve the vtime jump issue for sync queues.

If only vtime jump is an issue, can we solve it by delaying vtime
jump? That is, even if we serve an entity with a bigger vtime, we
don't update the reference vtime of the service tree until after some
time?

>
> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-03-24 18:35                                 ` Vivek Goyal
  (?)
  (?)
@ 2009-03-24 18:49                                 ` Nauman Rafique
  -1 siblings, 0 replies; 190+ messages in thread
From: Nauman Rafique @ 2009-03-24 18:49 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fabio Checconi, Gui Jianfeng, Dhaval Giani, dpshah, lizf, mikew,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	arozansk, jmoyer, oz-kernel, balbir, linux-kernel, containers,
	akpm, menage, peterz

On Tue, Mar 24, 2009 at 11:35 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Tue, Mar 24, 2009 at 07:41:01PM +0100, Fabio Checconi wrote:
>> > From: Vivek Goyal <vgoyal@redhat.com>
>> > Date: Tue, Mar 24, 2009 02:29:06PM -0400
>> >
>> ...
>> > > Does keeping the sync queue in ready tree solves the problem too? Is
>> > > it because it avoid a virtual time jump?
>> > >
>> >
>> > I have not tried the second approch yet. But that also should solve the
>> > vtime jump issue.
>> >
>>
>> Do you mean that you intend to keep a queue with no backlog in the
>> active tree?
>
> Yes. Is it possible to keep a not-backlogged queue in the tree for later
> expiry. So that we don't actively wait/idle for next request to come and
> hope queue will become backlogged soon. Otherwise, it will be deleted from
> the active queue. This is just a thought, I am not even sure how would it
> interefere with bfq code.
>
> All this to solve the vtime jump issue for sync queues.

If only vtime jump is an issue, can we solve it by delaying vtime
jump? That is, even if we serve an entity with a bigger vtime, we
don't update the reference vtime of the service tree until after some
time?

>
> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
       [not found]                                 ` <20090324183532.GG21389-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-03-24 18:49                                   ` Nauman Rafique
@ 2009-03-24 19:04                                   ` Fabio Checconi
  1 sibling, 0 replies; 190+ messages in thread
From: Fabio Checconi @ 2009-03-24 19:04 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA, Dhaval Giani,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

> From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Date: Tue, Mar 24, 2009 02:35:32PM -0400
>
> On Tue, Mar 24, 2009 at 07:41:01PM +0100, Fabio Checconi wrote:
> > > From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > > Date: Tue, Mar 24, 2009 02:29:06PM -0400
> > >
> > ...
> > > > Does keeping the sync queue in ready tree solves the problem too? Is
> > > > it because it avoid a virtual time jump?
> > > > 
> > > 
> > > I have not tried the second approch yet. But that also should solve the
> > > vtime jump issue.
> > > 
> > 
> > Do you mean that you intend to keep a queue with no backlog in the
> > active tree?
> 
> Yes. Is it possible to keep a not-backlogged queue in the tree for later
> expiry. So that we don't actively wait/idle for next request to come and
> hope queue will become backlogged soon. Otherwise, it will be deleted from
> the active queue. This is just a thought, I am not even sure how would it
> interefere with bfq code.
> 
> All this to solve the vtime jump issue for sync queues.
> 

Of course it is possible, but if you stick with wf2q+ the virtual time
will jump anyway, and the gain would be that each scheduling decision
will have O(N logN) complexity instead of O(log N), to skip empty
queues.

Otherwise, if you'll do your own timestamping (where any new request
can get a timestamp smaller that the virtual time) then nothing from
the theory BFQ was based on can give any hint on the guarantees that
the resulting algorithm can provide.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-03-24 18:35                                 ` Vivek Goyal
                                                   ` (2 preceding siblings ...)
  (?)
@ 2009-03-24 19:04                                 ` Fabio Checconi
  -1 siblings, 0 replies; 190+ messages in thread
From: Fabio Checconi @ 2009-03-24 19:04 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Nauman Rafique, Gui Jianfeng, Dhaval Giani, dpshah, lizf, mikew,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	arozansk, jmoyer, oz-kernel, balbir, linux-kernel, containers,
	akpm, menage, peterz

> From: Vivek Goyal <vgoyal@redhat.com>
> Date: Tue, Mar 24, 2009 02:35:32PM -0400
>
> On Tue, Mar 24, 2009 at 07:41:01PM +0100, Fabio Checconi wrote:
> > > From: Vivek Goyal <vgoyal@redhat.com>
> > > Date: Tue, Mar 24, 2009 02:29:06PM -0400
> > >
> > ...
> > > > Does keeping the sync queue in ready tree solves the problem too? Is
> > > > it because it avoid a virtual time jump?
> > > > 
> > > 
> > > I have not tried the second approch yet. But that also should solve the
> > > vtime jump issue.
> > > 
> > 
> > Do you mean that you intend to keep a queue with no backlog in the
> > active tree?
> 
> Yes. Is it possible to keep a not-backlogged queue in the tree for later
> expiry. So that we don't actively wait/idle for next request to come and
> hope queue will become backlogged soon. Otherwise, it will be deleted from
> the active queue. This is just a thought, I am not even sure how would it
> interefere with bfq code.
> 
> All this to solve the vtime jump issue for sync queues.
> 

Of course it is possible, but if you stick with wf2q+ the virtual time
will jump anyway, and the gain would be that each scheduling decision
will have O(N logN) complexity instead of O(log N), to skip empty
queues.

Otherwise, if you'll do your own timestamping (where any new request
can get a timestamp smaller that the virtual time) then nothing from
the theory BFQ was based on can give any hint on the guarantees that
the resulting algorithm can provide.

^ permalink raw reply	[flat|nested] 190+ messages in thread

* [PATCH] IO Controller: No need to stop idling in as
       [not found]     ` <1236823015-4183-11-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-03-27  6:58       ` Gui Jianfeng
  0 siblings, 0 replies; 190+ messages in thread
From: Gui Jianfeng @ 2009-03-27  6:58 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

Vivek Goyal wrote:

>  		.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
>  		.elevator_free_sched_queue_fn = as_free_as_queue,
> +#ifdef CONFIG_IOSCHED_AS_HIER
> +		.elevator_expire_ioq_fn =       as_expire_ioq,
> +		.elevator_active_ioq_set_fn =   as_active_ioq_set,
>  	},
> -
> +	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ | ELV_IOSCHED_DONT_IDLE,

  Hi Vivek,

  I found the IO Controller doesn't work in as.
  I dig into this issue, and notice that you stop idling in as. IMHO, this will cause
  active ioq is always expired when tring to choose a new ioq to serve(elv_fq_select_ioq). 
  Because idling is disabled, active ioq can't be kept anymore.
  So i just get rid of ELV_IOSCHED_DONT_IDLE, and it works fine this time.

Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
 block/as-iosched.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/block/as-iosched.c b/block/as-iosched.c
index 27c14a7..499c521 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -1689,7 +1689,7 @@ static struct elevator_type iosched_as = {
 		.elevator_expire_ioq_fn =       as_expire_ioq,
 		.elevator_active_ioq_set_fn =   as_active_ioq_set,
 	},
-	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ | ELV_IOSCHED_DONT_IDLE,
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
 #else
 	},
 #endif
-- 
1.5.4.rc3

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH] IO Controller: No need to stop idling in as
  2009-03-12  1:56     ` Vivek Goyal
  (?)
  (?)
@ 2009-03-27  6:58     ` Gui Jianfeng
       [not found]       ` <49CC791A.10008-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2009-03-27 14:05       ` Vivek Goyal
  -1 siblings, 2 replies; 190+ messages in thread
From: Gui Jianfeng @ 2009-03-27  6:58 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, arozansk, jmoyer,
	dhaval, balbir, linux-kernel, containers, akpm, menage, peterz

Vivek Goyal wrote:

>  		.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
>  		.elevator_free_sched_queue_fn = as_free_as_queue,
> +#ifdef CONFIG_IOSCHED_AS_HIER
> +		.elevator_expire_ioq_fn =       as_expire_ioq,
> +		.elevator_active_ioq_set_fn =   as_active_ioq_set,
>  	},
> -
> +	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ | ELV_IOSCHED_DONT_IDLE,

  Hi Vivek,

  I found the IO Controller doesn't work in as.
  I dig into this issue, and notice that you stop idling in as. IMHO, this will cause
  active ioq is always expired when tring to choose a new ioq to serve(elv_fq_select_ioq). 
  Because idling is disabled, active ioq can't be kept anymore.
  So i just get rid of ELV_IOSCHED_DONT_IDLE, and it works fine this time.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
 block/as-iosched.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/block/as-iosched.c b/block/as-iosched.c
index 27c14a7..499c521 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -1689,7 +1689,7 @@ static struct elevator_type iosched_as = {
 		.elevator_expire_ioq_fn =       as_expire_ioq,
 		.elevator_active_ioq_set_fn =   as_active_ioq_set,
 	},
-	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ | ELV_IOSCHED_DONT_IDLE,
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
 #else
 	},
 #endif
-- 
1.5.4.rc3



^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH] IO Controller: Don't store the pid in single queue circumstances
       [not found]   ` <1236823015-4183-3-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-03-19  6:27     ` Gui Jianfeng
@ 2009-03-27  8:30     ` Gui Jianfeng
  2009-04-02  4:06     ` [PATCH 02/10] Common flat fair queuing code in elevaotor layer Divyesh Shah
  2 siblings, 0 replies; 190+ messages in thread
From: Gui Jianfeng @ 2009-03-27  8:30 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

Vivek Goyal wrote:
...
> +int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
> +			void *sched_queue, int ioprio_class, int ioprio,
> +			int is_sync)
> +{
> +	struct elv_fq_data *efqd = &eq->efqd;
> +	struct io_group *iog = io_lookup_io_group_current(efqd->queue);
> +
> +	RB_CLEAR_NODE(&ioq->entity.rb_node);
> +	atomic_set(&ioq->ref, 0);
> +	ioq->efqd = efqd;
> +	ioq->entity.budget = efqd->elv_slice[is_sync];
> +	elv_ioq_set_ioprio_class(ioq, ioprio_class);
> +	elv_ioq_set_ioprio(ioq, ioprio);
> +	ioq->pid = current->pid;

  Hi Vivek,

  Storing a pid in single queue circumstances doesn't make sence.
  So just store the pid when cfq is used.

Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
 ---
 block/elevator-fq.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index df53418..c72f7e6 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1988,7 +1988,10 @@ int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
 	ioq->entity.budget = efqd->elv_slice[is_sync];
 	elv_ioq_set_ioprio_class(ioq, ioprio_class);
 	elv_ioq_set_ioprio(ioq, ioprio);
-	ioq->pid = current->pid;
+	if (elv_iosched_single_ioq(eq))
+		ioq->pid = 0;
+	else
+		ioq->pid = current->pid;
 	ioq->sched_queue = sched_queue;
 
 	/* If generic idle logic is enabled, mark it */
-- 
1.5.4.rc3

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH] IO Controller: Don't store the pid in single queue circumstances
  2009-03-12  1:56 ` [PATCH 02/10] Common flat fair queuing code in elevaotor layer Vivek Goyal
  2009-03-19  6:27   ` Gui Jianfeng
@ 2009-03-27  8:30   ` Gui Jianfeng
       [not found]     ` <49CC8EBA.9040804-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2009-03-27 13:52     ` Vivek Goyal
  2009-04-02  4:06   ` [PATCH 02/10] Common flat fair queuing code in elevaotor layer Divyesh Shah
       [not found]   ` <1236823015-4183-3-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  3 siblings, 2 replies; 190+ messages in thread
From: Gui Jianfeng @ 2009-03-27  8:30 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, arozansk, jmoyer,
	oz-kernel, dhaval, balbir, linux-kernel, containers, akpm,
	menage, peterz

Vivek Goyal wrote:
...
> +int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
> +			void *sched_queue, int ioprio_class, int ioprio,
> +			int is_sync)
> +{
> +	struct elv_fq_data *efqd = &eq->efqd;
> +	struct io_group *iog = io_lookup_io_group_current(efqd->queue);
> +
> +	RB_CLEAR_NODE(&ioq->entity.rb_node);
> +	atomic_set(&ioq->ref, 0);
> +	ioq->efqd = efqd;
> +	ioq->entity.budget = efqd->elv_slice[is_sync];
> +	elv_ioq_set_ioprio_class(ioq, ioprio_class);
> +	elv_ioq_set_ioprio(ioq, ioprio);
> +	ioq->pid = current->pid;

  Hi Vivek,

  Storing a pid in single queue circumstances doesn't make sence.
  So just store the pid when cfq is used.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
 ---
 block/elevator-fq.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index df53418..c72f7e6 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1988,7 +1988,10 @@ int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
 	ioq->entity.budget = efqd->elv_slice[is_sync];
 	elv_ioq_set_ioprio_class(ioq, ioprio_class);
 	elv_ioq_set_ioprio(ioq, ioprio);
-	ioq->pid = current->pid;
+	if (elv_iosched_single_ioq(eq))
+		ioq->pid = 0;
+	else
+		ioq->pid = current->pid;
 	ioq->sched_queue = sched_queue;
 
 	/* If generic idle logic is enabled, mark it */
-- 
1.5.4.rc3



^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [PATCH] IO Controller: Don't store the pid in single queue circumstances
       [not found]     ` <49CC8EBA.9040804-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-03-27 13:52       ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-27 13:52 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Fri, Mar 27, 2009 at 04:30:50PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> > +int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
> > +			void *sched_queue, int ioprio_class, int ioprio,
> > +			int is_sync)
> > +{
> > +	struct elv_fq_data *efqd = &eq->efqd;
> > +	struct io_group *iog = io_lookup_io_group_current(efqd->queue);
> > +
> > +	RB_CLEAR_NODE(&ioq->entity.rb_node);
> > +	atomic_set(&ioq->ref, 0);
> > +	ioq->efqd = efqd;
> > +	ioq->entity.budget = efqd->elv_slice[is_sync];
> > +	elv_ioq_set_ioprio_class(ioq, ioprio_class);
> > +	elv_ioq_set_ioprio(ioq, ioprio);
> > +	ioq->pid = current->pid;
> 
>   Hi Vivek,
> 
>   Storing a pid in single queue circumstances doesn't make sence.
>   So just store the pid when cfq is used.
> 
> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
>  ---
>  block/elevator-fq.c |    5 ++++-
>  1 files changed, 4 insertions(+), 1 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index df53418..c72f7e6 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -1988,7 +1988,10 @@ int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
>  	ioq->entity.budget = efqd->elv_slice[is_sync];
>  	elv_ioq_set_ioprio_class(ioq, ioprio_class);
>  	elv_ioq_set_ioprio(ioq, ioprio);
> -	ioq->pid = current->pid;
> +	if (elv_iosched_single_ioq(eq))
> +		ioq->pid = 0;
> +	else
> +		ioq->pid = current->pid;
>  	ioq->sched_queue = sched_queue;

Thanks Gui. Yes, if there is a single ioq, this pid will reflect the
pid of the process who caused the creation of the io queue and later
requests from all the other processess will go into same queue.

In fact cfq also has the same issue for async queues where async queue
will store the pid when it is created later all other processes of same
prio level will use it.

So if you think displaying "0" is better than displaying the the pid of
the process who created the queue, then I will include this patch. Right
now I don't have very strong opinion about it. 

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] IO Controller: Don't store the pid in single queue circumstances
  2009-03-27  8:30   ` [PATCH] IO Controller: Don't store the pid in single queue circumstances Gui Jianfeng
       [not found]     ` <49CC8EBA.9040804-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-03-27 13:52     ` Vivek Goyal
  1 sibling, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-27 13:52 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, arozansk, jmoyer,
	oz-kernel, dhaval, balbir, linux-kernel, containers, akpm,
	menage, peterz

On Fri, Mar 27, 2009 at 04:30:50PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> > +int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
> > +			void *sched_queue, int ioprio_class, int ioprio,
> > +			int is_sync)
> > +{
> > +	struct elv_fq_data *efqd = &eq->efqd;
> > +	struct io_group *iog = io_lookup_io_group_current(efqd->queue);
> > +
> > +	RB_CLEAR_NODE(&ioq->entity.rb_node);
> > +	atomic_set(&ioq->ref, 0);
> > +	ioq->efqd = efqd;
> > +	ioq->entity.budget = efqd->elv_slice[is_sync];
> > +	elv_ioq_set_ioprio_class(ioq, ioprio_class);
> > +	elv_ioq_set_ioprio(ioq, ioprio);
> > +	ioq->pid = current->pid;
> 
>   Hi Vivek,
> 
>   Storing a pid in single queue circumstances doesn't make sence.
>   So just store the pid when cfq is used.
> 
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
>  ---
>  block/elevator-fq.c |    5 ++++-
>  1 files changed, 4 insertions(+), 1 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index df53418..c72f7e6 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -1988,7 +1988,10 @@ int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
>  	ioq->entity.budget = efqd->elv_slice[is_sync];
>  	elv_ioq_set_ioprio_class(ioq, ioprio_class);
>  	elv_ioq_set_ioprio(ioq, ioprio);
> -	ioq->pid = current->pid;
> +	if (elv_iosched_single_ioq(eq))
> +		ioq->pid = 0;
> +	else
> +		ioq->pid = current->pid;
>  	ioq->sched_queue = sched_queue;

Thanks Gui. Yes, if there is a single ioq, this pid will reflect the
pid of the process who caused the creation of the io queue and later
requests from all the other processess will go into same queue.

In fact cfq also has the same issue for async queues where async queue
will store the pid when it is created later all other processes of same
prio level will use it.

So if you think displaying "0" is better than displaying the the pid of
the process who created the queue, then I will include this patch. Right
now I don't have very strong opinion about it. 

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] IO Controller: No need to stop idling in as
       [not found]       ` <49CC791A.10008-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-03-27 14:05         ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-27 14:05 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Fri, Mar 27, 2009 at 02:58:34PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> 
> >  		.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
> >  		.elevator_free_sched_queue_fn = as_free_as_queue,
> > +#ifdef CONFIG_IOSCHED_AS_HIER
> > +		.elevator_expire_ioq_fn =       as_expire_ioq,
> > +		.elevator_active_ioq_set_fn =   as_active_ioq_set,
> >  	},
> > -
> > +	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ | ELV_IOSCHED_DONT_IDLE,
> 
>   Hi Vivek,
> 
>   I found the IO Controller doesn't work in as.
>   I dig into this issue, and notice that you stop idling in as. IMHO, this will cause
>   active ioq is always expired when tring to choose a new ioq to serve(elv_fq_select_ioq). 
>   Because idling is disabled, active ioq can't be kept anymore.
>   So i just get rid of ELV_IOSCHED_DONT_IDLE, and it works fine this time.
> 

Hi Gui,

Thanks for the testing. I have not enabled idling for AS in common layer
becuase AS has its own idling/anticipation logic. I think we should not have
anticipation going on at two places, common layer as well as individual
io scheduler. That's the reason I have implemented a function
elv_iosched_expire_ioq(), which calls io scheduler to find out if an ioq
can be expired now. 

So in elv_fq_select_ioq(), we call elv_iosched_expire_ioq() and if
ioschduler denies expiration, then we don't expire the queue. In this case
AS can deny the expiration if it is anticipating for next request.

Actually AS is very different and its little tricky to make it work with
common layer, especially in terms of anticipation. This is on my TODO list
but before I fix AS, I wanted to get other things right with common layer,
cfq, noop and deadline.

Thanks
Vivek


> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> ---
>  block/as-iosched.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/block/as-iosched.c b/block/as-iosched.c
> index 27c14a7..499c521 100644
> --- a/block/as-iosched.c
> +++ b/block/as-iosched.c
> @@ -1689,7 +1689,7 @@ static struct elevator_type iosched_as = {
>  		.elevator_expire_ioq_fn =       as_expire_ioq,
>  		.elevator_active_ioq_set_fn =   as_active_ioq_set,
>  	},
> -	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ | ELV_IOSCHED_DONT_IDLE,
> +	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
>  #else
>  	},
>  #endif
> -- 
> 1.5.4.rc3
> 

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] IO Controller: No need to stop idling in as
  2009-03-27  6:58     ` Gui Jianfeng
       [not found]       ` <49CC791A.10008-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-03-27 14:05       ` Vivek Goyal
  2009-03-30  1:09         ` Gui Jianfeng
       [not found]         ` <20090327140530.GE30476-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 2 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-03-27 14:05 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, arozansk, jmoyer,
	dhaval, balbir, linux-kernel, containers, akpm, menage, peterz

On Fri, Mar 27, 2009 at 02:58:34PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> 
> >  		.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
> >  		.elevator_free_sched_queue_fn = as_free_as_queue,
> > +#ifdef CONFIG_IOSCHED_AS_HIER
> > +		.elevator_expire_ioq_fn =       as_expire_ioq,
> > +		.elevator_active_ioq_set_fn =   as_active_ioq_set,
> >  	},
> > -
> > +	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ | ELV_IOSCHED_DONT_IDLE,
> 
>   Hi Vivek,
> 
>   I found the IO Controller doesn't work in as.
>   I dig into this issue, and notice that you stop idling in as. IMHO, this will cause
>   active ioq is always expired when tring to choose a new ioq to serve(elv_fq_select_ioq). 
>   Because idling is disabled, active ioq can't be kept anymore.
>   So i just get rid of ELV_IOSCHED_DONT_IDLE, and it works fine this time.
> 

Hi Gui,

Thanks for the testing. I have not enabled idling for AS in common layer
becuase AS has its own idling/anticipation logic. I think we should not have
anticipation going on at two places, common layer as well as individual
io scheduler. That's the reason I have implemented a function
elv_iosched_expire_ioq(), which calls io scheduler to find out if an ioq
can be expired now. 

So in elv_fq_select_ioq(), we call elv_iosched_expire_ioq() and if
ioschduler denies expiration, then we don't expire the queue. In this case
AS can deny the expiration if it is anticipating for next request.

Actually AS is very different and its little tricky to make it work with
common layer, especially in terms of anticipation. This is on my TODO list
but before I fix AS, I wanted to get other things right with common layer,
cfq, noop and deadline.

Thanks
Vivek


> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
>  block/as-iosched.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/block/as-iosched.c b/block/as-iosched.c
> index 27c14a7..499c521 100644
> --- a/block/as-iosched.c
> +++ b/block/as-iosched.c
> @@ -1689,7 +1689,7 @@ static struct elevator_type iosched_as = {
>  		.elevator_expire_ioq_fn =       as_expire_ioq,
>  		.elevator_active_ioq_set_fn =   as_active_ioq_set,
>  	},
> -	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ | ELV_IOSCHED_DONT_IDLE,
> +	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
>  #else
>  	},
>  #endif
> -- 
> 1.5.4.rc3
> 

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] IO Controller: No need to stop idling in as
       [not found]         ` <20090327140530.GE30476-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-03-30  1:09           ` Gui Jianfeng
  0 siblings, 0 replies; 190+ messages in thread
From: Gui Jianfeng @ 2009-03-30  1:09 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

Vivek Goyal wrote:
> On Fri, Mar 27, 2009 at 02:58:34PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>
>>>  		.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
>>>  		.elevator_free_sched_queue_fn = as_free_as_queue,
>>> +#ifdef CONFIG_IOSCHED_AS_HIER
>>> +		.elevator_expire_ioq_fn =       as_expire_ioq,
>>> +		.elevator_active_ioq_set_fn =   as_active_ioq_set,
>>>  	},
>>> -
>>> +	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ | ELV_IOSCHED_DONT_IDLE,
>>   Hi Vivek,
>>
>>   I found the IO Controller doesn't work in as.
>>   I dig into this issue, and notice that you stop idling in as. IMHO, this will cause
>>   active ioq is always expired when tring to choose a new ioq to serve(elv_fq_select_ioq). 
>>   Because idling is disabled, active ioq can't be kept anymore.
>>   So i just get rid of ELV_IOSCHED_DONT_IDLE, and it works fine this time.
>>
> 
> Hi Gui,
> 
> Thanks for the testing. I have not enabled idling for AS in common layer
> becuase AS has its own idling/anticipation logic. I think we should not have
> anticipation going on at two places, common layer as well as individual
> io scheduler. That's the reason I have implemented a function
> elv_iosched_expire_ioq(), which calls io scheduler to find out if an ioq
> can be expired now. 

  Hi Vivek,

  If an user chooses fairness rather than throughput just like what your fairness
  patch is tring to do. Do we need to enable the common idling logic for as in 
  this scenario.

> 
> So in elv_fq_select_ioq(), we call elv_iosched_expire_ioq() and if
> ioschduler denies expiration, then we don't expire the queue. In this case
> AS can deny the expiration if it is anticipating for next request.
> 
> Actually AS is very different and its little tricky to make it work with
> common layer, especially in terms of anticipation. This is on my TODO list
> but before I fix AS, I wanted to get other things right with common layer,
> cfq, noop and deadline.
> 
> Thanks
> Vivek
> 
> 
>> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
>> ---
>>  block/as-iosched.c |    2 +-
>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/block/as-iosched.c b/block/as-iosched.c
>> index 27c14a7..499c521 100644
>> --- a/block/as-iosched.c
>> +++ b/block/as-iosched.c
>> @@ -1689,7 +1689,7 @@ static struct elevator_type iosched_as = {
>>  		.elevator_expire_ioq_fn =       as_expire_ioq,
>>  		.elevator_active_ioq_set_fn =   as_active_ioq_set,
>>  	},
>> -	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ | ELV_IOSCHED_DONT_IDLE,
>> +	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
>>  #else
>>  	},
>>  #endif
>> -- 
>> 1.5.4.rc3
>>
> 
> 
> 

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] IO Controller: No need to stop idling in as
  2009-03-27 14:05       ` Vivek Goyal
@ 2009-03-30  1:09         ` Gui Jianfeng
       [not found]         ` <20090327140530.GE30476-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 0 replies; 190+ messages in thread
From: Gui Jianfeng @ 2009-03-30  1:09 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, arozansk, jmoyer,
	dhaval, balbir, linux-kernel, containers, akpm, menage, peterz

Vivek Goyal wrote:
> On Fri, Mar 27, 2009 at 02:58:34PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>
>>>  		.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
>>>  		.elevator_free_sched_queue_fn = as_free_as_queue,
>>> +#ifdef CONFIG_IOSCHED_AS_HIER
>>> +		.elevator_expire_ioq_fn =       as_expire_ioq,
>>> +		.elevator_active_ioq_set_fn =   as_active_ioq_set,
>>>  	},
>>> -
>>> +	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ | ELV_IOSCHED_DONT_IDLE,
>>   Hi Vivek,
>>
>>   I found the IO Controller doesn't work in as.
>>   I dig into this issue, and notice that you stop idling in as. IMHO, this will cause
>>   active ioq is always expired when tring to choose a new ioq to serve(elv_fq_select_ioq). 
>>   Because idling is disabled, active ioq can't be kept anymore.
>>   So i just get rid of ELV_IOSCHED_DONT_IDLE, and it works fine this time.
>>
> 
> Hi Gui,
> 
> Thanks for the testing. I have not enabled idling for AS in common layer
> becuase AS has its own idling/anticipation logic. I think we should not have
> anticipation going on at two places, common layer as well as individual
> io scheduler. That's the reason I have implemented a function
> elv_iosched_expire_ioq(), which calls io scheduler to find out if an ioq
> can be expired now. 

  Hi Vivek,

  If an user chooses fairness rather than throughput just like what your fairness
  patch is tring to do. Do we need to enable the common idling logic for as in 
  this scenario.

> 
> So in elv_fq_select_ioq(), we call elv_iosched_expire_ioq() and if
> ioschduler denies expiration, then we don't expire the queue. In this case
> AS can deny the expiration if it is anticipating for next request.
> 
> Actually AS is very different and its little tricky to make it work with
> common layer, especially in terms of anticipation. This is on my TODO list
> but before I fix AS, I wanted to get other things right with common layer,
> cfq, noop and deadline.
> 
> Thanks
> Vivek
> 
> 
>> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
>> ---
>>  block/as-iosched.c |    2 +-
>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/block/as-iosched.c b/block/as-iosched.c
>> index 27c14a7..499c521 100644
>> --- a/block/as-iosched.c
>> +++ b/block/as-iosched.c
>> @@ -1689,7 +1689,7 @@ static struct elevator_type iosched_as = {
>>  		.elevator_expire_ioq_fn =       as_expire_ioq,
>>  		.elevator_active_ioq_set_fn =   as_active_ioq_set,
>>  	},
>> -	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ | ELV_IOSCHED_DONT_IDLE,
>> +	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
>>  #else
>>  	},
>>  #endif
>> -- 
>> 1.5.4.rc3
>>
> 
> 
> 

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 02/10] Common flat fair queuing code in elevaotor layer
       [not found]   ` <1236823015-4183-3-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-03-19  6:27     ` Gui Jianfeng
  2009-03-27  8:30     ` [PATCH] IO Controller: Don't store the pid in single queue circumstances Gui Jianfeng
@ 2009-04-02  4:06     ` Divyesh Shah
  2 siblings, 0 replies; 190+ messages in thread
From: Divyesh Shah @ 2009-04-02  4:06 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	menage-hpIqsD4AKlfQT0dZR+AlfA, arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Wed, Mar 11, 2009 at 6:56 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> +/*
> + * elv_exit_fq_data is called before we call elevator_exit_fn. Before
> + * we ask elevator to cleanup its queues, we do the cleanup here so
> + * that all the group and idle tree references to ioq are dropped. Later
> + * during elevator cleanup, ioc reference will be dropped which will lead
> + * to removal of ioscheduler queue as well as associated ioq object.
> + */
> +void elv_exit_fq_data(struct elevator_queue *e)
> +{
> +       struct elv_fq_data *efqd = &e->efqd;
> +       struct request_queue *q = efqd->queue;
> +
> +       if (!elv_iosched_fair_queuing_enabled(e))
> +               return;
> +
> +       elv_shutdown_timer_wq(e);
> +
> +       spin_lock_irq(q->queue_lock);
> +       /* This should drop all the idle tree references of ioq */
> +       elv_free_idle_ioq_list(e);
> +       spin_unlock_irq(q->queue_lock);
> +
> +       elv_shutdown_timer_wq(e);
> +
> +       BUG_ON(timer_pending(&efqd->idle_slice_timer));
> +       io_free_root_group(e);
> +}
>

Hi Vivek,
        When cleaning up the elv_fq_data and ioqs for the iogs
associated with a device on elv_exit(), I don't see any iogs except
the root group being freed. In io_disconnect_groups() you remove the
ioqs from each of the iog and move them to the root iog and then
delete the root iog. Am I missing something here or are there leftover
iogs at elv_exit?

-Divyesh

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 02/10] Common flat fair queuing code in elevaotor layer
  2009-03-12  1:56 ` [PATCH 02/10] Common flat fair queuing code in elevaotor layer Vivek Goyal
  2009-03-19  6:27   ` Gui Jianfeng
  2009-03-27  8:30   ` [PATCH] IO Controller: Don't store the pid in single queue circumstances Gui Jianfeng
@ 2009-04-02  4:06   ` Divyesh Shah
       [not found]     ` <af41c7c40904012106h41d3cb50i2eeab2a02277a4c9-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2009-04-02 13:52     ` Vivek Goyal
       [not found]   ` <1236823015-4183-3-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  3 siblings, 2 replies; 190+ messages in thread
From: Divyesh Shah @ 2009-04-02  4:06 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, lizf, mikew, fchecconi, paolo.valente, jens.axboe, ryov,
	fernando, s-uchida, taka, guijianfeng, arozansk, jmoyer,
	oz-kernel, dhaval, balbir, linux-kernel, containers, akpm,
	menage, peterz

On Wed, Mar 11, 2009 at 6:56 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> +/*
> + * elv_exit_fq_data is called before we call elevator_exit_fn. Before
> + * we ask elevator to cleanup its queues, we do the cleanup here so
> + * that all the group and idle tree references to ioq are dropped. Later
> + * during elevator cleanup, ioc reference will be dropped which will lead
> + * to removal of ioscheduler queue as well as associated ioq object.
> + */
> +void elv_exit_fq_data(struct elevator_queue *e)
> +{
> +       struct elv_fq_data *efqd = &e->efqd;
> +       struct request_queue *q = efqd->queue;
> +
> +       if (!elv_iosched_fair_queuing_enabled(e))
> +               return;
> +
> +       elv_shutdown_timer_wq(e);
> +
> +       spin_lock_irq(q->queue_lock);
> +       /* This should drop all the idle tree references of ioq */
> +       elv_free_idle_ioq_list(e);
> +       spin_unlock_irq(q->queue_lock);
> +
> +       elv_shutdown_timer_wq(e);
> +
> +       BUG_ON(timer_pending(&efqd->idle_slice_timer));
> +       io_free_root_group(e);
> +}
>

Hi Vivek,
        When cleaning up the elv_fq_data and ioqs for the iogs
associated with a device on elv_exit(), I don't see any iogs except
the root group being freed. In io_disconnect_groups() you remove the
ioqs from each of the iog and move them to the root iog and then
delete the root iog. Am I missing something here or are there leftover
iogs at elv_exit?

-Divyesh

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
       [not found] ` <1236823015-4183-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (10 preceding siblings ...)
  2009-03-12  3:27   ` [RFC] IO Controller Takuya Yoshikawa
@ 2009-04-02  6:39   ` Gui Jianfeng
  2009-04-10  9:33   ` Gui Jianfeng
  2009-05-01  1:25   ` Divyesh Shah
  13 siblings, 0 replies; 190+ messages in thread
From: Gui Jianfeng @ 2009-04-02  6:39 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

Vivek Goyal wrote:
> Hi All,
> 
> Here is another posting for IO controller patches. Last time I had posted
> RFC patches for an IO controller which did bio control per cgroup.
> 
> http://lkml.org/lkml/2008/11/6/227
> 
> One of the takeaway from the discussion in this thread was that let us
> implement a common layer which contains the proportional weight scheduling
> code which can be shared by all the IO schedulers.
> 
  
  Hi Vivek,

  I did some tests on my *old* i386 box(with two concurrent dd running), and notice 
  that IO Controller doesn't work fine in such situation. But it can work perfectly 
  in my *new* x86 box. I dig into this problem, and i guess the major reason is that
  my *old* i386 box is too slow, it can't ensure two running ioqs are always backlogged.
  If that is the case, I happens to have a thought. when an ioq uses up it time slice, 
  we don't expire it immediately. May be we can give a piece of bonus time for idling 
  to wait new requests if this ioq's finish time and its ancestor's finish time are all 
  much smaller than other entities in each corresponding service tree.

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
  2009-03-12  1:56 ` Vivek Goyal
                   ` (5 preceding siblings ...)
  (?)
@ 2009-04-02  6:39 ` Gui Jianfeng
       [not found]   ` <49D45DAC.2060508-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  -1 siblings, 1 reply; 190+ messages in thread
From: Gui Jianfeng @ 2009-04-02  6:39 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, arozansk, jmoyer,
	oz-kernel, dhaval, balbir, linux-kernel, containers, akpm,
	menage, peterz

Vivek Goyal wrote:
> Hi All,
> 
> Here is another posting for IO controller patches. Last time I had posted
> RFC patches for an IO controller which did bio control per cgroup.
> 
> http://lkml.org/lkml/2008/11/6/227
> 
> One of the takeaway from the discussion in this thread was that let us
> implement a common layer which contains the proportional weight scheduling
> code which can be shared by all the IO schedulers.
> 
  
  Hi Vivek,

  I did some tests on my *old* i386 box(with two concurrent dd running), and notice 
  that IO Controller doesn't work fine in such situation. But it can work perfectly 
  in my *new* x86 box. I dig into this problem, and i guess the major reason is that
  my *old* i386 box is too slow, it can't ensure two running ioqs are always backlogged.
  If that is the case, I happens to have a thought. when an ioq uses up it time slice, 
  we don't expire it immediately. May be we can give a piece of bonus time for idling 
  to wait new requests if this ioq's finish time and its ancestor's finish time are all 
  much smaller than other entities in each corresponding service tree.

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 02/10] Common flat fair queuing code in elevaotor layer
       [not found]     ` <af41c7c40904012106h41d3cb50i2eeab2a02277a4c9-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-04-02 13:52       ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-04-02 13:52 UTC (permalink / raw)
  To: Divyesh Shah
  Cc: oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	menage-hpIqsD4AKlfQT0dZR+AlfA, arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Wed, Apr 01, 2009 at 09:06:40PM -0700, Divyesh Shah wrote:
> On Wed, Mar 11, 2009 at 6:56 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > +/*
> > + * elv_exit_fq_data is called before we call elevator_exit_fn. Before
> > + * we ask elevator to cleanup its queues, we do the cleanup here so
> > + * that all the group and idle tree references to ioq are dropped. Later
> > + * during elevator cleanup, ioc reference will be dropped which will lead
> > + * to removal of ioscheduler queue as well as associated ioq object.
> > + */
> > +void elv_exit_fq_data(struct elevator_queue *e)
> > +{
> > +       struct elv_fq_data *efqd = &e->efqd;
> > +       struct request_queue *q = efqd->queue;
> > +
> > +       if (!elv_iosched_fair_queuing_enabled(e))
> > +               return;
> > +
> > +       elv_shutdown_timer_wq(e);
> > +
> > +       spin_lock_irq(q->queue_lock);
> > +       /* This should drop all the idle tree references of ioq */
> > +       elv_free_idle_ioq_list(e);
> > +       spin_unlock_irq(q->queue_lock);
> > +
> > +       elv_shutdown_timer_wq(e);
> > +
> > +       BUG_ON(timer_pending(&efqd->idle_slice_timer));
> > +       io_free_root_group(e);
> > +}
> >
> 
> Hi Vivek,
>         When cleaning up the elv_fq_data and ioqs for the iogs
> associated with a device on elv_exit(), I don't see any iogs except
> the root group being freed. In io_disconnect_groups() you remove the
> ioqs from each of the iog and move them to the root iog and then
> delete the root iog. Am I missing something here or are there leftover
> iogs at elv_exit?

Hi Divyesh,

io_groups are linked in two lists. One list is maintained by io_cgroup
to keep track of how many io_groups are there associated with this
cgroup and other list is maintained in elv_fq_data to keep track of
how many io_groups are actually doing IO to this device.

Upon elevator exit, we remove the io_groups (io_disconnect_groups())from the
list maintained by elv_fq_data but we don't free them up. Freeing up is
finally done when cgroup is being deleted (iocg_destroy()).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 02/10] Common flat fair queuing code in elevaotor layer
  2009-04-02  4:06   ` [PATCH 02/10] Common flat fair queuing code in elevaotor layer Divyesh Shah
       [not found]     ` <af41c7c40904012106h41d3cb50i2eeab2a02277a4c9-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-04-02 13:52     ` Vivek Goyal
  1 sibling, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-04-02 13:52 UTC (permalink / raw)
  To: Divyesh Shah
  Cc: nauman, lizf, mikew, fchecconi, paolo.valente, jens.axboe, ryov,
	fernando, s-uchida, taka, guijianfeng, arozansk, jmoyer,
	oz-kernel, dhaval, balbir, linux-kernel, containers, akpm,
	menage, peterz

On Wed, Apr 01, 2009 at 09:06:40PM -0700, Divyesh Shah wrote:
> On Wed, Mar 11, 2009 at 6:56 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > +/*
> > + * elv_exit_fq_data is called before we call elevator_exit_fn. Before
> > + * we ask elevator to cleanup its queues, we do the cleanup here so
> > + * that all the group and idle tree references to ioq are dropped. Later
> > + * during elevator cleanup, ioc reference will be dropped which will lead
> > + * to removal of ioscheduler queue as well as associated ioq object.
> > + */
> > +void elv_exit_fq_data(struct elevator_queue *e)
> > +{
> > +       struct elv_fq_data *efqd = &e->efqd;
> > +       struct request_queue *q = efqd->queue;
> > +
> > +       if (!elv_iosched_fair_queuing_enabled(e))
> > +               return;
> > +
> > +       elv_shutdown_timer_wq(e);
> > +
> > +       spin_lock_irq(q->queue_lock);
> > +       /* This should drop all the idle tree references of ioq */
> > +       elv_free_idle_ioq_list(e);
> > +       spin_unlock_irq(q->queue_lock);
> > +
> > +       elv_shutdown_timer_wq(e);
> > +
> > +       BUG_ON(timer_pending(&efqd->idle_slice_timer));
> > +       io_free_root_group(e);
> > +}
> >
> 
> Hi Vivek,
>         When cleaning up the elv_fq_data and ioqs for the iogs
> associated with a device on elv_exit(), I don't see any iogs except
> the root group being freed. In io_disconnect_groups() you remove the
> ioqs from each of the iog and move them to the root iog and then
> delete the root iog. Am I missing something here or are there leftover
> iogs at elv_exit?

Hi Divyesh,

io_groups are linked in two lists. One list is maintained by io_cgroup
to keep track of how many io_groups are there associated with this
cgroup and other list is maintained in elv_fq_data to keep track of
how many io_groups are actually doing IO to this device.

Upon elevator exit, we remove the io_groups (io_disconnect_groups())from the
list maintained by elv_fq_data but we don't free them up. Freeing up is
finally done when cgroup is being deleted (iocg_destroy()).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
  2009-04-02  6:39 ` Gui Jianfeng
@ 2009-04-02 14:00       ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-04-02 14:00 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Thu, Apr 02, 2009 at 02:39:40PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > Hi All,
> > 
> > Here is another posting for IO controller patches. Last time I had posted
> > RFC patches for an IO controller which did bio control per cgroup.
> > 
> > http://lkml.org/lkml/2008/11/6/227
> > 
> > One of the takeaway from the discussion in this thread was that let us
> > implement a common layer which contains the proportional weight scheduling
> > code which can be shared by all the IO schedulers.
> > 
>   
>   Hi Vivek,
> 
>   I did some tests on my *old* i386 box(with two concurrent dd running), and notice 
>   that IO Controller doesn't work fine in such situation. But it can work perfectly 
>   in my *new* x86 box. I dig into this problem, and i guess the major reason is that
>   my *old* i386 box is too slow, it can't ensure two running ioqs are always backlogged.

Hi Gui,

Have you run top to see what's the percentage cpu usage. I suspect that
cpu is not keeping up pace disk to enqueue enough requests. I think
process might be blocked somewhere else so that it could not issue
requests. 

>   If that is the case, I happens to have a thought. when an ioq uses up it time slice, 
>   we don't expire it immediately. May be we can give a piece of bonus time for idling 
>   to wait new requests if this ioq's finish time and its ancestor's finish time are all 
>   much smaller than other entities in each corresponding service tree.

Have you tried it with "fairness" enabled? With "fairness" enabled, for
sync queues I am waiting for one extra idle time slice "8ms" for queue
to get backlogged again before I move to the next queue?

Otherwise try to increase the idle time length to higher value say "12ms"
just to see if that has any impact.

Can you please also send me output of blkparse. It might give some idea
how IO schedulers see IO pattern.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
@ 2009-04-02 14:00       ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-04-02 14:00 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, arozansk, jmoyer,
	oz-kernel, dhaval, balbir, linux-kernel, containers, akpm,
	menage, peterz

On Thu, Apr 02, 2009 at 02:39:40PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > Hi All,
> > 
> > Here is another posting for IO controller patches. Last time I had posted
> > RFC patches for an IO controller which did bio control per cgroup.
> > 
> > http://lkml.org/lkml/2008/11/6/227
> > 
> > One of the takeaway from the discussion in this thread was that let us
> > implement a common layer which contains the proportional weight scheduling
> > code which can be shared by all the IO schedulers.
> > 
>   
>   Hi Vivek,
> 
>   I did some tests on my *old* i386 box(with two concurrent dd running), and notice 
>   that IO Controller doesn't work fine in such situation. But it can work perfectly 
>   in my *new* x86 box. I dig into this problem, and i guess the major reason is that
>   my *old* i386 box is too slow, it can't ensure two running ioqs are always backlogged.

Hi Gui,

Have you run top to see what's the percentage cpu usage. I suspect that
cpu is not keeping up pace disk to enqueue enough requests. I think
process might be blocked somewhere else so that it could not issue
requests. 

>   If that is the case, I happens to have a thought. when an ioq uses up it time slice, 
>   we don't expire it immediately. May be we can give a piece of bonus time for idling 
>   to wait new requests if this ioq's finish time and its ancestor's finish time are all 
>   much smaller than other entities in each corresponding service tree.

Have you tried it with "fairness" enabled? With "fairness" enabled, for
sync queues I am waiting for one extra idle time slice "8ms" for queue
to get backlogged again before I move to the next queue?

Otherwise try to increase the idle time length to higher value say "12ms"
just to see if that has any impact.

Can you please also send me output of blkparse. It might give some idea
how IO schedulers see IO pattern.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
       [not found]           ` <20090312180126.GI10919-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-03-16  8:40             ` Ryo Tsuruta
@ 2009-04-05 15:15             ` Andrea Righi
  1 sibling, 0 replies; 190+ messages in thread
From: Andrea Righi @ 2009-04-05 15:15 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	menage-hpIqsD4AKlfQT0dZR+AlfA

On 2009-03-12 19:01, Vivek Goyal wrote:
> On Thu, Mar 12, 2009 at 12:11:46AM -0700, Andrew Morton wrote:
>> On Wed, 11 Mar 2009 21:56:46 -0400 Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
[snip]
>> Also..  there are so many IO controller implementations that I've lost
>> track of who is doing what.  I do have one private report here that
>> Andreas's controller "is incredibly productive for us and has allowed
>> us to put twice as many users per server with faster times for all
>> users".  Which is pretty stunning, although it should be viewed as a
>> condemnation of the current code, I'm afraid.
>>
> 
> I had looked briefly at Andrea's implementation in the past. I will look
> again. I had thought that this approach did not get much traction.

Hi Vivek, sorry for my late reply. I periodically upload the latest
versions of io-throttle here if you're still interested:
http://download.systemimager.org/~arighi/linux/patches/io-throttle/

There's no consistent changes respect to the latest version I posted to
the LKML, just rebasing to the recent kernels.

> 
> Some quick thoughts about this approach though.
> 
> - It is not a proportional weight controller. It is more of limiting
>   bandwidth in absolute numbers for each cgroup on each disk.
>  
>   So each cgroup will define a rule for each disk in the system mentioning
>   at what maximum rate that cgroup can issue IO to that disk and throttle
>   the IO from that cgroup if rate has excedded.

Correct. Add also the proportional weight control has been in the TODO
list since the early versions, but I never dedicated too much effort to
implement this feature, I can focus on this and try to write something
if we all think it is worth to be done.

> 
>   Above requirement can create configuration problems.
> 
> 	- If there are large number of disks in system, per cgroup one shall
> 	  have to create rules for each disk. Until and unless admin knows
> 	  what applications are in which cgroup and strictly what disk
> 	  these applications do IO to and create rules for only those
>  	  disks.

I don't think this is a huge problem anyway. IMHO a userspace tool, e.g.
a script, would be able to efficiently create/modify rules parsing user
defined rules in some human-readable form (config files, etc.), even in
presence of hundreds of disk. The same is valid for dm-ioband I think.

> 
> 	- I think problem gets compounded if there is a hierarchy of
> 	  logical devices. I think in that case one shall have to create
> 	  rules for logical devices and not actual physical devices.

With logical devices you mean device-mapper devices (i.e. LVM, software
RAID, etc.)? or do you mean that we need to introduce the concept of
"logical device" to easily (quickly) configure IO requirements and then
map those logical devices to the actual physical devices? In this case I
think this can be addressed in userspace. Or maybe I'm totally missing
the point here.

> 
> - Because it is not proportional weight distribution, if some
>   cgroup is not using its planned BW, other group sharing the
>   disk can not make use of spare BW.  
> 	

Right.

> - I think one should know in advance the throughput rate of underlying media
>   and also know competing applications so that one can statically define
>   the BW assigned to each cgroup on each disk.
> 
>   This will be difficult. Effective BW extracted out of a rotational media
>   is dependent on the seek pattern so one shall have to either try to make
>   some conservative estimates and try to divide BW (we will not utilize disk
>   fully) or take some peak numbers and divide BW (cgroup might not get the
>   maximum rate configured).

Correct. I think the proportional weight approach is the only solution
to efficiently use the whole BW. OTOH absolute limiting rules offer a
better control over QoS, because you can totally remove performance
bursts/peaks that could break QoS requirements for short periods of
time. So, my "ideal" IO controller should allow to define both rules:
absolute and proportional limits.

I still have to look closely at your patchset anyway. I will do and give
a feedback.

-Andrea

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-03-12 18:01         ` Vivek Goyal
  2009-03-16  8:40           ` Ryo Tsuruta
       [not found]           ` <20090312180126.GI10919-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-04-05 15:15           ` Andrea Righi
  2009-04-06  6:50             ` Nauman Rafique
       [not found]             ` <49D8CB17.7040501-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2 siblings, 2 replies; 190+ messages in thread
From: Andrea Righi @ 2009-04-05 15:15 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, arozansk, jmoyer, oz-kernel, dhaval, balbir,
	linux-kernel, containers, menage, peterz

On 2009-03-12 19:01, Vivek Goyal wrote:
> On Thu, Mar 12, 2009 at 12:11:46AM -0700, Andrew Morton wrote:
>> On Wed, 11 Mar 2009 21:56:46 -0400 Vivek Goyal <vgoyal@redhat.com> wrote:
[snip]
>> Also..  there are so many IO controller implementations that I've lost
>> track of who is doing what.  I do have one private report here that
>> Andreas's controller "is incredibly productive for us and has allowed
>> us to put twice as many users per server with faster times for all
>> users".  Which is pretty stunning, although it should be viewed as a
>> condemnation of the current code, I'm afraid.
>>
> 
> I had looked briefly at Andrea's implementation in the past. I will look
> again. I had thought that this approach did not get much traction.

Hi Vivek, sorry for my late reply. I periodically upload the latest
versions of io-throttle here if you're still interested:
http://download.systemimager.org/~arighi/linux/patches/io-throttle/

There's no consistent changes respect to the latest version I posted to
the LKML, just rebasing to the recent kernels.

> 
> Some quick thoughts about this approach though.
> 
> - It is not a proportional weight controller. It is more of limiting
>   bandwidth in absolute numbers for each cgroup on each disk.
>  
>   So each cgroup will define a rule for each disk in the system mentioning
>   at what maximum rate that cgroup can issue IO to that disk and throttle
>   the IO from that cgroup if rate has excedded.

Correct. Add also the proportional weight control has been in the TODO
list since the early versions, but I never dedicated too much effort to
implement this feature, I can focus on this and try to write something
if we all think it is worth to be done.

> 
>   Above requirement can create configuration problems.
> 
> 	- If there are large number of disks in system, per cgroup one shall
> 	  have to create rules for each disk. Until and unless admin knows
> 	  what applications are in which cgroup and strictly what disk
> 	  these applications do IO to and create rules for only those
>  	  disks.

I don't think this is a huge problem anyway. IMHO a userspace tool, e.g.
a script, would be able to efficiently create/modify rules parsing user
defined rules in some human-readable form (config files, etc.), even in
presence of hundreds of disk. The same is valid for dm-ioband I think.

> 
> 	- I think problem gets compounded if there is a hierarchy of
> 	  logical devices. I think in that case one shall have to create
> 	  rules for logical devices and not actual physical devices.

With logical devices you mean device-mapper devices (i.e. LVM, software
RAID, etc.)? or do you mean that we need to introduce the concept of
"logical device" to easily (quickly) configure IO requirements and then
map those logical devices to the actual physical devices? In this case I
think this can be addressed in userspace. Or maybe I'm totally missing
the point here.

> 
> - Because it is not proportional weight distribution, if some
>   cgroup is not using its planned BW, other group sharing the
>   disk can not make use of spare BW.  
> 	

Right.

> - I think one should know in advance the throughput rate of underlying media
>   and also know competing applications so that one can statically define
>   the BW assigned to each cgroup on each disk.
> 
>   This will be difficult. Effective BW extracted out of a rotational media
>   is dependent on the seek pattern so one shall have to either try to make
>   some conservative estimates and try to divide BW (we will not utilize disk
>   fully) or take some peak numbers and divide BW (cgroup might not get the
>   maximum rate configured).

Correct. I think the proportional weight approach is the only solution
to efficiently use the whole BW. OTOH absolute limiting rules offer a
better control over QoS, because you can totally remove performance
bursts/peaks that could break QoS requirements for short periods of
time. So, my "ideal" IO controller should allow to define both rules:
absolute and proportional limits.

I still have to look closely at your patchset anyway. I will do and give
a feedback.

-Andrea

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
       [not found]             ` <49D8CB17.7040501-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2009-04-06  6:50               ` Nauman Rafique
  2009-04-07  6:40                 ` Vivek Goyal
  1 sibling, 0 replies; 190+ messages in thread
From: Nauman Rafique @ 2009-04-06  6:50 UTC (permalink / raw)
  To: righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	menage-hpIqsD4AKlfQT0dZR+AlfA

On Sun, Apr 5, 2009 at 8:15 AM, Andrea Righi <righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On 2009-03-12 19:01, Vivek Goyal wrote:
>> On Thu, Mar 12, 2009 at 12:11:46AM -0700, Andrew Morton wrote:
>>> On Wed, 11 Mar 2009 21:56:46 -0400 Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> [snip]
>>> Also..  there are so many IO controller implementations that I've lost
>>> track of who is doing what.  I do have one private report here that
>>> Andreas's controller "is incredibly productive for us and has allowed
>>> us to put twice as many users per server with faster times for all
>>> users".  Which is pretty stunning, although it should be viewed as a
>>> condemnation of the current code, I'm afraid.
>>>
>>
>> I had looked briefly at Andrea's implementation in the past. I will look
>> again. I had thought that this approach did not get much traction.
>
> Hi Vivek, sorry for my late reply. I periodically upload the latest
> versions of io-throttle here if you're still interested:
> http://download.systemimager.org/~arighi/linux/patches/io-throttle/
>
> There's no consistent changes respect to the latest version I posted to
> the LKML, just rebasing to the recent kernels.
>
>>
>> Some quick thoughts about this approach though.
>>
>> - It is not a proportional weight controller. It is more of limiting
>>   bandwidth in absolute numbers for each cgroup on each disk.
>>
>>   So each cgroup will define a rule for each disk in the system mentioning
>>   at what maximum rate that cgroup can issue IO to that disk and throttle
>>   the IO from that cgroup if rate has excedded.
>
> Correct. Add also the proportional weight control has been in the TODO
> list since the early versions, but I never dedicated too much effort to
> implement this feature, I can focus on this and try to write something
> if we all think it is worth to be done.
>
>>
>>   Above requirement can create configuration problems.
>>
>>       - If there are large number of disks in system, per cgroup one shall
>>         have to create rules for each disk. Until and unless admin knows
>>         what applications are in which cgroup and strictly what disk
>>         these applications do IO to and create rules for only those
>>         disks.
>
> I don't think this is a huge problem anyway. IMHO a userspace tool, e.g.
> a script, would be able to efficiently create/modify rules parsing user
> defined rules in some human-readable form (config files, etc.), even in
> presence of hundreds of disk. The same is valid for dm-ioband I think.
>
>>
>>       - I think problem gets compounded if there is a hierarchy of
>>         logical devices. I think in that case one shall have to create
>>         rules for logical devices and not actual physical devices.
>
> With logical devices you mean device-mapper devices (i.e. LVM, software
> RAID, etc.)? or do you mean that we need to introduce the concept of
> "logical device" to easily (quickly) configure IO requirements and then
> map those logical devices to the actual physical devices? In this case I
> think this can be addressed in userspace. Or maybe I'm totally missing
> the point here.
>
>>
>> - Because it is not proportional weight distribution, if some
>>   cgroup is not using its planned BW, other group sharing the
>>   disk can not make use of spare BW.
>>
>
> Right.
>
>> - I think one should know in advance the throughput rate of underlying media
>>   and also know competing applications so that one can statically define
>>   the BW assigned to each cgroup on each disk.
>>
>>   This will be difficult. Effective BW extracted out of a rotational media
>>   is dependent on the seek pattern so one shall have to either try to make
>>   some conservative estimates and try to divide BW (we will not utilize disk
>>   fully) or take some peak numbers and divide BW (cgroup might not get the
>>   maximum rate configured).
>
> Correct. I think the proportional weight approach is the only solution
> to efficiently use the whole BW. OTOH absolute limiting rules offer a
> better control over QoS, because you can totally remove performance
> bursts/peaks that could break QoS requirements for short periods of
> time. So, my "ideal" IO controller should allow to define both rules:
> absolute and proportional limits.

I completely agree with Andrea here. The final solution has to have
both absolute limits and proportions. But instead of adding a token
based approach on top of proportional based system, I have been
thinking about modifying the proportional approach to support absolute
limits. This might not work, but I think this is an interesting idea
to think about.

Here are my thoughts on it so far. We start with the patches that
Vivek has sent, and change the notion of weights to percent. That is,
the user space specifies percents of disk times instead of weights. We
do not put entities in idle tree; whenever an entity is not
backlogged, we still keep them in the active trees, and allocate them
time slices. But since they have not requests in them, no requests
will get dispatched during the time slices allocated to these
entities. Moreover, if the percents of all entities do not add upto
hundred, we introduce a dummy entity at each level to soak up the rest
of "percent". This dummy entity would get time slices just like any
other entity, but will not dispatch any requests. With these
modification, we can limit entities to their allocated percent of disk
time.

We might want a situation in which we want to allow certain entities
to exceed their "percent" of disk time, while others should be
limited. In this case, we can extend the above mentioned approach by
introducing a secondary active tree. All entities which are allowed to
exceed their "percent" can be queued in the secondary tree, besides
the primary tree. Whenever an idle (or dummy) entity gets a time
slice, instead of idling the disk, an entity can be picked from the
secondary tree.

The advantage of an approach like this is that it will be relatively
smaller modification of the proposed proportional approach. Moreover,
entities will be throttled by getting time slices less frequently,
instead of being allowed to send a burst and then getting starved
(like in ticket based approach). The downside is that this approach
sounds unconventional and probably have not been tried in other
domains either. Thoughts? opinions?

I will create patches based on the above idea in a few weeks.

>
> I still have to look closely at your patchset anyway. I will do and give
> a feedback.
>
> -Andrea
>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-04-05 15:15           ` Andrea Righi
@ 2009-04-06  6:50             ` Nauman Rafique
       [not found]             ` <49D8CB17.7040501-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  1 sibling, 0 replies; 190+ messages in thread
From: Nauman Rafique @ 2009-04-06  6:50 UTC (permalink / raw)
  To: righi.andrea
  Cc: Vivek Goyal, Andrew Morton, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, arozansk, jmoyer, oz-kernel, dhaval, balbir,
	linux-kernel, containers, menage, peterz

On Sun, Apr 5, 2009 at 8:15 AM, Andrea Righi <righi.andrea@gmail.com> wrote:
> On 2009-03-12 19:01, Vivek Goyal wrote:
>> On Thu, Mar 12, 2009 at 12:11:46AM -0700, Andrew Morton wrote:
>>> On Wed, 11 Mar 2009 21:56:46 -0400 Vivek Goyal <vgoyal@redhat.com> wrote:
> [snip]
>>> Also..  there are so many IO controller implementations that I've lost
>>> track of who is doing what.  I do have one private report here that
>>> Andreas's controller "is incredibly productive for us and has allowed
>>> us to put twice as many users per server with faster times for all
>>> users".  Which is pretty stunning, although it should be viewed as a
>>> condemnation of the current code, I'm afraid.
>>>
>>
>> I had looked briefly at Andrea's implementation in the past. I will look
>> again. I had thought that this approach did not get much traction.
>
> Hi Vivek, sorry for my late reply. I periodically upload the latest
> versions of io-throttle here if you're still interested:
> http://download.systemimager.org/~arighi/linux/patches/io-throttle/
>
> There's no consistent changes respect to the latest version I posted to
> the LKML, just rebasing to the recent kernels.
>
>>
>> Some quick thoughts about this approach though.
>>
>> - It is not a proportional weight controller. It is more of limiting
>>   bandwidth in absolute numbers for each cgroup on each disk.
>>
>>   So each cgroup will define a rule for each disk in the system mentioning
>>   at what maximum rate that cgroup can issue IO to that disk and throttle
>>   the IO from that cgroup if rate has excedded.
>
> Correct. Add also the proportional weight control has been in the TODO
> list since the early versions, but I never dedicated too much effort to
> implement this feature, I can focus on this and try to write something
> if we all think it is worth to be done.
>
>>
>>   Above requirement can create configuration problems.
>>
>>       - If there are large number of disks in system, per cgroup one shall
>>         have to create rules for each disk. Until and unless admin knows
>>         what applications are in which cgroup and strictly what disk
>>         these applications do IO to and create rules for only those
>>         disks.
>
> I don't think this is a huge problem anyway. IMHO a userspace tool, e.g.
> a script, would be able to efficiently create/modify rules parsing user
> defined rules in some human-readable form (config files, etc.), even in
> presence of hundreds of disk. The same is valid for dm-ioband I think.
>
>>
>>       - I think problem gets compounded if there is a hierarchy of
>>         logical devices. I think in that case one shall have to create
>>         rules for logical devices and not actual physical devices.
>
> With logical devices you mean device-mapper devices (i.e. LVM, software
> RAID, etc.)? or do you mean that we need to introduce the concept of
> "logical device" to easily (quickly) configure IO requirements and then
> map those logical devices to the actual physical devices? In this case I
> think this can be addressed in userspace. Or maybe I'm totally missing
> the point here.
>
>>
>> - Because it is not proportional weight distribution, if some
>>   cgroup is not using its planned BW, other group sharing the
>>   disk can not make use of spare BW.
>>
>
> Right.
>
>> - I think one should know in advance the throughput rate of underlying media
>>   and also know competing applications so that one can statically define
>>   the BW assigned to each cgroup on each disk.
>>
>>   This will be difficult. Effective BW extracted out of a rotational media
>>   is dependent on the seek pattern so one shall have to either try to make
>>   some conservative estimates and try to divide BW (we will not utilize disk
>>   fully) or take some peak numbers and divide BW (cgroup might not get the
>>   maximum rate configured).
>
> Correct. I think the proportional weight approach is the only solution
> to efficiently use the whole BW. OTOH absolute limiting rules offer a
> better control over QoS, because you can totally remove performance
> bursts/peaks that could break QoS requirements for short periods of
> time. So, my "ideal" IO controller should allow to define both rules:
> absolute and proportional limits.

I completely agree with Andrea here. The final solution has to have
both absolute limits and proportions. But instead of adding a token
based approach on top of proportional based system, I have been
thinking about modifying the proportional approach to support absolute
limits. This might not work, but I think this is an interesting idea
to think about.

Here are my thoughts on it so far. We start with the patches that
Vivek has sent, and change the notion of weights to percent. That is,
the user space specifies percents of disk times instead of weights. We
do not put entities in idle tree; whenever an entity is not
backlogged, we still keep them in the active trees, and allocate them
time slices. But since they have not requests in them, no requests
will get dispatched during the time slices allocated to these
entities. Moreover, if the percents of all entities do not add upto
hundred, we introduce a dummy entity at each level to soak up the rest
of "percent". This dummy entity would get time slices just like any
other entity, but will not dispatch any requests. With these
modification, we can limit entities to their allocated percent of disk
time.

We might want a situation in which we want to allow certain entities
to exceed their "percent" of disk time, while others should be
limited. In this case, we can extend the above mentioned approach by
introducing a secondary active tree. All entities which are allowed to
exceed their "percent" can be queued in the secondary tree, besides
the primary tree. Whenever an idle (or dummy) entity gets a time
slice, instead of idling the disk, an entity can be picked from the
secondary tree.

The advantage of an approach like this is that it will be relatively
smaller modification of the proposed proportional approach. Moreover,
entities will be throttled by getting time slices less frequently,
instead of being allowed to send a burst and then getting starved
(like in ticket based approach). The downside is that this approach
sounds unconventional and probably have not been tried in other
domains either. Thoughts? opinions?

I will create patches based on the above idea in a few weeks.

>
> I still have to look closely at your patchset anyway. I will do and give
> a feedback.
>
> -Andrea
>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-03-12  1:56     ` Vivek Goyal
@ 2009-04-06 14:35         ` Balbir Singh
  -1 siblings, 0 replies; 190+ messages in thread
From: Balbir Singh @ 2009-04-06 14:35 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil

* Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> [2009-03-11 21:56:46]:

> o Documentation for io-controller.
> 
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  Documentation/block/io-controller.txt |  221 +++++++++++++++++++++++++++++++++
>  1 files changed, 221 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/block/io-controller.txt
> 
> diff --git a/Documentation/block/io-controller.txt b/Documentation/block/io-controller.txt
> new file mode 100644
> index 0000000..8884c5a
> --- /dev/null
> +++ b/Documentation/block/io-controller.txt
> @@ -0,0 +1,221 @@
> +				IO Controller
> +				=============
> +
> +Overview
> +========
> +
> +This patchset implements a proportional weight IO controller. That is one
> +can create cgroups and assign prio/weights to those cgroups and task group
> +will get access to disk proportionate to the weight of the group.
> +
> +These patches modify elevator layer and individual IO schedulers to do
> +IO control hence this io controller works only on block devices which use
> +one of the standard io schedulers can not be used with any xyz logical block
> +device.
> +
> +The assumption/thought behind modifying IO scheduler is that resource control
> +is needed only on leaf nodes where the actual contention for resources is
> +present and not on intertermediate logical block devices.
> +
> +Consider following hypothetical scenario. Lets say there are three physical
> +disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
> +created on top of these. Some part of sdb is in lv0 and some part is in lv1.
> +
> +			    lv0      lv1
> +			  /	\  /     \
> +			sda      sdb      sdc
> +
> +Also consider following cgroup hierarchy
> +
> +				root
> +				/   \
> +			       A     B
> +			      / \    / \
> +			     T1 T2  T3  T4
> +
> +A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
> +Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
> +get their fair share of bandwidth on disks sda, sdb and sdc. There is no
> +IO control on intermediate logical block nodes (lv0, lv1).
> +
> +So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
> +only, there will not be any contetion for resources between group A and B if
> +IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
> +IO scheduler associated with the sdb will distribute disk bandwidth to
> +group A and B proportionate to their weight.

What if we have partitions sda1, sda2 and sda3 instead of sda, sdb and
sdc?

> +
> +CFQ already has the notion of fairness and it provides differential disk
> +access based on priority and class of the task. Just that it is flat and
> +with cgroup stuff, it needs to be made hierarchical.
> +
> +Rest of the IO schedulers (noop, deadline and AS) don't have any notion
> +of fairness among various threads.
> +
> +One of the concerns raised with modifying IO schedulers was that we don't
> +want to replicate the code in all the IO schedulers. These patches share
> +the fair queuing code which has been moved to a common layer (elevator
> +layer). Hence we don't end up replicating code across IO schedulers.
> +
> +Design
> +======
> +This patchset primarily uses BFQ (Budget Fair Queuing) code to provide
> +fairness among different IO queues. Fabio and Paolo implemented BFQ which uses
> +B-WF2Q+ algorithm for fair queuing.
> +

References to BFQ, please. I can search them, but having them in the
doc would be nice.

> +Why BFQ?
> +
> +- Not sure if weighted round robin logic of CFQ can be easily extended for
> +  hierarchical mode. One of the things is that we can not keep dividing
> +  the time slice of parent group among childrens. Deeper we go in hierarchy
> +  time slice will get smaller.
> +
> +  One of the ways to implement hierarchical support could be to keep track
> +  of virtual time and service provided to queue/group and select a queue/group
> +  for service based on any of the various available algoriths.
> +
> +  BFQ already had support for hierarchical scheduling, taking those patches
> +  was easier.
> +

Could you elaborate, when you say timeslices get smaller -

1. Are you referring to inability to use higher resolution time?
2. Loss of throughput due to timeslice degradation?

> +- BFQ was designed to provide tighter bounds/delay w.r.t service provided
> +  to a queue. Delay/Jitter with BFQ is supposed to be O(1).
> +
> +  Note: BFQ originally used amount of IO done (number of sectors) as notion
> +        of service provided. IOW, it tried to provide fairness in terms of
> +        actual IO done and not in terms of actual time disk access was
> +	given to a queue.

I assume by sectors you mean the kernel sector size?

> +
> +	This patcheset modified BFQ to provide fairness in time domain because
> +	that's what CFQ does. So idea was try not to deviate too much from
> +	the CFQ behavior initially.
> +
> +	Providing fairness in time domain makes accounting trciky because
> +	due to command queueing, at one time there might be multiple requests
> +	from different queues and there is no easy way to find out how much
> +	disk time actually was consumed by the requests of a particular
> +	queue. More about this in comments in source code.
> +
> +So it is yet to be seen if changing to time domain still retains BFQ gurantees
> +or not.
> +
> +From data structure point of view, one can think of a tree per device, where
> +io groups and io queues are hanging and are being scheduled using B-WF2Q+
> +algorithm. io_queue, is end queue where requests are actually stored and
> +dispatched from (like cfqq).
> +
> +These io queues are primarily created by and managed by end io schedulers
> +depending on its semantics. For example, noop, deadline and AS ioschedulers
> +keep one io queues per cgroup and cfqq keeps one io queue per io_context in
> +a cgroup (apart from async queues).
> +

I assume there is one io_context per cgroup.

> +A request is mapped to an io group by elevator layer and which io queue it
> +is mapped to with in group depends on ioscheduler. Currently "current" task
> +is used to determine the cgroup (hence io group) of the request. Down the
> +line we need to make use of bio-cgroup patches to map delayed writes to
> +right group.

That seem acceptable

> +
> +Going back to old behavior
> +==========================
> +In new scheme of things essentially we are creating hierarchical fair
> +queuing logic in elevator layer and chaning IO schedulers to make use of
> +that logic so that end IO schedulers start supporting hierarchical scheduling.
> +
> +Elevator layer continues to support the old interfaces. So even if fair queuing
> +is enabled at elevator layer, one can have both new hierchical scheduler as
> +well as old non-hierarchical scheduler operating.
> +
> +Also noop, deadline and AS have option of enabling hierarchical scheduling.
> +If it is selected, fair queuing is done in hierarchical manner. If hierarchical
> +scheduling is disabled, noop, deadline and AS should retain their existing
> +behavior.
> +
> +CFQ is the only exception where one can not disable fair queuing as it is
> +needed for provding fairness among various threads even in non-hierarchical
> +mode.
> +
> +Various user visible config options
> +===================================
> +CONFIG_IOSCHED_NOOP_HIER
> +	- Enables hierchical fair queuing in noop. Not selecting this option
> +	  leads to old behavior of noop.
> +
> +CONFIG_IOSCHED_DEADLINE_HIER
> +	- Enables hierchical fair queuing in deadline. Not selecting this
> +	  option leads to old behavior of deadline.
> +
> +CONFIG_IOSCHED_AS_HIER
> +	- Enables hierchical fair queuing in AS. Not selecting this option
> +	  leads to old behavior of AS.
> +
> +CONFIG_IOSCHED_CFQ_HIER
> +	- Enables hierarchical fair queuing in CFQ. Not selecting this option
> +	  still does fair queuing among various queus but it is flat and not
> +	  hierarchical.
> +
> +Config options selected automatically
> +=====================================
> +These config options are not user visible and are selected/deselected
> +automatically based on IO scheduler configurations.
> +
> +CONFIG_ELV_FAIR_QUEUING
> +	- Enables/Disables the fair queuing logic at elevator layer.
> +
> +CONFIG_GROUP_IOSCHED
> +	- Enables/Disables hierarchical queuing and associated cgroup bits.
> +
> +TODO
> +====
> +- Lots of cleanups, testing, bug fixing, optimizations, benchmarking etc...
> +- Convert cgroup ioprio to notion of weight.
> +- Anticipatory code will need more work. It is not working properly currently
> +  and needs more thought.

What are the problems with the code?

> +- Use of bio-cgroup patches.

I saw these posted as well

> +- Use of Nauman's per cgroup request descriptor patches.
> +

More details would be nice, I am not sure I understand

> +HOWTO
> +=====
> +So far I have done very simple testing of running two dd threads in two
> +different cgroups. Here is what you can do.
> +
> +- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
> +	CONFIG_IOSCHED_CFQ_HIER=y
> +
> +- Compile and boot into kernel and mount IO controller.
> +
> +	mount -t cgroup -o io none /cgroup
> +
> +- Create two cgroups
> +	mkdir -p /cgroup/test1/ /cgroup/test2
> +
> +- Set io priority of group test1 and test2
> +	echo 0 > /cgroup/test1/io.ioprio
> +	echo 4 > /cgroup/test2/io.ioprio
> +

What is the meaning of priorities? Which is higher, which is lower?
What is the maximum? How does it impact b/w?

> +- Create two same size files (say 512MB each) on same disk (file1, file2) and
> +  launch two dd threads in different cgroup to read those files. Make sure
> +  right io scheduler is being used for the block device where files are
> +  present (the one you compiled in hierarchical mode).
> +
> +	echo 1 > /proc/sys/vm/drop_caches
> +
> +	dd if=/mnt/lv0/zerofile1 of=/dev/null &
> +	echo $! > /cgroup/test1/tasks
> +	cat /cgroup/test1/tasks
> +
> +	dd if=/mnt/lv0/zerofile2 of=/dev/null &
> +	echo $! > /cgroup/test2/tasks
> +	cat /cgroup/test2/tasks
> +
> +- First dd should finish first.
> +
> +Some Test Results
> +=================
> +- Two dd in two cgroups with prio 0 and 4. Ran two "dd" in those cgroups.
> +
> +234179072 bytes (234 MB) copied, 10.1811 s, 23.0 MB/s
> +234179072 bytes (234 MB) copied, 12.6187 s, 18.6 MB/s
> +
> +- Three dd in three cgroups with prio 0, 4, 4.
> +
> +234179072 bytes (234 MB) copied, 13.7654 s, 17.0 MB/s
> +234179072 bytes (234 MB) copied, 19.476 s, 12.0 MB/s
> +234179072 bytes (234 MB) copied, 20.1858 s, 11.6 MB/s
> -- 
> 1.6.0.1
> 
> 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
@ 2009-04-06 14:35         ` Balbir Singh
  0 siblings, 0 replies; 190+ messages in thread
From: Balbir Singh @ 2009-04-06 14:35 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng,
	arozansk, jmoyer, oz-kernel, dhaval, linux-kernel, containers,
	akpm, menage, peterz

* Vivek Goyal <vgoyal@redhat.com> [2009-03-11 21:56:46]:

> o Documentation for io-controller.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  Documentation/block/io-controller.txt |  221 +++++++++++++++++++++++++++++++++
>  1 files changed, 221 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/block/io-controller.txt
> 
> diff --git a/Documentation/block/io-controller.txt b/Documentation/block/io-controller.txt
> new file mode 100644
> index 0000000..8884c5a
> --- /dev/null
> +++ b/Documentation/block/io-controller.txt
> @@ -0,0 +1,221 @@
> +				IO Controller
> +				=============
> +
> +Overview
> +========
> +
> +This patchset implements a proportional weight IO controller. That is one
> +can create cgroups and assign prio/weights to those cgroups and task group
> +will get access to disk proportionate to the weight of the group.
> +
> +These patches modify elevator layer and individual IO schedulers to do
> +IO control hence this io controller works only on block devices which use
> +one of the standard io schedulers can not be used with any xyz logical block
> +device.
> +
> +The assumption/thought behind modifying IO scheduler is that resource control
> +is needed only on leaf nodes where the actual contention for resources is
> +present and not on intertermediate logical block devices.
> +
> +Consider following hypothetical scenario. Lets say there are three physical
> +disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
> +created on top of these. Some part of sdb is in lv0 and some part is in lv1.
> +
> +			    lv0      lv1
> +			  /	\  /     \
> +			sda      sdb      sdc
> +
> +Also consider following cgroup hierarchy
> +
> +				root
> +				/   \
> +			       A     B
> +			      / \    / \
> +			     T1 T2  T3  T4
> +
> +A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
> +Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
> +get their fair share of bandwidth on disks sda, sdb and sdc. There is no
> +IO control on intermediate logical block nodes (lv0, lv1).
> +
> +So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
> +only, there will not be any contetion for resources between group A and B if
> +IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
> +IO scheduler associated with the sdb will distribute disk bandwidth to
> +group A and B proportionate to their weight.

What if we have partitions sda1, sda2 and sda3 instead of sda, sdb and
sdc?

> +
> +CFQ already has the notion of fairness and it provides differential disk
> +access based on priority and class of the task. Just that it is flat and
> +with cgroup stuff, it needs to be made hierarchical.
> +
> +Rest of the IO schedulers (noop, deadline and AS) don't have any notion
> +of fairness among various threads.
> +
> +One of the concerns raised with modifying IO schedulers was that we don't
> +want to replicate the code in all the IO schedulers. These patches share
> +the fair queuing code which has been moved to a common layer (elevator
> +layer). Hence we don't end up replicating code across IO schedulers.
> +
> +Design
> +======
> +This patchset primarily uses BFQ (Budget Fair Queuing) code to provide
> +fairness among different IO queues. Fabio and Paolo implemented BFQ which uses
> +B-WF2Q+ algorithm for fair queuing.
> +

References to BFQ, please. I can search them, but having them in the
doc would be nice.

> +Why BFQ?
> +
> +- Not sure if weighted round robin logic of CFQ can be easily extended for
> +  hierarchical mode. One of the things is that we can not keep dividing
> +  the time slice of parent group among childrens. Deeper we go in hierarchy
> +  time slice will get smaller.
> +
> +  One of the ways to implement hierarchical support could be to keep track
> +  of virtual time and service provided to queue/group and select a queue/group
> +  for service based on any of the various available algoriths.
> +
> +  BFQ already had support for hierarchical scheduling, taking those patches
> +  was easier.
> +

Could you elaborate, when you say timeslices get smaller -

1. Are you referring to inability to use higher resolution time?
2. Loss of throughput due to timeslice degradation?

> +- BFQ was designed to provide tighter bounds/delay w.r.t service provided
> +  to a queue. Delay/Jitter with BFQ is supposed to be O(1).
> +
> +  Note: BFQ originally used amount of IO done (number of sectors) as notion
> +        of service provided. IOW, it tried to provide fairness in terms of
> +        actual IO done and not in terms of actual time disk access was
> +	given to a queue.

I assume by sectors you mean the kernel sector size?

> +
> +	This patcheset modified BFQ to provide fairness in time domain because
> +	that's what CFQ does. So idea was try not to deviate too much from
> +	the CFQ behavior initially.
> +
> +	Providing fairness in time domain makes accounting trciky because
> +	due to command queueing, at one time there might be multiple requests
> +	from different queues and there is no easy way to find out how much
> +	disk time actually was consumed by the requests of a particular
> +	queue. More about this in comments in source code.
> +
> +So it is yet to be seen if changing to time domain still retains BFQ gurantees
> +or not.
> +
> +From data structure point of view, one can think of a tree per device, where
> +io groups and io queues are hanging and are being scheduled using B-WF2Q+
> +algorithm. io_queue, is end queue where requests are actually stored and
> +dispatched from (like cfqq).
> +
> +These io queues are primarily created by and managed by end io schedulers
> +depending on its semantics. For example, noop, deadline and AS ioschedulers
> +keep one io queues per cgroup and cfqq keeps one io queue per io_context in
> +a cgroup (apart from async queues).
> +

I assume there is one io_context per cgroup.

> +A request is mapped to an io group by elevator layer and which io queue it
> +is mapped to with in group depends on ioscheduler. Currently "current" task
> +is used to determine the cgroup (hence io group) of the request. Down the
> +line we need to make use of bio-cgroup patches to map delayed writes to
> +right group.

That seem acceptable

> +
> +Going back to old behavior
> +==========================
> +In new scheme of things essentially we are creating hierarchical fair
> +queuing logic in elevator layer and chaning IO schedulers to make use of
> +that logic so that end IO schedulers start supporting hierarchical scheduling.
> +
> +Elevator layer continues to support the old interfaces. So even if fair queuing
> +is enabled at elevator layer, one can have both new hierchical scheduler as
> +well as old non-hierarchical scheduler operating.
> +
> +Also noop, deadline and AS have option of enabling hierarchical scheduling.
> +If it is selected, fair queuing is done in hierarchical manner. If hierarchical
> +scheduling is disabled, noop, deadline and AS should retain their existing
> +behavior.
> +
> +CFQ is the only exception where one can not disable fair queuing as it is
> +needed for provding fairness among various threads even in non-hierarchical
> +mode.
> +
> +Various user visible config options
> +===================================
> +CONFIG_IOSCHED_NOOP_HIER
> +	- Enables hierchical fair queuing in noop. Not selecting this option
> +	  leads to old behavior of noop.
> +
> +CONFIG_IOSCHED_DEADLINE_HIER
> +	- Enables hierchical fair queuing in deadline. Not selecting this
> +	  option leads to old behavior of deadline.
> +
> +CONFIG_IOSCHED_AS_HIER
> +	- Enables hierchical fair queuing in AS. Not selecting this option
> +	  leads to old behavior of AS.
> +
> +CONFIG_IOSCHED_CFQ_HIER
> +	- Enables hierarchical fair queuing in CFQ. Not selecting this option
> +	  still does fair queuing among various queus but it is flat and not
> +	  hierarchical.
> +
> +Config options selected automatically
> +=====================================
> +These config options are not user visible and are selected/deselected
> +automatically based on IO scheduler configurations.
> +
> +CONFIG_ELV_FAIR_QUEUING
> +	- Enables/Disables the fair queuing logic at elevator layer.
> +
> +CONFIG_GROUP_IOSCHED
> +	- Enables/Disables hierarchical queuing and associated cgroup bits.
> +
> +TODO
> +====
> +- Lots of cleanups, testing, bug fixing, optimizations, benchmarking etc...
> +- Convert cgroup ioprio to notion of weight.
> +- Anticipatory code will need more work. It is not working properly currently
> +  and needs more thought.

What are the problems with the code?

> +- Use of bio-cgroup patches.

I saw these posted as well

> +- Use of Nauman's per cgroup request descriptor patches.
> +

More details would be nice, I am not sure I understand

> +HOWTO
> +=====
> +So far I have done very simple testing of running two dd threads in two
> +different cgroups. Here is what you can do.
> +
> +- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
> +	CONFIG_IOSCHED_CFQ_HIER=y
> +
> +- Compile and boot into kernel and mount IO controller.
> +
> +	mount -t cgroup -o io none /cgroup
> +
> +- Create two cgroups
> +	mkdir -p /cgroup/test1/ /cgroup/test2
> +
> +- Set io priority of group test1 and test2
> +	echo 0 > /cgroup/test1/io.ioprio
> +	echo 4 > /cgroup/test2/io.ioprio
> +

What is the meaning of priorities? Which is higher, which is lower?
What is the maximum? How does it impact b/w?

> +- Create two same size files (say 512MB each) on same disk (file1, file2) and
> +  launch two dd threads in different cgroup to read those files. Make sure
> +  right io scheduler is being used for the block device where files are
> +  present (the one you compiled in hierarchical mode).
> +
> +	echo 1 > /proc/sys/vm/drop_caches
> +
> +	dd if=/mnt/lv0/zerofile1 of=/dev/null &
> +	echo $! > /cgroup/test1/tasks
> +	cat /cgroup/test1/tasks
> +
> +	dd if=/mnt/lv0/zerofile2 of=/dev/null &
> +	echo $! > /cgroup/test2/tasks
> +	cat /cgroup/test2/tasks
> +
> +- First dd should finish first.
> +
> +Some Test Results
> +=================
> +- Two dd in two cgroups with prio 0 and 4. Ran two "dd" in those cgroups.
> +
> +234179072 bytes (234 MB) copied, 10.1811 s, 23.0 MB/s
> +234179072 bytes (234 MB) copied, 12.6187 s, 18.6 MB/s
> +
> +- Three dd in three cgroups with prio 0, 4, 4.
> +
> +234179072 bytes (234 MB) copied, 13.7654 s, 17.0 MB/s
> +234179072 bytes (234 MB) copied, 19.476 s, 12.0 MB/s
> +234179072 bytes (234 MB) copied, 20.1858 s, 11.6 MB/s
> -- 
> 1.6.0.1
> 
> 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-04-06 14:35         ` Balbir Singh
@ 2009-04-06 22:00             ` Nauman Rafique
  -1 siblings, 0 replies; 190+ messages in thread
From: Nauman Rafique @ 2009-04-06 22:00 UTC (permalink / raw)
  To: balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
  Cc: oz-kernel-H+wXaHxf7aLQT0dZR+AlfA, menage-hpIqsD4AKlfQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA

On Mon, Apr 6, 2009 at 7:35 AM, Balbir Singh <balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> * Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> [2009-03-11 21:56:46]:
>
>> o Documentation for io-controller.
>>
>> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> ---
>>  Documentation/block/io-controller.txt |  221 +++++++++++++++++++++++++++++++++
>>  1 files changed, 221 insertions(+), 0 deletions(-)
>>  create mode 100644 Documentation/block/io-controller.txt
>>
>> diff --git a/Documentation/block/io-controller.txt b/Documentation/block/io-controller.txt
>> new file mode 100644
>> index 0000000..8884c5a
>> --- /dev/null
>> +++ b/Documentation/block/io-controller.txt
>> @@ -0,0 +1,221 @@
>> +                             IO Controller
>> +                             =============
>> +
>> +Overview
>> +========
>> +
>> +This patchset implements a proportional weight IO controller. That is one
>> +can create cgroups and assign prio/weights to those cgroups and task group
>> +will get access to disk proportionate to the weight of the group.
>> +
>> +These patches modify elevator layer and individual IO schedulers to do
>> +IO control hence this io controller works only on block devices which use
>> +one of the standard io schedulers can not be used with any xyz logical block
>> +device.
>> +
>> +The assumption/thought behind modifying IO scheduler is that resource control
>> +is needed only on leaf nodes where the actual contention for resources is
>> +present and not on intertermediate logical block devices.
>> +
>> +Consider following hypothetical scenario. Lets say there are three physical
>> +disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
>> +created on top of these. Some part of sdb is in lv0 and some part is in lv1.
>> +
>> +                         lv0      lv1
>> +                       /     \  /     \
>> +                     sda      sdb      sdc
>> +
>> +Also consider following cgroup hierarchy
>> +
>> +                             root
>> +                             /   \
>> +                            A     B
>> +                           / \    / \
>> +                          T1 T2  T3  T4
>> +
>> +A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
>> +Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
>> +get their fair share of bandwidth on disks sda, sdb and sdc. There is no
>> +IO control on intermediate logical block nodes (lv0, lv1).
>> +
>> +So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
>> +only, there will not be any contetion for resources between group A and B if
>> +IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
>> +IO scheduler associated with the sdb will distribute disk bandwidth to
>> +group A and B proportionate to their weight.
>
> What if we have partitions sda1, sda2 and sda3 instead of sda, sdb and
> sdc?
>
>> +
>> +CFQ already has the notion of fairness and it provides differential disk
>> +access based on priority and class of the task. Just that it is flat and
>> +with cgroup stuff, it needs to be made hierarchical.
>> +
>> +Rest of the IO schedulers (noop, deadline and AS) don't have any notion
>> +of fairness among various threads.
>> +
>> +One of the concerns raised with modifying IO schedulers was that we don't
>> +want to replicate the code in all the IO schedulers. These patches share
>> +the fair queuing code which has been moved to a common layer (elevator
>> +layer). Hence we don't end up replicating code across IO schedulers.
>> +
>> +Design
>> +======
>> +This patchset primarily uses BFQ (Budget Fair Queuing) code to provide
>> +fairness among different IO queues. Fabio and Paolo implemented BFQ which uses
>> +B-WF2Q+ algorithm for fair queuing.
>> +
>
> References to BFQ, please. I can search them, but having them in the
> doc would be nice.
>
>> +Why BFQ?
>> +
>> +- Not sure if weighted round robin logic of CFQ can be easily extended for
>> +  hierarchical mode. One of the things is that we can not keep dividing
>> +  the time slice of parent group among childrens. Deeper we go in hierarchy
>> +  time slice will get smaller.
>> +
>> +  One of the ways to implement hierarchical support could be to keep track
>> +  of virtual time and service provided to queue/group and select a queue/group
>> +  for service based on any of the various available algoriths.
>> +
>> +  BFQ already had support for hierarchical scheduling, taking those patches
>> +  was easier.
>> +
>
> Could you elaborate, when you say timeslices get smaller -
>
> 1. Are you referring to inability to use higher resolution time?
> 2. Loss of throughput due to timeslice degradation?
>
>> +- BFQ was designed to provide tighter bounds/delay w.r.t service provided
>> +  to a queue. Delay/Jitter with BFQ is supposed to be O(1).
>> +
>> +  Note: BFQ originally used amount of IO done (number of sectors) as notion
>> +        of service provided. IOW, it tried to provide fairness in terms of
>> +        actual IO done and not in terms of actual time disk access was
>> +     given to a queue.
>
> I assume by sectors you mean the kernel sector size?
>
>> +
>> +     This patcheset modified BFQ to provide fairness in time domain because
>> +     that's what CFQ does. So idea was try not to deviate too much from
>> +     the CFQ behavior initially.
>> +
>> +     Providing fairness in time domain makes accounting trciky because
>> +     due to command queueing, at one time there might be multiple requests
>> +     from different queues and there is no easy way to find out how much
>> +     disk time actually was consumed by the requests of a particular
>> +     queue. More about this in comments in source code.
>> +
>> +So it is yet to be seen if changing to time domain still retains BFQ gurantees
>> +or not.
>> +
>> +From data structure point of view, one can think of a tree per device, where
>> +io groups and io queues are hanging and are being scheduled using B-WF2Q+
>> +algorithm. io_queue, is end queue where requests are actually stored and
>> +dispatched from (like cfqq).
>> +
>> +These io queues are primarily created by and managed by end io schedulers
>> +depending on its semantics. For example, noop, deadline and AS ioschedulers
>> +keep one io queues per cgroup and cfqq keeps one io queue per io_context in
>> +a cgroup (apart from async queues).
>> +
>
> I assume there is one io_context per cgroup.
>
>> +A request is mapped to an io group by elevator layer and which io queue it
>> +is mapped to with in group depends on ioscheduler. Currently "current" task
>> +is used to determine the cgroup (hence io group) of the request. Down the
>> +line we need to make use of bio-cgroup patches to map delayed writes to
>> +right group.
>
> That seem acceptable
>
>> +
>> +Going back to old behavior
>> +==========================
>> +In new scheme of things essentially we are creating hierarchical fair
>> +queuing logic in elevator layer and chaning IO schedulers to make use of
>> +that logic so that end IO schedulers start supporting hierarchical scheduling.
>> +
>> +Elevator layer continues to support the old interfaces. So even if fair queuing
>> +is enabled at elevator layer, one can have both new hierchical scheduler as
>> +well as old non-hierarchical scheduler operating.
>> +
>> +Also noop, deadline and AS have option of enabling hierarchical scheduling.
>> +If it is selected, fair queuing is done in hierarchical manner. If hierarchical
>> +scheduling is disabled, noop, deadline and AS should retain their existing
>> +behavior.
>> +
>> +CFQ is the only exception where one can not disable fair queuing as it is
>> +needed for provding fairness among various threads even in non-hierarchical
>> +mode.
>> +
>> +Various user visible config options
>> +===================================
>> +CONFIG_IOSCHED_NOOP_HIER
>> +     - Enables hierchical fair queuing in noop. Not selecting this option
>> +       leads to old behavior of noop.
>> +
>> +CONFIG_IOSCHED_DEADLINE_HIER
>> +     - Enables hierchical fair queuing in deadline. Not selecting this
>> +       option leads to old behavior of deadline.
>> +
>> +CONFIG_IOSCHED_AS_HIER
>> +     - Enables hierchical fair queuing in AS. Not selecting this option
>> +       leads to old behavior of AS.
>> +
>> +CONFIG_IOSCHED_CFQ_HIER
>> +     - Enables hierarchical fair queuing in CFQ. Not selecting this option
>> +       still does fair queuing among various queus but it is flat and not
>> +       hierarchical.
>> +
>> +Config options selected automatically
>> +=====================================
>> +These config options are not user visible and are selected/deselected
>> +automatically based on IO scheduler configurations.
>> +
>> +CONFIG_ELV_FAIR_QUEUING
>> +     - Enables/Disables the fair queuing logic at elevator layer.
>> +
>> +CONFIG_GROUP_IOSCHED
>> +     - Enables/Disables hierarchical queuing and associated cgroup bits.
>> +
>> +TODO
>> +====
>> +- Lots of cleanups, testing, bug fixing, optimizations, benchmarking etc...
>> +- Convert cgroup ioprio to notion of weight.
>> +- Anticipatory code will need more work. It is not working properly currently
>> +  and needs more thought.
>
> What are the problems with the code?
>
>> +- Use of bio-cgroup patches.
>
> I saw these posted as well

I have refactored the bio-cgroup patches to work on top of this patch
set, and keep track of async writes. But we have not been able to get
proportional division for async writes. The problem seems to stem from
the fact that pdflush is cgroup agnostic. Getting proportional IO
scheduling to work might need work beyond block layer. Vivek has been
able to do more testing with those patches, and can explain more.

>
>> +- Use of Nauman's per cgroup request descriptor patches.
>> +
>
> More details would be nice, I am not sure I understand

Right now, the block layer has a limit on request descriptors that can
be allocated. Once that limit is reached, a process trying to get a
request descriptor would be blocked. I wrote a patch in which I made
the request descriptor limit per cgroup, i.e a process will only be
blocked if request descriptors allocated to a give cgroup exceed a
certain limit.

This patch set is already big and we are trying to be careful to
include all the work we have done for solving the problem. So I was
planning to hold onto that patch, and send it out for comments once
the basic infrastructure gets some traction.

>
>> +HOWTO
>> +=====
>> +So far I have done very simple testing of running two dd threads in two
>> +different cgroups. Here is what you can do.
>> +
>> +- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
>> +     CONFIG_IOSCHED_CFQ_HIER=y
>> +
>> +- Compile and boot into kernel and mount IO controller.
>> +
>> +     mount -t cgroup -o io none /cgroup
>> +
>> +- Create two cgroups
>> +     mkdir -p /cgroup/test1/ /cgroup/test2
>> +
>> +- Set io priority of group test1 and test2
>> +     echo 0 > /cgroup/test1/io.ioprio
>> +     echo 4 > /cgroup/test2/io.ioprio
>> +
>
> What is the meaning of priorities? Which is higher, which is lower?
> What is the maximum? How does it impact b/w?
>
>> +- Create two same size files (say 512MB each) on same disk (file1, file2) and
>> +  launch two dd threads in different cgroup to read those files. Make sure
>> +  right io scheduler is being used for the block device where files are
>> +  present (the one you compiled in hierarchical mode).
>> +
>> +     echo 1 > /proc/sys/vm/drop_caches
>> +
>> +     dd if=/mnt/lv0/zerofile1 of=/dev/null &
>> +     echo $! > /cgroup/test1/tasks
>> +     cat /cgroup/test1/tasks
>> +
>> +     dd if=/mnt/lv0/zerofile2 of=/dev/null &
>> +     echo $! > /cgroup/test2/tasks
>> +     cat /cgroup/test2/tasks
>> +
>> +- First dd should finish first.
>> +
>> +Some Test Results
>> +=================
>> +- Two dd in two cgroups with prio 0 and 4. Ran two "dd" in those cgroups.
>> +
>> +234179072 bytes (234 MB) copied, 10.1811 s, 23.0 MB/s
>> +234179072 bytes (234 MB) copied, 12.6187 s, 18.6 MB/s
>> +
>> +- Three dd in three cgroups with prio 0, 4, 4.
>> +
>> +234179072 bytes (234 MB) copied, 13.7654 s, 17.0 MB/s
>> +234179072 bytes (234 MB) copied, 19.476 s, 12.0 MB/s
>> +234179072 bytes (234 MB) copied, 20.1858 s, 11.6 MB/s
>> --
>> 1.6.0.1
>>
>>
>
> --
>        Balbir
>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
@ 2009-04-06 22:00             ` Nauman Rafique
  0 siblings, 0 replies; 190+ messages in thread
From: Nauman Rafique @ 2009-04-06 22:00 UTC (permalink / raw)
  To: balbir
  Cc: Vivek Goyal, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng,
	arozansk, jmoyer, oz-kernel, dhaval, linux-kernel, containers,
	akpm, menage, peterz

On Mon, Apr 6, 2009 at 7:35 AM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> * Vivek Goyal <vgoyal@redhat.com> [2009-03-11 21:56:46]:
>
>> o Documentation for io-controller.
>>
>> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
>> ---
>>  Documentation/block/io-controller.txt |  221 +++++++++++++++++++++++++++++++++
>>  1 files changed, 221 insertions(+), 0 deletions(-)
>>  create mode 100644 Documentation/block/io-controller.txt
>>
>> diff --git a/Documentation/block/io-controller.txt b/Documentation/block/io-controller.txt
>> new file mode 100644
>> index 0000000..8884c5a
>> --- /dev/null
>> +++ b/Documentation/block/io-controller.txt
>> @@ -0,0 +1,221 @@
>> +                             IO Controller
>> +                             =============
>> +
>> +Overview
>> +========
>> +
>> +This patchset implements a proportional weight IO controller. That is one
>> +can create cgroups and assign prio/weights to those cgroups and task group
>> +will get access to disk proportionate to the weight of the group.
>> +
>> +These patches modify elevator layer and individual IO schedulers to do
>> +IO control hence this io controller works only on block devices which use
>> +one of the standard io schedulers can not be used with any xyz logical block
>> +device.
>> +
>> +The assumption/thought behind modifying IO scheduler is that resource control
>> +is needed only on leaf nodes where the actual contention for resources is
>> +present and not on intertermediate logical block devices.
>> +
>> +Consider following hypothetical scenario. Lets say there are three physical
>> +disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
>> +created on top of these. Some part of sdb is in lv0 and some part is in lv1.
>> +
>> +                         lv0      lv1
>> +                       /     \  /     \
>> +                     sda      sdb      sdc
>> +
>> +Also consider following cgroup hierarchy
>> +
>> +                             root
>> +                             /   \
>> +                            A     B
>> +                           / \    / \
>> +                          T1 T2  T3  T4
>> +
>> +A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
>> +Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
>> +get their fair share of bandwidth on disks sda, sdb and sdc. There is no
>> +IO control on intermediate logical block nodes (lv0, lv1).
>> +
>> +So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
>> +only, there will not be any contetion for resources between group A and B if
>> +IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
>> +IO scheduler associated with the sdb will distribute disk bandwidth to
>> +group A and B proportionate to their weight.
>
> What if we have partitions sda1, sda2 and sda3 instead of sda, sdb and
> sdc?
>
>> +
>> +CFQ already has the notion of fairness and it provides differential disk
>> +access based on priority and class of the task. Just that it is flat and
>> +with cgroup stuff, it needs to be made hierarchical.
>> +
>> +Rest of the IO schedulers (noop, deadline and AS) don't have any notion
>> +of fairness among various threads.
>> +
>> +One of the concerns raised with modifying IO schedulers was that we don't
>> +want to replicate the code in all the IO schedulers. These patches share
>> +the fair queuing code which has been moved to a common layer (elevator
>> +layer). Hence we don't end up replicating code across IO schedulers.
>> +
>> +Design
>> +======
>> +This patchset primarily uses BFQ (Budget Fair Queuing) code to provide
>> +fairness among different IO queues. Fabio and Paolo implemented BFQ which uses
>> +B-WF2Q+ algorithm for fair queuing.
>> +
>
> References to BFQ, please. I can search them, but having them in the
> doc would be nice.
>
>> +Why BFQ?
>> +
>> +- Not sure if weighted round robin logic of CFQ can be easily extended for
>> +  hierarchical mode. One of the things is that we can not keep dividing
>> +  the time slice of parent group among childrens. Deeper we go in hierarchy
>> +  time slice will get smaller.
>> +
>> +  One of the ways to implement hierarchical support could be to keep track
>> +  of virtual time and service provided to queue/group and select a queue/group
>> +  for service based on any of the various available algoriths.
>> +
>> +  BFQ already had support for hierarchical scheduling, taking those patches
>> +  was easier.
>> +
>
> Could you elaborate, when you say timeslices get smaller -
>
> 1. Are you referring to inability to use higher resolution time?
> 2. Loss of throughput due to timeslice degradation?
>
>> +- BFQ was designed to provide tighter bounds/delay w.r.t service provided
>> +  to a queue. Delay/Jitter with BFQ is supposed to be O(1).
>> +
>> +  Note: BFQ originally used amount of IO done (number of sectors) as notion
>> +        of service provided. IOW, it tried to provide fairness in terms of
>> +        actual IO done and not in terms of actual time disk access was
>> +     given to a queue.
>
> I assume by sectors you mean the kernel sector size?
>
>> +
>> +     This patcheset modified BFQ to provide fairness in time domain because
>> +     that's what CFQ does. So idea was try not to deviate too much from
>> +     the CFQ behavior initially.
>> +
>> +     Providing fairness in time domain makes accounting trciky because
>> +     due to command queueing, at one time there might be multiple requests
>> +     from different queues and there is no easy way to find out how much
>> +     disk time actually was consumed by the requests of a particular
>> +     queue. More about this in comments in source code.
>> +
>> +So it is yet to be seen if changing to time domain still retains BFQ gurantees
>> +or not.
>> +
>> +From data structure point of view, one can think of a tree per device, where
>> +io groups and io queues are hanging and are being scheduled using B-WF2Q+
>> +algorithm. io_queue, is end queue where requests are actually stored and
>> +dispatched from (like cfqq).
>> +
>> +These io queues are primarily created by and managed by end io schedulers
>> +depending on its semantics. For example, noop, deadline and AS ioschedulers
>> +keep one io queues per cgroup and cfqq keeps one io queue per io_context in
>> +a cgroup (apart from async queues).
>> +
>
> I assume there is one io_context per cgroup.
>
>> +A request is mapped to an io group by elevator layer and which io queue it
>> +is mapped to with in group depends on ioscheduler. Currently "current" task
>> +is used to determine the cgroup (hence io group) of the request. Down the
>> +line we need to make use of bio-cgroup patches to map delayed writes to
>> +right group.
>
> That seem acceptable
>
>> +
>> +Going back to old behavior
>> +==========================
>> +In new scheme of things essentially we are creating hierarchical fair
>> +queuing logic in elevator layer and chaning IO schedulers to make use of
>> +that logic so that end IO schedulers start supporting hierarchical scheduling.
>> +
>> +Elevator layer continues to support the old interfaces. So even if fair queuing
>> +is enabled at elevator layer, one can have both new hierchical scheduler as
>> +well as old non-hierarchical scheduler operating.
>> +
>> +Also noop, deadline and AS have option of enabling hierarchical scheduling.
>> +If it is selected, fair queuing is done in hierarchical manner. If hierarchical
>> +scheduling is disabled, noop, deadline and AS should retain their existing
>> +behavior.
>> +
>> +CFQ is the only exception where one can not disable fair queuing as it is
>> +needed for provding fairness among various threads even in non-hierarchical
>> +mode.
>> +
>> +Various user visible config options
>> +===================================
>> +CONFIG_IOSCHED_NOOP_HIER
>> +     - Enables hierchical fair queuing in noop. Not selecting this option
>> +       leads to old behavior of noop.
>> +
>> +CONFIG_IOSCHED_DEADLINE_HIER
>> +     - Enables hierchical fair queuing in deadline. Not selecting this
>> +       option leads to old behavior of deadline.
>> +
>> +CONFIG_IOSCHED_AS_HIER
>> +     - Enables hierchical fair queuing in AS. Not selecting this option
>> +       leads to old behavior of AS.
>> +
>> +CONFIG_IOSCHED_CFQ_HIER
>> +     - Enables hierarchical fair queuing in CFQ. Not selecting this option
>> +       still does fair queuing among various queus but it is flat and not
>> +       hierarchical.
>> +
>> +Config options selected automatically
>> +=====================================
>> +These config options are not user visible and are selected/deselected
>> +automatically based on IO scheduler configurations.
>> +
>> +CONFIG_ELV_FAIR_QUEUING
>> +     - Enables/Disables the fair queuing logic at elevator layer.
>> +
>> +CONFIG_GROUP_IOSCHED
>> +     - Enables/Disables hierarchical queuing and associated cgroup bits.
>> +
>> +TODO
>> +====
>> +- Lots of cleanups, testing, bug fixing, optimizations, benchmarking etc...
>> +- Convert cgroup ioprio to notion of weight.
>> +- Anticipatory code will need more work. It is not working properly currently
>> +  and needs more thought.
>
> What are the problems with the code?
>
>> +- Use of bio-cgroup patches.
>
> I saw these posted as well

I have refactored the bio-cgroup patches to work on top of this patch
set, and keep track of async writes. But we have not been able to get
proportional division for async writes. The problem seems to stem from
the fact that pdflush is cgroup agnostic. Getting proportional IO
scheduling to work might need work beyond block layer. Vivek has been
able to do more testing with those patches, and can explain more.

>
>> +- Use of Nauman's per cgroup request descriptor patches.
>> +
>
> More details would be nice, I am not sure I understand

Right now, the block layer has a limit on request descriptors that can
be allocated. Once that limit is reached, a process trying to get a
request descriptor would be blocked. I wrote a patch in which I made
the request descriptor limit per cgroup, i.e a process will only be
blocked if request descriptors allocated to a give cgroup exceed a
certain limit.

This patch set is already big and we are trying to be careful to
include all the work we have done for solving the problem. So I was
planning to hold onto that patch, and send it out for comments once
the basic infrastructure gets some traction.

>
>> +HOWTO
>> +=====
>> +So far I have done very simple testing of running two dd threads in two
>> +different cgroups. Here is what you can do.
>> +
>> +- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
>> +     CONFIG_IOSCHED_CFQ_HIER=y
>> +
>> +- Compile and boot into kernel and mount IO controller.
>> +
>> +     mount -t cgroup -o io none /cgroup
>> +
>> +- Create two cgroups
>> +     mkdir -p /cgroup/test1/ /cgroup/test2
>> +
>> +- Set io priority of group test1 and test2
>> +     echo 0 > /cgroup/test1/io.ioprio
>> +     echo 4 > /cgroup/test2/io.ioprio
>> +
>
> What is the meaning of priorities? Which is higher, which is lower?
> What is the maximum? How does it impact b/w?
>
>> +- Create two same size files (say 512MB each) on same disk (file1, file2) and
>> +  launch two dd threads in different cgroup to read those files. Make sure
>> +  right io scheduler is being used for the block device where files are
>> +  present (the one you compiled in hierarchical mode).
>> +
>> +     echo 1 > /proc/sys/vm/drop_caches
>> +
>> +     dd if=/mnt/lv0/zerofile1 of=/dev/null &
>> +     echo $! > /cgroup/test1/tasks
>> +     cat /cgroup/test1/tasks
>> +
>> +     dd if=/mnt/lv0/zerofile2 of=/dev/null &
>> +     echo $! > /cgroup/test2/tasks
>> +     cat /cgroup/test2/tasks
>> +
>> +- First dd should finish first.
>> +
>> +Some Test Results
>> +=================
>> +- Two dd in two cgroups with prio 0 and 4. Ran two "dd" in those cgroups.
>> +
>> +234179072 bytes (234 MB) copied, 10.1811 s, 23.0 MB/s
>> +234179072 bytes (234 MB) copied, 12.6187 s, 18.6 MB/s
>> +
>> +- Three dd in three cgroups with prio 0, 4, 4.
>> +
>> +234179072 bytes (234 MB) copied, 13.7654 s, 17.0 MB/s
>> +234179072 bytes (234 MB) copied, 19.476 s, 12.0 MB/s
>> +234179072 bytes (234 MB) copied, 20.1858 s, 11.6 MB/s
>> --
>> 1.6.0.1
>>
>>
>
> --
>        Balbir
>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
       [not found]       ` <20090402140037.GC12851-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-04-07  1:40         ` Gui Jianfeng
  0 siblings, 0 replies; 190+ messages in thread
From: Gui Jianfeng @ 2009-04-07  1:40 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

[-- Attachment #1: Type: text/plain, Size: 2466 bytes --]

Vivek Goyal wrote:
> On Thu, Apr 02, 2009 at 02:39:40PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> Hi All,
>>>
>>> Here is another posting for IO controller patches. Last time I had posted
>>> RFC patches for an IO controller which did bio control per cgroup.
>>>
>>> http://lkml.org/lkml/2008/11/6/227
>>>
>>> One of the takeaway from the discussion in this thread was that let us
>>> implement a common layer which contains the proportional weight scheduling
>>> code which can be shared by all the IO schedulers.
>>>
>>   
>>   Hi Vivek,
>>
>>   I did some tests on my *old* i386 box(with two concurrent dd running), and notice 
>>   that IO Controller doesn't work fine in such situation. But it can work perfectly 
>>   in my *new* x86 box. I dig into this problem, and i guess the major reason is that
>>   my *old* i386 box is too slow, it can't ensure two running ioqs are always backlogged.
> 
> Hi Gui,
> 
> Have you run top to see what's the percentage cpu usage. I suspect that
> cpu is not keeping up pace disk to enqueue enough requests. I think
> process might be blocked somewhere else so that it could not issue
> requests. 
> 
>>   If that is the case, I happens to have a thought. when an ioq uses up it time slice, 
>>   we don't expire it immediately. May be we can give a piece of bonus time for idling 
>>   to wait new requests if this ioq's finish time and its ancestor's finish time are all 
>>   much smaller than other entities in each corresponding service tree.
> 
> Have you tried it with "fairness" enabled? With "fairness" enabled, for
> sync queues I am waiting for one extra idle time slice "8ms" for queue
> to get backlogged again before I move to the next queue?
> 
> Otherwise try to increase the idle time length to higher value say "12ms"
> just to see if that has any impact.
> 
> Can you please also send me output of blkparse. It might give some idea
> how IO schedulers see IO pattern.

  Hi Vivek,

  Sorry for the late reply, I tried the "fairness" patch, it seems not working.
  I'v also tried to extend the idle value, not working either.
  The blktrace output is attached. It seems that the high priority ioq is deleting
  from busy tree too often due to lacking of requests. My box is single CPU and CPU
  speed is a little slow. May be two concurrent dd is contending CPU to submit
  requests, that's the reason for not always backlogged for ioqs.

> 
> Thanks
> Vivek
> 
> 
> 

-- 
Regards
Gui Jianfeng

[-- Attachment #2: log.tgz --]
[-- Type: application/x-compressed, Size: 40948 bytes --]

[-- Attachment #3: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
  2009-04-02 14:00       ` Vivek Goyal
  (?)
  (?)
@ 2009-04-07  1:40       ` Gui Jianfeng
       [not found]         ` <49DAAF25.8010702-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  -1 siblings, 1 reply; 190+ messages in thread
From: Gui Jianfeng @ 2009-04-07  1:40 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, arozansk, jmoyer,
	oz-kernel, dhaval, balbir, linux-kernel, containers, akpm,
	menage, peterz

[-- Attachment #1: Type: text/plain, Size: 2466 bytes --]

Vivek Goyal wrote:
> On Thu, Apr 02, 2009 at 02:39:40PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> Hi All,
>>>
>>> Here is another posting for IO controller patches. Last time I had posted
>>> RFC patches for an IO controller which did bio control per cgroup.
>>>
>>> http://lkml.org/lkml/2008/11/6/227
>>>
>>> One of the takeaway from the discussion in this thread was that let us
>>> implement a common layer which contains the proportional weight scheduling
>>> code which can be shared by all the IO schedulers.
>>>
>>   
>>   Hi Vivek,
>>
>>   I did some tests on my *old* i386 box(with two concurrent dd running), and notice 
>>   that IO Controller doesn't work fine in such situation. But it can work perfectly 
>>   in my *new* x86 box. I dig into this problem, and i guess the major reason is that
>>   my *old* i386 box is too slow, it can't ensure two running ioqs are always backlogged.
> 
> Hi Gui,
> 
> Have you run top to see what's the percentage cpu usage. I suspect that
> cpu is not keeping up pace disk to enqueue enough requests. I think
> process might be blocked somewhere else so that it could not issue
> requests. 
> 
>>   If that is the case, I happens to have a thought. when an ioq uses up it time slice, 
>>   we don't expire it immediately. May be we can give a piece of bonus time for idling 
>>   to wait new requests if this ioq's finish time and its ancestor's finish time are all 
>>   much smaller than other entities in each corresponding service tree.
> 
> Have you tried it with "fairness" enabled? With "fairness" enabled, for
> sync queues I am waiting for one extra idle time slice "8ms" for queue
> to get backlogged again before I move to the next queue?
> 
> Otherwise try to increase the idle time length to higher value say "12ms"
> just to see if that has any impact.
> 
> Can you please also send me output of blkparse. It might give some idea
> how IO schedulers see IO pattern.

  Hi Vivek,

  Sorry for the late reply, I tried the "fairness" patch, it seems not working.
  I'v also tried to extend the idle value, not working either.
  The blktrace output is attached. It seems that the high priority ioq is deleting
  from busy tree too often due to lacking of requests. My box is single CPU and CPU
  speed is a little slow. May be two concurrent dd is contending CPU to submit
  requests, that's the reason for not always backlogged for ioqs.

> 
> Thanks
> Vivek
> 
> 
> 

-- 
Regards
Gui Jianfeng

[-- Attachment #2: log.tgz --]
[-- Type: application/x-compressed, Size: 40948 bytes --]

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
       [not found]         ` <20090406143556.GK7082-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
  2009-04-06 22:00             ` Nauman Rafique
@ 2009-04-07  5:59           ` Gui Jianfeng
  2009-04-13 13:40           ` Vivek Goyal
  2 siblings, 0 replies; 190+ messages in thread
From: Gui Jianfeng @ 2009-04-07  5:59 UTC (permalink / raw)
  To: balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA

Balbir Singh wrote:
> * Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> [2009-03-11 21:56:46]:
> 
>> +
>> +			    lv0      lv1
>> +			  /	\  /     \
>> +			sda      sdb      sdc
>> +
>> +Also consider following cgroup hierarchy
>> +
>> +				root
>> +				/   \
>> +			       A     B
>> +			      / \    / \
>> +			     T1 T2  T3  T4
>> +
>> +A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
>> +Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
>> +get their fair share of bandwidth on disks sda, sdb and sdc. There is no
>> +IO control on intermediate logical block nodes (lv0, lv1).
>> +
>> +So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
>> +only, there will not be any contetion for resources between group A and B if
>> +IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
>> +IO scheduler associated with the sdb will distribute disk bandwidth to
>> +group A and B proportionate to their weight.
> 
> What if we have partitions sda1, sda2 and sda3 instead of sda, sdb and
> sdc?

  The bandwidth controlling is device basis, so with sda1, sda2 and sda3 instead,
  they will contending on sda.

> 
>> +
>> +CFQ already has the notion of fairness and it provides differential disk
>> +access based on priority and class of the task. Just that it is flat and
>> +with cgroup stuff, it needs to be made hierarchical.
>> +
>> +Rest of the IO schedulers (noop, deadline and AS) don't have any notion
>> +of fairness among various threads.
>> +
>> +One of the concerns raised with modifying IO schedulers was that we don't
>> +want to replicate the code in all the IO schedulers. These patches share
>> +the fair queuing code which has been moved to a common layer (elevator
>> +layer). Hence we don't end up replicating code across IO schedulers.
>> +
>> +Design
>> +======
>> +This patchset primarily uses BFQ (Budget Fair Queuing) code to provide
>> +fairness among different IO queues. Fabio and Paolo implemented BFQ which uses
>> +B-WF2Q+ algorithm for fair queuing.
>> +
> 
> References to BFQ, please. I can search them, but having them in the
> doc would be nice.
> 
>> +Why BFQ?
>> +
>> +- Not sure if weighted round robin logic of CFQ can be easily extended for
>> +  hierarchical mode. One of the things is that we can not keep dividing
>> +  the time slice of parent group among childrens. Deeper we go in hierarchy
>> +  time slice will get smaller.
>> +
>> +  One of the ways to implement hierarchical support could be to keep track
>> +  of virtual time and service provided to queue/group and select a queue/group
>> +  for service based on any of the various available algoriths.
>> +
>> +  BFQ already had support for hierarchical scheduling, taking those patches
>> +  was easier.
>> +
> 
> Could you elaborate, when you say timeslices get smaller -
> 
> 1. Are you referring to inability to use higher resolution time?
> 2. Loss of throughput due to timeslice degradation?
> 
>> +- BFQ was designed to provide tighter bounds/delay w.r.t service provided
>> +  to a queue. Delay/Jitter with BFQ is supposed to be O(1).
>> +
>> +  Note: BFQ originally used amount of IO done (number of sectors) as notion
>> +        of service provided. IOW, it tried to provide fairness in terms of
>> +        actual IO done and not in terms of actual time disk access was
>> +	given to a queue.
> 
> I assume by sectors you mean the kernel sector size?
> 
>> +
>> +	This patcheset modified BFQ to provide fairness in time domain because
>> +	that's what CFQ does. So idea was try not to deviate too much from
>> +	the CFQ behavior initially.
>> +
>> +	Providing fairness in time domain makes accounting trciky because
>> +	due to command queueing, at one time there might be multiple requests
>> +	from different queues and there is no easy way to find out how much
>> +	disk time actually was consumed by the requests of a particular
>> +	queue. More about this in comments in source code.
>> +
>> +So it is yet to be seen if changing to time domain still retains BFQ gurantees
>> +or not.
>> +
>> +From data structure point of view, one can think of a tree per device, where
>> +io groups and io queues are hanging and are being scheduled using B-WF2Q+
>> +algorithm. io_queue, is end queue where requests are actually stored and
>> +dispatched from (like cfqq).
>> +
>> +These io queues are primarily created by and managed by end io schedulers
>> +depending on its semantics. For example, noop, deadline and AS ioschedulers
>> +keep one io queues per cgroup and cfqq keeps one io queue per io_context in
>> +a cgroup (apart from async queues).
>> +
> 
> I assume there is one io_context per cgroup.
> 
>> +A request is mapped to an io group by elevator layer and which io queue it
>> +is mapped to with in group depends on ioscheduler. Currently "current" task
>> +is used to determine the cgroup (hence io group) of the request. Down the
>> +line we need to make use of bio-cgroup patches to map delayed writes to
>> +right group.
> 
> That seem acceptable
> 
>> +
>> +Going back to old behavior
>> +==========================
>> +In new scheme of things essentially we are creating hierarchical fair
>> +queuing logic in elevator layer and chaning IO schedulers to make use of
>> +that logic so that end IO schedulers start supporting hierarchical scheduling.
>> +
>> +Elevator layer continues to support the old interfaces. So even if fair queuing
>> +is enabled at elevator layer, one can have both new hierchical scheduler as
>> +well as old non-hierarchical scheduler operating.
>> +
>> +Also noop, deadline and AS have option of enabling hierarchical scheduling.
>> +If it is selected, fair queuing is done in hierarchical manner. If hierarchical
>> +scheduling is disabled, noop, deadline and AS should retain their existing
>> +behavior.
>> +
>> +CFQ is the only exception where one can not disable fair queuing as it is
>> +needed for provding fairness among various threads even in non-hierarchical
>> +mode.
>> +
>> +Various user visible config options
>> +===================================
>> +CONFIG_IOSCHED_NOOP_HIER
>> +	- Enables hierchical fair queuing in noop. Not selecting this option
>> +	  leads to old behavior of noop.
>> +
>> +CONFIG_IOSCHED_DEADLINE_HIER
>> +	- Enables hierchical fair queuing in deadline. Not selecting this
>> +	  option leads to old behavior of deadline.
>> +
>> +CONFIG_IOSCHED_AS_HIER
>> +	- Enables hierchical fair queuing in AS. Not selecting this option
>> +	  leads to old behavior of AS.
>> +
>> +CONFIG_IOSCHED_CFQ_HIER
>> +	- Enables hierarchical fair queuing in CFQ. Not selecting this option
>> +	  still does fair queuing among various queus but it is flat and not
>> +	  hierarchical.
>> +
>> +Config options selected automatically
>> +=====================================
>> +These config options are not user visible and are selected/deselected
>> +automatically based on IO scheduler configurations.
>> +
>> +CONFIG_ELV_FAIR_QUEUING
>> +	- Enables/Disables the fair queuing logic at elevator layer.
>> +
>> +CONFIG_GROUP_IOSCHED
>> +	- Enables/Disables hierarchical queuing and associated cgroup bits.
>> +
>> +TODO
>> +====
>> +- Lots of cleanups, testing, bug fixing, optimizations, benchmarking etc...
>> +- Convert cgroup ioprio to notion of weight.
>> +- Anticipatory code will need more work. It is not working properly currently
>> +  and needs more thought.
> 
> What are the problems with the code?

  Anticipatory has its own idling logic, so what is the concerning here is how to make
  as work together with commom layer.

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-04-06 14:35         ` Balbir Singh
  (?)
  (?)
@ 2009-04-07  5:59         ` Gui Jianfeng
  -1 siblings, 0 replies; 190+ messages in thread
From: Gui Jianfeng @ 2009-04-07  5:59 UTC (permalink / raw)
  To: balbir
  Cc: Vivek Goyal, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	arozansk, jmoyer, oz-kernel, dhaval, linux-kernel, containers,
	akpm, menage, peterz

Balbir Singh wrote:
> * Vivek Goyal <vgoyal@redhat.com> [2009-03-11 21:56:46]:
> 
>> +
>> +			    lv0      lv1
>> +			  /	\  /     \
>> +			sda      sdb      sdc
>> +
>> +Also consider following cgroup hierarchy
>> +
>> +				root
>> +				/   \
>> +			       A     B
>> +			      / \    / \
>> +			     T1 T2  T3  T4
>> +
>> +A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
>> +Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
>> +get their fair share of bandwidth on disks sda, sdb and sdc. There is no
>> +IO control on intermediate logical block nodes (lv0, lv1).
>> +
>> +So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
>> +only, there will not be any contetion for resources between group A and B if
>> +IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
>> +IO scheduler associated with the sdb will distribute disk bandwidth to
>> +group A and B proportionate to their weight.
> 
> What if we have partitions sda1, sda2 and sda3 instead of sda, sdb and
> sdc?

  The bandwidth controlling is device basis, so with sda1, sda2 and sda3 instead,
  they will contending on sda.

> 
>> +
>> +CFQ already has the notion of fairness and it provides differential disk
>> +access based on priority and class of the task. Just that it is flat and
>> +with cgroup stuff, it needs to be made hierarchical.
>> +
>> +Rest of the IO schedulers (noop, deadline and AS) don't have any notion
>> +of fairness among various threads.
>> +
>> +One of the concerns raised with modifying IO schedulers was that we don't
>> +want to replicate the code in all the IO schedulers. These patches share
>> +the fair queuing code which has been moved to a common layer (elevator
>> +layer). Hence we don't end up replicating code across IO schedulers.
>> +
>> +Design
>> +======
>> +This patchset primarily uses BFQ (Budget Fair Queuing) code to provide
>> +fairness among different IO queues. Fabio and Paolo implemented BFQ which uses
>> +B-WF2Q+ algorithm for fair queuing.
>> +
> 
> References to BFQ, please. I can search them, but having them in the
> doc would be nice.
> 
>> +Why BFQ?
>> +
>> +- Not sure if weighted round robin logic of CFQ can be easily extended for
>> +  hierarchical mode. One of the things is that we can not keep dividing
>> +  the time slice of parent group among childrens. Deeper we go in hierarchy
>> +  time slice will get smaller.
>> +
>> +  One of the ways to implement hierarchical support could be to keep track
>> +  of virtual time and service provided to queue/group and select a queue/group
>> +  for service based on any of the various available algoriths.
>> +
>> +  BFQ already had support for hierarchical scheduling, taking those patches
>> +  was easier.
>> +
> 
> Could you elaborate, when you say timeslices get smaller -
> 
> 1. Are you referring to inability to use higher resolution time?
> 2. Loss of throughput due to timeslice degradation?
> 
>> +- BFQ was designed to provide tighter bounds/delay w.r.t service provided
>> +  to a queue. Delay/Jitter with BFQ is supposed to be O(1).
>> +
>> +  Note: BFQ originally used amount of IO done (number of sectors) as notion
>> +        of service provided. IOW, it tried to provide fairness in terms of
>> +        actual IO done and not in terms of actual time disk access was
>> +	given to a queue.
> 
> I assume by sectors you mean the kernel sector size?
> 
>> +
>> +	This patcheset modified BFQ to provide fairness in time domain because
>> +	that's what CFQ does. So idea was try not to deviate too much from
>> +	the CFQ behavior initially.
>> +
>> +	Providing fairness in time domain makes accounting trciky because
>> +	due to command queueing, at one time there might be multiple requests
>> +	from different queues and there is no easy way to find out how much
>> +	disk time actually was consumed by the requests of a particular
>> +	queue. More about this in comments in source code.
>> +
>> +So it is yet to be seen if changing to time domain still retains BFQ gurantees
>> +or not.
>> +
>> +From data structure point of view, one can think of a tree per device, where
>> +io groups and io queues are hanging and are being scheduled using B-WF2Q+
>> +algorithm. io_queue, is end queue where requests are actually stored and
>> +dispatched from (like cfqq).
>> +
>> +These io queues are primarily created by and managed by end io schedulers
>> +depending on its semantics. For example, noop, deadline and AS ioschedulers
>> +keep one io queues per cgroup and cfqq keeps one io queue per io_context in
>> +a cgroup (apart from async queues).
>> +
> 
> I assume there is one io_context per cgroup.
> 
>> +A request is mapped to an io group by elevator layer and which io queue it
>> +is mapped to with in group depends on ioscheduler. Currently "current" task
>> +is used to determine the cgroup (hence io group) of the request. Down the
>> +line we need to make use of bio-cgroup patches to map delayed writes to
>> +right group.
> 
> That seem acceptable
> 
>> +
>> +Going back to old behavior
>> +==========================
>> +In new scheme of things essentially we are creating hierarchical fair
>> +queuing logic in elevator layer and chaning IO schedulers to make use of
>> +that logic so that end IO schedulers start supporting hierarchical scheduling.
>> +
>> +Elevator layer continues to support the old interfaces. So even if fair queuing
>> +is enabled at elevator layer, one can have both new hierchical scheduler as
>> +well as old non-hierarchical scheduler operating.
>> +
>> +Also noop, deadline and AS have option of enabling hierarchical scheduling.
>> +If it is selected, fair queuing is done in hierarchical manner. If hierarchical
>> +scheduling is disabled, noop, deadline and AS should retain their existing
>> +behavior.
>> +
>> +CFQ is the only exception where one can not disable fair queuing as it is
>> +needed for provding fairness among various threads even in non-hierarchical
>> +mode.
>> +
>> +Various user visible config options
>> +===================================
>> +CONFIG_IOSCHED_NOOP_HIER
>> +	- Enables hierchical fair queuing in noop. Not selecting this option
>> +	  leads to old behavior of noop.
>> +
>> +CONFIG_IOSCHED_DEADLINE_HIER
>> +	- Enables hierchical fair queuing in deadline. Not selecting this
>> +	  option leads to old behavior of deadline.
>> +
>> +CONFIG_IOSCHED_AS_HIER
>> +	- Enables hierchical fair queuing in AS. Not selecting this option
>> +	  leads to old behavior of AS.
>> +
>> +CONFIG_IOSCHED_CFQ_HIER
>> +	- Enables hierarchical fair queuing in CFQ. Not selecting this option
>> +	  still does fair queuing among various queus but it is flat and not
>> +	  hierarchical.
>> +
>> +Config options selected automatically
>> +=====================================
>> +These config options are not user visible and are selected/deselected
>> +automatically based on IO scheduler configurations.
>> +
>> +CONFIG_ELV_FAIR_QUEUING
>> +	- Enables/Disables the fair queuing logic at elevator layer.
>> +
>> +CONFIG_GROUP_IOSCHED
>> +	- Enables/Disables hierarchical queuing and associated cgroup bits.
>> +
>> +TODO
>> +====
>> +- Lots of cleanups, testing, bug fixing, optimizations, benchmarking etc...
>> +- Convert cgroup ioprio to notion of weight.
>> +- Anticipatory code will need more work. It is not working properly currently
>> +  and needs more thought.
> 
> What are the problems with the code?

  Anticipatory has its own idling logic, so what is the concerning here is how to make
  as work together with commom layer.

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
  2009-04-07  1:40       ` Gui Jianfeng
@ 2009-04-07  6:40             ` Gui Jianfeng
  0 siblings, 0 replies; 190+ messages in thread
From: Gui Jianfeng @ 2009-04-07  6:40 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

Gui Jianfeng wrote:
> Vivek Goyal wrote:
>> On Thu, Apr 02, 2009 at 02:39:40PM +0800, Gui Jianfeng wrote:
>>> Vivek Goyal wrote:
>>>> Hi All,
>>>>
>>>> Here is another posting for IO controller patches. Last time I had posted
>>>> RFC patches for an IO controller which did bio control per cgroup.
>>>>
>>>> http://lkml.org/lkml/2008/11/6/227
>>>>
>>>> One of the takeaway from the discussion in this thread was that let us
>>>> implement a common layer which contains the proportional weight scheduling
>>>> code which can be shared by all the IO schedulers.
>>>>
>>>   
>>>   Hi Vivek,
>>>
>>>   I did some tests on my *old* i386 box(with two concurrent dd running), and notice 
>>>   that IO Controller doesn't work fine in such situation. But it can work perfectly 
>>>   in my *new* x86 box. I dig into this problem, and i guess the major reason is that
>>>   my *old* i386 box is too slow, it can't ensure two running ioqs are always backlogged.
>> Hi Gui,
>>
>> Have you run top to see what's the percentage cpu usage. I suspect that
>> cpu is not keeping up pace disk to enqueue enough requests. I think
>> process might be blocked somewhere else so that it could not issue
>> requests. 
>>
>>>   If that is the case, I happens to have a thought. when an ioq uses up it time slice, 
>>>   we don't expire it immediately. May be we can give a piece of bonus time for idling 
>>>   to wait new requests if this ioq's finish time and its ancestor's finish time are all 
>>>   much smaller than other entities in each corresponding service tree.
>> Have you tried it with "fairness" enabled? With "fairness" enabled, for
>> sync queues I am waiting for one extra idle time slice "8ms" for queue
>> to get backlogged again before I move to the next queue?
>>
>> Otherwise try to increase the idle time length to higher value say "12ms"
>> just to see if that has any impact.
>>
>> Can you please also send me output of blkparse. It might give some idea
>> how IO schedulers see IO pattern.
> 
>   Hi Vivek,
> 
>   Sorry for the late reply, I tried the "fairness" patch, it seems not working.
>   I'v also tried to extend the idle value, not working either.
>   The blktrace output is attached. It seems that the high priority ioq is deleting
>   from busy tree too often due to lacking of requests. My box is single CPU and CPU
>   speed is a little slow. May be two concurrent dd is contending CPU to submit
>   requests, that's the reason for not always backlogged for ioqs.

  Hi Vivek,

  Sorry for bothering, there were some configure errors when i tested, and got the improper
  result.
  The "fairness" patch seems to work fine now! It makes the high priority ioq *always* backlogged :)

> 
>> Thanks
>> Vivek
>>
>>
>>
> 

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
@ 2009-04-07  6:40             ` Gui Jianfeng
  0 siblings, 0 replies; 190+ messages in thread
From: Gui Jianfeng @ 2009-04-07  6:40 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, arozansk, jmoyer,
	oz-kernel, dhaval, balbir, linux-kernel, containers, akpm,
	menage, peterz

Gui Jianfeng wrote:
> Vivek Goyal wrote:
>> On Thu, Apr 02, 2009 at 02:39:40PM +0800, Gui Jianfeng wrote:
>>> Vivek Goyal wrote:
>>>> Hi All,
>>>>
>>>> Here is another posting for IO controller patches. Last time I had posted
>>>> RFC patches for an IO controller which did bio control per cgroup.
>>>>
>>>> http://lkml.org/lkml/2008/11/6/227
>>>>
>>>> One of the takeaway from the discussion in this thread was that let us
>>>> implement a common layer which contains the proportional weight scheduling
>>>> code which can be shared by all the IO schedulers.
>>>>
>>>   
>>>   Hi Vivek,
>>>
>>>   I did some tests on my *old* i386 box(with two concurrent dd running), and notice 
>>>   that IO Controller doesn't work fine in such situation. But it can work perfectly 
>>>   in my *new* x86 box. I dig into this problem, and i guess the major reason is that
>>>   my *old* i386 box is too slow, it can't ensure two running ioqs are always backlogged.
>> Hi Gui,
>>
>> Have you run top to see what's the percentage cpu usage. I suspect that
>> cpu is not keeping up pace disk to enqueue enough requests. I think
>> process might be blocked somewhere else so that it could not issue
>> requests. 
>>
>>>   If that is the case, I happens to have a thought. when an ioq uses up it time slice, 
>>>   we don't expire it immediately. May be we can give a piece of bonus time for idling 
>>>   to wait new requests if this ioq's finish time and its ancestor's finish time are all 
>>>   much smaller than other entities in each corresponding service tree.
>> Have you tried it with "fairness" enabled? With "fairness" enabled, for
>> sync queues I am waiting for one extra idle time slice "8ms" for queue
>> to get backlogged again before I move to the next queue?
>>
>> Otherwise try to increase the idle time length to higher value say "12ms"
>> just to see if that has any impact.
>>
>> Can you please also send me output of blkparse. It might give some idea
>> how IO schedulers see IO pattern.
> 
>   Hi Vivek,
> 
>   Sorry for the late reply, I tried the "fairness" patch, it seems not working.
>   I'v also tried to extend the idle value, not working either.
>   The blktrace output is attached. It seems that the high priority ioq is deleting
>   from busy tree too often due to lacking of requests. My box is single CPU and CPU
>   speed is a little slow. May be two concurrent dd is contending CPU to submit
>   requests, that's the reason for not always backlogged for ioqs.

  Hi Vivek,

  Sorry for bothering, there were some configure errors when i tested, and got the improper
  result.
  The "fairness" patch seems to work fine now! It makes the high priority ioq *always* backlogged :)

> 
>> Thanks
>> Vivek
>>
>>
>>
> 

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-04-05 15:15           ` Andrea Righi
@ 2009-04-07  6:40                 ` Vivek Goyal
       [not found]             ` <49D8CB17.7040501-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  1 sibling, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-04-07  6:40 UTC (permalink / raw)
  To: Andrea Righi
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	menage-hpIqsD4AKlfQT0dZR+AlfA

On Sun, Apr 05, 2009 at 05:15:35PM +0200, Andrea Righi wrote:
> On 2009-03-12 19:01, Vivek Goyal wrote:
> > On Thu, Mar 12, 2009 at 12:11:46AM -0700, Andrew Morton wrote:
> >> On Wed, 11 Mar 2009 21:56:46 -0400 Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> [snip]
> >> Also..  there are so many IO controller implementations that I've lost
> >> track of who is doing what.  I do have one private report here that
> >> Andreas's controller "is incredibly productive for us and has allowed
> >> us to put twice as many users per server with faster times for all
> >> users".  Which is pretty stunning, although it should be viewed as a
> >> condemnation of the current code, I'm afraid.
> >>
> > 
> > I had looked briefly at Andrea's implementation in the past. I will look
> > again. I had thought that this approach did not get much traction.
> 
> Hi Vivek, sorry for my late reply. I periodically upload the latest
> versions of io-throttle here if you're still interested:
> http://download.systemimager.org/~arighi/linux/patches/io-throttle/
> 
> There's no consistent changes respect to the latest version I posted to
> the LKML, just rebasing to the recent kernels.
> 

Thanks Andrea. I will spend more time in looking through your patches
and do a bit of testing.

> > 
> > Some quick thoughts about this approach though.
> > 
> > - It is not a proportional weight controller. It is more of limiting
> >   bandwidth in absolute numbers for each cgroup on each disk.
> >  
> >   So each cgroup will define a rule for each disk in the system mentioning
> >   at what maximum rate that cgroup can issue IO to that disk and throttle
> >   the IO from that cgroup if rate has excedded.
> 
> Correct. Add also the proportional weight control has been in the TODO
> list since the early versions, but I never dedicated too much effort to
> implement this feature, I can focus on this and try to write something
> if we all think it is worth to be done.
> 

Please do have a look at this patchset and would you do it differently
to implement proportional weight control?

Few thoughts/queries.

- Max bandwidth control and Prportional weight control are two entirely
  different ways of controlling the IO. Former one tries to put an upper
  limit on the IO rate and later one kind of tries to  gurantee minmum
  percentage share of disk.  

  How does an determine what throughput rate you will get from a disk? That
  is so much dependent on workload and miscalculations can lead to getting
  lower BW for a particular cgroup?

  I am assuming that one can probably do some random read-write IO test
  to try to get some idea of disk throughput. If that's the case, then
  in proportional weight control also you should be able to predict the
  minimum BW a cgroup will be getting? The only difference will be that
  a cgroup can get higher BW also if there is no contention present and
  I am wondring that how getting more BW than promised minumum is harmful?

- I can think of atleast one usage of uppper limit controller where we
  might have spare IO resources still we don't want to give it to a
  cgroup because customer has not paid for that kind of service level. In
  those cases we need to implement uppper limit also.

  May be prportional weight and max bw controller can co-exist depending
  on what user's requirements are.
 
  If yes, then can't this control be done at the same layer/level where
  proportional weight control is being done? IOW, this set of patches is
  trying to do prportional weight control at IO scheduler level. I think
  we should be able to store another max rate as another feature in 
  cgroup (apart from weight) and not dispatch requests from the queue if
  we have exceeded the max BW as specified by the user?

- Have you thought of doing hierarchical control? 

- What happens to the notion of CFQ task classes and task priority. Looks
  like max bw rule supercede everything. There is no way that an RT task
  get unlimited amount of disk BW even if it wants to? (There is no notion
  of RT cgroup etc)

> > 
> >   Above requirement can create configuration problems.
> > 
> > 	- If there are large number of disks in system, per cgroup one shall
> > 	  have to create rules for each disk. Until and unless admin knows
> > 	  what applications are in which cgroup and strictly what disk
> > 	  these applications do IO to and create rules for only those
> >  	  disks.
> 
> I don't think this is a huge problem anyway. IMHO a userspace tool, e.g.
> a script, would be able to efficiently create/modify rules parsing user
> defined rules in some human-readable form (config files, etc.), even in
> presence of hundreds of disk. The same is valid for dm-ioband I think.
> 
> > 
> > 	- I think problem gets compounded if there is a hierarchy of
> > 	  logical devices. I think in that case one shall have to create
> > 	  rules for logical devices and not actual physical devices.
> 
> With logical devices you mean device-mapper devices (i.e. LVM, software
> RAID, etc.)? or do you mean that we need to introduce the concept of
> "logical device" to easily (quickly) configure IO requirements and then
> map those logical devices to the actual physical devices? In this case I
> think this can be addressed in userspace. Or maybe I'm totally missing
> the point here.

Yes, I meant LVM, Software RAID etc. So if I have got many disks in the system
and I have created software raid on some of them, I need to create rules for
lvm devices or physical devices behind those lvm devices? I am assuming
that it will be logical devices.

So I need to know exactly to what all devices applications in a particular
cgroup is going to do IO, and also know exactly how many cgroups are
contending for that cgroup, and also know what worst case disk rate I can
expect from that device and then I can do a good job of giving a
reasonable value to the max rate of that cgroup on a particular device?

> 
> > 
> > - Because it is not proportional weight distribution, if some
> >   cgroup is not using its planned BW, other group sharing the
> >   disk can not make use of spare BW.  
> > 	
> 
> Right.
> 
> > - I think one should know in advance the throughput rate of underlying media
> >   and also know competing applications so that one can statically define
> >   the BW assigned to each cgroup on each disk.
> > 
> >   This will be difficult. Effective BW extracted out of a rotational media
> >   is dependent on the seek pattern so one shall have to either try to make
> >   some conservative estimates and try to divide BW (we will not utilize disk
> >   fully) or take some peak numbers and divide BW (cgroup might not get the
> >   maximum rate configured).
> 
> Correct. I think the proportional weight approach is the only solution
> to efficiently use the whole BW. OTOH absolute limiting rules offer a
> better control over QoS, because you can totally remove performance
> bursts/peaks that could break QoS requirements for short periods of
> time.

Can you please give little more details here regarding how QoS requirements
are not met with proportional weight?

> So, my "ideal" IO controller should allow to define both rules:
> absolute and proportional limits.
> 
> I still have to look closely at your patchset anyway. I will do and give
> a feedback.

You feedback is always welcome.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
@ 2009-04-07  6:40                 ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-04-07  6:40 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, arozansk, jmoyer, oz-kernel, dhaval, balbir,
	linux-kernel, containers, menage, peterz

On Sun, Apr 05, 2009 at 05:15:35PM +0200, Andrea Righi wrote:
> On 2009-03-12 19:01, Vivek Goyal wrote:
> > On Thu, Mar 12, 2009 at 12:11:46AM -0700, Andrew Morton wrote:
> >> On Wed, 11 Mar 2009 21:56:46 -0400 Vivek Goyal <vgoyal@redhat.com> wrote:
> [snip]
> >> Also..  there are so many IO controller implementations that I've lost
> >> track of who is doing what.  I do have one private report here that
> >> Andreas's controller "is incredibly productive for us and has allowed
> >> us to put twice as many users per server with faster times for all
> >> users".  Which is pretty stunning, although it should be viewed as a
> >> condemnation of the current code, I'm afraid.
> >>
> > 
> > I had looked briefly at Andrea's implementation in the past. I will look
> > again. I had thought that this approach did not get much traction.
> 
> Hi Vivek, sorry for my late reply. I periodically upload the latest
> versions of io-throttle here if you're still interested:
> http://download.systemimager.org/~arighi/linux/patches/io-throttle/
> 
> There's no consistent changes respect to the latest version I posted to
> the LKML, just rebasing to the recent kernels.
> 

Thanks Andrea. I will spend more time in looking through your patches
and do a bit of testing.

> > 
> > Some quick thoughts about this approach though.
> > 
> > - It is not a proportional weight controller. It is more of limiting
> >   bandwidth in absolute numbers for each cgroup on each disk.
> >  
> >   So each cgroup will define a rule for each disk in the system mentioning
> >   at what maximum rate that cgroup can issue IO to that disk and throttle
> >   the IO from that cgroup if rate has excedded.
> 
> Correct. Add also the proportional weight control has been in the TODO
> list since the early versions, but I never dedicated too much effort to
> implement this feature, I can focus on this and try to write something
> if we all think it is worth to be done.
> 

Please do have a look at this patchset and would you do it differently
to implement proportional weight control?

Few thoughts/queries.

- Max bandwidth control and Prportional weight control are two entirely
  different ways of controlling the IO. Former one tries to put an upper
  limit on the IO rate and later one kind of tries to  gurantee minmum
  percentage share of disk.  

  How does an determine what throughput rate you will get from a disk? That
  is so much dependent on workload and miscalculations can lead to getting
  lower BW for a particular cgroup?

  I am assuming that one can probably do some random read-write IO test
  to try to get some idea of disk throughput. If that's the case, then
  in proportional weight control also you should be able to predict the
  minimum BW a cgroup will be getting? The only difference will be that
  a cgroup can get higher BW also if there is no contention present and
  I am wondring that how getting more BW than promised minumum is harmful?

- I can think of atleast one usage of uppper limit controller where we
  might have spare IO resources still we don't want to give it to a
  cgroup because customer has not paid for that kind of service level. In
  those cases we need to implement uppper limit also.

  May be prportional weight and max bw controller can co-exist depending
  on what user's requirements are.
 
  If yes, then can't this control be done at the same layer/level where
  proportional weight control is being done? IOW, this set of patches is
  trying to do prportional weight control at IO scheduler level. I think
  we should be able to store another max rate as another feature in 
  cgroup (apart from weight) and not dispatch requests from the queue if
  we have exceeded the max BW as specified by the user?

- Have you thought of doing hierarchical control? 

- What happens to the notion of CFQ task classes and task priority. Looks
  like max bw rule supercede everything. There is no way that an RT task
  get unlimited amount of disk BW even if it wants to? (There is no notion
  of RT cgroup etc)

> > 
> >   Above requirement can create configuration problems.
> > 
> > 	- If there are large number of disks in system, per cgroup one shall
> > 	  have to create rules for each disk. Until and unless admin knows
> > 	  what applications are in which cgroup and strictly what disk
> > 	  these applications do IO to and create rules for only those
> >  	  disks.
> 
> I don't think this is a huge problem anyway. IMHO a userspace tool, e.g.
> a script, would be able to efficiently create/modify rules parsing user
> defined rules in some human-readable form (config files, etc.), even in
> presence of hundreds of disk. The same is valid for dm-ioband I think.
> 
> > 
> > 	- I think problem gets compounded if there is a hierarchy of
> > 	  logical devices. I think in that case one shall have to create
> > 	  rules for logical devices and not actual physical devices.
> 
> With logical devices you mean device-mapper devices (i.e. LVM, software
> RAID, etc.)? or do you mean that we need to introduce the concept of
> "logical device" to easily (quickly) configure IO requirements and then
> map those logical devices to the actual physical devices? In this case I
> think this can be addressed in userspace. Or maybe I'm totally missing
> the point here.

Yes, I meant LVM, Software RAID etc. So if I have got many disks in the system
and I have created software raid on some of them, I need to create rules for
lvm devices or physical devices behind those lvm devices? I am assuming
that it will be logical devices.

So I need to know exactly to what all devices applications in a particular
cgroup is going to do IO, and also know exactly how many cgroups are
contending for that cgroup, and also know what worst case disk rate I can
expect from that device and then I can do a good job of giving a
reasonable value to the max rate of that cgroup on a particular device?

> 
> > 
> > - Because it is not proportional weight distribution, if some
> >   cgroup is not using its planned BW, other group sharing the
> >   disk can not make use of spare BW.  
> > 	
> 
> Right.
> 
> > - I think one should know in advance the throughput rate of underlying media
> >   and also know competing applications so that one can statically define
> >   the BW assigned to each cgroup on each disk.
> > 
> >   This will be difficult. Effective BW extracted out of a rotational media
> >   is dependent on the seek pattern so one shall have to either try to make
> >   some conservative estimates and try to divide BW (we will not utilize disk
> >   fully) or take some peak numbers and divide BW (cgroup might not get the
> >   maximum rate configured).
> 
> Correct. I think the proportional weight approach is the only solution
> to efficiently use the whole BW. OTOH absolute limiting rules offer a
> better control over QoS, because you can totally remove performance
> bursts/peaks that could break QoS requirements for short periods of
> time.

Can you please give little more details here regarding how QoS requirements
are not met with proportional weight?

> So, my "ideal" IO controller should allow to define both rules:
> absolute and proportional limits.
> 
> I still have to look closely at your patchset anyway. I will do and give
> a feedback.

You feedback is always welcome.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
       [not found]                 ` <20090407064046.GB20498-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-04-08 20:37                   ` Andrea Righi
  0 siblings, 0 replies; 190+ messages in thread
From: Andrea Righi @ 2009-04-08 20:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	menage-hpIqsD4AKlfQT0dZR+AlfA

On Tue, Apr 07, 2009 at 02:40:46AM -0400, Vivek Goyal wrote:
> On Sun, Apr 05, 2009 at 05:15:35PM +0200, Andrea Righi wrote:
> > On 2009-03-12 19:01, Vivek Goyal wrote:
> > > On Thu, Mar 12, 2009 at 12:11:46AM -0700, Andrew Morton wrote:
> > >> On Wed, 11 Mar 2009 21:56:46 -0400 Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > [snip]
> > >> Also..  there are so many IO controller implementations that I've lost
> > >> track of who is doing what.  I do have one private report here that
> > >> Andreas's controller "is incredibly productive for us and has allowed
> > >> us to put twice as many users per server with faster times for all
> > >> users".  Which is pretty stunning, although it should be viewed as a
> > >> condemnation of the current code, I'm afraid.
> > >>
> > > 
> > > I had looked briefly at Andrea's implementation in the past. I will look
> > > again. I had thought that this approach did not get much traction.
> > 
> > Hi Vivek, sorry for my late reply. I periodically upload the latest
> > versions of io-throttle here if you're still interested:
> > http://download.systemimager.org/~arighi/linux/patches/io-throttle/
> > 
> > There's no consistent changes respect to the latest version I posted to
> > the LKML, just rebasing to the recent kernels.
> > 
> 
> Thanks Andrea. I will spend more time in looking through your patches
> and do a bit of testing.
> 
> > > 
> > > Some quick thoughts about this approach though.
> > > 
> > > - It is not a proportional weight controller. It is more of limiting
> > >   bandwidth in absolute numbers for each cgroup on each disk.
> > >  
> > >   So each cgroup will define a rule for each disk in the system mentioning
> > >   at what maximum rate that cgroup can issue IO to that disk and throttle
> > >   the IO from that cgroup if rate has excedded.
> > 
> > Correct. Add also the proportional weight control has been in the TODO
> > list since the early versions, but I never dedicated too much effort to
> > implement this feature, I can focus on this and try to write something
> > if we all think it is worth to be done.
> > 
> 
> Please do have a look at this patchset and would you do it differently
> to implement proportional weight control?
> 
> Few thoughts/queries.
> 
> - Max bandwidth control and Prportional weight control are two entirely
>   different ways of controlling the IO. Former one tries to put an upper
>   limit on the IO rate and later one kind of tries to  gurantee minmum
>   percentage share of disk.  

Agree.

> 
>   How does an determine what throughput rate you will get from a disk? That
>   is so much dependent on workload and miscalculations can lead to getting
>   lower BW for a particular cgroup?
> 
>   I am assuming that one can probably do some random read-write IO test
>   to try to get some idea of disk throughput. If that's the case, then
>   in proportional weight control also you should be able to predict the
>   minimum BW a cgroup will be getting? The only difference will be that
>   a cgroup can get higher BW also if there is no contention present and
>   I am wondring that how getting more BW than promised minumum is harmful?

IMHO we shouldn't care too much on how to extract the exact BW from a
disk. With proportional weights we can directly map different levels of
service to different weights.

With absolute limiting we can measure the consumed BW post facto and try
to do the best to satisfy the limits defined by the user (absolute max,
min or proportional). Predict a priori how much BW will consume a
particular application's workload is a very hard task (maybe even
impossible) and probably it doesn't give huge advantages respect to the
approach we're currently using. I think this is true for both solutions.

> 
> - I can think of atleast one usage of uppper limit controller where we
>   might have spare IO resources still we don't want to give it to a
>   cgroup because customer has not paid for that kind of service level. In
>   those cases we need to implement uppper limit also.
> 
>   May be prportional weight and max bw controller can co-exist depending
>   on what user's requirements are.
>  
>   If yes, then can't this control be done at the same layer/level where
>   proportional weight control is being done? IOW, this set of patches is
>   trying to do prportional weight control at IO scheduler level. I think
>   we should be able to store another max rate as another feature in 
>   cgroup (apart from weight) and not dispatch requests from the queue if
>   we have exceeded the max BW as specified by the user?

The more I think about a "perfect" solution (at least for my
requirements), the more I'm convinced that we need both functionalities.

I think it would be possible to implement both proportional and limiting
rules at the same level (e.g., the IO scheduler), but we need also to
address the memory consumption problem (I still need to review your
patchset in details and I'm going to test it soon :), so I don't know if
you already addressed this issue).

IOW if we simply don't dispatch requests and we don't throttle the tasks
in the cgroup that exceeds its limit, how do we avoid the waste of
memory due to the succeeding IO requests and the increasingly dirty
pages in the page cache (that are also hard to reclaim)? I may be wrong,
but I think we talked about this problem in a previous email... sorry I
don't find the discussion in my mail archives.

IMHO a nice approach would be to measure IO consumption at the IO
scheduler level, and control IO applying proportional weights / absolute
limits _both_ at the IO scheduler / elevator level _and_ at the same
time block the tasks from dirtying memory that will generate additional
IO requests.

Anyway, there's no need to provide this with a single IO controller, we
could split the problem in two parts: 1) provide a proportional /
absolute IO controller in the IO schedulers and 2) allow to set, for
example, a maximum limit of dirty pages for each cgroup.

Maybe I'm just repeating what we already said in a previous
discussion... in this case sorry for the duplicate thoughts. :)

> 
> - Have you thought of doing hierarchical control? 
> 

Providing hiearchies in cgroups is in general expensive, deeper
hierarchies imply checking all the way up to the root cgroup, so I think
we need to be very careful and be aware of the trade-offs before
providing such feature. For this particular case (IO controller)
wouldn't it be simpler and more efficient to just ignore hierarchies in
the kernel and opportunely handle them in userspace? for absolute
limiting rules this isn't difficult at all, just imagine a config file
and a script or a deamon that dynamically create the opportune cgroups
and configure them accordingly to what is defined in the configuration
file.

I think we can simply define hierarchical dependencies in the
configuration file, translate them in absolute values and use the
absolute values to configure the cgroups' properties.

For example, we can just check that the BW allocated for a particular
parent cgroup is not greater than the total BW allocated for the
children. And for each child just use the min(parent_BW, BW) or equally
divide the parent's BW among the children, etc.

> - What happens to the notion of CFQ task classes and task priority. Looks
>   like max bw rule supercede everything. There is no way that an RT task
>   get unlimited amount of disk BW even if it wants to? (There is no notion
>   of RT cgroup etc)

What about moving all the RT tasks in a separate cgroup with unlimited
BW?

> 
> > > 
> > >   Above requirement can create configuration problems.
> > > 
> > > 	- If there are large number of disks in system, per cgroup one shall
> > > 	  have to create rules for each disk. Until and unless admin knows
> > > 	  what applications are in which cgroup and strictly what disk
> > > 	  these applications do IO to and create rules for only those
> > >  	  disks.
> > 
> > I don't think this is a huge problem anyway. IMHO a userspace tool, e.g.
> > a script, would be able to efficiently create/modify rules parsing user
> > defined rules in some human-readable form (config files, etc.), even in
> > presence of hundreds of disk. The same is valid for dm-ioband I think.
> > 
> > > 
> > > 	- I think problem gets compounded if there is a hierarchy of
> > > 	  logical devices. I think in that case one shall have to create
> > > 	  rules for logical devices and not actual physical devices.
> > 
> > With logical devices you mean device-mapper devices (i.e. LVM, software
> > RAID, etc.)? or do you mean that we need to introduce the concept of
> > "logical device" to easily (quickly) configure IO requirements and then
> > map those logical devices to the actual physical devices? In this case I
> > think this can be addressed in userspace. Or maybe I'm totally missing
> > the point here.
> 
> Yes, I meant LVM, Software RAID etc. So if I have got many disks in the system
> and I have created software raid on some of them, I need to create rules for
> lvm devices or physical devices behind those lvm devices? I am assuming
> that it will be logical devices.
> 
> So I need to know exactly to what all devices applications in a particular
> cgroup is going to do IO, and also know exactly how many cgroups are
> contending for that cgroup, and also know what worst case disk rate I can
> expect from that device and then I can do a good job of giving a
> reasonable value to the max rate of that cgroup on a particular device?

ok, I understand. For these cases dm-ioband perfectly addresses the
problem. For the general case, I think the only solution is to provide a
common interface that each dm subsystem must call to account IO and
apply limiting and proportional rules.

> 
> > 
> > > 
> > > - Because it is not proportional weight distribution, if some
> > >   cgroup is not using its planned BW, other group sharing the
> > >   disk can not make use of spare BW.  
> > > 	
> > 
> > Right.
> > 
> > > - I think one should know in advance the throughput rate of underlying media
> > >   and also know competing applications so that one can statically define
> > >   the BW assigned to each cgroup on each disk.
> > > 
> > >   This will be difficult. Effective BW extracted out of a rotational media
> > >   is dependent on the seek pattern so one shall have to either try to make
> > >   some conservative estimates and try to divide BW (we will not utilize disk
> > >   fully) or take some peak numbers and divide BW (cgroup might not get the
> > >   maximum rate configured).
> > 
> > Correct. I think the proportional weight approach is the only solution
> > to efficiently use the whole BW. OTOH absolute limiting rules offer a
> > better control over QoS, because you can totally remove performance
> > bursts/peaks that could break QoS requirements for short periods of
> > time.
> 
> Can you please give little more details here regarding how QoS requirements
> are not met with proportional weight?

With proportional weights the whole bandwidth is allocated if no one
else is using it. When IO is submitted other tasks with a higher weight
can be forced to sleep until the IO generated by the low weight tasks is
not completely dispatched. Or any extent of the priority inversion
problems.

Maybe it's not an issue at all for the most part of the cases, but using
a solution that is able to provide also a real partitioning of the
available resources can be profitely used by those who need to guarantee
_strict_ BW requirements (soft real-time, maximize the responsiveness of
certain services, etc.), because in this case we're sure that a certain
amount of "spare" BW will be always available when needed by some
"critical" services.

> 
> > So, my "ideal" IO controller should allow to define both rules:
> > absolute and proportional limits.
> > 
> > I still have to look closely at your patchset anyway. I will do and give
> > a feedback.
> 
> You feedback is always welcome.
> 
> Thanks
> Vivek

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-04-07  6:40                 ` Vivek Goyal
  (?)
  (?)
@ 2009-04-08 20:37                 ` Andrea Righi
  2009-04-16 18:37                     ` Vivek Goyal
  -1 siblings, 1 reply; 190+ messages in thread
From: Andrea Righi @ 2009-04-08 20:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, arozansk, jmoyer, oz-kernel, dhaval, balbir,
	linux-kernel, containers, menage, peterz

On Tue, Apr 07, 2009 at 02:40:46AM -0400, Vivek Goyal wrote:
> On Sun, Apr 05, 2009 at 05:15:35PM +0200, Andrea Righi wrote:
> > On 2009-03-12 19:01, Vivek Goyal wrote:
> > > On Thu, Mar 12, 2009 at 12:11:46AM -0700, Andrew Morton wrote:
> > >> On Wed, 11 Mar 2009 21:56:46 -0400 Vivek Goyal <vgoyal@redhat.com> wrote:
> > [snip]
> > >> Also..  there are so many IO controller implementations that I've lost
> > >> track of who is doing what.  I do have one private report here that
> > >> Andreas's controller "is incredibly productive for us and has allowed
> > >> us to put twice as many users per server with faster times for all
> > >> users".  Which is pretty stunning, although it should be viewed as a
> > >> condemnation of the current code, I'm afraid.
> > >>
> > > 
> > > I had looked briefly at Andrea's implementation in the past. I will look
> > > again. I had thought that this approach did not get much traction.
> > 
> > Hi Vivek, sorry for my late reply. I periodically upload the latest
> > versions of io-throttle here if you're still interested:
> > http://download.systemimager.org/~arighi/linux/patches/io-throttle/
> > 
> > There's no consistent changes respect to the latest version I posted to
> > the LKML, just rebasing to the recent kernels.
> > 
> 
> Thanks Andrea. I will spend more time in looking through your patches
> and do a bit of testing.
> 
> > > 
> > > Some quick thoughts about this approach though.
> > > 
> > > - It is not a proportional weight controller. It is more of limiting
> > >   bandwidth in absolute numbers for each cgroup on each disk.
> > >  
> > >   So each cgroup will define a rule for each disk in the system mentioning
> > >   at what maximum rate that cgroup can issue IO to that disk and throttle
> > >   the IO from that cgroup if rate has excedded.
> > 
> > Correct. Add also the proportional weight control has been in the TODO
> > list since the early versions, but I never dedicated too much effort to
> > implement this feature, I can focus on this and try to write something
> > if we all think it is worth to be done.
> > 
> 
> Please do have a look at this patchset and would you do it differently
> to implement proportional weight control?
> 
> Few thoughts/queries.
> 
> - Max bandwidth control and Prportional weight control are two entirely
>   different ways of controlling the IO. Former one tries to put an upper
>   limit on the IO rate and later one kind of tries to  gurantee minmum
>   percentage share of disk.  

Agree.

> 
>   How does an determine what throughput rate you will get from a disk? That
>   is so much dependent on workload and miscalculations can lead to getting
>   lower BW for a particular cgroup?
> 
>   I am assuming that one can probably do some random read-write IO test
>   to try to get some idea of disk throughput. If that's the case, then
>   in proportional weight control also you should be able to predict the
>   minimum BW a cgroup will be getting? The only difference will be that
>   a cgroup can get higher BW also if there is no contention present and
>   I am wondring that how getting more BW than promised minumum is harmful?

IMHO we shouldn't care too much on how to extract the exact BW from a
disk. With proportional weights we can directly map different levels of
service to different weights.

With absolute limiting we can measure the consumed BW post facto and try
to do the best to satisfy the limits defined by the user (absolute max,
min or proportional). Predict a priori how much BW will consume a
particular application's workload is a very hard task (maybe even
impossible) and probably it doesn't give huge advantages respect to the
approach we're currently using. I think this is true for both solutions.

> 
> - I can think of atleast one usage of uppper limit controller where we
>   might have spare IO resources still we don't want to give it to a
>   cgroup because customer has not paid for that kind of service level. In
>   those cases we need to implement uppper limit also.
> 
>   May be prportional weight and max bw controller can co-exist depending
>   on what user's requirements are.
>  
>   If yes, then can't this control be done at the same layer/level where
>   proportional weight control is being done? IOW, this set of patches is
>   trying to do prportional weight control at IO scheduler level. I think
>   we should be able to store another max rate as another feature in 
>   cgroup (apart from weight) and not dispatch requests from the queue if
>   we have exceeded the max BW as specified by the user?

The more I think about a "perfect" solution (at least for my
requirements), the more I'm convinced that we need both functionalities.

I think it would be possible to implement both proportional and limiting
rules at the same level (e.g., the IO scheduler), but we need also to
address the memory consumption problem (I still need to review your
patchset in details and I'm going to test it soon :), so I don't know if
you already addressed this issue).

IOW if we simply don't dispatch requests and we don't throttle the tasks
in the cgroup that exceeds its limit, how do we avoid the waste of
memory due to the succeeding IO requests and the increasingly dirty
pages in the page cache (that are also hard to reclaim)? I may be wrong,
but I think we talked about this problem in a previous email... sorry I
don't find the discussion in my mail archives.

IMHO a nice approach would be to measure IO consumption at the IO
scheduler level, and control IO applying proportional weights / absolute
limits _both_ at the IO scheduler / elevator level _and_ at the same
time block the tasks from dirtying memory that will generate additional
IO requests.

Anyway, there's no need to provide this with a single IO controller, we
could split the problem in two parts: 1) provide a proportional /
absolute IO controller in the IO schedulers and 2) allow to set, for
example, a maximum limit of dirty pages for each cgroup.

Maybe I'm just repeating what we already said in a previous
discussion... in this case sorry for the duplicate thoughts. :)

> 
> - Have you thought of doing hierarchical control? 
> 

Providing hiearchies in cgroups is in general expensive, deeper
hierarchies imply checking all the way up to the root cgroup, so I think
we need to be very careful and be aware of the trade-offs before
providing such feature. For this particular case (IO controller)
wouldn't it be simpler and more efficient to just ignore hierarchies in
the kernel and opportunely handle them in userspace? for absolute
limiting rules this isn't difficult at all, just imagine a config file
and a script or a deamon that dynamically create the opportune cgroups
and configure them accordingly to what is defined in the configuration
file.

I think we can simply define hierarchical dependencies in the
configuration file, translate them in absolute values and use the
absolute values to configure the cgroups' properties.

For example, we can just check that the BW allocated for a particular
parent cgroup is not greater than the total BW allocated for the
children. And for each child just use the min(parent_BW, BW) or equally
divide the parent's BW among the children, etc.

> - What happens to the notion of CFQ task classes and task priority. Looks
>   like max bw rule supercede everything. There is no way that an RT task
>   get unlimited amount of disk BW even if it wants to? (There is no notion
>   of RT cgroup etc)

What about moving all the RT tasks in a separate cgroup with unlimited
BW?

> 
> > > 
> > >   Above requirement can create configuration problems.
> > > 
> > > 	- If there are large number of disks in system, per cgroup one shall
> > > 	  have to create rules for each disk. Until and unless admin knows
> > > 	  what applications are in which cgroup and strictly what disk
> > > 	  these applications do IO to and create rules for only those
> > >  	  disks.
> > 
> > I don't think this is a huge problem anyway. IMHO a userspace tool, e.g.
> > a script, would be able to efficiently create/modify rules parsing user
> > defined rules in some human-readable form (config files, etc.), even in
> > presence of hundreds of disk. The same is valid for dm-ioband I think.
> > 
> > > 
> > > 	- I think problem gets compounded if there is a hierarchy of
> > > 	  logical devices. I think in that case one shall have to create
> > > 	  rules for logical devices and not actual physical devices.
> > 
> > With logical devices you mean device-mapper devices (i.e. LVM, software
> > RAID, etc.)? or do you mean that we need to introduce the concept of
> > "logical device" to easily (quickly) configure IO requirements and then
> > map those logical devices to the actual physical devices? In this case I
> > think this can be addressed in userspace. Or maybe I'm totally missing
> > the point here.
> 
> Yes, I meant LVM, Software RAID etc. So if I have got many disks in the system
> and I have created software raid on some of them, I need to create rules for
> lvm devices or physical devices behind those lvm devices? I am assuming
> that it will be logical devices.
> 
> So I need to know exactly to what all devices applications in a particular
> cgroup is going to do IO, and also know exactly how many cgroups are
> contending for that cgroup, and also know what worst case disk rate I can
> expect from that device and then I can do a good job of giving a
> reasonable value to the max rate of that cgroup on a particular device?

ok, I understand. For these cases dm-ioband perfectly addresses the
problem. For the general case, I think the only solution is to provide a
common interface that each dm subsystem must call to account IO and
apply limiting and proportional rules.

> 
> > 
> > > 
> > > - Because it is not proportional weight distribution, if some
> > >   cgroup is not using its planned BW, other group sharing the
> > >   disk can not make use of spare BW.  
> > > 	
> > 
> > Right.
> > 
> > > - I think one should know in advance the throughput rate of underlying media
> > >   and also know competing applications so that one can statically define
> > >   the BW assigned to each cgroup on each disk.
> > > 
> > >   This will be difficult. Effective BW extracted out of a rotational media
> > >   is dependent on the seek pattern so one shall have to either try to make
> > >   some conservative estimates and try to divide BW (we will not utilize disk
> > >   fully) or take some peak numbers and divide BW (cgroup might not get the
> > >   maximum rate configured).
> > 
> > Correct. I think the proportional weight approach is the only solution
> > to efficiently use the whole BW. OTOH absolute limiting rules offer a
> > better control over QoS, because you can totally remove performance
> > bursts/peaks that could break QoS requirements for short periods of
> > time.
> 
> Can you please give little more details here regarding how QoS requirements
> are not met with proportional weight?

With proportional weights the whole bandwidth is allocated if no one
else is using it. When IO is submitted other tasks with a higher weight
can be forced to sleep until the IO generated by the low weight tasks is
not completely dispatched. Or any extent of the priority inversion
problems.

Maybe it's not an issue at all for the most part of the cases, but using
a solution that is able to provide also a real partitioning of the
available resources can be profitely used by those who need to guarantee
_strict_ BW requirements (soft real-time, maximize the responsiveness of
certain services, etc.), because in this case we're sure that a certain
amount of "spare" BW will be always available when needed by some
"critical" services.

> 
> > So, my "ideal" IO controller should allow to define both rules:
> > absolute and proportional limits.
> > 
> > I still have to look closely at your patchset anyway. I will do and give
> > a feedback.
> 
> You feedback is always welcome.
> 
> Thanks
> Vivek

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
       [not found] ` <1236823015-4183-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (11 preceding siblings ...)
  2009-04-02  6:39   ` Gui Jianfeng
@ 2009-04-10  9:33   ` Gui Jianfeng
  2009-05-01  1:25   ` Divyesh Shah
  13 siblings, 0 replies; 190+ messages in thread
From: Gui Jianfeng @ 2009-04-10  9:33 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

Vivek Goyal wrote:
> Hi All,
> 
> Here is another posting for IO controller patches. Last time I had posted
> RFC patches for an IO controller which did bio control per cgroup.

  Hi Vivek,

  I got the following OOPS when testing, can't reproduce again :(

kernel BUG at block/elevator-fq.c:1396!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/block/hdb/queue/scheduler
Modules linked in: ipv6 cpufreq_ondemand acpi_cpufreq dm_mirror dm_multipath sbd
Pid: 5032, comm: rmdir Not tainted (2.6.29-rc7-vivek #17) Veriton M460
EIP: 0060:[<c04ec7de>] EFLAGS: 00010082 CPU: 0
EIP is at iocg_destroy+0xdc/0x14e
EAX: 00000000 EBX: f62278b4 ECX: f6207800 EDX: f6227904
ESI: f6227800 EDI: f62278a0 EBP: 00000003 ESP: c8790f00
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process rmdir (pid: 5032, ti=c8790000 task=f6636960 task.ti=c8790000)
Stack:
 f53cc5c0 f62b7258 f10991c8 00000282 f6227800 c0733fa0 ec1c5140 edfc6d34
 c8790000 c04463ce f6a4c84c edfc6d34 0804c840 c048883c f6a4c84c f6a4c84c
 c04888d6 f6a4c84c 00000000 c04897de f6a4c84c c048504c f115f4c0 ebc37954
Call Trace:
 [<c04463ce>] cgroup_diput+0x41/0x8c
 [<c048883c>] dentry_iput+0x45/0x5e
 [<c04888d6>] d_kill+0x19/0x32
 [<c04897de>] dput+0xd8/0xdf
 [<c048504c>] do_rmdir+0x8f/0xb6
 [<c06330fc>] do_page_fault+0x2a2/0x579
 [<c0402fc1>] sysenter_do_call+0x12/0x21
 [<c0630000>] schedule+0x641/0x830
Code: 08 00 74 04 0f 0b eb fe 83 7f 04 00 74 04 0f 0b eb fe 45 83 c3 1c 83 fd 0
EIP: [<c04ec7de>] iocg_destroy+0xdc/0x14e SS:ESP 0068:c8790f00

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
  2009-03-12  1:56 ` Vivek Goyal
                   ` (6 preceding siblings ...)
  (?)
@ 2009-04-10  9:33 ` Gui Jianfeng
       [not found]   ` <49DF1256.7080403-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
                     ` (2 more replies)
  -1 siblings, 3 replies; 190+ messages in thread
From: Gui Jianfeng @ 2009-04-10  9:33 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, arozansk, jmoyer,
	oz-kernel, dhaval, balbir, linux-kernel, containers, akpm,
	menage, peterz

Vivek Goyal wrote:
> Hi All,
> 
> Here is another posting for IO controller patches. Last time I had posted
> RFC patches for an IO controller which did bio control per cgroup.

  Hi Vivek,

  I got the following OOPS when testing, can't reproduce again :(

kernel BUG at block/elevator-fq.c:1396!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/block/hdb/queue/scheduler
Modules linked in: ipv6 cpufreq_ondemand acpi_cpufreq dm_mirror dm_multipath sbd
Pid: 5032, comm: rmdir Not tainted (2.6.29-rc7-vivek #17) Veriton M460
EIP: 0060:[<c04ec7de>] EFLAGS: 00010082 CPU: 0
EIP is at iocg_destroy+0xdc/0x14e
EAX: 00000000 EBX: f62278b4 ECX: f6207800 EDX: f6227904
ESI: f6227800 EDI: f62278a0 EBP: 00000003 ESP: c8790f00
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process rmdir (pid: 5032, ti=c8790000 task=f6636960 task.ti=c8790000)
Stack:
 f53cc5c0 f62b7258 f10991c8 00000282 f6227800 c0733fa0 ec1c5140 edfc6d34
 c8790000 c04463ce f6a4c84c edfc6d34 0804c840 c048883c f6a4c84c f6a4c84c
 c04888d6 f6a4c84c 00000000 c04897de f6a4c84c c048504c f115f4c0 ebc37954
Call Trace:
 [<c04463ce>] cgroup_diput+0x41/0x8c
 [<c048883c>] dentry_iput+0x45/0x5e
 [<c04888d6>] d_kill+0x19/0x32
 [<c04897de>] dput+0xd8/0xdf
 [<c048504c>] do_rmdir+0x8f/0xb6
 [<c06330fc>] do_page_fault+0x2a2/0x579
 [<c0402fc1>] sysenter_do_call+0x12/0x21
 [<c0630000>] schedule+0x641/0x830
Code: 08 00 74 04 0f 0b eb fe 83 7f 04 00 74 04 0f 0b eb fe 45 83 c3 1c 83 fd 0
EIP: [<c04ec7de>] iocg_destroy+0xdc/0x14e SS:ESP 0068:c8790f00

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
       [not found]   ` <49DF1256.7080403-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-04-10 17:49     ` Nauman Rafique
  2009-04-13 13:09     ` Vivek Goyal
  1 sibling, 0 replies; 190+ messages in thread
From: Nauman Rafique @ 2009-04-10 17:49 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: menage-hpIqsD4AKlfQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA

On Fri, Apr 10, 2009 at 2:33 AM, Gui Jianfeng
<guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> wrote:
> Vivek Goyal wrote:
>> Hi All,
>>
>> Here is another posting for IO controller patches. Last time I had posted
>> RFC patches for an IO controller which did bio control per cgroup.
>
>  Hi Vivek,
>
>  I got the following OOPS when testing, can't reproduce again :(
>
> kernel BUG at block/elevator-fq.c:1396!
> invalid opcode: 0000 [#1] SMP
> last sysfs file: /sys/block/hdb/queue/scheduler
> Modules linked in: ipv6 cpufreq_ondemand acpi_cpufreq dm_mirror dm_multipath sbd
> Pid: 5032, comm: rmdir Not tainted (2.6.29-rc7-vivek #17) Veriton M460
> EIP: 0060:[<c04ec7de>] EFLAGS: 00010082 CPU: 0
> EIP is at iocg_destroy+0xdc/0x14e
> EAX: 00000000 EBX: f62278b4 ECX: f6207800 EDX: f6227904
> ESI: f6227800 EDI: f62278a0 EBP: 00000003 ESP: c8790f00
>  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> Process rmdir (pid: 5032, ti=c8790000 task=f6636960 task.ti=c8790000)
> Stack:
>  f53cc5c0 f62b7258 f10991c8 00000282 f6227800 c0733fa0 ec1c5140 edfc6d34
>  c8790000 c04463ce f6a4c84c edfc6d34 0804c840 c048883c f6a4c84c f6a4c84c
>  c04888d6 f6a4c84c 00000000 c04897de f6a4c84c c048504c f115f4c0 ebc37954
> Call Trace:
>  [<c04463ce>] cgroup_diput+0x41/0x8c
>  [<c048883c>] dentry_iput+0x45/0x5e
>  [<c04888d6>] d_kill+0x19/0x32
>  [<c04897de>] dput+0xd8/0xdf
>  [<c048504c>] do_rmdir+0x8f/0xb6
>  [<c06330fc>] do_page_fault+0x2a2/0x579
>  [<c0402fc1>] sysenter_do_call+0x12/0x21
>  [<c0630000>] schedule+0x641/0x830
> Code: 08 00 74 04 0f 0b eb fe 83 7f 04 00 74 04 0f 0b eb fe 45 83 c3 1c 83 fd 0
> EIP: [<c04ec7de>] iocg_destroy+0xdc/0x14e SS:ESP 0068:c8790f00

We have seen this too. And have been able to reproduce it. I did not
get a chance to fix it so far, but my understanding is that one of the
async queues was active when the cgroup was getting destroyed. We
moved it to root cgroup, but did not deactivate it; so active_entity
still points to the entity of the async queue which has now been moved
to the root cgroup. I will send an update if I can verify this, or fix
it.

>
> --
> Regards
> Gui Jianfeng
>
>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
  2009-04-10  9:33 ` Gui Jianfeng
       [not found]   ` <49DF1256.7080403-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-04-10 17:49   ` Nauman Rafique
  2009-04-13 13:09   ` Vivek Goyal
  2 siblings, 0 replies; 190+ messages in thread
From: Nauman Rafique @ 2009-04-10 17:49 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: Vivek Goyal, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, arozansk, jmoyer,
	oz-kernel, dhaval, balbir, linux-kernel, containers, akpm,
	menage, peterz

On Fri, Apr 10, 2009 at 2:33 AM, Gui Jianfeng
<guijianfeng@cn.fujitsu.com> wrote:
> Vivek Goyal wrote:
>> Hi All,
>>
>> Here is another posting for IO controller patches. Last time I had posted
>> RFC patches for an IO controller which did bio control per cgroup.
>
>  Hi Vivek,
>
>  I got the following OOPS when testing, can't reproduce again :(
>
> kernel BUG at block/elevator-fq.c:1396!
> invalid opcode: 0000 [#1] SMP
> last sysfs file: /sys/block/hdb/queue/scheduler
> Modules linked in: ipv6 cpufreq_ondemand acpi_cpufreq dm_mirror dm_multipath sbd
> Pid: 5032, comm: rmdir Not tainted (2.6.29-rc7-vivek #17) Veriton M460
> EIP: 0060:[<c04ec7de>] EFLAGS: 00010082 CPU: 0
> EIP is at iocg_destroy+0xdc/0x14e
> EAX: 00000000 EBX: f62278b4 ECX: f6207800 EDX: f6227904
> ESI: f6227800 EDI: f62278a0 EBP: 00000003 ESP: c8790f00
>  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> Process rmdir (pid: 5032, ti=c8790000 task=f6636960 task.ti=c8790000)
> Stack:
>  f53cc5c0 f62b7258 f10991c8 00000282 f6227800 c0733fa0 ec1c5140 edfc6d34
>  c8790000 c04463ce f6a4c84c edfc6d34 0804c840 c048883c f6a4c84c f6a4c84c
>  c04888d6 f6a4c84c 00000000 c04897de f6a4c84c c048504c f115f4c0 ebc37954
> Call Trace:
>  [<c04463ce>] cgroup_diput+0x41/0x8c
>  [<c048883c>] dentry_iput+0x45/0x5e
>  [<c04888d6>] d_kill+0x19/0x32
>  [<c04897de>] dput+0xd8/0xdf
>  [<c048504c>] do_rmdir+0x8f/0xb6
>  [<c06330fc>] do_page_fault+0x2a2/0x579
>  [<c0402fc1>] sysenter_do_call+0x12/0x21
>  [<c0630000>] schedule+0x641/0x830
> Code: 08 00 74 04 0f 0b eb fe 83 7f 04 00 74 04 0f 0b eb fe 45 83 c3 1c 83 fd 0
> EIP: [<c04ec7de>] iocg_destroy+0xdc/0x14e SS:ESP 0068:c8790f00

We have seen this too. And have been able to reproduce it. I did not
get a chance to fix it so far, but my understanding is that one of the
async queues was active when the cgroup was getting destroyed. We
moved it to root cgroup, but did not deactivate it; so active_entity
still points to the entity of the async queue which has now been moved
to the root cgroup. I will send an update if I can verify this, or fix
it.

>
> --
> Regards
> Gui Jianfeng
>
>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
       [not found]   ` <49DF1256.7080403-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2009-04-10 17:49     ` Nauman Rafique
@ 2009-04-13 13:09     ` Vivek Goyal
  1 sibling, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-04-13 13:09 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Fri, Apr 10, 2009 at 05:33:10PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > Hi All,
> > 
> > Here is another posting for IO controller patches. Last time I had posted
> > RFC patches for an IO controller which did bio control per cgroup.
> 
>   Hi Vivek,
> 
>   I got the following OOPS when testing, can't reproduce again :(
> 

Hi Gui,

Thanks for the report. Will look into it and see if I can reproduce it.

Thanks
Vivek

> kernel BUG at block/elevator-fq.c:1396!
> invalid opcode: 0000 [#1] SMP
> last sysfs file: /sys/block/hdb/queue/scheduler
> Modules linked in: ipv6 cpufreq_ondemand acpi_cpufreq dm_mirror dm_multipath sbd
> Pid: 5032, comm: rmdir Not tainted (2.6.29-rc7-vivek #17) Veriton M460
> EIP: 0060:[<c04ec7de>] EFLAGS: 00010082 CPU: 0
> EIP is at iocg_destroy+0xdc/0x14e
> EAX: 00000000 EBX: f62278b4 ECX: f6207800 EDX: f6227904
> ESI: f6227800 EDI: f62278a0 EBP: 00000003 ESP: c8790f00
>  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> Process rmdir (pid: 5032, ti=c8790000 task=f6636960 task.ti=c8790000)
> Stack:
>  f53cc5c0 f62b7258 f10991c8 00000282 f6227800 c0733fa0 ec1c5140 edfc6d34
>  c8790000 c04463ce f6a4c84c edfc6d34 0804c840 c048883c f6a4c84c f6a4c84c
>  c04888d6 f6a4c84c 00000000 c04897de f6a4c84c c048504c f115f4c0 ebc37954
> Call Trace:
>  [<c04463ce>] cgroup_diput+0x41/0x8c
>  [<c048883c>] dentry_iput+0x45/0x5e
>  [<c04888d6>] d_kill+0x19/0x32
>  [<c04897de>] dput+0xd8/0xdf
>  [<c048504c>] do_rmdir+0x8f/0xb6
>  [<c06330fc>] do_page_fault+0x2a2/0x579
>  [<c0402fc1>] sysenter_do_call+0x12/0x21
>  [<c0630000>] schedule+0x641/0x830
> Code: 08 00 74 04 0f 0b eb fe 83 7f 04 00 74 04 0f 0b eb fe 45 83 c3 1c 83 fd 0
> EIP: [<c04ec7de>] iocg_destroy+0xdc/0x14e SS:ESP 0068:c8790f00
> 
> -- 
> Regards
> Gui Jianfeng

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
  2009-04-10  9:33 ` Gui Jianfeng
       [not found]   ` <49DF1256.7080403-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2009-04-10 17:49   ` Nauman Rafique
@ 2009-04-13 13:09   ` Vivek Goyal
  2009-04-22  3:04     ` Gui Jianfeng
       [not found]     ` <20090413130958.GB18007-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2 siblings, 2 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-04-13 13:09 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, arozansk, jmoyer,
	oz-kernel, dhaval, balbir, linux-kernel, containers, akpm,
	menage, peterz

On Fri, Apr 10, 2009 at 05:33:10PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > Hi All,
> > 
> > Here is another posting for IO controller patches. Last time I had posted
> > RFC patches for an IO controller which did bio control per cgroup.
> 
>   Hi Vivek,
> 
>   I got the following OOPS when testing, can't reproduce again :(
> 

Hi Gui,

Thanks for the report. Will look into it and see if I can reproduce it.

Thanks
Vivek

> kernel BUG at block/elevator-fq.c:1396!
> invalid opcode: 0000 [#1] SMP
> last sysfs file: /sys/block/hdb/queue/scheduler
> Modules linked in: ipv6 cpufreq_ondemand acpi_cpufreq dm_mirror dm_multipath sbd
> Pid: 5032, comm: rmdir Not tainted (2.6.29-rc7-vivek #17) Veriton M460
> EIP: 0060:[<c04ec7de>] EFLAGS: 00010082 CPU: 0
> EIP is at iocg_destroy+0xdc/0x14e
> EAX: 00000000 EBX: f62278b4 ECX: f6207800 EDX: f6227904
> ESI: f6227800 EDI: f62278a0 EBP: 00000003 ESP: c8790f00
>  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> Process rmdir (pid: 5032, ti=c8790000 task=f6636960 task.ti=c8790000)
> Stack:
>  f53cc5c0 f62b7258 f10991c8 00000282 f6227800 c0733fa0 ec1c5140 edfc6d34
>  c8790000 c04463ce f6a4c84c edfc6d34 0804c840 c048883c f6a4c84c f6a4c84c
>  c04888d6 f6a4c84c 00000000 c04897de f6a4c84c c048504c f115f4c0 ebc37954
> Call Trace:
>  [<c04463ce>] cgroup_diput+0x41/0x8c
>  [<c048883c>] dentry_iput+0x45/0x5e
>  [<c04888d6>] d_kill+0x19/0x32
>  [<c04897de>] dput+0xd8/0xdf
>  [<c048504c>] do_rmdir+0x8f/0xb6
>  [<c06330fc>] do_page_fault+0x2a2/0x579
>  [<c0402fc1>] sysenter_do_call+0x12/0x21
>  [<c0630000>] schedule+0x641/0x830
> Code: 08 00 74 04 0f 0b eb fe 83 7f 04 00 74 04 0f 0b eb fe 45 83 c3 1c 83 fd 0
> EIP: [<c04ec7de>] iocg_destroy+0xdc/0x14e SS:ESP 0068:c8790f00
> 
> -- 
> Regards
> Gui Jianfeng

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
       [not found]         ` <20090406143556.GK7082-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
  2009-04-06 22:00             ` Nauman Rafique
  2009-04-07  5:59           ` Gui Jianfeng
@ 2009-04-13 13:40           ` Vivek Goyal
  2 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-04-13 13:40 UTC (permalink / raw)
  To: Balbir Singh
  Cc: oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil

On Mon, Apr 06, 2009 at 08:05:56PM +0530, Balbir Singh wrote:
> * Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> [2009-03-11 21:56:46]:
> 

Thanks for having a look balbir. Sorry for the late reply..

[..]
> > +Consider following hypothetical scenario. Lets say there are three physical
> > +disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
> > +created on top of these. Some part of sdb is in lv0 and some part is in lv1.
> > +
> > +			    lv0      lv1
> > +			  /	\  /     \
> > +			sda      sdb      sdc
> > +
> > +Also consider following cgroup hierarchy
> > +
> > +				root
> > +				/   \
> > +			       A     B
> > +			      / \    / \
> > +			     T1 T2  T3  T4
> > +
> > +A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
> > +Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
> > +get their fair share of bandwidth on disks sda, sdb and sdc. There is no
> > +IO control on intermediate logical block nodes (lv0, lv1).
> > +
> > +So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
> > +only, there will not be any contetion for resources between group A and B if
> > +IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
> > +IO scheduler associated with the sdb will distribute disk bandwidth to
> > +group A and B proportionate to their weight.
> 
> What if we have partitions sda1, sda2 and sda3 instead of sda, sdb and
> sdc?

As Gui already mentioned, IO control is on per device basis (like IO
scheduler) and we don't try to control it per partition basis.

> 
> > +
> > +CFQ already has the notion of fairness and it provides differential disk
> > +access based on priority and class of the task. Just that it is flat and
> > +with cgroup stuff, it needs to be made hierarchical.
> > +
> > +Rest of the IO schedulers (noop, deadline and AS) don't have any notion
> > +of fairness among various threads.
> > +
> > +One of the concerns raised with modifying IO schedulers was that we don't
> > +want to replicate the code in all the IO schedulers. These patches share
> > +the fair queuing code which has been moved to a common layer (elevator
> > +layer). Hence we don't end up replicating code across IO schedulers.
> > +
> > +Design
> > +======
> > +This patchset primarily uses BFQ (Budget Fair Queuing) code to provide
> > +fairness among different IO queues. Fabio and Paolo implemented BFQ which uses
> > +B-WF2Q+ algorithm for fair queuing.
> > +
> 
> References to BFQ, please. I can search them, but having them in the
> doc would be nice.

That's a good point. In next posting I will put references also.

> 
> > +Why BFQ?
> > +
> > +- Not sure if weighted round robin logic of CFQ can be easily extended for
> > +  hierarchical mode. One of the things is that we can not keep dividing
> > +  the time slice of parent group among childrens. Deeper we go in hierarchy
> > +  time slice will get smaller.
> > +
> > +  One of the ways to implement hierarchical support could be to keep track
> > +  of virtual time and service provided to queue/group and select a queue/group
> > +  for service based on any of the various available algoriths.
> > +
> > +  BFQ already had support for hierarchical scheduling, taking those patches
> > +  was easier.
> > +
> 
> Could you elaborate, when you say timeslices get smaller -
> 
> 1. Are you referring to inability to use higher resolution time?
> 2. Loss of throughput due to timeslice degradation?

I think keeping a track of time using higher resolution time should not
be a problem but it would be rather more of loss of throughput due to
smaller timeslices and frequent queue switching.

> 
> > +- BFQ was designed to provide tighter bounds/delay w.r.t service provided
> > +  to a queue. Delay/Jitter with BFQ is supposed to be O(1).
> > +
> > +  Note: BFQ originally used amount of IO done (number of sectors) as notion
> > +        of service provided. IOW, it tried to provide fairness in terms of
> > +        actual IO done and not in terms of actual time disk access was
> > +	given to a queue.
> 
> I assume by sectors you mean the kernel sector size?

Yes.

> 
> > +
> > +	This patcheset modified BFQ to provide fairness in time domain because
> > +	that's what CFQ does. So idea was try not to deviate too much from
> > +	the CFQ behavior initially.
> > +
> > +	Providing fairness in time domain makes accounting trciky because
> > +	due to command queueing, at one time there might be multiple requests
> > +	from different queues and there is no easy way to find out how much
> > +	disk time actually was consumed by the requests of a particular
> > +	queue. More about this in comments in source code.
> > +
> > +So it is yet to be seen if changing to time domain still retains BFQ gurantees
> > +or not.
> > +
> > +From data structure point of view, one can think of a tree per device, where
> > +io groups and io queues are hanging and are being scheduled using B-WF2Q+
> > +algorithm. io_queue, is end queue where requests are actually stored and
> > +dispatched from (like cfqq).
> > +
> > +These io queues are primarily created by and managed by end io schedulers
> > +depending on its semantics. For example, noop, deadline and AS ioschedulers
> > +keep one io queues per cgroup and cfqq keeps one io queue per io_context in
> > +a cgroup (apart from async queues).
> > +
> 
> I assume there is one io_context per cgroup.

No. There can be multiple io_context per cgroup because currently
io_context is defined as threads which are doing IO sharing and are
kept in one queue from IO point of view by cfq and multiple queues are
not created. So there might be many processes/threads in a cgroup and
not necessarily they are sharing the io_context.

> 
> > +A request is mapped to an io group by elevator layer and which io queue it
> > +is mapped to with in group depends on ioscheduler. Currently "current" task
> > +is used to determine the cgroup (hence io group) of the request. Down the
> > +line we need to make use of bio-cgroup patches to map delayed writes to
> > +right group.
> 
> That seem acceptable

Andrew first wants to see a solid plan for handling async writes :-) So 
currently I am playing with patches to map writes to correct cgroup.
Mapping the IO to right cgroup is only one part of the problem. Other part
is that I am not seeing a continuious stream of writes at IO scheduler
level. So if two dd processes are running in user space, ideally one can
expect two continuous stream of write requests at IO scheduler but instead
I see bursty serialized traffic. So a bunch of write request from first
dd then another bunch of write requests from second dd and it goes on..
and this leads to no service differentiation between two writes because
when higher priority task is not dispatching any IO (for .2 seconds), 
lower priority task/group gets to use full disk and soon catches up with
higher priority one..

Part of this serialization was taking place at request descriptor
allocation infrastructure where number of request descriptors are limited
and if one writer first consumes most of the descriptors it will
block/serialize other writer.

Now I have got a crude working patch where I can limit per group request
descriptors so that one group can not block other group. But still don't
see continuously backlogged write queues at IO scheduler...

Time to do more debugging and move up the layer and see where this
serialization is taking place (i guess page cache...).

> 
> > +
> > +Going back to old behavior
> > +==========================
> > +In new scheme of things essentially we are creating hierarchical fair
> > +queuing logic in elevator layer and chaning IO schedulers to make use of
> > +that logic so that end IO schedulers start supporting hierarchical scheduling.
> > +
> > +Elevator layer continues to support the old interfaces. So even if fair queuing
> > +is enabled at elevator layer, one can have both new hierchical scheduler as
> > +well as old non-hierarchical scheduler operating.
> > +
> > +Also noop, deadline and AS have option of enabling hierarchical scheduling.
> > +If it is selected, fair queuing is done in hierarchical manner. If hierarchical
> > +scheduling is disabled, noop, deadline and AS should retain their existing
> > +behavior.
> > +
> > +CFQ is the only exception where one can not disable fair queuing as it is
> > +needed for provding fairness among various threads even in non-hierarchical
> > +mode.
> > +
> > +Various user visible config options
> > +===================================
> > +CONFIG_IOSCHED_NOOP_HIER
> > +	- Enables hierchical fair queuing in noop. Not selecting this option
> > +	  leads to old behavior of noop.
> > +
> > +CONFIG_IOSCHED_DEADLINE_HIER
> > +	- Enables hierchical fair queuing in deadline. Not selecting this
> > +	  option leads to old behavior of deadline.
> > +
> > +CONFIG_IOSCHED_AS_HIER
> > +	- Enables hierchical fair queuing in AS. Not selecting this option
> > +	  leads to old behavior of AS.
> > +
> > +CONFIG_IOSCHED_CFQ_HIER
> > +	- Enables hierarchical fair queuing in CFQ. Not selecting this option
> > +	  still does fair queuing among various queus but it is flat and not
> > +	  hierarchical.
> > +
> > +Config options selected automatically
> > +=====================================
> > +These config options are not user visible and are selected/deselected
> > +automatically based on IO scheduler configurations.
> > +
> > +CONFIG_ELV_FAIR_QUEUING
> > +	- Enables/Disables the fair queuing logic at elevator layer.
> > +
> > +CONFIG_GROUP_IOSCHED
> > +	- Enables/Disables hierarchical queuing and associated cgroup bits.
> > +
> > +TODO
> > +====
> > +- Lots of cleanups, testing, bug fixing, optimizations, benchmarking etc...
> > +- Convert cgroup ioprio to notion of weight.
> > +- Anticipatory code will need more work. It is not working properly currently
> > +  and needs more thought.
> 
> What are the problems with the code?

Have not got a chance to look into the issues in detail yet. Just a crude run
saw drop in performance. Will debug it later the moment I have got async writes
handled...

> > +- Use of bio-cgroup patches.
> 
> I saw these posted as well
> 
> > +- Use of Nauman's per cgroup request descriptor patches.
> > +
> 
> More details would be nice, I am not sure I understand

Currently the number of request descriptors which can be allocated per
device/request queue are fixed by a sysfs tunable (q->nr_requests). So
if there is lots of IO going on from one cgroup then it will consume all
the available request descriptors and other cgroup might starve and not
get its fair share.

Hence we also need to introduce the notion of request descriptor limit per
cgroup so that if request descriptors from one group are exhausted, then
it does not impact the IO of other cgroup.

> 
> > +HOWTO
> > +=====
> > +So far I have done very simple testing of running two dd threads in two
> > +different cgroups. Here is what you can do.
> > +
> > +- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
> > +	CONFIG_IOSCHED_CFQ_HIER=y
> > +
> > +- Compile and boot into kernel and mount IO controller.
> > +
> > +	mount -t cgroup -o io none /cgroup
> > +
> > +- Create two cgroups
> > +	mkdir -p /cgroup/test1/ /cgroup/test2
> > +
> > +- Set io priority of group test1 and test2
> > +	echo 0 > /cgroup/test1/io.ioprio
> > +	echo 4 > /cgroup/test2/io.ioprio
> > +
> 
> What is the meaning of priorities? Which is higher, which is lower?
> What is the maximum? How does it impact b/w?

Currently cfq has notion of priority range 0-7 (0 being highest). To being
with we simply adopted that notion though we are converting it to weights
now for group.

Mapping from group priority to group weight is linear. So prio 0 group
should get double the BW of prio 4 group.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-04-06 14:35         ` Balbir Singh
                           ` (2 preceding siblings ...)
  (?)
@ 2009-04-13 13:40         ` Vivek Goyal
  2009-05-01 22:04           ` IKEDA, Munehiro
       [not found]           ` <20090413134017.GC18007-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  -1 siblings, 2 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-04-13 13:40 UTC (permalink / raw)
  To: Balbir Singh
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng,
	arozansk, jmoyer, oz-kernel, dhaval, linux-kernel, containers,
	akpm, menage, peterz

On Mon, Apr 06, 2009 at 08:05:56PM +0530, Balbir Singh wrote:
> * Vivek Goyal <vgoyal@redhat.com> [2009-03-11 21:56:46]:
> 

Thanks for having a look balbir. Sorry for the late reply..

[..]
> > +Consider following hypothetical scenario. Lets say there are three physical
> > +disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
> > +created on top of these. Some part of sdb is in lv0 and some part is in lv1.
> > +
> > +			    lv0      lv1
> > +			  /	\  /     \
> > +			sda      sdb      sdc
> > +
> > +Also consider following cgroup hierarchy
> > +
> > +				root
> > +				/   \
> > +			       A     B
> > +			      / \    / \
> > +			     T1 T2  T3  T4
> > +
> > +A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
> > +Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
> > +get their fair share of bandwidth on disks sda, sdb and sdc. There is no
> > +IO control on intermediate logical block nodes (lv0, lv1).
> > +
> > +So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
> > +only, there will not be any contetion for resources between group A and B if
> > +IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
> > +IO scheduler associated with the sdb will distribute disk bandwidth to
> > +group A and B proportionate to their weight.
> 
> What if we have partitions sda1, sda2 and sda3 instead of sda, sdb and
> sdc?

As Gui already mentioned, IO control is on per device basis (like IO
scheduler) and we don't try to control it per partition basis.

> 
> > +
> > +CFQ already has the notion of fairness and it provides differential disk
> > +access based on priority and class of the task. Just that it is flat and
> > +with cgroup stuff, it needs to be made hierarchical.
> > +
> > +Rest of the IO schedulers (noop, deadline and AS) don't have any notion
> > +of fairness among various threads.
> > +
> > +One of the concerns raised with modifying IO schedulers was that we don't
> > +want to replicate the code in all the IO schedulers. These patches share
> > +the fair queuing code which has been moved to a common layer (elevator
> > +layer). Hence we don't end up replicating code across IO schedulers.
> > +
> > +Design
> > +======
> > +This patchset primarily uses BFQ (Budget Fair Queuing) code to provide
> > +fairness among different IO queues. Fabio and Paolo implemented BFQ which uses
> > +B-WF2Q+ algorithm for fair queuing.
> > +
> 
> References to BFQ, please. I can search them, but having them in the
> doc would be nice.

That's a good point. In next posting I will put references also.

> 
> > +Why BFQ?
> > +
> > +- Not sure if weighted round robin logic of CFQ can be easily extended for
> > +  hierarchical mode. One of the things is that we can not keep dividing
> > +  the time slice of parent group among childrens. Deeper we go in hierarchy
> > +  time slice will get smaller.
> > +
> > +  One of the ways to implement hierarchical support could be to keep track
> > +  of virtual time and service provided to queue/group and select a queue/group
> > +  for service based on any of the various available algoriths.
> > +
> > +  BFQ already had support for hierarchical scheduling, taking those patches
> > +  was easier.
> > +
> 
> Could you elaborate, when you say timeslices get smaller -
> 
> 1. Are you referring to inability to use higher resolution time?
> 2. Loss of throughput due to timeslice degradation?

I think keeping a track of time using higher resolution time should not
be a problem but it would be rather more of loss of throughput due to
smaller timeslices and frequent queue switching.

> 
> > +- BFQ was designed to provide tighter bounds/delay w.r.t service provided
> > +  to a queue. Delay/Jitter with BFQ is supposed to be O(1).
> > +
> > +  Note: BFQ originally used amount of IO done (number of sectors) as notion
> > +        of service provided. IOW, it tried to provide fairness in terms of
> > +        actual IO done and not in terms of actual time disk access was
> > +	given to a queue.
> 
> I assume by sectors you mean the kernel sector size?

Yes.

> 
> > +
> > +	This patcheset modified BFQ to provide fairness in time domain because
> > +	that's what CFQ does. So idea was try not to deviate too much from
> > +	the CFQ behavior initially.
> > +
> > +	Providing fairness in time domain makes accounting trciky because
> > +	due to command queueing, at one time there might be multiple requests
> > +	from different queues and there is no easy way to find out how much
> > +	disk time actually was consumed by the requests of a particular
> > +	queue. More about this in comments in source code.
> > +
> > +So it is yet to be seen if changing to time domain still retains BFQ gurantees
> > +or not.
> > +
> > +From data structure point of view, one can think of a tree per device, where
> > +io groups and io queues are hanging and are being scheduled using B-WF2Q+
> > +algorithm. io_queue, is end queue where requests are actually stored and
> > +dispatched from (like cfqq).
> > +
> > +These io queues are primarily created by and managed by end io schedulers
> > +depending on its semantics. For example, noop, deadline and AS ioschedulers
> > +keep one io queues per cgroup and cfqq keeps one io queue per io_context in
> > +a cgroup (apart from async queues).
> > +
> 
> I assume there is one io_context per cgroup.

No. There can be multiple io_context per cgroup because currently
io_context is defined as threads which are doing IO sharing and are
kept in one queue from IO point of view by cfq and multiple queues are
not created. So there might be many processes/threads in a cgroup and
not necessarily they are sharing the io_context.

> 
> > +A request is mapped to an io group by elevator layer and which io queue it
> > +is mapped to with in group depends on ioscheduler. Currently "current" task
> > +is used to determine the cgroup (hence io group) of the request. Down the
> > +line we need to make use of bio-cgroup patches to map delayed writes to
> > +right group.
> 
> That seem acceptable

Andrew first wants to see a solid plan for handling async writes :-) So 
currently I am playing with patches to map writes to correct cgroup.
Mapping the IO to right cgroup is only one part of the problem. Other part
is that I am not seeing a continuious stream of writes at IO scheduler
level. So if two dd processes are running in user space, ideally one can
expect two continuous stream of write requests at IO scheduler but instead
I see bursty serialized traffic. So a bunch of write request from first
dd then another bunch of write requests from second dd and it goes on..
and this leads to no service differentiation between two writes because
when higher priority task is not dispatching any IO (for .2 seconds), 
lower priority task/group gets to use full disk and soon catches up with
higher priority one..

Part of this serialization was taking place at request descriptor
allocation infrastructure where number of request descriptors are limited
and if one writer first consumes most of the descriptors it will
block/serialize other writer.

Now I have got a crude working patch where I can limit per group request
descriptors so that one group can not block other group. But still don't
see continuously backlogged write queues at IO scheduler...

Time to do more debugging and move up the layer and see where this
serialization is taking place (i guess page cache...).

> 
> > +
> > +Going back to old behavior
> > +==========================
> > +In new scheme of things essentially we are creating hierarchical fair
> > +queuing logic in elevator layer and chaning IO schedulers to make use of
> > +that logic so that end IO schedulers start supporting hierarchical scheduling.
> > +
> > +Elevator layer continues to support the old interfaces. So even if fair queuing
> > +is enabled at elevator layer, one can have both new hierchical scheduler as
> > +well as old non-hierarchical scheduler operating.
> > +
> > +Also noop, deadline and AS have option of enabling hierarchical scheduling.
> > +If it is selected, fair queuing is done in hierarchical manner. If hierarchical
> > +scheduling is disabled, noop, deadline and AS should retain their existing
> > +behavior.
> > +
> > +CFQ is the only exception where one can not disable fair queuing as it is
> > +needed for provding fairness among various threads even in non-hierarchical
> > +mode.
> > +
> > +Various user visible config options
> > +===================================
> > +CONFIG_IOSCHED_NOOP_HIER
> > +	- Enables hierchical fair queuing in noop. Not selecting this option
> > +	  leads to old behavior of noop.
> > +
> > +CONFIG_IOSCHED_DEADLINE_HIER
> > +	- Enables hierchical fair queuing in deadline. Not selecting this
> > +	  option leads to old behavior of deadline.
> > +
> > +CONFIG_IOSCHED_AS_HIER
> > +	- Enables hierchical fair queuing in AS. Not selecting this option
> > +	  leads to old behavior of AS.
> > +
> > +CONFIG_IOSCHED_CFQ_HIER
> > +	- Enables hierarchical fair queuing in CFQ. Not selecting this option
> > +	  still does fair queuing among various queus but it is flat and not
> > +	  hierarchical.
> > +
> > +Config options selected automatically
> > +=====================================
> > +These config options are not user visible and are selected/deselected
> > +automatically based on IO scheduler configurations.
> > +
> > +CONFIG_ELV_FAIR_QUEUING
> > +	- Enables/Disables the fair queuing logic at elevator layer.
> > +
> > +CONFIG_GROUP_IOSCHED
> > +	- Enables/Disables hierarchical queuing and associated cgroup bits.
> > +
> > +TODO
> > +====
> > +- Lots of cleanups, testing, bug fixing, optimizations, benchmarking etc...
> > +- Convert cgroup ioprio to notion of weight.
> > +- Anticipatory code will need more work. It is not working properly currently
> > +  and needs more thought.
> 
> What are the problems with the code?

Have not got a chance to look into the issues in detail yet. Just a crude run
saw drop in performance. Will debug it later the moment I have got async writes
handled...

> > +- Use of bio-cgroup patches.
> 
> I saw these posted as well
> 
> > +- Use of Nauman's per cgroup request descriptor patches.
> > +
> 
> More details would be nice, I am not sure I understand

Currently the number of request descriptors which can be allocated per
device/request queue are fixed by a sysfs tunable (q->nr_requests). So
if there is lots of IO going on from one cgroup then it will consume all
the available request descriptors and other cgroup might starve and not
get its fair share.

Hence we also need to introduce the notion of request descriptor limit per
cgroup so that if request descriptors from one group are exhausted, then
it does not impact the IO of other cgroup.

> 
> > +HOWTO
> > +=====
> > +So far I have done very simple testing of running two dd threads in two
> > +different cgroups. Here is what you can do.
> > +
> > +- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
> > +	CONFIG_IOSCHED_CFQ_HIER=y
> > +
> > +- Compile and boot into kernel and mount IO controller.
> > +
> > +	mount -t cgroup -o io none /cgroup
> > +
> > +- Create two cgroups
> > +	mkdir -p /cgroup/test1/ /cgroup/test2
> > +
> > +- Set io priority of group test1 and test2
> > +	echo 0 > /cgroup/test1/io.ioprio
> > +	echo 4 > /cgroup/test2/io.ioprio
> > +
> 
> What is the meaning of priorities? Which is higher, which is lower?
> What is the maximum? How does it impact b/w?

Currently cfq has notion of priority range 0-7 (0 being highest). To being
with we simply adopted that notion though we are converting it to weights
now for group.

Mapping from group priority to group weight is linear. So prio 0 group
should get double the BW of prio 4 group.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* [PATCH] IO-Controller: Fix kernel panic after moving a task
       [not found]     ` <1236823015-4183-6-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-04-16  5:25       ` Gui Jianfeng
  0 siblings, 0 replies; 190+ messages in thread
From: Gui Jianfeng @ 2009-04-16  5:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

Vivek Goyal wrote:
> +#ifdef CONFIG_IOSCHED_CFQ_HIER
> +static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
> +{
> +	struct cfq_queue *async_cfqq = cic_to_cfqq(cic, 0);
> +	struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1);
> +	struct cfq_data *cfqd = cic->key;
> +	struct io_group *iog, *__iog;
> +	unsigned long flags;
> +	struct request_queue *q;
> +
> +	if (unlikely(!cfqd))
> +		return;
> +
> +	q = cfqd->q;
> +
> +	spin_lock_irqsave(q->queue_lock, flags);
> +
> +	iog = io_lookup_io_group_current(q);
> +

  Hi Vivek,

  I triggered another kernel panic when testing. When moving a task to another 
  cgroup, the corresponding iog may not be setup properly all the time. "iog"
  might be NULL here. io_ioq_move() receives a NULL iog, kernel crash.

  Consider the following piece of code:

 941 int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 942 {
 943         struct elevator_queue *e = q->elevator;
 944 
 945         elv_fq_set_request_io_group(q, rq);
 
 -->task moving to a new group is happenning here.

 946 
 947         /*
 948          * Optimization for noop, deadline and AS which maintain only single
 949          * ioq per io group
 950          */
 951         if (elv_iosched_single_ioq(e))
 952                 return elv_fq_set_request_ioq(q, rq, gfp_mask);
 953 
 954         if (e->ops->elevator_set_req_fn)
 955                 return e->ops->elevator_set_req_fn(q, rq, gfp_mask);

cfq_set_request() will finally call io_ioq_move(), but the iog is NULL, beacause the iogs in the 
hierarchy are not built yet. So kernel crashes.

 956 
 957         rq->elevator_private = NULL;
 958         return 0;
 959 }

BUG: unable to handle kernel NULL pointer dereference at 000000bc
IP: [<c04ebf8f>] io_ioq_move+0xf2/0x109
*pde = 6cc00067
Oops: 0000 [#1] SMP
last sysfs file: /sys/block/hdb/queue/slice_idle
Modules linked in: ipv6 cpufreq_ondemand acpi_cpufreq dm_mirror dm_multipath sbs sbshc battery ac lp snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm serio_raw snd_timer rtc_cmos parport_pc snd r8169 button rtc_core parport soundcore mii i2c_i801 rtc_lib snd_page_alloc pcspkr i2c_core dm_region_hash dm_log dm_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd

Pid: 5431, comm: dd Not tainted (2.6.29-rc7-vivek #19) Veriton M460
EIP: 0060:[<c04ebf8f>] EFLAGS: 00010046 CPU: 0
EIP is at io_ioq_move+0xf2/0x109
EAX: f6203a88 EBX: f6792c94 ECX: f6203a84 EDX: 00000006
ESI: 00000000 EDI: 00000000 EBP: f6203a60 ESP: f6304c28
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process dd (pid: 5431, ti=f6304000 task=f669dae0 task.ti=f6304000)
Stack:
 f62478c0 0100dd40 f6247908 f62d995c 00000000 00000000 f675b54c c04e9182
 f638e9b0 00000282 f62d99a4 f6325a2c c04e9113 f5a707c0 c04e7ae0 f675b000
 f62d95fc f6325a2c c04e8501 00000010 f631e4e8 f675b000 00080000 ffffff10
Call Trace:
 [<c04e9182>] changed_cgroup+0x6f/0x8d
 [<c04e9113>] changed_cgroup+0x0/0x8d
 [<c04e7ae0>] __call_for_each_cic+0x1b/0x25
 [<c04e8501>] cfq_set_request+0x158/0x2c7
 [<c06316e6>] _spin_unlock_irqrestore+0x5/0x6
 [<c04eb106>] elv_fq_set_request_io_group+0x2b/0x3e
 [<c04e83a9>] cfq_set_request+0x0/0x2c7
 [<c04dddcb>] elv_set_request+0x3e/0x4e
 [<c04df3da>] get_request+0x1ed/0x29b
 [<c04df9bb>] get_request_wait+0xdf/0xf2
 [<c04dfd89>] __make_request+0x2c6/0x372
 [<c049bd76>] do_mpage_readpage+0x4fe/0x5e3
 [<c04deba5>] generic_make_request+0x2d0/0x355
 [<c04dff47>] submit_bio+0x92/0x97
 [<c045bfcb>] add_to_page_cache_locked+0x8a/0xb7
 [<c049bfa4>] mpage_end_io_read+0x0/0x50
 [<c049b1b6>] mpage_bio_submit+0x19/0x1d
 [<c049bf9a>] mpage_readpages+0x9b/0xa5
 [<f7dd18c7>] ext3_readpages+0x0/0x15 [ext3]
 [<c0462192>] __do_page_cache_readahead+0xea/0x154
 [<f7dd2286>] ext3_get_block+0x0/0xbe [ext3]
 [<c045d34d>] generic_file_aio_read+0x276/0x569
 [<c047cdd9>] do_sync_read+0xbf/0xfe
 [<c043a3f2>] getnstimeofday+0x51/0xdb
 [<c0434d3c>] autoremove_wake_function+0x0/0x2d
 [<c041bdc3>] sched_slice+0x61/0x6a
 [<c0423114>] task_tick_fair+0x3d/0x60
 [<c04c1d79>] security_file_permission+0xc/0xd
 [<c047cd1a>] do_sync_read+0x0/0xfe
 [<c047d35a>] vfs_read+0x6c/0x8b
 [<c047d67e>] sys_read+0x3c/0x63
 [<c0402fc1>] sysenter_do_call+0x12/0x21
 [<c0630000>] schedule+0x551/0x830
Code: 08 31 c9 89 da e8 77 fc ff ff 8b 86 bc 00 00 00 85 ff 89 43 38 8d 46 60 89 43 40 74 1d 83 c4 0c 89 d8 5b 5e 5f 5d e9 aa f9 ff ff <8b> 86 bc 00 00 00 89 43 38 8d 46 60 89 43 40 83 c4 0c 5b 5e 5f
EIP: [<c04ebf8f>] io_ioq_move+0xf2/0x109 SS:ESP 0068:f6304c28

Changelog:

Make sure iogs in the hierarchy are built properly after moving a task to a new cgroup.

Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
 block/cfq-iosched.c |    4 +++-
 block/elevator-fq.c |    1 +
 block/elevator-fq.h |    1 +
 3 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 0ecf7c7..6d7bb8a 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -12,6 +12,8 @@
 #include <linux/rbtree.h>
 #include <linux/ioprio.h>
 #include <linux/blktrace_api.h>
+#include "elevator-fq.h"
+
 /*
  * tunables
  */
@@ -1086,7 +1088,7 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
 
 	spin_lock_irqsave(q->queue_lock, flags);
 
-	iog = io_lookup_io_group_current(q);
+	iog = io_get_io_group(q);
 
 	if (async_cfqq != NULL) {
 		__iog = cfqq_to_io_group(async_cfqq);
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index df53418..f81cf6a 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1191,6 +1191,7 @@ struct io_group *io_get_io_group(struct request_queue *q)
 
 	return iog;
 }
+EXPORT_SYMBOL(io_get_io_group);
 
 void io_free_root_group(struct elevator_queue *e)
 {
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index fc4110d..f17e425 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -459,6 +459,7 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
 }
 
 #ifdef CONFIG_GROUP_IOSCHED
+extern struct io_group *io_get_io_group(struct request_queue *q);
 extern int io_group_allow_merge(struct request *rq, struct bio *bio);
 extern void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
 					struct io_group *iog);
-- 
1.5.4.rc3

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* [PATCH] IO-Controller: Fix kernel panic after moving a task
  2009-03-12  1:56     ` Vivek Goyal
  (?)
  (?)
@ 2009-04-16  5:25     ` Gui Jianfeng
       [not found]       ` <49E6C14F.3090009-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  -1 siblings, 1 reply; 190+ messages in thread
From: Gui Jianfeng @ 2009-04-16  5:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, arozansk, jmoyer,
	oz-kernel, dhaval, balbir, linux-kernel, containers, akpm,
	menage, peterz

Vivek Goyal wrote:
> +#ifdef CONFIG_IOSCHED_CFQ_HIER
> +static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
> +{
> +	struct cfq_queue *async_cfqq = cic_to_cfqq(cic, 0);
> +	struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1);
> +	struct cfq_data *cfqd = cic->key;
> +	struct io_group *iog, *__iog;
> +	unsigned long flags;
> +	struct request_queue *q;
> +
> +	if (unlikely(!cfqd))
> +		return;
> +
> +	q = cfqd->q;
> +
> +	spin_lock_irqsave(q->queue_lock, flags);
> +
> +	iog = io_lookup_io_group_current(q);
> +

  Hi Vivek,

  I triggered another kernel panic when testing. When moving a task to another 
  cgroup, the corresponding iog may not be setup properly all the time. "iog"
  might be NULL here. io_ioq_move() receives a NULL iog, kernel crash.

  Consider the following piece of code:

 941 int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 942 {
 943         struct elevator_queue *e = q->elevator;
 944 
 945         elv_fq_set_request_io_group(q, rq);
 
 -->task moving to a new group is happenning here.

 946 
 947         /*
 948          * Optimization for noop, deadline and AS which maintain only single
 949          * ioq per io group
 950          */
 951         if (elv_iosched_single_ioq(e))
 952                 return elv_fq_set_request_ioq(q, rq, gfp_mask);
 953 
 954         if (e->ops->elevator_set_req_fn)
 955                 return e->ops->elevator_set_req_fn(q, rq, gfp_mask);

cfq_set_request() will finally call io_ioq_move(), but the iog is NULL, beacause the iogs in the 
hierarchy are not built yet. So kernel crashes.

 956 
 957         rq->elevator_private = NULL;
 958         return 0;
 959 }

BUG: unable to handle kernel NULL pointer dereference at 000000bc
IP: [<c04ebf8f>] io_ioq_move+0xf2/0x109
*pde = 6cc00067
Oops: 0000 [#1] SMP
last sysfs file: /sys/block/hdb/queue/slice_idle
Modules linked in: ipv6 cpufreq_ondemand acpi_cpufreq dm_mirror dm_multipath sbs sbshc battery ac lp snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm serio_raw snd_timer rtc_cmos parport_pc snd r8169 button rtc_core parport soundcore mii i2c_i801 rtc_lib snd_page_alloc pcspkr i2c_core dm_region_hash dm_log dm_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd

Pid: 5431, comm: dd Not tainted (2.6.29-rc7-vivek #19) Veriton M460
EIP: 0060:[<c04ebf8f>] EFLAGS: 00010046 CPU: 0
EIP is at io_ioq_move+0xf2/0x109
EAX: f6203a88 EBX: f6792c94 ECX: f6203a84 EDX: 00000006
ESI: 00000000 EDI: 00000000 EBP: f6203a60 ESP: f6304c28
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process dd (pid: 5431, ti=f6304000 task=f669dae0 task.ti=f6304000)
Stack:
 f62478c0 0100dd40 f6247908 f62d995c 00000000 00000000 f675b54c c04e9182
 f638e9b0 00000282 f62d99a4 f6325a2c c04e9113 f5a707c0 c04e7ae0 f675b000
 f62d95fc f6325a2c c04e8501 00000010 f631e4e8 f675b000 00080000 ffffff10
Call Trace:
 [<c04e9182>] changed_cgroup+0x6f/0x8d
 [<c04e9113>] changed_cgroup+0x0/0x8d
 [<c04e7ae0>] __call_for_each_cic+0x1b/0x25
 [<c04e8501>] cfq_set_request+0x158/0x2c7
 [<c06316e6>] _spin_unlock_irqrestore+0x5/0x6
 [<c04eb106>] elv_fq_set_request_io_group+0x2b/0x3e
 [<c04e83a9>] cfq_set_request+0x0/0x2c7
 [<c04dddcb>] elv_set_request+0x3e/0x4e
 [<c04df3da>] get_request+0x1ed/0x29b
 [<c04df9bb>] get_request_wait+0xdf/0xf2
 [<c04dfd89>] __make_request+0x2c6/0x372
 [<c049bd76>] do_mpage_readpage+0x4fe/0x5e3
 [<c04deba5>] generic_make_request+0x2d0/0x355
 [<c04dff47>] submit_bio+0x92/0x97
 [<c045bfcb>] add_to_page_cache_locked+0x8a/0xb7
 [<c049bfa4>] mpage_end_io_read+0x0/0x50
 [<c049b1b6>] mpage_bio_submit+0x19/0x1d
 [<c049bf9a>] mpage_readpages+0x9b/0xa5
 [<f7dd18c7>] ext3_readpages+0x0/0x15 [ext3]
 [<c0462192>] __do_page_cache_readahead+0xea/0x154
 [<f7dd2286>] ext3_get_block+0x0/0xbe [ext3]
 [<c045d34d>] generic_file_aio_read+0x276/0x569
 [<c047cdd9>] do_sync_read+0xbf/0xfe
 [<c043a3f2>] getnstimeofday+0x51/0xdb
 [<c0434d3c>] autoremove_wake_function+0x0/0x2d
 [<c041bdc3>] sched_slice+0x61/0x6a
 [<c0423114>] task_tick_fair+0x3d/0x60
 [<c04c1d79>] security_file_permission+0xc/0xd
 [<c047cd1a>] do_sync_read+0x0/0xfe
 [<c047d35a>] vfs_read+0x6c/0x8b
 [<c047d67e>] sys_read+0x3c/0x63
 [<c0402fc1>] sysenter_do_call+0x12/0x21
 [<c0630000>] schedule+0x551/0x830
Code: 08 31 c9 89 da e8 77 fc ff ff 8b 86 bc 00 00 00 85 ff 89 43 38 8d 46 60 89 43 40 74 1d 83 c4 0c 89 d8 5b 5e 5f 5d e9 aa f9 ff ff <8b> 86 bc 00 00 00 89 43 38 8d 46 60 89 43 40 83 c4 0c 5b 5e 5f
EIP: [<c04ebf8f>] io_ioq_move+0xf2/0x109 SS:ESP 0068:f6304c28

Changelog:

Make sure iogs in the hierarchy are built properly after moving a task to a new cgroup.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
 block/cfq-iosched.c |    4 +++-
 block/elevator-fq.c |    1 +
 block/elevator-fq.h |    1 +
 3 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 0ecf7c7..6d7bb8a 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -12,6 +12,8 @@
 #include <linux/rbtree.h>
 #include <linux/ioprio.h>
 #include <linux/blktrace_api.h>
+#include "elevator-fq.h"
+
 /*
  * tunables
  */
@@ -1086,7 +1088,7 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
 
 	spin_lock_irqsave(q->queue_lock, flags);
 
-	iog = io_lookup_io_group_current(q);
+	iog = io_get_io_group(q);
 
 	if (async_cfqq != NULL) {
 		__iog = cfqq_to_io_group(async_cfqq);
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index df53418..f81cf6a 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1191,6 +1191,7 @@ struct io_group *io_get_io_group(struct request_queue *q)
 
 	return iog;
 }
+EXPORT_SYMBOL(io_get_io_group);
 
 void io_free_root_group(struct elevator_queue *e)
 {
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index fc4110d..f17e425 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -459,6 +459,7 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
 }
 
 #ifdef CONFIG_GROUP_IOSCHED
+extern struct io_group *io_get_io_group(struct request_queue *q);
 extern int io_group_allow_merge(struct request *rq, struct bio *bio);
 extern void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
 					struct io_group *iog);
-- 
1.5.4.rc3




^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-04-08 20:37                 ` Andrea Righi
@ 2009-04-16 18:37                     ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-04-16 18:37 UTC (permalink / raw)
  To: Andrew Morton, nauman-hpIqsD4AKlfQT0dZR+AlfA,
	dpshah-hpIqsD4AKlfQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A,
	mikew-hpIqsD4AKlfQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA, jen

On Wed, Apr 08, 2009 at 10:37:59PM +0200, Andrea Righi wrote:

[..]
> > 
> > - I can think of atleast one usage of uppper limit controller where we
> >   might have spare IO resources still we don't want to give it to a
> >   cgroup because customer has not paid for that kind of service level. In
> >   those cases we need to implement uppper limit also.
> > 
> >   May be prportional weight and max bw controller can co-exist depending
> >   on what user's requirements are.
> >  
> >   If yes, then can't this control be done at the same layer/level where
> >   proportional weight control is being done? IOW, this set of patches is
> >   trying to do prportional weight control at IO scheduler level. I think
> >   we should be able to store another max rate as another feature in 
> >   cgroup (apart from weight) and not dispatch requests from the queue if
> >   we have exceeded the max BW as specified by the user?
> 
> The more I think about a "perfect" solution (at least for my
> requirements), the more I'm convinced that we need both functionalities.
> 

I agree here. In some scenarios people might want to put an upper cap on BW
even if more BW is available and in some scenarios people will like to do
proportional distribution and let one get more share of disk if it is
free.

> I think it would be possible to implement both proportional and limiting
> rules at the same level (e.g., the IO scheduler), but we need also to
> address the memory consumption problem (I still need to review your
> patchset in details and I'm going to test it soon :), so I don't know if
> you already addressed this issue).
> 

Can you please elaborate a bit on this? Are you concerned about that data
structures created to solve the problem consume a lot of memory?

> IOW if we simply don't dispatch requests and we don't throttle the tasks
> in the cgroup that exceeds its limit, how do we avoid the waste of
> memory due to the succeeding IO requests and the increasingly dirty
> pages in the page cache (that are also hard to reclaim)? I may be wrong,
> but I think we talked about this problem in a previous email... sorry I
> don't find the discussion in my mail archives.
> 
> IMHO a nice approach would be to measure IO consumption at the IO
> scheduler level, and control IO applying proportional weights / absolute
> limits _both_ at the IO scheduler / elevator level _and_ at the same
> time block the tasks from dirtying memory that will generate additional
> IO requests.
> 
> Anyway, there's no need to provide this with a single IO controller, we
> could split the problem in two parts: 1) provide a proportional /
> absolute IO controller in the IO schedulers and 2) allow to set, for
> example, a maximum limit of dirty pages for each cgroup.
> 

I think setting a maximum limit on dirty pages is an interesting thought.
It sounds like as if memory controller can handle it?

I guess currently memory controller puts limit on total amount of memory
consumed by cgroup and there are no knobs on type of memory consumed. So
if one can limit amount of dirty page cache memory per cgroup, it
automatically throttles the aysnc writes at the input itself.
 
So I agree that if we can limit the process from dirtying too much of
memory than IO scheduler level controller should be able to do both
proportional weight and max bw controller.

Currently doing proportional weight control for async writes is very
tricky. I am not seeing constantly backlogged traffic at IO scheudler
level and hence two different weight processes seem to be getting same
BW.

I will dive deeper into the patches on dm-ioband to see how they have
solved this issue. Looks like they are just waiting longer for slowest
group to consume its tokens and that will keep the disk idle. Extended
delays might now show up immediately as performance hog, because it might
also promote increased merging but it should lead to increased latency of
response. And proving latency issues is hard. :-)   

> Maybe I'm just repeating what we already said in a previous
> discussion... in this case sorry for the duplicate thoughts. :)
> 
> > 
> > - Have you thought of doing hierarchical control? 
> > 
> 
> Providing hiearchies in cgroups is in general expensive, deeper
> hierarchies imply checking all the way up to the root cgroup, so I think
> we need to be very careful and be aware of the trade-offs before
> providing such feature. For this particular case (IO controller)
> wouldn't it be simpler and more efficient to just ignore hierarchies in
> the kernel and opportunely handle them in userspace? for absolute
> limiting rules this isn't difficult at all, just imagine a config file
> and a script or a deamon that dynamically create the opportune cgroups
> and configure them accordingly to what is defined in the configuration
> file.
> 
> I think we can simply define hierarchical dependencies in the
> configuration file, translate them in absolute values and use the
> absolute values to configure the cgroups' properties.
> 
> For example, we can just check that the BW allocated for a particular
> parent cgroup is not greater than the total BW allocated for the
> children. And for each child just use the min(parent_BW, BW) or equally
> divide the parent's BW among the children, etc.

IIUC, you are saying that allow hiearchy in user space and then flatten it
out and pass it to kernel?

Hmm.., agree that handling hierarchies is hard and expensive. But at the
same time rest of the controllers like cpu and memory are handling it in
kernel so it probably makes sense to keep the IO controller also in line.

In practice I am not expecting deep hiearchices. May be 2- 3 levels would
be good for most of the people.

> 
> > - What happens to the notion of CFQ task classes and task priority. Looks
> >   like max bw rule supercede everything. There is no way that an RT task
> >   get unlimited amount of disk BW even if it wants to? (There is no notion
> >   of RT cgroup etc)
> 
> What about moving all the RT tasks in a separate cgroup with unlimited
> BW?

Hmm.., I think that should work. I have yet to look at your patches in
detail but it looks like unlimited BW group will not be throttled at all
hence RT tasks can just go right through without getting impacted.

> 
> > 
> > > > 
> > > >   Above requirement can create configuration problems.
> > > > 
> > > > 	- If there are large number of disks in system, per cgroup one shall
> > > > 	  have to create rules for each disk. Until and unless admin knows
> > > > 	  what applications are in which cgroup and strictly what disk
> > > > 	  these applications do IO to and create rules for only those
> > > >  	  disks.
> > > 
> > > I don't think this is a huge problem anyway. IMHO a userspace tool, e.g.
> > > a script, would be able to efficiently create/modify rules parsing user
> > > defined rules in some human-readable form (config files, etc.), even in
> > > presence of hundreds of disk. The same is valid for dm-ioband I think.
> > > 
> > > > 
> > > > 	- I think problem gets compounded if there is a hierarchy of
> > > > 	  logical devices. I think in that case one shall have to create
> > > > 	  rules for logical devices and not actual physical devices.
> > > 
> > > With logical devices you mean device-mapper devices (i.e. LVM, software
> > > RAID, etc.)? or do you mean that we need to introduce the concept of
> > > "logical device" to easily (quickly) configure IO requirements and then
> > > map those logical devices to the actual physical devices? In this case I
> > > think this can be addressed in userspace. Or maybe I'm totally missing
> > > the point here.
> > 
> > Yes, I meant LVM, Software RAID etc. So if I have got many disks in the system
> > and I have created software raid on some of them, I need to create rules for
> > lvm devices or physical devices behind those lvm devices? I am assuming
> > that it will be logical devices.
> > 
> > So I need to know exactly to what all devices applications in a particular
> > cgroup is going to do IO, and also know exactly how many cgroups are
> > contending for that cgroup, and also know what worst case disk rate I can
> > expect from that device and then I can do a good job of giving a
> > reasonable value to the max rate of that cgroup on a particular device?
> 
> ok, I understand. For these cases dm-ioband perfectly addresses the
> problem. For the general case, I think the only solution is to provide a
> common interface that each dm subsystem must call to account IO and
> apply limiting and proportional rules.
> 
> > 
> > > 
> > > > 
> > > > - Because it is not proportional weight distribution, if some
> > > >   cgroup is not using its planned BW, other group sharing the
> > > >   disk can not make use of spare BW.  
> > > > 	
> > > 
> > > Right.
> > > 
> > > > - I think one should know in advance the throughput rate of underlying media
> > > >   and also know competing applications so that one can statically define
> > > >   the BW assigned to each cgroup on each disk.
> > > > 
> > > >   This will be difficult. Effective BW extracted out of a rotational media
> > > >   is dependent on the seek pattern so one shall have to either try to make
> > > >   some conservative estimates and try to divide BW (we will not utilize disk
> > > >   fully) or take some peak numbers and divide BW (cgroup might not get the
> > > >   maximum rate configured).
> > > 
> > > Correct. I think the proportional weight approach is the only solution
> > > to efficiently use the whole BW. OTOH absolute limiting rules offer a
> > > better control over QoS, because you can totally remove performance
> > > bursts/peaks that could break QoS requirements for short periods of
> > > time.
> > 
> > Can you please give little more details here regarding how QoS requirements
> > are not met with proportional weight?
> 
> With proportional weights the whole bandwidth is allocated if no one
> else is using it. When IO is submitted other tasks with a higher weight
> can be forced to sleep until the IO generated by the low weight tasks is
> not completely dispatched. Or any extent of the priority inversion
> problems.

Hmm..., I am not very sure here. When admin is allocating the weights, he
has the whole picture. He knows how many groups are conteding for the disk
and what could be the worst case scenario. So if I have got two groups
with A and B with weight 1 and 2 and both are contending, then as an 
admin one would expect to get 33% of BW for group A in worst case (if
group B is continuously backlogged). If B is not contending than A can get
100% of BW. So while configuring the system, will one not plan for worst
case (33% for A, and 66 % for B)?
  
> 
> Maybe it's not an issue at all for the most part of the cases, but using
> a solution that is able to provide also a real partitioning of the
> available resources can be profitely used by those who need to guarantee
> _strict_ BW requirements (soft real-time, maximize the responsiveness of
> certain services, etc.), because in this case we're sure that a certain
> amount of "spare" BW will be always available when needed by some
> "critical" services.
> 

Will the same thing not happen in proportional weight? If it is an RT
application, one can put it in RT groups to make sure it always gets
the BW first even if there is contention. 

Even in regular group, the moment you issue the IO and IO scheduler sees
it, you will start getting your reserved share according to your weight.

How it will be different in the case of io throttling? Even if I don't
utilize the disk fully, cfq will still put the new guy in the queue and
then try to give its share (based on prio).

Are you saying that by keeping disk relatively free, the latency of
response for soft real time application will become better? In that
case can't one simply underprovision the disk?

But having said that I am not disputing the need of max BW controller
as some people have expressed the need of a constant BW view and don't
want too big a fluctuations even if BW is available. Max BW controller
can't gurantee the minumum BW hence can't avoid the fluctuations
completely, but it can still help in smoothing the traffic because
other competitiors will be stopped from doing too much of IO.

Thanks
Vivek

> > 
> > > So, my "ideal" IO controller should allow to define both rules:
> > > absolute and proportional limits.
> > > 
> > > I still have to look closely at your patchset anyway. I will do and give
> > > a feedback.
> > 
> > You feedback is always welcome.
> > 
> > Thanks
> > Vivek
> 
> Thanks,
> -Andrea

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
@ 2009-04-16 18:37                     ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-04-16 18:37 UTC (permalink / raw)
  To: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, arozansk, jmoyer, oz-kernel, dhaval, balbir,
	linux-kernel, containers, menage, peterz

On Wed, Apr 08, 2009 at 10:37:59PM +0200, Andrea Righi wrote:

[..]
> > 
> > - I can think of atleast one usage of uppper limit controller where we
> >   might have spare IO resources still we don't want to give it to a
> >   cgroup because customer has not paid for that kind of service level. In
> >   those cases we need to implement uppper limit also.
> > 
> >   May be prportional weight and max bw controller can co-exist depending
> >   on what user's requirements are.
> >  
> >   If yes, then can't this control be done at the same layer/level where
> >   proportional weight control is being done? IOW, this set of patches is
> >   trying to do prportional weight control at IO scheduler level. I think
> >   we should be able to store another max rate as another feature in 
> >   cgroup (apart from weight) and not dispatch requests from the queue if
> >   we have exceeded the max BW as specified by the user?
> 
> The more I think about a "perfect" solution (at least for my
> requirements), the more I'm convinced that we need both functionalities.
> 

I agree here. In some scenarios people might want to put an upper cap on BW
even if more BW is available and in some scenarios people will like to do
proportional distribution and let one get more share of disk if it is
free.

> I think it would be possible to implement both proportional and limiting
> rules at the same level (e.g., the IO scheduler), but we need also to
> address the memory consumption problem (I still need to review your
> patchset in details and I'm going to test it soon :), so I don't know if
> you already addressed this issue).
> 

Can you please elaborate a bit on this? Are you concerned about that data
structures created to solve the problem consume a lot of memory?

> IOW if we simply don't dispatch requests and we don't throttle the tasks
> in the cgroup that exceeds its limit, how do we avoid the waste of
> memory due to the succeeding IO requests and the increasingly dirty
> pages in the page cache (that are also hard to reclaim)? I may be wrong,
> but I think we talked about this problem in a previous email... sorry I
> don't find the discussion in my mail archives.
> 
> IMHO a nice approach would be to measure IO consumption at the IO
> scheduler level, and control IO applying proportional weights / absolute
> limits _both_ at the IO scheduler / elevator level _and_ at the same
> time block the tasks from dirtying memory that will generate additional
> IO requests.
> 
> Anyway, there's no need to provide this with a single IO controller, we
> could split the problem in two parts: 1) provide a proportional /
> absolute IO controller in the IO schedulers and 2) allow to set, for
> example, a maximum limit of dirty pages for each cgroup.
> 

I think setting a maximum limit on dirty pages is an interesting thought.
It sounds like as if memory controller can handle it?

I guess currently memory controller puts limit on total amount of memory
consumed by cgroup and there are no knobs on type of memory consumed. So
if one can limit amount of dirty page cache memory per cgroup, it
automatically throttles the aysnc writes at the input itself.
 
So I agree that if we can limit the process from dirtying too much of
memory than IO scheduler level controller should be able to do both
proportional weight and max bw controller.

Currently doing proportional weight control for async writes is very
tricky. I am not seeing constantly backlogged traffic at IO scheudler
level and hence two different weight processes seem to be getting same
BW.

I will dive deeper into the patches on dm-ioband to see how they have
solved this issue. Looks like they are just waiting longer for slowest
group to consume its tokens and that will keep the disk idle. Extended
delays might now show up immediately as performance hog, because it might
also promote increased merging but it should lead to increased latency of
response. And proving latency issues is hard. :-)   

> Maybe I'm just repeating what we already said in a previous
> discussion... in this case sorry for the duplicate thoughts. :)
> 
> > 
> > - Have you thought of doing hierarchical control? 
> > 
> 
> Providing hiearchies in cgroups is in general expensive, deeper
> hierarchies imply checking all the way up to the root cgroup, so I think
> we need to be very careful and be aware of the trade-offs before
> providing such feature. For this particular case (IO controller)
> wouldn't it be simpler and more efficient to just ignore hierarchies in
> the kernel and opportunely handle them in userspace? for absolute
> limiting rules this isn't difficult at all, just imagine a config file
> and a script or a deamon that dynamically create the opportune cgroups
> and configure them accordingly to what is defined in the configuration
> file.
> 
> I think we can simply define hierarchical dependencies in the
> configuration file, translate them in absolute values and use the
> absolute values to configure the cgroups' properties.
> 
> For example, we can just check that the BW allocated for a particular
> parent cgroup is not greater than the total BW allocated for the
> children. And for each child just use the min(parent_BW, BW) or equally
> divide the parent's BW among the children, etc.

IIUC, you are saying that allow hiearchy in user space and then flatten it
out and pass it to kernel?

Hmm.., agree that handling hierarchies is hard and expensive. But at the
same time rest of the controllers like cpu and memory are handling it in
kernel so it probably makes sense to keep the IO controller also in line.

In practice I am not expecting deep hiearchices. May be 2- 3 levels would
be good for most of the people.

> 
> > - What happens to the notion of CFQ task classes and task priority. Looks
> >   like max bw rule supercede everything. There is no way that an RT task
> >   get unlimited amount of disk BW even if it wants to? (There is no notion
> >   of RT cgroup etc)
> 
> What about moving all the RT tasks in a separate cgroup with unlimited
> BW?

Hmm.., I think that should work. I have yet to look at your patches in
detail but it looks like unlimited BW group will not be throttled at all
hence RT tasks can just go right through without getting impacted.

> 
> > 
> > > > 
> > > >   Above requirement can create configuration problems.
> > > > 
> > > > 	- If there are large number of disks in system, per cgroup one shall
> > > > 	  have to create rules for each disk. Until and unless admin knows
> > > > 	  what applications are in which cgroup and strictly what disk
> > > > 	  these applications do IO to and create rules for only those
> > > >  	  disks.
> > > 
> > > I don't think this is a huge problem anyway. IMHO a userspace tool, e.g.
> > > a script, would be able to efficiently create/modify rules parsing user
> > > defined rules in some human-readable form (config files, etc.), even in
> > > presence of hundreds of disk. The same is valid for dm-ioband I think.
> > > 
> > > > 
> > > > 	- I think problem gets compounded if there is a hierarchy of
> > > > 	  logical devices. I think in that case one shall have to create
> > > > 	  rules for logical devices and not actual physical devices.
> > > 
> > > With logical devices you mean device-mapper devices (i.e. LVM, software
> > > RAID, etc.)? or do you mean that we need to introduce the concept of
> > > "logical device" to easily (quickly) configure IO requirements and then
> > > map those logical devices to the actual physical devices? In this case I
> > > think this can be addressed in userspace. Or maybe I'm totally missing
> > > the point here.
> > 
> > Yes, I meant LVM, Software RAID etc. So if I have got many disks in the system
> > and I have created software raid on some of them, I need to create rules for
> > lvm devices or physical devices behind those lvm devices? I am assuming
> > that it will be logical devices.
> > 
> > So I need to know exactly to what all devices applications in a particular
> > cgroup is going to do IO, and also know exactly how many cgroups are
> > contending for that cgroup, and also know what worst case disk rate I can
> > expect from that device and then I can do a good job of giving a
> > reasonable value to the max rate of that cgroup on a particular device?
> 
> ok, I understand. For these cases dm-ioband perfectly addresses the
> problem. For the general case, I think the only solution is to provide a
> common interface that each dm subsystem must call to account IO and
> apply limiting and proportional rules.
> 
> > 
> > > 
> > > > 
> > > > - Because it is not proportional weight distribution, if some
> > > >   cgroup is not using its planned BW, other group sharing the
> > > >   disk can not make use of spare BW.  
> > > > 	
> > > 
> > > Right.
> > > 
> > > > - I think one should know in advance the throughput rate of underlying media
> > > >   and also know competing applications so that one can statically define
> > > >   the BW assigned to each cgroup on each disk.
> > > > 
> > > >   This will be difficult. Effective BW extracted out of a rotational media
> > > >   is dependent on the seek pattern so one shall have to either try to make
> > > >   some conservative estimates and try to divide BW (we will not utilize disk
> > > >   fully) or take some peak numbers and divide BW (cgroup might not get the
> > > >   maximum rate configured).
> > > 
> > > Correct. I think the proportional weight approach is the only solution
> > > to efficiently use the whole BW. OTOH absolute limiting rules offer a
> > > better control over QoS, because you can totally remove performance
> > > bursts/peaks that could break QoS requirements for short periods of
> > > time.
> > 
> > Can you please give little more details here regarding how QoS requirements
> > are not met with proportional weight?
> 
> With proportional weights the whole bandwidth is allocated if no one
> else is using it. When IO is submitted other tasks with a higher weight
> can be forced to sleep until the IO generated by the low weight tasks is
> not completely dispatched. Or any extent of the priority inversion
> problems.

Hmm..., I am not very sure here. When admin is allocating the weights, he
has the whole picture. He knows how many groups are conteding for the disk
and what could be the worst case scenario. So if I have got two groups
with A and B with weight 1 and 2 and both are contending, then as an 
admin one would expect to get 33% of BW for group A in worst case (if
group B is continuously backlogged). If B is not contending than A can get
100% of BW. So while configuring the system, will one not plan for worst
case (33% for A, and 66 % for B)?
  
> 
> Maybe it's not an issue at all for the most part of the cases, but using
> a solution that is able to provide also a real partitioning of the
> available resources can be profitely used by those who need to guarantee
> _strict_ BW requirements (soft real-time, maximize the responsiveness of
> certain services, etc.), because in this case we're sure that a certain
> amount of "spare" BW will be always available when needed by some
> "critical" services.
> 

Will the same thing not happen in proportional weight? If it is an RT
application, one can put it in RT groups to make sure it always gets
the BW first even if there is contention. 

Even in regular group, the moment you issue the IO and IO scheduler sees
it, you will start getting your reserved share according to your weight.

How it will be different in the case of io throttling? Even if I don't
utilize the disk fully, cfq will still put the new guy in the queue and
then try to give its share (based on prio).

Are you saying that by keeping disk relatively free, the latency of
response for soft real time application will become better? In that
case can't one simply underprovision the disk?

But having said that I am not disputing the need of max BW controller
as some people have expressed the need of a constant BW view and don't
want too big a fluctuations even if BW is available. Max BW controller
can't gurantee the minumum BW hence can't avoid the fluctuations
completely, but it can still help in smoothing the traffic because
other competitiors will be stopped from doing too much of IO.

Thanks
Vivek

> > 
> > > So, my "ideal" IO controller should allow to define both rules:
> > > absolute and proportional limits.
> > > 
> > > I still have to look closely at your patchset anyway. I will do and give
> > > a feedback.
> > 
> > You feedback is always welcome.
> > 
> > Thanks
> > Vivek
> 
> Thanks,
> -Andrea

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] IO-Controller: Fix kernel panic after moving a task
  2009-04-16  5:25     ` Gui Jianfeng
@ 2009-04-16 19:15           ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-04-16 19:15 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Thu, Apr 16, 2009 at 01:25:35PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > +#ifdef CONFIG_IOSCHED_CFQ_HIER
> > +static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
> > +{
> > +	struct cfq_queue *async_cfqq = cic_to_cfqq(cic, 0);
> > +	struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1);
> > +	struct cfq_data *cfqd = cic->key;
> > +	struct io_group *iog, *__iog;
> > +	unsigned long flags;
> > +	struct request_queue *q;
> > +
> > +	if (unlikely(!cfqd))
> > +		return;
> > +
> > +	q = cfqd->q;
> > +
> > +	spin_lock_irqsave(q->queue_lock, flags);
> > +
> > +	iog = io_lookup_io_group_current(q);
> > +
> 
>   Hi Vivek,
> 
>   I triggered another kernel panic when testing. When moving a task to another 
>   cgroup, the corresponding iog may not be setup properly all the time. "iog"
>   might be NULL here. io_ioq_move() receives a NULL iog, kernel crash.
> 
>   Consider the following piece of code:
> 
>  941 int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
>  942 {
>  943         struct elevator_queue *e = q->elevator;
>  944 
>  945         elv_fq_set_request_io_group(q, rq);
>  
>  -->task moving to a new group is happenning here.
> 
>  946 
>  947         /*
>  948          * Optimization for noop, deadline and AS which maintain only single
>  949          * ioq per io group
>  950          */
>  951         if (elv_iosched_single_ioq(e))
>  952                 return elv_fq_set_request_ioq(q, rq, gfp_mask);
>  953 
>  954         if (e->ops->elevator_set_req_fn)
>  955                 return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
> 
> cfq_set_request() will finally call io_ioq_move(), but the iog is NULL, beacause the iogs in the 
> hierarchy are not built yet. So kernel crashes.
> 
>  956 
>  957         rq->elevator_private = NULL;
>  958         return 0;
>  959 }
> 

Thanks Gui. Good catch. 

> BUG: unable to handle kernel NULL pointer dereference at 000000bc
> IP: [<c04ebf8f>] io_ioq_move+0xf2/0x109
> *pde = 6cc00067
> Oops: 0000 [#1] SMP
> last sysfs file: /sys/block/hdb/queue/slice_idle
> Modules linked in: ipv6 cpufreq_ondemand acpi_cpufreq dm_mirror dm_multipath sbs sbshc battery ac lp snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm serio_raw snd_timer rtc_cmos parport_pc snd r8169 button rtc_core parport soundcore mii i2c_i801 rtc_lib snd_page_alloc pcspkr i2c_core dm_region_hash dm_log dm_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd
> 
> Pid: 5431, comm: dd Not tainted (2.6.29-rc7-vivek #19) Veriton M460
> EIP: 0060:[<c04ebf8f>] EFLAGS: 00010046 CPU: 0
> EIP is at io_ioq_move+0xf2/0x109
> EAX: f6203a88 EBX: f6792c94 ECX: f6203a84 EDX: 00000006
> ESI: 00000000 EDI: 00000000 EBP: f6203a60 ESP: f6304c28
>  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> Process dd (pid: 5431, ti=f6304000 task=f669dae0 task.ti=f6304000)
> Stack:
>  f62478c0 0100dd40 f6247908 f62d995c 00000000 00000000 f675b54c c04e9182
>  f638e9b0 00000282 f62d99a4 f6325a2c c04e9113 f5a707c0 c04e7ae0 f675b000
>  f62d95fc f6325a2c c04e8501 00000010 f631e4e8 f675b000 00080000 ffffff10
> Call Trace:
>  [<c04e9182>] changed_cgroup+0x6f/0x8d
>  [<c04e9113>] changed_cgroup+0x0/0x8d
>  [<c04e7ae0>] __call_for_each_cic+0x1b/0x25
>  [<c04e8501>] cfq_set_request+0x158/0x2c7
>  [<c06316e6>] _spin_unlock_irqrestore+0x5/0x6
>  [<c04eb106>] elv_fq_set_request_io_group+0x2b/0x3e
>  [<c04e83a9>] cfq_set_request+0x0/0x2c7
>  [<c04dddcb>] elv_set_request+0x3e/0x4e
>  [<c04df3da>] get_request+0x1ed/0x29b
>  [<c04df9bb>] get_request_wait+0xdf/0xf2
>  [<c04dfd89>] __make_request+0x2c6/0x372
>  [<c049bd76>] do_mpage_readpage+0x4fe/0x5e3
>  [<c04deba5>] generic_make_request+0x2d0/0x355
>  [<c04dff47>] submit_bio+0x92/0x97
>  [<c045bfcb>] add_to_page_cache_locked+0x8a/0xb7
>  [<c049bfa4>] mpage_end_io_read+0x0/0x50
>  [<c049b1b6>] mpage_bio_submit+0x19/0x1d
>  [<c049bf9a>] mpage_readpages+0x9b/0xa5
>  [<f7dd18c7>] ext3_readpages+0x0/0x15 [ext3]
>  [<c0462192>] __do_page_cache_readahead+0xea/0x154
>  [<f7dd2286>] ext3_get_block+0x0/0xbe [ext3]
>  [<c045d34d>] generic_file_aio_read+0x276/0x569
>  [<c047cdd9>] do_sync_read+0xbf/0xfe
>  [<c043a3f2>] getnstimeofday+0x51/0xdb
>  [<c0434d3c>] autoremove_wake_function+0x0/0x2d
>  [<c041bdc3>] sched_slice+0x61/0x6a
>  [<c0423114>] task_tick_fair+0x3d/0x60
>  [<c04c1d79>] security_file_permission+0xc/0xd
>  [<c047cd1a>] do_sync_read+0x0/0xfe
>  [<c047d35a>] vfs_read+0x6c/0x8b
>  [<c047d67e>] sys_read+0x3c/0x63
>  [<c0402fc1>] sysenter_do_call+0x12/0x21
>  [<c0630000>] schedule+0x551/0x830
> Code: 08 31 c9 89 da e8 77 fc ff ff 8b 86 bc 00 00 00 85 ff 89 43 38 8d 46 60 89 43 40 74 1d 83 c4 0c 89 d8 5b 5e 5f 5d e9 aa f9 ff ff <8b> 86 bc 00 00 00 89 43 38 8d 46 60 89 43 40 83 c4 0c 5b 5e 5f
> EIP: [<c04ebf8f>] io_ioq_move+0xf2/0x109 SS:ESP 0068:f6304c28
> 
> Changelog:
> 
> Make sure iogs in the hierarchy are built properly after moving a task to a new cgroup.
> 
> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> ---
>  block/cfq-iosched.c |    4 +++-
>  block/elevator-fq.c |    1 +
>  block/elevator-fq.h |    1 +
>  3 files changed, 5 insertions(+), 1 deletions(-)
> 
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 0ecf7c7..6d7bb8a 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -12,6 +12,8 @@
>  #include <linux/rbtree.h>
>  #include <linux/ioprio.h>
>  #include <linux/blktrace_api.h>
> +#include "elevator-fq.h"
> +

I think above explicit inclusion of "elevator-fq.h" might be unnecessary
as elevator.h includes elevator-fq.h and cfq-iosched.c is including
elevator.h

>  /*
>   * tunables
>   */
> @@ -1086,7 +1088,7 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
>  
>  	spin_lock_irqsave(q->queue_lock, flags);
>  
> -	iog = io_lookup_io_group_current(q);
> +	iog = io_get_io_group(q);

A one line comment here explaining the need to get_io_group instead of
lookup_io_group will be nice.

Thanks
Vivek

>  
>  	if (async_cfqq != NULL) {
>  		__iog = cfqq_to_io_group(async_cfqq);
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index df53418..f81cf6a 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -1191,6 +1191,7 @@ struct io_group *io_get_io_group(struct request_queue *q)
>  
>  	return iog;
>  }
> +EXPORT_SYMBOL(io_get_io_group);
>  
>  void io_free_root_group(struct elevator_queue *e)
>  {
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index fc4110d..f17e425 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -459,6 +459,7 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
>  }
>  
>  #ifdef CONFIG_GROUP_IOSCHED
> +extern struct io_group *io_get_io_group(struct request_queue *q);
>  extern int io_group_allow_merge(struct request *rq, struct bio *bio);
>  extern void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
>  					struct io_group *iog);
> -- 
> 1.5.4.rc3
> 
> 

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH] IO-Controller: Fix kernel panic after moving a task
@ 2009-04-16 19:15           ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-04-16 19:15 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, arozansk, jmoyer,
	oz-kernel, dhaval, balbir, linux-kernel, containers, akpm,
	menage, peterz

On Thu, Apr 16, 2009 at 01:25:35PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > +#ifdef CONFIG_IOSCHED_CFQ_HIER
> > +static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
> > +{
> > +	struct cfq_queue *async_cfqq = cic_to_cfqq(cic, 0);
> > +	struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1);
> > +	struct cfq_data *cfqd = cic->key;
> > +	struct io_group *iog, *__iog;
> > +	unsigned long flags;
> > +	struct request_queue *q;
> > +
> > +	if (unlikely(!cfqd))
> > +		return;
> > +
> > +	q = cfqd->q;
> > +
> > +	spin_lock_irqsave(q->queue_lock, flags);
> > +
> > +	iog = io_lookup_io_group_current(q);
> > +
> 
>   Hi Vivek,
> 
>   I triggered another kernel panic when testing. When moving a task to another 
>   cgroup, the corresponding iog may not be setup properly all the time. "iog"
>   might be NULL here. io_ioq_move() receives a NULL iog, kernel crash.
> 
>   Consider the following piece of code:
> 
>  941 int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
>  942 {
>  943         struct elevator_queue *e = q->elevator;
>  944 
>  945         elv_fq_set_request_io_group(q, rq);
>  
>  -->task moving to a new group is happenning here.
> 
>  946 
>  947         /*
>  948          * Optimization for noop, deadline and AS which maintain only single
>  949          * ioq per io group
>  950          */
>  951         if (elv_iosched_single_ioq(e))
>  952                 return elv_fq_set_request_ioq(q, rq, gfp_mask);
>  953 
>  954         if (e->ops->elevator_set_req_fn)
>  955                 return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
> 
> cfq_set_request() will finally call io_ioq_move(), but the iog is NULL, beacause the iogs in the 
> hierarchy are not built yet. So kernel crashes.
> 
>  956 
>  957         rq->elevator_private = NULL;
>  958         return 0;
>  959 }
> 

Thanks Gui. Good catch. 

> BUG: unable to handle kernel NULL pointer dereference at 000000bc
> IP: [<c04ebf8f>] io_ioq_move+0xf2/0x109
> *pde = 6cc00067
> Oops: 0000 [#1] SMP
> last sysfs file: /sys/block/hdb/queue/slice_idle
> Modules linked in: ipv6 cpufreq_ondemand acpi_cpufreq dm_mirror dm_multipath sbs sbshc battery ac lp snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm serio_raw snd_timer rtc_cmos parport_pc snd r8169 button rtc_core parport soundcore mii i2c_i801 rtc_lib snd_page_alloc pcspkr i2c_core dm_region_hash dm_log dm_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd
> 
> Pid: 5431, comm: dd Not tainted (2.6.29-rc7-vivek #19) Veriton M460
> EIP: 0060:[<c04ebf8f>] EFLAGS: 00010046 CPU: 0
> EIP is at io_ioq_move+0xf2/0x109
> EAX: f6203a88 EBX: f6792c94 ECX: f6203a84 EDX: 00000006
> ESI: 00000000 EDI: 00000000 EBP: f6203a60 ESP: f6304c28
>  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> Process dd (pid: 5431, ti=f6304000 task=f669dae0 task.ti=f6304000)
> Stack:
>  f62478c0 0100dd40 f6247908 f62d995c 00000000 00000000 f675b54c c04e9182
>  f638e9b0 00000282 f62d99a4 f6325a2c c04e9113 f5a707c0 c04e7ae0 f675b000
>  f62d95fc f6325a2c c04e8501 00000010 f631e4e8 f675b000 00080000 ffffff10
> Call Trace:
>  [<c04e9182>] changed_cgroup+0x6f/0x8d
>  [<c04e9113>] changed_cgroup+0x0/0x8d
>  [<c04e7ae0>] __call_for_each_cic+0x1b/0x25
>  [<c04e8501>] cfq_set_request+0x158/0x2c7
>  [<c06316e6>] _spin_unlock_irqrestore+0x5/0x6
>  [<c04eb106>] elv_fq_set_request_io_group+0x2b/0x3e
>  [<c04e83a9>] cfq_set_request+0x0/0x2c7
>  [<c04dddcb>] elv_set_request+0x3e/0x4e
>  [<c04df3da>] get_request+0x1ed/0x29b
>  [<c04df9bb>] get_request_wait+0xdf/0xf2
>  [<c04dfd89>] __make_request+0x2c6/0x372
>  [<c049bd76>] do_mpage_readpage+0x4fe/0x5e3
>  [<c04deba5>] generic_make_request+0x2d0/0x355
>  [<c04dff47>] submit_bio+0x92/0x97
>  [<c045bfcb>] add_to_page_cache_locked+0x8a/0xb7
>  [<c049bfa4>] mpage_end_io_read+0x0/0x50
>  [<c049b1b6>] mpage_bio_submit+0x19/0x1d
>  [<c049bf9a>] mpage_readpages+0x9b/0xa5
>  [<f7dd18c7>] ext3_readpages+0x0/0x15 [ext3]
>  [<c0462192>] __do_page_cache_readahead+0xea/0x154
>  [<f7dd2286>] ext3_get_block+0x0/0xbe [ext3]
>  [<c045d34d>] generic_file_aio_read+0x276/0x569
>  [<c047cdd9>] do_sync_read+0xbf/0xfe
>  [<c043a3f2>] getnstimeofday+0x51/0xdb
>  [<c0434d3c>] autoremove_wake_function+0x0/0x2d
>  [<c041bdc3>] sched_slice+0x61/0x6a
>  [<c0423114>] task_tick_fair+0x3d/0x60
>  [<c04c1d79>] security_file_permission+0xc/0xd
>  [<c047cd1a>] do_sync_read+0x0/0xfe
>  [<c047d35a>] vfs_read+0x6c/0x8b
>  [<c047d67e>] sys_read+0x3c/0x63
>  [<c0402fc1>] sysenter_do_call+0x12/0x21
>  [<c0630000>] schedule+0x551/0x830
> Code: 08 31 c9 89 da e8 77 fc ff ff 8b 86 bc 00 00 00 85 ff 89 43 38 8d 46 60 89 43 40 74 1d 83 c4 0c 89 d8 5b 5e 5f 5d e9 aa f9 ff ff <8b> 86 bc 00 00 00 89 43 38 8d 46 60 89 43 40 83 c4 0c 5b 5e 5f
> EIP: [<c04ebf8f>] io_ioq_move+0xf2/0x109 SS:ESP 0068:f6304c28
> 
> Changelog:
> 
> Make sure iogs in the hierarchy are built properly after moving a task to a new cgroup.
> 
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
>  block/cfq-iosched.c |    4 +++-
>  block/elevator-fq.c |    1 +
>  block/elevator-fq.h |    1 +
>  3 files changed, 5 insertions(+), 1 deletions(-)
> 
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 0ecf7c7..6d7bb8a 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -12,6 +12,8 @@
>  #include <linux/rbtree.h>
>  #include <linux/ioprio.h>
>  #include <linux/blktrace_api.h>
> +#include "elevator-fq.h"
> +

I think above explicit inclusion of "elevator-fq.h" might be unnecessary
as elevator.h includes elevator-fq.h and cfq-iosched.c is including
elevator.h

>  /*
>   * tunables
>   */
> @@ -1086,7 +1088,7 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
>  
>  	spin_lock_irqsave(q->queue_lock, flags);
>  
> -	iog = io_lookup_io_group_current(q);
> +	iog = io_get_io_group(q);

A one line comment here explaining the need to get_io_group instead of
lookup_io_group will be nice.

Thanks
Vivek

>  
>  	if (async_cfqq != NULL) {
>  		__iog = cfqq_to_io_group(async_cfqq);
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index df53418..f81cf6a 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -1191,6 +1191,7 @@ struct io_group *io_get_io_group(struct request_queue *q)
>  
>  	return iog;
>  }
> +EXPORT_SYMBOL(io_get_io_group);
>  
>  void io_free_root_group(struct elevator_queue *e)
>  {
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index fc4110d..f17e425 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -459,6 +459,7 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
>  }
>  
>  #ifdef CONFIG_GROUP_IOSCHED
> +extern struct io_group *io_get_io_group(struct request_queue *q);
>  extern int io_group_allow_merge(struct request *rq, struct bio *bio);
>  extern void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
>  					struct io_group *iog);
> -- 
> 1.5.4.rc3
> 
> 

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
       [not found]                     ` <20090416183753.GE8896-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-04-17  5:35                       ` Dhaval Giani
  2009-04-17  9:37                       ` Andrea Righi
  1 sibling, 0 replies; 190+ messages in thread
From: Dhaval Giani @ 2009-04-17  5:35 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, Andrew Morton,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Thu, Apr 16, 2009 at 02:37:53PM -0400, Vivek Goyal wrote:
> On Wed, Apr 08, 2009 at 10:37:59PM +0200, Andrea Righi wrote:
> 
> [..]
> > > 
> > > - I can think of atleast one usage of uppper limit controller where we
> > >   might have spare IO resources still we don't want to give it to a
> > >   cgroup because customer has not paid for that kind of service level. In
> > >   those cases we need to implement uppper limit also.
> > > 
> > >   May be prportional weight and max bw controller can co-exist depending
> > >   on what user's requirements are.
> > >  
> > >   If yes, then can't this control be done at the same layer/level where
> > >   proportional weight control is being done? IOW, this set of patches is
> > >   trying to do prportional weight control at IO scheduler level. I think
> > >   we should be able to store another max rate as another feature in 
> > >   cgroup (apart from weight) and not dispatch requests from the queue if
> > >   we have exceeded the max BW as specified by the user?
> > 
> > The more I think about a "perfect" solution (at least for my
> > requirements), the more I'm convinced that we need both functionalities.
> > 

hard limits vs work conserving argument again :). I agree, we need
both of the functionalities. I think first the aim should be to get the
proportional weight functionality and then look at doing hard limits.

[..]

> > > 
> > > - Have you thought of doing hierarchical control? 
> > > 
> > 
> > Providing hiearchies in cgroups is in general expensive, deeper
> > hierarchies imply checking all the way up to the root cgroup, so I think
> > we need to be very careful and be aware of the trade-offs before
> > providing such feature. For this particular case (IO controller)
> > wouldn't it be simpler and more efficient to just ignore hierarchies in
> > the kernel and opportunely handle them in userspace? for absolute
> > limiting rules this isn't difficult at all, just imagine a config file
> > and a script or a deamon that dynamically create the opportune cgroups
> > and configure them accordingly to what is defined in the configuration
> > file.
> > 
> > I think we can simply define hierarchical dependencies in the
> > configuration file, translate them in absolute values and use the
> > absolute values to configure the cgroups' properties.
> > 
> > For example, we can just check that the BW allocated for a particular
> > parent cgroup is not greater than the total BW allocated for the
> > children. And for each child just use the min(parent_BW, BW) or equally
> > divide the parent's BW among the children, etc.
> 
> IIUC, you are saying that allow hiearchy in user space and then flatten it
> out and pass it to kernel?
> 
> Hmm.., agree that handling hierarchies is hard and expensive. But at the
> same time rest of the controllers like cpu and memory are handling it in
> kernel so it probably makes sense to keep the IO controller also in line.
> 
> In practice I am not expecting deep hiearchices. May be 2- 3 levels would
> be good for most of the people.
> 

FWIW, even in the CPU controller having deep hierarchies is not a good idea.
I think this can be documented for IO Controller as well. Beyond that,
we realized that having a proportional system and doing it in userspace
is not a good idea. It would require a lot of calculations dependending
on the system load. (Because, the sub-group should be just the same as a
process in the parent group). Having hierarchy in the kernel just makes it way
more easier and way more accurate.

> > 
> > > - What happens to the notion of CFQ task classes and task priority. Looks
> > >   like max bw rule supercede everything. There is no way that an RT task
> > >   get unlimited amount of disk BW even if it wants to? (There is no notion
> > >   of RT cgroup etc)
> > 
> > What about moving all the RT tasks in a separate cgroup with unlimited
> > BW?
> 
> Hmm.., I think that should work. I have yet to look at your patches in
> detail but it looks like unlimited BW group will not be throttled at all
> hence RT tasks can just go right through without getting impacted.
> 

This is where the cpu scheduler design helped a lot :). Having different
classes for differnet types of processes allowed us to handle them
separately.

thanks,
-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-04-16 18:37                     ` Vivek Goyal
  (?)
@ 2009-04-17  5:35                     ` Dhaval Giani
       [not found]                       ` <20090417053517.GC26437-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  -1 siblings, 1 reply; 190+ messages in thread
From: Dhaval Giani @ 2009-04-17  5:35 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, arozansk, jmoyer, oz-kernel, balbir, linux-kernel,
	containers, menage, peterz

On Thu, Apr 16, 2009 at 02:37:53PM -0400, Vivek Goyal wrote:
> On Wed, Apr 08, 2009 at 10:37:59PM +0200, Andrea Righi wrote:
> 
> [..]
> > > 
> > > - I can think of atleast one usage of uppper limit controller where we
> > >   might have spare IO resources still we don't want to give it to a
> > >   cgroup because customer has not paid for that kind of service level. In
> > >   those cases we need to implement uppper limit also.
> > > 
> > >   May be prportional weight and max bw controller can co-exist depending
> > >   on what user's requirements are.
> > >  
> > >   If yes, then can't this control be done at the same layer/level where
> > >   proportional weight control is being done? IOW, this set of patches is
> > >   trying to do prportional weight control at IO scheduler level. I think
> > >   we should be able to store another max rate as another feature in 
> > >   cgroup (apart from weight) and not dispatch requests from the queue if
> > >   we have exceeded the max BW as specified by the user?
> > 
> > The more I think about a "perfect" solution (at least for my
> > requirements), the more I'm convinced that we need both functionalities.
> > 

hard limits vs work conserving argument again :). I agree, we need
both of the functionalities. I think first the aim should be to get the
proportional weight functionality and then look at doing hard limits.

[..]

> > > 
> > > - Have you thought of doing hierarchical control? 
> > > 
> > 
> > Providing hiearchies in cgroups is in general expensive, deeper
> > hierarchies imply checking all the way up to the root cgroup, so I think
> > we need to be very careful and be aware of the trade-offs before
> > providing such feature. For this particular case (IO controller)
> > wouldn't it be simpler and more efficient to just ignore hierarchies in
> > the kernel and opportunely handle them in userspace? for absolute
> > limiting rules this isn't difficult at all, just imagine a config file
> > and a script or a deamon that dynamically create the opportune cgroups
> > and configure them accordingly to what is defined in the configuration
> > file.
> > 
> > I think we can simply define hierarchical dependencies in the
> > configuration file, translate them in absolute values and use the
> > absolute values to configure the cgroups' properties.
> > 
> > For example, we can just check that the BW allocated for a particular
> > parent cgroup is not greater than the total BW allocated for the
> > children. And for each child just use the min(parent_BW, BW) or equally
> > divide the parent's BW among the children, etc.
> 
> IIUC, you are saying that allow hiearchy in user space and then flatten it
> out and pass it to kernel?
> 
> Hmm.., agree that handling hierarchies is hard and expensive. But at the
> same time rest of the controllers like cpu and memory are handling it in
> kernel so it probably makes sense to keep the IO controller also in line.
> 
> In practice I am not expecting deep hiearchices. May be 2- 3 levels would
> be good for most of the people.
> 

FWIW, even in the CPU controller having deep hierarchies is not a good idea.
I think this can be documented for IO Controller as well. Beyond that,
we realized that having a proportional system and doing it in userspace
is not a good idea. It would require a lot of calculations dependending
on the system load. (Because, the sub-group should be just the same as a
process in the parent group). Having hierarchy in the kernel just makes it way
more easier and way more accurate.

> > 
> > > - What happens to the notion of CFQ task classes and task priority. Looks
> > >   like max bw rule supercede everything. There is no way that an RT task
> > >   get unlimited amount of disk BW even if it wants to? (There is no notion
> > >   of RT cgroup etc)
> > 
> > What about moving all the RT tasks in a separate cgroup with unlimited
> > BW?
> 
> Hmm.., I think that should work. I have yet to look at your patches in
> detail but it looks like unlimited BW group will not be throttled at all
> hence RT tasks can just go right through without getting impacted.
> 

This is where the cpu scheduler design helped a lot :). Having different
classes for differnet types of processes allowed us to handle them
separately.

thanks,
-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
       [not found]                     ` <20090416183753.GE8896-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-04-17  5:35                       ` [PATCH 01/10] Documentation Dhaval Giani
@ 2009-04-17  9:37                       ` Andrea Righi
  1 sibling, 0 replies; 190+ messages in thread
From: Andrea Righi @ 2009-04-17  9:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	menage-hpIqsD4AKlfQT0dZR+AlfA

On Thu, Apr 16, 2009 at 02:37:53PM -0400, Vivek Goyal wrote:
> > I think it would be possible to implement both proportional and limiting
> > rules at the same level (e.g., the IO scheduler), but we need also to
> > address the memory consumption problem (I still need to review your
> > patchset in details and I'm going to test it soon :), so I don't know if
> > you already addressed this issue).
> > 
> 
> Can you please elaborate a bit on this? Are you concerned about that data
> structures created to solve the problem consume a lot of memory?

Sorry I was not very clear here. With memory consumption I mean wasting
the memory with hard/slow reclaimable dirty pages or pending IO
requests.

If there's only a global limit on dirty pages, any cgroup can exhaust
that limit and cause other cgroups/processes to block when they try to
write to disk.

But, ok, the IO controller is not probably the best place to implement
such functionality. I should rework on the per cgroup dirty_ratio:

https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html

Last time we focused too much on the best interfaces to define dirty
pages limit, and I never re-posted an updated version of this patchset.
Now I think we can simply provide the same dirty_ratio/dirty_bytes
interface that we provide globally, but per cgroup.

> 
> > IOW if we simply don't dispatch requests and we don't throttle the tasks
> > in the cgroup that exceeds its limit, how do we avoid the waste of
> > memory due to the succeeding IO requests and the increasingly dirty
> > pages in the page cache (that are also hard to reclaim)? I may be wrong,
> > but I think we talked about this problem in a previous email... sorry I
> > don't find the discussion in my mail archives.
> > 
> > IMHO a nice approach would be to measure IO consumption at the IO
> > scheduler level, and control IO applying proportional weights / absolute
> > limits _both_ at the IO scheduler / elevator level _and_ at the same
> > time block the tasks from dirtying memory that will generate additional
> > IO requests.
> > 
> > Anyway, there's no need to provide this with a single IO controller, we
> > could split the problem in two parts: 1) provide a proportional /
> > absolute IO controller in the IO schedulers and 2) allow to set, for
> > example, a maximum limit of dirty pages for each cgroup.
> > 
> 
> I think setting a maximum limit on dirty pages is an interesting thought.
> It sounds like as if memory controller can handle it?

Exactly, the same above.

> 
> I guess currently memory controller puts limit on total amount of memory
> consumed by cgroup and there are no knobs on type of memory consumed. So
> if one can limit amount of dirty page cache memory per cgroup, it
> automatically throttles the aysnc writes at the input itself.
>  
> So I agree that if we can limit the process from dirtying too much of
> memory than IO scheduler level controller should be able to do both
> proportional weight and max bw controller.
> 
> Currently doing proportional weight control for async writes is very
> tricky. I am not seeing constantly backlogged traffic at IO scheudler
> level and hence two different weight processes seem to be getting same
> BW.
> 
> I will dive deeper into the patches on dm-ioband to see how they have
> solved this issue. Looks like they are just waiting longer for slowest
> group to consume its tokens and that will keep the disk idle. Extended
> delays might now show up immediately as performance hog, because it might
> also promote increased merging but it should lead to increased latency of
> response. And proving latency issues is hard. :-)   
> 
> > Maybe I'm just repeating what we already said in a previous
> > discussion... in this case sorry for the duplicate thoughts. :)
> > 
> > > 
> > > - Have you thought of doing hierarchical control? 
> > > 
> > 
> > Providing hiearchies in cgroups is in general expensive, deeper
> > hierarchies imply checking all the way up to the root cgroup, so I think
> > we need to be very careful and be aware of the trade-offs before
> > providing such feature. For this particular case (IO controller)
> > wouldn't it be simpler and more efficient to just ignore hierarchies in
> > the kernel and opportunely handle them in userspace? for absolute
> > limiting rules this isn't difficult at all, just imagine a config file
> > and a script or a deamon that dynamically create the opportune cgroups
> > and configure them accordingly to what is defined in the configuration
> > file.
> > 
> > I think we can simply define hierarchical dependencies in the
> > configuration file, translate them in absolute values and use the
> > absolute values to configure the cgroups' properties.
> > 
> > For example, we can just check that the BW allocated for a particular
> > parent cgroup is not greater than the total BW allocated for the
> > children. And for each child just use the min(parent_BW, BW) or equally
> > divide the parent's BW among the children, etc.
> 
> IIUC, you are saying that allow hiearchy in user space and then flatten it
> out and pass it to kernel?
> 
> Hmm.., agree that handling hierarchies is hard and expensive. But at the
> same time rest of the controllers like cpu and memory are handling it in
> kernel so it probably makes sense to keep the IO controller also in line.
> 
> In practice I am not expecting deep hiearchices. May be 2- 3 levels would
> be good for most of the people.
> 
> > 
> > > - What happens to the notion of CFQ task classes and task priority. Looks
> > >   like max bw rule supercede everything. There is no way that an RT task
> > >   get unlimited amount of disk BW even if it wants to? (There is no notion
> > >   of RT cgroup etc)
> > 
> > What about moving all the RT tasks in a separate cgroup with unlimited
> > BW?
> 
> Hmm.., I think that should work. I have yet to look at your patches in
> detail but it looks like unlimited BW group will not be throttled at all
> hence RT tasks can just go right through without getting impacted.

Correct.

> 
> > 
> > > 
> > > > > 
> > > > >   Above requirement can create configuration problems.
> > > > > 
> > > > > 	- If there are large number of disks in system, per cgroup one shall
> > > > > 	  have to create rules for each disk. Until and unless admin knows
> > > > > 	  what applications are in which cgroup and strictly what disk
> > > > > 	  these applications do IO to and create rules for only those
> > > > >  	  disks.
> > > > 
> > > > I don't think this is a huge problem anyway. IMHO a userspace tool, e.g.
> > > > a script, would be able to efficiently create/modify rules parsing user
> > > > defined rules in some human-readable form (config files, etc.), even in
> > > > presence of hundreds of disk. The same is valid for dm-ioband I think.
> > > > 
> > > > > 
> > > > > 	- I think problem gets compounded if there is a hierarchy of
> > > > > 	  logical devices. I think in that case one shall have to create
> > > > > 	  rules for logical devices and not actual physical devices.
> > > > 
> > > > With logical devices you mean device-mapper devices (i.e. LVM, software
> > > > RAID, etc.)? or do you mean that we need to introduce the concept of
> > > > "logical device" to easily (quickly) configure IO requirements and then
> > > > map those logical devices to the actual physical devices? In this case I
> > > > think this can be addressed in userspace. Or maybe I'm totally missing
> > > > the point here.
> > > 
> > > Yes, I meant LVM, Software RAID etc. So if I have got many disks in the system
> > > and I have created software raid on some of them, I need to create rules for
> > > lvm devices or physical devices behind those lvm devices? I am assuming
> > > that it will be logical devices.
> > > 
> > > So I need to know exactly to what all devices applications in a particular
> > > cgroup is going to do IO, and also know exactly how many cgroups are
> > > contending for that cgroup, and also know what worst case disk rate I can
> > > expect from that device and then I can do a good job of giving a
> > > reasonable value to the max rate of that cgroup on a particular device?
> > 
> > ok, I understand. For these cases dm-ioband perfectly addresses the
> > problem. For the general case, I think the only solution is to provide a
> > common interface that each dm subsystem must call to account IO and
> > apply limiting and proportional rules.
> > 
> > > 
> > > > 
> > > > > 
> > > > > - Because it is not proportional weight distribution, if some
> > > > >   cgroup is not using its planned BW, other group sharing the
> > > > >   disk can not make use of spare BW.  
> > > > > 	
> > > > 
> > > > Right.
> > > > 
> > > > > - I think one should know in advance the throughput rate of underlying media
> > > > >   and also know competing applications so that one can statically define
> > > > >   the BW assigned to each cgroup on each disk.
> > > > > 
> > > > >   This will be difficult. Effective BW extracted out of a rotational media
> > > > >   is dependent on the seek pattern so one shall have to either try to make
> > > > >   some conservative estimates and try to divide BW (we will not utilize disk
> > > > >   fully) or take some peak numbers and divide BW (cgroup might not get the
> > > > >   maximum rate configured).
> > > > 
> > > > Correct. I think the proportional weight approach is the only solution
> > > > to efficiently use the whole BW. OTOH absolute limiting rules offer a
> > > > better control over QoS, because you can totally remove performance
> > > > bursts/peaks that could break QoS requirements for short periods of
> > > > time.
> > > 
> > > Can you please give little more details here regarding how QoS requirements
> > > are not met with proportional weight?
> > 
> > With proportional weights the whole bandwidth is allocated if no one
> > else is using it. When IO is submitted other tasks with a higher weight
> > can be forced to sleep until the IO generated by the low weight tasks is
> > not completely dispatched. Or any extent of the priority inversion
> > problems.
> 
> Hmm..., I am not very sure here. When admin is allocating the weights, he
> has the whole picture. He knows how many groups are conteding for the disk
> and what could be the worst case scenario. So if I have got two groups
> with A and B with weight 1 and 2 and both are contending, then as an 
> admin one would expect to get 33% of BW for group A in worst case (if
> group B is continuously backlogged). If B is not contending than A can get
> 100% of BW. So while configuring the system, will one not plan for worst
> case (33% for A, and 66 % for B)?

OK, I'm quite convinced.. :)

To a large degree, if we want to provide a BW reservation strategy we
must provide an interface that allows cgroups to ask for time slices
such as max/min 5 IO requests every 50ms or something like that.
Probably the same functionality can be achieved translating time slices
from weights, percentages or absolute BW limits.

>   
> > 
> > Maybe it's not an issue at all for the most part of the cases, but using
> > a solution that is able to provide also a real partitioning of the
> > available resources can be profitely used by those who need to guarantee
> > _strict_ BW requirements (soft real-time, maximize the responsiveness of
> > certain services, etc.), because in this case we're sure that a certain
> > amount of "spare" BW will be always available when needed by some
> > "critical" services.
> > 
> 
> Will the same thing not happen in proportional weight? If it is an RT
> application, one can put it in RT groups to make sure it always gets
> the BW first even if there is contention. 
> 
> Even in regular group, the moment you issue the IO and IO scheduler sees
> it, you will start getting your reserved share according to your weight.
> 
> How it will be different in the case of io throttling? Even if I don't
> utilize the disk fully, cfq will still put the new guy in the queue and
> then try to give its share (based on prio).
> 
> Are you saying that by keeping disk relatively free, the latency of
> response for soft real time application will become better? In that
> case can't one simply underprovision the disk?
> 
> But having said that I am not disputing the need of max BW controller
> as some people have expressed the need of a constant BW view and don't
> want too big a fluctuations even if BW is available. Max BW controller
> can't gurantee the minumum BW hence can't avoid the fluctuations
> completely, but it can still help in smoothing the traffic because
> other competitiors will be stopped from doing too much of IO.

Agree.

-Andrea

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-04-16 18:37                     ` Vivek Goyal
                                       ` (2 preceding siblings ...)
  (?)
@ 2009-04-17  9:37                     ` Andrea Righi
  2009-04-17 14:13                       ` IO controller discussion (Was: Re: [PATCH 01/10] Documentation) Vivek Goyal
  2009-04-17 14:13                       ` Vivek Goyal
  -1 siblings, 2 replies; 190+ messages in thread
From: Andrea Righi @ 2009-04-17  9:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, arozansk, jmoyer, oz-kernel, dhaval, balbir,
	linux-kernel, containers, menage, peterz

On Thu, Apr 16, 2009 at 02:37:53PM -0400, Vivek Goyal wrote:
> > I think it would be possible to implement both proportional and limiting
> > rules at the same level (e.g., the IO scheduler), but we need also to
> > address the memory consumption problem (I still need to review your
> > patchset in details and I'm going to test it soon :), so I don't know if
> > you already addressed this issue).
> > 
> 
> Can you please elaborate a bit on this? Are you concerned about that data
> structures created to solve the problem consume a lot of memory?

Sorry I was not very clear here. With memory consumption I mean wasting
the memory with hard/slow reclaimable dirty pages or pending IO
requests.

If there's only a global limit on dirty pages, any cgroup can exhaust
that limit and cause other cgroups/processes to block when they try to
write to disk.

But, ok, the IO controller is not probably the best place to implement
such functionality. I should rework on the per cgroup dirty_ratio:

https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html

Last time we focused too much on the best interfaces to define dirty
pages limit, and I never re-posted an updated version of this patchset.
Now I think we can simply provide the same dirty_ratio/dirty_bytes
interface that we provide globally, but per cgroup.

> 
> > IOW if we simply don't dispatch requests and we don't throttle the tasks
> > in the cgroup that exceeds its limit, how do we avoid the waste of
> > memory due to the succeeding IO requests and the increasingly dirty
> > pages in the page cache (that are also hard to reclaim)? I may be wrong,
> > but I think we talked about this problem in a previous email... sorry I
> > don't find the discussion in my mail archives.
> > 
> > IMHO a nice approach would be to measure IO consumption at the IO
> > scheduler level, and control IO applying proportional weights / absolute
> > limits _both_ at the IO scheduler / elevator level _and_ at the same
> > time block the tasks from dirtying memory that will generate additional
> > IO requests.
> > 
> > Anyway, there's no need to provide this with a single IO controller, we
> > could split the problem in two parts: 1) provide a proportional /
> > absolute IO controller in the IO schedulers and 2) allow to set, for
> > example, a maximum limit of dirty pages for each cgroup.
> > 
> 
> I think setting a maximum limit on dirty pages is an interesting thought.
> It sounds like as if memory controller can handle it?

Exactly, the same above.

> 
> I guess currently memory controller puts limit on total amount of memory
> consumed by cgroup and there are no knobs on type of memory consumed. So
> if one can limit amount of dirty page cache memory per cgroup, it
> automatically throttles the aysnc writes at the input itself.
>  
> So I agree that if we can limit the process from dirtying too much of
> memory than IO scheduler level controller should be able to do both
> proportional weight and max bw controller.
> 
> Currently doing proportional weight control for async writes is very
> tricky. I am not seeing constantly backlogged traffic at IO scheudler
> level and hence two different weight processes seem to be getting same
> BW.
> 
> I will dive deeper into the patches on dm-ioband to see how they have
> solved this issue. Looks like they are just waiting longer for slowest
> group to consume its tokens and that will keep the disk idle. Extended
> delays might now show up immediately as performance hog, because it might
> also promote increased merging but it should lead to increased latency of
> response. And proving latency issues is hard. :-)   
> 
> > Maybe I'm just repeating what we already said in a previous
> > discussion... in this case sorry for the duplicate thoughts. :)
> > 
> > > 
> > > - Have you thought of doing hierarchical control? 
> > > 
> > 
> > Providing hiearchies in cgroups is in general expensive, deeper
> > hierarchies imply checking all the way up to the root cgroup, so I think
> > we need to be very careful and be aware of the trade-offs before
> > providing such feature. For this particular case (IO controller)
> > wouldn't it be simpler and more efficient to just ignore hierarchies in
> > the kernel and opportunely handle them in userspace? for absolute
> > limiting rules this isn't difficult at all, just imagine a config file
> > and a script or a deamon that dynamically create the opportune cgroups
> > and configure them accordingly to what is defined in the configuration
> > file.
> > 
> > I think we can simply define hierarchical dependencies in the
> > configuration file, translate them in absolute values and use the
> > absolute values to configure the cgroups' properties.
> > 
> > For example, we can just check that the BW allocated for a particular
> > parent cgroup is not greater than the total BW allocated for the
> > children. And for each child just use the min(parent_BW, BW) or equally
> > divide the parent's BW among the children, etc.
> 
> IIUC, you are saying that allow hiearchy in user space and then flatten it
> out and pass it to kernel?
> 
> Hmm.., agree that handling hierarchies is hard and expensive. But at the
> same time rest of the controllers like cpu and memory are handling it in
> kernel so it probably makes sense to keep the IO controller also in line.
> 
> In practice I am not expecting deep hiearchices. May be 2- 3 levels would
> be good for most of the people.
> 
> > 
> > > - What happens to the notion of CFQ task classes and task priority. Looks
> > >   like max bw rule supercede everything. There is no way that an RT task
> > >   get unlimited amount of disk BW even if it wants to? (There is no notion
> > >   of RT cgroup etc)
> > 
> > What about moving all the RT tasks in a separate cgroup with unlimited
> > BW?
> 
> Hmm.., I think that should work. I have yet to look at your patches in
> detail but it looks like unlimited BW group will not be throttled at all
> hence RT tasks can just go right through without getting impacted.

Correct.

> 
> > 
> > > 
> > > > > 
> > > > >   Above requirement can create configuration problems.
> > > > > 
> > > > > 	- If there are large number of disks in system, per cgroup one shall
> > > > > 	  have to create rules for each disk. Until and unless admin knows
> > > > > 	  what applications are in which cgroup and strictly what disk
> > > > > 	  these applications do IO to and create rules for only those
> > > > >  	  disks.
> > > > 
> > > > I don't think this is a huge problem anyway. IMHO a userspace tool, e.g.
> > > > a script, would be able to efficiently create/modify rules parsing user
> > > > defined rules in some human-readable form (config files, etc.), even in
> > > > presence of hundreds of disk. The same is valid for dm-ioband I think.
> > > > 
> > > > > 
> > > > > 	- I think problem gets compounded if there is a hierarchy of
> > > > > 	  logical devices. I think in that case one shall have to create
> > > > > 	  rules for logical devices and not actual physical devices.
> > > > 
> > > > With logical devices you mean device-mapper devices (i.e. LVM, software
> > > > RAID, etc.)? or do you mean that we need to introduce the concept of
> > > > "logical device" to easily (quickly) configure IO requirements and then
> > > > map those logical devices to the actual physical devices? In this case I
> > > > think this can be addressed in userspace. Or maybe I'm totally missing
> > > > the point here.
> > > 
> > > Yes, I meant LVM, Software RAID etc. So if I have got many disks in the system
> > > and I have created software raid on some of them, I need to create rules for
> > > lvm devices or physical devices behind those lvm devices? I am assuming
> > > that it will be logical devices.
> > > 
> > > So I need to know exactly to what all devices applications in a particular
> > > cgroup is going to do IO, and also know exactly how many cgroups are
> > > contending for that cgroup, and also know what worst case disk rate I can
> > > expect from that device and then I can do a good job of giving a
> > > reasonable value to the max rate of that cgroup on a particular device?
> > 
> > ok, I understand. For these cases dm-ioband perfectly addresses the
> > problem. For the general case, I think the only solution is to provide a
> > common interface that each dm subsystem must call to account IO and
> > apply limiting and proportional rules.
> > 
> > > 
> > > > 
> > > > > 
> > > > > - Because it is not proportional weight distribution, if some
> > > > >   cgroup is not using its planned BW, other group sharing the
> > > > >   disk can not make use of spare BW.  
> > > > > 	
> > > > 
> > > > Right.
> > > > 
> > > > > - I think one should know in advance the throughput rate of underlying media
> > > > >   and also know competing applications so that one can statically define
> > > > >   the BW assigned to each cgroup on each disk.
> > > > > 
> > > > >   This will be difficult. Effective BW extracted out of a rotational media
> > > > >   is dependent on the seek pattern so one shall have to either try to make
> > > > >   some conservative estimates and try to divide BW (we will not utilize disk
> > > > >   fully) or take some peak numbers and divide BW (cgroup might not get the
> > > > >   maximum rate configured).
> > > > 
> > > > Correct. I think the proportional weight approach is the only solution
> > > > to efficiently use the whole BW. OTOH absolute limiting rules offer a
> > > > better control over QoS, because you can totally remove performance
> > > > bursts/peaks that could break QoS requirements for short periods of
> > > > time.
> > > 
> > > Can you please give little more details here regarding how QoS requirements
> > > are not met with proportional weight?
> > 
> > With proportional weights the whole bandwidth is allocated if no one
> > else is using it. When IO is submitted other tasks with a higher weight
> > can be forced to sleep until the IO generated by the low weight tasks is
> > not completely dispatched. Or any extent of the priority inversion
> > problems.
> 
> Hmm..., I am not very sure here. When admin is allocating the weights, he
> has the whole picture. He knows how many groups are conteding for the disk
> and what could be the worst case scenario. So if I have got two groups
> with A and B with weight 1 and 2 and both are contending, then as an 
> admin one would expect to get 33% of BW for group A in worst case (if
> group B is continuously backlogged). If B is not contending than A can get
> 100% of BW. So while configuring the system, will one not plan for worst
> case (33% for A, and 66 % for B)?

OK, I'm quite convinced.. :)

To a large degree, if we want to provide a BW reservation strategy we
must provide an interface that allows cgroups to ask for time slices
such as max/min 5 IO requests every 50ms or something like that.
Probably the same functionality can be achieved translating time slices
from weights, percentages or absolute BW limits.

>   
> > 
> > Maybe it's not an issue at all for the most part of the cases, but using
> > a solution that is able to provide also a real partitioning of the
> > available resources can be profitely used by those who need to guarantee
> > _strict_ BW requirements (soft real-time, maximize the responsiveness of
> > certain services, etc.), because in this case we're sure that a certain
> > amount of "spare" BW will be always available when needed by some
> > "critical" services.
> > 
> 
> Will the same thing not happen in proportional weight? If it is an RT
> application, one can put it in RT groups to make sure it always gets
> the BW first even if there is contention. 
> 
> Even in regular group, the moment you issue the IO and IO scheduler sees
> it, you will start getting your reserved share according to your weight.
> 
> How it will be different in the case of io throttling? Even if I don't
> utilize the disk fully, cfq will still put the new guy in the queue and
> then try to give its share (based on prio).
> 
> Are you saying that by keeping disk relatively free, the latency of
> response for soft real time application will become better? In that
> case can't one simply underprovision the disk?
> 
> But having said that I am not disputing the need of max BW controller
> as some people have expressed the need of a constant BW view and don't
> want too big a fluctuations even if BW is available. Max BW controller
> can't gurantee the minumum BW hence can't avoid the fluctuations
> completely, but it can still help in smoothing the traffic because
> other competitiors will be stopped from doing too much of IO.

Agree.

-Andrea

^ permalink raw reply	[flat|nested] 190+ messages in thread

* IO Controller discussion (Was: Re: [PATCH 01/10] Documentation)
  2009-04-17  5:35                     ` Dhaval Giani
@ 2009-04-17 13:49                           ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-04-17 13:49 UTC (permalink / raw)
  To: Dhaval Giani
  Cc: oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, Andrew Morton,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Fri, Apr 17, 2009 at 11:05:17AM +0530, Dhaval Giani wrote:
> On Thu, Apr 16, 2009 at 02:37:53PM -0400, Vivek Goyal wrote:
> > On Wed, Apr 08, 2009 at 10:37:59PM +0200, Andrea Righi wrote:
> > 
> > [..]
> > > > 
> > > > - I can think of atleast one usage of uppper limit controller where we
> > > >   might have spare IO resources still we don't want to give it to a
> > > >   cgroup because customer has not paid for that kind of service level. In
> > > >   those cases we need to implement uppper limit also.
> > > > 
> > > >   May be prportional weight and max bw controller can co-exist depending
> > > >   on what user's requirements are.
> > > >  
> > > >   If yes, then can't this control be done at the same layer/level where
> > > >   proportional weight control is being done? IOW, this set of patches is
> > > >   trying to do prportional weight control at IO scheduler level. I think
> > > >   we should be able to store another max rate as another feature in 
> > > >   cgroup (apart from weight) and not dispatch requests from the queue if
> > > >   we have exceeded the max BW as specified by the user?
> > > 
> > > The more I think about a "perfect" solution (at least for my
> > > requirements), the more I'm convinced that we need both functionalities.
> > > 
> 
> hard limits vs work conserving argument again :). I agree, we need
> both of the functionalities. I think first the aim should be to get the
> proportional weight functionality and then look at doing hard limits.
> 

Agreed.

> [..]
> 
> > > > 
> > > > - Have you thought of doing hierarchical control? 
> > > > 
> > > 
> > > Providing hiearchies in cgroups is in general expensive, deeper
> > > hierarchies imply checking all the way up to the root cgroup, so I think
> > > we need to be very careful and be aware of the trade-offs before
> > > providing such feature. For this particular case (IO controller)
> > > wouldn't it be simpler and more efficient to just ignore hierarchies in
> > > the kernel and opportunely handle them in userspace? for absolute
> > > limiting rules this isn't difficult at all, just imagine a config file
> > > and a script or a deamon that dynamically create the opportune cgroups
> > > and configure them accordingly to what is defined in the configuration
> > > file.
> > > 
> > > I think we can simply define hierarchical dependencies in the
> > > configuration file, translate them in absolute values and use the
> > > absolute values to configure the cgroups' properties.
> > > 
> > > For example, we can just check that the BW allocated for a particular
> > > parent cgroup is not greater than the total BW allocated for the
> > > children. And for each child just use the min(parent_BW, BW) or equally
> > > divide the parent's BW among the children, etc.
> > 
> > IIUC, you are saying that allow hiearchy in user space and then flatten it
> > out and pass it to kernel?
> > 
> > Hmm.., agree that handling hierarchies is hard and expensive. But at the
> > same time rest of the controllers like cpu and memory are handling it in
> > kernel so it probably makes sense to keep the IO controller also in line.
> > 
> > In practice I am not expecting deep hiearchices. May be 2- 3 levels would
> > be good for most of the people.
> > 
> 
> FWIW, even in the CPU controller having deep hierarchies is not a good idea.
> I think this can be documented for IO Controller as well. Beyond that,
> we realized that having a proportional system and doing it in userspace
> is not a good idea. It would require a lot of calculations dependending
> on the system load. (Because, the sub-group should be just the same as a
> process in the parent group). Having hierarchy in the kernel just makes it way
> more easier and way more accurate.

Agreed. I will prefer to keep hierarchical support in kernel inline with
other controllers.

> 
> > > 
> > > > - What happens to the notion of CFQ task classes and task priority. Looks
> > > >   like max bw rule supercede everything. There is no way that an RT task
> > > >   get unlimited amount of disk BW even if it wants to? (There is no notion
> > > >   of RT cgroup etc)
> > > 
> > > What about moving all the RT tasks in a separate cgroup with unlimited
> > > BW?
> > 
> > Hmm.., I think that should work. I have yet to look at your patches in
> > detail but it looks like unlimited BW group will not be throttled at all
> > hence RT tasks can just go right through without getting impacted.
> > 
> 
> This is where the cpu scheduler design helped a lot :). Having different
> classes for differnet types of processes allowed us to handle them
> separately.

In common layer scheduling approach, we do have separate classes (RT, BE
and IDLE) and scheduling is done accordingly. Code primarily taken fro
bfq and cfq.

dm-ioband has no notion of separate classes and everything was being
treated at same level which is a problem as end level IO scheduler will
loose its capability to differentiate we mixup he things above it.

Time to play with max bw controller patches and then I can probably have
more insights into it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* IO Controller discussion (Was: Re: [PATCH 01/10] Documentation)
@ 2009-04-17 13:49                           ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-04-17 13:49 UTC (permalink / raw)
  To: Dhaval Giani
  Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, arozansk, jmoyer, oz-kernel, balbir, linux-kernel,
	containers, menage, peterz

On Fri, Apr 17, 2009 at 11:05:17AM +0530, Dhaval Giani wrote:
> On Thu, Apr 16, 2009 at 02:37:53PM -0400, Vivek Goyal wrote:
> > On Wed, Apr 08, 2009 at 10:37:59PM +0200, Andrea Righi wrote:
> > 
> > [..]
> > > > 
> > > > - I can think of atleast one usage of uppper limit controller where we
> > > >   might have spare IO resources still we don't want to give it to a
> > > >   cgroup because customer has not paid for that kind of service level. In
> > > >   those cases we need to implement uppper limit also.
> > > > 
> > > >   May be prportional weight and max bw controller can co-exist depending
> > > >   on what user's requirements are.
> > > >  
> > > >   If yes, then can't this control be done at the same layer/level where
> > > >   proportional weight control is being done? IOW, this set of patches is
> > > >   trying to do prportional weight control at IO scheduler level. I think
> > > >   we should be able to store another max rate as another feature in 
> > > >   cgroup (apart from weight) and not dispatch requests from the queue if
> > > >   we have exceeded the max BW as specified by the user?
> > > 
> > > The more I think about a "perfect" solution (at least for my
> > > requirements), the more I'm convinced that we need both functionalities.
> > > 
> 
> hard limits vs work conserving argument again :). I agree, we need
> both of the functionalities. I think first the aim should be to get the
> proportional weight functionality and then look at doing hard limits.
> 

Agreed.

> [..]
> 
> > > > 
> > > > - Have you thought of doing hierarchical control? 
> > > > 
> > > 
> > > Providing hiearchies in cgroups is in general expensive, deeper
> > > hierarchies imply checking all the way up to the root cgroup, so I think
> > > we need to be very careful and be aware of the trade-offs before
> > > providing such feature. For this particular case (IO controller)
> > > wouldn't it be simpler and more efficient to just ignore hierarchies in
> > > the kernel and opportunely handle them in userspace? for absolute
> > > limiting rules this isn't difficult at all, just imagine a config file
> > > and a script or a deamon that dynamically create the opportune cgroups
> > > and configure them accordingly to what is defined in the configuration
> > > file.
> > > 
> > > I think we can simply define hierarchical dependencies in the
> > > configuration file, translate them in absolute values and use the
> > > absolute values to configure the cgroups' properties.
> > > 
> > > For example, we can just check that the BW allocated for a particular
> > > parent cgroup is not greater than the total BW allocated for the
> > > children. And for each child just use the min(parent_BW, BW) or equally
> > > divide the parent's BW among the children, etc.
> > 
> > IIUC, you are saying that allow hiearchy in user space and then flatten it
> > out and pass it to kernel?
> > 
> > Hmm.., agree that handling hierarchies is hard and expensive. But at the
> > same time rest of the controllers like cpu and memory are handling it in
> > kernel so it probably makes sense to keep the IO controller also in line.
> > 
> > In practice I am not expecting deep hiearchices. May be 2- 3 levels would
> > be good for most of the people.
> > 
> 
> FWIW, even in the CPU controller having deep hierarchies is not a good idea.
> I think this can be documented for IO Controller as well. Beyond that,
> we realized that having a proportional system and doing it in userspace
> is not a good idea. It would require a lot of calculations dependending
> on the system load. (Because, the sub-group should be just the same as a
> process in the parent group). Having hierarchy in the kernel just makes it way
> more easier and way more accurate.

Agreed. I will prefer to keep hierarchical support in kernel inline with
other controllers.

> 
> > > 
> > > > - What happens to the notion of CFQ task classes and task priority. Looks
> > > >   like max bw rule supercede everything. There is no way that an RT task
> > > >   get unlimited amount of disk BW even if it wants to? (There is no notion
> > > >   of RT cgroup etc)
> > > 
> > > What about moving all the RT tasks in a separate cgroup with unlimited
> > > BW?
> > 
> > Hmm.., I think that should work. I have yet to look at your patches in
> > detail but it looks like unlimited BW group will not be throttled at all
> > hence RT tasks can just go right through without getting impacted.
> > 
> 
> This is where the cpu scheduler design helped a lot :). Having different
> classes for differnet types of processes allowed us to handle them
> separately.

In common layer scheduling approach, we do have separate classes (RT, BE
and IDLE) and scheduling is done accordingly. Code primarily taken fro
bfq and cfq.

dm-ioband has no notion of separate classes and everything was being
treated at same level which is a problem as end level IO scheduler will
loose its capability to differentiate we mixup he things above it.

Time to play with max bw controller patches and then I can probably have
more insights into it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* IO controller discussion (Was: Re: [PATCH 01/10] Documentation)
  2009-04-17  9:37                     ` Andrea Righi
@ 2009-04-17 14:13                       ` Vivek Goyal
  2009-04-17 14:13                       ` Vivek Goyal
  1 sibling, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-04-17 14:13 UTC (permalink / raw)
  To: Andrea Righi, Andrew Morton, nauman-hpIqsD4AKlfQT0dZR+AlfA,
	dpshah-hpIqsD4AKlfQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A,
	mikew-hpIqsD4AKlfQT0dZR+AlfA, fch

On Fri, Apr 17, 2009 at 11:37:28AM +0200, Andrea Righi wrote:
> On Thu, Apr 16, 2009 at 02:37:53PM -0400, Vivek Goyal wrote:
> > > I think it would be possible to implement both proportional and limiting
> > > rules at the same level (e.g., the IO scheduler), but we need also to
> > > address the memory consumption problem (I still need to review your
> > > patchset in details and I'm going to test it soon :), so I don't know if
> > > you already addressed this issue).
> > > 
> > 
> > Can you please elaborate a bit on this? Are you concerned about that data
> > structures created to solve the problem consume a lot of memory?
> 
> Sorry I was not very clear here. With memory consumption I mean wasting
> the memory with hard/slow reclaimable dirty pages or pending IO
> requests.
> 
> If there's only a global limit on dirty pages, any cgroup can exhaust
> that limit and cause other cgroups/processes to block when they try to
> write to disk.
> 
> But, ok, the IO controller is not probably the best place to implement
> such functionality. I should rework on the per cgroup dirty_ratio:
> 
> https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html
> 
> Last time we focused too much on the best interfaces to define dirty
> pages limit, and I never re-posted an updated version of this patchset.
> Now I think we can simply provide the same dirty_ratio/dirty_bytes
> interface that we provide globally, but per cgroup.
> 
> > 
> > > IOW if we simply don't dispatch requests and we don't throttle the tasks
> > > in the cgroup that exceeds its limit, how do we avoid the waste of
> > > memory due to the succeeding IO requests and the increasingly dirty
> > > pages in the page cache (that are also hard to reclaim)? I may be wrong,
> > > but I think we talked about this problem in a previous email... sorry I
> > > don't find the discussion in my mail archives.
> > > 
> > > IMHO a nice approach would be to measure IO consumption at the IO
> > > scheduler level, and control IO applying proportional weights / absolute
> > > limits _both_ at the IO scheduler / elevator level _and_ at the same
> > > time block the tasks from dirtying memory that will generate additional
> > > IO requests.
> > > 
> > > Anyway, there's no need to provide this with a single IO controller, we
> > > could split the problem in two parts: 1) provide a proportional /
> > > absolute IO controller in the IO schedulers and 2) allow to set, for
> > > example, a maximum limit of dirty pages for each cgroup.
> > > 
> > 
> > I think setting a maximum limit on dirty pages is an interesting thought.
> > It sounds like as if memory controller can handle it?
> 
> Exactly, the same above.

Thinking more about it. Memory controller can probably enforce the higher
limit but it would not easily translate into a fixed upper async write
rate. Till the process hits the page cache limit or is slowed down by
dirty page writeout, it can get a very high async write BW.

So memory controller page cache limit will help but it would not direclty
translate into what max bw limit patches are doing.

Even if we do max bw control at IO scheduler level, async writes are
problematic again. IO controller will not be able to throttle the process
until it sees actuall write request. In big memory systems, writeout might
not happen for some time and till then it will see a high throughput.

So doing async write throttling at higher layer and not at IO scheduler
layer gives us the opprotunity to produce more accurate results.

For sync requests, I think IO scheduler max bw control should work fine.

BTW, andrea, what is the use case of your patches? Andrew had mentioned
that some people are already using it. I am curious to know will a
proportional BW controller will solve the issues/requirements of these
people or they have specific requirement of traffic shaping and max bw
controller only.

[..]
> > > > Can you please give little more details here regarding how QoS requirements
> > > > are not met with proportional weight?
> > > 
> > > With proportional weights the whole bandwidth is allocated if no one
> > > else is using it. When IO is submitted other tasks with a higher weight
> > > can be forced to sleep until the IO generated by the low weight tasks is
> > > not completely dispatched. Or any extent of the priority inversion
> > > problems.
> > 
> > Hmm..., I am not very sure here. When admin is allocating the weights, he
> > has the whole picture. He knows how many groups are conteding for the disk
> > and what could be the worst case scenario. So if I have got two groups
> > with A and B with weight 1 and 2 and both are contending, then as an 
> > admin one would expect to get 33% of BW for group A in worst case (if
> > group B is continuously backlogged). If B is not contending than A can get
> > 100% of BW. So while configuring the system, will one not plan for worst
> > case (33% for A, and 66 % for B)?
> 
> OK, I'm quite convinced.. :)
> 
> To a large degree, if we want to provide a BW reservation strategy we
> must provide an interface that allows cgroups to ask for time slices
> such as max/min 5 IO requests every 50ms or something like that.
> Probably the same functionality can be achieved translating time slices
> from weights, percentages or absolute BW limits.

Ok, I would like to split it in two parts.

I think providng minimum gurantee in absolute terms like 5 IO request
every 50ms will be very hard because IO scheduler has no control over
how many competitors are there. An easier thing will be to have minimum
gurantees on share basis. For minimum BW (disk time slice) gurantee, admin
shall have to create right cgroup hierarchy and assign weights properly and
then admin can calculate what % of disk slice a particular group will get
as minimum gurantee. (This is more complicated than this as there are
time slices which are not accounted to any groups. During queue switch
cfq starts the time slice counting only after first request has completed
to offset the impact of seeking and i guess also NCQ).

I think it should be possible to give max bandwidth gurantees in absolute
terms, like io/s or sectors/sec or MB/sec etc, because only thing IO
scheduler has to do is to not allow dispatch from a particular queue if
it has crossed its limit and then either let the disk idle or move onto
next eligible queue.

The only issue here will be async writes. max bw gurantee for async writes
at IO scheduler level might not mean much to application because of page
cache.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* IO controller discussion (Was: Re: [PATCH 01/10] Documentation)
  2009-04-17  9:37                     ` Andrea Righi
  2009-04-17 14:13                       ` IO controller discussion (Was: Re: [PATCH 01/10] Documentation) Vivek Goyal
@ 2009-04-17 14:13                       ` Vivek Goyal
       [not found]                         ` <20090417141358.GD29086-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                                           ` (4 more replies)
  1 sibling, 5 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-04-17 14:13 UTC (permalink / raw)
  To: Andrea Righi, Andrew Morton, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, jens.axboe, ryov, fernando, s-uchida,
	taka, guijianfeng, arozansk, jmoyer, oz-kernel, dhaval, balbir,
	linux-kernel, containers, menage, peterz

On Fri, Apr 17, 2009 at 11:37:28AM +0200, Andrea Righi wrote:
> On Thu, Apr 16, 2009 at 02:37:53PM -0400, Vivek Goyal wrote:
> > > I think it would be possible to implement both proportional and limiting
> > > rules at the same level (e.g., the IO scheduler), but we need also to
> > > address the memory consumption problem (I still need to review your
> > > patchset in details and I'm going to test it soon :), so I don't know if
> > > you already addressed this issue).
> > > 
> > 
> > Can you please elaborate a bit on this? Are you concerned about that data
> > structures created to solve the problem consume a lot of memory?
> 
> Sorry I was not very clear here. With memory consumption I mean wasting
> the memory with hard/slow reclaimable dirty pages or pending IO
> requests.
> 
> If there's only a global limit on dirty pages, any cgroup can exhaust
> that limit and cause other cgroups/processes to block when they try to
> write to disk.
> 
> But, ok, the IO controller is not probably the best place to implement
> such functionality. I should rework on the per cgroup dirty_ratio:
> 
> https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html
> 
> Last time we focused too much on the best interfaces to define dirty
> pages limit, and I never re-posted an updated version of this patchset.
> Now I think we can simply provide the same dirty_ratio/dirty_bytes
> interface that we provide globally, but per cgroup.
> 
> > 
> > > IOW if we simply don't dispatch requests and we don't throttle the tasks
> > > in the cgroup that exceeds its limit, how do we avoid the waste of
> > > memory due to the succeeding IO requests and the increasingly dirty
> > > pages in the page cache (that are also hard to reclaim)? I may be wrong,
> > > but I think we talked about this problem in a previous email... sorry I
> > > don't find the discussion in my mail archives.
> > > 
> > > IMHO a nice approach would be to measure IO consumption at the IO
> > > scheduler level, and control IO applying proportional weights / absolute
> > > limits _both_ at the IO scheduler / elevator level _and_ at the same
> > > time block the tasks from dirtying memory that will generate additional
> > > IO requests.
> > > 
> > > Anyway, there's no need to provide this with a single IO controller, we
> > > could split the problem in two parts: 1) provide a proportional /
> > > absolute IO controller in the IO schedulers and 2) allow to set, for
> > > example, a maximum limit of dirty pages for each cgroup.
> > > 
> > 
> > I think setting a maximum limit on dirty pages is an interesting thought.
> > It sounds like as if memory controller can handle it?
> 
> Exactly, the same above.

Thinking more about it. Memory controller can probably enforce the higher
limit but it would not easily translate into a fixed upper async write
rate. Till the process hits the page cache limit or is slowed down by
dirty page writeout, it can get a very high async write BW.

So memory controller page cache limit will help but it would not direclty
translate into what max bw limit patches are doing.

Even if we do max bw control at IO scheduler level, async writes are
problematic again. IO controller will not be able to throttle the process
until it sees actuall write request. In big memory systems, writeout might
not happen for some time and till then it will see a high throughput.

So doing async write throttling at higher layer and not at IO scheduler
layer gives us the opprotunity to produce more accurate results.

For sync requests, I think IO scheduler max bw control should work fine.

BTW, andrea, what is the use case of your patches? Andrew had mentioned
that some people are already using it. I am curious to know will a
proportional BW controller will solve the issues/requirements of these
people or they have specific requirement of traffic shaping and max bw
controller only.

[..]
> > > > Can you please give little more details here regarding how QoS requirements
> > > > are not met with proportional weight?
> > > 
> > > With proportional weights the whole bandwidth is allocated if no one
> > > else is using it. When IO is submitted other tasks with a higher weight
> > > can be forced to sleep until the IO generated by the low weight tasks is
> > > not completely dispatched. Or any extent of the priority inversion
> > > problems.
> > 
> > Hmm..., I am not very sure here. When admin is allocating the weights, he
> > has the whole picture. He knows how many groups are conteding for the disk
> > and what could be the worst case scenario. So if I have got two groups
> > with A and B with weight 1 and 2 and both are contending, then as an 
> > admin one would expect to get 33% of BW for group A in worst case (if
> > group B is continuously backlogged). If B is not contending than A can get
> > 100% of BW. So while configuring the system, will one not plan for worst
> > case (33% for A, and 66 % for B)?
> 
> OK, I'm quite convinced.. :)
> 
> To a large degree, if we want to provide a BW reservation strategy we
> must provide an interface that allows cgroups to ask for time slices
> such as max/min 5 IO requests every 50ms or something like that.
> Probably the same functionality can be achieved translating time slices
> from weights, percentages or absolute BW limits.

Ok, I would like to split it in two parts.

I think providng minimum gurantee in absolute terms like 5 IO request
every 50ms will be very hard because IO scheduler has no control over
how many competitors are there. An easier thing will be to have minimum
gurantees on share basis. For minimum BW (disk time slice) gurantee, admin
shall have to create right cgroup hierarchy and assign weights properly and
then admin can calculate what % of disk slice a particular group will get
as minimum gurantee. (This is more complicated than this as there are
time slices which are not accounted to any groups. During queue switch
cfq starts the time slice counting only after first request has completed
to offset the impact of seeking and i guess also NCQ).

I think it should be possible to give max bandwidth gurantees in absolute
terms, like io/s or sectors/sec or MB/sec etc, because only thing IO
scheduler has to do is to not allow dispatch from a particular queue if
it has crossed its limit and then either let the disk idle or move onto
next eligible queue.

The only issue here will be async writes. max bw gurantee for async writes
at IO scheduler level might not mean much to application because of page
cache.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: IO controller discussion (Was: Re: [PATCH 01/10] Documentation)
       [not found]                         ` <20090417141358.GD29086-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-04-17 18:09                           ` Nauman Rafique
  2009-04-17 22:38                           ` Andrea Righi
                                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 190+ messages in thread
From: Nauman Rafique @ 2009-04-17 18:09 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil, Andrea Righi,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	menage-hpIqsD4AKlfQT0dZR+AlfA

On Fri, Apr 17, 2009 at 7:13 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, Apr 17, 2009 at 11:37:28AM +0200, Andrea Righi wrote:
>> On Thu, Apr 16, 2009 at 02:37:53PM -0400, Vivek Goyal wrote:
>> > > I think it would be possible to implement both proportional and limiting
>> > > rules at the same level (e.g., the IO scheduler), but we need also to
>> > > address the memory consumption problem (I still need to review your
>> > > patchset in details and I'm going to test it soon :), so I don't know if
>> > > you already addressed this issue).
>> > >
>> >
>> > Can you please elaborate a bit on this? Are you concerned about that data
>> > structures created to solve the problem consume a lot of memory?
>>
>> Sorry I was not very clear here. With memory consumption I mean wasting
>> the memory with hard/slow reclaimable dirty pages or pending IO
>> requests.
>>
>> If there's only a global limit on dirty pages, any cgroup can exhaust
>> that limit and cause other cgroups/processes to block when they try to
>> write to disk.
>>
>> But, ok, the IO controller is not probably the best place to implement
>> such functionality. I should rework on the per cgroup dirty_ratio:
>>
>> https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html
>>
>> Last time we focused too much on the best interfaces to define dirty
>> pages limit, and I never re-posted an updated version of this patchset.
>> Now I think we can simply provide the same dirty_ratio/dirty_bytes
>> interface that we provide globally, but per cgroup.
>>
>> >
>> > > IOW if we simply don't dispatch requests and we don't throttle the tasks
>> > > in the cgroup that exceeds its limit, how do we avoid the waste of
>> > > memory due to the succeeding IO requests and the increasingly dirty
>> > > pages in the page cache (that are also hard to reclaim)? I may be wrong,
>> > > but I think we talked about this problem in a previous email... sorry I
>> > > don't find the discussion in my mail archives.
>> > >
>> > > IMHO a nice approach would be to measure IO consumption at the IO
>> > > scheduler level, and control IO applying proportional weights / absolute
>> > > limits _both_ at the IO scheduler / elevator level _and_ at the same
>> > > time block the tasks from dirtying memory that will generate additional
>> > > IO requests.
>> > >
>> > > Anyway, there's no need to provide this with a single IO controller, we
>> > > could split the problem in two parts: 1) provide a proportional /
>> > > absolute IO controller in the IO schedulers and 2) allow to set, for
>> > > example, a maximum limit of dirty pages for each cgroup.
>> > >
>> >
>> > I think setting a maximum limit on dirty pages is an interesting thought.
>> > It sounds like as if memory controller can handle it?
>>
>> Exactly, the same above.
>
> Thinking more about it. Memory controller can probably enforce the higher
> limit but it would not easily translate into a fixed upper async write
> rate. Till the process hits the page cache limit or is slowed down by
> dirty page writeout, it can get a very high async write BW.
>
> So memory controller page cache limit will help but it would not direclty
> translate into what max bw limit patches are doing.
>
> Even if we do max bw control at IO scheduler level, async writes are
> problematic again. IO controller will not be able to throttle the process
> until it sees actuall write request. In big memory systems, writeout might
> not happen for some time and till then it will see a high throughput.
>
> So doing async write throttling at higher layer and not at IO scheduler
> layer gives us the opprotunity to produce more accurate results.

Wouldn't 'doing control on writes at a higher layer' have the same
problems as the ones we talk about in dm-ioband? What if the cgroup
being throttled for dirtying pages has a high weight assigned to it at
the IO scheduler level? What if there are threads of different classes
within that cgroup, and we would want to let RT task dirty the pages
before BE tasks? I am not sure all these questions make sense, but
just wanted to raise issues that might pop up.

If the whole system is designed with cgroups in mind, then throttling
at IO scheduler layer should lead to backlog, that could be seen at
higher level. For example, if a cgroup is not getting service at IO
scheduler level, it should run out of request descriptors, and thus
the thread writing back dirty pages should notice it (if its pdflush,
blocking it is probably not the best idea). And that should mean the
cgroup should hit the dirty threshold, and disallow the task to dirty
further pages. There is a possibility though that getting all this
right might be an overkill and we can get away with a simpler
solution. One possibility seems to be that we provide some feedback
from IO scheduling layer to higher layers, that cgroup is hitting its
write bandwith limit, and should not be allowed to dirty any more
pages.

>
> For sync requests, I think IO scheduler max bw control should work fine.
>
> BTW, andrea, what is the use case of your patches? Andrew had mentioned
> that some people are already using it. I am curious to know will a
> proportional BW controller will solve the issues/requirements of these
> people or they have specific requirement of traffic shaping and max bw
> controller only.
>
> [..]
>> > > > Can you please give little more details here regarding how QoS requirements
>> > > > are not met with proportional weight?
>> > >
>> > > With proportional weights the whole bandwidth is allocated if no one
>> > > else is using it. When IO is submitted other tasks with a higher weight
>> > > can be forced to sleep until the IO generated by the low weight tasks is
>> > > not completely dispatched. Or any extent of the priority inversion
>> > > problems.
>> >
>> > Hmm..., I am not very sure here. When admin is allocating the weights, he
>> > has the whole picture. He knows how many groups are conteding for the disk
>> > and what could be the worst case scenario. So if I have got two groups
>> > with A and B with weight 1 and 2 and both are contending, then as an
>> > admin one would expect to get 33% of BW for group A in worst case (if
>> > group B is continuously backlogged). If B is not contending than A can get
>> > 100% of BW. So while configuring the system, will one not plan for worst
>> > case (33% for A, and 66 % for B)?
>>
>> OK, I'm quite convinced.. :)
>>
>> To a large degree, if we want to provide a BW reservation strategy we
>> must provide an interface that allows cgroups to ask for time slices
>> such as max/min 5 IO requests every 50ms or something like that.
>> Probably the same functionality can be achieved translating time slices
>> from weights, percentages or absolute BW limits.
>
> Ok, I would like to split it in two parts.
>
> I think providng minimum gurantee in absolute terms like 5 IO request
> every 50ms will be very hard because IO scheduler has no control over
> how many competitors are there. An easier thing will be to have minimum
> gurantees on share basis. For minimum BW (disk time slice) gurantee, admin
> shall have to create right cgroup hierarchy and assign weights properly and
> then admin can calculate what % of disk slice a particular group will get
> as minimum gurantee. (This is more complicated than this as there are
> time slices which are not accounted to any groups. During queue switch
> cfq starts the time slice counting only after first request has completed
> to offset the impact of seeking and i guess also NCQ).
>
> I think it should be possible to give max bandwidth gurantees in absolute
> terms, like io/s or sectors/sec or MB/sec etc, because only thing IO
> scheduler has to do is to not allow dispatch from a particular queue if
> it has crossed its limit and then either let the disk idle or move onto
> next eligible queue.
>
> The only issue here will be async writes. max bw gurantee for async writes
> at IO scheduler level might not mean much to application because of page
> cache.
>
> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: IO controller discussion (Was: Re: [PATCH 01/10] Documentation)
  2009-04-17 14:13                       ` Vivek Goyal
       [not found]                         ` <20090417141358.GD29086-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-04-17 18:09                         ` Nauman Rafique
       [not found]                           ` <e98e18940904171109r17ccb054kb7879f8d02ac26b5-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
                                             ` (2 more replies)
  2009-04-17 22:38                         ` Andrea Righi
                                           ` (2 subsequent siblings)
  4 siblings, 3 replies; 190+ messages in thread
From: Nauman Rafique @ 2009-04-17 18:09 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrea Righi, Andrew Morton, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, arozansk, jmoyer, oz-kernel, dhaval, balbir,
	linux-kernel, containers, menage, peterz

On Fri, Apr 17, 2009 at 7:13 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Fri, Apr 17, 2009 at 11:37:28AM +0200, Andrea Righi wrote:
>> On Thu, Apr 16, 2009 at 02:37:53PM -0400, Vivek Goyal wrote:
>> > > I think it would be possible to implement both proportional and limiting
>> > > rules at the same level (e.g., the IO scheduler), but we need also to
>> > > address the memory consumption problem (I still need to review your
>> > > patchset in details and I'm going to test it soon :), so I don't know if
>> > > you already addressed this issue).
>> > >
>> >
>> > Can you please elaborate a bit on this? Are you concerned about that data
>> > structures created to solve the problem consume a lot of memory?
>>
>> Sorry I was not very clear here. With memory consumption I mean wasting
>> the memory with hard/slow reclaimable dirty pages or pending IO
>> requests.
>>
>> If there's only a global limit on dirty pages, any cgroup can exhaust
>> that limit and cause other cgroups/processes to block when they try to
>> write to disk.
>>
>> But, ok, the IO controller is not probably the best place to implement
>> such functionality. I should rework on the per cgroup dirty_ratio:
>>
>> https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html
>>
>> Last time we focused too much on the best interfaces to define dirty
>> pages limit, and I never re-posted an updated version of this patchset.
>> Now I think we can simply provide the same dirty_ratio/dirty_bytes
>> interface that we provide globally, but per cgroup.
>>
>> >
>> > > IOW if we simply don't dispatch requests and we don't throttle the tasks
>> > > in the cgroup that exceeds its limit, how do we avoid the waste of
>> > > memory due to the succeeding IO requests and the increasingly dirty
>> > > pages in the page cache (that are also hard to reclaim)? I may be wrong,
>> > > but I think we talked about this problem in a previous email... sorry I
>> > > don't find the discussion in my mail archives.
>> > >
>> > > IMHO a nice approach would be to measure IO consumption at the IO
>> > > scheduler level, and control IO applying proportional weights / absolute
>> > > limits _both_ at the IO scheduler / elevator level _and_ at the same
>> > > time block the tasks from dirtying memory that will generate additional
>> > > IO requests.
>> > >
>> > > Anyway, there's no need to provide this with a single IO controller, we
>> > > could split the problem in two parts: 1) provide a proportional /
>> > > absolute IO controller in the IO schedulers and 2) allow to set, for
>> > > example, a maximum limit of dirty pages for each cgroup.
>> > >
>> >
>> > I think setting a maximum limit on dirty pages is an interesting thought.
>> > It sounds like as if memory controller can handle it?
>>
>> Exactly, the same above.
>
> Thinking more about it. Memory controller can probably enforce the higher
> limit but it would not easily translate into a fixed upper async write
> rate. Till the process hits the page cache limit or is slowed down by
> dirty page writeout, it can get a very high async write BW.
>
> So memory controller page cache limit will help but it would not direclty
> translate into what max bw limit patches are doing.
>
> Even if we do max bw control at IO scheduler level, async writes are
> problematic again. IO controller will not be able to throttle the process
> until it sees actuall write request. In big memory systems, writeout might
> not happen for some time and till then it will see a high throughput.
>
> So doing async write throttling at higher layer and not at IO scheduler
> layer gives us the opprotunity to produce more accurate results.

Wouldn't 'doing control on writes at a higher layer' have the same
problems as the ones we talk about in dm-ioband? What if the cgroup
being throttled for dirtying pages has a high weight assigned to it at
the IO scheduler level? What if there are threads of different classes
within that cgroup, and we would want to let RT task dirty the pages
before BE tasks? I am not sure all these questions make sense, but
just wanted to raise issues that might pop up.

If the whole system is designed with cgroups in mind, then throttling
at IO scheduler layer should lead to backlog, that could be seen at
higher level. For example, if a cgroup is not getting service at IO
scheduler level, it should run out of request descriptors, and thus
the thread writing back dirty pages should notice it (if its pdflush,
blocking it is probably not the best idea). And that should mean the
cgroup should hit the dirty threshold, and disallow the task to dirty
further pages. There is a possibility though that getting all this
right might be an overkill and we can get away with a simpler
solution. One possibility seems to be that we provide some feedback
from IO scheduling layer to higher layers, that cgroup is hitting its
write bandwith limit, and should not be allowed to dirty any more
pages.

>
> For sync requests, I think IO scheduler max bw control should work fine.
>
> BTW, andrea, what is the use case of your patches? Andrew had mentioned
> that some people are already using it. I am curious to know will a
> proportional BW controller will solve the issues/requirements of these
> people or they have specific requirement of traffic shaping and max bw
> controller only.
>
> [..]
>> > > > Can you please give little more details here regarding how QoS requirements
>> > > > are not met with proportional weight?
>> > >
>> > > With proportional weights the whole bandwidth is allocated if no one
>> > > else is using it. When IO is submitted other tasks with a higher weight
>> > > can be forced to sleep until the IO generated by the low weight tasks is
>> > > not completely dispatched. Or any extent of the priority inversion
>> > > problems.
>> >
>> > Hmm..., I am not very sure here. When admin is allocating the weights, he
>> > has the whole picture. He knows how many groups are conteding for the disk
>> > and what could be the worst case scenario. So if I have got two groups
>> > with A and B with weight 1 and 2 and both are contending, then as an
>> > admin one would expect to get 33% of BW for group A in worst case (if
>> > group B is continuously backlogged). If B is not contending than A can get
>> > 100% of BW. So while configuring the system, will one not plan for worst
>> > case (33% for A, and 66 % for B)?
>>
>> OK, I'm quite convinced.. :)
>>
>> To a large degree, if we want to provide a BW reservation strategy we
>> must provide an interface that allows cgroups to ask for time slices
>> such as max/min 5 IO requests every 50ms or something like that.
>> Probably the same functionality can be achieved translating time slices
>> from weights, percentages or absolute BW limits.
>
> Ok, I would like to split it in two parts.
>
> I think providng minimum gurantee in absolute terms like 5 IO request
> every 50ms will be very hard because IO scheduler has no control over
> how many competitors are there. An easier thing will be to have minimum
> gurantees on share basis. For minimum BW (disk time slice) gurantee, admin
> shall have to create right cgroup hierarchy and assign weights properly and
> then admin can calculate what % of disk slice a particular group will get
> as minimum gurantee. (This is more complicated than this as there are
> time slices which are not accounted to any groups. During queue switch
> cfq starts the time slice counting only after first request has completed
> to offset the impact of seeking and i guess also NCQ).
>
> I think it should be possible to give max bandwidth gurantees in absolute
> terms, like io/s or sectors/sec or MB/sec etc, because only thing IO
> scheduler has to do is to not allow dispatch from a particular queue if
> it has crossed its limit and then either let the disk idle or move onto
> next eligible queue.
>
> The only issue here will be async writes. max bw gurantee for async writes
> at IO scheduler level might not mean much to application because of page
> cache.
>
> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: IO controller discussion (Was: Re: [PATCH 01/10] Documentation)
       [not found]                         ` <20090417141358.GD29086-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-04-17 18:09                           ` Nauman Rafique
@ 2009-04-17 22:38                           ` Andrea Righi
  2009-04-18 13:19                           ` Balbir Singh
  2009-04-19  4:35                           ` Nauman Rafique
  3 siblings, 0 replies; 190+ messages in thread
From: Andrea Righi @ 2009-04-17 22:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	menage-hpIqsD4AKlfQT0dZR+AlfA

On Fri, Apr 17, 2009 at 10:13:58AM -0400, Vivek Goyal wrote:
> > > I think setting a maximum limit on dirty pages is an interesting thought.
> > > It sounds like as if memory controller can handle it?
> > 
> > Exactly, the same above.
> 
> Thinking more about it. Memory controller can probably enforce the higher
> limit but it would not easily translate into a fixed upper async write
> rate. Till the process hits the page cache limit or is slowed down by
> dirty page writeout, it can get a very high async write BW.
> 
> So memory controller page cache limit will help but it would not direclty
> translate into what max bw limit patches are doing.

The memory controller can be used to set an upper limit of the dirty
pages. When this limit is exceeded the tasks in the cgroup can be forced
to write the exceeding dirty pages to disk. At this point the IO
controller can: 1) throttle the task that is going to submit the IO
requests, if the guy that dirtied the pages was actually the task
itself, or 2) delay the submission of those requests to the elevator (or
at the IO scheduler level) if it's writeback IO (e.g., made by pdflush).

Both functionalities should allow to have a BW control and avoid that
any single cgroup can entirely exhaust the global limit of dirty pages.

> 
> Even if we do max bw control at IO scheduler level, async writes are
> problematic again. IO controller will not be able to throttle the process
> until it sees actuall write request. In big memory systems, writeout might
> not happen for some time and till then it will see a high throughput.
> 
> So doing async write throttling at higher layer and not at IO scheduler
> layer gives us the opprotunity to produce more accurate results.

Totally agree.

> 
> For sync requests, I think IO scheduler max bw control should work fine.

ditto

-Andrea

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: IO controller discussion (Was: Re: [PATCH 01/10] Documentation)
  2009-04-17 14:13                       ` Vivek Goyal
       [not found]                         ` <20090417141358.GD29086-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-04-17 18:09                         ` Nauman Rafique
@ 2009-04-17 22:38                         ` Andrea Righi
  2009-04-19 13:21                             ` Vivek Goyal
  2009-04-18 13:19                         ` Balbir Singh
  2009-04-19  4:35                         ` Nauman Rafique
  4 siblings, 1 reply; 190+ messages in thread
From: Andrea Righi @ 2009-04-17 22:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, arozansk, jmoyer, oz-kernel, dhaval, balbir,
	linux-kernel, containers, menage, peterz

On Fri, Apr 17, 2009 at 10:13:58AM -0400, Vivek Goyal wrote:
> > > I think setting a maximum limit on dirty pages is an interesting thought.
> > > It sounds like as if memory controller can handle it?
> > 
> > Exactly, the same above.
> 
> Thinking more about it. Memory controller can probably enforce the higher
> limit but it would not easily translate into a fixed upper async write
> rate. Till the process hits the page cache limit or is slowed down by
> dirty page writeout, it can get a very high async write BW.
> 
> So memory controller page cache limit will help but it would not direclty
> translate into what max bw limit patches are doing.

The memory controller can be used to set an upper limit of the dirty
pages. When this limit is exceeded the tasks in the cgroup can be forced
to write the exceeding dirty pages to disk. At this point the IO
controller can: 1) throttle the task that is going to submit the IO
requests, if the guy that dirtied the pages was actually the task
itself, or 2) delay the submission of those requests to the elevator (or
at the IO scheduler level) if it's writeback IO (e.g., made by pdflush).

Both functionalities should allow to have a BW control and avoid that
any single cgroup can entirely exhaust the global limit of dirty pages.

> 
> Even if we do max bw control at IO scheduler level, async writes are
> problematic again. IO controller will not be able to throttle the process
> until it sees actuall write request. In big memory systems, writeout might
> not happen for some time and till then it will see a high throughput.
> 
> So doing async write throttling at higher layer and not at IO scheduler
> layer gives us the opprotunity to produce more accurate results.

Totally agree.

> 
> For sync requests, I think IO scheduler max bw control should work fine.

ditto

-Andrea

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: IO controller discussion (Was: Re: [PATCH 01/10] Documentation)
       [not found]                           ` <e98e18940904171109r17ccb054kb7879f8d02ac26b5-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-04-18  8:13                             ` Andrea Righi
  2009-04-19 12:59                               ` Vivek Goyal
  2009-04-19 13:08                             ` Vivek Goyal
  2 siblings, 0 replies; 190+ messages in thread
From: Andrea Righi @ 2009-04-18  8:13 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	dradford-cT2on/YLNlBWk0Htik3J/w,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Andrew Morton,
	axboe-tSWWG44O7X1aa/9Udqfwiw,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, matt-cT2on/YLNlBWk0Htik3J/w,
	menage-hpIqsD4AKlfQT0dZR+AlfA

On Fri, Apr 17, 2009 at 11:09:51AM -0700, Nauman Rafique wrote:
> > Thinking more about it. Memory controller can probably enforce the higher
> > limit but it would not easily translate into a fixed upper async write
> > rate. Till the process hits the page cache limit or is slowed down by
> > dirty page writeout, it can get a very high async write BW.
> >
> > So memory controller page cache limit will help but it would not direclty
> > translate into what max bw limit patches are doing.
> >
> > Even if we do max bw control at IO scheduler level, async writes are
> > problematic again. IO controller will not be able to throttle the process
> > until it sees actuall write request. In big memory systems, writeout might
> > not happen for some time and till then it will see a high throughput.
> >
> > So doing async write throttling at higher layer and not at IO scheduler
> > layer gives us the opprotunity to produce more accurate results.
> 
> Wouldn't 'doing control on writes at a higher layer' have the same
> problems as the ones we talk about in dm-ioband? What if the cgroup
> being throttled for dirtying pages has a high weight assigned to it at
> the IO scheduler level? What if there are threads of different classes
> within that cgroup, and we would want to let RT task dirty the pages
> before BE tasks? I am not sure all these questions make sense, but
> just wanted to raise issues that might pop up.

To a large degree, this seems to be related to provide "fair throttling"
at higher level. I mean, throttle equally the tasks belongin to a cgroup
that exceeded the limits. With equally I mean proportionally to the IO
traffic previously generated _and_ the IO priority.

Otherwise a low priority task doing a lot of IO can consumes all the
available cgroup BW and other high priority tasks in the same cgroup may
be blocked when they try to write to disk, even if they try to write a
small amount of bytes.

> 
> If the whole system is designed with cgroups in mind, then throttling
> at IO scheduler layer should lead to backlog, that could be seen at
> higher level. For example, if a cgroup is not getting service at IO
> scheduler level, it should run out of request descriptors, and thus
> the thread writing back dirty pages should notice it (if its pdflush,
> blocking it is probably not the best idea). And that should mean the
> cgroup should hit the dirty threshold, and disallow the task to dirty
> further pages. There is a possibility though that getting all this
> right might be an overkill and we can get away with a simpler
> solution. One possibility seems to be that we provide some feedback
> from IO scheduling layer to higher layers, that cgroup is hitting its
> write bandwith limit, and should not be allowed to dirty any more
> pages.
> 

IMHO accounting the IO activity in the IO scheduler and blocking the
offending application at the higher level is a good solution.

Throttle dirty page ratio could be a nice feature, but probably it's
enough to provide a max amount of dirty pages per cgroup and force the
tasks to directly writeback those pages when the cgroup exceeded the
dirty limit. In this way the dirty page ratio will be automatically
throttled by the underlying IO controller.

-Andrea

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: IO controller discussion (Was: Re: [PATCH 01/10] Documentation)
  2009-04-17 18:09                         ` Nauman Rafique
       [not found]                           ` <e98e18940904171109r17ccb054kb7879f8d02ac26b5-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-04-18  8:13                           ` Andrea Righi
  2009-04-19 13:08                           ` Vivek Goyal
  2 siblings, 0 replies; 190+ messages in thread
From: Andrea Righi @ 2009-04-18  8:13 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: Vivek Goyal, Andrew Morton, dpshah, lizf, mikew, fchecconi,
	paolo.valente, axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, arozansk, jmoyer, oz-kernel, dhaval, balbir,
	linux-kernel, containers, menage, peterz, matt, dradford

On Fri, Apr 17, 2009 at 11:09:51AM -0700, Nauman Rafique wrote:
> > Thinking more about it. Memory controller can probably enforce the higher
> > limit but it would not easily translate into a fixed upper async write
> > rate. Till the process hits the page cache limit or is slowed down by
> > dirty page writeout, it can get a very high async write BW.
> >
> > So memory controller page cache limit will help but it would not direclty
> > translate into what max bw limit patches are doing.
> >
> > Even if we do max bw control at IO scheduler level, async writes are
> > problematic again. IO controller will not be able to throttle the process
> > until it sees actuall write request. In big memory systems, writeout might
> > not happen for some time and till then it will see a high throughput.
> >
> > So doing async write throttling at higher layer and not at IO scheduler
> > layer gives us the opprotunity to produce more accurate results.
> 
> Wouldn't 'doing control on writes at a higher layer' have the same
> problems as the ones we talk about in dm-ioband? What if the cgroup
> being throttled for dirtying pages has a high weight assigned to it at
> the IO scheduler level? What if there are threads of different classes
> within that cgroup, and we would want to let RT task dirty the pages
> before BE tasks? I am not sure all these questions make sense, but
> just wanted to raise issues that might pop up.

To a large degree, this seems to be related to provide "fair throttling"
at higher level. I mean, throttle equally the tasks belongin to a cgroup
that exceeded the limits. With equally I mean proportionally to the IO
traffic previously generated _and_ the IO priority.

Otherwise a low priority task doing a lot of IO can consumes all the
available cgroup BW and other high priority tasks in the same cgroup may
be blocked when they try to write to disk, even if they try to write a
small amount of bytes.

> 
> If the whole system is designed with cgroups in mind, then throttling
> at IO scheduler layer should lead to backlog, that could be seen at
> higher level. For example, if a cgroup is not getting service at IO
> scheduler level, it should run out of request descriptors, and thus
> the thread writing back dirty pages should notice it (if its pdflush,
> blocking it is probably not the best idea). And that should mean the
> cgroup should hit the dirty threshold, and disallow the task to dirty
> further pages. There is a possibility though that getting all this
> right might be an overkill and we can get away with a simpler
> solution. One possibility seems to be that we provide some feedback
> from IO scheduling layer to higher layers, that cgroup is hitting its
> write bandwith limit, and should not be allowed to dirty any more
> pages.
> 

IMHO accounting the IO activity in the IO scheduler and blocking the
offending application at the higher level is a good solution.

Throttle dirty page ratio could be a nice feature, but probably it's
enough to provide a max amount of dirty pages per cgroup and force the
tasks to directly writeback those pages when the cgroup exceeded the
dirty limit. In this way the dirty page ratio will be automatically
throttled by the underlying IO controller.

-Andrea

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: IO controller discussion (Was: Re: [PATCH 01/10] Documentation)
       [not found]                         ` <20090417141358.GD29086-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-04-17 18:09                           ` Nauman Rafique
  2009-04-17 22:38                           ` Andrea Righi
@ 2009-04-18 13:19                           ` Balbir Singh
  2009-04-19  4:35                           ` Nauman Rafique
  3 siblings, 0 replies; 190+ messages in thread
From: Balbir Singh @ 2009-04-18 13:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil, Andrea Righi,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	menage-hpIqsD4AKlfQT0dZR+AlfA

On Fri, Apr 17, 2009 at 7:43 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, Apr 17, 2009 at 11:37:28AM +0200, Andrea Righi wrote:
>> On Thu, Apr 16, 2009 at 02:37:53PM -0400, Vivek Goyal wrote:
>> > > I think it would be possible to implement both proportional and limiting
>> > > rules at the same level (e.g., the IO scheduler), but we need also to
>> > > address the memory consumption problem (I still need to review your
>> > > patchset in details and I'm going to test it soon :), so I don't know if
>> > > you already addressed this issue).
>> > >
>> >
>> > Can you please elaborate a bit on this? Are you concerned about that data
>> > structures created to solve the problem consume a lot of memory?
>>
>> Sorry I was not very clear here. With memory consumption I mean wasting
>> the memory with hard/slow reclaimable dirty pages or pending IO
>> requests.
>>
>> If there's only a global limit on dirty pages, any cgroup can exhaust
>> that limit and cause other cgroups/processes to block when they try to
>> write to disk.
>>
>> But, ok, the IO controller is not probably the best place to implement
>> such functionality. I should rework on the per cgroup dirty_ratio:
>>
>> https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html
>>
>> Last time we focused too much on the best interfaces to define dirty
>> pages limit, and I never re-posted an updated version of this patchset.
>> Now I think we can simply provide the same dirty_ratio/dirty_bytes
>> interface that we provide globally, but per cgroup.
>>
>> >
>> > > IOW if we simply don't dispatch requests and we don't throttle the tasks
>> > > in the cgroup that exceeds its limit, how do we avoid the waste of
>> > > memory due to the succeeding IO requests and the increasingly dirty
>> > > pages in the page cache (that are also hard to reclaim)? I may be wrong,
>> > > but I think we talked about this problem in a previous email... sorry I
>> > > don't find the discussion in my mail archives.
>> > >
>> > > IMHO a nice approach would be to measure IO consumption at the IO
>> > > scheduler level, and control IO applying proportional weights / absolute
>> > > limits _both_ at the IO scheduler / elevator level _and_ at the same
>> > > time block the tasks from dirtying memory that will generate additional
>> > > IO requests.
>> > >
>> > > Anyway, there's no need to provide this with a single IO controller, we
>> > > could split the problem in two parts: 1) provide a proportional /
>> > > absolute IO controller in the IO schedulers and 2) allow to set, for
>> > > example, a maximum limit of dirty pages for each cgroup.
>> > >
>> >
>> > I think setting a maximum limit on dirty pages is an interesting thought.
>> > It sounds like as if memory controller can handle it?
>>
>> Exactly, the same above.
>
> Thinking more about it. Memory controller can probably enforce the higher
> limit but it would not easily translate into a fixed upper async write
> rate. Till the process hits the page cache limit or is slowed down by
> dirty page writeout, it can get a very high async write BW.
>
> So memory controller page cache limit will help but it would not direclty
> translate into what max bw limit patches are doing.
>
> Even if we do max bw control at IO scheduler level, async writes are
> problematic again. IO controller will not be able to throttle the process
> until it sees actuall write request. In big memory systems, writeout might
> not happen for some time and till then it will see a high throughput.
>
> So doing async write throttling at higher layer and not at IO scheduler
> layer gives us the opprotunity to produce more accurate results.
>
> For sync requests, I think IO scheduler max bw control should work fine.
>
> BTW, andrea, what is the use case of your patches? Andrew had mentioned
> that some people are already using it. I am curious to know will a
> proportional BW controller will solve the issues/requirements of these
> people or they have specific requirement of traffic shaping and max bw
> controller only.
>
> [..]
>> > > > Can you please give little more details here regarding how QoS requirements
>> > > > are not met with proportional weight?
>> > >
>> > > With proportional weights the whole bandwidth is allocated if no one
>> > > else is using it. When IO is submitted other tasks with a higher weight
>> > > can be forced to sleep until the IO generated by the low weight tasks is
>> > > not completely dispatched. Or any extent of the priority inversion
>> > > problems.
>> >
>> > Hmm..., I am not very sure here. When admin is allocating the weights, he
>> > has the whole picture. He knows how many groups are conteding for the disk
>> > and what could be the worst case scenario. So if I have got two groups
>> > with A and B with weight 1 and 2 and both are contending, then as an
>> > admin one would expect to get 33% of BW for group A in worst case (if
>> > group B is continuously backlogged). If B is not contending than A can get
>> > 100% of BW. So while configuring the system, will one not plan for worst
>> > case (33% for A, and 66 % for B)?
>>
>> OK, I'm quite convinced.. :)
>>
>> To a large degree, if we want to provide a BW reservation strategy we
>> must provide an interface that allows cgroups to ask for time slices
>> such as max/min 5 IO requests every 50ms or something like that.
>> Probably the same functionality can be achieved translating time slices
>> from weights, percentages or absolute BW limits.
>
> Ok, I would like to split it in two parts.
>
> I think providng minimum gurantee in absolute terms like 5 IO request
> every 50ms will be very hard because IO scheduler has no control over
> how many competitors are there. An easier thing will be to have minimum
> gurantees on share basis. For minimum BW (disk time slice) gurantee, admin
> shall have to create right cgroup hierarchy and assign weights properly and
> then admin can calculate what % of disk slice a particular group will get
> as minimum gurantee. (This is more complicated than this as there are
> time slices which are not accounted to any groups. During queue switch
> cfq starts the time slice counting only after first request has completed
> to offset the impact of seeking and i guess also NCQ).
>
> I think it should be possible to give max bandwidth gurantees in absolute
> terms, like io/s or sectors/sec or MB/sec etc, because only thing IO
> scheduler has to do is to not allow dispatch from a particular queue if
> it has crossed its limit and then either let the disk idle or move onto
> next eligible queue.
>
> The only issue here will be async writes. max bw gurantee for async writes
> at IO scheduler level might not mean much to application because of page
> cache.

I see so much of the memory controller coming up. Since we've been
discussing so many of these design points on mail, I wonder if it
makes sense to summarize them somewhere (a wiki?). Would anyone like
to take a shot at it?

Balbir

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: IO controller discussion (Was: Re: [PATCH 01/10] Documentation)
  2009-04-17 14:13                       ` Vivek Goyal
                                           ` (2 preceding siblings ...)
  2009-04-17 22:38                         ` Andrea Righi
@ 2009-04-18 13:19                         ` Balbir Singh
  2009-04-19 13:45                           ` Vivek Goyal
       [not found]                           ` <661de9470904180619k34e7998ch755a2ad3bed9ce5e-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2009-04-19  4:35                         ` Nauman Rafique
  4 siblings, 2 replies; 190+ messages in thread
From: Balbir Singh @ 2009-04-18 13:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrea Righi, Andrew Morton, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, jens.axboe, ryov, fernando, s-uchida,
	taka, guijianfeng, arozansk, jmoyer, oz-kernel, dhaval,
	linux-kernel, containers, menage, peterz

On Fri, Apr 17, 2009 at 7:43 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Fri, Apr 17, 2009 at 11:37:28AM +0200, Andrea Righi wrote:
>> On Thu, Apr 16, 2009 at 02:37:53PM -0400, Vivek Goyal wrote:
>> > > I think it would be possible to implement both proportional and limiting
>> > > rules at the same level (e.g., the IO scheduler), but we need also to
>> > > address the memory consumption problem (I still need to review your
>> > > patchset in details and I'm going to test it soon :), so I don't know if
>> > > you already addressed this issue).
>> > >
>> >
>> > Can you please elaborate a bit on this? Are you concerned about that data
>> > structures created to solve the problem consume a lot of memory?
>>
>> Sorry I was not very clear here. With memory consumption I mean wasting
>> the memory with hard/slow reclaimable dirty pages or pending IO
>> requests.
>>
>> If there's only a global limit on dirty pages, any cgroup can exhaust
>> that limit and cause other cgroups/processes to block when they try to
>> write to disk.
>>
>> But, ok, the IO controller is not probably the best place to implement
>> such functionality. I should rework on the per cgroup dirty_ratio:
>>
>> https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html
>>
>> Last time we focused too much on the best interfaces to define dirty
>> pages limit, and I never re-posted an updated version of this patchset.
>> Now I think we can simply provide the same dirty_ratio/dirty_bytes
>> interface that we provide globally, but per cgroup.
>>
>> >
>> > > IOW if we simply don't dispatch requests and we don't throttle the tasks
>> > > in the cgroup that exceeds its limit, how do we avoid the waste of
>> > > memory due to the succeeding IO requests and the increasingly dirty
>> > > pages in the page cache (that are also hard to reclaim)? I may be wrong,
>> > > but I think we talked about this problem in a previous email... sorry I
>> > > don't find the discussion in my mail archives.
>> > >
>> > > IMHO a nice approach would be to measure IO consumption at the IO
>> > > scheduler level, and control IO applying proportional weights / absolute
>> > > limits _both_ at the IO scheduler / elevator level _and_ at the same
>> > > time block the tasks from dirtying memory that will generate additional
>> > > IO requests.
>> > >
>> > > Anyway, there's no need to provide this with a single IO controller, we
>> > > could split the problem in two parts: 1) provide a proportional /
>> > > absolute IO controller in the IO schedulers and 2) allow to set, for
>> > > example, a maximum limit of dirty pages for each cgroup.
>> > >
>> >
>> > I think setting a maximum limit on dirty pages is an interesting thought.
>> > It sounds like as if memory controller can handle it?
>>
>> Exactly, the same above.
>
> Thinking more about it. Memory controller can probably enforce the higher
> limit but it would not easily translate into a fixed upper async write
> rate. Till the process hits the page cache limit or is slowed down by
> dirty page writeout, it can get a very high async write BW.
>
> So memory controller page cache limit will help but it would not direclty
> translate into what max bw limit patches are doing.
>
> Even if we do max bw control at IO scheduler level, async writes are
> problematic again. IO controller will not be able to throttle the process
> until it sees actuall write request. In big memory systems, writeout might
> not happen for some time and till then it will see a high throughput.
>
> So doing async write throttling at higher layer and not at IO scheduler
> layer gives us the opprotunity to produce more accurate results.
>
> For sync requests, I think IO scheduler max bw control should work fine.
>
> BTW, andrea, what is the use case of your patches? Andrew had mentioned
> that some people are already using it. I am curious to know will a
> proportional BW controller will solve the issues/requirements of these
> people or they have specific requirement of traffic shaping and max bw
> controller only.
>
> [..]
>> > > > Can you please give little more details here regarding how QoS requirements
>> > > > are not met with proportional weight?
>> > >
>> > > With proportional weights the whole bandwidth is allocated if no one
>> > > else is using it. When IO is submitted other tasks with a higher weight
>> > > can be forced to sleep until the IO generated by the low weight tasks is
>> > > not completely dispatched. Or any extent of the priority inversion
>> > > problems.
>> >
>> > Hmm..., I am not very sure here. When admin is allocating the weights, he
>> > has the whole picture. He knows how many groups are conteding for the disk
>> > and what could be the worst case scenario. So if I have got two groups
>> > with A and B with weight 1 and 2 and both are contending, then as an
>> > admin one would expect to get 33% of BW for group A in worst case (if
>> > group B is continuously backlogged). If B is not contending than A can get
>> > 100% of BW. So while configuring the system, will one not plan for worst
>> > case (33% for A, and 66 % for B)?
>>
>> OK, I'm quite convinced.. :)
>>
>> To a large degree, if we want to provide a BW reservation strategy we
>> must provide an interface that allows cgroups to ask for time slices
>> such as max/min 5 IO requests every 50ms or something like that.
>> Probably the same functionality can be achieved translating time slices
>> from weights, percentages or absolute BW limits.
>
> Ok, I would like to split it in two parts.
>
> I think providng minimum gurantee in absolute terms like 5 IO request
> every 50ms will be very hard because IO scheduler has no control over
> how many competitors are there. An easier thing will be to have minimum
> gurantees on share basis. For minimum BW (disk time slice) gurantee, admin
> shall have to create right cgroup hierarchy and assign weights properly and
> then admin can calculate what % of disk slice a particular group will get
> as minimum gurantee. (This is more complicated than this as there are
> time slices which are not accounted to any groups. During queue switch
> cfq starts the time slice counting only after first request has completed
> to offset the impact of seeking and i guess also NCQ).
>
> I think it should be possible to give max bandwidth gurantees in absolute
> terms, like io/s or sectors/sec or MB/sec etc, because only thing IO
> scheduler has to do is to not allow dispatch from a particular queue if
> it has crossed its limit and then either let the disk idle or move onto
> next eligible queue.
>
> The only issue here will be async writes. max bw gurantee for async writes
> at IO scheduler level might not mean much to application because of page
> cache.

I see so much of the memory controller coming up. Since we've been
discussing so many of these design points on mail, I wonder if it
makes sense to summarize them somewhere (a wiki?). Would anyone like
to take a shot at it?

Balbir

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: IO controller discussion (Was: Re: [PATCH 01/10] Documentation)
       [not found]                         ` <20090417141358.GD29086-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                                             ` (2 preceding siblings ...)
  2009-04-18 13:19                           ` Balbir Singh
@ 2009-04-19  4:35                           ` Nauman Rafique
  3 siblings, 0 replies; 190+ messages in thread
From: Nauman Rafique @ 2009-04-19  4:35 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, John Wilkes,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil, Andrea Righi,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	menage-hpIqsD4AKlfQT0dZR+AlfA

On Fri, Apr 17, 2009 at 7:13 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, Apr 17, 2009 at 11:37:28AM +0200, Andrea Righi wrote:
>> On Thu, Apr 16, 2009 at 02:37:53PM -0400, Vivek Goyal wrote:
>> > > I think it would be possible to implement both proportional and limiting
>> > > rules at the same level (e.g., the IO scheduler), but we need also to
>> > > address the memory consumption problem (I still need to review your
>> > > patchset in details and I'm going to test it soon :), so I don't know if
>> > > you already addressed this issue).
>> > >
>> >
>> > Can you please elaborate a bit on this? Are you concerned about that data
>> > structures created to solve the problem consume a lot of memory?
>>
>> Sorry I was not very clear here. With memory consumption I mean wasting
>> the memory with hard/slow reclaimable dirty pages or pending IO
>> requests.
>>
>> If there's only a global limit on dirty pages, any cgroup can exhaust
>> that limit and cause other cgroups/processes to block when they try to
>> write to disk.
>>
>> But, ok, the IO controller is not probably the best place to implement
>> such functionality. I should rework on the per cgroup dirty_ratio:
>>
>> https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html
>>
>> Last time we focused too much on the best interfaces to define dirty
>> pages limit, and I never re-posted an updated version of this patchset.
>> Now I think we can simply provide the same dirty_ratio/dirty_bytes
>> interface that we provide globally, but per cgroup.
>>
>> >
>> > > IOW if we simply don't dispatch requests and we don't throttle the tasks
>> > > in the cgroup that exceeds its limit, how do we avoid the waste of
>> > > memory due to the succeeding IO requests and the increasingly dirty
>> > > pages in the page cache (that are also hard to reclaim)? I may be wrong,
>> > > but I think we talked about this problem in a previous email... sorry I
>> > > don't find the discussion in my mail archives.
>> > >
>> > > IMHO a nice approach would be to measure IO consumption at the IO
>> > > scheduler level, and control IO applying proportional weights / absolute
>> > > limits _both_ at the IO scheduler / elevator level _and_ at the same
>> > > time block the tasks from dirtying memory that will generate additional
>> > > IO requests.
>> > >
>> > > Anyway, there's no need to provide this with a single IO controller, we
>> > > could split the problem in two parts: 1) provide a proportional /
>> > > absolute IO controller in the IO schedulers and 2) allow to set, for
>> > > example, a maximum limit of dirty pages for each cgroup.
>> > >
>> >
>> > I think setting a maximum limit on dirty pages is an interesting thought.
>> > It sounds like as if memory controller can handle it?
>>
>> Exactly, the same above.
>
> Thinking more about it. Memory controller can probably enforce the higher
> limit but it would not easily translate into a fixed upper async write
> rate. Till the process hits the page cache limit or is slowed down by
> dirty page writeout, it can get a very high async write BW.
>
> So memory controller page cache limit will help but it would not direclty
> translate into what max bw limit patches are doing.
>
> Even if we do max bw control at IO scheduler level, async writes are
> problematic again. IO controller will not be able to throttle the process
> until it sees actuall write request. In big memory systems, writeout might
> not happen for some time and till then it will see a high throughput.
>
> So doing async write throttling at higher layer and not at IO scheduler
> layer gives us the opprotunity to produce more accurate results.
>
> For sync requests, I think IO scheduler max bw control should work fine.
>
> BTW, andrea, what is the use case of your patches? Andrew had mentioned
> that some people are already using it. I am curious to know will a
> proportional BW controller will solve the issues/requirements of these
> people or they have specific requirement of traffic shaping and max bw
> controller only.
>
> [..]
>> > > > Can you please give little more details here regarding how QoS requirements
>> > > > are not met with proportional weight?
>> > >
>> > > With proportional weights the whole bandwidth is allocated if no one
>> > > else is using it. When IO is submitted other tasks with a higher weight
>> > > can be forced to sleep until the IO generated by the low weight tasks is
>> > > not completely dispatched. Or any extent of the priority inversion
>> > > problems.
>> >
>> > Hmm..., I am not very sure here. When admin is allocating the weights, he
>> > has the whole picture. He knows how many groups are conteding for the disk
>> > and what could be the worst case scenario. So if I have got two groups
>> > with A and B with weight 1 and 2 and both are contending, then as an
>> > admin one would expect to get 33% of BW for group A in worst case (if
>> > group B is continuously backlogged). If B is not contending than A can get
>> > 100% of BW. So while configuring the system, will one not plan for worst
>> > case (33% for A, and 66 % for B)?
>>
>> OK, I'm quite convinced.. :)
>>
>> To a large degree, if we want to provide a BW reservation strategy we
>> must provide an interface that allows cgroups to ask for time slices
>> such as max/min 5 IO requests every 50ms or something like that.
>> Probably the same functionality can be achieved translating time slices
>> from weights, percentages or absolute BW limits.
>
> Ok, I would like to split it in two parts.
>
> I think providng minimum gurantee in absolute terms like 5 IO request
> every 50ms will be very hard because IO scheduler has no control over
> how many competitors are there. An easier thing will be to have minimum
> gurantees on share basis. For minimum BW (disk time slice) gurantee, admin
> shall have to create right cgroup hierarchy and assign weights properly and
> then admin can calculate what % of disk slice a particular group will get
> as minimum gurantee. (This is more complicated than this as there are
> time slices which are not accounted to any groups. During queue switch
> cfq starts the time slice counting only after first request has completed
> to offset the impact of seeking and i guess also NCQ).

I agree with Vivek that absolute metrics like 5 IO requests every 50ms
might be hard to offer. But 'x ms of disk time every y ms, for a given
cgroup' might be a desirable goal. That said, for now we can focus on
weight based allocation of disk time, and leave such goals for future.

>
> I think it should be possible to give max bandwidth gurantees in absolute
> terms, like io/s or sectors/sec or MB/sec etc, because only thing IO
> scheduler has to do is to not allow dispatch from a particular queue if
> it has crossed its limit and then either let the disk idle or move onto
> next eligible queue.
>
> The only issue here will be async writes. max bw gurantee for async writes
> at IO scheduler level might not mean much to application because of page
> cache.
>
> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: IO controller discussion (Was: Re: [PATCH 01/10] Documentation)
  2009-04-17 14:13                       ` Vivek Goyal
                                           ` (3 preceding siblings ...)
  2009-04-18 13:19                         ` Balbir Singh
@ 2009-04-19  4:35                         ` Nauman Rafique
  4 siblings, 0 replies; 190+ messages in thread
From: Nauman Rafique @ 2009-04-19  4:35 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrea Righi, Andrew Morton, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, arozansk, jmoyer, oz-kernel, dhaval, balbir,
	linux-kernel, containers, menage, peterz, John Wilkes

On Fri, Apr 17, 2009 at 7:13 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Fri, Apr 17, 2009 at 11:37:28AM +0200, Andrea Righi wrote:
>> On Thu, Apr 16, 2009 at 02:37:53PM -0400, Vivek Goyal wrote:
>> > > I think it would be possible to implement both proportional and limiting
>> > > rules at the same level (e.g., the IO scheduler), but we need also to
>> > > address the memory consumption problem (I still need to review your
>> > > patchset in details and I'm going to test it soon :), so I don't know if
>> > > you already addressed this issue).
>> > >
>> >
>> > Can you please elaborate a bit on this? Are you concerned about that data
>> > structures created to solve the problem consume a lot of memory?
>>
>> Sorry I was not very clear here. With memory consumption I mean wasting
>> the memory with hard/slow reclaimable dirty pages or pending IO
>> requests.
>>
>> If there's only a global limit on dirty pages, any cgroup can exhaust
>> that limit and cause other cgroups/processes to block when they try to
>> write to disk.
>>
>> But, ok, the IO controller is not probably the best place to implement
>> such functionality. I should rework on the per cgroup dirty_ratio:
>>
>> https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html
>>
>> Last time we focused too much on the best interfaces to define dirty
>> pages limit, and I never re-posted an updated version of this patchset.
>> Now I think we can simply provide the same dirty_ratio/dirty_bytes
>> interface that we provide globally, but per cgroup.
>>
>> >
>> > > IOW if we simply don't dispatch requests and we don't throttle the tasks
>> > > in the cgroup that exceeds its limit, how do we avoid the waste of
>> > > memory due to the succeeding IO requests and the increasingly dirty
>> > > pages in the page cache (that are also hard to reclaim)? I may be wrong,
>> > > but I think we talked about this problem in a previous email... sorry I
>> > > don't find the discussion in my mail archives.
>> > >
>> > > IMHO a nice approach would be to measure IO consumption at the IO
>> > > scheduler level, and control IO applying proportional weights / absolute
>> > > limits _both_ at the IO scheduler / elevator level _and_ at the same
>> > > time block the tasks from dirtying memory that will generate additional
>> > > IO requests.
>> > >
>> > > Anyway, there's no need to provide this with a single IO controller, we
>> > > could split the problem in two parts: 1) provide a proportional /
>> > > absolute IO controller in the IO schedulers and 2) allow to set, for
>> > > example, a maximum limit of dirty pages for each cgroup.
>> > >
>> >
>> > I think setting a maximum limit on dirty pages is an interesting thought.
>> > It sounds like as if memory controller can handle it?
>>
>> Exactly, the same above.
>
> Thinking more about it. Memory controller can probably enforce the higher
> limit but it would not easily translate into a fixed upper async write
> rate. Till the process hits the page cache limit or is slowed down by
> dirty page writeout, it can get a very high async write BW.
>
> So memory controller page cache limit will help but it would not direclty
> translate into what max bw limit patches are doing.
>
> Even if we do max bw control at IO scheduler level, async writes are
> problematic again. IO controller will not be able to throttle the process
> until it sees actuall write request. In big memory systems, writeout might
> not happen for some time and till then it will see a high throughput.
>
> So doing async write throttling at higher layer and not at IO scheduler
> layer gives us the opprotunity to produce more accurate results.
>
> For sync requests, I think IO scheduler max bw control should work fine.
>
> BTW, andrea, what is the use case of your patches? Andrew had mentioned
> that some people are already using it. I am curious to know will a
> proportional BW controller will solve the issues/requirements of these
> people or they have specific requirement of traffic shaping and max bw
> controller only.
>
> [..]
>> > > > Can you please give little more details here regarding how QoS requirements
>> > > > are not met with proportional weight?
>> > >
>> > > With proportional weights the whole bandwidth is allocated if no one
>> > > else is using it. When IO is submitted other tasks with a higher weight
>> > > can be forced to sleep until the IO generated by the low weight tasks is
>> > > not completely dispatched. Or any extent of the priority inversion
>> > > problems.
>> >
>> > Hmm..., I am not very sure here. When admin is allocating the weights, he
>> > has the whole picture. He knows how many groups are conteding for the disk
>> > and what could be the worst case scenario. So if I have got two groups
>> > with A and B with weight 1 and 2 and both are contending, then as an
>> > admin one would expect to get 33% of BW for group A in worst case (if
>> > group B is continuously backlogged). If B is not contending than A can get
>> > 100% of BW. So while configuring the system, will one not plan for worst
>> > case (33% for A, and 66 % for B)?
>>
>> OK, I'm quite convinced.. :)
>>
>> To a large degree, if we want to provide a BW reservation strategy we
>> must provide an interface that allows cgroups to ask for time slices
>> such as max/min 5 IO requests every 50ms or something like that.
>> Probably the same functionality can be achieved translating time slices
>> from weights, percentages or absolute BW limits.
>
> Ok, I would like to split it in two parts.
>
> I think providng minimum gurantee in absolute terms like 5 IO request
> every 50ms will be very hard because IO scheduler has no control over
> how many competitors are there. An easier thing will be to have minimum
> gurantees on share basis. For minimum BW (disk time slice) gurantee, admin
> shall have to create right cgroup hierarchy and assign weights properly and
> then admin can calculate what % of disk slice a particular group will get
> as minimum gurantee. (This is more complicated than this as there are
> time slices which are not accounted to any groups. During queue switch
> cfq starts the time slice counting only after first request has completed
> to offset the impact of seeking and i guess also NCQ).

I agree with Vivek that absolute metrics like 5 IO requests every 50ms
might be hard to offer. But 'x ms of disk time every y ms, for a given
cgroup' might be a desirable goal. That said, for now we can focus on
weight based allocation of disk time, and leave such goals for future.

>
> I think it should be possible to give max bandwidth gurantees in absolute
> terms, like io/s or sectors/sec or MB/sec etc, because only thing IO
> scheduler has to do is to not allow dispatch from a particular queue if
> it has crossed its limit and then either let the disk idle or move onto
> next eligible queue.
>
> The only issue here will be async writes. max bw gurantee for async writes
> at IO scheduler level might not mean much to application because of page
> cache.
>
> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: IO controller discussion (Was: Re: [PATCH 01/10] Documentation)
  2009-04-17 18:09                         ` Nauman Rafique
@ 2009-04-19 12:59                               ` Vivek Goyal
  2009-04-18  8:13                           ` Andrea Righi
  2009-04-19 13:08                           ` Vivek Goyal
  2 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-04-19 12:59 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil, Andrea Righi,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	menage-hpIqsD4AKlfQT0dZR+AlfA

On Fri, Apr 17, 2009 at 11:09:51AM -0700, Nauman Rafique wrote:
> On Fri, Apr 17, 2009 at 7:13 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Fri, Apr 17, 2009 at 11:37:28AM +0200, Andrea Righi wrote:
> >> On Thu, Apr 16, 2009 at 02:37:53PM -0400, Vivek Goyal wrote:
> >> > > I think it would be possible to implement both proportional and limiting
> >> > > rules at the same level (e.g., the IO scheduler), but we need also to
> >> > > address the memory consumption problem (I still need to review your
> >> > > patchset in details and I'm going to test it soon :), so I don't know if
> >> > > you already addressed this issue).
> >> > >
> >> >
> >> > Can you please elaborate a bit on this? Are you concerned about that data
> >> > structures created to solve the problem consume a lot of memory?
> >>
> >> Sorry I was not very clear here. With memory consumption I mean wasting
> >> the memory with hard/slow reclaimable dirty pages or pending IO
> >> requests.
> >>
> >> If there's only a global limit on dirty pages, any cgroup can exhaust
> >> that limit and cause other cgroups/processes to block when they try to
> >> write to disk.
> >>
> >> But, ok, the IO controller is not probably the best place to implement
> >> such functionality. I should rework on the per cgroup dirty_ratio:
> >>
> >> https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html
> >>
> >> Last time we focused too much on the best interfaces to define dirty
> >> pages limit, and I never re-posted an updated version of this patchset.
> >> Now I think we can simply provide the same dirty_ratio/dirty_bytes
> >> interface that we provide globally, but per cgroup.
> >>
> >> >
> >> > > IOW if we simply don't dispatch requests and we don't throttle the tasks
> >> > > in the cgroup that exceeds its limit, how do we avoid the waste of
> >> > > memory due to the succeeding IO requests and the increasingly dirty
> >> > > pages in the page cache (that are also hard to reclaim)? I may be wrong,
> >> > > but I think we talked about this problem in a previous email... sorry I
> >> > > don't find the discussion in my mail archives.
> >> > >
> >> > > IMHO a nice approach would be to measure IO consumption at the IO
> >> > > scheduler level, and control IO applying proportional weights / absolute
> >> > > limits _both_ at the IO scheduler / elevator level _and_ at the same
> >> > > time block the tasks from dirtying memory that will generate additional
> >> > > IO requests.
> >> > >
> >> > > Anyway, there's no need to provide this with a single IO controller, we
> >> > > could split the problem in two parts: 1) provide a proportional /
> >> > > absolute IO controller in the IO schedulers and 2) allow to set, for
> >> > > example, a maximum limit of dirty pages for each cgroup.
> >> > >
> >> >
> >> > I think setting a maximum limit on dirty pages is an interesting thought.
> >> > It sounds like as if memory controller can handle it?
> >>
> >> Exactly, the same above.
> >
> > Thinking more about it. Memory controller can probably enforce the higher
> > limit but it would not easily translate into a fixed upper async write
> > rate. Till the process hits the page cache limit or is slowed down by
> > dirty page writeout, it can get a very high async write BW.
> >
> > So memory controller page cache limit will help but it would not direclty
> > translate into what max bw limit patches are doing.
> >
> > Even if we do max bw control at IO scheduler level, async writes are
> > problematic again. IO controller will not be able to throttle the process
> > until it sees actuall write request. In big memory systems, writeout might
> > not happen for some time and till then it will see a high throughput.
> >
> > So doing async write throttling at higher layer and not at IO scheduler
> > layer gives us the opprotunity to produce more accurate results.
> 
> Wouldn't 'doing control on writes at a higher layer' have the same
> problems as the ones we talk about in dm-ioband? What if the cgroup
> being throttled for dirtying pages has a high weight assigned to it at
> the IO scheduler level? What if there are threads of different classes
> within that cgroup, and we would want to let RT task dirty the pages
> before BE tasks? I am not sure all these questions make sense, but
> just wanted to raise issues that might pop up.
> 

Yes, i would think that you will run into same issues if one is not
maintaining separate queues. The only difference is that we are throttling
only async writes and i am not sure how well the notion of fairness is
defined for async writes in current system as most of the time nobody
has cared too much for latency of async writes (fsync is the exception).

But in general, yes, putting things into a single queue again puts into
the same situation where we loose the notion of class and prio with-in
cgroup. So any design which tries to do that probably is not a very good
idea.

> If the whole system is designed with cgroups in mind, then throttling
> at IO scheduler layer should lead to backlog, that could be seen at
> higher level. For example, if a cgroup is not getting service at IO
> scheduler level, it should run out of request descriptors, and thus
> the thread writing back dirty pages should notice it (if its pdflush,
> blocking it is probably not the best idea). And that should mean the
> cgroup should hit the dirty threshold, and disallow the task to dirty
> further pages. There is a possibility though that getting all this
> right might be an overkill and we can get away with a simpler
> solution. One possibility seems to be that we provide some feedback
> from IO scheduling layer to higher layers, that cgroup is hitting its
> write bandwith limit, and should not be allowed to dirty any more
> pages.

I think bdi_write_congested(), already provides the feedback to upper
layer about how congested the queue is and it is likely that any thread
submitting new write request to this bdi will be made to sleep before
it gets the reqeust descriptor.

I think in new scheme of thing, this notion of per device congestion shall
have to changed into per device per cgroup congestion.

Thanks
Vivek


> 
> >
> > For sync requests, I think IO scheduler max bw control should work fine.
> >
> > BTW, andrea, what is the use case of your patches? Andrew had mentioned
> > that some people are already using it. I am curious to know will a
> > proportional BW controller will solve the issues/requirements of these
> > people or they have specific requirement of traffic shaping and max bw
> > controller only.
> >
> > [..]
> >> > > > Can you please give little more details here regarding how QoS requirements
> >> > > > are not met with proportional weight?
> >> > >
> >> > > With proportional weights the whole bandwidth is allocated if no one
> >> > > else is using it. When IO is submitted other tasks with a higher weight
> >> > > can be forced to sleep until the IO generated by the low weight tasks is
> >> > > not completely dispatched. Or any extent of the priority inversion
> >> > > problems.
> >> >
> >> > Hmm..., I am not very sure here. When admin is allocating the weights, he
> >> > has the whole picture. He knows how many groups are conteding for the disk
> >> > and what could be the worst case scenario. So if I have got two groups
> >> > with A and B with weight 1 and 2 and both are contending, then as an
> >> > admin one would expect to get 33% of BW for group A in worst case (if
> >> > group B is continuously backlogged). If B is not contending than A can get
> >> > 100% of BW. So while configuring the system, will one not plan for worst
> >> > case (33% for A, and 66 % for B)?
> >>
> >> OK, I'm quite convinced.. :)
> >>
> >> To a large degree, if we want to provide a BW reservation strategy we
> >> must provide an interface that allows cgroups to ask for time slices
> >> such as max/min 5 IO requests every 50ms or something like that.
> >> Probably the same functionality can be achieved translating time slices
> >> from weights, percentages or absolute BW limits.
> >
> > Ok, I would like to split it in two parts.
> >
> > I think providng minimum gurantee in absolute terms like 5 IO request
> > every 50ms will be very hard because IO scheduler has no control over
> > how many competitors are there. An easier thing will be to have minimum
> > gurantees on share basis. For minimum BW (disk time slice) gurantee, admin
> > shall have to create right cgroup hierarchy and assign weights properly and
> > then admin can calculate what % of disk slice a particular group will get
> > as minimum gurantee. (This is more complicated than this as there are
> > time slices which are not accounted to any groups. During queue switch
> > cfq starts the time slice counting only after first request has completed
> > to offset the impact of seeking and i guess also NCQ).
> >
> > I think it should be possible to give max bandwidth gurantees in absolute
> > terms, like io/s or sectors/sec or MB/sec etc, because only thing IO
> > scheduler has to do is to not allow dispatch from a particular queue if
> > it has crossed its limit and then either let the disk idle or move onto
> > next eligible queue.
> >
> > The only issue here will be async writes. max bw gurantee for async writes
> > at IO scheduler level might not mean much to application because of page
> > cache.
> >
> > Thanks
> > Vivek
> >

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: IO controller discussion (Was: Re: [PATCH 01/10] Documentation)
@ 2009-04-19 12:59                               ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-04-19 12:59 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: Andrea Righi, Andrew Morton, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, arozansk, jmoyer, oz-kernel, dhaval, balbir,
	linux-kernel, containers, menage, peterz

On Fri, Apr 17, 2009 at 11:09:51AM -0700, Nauman Rafique wrote:
> On Fri, Apr 17, 2009 at 7:13 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Fri, Apr 17, 2009 at 11:37:28AM +0200, Andrea Righi wrote:
> >> On Thu, Apr 16, 2009 at 02:37:53PM -0400, Vivek Goyal wrote:
> >> > > I think it would be possible to implement both proportional and limiting
> >> > > rules at the same level (e.g., the IO scheduler), but we need also to
> >> > > address the memory consumption problem (I still need to review your
> >> > > patchset in details and I'm going to test it soon :), so I don't know if
> >> > > you already addressed this issue).
> >> > >
> >> >
> >> > Can you please elaborate a bit on this? Are you concerned about that data
> >> > structures created to solve the problem consume a lot of memory?
> >>
> >> Sorry I was not very clear here. With memory consumption I mean wasting
> >> the memory with hard/slow reclaimable dirty pages or pending IO
> >> requests.
> >>
> >> If there's only a global limit on dirty pages, any cgroup can exhaust
> >> that limit and cause other cgroups/processes to block when they try to
> >> write to disk.
> >>
> >> But, ok, the IO controller is not probably the best place to implement
> >> such functionality. I should rework on the per cgroup dirty_ratio:
> >>
> >> https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html
> >>
> >> Last time we focused too much on the best interfaces to define dirty
> >> pages limit, and I never re-posted an updated version of this patchset.
> >> Now I think we can simply provide the same dirty_ratio/dirty_bytes
> >> interface that we provide globally, but per cgroup.
> >>
> >> >
> >> > > IOW if we simply don't dispatch requests and we don't throttle the tasks
> >> > > in the cgroup that exceeds its limit, how do we avoid the waste of
> >> > > memory due to the succeeding IO requests and the increasingly dirty
> >> > > pages in the page cache (that are also hard to reclaim)? I may be wrong,
> >> > > but I think we talked about this problem in a previous email... sorry I
> >> > > don't find the discussion in my mail archives.
> >> > >
> >> > > IMHO a nice approach would be to measure IO consumption at the IO
> >> > > scheduler level, and control IO applying proportional weights / absolute
> >> > > limits _both_ at the IO scheduler / elevator level _and_ at the same
> >> > > time block the tasks from dirtying memory that will generate additional
> >> > > IO requests.
> >> > >
> >> > > Anyway, there's no need to provide this with a single IO controller, we
> >> > > could split the problem in two parts: 1) provide a proportional /
> >> > > absolute IO controller in the IO schedulers and 2) allow to set, for
> >> > > example, a maximum limit of dirty pages for each cgroup.
> >> > >
> >> >
> >> > I think setting a maximum limit on dirty pages is an interesting thought.
> >> > It sounds like as if memory controller can handle it?
> >>
> >> Exactly, the same above.
> >
> > Thinking more about it. Memory controller can probably enforce the higher
> > limit but it would not easily translate into a fixed upper async write
> > rate. Till the process hits the page cache limit or is slowed down by
> > dirty page writeout, it can get a very high async write BW.
> >
> > So memory controller page cache limit will help but it would not direclty
> > translate into what max bw limit patches are doing.
> >
> > Even if we do max bw control at IO scheduler level, async writes are
> > problematic again. IO controller will not be able to throttle the process
> > until it sees actuall write request. In big memory systems, writeout might
> > not happen for some time and till then it will see a high throughput.
> >
> > So doing async write throttling at higher layer and not at IO scheduler
> > layer gives us the opprotunity to produce more accurate results.
> 
> Wouldn't 'doing control on writes at a higher layer' have the same
> problems as the ones we talk about in dm-ioband? What if the cgroup
> being throttled for dirtying pages has a high weight assigned to it at
> the IO scheduler level? What if there are threads of different classes
> within that cgroup, and we would want to let RT task dirty the pages
> before BE tasks? I am not sure all these questions make sense, but
> just wanted to raise issues that might pop up.
> 

Yes, i would think that you will run into same issues if one is not
maintaining separate queues. The only difference is that we are throttling
only async writes and i am not sure how well the notion of fairness is
defined for async writes in current system as most of the time nobody
has cared too much for latency of async writes (fsync is the exception).

But in general, yes, putting things into a single queue again puts into
the same situation where we loose the notion of class and prio with-in
cgroup. So any design which tries to do that probably is not a very good
idea.

> If the whole system is designed with cgroups in mind, then throttling
> at IO scheduler layer should lead to backlog, that could be seen at
> higher level. For example, if a cgroup is not getting service at IO
> scheduler level, it should run out of request descriptors, and thus
> the thread writing back dirty pages should notice it (if its pdflush,
> blocking it is probably not the best idea). And that should mean the
> cgroup should hit the dirty threshold, and disallow the task to dirty
> further pages. There is a possibility though that getting all this
> right might be an overkill and we can get away with a simpler
> solution. One possibility seems to be that we provide some feedback
> from IO scheduling layer to higher layers, that cgroup is hitting its
> write bandwith limit, and should not be allowed to dirty any more
> pages.

I think bdi_write_congested(), already provides the feedback to upper
layer about how congested the queue is and it is likely that any thread
submitting new write request to this bdi will be made to sleep before
it gets the reqeust descriptor.

I think in new scheme of thing, this notion of per device congestion shall
have to changed into per device per cgroup congestion.

Thanks
Vivek


> 
> >
> > For sync requests, I think IO scheduler max bw control should work fine.
> >
> > BTW, andrea, what is the use case of your patches? Andrew had mentioned
> > that some people are already using it. I am curious to know will a
> > proportional BW controller will solve the issues/requirements of these
> > people or they have specific requirement of traffic shaping and max bw
> > controller only.
> >
> > [..]
> >> > > > Can you please give little more details here regarding how QoS requirements
> >> > > > are not met with proportional weight?
> >> > >
> >> > > With proportional weights the whole bandwidth is allocated if no one
> >> > > else is using it. When IO is submitted other tasks with a higher weight
> >> > > can be forced to sleep until the IO generated by the low weight tasks is
> >> > > not completely dispatched. Or any extent of the priority inversion
> >> > > problems.
> >> >
> >> > Hmm..., I am not very sure here. When admin is allocating the weights, he
> >> > has the whole picture. He knows how many groups are conteding for the disk
> >> > and what could be the worst case scenario. So if I have got two groups
> >> > with A and B with weight 1 and 2 and both are contending, then as an
> >> > admin one would expect to get 33% of BW for group A in worst case (if
> >> > group B is continuously backlogged). If B is not contending than A can get
> >> > 100% of BW. So while configuring the system, will one not plan for worst
> >> > case (33% for A, and 66 % for B)?
> >>
> >> OK, I'm quite convinced.. :)
> >>
> >> To a large degree, if we want to provide a BW reservation strategy we
> >> must provide an interface that allows cgroups to ask for time slices
> >> such as max/min 5 IO requests every 50ms or something like that.
> >> Probably the same functionality can be achieved translating time slices
> >> from weights, percentages or absolute BW limits.
> >
> > Ok, I would like to split it in two parts.
> >
> > I think providng minimum gurantee in absolute terms like 5 IO request
> > every 50ms will be very hard because IO scheduler has no control over
> > how many competitors are there. An easier thing will be to have minimum
> > gurantees on share basis. For minimum BW (disk time slice) gurantee, admin
> > shall have to create right cgroup hierarchy and assign weights properly and
> > then admin can calculate what % of disk slice a particular group will get
> > as minimum gurantee. (This is more complicated than this as there are
> > time slices which are not accounted to any groups. During queue switch
> > cfq starts the time slice counting only after first request has completed
> > to offset the impact of seeking and i guess also NCQ).
> >
> > I think it should be possible to give max bandwidth gurantees in absolute
> > terms, like io/s or sectors/sec or MB/sec etc, because only thing IO
> > scheduler has to do is to not allow dispatch from a particular queue if
> > it has crossed its limit and then either let the disk idle or move onto
> > next eligible queue.
> >
> > The only issue here will be async writes. max bw gurantee for async writes
> > at IO scheduler level might not mean much to application because of page
> > cache.
> >
> > Thanks
> > Vivek
> >

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: IO controller discussion (Was: Re: [PATCH 01/10] Documentation)
       [not found]                           ` <e98e18940904171109r17ccb054kb7879f8d02ac26b5-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2009-04-18  8:13                             ` Andrea Righi
  2009-04-19 12:59                               ` Vivek Goyal
@ 2009-04-19 13:08                             ` Vivek Goyal
  2 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-04-19 13:08 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil, Andrea Righi,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	menage-hpIqsD4AKlfQT0dZR+AlfA

On Fri, Apr 17, 2009 at 11:09:51AM -0700, Nauman Rafique wrote:
> On Fri, Apr 17, 2009 at 7:13 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Fri, Apr 17, 2009 at 11:37:28AM +0200, Andrea Righi wrote:
> >> On Thu, Apr 16, 2009 at 02:37:53PM -0400, Vivek Goyal wrote:
> >> > > I think it would be possible to implement both proportional and limiting
> >> > > rules at the same level (e.g., the IO scheduler), but we need also to
> >> > > address the memory consumption problem (I still need to review your
> >> > > patchset in details and I'm going to test it soon :), so I don't know if
> >> > > you already addressed this issue).
> >> > >
> >> >
> >> > Can you please elaborate a bit on this? Are you concerned about that data
> >> > structures created to solve the problem consume a lot of memory?
> >>
> >> Sorry I was not very clear here. With memory consumption I mean wasting
> >> the memory with hard/slow reclaimable dirty pages or pending IO
> >> requests.
> >>
> >> If there's only a global limit on dirty pages, any cgroup can exhaust
> >> that limit and cause other cgroups/processes to block when they try to
> >> write to disk.
> >>
> >> But, ok, the IO controller is not probably the best place to implement
> >> such functionality. I should rework on the per cgroup dirty_ratio:
> >>
> >> https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html
> >>
> >> Last time we focused too much on the best interfaces to define dirty
> >> pages limit, and I never re-posted an updated version of this patchset.
> >> Now I think we can simply provide the same dirty_ratio/dirty_bytes
> >> interface that we provide globally, but per cgroup.
> >>
> >> >
> >> > > IOW if we simply don't dispatch requests and we don't throttle the tasks
> >> > > in the cgroup that exceeds its limit, how do we avoid the waste of
> >> > > memory due to the succeeding IO requests and the increasingly dirty
> >> > > pages in the page cache (that are also hard to reclaim)? I may be wrong,
> >> > > but I think we talked about this problem in a previous email... sorry I
> >> > > don't find the discussion in my mail archives.
> >> > >
> >> > > IMHO a nice approach would be to measure IO consumption at the IO
> >> > > scheduler level, and control IO applying proportional weights / absolute
> >> > > limits _both_ at the IO scheduler / elevator level _and_ at the same
> >> > > time block the tasks from dirtying memory that will generate additional
> >> > > IO requests.
> >> > >
> >> > > Anyway, there's no need to provide this with a single IO controller, we
> >> > > could split the problem in two parts: 1) provide a proportional /
> >> > > absolute IO controller in the IO schedulers and 2) allow to set, for
> >> > > example, a maximum limit of dirty pages for each cgroup.
> >> > >
> >> >
> >> > I think setting a maximum limit on dirty pages is an interesting thought.
> >> > It sounds like as if memory controller can handle it?
> >>
> >> Exactly, the same above.
> >
> > Thinking more about it. Memory controller can probably enforce the higher
> > limit but it would not easily translate into a fixed upper async write
> > rate. Till the process hits the page cache limit or is slowed down by
> > dirty page writeout, it can get a very high async write BW.
> >
> > So memory controller page cache limit will help but it would not direclty
> > translate into what max bw limit patches are doing.
> >
> > Even if we do max bw control at IO scheduler level, async writes are
> > problematic again. IO controller will not be able to throttle the process
> > until it sees actuall write request. In big memory systems, writeout might
> > not happen for some time and till then it will see a high throughput.
> >
> > So doing async write throttling at higher layer and not at IO scheduler
> > layer gives us the opprotunity to produce more accurate results.
> 
> Wouldn't 'doing control on writes at a higher layer' have the same
> problems as the ones we talk about in dm-ioband? What if the cgroup
> being throttled for dirtying pages has a high weight assigned to it at
> the IO scheduler level? What if there are threads of different classes
> within that cgroup, and we would want to let RT task dirty the pages
> before BE tasks? I am not sure all these questions make sense, but
> just wanted to raise issues that might pop up.
> 
> If the whole system is designed with cgroups in mind, then throttling
> at IO scheduler layer should lead to backlog, that could be seen at
> higher level. For example, if a cgroup is not getting service at IO
> scheduler level, it should run out of request descriptors, and thus
> the thread writing back dirty pages should notice it (if its pdflush,
> blocking it is probably not the best idea). And that should mean the
> cgroup should hit the dirty threshold, and disallow the task to dirty
> further pages. There is a possibility though that getting all this
> right might be an overkill and we can get away with a simpler
> solution.

Currently, if pdflush can't keep up and processes are dirtying the
page cache at higher rate than we will cross vm_dirty_ratio and
process will be made to write back some of dirty pages. That should
make sure that processes will be automatically throttled at IO scheduler
(Assuming process tries to write its own pages and does not pick randomly
some other processes's pages). Currently I think a processes can pick
any inode for writeback and not necessarily the inode the process is
dirtying.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: IO controller discussion (Was: Re: [PATCH 01/10] Documentation)
  2009-04-17 18:09                         ` Nauman Rafique
       [not found]                           ` <e98e18940904171109r17ccb054kb7879f8d02ac26b5-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2009-04-18  8:13                           ` Andrea Righi
@ 2009-04-19 13:08                           ` Vivek Goyal
  2 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-04-19 13:08 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: Andrea Righi, Andrew Morton, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, arozansk, jmoyer, oz-kernel, dhaval, balbir,
	linux-kernel, containers, menage, peterz

On Fri, Apr 17, 2009 at 11:09:51AM -0700, Nauman Rafique wrote:
> On Fri, Apr 17, 2009 at 7:13 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Fri, Apr 17, 2009 at 11:37:28AM +0200, Andrea Righi wrote:
> >> On Thu, Apr 16, 2009 at 02:37:53PM -0400, Vivek Goyal wrote:
> >> > > I think it would be possible to implement both proportional and limiting
> >> > > rules at the same level (e.g., the IO scheduler), but we need also to
> >> > > address the memory consumption problem (I still need to review your
> >> > > patchset in details and I'm going to test it soon :), so I don't know if
> >> > > you already addressed this issue).
> >> > >
> >> >
> >> > Can you please elaborate a bit on this? Are you concerned about that data
> >> > structures created to solve the problem consume a lot of memory?
> >>
> >> Sorry I was not very clear here. With memory consumption I mean wasting
> >> the memory with hard/slow reclaimable dirty pages or pending IO
> >> requests.
> >>
> >> If there's only a global limit on dirty pages, any cgroup can exhaust
> >> that limit and cause other cgroups/processes to block when they try to
> >> write to disk.
> >>
> >> But, ok, the IO controller is not probably the best place to implement
> >> such functionality. I should rework on the per cgroup dirty_ratio:
> >>
> >> https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html
> >>
> >> Last time we focused too much on the best interfaces to define dirty
> >> pages limit, and I never re-posted an updated version of this patchset.
> >> Now I think we can simply provide the same dirty_ratio/dirty_bytes
> >> interface that we provide globally, but per cgroup.
> >>
> >> >
> >> > > IOW if we simply don't dispatch requests and we don't throttle the tasks
> >> > > in the cgroup that exceeds its limit, how do we avoid the waste of
> >> > > memory due to the succeeding IO requests and the increasingly dirty
> >> > > pages in the page cache (that are also hard to reclaim)? I may be wrong,
> >> > > but I think we talked about this problem in a previous email... sorry I
> >> > > don't find the discussion in my mail archives.
> >> > >
> >> > > IMHO a nice approach would be to measure IO consumption at the IO
> >> > > scheduler level, and control IO applying proportional weights / absolute
> >> > > limits _both_ at the IO scheduler / elevator level _and_ at the same
> >> > > time block the tasks from dirtying memory that will generate additional
> >> > > IO requests.
> >> > >
> >> > > Anyway, there's no need to provide this with a single IO controller, we
> >> > > could split the problem in two parts: 1) provide a proportional /
> >> > > absolute IO controller in the IO schedulers and 2) allow to set, for
> >> > > example, a maximum limit of dirty pages for each cgroup.
> >> > >
> >> >
> >> > I think setting a maximum limit on dirty pages is an interesting thought.
> >> > It sounds like as if memory controller can handle it?
> >>
> >> Exactly, the same above.
> >
> > Thinking more about it. Memory controller can probably enforce the higher
> > limit but it would not easily translate into a fixed upper async write
> > rate. Till the process hits the page cache limit or is slowed down by
> > dirty page writeout, it can get a very high async write BW.
> >
> > So memory controller page cache limit will help but it would not direclty
> > translate into what max bw limit patches are doing.
> >
> > Even if we do max bw control at IO scheduler level, async writes are
> > problematic again. IO controller will not be able to throttle the process
> > until it sees actuall write request. In big memory systems, writeout might
> > not happen for some time and till then it will see a high throughput.
> >
> > So doing async write throttling at higher layer and not at IO scheduler
> > layer gives us the opprotunity to produce more accurate results.
> 
> Wouldn't 'doing control on writes at a higher layer' have the same
> problems as the ones we talk about in dm-ioband? What if the cgroup
> being throttled for dirtying pages has a high weight assigned to it at
> the IO scheduler level? What if there are threads of different classes
> within that cgroup, and we would want to let RT task dirty the pages
> before BE tasks? I am not sure all these questions make sense, but
> just wanted to raise issues that might pop up.
> 
> If the whole system is designed with cgroups in mind, then throttling
> at IO scheduler layer should lead to backlog, that could be seen at
> higher level. For example, if a cgroup is not getting service at IO
> scheduler level, it should run out of request descriptors, and thus
> the thread writing back dirty pages should notice it (if its pdflush,
> blocking it is probably not the best idea). And that should mean the
> cgroup should hit the dirty threshold, and disallow the task to dirty
> further pages. There is a possibility though that getting all this
> right might be an overkill and we can get away with a simpler
> solution.

Currently, if pdflush can't keep up and processes are dirtying the
page cache at higher rate than we will cross vm_dirty_ratio and
process will be made to write back some of dirty pages. That should
make sure that processes will be automatically throttled at IO scheduler
(Assuming process tries to write its own pages and does not pick randomly
some other processes's pages). Currently I think a processes can pick
any inode for writeback and not necessarily the inode the process is
dirtying.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: IO controller discussion (Was: Re: [PATCH 01/10] Documentation)
  2009-04-17 22:38                         ` Andrea Righi
@ 2009-04-19 13:21                             ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-04-19 13:21 UTC (permalink / raw)
  To: Andrew Morton, nauman-hpIqsD4AKlfQT0dZR+AlfA,
	dpshah-hpIqsD4AKlfQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A,
	mikew-hpIqsD4AKlfQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA, jen

On Sat, Apr 18, 2009 at 12:38:10AM +0200, Andrea Righi wrote:
> On Fri, Apr 17, 2009 at 10:13:58AM -0400, Vivek Goyal wrote:
> > > > I think setting a maximum limit on dirty pages is an interesting thought.
> > > > It sounds like as if memory controller can handle it?
> > > 
> > > Exactly, the same above.
> > 
> > Thinking more about it. Memory controller can probably enforce the higher
> > limit but it would not easily translate into a fixed upper async write
> > rate. Till the process hits the page cache limit or is slowed down by
> > dirty page writeout, it can get a very high async write BW.
> > 
> > So memory controller page cache limit will help but it would not direclty
> > translate into what max bw limit patches are doing.
> 
> The memory controller can be used to set an upper limit of the dirty
> pages. When this limit is exceeded the tasks in the cgroup can be forced
> to write the exceeding dirty pages to disk. At this point the IO
> controller can: 1) throttle the task that is going to submit the IO
> requests, if the guy that dirtied the pages was actually the task
> itself, or 2) delay the submission of those requests to the elevator (or
> at the IO scheduler level) if it's writeback IO (e.g., made by pdflush).
> 

True, per cgroup dirty pages limit will help in making sure one cgroup
does not run away mojority share of the page cache. And once a cgroup 
hits dirty limit it is forced to do write back.

But my point is that it hels in bandwidth control but it does not directly
translate into what max bw patches are doing. I thought your goal with
max bw patches was to provide the consistent upper limit on BW seem by
the application. So till an application hits the per cgroup dirty limit,
it might see an spike in async write BW (much more than what has been 
specified by per cgroup max bw limit) and that will defeat the purpose
of max bw controller up to some extent?

> Both functionalities should allow to have a BW control and avoid that
> any single cgroup can entirely exhaust the global limit of dirty pages.
> 
> > 
> > Even if we do max bw control at IO scheduler level, async writes are
> > problematic again. IO controller will not be able to throttle the process
> > until it sees actuall write request. In big memory systems, writeout might
> > not happen for some time and till then it will see a high throughput.
> > 
> > So doing async write throttling at higher layer and not at IO scheduler
> > layer gives us the opprotunity to produce more accurate results.
> 
> Totally agree.

I will correct myself here. After going through the documentation of
max bw controller patches, it looks like that you are also controlling
async writes only after they are actually being written to the disk and
not at the time of async write admission in page cache.

If that's the case then doing this control at IO scheduler level should
produce the similar results what you are seeing now with higher level 
control. In fact throttling at IO scheduler has advantage that one does
not have to worry about maintaining multiple queues for separate class
and prio requests as IO scheduler already does it.

Thanks
Vivek

> 
> > 
> > For sync requests, I think IO scheduler max bw control should work fine.
> 
> ditto
> 
> -Andrea

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: IO controller discussion (Was: Re: [PATCH 01/10] Documentation)
@ 2009-04-19 13:21                             ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-04-19 13:21 UTC (permalink / raw)
  To: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, arozansk, jmoyer, oz-kernel, dhaval, balbir,
	linux-kernel, containers, menage, peterz

On Sat, Apr 18, 2009 at 12:38:10AM +0200, Andrea Righi wrote:
> On Fri, Apr 17, 2009 at 10:13:58AM -0400, Vivek Goyal wrote:
> > > > I think setting a maximum limit on dirty pages is an interesting thought.
> > > > It sounds like as if memory controller can handle it?
> > > 
> > > Exactly, the same above.
> > 
> > Thinking more about it. Memory controller can probably enforce the higher
> > limit but it would not easily translate into a fixed upper async write
> > rate. Till the process hits the page cache limit or is slowed down by
> > dirty page writeout, it can get a very high async write BW.
> > 
> > So memory controller page cache limit will help but it would not direclty
> > translate into what max bw limit patches are doing.
> 
> The memory controller can be used to set an upper limit of the dirty
> pages. When this limit is exceeded the tasks in the cgroup can be forced
> to write the exceeding dirty pages to disk. At this point the IO
> controller can: 1) throttle the task that is going to submit the IO
> requests, if the guy that dirtied the pages was actually the task
> itself, or 2) delay the submission of those requests to the elevator (or
> at the IO scheduler level) if it's writeback IO (e.g., made by pdflush).
> 

True, per cgroup dirty pages limit will help in making sure one cgroup
does not run away mojority share of the page cache. And once a cgroup 
hits dirty limit it is forced to do write back.

But my point is that it hels in bandwidth control but it does not directly
translate into what max bw patches are doing. I thought your goal with
max bw patches was to provide the consistent upper limit on BW seem by
the application. So till an application hits the per cgroup dirty limit,
it might see an spike in async write BW (much more than what has been 
specified by per cgroup max bw limit) and that will defeat the purpose
of max bw controller up to some extent?

> Both functionalities should allow to have a BW control and avoid that
> any single cgroup can entirely exhaust the global limit of dirty pages.
> 
> > 
> > Even if we do max bw control at IO scheduler level, async writes are
> > problematic again. IO controller will not be able to throttle the process
> > until it sees actuall write request. In big memory systems, writeout might
> > not happen for some time and till then it will see a high throughput.
> > 
> > So doing async write throttling at higher layer and not at IO scheduler
> > layer gives us the opprotunity to produce more accurate results.
> 
> Totally agree.

I will correct myself here. After going through the documentation of
max bw controller patches, it looks like that you are also controlling
async writes only after they are actually being written to the disk and
not at the time of async write admission in page cache.

If that's the case then doing this control at IO scheduler level should
produce the similar results what you are seeing now with higher level 
control. In fact throttling at IO scheduler has advantage that one does
not have to worry about maintaining multiple queues for separate class
and prio requests as IO scheduler already does it.

Thanks
Vivek

> 
> > 
> > For sync requests, I think IO scheduler max bw control should work fine.
> 
> ditto
> 
> -Andrea

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: IO controller discussion (Was: Re: [PATCH 01/10] Documentation)
       [not found]                           ` <661de9470904180619k34e7998ch755a2ad3bed9ce5e-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-04-19 13:45                             ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-04-19 13:45 UTC (permalink / raw)
  To: Balbir Singh
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil, Andrea Righi,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	menage-hpIqsD4AKlfQT0dZR+AlfA

On Sat, Apr 18, 2009 at 06:49:33PM +0530, Balbir Singh wrote:
> On Fri, Apr 17, 2009 at 7:43 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Fri, Apr 17, 2009 at 11:37:28AM +0200, Andrea Righi wrote:
> >> On Thu, Apr 16, 2009 at 02:37:53PM -0400, Vivek Goyal wrote:
> >> > > I think it would be possible to implement both proportional and limiting
> >> > > rules at the same level (e.g., the IO scheduler), but we need also to
> >> > > address the memory consumption problem (I still need to review your
> >> > > patchset in details and I'm going to test it soon :), so I don't know if
> >> > > you already addressed this issue).
> >> > >
> >> >
> >> > Can you please elaborate a bit on this? Are you concerned about that data
> >> > structures created to solve the problem consume a lot of memory?
> >>
> >> Sorry I was not very clear here. With memory consumption I mean wasting
> >> the memory with hard/slow reclaimable dirty pages or pending IO
> >> requests.
> >>
> >> If there's only a global limit on dirty pages, any cgroup can exhaust
> >> that limit and cause other cgroups/processes to block when they try to
> >> write to disk.
> >>
> >> But, ok, the IO controller is not probably the best place to implement
> >> such functionality. I should rework on the per cgroup dirty_ratio:
> >>
> >> https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html
> >>
> >> Last time we focused too much on the best interfaces to define dirty
> >> pages limit, and I never re-posted an updated version of this patchset.
> >> Now I think we can simply provide the same dirty_ratio/dirty_bytes
> >> interface that we provide globally, but per cgroup.
> >>
> >> >
> >> > > IOW if we simply don't dispatch requests and we don't throttle the tasks
> >> > > in the cgroup that exceeds its limit, how do we avoid the waste of
> >> > > memory due to the succeeding IO requests and the increasingly dirty
> >> > > pages in the page cache (that are also hard to reclaim)? I may be wrong,
> >> > > but I think we talked about this problem in a previous email... sorry I
> >> > > don't find the discussion in my mail archives.
> >> > >
> >> > > IMHO a nice approach would be to measure IO consumption at the IO
> >> > > scheduler level, and control IO applying proportional weights / absolute
> >> > > limits _both_ at the IO scheduler / elevator level _and_ at the same
> >> > > time block the tasks from dirtying memory that will generate additional
> >> > > IO requests.
> >> > >
> >> > > Anyway, there's no need to provide this with a single IO controller, we
> >> > > could split the problem in two parts: 1) provide a proportional /
> >> > > absolute IO controller in the IO schedulers and 2) allow to set, for
> >> > > example, a maximum limit of dirty pages for each cgroup.
> >> > >
> >> >
> >> > I think setting a maximum limit on dirty pages is an interesting thought.
> >> > It sounds like as if memory controller can handle it?
> >>
> >> Exactly, the same above.
> >
> > Thinking more about it. Memory controller can probably enforce the higher
> > limit but it would not easily translate into a fixed upper async write
> > rate. Till the process hits the page cache limit or is slowed down by
> > dirty page writeout, it can get a very high async write BW.
> >
> > So memory controller page cache limit will help but it would not direclty
> > translate into what max bw limit patches are doing.
> >
> > Even if we do max bw control at IO scheduler level, async writes are
> > problematic again. IO controller will not be able to throttle the process
> > until it sees actuall write request. In big memory systems, writeout might
> > not happen for some time and till then it will see a high throughput.
> >
> > So doing async write throttling at higher layer and not at IO scheduler
> > layer gives us the opprotunity to produce more accurate results.
> >
> > For sync requests, I think IO scheduler max bw control should work fine.
> >
> > BTW, andrea, what is the use case of your patches? Andrew had mentioned
> > that some people are already using it. I am curious to know will a
> > proportional BW controller will solve the issues/requirements of these
> > people or they have specific requirement of traffic shaping and max bw
> > controller only.
> >
> > [..]
> >> > > > Can you please give little more details here regarding how QoS requirements
> >> > > > are not met with proportional weight?
> >> > >
> >> > > With proportional weights the whole bandwidth is allocated if no one
> >> > > else is using it. When IO is submitted other tasks with a higher weight
> >> > > can be forced to sleep until the IO generated by the low weight tasks is
> >> > > not completely dispatched. Or any extent of the priority inversion
> >> > > problems.
> >> >
> >> > Hmm..., I am not very sure here. When admin is allocating the weights, he
> >> > has the whole picture. He knows how many groups are conteding for the disk
> >> > and what could be the worst case scenario. So if I have got two groups
> >> > with A and B with weight 1 and 2 and both are contending, then as an
> >> > admin one would expect to get 33% of BW for group A in worst case (if
> >> > group B is continuously backlogged). If B is not contending than A can get
> >> > 100% of BW. So while configuring the system, will one not plan for worst
> >> > case (33% for A, and 66 % for B)?
> >>
> >> OK, I'm quite convinced.. :)
> >>
> >> To a large degree, if we want to provide a BW reservation strategy we
> >> must provide an interface that allows cgroups to ask for time slices
> >> such as max/min 5 IO requests every 50ms or something like that.
> >> Probably the same functionality can be achieved translating time slices
> >> from weights, percentages or absolute BW limits.
> >
> > Ok, I would like to split it in two parts.
> >
> > I think providng minimum gurantee in absolute terms like 5 IO request
> > every 50ms will be very hard because IO scheduler has no control over
> > how many competitors are there. An easier thing will be to have minimum
> > gurantees on share basis. For minimum BW (disk time slice) gurantee, admin
> > shall have to create right cgroup hierarchy and assign weights properly and
> > then admin can calculate what % of disk slice a particular group will get
> > as minimum gurantee. (This is more complicated than this as there are
> > time slices which are not accounted to any groups. During queue switch
> > cfq starts the time slice counting only after first request has completed
> > to offset the impact of seeking and i guess also NCQ).
> >
> > I think it should be possible to give max bandwidth gurantees in absolute
> > terms, like io/s or sectors/sec or MB/sec etc, because only thing IO
> > scheduler has to do is to not allow dispatch from a particular queue if
> > it has crossed its limit and then either let the disk idle or move onto
> > next eligible queue.
> >
> > The only issue here will be async writes. max bw gurantee for async writes
> > at IO scheduler level might not mean much to application because of page
> > cache.
> 
> I see so much of the memory controller coming up. Since we've been
> discussing so many of these design points on mail, I wonder if it
> makes sense to summarize them somewhere (a wiki?). Would anyone like
> to take a shot at it?

Balbir, this is definitely a good idea. Just that once we have had some
more discussion and some sort of understanding of issues, it might make
more sense.

Got a question for you. Does memory controller already have the per cgroup
dirty pages limit? If no, has this been discussed in the past? if yes,
what was the conclsion?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: IO controller discussion (Was: Re: [PATCH 01/10] Documentation)
  2009-04-18 13:19                         ` Balbir Singh
@ 2009-04-19 13:45                           ` Vivek Goyal
  2009-04-19 15:53                             ` Andrea Righi
       [not found]                             ` <20090419134508.GG8493-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
       [not found]                           ` <661de9470904180619k34e7998ch755a2ad3bed9ce5e-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 2 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-04-19 13:45 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrea Righi, Andrew Morton, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, jens.axboe, ryov, fernando, s-uchida,
	taka, guijianfeng, arozansk, jmoyer, oz-kernel, dhaval,
	linux-kernel, containers, menage, peterz

On Sat, Apr 18, 2009 at 06:49:33PM +0530, Balbir Singh wrote:
> On Fri, Apr 17, 2009 at 7:43 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Fri, Apr 17, 2009 at 11:37:28AM +0200, Andrea Righi wrote:
> >> On Thu, Apr 16, 2009 at 02:37:53PM -0400, Vivek Goyal wrote:
> >> > > I think it would be possible to implement both proportional and limiting
> >> > > rules at the same level (e.g., the IO scheduler), but we need also to
> >> > > address the memory consumption problem (I still need to review your
> >> > > patchset in details and I'm going to test it soon :), so I don't know if
> >> > > you already addressed this issue).
> >> > >
> >> >
> >> > Can you please elaborate a bit on this? Are you concerned about that data
> >> > structures created to solve the problem consume a lot of memory?
> >>
> >> Sorry I was not very clear here. With memory consumption I mean wasting
> >> the memory with hard/slow reclaimable dirty pages or pending IO
> >> requests.
> >>
> >> If there's only a global limit on dirty pages, any cgroup can exhaust
> >> that limit and cause other cgroups/processes to block when they try to
> >> write to disk.
> >>
> >> But, ok, the IO controller is not probably the best place to implement
> >> such functionality. I should rework on the per cgroup dirty_ratio:
> >>
> >> https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html
> >>
> >> Last time we focused too much on the best interfaces to define dirty
> >> pages limit, and I never re-posted an updated version of this patchset.
> >> Now I think we can simply provide the same dirty_ratio/dirty_bytes
> >> interface that we provide globally, but per cgroup.
> >>
> >> >
> >> > > IOW if we simply don't dispatch requests and we don't throttle the tasks
> >> > > in the cgroup that exceeds its limit, how do we avoid the waste of
> >> > > memory due to the succeeding IO requests and the increasingly dirty
> >> > > pages in the page cache (that are also hard to reclaim)? I may be wrong,
> >> > > but I think we talked about this problem in a previous email... sorry I
> >> > > don't find the discussion in my mail archives.
> >> > >
> >> > > IMHO a nice approach would be to measure IO consumption at the IO
> >> > > scheduler level, and control IO applying proportional weights / absolute
> >> > > limits _both_ at the IO scheduler / elevator level _and_ at the same
> >> > > time block the tasks from dirtying memory that will generate additional
> >> > > IO requests.
> >> > >
> >> > > Anyway, there's no need to provide this with a single IO controller, we
> >> > > could split the problem in two parts: 1) provide a proportional /
> >> > > absolute IO controller in the IO schedulers and 2) allow to set, for
> >> > > example, a maximum limit of dirty pages for each cgroup.
> >> > >
> >> >
> >> > I think setting a maximum limit on dirty pages is an interesting thought.
> >> > It sounds like as if memory controller can handle it?
> >>
> >> Exactly, the same above.
> >
> > Thinking more about it. Memory controller can probably enforce the higher
> > limit but it would not easily translate into a fixed upper async write
> > rate. Till the process hits the page cache limit or is slowed down by
> > dirty page writeout, it can get a very high async write BW.
> >
> > So memory controller page cache limit will help but it would not direclty
> > translate into what max bw limit patches are doing.
> >
> > Even if we do max bw control at IO scheduler level, async writes are
> > problematic again. IO controller will not be able to throttle the process
> > until it sees actuall write request. In big memory systems, writeout might
> > not happen for some time and till then it will see a high throughput.
> >
> > So doing async write throttling at higher layer and not at IO scheduler
> > layer gives us the opprotunity to produce more accurate results.
> >
> > For sync requests, I think IO scheduler max bw control should work fine.
> >
> > BTW, andrea, what is the use case of your patches? Andrew had mentioned
> > that some people are already using it. I am curious to know will a
> > proportional BW controller will solve the issues/requirements of these
> > people or they have specific requirement of traffic shaping and max bw
> > controller only.
> >
> > [..]
> >> > > > Can you please give little more details here regarding how QoS requirements
> >> > > > are not met with proportional weight?
> >> > >
> >> > > With proportional weights the whole bandwidth is allocated if no one
> >> > > else is using it. When IO is submitted other tasks with a higher weight
> >> > > can be forced to sleep until the IO generated by the low weight tasks is
> >> > > not completely dispatched. Or any extent of the priority inversion
> >> > > problems.
> >> >
> >> > Hmm..., I am not very sure here. When admin is allocating the weights, he
> >> > has the whole picture. He knows how many groups are conteding for the disk
> >> > and what could be the worst case scenario. So if I have got two groups
> >> > with A and B with weight 1 and 2 and both are contending, then as an
> >> > admin one would expect to get 33% of BW for group A in worst case (if
> >> > group B is continuously backlogged). If B is not contending than A can get
> >> > 100% of BW. So while configuring the system, will one not plan for worst
> >> > case (33% for A, and 66 % for B)?
> >>
> >> OK, I'm quite convinced.. :)
> >>
> >> To a large degree, if we want to provide a BW reservation strategy we
> >> must provide an interface that allows cgroups to ask for time slices
> >> such as max/min 5 IO requests every 50ms or something like that.
> >> Probably the same functionality can be achieved translating time slices
> >> from weights, percentages or absolute BW limits.
> >
> > Ok, I would like to split it in two parts.
> >
> > I think providng minimum gurantee in absolute terms like 5 IO request
> > every 50ms will be very hard because IO scheduler has no control over
> > how many competitors are there. An easier thing will be to have minimum
> > gurantees on share basis. For minimum BW (disk time slice) gurantee, admin
> > shall have to create right cgroup hierarchy and assign weights properly and
> > then admin can calculate what % of disk slice a particular group will get
> > as minimum gurantee. (This is more complicated than this as there are
> > time slices which are not accounted to any groups. During queue switch
> > cfq starts the time slice counting only after first request has completed
> > to offset the impact of seeking and i guess also NCQ).
> >
> > I think it should be possible to give max bandwidth gurantees in absolute
> > terms, like io/s or sectors/sec or MB/sec etc, because only thing IO
> > scheduler has to do is to not allow dispatch from a particular queue if
> > it has crossed its limit and then either let the disk idle or move onto
> > next eligible queue.
> >
> > The only issue here will be async writes. max bw gurantee for async writes
> > at IO scheduler level might not mean much to application because of page
> > cache.
> 
> I see so much of the memory controller coming up. Since we've been
> discussing so many of these design points on mail, I wonder if it
> makes sense to summarize them somewhere (a wiki?). Would anyone like
> to take a shot at it?

Balbir, this is definitely a good idea. Just that once we have had some
more discussion and some sort of understanding of issues, it might make
more sense.

Got a question for you. Does memory controller already have the per cgroup
dirty pages limit? If no, has this been discussed in the past? if yes,
what was the conclsion?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: IO controller discussion (Was: Re: [PATCH 01/10] Documentation)
       [not found]                             ` <20090419134508.GG8493-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-04-19 15:53                               ` Andrea Righi
  0 siblings, 0 replies; 190+ messages in thread
From: Andrea Righi @ 2009-04-19 15:53 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, Balbir Singh,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	menage-hpIqsD4AKlfQT0dZR+AlfA

On Sun, Apr 19, 2009 at 09:45:08AM -0400, Vivek Goyal wrote:
> On Sat, Apr 18, 2009 at 06:49:33PM +0530, Balbir Singh wrote:
> > On Fri, Apr 17, 2009 at 7:43 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > On Fri, Apr 17, 2009 at 11:37:28AM +0200, Andrea Righi wrote:
> > >> On Thu, Apr 16, 2009 at 02:37:53PM -0400, Vivek Goyal wrote:
> > >> > > I think it would be possible to implement both proportional and limiting
> > >> > > rules at the same level (e.g., the IO scheduler), but we need also to
> > >> > > address the memory consumption problem (I still need to review your
> > >> > > patchset in details and I'm going to test it soon :), so I don't know if
> > >> > > you already addressed this issue).
> > >> > >
> > >> >
> > >> > Can you please elaborate a bit on this? Are you concerned about that data
> > >> > structures created to solve the problem consume a lot of memory?
> > >>
> > >> Sorry I was not very clear here. With memory consumption I mean wasting
> > >> the memory with hard/slow reclaimable dirty pages or pending IO
> > >> requests.
> > >>
> > >> If there's only a global limit on dirty pages, any cgroup can exhaust
> > >> that limit and cause other cgroups/processes to block when they try to
> > >> write to disk.
> > >>
> > >> But, ok, the IO controller is not probably the best place to implement
> > >> such functionality. I should rework on the per cgroup dirty_ratio:
> > >>
> > >> https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html
> > >>
> > >> Last time we focused too much on the best interfaces to define dirty
> > >> pages limit, and I never re-posted an updated version of this patchset.
> > >> Now I think we can simply provide the same dirty_ratio/dirty_bytes
> > >> interface that we provide globally, but per cgroup.
> > >>
> > >> >
> > >> > > IOW if we simply don't dispatch requests and we don't throttle the tasks
> > >> > > in the cgroup that exceeds its limit, how do we avoid the waste of
> > >> > > memory due to the succeeding IO requests and the increasingly dirty
> > >> > > pages in the page cache (that are also hard to reclaim)? I may be wrong,
> > >> > > but I think we talked about this problem in a previous email... sorry I
> > >> > > don't find the discussion in my mail archives.
> > >> > >
> > >> > > IMHO a nice approach would be to measure IO consumption at the IO
> > >> > > scheduler level, and control IO applying proportional weights / absolute
> > >> > > limits _both_ at the IO scheduler / elevator level _and_ at the same
> > >> > > time block the tasks from dirtying memory that will generate additional
> > >> > > IO requests.
> > >> > >
> > >> > > Anyway, there's no need to provide this with a single IO controller, we
> > >> > > could split the problem in two parts: 1) provide a proportional /
> > >> > > absolute IO controller in the IO schedulers and 2) allow to set, for
> > >> > > example, a maximum limit of dirty pages for each cgroup.
> > >> > >
> > >> >
> > >> > I think setting a maximum limit on dirty pages is an interesting thought.
> > >> > It sounds like as if memory controller can handle it?
> > >>
> > >> Exactly, the same above.
> > >
> > > Thinking more about it. Memory controller can probably enforce the higher
> > > limit but it would not easily translate into a fixed upper async write
> > > rate. Till the process hits the page cache limit or is slowed down by
> > > dirty page writeout, it can get a very high async write BW.
> > >
> > > So memory controller page cache limit will help but it would not direclty
> > > translate into what max bw limit patches are doing.
> > >
> > > Even if we do max bw control at IO scheduler level, async writes are
> > > problematic again. IO controller will not be able to throttle the process
> > > until it sees actuall write request. In big memory systems, writeout might
> > > not happen for some time and till then it will see a high throughput.
> > >
> > > So doing async write throttling at higher layer and not at IO scheduler
> > > layer gives us the opprotunity to produce more accurate results.
> > >
> > > For sync requests, I think IO scheduler max bw control should work fine.
> > >
> > > BTW, andrea, what is the use case of your patches? Andrew had mentioned
> > > that some people are already using it. I am curious to know will a
> > > proportional BW controller will solve the issues/requirements of these
> > > people or they have specific requirement of traffic shaping and max bw
> > > controller only.
> > >
> > > [..]
> > >> > > > Can you please give little more details here regarding how QoS requirements
> > >> > > > are not met with proportional weight?
> > >> > >
> > >> > > With proportional weights the whole bandwidth is allocated if no one
> > >> > > else is using it. When IO is submitted other tasks with a higher weight
> > >> > > can be forced to sleep until the IO generated by the low weight tasks is
> > >> > > not completely dispatched. Or any extent of the priority inversion
> > >> > > problems.
> > >> >
> > >> > Hmm..., I am not very sure here. When admin is allocating the weights, he
> > >> > has the whole picture. He knows how many groups are conteding for the disk
> > >> > and what could be the worst case scenario. So if I have got two groups
> > >> > with A and B with weight 1 and 2 and both are contending, then as an
> > >> > admin one would expect to get 33% of BW for group A in worst case (if
> > >> > group B is continuously backlogged). If B is not contending than A can get
> > >> > 100% of BW. So while configuring the system, will one not plan for worst
> > >> > case (33% for A, and 66 % for B)?
> > >>
> > >> OK, I'm quite convinced.. :)
> > >>
> > >> To a large degree, if we want to provide a BW reservation strategy we
> > >> must provide an interface that allows cgroups to ask for time slices
> > >> such as max/min 5 IO requests every 50ms or something like that.
> > >> Probably the same functionality can be achieved translating time slices
> > >> from weights, percentages or absolute BW limits.
> > >
> > > Ok, I would like to split it in two parts.
> > >
> > > I think providng minimum gurantee in absolute terms like 5 IO request
> > > every 50ms will be very hard because IO scheduler has no control over
> > > how many competitors are there. An easier thing will be to have minimum
> > > gurantees on share basis. For minimum BW (disk time slice) gurantee, admin
> > > shall have to create right cgroup hierarchy and assign weights properly and
> > > then admin can calculate what % of disk slice a particular group will get
> > > as minimum gurantee. (This is more complicated than this as there are
> > > time slices which are not accounted to any groups. During queue switch
> > > cfq starts the time slice counting only after first request has completed
> > > to offset the impact of seeking and i guess also NCQ).
> > >
> > > I think it should be possible to give max bandwidth gurantees in absolute
> > > terms, like io/s or sectors/sec or MB/sec etc, because only thing IO
> > > scheduler has to do is to not allow dispatch from a particular queue if
> > > it has crossed its limit and then either let the disk idle or move onto
> > > next eligible queue.
> > >
> > > The only issue here will be async writes. max bw gurantee for async writes
> > > at IO scheduler level might not mean much to application because of page
> > > cache.
> > 
> > I see so much of the memory controller coming up. Since we've been
> > discussing so many of these design points on mail, I wonder if it
> > makes sense to summarize them somewhere (a wiki?). Would anyone like
> > to take a shot at it?
> 
> Balbir, this is definitely a good idea. Just that once we have had some
> more discussion and some sort of understanding of issues, it might make
> more sense.

Sounds good. A wiki would be perfect IMHO, we could all contribute in
the documentation, integrate thoughts, ideas and easily keep everything
updated.

> 
> Got a question for you. Does memory controller already have the per cgroup
> dirty pages limit? If no, has this been discussed in the past? if yes,
> what was the conclsion?

I think the answer is in the previous email. :)

-Andrea

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: IO controller discussion (Was: Re: [PATCH 01/10] Documentation)
  2009-04-19 13:45                           ` Vivek Goyal
@ 2009-04-19 15:53                             ` Andrea Righi
  2009-04-21  1:16                               ` KAMEZAWA Hiroyuki
  2009-04-21  1:16                               ` KAMEZAWA Hiroyuki
       [not found]                             ` <20090419134508.GG8493-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 2 replies; 190+ messages in thread
From: Andrea Righi @ 2009-04-19 15:53 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Balbir Singh, Andrew Morton, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, jens.axboe, ryov, fernando, s-uchida,
	taka, guijianfeng, arozansk, jmoyer, oz-kernel, dhaval,
	linux-kernel, containers, menage, peterz

On Sun, Apr 19, 2009 at 09:45:08AM -0400, Vivek Goyal wrote:
> On Sat, Apr 18, 2009 at 06:49:33PM +0530, Balbir Singh wrote:
> > On Fri, Apr 17, 2009 at 7:43 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > > On Fri, Apr 17, 2009 at 11:37:28AM +0200, Andrea Righi wrote:
> > >> On Thu, Apr 16, 2009 at 02:37:53PM -0400, Vivek Goyal wrote:
> > >> > > I think it would be possible to implement both proportional and limiting
> > >> > > rules at the same level (e.g., the IO scheduler), but we need also to
> > >> > > address the memory consumption problem (I still need to review your
> > >> > > patchset in details and I'm going to test it soon :), so I don't know if
> > >> > > you already addressed this issue).
> > >> > >
> > >> >
> > >> > Can you please elaborate a bit on this? Are you concerned about that data
> > >> > structures created to solve the problem consume a lot of memory?
> > >>
> > >> Sorry I was not very clear here. With memory consumption I mean wasting
> > >> the memory with hard/slow reclaimable dirty pages or pending IO
> > >> requests.
> > >>
> > >> If there's only a global limit on dirty pages, any cgroup can exhaust
> > >> that limit and cause other cgroups/processes to block when they try to
> > >> write to disk.
> > >>
> > >> But, ok, the IO controller is not probably the best place to implement
> > >> such functionality. I should rework on the per cgroup dirty_ratio:
> > >>
> > >> https://lists.linux-foundation.org/pipermail/containers/2008-September/013140.html
> > >>
> > >> Last time we focused too much on the best interfaces to define dirty
> > >> pages limit, and I never re-posted an updated version of this patchset.
> > >> Now I think we can simply provide the same dirty_ratio/dirty_bytes
> > >> interface that we provide globally, but per cgroup.
> > >>
> > >> >
> > >> > > IOW if we simply don't dispatch requests and we don't throttle the tasks
> > >> > > in the cgroup that exceeds its limit, how do we avoid the waste of
> > >> > > memory due to the succeeding IO requests and the increasingly dirty
> > >> > > pages in the page cache (that are also hard to reclaim)? I may be wrong,
> > >> > > but I think we talked about this problem in a previous email... sorry I
> > >> > > don't find the discussion in my mail archives.
> > >> > >
> > >> > > IMHO a nice approach would be to measure IO consumption at the IO
> > >> > > scheduler level, and control IO applying proportional weights / absolute
> > >> > > limits _both_ at the IO scheduler / elevator level _and_ at the same
> > >> > > time block the tasks from dirtying memory that will generate additional
> > >> > > IO requests.
> > >> > >
> > >> > > Anyway, there's no need to provide this with a single IO controller, we
> > >> > > could split the problem in two parts: 1) provide a proportional /
> > >> > > absolute IO controller in the IO schedulers and 2) allow to set, for
> > >> > > example, a maximum limit of dirty pages for each cgroup.
> > >> > >
> > >> >
> > >> > I think setting a maximum limit on dirty pages is an interesting thought.
> > >> > It sounds like as if memory controller can handle it?
> > >>
> > >> Exactly, the same above.
> > >
> > > Thinking more about it. Memory controller can probably enforce the higher
> > > limit but it would not easily translate into a fixed upper async write
> > > rate. Till the process hits the page cache limit or is slowed down by
> > > dirty page writeout, it can get a very high async write BW.
> > >
> > > So memory controller page cache limit will help but it would not direclty
> > > translate into what max bw limit patches are doing.
> > >
> > > Even if we do max bw control at IO scheduler level, async writes are
> > > problematic again. IO controller will not be able to throttle the process
> > > until it sees actuall write request. In big memory systems, writeout might
> > > not happen for some time and till then it will see a high throughput.
> > >
> > > So doing async write throttling at higher layer and not at IO scheduler
> > > layer gives us the opprotunity to produce more accurate results.
> > >
> > > For sync requests, I think IO scheduler max bw control should work fine.
> > >
> > > BTW, andrea, what is the use case of your patches? Andrew had mentioned
> > > that some people are already using it. I am curious to know will a
> > > proportional BW controller will solve the issues/requirements of these
> > > people or they have specific requirement of traffic shaping and max bw
> > > controller only.
> > >
> > > [..]
> > >> > > > Can you please give little more details here regarding how QoS requirements
> > >> > > > are not met with proportional weight?
> > >> > >
> > >> > > With proportional weights the whole bandwidth is allocated if no one
> > >> > > else is using it. When IO is submitted other tasks with a higher weight
> > >> > > can be forced to sleep until the IO generated by the low weight tasks is
> > >> > > not completely dispatched. Or any extent of the priority inversion
> > >> > > problems.
> > >> >
> > >> > Hmm..., I am not very sure here. When admin is allocating the weights, he
> > >> > has the whole picture. He knows how many groups are conteding for the disk
> > >> > and what could be the worst case scenario. So if I have got two groups
> > >> > with A and B with weight 1 and 2 and both are contending, then as an
> > >> > admin one would expect to get 33% of BW for group A in worst case (if
> > >> > group B is continuously backlogged). If B is not contending than A can get
> > >> > 100% of BW. So while configuring the system, will one not plan for worst
> > >> > case (33% for A, and 66 % for B)?
> > >>
> > >> OK, I'm quite convinced.. :)
> > >>
> > >> To a large degree, if we want to provide a BW reservation strategy we
> > >> must provide an interface that allows cgroups to ask for time slices
> > >> such as max/min 5 IO requests every 50ms or something like that.
> > >> Probably the same functionality can be achieved translating time slices
> > >> from weights, percentages or absolute BW limits.
> > >
> > > Ok, I would like to split it in two parts.
> > >
> > > I think providng minimum gurantee in absolute terms like 5 IO request
> > > every 50ms will be very hard because IO scheduler has no control over
> > > how many competitors are there. An easier thing will be to have minimum
> > > gurantees on share basis. For minimum BW (disk time slice) gurantee, admin
> > > shall have to create right cgroup hierarchy and assign weights properly and
> > > then admin can calculate what % of disk slice a particular group will get
> > > as minimum gurantee. (This is more complicated than this as there are
> > > time slices which are not accounted to any groups. During queue switch
> > > cfq starts the time slice counting only after first request has completed
> > > to offset the impact of seeking and i guess also NCQ).
> > >
> > > I think it should be possible to give max bandwidth gurantees in absolute
> > > terms, like io/s or sectors/sec or MB/sec etc, because only thing IO
> > > scheduler has to do is to not allow dispatch from a particular queue if
> > > it has crossed its limit and then either let the disk idle or move onto
> > > next eligible queue.
> > >
> > > The only issue here will be async writes. max bw gurantee for async writes
> > > at IO scheduler level might not mean much to application because of page
> > > cache.
> > 
> > I see so much of the memory controller coming up. Since we've been
> > discussing so many of these design points on mail, I wonder if it
> > makes sense to summarize them somewhere (a wiki?). Would anyone like
> > to take a shot at it?
> 
> Balbir, this is definitely a good idea. Just that once we have had some
> more discussion and some sort of understanding of issues, it might make
> more sense.

Sounds good. A wiki would be perfect IMHO, we could all contribute in
the documentation, integrate thoughts, ideas and easily keep everything
updated.

> 
> Got a question for you. Does memory controller already have the per cgroup
> dirty pages limit? If no, has this been discussed in the past? if yes,
> what was the conclsion?

I think the answer is in the previous email. :)

-Andrea

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: IO controller discussion (Was: Re: [PATCH 01/10] Documentation)
  2009-04-19 15:53                             ` Andrea Righi
@ 2009-04-21  1:16                               ` KAMEZAWA Hiroyuki
  2009-04-21  1:16                               ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 190+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-04-21  1:16 UTC (permalink / raw)
  To: Andrea Righi
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, Andrew Morton,
	menage-hpIqsD4AKlfQT0dZR+AlfA, Balbir Singh

On Sun, 19 Apr 2009 17:53:59 +0200
Andrea Righi <righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > 
> > Got a question for you. Does memory controller already have the per cgroup
> > dirty pages limit? If no, has this been discussed in the past? if yes,
> > what was the conclsion?
> 

IMHO, dirty page handling and I/O throttling is a different problem.

 - A task (or cgroup) which makes the page dirty
 and
 - A task (or cgroup) to which a page is accounted

Is different from each other, in general.

I have a plan to add dirty_ratio to memcg, but it's for avoiding massive stavation in
memory reclaim, not for I/O control.

If you want to implement I/O throttle in MM layer, plz don't depend on memcg.
The perpose is different.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: IO controller discussion (Was: Re: [PATCH 01/10] Documentation)
  2009-04-19 15:53                             ` Andrea Righi
  2009-04-21  1:16                               ` KAMEZAWA Hiroyuki
@ 2009-04-21  1:16                               ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 190+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-04-21  1:16 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Vivek Goyal, dhaval, arozansk, jens.axboe, Balbir Singh,
	paolo.valente, jmoyer, fernando, oz-kernel, fchecconi,
	Andrew Morton, containers, linux-kernel, menage

On Sun, 19 Apr 2009 17:53:59 +0200
Andrea Righi <righi.andrea@gmail.com> wrote:
> > 
> > Got a question for you. Does memory controller already have the per cgroup
> > dirty pages limit? If no, has this been discussed in the past? if yes,
> > what was the conclsion?
> 

IMHO, dirty page handling and I/O throttling is a different problem.

 - A task (or cgroup) which makes the page dirty
 and
 - A task (or cgroup) to which a page is accounted

Is different from each other, in general.

I have a plan to add dirty_ratio to memcg, but it's for avoiding massive stavation in
memory reclaim, not for I/O control.

If you want to implement I/O throttle in MM layer, plz don't depend on memcg.
The perpose is different.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
       [not found]     ` <20090413130958.GB18007-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-04-22  3:04       ` Gui Jianfeng
  0 siblings, 0 replies; 190+ messages in thread
From: Gui Jianfeng @ 2009-04-22  3:04 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

Vivek Goyal wrote:
> On Fri, Apr 10, 2009 at 05:33:10PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> Hi All,
>>>
>>> Here is another posting for IO controller patches. Last time I had posted
>>> RFC patches for an IO controller which did bio control per cgroup.
>>   Hi Vivek,
>>
>>   I got the following OOPS when testing, can't reproduce again :(
>>
> 
> Hi Gui,
> 
> Thanks for the report. Will look into it and see if I can reproduce it.

  Hi Vivek,

  The following script can reproduce the bug in my box.

#!/bin/sh

mkdir /cgroup
mount -t cgroup -o io io /cgroup
mkdir /cgroup/test1
mkdir /cgroup/test2

echo cfq > /sys/block/sda/queue/scheduler
echo 7 > /cgroup/test1/io.ioprio
echo 1 > /cgroup/test2/io.ioprio
echo 1 > /proc/sys/vm/drop_caches
dd if=1000M.1 of=/dev/null &
pid1=$!
echo $pid1
echo $pid1 > /cgroup/test1/tasks
dd if=1000M.2 of=/dev/null
pid2=$!
echo $pid2
echo $pid2 > /cgroup/test2/tasks


rmdir /cgroup/test1
rmdir /cgroup/test2
umount /cgroup
rmdir /cgroup

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
  2009-04-13 13:09   ` Vivek Goyal
@ 2009-04-22  3:04     ` Gui Jianfeng
  2009-04-22  3:10       ` Nauman Rafique
                         ` (2 more replies)
       [not found]     ` <20090413130958.GB18007-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 3 replies; 190+ messages in thread
From: Gui Jianfeng @ 2009-04-22  3:04 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, arozansk, jmoyer,
	oz-kernel, dhaval, balbir, linux-kernel, containers, akpm,
	menage, peterz

Vivek Goyal wrote:
> On Fri, Apr 10, 2009 at 05:33:10PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> Hi All,
>>>
>>> Here is another posting for IO controller patches. Last time I had posted
>>> RFC patches for an IO controller which did bio control per cgroup.
>>   Hi Vivek,
>>
>>   I got the following OOPS when testing, can't reproduce again :(
>>
> 
> Hi Gui,
> 
> Thanks for the report. Will look into it and see if I can reproduce it.

  Hi Vivek,

  The following script can reproduce the bug in my box.

#!/bin/sh

mkdir /cgroup
mount -t cgroup -o io io /cgroup
mkdir /cgroup/test1
mkdir /cgroup/test2

echo cfq > /sys/block/sda/queue/scheduler
echo 7 > /cgroup/test1/io.ioprio
echo 1 > /cgroup/test2/io.ioprio
echo 1 > /proc/sys/vm/drop_caches
dd if=1000M.1 of=/dev/null &
pid1=$!
echo $pid1
echo $pid1 > /cgroup/test1/tasks
dd if=1000M.2 of=/dev/null
pid2=$!
echo $pid2
echo $pid2 > /cgroup/test2/tasks


rmdir /cgroup/test1
rmdir /cgroup/test2
umount /cgroup
rmdir /cgroup

-- 
Regards
Gui Jianfeng




^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
       [not found]       ` <49EE895A.1060101-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-04-22  3:10         ` Nauman Rafique
  2009-04-22 13:23         ` Vivek Goyal
  1 sibling, 0 replies; 190+ messages in thread
From: Nauman Rafique @ 2009-04-22  3:10 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: menage-hpIqsD4AKlfQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA

On Tue, Apr 21, 2009 at 8:04 PM, Gui Jianfeng
<guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> wrote:
> Vivek Goyal wrote:
>> On Fri, Apr 10, 2009 at 05:33:10PM +0800, Gui Jianfeng wrote:
>>> Vivek Goyal wrote:
>>>> Hi All,
>>>>
>>>> Here is another posting for IO controller patches. Last time I had posted
>>>> RFC patches for an IO controller which did bio control per cgroup.
>>>   Hi Vivek,
>>>
>>>   I got the following OOPS when testing, can't reproduce again :(
>>>
>>
>> Hi Gui,
>>
>> Thanks for the report. Will look into it and see if I can reproduce it.
>
>  Hi Vivek,
>
>  The following script can reproduce the bug in my box.
>
> #!/bin/sh
>
> mkdir /cgroup
> mount -t cgroup -o io io /cgroup
> mkdir /cgroup/test1
> mkdir /cgroup/test2
>
> echo cfq > /sys/block/sda/queue/scheduler
> echo 7 > /cgroup/test1/io.ioprio
> echo 1 > /cgroup/test2/io.ioprio
> echo 1 > /proc/sys/vm/drop_caches
> dd if=1000M.1 of=/dev/null &
> pid1=$!
> echo $pid1
> echo $pid1 > /cgroup/test1/tasks
> dd if=1000M.2 of=/dev/null
> pid2=$!
> echo $pid2
> echo $pid2 > /cgroup/test2/tasks
>
>
> rmdir /cgroup/test1
> rmdir /cgroup/test2
> umount /cgroup
> rmdir /cgroup

Yes, this bug happens when we move a task from a cgroup to another
one, and delete the cgroup. Since the actual move to the new cgroup is
performed in a delayed fashion, if the cgroup is removed before
another request from the task is seen (and the actual move is
performed) , it results in a hit on BUG_ON. I am working on a patch
that will solve this problem and a few others; basically it would do
ref counting  for io_group structure. I am having a few problems with
it at the moment; will post the patch as soon as I can get it to work.

>
> --
> Regards
> Gui Jianfeng
>
>
>
>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
  2009-04-22  3:04     ` Gui Jianfeng
@ 2009-04-22  3:10       ` Nauman Rafique
  2009-04-22 13:23       ` Vivek Goyal
       [not found]       ` <49EE895A.1060101-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2 siblings, 0 replies; 190+ messages in thread
From: Nauman Rafique @ 2009-04-22  3:10 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: Vivek Goyal, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, arozansk, jmoyer,
	oz-kernel, dhaval, balbir, linux-kernel, containers, akpm,
	menage, peterz

On Tue, Apr 21, 2009 at 8:04 PM, Gui Jianfeng
<guijianfeng@cn.fujitsu.com> wrote:
> Vivek Goyal wrote:
>> On Fri, Apr 10, 2009 at 05:33:10PM +0800, Gui Jianfeng wrote:
>>> Vivek Goyal wrote:
>>>> Hi All,
>>>>
>>>> Here is another posting for IO controller patches. Last time I had posted
>>>> RFC patches for an IO controller which did bio control per cgroup.
>>>   Hi Vivek,
>>>
>>>   I got the following OOPS when testing, can't reproduce again :(
>>>
>>
>> Hi Gui,
>>
>> Thanks for the report. Will look into it and see if I can reproduce it.
>
>  Hi Vivek,
>
>  The following script can reproduce the bug in my box.
>
> #!/bin/sh
>
> mkdir /cgroup
> mount -t cgroup -o io io /cgroup
> mkdir /cgroup/test1
> mkdir /cgroup/test2
>
> echo cfq > /sys/block/sda/queue/scheduler
> echo 7 > /cgroup/test1/io.ioprio
> echo 1 > /cgroup/test2/io.ioprio
> echo 1 > /proc/sys/vm/drop_caches
> dd if=1000M.1 of=/dev/null &
> pid1=$!
> echo $pid1
> echo $pid1 > /cgroup/test1/tasks
> dd if=1000M.2 of=/dev/null
> pid2=$!
> echo $pid2
> echo $pid2 > /cgroup/test2/tasks
>
>
> rmdir /cgroup/test1
> rmdir /cgroup/test2
> umount /cgroup
> rmdir /cgroup

Yes, this bug happens when we move a task from a cgroup to another
one, and delete the cgroup. Since the actual move to the new cgroup is
performed in a delayed fashion, if the cgroup is removed before
another request from the task is seen (and the actual move is
performed) , it results in a hit on BUG_ON. I am working on a patch
that will solve this problem and a few others; basically it would do
ref counting  for io_group structure. I am having a few problems with
it at the moment; will post the patch as soon as I can get it to work.

>
> --
> Regards
> Gui Jianfeng
>
>
>
>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
       [not found]       ` <49EE895A.1060101-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2009-04-22  3:10         ` Nauman Rafique
@ 2009-04-22 13:23         ` Vivek Goyal
  1 sibling, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-04-22 13:23 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	menage-hpIqsD4AKlfQT0dZR+AlfA, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Wed, Apr 22, 2009 at 11:04:58AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > On Fri, Apr 10, 2009 at 05:33:10PM +0800, Gui Jianfeng wrote:
> >> Vivek Goyal wrote:
> >>> Hi All,
> >>>
> >>> Here is another posting for IO controller patches. Last time I had posted
> >>> RFC patches for an IO controller which did bio control per cgroup.
> >>   Hi Vivek,
> >>
> >>   I got the following OOPS when testing, can't reproduce again :(
> >>
> > 
> > Hi Gui,
> > 
> > Thanks for the report. Will look into it and see if I can reproduce it.
> 
>   Hi Vivek,
> 
>   The following script can reproduce the bug in my box.
> 
> #!/bin/sh
> 
> mkdir /cgroup
> mount -t cgroup -o io io /cgroup
> mkdir /cgroup/test1
> mkdir /cgroup/test2
> 
> echo cfq > /sys/block/sda/queue/scheduler
> echo 7 > /cgroup/test1/io.ioprio
> echo 1 > /cgroup/test2/io.ioprio
> echo 1 > /proc/sys/vm/drop_caches
> dd if=1000M.1 of=/dev/null &
> pid1=$!
> echo $pid1
> echo $pid1 > /cgroup/test1/tasks
> dd if=1000M.2 of=/dev/null
> pid2=$!
> echo $pid2
> echo $pid2 > /cgroup/test2/tasks
> 
> 
> rmdir /cgroup/test1
> rmdir /cgroup/test2
> umount /cgroup
> rmdir /cgroup

Thanks Gui. We have got races with task movement and cgroup deletion. In
the original bfq patch, Fabio had implemented the logic to migrate the
task queue synchronously. It found the logic to be little complicated so I
changed it to delayed movement of queue from old cgroup to new cgroup.
Fabio later mentioned that it introduces a race where old cgroup is
deleted before task queue has actually moved to new cgroup.

Nauman is currently implementing reference counting for io groups and that
will solve this problem at the same time some other problems like movement
of queue to root group during cgroup deletion and which can potentially 
result in unfair share for some time to that queue etc.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
  2009-04-22  3:04     ` Gui Jianfeng
  2009-04-22  3:10       ` Nauman Rafique
@ 2009-04-22 13:23       ` Vivek Goyal
       [not found]         ` <20090422132307.GA23098-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-04-30 19:38         ` Nauman Rafique
       [not found]       ` <49EE895A.1060101-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2 siblings, 2 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-04-22 13:23 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, arozansk, jmoyer,
	oz-kernel, dhaval, balbir, linux-kernel, containers, akpm,
	menage, peterz

On Wed, Apr 22, 2009 at 11:04:58AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > On Fri, Apr 10, 2009 at 05:33:10PM +0800, Gui Jianfeng wrote:
> >> Vivek Goyal wrote:
> >>> Hi All,
> >>>
> >>> Here is another posting for IO controller patches. Last time I had posted
> >>> RFC patches for an IO controller which did bio control per cgroup.
> >>   Hi Vivek,
> >>
> >>   I got the following OOPS when testing, can't reproduce again :(
> >>
> > 
> > Hi Gui,
> > 
> > Thanks for the report. Will look into it and see if I can reproduce it.
> 
>   Hi Vivek,
> 
>   The following script can reproduce the bug in my box.
> 
> #!/bin/sh
> 
> mkdir /cgroup
> mount -t cgroup -o io io /cgroup
> mkdir /cgroup/test1
> mkdir /cgroup/test2
> 
> echo cfq > /sys/block/sda/queue/scheduler
> echo 7 > /cgroup/test1/io.ioprio
> echo 1 > /cgroup/test2/io.ioprio
> echo 1 > /proc/sys/vm/drop_caches
> dd if=1000M.1 of=/dev/null &
> pid1=$!
> echo $pid1
> echo $pid1 > /cgroup/test1/tasks
> dd if=1000M.2 of=/dev/null
> pid2=$!
> echo $pid2
> echo $pid2 > /cgroup/test2/tasks
> 
> 
> rmdir /cgroup/test1
> rmdir /cgroup/test2
> umount /cgroup
> rmdir /cgroup

Thanks Gui. We have got races with task movement and cgroup deletion. In
the original bfq patch, Fabio had implemented the logic to migrate the
task queue synchronously. It found the logic to be little complicated so I
changed it to delayed movement of queue from old cgroup to new cgroup.
Fabio later mentioned that it introduces a race where old cgroup is
deleted before task queue has actually moved to new cgroup.

Nauman is currently implementing reference counting for io groups and that
will solve this problem at the same time some other problems like movement
of queue to root group during cgroup deletion and which can potentially 
result in unfair share for some time to that queue etc.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
       [not found]         ` <20090422132307.GA23098-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-04-30 19:38           ` Nauman Rafique
  0 siblings, 0 replies; 190+ messages in thread
From: Nauman Rafique @ 2009-04-30 19:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: oz-kernel-H+wXaHxf7aLQT0dZR+AlfA, menage-hpIqsD4AKlfQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

Vivek Goyal wrote:
> On Wed, Apr 22, 2009 at 11:04:58AM +0800, Gui Jianfeng wrote:
>   
>> Vivek Goyal wrote:
>>     
>>> On Fri, Apr 10, 2009 at 05:33:10PM +0800, Gui Jianfeng wrote:
>>>       
>>>> Vivek Goyal wrote:
>>>>         
>>>>> Hi All,
>>>>>
>>>>> Here is another posting for IO controller patches. Last time I had posted
>>>>> RFC patches for an IO controller which did bio control per cgroup.
>>>>>           
>>>>   Hi Vivek,
>>>>
>>>>   I got the following OOPS when testing, can't reproduce again :(
>>>>
>>>>         
>>> Hi Gui,
>>>
>>> Thanks for the report. Will look into it and see if I can reproduce it.
>>>       
>>   Hi Vivek,
>>
>>   The following script can reproduce the bug in my box.
>>
>> #!/bin/sh
>>
>> mkdir /cgroup
>> mount -t cgroup -o io io /cgroup
>> mkdir /cgroup/test1
>> mkdir /cgroup/test2
>>
>> echo cfq > /sys/block/sda/queue/scheduler
>> echo 7 > /cgroup/test1/io.ioprio
>> echo 1 > /cgroup/test2/io.ioprio
>> echo 1 > /proc/sys/vm/drop_caches
>> dd if=1000M.1 of=/dev/null &
>> pid1=$!
>> echo $pid1
>> echo $pid1 > /cgroup/test1/tasks
>> dd if=1000M.2 of=/dev/null
>> pid2=$!
>> echo $pid2
>> echo $pid2 > /cgroup/test2/tasks
>>
>>
>> rmdir /cgroup/test1
>> rmdir /cgroup/test2
>> umount /cgroup
>> rmdir /cgroup
>>     
>
> Thanks Gui. We have got races with task movement and cgroup deletion. In
> the original bfq patch, Fabio had implemented the logic to migrate the
> task queue synchronously. It found the logic to be little complicated so I
> changed it to delayed movement of queue from old cgroup to new cgroup.
> Fabio later mentioned that it introduces a race where old cgroup is
> deleted before task queue has actually moved to new cgroup.
>
> Nauman is currently implementing reference counting for io groups and that
> will solve this problem at the same time some other problems like movement
> of queue to root group during cgroup deletion and which can potentially 
> result in unfair share for some time to that queue etc.
>
> Thanks
> Vivek
>   
Hi Gui,
This patch should solve the problems reported by you. Please let me know if it does not work.
@Vivek, this has a few more changes after the patch I sent you separately.

DESC
Add ref counting for io_group.
EDESC
    
        Reference counting for io_group solves many problems, most of which
        occured when we tried to delete the cgroup. Earlier, ioqs were being
        moved out of cgroup to root cgroup. That is problematic in many ways:
        First, the pending requests in queues might get unfair service, and
        will also cause unfairness for other cgroups at the root level. This
        problem can become signficant if cgroups are created and destroyed
        relatively frequently. Second, moving queues to root cgroups was
        complicated and was causing many BUG_ON's to trigger. Third, there is
        a single io queue in AS, Deadline and Noop within a cgroup; and it
        does not make sense to move it to the root cgroup. The same is true of
        async queues.
    
        Requests already keep a reference on ioq, so queues keep a reference on
        cgroup. For async queues in CFQ, and single ioq in other schedulers,
        io_group also keeps are reference on io_queue. This reference on ioq
        is dropped when the queue is released (elv_release_ioq). So the queue
        can be freed.
    
        When a queue is released, it puts the reference to io_group and the
        io_group is released after all the queues are released. Child groups
        also take reference on parent groups, and release it when they are
        destroyed.
    
        Also we no longer need to maintain a seprate linked list of idle
        entities, which was maintained only to help release the ioq references
        during elevator switch. The code for releasing io_groups is reused for
        elevator switch, resulting in simpler and tight code.

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 0ecf7c7..21e8ab8 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1090,8 +1090,8 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
 
 	if (async_cfqq != NULL) {
 		__iog = cfqq_to_io_group(async_cfqq);
-
 		if (iog != __iog) {
+			/* Cgroup has changed, drop the reference to async queue */
 			cic_set_cfqq(cic, NULL, 0);
 			cfq_put_queue(async_cfqq);
 		}
@@ -1099,8 +1099,10 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
 
 	if (sync_cfqq != NULL) {
 		__iog = cfqq_to_io_group(sync_cfqq);
-		if (iog != __iog)
-			io_ioq_move(q->elevator, sync_cfqq->ioq, iog);
+		if (iog != __iog) {
+			cic_set_cfqq(cic, NULL, 1);
+			cfq_put_queue(sync_cfqq);
+		}
 	}
 
 	spin_unlock_irqrestore(q->queue_lock, flags);
@@ -1114,8 +1116,8 @@ static void cfq_ioc_set_cgroup(struct io_context *ioc)
 #endif  /* CONFIG_IOSCHED_CFQ_HIER */
 
 static struct cfq_queue *
-cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
-				struct io_context *ioc, gfp_t gfp_mask)
+cfq_find_alloc_queue(struct cfq_data *cfqd, struct io_group *iog,
+		     int is_sync, struct io_context *ioc, gfp_t gfp_mask)
 {
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
 	struct cfq_io_context *cic;
@@ -1198,6 +1200,8 @@ alloc_ioq:
 			elv_mark_ioq_sync(cfqq->ioq);
 		}
 		cfqq->pid = current->pid;
+		/* ioq reference on iog */
+		elv_get_iog(iog);
 		cfq_log_cfqq(cfqd, cfqq, "alloced");
 	}
 
@@ -1229,7 +1233,8 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 	}
 
 	if (!cfqq) {
-		cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc, gfp_mask);
+		cfqq = cfq_find_alloc_queue(cfqd, iog, 
+					    is_sync, ioc, gfp_mask);
 		if (!cfqq)
 			return NULL;
 	}
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 7474f6d..52419d1 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -198,7 +198,6 @@ static void bfq_idle_extract(struct io_service_tree *st,
 				struct io_entity *entity)
 {
 	struct rb_node *next;
-	struct io_queue *ioq = io_entity_to_ioq(entity);
 
 	BUG_ON(entity->tree != &st->idle);
 
@@ -213,10 +212,6 @@ static void bfq_idle_extract(struct io_service_tree *st,
 	}
 
 	bfq_extract(&st->idle, entity);
-
-	/* Delete queue from idle list */
-	if (ioq)
-		list_del(&ioq->queue_list);
 }
 
 /**
@@ -420,7 +415,6 @@ static void bfq_idle_insert(struct io_service_tree *st,
 {
 	struct io_entity *first_idle = st->first_idle;
 	struct io_entity *last_idle = st->last_idle;
-	struct io_queue *ioq = io_entity_to_ioq(entity);
 
 	if (first_idle == NULL || bfq_gt(first_idle->finish, entity->finish))
 		st->first_idle = entity;
@@ -428,10 +422,6 @@ static void bfq_idle_insert(struct io_service_tree *st,
 		st->last_idle = entity;
 
 	bfq_insert(&st->idle, entity);
-
-	/* Add this queue to idle list */
-	if (ioq)
-		list_add(&ioq->queue_list, &ioq->efqd->idle_list);
 }
 
 /**
@@ -666,8 +656,13 @@ int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
 void bfq_deactivate_entity(struct io_entity *entity, int requeue)
 {
 	struct io_sched_data *sd;
+	struct io_group *iog;
 	struct io_entity *parent;
 
+	iog = container_of(entity->sched_data, struct io_group, sched_data);
+	/* Hold a reference to entity's iog until we are done */
+	elv_get_iog(iog);
+
 	for_each_entity_safe(entity, parent) {
 		sd = entity->sched_data;
 
@@ -679,13 +674,15 @@ void bfq_deactivate_entity(struct io_entity *entity, int requeue)
 			 */
 			break;
 
-		if (sd->next_active != NULL)
+		if (sd->next_active != NULL) {
 			/*
 			 * The parent entity is still backlogged and
 			 * the budgets on the path towards the root
 			 * need to be updated.
 			 */
+			elv_put_iog(iog);
 			goto update;
+		}
 
 		/*
 		 * If we reach there the parent is no more backlogged and
@@ -694,6 +691,7 @@ void bfq_deactivate_entity(struct io_entity *entity, int requeue)
 		requeue = 1;
 	}
 
+	elv_put_iog(iog);
 	return;
 
 update:
@@ -944,6 +942,8 @@ void io_group_set_parent(struct io_group *iog, struct io_group *parent)
 	entity = &iog->entity;
 	entity->parent = parent->my_entity;
 	entity->sched_data = &parent->sched_data;
+	if (entity->parent)
+		elv_get_iog(parent);
 }
 
 /**
@@ -1052,6 +1052,10 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
 
 		io_group_init_entity(iocg, iog);
 		iog->my_entity = &iog->entity;
+		/* Take the initial reference that will be released on destroy */
+		atomic_set(&iog->ref, 0);
+		iog->deleting = 0;
+		elv_get_iog(iog);
 
 		if (leaf == NULL) {
 			leaf = iog;
@@ -1074,7 +1078,7 @@ cleanup:
 	while (leaf != NULL) {
 		prev = leaf;
 		leaf = leaf->key;
-		kfree(iog);
+		kfree(prev);
 	}
 
 	return NULL;
@@ -1197,13 +1201,20 @@ void io_free_root_group(struct elevator_queue *e)
 	struct io_cgroup *iocg = &io_root_cgroup;
 	struct elv_fq_data *efqd = &e->efqd;
 	struct io_group *iog = efqd->root_group;
+	struct io_service_tree *st;
+	int i;
 
 	BUG_ON(!iog);
 	spin_lock_irq(&iocg->lock);
 	hlist_del_rcu(&iog->group_node);
 	spin_unlock_irq(&iocg->lock);
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		io_flush_idle_tree(st);
+	}
 	io_put_io_group_queues(e, iog);
-	kfree(iog);
+	elv_put_iog(iog);
 }
 
 struct io_group *io_alloc_root_group(struct request_queue *q,
@@ -1217,6 +1228,7 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
 	if (iog == NULL)
 		return NULL;
 
+	elv_get_iog(iog);
 	iog->entity.parent = NULL;
 	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
 		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
@@ -1311,90 +1323,89 @@ void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
 	task_unlock(tsk);
 }
 
-/*
- * Move the queue to the root group if it is active. This is needed when
- * a cgroup is being deleted and all the IO is not done yet. This is not
- * very good scheme as a user might get unfair share. This needs to be
- * fixed.
+/* This cleanup function is does the last bit of things to destroy cgroup.
+   It should only get called after io_destroy_group has been invoked.
  */
-void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
-				struct io_group *iog)
+void io_group_cleanup(struct io_group *iog)
 {
-	int busy, resume;
-	struct io_entity *entity = &ioq->entity;
-	struct elv_fq_data *efqd = &e->efqd;
-	struct io_service_tree *st = io_entity_service_tree(entity);
-
-	busy = elv_ioq_busy(ioq);
-	resume = !!ioq->nr_queued;
+	struct io_service_tree *st;
+	struct io_entity *entity = iog->my_entity;
+	int i;
 
-	BUG_ON(resume && !entity->on_st);
-	BUG_ON(busy && !resume && entity->on_st && ioq != efqd->active_queue);
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
 
-	/*
-	 * We could be moving an queue which is on idle tree of previous group
-	 * What to do? I guess anyway this queue does not have any requests.
-	 * just forget the entity and free up from idle tree.
-	 *
-	 * This needs cleanup. Hackish.
-	 */
-	if (entity->tree == &st->idle) {
-		BUG_ON(atomic_read(&ioq->ref) < 2);
-		bfq_put_idle_entity(st, entity);
+		BUG_ON(!RB_EMPTY_ROOT(&st->active));
+		BUG_ON(!RB_EMPTY_ROOT(&st->idle));
+		BUG_ON(st->wsum != 0);
 	}
 
-	if (busy) {
-		BUG_ON(atomic_read(&ioq->ref) < 2);
+	BUG_ON(iog->sched_data.next_active != NULL);
+	BUG_ON(iog->sched_data.active_entity != NULL);
+	BUG_ON(entity != NULL && entity->tree != NULL);
 
-		if (!resume)
-			elv_del_ioq_busy(e, ioq, 0);
-		else
-			elv_deactivate_ioq(efqd, ioq, 0);
-	}
+	kfree(iog);
+}
 
-	/*
-	 * Here we use a reference to bfqg.  We don't need a refcounter
-	 * as the cgroup reference will not be dropped, so that its
-	 * destroy() callback will not be invoked.
-	 */
-	entity->parent = iog->my_entity;
-	entity->sched_data = &iog->sched_data;
+void elv_put_iog(struct io_group *iog)
+{
+	struct io_group *parent = NULL;
+	struct io_entity *entity;
 
-	if (busy && resume)
-		elv_activate_ioq(ioq);
+	BUG_ON(!iog);
+
+	entity = iog->my_entity;
+
+	BUG_ON(atomic_read(&iog->ref) <= 0);
+	if (!atomic_dec_and_test(&iog->ref))
+		return;
+
+	if (iog->my_entity)
+		parent = container_of(iog->my_entity->parent,
+				      struct io_group, entity);
+
+	if (entity)
+		__bfq_deactivate_entity(entity, 0);
+
+	io_group_cleanup(iog);
+
+	if (parent)
+		elv_put_iog(parent);
 }
-EXPORT_SYMBOL(io_ioq_move);
+EXPORT_SYMBOL(elv_put_iog);
 
+/* After the group is destroyed, no new sync IO should come to the group.
+   It might still have pending IOs in some busy queues. It should be able to 
+   send those IOs down to the disk. The async IOs (due to dirty page writeback)
+   would go in the root group queues after this, as the group does not exist
+   anymore.
+   When one of those busy queues get new requests, the queue
+   is moved to the new cgroup. 
+*/
 static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
 {
 	struct elevator_queue *eq;
-	struct io_entity *entity = iog->my_entity;
 	struct io_service_tree *st;
 	int i;
 
-	eq = container_of(efqd, struct elevator_queue, efqd);
-	hlist_del(&iog->elv_data_node);
-	__bfq_deactivate_entity(entity, 0);
-	io_put_io_group_queues(eq, iog);
+	BUG_ON(iog->my_entity == NULL);
 
+	/* We flush idle tree now, and don't put things in there
+	   any more.
+	 */
 	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
 		st = iog->sched_data.service_tree + i;
-
-		/*
-		 * The idle tree may still contain bfq_queues belonging
-		 * to exited task because they never migrated to a different
-		 * cgroup from the one being destroyed now.  Noone else
-		 * can access them so it's safe to act without any lock.
-		 */
 		io_flush_idle_tree(st);
-
-		BUG_ON(!RB_EMPTY_ROOT(&st->active));
-		BUG_ON(!RB_EMPTY_ROOT(&st->idle));
 	}
+	iog->deleting = 1;
 
-	BUG_ON(iog->sched_data.next_active != NULL);
-	BUG_ON(iog->sched_data.active_entity != NULL);
-	BUG_ON(entity->tree != NULL);
+	eq = container_of(efqd, struct elevator_queue, efqd);
+	hlist_del(&iog->elv_data_node);
+	io_put_io_group_queues(eq, iog);
+	/* Put the reference taken at the time of creation
+	   so that when all queues are gone, cgroup can be destroyed.
+	 */
+	elv_put_iog(iog);
 }
 
 /**
@@ -1438,14 +1449,6 @@ static void io_destroy_group(struct io_cgroup *iocg, struct io_group *iog)
 		spin_unlock_irqrestore(efqd->queue->queue_lock, flags);
 	}
 	rcu_read_unlock();
-
-	/*
-	 * No need to defer the kfree() to the end of the RCU grace
-	 * period: we are called from the destroy() callback of our
-	 * cgroup, so we can be sure that noone is a) still using
-	 * this cgroup or b) doing lookups in it.
-	 */
-	kfree(iog);
 }
 
 void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
@@ -1477,19 +1480,8 @@ void io_disconnect_groups(struct elevator_queue *e)
 
 	hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
 					elv_data_node) {
-		hlist_del(&iog->elv_data_node);
-
-		__bfq_deactivate_entity(iog->my_entity, 0);
-
-		/*
-		 * Don't remove from the group hash, just set an
-		 * invalid key.  No lookups can race with the
-		 * assignment as bfqd is being destroyed; this
-		 * implies also that new elements cannot be added
-		 * to the list.
-		 */
-		rcu_assign_pointer(iog->key, NULL);
-		io_put_io_group_queues(e, iog);
+		hlist_del(&iog->group_node);
+		__io_destroy_group(efqd, iog);
 	}
 }
 
@@ -1637,6 +1629,7 @@ alloc_ioq:
 		elv_init_ioq(e, ioq, sched_q, IOPRIO_CLASS_BE, 4, 1);
 		io_group_set_ioq(iog, ioq);
 		elv_mark_ioq_sync(ioq);
+		elv_get_iog(iog);
 	}
 
 	if (new_sched_q)
@@ -1997,10 +1990,14 @@ void elv_put_ioq(struct io_queue *ioq)
 	struct elv_fq_data *efqd = ioq->efqd;
 	struct elevator_queue *e = container_of(efqd, struct elevator_queue,
 						efqd);
+	struct io_group *iog;
 
 	BUG_ON(atomic_read(&ioq->ref) <= 0);
 	if (!atomic_dec_and_test(&ioq->ref))
 		return;
+
+	iog = ioq_to_io_group(ioq);
+
 	BUG_ON(ioq->nr_queued);
 	BUG_ON(ioq->entity.tree != NULL);
 	BUG_ON(elv_ioq_busy(ioq));
@@ -2012,16 +2009,15 @@ void elv_put_ioq(struct io_queue *ioq)
 	e->ops->elevator_free_sched_queue_fn(e, ioq->sched_queue);
 	elv_log_ioq(efqd, ioq, "freed");
 	elv_free_ioq(ioq);
+	elv_put_iog(iog);
 }
 EXPORT_SYMBOL(elv_put_ioq);
 
 void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
 {
-	struct io_group *root_group = e->efqd.root_group;
 	struct io_queue *ioq = *ioq_ptr;
 
 	if (ioq != NULL) {
-		io_ioq_move(e, ioq, root_group);
 		/* Drop the reference taken by the io group */
 		elv_put_ioq(ioq);
 		*ioq_ptr = NULL;
@@ -2122,9 +2118,14 @@ void elv_activate_ioq(struct io_queue *ioq)
 void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
 					int requeue)
 {
+	struct io_group *iog = ioq_to_io_group(ioq);
+
 	if (ioq == efqd->active_queue)
 		elv_reset_active_ioq(efqd);
 
+	if (iog->deleting == 1)
+		requeue = 0;
+
 	bfq_deactivate_entity(&ioq->entity, requeue);
 }
 
@@ -2460,15 +2461,6 @@ void elv_ioq_arm_slice_timer(struct request_queue *q)
 	}
 }
 
-void elv_free_idle_ioq_list(struct elevator_queue *e)
-{
-	struct io_queue *ioq, *n;
-	struct elv_fq_data *efqd = &e->efqd;
-
-	list_for_each_entry_safe(ioq, n, &efqd->idle_list, queue_list)
-		elv_deactivate_ioq(efqd, ioq, 0);
-}
-
 /*
  * Call iosched to let that elevator wants to expire the queue. This gives
  * iosched like AS to say no (if it is in the middle of batch changeover or
@@ -2838,8 +2830,6 @@ void elv_exit_fq_data(struct elevator_queue *e)
 	elv_shutdown_timer_wq(e);
 
 	spin_lock_irq(q->queue_lock);
-	/* This should drop all the idle tree references of ioq */
-	elv_free_idle_ioq_list(e);
 	/* This should drop all the io group references of async queues */
 	io_disconnect_groups(e);
 	spin_unlock_irq(q->queue_lock);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 62b2ee2..7622b28 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -213,6 +213,7 @@ struct io_group {
 	struct hlist_node elv_data_node;
 	struct hlist_node group_node;
 	struct io_sched_data sched_data;
+	atomic_t ref;
 
 	struct io_entity *my_entity;
 
@@ -229,6 +230,8 @@ struct io_group {
 
 	/* Single ioq per group, used for noop, deadline, anticipatory */
 	struct io_queue *ioq;
+
+	int deleting;
 };
 
 /**
@@ -462,6 +465,8 @@ extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
 					gfp_t gfp_mask);
 extern void elv_fq_unset_request_ioq(struct request_queue *q,
 					struct request *rq);
+extern void elv_put_iog(struct io_group *iog);
+
 extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
 
 /* Returns single ioq associated with the io group. */
@@ -480,6 +485,11 @@ static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
 	iog->ioq = ioq;
 }
 
+static inline void elv_get_iog(struct io_group *iog)
+{
+	atomic_inc(&iog->ref);
+}
+
 #else /* !GROUP_IOSCHED */
 /*
  * No ioq movement is needed in case of flat setup. root io group gets cleaned
@@ -531,6 +541,15 @@ static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
 	return NULL;
 }
 
+static inline void elv_get_iog(struct io_group *iog)
+{
+}
+
+static inline void elv_put_iog(struct io_group *iog)
+{
+}
+
+
 #endif /* GROUP_IOSCHED */
 
 /* Functions used by blksysfs.c */

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
  2009-04-22 13:23       ` Vivek Goyal
       [not found]         ` <20090422132307.GA23098-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-04-30 19:38         ` Nauman Rafique
  2009-05-05  3:18           ` Gui Jianfeng
       [not found]           ` <49F9FE3C.3070000-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  1 sibling, 2 replies; 190+ messages in thread
From: Nauman Rafique @ 2009-04-30 19:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Gui Jianfeng, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, arozansk, jmoyer,
	oz-kernel, dhaval, balbir, linux-kernel, containers, akpm,
	menage, peterz

Vivek Goyal wrote:
> On Wed, Apr 22, 2009 at 11:04:58AM +0800, Gui Jianfeng wrote:
>   
>> Vivek Goyal wrote:
>>     
>>> On Fri, Apr 10, 2009 at 05:33:10PM +0800, Gui Jianfeng wrote:
>>>       
>>>> Vivek Goyal wrote:
>>>>         
>>>>> Hi All,
>>>>>
>>>>> Here is another posting for IO controller patches. Last time I had posted
>>>>> RFC patches for an IO controller which did bio control per cgroup.
>>>>>           
>>>>   Hi Vivek,
>>>>
>>>>   I got the following OOPS when testing, can't reproduce again :(
>>>>
>>>>         
>>> Hi Gui,
>>>
>>> Thanks for the report. Will look into it and see if I can reproduce it.
>>>       
>>   Hi Vivek,
>>
>>   The following script can reproduce the bug in my box.
>>
>> #!/bin/sh
>>
>> mkdir /cgroup
>> mount -t cgroup -o io io /cgroup
>> mkdir /cgroup/test1
>> mkdir /cgroup/test2
>>
>> echo cfq > /sys/block/sda/queue/scheduler
>> echo 7 > /cgroup/test1/io.ioprio
>> echo 1 > /cgroup/test2/io.ioprio
>> echo 1 > /proc/sys/vm/drop_caches
>> dd if=1000M.1 of=/dev/null &
>> pid1=$!
>> echo $pid1
>> echo $pid1 > /cgroup/test1/tasks
>> dd if=1000M.2 of=/dev/null
>> pid2=$!
>> echo $pid2
>> echo $pid2 > /cgroup/test2/tasks
>>
>>
>> rmdir /cgroup/test1
>> rmdir /cgroup/test2
>> umount /cgroup
>> rmdir /cgroup
>>     
>
> Thanks Gui. We have got races with task movement and cgroup deletion. In
> the original bfq patch, Fabio had implemented the logic to migrate the
> task queue synchronously. It found the logic to be little complicated so I
> changed it to delayed movement of queue from old cgroup to new cgroup.
> Fabio later mentioned that it introduces a race where old cgroup is
> deleted before task queue has actually moved to new cgroup.
>
> Nauman is currently implementing reference counting for io groups and that
> will solve this problem at the same time some other problems like movement
> of queue to root group during cgroup deletion and which can potentially 
> result in unfair share for some time to that queue etc.
>
> Thanks
> Vivek
>   
Hi Gui,
This patch should solve the problems reported by you. Please let me know if it does not work.
@Vivek, this has a few more changes after the patch I sent you separately.

DESC
Add ref counting for io_group.
EDESC
    
        Reference counting for io_group solves many problems, most of which
        occured when we tried to delete the cgroup. Earlier, ioqs were being
        moved out of cgroup to root cgroup. That is problematic in many ways:
        First, the pending requests in queues might get unfair service, and
        will also cause unfairness for other cgroups at the root level. This
        problem can become signficant if cgroups are created and destroyed
        relatively frequently. Second, moving queues to root cgroups was
        complicated and was causing many BUG_ON's to trigger. Third, there is
        a single io queue in AS, Deadline and Noop within a cgroup; and it
        does not make sense to move it to the root cgroup. The same is true of
        async queues.
    
        Requests already keep a reference on ioq, so queues keep a reference on
        cgroup. For async queues in CFQ, and single ioq in other schedulers,
        io_group also keeps are reference on io_queue. This reference on ioq
        is dropped when the queue is released (elv_release_ioq). So the queue
        can be freed.
    
        When a queue is released, it puts the reference to io_group and the
        io_group is released after all the queues are released. Child groups
        also take reference on parent groups, and release it when they are
        destroyed.
    
        Also we no longer need to maintain a seprate linked list of idle
        entities, which was maintained only to help release the ioq references
        during elevator switch. The code for releasing io_groups is reused for
        elevator switch, resulting in simpler and tight code.

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 0ecf7c7..21e8ab8 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1090,8 +1090,8 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
 
 	if (async_cfqq != NULL) {
 		__iog = cfqq_to_io_group(async_cfqq);
-
 		if (iog != __iog) {
+			/* Cgroup has changed, drop the reference to async queue */
 			cic_set_cfqq(cic, NULL, 0);
 			cfq_put_queue(async_cfqq);
 		}
@@ -1099,8 +1099,10 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
 
 	if (sync_cfqq != NULL) {
 		__iog = cfqq_to_io_group(sync_cfqq);
-		if (iog != __iog)
-			io_ioq_move(q->elevator, sync_cfqq->ioq, iog);
+		if (iog != __iog) {
+			cic_set_cfqq(cic, NULL, 1);
+			cfq_put_queue(sync_cfqq);
+		}
 	}
 
 	spin_unlock_irqrestore(q->queue_lock, flags);
@@ -1114,8 +1116,8 @@ static void cfq_ioc_set_cgroup(struct io_context *ioc)
 #endif  /* CONFIG_IOSCHED_CFQ_HIER */
 
 static struct cfq_queue *
-cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
-				struct io_context *ioc, gfp_t gfp_mask)
+cfq_find_alloc_queue(struct cfq_data *cfqd, struct io_group *iog,
+		     int is_sync, struct io_context *ioc, gfp_t gfp_mask)
 {
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
 	struct cfq_io_context *cic;
@@ -1198,6 +1200,8 @@ alloc_ioq:
 			elv_mark_ioq_sync(cfqq->ioq);
 		}
 		cfqq->pid = current->pid;
+		/* ioq reference on iog */
+		elv_get_iog(iog);
 		cfq_log_cfqq(cfqd, cfqq, "alloced");
 	}
 
@@ -1229,7 +1233,8 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 	}
 
 	if (!cfqq) {
-		cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc, gfp_mask);
+		cfqq = cfq_find_alloc_queue(cfqd, iog, 
+					    is_sync, ioc, gfp_mask);
 		if (!cfqq)
 			return NULL;
 	}
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 7474f6d..52419d1 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -198,7 +198,6 @@ static void bfq_idle_extract(struct io_service_tree *st,
 				struct io_entity *entity)
 {
 	struct rb_node *next;
-	struct io_queue *ioq = io_entity_to_ioq(entity);
 
 	BUG_ON(entity->tree != &st->idle);
 
@@ -213,10 +212,6 @@ static void bfq_idle_extract(struct io_service_tree *st,
 	}
 
 	bfq_extract(&st->idle, entity);
-
-	/* Delete queue from idle list */
-	if (ioq)
-		list_del(&ioq->queue_list);
 }
 
 /**
@@ -420,7 +415,6 @@ static void bfq_idle_insert(struct io_service_tree *st,
 {
 	struct io_entity *first_idle = st->first_idle;
 	struct io_entity *last_idle = st->last_idle;
-	struct io_queue *ioq = io_entity_to_ioq(entity);
 
 	if (first_idle == NULL || bfq_gt(first_idle->finish, entity->finish))
 		st->first_idle = entity;
@@ -428,10 +422,6 @@ static void bfq_idle_insert(struct io_service_tree *st,
 		st->last_idle = entity;
 
 	bfq_insert(&st->idle, entity);
-
-	/* Add this queue to idle list */
-	if (ioq)
-		list_add(&ioq->queue_list, &ioq->efqd->idle_list);
 }
 
 /**
@@ -666,8 +656,13 @@ int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
 void bfq_deactivate_entity(struct io_entity *entity, int requeue)
 {
 	struct io_sched_data *sd;
+	struct io_group *iog;
 	struct io_entity *parent;
 
+	iog = container_of(entity->sched_data, struct io_group, sched_data);
+	/* Hold a reference to entity's iog until we are done */
+	elv_get_iog(iog);
+
 	for_each_entity_safe(entity, parent) {
 		sd = entity->sched_data;
 
@@ -679,13 +674,15 @@ void bfq_deactivate_entity(struct io_entity *entity, int requeue)
 			 */
 			break;
 
-		if (sd->next_active != NULL)
+		if (sd->next_active != NULL) {
 			/*
 			 * The parent entity is still backlogged and
 			 * the budgets on the path towards the root
 			 * need to be updated.
 			 */
+			elv_put_iog(iog);
 			goto update;
+		}
 
 		/*
 		 * If we reach there the parent is no more backlogged and
@@ -694,6 +691,7 @@ void bfq_deactivate_entity(struct io_entity *entity, int requeue)
 		requeue = 1;
 	}
 
+	elv_put_iog(iog);
 	return;
 
 update:
@@ -944,6 +942,8 @@ void io_group_set_parent(struct io_group *iog, struct io_group *parent)
 	entity = &iog->entity;
 	entity->parent = parent->my_entity;
 	entity->sched_data = &parent->sched_data;
+	if (entity->parent)
+		elv_get_iog(parent);
 }
 
 /**
@@ -1052,6 +1052,10 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
 
 		io_group_init_entity(iocg, iog);
 		iog->my_entity = &iog->entity;
+		/* Take the initial reference that will be released on destroy */
+		atomic_set(&iog->ref, 0);
+		iog->deleting = 0;
+		elv_get_iog(iog);
 
 		if (leaf == NULL) {
 			leaf = iog;
@@ -1074,7 +1078,7 @@ cleanup:
 	while (leaf != NULL) {
 		prev = leaf;
 		leaf = leaf->key;
-		kfree(iog);
+		kfree(prev);
 	}
 
 	return NULL;
@@ -1197,13 +1201,20 @@ void io_free_root_group(struct elevator_queue *e)
 	struct io_cgroup *iocg = &io_root_cgroup;
 	struct elv_fq_data *efqd = &e->efqd;
 	struct io_group *iog = efqd->root_group;
+	struct io_service_tree *st;
+	int i;
 
 	BUG_ON(!iog);
 	spin_lock_irq(&iocg->lock);
 	hlist_del_rcu(&iog->group_node);
 	spin_unlock_irq(&iocg->lock);
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		io_flush_idle_tree(st);
+	}
 	io_put_io_group_queues(e, iog);
-	kfree(iog);
+	elv_put_iog(iog);
 }
 
 struct io_group *io_alloc_root_group(struct request_queue *q,
@@ -1217,6 +1228,7 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
 	if (iog == NULL)
 		return NULL;
 
+	elv_get_iog(iog);
 	iog->entity.parent = NULL;
 	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
 		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
@@ -1311,90 +1323,89 @@ void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
 	task_unlock(tsk);
 }
 
-/*
- * Move the queue to the root group if it is active. This is needed when
- * a cgroup is being deleted and all the IO is not done yet. This is not
- * very good scheme as a user might get unfair share. This needs to be
- * fixed.
+/* This cleanup function is does the last bit of things to destroy cgroup.
+   It should only get called after io_destroy_group has been invoked.
  */
-void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
-				struct io_group *iog)
+void io_group_cleanup(struct io_group *iog)
 {
-	int busy, resume;
-	struct io_entity *entity = &ioq->entity;
-	struct elv_fq_data *efqd = &e->efqd;
-	struct io_service_tree *st = io_entity_service_tree(entity);
-
-	busy = elv_ioq_busy(ioq);
-	resume = !!ioq->nr_queued;
+	struct io_service_tree *st;
+	struct io_entity *entity = iog->my_entity;
+	int i;
 
-	BUG_ON(resume && !entity->on_st);
-	BUG_ON(busy && !resume && entity->on_st && ioq != efqd->active_queue);
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
 
-	/*
-	 * We could be moving an queue which is on idle tree of previous group
-	 * What to do? I guess anyway this queue does not have any requests.
-	 * just forget the entity and free up from idle tree.
-	 *
-	 * This needs cleanup. Hackish.
-	 */
-	if (entity->tree == &st->idle) {
-		BUG_ON(atomic_read(&ioq->ref) < 2);
-		bfq_put_idle_entity(st, entity);
+		BUG_ON(!RB_EMPTY_ROOT(&st->active));
+		BUG_ON(!RB_EMPTY_ROOT(&st->idle));
+		BUG_ON(st->wsum != 0);
 	}
 
-	if (busy) {
-		BUG_ON(atomic_read(&ioq->ref) < 2);
+	BUG_ON(iog->sched_data.next_active != NULL);
+	BUG_ON(iog->sched_data.active_entity != NULL);
+	BUG_ON(entity != NULL && entity->tree != NULL);
 
-		if (!resume)
-			elv_del_ioq_busy(e, ioq, 0);
-		else
-			elv_deactivate_ioq(efqd, ioq, 0);
-	}
+	kfree(iog);
+}
 
-	/*
-	 * Here we use a reference to bfqg.  We don't need a refcounter
-	 * as the cgroup reference will not be dropped, so that its
-	 * destroy() callback will not be invoked.
-	 */
-	entity->parent = iog->my_entity;
-	entity->sched_data = &iog->sched_data;
+void elv_put_iog(struct io_group *iog)
+{
+	struct io_group *parent = NULL;
+	struct io_entity *entity;
 
-	if (busy && resume)
-		elv_activate_ioq(ioq);
+	BUG_ON(!iog);
+
+	entity = iog->my_entity;
+
+	BUG_ON(atomic_read(&iog->ref) <= 0);
+	if (!atomic_dec_and_test(&iog->ref))
+		return;
+
+	if (iog->my_entity)
+		parent = container_of(iog->my_entity->parent,
+				      struct io_group, entity);
+
+	if (entity)
+		__bfq_deactivate_entity(entity, 0);
+
+	io_group_cleanup(iog);
+
+	if (parent)
+		elv_put_iog(parent);
 }
-EXPORT_SYMBOL(io_ioq_move);
+EXPORT_SYMBOL(elv_put_iog);
 
+/* After the group is destroyed, no new sync IO should come to the group.
+   It might still have pending IOs in some busy queues. It should be able to 
+   send those IOs down to the disk. The async IOs (due to dirty page writeback)
+   would go in the root group queues after this, as the group does not exist
+   anymore.
+   When one of those busy queues get new requests, the queue
+   is moved to the new cgroup. 
+*/
 static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
 {
 	struct elevator_queue *eq;
-	struct io_entity *entity = iog->my_entity;
 	struct io_service_tree *st;
 	int i;
 
-	eq = container_of(efqd, struct elevator_queue, efqd);
-	hlist_del(&iog->elv_data_node);
-	__bfq_deactivate_entity(entity, 0);
-	io_put_io_group_queues(eq, iog);
+	BUG_ON(iog->my_entity == NULL);
 
+	/* We flush idle tree now, and don't put things in there
+	   any more.
+	 */
 	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
 		st = iog->sched_data.service_tree + i;
-
-		/*
-		 * The idle tree may still contain bfq_queues belonging
-		 * to exited task because they never migrated to a different
-		 * cgroup from the one being destroyed now.  Noone else
-		 * can access them so it's safe to act without any lock.
-		 */
 		io_flush_idle_tree(st);
-
-		BUG_ON(!RB_EMPTY_ROOT(&st->active));
-		BUG_ON(!RB_EMPTY_ROOT(&st->idle));
 	}
+	iog->deleting = 1;
 
-	BUG_ON(iog->sched_data.next_active != NULL);
-	BUG_ON(iog->sched_data.active_entity != NULL);
-	BUG_ON(entity->tree != NULL);
+	eq = container_of(efqd, struct elevator_queue, efqd);
+	hlist_del(&iog->elv_data_node);
+	io_put_io_group_queues(eq, iog);
+	/* Put the reference taken at the time of creation
+	   so that when all queues are gone, cgroup can be destroyed.
+	 */
+	elv_put_iog(iog);
 }
 
 /**
@@ -1438,14 +1449,6 @@ static void io_destroy_group(struct io_cgroup *iocg, struct io_group *iog)
 		spin_unlock_irqrestore(efqd->queue->queue_lock, flags);
 	}
 	rcu_read_unlock();
-
-	/*
-	 * No need to defer the kfree() to the end of the RCU grace
-	 * period: we are called from the destroy() callback of our
-	 * cgroup, so we can be sure that noone is a) still using
-	 * this cgroup or b) doing lookups in it.
-	 */
-	kfree(iog);
 }
 
 void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
@@ -1477,19 +1480,8 @@ void io_disconnect_groups(struct elevator_queue *e)
 
 	hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
 					elv_data_node) {
-		hlist_del(&iog->elv_data_node);
-
-		__bfq_deactivate_entity(iog->my_entity, 0);
-
-		/*
-		 * Don't remove from the group hash, just set an
-		 * invalid key.  No lookups can race with the
-		 * assignment as bfqd is being destroyed; this
-		 * implies also that new elements cannot be added
-		 * to the list.
-		 */
-		rcu_assign_pointer(iog->key, NULL);
-		io_put_io_group_queues(e, iog);
+		hlist_del(&iog->group_node);
+		__io_destroy_group(efqd, iog);
 	}
 }
 
@@ -1637,6 +1629,7 @@ alloc_ioq:
 		elv_init_ioq(e, ioq, sched_q, IOPRIO_CLASS_BE, 4, 1);
 		io_group_set_ioq(iog, ioq);
 		elv_mark_ioq_sync(ioq);
+		elv_get_iog(iog);
 	}
 
 	if (new_sched_q)
@@ -1997,10 +1990,14 @@ void elv_put_ioq(struct io_queue *ioq)
 	struct elv_fq_data *efqd = ioq->efqd;
 	struct elevator_queue *e = container_of(efqd, struct elevator_queue,
 						efqd);
+	struct io_group *iog;
 
 	BUG_ON(atomic_read(&ioq->ref) <= 0);
 	if (!atomic_dec_and_test(&ioq->ref))
 		return;
+
+	iog = ioq_to_io_group(ioq);
+
 	BUG_ON(ioq->nr_queued);
 	BUG_ON(ioq->entity.tree != NULL);
 	BUG_ON(elv_ioq_busy(ioq));
@@ -2012,16 +2009,15 @@ void elv_put_ioq(struct io_queue *ioq)
 	e->ops->elevator_free_sched_queue_fn(e, ioq->sched_queue);
 	elv_log_ioq(efqd, ioq, "freed");
 	elv_free_ioq(ioq);
+	elv_put_iog(iog);
 }
 EXPORT_SYMBOL(elv_put_ioq);
 
 void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
 {
-	struct io_group *root_group = e->efqd.root_group;
 	struct io_queue *ioq = *ioq_ptr;
 
 	if (ioq != NULL) {
-		io_ioq_move(e, ioq, root_group);
 		/* Drop the reference taken by the io group */
 		elv_put_ioq(ioq);
 		*ioq_ptr = NULL;
@@ -2122,9 +2118,14 @@ void elv_activate_ioq(struct io_queue *ioq)
 void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
 					int requeue)
 {
+	struct io_group *iog = ioq_to_io_group(ioq);
+
 	if (ioq == efqd->active_queue)
 		elv_reset_active_ioq(efqd);
 
+	if (iog->deleting == 1)
+		requeue = 0;
+
 	bfq_deactivate_entity(&ioq->entity, requeue);
 }
 
@@ -2460,15 +2461,6 @@ void elv_ioq_arm_slice_timer(struct request_queue *q)
 	}
 }
 
-void elv_free_idle_ioq_list(struct elevator_queue *e)
-{
-	struct io_queue *ioq, *n;
-	struct elv_fq_data *efqd = &e->efqd;
-
-	list_for_each_entry_safe(ioq, n, &efqd->idle_list, queue_list)
-		elv_deactivate_ioq(efqd, ioq, 0);
-}
-
 /*
  * Call iosched to let that elevator wants to expire the queue. This gives
  * iosched like AS to say no (if it is in the middle of batch changeover or
@@ -2838,8 +2830,6 @@ void elv_exit_fq_data(struct elevator_queue *e)
 	elv_shutdown_timer_wq(e);
 
 	spin_lock_irq(q->queue_lock);
-	/* This should drop all the idle tree references of ioq */
-	elv_free_idle_ioq_list(e);
 	/* This should drop all the io group references of async queues */
 	io_disconnect_groups(e);
 	spin_unlock_irq(q->queue_lock);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 62b2ee2..7622b28 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -213,6 +213,7 @@ struct io_group {
 	struct hlist_node elv_data_node;
 	struct hlist_node group_node;
 	struct io_sched_data sched_data;
+	atomic_t ref;
 
 	struct io_entity *my_entity;
 
@@ -229,6 +230,8 @@ struct io_group {
 
 	/* Single ioq per group, used for noop, deadline, anticipatory */
 	struct io_queue *ioq;
+
+	int deleting;
 };
 
 /**
@@ -462,6 +465,8 @@ extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
 					gfp_t gfp_mask);
 extern void elv_fq_unset_request_ioq(struct request_queue *q,
 					struct request *rq);
+extern void elv_put_iog(struct io_group *iog);
+
 extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
 
 /* Returns single ioq associated with the io group. */
@@ -480,6 +485,11 @@ static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
 	iog->ioq = ioq;
 }
 
+static inline void elv_get_iog(struct io_group *iog)
+{
+	atomic_inc(&iog->ref);
+}
+
 #else /* !GROUP_IOSCHED */
 /*
  * No ioq movement is needed in case of flat setup. root io group gets cleaned
@@ -531,6 +541,15 @@ static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
 	return NULL;
 }
 
+static inline void elv_get_iog(struct io_group *iog)
+{
+}
+
+static inline void elv_put_iog(struct io_group *iog)
+{
+}
+
+
 #endif /* GROUP_IOSCHED */
 
 /* Functions used by blksysfs.c */



^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
       [not found] ` <1236823015-4183-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (12 preceding siblings ...)
  2009-04-10  9:33   ` Gui Jianfeng
@ 2009-05-01  1:25   ` Divyesh Shah
  13 siblings, 0 replies; 190+ messages in thread
From: Divyesh Shah @ 2009-05-01  1:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	menage-hpIqsD4AKlfQT0dZR+AlfA, arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

Vivek Goyal wrote:
> Hi All,
> 
> Here is another posting for IO controller patches. Last time I had posted
> RFC patches for an IO controller which did bio control per cgroup.
> 
> http://lkml.org/lkml/2008/11/6/227
> 
> One of the takeaway from the discussion in this thread was that let us
> implement a common layer which contains the proportional weight scheduling
> code which can be shared by all the IO schedulers.
> 
> Implementing IO controller will not cover the devices which don't use
> IO schedulers but it should cover the common case.
> 
> There were more discussions regarding 2 level vs 1 level IO control at
> following link.
> 
> https://lists.linux-foundation.org/pipermail/containers/2009-January/015402.html
> 
> So in the mean time we took the discussion off the list and spent time on
> making the 1 level control apporoach work where majority of the proportional
> weight control is shared by the four schedulers instead of each one having
> to replicate the code. We make use of BFQ code for fair queuing as posted
> by Paolo and Fabio here.
> 
> http://lkml.org/lkml/2008/11/11/148
> 
> Details about design and howto have been put in documentation patch.
> 
> I have done very basic testing of running 2 or 3 "dd" threads in different
> cgroups. Wanted to get the patchset out for feedback/review before we dive
> into more bug fixing, benchmarking, optimizations etc.
> 
> Your feedback/comments are welcome.
> 
> Patch series contains 10 patches. It should be compilable and bootable after
> every patch. Intial 2 patches implement flat fair queuing (no cgroup
> support) and make cfq to use that. Later patches introduce hierarchical
> fair queuing support in elevator layer and modify other IO schdulers to use
> that.
> 
> Thanks
> Vivek

Hi Vivek,
   While testing these patches along with the bio-cgroup patches I noticed that for the case of 2 buffered writers (dd) with different weights, one of them would be able to use up a very large timeslice (I've seen upto 500ms) when the other queue is empty and not be accounted for it. This is due to the check in cfq_dispatch_requests() where  a given cgroup can empty its entire queue (100 IOs or more) within its timeslice and have them sit in the dispatch queue ready for the disk driver to pick up. Moreover, this huge timeslice is not accounted for as this cgroup is charged only for the length of the intended timeslice and not the actual time taken.
  The following patch fixes this by not optimizing on the single busy queue fact inside cfq_dispatch_requests. Note that this does not hurt throughput in any sense but just causes more IOs to be dispatched only when the drive is ready for them thus leading to better accounting too.

Fix bug where a given ioq can run through all its requests at once.

Signed-off-by: Divyesh Shah <dpshah-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
diff --git a/2.6.26/block/cfq-iosched.c b/2.6.26/block/cfq-iosched.c
index 5a275a2..c0199a6 100644
--- a/2.6.26/block/cfq-iosched.c
+++ b/2.6.26/block/cfq-iosched.c
@@ -848,8 +848,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
		if (cfq_class_idle(cfqq))
			max_dispatch = 1;

-		if (elv_ioq_nr_dispatched(cfqq->ioq) >= max_dispatch &&
-			elv_nr_busy_ioq(q->elevator) > 1)
+		if (elv_ioq_nr_dispatched(cfqq->ioq) >= max_dispatch)
			break;

		if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
  2009-03-12  1:56 ` Vivek Goyal
                   ` (7 preceding siblings ...)
  (?)
@ 2009-05-01  1:25 ` Divyesh Shah
  2009-05-01  2:45   ` Vivek Goyal
       [not found]   ` <49FA4F91.204-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  -1 siblings, 2 replies; 190+ messages in thread
From: Divyesh Shah @ 2009-05-01  1:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, lizf, mikew, fchecconi, paolo.valente, jens.axboe, ryov,
	fernando, s-uchida, taka, guijianfeng, arozansk, jmoyer,
	oz-kernel, dhaval, balbir, linux-kernel, containers, akpm,
	menage, peterz

Vivek Goyal wrote:
> Hi All,
> 
> Here is another posting for IO controller patches. Last time I had posted
> RFC patches for an IO controller which did bio control per cgroup.
> 
> http://lkml.org/lkml/2008/11/6/227
> 
> One of the takeaway from the discussion in this thread was that let us
> implement a common layer which contains the proportional weight scheduling
> code which can be shared by all the IO schedulers.
> 
> Implementing IO controller will not cover the devices which don't use
> IO schedulers but it should cover the common case.
> 
> There were more discussions regarding 2 level vs 1 level IO control at
> following link.
> 
> https://lists.linux-foundation.org/pipermail/containers/2009-January/015402.html
> 
> So in the mean time we took the discussion off the list and spent time on
> making the 1 level control apporoach work where majority of the proportional
> weight control is shared by the four schedulers instead of each one having
> to replicate the code. We make use of BFQ code for fair queuing as posted
> by Paolo and Fabio here.
> 
> http://lkml.org/lkml/2008/11/11/148
> 
> Details about design and howto have been put in documentation patch.
> 
> I have done very basic testing of running 2 or 3 "dd" threads in different
> cgroups. Wanted to get the patchset out for feedback/review before we dive
> into more bug fixing, benchmarking, optimizations etc.
> 
> Your feedback/comments are welcome.
> 
> Patch series contains 10 patches. It should be compilable and bootable after
> every patch. Intial 2 patches implement flat fair queuing (no cgroup
> support) and make cfq to use that. Later patches introduce hierarchical
> fair queuing support in elevator layer and modify other IO schdulers to use
> that.
> 
> Thanks
> Vivek

Hi Vivek,
   While testing these patches along with the bio-cgroup patches I noticed that for the case of 2 buffered writers (dd) with different weights, one of them would be able to use up a very large timeslice (I've seen upto 500ms) when the other queue is empty and not be accounted for it. This is due to the check in cfq_dispatch_requests() where  a given cgroup can empty its entire queue (100 IOs or more) within its timeslice and have them sit in the dispatch queue ready for the disk driver to pick up. Moreover, this huge timeslice is not accounted for as this cgroup is charged only for the length of the intended timeslice and not the actual time taken.
  The following patch fixes this by not optimizing on the single busy queue fact inside cfq_dispatch_requests. Note that this does not hurt throughput in any sense but just causes more IOs to be dispatched only when the drive is ready for them thus leading to better accounting too.

Fix bug where a given ioq can run through all its requests at once.

Signed-off-by: Divyesh Shah <dpshah@google.com>
---
diff --git a/2.6.26/block/cfq-iosched.c b/2.6.26/block/cfq-iosched.c
index 5a275a2..c0199a6 100644
--- a/2.6.26/block/cfq-iosched.c
+++ b/2.6.26/block/cfq-iosched.c
@@ -848,8 +848,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
		if (cfq_class_idle(cfqq))
			max_dispatch = 1;

-		if (elv_ioq_nr_dispatched(cfqq->ioq) >= max_dispatch &&
-			elv_nr_busy_ioq(q->elevator) > 1)
+		if (elv_ioq_nr_dispatched(cfqq->ioq) >= max_dispatch)
			break;

		if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
       [not found]   ` <49FA4F91.204-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2009-05-01  2:45     ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-05-01  2:45 UTC (permalink / raw)
  To: Divyesh Shah
  Cc: oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	menage-hpIqsD4AKlfQT0dZR+AlfA, arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Thu, Apr 30, 2009 at 06:25:37PM -0700, Divyesh Shah wrote:
> Vivek Goyal wrote:
> > Hi All,
> > 
> > Here is another posting for IO controller patches. Last time I had posted
> > RFC patches for an IO controller which did bio control per cgroup.
> > 
> > http://lkml.org/lkml/2008/11/6/227
> > 
> > One of the takeaway from the discussion in this thread was that let us
> > implement a common layer which contains the proportional weight scheduling
> > code which can be shared by all the IO schedulers.
> > 
> > Implementing IO controller will not cover the devices which don't use
> > IO schedulers but it should cover the common case.
> > 
> > There were more discussions regarding 2 level vs 1 level IO control at
> > following link.
> > 
> > https://lists.linux-foundation.org/pipermail/containers/2009-January/015402.html
> > 
> > So in the mean time we took the discussion off the list and spent time on
> > making the 1 level control apporoach work where majority of the proportional
> > weight control is shared by the four schedulers instead of each one having
> > to replicate the code. We make use of BFQ code for fair queuing as posted
> > by Paolo and Fabio here.
> > 
> > http://lkml.org/lkml/2008/11/11/148
> > 
> > Details about design and howto have been put in documentation patch.
> > 
> > I have done very basic testing of running 2 or 3 "dd" threads in different
> > cgroups. Wanted to get the patchset out for feedback/review before we dive
> > into more bug fixing, benchmarking, optimizations etc.
> > 
> > Your feedback/comments are welcome.
> > 
> > Patch series contains 10 patches. It should be compilable and bootable after
> > every patch. Intial 2 patches implement flat fair queuing (no cgroup
> > support) and make cfq to use that. Later patches introduce hierarchical
> > fair queuing support in elevator layer and modify other IO schdulers to use
> > that.
> > 
> > Thanks
> > Vivek
> 
> Hi Vivek,
>    While testing these patches along with the bio-cgroup patches I noticed that for the case of 2 buffered writers (dd) with different weights, one of them would be able to use up a very large timeslice (I've seen upto 500ms) when the other queue is empty and not be accounted for it. This is due to the check in cfq_dispatch_requests() where  a given cgroup can empty its entire queue (100 IOs or more) within its timeslice and have them sit in the dispatch queue ready for the disk driver to pick up. Moreover, this huge timeslice is not accounted for as this cgroup is charged only for the length of the intended timeslice and not the actual time taken.
>   The following patch fixes this by not optimizing on the single busy queue fact inside cfq_dispatch_requests. Note that this does not hurt throughput in any sense but just causes more IOs to be dispatched only when the drive is ready for them thus leading to better accounting too.

Hi Divyesh,

Thanks for the testing and noticing the issue. I also had noticed this
issue.

Couple of points.

- In 30-rc3 jens has fixed the huge dispatch problem. Now in case of single
  ioq doing dispatch, in one round upto 4*quantum request can be dispatched.
  So that means in default configuration with single queue, maximum request on
  diaptch list can be 16.

- Secondly, in my tree, now I have modified the patches to charge for
  actual consumption of the slice instead of capping it to budget. In a
  week's time I should be able to post V2 of the patches. Please do try
  it out then

Thanks
Vivek

> 
> Fix bug where a given ioq can run through all its requests at once.
> 
> Signed-off-by: Divyesh Shah <dpshah-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
> diff --git a/2.6.26/block/cfq-iosched.c b/2.6.26/block/cfq-iosched.c
> index 5a275a2..c0199a6 100644
> --- a/2.6.26/block/cfq-iosched.c
> +++ b/2.6.26/block/cfq-iosched.c
> @@ -848,8 +848,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
> 		if (cfq_class_idle(cfqq))
> 			max_dispatch = 1;
> 
> -		if (elv_ioq_nr_dispatched(cfqq->ioq) >= max_dispatch &&
> -			elv_nr_busy_ioq(q->elevator) > 1)
> +		if (elv_ioq_nr_dispatched(cfqq->ioq) >= max_dispatch)
> 			break;
> 
> 		if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
  2009-05-01  1:25 ` Divyesh Shah
@ 2009-05-01  2:45   ` Vivek Goyal
  2009-05-01  3:00     ` Divyesh Shah
       [not found]     ` <20090501024527.GA3730-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
       [not found]   ` <49FA4F91.204-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  1 sibling, 2 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-05-01  2:45 UTC (permalink / raw)
  To: Divyesh Shah
  Cc: nauman, lizf, mikew, fchecconi, paolo.valente, jens.axboe, ryov,
	fernando, s-uchida, taka, guijianfeng, arozansk, jmoyer,
	oz-kernel, dhaval, balbir, linux-kernel, containers, akpm,
	menage, peterz

On Thu, Apr 30, 2009 at 06:25:37PM -0700, Divyesh Shah wrote:
> Vivek Goyal wrote:
> > Hi All,
> > 
> > Here is another posting for IO controller patches. Last time I had posted
> > RFC patches for an IO controller which did bio control per cgroup.
> > 
> > http://lkml.org/lkml/2008/11/6/227
> > 
> > One of the takeaway from the discussion in this thread was that let us
> > implement a common layer which contains the proportional weight scheduling
> > code which can be shared by all the IO schedulers.
> > 
> > Implementing IO controller will not cover the devices which don't use
> > IO schedulers but it should cover the common case.
> > 
> > There were more discussions regarding 2 level vs 1 level IO control at
> > following link.
> > 
> > https://lists.linux-foundation.org/pipermail/containers/2009-January/015402.html
> > 
> > So in the mean time we took the discussion off the list and spent time on
> > making the 1 level control apporoach work where majority of the proportional
> > weight control is shared by the four schedulers instead of each one having
> > to replicate the code. We make use of BFQ code for fair queuing as posted
> > by Paolo and Fabio here.
> > 
> > http://lkml.org/lkml/2008/11/11/148
> > 
> > Details about design and howto have been put in documentation patch.
> > 
> > I have done very basic testing of running 2 or 3 "dd" threads in different
> > cgroups. Wanted to get the patchset out for feedback/review before we dive
> > into more bug fixing, benchmarking, optimizations etc.
> > 
> > Your feedback/comments are welcome.
> > 
> > Patch series contains 10 patches. It should be compilable and bootable after
> > every patch. Intial 2 patches implement flat fair queuing (no cgroup
> > support) and make cfq to use that. Later patches introduce hierarchical
> > fair queuing support in elevator layer and modify other IO schdulers to use
> > that.
> > 
> > Thanks
> > Vivek
> 
> Hi Vivek,
>    While testing these patches along with the bio-cgroup patches I noticed that for the case of 2 buffered writers (dd) with different weights, one of them would be able to use up a very large timeslice (I've seen upto 500ms) when the other queue is empty and not be accounted for it. This is due to the check in cfq_dispatch_requests() where  a given cgroup can empty its entire queue (100 IOs or more) within its timeslice and have them sit in the dispatch queue ready for the disk driver to pick up. Moreover, this huge timeslice is not accounted for as this cgroup is charged only for the length of the intended timeslice and not the actual time taken.
>   The following patch fixes this by not optimizing on the single busy queue fact inside cfq_dispatch_requests. Note that this does not hurt throughput in any sense but just causes more IOs to be dispatched only when the drive is ready for them thus leading to better accounting too.

Hi Divyesh,

Thanks for the testing and noticing the issue. I also had noticed this
issue.

Couple of points.

- In 30-rc3 jens has fixed the huge dispatch problem. Now in case of single
  ioq doing dispatch, in one round upto 4*quantum request can be dispatched.
  So that means in default configuration with single queue, maximum request on
  diaptch list can be 16.

- Secondly, in my tree, now I have modified the patches to charge for
  actual consumption of the slice instead of capping it to budget. In a
  week's time I should be able to post V2 of the patches. Please do try
  it out then

Thanks
Vivek

> 
> Fix bug where a given ioq can run through all its requests at once.
> 
> Signed-off-by: Divyesh Shah <dpshah@google.com>
> ---
> diff --git a/2.6.26/block/cfq-iosched.c b/2.6.26/block/cfq-iosched.c
> index 5a275a2..c0199a6 100644
> --- a/2.6.26/block/cfq-iosched.c
> +++ b/2.6.26/block/cfq-iosched.c
> @@ -848,8 +848,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
> 		if (cfq_class_idle(cfqq))
> 			max_dispatch = 1;
> 
> -		if (elv_ioq_nr_dispatched(cfqq->ioq) >= max_dispatch &&
> -			elv_nr_busy_ioq(q->elevator) > 1)
> +		if (elv_ioq_nr_dispatched(cfqq->ioq) >= max_dispatch)
> 			break;
> 
> 		if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
       [not found]     ` <20090501024527.GA3730-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-01  3:00       ` Divyesh Shah
  0 siblings, 0 replies; 190+ messages in thread
From: Divyesh Shah @ 2009-05-01  3:00 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	menage-hpIqsD4AKlfQT0dZR+AlfA, arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

Vivek Goyal wrote:
> On Thu, Apr 30, 2009 at 06:25:37PM -0700, Divyesh Shah wrote:
>> Vivek Goyal wrote:
>>> Hi All,
>>>
>>> Here is another posting for IO controller patches. Last time I had posted
>>> RFC patches for an IO controller which did bio control per cgroup.
>>>
>>> http://lkml.org/lkml/2008/11/6/227
>>>
>>> One of the takeaway from the discussion in this thread was that let us
>>> implement a common layer which contains the proportional weight scheduling
>>> code which can be shared by all the IO schedulers.
>>>
>>> Implementing IO controller will not cover the devices which don't use
>>> IO schedulers but it should cover the common case.
>>>
>>> There were more discussions regarding 2 level vs 1 level IO control at
>>> following link.
>>>
>>> https://lists.linux-foundation.org/pipermail/containers/2009-January/015402.html
>>>
>>> So in the mean time we took the discussion off the list and spent time on
>>> making the 1 level control apporoach work where majority of the proportional
>>> weight control is shared by the four schedulers instead of each one having
>>> to replicate the code. We make use of BFQ code for fair queuing as posted
>>> by Paolo and Fabio here.
>>>
>>> http://lkml.org/lkml/2008/11/11/148
>>>
>>> Details about design and howto have been put in documentation patch.
>>>
>>> I have done very basic testing of running 2 or 3 "dd" threads in different
>>> cgroups. Wanted to get the patchset out for feedback/review before we dive
>>> into more bug fixing, benchmarking, optimizations etc.
>>>
>>> Your feedback/comments are welcome.
>>>
>>> Patch series contains 10 patches. It should be compilable and bootable after
>>> every patch. Intial 2 patches implement flat fair queuing (no cgroup
>>> support) and make cfq to use that. Later patches introduce hierarchical
>>> fair queuing support in elevator layer and modify other IO schdulers to use
>>> that.
>>>
>>> Thanks
>>> Vivek
>> Hi Vivek,
>>    While testing these patches along with the bio-cgroup patches I noticed that for the case of 2 buffered writers (dd) with different weights, one of them would be able to use up a very large timeslice (I've seen upto 500ms) when the other queue is empty and not be accounted for it. This is due to the check in cfq_dispatch_requests() where  a given cgroup can empty its entire queue (100 IOs or more) within its timeslice and have them sit in the dispatch queue ready for the disk driver to pick up. Moreover, this huge timeslice is not accounted for as this cgroup is charged only for the length of the intended timeslice and not the actual time taken.
>>   The following patch fixes this by not optimizing on the single busy queue fact inside cfq_dispatch_requests. Note that this does not hurt throughput in any sense but just causes more IOs to be dispatched only when the drive is ready for them thus leading to better accounting too.
> 
> Hi Divyesh,
> 
> Thanks for the testing and noticing the issue. I also had noticed this
> issue.
> 
> Couple of points.
> 
> - In 30-rc3 jens has fixed the huge dispatch problem. Now in case of single
>   ioq doing dispatch, in one round upto 4*quantum request can be dispatched.
>   So that means in default configuration with single queue, maximum request on
>   diaptch list can be 16.

I just synced my git tree and I see Jens changes. That makes this much cleaner!

> 
> - Secondly, in my tree, now I have modified the patches to charge for
>   actual consumption of the slice instead of capping it to budget. In a
>   week's time I should be able to post V2 of the patches. Please do try
>   it out then

I have that in my tree as well and was going to send that out. No need now :)

> 
> Thanks
> Vivek
> 
>> Fix bug where a given ioq can run through all its requests at once.
>>
>> Signed-off-by: Divyesh Shah <dpshah-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>> ---
>> diff --git a/2.6.26/block/cfq-iosched.c b/2.6.26/block/cfq-iosched.c
>> index 5a275a2..c0199a6 100644
>> --- a/2.6.26/block/cfq-iosched.c
>> +++ b/2.6.26/block/cfq-iosched.c
>> @@ -848,8 +848,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
>> 		if (cfq_class_idle(cfqq))
>> 			max_dispatch = 1;
>>
>> -		if (elv_ioq_nr_dispatched(cfqq->ioq) >= max_dispatch &&
>> -			elv_nr_busy_ioq(q->elevator) > 1)
>> +		if (elv_ioq_nr_dispatched(cfqq->ioq) >= max_dispatch)
>> 			break;
>>
>> 		if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
  2009-05-01  2:45   ` Vivek Goyal
@ 2009-05-01  3:00     ` Divyesh Shah
       [not found]     ` <20090501024527.GA3730-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 0 replies; 190+ messages in thread
From: Divyesh Shah @ 2009-05-01  3:00 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, lizf, mikew, fchecconi, paolo.valente, jens.axboe, ryov,
	fernando, s-uchida, taka, guijianfeng, arozansk, jmoyer, dhaval,
	balbir, linux-kernel, containers, akpm, menage, peterz

Vivek Goyal wrote:
> On Thu, Apr 30, 2009 at 06:25:37PM -0700, Divyesh Shah wrote:
>> Vivek Goyal wrote:
>>> Hi All,
>>>
>>> Here is another posting for IO controller patches. Last time I had posted
>>> RFC patches for an IO controller which did bio control per cgroup.
>>>
>>> http://lkml.org/lkml/2008/11/6/227
>>>
>>> One of the takeaway from the discussion in this thread was that let us
>>> implement a common layer which contains the proportional weight scheduling
>>> code which can be shared by all the IO schedulers.
>>>
>>> Implementing IO controller will not cover the devices which don't use
>>> IO schedulers but it should cover the common case.
>>>
>>> There were more discussions regarding 2 level vs 1 level IO control at
>>> following link.
>>>
>>> https://lists.linux-foundation.org/pipermail/containers/2009-January/015402.html
>>>
>>> So in the mean time we took the discussion off the list and spent time on
>>> making the 1 level control apporoach work where majority of the proportional
>>> weight control is shared by the four schedulers instead of each one having
>>> to replicate the code. We make use of BFQ code for fair queuing as posted
>>> by Paolo and Fabio here.
>>>
>>> http://lkml.org/lkml/2008/11/11/148
>>>
>>> Details about design and howto have been put in documentation patch.
>>>
>>> I have done very basic testing of running 2 or 3 "dd" threads in different
>>> cgroups. Wanted to get the patchset out for feedback/review before we dive
>>> into more bug fixing, benchmarking, optimizations etc.
>>>
>>> Your feedback/comments are welcome.
>>>
>>> Patch series contains 10 patches. It should be compilable and bootable after
>>> every patch. Intial 2 patches implement flat fair queuing (no cgroup
>>> support) and make cfq to use that. Later patches introduce hierarchical
>>> fair queuing support in elevator layer and modify other IO schdulers to use
>>> that.
>>>
>>> Thanks
>>> Vivek
>> Hi Vivek,
>>    While testing these patches along with the bio-cgroup patches I noticed that for the case of 2 buffered writers (dd) with different weights, one of them would be able to use up a very large timeslice (I've seen upto 500ms) when the other queue is empty and not be accounted for it. This is due to the check in cfq_dispatch_requests() where  a given cgroup can empty its entire queue (100 IOs or more) within its timeslice and have them sit in the dispatch queue ready for the disk driver to pick up. Moreover, this huge timeslice is not accounted for as this cgroup is charged only for the length of the intended timeslice and not the actual time taken.
>>   The following patch fixes this by not optimizing on the single busy queue fact inside cfq_dispatch_requests. Note that this does not hurt throughput in any sense but just causes more IOs to be dispatched only when the drive is ready for them thus leading to better accounting too.
> 
> Hi Divyesh,
> 
> Thanks for the testing and noticing the issue. I also had noticed this
> issue.
> 
> Couple of points.
> 
> - In 30-rc3 jens has fixed the huge dispatch problem. Now in case of single
>   ioq doing dispatch, in one round upto 4*quantum request can be dispatched.
>   So that means in default configuration with single queue, maximum request on
>   diaptch list can be 16.

I just synced my git tree and I see Jens changes. That makes this much cleaner!

> 
> - Secondly, in my tree, now I have modified the patches to charge for
>   actual consumption of the slice instead of capping it to budget. In a
>   week's time I should be able to post V2 of the patches. Please do try
>   it out then

I have that in my tree as well and was going to send that out. No need now :)

> 
> Thanks
> Vivek
> 
>> Fix bug where a given ioq can run through all its requests at once.
>>
>> Signed-off-by: Divyesh Shah <dpshah@google.com>
>> ---
>> diff --git a/2.6.26/block/cfq-iosched.c b/2.6.26/block/cfq-iosched.c
>> index 5a275a2..c0199a6 100644
>> --- a/2.6.26/block/cfq-iosched.c
>> +++ b/2.6.26/block/cfq-iosched.c
>> @@ -848,8 +848,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
>> 		if (cfq_class_idle(cfqq))
>> 			max_dispatch = 1;
>>
>> -		if (elv_ioq_nr_dispatched(cfqq->ioq) >= max_dispatch &&
>> -			elv_nr_busy_ioq(q->elevator) > 1)
>> +		if (elv_ioq_nr_dispatched(cfqq->ioq) >= max_dispatch)
>> 			break;
>>
>> 		if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
       [not found]           ` <20090413134017.GC18007-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-01 22:04             ` IKEDA, Munehiro
  0 siblings, 0 replies; 190+ messages in thread
From: IKEDA, Munehiro @ 2009-05-01 22:04 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, menage-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil, Balbir Singh

Vivek Goyal wrote:
>>> +TODO
>>> +====
>>> +- Lots of cleanups, testing, bug fixing, optimizations, benchmarking etc...
>>> +- Convert cgroup ioprio to notion of weight.
>>> +- Anticipatory code will need more work. It is not working properly currently
>>> +  and needs more thought.
>> What are the problems with the code?
> 
> Have not got a chance to look into the issues in detail yet. Just a crude run
> saw drop in performance. Will debug it later the moment I have got async writes
> handled...
> 
>>> +- Use of bio-cgroup patches.
>> I saw these posted as well
>>
>>> +- Use of Nauman's per cgroup request descriptor patches.
>>> +
>> More details would be nice, I am not sure I understand
> 
> Currently the number of request descriptors which can be allocated per
> device/request queue are fixed by a sysfs tunable (q->nr_requests). So
> if there is lots of IO going on from one cgroup then it will consume all
> the available request descriptors and other cgroup might starve and not
> get its fair share.
> 
> Hence we also need to introduce the notion of request descriptor limit per
> cgroup so that if request descriptors from one group are exhausted, then
> it does not impact the IO of other cgroup.

Unfortunately I couldn't find and I've never seen the Nauman's patches.
So I tried to make a patch below against this todo.  The reason why
I'm posting this despite this is just a quick and ugly hack (and it
might be a reinvention of wheel) is that I would like to discuss how
we should define the limitation of requests per cgroup.
This patch should be applied on Vivek's I/O controller patches
posted on Mar 11.

This patch temporarily distribute q->nr_requests to each cgroup.
I think the number should be weighted like BFQ's budget.  But in
this case, if the hierarchy of cgroup is deep, leaf cgroups are
allowed to allocate very few numbers of requests.  I don't think
this is reasonable...but I don't have specific idea to solve this
problem.  Does anyone have the good idea?

Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
---
 block/blk-core.c    |   36 +++++++--
 block/blk-sysfs.c   |   22 ++++--
 block/elevator-fq.c |  133 ++++++++++++++++++++++++++++++++--
 block/elevator-fq.h |  201 +++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 371 insertions(+), 21 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 29bcfac..21023f7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -705,11 +705,15 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
 static void __freed_request(struct request_queue *q, int rw)
 {
 	struct request_list *rl = &q->rq;
-
-	if (rl->count[rw] < queue_congestion_off_threshold(q))
+	struct io_group *congested_iog, *full_iog;
+	
+	congested_iog = io_congested_io_group(q, rw);
+	if (rl->count[rw] < queue_congestion_off_threshold(q) &&
+	    !congested_iog)
 		blk_clear_queue_congested(q, rw);
 
-	if (rl->count[rw] + 1 <= q->nr_requests) {
+	full_iog = io_full_io_group(q, rw);
+	if (rl->count[rw] + 1 <= q->nr_requests && !full_iog) {
 		if (waitqueue_active(&rl->wait[rw]))
 			wake_up(&rl->wait[rw]);
 
@@ -721,13 +725,16 @@ static void __freed_request(struct request_queue *q, int rw)
  * A request has just been released.  Account for it, update the full and
  * congestion status, wake up any waiters.   Called under q->queue_lock.
  */
-static void freed_request(struct request_queue *q, int rw, int priv)
+static void freed_request(struct request_queue *q, struct io_group *iog,
+			  int rw, int priv)
 {
 	struct request_list *rl = &q->rq;
 
 	rl->count[rw]--;
 	if (priv)
 		rl->elvpriv--;
+	if (iog)
+		io_group_dec_count(iog, rw);
 
 	__freed_request(q, rw);
 
@@ -746,16 +753,21 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 {
 	struct request *rq = NULL;
 	struct request_list *rl = &q->rq;
+	struct io_group *iog;
 	struct io_context *ioc = NULL;
 	const int rw = rw_flags & 0x01;
 	int may_queue, priv;
 
+	iog = __io_get_io_group(q);
+
 	may_queue = elv_may_queue(q, rw_flags);
 	if (may_queue == ELV_MQUEUE_NO)
 		goto rq_starved;
 
-	if (rl->count[rw]+1 >= queue_congestion_on_threshold(q)) {
-		if (rl->count[rw]+1 >= q->nr_requests) {
+	if (rl->count[rw]+1 >= queue_congestion_on_threshold(q) ||
+	    io_group_congestion_on(iog, rw)) {
+		if (rl->count[rw]+1 >= q->nr_requests ||
+		    io_group_full(iog, rw)) {
 			ioc = current_io_context(GFP_ATOMIC, q->node);
 			/*
 			 * The queue will fill after this allocation, so set
@@ -789,8 +801,15 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	if (rl->count[rw] >= (3 * q->nr_requests / 2))
 		goto out;
 
+	if (iog)
+		if (io_group_count(iog, rw) >=
+		   (3 * io_group_nr_requests(iog) / 2))
+			goto out;
+
 	rl->count[rw]++;
 	rl->starved[rw] = 0;
+	if (iog)
+		io_group_inc_count(iog, rw);
 
 	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
 	if (priv)
@@ -808,7 +827,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		 * wait queue, but this is pretty rare.
 		 */
 		spin_lock_irq(q->queue_lock);
-		freed_request(q, rw, priv);
+		freed_request(q, iog, rw, priv);
 
 		/*
 		 * in the very unlikely event that allocation failed and no
@@ -1073,12 +1092,13 @@ void __blk_put_request(struct request_queue *q, struct request *req)
 	if (req->cmd_flags & REQ_ALLOCED) {
 		int rw = rq_data_dir(req);
 		int priv = req->cmd_flags & REQ_ELVPRIV;
+		struct io_group *iog = io_request_io_group(req);
 
 		BUG_ON(!list_empty(&req->queuelist));
 		BUG_ON(!hlist_unhashed(&req->hash));
 
 		blk_free_request(q, req);
-		freed_request(q, rw, priv);
+		freed_request(q, iog, rw, priv);
 	}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 0d98c96..af5191c 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -40,6 +40,7 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
 {
 	struct request_list *rl = &q->rq;
 	unsigned long nr;
+	int iog_congested[2], iog_full[2];
 	int ret = queue_var_store(&nr, page, count);
 	if (nr < BLKDEV_MIN_RQ)
 		nr = BLKDEV_MIN_RQ;
@@ -47,27 +48,32 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
 	spin_lock_irq(q->queue_lock);
 	q->nr_requests = nr;
 	blk_queue_congestion_threshold(q);
+	io_group_set_nrq_all(q, nr, iog_congested, iog_full);
 
-	if (rl->count[READ] >= queue_congestion_on_threshold(q))
+	if (rl->count[READ] >= queue_congestion_on_threshold(q) ||
+	    iog_congested[READ])
 		blk_set_queue_congested(q, READ);
-	else if (rl->count[READ] < queue_congestion_off_threshold(q))
+	else if (rl->count[READ] < queue_congestion_off_threshold(q) &&
+		 !iog_congested[READ])
 		blk_clear_queue_congested(q, READ);
 
-	if (rl->count[WRITE] >= queue_congestion_on_threshold(q))
+	if (rl->count[WRITE] >= queue_congestion_on_threshold(q) ||
+	    iog_congested[WRITE])
 		blk_set_queue_congested(q, WRITE);
-	else if (rl->count[WRITE] < queue_congestion_off_threshold(q))
+	else if (rl->count[WRITE] < queue_congestion_off_threshold(q) &&
+		 !iog_congested[WRITE])
 		blk_clear_queue_congested(q, WRITE);
 
-	if (rl->count[READ] >= q->nr_requests) {
+	if (rl->count[READ] >= q->nr_requests || iog_full[READ]) {
 		blk_set_queue_full(q, READ);
-	} else if (rl->count[READ]+1 <= q->nr_requests) {
+	} else if (rl->count[READ]+1 <= q->nr_requests && !iog_full[READ]) {
 		blk_clear_queue_full(q, READ);
 		wake_up(&rl->wait[READ]);
 	}
 
-	if (rl->count[WRITE] >= q->nr_requests) {
+	if (rl->count[WRITE] >= q->nr_requests || iog_full[WRITE]) {
 		blk_set_queue_full(q, WRITE);
-	} else if (rl->count[WRITE]+1 <= q->nr_requests) {
+	} else if (rl->count[WRITE]+1 <= q->nr_requests && !iog_full[WRITE]) {
 		blk_clear_queue_full(q, WRITE);
 		wake_up(&rl->wait[WRITE]);
 	}
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index df53418..3b021f3 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -924,6 +924,111 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
 }
 EXPORT_SYMBOL(io_lookup_io_group_current);
 
+/*
+ * TODO
+ * This is complete dupulication of blk_queue_congestion_threshold()
+ * except for the argument type and name.  Can we merge them?
+ */
+static void io_group_nrq_congestion_threshold(struct io_group_nrq *nrq)
+{
+	int nr;
+
+	nr = nrq->nr_requests - (nrq->nr_requests / 8) + 1;
+	if (nr > nrq->nr_requests)
+		nr = nrq->nr_requests;
+	nrq->nr_congestion_on = nr;
+
+	nr = nrq->nr_requests - (nrq->nr_requests / 8)
+		- (nrq->nr_requests / 16) - 1;
+	if (nr < 1)
+		nr = 1;
+	nrq->nr_congestion_off = nr;
+}
+
+static void io_group_set_nrq(struct io_group_nrq *nrq, int nr_requests,
+			 int *congested, int *full)
+{
+	int i;
+
+	BUG_ON(nr_requests < 0);
+
+	nrq->nr_requests = nr_requests;
+	io_group_nrq_congestion_threshold(nrq);
+
+	for (i=0; i<2; i++) {
+		if (nrq->count[i] >= nrq->nr_congestion_on)
+			congested[i] = 1;
+		else if (nrq->count[i] < nrq->nr_congestion_off)
+			congested[i] = 0;
+
+		if (nrq->count[i] >= nrq->nr_requests)
+			full[i] = 1;
+		else if (nrq->count[i]+1 <= nrq->nr_requests)
+			full[i] = 0;
+	}
+}
+
+void io_group_set_nrq_all(struct request_queue *q, int nr,
+			    int *congested, int *full)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct hlist_head *head = &efqd->group_list;
+	struct io_group *root = efqd->root_group;
+	struct hlist_node *n;
+	struct io_group *iog;
+	struct io_group_nrq *nrq;
+	int nrq_congested[2];
+	int nrq_full[2];
+	int i;
+
+	for (i=0; i<2; i++)
+		*(congested + i) = *(full + i) = 0;
+
+	nrq = &root->nrq;
+	io_group_set_nrq(nrq, nr, nrq_congested, nrq_full);
+	for (i=0; i<2; i++) {
+		*(congested + i) |= nrq_congested[i];
+		*(full + i) |= nrq_full[i];
+	}
+
+	hlist_for_each_entry(iog, n, head, elv_data_node) {
+		nrq = &iog->nrq;
+		io_group_set_nrq(nrq, nr, nrq_congested, nrq_full);
+		for (i=0; i<2; i++) {
+			*(congested + i) |= nrq_congested[i];
+			*(full + i) |= nrq_full[i];
+		}
+	}
+}
+
+struct io_group *io_congested_io_group(struct request_queue *q, int rw)
+{
+	struct hlist_head *head = &q->elevator->efqd.group_list;
+	struct hlist_node *n;
+	struct io_group *iog;
+
+	hlist_for_each_entry(iog, n, head, elv_data_node) {
+		struct io_group_nrq *nrq = &iog->nrq;
+		if (nrq->count[rw] >= nrq->nr_congestion_off)
+			return iog;
+	}
+	return NULL;
+}
+
+struct io_group *io_full_io_group(struct request_queue *q, int rw)
+{
+	struct hlist_head *head = &q->elevator->efqd.group_list;
+	struct hlist_node *n;
+	struct io_group *iog;
+
+	hlist_for_each_entry(iog, n, head, elv_data_node) {
+		struct io_group_nrq *nrq = &iog->nrq;
+		if (nrq->count[rw] >= nrq->nr_requests)
+			return iog;
+	}
+	return NULL;
+}
+
 void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
 {
 	struct io_entity *entity = &iog->entity;
@@ -934,6 +1039,12 @@ void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
 	entity->my_sched_data = &iog->sched_data;
 }
 
+static void io_group_init_nrq(struct request_queue *q, struct io_group_nrq *nrq)
+{
+	nrq->nr_requests = q->nr_requests;
+	io_group_nrq_congestion_threshold(nrq);
+}
+
 void io_group_set_parent(struct io_group *iog, struct io_group *parent)
 {
 	struct io_entity *entity;
@@ -1053,6 +1164,8 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
 		io_group_init_entity(iocg, iog);
 		iog->my_entity = &iog->entity;
 
+		io_group_init_nrq(q, &iog->nrq);
+
 		if (leaf == NULL) {
 			leaf = iog;
 			prev = leaf;
@@ -1176,7 +1289,7 @@ struct io_group *io_find_alloc_group(struct request_queue *q,
  * Generic function to make sure cgroup hierarchy is all setup once a request
  * from a cgroup is received by the io scheduler.
  */
-struct io_group *io_get_io_group(struct request_queue *q)
+struct io_group *__io_get_io_group(struct request_queue *q)
 {
 	struct cgroup *cgroup;
 	struct io_group *iog;
@@ -1192,6 +1305,19 @@ struct io_group *io_get_io_group(struct request_queue *q)
 	return iog;
 }
 
+struct io_group *io_get_io_group(struct request_queue *q)
+{
+	struct io_group *iog;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	iog = __io_get_io_group(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	BUG_ON(!iog);
+
+	return iog;
+}
+
 void io_free_root_group(struct elevator_queue *e)
 {
 	struct io_cgroup *iocg = &io_root_cgroup;
@@ -1220,6 +1346,7 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
 	iog->entity.parent = NULL;
 	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
 		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+	io_group_init_nrq(q, &iog->nrq);
 
 	iocg = &io_root_cgroup;
 	spin_lock_irq(&iocg->lock);
@@ -1533,15 +1660,11 @@ void elv_fq_set_request_io_group(struct request_queue *q,
 						struct request *rq)
 {
 	struct io_group *iog;
-	unsigned long flags;
 
 	/* Make sure io group hierarchy has been setup and also set the
 	 * io group to which rq belongs. Later we should make use of
 	 * bio cgroup patches to determine the io group */
-	spin_lock_irqsave(q->queue_lock, flags);
 	iog = io_get_io_group(q);
-	spin_unlock_irqrestore(q->queue_lock, flags);
-	BUG_ON(!iog);
 
 	/* Store iog in rq. TODO: take care of referencing */
 	rq->iog = iog;
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index fc4110d..f8eabd4 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -187,6 +187,22 @@ struct io_queue {
 
 #ifdef CONFIG_GROUP_IOSCHED
 /**
+ * struct io_group_nrq - structure to store allocated requests info
+ * @nr_requests: maximun num of requests for the io_group
+ * @nr_congestion_on: threshold to determin the io_group is cogested.
+ * @nr_congestion_off: threshold to determin the io_group is not congested.
+ * @count: num of allocated requests.
+ *
+ * All fields are protected by queue_lock.
+ */
+struct io_group_nrq {
+	unsigned long nr_requests;
+	unsigned int nr_congestion_on;
+	unsigned int nr_congestion_off;
+	int count[2];
+};
+
+/**
  * struct bfq_group - per (device, cgroup) data structure.
  * @entity: schedulable entity to insert into the parent group sched_data.
  * @sched_data: own sched_data, to contain child entities (they may be
@@ -235,6 +251,8 @@ struct io_group {
 
 	/* Single ioq per group, used for noop, deadline, anticipatory */
 	struct io_queue *ioq;
+
+	struct io_group_nrq nrq;
 };
 
 /**
@@ -469,6 +487,11 @@ extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
 extern void elv_fq_unset_request_ioq(struct request_queue *q,
 					struct request *rq);
 extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+extern void io_group_set_nrq_all(struct request_queue *q, int nr,
+			    int *congested, int *full);
+extern struct io_group *io_congested_io_group(struct request_queue *q, int rw);
+extern struct io_group *io_full_io_group(struct request_queue *q, int rw);
+extern struct io_group *__io_get_io_group(struct request_queue *q);
 
 /* Returns single ioq associated with the io group. */
 static inline struct io_queue *io_group_ioq(struct io_group *iog)
@@ -486,6 +509,52 @@ static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
 	iog->ioq = ioq;
 }
 
+static inline struct io_group *io_request_io_group(struct request *rq)
+{
+	return rq->iog;
+}
+
+static inline unsigned long io_group_nr_requests(struct io_group *iog)
+{
+	BUG_ON(!iog);
+	return iog->nrq.nr_requests;
+}
+
+static inline int io_group_inc_count(struct io_group *iog, int rw)
+{
+	BUG_ON(!iog);
+	return iog->nrq.count[rw]++;
+}
+
+static inline int io_group_dec_count(struct io_group *iog, int rw)
+{
+	BUG_ON(!iog);
+	return iog->nrq.count[rw]--;
+}
+
+static inline int io_group_count(struct io_group *iog, int rw)
+{
+	BUG_ON(!iog);
+	return iog->nrq.count[rw];
+}
+
+static inline int io_group_congestion_on(struct io_group *iog, int rw)
+{
+	BUG_ON(!iog);
+	return iog->nrq.count[rw] + 1 >= iog->nrq.nr_congestion_on;
+}
+
+static inline int io_group_congestion_off(struct io_group *iog, int rw)
+{
+	BUG_ON(!iog);
+	return iog->nrq.count[rw] < iog->nrq.nr_congestion_off;
+}
+
+static inline int io_group_full(struct io_group *iog, int rw)
+{
+	BUG_ON(!iog);
+	return iog->nrq.count[rw] + 1 >= iog->nrq.nr_requests;
+}
 #else /* !GROUP_IOSCHED */
 /*
  * No ioq movement is needed in case of flat setup. root io group gets cleaned
@@ -537,6 +606,71 @@ static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
 	return NULL;
 }
 
+static inline void io_group_set_nrq_all(struct request_queue *q, int nr,
+					int *congested, int *full)
+{
+	int i;
+	for (i=0; i<2; i++)
+		*(congested + i) = *(full + i) = 0;
+}
+
+static inline struct io_group *
+io_congested_io_group(struct request_queue *q, int rw)
+{
+	return NULL;
+}
+
+static inline struct io_group *
+io_full_io_group(struct request_queue *q, int rw)
+{
+	return NULL;
+}
+
+static inline struct io_group *__io_get_io_group(struct request_queue *q)
+{
+	return NULL;
+}
+
+static inline struct io_group *io_request_io_group(struct request *rq)
+{
+	return NULL;
+}
+
+static inline unsigned long io_group_nr_requests(struct io_group *iog)
+{
+	return 0;
+}
+
+static inline int io_group_inc_count(struct io_group *iog, int rw)
+{
+	return 0;
+}
+
+static inline int io_group_dec_count(struct io_group *iog, int rw)
+{
+	return 0;
+}
+
+static inline int io_group_count(struct io_group *iog, int rw)
+{
+	return 0;
+}
+
+static inline int io_group_congestion_on(struct io_group *iog, int rw)
+{
+	return 0;
+}
+
+static inline int io_group_congestion_off(struct io_group *iog, int rw)
+{
+	return 1;
+}
+
+static inline int io_group_full(struct io_group *iog, int rw)
+{
+	return 0;
+}
+
 #endif /* GROUP_IOSCHED */
 
 /* Functions used by blksysfs.c */
@@ -589,6 +723,9 @@ extern void elv_free_ioq(struct io_queue *ioq);
 
 #else /* CONFIG_ELV_FAIR_QUEUING */
 
+struct io_group {
+};
+
 static inline int elv_init_fq_data(struct request_queue *q,
 					struct elevator_queue *e)
 {
@@ -655,5 +792,69 @@ static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
 	return NULL;
 }
 
+static inline void io_group_set_nrq_all(struct request_queue *q, int nr,
+					int *congested, int *full)
+{
+	int i;
+	for (i=0; i<2; i++)
+		*(congested + i) = *(full + i) = 0;
+}
+
+static inline struct io_group *
+io_congested_io_group(struct request_queue *q, int rw)
+{
+	return NULL;
+}
+
+static inline struct io_group *
+io_full_io_group(struct request_queue *q, int rw)
+{
+	return NULL;
+}
+
+static inline struct io_group *__io_get_io_group(struct request_queue *q)
+{
+	return NULL;
+}
+
+static inline struct io_group *io_request_io_group(struct request *rq)
+{
+	return NULL;
+}
+
+static inline unsigned long io_group_nr_requests(struct io_group *iog)
+{
+	return 0;
+}
+
+static inline int io_group_inc_count(struct io_group *iog, int rw)
+{
+	return 0;
+}
+
+static inline int io_group_dec_count(struct io_group *iog, int rw)
+{
+	return 0;
+}
+
+static inline int io_group_count(struct io_group *iog, int rw)
+{
+	return 0;
+}
+
+static inline int io_group_congestion_on(struct io_group *iog, int rw)
+{
+	return 0;
+}
+
+static inline int io_group_congestion_off(struct io_group *iog, int rw)
+{
+	return 1;
+}
+
+static inline int io_group_full(struct io_group *iog, int rw)
+{
+	return 0;
+}
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _BFQ_SCHED_H */
-- 
1.5.4.3


-- 
IKEDA, Munehiro
 NEC Corporation of America
   m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org

^ permalink raw reply related	[flat|nested] 190+ messages in thread

* Re: [PATCH 01/10] Documentation
  2009-04-13 13:40         ` Vivek Goyal
@ 2009-05-01 22:04           ` IKEDA, Munehiro
       [not found]             ` <49FB71F7.90309-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
       [not found]           ` <20090413134017.GC18007-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 190+ messages in thread
From: IKEDA, Munehiro @ 2009-05-01 22:04 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Balbir Singh, oz-kernel, paolo.valente, linux-kernel, dhaval,
	containers, menage, jmoyer, fchecconi, arozansk, jens.axboe,
	akpm, fernando

Vivek Goyal wrote:
>>> +TODO
>>> +====
>>> +- Lots of cleanups, testing, bug fixing, optimizations, benchmarking etc...
>>> +- Convert cgroup ioprio to notion of weight.
>>> +- Anticipatory code will need more work. It is not working properly currently
>>> +  and needs more thought.
>> What are the problems with the code?
> 
> Have not got a chance to look into the issues in detail yet. Just a crude run
> saw drop in performance. Will debug it later the moment I have got async writes
> handled...
> 
>>> +- Use of bio-cgroup patches.
>> I saw these posted as well
>>
>>> +- Use of Nauman's per cgroup request descriptor patches.
>>> +
>> More details would be nice, I am not sure I understand
> 
> Currently the number of request descriptors which can be allocated per
> device/request queue are fixed by a sysfs tunable (q->nr_requests). So
> if there is lots of IO going on from one cgroup then it will consume all
> the available request descriptors and other cgroup might starve and not
> get its fair share.
> 
> Hence we also need to introduce the notion of request descriptor limit per
> cgroup so that if request descriptors from one group are exhausted, then
> it does not impact the IO of other cgroup.

Unfortunately I couldn't find and I've never seen the Nauman's patches.
So I tried to make a patch below against this todo.  The reason why
I'm posting this despite this is just a quick and ugly hack (and it
might be a reinvention of wheel) is that I would like to discuss how
we should define the limitation of requests per cgroup.
This patch should be applied on Vivek's I/O controller patches
posted on Mar 11.

This patch temporarily distribute q->nr_requests to each cgroup.
I think the number should be weighted like BFQ's budget.  But in
this case, if the hierarchy of cgroup is deep, leaf cgroups are
allowed to allocate very few numbers of requests.  I don't think
this is reasonable...but I don't have specific idea to solve this
problem.  Does anyone have the good idea?

Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
---
 block/blk-core.c    |   36 +++++++--
 block/blk-sysfs.c   |   22 ++++--
 block/elevator-fq.c |  133 ++++++++++++++++++++++++++++++++--
 block/elevator-fq.h |  201 +++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 371 insertions(+), 21 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 29bcfac..21023f7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -705,11 +705,15 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
 static void __freed_request(struct request_queue *q, int rw)
 {
 	struct request_list *rl = &q->rq;
-
-	if (rl->count[rw] < queue_congestion_off_threshold(q))
+	struct io_group *congested_iog, *full_iog;
+	
+	congested_iog = io_congested_io_group(q, rw);
+	if (rl->count[rw] < queue_congestion_off_threshold(q) &&
+	    !congested_iog)
 		blk_clear_queue_congested(q, rw);
 
-	if (rl->count[rw] + 1 <= q->nr_requests) {
+	full_iog = io_full_io_group(q, rw);
+	if (rl->count[rw] + 1 <= q->nr_requests && !full_iog) {
 		if (waitqueue_active(&rl->wait[rw]))
 			wake_up(&rl->wait[rw]);
 
@@ -721,13 +725,16 @@ static void __freed_request(struct request_queue *q, int rw)
  * A request has just been released.  Account for it, update the full and
  * congestion status, wake up any waiters.   Called under q->queue_lock.
  */
-static void freed_request(struct request_queue *q, int rw, int priv)
+static void freed_request(struct request_queue *q, struct io_group *iog,
+			  int rw, int priv)
 {
 	struct request_list *rl = &q->rq;
 
 	rl->count[rw]--;
 	if (priv)
 		rl->elvpriv--;
+	if (iog)
+		io_group_dec_count(iog, rw);
 
 	__freed_request(q, rw);
 
@@ -746,16 +753,21 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 {
 	struct request *rq = NULL;
 	struct request_list *rl = &q->rq;
+	struct io_group *iog;
 	struct io_context *ioc = NULL;
 	const int rw = rw_flags & 0x01;
 	int may_queue, priv;
 
+	iog = __io_get_io_group(q);
+
 	may_queue = elv_may_queue(q, rw_flags);
 	if (may_queue == ELV_MQUEUE_NO)
 		goto rq_starved;
 
-	if (rl->count[rw]+1 >= queue_congestion_on_threshold(q)) {
-		if (rl->count[rw]+1 >= q->nr_requests) {
+	if (rl->count[rw]+1 >= queue_congestion_on_threshold(q) ||
+	    io_group_congestion_on(iog, rw)) {
+		if (rl->count[rw]+1 >= q->nr_requests ||
+		    io_group_full(iog, rw)) {
 			ioc = current_io_context(GFP_ATOMIC, q->node);
 			/*
 			 * The queue will fill after this allocation, so set
@@ -789,8 +801,15 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	if (rl->count[rw] >= (3 * q->nr_requests / 2))
 		goto out;
 
+	if (iog)
+		if (io_group_count(iog, rw) >=
+		   (3 * io_group_nr_requests(iog) / 2))
+			goto out;
+
 	rl->count[rw]++;
 	rl->starved[rw] = 0;
+	if (iog)
+		io_group_inc_count(iog, rw);
 
 	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
 	if (priv)
@@ -808,7 +827,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		 * wait queue, but this is pretty rare.
 		 */
 		spin_lock_irq(q->queue_lock);
-		freed_request(q, rw, priv);
+		freed_request(q, iog, rw, priv);
 
 		/*
 		 * in the very unlikely event that allocation failed and no
@@ -1073,12 +1092,13 @@ void __blk_put_request(struct request_queue *q, struct request *req)
 	if (req->cmd_flags & REQ_ALLOCED) {
 		int rw = rq_data_dir(req);
 		int priv = req->cmd_flags & REQ_ELVPRIV;
+		struct io_group *iog = io_request_io_group(req);
 
 		BUG_ON(!list_empty(&req->queuelist));
 		BUG_ON(!hlist_unhashed(&req->hash));
 
 		blk_free_request(q, req);
-		freed_request(q, rw, priv);
+		freed_request(q, iog, rw, priv);
 	}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 0d98c96..af5191c 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -40,6 +40,7 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
 {
 	struct request_list *rl = &q->rq;
 	unsigned long nr;
+	int iog_congested[2], iog_full[2];
 	int ret = queue_var_store(&nr, page, count);
 	if (nr < BLKDEV_MIN_RQ)
 		nr = BLKDEV_MIN_RQ;
@@ -47,27 +48,32 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
 	spin_lock_irq(q->queue_lock);
 	q->nr_requests = nr;
 	blk_queue_congestion_threshold(q);
+	io_group_set_nrq_all(q, nr, iog_congested, iog_full);
 
-	if (rl->count[READ] >= queue_congestion_on_threshold(q))
+	if (rl->count[READ] >= queue_congestion_on_threshold(q) ||
+	    iog_congested[READ])
 		blk_set_queue_congested(q, READ);
-	else if (rl->count[READ] < queue_congestion_off_threshold(q))
+	else if (rl->count[READ] < queue_congestion_off_threshold(q) &&
+		 !iog_congested[READ])
 		blk_clear_queue_congested(q, READ);
 
-	if (rl->count[WRITE] >= queue_congestion_on_threshold(q))
+	if (rl->count[WRITE] >= queue_congestion_on_threshold(q) ||
+	    iog_congested[WRITE])
 		blk_set_queue_congested(q, WRITE);
-	else if (rl->count[WRITE] < queue_congestion_off_threshold(q))
+	else if (rl->count[WRITE] < queue_congestion_off_threshold(q) &&
+		 !iog_congested[WRITE])
 		blk_clear_queue_congested(q, WRITE);
 
-	if (rl->count[READ] >= q->nr_requests) {
+	if (rl->count[READ] >= q->nr_requests || iog_full[READ]) {
 		blk_set_queue_full(q, READ);
-	} else if (rl->count[READ]+1 <= q->nr_requests) {
+	} else if (rl->count[READ]+1 <= q->nr_requests && !iog_full[READ]) {
 		blk_clear_queue_full(q, READ);
 		wake_up(&rl->wait[READ]);
 	}
 
-	if (rl->count[WRITE] >= q->nr_requests) {
+	if (rl->count[WRITE] >= q->nr_requests || iog_full[WRITE]) {
 		blk_set_queue_full(q, WRITE);
-	} else if (rl->count[WRITE]+1 <= q->nr_requests) {
+	} else if (rl->count[WRITE]+1 <= q->nr_requests && !iog_full[WRITE]) {
 		blk_clear_queue_full(q, WRITE);
 		wake_up(&rl->wait[WRITE]);
 	}
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index df53418..3b021f3 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -924,6 +924,111 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
 }
 EXPORT_SYMBOL(io_lookup_io_group_current);
 
+/*
+ * TODO
+ * This is complete dupulication of blk_queue_congestion_threshold()
+ * except for the argument type and name.  Can we merge them?
+ */
+static void io_group_nrq_congestion_threshold(struct io_group_nrq *nrq)
+{
+	int nr;
+
+	nr = nrq->nr_requests - (nrq->nr_requests / 8) + 1;
+	if (nr > nrq->nr_requests)
+		nr = nrq->nr_requests;
+	nrq->nr_congestion_on = nr;
+
+	nr = nrq->nr_requests - (nrq->nr_requests / 8)
+		- (nrq->nr_requests / 16) - 1;
+	if (nr < 1)
+		nr = 1;
+	nrq->nr_congestion_off = nr;
+}
+
+static void io_group_set_nrq(struct io_group_nrq *nrq, int nr_requests,
+			 int *congested, int *full)
+{
+	int i;
+
+	BUG_ON(nr_requests < 0);
+
+	nrq->nr_requests = nr_requests;
+	io_group_nrq_congestion_threshold(nrq);
+
+	for (i=0; i<2; i++) {
+		if (nrq->count[i] >= nrq->nr_congestion_on)
+			congested[i] = 1;
+		else if (nrq->count[i] < nrq->nr_congestion_off)
+			congested[i] = 0;
+
+		if (nrq->count[i] >= nrq->nr_requests)
+			full[i] = 1;
+		else if (nrq->count[i]+1 <= nrq->nr_requests)
+			full[i] = 0;
+	}
+}
+
+void io_group_set_nrq_all(struct request_queue *q, int nr,
+			    int *congested, int *full)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct hlist_head *head = &efqd->group_list;
+	struct io_group *root = efqd->root_group;
+	struct hlist_node *n;
+	struct io_group *iog;
+	struct io_group_nrq *nrq;
+	int nrq_congested[2];
+	int nrq_full[2];
+	int i;
+
+	for (i=0; i<2; i++)
+		*(congested + i) = *(full + i) = 0;
+
+	nrq = &root->nrq;
+	io_group_set_nrq(nrq, nr, nrq_congested, nrq_full);
+	for (i=0; i<2; i++) {
+		*(congested + i) |= nrq_congested[i];
+		*(full + i) |= nrq_full[i];
+	}
+
+	hlist_for_each_entry(iog, n, head, elv_data_node) {
+		nrq = &iog->nrq;
+		io_group_set_nrq(nrq, nr, nrq_congested, nrq_full);
+		for (i=0; i<2; i++) {
+			*(congested + i) |= nrq_congested[i];
+			*(full + i) |= nrq_full[i];
+		}
+	}
+}
+
+struct io_group *io_congested_io_group(struct request_queue *q, int rw)
+{
+	struct hlist_head *head = &q->elevator->efqd.group_list;
+	struct hlist_node *n;
+	struct io_group *iog;
+
+	hlist_for_each_entry(iog, n, head, elv_data_node) {
+		struct io_group_nrq *nrq = &iog->nrq;
+		if (nrq->count[rw] >= nrq->nr_congestion_off)
+			return iog;
+	}
+	return NULL;
+}
+
+struct io_group *io_full_io_group(struct request_queue *q, int rw)
+{
+	struct hlist_head *head = &q->elevator->efqd.group_list;
+	struct hlist_node *n;
+	struct io_group *iog;
+
+	hlist_for_each_entry(iog, n, head, elv_data_node) {
+		struct io_group_nrq *nrq = &iog->nrq;
+		if (nrq->count[rw] >= nrq->nr_requests)
+			return iog;
+	}
+	return NULL;
+}
+
 void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
 {
 	struct io_entity *entity = &iog->entity;
@@ -934,6 +1039,12 @@ void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
 	entity->my_sched_data = &iog->sched_data;
 }
 
+static void io_group_init_nrq(struct request_queue *q, struct io_group_nrq *nrq)
+{
+	nrq->nr_requests = q->nr_requests;
+	io_group_nrq_congestion_threshold(nrq);
+}
+
 void io_group_set_parent(struct io_group *iog, struct io_group *parent)
 {
 	struct io_entity *entity;
@@ -1053,6 +1164,8 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
 		io_group_init_entity(iocg, iog);
 		iog->my_entity = &iog->entity;
 
+		io_group_init_nrq(q, &iog->nrq);
+
 		if (leaf == NULL) {
 			leaf = iog;
 			prev = leaf;
@@ -1176,7 +1289,7 @@ struct io_group *io_find_alloc_group(struct request_queue *q,
  * Generic function to make sure cgroup hierarchy is all setup once a request
  * from a cgroup is received by the io scheduler.
  */
-struct io_group *io_get_io_group(struct request_queue *q)
+struct io_group *__io_get_io_group(struct request_queue *q)
 {
 	struct cgroup *cgroup;
 	struct io_group *iog;
@@ -1192,6 +1305,19 @@ struct io_group *io_get_io_group(struct request_queue *q)
 	return iog;
 }
 
+struct io_group *io_get_io_group(struct request_queue *q)
+{
+	struct io_group *iog;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	iog = __io_get_io_group(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	BUG_ON(!iog);
+
+	return iog;
+}
+
 void io_free_root_group(struct elevator_queue *e)
 {
 	struct io_cgroup *iocg = &io_root_cgroup;
@@ -1220,6 +1346,7 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
 	iog->entity.parent = NULL;
 	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
 		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+	io_group_init_nrq(q, &iog->nrq);
 
 	iocg = &io_root_cgroup;
 	spin_lock_irq(&iocg->lock);
@@ -1533,15 +1660,11 @@ void elv_fq_set_request_io_group(struct request_queue *q,
 						struct request *rq)
 {
 	struct io_group *iog;
-	unsigned long flags;
 
 	/* Make sure io group hierarchy has been setup and also set the
 	 * io group to which rq belongs. Later we should make use of
 	 * bio cgroup patches to determine the io group */
-	spin_lock_irqsave(q->queue_lock, flags);
 	iog = io_get_io_group(q);
-	spin_unlock_irqrestore(q->queue_lock, flags);
-	BUG_ON(!iog);
 
 	/* Store iog in rq. TODO: take care of referencing */
 	rq->iog = iog;
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index fc4110d..f8eabd4 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -187,6 +187,22 @@ struct io_queue {
 
 #ifdef CONFIG_GROUP_IOSCHED
 /**
+ * struct io_group_nrq - structure to store allocated requests info
+ * @nr_requests: maximun num of requests for the io_group
+ * @nr_congestion_on: threshold to determin the io_group is cogested.
+ * @nr_congestion_off: threshold to determin the io_group is not congested.
+ * @count: num of allocated requests.
+ *
+ * All fields are protected by queue_lock.
+ */
+struct io_group_nrq {
+	unsigned long nr_requests;
+	unsigned int nr_congestion_on;
+	unsigned int nr_congestion_off;
+	int count[2];
+};
+
+/**
  * struct bfq_group - per (device, cgroup) data structure.
  * @entity: schedulable entity to insert into the parent group sched_data.
  * @sched_data: own sched_data, to contain child entities (they may be
@@ -235,6 +251,8 @@ struct io_group {
 
 	/* Single ioq per group, used for noop, deadline, anticipatory */
 	struct io_queue *ioq;
+
+	struct io_group_nrq nrq;
 };
 
 /**
@@ -469,6 +487,11 @@ extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
 extern void elv_fq_unset_request_ioq(struct request_queue *q,
 					struct request *rq);
 extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+extern void io_group_set_nrq_all(struct request_queue *q, int nr,
+			    int *congested, int *full);
+extern struct io_group *io_congested_io_group(struct request_queue *q, int rw);
+extern struct io_group *io_full_io_group(struct request_queue *q, int rw);
+extern struct io_group *__io_get_io_group(struct request_queue *q);
 
 /* Returns single ioq associated with the io group. */
 static inline struct io_queue *io_group_ioq(struct io_group *iog)
@@ -486,6 +509,52 @@ static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
 	iog->ioq = ioq;
 }
 
+static inline struct io_group *io_request_io_group(struct request *rq)
+{
+	return rq->iog;
+}
+
+static inline unsigned long io_group_nr_requests(struct io_group *iog)
+{
+	BUG_ON(!iog);
+	return iog->nrq.nr_requests;
+}
+
+static inline int io_group_inc_count(struct io_group *iog, int rw)
+{
+	BUG_ON(!iog);
+	return iog->nrq.count[rw]++;
+}
+
+static inline int io_group_dec_count(struct io_group *iog, int rw)
+{
+	BUG_ON(!iog);
+	return iog->nrq.count[rw]--;
+}
+
+static inline int io_group_count(struct io_group *iog, int rw)
+{
+	BUG_ON(!iog);
+	return iog->nrq.count[rw];
+}
+
+static inline int io_group_congestion_on(struct io_group *iog, int rw)
+{
+	BUG_ON(!iog);
+	return iog->nrq.count[rw] + 1 >= iog->nrq.nr_congestion_on;
+}
+
+static inline int io_group_congestion_off(struct io_group *iog, int rw)
+{
+	BUG_ON(!iog);
+	return iog->nrq.count[rw] < iog->nrq.nr_congestion_off;
+}
+
+static inline int io_group_full(struct io_group *iog, int rw)
+{
+	BUG_ON(!iog);
+	return iog->nrq.count[rw] + 1 >= iog->nrq.nr_requests;
+}
 #else /* !GROUP_IOSCHED */
 /*
  * No ioq movement is needed in case of flat setup. root io group gets cleaned
@@ -537,6 +606,71 @@ static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
 	return NULL;
 }
 
+static inline void io_group_set_nrq_all(struct request_queue *q, int nr,
+					int *congested, int *full)
+{
+	int i;
+	for (i=0; i<2; i++)
+		*(congested + i) = *(full + i) = 0;
+}
+
+static inline struct io_group *
+io_congested_io_group(struct request_queue *q, int rw)
+{
+	return NULL;
+}
+
+static inline struct io_group *
+io_full_io_group(struct request_queue *q, int rw)
+{
+	return NULL;
+}
+
+static inline struct io_group *__io_get_io_group(struct request_queue *q)
+{
+	return NULL;
+}
+
+static inline struct io_group *io_request_io_group(struct request *rq)
+{
+	return NULL;
+}
+
+static inline unsigned long io_group_nr_requests(struct io_group *iog)
+{
+	return 0;
+}
+
+static inline int io_group_inc_count(struct io_group *iog, int rw)
+{
+	return 0;
+}
+
+static inline int io_group_dec_count(struct io_group *iog, int rw)
+{
+	return 0;
+}
+
+static inline int io_group_count(struct io_group *iog, int rw)
+{
+	return 0;
+}
+
+static inline int io_group_congestion_on(struct io_group *iog, int rw)
+{
+	return 0;
+}
+
+static inline int io_group_congestion_off(struct io_group *iog, int rw)
+{
+	return 1;
+}
+
+static inline int io_group_full(struct io_group *iog, int rw)
+{
+	return 0;
+}
+
 #endif /* GROUP_IOSCHED */
 
 /* Functions used by blksysfs.c */
@@ -589,6 +723,9 @@ extern void elv_free_ioq(struct io_queue *ioq);
 
 #else /* CONFIG_ELV_FAIR_QUEUING */
 
+struct io_group {
+};
+
 static inline int elv_init_fq_data(struct request_queue *q,
 					struct elevator_queue *e)
 {
@@ -655,5 +792,69 @@ static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
 	return NULL;
 }
 
+static inline void io_group_set_nrq_all(struct request_queue *q, int nr,
+					int *congested, int *full)
+{
+	int i;
+	for (i=0; i<2; i++)
+		*(congested + i) = *(full + i) = 0;
+}
+
+static inline struct io_group *
+io_congested_io_group(struct request_queue *q, int rw)
+{
+	return NULL;
+}
+
+static inline struct io_group *
+io_full_io_group(struct request_queue *q, int rw)
+{
+	return NULL;
+}
+
+static inline struct io_group *__io_get_io_group(struct request_queue *q)
+{
+	return NULL;
+}
+
+static inline struct io_group *io_request_io_group(struct request *rq)
+{
+	return NULL;
+}
+
+static inline unsigned long io_group_nr_requests(struct io_group *iog)
+{
+	return 0;
+}
+
+static inline int io_group_inc_count(struct io_group *iog, int rw)
+{
+	return 0;
+}
+
+static inline int io_group_dec_count(struct io_group *iog, int rw)
+{
+	return 0;
+}
+
+static inline int io_group_count(struct io_group *iog, int rw)
+{
+	return 0;
+}
+
+static inline int io_group_congestion_on(struct io_group *iog, int rw)
+{
+	return 0;
+}
+
+static inline int io_group_congestion_off(struct io_group *iog, int rw)
+{
+	return 1;
+}
+
+static inline int io_group_full(struct io_group *iog, int rw)
+{
+	return 0;
+}
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _BFQ_SCHED_H */
-- 
1.5.4.3


-- 
IKEDA, Munehiro
 NEC Corporation of America
   m-ikeda@ds.jp.nec.com



^ permalink raw reply related	[flat|nested] 190+ messages in thread

* IO Controller per cgroup request descriptors (Re: [PATCH 01/10] Documentation)
  2009-05-01 22:04           ` IKEDA, Munehiro
@ 2009-05-01 22:45                 ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-05-01 22:45 UTC (permalink / raw)
  To: IKEDA, Munehiro
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, menage-hpIqsD4AKlfQT0dZR+AlfA,
	Andrea Righi, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil, Balbir Singh

On Fri, May 01, 2009 at 06:04:39PM -0400, IKEDA, Munehiro wrote:
> Vivek Goyal wrote:
>>>> +TODO
>>>> +====
>>>> +- Lots of cleanups, testing, bug fixing, optimizations, benchmarking etc...
>>>> +- Convert cgroup ioprio to notion of weight.
>>>> +- Anticipatory code will need more work. It is not working properly currently
>>>> +  and needs more thought.
>>> What are the problems with the code?
>>
>> Have not got a chance to look into the issues in detail yet. Just a crude run
>> saw drop in performance. Will debug it later the moment I have got async writes
>> handled...
>>
>>>> +- Use of bio-cgroup patches.
>>> I saw these posted as well
>>>
>>>> +- Use of Nauman's per cgroup request descriptor patches.
>>>> +
>>> More details would be nice, I am not sure I understand
>>
>> Currently the number of request descriptors which can be allocated per
>> device/request queue are fixed by a sysfs tunable (q->nr_requests). So
>> if there is lots of IO going on from one cgroup then it will consume all
>> the available request descriptors and other cgroup might starve and not
>> get its fair share.
>>
>> Hence we also need to introduce the notion of request descriptor limit per
>> cgroup so that if request descriptors from one group are exhausted, then
>> it does not impact the IO of other cgroup.
>
> Unfortunately I couldn't find and I've never seen the Nauman's patches.
> So I tried to make a patch below against this todo.  The reason why
> I'm posting this despite this is just a quick and ugly hack (and it
> might be a reinvention of wheel) is that I would like to discuss how
> we should define the limitation of requests per cgroup.
> This patch should be applied on Vivek's I/O controller patches
> posted on Mar 11.

Hi IKEDA,

Sorry for the confusion here. Actually Nauman had sent a patch to select group
of people who were initially copied on the mail thread.

>
> This patch temporarily distribute q->nr_requests to each cgroup.
> I think the number should be weighted like BFQ's budget.  But in
> this case, if the hierarchy of cgroup is deep, leaf cgroups are
> allowed to allocate very few numbers of requests.  I don't think
> this is reasonable...but I don't have specific idea to solve this
> problem.  Does anyone have the good idea?
>

Thanks for the patch. Yes, ideally one would expect the request descriptor
to be allocated also in proportion to the weight but I guess that would
become very comlicated.

In terms of simpler things, two thoughts come to mind.

- First approach is to make q->nr_requests per group. So every group is
  entitled for q->nr_requests as set by the user. This is what your patch
  seems to have done.

  I had some concerns with this approach. First of all it does not seem to
  have an upper bound on number of request descriptors allocated per queue
  because if a user creates more cgroups, total number of request
  descriptors increase.

- Second approach can be that we retain the meaning of q->nr_requests
  which defines the total number of request descriptors on the queue (with
  the exception of 50% more descriptors for batching processes). And we
  define a new per group limit q->nr_group_requests which defines how many
  requests per group can be assigned. So q->nr_requests defines total pool
  size on the queue and q->nr_group_requests will define how many requests
  each group can allocate out of that pool.

  Here the issue is that a user shall have to balance the q->nr_group_requests    and q->nr_requests properly.

To experiment, I have implemented the second approach. I am attaching the
patch which is in my current tree. It probably will not apply on my march
11 posting as since then patches have changed. But posting it here so that
at least it will give an idea behind the thought process.

Ideas are welcome...

Thanks
Vivek
   
o Currently a request queue has got fixed number of request descriptors for
  sync and async requests. Once the request descriptors are consumed, new
  processes are put to sleep and they effectively become serialized. Because
  sync and async queues are separate, async requests don't impact sync ones
  but if one is looking for fairness between async requests, that is not
  achievable if request queue descriptors become bottleneck.

o Make request descriptor's per io group so that if there is lots of IO
  going on in one cgroup, it does not impact the IO of other group.

o This patch implements the per cgroup request descriptors. request pool per
  queue is still common but every group will have its own wait list and its
  own count of request descriptors allocated to that group for sync and async
  queues. So effectively request_list becomes per io group property and not a
  global request queue feature.

o Currently one can define q->nr_requests to limit request descriptors
  allocated for the queue. Now there is another tunable q->nr_group_requests
  which controls the requests descriptr limit per group. q->nr_requests
  supercedes q->nr_group_requests to make sure if there are lots of groups
  present, we don't end up allocating too many request descriptors on the
  queue.

o Issues: Currently notion of congestion is per queue. With per group request
  descriptor it is possible that queue is not congested but the group bio
  will go into is congested.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

---
 block/blk-core.c       |  216 ++++++++++++++++++++++++++++++++++---------------
 block/blk-settings.c   |    3 
 block/blk-sysfs.c      |   57 ++++++++++--
 block/elevator-fq.c    |   15 +++
 block/elevator-fq.h    |    8 +
 block/elevator.c       |    6 -
 include/linux/blkdev.h |   62 +++++++++++++-
 7 files changed, 287 insertions(+), 80 deletions(-)

Index: linux9/include/linux/blkdev.h
===================================================================
--- linux9.orig/include/linux/blkdev.h	2009-04-30 15:43:53.000000000 -0400
+++ linux9/include/linux/blkdev.h	2009-04-30 16:18:29.000000000 -0400
@@ -32,21 +32,51 @@ struct request;
 struct sg_io_hdr;
 
 #define BLKDEV_MIN_RQ	4
+
+#ifdef CONFIG_GROUP_IOSCHED
+#define BLKDEV_MAX_RQ	256	/* Default maximum */
+#define BLKDEV_MAX_GROUP_RQ    64      /* Default maximum */
+#else
 #define BLKDEV_MAX_RQ	128	/* Default maximum */
+/*
+ * This is eqivalent to case of only one group present (root group). Let
+ * it consume all the request descriptors available on the queue .
+ */
+#define BLKDEV_MAX_GROUP_RQ    BLKDEV_MAX_RQ      /* Default maximum */
+#endif
 
 struct request;
 typedef void (rq_end_io_fn)(struct request *, int);
 
 struct request_list {
 	/*
-	 * count[], starved[], and wait[] are indexed by
+	 * count[], starved and wait[] are indexed by
 	 * BLK_RW_SYNC/BLK_RW_ASYNC
 	 */
 	int count[2];
 	int starved[2];
+	wait_queue_head_t wait[2];
+};
+
+/*
+ * This data structures keeps track of mempool of requests for the queue
+ * and some overall statistics.
+ */
+struct request_data {
+	/*
+	 * Per queue request descriptor count. This is in addition to per
+	 * cgroup count
+	 */
+	int count[2];
 	int elvpriv;
 	mempool_t *rq_pool;
-	wait_queue_head_t wait[2];
+	int starved;
+	/*
+	 * Global list for starved tasks. A task will be queued here if
+	 * it could not allocate request descriptor and the associated
+	 * group request list does not have any requests pending.
+	 */
+	wait_queue_head_t starved_wait;
 };
 
 /*
@@ -251,6 +281,7 @@ struct request {
 #ifdef CONFIG_GROUP_IOSCHED
 	/* io group request belongs to */
 	struct io_group *iog;
+	struct request_list *rl;
 #endif /* GROUP_IOSCHED */
 #endif /* ELV_FAIR_QUEUING */
 };
@@ -340,6 +371,9 @@ struct request_queue
 	 */
 	struct request_list	rq;
 
+	/* Contains request pool and other data like starved data */
+	struct request_data	rq_data;
+
 	request_fn_proc		*request_fn;
 	make_request_fn		*make_request_fn;
 	prep_rq_fn		*prep_rq_fn;
@@ -402,6 +436,8 @@ struct request_queue
 	 * queue settings
 	 */
 	unsigned long		nr_requests;	/* Max # of requests */
+	/* Max # of per io group requests */
+	unsigned long		nr_group_requests;
 	unsigned int		nr_congestion_on;
 	unsigned int		nr_congestion_off;
 	unsigned int		nr_batching;
@@ -773,6 +809,28 @@ extern int scsi_cmd_ioctl(struct request
 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 			 struct scsi_ioctl_command __user *);
 
+extern void blk_init_request_list(struct request_list *rl);
+
+static inline struct request_list *blk_get_request_list(struct request_queue *q,
+						struct bio *bio)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+	return io_group_get_request_list(q, bio);
+#else
+	return &q->rq;
+#endif
+}
+
+static inline struct request_list *rq_rl(struct request_queue *q,
+						struct request *rq)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+	return rq->rl;
+#else
+	return blk_get_request_list(q, NULL);
+#endif
+}
+
 /*
  * Temporary export, until SCSI gets fixed up.
  */
Index: linux9/block/elevator.c
===================================================================
--- linux9.orig/block/elevator.c	2009-04-30 16:17:53.000000000 -0400
+++ linux9/block/elevator.c	2009-04-30 16:18:29.000000000 -0400
@@ -664,7 +664,7 @@ void elv_quiesce_start(struct request_qu
 	 * make sure we don't have any requests in flight
 	 */
 	elv_drain_elevator(q);
-	while (q->rq.elvpriv) {
+	while (q->rq_data.elvpriv) {
 		blk_start_queueing(q);
 		spin_unlock_irq(q->queue_lock);
 		msleep(10);
@@ -764,8 +764,8 @@ void elv_insert(struct request_queue *q,
 	}
 
 	if (unplug_it && blk_queue_plugged(q)) {
-		int nrq = q->rq.count[BLK_RW_SYNC] + q->rq.count[BLK_RW_ASYNC]
-			- q->in_flight;
+		int nrq = q->rq_data.count[BLK_RW_SYNC] +
+				q->rq_data.count[BLK_RW_ASYNC] - q->in_flight;
 
 		if (nrq >= q->unplug_thresh)
 			__generic_unplug_device(q);
Index: linux9/block/blk-core.c
===================================================================
--- linux9.orig/block/blk-core.c	2009-04-30 16:17:53.000000000 -0400
+++ linux9/block/blk-core.c	2009-04-30 16:18:29.000000000 -0400
@@ -480,20 +480,31 @@ void blk_cleanup_queue(struct request_qu
 }
 EXPORT_SYMBOL(blk_cleanup_queue);
 
-static int blk_init_free_list(struct request_queue *q)
+void blk_init_request_list(struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
 
 	rl->count[BLK_RW_SYNC] = rl->count[BLK_RW_ASYNC] = 0;
-	rl->starved[BLK_RW_SYNC] = rl->starved[BLK_RW_ASYNC] = 0;
-	rl->elvpriv = 0;
 	init_waitqueue_head(&rl->wait[BLK_RW_SYNC]);
 	init_waitqueue_head(&rl->wait[BLK_RW_ASYNC]);
+}
 
-	rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ, mempool_alloc_slab,
-				mempool_free_slab, request_cachep, q->node);
+static int blk_init_free_list(struct request_queue *q)
+{
+#ifndef CONFIG_GROUP_IOSCHED
+	struct request_list *rl = blk_get_request_list(q, NULL);
+
+	/*
+	 * In case of group scheduling, request list is inside the associated
+	 * group and when that group is instanciated, it takes care of
+	 * initializing the request list also.
+	 */
+	blk_init_request_list(rl);
+#endif
+	q->rq_data.rq_pool = mempool_create_node(BLKDEV_MIN_RQ,
+				mempool_alloc_slab, mempool_free_slab,
+				request_cachep, q->node);
 
-	if (!rl->rq_pool)
+	if (!q->rq_data.rq_pool)
 		return -ENOMEM;
 
 	return 0;
@@ -590,6 +601,9 @@ blk_init_queue_node(request_fn_proc *rfn
 		return NULL;
 	}
 
+	/* init starved waiter wait queue */
+	init_waitqueue_head(&q->rq_data.starved_wait);
+
 	/*
 	 * if caller didn't supply a lock, they get per-queue locking with
 	 * our embedded lock
@@ -639,14 +653,14 @@ static inline void blk_free_request(stru
 {
 	if (rq->cmd_flags & REQ_ELVPRIV)
 		elv_put_request(q, rq);
-	mempool_free(rq, q->rq.rq_pool);
+	mempool_free(rq, q->rq_data.rq_pool);
 }
 
 static struct request *
 blk_alloc_request(struct request_queue *q, struct bio *bio, int rw, int priv,
 					gfp_t gfp_mask)
 {
-	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
+	struct request *rq = mempool_alloc(q->rq_data.rq_pool, gfp_mask);
 
 	if (!rq)
 		return NULL;
@@ -657,7 +671,7 @@ blk_alloc_request(struct request_queue *
 
 	if (priv) {
 		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
-			mempool_free(rq, q->rq.rq_pool);
+			mempool_free(rq, q->rq_data.rq_pool);
 			return NULL;
 		}
 		rq->cmd_flags |= REQ_ELVPRIV;
@@ -700,18 +714,18 @@ static void ioc_set_batching(struct requ
 	ioc->last_waited = jiffies;
 }
 
-static void __freed_request(struct request_queue *q, int sync)
+static void __freed_request(struct request_queue *q, int sync,
+					struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
-
-	if (rl->count[sync] < queue_congestion_off_threshold(q))
+	if (q->rq_data.count[sync] < queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, sync);
 
-	if (rl->count[sync] + 1 <= q->nr_requests) {
+	if (q->rq_data.count[sync] + 1 <= q->nr_requests)
+		blk_clear_queue_full(q, sync);
+
+	if (rl->count[sync] + 1 <= q->nr_group_requests) {
 		if (waitqueue_active(&rl->wait[sync]))
 			wake_up(&rl->wait[sync]);
-
-		blk_clear_queue_full(q, sync);
 	}
 }
 
@@ -719,18 +733,29 @@ static void __freed_request(struct reque
  * A request has just been released.  Account for it, update the full and
  * congestion status, wake up any waiters.   Called under q->queue_lock.
  */
-static void freed_request(struct request_queue *q, int sync, int priv)
+static void freed_request(struct request_queue *q, int sync, int priv,
+					struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
-
+	BUG_ON(!rl->count[sync]);
 	rl->count[sync]--;
+
+	BUG_ON(!q->rq_data.count[sync]);
+	q->rq_data.count[sync]--;
+
 	if (priv)
-		rl->elvpriv--;
+		q->rq_data.elvpriv--;
 
-	__freed_request(q, sync);
+	__freed_request(q, sync, rl);
 
 	if (unlikely(rl->starved[sync ^ 1]))
-		__freed_request(q, sync ^ 1);
+		__freed_request(q, sync ^ 1, rl);
+
+	/* Wake up the starved process on global list, if any */
+	if (unlikely(q->rq_data.starved)) {
+		if (waitqueue_active(&q->rq_data.starved_wait))
+			wake_up(&q->rq_data.starved_wait);
+		q->rq_data.starved--;
+	}
 }
 
 /*
@@ -739,10 +764,9 @@ static void freed_request(struct request
  * Returns !NULL on success, with queue_lock *not held*.
  */
 static struct request *get_request(struct request_queue *q, int rw_flags,
-				   struct bio *bio, gfp_t gfp_mask)
+		   struct bio *bio, gfp_t gfp_mask, struct request_list *rl)
 {
 	struct request *rq = NULL;
-	struct request_list *rl = &q->rq;
 	struct io_context *ioc = NULL;
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
 	int may_queue, priv;
@@ -751,31 +775,38 @@ static struct request *get_request(struc
 	if (may_queue == ELV_MQUEUE_NO)
 		goto rq_starved;
 
-	if (rl->count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
-		if (rl->count[is_sync]+1 >= q->nr_requests) {
-			ioc = current_io_context(GFP_ATOMIC, q->node);
-			/*
-			 * The queue will fill after this allocation, so set
-			 * it as full, and mark this process as "batching".
-			 * This process will be allowed to complete a batch of
-			 * requests, others will be blocked.
-			 */
-			if (!blk_queue_full(q, is_sync)) {
-				ioc_set_batching(q, ioc);
-				blk_set_queue_full(q, is_sync);
-			} else {
-				if (may_queue != ELV_MQUEUE_MUST
-						&& !ioc_batching(q, ioc)) {
-					/*
-					 * The queue is full and the allocating
-					 * process is not a "batcher", and not
-					 * exempted by the IO scheduler
-					 */
-					goto out;
-				}
+	if (q->rq_data.count[is_sync]+1 >= queue_congestion_on_threshold(q))
+		blk_set_queue_congested(q, is_sync);
+
+	/*
+	 * Looks like there is no user of queue full now.
+	 * Keeping it for time being.
+	 */
+	if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
+		blk_set_queue_full(q, is_sync);
+
+	if (rl->count[is_sync]+1 >= q->nr_group_requests) {
+		ioc = current_io_context(GFP_ATOMIC, q->node);
+		/*
+		 * The queue request descriptor group will fill after this
+		 * allocation, so set
+		 * it as full, and mark this process as "batching".
+		 * This process will be allowed to complete a batch of
+		 * requests, others will be blocked.
+		 */
+		if (rl->count[is_sync] <= q->nr_group_requests)
+			ioc_set_batching(q, ioc);
+		else {
+			if (may_queue != ELV_MQUEUE_MUST
+					&& !ioc_batching(q, ioc)) {
+				/*
+				 * The queue is full and the allocating
+				 * process is not a "batcher", and not
+				 * exempted by the IO scheduler
+				 */
+				goto out;
 			}
 		}
-		blk_set_queue_congested(q, is_sync);
 	}
 
 	/*
@@ -783,19 +814,41 @@ static struct request *get_request(struc
 	 * limit of requests, otherwise we could have thousands of requests
 	 * allocated with any setting of ->nr_requests
 	 */
-	if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
+
+	if (q->rq_data.count[is_sync] >= (3 * q->nr_requests / 2))
+		goto out;
+
+	/*
+	 * Allocation of request is allowed from queue perspective. Now check
+	 * from per group request list
+	 */
+
+	if (rl->count[is_sync] >= (3 * q->nr_group_requests / 2))
 		goto out;
 
 	rl->count[is_sync]++;
 	rl->starved[is_sync] = 0;
 
+	q->rq_data.count[is_sync]++;
+
 	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
 	if (priv)
-		rl->elvpriv++;
+		q->rq_data.elvpriv++;
 
 	spin_unlock_irq(q->queue_lock);
 
 	rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
+
+#ifdef CONFIG_GROUP_IOSCHED
+	if (rq) {
+		/*
+		 * TODO. Implement group reference counting and take the
+		 * reference to the group to make sure group hence request
+		 * list does not go away till rq finishes.
+		 */
+		rq->rl = rl;
+	}
+#endif
 	if (unlikely(!rq)) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything
@@ -805,7 +858,7 @@ static struct request *get_request(struc
 		 * wait queue, but this is pretty rare.
 		 */
 		spin_lock_irq(q->queue_lock);
-		freed_request(q, is_sync, priv);
+		freed_request(q, is_sync, priv, rl);
 
 		/*
 		 * in the very unlikely event that allocation failed and no
@@ -815,10 +868,26 @@ static struct request *get_request(struc
 		 * rq mempool into READ and WRITE
 		 */
 rq_starved:
-		if (unlikely(rl->count[is_sync] == 0))
-			rl->starved[is_sync] = 1;
-
-		goto out;
+		if (unlikely(rl->count[is_sync] == 0)) {
+			/*
+			 * If there is a request pending in other direction
+			 * in same io group, then set the starved flag of
+			 * the group request list. Otherwise, we need to
+			 * make this process sleep in global starved list
+			 * to make sure it will not sleep indefinitely.
+			 */
+			if (rl->count[is_sync ^ 1] != 0) {
+				rl->starved[is_sync] = 1;
+				goto out;
+			} else {
+				/*
+				 * It indicates to calling function to put
+				 * task on global starved list. Not the best
+				 * way
+				 */
+				return ERR_PTR(-ENOMEM);
+			}
+		}
 	}
 
 	/*
@@ -846,15 +915,29 @@ static struct request *get_request_wait(
 {
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
 	struct request *rq;
+	struct request_list *rl = blk_get_request_list(q, bio);
 
-	rq = get_request(q, rw_flags, bio, GFP_NOIO);
-	while (!rq) {
+	rq = get_request(q, rw_flags, bio, GFP_NOIO, rl);
+	while (!rq || (IS_ERR(rq) && PTR_ERR(rq) == -ENOMEM)) {
 		DEFINE_WAIT(wait);
 		struct io_context *ioc;
-		struct request_list *rl = &q->rq;
 
-		prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
-				TASK_UNINTERRUPTIBLE);
+		if (IS_ERR(rq) && PTR_ERR(rq) == -ENOMEM) {
+			/*
+			 * Task failed allocation and needs to wait and
+			 * try again. There are no requests pending from
+			 * the io group hence need to sleep on global
+			 * wait queue. Most likely the allocation failed
+			 * because of memory issues.
+			 */
+
+			q->rq_data.starved++;
+			prepare_to_wait_exclusive(&q->rq_data.starved_wait,
+					&wait, TASK_UNINTERRUPTIBLE);
+		} else {
+			prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
+					TASK_UNINTERRUPTIBLE);
+		}
 
 		trace_block_sleeprq(q, bio, rw_flags & 1);
 
@@ -874,7 +957,12 @@ static struct request *get_request_wait(
 		spin_lock_irq(q->queue_lock);
 		finish_wait(&rl->wait[is_sync], &wait);
 
-		rq = get_request(q, rw_flags, bio, GFP_NOIO);
+		/*
+		 * After the sleep check the rl again in case cgrop bio
+		 * belonged to is gone and it is mapped to root group now
+		 */
+		rl = blk_get_request_list(q, bio);
+		rq = get_request(q, rw_flags, bio, GFP_NOIO, rl);
 	};
 
 	return rq;
@@ -883,6 +971,7 @@ static struct request *get_request_wait(
 struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
 {
 	struct request *rq;
+	struct request_list *rl = blk_get_request_list(q, NULL);
 
 	BUG_ON(rw != READ && rw != WRITE);
 
@@ -890,7 +979,7 @@ struct request *blk_get_request(struct r
 	if (gfp_mask & __GFP_WAIT) {
 		rq = get_request_wait(q, rw, NULL);
 	} else {
-		rq = get_request(q, rw, NULL, gfp_mask);
+		rq = get_request(q, rw, NULL, gfp_mask, rl);
 		if (!rq)
 			spin_unlock_irq(q->queue_lock);
 	}
@@ -1073,12 +1162,13 @@ void __blk_put_request(struct request_qu
 	if (req->cmd_flags & REQ_ALLOCED) {
 		int is_sync = rq_is_sync(req) != 0;
 		int priv = req->cmd_flags & REQ_ELVPRIV;
+		struct request_list *rl = rq_rl(q, req);
 
 		BUG_ON(!list_empty(&req->queuelist));
 		BUG_ON(!hlist_unhashed(&req->hash));
 
 		blk_free_request(q, req);
-		freed_request(q, is_sync, priv);
+		freed_request(q, is_sync, priv, rl);
 	}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
Index: linux9/block/blk-sysfs.c
===================================================================
--- linux9.orig/block/blk-sysfs.c	2009-04-30 16:18:27.000000000 -0400
+++ linux9/block/blk-sysfs.c	2009-04-30 16:18:29.000000000 -0400
@@ -38,7 +38,7 @@ static ssize_t queue_requests_show(struc
 static ssize_t
 queue_requests_store(struct request_queue *q, const char *page, size_t count)
 {
-	struct request_list *rl = &q->rq;
+	struct request_list *rl = blk_get_request_list(q, NULL);
 	unsigned long nr;
 	int ret = queue_var_store(&nr, page, count);
 	if (nr < BLKDEV_MIN_RQ)
@@ -48,32 +48,55 @@ queue_requests_store(struct request_queu
 	q->nr_requests = nr;
 	blk_queue_congestion_threshold(q);
 
-	if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
+	if (q->rq_data.count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
 		blk_set_queue_congested(q, BLK_RW_SYNC);
-	else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
+	else if (q->rq_data.count[BLK_RW_SYNC] <
+				queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, BLK_RW_SYNC);
 
-	if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
+	if (q->rq_data.count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
 		blk_set_queue_congested(q, BLK_RW_ASYNC);
-	else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
+	else if (q->rq_data.count[BLK_RW_ASYNC] <
+				queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, BLK_RW_ASYNC);
 
-	if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
+	if (q->rq_data.count[BLK_RW_SYNC] >= q->nr_requests) {
 		blk_set_queue_full(q, BLK_RW_SYNC);
-	} else if (rl->count[BLK_RW_SYNC]+1 <= q->nr_requests) {
+	} else if (q->rq_data.count[BLK_RW_SYNC]+1 <= q->nr_requests) {
 		blk_clear_queue_full(q, BLK_RW_SYNC);
 		wake_up(&rl->wait[BLK_RW_SYNC]);
 	}
 
-	if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
+	if (q->rq_data.count[BLK_RW_ASYNC] >= q->nr_requests) {
 		blk_set_queue_full(q, BLK_RW_ASYNC);
-	} else if (rl->count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
+	} else if (q->rq_data.count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
 		blk_clear_queue_full(q, BLK_RW_ASYNC);
 		wake_up(&rl->wait[BLK_RW_ASYNC]);
 	}
 	spin_unlock_irq(q->queue_lock);
 	return ret;
 }
+#ifdef CONFIG_GROUP_IOSCHED
+static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(q->nr_group_requests, (page));
+}
+
+static ssize_t
+queue_group_requests_store(struct request_queue *q, const char *page,
+					size_t count)
+{
+	unsigned long nr;
+	int ret = queue_var_store(&nr, page, count);
+	if (nr < BLKDEV_MIN_RQ)
+		nr = BLKDEV_MIN_RQ;
+
+	spin_lock_irq(q->queue_lock);
+	q->nr_group_requests = nr;
+	spin_unlock_irq(q->queue_lock);
+	return ret;
+}
+#endif
 
 static ssize_t queue_ra_show(struct request_queue *q, char *page)
 {
@@ -228,6 +251,14 @@ static struct queue_sysfs_entry queue_re
 	.store = queue_requests_store,
 };
 
+#ifdef CONFIG_GROUP_IOSCHED
+static struct queue_sysfs_entry queue_group_requests_entry = {
+	.attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_group_requests_show,
+	.store = queue_group_requests_store,
+};
+#endif
+
 static struct queue_sysfs_entry queue_ra_entry = {
 	.attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_ra_show,
@@ -308,6 +339,9 @@ static struct queue_sysfs_entry queue_sl
 
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
+#ifdef CONFIG_GROUP_IOSCHED
+	&queue_group_requests_entry.attr,
+#endif
 	&queue_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
@@ -389,12 +423,11 @@ static void blk_release_queue(struct kob
 {
 	struct request_queue *q =
 		container_of(kobj, struct request_queue, kobj);
-	struct request_list *rl = &q->rq;
 
 	blk_sync_queue(q);
 
-	if (rl->rq_pool)
-		mempool_destroy(rl->rq_pool);
+	if (q->rq_data.rq_pool)
+		mempool_destroy(q->rq_data.rq_pool);
 
 	if (q->queue_tags)
 		__blk_queue_free_tags(q);
Index: linux9/block/blk-settings.c
===================================================================
--- linux9.orig/block/blk-settings.c	2009-04-30 15:43:53.000000000 -0400
+++ linux9/block/blk-settings.c	2009-04-30 16:18:29.000000000 -0400
@@ -123,6 +123,9 @@ void blk_queue_make_request(struct reque
 	 * set defaults
 	 */
 	q->nr_requests = BLKDEV_MAX_RQ;
+#ifdef CONFIG_GROUP_IOSCHED
+	q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
+#endif
 	blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);
 	blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS);
 	blk_queue_segment_boundary(q, BLK_SEG_BOUNDARY_MASK);
Index: linux9/block/elevator-fq.c
===================================================================
--- linux9.orig/block/elevator-fq.c	2009-04-30 16:18:27.000000000 -0400
+++ linux9/block/elevator-fq.c	2009-04-30 16:18:29.000000000 -0400
@@ -954,6 +954,17 @@ struct io_cgroup *cgroup_to_io_cgroup(st
 			    struct io_cgroup, css);
 }
 
+struct request_list *io_group_get_request_list(struct request_queue *q,
+						struct bio *bio)
+{
+	struct io_group *iog;
+
+	iog = io_get_io_group_bio(q, bio, 1);
+	BUG_ON(!iog);
+out:
+	return &iog->rl;
+}
+
 /*
  * Search the bfq_group for bfqd into the hash table (by now only a list)
  * of bgrp.  Must be called under rcu_read_lock().
@@ -1203,6 +1214,8 @@ struct io_group *io_group_chain_alloc(st
 		io_group_init_entity(iocg, iog);
 		iog->my_entity = &iog->entity;
 
+		blk_init_request_list(&iog->rl);
+
 		if (leaf == NULL) {
 			leaf = iog;
 			prev = leaf;
@@ -1446,6 +1459,8 @@ struct io_group *io_alloc_root_group(str
 	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
 		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
 
+	blk_init_request_list(&iog->rl);
+
 	iocg = &io_root_cgroup;
 	spin_lock_irq(&iocg->lock);
 	rcu_assign_pointer(iog->key, key);
Index: linux9/block/elevator-fq.h
===================================================================
--- linux9.orig/block/elevator-fq.h	2009-04-30 16:18:27.000000000 -0400
+++ linux9/block/elevator-fq.h	2009-04-30 16:18:29.000000000 -0400
@@ -239,8 +239,14 @@ struct io_group {
 
 	/* Single ioq per group, used for noop, deadline, anticipatory */
 	struct io_queue *ioq;
+
+	/* request list associated with the group */
+	struct request_list rl;
 };
 
+#define IOG_FLAG_READFULL	1	/* read queue has been filled */
+#define IOG_FLAG_WRITEFULL	2	/* write queue has been filled */
+
 /**
  * struct bfqio_cgroup - bfq cgroup data structure.
  * @css: subsystem state for bfq in the containing cgroup.
@@ -517,6 +523,8 @@ extern void elv_fq_unset_request_ioq(str
 extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
 extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
 						struct bio *bio);
+extern struct request_list *io_group_get_request_list(struct request_queue *q,
+						struct bio *bio);
 
 /* Returns single ioq associated with the io group. */
 static inline struct io_queue *io_group_ioq(struct io_group *iog)

Thanks
Vivek

> Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
> ---
> block/blk-core.c    |   36 +++++++--
> block/blk-sysfs.c   |   22 ++++--
> block/elevator-fq.c |  133 ++++++++++++++++++++++++++++++++--
> block/elevator-fq.h |  201 +++++++++++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 371 insertions(+), 21 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 29bcfac..21023f7 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -705,11 +705,15 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
> static void __freed_request(struct request_queue *q, int rw)
> {
> 	struct request_list *rl = &q->rq;
> -
> -	if (rl->count[rw] < queue_congestion_off_threshold(q))
> +	struct io_group *congested_iog, *full_iog;
> +	
> +	congested_iog = io_congested_io_group(q, rw);
> +	if (rl->count[rw] < queue_congestion_off_threshold(q) &&
> +	    !congested_iog)
> 		blk_clear_queue_congested(q, rw);
>
> -	if (rl->count[rw] + 1 <= q->nr_requests) {
> +	full_iog = io_full_io_group(q, rw);
> +	if (rl->count[rw] + 1 <= q->nr_requests && !full_iog) {
> 		if (waitqueue_active(&rl->wait[rw]))
> 			wake_up(&rl->wait[rw]);
>
> @@ -721,13 +725,16 @@ static void __freed_request(struct request_queue *q, int rw)
>  * A request has just been released.  Account for it, update the full and
>  * congestion status, wake up any waiters.   Called under q->queue_lock.
>  */
> -static void freed_request(struct request_queue *q, int rw, int priv)
> +static void freed_request(struct request_queue *q, struct io_group *iog,
> +			  int rw, int priv)
> {
> 	struct request_list *rl = &q->rq;
>
> 	rl->count[rw]--;
> 	if (priv)
> 		rl->elvpriv--;
> +	if (iog)
> +		io_group_dec_count(iog, rw);
>
> 	__freed_request(q, rw);
>
> @@ -746,16 +753,21 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
> {
> 	struct request *rq = NULL;
> 	struct request_list *rl = &q->rq;
> +	struct io_group *iog;
> 	struct io_context *ioc = NULL;
> 	const int rw = rw_flags & 0x01;
> 	int may_queue, priv;
>
> +	iog = __io_get_io_group(q);
> +
> 	may_queue = elv_may_queue(q, rw_flags);
> 	if (may_queue == ELV_MQUEUE_NO)
> 		goto rq_starved;
>
> -	if (rl->count[rw]+1 >= queue_congestion_on_threshold(q)) {
> -		if (rl->count[rw]+1 >= q->nr_requests) {
> +	if (rl->count[rw]+1 >= queue_congestion_on_threshold(q) ||
> +	    io_group_congestion_on(iog, rw)) {
> +		if (rl->count[rw]+1 >= q->nr_requests ||
> +		    io_group_full(iog, rw)) {
> 			ioc = current_io_context(GFP_ATOMIC, q->node);
> 			/*
> 			 * The queue will fill after this allocation, so set
> @@ -789,8 +801,15 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
> 	if (rl->count[rw] >= (3 * q->nr_requests / 2))
> 		goto out;
>
> +	if (iog)
> +		if (io_group_count(iog, rw) >=
> +		   (3 * io_group_nr_requests(iog) / 2))
> +			goto out;
> +
> 	rl->count[rw]++;
> 	rl->starved[rw] = 0;
> +	if (iog)
> +		io_group_inc_count(iog, rw);
>
> 	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
> 	if (priv)
> @@ -808,7 +827,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
> 		 * wait queue, but this is pretty rare.
> 		 */
> 		spin_lock_irq(q->queue_lock);
> -		freed_request(q, rw, priv);
> +		freed_request(q, iog, rw, priv);
>
> 		/*
> 		 * in the very unlikely event that allocation failed and no
> @@ -1073,12 +1092,13 @@ void __blk_put_request(struct request_queue *q, struct request *req)
> 	if (req->cmd_flags & REQ_ALLOCED) {
> 		int rw = rq_data_dir(req);
> 		int priv = req->cmd_flags & REQ_ELVPRIV;
> +		struct io_group *iog = io_request_io_group(req);
>
> 		BUG_ON(!list_empty(&req->queuelist));
> 		BUG_ON(!hlist_unhashed(&req->hash));
>
> 		blk_free_request(q, req);
> -		freed_request(q, rw, priv);
> +		freed_request(q, iog, rw, priv);
> 	}
> }
> EXPORT_SYMBOL_GPL(__blk_put_request);
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 0d98c96..af5191c 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -40,6 +40,7 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
> {
> 	struct request_list *rl = &q->rq;
> 	unsigned long nr;
> +	int iog_congested[2], iog_full[2];
> 	int ret = queue_var_store(&nr, page, count);
> 	if (nr < BLKDEV_MIN_RQ)
> 		nr = BLKDEV_MIN_RQ;
> @@ -47,27 +48,32 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
> 	spin_lock_irq(q->queue_lock);
> 	q->nr_requests = nr;
> 	blk_queue_congestion_threshold(q);
> +	io_group_set_nrq_all(q, nr, iog_congested, iog_full);
>
> -	if (rl->count[READ] >= queue_congestion_on_threshold(q))
> +	if (rl->count[READ] >= queue_congestion_on_threshold(q) ||
> +	    iog_congested[READ])
> 		blk_set_queue_congested(q, READ);
> -	else if (rl->count[READ] < queue_congestion_off_threshold(q))
> +	else if (rl->count[READ] < queue_congestion_off_threshold(q) &&
> +		 !iog_congested[READ])
> 		blk_clear_queue_congested(q, READ);
>
> -	if (rl->count[WRITE] >= queue_congestion_on_threshold(q))
> +	if (rl->count[WRITE] >= queue_congestion_on_threshold(q) ||
> +	    iog_congested[WRITE])
> 		blk_set_queue_congested(q, WRITE);
> -	else if (rl->count[WRITE] < queue_congestion_off_threshold(q))
> +	else if (rl->count[WRITE] < queue_congestion_off_threshold(q) &&
> +		 !iog_congested[WRITE])
> 		blk_clear_queue_congested(q, WRITE);
>
> -	if (rl->count[READ] >= q->nr_requests) {
> +	if (rl->count[READ] >= q->nr_requests || iog_full[READ]) {
> 		blk_set_queue_full(q, READ);
> -	} else if (rl->count[READ]+1 <= q->nr_requests) {
> +	} else if (rl->count[READ]+1 <= q->nr_requests && !iog_full[READ]) {
> 		blk_clear_queue_full(q, READ);
> 		wake_up(&rl->wait[READ]);
> 	}
>
> -	if (rl->count[WRITE] >= q->nr_requests) {
> +	if (rl->count[WRITE] >= q->nr_requests || iog_full[WRITE]) {
> 		blk_set_queue_full(q, WRITE);
> -	} else if (rl->count[WRITE]+1 <= q->nr_requests) {
> +	} else if (rl->count[WRITE]+1 <= q->nr_requests && !iog_full[WRITE]) {
> 		blk_clear_queue_full(q, WRITE);
> 		wake_up(&rl->wait[WRITE]);
> 	}
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index df53418..3b021f3 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -924,6 +924,111 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
> }
> EXPORT_SYMBOL(io_lookup_io_group_current);
>
> +/*
> + * TODO
> + * This is complete dupulication of blk_queue_congestion_threshold()
> + * except for the argument type and name.  Can we merge them?
> + */
> +static void io_group_nrq_congestion_threshold(struct io_group_nrq *nrq)
> +{
> +	int nr;
> +
> +	nr = nrq->nr_requests - (nrq->nr_requests / 8) + 1;
> +	if (nr > nrq->nr_requests)
> +		nr = nrq->nr_requests;
> +	nrq->nr_congestion_on = nr;
> +
> +	nr = nrq->nr_requests - (nrq->nr_requests / 8)
> +		- (nrq->nr_requests / 16) - 1;
> +	if (nr < 1)
> +		nr = 1;
> +	nrq->nr_congestion_off = nr;
> +}
> +
> +static void io_group_set_nrq(struct io_group_nrq *nrq, int nr_requests,
> +			 int *congested, int *full)
> +{
> +	int i;
> +
> +	BUG_ON(nr_requests < 0);
> +
> +	nrq->nr_requests = nr_requests;
> +	io_group_nrq_congestion_threshold(nrq);
> +
> +	for (i=0; i<2; i++) {
> +		if (nrq->count[i] >= nrq->nr_congestion_on)
> +			congested[i] = 1;
> +		else if (nrq->count[i] < nrq->nr_congestion_off)
> +			congested[i] = 0;
> +
> +		if (nrq->count[i] >= nrq->nr_requests)
> +			full[i] = 1;
> +		else if (nrq->count[i]+1 <= nrq->nr_requests)
> +			full[i] = 0;
> +	}
> +}
> +
> +void io_group_set_nrq_all(struct request_queue *q, int nr,
> +			    int *congested, int *full)
> +{
> +	struct elv_fq_data *efqd = &q->elevator->efqd;
> +	struct hlist_head *head = &efqd->group_list;
> +	struct io_group *root = efqd->root_group;
> +	struct hlist_node *n;
> +	struct io_group *iog;
> +	struct io_group_nrq *nrq;
> +	int nrq_congested[2];
> +	int nrq_full[2];
> +	int i;
> +
> +	for (i=0; i<2; i++)
> +		*(congested + i) = *(full + i) = 0;
> +
> +	nrq = &root->nrq;
> +	io_group_set_nrq(nrq, nr, nrq_congested, nrq_full);
> +	for (i=0; i<2; i++) {
> +		*(congested + i) |= nrq_congested[i];
> +		*(full + i) |= nrq_full[i];
> +	}
> +
> +	hlist_for_each_entry(iog, n, head, elv_data_node) {
> +		nrq = &iog->nrq;
> +		io_group_set_nrq(nrq, nr, nrq_congested, nrq_full);
> +		for (i=0; i<2; i++) {
> +			*(congested + i) |= nrq_congested[i];
> +			*(full + i) |= nrq_full[i];
> +		}
> +	}
> +}
> +
> +struct io_group *io_congested_io_group(struct request_queue *q, int rw)
> +{
> +	struct hlist_head *head = &q->elevator->efqd.group_list;
> +	struct hlist_node *n;
> +	struct io_group *iog;
> +
> +	hlist_for_each_entry(iog, n, head, elv_data_node) {
> +		struct io_group_nrq *nrq = &iog->nrq;
> +		if (nrq->count[rw] >= nrq->nr_congestion_off)
> +			return iog;
> +	}
> +	return NULL;
> +}
> +
> +struct io_group *io_full_io_group(struct request_queue *q, int rw)
> +{
> +	struct hlist_head *head = &q->elevator->efqd.group_list;
> +	struct hlist_node *n;
> +	struct io_group *iog;
> +
> +	hlist_for_each_entry(iog, n, head, elv_data_node) {
> +		struct io_group_nrq *nrq = &iog->nrq;
> +		if (nrq->count[rw] >= nrq->nr_requests)
> +			return iog;
> +	}
> +	return NULL;
> +}
> +
> void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
> {
> 	struct io_entity *entity = &iog->entity;
> @@ -934,6 +1039,12 @@ void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
> 	entity->my_sched_data = &iog->sched_data;
> }
>
> +static void io_group_init_nrq(struct request_queue *q, struct io_group_nrq *nrq)
> +{
> +	nrq->nr_requests = q->nr_requests;
> +	io_group_nrq_congestion_threshold(nrq);
> +}
> +
> void io_group_set_parent(struct io_group *iog, struct io_group *parent)
> {
> 	struct io_entity *entity;
> @@ -1053,6 +1164,8 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
> 		io_group_init_entity(iocg, iog);
> 		iog->my_entity = &iog->entity;
>
> +		io_group_init_nrq(q, &iog->nrq);
> +
> 		if (leaf == NULL) {
> 			leaf = iog;
> 			prev = leaf;
> @@ -1176,7 +1289,7 @@ struct io_group *io_find_alloc_group(struct request_queue *q,
>  * Generic function to make sure cgroup hierarchy is all setup once a request
>  * from a cgroup is received by the io scheduler.
>  */
> -struct io_group *io_get_io_group(struct request_queue *q)
> +struct io_group *__io_get_io_group(struct request_queue *q)
> {
> 	struct cgroup *cgroup;
> 	struct io_group *iog;
> @@ -1192,6 +1305,19 @@ struct io_group *io_get_io_group(struct request_queue *q)
> 	return iog;
> }
>
> +struct io_group *io_get_io_group(struct request_queue *q)
> +{
> +	struct io_group *iog;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(q->queue_lock, flags);
> +	iog = __io_get_io_group(q);
> +	spin_unlock_irqrestore(q->queue_lock, flags);
> +	BUG_ON(!iog);
> +
> +	return iog;
> +}
> +
> void io_free_root_group(struct elevator_queue *e)
> {
> 	struct io_cgroup *iocg = &io_root_cgroup;
> @@ -1220,6 +1346,7 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
> 	iog->entity.parent = NULL;
> 	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
> 		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
> +	io_group_init_nrq(q, &iog->nrq);
>
> 	iocg = &io_root_cgroup;
> 	spin_lock_irq(&iocg->lock);
> @@ -1533,15 +1660,11 @@ void elv_fq_set_request_io_group(struct request_queue *q,
> 						struct request *rq)
> {
> 	struct io_group *iog;
> -	unsigned long flags;
>
> 	/* Make sure io group hierarchy has been setup and also set the
> 	 * io group to which rq belongs. Later we should make use of
> 	 * bio cgroup patches to determine the io group */
> -	spin_lock_irqsave(q->queue_lock, flags);
> 	iog = io_get_io_group(q);
> -	spin_unlock_irqrestore(q->queue_lock, flags);
> -	BUG_ON(!iog);
>
> 	/* Store iog in rq. TODO: take care of referencing */
> 	rq->iog = iog;
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index fc4110d..f8eabd4 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -187,6 +187,22 @@ struct io_queue {
>
> #ifdef CONFIG_GROUP_IOSCHED
> /**
> + * struct io_group_nrq - structure to store allocated requests info
> + * @nr_requests: maximun num of requests for the io_group
> + * @nr_congestion_on: threshold to determin the io_group is cogested.
> + * @nr_congestion_off: threshold to determin the io_group is not congested.
> + * @count: num of allocated requests.
> + *
> + * All fields are protected by queue_lock.
> + */
> +struct io_group_nrq {
> +	unsigned long nr_requests;
> +	unsigned int nr_congestion_on;
> +	unsigned int nr_congestion_off;
> +	int count[2];
> +};
> +
> +/**
>  * struct bfq_group - per (device, cgroup) data structure.
>  * @entity: schedulable entity to insert into the parent group sched_data.
>  * @sched_data: own sched_data, to contain child entities (they may be
> @@ -235,6 +251,8 @@ struct io_group {
>
> 	/* Single ioq per group, used for noop, deadline, anticipatory */
> 	struct io_queue *ioq;
> +
> +	struct io_group_nrq nrq;
> };
>
> /**
> @@ -469,6 +487,11 @@ extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
> extern void elv_fq_unset_request_ioq(struct request_queue *q,
> 					struct request *rq);
> extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
> +extern void io_group_set_nrq_all(struct request_queue *q, int nr,
> +			    int *congested, int *full);
> +extern struct io_group *io_congested_io_group(struct request_queue *q, int rw);
> +extern struct io_group *io_full_io_group(struct request_queue *q, int rw);
> +extern struct io_group *__io_get_io_group(struct request_queue *q);
>
> /* Returns single ioq associated with the io group. */
> static inline struct io_queue *io_group_ioq(struct io_group *iog)
> @@ -486,6 +509,52 @@ static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
> 	iog->ioq = ioq;
> }
>
> +static inline struct io_group *io_request_io_group(struct request *rq)
> +{
> +	return rq->iog;
> +}
> +
> +static inline unsigned long io_group_nr_requests(struct io_group *iog)
> +{
> +	BUG_ON(!iog);
> +	return iog->nrq.nr_requests;
> +}
> +
> +static inline int io_group_inc_count(struct io_group *iog, int rw)
> +{
> +	BUG_ON(!iog);
> +	return iog->nrq.count[rw]++;
> +}
> +
> +static inline int io_group_dec_count(struct io_group *iog, int rw)
> +{
> +	BUG_ON(!iog);
> +	return iog->nrq.count[rw]--;
> +}
> +
> +static inline int io_group_count(struct io_group *iog, int rw)
> +{
> +	BUG_ON(!iog);
> +	return iog->nrq.count[rw];
> +}
> +
> +static inline int io_group_congestion_on(struct io_group *iog, int rw)
> +{
> +	BUG_ON(!iog);
> +	return iog->nrq.count[rw] + 1 >= iog->nrq.nr_congestion_on;
> +}
> +
> +static inline int io_group_congestion_off(struct io_group *iog, int rw)
> +{
> +	BUG_ON(!iog);
> +	return iog->nrq.count[rw] < iog->nrq.nr_congestion_off;
> +}
> +
> +static inline int io_group_full(struct io_group *iog, int rw)
> +{
> +	BUG_ON(!iog);
> +	return iog->nrq.count[rw] + 1 >= iog->nrq.nr_requests;
> +}
> #else /* !GROUP_IOSCHED */
> /*
>  * No ioq movement is needed in case of flat setup. root io group gets cleaned
> @@ -537,6 +606,71 @@ static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
> 	return NULL;
> }
>
> +static inline void io_group_set_nrq_all(struct request_queue *q, int nr,
> +					int *congested, int *full)
> +{
> +	int i;
> +	for (i=0; i<2; i++)
> +		*(congested + i) = *(full + i) = 0;
> +}
> +
> +static inline struct io_group *
> +io_congested_io_group(struct request_queue *q, int rw)
> +{
> +	return NULL;
> +}
> +
> +static inline struct io_group *
> +io_full_io_group(struct request_queue *q, int rw)
> +{
> +	return NULL;
> +}
> +
> +static inline struct io_group *__io_get_io_group(struct request_queue *q)
> +{
> +	return NULL;
> +}
> +
> +static inline struct io_group *io_request_io_group(struct request *rq)
> +{
> +	return NULL;
> +}
> +
> +static inline unsigned long io_group_nr_requests(struct io_group *iog)
> +{
> +	return 0;
> +}
> +
> +static inline int io_group_inc_count(struct io_group *iog, int rw)
> +{
> +	return 0;
> +}
> +
> +static inline int io_group_dec_count(struct io_group *iog, int rw)
> +{
> +	return 0;
> +}
> +
> +static inline int io_group_count(struct io_group *iog, int rw)
> +{
> +	return 0;
> +}
> +
> +static inline int io_group_congestion_on(struct io_group *iog, int rw)
> +{
> +	return 0;
> +}
> +
> +static inline int io_group_congestion_off(struct io_group *iog, int rw)
> +{
> +	return 1;
> +}
> +
> +static inline int io_group_full(struct io_group *iog, int rw)
> +{
> +	return 0;
> +}
> +
> #endif /* GROUP_IOSCHED */
>
> /* Functions used by blksysfs.c */
> @@ -589,6 +723,9 @@ extern void elv_free_ioq(struct io_queue *ioq);
>
> #else /* CONFIG_ELV_FAIR_QUEUING */
>
> +struct io_group {
> +};
> +
> static inline int elv_init_fq_data(struct request_queue *q,
> 					struct elevator_queue *e)
> {
> @@ -655,5 +792,69 @@ static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
> 	return NULL;
> }
>
> +static inline void io_group_set_nrq_all(struct request_queue *q, int nr,
> +					int *congested, int *full)
> +{
> +	int i;
> +	for (i=0; i<2; i++)
> +		*(congested + i) = *(full + i) = 0;
> +}
> +
> +static inline struct io_group *
> +io_congested_io_group(struct request_queue *q, int rw)
> +{
> +	return NULL;
> +}
> +
> +static inline struct io_group *
> +io_full_io_group(struct request_queue *q, int rw)
> +{
> +	return NULL;
> +}
> +
> +static inline struct io_group *__io_get_io_group(struct request_queue *q)
> +{
> +	return NULL;
> +}
> +
> +static inline struct io_group *io_request_io_group(struct request *rq)
> +{
> +	return NULL;
> +}
> +
> +static inline unsigned long io_group_nr_requests(struct io_group *iog)
> +{
> +	return 0;
> +}
> +
> +static inline int io_group_inc_count(struct io_group *iog, int rw)
> +{
> +	return 0;
> +}
> +
> +static inline int io_group_dec_count(struct io_group *iog, int rw)
> +{
> +	return 0;
> +}
> +
> +static inline int io_group_count(struct io_group *iog, int rw)
> +{
> +	return 0;
> +}
> +
> +static inline int io_group_congestion_on(struct io_group *iog, int rw)
> +{
> +	return 0;
> +}
> +
> +static inline int io_group_congestion_off(struct io_group *iog, int rw)
> +{
> +	return 1;
> +}
> +
> +static inline int io_group_full(struct io_group *iog, int rw)
> +{
> +	return 0;
> +}
> #endif /* CONFIG_ELV_FAIR_QUEUING */
> #endif /* _BFQ_SCHED_H */
> -- 
> 1.5.4.3
>
>
> -- 
> IKEDA, Munehiro
> NEC Corporation of America
>   m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org
>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* IO Controller per cgroup request descriptors (Re: [PATCH 01/10] Documentation)
@ 2009-05-01 22:45                 ` Vivek Goyal
  0 siblings, 0 replies; 190+ messages in thread
From: Vivek Goyal @ 2009-05-01 22:45 UTC (permalink / raw)
  To: IKEDA, Munehiro
  Cc: Balbir Singh, oz-kernel, paolo.valente, linux-kernel, dhaval,
	containers, menage, jmoyer, fchecconi, arozansk, jens.axboe,
	akpm, fernando, Andrea Righi, Ryo Tsuruta, Nauman Rafique,
	Divyesh Shah, Gui Jianfeng

On Fri, May 01, 2009 at 06:04:39PM -0400, IKEDA, Munehiro wrote:
> Vivek Goyal wrote:
>>>> +TODO
>>>> +====
>>>> +- Lots of cleanups, testing, bug fixing, optimizations, benchmarking etc...
>>>> +- Convert cgroup ioprio to notion of weight.
>>>> +- Anticipatory code will need more work. It is not working properly currently
>>>> +  and needs more thought.
>>> What are the problems with the code?
>>
>> Have not got a chance to look into the issues in detail yet. Just a crude run
>> saw drop in performance. Will debug it later the moment I have got async writes
>> handled...
>>
>>>> +- Use of bio-cgroup patches.
>>> I saw these posted as well
>>>
>>>> +- Use of Nauman's per cgroup request descriptor patches.
>>>> +
>>> More details would be nice, I am not sure I understand
>>
>> Currently the number of request descriptors which can be allocated per
>> device/request queue are fixed by a sysfs tunable (q->nr_requests). So
>> if there is lots of IO going on from one cgroup then it will consume all
>> the available request descriptors and other cgroup might starve and not
>> get its fair share.
>>
>> Hence we also need to introduce the notion of request descriptor limit per
>> cgroup so that if request descriptors from one group are exhausted, then
>> it does not impact the IO of other cgroup.
>
> Unfortunately I couldn't find and I've never seen the Nauman's patches.
> So I tried to make a patch below against this todo.  The reason why
> I'm posting this despite this is just a quick and ugly hack (and it
> might be a reinvention of wheel) is that I would like to discuss how
> we should define the limitation of requests per cgroup.
> This patch should be applied on Vivek's I/O controller patches
> posted on Mar 11.

Hi IKEDA,

Sorry for the confusion here. Actually Nauman had sent a patch to select group
of people who were initially copied on the mail thread.

>
> This patch temporarily distribute q->nr_requests to each cgroup.
> I think the number should be weighted like BFQ's budget.  But in
> this case, if the hierarchy of cgroup is deep, leaf cgroups are
> allowed to allocate very few numbers of requests.  I don't think
> this is reasonable...but I don't have specific idea to solve this
> problem.  Does anyone have the good idea?
>

Thanks for the patch. Yes, ideally one would expect the request descriptor
to be allocated also in proportion to the weight but I guess that would
become very comlicated.

In terms of simpler things, two thoughts come to mind.

- First approach is to make q->nr_requests per group. So every group is
  entitled for q->nr_requests as set by the user. This is what your patch
  seems to have done.

  I had some concerns with this approach. First of all it does not seem to
  have an upper bound on number of request descriptors allocated per queue
  because if a user creates more cgroups, total number of request
  descriptors increase.

- Second approach can be that we retain the meaning of q->nr_requests
  which defines the total number of request descriptors on the queue (with
  the exception of 50% more descriptors for batching processes). And we
  define a new per group limit q->nr_group_requests which defines how many
  requests per group can be assigned. So q->nr_requests defines total pool
  size on the queue and q->nr_group_requests will define how many requests
  each group can allocate out of that pool.

  Here the issue is that a user shall have to balance the q->nr_group_requests    and q->nr_requests properly.

To experiment, I have implemented the second approach. I am attaching the
patch which is in my current tree. It probably will not apply on my march
11 posting as since then patches have changed. But posting it here so that
at least it will give an idea behind the thought process.

Ideas are welcome...

Thanks
Vivek
   
o Currently a request queue has got fixed number of request descriptors for
  sync and async requests. Once the request descriptors are consumed, new
  processes are put to sleep and they effectively become serialized. Because
  sync and async queues are separate, async requests don't impact sync ones
  but if one is looking for fairness between async requests, that is not
  achievable if request queue descriptors become bottleneck.

o Make request descriptor's per io group so that if there is lots of IO
  going on in one cgroup, it does not impact the IO of other group.

o This patch implements the per cgroup request descriptors. request pool per
  queue is still common but every group will have its own wait list and its
  own count of request descriptors allocated to that group for sync and async
  queues. So effectively request_list becomes per io group property and not a
  global request queue feature.

o Currently one can define q->nr_requests to limit request descriptors
  allocated for the queue. Now there is another tunable q->nr_group_requests
  which controls the requests descriptr limit per group. q->nr_requests
  supercedes q->nr_group_requests to make sure if there are lots of groups
  present, we don't end up allocating too many request descriptors on the
  queue.

o Issues: Currently notion of congestion is per queue. With per group request
  descriptor it is possible that queue is not congested but the group bio
  will go into is congested.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

---
 block/blk-core.c       |  216 ++++++++++++++++++++++++++++++++++---------------
 block/blk-settings.c   |    3 
 block/blk-sysfs.c      |   57 ++++++++++--
 block/elevator-fq.c    |   15 +++
 block/elevator-fq.h    |    8 +
 block/elevator.c       |    6 -
 include/linux/blkdev.h |   62 +++++++++++++-
 7 files changed, 287 insertions(+), 80 deletions(-)

Index: linux9/include/linux/blkdev.h
===================================================================
--- linux9.orig/include/linux/blkdev.h	2009-04-30 15:43:53.000000000 -0400
+++ linux9/include/linux/blkdev.h	2009-04-30 16:18:29.000000000 -0400
@@ -32,21 +32,51 @@ struct request;
 struct sg_io_hdr;
 
 #define BLKDEV_MIN_RQ	4
+
+#ifdef CONFIG_GROUP_IOSCHED
+#define BLKDEV_MAX_RQ	256	/* Default maximum */
+#define BLKDEV_MAX_GROUP_RQ    64      /* Default maximum */
+#else
 #define BLKDEV_MAX_RQ	128	/* Default maximum */
+/*
+ * This is eqivalent to case of only one group present (root group). Let
+ * it consume all the request descriptors available on the queue .
+ */
+#define BLKDEV_MAX_GROUP_RQ    BLKDEV_MAX_RQ      /* Default maximum */
+#endif
 
 struct request;
 typedef void (rq_end_io_fn)(struct request *, int);
 
 struct request_list {
 	/*
-	 * count[], starved[], and wait[] are indexed by
+	 * count[], starved and wait[] are indexed by
 	 * BLK_RW_SYNC/BLK_RW_ASYNC
 	 */
 	int count[2];
 	int starved[2];
+	wait_queue_head_t wait[2];
+};
+
+/*
+ * This data structures keeps track of mempool of requests for the queue
+ * and some overall statistics.
+ */
+struct request_data {
+	/*
+	 * Per queue request descriptor count. This is in addition to per
+	 * cgroup count
+	 */
+	int count[2];
 	int elvpriv;
 	mempool_t *rq_pool;
-	wait_queue_head_t wait[2];
+	int starved;
+	/*
+	 * Global list for starved tasks. A task will be queued here if
+	 * it could not allocate request descriptor and the associated
+	 * group request list does not have any requests pending.
+	 */
+	wait_queue_head_t starved_wait;
 };
 
 /*
@@ -251,6 +281,7 @@ struct request {
 #ifdef CONFIG_GROUP_IOSCHED
 	/* io group request belongs to */
 	struct io_group *iog;
+	struct request_list *rl;
 #endif /* GROUP_IOSCHED */
 #endif /* ELV_FAIR_QUEUING */
 };
@@ -340,6 +371,9 @@ struct request_queue
 	 */
 	struct request_list	rq;
 
+	/* Contains request pool and other data like starved data */
+	struct request_data	rq_data;
+
 	request_fn_proc		*request_fn;
 	make_request_fn		*make_request_fn;
 	prep_rq_fn		*prep_rq_fn;
@@ -402,6 +436,8 @@ struct request_queue
 	 * queue settings
 	 */
 	unsigned long		nr_requests;	/* Max # of requests */
+	/* Max # of per io group requests */
+	unsigned long		nr_group_requests;
 	unsigned int		nr_congestion_on;
 	unsigned int		nr_congestion_off;
 	unsigned int		nr_batching;
@@ -773,6 +809,28 @@ extern int scsi_cmd_ioctl(struct request
 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 			 struct scsi_ioctl_command __user *);
 
+extern void blk_init_request_list(struct request_list *rl);
+
+static inline struct request_list *blk_get_request_list(struct request_queue *q,
+						struct bio *bio)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+	return io_group_get_request_list(q, bio);
+#else
+	return &q->rq;
+#endif
+}
+
+static inline struct request_list *rq_rl(struct request_queue *q,
+						struct request *rq)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+	return rq->rl;
+#else
+	return blk_get_request_list(q, NULL);
+#endif
+}
+
 /*
  * Temporary export, until SCSI gets fixed up.
  */
Index: linux9/block/elevator.c
===================================================================
--- linux9.orig/block/elevator.c	2009-04-30 16:17:53.000000000 -0400
+++ linux9/block/elevator.c	2009-04-30 16:18:29.000000000 -0400
@@ -664,7 +664,7 @@ void elv_quiesce_start(struct request_qu
 	 * make sure we don't have any requests in flight
 	 */
 	elv_drain_elevator(q);
-	while (q->rq.elvpriv) {
+	while (q->rq_data.elvpriv) {
 		blk_start_queueing(q);
 		spin_unlock_irq(q->queue_lock);
 		msleep(10);
@@ -764,8 +764,8 @@ void elv_insert(struct request_queue *q,
 	}
 
 	if (unplug_it && blk_queue_plugged(q)) {
-		int nrq = q->rq.count[BLK_RW_SYNC] + q->rq.count[BLK_RW_ASYNC]
-			- q->in_flight;
+		int nrq = q->rq_data.count[BLK_RW_SYNC] +
+				q->rq_data.count[BLK_RW_ASYNC] - q->in_flight;
 
 		if (nrq >= q->unplug_thresh)
 			__generic_unplug_device(q);
Index: linux9/block/blk-core.c
===================================================================
--- linux9.orig/block/blk-core.c	2009-04-30 16:17:53.000000000 -0400
+++ linux9/block/blk-core.c	2009-04-30 16:18:29.000000000 -0400
@@ -480,20 +480,31 @@ void blk_cleanup_queue(struct request_qu
 }
 EXPORT_SYMBOL(blk_cleanup_queue);
 
-static int blk_init_free_list(struct request_queue *q)
+void blk_init_request_list(struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
 
 	rl->count[BLK_RW_SYNC] = rl->count[BLK_RW_ASYNC] = 0;
-	rl->starved[BLK_RW_SYNC] = rl->starved[BLK_RW_ASYNC] = 0;
-	rl->elvpriv = 0;
 	init_waitqueue_head(&rl->wait[BLK_RW_SYNC]);
 	init_waitqueue_head(&rl->wait[BLK_RW_ASYNC]);
+}
 
-	rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ, mempool_alloc_slab,
-				mempool_free_slab, request_cachep, q->node);
+static int blk_init_free_list(struct request_queue *q)
+{
+#ifndef CONFIG_GROUP_IOSCHED
+	struct request_list *rl = blk_get_request_list(q, NULL);
+
+	/*
+	 * In case of group scheduling, request list is inside the associated
+	 * group and when that group is instanciated, it takes care of
+	 * initializing the request list also.
+	 */
+	blk_init_request_list(rl);
+#endif
+	q->rq_data.rq_pool = mempool_create_node(BLKDEV_MIN_RQ,
+				mempool_alloc_slab, mempool_free_slab,
+				request_cachep, q->node);
 
-	if (!rl->rq_pool)
+	if (!q->rq_data.rq_pool)
 		return -ENOMEM;
 
 	return 0;
@@ -590,6 +601,9 @@ blk_init_queue_node(request_fn_proc *rfn
 		return NULL;
 	}
 
+	/* init starved waiter wait queue */
+	init_waitqueue_head(&q->rq_data.starved_wait);
+
 	/*
 	 * if caller didn't supply a lock, they get per-queue locking with
 	 * our embedded lock
@@ -639,14 +653,14 @@ static inline void blk_free_request(stru
 {
 	if (rq->cmd_flags & REQ_ELVPRIV)
 		elv_put_request(q, rq);
-	mempool_free(rq, q->rq.rq_pool);
+	mempool_free(rq, q->rq_data.rq_pool);
 }
 
 static struct request *
 blk_alloc_request(struct request_queue *q, struct bio *bio, int rw, int priv,
 					gfp_t gfp_mask)
 {
-	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
+	struct request *rq = mempool_alloc(q->rq_data.rq_pool, gfp_mask);
 
 	if (!rq)
 		return NULL;
@@ -657,7 +671,7 @@ blk_alloc_request(struct request_queue *
 
 	if (priv) {
 		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
-			mempool_free(rq, q->rq.rq_pool);
+			mempool_free(rq, q->rq_data.rq_pool);
 			return NULL;
 		}
 		rq->cmd_flags |= REQ_ELVPRIV;
@@ -700,18 +714,18 @@ static void ioc_set_batching(struct requ
 	ioc->last_waited = jiffies;
 }
 
-static void __freed_request(struct request_queue *q, int sync)
+static void __freed_request(struct request_queue *q, int sync,
+					struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
-
-	if (rl->count[sync] < queue_congestion_off_threshold(q))
+	if (q->rq_data.count[sync] < queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, sync);
 
-	if (rl->count[sync] + 1 <= q->nr_requests) {
+	if (q->rq_data.count[sync] + 1 <= q->nr_requests)
+		blk_clear_queue_full(q, sync);
+
+	if (rl->count[sync] + 1 <= q->nr_group_requests) {
 		if (waitqueue_active(&rl->wait[sync]))
 			wake_up(&rl->wait[sync]);
-
-		blk_clear_queue_full(q, sync);
 	}
 }
 
@@ -719,18 +733,29 @@ static void __freed_request(struct reque
  * A request has just been released.  Account for it, update the full and
  * congestion status, wake up any waiters.   Called under q->queue_lock.
  */
-static void freed_request(struct request_queue *q, int sync, int priv)
+static void freed_request(struct request_queue *q, int sync, int priv,
+					struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
-
+	BUG_ON(!rl->count[sync]);
 	rl->count[sync]--;
+
+	BUG_ON(!q->rq_data.count[sync]);
+	q->rq_data.count[sync]--;
+
 	if (priv)
-		rl->elvpriv--;
+		q->rq_data.elvpriv--;
 
-	__freed_request(q, sync);
+	__freed_request(q, sync, rl);
 
 	if (unlikely(rl->starved[sync ^ 1]))
-		__freed_request(q, sync ^ 1);
+		__freed_request(q, sync ^ 1, rl);
+
+	/* Wake up the starved process on global list, if any */
+	if (unlikely(q->rq_data.starved)) {
+		if (waitqueue_active(&q->rq_data.starved_wait))
+			wake_up(&q->rq_data.starved_wait);
+		q->rq_data.starved--;
+	}
 }
 
 /*
@@ -739,10 +764,9 @@ static void freed_request(struct request
  * Returns !NULL on success, with queue_lock *not held*.
  */
 static struct request *get_request(struct request_queue *q, int rw_flags,
-				   struct bio *bio, gfp_t gfp_mask)
+		   struct bio *bio, gfp_t gfp_mask, struct request_list *rl)
 {
 	struct request *rq = NULL;
-	struct request_list *rl = &q->rq;
 	struct io_context *ioc = NULL;
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
 	int may_queue, priv;
@@ -751,31 +775,38 @@ static struct request *get_request(struc
 	if (may_queue == ELV_MQUEUE_NO)
 		goto rq_starved;
 
-	if (rl->count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
-		if (rl->count[is_sync]+1 >= q->nr_requests) {
-			ioc = current_io_context(GFP_ATOMIC, q->node);
-			/*
-			 * The queue will fill after this allocation, so set
-			 * it as full, and mark this process as "batching".
-			 * This process will be allowed to complete a batch of
-			 * requests, others will be blocked.
-			 */
-			if (!blk_queue_full(q, is_sync)) {
-				ioc_set_batching(q, ioc);
-				blk_set_queue_full(q, is_sync);
-			} else {
-				if (may_queue != ELV_MQUEUE_MUST
-						&& !ioc_batching(q, ioc)) {
-					/*
-					 * The queue is full and the allocating
-					 * process is not a "batcher", and not
-					 * exempted by the IO scheduler
-					 */
-					goto out;
-				}
+	if (q->rq_data.count[is_sync]+1 >= queue_congestion_on_threshold(q))
+		blk_set_queue_congested(q, is_sync);
+
+	/*
+	 * Looks like there is no user of queue full now.
+	 * Keeping it for time being.
+	 */
+	if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
+		blk_set_queue_full(q, is_sync);
+
+	if (rl->count[is_sync]+1 >= q->nr_group_requests) {
+		ioc = current_io_context(GFP_ATOMIC, q->node);
+		/*
+		 * The queue request descriptor group will fill after this
+		 * allocation, so set
+		 * it as full, and mark this process as "batching".
+		 * This process will be allowed to complete a batch of
+		 * requests, others will be blocked.
+		 */
+		if (rl->count[is_sync] <= q->nr_group_requests)
+			ioc_set_batching(q, ioc);
+		else {
+			if (may_queue != ELV_MQUEUE_MUST
+					&& !ioc_batching(q, ioc)) {
+				/*
+				 * The queue is full and the allocating
+				 * process is not a "batcher", and not
+				 * exempted by the IO scheduler
+				 */
+				goto out;
 			}
 		}
-		blk_set_queue_congested(q, is_sync);
 	}
 
 	/*
@@ -783,19 +814,41 @@ static struct request *get_request(struc
 	 * limit of requests, otherwise we could have thousands of requests
 	 * allocated with any setting of ->nr_requests
 	 */
-	if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
+
+	if (q->rq_data.count[is_sync] >= (3 * q->nr_requests / 2))
+		goto out;
+
+	/*
+	 * Allocation of request is allowed from queue perspective. Now check
+	 * from per group request list
+	 */
+
+	if (rl->count[is_sync] >= (3 * q->nr_group_requests / 2))
 		goto out;
 
 	rl->count[is_sync]++;
 	rl->starved[is_sync] = 0;
 
+	q->rq_data.count[is_sync]++;
+
 	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
 	if (priv)
-		rl->elvpriv++;
+		q->rq_data.elvpriv++;
 
 	spin_unlock_irq(q->queue_lock);
 
 	rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
+
+#ifdef CONFIG_GROUP_IOSCHED
+	if (rq) {
+		/*
+		 * TODO. Implement group reference counting and take the
+		 * reference to the group to make sure group hence request
+		 * list does not go away till rq finishes.
+		 */
+		rq->rl = rl;
+	}
+#endif
 	if (unlikely(!rq)) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything
@@ -805,7 +858,7 @@ static struct request *get_request(struc
 		 * wait queue, but this is pretty rare.
 		 */
 		spin_lock_irq(q->queue_lock);
-		freed_request(q, is_sync, priv);
+		freed_request(q, is_sync, priv, rl);
 
 		/*
 		 * in the very unlikely event that allocation failed and no
@@ -815,10 +868,26 @@ static struct request *get_request(struc
 		 * rq mempool into READ and WRITE
 		 */
 rq_starved:
-		if (unlikely(rl->count[is_sync] == 0))
-			rl->starved[is_sync] = 1;
-
-		goto out;
+		if (unlikely(rl->count[is_sync] == 0)) {
+			/*
+			 * If there is a request pending in other direction
+			 * in same io group, then set the starved flag of
+			 * the group request list. Otherwise, we need to
+			 * make this process sleep in global starved list
+			 * to make sure it will not sleep indefinitely.
+			 */
+			if (rl->count[is_sync ^ 1] != 0) {
+				rl->starved[is_sync] = 1;
+				goto out;
+			} else {
+				/*
+				 * It indicates to calling function to put
+				 * task on global starved list. Not the best
+				 * way
+				 */
+				return ERR_PTR(-ENOMEM);
+			}
+		}
 	}
 
 	/*
@@ -846,15 +915,29 @@ static struct request *get_request_wait(
 {
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
 	struct request *rq;
+	struct request_list *rl = blk_get_request_list(q, bio);
 
-	rq = get_request(q, rw_flags, bio, GFP_NOIO);
-	while (!rq) {
+	rq = get_request(q, rw_flags, bio, GFP_NOIO, rl);
+	while (!rq || (IS_ERR(rq) && PTR_ERR(rq) == -ENOMEM)) {
 		DEFINE_WAIT(wait);
 		struct io_context *ioc;
-		struct request_list *rl = &q->rq;
 
-		prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
-				TASK_UNINTERRUPTIBLE);
+		if (IS_ERR(rq) && PTR_ERR(rq) == -ENOMEM) {
+			/*
+			 * Task failed allocation and needs to wait and
+			 * try again. There are no requests pending from
+			 * the io group hence need to sleep on global
+			 * wait queue. Most likely the allocation failed
+			 * because of memory issues.
+			 */
+
+			q->rq_data.starved++;
+			prepare_to_wait_exclusive(&q->rq_data.starved_wait,
+					&wait, TASK_UNINTERRUPTIBLE);
+		} else {
+			prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
+					TASK_UNINTERRUPTIBLE);
+		}
 
 		trace_block_sleeprq(q, bio, rw_flags & 1);
 
@@ -874,7 +957,12 @@ static struct request *get_request_wait(
 		spin_lock_irq(q->queue_lock);
 		finish_wait(&rl->wait[is_sync], &wait);
 
-		rq = get_request(q, rw_flags, bio, GFP_NOIO);
+		/*
+		 * After the sleep check the rl again in case cgrop bio
+		 * belonged to is gone and it is mapped to root group now
+		 */
+		rl = blk_get_request_list(q, bio);
+		rq = get_request(q, rw_flags, bio, GFP_NOIO, rl);
 	};
 
 	return rq;
@@ -883,6 +971,7 @@ static struct request *get_request_wait(
 struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
 {
 	struct request *rq;
+	struct request_list *rl = blk_get_request_list(q, NULL);
 
 	BUG_ON(rw != READ && rw != WRITE);
 
@@ -890,7 +979,7 @@ struct request *blk_get_request(struct r
 	if (gfp_mask & __GFP_WAIT) {
 		rq = get_request_wait(q, rw, NULL);
 	} else {
-		rq = get_request(q, rw, NULL, gfp_mask);
+		rq = get_request(q, rw, NULL, gfp_mask, rl);
 		if (!rq)
 			spin_unlock_irq(q->queue_lock);
 	}
@@ -1073,12 +1162,13 @@ void __blk_put_request(struct request_qu
 	if (req->cmd_flags & REQ_ALLOCED) {
 		int is_sync = rq_is_sync(req) != 0;
 		int priv = req->cmd_flags & REQ_ELVPRIV;
+		struct request_list *rl = rq_rl(q, req);
 
 		BUG_ON(!list_empty(&req->queuelist));
 		BUG_ON(!hlist_unhashed(&req->hash));
 
 		blk_free_request(q, req);
-		freed_request(q, is_sync, priv);
+		freed_request(q, is_sync, priv, rl);
 	}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
Index: linux9/block/blk-sysfs.c
===================================================================
--- linux9.orig/block/blk-sysfs.c	2009-04-30 16:18:27.000000000 -0400
+++ linux9/block/blk-sysfs.c	2009-04-30 16:18:29.000000000 -0400
@@ -38,7 +38,7 @@ static ssize_t queue_requests_show(struc
 static ssize_t
 queue_requests_store(struct request_queue *q, const char *page, size_t count)
 {
-	struct request_list *rl = &q->rq;
+	struct request_list *rl = blk_get_request_list(q, NULL);
 	unsigned long nr;
 	int ret = queue_var_store(&nr, page, count);
 	if (nr < BLKDEV_MIN_RQ)
@@ -48,32 +48,55 @@ queue_requests_store(struct request_queu
 	q->nr_requests = nr;
 	blk_queue_congestion_threshold(q);
 
-	if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
+	if (q->rq_data.count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
 		blk_set_queue_congested(q, BLK_RW_SYNC);
-	else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
+	else if (q->rq_data.count[BLK_RW_SYNC] <
+				queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, BLK_RW_SYNC);
 
-	if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
+	if (q->rq_data.count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
 		blk_set_queue_congested(q, BLK_RW_ASYNC);
-	else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
+	else if (q->rq_data.count[BLK_RW_ASYNC] <
+				queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, BLK_RW_ASYNC);
 
-	if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
+	if (q->rq_data.count[BLK_RW_SYNC] >= q->nr_requests) {
 		blk_set_queue_full(q, BLK_RW_SYNC);
-	} else if (rl->count[BLK_RW_SYNC]+1 <= q->nr_requests) {
+	} else if (q->rq_data.count[BLK_RW_SYNC]+1 <= q->nr_requests) {
 		blk_clear_queue_full(q, BLK_RW_SYNC);
 		wake_up(&rl->wait[BLK_RW_SYNC]);
 	}
 
-	if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
+	if (q->rq_data.count[BLK_RW_ASYNC] >= q->nr_requests) {
 		blk_set_queue_full(q, BLK_RW_ASYNC);
-	} else if (rl->count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
+	} else if (q->rq_data.count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
 		blk_clear_queue_full(q, BLK_RW_ASYNC);
 		wake_up(&rl->wait[BLK_RW_ASYNC]);
 	}
 	spin_unlock_irq(q->queue_lock);
 	return ret;
 }
+#ifdef CONFIG_GROUP_IOSCHED
+static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(q->nr_group_requests, (page));
+}
+
+static ssize_t
+queue_group_requests_store(struct request_queue *q, const char *page,
+					size_t count)
+{
+	unsigned long nr;
+	int ret = queue_var_store(&nr, page, count);
+	if (nr < BLKDEV_MIN_RQ)
+		nr = BLKDEV_MIN_RQ;
+
+	spin_lock_irq(q->queue_lock);
+	q->nr_group_requests = nr;
+	spin_unlock_irq(q->queue_lock);
+	return ret;
+}
+#endif
 
 static ssize_t queue_ra_show(struct request_queue *q, char *page)
 {
@@ -228,6 +251,14 @@ static struct queue_sysfs_entry queue_re
 	.store = queue_requests_store,
 };
 
+#ifdef CONFIG_GROUP_IOSCHED
+static struct queue_sysfs_entry queue_group_requests_entry = {
+	.attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_group_requests_show,
+	.store = queue_group_requests_store,
+};
+#endif
+
 static struct queue_sysfs_entry queue_ra_entry = {
 	.attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_ra_show,
@@ -308,6 +339,9 @@ static struct queue_sysfs_entry queue_sl
 
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
+#ifdef CONFIG_GROUP_IOSCHED
+	&queue_group_requests_entry.attr,
+#endif
 	&queue_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
@@ -389,12 +423,11 @@ static void blk_release_queue(struct kob
 {
 	struct request_queue *q =
 		container_of(kobj, struct request_queue, kobj);
-	struct request_list *rl = &q->rq;
 
 	blk_sync_queue(q);
 
-	if (rl->rq_pool)
-		mempool_destroy(rl->rq_pool);
+	if (q->rq_data.rq_pool)
+		mempool_destroy(q->rq_data.rq_pool);
 
 	if (q->queue_tags)
 		__blk_queue_free_tags(q);
Index: linux9/block/blk-settings.c
===================================================================
--- linux9.orig/block/blk-settings.c	2009-04-30 15:43:53.000000000 -0400
+++ linux9/block/blk-settings.c	2009-04-30 16:18:29.000000000 -0400
@@ -123,6 +123,9 @@ void blk_queue_make_request(struct reque
 	 * set defaults
 	 */
 	q->nr_requests = BLKDEV_MAX_RQ;
+#ifdef CONFIG_GROUP_IOSCHED
+	q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
+#endif
 	blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);
 	blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS);
 	blk_queue_segment_boundary(q, BLK_SEG_BOUNDARY_MASK);
Index: linux9/block/elevator-fq.c
===================================================================
--- linux9.orig/block/elevator-fq.c	2009-04-30 16:18:27.000000000 -0400
+++ linux9/block/elevator-fq.c	2009-04-30 16:18:29.000000000 -0400
@@ -954,6 +954,17 @@ struct io_cgroup *cgroup_to_io_cgroup(st
 			    struct io_cgroup, css);
 }
 
+struct request_list *io_group_get_request_list(struct request_queue *q,
+						struct bio *bio)
+{
+	struct io_group *iog;
+
+	iog = io_get_io_group_bio(q, bio, 1);
+	BUG_ON(!iog);
+out:
+	return &iog->rl;
+}
+
 /*
  * Search the bfq_group for bfqd into the hash table (by now only a list)
  * of bgrp.  Must be called under rcu_read_lock().
@@ -1203,6 +1214,8 @@ struct io_group *io_group_chain_alloc(st
 		io_group_init_entity(iocg, iog);
 		iog->my_entity = &iog->entity;
 
+		blk_init_request_list(&iog->rl);
+
 		if (leaf == NULL) {
 			leaf = iog;
 			prev = leaf;
@@ -1446,6 +1459,8 @@ struct io_group *io_alloc_root_group(str
 	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
 		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
 
+	blk_init_request_list(&iog->rl);
+
 	iocg = &io_root_cgroup;
 	spin_lock_irq(&iocg->lock);
 	rcu_assign_pointer(iog->key, key);
Index: linux9/block/elevator-fq.h
===================================================================
--- linux9.orig/block/elevator-fq.h	2009-04-30 16:18:27.000000000 -0400
+++ linux9/block/elevator-fq.h	2009-04-30 16:18:29.000000000 -0400
@@ -239,8 +239,14 @@ struct io_group {
 
 	/* Single ioq per group, used for noop, deadline, anticipatory */
 	struct io_queue *ioq;
+
+	/* request list associated with the group */
+	struct request_list rl;
 };
 
+#define IOG_FLAG_READFULL	1	/* read queue has been filled */
+#define IOG_FLAG_WRITEFULL	2	/* write queue has been filled */
+
 /**
  * struct bfqio_cgroup - bfq cgroup data structure.
  * @css: subsystem state for bfq in the containing cgroup.
@@ -517,6 +523,8 @@ extern void elv_fq_unset_request_ioq(str
 extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
 extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
 						struct bio *bio);
+extern struct request_list *io_group_get_request_list(struct request_queue *q,
+						struct bio *bio);
 
 /* Returns single ioq associated with the io group. */
 static inline struct io_queue *io_group_ioq(struct io_group *iog)

Thanks
Vivek

> Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
> ---
> block/blk-core.c    |   36 +++++++--
> block/blk-sysfs.c   |   22 ++++--
> block/elevator-fq.c |  133 ++++++++++++++++++++++++++++++++--
> block/elevator-fq.h |  201 +++++++++++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 371 insertions(+), 21 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 29bcfac..21023f7 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -705,11 +705,15 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
> static void __freed_request(struct request_queue *q, int rw)
> {
> 	struct request_list *rl = &q->rq;
> -
> -	if (rl->count[rw] < queue_congestion_off_threshold(q))
> +	struct io_group *congested_iog, *full_iog;
> +	
> +	congested_iog = io_congested_io_group(q, rw);
> +	if (rl->count[rw] < queue_congestion_off_threshold(q) &&
> +	    !congested_iog)
> 		blk_clear_queue_congested(q, rw);
>
> -	if (rl->count[rw] + 1 <= q->nr_requests) {
> +	full_iog = io_full_io_group(q, rw);
> +	if (rl->count[rw] + 1 <= q->nr_requests && !full_iog) {
> 		if (waitqueue_active(&rl->wait[rw]))
> 			wake_up(&rl->wait[rw]);
>
> @@ -721,13 +725,16 @@ static void __freed_request(struct request_queue *q, int rw)
>  * A request has just been released.  Account for it, update the full and
>  * congestion status, wake up any waiters.   Called under q->queue_lock.
>  */
> -static void freed_request(struct request_queue *q, int rw, int priv)
> +static void freed_request(struct request_queue *q, struct io_group *iog,
> +			  int rw, int priv)
> {
> 	struct request_list *rl = &q->rq;
>
> 	rl->count[rw]--;
> 	if (priv)
> 		rl->elvpriv--;
> +	if (iog)
> +		io_group_dec_count(iog, rw);
>
> 	__freed_request(q, rw);
>
> @@ -746,16 +753,21 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
> {
> 	struct request *rq = NULL;
> 	struct request_list *rl = &q->rq;
> +	struct io_group *iog;
> 	struct io_context *ioc = NULL;
> 	const int rw = rw_flags & 0x01;
> 	int may_queue, priv;
>
> +	iog = __io_get_io_group(q);
> +
> 	may_queue = elv_may_queue(q, rw_flags);
> 	if (may_queue == ELV_MQUEUE_NO)
> 		goto rq_starved;
>
> -	if (rl->count[rw]+1 >= queue_congestion_on_threshold(q)) {
> -		if (rl->count[rw]+1 >= q->nr_requests) {
> +	if (rl->count[rw]+1 >= queue_congestion_on_threshold(q) ||
> +	    io_group_congestion_on(iog, rw)) {
> +		if (rl->count[rw]+1 >= q->nr_requests ||
> +		    io_group_full(iog, rw)) {
> 			ioc = current_io_context(GFP_ATOMIC, q->node);
> 			/*
> 			 * The queue will fill after this allocation, so set
> @@ -789,8 +801,15 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
> 	if (rl->count[rw] >= (3 * q->nr_requests / 2))
> 		goto out;
>
> +	if (iog)
> +		if (io_group_count(iog, rw) >=
> +		   (3 * io_group_nr_requests(iog) / 2))
> +			goto out;
> +
> 	rl->count[rw]++;
> 	rl->starved[rw] = 0;
> +	if (iog)
> +		io_group_inc_count(iog, rw);
>
> 	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
> 	if (priv)
> @@ -808,7 +827,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
> 		 * wait queue, but this is pretty rare.
> 		 */
> 		spin_lock_irq(q->queue_lock);
> -		freed_request(q, rw, priv);
> +		freed_request(q, iog, rw, priv);
>
> 		/*
> 		 * in the very unlikely event that allocation failed and no
> @@ -1073,12 +1092,13 @@ void __blk_put_request(struct request_queue *q, struct request *req)
> 	if (req->cmd_flags & REQ_ALLOCED) {
> 		int rw = rq_data_dir(req);
> 		int priv = req->cmd_flags & REQ_ELVPRIV;
> +		struct io_group *iog = io_request_io_group(req);
>
> 		BUG_ON(!list_empty(&req->queuelist));
> 		BUG_ON(!hlist_unhashed(&req->hash));
>
> 		blk_free_request(q, req);
> -		freed_request(q, rw, priv);
> +		freed_request(q, iog, rw, priv);
> 	}
> }
> EXPORT_SYMBOL_GPL(__blk_put_request);
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 0d98c96..af5191c 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -40,6 +40,7 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
> {
> 	struct request_list *rl = &q->rq;
> 	unsigned long nr;
> +	int iog_congested[2], iog_full[2];
> 	int ret = queue_var_store(&nr, page, count);
> 	if (nr < BLKDEV_MIN_RQ)
> 		nr = BLKDEV_MIN_RQ;
> @@ -47,27 +48,32 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
> 	spin_lock_irq(q->queue_lock);
> 	q->nr_requests = nr;
> 	blk_queue_congestion_threshold(q);
> +	io_group_set_nrq_all(q, nr, iog_congested, iog_full);
>
> -	if (rl->count[READ] >= queue_congestion_on_threshold(q))
> +	if (rl->count[READ] >= queue_congestion_on_threshold(q) ||
> +	    iog_congested[READ])
> 		blk_set_queue_congested(q, READ);
> -	else if (rl->count[READ] < queue_congestion_off_threshold(q))
> +	else if (rl->count[READ] < queue_congestion_off_threshold(q) &&
> +		 !iog_congested[READ])
> 		blk_clear_queue_congested(q, READ);
>
> -	if (rl->count[WRITE] >= queue_congestion_on_threshold(q))
> +	if (rl->count[WRITE] >= queue_congestion_on_threshold(q) ||
> +	    iog_congested[WRITE])
> 		blk_set_queue_congested(q, WRITE);
> -	else if (rl->count[WRITE] < queue_congestion_off_threshold(q))
> +	else if (rl->count[WRITE] < queue_congestion_off_threshold(q) &&
> +		 !iog_congested[WRITE])
> 		blk_clear_queue_congested(q, WRITE);
>
> -	if (rl->count[READ] >= q->nr_requests) {
> +	if (rl->count[READ] >= q->nr_requests || iog_full[READ]) {
> 		blk_set_queue_full(q, READ);
> -	} else if (rl->count[READ]+1 <= q->nr_requests) {
> +	} else if (rl->count[READ]+1 <= q->nr_requests && !iog_full[READ]) {
> 		blk_clear_queue_full(q, READ);
> 		wake_up(&rl->wait[READ]);
> 	}
>
> -	if (rl->count[WRITE] >= q->nr_requests) {
> +	if (rl->count[WRITE] >= q->nr_requests || iog_full[WRITE]) {
> 		blk_set_queue_full(q, WRITE);
> -	} else if (rl->count[WRITE]+1 <= q->nr_requests) {
> +	} else if (rl->count[WRITE]+1 <= q->nr_requests && !iog_full[WRITE]) {
> 		blk_clear_queue_full(q, WRITE);
> 		wake_up(&rl->wait[WRITE]);
> 	}
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index df53418..3b021f3 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -924,6 +924,111 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
> }
> EXPORT_SYMBOL(io_lookup_io_group_current);
>
> +/*
> + * TODO
> + * This is complete dupulication of blk_queue_congestion_threshold()
> + * except for the argument type and name.  Can we merge them?
> + */
> +static void io_group_nrq_congestion_threshold(struct io_group_nrq *nrq)
> +{
> +	int nr;
> +
> +	nr = nrq->nr_requests - (nrq->nr_requests / 8) + 1;
> +	if (nr > nrq->nr_requests)
> +		nr = nrq->nr_requests;
> +	nrq->nr_congestion_on = nr;
> +
> +	nr = nrq->nr_requests - (nrq->nr_requests / 8)
> +		- (nrq->nr_requests / 16) - 1;
> +	if (nr < 1)
> +		nr = 1;
> +	nrq->nr_congestion_off = nr;
> +}
> +
> +static void io_group_set_nrq(struct io_group_nrq *nrq, int nr_requests,
> +			 int *congested, int *full)
> +{
> +	int i;
> +
> +	BUG_ON(nr_requests < 0);
> +
> +	nrq->nr_requests = nr_requests;
> +	io_group_nrq_congestion_threshold(nrq);
> +
> +	for (i=0; i<2; i++) {
> +		if (nrq->count[i] >= nrq->nr_congestion_on)
> +			congested[i] = 1;
> +		else if (nrq->count[i] < nrq->nr_congestion_off)
> +			congested[i] = 0;
> +
> +		if (nrq->count[i] >= nrq->nr_requests)
> +			full[i] = 1;
> +		else if (nrq->count[i]+1 <= nrq->nr_requests)
> +			full[i] = 0;
> +	}
> +}
> +
> +void io_group_set_nrq_all(struct request_queue *q, int nr,
> +			    int *congested, int *full)
> +{
> +	struct elv_fq_data *efqd = &q->elevator->efqd;
> +	struct hlist_head *head = &efqd->group_list;
> +	struct io_group *root = efqd->root_group;
> +	struct hlist_node *n;
> +	struct io_group *iog;
> +	struct io_group_nrq *nrq;
> +	int nrq_congested[2];
> +	int nrq_full[2];
> +	int i;
> +
> +	for (i=0; i<2; i++)
> +		*(congested + i) = *(full + i) = 0;
> +
> +	nrq = &root->nrq;
> +	io_group_set_nrq(nrq, nr, nrq_congested, nrq_full);
> +	for (i=0; i<2; i++) {
> +		*(congested + i) |= nrq_congested[i];
> +		*(full + i) |= nrq_full[i];
> +	}
> +
> +	hlist_for_each_entry(iog, n, head, elv_data_node) {
> +		nrq = &iog->nrq;
> +		io_group_set_nrq(nrq, nr, nrq_congested, nrq_full);
> +		for (i=0; i<2; i++) {
> +			*(congested + i) |= nrq_congested[i];
> +			*(full + i) |= nrq_full[i];
> +		}
> +	}
> +}
> +
> +struct io_group *io_congested_io_group(struct request_queue *q, int rw)
> +{
> +	struct hlist_head *head = &q->elevator->efqd.group_list;
> +	struct hlist_node *n;
> +	struct io_group *iog;
> +
> +	hlist_for_each_entry(iog, n, head, elv_data_node) {
> +		struct io_group_nrq *nrq = &iog->nrq;
> +		if (nrq->count[rw] >= nrq->nr_congestion_off)
> +			return iog;
> +	}
> +	return NULL;
> +}
> +
> +struct io_group *io_full_io_group(struct request_queue *q, int rw)
> +{
> +	struct hlist_head *head = &q->elevator->efqd.group_list;
> +	struct hlist_node *n;
> +	struct io_group *iog;
> +
> +	hlist_for_each_entry(iog, n, head, elv_data_node) {
> +		struct io_group_nrq *nrq = &iog->nrq;
> +		if (nrq->count[rw] >= nrq->nr_requests)
> +			return iog;
> +	}
> +	return NULL;
> +}
> +
> void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
> {
> 	struct io_entity *entity = &iog->entity;
> @@ -934,6 +1039,12 @@ void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
> 	entity->my_sched_data = &iog->sched_data;
> }
>
> +static void io_group_init_nrq(struct request_queue *q, struct io_group_nrq *nrq)
> +{
> +	nrq->nr_requests = q->nr_requests;
> +	io_group_nrq_congestion_threshold(nrq);
> +}
> +
> void io_group_set_parent(struct io_group *iog, struct io_group *parent)
> {
> 	struct io_entity *entity;
> @@ -1053,6 +1164,8 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
> 		io_group_init_entity(iocg, iog);
> 		iog->my_entity = &iog->entity;
>
> +		io_group_init_nrq(q, &iog->nrq);
> +
> 		if (leaf == NULL) {
> 			leaf = iog;
> 			prev = leaf;
> @@ -1176,7 +1289,7 @@ struct io_group *io_find_alloc_group(struct request_queue *q,
>  * Generic function to make sure cgroup hierarchy is all setup once a request
>  * from a cgroup is received by the io scheduler.
>  */
> -struct io_group *io_get_io_group(struct request_queue *q)
> +struct io_group *__io_get_io_group(struct request_queue *q)
> {
> 	struct cgroup *cgroup;
> 	struct io_group *iog;
> @@ -1192,6 +1305,19 @@ struct io_group *io_get_io_group(struct request_queue *q)
> 	return iog;
> }
>
> +struct io_group *io_get_io_group(struct request_queue *q)
> +{
> +	struct io_group *iog;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(q->queue_lock, flags);
> +	iog = __io_get_io_group(q);
> +	spin_unlock_irqrestore(q->queue_lock, flags);
> +	BUG_ON(!iog);
> +
> +	return iog;
> +}
> +
> void io_free_root_group(struct elevator_queue *e)
> {
> 	struct io_cgroup *iocg = &io_root_cgroup;
> @@ -1220,6 +1346,7 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
> 	iog->entity.parent = NULL;
> 	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
> 		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
> +	io_group_init_nrq(q, &iog->nrq);
>
> 	iocg = &io_root_cgroup;
> 	spin_lock_irq(&iocg->lock);
> @@ -1533,15 +1660,11 @@ void elv_fq_set_request_io_group(struct request_queue *q,
> 						struct request *rq)
> {
> 	struct io_group *iog;
> -	unsigned long flags;
>
> 	/* Make sure io group hierarchy has been setup and also set the
> 	 * io group to which rq belongs. Later we should make use of
> 	 * bio cgroup patches to determine the io group */
> -	spin_lock_irqsave(q->queue_lock, flags);
> 	iog = io_get_io_group(q);
> -	spin_unlock_irqrestore(q->queue_lock, flags);
> -	BUG_ON(!iog);
>
> 	/* Store iog in rq. TODO: take care of referencing */
> 	rq->iog = iog;
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index fc4110d..f8eabd4 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -187,6 +187,22 @@ struct io_queue {
>
> #ifdef CONFIG_GROUP_IOSCHED
> /**
> + * struct io_group_nrq - structure to store allocated requests info
> + * @nr_requests: maximun num of requests for the io_group
> + * @nr_congestion_on: threshold to determin the io_group is cogested.
> + * @nr_congestion_off: threshold to determin the io_group is not congested.
> + * @count: num of allocated requests.
> + *
> + * All fields are protected by queue_lock.
> + */
> +struct io_group_nrq {
> +	unsigned long nr_requests;
> +	unsigned int nr_congestion_on;
> +	unsigned int nr_congestion_off;
> +	int count[2];
> +};
> +
> +/**
>  * struct bfq_group - per (device, cgroup) data structure.
>  * @entity: schedulable entity to insert into the parent group sched_data.
>  * @sched_data: own sched_data, to contain child entities (they may be
> @@ -235,6 +251,8 @@ struct io_group {
>
> 	/* Single ioq per group, used for noop, deadline, anticipatory */
> 	struct io_queue *ioq;
> +
> +	struct io_group_nrq nrq;
> };
>
> /**
> @@ -469,6 +487,11 @@ extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
> extern void elv_fq_unset_request_ioq(struct request_queue *q,
> 					struct request *rq);
> extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
> +extern void io_group_set_nrq_all(struct request_queue *q, int nr,
> +			    int *congested, int *full);
> +extern struct io_group *io_congested_io_group(struct request_queue *q, int rw);
> +extern struct io_group *io_full_io_group(struct request_queue *q, int rw);
> +extern struct io_group *__io_get_io_group(struct request_queue *q);
>
> /* Returns single ioq associated with the io group. */
> static inline struct io_queue *io_group_ioq(struct io_group *iog)
> @@ -486,6 +509,52 @@ static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
> 	iog->ioq = ioq;
> }
>
> +static inline struct io_group *io_request_io_group(struct request *rq)
> +{
> +	return rq->iog;
> +}
> +
> +static inline unsigned long io_group_nr_requests(struct io_group *iog)
> +{
> +	BUG_ON(!iog);
> +	return iog->nrq.nr_requests;
> +}
> +
> +static inline int io_group_inc_count(struct io_group *iog, int rw)
> +{
> +	BUG_ON(!iog);
> +	return iog->nrq.count[rw]++;
> +}
> +
> +static inline int io_group_dec_count(struct io_group *iog, int rw)
> +{
> +	BUG_ON(!iog);
> +	return iog->nrq.count[rw]--;
> +}
> +
> +static inline int io_group_count(struct io_group *iog, int rw)
> +{
> +	BUG_ON(!iog);
> +	return iog->nrq.count[rw];
> +}
> +
> +static inline int io_group_congestion_on(struct io_group *iog, int rw)
> +{
> +	BUG_ON(!iog);
> +	return iog->nrq.count[rw] + 1 >= iog->nrq.nr_congestion_on;
> +}
> +
> +static inline int io_group_congestion_off(struct io_group *iog, int rw)
> +{
> +	BUG_ON(!iog);
> +	return iog->nrq.count[rw] < iog->nrq.nr_congestion_off;
> +}
> +
> +static inline int io_group_full(struct io_group *iog, int rw)
> +{
> +	BUG_ON(!iog);
> +	return iog->nrq.count[rw] + 1 >= iog->nrq.nr_requests;
> +}
> #else /* !GROUP_IOSCHED */
> /*
>  * No ioq movement is needed in case of flat setup. root io group gets cleaned
> @@ -537,6 +606,71 @@ static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
> 	return NULL;
> }
>
> +static inline void io_group_set_nrq_all(struct request_queue *q, int nr,
> +					int *congested, int *full)
> +{
> +	int i;
> +	for (i=0; i<2; i++)
> +		*(congested + i) = *(full + i) = 0;
> +}
> +
> +static inline struct io_group *
> +io_congested_io_group(struct request_queue *q, int rw)
> +{
> +	return NULL;
> +}
> +
> +static inline struct io_group *
> +io_full_io_group(struct request_queue *q, int rw)
> +{
> +	return NULL;
> +}
> +
> +static inline struct io_group *__io_get_io_group(struct request_queue *q)
> +{
> +	return NULL;
> +}
> +
> +static inline struct io_group *io_request_io_group(struct request *rq)
> +{
> +	return NULL;
> +}
> +
> +static inline unsigned long io_group_nr_requests(struct io_group *iog)
> +{
> +	return 0;
> +}
> +
> +static inline int io_group_inc_count(struct io_group *iog, int rw)
> +{
> +	return 0;
> +}
> +
> +static inline int io_group_dec_count(struct io_group *iog, int rw)
> +{
> +	return 0;
> +}
> +
> +static inline int io_group_count(struct io_group *iog, int rw)
> +{
> +	return 0;
> +}
> +
> +static inline int io_group_congestion_on(struct io_group *iog, int rw)
> +{
> +	return 0;
> +}
> +
> +static inline int io_group_congestion_off(struct io_group *iog, int rw)
> +{
> +	return 1;
> +}
> +
> +static inline int io_group_full(struct io_group *iog, int rw)
> +{
> +	return 0;
> +}
> +
> #endif /* GROUP_IOSCHED */
>
> /* Functions used by blksysfs.c */
> @@ -589,6 +723,9 @@ extern void elv_free_ioq(struct io_queue *ioq);
>
> #else /* CONFIG_ELV_FAIR_QUEUING */
>
> +struct io_group {
> +};
> +
> static inline int elv_init_fq_data(struct request_queue *q,
> 					struct elevator_queue *e)
> {
> @@ -655,5 +792,69 @@ static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
> 	return NULL;
> }
>
> +static inline void io_group_set_nrq_all(struct request_queue *q, int nr,
> +					int *congested, int *full)
> +{
> +	int i;
> +	for (i=0; i<2; i++)
> +		*(congested + i) = *(full + i) = 0;
> +}
> +
> +static inline struct io_group *
> +io_congested_io_group(struct request_queue *q, int rw)
> +{
> +	return NULL;
> +}
> +
> +static inline struct io_group *
> +io_full_io_group(struct request_queue *q, int rw)
> +{
> +	return NULL;
> +}
> +
> +static inline struct io_group *__io_get_io_group(struct request_queue *q)
> +{
> +	return NULL;
> +}
> +
> +static inline struct io_group *io_request_io_group(struct request *rq)
> +{
> +	return NULL;
> +}
> +
> +static inline unsigned long io_group_nr_requests(struct io_group *iog)
> +{
> +	return 0;
> +}
> +
> +static inline int io_group_inc_count(struct io_group *iog, int rw)
> +{
> +	return 0;
> +}
> +
> +static inline int io_group_dec_count(struct io_group *iog, int rw)
> +{
> +	return 0;
> +}
> +
> +static inline int io_group_count(struct io_group *iog, int rw)
> +{
> +	return 0;
> +}
> +
> +static inline int io_group_congestion_on(struct io_group *iog, int rw)
> +{
> +	return 0;
> +}
> +
> +static inline int io_group_congestion_off(struct io_group *iog, int rw)
> +{
> +	return 1;
> +}
> +
> +static inline int io_group_full(struct io_group *iog, int rw)
> +{
> +	return 0;
> +}
> #endif /* CONFIG_ELV_FAIR_QUEUING */
> #endif /* _BFQ_SCHED_H */
> -- 
> 1.5.4.3
>
>
> -- 
> IKEDA, Munehiro
> NEC Corporation of America
>   m-ikeda@ds.jp.nec.com
>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: IO Controller per cgroup request descriptors (Re: [PATCH 01/10] Documentation)
       [not found]                 ` <20090501224506.GC6130-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-01 23:39                   ` Nauman Rafique
  0 siblings, 0 replies; 190+ messages in thread
From: Nauman Rafique @ 2009-05-01 23:39 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, Andrea Righi,
	menage-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil, Balbir Singh

On Fri, May 1, 2009 at 3:45 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, May 01, 2009 at 06:04:39PM -0400, IKEDA, Munehiro wrote:
>> Vivek Goyal wrote:
>>>>> +TODO
>>>>> +====
>>>>> +- Lots of cleanups, testing, bug fixing, optimizations, benchmarking etc...
>>>>> +- Convert cgroup ioprio to notion of weight.
>>>>> +- Anticipatory code will need more work. It is not working properly currently
>>>>> +  and needs more thought.
>>>> What are the problems with the code?
>>>
>>> Have not got a chance to look into the issues in detail yet. Just a crude run
>>> saw drop in performance. Will debug it later the moment I have got async writes
>>> handled...
>>>
>>>>> +- Use of bio-cgroup patches.
>>>> I saw these posted as well
>>>>
>>>>> +- Use of Nauman's per cgroup request descriptor patches.
>>>>> +
>>>> More details would be nice, I am not sure I understand
>>>
>>> Currently the number of request descriptors which can be allocated per
>>> device/request queue are fixed by a sysfs tunable (q->nr_requests). So
>>> if there is lots of IO going on from one cgroup then it will consume all
>>> the available request descriptors and other cgroup might starve and not
>>> get its fair share.
>>>
>>> Hence we also need to introduce the notion of request descriptor limit per
>>> cgroup so that if request descriptors from one group are exhausted, then
>>> it does not impact the IO of other cgroup.
>>
>> Unfortunately I couldn't find and I've never seen the Nauman's patches.
>> So I tried to make a patch below against this todo.  The reason why
>> I'm posting this despite this is just a quick and ugly hack (and it
>> might be a reinvention of wheel) is that I would like to discuss how
>> we should define the limitation of requests per cgroup.
>> This patch should be applied on Vivek's I/O controller patches
>> posted on Mar 11.
>
> Hi IKEDA,
>
> Sorry for the confusion here. Actually Nauman had sent a patch to select group
> of people who were initially copied on the mail thread.

I am sorry about that. Since I dropped my whole patch set in favor of
Vivek's stuff, this stuff fell through the cracks.

>
>>
>> This patch temporarily distribute q->nr_requests to each cgroup.
>> I think the number should be weighted like BFQ's budget.  But in
>> this case, if the hierarchy of cgroup is deep, leaf cgroups are
>> allowed to allocate very few numbers of requests.  I don't think
>> this is reasonable...but I don't have specific idea to solve this
>> problem.  Does anyone have the good idea?
>>
>
> Thanks for the patch. Yes, ideally one would expect the request descriptor
> to be allocated also in proportion to the weight but I guess that would
> become very comlicated.
>
> In terms of simpler things, two thoughts come to mind.
>
> - First approach is to make q->nr_requests per group. So every group is
>  entitled for q->nr_requests as set by the user. This is what your patch
>  seems to have done.
>
>  I had some concerns with this approach. First of all it does not seem to
>  have an upper bound on number of request descriptors allocated per queue
>  because if a user creates more cgroups, total number of request
>  descriptors increase.
>
> - Second approach can be that we retain the meaning of q->nr_requests
>  which defines the total number of request descriptors on the queue (with
>  the exception of 50% more descriptors for batching processes). And we
>  define a new per group limit q->nr_group_requests which defines how many
>  requests per group can be assigned. So q->nr_requests defines total pool
>  size on the queue and q->nr_group_requests will define how many requests
>  each group can allocate out of that pool.
>
>  Here the issue is that a user shall have to balance the q->nr_group_requests    and q->nr_requests properly.
>
> To experiment, I have implemented the second approach. I am attaching the
> patch which is in my current tree. It probably will not apply on my march
> 11 posting as since then patches have changed. But posting it here so that
> at least it will give an idea behind the thought process.
>
> Ideas are welcome...

I had started with the first option, but the second option sounds good
too. But one problem that comes to mind is how we deal with
hierarchies? The sys admin can limit the root level cgroups to
specific number of request descriptors, but if applications running in
a cgroup are allowed to create their own cgroups, then the total
request descriptors of all child cgroups should be capped by the
number assigned to parent cgroups.

>
> Thanks
> Vivek
>
> o Currently a request queue has got fixed number of request descriptors for
>  sync and async requests. Once the request descriptors are consumed, new
>  processes are put to sleep and they effectively become serialized. Because
>  sync and async queues are separate, async requests don't impact sync ones
>  but if one is looking for fairness between async requests, that is not
>  achievable if request queue descriptors become bottleneck.
>
> o Make request descriptor's per io group so that if there is lots of IO
>  going on in one cgroup, it does not impact the IO of other group.
>
> o This patch implements the per cgroup request descriptors. request pool per
>  queue is still common but every group will have its own wait list and its
>  own count of request descriptors allocated to that group for sync and async
>  queues. So effectively request_list becomes per io group property and not a
>  global request queue feature.
>
> o Currently one can define q->nr_requests to limit request descriptors
>  allocated for the queue. Now there is another tunable q->nr_group_requests
>  which controls the requests descriptr limit per group. q->nr_requests
>  supercedes q->nr_group_requests to make sure if there are lots of groups
>  present, we don't end up allocating too many request descriptors on the
>  queue.
>
> o Issues: Currently notion of congestion is per queue. With per group request
>  descriptor it is possible that queue is not congested but the group bio
>  will go into is congested.
>
> Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>
> ---
>  block/blk-core.c       |  216 ++++++++++++++++++++++++++++++++++---------------
>  block/blk-settings.c   |    3
>  block/blk-sysfs.c      |   57 ++++++++++--
>  block/elevator-fq.c    |   15 +++
>  block/elevator-fq.h    |    8 +
>  block/elevator.c       |    6 -
>  include/linux/blkdev.h |   62 +++++++++++++-
>  7 files changed, 287 insertions(+), 80 deletions(-)
>
> Index: linux9/include/linux/blkdev.h
> ===================================================================
> --- linux9.orig/include/linux/blkdev.h  2009-04-30 15:43:53.000000000 -0400
> +++ linux9/include/linux/blkdev.h       2009-04-30 16:18:29.000000000 -0400
> @@ -32,21 +32,51 @@ struct request;
>  struct sg_io_hdr;
>
>  #define BLKDEV_MIN_RQ  4
> +
> +#ifdef CONFIG_GROUP_IOSCHED
> +#define BLKDEV_MAX_RQ  256     /* Default maximum */
> +#define BLKDEV_MAX_GROUP_RQ    64      /* Default maximum */
> +#else
>  #define BLKDEV_MAX_RQ  128     /* Default maximum */
> +/*
> + * This is eqivalent to case of only one group present (root group). Let
> + * it consume all the request descriptors available on the queue .
> + */
> +#define BLKDEV_MAX_GROUP_RQ    BLKDEV_MAX_RQ      /* Default maximum */
> +#endif
>
>  struct request;
>  typedef void (rq_end_io_fn)(struct request *, int);
>
>  struct request_list {
>        /*
> -        * count[], starved[], and wait[] are indexed by
> +        * count[], starved and wait[] are indexed by
>         * BLK_RW_SYNC/BLK_RW_ASYNC
>         */
>        int count[2];
>        int starved[2];
> +       wait_queue_head_t wait[2];
> +};
> +
> +/*
> + * This data structures keeps track of mempool of requests for the queue
> + * and some overall statistics.
> + */
> +struct request_data {
> +       /*
> +        * Per queue request descriptor count. This is in addition to per
> +        * cgroup count
> +        */
> +       int count[2];
>        int elvpriv;
>        mempool_t *rq_pool;
> -       wait_queue_head_t wait[2];
> +       int starved;
> +       /*
> +        * Global list for starved tasks. A task will be queued here if
> +        * it could not allocate request descriptor and the associated
> +        * group request list does not have any requests pending.
> +        */
> +       wait_queue_head_t starved_wait;
>  };
>
>  /*
> @@ -251,6 +281,7 @@ struct request {
>  #ifdef CONFIG_GROUP_IOSCHED
>        /* io group request belongs to */
>        struct io_group *iog;
> +       struct request_list *rl;
>  #endif /* GROUP_IOSCHED */
>  #endif /* ELV_FAIR_QUEUING */
>  };
> @@ -340,6 +371,9 @@ struct request_queue
>         */
>        struct request_list     rq;
>
> +       /* Contains request pool and other data like starved data */
> +       struct request_data     rq_data;
> +
>        request_fn_proc         *request_fn;
>        make_request_fn         *make_request_fn;
>        prep_rq_fn              *prep_rq_fn;
> @@ -402,6 +436,8 @@ struct request_queue
>         * queue settings
>         */
>        unsigned long           nr_requests;    /* Max # of requests */
> +       /* Max # of per io group requests */
> +       unsigned long           nr_group_requests;
>        unsigned int            nr_congestion_on;
>        unsigned int            nr_congestion_off;
>        unsigned int            nr_batching;
> @@ -773,6 +809,28 @@ extern int scsi_cmd_ioctl(struct request
>  extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
>                         struct scsi_ioctl_command __user *);
>
> +extern void blk_init_request_list(struct request_list *rl);
> +
> +static inline struct request_list *blk_get_request_list(struct request_queue *q,
> +                                               struct bio *bio)
> +{
> +#ifdef CONFIG_GROUP_IOSCHED
> +       return io_group_get_request_list(q, bio);
> +#else
> +       return &q->rq;
> +#endif
> +}
> +
> +static inline struct request_list *rq_rl(struct request_queue *q,
> +                                               struct request *rq)
> +{
> +#ifdef CONFIG_GROUP_IOSCHED
> +       return rq->rl;
> +#else
> +       return blk_get_request_list(q, NULL);
> +#endif
> +}
> +
>  /*
>  * Temporary export, until SCSI gets fixed up.
>  */
> Index: linux9/block/elevator.c
> ===================================================================
> --- linux9.orig/block/elevator.c        2009-04-30 16:17:53.000000000 -0400
> +++ linux9/block/elevator.c     2009-04-30 16:18:29.000000000 -0400
> @@ -664,7 +664,7 @@ void elv_quiesce_start(struct request_qu
>         * make sure we don't have any requests in flight
>         */
>        elv_drain_elevator(q);
> -       while (q->rq.elvpriv) {
> +       while (q->rq_data.elvpriv) {
>                blk_start_queueing(q);
>                spin_unlock_irq(q->queue_lock);
>                msleep(10);
> @@ -764,8 +764,8 @@ void elv_insert(struct request_queue *q,
>        }
>
>        if (unplug_it && blk_queue_plugged(q)) {
> -               int nrq = q->rq.count[BLK_RW_SYNC] + q->rq.count[BLK_RW_ASYNC]
> -                       - q->in_flight;
> +               int nrq = q->rq_data.count[BLK_RW_SYNC] +
> +                               q->rq_data.count[BLK_RW_ASYNC] - q->in_flight;
>
>                if (nrq >= q->unplug_thresh)
>                        __generic_unplug_device(q);
> Index: linux9/block/blk-core.c
> ===================================================================
> --- linux9.orig/block/blk-core.c        2009-04-30 16:17:53.000000000 -0400
> +++ linux9/block/blk-core.c     2009-04-30 16:18:29.000000000 -0400
> @@ -480,20 +480,31 @@ void blk_cleanup_queue(struct request_qu
>  }
>  EXPORT_SYMBOL(blk_cleanup_queue);
>
> -static int blk_init_free_list(struct request_queue *q)
> +void blk_init_request_list(struct request_list *rl)
>  {
> -       struct request_list *rl = &q->rq;
>
>        rl->count[BLK_RW_SYNC] = rl->count[BLK_RW_ASYNC] = 0;
> -       rl->starved[BLK_RW_SYNC] = rl->starved[BLK_RW_ASYNC] = 0;
> -       rl->elvpriv = 0;
>        init_waitqueue_head(&rl->wait[BLK_RW_SYNC]);
>        init_waitqueue_head(&rl->wait[BLK_RW_ASYNC]);
> +}
>
> -       rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ, mempool_alloc_slab,
> -                               mempool_free_slab, request_cachep, q->node);
> +static int blk_init_free_list(struct request_queue *q)
> +{
> +#ifndef CONFIG_GROUP_IOSCHED
> +       struct request_list *rl = blk_get_request_list(q, NULL);
> +
> +       /*
> +        * In case of group scheduling, request list is inside the associated
> +        * group and when that group is instanciated, it takes care of
> +        * initializing the request list also.
> +        */
> +       blk_init_request_list(rl);
> +#endif
> +       q->rq_data.rq_pool = mempool_create_node(BLKDEV_MIN_RQ,
> +                               mempool_alloc_slab, mempool_free_slab,
> +                               request_cachep, q->node);
>
> -       if (!rl->rq_pool)
> +       if (!q->rq_data.rq_pool)
>                return -ENOMEM;
>
>        return 0;
> @@ -590,6 +601,9 @@ blk_init_queue_node(request_fn_proc *rfn
>                return NULL;
>        }
>
> +       /* init starved waiter wait queue */
> +       init_waitqueue_head(&q->rq_data.starved_wait);
> +
>        /*
>         * if caller didn't supply a lock, they get per-queue locking with
>         * our embedded lock
> @@ -639,14 +653,14 @@ static inline void blk_free_request(stru
>  {
>        if (rq->cmd_flags & REQ_ELVPRIV)
>                elv_put_request(q, rq);
> -       mempool_free(rq, q->rq.rq_pool);
> +       mempool_free(rq, q->rq_data.rq_pool);
>  }
>
>  static struct request *
>  blk_alloc_request(struct request_queue *q, struct bio *bio, int rw, int priv,
>                                        gfp_t gfp_mask)
>  {
> -       struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
> +       struct request *rq = mempool_alloc(q->rq_data.rq_pool, gfp_mask);
>
>        if (!rq)
>                return NULL;
> @@ -657,7 +671,7 @@ blk_alloc_request(struct request_queue *
>
>        if (priv) {
>                if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
> -                       mempool_free(rq, q->rq.rq_pool);
> +                       mempool_free(rq, q->rq_data.rq_pool);
>                        return NULL;
>                }
>                rq->cmd_flags |= REQ_ELVPRIV;
> @@ -700,18 +714,18 @@ static void ioc_set_batching(struct requ
>        ioc->last_waited = jiffies;
>  }
>
> -static void __freed_request(struct request_queue *q, int sync)
> +static void __freed_request(struct request_queue *q, int sync,
> +                                       struct request_list *rl)
>  {
> -       struct request_list *rl = &q->rq;
> -
> -       if (rl->count[sync] < queue_congestion_off_threshold(q))
> +       if (q->rq_data.count[sync] < queue_congestion_off_threshold(q))
>                blk_clear_queue_congested(q, sync);
>
> -       if (rl->count[sync] + 1 <= q->nr_requests) {
> +       if (q->rq_data.count[sync] + 1 <= q->nr_requests)
> +               blk_clear_queue_full(q, sync);
> +
> +       if (rl->count[sync] + 1 <= q->nr_group_requests) {
>                if (waitqueue_active(&rl->wait[sync]))
>                        wake_up(&rl->wait[sync]);
> -
> -               blk_clear_queue_full(q, sync);
>        }
>  }
>
> @@ -719,18 +733,29 @@ static void __freed_request(struct reque
>  * A request has just been released.  Account for it, update the full and
>  * congestion status, wake up any waiters.   Called under q->queue_lock.
>  */
> -static void freed_request(struct request_queue *q, int sync, int priv)
> +static void freed_request(struct request_queue *q, int sync, int priv,
> +                                       struct request_list *rl)
>  {
> -       struct request_list *rl = &q->rq;
> -
> +       BUG_ON(!rl->count[sync]);
>        rl->count[sync]--;
> +
> +       BUG_ON(!q->rq_data.count[sync]);
> +       q->rq_data.count[sync]--;
> +
>        if (priv)
> -               rl->elvpriv--;
> +               q->rq_data.elvpriv--;
>
> -       __freed_request(q, sync);
> +       __freed_request(q, sync, rl);
>
>        if (unlikely(rl->starved[sync ^ 1]))
> -               __freed_request(q, sync ^ 1);
> +               __freed_request(q, sync ^ 1, rl);
> +
> +       /* Wake up the starved process on global list, if any */
> +       if (unlikely(q->rq_data.starved)) {
> +               if (waitqueue_active(&q->rq_data.starved_wait))
> +                       wake_up(&q->rq_data.starved_wait);
> +               q->rq_data.starved--;
> +       }
>  }
>
>  /*
> @@ -739,10 +764,9 @@ static void freed_request(struct request
>  * Returns !NULL on success, with queue_lock *not held*.
>  */
>  static struct request *get_request(struct request_queue *q, int rw_flags,
> -                                  struct bio *bio, gfp_t gfp_mask)
> +                  struct bio *bio, gfp_t gfp_mask, struct request_list *rl)
>  {
>        struct request *rq = NULL;
> -       struct request_list *rl = &q->rq;
>        struct io_context *ioc = NULL;
>        const bool is_sync = rw_is_sync(rw_flags) != 0;
>        int may_queue, priv;
> @@ -751,31 +775,38 @@ static struct request *get_request(struc
>        if (may_queue == ELV_MQUEUE_NO)
>                goto rq_starved;
>
> -       if (rl->count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
> -               if (rl->count[is_sync]+1 >= q->nr_requests) {
> -                       ioc = current_io_context(GFP_ATOMIC, q->node);
> -                       /*
> -                        * The queue will fill after this allocation, so set
> -                        * it as full, and mark this process as "batching".
> -                        * This process will be allowed to complete a batch of
> -                        * requests, others will be blocked.
> -                        */
> -                       if (!blk_queue_full(q, is_sync)) {
> -                               ioc_set_batching(q, ioc);
> -                               blk_set_queue_full(q, is_sync);
> -                       } else {
> -                               if (may_queue != ELV_MQUEUE_MUST
> -                                               && !ioc_batching(q, ioc)) {
> -                                       /*
> -                                        * The queue is full and the allocating
> -                                        * process is not a "batcher", and not
> -                                        * exempted by the IO scheduler
> -                                        */
> -                                       goto out;
> -                               }
> +       if (q->rq_data.count[is_sync]+1 >= queue_congestion_on_threshold(q))
> +               blk_set_queue_congested(q, is_sync);
> +
> +       /*
> +        * Looks like there is no user of queue full now.
> +        * Keeping it for time being.
> +        */
> +       if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
> +               blk_set_queue_full(q, is_sync);
> +
> +       if (rl->count[is_sync]+1 >= q->nr_group_requests) {
> +               ioc = current_io_context(GFP_ATOMIC, q->node);
> +               /*
> +                * The queue request descriptor group will fill after this
> +                * allocation, so set
> +                * it as full, and mark this process as "batching".
> +                * This process will be allowed to complete a batch of
> +                * requests, others will be blocked.
> +                */
> +               if (rl->count[is_sync] <= q->nr_group_requests)
> +                       ioc_set_batching(q, ioc);
> +               else {
> +                       if (may_queue != ELV_MQUEUE_MUST
> +                                       && !ioc_batching(q, ioc)) {
> +                               /*
> +                                * The queue is full and the allocating
> +                                * process is not a "batcher", and not
> +                                * exempted by the IO scheduler
> +                                */
> +                               goto out;
>                        }
>                }
> -               blk_set_queue_congested(q, is_sync);
>        }
>
>        /*
> @@ -783,19 +814,41 @@ static struct request *get_request(struc
>         * limit of requests, otherwise we could have thousands of requests
>         * allocated with any setting of ->nr_requests
>         */
> -       if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
> +
> +       if (q->rq_data.count[is_sync] >= (3 * q->nr_requests / 2))
> +               goto out;
> +
> +       /*
> +        * Allocation of request is allowed from queue perspective. Now check
> +        * from per group request list
> +        */
> +
> +       if (rl->count[is_sync] >= (3 * q->nr_group_requests / 2))
>                goto out;
>
>        rl->count[is_sync]++;
>        rl->starved[is_sync] = 0;
>
> +       q->rq_data.count[is_sync]++;
> +
>        priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
>        if (priv)
> -               rl->elvpriv++;
> +               q->rq_data.elvpriv++;
>
>        spin_unlock_irq(q->queue_lock);
>
>        rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
> +
> +#ifdef CONFIG_GROUP_IOSCHED
> +       if (rq) {
> +               /*
> +                * TODO. Implement group reference counting and take the
> +                * reference to the group to make sure group hence request
> +                * list does not go away till rq finishes.
> +                */
> +               rq->rl = rl;
> +       }
> +#endif
>        if (unlikely(!rq)) {
>                /*
>                 * Allocation failed presumably due to memory. Undo anything
> @@ -805,7 +858,7 @@ static struct request *get_request(struc
>                 * wait queue, but this is pretty rare.
>                 */
>                spin_lock_irq(q->queue_lock);
> -               freed_request(q, is_sync, priv);
> +               freed_request(q, is_sync, priv, rl);
>
>                /*
>                 * in the very unlikely event that allocation failed and no
> @@ -815,10 +868,26 @@ static struct request *get_request(struc
>                 * rq mempool into READ and WRITE
>                 */
>  rq_starved:
> -               if (unlikely(rl->count[is_sync] == 0))
> -                       rl->starved[is_sync] = 1;
> -
> -               goto out;
> +               if (unlikely(rl->count[is_sync] == 0)) {
> +                       /*
> +                        * If there is a request pending in other direction
> +                        * in same io group, then set the starved flag of
> +                        * the group request list. Otherwise, we need to
> +                        * make this process sleep in global starved list
> +                        * to make sure it will not sleep indefinitely.
> +                        */
> +                       if (rl->count[is_sync ^ 1] != 0) {
> +                               rl->starved[is_sync] = 1;
> +                               goto out;
> +                       } else {
> +                               /*
> +                                * It indicates to calling function to put
> +                                * task on global starved list. Not the best
> +                                * way
> +                                */
> +                               return ERR_PTR(-ENOMEM);
> +                       }
> +               }
>        }
>
>        /*
> @@ -846,15 +915,29 @@ static struct request *get_request_wait(
>  {
>        const bool is_sync = rw_is_sync(rw_flags) != 0;
>        struct request *rq;
> +       struct request_list *rl = blk_get_request_list(q, bio);
>
> -       rq = get_request(q, rw_flags, bio, GFP_NOIO);
> -       while (!rq) {
> +       rq = get_request(q, rw_flags, bio, GFP_NOIO, rl);
> +       while (!rq || (IS_ERR(rq) && PTR_ERR(rq) == -ENOMEM)) {
>                DEFINE_WAIT(wait);
>                struct io_context *ioc;
> -               struct request_list *rl = &q->rq;
>
> -               prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
> -                               TASK_UNINTERRUPTIBLE);
> +               if (IS_ERR(rq) && PTR_ERR(rq) == -ENOMEM) {
> +                       /*
> +                        * Task failed allocation and needs to wait and
> +                        * try again. There are no requests pending from
> +                        * the io group hence need to sleep on global
> +                        * wait queue. Most likely the allocation failed
> +                        * because of memory issues.
> +                        */
> +
> +                       q->rq_data.starved++;
> +                       prepare_to_wait_exclusive(&q->rq_data.starved_wait,
> +                                       &wait, TASK_UNINTERRUPTIBLE);
> +               } else {
> +                       prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
> +                                       TASK_UNINTERRUPTIBLE);
> +               }
>
>                trace_block_sleeprq(q, bio, rw_flags & 1);
>
> @@ -874,7 +957,12 @@ static struct request *get_request_wait(
>                spin_lock_irq(q->queue_lock);
>                finish_wait(&rl->wait[is_sync], &wait);
>
> -               rq = get_request(q, rw_flags, bio, GFP_NOIO);
> +               /*
> +                * After the sleep check the rl again in case cgrop bio
> +                * belonged to is gone and it is mapped to root group now
> +                */
> +               rl = blk_get_request_list(q, bio);
> +               rq = get_request(q, rw_flags, bio, GFP_NOIO, rl);
>        };
>
>        return rq;
> @@ -883,6 +971,7 @@ static struct request *get_request_wait(
>  struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
>  {
>        struct request *rq;
> +       struct request_list *rl = blk_get_request_list(q, NULL);
>
>        BUG_ON(rw != READ && rw != WRITE);
>
> @@ -890,7 +979,7 @@ struct request *blk_get_request(struct r
>        if (gfp_mask & __GFP_WAIT) {
>                rq = get_request_wait(q, rw, NULL);
>        } else {
> -               rq = get_request(q, rw, NULL, gfp_mask);
> +               rq = get_request(q, rw, NULL, gfp_mask, rl);
>                if (!rq)
>                        spin_unlock_irq(q->queue_lock);
>        }
> @@ -1073,12 +1162,13 @@ void __blk_put_request(struct request_qu
>        if (req->cmd_flags & REQ_ALLOCED) {
>                int is_sync = rq_is_sync(req) != 0;
>                int priv = req->cmd_flags & REQ_ELVPRIV;
> +               struct request_list *rl = rq_rl(q, req);
>
>                BUG_ON(!list_empty(&req->queuelist));
>                BUG_ON(!hlist_unhashed(&req->hash));
>
>                blk_free_request(q, req);
> -               freed_request(q, is_sync, priv);
> +               freed_request(q, is_sync, priv, rl);
>        }
>  }
>  EXPORT_SYMBOL_GPL(__blk_put_request);
> Index: linux9/block/blk-sysfs.c
> ===================================================================
> --- linux9.orig/block/blk-sysfs.c       2009-04-30 16:18:27.000000000 -0400
> +++ linux9/block/blk-sysfs.c    2009-04-30 16:18:29.000000000 -0400
> @@ -38,7 +38,7 @@ static ssize_t queue_requests_show(struc
>  static ssize_t
>  queue_requests_store(struct request_queue *q, const char *page, size_t count)
>  {
> -       struct request_list *rl = &q->rq;
> +       struct request_list *rl = blk_get_request_list(q, NULL);
>        unsigned long nr;
>        int ret = queue_var_store(&nr, page, count);
>        if (nr < BLKDEV_MIN_RQ)
> @@ -48,32 +48,55 @@ queue_requests_store(struct request_queu
>        q->nr_requests = nr;
>        blk_queue_congestion_threshold(q);
>
> -       if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
> +       if (q->rq_data.count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
>                blk_set_queue_congested(q, BLK_RW_SYNC);
> -       else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
> +       else if (q->rq_data.count[BLK_RW_SYNC] <
> +                               queue_congestion_off_threshold(q))
>                blk_clear_queue_congested(q, BLK_RW_SYNC);
>
> -       if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
> +       if (q->rq_data.count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
>                blk_set_queue_congested(q, BLK_RW_ASYNC);
> -       else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
> +       else if (q->rq_data.count[BLK_RW_ASYNC] <
> +                               queue_congestion_off_threshold(q))
>                blk_clear_queue_congested(q, BLK_RW_ASYNC);
>
> -       if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
> +       if (q->rq_data.count[BLK_RW_SYNC] >= q->nr_requests) {
>                blk_set_queue_full(q, BLK_RW_SYNC);
> -       } else if (rl->count[BLK_RW_SYNC]+1 <= q->nr_requests) {
> +       } else if (q->rq_data.count[BLK_RW_SYNC]+1 <= q->nr_requests) {
>                blk_clear_queue_full(q, BLK_RW_SYNC);
>                wake_up(&rl->wait[BLK_RW_SYNC]);
>        }
>
> -       if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
> +       if (q->rq_data.count[BLK_RW_ASYNC] >= q->nr_requests) {
>                blk_set_queue_full(q, BLK_RW_ASYNC);
> -       } else if (rl->count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
> +       } else if (q->rq_data.count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
>                blk_clear_queue_full(q, BLK_RW_ASYNC);
>                wake_up(&rl->wait[BLK_RW_ASYNC]);
>        }
>        spin_unlock_irq(q->queue_lock);
>        return ret;
>  }
> +#ifdef CONFIG_GROUP_IOSCHED
> +static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
> +{
> +       return queue_var_show(q->nr_group_requests, (page));
> +}
> +
> +static ssize_t
> +queue_group_requests_store(struct request_queue *q, const char *page,
> +                                       size_t count)
> +{
> +       unsigned long nr;
> +       int ret = queue_var_store(&nr, page, count);
> +       if (nr < BLKDEV_MIN_RQ)
> +               nr = BLKDEV_MIN_RQ;
> +
> +       spin_lock_irq(q->queue_lock);
> +       q->nr_group_requests = nr;
> +       spin_unlock_irq(q->queue_lock);
> +       return ret;
> +}
> +#endif
>
>  static ssize_t queue_ra_show(struct request_queue *q, char *page)
>  {
> @@ -228,6 +251,14 @@ static struct queue_sysfs_entry queue_re
>        .store = queue_requests_store,
>  };
>
> +#ifdef CONFIG_GROUP_IOSCHED
> +static struct queue_sysfs_entry queue_group_requests_entry = {
> +       .attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
> +       .show = queue_group_requests_show,
> +       .store = queue_group_requests_store,
> +};
> +#endif
> +
>  static struct queue_sysfs_entry queue_ra_entry = {
>        .attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
>        .show = queue_ra_show,
> @@ -308,6 +339,9 @@ static struct queue_sysfs_entry queue_sl
>
>  static struct attribute *default_attrs[] = {
>        &queue_requests_entry.attr,
> +#ifdef CONFIG_GROUP_IOSCHED
> +       &queue_group_requests_entry.attr,
> +#endif
>        &queue_ra_entry.attr,
>        &queue_max_hw_sectors_entry.attr,
>        &queue_max_sectors_entry.attr,
> @@ -389,12 +423,11 @@ static void blk_release_queue(struct kob
>  {
>        struct request_queue *q =
>                container_of(kobj, struct request_queue, kobj);
> -       struct request_list *rl = &q->rq;
>
>        blk_sync_queue(q);
>
> -       if (rl->rq_pool)
> -               mempool_destroy(rl->rq_pool);
> +       if (q->rq_data.rq_pool)
> +               mempool_destroy(q->rq_data.rq_pool);
>
>        if (q->queue_tags)
>                __blk_queue_free_tags(q);
> Index: linux9/block/blk-settings.c
> ===================================================================
> --- linux9.orig/block/blk-settings.c    2009-04-30 15:43:53.000000000 -0400
> +++ linux9/block/blk-settings.c 2009-04-30 16:18:29.000000000 -0400
> @@ -123,6 +123,9 @@ void blk_queue_make_request(struct reque
>         * set defaults
>         */
>        q->nr_requests = BLKDEV_MAX_RQ;
> +#ifdef CONFIG_GROUP_IOSCHED
> +       q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
> +#endif
>        blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);
>        blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS);
>        blk_queue_segment_boundary(q, BLK_SEG_BOUNDARY_MASK);
> Index: linux9/block/elevator-fq.c
> ===================================================================
> --- linux9.orig/block/elevator-fq.c     2009-04-30 16:18:27.000000000 -0400
> +++ linux9/block/elevator-fq.c  2009-04-30 16:18:29.000000000 -0400
> @@ -954,6 +954,17 @@ struct io_cgroup *cgroup_to_io_cgroup(st
>                            struct io_cgroup, css);
>  }
>
> +struct request_list *io_group_get_request_list(struct request_queue *q,
> +                                               struct bio *bio)
> +{
> +       struct io_group *iog;
> +
> +       iog = io_get_io_group_bio(q, bio, 1);
> +       BUG_ON(!iog);
> +out:
> +       return &iog->rl;
> +}
> +
>  /*
>  * Search the bfq_group for bfqd into the hash table (by now only a list)
>  * of bgrp.  Must be called under rcu_read_lock().
> @@ -1203,6 +1214,8 @@ struct io_group *io_group_chain_alloc(st
>                io_group_init_entity(iocg, iog);
>                iog->my_entity = &iog->entity;
>
> +               blk_init_request_list(&iog->rl);
> +
>                if (leaf == NULL) {
>                        leaf = iog;
>                        prev = leaf;
> @@ -1446,6 +1459,8 @@ struct io_group *io_alloc_root_group(str
>        for (i = 0; i < IO_IOPRIO_CLASSES; i++)
>                iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
>
> +       blk_init_request_list(&iog->rl);
> +
>        iocg = &io_root_cgroup;
>        spin_lock_irq(&iocg->lock);
>        rcu_assign_pointer(iog->key, key);
> Index: linux9/block/elevator-fq.h
> ===================================================================
> --- linux9.orig/block/elevator-fq.h     2009-04-30 16:18:27.000000000 -0400
> +++ linux9/block/elevator-fq.h  2009-04-30 16:18:29.000000000 -0400
> @@ -239,8 +239,14 @@ struct io_group {
>
>        /* Single ioq per group, used for noop, deadline, anticipatory */
>        struct io_queue *ioq;
> +
> +       /* request list associated with the group */
> +       struct request_list rl;
>  };
>
> +#define IOG_FLAG_READFULL      1       /* read queue has been filled */
> +#define IOG_FLAG_WRITEFULL     2       /* write queue has been filled */
> +
>  /**
>  * struct bfqio_cgroup - bfq cgroup data structure.
>  * @css: subsystem state for bfq in the containing cgroup.
> @@ -517,6 +523,8 @@ extern void elv_fq_unset_request_ioq(str
>  extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
>  extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
>                                                struct bio *bio);
> +extern struct request_list *io_group_get_request_list(struct request_queue *q,
> +                                               struct bio *bio);
>
>  /* Returns single ioq associated with the io group. */
>  static inline struct io_queue *io_group_ioq(struct io_group *iog)
>
> Thanks
> Vivek
>
>> Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
>> ---
>> block/blk-core.c    |   36 +++++++--
>> block/blk-sysfs.c   |   22 ++++--
>> block/elevator-fq.c |  133 ++++++++++++++++++++++++++++++++--
>> block/elevator-fq.h |  201 +++++++++++++++++++++++++++++++++++++++++++++++++++
>> 4 files changed, 371 insertions(+), 21 deletions(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 29bcfac..21023f7 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -705,11 +705,15 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
>> static void __freed_request(struct request_queue *q, int rw)
>> {
>>       struct request_list *rl = &q->rq;
>> -
>> -     if (rl->count[rw] < queue_congestion_off_threshold(q))
>> +     struct io_group *congested_iog, *full_iog;
>> +
>> +     congested_iog = io_congested_io_group(q, rw);
>> +     if (rl->count[rw] < queue_congestion_off_threshold(q) &&
>> +         !congested_iog)
>>               blk_clear_queue_congested(q, rw);
>>
>> -     if (rl->count[rw] + 1 <= q->nr_requests) {
>> +     full_iog = io_full_io_group(q, rw);
>> +     if (rl->count[rw] + 1 <= q->nr_requests && !full_iog) {
>>               if (waitqueue_active(&rl->wait[rw]))
>>                       wake_up(&rl->wait[rw]);
>>
>> @@ -721,13 +725,16 @@ static void __freed_request(struct request_queue *q, int rw)
>>  * A request has just been released.  Account for it, update the full and
>>  * congestion status, wake up any waiters.   Called under q->queue_lock.
>>  */
>> -static void freed_request(struct request_queue *q, int rw, int priv)
>> +static void freed_request(struct request_queue *q, struct io_group *iog,
>> +                       int rw, int priv)
>> {
>>       struct request_list *rl = &q->rq;
>>
>>       rl->count[rw]--;
>>       if (priv)
>>               rl->elvpriv--;
>> +     if (iog)
>> +             io_group_dec_count(iog, rw);
>>
>>       __freed_request(q, rw);
>>
>> @@ -746,16 +753,21 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>> {
>>       struct request *rq = NULL;
>>       struct request_list *rl = &q->rq;
>> +     struct io_group *iog;
>>       struct io_context *ioc = NULL;
>>       const int rw = rw_flags & 0x01;
>>       int may_queue, priv;
>>
>> +     iog = __io_get_io_group(q);
>> +
>>       may_queue = elv_may_queue(q, rw_flags);
>>       if (may_queue == ELV_MQUEUE_NO)
>>               goto rq_starved;
>>
>> -     if (rl->count[rw]+1 >= queue_congestion_on_threshold(q)) {
>> -             if (rl->count[rw]+1 >= q->nr_requests) {
>> +     if (rl->count[rw]+1 >= queue_congestion_on_threshold(q) ||
>> +         io_group_congestion_on(iog, rw)) {
>> +             if (rl->count[rw]+1 >= q->nr_requests ||
>> +                 io_group_full(iog, rw)) {
>>                       ioc = current_io_context(GFP_ATOMIC, q->node);
>>                       /*
>>                        * The queue will fill after this allocation, so set
>> @@ -789,8 +801,15 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>>       if (rl->count[rw] >= (3 * q->nr_requests / 2))
>>               goto out;
>>
>> +     if (iog)
>> +             if (io_group_count(iog, rw) >=
>> +                (3 * io_group_nr_requests(iog) / 2))
>> +                     goto out;
>> +
>>       rl->count[rw]++;
>>       rl->starved[rw] = 0;
>> +     if (iog)
>> +             io_group_inc_count(iog, rw);
>>
>>       priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
>>       if (priv)
>> @@ -808,7 +827,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>>                * wait queue, but this is pretty rare.
>>                */
>>               spin_lock_irq(q->queue_lock);
>> -             freed_request(q, rw, priv);
>> +             freed_request(q, iog, rw, priv);
>>
>>               /*
>>                * in the very unlikely event that allocation failed and no
>> @@ -1073,12 +1092,13 @@ void __blk_put_request(struct request_queue *q, struct request *req)
>>       if (req->cmd_flags & REQ_ALLOCED) {
>>               int rw = rq_data_dir(req);
>>               int priv = req->cmd_flags & REQ_ELVPRIV;
>> +             struct io_group *iog = io_request_io_group(req);
>>
>>               BUG_ON(!list_empty(&req->queuelist));
>>               BUG_ON(!hlist_unhashed(&req->hash));
>>
>>               blk_free_request(q, req);
>> -             freed_request(q, rw, priv);
>> +             freed_request(q, iog, rw, priv);
>>       }
>> }
>> EXPORT_SYMBOL_GPL(__blk_put_request);
>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>> index 0d98c96..af5191c 100644
>> --- a/block/blk-sysfs.c
>> +++ b/block/blk-sysfs.c
>> @@ -40,6 +40,7 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
>> {
>>       struct request_list *rl = &q->rq;
>>       unsigned long nr;
>> +     int iog_congested[2], iog_full[2];
>>       int ret = queue_var_store(&nr, page, count);
>>       if (nr < BLKDEV_MIN_RQ)
>>               nr = BLKDEV_MIN_RQ;
>> @@ -47,27 +48,32 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
>>       spin_lock_irq(q->queue_lock);
>>       q->nr_requests = nr;
>>       blk_queue_congestion_threshold(q);
>> +     io_group_set_nrq_all(q, nr, iog_congested, iog_full);
>>
>> -     if (rl->count[READ] >= queue_congestion_on_threshold(q))
>> +     if (rl->count[READ] >= queue_congestion_on_threshold(q) ||
>> +         iog_congested[READ])
>>               blk_set_queue_congested(q, READ);
>> -     else if (rl->count[READ] < queue_congestion_off_threshold(q))
>> +     else if (rl->count[READ] < queue_congestion_off_threshold(q) &&
>> +              !iog_congested[READ])
>>               blk_clear_queue_congested(q, READ);
>>
>> -     if (rl->count[WRITE] >= queue_congestion_on_threshold(q))
>> +     if (rl->count[WRITE] >= queue_congestion_on_threshold(q) ||
>> +         iog_congested[WRITE])
>>               blk_set_queue_congested(q, WRITE);
>> -     else if (rl->count[WRITE] < queue_congestion_off_threshold(q))
>> +     else if (rl->count[WRITE] < queue_congestion_off_threshold(q) &&
>> +              !iog_congested[WRITE])
>>               blk_clear_queue_congested(q, WRITE);
>>
>> -     if (rl->count[READ] >= q->nr_requests) {
>> +     if (rl->count[READ] >= q->nr_requests || iog_full[READ]) {
>>               blk_set_queue_full(q, READ);
>> -     } else if (rl->count[READ]+1 <= q->nr_requests) {
>> +     } else if (rl->count[READ]+1 <= q->nr_requests && !iog_full[READ]) {
>>               blk_clear_queue_full(q, READ);
>>               wake_up(&rl->wait[READ]);
>>       }
>>
>> -     if (rl->count[WRITE] >= q->nr_requests) {
>> +     if (rl->count[WRITE] >= q->nr_requests || iog_full[WRITE]) {
>>               blk_set_queue_full(q, WRITE);
>> -     } else if (rl->count[WRITE]+1 <= q->nr_requests) {
>> +     } else if (rl->count[WRITE]+1 <= q->nr_requests && !iog_full[WRITE]) {
>>               blk_clear_queue_full(q, WRITE);
>>               wake_up(&rl->wait[WRITE]);
>>       }
>> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
>> index df53418..3b021f3 100644
>> --- a/block/elevator-fq.c
>> +++ b/block/elevator-fq.c
>> @@ -924,6 +924,111 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
>> }
>> EXPORT_SYMBOL(io_lookup_io_group_current);
>>
>> +/*
>> + * TODO
>> + * This is complete dupulication of blk_queue_congestion_threshold()
>> + * except for the argument type and name.  Can we merge them?
>> + */
>> +static void io_group_nrq_congestion_threshold(struct io_group_nrq *nrq)
>> +{
>> +     int nr;
>> +
>> +     nr = nrq->nr_requests - (nrq->nr_requests / 8) + 1;
>> +     if (nr > nrq->nr_requests)
>> +             nr = nrq->nr_requests;
>> +     nrq->nr_congestion_on = nr;
>> +
>> +     nr = nrq->nr_requests - (nrq->nr_requests / 8)
>> +             - (nrq->nr_requests / 16) - 1;
>> +     if (nr < 1)
>> +             nr = 1;
>> +     nrq->nr_congestion_off = nr;
>> +}
>> +
>> +static void io_group_set_nrq(struct io_group_nrq *nrq, int nr_requests,
>> +                      int *congested, int *full)
>> +{
>> +     int i;
>> +
>> +     BUG_ON(nr_requests < 0);
>> +
>> +     nrq->nr_requests = nr_requests;
>> +     io_group_nrq_congestion_threshold(nrq);
>> +
>> +     for (i=0; i<2; i++) {
>> +             if (nrq->count[i] >= nrq->nr_congestion_on)
>> +                     congested[i] = 1;
>> +             else if (nrq->count[i] < nrq->nr_congestion_off)
>> +                     congested[i] = 0;
>> +
>> +             if (nrq->count[i] >= nrq->nr_requests)
>> +                     full[i] = 1;
>> +             else if (nrq->count[i]+1 <= nrq->nr_requests)
>> +                     full[i] = 0;
>> +     }
>> +}
>> +
>> +void io_group_set_nrq_all(struct request_queue *q, int nr,
>> +                         int *congested, int *full)
>> +{
>> +     struct elv_fq_data *efqd = &q->elevator->efqd;
>> +     struct hlist_head *head = &efqd->group_list;
>> +     struct io_group *root = efqd->root_group;
>> +     struct hlist_node *n;
>> +     struct io_group *iog;
>> +     struct io_group_nrq *nrq;
>> +     int nrq_congested[2];
>> +     int nrq_full[2];
>> +     int i;
>> +
>> +     for (i=0; i<2; i++)
>> +             *(congested + i) = *(full + i) = 0;
>> +
>> +     nrq = &root->nrq;
>> +     io_group_set_nrq(nrq, nr, nrq_congested, nrq_full);
>> +     for (i=0; i<2; i++) {
>> +             *(congested + i) |= nrq_congested[i];
>> +             *(full + i) |= nrq_full[i];
>> +     }
>> +
>> +     hlist_for_each_entry(iog, n, head, elv_data_node) {
>> +             nrq = &iog->nrq;
>> +             io_group_set_nrq(nrq, nr, nrq_congested, nrq_full);
>> +             for (i=0; i<2; i++) {
>> +                     *(congested + i) |= nrq_congested[i];
>> +                     *(full + i) |= nrq_full[i];
>> +             }
>> +     }
>> +}
>> +
>> +struct io_group *io_congested_io_group(struct request_queue *q, int rw)
>> +{
>> +     struct hlist_head *head = &q->elevator->efqd.group_list;
>> +     struct hlist_node *n;
>> +     struct io_group *iog;
>> +
>> +     hlist_for_each_entry(iog, n, head, elv_data_node) {
>> +             struct io_group_nrq *nrq = &iog->nrq;
>> +             if (nrq->count[rw] >= nrq->nr_congestion_off)
>> +                     return iog;
>> +     }
>> +     return NULL;
>> +}
>> +
>> +struct io_group *io_full_io_group(struct request_queue *q, int rw)
>> +{
>> +     struct hlist_head *head = &q->elevator->efqd.group_list;
>> +     struct hlist_node *n;
>> +     struct io_group *iog;
>> +
>> +     hlist_for_each_entry(iog, n, head, elv_data_node) {
>> +             struct io_group_nrq *nrq = &iog->nrq;
>> +             if (nrq->count[rw] >= nrq->nr_requests)
>> +                     return iog;
>> +     }
>> +     return NULL;
>> +}
>> +
>> void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
>> {
>>       struct io_entity *entity = &iog->entity;
>> @@ -934,6 +1039,12 @@ void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
>>       entity->my_sched_data = &iog->sched_data;
>> }
>>
>> +static void io_group_init_nrq(struct request_queue *q, struct io_group_nrq *nrq)
>> +{
>> +     nrq->nr_requests = q->nr_requests;
>> +     io_group_nrq_congestion_threshold(nrq);
>> +}
>> +
>> void io_group_set_parent(struct io_group *iog, struct io_group *parent)
>> {
>>       struct io_entity *entity;
>> @@ -1053,6 +1164,8 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
>>               io_group_init_entity(iocg, iog);
>>               iog->my_entity = &iog->entity;
>>
>> +             io_group_init_nrq(q, &iog->nrq);
>> +
>>               if (leaf == NULL) {
>>                       leaf = iog;
>>                       prev = leaf;
>> @@ -1176,7 +1289,7 @@ struct io_group *io_find_alloc_group(struct request_queue *q,
>>  * Generic function to make sure cgroup hierarchy is all setup once a request
>>  * from a cgroup is received by the io scheduler.
>>  */
>> -struct io_group *io_get_io_group(struct request_queue *q)
>> +struct io_group *__io_get_io_group(struct request_queue *q)
>> {
>>       struct cgroup *cgroup;
>>       struct io_group *iog;
>> @@ -1192,6 +1305,19 @@ struct io_group *io_get_io_group(struct request_queue *q)
>>       return iog;
>> }
>>
>> +struct io_group *io_get_io_group(struct request_queue *q)
>> +{
>> +     struct io_group *iog;
>> +     unsigned long flags;
>> +
>> +     spin_lock_irqsave(q->queue_lock, flags);
>> +     iog = __io_get_io_group(q);
>> +     spin_unlock_irqrestore(q->queue_lock, flags);
>> +     BUG_ON(!iog);
>> +
>> +     return iog;
>> +}
>> +
>> void io_free_root_group(struct elevator_queue *e)
>> {
>>       struct io_cgroup *iocg = &io_root_cgroup;
>> @@ -1220,6 +1346,7 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
>>       iog->entity.parent = NULL;
>>       for (i = 0; i < IO_IOPRIO_CLASSES; i++)
>>               iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
>> +     io_group_init_nrq(q, &iog->nrq);
>>
>>       iocg = &io_root_cgroup;
>>       spin_lock_irq(&iocg->lock);
>> @@ -1533,15 +1660,11 @@ void elv_fq_set_request_io_group(struct request_queue *q,
>>                                               struct request *rq)
>> {
>>       struct io_group *iog;
>> -     unsigned long flags;
>>
>>       /* Make sure io group hierarchy has been setup and also set the
>>        * io group to which rq belongs. Later we should make use of
>>        * bio cgroup patches to determine the io group */
>> -     spin_lock_irqsave(q->queue_lock, flags);
>>       iog = io_get_io_group(q);
>> -     spin_unlock_irqrestore(q->queue_lock, flags);
>> -     BUG_ON(!iog);
>>
>>       /* Store iog in rq. TODO: take care of referencing */
>>       rq->iog = iog;
>> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
>> index fc4110d..f8eabd4 100644
>> --- a/block/elevator-fq.h
>> +++ b/block/elevator-fq.h
>> @@ -187,6 +187,22 @@ struct io_queue {
>>
>> #ifdef CONFIG_GROUP_IOSCHED
>> /**
>> + * struct io_group_nrq - structure to store allocated requests info
>> + * @nr_requests: maximun num of requests for the io_group
>> + * @nr_congestion_on: threshold to determin the io_group is cogested.
>> + * @nr_congestion_off: threshold to determin the io_group is not congested.
>> + * @count: num of allocated requests.
>> + *
>> + * All fields are protected by queue_lock.
>> + */
>> +struct io_group_nrq {
>> +     unsigned long nr_requests;
>> +     unsigned int nr_congestion_on;
>> +     unsigned int nr_congestion_off;
>> +     int count[2];
>> +};
>> +
>> +/**
>>  * struct bfq_group - per (device, cgroup) data structure.
>>  * @entity: schedulable entity to insert into the parent group sched_data.
>>  * @sched_data: own sched_data, to contain child entities (they may be
>> @@ -235,6 +251,8 @@ struct io_group {
>>
>>       /* Single ioq per group, used for noop, deadline, anticipatory */
>>       struct io_queue *ioq;
>> +
>> +     struct io_group_nrq nrq;
>> };
>>
>> /**
>> @@ -469,6 +487,11 @@ extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
>> extern void elv_fq_unset_request_ioq(struct request_queue *q,
>>                                       struct request *rq);
>> extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
>> +extern void io_group_set_nrq_all(struct request_queue *q, int nr,
>> +                         int *congested, int *full);
>> +extern struct io_group *io_congested_io_group(struct request_queue *q, int rw);
>> +extern struct io_group *io_full_io_group(struct request_queue *q, int rw);
>> +extern struct io_group *__io_get_io_group(struct request_queue *q);
>>
>> /* Returns single ioq associated with the io group. */
>> static inline struct io_queue *io_group_ioq(struct io_group *iog)
>> @@ -486,6 +509,52 @@ static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
>>       iog->ioq = ioq;
>> }
>>
>> +static inline struct io_group *io_request_io_group(struct request *rq)
>> +{
>> +     return rq->iog;
>> +}
>> +
>> +static inline unsigned long io_group_nr_requests(struct io_group *iog)
>> +{
>> +     BUG_ON(!iog);
>> +     return iog->nrq.nr_requests;
>> +}
>> +
>> +static inline int io_group_inc_count(struct io_group *iog, int rw)
>> +{
>> +     BUG_ON(!iog);
>> +     return iog->nrq.count[rw]++;
>> +}
>> +
>> +static inline int io_group_dec_count(struct io_group *iog, int rw)
>> +{
>> +     BUG_ON(!iog);
>> +     return iog->nrq.count[rw]--;
>> +}
>> +
>> +static inline int io_group_count(struct io_group *iog, int rw)
>> +{
>> +     BUG_ON(!iog);
>> +     return iog->nrq.count[rw];
>> +}
>> +
>> +static inline int io_group_congestion_on(struct io_group *iog, int rw)
>> +{
>> +     BUG_ON(!iog);
>> +     return iog->nrq.count[rw] + 1 >= iog->nrq.nr_congestion_on;
>> +}
>> +
>> +static inline int io_group_congestion_off(struct io_group *iog, int rw)
>> +{
>> +     BUG_ON(!iog);
>> +     return iog->nrq.count[rw] < iog->nrq.nr_congestion_off;
>> +}
>> +
>> +static inline int io_group_full(struct io_group *iog, int rw)
>> +{
>> +     BUG_ON(!iog);
>> +     return iog->nrq.count[rw] + 1 >= iog->nrq.nr_requests;
>> +}
>> #else /* !GROUP_IOSCHED */
>> /*
>>  * No ioq movement is needed in case of flat setup. root io group gets cleaned
>> @@ -537,6 +606,71 @@ static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
>>       return NULL;
>> }
>>
>> +static inline void io_group_set_nrq_all(struct request_queue *q, int nr,
>> +                                     int *congested, int *full)
>> +{
>> +     int i;
>> +     for (i=0; i<2; i++)
>> +             *(congested + i) = *(full + i) = 0;
>> +}
>> +
>> +static inline struct io_group *
>> +io_congested_io_group(struct request_queue *q, int rw)
>> +{
>> +     return NULL;
>> +}
>> +
>> +static inline struct io_group *
>> +io_full_io_group(struct request_queue *q, int rw)
>> +{
>> +     return NULL;
>> +}
>> +
>> +static inline struct io_group *__io_get_io_group(struct request_queue *q)
>> +{
>> +     return NULL;
>> +}
>> +
>> +static inline struct io_group *io_request_io_group(struct request *rq)
>> +{
>> +     return NULL;
>> +}
>> +
>> +static inline unsigned long io_group_nr_requests(struct io_group *iog)
>> +{
>> +     return 0;
>> +}
>> +
>> +static inline int io_group_inc_count(struct io_group *iog, int rw)
>> +{
>> +     return 0;
>> +}
>> +
>> +static inline int io_group_dec_count(struct io_group *iog, int rw)
>> +{
>> +     return 0;
>> +}
>> +
>> +static inline int io_group_count(struct io_group *iog, int rw)
>> +{
>> +     return 0;
>> +}
>> +
>> +static inline int io_group_congestion_on(struct io_group *iog, int rw)
>> +{
>> +     return 0;
>> +}
>> +
>> +static inline int io_group_congestion_off(struct io_group *iog, int rw)
>> +{
>> +     return 1;
>> +}
>> +
>> +static inline int io_group_full(struct io_group *iog, int rw)
>> +{
>> +     return 0;
>> +}
>> +
>> #endif /* GROUP_IOSCHED */
>>
>> /* Functions used by blksysfs.c */
>> @@ -589,6 +723,9 @@ extern void elv_free_ioq(struct io_queue *ioq);
>>
>> #else /* CONFIG_ELV_FAIR_QUEUING */
>>
>> +struct io_group {
>> +};
>> +
>> static inline int elv_init_fq_data(struct request_queue *q,
>>                                       struct elevator_queue *e)
>> {
>> @@ -655,5 +792,69 @@ static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
>>       return NULL;
>> }
>>
>> +static inline void io_group_set_nrq_all(struct request_queue *q, int nr,
>> +                                     int *congested, int *full)
>> +{
>> +     int i;
>> +     for (i=0; i<2; i++)
>> +             *(congested + i) = *(full + i) = 0;
>> +}
>> +
>> +static inline struct io_group *
>> +io_congested_io_group(struct request_queue *q, int rw)
>> +{
>> +     return NULL;
>> +}
>> +
>> +static inline struct io_group *
>> +io_full_io_group(struct request_queue *q, int rw)
>> +{
>> +     return NULL;
>> +}
>> +
>> +static inline struct io_group *__io_get_io_group(struct request_queue *q)
>> +{
>> +     return NULL;
>> +}
>> +
>> +static inline struct io_group *io_request_io_group(struct request *rq)
>> +{
>> +     return NULL;
>> +}
>> +
>> +static inline unsigned long io_group_nr_requests(struct io_group *iog)
>> +{
>> +     return 0;
>> +}
>> +
>> +static inline int io_group_inc_count(struct io_group *iog, int rw)
>> +{
>> +     return 0;
>> +}
>> +
>> +static inline int io_group_dec_count(struct io_group *iog, int rw)
>> +{
>> +     return 0;
>> +}
>> +
>> +static inline int io_group_count(struct io_group *iog, int rw)
>> +{
>> +     return 0;
>> +}
>> +
>> +static inline int io_group_congestion_on(struct io_group *iog, int rw)
>> +{
>> +     return 0;
>> +}
>> +
>> +static inline int io_group_congestion_off(struct io_group *iog, int rw)
>> +{
>> +     return 1;
>> +}
>> +
>> +static inline int io_group_full(struct io_group *iog, int rw)
>> +{
>> +     return 0;
>> +}
>> #endif /* CONFIG_ELV_FAIR_QUEUING */
>> #endif /* _BFQ_SCHED_H */
>> --
>> 1.5.4.3
>>
>>
>> --
>> IKEDA, Munehiro
>> NEC Corporation of America
>>   m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org
>>
>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: IO Controller per cgroup request descriptors (Re: [PATCH 01/10]  Documentation)
  2009-05-01 22:45                 ` Vivek Goyal
  (?)
@ 2009-05-01 23:39                 ` Nauman Rafique
  2009-05-04 17:18                   ` IKEDA, Munehiro
       [not found]                   ` <e98e18940905011639o63c048f1n79c7e7648441a06d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  -1 siblings, 2 replies; 190+ messages in thread
From: Nauman Rafique @ 2009-05-01 23:39 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: IKEDA, Munehiro, Balbir Singh, oz-kernel, paolo.valente,
	linux-kernel, dhaval, containers, menage, jmoyer, fchecconi,
	arozansk, jens.axboe, akpm, fernando, Andrea Righi, Ryo Tsuruta,
	Divyesh Shah, Gui Jianfeng

On Fri, May 1, 2009 at 3:45 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Fri, May 01, 2009 at 06:04:39PM -0400, IKEDA, Munehiro wrote:
>> Vivek Goyal wrote:
>>>>> +TODO
>>>>> +====
>>>>> +- Lots of cleanups, testing, bug fixing, optimizations, benchmarking etc...
>>>>> +- Convert cgroup ioprio to notion of weight.
>>>>> +- Anticipatory code will need more work. It is not working properly currently
>>>>> +  and needs more thought.
>>>> What are the problems with the code?
>>>
>>> Have not got a chance to look into the issues in detail yet. Just a crude run
>>> saw drop in performance. Will debug it later the moment I have got async writes
>>> handled...
>>>
>>>>> +- Use of bio-cgroup patches.
>>>> I saw these posted as well
>>>>
>>>>> +- Use of Nauman's per cgroup request descriptor patches.
>>>>> +
>>>> More details would be nice, I am not sure I understand
>>>
>>> Currently the number of request descriptors which can be allocated per
>>> device/request queue are fixed by a sysfs tunable (q->nr_requests). So
>>> if there is lots of IO going on from one cgroup then it will consume all
>>> the available request descriptors and other cgroup might starve and not
>>> get its fair share.
>>>
>>> Hence we also need to introduce the notion of request descriptor limit per
>>> cgroup so that if request descriptors from one group are exhausted, then
>>> it does not impact the IO of other cgroup.
>>
>> Unfortunately I couldn't find and I've never seen the Nauman's patches.
>> So I tried to make a patch below against this todo.  The reason why
>> I'm posting this despite this is just a quick and ugly hack (and it
>> might be a reinvention of wheel) is that I would like to discuss how
>> we should define the limitation of requests per cgroup.
>> This patch should be applied on Vivek's I/O controller patches
>> posted on Mar 11.
>
> Hi IKEDA,
>
> Sorry for the confusion here. Actually Nauman had sent a patch to select group
> of people who were initially copied on the mail thread.

I am sorry about that. Since I dropped my whole patch set in favor of
Vivek's stuff, this stuff fell through the cracks.

>
>>
>> This patch temporarily distribute q->nr_requests to each cgroup.
>> I think the number should be weighted like BFQ's budget.  But in
>> this case, if the hierarchy of cgroup is deep, leaf cgroups are
>> allowed to allocate very few numbers of requests.  I don't think
>> this is reasonable...but I don't have specific idea to solve this
>> problem.  Does anyone have the good idea?
>>
>
> Thanks for the patch. Yes, ideally one would expect the request descriptor
> to be allocated also in proportion to the weight but I guess that would
> become very comlicated.
>
> In terms of simpler things, two thoughts come to mind.
>
> - First approach is to make q->nr_requests per group. So every group is
>  entitled for q->nr_requests as set by the user. This is what your patch
>  seems to have done.
>
>  I had some concerns with this approach. First of all it does not seem to
>  have an upper bound on number of request descriptors allocated per queue
>  because if a user creates more cgroups, total number of request
>  descriptors increase.
>
> - Second approach can be that we retain the meaning of q->nr_requests
>  which defines the total number of request descriptors on the queue (with
>  the exception of 50% more descriptors for batching processes). And we
>  define a new per group limit q->nr_group_requests which defines how many
>  requests per group can be assigned. So q->nr_requests defines total pool
>  size on the queue and q->nr_group_requests will define how many requests
>  each group can allocate out of that pool.
>
>  Here the issue is that a user shall have to balance the q->nr_group_requests    and q->nr_requests properly.
>
> To experiment, I have implemented the second approach. I am attaching the
> patch which is in my current tree. It probably will not apply on my march
> 11 posting as since then patches have changed. But posting it here so that
> at least it will give an idea behind the thought process.
>
> Ideas are welcome...

I had started with the first option, but the second option sounds good
too. But one problem that comes to mind is how we deal with
hierarchies? The sys admin can limit the root level cgroups to
specific number of request descriptors, but if applications running in
a cgroup are allowed to create their own cgroups, then the total
request descriptors of all child cgroups should be capped by the
number assigned to parent cgroups.

>
> Thanks
> Vivek
>
> o Currently a request queue has got fixed number of request descriptors for
>  sync and async requests. Once the request descriptors are consumed, new
>  processes are put to sleep and they effectively become serialized. Because
>  sync and async queues are separate, async requests don't impact sync ones
>  but if one is looking for fairness between async requests, that is not
>  achievable if request queue descriptors become bottleneck.
>
> o Make request descriptor's per io group so that if there is lots of IO
>  going on in one cgroup, it does not impact the IO of other group.
>
> o This patch implements the per cgroup request descriptors. request pool per
>  queue is still common but every group will have its own wait list and its
>  own count of request descriptors allocated to that group for sync and async
>  queues. So effectively request_list becomes per io group property and not a
>  global request queue feature.
>
> o Currently one can define q->nr_requests to limit request descriptors
>  allocated for the queue. Now there is another tunable q->nr_group_requests
>  which controls the requests descriptr limit per group. q->nr_requests
>  supercedes q->nr_group_requests to make sure if there are lots of groups
>  present, we don't end up allocating too many request descriptors on the
>  queue.
>
> o Issues: Currently notion of congestion is per queue. With per group request
>  descriptor it is possible that queue is not congested but the group bio
>  will go into is congested.
>
> Signed-off-by: Nauman Rafique <nauman@google.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
>
> ---
>  block/blk-core.c       |  216 ++++++++++++++++++++++++++++++++++---------------
>  block/blk-settings.c   |    3
>  block/blk-sysfs.c      |   57 ++++++++++--
>  block/elevator-fq.c    |   15 +++
>  block/elevator-fq.h    |    8 +
>  block/elevator.c       |    6 -
>  include/linux/blkdev.h |   62 +++++++++++++-
>  7 files changed, 287 insertions(+), 80 deletions(-)
>
> Index: linux9/include/linux/blkdev.h
> ===================================================================
> --- linux9.orig/include/linux/blkdev.h  2009-04-30 15:43:53.000000000 -0400
> +++ linux9/include/linux/blkdev.h       2009-04-30 16:18:29.000000000 -0400
> @@ -32,21 +32,51 @@ struct request;
>  struct sg_io_hdr;
>
>  #define BLKDEV_MIN_RQ  4
> +
> +#ifdef CONFIG_GROUP_IOSCHED
> +#define BLKDEV_MAX_RQ  256     /* Default maximum */
> +#define BLKDEV_MAX_GROUP_RQ    64      /* Default maximum */
> +#else
>  #define BLKDEV_MAX_RQ  128     /* Default maximum */
> +/*
> + * This is eqivalent to case of only one group present (root group). Let
> + * it consume all the request descriptors available on the queue .
> + */
> +#define BLKDEV_MAX_GROUP_RQ    BLKDEV_MAX_RQ      /* Default maximum */
> +#endif
>
>  struct request;
>  typedef void (rq_end_io_fn)(struct request *, int);
>
>  struct request_list {
>        /*
> -        * count[], starved[], and wait[] are indexed by
> +        * count[], starved and wait[] are indexed by
>         * BLK_RW_SYNC/BLK_RW_ASYNC
>         */
>        int count[2];
>        int starved[2];
> +       wait_queue_head_t wait[2];
> +};
> +
> +/*
> + * This data structures keeps track of mempool of requests for the queue
> + * and some overall statistics.
> + */
> +struct request_data {
> +       /*
> +        * Per queue request descriptor count. This is in addition to per
> +        * cgroup count
> +        */
> +       int count[2];
>        int elvpriv;
>        mempool_t *rq_pool;
> -       wait_queue_head_t wait[2];
> +       int starved;
> +       /*
> +        * Global list for starved tasks. A task will be queued here if
> +        * it could not allocate request descriptor and the associated
> +        * group request list does not have any requests pending.
> +        */
> +       wait_queue_head_t starved_wait;
>  };
>
>  /*
> @@ -251,6 +281,7 @@ struct request {
>  #ifdef CONFIG_GROUP_IOSCHED
>        /* io group request belongs to */
>        struct io_group *iog;
> +       struct request_list *rl;
>  #endif /* GROUP_IOSCHED */
>  #endif /* ELV_FAIR_QUEUING */
>  };
> @@ -340,6 +371,9 @@ struct request_queue
>         */
>        struct request_list     rq;
>
> +       /* Contains request pool and other data like starved data */
> +       struct request_data     rq_data;
> +
>        request_fn_proc         *request_fn;
>        make_request_fn         *make_request_fn;
>        prep_rq_fn              *prep_rq_fn;
> @@ -402,6 +436,8 @@ struct request_queue
>         * queue settings
>         */
>        unsigned long           nr_requests;    /* Max # of requests */
> +       /* Max # of per io group requests */
> +       unsigned long           nr_group_requests;
>        unsigned int            nr_congestion_on;
>        unsigned int            nr_congestion_off;
>        unsigned int            nr_batching;
> @@ -773,6 +809,28 @@ extern int scsi_cmd_ioctl(struct request
>  extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
>                         struct scsi_ioctl_command __user *);
>
> +extern void blk_init_request_list(struct request_list *rl);
> +
> +static inline struct request_list *blk_get_request_list(struct request_queue *q,
> +                                               struct bio *bio)
> +{
> +#ifdef CONFIG_GROUP_IOSCHED
> +       return io_group_get_request_list(q, bio);
> +#else
> +       return &q->rq;
> +#endif
> +}
> +
> +static inline struct request_list *rq_rl(struct request_queue *q,
> +                                               struct request *rq)
> +{
> +#ifdef CONFIG_GROUP_IOSCHED
> +       return rq->rl;
> +#else
> +       return blk_get_request_list(q, NULL);
> +#endif
> +}
> +
>  /*
>  * Temporary export, until SCSI gets fixed up.
>  */
> Index: linux9/block/elevator.c
> ===================================================================
> --- linux9.orig/block/elevator.c        2009-04-30 16:17:53.000000000 -0400
> +++ linux9/block/elevator.c     2009-04-30 16:18:29.000000000 -0400
> @@ -664,7 +664,7 @@ void elv_quiesce_start(struct request_qu
>         * make sure we don't have any requests in flight
>         */
>        elv_drain_elevator(q);
> -       while (q->rq.elvpriv) {
> +       while (q->rq_data.elvpriv) {
>                blk_start_queueing(q);
>                spin_unlock_irq(q->queue_lock);
>                msleep(10);
> @@ -764,8 +764,8 @@ void elv_insert(struct request_queue *q,
>        }
>
>        if (unplug_it && blk_queue_plugged(q)) {
> -               int nrq = q->rq.count[BLK_RW_SYNC] + q->rq.count[BLK_RW_ASYNC]
> -                       - q->in_flight;
> +               int nrq = q->rq_data.count[BLK_RW_SYNC] +
> +                               q->rq_data.count[BLK_RW_ASYNC] - q->in_flight;
>
>                if (nrq >= q->unplug_thresh)
>                        __generic_unplug_device(q);
> Index: linux9/block/blk-core.c
> ===================================================================
> --- linux9.orig/block/blk-core.c        2009-04-30 16:17:53.000000000 -0400
> +++ linux9/block/blk-core.c     2009-04-30 16:18:29.000000000 -0400
> @@ -480,20 +480,31 @@ void blk_cleanup_queue(struct request_qu
>  }
>  EXPORT_SYMBOL(blk_cleanup_queue);
>
> -static int blk_init_free_list(struct request_queue *q)
> +void blk_init_request_list(struct request_list *rl)
>  {
> -       struct request_list *rl = &q->rq;
>
>        rl->count[BLK_RW_SYNC] = rl->count[BLK_RW_ASYNC] = 0;
> -       rl->starved[BLK_RW_SYNC] = rl->starved[BLK_RW_ASYNC] = 0;
> -       rl->elvpriv = 0;
>        init_waitqueue_head(&rl->wait[BLK_RW_SYNC]);
>        init_waitqueue_head(&rl->wait[BLK_RW_ASYNC]);
> +}
>
> -       rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ, mempool_alloc_slab,
> -                               mempool_free_slab, request_cachep, q->node);
> +static int blk_init_free_list(struct request_queue *q)
> +{
> +#ifndef CONFIG_GROUP_IOSCHED
> +       struct request_list *rl = blk_get_request_list(q, NULL);
> +
> +       /*
> +        * In case of group scheduling, request list is inside the associated
> +        * group and when that group is instanciated, it takes care of
> +        * initializing the request list also.
> +        */
> +       blk_init_request_list(rl);
> +#endif
> +       q->rq_data.rq_pool = mempool_create_node(BLKDEV_MIN_RQ,
> +                               mempool_alloc_slab, mempool_free_slab,
> +                               request_cachep, q->node);
>
> -       if (!rl->rq_pool)
> +       if (!q->rq_data.rq_pool)
>                return -ENOMEM;
>
>        return 0;
> @@ -590,6 +601,9 @@ blk_init_queue_node(request_fn_proc *rfn
>                return NULL;
>        }
>
> +       /* init starved waiter wait queue */
> +       init_waitqueue_head(&q->rq_data.starved_wait);
> +
>        /*
>         * if caller didn't supply a lock, they get per-queue locking with
>         * our embedded lock
> @@ -639,14 +653,14 @@ static inline void blk_free_request(stru
>  {
>        if (rq->cmd_flags & REQ_ELVPRIV)
>                elv_put_request(q, rq);
> -       mempool_free(rq, q->rq.rq_pool);
> +       mempool_free(rq, q->rq_data.rq_pool);
>  }
>
>  static struct request *
>  blk_alloc_request(struct request_queue *q, struct bio *bio, int rw, int priv,
>                                        gfp_t gfp_mask)
>  {
> -       struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
> +       struct request *rq = mempool_alloc(q->rq_data.rq_pool, gfp_mask);
>
>        if (!rq)
>                return NULL;
> @@ -657,7 +671,7 @@ blk_alloc_request(struct request_queue *
>
>        if (priv) {
>                if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
> -                       mempool_free(rq, q->rq.rq_pool);
> +                       mempool_free(rq, q->rq_data.rq_pool);
>                        return NULL;
>                }
>                rq->cmd_flags |= REQ_ELVPRIV;
> @@ -700,18 +714,18 @@ static void ioc_set_batching(struct requ
>        ioc->last_waited = jiffies;
>  }
>
> -static void __freed_request(struct request_queue *q, int sync)
> +static void __freed_request(struct request_queue *q, int sync,
> +                                       struct request_list *rl)
>  {
> -       struct request_list *rl = &q->rq;
> -
> -       if (rl->count[sync] < queue_congestion_off_threshold(q))
> +       if (q->rq_data.count[sync] < queue_congestion_off_threshold(q))
>                blk_clear_queue_congested(q, sync);
>
> -       if (rl->count[sync] + 1 <= q->nr_requests) {
> +       if (q->rq_data.count[sync] + 1 <= q->nr_requests)
> +               blk_clear_queue_full(q, sync);
> +
> +       if (rl->count[sync] + 1 <= q->nr_group_requests) {
>                if (waitqueue_active(&rl->wait[sync]))
>                        wake_up(&rl->wait[sync]);
> -
> -               blk_clear_queue_full(q, sync);
>        }
>  }
>
> @@ -719,18 +733,29 @@ static void __freed_request(struct reque
>  * A request has just been released.  Account for it, update the full and
>  * congestion status, wake up any waiters.   Called under q->queue_lock.
>  */
> -static void freed_request(struct request_queue *q, int sync, int priv)
> +static void freed_request(struct request_queue *q, int sync, int priv,
> +                                       struct request_list *rl)
>  {
> -       struct request_list *rl = &q->rq;
> -
> +       BUG_ON(!rl->count[sync]);
>        rl->count[sync]--;
> +
> +       BUG_ON(!q->rq_data.count[sync]);
> +       q->rq_data.count[sync]--;
> +
>        if (priv)
> -               rl->elvpriv--;
> +               q->rq_data.elvpriv--;
>
> -       __freed_request(q, sync);
> +       __freed_request(q, sync, rl);
>
>        if (unlikely(rl->starved[sync ^ 1]))
> -               __freed_request(q, sync ^ 1);
> +               __freed_request(q, sync ^ 1, rl);
> +
> +       /* Wake up the starved process on global list, if any */
> +       if (unlikely(q->rq_data.starved)) {
> +               if (waitqueue_active(&q->rq_data.starved_wait))
> +                       wake_up(&q->rq_data.starved_wait);
> +               q->rq_data.starved--;
> +       }
>  }
>
>  /*
> @@ -739,10 +764,9 @@ static void freed_request(struct request
>  * Returns !NULL on success, with queue_lock *not held*.
>  */
>  static struct request *get_request(struct request_queue *q, int rw_flags,
> -                                  struct bio *bio, gfp_t gfp_mask)
> +                  struct bio *bio, gfp_t gfp_mask, struct request_list *rl)
>  {
>        struct request *rq = NULL;
> -       struct request_list *rl = &q->rq;
>        struct io_context *ioc = NULL;
>        const bool is_sync = rw_is_sync(rw_flags) != 0;
>        int may_queue, priv;
> @@ -751,31 +775,38 @@ static struct request *get_request(struc
>        if (may_queue == ELV_MQUEUE_NO)
>                goto rq_starved;
>
> -       if (rl->count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
> -               if (rl->count[is_sync]+1 >= q->nr_requests) {
> -                       ioc = current_io_context(GFP_ATOMIC, q->node);
> -                       /*
> -                        * The queue will fill after this allocation, so set
> -                        * it as full, and mark this process as "batching".
> -                        * This process will be allowed to complete a batch of
> -                        * requests, others will be blocked.
> -                        */
> -                       if (!blk_queue_full(q, is_sync)) {
> -                               ioc_set_batching(q, ioc);
> -                               blk_set_queue_full(q, is_sync);
> -                       } else {
> -                               if (may_queue != ELV_MQUEUE_MUST
> -                                               && !ioc_batching(q, ioc)) {
> -                                       /*
> -                                        * The queue is full and the allocating
> -                                        * process is not a "batcher", and not
> -                                        * exempted by the IO scheduler
> -                                        */
> -                                       goto out;
> -                               }
> +       if (q->rq_data.count[is_sync]+1 >= queue_congestion_on_threshold(q))
> +               blk_set_queue_congested(q, is_sync);
> +
> +       /*
> +        * Looks like there is no user of queue full now.
> +        * Keeping it for time being.
> +        */
> +       if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
> +               blk_set_queue_full(q, is_sync);
> +
> +       if (rl->count[is_sync]+1 >= q->nr_group_requests) {
> +               ioc = current_io_context(GFP_ATOMIC, q->node);
> +               /*
> +                * The queue request descriptor group will fill after this
> +                * allocation, so set
> +                * it as full, and mark this process as "batching".
> +                * This process will be allowed to complete a batch of
> +                * requests, others will be blocked.
> +                */
> +               if (rl->count[is_sync] <= q->nr_group_requests)
> +                       ioc_set_batching(q, ioc);
> +               else {
> +                       if (may_queue != ELV_MQUEUE_MUST
> +                                       && !ioc_batching(q, ioc)) {
> +                               /*
> +                                * The queue is full and the allocating
> +                                * process is not a "batcher", and not
> +                                * exempted by the IO scheduler
> +                                */
> +                               goto out;
>                        }
>                }
> -               blk_set_queue_congested(q, is_sync);
>        }
>
>        /*
> @@ -783,19 +814,41 @@ static struct request *get_request(struc
>         * limit of requests, otherwise we could have thousands of requests
>         * allocated with any setting of ->nr_requests
>         */
> -       if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
> +
> +       if (q->rq_data.count[is_sync] >= (3 * q->nr_requests / 2))
> +               goto out;
> +
> +       /*
> +        * Allocation of request is allowed from queue perspective. Now check
> +        * from per group request list
> +        */
> +
> +       if (rl->count[is_sync] >= (3 * q->nr_group_requests / 2))
>                goto out;
>
>        rl->count[is_sync]++;
>        rl->starved[is_sync] = 0;
>
> +       q->rq_data.count[is_sync]++;
> +
>        priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
>        if (priv)
> -               rl->elvpriv++;
> +               q->rq_data.elvpriv++;
>
>        spin_unlock_irq(q->queue_lock);
>
>        rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
> +
> +#ifdef CONFIG_GROUP_IOSCHED
> +       if (rq) {
> +               /*
> +                * TODO. Implement group reference counting and take the
> +                * reference to the group to make sure group hence request
> +                * list does not go away till rq finishes.
> +                */
> +               rq->rl = rl;
> +       }
> +#endif
>        if (unlikely(!rq)) {
>                /*
>                 * Allocation failed presumably due to memory. Undo anything
> @@ -805,7 +858,7 @@ static struct request *get_request(struc
>                 * wait queue, but this is pretty rare.
>                 */
>                spin_lock_irq(q->queue_lock);
> -               freed_request(q, is_sync, priv);
> +               freed_request(q, is_sync, priv, rl);
>
>                /*
>                 * in the very unlikely event that allocation failed and no
> @@ -815,10 +868,26 @@ static struct request *get_request(struc
>                 * rq mempool into READ and WRITE
>                 */
>  rq_starved:
> -               if (unlikely(rl->count[is_sync] == 0))
> -                       rl->starved[is_sync] = 1;
> -
> -               goto out;
> +               if (unlikely(rl->count[is_sync] == 0)) {
> +                       /*
> +                        * If there is a request pending in other direction
> +                        * in same io group, then set the starved flag of
> +                        * the group request list. Otherwise, we need to
> +                        * make this process sleep in global starved list
> +                        * to make sure it will not sleep indefinitely.
> +                        */
> +                       if (rl->count[is_sync ^ 1] != 0) {
> +                               rl->starved[is_sync] = 1;
> +                               goto out;
> +                       } else {
> +                               /*
> +                                * It indicates to calling function to put
> +                                * task on global starved list. Not the best
> +                                * way
> +                                */
> +                               return ERR_PTR(-ENOMEM);
> +                       }
> +               }
>        }
>
>        /*
> @@ -846,15 +915,29 @@ static struct request *get_request_wait(
>  {
>        const bool is_sync = rw_is_sync(rw_flags) != 0;
>        struct request *rq;
> +       struct request_list *rl = blk_get_request_list(q, bio);
>
> -       rq = get_request(q, rw_flags, bio, GFP_NOIO);
> -       while (!rq) {
> +       rq = get_request(q, rw_flags, bio, GFP_NOIO, rl);
> +       while (!rq || (IS_ERR(rq) && PTR_ERR(rq) == -ENOMEM)) {
>                DEFINE_WAIT(wait);
>                struct io_context *ioc;
> -               struct request_list *rl = &q->rq;
>
> -               prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
> -                               TASK_UNINTERRUPTIBLE);
> +               if (IS_ERR(rq) && PTR_ERR(rq) == -ENOMEM) {
> +                       /*
> +                        * Task failed allocation and needs to wait and
> +                        * try again. There are no requests pending from
> +                        * the io group hence need to sleep on global
> +                        * wait queue. Most likely the allocation failed
> +                        * because of memory issues.
> +                        */
> +
> +                       q->rq_data.starved++;
> +                       prepare_to_wait_exclusive(&q->rq_data.starved_wait,
> +                                       &wait, TASK_UNINTERRUPTIBLE);
> +               } else {
> +                       prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
> +                                       TASK_UNINTERRUPTIBLE);
> +               }
>
>                trace_block_sleeprq(q, bio, rw_flags & 1);
>
> @@ -874,7 +957,12 @@ static struct request *get_request_wait(
>                spin_lock_irq(q->queue_lock);
>                finish_wait(&rl->wait[is_sync], &wait);
>
> -               rq = get_request(q, rw_flags, bio, GFP_NOIO);
> +               /*
> +                * After the sleep check the rl again in case cgrop bio
> +                * belonged to is gone and it is mapped to root group now
> +                */
> +               rl = blk_get_request_list(q, bio);
> +               rq = get_request(q, rw_flags, bio, GFP_NOIO, rl);
>        };
>
>        return rq;
> @@ -883,6 +971,7 @@ static struct request *get_request_wait(
>  struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
>  {
>        struct request *rq;
> +       struct request_list *rl = blk_get_request_list(q, NULL);
>
>        BUG_ON(rw != READ && rw != WRITE);
>
> @@ -890,7 +979,7 @@ struct request *blk_get_request(struct r
>        if (gfp_mask & __GFP_WAIT) {
>                rq = get_request_wait(q, rw, NULL);
>        } else {
> -               rq = get_request(q, rw, NULL, gfp_mask);
> +               rq = get_request(q, rw, NULL, gfp_mask, rl);
>                if (!rq)
>                        spin_unlock_irq(q->queue_lock);
>        }
> @@ -1073,12 +1162,13 @@ void __blk_put_request(struct request_qu
>        if (req->cmd_flags & REQ_ALLOCED) {
>                int is_sync = rq_is_sync(req) != 0;
>                int priv = req->cmd_flags & REQ_ELVPRIV;
> +               struct request_list *rl = rq_rl(q, req);
>
>                BUG_ON(!list_empty(&req->queuelist));
>                BUG_ON(!hlist_unhashed(&req->hash));
>
>                blk_free_request(q, req);
> -               freed_request(q, is_sync, priv);
> +               freed_request(q, is_sync, priv, rl);
>        }
>  }
>  EXPORT_SYMBOL_GPL(__blk_put_request);
> Index: linux9/block/blk-sysfs.c
> ===================================================================
> --- linux9.orig/block/blk-sysfs.c       2009-04-30 16:18:27.000000000 -0400
> +++ linux9/block/blk-sysfs.c    2009-04-30 16:18:29.000000000 -0400
> @@ -38,7 +38,7 @@ static ssize_t queue_requests_show(struc
>  static ssize_t
>  queue_requests_store(struct request_queue *q, const char *page, size_t count)
>  {
> -       struct request_list *rl = &q->rq;
> +       struct request_list *rl = blk_get_request_list(q, NULL);
>        unsigned long nr;
>        int ret = queue_var_store(&nr, page, count);
>        if (nr < BLKDEV_MIN_RQ)
> @@ -48,32 +48,55 @@ queue_requests_store(struct request_queu
>        q->nr_requests = nr;
>        blk_queue_congestion_threshold(q);
>
> -       if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
> +       if (q->rq_data.count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
>                blk_set_queue_congested(q, BLK_RW_SYNC);
> -       else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
> +       else if (q->rq_data.count[BLK_RW_SYNC] <
> +                               queue_congestion_off_threshold(q))
>                blk_clear_queue_congested(q, BLK_RW_SYNC);
>
> -       if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
> +       if (q->rq_data.count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
>                blk_set_queue_congested(q, BLK_RW_ASYNC);
> -       else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
> +       else if (q->rq_data.count[BLK_RW_ASYNC] <
> +                               queue_congestion_off_threshold(q))
>                blk_clear_queue_congested(q, BLK_RW_ASYNC);
>
> -       if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
> +       if (q->rq_data.count[BLK_RW_SYNC] >= q->nr_requests) {
>                blk_set_queue_full(q, BLK_RW_SYNC);
> -       } else if (rl->count[BLK_RW_SYNC]+1 <= q->nr_requests) {
> +       } else if (q->rq_data.count[BLK_RW_SYNC]+1 <= q->nr_requests) {
>                blk_clear_queue_full(q, BLK_RW_SYNC);
>                wake_up(&rl->wait[BLK_RW_SYNC]);
>        }
>
> -       if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
> +       if (q->rq_data.count[BLK_RW_ASYNC] >= q->nr_requests) {
>                blk_set_queue_full(q, BLK_RW_ASYNC);
> -       } else if (rl->count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
> +       } else if (q->rq_data.count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
>                blk_clear_queue_full(q, BLK_RW_ASYNC);
>                wake_up(&rl->wait[BLK_RW_ASYNC]);
>        }
>        spin_unlock_irq(q->queue_lock);
>        return ret;
>  }
> +#ifdef CONFIG_GROUP_IOSCHED
> +static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
> +{
> +       return queue_var_show(q->nr_group_requests, (page));
> +}
> +
> +static ssize_t
> +queue_group_requests_store(struct request_queue *q, const char *page,
> +                                       size_t count)
> +{
> +       unsigned long nr;
> +       int ret = queue_var_store(&nr, page, count);
> +       if (nr < BLKDEV_MIN_RQ)
> +               nr = BLKDEV_MIN_RQ;
> +
> +       spin_lock_irq(q->queue_lock);
> +       q->nr_group_requests = nr;
> +       spin_unlock_irq(q->queue_lock);
> +       return ret;
> +}
> +#endif
>
>  static ssize_t queue_ra_show(struct request_queue *q, char *page)
>  {
> @@ -228,6 +251,14 @@ static struct queue_sysfs_entry queue_re
>        .store = queue_requests_store,
>  };
>
> +#ifdef CONFIG_GROUP_IOSCHED
> +static struct queue_sysfs_entry queue_group_requests_entry = {
> +       .attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
> +       .show = queue_group_requests_show,
> +       .store = queue_group_requests_store,
> +};
> +#endif
> +
>  static struct queue_sysfs_entry queue_ra_entry = {
>        .attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
>        .show = queue_ra_show,
> @@ -308,6 +339,9 @@ static struct queue_sysfs_entry queue_sl
>
>  static struct attribute *default_attrs[] = {
>        &queue_requests_entry.attr,
> +#ifdef CONFIG_GROUP_IOSCHED
> +       &queue_group_requests_entry.attr,
> +#endif
>        &queue_ra_entry.attr,
>        &queue_max_hw_sectors_entry.attr,
>        &queue_max_sectors_entry.attr,
> @@ -389,12 +423,11 @@ static void blk_release_queue(struct kob
>  {
>        struct request_queue *q =
>                container_of(kobj, struct request_queue, kobj);
> -       struct request_list *rl = &q->rq;
>
>        blk_sync_queue(q);
>
> -       if (rl->rq_pool)
> -               mempool_destroy(rl->rq_pool);
> +       if (q->rq_data.rq_pool)
> +               mempool_destroy(q->rq_data.rq_pool);
>
>        if (q->queue_tags)
>                __blk_queue_free_tags(q);
> Index: linux9/block/blk-settings.c
> ===================================================================
> --- linux9.orig/block/blk-settings.c    2009-04-30 15:43:53.000000000 -0400
> +++ linux9/block/blk-settings.c 2009-04-30 16:18:29.000000000 -0400
> @@ -123,6 +123,9 @@ void blk_queue_make_request(struct reque
>         * set defaults
>         */
>        q->nr_requests = BLKDEV_MAX_RQ;
> +#ifdef CONFIG_GROUP_IOSCHED
> +       q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
> +#endif
>        blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);
>        blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS);
>        blk_queue_segment_boundary(q, BLK_SEG_BOUNDARY_MASK);
> Index: linux9/block/elevator-fq.c
> ===================================================================
> --- linux9.orig/block/elevator-fq.c     2009-04-30 16:18:27.000000000 -0400
> +++ linux9/block/elevator-fq.c  2009-04-30 16:18:29.000000000 -0400
> @@ -954,6 +954,17 @@ struct io_cgroup *cgroup_to_io_cgroup(st
>                            struct io_cgroup, css);
>  }
>
> +struct request_list *io_group_get_request_list(struct request_queue *q,
> +                                               struct bio *bio)
> +{
> +       struct io_group *iog;
> +
> +       iog = io_get_io_group_bio(q, bio, 1);
> +       BUG_ON(!iog);
> +out:
> +       return &iog->rl;
> +}
> +
>  /*
>  * Search the bfq_group for bfqd into the hash table (by now only a list)
>  * of bgrp.  Must be called under rcu_read_lock().
> @@ -1203,6 +1214,8 @@ struct io_group *io_group_chain_alloc(st
>                io_group_init_entity(iocg, iog);
>                iog->my_entity = &iog->entity;
>
> +               blk_init_request_list(&iog->rl);
> +
>                if (leaf == NULL) {
>                        leaf = iog;
>                        prev = leaf;
> @@ -1446,6 +1459,8 @@ struct io_group *io_alloc_root_group(str
>        for (i = 0; i < IO_IOPRIO_CLASSES; i++)
>                iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
>
> +       blk_init_request_list(&iog->rl);
> +
>        iocg = &io_root_cgroup;
>        spin_lock_irq(&iocg->lock);
>        rcu_assign_pointer(iog->key, key);
> Index: linux9/block/elevator-fq.h
> ===================================================================
> --- linux9.orig/block/elevator-fq.h     2009-04-30 16:18:27.000000000 -0400
> +++ linux9/block/elevator-fq.h  2009-04-30 16:18:29.000000000 -0400
> @@ -239,8 +239,14 @@ struct io_group {
>
>        /* Single ioq per group, used for noop, deadline, anticipatory */
>        struct io_queue *ioq;
> +
> +       /* request list associated with the group */
> +       struct request_list rl;
>  };
>
> +#define IOG_FLAG_READFULL      1       /* read queue has been filled */
> +#define IOG_FLAG_WRITEFULL     2       /* write queue has been filled */
> +
>  /**
>  * struct bfqio_cgroup - bfq cgroup data structure.
>  * @css: subsystem state for bfq in the containing cgroup.
> @@ -517,6 +523,8 @@ extern void elv_fq_unset_request_ioq(str
>  extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
>  extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
>                                                struct bio *bio);
> +extern struct request_list *io_group_get_request_list(struct request_queue *q,
> +                                               struct bio *bio);
>
>  /* Returns single ioq associated with the io group. */
>  static inline struct io_queue *io_group_ioq(struct io_group *iog)
>
> Thanks
> Vivek
>
>> Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
>> ---
>> block/blk-core.c    |   36 +++++++--
>> block/blk-sysfs.c   |   22 ++++--
>> block/elevator-fq.c |  133 ++++++++++++++++++++++++++++++++--
>> block/elevator-fq.h |  201 +++++++++++++++++++++++++++++++++++++++++++++++++++
>> 4 files changed, 371 insertions(+), 21 deletions(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 29bcfac..21023f7 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -705,11 +705,15 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
>> static void __freed_request(struct request_queue *q, int rw)
>> {
>>       struct request_list *rl = &q->rq;
>> -
>> -     if (rl->count[rw] < queue_congestion_off_threshold(q))
>> +     struct io_group *congested_iog, *full_iog;
>> +
>> +     congested_iog = io_congested_io_group(q, rw);
>> +     if (rl->count[rw] < queue_congestion_off_threshold(q) &&
>> +         !congested_iog)
>>               blk_clear_queue_congested(q, rw);
>>
>> -     if (rl->count[rw] + 1 <= q->nr_requests) {
>> +     full_iog = io_full_io_group(q, rw);
>> +     if (rl->count[rw] + 1 <= q->nr_requests && !full_iog) {
>>               if (waitqueue_active(&rl->wait[rw]))
>>                       wake_up(&rl->wait[rw]);
>>
>> @@ -721,13 +725,16 @@ static void __freed_request(struct request_queue *q, int rw)
>>  * A request has just been released.  Account for it, update the full and
>>  * congestion status, wake up any waiters.   Called under q->queue_lock.
>>  */
>> -static void freed_request(struct request_queue *q, int rw, int priv)
>> +static void freed_request(struct request_queue *q, struct io_group *iog,
>> +                       int rw, int priv)
>> {
>>       struct request_list *rl = &q->rq;
>>
>>       rl->count[rw]--;
>>       if (priv)
>>               rl->elvpriv--;
>> +     if (iog)
>> +             io_group_dec_count(iog, rw);
>>
>>       __freed_request(q, rw);
>>
>> @@ -746,16 +753,21 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>> {
>>       struct request *rq = NULL;
>>       struct request_list *rl = &q->rq;
>> +     struct io_group *iog;
>>       struct io_context *ioc = NULL;
>>       const int rw = rw_flags & 0x01;
>>       int may_queue, priv;
>>
>> +     iog = __io_get_io_group(q);
>> +
>>       may_queue = elv_may_queue(q, rw_flags);
>>       if (may_queue == ELV_MQUEUE_NO)
>>               goto rq_starved;
>>
>> -     if (rl->count[rw]+1 >= queue_congestion_on_threshold(q)) {
>> -             if (rl->count[rw]+1 >= q->nr_requests) {
>> +     if (rl->count[rw]+1 >= queue_congestion_on_threshold(q) ||
>> +         io_group_congestion_on(iog, rw)) {
>> +             if (rl->count[rw]+1 >= q->nr_requests ||
>> +                 io_group_full(iog, rw)) {
>>                       ioc = current_io_context(GFP_ATOMIC, q->node);
>>                       /*
>>                        * The queue will fill after this allocation, so set
>> @@ -789,8 +801,15 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>>       if (rl->count[rw] >= (3 * q->nr_requests / 2))
>>               goto out;
>>
>> +     if (iog)
>> +             if (io_group_count(iog, rw) >=
>> +                (3 * io_group_nr_requests(iog) / 2))
>> +                     goto out;
>> +
>>       rl->count[rw]++;
>>       rl->starved[rw] = 0;
>> +     if (iog)
>> +             io_group_inc_count(iog, rw);
>>
>>       priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
>>       if (priv)
>> @@ -808,7 +827,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>>                * wait queue, but this is pretty rare.
>>                */
>>               spin_lock_irq(q->queue_lock);
>> -             freed_request(q, rw, priv);
>> +             freed_request(q, iog, rw, priv);
>>
>>               /*
>>                * in the very unlikely event that allocation failed and no
>> @@ -1073,12 +1092,13 @@ void __blk_put_request(struct request_queue *q, struct request *req)
>>       if (req->cmd_flags & REQ_ALLOCED) {
>>               int rw = rq_data_dir(req);
>>               int priv = req->cmd_flags & REQ_ELVPRIV;
>> +             struct io_group *iog = io_request_io_group(req);
>>
>>               BUG_ON(!list_empty(&req->queuelist));
>>               BUG_ON(!hlist_unhashed(&req->hash));
>>
>>               blk_free_request(q, req);
>> -             freed_request(q, rw, priv);
>> +             freed_request(q, iog, rw, priv);
>>       }
>> }
>> EXPORT_SYMBOL_GPL(__blk_put_request);
>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>> index 0d98c96..af5191c 100644
>> --- a/block/blk-sysfs.c
>> +++ b/block/blk-sysfs.c
>> @@ -40,6 +40,7 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
>> {
>>       struct request_list *rl = &q->rq;
>>       unsigned long nr;
>> +     int iog_congested[2], iog_full[2];
>>       int ret = queue_var_store(&nr, page, count);
>>       if (nr < BLKDEV_MIN_RQ)
>>               nr = BLKDEV_MIN_RQ;
>> @@ -47,27 +48,32 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
>>       spin_lock_irq(q->queue_lock);
>>       q->nr_requests = nr;
>>       blk_queue_congestion_threshold(q);
>> +     io_group_set_nrq_all(q, nr, iog_congested, iog_full);
>>
>> -     if (rl->count[READ] >= queue_congestion_on_threshold(q))
>> +     if (rl->count[READ] >= queue_congestion_on_threshold(q) ||
>> +         iog_congested[READ])
>>               blk_set_queue_congested(q, READ);
>> -     else if (rl->count[READ] < queue_congestion_off_threshold(q))
>> +     else if (rl->count[READ] < queue_congestion_off_threshold(q) &&
>> +              !iog_congested[READ])
>>               blk_clear_queue_congested(q, READ);
>>
>> -     if (rl->count[WRITE] >= queue_congestion_on_threshold(q))
>> +     if (rl->count[WRITE] >= queue_congestion_on_threshold(q) ||
>> +         iog_congested[WRITE])
>>               blk_set_queue_congested(q, WRITE);
>> -     else if (rl->count[WRITE] < queue_congestion_off_threshold(q))
>> +     else if (rl->count[WRITE] < queue_congestion_off_threshold(q) &&
>> +              !iog_congested[WRITE])
>>               blk_clear_queue_congested(q, WRITE);
>>
>> -     if (rl->count[READ] >= q->nr_requests) {
>> +     if (rl->count[READ] >= q->nr_requests || iog_full[READ]) {
>>               blk_set_queue_full(q, READ);
>> -     } else if (rl->count[READ]+1 <= q->nr_requests) {
>> +     } else if (rl->count[READ]+1 <= q->nr_requests && !iog_full[READ]) {
>>               blk_clear_queue_full(q, READ);
>>               wake_up(&rl->wait[READ]);
>>       }
>>
>> -     if (rl->count[WRITE] >= q->nr_requests) {
>> +     if (rl->count[WRITE] >= q->nr_requests || iog_full[WRITE]) {
>>               blk_set_queue_full(q, WRITE);
>> -     } else if (rl->count[WRITE]+1 <= q->nr_requests) {
>> +     } else if (rl->count[WRITE]+1 <= q->nr_requests && !iog_full[WRITE]) {
>>               blk_clear_queue_full(q, WRITE);
>>               wake_up(&rl->wait[WRITE]);
>>       }
>> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
>> index df53418..3b021f3 100644
>> --- a/block/elevator-fq.c
>> +++ b/block/elevator-fq.c
>> @@ -924,6 +924,111 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
>> }
>> EXPORT_SYMBOL(io_lookup_io_group_current);
>>
>> +/*
>> + * TODO
>> + * This is complete dupulication of blk_queue_congestion_threshold()
>> + * except for the argument type and name.  Can we merge them?
>> + */
>> +static void io_group_nrq_congestion_threshold(struct io_group_nrq *nrq)
>> +{
>> +     int nr;
>> +
>> +     nr = nrq->nr_requests - (nrq->nr_requests / 8) + 1;
>> +     if (nr > nrq->nr_requests)
>> +             nr = nrq->nr_requests;
>> +     nrq->nr_congestion_on = nr;
>> +
>> +     nr = nrq->nr_requests - (nrq->nr_requests / 8)
>> +             - (nrq->nr_requests / 16) - 1;
>> +     if (nr < 1)
>> +             nr = 1;
>> +     nrq->nr_congestion_off = nr;
>> +}
>> +
>> +static void io_group_set_nrq(struct io_group_nrq *nrq, int nr_requests,
>> +                      int *congested, int *full)
>> +{
>> +     int i;
>> +
>> +     BUG_ON(nr_requests < 0);
>> +
>> +     nrq->nr_requests = nr_requests;
>> +     io_group_nrq_congestion_threshold(nrq);
>> +
>> +     for (i=0; i<2; i++) {
>> +             if (nrq->count[i] >= nrq->nr_congestion_on)
>> +                     congested[i] = 1;
>> +             else if (nrq->count[i] < nrq->nr_congestion_off)
>> +                     congested[i] = 0;
>> +
>> +             if (nrq->count[i] >= nrq->nr_requests)
>> +                     full[i] = 1;
>> +             else if (nrq->count[i]+1 <= nrq->nr_requests)
>> +                     full[i] = 0;
>> +     }
>> +}
>> +
>> +void io_group_set_nrq_all(struct request_queue *q, int nr,
>> +                         int *congested, int *full)
>> +{
>> +     struct elv_fq_data *efqd = &q->elevator->efqd;
>> +     struct hlist_head *head = &efqd->group_list;
>> +     struct io_group *root = efqd->root_group;
>> +     struct hlist_node *n;
>> +     struct io_group *iog;
>> +     struct io_group_nrq *nrq;
>> +     int nrq_congested[2];
>> +     int nrq_full[2];
>> +     int i;
>> +
>> +     for (i=0; i<2; i++)
>> +             *(congested + i) = *(full + i) = 0;
>> +
>> +     nrq = &root->nrq;
>> +     io_group_set_nrq(nrq, nr, nrq_congested, nrq_full);
>> +     for (i=0; i<2; i++) {
>> +             *(congested + i) |= nrq_congested[i];
>> +             *(full + i) |= nrq_full[i];
>> +     }
>> +
>> +     hlist_for_each_entry(iog, n, head, elv_data_node) {
>> +             nrq = &iog->nrq;
>> +             io_group_set_nrq(nrq, nr, nrq_congested, nrq_full);
>> +             for (i=0; i<2; i++) {
>> +                     *(congested + i) |= nrq_congested[i];
>> +                     *(full + i) |= nrq_full[i];
>> +             }
>> +     }
>> +}
>> +
>> +struct io_group *io_congested_io_group(struct request_queue *q, int rw)
>> +{
>> +     struct hlist_head *head = &q->elevator->efqd.group_list;
>> +     struct hlist_node *n;
>> +     struct io_group *iog;
>> +
>> +     hlist_for_each_entry(iog, n, head, elv_data_node) {
>> +             struct io_group_nrq *nrq = &iog->nrq;
>> +             if (nrq->count[rw] >= nrq->nr_congestion_off)
>> +                     return iog;
>> +     }
>> +     return NULL;
>> +}
>> +
>> +struct io_group *io_full_io_group(struct request_queue *q, int rw)
>> +{
>> +     struct hlist_head *head = &q->elevator->efqd.group_list;
>> +     struct hlist_node *n;
>> +     struct io_group *iog;
>> +
>> +     hlist_for_each_entry(iog, n, head, elv_data_node) {
>> +             struct io_group_nrq *nrq = &iog->nrq;
>> +             if (nrq->count[rw] >= nrq->nr_requests)
>> +                     return iog;
>> +     }
>> +     return NULL;
>> +}
>> +
>> void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
>> {
>>       struct io_entity *entity = &iog->entity;
>> @@ -934,6 +1039,12 @@ void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
>>       entity->my_sched_data = &iog->sched_data;
>> }
>>
>> +static void io_group_init_nrq(struct request_queue *q, struct io_group_nrq *nrq)
>> +{
>> +     nrq->nr_requests = q->nr_requests;
>> +     io_group_nrq_congestion_threshold(nrq);
>> +}
>> +
>> void io_group_set_parent(struct io_group *iog, struct io_group *parent)
>> {
>>       struct io_entity *entity;
>> @@ -1053,6 +1164,8 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
>>               io_group_init_entity(iocg, iog);
>>               iog->my_entity = &iog->entity;
>>
>> +             io_group_init_nrq(q, &iog->nrq);
>> +
>>               if (leaf == NULL) {
>>                       leaf = iog;
>>                       prev = leaf;
>> @@ -1176,7 +1289,7 @@ struct io_group *io_find_alloc_group(struct request_queue *q,
>>  * Generic function to make sure cgroup hierarchy is all setup once a request
>>  * from a cgroup is received by the io scheduler.
>>  */
>> -struct io_group *io_get_io_group(struct request_queue *q)
>> +struct io_group *__io_get_io_group(struct request_queue *q)
>> {
>>       struct cgroup *cgroup;
>>       struct io_group *iog;
>> @@ -1192,6 +1305,19 @@ struct io_group *io_get_io_group(struct request_queue *q)
>>       return iog;
>> }
>>
>> +struct io_group *io_get_io_group(struct request_queue *q)
>> +{
>> +     struct io_group *iog;
>> +     unsigned long flags;
>> +
>> +     spin_lock_irqsave(q->queue_lock, flags);
>> +     iog = __io_get_io_group(q);
>> +     spin_unlock_irqrestore(q->queue_lock, flags);
>> +     BUG_ON(!iog);
>> +
>> +     return iog;
>> +}
>> +
>> void io_free_root_group(struct elevator_queue *e)
>> {
>>       struct io_cgroup *iocg = &io_root_cgroup;
>> @@ -1220,6 +1346,7 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
>>       iog->entity.parent = NULL;
>>       for (i = 0; i < IO_IOPRIO_CLASSES; i++)
>>               iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
>> +     io_group_init_nrq(q, &iog->nrq);
>>
>>       iocg = &io_root_cgroup;
>>       spin_lock_irq(&iocg->lock);
>> @@ -1533,15 +1660,11 @@ void elv_fq_set_request_io_group(struct request_queue *q,
>>                                               struct request *rq)
>> {
>>       struct io_group *iog;
>> -     unsigned long flags;
>>
>>       /* Make sure io group hierarchy has been setup and also set the
>>        * io group to which rq belongs. Later we should make use of
>>        * bio cgroup patches to determine the io group */
>> -     spin_lock_irqsave(q->queue_lock, flags);
>>       iog = io_get_io_group(q);
>> -     spin_unlock_irqrestore(q->queue_lock, flags);
>> -     BUG_ON(!iog);
>>
>>       /* Store iog in rq. TODO: take care of referencing */
>>       rq->iog = iog;
>> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
>> index fc4110d..f8eabd4 100644
>> --- a/block/elevator-fq.h
>> +++ b/block/elevator-fq.h
>> @@ -187,6 +187,22 @@ struct io_queue {
>>
>> #ifdef CONFIG_GROUP_IOSCHED
>> /**
>> + * struct io_group_nrq - structure to store allocated requests info
>> + * @nr_requests: maximun num of requests for the io_group
>> + * @nr_congestion_on: threshold to determin the io_group is cogested.
>> + * @nr_congestion_off: threshold to determin the io_group is not congested.
>> + * @count: num of allocated requests.
>> + *
>> + * All fields are protected by queue_lock.
>> + */
>> +struct io_group_nrq {
>> +     unsigned long nr_requests;
>> +     unsigned int nr_congestion_on;
>> +     unsigned int nr_congestion_off;
>> +     int count[2];
>> +};
>> +
>> +/**
>>  * struct bfq_group - per (device, cgroup) data structure.
>>  * @entity: schedulable entity to insert into the parent group sched_data.
>>  * @sched_data: own sched_data, to contain child entities (they may be
>> @@ -235,6 +251,8 @@ struct io_group {
>>
>>       /* Single ioq per group, used for noop, deadline, anticipatory */
>>       struct io_queue *ioq;
>> +
>> +     struct io_group_nrq nrq;
>> };
>>
>> /**
>> @@ -469,6 +487,11 @@ extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
>> extern void elv_fq_unset_request_ioq(struct request_queue *q,
>>                                       struct request *rq);
>> extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
>> +extern void io_group_set_nrq_all(struct request_queue *q, int nr,
>> +                         int *congested, int *full);
>> +extern struct io_group *io_congested_io_group(struct request_queue *q, int rw);
>> +extern struct io_group *io_full_io_group(struct request_queue *q, int rw);
>> +extern struct io_group *__io_get_io_group(struct request_queue *q);
>>
>> /* Returns single ioq associated with the io group. */
>> static inline struct io_queue *io_group_ioq(struct io_group *iog)
>> @@ -486,6 +509,52 @@ static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
>>       iog->ioq = ioq;
>> }
>>
>> +static inline struct io_group *io_request_io_group(struct request *rq)
>> +{
>> +     return rq->iog;
>> +}
>> +
>> +static inline unsigned long io_group_nr_requests(struct io_group *iog)
>> +{
>> +     BUG_ON(!iog);
>> +     return iog->nrq.nr_requests;
>> +}
>> +
>> +static inline int io_group_inc_count(struct io_group *iog, int rw)
>> +{
>> +     BUG_ON(!iog);
>> +     return iog->nrq.count[rw]++;
>> +}
>> +
>> +static inline int io_group_dec_count(struct io_group *iog, int rw)
>> +{
>> +     BUG_ON(!iog);
>> +     return iog->nrq.count[rw]--;
>> +}
>> +
>> +static inline int io_group_count(struct io_group *iog, int rw)
>> +{
>> +     BUG_ON(!iog);
>> +     return iog->nrq.count[rw];
>> +}
>> +
>> +static inline int io_group_congestion_on(struct io_group *iog, int rw)
>> +{
>> +     BUG_ON(!iog);
>> +     return iog->nrq.count[rw] + 1 >= iog->nrq.nr_congestion_on;
>> +}
>> +
>> +static inline int io_group_congestion_off(struct io_group *iog, int rw)
>> +{
>> +     BUG_ON(!iog);
>> +     return iog->nrq.count[rw] < iog->nrq.nr_congestion_off;
>> +}
>> +
>> +static inline int io_group_full(struct io_group *iog, int rw)
>> +{
>> +     BUG_ON(!iog);
>> +     return iog->nrq.count[rw] + 1 >= iog->nrq.nr_requests;
>> +}
>> #else /* !GROUP_IOSCHED */
>> /*
>>  * No ioq movement is needed in case of flat setup. root io group gets cleaned
>> @@ -537,6 +606,71 @@ static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
>>       return NULL;
>> }
>>
>> +static inline void io_group_set_nrq_all(struct request_queue *q, int nr,
>> +                                     int *congested, int *full)
>> +{
>> +     int i;
>> +     for (i=0; i<2; i++)
>> +             *(congested + i) = *(full + i) = 0;
>> +}
>> +
>> +static inline struct io_group *
>> +io_congested_io_group(struct request_queue *q, int rw)
>> +{
>> +     return NULL;
>> +}
>> +
>> +static inline struct io_group *
>> +io_full_io_group(struct request_queue *q, int rw)
>> +{
>> +     return NULL;
>> +}
>> +
>> +static inline struct io_group *__io_get_io_group(struct request_queue *q)
>> +{
>> +     return NULL;
>> +}
>> +
>> +static inline struct io_group *io_request_io_group(struct request *rq)
>> +{
>> +     return NULL;
>> +}
>> +
>> +static inline unsigned long io_group_nr_requests(struct io_group *iog)
>> +{
>> +     return 0;
>> +}
>> +
>> +static inline int io_group_inc_count(struct io_group *iog, int rw)
>> +{
>> +     return 0;
>> +}
>> +
>> +static inline int io_group_dec_count(struct io_group *iog, int rw)
>> +{
>> +     return 0;
>> +}
>> +
>> +static inline int io_group_count(struct io_group *iog, int rw)
>> +{
>> +     return 0;
>> +}
>> +
>> +static inline int io_group_congestion_on(struct io_group *iog, int rw)
>> +{
>> +     return 0;
>> +}
>> +
>> +static inline int io_group_congestion_off(struct io_group *iog, int rw)
>> +{
>> +     return 1;
>> +}
>> +
>> +static inline int io_group_full(struct io_group *iog, int rw)
>> +{
>> +     return 0;
>> +}
>> +
>> #endif /* GROUP_IOSCHED */
>>
>> /* Functions used by blksysfs.c */
>> @@ -589,6 +723,9 @@ extern void elv_free_ioq(struct io_queue *ioq);
>>
>> #else /* CONFIG_ELV_FAIR_QUEUING */
>>
>> +struct io_group {
>> +};
>> +
>> static inline int elv_init_fq_data(struct request_queue *q,
>>                                       struct elevator_queue *e)
>> {
>> @@ -655,5 +792,69 @@ static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
>>       return NULL;
>> }
>>
>> +static inline void io_group_set_nrq_all(struct request_queue *q, int nr,
>> +                                     int *congested, int *full)
>> +{
>> +     int i;
>> +     for (i=0; i<2; i++)
>> +             *(congested + i) = *(full + i) = 0;
>> +}
>> +
>> +static inline struct io_group *
>> +io_congested_io_group(struct request_queue *q, int rw)
>> +{
>> +     return NULL;
>> +}
>> +
>> +static inline struct io_group *
>> +io_full_io_group(struct request_queue *q, int rw)
>> +{
>> +     return NULL;
>> +}
>> +
>> +static inline struct io_group *__io_get_io_group(struct request_queue *q)
>> +{
>> +     return NULL;
>> +}
>> +
>> +static inline struct io_group *io_request_io_group(struct request *rq)
>> +{
>> +     return NULL;
>> +}
>> +
>> +static inline unsigned long io_group_nr_requests(struct io_group *iog)
>> +{
>> +     return 0;
>> +}
>> +
>> +static inline int io_group_inc_count(struct io_group *iog, int rw)
>> +{
>> +     return 0;
>> +}
>> +
>> +static inline int io_group_dec_count(struct io_group *iog, int rw)
>> +{
>> +     return 0;
>> +}
>> +
>> +static inline int io_group_count(struct io_group *iog, int rw)
>> +{
>> +     return 0;
>> +}
>> +
>> +static inline int io_group_congestion_on(struct io_group *iog, int rw)
>> +{
>> +     return 0;
>> +}
>> +
>> +static inline int io_group_congestion_off(struct io_group *iog, int rw)
>> +{
>> +     return 1;
>> +}
>> +
>> +static inline int io_group_full(struct io_group *iog, int rw)
>> +{
>> +     return 0;
>> +}
>> #endif /* CONFIG_ELV_FAIR_QUEUING */
>> #endif /* _BFQ_SCHED_H */
>> --
>> 1.5.4.3
>>
>>
>> --
>> IKEDA, Munehiro
>> NEC Corporation of America
>>   m-ikeda@ds.jp.nec.com
>>
>

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: IO Controller per cgroup request descriptors (Re: [PATCH 01/10] Documentation)
       [not found]                   ` <e98e18940905011639o63c048f1n79c7e7648441a06d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-05-04 17:18                     ` IKEDA, Munehiro
  0 siblings, 0 replies; 190+ messages in thread
From: IKEDA, Munehiro @ 2009-05-04 17:18 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, menage-hpIqsD4AKlfQT0dZR+AlfA,
	Andrea Righi, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Balbir Singh

Nauman Rafique wrote:
> On Fri, May 1, 2009 at 3:45 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> On Fri, May 01, 2009 at 06:04:39PM -0400, IKEDA, Munehiro wrote:
>>> Vivek Goyal wrote:
>>>>>> +TODO
>>>>>> +====
>>>>>> +- Lots of cleanups, testing, bug fixing, optimizations, benchmarking etc...
>>>>>> +- Convert cgroup ioprio to notion of weight.
>>>>>> +- Anticipatory code will need more work. It is not working properly currently
>>>>>> +  and needs more thought.
>>>>> What are the problems with the code?
>>>> Have not got a chance to look into the issues in detail yet. Just a crude run
>>>> saw drop in performance. Will debug it later the moment I have got async writes
>>>> handled...
>>>>
>>>>>> +- Use of bio-cgroup patches.
>>>>> I saw these posted as well
>>>>>
>>>>>> +- Use of Nauman's per cgroup request descriptor patches.
>>>>>> +
>>>>> More details would be nice, I am not sure I understand
>>>> Currently the number of request descriptors which can be allocated per
>>>> device/request queue are fixed by a sysfs tunable (q->nr_requests). So
>>>> if there is lots of IO going on from one cgroup then it will consume all
>>>> the available request descriptors and other cgroup might starve and not
>>>> get its fair share.
>>>>
>>>> Hence we also need to introduce the notion of request descriptor limit per
>>>> cgroup so that if request descriptors from one group are exhausted, then
>>>> it does not impact the IO of other cgroup.
>>> Unfortunately I couldn't find and I've never seen the Nauman's patches.
>>> So I tried to make a patch below against this todo.  The reason why
>>> I'm posting this despite this is just a quick and ugly hack (and it
>>> might be a reinvention of wheel) is that I would like to discuss how
>>> we should define the limitation of requests per cgroup.
>>> This patch should be applied on Vivek's I/O controller patches
>>> posted on Mar 11.
>> Hi IKEDA,
>>
>> Sorry for the confusion here. Actually Nauman had sent a patch to select group
>> of people who were initially copied on the mail thread.
> 
> I am sorry about that. Since I dropped my whole patch set in favor of
> Vivek's stuff, this stuff fell through the cracks.

No problem at all guys.  I'm glad to see your patch Vivek sent, thanks.


>>> This patch temporarily distribute q->nr_requests to each cgroup.
>>> I think the number should be weighted like BFQ's budget.  But in
>>> this case, if the hierarchy of cgroup is deep, leaf cgroups are
>>> allowed to allocate very few numbers of requests.  I don't think
>>> this is reasonable...but I don't have specific idea to solve this
>>> problem.  Does anyone have the good idea?
>>>
>> Thanks for the patch. Yes, ideally one would expect the request descriptor
>> to be allocated also in proportion to the weight but I guess that would
>> become very comlicated.
>>
>> In terms of simpler things, two thoughts come to mind.
>>
>> - First approach is to make q->nr_requests per group. So every group is
>>  entitled for q->nr_requests as set by the user. This is what your patch
>>  seems to have done.
>>
>>  I had some concerns with this approach. First of all it does not seem to
>>  have an upper bound on number of request descriptors allocated per queue
>>  because if a user creates more cgroups, total number of request
>>  descriptors increase.
>>
>> - Second approach can be that we retain the meaning of q->nr_requests
>>  which defines the total number of request descriptors on the queue (with
>>  the exception of 50% more descriptors for batching processes). And we
>>  define a new per group limit q->nr_group_requests which defines how many
>>  requests per group can be assigned. So q->nr_requests defines total pool
>>  size on the queue and q->nr_group_requests will define how many requests
>>  each group can allocate out of that pool.
>>
>>  Here the issue is that a user shall have to balance the q->nr_group_requests    and q->nr_requests properly.
>>
>> To experiment, I have implemented the second approach. I am attaching the
>> patch which is in my current tree. It probably will not apply on my march
>> 11 posting as since then patches have changed. But posting it here so that
>> at least it will give an idea behind the thought process.
>>
>> Ideas are welcome...
> 
> I had started with the first option, but the second option sounds good
> too. But one problem that comes to mind is how we deal with
> hierarchies? The sys admin can limit the root level cgroups to
> specific number of request descriptors, but if applications running in
> a cgroup are allowed to create their own cgroups, then the total
> request descriptors of all child cgroups should be capped by the
> number assigned to parent cgroups.

I think the second option cannot coexist with hierarchy support of
per cgroup request descriptors limitation.  I guess the fundamental
idea of the second approach is to make logic simpler by giving up
hierarchy, isn't it correct?
IIUC, for hierarchy support, we need to have some good idea to solve
the issue that a cgroup which belongs to deep hierarchy can have only
few requests, as I mentioned.
Anyway, I keep my eyes on this issue.  I'm looking forward to Vivek's
next version patches.


(snip)
>> +#ifdef CONFIG_GROUP_IOSCHED
>> +static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
>> +{
>> +       return queue_var_show(q->nr_group_requests, (page));
>> +}
>> +
>> +static ssize_t
>> +queue_group_requests_store(struct request_queue *q, const char *page,
>> +                                       size_t count)
>> +{
>> +       unsigned long nr;
>> +       int ret = queue_var_store(&nr, page, count);
>> +       if (nr < BLKDEV_MIN_RQ)
>> +               nr = BLKDEV_MIN_RQ;
>> +
>> +       spin_lock_irq(q->queue_lock);
>> +       q->nr_group_requests = nr;
>> +       spin_unlock_irq(q->queue_lock);
>> +       return ret;
>> +}
>> +#endif
(snip)

Unchanging io_context "batching" status is on purpose?



Thanks,
Muuhh

-- 
IKEDA, Munehiro
  NEC Corporation of America
    m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: IO Controller per cgroup request descriptors (Re: [PATCH 01/10] Documentation)
  2009-05-01 23:39                 ` Nauman Rafique
@ 2009-05-04 17:18                   ` IKEDA, Munehiro
       [not found]                   ` <e98e18940905011639o63c048f1n79c7e7648441a06d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 0 replies; 190+ messages in thread
From: IKEDA, Munehiro @ 2009-05-04 17:18 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: Vivek Goyal, Balbir Singh, oz-kernel, paolo.valente,
	linux-kernel, dhaval, containers, menage, jmoyer, fchecconi,
	arozansk, jens.axboe, akpm, fernando, Andrea Righi, Ryo Tsuruta,
	Divyesh Shah, Gui Jianfeng

Nauman Rafique wrote:
> On Fri, May 1, 2009 at 3:45 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> On Fri, May 01, 2009 at 06:04:39PM -0400, IKEDA, Munehiro wrote:
>>> Vivek Goyal wrote:
>>>>>> +TODO
>>>>>> +====
>>>>>> +- Lots of cleanups, testing, bug fixing, optimizations, benchmarking etc...
>>>>>> +- Convert cgroup ioprio to notion of weight.
>>>>>> +- Anticipatory code will need more work. It is not working properly currently
>>>>>> +  and needs more thought.
>>>>> What are the problems with the code?
>>>> Have not got a chance to look into the issues in detail yet. Just a crude run
>>>> saw drop in performance. Will debug it later the moment I have got async writes
>>>> handled...
>>>>
>>>>>> +- Use of bio-cgroup patches.
>>>>> I saw these posted as well
>>>>>
>>>>>> +- Use of Nauman's per cgroup request descriptor patches.
>>>>>> +
>>>>> More details would be nice, I am not sure I understand
>>>> Currently the number of request descriptors which can be allocated per
>>>> device/request queue are fixed by a sysfs tunable (q->nr_requests). So
>>>> if there is lots of IO going on from one cgroup then it will consume all
>>>> the available request descriptors and other cgroup might starve and not
>>>> get its fair share.
>>>>
>>>> Hence we also need to introduce the notion of request descriptor limit per
>>>> cgroup so that if request descriptors from one group are exhausted, then
>>>> it does not impact the IO of other cgroup.
>>> Unfortunately I couldn't find and I've never seen the Nauman's patches.
>>> So I tried to make a patch below against this todo.  The reason why
>>> I'm posting this despite this is just a quick and ugly hack (and it
>>> might be a reinvention of wheel) is that I would like to discuss how
>>> we should define the limitation of requests per cgroup.
>>> This patch should be applied on Vivek's I/O controller patches
>>> posted on Mar 11.
>> Hi IKEDA,
>>
>> Sorry for the confusion here. Actually Nauman had sent a patch to select group
>> of people who were initially copied on the mail thread.
> 
> I am sorry about that. Since I dropped my whole patch set in favor of
> Vivek's stuff, this stuff fell through the cracks.

No problem at all guys.  I'm glad to see your patch Vivek sent, thanks.


>>> This patch temporarily distribute q->nr_requests to each cgroup.
>>> I think the number should be weighted like BFQ's budget.  But in
>>> this case, if the hierarchy of cgroup is deep, leaf cgroups are
>>> allowed to allocate very few numbers of requests.  I don't think
>>> this is reasonable...but I don't have specific idea to solve this
>>> problem.  Does anyone have the good idea?
>>>
>> Thanks for the patch. Yes, ideally one would expect the request descriptor
>> to be allocated also in proportion to the weight but I guess that would
>> become very comlicated.
>>
>> In terms of simpler things, two thoughts come to mind.
>>
>> - First approach is to make q->nr_requests per group. So every group is
>>  entitled for q->nr_requests as set by the user. This is what your patch
>>  seems to have done.
>>
>>  I had some concerns with this approach. First of all it does not seem to
>>  have an upper bound on number of request descriptors allocated per queue
>>  because if a user creates more cgroups, total number of request
>>  descriptors increase.
>>
>> - Second approach can be that we retain the meaning of q->nr_requests
>>  which defines the total number of request descriptors on the queue (with
>>  the exception of 50% more descriptors for batching processes). And we
>>  define a new per group limit q->nr_group_requests which defines how many
>>  requests per group can be assigned. So q->nr_requests defines total pool
>>  size on the queue and q->nr_group_requests will define how many requests
>>  each group can allocate out of that pool.
>>
>>  Here the issue is that a user shall have to balance the q->nr_group_requests    and q->nr_requests properly.
>>
>> To experiment, I have implemented the second approach. I am attaching the
>> patch which is in my current tree. It probably will not apply on my march
>> 11 posting as since then patches have changed. But posting it here so that
>> at least it will give an idea behind the thought process.
>>
>> Ideas are welcome...
> 
> I had started with the first option, but the second option sounds good
> too. But one problem that comes to mind is how we deal with
> hierarchies? The sys admin can limit the root level cgroups to
> specific number of request descriptors, but if applications running in
> a cgroup are allowed to create their own cgroups, then the total
> request descriptors of all child cgroups should be capped by the
> number assigned to parent cgroups.

I think the second option cannot coexist with hierarchy support of
per cgroup request descriptors limitation.  I guess the fundamental
idea of the second approach is to make logic simpler by giving up
hierarchy, isn't it correct?
IIUC, for hierarchy support, we need to have some good idea to solve
the issue that a cgroup which belongs to deep hierarchy can have only
few requests, as I mentioned.
Anyway, I keep my eyes on this issue.  I'm looking forward to Vivek's
next version patches.


(snip)
>> +#ifdef CONFIG_GROUP_IOSCHED
>> +static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
>> +{
>> +       return queue_var_show(q->nr_group_requests, (page));
>> +}
>> +
>> +static ssize_t
>> +queue_group_requests_store(struct request_queue *q, const char *page,
>> +                                       size_t count)
>> +{
>> +       unsigned long nr;
>> +       int ret = queue_var_store(&nr, page, count);
>> +       if (nr < BLKDEV_MIN_RQ)
>> +               nr = BLKDEV_MIN_RQ;
>> +
>> +       spin_lock_irq(q->queue_lock);
>> +       q->nr_group_requests = nr;
>> +       spin_unlock_irq(q->queue_lock);
>> +       return ret;
>> +}
>> +#endif
(snip)

Unchanging io_context "batching" status is on purpose?



Thanks,
Muuhh

-- 
IKEDA, Munehiro
  NEC Corporation of America
    m-ikeda@ds.jp.nec.com


^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
       [not found]           ` <49F9FE3C.3070000-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2009-05-05  3:18             ` Gui Jianfeng
  0 siblings, 0 replies; 190+ messages in thread
From: Gui Jianfeng @ 2009-05-05  3:18 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: menage-hpIqsD4AKlfQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	oz-kernel-H+wXaHxf7aLQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	fernando-w0OK63jvRlAuJ+9fw/WgBHgSJqDPrsil,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	arozansk-H+wXaHxf7aLQT0dZR+AlfA

Nauman Rafique wrote:
...
> Hi Gui,
> This patch should solve the problems reported by you. Please let me know if it does not work.
> @Vivek, this has a few more changes after the patch I sent you separately.
> 

  Hi Nauman,

  I'v tried your patch, seems the bug has been fixed. thanks!

> DESC
> Add ref counting for io_group.
> EDESC
>     
>         Reference counting for io_group solves many problems, most of which
>         occured when we tried to delete the cgroup. Earlier, ioqs were being
>         moved out of cgroup to root cgroup. That is problematic in many ways:
>         First, the pending requests in queues might get unfair service, and
>         will also cause unfairness for other cgroups at the root level. This
>         problem can become signficant if cgroups are created and destroyed
>         relatively frequently. Second, moving queues to root cgroups was
>         complicated and was causing many BUG_ON's to trigger. Third, there is
>         a single io queue in AS, Deadline and Noop within a cgroup; and it
>         does not make sense to move it to the root cgroup. The same is true of
>         async queues.
>     
>         Requests already keep a reference on ioq, so queues keep a reference on
>         cgroup. For async queues in CFQ, and single ioq in other schedulers,
>         io_group also keeps are reference on io_queue. This reference on ioq
>         is dropped when the queue is released (elv_release_ioq). So the queue
>         can be freed.
>     
>         When a queue is released, it puts the reference to io_group and the
>         io_group is released after all the queues are released. Child groups
>         also take reference on parent groups, and release it when they are
>         destroyed.
>     
>         Also we no longer need to maintain a seprate linked list of idle
>         entities, which was maintained only to help release the ioq references
>         during elevator switch. The code for releasing io_groups is reused for
>         elevator switch, resulting in simpler and tight code.
> 
 

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 190+ messages in thread

* Re: [RFC] IO Controller
  2009-04-30 19:38         ` Nauman Rafique
@ 2009-05-05  3:18           ` Gui Jianfeng
       [not found]           ` <49F9FE3C.3070000-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  1 sibling, 0 replies; 190+ messages in thread
From: Gui Jianfeng @ 2009-05-05  3:18 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: Vivek Goyal, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, arozansk, jmoyer,
	oz-kernel, dhaval, balbir, linux-kernel, containers, akpm,
	menage, peterz

Nauman Rafique wrote:
...
> Hi Gui,
> This patch should solve the problems reported by you. Please let me know if it does not work.
> @Vivek, this has a few more changes after the patch I sent you separately.
> 

  Hi Nauman,

  I'v tried your patch, seems the bug has been fixed. thanks!

> DESC
> Add ref counting for io_group.
> EDESC
>     
>         Reference counting for io_group solves many problems, most of which
>         occured when we tried to delete the cgroup. Earlier, ioqs were being
>         moved out of cgroup to root cgroup. That is problematic in many ways:
>         First, the pending requests in queues might get unfair service, and
>         will also cause unfairness for other cgroups at the root level. This
>         problem can become signficant if cgroups are created and destroyed
>         relatively frequently. Second, moving queues to root cgroups was
>         complicated and was causing many BUG_ON's to trigger. Third, there is
>         a single io queue in AS, Deadline and Noop within a cgroup; and it
>         does not make sense to move it to the root cgroup. The same is true of
>         async queues.
>     
>         Requests already keep a reference on ioq, so queues keep a reference on
>         cgroup. For async queues in CFQ, and single ioq in other schedulers,
>         io_group also keeps are reference on io_queue. This reference on ioq
>         is dropped when the queue is released (elv_release_ioq). So the queue
>         can be freed.
>     
>         When a queue is released, it puts the reference to io_group and the
>         io_group is released after all the queues are released. Child groups
>         also take reference on parent groups, and release it when they are
>         destroyed.
>     
>         Also we no longer need to maintain a seprate linked list of idle
>         entities, which was maintained only to help release the ioq references
>         during elevator switch. The code for releasing io_groups is reused for
>         elevator switch, resulting in simpler and tight code.
> 
 

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 190+ messages in thread

end of thread, other threads:[~2009-05-05  3:19 UTC | newest]

Thread overview: 190+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-03-12  1:56 [RFC] IO Controller Vivek Goyal
2009-03-12  1:56 ` Vivek Goyal
2009-03-12  1:56 ` [PATCH 02/10] Common flat fair queuing code in elevaotor layer Vivek Goyal
2009-03-19  6:27   ` Gui Jianfeng
2009-03-27  8:30   ` [PATCH] IO Controller: Don't store the pid in single queue circumstances Gui Jianfeng
     [not found]     ` <49CC8EBA.9040804-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-03-27 13:52       ` Vivek Goyal
2009-03-27 13:52     ` Vivek Goyal
2009-04-02  4:06   ` [PATCH 02/10] Common flat fair queuing code in elevaotor layer Divyesh Shah
     [not found]     ` <af41c7c40904012106h41d3cb50i2eeab2a02277a4c9-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-04-02 13:52       ` Vivek Goyal
2009-04-02 13:52     ` Vivek Goyal
     [not found]   ` <1236823015-4183-3-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-03-19  6:27     ` Gui Jianfeng
2009-03-27  8:30     ` [PATCH] IO Controller: Don't store the pid in single queue circumstances Gui Jianfeng
2009-04-02  4:06     ` [PATCH 02/10] Common flat fair queuing code in elevaotor layer Divyesh Shah
2009-03-12  1:56 ` [PATCH 03/10] Modify cfq to make use of flat elevator fair queuing Vivek Goyal
2009-03-12  1:56 ` [PATCH 07/10] Prepare elevator layer for single queue schedulers Vivek Goyal
2009-03-12  3:27 ` [RFC] IO Controller Takuya Yoshikawa
2009-03-12  6:40   ` anqin
     [not found]     ` <d95d44a20903112340s3c77807dt465e68901747ad89-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-03-12  6:55       ` Li Zefan
2009-03-12 13:46       ` Vivek Goyal
2009-03-12 13:46         ` Vivek Goyal
2009-03-12  6:55     ` Li Zefan
2009-03-12  7:11       ` anqin
     [not found]         ` <d95d44a20903120011m4a7281enf17b31b9aaf7c937-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-03-12 14:57           ` Vivek Goyal
2009-03-12 14:57             ` Vivek Goyal
     [not found]       ` <49B8B1FB.1040506-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-03-12  7:11         ` anqin
     [not found]   ` <49B8810B.7030603-gVGce1chcLdL9jVzuh4AOg@public.gmane.org>
2009-03-12  6:40     ` anqin
2009-03-12 13:43     ` Vivek Goyal
2009-03-12 13:43       ` Vivek Goyal
     [not found] ` <1236823015-4183-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-03-12  1:56   ` [PATCH 01/10] Documentation Vivek Goyal
2009-03-12  1:56     ` Vivek Goyal
     [not found]     ` <1236823015-4183-2-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-03-12  7:11       ` Andrew Morton
2009-03-12  7:11         ` Andrew Morton
2009-03-12 10:07         ` Ryo Tsuruta
     [not found]         ` <20090312001146.74591b9d.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2009-03-12 10:07           ` Ryo Tsuruta
2009-03-12 18:01           ` Vivek Goyal
2009-03-12 18:01         ` Vivek Goyal
2009-03-16  8:40           ` Ryo Tsuruta
2009-03-16 13:39             ` Vivek Goyal
     [not found]             ` <20090316.174043.193698189.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2009-03-16 13:39               ` Vivek Goyal
     [not found]           ` <20090312180126.GI10919-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-03-16  8:40             ` Ryo Tsuruta
2009-04-05 15:15             ` Andrea Righi
2009-04-05 15:15           ` Andrea Righi
2009-04-06  6:50             ` Nauman Rafique
     [not found]             ` <49D8CB17.7040501-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2009-04-06  6:50               ` Nauman Rafique
2009-04-07  6:40               ` Vivek Goyal
2009-04-07  6:40                 ` Vivek Goyal
     [not found]                 ` <20090407064046.GB20498-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-04-08 20:37                   ` Andrea Righi
2009-04-08 20:37                 ` Andrea Righi
2009-04-16 18:37                   ` Vivek Goyal
2009-04-16 18:37                     ` Vivek Goyal
2009-04-17  5:35                     ` Dhaval Giani
     [not found]                       ` <20090417053517.GC26437-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2009-04-17 13:49                         ` IO Controller discussion (Was: Re: [PATCH 01/10] Documentation) Vivek Goyal
2009-04-17 13:49                           ` Vivek Goyal
     [not found]                     ` <20090416183753.GE8896-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-04-17  5:35                       ` [PATCH 01/10] Documentation Dhaval Giani
2009-04-17  9:37                       ` Andrea Righi
2009-04-17  9:37                     ` Andrea Righi
2009-04-17 14:13                       ` IO controller discussion (Was: Re: [PATCH 01/10] Documentation) Vivek Goyal
2009-04-17 14:13                       ` Vivek Goyal
     [not found]                         ` <20090417141358.GD29086-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-04-17 18:09                           ` Nauman Rafique
2009-04-17 22:38                           ` Andrea Righi
2009-04-18 13:19                           ` Balbir Singh
2009-04-19  4:35                           ` Nauman Rafique
2009-04-17 18:09                         ` Nauman Rafique
     [not found]                           ` <e98e18940904171109r17ccb054kb7879f8d02ac26b5-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-04-18  8:13                             ` Andrea Righi
2009-04-19 12:59                             ` Vivek Goyal
2009-04-19 12:59                               ` Vivek Goyal
2009-04-19 13:08                             ` Vivek Goyal
2009-04-18  8:13                           ` Andrea Righi
2009-04-19 13:08                           ` Vivek Goyal
2009-04-17 22:38                         ` Andrea Righi
2009-04-19 13:21                           ` Vivek Goyal
2009-04-19 13:21                             ` Vivek Goyal
2009-04-18 13:19                         ` Balbir Singh
2009-04-19 13:45                           ` Vivek Goyal
2009-04-19 15:53                             ` Andrea Righi
2009-04-21  1:16                               ` KAMEZAWA Hiroyuki
2009-04-21  1:16                               ` KAMEZAWA Hiroyuki
     [not found]                             ` <20090419134508.GG8493-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-04-19 15:53                               ` Andrea Righi
     [not found]                           ` <661de9470904180619k34e7998ch755a2ad3bed9ce5e-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-04-19 13:45                             ` Vivek Goyal
2009-04-19  4:35                         ` Nauman Rafique
2009-03-12  7:45       ` [PATCH 01/10] Documentation Yang Hongyang
2009-03-12  7:45         ` Yang Hongyang
     [not found]         ` <49B8BDB3.40808-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-03-12 13:51           ` Vivek Goyal
2009-03-12 13:51         ` Vivek Goyal
2009-03-12 10:00       ` Dhaval Giani
2009-03-12 10:24       ` Peter Zijlstra
2009-03-12 10:24         ` Peter Zijlstra
2009-03-12 14:09         ` Vivek Goyal
2009-03-12 14:09         ` Vivek Goyal
2009-04-06 14:35       ` Balbir Singh
2009-04-06 14:35         ` Balbir Singh
     [not found]         ` <20090406143556.GK7082-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
2009-04-06 22:00           ` Nauman Rafique
2009-04-06 22:00             ` Nauman Rafique
2009-04-07  5:59           ` Gui Jianfeng
2009-04-13 13:40           ` Vivek Goyal
2009-04-07  5:59         ` Gui Jianfeng
2009-04-13 13:40         ` Vivek Goyal
2009-05-01 22:04           ` IKEDA, Munehiro
     [not found]             ` <49FB71F7.90309-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
2009-05-01 22:45               ` IO Controller per cgroup request descriptors (Re: [PATCH 01/10] Documentation) Vivek Goyal
2009-05-01 22:45                 ` Vivek Goyal
2009-05-01 23:39                 ` Nauman Rafique
2009-05-04 17:18                   ` IKEDA, Munehiro
     [not found]                   ` <e98e18940905011639o63c048f1n79c7e7648441a06d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-05-04 17:18                     ` IKEDA, Munehiro
     [not found]                 ` <20090501224506.GC6130-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-01 23:39                   ` Nauman Rafique
     [not found]           ` <20090413134017.GC18007-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-01 22:04             ` [PATCH 01/10] Documentation IKEDA, Munehiro
2009-03-12 10:00     ` Dhaval Giani
     [not found]       ` <20090312100054.GA8024-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2009-03-12 14:04         ` Vivek Goyal
2009-03-12 14:04       ` Vivek Goyal
     [not found]         ` <20090312140450.GE10919-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-03-12 14:48           ` Fabio Checconi
2009-03-12 14:48             ` Fabio Checconi
2009-03-12 15:03             ` Vivek Goyal
     [not found]             ` <20090312144842.GS12361-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
2009-03-12 15:03               ` Vivek Goyal
2009-03-18  7:23           ` Gui Jianfeng
2009-03-18  7:23         ` Gui Jianfeng
     [not found]           ` <49C0A171.8060009-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-03-18 21:55             ` Vivek Goyal
2009-03-18 21:55               ` Vivek Goyal
     [not found]               ` <20090318215529.GA3338-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-03-19  3:38                 ` Gui Jianfeng
2009-03-24  5:32                 ` Nauman Rafique
2009-03-19  3:38               ` Gui Jianfeng
2009-03-24  5:32               ` Nauman Rafique
     [not found]                 ` <e98e18940903232232i432f62c5r9dfd74268e1b2684-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-03-24 12:58                   ` Vivek Goyal
2009-03-24 12:58                     ` Vivek Goyal
2009-03-24 18:14                     ` Nauman Rafique
     [not found]                       ` <e98e18940903241114u1e03ae7dhf654d7d8d0fc0302-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-03-24 18:29                         ` Vivek Goyal
2009-03-24 18:29                           ` Vivek Goyal
2009-03-24 18:41                           ` Fabio Checconi
     [not found]                             ` <20090324184101.GO18554-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
2009-03-24 18:35                               ` Vivek Goyal
2009-03-24 18:35                                 ` Vivek Goyal
     [not found]                                 ` <20090324183532.GG21389-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-03-24 18:49                                   ` Nauman Rafique
2009-03-24 19:04                                   ` Fabio Checconi
2009-03-24 18:49                                 ` Nauman Rafique
2009-03-24 19:04                                 ` Fabio Checconi
     [not found]                           ` <20090324182906.GF21389-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-03-24 18:41                             ` Fabio Checconi
     [not found]                     ` <20090324125842.GA21389-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-03-24 18:14                       ` Nauman Rafique
2009-03-12  1:56   ` [PATCH 02/10] Common flat fair queuing code in elevaotor layer Vivek Goyal
2009-03-12  1:56   ` [PATCH 03/10] Modify cfq to make use of flat elevator fair queuing Vivek Goyal
2009-03-12  1:56   ` [PATCH 04/10] Common hierarchical fair queuing code in elevaotor layer Vivek Goyal
2009-03-12  1:56     ` Vivek Goyal
2009-03-12  1:56   ` [PATCH 05/10] cfq changes to use " Vivek Goyal
2009-03-12  1:56     ` Vivek Goyal
     [not found]     ` <1236823015-4183-6-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-04-16  5:25       ` [PATCH] IO-Controller: Fix kernel panic after moving a task Gui Jianfeng
2009-04-16  5:25     ` Gui Jianfeng
     [not found]       ` <49E6C14F.3090009-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-04-16 19:15         ` Vivek Goyal
2009-04-16 19:15           ` Vivek Goyal
2009-03-12  1:56   ` [PATCH 06/10] Separate out queue and data Vivek Goyal
2009-03-12  1:56     ` Vivek Goyal
2009-03-12  1:56   ` [PATCH 07/10] Prepare elevator layer for single queue schedulers Vivek Goyal
2009-03-12  1:56   ` [PATCH 08/10] noop changes for hierarchical fair queuing Vivek Goyal
2009-03-12  1:56     ` Vivek Goyal
2009-03-12  1:56   ` [PATCH 09/10] deadline " Vivek Goyal
2009-03-12  1:56     ` Vivek Goyal
2009-03-12  1:56   ` [PATCH 10/10] anticipatory " Vivek Goyal
2009-03-12  1:56     ` Vivek Goyal
     [not found]     ` <1236823015-4183-11-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-03-27  6:58       ` [PATCH] IO Controller: No need to stop idling in as Gui Jianfeng
2009-03-27  6:58     ` Gui Jianfeng
     [not found]       ` <49CC791A.10008-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-03-27 14:05         ` Vivek Goyal
2009-03-27 14:05       ` Vivek Goyal
2009-03-30  1:09         ` Gui Jianfeng
     [not found]         ` <20090327140530.GE30476-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-03-30  1:09           ` Gui Jianfeng
2009-03-12  3:27   ` [RFC] IO Controller Takuya Yoshikawa
2009-04-02  6:39   ` Gui Jianfeng
2009-04-10  9:33   ` Gui Jianfeng
2009-05-01  1:25   ` Divyesh Shah
2009-04-02  6:39 ` Gui Jianfeng
     [not found]   ` <49D45DAC.2060508-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-04-02 14:00     ` Vivek Goyal
2009-04-02 14:00       ` Vivek Goyal
     [not found]       ` <20090402140037.GC12851-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-04-07  1:40         ` Gui Jianfeng
2009-04-07  1:40       ` Gui Jianfeng
     [not found]         ` <49DAAF25.8010702-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-04-07  6:40           ` Gui Jianfeng
2009-04-07  6:40             ` Gui Jianfeng
2009-04-10  9:33 ` Gui Jianfeng
     [not found]   ` <49DF1256.7080403-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-04-10 17:49     ` Nauman Rafique
2009-04-13 13:09     ` Vivek Goyal
2009-04-10 17:49   ` Nauman Rafique
2009-04-13 13:09   ` Vivek Goyal
2009-04-22  3:04     ` Gui Jianfeng
2009-04-22  3:10       ` Nauman Rafique
2009-04-22 13:23       ` Vivek Goyal
     [not found]         ` <20090422132307.GA23098-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-04-30 19:38           ` Nauman Rafique
2009-04-30 19:38         ` Nauman Rafique
2009-05-05  3:18           ` Gui Jianfeng
     [not found]           ` <49F9FE3C.3070000-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2009-05-05  3:18             ` Gui Jianfeng
     [not found]       ` <49EE895A.1060101-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-04-22  3:10         ` Nauman Rafique
2009-04-22 13:23         ` Vivek Goyal
     [not found]     ` <20090413130958.GB18007-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-04-22  3:04       ` Gui Jianfeng
2009-05-01  1:25 ` Divyesh Shah
2009-05-01  2:45   ` Vivek Goyal
2009-05-01  3:00     ` Divyesh Shah
     [not found]     ` <20090501024527.GA3730-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-01  3:00       ` Divyesh Shah
     [not found]   ` <49FA4F91.204-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2009-05-01  2:45     ` Vivek Goyal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.