All of lore.kernel.org
 help / color / mirror / Atom feed
* IO scheduler based IO Controller V2
@ 2009-05-05 19:58 Vivek Goyal
  2009-05-05 19:58 ` [PATCH 01/18] io-controller: Documentation Vivek Goyal
                   ` (37 more replies)
  0 siblings, 38 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, m-ikeda
  Cc: vgoyal, akpm


Hi All,

Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
First version of the patches was posted here.

http://lkml.org/lkml/2009/3/11/486

This patchset is still work in progress but I want to keep on getting the
snapshot of my tree out at regular intervals to get the feedback hence V2.

Before I go into details of what are the major changes from V1, wanted
to highlight other IO controller proposals on lkml.

Other active IO controller proposals
------------------------------------
Currently primarily two other IO controller proposals are out there.

dm-ioband
---------
This patch set is from Ryo Tsuruta from valinux. It is a proportional bandwidth controller implemented as a dm driver.

http://people.valinux.co.jp/~ryov/dm-ioband/

The biggest issue (apart from others), with a 2nd level IO controller is that
buffering of BIOs takes place in a single queue and dispatch of this BIOs
to unerlying IO scheduler is in FIFO manner. That means whenever the buffering
takes place, it breaks the notion of different class and priority of CFQ.

That means RT requests might be stuck behind some write requests or some read
requests might be stuck behind somet write requests for long time etc. To
demonstrate the single FIFO dispatch issues, I had run some basic tests and
posted the results in following mail thread.

http://lkml.org/lkml/2009/4/13/2

These are hard to solve issues and one will end up maintaining the separate
queues for separate classes and priority as CFQ does to fully resolve it.
But that will make 2nd level implementation complex at the same time if
somebody is trying to use IO controller on a single disk or on a hardware RAID
using cfq as scheduler, it will be two layers of queueing maintating separate
queues per priorty level. One at dm-driver level and other at CFQ which again
does not make lot of sense.

On the other hand, if a user is running noop at the device level, at higher
level we will be maintaining multiple cfq like queues, which also does not
make sense as underlying IO scheduler never wanted that.

Hence, IMHO, I think that controlling bio at second level probably is not a
very good idea. We should instead do it at IO scheduler level where we already
maintain all the needed queues. Just that make the scheduling hierarhical and
group aware so isolate IO of one group from other.

IO-throttling
-------------
This patch set is from Andrea Righi provides max bandwidth controller. That
means, it does not gurantee the minimum bandwidth. It provides the maximum
bandwidth limits and throttles the application if it crosses its bandwidth.

So its not apple vs apple comparison. This patch set and dm-ioband provide
proportional bandwidth control where a cgroup can use much more bandwidth
if there are not other users and resource control comes into the picture
only if there is contention.

It seems that there are both the kind of users there. One set of people needing
proportional BW control and other people needing max bandwidth control.

Now the question is, where max bandwidth control should be implemented? At
higher layers or at IO scheduler level? Should proportional bw control and
max bw control be implemented separately at different layer or these should
be implemented at one place?

IMHO, if we are doing proportional bw control at IO scheduler layer, it should
be possible to extend it to do max bw control also here without lot of effort.
Then it probably does not make too much of sense to do two types of control
at two different layers. Doing it at one place should lead to lesser code
and reduced complexity.

Secondly, io-throttling solution also buffers writes at higher layer.
Which again will lead to issue of losing the notion of priority of writes.

Hence, personally I think that users will need both proportional bw as well
as max bw control and we probably should implement these at a single place
instead of splitting it. Once elevator based io controller patchset matures,
it can be enhanced to do max bw control also.

Having said that, one issue with doing upper limit control at elevator/IO
scheduler level is that it does not have the view of higher level logical
devices. So if there is a software RAID with two disks, then one can not do
max bw control on logical device, instead it shall have to be on leaf node
where io scheduler is attached.

Now back to the desciption of this patchset and changes from V1.

- Rebased patches to 2.6.30-rc4.

- Last time Andrew mentioned that async writes are big issue for us hence,
  introduced the control for async writes also.

- Implemented per group request descriptor support. This was needed to
  make sure one group doing lot of IO does not starve other group of request
  descriptors and other group does not get fair share. This is a basic patch
  right now which probably will require more changes after some discussion.

- Exported the disk time used and number of sectors dispatched by a cgroup
  through cgroup interface. This should help us in seeing how much disk
  time each group got and whether it is fair or not.

- Implemented group refcounting support. Lack of this was causing some
  cgroup related issues. There are still some races left out which needs
  to be fixed. 

- For IO tracking/async write tracking, started making use of patches of
  blkio-cgroup from ryo Tsuruta posted here.

  http://lkml.org/lkml/2009/4/28/235

  Currently people seem to be liking the idea of separate subsystem for
  tracking writes and then rest of the users can use that info instead of
  everybody implementing their own. That's a different thing that how many
  users are out there which will end up in kernel is not clear.

  So instead of carrying own versin of bio-cgroup patches, and overloading
  io controller cgroup subsystem, I am making use of blkio-cgroup patches.
  One shall have to mount io controller and blkio subsystem together on the
  same hiearchy for the time being. Later we can take care of the case where
  blkio is mounted on a different hierarchy.

- Replaced group priorities with group weights.

Testing
=======

Again, I have been able to do only very basic testing of reads and writes.
Did not want to hold the patches back because of testing. Providing support
for async writes took much more time than expected and still work is left
in that area. Will continue to do more testing.

Test1 (Fairness for synchronous reads)
======================================
- Two dd in two cgroups with cgrop weights 1000 and 500. Ran two "dd" in those
  cgroups (With CFQ scheduler and /sys/block/<device>/queue/fairness = 1)

dd if=/mnt/$BLOCKDEV/zerofile1 of=/dev/null &
dd if=/mnt/$BLOCKDEV/zerofile2 of=/dev/null &

234179072 bytes (234 MB) copied, 4.13954 s, 56.6 MB/s
234179072 bytes (234 MB) copied, 5.2127 s, 44.9 MB/s

group1 time=3108 group1 sectors=460968
group2 time=1405 group2 sectors=264944

This patchset tries to provide fairness in terms of disk time received. group1
got almost double of group2 disk time (At the time of first dd finish). These
time and sectors statistics can be read using io.disk_time and io.disk_sector
files in cgroup. More about it in documentation file.

Test2 (Fairness for async writes)
=================================
Fairness for async writes is tricy and biggest reason is that async writes
are cached in higher layers (page cahe) and are dispatched to lower layers
not necessarily in proportional manner. For example, consider two dd threads
reading /dev/zero as input file and doing writes of huge files. Very soon
we will cross vm_dirty_ratio and dd thread will be forced to write out some
pages to disk before more pages can be dirtied. But not necessarily dirty
pages of same thread are picked. It can very well pick the inode of lesser
priority dd thread and do some writeout. So effectively higher weight dd is
doing writeouts of lower weight dd pages and we don't see service differentation

IOW, the core problem with async write fairness is that higher weight thread
does not throw enought IO traffic at IO controller to keep the queue
continuously backlogged. This are many .2 to .8 second intervals where higher
weight queue is empty and in that duration lower weight queue get lots of job
done giving the impression that there was no service differentiation.

In summary, from IO controller point of view async writes support is there. Now
we need to do some more work in higher layers to make sure higher weight process
is not blocked behind IO of some lower weight process. This is a TODO item.

So to test async writes I generated lots of write traffic in two cgroups (50
fio threads) and watched the disk time statistics in respective cgroups at
the interval of 2 seconds. Thanks to ryo tsuruta for the test case.

*****************************************************************
sync
echo 3 > /proc/sys/vm/drop_caches

fio_args="--size=64m --rw=write --numjobs=50 --group_reporting"

echo $$ > /cgroup/bfqio/test1/tasks
fio $fio_args --name=test1 --directory=/mnt/sdd1/fio/ --output=/mnt/sdd1/fio/test1.log &

echo $$ > /cgroup/bfqio/test2/tasks
fio $fio_args --name=test2 --directory=/mnt/sdd2/fio/ --output=/mnt/sdd2/fio/test2.log &
*********************************************************************** 

And watched the disk time and sector statistics for the both the cgroups
every 2 seconds using a script. How is snippet from output.

test1 statistics: time=9848   sectors=643152
test2 statistics: time=5224   sectors=258600

test1 statistics: time=11736   sectors=785792
test2 statistics: time=6509   sectors=333160

test1 statistics: time=13607   sectors=943968
test2 statistics: time=7443   sectors=394352

test1 statistics: time=15662   sectors=1089496
test2 statistics: time=8568   sectors=451152

So disk time consumed by group1 is almost double of group2.  

Your feedback and comments are welcome.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* [PATCH 01/18] io-controller: Documentation
       [not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-05 19:58   ` Vivek Goyal
  2009-05-05 19:58     ` Vivek Goyal
                     ` (20 subsequent siblings)
  21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

o Documentation for io-controller.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 Documentation/block/00-INDEX          |    2 +
 Documentation/block/io-controller.txt |  264 +++++++++++++++++++++++++++++++++
 2 files changed, 266 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/block/io-controller.txt

diff --git a/Documentation/block/00-INDEX b/Documentation/block/00-INDEX
index 961a051..dc8bf95 100644
--- a/Documentation/block/00-INDEX
+++ b/Documentation/block/00-INDEX
@@ -10,6 +10,8 @@ capability.txt
 	- Generic Block Device Capability (/sys/block/<disk>/capability)
 deadline-iosched.txt
 	- Deadline IO scheduler tunables
+io-controller.txt
+	- IO controller for provding hierarchical IO scheduling
 ioprio.txt
 	- Block io priorities (in CFQ scheduler)
 request.txt
diff --git a/Documentation/block/io-controller.txt b/Documentation/block/io-controller.txt
new file mode 100644
index 0000000..1290ada
--- /dev/null
+++ b/Documentation/block/io-controller.txt
@@ -0,0 +1,264 @@
+				IO Controller
+				=============
+
+Overview
+========
+
+This patchset implements a proportional weight IO controller. That is one
+can create cgroups and assign prio/weights to those cgroups and task group
+will get access to disk proportionate to the weight of the group.
+
+These patches modify elevator layer and individual IO schedulers to do
+IO control hence this io controller works only on block devices which use
+one of the standard io schedulers can not be used with any xyz logical block
+device.
+
+The assumption/thought behind modifying IO scheduler is that resource control
+is needed only on leaf nodes where the actual contention for resources is
+present and not on intertermediate logical block devices.
+
+Consider following hypothetical scenario. Lets say there are three physical
+disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
+created on top of these. Some part of sdb is in lv0 and some part is in lv1.
+
+			    lv0      lv1
+			  /	\  /     \
+			sda      sdb      sdc
+
+Also consider following cgroup hierarchy
+
+				root
+				/   \
+			       A     B
+			      / \    / \
+			     T1 T2  T3  T4
+
+A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
+Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
+get their fair share of bandwidth on disks sda, sdb and sdc. There is no
+IO control on intermediate logical block nodes (lv0, lv1).
+
+So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
+only, there will not be any contetion for resources between group A and B if
+IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
+IO scheduler associated with the sdb will distribute disk bandwidth to
+group A and B proportionate to their weight.
+
+CFQ already has the notion of fairness and it provides differential disk
+access based on priority and class of the task. Just that it is flat and
+with cgroup stuff, it needs to be made hierarchical to achive a good
+hierarchical control on IO.
+
+Rest of the IO schedulers (noop, deadline and AS) don't have any notion
+of fairness among various threads. They maintain only one queue where all
+the IO gets queued (internally this queue is split in read and write queue
+for deadline and AS). With this patchset, now we maintain one queue per
+cgropu per device and then try to do fair queuing among those queues.
+
+One of the concerns raised with modifying IO schedulers was that we don't
+want to replicate the code in all the IO schedulers. These patches share
+the fair queuing code which has been moved to a common layer (elevator
+layer). Hence we don't end up replicating code across IO schedulers. Following
+diagram depicts the concept.
+
+			--------------------------------
+			| Elevator Layer + Fair Queuing |
+			--------------------------------
+			 |	     |	     |       |
+			NOOP     DEADLINE    AS     CFQ
+
+Design
+======
+This patchset primarily uses BFQ (Budget Fair Queuing) code to provide
+fairness among different IO queues. Fabio and Paolo implemented BFQ which uses
+B-WF2Q+ algorithm for fair queuing.
+
+Why BFQ?
+
+- Not sure if weighted round robin logic of CFQ can be easily extended for
+  hierarchical mode. One of the things is that we can not keep dividing
+  the time slice of parent group among childrens. Deeper we go in hierarchy
+  time slice will get smaller.
+
+  One of the ways to implement hierarchical support could be to keep track
+  of virtual time and service provided to queue/group and select a queue/group
+  for service based on any of the various available algoriths.
+
+  BFQ already had support for hierarchical scheduling, taking those patches
+  was easier.
+
+- BFQ was designed to provide tighter bounds/delay w.r.t service provided
+  to a queue. Delay/Jitter with BFQ is O(1).
+
+  Note: BFQ originally used amount of IO done (number of sectors) as notion
+        of service provided. IOW, it tried to provide fairness in terms of
+        actual IO done and not in terms of actual time disk access was
+	given to a queue.
+
+	This patcheset modified BFQ to provide fairness in time domain because
+	that's what CFQ does. So idea was try not to deviate too much from
+	the CFQ behavior initially.
+
+	Providing fairness in time domain makes accounting trciky because
+	due to command queueing, at one time there might be multiple requests
+	from different queues and there is no easy way to find out how much
+	disk time actually was consumed by the requests of a particular
+	queue. More about this in comments in source code.
+
+We have taken BFQ code as starting point for providing fairness among groups
+because it already contained lots of features which we required to implement
+hierarhical IO scheduling. With this patch set, I am not trying to ensure O(1)
+delay here as my goal is to provide fairness among groups. Most likely that
+will mean that latencies are not worse than what cfq currently provides (if
+not improved ones). Once fairness is ensured, one can look into  more in
+ensuring O(1) latencies.
+
+From data structure point of view, one can think of a tree per device, where
+io groups and io queues are hanging and are being scheduled using B-WF2Q+
+algorithm. io_queue, is end queue where requests are actually stored and
+dispatched from (like cfqq).
+
+These io queues are primarily created by and managed by end io schedulers
+depending on its semantics. For example, noop, deadline and AS ioschedulers
+keep one io queues per cgroup and cfqq keeps one io queue per io_context in
+a cgroup (apart from async queues).
+
+A request is mapped to an io group by elevator layer and which io queue it
+is mapped to with in group depends on ioscheduler. Currently "current" task
+is used to determine the cgroup (hence io group) of the request. Down the
+line we need to make use of bio-cgroup patches to map delayed writes to
+right group.
+
+Going back to old behavior
+==========================
+In new scheme of things essentially we are creating hierarchical fair
+queuing logic in elevator layer and chaning IO schedulers to make use of
+that logic so that end IO schedulers start supporting hierarchical scheduling.
+
+Elevator layer continues to support the old interfaces. So even if fair queuing
+is enabled at elevator layer, one can have both new hierchical scheduler as
+well as old non-hierarchical scheduler operating.
+
+Also noop, deadline and AS have option of enabling hierarchical scheduling.
+If it is selected, fair queuing is done in hierarchical manner. If hierarchical
+scheduling is disabled, noop, deadline and AS should retain their existing
+behavior.
+
+CFQ is the only exception where one can not disable fair queuing as it is
+needed for provding fairness among various threads even in non-hierarchical
+mode.
+
+Various user visible config options
+===================================
+CONFIG_IOSCHED_NOOP_HIER
+	- Enables hierchical fair queuing in noop. Not selecting this option
+	  leads to old behavior of noop.
+
+CONFIG_IOSCHED_DEADLINE_HIER
+	- Enables hierchical fair queuing in deadline. Not selecting this
+	  option leads to old behavior of deadline.
+
+CONFIG_IOSCHED_AS_HIER
+	- Enables hierchical fair queuing in AS. Not selecting this option
+	  leads to old behavior of AS.
+
+CONFIG_IOSCHED_CFQ_HIER
+	- Enables hierarchical fair queuing in CFQ. Not selecting this option
+	  still does fair queuing among various queus but it is flat and not
+	  hierarchical.
+
+CGROUP_BLKIO
+	- This option enables blkio-cgroup controller for IO tracking
+	  purposes. That means, by this controller one can attribute a write
+	  to the original cgroup and not assume that it belongs to submitting
+	  thread.
+
+CONFIG_TRACK_ASYNC_CONTEXT
+	- Currently CFQ attributes the writes to the submitting thread and
+	  caches the async queue pointer in the io context of the process.
+	  If this option is set, it tells cfq and elevator fair queuing logic
+	  that for async writes make use of IO tracking patches and attribute
+	  writes to original cgroup and not to write submitting thread.
+
+CONFIG_DEBUG_GROUP_IOSCHED
+	- Throws extra debug messages in blktrace output helpful in doing
+	  doing debugging in hierarchical setup.
+
+Config options selected automatically
+=====================================
+These config options are not user visible and are selected/deselected
+automatically based on IO scheduler configurations.
+
+CONFIG_ELV_FAIR_QUEUING
+	- Enables/Disables the fair queuing logic at elevator layer.
+
+CONFIG_GROUP_IOSCHED
+	- Enables/Disables hierarchical queuing and associated cgroup bits.
+
+TODO
+====
+- Lots of code cleanups, testing, bug fixing, optimizations,
+  benchmarking etc...
+
+- Debug and fix some of the areas where higher weight cgroup async writes
+  are stuck behind lower weight cgroup async writes.
+
+- Anticipatory code will need more work. It is not working properly currently
+  and needs more thought.
+
+- Once things start working, planning to look into core algorithm. It looks
+  complicated and maintains lots of data structures. Need to spend some time
+  to see if can be simplified.
+
+- Currently a cgroup setting is global, that is it is applicable to all
+  the block devices in the system. Probably it will make more sense to
+  make it per cgroup per device setting so that a cgroup can have different
+  weights on different device etc.
+
+HOWTO
+=====
+So far I have done very simple testing of running two dd threads in two
+different cgroups. Here is what you can do.
+
+- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
+	CONFIG_IOSCHED_CFQ_HIER=y
+
+- Enable IO tracking for async writes.
+	CONFIG_TRACK_ASYNC_CONTEXT=y
+
+  (This will automatically select CGROUP_BLKIO)
+
+- Compile and boot into kernel and mount IO controller and blkio io tracking
+  controller.
+
+	mount -t cgroup -o io,blkio none /cgroup
+
+- Create two cgroups
+	mkdir -p /cgroup/test1/ /cgroup/test2
+
+- Set weights of group test1 and test2
+	echo 1000 > /cgroup/test1/io.ioprio
+	echo 500 > /cgroup/test2/io.ioprio
+
+- Create two same size files (say 512MB each) on same disk (file1, file2) and
+  launch two dd threads in different cgroup to read those files. Make sure
+  right io scheduler is being used for the block device where files are
+  present (the one you compiled in hierarchical mode).
+
+	echo 1 > /proc/sys/vm/drop_caches
+
+	dd if=/mnt/lv0/zerofile1 of=/dev/null &
+	echo $! > /cgroup/test1/tasks
+	cat /cgroup/test1/tasks
+
+	dd if=/mnt/lv0/zerofile2 of=/dev/null &
+	echo $! > /cgroup/test2/tasks
+	cat /cgroup/test2/tasks
+
+- At macro level, first dd should finish first. To get more precise data, keep
+  on looking at (with the help of script), at io.disk_time and io.disk_sectors
+  files of both test1 and test2 groups. This will tell how much disk time
+  (in milli seconds), each group got and how many secotors each group
+  dispatched to the disk. We provide fairness in terms of disk time, so
+  ideally io.disk_time of cgroups should be in proportion to the weight.
+  (It is hard to achieve though :-)).
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 01/18] io-controller: Documentation
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-06  3:16   ` Gui Jianfeng
       [not found]   ` <1241553525-28095-2-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-05-05 19:58 ` Vivek Goyal
                   ` (36 subsequent siblings)
  37 siblings, 2 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, m-ikeda
  Cc: vgoyal, akpm

o Documentation for io-controller.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 Documentation/block/00-INDEX          |    2 +
 Documentation/block/io-controller.txt |  264 +++++++++++++++++++++++++++++++++
 2 files changed, 266 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/block/io-controller.txt

diff --git a/Documentation/block/00-INDEX b/Documentation/block/00-INDEX
index 961a051..dc8bf95 100644
--- a/Documentation/block/00-INDEX
+++ b/Documentation/block/00-INDEX
@@ -10,6 +10,8 @@ capability.txt
 	- Generic Block Device Capability (/sys/block/<disk>/capability)
 deadline-iosched.txt
 	- Deadline IO scheduler tunables
+io-controller.txt
+	- IO controller for provding hierarchical IO scheduling
 ioprio.txt
 	- Block io priorities (in CFQ scheduler)
 request.txt
diff --git a/Documentation/block/io-controller.txt b/Documentation/block/io-controller.txt
new file mode 100644
index 0000000..1290ada
--- /dev/null
+++ b/Documentation/block/io-controller.txt
@@ -0,0 +1,264 @@
+				IO Controller
+				=============
+
+Overview
+========
+
+This patchset implements a proportional weight IO controller. That is one
+can create cgroups and assign prio/weights to those cgroups and task group
+will get access to disk proportionate to the weight of the group.
+
+These patches modify elevator layer and individual IO schedulers to do
+IO control hence this io controller works only on block devices which use
+one of the standard io schedulers can not be used with any xyz logical block
+device.
+
+The assumption/thought behind modifying IO scheduler is that resource control
+is needed only on leaf nodes where the actual contention for resources is
+present and not on intertermediate logical block devices.
+
+Consider following hypothetical scenario. Lets say there are three physical
+disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
+created on top of these. Some part of sdb is in lv0 and some part is in lv1.
+
+			    lv0      lv1
+			  /	\  /     \
+			sda      sdb      sdc
+
+Also consider following cgroup hierarchy
+
+				root
+				/   \
+			       A     B
+			      / \    / \
+			     T1 T2  T3  T4
+
+A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
+Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
+get their fair share of bandwidth on disks sda, sdb and sdc. There is no
+IO control on intermediate logical block nodes (lv0, lv1).
+
+So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
+only, there will not be any contetion for resources between group A and B if
+IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
+IO scheduler associated with the sdb will distribute disk bandwidth to
+group A and B proportionate to their weight.
+
+CFQ already has the notion of fairness and it provides differential disk
+access based on priority and class of the task. Just that it is flat and
+with cgroup stuff, it needs to be made hierarchical to achive a good
+hierarchical control on IO.
+
+Rest of the IO schedulers (noop, deadline and AS) don't have any notion
+of fairness among various threads. They maintain only one queue where all
+the IO gets queued (internally this queue is split in read and write queue
+for deadline and AS). With this patchset, now we maintain one queue per
+cgropu per device and then try to do fair queuing among those queues.
+
+One of the concerns raised with modifying IO schedulers was that we don't
+want to replicate the code in all the IO schedulers. These patches share
+the fair queuing code which has been moved to a common layer (elevator
+layer). Hence we don't end up replicating code across IO schedulers. Following
+diagram depicts the concept.
+
+			--------------------------------
+			| Elevator Layer + Fair Queuing |
+			--------------------------------
+			 |	     |	     |       |
+			NOOP     DEADLINE    AS     CFQ
+
+Design
+======
+This patchset primarily uses BFQ (Budget Fair Queuing) code to provide
+fairness among different IO queues. Fabio and Paolo implemented BFQ which uses
+B-WF2Q+ algorithm for fair queuing.
+
+Why BFQ?
+
+- Not sure if weighted round robin logic of CFQ can be easily extended for
+  hierarchical mode. One of the things is that we can not keep dividing
+  the time slice of parent group among childrens. Deeper we go in hierarchy
+  time slice will get smaller.
+
+  One of the ways to implement hierarchical support could be to keep track
+  of virtual time and service provided to queue/group and select a queue/group
+  for service based on any of the various available algoriths.
+
+  BFQ already had support for hierarchical scheduling, taking those patches
+  was easier.
+
+- BFQ was designed to provide tighter bounds/delay w.r.t service provided
+  to a queue. Delay/Jitter with BFQ is O(1).
+
+  Note: BFQ originally used amount of IO done (number of sectors) as notion
+        of service provided. IOW, it tried to provide fairness in terms of
+        actual IO done and not in terms of actual time disk access was
+	given to a queue.
+
+	This patcheset modified BFQ to provide fairness in time domain because
+	that's what CFQ does. So idea was try not to deviate too much from
+	the CFQ behavior initially.
+
+	Providing fairness in time domain makes accounting trciky because
+	due to command queueing, at one time there might be multiple requests
+	from different queues and there is no easy way to find out how much
+	disk time actually was consumed by the requests of a particular
+	queue. More about this in comments in source code.
+
+We have taken BFQ code as starting point for providing fairness among groups
+because it already contained lots of features which we required to implement
+hierarhical IO scheduling. With this patch set, I am not trying to ensure O(1)
+delay here as my goal is to provide fairness among groups. Most likely that
+will mean that latencies are not worse than what cfq currently provides (if
+not improved ones). Once fairness is ensured, one can look into  more in
+ensuring O(1) latencies.
+
+From data structure point of view, one can think of a tree per device, where
+io groups and io queues are hanging and are being scheduled using B-WF2Q+
+algorithm. io_queue, is end queue where requests are actually stored and
+dispatched from (like cfqq).
+
+These io queues are primarily created by and managed by end io schedulers
+depending on its semantics. For example, noop, deadline and AS ioschedulers
+keep one io queues per cgroup and cfqq keeps one io queue per io_context in
+a cgroup (apart from async queues).
+
+A request is mapped to an io group by elevator layer and which io queue it
+is mapped to with in group depends on ioscheduler. Currently "current" task
+is used to determine the cgroup (hence io group) of the request. Down the
+line we need to make use of bio-cgroup patches to map delayed writes to
+right group.
+
+Going back to old behavior
+==========================
+In new scheme of things essentially we are creating hierarchical fair
+queuing logic in elevator layer and chaning IO schedulers to make use of
+that logic so that end IO schedulers start supporting hierarchical scheduling.
+
+Elevator layer continues to support the old interfaces. So even if fair queuing
+is enabled at elevator layer, one can have both new hierchical scheduler as
+well as old non-hierarchical scheduler operating.
+
+Also noop, deadline and AS have option of enabling hierarchical scheduling.
+If it is selected, fair queuing is done in hierarchical manner. If hierarchical
+scheduling is disabled, noop, deadline and AS should retain their existing
+behavior.
+
+CFQ is the only exception where one can not disable fair queuing as it is
+needed for provding fairness among various threads even in non-hierarchical
+mode.
+
+Various user visible config options
+===================================
+CONFIG_IOSCHED_NOOP_HIER
+	- Enables hierchical fair queuing in noop. Not selecting this option
+	  leads to old behavior of noop.
+
+CONFIG_IOSCHED_DEADLINE_HIER
+	- Enables hierchical fair queuing in deadline. Not selecting this
+	  option leads to old behavior of deadline.
+
+CONFIG_IOSCHED_AS_HIER
+	- Enables hierchical fair queuing in AS. Not selecting this option
+	  leads to old behavior of AS.
+
+CONFIG_IOSCHED_CFQ_HIER
+	- Enables hierarchical fair queuing in CFQ. Not selecting this option
+	  still does fair queuing among various queus but it is flat and not
+	  hierarchical.
+
+CGROUP_BLKIO
+	- This option enables blkio-cgroup controller for IO tracking
+	  purposes. That means, by this controller one can attribute a write
+	  to the original cgroup and not assume that it belongs to submitting
+	  thread.
+
+CONFIG_TRACK_ASYNC_CONTEXT
+	- Currently CFQ attributes the writes to the submitting thread and
+	  caches the async queue pointer in the io context of the process.
+	  If this option is set, it tells cfq and elevator fair queuing logic
+	  that for async writes make use of IO tracking patches and attribute
+	  writes to original cgroup and not to write submitting thread.
+
+CONFIG_DEBUG_GROUP_IOSCHED
+	- Throws extra debug messages in blktrace output helpful in doing
+	  doing debugging in hierarchical setup.
+
+Config options selected automatically
+=====================================
+These config options are not user visible and are selected/deselected
+automatically based on IO scheduler configurations.
+
+CONFIG_ELV_FAIR_QUEUING
+	- Enables/Disables the fair queuing logic at elevator layer.
+
+CONFIG_GROUP_IOSCHED
+	- Enables/Disables hierarchical queuing and associated cgroup bits.
+
+TODO
+====
+- Lots of code cleanups, testing, bug fixing, optimizations,
+  benchmarking etc...
+
+- Debug and fix some of the areas where higher weight cgroup async writes
+  are stuck behind lower weight cgroup async writes.
+
+- Anticipatory code will need more work. It is not working properly currently
+  and needs more thought.
+
+- Once things start working, planning to look into core algorithm. It looks
+  complicated and maintains lots of data structures. Need to spend some time
+  to see if can be simplified.
+
+- Currently a cgroup setting is global, that is it is applicable to all
+  the block devices in the system. Probably it will make more sense to
+  make it per cgroup per device setting so that a cgroup can have different
+  weights on different device etc.
+
+HOWTO
+=====
+So far I have done very simple testing of running two dd threads in two
+different cgroups. Here is what you can do.
+
+- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
+	CONFIG_IOSCHED_CFQ_HIER=y
+
+- Enable IO tracking for async writes.
+	CONFIG_TRACK_ASYNC_CONTEXT=y
+
+  (This will automatically select CGROUP_BLKIO)
+
+- Compile and boot into kernel and mount IO controller and blkio io tracking
+  controller.
+
+	mount -t cgroup -o io,blkio none /cgroup
+
+- Create two cgroups
+	mkdir -p /cgroup/test1/ /cgroup/test2
+
+- Set weights of group test1 and test2
+	echo 1000 > /cgroup/test1/io.ioprio
+	echo 500 > /cgroup/test2/io.ioprio
+
+- Create two same size files (say 512MB each) on same disk (file1, file2) and
+  launch two dd threads in different cgroup to read those files. Make sure
+  right io scheduler is being used for the block device where files are
+  present (the one you compiled in hierarchical mode).
+
+	echo 1 > /proc/sys/vm/drop_caches
+
+	dd if=/mnt/lv0/zerofile1 of=/dev/null &
+	echo $! > /cgroup/test1/tasks
+	cat /cgroup/test1/tasks
+
+	dd if=/mnt/lv0/zerofile2 of=/dev/null &
+	echo $! > /cgroup/test2/tasks
+	cat /cgroup/test2/tasks
+
+- At macro level, first dd should finish first. To get more precise data, keep
+  on looking at (with the help of script), at io.disk_time and io.disk_sectors
+  files of both test1 and test2 groups. This will tell how much disk time
+  (in milli seconds), each group got and how many secotors each group
+  dispatched to the disk. We provide fairness in terms of disk time, so
+  ideally io.disk_time of cgroups should be in proportion to the weight.
+  (It is hard to achieve though :-)).
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 01/18] io-controller: Documentation
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
  2009-05-05 19:58 ` [PATCH 01/18] io-controller: Documentation Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-05 19:58 ` [PATCH 02/18] io-controller: Common flat fair queuing code in elevaotor layer Vivek Goyal
                   ` (35 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando
  Cc: akpm, vgoyal

o Documentation for io-controller.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 Documentation/block/00-INDEX          |    2 +
 Documentation/block/io-controller.txt |  264 +++++++++++++++++++++++++++++++++
 2 files changed, 266 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/block/io-controller.txt

diff --git a/Documentation/block/00-INDEX b/Documentation/block/00-INDEX
index 961a051..dc8bf95 100644
--- a/Documentation/block/00-INDEX
+++ b/Documentation/block/00-INDEX
@@ -10,6 +10,8 @@ capability.txt
 	- Generic Block Device Capability (/sys/block/<disk>/capability)
 deadline-iosched.txt
 	- Deadline IO scheduler tunables
+io-controller.txt
+	- IO controller for provding hierarchical IO scheduling
 ioprio.txt
 	- Block io priorities (in CFQ scheduler)
 request.txt
diff --git a/Documentation/block/io-controller.txt b/Documentation/block/io-controller.txt
new file mode 100644
index 0000000..1290ada
--- /dev/null
+++ b/Documentation/block/io-controller.txt
@@ -0,0 +1,264 @@
+				IO Controller
+				=============
+
+Overview
+========
+
+This patchset implements a proportional weight IO controller. That is one
+can create cgroups and assign prio/weights to those cgroups and task group
+will get access to disk proportionate to the weight of the group.
+
+These patches modify elevator layer and individual IO schedulers to do
+IO control hence this io controller works only on block devices which use
+one of the standard io schedulers can not be used with any xyz logical block
+device.
+
+The assumption/thought behind modifying IO scheduler is that resource control
+is needed only on leaf nodes where the actual contention for resources is
+present and not on intertermediate logical block devices.
+
+Consider following hypothetical scenario. Lets say there are three physical
+disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
+created on top of these. Some part of sdb is in lv0 and some part is in lv1.
+
+			    lv0      lv1
+			  /	\  /     \
+			sda      sdb      sdc
+
+Also consider following cgroup hierarchy
+
+				root
+				/   \
+			       A     B
+			      / \    / \
+			     T1 T2  T3  T4
+
+A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
+Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
+get their fair share of bandwidth on disks sda, sdb and sdc. There is no
+IO control on intermediate logical block nodes (lv0, lv1).
+
+So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
+only, there will not be any contetion for resources between group A and B if
+IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
+IO scheduler associated with the sdb will distribute disk bandwidth to
+group A and B proportionate to their weight.
+
+CFQ already has the notion of fairness and it provides differential disk
+access based on priority and class of the task. Just that it is flat and
+with cgroup stuff, it needs to be made hierarchical to achive a good
+hierarchical control on IO.
+
+Rest of the IO schedulers (noop, deadline and AS) don't have any notion
+of fairness among various threads. They maintain only one queue where all
+the IO gets queued (internally this queue is split in read and write queue
+for deadline and AS). With this patchset, now we maintain one queue per
+cgropu per device and then try to do fair queuing among those queues.
+
+One of the concerns raised with modifying IO schedulers was that we don't
+want to replicate the code in all the IO schedulers. These patches share
+the fair queuing code which has been moved to a common layer (elevator
+layer). Hence we don't end up replicating code across IO schedulers. Following
+diagram depicts the concept.
+
+			--------------------------------
+			| Elevator Layer + Fair Queuing |
+			--------------------------------
+			 |	     |	     |       |
+			NOOP     DEADLINE    AS     CFQ
+
+Design
+======
+This patchset primarily uses BFQ (Budget Fair Queuing) code to provide
+fairness among different IO queues. Fabio and Paolo implemented BFQ which uses
+B-WF2Q+ algorithm for fair queuing.
+
+Why BFQ?
+
+- Not sure if weighted round robin logic of CFQ can be easily extended for
+  hierarchical mode. One of the things is that we can not keep dividing
+  the time slice of parent group among childrens. Deeper we go in hierarchy
+  time slice will get smaller.
+
+  One of the ways to implement hierarchical support could be to keep track
+  of virtual time and service provided to queue/group and select a queue/group
+  for service based on any of the various available algoriths.
+
+  BFQ already had support for hierarchical scheduling, taking those patches
+  was easier.
+
+- BFQ was designed to provide tighter bounds/delay w.r.t service provided
+  to a queue. Delay/Jitter with BFQ is O(1).
+
+  Note: BFQ originally used amount of IO done (number of sectors) as notion
+        of service provided. IOW, it tried to provide fairness in terms of
+        actual IO done and not in terms of actual time disk access was
+	given to a queue.
+
+	This patcheset modified BFQ to provide fairness in time domain because
+	that's what CFQ does. So idea was try not to deviate too much from
+	the CFQ behavior initially.
+
+	Providing fairness in time domain makes accounting trciky because
+	due to command queueing, at one time there might be multiple requests
+	from different queues and there is no easy way to find out how much
+	disk time actually was consumed by the requests of a particular
+	queue. More about this in comments in source code.
+
+We have taken BFQ code as starting point for providing fairness among groups
+because it already contained lots of features which we required to implement
+hierarhical IO scheduling. With this patch set, I am not trying to ensure O(1)
+delay here as my goal is to provide fairness among groups. Most likely that
+will mean that latencies are not worse than what cfq currently provides (if
+not improved ones). Once fairness is ensured, one can look into  more in
+ensuring O(1) latencies.
+
+From data structure point of view, one can think of a tree per device, where
+io groups and io queues are hanging and are being scheduled using B-WF2Q+
+algorithm. io_queue, is end queue where requests are actually stored and
+dispatched from (like cfqq).
+
+These io queues are primarily created by and managed by end io schedulers
+depending on its semantics. For example, noop, deadline and AS ioschedulers
+keep one io queues per cgroup and cfqq keeps one io queue per io_context in
+a cgroup (apart from async queues).
+
+A request is mapped to an io group by elevator layer and which io queue it
+is mapped to with in group depends on ioscheduler. Currently "current" task
+is used to determine the cgroup (hence io group) of the request. Down the
+line we need to make use of bio-cgroup patches to map delayed writes to
+right group.
+
+Going back to old behavior
+==========================
+In new scheme of things essentially we are creating hierarchical fair
+queuing logic in elevator layer and chaning IO schedulers to make use of
+that logic so that end IO schedulers start supporting hierarchical scheduling.
+
+Elevator layer continues to support the old interfaces. So even if fair queuing
+is enabled at elevator layer, one can have both new hierchical scheduler as
+well as old non-hierarchical scheduler operating.
+
+Also noop, deadline and AS have option of enabling hierarchical scheduling.
+If it is selected, fair queuing is done in hierarchical manner. If hierarchical
+scheduling is disabled, noop, deadline and AS should retain their existing
+behavior.
+
+CFQ is the only exception where one can not disable fair queuing as it is
+needed for provding fairness among various threads even in non-hierarchical
+mode.
+
+Various user visible config options
+===================================
+CONFIG_IOSCHED_NOOP_HIER
+	- Enables hierchical fair queuing in noop. Not selecting this option
+	  leads to old behavior of noop.
+
+CONFIG_IOSCHED_DEADLINE_HIER
+	- Enables hierchical fair queuing in deadline. Not selecting this
+	  option leads to old behavior of deadline.
+
+CONFIG_IOSCHED_AS_HIER
+	- Enables hierchical fair queuing in AS. Not selecting this option
+	  leads to old behavior of AS.
+
+CONFIG_IOSCHED_CFQ_HIER
+	- Enables hierarchical fair queuing in CFQ. Not selecting this option
+	  still does fair queuing among various queus but it is flat and not
+	  hierarchical.
+
+CGROUP_BLKIO
+	- This option enables blkio-cgroup controller for IO tracking
+	  purposes. That means, by this controller one can attribute a write
+	  to the original cgroup and not assume that it belongs to submitting
+	  thread.
+
+CONFIG_TRACK_ASYNC_CONTEXT
+	- Currently CFQ attributes the writes to the submitting thread and
+	  caches the async queue pointer in the io context of the process.
+	  If this option is set, it tells cfq and elevator fair queuing logic
+	  that for async writes make use of IO tracking patches and attribute
+	  writes to original cgroup and not to write submitting thread.
+
+CONFIG_DEBUG_GROUP_IOSCHED
+	- Throws extra debug messages in blktrace output helpful in doing
+	  doing debugging in hierarchical setup.
+
+Config options selected automatically
+=====================================
+These config options are not user visible and are selected/deselected
+automatically based on IO scheduler configurations.
+
+CONFIG_ELV_FAIR_QUEUING
+	- Enables/Disables the fair queuing logic at elevator layer.
+
+CONFIG_GROUP_IOSCHED
+	- Enables/Disables hierarchical queuing and associated cgroup bits.
+
+TODO
+====
+- Lots of code cleanups, testing, bug fixing, optimizations,
+  benchmarking etc...
+
+- Debug and fix some of the areas where higher weight cgroup async writes
+  are stuck behind lower weight cgroup async writes.
+
+- Anticipatory code will need more work. It is not working properly currently
+  and needs more thought.
+
+- Once things start working, planning to look into core algorithm. It looks
+  complicated and maintains lots of data structures. Need to spend some time
+  to see if can be simplified.
+
+- Currently a cgroup setting is global, that is it is applicable to all
+  the block devices in the system. Probably it will make more sense to
+  make it per cgroup per device setting so that a cgroup can have different
+  weights on different device etc.
+
+HOWTO
+=====
+So far I have done very simple testing of running two dd threads in two
+different cgroups. Here is what you can do.
+
+- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
+	CONFIG_IOSCHED_CFQ_HIER=y
+
+- Enable IO tracking for async writes.
+	CONFIG_TRACK_ASYNC_CONTEXT=y
+
+  (This will automatically select CGROUP_BLKIO)
+
+- Compile and boot into kernel and mount IO controller and blkio io tracking
+  controller.
+
+	mount -t cgroup -o io,blkio none /cgroup
+
+- Create two cgroups
+	mkdir -p /cgroup/test1/ /cgroup/test2
+
+- Set weights of group test1 and test2
+	echo 1000 > /cgroup/test1/io.ioprio
+	echo 500 > /cgroup/test2/io.ioprio
+
+- Create two same size files (say 512MB each) on same disk (file1, file2) and
+  launch two dd threads in different cgroup to read those files. Make sure
+  right io scheduler is being used for the block device where files are
+  present (the one you compiled in hierarchical mode).
+
+	echo 1 > /proc/sys/vm/drop_caches
+
+	dd if=/mnt/lv0/zerofile1 of=/dev/null &
+	echo $! > /cgroup/test1/tasks
+	cat /cgroup/test1/tasks
+
+	dd if=/mnt/lv0/zerofile2 of=/dev/null &
+	echo $! > /cgroup/test2/tasks
+	cat /cgroup/test2/tasks
+
+- At macro level, first dd should finish first. To get more precise data, keep
+  on looking at (with the help of script), at io.disk_time and io.disk_sectors
+  files of both test1 and test2 groups. This will tell how much disk time
+  (in milli seconds), each group got and how many secotors each group
+  dispatched to the disk. We provide fairness in terms of disk time, so
+  ideally io.disk_time of cgroups should be in proportion to the weight.
+  (It is hard to achieve though :-)).
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 02/18] io-controller: Common flat fair queuing code in elevaotor layer
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
@ 2009-05-05 19:58     ` Vivek Goyal
  2009-05-05 19:58 ` Vivek Goyal
                       ` (36 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

This is common fair queuing code in elevator layer. This is controlled by
config option CONFIG_ELV_FAIR_QUEUING. This patch initially only introduces
flat fair queuing support where there is only one group, "root group" and all
the tasks belong to root group.

This elevator layer changes are backward compatible. That means any ioscheduler
using old interfaces will continue to work.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched    |   13 +
 block/Makefile           |    1 +
 block/blk-sysfs.c        |   25 +
 block/elevator-fq.c      | 2076 ++++++++++++++++++++++++++++++++++++++++++++++
 block/elevator-fq.h      |  488 +++++++++++
 block/elevator.c         |   46 +-
 include/linux/blkdev.h   |    5 +
 include/linux/elevator.h |   51 ++
 8 files changed, 2694 insertions(+), 11 deletions(-)
 create mode 100644 block/elevator-fq.c
 create mode 100644 block/elevator-fq.h

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 7e803fc..3398134 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -2,6 +2,19 @@ if BLOCK
 
 menu "IO Schedulers"
 
+config ELV_FAIR_QUEUING
+	bool "Elevator Fair Queuing Support"
+	default n
+	---help---
+	  Traditionally only cfq had notion of multiple queues and it did
+	  fair queuing at its own. With the cgroups and need of controlling
+	  IO, now even the simple io schedulers like noop, deadline, as will
+	  have one queue per cgroup and will need hierarchical fair queuing.
+	  Instead of every io scheduler implementing its own fair queuing
+	  logic, this option enables fair queuing in elevator layer so that
+	  other ioschedulers can make use of it.
+	  If unsure, say N.
+
 config IOSCHED_NOOP
 	bool
 	default y
diff --git a/block/Makefile b/block/Makefile
index e9fa4dd..94bfc6e 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -15,3 +15,4 @@ obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
 
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY)	+= blk-integrity.o
+obj-$(CONFIG_ELV_FAIR_QUEUING)	+= elevator-fq.o
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 3ff9bba..082a273 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -276,6 +276,26 @@ static struct queue_sysfs_entry queue_iostats_entry = {
 	.store = queue_iostats_store,
 };
 
+#ifdef CONFIG_ELV_FAIR_QUEUING
+static struct queue_sysfs_entry queue_slice_idle_entry = {
+	.attr = {.name = "slice_idle", .mode = S_IRUGO | S_IWUSR },
+	.show = elv_slice_idle_show,
+	.store = elv_slice_idle_store,
+};
+
+static struct queue_sysfs_entry queue_slice_sync_entry = {
+	.attr = {.name = "slice_sync", .mode = S_IRUGO | S_IWUSR },
+	.show = elv_slice_sync_show,
+	.store = elv_slice_sync_store,
+};
+
+static struct queue_sysfs_entry queue_slice_async_entry = {
+	.attr = {.name = "slice_async", .mode = S_IRUGO | S_IWUSR },
+	.show = elv_slice_async_show,
+	.store = elv_slice_async_store,
+};
+#endif
+
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
 	&queue_ra_entry.attr,
@@ -287,6 +307,11 @@ static struct attribute *default_attrs[] = {
 	&queue_nomerges_entry.attr,
 	&queue_rq_affinity_entry.attr,
 	&queue_iostats_entry.attr,
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	&queue_slice_idle_entry.attr,
+	&queue_slice_sync_entry.attr,
+	&queue_slice_async_entry.attr,
+#endif
 	NULL,
 };
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
new file mode 100644
index 0000000..9aea899
--- /dev/null
+++ b/block/elevator-fq.c
@@ -0,0 +1,2076 @@
+/*
+ * BFQ: Hierarchical B-WF2Q+ scheduler.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
+ *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ */
+
+#include <linux/blkdev.h>
+#include "elevator-fq.h"
+#include <linux/blktrace_api.h>
+
+/* Values taken from cfq */
+const int elv_slice_sync = HZ / 10;
+int elv_slice_async = HZ / 25;
+const int elv_slice_async_rq = 2;
+int elv_slice_idle = HZ / 125;
+static struct kmem_cache *elv_ioq_pool;
+
+#define ELV_SLICE_SCALE		(5)
+#define ELV_HW_QUEUE_MIN	(5)
+#define IO_SERVICE_TREE_INIT   ((struct io_service_tree)		\
+				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
+
+static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
+					struct io_queue *ioq, int probe);
+struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
+						 int extract);
+
+static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
+					unsigned short prio)
+{
+	const int base_slice = efqd->elv_slice[sync];
+
+	WARN_ON(prio >= IOPRIO_BE_NR);
+
+	return base_slice + (base_slice/ELV_SLICE_SCALE * (4 - prio));
+}
+
+static inline int
+elv_prio_to_slice(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+	return elv_prio_slice(efqd, elv_ioq_sync(ioq), ioq->entity.ioprio);
+}
+
+/* Mainly the BFQ scheduling code Follows */
+
+/*
+ * Shift for timestamp calculations.  This actually limits the maximum
+ * service allowed in one timestamp delta (small shift values increase it),
+ * the maximum total weight that can be used for the queues in the system
+ * (big shift values increase it), and the period of virtual time wraparounds.
+ */
+#define WFQ_SERVICE_SHIFT	22
+
+/**
+ * bfq_gt - compare two timestamps.
+ * @a: first ts.
+ * @b: second ts.
+ *
+ * Return @a > @b, dealing with wrapping correctly.
+ */
+static inline int bfq_gt(bfq_timestamp_t a, bfq_timestamp_t b)
+{
+	return (s64)(a - b) > 0;
+}
+
+/**
+ * bfq_delta - map service into the virtual time domain.
+ * @service: amount of service.
+ * @weight: scale factor.
+ */
+static inline bfq_timestamp_t bfq_delta(bfq_service_t service,
+					bfq_weight_t weight)
+{
+	bfq_timestamp_t d = (bfq_timestamp_t)service << WFQ_SERVICE_SHIFT;
+
+	do_div(d, weight);
+	return d;
+}
+
+/**
+ * bfq_calc_finish - assign the finish time to an entity.
+ * @entity: the entity to act upon.
+ * @service: the service to be charged to the entity.
+ */
+static inline void bfq_calc_finish(struct io_entity *entity,
+				   bfq_service_t service)
+{
+	BUG_ON(entity->weight == 0);
+
+	entity->finish = entity->start + bfq_delta(service, entity->weight);
+}
+
+static inline struct io_queue *io_entity_to_ioq(struct io_entity *entity)
+{
+	struct io_queue *ioq = NULL;
+
+	BUG_ON(entity == NULL);
+	if (entity->my_sched_data == NULL)
+		ioq = container_of(entity, struct io_queue, entity);
+	return ioq;
+}
+
+/**
+ * bfq_entity_of - get an entity from a node.
+ * @node: the node field of the entity.
+ *
+ * Convert a node pointer to the relative entity.  This is used only
+ * to simplify the logic of some functions and not as the generic
+ * conversion mechanism because, e.g., in the tree walking functions,
+ * the check for a %NULL value would be redundant.
+ */
+static inline struct io_entity *bfq_entity_of(struct rb_node *node)
+{
+	struct io_entity *entity = NULL;
+
+	if (node != NULL)
+		entity = rb_entry(node, struct io_entity, rb_node);
+
+	return entity;
+}
+
+/**
+ * bfq_extract - remove an entity from a tree.
+ * @root: the tree root.
+ * @entity: the entity to remove.
+ */
+static inline void bfq_extract(struct rb_root *root, struct io_entity *entity)
+{
+	BUG_ON(entity->tree != root);
+
+	entity->tree = NULL;
+	rb_erase(&entity->rb_node, root);
+}
+
+/**
+ * bfq_idle_extract - extract an entity from the idle tree.
+ * @st: the service tree of the owning @entity.
+ * @entity: the entity being removed.
+ */
+static void bfq_idle_extract(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	struct rb_node *next;
+	struct io_queue *ioq = io_entity_to_ioq(entity);
+
+	BUG_ON(entity->tree != &st->idle);
+
+	if (entity == st->first_idle) {
+		next = rb_next(&entity->rb_node);
+		st->first_idle = bfq_entity_of(next);
+	}
+
+	if (entity == st->last_idle) {
+		next = rb_prev(&entity->rb_node);
+		st->last_idle = bfq_entity_of(next);
+	}
+
+	bfq_extract(&st->idle, entity);
+
+	/* Delete queue from idle list */
+	if (ioq)
+		list_del(&ioq->queue_list);
+}
+
+/**
+ * bfq_insert - generic tree insertion.
+ * @root: tree root.
+ * @entity: entity to insert.
+ *
+ * This is used for the idle and the active tree, since they are both
+ * ordered by finish time.
+ */
+static void bfq_insert(struct rb_root *root, struct io_entity *entity)
+{
+	struct io_entity *entry;
+	struct rb_node **node = &root->rb_node;
+	struct rb_node *parent = NULL;
+
+	BUG_ON(entity->tree != NULL);
+
+	while (*node != NULL) {
+		parent = *node;
+		entry = rb_entry(parent, struct io_entity, rb_node);
+
+		if (bfq_gt(entry->finish, entity->finish))
+			node = &parent->rb_left;
+		else
+			node = &parent->rb_right;
+	}
+
+	rb_link_node(&entity->rb_node, parent, node);
+	rb_insert_color(&entity->rb_node, root);
+
+	entity->tree = root;
+}
+
+/**
+ * bfq_update_min - update the min_start field of a entity.
+ * @entity: the entity to update.
+ * @node: one of its children.
+ *
+ * This function is called when @entity may store an invalid value for
+ * min_start due to updates to the active tree.  The function  assumes
+ * that the subtree rooted at @node (which may be its left or its right
+ * child) has a valid min_start value.
+ */
+static inline void bfq_update_min(struct io_entity *entity,
+					struct rb_node *node)
+{
+	struct io_entity *child;
+
+	if (node != NULL) {
+		child = rb_entry(node, struct io_entity, rb_node);
+		if (bfq_gt(entity->min_start, child->min_start))
+			entity->min_start = child->min_start;
+	}
+}
+
+/**
+ * bfq_update_active_node - recalculate min_start.
+ * @node: the node to update.
+ *
+ * @node may have changed position or one of its children may have moved,
+ * this function updates its min_start value.  The left and right subtrees
+ * are assumed to hold a correct min_start value.
+ */
+static inline void bfq_update_active_node(struct rb_node *node)
+{
+	struct io_entity *entity = rb_entry(node, struct io_entity, rb_node);
+
+	entity->min_start = entity->start;
+	bfq_update_min(entity, node->rb_right);
+	bfq_update_min(entity, node->rb_left);
+}
+
+/**
+ * bfq_update_active_tree - update min_start for the whole active tree.
+ * @node: the starting node.
+ *
+ * @node must be the deepest modified node after an update.  This function
+ * updates its min_start using the values held by its children, assuming
+ * that they did not change, and then updates all the nodes that may have
+ * changed in the path to the root.  The only nodes that may have changed
+ * are the ones in the path or their siblings.
+ */
+static void bfq_update_active_tree(struct rb_node *node)
+{
+	struct rb_node *parent;
+
+up:
+	bfq_update_active_node(node);
+
+	parent = rb_parent(node);
+	if (parent == NULL)
+		return;
+
+	if (node == parent->rb_left && parent->rb_right != NULL)
+		bfq_update_active_node(parent->rb_right);
+	else if (parent->rb_left != NULL)
+		bfq_update_active_node(parent->rb_left);
+
+	node = parent;
+	goto up;
+}
+
+/**
+ * bfq_active_insert - insert an entity in the active tree of its group/device.
+ * @st: the service tree of the entity.
+ * @entity: the entity being inserted.
+ *
+ * The active tree is ordered by finish time, but an extra key is kept
+ * per each node, containing the minimum value for the start times of
+ * its children (and the node itself), so it's possible to search for
+ * the eligible node with the lowest finish time in logarithmic time.
+ */
+static void bfq_active_insert(struct io_service_tree *st,
+					struct io_entity *entity)
+{
+	struct rb_node *node = &entity->rb_node;
+
+	bfq_insert(&st->active, entity);
+
+	if (node->rb_left != NULL)
+		node = node->rb_left;
+	else if (node->rb_right != NULL)
+		node = node->rb_right;
+
+	bfq_update_active_tree(node);
+}
+
+/**
+ * bfq_ioprio_to_weight - calc a weight from an ioprio.
+ * @ioprio: the ioprio value to convert.
+ */
+static bfq_weight_t bfq_ioprio_to_weight(int ioprio)
+{
+	WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
+	return IOPRIO_BE_NR - ioprio;
+}
+
+void bfq_get_entity(struct io_entity *entity)
+{
+	struct io_queue *ioq = io_entity_to_ioq(entity);
+
+	if (ioq)
+		elv_get_ioq(ioq);
+}
+
+void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+	entity->ioprio = entity->new_ioprio;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->sched_data = &iog->sched_data;
+}
+
+/**
+ * bfq_find_deepest - find the deepest node that an extraction can modify.
+ * @node: the node being removed.
+ *
+ * Do the first step of an extraction in an rb tree, looking for the
+ * node that will replace @node, and returning the deepest node that
+ * the following modifications to the tree can touch.  If @node is the
+ * last node in the tree return %NULL.
+ */
+static struct rb_node *bfq_find_deepest(struct rb_node *node)
+{
+	struct rb_node *deepest;
+
+	if (node->rb_right == NULL && node->rb_left == NULL)
+		deepest = rb_parent(node);
+	else if (node->rb_right == NULL)
+		deepest = node->rb_left;
+	else if (node->rb_left == NULL)
+		deepest = node->rb_right;
+	else {
+		deepest = rb_next(node);
+		if (deepest->rb_right != NULL)
+			deepest = deepest->rb_right;
+		else if (rb_parent(deepest) != node)
+			deepest = rb_parent(deepest);
+	}
+
+	return deepest;
+}
+
+/**
+ * bfq_active_extract - remove an entity from the active tree.
+ * @st: the service_tree containing the tree.
+ * @entity: the entity being removed.
+ */
+static void bfq_active_extract(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	struct rb_node *node;
+
+	node = bfq_find_deepest(&entity->rb_node);
+	bfq_extract(&st->active, entity);
+
+	if (node != NULL)
+		bfq_update_active_tree(node);
+}
+
+/**
+ * bfq_idle_insert - insert an entity into the idle tree.
+ * @st: the service tree containing the tree.
+ * @entity: the entity to insert.
+ */
+static void bfq_idle_insert(struct io_service_tree *st,
+					struct io_entity *entity)
+{
+	struct io_entity *first_idle = st->first_idle;
+	struct io_entity *last_idle = st->last_idle;
+	struct io_queue *ioq = io_entity_to_ioq(entity);
+
+	if (first_idle == NULL || bfq_gt(first_idle->finish, entity->finish))
+		st->first_idle = entity;
+	if (last_idle == NULL || bfq_gt(entity->finish, last_idle->finish))
+		st->last_idle = entity;
+
+	bfq_insert(&st->idle, entity);
+
+	/* Add this queue to idle list */
+	if (ioq)
+		list_add(&ioq->queue_list, &ioq->efqd->idle_list);
+}
+
+/**
+ * bfq_forget_entity - remove an entity from the wfq trees.
+ * @st: the service tree.
+ * @entity: the entity being removed.
+ *
+ * Update the device status and forget everything about @entity, putting
+ * the device reference to it, if it is a queue.  Entities belonging to
+ * groups are not refcounted.
+ */
+static void bfq_forget_entity(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	struct io_queue *ioq = NULL;
+
+	BUG_ON(!entity->on_st);
+	entity->on_st = 0;
+	st->wsum -= entity->weight;
+	ioq = io_entity_to_ioq(entity);
+	if (!ioq)
+		return;
+	elv_put_ioq(ioq);
+}
+
+/**
+ * bfq_put_idle_entity - release the idle tree ref of an entity.
+ * @st: service tree for the entity.
+ * @entity: the entity being released.
+ */
+void bfq_put_idle_entity(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	bfq_idle_extract(st, entity);
+	bfq_forget_entity(st, entity);
+}
+
+/**
+ * bfq_forget_idle - update the idle tree if necessary.
+ * @st: the service tree to act upon.
+ *
+ * To preserve the global O(log N) complexity we only remove one entry here;
+ * as the idle tree will not grow indefinitely this can be done safely.
+ */
+void bfq_forget_idle(struct io_service_tree *st)
+{
+	struct io_entity *first_idle = st->first_idle;
+	struct io_entity *last_idle = st->last_idle;
+
+	if (RB_EMPTY_ROOT(&st->active) && last_idle != NULL &&
+	    !bfq_gt(last_idle->finish, st->vtime)) {
+		/*
+		 * Active tree is empty. Pull back vtime to finish time of
+		 * last idle entity on idle tree.
+		 * Rational seems to be that it reduces the possibility of
+		 * vtime wraparound (bfq_gt(V-F) < 0).
+		 */
+		st->vtime = last_idle->finish;
+	}
+
+	if (first_idle != NULL && !bfq_gt(first_idle->finish, st->vtime))
+		bfq_put_idle_entity(st, first_idle);
+}
+
+
+static struct io_service_tree *
+__bfq_entity_update_prio(struct io_service_tree *old_st,
+				struct io_entity *entity)
+{
+	struct io_service_tree *new_st = old_st;
+	struct io_queue *ioq = io_entity_to_ioq(entity);
+
+	if (entity->ioprio_changed) {
+		entity->ioprio = entity->new_ioprio;
+		entity->ioprio_class = entity->new_ioprio_class;
+		entity->ioprio_changed = 0;
+
+		/*
+		 * Also update the scaled budget for ioq. Group will get the
+		 * updated budget once ioq is selected to run next.
+		 */
+		if (ioq) {
+			struct elv_fq_data *efqd = ioq->efqd;
+			entity->budget = elv_prio_to_slice(efqd, ioq);
+		}
+
+		old_st->wsum -= entity->weight;
+		entity->weight = bfq_ioprio_to_weight(entity->ioprio);
+
+		/*
+		 * NOTE: here we may be changing the weight too early,
+		 * this will cause unfairness.  The correct approach
+		 * would have required additional complexity to defer
+		 * weight changes to the proper time instants (i.e.,
+		 * when entity->finish <= old_st->vtime).
+		 */
+		new_st = io_entity_service_tree(entity);
+		new_st->wsum += entity->weight;
+
+		if (new_st != old_st)
+			entity->start = new_st->vtime;
+	}
+
+	return new_st;
+}
+
+/**
+ * __bfq_activate_entity - activate an entity.
+ * @entity: the entity being activated.
+ *
+ * Called whenever an entity is activated, i.e., it is not active and one
+ * of its children receives a new request, or has to be reactivated due to
+ * budget exhaustion.  It uses the current budget of the entity (and the
+ * service received if @entity is active) of the queue to calculate its
+ * timestamps.
+ */
+static void __bfq_activate_entity(struct io_entity *entity, int add_front)
+{
+	struct io_sched_data *sd = entity->sched_data;
+	struct io_service_tree *st = io_entity_service_tree(entity);
+
+	if (entity == sd->active_entity) {
+		BUG_ON(entity->tree != NULL);
+		/*
+		 * If we are requeueing the current entity we have
+		 * to take care of not charging to it service it has
+		 * not received.
+		 */
+		bfq_calc_finish(entity, entity->service);
+		entity->start = entity->finish;
+		sd->active_entity = NULL;
+	} else if (entity->tree == &st->active) {
+		/*
+		 * Requeueing an entity due to a change of some
+		 * next_active entity below it.  We reuse the old
+		 * start time.
+		 */
+		bfq_active_extract(st, entity);
+	} else if (entity->tree == &st->idle) {
+		/*
+		 * Must be on the idle tree, bfq_idle_extract() will
+		 * check for that.
+		 */
+		bfq_idle_extract(st, entity);
+		entity->start = bfq_gt(st->vtime, entity->finish) ?
+				       st->vtime : entity->finish;
+	} else {
+		/*
+		 * The finish time of the entity may be invalid, and
+		 * it is in the past for sure, otherwise the queue
+		 * would have been on the idle tree.
+		 */
+		entity->start = st->vtime;
+		st->wsum += entity->weight;
+		bfq_get_entity(entity);
+
+		BUG_ON(entity->on_st);
+		entity->on_st = 1;
+	}
+
+	st = __bfq_entity_update_prio(st, entity);
+	/*
+	 * This is to emulate cfq like functionality where preemption can
+	 * happen with-in same class, like sync queue preempting async queue
+	 * May be this is not a very good idea from fairness point of view
+	 * as preempting queue gains share. Keeping it for now.
+	 */
+	if (add_front) {
+		struct io_entity *next_entity;
+
+		/*
+		 * Determine the entity which will be dispatched next
+		 * Use sd->next_active once hierarchical patch is applied
+		 */
+		next_entity = bfq_lookup_next_entity(sd, 0);
+
+		if (next_entity && next_entity != entity) {
+			struct io_service_tree *new_st;
+			bfq_timestamp_t delta;
+
+			new_st = io_entity_service_tree(next_entity);
+
+			/*
+			 * At this point, both entities should belong to
+			 * same service tree as cross service tree preemption
+			 * is automatically taken care by algorithm
+			 */
+			BUG_ON(new_st != st);
+			entity->finish = next_entity->finish - 1;
+			delta = bfq_delta(entity->budget, entity->weight);
+			entity->start = entity->finish - delta;
+			if (bfq_gt(entity->start, st->vtime))
+				entity->start = st->vtime;
+		}
+	} else {
+		bfq_calc_finish(entity, entity->budget);
+	}
+	bfq_active_insert(st, entity);
+}
+
+/**
+ * bfq_activate_entity - activate an entity.
+ * @entity: the entity to activate.
+ */
+void bfq_activate_entity(struct io_entity *entity, int add_front)
+{
+	__bfq_activate_entity(entity, add_front);
+}
+
+/**
+ * __bfq_deactivate_entity - deactivate an entity from its service tree.
+ * @entity: the entity to deactivate.
+ * @requeue: if false, the entity will not be put into the idle tree.
+ *
+ * Deactivate an entity, independently from its previous state.  If the
+ * entity was not on a service tree just return, otherwise if it is on
+ * any scheduler tree, extract it from that tree, and if necessary
+ * and if the caller did not specify @requeue, put it on the idle tree.
+ *
+ */
+int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
+{
+	struct io_sched_data *sd = entity->sched_data;
+	struct io_service_tree *st = io_entity_service_tree(entity);
+	int was_active = entity == sd->active_entity;
+	int ret = 0;
+
+	if (!entity->on_st)
+		return 0;
+
+	BUG_ON(was_active && entity->tree != NULL);
+
+	if (was_active) {
+		bfq_calc_finish(entity, entity->service);
+		sd->active_entity = NULL;
+	} else if (entity->tree == &st->active)
+		bfq_active_extract(st, entity);
+	else if (entity->tree == &st->idle)
+		bfq_idle_extract(st, entity);
+	else if (entity->tree != NULL)
+		BUG();
+
+	if (!requeue || !bfq_gt(entity->finish, st->vtime))
+		bfq_forget_entity(st, entity);
+	else
+		bfq_idle_insert(st, entity);
+
+	BUG_ON(sd->active_entity == entity);
+
+	return ret;
+}
+
+/**
+ * bfq_deactivate_entity - deactivate an entity.
+ * @entity: the entity to deactivate.
+ * @requeue: true if the entity can be put on the idle tree
+ */
+void bfq_deactivate_entity(struct io_entity *entity, int requeue)
+{
+	__bfq_deactivate_entity(entity, requeue);
+}
+
+/**
+ * bfq_update_vtime - update vtime if necessary.
+ * @st: the service tree to act upon.
+ *
+ * If necessary update the service tree vtime to have at least one
+ * eligible entity, skipping to its start time.  Assumes that the
+ * active tree of the device is not empty.
+ *
+ * NOTE: this hierarchical implementation updates vtimes quite often,
+ * we may end up with reactivated tasks getting timestamps after a
+ * vtime skip done because we needed a ->first_active entity on some
+ * intermediate node.
+ */
+static void bfq_update_vtime(struct io_service_tree *st)
+{
+	struct io_entity *entry;
+	struct rb_node *node = st->active.rb_node;
+
+	entry = rb_entry(node, struct io_entity, rb_node);
+	if (bfq_gt(entry->min_start, st->vtime)) {
+		st->vtime = entry->min_start;
+		bfq_forget_idle(st);
+	}
+}
+
+/**
+ * bfq_first_active - find the eligible entity with the smallest finish time
+ * @st: the service tree to select from.
+ *
+ * This function searches the first schedulable entity, starting from the
+ * root of the tree and going on the left every time on this side there is
+ * a subtree with at least one eligible (start <= vtime) entity.  The path
+ * on the right is followed only if a) the left subtree contains no eligible
+ * entities and b) no eligible entity has been found yet.
+ */
+static struct io_entity *bfq_first_active_entity(struct io_service_tree *st)
+{
+	struct io_entity *entry, *first = NULL;
+	struct rb_node *node = st->active.rb_node;
+
+	while (node != NULL) {
+		entry = rb_entry(node, struct io_entity, rb_node);
+left:
+		if (!bfq_gt(entry->start, st->vtime))
+			first = entry;
+
+		BUG_ON(bfq_gt(entry->min_start, st->vtime));
+
+		if (node->rb_left != NULL) {
+			entry = rb_entry(node->rb_left,
+					 struct io_entity, rb_node);
+			if (!bfq_gt(entry->min_start, st->vtime)) {
+				node = node->rb_left;
+				goto left;
+			}
+		}
+		if (first != NULL)
+			break;
+		node = node->rb_right;
+	}
+
+	BUG_ON(first == NULL && !RB_EMPTY_ROOT(&st->active));
+	return first;
+}
+
+/**
+ * __bfq_lookup_next_entity - return the first eligible entity in @st.
+ * @st: the service tree.
+ *
+ * Update the virtual time in @st and return the first eligible entity
+ * it contains.
+ */
+static struct io_entity *__bfq_lookup_next_entity(struct io_service_tree *st)
+{
+	struct io_entity *entity;
+
+	if (RB_EMPTY_ROOT(&st->active))
+		return NULL;
+
+	bfq_update_vtime(st);
+	entity = bfq_first_active_entity(st);
+	BUG_ON(bfq_gt(entity->start, st->vtime));
+
+	return entity;
+}
+
+/**
+ * bfq_lookup_next_entity - return the first eligible entity in @sd.
+ * @sd: the sched_data.
+ * @extract: if true the returned entity will be also extracted from @sd.
+ *
+ * NOTE: since we cache the next_active entity at each level of the
+ * hierarchy, the complexity of the lookup can be decreased with
+ * absolutely no effort just returning the cached next_active value;
+ * we prefer to do full lookups to test the consistency of * the data
+ * structures.
+ */
+struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
+						 int extract)
+{
+	struct io_service_tree *st = sd->service_tree;
+	struct io_entity *entity;
+	int i;
+
+	/*
+	 * One can check for which will be next selected entity without
+	 * expiring the current one.
+	 */
+	BUG_ON(extract && sd->active_entity != NULL);
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++, st++) {
+		entity = __bfq_lookup_next_entity(st);
+		if (entity != NULL) {
+			if (extract) {
+				bfq_active_extract(st, entity);
+				sd->active_entity = entity;
+			}
+			break;
+		}
+	}
+
+	return entity;
+}
+
+void entity_served(struct io_entity *entity, bfq_service_t served)
+{
+	struct io_service_tree *st;
+
+	st = io_entity_service_tree(entity);
+	entity->service += served;
+	BUG_ON(st->wsum == 0);
+	st->vtime += bfq_delta(served, st->wsum);
+	bfq_forget_idle(st);
+}
+
+/* Elevator fair queuing function */
+struct io_queue *rq_ioq(struct request *rq)
+{
+	return rq->ioq;
+}
+
+static inline struct io_queue *elv_active_ioq(struct elevator_queue *e)
+{
+	return e->efqd.active_queue;
+}
+
+void *elv_active_sched_queue(struct elevator_queue *e)
+{
+	return ioq_sched_queue(elv_active_ioq(e));
+}
+EXPORT_SYMBOL(elv_active_sched_queue);
+
+int elv_nr_busy_ioq(struct elevator_queue *e)
+{
+	return e->efqd.busy_queues;
+}
+EXPORT_SYMBOL(elv_nr_busy_ioq);
+
+int elv_nr_busy_rt_ioq(struct elevator_queue *e)
+{
+	return e->efqd.busy_rt_queues;
+}
+EXPORT_SYMBOL(elv_nr_busy_rt_ioq);
+
+int elv_hw_tag(struct elevator_queue *e)
+{
+	return e->efqd.hw_tag;
+}
+EXPORT_SYMBOL(elv_hw_tag);
+
+/* Helper functions for operating on elevator idle slice timer */
+int elv_mod_idle_slice_timer(struct elevator_queue *eq, unsigned long expires)
+{
+	struct elv_fq_data *efqd = &eq->efqd;
+
+	return mod_timer(&efqd->idle_slice_timer, expires);
+}
+EXPORT_SYMBOL(elv_mod_idle_slice_timer);
+
+int elv_del_idle_slice_timer(struct elevator_queue *eq)
+{
+	struct elv_fq_data *efqd = &eq->efqd;
+
+	return del_timer(&efqd->idle_slice_timer);
+}
+EXPORT_SYMBOL(elv_del_idle_slice_timer);
+
+unsigned int elv_get_slice_idle(struct elevator_queue *eq)
+{
+	return eq->efqd.elv_slice_idle;
+}
+EXPORT_SYMBOL(elv_get_slice_idle);
+
+void elv_ioq_served(struct io_queue *ioq, bfq_service_t served)
+{
+	entity_served(&ioq->entity, served);
+}
+
+/* Tells whether ioq is queued in root group or not */
+static inline int is_root_group_ioq(struct request_queue *q,
+					struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	return (ioq->entity.sched_data == &efqd->root_group->sched_data);
+}
+
+/* Functions to show and store elv_idle_slice value through sysfs */
+ssize_t elv_slice_idle_show(struct request_queue *q, char *name)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	data = jiffies_to_msecs(efqd->elv_slice_idle);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return sprintf(name, "%d\n", data);
+}
+
+ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
+			  size_t count)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	char *p = (char *)name;
+
+	data = simple_strtoul(p, &p, 10);
+
+	if (data < 0)
+		data = 0;
+	else if (data > INT_MAX)
+		data = INT_MAX;
+
+	data = msecs_to_jiffies(data);
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	efqd->elv_slice_idle = data;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	return count;
+}
+
+/* Functions to show and store elv_slice_sync value through sysfs */
+ssize_t elv_slice_sync_show(struct request_queue *q, char *name)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	data = efqd->elv_slice[1];
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return sprintf(name, "%d\n", data);
+}
+
+ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
+			  size_t count)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	char *p = (char *)name;
+
+	data = simple_strtoul(p, &p, 10);
+
+	if (data < 0)
+		data = 0;
+	/* 100ms is the limit for now*/
+	else if (data > 100)
+		data = 100;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	efqd->elv_slice[1] = data;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	return count;
+}
+
+/* Functions to show and store elv_slice_async value through sysfs */
+ssize_t elv_slice_async_show(struct request_queue *q, char *name)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	data = efqd->elv_slice[0];
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return sprintf(name, "%d\n", data);
+}
+
+ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
+			  size_t count)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	char *p = (char *)name;
+
+	data = simple_strtoul(p, &p, 10);
+
+	if (data < 0)
+		data = 0;
+	/* 100ms is the limit for now*/
+	else if (data > 100)
+		data = 100;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	efqd->elv_slice[0] = data;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	return count;
+}
+
+void elv_schedule_dispatch(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (elv_nr_busy_ioq(q->elevator)) {
+		elv_log(efqd, "schedule dispatch");
+		kblockd_schedule_work(efqd->queue, &efqd->unplug_work);
+	}
+}
+EXPORT_SYMBOL(elv_schedule_dispatch);
+
+void elv_kick_queue(struct work_struct *work)
+{
+	struct elv_fq_data *efqd =
+		container_of(work, struct elv_fq_data, unplug_work);
+	struct request_queue *q = efqd->queue;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	blk_start_queueing(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+void elv_shutdown_timer_wq(struct elevator_queue *e)
+{
+	del_timer_sync(&e->efqd.idle_slice_timer);
+	cancel_work_sync(&e->efqd.unplug_work);
+}
+EXPORT_SYMBOL(elv_shutdown_timer_wq);
+
+void elv_ioq_set_prio_slice(struct request_queue *q, struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	ioq->slice_end = jiffies + ioq->entity.budget;
+	elv_log_ioq(efqd, ioq, "set_slice=%lu", ioq->entity.budget);
+}
+
+static void elv_ioq_update_io_thinktime(struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = ioq->efqd;
+	unsigned long elapsed = jiffies - ioq->last_end_request;
+	unsigned long ttime = min(elapsed, 2UL * efqd->elv_slice_idle);
+
+	ioq->ttime_samples = (7*ioq->ttime_samples + 256) / 8;
+	ioq->ttime_total = (7*ioq->ttime_total + 256*ttime) / 8;
+	ioq->ttime_mean = (ioq->ttime_total + 128) / ioq->ttime_samples;
+}
+
+/*
+ * Disable idle window if the process thinks too long.
+ * This idle flag can also be updated by io scheduler.
+ */
+static void elv_ioq_update_idle_window(struct elevator_queue *eq,
+				struct io_queue *ioq, struct request *rq)
+{
+	int old_idle, enable_idle;
+	struct elv_fq_data *efqd = ioq->efqd;
+
+	/*
+	 * Don't idle for async or idle io prio class
+	 */
+	if (!elv_ioq_sync(ioq) || elv_ioq_class_idle(ioq))
+		return;
+
+	enable_idle = old_idle = elv_ioq_idle_window(ioq);
+
+	if (!efqd->elv_slice_idle)
+		enable_idle = 0;
+	else if (ioq_sample_valid(ioq->ttime_samples)) {
+		if (ioq->ttime_mean > efqd->elv_slice_idle)
+			enable_idle = 0;
+		else
+			enable_idle = 1;
+	}
+
+	/*
+	 * From think time perspective idle should be enabled. Check with
+	 * io scheduler if it wants to disable idling based on additional
+	 * considrations like seek pattern.
+	 */
+	if (enable_idle) {
+		if (eq->ops->elevator_update_idle_window_fn)
+			enable_idle = eq->ops->elevator_update_idle_window_fn(
+						eq, ioq->sched_queue, rq);
+		if (!enable_idle)
+			elv_log_ioq(efqd, ioq, "iosched disabled idle");
+	}
+
+	if (old_idle != enable_idle) {
+		elv_log_ioq(efqd, ioq, "idle=%d", enable_idle);
+		if (enable_idle)
+			elv_mark_ioq_idle_window(ioq);
+		else
+			elv_clear_ioq_idle_window(ioq);
+	}
+}
+
+struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask)
+{
+	struct io_queue *ioq = NULL;
+
+	ioq = kmem_cache_alloc_node(elv_ioq_pool, gfp_mask, q->node);
+	return ioq;
+}
+EXPORT_SYMBOL(elv_alloc_ioq);
+
+void elv_free_ioq(struct io_queue *ioq)
+{
+	kmem_cache_free(elv_ioq_pool, ioq);
+}
+EXPORT_SYMBOL(elv_free_ioq);
+
+int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
+			void *sched_queue, int ioprio_class, int ioprio,
+			int is_sync)
+{
+	struct elv_fq_data *efqd = &eq->efqd;
+	struct io_group *iog = io_lookup_io_group_current(efqd->queue);
+
+	RB_CLEAR_NODE(&ioq->entity.rb_node);
+	atomic_set(&ioq->ref, 0);
+	ioq->efqd = efqd;
+	elv_ioq_set_ioprio_class(ioq, ioprio_class);
+	elv_ioq_set_ioprio(ioq, ioprio);
+	ioq->pid = current->pid;
+	ioq->sched_queue = sched_queue;
+	if (is_sync && !elv_ioq_class_idle(ioq))
+		elv_mark_ioq_idle_window(ioq);
+	bfq_init_entity(&ioq->entity, iog);
+	ioq->entity.budget = elv_prio_to_slice(efqd, ioq);
+	return 0;
+}
+EXPORT_SYMBOL(elv_init_ioq);
+
+void elv_put_ioq(struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = ioq->efqd;
+	struct elevator_queue *e = container_of(efqd, struct elevator_queue,
+						efqd);
+
+	BUG_ON(atomic_read(&ioq->ref) <= 0);
+	if (!atomic_dec_and_test(&ioq->ref))
+		return;
+	BUG_ON(ioq->nr_queued);
+	BUG_ON(ioq->entity.tree != NULL);
+	BUG_ON(elv_ioq_busy(ioq));
+	BUG_ON(efqd->active_queue == ioq);
+
+	/* Can be called by outgoing elevator. Don't use q */
+	BUG_ON(!e->ops->elevator_free_sched_queue_fn);
+
+	e->ops->elevator_free_sched_queue_fn(e, ioq->sched_queue);
+	elv_log_ioq(efqd, ioq, "put_queue");
+	elv_free_ioq(ioq);
+}
+EXPORT_SYMBOL(elv_put_ioq);
+
+void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
+{
+	struct io_queue *ioq = *ioq_ptr;
+
+	if (ioq != NULL) {
+		/* Drop the reference taken by the io group */
+		elv_put_ioq(ioq);
+		*ioq_ptr = NULL;
+	}
+}
+
+/*
+ * Normally next io queue to be served is selected from the service tree.
+ * This function allows one to choose a specific io queue to run next
+ * out of order. This is primarily to accomodate the close_cooperator
+ * feature of cfq.
+ *
+ * Currently it is done only for root level as to begin with supporting
+ * close cooperator feature only for root group to make sure default
+ * cfq behavior in flat hierarchy is not changed.
+ */
+void elv_set_next_ioq(struct request_queue *q, struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_entity *entity = &ioq->entity;
+	struct io_sched_data *sd = &efqd->root_group->sched_data;
+	struct io_service_tree *st = io_entity_service_tree(entity);
+
+	BUG_ON(efqd->active_queue != NULL || sd->active_entity != NULL);
+	BUG_ON(!efqd->busy_queues);
+	BUG_ON(sd != entity->sched_data);
+	BUG_ON(!st);
+
+	bfq_update_vtime(st);
+	bfq_active_extract(st, entity);
+	sd->active_entity = entity;
+	entity->service = 0;
+	elv_log_ioq(efqd, ioq, "set_next_ioq");
+}
+
+/* Get next queue for service. */
+struct io_queue *elv_get_next_ioq(struct request_queue *q, int extract)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_entity *entity = NULL;
+	struct io_queue *ioq = NULL;
+	struct io_sched_data *sd;
+
+	/*
+	 * one can check for which queue will be selected next while having
+	 * one queue active. preempt logic uses it.
+	 */
+	BUG_ON(extract && efqd->active_queue != NULL);
+
+	if (!efqd->busy_queues)
+		return NULL;
+
+	sd = &efqd->root_group->sched_data;
+	if (extract)
+		entity = bfq_lookup_next_entity(sd, 1);
+	else
+		entity = bfq_lookup_next_entity(sd, 0);
+
+	BUG_ON(!entity);
+	if (extract)
+		entity->service = 0;
+	ioq = io_entity_to_ioq(entity);
+
+	return ioq;
+}
+
+/*
+ * coop tells that io scheduler selected a queue for us and we did not
+ * select the next queue based on fairness.
+ */
+static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
+					int coop)
+{
+	struct request_queue *q = efqd->queue;
+
+	if (ioq) {
+		elv_log_ioq(efqd, ioq, "set_active, busy=%d",
+							efqd->busy_queues);
+		ioq->slice_end = 0;
+
+		elv_clear_ioq_wait_request(ioq);
+		elv_clear_ioq_must_dispatch(ioq);
+		elv_mark_ioq_slice_new(ioq);
+
+		del_timer(&efqd->idle_slice_timer);
+	}
+
+	efqd->active_queue = ioq;
+
+	/* Let iosched know if it wants to take some action */
+	if (ioq) {
+		if (q->elevator->ops->elevator_active_ioq_set_fn)
+			q->elevator->ops->elevator_active_ioq_set_fn(q,
+							ioq->sched_queue, coop);
+	}
+}
+
+/* Get and set a new active queue for service. */
+struct io_queue *elv_set_active_ioq(struct request_queue *q,
+						struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	int coop = 0;
+
+	if (!ioq)
+		ioq = elv_get_next_ioq(q, 1);
+	else {
+		elv_set_next_ioq(q, ioq);
+		/*
+		 * io scheduler selected the next queue for us. Pass this
+		 * this info back to io scheudler. cfq currently uses it
+		 * to reset coop flag on the queue.
+		 */
+		coop = 1;
+	}
+	__elv_set_active_ioq(efqd, ioq, coop);
+	return ioq;
+}
+
+void elv_reset_active_ioq(struct elv_fq_data *efqd)
+{
+	struct request_queue *q = efqd->queue;
+	struct io_queue *ioq = elv_active_ioq(efqd->queue->elevator);
+
+	if (q->elevator->ops->elevator_active_ioq_reset_fn)
+		q->elevator->ops->elevator_active_ioq_reset_fn(q,
+							ioq->sched_queue);
+	efqd->active_queue = NULL;
+	del_timer(&efqd->idle_slice_timer);
+}
+
+void elv_activate_ioq(struct io_queue *ioq, int add_front)
+{
+	bfq_activate_entity(&ioq->entity, add_front);
+}
+
+void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
+					int requeue)
+{
+	if (ioq == efqd->active_queue)
+		elv_reset_active_ioq(efqd);
+
+	bfq_deactivate_entity(&ioq->entity, requeue);
+}
+
+/* Called when an inactive queue receives a new request. */
+void elv_add_ioq_busy(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+	BUG_ON(elv_ioq_busy(ioq));
+	BUG_ON(ioq == efqd->active_queue);
+	elv_log_ioq(efqd, ioq, "add to busy");
+	elv_activate_ioq(ioq, 0);
+	elv_mark_ioq_busy(ioq);
+	efqd->busy_queues++;
+	if (elv_ioq_class_rt(ioq))
+		efqd->busy_rt_queues++;
+}
+
+void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
+					int requeue)
+{
+	struct elv_fq_data *efqd = &e->efqd;
+
+	BUG_ON(!elv_ioq_busy(ioq));
+	BUG_ON(ioq->nr_queued);
+	elv_log_ioq(efqd, ioq, "del from busy");
+	elv_clear_ioq_busy(ioq);
+	BUG_ON(efqd->busy_queues == 0);
+	efqd->busy_queues--;
+	if (elv_ioq_class_rt(ioq))
+		efqd->busy_rt_queues--;
+
+	elv_deactivate_ioq(efqd, ioq, requeue);
+}
+
+/*
+ * Do the accounting. Determine how much service (in terms of time slices)
+ * current queue used and adjust the start, finish time of queue and vtime
+ * of the tree accordingly.
+ *
+ * Determining the service used in terms of time is tricky in certain
+ * situations. Especially when underlying device supports command queuing
+ * and requests from multiple queues can be there at same time, then it
+ * is not clear which queue consumed how much of disk time.
+ *
+ * To mitigate this problem, cfq starts the time slice of the queue only
+ * after first request from the queue has completed. This does not work
+ * very well if we expire the queue before we wait for first and more
+ * request to finish from the queue. For seeky queues, we will expire the
+ * queue after dispatching few requests without waiting and start dispatching
+ * from next queue.
+ *
+ * Not sure how to determine the time consumed by queue in such scenarios.
+ * Currently as a crude approximation, we are charging 25% of time slice
+ * for such cases. A better mechanism is needed for accurate accounting.
+ */
+void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_entity *entity = &ioq->entity;
+	long slice_unused = 0, slice_used = 0, slice_overshoot = 0;
+
+	assert_spin_locked(q->queue_lock);
+	elv_log_ioq(efqd, ioq, "slice expired");
+
+	if (elv_ioq_wait_request(ioq))
+		del_timer(&efqd->idle_slice_timer);
+
+	elv_clear_ioq_wait_request(ioq);
+
+	/*
+	 * if ioq->slice_end = 0, that means a queue was expired before first
+	 * reuqest from the queue got completed. Of course we are not planning
+	 * to idle on the queue otherwise we would not have expired it.
+	 *
+	 * Charge for the 25% slice in such cases. This is not the best thing
+	 * to do but at the same time not very sure what's the next best
+	 * thing to do.
+	 *
+	 * This arises from that fact that we don't have the notion of
+	 * one queue being operational at one time. io scheduler can dispatch
+	 * requests from multiple queues in one dispatch round. Ideally for
+	 * more accurate accounting of exact disk time used by disk, one
+	 * should dispatch requests from only one queue and wait for all
+	 * the requests to finish. But this will reduce throughput.
+	 */
+	if (!ioq->slice_end)
+		slice_used = entity->budget/4;
+	else {
+		if (time_after(ioq->slice_end, jiffies)) {
+			slice_unused = ioq->slice_end - jiffies;
+			if (slice_unused == entity->budget) {
+				/*
+				 * queue got expired immediately after
+				 * completing first request. Charge 25% of
+				 * slice.
+				 */
+				slice_used = entity->budget/4;
+			} else
+				slice_used = entity->budget - slice_unused;
+		} else {
+			slice_overshoot = jiffies - ioq->slice_end;
+			slice_used = entity->budget + slice_overshoot;
+		}
+	}
+
+	elv_log_ioq(efqd, ioq, "sl_end=%lx, jiffies=%lx", ioq->slice_end,
+			jiffies);
+	elv_log_ioq(efqd, ioq, "sl_used=%ld, budget=%ld overshoot=%ld",
+				slice_used, entity->budget, slice_overshoot);
+	elv_ioq_served(ioq, slice_used);
+
+	BUG_ON(ioq != efqd->active_queue);
+	elv_reset_active_ioq(efqd);
+
+	if (!ioq->nr_queued)
+		elv_del_ioq_busy(q->elevator, ioq, 1);
+	else
+		elv_activate_ioq(ioq, 0);
+}
+EXPORT_SYMBOL(__elv_ioq_slice_expired);
+
+/*
+ *  Expire the ioq.
+ */
+void elv_ioq_slice_expired(struct request_queue *q)
+{
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+
+	if (ioq)
+		__elv_ioq_slice_expired(q, ioq);
+}
+
+/*
+ * Check if new_cfqq should preempt the currently active queue. Return 0 for
+ * no or if we aren't sure, a 1 will cause a preemption attempt.
+ */
+int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
+			struct request *rq)
+{
+	struct io_queue *ioq;
+	struct elevator_queue *eq = q->elevator;
+
+	ioq = elv_active_ioq(eq);
+
+	if (!ioq)
+		return 0;
+
+	if (elv_ioq_slice_used(ioq))
+		return 1;
+
+	if (elv_ioq_class_idle(new_ioq))
+		return 0;
+
+	if (elv_ioq_class_idle(ioq))
+		return 1;
+
+	/*
+	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
+	 */
+	if (elv_ioq_class_rt(new_ioq) && !elv_ioq_class_rt(ioq))
+		return 1;
+
+	/*
+	 * Check with io scheduler if it has additional criterion based on
+	 * which it wants to preempt existing queue.
+	 */
+	if (eq->ops->elevator_should_preempt_fn)
+		return eq->ops->elevator_should_preempt_fn(q, new_ioq, rq);
+
+	return 0;
+}
+
+static void elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
+{
+	elv_log_ioq(&q->elevator->efqd, ioq, "preempt");
+	elv_ioq_slice_expired(q);
+
+	/*
+	 * Put the new queue at the front of the of the current list,
+	 * so we know that it will be selected next.
+	 */
+
+	elv_activate_ioq(ioq, 1);
+	elv_ioq_set_slice_end(ioq, 0);
+	elv_mark_ioq_slice_new(ioq);
+}
+
+void elv_ioq_request_add(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	BUG_ON(!efqd);
+	BUG_ON(!ioq);
+	efqd->rq_queued++;
+	ioq->nr_queued++;
+
+	if (!elv_ioq_busy(ioq))
+		elv_add_ioq_busy(efqd, ioq);
+
+	elv_ioq_update_io_thinktime(ioq);
+	elv_ioq_update_idle_window(q->elevator, ioq, rq);
+
+	if (ioq == elv_active_ioq(q->elevator)) {
+		/*
+		 * Remember that we saw a request from this process, but
+		 * don't start queuing just yet. Otherwise we risk seeing lots
+		 * of tiny requests, because we disrupt the normal plugging
+		 * and merging. If the request is already larger than a single
+		 * page, let it rip immediately. For that case we assume that
+		 * merging is already done. Ditto for a busy system that
+		 * has other work pending, don't risk delaying until the
+		 * idle timer unplug to continue working.
+		 */
+		if (elv_ioq_wait_request(ioq)) {
+			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
+			    efqd->busy_queues > 1) {
+				del_timer(&efqd->idle_slice_timer);
+				blk_start_queueing(q);
+			}
+			elv_mark_ioq_must_dispatch(ioq);
+		}
+	} else if (elv_should_preempt(q, ioq, rq)) {
+		/*
+		 * not the active queue - expire current slice if it is
+		 * idle and has expired it's mean thinktime or this new queue
+		 * has some old slice time left and is of higher priority or
+		 * this new queue is RT and the current one is BE
+		 */
+		elv_preempt_queue(q, ioq);
+		blk_start_queueing(q);
+	}
+}
+
+void elv_idle_slice_timer(unsigned long data)
+{
+	struct elv_fq_data *efqd = (struct elv_fq_data *)data;
+	struct io_queue *ioq;
+	unsigned long flags;
+	struct request_queue *q = efqd->queue;
+
+	elv_log(efqd, "idle timer fired");
+
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	ioq = efqd->active_queue;
+
+	if (ioq) {
+
+		/*
+		 * We saw a request before the queue expired, let it through
+		 */
+		if (elv_ioq_must_dispatch(ioq))
+			goto out_kick;
+
+		/*
+		 * expired
+		 */
+		if (elv_ioq_slice_used(ioq))
+			goto expire;
+
+		/*
+		 * only expire and reinvoke request handler, if there are
+		 * other queues with pending requests
+		 */
+		if (!elv_nr_busy_ioq(q->elevator))
+			goto out_cont;
+
+		/*
+		 * not expired and it has a request pending, let it dispatch
+		 */
+		if (ioq->nr_queued)
+			goto out_kick;
+	}
+expire:
+	elv_ioq_slice_expired(q);
+out_kick:
+	elv_schedule_dispatch(q);
+out_cont:
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+void elv_ioq_arm_slice_timer(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+	unsigned long sl;
+
+	BUG_ON(!ioq);
+
+	/*
+	 * SSD device without seek penalty, disable idling. But only do so
+	 * for devices that support queuing, otherwise we still have a problem
+	 * with sync vs async workloads.
+	 */
+	if (blk_queue_nonrot(q) && efqd->hw_tag)
+		return;
+
+	/*
+	 * still requests with the driver, don't idle
+	 */
+	if (efqd->rq_in_driver)
+		return;
+
+	/*
+	 * idle is disabled, either manually or by past process history
+	 */
+	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
+		return;
+
+	/*
+	 * may be iosched got its own idling logic. In that case io
+	 * schduler will take care of arming the timer, if need be.
+	 */
+	if (q->elevator->ops->elevator_arm_slice_timer_fn) {
+		q->elevator->ops->elevator_arm_slice_timer_fn(q,
+						ioq->sched_queue);
+	} else {
+		elv_mark_ioq_wait_request(ioq);
+		sl = efqd->elv_slice_idle;
+		mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+		elv_log(efqd, "arm idle: %lu", sl);
+	}
+}
+
+void elv_free_idle_ioq_list(struct elevator_queue *e)
+{
+	struct io_queue *ioq, *n;
+	struct elv_fq_data *efqd = &e->efqd;
+
+	list_for_each_entry_safe(ioq, n, &efqd->idle_list, queue_list)
+		elv_deactivate_ioq(efqd, ioq, 0);
+}
+
+/* Common layer function to select the next queue to dispatch from */
+void *elv_fq_select_ioq(struct request_queue *q, int force)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
+
+	if (!elv_nr_busy_ioq(q->elevator))
+		return NULL;
+
+	if (ioq == NULL)
+		goto new_queue;
+
+	/*
+	 * Force dispatch. Continue to dispatch from current queue as long
+	 * as it has requests.
+	 */
+	if (unlikely(force)) {
+		if (ioq->nr_queued)
+			goto keep_queue;
+		else
+			goto expire;
+	}
+
+	/*
+	 * The active queue has run out of time, expire it and select new.
+	 */
+	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
+		goto expire;
+
+	/*
+	 * If we have a RT cfqq waiting, then we pre-empt the current non-rt
+	 * cfqq.
+	 */
+	if (!elv_ioq_class_rt(ioq) && efqd->busy_rt_queues) {
+		/*
+		 * We simulate this as cfqq timed out so that it gets to bank
+		 * the remaining of its time slice.
+		 */
+		elv_log_ioq(efqd, ioq, "preempt");
+		goto expire;
+	}
+
+	/*
+	 * The active queue has requests and isn't expired, allow it to
+	 * dispatch.
+	 */
+
+	if (ioq->nr_queued)
+		goto keep_queue;
+
+	/*
+	 * If another queue has a request waiting within our mean seek
+	 * distance, let it run.  The expire code will check for close
+	 * cooperators and put the close queue at the front of the service
+	 * tree.
+	 */
+	new_ioq = elv_close_cooperator(q, ioq, 0);
+	if (new_ioq)
+		goto expire;
+
+	/*
+	 * No requests pending. If the active queue still has requests in
+	 * flight or is idling for a new request, allow either of these
+	 * conditions to happen (or time out) before selecting a new queue.
+	 */
+
+	if (timer_pending(&efqd->idle_slice_timer) ||
+	    (elv_ioq_nr_dispatched(ioq) && elv_ioq_idle_window(ioq))) {
+		ioq = NULL;
+		goto keep_queue;
+	}
+
+expire:
+	elv_ioq_slice_expired(q);
+new_queue:
+	ioq = elv_set_active_ioq(q, new_ioq);
+keep_queue:
+	return ioq;
+}
+
+/* A request got removed from io_queue. Do the accounting */
+void elv_ioq_request_removed(struct elevator_queue *e, struct request *rq)
+{
+	struct io_queue *ioq;
+	struct elv_fq_data *efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	ioq = rq->ioq;
+	BUG_ON(!ioq);
+	ioq->nr_queued--;
+
+	efqd = ioq->efqd;
+	BUG_ON(!efqd);
+	efqd->rq_queued--;
+
+	if (elv_ioq_busy(ioq) && (elv_active_ioq(e) != ioq) && !ioq->nr_queued)
+		elv_del_ioq_busy(e, ioq, 1);
+}
+
+/* A request got dispatched. Do the accounting. */
+void elv_fq_dispatched_request(struct elevator_queue *e, struct request *rq)
+{
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	BUG_ON(!ioq);
+	elv_ioq_request_dispatched(ioq);
+	elv_ioq_request_removed(e, rq);
+	elv_clear_ioq_must_dispatch(ioq);
+}
+
+void elv_fq_activate_rq(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	efqd->rq_in_driver++;
+	elv_log_ioq(efqd, rq_ioq(rq), "activate rq, drv=%d",
+						efqd->rq_in_driver);
+}
+
+void elv_fq_deactivate_rq(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	WARN_ON(!efqd->rq_in_driver);
+	efqd->rq_in_driver--;
+	elv_log_ioq(efqd, rq_ioq(rq), "deactivate rq, drv=%d",
+						efqd->rq_in_driver);
+}
+
+/*
+ * Update hw_tag based on peak queue depth over 50 samples under
+ * sufficient load.
+ */
+static void elv_update_hw_tag(struct elv_fq_data *efqd)
+{
+	if (efqd->rq_in_driver > efqd->rq_in_driver_peak)
+		efqd->rq_in_driver_peak = efqd->rq_in_driver;
+
+	if (efqd->rq_queued <= ELV_HW_QUEUE_MIN &&
+	    efqd->rq_in_driver <= ELV_HW_QUEUE_MIN)
+		return;
+
+	if (efqd->hw_tag_samples++ < 50)
+		return;
+
+	if (efqd->rq_in_driver_peak >= ELV_HW_QUEUE_MIN)
+		efqd->hw_tag = 1;
+	else
+		efqd->hw_tag = 0;
+
+	efqd->hw_tag_samples = 0;
+	efqd->rq_in_driver_peak = 0;
+}
+
+/*
+ * If ioscheduler has functionality of keeping track of close cooperator, check
+ * with it if it has got a closely co-operating queue.
+ */
+static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
+					struct io_queue *ioq, int probe)
+{
+	struct elevator_queue *e = q->elevator;
+	struct io_queue *new_ioq = NULL;
+
+	/*
+	 * Currently this feature is supported only for flat hierarchy or
+	 * root group queues so that default cfq behavior is not changed.
+	 */
+	if (!is_root_group_ioq(q, ioq))
+		return NULL;
+
+	if (q->elevator->ops->elevator_close_cooperator_fn)
+		new_ioq = e->ops->elevator_close_cooperator_fn(q,
+						ioq->sched_queue, probe);
+
+	/* Only select co-operating queue if it belongs to root group */
+	if (new_ioq && !is_root_group_ioq(q, new_ioq))
+		return NULL;
+
+	return new_ioq;
+}
+
+/* A request got completed from io_queue. Do the accounting. */
+void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
+{
+	const int sync = rq_is_sync(rq);
+	struct io_queue *ioq = rq->ioq;
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	elv_log_ioq(efqd, ioq, "complete");
+
+	elv_update_hw_tag(efqd);
+
+	WARN_ON(!efqd->rq_in_driver);
+	WARN_ON(!ioq->dispatched);
+	efqd->rq_in_driver--;
+	ioq->dispatched--;
+
+	if (sync)
+		ioq->last_end_request = jiffies;
+
+	/*
+	 * If this is the active queue, check if it needs to be expired,
+	 * or if we want to idle in case it has no pending requests.
+	 */
+
+	if (elv_active_ioq(q->elevator) == ioq) {
+		if (elv_ioq_slice_new(ioq)) {
+			elv_ioq_set_prio_slice(q, ioq);
+			elv_clear_ioq_slice_new(ioq);
+		}
+		/*
+		 * If there are no requests waiting in this queue, and
+		 * there are other queues ready to issue requests, AND
+		 * those other queues are issuing requests within our
+		 * mean seek distance, give them a chance to run instead
+		 * of idling.
+		 */
+		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+			elv_ioq_slice_expired(q);
+		else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
+			 && sync && !rq_noidle(rq))
+			elv_ioq_arm_slice_timer(q);
+	}
+
+	if (!efqd->rq_in_driver)
+		elv_schedule_dispatch(q);
+}
+
+struct io_group *io_lookup_io_group_current(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	return efqd->root_group;
+}
+EXPORT_SYMBOL(io_lookup_io_group_current);
+
+void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
+					int ioprio)
+{
+	struct io_queue *ioq = NULL;
+
+	switch (ioprio_class) {
+	case IOPRIO_CLASS_RT:
+		ioq = iog->async_queue[0][ioprio];
+		break;
+	case IOPRIO_CLASS_BE:
+		ioq = iog->async_queue[1][ioprio];
+		break;
+	case IOPRIO_CLASS_IDLE:
+		ioq = iog->async_idle_queue;
+		break;
+	default:
+		BUG();
+	}
+
+	if (ioq)
+		return ioq->sched_queue;
+	return NULL;
+}
+EXPORT_SYMBOL(io_group_async_queue_prio);
+
+void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+					int ioprio, struct io_queue *ioq)
+{
+	switch (ioprio_class) {
+	case IOPRIO_CLASS_RT:
+		iog->async_queue[0][ioprio] = ioq;
+		break;
+	case IOPRIO_CLASS_BE:
+		iog->async_queue[1][ioprio] = ioq;
+		break;
+	case IOPRIO_CLASS_IDLE:
+		iog->async_idle_queue = ioq;
+		break;
+	default:
+		BUG();
+	}
+
+	/*
+	 * Take the group reference and pin the queue. Group exit will
+	 * clean it up
+	 */
+	elv_get_ioq(ioq);
+}
+EXPORT_SYMBOL(io_group_set_async_queue);
+
+/*
+ * Release all the io group references to its async queues.
+ */
+void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+{
+	int i, j;
+
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < IOPRIO_BE_NR; j++)
+			elv_release_ioq(e, &iog->async_queue[i][j]);
+
+	/* Free up async idle queue */
+	elv_release_ioq(e, &iog->async_idle_queue);
+}
+
+struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+	return iog;
+}
+
+void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_group *iog = e->efqd.root_group;
+	io_put_io_group_queues(e, iog);
+	kfree(iog);
+}
+
+static void elv_slab_kill(void)
+{
+	/*
+	 * Caller already ensured that pending RCU callbacks are completed,
+	 * so we should have no busy allocations at this point.
+	 */
+	if (elv_ioq_pool)
+		kmem_cache_destroy(elv_ioq_pool);
+}
+
+static int __init elv_slab_setup(void)
+{
+	elv_ioq_pool = KMEM_CACHE(io_queue, 0);
+	if (!elv_ioq_pool)
+		goto fail;
+
+	return 0;
+fail:
+	elv_slab_kill();
+	return -ENOMEM;
+}
+
+/* Initialize fair queueing data associated with elevator */
+int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
+{
+	struct io_group *iog;
+	struct elv_fq_data *efqd = &e->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return 0;
+
+	iog = io_alloc_root_group(q, e, efqd);
+	if (iog == NULL)
+		return 1;
+
+	efqd->root_group = iog;
+	efqd->queue = q;
+
+	init_timer(&efqd->idle_slice_timer);
+	efqd->idle_slice_timer.function = elv_idle_slice_timer;
+	efqd->idle_slice_timer.data = (unsigned long) efqd;
+
+	INIT_WORK(&efqd->unplug_work, elv_kick_queue);
+
+	INIT_LIST_HEAD(&efqd->idle_list);
+
+	efqd->elv_slice[0] = elv_slice_async;
+	efqd->elv_slice[1] = elv_slice_sync;
+	efqd->elv_slice_idle = elv_slice_idle;
+	efqd->hw_tag = 1;
+
+	return 0;
+}
+
+/*
+ * elv_exit_fq_data is called before we call elevator_exit_fn. Before
+ * we ask elevator to cleanup its queues, we do the cleanup here so
+ * that all the group and idle tree references to ioq are dropped. Later
+ * during elevator cleanup, ioc reference will be dropped which will lead
+ * to removal of ioscheduler queue as well as associated ioq object.
+ */
+void elv_exit_fq_data(struct elevator_queue *e)
+{
+	struct elv_fq_data *efqd = &e->efqd;
+	struct request_queue *q = efqd->queue;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	elv_shutdown_timer_wq(e);
+
+	spin_lock_irq(q->queue_lock);
+	/* This should drop all the idle tree references of ioq */
+	elv_free_idle_ioq_list(e);
+	spin_unlock_irq(q->queue_lock);
+
+	elv_shutdown_timer_wq(e);
+
+	BUG_ON(timer_pending(&efqd->idle_slice_timer));
+	io_free_root_group(e);
+}
+
+/*
+ * This is called after the io scheduler has cleaned up its data structres.
+ * I don't think that this function is required. Right now just keeping it
+ * because cfq cleans up timer and work queue again after freeing up
+ * io contexts. To me io scheduler has already been drained out, and all
+ * the active queue have already been expired so time and work queue should
+ * not been activated during cleanup process.
+ *
+ * Keeping it here for the time being. Will get rid of it later.
+ */
+void elv_exit_fq_data_post(struct elevator_queue *e)
+{
+	struct elv_fq_data *efqd = &e->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	elv_shutdown_timer_wq(e);
+	BUG_ON(timer_pending(&efqd->idle_slice_timer));
+}
+
+
+static int __init elv_fq_init(void)
+{
+	if (elv_slab_setup())
+		return -ENOMEM;
+
+	/* could be 0 on HZ < 1000 setups */
+
+	if (!elv_slice_async)
+		elv_slice_async = 1;
+
+	if (!elv_slice_idle)
+		elv_slice_idle = 1;
+
+	return 0;
+}
+
+module_init(elv_fq_init);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
new file mode 100644
index 0000000..3bea279
--- /dev/null
+++ b/block/elevator-fq.h
@@ -0,0 +1,488 @@
+/*
+ * BFQ: data structures and common functions prototypes.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
+ *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ */
+
+#include <linux/blkdev.h>
+
+#ifndef _BFQ_SCHED_H
+#define _BFQ_SCHED_H
+
+#define IO_IOPRIO_CLASSES	3
+
+typedef u64 bfq_timestamp_t;
+typedef unsigned long bfq_weight_t;
+typedef unsigned long bfq_service_t;
+struct io_entity;
+struct io_queue;
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+
+/**
+ * struct bfq_service_tree - per ioprio_class service tree.
+ * @active: tree for active entities (i.e., those backlogged).
+ * @idle: tree for idle entities (i.e., those not backlogged, with V <= F_i).
+ * @first_idle: idle entity with minimum F_i.
+ * @last_idle: idle entity with maximum F_i.
+ * @vtime: scheduler virtual time.
+ * @wsum: scheduler weight sum; active and idle entities contribute to it.
+ *
+ * Each service tree represents a B-WF2Q+ scheduler on its own.  Each
+ * ioprio_class has its own independent scheduler, and so its own
+ * bfq_service_tree.  All the fields are protected by the queue lock
+ * of the containing efqd.
+ */
+struct io_service_tree {
+	struct rb_root active;
+	struct rb_root idle;
+
+	struct io_entity *first_idle;
+	struct io_entity *last_idle;
+
+	bfq_timestamp_t vtime;
+	bfq_weight_t wsum;
+};
+
+/**
+ * struct bfq_sched_data - multi-class scheduler.
+ * @active_entity: entity under service.
+ * @next_active: head-of-the-line entity in the scheduler.
+ * @service_tree: array of service trees, one per ioprio_class.
+ *
+ * bfq_sched_data is the basic scheduler queue.  It supports three
+ * ioprio_classes, and can be used either as a toplevel queue or as
+ * an intermediate queue on a hierarchical setup.
+ * @next_active points to the active entity of the sched_data service
+ * trees that will be scheduled next.
+ *
+ * The supported ioprio_classes are the same as in CFQ, in descending
+ * priority order, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE.
+ * Requests from higher priority queues are served before all the
+ * requests from lower priority queues; among requests of the same
+ * queue requests are served according to B-WF2Q+.
+ * All the fields are protected by the queue lock of the containing bfqd.
+ */
+struct io_sched_data {
+	struct io_entity *active_entity;
+	struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
+};
+
+/**
+ * struct bfq_entity - schedulable entity.
+ * @rb_node: service_tree member.
+ * @on_st: flag, true if the entity is on a tree (either the active or
+ *         the idle one of its service_tree).
+ * @finish: B-WF2Q+ finish timestamp (aka F_i).
+ * @start: B-WF2Q+ start timestamp (aka S_i).
+ * @tree: tree the entity is enqueued into; %NULL if not on a tree.
+ * @min_start: minimum start time of the (active) subtree rooted at
+ *             this entity; used for O(log N) lookups into active trees.
+ * @service: service received during the last round of service.
+ * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
+ * @weight: weight of the queue, calculated as IOPRIO_BE_NR - @ioprio.
+ * @parent: parent entity, for hierarchical scheduling.
+ * @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the
+ *                 associated scheduler queue, %NULL on leaf nodes.
+ * @sched_data: the scheduler queue this entity belongs to.
+ * @ioprio: the ioprio in use.
+ * @new_ioprio: when an ioprio change is requested, the new ioprio value
+ * @ioprio_class: the ioprio_class in use.
+ * @new_ioprio_class: when an ioprio_class change is requested, the new
+ *                    ioprio_class value.
+ * @ioprio_changed: flag, true when the user requested an ioprio or
+ *                  ioprio_class change.
+ *
+ * A bfq_entity is used to represent either a bfq_queue (leaf node in the
+ * cgroup hierarchy) or a bfq_group into the upper level scheduler.  Each
+ * entity belongs to the sched_data of the parent group in the cgroup
+ * hierarchy.  Non-leaf entities have also their own sched_data, stored
+ * in @my_sched_data.
+ *
+ * Each entity stores independently its priority values; this would allow
+ * different weights on different devices, but this functionality is not
+ * exported to userspace by now.  Priorities are updated lazily, first
+ * storing the new values into the new_* fields, then setting the
+ * @ioprio_changed flag.  As soon as there is a transition in the entity
+ * state that allows the priority update to take place the effective and
+ * the requested priority values are synchronized.
+ *
+ * The weight value is calculated from the ioprio to export the same
+ * interface as CFQ.  When dealing with ``well-behaved'' queues (i.e.,
+ * queues that do not spend too much time to consume their budget and
+ * have true sequential behavior, and when there are no external factors
+ * breaking anticipation) the relative weights at each level of the
+ * cgroups hierarchy should be guaranteed.
+ * All the fields are protected by the queue lock of the containing bfqd.
+ */
+struct io_entity {
+	struct rb_node rb_node;
+
+	int on_st;
+
+	bfq_timestamp_t finish;
+	bfq_timestamp_t start;
+
+	struct rb_root *tree;
+
+	bfq_timestamp_t min_start;
+
+	bfq_service_t service, budget;
+	bfq_weight_t weight;
+
+	struct io_entity *parent;
+
+	struct io_sched_data *my_sched_data;
+	struct io_sched_data *sched_data;
+
+	unsigned short ioprio, new_ioprio;
+	unsigned short ioprio_class, new_ioprio_class;
+
+	int ioprio_changed;
+};
+
+/*
+ * A common structure embedded by every io scheduler into their respective
+ * queue structure.
+ */
+struct io_queue {
+	struct io_entity entity;
+	atomic_t ref;
+	unsigned int flags;
+
+	/* Pointer to generic elevator data structure */
+	struct elv_fq_data *efqd;
+	struct list_head queue_list;
+	pid_t pid;
+
+	/* Number of requests queued on this io queue */
+	unsigned long nr_queued;
+
+	/* Requests dispatched from this queue */
+	int dispatched;
+
+	/* Keep a track of think time of processes in this queue */
+	unsigned long last_end_request;
+	unsigned long ttime_total;
+	unsigned long ttime_samples;
+	unsigned long ttime_mean;
+
+	unsigned long slice_end;
+
+	/* Pointer to io scheduler's queue */
+	void *sched_queue;
+};
+
+struct io_group {
+	struct io_sched_data sched_data;
+
+	/* async_queue and idle_queue are used only for cfq */
+	struct io_queue *async_queue[2][IOPRIO_BE_NR];
+	struct io_queue *async_idle_queue;
+};
+
+struct elv_fq_data {
+	struct io_group *root_group;
+
+	/* List of io queues on idle tree. */
+	struct list_head idle_list;
+
+	struct request_queue *queue;
+	unsigned int busy_queues;
+	/*
+	 * Used to track any pending rt requests so we can pre-empt current
+	 * non-RT cfqq in service when this value is non-zero.
+	 */
+	unsigned int busy_rt_queues;
+
+	/* Number of requests queued */
+	int rq_queued;
+
+	/* Pointer to the ioscheduler queue being served */
+	void *active_queue;
+
+	int rq_in_driver;
+	int hw_tag;
+	int hw_tag_samples;
+	int rq_in_driver_peak;
+
+	/*
+	 * elevator fair queuing layer has the capability to provide idling
+	 * for ensuring fairness for processes doing dependent reads.
+	 * This might be needed to ensure fairness among two processes doing
+	 * synchronous reads in two different cgroups. noop and deadline don't
+	 * have any notion of anticipation/idling. As of now, these are the
+	 * users of this functionality.
+	 */
+	unsigned int elv_slice_idle;
+	struct timer_list idle_slice_timer;
+	struct work_struct unplug_work;
+
+	unsigned int elv_slice[2];
+};
+
+extern int elv_slice_idle;
+extern int elv_slice_async;
+
+/* Logging facilities. */
+#define elv_log_ioq(efqd, ioq, fmt, args...) \
+	blk_add_trace_msg((efqd)->queue, "elv%d%c " fmt, (ioq)->pid,	\
+				elv_ioq_sync(ioq) ? 'S' : 'A', ##args)
+
+#define elv_log(efqd, fmt, args...) \
+	blk_add_trace_msg((efqd)->queue, "elv " fmt, ##args)
+
+#define ioq_sample_valid(samples)   ((samples) > 80)
+
+/* Some shared queue flag manipulation functions among elevators */
+
+enum elv_queue_state_flags {
+	ELV_QUEUE_FLAG_busy = 0,          /* has requests or is under service */
+	ELV_QUEUE_FLAG_sync,              /* synchronous queue */
+	ELV_QUEUE_FLAG_idle_window,	  /* elevator slice idling enabled */
+	ELV_QUEUE_FLAG_wait_request,	  /* waiting for a request */
+	ELV_QUEUE_FLAG_must_dispatch,	  /* must be allowed a dispatch */
+	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
+	ELV_QUEUE_FLAG_NR,
+};
+
+#define ELV_IO_QUEUE_FLAG_FNS(name)					\
+static inline void elv_mark_ioq_##name(struct io_queue *ioq)		\
+{                                                                       \
+	(ioq)->flags |= (1 << ELV_QUEUE_FLAG_##name);			\
+}                                                                       \
+static inline void elv_clear_ioq_##name(struct io_queue *ioq)		\
+{                                                                       \
+	(ioq)->flags &= ~(1 << ELV_QUEUE_FLAG_##name);			\
+}                                                                       \
+static inline int elv_ioq_##name(struct io_queue *ioq)         		\
+{                                                                       \
+	return ((ioq)->flags & (1 << ELV_QUEUE_FLAG_##name)) != 0;	\
+}
+
+ELV_IO_QUEUE_FLAG_FNS(busy)
+ELV_IO_QUEUE_FLAG_FNS(sync)
+ELV_IO_QUEUE_FLAG_FNS(wait_request)
+ELV_IO_QUEUE_FLAG_FNS(must_dispatch)
+ELV_IO_QUEUE_FLAG_FNS(idle_window)
+ELV_IO_QUEUE_FLAG_FNS(slice_new)
+
+static inline struct io_service_tree *
+io_entity_service_tree(struct io_entity *entity)
+{
+	struct io_sched_data *sched_data = entity->sched_data;
+	unsigned int idx = entity->ioprio_class - 1;
+
+	BUG_ON(idx >= IO_IOPRIO_CLASSES);
+	BUG_ON(sched_data == NULL);
+
+	return sched_data->service_tree + idx;
+}
+
+/* A request got dispatched from the io_queue. Do the accounting. */
+static inline void elv_ioq_request_dispatched(struct io_queue *ioq)
+{
+	ioq->dispatched++;
+}
+
+static inline int elv_ioq_slice_used(struct io_queue *ioq)
+{
+	if (elv_ioq_slice_new(ioq))
+		return 0;
+	if (time_before(jiffies, ioq->slice_end))
+		return 0;
+
+	return 1;
+}
+
+/* How many request are currently dispatched from the queue */
+static inline int elv_ioq_nr_dispatched(struct io_queue *ioq)
+{
+	return ioq->dispatched;
+}
+
+/* How many request are currently queued in the queue */
+static inline int elv_ioq_nr_queued(struct io_queue *ioq)
+{
+	return ioq->nr_queued;
+}
+
+static inline pid_t elv_ioq_pid(struct io_queue *ioq)
+{
+	return ioq->pid;
+}
+
+static inline unsigned long elv_ioq_ttime_mean(struct io_queue *ioq)
+{
+	return ioq->ttime_mean;
+}
+
+static inline unsigned long elv_ioq_sample_valid(struct io_queue *ioq)
+{
+	return ioq_sample_valid(ioq->ttime_samples);
+}
+
+static inline void elv_get_ioq(struct io_queue *ioq)
+{
+	atomic_inc(&ioq->ref);
+}
+
+static inline void elv_ioq_set_slice_end(struct io_queue *ioq,
+						unsigned long slice_end)
+{
+	ioq->slice_end = slice_end;
+}
+
+static inline int elv_ioq_class_idle(struct io_queue *ioq)
+{
+	return ioq->entity.ioprio_class == IOPRIO_CLASS_IDLE;
+}
+
+static inline int elv_ioq_class_rt(struct io_queue *ioq)
+{
+	return ioq->entity.ioprio_class == IOPRIO_CLASS_RT;
+}
+
+static inline int elv_ioq_ioprio_class(struct io_queue *ioq)
+{
+	return ioq->entity.new_ioprio_class;
+}
+
+static inline int elv_ioq_ioprio(struct io_queue *ioq)
+{
+	return ioq->entity.new_ioprio;
+}
+
+static inline void elv_ioq_set_ioprio_class(struct io_queue *ioq,
+						int ioprio_class)
+{
+	ioq->entity.new_ioprio_class = ioprio_class;
+	ioq->entity.ioprio_changed = 1;
+}
+
+static inline void elv_ioq_set_ioprio(struct io_queue *ioq, int ioprio)
+{
+	ioq->entity.new_ioprio = ioprio;
+	ioq->entity.ioprio_changed = 1;
+}
+
+static inline void *ioq_sched_queue(struct io_queue *ioq)
+{
+	if (ioq)
+		return ioq->sched_queue;
+	return NULL;
+}
+
+static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
+{
+	return container_of(ioq->entity.sched_data, struct io_group,
+						sched_data);
+}
+
+/* Functions used by blksysfs.c */
+extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
+extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
+						size_t count);
+extern ssize_t elv_slice_sync_show(struct request_queue *q, char *name);
+extern ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
+						size_t count);
+extern ssize_t elv_slice_async_show(struct request_queue *q, char *name);
+extern ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
+						size_t count);
+
+/* Functions used by elevator.c */
+extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
+extern void elv_exit_fq_data(struct elevator_queue *e);
+extern void elv_exit_fq_data_post(struct elevator_queue *e);
+
+extern void elv_ioq_request_add(struct request_queue *q, struct request *rq);
+extern void elv_ioq_request_removed(struct elevator_queue *e,
+					struct request *rq);
+extern void elv_fq_dispatched_request(struct elevator_queue *e,
+					struct request *rq);
+
+extern void elv_fq_activate_rq(struct request_queue *q, struct request *rq);
+extern void elv_fq_deactivate_rq(struct request_queue *q, struct request *rq);
+
+extern void elv_ioq_completed_request(struct request_queue *q,
+				struct request *rq);
+
+extern void *elv_fq_select_ioq(struct request_queue *q, int force);
+extern struct io_queue *rq_ioq(struct request *rq);
+
+/* Functions used by io schedulers */
+extern void elv_put_ioq(struct io_queue *ioq);
+extern void __elv_ioq_slice_expired(struct request_queue *q,
+					struct io_queue *ioq);
+extern int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
+		void *sched_queue, int ioprio_class, int ioprio, int is_sync);
+extern void elv_schedule_dispatch(struct request_queue *q);
+extern int elv_hw_tag(struct elevator_queue *e);
+extern void *elv_active_sched_queue(struct elevator_queue *e);
+extern int elv_mod_idle_slice_timer(struct elevator_queue *eq,
+					unsigned long expires);
+extern int elv_del_idle_slice_timer(struct elevator_queue *eq);
+extern unsigned int elv_get_slice_idle(struct elevator_queue *eq);
+extern void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
+					int ioprio);
+extern void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+					int ioprio, struct io_queue *ioq);
+extern struct io_group *io_lookup_io_group_current(struct request_queue *q);
+extern int elv_nr_busy_ioq(struct elevator_queue *e);
+extern int elv_nr_busy_rt_ioq(struct elevator_queue *e);
+extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
+extern void elv_free_ioq(struct io_queue *ioq);
+
+#else /* CONFIG_ELV_FAIR_QUEUING */
+
+static inline int elv_init_fq_data(struct request_queue *q,
+					struct elevator_queue *e)
+{
+	return 0;
+}
+
+static inline void elv_exit_fq_data(struct elevator_queue *e) {}
+static inline void elv_exit_fq_data_post(struct elevator_queue *e) {}
+
+static inline void elv_fq_activate_rq(struct request_queue *q,
+					struct request *rq)
+{
+}
+
+static inline void elv_fq_deactivate_rq(struct request_queue *q,
+					struct request *rq)
+{
+}
+
+static inline void elv_fq_dispatched_request(struct elevator_queue *e,
+						struct request *rq)
+{
+}
+
+static inline void elv_ioq_request_removed(struct elevator_queue *e,
+						struct request *rq)
+{
+}
+
+static inline void elv_ioq_request_add(struct request_queue *q,
+					struct request *rq)
+{
+}
+
+static inline void elv_ioq_completed_request(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline void *ioq_sched_queue(struct io_queue *ioq) { return NULL; }
+static inline struct io_queue *rq_ioq(struct request *rq) { return NULL; }
+static inline void *elv_fq_select_ioq(struct request_queue *q, int force)
+{
+	return NULL;
+}
+#endif /* CONFIG_ELV_FAIR_QUEUING */
+#endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index 7073a90..c2f07f5 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -231,6 +231,9 @@ static struct elevator_queue *elevator_alloc(struct request_queue *q,
 	for (i = 0; i < ELV_HASH_ENTRIES; i++)
 		INIT_HLIST_HEAD(&eq->hash[i]);
 
+	if (elv_init_fq_data(q, eq))
+		goto err;
+
 	return eq;
 err:
 	kfree(eq);
@@ -301,9 +304,11 @@ EXPORT_SYMBOL(elevator_init);
 void elevator_exit(struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
+	elv_exit_fq_data(e);
 	if (e->ops->elevator_exit_fn)
 		e->ops->elevator_exit_fn(e);
 	e->ops = NULL;
+	elv_exit_fq_data_post(e);
 	mutex_unlock(&e->sysfs_lock);
 
 	kobject_put(&e->kobj);
@@ -314,6 +319,8 @@ static void elv_activate_rq(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	elv_fq_activate_rq(q, rq);
+
 	if (e->ops->elevator_activate_req_fn)
 		e->ops->elevator_activate_req_fn(q, rq);
 }
@@ -322,6 +329,8 @@ static void elv_deactivate_rq(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	elv_fq_deactivate_rq(q, rq);
+
 	if (e->ops->elevator_deactivate_req_fn)
 		e->ops->elevator_deactivate_req_fn(q, rq);
 }
@@ -446,6 +455,7 @@ void elv_dispatch_sort(struct request_queue *q, struct request *rq)
 	elv_rqhash_del(q, rq);
 
 	q->nr_sorted--;
+	elv_fq_dispatched_request(q->elevator, rq);
 
 	boundary = q->end_sector;
 	stop_flags = REQ_SOFTBARRIER | REQ_HARDBARRIER | REQ_STARTED;
@@ -486,6 +496,7 @@ void elv_dispatch_add_tail(struct request_queue *q, struct request *rq)
 	elv_rqhash_del(q, rq);
 
 	q->nr_sorted--;
+	elv_fq_dispatched_request(q->elevator, rq);
 
 	q->end_sector = rq_end_sector(rq);
 	q->boundary_rq = rq;
@@ -553,6 +564,7 @@ void elv_merge_requests(struct request_queue *q, struct request *rq,
 	elv_rqhash_del(q, next);
 
 	q->nr_sorted--;
+	elv_ioq_request_removed(e, next);
 	q->last_merge = rq;
 }
 
@@ -657,12 +669,8 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 				q->last_merge = rq;
 		}
 
-		/*
-		 * Some ioscheds (cfq) run q->request_fn directly, so
-		 * rq cannot be accessed after calling
-		 * elevator_add_req_fn.
-		 */
 		q->elevator->ops->elevator_add_req_fn(q, rq);
+		elv_ioq_request_add(q, rq);
 		break;
 
 	case ELEVATOR_INSERT_REQUEUE:
@@ -872,13 +880,12 @@ void elv_dequeue_request(struct request_queue *q, struct request *rq)
 
 int elv_queue_empty(struct request_queue *q)
 {
-	struct elevator_queue *e = q->elevator;
-
 	if (!list_empty(&q->queue_head))
 		return 0;
 
-	if (e->ops->elevator_queue_empty_fn)
-		return e->ops->elevator_queue_empty_fn(q);
+	/* Hopefully nr_sorted works and no need to call queue_empty_fn */
+	if (q->nr_sorted)
+		return 0;
 
 	return 1;
 }
@@ -953,8 +960,11 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
 	 */
 	if (blk_account_rq(rq)) {
 		q->in_flight--;
-		if (blk_sorted_rq(rq) && e->ops->elevator_completed_req_fn)
-			e->ops->elevator_completed_req_fn(q, rq);
+		if (blk_sorted_rq(rq)) {
+			if (e->ops->elevator_completed_req_fn)
+				e->ops->elevator_completed_req_fn(q, rq);
+			elv_ioq_completed_request(q, rq);
+		}
 	}
 
 	/*
@@ -1242,3 +1252,17 @@ struct request *elv_rb_latter_request(struct request_queue *q,
 	return NULL;
 }
 EXPORT_SYMBOL(elv_rb_latter_request);
+
+/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
+{
+	return ioq_sched_queue(rq_ioq(rq));
+}
+EXPORT_SYMBOL(elv_get_sched_queue);
+
+/* Select an ioscheduler queue to dispatch request from. */
+void *elv_select_sched_queue(struct request_queue *q, int force)
+{
+	return ioq_sched_queue(elv_fq_select_ioq(q, force));
+}
+EXPORT_SYMBOL(elv_select_sched_queue);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2755d5c..4634949 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -245,6 +245,11 @@ struct request {
 
 	/* for bidi */
 	struct request *next_rq;
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	/* io queue request belongs to */
+	struct io_queue *ioq;
+#endif
 };
 
 static inline unsigned short req_get_ioprio(struct request *req)
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index c59b769..679c149 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -2,6 +2,7 @@
 #define _LINUX_ELEVATOR_H
 
 #include <linux/percpu.h>
+#include "../../block/elevator-fq.h"
 
 #ifdef CONFIG_BLOCK
 
@@ -29,6 +30,18 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
 
 typedef void *(elevator_init_fn) (struct request_queue *);
 typedef void (elevator_exit_fn) (struct elevator_queue *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
+typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
+typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
+typedef int (elevator_should_preempt_fn) (struct request_queue*, void*,
+						struct request*);
+typedef int (elevator_update_idle_window_fn) (struct elevator_queue*, void*,
+						struct request*);
+typedef struct io_queue* (elevator_close_cooperator_fn) (struct request_queue*,
+						void*, int probe);
+#endif
 
 struct elevator_ops
 {
@@ -56,6 +69,17 @@ struct elevator_ops
 	elevator_init_fn *elevator_init_fn;
 	elevator_exit_fn *elevator_exit_fn;
 	void (*trim)(struct io_context *);
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+	elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
+	elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;
+
+	elevator_arm_slice_timer_fn *elevator_arm_slice_timer_fn;
+	elevator_should_preempt_fn *elevator_should_preempt_fn;
+	elevator_update_idle_window_fn *elevator_update_idle_window_fn;
+	elevator_close_cooperator_fn *elevator_close_cooperator_fn;
+#endif
 };
 
 #define ELV_NAME_MAX	(16)
@@ -76,6 +100,9 @@ struct elevator_type
 	struct elv_fs_entry *elevator_attrs;
 	char elevator_name[ELV_NAME_MAX];
 	struct module *elevator_owner;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	int elevator_features;
+#endif
 };
 
 /*
@@ -89,6 +116,10 @@ struct elevator_queue
 	struct elevator_type *elevator_type;
 	struct mutex sysfs_lock;
 	struct hlist_head *hash;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	/* fair queuing data */
+	struct elv_fq_data efqd;
+#endif
 };
 
 /*
@@ -209,5 +240,25 @@ enum {
 	__val;							\
 })
 
+/* iosched can let elevator know their feature set/capability */
+#ifdef CONFIG_ELV_FAIR_QUEUING
+
+/* iosched wants to use fq logic of elevator layer */
+#define	ELV_IOSCHED_NEED_FQ	1
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+	return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
+}
+
+#else /* ELV_IOSCHED_FAIR_QUEUING */
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+	return 0;
+}
+#endif /* ELV_IOSCHED_FAIR_QUEUING */
+extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
+extern void *elv_select_sched_queue(struct request_queue *q, int force);
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 02/18] io-controller: Common flat fair queuing code in elevaotor layer
@ 2009-05-05 19:58     ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, m-ikeda
  Cc: vgoyal, akpm

This is common fair queuing code in elevator layer. This is controlled by
config option CONFIG_ELV_FAIR_QUEUING. This patch initially only introduces
flat fair queuing support where there is only one group, "root group" and all
the tasks belong to root group.

This elevator layer changes are backward compatible. That means any ioscheduler
using old interfaces will continue to work.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched    |   13 +
 block/Makefile           |    1 +
 block/blk-sysfs.c        |   25 +
 block/elevator-fq.c      | 2076 ++++++++++++++++++++++++++++++++++++++++++++++
 block/elevator-fq.h      |  488 +++++++++++
 block/elevator.c         |   46 +-
 include/linux/blkdev.h   |    5 +
 include/linux/elevator.h |   51 ++
 8 files changed, 2694 insertions(+), 11 deletions(-)
 create mode 100644 block/elevator-fq.c
 create mode 100644 block/elevator-fq.h

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 7e803fc..3398134 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -2,6 +2,19 @@ if BLOCK
 
 menu "IO Schedulers"
 
+config ELV_FAIR_QUEUING
+	bool "Elevator Fair Queuing Support"
+	default n
+	---help---
+	  Traditionally only cfq had notion of multiple queues and it did
+	  fair queuing at its own. With the cgroups and need of controlling
+	  IO, now even the simple io schedulers like noop, deadline, as will
+	  have one queue per cgroup and will need hierarchical fair queuing.
+	  Instead of every io scheduler implementing its own fair queuing
+	  logic, this option enables fair queuing in elevator layer so that
+	  other ioschedulers can make use of it.
+	  If unsure, say N.
+
 config IOSCHED_NOOP
 	bool
 	default y
diff --git a/block/Makefile b/block/Makefile
index e9fa4dd..94bfc6e 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -15,3 +15,4 @@ obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
 
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY)	+= blk-integrity.o
+obj-$(CONFIG_ELV_FAIR_QUEUING)	+= elevator-fq.o
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 3ff9bba..082a273 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -276,6 +276,26 @@ static struct queue_sysfs_entry queue_iostats_entry = {
 	.store = queue_iostats_store,
 };
 
+#ifdef CONFIG_ELV_FAIR_QUEUING
+static struct queue_sysfs_entry queue_slice_idle_entry = {
+	.attr = {.name = "slice_idle", .mode = S_IRUGO | S_IWUSR },
+	.show = elv_slice_idle_show,
+	.store = elv_slice_idle_store,
+};
+
+static struct queue_sysfs_entry queue_slice_sync_entry = {
+	.attr = {.name = "slice_sync", .mode = S_IRUGO | S_IWUSR },
+	.show = elv_slice_sync_show,
+	.store = elv_slice_sync_store,
+};
+
+static struct queue_sysfs_entry queue_slice_async_entry = {
+	.attr = {.name = "slice_async", .mode = S_IRUGO | S_IWUSR },
+	.show = elv_slice_async_show,
+	.store = elv_slice_async_store,
+};
+#endif
+
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
 	&queue_ra_entry.attr,
@@ -287,6 +307,11 @@ static struct attribute *default_attrs[] = {
 	&queue_nomerges_entry.attr,
 	&queue_rq_affinity_entry.attr,
 	&queue_iostats_entry.attr,
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	&queue_slice_idle_entry.attr,
+	&queue_slice_sync_entry.attr,
+	&queue_slice_async_entry.attr,
+#endif
 	NULL,
 };
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
new file mode 100644
index 0000000..9aea899
--- /dev/null
+++ b/block/elevator-fq.c
@@ -0,0 +1,2076 @@
+/*
+ * BFQ: Hierarchical B-WF2Q+ scheduler.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ *		      Paolo Valente <paolo.valente@unimore.it>
+ */
+
+#include <linux/blkdev.h>
+#include "elevator-fq.h"
+#include <linux/blktrace_api.h>
+
+/* Values taken from cfq */
+const int elv_slice_sync = HZ / 10;
+int elv_slice_async = HZ / 25;
+const int elv_slice_async_rq = 2;
+int elv_slice_idle = HZ / 125;
+static struct kmem_cache *elv_ioq_pool;
+
+#define ELV_SLICE_SCALE		(5)
+#define ELV_HW_QUEUE_MIN	(5)
+#define IO_SERVICE_TREE_INIT   ((struct io_service_tree)		\
+				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
+
+static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
+					struct io_queue *ioq, int probe);
+struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
+						 int extract);
+
+static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
+					unsigned short prio)
+{
+	const int base_slice = efqd->elv_slice[sync];
+
+	WARN_ON(prio >= IOPRIO_BE_NR);
+
+	return base_slice + (base_slice/ELV_SLICE_SCALE * (4 - prio));
+}
+
+static inline int
+elv_prio_to_slice(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+	return elv_prio_slice(efqd, elv_ioq_sync(ioq), ioq->entity.ioprio);
+}
+
+/* Mainly the BFQ scheduling code Follows */
+
+/*
+ * Shift for timestamp calculations.  This actually limits the maximum
+ * service allowed in one timestamp delta (small shift values increase it),
+ * the maximum total weight that can be used for the queues in the system
+ * (big shift values increase it), and the period of virtual time wraparounds.
+ */
+#define WFQ_SERVICE_SHIFT	22
+
+/**
+ * bfq_gt - compare two timestamps.
+ * @a: first ts.
+ * @b: second ts.
+ *
+ * Return @a > @b, dealing with wrapping correctly.
+ */
+static inline int bfq_gt(bfq_timestamp_t a, bfq_timestamp_t b)
+{
+	return (s64)(a - b) > 0;
+}
+
+/**
+ * bfq_delta - map service into the virtual time domain.
+ * @service: amount of service.
+ * @weight: scale factor.
+ */
+static inline bfq_timestamp_t bfq_delta(bfq_service_t service,
+					bfq_weight_t weight)
+{
+	bfq_timestamp_t d = (bfq_timestamp_t)service << WFQ_SERVICE_SHIFT;
+
+	do_div(d, weight);
+	return d;
+}
+
+/**
+ * bfq_calc_finish - assign the finish time to an entity.
+ * @entity: the entity to act upon.
+ * @service: the service to be charged to the entity.
+ */
+static inline void bfq_calc_finish(struct io_entity *entity,
+				   bfq_service_t service)
+{
+	BUG_ON(entity->weight == 0);
+
+	entity->finish = entity->start + bfq_delta(service, entity->weight);
+}
+
+static inline struct io_queue *io_entity_to_ioq(struct io_entity *entity)
+{
+	struct io_queue *ioq = NULL;
+
+	BUG_ON(entity == NULL);
+	if (entity->my_sched_data == NULL)
+		ioq = container_of(entity, struct io_queue, entity);
+	return ioq;
+}
+
+/**
+ * bfq_entity_of - get an entity from a node.
+ * @node: the node field of the entity.
+ *
+ * Convert a node pointer to the relative entity.  This is used only
+ * to simplify the logic of some functions and not as the generic
+ * conversion mechanism because, e.g., in the tree walking functions,
+ * the check for a %NULL value would be redundant.
+ */
+static inline struct io_entity *bfq_entity_of(struct rb_node *node)
+{
+	struct io_entity *entity = NULL;
+
+	if (node != NULL)
+		entity = rb_entry(node, struct io_entity, rb_node);
+
+	return entity;
+}
+
+/**
+ * bfq_extract - remove an entity from a tree.
+ * @root: the tree root.
+ * @entity: the entity to remove.
+ */
+static inline void bfq_extract(struct rb_root *root, struct io_entity *entity)
+{
+	BUG_ON(entity->tree != root);
+
+	entity->tree = NULL;
+	rb_erase(&entity->rb_node, root);
+}
+
+/**
+ * bfq_idle_extract - extract an entity from the idle tree.
+ * @st: the service tree of the owning @entity.
+ * @entity: the entity being removed.
+ */
+static void bfq_idle_extract(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	struct rb_node *next;
+	struct io_queue *ioq = io_entity_to_ioq(entity);
+
+	BUG_ON(entity->tree != &st->idle);
+
+	if (entity == st->first_idle) {
+		next = rb_next(&entity->rb_node);
+		st->first_idle = bfq_entity_of(next);
+	}
+
+	if (entity == st->last_idle) {
+		next = rb_prev(&entity->rb_node);
+		st->last_idle = bfq_entity_of(next);
+	}
+
+	bfq_extract(&st->idle, entity);
+
+	/* Delete queue from idle list */
+	if (ioq)
+		list_del(&ioq->queue_list);
+}
+
+/**
+ * bfq_insert - generic tree insertion.
+ * @root: tree root.
+ * @entity: entity to insert.
+ *
+ * This is used for the idle and the active tree, since they are both
+ * ordered by finish time.
+ */
+static void bfq_insert(struct rb_root *root, struct io_entity *entity)
+{
+	struct io_entity *entry;
+	struct rb_node **node = &root->rb_node;
+	struct rb_node *parent = NULL;
+
+	BUG_ON(entity->tree != NULL);
+
+	while (*node != NULL) {
+		parent = *node;
+		entry = rb_entry(parent, struct io_entity, rb_node);
+
+		if (bfq_gt(entry->finish, entity->finish))
+			node = &parent->rb_left;
+		else
+			node = &parent->rb_right;
+	}
+
+	rb_link_node(&entity->rb_node, parent, node);
+	rb_insert_color(&entity->rb_node, root);
+
+	entity->tree = root;
+}
+
+/**
+ * bfq_update_min - update the min_start field of a entity.
+ * @entity: the entity to update.
+ * @node: one of its children.
+ *
+ * This function is called when @entity may store an invalid value for
+ * min_start due to updates to the active tree.  The function  assumes
+ * that the subtree rooted at @node (which may be its left or its right
+ * child) has a valid min_start value.
+ */
+static inline void bfq_update_min(struct io_entity *entity,
+					struct rb_node *node)
+{
+	struct io_entity *child;
+
+	if (node != NULL) {
+		child = rb_entry(node, struct io_entity, rb_node);
+		if (bfq_gt(entity->min_start, child->min_start))
+			entity->min_start = child->min_start;
+	}
+}
+
+/**
+ * bfq_update_active_node - recalculate min_start.
+ * @node: the node to update.
+ *
+ * @node may have changed position or one of its children may have moved,
+ * this function updates its min_start value.  The left and right subtrees
+ * are assumed to hold a correct min_start value.
+ */
+static inline void bfq_update_active_node(struct rb_node *node)
+{
+	struct io_entity *entity = rb_entry(node, struct io_entity, rb_node);
+
+	entity->min_start = entity->start;
+	bfq_update_min(entity, node->rb_right);
+	bfq_update_min(entity, node->rb_left);
+}
+
+/**
+ * bfq_update_active_tree - update min_start for the whole active tree.
+ * @node: the starting node.
+ *
+ * @node must be the deepest modified node after an update.  This function
+ * updates its min_start using the values held by its children, assuming
+ * that they did not change, and then updates all the nodes that may have
+ * changed in the path to the root.  The only nodes that may have changed
+ * are the ones in the path or their siblings.
+ */
+static void bfq_update_active_tree(struct rb_node *node)
+{
+	struct rb_node *parent;
+
+up:
+	bfq_update_active_node(node);
+
+	parent = rb_parent(node);
+	if (parent == NULL)
+		return;
+
+	if (node == parent->rb_left && parent->rb_right != NULL)
+		bfq_update_active_node(parent->rb_right);
+	else if (parent->rb_left != NULL)
+		bfq_update_active_node(parent->rb_left);
+
+	node = parent;
+	goto up;
+}
+
+/**
+ * bfq_active_insert - insert an entity in the active tree of its group/device.
+ * @st: the service tree of the entity.
+ * @entity: the entity being inserted.
+ *
+ * The active tree is ordered by finish time, but an extra key is kept
+ * per each node, containing the minimum value for the start times of
+ * its children (and the node itself), so it's possible to search for
+ * the eligible node with the lowest finish time in logarithmic time.
+ */
+static void bfq_active_insert(struct io_service_tree *st,
+					struct io_entity *entity)
+{
+	struct rb_node *node = &entity->rb_node;
+
+	bfq_insert(&st->active, entity);
+
+	if (node->rb_left != NULL)
+		node = node->rb_left;
+	else if (node->rb_right != NULL)
+		node = node->rb_right;
+
+	bfq_update_active_tree(node);
+}
+
+/**
+ * bfq_ioprio_to_weight - calc a weight from an ioprio.
+ * @ioprio: the ioprio value to convert.
+ */
+static bfq_weight_t bfq_ioprio_to_weight(int ioprio)
+{
+	WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
+	return IOPRIO_BE_NR - ioprio;
+}
+
+void bfq_get_entity(struct io_entity *entity)
+{
+	struct io_queue *ioq = io_entity_to_ioq(entity);
+
+	if (ioq)
+		elv_get_ioq(ioq);
+}
+
+void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+	entity->ioprio = entity->new_ioprio;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->sched_data = &iog->sched_data;
+}
+
+/**
+ * bfq_find_deepest - find the deepest node that an extraction can modify.
+ * @node: the node being removed.
+ *
+ * Do the first step of an extraction in an rb tree, looking for the
+ * node that will replace @node, and returning the deepest node that
+ * the following modifications to the tree can touch.  If @node is the
+ * last node in the tree return %NULL.
+ */
+static struct rb_node *bfq_find_deepest(struct rb_node *node)
+{
+	struct rb_node *deepest;
+
+	if (node->rb_right == NULL && node->rb_left == NULL)
+		deepest = rb_parent(node);
+	else if (node->rb_right == NULL)
+		deepest = node->rb_left;
+	else if (node->rb_left == NULL)
+		deepest = node->rb_right;
+	else {
+		deepest = rb_next(node);
+		if (deepest->rb_right != NULL)
+			deepest = deepest->rb_right;
+		else if (rb_parent(deepest) != node)
+			deepest = rb_parent(deepest);
+	}
+
+	return deepest;
+}
+
+/**
+ * bfq_active_extract - remove an entity from the active tree.
+ * @st: the service_tree containing the tree.
+ * @entity: the entity being removed.
+ */
+static void bfq_active_extract(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	struct rb_node *node;
+
+	node = bfq_find_deepest(&entity->rb_node);
+	bfq_extract(&st->active, entity);
+
+	if (node != NULL)
+		bfq_update_active_tree(node);
+}
+
+/**
+ * bfq_idle_insert - insert an entity into the idle tree.
+ * @st: the service tree containing the tree.
+ * @entity: the entity to insert.
+ */
+static void bfq_idle_insert(struct io_service_tree *st,
+					struct io_entity *entity)
+{
+	struct io_entity *first_idle = st->first_idle;
+	struct io_entity *last_idle = st->last_idle;
+	struct io_queue *ioq = io_entity_to_ioq(entity);
+
+	if (first_idle == NULL || bfq_gt(first_idle->finish, entity->finish))
+		st->first_idle = entity;
+	if (last_idle == NULL || bfq_gt(entity->finish, last_idle->finish))
+		st->last_idle = entity;
+
+	bfq_insert(&st->idle, entity);
+
+	/* Add this queue to idle list */
+	if (ioq)
+		list_add(&ioq->queue_list, &ioq->efqd->idle_list);
+}
+
+/**
+ * bfq_forget_entity - remove an entity from the wfq trees.
+ * @st: the service tree.
+ * @entity: the entity being removed.
+ *
+ * Update the device status and forget everything about @entity, putting
+ * the device reference to it, if it is a queue.  Entities belonging to
+ * groups are not refcounted.
+ */
+static void bfq_forget_entity(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	struct io_queue *ioq = NULL;
+
+	BUG_ON(!entity->on_st);
+	entity->on_st = 0;
+	st->wsum -= entity->weight;
+	ioq = io_entity_to_ioq(entity);
+	if (!ioq)
+		return;
+	elv_put_ioq(ioq);
+}
+
+/**
+ * bfq_put_idle_entity - release the idle tree ref of an entity.
+ * @st: service tree for the entity.
+ * @entity: the entity being released.
+ */
+void bfq_put_idle_entity(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	bfq_idle_extract(st, entity);
+	bfq_forget_entity(st, entity);
+}
+
+/**
+ * bfq_forget_idle - update the idle tree if necessary.
+ * @st: the service tree to act upon.
+ *
+ * To preserve the global O(log N) complexity we only remove one entry here;
+ * as the idle tree will not grow indefinitely this can be done safely.
+ */
+void bfq_forget_idle(struct io_service_tree *st)
+{
+	struct io_entity *first_idle = st->first_idle;
+	struct io_entity *last_idle = st->last_idle;
+
+	if (RB_EMPTY_ROOT(&st->active) && last_idle != NULL &&
+	    !bfq_gt(last_idle->finish, st->vtime)) {
+		/*
+		 * Active tree is empty. Pull back vtime to finish time of
+		 * last idle entity on idle tree.
+		 * Rational seems to be that it reduces the possibility of
+		 * vtime wraparound (bfq_gt(V-F) < 0).
+		 */
+		st->vtime = last_idle->finish;
+	}
+
+	if (first_idle != NULL && !bfq_gt(first_idle->finish, st->vtime))
+		bfq_put_idle_entity(st, first_idle);
+}
+
+
+static struct io_service_tree *
+__bfq_entity_update_prio(struct io_service_tree *old_st,
+				struct io_entity *entity)
+{
+	struct io_service_tree *new_st = old_st;
+	struct io_queue *ioq = io_entity_to_ioq(entity);
+
+	if (entity->ioprio_changed) {
+		entity->ioprio = entity->new_ioprio;
+		entity->ioprio_class = entity->new_ioprio_class;
+		entity->ioprio_changed = 0;
+
+		/*
+		 * Also update the scaled budget for ioq. Group will get the
+		 * updated budget once ioq is selected to run next.
+		 */
+		if (ioq) {
+			struct elv_fq_data *efqd = ioq->efqd;
+			entity->budget = elv_prio_to_slice(efqd, ioq);
+		}
+
+		old_st->wsum -= entity->weight;
+		entity->weight = bfq_ioprio_to_weight(entity->ioprio);
+
+		/*
+		 * NOTE: here we may be changing the weight too early,
+		 * this will cause unfairness.  The correct approach
+		 * would have required additional complexity to defer
+		 * weight changes to the proper time instants (i.e.,
+		 * when entity->finish <= old_st->vtime).
+		 */
+		new_st = io_entity_service_tree(entity);
+		new_st->wsum += entity->weight;
+
+		if (new_st != old_st)
+			entity->start = new_st->vtime;
+	}
+
+	return new_st;
+}
+
+/**
+ * __bfq_activate_entity - activate an entity.
+ * @entity: the entity being activated.
+ *
+ * Called whenever an entity is activated, i.e., it is not active and one
+ * of its children receives a new request, or has to be reactivated due to
+ * budget exhaustion.  It uses the current budget of the entity (and the
+ * service received if @entity is active) of the queue to calculate its
+ * timestamps.
+ */
+static void __bfq_activate_entity(struct io_entity *entity, int add_front)
+{
+	struct io_sched_data *sd = entity->sched_data;
+	struct io_service_tree *st = io_entity_service_tree(entity);
+
+	if (entity == sd->active_entity) {
+		BUG_ON(entity->tree != NULL);
+		/*
+		 * If we are requeueing the current entity we have
+		 * to take care of not charging to it service it has
+		 * not received.
+		 */
+		bfq_calc_finish(entity, entity->service);
+		entity->start = entity->finish;
+		sd->active_entity = NULL;
+	} else if (entity->tree == &st->active) {
+		/*
+		 * Requeueing an entity due to a change of some
+		 * next_active entity below it.  We reuse the old
+		 * start time.
+		 */
+		bfq_active_extract(st, entity);
+	} else if (entity->tree == &st->idle) {
+		/*
+		 * Must be on the idle tree, bfq_idle_extract() will
+		 * check for that.
+		 */
+		bfq_idle_extract(st, entity);
+		entity->start = bfq_gt(st->vtime, entity->finish) ?
+				       st->vtime : entity->finish;
+	} else {
+		/*
+		 * The finish time of the entity may be invalid, and
+		 * it is in the past for sure, otherwise the queue
+		 * would have been on the idle tree.
+		 */
+		entity->start = st->vtime;
+		st->wsum += entity->weight;
+		bfq_get_entity(entity);
+
+		BUG_ON(entity->on_st);
+		entity->on_st = 1;
+	}
+
+	st = __bfq_entity_update_prio(st, entity);
+	/*
+	 * This is to emulate cfq like functionality where preemption can
+	 * happen with-in same class, like sync queue preempting async queue
+	 * May be this is not a very good idea from fairness point of view
+	 * as preempting queue gains share. Keeping it for now.
+	 */
+	if (add_front) {
+		struct io_entity *next_entity;
+
+		/*
+		 * Determine the entity which will be dispatched next
+		 * Use sd->next_active once hierarchical patch is applied
+		 */
+		next_entity = bfq_lookup_next_entity(sd, 0);
+
+		if (next_entity && next_entity != entity) {
+			struct io_service_tree *new_st;
+			bfq_timestamp_t delta;
+
+			new_st = io_entity_service_tree(next_entity);
+
+			/*
+			 * At this point, both entities should belong to
+			 * same service tree as cross service tree preemption
+			 * is automatically taken care by algorithm
+			 */
+			BUG_ON(new_st != st);
+			entity->finish = next_entity->finish - 1;
+			delta = bfq_delta(entity->budget, entity->weight);
+			entity->start = entity->finish - delta;
+			if (bfq_gt(entity->start, st->vtime))
+				entity->start = st->vtime;
+		}
+	} else {
+		bfq_calc_finish(entity, entity->budget);
+	}
+	bfq_active_insert(st, entity);
+}
+
+/**
+ * bfq_activate_entity - activate an entity.
+ * @entity: the entity to activate.
+ */
+void bfq_activate_entity(struct io_entity *entity, int add_front)
+{
+	__bfq_activate_entity(entity, add_front);
+}
+
+/**
+ * __bfq_deactivate_entity - deactivate an entity from its service tree.
+ * @entity: the entity to deactivate.
+ * @requeue: if false, the entity will not be put into the idle tree.
+ *
+ * Deactivate an entity, independently from its previous state.  If the
+ * entity was not on a service tree just return, otherwise if it is on
+ * any scheduler tree, extract it from that tree, and if necessary
+ * and if the caller did not specify @requeue, put it on the idle tree.
+ *
+ */
+int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
+{
+	struct io_sched_data *sd = entity->sched_data;
+	struct io_service_tree *st = io_entity_service_tree(entity);
+	int was_active = entity == sd->active_entity;
+	int ret = 0;
+
+	if (!entity->on_st)
+		return 0;
+
+	BUG_ON(was_active && entity->tree != NULL);
+
+	if (was_active) {
+		bfq_calc_finish(entity, entity->service);
+		sd->active_entity = NULL;
+	} else if (entity->tree == &st->active)
+		bfq_active_extract(st, entity);
+	else if (entity->tree == &st->idle)
+		bfq_idle_extract(st, entity);
+	else if (entity->tree != NULL)
+		BUG();
+
+	if (!requeue || !bfq_gt(entity->finish, st->vtime))
+		bfq_forget_entity(st, entity);
+	else
+		bfq_idle_insert(st, entity);
+
+	BUG_ON(sd->active_entity == entity);
+
+	return ret;
+}
+
+/**
+ * bfq_deactivate_entity - deactivate an entity.
+ * @entity: the entity to deactivate.
+ * @requeue: true if the entity can be put on the idle tree
+ */
+void bfq_deactivate_entity(struct io_entity *entity, int requeue)
+{
+	__bfq_deactivate_entity(entity, requeue);
+}
+
+/**
+ * bfq_update_vtime - update vtime if necessary.
+ * @st: the service tree to act upon.
+ *
+ * If necessary update the service tree vtime to have at least one
+ * eligible entity, skipping to its start time.  Assumes that the
+ * active tree of the device is not empty.
+ *
+ * NOTE: this hierarchical implementation updates vtimes quite often,
+ * we may end up with reactivated tasks getting timestamps after a
+ * vtime skip done because we needed a ->first_active entity on some
+ * intermediate node.
+ */
+static void bfq_update_vtime(struct io_service_tree *st)
+{
+	struct io_entity *entry;
+	struct rb_node *node = st->active.rb_node;
+
+	entry = rb_entry(node, struct io_entity, rb_node);
+	if (bfq_gt(entry->min_start, st->vtime)) {
+		st->vtime = entry->min_start;
+		bfq_forget_idle(st);
+	}
+}
+
+/**
+ * bfq_first_active - find the eligible entity with the smallest finish time
+ * @st: the service tree to select from.
+ *
+ * This function searches the first schedulable entity, starting from the
+ * root of the tree and going on the left every time on this side there is
+ * a subtree with at least one eligible (start <= vtime) entity.  The path
+ * on the right is followed only if a) the left subtree contains no eligible
+ * entities and b) no eligible entity has been found yet.
+ */
+static struct io_entity *bfq_first_active_entity(struct io_service_tree *st)
+{
+	struct io_entity *entry, *first = NULL;
+	struct rb_node *node = st->active.rb_node;
+
+	while (node != NULL) {
+		entry = rb_entry(node, struct io_entity, rb_node);
+left:
+		if (!bfq_gt(entry->start, st->vtime))
+			first = entry;
+
+		BUG_ON(bfq_gt(entry->min_start, st->vtime));
+
+		if (node->rb_left != NULL) {
+			entry = rb_entry(node->rb_left,
+					 struct io_entity, rb_node);
+			if (!bfq_gt(entry->min_start, st->vtime)) {
+				node = node->rb_left;
+				goto left;
+			}
+		}
+		if (first != NULL)
+			break;
+		node = node->rb_right;
+	}
+
+	BUG_ON(first == NULL && !RB_EMPTY_ROOT(&st->active));
+	return first;
+}
+
+/**
+ * __bfq_lookup_next_entity - return the first eligible entity in @st.
+ * @st: the service tree.
+ *
+ * Update the virtual time in @st and return the first eligible entity
+ * it contains.
+ */
+static struct io_entity *__bfq_lookup_next_entity(struct io_service_tree *st)
+{
+	struct io_entity *entity;
+
+	if (RB_EMPTY_ROOT(&st->active))
+		return NULL;
+
+	bfq_update_vtime(st);
+	entity = bfq_first_active_entity(st);
+	BUG_ON(bfq_gt(entity->start, st->vtime));
+
+	return entity;
+}
+
+/**
+ * bfq_lookup_next_entity - return the first eligible entity in @sd.
+ * @sd: the sched_data.
+ * @extract: if true the returned entity will be also extracted from @sd.
+ *
+ * NOTE: since we cache the next_active entity at each level of the
+ * hierarchy, the complexity of the lookup can be decreased with
+ * absolutely no effort just returning the cached next_active value;
+ * we prefer to do full lookups to test the consistency of * the data
+ * structures.
+ */
+struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
+						 int extract)
+{
+	struct io_service_tree *st = sd->service_tree;
+	struct io_entity *entity;
+	int i;
+
+	/*
+	 * One can check for which will be next selected entity without
+	 * expiring the current one.
+	 */
+	BUG_ON(extract && sd->active_entity != NULL);
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++, st++) {
+		entity = __bfq_lookup_next_entity(st);
+		if (entity != NULL) {
+			if (extract) {
+				bfq_active_extract(st, entity);
+				sd->active_entity = entity;
+			}
+			break;
+		}
+	}
+
+	return entity;
+}
+
+void entity_served(struct io_entity *entity, bfq_service_t served)
+{
+	struct io_service_tree *st;
+
+	st = io_entity_service_tree(entity);
+	entity->service += served;
+	BUG_ON(st->wsum == 0);
+	st->vtime += bfq_delta(served, st->wsum);
+	bfq_forget_idle(st);
+}
+
+/* Elevator fair queuing function */
+struct io_queue *rq_ioq(struct request *rq)
+{
+	return rq->ioq;
+}
+
+static inline struct io_queue *elv_active_ioq(struct elevator_queue *e)
+{
+	return e->efqd.active_queue;
+}
+
+void *elv_active_sched_queue(struct elevator_queue *e)
+{
+	return ioq_sched_queue(elv_active_ioq(e));
+}
+EXPORT_SYMBOL(elv_active_sched_queue);
+
+int elv_nr_busy_ioq(struct elevator_queue *e)
+{
+	return e->efqd.busy_queues;
+}
+EXPORT_SYMBOL(elv_nr_busy_ioq);
+
+int elv_nr_busy_rt_ioq(struct elevator_queue *e)
+{
+	return e->efqd.busy_rt_queues;
+}
+EXPORT_SYMBOL(elv_nr_busy_rt_ioq);
+
+int elv_hw_tag(struct elevator_queue *e)
+{
+	return e->efqd.hw_tag;
+}
+EXPORT_SYMBOL(elv_hw_tag);
+
+/* Helper functions for operating on elevator idle slice timer */
+int elv_mod_idle_slice_timer(struct elevator_queue *eq, unsigned long expires)
+{
+	struct elv_fq_data *efqd = &eq->efqd;
+
+	return mod_timer(&efqd->idle_slice_timer, expires);
+}
+EXPORT_SYMBOL(elv_mod_idle_slice_timer);
+
+int elv_del_idle_slice_timer(struct elevator_queue *eq)
+{
+	struct elv_fq_data *efqd = &eq->efqd;
+
+	return del_timer(&efqd->idle_slice_timer);
+}
+EXPORT_SYMBOL(elv_del_idle_slice_timer);
+
+unsigned int elv_get_slice_idle(struct elevator_queue *eq)
+{
+	return eq->efqd.elv_slice_idle;
+}
+EXPORT_SYMBOL(elv_get_slice_idle);
+
+void elv_ioq_served(struct io_queue *ioq, bfq_service_t served)
+{
+	entity_served(&ioq->entity, served);
+}
+
+/* Tells whether ioq is queued in root group or not */
+static inline int is_root_group_ioq(struct request_queue *q,
+					struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	return (ioq->entity.sched_data == &efqd->root_group->sched_data);
+}
+
+/* Functions to show and store elv_idle_slice value through sysfs */
+ssize_t elv_slice_idle_show(struct request_queue *q, char *name)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	data = jiffies_to_msecs(efqd->elv_slice_idle);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return sprintf(name, "%d\n", data);
+}
+
+ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
+			  size_t count)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	char *p = (char *)name;
+
+	data = simple_strtoul(p, &p, 10);
+
+	if (data < 0)
+		data = 0;
+	else if (data > INT_MAX)
+		data = INT_MAX;
+
+	data = msecs_to_jiffies(data);
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	efqd->elv_slice_idle = data;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	return count;
+}
+
+/* Functions to show and store elv_slice_sync value through sysfs */
+ssize_t elv_slice_sync_show(struct request_queue *q, char *name)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	data = efqd->elv_slice[1];
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return sprintf(name, "%d\n", data);
+}
+
+ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
+			  size_t count)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	char *p = (char *)name;
+
+	data = simple_strtoul(p, &p, 10);
+
+	if (data < 0)
+		data = 0;
+	/* 100ms is the limit for now*/
+	else if (data > 100)
+		data = 100;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	efqd->elv_slice[1] = data;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	return count;
+}
+
+/* Functions to show and store elv_slice_async value through sysfs */
+ssize_t elv_slice_async_show(struct request_queue *q, char *name)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	data = efqd->elv_slice[0];
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return sprintf(name, "%d\n", data);
+}
+
+ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
+			  size_t count)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	char *p = (char *)name;
+
+	data = simple_strtoul(p, &p, 10);
+
+	if (data < 0)
+		data = 0;
+	/* 100ms is the limit for now*/
+	else if (data > 100)
+		data = 100;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	efqd->elv_slice[0] = data;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	return count;
+}
+
+void elv_schedule_dispatch(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (elv_nr_busy_ioq(q->elevator)) {
+		elv_log(efqd, "schedule dispatch");
+		kblockd_schedule_work(efqd->queue, &efqd->unplug_work);
+	}
+}
+EXPORT_SYMBOL(elv_schedule_dispatch);
+
+void elv_kick_queue(struct work_struct *work)
+{
+	struct elv_fq_data *efqd =
+		container_of(work, struct elv_fq_data, unplug_work);
+	struct request_queue *q = efqd->queue;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	blk_start_queueing(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+void elv_shutdown_timer_wq(struct elevator_queue *e)
+{
+	del_timer_sync(&e->efqd.idle_slice_timer);
+	cancel_work_sync(&e->efqd.unplug_work);
+}
+EXPORT_SYMBOL(elv_shutdown_timer_wq);
+
+void elv_ioq_set_prio_slice(struct request_queue *q, struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	ioq->slice_end = jiffies + ioq->entity.budget;
+	elv_log_ioq(efqd, ioq, "set_slice=%lu", ioq->entity.budget);
+}
+
+static void elv_ioq_update_io_thinktime(struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = ioq->efqd;
+	unsigned long elapsed = jiffies - ioq->last_end_request;
+	unsigned long ttime = min(elapsed, 2UL * efqd->elv_slice_idle);
+
+	ioq->ttime_samples = (7*ioq->ttime_samples + 256) / 8;
+	ioq->ttime_total = (7*ioq->ttime_total + 256*ttime) / 8;
+	ioq->ttime_mean = (ioq->ttime_total + 128) / ioq->ttime_samples;
+}
+
+/*
+ * Disable idle window if the process thinks too long.
+ * This idle flag can also be updated by io scheduler.
+ */
+static void elv_ioq_update_idle_window(struct elevator_queue *eq,
+				struct io_queue *ioq, struct request *rq)
+{
+	int old_idle, enable_idle;
+	struct elv_fq_data *efqd = ioq->efqd;
+
+	/*
+	 * Don't idle for async or idle io prio class
+	 */
+	if (!elv_ioq_sync(ioq) || elv_ioq_class_idle(ioq))
+		return;
+
+	enable_idle = old_idle = elv_ioq_idle_window(ioq);
+
+	if (!efqd->elv_slice_idle)
+		enable_idle = 0;
+	else if (ioq_sample_valid(ioq->ttime_samples)) {
+		if (ioq->ttime_mean > efqd->elv_slice_idle)
+			enable_idle = 0;
+		else
+			enable_idle = 1;
+	}
+
+	/*
+	 * From think time perspective idle should be enabled. Check with
+	 * io scheduler if it wants to disable idling based on additional
+	 * considrations like seek pattern.
+	 */
+	if (enable_idle) {
+		if (eq->ops->elevator_update_idle_window_fn)
+			enable_idle = eq->ops->elevator_update_idle_window_fn(
+						eq, ioq->sched_queue, rq);
+		if (!enable_idle)
+			elv_log_ioq(efqd, ioq, "iosched disabled idle");
+	}
+
+	if (old_idle != enable_idle) {
+		elv_log_ioq(efqd, ioq, "idle=%d", enable_idle);
+		if (enable_idle)
+			elv_mark_ioq_idle_window(ioq);
+		else
+			elv_clear_ioq_idle_window(ioq);
+	}
+}
+
+struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask)
+{
+	struct io_queue *ioq = NULL;
+
+	ioq = kmem_cache_alloc_node(elv_ioq_pool, gfp_mask, q->node);
+	return ioq;
+}
+EXPORT_SYMBOL(elv_alloc_ioq);
+
+void elv_free_ioq(struct io_queue *ioq)
+{
+	kmem_cache_free(elv_ioq_pool, ioq);
+}
+EXPORT_SYMBOL(elv_free_ioq);
+
+int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
+			void *sched_queue, int ioprio_class, int ioprio,
+			int is_sync)
+{
+	struct elv_fq_data *efqd = &eq->efqd;
+	struct io_group *iog = io_lookup_io_group_current(efqd->queue);
+
+	RB_CLEAR_NODE(&ioq->entity.rb_node);
+	atomic_set(&ioq->ref, 0);
+	ioq->efqd = efqd;
+	elv_ioq_set_ioprio_class(ioq, ioprio_class);
+	elv_ioq_set_ioprio(ioq, ioprio);
+	ioq->pid = current->pid;
+	ioq->sched_queue = sched_queue;
+	if (is_sync && !elv_ioq_class_idle(ioq))
+		elv_mark_ioq_idle_window(ioq);
+	bfq_init_entity(&ioq->entity, iog);
+	ioq->entity.budget = elv_prio_to_slice(efqd, ioq);
+	return 0;
+}
+EXPORT_SYMBOL(elv_init_ioq);
+
+void elv_put_ioq(struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = ioq->efqd;
+	struct elevator_queue *e = container_of(efqd, struct elevator_queue,
+						efqd);
+
+	BUG_ON(atomic_read(&ioq->ref) <= 0);
+	if (!atomic_dec_and_test(&ioq->ref))
+		return;
+	BUG_ON(ioq->nr_queued);
+	BUG_ON(ioq->entity.tree != NULL);
+	BUG_ON(elv_ioq_busy(ioq));
+	BUG_ON(efqd->active_queue == ioq);
+
+	/* Can be called by outgoing elevator. Don't use q */
+	BUG_ON(!e->ops->elevator_free_sched_queue_fn);
+
+	e->ops->elevator_free_sched_queue_fn(e, ioq->sched_queue);
+	elv_log_ioq(efqd, ioq, "put_queue");
+	elv_free_ioq(ioq);
+}
+EXPORT_SYMBOL(elv_put_ioq);
+
+void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
+{
+	struct io_queue *ioq = *ioq_ptr;
+
+	if (ioq != NULL) {
+		/* Drop the reference taken by the io group */
+		elv_put_ioq(ioq);
+		*ioq_ptr = NULL;
+	}
+}
+
+/*
+ * Normally next io queue to be served is selected from the service tree.
+ * This function allows one to choose a specific io queue to run next
+ * out of order. This is primarily to accomodate the close_cooperator
+ * feature of cfq.
+ *
+ * Currently it is done only for root level as to begin with supporting
+ * close cooperator feature only for root group to make sure default
+ * cfq behavior in flat hierarchy is not changed.
+ */
+void elv_set_next_ioq(struct request_queue *q, struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_entity *entity = &ioq->entity;
+	struct io_sched_data *sd = &efqd->root_group->sched_data;
+	struct io_service_tree *st = io_entity_service_tree(entity);
+
+	BUG_ON(efqd->active_queue != NULL || sd->active_entity != NULL);
+	BUG_ON(!efqd->busy_queues);
+	BUG_ON(sd != entity->sched_data);
+	BUG_ON(!st);
+
+	bfq_update_vtime(st);
+	bfq_active_extract(st, entity);
+	sd->active_entity = entity;
+	entity->service = 0;
+	elv_log_ioq(efqd, ioq, "set_next_ioq");
+}
+
+/* Get next queue for service. */
+struct io_queue *elv_get_next_ioq(struct request_queue *q, int extract)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_entity *entity = NULL;
+	struct io_queue *ioq = NULL;
+	struct io_sched_data *sd;
+
+	/*
+	 * one can check for which queue will be selected next while having
+	 * one queue active. preempt logic uses it.
+	 */
+	BUG_ON(extract && efqd->active_queue != NULL);
+
+	if (!efqd->busy_queues)
+		return NULL;
+
+	sd = &efqd->root_group->sched_data;
+	if (extract)
+		entity = bfq_lookup_next_entity(sd, 1);
+	else
+		entity = bfq_lookup_next_entity(sd, 0);
+
+	BUG_ON(!entity);
+	if (extract)
+		entity->service = 0;
+	ioq = io_entity_to_ioq(entity);
+
+	return ioq;
+}
+
+/*
+ * coop tells that io scheduler selected a queue for us and we did not
+ * select the next queue based on fairness.
+ */
+static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
+					int coop)
+{
+	struct request_queue *q = efqd->queue;
+
+	if (ioq) {
+		elv_log_ioq(efqd, ioq, "set_active, busy=%d",
+							efqd->busy_queues);
+		ioq->slice_end = 0;
+
+		elv_clear_ioq_wait_request(ioq);
+		elv_clear_ioq_must_dispatch(ioq);
+		elv_mark_ioq_slice_new(ioq);
+
+		del_timer(&efqd->idle_slice_timer);
+	}
+
+	efqd->active_queue = ioq;
+
+	/* Let iosched know if it wants to take some action */
+	if (ioq) {
+		if (q->elevator->ops->elevator_active_ioq_set_fn)
+			q->elevator->ops->elevator_active_ioq_set_fn(q,
+							ioq->sched_queue, coop);
+	}
+}
+
+/* Get and set a new active queue for service. */
+struct io_queue *elv_set_active_ioq(struct request_queue *q,
+						struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	int coop = 0;
+
+	if (!ioq)
+		ioq = elv_get_next_ioq(q, 1);
+	else {
+		elv_set_next_ioq(q, ioq);
+		/*
+		 * io scheduler selected the next queue for us. Pass this
+		 * this info back to io scheudler. cfq currently uses it
+		 * to reset coop flag on the queue.
+		 */
+		coop = 1;
+	}
+	__elv_set_active_ioq(efqd, ioq, coop);
+	return ioq;
+}
+
+void elv_reset_active_ioq(struct elv_fq_data *efqd)
+{
+	struct request_queue *q = efqd->queue;
+	struct io_queue *ioq = elv_active_ioq(efqd->queue->elevator);
+
+	if (q->elevator->ops->elevator_active_ioq_reset_fn)
+		q->elevator->ops->elevator_active_ioq_reset_fn(q,
+							ioq->sched_queue);
+	efqd->active_queue = NULL;
+	del_timer(&efqd->idle_slice_timer);
+}
+
+void elv_activate_ioq(struct io_queue *ioq, int add_front)
+{
+	bfq_activate_entity(&ioq->entity, add_front);
+}
+
+void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
+					int requeue)
+{
+	if (ioq == efqd->active_queue)
+		elv_reset_active_ioq(efqd);
+
+	bfq_deactivate_entity(&ioq->entity, requeue);
+}
+
+/* Called when an inactive queue receives a new request. */
+void elv_add_ioq_busy(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+	BUG_ON(elv_ioq_busy(ioq));
+	BUG_ON(ioq == efqd->active_queue);
+	elv_log_ioq(efqd, ioq, "add to busy");
+	elv_activate_ioq(ioq, 0);
+	elv_mark_ioq_busy(ioq);
+	efqd->busy_queues++;
+	if (elv_ioq_class_rt(ioq))
+		efqd->busy_rt_queues++;
+}
+
+void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
+					int requeue)
+{
+	struct elv_fq_data *efqd = &e->efqd;
+
+	BUG_ON(!elv_ioq_busy(ioq));
+	BUG_ON(ioq->nr_queued);
+	elv_log_ioq(efqd, ioq, "del from busy");
+	elv_clear_ioq_busy(ioq);
+	BUG_ON(efqd->busy_queues == 0);
+	efqd->busy_queues--;
+	if (elv_ioq_class_rt(ioq))
+		efqd->busy_rt_queues--;
+
+	elv_deactivate_ioq(efqd, ioq, requeue);
+}
+
+/*
+ * Do the accounting. Determine how much service (in terms of time slices)
+ * current queue used and adjust the start, finish time of queue and vtime
+ * of the tree accordingly.
+ *
+ * Determining the service used in terms of time is tricky in certain
+ * situations. Especially when underlying device supports command queuing
+ * and requests from multiple queues can be there at same time, then it
+ * is not clear which queue consumed how much of disk time.
+ *
+ * To mitigate this problem, cfq starts the time slice of the queue only
+ * after first request from the queue has completed. This does not work
+ * very well if we expire the queue before we wait for first and more
+ * request to finish from the queue. For seeky queues, we will expire the
+ * queue after dispatching few requests without waiting and start dispatching
+ * from next queue.
+ *
+ * Not sure how to determine the time consumed by queue in such scenarios.
+ * Currently as a crude approximation, we are charging 25% of time slice
+ * for such cases. A better mechanism is needed for accurate accounting.
+ */
+void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_entity *entity = &ioq->entity;
+	long slice_unused = 0, slice_used = 0, slice_overshoot = 0;
+
+	assert_spin_locked(q->queue_lock);
+	elv_log_ioq(efqd, ioq, "slice expired");
+
+	if (elv_ioq_wait_request(ioq))
+		del_timer(&efqd->idle_slice_timer);
+
+	elv_clear_ioq_wait_request(ioq);
+
+	/*
+	 * if ioq->slice_end = 0, that means a queue was expired before first
+	 * reuqest from the queue got completed. Of course we are not planning
+	 * to idle on the queue otherwise we would not have expired it.
+	 *
+	 * Charge for the 25% slice in such cases. This is not the best thing
+	 * to do but at the same time not very sure what's the next best
+	 * thing to do.
+	 *
+	 * This arises from that fact that we don't have the notion of
+	 * one queue being operational at one time. io scheduler can dispatch
+	 * requests from multiple queues in one dispatch round. Ideally for
+	 * more accurate accounting of exact disk time used by disk, one
+	 * should dispatch requests from only one queue and wait for all
+	 * the requests to finish. But this will reduce throughput.
+	 */
+	if (!ioq->slice_end)
+		slice_used = entity->budget/4;
+	else {
+		if (time_after(ioq->slice_end, jiffies)) {
+			slice_unused = ioq->slice_end - jiffies;
+			if (slice_unused == entity->budget) {
+				/*
+				 * queue got expired immediately after
+				 * completing first request. Charge 25% of
+				 * slice.
+				 */
+				slice_used = entity->budget/4;
+			} else
+				slice_used = entity->budget - slice_unused;
+		} else {
+			slice_overshoot = jiffies - ioq->slice_end;
+			slice_used = entity->budget + slice_overshoot;
+		}
+	}
+
+	elv_log_ioq(efqd, ioq, "sl_end=%lx, jiffies=%lx", ioq->slice_end,
+			jiffies);
+	elv_log_ioq(efqd, ioq, "sl_used=%ld, budget=%ld overshoot=%ld",
+				slice_used, entity->budget, slice_overshoot);
+	elv_ioq_served(ioq, slice_used);
+
+	BUG_ON(ioq != efqd->active_queue);
+	elv_reset_active_ioq(efqd);
+
+	if (!ioq->nr_queued)
+		elv_del_ioq_busy(q->elevator, ioq, 1);
+	else
+		elv_activate_ioq(ioq, 0);
+}
+EXPORT_SYMBOL(__elv_ioq_slice_expired);
+
+/*
+ *  Expire the ioq.
+ */
+void elv_ioq_slice_expired(struct request_queue *q)
+{
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+
+	if (ioq)
+		__elv_ioq_slice_expired(q, ioq);
+}
+
+/*
+ * Check if new_cfqq should preempt the currently active queue. Return 0 for
+ * no or if we aren't sure, a 1 will cause a preemption attempt.
+ */
+int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
+			struct request *rq)
+{
+	struct io_queue *ioq;
+	struct elevator_queue *eq = q->elevator;
+
+	ioq = elv_active_ioq(eq);
+
+	if (!ioq)
+		return 0;
+
+	if (elv_ioq_slice_used(ioq))
+		return 1;
+
+	if (elv_ioq_class_idle(new_ioq))
+		return 0;
+
+	if (elv_ioq_class_idle(ioq))
+		return 1;
+
+	/*
+	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
+	 */
+	if (elv_ioq_class_rt(new_ioq) && !elv_ioq_class_rt(ioq))
+		return 1;
+
+	/*
+	 * Check with io scheduler if it has additional criterion based on
+	 * which it wants to preempt existing queue.
+	 */
+	if (eq->ops->elevator_should_preempt_fn)
+		return eq->ops->elevator_should_preempt_fn(q, new_ioq, rq);
+
+	return 0;
+}
+
+static void elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
+{
+	elv_log_ioq(&q->elevator->efqd, ioq, "preempt");
+	elv_ioq_slice_expired(q);
+
+	/*
+	 * Put the new queue at the front of the of the current list,
+	 * so we know that it will be selected next.
+	 */
+
+	elv_activate_ioq(ioq, 1);
+	elv_ioq_set_slice_end(ioq, 0);
+	elv_mark_ioq_slice_new(ioq);
+}
+
+void elv_ioq_request_add(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	BUG_ON(!efqd);
+	BUG_ON(!ioq);
+	efqd->rq_queued++;
+	ioq->nr_queued++;
+
+	if (!elv_ioq_busy(ioq))
+		elv_add_ioq_busy(efqd, ioq);
+
+	elv_ioq_update_io_thinktime(ioq);
+	elv_ioq_update_idle_window(q->elevator, ioq, rq);
+
+	if (ioq == elv_active_ioq(q->elevator)) {
+		/*
+		 * Remember that we saw a request from this process, but
+		 * don't start queuing just yet. Otherwise we risk seeing lots
+		 * of tiny requests, because we disrupt the normal plugging
+		 * and merging. If the request is already larger than a single
+		 * page, let it rip immediately. For that case we assume that
+		 * merging is already done. Ditto for a busy system that
+		 * has other work pending, don't risk delaying until the
+		 * idle timer unplug to continue working.
+		 */
+		if (elv_ioq_wait_request(ioq)) {
+			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
+			    efqd->busy_queues > 1) {
+				del_timer(&efqd->idle_slice_timer);
+				blk_start_queueing(q);
+			}
+			elv_mark_ioq_must_dispatch(ioq);
+		}
+	} else if (elv_should_preempt(q, ioq, rq)) {
+		/*
+		 * not the active queue - expire current slice if it is
+		 * idle and has expired it's mean thinktime or this new queue
+		 * has some old slice time left and is of higher priority or
+		 * this new queue is RT and the current one is BE
+		 */
+		elv_preempt_queue(q, ioq);
+		blk_start_queueing(q);
+	}
+}
+
+void elv_idle_slice_timer(unsigned long data)
+{
+	struct elv_fq_data *efqd = (struct elv_fq_data *)data;
+	struct io_queue *ioq;
+	unsigned long flags;
+	struct request_queue *q = efqd->queue;
+
+	elv_log(efqd, "idle timer fired");
+
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	ioq = efqd->active_queue;
+
+	if (ioq) {
+
+		/*
+		 * We saw a request before the queue expired, let it through
+		 */
+		if (elv_ioq_must_dispatch(ioq))
+			goto out_kick;
+
+		/*
+		 * expired
+		 */
+		if (elv_ioq_slice_used(ioq))
+			goto expire;
+
+		/*
+		 * only expire and reinvoke request handler, if there are
+		 * other queues with pending requests
+		 */
+		if (!elv_nr_busy_ioq(q->elevator))
+			goto out_cont;
+
+		/*
+		 * not expired and it has a request pending, let it dispatch
+		 */
+		if (ioq->nr_queued)
+			goto out_kick;
+	}
+expire:
+	elv_ioq_slice_expired(q);
+out_kick:
+	elv_schedule_dispatch(q);
+out_cont:
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+void elv_ioq_arm_slice_timer(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+	unsigned long sl;
+
+	BUG_ON(!ioq);
+
+	/*
+	 * SSD device without seek penalty, disable idling. But only do so
+	 * for devices that support queuing, otherwise we still have a problem
+	 * with sync vs async workloads.
+	 */
+	if (blk_queue_nonrot(q) && efqd->hw_tag)
+		return;
+
+	/*
+	 * still requests with the driver, don't idle
+	 */
+	if (efqd->rq_in_driver)
+		return;
+
+	/*
+	 * idle is disabled, either manually or by past process history
+	 */
+	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
+		return;
+
+	/*
+	 * may be iosched got its own idling logic. In that case io
+	 * schduler will take care of arming the timer, if need be.
+	 */
+	if (q->elevator->ops->elevator_arm_slice_timer_fn) {
+		q->elevator->ops->elevator_arm_slice_timer_fn(q,
+						ioq->sched_queue);
+	} else {
+		elv_mark_ioq_wait_request(ioq);
+		sl = efqd->elv_slice_idle;
+		mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+		elv_log(efqd, "arm idle: %lu", sl);
+	}
+}
+
+void elv_free_idle_ioq_list(struct elevator_queue *e)
+{
+	struct io_queue *ioq, *n;
+	struct elv_fq_data *efqd = &e->efqd;
+
+	list_for_each_entry_safe(ioq, n, &efqd->idle_list, queue_list)
+		elv_deactivate_ioq(efqd, ioq, 0);
+}
+
+/* Common layer function to select the next queue to dispatch from */
+void *elv_fq_select_ioq(struct request_queue *q, int force)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
+
+	if (!elv_nr_busy_ioq(q->elevator))
+		return NULL;
+
+	if (ioq == NULL)
+		goto new_queue;
+
+	/*
+	 * Force dispatch. Continue to dispatch from current queue as long
+	 * as it has requests.
+	 */
+	if (unlikely(force)) {
+		if (ioq->nr_queued)
+			goto keep_queue;
+		else
+			goto expire;
+	}
+
+	/*
+	 * The active queue has run out of time, expire it and select new.
+	 */
+	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
+		goto expire;
+
+	/*
+	 * If we have a RT cfqq waiting, then we pre-empt the current non-rt
+	 * cfqq.
+	 */
+	if (!elv_ioq_class_rt(ioq) && efqd->busy_rt_queues) {
+		/*
+		 * We simulate this as cfqq timed out so that it gets to bank
+		 * the remaining of its time slice.
+		 */
+		elv_log_ioq(efqd, ioq, "preempt");
+		goto expire;
+	}
+
+	/*
+	 * The active queue has requests and isn't expired, allow it to
+	 * dispatch.
+	 */
+
+	if (ioq->nr_queued)
+		goto keep_queue;
+
+	/*
+	 * If another queue has a request waiting within our mean seek
+	 * distance, let it run.  The expire code will check for close
+	 * cooperators and put the close queue at the front of the service
+	 * tree.
+	 */
+	new_ioq = elv_close_cooperator(q, ioq, 0);
+	if (new_ioq)
+		goto expire;
+
+	/*
+	 * No requests pending. If the active queue still has requests in
+	 * flight or is idling for a new request, allow either of these
+	 * conditions to happen (or time out) before selecting a new queue.
+	 */
+
+	if (timer_pending(&efqd->idle_slice_timer) ||
+	    (elv_ioq_nr_dispatched(ioq) && elv_ioq_idle_window(ioq))) {
+		ioq = NULL;
+		goto keep_queue;
+	}
+
+expire:
+	elv_ioq_slice_expired(q);
+new_queue:
+	ioq = elv_set_active_ioq(q, new_ioq);
+keep_queue:
+	return ioq;
+}
+
+/* A request got removed from io_queue. Do the accounting */
+void elv_ioq_request_removed(struct elevator_queue *e, struct request *rq)
+{
+	struct io_queue *ioq;
+	struct elv_fq_data *efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	ioq = rq->ioq;
+	BUG_ON(!ioq);
+	ioq->nr_queued--;
+
+	efqd = ioq->efqd;
+	BUG_ON(!efqd);
+	efqd->rq_queued--;
+
+	if (elv_ioq_busy(ioq) && (elv_active_ioq(e) != ioq) && !ioq->nr_queued)
+		elv_del_ioq_busy(e, ioq, 1);
+}
+
+/* A request got dispatched. Do the accounting. */
+void elv_fq_dispatched_request(struct elevator_queue *e, struct request *rq)
+{
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	BUG_ON(!ioq);
+	elv_ioq_request_dispatched(ioq);
+	elv_ioq_request_removed(e, rq);
+	elv_clear_ioq_must_dispatch(ioq);
+}
+
+void elv_fq_activate_rq(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	efqd->rq_in_driver++;
+	elv_log_ioq(efqd, rq_ioq(rq), "activate rq, drv=%d",
+						efqd->rq_in_driver);
+}
+
+void elv_fq_deactivate_rq(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	WARN_ON(!efqd->rq_in_driver);
+	efqd->rq_in_driver--;
+	elv_log_ioq(efqd, rq_ioq(rq), "deactivate rq, drv=%d",
+						efqd->rq_in_driver);
+}
+
+/*
+ * Update hw_tag based on peak queue depth over 50 samples under
+ * sufficient load.
+ */
+static void elv_update_hw_tag(struct elv_fq_data *efqd)
+{
+	if (efqd->rq_in_driver > efqd->rq_in_driver_peak)
+		efqd->rq_in_driver_peak = efqd->rq_in_driver;
+
+	if (efqd->rq_queued <= ELV_HW_QUEUE_MIN &&
+	    efqd->rq_in_driver <= ELV_HW_QUEUE_MIN)
+		return;
+
+	if (efqd->hw_tag_samples++ < 50)
+		return;
+
+	if (efqd->rq_in_driver_peak >= ELV_HW_QUEUE_MIN)
+		efqd->hw_tag = 1;
+	else
+		efqd->hw_tag = 0;
+
+	efqd->hw_tag_samples = 0;
+	efqd->rq_in_driver_peak = 0;
+}
+
+/*
+ * If ioscheduler has functionality of keeping track of close cooperator, check
+ * with it if it has got a closely co-operating queue.
+ */
+static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
+					struct io_queue *ioq, int probe)
+{
+	struct elevator_queue *e = q->elevator;
+	struct io_queue *new_ioq = NULL;
+
+	/*
+	 * Currently this feature is supported only for flat hierarchy or
+	 * root group queues so that default cfq behavior is not changed.
+	 */
+	if (!is_root_group_ioq(q, ioq))
+		return NULL;
+
+	if (q->elevator->ops->elevator_close_cooperator_fn)
+		new_ioq = e->ops->elevator_close_cooperator_fn(q,
+						ioq->sched_queue, probe);
+
+	/* Only select co-operating queue if it belongs to root group */
+	if (new_ioq && !is_root_group_ioq(q, new_ioq))
+		return NULL;
+
+	return new_ioq;
+}
+
+/* A request got completed from io_queue. Do the accounting. */
+void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
+{
+	const int sync = rq_is_sync(rq);
+	struct io_queue *ioq = rq->ioq;
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	elv_log_ioq(efqd, ioq, "complete");
+
+	elv_update_hw_tag(efqd);
+
+	WARN_ON(!efqd->rq_in_driver);
+	WARN_ON(!ioq->dispatched);
+	efqd->rq_in_driver--;
+	ioq->dispatched--;
+
+	if (sync)
+		ioq->last_end_request = jiffies;
+
+	/*
+	 * If this is the active queue, check if it needs to be expired,
+	 * or if we want to idle in case it has no pending requests.
+	 */
+
+	if (elv_active_ioq(q->elevator) == ioq) {
+		if (elv_ioq_slice_new(ioq)) {
+			elv_ioq_set_prio_slice(q, ioq);
+			elv_clear_ioq_slice_new(ioq);
+		}
+		/*
+		 * If there are no requests waiting in this queue, and
+		 * there are other queues ready to issue requests, AND
+		 * those other queues are issuing requests within our
+		 * mean seek distance, give them a chance to run instead
+		 * of idling.
+		 */
+		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+			elv_ioq_slice_expired(q);
+		else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
+			 && sync && !rq_noidle(rq))
+			elv_ioq_arm_slice_timer(q);
+	}
+
+	if (!efqd->rq_in_driver)
+		elv_schedule_dispatch(q);
+}
+
+struct io_group *io_lookup_io_group_current(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	return efqd->root_group;
+}
+EXPORT_SYMBOL(io_lookup_io_group_current);
+
+void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
+					int ioprio)
+{
+	struct io_queue *ioq = NULL;
+
+	switch (ioprio_class) {
+	case IOPRIO_CLASS_RT:
+		ioq = iog->async_queue[0][ioprio];
+		break;
+	case IOPRIO_CLASS_BE:
+		ioq = iog->async_queue[1][ioprio];
+		break;
+	case IOPRIO_CLASS_IDLE:
+		ioq = iog->async_idle_queue;
+		break;
+	default:
+		BUG();
+	}
+
+	if (ioq)
+		return ioq->sched_queue;
+	return NULL;
+}
+EXPORT_SYMBOL(io_group_async_queue_prio);
+
+void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+					int ioprio, struct io_queue *ioq)
+{
+	switch (ioprio_class) {
+	case IOPRIO_CLASS_RT:
+		iog->async_queue[0][ioprio] = ioq;
+		break;
+	case IOPRIO_CLASS_BE:
+		iog->async_queue[1][ioprio] = ioq;
+		break;
+	case IOPRIO_CLASS_IDLE:
+		iog->async_idle_queue = ioq;
+		break;
+	default:
+		BUG();
+	}
+
+	/*
+	 * Take the group reference and pin the queue. Group exit will
+	 * clean it up
+	 */
+	elv_get_ioq(ioq);
+}
+EXPORT_SYMBOL(io_group_set_async_queue);
+
+/*
+ * Release all the io group references to its async queues.
+ */
+void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+{
+	int i, j;
+
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < IOPRIO_BE_NR; j++)
+			elv_release_ioq(e, &iog->async_queue[i][j]);
+
+	/* Free up async idle queue */
+	elv_release_ioq(e, &iog->async_idle_queue);
+}
+
+struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+	return iog;
+}
+
+void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_group *iog = e->efqd.root_group;
+	io_put_io_group_queues(e, iog);
+	kfree(iog);
+}
+
+static void elv_slab_kill(void)
+{
+	/*
+	 * Caller already ensured that pending RCU callbacks are completed,
+	 * so we should have no busy allocations at this point.
+	 */
+	if (elv_ioq_pool)
+		kmem_cache_destroy(elv_ioq_pool);
+}
+
+static int __init elv_slab_setup(void)
+{
+	elv_ioq_pool = KMEM_CACHE(io_queue, 0);
+	if (!elv_ioq_pool)
+		goto fail;
+
+	return 0;
+fail:
+	elv_slab_kill();
+	return -ENOMEM;
+}
+
+/* Initialize fair queueing data associated with elevator */
+int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
+{
+	struct io_group *iog;
+	struct elv_fq_data *efqd = &e->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return 0;
+
+	iog = io_alloc_root_group(q, e, efqd);
+	if (iog == NULL)
+		return 1;
+
+	efqd->root_group = iog;
+	efqd->queue = q;
+
+	init_timer(&efqd->idle_slice_timer);
+	efqd->idle_slice_timer.function = elv_idle_slice_timer;
+	efqd->idle_slice_timer.data = (unsigned long) efqd;
+
+	INIT_WORK(&efqd->unplug_work, elv_kick_queue);
+
+	INIT_LIST_HEAD(&efqd->idle_list);
+
+	efqd->elv_slice[0] = elv_slice_async;
+	efqd->elv_slice[1] = elv_slice_sync;
+	efqd->elv_slice_idle = elv_slice_idle;
+	efqd->hw_tag = 1;
+
+	return 0;
+}
+
+/*
+ * elv_exit_fq_data is called before we call elevator_exit_fn. Before
+ * we ask elevator to cleanup its queues, we do the cleanup here so
+ * that all the group and idle tree references to ioq are dropped. Later
+ * during elevator cleanup, ioc reference will be dropped which will lead
+ * to removal of ioscheduler queue as well as associated ioq object.
+ */
+void elv_exit_fq_data(struct elevator_queue *e)
+{
+	struct elv_fq_data *efqd = &e->efqd;
+	struct request_queue *q = efqd->queue;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	elv_shutdown_timer_wq(e);
+
+	spin_lock_irq(q->queue_lock);
+	/* This should drop all the idle tree references of ioq */
+	elv_free_idle_ioq_list(e);
+	spin_unlock_irq(q->queue_lock);
+
+	elv_shutdown_timer_wq(e);
+
+	BUG_ON(timer_pending(&efqd->idle_slice_timer));
+	io_free_root_group(e);
+}
+
+/*
+ * This is called after the io scheduler has cleaned up its data structres.
+ * I don't think that this function is required. Right now just keeping it
+ * because cfq cleans up timer and work queue again after freeing up
+ * io contexts. To me io scheduler has already been drained out, and all
+ * the active queue have already been expired so time and work queue should
+ * not been activated during cleanup process.
+ *
+ * Keeping it here for the time being. Will get rid of it later.
+ */
+void elv_exit_fq_data_post(struct elevator_queue *e)
+{
+	struct elv_fq_data *efqd = &e->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	elv_shutdown_timer_wq(e);
+	BUG_ON(timer_pending(&efqd->idle_slice_timer));
+}
+
+
+static int __init elv_fq_init(void)
+{
+	if (elv_slab_setup())
+		return -ENOMEM;
+
+	/* could be 0 on HZ < 1000 setups */
+
+	if (!elv_slice_async)
+		elv_slice_async = 1;
+
+	if (!elv_slice_idle)
+		elv_slice_idle = 1;
+
+	return 0;
+}
+
+module_init(elv_fq_init);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
new file mode 100644
index 0000000..3bea279
--- /dev/null
+++ b/block/elevator-fq.h
@@ -0,0 +1,488 @@
+/*
+ * BFQ: data structures and common functions prototypes.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ *		      Paolo Valente <paolo.valente@unimore.it>
+ */
+
+#include <linux/blkdev.h>
+
+#ifndef _BFQ_SCHED_H
+#define _BFQ_SCHED_H
+
+#define IO_IOPRIO_CLASSES	3
+
+typedef u64 bfq_timestamp_t;
+typedef unsigned long bfq_weight_t;
+typedef unsigned long bfq_service_t;
+struct io_entity;
+struct io_queue;
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+
+/**
+ * struct bfq_service_tree - per ioprio_class service tree.
+ * @active: tree for active entities (i.e., those backlogged).
+ * @idle: tree for idle entities (i.e., those not backlogged, with V <= F_i).
+ * @first_idle: idle entity with minimum F_i.
+ * @last_idle: idle entity with maximum F_i.
+ * @vtime: scheduler virtual time.
+ * @wsum: scheduler weight sum; active and idle entities contribute to it.
+ *
+ * Each service tree represents a B-WF2Q+ scheduler on its own.  Each
+ * ioprio_class has its own independent scheduler, and so its own
+ * bfq_service_tree.  All the fields are protected by the queue lock
+ * of the containing efqd.
+ */
+struct io_service_tree {
+	struct rb_root active;
+	struct rb_root idle;
+
+	struct io_entity *first_idle;
+	struct io_entity *last_idle;
+
+	bfq_timestamp_t vtime;
+	bfq_weight_t wsum;
+};
+
+/**
+ * struct bfq_sched_data - multi-class scheduler.
+ * @active_entity: entity under service.
+ * @next_active: head-of-the-line entity in the scheduler.
+ * @service_tree: array of service trees, one per ioprio_class.
+ *
+ * bfq_sched_data is the basic scheduler queue.  It supports three
+ * ioprio_classes, and can be used either as a toplevel queue or as
+ * an intermediate queue on a hierarchical setup.
+ * @next_active points to the active entity of the sched_data service
+ * trees that will be scheduled next.
+ *
+ * The supported ioprio_classes are the same as in CFQ, in descending
+ * priority order, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE.
+ * Requests from higher priority queues are served before all the
+ * requests from lower priority queues; among requests of the same
+ * queue requests are served according to B-WF2Q+.
+ * All the fields are protected by the queue lock of the containing bfqd.
+ */
+struct io_sched_data {
+	struct io_entity *active_entity;
+	struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
+};
+
+/**
+ * struct bfq_entity - schedulable entity.
+ * @rb_node: service_tree member.
+ * @on_st: flag, true if the entity is on a tree (either the active or
+ *         the idle one of its service_tree).
+ * @finish: B-WF2Q+ finish timestamp (aka F_i).
+ * @start: B-WF2Q+ start timestamp (aka S_i).
+ * @tree: tree the entity is enqueued into; %NULL if not on a tree.
+ * @min_start: minimum start time of the (active) subtree rooted at
+ *             this entity; used for O(log N) lookups into active trees.
+ * @service: service received during the last round of service.
+ * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
+ * @weight: weight of the queue, calculated as IOPRIO_BE_NR - @ioprio.
+ * @parent: parent entity, for hierarchical scheduling.
+ * @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the
+ *                 associated scheduler queue, %NULL on leaf nodes.
+ * @sched_data: the scheduler queue this entity belongs to.
+ * @ioprio: the ioprio in use.
+ * @new_ioprio: when an ioprio change is requested, the new ioprio value
+ * @ioprio_class: the ioprio_class in use.
+ * @new_ioprio_class: when an ioprio_class change is requested, the new
+ *                    ioprio_class value.
+ * @ioprio_changed: flag, true when the user requested an ioprio or
+ *                  ioprio_class change.
+ *
+ * A bfq_entity is used to represent either a bfq_queue (leaf node in the
+ * cgroup hierarchy) or a bfq_group into the upper level scheduler.  Each
+ * entity belongs to the sched_data of the parent group in the cgroup
+ * hierarchy.  Non-leaf entities have also their own sched_data, stored
+ * in @my_sched_data.
+ *
+ * Each entity stores independently its priority values; this would allow
+ * different weights on different devices, but this functionality is not
+ * exported to userspace by now.  Priorities are updated lazily, first
+ * storing the new values into the new_* fields, then setting the
+ * @ioprio_changed flag.  As soon as there is a transition in the entity
+ * state that allows the priority update to take place the effective and
+ * the requested priority values are synchronized.
+ *
+ * The weight value is calculated from the ioprio to export the same
+ * interface as CFQ.  When dealing with ``well-behaved'' queues (i.e.,
+ * queues that do not spend too much time to consume their budget and
+ * have true sequential behavior, and when there are no external factors
+ * breaking anticipation) the relative weights at each level of the
+ * cgroups hierarchy should be guaranteed.
+ * All the fields are protected by the queue lock of the containing bfqd.
+ */
+struct io_entity {
+	struct rb_node rb_node;
+
+	int on_st;
+
+	bfq_timestamp_t finish;
+	bfq_timestamp_t start;
+
+	struct rb_root *tree;
+
+	bfq_timestamp_t min_start;
+
+	bfq_service_t service, budget;
+	bfq_weight_t weight;
+
+	struct io_entity *parent;
+
+	struct io_sched_data *my_sched_data;
+	struct io_sched_data *sched_data;
+
+	unsigned short ioprio, new_ioprio;
+	unsigned short ioprio_class, new_ioprio_class;
+
+	int ioprio_changed;
+};
+
+/*
+ * A common structure embedded by every io scheduler into their respective
+ * queue structure.
+ */
+struct io_queue {
+	struct io_entity entity;
+	atomic_t ref;
+	unsigned int flags;
+
+	/* Pointer to generic elevator data structure */
+	struct elv_fq_data *efqd;
+	struct list_head queue_list;
+	pid_t pid;
+
+	/* Number of requests queued on this io queue */
+	unsigned long nr_queued;
+
+	/* Requests dispatched from this queue */
+	int dispatched;
+
+	/* Keep a track of think time of processes in this queue */
+	unsigned long last_end_request;
+	unsigned long ttime_total;
+	unsigned long ttime_samples;
+	unsigned long ttime_mean;
+
+	unsigned long slice_end;
+
+	/* Pointer to io scheduler's queue */
+	void *sched_queue;
+};
+
+struct io_group {
+	struct io_sched_data sched_data;
+
+	/* async_queue and idle_queue are used only for cfq */
+	struct io_queue *async_queue[2][IOPRIO_BE_NR];
+	struct io_queue *async_idle_queue;
+};
+
+struct elv_fq_data {
+	struct io_group *root_group;
+
+	/* List of io queues on idle tree. */
+	struct list_head idle_list;
+
+	struct request_queue *queue;
+	unsigned int busy_queues;
+	/*
+	 * Used to track any pending rt requests so we can pre-empt current
+	 * non-RT cfqq in service when this value is non-zero.
+	 */
+	unsigned int busy_rt_queues;
+
+	/* Number of requests queued */
+	int rq_queued;
+
+	/* Pointer to the ioscheduler queue being served */
+	void *active_queue;
+
+	int rq_in_driver;
+	int hw_tag;
+	int hw_tag_samples;
+	int rq_in_driver_peak;
+
+	/*
+	 * elevator fair queuing layer has the capability to provide idling
+	 * for ensuring fairness for processes doing dependent reads.
+	 * This might be needed to ensure fairness among two processes doing
+	 * synchronous reads in two different cgroups. noop and deadline don't
+	 * have any notion of anticipation/idling. As of now, these are the
+	 * users of this functionality.
+	 */
+	unsigned int elv_slice_idle;
+	struct timer_list idle_slice_timer;
+	struct work_struct unplug_work;
+
+	unsigned int elv_slice[2];
+};
+
+extern int elv_slice_idle;
+extern int elv_slice_async;
+
+/* Logging facilities. */
+#define elv_log_ioq(efqd, ioq, fmt, args...) \
+	blk_add_trace_msg((efqd)->queue, "elv%d%c " fmt, (ioq)->pid,	\
+				elv_ioq_sync(ioq) ? 'S' : 'A', ##args)
+
+#define elv_log(efqd, fmt, args...) \
+	blk_add_trace_msg((efqd)->queue, "elv " fmt, ##args)
+
+#define ioq_sample_valid(samples)   ((samples) > 80)
+
+/* Some shared queue flag manipulation functions among elevators */
+
+enum elv_queue_state_flags {
+	ELV_QUEUE_FLAG_busy = 0,          /* has requests or is under service */
+	ELV_QUEUE_FLAG_sync,              /* synchronous queue */
+	ELV_QUEUE_FLAG_idle_window,	  /* elevator slice idling enabled */
+	ELV_QUEUE_FLAG_wait_request,	  /* waiting for a request */
+	ELV_QUEUE_FLAG_must_dispatch,	  /* must be allowed a dispatch */
+	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
+	ELV_QUEUE_FLAG_NR,
+};
+
+#define ELV_IO_QUEUE_FLAG_FNS(name)					\
+static inline void elv_mark_ioq_##name(struct io_queue *ioq)		\
+{                                                                       \
+	(ioq)->flags |= (1 << ELV_QUEUE_FLAG_##name);			\
+}                                                                       \
+static inline void elv_clear_ioq_##name(struct io_queue *ioq)		\
+{                                                                       \
+	(ioq)->flags &= ~(1 << ELV_QUEUE_FLAG_##name);			\
+}                                                                       \
+static inline int elv_ioq_##name(struct io_queue *ioq)         		\
+{                                                                       \
+	return ((ioq)->flags & (1 << ELV_QUEUE_FLAG_##name)) != 0;	\
+}
+
+ELV_IO_QUEUE_FLAG_FNS(busy)
+ELV_IO_QUEUE_FLAG_FNS(sync)
+ELV_IO_QUEUE_FLAG_FNS(wait_request)
+ELV_IO_QUEUE_FLAG_FNS(must_dispatch)
+ELV_IO_QUEUE_FLAG_FNS(idle_window)
+ELV_IO_QUEUE_FLAG_FNS(slice_new)
+
+static inline struct io_service_tree *
+io_entity_service_tree(struct io_entity *entity)
+{
+	struct io_sched_data *sched_data = entity->sched_data;
+	unsigned int idx = entity->ioprio_class - 1;
+
+	BUG_ON(idx >= IO_IOPRIO_CLASSES);
+	BUG_ON(sched_data == NULL);
+
+	return sched_data->service_tree + idx;
+}
+
+/* A request got dispatched from the io_queue. Do the accounting. */
+static inline void elv_ioq_request_dispatched(struct io_queue *ioq)
+{
+	ioq->dispatched++;
+}
+
+static inline int elv_ioq_slice_used(struct io_queue *ioq)
+{
+	if (elv_ioq_slice_new(ioq))
+		return 0;
+	if (time_before(jiffies, ioq->slice_end))
+		return 0;
+
+	return 1;
+}
+
+/* How many request are currently dispatched from the queue */
+static inline int elv_ioq_nr_dispatched(struct io_queue *ioq)
+{
+	return ioq->dispatched;
+}
+
+/* How many request are currently queued in the queue */
+static inline int elv_ioq_nr_queued(struct io_queue *ioq)
+{
+	return ioq->nr_queued;
+}
+
+static inline pid_t elv_ioq_pid(struct io_queue *ioq)
+{
+	return ioq->pid;
+}
+
+static inline unsigned long elv_ioq_ttime_mean(struct io_queue *ioq)
+{
+	return ioq->ttime_mean;
+}
+
+static inline unsigned long elv_ioq_sample_valid(struct io_queue *ioq)
+{
+	return ioq_sample_valid(ioq->ttime_samples);
+}
+
+static inline void elv_get_ioq(struct io_queue *ioq)
+{
+	atomic_inc(&ioq->ref);
+}
+
+static inline void elv_ioq_set_slice_end(struct io_queue *ioq,
+						unsigned long slice_end)
+{
+	ioq->slice_end = slice_end;
+}
+
+static inline int elv_ioq_class_idle(struct io_queue *ioq)
+{
+	return ioq->entity.ioprio_class == IOPRIO_CLASS_IDLE;
+}
+
+static inline int elv_ioq_class_rt(struct io_queue *ioq)
+{
+	return ioq->entity.ioprio_class == IOPRIO_CLASS_RT;
+}
+
+static inline int elv_ioq_ioprio_class(struct io_queue *ioq)
+{
+	return ioq->entity.new_ioprio_class;
+}
+
+static inline int elv_ioq_ioprio(struct io_queue *ioq)
+{
+	return ioq->entity.new_ioprio;
+}
+
+static inline void elv_ioq_set_ioprio_class(struct io_queue *ioq,
+						int ioprio_class)
+{
+	ioq->entity.new_ioprio_class = ioprio_class;
+	ioq->entity.ioprio_changed = 1;
+}
+
+static inline void elv_ioq_set_ioprio(struct io_queue *ioq, int ioprio)
+{
+	ioq->entity.new_ioprio = ioprio;
+	ioq->entity.ioprio_changed = 1;
+}
+
+static inline void *ioq_sched_queue(struct io_queue *ioq)
+{
+	if (ioq)
+		return ioq->sched_queue;
+	return NULL;
+}
+
+static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
+{
+	return container_of(ioq->entity.sched_data, struct io_group,
+						sched_data);
+}
+
+/* Functions used by blksysfs.c */
+extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
+extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
+						size_t count);
+extern ssize_t elv_slice_sync_show(struct request_queue *q, char *name);
+extern ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
+						size_t count);
+extern ssize_t elv_slice_async_show(struct request_queue *q, char *name);
+extern ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
+						size_t count);
+
+/* Functions used by elevator.c */
+extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
+extern void elv_exit_fq_data(struct elevator_queue *e);
+extern void elv_exit_fq_data_post(struct elevator_queue *e);
+
+extern void elv_ioq_request_add(struct request_queue *q, struct request *rq);
+extern void elv_ioq_request_removed(struct elevator_queue *e,
+					struct request *rq);
+extern void elv_fq_dispatched_request(struct elevator_queue *e,
+					struct request *rq);
+
+extern void elv_fq_activate_rq(struct request_queue *q, struct request *rq);
+extern void elv_fq_deactivate_rq(struct request_queue *q, struct request *rq);
+
+extern void elv_ioq_completed_request(struct request_queue *q,
+				struct request *rq);
+
+extern void *elv_fq_select_ioq(struct request_queue *q, int force);
+extern struct io_queue *rq_ioq(struct request *rq);
+
+/* Functions used by io schedulers */
+extern void elv_put_ioq(struct io_queue *ioq);
+extern void __elv_ioq_slice_expired(struct request_queue *q,
+					struct io_queue *ioq);
+extern int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
+		void *sched_queue, int ioprio_class, int ioprio, int is_sync);
+extern void elv_schedule_dispatch(struct request_queue *q);
+extern int elv_hw_tag(struct elevator_queue *e);
+extern void *elv_active_sched_queue(struct elevator_queue *e);
+extern int elv_mod_idle_slice_timer(struct elevator_queue *eq,
+					unsigned long expires);
+extern int elv_del_idle_slice_timer(struct elevator_queue *eq);
+extern unsigned int elv_get_slice_idle(struct elevator_queue *eq);
+extern void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
+					int ioprio);
+extern void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+					int ioprio, struct io_queue *ioq);
+extern struct io_group *io_lookup_io_group_current(struct request_queue *q);
+extern int elv_nr_busy_ioq(struct elevator_queue *e);
+extern int elv_nr_busy_rt_ioq(struct elevator_queue *e);
+extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
+extern void elv_free_ioq(struct io_queue *ioq);
+
+#else /* CONFIG_ELV_FAIR_QUEUING */
+
+static inline int elv_init_fq_data(struct request_queue *q,
+					struct elevator_queue *e)
+{
+	return 0;
+}
+
+static inline void elv_exit_fq_data(struct elevator_queue *e) {}
+static inline void elv_exit_fq_data_post(struct elevator_queue *e) {}
+
+static inline void elv_fq_activate_rq(struct request_queue *q,
+					struct request *rq)
+{
+}
+
+static inline void elv_fq_deactivate_rq(struct request_queue *q,
+					struct request *rq)
+{
+}
+
+static inline void elv_fq_dispatched_request(struct elevator_queue *e,
+						struct request *rq)
+{
+}
+
+static inline void elv_ioq_request_removed(struct elevator_queue *e,
+						struct request *rq)
+{
+}
+
+static inline void elv_ioq_request_add(struct request_queue *q,
+					struct request *rq)
+{
+}
+
+static inline void elv_ioq_completed_request(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline void *ioq_sched_queue(struct io_queue *ioq) { return NULL; }
+static inline struct io_queue *rq_ioq(struct request *rq) { return NULL; }
+static inline void *elv_fq_select_ioq(struct request_queue *q, int force)
+{
+	return NULL;
+}
+#endif /* CONFIG_ELV_FAIR_QUEUING */
+#endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index 7073a90..c2f07f5 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -231,6 +231,9 @@ static struct elevator_queue *elevator_alloc(struct request_queue *q,
 	for (i = 0; i < ELV_HASH_ENTRIES; i++)
 		INIT_HLIST_HEAD(&eq->hash[i]);
 
+	if (elv_init_fq_data(q, eq))
+		goto err;
+
 	return eq;
 err:
 	kfree(eq);
@@ -301,9 +304,11 @@ EXPORT_SYMBOL(elevator_init);
 void elevator_exit(struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
+	elv_exit_fq_data(e);
 	if (e->ops->elevator_exit_fn)
 		e->ops->elevator_exit_fn(e);
 	e->ops = NULL;
+	elv_exit_fq_data_post(e);
 	mutex_unlock(&e->sysfs_lock);
 
 	kobject_put(&e->kobj);
@@ -314,6 +319,8 @@ static void elv_activate_rq(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	elv_fq_activate_rq(q, rq);
+
 	if (e->ops->elevator_activate_req_fn)
 		e->ops->elevator_activate_req_fn(q, rq);
 }
@@ -322,6 +329,8 @@ static void elv_deactivate_rq(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	elv_fq_deactivate_rq(q, rq);
+
 	if (e->ops->elevator_deactivate_req_fn)
 		e->ops->elevator_deactivate_req_fn(q, rq);
 }
@@ -446,6 +455,7 @@ void elv_dispatch_sort(struct request_queue *q, struct request *rq)
 	elv_rqhash_del(q, rq);
 
 	q->nr_sorted--;
+	elv_fq_dispatched_request(q->elevator, rq);
 
 	boundary = q->end_sector;
 	stop_flags = REQ_SOFTBARRIER | REQ_HARDBARRIER | REQ_STARTED;
@@ -486,6 +496,7 @@ void elv_dispatch_add_tail(struct request_queue *q, struct request *rq)
 	elv_rqhash_del(q, rq);
 
 	q->nr_sorted--;
+	elv_fq_dispatched_request(q->elevator, rq);
 
 	q->end_sector = rq_end_sector(rq);
 	q->boundary_rq = rq;
@@ -553,6 +564,7 @@ void elv_merge_requests(struct request_queue *q, struct request *rq,
 	elv_rqhash_del(q, next);
 
 	q->nr_sorted--;
+	elv_ioq_request_removed(e, next);
 	q->last_merge = rq;
 }
 
@@ -657,12 +669,8 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 				q->last_merge = rq;
 		}
 
-		/*
-		 * Some ioscheds (cfq) run q->request_fn directly, so
-		 * rq cannot be accessed after calling
-		 * elevator_add_req_fn.
-		 */
 		q->elevator->ops->elevator_add_req_fn(q, rq);
+		elv_ioq_request_add(q, rq);
 		break;
 
 	case ELEVATOR_INSERT_REQUEUE:
@@ -872,13 +880,12 @@ void elv_dequeue_request(struct request_queue *q, struct request *rq)
 
 int elv_queue_empty(struct request_queue *q)
 {
-	struct elevator_queue *e = q->elevator;
-
 	if (!list_empty(&q->queue_head))
 		return 0;
 
-	if (e->ops->elevator_queue_empty_fn)
-		return e->ops->elevator_queue_empty_fn(q);
+	/* Hopefully nr_sorted works and no need to call queue_empty_fn */
+	if (q->nr_sorted)
+		return 0;
 
 	return 1;
 }
@@ -953,8 +960,11 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
 	 */
 	if (blk_account_rq(rq)) {
 		q->in_flight--;
-		if (blk_sorted_rq(rq) && e->ops->elevator_completed_req_fn)
-			e->ops->elevator_completed_req_fn(q, rq);
+		if (blk_sorted_rq(rq)) {
+			if (e->ops->elevator_completed_req_fn)
+				e->ops->elevator_completed_req_fn(q, rq);
+			elv_ioq_completed_request(q, rq);
+		}
 	}
 
 	/*
@@ -1242,3 +1252,17 @@ struct request *elv_rb_latter_request(struct request_queue *q,
 	return NULL;
 }
 EXPORT_SYMBOL(elv_rb_latter_request);
+
+/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
+{
+	return ioq_sched_queue(rq_ioq(rq));
+}
+EXPORT_SYMBOL(elv_get_sched_queue);
+
+/* Select an ioscheduler queue to dispatch request from. */
+void *elv_select_sched_queue(struct request_queue *q, int force)
+{
+	return ioq_sched_queue(elv_fq_select_ioq(q, force));
+}
+EXPORT_SYMBOL(elv_select_sched_queue);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2755d5c..4634949 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -245,6 +245,11 @@ struct request {
 
 	/* for bidi */
 	struct request *next_rq;
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	/* io queue request belongs to */
+	struct io_queue *ioq;
+#endif
 };
 
 static inline unsigned short req_get_ioprio(struct request *req)
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index c59b769..679c149 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -2,6 +2,7 @@
 #define _LINUX_ELEVATOR_H
 
 #include <linux/percpu.h>
+#include "../../block/elevator-fq.h"
 
 #ifdef CONFIG_BLOCK
 
@@ -29,6 +30,18 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
 
 typedef void *(elevator_init_fn) (struct request_queue *);
 typedef void (elevator_exit_fn) (struct elevator_queue *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
+typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
+typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
+typedef int (elevator_should_preempt_fn) (struct request_queue*, void*,
+						struct request*);
+typedef int (elevator_update_idle_window_fn) (struct elevator_queue*, void*,
+						struct request*);
+typedef struct io_queue* (elevator_close_cooperator_fn) (struct request_queue*,
+						void*, int probe);
+#endif
 
 struct elevator_ops
 {
@@ -56,6 +69,17 @@ struct elevator_ops
 	elevator_init_fn *elevator_init_fn;
 	elevator_exit_fn *elevator_exit_fn;
 	void (*trim)(struct io_context *);
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+	elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
+	elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;
+
+	elevator_arm_slice_timer_fn *elevator_arm_slice_timer_fn;
+	elevator_should_preempt_fn *elevator_should_preempt_fn;
+	elevator_update_idle_window_fn *elevator_update_idle_window_fn;
+	elevator_close_cooperator_fn *elevator_close_cooperator_fn;
+#endif
 };
 
 #define ELV_NAME_MAX	(16)
@@ -76,6 +100,9 @@ struct elevator_type
 	struct elv_fs_entry *elevator_attrs;
 	char elevator_name[ELV_NAME_MAX];
 	struct module *elevator_owner;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	int elevator_features;
+#endif
 };
 
 /*
@@ -89,6 +116,10 @@ struct elevator_queue
 	struct elevator_type *elevator_type;
 	struct mutex sysfs_lock;
 	struct hlist_head *hash;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	/* fair queuing data */
+	struct elv_fq_data efqd;
+#endif
 };
 
 /*
@@ -209,5 +240,25 @@ enum {
 	__val;							\
 })
 
+/* iosched can let elevator know their feature set/capability */
+#ifdef CONFIG_ELV_FAIR_QUEUING
+
+/* iosched wants to use fq logic of elevator layer */
+#define	ELV_IOSCHED_NEED_FQ	1
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+	return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
+}
+
+#else /* ELV_IOSCHED_FAIR_QUEUING */
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+	return 0;
+}
+#endif /* ELV_IOSCHED_FAIR_QUEUING */
+extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
+extern void *elv_select_sched_queue(struct request_queue *q, int force);
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 02/18] io-controller: Common flat fair queuing code in elevaotor layer
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
  2009-05-05 19:58 ` [PATCH 01/18] io-controller: Documentation Vivek Goyal
  2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-05 19:58 ` [PATCH 03/18] io-controller: Charge for time slice based on average disk rate Vivek Goyal
                   ` (34 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando
  Cc: akpm, vgoyal

This is common fair queuing code in elevator layer. This is controlled by
config option CONFIG_ELV_FAIR_QUEUING. This patch initially only introduces
flat fair queuing support where there is only one group, "root group" and all
the tasks belong to root group.

This elevator layer changes are backward compatible. That means any ioscheduler
using old interfaces will continue to work.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched    |   13 +
 block/Makefile           |    1 +
 block/blk-sysfs.c        |   25 +
 block/elevator-fq.c      | 2076 ++++++++++++++++++++++++++++++++++++++++++++++
 block/elevator-fq.h      |  488 +++++++++++
 block/elevator.c         |   46 +-
 include/linux/blkdev.h   |    5 +
 include/linux/elevator.h |   51 ++
 8 files changed, 2694 insertions(+), 11 deletions(-)
 create mode 100644 block/elevator-fq.c
 create mode 100644 block/elevator-fq.h

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 7e803fc..3398134 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -2,6 +2,19 @@ if BLOCK
 
 menu "IO Schedulers"
 
+config ELV_FAIR_QUEUING
+	bool "Elevator Fair Queuing Support"
+	default n
+	---help---
+	  Traditionally only cfq had notion of multiple queues and it did
+	  fair queuing at its own. With the cgroups and need of controlling
+	  IO, now even the simple io schedulers like noop, deadline, as will
+	  have one queue per cgroup and will need hierarchical fair queuing.
+	  Instead of every io scheduler implementing its own fair queuing
+	  logic, this option enables fair queuing in elevator layer so that
+	  other ioschedulers can make use of it.
+	  If unsure, say N.
+
 config IOSCHED_NOOP
 	bool
 	default y
diff --git a/block/Makefile b/block/Makefile
index e9fa4dd..94bfc6e 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -15,3 +15,4 @@ obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
 
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY)	+= blk-integrity.o
+obj-$(CONFIG_ELV_FAIR_QUEUING)	+= elevator-fq.o
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 3ff9bba..082a273 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -276,6 +276,26 @@ static struct queue_sysfs_entry queue_iostats_entry = {
 	.store = queue_iostats_store,
 };
 
+#ifdef CONFIG_ELV_FAIR_QUEUING
+static struct queue_sysfs_entry queue_slice_idle_entry = {
+	.attr = {.name = "slice_idle", .mode = S_IRUGO | S_IWUSR },
+	.show = elv_slice_idle_show,
+	.store = elv_slice_idle_store,
+};
+
+static struct queue_sysfs_entry queue_slice_sync_entry = {
+	.attr = {.name = "slice_sync", .mode = S_IRUGO | S_IWUSR },
+	.show = elv_slice_sync_show,
+	.store = elv_slice_sync_store,
+};
+
+static struct queue_sysfs_entry queue_slice_async_entry = {
+	.attr = {.name = "slice_async", .mode = S_IRUGO | S_IWUSR },
+	.show = elv_slice_async_show,
+	.store = elv_slice_async_store,
+};
+#endif
+
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
 	&queue_ra_entry.attr,
@@ -287,6 +307,11 @@ static struct attribute *default_attrs[] = {
 	&queue_nomerges_entry.attr,
 	&queue_rq_affinity_entry.attr,
 	&queue_iostats_entry.attr,
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	&queue_slice_idle_entry.attr,
+	&queue_slice_sync_entry.attr,
+	&queue_slice_async_entry.attr,
+#endif
 	NULL,
 };
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
new file mode 100644
index 0000000..9aea899
--- /dev/null
+++ b/block/elevator-fq.c
@@ -0,0 +1,2076 @@
+/*
+ * BFQ: Hierarchical B-WF2Q+ scheduler.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ *		      Paolo Valente <paolo.valente@unimore.it>
+ */
+
+#include <linux/blkdev.h>
+#include "elevator-fq.h"
+#include <linux/blktrace_api.h>
+
+/* Values taken from cfq */
+const int elv_slice_sync = HZ / 10;
+int elv_slice_async = HZ / 25;
+const int elv_slice_async_rq = 2;
+int elv_slice_idle = HZ / 125;
+static struct kmem_cache *elv_ioq_pool;
+
+#define ELV_SLICE_SCALE		(5)
+#define ELV_HW_QUEUE_MIN	(5)
+#define IO_SERVICE_TREE_INIT   ((struct io_service_tree)		\
+				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
+
+static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
+					struct io_queue *ioq, int probe);
+struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
+						 int extract);
+
+static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
+					unsigned short prio)
+{
+	const int base_slice = efqd->elv_slice[sync];
+
+	WARN_ON(prio >= IOPRIO_BE_NR);
+
+	return base_slice + (base_slice/ELV_SLICE_SCALE * (4 - prio));
+}
+
+static inline int
+elv_prio_to_slice(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+	return elv_prio_slice(efqd, elv_ioq_sync(ioq), ioq->entity.ioprio);
+}
+
+/* Mainly the BFQ scheduling code Follows */
+
+/*
+ * Shift for timestamp calculations.  This actually limits the maximum
+ * service allowed in one timestamp delta (small shift values increase it),
+ * the maximum total weight that can be used for the queues in the system
+ * (big shift values increase it), and the period of virtual time wraparounds.
+ */
+#define WFQ_SERVICE_SHIFT	22
+
+/**
+ * bfq_gt - compare two timestamps.
+ * @a: first ts.
+ * @b: second ts.
+ *
+ * Return @a > @b, dealing with wrapping correctly.
+ */
+static inline int bfq_gt(bfq_timestamp_t a, bfq_timestamp_t b)
+{
+	return (s64)(a - b) > 0;
+}
+
+/**
+ * bfq_delta - map service into the virtual time domain.
+ * @service: amount of service.
+ * @weight: scale factor.
+ */
+static inline bfq_timestamp_t bfq_delta(bfq_service_t service,
+					bfq_weight_t weight)
+{
+	bfq_timestamp_t d = (bfq_timestamp_t)service << WFQ_SERVICE_SHIFT;
+
+	do_div(d, weight);
+	return d;
+}
+
+/**
+ * bfq_calc_finish - assign the finish time to an entity.
+ * @entity: the entity to act upon.
+ * @service: the service to be charged to the entity.
+ */
+static inline void bfq_calc_finish(struct io_entity *entity,
+				   bfq_service_t service)
+{
+	BUG_ON(entity->weight == 0);
+
+	entity->finish = entity->start + bfq_delta(service, entity->weight);
+}
+
+static inline struct io_queue *io_entity_to_ioq(struct io_entity *entity)
+{
+	struct io_queue *ioq = NULL;
+
+	BUG_ON(entity == NULL);
+	if (entity->my_sched_data == NULL)
+		ioq = container_of(entity, struct io_queue, entity);
+	return ioq;
+}
+
+/**
+ * bfq_entity_of - get an entity from a node.
+ * @node: the node field of the entity.
+ *
+ * Convert a node pointer to the relative entity.  This is used only
+ * to simplify the logic of some functions and not as the generic
+ * conversion mechanism because, e.g., in the tree walking functions,
+ * the check for a %NULL value would be redundant.
+ */
+static inline struct io_entity *bfq_entity_of(struct rb_node *node)
+{
+	struct io_entity *entity = NULL;
+
+	if (node != NULL)
+		entity = rb_entry(node, struct io_entity, rb_node);
+
+	return entity;
+}
+
+/**
+ * bfq_extract - remove an entity from a tree.
+ * @root: the tree root.
+ * @entity: the entity to remove.
+ */
+static inline void bfq_extract(struct rb_root *root, struct io_entity *entity)
+{
+	BUG_ON(entity->tree != root);
+
+	entity->tree = NULL;
+	rb_erase(&entity->rb_node, root);
+}
+
+/**
+ * bfq_idle_extract - extract an entity from the idle tree.
+ * @st: the service tree of the owning @entity.
+ * @entity: the entity being removed.
+ */
+static void bfq_idle_extract(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	struct rb_node *next;
+	struct io_queue *ioq = io_entity_to_ioq(entity);
+
+	BUG_ON(entity->tree != &st->idle);
+
+	if (entity == st->first_idle) {
+		next = rb_next(&entity->rb_node);
+		st->first_idle = bfq_entity_of(next);
+	}
+
+	if (entity == st->last_idle) {
+		next = rb_prev(&entity->rb_node);
+		st->last_idle = bfq_entity_of(next);
+	}
+
+	bfq_extract(&st->idle, entity);
+
+	/* Delete queue from idle list */
+	if (ioq)
+		list_del(&ioq->queue_list);
+}
+
+/**
+ * bfq_insert - generic tree insertion.
+ * @root: tree root.
+ * @entity: entity to insert.
+ *
+ * This is used for the idle and the active tree, since they are both
+ * ordered by finish time.
+ */
+static void bfq_insert(struct rb_root *root, struct io_entity *entity)
+{
+	struct io_entity *entry;
+	struct rb_node **node = &root->rb_node;
+	struct rb_node *parent = NULL;
+
+	BUG_ON(entity->tree != NULL);
+
+	while (*node != NULL) {
+		parent = *node;
+		entry = rb_entry(parent, struct io_entity, rb_node);
+
+		if (bfq_gt(entry->finish, entity->finish))
+			node = &parent->rb_left;
+		else
+			node = &parent->rb_right;
+	}
+
+	rb_link_node(&entity->rb_node, parent, node);
+	rb_insert_color(&entity->rb_node, root);
+
+	entity->tree = root;
+}
+
+/**
+ * bfq_update_min - update the min_start field of a entity.
+ * @entity: the entity to update.
+ * @node: one of its children.
+ *
+ * This function is called when @entity may store an invalid value for
+ * min_start due to updates to the active tree.  The function  assumes
+ * that the subtree rooted at @node (which may be its left or its right
+ * child) has a valid min_start value.
+ */
+static inline void bfq_update_min(struct io_entity *entity,
+					struct rb_node *node)
+{
+	struct io_entity *child;
+
+	if (node != NULL) {
+		child = rb_entry(node, struct io_entity, rb_node);
+		if (bfq_gt(entity->min_start, child->min_start))
+			entity->min_start = child->min_start;
+	}
+}
+
+/**
+ * bfq_update_active_node - recalculate min_start.
+ * @node: the node to update.
+ *
+ * @node may have changed position or one of its children may have moved,
+ * this function updates its min_start value.  The left and right subtrees
+ * are assumed to hold a correct min_start value.
+ */
+static inline void bfq_update_active_node(struct rb_node *node)
+{
+	struct io_entity *entity = rb_entry(node, struct io_entity, rb_node);
+
+	entity->min_start = entity->start;
+	bfq_update_min(entity, node->rb_right);
+	bfq_update_min(entity, node->rb_left);
+}
+
+/**
+ * bfq_update_active_tree - update min_start for the whole active tree.
+ * @node: the starting node.
+ *
+ * @node must be the deepest modified node after an update.  This function
+ * updates its min_start using the values held by its children, assuming
+ * that they did not change, and then updates all the nodes that may have
+ * changed in the path to the root.  The only nodes that may have changed
+ * are the ones in the path or their siblings.
+ */
+static void bfq_update_active_tree(struct rb_node *node)
+{
+	struct rb_node *parent;
+
+up:
+	bfq_update_active_node(node);
+
+	parent = rb_parent(node);
+	if (parent == NULL)
+		return;
+
+	if (node == parent->rb_left && parent->rb_right != NULL)
+		bfq_update_active_node(parent->rb_right);
+	else if (parent->rb_left != NULL)
+		bfq_update_active_node(parent->rb_left);
+
+	node = parent;
+	goto up;
+}
+
+/**
+ * bfq_active_insert - insert an entity in the active tree of its group/device.
+ * @st: the service tree of the entity.
+ * @entity: the entity being inserted.
+ *
+ * The active tree is ordered by finish time, but an extra key is kept
+ * per each node, containing the minimum value for the start times of
+ * its children (and the node itself), so it's possible to search for
+ * the eligible node with the lowest finish time in logarithmic time.
+ */
+static void bfq_active_insert(struct io_service_tree *st,
+					struct io_entity *entity)
+{
+	struct rb_node *node = &entity->rb_node;
+
+	bfq_insert(&st->active, entity);
+
+	if (node->rb_left != NULL)
+		node = node->rb_left;
+	else if (node->rb_right != NULL)
+		node = node->rb_right;
+
+	bfq_update_active_tree(node);
+}
+
+/**
+ * bfq_ioprio_to_weight - calc a weight from an ioprio.
+ * @ioprio: the ioprio value to convert.
+ */
+static bfq_weight_t bfq_ioprio_to_weight(int ioprio)
+{
+	WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
+	return IOPRIO_BE_NR - ioprio;
+}
+
+void bfq_get_entity(struct io_entity *entity)
+{
+	struct io_queue *ioq = io_entity_to_ioq(entity);
+
+	if (ioq)
+		elv_get_ioq(ioq);
+}
+
+void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+	entity->ioprio = entity->new_ioprio;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->sched_data = &iog->sched_data;
+}
+
+/**
+ * bfq_find_deepest - find the deepest node that an extraction can modify.
+ * @node: the node being removed.
+ *
+ * Do the first step of an extraction in an rb tree, looking for the
+ * node that will replace @node, and returning the deepest node that
+ * the following modifications to the tree can touch.  If @node is the
+ * last node in the tree return %NULL.
+ */
+static struct rb_node *bfq_find_deepest(struct rb_node *node)
+{
+	struct rb_node *deepest;
+
+	if (node->rb_right == NULL && node->rb_left == NULL)
+		deepest = rb_parent(node);
+	else if (node->rb_right == NULL)
+		deepest = node->rb_left;
+	else if (node->rb_left == NULL)
+		deepest = node->rb_right;
+	else {
+		deepest = rb_next(node);
+		if (deepest->rb_right != NULL)
+			deepest = deepest->rb_right;
+		else if (rb_parent(deepest) != node)
+			deepest = rb_parent(deepest);
+	}
+
+	return deepest;
+}
+
+/**
+ * bfq_active_extract - remove an entity from the active tree.
+ * @st: the service_tree containing the tree.
+ * @entity: the entity being removed.
+ */
+static void bfq_active_extract(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	struct rb_node *node;
+
+	node = bfq_find_deepest(&entity->rb_node);
+	bfq_extract(&st->active, entity);
+
+	if (node != NULL)
+		bfq_update_active_tree(node);
+}
+
+/**
+ * bfq_idle_insert - insert an entity into the idle tree.
+ * @st: the service tree containing the tree.
+ * @entity: the entity to insert.
+ */
+static void bfq_idle_insert(struct io_service_tree *st,
+					struct io_entity *entity)
+{
+	struct io_entity *first_idle = st->first_idle;
+	struct io_entity *last_idle = st->last_idle;
+	struct io_queue *ioq = io_entity_to_ioq(entity);
+
+	if (first_idle == NULL || bfq_gt(first_idle->finish, entity->finish))
+		st->first_idle = entity;
+	if (last_idle == NULL || bfq_gt(entity->finish, last_idle->finish))
+		st->last_idle = entity;
+
+	bfq_insert(&st->idle, entity);
+
+	/* Add this queue to idle list */
+	if (ioq)
+		list_add(&ioq->queue_list, &ioq->efqd->idle_list);
+}
+
+/**
+ * bfq_forget_entity - remove an entity from the wfq trees.
+ * @st: the service tree.
+ * @entity: the entity being removed.
+ *
+ * Update the device status and forget everything about @entity, putting
+ * the device reference to it, if it is a queue.  Entities belonging to
+ * groups are not refcounted.
+ */
+static void bfq_forget_entity(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	struct io_queue *ioq = NULL;
+
+	BUG_ON(!entity->on_st);
+	entity->on_st = 0;
+	st->wsum -= entity->weight;
+	ioq = io_entity_to_ioq(entity);
+	if (!ioq)
+		return;
+	elv_put_ioq(ioq);
+}
+
+/**
+ * bfq_put_idle_entity - release the idle tree ref of an entity.
+ * @st: service tree for the entity.
+ * @entity: the entity being released.
+ */
+void bfq_put_idle_entity(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	bfq_idle_extract(st, entity);
+	bfq_forget_entity(st, entity);
+}
+
+/**
+ * bfq_forget_idle - update the idle tree if necessary.
+ * @st: the service tree to act upon.
+ *
+ * To preserve the global O(log N) complexity we only remove one entry here;
+ * as the idle tree will not grow indefinitely this can be done safely.
+ */
+void bfq_forget_idle(struct io_service_tree *st)
+{
+	struct io_entity *first_idle = st->first_idle;
+	struct io_entity *last_idle = st->last_idle;
+
+	if (RB_EMPTY_ROOT(&st->active) && last_idle != NULL &&
+	    !bfq_gt(last_idle->finish, st->vtime)) {
+		/*
+		 * Active tree is empty. Pull back vtime to finish time of
+		 * last idle entity on idle tree.
+		 * Rational seems to be that it reduces the possibility of
+		 * vtime wraparound (bfq_gt(V-F) < 0).
+		 */
+		st->vtime = last_idle->finish;
+	}
+
+	if (first_idle != NULL && !bfq_gt(first_idle->finish, st->vtime))
+		bfq_put_idle_entity(st, first_idle);
+}
+
+
+static struct io_service_tree *
+__bfq_entity_update_prio(struct io_service_tree *old_st,
+				struct io_entity *entity)
+{
+	struct io_service_tree *new_st = old_st;
+	struct io_queue *ioq = io_entity_to_ioq(entity);
+
+	if (entity->ioprio_changed) {
+		entity->ioprio = entity->new_ioprio;
+		entity->ioprio_class = entity->new_ioprio_class;
+		entity->ioprio_changed = 0;
+
+		/*
+		 * Also update the scaled budget for ioq. Group will get the
+		 * updated budget once ioq is selected to run next.
+		 */
+		if (ioq) {
+			struct elv_fq_data *efqd = ioq->efqd;
+			entity->budget = elv_prio_to_slice(efqd, ioq);
+		}
+
+		old_st->wsum -= entity->weight;
+		entity->weight = bfq_ioprio_to_weight(entity->ioprio);
+
+		/*
+		 * NOTE: here we may be changing the weight too early,
+		 * this will cause unfairness.  The correct approach
+		 * would have required additional complexity to defer
+		 * weight changes to the proper time instants (i.e.,
+		 * when entity->finish <= old_st->vtime).
+		 */
+		new_st = io_entity_service_tree(entity);
+		new_st->wsum += entity->weight;
+
+		if (new_st != old_st)
+			entity->start = new_st->vtime;
+	}
+
+	return new_st;
+}
+
+/**
+ * __bfq_activate_entity - activate an entity.
+ * @entity: the entity being activated.
+ *
+ * Called whenever an entity is activated, i.e., it is not active and one
+ * of its children receives a new request, or has to be reactivated due to
+ * budget exhaustion.  It uses the current budget of the entity (and the
+ * service received if @entity is active) of the queue to calculate its
+ * timestamps.
+ */
+static void __bfq_activate_entity(struct io_entity *entity, int add_front)
+{
+	struct io_sched_data *sd = entity->sched_data;
+	struct io_service_tree *st = io_entity_service_tree(entity);
+
+	if (entity == sd->active_entity) {
+		BUG_ON(entity->tree != NULL);
+		/*
+		 * If we are requeueing the current entity we have
+		 * to take care of not charging to it service it has
+		 * not received.
+		 */
+		bfq_calc_finish(entity, entity->service);
+		entity->start = entity->finish;
+		sd->active_entity = NULL;
+	} else if (entity->tree == &st->active) {
+		/*
+		 * Requeueing an entity due to a change of some
+		 * next_active entity below it.  We reuse the old
+		 * start time.
+		 */
+		bfq_active_extract(st, entity);
+	} else if (entity->tree == &st->idle) {
+		/*
+		 * Must be on the idle tree, bfq_idle_extract() will
+		 * check for that.
+		 */
+		bfq_idle_extract(st, entity);
+		entity->start = bfq_gt(st->vtime, entity->finish) ?
+				       st->vtime : entity->finish;
+	} else {
+		/*
+		 * The finish time of the entity may be invalid, and
+		 * it is in the past for sure, otherwise the queue
+		 * would have been on the idle tree.
+		 */
+		entity->start = st->vtime;
+		st->wsum += entity->weight;
+		bfq_get_entity(entity);
+
+		BUG_ON(entity->on_st);
+		entity->on_st = 1;
+	}
+
+	st = __bfq_entity_update_prio(st, entity);
+	/*
+	 * This is to emulate cfq like functionality where preemption can
+	 * happen with-in same class, like sync queue preempting async queue
+	 * May be this is not a very good idea from fairness point of view
+	 * as preempting queue gains share. Keeping it for now.
+	 */
+	if (add_front) {
+		struct io_entity *next_entity;
+
+		/*
+		 * Determine the entity which will be dispatched next
+		 * Use sd->next_active once hierarchical patch is applied
+		 */
+		next_entity = bfq_lookup_next_entity(sd, 0);
+
+		if (next_entity && next_entity != entity) {
+			struct io_service_tree *new_st;
+			bfq_timestamp_t delta;
+
+			new_st = io_entity_service_tree(next_entity);
+
+			/*
+			 * At this point, both entities should belong to
+			 * same service tree as cross service tree preemption
+			 * is automatically taken care by algorithm
+			 */
+			BUG_ON(new_st != st);
+			entity->finish = next_entity->finish - 1;
+			delta = bfq_delta(entity->budget, entity->weight);
+			entity->start = entity->finish - delta;
+			if (bfq_gt(entity->start, st->vtime))
+				entity->start = st->vtime;
+		}
+	} else {
+		bfq_calc_finish(entity, entity->budget);
+	}
+	bfq_active_insert(st, entity);
+}
+
+/**
+ * bfq_activate_entity - activate an entity.
+ * @entity: the entity to activate.
+ */
+void bfq_activate_entity(struct io_entity *entity, int add_front)
+{
+	__bfq_activate_entity(entity, add_front);
+}
+
+/**
+ * __bfq_deactivate_entity - deactivate an entity from its service tree.
+ * @entity: the entity to deactivate.
+ * @requeue: if false, the entity will not be put into the idle tree.
+ *
+ * Deactivate an entity, independently from its previous state.  If the
+ * entity was not on a service tree just return, otherwise if it is on
+ * any scheduler tree, extract it from that tree, and if necessary
+ * and if the caller did not specify @requeue, put it on the idle tree.
+ *
+ */
+int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
+{
+	struct io_sched_data *sd = entity->sched_data;
+	struct io_service_tree *st = io_entity_service_tree(entity);
+	int was_active = entity == sd->active_entity;
+	int ret = 0;
+
+	if (!entity->on_st)
+		return 0;
+
+	BUG_ON(was_active && entity->tree != NULL);
+
+	if (was_active) {
+		bfq_calc_finish(entity, entity->service);
+		sd->active_entity = NULL;
+	} else if (entity->tree == &st->active)
+		bfq_active_extract(st, entity);
+	else if (entity->tree == &st->idle)
+		bfq_idle_extract(st, entity);
+	else if (entity->tree != NULL)
+		BUG();
+
+	if (!requeue || !bfq_gt(entity->finish, st->vtime))
+		bfq_forget_entity(st, entity);
+	else
+		bfq_idle_insert(st, entity);
+
+	BUG_ON(sd->active_entity == entity);
+
+	return ret;
+}
+
+/**
+ * bfq_deactivate_entity - deactivate an entity.
+ * @entity: the entity to deactivate.
+ * @requeue: true if the entity can be put on the idle tree
+ */
+void bfq_deactivate_entity(struct io_entity *entity, int requeue)
+{
+	__bfq_deactivate_entity(entity, requeue);
+}
+
+/**
+ * bfq_update_vtime - update vtime if necessary.
+ * @st: the service tree to act upon.
+ *
+ * If necessary update the service tree vtime to have at least one
+ * eligible entity, skipping to its start time.  Assumes that the
+ * active tree of the device is not empty.
+ *
+ * NOTE: this hierarchical implementation updates vtimes quite often,
+ * we may end up with reactivated tasks getting timestamps after a
+ * vtime skip done because we needed a ->first_active entity on some
+ * intermediate node.
+ */
+static void bfq_update_vtime(struct io_service_tree *st)
+{
+	struct io_entity *entry;
+	struct rb_node *node = st->active.rb_node;
+
+	entry = rb_entry(node, struct io_entity, rb_node);
+	if (bfq_gt(entry->min_start, st->vtime)) {
+		st->vtime = entry->min_start;
+		bfq_forget_idle(st);
+	}
+}
+
+/**
+ * bfq_first_active - find the eligible entity with the smallest finish time
+ * @st: the service tree to select from.
+ *
+ * This function searches the first schedulable entity, starting from the
+ * root of the tree and going on the left every time on this side there is
+ * a subtree with at least one eligible (start <= vtime) entity.  The path
+ * on the right is followed only if a) the left subtree contains no eligible
+ * entities and b) no eligible entity has been found yet.
+ */
+static struct io_entity *bfq_first_active_entity(struct io_service_tree *st)
+{
+	struct io_entity *entry, *first = NULL;
+	struct rb_node *node = st->active.rb_node;
+
+	while (node != NULL) {
+		entry = rb_entry(node, struct io_entity, rb_node);
+left:
+		if (!bfq_gt(entry->start, st->vtime))
+			first = entry;
+
+		BUG_ON(bfq_gt(entry->min_start, st->vtime));
+
+		if (node->rb_left != NULL) {
+			entry = rb_entry(node->rb_left,
+					 struct io_entity, rb_node);
+			if (!bfq_gt(entry->min_start, st->vtime)) {
+				node = node->rb_left;
+				goto left;
+			}
+		}
+		if (first != NULL)
+			break;
+		node = node->rb_right;
+	}
+
+	BUG_ON(first == NULL && !RB_EMPTY_ROOT(&st->active));
+	return first;
+}
+
+/**
+ * __bfq_lookup_next_entity - return the first eligible entity in @st.
+ * @st: the service tree.
+ *
+ * Update the virtual time in @st and return the first eligible entity
+ * it contains.
+ */
+static struct io_entity *__bfq_lookup_next_entity(struct io_service_tree *st)
+{
+	struct io_entity *entity;
+
+	if (RB_EMPTY_ROOT(&st->active))
+		return NULL;
+
+	bfq_update_vtime(st);
+	entity = bfq_first_active_entity(st);
+	BUG_ON(bfq_gt(entity->start, st->vtime));
+
+	return entity;
+}
+
+/**
+ * bfq_lookup_next_entity - return the first eligible entity in @sd.
+ * @sd: the sched_data.
+ * @extract: if true the returned entity will be also extracted from @sd.
+ *
+ * NOTE: since we cache the next_active entity at each level of the
+ * hierarchy, the complexity of the lookup can be decreased with
+ * absolutely no effort just returning the cached next_active value;
+ * we prefer to do full lookups to test the consistency of * the data
+ * structures.
+ */
+struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
+						 int extract)
+{
+	struct io_service_tree *st = sd->service_tree;
+	struct io_entity *entity;
+	int i;
+
+	/*
+	 * One can check for which will be next selected entity without
+	 * expiring the current one.
+	 */
+	BUG_ON(extract && sd->active_entity != NULL);
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++, st++) {
+		entity = __bfq_lookup_next_entity(st);
+		if (entity != NULL) {
+			if (extract) {
+				bfq_active_extract(st, entity);
+				sd->active_entity = entity;
+			}
+			break;
+		}
+	}
+
+	return entity;
+}
+
+void entity_served(struct io_entity *entity, bfq_service_t served)
+{
+	struct io_service_tree *st;
+
+	st = io_entity_service_tree(entity);
+	entity->service += served;
+	BUG_ON(st->wsum == 0);
+	st->vtime += bfq_delta(served, st->wsum);
+	bfq_forget_idle(st);
+}
+
+/* Elevator fair queuing function */
+struct io_queue *rq_ioq(struct request *rq)
+{
+	return rq->ioq;
+}
+
+static inline struct io_queue *elv_active_ioq(struct elevator_queue *e)
+{
+	return e->efqd.active_queue;
+}
+
+void *elv_active_sched_queue(struct elevator_queue *e)
+{
+	return ioq_sched_queue(elv_active_ioq(e));
+}
+EXPORT_SYMBOL(elv_active_sched_queue);
+
+int elv_nr_busy_ioq(struct elevator_queue *e)
+{
+	return e->efqd.busy_queues;
+}
+EXPORT_SYMBOL(elv_nr_busy_ioq);
+
+int elv_nr_busy_rt_ioq(struct elevator_queue *e)
+{
+	return e->efqd.busy_rt_queues;
+}
+EXPORT_SYMBOL(elv_nr_busy_rt_ioq);
+
+int elv_hw_tag(struct elevator_queue *e)
+{
+	return e->efqd.hw_tag;
+}
+EXPORT_SYMBOL(elv_hw_tag);
+
+/* Helper functions for operating on elevator idle slice timer */
+int elv_mod_idle_slice_timer(struct elevator_queue *eq, unsigned long expires)
+{
+	struct elv_fq_data *efqd = &eq->efqd;
+
+	return mod_timer(&efqd->idle_slice_timer, expires);
+}
+EXPORT_SYMBOL(elv_mod_idle_slice_timer);
+
+int elv_del_idle_slice_timer(struct elevator_queue *eq)
+{
+	struct elv_fq_data *efqd = &eq->efqd;
+
+	return del_timer(&efqd->idle_slice_timer);
+}
+EXPORT_SYMBOL(elv_del_idle_slice_timer);
+
+unsigned int elv_get_slice_idle(struct elevator_queue *eq)
+{
+	return eq->efqd.elv_slice_idle;
+}
+EXPORT_SYMBOL(elv_get_slice_idle);
+
+void elv_ioq_served(struct io_queue *ioq, bfq_service_t served)
+{
+	entity_served(&ioq->entity, served);
+}
+
+/* Tells whether ioq is queued in root group or not */
+static inline int is_root_group_ioq(struct request_queue *q,
+					struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	return (ioq->entity.sched_data == &efqd->root_group->sched_data);
+}
+
+/* Functions to show and store elv_idle_slice value through sysfs */
+ssize_t elv_slice_idle_show(struct request_queue *q, char *name)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	data = jiffies_to_msecs(efqd->elv_slice_idle);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return sprintf(name, "%d\n", data);
+}
+
+ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
+			  size_t count)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	char *p = (char *)name;
+
+	data = simple_strtoul(p, &p, 10);
+
+	if (data < 0)
+		data = 0;
+	else if (data > INT_MAX)
+		data = INT_MAX;
+
+	data = msecs_to_jiffies(data);
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	efqd->elv_slice_idle = data;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	return count;
+}
+
+/* Functions to show and store elv_slice_sync value through sysfs */
+ssize_t elv_slice_sync_show(struct request_queue *q, char *name)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	data = efqd->elv_slice[1];
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return sprintf(name, "%d\n", data);
+}
+
+ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
+			  size_t count)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	char *p = (char *)name;
+
+	data = simple_strtoul(p, &p, 10);
+
+	if (data < 0)
+		data = 0;
+	/* 100ms is the limit for now*/
+	else if (data > 100)
+		data = 100;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	efqd->elv_slice[1] = data;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	return count;
+}
+
+/* Functions to show and store elv_slice_async value through sysfs */
+ssize_t elv_slice_async_show(struct request_queue *q, char *name)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	data = efqd->elv_slice[0];
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return sprintf(name, "%d\n", data);
+}
+
+ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
+			  size_t count)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	char *p = (char *)name;
+
+	data = simple_strtoul(p, &p, 10);
+
+	if (data < 0)
+		data = 0;
+	/* 100ms is the limit for now*/
+	else if (data > 100)
+		data = 100;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	efqd->elv_slice[0] = data;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	return count;
+}
+
+void elv_schedule_dispatch(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (elv_nr_busy_ioq(q->elevator)) {
+		elv_log(efqd, "schedule dispatch");
+		kblockd_schedule_work(efqd->queue, &efqd->unplug_work);
+	}
+}
+EXPORT_SYMBOL(elv_schedule_dispatch);
+
+void elv_kick_queue(struct work_struct *work)
+{
+	struct elv_fq_data *efqd =
+		container_of(work, struct elv_fq_data, unplug_work);
+	struct request_queue *q = efqd->queue;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	blk_start_queueing(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+void elv_shutdown_timer_wq(struct elevator_queue *e)
+{
+	del_timer_sync(&e->efqd.idle_slice_timer);
+	cancel_work_sync(&e->efqd.unplug_work);
+}
+EXPORT_SYMBOL(elv_shutdown_timer_wq);
+
+void elv_ioq_set_prio_slice(struct request_queue *q, struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	ioq->slice_end = jiffies + ioq->entity.budget;
+	elv_log_ioq(efqd, ioq, "set_slice=%lu", ioq->entity.budget);
+}
+
+static void elv_ioq_update_io_thinktime(struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = ioq->efqd;
+	unsigned long elapsed = jiffies - ioq->last_end_request;
+	unsigned long ttime = min(elapsed, 2UL * efqd->elv_slice_idle);
+
+	ioq->ttime_samples = (7*ioq->ttime_samples + 256) / 8;
+	ioq->ttime_total = (7*ioq->ttime_total + 256*ttime) / 8;
+	ioq->ttime_mean = (ioq->ttime_total + 128) / ioq->ttime_samples;
+}
+
+/*
+ * Disable idle window if the process thinks too long.
+ * This idle flag can also be updated by io scheduler.
+ */
+static void elv_ioq_update_idle_window(struct elevator_queue *eq,
+				struct io_queue *ioq, struct request *rq)
+{
+	int old_idle, enable_idle;
+	struct elv_fq_data *efqd = ioq->efqd;
+
+	/*
+	 * Don't idle for async or idle io prio class
+	 */
+	if (!elv_ioq_sync(ioq) || elv_ioq_class_idle(ioq))
+		return;
+
+	enable_idle = old_idle = elv_ioq_idle_window(ioq);
+
+	if (!efqd->elv_slice_idle)
+		enable_idle = 0;
+	else if (ioq_sample_valid(ioq->ttime_samples)) {
+		if (ioq->ttime_mean > efqd->elv_slice_idle)
+			enable_idle = 0;
+		else
+			enable_idle = 1;
+	}
+
+	/*
+	 * From think time perspective idle should be enabled. Check with
+	 * io scheduler if it wants to disable idling based on additional
+	 * considrations like seek pattern.
+	 */
+	if (enable_idle) {
+		if (eq->ops->elevator_update_idle_window_fn)
+			enable_idle = eq->ops->elevator_update_idle_window_fn(
+						eq, ioq->sched_queue, rq);
+		if (!enable_idle)
+			elv_log_ioq(efqd, ioq, "iosched disabled idle");
+	}
+
+	if (old_idle != enable_idle) {
+		elv_log_ioq(efqd, ioq, "idle=%d", enable_idle);
+		if (enable_idle)
+			elv_mark_ioq_idle_window(ioq);
+		else
+			elv_clear_ioq_idle_window(ioq);
+	}
+}
+
+struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask)
+{
+	struct io_queue *ioq = NULL;
+
+	ioq = kmem_cache_alloc_node(elv_ioq_pool, gfp_mask, q->node);
+	return ioq;
+}
+EXPORT_SYMBOL(elv_alloc_ioq);
+
+void elv_free_ioq(struct io_queue *ioq)
+{
+	kmem_cache_free(elv_ioq_pool, ioq);
+}
+EXPORT_SYMBOL(elv_free_ioq);
+
+int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
+			void *sched_queue, int ioprio_class, int ioprio,
+			int is_sync)
+{
+	struct elv_fq_data *efqd = &eq->efqd;
+	struct io_group *iog = io_lookup_io_group_current(efqd->queue);
+
+	RB_CLEAR_NODE(&ioq->entity.rb_node);
+	atomic_set(&ioq->ref, 0);
+	ioq->efqd = efqd;
+	elv_ioq_set_ioprio_class(ioq, ioprio_class);
+	elv_ioq_set_ioprio(ioq, ioprio);
+	ioq->pid = current->pid;
+	ioq->sched_queue = sched_queue;
+	if (is_sync && !elv_ioq_class_idle(ioq))
+		elv_mark_ioq_idle_window(ioq);
+	bfq_init_entity(&ioq->entity, iog);
+	ioq->entity.budget = elv_prio_to_slice(efqd, ioq);
+	return 0;
+}
+EXPORT_SYMBOL(elv_init_ioq);
+
+void elv_put_ioq(struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = ioq->efqd;
+	struct elevator_queue *e = container_of(efqd, struct elevator_queue,
+						efqd);
+
+	BUG_ON(atomic_read(&ioq->ref) <= 0);
+	if (!atomic_dec_and_test(&ioq->ref))
+		return;
+	BUG_ON(ioq->nr_queued);
+	BUG_ON(ioq->entity.tree != NULL);
+	BUG_ON(elv_ioq_busy(ioq));
+	BUG_ON(efqd->active_queue == ioq);
+
+	/* Can be called by outgoing elevator. Don't use q */
+	BUG_ON(!e->ops->elevator_free_sched_queue_fn);
+
+	e->ops->elevator_free_sched_queue_fn(e, ioq->sched_queue);
+	elv_log_ioq(efqd, ioq, "put_queue");
+	elv_free_ioq(ioq);
+}
+EXPORT_SYMBOL(elv_put_ioq);
+
+void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
+{
+	struct io_queue *ioq = *ioq_ptr;
+
+	if (ioq != NULL) {
+		/* Drop the reference taken by the io group */
+		elv_put_ioq(ioq);
+		*ioq_ptr = NULL;
+	}
+}
+
+/*
+ * Normally next io queue to be served is selected from the service tree.
+ * This function allows one to choose a specific io queue to run next
+ * out of order. This is primarily to accomodate the close_cooperator
+ * feature of cfq.
+ *
+ * Currently it is done only for root level as to begin with supporting
+ * close cooperator feature only for root group to make sure default
+ * cfq behavior in flat hierarchy is not changed.
+ */
+void elv_set_next_ioq(struct request_queue *q, struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_entity *entity = &ioq->entity;
+	struct io_sched_data *sd = &efqd->root_group->sched_data;
+	struct io_service_tree *st = io_entity_service_tree(entity);
+
+	BUG_ON(efqd->active_queue != NULL || sd->active_entity != NULL);
+	BUG_ON(!efqd->busy_queues);
+	BUG_ON(sd != entity->sched_data);
+	BUG_ON(!st);
+
+	bfq_update_vtime(st);
+	bfq_active_extract(st, entity);
+	sd->active_entity = entity;
+	entity->service = 0;
+	elv_log_ioq(efqd, ioq, "set_next_ioq");
+}
+
+/* Get next queue for service. */
+struct io_queue *elv_get_next_ioq(struct request_queue *q, int extract)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_entity *entity = NULL;
+	struct io_queue *ioq = NULL;
+	struct io_sched_data *sd;
+
+	/*
+	 * one can check for which queue will be selected next while having
+	 * one queue active. preempt logic uses it.
+	 */
+	BUG_ON(extract && efqd->active_queue != NULL);
+
+	if (!efqd->busy_queues)
+		return NULL;
+
+	sd = &efqd->root_group->sched_data;
+	if (extract)
+		entity = bfq_lookup_next_entity(sd, 1);
+	else
+		entity = bfq_lookup_next_entity(sd, 0);
+
+	BUG_ON(!entity);
+	if (extract)
+		entity->service = 0;
+	ioq = io_entity_to_ioq(entity);
+
+	return ioq;
+}
+
+/*
+ * coop tells that io scheduler selected a queue for us and we did not
+ * select the next queue based on fairness.
+ */
+static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
+					int coop)
+{
+	struct request_queue *q = efqd->queue;
+
+	if (ioq) {
+		elv_log_ioq(efqd, ioq, "set_active, busy=%d",
+							efqd->busy_queues);
+		ioq->slice_end = 0;
+
+		elv_clear_ioq_wait_request(ioq);
+		elv_clear_ioq_must_dispatch(ioq);
+		elv_mark_ioq_slice_new(ioq);
+
+		del_timer(&efqd->idle_slice_timer);
+	}
+
+	efqd->active_queue = ioq;
+
+	/* Let iosched know if it wants to take some action */
+	if (ioq) {
+		if (q->elevator->ops->elevator_active_ioq_set_fn)
+			q->elevator->ops->elevator_active_ioq_set_fn(q,
+							ioq->sched_queue, coop);
+	}
+}
+
+/* Get and set a new active queue for service. */
+struct io_queue *elv_set_active_ioq(struct request_queue *q,
+						struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	int coop = 0;
+
+	if (!ioq)
+		ioq = elv_get_next_ioq(q, 1);
+	else {
+		elv_set_next_ioq(q, ioq);
+		/*
+		 * io scheduler selected the next queue for us. Pass this
+		 * this info back to io scheudler. cfq currently uses it
+		 * to reset coop flag on the queue.
+		 */
+		coop = 1;
+	}
+	__elv_set_active_ioq(efqd, ioq, coop);
+	return ioq;
+}
+
+void elv_reset_active_ioq(struct elv_fq_data *efqd)
+{
+	struct request_queue *q = efqd->queue;
+	struct io_queue *ioq = elv_active_ioq(efqd->queue->elevator);
+
+	if (q->elevator->ops->elevator_active_ioq_reset_fn)
+		q->elevator->ops->elevator_active_ioq_reset_fn(q,
+							ioq->sched_queue);
+	efqd->active_queue = NULL;
+	del_timer(&efqd->idle_slice_timer);
+}
+
+void elv_activate_ioq(struct io_queue *ioq, int add_front)
+{
+	bfq_activate_entity(&ioq->entity, add_front);
+}
+
+void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
+					int requeue)
+{
+	if (ioq == efqd->active_queue)
+		elv_reset_active_ioq(efqd);
+
+	bfq_deactivate_entity(&ioq->entity, requeue);
+}
+
+/* Called when an inactive queue receives a new request. */
+void elv_add_ioq_busy(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+	BUG_ON(elv_ioq_busy(ioq));
+	BUG_ON(ioq == efqd->active_queue);
+	elv_log_ioq(efqd, ioq, "add to busy");
+	elv_activate_ioq(ioq, 0);
+	elv_mark_ioq_busy(ioq);
+	efqd->busy_queues++;
+	if (elv_ioq_class_rt(ioq))
+		efqd->busy_rt_queues++;
+}
+
+void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
+					int requeue)
+{
+	struct elv_fq_data *efqd = &e->efqd;
+
+	BUG_ON(!elv_ioq_busy(ioq));
+	BUG_ON(ioq->nr_queued);
+	elv_log_ioq(efqd, ioq, "del from busy");
+	elv_clear_ioq_busy(ioq);
+	BUG_ON(efqd->busy_queues == 0);
+	efqd->busy_queues--;
+	if (elv_ioq_class_rt(ioq))
+		efqd->busy_rt_queues--;
+
+	elv_deactivate_ioq(efqd, ioq, requeue);
+}
+
+/*
+ * Do the accounting. Determine how much service (in terms of time slices)
+ * current queue used and adjust the start, finish time of queue and vtime
+ * of the tree accordingly.
+ *
+ * Determining the service used in terms of time is tricky in certain
+ * situations. Especially when underlying device supports command queuing
+ * and requests from multiple queues can be there at same time, then it
+ * is not clear which queue consumed how much of disk time.
+ *
+ * To mitigate this problem, cfq starts the time slice of the queue only
+ * after first request from the queue has completed. This does not work
+ * very well if we expire the queue before we wait for first and more
+ * request to finish from the queue. For seeky queues, we will expire the
+ * queue after dispatching few requests without waiting and start dispatching
+ * from next queue.
+ *
+ * Not sure how to determine the time consumed by queue in such scenarios.
+ * Currently as a crude approximation, we are charging 25% of time slice
+ * for such cases. A better mechanism is needed for accurate accounting.
+ */
+void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_entity *entity = &ioq->entity;
+	long slice_unused = 0, slice_used = 0, slice_overshoot = 0;
+
+	assert_spin_locked(q->queue_lock);
+	elv_log_ioq(efqd, ioq, "slice expired");
+
+	if (elv_ioq_wait_request(ioq))
+		del_timer(&efqd->idle_slice_timer);
+
+	elv_clear_ioq_wait_request(ioq);
+
+	/*
+	 * if ioq->slice_end = 0, that means a queue was expired before first
+	 * reuqest from the queue got completed. Of course we are not planning
+	 * to idle on the queue otherwise we would not have expired it.
+	 *
+	 * Charge for the 25% slice in such cases. This is not the best thing
+	 * to do but at the same time not very sure what's the next best
+	 * thing to do.
+	 *
+	 * This arises from that fact that we don't have the notion of
+	 * one queue being operational at one time. io scheduler can dispatch
+	 * requests from multiple queues in one dispatch round. Ideally for
+	 * more accurate accounting of exact disk time used by disk, one
+	 * should dispatch requests from only one queue and wait for all
+	 * the requests to finish. But this will reduce throughput.
+	 */
+	if (!ioq->slice_end)
+		slice_used = entity->budget/4;
+	else {
+		if (time_after(ioq->slice_end, jiffies)) {
+			slice_unused = ioq->slice_end - jiffies;
+			if (slice_unused == entity->budget) {
+				/*
+				 * queue got expired immediately after
+				 * completing first request. Charge 25% of
+				 * slice.
+				 */
+				slice_used = entity->budget/4;
+			} else
+				slice_used = entity->budget - slice_unused;
+		} else {
+			slice_overshoot = jiffies - ioq->slice_end;
+			slice_used = entity->budget + slice_overshoot;
+		}
+	}
+
+	elv_log_ioq(efqd, ioq, "sl_end=%lx, jiffies=%lx", ioq->slice_end,
+			jiffies);
+	elv_log_ioq(efqd, ioq, "sl_used=%ld, budget=%ld overshoot=%ld",
+				slice_used, entity->budget, slice_overshoot);
+	elv_ioq_served(ioq, slice_used);
+
+	BUG_ON(ioq != efqd->active_queue);
+	elv_reset_active_ioq(efqd);
+
+	if (!ioq->nr_queued)
+		elv_del_ioq_busy(q->elevator, ioq, 1);
+	else
+		elv_activate_ioq(ioq, 0);
+}
+EXPORT_SYMBOL(__elv_ioq_slice_expired);
+
+/*
+ *  Expire the ioq.
+ */
+void elv_ioq_slice_expired(struct request_queue *q)
+{
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+
+	if (ioq)
+		__elv_ioq_slice_expired(q, ioq);
+}
+
+/*
+ * Check if new_cfqq should preempt the currently active queue. Return 0 for
+ * no or if we aren't sure, a 1 will cause a preemption attempt.
+ */
+int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
+			struct request *rq)
+{
+	struct io_queue *ioq;
+	struct elevator_queue *eq = q->elevator;
+
+	ioq = elv_active_ioq(eq);
+
+	if (!ioq)
+		return 0;
+
+	if (elv_ioq_slice_used(ioq))
+		return 1;
+
+	if (elv_ioq_class_idle(new_ioq))
+		return 0;
+
+	if (elv_ioq_class_idle(ioq))
+		return 1;
+
+	/*
+	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
+	 */
+	if (elv_ioq_class_rt(new_ioq) && !elv_ioq_class_rt(ioq))
+		return 1;
+
+	/*
+	 * Check with io scheduler if it has additional criterion based on
+	 * which it wants to preempt existing queue.
+	 */
+	if (eq->ops->elevator_should_preempt_fn)
+		return eq->ops->elevator_should_preempt_fn(q, new_ioq, rq);
+
+	return 0;
+}
+
+static void elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
+{
+	elv_log_ioq(&q->elevator->efqd, ioq, "preempt");
+	elv_ioq_slice_expired(q);
+
+	/*
+	 * Put the new queue at the front of the of the current list,
+	 * so we know that it will be selected next.
+	 */
+
+	elv_activate_ioq(ioq, 1);
+	elv_ioq_set_slice_end(ioq, 0);
+	elv_mark_ioq_slice_new(ioq);
+}
+
+void elv_ioq_request_add(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	BUG_ON(!efqd);
+	BUG_ON(!ioq);
+	efqd->rq_queued++;
+	ioq->nr_queued++;
+
+	if (!elv_ioq_busy(ioq))
+		elv_add_ioq_busy(efqd, ioq);
+
+	elv_ioq_update_io_thinktime(ioq);
+	elv_ioq_update_idle_window(q->elevator, ioq, rq);
+
+	if (ioq == elv_active_ioq(q->elevator)) {
+		/*
+		 * Remember that we saw a request from this process, but
+		 * don't start queuing just yet. Otherwise we risk seeing lots
+		 * of tiny requests, because we disrupt the normal plugging
+		 * and merging. If the request is already larger than a single
+		 * page, let it rip immediately. For that case we assume that
+		 * merging is already done. Ditto for a busy system that
+		 * has other work pending, don't risk delaying until the
+		 * idle timer unplug to continue working.
+		 */
+		if (elv_ioq_wait_request(ioq)) {
+			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
+			    efqd->busy_queues > 1) {
+				del_timer(&efqd->idle_slice_timer);
+				blk_start_queueing(q);
+			}
+			elv_mark_ioq_must_dispatch(ioq);
+		}
+	} else if (elv_should_preempt(q, ioq, rq)) {
+		/*
+		 * not the active queue - expire current slice if it is
+		 * idle and has expired it's mean thinktime or this new queue
+		 * has some old slice time left and is of higher priority or
+		 * this new queue is RT and the current one is BE
+		 */
+		elv_preempt_queue(q, ioq);
+		blk_start_queueing(q);
+	}
+}
+
+void elv_idle_slice_timer(unsigned long data)
+{
+	struct elv_fq_data *efqd = (struct elv_fq_data *)data;
+	struct io_queue *ioq;
+	unsigned long flags;
+	struct request_queue *q = efqd->queue;
+
+	elv_log(efqd, "idle timer fired");
+
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	ioq = efqd->active_queue;
+
+	if (ioq) {
+
+		/*
+		 * We saw a request before the queue expired, let it through
+		 */
+		if (elv_ioq_must_dispatch(ioq))
+			goto out_kick;
+
+		/*
+		 * expired
+		 */
+		if (elv_ioq_slice_used(ioq))
+			goto expire;
+
+		/*
+		 * only expire and reinvoke request handler, if there are
+		 * other queues with pending requests
+		 */
+		if (!elv_nr_busy_ioq(q->elevator))
+			goto out_cont;
+
+		/*
+		 * not expired and it has a request pending, let it dispatch
+		 */
+		if (ioq->nr_queued)
+			goto out_kick;
+	}
+expire:
+	elv_ioq_slice_expired(q);
+out_kick:
+	elv_schedule_dispatch(q);
+out_cont:
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+void elv_ioq_arm_slice_timer(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+	unsigned long sl;
+
+	BUG_ON(!ioq);
+
+	/*
+	 * SSD device without seek penalty, disable idling. But only do so
+	 * for devices that support queuing, otherwise we still have a problem
+	 * with sync vs async workloads.
+	 */
+	if (blk_queue_nonrot(q) && efqd->hw_tag)
+		return;
+
+	/*
+	 * still requests with the driver, don't idle
+	 */
+	if (efqd->rq_in_driver)
+		return;
+
+	/*
+	 * idle is disabled, either manually or by past process history
+	 */
+	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
+		return;
+
+	/*
+	 * may be iosched got its own idling logic. In that case io
+	 * schduler will take care of arming the timer, if need be.
+	 */
+	if (q->elevator->ops->elevator_arm_slice_timer_fn) {
+		q->elevator->ops->elevator_arm_slice_timer_fn(q,
+						ioq->sched_queue);
+	} else {
+		elv_mark_ioq_wait_request(ioq);
+		sl = efqd->elv_slice_idle;
+		mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+		elv_log(efqd, "arm idle: %lu", sl);
+	}
+}
+
+void elv_free_idle_ioq_list(struct elevator_queue *e)
+{
+	struct io_queue *ioq, *n;
+	struct elv_fq_data *efqd = &e->efqd;
+
+	list_for_each_entry_safe(ioq, n, &efqd->idle_list, queue_list)
+		elv_deactivate_ioq(efqd, ioq, 0);
+}
+
+/* Common layer function to select the next queue to dispatch from */
+void *elv_fq_select_ioq(struct request_queue *q, int force)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
+
+	if (!elv_nr_busy_ioq(q->elevator))
+		return NULL;
+
+	if (ioq == NULL)
+		goto new_queue;
+
+	/*
+	 * Force dispatch. Continue to dispatch from current queue as long
+	 * as it has requests.
+	 */
+	if (unlikely(force)) {
+		if (ioq->nr_queued)
+			goto keep_queue;
+		else
+			goto expire;
+	}
+
+	/*
+	 * The active queue has run out of time, expire it and select new.
+	 */
+	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
+		goto expire;
+
+	/*
+	 * If we have a RT cfqq waiting, then we pre-empt the current non-rt
+	 * cfqq.
+	 */
+	if (!elv_ioq_class_rt(ioq) && efqd->busy_rt_queues) {
+		/*
+		 * We simulate this as cfqq timed out so that it gets to bank
+		 * the remaining of its time slice.
+		 */
+		elv_log_ioq(efqd, ioq, "preempt");
+		goto expire;
+	}
+
+	/*
+	 * The active queue has requests and isn't expired, allow it to
+	 * dispatch.
+	 */
+
+	if (ioq->nr_queued)
+		goto keep_queue;
+
+	/*
+	 * If another queue has a request waiting within our mean seek
+	 * distance, let it run.  The expire code will check for close
+	 * cooperators and put the close queue at the front of the service
+	 * tree.
+	 */
+	new_ioq = elv_close_cooperator(q, ioq, 0);
+	if (new_ioq)
+		goto expire;
+
+	/*
+	 * No requests pending. If the active queue still has requests in
+	 * flight or is idling for a new request, allow either of these
+	 * conditions to happen (or time out) before selecting a new queue.
+	 */
+
+	if (timer_pending(&efqd->idle_slice_timer) ||
+	    (elv_ioq_nr_dispatched(ioq) && elv_ioq_idle_window(ioq))) {
+		ioq = NULL;
+		goto keep_queue;
+	}
+
+expire:
+	elv_ioq_slice_expired(q);
+new_queue:
+	ioq = elv_set_active_ioq(q, new_ioq);
+keep_queue:
+	return ioq;
+}
+
+/* A request got removed from io_queue. Do the accounting */
+void elv_ioq_request_removed(struct elevator_queue *e, struct request *rq)
+{
+	struct io_queue *ioq;
+	struct elv_fq_data *efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	ioq = rq->ioq;
+	BUG_ON(!ioq);
+	ioq->nr_queued--;
+
+	efqd = ioq->efqd;
+	BUG_ON(!efqd);
+	efqd->rq_queued--;
+
+	if (elv_ioq_busy(ioq) && (elv_active_ioq(e) != ioq) && !ioq->nr_queued)
+		elv_del_ioq_busy(e, ioq, 1);
+}
+
+/* A request got dispatched. Do the accounting. */
+void elv_fq_dispatched_request(struct elevator_queue *e, struct request *rq)
+{
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	BUG_ON(!ioq);
+	elv_ioq_request_dispatched(ioq);
+	elv_ioq_request_removed(e, rq);
+	elv_clear_ioq_must_dispatch(ioq);
+}
+
+void elv_fq_activate_rq(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	efqd->rq_in_driver++;
+	elv_log_ioq(efqd, rq_ioq(rq), "activate rq, drv=%d",
+						efqd->rq_in_driver);
+}
+
+void elv_fq_deactivate_rq(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	WARN_ON(!efqd->rq_in_driver);
+	efqd->rq_in_driver--;
+	elv_log_ioq(efqd, rq_ioq(rq), "deactivate rq, drv=%d",
+						efqd->rq_in_driver);
+}
+
+/*
+ * Update hw_tag based on peak queue depth over 50 samples under
+ * sufficient load.
+ */
+static void elv_update_hw_tag(struct elv_fq_data *efqd)
+{
+	if (efqd->rq_in_driver > efqd->rq_in_driver_peak)
+		efqd->rq_in_driver_peak = efqd->rq_in_driver;
+
+	if (efqd->rq_queued <= ELV_HW_QUEUE_MIN &&
+	    efqd->rq_in_driver <= ELV_HW_QUEUE_MIN)
+		return;
+
+	if (efqd->hw_tag_samples++ < 50)
+		return;
+
+	if (efqd->rq_in_driver_peak >= ELV_HW_QUEUE_MIN)
+		efqd->hw_tag = 1;
+	else
+		efqd->hw_tag = 0;
+
+	efqd->hw_tag_samples = 0;
+	efqd->rq_in_driver_peak = 0;
+}
+
+/*
+ * If ioscheduler has functionality of keeping track of close cooperator, check
+ * with it if it has got a closely co-operating queue.
+ */
+static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
+					struct io_queue *ioq, int probe)
+{
+	struct elevator_queue *e = q->elevator;
+	struct io_queue *new_ioq = NULL;
+
+	/*
+	 * Currently this feature is supported only for flat hierarchy or
+	 * root group queues so that default cfq behavior is not changed.
+	 */
+	if (!is_root_group_ioq(q, ioq))
+		return NULL;
+
+	if (q->elevator->ops->elevator_close_cooperator_fn)
+		new_ioq = e->ops->elevator_close_cooperator_fn(q,
+						ioq->sched_queue, probe);
+
+	/* Only select co-operating queue if it belongs to root group */
+	if (new_ioq && !is_root_group_ioq(q, new_ioq))
+		return NULL;
+
+	return new_ioq;
+}
+
+/* A request got completed from io_queue. Do the accounting. */
+void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
+{
+	const int sync = rq_is_sync(rq);
+	struct io_queue *ioq = rq->ioq;
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	elv_log_ioq(efqd, ioq, "complete");
+
+	elv_update_hw_tag(efqd);
+
+	WARN_ON(!efqd->rq_in_driver);
+	WARN_ON(!ioq->dispatched);
+	efqd->rq_in_driver--;
+	ioq->dispatched--;
+
+	if (sync)
+		ioq->last_end_request = jiffies;
+
+	/*
+	 * If this is the active queue, check if it needs to be expired,
+	 * or if we want to idle in case it has no pending requests.
+	 */
+
+	if (elv_active_ioq(q->elevator) == ioq) {
+		if (elv_ioq_slice_new(ioq)) {
+			elv_ioq_set_prio_slice(q, ioq);
+			elv_clear_ioq_slice_new(ioq);
+		}
+		/*
+		 * If there are no requests waiting in this queue, and
+		 * there are other queues ready to issue requests, AND
+		 * those other queues are issuing requests within our
+		 * mean seek distance, give them a chance to run instead
+		 * of idling.
+		 */
+		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+			elv_ioq_slice_expired(q);
+		else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
+			 && sync && !rq_noidle(rq))
+			elv_ioq_arm_slice_timer(q);
+	}
+
+	if (!efqd->rq_in_driver)
+		elv_schedule_dispatch(q);
+}
+
+struct io_group *io_lookup_io_group_current(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	return efqd->root_group;
+}
+EXPORT_SYMBOL(io_lookup_io_group_current);
+
+void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
+					int ioprio)
+{
+	struct io_queue *ioq = NULL;
+
+	switch (ioprio_class) {
+	case IOPRIO_CLASS_RT:
+		ioq = iog->async_queue[0][ioprio];
+		break;
+	case IOPRIO_CLASS_BE:
+		ioq = iog->async_queue[1][ioprio];
+		break;
+	case IOPRIO_CLASS_IDLE:
+		ioq = iog->async_idle_queue;
+		break;
+	default:
+		BUG();
+	}
+
+	if (ioq)
+		return ioq->sched_queue;
+	return NULL;
+}
+EXPORT_SYMBOL(io_group_async_queue_prio);
+
+void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+					int ioprio, struct io_queue *ioq)
+{
+	switch (ioprio_class) {
+	case IOPRIO_CLASS_RT:
+		iog->async_queue[0][ioprio] = ioq;
+		break;
+	case IOPRIO_CLASS_BE:
+		iog->async_queue[1][ioprio] = ioq;
+		break;
+	case IOPRIO_CLASS_IDLE:
+		iog->async_idle_queue = ioq;
+		break;
+	default:
+		BUG();
+	}
+
+	/*
+	 * Take the group reference and pin the queue. Group exit will
+	 * clean it up
+	 */
+	elv_get_ioq(ioq);
+}
+EXPORT_SYMBOL(io_group_set_async_queue);
+
+/*
+ * Release all the io group references to its async queues.
+ */
+void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+{
+	int i, j;
+
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < IOPRIO_BE_NR; j++)
+			elv_release_ioq(e, &iog->async_queue[i][j]);
+
+	/* Free up async idle queue */
+	elv_release_ioq(e, &iog->async_idle_queue);
+}
+
+struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+	return iog;
+}
+
+void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_group *iog = e->efqd.root_group;
+	io_put_io_group_queues(e, iog);
+	kfree(iog);
+}
+
+static void elv_slab_kill(void)
+{
+	/*
+	 * Caller already ensured that pending RCU callbacks are completed,
+	 * so we should have no busy allocations at this point.
+	 */
+	if (elv_ioq_pool)
+		kmem_cache_destroy(elv_ioq_pool);
+}
+
+static int __init elv_slab_setup(void)
+{
+	elv_ioq_pool = KMEM_CACHE(io_queue, 0);
+	if (!elv_ioq_pool)
+		goto fail;
+
+	return 0;
+fail:
+	elv_slab_kill();
+	return -ENOMEM;
+}
+
+/* Initialize fair queueing data associated with elevator */
+int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
+{
+	struct io_group *iog;
+	struct elv_fq_data *efqd = &e->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return 0;
+
+	iog = io_alloc_root_group(q, e, efqd);
+	if (iog == NULL)
+		return 1;
+
+	efqd->root_group = iog;
+	efqd->queue = q;
+
+	init_timer(&efqd->idle_slice_timer);
+	efqd->idle_slice_timer.function = elv_idle_slice_timer;
+	efqd->idle_slice_timer.data = (unsigned long) efqd;
+
+	INIT_WORK(&efqd->unplug_work, elv_kick_queue);
+
+	INIT_LIST_HEAD(&efqd->idle_list);
+
+	efqd->elv_slice[0] = elv_slice_async;
+	efqd->elv_slice[1] = elv_slice_sync;
+	efqd->elv_slice_idle = elv_slice_idle;
+	efqd->hw_tag = 1;
+
+	return 0;
+}
+
+/*
+ * elv_exit_fq_data is called before we call elevator_exit_fn. Before
+ * we ask elevator to cleanup its queues, we do the cleanup here so
+ * that all the group and idle tree references to ioq are dropped. Later
+ * during elevator cleanup, ioc reference will be dropped which will lead
+ * to removal of ioscheduler queue as well as associated ioq object.
+ */
+void elv_exit_fq_data(struct elevator_queue *e)
+{
+	struct elv_fq_data *efqd = &e->efqd;
+	struct request_queue *q = efqd->queue;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	elv_shutdown_timer_wq(e);
+
+	spin_lock_irq(q->queue_lock);
+	/* This should drop all the idle tree references of ioq */
+	elv_free_idle_ioq_list(e);
+	spin_unlock_irq(q->queue_lock);
+
+	elv_shutdown_timer_wq(e);
+
+	BUG_ON(timer_pending(&efqd->idle_slice_timer));
+	io_free_root_group(e);
+}
+
+/*
+ * This is called after the io scheduler has cleaned up its data structres.
+ * I don't think that this function is required. Right now just keeping it
+ * because cfq cleans up timer and work queue again after freeing up
+ * io contexts. To me io scheduler has already been drained out, and all
+ * the active queue have already been expired so time and work queue should
+ * not been activated during cleanup process.
+ *
+ * Keeping it here for the time being. Will get rid of it later.
+ */
+void elv_exit_fq_data_post(struct elevator_queue *e)
+{
+	struct elv_fq_data *efqd = &e->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	elv_shutdown_timer_wq(e);
+	BUG_ON(timer_pending(&efqd->idle_slice_timer));
+}
+
+
+static int __init elv_fq_init(void)
+{
+	if (elv_slab_setup())
+		return -ENOMEM;
+
+	/* could be 0 on HZ < 1000 setups */
+
+	if (!elv_slice_async)
+		elv_slice_async = 1;
+
+	if (!elv_slice_idle)
+		elv_slice_idle = 1;
+
+	return 0;
+}
+
+module_init(elv_fq_init);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
new file mode 100644
index 0000000..3bea279
--- /dev/null
+++ b/block/elevator-fq.h
@@ -0,0 +1,488 @@
+/*
+ * BFQ: data structures and common functions prototypes.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ *		      Paolo Valente <paolo.valente@unimore.it>
+ */
+
+#include <linux/blkdev.h>
+
+#ifndef _BFQ_SCHED_H
+#define _BFQ_SCHED_H
+
+#define IO_IOPRIO_CLASSES	3
+
+typedef u64 bfq_timestamp_t;
+typedef unsigned long bfq_weight_t;
+typedef unsigned long bfq_service_t;
+struct io_entity;
+struct io_queue;
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+
+/**
+ * struct bfq_service_tree - per ioprio_class service tree.
+ * @active: tree for active entities (i.e., those backlogged).
+ * @idle: tree for idle entities (i.e., those not backlogged, with V <= F_i).
+ * @first_idle: idle entity with minimum F_i.
+ * @last_idle: idle entity with maximum F_i.
+ * @vtime: scheduler virtual time.
+ * @wsum: scheduler weight sum; active and idle entities contribute to it.
+ *
+ * Each service tree represents a B-WF2Q+ scheduler on its own.  Each
+ * ioprio_class has its own independent scheduler, and so its own
+ * bfq_service_tree.  All the fields are protected by the queue lock
+ * of the containing efqd.
+ */
+struct io_service_tree {
+	struct rb_root active;
+	struct rb_root idle;
+
+	struct io_entity *first_idle;
+	struct io_entity *last_idle;
+
+	bfq_timestamp_t vtime;
+	bfq_weight_t wsum;
+};
+
+/**
+ * struct bfq_sched_data - multi-class scheduler.
+ * @active_entity: entity under service.
+ * @next_active: head-of-the-line entity in the scheduler.
+ * @service_tree: array of service trees, one per ioprio_class.
+ *
+ * bfq_sched_data is the basic scheduler queue.  It supports three
+ * ioprio_classes, and can be used either as a toplevel queue or as
+ * an intermediate queue on a hierarchical setup.
+ * @next_active points to the active entity of the sched_data service
+ * trees that will be scheduled next.
+ *
+ * The supported ioprio_classes are the same as in CFQ, in descending
+ * priority order, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE.
+ * Requests from higher priority queues are served before all the
+ * requests from lower priority queues; among requests of the same
+ * queue requests are served according to B-WF2Q+.
+ * All the fields are protected by the queue lock of the containing bfqd.
+ */
+struct io_sched_data {
+	struct io_entity *active_entity;
+	struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
+};
+
+/**
+ * struct bfq_entity - schedulable entity.
+ * @rb_node: service_tree member.
+ * @on_st: flag, true if the entity is on a tree (either the active or
+ *         the idle one of its service_tree).
+ * @finish: B-WF2Q+ finish timestamp (aka F_i).
+ * @start: B-WF2Q+ start timestamp (aka S_i).
+ * @tree: tree the entity is enqueued into; %NULL if not on a tree.
+ * @min_start: minimum start time of the (active) subtree rooted at
+ *             this entity; used for O(log N) lookups into active trees.
+ * @service: service received during the last round of service.
+ * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
+ * @weight: weight of the queue, calculated as IOPRIO_BE_NR - @ioprio.
+ * @parent: parent entity, for hierarchical scheduling.
+ * @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the
+ *                 associated scheduler queue, %NULL on leaf nodes.
+ * @sched_data: the scheduler queue this entity belongs to.
+ * @ioprio: the ioprio in use.
+ * @new_ioprio: when an ioprio change is requested, the new ioprio value
+ * @ioprio_class: the ioprio_class in use.
+ * @new_ioprio_class: when an ioprio_class change is requested, the new
+ *                    ioprio_class value.
+ * @ioprio_changed: flag, true when the user requested an ioprio or
+ *                  ioprio_class change.
+ *
+ * A bfq_entity is used to represent either a bfq_queue (leaf node in the
+ * cgroup hierarchy) or a bfq_group into the upper level scheduler.  Each
+ * entity belongs to the sched_data of the parent group in the cgroup
+ * hierarchy.  Non-leaf entities have also their own sched_data, stored
+ * in @my_sched_data.
+ *
+ * Each entity stores independently its priority values; this would allow
+ * different weights on different devices, but this functionality is not
+ * exported to userspace by now.  Priorities are updated lazily, first
+ * storing the new values into the new_* fields, then setting the
+ * @ioprio_changed flag.  As soon as there is a transition in the entity
+ * state that allows the priority update to take place the effective and
+ * the requested priority values are synchronized.
+ *
+ * The weight value is calculated from the ioprio to export the same
+ * interface as CFQ.  When dealing with ``well-behaved'' queues (i.e.,
+ * queues that do not spend too much time to consume their budget and
+ * have true sequential behavior, and when there are no external factors
+ * breaking anticipation) the relative weights at each level of the
+ * cgroups hierarchy should be guaranteed.
+ * All the fields are protected by the queue lock of the containing bfqd.
+ */
+struct io_entity {
+	struct rb_node rb_node;
+
+	int on_st;
+
+	bfq_timestamp_t finish;
+	bfq_timestamp_t start;
+
+	struct rb_root *tree;
+
+	bfq_timestamp_t min_start;
+
+	bfq_service_t service, budget;
+	bfq_weight_t weight;
+
+	struct io_entity *parent;
+
+	struct io_sched_data *my_sched_data;
+	struct io_sched_data *sched_data;
+
+	unsigned short ioprio, new_ioprio;
+	unsigned short ioprio_class, new_ioprio_class;
+
+	int ioprio_changed;
+};
+
+/*
+ * A common structure embedded by every io scheduler into their respective
+ * queue structure.
+ */
+struct io_queue {
+	struct io_entity entity;
+	atomic_t ref;
+	unsigned int flags;
+
+	/* Pointer to generic elevator data structure */
+	struct elv_fq_data *efqd;
+	struct list_head queue_list;
+	pid_t pid;
+
+	/* Number of requests queued on this io queue */
+	unsigned long nr_queued;
+
+	/* Requests dispatched from this queue */
+	int dispatched;
+
+	/* Keep a track of think time of processes in this queue */
+	unsigned long last_end_request;
+	unsigned long ttime_total;
+	unsigned long ttime_samples;
+	unsigned long ttime_mean;
+
+	unsigned long slice_end;
+
+	/* Pointer to io scheduler's queue */
+	void *sched_queue;
+};
+
+struct io_group {
+	struct io_sched_data sched_data;
+
+	/* async_queue and idle_queue are used only for cfq */
+	struct io_queue *async_queue[2][IOPRIO_BE_NR];
+	struct io_queue *async_idle_queue;
+};
+
+struct elv_fq_data {
+	struct io_group *root_group;
+
+	/* List of io queues on idle tree. */
+	struct list_head idle_list;
+
+	struct request_queue *queue;
+	unsigned int busy_queues;
+	/*
+	 * Used to track any pending rt requests so we can pre-empt current
+	 * non-RT cfqq in service when this value is non-zero.
+	 */
+	unsigned int busy_rt_queues;
+
+	/* Number of requests queued */
+	int rq_queued;
+
+	/* Pointer to the ioscheduler queue being served */
+	void *active_queue;
+
+	int rq_in_driver;
+	int hw_tag;
+	int hw_tag_samples;
+	int rq_in_driver_peak;
+
+	/*
+	 * elevator fair queuing layer has the capability to provide idling
+	 * for ensuring fairness for processes doing dependent reads.
+	 * This might be needed to ensure fairness among two processes doing
+	 * synchronous reads in two different cgroups. noop and deadline don't
+	 * have any notion of anticipation/idling. As of now, these are the
+	 * users of this functionality.
+	 */
+	unsigned int elv_slice_idle;
+	struct timer_list idle_slice_timer;
+	struct work_struct unplug_work;
+
+	unsigned int elv_slice[2];
+};
+
+extern int elv_slice_idle;
+extern int elv_slice_async;
+
+/* Logging facilities. */
+#define elv_log_ioq(efqd, ioq, fmt, args...) \
+	blk_add_trace_msg((efqd)->queue, "elv%d%c " fmt, (ioq)->pid,	\
+				elv_ioq_sync(ioq) ? 'S' : 'A', ##args)
+
+#define elv_log(efqd, fmt, args...) \
+	blk_add_trace_msg((efqd)->queue, "elv " fmt, ##args)
+
+#define ioq_sample_valid(samples)   ((samples) > 80)
+
+/* Some shared queue flag manipulation functions among elevators */
+
+enum elv_queue_state_flags {
+	ELV_QUEUE_FLAG_busy = 0,          /* has requests or is under service */
+	ELV_QUEUE_FLAG_sync,              /* synchronous queue */
+	ELV_QUEUE_FLAG_idle_window,	  /* elevator slice idling enabled */
+	ELV_QUEUE_FLAG_wait_request,	  /* waiting for a request */
+	ELV_QUEUE_FLAG_must_dispatch,	  /* must be allowed a dispatch */
+	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
+	ELV_QUEUE_FLAG_NR,
+};
+
+#define ELV_IO_QUEUE_FLAG_FNS(name)					\
+static inline void elv_mark_ioq_##name(struct io_queue *ioq)		\
+{                                                                       \
+	(ioq)->flags |= (1 << ELV_QUEUE_FLAG_##name);			\
+}                                                                       \
+static inline void elv_clear_ioq_##name(struct io_queue *ioq)		\
+{                                                                       \
+	(ioq)->flags &= ~(1 << ELV_QUEUE_FLAG_##name);			\
+}                                                                       \
+static inline int elv_ioq_##name(struct io_queue *ioq)         		\
+{                                                                       \
+	return ((ioq)->flags & (1 << ELV_QUEUE_FLAG_##name)) != 0;	\
+}
+
+ELV_IO_QUEUE_FLAG_FNS(busy)
+ELV_IO_QUEUE_FLAG_FNS(sync)
+ELV_IO_QUEUE_FLAG_FNS(wait_request)
+ELV_IO_QUEUE_FLAG_FNS(must_dispatch)
+ELV_IO_QUEUE_FLAG_FNS(idle_window)
+ELV_IO_QUEUE_FLAG_FNS(slice_new)
+
+static inline struct io_service_tree *
+io_entity_service_tree(struct io_entity *entity)
+{
+	struct io_sched_data *sched_data = entity->sched_data;
+	unsigned int idx = entity->ioprio_class - 1;
+
+	BUG_ON(idx >= IO_IOPRIO_CLASSES);
+	BUG_ON(sched_data == NULL);
+
+	return sched_data->service_tree + idx;
+}
+
+/* A request got dispatched from the io_queue. Do the accounting. */
+static inline void elv_ioq_request_dispatched(struct io_queue *ioq)
+{
+	ioq->dispatched++;
+}
+
+static inline int elv_ioq_slice_used(struct io_queue *ioq)
+{
+	if (elv_ioq_slice_new(ioq))
+		return 0;
+	if (time_before(jiffies, ioq->slice_end))
+		return 0;
+
+	return 1;
+}
+
+/* How many request are currently dispatched from the queue */
+static inline int elv_ioq_nr_dispatched(struct io_queue *ioq)
+{
+	return ioq->dispatched;
+}
+
+/* How many request are currently queued in the queue */
+static inline int elv_ioq_nr_queued(struct io_queue *ioq)
+{
+	return ioq->nr_queued;
+}
+
+static inline pid_t elv_ioq_pid(struct io_queue *ioq)
+{
+	return ioq->pid;
+}
+
+static inline unsigned long elv_ioq_ttime_mean(struct io_queue *ioq)
+{
+	return ioq->ttime_mean;
+}
+
+static inline unsigned long elv_ioq_sample_valid(struct io_queue *ioq)
+{
+	return ioq_sample_valid(ioq->ttime_samples);
+}
+
+static inline void elv_get_ioq(struct io_queue *ioq)
+{
+	atomic_inc(&ioq->ref);
+}
+
+static inline void elv_ioq_set_slice_end(struct io_queue *ioq,
+						unsigned long slice_end)
+{
+	ioq->slice_end = slice_end;
+}
+
+static inline int elv_ioq_class_idle(struct io_queue *ioq)
+{
+	return ioq->entity.ioprio_class == IOPRIO_CLASS_IDLE;
+}
+
+static inline int elv_ioq_class_rt(struct io_queue *ioq)
+{
+	return ioq->entity.ioprio_class == IOPRIO_CLASS_RT;
+}
+
+static inline int elv_ioq_ioprio_class(struct io_queue *ioq)
+{
+	return ioq->entity.new_ioprio_class;
+}
+
+static inline int elv_ioq_ioprio(struct io_queue *ioq)
+{
+	return ioq->entity.new_ioprio;
+}
+
+static inline void elv_ioq_set_ioprio_class(struct io_queue *ioq,
+						int ioprio_class)
+{
+	ioq->entity.new_ioprio_class = ioprio_class;
+	ioq->entity.ioprio_changed = 1;
+}
+
+static inline void elv_ioq_set_ioprio(struct io_queue *ioq, int ioprio)
+{
+	ioq->entity.new_ioprio = ioprio;
+	ioq->entity.ioprio_changed = 1;
+}
+
+static inline void *ioq_sched_queue(struct io_queue *ioq)
+{
+	if (ioq)
+		return ioq->sched_queue;
+	return NULL;
+}
+
+static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
+{
+	return container_of(ioq->entity.sched_data, struct io_group,
+						sched_data);
+}
+
+/* Functions used by blksysfs.c */
+extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
+extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
+						size_t count);
+extern ssize_t elv_slice_sync_show(struct request_queue *q, char *name);
+extern ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
+						size_t count);
+extern ssize_t elv_slice_async_show(struct request_queue *q, char *name);
+extern ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
+						size_t count);
+
+/* Functions used by elevator.c */
+extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
+extern void elv_exit_fq_data(struct elevator_queue *e);
+extern void elv_exit_fq_data_post(struct elevator_queue *e);
+
+extern void elv_ioq_request_add(struct request_queue *q, struct request *rq);
+extern void elv_ioq_request_removed(struct elevator_queue *e,
+					struct request *rq);
+extern void elv_fq_dispatched_request(struct elevator_queue *e,
+					struct request *rq);
+
+extern void elv_fq_activate_rq(struct request_queue *q, struct request *rq);
+extern void elv_fq_deactivate_rq(struct request_queue *q, struct request *rq);
+
+extern void elv_ioq_completed_request(struct request_queue *q,
+				struct request *rq);
+
+extern void *elv_fq_select_ioq(struct request_queue *q, int force);
+extern struct io_queue *rq_ioq(struct request *rq);
+
+/* Functions used by io schedulers */
+extern void elv_put_ioq(struct io_queue *ioq);
+extern void __elv_ioq_slice_expired(struct request_queue *q,
+					struct io_queue *ioq);
+extern int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
+		void *sched_queue, int ioprio_class, int ioprio, int is_sync);
+extern void elv_schedule_dispatch(struct request_queue *q);
+extern int elv_hw_tag(struct elevator_queue *e);
+extern void *elv_active_sched_queue(struct elevator_queue *e);
+extern int elv_mod_idle_slice_timer(struct elevator_queue *eq,
+					unsigned long expires);
+extern int elv_del_idle_slice_timer(struct elevator_queue *eq);
+extern unsigned int elv_get_slice_idle(struct elevator_queue *eq);
+extern void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
+					int ioprio);
+extern void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+					int ioprio, struct io_queue *ioq);
+extern struct io_group *io_lookup_io_group_current(struct request_queue *q);
+extern int elv_nr_busy_ioq(struct elevator_queue *e);
+extern int elv_nr_busy_rt_ioq(struct elevator_queue *e);
+extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
+extern void elv_free_ioq(struct io_queue *ioq);
+
+#else /* CONFIG_ELV_FAIR_QUEUING */
+
+static inline int elv_init_fq_data(struct request_queue *q,
+					struct elevator_queue *e)
+{
+	return 0;
+}
+
+static inline void elv_exit_fq_data(struct elevator_queue *e) {}
+static inline void elv_exit_fq_data_post(struct elevator_queue *e) {}
+
+static inline void elv_fq_activate_rq(struct request_queue *q,
+					struct request *rq)
+{
+}
+
+static inline void elv_fq_deactivate_rq(struct request_queue *q,
+					struct request *rq)
+{
+}
+
+static inline void elv_fq_dispatched_request(struct elevator_queue *e,
+						struct request *rq)
+{
+}
+
+static inline void elv_ioq_request_removed(struct elevator_queue *e,
+						struct request *rq)
+{
+}
+
+static inline void elv_ioq_request_add(struct request_queue *q,
+					struct request *rq)
+{
+}
+
+static inline void elv_ioq_completed_request(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline void *ioq_sched_queue(struct io_queue *ioq) { return NULL; }
+static inline struct io_queue *rq_ioq(struct request *rq) { return NULL; }
+static inline void *elv_fq_select_ioq(struct request_queue *q, int force)
+{
+	return NULL;
+}
+#endif /* CONFIG_ELV_FAIR_QUEUING */
+#endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index 7073a90..c2f07f5 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -231,6 +231,9 @@ static struct elevator_queue *elevator_alloc(struct request_queue *q,
 	for (i = 0; i < ELV_HASH_ENTRIES; i++)
 		INIT_HLIST_HEAD(&eq->hash[i]);
 
+	if (elv_init_fq_data(q, eq))
+		goto err;
+
 	return eq;
 err:
 	kfree(eq);
@@ -301,9 +304,11 @@ EXPORT_SYMBOL(elevator_init);
 void elevator_exit(struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
+	elv_exit_fq_data(e);
 	if (e->ops->elevator_exit_fn)
 		e->ops->elevator_exit_fn(e);
 	e->ops = NULL;
+	elv_exit_fq_data_post(e);
 	mutex_unlock(&e->sysfs_lock);
 
 	kobject_put(&e->kobj);
@@ -314,6 +319,8 @@ static void elv_activate_rq(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	elv_fq_activate_rq(q, rq);
+
 	if (e->ops->elevator_activate_req_fn)
 		e->ops->elevator_activate_req_fn(q, rq);
 }
@@ -322,6 +329,8 @@ static void elv_deactivate_rq(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	elv_fq_deactivate_rq(q, rq);
+
 	if (e->ops->elevator_deactivate_req_fn)
 		e->ops->elevator_deactivate_req_fn(q, rq);
 }
@@ -446,6 +455,7 @@ void elv_dispatch_sort(struct request_queue *q, struct request *rq)
 	elv_rqhash_del(q, rq);
 
 	q->nr_sorted--;
+	elv_fq_dispatched_request(q->elevator, rq);
 
 	boundary = q->end_sector;
 	stop_flags = REQ_SOFTBARRIER | REQ_HARDBARRIER | REQ_STARTED;
@@ -486,6 +496,7 @@ void elv_dispatch_add_tail(struct request_queue *q, struct request *rq)
 	elv_rqhash_del(q, rq);
 
 	q->nr_sorted--;
+	elv_fq_dispatched_request(q->elevator, rq);
 
 	q->end_sector = rq_end_sector(rq);
 	q->boundary_rq = rq;
@@ -553,6 +564,7 @@ void elv_merge_requests(struct request_queue *q, struct request *rq,
 	elv_rqhash_del(q, next);
 
 	q->nr_sorted--;
+	elv_ioq_request_removed(e, next);
 	q->last_merge = rq;
 }
 
@@ -657,12 +669,8 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 				q->last_merge = rq;
 		}
 
-		/*
-		 * Some ioscheds (cfq) run q->request_fn directly, so
-		 * rq cannot be accessed after calling
-		 * elevator_add_req_fn.
-		 */
 		q->elevator->ops->elevator_add_req_fn(q, rq);
+		elv_ioq_request_add(q, rq);
 		break;
 
 	case ELEVATOR_INSERT_REQUEUE:
@@ -872,13 +880,12 @@ void elv_dequeue_request(struct request_queue *q, struct request *rq)
 
 int elv_queue_empty(struct request_queue *q)
 {
-	struct elevator_queue *e = q->elevator;
-
 	if (!list_empty(&q->queue_head))
 		return 0;
 
-	if (e->ops->elevator_queue_empty_fn)
-		return e->ops->elevator_queue_empty_fn(q);
+	/* Hopefully nr_sorted works and no need to call queue_empty_fn */
+	if (q->nr_sorted)
+		return 0;
 
 	return 1;
 }
@@ -953,8 +960,11 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
 	 */
 	if (blk_account_rq(rq)) {
 		q->in_flight--;
-		if (blk_sorted_rq(rq) && e->ops->elevator_completed_req_fn)
-			e->ops->elevator_completed_req_fn(q, rq);
+		if (blk_sorted_rq(rq)) {
+			if (e->ops->elevator_completed_req_fn)
+				e->ops->elevator_completed_req_fn(q, rq);
+			elv_ioq_completed_request(q, rq);
+		}
 	}
 
 	/*
@@ -1242,3 +1252,17 @@ struct request *elv_rb_latter_request(struct request_queue *q,
 	return NULL;
 }
 EXPORT_SYMBOL(elv_rb_latter_request);
+
+/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
+{
+	return ioq_sched_queue(rq_ioq(rq));
+}
+EXPORT_SYMBOL(elv_get_sched_queue);
+
+/* Select an ioscheduler queue to dispatch request from. */
+void *elv_select_sched_queue(struct request_queue *q, int force)
+{
+	return ioq_sched_queue(elv_fq_select_ioq(q, force));
+}
+EXPORT_SYMBOL(elv_select_sched_queue);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2755d5c..4634949 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -245,6 +245,11 @@ struct request {
 
 	/* for bidi */
 	struct request *next_rq;
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	/* io queue request belongs to */
+	struct io_queue *ioq;
+#endif
 };
 
 static inline unsigned short req_get_ioprio(struct request *req)
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index c59b769..679c149 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -2,6 +2,7 @@
 #define _LINUX_ELEVATOR_H
 
 #include <linux/percpu.h>
+#include "../../block/elevator-fq.h"
 
 #ifdef CONFIG_BLOCK
 
@@ -29,6 +30,18 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
 
 typedef void *(elevator_init_fn) (struct request_queue *);
 typedef void (elevator_exit_fn) (struct elevator_queue *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
+typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
+typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
+typedef int (elevator_should_preempt_fn) (struct request_queue*, void*,
+						struct request*);
+typedef int (elevator_update_idle_window_fn) (struct elevator_queue*, void*,
+						struct request*);
+typedef struct io_queue* (elevator_close_cooperator_fn) (struct request_queue*,
+						void*, int probe);
+#endif
 
 struct elevator_ops
 {
@@ -56,6 +69,17 @@ struct elevator_ops
 	elevator_init_fn *elevator_init_fn;
 	elevator_exit_fn *elevator_exit_fn;
 	void (*trim)(struct io_context *);
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+	elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
+	elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;
+
+	elevator_arm_slice_timer_fn *elevator_arm_slice_timer_fn;
+	elevator_should_preempt_fn *elevator_should_preempt_fn;
+	elevator_update_idle_window_fn *elevator_update_idle_window_fn;
+	elevator_close_cooperator_fn *elevator_close_cooperator_fn;
+#endif
 };
 
 #define ELV_NAME_MAX	(16)
@@ -76,6 +100,9 @@ struct elevator_type
 	struct elv_fs_entry *elevator_attrs;
 	char elevator_name[ELV_NAME_MAX];
 	struct module *elevator_owner;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	int elevator_features;
+#endif
 };
 
 /*
@@ -89,6 +116,10 @@ struct elevator_queue
 	struct elevator_type *elevator_type;
 	struct mutex sysfs_lock;
 	struct hlist_head *hash;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	/* fair queuing data */
+	struct elv_fq_data efqd;
+#endif
 };
 
 /*
@@ -209,5 +240,25 @@ enum {
 	__val;							\
 })
 
+/* iosched can let elevator know their feature set/capability */
+#ifdef CONFIG_ELV_FAIR_QUEUING
+
+/* iosched wants to use fq logic of elevator layer */
+#define	ELV_IOSCHED_NEED_FQ	1
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+	return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
+}
+
+#else /* ELV_IOSCHED_FAIR_QUEUING */
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+	return 0;
+}
+#endif /* ELV_IOSCHED_FAIR_QUEUING */
+extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
+extern void *elv_select_sched_queue(struct request_queue *q, int force);
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 03/18] io-controller: Charge for time slice based on average disk rate
       [not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-05-05 19:58   ` [PATCH 01/18] io-controller: Documentation Vivek Goyal
  2009-05-05 19:58     ` Vivek Goyal
@ 2009-05-05 19:58   ` Vivek Goyal
  2009-05-05 19:58   ` [PATCH 04/18] io-controller: Modify cfq to make use of flat elevator fair queuing Vivek Goyal
                     ` (18 subsequent siblings)
  21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

o There are situations where a queue gets expired very soon and it looks
  as if time slice used by that queue is zero. For example, If an async
  queue dispatches a bunch of requests and queue is expired before first
  request completes. Another example is where a queue is expired as soon
  as first request completes and queue has no more requests (sync queues
  on SSD).

o Currently we just charge 25% of slice length in such cases. This patch tries
  to improve on that approximation by keeping a track of average disk rate
  and charging for time by nr_sectors/disk_rate.

o This is still experimental, not very sure if it gives measurable improvement
  or not.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/elevator-fq.c |   85 +++++++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h |   11 ++++++
 2 files changed, 94 insertions(+), 2 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 9aea899..9f1fbb9 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -19,6 +19,9 @@ const int elv_slice_async_rq = 2;
 int elv_slice_idle = HZ / 125;
 static struct kmem_cache *elv_ioq_pool;
 
+/* Maximum Window length for updating average disk rate */
+static int elv_rate_sampling_window = HZ / 10;
+
 #define ELV_SLICE_SCALE		(5)
 #define ELV_HW_QUEUE_MIN	(5)
 #define IO_SERVICE_TREE_INIT   ((struct io_service_tree)		\
@@ -1022,6 +1025,47 @@ static void elv_ioq_update_io_thinktime(struct io_queue *ioq)
 	ioq->ttime_mean = (ioq->ttime_total + 128) / ioq->ttime_samples;
 }
 
+static void elv_update_io_rate(struct elv_fq_data *efqd, struct request *rq)
+{
+	long elapsed = jiffies - efqd->rate_sampling_start;
+	unsigned long total;
+
+	/* sampling window is off */
+	if (!efqd->rate_sampling_start)
+		return;
+
+	efqd->rate_sectors_current += rq->nr_sectors;
+
+	if (efqd->rq_in_driver && (elapsed < elv_rate_sampling_window))
+		return;
+
+	efqd->rate_sectors = (7*efqd->rate_sectors +
+				256*efqd->rate_sectors_current) / 8;
+
+	if (!elapsed) {
+		/*
+		 * updating rate before a jiffy could complete. Could be a
+		 * problem with fast queuing/non-queuing hardware. Should we
+		 * look at higher resolution time source?
+		 *
+		 * In case of non-queuing hardware we will probably not try to
+		 * dispatch from multiple queues and will be able to account
+		 * for disk time used and will not need this approximation
+		 * anyway?
+		 */
+		elapsed = 1;
+	}
+
+	efqd->rate_time = (7*efqd->rate_time + 256*elapsed) / 8;
+	total = efqd->rate_sectors + (efqd->rate_time/2);
+	efqd->mean_rate = total/efqd->rate_time;
+
+	elv_log(efqd, "mean_rate=%d, t=%d s=%d", efqd->mean_rate,
+			elapsed, efqd->rate_sectors_current);
+	efqd->rate_sampling_start = 0;
+	efqd->rate_sectors_current = 0;
+}
+
 /*
  * Disable idle window if the process thinks too long.
  * This idle flag can also be updated by io scheduler.
@@ -1312,6 +1356,34 @@ void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
 }
 
 /*
+ * Calculate the effective disk time used by the queue based on how many
+ * sectors queue has dispatched and what is the average disk rate
+ * Returns disk time in ms.
+ */
+static inline unsigned long elv_disk_time_used(struct request_queue *q,
+					struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_entity *entity = &ioq->entity;
+	unsigned long jiffies_used = 0;
+
+	if (!efqd->mean_rate)
+		return entity->budget/4;
+
+	/* Charge the queue based on average disk rate */
+	jiffies_used = ioq->nr_sectors/efqd->mean_rate;
+
+	if (!jiffies_used)
+		jiffies_used = 1;
+
+	elv_log_ioq(efqd, ioq, "disk time=%ldms sect=%ld rate=%ld",
+				jiffies_to_msecs(jiffies_used),
+				ioq->nr_sectors, efqd->mean_rate);
+
+	return jiffies_used;
+}
+
+/*
  * Do the accounting. Determine how much service (in terms of time slices)
  * current queue used and adjust the start, finish time of queue and vtime
  * of the tree accordingly.
@@ -1363,7 +1435,7 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 	 * the requests to finish. But this will reduce throughput.
 	 */
 	if (!ioq->slice_end)
-		slice_used = entity->budget/4;
+		slice_used = elv_disk_time_used(q, ioq);
 	else {
 		if (time_after(ioq->slice_end, jiffies)) {
 			slice_unused = ioq->slice_end - jiffies;
@@ -1373,7 +1445,7 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 				 * completing first request. Charge 25% of
 				 * slice.
 				 */
-				slice_used = entity->budget/4;
+				slice_used = elv_disk_time_used(q, ioq);
 			} else
 				slice_used = entity->budget - slice_unused;
 		} else {
@@ -1391,6 +1463,8 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 	BUG_ON(ioq != efqd->active_queue);
 	elv_reset_active_ioq(efqd);
 
+	/* Queue is being expired. Reset number of secotrs dispatched */
+	ioq->nr_sectors = 0;
 	if (!ioq->nr_queued)
 		elv_del_ioq_busy(q->elevator, ioq, 1);
 	else
@@ -1725,6 +1799,7 @@ void elv_fq_dispatched_request(struct elevator_queue *e, struct request *rq)
 
 	BUG_ON(!ioq);
 	elv_ioq_request_dispatched(ioq);
+	ioq->nr_sectors += rq->nr_sectors;
 	elv_ioq_request_removed(e, rq);
 	elv_clear_ioq_must_dispatch(ioq);
 }
@@ -1737,6 +1812,10 @@ void elv_fq_activate_rq(struct request_queue *q, struct request *rq)
 		return;
 
 	efqd->rq_in_driver++;
+
+	if (!efqd->rate_sampling_start)
+		efqd->rate_sampling_start = jiffies;
+
 	elv_log_ioq(efqd, rq_ioq(rq), "activate rq, drv=%d",
 						efqd->rq_in_driver);
 }
@@ -1826,6 +1905,8 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 	efqd->rq_in_driver--;
 	ioq->dispatched--;
 
+	elv_update_io_rate(efqd, rq);
+
 	if (sync)
 		ioq->last_end_request = jiffies;
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 3bea279..ce2d671 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -165,6 +165,9 @@ struct io_queue {
 	/* Requests dispatched from this queue */
 	int dispatched;
 
+	/* Number of sectors dispatched in current dispatch round */
+	int nr_sectors;
+
 	/* Keep a track of think time of processes in this queue */
 	unsigned long last_end_request;
 	unsigned long ttime_total;
@@ -223,6 +226,14 @@ struct elv_fq_data {
 	struct work_struct unplug_work;
 
 	unsigned int elv_slice[2];
+
+	/* Fields for keeping track of average disk rate */
+	unsigned long rate_sectors; /* number of sectors finished */
+	unsigned long rate_time;   /* jiffies elapsed */
+	unsigned long mean_rate; /* sectors per jiffy */
+	unsigned long long rate_sampling_start; /*sampling window start jifies*/
+	/* number of sectors finished io during current sampling window */
+	unsigned long rate_sectors_current;
 };
 
 extern int elv_slice_idle;
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 03/18] io-controller: Charge for time slice based on average disk rate
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (3 preceding siblings ...)
  2009-05-05 19:58 ` [PATCH 03/18] io-controller: Charge for time slice based on average disk rate Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-05 19:58 ` [PATCH 04/18] io-controller: Modify cfq to make use of flat elevator fair queuing Vivek Goyal
                   ` (32 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, m-ikeda
  Cc: vgoyal, akpm

o There are situations where a queue gets expired very soon and it looks
  as if time slice used by that queue is zero. For example, If an async
  queue dispatches a bunch of requests and queue is expired before first
  request completes. Another example is where a queue is expired as soon
  as first request completes and queue has no more requests (sync queues
  on SSD).

o Currently we just charge 25% of slice length in such cases. This patch tries
  to improve on that approximation by keeping a track of average disk rate
  and charging for time by nr_sectors/disk_rate.

o This is still experimental, not very sure if it gives measurable improvement
  or not.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |   85 +++++++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h |   11 ++++++
 2 files changed, 94 insertions(+), 2 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 9aea899..9f1fbb9 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -19,6 +19,9 @@ const int elv_slice_async_rq = 2;
 int elv_slice_idle = HZ / 125;
 static struct kmem_cache *elv_ioq_pool;
 
+/* Maximum Window length for updating average disk rate */
+static int elv_rate_sampling_window = HZ / 10;
+
 #define ELV_SLICE_SCALE		(5)
 #define ELV_HW_QUEUE_MIN	(5)
 #define IO_SERVICE_TREE_INIT   ((struct io_service_tree)		\
@@ -1022,6 +1025,47 @@ static void elv_ioq_update_io_thinktime(struct io_queue *ioq)
 	ioq->ttime_mean = (ioq->ttime_total + 128) / ioq->ttime_samples;
 }
 
+static void elv_update_io_rate(struct elv_fq_data *efqd, struct request *rq)
+{
+	long elapsed = jiffies - efqd->rate_sampling_start;
+	unsigned long total;
+
+	/* sampling window is off */
+	if (!efqd->rate_sampling_start)
+		return;
+
+	efqd->rate_sectors_current += rq->nr_sectors;
+
+	if (efqd->rq_in_driver && (elapsed < elv_rate_sampling_window))
+		return;
+
+	efqd->rate_sectors = (7*efqd->rate_sectors +
+				256*efqd->rate_sectors_current) / 8;
+
+	if (!elapsed) {
+		/*
+		 * updating rate before a jiffy could complete. Could be a
+		 * problem with fast queuing/non-queuing hardware. Should we
+		 * look at higher resolution time source?
+		 *
+		 * In case of non-queuing hardware we will probably not try to
+		 * dispatch from multiple queues and will be able to account
+		 * for disk time used and will not need this approximation
+		 * anyway?
+		 */
+		elapsed = 1;
+	}
+
+	efqd->rate_time = (7*efqd->rate_time + 256*elapsed) / 8;
+	total = efqd->rate_sectors + (efqd->rate_time/2);
+	efqd->mean_rate = total/efqd->rate_time;
+
+	elv_log(efqd, "mean_rate=%d, t=%d s=%d", efqd->mean_rate,
+			elapsed, efqd->rate_sectors_current);
+	efqd->rate_sampling_start = 0;
+	efqd->rate_sectors_current = 0;
+}
+
 /*
  * Disable idle window if the process thinks too long.
  * This idle flag can also be updated by io scheduler.
@@ -1312,6 +1356,34 @@ void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
 }
 
 /*
+ * Calculate the effective disk time used by the queue based on how many
+ * sectors queue has dispatched and what is the average disk rate
+ * Returns disk time in ms.
+ */
+static inline unsigned long elv_disk_time_used(struct request_queue *q,
+					struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_entity *entity = &ioq->entity;
+	unsigned long jiffies_used = 0;
+
+	if (!efqd->mean_rate)
+		return entity->budget/4;
+
+	/* Charge the queue based on average disk rate */
+	jiffies_used = ioq->nr_sectors/efqd->mean_rate;
+
+	if (!jiffies_used)
+		jiffies_used = 1;
+
+	elv_log_ioq(efqd, ioq, "disk time=%ldms sect=%ld rate=%ld",
+				jiffies_to_msecs(jiffies_used),
+				ioq->nr_sectors, efqd->mean_rate);
+
+	return jiffies_used;
+}
+
+/*
  * Do the accounting. Determine how much service (in terms of time slices)
  * current queue used and adjust the start, finish time of queue and vtime
  * of the tree accordingly.
@@ -1363,7 +1435,7 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 	 * the requests to finish. But this will reduce throughput.
 	 */
 	if (!ioq->slice_end)
-		slice_used = entity->budget/4;
+		slice_used = elv_disk_time_used(q, ioq);
 	else {
 		if (time_after(ioq->slice_end, jiffies)) {
 			slice_unused = ioq->slice_end - jiffies;
@@ -1373,7 +1445,7 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 				 * completing first request. Charge 25% of
 				 * slice.
 				 */
-				slice_used = entity->budget/4;
+				slice_used = elv_disk_time_used(q, ioq);
 			} else
 				slice_used = entity->budget - slice_unused;
 		} else {
@@ -1391,6 +1463,8 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 	BUG_ON(ioq != efqd->active_queue);
 	elv_reset_active_ioq(efqd);
 
+	/* Queue is being expired. Reset number of secotrs dispatched */
+	ioq->nr_sectors = 0;
 	if (!ioq->nr_queued)
 		elv_del_ioq_busy(q->elevator, ioq, 1);
 	else
@@ -1725,6 +1799,7 @@ void elv_fq_dispatched_request(struct elevator_queue *e, struct request *rq)
 
 	BUG_ON(!ioq);
 	elv_ioq_request_dispatched(ioq);
+	ioq->nr_sectors += rq->nr_sectors;
 	elv_ioq_request_removed(e, rq);
 	elv_clear_ioq_must_dispatch(ioq);
 }
@@ -1737,6 +1812,10 @@ void elv_fq_activate_rq(struct request_queue *q, struct request *rq)
 		return;
 
 	efqd->rq_in_driver++;
+
+	if (!efqd->rate_sampling_start)
+		efqd->rate_sampling_start = jiffies;
+
 	elv_log_ioq(efqd, rq_ioq(rq), "activate rq, drv=%d",
 						efqd->rq_in_driver);
 }
@@ -1826,6 +1905,8 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 	efqd->rq_in_driver--;
 	ioq->dispatched--;
 
+	elv_update_io_rate(efqd, rq);
+
 	if (sync)
 		ioq->last_end_request = jiffies;
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 3bea279..ce2d671 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -165,6 +165,9 @@ struct io_queue {
 	/* Requests dispatched from this queue */
 	int dispatched;
 
+	/* Number of sectors dispatched in current dispatch round */
+	int nr_sectors;
+
 	/* Keep a track of think time of processes in this queue */
 	unsigned long last_end_request;
 	unsigned long ttime_total;
@@ -223,6 +226,14 @@ struct elv_fq_data {
 	struct work_struct unplug_work;
 
 	unsigned int elv_slice[2];
+
+	/* Fields for keeping track of average disk rate */
+	unsigned long rate_sectors; /* number of sectors finished */
+	unsigned long rate_time;   /* jiffies elapsed */
+	unsigned long mean_rate; /* sectors per jiffy */
+	unsigned long long rate_sampling_start; /*sampling window start jifies*/
+	/* number of sectors finished io during current sampling window */
+	unsigned long rate_sectors_current;
 };
 
 extern int elv_slice_idle;
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 03/18] io-controller: Charge for time slice based on average disk rate
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (2 preceding siblings ...)
  2009-05-05 19:58 ` [PATCH 02/18] io-controller: Common flat fair queuing code in elevaotor layer Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-05 19:58 ` Vivek Goyal
                   ` (33 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando
  Cc: akpm, vgoyal

o There are situations where a queue gets expired very soon and it looks
  as if time slice used by that queue is zero. For example, If an async
  queue dispatches a bunch of requests and queue is expired before first
  request completes. Another example is where a queue is expired as soon
  as first request completes and queue has no more requests (sync queues
  on SSD).

o Currently we just charge 25% of slice length in such cases. This patch tries
  to improve on that approximation by keeping a track of average disk rate
  and charging for time by nr_sectors/disk_rate.

o This is still experimental, not very sure if it gives measurable improvement
  or not.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |   85 +++++++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h |   11 ++++++
 2 files changed, 94 insertions(+), 2 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 9aea899..9f1fbb9 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -19,6 +19,9 @@ const int elv_slice_async_rq = 2;
 int elv_slice_idle = HZ / 125;
 static struct kmem_cache *elv_ioq_pool;
 
+/* Maximum Window length for updating average disk rate */
+static int elv_rate_sampling_window = HZ / 10;
+
 #define ELV_SLICE_SCALE		(5)
 #define ELV_HW_QUEUE_MIN	(5)
 #define IO_SERVICE_TREE_INIT   ((struct io_service_tree)		\
@@ -1022,6 +1025,47 @@ static void elv_ioq_update_io_thinktime(struct io_queue *ioq)
 	ioq->ttime_mean = (ioq->ttime_total + 128) / ioq->ttime_samples;
 }
 
+static void elv_update_io_rate(struct elv_fq_data *efqd, struct request *rq)
+{
+	long elapsed = jiffies - efqd->rate_sampling_start;
+	unsigned long total;
+
+	/* sampling window is off */
+	if (!efqd->rate_sampling_start)
+		return;
+
+	efqd->rate_sectors_current += rq->nr_sectors;
+
+	if (efqd->rq_in_driver && (elapsed < elv_rate_sampling_window))
+		return;
+
+	efqd->rate_sectors = (7*efqd->rate_sectors +
+				256*efqd->rate_sectors_current) / 8;
+
+	if (!elapsed) {
+		/*
+		 * updating rate before a jiffy could complete. Could be a
+		 * problem with fast queuing/non-queuing hardware. Should we
+		 * look at higher resolution time source?
+		 *
+		 * In case of non-queuing hardware we will probably not try to
+		 * dispatch from multiple queues and will be able to account
+		 * for disk time used and will not need this approximation
+		 * anyway?
+		 */
+		elapsed = 1;
+	}
+
+	efqd->rate_time = (7*efqd->rate_time + 256*elapsed) / 8;
+	total = efqd->rate_sectors + (efqd->rate_time/2);
+	efqd->mean_rate = total/efqd->rate_time;
+
+	elv_log(efqd, "mean_rate=%d, t=%d s=%d", efqd->mean_rate,
+			elapsed, efqd->rate_sectors_current);
+	efqd->rate_sampling_start = 0;
+	efqd->rate_sectors_current = 0;
+}
+
 /*
  * Disable idle window if the process thinks too long.
  * This idle flag can also be updated by io scheduler.
@@ -1312,6 +1356,34 @@ void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
 }
 
 /*
+ * Calculate the effective disk time used by the queue based on how many
+ * sectors queue has dispatched and what is the average disk rate
+ * Returns disk time in ms.
+ */
+static inline unsigned long elv_disk_time_used(struct request_queue *q,
+					struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_entity *entity = &ioq->entity;
+	unsigned long jiffies_used = 0;
+
+	if (!efqd->mean_rate)
+		return entity->budget/4;
+
+	/* Charge the queue based on average disk rate */
+	jiffies_used = ioq->nr_sectors/efqd->mean_rate;
+
+	if (!jiffies_used)
+		jiffies_used = 1;
+
+	elv_log_ioq(efqd, ioq, "disk time=%ldms sect=%ld rate=%ld",
+				jiffies_to_msecs(jiffies_used),
+				ioq->nr_sectors, efqd->mean_rate);
+
+	return jiffies_used;
+}
+
+/*
  * Do the accounting. Determine how much service (in terms of time slices)
  * current queue used and adjust the start, finish time of queue and vtime
  * of the tree accordingly.
@@ -1363,7 +1435,7 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 	 * the requests to finish. But this will reduce throughput.
 	 */
 	if (!ioq->slice_end)
-		slice_used = entity->budget/4;
+		slice_used = elv_disk_time_used(q, ioq);
 	else {
 		if (time_after(ioq->slice_end, jiffies)) {
 			slice_unused = ioq->slice_end - jiffies;
@@ -1373,7 +1445,7 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 				 * completing first request. Charge 25% of
 				 * slice.
 				 */
-				slice_used = entity->budget/4;
+				slice_used = elv_disk_time_used(q, ioq);
 			} else
 				slice_used = entity->budget - slice_unused;
 		} else {
@@ -1391,6 +1463,8 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 	BUG_ON(ioq != efqd->active_queue);
 	elv_reset_active_ioq(efqd);
 
+	/* Queue is being expired. Reset number of secotrs dispatched */
+	ioq->nr_sectors = 0;
 	if (!ioq->nr_queued)
 		elv_del_ioq_busy(q->elevator, ioq, 1);
 	else
@@ -1725,6 +1799,7 @@ void elv_fq_dispatched_request(struct elevator_queue *e, struct request *rq)
 
 	BUG_ON(!ioq);
 	elv_ioq_request_dispatched(ioq);
+	ioq->nr_sectors += rq->nr_sectors;
 	elv_ioq_request_removed(e, rq);
 	elv_clear_ioq_must_dispatch(ioq);
 }
@@ -1737,6 +1812,10 @@ void elv_fq_activate_rq(struct request_queue *q, struct request *rq)
 		return;
 
 	efqd->rq_in_driver++;
+
+	if (!efqd->rate_sampling_start)
+		efqd->rate_sampling_start = jiffies;
+
 	elv_log_ioq(efqd, rq_ioq(rq), "activate rq, drv=%d",
 						efqd->rq_in_driver);
 }
@@ -1826,6 +1905,8 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 	efqd->rq_in_driver--;
 	ioq->dispatched--;
 
+	elv_update_io_rate(efqd, rq);
+
 	if (sync)
 		ioq->last_end_request = jiffies;
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 3bea279..ce2d671 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -165,6 +165,9 @@ struct io_queue {
 	/* Requests dispatched from this queue */
 	int dispatched;
 
+	/* Number of sectors dispatched in current dispatch round */
+	int nr_sectors;
+
 	/* Keep a track of think time of processes in this queue */
 	unsigned long last_end_request;
 	unsigned long ttime_total;
@@ -223,6 +226,14 @@ struct elv_fq_data {
 	struct work_struct unplug_work;
 
 	unsigned int elv_slice[2];
+
+	/* Fields for keeping track of average disk rate */
+	unsigned long rate_sectors; /* number of sectors finished */
+	unsigned long rate_time;   /* jiffies elapsed */
+	unsigned long mean_rate; /* sectors per jiffy */
+	unsigned long long rate_sampling_start; /*sampling window start jifies*/
+	/* number of sectors finished io during current sampling window */
+	unsigned long rate_sectors_current;
 };
 
 extern int elv_slice_idle;
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 04/18] io-controller: Modify cfq to make use of flat elevator fair queuing
       [not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (2 preceding siblings ...)
  2009-05-05 19:58   ` [PATCH 03/18] io-controller: Charge for time slice based on average disk rate Vivek Goyal
@ 2009-05-05 19:58   ` Vivek Goyal
  2009-05-05 19:58   ` [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer Vivek Goyal
                     ` (17 subsequent siblings)
  21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

This patch changes cfq to use fair queuing code from elevator layer.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched |    3 +-
 block/cfq-iosched.c   | 1097 ++++++++++---------------------------------------
 2 files changed, 219 insertions(+), 881 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3398134..dd5224d 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -3,7 +3,7 @@ if BLOCK
 menu "IO Schedulers"
 
 config ELV_FAIR_QUEUING
-	bool "Elevator Fair Queuing Support"
+	bool
 	default n
 	---help---
 	  Traditionally only cfq had notion of multiple queues and it did
@@ -46,6 +46,7 @@ config IOSCHED_DEADLINE
 
 config IOSCHED_CFQ
 	tristate "CFQ I/O scheduler"
+	select ELV_FAIR_QUEUING
 	default y
 	---help---
 	  The CFQ I/O scheduler tries to distribute bandwidth equally
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index a55a9bd..f90c534 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -12,7 +12,6 @@
 #include <linux/rbtree.h>
 #include <linux/ioprio.h>
 #include <linux/blktrace_api.h>
-
 /*
  * tunables
  */
@@ -23,15 +22,7 @@ static const int cfq_fifo_expire[2] = { HZ / 4, HZ / 8 };
 static const int cfq_back_max = 16 * 1024;
 /* penalty of a backwards seek */
 static const int cfq_back_penalty = 2;
-static const int cfq_slice_sync = HZ / 10;
-static int cfq_slice_async = HZ / 25;
 static const int cfq_slice_async_rq = 2;
-static int cfq_slice_idle = HZ / 125;
-
-/*
- * offset from end of service tree
- */
-#define CFQ_IDLE_DELAY		(HZ / 5)
 
 /*
  * below this threshold, we consider thinktime immediate
@@ -43,7 +34,7 @@ static int cfq_slice_idle = HZ / 125;
 
 #define RQ_CIC(rq)		\
 	((struct cfq_io_context *) (rq)->elevator_private)
-#define RQ_CFQQ(rq)		(struct cfq_queue *) ((rq)->elevator_private2)
+#define RQ_CFQQ(rq)	(struct cfq_queue *) (ioq_sched_queue((rq)->ioq))
 
 static struct kmem_cache *cfq_pool;
 static struct kmem_cache *cfq_ioc_pool;
@@ -53,8 +44,6 @@ static struct completion *ioc_gone;
 static DEFINE_SPINLOCK(ioc_gone_lock);
 
 #define CFQ_PRIO_LISTS		IOPRIO_BE_NR
-#define cfq_class_idle(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
-#define cfq_class_rt(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_RT)
 
 #define sample_valid(samples)	((samples) > 80)
 
@@ -75,12 +64,6 @@ struct cfq_rb_root {
  */
 struct cfq_data {
 	struct request_queue *queue;
-
-	/*
-	 * rr list of queues with requests and the count of them
-	 */
-	struct cfq_rb_root service_tree;
-
 	/*
 	 * Each priority tree is sorted by next_request position.  These
 	 * trees are used when determining if two or more queues are
@@ -88,39 +71,10 @@ struct cfq_data {
 	 */
 	struct rb_root prio_trees[CFQ_PRIO_LISTS];
 
-	unsigned int busy_queues;
-	/*
-	 * Used to track any pending rt requests so we can pre-empt current
-	 * non-RT cfqq in service when this value is non-zero.
-	 */
-	unsigned int busy_rt_queues;
-
-	int rq_in_driver;
 	int sync_flight;
 
-	/*
-	 * queue-depth detection
-	 */
-	int rq_queued;
-	int hw_tag;
-	int hw_tag_samples;
-	int rq_in_driver_peak;
-
-	/*
-	 * idle window management
-	 */
-	struct timer_list idle_slice_timer;
-	struct work_struct unplug_work;
-
-	struct cfq_queue *active_queue;
 	struct cfq_io_context *active_cic;
 
-	/*
-	 * async queue for each priority case
-	 */
-	struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
-	struct cfq_queue *async_idle_cfqq;
-
 	sector_t last_position;
 	unsigned long last_end_request;
 
@@ -131,9 +85,7 @@ struct cfq_data {
 	unsigned int cfq_fifo_expire[2];
 	unsigned int cfq_back_penalty;
 	unsigned int cfq_back_max;
-	unsigned int cfq_slice[2];
 	unsigned int cfq_slice_async_rq;
-	unsigned int cfq_slice_idle;
 
 	struct list_head cic_list;
 };
@@ -142,16 +94,11 @@ struct cfq_data {
  * Per process-grouping structure
  */
 struct cfq_queue {
-	/* reference count */
-	atomic_t ref;
+	struct io_queue *ioq;
 	/* various state flags, see below */
 	unsigned int flags;
 	/* parent cfq_data */
 	struct cfq_data *cfqd;
-	/* service_tree member */
-	struct rb_node rb_node;
-	/* service_tree key */
-	unsigned long rb_key;
 	/* prio tree member */
 	struct rb_node p_node;
 	/* prio tree root we belong to, if any */
@@ -167,33 +114,23 @@ struct cfq_queue {
 	/* fifo list of requests in sort_list */
 	struct list_head fifo;
 
-	unsigned long slice_end;
-	long slice_resid;
 	unsigned int slice_dispatch;
 
 	/* pending metadata requests */
 	int meta_pending;
-	/* number of requests that are on the dispatch list or inside driver */
-	int dispatched;
 
 	/* io prio of this group */
-	unsigned short ioprio, org_ioprio;
-	unsigned short ioprio_class, org_ioprio_class;
+	unsigned short org_ioprio;
+	unsigned short org_ioprio_class;
 
 	pid_t pid;
 };
 
 enum cfqq_state_flags {
-	CFQ_CFQQ_FLAG_on_rr = 0,	/* on round-robin busy list */
-	CFQ_CFQQ_FLAG_wait_request,	/* waiting for a request */
-	CFQ_CFQQ_FLAG_must_dispatch,	/* must be allowed a dispatch */
 	CFQ_CFQQ_FLAG_must_alloc,	/* must be allowed rq alloc */
 	CFQ_CFQQ_FLAG_must_alloc_slice,	/* per-slice must_alloc flag */
 	CFQ_CFQQ_FLAG_fifo_expire,	/* FIFO checked in this slice */
-	CFQ_CFQQ_FLAG_idle_window,	/* slice idling enabled */
 	CFQ_CFQQ_FLAG_prio_changed,	/* task priority has changed */
-	CFQ_CFQQ_FLAG_slice_new,	/* no requests dispatched in slice */
-	CFQ_CFQQ_FLAG_sync,		/* synchronous queue */
 	CFQ_CFQQ_FLAG_coop,		/* has done a coop jump of the queue */
 };
 
@@ -211,16 +148,10 @@ static inline int cfq_cfqq_##name(const struct cfq_queue *cfqq)		\
 	return ((cfqq)->flags & (1 << CFQ_CFQQ_FLAG_##name)) != 0;	\
 }
 
-CFQ_CFQQ_FNS(on_rr);
-CFQ_CFQQ_FNS(wait_request);
-CFQ_CFQQ_FNS(must_dispatch);
 CFQ_CFQQ_FNS(must_alloc);
 CFQ_CFQQ_FNS(must_alloc_slice);
 CFQ_CFQQ_FNS(fifo_expire);
-CFQ_CFQQ_FNS(idle_window);
 CFQ_CFQQ_FNS(prio_changed);
-CFQ_CFQQ_FNS(slice_new);
-CFQ_CFQQ_FNS(sync);
 CFQ_CFQQ_FNS(coop);
 #undef CFQ_CFQQ_FNS
 
@@ -259,66 +190,32 @@ static inline int cfq_bio_sync(struct bio *bio)
 	return 0;
 }
 
-/*
- * scheduler run of queue, if there are requests pending and no one in the
- * driver that will restart queueing
- */
-static inline void cfq_schedule_dispatch(struct cfq_data *cfqd)
+static inline struct io_group *cfqq_to_io_group(struct cfq_queue *cfqq)
 {
-	if (cfqd->busy_queues) {
-		cfq_log(cfqd, "schedule dispatch");
-		kblockd_schedule_work(cfqd->queue, &cfqd->unplug_work);
-	}
+	return ioq_to_io_group(cfqq->ioq);
 }
 
-static int cfq_queue_empty(struct request_queue *q)
+static inline int cfq_class_idle(struct cfq_queue *cfqq)
 {
-	struct cfq_data *cfqd = q->elevator->elevator_data;
-
-	return !cfqd->busy_queues;
+	return elv_ioq_class_idle(cfqq->ioq);
 }
 
-/*
- * Scale schedule slice based on io priority. Use the sync time slice only
- * if a queue is marked sync and has sync io queued. A sync queue with async
- * io only, should not get full sync slice length.
- */
-static inline int cfq_prio_slice(struct cfq_data *cfqd, int sync,
-				 unsigned short prio)
-{
-	const int base_slice = cfqd->cfq_slice[sync];
-
-	WARN_ON(prio >= IOPRIO_BE_NR);
-
-	return base_slice + (base_slice/CFQ_SLICE_SCALE * (4 - prio));
-}
-
-static inline int
-cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+static inline int cfq_class_rt(struct cfq_queue *cfqq)
 {
-	return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
+	return elv_ioq_class_rt(cfqq->ioq);
 }
 
-static inline void
-cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+static inline int cfq_cfqq_sync(struct cfq_queue *cfqq)
 {
-	cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
-	cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
+	return elv_ioq_sync(cfqq->ioq);
 }
 
-/*
- * We need to wrap this check in cfq_cfqq_slice_new(), since ->slice_end
- * isn't valid until the first request from the dispatch is activated
- * and the slice time set.
- */
-static inline int cfq_slice_used(struct cfq_queue *cfqq)
+static inline int cfqq_is_active_queue(struct cfq_queue *cfqq)
 {
-	if (cfq_cfqq_slice_new(cfqq))
-		return 0;
-	if (time_before(jiffies, cfqq->slice_end))
-		return 0;
+	struct cfq_data *cfqd = cfqq->cfqd;
+	struct elevator_queue *e = cfqd->queue->elevator;
 
-	return 1;
+	return (elv_active_sched_queue(e) == cfqq);
 }
 
 /*
@@ -417,33 +314,6 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2)
 }
 
 /*
- * The below is leftmost cache rbtree addon
- */
-static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
-{
-	if (!root->left)
-		root->left = rb_first(&root->rb);
-
-	if (root->left)
-		return rb_entry(root->left, struct cfq_queue, rb_node);
-
-	return NULL;
-}
-
-static void rb_erase_init(struct rb_node *n, struct rb_root *root)
-{
-	rb_erase(n, root);
-	RB_CLEAR_NODE(n);
-}
-
-static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root)
-{
-	if (root->left == n)
-		root->left = NULL;
-	rb_erase_init(n, &root->rb);
-}
-
-/*
  * would be nice to take fifo expire time into account as well
  */
 static struct request *
@@ -456,10 +326,10 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 
 	BUG_ON(RB_EMPTY_NODE(&last->rb_node));
 
-	if (rbprev)
+	if (rbprev != NULL)
 		prev = rb_entry_rq(rbprev);
 
-	if (rbnext)
+	if (rbnext != NULL)
 		next = rb_entry_rq(rbnext);
 	else {
 		rbnext = rb_first(&cfqq->sort_list);
@@ -470,95 +340,6 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	return cfq_choose_req(cfqd, next, prev);
 }
 
-static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
-				      struct cfq_queue *cfqq)
-{
-	/*
-	 * just an approximation, should be ok.
-	 */
-	return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) -
-		       cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
-}
-
-/*
- * The cfqd->service_tree holds all pending cfq_queue's that have
- * requests waiting to be processed. It is sorted in the order that
- * we will service the queues.
- */
-static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-				 int add_front)
-{
-	struct rb_node **p, *parent;
-	struct cfq_queue *__cfqq;
-	unsigned long rb_key;
-	int left;
-
-	if (cfq_class_idle(cfqq)) {
-		rb_key = CFQ_IDLE_DELAY;
-		parent = rb_last(&cfqd->service_tree.rb);
-		if (parent && parent != &cfqq->rb_node) {
-			__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
-			rb_key += __cfqq->rb_key;
-		} else
-			rb_key += jiffies;
-	} else if (!add_front) {
-		rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
-		rb_key += cfqq->slice_resid;
-		cfqq->slice_resid = 0;
-	} else
-		rb_key = 0;
-
-	if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
-		/*
-		 * same position, nothing more to do
-		 */
-		if (rb_key == cfqq->rb_key)
-			return;
-
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
-	}
-
-	left = 1;
-	parent = NULL;
-	p = &cfqd->service_tree.rb.rb_node;
-	while (*p) {
-		struct rb_node **n;
-
-		parent = *p;
-		__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
-
-		/*
-		 * sort RT queues first, we always want to give
-		 * preference to them. IDLE queues goes to the back.
-		 * after that, sort on the next service time.
-		 */
-		if (cfq_class_rt(cfqq) > cfq_class_rt(__cfqq))
-			n = &(*p)->rb_left;
-		else if (cfq_class_rt(cfqq) < cfq_class_rt(__cfqq))
-			n = &(*p)->rb_right;
-		else if (cfq_class_idle(cfqq) < cfq_class_idle(__cfqq))
-			n = &(*p)->rb_left;
-		else if (cfq_class_idle(cfqq) > cfq_class_idle(__cfqq))
-			n = &(*p)->rb_right;
-		else if (rb_key < __cfqq->rb_key)
-			n = &(*p)->rb_left;
-		else
-			n = &(*p)->rb_right;
-
-		if (n == &(*p)->rb_right)
-			left = 0;
-
-		p = n;
-	}
-
-	if (left)
-		cfqd->service_tree.left = &cfqq->rb_node;
-
-	cfqq->rb_key = rb_key;
-	rb_link_node(&cfqq->rb_node, parent, p);
-	rb_insert_color(&cfqq->rb_node, &cfqd->service_tree.rb);
-}
-
 static struct cfq_queue *
 cfq_prio_tree_lookup(struct cfq_data *cfqd, struct rb_root *root,
 		     sector_t sector, struct rb_node **ret_parent,
@@ -620,57 +401,34 @@ static void cfq_prio_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 		cfqq->p_root = NULL;
 }
 
-/*
- * Update cfqq's position in the service tree.
- */
-static void cfq_resort_rr_list(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+/* An active ioq is being reset. A chance to do cic related stuff. */
+static void cfq_active_ioq_reset(struct request_queue *q, void *sched_queue)
 {
-	/*
-	 * Resorting requires the cfqq to be on the RR list already.
-	 */
-	if (cfq_cfqq_on_rr(cfqq)) {
-		cfq_service_tree_add(cfqd, cfqq, 0);
-		cfq_prio_tree_add(cfqd, cfqq);
-	}
-}
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = sched_queue;
 
-/*
- * add to busy list of queues for service, trying to be fair in ordering
- * the pending list according to last request service
- */
-static void cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	cfq_log_cfqq(cfqd, cfqq, "add_to_rr");
-	BUG_ON(cfq_cfqq_on_rr(cfqq));
-	cfq_mark_cfqq_on_rr(cfqq);
-	cfqd->busy_queues++;
-	if (cfq_class_rt(cfqq))
-		cfqd->busy_rt_queues++;
+	if (cfqd->active_cic) {
+		put_io_context(cfqd->active_cic->ioc);
+		cfqd->active_cic = NULL;
+	}
 
-	cfq_resort_rr_list(cfqd, cfqq);
+	/* Resort the cfqq in prio tree */
+	if (cfqq)
+		cfq_prio_tree_add(cfqd, cfqq);
 }
 
-/*
- * Called when the cfqq no longer has requests pending, remove it from
- * the service tree.
- */
-static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+/* An ioq has been set as active one. */
+static void cfq_active_ioq_set(struct request_queue *q, void *sched_queue,
+				int coop)
 {
-	cfq_log_cfqq(cfqd, cfqq, "del_from_rr");
-	BUG_ON(!cfq_cfqq_on_rr(cfqq));
-	cfq_clear_cfqq_on_rr(cfqq);
+	struct cfq_queue *cfqq = sched_queue;
 
-	if (!RB_EMPTY_NODE(&cfqq->rb_node))
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
-	if (cfqq->p_root) {
-		rb_erase(&cfqq->p_node, cfqq->p_root);
-		cfqq->p_root = NULL;
-	}
+	cfqq->slice_dispatch = 0;
 
-	BUG_ON(!cfqd->busy_queues);
-	cfqd->busy_queues--;
-	if (cfq_class_rt(cfqq))
-		cfqd->busy_rt_queues--;
+	cfq_clear_cfqq_must_alloc_slice(cfqq);
+	cfq_clear_cfqq_fifo_expire(cfqq);
+	if (!coop)
+		cfq_clear_cfqq_coop(cfqq);
 }
 
 /*
@@ -679,7 +437,6 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 static void cfq_del_rq_rb(struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
-	struct cfq_data *cfqd = cfqq->cfqd;
 	const int sync = rq_is_sync(rq);
 
 	BUG_ON(!cfqq->queued[sync]);
@@ -687,8 +444,17 @@ static void cfq_del_rq_rb(struct request *rq)
 
 	elv_rb_del(&cfqq->sort_list, rq);
 
-	if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
-		cfq_del_cfqq_rr(cfqd, cfqq);
+	/*
+	 * If this was last request in the queue, remove this queue from
+	 * prio trees. For last request nr_queued count will still be 1 as
+	 * elevator fair queuing layer is yet to do the accounting.
+	 */
+	if (elv_ioq_nr_queued(cfqq->ioq) == 1) {
+		if (cfqq->p_root) {
+			rb_erase(&cfqq->p_node, cfqq->p_root);
+			cfqq->p_root = NULL;
+		}
+	}
 }
 
 static void cfq_add_rq_rb(struct request *rq)
@@ -706,9 +472,6 @@ static void cfq_add_rq_rb(struct request *rq)
 	while ((__alias = elv_rb_add(&cfqq->sort_list, rq)) != NULL)
 		cfq_dispatch_insert(cfqd->queue, __alias);
 
-	if (!cfq_cfqq_on_rr(cfqq))
-		cfq_add_cfqq_rr(cfqd, cfqq);
-
 	/*
 	 * check if this request is a better next-serve candidate
 	 */
@@ -756,23 +519,9 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 
-	cfqd->rq_in_driver++;
-	cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "activate rq, drv=%d",
-						cfqd->rq_in_driver);
-
 	cfqd->last_position = rq->hard_sector + rq->hard_nr_sectors;
 }
 
-static void cfq_deactivate_request(struct request_queue *q, struct request *rq)
-{
-	struct cfq_data *cfqd = q->elevator->elevator_data;
-
-	WARN_ON(!cfqd->rq_in_driver);
-	cfqd->rq_in_driver--;
-	cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "deactivate rq, drv=%d",
-						cfqd->rq_in_driver);
-}
-
 static void cfq_remove_request(struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
@@ -783,7 +532,6 @@ static void cfq_remove_request(struct request *rq)
 	list_del_init(&rq->queuelist);
 	cfq_del_rq_rb(rq);
 
-	cfqq->cfqd->rq_queued--;
 	if (rq_is_meta(rq)) {
 		WARN_ON(!cfqq->meta_pending);
 		cfqq->meta_pending--;
@@ -857,93 +605,21 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	return 0;
 }
 
-static void __cfq_set_active_queue(struct cfq_data *cfqd,
-				   struct cfq_queue *cfqq)
-{
-	if (cfqq) {
-		cfq_log_cfqq(cfqd, cfqq, "set_active");
-		cfqq->slice_end = 0;
-		cfqq->slice_dispatch = 0;
-
-		cfq_clear_cfqq_wait_request(cfqq);
-		cfq_clear_cfqq_must_dispatch(cfqq);
-		cfq_clear_cfqq_must_alloc_slice(cfqq);
-		cfq_clear_cfqq_fifo_expire(cfqq);
-		cfq_mark_cfqq_slice_new(cfqq);
-
-		del_timer(&cfqd->idle_slice_timer);
-	}
-
-	cfqd->active_queue = cfqq;
-}
-
 /*
  * current cfqq expired its slice (or was too idle), select new one
  */
 static void
-__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-		    int timed_out)
+__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	cfq_log_cfqq(cfqd, cfqq, "slice expired t=%d", timed_out);
-
-	if (cfq_cfqq_wait_request(cfqq))
-		del_timer(&cfqd->idle_slice_timer);
-
-	cfq_clear_cfqq_wait_request(cfqq);
-
-	/*
-	 * store what was left of this slice, if the queue idled/timed out
-	 */
-	if (timed_out && !cfq_cfqq_slice_new(cfqq)) {
-		cfqq->slice_resid = cfqq->slice_end - jiffies;
-		cfq_log_cfqq(cfqd, cfqq, "resid=%ld", cfqq->slice_resid);
-	}
-
-	cfq_resort_rr_list(cfqd, cfqq);
-
-	if (cfqq == cfqd->active_queue)
-		cfqd->active_queue = NULL;
-
-	if (cfqd->active_cic) {
-		put_io_context(cfqd->active_cic->ioc);
-		cfqd->active_cic = NULL;
-	}
+	__elv_ioq_slice_expired(cfqd->queue, cfqq->ioq);
 }
 
-static inline void cfq_slice_expired(struct cfq_data *cfqd, int timed_out)
+static inline void cfq_slice_expired(struct cfq_data *cfqd)
 {
-	struct cfq_queue *cfqq = cfqd->active_queue;
+	struct cfq_queue *cfqq = elv_active_sched_queue(cfqd->queue->elevator);
 
 	if (cfqq)
-		__cfq_slice_expired(cfqd, cfqq, timed_out);
-}
-
-/*
- * Get next queue for service. Unless we have a queue preemption,
- * we'll simply select the first cfqq in the service tree.
- */
-static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
-{
-	if (RB_EMPTY_ROOT(&cfqd->service_tree.rb))
-		return NULL;
-
-	return cfq_rb_first(&cfqd->service_tree);
-}
-
-/*
- * Get and set a new active queue for service.
- */
-static struct cfq_queue *cfq_set_active_queue(struct cfq_data *cfqd,
-					      struct cfq_queue *cfqq)
-{
-	if (!cfqq) {
-		cfqq = cfq_get_next_queue(cfqd);
-		if (cfqq)
-			cfq_clear_cfqq_coop(cfqq);
-	}
-
-	__cfq_set_active_queue(cfqd, cfqq);
-	return cfqq;
+		__cfq_slice_expired(cfqd, cfqq);
 }
 
 static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd,
@@ -1020,11 +696,12 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
  * associated with the I/O issued by cur_cfqq.  I'm not sure this is a valid
  * assumption.
  */
-static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
-					      struct cfq_queue *cur_cfqq,
+static struct io_queue *cfq_close_cooperator(struct request_queue *q,
+					      void *cur_sched_queue,
 					      int probe)
 {
-	struct cfq_queue *cfqq;
+	struct cfq_queue *cur_cfqq = cur_sched_queue, *cfqq;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
 
 	/*
 	 * A valid cfq_io_context is necessary to compare requests against
@@ -1047,38 +724,18 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
 
 	if (!probe)
 		cfq_mark_cfqq_coop(cfqq);
-	return cfqq;
+	return cfqq->ioq;
 }
 
-static void cfq_arm_slice_timer(struct cfq_data *cfqd)
+static void cfq_arm_slice_timer(struct request_queue *q, void *sched_queue)
 {
-	struct cfq_queue *cfqq = cfqd->active_queue;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = sched_queue;
 	struct cfq_io_context *cic;
 	unsigned long sl;
 
-	/*
-	 * SSD device without seek penalty, disable idling. But only do so
-	 * for devices that support queuing, otherwise we still have a problem
-	 * with sync vs async workloads.
-	 */
-	if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)
-		return;
-
 	WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list));
-	WARN_ON(cfq_cfqq_slice_new(cfqq));
-
-	/*
-	 * idle is disabled, either manually or by past process history
-	 */
-	if (!cfqd->cfq_slice_idle || !cfq_cfqq_idle_window(cfqq))
-		return;
-
-	/*
-	 * still requests with the driver, don't idle
-	 */
-	if (cfqd->rq_in_driver)
-		return;
-
+	WARN_ON(elv_ioq_slice_new(cfqq->ioq));
 	/*
 	 * task has exited, don't wait
 	 */
@@ -1086,18 +743,18 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 	if (!cic || !atomic_read(&cic->ioc->nr_tasks))
 		return;
 
-	cfq_mark_cfqq_wait_request(cfqq);
 
+	elv_mark_ioq_wait_request(cfqq->ioq);
 	/*
 	 * we don't want to idle for seeks, but we do want to allow
 	 * fair distribution of slice time for a process doing back-to-back
 	 * seeks. so allow a little bit of time for him to submit a new rq
 	 */
-	sl = cfqd->cfq_slice_idle;
+	sl = elv_get_slice_idle(q->elevator);
 	if (sample_valid(cic->seek_samples) && CIC_SEEKY(cic))
 		sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));
 
-	mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
+	elv_mod_idle_slice_timer(q->elevator, jiffies + sl);
 	cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
 }
 
@@ -1106,13 +763,12 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
  */
 static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
 {
-	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
+	struct cfq_data *cfqd = q->elevator->elevator_data;
 
-	cfq_log_cfqq(cfqd, cfqq, "dispatch_insert");
+	cfq_log_cfqq(cfqd, cfqq, "dispatch_insert sect=%d", rq->nr_sectors);
 
 	cfq_remove_request(rq);
-	cfqq->dispatched++;
 	elv_dispatch_sort(q, rq);
 
 	if (cfq_cfqq_sync(cfqq))
@@ -1150,78 +806,11 @@ static inline int
 cfq_prio_to_maxrq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
 	const int base_rq = cfqd->cfq_slice_async_rq;
+	unsigned short ioprio = elv_ioq_ioprio(cfqq->ioq);
 
-	WARN_ON(cfqq->ioprio >= IOPRIO_BE_NR);
-
-	return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - cfqq->ioprio));
-}
-
-/*
- * Select a queue for service. If we have a current active queue,
- * check whether to continue servicing it, or retrieve and set a new one.
- */
-static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
-{
-	struct cfq_queue *cfqq, *new_cfqq = NULL;
-
-	cfqq = cfqd->active_queue;
-	if (!cfqq)
-		goto new_queue;
-
-	/*
-	 * The active queue has run out of time, expire it and select new.
-	 */
-	if (cfq_slice_used(cfqq) && !cfq_cfqq_must_dispatch(cfqq))
-		goto expire;
-
-	/*
-	 * If we have a RT cfqq waiting, then we pre-empt the current non-rt
-	 * cfqq.
-	 */
-	if (!cfq_class_rt(cfqq) && cfqd->busy_rt_queues) {
-		/*
-		 * We simulate this as cfqq timed out so that it gets to bank
-		 * the remaining of its time slice.
-		 */
-		cfq_log_cfqq(cfqd, cfqq, "preempt");
-		cfq_slice_expired(cfqd, 1);
-		goto new_queue;
-	}
-
-	/*
-	 * The active queue has requests and isn't expired, allow it to
-	 * dispatch.
-	 */
-	if (!RB_EMPTY_ROOT(&cfqq->sort_list))
-		goto keep_queue;
-
-	/*
-	 * If another queue has a request waiting within our mean seek
-	 * distance, let it run.  The expire code will check for close
-	 * cooperators and put the close queue at the front of the service
-	 * tree.
-	 */
-	new_cfqq = cfq_close_cooperator(cfqd, cfqq, 0);
-	if (new_cfqq)
-		goto expire;
+	WARN_ON(ioprio >= IOPRIO_BE_NR);
 
-	/*
-	 * No requests pending. If the active queue still has requests in
-	 * flight or is idling for a new request, allow either of these
-	 * conditions to happen (or time out) before selecting a new queue.
-	 */
-	if (timer_pending(&cfqd->idle_slice_timer) ||
-	    (cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) {
-		cfqq = NULL;
-		goto keep_queue;
-	}
-
-expire:
-	cfq_slice_expired(cfqd, 0);
-new_queue:
-	cfqq = cfq_set_active_queue(cfqd, new_cfqq);
-keep_queue:
-	return cfqq;
+	return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - ioprio));
 }
 
 static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
@@ -1246,12 +835,14 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
 	struct cfq_queue *cfqq;
 	int dispatched = 0;
 
-	while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL)
+	while ((cfqq = elv_select_sched_queue(cfqd->queue, 1)) != NULL)
 		dispatched += __cfq_forced_dispatch_cfqq(cfqq);
 
-	cfq_slice_expired(cfqd, 0);
+	/* This probably is redundant now. above loop will should make sure
+	 * that all the busy queues have expired */
+	cfq_slice_expired(cfqd);
 
-	BUG_ON(cfqd->busy_queues);
+	BUG_ON(elv_nr_busy_ioq(cfqd->queue->elevator));
 
 	cfq_log(cfqd, "forced_dispatch=%d\n", dispatched);
 	return dispatched;
@@ -1297,13 +888,10 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	struct cfq_queue *cfqq;
 	unsigned int max_dispatch;
 
-	if (!cfqd->busy_queues)
-		return 0;
-
 	if (unlikely(force))
 		return cfq_forced_dispatch(cfqd);
 
-	cfqq = cfq_select_queue(cfqd);
+	cfqq = elv_select_sched_queue(q, 0);
 	if (!cfqq)
 		return 0;
 
@@ -1320,7 +908,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	/*
 	 * Does this cfqq already have too much IO in flight?
 	 */
-	if (cfqq->dispatched >= max_dispatch) {
+	if (elv_ioq_nr_dispatched(cfqq->ioq) >= max_dispatch) {
 		/*
 		 * idle queue must always only have a single IO in flight
 		 */
@@ -1330,13 +918,13 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 		/*
 		 * We have other queues, don't allow more IO from this one
 		 */
-		if (cfqd->busy_queues > 1)
+		if (elv_nr_busy_ioq(q->elevator) > 1)
 			return 0;
 
 		/*
 		 * we are the only queue, allow up to 4 times of 'quantum'
 		 */
-		if (cfqq->dispatched >= 4 * max_dispatch)
+		if (elv_ioq_nr_dispatched(cfqq->ioq) >= 4 * max_dispatch)
 			return 0;
 	}
 
@@ -1345,51 +933,45 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	 */
 	cfq_dispatch_request(cfqd, cfqq);
 	cfqq->slice_dispatch++;
-	cfq_clear_cfqq_must_dispatch(cfqq);
 
 	/*
 	 * expire an async queue immediately if it has used up its slice. idle
 	 * queue always expire after 1 dispatch round.
 	 */
-	if (cfqd->busy_queues > 1 && ((!cfq_cfqq_sync(cfqq) &&
+	if (elv_nr_busy_ioq(q->elevator) > 1 && ((!cfq_cfqq_sync(cfqq) &&
 	    cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
 	    cfq_class_idle(cfqq))) {
-		cfqq->slice_end = jiffies + 1;
-		cfq_slice_expired(cfqd, 0);
+		cfq_slice_expired(cfqd);
 	}
 
 	cfq_log(cfqd, "dispatched a request");
 	return 1;
 }
 
-/*
- * task holds one reference to the queue, dropped when task exits. each rq
- * in-flight on this queue also holds a reference, dropped when rq is freed.
- *
- * queue lock must be held here.
- */
-static void cfq_put_queue(struct cfq_queue *cfqq)
+static void cfq_free_cfq_queue(struct elevator_queue *e, void *sched_queue)
 {
+	struct cfq_queue *cfqq = sched_queue;
 	struct cfq_data *cfqd = cfqq->cfqd;
 
-	BUG_ON(atomic_read(&cfqq->ref) <= 0);
+	BUG_ON(!cfqq);
 
-	if (!atomic_dec_and_test(&cfqq->ref))
-		return;
-
-	cfq_log_cfqq(cfqd, cfqq, "put_queue");
+	cfq_log_cfqq(cfqd, cfqq, "free_queue");
 	BUG_ON(rb_first(&cfqq->sort_list));
 	BUG_ON(cfqq->allocated[READ] + cfqq->allocated[WRITE]);
-	BUG_ON(cfq_cfqq_on_rr(cfqq));
 
-	if (unlikely(cfqd->active_queue == cfqq)) {
-		__cfq_slice_expired(cfqd, cfqq, 0);
-		cfq_schedule_dispatch(cfqd);
+	if (unlikely(cfqq_is_active_queue(cfqq))) {
+		__cfq_slice_expired(cfqd, cfqq);
+		elv_schedule_dispatch(cfqd->queue);
 	}
 
 	kmem_cache_free(cfq_pool, cfqq);
 }
 
+static inline void cfq_put_queue(struct cfq_queue *cfqq)
+{
+	elv_put_ioq(cfqq->ioq);
+}
+
 /*
  * Must always be called with the rcu_read_lock() held
  */
@@ -1477,9 +1059,9 @@ static void cfq_free_io_context(struct io_context *ioc)
 
 static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	if (unlikely(cfqq == cfqd->active_queue)) {
-		__cfq_slice_expired(cfqd, cfqq, 0);
-		cfq_schedule_dispatch(cfqd);
+	if (unlikely(cfqq == elv_active_sched_queue(cfqd->queue->elevator))) {
+		__cfq_slice_expired(cfqd, cfqq);
+		elv_schedule_dispatch(cfqd->queue);
 	}
 
 	cfq_put_queue(cfqq);
@@ -1549,9 +1131,10 @@ static struct cfq_io_context *
 cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 {
 	struct cfq_io_context *cic;
+	struct request_queue *q = cfqd->queue;
 
 	cic = kmem_cache_alloc_node(cfq_ioc_pool, gfp_mask | __GFP_ZERO,
-							cfqd->queue->node);
+							q->node);
 	if (cic) {
 		cic->last_end_request = jiffies;
 		INIT_LIST_HEAD(&cic->queue_list);
@@ -1567,7 +1150,7 @@ cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
 {
 	struct task_struct *tsk = current;
-	int ioprio_class;
+	int ioprio_class, ioprio;
 
 	if (!cfq_cfqq_prio_changed(cfqq))
 		return;
@@ -1580,30 +1163,33 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
 		/*
 		 * no prio set, inherit CPU scheduling settings
 		 */
-		cfqq->ioprio = task_nice_ioprio(tsk);
-		cfqq->ioprio_class = task_nice_ioclass(tsk);
+		ioprio = task_nice_ioprio(tsk);
+		ioprio_class = task_nice_ioclass(tsk);
 		break;
 	case IOPRIO_CLASS_RT:
-		cfqq->ioprio = task_ioprio(ioc);
-		cfqq->ioprio_class = IOPRIO_CLASS_RT;
+		ioprio = task_ioprio(ioc);
+		ioprio_class = IOPRIO_CLASS_RT;
 		break;
 	case IOPRIO_CLASS_BE:
-		cfqq->ioprio = task_ioprio(ioc);
-		cfqq->ioprio_class = IOPRIO_CLASS_BE;
+		ioprio = task_ioprio(ioc);
+		ioprio_class = IOPRIO_CLASS_BE;
 		break;
 	case IOPRIO_CLASS_IDLE:
-		cfqq->ioprio_class = IOPRIO_CLASS_IDLE;
-		cfqq->ioprio = 7;
-		cfq_clear_cfqq_idle_window(cfqq);
+		ioprio_class = IOPRIO_CLASS_IDLE;
+		ioprio = 7;
+		elv_clear_ioq_idle_window(cfqq->ioq);
 		break;
 	}
 
+	elv_ioq_set_ioprio_class(cfqq->ioq, ioprio_class);
+	elv_ioq_set_ioprio(cfqq->ioq, ioprio);
+
 	/*
 	 * keep track of original prio settings in case we have to temporarily
 	 * elevate the priority of this queue
 	 */
-	cfqq->org_ioprio = cfqq->ioprio;
-	cfqq->org_ioprio_class = cfqq->ioprio_class;
+	cfqq->org_ioprio = ioprio;
+	cfqq->org_ioprio_class = ioprio_class;
 	cfq_clear_cfqq_prio_changed(cfqq);
 }
 
@@ -1612,11 +1198,12 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
 	struct cfq_data *cfqd = cic->key;
 	struct cfq_queue *cfqq;
 	unsigned long flags;
+	struct request_queue *q = cfqd->queue;
 
 	if (unlikely(!cfqd))
 		return;
 
-	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+	spin_lock_irqsave(q->queue_lock, flags);
 
 	cfqq = cic->cfqq[BLK_RW_ASYNC];
 	if (cfqq) {
@@ -1633,7 +1220,7 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
 	if (cfqq)
 		cfq_mark_cfqq_prio_changed(cfqq);
 
-	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
 static void cfq_ioc_set_ioprio(struct io_context *ioc)
@@ -1644,11 +1231,12 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
 
 static struct cfq_queue *
 cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
-		     struct io_context *ioc, gfp_t gfp_mask)
+				struct io_context *ioc, gfp_t gfp_mask)
 {
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
 	struct cfq_io_context *cic;
-
+	struct request_queue *q = cfqd->queue;
+	struct io_queue *ioq = NULL, *new_ioq = NULL;
 retry:
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
@@ -1656,8 +1244,7 @@ retry:
 
 	if (!cfqq) {
 		if (new_cfqq) {
-			cfqq = new_cfqq;
-			new_cfqq = NULL;
+			goto alloc_ioq;
 		} else if (gfp_mask & __GFP_WAIT) {
 			/*
 			 * Inform the allocator of the fact that we will
@@ -1678,22 +1265,52 @@ retry:
 			if (!cfqq)
 				goto out;
 		}
+alloc_ioq:
+		if (new_ioq) {
+			ioq = new_ioq;
+			new_ioq = NULL;
+			cfqq = new_cfqq;
+			new_cfqq = NULL;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			new_ioq = elv_alloc_ioq(q,
+					gfp_mask | __GFP_NOFAIL | __GFP_ZERO);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+			if (!ioq) {
+				kmem_cache_free(cfq_pool, cfqq);
+				cfqq = NULL;
+				goto out;
+			}
+		}
 
-		RB_CLEAR_NODE(&cfqq->rb_node);
+		/*
+		 * Both cfqq and ioq objects allocated. Do the initializations
+		 * now.
+		 */
 		RB_CLEAR_NODE(&cfqq->p_node);
 		INIT_LIST_HEAD(&cfqq->fifo);
-
-		atomic_set(&cfqq->ref, 0);
 		cfqq->cfqd = cfqd;
 
 		cfq_mark_cfqq_prio_changed(cfqq);
 
+		cfqq->ioq = ioq;
 		cfq_init_prio_data(cfqq, ioc);
+		elv_init_ioq(q->elevator, ioq, cfqq, cfqq->org_ioprio_class,
+				cfqq->org_ioprio, is_sync);
 
 		if (is_sync) {
 			if (!cfq_class_idle(cfqq))
-				cfq_mark_cfqq_idle_window(cfqq);
-			cfq_mark_cfqq_sync(cfqq);
+				elv_mark_ioq_idle_window(cfqq->ioq);
+			elv_mark_ioq_sync(cfqq->ioq);
 		}
 		cfqq->pid = current->pid;
 		cfq_log_cfqq(cfqd, cfqq, "alloced");
@@ -1702,38 +1319,28 @@ retry:
 	if (new_cfqq)
 		kmem_cache_free(cfq_pool, new_cfqq);
 
+	if (new_ioq)
+		elv_free_ioq(new_ioq);
+
 out:
 	WARN_ON((gfp_mask & __GFP_WAIT) && !cfqq);
 	return cfqq;
 }
 
-static struct cfq_queue **
-cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
-{
-	switch (ioprio_class) {
-	case IOPRIO_CLASS_RT:
-		return &cfqd->async_cfqq[0][ioprio];
-	case IOPRIO_CLASS_BE:
-		return &cfqd->async_cfqq[1][ioprio];
-	case IOPRIO_CLASS_IDLE:
-		return &cfqd->async_idle_cfqq;
-	default:
-		BUG();
-	}
-}
-
 static struct cfq_queue *
 cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
-	      gfp_t gfp_mask)
+					gfp_t gfp_mask)
 {
 	const int ioprio = task_ioprio(ioc);
 	const int ioprio_class = task_ioprio_class(ioc);
-	struct cfq_queue **async_cfqq = NULL;
+	struct cfq_queue *async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
+	struct io_group *iog = io_lookup_io_group_current(cfqd->queue);
 
 	if (!is_sync) {
-		async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);
-		cfqq = *async_cfqq;
+		async_cfqq = io_group_async_queue_prio(iog, ioprio_class,
+								ioprio);
+		cfqq = async_cfqq;
 	}
 
 	if (!cfqq) {
@@ -1742,15 +1349,11 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 			return NULL;
 	}
 
-	/*
-	 * pin the queue now that it's allocated, scheduler exit will prune it
-	 */
-	if (!is_sync && !(*async_cfqq)) {
-		atomic_inc(&cfqq->ref);
-		*async_cfqq = cfqq;
-	}
+	if (!is_sync && !async_cfqq)
+		io_group_set_async_queue(iog, ioprio_class, ioprio, cfqq->ioq);
 
-	atomic_inc(&cfqq->ref);
+	/* ioc reference */
+	elv_get_ioq(cfqq->ioq);
 	return cfqq;
 }
 
@@ -1829,6 +1432,7 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
 {
 	unsigned long flags;
 	int ret;
+	struct request_queue *q = cfqd->queue;
 
 	ret = radix_tree_preload(gfp_mask);
 	if (!ret) {
@@ -1845,9 +1449,9 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
 		radix_tree_preload_end();
 
 		if (!ret) {
-			spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+			spin_lock_irqsave(q->queue_lock, flags);
 			list_add(&cic->queue_list, &cfqd->cic_list);
-			spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+			spin_unlock_irqrestore(q->queue_lock, flags);
 		}
 	}
 
@@ -1867,10 +1471,11 @@ cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 {
 	struct io_context *ioc = NULL;
 	struct cfq_io_context *cic;
+	struct request_queue *q = cfqd->queue;
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
-	ioc = get_io_context(gfp_mask, cfqd->queue->node);
+	ioc = get_io_context(gfp_mask, q->node);
 	if (!ioc)
 		return NULL;
 
@@ -1889,7 +1494,6 @@ out:
 	smp_read_barrier_depends();
 	if (unlikely(ioc->ioprio_changed))
 		cfq_ioc_set_ioprio(ioc);
-
 	return cic;
 err_free:
 	cfq_cic_free(cic);
@@ -1899,17 +1503,6 @@ err:
 }
 
 static void
-cfq_update_io_thinktime(struct cfq_data *cfqd, struct cfq_io_context *cic)
-{
-	unsigned long elapsed = jiffies - cic->last_end_request;
-	unsigned long ttime = min(elapsed, 2UL * cfqd->cfq_slice_idle);
-
-	cic->ttime_samples = (7*cic->ttime_samples + 256) / 8;
-	cic->ttime_total = (7*cic->ttime_total + 256*ttime) / 8;
-	cic->ttime_mean = (cic->ttime_total + 128) / cic->ttime_samples;
-}
-
-static void
 cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_io_context *cic,
 		       struct request *rq)
 {
@@ -1940,65 +1533,40 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_io_context *cic,
 }
 
 /*
- * Disable idle window if the process thinks too long or seeks so much that
- * it doesn't matter
+ * Disable idle window if the process seeks so much that it doesn't matter
  */
-static void
-cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-		       struct cfq_io_context *cic)
+static int
+cfq_update_idle_window(struct elevator_queue *eq, void *cfqq,
+					struct request *rq)
 {
-	int old_idle, enable_idle;
+	struct cfq_io_context *cic = RQ_CIC(rq);
 
 	/*
-	 * Don't idle for async or idle io prio class
+	 * Enabling/Disabling idling based on thinktime has been moved
+	 * in common layer.
 	 */
-	if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
-		return;
-
-	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
-
-	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
-	    (cfqd->hw_tag && CIC_SEEKY(cic)))
-		enable_idle = 0;
-	else if (sample_valid(cic->ttime_samples)) {
-		if (cic->ttime_mean > cfqd->cfq_slice_idle)
-			enable_idle = 0;
-		else
-			enable_idle = 1;
-	}
+	if (!atomic_read(&cic->ioc->nr_tasks) ||
+	    (elv_hw_tag(eq) && CIC_SEEKY(cic)))
+		return 0;
 
-	if (old_idle != enable_idle) {
-		cfq_log_cfqq(cfqd, cfqq, "idle=%d", enable_idle);
-		if (enable_idle)
-			cfq_mark_cfqq_idle_window(cfqq);
-		else
-			cfq_clear_cfqq_idle_window(cfqq);
-	}
+	return 1;
 }
 
 /*
  * Check if new_cfqq should preempt the currently active queue. Return 0 for
- * no or if we aren't sure, a 1 will cause a preempt.
+ * no or if we aren't sure, a 1 will cause a preemption attempt.
+ * Some of the preemption logic has been moved to common layer. Only cfq
+ * specific parts are left here.
  */
 static int
-cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
-		   struct request *rq)
+cfq_should_preempt(struct request_queue *q, void *new_cfqq, struct request *rq)
 {
-	struct cfq_queue *cfqq;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = elv_active_sched_queue(q->elevator);
 
-	cfqq = cfqd->active_queue;
 	if (!cfqq)
 		return 0;
 
-	if (cfq_slice_used(cfqq))
-		return 1;
-
-	if (cfq_class_idle(new_cfqq))
-		return 0;
-
-	if (cfq_class_idle(cfqq))
-		return 1;
-
 	/*
 	 * if the new request is sync, but the currently running queue is
 	 * not, let the sync request have priority.
@@ -2013,13 +1581,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 	if (rq_is_meta(rq) && !cfqq->meta_pending)
 		return 1;
 
-	/*
-	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
-	 */
-	if (cfq_class_rt(new_cfqq) && !cfq_class_rt(cfqq))
-		return 1;
-
-	if (!cfqd->active_cic || !cfq_cfqq_wait_request(cfqq))
+	if (!cfqd->active_cic || !elv_ioq_wait_request(cfqq->ioq))
 		return 0;
 
 	/*
@@ -2033,29 +1595,10 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 }
 
 /*
- * cfqq preempts the active queue. if we allowed preempt with no slice left,
- * let it have half of its nominal slice.
- */
-static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	cfq_log_cfqq(cfqd, cfqq, "preempt");
-	cfq_slice_expired(cfqd, 1);
-
-	/*
-	 * Put the new queue at the front of the of the current list,
-	 * so we know that it will be selected next.
-	 */
-	BUG_ON(!cfq_cfqq_on_rr(cfqq));
-
-	cfq_service_tree_add(cfqd, cfqq, 1);
-
-	cfqq->slice_end = 0;
-	cfq_mark_cfqq_slice_new(cfqq);
-}
-
-/*
  * Called when a new fs request (rq) is added (to cfqq). Check if there's
  * something we should do about it
+ * After enqueuing the request whether queue should be preempted or kicked
+ * decision is taken by common layer.
  */
 static void
 cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
@@ -2063,45 +1606,12 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 {
 	struct cfq_io_context *cic = RQ_CIC(rq);
 
-	cfqd->rq_queued++;
 	if (rq_is_meta(rq))
 		cfqq->meta_pending++;
 
-	cfq_update_io_thinktime(cfqd, cic);
 	cfq_update_io_seektime(cfqd, cic, rq);
-	cfq_update_idle_window(cfqd, cfqq, cic);
 
 	cic->last_request_pos = rq->sector + rq->nr_sectors;
-
-	if (cfqq == cfqd->active_queue) {
-		/*
-		 * Remember that we saw a request from this process, but
-		 * don't start queuing just yet. Otherwise we risk seeing lots
-		 * of tiny requests, because we disrupt the normal plugging
-		 * and merging. If the request is already larger than a single
-		 * page, let it rip immediately. For that case we assume that
-		 * merging is already done. Ditto for a busy system that
-		 * has other work pending, don't risk delaying until the
-		 * idle timer unplug to continue working.
-		 */
-		if (cfq_cfqq_wait_request(cfqq)) {
-			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
-			    cfqd->busy_queues > 1) {
-				del_timer(&cfqd->idle_slice_timer);
-				blk_start_queueing(cfqd->queue);
-			}
-			cfq_mark_cfqq_must_dispatch(cfqq);
-		}
-	} else if (cfq_should_preempt(cfqd, cfqq, rq)) {
-		/*
-		 * not the active queue - expire current slice if it is
-		 * idle and has expired it's mean thinktime or this new queue
-		 * has some old slice time left and is of higher priority or
-		 * this new queue is RT and the current one is BE
-		 */
-		cfq_preempt_queue(cfqd, cfqq);
-		blk_start_queueing(cfqd->queue);
-	}
 }
 
 static void cfq_insert_request(struct request_queue *q, struct request *rq)
@@ -2119,31 +1629,6 @@ static void cfq_insert_request(struct request_queue *q, struct request *rq)
 	cfq_rq_enqueued(cfqd, cfqq, rq);
 }
 
-/*
- * Update hw_tag based on peak queue depth over 50 samples under
- * sufficient load.
- */
-static void cfq_update_hw_tag(struct cfq_data *cfqd)
-{
-	if (cfqd->rq_in_driver > cfqd->rq_in_driver_peak)
-		cfqd->rq_in_driver_peak = cfqd->rq_in_driver;
-
-	if (cfqd->rq_queued <= CFQ_HW_QUEUE_MIN &&
-	    cfqd->rq_in_driver <= CFQ_HW_QUEUE_MIN)
-		return;
-
-	if (cfqd->hw_tag_samples++ < 50)
-		return;
-
-	if (cfqd->rq_in_driver_peak >= CFQ_HW_QUEUE_MIN)
-		cfqd->hw_tag = 1;
-	else
-		cfqd->hw_tag = 0;
-
-	cfqd->hw_tag_samples = 0;
-	cfqd->rq_in_driver_peak = 0;
-}
-
 static void cfq_completed_request(struct request_queue *q, struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
@@ -2154,13 +1639,6 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 	now = jiffies;
 	cfq_log_cfqq(cfqd, cfqq, "complete");
 
-	cfq_update_hw_tag(cfqd);
-
-	WARN_ON(!cfqd->rq_in_driver);
-	WARN_ON(!cfqq->dispatched);
-	cfqd->rq_in_driver--;
-	cfqq->dispatched--;
-
 	if (cfq_cfqq_sync(cfqq))
 		cfqd->sync_flight--;
 
@@ -2169,34 +1647,6 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 
 	if (sync)
 		RQ_CIC(rq)->last_end_request = now;
-
-	/*
-	 * If this is the active queue, check if it needs to be expired,
-	 * or if we want to idle in case it has no pending requests.
-	 */
-	if (cfqd->active_queue == cfqq) {
-		const bool cfqq_empty = RB_EMPTY_ROOT(&cfqq->sort_list);
-
-		if (cfq_cfqq_slice_new(cfqq)) {
-			cfq_set_prio_slice(cfqd, cfqq);
-			cfq_clear_cfqq_slice_new(cfqq);
-		}
-		/*
-		 * If there are no requests waiting in this queue, and
-		 * there are other queues ready to issue requests, AND
-		 * those other queues are issuing requests within our
-		 * mean seek distance, give them a chance to run instead
-		 * of idling.
-		 */
-		if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
-			cfq_slice_expired(cfqd, 1);
-		else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) &&
-			 sync && !rq_noidle(rq))
-			cfq_arm_slice_timer(cfqd);
-	}
-
-	if (!cfqd->rq_in_driver)
-		cfq_schedule_dispatch(cfqd);
 }
 
 /*
@@ -2205,30 +1655,33 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
  */
 static void cfq_prio_boost(struct cfq_queue *cfqq)
 {
+	struct io_queue *ioq = cfqq->ioq;
+
 	if (has_fs_excl()) {
 		/*
 		 * boost idle prio on transactions that would lock out other
 		 * users of the filesystem
 		 */
 		if (cfq_class_idle(cfqq))
-			cfqq->ioprio_class = IOPRIO_CLASS_BE;
-		if (cfqq->ioprio > IOPRIO_NORM)
-			cfqq->ioprio = IOPRIO_NORM;
+			elv_ioq_set_ioprio_class(ioq, IOPRIO_CLASS_BE);
+		if (elv_ioq_ioprio(ioq) > IOPRIO_NORM)
+			elv_ioq_set_ioprio(ioq, IOPRIO_NORM);
+
 	} else {
 		/*
 		 * check if we need to unboost the queue
 		 */
-		if (cfqq->ioprio_class != cfqq->org_ioprio_class)
-			cfqq->ioprio_class = cfqq->org_ioprio_class;
-		if (cfqq->ioprio != cfqq->org_ioprio)
-			cfqq->ioprio = cfqq->org_ioprio;
+		if (elv_ioq_ioprio_class(ioq) != cfqq->org_ioprio_class)
+			elv_ioq_set_ioprio_class(ioq, cfqq->org_ioprio_class);
+		if (elv_ioq_ioprio(ioq) != cfqq->org_ioprio)
+			elv_ioq_set_ioprio(ioq, cfqq->org_ioprio);
 	}
 }
 
 static inline int __cfq_may_queue(struct cfq_queue *cfqq)
 {
-	if ((cfq_cfqq_wait_request(cfqq) || cfq_cfqq_must_alloc(cfqq)) &&
-	    !cfq_cfqq_must_alloc_slice(cfqq)) {
+	if ((elv_ioq_wait_request(cfqq->ioq) ||
+	   cfq_cfqq_must_alloc(cfqq)) && !cfq_cfqq_must_alloc_slice(cfqq)) {
 		cfq_mark_cfqq_must_alloc_slice(cfqq);
 		return ELV_MQUEUE_MUST;
 	}
@@ -2320,119 +1773,31 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 
 	cfqq->allocated[rw]++;
 	cfq_clear_cfqq_must_alloc(cfqq);
-	atomic_inc(&cfqq->ref);
+	elv_get_ioq(cfqq->ioq);
 
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
 	rq->elevator_private = cic;
-	rq->elevator_private2 = cfqq;
+	rq->ioq = cfqq->ioq;
 	return 0;
 
 queue_fail:
 	if (cic)
 		put_io_context(cic->ioc);
 
-	cfq_schedule_dispatch(cfqd);
+	elv_schedule_dispatch(cfqd->queue);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 	cfq_log(cfqd, "set_request fail");
 	return 1;
 }
 
-static void cfq_kick_queue(struct work_struct *work)
-{
-	struct cfq_data *cfqd =
-		container_of(work, struct cfq_data, unplug_work);
-	struct request_queue *q = cfqd->queue;
-
-	spin_lock_irq(q->queue_lock);
-	blk_start_queueing(q);
-	spin_unlock_irq(q->queue_lock);
-}
-
-/*
- * Timer running if the active_queue is currently idling inside its time slice
- */
-static void cfq_idle_slice_timer(unsigned long data)
-{
-	struct cfq_data *cfqd = (struct cfq_data *) data;
-	struct cfq_queue *cfqq;
-	unsigned long flags;
-	int timed_out = 1;
-
-	cfq_log(cfqd, "idle timer fired");
-
-	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
-
-	cfqq = cfqd->active_queue;
-	if (cfqq) {
-		timed_out = 0;
-
-		/*
-		 * We saw a request before the queue expired, let it through
-		 */
-		if (cfq_cfqq_must_dispatch(cfqq))
-			goto out_kick;
-
-		/*
-		 * expired
-		 */
-		if (cfq_slice_used(cfqq))
-			goto expire;
-
-		/*
-		 * only expire and reinvoke request handler, if there are
-		 * other queues with pending requests
-		 */
-		if (!cfqd->busy_queues)
-			goto out_cont;
-
-		/*
-		 * not expired and it has a request pending, let it dispatch
-		 */
-		if (!RB_EMPTY_ROOT(&cfqq->sort_list))
-			goto out_kick;
-	}
-expire:
-	cfq_slice_expired(cfqd, timed_out);
-out_kick:
-	cfq_schedule_dispatch(cfqd);
-out_cont:
-	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
-}
-
-static void cfq_shutdown_timer_wq(struct cfq_data *cfqd)
-{
-	del_timer_sync(&cfqd->idle_slice_timer);
-	cancel_work_sync(&cfqd->unplug_work);
-}
-
-static void cfq_put_async_queues(struct cfq_data *cfqd)
-{
-	int i;
-
-	for (i = 0; i < IOPRIO_BE_NR; i++) {
-		if (cfqd->async_cfqq[0][i])
-			cfq_put_queue(cfqd->async_cfqq[0][i]);
-		if (cfqd->async_cfqq[1][i])
-			cfq_put_queue(cfqd->async_cfqq[1][i]);
-	}
-
-	if (cfqd->async_idle_cfqq)
-		cfq_put_queue(cfqd->async_idle_cfqq);
-}
-
 static void cfq_exit_queue(struct elevator_queue *e)
 {
 	struct cfq_data *cfqd = e->elevator_data;
 	struct request_queue *q = cfqd->queue;
 
-	cfq_shutdown_timer_wq(cfqd);
-
 	spin_lock_irq(q->queue_lock);
 
-	if (cfqd->active_queue)
-		__cfq_slice_expired(cfqd, cfqd->active_queue, 0);
-
 	while (!list_empty(&cfqd->cic_list)) {
 		struct cfq_io_context *cic = list_entry(cfqd->cic_list.next,
 							struct cfq_io_context,
@@ -2441,12 +1806,7 @@ static void cfq_exit_queue(struct elevator_queue *e)
 		__cfq_exit_single_io_context(cfqd, cic);
 	}
 
-	cfq_put_async_queues(cfqd);
-
 	spin_unlock_irq(q->queue_lock);
-
-	cfq_shutdown_timer_wq(cfqd);
-
 	kfree(cfqd);
 }
 
@@ -2459,8 +1819,6 @@ static void *cfq_init_queue(struct request_queue *q)
 	if (!cfqd)
 		return NULL;
 
-	cfqd->service_tree = CFQ_RB_ROOT;
-
 	/*
 	 * Not strictly needed (since RB_ROOT just clears the node and we
 	 * zeroed cfqd on alloc), but better be safe in case someone decides
@@ -2473,23 +1831,13 @@ static void *cfq_init_queue(struct request_queue *q)
 
 	cfqd->queue = q;
 
-	init_timer(&cfqd->idle_slice_timer);
-	cfqd->idle_slice_timer.function = cfq_idle_slice_timer;
-	cfqd->idle_slice_timer.data = (unsigned long) cfqd;
-
-	INIT_WORK(&cfqd->unplug_work, cfq_kick_queue);
-
 	cfqd->last_end_request = jiffies;
 	cfqd->cfq_quantum = cfq_quantum;
 	cfqd->cfq_fifo_expire[0] = cfq_fifo_expire[0];
 	cfqd->cfq_fifo_expire[1] = cfq_fifo_expire[1];
 	cfqd->cfq_back_max = cfq_back_max;
 	cfqd->cfq_back_penalty = cfq_back_penalty;
-	cfqd->cfq_slice[0] = cfq_slice_async;
-	cfqd->cfq_slice[1] = cfq_slice_sync;
 	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
-	cfqd->cfq_slice_idle = cfq_slice_idle;
-	cfqd->hw_tag = 1;
 
 	return cfqd;
 }
@@ -2554,9 +1902,6 @@ SHOW_FUNCTION(cfq_fifo_expire_sync_show, cfqd->cfq_fifo_expire[1], 1);
 SHOW_FUNCTION(cfq_fifo_expire_async_show, cfqd->cfq_fifo_expire[0], 1);
 SHOW_FUNCTION(cfq_back_seek_max_show, cfqd->cfq_back_max, 0);
 SHOW_FUNCTION(cfq_back_seek_penalty_show, cfqd->cfq_back_penalty, 0);
-SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1);
-SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
-SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
 SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
 #undef SHOW_FUNCTION
 
@@ -2584,9 +1929,6 @@ STORE_FUNCTION(cfq_fifo_expire_async_store, &cfqd->cfq_fifo_expire[0], 1,
 STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
 STORE_FUNCTION(cfq_back_seek_penalty_store, &cfqd->cfq_back_penalty, 1,
 		UINT_MAX, 0);
-STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
 		UINT_MAX, 0);
 #undef STORE_FUNCTION
@@ -2600,10 +1942,7 @@ static struct elv_fs_entry cfq_attrs[] = {
 	CFQ_ATTR(fifo_expire_async),
 	CFQ_ATTR(back_seek_max),
 	CFQ_ATTR(back_seek_penalty),
-	CFQ_ATTR(slice_sync),
-	CFQ_ATTR(slice_async),
 	CFQ_ATTR(slice_async_rq),
-	CFQ_ATTR(slice_idle),
 	__ATTR_NULL
 };
 
@@ -2616,8 +1955,6 @@ static struct elevator_type iosched_cfq = {
 		.elevator_dispatch_fn =		cfq_dispatch_requests,
 		.elevator_add_req_fn =		cfq_insert_request,
 		.elevator_activate_req_fn =	cfq_activate_request,
-		.elevator_deactivate_req_fn =	cfq_deactivate_request,
-		.elevator_queue_empty_fn =	cfq_queue_empty,
 		.elevator_completed_req_fn =	cfq_completed_request,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
@@ -2627,7 +1964,15 @@ static struct elevator_type iosched_cfq = {
 		.elevator_init_fn =		cfq_init_queue,
 		.elevator_exit_fn =		cfq_exit_queue,
 		.trim =				cfq_free_io_context,
+		.elevator_free_sched_queue_fn =	cfq_free_cfq_queue,
+		.elevator_active_ioq_set_fn = 	cfq_active_ioq_set,
+		.elevator_active_ioq_reset_fn =	cfq_active_ioq_reset,
+		.elevator_arm_slice_timer_fn = 	cfq_arm_slice_timer,
+		.elevator_should_preempt_fn = 	cfq_should_preempt,
+		.elevator_update_idle_window_fn = cfq_update_idle_window,
+		.elevator_close_cooperator_fn = cfq_close_cooperator,
 	},
+	.elevator_features =    ELV_IOSCHED_NEED_FQ,
 	.elevator_attrs =	cfq_attrs,
 	.elevator_name =	"cfq",
 	.elevator_owner =	THIS_MODULE,
@@ -2635,14 +1980,6 @@ static struct elevator_type iosched_cfq = {
 
 static int __init cfq_init(void)
 {
-	/*
-	 * could be 0 on HZ < 1000 setups
-	 */
-	if (!cfq_slice_async)
-		cfq_slice_async = 1;
-	if (!cfq_slice_idle)
-		cfq_slice_idle = 1;
-
 	if (cfq_slab_setup())
 		return -ENOMEM;
 
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 04/18] io-controller: Modify cfq to make use of flat elevator fair queuing
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (4 preceding siblings ...)
  2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
       [not found]   ` <1241553525-28095-5-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-05-22  8:54   ` Gui Jianfeng
  2009-05-05 19:58 ` Vivek Goyal
                   ` (31 subsequent siblings)
  37 siblings, 2 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, m-ikeda
  Cc: vgoyal, akpm

This patch changes cfq to use fair queuing code from elevator layer.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched |    3 +-
 block/cfq-iosched.c   | 1097 ++++++++++---------------------------------------
 2 files changed, 219 insertions(+), 881 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3398134..dd5224d 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -3,7 +3,7 @@ if BLOCK
 menu "IO Schedulers"
 
 config ELV_FAIR_QUEUING
-	bool "Elevator Fair Queuing Support"
+	bool
 	default n
 	---help---
 	  Traditionally only cfq had notion of multiple queues and it did
@@ -46,6 +46,7 @@ config IOSCHED_DEADLINE
 
 config IOSCHED_CFQ
 	tristate "CFQ I/O scheduler"
+	select ELV_FAIR_QUEUING
 	default y
 	---help---
 	  The CFQ I/O scheduler tries to distribute bandwidth equally
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index a55a9bd..f90c534 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -12,7 +12,6 @@
 #include <linux/rbtree.h>
 #include <linux/ioprio.h>
 #include <linux/blktrace_api.h>
-
 /*
  * tunables
  */
@@ -23,15 +22,7 @@ static const int cfq_fifo_expire[2] = { HZ / 4, HZ / 8 };
 static const int cfq_back_max = 16 * 1024;
 /* penalty of a backwards seek */
 static const int cfq_back_penalty = 2;
-static const int cfq_slice_sync = HZ / 10;
-static int cfq_slice_async = HZ / 25;
 static const int cfq_slice_async_rq = 2;
-static int cfq_slice_idle = HZ / 125;
-
-/*
- * offset from end of service tree
- */
-#define CFQ_IDLE_DELAY		(HZ / 5)
 
 /*
  * below this threshold, we consider thinktime immediate
@@ -43,7 +34,7 @@ static int cfq_slice_idle = HZ / 125;
 
 #define RQ_CIC(rq)		\
 	((struct cfq_io_context *) (rq)->elevator_private)
-#define RQ_CFQQ(rq)		(struct cfq_queue *) ((rq)->elevator_private2)
+#define RQ_CFQQ(rq)	(struct cfq_queue *) (ioq_sched_queue((rq)->ioq))
 
 static struct kmem_cache *cfq_pool;
 static struct kmem_cache *cfq_ioc_pool;
@@ -53,8 +44,6 @@ static struct completion *ioc_gone;
 static DEFINE_SPINLOCK(ioc_gone_lock);
 
 #define CFQ_PRIO_LISTS		IOPRIO_BE_NR
-#define cfq_class_idle(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
-#define cfq_class_rt(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_RT)
 
 #define sample_valid(samples)	((samples) > 80)
 
@@ -75,12 +64,6 @@ struct cfq_rb_root {
  */
 struct cfq_data {
 	struct request_queue *queue;
-
-	/*
-	 * rr list of queues with requests and the count of them
-	 */
-	struct cfq_rb_root service_tree;
-
 	/*
 	 * Each priority tree is sorted by next_request position.  These
 	 * trees are used when determining if two or more queues are
@@ -88,39 +71,10 @@ struct cfq_data {
 	 */
 	struct rb_root prio_trees[CFQ_PRIO_LISTS];
 
-	unsigned int busy_queues;
-	/*
-	 * Used to track any pending rt requests so we can pre-empt current
-	 * non-RT cfqq in service when this value is non-zero.
-	 */
-	unsigned int busy_rt_queues;
-
-	int rq_in_driver;
 	int sync_flight;
 
-	/*
-	 * queue-depth detection
-	 */
-	int rq_queued;
-	int hw_tag;
-	int hw_tag_samples;
-	int rq_in_driver_peak;
-
-	/*
-	 * idle window management
-	 */
-	struct timer_list idle_slice_timer;
-	struct work_struct unplug_work;
-
-	struct cfq_queue *active_queue;
 	struct cfq_io_context *active_cic;
 
-	/*
-	 * async queue for each priority case
-	 */
-	struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
-	struct cfq_queue *async_idle_cfqq;
-
 	sector_t last_position;
 	unsigned long last_end_request;
 
@@ -131,9 +85,7 @@ struct cfq_data {
 	unsigned int cfq_fifo_expire[2];
 	unsigned int cfq_back_penalty;
 	unsigned int cfq_back_max;
-	unsigned int cfq_slice[2];
 	unsigned int cfq_slice_async_rq;
-	unsigned int cfq_slice_idle;
 
 	struct list_head cic_list;
 };
@@ -142,16 +94,11 @@ struct cfq_data {
  * Per process-grouping structure
  */
 struct cfq_queue {
-	/* reference count */
-	atomic_t ref;
+	struct io_queue *ioq;
 	/* various state flags, see below */
 	unsigned int flags;
 	/* parent cfq_data */
 	struct cfq_data *cfqd;
-	/* service_tree member */
-	struct rb_node rb_node;
-	/* service_tree key */
-	unsigned long rb_key;
 	/* prio tree member */
 	struct rb_node p_node;
 	/* prio tree root we belong to, if any */
@@ -167,33 +114,23 @@ struct cfq_queue {
 	/* fifo list of requests in sort_list */
 	struct list_head fifo;
 
-	unsigned long slice_end;
-	long slice_resid;
 	unsigned int slice_dispatch;
 
 	/* pending metadata requests */
 	int meta_pending;
-	/* number of requests that are on the dispatch list or inside driver */
-	int dispatched;
 
 	/* io prio of this group */
-	unsigned short ioprio, org_ioprio;
-	unsigned short ioprio_class, org_ioprio_class;
+	unsigned short org_ioprio;
+	unsigned short org_ioprio_class;
 
 	pid_t pid;
 };
 
 enum cfqq_state_flags {
-	CFQ_CFQQ_FLAG_on_rr = 0,	/* on round-robin busy list */
-	CFQ_CFQQ_FLAG_wait_request,	/* waiting for a request */
-	CFQ_CFQQ_FLAG_must_dispatch,	/* must be allowed a dispatch */
 	CFQ_CFQQ_FLAG_must_alloc,	/* must be allowed rq alloc */
 	CFQ_CFQQ_FLAG_must_alloc_slice,	/* per-slice must_alloc flag */
 	CFQ_CFQQ_FLAG_fifo_expire,	/* FIFO checked in this slice */
-	CFQ_CFQQ_FLAG_idle_window,	/* slice idling enabled */
 	CFQ_CFQQ_FLAG_prio_changed,	/* task priority has changed */
-	CFQ_CFQQ_FLAG_slice_new,	/* no requests dispatched in slice */
-	CFQ_CFQQ_FLAG_sync,		/* synchronous queue */
 	CFQ_CFQQ_FLAG_coop,		/* has done a coop jump of the queue */
 };
 
@@ -211,16 +148,10 @@ static inline int cfq_cfqq_##name(const struct cfq_queue *cfqq)		\
 	return ((cfqq)->flags & (1 << CFQ_CFQQ_FLAG_##name)) != 0;	\
 }
 
-CFQ_CFQQ_FNS(on_rr);
-CFQ_CFQQ_FNS(wait_request);
-CFQ_CFQQ_FNS(must_dispatch);
 CFQ_CFQQ_FNS(must_alloc);
 CFQ_CFQQ_FNS(must_alloc_slice);
 CFQ_CFQQ_FNS(fifo_expire);
-CFQ_CFQQ_FNS(idle_window);
 CFQ_CFQQ_FNS(prio_changed);
-CFQ_CFQQ_FNS(slice_new);
-CFQ_CFQQ_FNS(sync);
 CFQ_CFQQ_FNS(coop);
 #undef CFQ_CFQQ_FNS
 
@@ -259,66 +190,32 @@ static inline int cfq_bio_sync(struct bio *bio)
 	return 0;
 }
 
-/*
- * scheduler run of queue, if there are requests pending and no one in the
- * driver that will restart queueing
- */
-static inline void cfq_schedule_dispatch(struct cfq_data *cfqd)
+static inline struct io_group *cfqq_to_io_group(struct cfq_queue *cfqq)
 {
-	if (cfqd->busy_queues) {
-		cfq_log(cfqd, "schedule dispatch");
-		kblockd_schedule_work(cfqd->queue, &cfqd->unplug_work);
-	}
+	return ioq_to_io_group(cfqq->ioq);
 }
 
-static int cfq_queue_empty(struct request_queue *q)
+static inline int cfq_class_idle(struct cfq_queue *cfqq)
 {
-	struct cfq_data *cfqd = q->elevator->elevator_data;
-
-	return !cfqd->busy_queues;
+	return elv_ioq_class_idle(cfqq->ioq);
 }
 
-/*
- * Scale schedule slice based on io priority. Use the sync time slice only
- * if a queue is marked sync and has sync io queued. A sync queue with async
- * io only, should not get full sync slice length.
- */
-static inline int cfq_prio_slice(struct cfq_data *cfqd, int sync,
-				 unsigned short prio)
-{
-	const int base_slice = cfqd->cfq_slice[sync];
-
-	WARN_ON(prio >= IOPRIO_BE_NR);
-
-	return base_slice + (base_slice/CFQ_SLICE_SCALE * (4 - prio));
-}
-
-static inline int
-cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+static inline int cfq_class_rt(struct cfq_queue *cfqq)
 {
-	return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
+	return elv_ioq_class_rt(cfqq->ioq);
 }
 
-static inline void
-cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+static inline int cfq_cfqq_sync(struct cfq_queue *cfqq)
 {
-	cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
-	cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
+	return elv_ioq_sync(cfqq->ioq);
 }
 
-/*
- * We need to wrap this check in cfq_cfqq_slice_new(), since ->slice_end
- * isn't valid until the first request from the dispatch is activated
- * and the slice time set.
- */
-static inline int cfq_slice_used(struct cfq_queue *cfqq)
+static inline int cfqq_is_active_queue(struct cfq_queue *cfqq)
 {
-	if (cfq_cfqq_slice_new(cfqq))
-		return 0;
-	if (time_before(jiffies, cfqq->slice_end))
-		return 0;
+	struct cfq_data *cfqd = cfqq->cfqd;
+	struct elevator_queue *e = cfqd->queue->elevator;
 
-	return 1;
+	return (elv_active_sched_queue(e) == cfqq);
 }
 
 /*
@@ -417,33 +314,6 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2)
 }
 
 /*
- * The below is leftmost cache rbtree addon
- */
-static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
-{
-	if (!root->left)
-		root->left = rb_first(&root->rb);
-
-	if (root->left)
-		return rb_entry(root->left, struct cfq_queue, rb_node);
-
-	return NULL;
-}
-
-static void rb_erase_init(struct rb_node *n, struct rb_root *root)
-{
-	rb_erase(n, root);
-	RB_CLEAR_NODE(n);
-}
-
-static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root)
-{
-	if (root->left == n)
-		root->left = NULL;
-	rb_erase_init(n, &root->rb);
-}
-
-/*
  * would be nice to take fifo expire time into account as well
  */
 static struct request *
@@ -456,10 +326,10 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 
 	BUG_ON(RB_EMPTY_NODE(&last->rb_node));
 
-	if (rbprev)
+	if (rbprev != NULL)
 		prev = rb_entry_rq(rbprev);
 
-	if (rbnext)
+	if (rbnext != NULL)
 		next = rb_entry_rq(rbnext);
 	else {
 		rbnext = rb_first(&cfqq->sort_list);
@@ -470,95 +340,6 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	return cfq_choose_req(cfqd, next, prev);
 }
 
-static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
-				      struct cfq_queue *cfqq)
-{
-	/*
-	 * just an approximation, should be ok.
-	 */
-	return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) -
-		       cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
-}
-
-/*
- * The cfqd->service_tree holds all pending cfq_queue's that have
- * requests waiting to be processed. It is sorted in the order that
- * we will service the queues.
- */
-static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-				 int add_front)
-{
-	struct rb_node **p, *parent;
-	struct cfq_queue *__cfqq;
-	unsigned long rb_key;
-	int left;
-
-	if (cfq_class_idle(cfqq)) {
-		rb_key = CFQ_IDLE_DELAY;
-		parent = rb_last(&cfqd->service_tree.rb);
-		if (parent && parent != &cfqq->rb_node) {
-			__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
-			rb_key += __cfqq->rb_key;
-		} else
-			rb_key += jiffies;
-	} else if (!add_front) {
-		rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
-		rb_key += cfqq->slice_resid;
-		cfqq->slice_resid = 0;
-	} else
-		rb_key = 0;
-
-	if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
-		/*
-		 * same position, nothing more to do
-		 */
-		if (rb_key == cfqq->rb_key)
-			return;
-
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
-	}
-
-	left = 1;
-	parent = NULL;
-	p = &cfqd->service_tree.rb.rb_node;
-	while (*p) {
-		struct rb_node **n;
-
-		parent = *p;
-		__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
-
-		/*
-		 * sort RT queues first, we always want to give
-		 * preference to them. IDLE queues goes to the back.
-		 * after that, sort on the next service time.
-		 */
-		if (cfq_class_rt(cfqq) > cfq_class_rt(__cfqq))
-			n = &(*p)->rb_left;
-		else if (cfq_class_rt(cfqq) < cfq_class_rt(__cfqq))
-			n = &(*p)->rb_right;
-		else if (cfq_class_idle(cfqq) < cfq_class_idle(__cfqq))
-			n = &(*p)->rb_left;
-		else if (cfq_class_idle(cfqq) > cfq_class_idle(__cfqq))
-			n = &(*p)->rb_right;
-		else if (rb_key < __cfqq->rb_key)
-			n = &(*p)->rb_left;
-		else
-			n = &(*p)->rb_right;
-
-		if (n == &(*p)->rb_right)
-			left = 0;
-
-		p = n;
-	}
-
-	if (left)
-		cfqd->service_tree.left = &cfqq->rb_node;
-
-	cfqq->rb_key = rb_key;
-	rb_link_node(&cfqq->rb_node, parent, p);
-	rb_insert_color(&cfqq->rb_node, &cfqd->service_tree.rb);
-}
-
 static struct cfq_queue *
 cfq_prio_tree_lookup(struct cfq_data *cfqd, struct rb_root *root,
 		     sector_t sector, struct rb_node **ret_parent,
@@ -620,57 +401,34 @@ static void cfq_prio_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 		cfqq->p_root = NULL;
 }
 
-/*
- * Update cfqq's position in the service tree.
- */
-static void cfq_resort_rr_list(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+/* An active ioq is being reset. A chance to do cic related stuff. */
+static void cfq_active_ioq_reset(struct request_queue *q, void *sched_queue)
 {
-	/*
-	 * Resorting requires the cfqq to be on the RR list already.
-	 */
-	if (cfq_cfqq_on_rr(cfqq)) {
-		cfq_service_tree_add(cfqd, cfqq, 0);
-		cfq_prio_tree_add(cfqd, cfqq);
-	}
-}
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = sched_queue;
 
-/*
- * add to busy list of queues for service, trying to be fair in ordering
- * the pending list according to last request service
- */
-static void cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	cfq_log_cfqq(cfqd, cfqq, "add_to_rr");
-	BUG_ON(cfq_cfqq_on_rr(cfqq));
-	cfq_mark_cfqq_on_rr(cfqq);
-	cfqd->busy_queues++;
-	if (cfq_class_rt(cfqq))
-		cfqd->busy_rt_queues++;
+	if (cfqd->active_cic) {
+		put_io_context(cfqd->active_cic->ioc);
+		cfqd->active_cic = NULL;
+	}
 
-	cfq_resort_rr_list(cfqd, cfqq);
+	/* Resort the cfqq in prio tree */
+	if (cfqq)
+		cfq_prio_tree_add(cfqd, cfqq);
 }
 
-/*
- * Called when the cfqq no longer has requests pending, remove it from
- * the service tree.
- */
-static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+/* An ioq has been set as active one. */
+static void cfq_active_ioq_set(struct request_queue *q, void *sched_queue,
+				int coop)
 {
-	cfq_log_cfqq(cfqd, cfqq, "del_from_rr");
-	BUG_ON(!cfq_cfqq_on_rr(cfqq));
-	cfq_clear_cfqq_on_rr(cfqq);
+	struct cfq_queue *cfqq = sched_queue;
 
-	if (!RB_EMPTY_NODE(&cfqq->rb_node))
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
-	if (cfqq->p_root) {
-		rb_erase(&cfqq->p_node, cfqq->p_root);
-		cfqq->p_root = NULL;
-	}
+	cfqq->slice_dispatch = 0;
 
-	BUG_ON(!cfqd->busy_queues);
-	cfqd->busy_queues--;
-	if (cfq_class_rt(cfqq))
-		cfqd->busy_rt_queues--;
+	cfq_clear_cfqq_must_alloc_slice(cfqq);
+	cfq_clear_cfqq_fifo_expire(cfqq);
+	if (!coop)
+		cfq_clear_cfqq_coop(cfqq);
 }
 
 /*
@@ -679,7 +437,6 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 static void cfq_del_rq_rb(struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
-	struct cfq_data *cfqd = cfqq->cfqd;
 	const int sync = rq_is_sync(rq);
 
 	BUG_ON(!cfqq->queued[sync]);
@@ -687,8 +444,17 @@ static void cfq_del_rq_rb(struct request *rq)
 
 	elv_rb_del(&cfqq->sort_list, rq);
 
-	if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
-		cfq_del_cfqq_rr(cfqd, cfqq);
+	/*
+	 * If this was last request in the queue, remove this queue from
+	 * prio trees. For last request nr_queued count will still be 1 as
+	 * elevator fair queuing layer is yet to do the accounting.
+	 */
+	if (elv_ioq_nr_queued(cfqq->ioq) == 1) {
+		if (cfqq->p_root) {
+			rb_erase(&cfqq->p_node, cfqq->p_root);
+			cfqq->p_root = NULL;
+		}
+	}
 }
 
 static void cfq_add_rq_rb(struct request *rq)
@@ -706,9 +472,6 @@ static void cfq_add_rq_rb(struct request *rq)
 	while ((__alias = elv_rb_add(&cfqq->sort_list, rq)) != NULL)
 		cfq_dispatch_insert(cfqd->queue, __alias);
 
-	if (!cfq_cfqq_on_rr(cfqq))
-		cfq_add_cfqq_rr(cfqd, cfqq);
-
 	/*
 	 * check if this request is a better next-serve candidate
 	 */
@@ -756,23 +519,9 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 
-	cfqd->rq_in_driver++;
-	cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "activate rq, drv=%d",
-						cfqd->rq_in_driver);
-
 	cfqd->last_position = rq->hard_sector + rq->hard_nr_sectors;
 }
 
-static void cfq_deactivate_request(struct request_queue *q, struct request *rq)
-{
-	struct cfq_data *cfqd = q->elevator->elevator_data;
-
-	WARN_ON(!cfqd->rq_in_driver);
-	cfqd->rq_in_driver--;
-	cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "deactivate rq, drv=%d",
-						cfqd->rq_in_driver);
-}
-
 static void cfq_remove_request(struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
@@ -783,7 +532,6 @@ static void cfq_remove_request(struct request *rq)
 	list_del_init(&rq->queuelist);
 	cfq_del_rq_rb(rq);
 
-	cfqq->cfqd->rq_queued--;
 	if (rq_is_meta(rq)) {
 		WARN_ON(!cfqq->meta_pending);
 		cfqq->meta_pending--;
@@ -857,93 +605,21 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	return 0;
 }
 
-static void __cfq_set_active_queue(struct cfq_data *cfqd,
-				   struct cfq_queue *cfqq)
-{
-	if (cfqq) {
-		cfq_log_cfqq(cfqd, cfqq, "set_active");
-		cfqq->slice_end = 0;
-		cfqq->slice_dispatch = 0;
-
-		cfq_clear_cfqq_wait_request(cfqq);
-		cfq_clear_cfqq_must_dispatch(cfqq);
-		cfq_clear_cfqq_must_alloc_slice(cfqq);
-		cfq_clear_cfqq_fifo_expire(cfqq);
-		cfq_mark_cfqq_slice_new(cfqq);
-
-		del_timer(&cfqd->idle_slice_timer);
-	}
-
-	cfqd->active_queue = cfqq;
-}
-
 /*
  * current cfqq expired its slice (or was too idle), select new one
  */
 static void
-__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-		    int timed_out)
+__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	cfq_log_cfqq(cfqd, cfqq, "slice expired t=%d", timed_out);
-
-	if (cfq_cfqq_wait_request(cfqq))
-		del_timer(&cfqd->idle_slice_timer);
-
-	cfq_clear_cfqq_wait_request(cfqq);
-
-	/*
-	 * store what was left of this slice, if the queue idled/timed out
-	 */
-	if (timed_out && !cfq_cfqq_slice_new(cfqq)) {
-		cfqq->slice_resid = cfqq->slice_end - jiffies;
-		cfq_log_cfqq(cfqd, cfqq, "resid=%ld", cfqq->slice_resid);
-	}
-
-	cfq_resort_rr_list(cfqd, cfqq);
-
-	if (cfqq == cfqd->active_queue)
-		cfqd->active_queue = NULL;
-
-	if (cfqd->active_cic) {
-		put_io_context(cfqd->active_cic->ioc);
-		cfqd->active_cic = NULL;
-	}
+	__elv_ioq_slice_expired(cfqd->queue, cfqq->ioq);
 }
 
-static inline void cfq_slice_expired(struct cfq_data *cfqd, int timed_out)
+static inline void cfq_slice_expired(struct cfq_data *cfqd)
 {
-	struct cfq_queue *cfqq = cfqd->active_queue;
+	struct cfq_queue *cfqq = elv_active_sched_queue(cfqd->queue->elevator);
 
 	if (cfqq)
-		__cfq_slice_expired(cfqd, cfqq, timed_out);
-}
-
-/*
- * Get next queue for service. Unless we have a queue preemption,
- * we'll simply select the first cfqq in the service tree.
- */
-static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
-{
-	if (RB_EMPTY_ROOT(&cfqd->service_tree.rb))
-		return NULL;
-
-	return cfq_rb_first(&cfqd->service_tree);
-}
-
-/*
- * Get and set a new active queue for service.
- */
-static struct cfq_queue *cfq_set_active_queue(struct cfq_data *cfqd,
-					      struct cfq_queue *cfqq)
-{
-	if (!cfqq) {
-		cfqq = cfq_get_next_queue(cfqd);
-		if (cfqq)
-			cfq_clear_cfqq_coop(cfqq);
-	}
-
-	__cfq_set_active_queue(cfqd, cfqq);
-	return cfqq;
+		__cfq_slice_expired(cfqd, cfqq);
 }
 
 static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd,
@@ -1020,11 +696,12 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
  * associated with the I/O issued by cur_cfqq.  I'm not sure this is a valid
  * assumption.
  */
-static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
-					      struct cfq_queue *cur_cfqq,
+static struct io_queue *cfq_close_cooperator(struct request_queue *q,
+					      void *cur_sched_queue,
 					      int probe)
 {
-	struct cfq_queue *cfqq;
+	struct cfq_queue *cur_cfqq = cur_sched_queue, *cfqq;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
 
 	/*
 	 * A valid cfq_io_context is necessary to compare requests against
@@ -1047,38 +724,18 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
 
 	if (!probe)
 		cfq_mark_cfqq_coop(cfqq);
-	return cfqq;
+	return cfqq->ioq;
 }
 
-static void cfq_arm_slice_timer(struct cfq_data *cfqd)
+static void cfq_arm_slice_timer(struct request_queue *q, void *sched_queue)
 {
-	struct cfq_queue *cfqq = cfqd->active_queue;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = sched_queue;
 	struct cfq_io_context *cic;
 	unsigned long sl;
 
-	/*
-	 * SSD device without seek penalty, disable idling. But only do so
-	 * for devices that support queuing, otherwise we still have a problem
-	 * with sync vs async workloads.
-	 */
-	if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)
-		return;
-
 	WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list));
-	WARN_ON(cfq_cfqq_slice_new(cfqq));
-
-	/*
-	 * idle is disabled, either manually or by past process history
-	 */
-	if (!cfqd->cfq_slice_idle || !cfq_cfqq_idle_window(cfqq))
-		return;
-
-	/*
-	 * still requests with the driver, don't idle
-	 */
-	if (cfqd->rq_in_driver)
-		return;
-
+	WARN_ON(elv_ioq_slice_new(cfqq->ioq));
 	/*
 	 * task has exited, don't wait
 	 */
@@ -1086,18 +743,18 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 	if (!cic || !atomic_read(&cic->ioc->nr_tasks))
 		return;
 
-	cfq_mark_cfqq_wait_request(cfqq);
 
+	elv_mark_ioq_wait_request(cfqq->ioq);
 	/*
 	 * we don't want to idle for seeks, but we do want to allow
 	 * fair distribution of slice time for a process doing back-to-back
 	 * seeks. so allow a little bit of time for him to submit a new rq
 	 */
-	sl = cfqd->cfq_slice_idle;
+	sl = elv_get_slice_idle(q->elevator);
 	if (sample_valid(cic->seek_samples) && CIC_SEEKY(cic))
 		sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));
 
-	mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
+	elv_mod_idle_slice_timer(q->elevator, jiffies + sl);
 	cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
 }
 
@@ -1106,13 +763,12 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
  */
 static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
 {
-	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
+	struct cfq_data *cfqd = q->elevator->elevator_data;
 
-	cfq_log_cfqq(cfqd, cfqq, "dispatch_insert");
+	cfq_log_cfqq(cfqd, cfqq, "dispatch_insert sect=%d", rq->nr_sectors);
 
 	cfq_remove_request(rq);
-	cfqq->dispatched++;
 	elv_dispatch_sort(q, rq);
 
 	if (cfq_cfqq_sync(cfqq))
@@ -1150,78 +806,11 @@ static inline int
 cfq_prio_to_maxrq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
 	const int base_rq = cfqd->cfq_slice_async_rq;
+	unsigned short ioprio = elv_ioq_ioprio(cfqq->ioq);
 
-	WARN_ON(cfqq->ioprio >= IOPRIO_BE_NR);
-
-	return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - cfqq->ioprio));
-}
-
-/*
- * Select a queue for service. If we have a current active queue,
- * check whether to continue servicing it, or retrieve and set a new one.
- */
-static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
-{
-	struct cfq_queue *cfqq, *new_cfqq = NULL;
-
-	cfqq = cfqd->active_queue;
-	if (!cfqq)
-		goto new_queue;
-
-	/*
-	 * The active queue has run out of time, expire it and select new.
-	 */
-	if (cfq_slice_used(cfqq) && !cfq_cfqq_must_dispatch(cfqq))
-		goto expire;
-
-	/*
-	 * If we have a RT cfqq waiting, then we pre-empt the current non-rt
-	 * cfqq.
-	 */
-	if (!cfq_class_rt(cfqq) && cfqd->busy_rt_queues) {
-		/*
-		 * We simulate this as cfqq timed out so that it gets to bank
-		 * the remaining of its time slice.
-		 */
-		cfq_log_cfqq(cfqd, cfqq, "preempt");
-		cfq_slice_expired(cfqd, 1);
-		goto new_queue;
-	}
-
-	/*
-	 * The active queue has requests and isn't expired, allow it to
-	 * dispatch.
-	 */
-	if (!RB_EMPTY_ROOT(&cfqq->sort_list))
-		goto keep_queue;
-
-	/*
-	 * If another queue has a request waiting within our mean seek
-	 * distance, let it run.  The expire code will check for close
-	 * cooperators and put the close queue at the front of the service
-	 * tree.
-	 */
-	new_cfqq = cfq_close_cooperator(cfqd, cfqq, 0);
-	if (new_cfqq)
-		goto expire;
+	WARN_ON(ioprio >= IOPRIO_BE_NR);
 
-	/*
-	 * No requests pending. If the active queue still has requests in
-	 * flight or is idling for a new request, allow either of these
-	 * conditions to happen (or time out) before selecting a new queue.
-	 */
-	if (timer_pending(&cfqd->idle_slice_timer) ||
-	    (cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) {
-		cfqq = NULL;
-		goto keep_queue;
-	}
-
-expire:
-	cfq_slice_expired(cfqd, 0);
-new_queue:
-	cfqq = cfq_set_active_queue(cfqd, new_cfqq);
-keep_queue:
-	return cfqq;
+	return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - ioprio));
 }
 
 static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
@@ -1246,12 +835,14 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
 	struct cfq_queue *cfqq;
 	int dispatched = 0;
 
-	while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL)
+	while ((cfqq = elv_select_sched_queue(cfqd->queue, 1)) != NULL)
 		dispatched += __cfq_forced_dispatch_cfqq(cfqq);
 
-	cfq_slice_expired(cfqd, 0);
+	/* This probably is redundant now. above loop will should make sure
+	 * that all the busy queues have expired */
+	cfq_slice_expired(cfqd);
 
-	BUG_ON(cfqd->busy_queues);
+	BUG_ON(elv_nr_busy_ioq(cfqd->queue->elevator));
 
 	cfq_log(cfqd, "forced_dispatch=%d\n", dispatched);
 	return dispatched;
@@ -1297,13 +888,10 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	struct cfq_queue *cfqq;
 	unsigned int max_dispatch;
 
-	if (!cfqd->busy_queues)
-		return 0;
-
 	if (unlikely(force))
 		return cfq_forced_dispatch(cfqd);
 
-	cfqq = cfq_select_queue(cfqd);
+	cfqq = elv_select_sched_queue(q, 0);
 	if (!cfqq)
 		return 0;
 
@@ -1320,7 +908,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	/*
 	 * Does this cfqq already have too much IO in flight?
 	 */
-	if (cfqq->dispatched >= max_dispatch) {
+	if (elv_ioq_nr_dispatched(cfqq->ioq) >= max_dispatch) {
 		/*
 		 * idle queue must always only have a single IO in flight
 		 */
@@ -1330,13 +918,13 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 		/*
 		 * We have other queues, don't allow more IO from this one
 		 */
-		if (cfqd->busy_queues > 1)
+		if (elv_nr_busy_ioq(q->elevator) > 1)
 			return 0;
 
 		/*
 		 * we are the only queue, allow up to 4 times of 'quantum'
 		 */
-		if (cfqq->dispatched >= 4 * max_dispatch)
+		if (elv_ioq_nr_dispatched(cfqq->ioq) >= 4 * max_dispatch)
 			return 0;
 	}
 
@@ -1345,51 +933,45 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	 */
 	cfq_dispatch_request(cfqd, cfqq);
 	cfqq->slice_dispatch++;
-	cfq_clear_cfqq_must_dispatch(cfqq);
 
 	/*
 	 * expire an async queue immediately if it has used up its slice. idle
 	 * queue always expire after 1 dispatch round.
 	 */
-	if (cfqd->busy_queues > 1 && ((!cfq_cfqq_sync(cfqq) &&
+	if (elv_nr_busy_ioq(q->elevator) > 1 && ((!cfq_cfqq_sync(cfqq) &&
 	    cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
 	    cfq_class_idle(cfqq))) {
-		cfqq->slice_end = jiffies + 1;
-		cfq_slice_expired(cfqd, 0);
+		cfq_slice_expired(cfqd);
 	}
 
 	cfq_log(cfqd, "dispatched a request");
 	return 1;
 }
 
-/*
- * task holds one reference to the queue, dropped when task exits. each rq
- * in-flight on this queue also holds a reference, dropped when rq is freed.
- *
- * queue lock must be held here.
- */
-static void cfq_put_queue(struct cfq_queue *cfqq)
+static void cfq_free_cfq_queue(struct elevator_queue *e, void *sched_queue)
 {
+	struct cfq_queue *cfqq = sched_queue;
 	struct cfq_data *cfqd = cfqq->cfqd;
 
-	BUG_ON(atomic_read(&cfqq->ref) <= 0);
+	BUG_ON(!cfqq);
 
-	if (!atomic_dec_and_test(&cfqq->ref))
-		return;
-
-	cfq_log_cfqq(cfqd, cfqq, "put_queue");
+	cfq_log_cfqq(cfqd, cfqq, "free_queue");
 	BUG_ON(rb_first(&cfqq->sort_list));
 	BUG_ON(cfqq->allocated[READ] + cfqq->allocated[WRITE]);
-	BUG_ON(cfq_cfqq_on_rr(cfqq));
 
-	if (unlikely(cfqd->active_queue == cfqq)) {
-		__cfq_slice_expired(cfqd, cfqq, 0);
-		cfq_schedule_dispatch(cfqd);
+	if (unlikely(cfqq_is_active_queue(cfqq))) {
+		__cfq_slice_expired(cfqd, cfqq);
+		elv_schedule_dispatch(cfqd->queue);
 	}
 
 	kmem_cache_free(cfq_pool, cfqq);
 }
 
+static inline void cfq_put_queue(struct cfq_queue *cfqq)
+{
+	elv_put_ioq(cfqq->ioq);
+}
+
 /*
  * Must always be called with the rcu_read_lock() held
  */
@@ -1477,9 +1059,9 @@ static void cfq_free_io_context(struct io_context *ioc)
 
 static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	if (unlikely(cfqq == cfqd->active_queue)) {
-		__cfq_slice_expired(cfqd, cfqq, 0);
-		cfq_schedule_dispatch(cfqd);
+	if (unlikely(cfqq == elv_active_sched_queue(cfqd->queue->elevator))) {
+		__cfq_slice_expired(cfqd, cfqq);
+		elv_schedule_dispatch(cfqd->queue);
 	}
 
 	cfq_put_queue(cfqq);
@@ -1549,9 +1131,10 @@ static struct cfq_io_context *
 cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 {
 	struct cfq_io_context *cic;
+	struct request_queue *q = cfqd->queue;
 
 	cic = kmem_cache_alloc_node(cfq_ioc_pool, gfp_mask | __GFP_ZERO,
-							cfqd->queue->node);
+							q->node);
 	if (cic) {
 		cic->last_end_request = jiffies;
 		INIT_LIST_HEAD(&cic->queue_list);
@@ -1567,7 +1150,7 @@ cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
 {
 	struct task_struct *tsk = current;
-	int ioprio_class;
+	int ioprio_class, ioprio;
 
 	if (!cfq_cfqq_prio_changed(cfqq))
 		return;
@@ -1580,30 +1163,33 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
 		/*
 		 * no prio set, inherit CPU scheduling settings
 		 */
-		cfqq->ioprio = task_nice_ioprio(tsk);
-		cfqq->ioprio_class = task_nice_ioclass(tsk);
+		ioprio = task_nice_ioprio(tsk);
+		ioprio_class = task_nice_ioclass(tsk);
 		break;
 	case IOPRIO_CLASS_RT:
-		cfqq->ioprio = task_ioprio(ioc);
-		cfqq->ioprio_class = IOPRIO_CLASS_RT;
+		ioprio = task_ioprio(ioc);
+		ioprio_class = IOPRIO_CLASS_RT;
 		break;
 	case IOPRIO_CLASS_BE:
-		cfqq->ioprio = task_ioprio(ioc);
-		cfqq->ioprio_class = IOPRIO_CLASS_BE;
+		ioprio = task_ioprio(ioc);
+		ioprio_class = IOPRIO_CLASS_BE;
 		break;
 	case IOPRIO_CLASS_IDLE:
-		cfqq->ioprio_class = IOPRIO_CLASS_IDLE;
-		cfqq->ioprio = 7;
-		cfq_clear_cfqq_idle_window(cfqq);
+		ioprio_class = IOPRIO_CLASS_IDLE;
+		ioprio = 7;
+		elv_clear_ioq_idle_window(cfqq->ioq);
 		break;
 	}
 
+	elv_ioq_set_ioprio_class(cfqq->ioq, ioprio_class);
+	elv_ioq_set_ioprio(cfqq->ioq, ioprio);
+
 	/*
 	 * keep track of original prio settings in case we have to temporarily
 	 * elevate the priority of this queue
 	 */
-	cfqq->org_ioprio = cfqq->ioprio;
-	cfqq->org_ioprio_class = cfqq->ioprio_class;
+	cfqq->org_ioprio = ioprio;
+	cfqq->org_ioprio_class = ioprio_class;
 	cfq_clear_cfqq_prio_changed(cfqq);
 }
 
@@ -1612,11 +1198,12 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
 	struct cfq_data *cfqd = cic->key;
 	struct cfq_queue *cfqq;
 	unsigned long flags;
+	struct request_queue *q = cfqd->queue;
 
 	if (unlikely(!cfqd))
 		return;
 
-	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+	spin_lock_irqsave(q->queue_lock, flags);
 
 	cfqq = cic->cfqq[BLK_RW_ASYNC];
 	if (cfqq) {
@@ -1633,7 +1220,7 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
 	if (cfqq)
 		cfq_mark_cfqq_prio_changed(cfqq);
 
-	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
 static void cfq_ioc_set_ioprio(struct io_context *ioc)
@@ -1644,11 +1231,12 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
 
 static struct cfq_queue *
 cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
-		     struct io_context *ioc, gfp_t gfp_mask)
+				struct io_context *ioc, gfp_t gfp_mask)
 {
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
 	struct cfq_io_context *cic;
-
+	struct request_queue *q = cfqd->queue;
+	struct io_queue *ioq = NULL, *new_ioq = NULL;
 retry:
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
@@ -1656,8 +1244,7 @@ retry:
 
 	if (!cfqq) {
 		if (new_cfqq) {
-			cfqq = new_cfqq;
-			new_cfqq = NULL;
+			goto alloc_ioq;
 		} else if (gfp_mask & __GFP_WAIT) {
 			/*
 			 * Inform the allocator of the fact that we will
@@ -1678,22 +1265,52 @@ retry:
 			if (!cfqq)
 				goto out;
 		}
+alloc_ioq:
+		if (new_ioq) {
+			ioq = new_ioq;
+			new_ioq = NULL;
+			cfqq = new_cfqq;
+			new_cfqq = NULL;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			new_ioq = elv_alloc_ioq(q,
+					gfp_mask | __GFP_NOFAIL | __GFP_ZERO);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+			if (!ioq) {
+				kmem_cache_free(cfq_pool, cfqq);
+				cfqq = NULL;
+				goto out;
+			}
+		}
 
-		RB_CLEAR_NODE(&cfqq->rb_node);
+		/*
+		 * Both cfqq and ioq objects allocated. Do the initializations
+		 * now.
+		 */
 		RB_CLEAR_NODE(&cfqq->p_node);
 		INIT_LIST_HEAD(&cfqq->fifo);
-
-		atomic_set(&cfqq->ref, 0);
 		cfqq->cfqd = cfqd;
 
 		cfq_mark_cfqq_prio_changed(cfqq);
 
+		cfqq->ioq = ioq;
 		cfq_init_prio_data(cfqq, ioc);
+		elv_init_ioq(q->elevator, ioq, cfqq, cfqq->org_ioprio_class,
+				cfqq->org_ioprio, is_sync);
 
 		if (is_sync) {
 			if (!cfq_class_idle(cfqq))
-				cfq_mark_cfqq_idle_window(cfqq);
-			cfq_mark_cfqq_sync(cfqq);
+				elv_mark_ioq_idle_window(cfqq->ioq);
+			elv_mark_ioq_sync(cfqq->ioq);
 		}
 		cfqq->pid = current->pid;
 		cfq_log_cfqq(cfqd, cfqq, "alloced");
@@ -1702,38 +1319,28 @@ retry:
 	if (new_cfqq)
 		kmem_cache_free(cfq_pool, new_cfqq);
 
+	if (new_ioq)
+		elv_free_ioq(new_ioq);
+
 out:
 	WARN_ON((gfp_mask & __GFP_WAIT) && !cfqq);
 	return cfqq;
 }
 
-static struct cfq_queue **
-cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
-{
-	switch (ioprio_class) {
-	case IOPRIO_CLASS_RT:
-		return &cfqd->async_cfqq[0][ioprio];
-	case IOPRIO_CLASS_BE:
-		return &cfqd->async_cfqq[1][ioprio];
-	case IOPRIO_CLASS_IDLE:
-		return &cfqd->async_idle_cfqq;
-	default:
-		BUG();
-	}
-}
-
 static struct cfq_queue *
 cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
-	      gfp_t gfp_mask)
+					gfp_t gfp_mask)
 {
 	const int ioprio = task_ioprio(ioc);
 	const int ioprio_class = task_ioprio_class(ioc);
-	struct cfq_queue **async_cfqq = NULL;
+	struct cfq_queue *async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
+	struct io_group *iog = io_lookup_io_group_current(cfqd->queue);
 
 	if (!is_sync) {
-		async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);
-		cfqq = *async_cfqq;
+		async_cfqq = io_group_async_queue_prio(iog, ioprio_class,
+								ioprio);
+		cfqq = async_cfqq;
 	}
 
 	if (!cfqq) {
@@ -1742,15 +1349,11 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 			return NULL;
 	}
 
-	/*
-	 * pin the queue now that it's allocated, scheduler exit will prune it
-	 */
-	if (!is_sync && !(*async_cfqq)) {
-		atomic_inc(&cfqq->ref);
-		*async_cfqq = cfqq;
-	}
+	if (!is_sync && !async_cfqq)
+		io_group_set_async_queue(iog, ioprio_class, ioprio, cfqq->ioq);
 
-	atomic_inc(&cfqq->ref);
+	/* ioc reference */
+	elv_get_ioq(cfqq->ioq);
 	return cfqq;
 }
 
@@ -1829,6 +1432,7 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
 {
 	unsigned long flags;
 	int ret;
+	struct request_queue *q = cfqd->queue;
 
 	ret = radix_tree_preload(gfp_mask);
 	if (!ret) {
@@ -1845,9 +1449,9 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
 		radix_tree_preload_end();
 
 		if (!ret) {
-			spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+			spin_lock_irqsave(q->queue_lock, flags);
 			list_add(&cic->queue_list, &cfqd->cic_list);
-			spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+			spin_unlock_irqrestore(q->queue_lock, flags);
 		}
 	}
 
@@ -1867,10 +1471,11 @@ cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 {
 	struct io_context *ioc = NULL;
 	struct cfq_io_context *cic;
+	struct request_queue *q = cfqd->queue;
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
-	ioc = get_io_context(gfp_mask, cfqd->queue->node);
+	ioc = get_io_context(gfp_mask, q->node);
 	if (!ioc)
 		return NULL;
 
@@ -1889,7 +1494,6 @@ out:
 	smp_read_barrier_depends();
 	if (unlikely(ioc->ioprio_changed))
 		cfq_ioc_set_ioprio(ioc);
-
 	return cic;
 err_free:
 	cfq_cic_free(cic);
@@ -1899,17 +1503,6 @@ err:
 }
 
 static void
-cfq_update_io_thinktime(struct cfq_data *cfqd, struct cfq_io_context *cic)
-{
-	unsigned long elapsed = jiffies - cic->last_end_request;
-	unsigned long ttime = min(elapsed, 2UL * cfqd->cfq_slice_idle);
-
-	cic->ttime_samples = (7*cic->ttime_samples + 256) / 8;
-	cic->ttime_total = (7*cic->ttime_total + 256*ttime) / 8;
-	cic->ttime_mean = (cic->ttime_total + 128) / cic->ttime_samples;
-}
-
-static void
 cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_io_context *cic,
 		       struct request *rq)
 {
@@ -1940,65 +1533,40 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_io_context *cic,
 }
 
 /*
- * Disable idle window if the process thinks too long or seeks so much that
- * it doesn't matter
+ * Disable idle window if the process seeks so much that it doesn't matter
  */
-static void
-cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-		       struct cfq_io_context *cic)
+static int
+cfq_update_idle_window(struct elevator_queue *eq, void *cfqq,
+					struct request *rq)
 {
-	int old_idle, enable_idle;
+	struct cfq_io_context *cic = RQ_CIC(rq);
 
 	/*
-	 * Don't idle for async or idle io prio class
+	 * Enabling/Disabling idling based on thinktime has been moved
+	 * in common layer.
 	 */
-	if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
-		return;
-
-	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
-
-	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
-	    (cfqd->hw_tag && CIC_SEEKY(cic)))
-		enable_idle = 0;
-	else if (sample_valid(cic->ttime_samples)) {
-		if (cic->ttime_mean > cfqd->cfq_slice_idle)
-			enable_idle = 0;
-		else
-			enable_idle = 1;
-	}
+	if (!atomic_read(&cic->ioc->nr_tasks) ||
+	    (elv_hw_tag(eq) && CIC_SEEKY(cic)))
+		return 0;
 
-	if (old_idle != enable_idle) {
-		cfq_log_cfqq(cfqd, cfqq, "idle=%d", enable_idle);
-		if (enable_idle)
-			cfq_mark_cfqq_idle_window(cfqq);
-		else
-			cfq_clear_cfqq_idle_window(cfqq);
-	}
+	return 1;
 }
 
 /*
  * Check if new_cfqq should preempt the currently active queue. Return 0 for
- * no or if we aren't sure, a 1 will cause a preempt.
+ * no or if we aren't sure, a 1 will cause a preemption attempt.
+ * Some of the preemption logic has been moved to common layer. Only cfq
+ * specific parts are left here.
  */
 static int
-cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
-		   struct request *rq)
+cfq_should_preempt(struct request_queue *q, void *new_cfqq, struct request *rq)
 {
-	struct cfq_queue *cfqq;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = elv_active_sched_queue(q->elevator);
 
-	cfqq = cfqd->active_queue;
 	if (!cfqq)
 		return 0;
 
-	if (cfq_slice_used(cfqq))
-		return 1;
-
-	if (cfq_class_idle(new_cfqq))
-		return 0;
-
-	if (cfq_class_idle(cfqq))
-		return 1;
-
 	/*
 	 * if the new request is sync, but the currently running queue is
 	 * not, let the sync request have priority.
@@ -2013,13 +1581,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 	if (rq_is_meta(rq) && !cfqq->meta_pending)
 		return 1;
 
-	/*
-	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
-	 */
-	if (cfq_class_rt(new_cfqq) && !cfq_class_rt(cfqq))
-		return 1;
-
-	if (!cfqd->active_cic || !cfq_cfqq_wait_request(cfqq))
+	if (!cfqd->active_cic || !elv_ioq_wait_request(cfqq->ioq))
 		return 0;
 
 	/*
@@ -2033,29 +1595,10 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 }
 
 /*
- * cfqq preempts the active queue. if we allowed preempt with no slice left,
- * let it have half of its nominal slice.
- */
-static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	cfq_log_cfqq(cfqd, cfqq, "preempt");
-	cfq_slice_expired(cfqd, 1);
-
-	/*
-	 * Put the new queue at the front of the of the current list,
-	 * so we know that it will be selected next.
-	 */
-	BUG_ON(!cfq_cfqq_on_rr(cfqq));
-
-	cfq_service_tree_add(cfqd, cfqq, 1);
-
-	cfqq->slice_end = 0;
-	cfq_mark_cfqq_slice_new(cfqq);
-}
-
-/*
  * Called when a new fs request (rq) is added (to cfqq). Check if there's
  * something we should do about it
+ * After enqueuing the request whether queue should be preempted or kicked
+ * decision is taken by common layer.
  */
 static void
 cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
@@ -2063,45 +1606,12 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 {
 	struct cfq_io_context *cic = RQ_CIC(rq);
 
-	cfqd->rq_queued++;
 	if (rq_is_meta(rq))
 		cfqq->meta_pending++;
 
-	cfq_update_io_thinktime(cfqd, cic);
 	cfq_update_io_seektime(cfqd, cic, rq);
-	cfq_update_idle_window(cfqd, cfqq, cic);
 
 	cic->last_request_pos = rq->sector + rq->nr_sectors;
-
-	if (cfqq == cfqd->active_queue) {
-		/*
-		 * Remember that we saw a request from this process, but
-		 * don't start queuing just yet. Otherwise we risk seeing lots
-		 * of tiny requests, because we disrupt the normal plugging
-		 * and merging. If the request is already larger than a single
-		 * page, let it rip immediately. For that case we assume that
-		 * merging is already done. Ditto for a busy system that
-		 * has other work pending, don't risk delaying until the
-		 * idle timer unplug to continue working.
-		 */
-		if (cfq_cfqq_wait_request(cfqq)) {
-			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
-			    cfqd->busy_queues > 1) {
-				del_timer(&cfqd->idle_slice_timer);
-				blk_start_queueing(cfqd->queue);
-			}
-			cfq_mark_cfqq_must_dispatch(cfqq);
-		}
-	} else if (cfq_should_preempt(cfqd, cfqq, rq)) {
-		/*
-		 * not the active queue - expire current slice if it is
-		 * idle and has expired it's mean thinktime or this new queue
-		 * has some old slice time left and is of higher priority or
-		 * this new queue is RT and the current one is BE
-		 */
-		cfq_preempt_queue(cfqd, cfqq);
-		blk_start_queueing(cfqd->queue);
-	}
 }
 
 static void cfq_insert_request(struct request_queue *q, struct request *rq)
@@ -2119,31 +1629,6 @@ static void cfq_insert_request(struct request_queue *q, struct request *rq)
 	cfq_rq_enqueued(cfqd, cfqq, rq);
 }
 
-/*
- * Update hw_tag based on peak queue depth over 50 samples under
- * sufficient load.
- */
-static void cfq_update_hw_tag(struct cfq_data *cfqd)
-{
-	if (cfqd->rq_in_driver > cfqd->rq_in_driver_peak)
-		cfqd->rq_in_driver_peak = cfqd->rq_in_driver;
-
-	if (cfqd->rq_queued <= CFQ_HW_QUEUE_MIN &&
-	    cfqd->rq_in_driver <= CFQ_HW_QUEUE_MIN)
-		return;
-
-	if (cfqd->hw_tag_samples++ < 50)
-		return;
-
-	if (cfqd->rq_in_driver_peak >= CFQ_HW_QUEUE_MIN)
-		cfqd->hw_tag = 1;
-	else
-		cfqd->hw_tag = 0;
-
-	cfqd->hw_tag_samples = 0;
-	cfqd->rq_in_driver_peak = 0;
-}
-
 static void cfq_completed_request(struct request_queue *q, struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
@@ -2154,13 +1639,6 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 	now = jiffies;
 	cfq_log_cfqq(cfqd, cfqq, "complete");
 
-	cfq_update_hw_tag(cfqd);
-
-	WARN_ON(!cfqd->rq_in_driver);
-	WARN_ON(!cfqq->dispatched);
-	cfqd->rq_in_driver--;
-	cfqq->dispatched--;
-
 	if (cfq_cfqq_sync(cfqq))
 		cfqd->sync_flight--;
 
@@ -2169,34 +1647,6 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 
 	if (sync)
 		RQ_CIC(rq)->last_end_request = now;
-
-	/*
-	 * If this is the active queue, check if it needs to be expired,
-	 * or if we want to idle in case it has no pending requests.
-	 */
-	if (cfqd->active_queue == cfqq) {
-		const bool cfqq_empty = RB_EMPTY_ROOT(&cfqq->sort_list);
-
-		if (cfq_cfqq_slice_new(cfqq)) {
-			cfq_set_prio_slice(cfqd, cfqq);
-			cfq_clear_cfqq_slice_new(cfqq);
-		}
-		/*
-		 * If there are no requests waiting in this queue, and
-		 * there are other queues ready to issue requests, AND
-		 * those other queues are issuing requests within our
-		 * mean seek distance, give them a chance to run instead
-		 * of idling.
-		 */
-		if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
-			cfq_slice_expired(cfqd, 1);
-		else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) &&
-			 sync && !rq_noidle(rq))
-			cfq_arm_slice_timer(cfqd);
-	}
-
-	if (!cfqd->rq_in_driver)
-		cfq_schedule_dispatch(cfqd);
 }
 
 /*
@@ -2205,30 +1655,33 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
  */
 static void cfq_prio_boost(struct cfq_queue *cfqq)
 {
+	struct io_queue *ioq = cfqq->ioq;
+
 	if (has_fs_excl()) {
 		/*
 		 * boost idle prio on transactions that would lock out other
 		 * users of the filesystem
 		 */
 		if (cfq_class_idle(cfqq))
-			cfqq->ioprio_class = IOPRIO_CLASS_BE;
-		if (cfqq->ioprio > IOPRIO_NORM)
-			cfqq->ioprio = IOPRIO_NORM;
+			elv_ioq_set_ioprio_class(ioq, IOPRIO_CLASS_BE);
+		if (elv_ioq_ioprio(ioq) > IOPRIO_NORM)
+			elv_ioq_set_ioprio(ioq, IOPRIO_NORM);
+
 	} else {
 		/*
 		 * check if we need to unboost the queue
 		 */
-		if (cfqq->ioprio_class != cfqq->org_ioprio_class)
-			cfqq->ioprio_class = cfqq->org_ioprio_class;
-		if (cfqq->ioprio != cfqq->org_ioprio)
-			cfqq->ioprio = cfqq->org_ioprio;
+		if (elv_ioq_ioprio_class(ioq) != cfqq->org_ioprio_class)
+			elv_ioq_set_ioprio_class(ioq, cfqq->org_ioprio_class);
+		if (elv_ioq_ioprio(ioq) != cfqq->org_ioprio)
+			elv_ioq_set_ioprio(ioq, cfqq->org_ioprio);
 	}
 }
 
 static inline int __cfq_may_queue(struct cfq_queue *cfqq)
 {
-	if ((cfq_cfqq_wait_request(cfqq) || cfq_cfqq_must_alloc(cfqq)) &&
-	    !cfq_cfqq_must_alloc_slice(cfqq)) {
+	if ((elv_ioq_wait_request(cfqq->ioq) ||
+	   cfq_cfqq_must_alloc(cfqq)) && !cfq_cfqq_must_alloc_slice(cfqq)) {
 		cfq_mark_cfqq_must_alloc_slice(cfqq);
 		return ELV_MQUEUE_MUST;
 	}
@@ -2320,119 +1773,31 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 
 	cfqq->allocated[rw]++;
 	cfq_clear_cfqq_must_alloc(cfqq);
-	atomic_inc(&cfqq->ref);
+	elv_get_ioq(cfqq->ioq);
 
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
 	rq->elevator_private = cic;
-	rq->elevator_private2 = cfqq;
+	rq->ioq = cfqq->ioq;
 	return 0;
 
 queue_fail:
 	if (cic)
 		put_io_context(cic->ioc);
 
-	cfq_schedule_dispatch(cfqd);
+	elv_schedule_dispatch(cfqd->queue);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 	cfq_log(cfqd, "set_request fail");
 	return 1;
 }
 
-static void cfq_kick_queue(struct work_struct *work)
-{
-	struct cfq_data *cfqd =
-		container_of(work, struct cfq_data, unplug_work);
-	struct request_queue *q = cfqd->queue;
-
-	spin_lock_irq(q->queue_lock);
-	blk_start_queueing(q);
-	spin_unlock_irq(q->queue_lock);
-}
-
-/*
- * Timer running if the active_queue is currently idling inside its time slice
- */
-static void cfq_idle_slice_timer(unsigned long data)
-{
-	struct cfq_data *cfqd = (struct cfq_data *) data;
-	struct cfq_queue *cfqq;
-	unsigned long flags;
-	int timed_out = 1;
-
-	cfq_log(cfqd, "idle timer fired");
-
-	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
-
-	cfqq = cfqd->active_queue;
-	if (cfqq) {
-		timed_out = 0;
-
-		/*
-		 * We saw a request before the queue expired, let it through
-		 */
-		if (cfq_cfqq_must_dispatch(cfqq))
-			goto out_kick;
-
-		/*
-		 * expired
-		 */
-		if (cfq_slice_used(cfqq))
-			goto expire;
-
-		/*
-		 * only expire and reinvoke request handler, if there are
-		 * other queues with pending requests
-		 */
-		if (!cfqd->busy_queues)
-			goto out_cont;
-
-		/*
-		 * not expired and it has a request pending, let it dispatch
-		 */
-		if (!RB_EMPTY_ROOT(&cfqq->sort_list))
-			goto out_kick;
-	}
-expire:
-	cfq_slice_expired(cfqd, timed_out);
-out_kick:
-	cfq_schedule_dispatch(cfqd);
-out_cont:
-	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
-}
-
-static void cfq_shutdown_timer_wq(struct cfq_data *cfqd)
-{
-	del_timer_sync(&cfqd->idle_slice_timer);
-	cancel_work_sync(&cfqd->unplug_work);
-}
-
-static void cfq_put_async_queues(struct cfq_data *cfqd)
-{
-	int i;
-
-	for (i = 0; i < IOPRIO_BE_NR; i++) {
-		if (cfqd->async_cfqq[0][i])
-			cfq_put_queue(cfqd->async_cfqq[0][i]);
-		if (cfqd->async_cfqq[1][i])
-			cfq_put_queue(cfqd->async_cfqq[1][i]);
-	}
-
-	if (cfqd->async_idle_cfqq)
-		cfq_put_queue(cfqd->async_idle_cfqq);
-}
-
 static void cfq_exit_queue(struct elevator_queue *e)
 {
 	struct cfq_data *cfqd = e->elevator_data;
 	struct request_queue *q = cfqd->queue;
 
-	cfq_shutdown_timer_wq(cfqd);
-
 	spin_lock_irq(q->queue_lock);
 
-	if (cfqd->active_queue)
-		__cfq_slice_expired(cfqd, cfqd->active_queue, 0);
-
 	while (!list_empty(&cfqd->cic_list)) {
 		struct cfq_io_context *cic = list_entry(cfqd->cic_list.next,
 							struct cfq_io_context,
@@ -2441,12 +1806,7 @@ static void cfq_exit_queue(struct elevator_queue *e)
 		__cfq_exit_single_io_context(cfqd, cic);
 	}
 
-	cfq_put_async_queues(cfqd);
-
 	spin_unlock_irq(q->queue_lock);
-
-	cfq_shutdown_timer_wq(cfqd);
-
 	kfree(cfqd);
 }
 
@@ -2459,8 +1819,6 @@ static void *cfq_init_queue(struct request_queue *q)
 	if (!cfqd)
 		return NULL;
 
-	cfqd->service_tree = CFQ_RB_ROOT;
-
 	/*
 	 * Not strictly needed (since RB_ROOT just clears the node and we
 	 * zeroed cfqd on alloc), but better be safe in case someone decides
@@ -2473,23 +1831,13 @@ static void *cfq_init_queue(struct request_queue *q)
 
 	cfqd->queue = q;
 
-	init_timer(&cfqd->idle_slice_timer);
-	cfqd->idle_slice_timer.function = cfq_idle_slice_timer;
-	cfqd->idle_slice_timer.data = (unsigned long) cfqd;
-
-	INIT_WORK(&cfqd->unplug_work, cfq_kick_queue);
-
 	cfqd->last_end_request = jiffies;
 	cfqd->cfq_quantum = cfq_quantum;
 	cfqd->cfq_fifo_expire[0] = cfq_fifo_expire[0];
 	cfqd->cfq_fifo_expire[1] = cfq_fifo_expire[1];
 	cfqd->cfq_back_max = cfq_back_max;
 	cfqd->cfq_back_penalty = cfq_back_penalty;
-	cfqd->cfq_slice[0] = cfq_slice_async;
-	cfqd->cfq_slice[1] = cfq_slice_sync;
 	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
-	cfqd->cfq_slice_idle = cfq_slice_idle;
-	cfqd->hw_tag = 1;
 
 	return cfqd;
 }
@@ -2554,9 +1902,6 @@ SHOW_FUNCTION(cfq_fifo_expire_sync_show, cfqd->cfq_fifo_expire[1], 1);
 SHOW_FUNCTION(cfq_fifo_expire_async_show, cfqd->cfq_fifo_expire[0], 1);
 SHOW_FUNCTION(cfq_back_seek_max_show, cfqd->cfq_back_max, 0);
 SHOW_FUNCTION(cfq_back_seek_penalty_show, cfqd->cfq_back_penalty, 0);
-SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1);
-SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
-SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
 SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
 #undef SHOW_FUNCTION
 
@@ -2584,9 +1929,6 @@ STORE_FUNCTION(cfq_fifo_expire_async_store, &cfqd->cfq_fifo_expire[0], 1,
 STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
 STORE_FUNCTION(cfq_back_seek_penalty_store, &cfqd->cfq_back_penalty, 1,
 		UINT_MAX, 0);
-STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
 		UINT_MAX, 0);
 #undef STORE_FUNCTION
@@ -2600,10 +1942,7 @@ static struct elv_fs_entry cfq_attrs[] = {
 	CFQ_ATTR(fifo_expire_async),
 	CFQ_ATTR(back_seek_max),
 	CFQ_ATTR(back_seek_penalty),
-	CFQ_ATTR(slice_sync),
-	CFQ_ATTR(slice_async),
 	CFQ_ATTR(slice_async_rq),
-	CFQ_ATTR(slice_idle),
 	__ATTR_NULL
 };
 
@@ -2616,8 +1955,6 @@ static struct elevator_type iosched_cfq = {
 		.elevator_dispatch_fn =		cfq_dispatch_requests,
 		.elevator_add_req_fn =		cfq_insert_request,
 		.elevator_activate_req_fn =	cfq_activate_request,
-		.elevator_deactivate_req_fn =	cfq_deactivate_request,
-		.elevator_queue_empty_fn =	cfq_queue_empty,
 		.elevator_completed_req_fn =	cfq_completed_request,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
@@ -2627,7 +1964,15 @@ static struct elevator_type iosched_cfq = {
 		.elevator_init_fn =		cfq_init_queue,
 		.elevator_exit_fn =		cfq_exit_queue,
 		.trim =				cfq_free_io_context,
+		.elevator_free_sched_queue_fn =	cfq_free_cfq_queue,
+		.elevator_active_ioq_set_fn = 	cfq_active_ioq_set,
+		.elevator_active_ioq_reset_fn =	cfq_active_ioq_reset,
+		.elevator_arm_slice_timer_fn = 	cfq_arm_slice_timer,
+		.elevator_should_preempt_fn = 	cfq_should_preempt,
+		.elevator_update_idle_window_fn = cfq_update_idle_window,
+		.elevator_close_cooperator_fn = cfq_close_cooperator,
 	},
+	.elevator_features =    ELV_IOSCHED_NEED_FQ,
 	.elevator_attrs =	cfq_attrs,
 	.elevator_name =	"cfq",
 	.elevator_owner =	THIS_MODULE,
@@ -2635,14 +1980,6 @@ static struct elevator_type iosched_cfq = {
 
 static int __init cfq_init(void)
 {
-	/*
-	 * could be 0 on HZ < 1000 setups
-	 */
-	if (!cfq_slice_async)
-		cfq_slice_async = 1;
-	if (!cfq_slice_idle)
-		cfq_slice_idle = 1;
-
 	if (cfq_slab_setup())
 		return -ENOMEM;
 
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 04/18] io-controller: Modify cfq to make use of flat elevator fair queuing
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (5 preceding siblings ...)
  2009-05-05 19:58 ` [PATCH 04/18] io-controller: Modify cfq to make use of flat elevator fair queuing Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-05 19:58 ` [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer Vivek Goyal
                   ` (30 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando
  Cc: akpm, vgoyal

This patch changes cfq to use fair queuing code from elevator layer.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched |    3 +-
 block/cfq-iosched.c   | 1097 ++++++++++---------------------------------------
 2 files changed, 219 insertions(+), 881 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3398134..dd5224d 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -3,7 +3,7 @@ if BLOCK
 menu "IO Schedulers"
 
 config ELV_FAIR_QUEUING
-	bool "Elevator Fair Queuing Support"
+	bool
 	default n
 	---help---
 	  Traditionally only cfq had notion of multiple queues and it did
@@ -46,6 +46,7 @@ config IOSCHED_DEADLINE
 
 config IOSCHED_CFQ
 	tristate "CFQ I/O scheduler"
+	select ELV_FAIR_QUEUING
 	default y
 	---help---
 	  The CFQ I/O scheduler tries to distribute bandwidth equally
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index a55a9bd..f90c534 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -12,7 +12,6 @@
 #include <linux/rbtree.h>
 #include <linux/ioprio.h>
 #include <linux/blktrace_api.h>
-
 /*
  * tunables
  */
@@ -23,15 +22,7 @@ static const int cfq_fifo_expire[2] = { HZ / 4, HZ / 8 };
 static const int cfq_back_max = 16 * 1024;
 /* penalty of a backwards seek */
 static const int cfq_back_penalty = 2;
-static const int cfq_slice_sync = HZ / 10;
-static int cfq_slice_async = HZ / 25;
 static const int cfq_slice_async_rq = 2;
-static int cfq_slice_idle = HZ / 125;
-
-/*
- * offset from end of service tree
- */
-#define CFQ_IDLE_DELAY		(HZ / 5)
 
 /*
  * below this threshold, we consider thinktime immediate
@@ -43,7 +34,7 @@ static int cfq_slice_idle = HZ / 125;
 
 #define RQ_CIC(rq)		\
 	((struct cfq_io_context *) (rq)->elevator_private)
-#define RQ_CFQQ(rq)		(struct cfq_queue *) ((rq)->elevator_private2)
+#define RQ_CFQQ(rq)	(struct cfq_queue *) (ioq_sched_queue((rq)->ioq))
 
 static struct kmem_cache *cfq_pool;
 static struct kmem_cache *cfq_ioc_pool;
@@ -53,8 +44,6 @@ static struct completion *ioc_gone;
 static DEFINE_SPINLOCK(ioc_gone_lock);
 
 #define CFQ_PRIO_LISTS		IOPRIO_BE_NR
-#define cfq_class_idle(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
-#define cfq_class_rt(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_RT)
 
 #define sample_valid(samples)	((samples) > 80)
 
@@ -75,12 +64,6 @@ struct cfq_rb_root {
  */
 struct cfq_data {
 	struct request_queue *queue;
-
-	/*
-	 * rr list of queues with requests and the count of them
-	 */
-	struct cfq_rb_root service_tree;
-
 	/*
 	 * Each priority tree is sorted by next_request position.  These
 	 * trees are used when determining if two or more queues are
@@ -88,39 +71,10 @@ struct cfq_data {
 	 */
 	struct rb_root prio_trees[CFQ_PRIO_LISTS];
 
-	unsigned int busy_queues;
-	/*
-	 * Used to track any pending rt requests so we can pre-empt current
-	 * non-RT cfqq in service when this value is non-zero.
-	 */
-	unsigned int busy_rt_queues;
-
-	int rq_in_driver;
 	int sync_flight;
 
-	/*
-	 * queue-depth detection
-	 */
-	int rq_queued;
-	int hw_tag;
-	int hw_tag_samples;
-	int rq_in_driver_peak;
-
-	/*
-	 * idle window management
-	 */
-	struct timer_list idle_slice_timer;
-	struct work_struct unplug_work;
-
-	struct cfq_queue *active_queue;
 	struct cfq_io_context *active_cic;
 
-	/*
-	 * async queue for each priority case
-	 */
-	struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
-	struct cfq_queue *async_idle_cfqq;
-
 	sector_t last_position;
 	unsigned long last_end_request;
 
@@ -131,9 +85,7 @@ struct cfq_data {
 	unsigned int cfq_fifo_expire[2];
 	unsigned int cfq_back_penalty;
 	unsigned int cfq_back_max;
-	unsigned int cfq_slice[2];
 	unsigned int cfq_slice_async_rq;
-	unsigned int cfq_slice_idle;
 
 	struct list_head cic_list;
 };
@@ -142,16 +94,11 @@ struct cfq_data {
  * Per process-grouping structure
  */
 struct cfq_queue {
-	/* reference count */
-	atomic_t ref;
+	struct io_queue *ioq;
 	/* various state flags, see below */
 	unsigned int flags;
 	/* parent cfq_data */
 	struct cfq_data *cfqd;
-	/* service_tree member */
-	struct rb_node rb_node;
-	/* service_tree key */
-	unsigned long rb_key;
 	/* prio tree member */
 	struct rb_node p_node;
 	/* prio tree root we belong to, if any */
@@ -167,33 +114,23 @@ struct cfq_queue {
 	/* fifo list of requests in sort_list */
 	struct list_head fifo;
 
-	unsigned long slice_end;
-	long slice_resid;
 	unsigned int slice_dispatch;
 
 	/* pending metadata requests */
 	int meta_pending;
-	/* number of requests that are on the dispatch list or inside driver */
-	int dispatched;
 
 	/* io prio of this group */
-	unsigned short ioprio, org_ioprio;
-	unsigned short ioprio_class, org_ioprio_class;
+	unsigned short org_ioprio;
+	unsigned short org_ioprio_class;
 
 	pid_t pid;
 };
 
 enum cfqq_state_flags {
-	CFQ_CFQQ_FLAG_on_rr = 0,	/* on round-robin busy list */
-	CFQ_CFQQ_FLAG_wait_request,	/* waiting for a request */
-	CFQ_CFQQ_FLAG_must_dispatch,	/* must be allowed a dispatch */
 	CFQ_CFQQ_FLAG_must_alloc,	/* must be allowed rq alloc */
 	CFQ_CFQQ_FLAG_must_alloc_slice,	/* per-slice must_alloc flag */
 	CFQ_CFQQ_FLAG_fifo_expire,	/* FIFO checked in this slice */
-	CFQ_CFQQ_FLAG_idle_window,	/* slice idling enabled */
 	CFQ_CFQQ_FLAG_prio_changed,	/* task priority has changed */
-	CFQ_CFQQ_FLAG_slice_new,	/* no requests dispatched in slice */
-	CFQ_CFQQ_FLAG_sync,		/* synchronous queue */
 	CFQ_CFQQ_FLAG_coop,		/* has done a coop jump of the queue */
 };
 
@@ -211,16 +148,10 @@ static inline int cfq_cfqq_##name(const struct cfq_queue *cfqq)		\
 	return ((cfqq)->flags & (1 << CFQ_CFQQ_FLAG_##name)) != 0;	\
 }
 
-CFQ_CFQQ_FNS(on_rr);
-CFQ_CFQQ_FNS(wait_request);
-CFQ_CFQQ_FNS(must_dispatch);
 CFQ_CFQQ_FNS(must_alloc);
 CFQ_CFQQ_FNS(must_alloc_slice);
 CFQ_CFQQ_FNS(fifo_expire);
-CFQ_CFQQ_FNS(idle_window);
 CFQ_CFQQ_FNS(prio_changed);
-CFQ_CFQQ_FNS(slice_new);
-CFQ_CFQQ_FNS(sync);
 CFQ_CFQQ_FNS(coop);
 #undef CFQ_CFQQ_FNS
 
@@ -259,66 +190,32 @@ static inline int cfq_bio_sync(struct bio *bio)
 	return 0;
 }
 
-/*
- * scheduler run of queue, if there are requests pending and no one in the
- * driver that will restart queueing
- */
-static inline void cfq_schedule_dispatch(struct cfq_data *cfqd)
+static inline struct io_group *cfqq_to_io_group(struct cfq_queue *cfqq)
 {
-	if (cfqd->busy_queues) {
-		cfq_log(cfqd, "schedule dispatch");
-		kblockd_schedule_work(cfqd->queue, &cfqd->unplug_work);
-	}
+	return ioq_to_io_group(cfqq->ioq);
 }
 
-static int cfq_queue_empty(struct request_queue *q)
+static inline int cfq_class_idle(struct cfq_queue *cfqq)
 {
-	struct cfq_data *cfqd = q->elevator->elevator_data;
-
-	return !cfqd->busy_queues;
+	return elv_ioq_class_idle(cfqq->ioq);
 }
 
-/*
- * Scale schedule slice based on io priority. Use the sync time slice only
- * if a queue is marked sync and has sync io queued. A sync queue with async
- * io only, should not get full sync slice length.
- */
-static inline int cfq_prio_slice(struct cfq_data *cfqd, int sync,
-				 unsigned short prio)
-{
-	const int base_slice = cfqd->cfq_slice[sync];
-
-	WARN_ON(prio >= IOPRIO_BE_NR);
-
-	return base_slice + (base_slice/CFQ_SLICE_SCALE * (4 - prio));
-}
-
-static inline int
-cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+static inline int cfq_class_rt(struct cfq_queue *cfqq)
 {
-	return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
+	return elv_ioq_class_rt(cfqq->ioq);
 }
 
-static inline void
-cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+static inline int cfq_cfqq_sync(struct cfq_queue *cfqq)
 {
-	cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
-	cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
+	return elv_ioq_sync(cfqq->ioq);
 }
 
-/*
- * We need to wrap this check in cfq_cfqq_slice_new(), since ->slice_end
- * isn't valid until the first request from the dispatch is activated
- * and the slice time set.
- */
-static inline int cfq_slice_used(struct cfq_queue *cfqq)
+static inline int cfqq_is_active_queue(struct cfq_queue *cfqq)
 {
-	if (cfq_cfqq_slice_new(cfqq))
-		return 0;
-	if (time_before(jiffies, cfqq->slice_end))
-		return 0;
+	struct cfq_data *cfqd = cfqq->cfqd;
+	struct elevator_queue *e = cfqd->queue->elevator;
 
-	return 1;
+	return (elv_active_sched_queue(e) == cfqq);
 }
 
 /*
@@ -417,33 +314,6 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2)
 }
 
 /*
- * The below is leftmost cache rbtree addon
- */
-static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
-{
-	if (!root->left)
-		root->left = rb_first(&root->rb);
-
-	if (root->left)
-		return rb_entry(root->left, struct cfq_queue, rb_node);
-
-	return NULL;
-}
-
-static void rb_erase_init(struct rb_node *n, struct rb_root *root)
-{
-	rb_erase(n, root);
-	RB_CLEAR_NODE(n);
-}
-
-static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root)
-{
-	if (root->left == n)
-		root->left = NULL;
-	rb_erase_init(n, &root->rb);
-}
-
-/*
  * would be nice to take fifo expire time into account as well
  */
 static struct request *
@@ -456,10 +326,10 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 
 	BUG_ON(RB_EMPTY_NODE(&last->rb_node));
 
-	if (rbprev)
+	if (rbprev != NULL)
 		prev = rb_entry_rq(rbprev);
 
-	if (rbnext)
+	if (rbnext != NULL)
 		next = rb_entry_rq(rbnext);
 	else {
 		rbnext = rb_first(&cfqq->sort_list);
@@ -470,95 +340,6 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	return cfq_choose_req(cfqd, next, prev);
 }
 
-static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
-				      struct cfq_queue *cfqq)
-{
-	/*
-	 * just an approximation, should be ok.
-	 */
-	return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) -
-		       cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
-}
-
-/*
- * The cfqd->service_tree holds all pending cfq_queue's that have
- * requests waiting to be processed. It is sorted in the order that
- * we will service the queues.
- */
-static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-				 int add_front)
-{
-	struct rb_node **p, *parent;
-	struct cfq_queue *__cfqq;
-	unsigned long rb_key;
-	int left;
-
-	if (cfq_class_idle(cfqq)) {
-		rb_key = CFQ_IDLE_DELAY;
-		parent = rb_last(&cfqd->service_tree.rb);
-		if (parent && parent != &cfqq->rb_node) {
-			__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
-			rb_key += __cfqq->rb_key;
-		} else
-			rb_key += jiffies;
-	} else if (!add_front) {
-		rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
-		rb_key += cfqq->slice_resid;
-		cfqq->slice_resid = 0;
-	} else
-		rb_key = 0;
-
-	if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
-		/*
-		 * same position, nothing more to do
-		 */
-		if (rb_key == cfqq->rb_key)
-			return;
-
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
-	}
-
-	left = 1;
-	parent = NULL;
-	p = &cfqd->service_tree.rb.rb_node;
-	while (*p) {
-		struct rb_node **n;
-
-		parent = *p;
-		__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
-
-		/*
-		 * sort RT queues first, we always want to give
-		 * preference to them. IDLE queues goes to the back.
-		 * after that, sort on the next service time.
-		 */
-		if (cfq_class_rt(cfqq) > cfq_class_rt(__cfqq))
-			n = &(*p)->rb_left;
-		else if (cfq_class_rt(cfqq) < cfq_class_rt(__cfqq))
-			n = &(*p)->rb_right;
-		else if (cfq_class_idle(cfqq) < cfq_class_idle(__cfqq))
-			n = &(*p)->rb_left;
-		else if (cfq_class_idle(cfqq) > cfq_class_idle(__cfqq))
-			n = &(*p)->rb_right;
-		else if (rb_key < __cfqq->rb_key)
-			n = &(*p)->rb_left;
-		else
-			n = &(*p)->rb_right;
-
-		if (n == &(*p)->rb_right)
-			left = 0;
-
-		p = n;
-	}
-
-	if (left)
-		cfqd->service_tree.left = &cfqq->rb_node;
-
-	cfqq->rb_key = rb_key;
-	rb_link_node(&cfqq->rb_node, parent, p);
-	rb_insert_color(&cfqq->rb_node, &cfqd->service_tree.rb);
-}
-
 static struct cfq_queue *
 cfq_prio_tree_lookup(struct cfq_data *cfqd, struct rb_root *root,
 		     sector_t sector, struct rb_node **ret_parent,
@@ -620,57 +401,34 @@ static void cfq_prio_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 		cfqq->p_root = NULL;
 }
 
-/*
- * Update cfqq's position in the service tree.
- */
-static void cfq_resort_rr_list(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+/* An active ioq is being reset. A chance to do cic related stuff. */
+static void cfq_active_ioq_reset(struct request_queue *q, void *sched_queue)
 {
-	/*
-	 * Resorting requires the cfqq to be on the RR list already.
-	 */
-	if (cfq_cfqq_on_rr(cfqq)) {
-		cfq_service_tree_add(cfqd, cfqq, 0);
-		cfq_prio_tree_add(cfqd, cfqq);
-	}
-}
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = sched_queue;
 
-/*
- * add to busy list of queues for service, trying to be fair in ordering
- * the pending list according to last request service
- */
-static void cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	cfq_log_cfqq(cfqd, cfqq, "add_to_rr");
-	BUG_ON(cfq_cfqq_on_rr(cfqq));
-	cfq_mark_cfqq_on_rr(cfqq);
-	cfqd->busy_queues++;
-	if (cfq_class_rt(cfqq))
-		cfqd->busy_rt_queues++;
+	if (cfqd->active_cic) {
+		put_io_context(cfqd->active_cic->ioc);
+		cfqd->active_cic = NULL;
+	}
 
-	cfq_resort_rr_list(cfqd, cfqq);
+	/* Resort the cfqq in prio tree */
+	if (cfqq)
+		cfq_prio_tree_add(cfqd, cfqq);
 }
 
-/*
- * Called when the cfqq no longer has requests pending, remove it from
- * the service tree.
- */
-static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+/* An ioq has been set as active one. */
+static void cfq_active_ioq_set(struct request_queue *q, void *sched_queue,
+				int coop)
 {
-	cfq_log_cfqq(cfqd, cfqq, "del_from_rr");
-	BUG_ON(!cfq_cfqq_on_rr(cfqq));
-	cfq_clear_cfqq_on_rr(cfqq);
+	struct cfq_queue *cfqq = sched_queue;
 
-	if (!RB_EMPTY_NODE(&cfqq->rb_node))
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
-	if (cfqq->p_root) {
-		rb_erase(&cfqq->p_node, cfqq->p_root);
-		cfqq->p_root = NULL;
-	}
+	cfqq->slice_dispatch = 0;
 
-	BUG_ON(!cfqd->busy_queues);
-	cfqd->busy_queues--;
-	if (cfq_class_rt(cfqq))
-		cfqd->busy_rt_queues--;
+	cfq_clear_cfqq_must_alloc_slice(cfqq);
+	cfq_clear_cfqq_fifo_expire(cfqq);
+	if (!coop)
+		cfq_clear_cfqq_coop(cfqq);
 }
 
 /*
@@ -679,7 +437,6 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 static void cfq_del_rq_rb(struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
-	struct cfq_data *cfqd = cfqq->cfqd;
 	const int sync = rq_is_sync(rq);
 
 	BUG_ON(!cfqq->queued[sync]);
@@ -687,8 +444,17 @@ static void cfq_del_rq_rb(struct request *rq)
 
 	elv_rb_del(&cfqq->sort_list, rq);
 
-	if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
-		cfq_del_cfqq_rr(cfqd, cfqq);
+	/*
+	 * If this was last request in the queue, remove this queue from
+	 * prio trees. For last request nr_queued count will still be 1 as
+	 * elevator fair queuing layer is yet to do the accounting.
+	 */
+	if (elv_ioq_nr_queued(cfqq->ioq) == 1) {
+		if (cfqq->p_root) {
+			rb_erase(&cfqq->p_node, cfqq->p_root);
+			cfqq->p_root = NULL;
+		}
+	}
 }
 
 static void cfq_add_rq_rb(struct request *rq)
@@ -706,9 +472,6 @@ static void cfq_add_rq_rb(struct request *rq)
 	while ((__alias = elv_rb_add(&cfqq->sort_list, rq)) != NULL)
 		cfq_dispatch_insert(cfqd->queue, __alias);
 
-	if (!cfq_cfqq_on_rr(cfqq))
-		cfq_add_cfqq_rr(cfqd, cfqq);
-
 	/*
 	 * check if this request is a better next-serve candidate
 	 */
@@ -756,23 +519,9 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 
-	cfqd->rq_in_driver++;
-	cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "activate rq, drv=%d",
-						cfqd->rq_in_driver);
-
 	cfqd->last_position = rq->hard_sector + rq->hard_nr_sectors;
 }
 
-static void cfq_deactivate_request(struct request_queue *q, struct request *rq)
-{
-	struct cfq_data *cfqd = q->elevator->elevator_data;
-
-	WARN_ON(!cfqd->rq_in_driver);
-	cfqd->rq_in_driver--;
-	cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "deactivate rq, drv=%d",
-						cfqd->rq_in_driver);
-}
-
 static void cfq_remove_request(struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
@@ -783,7 +532,6 @@ static void cfq_remove_request(struct request *rq)
 	list_del_init(&rq->queuelist);
 	cfq_del_rq_rb(rq);
 
-	cfqq->cfqd->rq_queued--;
 	if (rq_is_meta(rq)) {
 		WARN_ON(!cfqq->meta_pending);
 		cfqq->meta_pending--;
@@ -857,93 +605,21 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	return 0;
 }
 
-static void __cfq_set_active_queue(struct cfq_data *cfqd,
-				   struct cfq_queue *cfqq)
-{
-	if (cfqq) {
-		cfq_log_cfqq(cfqd, cfqq, "set_active");
-		cfqq->slice_end = 0;
-		cfqq->slice_dispatch = 0;
-
-		cfq_clear_cfqq_wait_request(cfqq);
-		cfq_clear_cfqq_must_dispatch(cfqq);
-		cfq_clear_cfqq_must_alloc_slice(cfqq);
-		cfq_clear_cfqq_fifo_expire(cfqq);
-		cfq_mark_cfqq_slice_new(cfqq);
-
-		del_timer(&cfqd->idle_slice_timer);
-	}
-
-	cfqd->active_queue = cfqq;
-}
-
 /*
  * current cfqq expired its slice (or was too idle), select new one
  */
 static void
-__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-		    int timed_out)
+__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	cfq_log_cfqq(cfqd, cfqq, "slice expired t=%d", timed_out);
-
-	if (cfq_cfqq_wait_request(cfqq))
-		del_timer(&cfqd->idle_slice_timer);
-
-	cfq_clear_cfqq_wait_request(cfqq);
-
-	/*
-	 * store what was left of this slice, if the queue idled/timed out
-	 */
-	if (timed_out && !cfq_cfqq_slice_new(cfqq)) {
-		cfqq->slice_resid = cfqq->slice_end - jiffies;
-		cfq_log_cfqq(cfqd, cfqq, "resid=%ld", cfqq->slice_resid);
-	}
-
-	cfq_resort_rr_list(cfqd, cfqq);
-
-	if (cfqq == cfqd->active_queue)
-		cfqd->active_queue = NULL;
-
-	if (cfqd->active_cic) {
-		put_io_context(cfqd->active_cic->ioc);
-		cfqd->active_cic = NULL;
-	}
+	__elv_ioq_slice_expired(cfqd->queue, cfqq->ioq);
 }
 
-static inline void cfq_slice_expired(struct cfq_data *cfqd, int timed_out)
+static inline void cfq_slice_expired(struct cfq_data *cfqd)
 {
-	struct cfq_queue *cfqq = cfqd->active_queue;
+	struct cfq_queue *cfqq = elv_active_sched_queue(cfqd->queue->elevator);
 
 	if (cfqq)
-		__cfq_slice_expired(cfqd, cfqq, timed_out);
-}
-
-/*
- * Get next queue for service. Unless we have a queue preemption,
- * we'll simply select the first cfqq in the service tree.
- */
-static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
-{
-	if (RB_EMPTY_ROOT(&cfqd->service_tree.rb))
-		return NULL;
-
-	return cfq_rb_first(&cfqd->service_tree);
-}
-
-/*
- * Get and set a new active queue for service.
- */
-static struct cfq_queue *cfq_set_active_queue(struct cfq_data *cfqd,
-					      struct cfq_queue *cfqq)
-{
-	if (!cfqq) {
-		cfqq = cfq_get_next_queue(cfqd);
-		if (cfqq)
-			cfq_clear_cfqq_coop(cfqq);
-	}
-
-	__cfq_set_active_queue(cfqd, cfqq);
-	return cfqq;
+		__cfq_slice_expired(cfqd, cfqq);
 }
 
 static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd,
@@ -1020,11 +696,12 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
  * associated with the I/O issued by cur_cfqq.  I'm not sure this is a valid
  * assumption.
  */
-static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
-					      struct cfq_queue *cur_cfqq,
+static struct io_queue *cfq_close_cooperator(struct request_queue *q,
+					      void *cur_sched_queue,
 					      int probe)
 {
-	struct cfq_queue *cfqq;
+	struct cfq_queue *cur_cfqq = cur_sched_queue, *cfqq;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
 
 	/*
 	 * A valid cfq_io_context is necessary to compare requests against
@@ -1047,38 +724,18 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
 
 	if (!probe)
 		cfq_mark_cfqq_coop(cfqq);
-	return cfqq;
+	return cfqq->ioq;
 }
 
-static void cfq_arm_slice_timer(struct cfq_data *cfqd)
+static void cfq_arm_slice_timer(struct request_queue *q, void *sched_queue)
 {
-	struct cfq_queue *cfqq = cfqd->active_queue;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = sched_queue;
 	struct cfq_io_context *cic;
 	unsigned long sl;
 
-	/*
-	 * SSD device without seek penalty, disable idling. But only do so
-	 * for devices that support queuing, otherwise we still have a problem
-	 * with sync vs async workloads.
-	 */
-	if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)
-		return;
-
 	WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list));
-	WARN_ON(cfq_cfqq_slice_new(cfqq));
-
-	/*
-	 * idle is disabled, either manually or by past process history
-	 */
-	if (!cfqd->cfq_slice_idle || !cfq_cfqq_idle_window(cfqq))
-		return;
-
-	/*
-	 * still requests with the driver, don't idle
-	 */
-	if (cfqd->rq_in_driver)
-		return;
-
+	WARN_ON(elv_ioq_slice_new(cfqq->ioq));
 	/*
 	 * task has exited, don't wait
 	 */
@@ -1086,18 +743,18 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 	if (!cic || !atomic_read(&cic->ioc->nr_tasks))
 		return;
 
-	cfq_mark_cfqq_wait_request(cfqq);
 
+	elv_mark_ioq_wait_request(cfqq->ioq);
 	/*
 	 * we don't want to idle for seeks, but we do want to allow
 	 * fair distribution of slice time for a process doing back-to-back
 	 * seeks. so allow a little bit of time for him to submit a new rq
 	 */
-	sl = cfqd->cfq_slice_idle;
+	sl = elv_get_slice_idle(q->elevator);
 	if (sample_valid(cic->seek_samples) && CIC_SEEKY(cic))
 		sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));
 
-	mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
+	elv_mod_idle_slice_timer(q->elevator, jiffies + sl);
 	cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
 }
 
@@ -1106,13 +763,12 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
  */
 static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
 {
-	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
+	struct cfq_data *cfqd = q->elevator->elevator_data;
 
-	cfq_log_cfqq(cfqd, cfqq, "dispatch_insert");
+	cfq_log_cfqq(cfqd, cfqq, "dispatch_insert sect=%d", rq->nr_sectors);
 
 	cfq_remove_request(rq);
-	cfqq->dispatched++;
 	elv_dispatch_sort(q, rq);
 
 	if (cfq_cfqq_sync(cfqq))
@@ -1150,78 +806,11 @@ static inline int
 cfq_prio_to_maxrq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
 	const int base_rq = cfqd->cfq_slice_async_rq;
+	unsigned short ioprio = elv_ioq_ioprio(cfqq->ioq);
 
-	WARN_ON(cfqq->ioprio >= IOPRIO_BE_NR);
-
-	return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - cfqq->ioprio));
-}
-
-/*
- * Select a queue for service. If we have a current active queue,
- * check whether to continue servicing it, or retrieve and set a new one.
- */
-static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
-{
-	struct cfq_queue *cfqq, *new_cfqq = NULL;
-
-	cfqq = cfqd->active_queue;
-	if (!cfqq)
-		goto new_queue;
-
-	/*
-	 * The active queue has run out of time, expire it and select new.
-	 */
-	if (cfq_slice_used(cfqq) && !cfq_cfqq_must_dispatch(cfqq))
-		goto expire;
-
-	/*
-	 * If we have a RT cfqq waiting, then we pre-empt the current non-rt
-	 * cfqq.
-	 */
-	if (!cfq_class_rt(cfqq) && cfqd->busy_rt_queues) {
-		/*
-		 * We simulate this as cfqq timed out so that it gets to bank
-		 * the remaining of its time slice.
-		 */
-		cfq_log_cfqq(cfqd, cfqq, "preempt");
-		cfq_slice_expired(cfqd, 1);
-		goto new_queue;
-	}
-
-	/*
-	 * The active queue has requests and isn't expired, allow it to
-	 * dispatch.
-	 */
-	if (!RB_EMPTY_ROOT(&cfqq->sort_list))
-		goto keep_queue;
-
-	/*
-	 * If another queue has a request waiting within our mean seek
-	 * distance, let it run.  The expire code will check for close
-	 * cooperators and put the close queue at the front of the service
-	 * tree.
-	 */
-	new_cfqq = cfq_close_cooperator(cfqd, cfqq, 0);
-	if (new_cfqq)
-		goto expire;
+	WARN_ON(ioprio >= IOPRIO_BE_NR);
 
-	/*
-	 * No requests pending. If the active queue still has requests in
-	 * flight or is idling for a new request, allow either of these
-	 * conditions to happen (or time out) before selecting a new queue.
-	 */
-	if (timer_pending(&cfqd->idle_slice_timer) ||
-	    (cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) {
-		cfqq = NULL;
-		goto keep_queue;
-	}
-
-expire:
-	cfq_slice_expired(cfqd, 0);
-new_queue:
-	cfqq = cfq_set_active_queue(cfqd, new_cfqq);
-keep_queue:
-	return cfqq;
+	return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - ioprio));
 }
 
 static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
@@ -1246,12 +835,14 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
 	struct cfq_queue *cfqq;
 	int dispatched = 0;
 
-	while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL)
+	while ((cfqq = elv_select_sched_queue(cfqd->queue, 1)) != NULL)
 		dispatched += __cfq_forced_dispatch_cfqq(cfqq);
 
-	cfq_slice_expired(cfqd, 0);
+	/* This probably is redundant now. above loop will should make sure
+	 * that all the busy queues have expired */
+	cfq_slice_expired(cfqd);
 
-	BUG_ON(cfqd->busy_queues);
+	BUG_ON(elv_nr_busy_ioq(cfqd->queue->elevator));
 
 	cfq_log(cfqd, "forced_dispatch=%d\n", dispatched);
 	return dispatched;
@@ -1297,13 +888,10 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	struct cfq_queue *cfqq;
 	unsigned int max_dispatch;
 
-	if (!cfqd->busy_queues)
-		return 0;
-
 	if (unlikely(force))
 		return cfq_forced_dispatch(cfqd);
 
-	cfqq = cfq_select_queue(cfqd);
+	cfqq = elv_select_sched_queue(q, 0);
 	if (!cfqq)
 		return 0;
 
@@ -1320,7 +908,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	/*
 	 * Does this cfqq already have too much IO in flight?
 	 */
-	if (cfqq->dispatched >= max_dispatch) {
+	if (elv_ioq_nr_dispatched(cfqq->ioq) >= max_dispatch) {
 		/*
 		 * idle queue must always only have a single IO in flight
 		 */
@@ -1330,13 +918,13 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 		/*
 		 * We have other queues, don't allow more IO from this one
 		 */
-		if (cfqd->busy_queues > 1)
+		if (elv_nr_busy_ioq(q->elevator) > 1)
 			return 0;
 
 		/*
 		 * we are the only queue, allow up to 4 times of 'quantum'
 		 */
-		if (cfqq->dispatched >= 4 * max_dispatch)
+		if (elv_ioq_nr_dispatched(cfqq->ioq) >= 4 * max_dispatch)
 			return 0;
 	}
 
@@ -1345,51 +933,45 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	 */
 	cfq_dispatch_request(cfqd, cfqq);
 	cfqq->slice_dispatch++;
-	cfq_clear_cfqq_must_dispatch(cfqq);
 
 	/*
 	 * expire an async queue immediately if it has used up its slice. idle
 	 * queue always expire after 1 dispatch round.
 	 */
-	if (cfqd->busy_queues > 1 && ((!cfq_cfqq_sync(cfqq) &&
+	if (elv_nr_busy_ioq(q->elevator) > 1 && ((!cfq_cfqq_sync(cfqq) &&
 	    cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
 	    cfq_class_idle(cfqq))) {
-		cfqq->slice_end = jiffies + 1;
-		cfq_slice_expired(cfqd, 0);
+		cfq_slice_expired(cfqd);
 	}
 
 	cfq_log(cfqd, "dispatched a request");
 	return 1;
 }
 
-/*
- * task holds one reference to the queue, dropped when task exits. each rq
- * in-flight on this queue also holds a reference, dropped when rq is freed.
- *
- * queue lock must be held here.
- */
-static void cfq_put_queue(struct cfq_queue *cfqq)
+static void cfq_free_cfq_queue(struct elevator_queue *e, void *sched_queue)
 {
+	struct cfq_queue *cfqq = sched_queue;
 	struct cfq_data *cfqd = cfqq->cfqd;
 
-	BUG_ON(atomic_read(&cfqq->ref) <= 0);
+	BUG_ON(!cfqq);
 
-	if (!atomic_dec_and_test(&cfqq->ref))
-		return;
-
-	cfq_log_cfqq(cfqd, cfqq, "put_queue");
+	cfq_log_cfqq(cfqd, cfqq, "free_queue");
 	BUG_ON(rb_first(&cfqq->sort_list));
 	BUG_ON(cfqq->allocated[READ] + cfqq->allocated[WRITE]);
-	BUG_ON(cfq_cfqq_on_rr(cfqq));
 
-	if (unlikely(cfqd->active_queue == cfqq)) {
-		__cfq_slice_expired(cfqd, cfqq, 0);
-		cfq_schedule_dispatch(cfqd);
+	if (unlikely(cfqq_is_active_queue(cfqq))) {
+		__cfq_slice_expired(cfqd, cfqq);
+		elv_schedule_dispatch(cfqd->queue);
 	}
 
 	kmem_cache_free(cfq_pool, cfqq);
 }
 
+static inline void cfq_put_queue(struct cfq_queue *cfqq)
+{
+	elv_put_ioq(cfqq->ioq);
+}
+
 /*
  * Must always be called with the rcu_read_lock() held
  */
@@ -1477,9 +1059,9 @@ static void cfq_free_io_context(struct io_context *ioc)
 
 static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	if (unlikely(cfqq == cfqd->active_queue)) {
-		__cfq_slice_expired(cfqd, cfqq, 0);
-		cfq_schedule_dispatch(cfqd);
+	if (unlikely(cfqq == elv_active_sched_queue(cfqd->queue->elevator))) {
+		__cfq_slice_expired(cfqd, cfqq);
+		elv_schedule_dispatch(cfqd->queue);
 	}
 
 	cfq_put_queue(cfqq);
@@ -1549,9 +1131,10 @@ static struct cfq_io_context *
 cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 {
 	struct cfq_io_context *cic;
+	struct request_queue *q = cfqd->queue;
 
 	cic = kmem_cache_alloc_node(cfq_ioc_pool, gfp_mask | __GFP_ZERO,
-							cfqd->queue->node);
+							q->node);
 	if (cic) {
 		cic->last_end_request = jiffies;
 		INIT_LIST_HEAD(&cic->queue_list);
@@ -1567,7 +1150,7 @@ cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
 {
 	struct task_struct *tsk = current;
-	int ioprio_class;
+	int ioprio_class, ioprio;
 
 	if (!cfq_cfqq_prio_changed(cfqq))
 		return;
@@ -1580,30 +1163,33 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
 		/*
 		 * no prio set, inherit CPU scheduling settings
 		 */
-		cfqq->ioprio = task_nice_ioprio(tsk);
-		cfqq->ioprio_class = task_nice_ioclass(tsk);
+		ioprio = task_nice_ioprio(tsk);
+		ioprio_class = task_nice_ioclass(tsk);
 		break;
 	case IOPRIO_CLASS_RT:
-		cfqq->ioprio = task_ioprio(ioc);
-		cfqq->ioprio_class = IOPRIO_CLASS_RT;
+		ioprio = task_ioprio(ioc);
+		ioprio_class = IOPRIO_CLASS_RT;
 		break;
 	case IOPRIO_CLASS_BE:
-		cfqq->ioprio = task_ioprio(ioc);
-		cfqq->ioprio_class = IOPRIO_CLASS_BE;
+		ioprio = task_ioprio(ioc);
+		ioprio_class = IOPRIO_CLASS_BE;
 		break;
 	case IOPRIO_CLASS_IDLE:
-		cfqq->ioprio_class = IOPRIO_CLASS_IDLE;
-		cfqq->ioprio = 7;
-		cfq_clear_cfqq_idle_window(cfqq);
+		ioprio_class = IOPRIO_CLASS_IDLE;
+		ioprio = 7;
+		elv_clear_ioq_idle_window(cfqq->ioq);
 		break;
 	}
 
+	elv_ioq_set_ioprio_class(cfqq->ioq, ioprio_class);
+	elv_ioq_set_ioprio(cfqq->ioq, ioprio);
+
 	/*
 	 * keep track of original prio settings in case we have to temporarily
 	 * elevate the priority of this queue
 	 */
-	cfqq->org_ioprio = cfqq->ioprio;
-	cfqq->org_ioprio_class = cfqq->ioprio_class;
+	cfqq->org_ioprio = ioprio;
+	cfqq->org_ioprio_class = ioprio_class;
 	cfq_clear_cfqq_prio_changed(cfqq);
 }
 
@@ -1612,11 +1198,12 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
 	struct cfq_data *cfqd = cic->key;
 	struct cfq_queue *cfqq;
 	unsigned long flags;
+	struct request_queue *q = cfqd->queue;
 
 	if (unlikely(!cfqd))
 		return;
 
-	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+	spin_lock_irqsave(q->queue_lock, flags);
 
 	cfqq = cic->cfqq[BLK_RW_ASYNC];
 	if (cfqq) {
@@ -1633,7 +1220,7 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
 	if (cfqq)
 		cfq_mark_cfqq_prio_changed(cfqq);
 
-	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
 static void cfq_ioc_set_ioprio(struct io_context *ioc)
@@ -1644,11 +1231,12 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
 
 static struct cfq_queue *
 cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
-		     struct io_context *ioc, gfp_t gfp_mask)
+				struct io_context *ioc, gfp_t gfp_mask)
 {
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
 	struct cfq_io_context *cic;
-
+	struct request_queue *q = cfqd->queue;
+	struct io_queue *ioq = NULL, *new_ioq = NULL;
 retry:
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
@@ -1656,8 +1244,7 @@ retry:
 
 	if (!cfqq) {
 		if (new_cfqq) {
-			cfqq = new_cfqq;
-			new_cfqq = NULL;
+			goto alloc_ioq;
 		} else if (gfp_mask & __GFP_WAIT) {
 			/*
 			 * Inform the allocator of the fact that we will
@@ -1678,22 +1265,52 @@ retry:
 			if (!cfqq)
 				goto out;
 		}
+alloc_ioq:
+		if (new_ioq) {
+			ioq = new_ioq;
+			new_ioq = NULL;
+			cfqq = new_cfqq;
+			new_cfqq = NULL;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			new_ioq = elv_alloc_ioq(q,
+					gfp_mask | __GFP_NOFAIL | __GFP_ZERO);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+			if (!ioq) {
+				kmem_cache_free(cfq_pool, cfqq);
+				cfqq = NULL;
+				goto out;
+			}
+		}
 
-		RB_CLEAR_NODE(&cfqq->rb_node);
+		/*
+		 * Both cfqq and ioq objects allocated. Do the initializations
+		 * now.
+		 */
 		RB_CLEAR_NODE(&cfqq->p_node);
 		INIT_LIST_HEAD(&cfqq->fifo);
-
-		atomic_set(&cfqq->ref, 0);
 		cfqq->cfqd = cfqd;
 
 		cfq_mark_cfqq_prio_changed(cfqq);
 
+		cfqq->ioq = ioq;
 		cfq_init_prio_data(cfqq, ioc);
+		elv_init_ioq(q->elevator, ioq, cfqq, cfqq->org_ioprio_class,
+				cfqq->org_ioprio, is_sync);
 
 		if (is_sync) {
 			if (!cfq_class_idle(cfqq))
-				cfq_mark_cfqq_idle_window(cfqq);
-			cfq_mark_cfqq_sync(cfqq);
+				elv_mark_ioq_idle_window(cfqq->ioq);
+			elv_mark_ioq_sync(cfqq->ioq);
 		}
 		cfqq->pid = current->pid;
 		cfq_log_cfqq(cfqd, cfqq, "alloced");
@@ -1702,38 +1319,28 @@ retry:
 	if (new_cfqq)
 		kmem_cache_free(cfq_pool, new_cfqq);
 
+	if (new_ioq)
+		elv_free_ioq(new_ioq);
+
 out:
 	WARN_ON((gfp_mask & __GFP_WAIT) && !cfqq);
 	return cfqq;
 }
 
-static struct cfq_queue **
-cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
-{
-	switch (ioprio_class) {
-	case IOPRIO_CLASS_RT:
-		return &cfqd->async_cfqq[0][ioprio];
-	case IOPRIO_CLASS_BE:
-		return &cfqd->async_cfqq[1][ioprio];
-	case IOPRIO_CLASS_IDLE:
-		return &cfqd->async_idle_cfqq;
-	default:
-		BUG();
-	}
-}
-
 static struct cfq_queue *
 cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
-	      gfp_t gfp_mask)
+					gfp_t gfp_mask)
 {
 	const int ioprio = task_ioprio(ioc);
 	const int ioprio_class = task_ioprio_class(ioc);
-	struct cfq_queue **async_cfqq = NULL;
+	struct cfq_queue *async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
+	struct io_group *iog = io_lookup_io_group_current(cfqd->queue);
 
 	if (!is_sync) {
-		async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);
-		cfqq = *async_cfqq;
+		async_cfqq = io_group_async_queue_prio(iog, ioprio_class,
+								ioprio);
+		cfqq = async_cfqq;
 	}
 
 	if (!cfqq) {
@@ -1742,15 +1349,11 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 			return NULL;
 	}
 
-	/*
-	 * pin the queue now that it's allocated, scheduler exit will prune it
-	 */
-	if (!is_sync && !(*async_cfqq)) {
-		atomic_inc(&cfqq->ref);
-		*async_cfqq = cfqq;
-	}
+	if (!is_sync && !async_cfqq)
+		io_group_set_async_queue(iog, ioprio_class, ioprio, cfqq->ioq);
 
-	atomic_inc(&cfqq->ref);
+	/* ioc reference */
+	elv_get_ioq(cfqq->ioq);
 	return cfqq;
 }
 
@@ -1829,6 +1432,7 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
 {
 	unsigned long flags;
 	int ret;
+	struct request_queue *q = cfqd->queue;
 
 	ret = radix_tree_preload(gfp_mask);
 	if (!ret) {
@@ -1845,9 +1449,9 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
 		radix_tree_preload_end();
 
 		if (!ret) {
-			spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+			spin_lock_irqsave(q->queue_lock, flags);
 			list_add(&cic->queue_list, &cfqd->cic_list);
-			spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+			spin_unlock_irqrestore(q->queue_lock, flags);
 		}
 	}
 
@@ -1867,10 +1471,11 @@ cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 {
 	struct io_context *ioc = NULL;
 	struct cfq_io_context *cic;
+	struct request_queue *q = cfqd->queue;
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
-	ioc = get_io_context(gfp_mask, cfqd->queue->node);
+	ioc = get_io_context(gfp_mask, q->node);
 	if (!ioc)
 		return NULL;
 
@@ -1889,7 +1494,6 @@ out:
 	smp_read_barrier_depends();
 	if (unlikely(ioc->ioprio_changed))
 		cfq_ioc_set_ioprio(ioc);
-
 	return cic;
 err_free:
 	cfq_cic_free(cic);
@@ -1899,17 +1503,6 @@ err:
 }
 
 static void
-cfq_update_io_thinktime(struct cfq_data *cfqd, struct cfq_io_context *cic)
-{
-	unsigned long elapsed = jiffies - cic->last_end_request;
-	unsigned long ttime = min(elapsed, 2UL * cfqd->cfq_slice_idle);
-
-	cic->ttime_samples = (7*cic->ttime_samples + 256) / 8;
-	cic->ttime_total = (7*cic->ttime_total + 256*ttime) / 8;
-	cic->ttime_mean = (cic->ttime_total + 128) / cic->ttime_samples;
-}
-
-static void
 cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_io_context *cic,
 		       struct request *rq)
 {
@@ -1940,65 +1533,40 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_io_context *cic,
 }
 
 /*
- * Disable idle window if the process thinks too long or seeks so much that
- * it doesn't matter
+ * Disable idle window if the process seeks so much that it doesn't matter
  */
-static void
-cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-		       struct cfq_io_context *cic)
+static int
+cfq_update_idle_window(struct elevator_queue *eq, void *cfqq,
+					struct request *rq)
 {
-	int old_idle, enable_idle;
+	struct cfq_io_context *cic = RQ_CIC(rq);
 
 	/*
-	 * Don't idle for async or idle io prio class
+	 * Enabling/Disabling idling based on thinktime has been moved
+	 * in common layer.
 	 */
-	if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
-		return;
-
-	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
-
-	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
-	    (cfqd->hw_tag && CIC_SEEKY(cic)))
-		enable_idle = 0;
-	else if (sample_valid(cic->ttime_samples)) {
-		if (cic->ttime_mean > cfqd->cfq_slice_idle)
-			enable_idle = 0;
-		else
-			enable_idle = 1;
-	}
+	if (!atomic_read(&cic->ioc->nr_tasks) ||
+	    (elv_hw_tag(eq) && CIC_SEEKY(cic)))
+		return 0;
 
-	if (old_idle != enable_idle) {
-		cfq_log_cfqq(cfqd, cfqq, "idle=%d", enable_idle);
-		if (enable_idle)
-			cfq_mark_cfqq_idle_window(cfqq);
-		else
-			cfq_clear_cfqq_idle_window(cfqq);
-	}
+	return 1;
 }
 
 /*
  * Check if new_cfqq should preempt the currently active queue. Return 0 for
- * no or if we aren't sure, a 1 will cause a preempt.
+ * no or if we aren't sure, a 1 will cause a preemption attempt.
+ * Some of the preemption logic has been moved to common layer. Only cfq
+ * specific parts are left here.
  */
 static int
-cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
-		   struct request *rq)
+cfq_should_preempt(struct request_queue *q, void *new_cfqq, struct request *rq)
 {
-	struct cfq_queue *cfqq;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = elv_active_sched_queue(q->elevator);
 
-	cfqq = cfqd->active_queue;
 	if (!cfqq)
 		return 0;
 
-	if (cfq_slice_used(cfqq))
-		return 1;
-
-	if (cfq_class_idle(new_cfqq))
-		return 0;
-
-	if (cfq_class_idle(cfqq))
-		return 1;
-
 	/*
 	 * if the new request is sync, but the currently running queue is
 	 * not, let the sync request have priority.
@@ -2013,13 +1581,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 	if (rq_is_meta(rq) && !cfqq->meta_pending)
 		return 1;
 
-	/*
-	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
-	 */
-	if (cfq_class_rt(new_cfqq) && !cfq_class_rt(cfqq))
-		return 1;
-
-	if (!cfqd->active_cic || !cfq_cfqq_wait_request(cfqq))
+	if (!cfqd->active_cic || !elv_ioq_wait_request(cfqq->ioq))
 		return 0;
 
 	/*
@@ -2033,29 +1595,10 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 }
 
 /*
- * cfqq preempts the active queue. if we allowed preempt with no slice left,
- * let it have half of its nominal slice.
- */
-static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	cfq_log_cfqq(cfqd, cfqq, "preempt");
-	cfq_slice_expired(cfqd, 1);
-
-	/*
-	 * Put the new queue at the front of the of the current list,
-	 * so we know that it will be selected next.
-	 */
-	BUG_ON(!cfq_cfqq_on_rr(cfqq));
-
-	cfq_service_tree_add(cfqd, cfqq, 1);
-
-	cfqq->slice_end = 0;
-	cfq_mark_cfqq_slice_new(cfqq);
-}
-
-/*
  * Called when a new fs request (rq) is added (to cfqq). Check if there's
  * something we should do about it
+ * After enqueuing the request whether queue should be preempted or kicked
+ * decision is taken by common layer.
  */
 static void
 cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
@@ -2063,45 +1606,12 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 {
 	struct cfq_io_context *cic = RQ_CIC(rq);
 
-	cfqd->rq_queued++;
 	if (rq_is_meta(rq))
 		cfqq->meta_pending++;
 
-	cfq_update_io_thinktime(cfqd, cic);
 	cfq_update_io_seektime(cfqd, cic, rq);
-	cfq_update_idle_window(cfqd, cfqq, cic);
 
 	cic->last_request_pos = rq->sector + rq->nr_sectors;
-
-	if (cfqq == cfqd->active_queue) {
-		/*
-		 * Remember that we saw a request from this process, but
-		 * don't start queuing just yet. Otherwise we risk seeing lots
-		 * of tiny requests, because we disrupt the normal plugging
-		 * and merging. If the request is already larger than a single
-		 * page, let it rip immediately. For that case we assume that
-		 * merging is already done. Ditto for a busy system that
-		 * has other work pending, don't risk delaying until the
-		 * idle timer unplug to continue working.
-		 */
-		if (cfq_cfqq_wait_request(cfqq)) {
-			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
-			    cfqd->busy_queues > 1) {
-				del_timer(&cfqd->idle_slice_timer);
-				blk_start_queueing(cfqd->queue);
-			}
-			cfq_mark_cfqq_must_dispatch(cfqq);
-		}
-	} else if (cfq_should_preempt(cfqd, cfqq, rq)) {
-		/*
-		 * not the active queue - expire current slice if it is
-		 * idle and has expired it's mean thinktime or this new queue
-		 * has some old slice time left and is of higher priority or
-		 * this new queue is RT and the current one is BE
-		 */
-		cfq_preempt_queue(cfqd, cfqq);
-		blk_start_queueing(cfqd->queue);
-	}
 }
 
 static void cfq_insert_request(struct request_queue *q, struct request *rq)
@@ -2119,31 +1629,6 @@ static void cfq_insert_request(struct request_queue *q, struct request *rq)
 	cfq_rq_enqueued(cfqd, cfqq, rq);
 }
 
-/*
- * Update hw_tag based on peak queue depth over 50 samples under
- * sufficient load.
- */
-static void cfq_update_hw_tag(struct cfq_data *cfqd)
-{
-	if (cfqd->rq_in_driver > cfqd->rq_in_driver_peak)
-		cfqd->rq_in_driver_peak = cfqd->rq_in_driver;
-
-	if (cfqd->rq_queued <= CFQ_HW_QUEUE_MIN &&
-	    cfqd->rq_in_driver <= CFQ_HW_QUEUE_MIN)
-		return;
-
-	if (cfqd->hw_tag_samples++ < 50)
-		return;
-
-	if (cfqd->rq_in_driver_peak >= CFQ_HW_QUEUE_MIN)
-		cfqd->hw_tag = 1;
-	else
-		cfqd->hw_tag = 0;
-
-	cfqd->hw_tag_samples = 0;
-	cfqd->rq_in_driver_peak = 0;
-}
-
 static void cfq_completed_request(struct request_queue *q, struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
@@ -2154,13 +1639,6 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 	now = jiffies;
 	cfq_log_cfqq(cfqd, cfqq, "complete");
 
-	cfq_update_hw_tag(cfqd);
-
-	WARN_ON(!cfqd->rq_in_driver);
-	WARN_ON(!cfqq->dispatched);
-	cfqd->rq_in_driver--;
-	cfqq->dispatched--;
-
 	if (cfq_cfqq_sync(cfqq))
 		cfqd->sync_flight--;
 
@@ -2169,34 +1647,6 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 
 	if (sync)
 		RQ_CIC(rq)->last_end_request = now;
-
-	/*
-	 * If this is the active queue, check if it needs to be expired,
-	 * or if we want to idle in case it has no pending requests.
-	 */
-	if (cfqd->active_queue == cfqq) {
-		const bool cfqq_empty = RB_EMPTY_ROOT(&cfqq->sort_list);
-
-		if (cfq_cfqq_slice_new(cfqq)) {
-			cfq_set_prio_slice(cfqd, cfqq);
-			cfq_clear_cfqq_slice_new(cfqq);
-		}
-		/*
-		 * If there are no requests waiting in this queue, and
-		 * there are other queues ready to issue requests, AND
-		 * those other queues are issuing requests within our
-		 * mean seek distance, give them a chance to run instead
-		 * of idling.
-		 */
-		if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
-			cfq_slice_expired(cfqd, 1);
-		else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) &&
-			 sync && !rq_noidle(rq))
-			cfq_arm_slice_timer(cfqd);
-	}
-
-	if (!cfqd->rq_in_driver)
-		cfq_schedule_dispatch(cfqd);
 }
 
 /*
@@ -2205,30 +1655,33 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
  */
 static void cfq_prio_boost(struct cfq_queue *cfqq)
 {
+	struct io_queue *ioq = cfqq->ioq;
+
 	if (has_fs_excl()) {
 		/*
 		 * boost idle prio on transactions that would lock out other
 		 * users of the filesystem
 		 */
 		if (cfq_class_idle(cfqq))
-			cfqq->ioprio_class = IOPRIO_CLASS_BE;
-		if (cfqq->ioprio > IOPRIO_NORM)
-			cfqq->ioprio = IOPRIO_NORM;
+			elv_ioq_set_ioprio_class(ioq, IOPRIO_CLASS_BE);
+		if (elv_ioq_ioprio(ioq) > IOPRIO_NORM)
+			elv_ioq_set_ioprio(ioq, IOPRIO_NORM);
+
 	} else {
 		/*
 		 * check if we need to unboost the queue
 		 */
-		if (cfqq->ioprio_class != cfqq->org_ioprio_class)
-			cfqq->ioprio_class = cfqq->org_ioprio_class;
-		if (cfqq->ioprio != cfqq->org_ioprio)
-			cfqq->ioprio = cfqq->org_ioprio;
+		if (elv_ioq_ioprio_class(ioq) != cfqq->org_ioprio_class)
+			elv_ioq_set_ioprio_class(ioq, cfqq->org_ioprio_class);
+		if (elv_ioq_ioprio(ioq) != cfqq->org_ioprio)
+			elv_ioq_set_ioprio(ioq, cfqq->org_ioprio);
 	}
 }
 
 static inline int __cfq_may_queue(struct cfq_queue *cfqq)
 {
-	if ((cfq_cfqq_wait_request(cfqq) || cfq_cfqq_must_alloc(cfqq)) &&
-	    !cfq_cfqq_must_alloc_slice(cfqq)) {
+	if ((elv_ioq_wait_request(cfqq->ioq) ||
+	   cfq_cfqq_must_alloc(cfqq)) && !cfq_cfqq_must_alloc_slice(cfqq)) {
 		cfq_mark_cfqq_must_alloc_slice(cfqq);
 		return ELV_MQUEUE_MUST;
 	}
@@ -2320,119 +1773,31 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 
 	cfqq->allocated[rw]++;
 	cfq_clear_cfqq_must_alloc(cfqq);
-	atomic_inc(&cfqq->ref);
+	elv_get_ioq(cfqq->ioq);
 
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
 	rq->elevator_private = cic;
-	rq->elevator_private2 = cfqq;
+	rq->ioq = cfqq->ioq;
 	return 0;
 
 queue_fail:
 	if (cic)
 		put_io_context(cic->ioc);
 
-	cfq_schedule_dispatch(cfqd);
+	elv_schedule_dispatch(cfqd->queue);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 	cfq_log(cfqd, "set_request fail");
 	return 1;
 }
 
-static void cfq_kick_queue(struct work_struct *work)
-{
-	struct cfq_data *cfqd =
-		container_of(work, struct cfq_data, unplug_work);
-	struct request_queue *q = cfqd->queue;
-
-	spin_lock_irq(q->queue_lock);
-	blk_start_queueing(q);
-	spin_unlock_irq(q->queue_lock);
-}
-
-/*
- * Timer running if the active_queue is currently idling inside its time slice
- */
-static void cfq_idle_slice_timer(unsigned long data)
-{
-	struct cfq_data *cfqd = (struct cfq_data *) data;
-	struct cfq_queue *cfqq;
-	unsigned long flags;
-	int timed_out = 1;
-
-	cfq_log(cfqd, "idle timer fired");
-
-	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
-
-	cfqq = cfqd->active_queue;
-	if (cfqq) {
-		timed_out = 0;
-
-		/*
-		 * We saw a request before the queue expired, let it through
-		 */
-		if (cfq_cfqq_must_dispatch(cfqq))
-			goto out_kick;
-
-		/*
-		 * expired
-		 */
-		if (cfq_slice_used(cfqq))
-			goto expire;
-
-		/*
-		 * only expire and reinvoke request handler, if there are
-		 * other queues with pending requests
-		 */
-		if (!cfqd->busy_queues)
-			goto out_cont;
-
-		/*
-		 * not expired and it has a request pending, let it dispatch
-		 */
-		if (!RB_EMPTY_ROOT(&cfqq->sort_list))
-			goto out_kick;
-	}
-expire:
-	cfq_slice_expired(cfqd, timed_out);
-out_kick:
-	cfq_schedule_dispatch(cfqd);
-out_cont:
-	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
-}
-
-static void cfq_shutdown_timer_wq(struct cfq_data *cfqd)
-{
-	del_timer_sync(&cfqd->idle_slice_timer);
-	cancel_work_sync(&cfqd->unplug_work);
-}
-
-static void cfq_put_async_queues(struct cfq_data *cfqd)
-{
-	int i;
-
-	for (i = 0; i < IOPRIO_BE_NR; i++) {
-		if (cfqd->async_cfqq[0][i])
-			cfq_put_queue(cfqd->async_cfqq[0][i]);
-		if (cfqd->async_cfqq[1][i])
-			cfq_put_queue(cfqd->async_cfqq[1][i]);
-	}
-
-	if (cfqd->async_idle_cfqq)
-		cfq_put_queue(cfqd->async_idle_cfqq);
-}
-
 static void cfq_exit_queue(struct elevator_queue *e)
 {
 	struct cfq_data *cfqd = e->elevator_data;
 	struct request_queue *q = cfqd->queue;
 
-	cfq_shutdown_timer_wq(cfqd);
-
 	spin_lock_irq(q->queue_lock);
 
-	if (cfqd->active_queue)
-		__cfq_slice_expired(cfqd, cfqd->active_queue, 0);
-
 	while (!list_empty(&cfqd->cic_list)) {
 		struct cfq_io_context *cic = list_entry(cfqd->cic_list.next,
 							struct cfq_io_context,
@@ -2441,12 +1806,7 @@ static void cfq_exit_queue(struct elevator_queue *e)
 		__cfq_exit_single_io_context(cfqd, cic);
 	}
 
-	cfq_put_async_queues(cfqd);
-
 	spin_unlock_irq(q->queue_lock);
-
-	cfq_shutdown_timer_wq(cfqd);
-
 	kfree(cfqd);
 }
 
@@ -2459,8 +1819,6 @@ static void *cfq_init_queue(struct request_queue *q)
 	if (!cfqd)
 		return NULL;
 
-	cfqd->service_tree = CFQ_RB_ROOT;
-
 	/*
 	 * Not strictly needed (since RB_ROOT just clears the node and we
 	 * zeroed cfqd on alloc), but better be safe in case someone decides
@@ -2473,23 +1831,13 @@ static void *cfq_init_queue(struct request_queue *q)
 
 	cfqd->queue = q;
 
-	init_timer(&cfqd->idle_slice_timer);
-	cfqd->idle_slice_timer.function = cfq_idle_slice_timer;
-	cfqd->idle_slice_timer.data = (unsigned long) cfqd;
-
-	INIT_WORK(&cfqd->unplug_work, cfq_kick_queue);
-
 	cfqd->last_end_request = jiffies;
 	cfqd->cfq_quantum = cfq_quantum;
 	cfqd->cfq_fifo_expire[0] = cfq_fifo_expire[0];
 	cfqd->cfq_fifo_expire[1] = cfq_fifo_expire[1];
 	cfqd->cfq_back_max = cfq_back_max;
 	cfqd->cfq_back_penalty = cfq_back_penalty;
-	cfqd->cfq_slice[0] = cfq_slice_async;
-	cfqd->cfq_slice[1] = cfq_slice_sync;
 	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
-	cfqd->cfq_slice_idle = cfq_slice_idle;
-	cfqd->hw_tag = 1;
 
 	return cfqd;
 }
@@ -2554,9 +1902,6 @@ SHOW_FUNCTION(cfq_fifo_expire_sync_show, cfqd->cfq_fifo_expire[1], 1);
 SHOW_FUNCTION(cfq_fifo_expire_async_show, cfqd->cfq_fifo_expire[0], 1);
 SHOW_FUNCTION(cfq_back_seek_max_show, cfqd->cfq_back_max, 0);
 SHOW_FUNCTION(cfq_back_seek_penalty_show, cfqd->cfq_back_penalty, 0);
-SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1);
-SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
-SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
 SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
 #undef SHOW_FUNCTION
 
@@ -2584,9 +1929,6 @@ STORE_FUNCTION(cfq_fifo_expire_async_store, &cfqd->cfq_fifo_expire[0], 1,
 STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
 STORE_FUNCTION(cfq_back_seek_penalty_store, &cfqd->cfq_back_penalty, 1,
 		UINT_MAX, 0);
-STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
 		UINT_MAX, 0);
 #undef STORE_FUNCTION
@@ -2600,10 +1942,7 @@ static struct elv_fs_entry cfq_attrs[] = {
 	CFQ_ATTR(fifo_expire_async),
 	CFQ_ATTR(back_seek_max),
 	CFQ_ATTR(back_seek_penalty),
-	CFQ_ATTR(slice_sync),
-	CFQ_ATTR(slice_async),
 	CFQ_ATTR(slice_async_rq),
-	CFQ_ATTR(slice_idle),
 	__ATTR_NULL
 };
 
@@ -2616,8 +1955,6 @@ static struct elevator_type iosched_cfq = {
 		.elevator_dispatch_fn =		cfq_dispatch_requests,
 		.elevator_add_req_fn =		cfq_insert_request,
 		.elevator_activate_req_fn =	cfq_activate_request,
-		.elevator_deactivate_req_fn =	cfq_deactivate_request,
-		.elevator_queue_empty_fn =	cfq_queue_empty,
 		.elevator_completed_req_fn =	cfq_completed_request,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
@@ -2627,7 +1964,15 @@ static struct elevator_type iosched_cfq = {
 		.elevator_init_fn =		cfq_init_queue,
 		.elevator_exit_fn =		cfq_exit_queue,
 		.trim =				cfq_free_io_context,
+		.elevator_free_sched_queue_fn =	cfq_free_cfq_queue,
+		.elevator_active_ioq_set_fn = 	cfq_active_ioq_set,
+		.elevator_active_ioq_reset_fn =	cfq_active_ioq_reset,
+		.elevator_arm_slice_timer_fn = 	cfq_arm_slice_timer,
+		.elevator_should_preempt_fn = 	cfq_should_preempt,
+		.elevator_update_idle_window_fn = cfq_update_idle_window,
+		.elevator_close_cooperator_fn = cfq_close_cooperator,
 	},
+	.elevator_features =    ELV_IOSCHED_NEED_FQ,
 	.elevator_attrs =	cfq_attrs,
 	.elevator_name =	"cfq",
 	.elevator_owner =	THIS_MODULE,
@@ -2635,14 +1980,6 @@ static struct elevator_type iosched_cfq = {
 
 static int __init cfq_init(void)
 {
-	/*
-	 * could be 0 on HZ < 1000 setups
-	 */
-	if (!cfq_slice_async)
-		cfq_slice_async = 1;
-	if (!cfq_slice_idle)
-		cfq_slice_idle = 1;
-
 	if (cfq_slab_setup())
 		return -ENOMEM;
 
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer
       [not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (3 preceding siblings ...)
  2009-05-05 19:58   ` [PATCH 04/18] io-controller: Modify cfq to make use of flat elevator fair queuing Vivek Goyal
@ 2009-05-05 19:58   ` Vivek Goyal
  2009-05-05 19:58   ` [PATCH 06/18] io-controller: cfq changes to use " Vivek Goyal
                     ` (16 subsequent siblings)
  21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

This patch enables hierarchical fair queuing in common layer. It is
controlled by config option CONFIG_GROUP_IOSCHED.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/blk-ioc.c               |    3 +
 block/elevator-fq.c           | 1037 +++++++++++++++++++++++++++++++++++++----
 block/elevator-fq.h           |  149 ++++++-
 block/elevator.c              |    6 +
 include/linux/blkdev.h        |    7 +-
 include/linux/cgroup_subsys.h |    7 +
 include/linux/iocontext.h     |    5 +
 init/Kconfig                  |    8 +
 8 files changed, 1127 insertions(+), 95 deletions(-)

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 012f065..8f0f6cf 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -95,6 +95,9 @@ struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 		spin_lock_init(&ret->lock);
 		ret->ioprio_changed = 0;
 		ret->ioprio = 0;
+#ifdef CONFIG_GROUP_IOSCHED
+		ret->cgroup_changed = 0;
+#endif
 		ret->last_waited = jiffies; /* doesn't matter... */
 		ret->nr_batch_requests = 0; /* because this is 0 */
 		ret->aic = NULL;
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 9f1fbb9..cdaa46f 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -24,6 +24,10 @@ static int elv_rate_sampling_window = HZ / 10;
 
 #define ELV_SLICE_SCALE		(5)
 #define ELV_HW_QUEUE_MIN	(5)
+
+#define IO_DEFAULT_GRP_WEIGHT  500
+#define IO_DEFAULT_GRP_CLASS   IOPRIO_CLASS_BE
+
 #define IO_SERVICE_TREE_INIT   ((struct io_service_tree)		\
 				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
 
@@ -31,6 +35,7 @@ static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
 					struct io_queue *ioq, int probe);
 struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
 						 int extract);
+void elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
 
 static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
 					unsigned short prio)
@@ -49,6 +54,73 @@ elv_prio_to_slice(struct elv_fq_data *efqd, struct io_queue *ioq)
 }
 
 /* Mainly the BFQ scheduling code Follows */
+#ifdef CONFIG_GROUP_IOSCHED
+#define for_each_entity(entity)	\
+	for (; entity != NULL; entity = entity->parent)
+
+#define for_each_entity_safe(entity, parent) \
+	for (; entity && ({ parent = entity->parent; 1; }); entity = parent)
+
+
+struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
+						 int extract);
+void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
+					int requeue);
+void elv_activate_ioq(struct io_queue *ioq, int add_front);
+void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
+					int requeue);
+
+static int bfq_update_next_active(struct io_sched_data *sd)
+{
+	struct io_group *iog;
+	struct io_entity *entity, *next_active;
+
+	if (sd->active_entity != NULL)
+		/* will update/requeue at the end of service */
+		return 0;
+
+	/*
+	 * NOTE: this can be improved in may ways, such as returning
+	 * 1 (and thus propagating upwards the update) only when the
+	 * budget changes, or caching the bfqq that will be scheduled
+	 * next from this subtree.  By now we worry more about
+	 * correctness than about performance...
+	 */
+	next_active = bfq_lookup_next_entity(sd, 0);
+	sd->next_active = next_active;
+
+	if (next_active != NULL) {
+		iog = container_of(sd, struct io_group, sched_data);
+		entity = iog->my_entity;
+		if (entity != NULL)
+			entity->budget = next_active->budget;
+	}
+
+	return 1;
+}
+
+static inline void bfq_check_next_active(struct io_sched_data *sd,
+					 struct io_entity *entity)
+{
+	BUG_ON(sd->next_active != entity);
+}
+#else /* GROUP_IOSCHED */
+#define for_each_entity(entity)	\
+	for (; entity != NULL; entity = NULL)
+
+#define for_each_entity_safe(entity, parent) \
+	for (parent = NULL; entity != NULL; entity = parent)
+
+static inline int bfq_update_next_active(struct io_sched_data *sd)
+{
+	return 0;
+}
+
+static inline void bfq_check_next_active(struct io_sched_data *sd,
+					 struct io_entity *entity)
+{
+}
+#endif
 
 /*
  * Shift for timestamp calculations.  This actually limits the maximum
@@ -295,16 +367,6 @@ static void bfq_active_insert(struct io_service_tree *st,
 	bfq_update_active_tree(node);
 }
 
-/**
- * bfq_ioprio_to_weight - calc a weight from an ioprio.
- * @ioprio: the ioprio value to convert.
- */
-static bfq_weight_t bfq_ioprio_to_weight(int ioprio)
-{
-	WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
-	return IOPRIO_BE_NR - ioprio;
-}
-
 void bfq_get_entity(struct io_entity *entity)
 {
 	struct io_queue *ioq = io_entity_to_ioq(entity);
@@ -313,13 +375,6 @@ void bfq_get_entity(struct io_entity *entity)
 		elv_get_ioq(ioq);
 }
 
-void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
-{
-	entity->ioprio = entity->new_ioprio;
-	entity->ioprio_class = entity->new_ioprio_class;
-	entity->sched_data = &iog->sched_data;
-}
-
 /**
  * bfq_find_deepest - find the deepest node that an extraction can modify.
  * @node: the node being removed.
@@ -462,8 +517,10 @@ __bfq_entity_update_prio(struct io_service_tree *old_st,
 	struct io_queue *ioq = io_entity_to_ioq(entity);
 
 	if (entity->ioprio_changed) {
+		old_st->wsum -= entity->weight;
 		entity->ioprio = entity->new_ioprio;
 		entity->ioprio_class = entity->new_ioprio_class;
+		entity->weight = entity->new_weight;
 		entity->ioprio_changed = 0;
 
 		/*
@@ -475,9 +532,6 @@ __bfq_entity_update_prio(struct io_service_tree *old_st,
 			entity->budget = elv_prio_to_slice(efqd, ioq);
 		}
 
-		old_st->wsum -= entity->weight;
-		entity->weight = bfq_ioprio_to_weight(entity->ioprio);
-
 		/*
 		 * NOTE: here we may be changing the weight too early,
 		 * this will cause unfairness.  The correct approach
@@ -559,11 +613,8 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
 	if (add_front) {
 		struct io_entity *next_entity;
 
-		/*
-		 * Determine the entity which will be dispatched next
-		 * Use sd->next_active once hierarchical patch is applied
-		 */
-		next_entity = bfq_lookup_next_entity(sd, 0);
+		/* Determine the entity which will be dispatched next */
+		next_entity = sd->next_active;
 
 		if (next_entity && next_entity != entity) {
 			struct io_service_tree *new_st;
@@ -590,12 +641,27 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
 }
 
 /**
- * bfq_activate_entity - activate an entity.
+ * bfq_activate_entity - activate an entity and its ancestors if necessary.
  * @entity: the entity to activate.
+ * Activate @entity and all the entities on the path from it to the root.
  */
 void bfq_activate_entity(struct io_entity *entity, int add_front)
 {
-	__bfq_activate_entity(entity, add_front);
+	struct io_sched_data *sd;
+
+	for_each_entity(entity) {
+		__bfq_activate_entity(entity, add_front);
+
+		add_front = 0;
+		sd = entity->sched_data;
+		if (!bfq_update_next_active(sd))
+			/*
+			 * No need to propagate the activation to the
+			 * upper entities, as they will be updated when
+			 * the active entity is rescheduled.
+			 */
+			break;
+	}
 }
 
 /**
@@ -631,12 +697,16 @@ int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
 	else if (entity->tree != NULL)
 		BUG();
 
+	if (was_active || sd->next_active == entity)
+		ret = bfq_update_next_active(sd);
+
 	if (!requeue || !bfq_gt(entity->finish, st->vtime))
 		bfq_forget_entity(st, entity);
 	else
 		bfq_idle_insert(st, entity);
 
 	BUG_ON(sd->active_entity == entity);
+	BUG_ON(sd->next_active == entity);
 
 	return ret;
 }
@@ -648,7 +718,46 @@ int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
  */
 void bfq_deactivate_entity(struct io_entity *entity, int requeue)
 {
-	__bfq_deactivate_entity(entity, requeue);
+	struct io_sched_data *sd;
+	struct io_entity *parent;
+
+	for_each_entity_safe(entity, parent) {
+		sd = entity->sched_data;
+
+		if (!__bfq_deactivate_entity(entity, requeue))
+			/*
+			 * The parent entity is still backlogged, and
+			 * we don't need to update it as it is still
+			 * under service.
+			 */
+			break;
+
+		if (sd->next_active != NULL)
+			/*
+			 * The parent entity is still backlogged and
+			 * the budgets on the path towards the root
+			 * need to be updated.
+			 */
+			goto update;
+
+		/*
+		 * If we reach there the parent is no more backlogged and
+		 * we want to propagate the dequeue upwards.
+		 */
+		requeue = 1;
+	}
+
+	return;
+
+update:
+	entity = parent;
+	for_each_entity(entity) {
+		__bfq_activate_entity(entity, 0);
+
+		sd = entity->sched_data;
+		if (!bfq_update_next_active(sd))
+			break;
+	}
 }
 
 /**
@@ -765,8 +874,10 @@ struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
 		entity = __bfq_lookup_next_entity(st);
 		if (entity != NULL) {
 			if (extract) {
+				bfq_check_next_active(sd, entity);
 				bfq_active_extract(st, entity);
 				sd->active_entity = entity;
+				sd->next_active = NULL;
 			}
 			break;
 		}
@@ -779,13 +890,768 @@ void entity_served(struct io_entity *entity, bfq_service_t served)
 {
 	struct io_service_tree *st;
 
-	st = io_entity_service_tree(entity);
-	entity->service += served;
-	BUG_ON(st->wsum == 0);
-	st->vtime += bfq_delta(served, st->wsum);
-	bfq_forget_idle(st);
+	for_each_entity(entity) {
+		st = io_entity_service_tree(entity);
+		entity->service += served;
+		BUG_ON(st->wsum == 0);
+		st->vtime += bfq_delta(served, st->wsum);
+		bfq_forget_idle(st);
+	}
+}
+
+/*
+ * Release all the io group references to its async queues.
+ */
+void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+{
+	int i, j;
+
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < IOPRIO_BE_NR; j++)
+			elv_release_ioq(e, &iog->async_queue[i][j]);
+
+	/* Free up async idle queue */
+	elv_release_ioq(e, &iog->async_idle_queue);
+}
+
+
+/* Mainly hierarchical grouping code */
+#ifdef CONFIG_GROUP_IOSCHED
+
+struct io_cgroup io_root_cgroup = {
+	.weight = IO_DEFAULT_GRP_WEIGHT,
+	.ioprio_class = IO_DEFAULT_GRP_CLASS,
+};
+
+void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+	entity->ioprio = entity->new_ioprio;
+	entity->weight = entity->new_weight;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->parent = iog->my_entity;
+	entity->sched_data = &iog->sched_data;
+}
+
+struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
+{
+	return container_of(cgroup_subsys_state(cgroup, io_subsys_id),
+			    struct io_cgroup, css);
+}
+
+/*
+ * Search the bfq_group for bfqd into the hash table (by now only a list)
+ * of bgrp.  Must be called under rcu_read_lock().
+ */
+struct io_group *io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
+{
+	struct io_group *iog;
+	struct hlist_node *n;
+	void *__key;
+
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		__key = rcu_dereference(iog->key);
+		if (__key == key)
+			return iog;
+	}
+
+	return NULL;
+}
+
+struct io_group *io_lookup_io_group_current(struct request_queue *q)
+{
+	struct io_group *iog;
+	struct io_cgroup *iocg;
+	struct cgroup *cgroup;
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	cgroup = task_cgroup(current, io_subsys_id);
+	iocg = cgroup_to_io_cgroup(cgroup);
+	iog = io_cgroup_lookup_group(iocg, efqd);
+	return iog;
+}
+EXPORT_SYMBOL(io_lookup_io_group_current);
+
+void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+{
+	struct io_entity *entity = &iog->entity;
+
+	entity->weight = entity->new_weight = iocg->weight;
+	entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
+	entity->ioprio_changed = 1;
+	entity->my_sched_data = &iog->sched_data;
+}
+
+void io_group_set_parent(struct io_group *iog, struct io_group *parent)
+{
+	struct io_entity *entity;
+
+	BUG_ON(parent == NULL);
+	BUG_ON(iog == NULL);
+
+	entity = &iog->entity;
+	entity->parent = parent->my_entity;
+	entity->sched_data = &parent->sched_data;
+}
+
+/**
+ * bfq_flush_idle_tree - deactivate any entity on the idle tree of @st.
+ * @st: the service tree being flushed.
+ */
+void io_flush_idle_tree(struct io_service_tree *st)
+{
+	struct io_entity *entity = st->first_idle;
+
+	for (; entity != NULL; entity = st->first_idle)
+		__bfq_deactivate_entity(entity, 0);
+}
+
+#define SHOW_FUNCTION(__VAR)						\
+static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
+				       struct cftype *cftype)		\
+{									\
+	struct io_cgroup *iocg;					\
+	u64 ret;							\
+									\
+	if (!cgroup_lock_live_group(cgroup))				\
+		return -ENODEV;						\
+									\
+	iocg = cgroup_to_io_cgroup(cgroup);				\
+	spin_lock_irq(&iocg->lock);					\
+	ret = iocg->__VAR;						\
+	spin_unlock_irq(&iocg->lock);					\
+									\
+	cgroup_unlock();						\
+									\
+	return ret;							\
+}
+
+SHOW_FUNCTION(weight);
+SHOW_FUNCTION(ioprio_class);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__VAR, __MIN, __MAX)				\
+static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
+					struct cftype *cftype,		\
+					u64 val)			\
+{									\
+	struct io_cgroup *iocg;					\
+	struct io_group *iog;						\
+	struct hlist_node *n;						\
+									\
+	if (val < (__MIN) || val > (__MAX))				\
+		return -EINVAL;						\
+									\
+	if (!cgroup_lock_live_group(cgroup))				\
+		return -ENODEV;						\
+									\
+	iocg = cgroup_to_io_cgroup(cgroup);				\
+									\
+	spin_lock_irq(&iocg->lock);					\
+	iocg->__VAR = (unsigned long)val;				\
+	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {	\
+		iog->entity.new_##__VAR = (unsigned long)val;		\
+		smp_wmb();						\
+		iog->entity.ioprio_changed = 1;				\
+	}								\
+	spin_unlock_irq(&iocg->lock);					\
+									\
+	cgroup_unlock();						\
+									\
+	return 0;							\
+}
+
+STORE_FUNCTION(weight, 0, WEIGHT_MAX);
+STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
+#undef STORE_FUNCTION
+
+/**
+ * bfq_group_chain_alloc - allocate a chain of groups.
+ * @bfqd: queue descriptor.
+ * @cgroup: the leaf cgroup this chain starts from.
+ *
+ * Allocate a chain of groups starting from the one belonging to
+ * @cgroup up to the root cgroup.  Stop if a cgroup on the chain
+ * to the root has already an allocated group on @bfqd.
+ */
+struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
+					struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog, *leaf = NULL, *prev = NULL;
+	gfp_t flags = GFP_ATOMIC |  __GFP_ZERO;
+
+	for (; cgroup != NULL; cgroup = cgroup->parent) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+
+		iog = io_cgroup_lookup_group(iocg, key);
+		if (iog != NULL) {
+			/*
+			 * All the cgroups in the path from there to the
+			 * root must have a bfq_group for bfqd, so we don't
+			 * need any more allocations.
+			 */
+			break;
+		}
+
+		iog = kzalloc_node(sizeof(*iog), flags, q->node);
+		if (!iog)
+			goto cleanup;
+
+		io_group_init_entity(iocg, iog);
+		iog->my_entity = &iog->entity;
+
+		if (leaf == NULL) {
+			leaf = iog;
+			prev = leaf;
+		} else {
+			io_group_set_parent(prev, iog);
+			/*
+			 * Build a list of allocated nodes using the bfqd
+			 * filed, that is still unused and will be initialized
+			 * only after the node will be connected.
+			 */
+			prev->key = iog;
+			prev = iog;
+		}
+	}
+
+	return leaf;
+
+cleanup:
+	while (leaf != NULL) {
+		prev = leaf;
+		leaf = leaf->key;
+		kfree(prev);
+	}
+
+	return NULL;
+}
+
+/**
+ * bfq_group_chain_link - link an allocatd group chain to a cgroup hierarchy.
+ * @bfqd: the queue descriptor.
+ * @cgroup: the leaf cgroup to start from.
+ * @leaf: the leaf group (to be associated to @cgroup).
+ *
+ * Try to link a chain of groups to a cgroup hierarchy, connecting the
+ * nodes bottom-up, so we can be sure that when we find a cgroup in the
+ * hierarchy that already as a group associated to @bfqd all the nodes
+ * in the path to the root cgroup have one too.
+ *
+ * On locking: the queue lock protects the hierarchy (there is a hierarchy
+ * per device) while the bfqio_cgroup lock protects the list of groups
+ * belonging to the same cgroup.
+ */
+void io_group_chain_link(struct request_queue *q, void *key,
+				struct cgroup *cgroup,
+				struct io_group *leaf,
+				struct elv_fq_data *efqd)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog, *next, *prev = NULL;
+	unsigned long flags;
+
+	assert_spin_locked(q->queue_lock);
+
+	for (; cgroup != NULL && leaf != NULL; cgroup = cgroup->parent) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+		next = leaf->key;
+
+		iog = io_cgroup_lookup_group(iocg, key);
+		BUG_ON(iog != NULL);
+
+		spin_lock_irqsave(&iocg->lock, flags);
+
+		rcu_assign_pointer(leaf->key, key);
+		hlist_add_head_rcu(&leaf->group_node, &iocg->group_data);
+		hlist_add_head(&leaf->elv_data_node, &efqd->group_list);
+
+		spin_unlock_irqrestore(&iocg->lock, flags);
+
+		prev = leaf;
+		leaf = next;
+	}
+
+	BUG_ON(cgroup == NULL && leaf != NULL);
+
+	if (cgroup != NULL && prev != NULL) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+		iog = io_cgroup_lookup_group(iocg, key);
+		io_group_set_parent(prev, iog);
+	}
+}
+
+/**
+ * bfq_find_alloc_group - return the group associated to @bfqd in @cgroup.
+ * @bfqd: queue descriptor.
+ * @cgroup: cgroup being searched for.
+ * @create: if set to 1, create the io group if it has not been created yet.
+ *
+ * Return a group associated to @bfqd in @cgroup, allocating one if
+ * necessary.  When a group is returned all the cgroups in the path
+ * to the root have a group associated to @bfqd.
+ *
+ * If the allocation fails, return the root group: this breaks guarantees
+ * but is a safe fallbak.  If this loss becames a problem it can be
+ * mitigated using the equivalent weight (given by the product of the
+ * weights of the groups in the path from @group to the root) in the
+ * root scheduler.
+ *
+ * We allocate all the missing nodes in the path from the leaf cgroup
+ * to the root and we connect the nodes only after all the allocations
+ * have been successful.
+ */
+struct io_group *io_find_alloc_group(struct request_queue *q,
+			struct cgroup *cgroup, struct elv_fq_data *efqd,
+			int create)
+{
+	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+	struct io_group *iog = NULL;
+	/* Note: Use efqd as key */
+	void *key = efqd;
+
+	iog = io_cgroup_lookup_group(iocg, key);
+	if (iog != NULL || !create)
+		return iog;
+
+	iog = io_group_chain_alloc(q, key, cgroup);
+	if (iog != NULL)
+		io_group_chain_link(q, key, cgroup, iog, efqd);
+
+	return iog;
+}
+
+/*
+ * Search for the io group current task belongs to. If create=1, then also
+ * create the io group if it is not already there.
+ */
+struct io_group *io_get_io_group(struct request_queue *q, int create)
+{
+	struct cgroup *cgroup;
+	struct io_group *iog;
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	rcu_read_lock();
+	cgroup = task_cgroup(current, io_subsys_id);
+	iog = io_find_alloc_group(q, cgroup, efqd, create);
+	if (!iog) {
+		if (create)
+			iog = efqd->root_group;
+		else
+			/*
+			 * bio merge functions doing lookup don't want to
+			 * map bio to root group by default
+			 */
+			iog = NULL;
+	}
+	rcu_read_unlock();
+	return iog;
+}
+
+void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_cgroup *iocg = &io_root_cgroup;
+	struct elv_fq_data *efqd = &e->efqd;
+	struct io_group *iog = efqd->root_group;
+
+	BUG_ON(!iog);
+	spin_lock_irq(&iocg->lock);
+	hlist_del_rcu(&iog->group_node);
+	spin_unlock_irq(&iocg->lock);
+	io_put_io_group_queues(e, iog);
+	kfree(iog);
+}
+
+struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	struct io_cgroup *iocg;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	iog->entity.parent = NULL;
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+	iocg = &io_root_cgroup;
+	spin_lock_irq(&iocg->lock);
+	rcu_assign_pointer(iog->key, key);
+	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
+	spin_unlock_irq(&iocg->lock);
+
+	return iog;
+}
+
+struct cftype bfqio_files[] = {
+	{
+		.name = "weight",
+		.read_u64 = io_cgroup_weight_read,
+		.write_u64 = io_cgroup_weight_write,
+	},
+	{
+		.name = "ioprio_class",
+		.read_u64 = io_cgroup_ioprio_class_read,
+		.write_u64 = io_cgroup_ioprio_class_write,
+	},
+};
+
+int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+	return cgroup_add_files(cgroup, subsys, bfqio_files,
+				ARRAY_SIZE(bfqio_files));
+}
+
+struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
+						struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg;
+
+	if (cgroup->parent != NULL) {
+		iocg = kzalloc(sizeof(*iocg), GFP_KERNEL);
+		if (iocg == NULL)
+			return ERR_PTR(-ENOMEM);
+	} else
+		iocg = &io_root_cgroup;
+
+	spin_lock_init(&iocg->lock);
+	INIT_HLIST_HEAD(&iocg->group_data);
+	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
+	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
+
+	return &iocg->css;
+}
+
+/*
+ * We cannot support shared io contexts, as we have no mean to support
+ * two tasks with the same ioc in two different groups without major rework
+ * of the main cic/bfqq data structures.  By now we allow a task to change
+ * its cgroup only if it's the only owner of its ioc; the drawback of this
+ * behavior is that a group containing a task that forked using CLONE_IO
+ * will not be destroyed until the tasks sharing the ioc die.
+ */
+int iocg_can_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+			    struct task_struct *tsk)
+{
+	struct io_context *ioc;
+	int ret = 0;
+
+	/* task_lock() is needed to avoid races with exit_io_context() */
+	task_lock(tsk);
+	ioc = tsk->io_context;
+	if (ioc != NULL && atomic_read(&ioc->nr_tasks) > 1)
+		/*
+		 * ioc == NULL means that the task is either too young or
+		 * exiting: if it has still no ioc the ioc can't be shared,
+		 * if the task is exiting the attach will fail anyway, no
+		 * matter what we return here.
+		 */
+		ret = -EINVAL;
+	task_unlock(tsk);
+
+	return ret;
+}
+
+void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+			 struct cgroup *prev, struct task_struct *tsk)
+{
+	struct io_context *ioc;
+
+	task_lock(tsk);
+	ioc = tsk->io_context;
+	if (ioc != NULL)
+		ioc->cgroup_changed = 1;
+	task_unlock(tsk);
+}
+
+/*
+ * Move the queue to the root group if it is active. This is needed when
+ * a cgroup is being deleted and all the IO is not done yet. This is not
+ * very good scheme as a user might get unfair share. This needs to be
+ * fixed.
+ */
+void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
+				struct io_group *iog)
+{
+	int busy, resume;
+	struct io_entity *entity = &ioq->entity;
+	struct elv_fq_data *efqd = &e->efqd;
+	struct io_service_tree *st = io_entity_service_tree(entity);
+
+	busy = elv_ioq_busy(ioq);
+	resume = !!ioq->nr_queued;
+
+	BUG_ON(resume && !entity->on_st);
+	BUG_ON(busy && !resume && entity->on_st && ioq != efqd->active_queue);
+
+	/*
+	 * We could be moving an queue which is on idle tree of previous group
+	 * What to do? I guess anyway this queue does not have any requests.
+	 * just forget the entity and free up from idle tree.
+	 *
+	 * This needs cleanup. Hackish.
+	 */
+	if (entity->tree == &st->idle) {
+		BUG_ON(atomic_read(&ioq->ref) < 2);
+		bfq_put_idle_entity(st, entity);
+	}
+
+	if (busy) {
+		BUG_ON(atomic_read(&ioq->ref) < 2);
+
+		if (!resume)
+			elv_del_ioq_busy(e, ioq, 0);
+		else
+			elv_deactivate_ioq(efqd, ioq, 0);
+	}
+
+	/*
+	 * Here we use a reference to bfqg.  We don't need a refcounter
+	 * as the cgroup reference will not be dropped, so that its
+	 * destroy() callback will not be invoked.
+	 */
+	entity->parent = iog->my_entity;
+	entity->sched_data = &iog->sched_data;
+
+	if (busy && resume)
+		elv_activate_ioq(ioq, 0);
+}
+EXPORT_SYMBOL(io_ioq_move);
+
+static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
+{
+	struct elevator_queue *eq;
+	struct io_entity *entity = iog->my_entity;
+	struct io_service_tree *st;
+	int i;
+
+	eq = container_of(efqd, struct elevator_queue, efqd);
+	hlist_del(&iog->elv_data_node);
+	__bfq_deactivate_entity(entity, 0);
+	io_put_io_group_queues(eq, iog);
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+
+		/*
+		 * The idle tree may still contain bfq_queues belonging
+		 * to exited task because they never migrated to a different
+		 * cgroup from the one being destroyed now.  Noone else
+		 * can access them so it's safe to act without any lock.
+		 */
+		io_flush_idle_tree(st);
+
+		BUG_ON(!RB_EMPTY_ROOT(&st->active));
+		BUG_ON(!RB_EMPTY_ROOT(&st->idle));
+	}
+
+	BUG_ON(iog->sched_data.next_active != NULL);
+	BUG_ON(iog->sched_data.active_entity != NULL);
+	BUG_ON(entity->tree != NULL);
+}
+
+/**
+ * bfq_destroy_group - destroy @bfqg.
+ * @bgrp: the bfqio_cgroup containing @bfqg.
+ * @bfqg: the group being destroyed.
+ *
+ * Destroy @bfqg, making sure that it is not referenced from its parent.
+ */
+static void io_destroy_group(struct io_cgroup *iocg, struct io_group *iog)
+{
+	struct elv_fq_data *efqd = NULL;
+	unsigned long uninitialized_var(flags);
+
+	/* Remove io group from cgroup list */
+	hlist_del(&iog->group_node);
+
+	/*
+	 * io groups are linked in two lists. One list is maintained
+	 * in elevator (efqd->group_list) and other is maintained
+	 * per cgroup structure (iocg->group_data).
+	 *
+	 * While a cgroup is being deleted, elevator also might be
+	 * exiting and both might try to cleanup the same io group
+	 * so need to be little careful.
+	 *
+	 * Following code first accesses efqd under RCU to make sure
+	 * iog->key is pointing to valid efqd and then takes the
+	 * associated queue lock. After gettting the queue lock it
+	 * again checks whether elevator exit path had alreday got
+	 * hold of io group (iog->key == NULL). If yes, it does not
+	 * try to free up async queues again or flush the idle tree.
+	 */
+
+	rcu_read_lock();
+	efqd = rcu_dereference(iog->key);
+	if (efqd != NULL) {
+		spin_lock_irqsave(efqd->queue->queue_lock, flags);
+		if (iog->key == efqd)
+			__io_destroy_group(efqd, iog);
+		spin_unlock_irqrestore(efqd->queue->queue_lock, flags);
+	}
+	rcu_read_unlock();
+
+	/*
+	 * No need to defer the kfree() to the end of the RCU grace
+	 * period: we are called from the destroy() callback of our
+	 * cgroup, so we can be sure that noone is a) still using
+	 * this cgroup or b) doing lookups in it.
+	 */
+	kfree(iog);
+}
+
+void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+	struct hlist_node *n, *tmp;
+	struct io_group *iog;
+
+	/*
+	 * Since we are destroying the cgroup, there are no more tasks
+	 * referencing it, and all the RCU grace periods that may have
+	 * referenced it are ended (as the destruction of the parent
+	 * cgroup is RCU-safe); bgrp->group_data will not be accessed by
+	 * anything else and we don't need any synchronization.
+	 */
+	hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node)
+		io_destroy_group(iocg, iog);
+
+	BUG_ON(!hlist_empty(&iocg->group_data));
+
+	kfree(iocg);
+}
+
+void io_disconnect_groups(struct elevator_queue *e)
+{
+	struct hlist_node *pos, *n;
+	struct io_group *iog;
+	struct elv_fq_data *efqd = &e->efqd;
+
+	hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
+					elv_data_node) {
+		hlist_del(&iog->elv_data_node);
+
+		__bfq_deactivate_entity(iog->my_entity, 0);
+
+		/*
+		 * Don't remove from the group hash, just set an
+		 * invalid key.  No lookups can race with the
+		 * assignment as bfqd is being destroyed; this
+		 * implies also that new elements cannot be added
+		 * to the list.
+		 */
+		rcu_assign_pointer(iog->key, NULL);
+		io_put_io_group_queues(e, iog);
+	}
+}
+
+struct cgroup_subsys io_subsys = {
+	.name = "io",
+	.create = iocg_create,
+	.can_attach = iocg_can_attach,
+	.attach = iocg_attach,
+	.destroy = iocg_destroy,
+	.populate = iocg_populate,
+	.subsys_id = io_subsys_id,
+};
+
+/*
+ * if bio sumbmitting task and rq don't belong to same io_group, it can't
+ * be merged
+ */
+int io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+	struct request_queue *q = rq->q;
+	struct io_queue *ioq = rq->ioq;
+	struct io_group *iog, *__iog;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return 1;
+
+	/* Determine the io group of the bio submitting task */
+	iog = io_get_io_group(q, 0);
+	if (!iog) {
+		/* May be task belongs to a differet cgroup for which io
+		 * group has not been setup yet. */
+		return 0;
+	}
+
+	/* Determine the io group of the ioq, rq belongs to*/
+	__iog = ioq_to_io_group(ioq);
+
+	return (iog == __iog);
+}
+
+/* find/create the io group request belongs to and put that info in rq */
+void elv_fq_set_request_io_group(struct request_queue *q,
+						struct request *rq)
+{
+	struct io_group *iog;
+	unsigned long flags;
+
+	/* Make sure io group hierarchy has been setup and also set the
+	 * io group to which rq belongs. Later we should make use of
+	 * bio cgroup patches to determine the io group */
+	spin_lock_irqsave(q->queue_lock, flags);
+	iog = io_get_io_group(q, 1);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	BUG_ON(!iog);
+
+	/* Store iog in rq. TODO: take care of referencing */
+	rq->iog = iog;
 }
 
+#else /* GROUP_IOSCHED */
+void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+	entity->ioprio = entity->new_ioprio;
+	entity->weight = entity->new_weight;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->sched_data = &iog->sched_data;
+}
+
+struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+	return iog;
+}
+
+struct io_group *io_lookup_io_group_current(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	return efqd->root_group;
+}
+EXPORT_SYMBOL(io_lookup_io_group_current);
+
+void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_group *iog = e->efqd.root_group;
+	io_put_io_group_queues(e, iog);
+	kfree(iog);
+}
+
+struct io_group *io_get_io_group(struct request_queue *q, int create)
+{
+	return q->elevator->efqd.root_group;
+}
+
+#endif /* CONFIG_GROUP_IOSCHED*/
+
 /* Elevator fair queuing function */
 struct io_queue *rq_ioq(struct request *rq)
 {
@@ -1177,9 +2043,11 @@ EXPORT_SYMBOL(elv_put_ioq);
 
 void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
 {
+	struct io_group *root_group = e->efqd.root_group;
 	struct io_queue *ioq = *ioq_ptr;
 
 	if (ioq != NULL) {
+		io_ioq_move(e, ioq, root_group);
 		/* Drop the reference taken by the io group */
 		elv_put_ioq(ioq);
 		*ioq_ptr = NULL;
@@ -1233,14 +2101,27 @@ struct io_queue *elv_get_next_ioq(struct request_queue *q, int extract)
 		return NULL;
 
 	sd = &efqd->root_group->sched_data;
-	if (extract)
-		entity = bfq_lookup_next_entity(sd, 1);
-	else
-		entity = bfq_lookup_next_entity(sd, 0);
+	for (; sd != NULL; sd = entity->my_sched_data) {
+		if (extract)
+			entity = bfq_lookup_next_entity(sd, 1);
+		else
+			entity = bfq_lookup_next_entity(sd, 0);
+
+		/*
+		 * entity can be null despite the fact that there are busy
+		 * queues. if all the busy queues are under a group which is
+		 * currently under service.
+		 * So if we are just looking for next ioq while something is
+		 * being served, null entity is not an error.
+		 */
+		BUG_ON(!entity && extract);
+
+		if (extract)
+			entity->service = 0;
 
-	BUG_ON(!entity);
-	if (extract)
-		entity->service = 0;
+		if (!entity)
+			return NULL;
+	}
 	ioq = io_entity_to_ioq(entity);
 
 	return ioq;
@@ -1256,8 +2137,12 @@ static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
 	struct request_queue *q = efqd->queue;
 
 	if (ioq) {
-		elv_log_ioq(efqd, ioq, "set_active, busy=%d",
-							efqd->busy_queues);
+		struct io_group *iog = ioq_to_io_group(ioq);
+		elv_log_ioq(efqd, ioq, "set_active, busy=%d ioprio=%d"
+				" weight=%ld group_weight=%ld",
+				efqd->busy_queues,
+				ioq->entity.ioprio, ioq->entity.weight,
+				iog_weight(iog));
 		ioq->slice_end = 0;
 
 		elv_clear_ioq_wait_request(ioq);
@@ -1492,6 +2377,7 @@ int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 {
 	struct io_queue *ioq;
 	struct elevator_queue *eq = q->elevator;
+	struct io_group *iog = NULL, *new_iog = NULL;
 
 	ioq = elv_active_ioq(eq);
 
@@ -1509,14 +2395,26 @@ int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 
 	/*
 	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
+	 *
+	 * TODO: In hierarchical setup, one need to traverse up the hier
+	 * till both the queues are children of same parent to make a
+	 * decision whether to do the preemption or not. Something like
+	 * what cfs has done for cpu scheduler. Will do it little later.
 	 */
 	if (elv_ioq_class_rt(new_ioq) && !elv_ioq_class_rt(ioq))
 		return 1;
 
+	iog = ioq_to_io_group(ioq);
+	new_iog = ioq_to_io_group(new_ioq);
+
 	/*
-	 * Check with io scheduler if it has additional criterion based on
-	 * which it wants to preempt existing queue.
+	 * If both the queues belong to same group, check with io scheduler
+	 * if it has additional criterion based on which it wants to
+	 * preempt existing queue.
 	 */
+	if (iog != new_iog)
+		return 0;
+
 	if (eq->ops->elevator_should_preempt_fn)
 		return eq->ops->elevator_should_preempt_fn(q, new_ioq, rq);
 
@@ -1938,14 +2836,6 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		elv_schedule_dispatch(q);
 }
 
-struct io_group *io_lookup_io_group_current(struct request_queue *q)
-{
-	struct elv_fq_data *efqd = &q->elevator->efqd;
-
-	return efqd->root_group;
-}
-EXPORT_SYMBOL(io_lookup_io_group_current);
-
 void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
 					int ioprio)
 {
@@ -1996,44 +2886,6 @@ void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 }
 EXPORT_SYMBOL(io_group_set_async_queue);
 
-/*
- * Release all the io group references to its async queues.
- */
-void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
-{
-	int i, j;
-
-	for (i = 0; i < 2; i++)
-		for (j = 0; j < IOPRIO_BE_NR; j++)
-			elv_release_ioq(e, &iog->async_queue[i][j]);
-
-	/* Free up async idle queue */
-	elv_release_ioq(e, &iog->async_idle_queue);
-}
-
-struct io_group *io_alloc_root_group(struct request_queue *q,
-					struct elevator_queue *e, void *key)
-{
-	struct io_group *iog;
-	int i;
-
-	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
-	if (iog == NULL)
-		return NULL;
-
-	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
-		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
-
-	return iog;
-}
-
-void io_free_root_group(struct elevator_queue *e)
-{
-	struct io_group *iog = e->efqd.root_group;
-	io_put_io_group_queues(e, iog);
-	kfree(iog);
-}
-
 static void elv_slab_kill(void)
 {
 	/*
@@ -2079,6 +2931,7 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 	INIT_WORK(&efqd->unplug_work, elv_kick_queue);
 
 	INIT_LIST_HEAD(&efqd->idle_list);
+	INIT_HLIST_HEAD(&efqd->group_list);
 
 	efqd->elv_slice[0] = elv_slice_async;
 	efqd->elv_slice[1] = elv_slice_sync;
@@ -2108,10 +2961,14 @@ void elv_exit_fq_data(struct elevator_queue *e)
 	spin_lock_irq(q->queue_lock);
 	/* This should drop all the idle tree references of ioq */
 	elv_free_idle_ioq_list(e);
+	/* This should drop all the io group references of async queues */
+	io_disconnect_groups(e);
 	spin_unlock_irq(q->queue_lock);
 
 	elv_shutdown_timer_wq(e);
 
+	/* Wait for iog->key accessors to exit their grace periods. */
+	synchronize_rcu();
 	BUG_ON(timer_pending(&efqd->idle_slice_timer));
 	io_free_root_group(e);
 }
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index ce2d671..8c60cf7 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -9,11 +9,13 @@
  */
 
 #include <linux/blkdev.h>
+#include <linux/cgroup.h>
 
 #ifndef _BFQ_SCHED_H
 #define _BFQ_SCHED_H
 
 #define IO_IOPRIO_CLASSES	3
+#define WEIGHT_MAX 		1000
 
 typedef u64 bfq_timestamp_t;
 typedef unsigned long bfq_weight_t;
@@ -69,6 +71,7 @@ struct io_service_tree {
  */
 struct io_sched_data {
 	struct io_entity *active_entity;
+	struct io_entity *next_active;
 	struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
 };
 
@@ -84,13 +87,12 @@ struct io_sched_data {
  *             this entity; used for O(log N) lookups into active trees.
  * @service: service received during the last round of service.
  * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
- * @weight: weight of the queue, calculated as IOPRIO_BE_NR - @ioprio.
  * @parent: parent entity, for hierarchical scheduling.
  * @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the
  *                 associated scheduler queue, %NULL on leaf nodes.
  * @sched_data: the scheduler queue this entity belongs to.
- * @ioprio: the ioprio in use.
- * @new_ioprio: when an ioprio change is requested, the new ioprio value
+ * @weight: the weight in use.
+ * @new_weight: when a weight change is requested, the new weight value
  * @ioprio_class: the ioprio_class in use.
  * @new_ioprio_class: when an ioprio_class change is requested, the new
  *                    ioprio_class value.
@@ -132,13 +134,13 @@ struct io_entity {
 	bfq_timestamp_t min_start;
 
 	bfq_service_t service, budget;
-	bfq_weight_t weight;
 
 	struct io_entity *parent;
 
 	struct io_sched_data *my_sched_data;
 	struct io_sched_data *sched_data;
 
+	bfq_weight_t weight, new_weight;
 	unsigned short ioprio, new_ioprio;
 	unsigned short ioprio_class, new_ioprio_class;
 
@@ -180,6 +182,75 @@ struct io_queue {
 	void *sched_queue;
 };
 
+#ifdef CONFIG_GROUP_IOSCHED
+/**
+ * struct bfq_group - per (device, cgroup) data structure.
+ * @entity: schedulable entity to insert into the parent group sched_data.
+ * @sched_data: own sched_data, to contain child entities (they may be
+ *              both bfq_queues and bfq_groups).
+ * @group_node: node to be inserted into the bfqio_cgroup->group_data
+ *              list of the containing cgroup's bfqio_cgroup.
+ * @bfqd_node: node to be inserted into the @bfqd->group_list list
+ *             of the groups active on the same device; used for cleanup.
+ * @bfqd: the bfq_data for the device this group acts upon.
+ * @async_bfqq: array of async queues for all the tasks belonging to
+ *              the group, one queue per ioprio value per ioprio_class,
+ *              except for the idle class that has only one queue.
+ * @async_idle_bfqq: async queue for the idle class (ioprio is ignored).
+ * @my_entity: pointer to @entity, %NULL for the toplevel group; used
+ *             to avoid too many special cases during group creation/migration.
+ *
+ * Each (device, cgroup) pair has its own bfq_group, i.e., for each cgroup
+ * there is a set of bfq_groups, each one collecting the lower-level
+ * entities belonging to the group that are acting on the same device.
+ *
+ * Locking works as follows:
+ *    o @group_node is protected by the bfqio_cgroup lock, and is accessed
+ *      via RCU from its readers.
+ *    o @bfqd is protected by the queue lock, RCU is used to access it
+ *      from the readers.
+ *    o All the other fields are protected by the @bfqd queue lock.
+ */
+struct io_group {
+	struct io_entity entity;
+	struct hlist_node elv_data_node;
+	struct hlist_node group_node;
+	struct io_sched_data sched_data;
+
+	struct io_entity *my_entity;
+
+	/*
+	 * A cgroup has multiple io_groups, one for each request queue.
+	 * to find io group belonging to a particular queue, elv_fq_data
+	 * pointer is stored as a key.
+	 */
+	void *key;
+
+	/* async_queue and idle_queue are used only for cfq */
+	struct io_queue *async_queue[2][IOPRIO_BE_NR];
+	struct io_queue *async_idle_queue;
+};
+
+/**
+ * struct bfqio_cgroup - bfq cgroup data structure.
+ * @css: subsystem state for bfq in the containing cgroup.
+ * @weight: cgroup weight.
+ * @ioprio_class: cgroup ioprio_class.
+ * @lock: spinlock that protects @weight, @ioprio_class and @group_data.
+ * @group_data: list containing the bfq_group belonging to this cgroup.
+ *
+ * @group_data is accessed using RCU, with @lock protecting the updates,
+ * @weight and @ioprio_class are protected by @lock.
+ */
+struct io_cgroup {
+	struct cgroup_subsys_state css;
+
+	unsigned long weight, ioprio_class;
+
+	spinlock_t lock;
+	struct hlist_head group_data;
+};
+#else
 struct io_group {
 	struct io_sched_data sched_data;
 
@@ -187,10 +258,14 @@ struct io_group {
 	struct io_queue *async_queue[2][IOPRIO_BE_NR];
 	struct io_queue *async_idle_queue;
 };
+#endif
 
 struct elv_fq_data {
 	struct io_group *root_group;
 
+	/* List of io groups hanging on this elevator */
+	struct hlist_head group_list;
+
 	/* List of io queues on idle tree. */
 	struct list_head idle_list;
 
@@ -375,9 +450,20 @@ static inline void elv_ioq_set_ioprio_class(struct io_queue *ioq,
 	ioq->entity.ioprio_changed = 1;
 }
 
+/**
+ * bfq_ioprio_to_weight - calc a weight from an ioprio.
+ * @ioprio: the ioprio value to convert.
+ */
+static inline bfq_weight_t bfq_ioprio_to_weight(int ioprio)
+{
+	WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
+	return ((IOPRIO_BE_NR - ioprio) * WEIGHT_MAX)/IOPRIO_BE_NR;
+}
+
 static inline void elv_ioq_set_ioprio(struct io_queue *ioq, int ioprio)
 {
 	ioq->entity.new_ioprio = ioprio;
+	ioq->entity.new_weight = bfq_ioprio_to_weight(ioprio);
 	ioq->entity.ioprio_changed = 1;
 }
 
@@ -394,6 +480,50 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
 						sched_data);
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+extern int io_group_allow_merge(struct request *rq, struct bio *bio);
+extern void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
+					struct io_group *iog);
+extern void elv_fq_set_request_io_group(struct request_queue *q,
+						struct request *rq);
+static inline bfq_weight_t iog_weight(struct io_group *iog)
+{
+	return iog->entity.weight;
+}
+
+#else /* !GROUP_IOSCHED */
+/*
+ * No ioq movement is needed in case of flat setup. root io group gets cleaned
+ * up upon elevator exit and before that it has been made sure that both
+ * active and idle tree are empty.
+ */
+static inline void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
+					struct io_group *iog)
+{
+}
+
+static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+	return 1;
+}
+/*
+ * Currently root group is not part of elevator group list and freed
+ * separately. Hence in case of non-hierarchical setup, nothing todo.
+ */
+static inline void io_disconnect_groups(struct elevator_queue *e) {}
+static inline void elv_fq_set_request_io_group(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline bfq_weight_t iog_weight(struct io_group *iog)
+{
+	/* Just root group is present and weight is immaterial. */
+	return 0;
+}
+
+#endif /* GROUP_IOSCHED */
+
 /* Functions used by blksysfs.c */
 extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
 extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
@@ -495,5 +625,16 @@ static inline void *elv_fq_select_ioq(struct request_queue *q, int force)
 {
 	return NULL;
 }
+
+static inline void elv_fq_set_request_io_group(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
+
+{
+	return 1;
+}
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index c2f07f5..4321169 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -105,6 +105,10 @@ int elv_rq_merge_ok(struct request *rq, struct bio *bio)
 	if (bio_integrity(bio) != blk_integrity_rq(rq))
 		return 0;
 
+	/* If rq and bio belongs to different groups, dont allow merging */
+	if (!io_group_allow_merge(rq, bio))
+		return 0;
+
 	if (!elv_iosched_allow_merge(rq, bio))
 		return 0;
 
@@ -913,6 +917,8 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 
+	elv_fq_set_request_io_group(q, rq);
+
 	if (e->ops->elevator_set_req_fn)
 		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4634949..9c209a0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -249,7 +249,12 @@ struct request {
 #ifdef CONFIG_ELV_FAIR_QUEUING
 	/* io queue request belongs to */
 	struct io_queue *ioq;
-#endif
+
+#ifdef CONFIG_GROUP_IOSCHED
+	/* io group request belongs to */
+	struct io_group *iog;
+#endif /* GROUP_IOSCHED */
+#endif /* ELV_FAIR_QUEUING */
 };
 
 static inline unsigned short req_get_ioprio(struct request *req)
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c8d31b..68ea6bd 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -60,3 +60,10 @@ SUBSYS(net_cls)
 #endif
 
 /* */
+
+#ifdef CONFIG_GROUP_IOSCHED
+SUBSYS(io)
+#endif
+
+/* */
+
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 08b987b..51664bb 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -73,6 +73,11 @@ struct io_context {
 	unsigned short ioprio;
 	unsigned short ioprio_changed;
 
+#ifdef CONFIG_GROUP_IOSCHED
+	/* If task changes the cgroup, elevator processes it asynchronously */
+	unsigned short cgroup_changed;
+#endif
+
 	/*
 	 * For request batching
 	 */
diff --git a/init/Kconfig b/init/Kconfig
index 7be4d38..ab76477 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -606,6 +606,14 @@ config CGROUP_MEM_RES_CTLR_SWAP
 	  Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page
 	  size is 4096bytes, 512k per 1Gbytes of swap.
 
+config GROUP_IOSCHED
+	bool "Group IO Scheduler"
+	depends on CGROUPS && ELV_FAIR_QUEUING
+	default n
+	---help---
+	  This feature lets IO scheduler recognize task groups and control
+	  disk bandwidth allocation to such task groups.
+
 endif # CGROUPS
 
 config MM_OWNER
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (7 preceding siblings ...)
  2009-05-05 19:58 ` [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-07  7:42   ` Gui Jianfeng
                     ` (2 more replies)
  2009-05-05 19:58 ` [PATCH 06/18] io-controller: cfq changes to use " Vivek Goyal
                   ` (28 subsequent siblings)
  37 siblings, 3 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, m-ikeda
  Cc: vgoyal, akpm

This patch enables hierarchical fair queuing in common layer. It is
controlled by config option CONFIG_GROUP_IOSCHED.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-ioc.c               |    3 +
 block/elevator-fq.c           | 1037 +++++++++++++++++++++++++++++++++++++----
 block/elevator-fq.h           |  149 ++++++-
 block/elevator.c              |    6 +
 include/linux/blkdev.h        |    7 +-
 include/linux/cgroup_subsys.h |    7 +
 include/linux/iocontext.h     |    5 +
 init/Kconfig                  |    8 +
 8 files changed, 1127 insertions(+), 95 deletions(-)

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 012f065..8f0f6cf 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -95,6 +95,9 @@ struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 		spin_lock_init(&ret->lock);
 		ret->ioprio_changed = 0;
 		ret->ioprio = 0;
+#ifdef CONFIG_GROUP_IOSCHED
+		ret->cgroup_changed = 0;
+#endif
 		ret->last_waited = jiffies; /* doesn't matter... */
 		ret->nr_batch_requests = 0; /* because this is 0 */
 		ret->aic = NULL;
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 9f1fbb9..cdaa46f 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -24,6 +24,10 @@ static int elv_rate_sampling_window = HZ / 10;
 
 #define ELV_SLICE_SCALE		(5)
 #define ELV_HW_QUEUE_MIN	(5)
+
+#define IO_DEFAULT_GRP_WEIGHT  500
+#define IO_DEFAULT_GRP_CLASS   IOPRIO_CLASS_BE
+
 #define IO_SERVICE_TREE_INIT   ((struct io_service_tree)		\
 				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
 
@@ -31,6 +35,7 @@ static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
 					struct io_queue *ioq, int probe);
 struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
 						 int extract);
+void elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
 
 static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
 					unsigned short prio)
@@ -49,6 +54,73 @@ elv_prio_to_slice(struct elv_fq_data *efqd, struct io_queue *ioq)
 }
 
 /* Mainly the BFQ scheduling code Follows */
+#ifdef CONFIG_GROUP_IOSCHED
+#define for_each_entity(entity)	\
+	for (; entity != NULL; entity = entity->parent)
+
+#define for_each_entity_safe(entity, parent) \
+	for (; entity && ({ parent = entity->parent; 1; }); entity = parent)
+
+
+struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
+						 int extract);
+void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
+					int requeue);
+void elv_activate_ioq(struct io_queue *ioq, int add_front);
+void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
+					int requeue);
+
+static int bfq_update_next_active(struct io_sched_data *sd)
+{
+	struct io_group *iog;
+	struct io_entity *entity, *next_active;
+
+	if (sd->active_entity != NULL)
+		/* will update/requeue at the end of service */
+		return 0;
+
+	/*
+	 * NOTE: this can be improved in may ways, such as returning
+	 * 1 (and thus propagating upwards the update) only when the
+	 * budget changes, or caching the bfqq that will be scheduled
+	 * next from this subtree.  By now we worry more about
+	 * correctness than about performance...
+	 */
+	next_active = bfq_lookup_next_entity(sd, 0);
+	sd->next_active = next_active;
+
+	if (next_active != NULL) {
+		iog = container_of(sd, struct io_group, sched_data);
+		entity = iog->my_entity;
+		if (entity != NULL)
+			entity->budget = next_active->budget;
+	}
+
+	return 1;
+}
+
+static inline void bfq_check_next_active(struct io_sched_data *sd,
+					 struct io_entity *entity)
+{
+	BUG_ON(sd->next_active != entity);
+}
+#else /* GROUP_IOSCHED */
+#define for_each_entity(entity)	\
+	for (; entity != NULL; entity = NULL)
+
+#define for_each_entity_safe(entity, parent) \
+	for (parent = NULL; entity != NULL; entity = parent)
+
+static inline int bfq_update_next_active(struct io_sched_data *sd)
+{
+	return 0;
+}
+
+static inline void bfq_check_next_active(struct io_sched_data *sd,
+					 struct io_entity *entity)
+{
+}
+#endif
 
 /*
  * Shift for timestamp calculations.  This actually limits the maximum
@@ -295,16 +367,6 @@ static void bfq_active_insert(struct io_service_tree *st,
 	bfq_update_active_tree(node);
 }
 
-/**
- * bfq_ioprio_to_weight - calc a weight from an ioprio.
- * @ioprio: the ioprio value to convert.
- */
-static bfq_weight_t bfq_ioprio_to_weight(int ioprio)
-{
-	WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
-	return IOPRIO_BE_NR - ioprio;
-}
-
 void bfq_get_entity(struct io_entity *entity)
 {
 	struct io_queue *ioq = io_entity_to_ioq(entity);
@@ -313,13 +375,6 @@ void bfq_get_entity(struct io_entity *entity)
 		elv_get_ioq(ioq);
 }
 
-void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
-{
-	entity->ioprio = entity->new_ioprio;
-	entity->ioprio_class = entity->new_ioprio_class;
-	entity->sched_data = &iog->sched_data;
-}
-
 /**
  * bfq_find_deepest - find the deepest node that an extraction can modify.
  * @node: the node being removed.
@@ -462,8 +517,10 @@ __bfq_entity_update_prio(struct io_service_tree *old_st,
 	struct io_queue *ioq = io_entity_to_ioq(entity);
 
 	if (entity->ioprio_changed) {
+		old_st->wsum -= entity->weight;
 		entity->ioprio = entity->new_ioprio;
 		entity->ioprio_class = entity->new_ioprio_class;
+		entity->weight = entity->new_weight;
 		entity->ioprio_changed = 0;
 
 		/*
@@ -475,9 +532,6 @@ __bfq_entity_update_prio(struct io_service_tree *old_st,
 			entity->budget = elv_prio_to_slice(efqd, ioq);
 		}
 
-		old_st->wsum -= entity->weight;
-		entity->weight = bfq_ioprio_to_weight(entity->ioprio);
-
 		/*
 		 * NOTE: here we may be changing the weight too early,
 		 * this will cause unfairness.  The correct approach
@@ -559,11 +613,8 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
 	if (add_front) {
 		struct io_entity *next_entity;
 
-		/*
-		 * Determine the entity which will be dispatched next
-		 * Use sd->next_active once hierarchical patch is applied
-		 */
-		next_entity = bfq_lookup_next_entity(sd, 0);
+		/* Determine the entity which will be dispatched next */
+		next_entity = sd->next_active;
 
 		if (next_entity && next_entity != entity) {
 			struct io_service_tree *new_st;
@@ -590,12 +641,27 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
 }
 
 /**
- * bfq_activate_entity - activate an entity.
+ * bfq_activate_entity - activate an entity and its ancestors if necessary.
  * @entity: the entity to activate.
+ * Activate @entity and all the entities on the path from it to the root.
  */
 void bfq_activate_entity(struct io_entity *entity, int add_front)
 {
-	__bfq_activate_entity(entity, add_front);
+	struct io_sched_data *sd;
+
+	for_each_entity(entity) {
+		__bfq_activate_entity(entity, add_front);
+
+		add_front = 0;
+		sd = entity->sched_data;
+		if (!bfq_update_next_active(sd))
+			/*
+			 * No need to propagate the activation to the
+			 * upper entities, as they will be updated when
+			 * the active entity is rescheduled.
+			 */
+			break;
+	}
 }
 
 /**
@@ -631,12 +697,16 @@ int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
 	else if (entity->tree != NULL)
 		BUG();
 
+	if (was_active || sd->next_active == entity)
+		ret = bfq_update_next_active(sd);
+
 	if (!requeue || !bfq_gt(entity->finish, st->vtime))
 		bfq_forget_entity(st, entity);
 	else
 		bfq_idle_insert(st, entity);
 
 	BUG_ON(sd->active_entity == entity);
+	BUG_ON(sd->next_active == entity);
 
 	return ret;
 }
@@ -648,7 +718,46 @@ int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
  */
 void bfq_deactivate_entity(struct io_entity *entity, int requeue)
 {
-	__bfq_deactivate_entity(entity, requeue);
+	struct io_sched_data *sd;
+	struct io_entity *parent;
+
+	for_each_entity_safe(entity, parent) {
+		sd = entity->sched_data;
+
+		if (!__bfq_deactivate_entity(entity, requeue))
+			/*
+			 * The parent entity is still backlogged, and
+			 * we don't need to update it as it is still
+			 * under service.
+			 */
+			break;
+
+		if (sd->next_active != NULL)
+			/*
+			 * The parent entity is still backlogged and
+			 * the budgets on the path towards the root
+			 * need to be updated.
+			 */
+			goto update;
+
+		/*
+		 * If we reach there the parent is no more backlogged and
+		 * we want to propagate the dequeue upwards.
+		 */
+		requeue = 1;
+	}
+
+	return;
+
+update:
+	entity = parent;
+	for_each_entity(entity) {
+		__bfq_activate_entity(entity, 0);
+
+		sd = entity->sched_data;
+		if (!bfq_update_next_active(sd))
+			break;
+	}
 }
 
 /**
@@ -765,8 +874,10 @@ struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
 		entity = __bfq_lookup_next_entity(st);
 		if (entity != NULL) {
 			if (extract) {
+				bfq_check_next_active(sd, entity);
 				bfq_active_extract(st, entity);
 				sd->active_entity = entity;
+				sd->next_active = NULL;
 			}
 			break;
 		}
@@ -779,13 +890,768 @@ void entity_served(struct io_entity *entity, bfq_service_t served)
 {
 	struct io_service_tree *st;
 
-	st = io_entity_service_tree(entity);
-	entity->service += served;
-	BUG_ON(st->wsum == 0);
-	st->vtime += bfq_delta(served, st->wsum);
-	bfq_forget_idle(st);
+	for_each_entity(entity) {
+		st = io_entity_service_tree(entity);
+		entity->service += served;
+		BUG_ON(st->wsum == 0);
+		st->vtime += bfq_delta(served, st->wsum);
+		bfq_forget_idle(st);
+	}
+}
+
+/*
+ * Release all the io group references to its async queues.
+ */
+void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+{
+	int i, j;
+
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < IOPRIO_BE_NR; j++)
+			elv_release_ioq(e, &iog->async_queue[i][j]);
+
+	/* Free up async idle queue */
+	elv_release_ioq(e, &iog->async_idle_queue);
+}
+
+
+/* Mainly hierarchical grouping code */
+#ifdef CONFIG_GROUP_IOSCHED
+
+struct io_cgroup io_root_cgroup = {
+	.weight = IO_DEFAULT_GRP_WEIGHT,
+	.ioprio_class = IO_DEFAULT_GRP_CLASS,
+};
+
+void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+	entity->ioprio = entity->new_ioprio;
+	entity->weight = entity->new_weight;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->parent = iog->my_entity;
+	entity->sched_data = &iog->sched_data;
+}
+
+struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
+{
+	return container_of(cgroup_subsys_state(cgroup, io_subsys_id),
+			    struct io_cgroup, css);
+}
+
+/*
+ * Search the bfq_group for bfqd into the hash table (by now only a list)
+ * of bgrp.  Must be called under rcu_read_lock().
+ */
+struct io_group *io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
+{
+	struct io_group *iog;
+	struct hlist_node *n;
+	void *__key;
+
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		__key = rcu_dereference(iog->key);
+		if (__key == key)
+			return iog;
+	}
+
+	return NULL;
+}
+
+struct io_group *io_lookup_io_group_current(struct request_queue *q)
+{
+	struct io_group *iog;
+	struct io_cgroup *iocg;
+	struct cgroup *cgroup;
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	cgroup = task_cgroup(current, io_subsys_id);
+	iocg = cgroup_to_io_cgroup(cgroup);
+	iog = io_cgroup_lookup_group(iocg, efqd);
+	return iog;
+}
+EXPORT_SYMBOL(io_lookup_io_group_current);
+
+void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+{
+	struct io_entity *entity = &iog->entity;
+
+	entity->weight = entity->new_weight = iocg->weight;
+	entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
+	entity->ioprio_changed = 1;
+	entity->my_sched_data = &iog->sched_data;
+}
+
+void io_group_set_parent(struct io_group *iog, struct io_group *parent)
+{
+	struct io_entity *entity;
+
+	BUG_ON(parent == NULL);
+	BUG_ON(iog == NULL);
+
+	entity = &iog->entity;
+	entity->parent = parent->my_entity;
+	entity->sched_data = &parent->sched_data;
+}
+
+/**
+ * bfq_flush_idle_tree - deactivate any entity on the idle tree of @st.
+ * @st: the service tree being flushed.
+ */
+void io_flush_idle_tree(struct io_service_tree *st)
+{
+	struct io_entity *entity = st->first_idle;
+
+	for (; entity != NULL; entity = st->first_idle)
+		__bfq_deactivate_entity(entity, 0);
+}
+
+#define SHOW_FUNCTION(__VAR)						\
+static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
+				       struct cftype *cftype)		\
+{									\
+	struct io_cgroup *iocg;					\
+	u64 ret;							\
+									\
+	if (!cgroup_lock_live_group(cgroup))				\
+		return -ENODEV;						\
+									\
+	iocg = cgroup_to_io_cgroup(cgroup);				\
+	spin_lock_irq(&iocg->lock);					\
+	ret = iocg->__VAR;						\
+	spin_unlock_irq(&iocg->lock);					\
+									\
+	cgroup_unlock();						\
+									\
+	return ret;							\
+}
+
+SHOW_FUNCTION(weight);
+SHOW_FUNCTION(ioprio_class);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__VAR, __MIN, __MAX)				\
+static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
+					struct cftype *cftype,		\
+					u64 val)			\
+{									\
+	struct io_cgroup *iocg;					\
+	struct io_group *iog;						\
+	struct hlist_node *n;						\
+									\
+	if (val < (__MIN) || val > (__MAX))				\
+		return -EINVAL;						\
+									\
+	if (!cgroup_lock_live_group(cgroup))				\
+		return -ENODEV;						\
+									\
+	iocg = cgroup_to_io_cgroup(cgroup);				\
+									\
+	spin_lock_irq(&iocg->lock);					\
+	iocg->__VAR = (unsigned long)val;				\
+	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {	\
+		iog->entity.new_##__VAR = (unsigned long)val;		\
+		smp_wmb();						\
+		iog->entity.ioprio_changed = 1;				\
+	}								\
+	spin_unlock_irq(&iocg->lock);					\
+									\
+	cgroup_unlock();						\
+									\
+	return 0;							\
+}
+
+STORE_FUNCTION(weight, 0, WEIGHT_MAX);
+STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
+#undef STORE_FUNCTION
+
+/**
+ * bfq_group_chain_alloc - allocate a chain of groups.
+ * @bfqd: queue descriptor.
+ * @cgroup: the leaf cgroup this chain starts from.
+ *
+ * Allocate a chain of groups starting from the one belonging to
+ * @cgroup up to the root cgroup.  Stop if a cgroup on the chain
+ * to the root has already an allocated group on @bfqd.
+ */
+struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
+					struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog, *leaf = NULL, *prev = NULL;
+	gfp_t flags = GFP_ATOMIC |  __GFP_ZERO;
+
+	for (; cgroup != NULL; cgroup = cgroup->parent) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+
+		iog = io_cgroup_lookup_group(iocg, key);
+		if (iog != NULL) {
+			/*
+			 * All the cgroups in the path from there to the
+			 * root must have a bfq_group for bfqd, so we don't
+			 * need any more allocations.
+			 */
+			break;
+		}
+
+		iog = kzalloc_node(sizeof(*iog), flags, q->node);
+		if (!iog)
+			goto cleanup;
+
+		io_group_init_entity(iocg, iog);
+		iog->my_entity = &iog->entity;
+
+		if (leaf == NULL) {
+			leaf = iog;
+			prev = leaf;
+		} else {
+			io_group_set_parent(prev, iog);
+			/*
+			 * Build a list of allocated nodes using the bfqd
+			 * filed, that is still unused and will be initialized
+			 * only after the node will be connected.
+			 */
+			prev->key = iog;
+			prev = iog;
+		}
+	}
+
+	return leaf;
+
+cleanup:
+	while (leaf != NULL) {
+		prev = leaf;
+		leaf = leaf->key;
+		kfree(prev);
+	}
+
+	return NULL;
+}
+
+/**
+ * bfq_group_chain_link - link an allocatd group chain to a cgroup hierarchy.
+ * @bfqd: the queue descriptor.
+ * @cgroup: the leaf cgroup to start from.
+ * @leaf: the leaf group (to be associated to @cgroup).
+ *
+ * Try to link a chain of groups to a cgroup hierarchy, connecting the
+ * nodes bottom-up, so we can be sure that when we find a cgroup in the
+ * hierarchy that already as a group associated to @bfqd all the nodes
+ * in the path to the root cgroup have one too.
+ *
+ * On locking: the queue lock protects the hierarchy (there is a hierarchy
+ * per device) while the bfqio_cgroup lock protects the list of groups
+ * belonging to the same cgroup.
+ */
+void io_group_chain_link(struct request_queue *q, void *key,
+				struct cgroup *cgroup,
+				struct io_group *leaf,
+				struct elv_fq_data *efqd)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog, *next, *prev = NULL;
+	unsigned long flags;
+
+	assert_spin_locked(q->queue_lock);
+
+	for (; cgroup != NULL && leaf != NULL; cgroup = cgroup->parent) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+		next = leaf->key;
+
+		iog = io_cgroup_lookup_group(iocg, key);
+		BUG_ON(iog != NULL);
+
+		spin_lock_irqsave(&iocg->lock, flags);
+
+		rcu_assign_pointer(leaf->key, key);
+		hlist_add_head_rcu(&leaf->group_node, &iocg->group_data);
+		hlist_add_head(&leaf->elv_data_node, &efqd->group_list);
+
+		spin_unlock_irqrestore(&iocg->lock, flags);
+
+		prev = leaf;
+		leaf = next;
+	}
+
+	BUG_ON(cgroup == NULL && leaf != NULL);
+
+	if (cgroup != NULL && prev != NULL) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+		iog = io_cgroup_lookup_group(iocg, key);
+		io_group_set_parent(prev, iog);
+	}
+}
+
+/**
+ * bfq_find_alloc_group - return the group associated to @bfqd in @cgroup.
+ * @bfqd: queue descriptor.
+ * @cgroup: cgroup being searched for.
+ * @create: if set to 1, create the io group if it has not been created yet.
+ *
+ * Return a group associated to @bfqd in @cgroup, allocating one if
+ * necessary.  When a group is returned all the cgroups in the path
+ * to the root have a group associated to @bfqd.
+ *
+ * If the allocation fails, return the root group: this breaks guarantees
+ * but is a safe fallbak.  If this loss becames a problem it can be
+ * mitigated using the equivalent weight (given by the product of the
+ * weights of the groups in the path from @group to the root) in the
+ * root scheduler.
+ *
+ * We allocate all the missing nodes in the path from the leaf cgroup
+ * to the root and we connect the nodes only after all the allocations
+ * have been successful.
+ */
+struct io_group *io_find_alloc_group(struct request_queue *q,
+			struct cgroup *cgroup, struct elv_fq_data *efqd,
+			int create)
+{
+	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+	struct io_group *iog = NULL;
+	/* Note: Use efqd as key */
+	void *key = efqd;
+
+	iog = io_cgroup_lookup_group(iocg, key);
+	if (iog != NULL || !create)
+		return iog;
+
+	iog = io_group_chain_alloc(q, key, cgroup);
+	if (iog != NULL)
+		io_group_chain_link(q, key, cgroup, iog, efqd);
+
+	return iog;
+}
+
+/*
+ * Search for the io group current task belongs to. If create=1, then also
+ * create the io group if it is not already there.
+ */
+struct io_group *io_get_io_group(struct request_queue *q, int create)
+{
+	struct cgroup *cgroup;
+	struct io_group *iog;
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	rcu_read_lock();
+	cgroup = task_cgroup(current, io_subsys_id);
+	iog = io_find_alloc_group(q, cgroup, efqd, create);
+	if (!iog) {
+		if (create)
+			iog = efqd->root_group;
+		else
+			/*
+			 * bio merge functions doing lookup don't want to
+			 * map bio to root group by default
+			 */
+			iog = NULL;
+	}
+	rcu_read_unlock();
+	return iog;
+}
+
+void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_cgroup *iocg = &io_root_cgroup;
+	struct elv_fq_data *efqd = &e->efqd;
+	struct io_group *iog = efqd->root_group;
+
+	BUG_ON(!iog);
+	spin_lock_irq(&iocg->lock);
+	hlist_del_rcu(&iog->group_node);
+	spin_unlock_irq(&iocg->lock);
+	io_put_io_group_queues(e, iog);
+	kfree(iog);
+}
+
+struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	struct io_cgroup *iocg;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	iog->entity.parent = NULL;
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+	iocg = &io_root_cgroup;
+	spin_lock_irq(&iocg->lock);
+	rcu_assign_pointer(iog->key, key);
+	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
+	spin_unlock_irq(&iocg->lock);
+
+	return iog;
+}
+
+struct cftype bfqio_files[] = {
+	{
+		.name = "weight",
+		.read_u64 = io_cgroup_weight_read,
+		.write_u64 = io_cgroup_weight_write,
+	},
+	{
+		.name = "ioprio_class",
+		.read_u64 = io_cgroup_ioprio_class_read,
+		.write_u64 = io_cgroup_ioprio_class_write,
+	},
+};
+
+int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+	return cgroup_add_files(cgroup, subsys, bfqio_files,
+				ARRAY_SIZE(bfqio_files));
+}
+
+struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
+						struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg;
+
+	if (cgroup->parent != NULL) {
+		iocg = kzalloc(sizeof(*iocg), GFP_KERNEL);
+		if (iocg == NULL)
+			return ERR_PTR(-ENOMEM);
+	} else
+		iocg = &io_root_cgroup;
+
+	spin_lock_init(&iocg->lock);
+	INIT_HLIST_HEAD(&iocg->group_data);
+	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
+	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
+
+	return &iocg->css;
+}
+
+/*
+ * We cannot support shared io contexts, as we have no mean to support
+ * two tasks with the same ioc in two different groups without major rework
+ * of the main cic/bfqq data structures.  By now we allow a task to change
+ * its cgroup only if it's the only owner of its ioc; the drawback of this
+ * behavior is that a group containing a task that forked using CLONE_IO
+ * will not be destroyed until the tasks sharing the ioc die.
+ */
+int iocg_can_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+			    struct task_struct *tsk)
+{
+	struct io_context *ioc;
+	int ret = 0;
+
+	/* task_lock() is needed to avoid races with exit_io_context() */
+	task_lock(tsk);
+	ioc = tsk->io_context;
+	if (ioc != NULL && atomic_read(&ioc->nr_tasks) > 1)
+		/*
+		 * ioc == NULL means that the task is either too young or
+		 * exiting: if it has still no ioc the ioc can't be shared,
+		 * if the task is exiting the attach will fail anyway, no
+		 * matter what we return here.
+		 */
+		ret = -EINVAL;
+	task_unlock(tsk);
+
+	return ret;
+}
+
+void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+			 struct cgroup *prev, struct task_struct *tsk)
+{
+	struct io_context *ioc;
+
+	task_lock(tsk);
+	ioc = tsk->io_context;
+	if (ioc != NULL)
+		ioc->cgroup_changed = 1;
+	task_unlock(tsk);
+}
+
+/*
+ * Move the queue to the root group if it is active. This is needed when
+ * a cgroup is being deleted and all the IO is not done yet. This is not
+ * very good scheme as a user might get unfair share. This needs to be
+ * fixed.
+ */
+void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
+				struct io_group *iog)
+{
+	int busy, resume;
+	struct io_entity *entity = &ioq->entity;
+	struct elv_fq_data *efqd = &e->efqd;
+	struct io_service_tree *st = io_entity_service_tree(entity);
+
+	busy = elv_ioq_busy(ioq);
+	resume = !!ioq->nr_queued;
+
+	BUG_ON(resume && !entity->on_st);
+	BUG_ON(busy && !resume && entity->on_st && ioq != efqd->active_queue);
+
+	/*
+	 * We could be moving an queue which is on idle tree of previous group
+	 * What to do? I guess anyway this queue does not have any requests.
+	 * just forget the entity and free up from idle tree.
+	 *
+	 * This needs cleanup. Hackish.
+	 */
+	if (entity->tree == &st->idle) {
+		BUG_ON(atomic_read(&ioq->ref) < 2);
+		bfq_put_idle_entity(st, entity);
+	}
+
+	if (busy) {
+		BUG_ON(atomic_read(&ioq->ref) < 2);
+
+		if (!resume)
+			elv_del_ioq_busy(e, ioq, 0);
+		else
+			elv_deactivate_ioq(efqd, ioq, 0);
+	}
+
+	/*
+	 * Here we use a reference to bfqg.  We don't need a refcounter
+	 * as the cgroup reference will not be dropped, so that its
+	 * destroy() callback will not be invoked.
+	 */
+	entity->parent = iog->my_entity;
+	entity->sched_data = &iog->sched_data;
+
+	if (busy && resume)
+		elv_activate_ioq(ioq, 0);
+}
+EXPORT_SYMBOL(io_ioq_move);
+
+static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
+{
+	struct elevator_queue *eq;
+	struct io_entity *entity = iog->my_entity;
+	struct io_service_tree *st;
+	int i;
+
+	eq = container_of(efqd, struct elevator_queue, efqd);
+	hlist_del(&iog->elv_data_node);
+	__bfq_deactivate_entity(entity, 0);
+	io_put_io_group_queues(eq, iog);
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+
+		/*
+		 * The idle tree may still contain bfq_queues belonging
+		 * to exited task because they never migrated to a different
+		 * cgroup from the one being destroyed now.  Noone else
+		 * can access them so it's safe to act without any lock.
+		 */
+		io_flush_idle_tree(st);
+
+		BUG_ON(!RB_EMPTY_ROOT(&st->active));
+		BUG_ON(!RB_EMPTY_ROOT(&st->idle));
+	}
+
+	BUG_ON(iog->sched_data.next_active != NULL);
+	BUG_ON(iog->sched_data.active_entity != NULL);
+	BUG_ON(entity->tree != NULL);
+}
+
+/**
+ * bfq_destroy_group - destroy @bfqg.
+ * @bgrp: the bfqio_cgroup containing @bfqg.
+ * @bfqg: the group being destroyed.
+ *
+ * Destroy @bfqg, making sure that it is not referenced from its parent.
+ */
+static void io_destroy_group(struct io_cgroup *iocg, struct io_group *iog)
+{
+	struct elv_fq_data *efqd = NULL;
+	unsigned long uninitialized_var(flags);
+
+	/* Remove io group from cgroup list */
+	hlist_del(&iog->group_node);
+
+	/*
+	 * io groups are linked in two lists. One list is maintained
+	 * in elevator (efqd->group_list) and other is maintained
+	 * per cgroup structure (iocg->group_data).
+	 *
+	 * While a cgroup is being deleted, elevator also might be
+	 * exiting and both might try to cleanup the same io group
+	 * so need to be little careful.
+	 *
+	 * Following code first accesses efqd under RCU to make sure
+	 * iog->key is pointing to valid efqd and then takes the
+	 * associated queue lock. After gettting the queue lock it
+	 * again checks whether elevator exit path had alreday got
+	 * hold of io group (iog->key == NULL). If yes, it does not
+	 * try to free up async queues again or flush the idle tree.
+	 */
+
+	rcu_read_lock();
+	efqd = rcu_dereference(iog->key);
+	if (efqd != NULL) {
+		spin_lock_irqsave(efqd->queue->queue_lock, flags);
+		if (iog->key == efqd)
+			__io_destroy_group(efqd, iog);
+		spin_unlock_irqrestore(efqd->queue->queue_lock, flags);
+	}
+	rcu_read_unlock();
+
+	/*
+	 * No need to defer the kfree() to the end of the RCU grace
+	 * period: we are called from the destroy() callback of our
+	 * cgroup, so we can be sure that noone is a) still using
+	 * this cgroup or b) doing lookups in it.
+	 */
+	kfree(iog);
+}
+
+void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+	struct hlist_node *n, *tmp;
+	struct io_group *iog;
+
+	/*
+	 * Since we are destroying the cgroup, there are no more tasks
+	 * referencing it, and all the RCU grace periods that may have
+	 * referenced it are ended (as the destruction of the parent
+	 * cgroup is RCU-safe); bgrp->group_data will not be accessed by
+	 * anything else and we don't need any synchronization.
+	 */
+	hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node)
+		io_destroy_group(iocg, iog);
+
+	BUG_ON(!hlist_empty(&iocg->group_data));
+
+	kfree(iocg);
+}
+
+void io_disconnect_groups(struct elevator_queue *e)
+{
+	struct hlist_node *pos, *n;
+	struct io_group *iog;
+	struct elv_fq_data *efqd = &e->efqd;
+
+	hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
+					elv_data_node) {
+		hlist_del(&iog->elv_data_node);
+
+		__bfq_deactivate_entity(iog->my_entity, 0);
+
+		/*
+		 * Don't remove from the group hash, just set an
+		 * invalid key.  No lookups can race with the
+		 * assignment as bfqd is being destroyed; this
+		 * implies also that new elements cannot be added
+		 * to the list.
+		 */
+		rcu_assign_pointer(iog->key, NULL);
+		io_put_io_group_queues(e, iog);
+	}
+}
+
+struct cgroup_subsys io_subsys = {
+	.name = "io",
+	.create = iocg_create,
+	.can_attach = iocg_can_attach,
+	.attach = iocg_attach,
+	.destroy = iocg_destroy,
+	.populate = iocg_populate,
+	.subsys_id = io_subsys_id,
+};
+
+/*
+ * if bio sumbmitting task and rq don't belong to same io_group, it can't
+ * be merged
+ */
+int io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+	struct request_queue *q = rq->q;
+	struct io_queue *ioq = rq->ioq;
+	struct io_group *iog, *__iog;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return 1;
+
+	/* Determine the io group of the bio submitting task */
+	iog = io_get_io_group(q, 0);
+	if (!iog) {
+		/* May be task belongs to a differet cgroup for which io
+		 * group has not been setup yet. */
+		return 0;
+	}
+
+	/* Determine the io group of the ioq, rq belongs to*/
+	__iog = ioq_to_io_group(ioq);
+
+	return (iog == __iog);
+}
+
+/* find/create the io group request belongs to and put that info in rq */
+void elv_fq_set_request_io_group(struct request_queue *q,
+						struct request *rq)
+{
+	struct io_group *iog;
+	unsigned long flags;
+
+	/* Make sure io group hierarchy has been setup and also set the
+	 * io group to which rq belongs. Later we should make use of
+	 * bio cgroup patches to determine the io group */
+	spin_lock_irqsave(q->queue_lock, flags);
+	iog = io_get_io_group(q, 1);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	BUG_ON(!iog);
+
+	/* Store iog in rq. TODO: take care of referencing */
+	rq->iog = iog;
 }
 
+#else /* GROUP_IOSCHED */
+void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+	entity->ioprio = entity->new_ioprio;
+	entity->weight = entity->new_weight;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->sched_data = &iog->sched_data;
+}
+
+struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+	return iog;
+}
+
+struct io_group *io_lookup_io_group_current(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	return efqd->root_group;
+}
+EXPORT_SYMBOL(io_lookup_io_group_current);
+
+void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_group *iog = e->efqd.root_group;
+	io_put_io_group_queues(e, iog);
+	kfree(iog);
+}
+
+struct io_group *io_get_io_group(struct request_queue *q, int create)
+{
+	return q->elevator->efqd.root_group;
+}
+
+#endif /* CONFIG_GROUP_IOSCHED*/
+
 /* Elevator fair queuing function */
 struct io_queue *rq_ioq(struct request *rq)
 {
@@ -1177,9 +2043,11 @@ EXPORT_SYMBOL(elv_put_ioq);
 
 void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
 {
+	struct io_group *root_group = e->efqd.root_group;
 	struct io_queue *ioq = *ioq_ptr;
 
 	if (ioq != NULL) {
+		io_ioq_move(e, ioq, root_group);
 		/* Drop the reference taken by the io group */
 		elv_put_ioq(ioq);
 		*ioq_ptr = NULL;
@@ -1233,14 +2101,27 @@ struct io_queue *elv_get_next_ioq(struct request_queue *q, int extract)
 		return NULL;
 
 	sd = &efqd->root_group->sched_data;
-	if (extract)
-		entity = bfq_lookup_next_entity(sd, 1);
-	else
-		entity = bfq_lookup_next_entity(sd, 0);
+	for (; sd != NULL; sd = entity->my_sched_data) {
+		if (extract)
+			entity = bfq_lookup_next_entity(sd, 1);
+		else
+			entity = bfq_lookup_next_entity(sd, 0);
+
+		/*
+		 * entity can be null despite the fact that there are busy
+		 * queues. if all the busy queues are under a group which is
+		 * currently under service.
+		 * So if we are just looking for next ioq while something is
+		 * being served, null entity is not an error.
+		 */
+		BUG_ON(!entity && extract);
+
+		if (extract)
+			entity->service = 0;
 
-	BUG_ON(!entity);
-	if (extract)
-		entity->service = 0;
+		if (!entity)
+			return NULL;
+	}
 	ioq = io_entity_to_ioq(entity);
 
 	return ioq;
@@ -1256,8 +2137,12 @@ static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
 	struct request_queue *q = efqd->queue;
 
 	if (ioq) {
-		elv_log_ioq(efqd, ioq, "set_active, busy=%d",
-							efqd->busy_queues);
+		struct io_group *iog = ioq_to_io_group(ioq);
+		elv_log_ioq(efqd, ioq, "set_active, busy=%d ioprio=%d"
+				" weight=%ld group_weight=%ld",
+				efqd->busy_queues,
+				ioq->entity.ioprio, ioq->entity.weight,
+				iog_weight(iog));
 		ioq->slice_end = 0;
 
 		elv_clear_ioq_wait_request(ioq);
@@ -1492,6 +2377,7 @@ int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 {
 	struct io_queue *ioq;
 	struct elevator_queue *eq = q->elevator;
+	struct io_group *iog = NULL, *new_iog = NULL;
 
 	ioq = elv_active_ioq(eq);
 
@@ -1509,14 +2395,26 @@ int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 
 	/*
 	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
+	 *
+	 * TODO: In hierarchical setup, one need to traverse up the hier
+	 * till both the queues are children of same parent to make a
+	 * decision whether to do the preemption or not. Something like
+	 * what cfs has done for cpu scheduler. Will do it little later.
 	 */
 	if (elv_ioq_class_rt(new_ioq) && !elv_ioq_class_rt(ioq))
 		return 1;
 
+	iog = ioq_to_io_group(ioq);
+	new_iog = ioq_to_io_group(new_ioq);
+
 	/*
-	 * Check with io scheduler if it has additional criterion based on
-	 * which it wants to preempt existing queue.
+	 * If both the queues belong to same group, check with io scheduler
+	 * if it has additional criterion based on which it wants to
+	 * preempt existing queue.
 	 */
+	if (iog != new_iog)
+		return 0;
+
 	if (eq->ops->elevator_should_preempt_fn)
 		return eq->ops->elevator_should_preempt_fn(q, new_ioq, rq);
 
@@ -1938,14 +2836,6 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		elv_schedule_dispatch(q);
 }
 
-struct io_group *io_lookup_io_group_current(struct request_queue *q)
-{
-	struct elv_fq_data *efqd = &q->elevator->efqd;
-
-	return efqd->root_group;
-}
-EXPORT_SYMBOL(io_lookup_io_group_current);
-
 void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
 					int ioprio)
 {
@@ -1996,44 +2886,6 @@ void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 }
 EXPORT_SYMBOL(io_group_set_async_queue);
 
-/*
- * Release all the io group references to its async queues.
- */
-void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
-{
-	int i, j;
-
-	for (i = 0; i < 2; i++)
-		for (j = 0; j < IOPRIO_BE_NR; j++)
-			elv_release_ioq(e, &iog->async_queue[i][j]);
-
-	/* Free up async idle queue */
-	elv_release_ioq(e, &iog->async_idle_queue);
-}
-
-struct io_group *io_alloc_root_group(struct request_queue *q,
-					struct elevator_queue *e, void *key)
-{
-	struct io_group *iog;
-	int i;
-
-	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
-	if (iog == NULL)
-		return NULL;
-
-	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
-		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
-
-	return iog;
-}
-
-void io_free_root_group(struct elevator_queue *e)
-{
-	struct io_group *iog = e->efqd.root_group;
-	io_put_io_group_queues(e, iog);
-	kfree(iog);
-}
-
 static void elv_slab_kill(void)
 {
 	/*
@@ -2079,6 +2931,7 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 	INIT_WORK(&efqd->unplug_work, elv_kick_queue);
 
 	INIT_LIST_HEAD(&efqd->idle_list);
+	INIT_HLIST_HEAD(&efqd->group_list);
 
 	efqd->elv_slice[0] = elv_slice_async;
 	efqd->elv_slice[1] = elv_slice_sync;
@@ -2108,10 +2961,14 @@ void elv_exit_fq_data(struct elevator_queue *e)
 	spin_lock_irq(q->queue_lock);
 	/* This should drop all the idle tree references of ioq */
 	elv_free_idle_ioq_list(e);
+	/* This should drop all the io group references of async queues */
+	io_disconnect_groups(e);
 	spin_unlock_irq(q->queue_lock);
 
 	elv_shutdown_timer_wq(e);
 
+	/* Wait for iog->key accessors to exit their grace periods. */
+	synchronize_rcu();
 	BUG_ON(timer_pending(&efqd->idle_slice_timer));
 	io_free_root_group(e);
 }
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index ce2d671..8c60cf7 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -9,11 +9,13 @@
  */
 
 #include <linux/blkdev.h>
+#include <linux/cgroup.h>
 
 #ifndef _BFQ_SCHED_H
 #define _BFQ_SCHED_H
 
 #define IO_IOPRIO_CLASSES	3
+#define WEIGHT_MAX 		1000
 
 typedef u64 bfq_timestamp_t;
 typedef unsigned long bfq_weight_t;
@@ -69,6 +71,7 @@ struct io_service_tree {
  */
 struct io_sched_data {
 	struct io_entity *active_entity;
+	struct io_entity *next_active;
 	struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
 };
 
@@ -84,13 +87,12 @@ struct io_sched_data {
  *             this entity; used for O(log N) lookups into active trees.
  * @service: service received during the last round of service.
  * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
- * @weight: weight of the queue, calculated as IOPRIO_BE_NR - @ioprio.
  * @parent: parent entity, for hierarchical scheduling.
  * @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the
  *                 associated scheduler queue, %NULL on leaf nodes.
  * @sched_data: the scheduler queue this entity belongs to.
- * @ioprio: the ioprio in use.
- * @new_ioprio: when an ioprio change is requested, the new ioprio value
+ * @weight: the weight in use.
+ * @new_weight: when a weight change is requested, the new weight value
  * @ioprio_class: the ioprio_class in use.
  * @new_ioprio_class: when an ioprio_class change is requested, the new
  *                    ioprio_class value.
@@ -132,13 +134,13 @@ struct io_entity {
 	bfq_timestamp_t min_start;
 
 	bfq_service_t service, budget;
-	bfq_weight_t weight;
 
 	struct io_entity *parent;
 
 	struct io_sched_data *my_sched_data;
 	struct io_sched_data *sched_data;
 
+	bfq_weight_t weight, new_weight;
 	unsigned short ioprio, new_ioprio;
 	unsigned short ioprio_class, new_ioprio_class;
 
@@ -180,6 +182,75 @@ struct io_queue {
 	void *sched_queue;
 };
 
+#ifdef CONFIG_GROUP_IOSCHED
+/**
+ * struct bfq_group - per (device, cgroup) data structure.
+ * @entity: schedulable entity to insert into the parent group sched_data.
+ * @sched_data: own sched_data, to contain child entities (they may be
+ *              both bfq_queues and bfq_groups).
+ * @group_node: node to be inserted into the bfqio_cgroup->group_data
+ *              list of the containing cgroup's bfqio_cgroup.
+ * @bfqd_node: node to be inserted into the @bfqd->group_list list
+ *             of the groups active on the same device; used for cleanup.
+ * @bfqd: the bfq_data for the device this group acts upon.
+ * @async_bfqq: array of async queues for all the tasks belonging to
+ *              the group, one queue per ioprio value per ioprio_class,
+ *              except for the idle class that has only one queue.
+ * @async_idle_bfqq: async queue for the idle class (ioprio is ignored).
+ * @my_entity: pointer to @entity, %NULL for the toplevel group; used
+ *             to avoid too many special cases during group creation/migration.
+ *
+ * Each (device, cgroup) pair has its own bfq_group, i.e., for each cgroup
+ * there is a set of bfq_groups, each one collecting the lower-level
+ * entities belonging to the group that are acting on the same device.
+ *
+ * Locking works as follows:
+ *    o @group_node is protected by the bfqio_cgroup lock, and is accessed
+ *      via RCU from its readers.
+ *    o @bfqd is protected by the queue lock, RCU is used to access it
+ *      from the readers.
+ *    o All the other fields are protected by the @bfqd queue lock.
+ */
+struct io_group {
+	struct io_entity entity;
+	struct hlist_node elv_data_node;
+	struct hlist_node group_node;
+	struct io_sched_data sched_data;
+
+	struct io_entity *my_entity;
+
+	/*
+	 * A cgroup has multiple io_groups, one for each request queue.
+	 * to find io group belonging to a particular queue, elv_fq_data
+	 * pointer is stored as a key.
+	 */
+	void *key;
+
+	/* async_queue and idle_queue are used only for cfq */
+	struct io_queue *async_queue[2][IOPRIO_BE_NR];
+	struct io_queue *async_idle_queue;
+};
+
+/**
+ * struct bfqio_cgroup - bfq cgroup data structure.
+ * @css: subsystem state for bfq in the containing cgroup.
+ * @weight: cgroup weight.
+ * @ioprio_class: cgroup ioprio_class.
+ * @lock: spinlock that protects @weight, @ioprio_class and @group_data.
+ * @group_data: list containing the bfq_group belonging to this cgroup.
+ *
+ * @group_data is accessed using RCU, with @lock protecting the updates,
+ * @weight and @ioprio_class are protected by @lock.
+ */
+struct io_cgroup {
+	struct cgroup_subsys_state css;
+
+	unsigned long weight, ioprio_class;
+
+	spinlock_t lock;
+	struct hlist_head group_data;
+};
+#else
 struct io_group {
 	struct io_sched_data sched_data;
 
@@ -187,10 +258,14 @@ struct io_group {
 	struct io_queue *async_queue[2][IOPRIO_BE_NR];
 	struct io_queue *async_idle_queue;
 };
+#endif
 
 struct elv_fq_data {
 	struct io_group *root_group;
 
+	/* List of io groups hanging on this elevator */
+	struct hlist_head group_list;
+
 	/* List of io queues on idle tree. */
 	struct list_head idle_list;
 
@@ -375,9 +450,20 @@ static inline void elv_ioq_set_ioprio_class(struct io_queue *ioq,
 	ioq->entity.ioprio_changed = 1;
 }
 
+/**
+ * bfq_ioprio_to_weight - calc a weight from an ioprio.
+ * @ioprio: the ioprio value to convert.
+ */
+static inline bfq_weight_t bfq_ioprio_to_weight(int ioprio)
+{
+	WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
+	return ((IOPRIO_BE_NR - ioprio) * WEIGHT_MAX)/IOPRIO_BE_NR;
+}
+
 static inline void elv_ioq_set_ioprio(struct io_queue *ioq, int ioprio)
 {
 	ioq->entity.new_ioprio = ioprio;
+	ioq->entity.new_weight = bfq_ioprio_to_weight(ioprio);
 	ioq->entity.ioprio_changed = 1;
 }
 
@@ -394,6 +480,50 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
 						sched_data);
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+extern int io_group_allow_merge(struct request *rq, struct bio *bio);
+extern void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
+					struct io_group *iog);
+extern void elv_fq_set_request_io_group(struct request_queue *q,
+						struct request *rq);
+static inline bfq_weight_t iog_weight(struct io_group *iog)
+{
+	return iog->entity.weight;
+}
+
+#else /* !GROUP_IOSCHED */
+/*
+ * No ioq movement is needed in case of flat setup. root io group gets cleaned
+ * up upon elevator exit and before that it has been made sure that both
+ * active and idle tree are empty.
+ */
+static inline void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
+					struct io_group *iog)
+{
+}
+
+static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+	return 1;
+}
+/*
+ * Currently root group is not part of elevator group list and freed
+ * separately. Hence in case of non-hierarchical setup, nothing todo.
+ */
+static inline void io_disconnect_groups(struct elevator_queue *e) {}
+static inline void elv_fq_set_request_io_group(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline bfq_weight_t iog_weight(struct io_group *iog)
+{
+	/* Just root group is present and weight is immaterial. */
+	return 0;
+}
+
+#endif /* GROUP_IOSCHED */
+
 /* Functions used by blksysfs.c */
 extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
 extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
@@ -495,5 +625,16 @@ static inline void *elv_fq_select_ioq(struct request_queue *q, int force)
 {
 	return NULL;
 }
+
+static inline void elv_fq_set_request_io_group(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
+
+{
+	return 1;
+}
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index c2f07f5..4321169 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -105,6 +105,10 @@ int elv_rq_merge_ok(struct request *rq, struct bio *bio)
 	if (bio_integrity(bio) != blk_integrity_rq(rq))
 		return 0;
 
+	/* If rq and bio belongs to different groups, dont allow merging */
+	if (!io_group_allow_merge(rq, bio))
+		return 0;
+
 	if (!elv_iosched_allow_merge(rq, bio))
 		return 0;
 
@@ -913,6 +917,8 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 
+	elv_fq_set_request_io_group(q, rq);
+
 	if (e->ops->elevator_set_req_fn)
 		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4634949..9c209a0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -249,7 +249,12 @@ struct request {
 #ifdef CONFIG_ELV_FAIR_QUEUING
 	/* io queue request belongs to */
 	struct io_queue *ioq;
-#endif
+
+#ifdef CONFIG_GROUP_IOSCHED
+	/* io group request belongs to */
+	struct io_group *iog;
+#endif /* GROUP_IOSCHED */
+#endif /* ELV_FAIR_QUEUING */
 };
 
 static inline unsigned short req_get_ioprio(struct request *req)
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c8d31b..68ea6bd 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -60,3 +60,10 @@ SUBSYS(net_cls)
 #endif
 
 /* */
+
+#ifdef CONFIG_GROUP_IOSCHED
+SUBSYS(io)
+#endif
+
+/* */
+
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 08b987b..51664bb 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -73,6 +73,11 @@ struct io_context {
 	unsigned short ioprio;
 	unsigned short ioprio_changed;
 
+#ifdef CONFIG_GROUP_IOSCHED
+	/* If task changes the cgroup, elevator processes it asynchronously */
+	unsigned short cgroup_changed;
+#endif
+
 	/*
 	 * For request batching
 	 */
diff --git a/init/Kconfig b/init/Kconfig
index 7be4d38..ab76477 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -606,6 +606,14 @@ config CGROUP_MEM_RES_CTLR_SWAP
 	  Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page
 	  size is 4096bytes, 512k per 1Gbytes of swap.
 
+config GROUP_IOSCHED
+	bool "Group IO Scheduler"
+	depends on CGROUPS && ELV_FAIR_QUEUING
+	default n
+	---help---
+	  This feature lets IO scheduler recognize task groups and control
+	  disk bandwidth allocation to such task groups.
+
 endif # CGROUPS
 
 config MM_OWNER
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (6 preceding siblings ...)
  2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-05 19:58 ` Vivek Goyal
                   ` (29 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando
  Cc: akpm, vgoyal

This patch enables hierarchical fair queuing in common layer. It is
controlled by config option CONFIG_GROUP_IOSCHED.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-ioc.c               |    3 +
 block/elevator-fq.c           | 1037 +++++++++++++++++++++++++++++++++++++----
 block/elevator-fq.h           |  149 ++++++-
 block/elevator.c              |    6 +
 include/linux/blkdev.h        |    7 +-
 include/linux/cgroup_subsys.h |    7 +
 include/linux/iocontext.h     |    5 +
 init/Kconfig                  |    8 +
 8 files changed, 1127 insertions(+), 95 deletions(-)

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 012f065..8f0f6cf 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -95,6 +95,9 @@ struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 		spin_lock_init(&ret->lock);
 		ret->ioprio_changed = 0;
 		ret->ioprio = 0;
+#ifdef CONFIG_GROUP_IOSCHED
+		ret->cgroup_changed = 0;
+#endif
 		ret->last_waited = jiffies; /* doesn't matter... */
 		ret->nr_batch_requests = 0; /* because this is 0 */
 		ret->aic = NULL;
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 9f1fbb9..cdaa46f 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -24,6 +24,10 @@ static int elv_rate_sampling_window = HZ / 10;
 
 #define ELV_SLICE_SCALE		(5)
 #define ELV_HW_QUEUE_MIN	(5)
+
+#define IO_DEFAULT_GRP_WEIGHT  500
+#define IO_DEFAULT_GRP_CLASS   IOPRIO_CLASS_BE
+
 #define IO_SERVICE_TREE_INIT   ((struct io_service_tree)		\
 				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
 
@@ -31,6 +35,7 @@ static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
 					struct io_queue *ioq, int probe);
 struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
 						 int extract);
+void elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
 
 static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
 					unsigned short prio)
@@ -49,6 +54,73 @@ elv_prio_to_slice(struct elv_fq_data *efqd, struct io_queue *ioq)
 }
 
 /* Mainly the BFQ scheduling code Follows */
+#ifdef CONFIG_GROUP_IOSCHED
+#define for_each_entity(entity)	\
+	for (; entity != NULL; entity = entity->parent)
+
+#define for_each_entity_safe(entity, parent) \
+	for (; entity && ({ parent = entity->parent; 1; }); entity = parent)
+
+
+struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
+						 int extract);
+void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
+					int requeue);
+void elv_activate_ioq(struct io_queue *ioq, int add_front);
+void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
+					int requeue);
+
+static int bfq_update_next_active(struct io_sched_data *sd)
+{
+	struct io_group *iog;
+	struct io_entity *entity, *next_active;
+
+	if (sd->active_entity != NULL)
+		/* will update/requeue at the end of service */
+		return 0;
+
+	/*
+	 * NOTE: this can be improved in may ways, such as returning
+	 * 1 (and thus propagating upwards the update) only when the
+	 * budget changes, or caching the bfqq that will be scheduled
+	 * next from this subtree.  By now we worry more about
+	 * correctness than about performance...
+	 */
+	next_active = bfq_lookup_next_entity(sd, 0);
+	sd->next_active = next_active;
+
+	if (next_active != NULL) {
+		iog = container_of(sd, struct io_group, sched_data);
+		entity = iog->my_entity;
+		if (entity != NULL)
+			entity->budget = next_active->budget;
+	}
+
+	return 1;
+}
+
+static inline void bfq_check_next_active(struct io_sched_data *sd,
+					 struct io_entity *entity)
+{
+	BUG_ON(sd->next_active != entity);
+}
+#else /* GROUP_IOSCHED */
+#define for_each_entity(entity)	\
+	for (; entity != NULL; entity = NULL)
+
+#define for_each_entity_safe(entity, parent) \
+	for (parent = NULL; entity != NULL; entity = parent)
+
+static inline int bfq_update_next_active(struct io_sched_data *sd)
+{
+	return 0;
+}
+
+static inline void bfq_check_next_active(struct io_sched_data *sd,
+					 struct io_entity *entity)
+{
+}
+#endif
 
 /*
  * Shift for timestamp calculations.  This actually limits the maximum
@@ -295,16 +367,6 @@ static void bfq_active_insert(struct io_service_tree *st,
 	bfq_update_active_tree(node);
 }
 
-/**
- * bfq_ioprio_to_weight - calc a weight from an ioprio.
- * @ioprio: the ioprio value to convert.
- */
-static bfq_weight_t bfq_ioprio_to_weight(int ioprio)
-{
-	WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
-	return IOPRIO_BE_NR - ioprio;
-}
-
 void bfq_get_entity(struct io_entity *entity)
 {
 	struct io_queue *ioq = io_entity_to_ioq(entity);
@@ -313,13 +375,6 @@ void bfq_get_entity(struct io_entity *entity)
 		elv_get_ioq(ioq);
 }
 
-void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
-{
-	entity->ioprio = entity->new_ioprio;
-	entity->ioprio_class = entity->new_ioprio_class;
-	entity->sched_data = &iog->sched_data;
-}
-
 /**
  * bfq_find_deepest - find the deepest node that an extraction can modify.
  * @node: the node being removed.
@@ -462,8 +517,10 @@ __bfq_entity_update_prio(struct io_service_tree *old_st,
 	struct io_queue *ioq = io_entity_to_ioq(entity);
 
 	if (entity->ioprio_changed) {
+		old_st->wsum -= entity->weight;
 		entity->ioprio = entity->new_ioprio;
 		entity->ioprio_class = entity->new_ioprio_class;
+		entity->weight = entity->new_weight;
 		entity->ioprio_changed = 0;
 
 		/*
@@ -475,9 +532,6 @@ __bfq_entity_update_prio(struct io_service_tree *old_st,
 			entity->budget = elv_prio_to_slice(efqd, ioq);
 		}
 
-		old_st->wsum -= entity->weight;
-		entity->weight = bfq_ioprio_to_weight(entity->ioprio);
-
 		/*
 		 * NOTE: here we may be changing the weight too early,
 		 * this will cause unfairness.  The correct approach
@@ -559,11 +613,8 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
 	if (add_front) {
 		struct io_entity *next_entity;
 
-		/*
-		 * Determine the entity which will be dispatched next
-		 * Use sd->next_active once hierarchical patch is applied
-		 */
-		next_entity = bfq_lookup_next_entity(sd, 0);
+		/* Determine the entity which will be dispatched next */
+		next_entity = sd->next_active;
 
 		if (next_entity && next_entity != entity) {
 			struct io_service_tree *new_st;
@@ -590,12 +641,27 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
 }
 
 /**
- * bfq_activate_entity - activate an entity.
+ * bfq_activate_entity - activate an entity and its ancestors if necessary.
  * @entity: the entity to activate.
+ * Activate @entity and all the entities on the path from it to the root.
  */
 void bfq_activate_entity(struct io_entity *entity, int add_front)
 {
-	__bfq_activate_entity(entity, add_front);
+	struct io_sched_data *sd;
+
+	for_each_entity(entity) {
+		__bfq_activate_entity(entity, add_front);
+
+		add_front = 0;
+		sd = entity->sched_data;
+		if (!bfq_update_next_active(sd))
+			/*
+			 * No need to propagate the activation to the
+			 * upper entities, as they will be updated when
+			 * the active entity is rescheduled.
+			 */
+			break;
+	}
 }
 
 /**
@@ -631,12 +697,16 @@ int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
 	else if (entity->tree != NULL)
 		BUG();
 
+	if (was_active || sd->next_active == entity)
+		ret = bfq_update_next_active(sd);
+
 	if (!requeue || !bfq_gt(entity->finish, st->vtime))
 		bfq_forget_entity(st, entity);
 	else
 		bfq_idle_insert(st, entity);
 
 	BUG_ON(sd->active_entity == entity);
+	BUG_ON(sd->next_active == entity);
 
 	return ret;
 }
@@ -648,7 +718,46 @@ int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
  */
 void bfq_deactivate_entity(struct io_entity *entity, int requeue)
 {
-	__bfq_deactivate_entity(entity, requeue);
+	struct io_sched_data *sd;
+	struct io_entity *parent;
+
+	for_each_entity_safe(entity, parent) {
+		sd = entity->sched_data;
+
+		if (!__bfq_deactivate_entity(entity, requeue))
+			/*
+			 * The parent entity is still backlogged, and
+			 * we don't need to update it as it is still
+			 * under service.
+			 */
+			break;
+
+		if (sd->next_active != NULL)
+			/*
+			 * The parent entity is still backlogged and
+			 * the budgets on the path towards the root
+			 * need to be updated.
+			 */
+			goto update;
+
+		/*
+		 * If we reach there the parent is no more backlogged and
+		 * we want to propagate the dequeue upwards.
+		 */
+		requeue = 1;
+	}
+
+	return;
+
+update:
+	entity = parent;
+	for_each_entity(entity) {
+		__bfq_activate_entity(entity, 0);
+
+		sd = entity->sched_data;
+		if (!bfq_update_next_active(sd))
+			break;
+	}
 }
 
 /**
@@ -765,8 +874,10 @@ struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
 		entity = __bfq_lookup_next_entity(st);
 		if (entity != NULL) {
 			if (extract) {
+				bfq_check_next_active(sd, entity);
 				bfq_active_extract(st, entity);
 				sd->active_entity = entity;
+				sd->next_active = NULL;
 			}
 			break;
 		}
@@ -779,13 +890,768 @@ void entity_served(struct io_entity *entity, bfq_service_t served)
 {
 	struct io_service_tree *st;
 
-	st = io_entity_service_tree(entity);
-	entity->service += served;
-	BUG_ON(st->wsum == 0);
-	st->vtime += bfq_delta(served, st->wsum);
-	bfq_forget_idle(st);
+	for_each_entity(entity) {
+		st = io_entity_service_tree(entity);
+		entity->service += served;
+		BUG_ON(st->wsum == 0);
+		st->vtime += bfq_delta(served, st->wsum);
+		bfq_forget_idle(st);
+	}
+}
+
+/*
+ * Release all the io group references to its async queues.
+ */
+void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+{
+	int i, j;
+
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < IOPRIO_BE_NR; j++)
+			elv_release_ioq(e, &iog->async_queue[i][j]);
+
+	/* Free up async idle queue */
+	elv_release_ioq(e, &iog->async_idle_queue);
+}
+
+
+/* Mainly hierarchical grouping code */
+#ifdef CONFIG_GROUP_IOSCHED
+
+struct io_cgroup io_root_cgroup = {
+	.weight = IO_DEFAULT_GRP_WEIGHT,
+	.ioprio_class = IO_DEFAULT_GRP_CLASS,
+};
+
+void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+	entity->ioprio = entity->new_ioprio;
+	entity->weight = entity->new_weight;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->parent = iog->my_entity;
+	entity->sched_data = &iog->sched_data;
+}
+
+struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
+{
+	return container_of(cgroup_subsys_state(cgroup, io_subsys_id),
+			    struct io_cgroup, css);
+}
+
+/*
+ * Search the bfq_group for bfqd into the hash table (by now only a list)
+ * of bgrp.  Must be called under rcu_read_lock().
+ */
+struct io_group *io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
+{
+	struct io_group *iog;
+	struct hlist_node *n;
+	void *__key;
+
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		__key = rcu_dereference(iog->key);
+		if (__key == key)
+			return iog;
+	}
+
+	return NULL;
+}
+
+struct io_group *io_lookup_io_group_current(struct request_queue *q)
+{
+	struct io_group *iog;
+	struct io_cgroup *iocg;
+	struct cgroup *cgroup;
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	cgroup = task_cgroup(current, io_subsys_id);
+	iocg = cgroup_to_io_cgroup(cgroup);
+	iog = io_cgroup_lookup_group(iocg, efqd);
+	return iog;
+}
+EXPORT_SYMBOL(io_lookup_io_group_current);
+
+void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+{
+	struct io_entity *entity = &iog->entity;
+
+	entity->weight = entity->new_weight = iocg->weight;
+	entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
+	entity->ioprio_changed = 1;
+	entity->my_sched_data = &iog->sched_data;
+}
+
+void io_group_set_parent(struct io_group *iog, struct io_group *parent)
+{
+	struct io_entity *entity;
+
+	BUG_ON(parent == NULL);
+	BUG_ON(iog == NULL);
+
+	entity = &iog->entity;
+	entity->parent = parent->my_entity;
+	entity->sched_data = &parent->sched_data;
+}
+
+/**
+ * bfq_flush_idle_tree - deactivate any entity on the idle tree of @st.
+ * @st: the service tree being flushed.
+ */
+void io_flush_idle_tree(struct io_service_tree *st)
+{
+	struct io_entity *entity = st->first_idle;
+
+	for (; entity != NULL; entity = st->first_idle)
+		__bfq_deactivate_entity(entity, 0);
+}
+
+#define SHOW_FUNCTION(__VAR)						\
+static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
+				       struct cftype *cftype)		\
+{									\
+	struct io_cgroup *iocg;					\
+	u64 ret;							\
+									\
+	if (!cgroup_lock_live_group(cgroup))				\
+		return -ENODEV;						\
+									\
+	iocg = cgroup_to_io_cgroup(cgroup);				\
+	spin_lock_irq(&iocg->lock);					\
+	ret = iocg->__VAR;						\
+	spin_unlock_irq(&iocg->lock);					\
+									\
+	cgroup_unlock();						\
+									\
+	return ret;							\
+}
+
+SHOW_FUNCTION(weight);
+SHOW_FUNCTION(ioprio_class);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__VAR, __MIN, __MAX)				\
+static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
+					struct cftype *cftype,		\
+					u64 val)			\
+{									\
+	struct io_cgroup *iocg;					\
+	struct io_group *iog;						\
+	struct hlist_node *n;						\
+									\
+	if (val < (__MIN) || val > (__MAX))				\
+		return -EINVAL;						\
+									\
+	if (!cgroup_lock_live_group(cgroup))				\
+		return -ENODEV;						\
+									\
+	iocg = cgroup_to_io_cgroup(cgroup);				\
+									\
+	spin_lock_irq(&iocg->lock);					\
+	iocg->__VAR = (unsigned long)val;				\
+	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {	\
+		iog->entity.new_##__VAR = (unsigned long)val;		\
+		smp_wmb();						\
+		iog->entity.ioprio_changed = 1;				\
+	}								\
+	spin_unlock_irq(&iocg->lock);					\
+									\
+	cgroup_unlock();						\
+									\
+	return 0;							\
+}
+
+STORE_FUNCTION(weight, 0, WEIGHT_MAX);
+STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
+#undef STORE_FUNCTION
+
+/**
+ * bfq_group_chain_alloc - allocate a chain of groups.
+ * @bfqd: queue descriptor.
+ * @cgroup: the leaf cgroup this chain starts from.
+ *
+ * Allocate a chain of groups starting from the one belonging to
+ * @cgroup up to the root cgroup.  Stop if a cgroup on the chain
+ * to the root has already an allocated group on @bfqd.
+ */
+struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
+					struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog, *leaf = NULL, *prev = NULL;
+	gfp_t flags = GFP_ATOMIC |  __GFP_ZERO;
+
+	for (; cgroup != NULL; cgroup = cgroup->parent) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+
+		iog = io_cgroup_lookup_group(iocg, key);
+		if (iog != NULL) {
+			/*
+			 * All the cgroups in the path from there to the
+			 * root must have a bfq_group for bfqd, so we don't
+			 * need any more allocations.
+			 */
+			break;
+		}
+
+		iog = kzalloc_node(sizeof(*iog), flags, q->node);
+		if (!iog)
+			goto cleanup;
+
+		io_group_init_entity(iocg, iog);
+		iog->my_entity = &iog->entity;
+
+		if (leaf == NULL) {
+			leaf = iog;
+			prev = leaf;
+		} else {
+			io_group_set_parent(prev, iog);
+			/*
+			 * Build a list of allocated nodes using the bfqd
+			 * filed, that is still unused and will be initialized
+			 * only after the node will be connected.
+			 */
+			prev->key = iog;
+			prev = iog;
+		}
+	}
+
+	return leaf;
+
+cleanup:
+	while (leaf != NULL) {
+		prev = leaf;
+		leaf = leaf->key;
+		kfree(prev);
+	}
+
+	return NULL;
+}
+
+/**
+ * bfq_group_chain_link - link an allocatd group chain to a cgroup hierarchy.
+ * @bfqd: the queue descriptor.
+ * @cgroup: the leaf cgroup to start from.
+ * @leaf: the leaf group (to be associated to @cgroup).
+ *
+ * Try to link a chain of groups to a cgroup hierarchy, connecting the
+ * nodes bottom-up, so we can be sure that when we find a cgroup in the
+ * hierarchy that already as a group associated to @bfqd all the nodes
+ * in the path to the root cgroup have one too.
+ *
+ * On locking: the queue lock protects the hierarchy (there is a hierarchy
+ * per device) while the bfqio_cgroup lock protects the list of groups
+ * belonging to the same cgroup.
+ */
+void io_group_chain_link(struct request_queue *q, void *key,
+				struct cgroup *cgroup,
+				struct io_group *leaf,
+				struct elv_fq_data *efqd)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog, *next, *prev = NULL;
+	unsigned long flags;
+
+	assert_spin_locked(q->queue_lock);
+
+	for (; cgroup != NULL && leaf != NULL; cgroup = cgroup->parent) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+		next = leaf->key;
+
+		iog = io_cgroup_lookup_group(iocg, key);
+		BUG_ON(iog != NULL);
+
+		spin_lock_irqsave(&iocg->lock, flags);
+
+		rcu_assign_pointer(leaf->key, key);
+		hlist_add_head_rcu(&leaf->group_node, &iocg->group_data);
+		hlist_add_head(&leaf->elv_data_node, &efqd->group_list);
+
+		spin_unlock_irqrestore(&iocg->lock, flags);
+
+		prev = leaf;
+		leaf = next;
+	}
+
+	BUG_ON(cgroup == NULL && leaf != NULL);
+
+	if (cgroup != NULL && prev != NULL) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+		iog = io_cgroup_lookup_group(iocg, key);
+		io_group_set_parent(prev, iog);
+	}
+}
+
+/**
+ * bfq_find_alloc_group - return the group associated to @bfqd in @cgroup.
+ * @bfqd: queue descriptor.
+ * @cgroup: cgroup being searched for.
+ * @create: if set to 1, create the io group if it has not been created yet.
+ *
+ * Return a group associated to @bfqd in @cgroup, allocating one if
+ * necessary.  When a group is returned all the cgroups in the path
+ * to the root have a group associated to @bfqd.
+ *
+ * If the allocation fails, return the root group: this breaks guarantees
+ * but is a safe fallbak.  If this loss becames a problem it can be
+ * mitigated using the equivalent weight (given by the product of the
+ * weights of the groups in the path from @group to the root) in the
+ * root scheduler.
+ *
+ * We allocate all the missing nodes in the path from the leaf cgroup
+ * to the root and we connect the nodes only after all the allocations
+ * have been successful.
+ */
+struct io_group *io_find_alloc_group(struct request_queue *q,
+			struct cgroup *cgroup, struct elv_fq_data *efqd,
+			int create)
+{
+	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+	struct io_group *iog = NULL;
+	/* Note: Use efqd as key */
+	void *key = efqd;
+
+	iog = io_cgroup_lookup_group(iocg, key);
+	if (iog != NULL || !create)
+		return iog;
+
+	iog = io_group_chain_alloc(q, key, cgroup);
+	if (iog != NULL)
+		io_group_chain_link(q, key, cgroup, iog, efqd);
+
+	return iog;
+}
+
+/*
+ * Search for the io group current task belongs to. If create=1, then also
+ * create the io group if it is not already there.
+ */
+struct io_group *io_get_io_group(struct request_queue *q, int create)
+{
+	struct cgroup *cgroup;
+	struct io_group *iog;
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	rcu_read_lock();
+	cgroup = task_cgroup(current, io_subsys_id);
+	iog = io_find_alloc_group(q, cgroup, efqd, create);
+	if (!iog) {
+		if (create)
+			iog = efqd->root_group;
+		else
+			/*
+			 * bio merge functions doing lookup don't want to
+			 * map bio to root group by default
+			 */
+			iog = NULL;
+	}
+	rcu_read_unlock();
+	return iog;
+}
+
+void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_cgroup *iocg = &io_root_cgroup;
+	struct elv_fq_data *efqd = &e->efqd;
+	struct io_group *iog = efqd->root_group;
+
+	BUG_ON(!iog);
+	spin_lock_irq(&iocg->lock);
+	hlist_del_rcu(&iog->group_node);
+	spin_unlock_irq(&iocg->lock);
+	io_put_io_group_queues(e, iog);
+	kfree(iog);
+}
+
+struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	struct io_cgroup *iocg;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	iog->entity.parent = NULL;
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+	iocg = &io_root_cgroup;
+	spin_lock_irq(&iocg->lock);
+	rcu_assign_pointer(iog->key, key);
+	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
+	spin_unlock_irq(&iocg->lock);
+
+	return iog;
+}
+
+struct cftype bfqio_files[] = {
+	{
+		.name = "weight",
+		.read_u64 = io_cgroup_weight_read,
+		.write_u64 = io_cgroup_weight_write,
+	},
+	{
+		.name = "ioprio_class",
+		.read_u64 = io_cgroup_ioprio_class_read,
+		.write_u64 = io_cgroup_ioprio_class_write,
+	},
+};
+
+int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+	return cgroup_add_files(cgroup, subsys, bfqio_files,
+				ARRAY_SIZE(bfqio_files));
+}
+
+struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
+						struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg;
+
+	if (cgroup->parent != NULL) {
+		iocg = kzalloc(sizeof(*iocg), GFP_KERNEL);
+		if (iocg == NULL)
+			return ERR_PTR(-ENOMEM);
+	} else
+		iocg = &io_root_cgroup;
+
+	spin_lock_init(&iocg->lock);
+	INIT_HLIST_HEAD(&iocg->group_data);
+	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
+	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
+
+	return &iocg->css;
+}
+
+/*
+ * We cannot support shared io contexts, as we have no mean to support
+ * two tasks with the same ioc in two different groups without major rework
+ * of the main cic/bfqq data structures.  By now we allow a task to change
+ * its cgroup only if it's the only owner of its ioc; the drawback of this
+ * behavior is that a group containing a task that forked using CLONE_IO
+ * will not be destroyed until the tasks sharing the ioc die.
+ */
+int iocg_can_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+			    struct task_struct *tsk)
+{
+	struct io_context *ioc;
+	int ret = 0;
+
+	/* task_lock() is needed to avoid races with exit_io_context() */
+	task_lock(tsk);
+	ioc = tsk->io_context;
+	if (ioc != NULL && atomic_read(&ioc->nr_tasks) > 1)
+		/*
+		 * ioc == NULL means that the task is either too young or
+		 * exiting: if it has still no ioc the ioc can't be shared,
+		 * if the task is exiting the attach will fail anyway, no
+		 * matter what we return here.
+		 */
+		ret = -EINVAL;
+	task_unlock(tsk);
+
+	return ret;
+}
+
+void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+			 struct cgroup *prev, struct task_struct *tsk)
+{
+	struct io_context *ioc;
+
+	task_lock(tsk);
+	ioc = tsk->io_context;
+	if (ioc != NULL)
+		ioc->cgroup_changed = 1;
+	task_unlock(tsk);
+}
+
+/*
+ * Move the queue to the root group if it is active. This is needed when
+ * a cgroup is being deleted and all the IO is not done yet. This is not
+ * very good scheme as a user might get unfair share. This needs to be
+ * fixed.
+ */
+void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
+				struct io_group *iog)
+{
+	int busy, resume;
+	struct io_entity *entity = &ioq->entity;
+	struct elv_fq_data *efqd = &e->efqd;
+	struct io_service_tree *st = io_entity_service_tree(entity);
+
+	busy = elv_ioq_busy(ioq);
+	resume = !!ioq->nr_queued;
+
+	BUG_ON(resume && !entity->on_st);
+	BUG_ON(busy && !resume && entity->on_st && ioq != efqd->active_queue);
+
+	/*
+	 * We could be moving an queue which is on idle tree of previous group
+	 * What to do? I guess anyway this queue does not have any requests.
+	 * just forget the entity and free up from idle tree.
+	 *
+	 * This needs cleanup. Hackish.
+	 */
+	if (entity->tree == &st->idle) {
+		BUG_ON(atomic_read(&ioq->ref) < 2);
+		bfq_put_idle_entity(st, entity);
+	}
+
+	if (busy) {
+		BUG_ON(atomic_read(&ioq->ref) < 2);
+
+		if (!resume)
+			elv_del_ioq_busy(e, ioq, 0);
+		else
+			elv_deactivate_ioq(efqd, ioq, 0);
+	}
+
+	/*
+	 * Here we use a reference to bfqg.  We don't need a refcounter
+	 * as the cgroup reference will not be dropped, so that its
+	 * destroy() callback will not be invoked.
+	 */
+	entity->parent = iog->my_entity;
+	entity->sched_data = &iog->sched_data;
+
+	if (busy && resume)
+		elv_activate_ioq(ioq, 0);
+}
+EXPORT_SYMBOL(io_ioq_move);
+
+static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
+{
+	struct elevator_queue *eq;
+	struct io_entity *entity = iog->my_entity;
+	struct io_service_tree *st;
+	int i;
+
+	eq = container_of(efqd, struct elevator_queue, efqd);
+	hlist_del(&iog->elv_data_node);
+	__bfq_deactivate_entity(entity, 0);
+	io_put_io_group_queues(eq, iog);
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+
+		/*
+		 * The idle tree may still contain bfq_queues belonging
+		 * to exited task because they never migrated to a different
+		 * cgroup from the one being destroyed now.  Noone else
+		 * can access them so it's safe to act without any lock.
+		 */
+		io_flush_idle_tree(st);
+
+		BUG_ON(!RB_EMPTY_ROOT(&st->active));
+		BUG_ON(!RB_EMPTY_ROOT(&st->idle));
+	}
+
+	BUG_ON(iog->sched_data.next_active != NULL);
+	BUG_ON(iog->sched_data.active_entity != NULL);
+	BUG_ON(entity->tree != NULL);
+}
+
+/**
+ * bfq_destroy_group - destroy @bfqg.
+ * @bgrp: the bfqio_cgroup containing @bfqg.
+ * @bfqg: the group being destroyed.
+ *
+ * Destroy @bfqg, making sure that it is not referenced from its parent.
+ */
+static void io_destroy_group(struct io_cgroup *iocg, struct io_group *iog)
+{
+	struct elv_fq_data *efqd = NULL;
+	unsigned long uninitialized_var(flags);
+
+	/* Remove io group from cgroup list */
+	hlist_del(&iog->group_node);
+
+	/*
+	 * io groups are linked in two lists. One list is maintained
+	 * in elevator (efqd->group_list) and other is maintained
+	 * per cgroup structure (iocg->group_data).
+	 *
+	 * While a cgroup is being deleted, elevator also might be
+	 * exiting and both might try to cleanup the same io group
+	 * so need to be little careful.
+	 *
+	 * Following code first accesses efqd under RCU to make sure
+	 * iog->key is pointing to valid efqd and then takes the
+	 * associated queue lock. After gettting the queue lock it
+	 * again checks whether elevator exit path had alreday got
+	 * hold of io group (iog->key == NULL). If yes, it does not
+	 * try to free up async queues again or flush the idle tree.
+	 */
+
+	rcu_read_lock();
+	efqd = rcu_dereference(iog->key);
+	if (efqd != NULL) {
+		spin_lock_irqsave(efqd->queue->queue_lock, flags);
+		if (iog->key == efqd)
+			__io_destroy_group(efqd, iog);
+		spin_unlock_irqrestore(efqd->queue->queue_lock, flags);
+	}
+	rcu_read_unlock();
+
+	/*
+	 * No need to defer the kfree() to the end of the RCU grace
+	 * period: we are called from the destroy() callback of our
+	 * cgroup, so we can be sure that noone is a) still using
+	 * this cgroup or b) doing lookups in it.
+	 */
+	kfree(iog);
+}
+
+void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+	struct hlist_node *n, *tmp;
+	struct io_group *iog;
+
+	/*
+	 * Since we are destroying the cgroup, there are no more tasks
+	 * referencing it, and all the RCU grace periods that may have
+	 * referenced it are ended (as the destruction of the parent
+	 * cgroup is RCU-safe); bgrp->group_data will not be accessed by
+	 * anything else and we don't need any synchronization.
+	 */
+	hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node)
+		io_destroy_group(iocg, iog);
+
+	BUG_ON(!hlist_empty(&iocg->group_data));
+
+	kfree(iocg);
+}
+
+void io_disconnect_groups(struct elevator_queue *e)
+{
+	struct hlist_node *pos, *n;
+	struct io_group *iog;
+	struct elv_fq_data *efqd = &e->efqd;
+
+	hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
+					elv_data_node) {
+		hlist_del(&iog->elv_data_node);
+
+		__bfq_deactivate_entity(iog->my_entity, 0);
+
+		/*
+		 * Don't remove from the group hash, just set an
+		 * invalid key.  No lookups can race with the
+		 * assignment as bfqd is being destroyed; this
+		 * implies also that new elements cannot be added
+		 * to the list.
+		 */
+		rcu_assign_pointer(iog->key, NULL);
+		io_put_io_group_queues(e, iog);
+	}
+}
+
+struct cgroup_subsys io_subsys = {
+	.name = "io",
+	.create = iocg_create,
+	.can_attach = iocg_can_attach,
+	.attach = iocg_attach,
+	.destroy = iocg_destroy,
+	.populate = iocg_populate,
+	.subsys_id = io_subsys_id,
+};
+
+/*
+ * if bio sumbmitting task and rq don't belong to same io_group, it can't
+ * be merged
+ */
+int io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+	struct request_queue *q = rq->q;
+	struct io_queue *ioq = rq->ioq;
+	struct io_group *iog, *__iog;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return 1;
+
+	/* Determine the io group of the bio submitting task */
+	iog = io_get_io_group(q, 0);
+	if (!iog) {
+		/* May be task belongs to a differet cgroup for which io
+		 * group has not been setup yet. */
+		return 0;
+	}
+
+	/* Determine the io group of the ioq, rq belongs to*/
+	__iog = ioq_to_io_group(ioq);
+
+	return (iog == __iog);
+}
+
+/* find/create the io group request belongs to and put that info in rq */
+void elv_fq_set_request_io_group(struct request_queue *q,
+						struct request *rq)
+{
+	struct io_group *iog;
+	unsigned long flags;
+
+	/* Make sure io group hierarchy has been setup and also set the
+	 * io group to which rq belongs. Later we should make use of
+	 * bio cgroup patches to determine the io group */
+	spin_lock_irqsave(q->queue_lock, flags);
+	iog = io_get_io_group(q, 1);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	BUG_ON(!iog);
+
+	/* Store iog in rq. TODO: take care of referencing */
+	rq->iog = iog;
 }
 
+#else /* GROUP_IOSCHED */
+void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+	entity->ioprio = entity->new_ioprio;
+	entity->weight = entity->new_weight;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->sched_data = &iog->sched_data;
+}
+
+struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+	return iog;
+}
+
+struct io_group *io_lookup_io_group_current(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	return efqd->root_group;
+}
+EXPORT_SYMBOL(io_lookup_io_group_current);
+
+void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_group *iog = e->efqd.root_group;
+	io_put_io_group_queues(e, iog);
+	kfree(iog);
+}
+
+struct io_group *io_get_io_group(struct request_queue *q, int create)
+{
+	return q->elevator->efqd.root_group;
+}
+
+#endif /* CONFIG_GROUP_IOSCHED*/
+
 /* Elevator fair queuing function */
 struct io_queue *rq_ioq(struct request *rq)
 {
@@ -1177,9 +2043,11 @@ EXPORT_SYMBOL(elv_put_ioq);
 
 void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
 {
+	struct io_group *root_group = e->efqd.root_group;
 	struct io_queue *ioq = *ioq_ptr;
 
 	if (ioq != NULL) {
+		io_ioq_move(e, ioq, root_group);
 		/* Drop the reference taken by the io group */
 		elv_put_ioq(ioq);
 		*ioq_ptr = NULL;
@@ -1233,14 +2101,27 @@ struct io_queue *elv_get_next_ioq(struct request_queue *q, int extract)
 		return NULL;
 
 	sd = &efqd->root_group->sched_data;
-	if (extract)
-		entity = bfq_lookup_next_entity(sd, 1);
-	else
-		entity = bfq_lookup_next_entity(sd, 0);
+	for (; sd != NULL; sd = entity->my_sched_data) {
+		if (extract)
+			entity = bfq_lookup_next_entity(sd, 1);
+		else
+			entity = bfq_lookup_next_entity(sd, 0);
+
+		/*
+		 * entity can be null despite the fact that there are busy
+		 * queues. if all the busy queues are under a group which is
+		 * currently under service.
+		 * So if we are just looking for next ioq while something is
+		 * being served, null entity is not an error.
+		 */
+		BUG_ON(!entity && extract);
+
+		if (extract)
+			entity->service = 0;
 
-	BUG_ON(!entity);
-	if (extract)
-		entity->service = 0;
+		if (!entity)
+			return NULL;
+	}
 	ioq = io_entity_to_ioq(entity);
 
 	return ioq;
@@ -1256,8 +2137,12 @@ static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
 	struct request_queue *q = efqd->queue;
 
 	if (ioq) {
-		elv_log_ioq(efqd, ioq, "set_active, busy=%d",
-							efqd->busy_queues);
+		struct io_group *iog = ioq_to_io_group(ioq);
+		elv_log_ioq(efqd, ioq, "set_active, busy=%d ioprio=%d"
+				" weight=%ld group_weight=%ld",
+				efqd->busy_queues,
+				ioq->entity.ioprio, ioq->entity.weight,
+				iog_weight(iog));
 		ioq->slice_end = 0;
 
 		elv_clear_ioq_wait_request(ioq);
@@ -1492,6 +2377,7 @@ int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 {
 	struct io_queue *ioq;
 	struct elevator_queue *eq = q->elevator;
+	struct io_group *iog = NULL, *new_iog = NULL;
 
 	ioq = elv_active_ioq(eq);
 
@@ -1509,14 +2395,26 @@ int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 
 	/*
 	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
+	 *
+	 * TODO: In hierarchical setup, one need to traverse up the hier
+	 * till both the queues are children of same parent to make a
+	 * decision whether to do the preemption or not. Something like
+	 * what cfs has done for cpu scheduler. Will do it little later.
 	 */
 	if (elv_ioq_class_rt(new_ioq) && !elv_ioq_class_rt(ioq))
 		return 1;
 
+	iog = ioq_to_io_group(ioq);
+	new_iog = ioq_to_io_group(new_ioq);
+
 	/*
-	 * Check with io scheduler if it has additional criterion based on
-	 * which it wants to preempt existing queue.
+	 * If both the queues belong to same group, check with io scheduler
+	 * if it has additional criterion based on which it wants to
+	 * preempt existing queue.
 	 */
+	if (iog != new_iog)
+		return 0;
+
 	if (eq->ops->elevator_should_preempt_fn)
 		return eq->ops->elevator_should_preempt_fn(q, new_ioq, rq);
 
@@ -1938,14 +2836,6 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		elv_schedule_dispatch(q);
 }
 
-struct io_group *io_lookup_io_group_current(struct request_queue *q)
-{
-	struct elv_fq_data *efqd = &q->elevator->efqd;
-
-	return efqd->root_group;
-}
-EXPORT_SYMBOL(io_lookup_io_group_current);
-
 void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
 					int ioprio)
 {
@@ -1996,44 +2886,6 @@ void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 }
 EXPORT_SYMBOL(io_group_set_async_queue);
 
-/*
- * Release all the io group references to its async queues.
- */
-void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
-{
-	int i, j;
-
-	for (i = 0; i < 2; i++)
-		for (j = 0; j < IOPRIO_BE_NR; j++)
-			elv_release_ioq(e, &iog->async_queue[i][j]);
-
-	/* Free up async idle queue */
-	elv_release_ioq(e, &iog->async_idle_queue);
-}
-
-struct io_group *io_alloc_root_group(struct request_queue *q,
-					struct elevator_queue *e, void *key)
-{
-	struct io_group *iog;
-	int i;
-
-	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
-	if (iog == NULL)
-		return NULL;
-
-	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
-		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
-
-	return iog;
-}
-
-void io_free_root_group(struct elevator_queue *e)
-{
-	struct io_group *iog = e->efqd.root_group;
-	io_put_io_group_queues(e, iog);
-	kfree(iog);
-}
-
 static void elv_slab_kill(void)
 {
 	/*
@@ -2079,6 +2931,7 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 	INIT_WORK(&efqd->unplug_work, elv_kick_queue);
 
 	INIT_LIST_HEAD(&efqd->idle_list);
+	INIT_HLIST_HEAD(&efqd->group_list);
 
 	efqd->elv_slice[0] = elv_slice_async;
 	efqd->elv_slice[1] = elv_slice_sync;
@@ -2108,10 +2961,14 @@ void elv_exit_fq_data(struct elevator_queue *e)
 	spin_lock_irq(q->queue_lock);
 	/* This should drop all the idle tree references of ioq */
 	elv_free_idle_ioq_list(e);
+	/* This should drop all the io group references of async queues */
+	io_disconnect_groups(e);
 	spin_unlock_irq(q->queue_lock);
 
 	elv_shutdown_timer_wq(e);
 
+	/* Wait for iog->key accessors to exit their grace periods. */
+	synchronize_rcu();
 	BUG_ON(timer_pending(&efqd->idle_slice_timer));
 	io_free_root_group(e);
 }
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index ce2d671..8c60cf7 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -9,11 +9,13 @@
  */
 
 #include <linux/blkdev.h>
+#include <linux/cgroup.h>
 
 #ifndef _BFQ_SCHED_H
 #define _BFQ_SCHED_H
 
 #define IO_IOPRIO_CLASSES	3
+#define WEIGHT_MAX 		1000
 
 typedef u64 bfq_timestamp_t;
 typedef unsigned long bfq_weight_t;
@@ -69,6 +71,7 @@ struct io_service_tree {
  */
 struct io_sched_data {
 	struct io_entity *active_entity;
+	struct io_entity *next_active;
 	struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
 };
 
@@ -84,13 +87,12 @@ struct io_sched_data {
  *             this entity; used for O(log N) lookups into active trees.
  * @service: service received during the last round of service.
  * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
- * @weight: weight of the queue, calculated as IOPRIO_BE_NR - @ioprio.
  * @parent: parent entity, for hierarchical scheduling.
  * @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the
  *                 associated scheduler queue, %NULL on leaf nodes.
  * @sched_data: the scheduler queue this entity belongs to.
- * @ioprio: the ioprio in use.
- * @new_ioprio: when an ioprio change is requested, the new ioprio value
+ * @weight: the weight in use.
+ * @new_weight: when a weight change is requested, the new weight value
  * @ioprio_class: the ioprio_class in use.
  * @new_ioprio_class: when an ioprio_class change is requested, the new
  *                    ioprio_class value.
@@ -132,13 +134,13 @@ struct io_entity {
 	bfq_timestamp_t min_start;
 
 	bfq_service_t service, budget;
-	bfq_weight_t weight;
 
 	struct io_entity *parent;
 
 	struct io_sched_data *my_sched_data;
 	struct io_sched_data *sched_data;
 
+	bfq_weight_t weight, new_weight;
 	unsigned short ioprio, new_ioprio;
 	unsigned short ioprio_class, new_ioprio_class;
 
@@ -180,6 +182,75 @@ struct io_queue {
 	void *sched_queue;
 };
 
+#ifdef CONFIG_GROUP_IOSCHED
+/**
+ * struct bfq_group - per (device, cgroup) data structure.
+ * @entity: schedulable entity to insert into the parent group sched_data.
+ * @sched_data: own sched_data, to contain child entities (they may be
+ *              both bfq_queues and bfq_groups).
+ * @group_node: node to be inserted into the bfqio_cgroup->group_data
+ *              list of the containing cgroup's bfqio_cgroup.
+ * @bfqd_node: node to be inserted into the @bfqd->group_list list
+ *             of the groups active on the same device; used for cleanup.
+ * @bfqd: the bfq_data for the device this group acts upon.
+ * @async_bfqq: array of async queues for all the tasks belonging to
+ *              the group, one queue per ioprio value per ioprio_class,
+ *              except for the idle class that has only one queue.
+ * @async_idle_bfqq: async queue for the idle class (ioprio is ignored).
+ * @my_entity: pointer to @entity, %NULL for the toplevel group; used
+ *             to avoid too many special cases during group creation/migration.
+ *
+ * Each (device, cgroup) pair has its own bfq_group, i.e., for each cgroup
+ * there is a set of bfq_groups, each one collecting the lower-level
+ * entities belonging to the group that are acting on the same device.
+ *
+ * Locking works as follows:
+ *    o @group_node is protected by the bfqio_cgroup lock, and is accessed
+ *      via RCU from its readers.
+ *    o @bfqd is protected by the queue lock, RCU is used to access it
+ *      from the readers.
+ *    o All the other fields are protected by the @bfqd queue lock.
+ */
+struct io_group {
+	struct io_entity entity;
+	struct hlist_node elv_data_node;
+	struct hlist_node group_node;
+	struct io_sched_data sched_data;
+
+	struct io_entity *my_entity;
+
+	/*
+	 * A cgroup has multiple io_groups, one for each request queue.
+	 * to find io group belonging to a particular queue, elv_fq_data
+	 * pointer is stored as a key.
+	 */
+	void *key;
+
+	/* async_queue and idle_queue are used only for cfq */
+	struct io_queue *async_queue[2][IOPRIO_BE_NR];
+	struct io_queue *async_idle_queue;
+};
+
+/**
+ * struct bfqio_cgroup - bfq cgroup data structure.
+ * @css: subsystem state for bfq in the containing cgroup.
+ * @weight: cgroup weight.
+ * @ioprio_class: cgroup ioprio_class.
+ * @lock: spinlock that protects @weight, @ioprio_class and @group_data.
+ * @group_data: list containing the bfq_group belonging to this cgroup.
+ *
+ * @group_data is accessed using RCU, with @lock protecting the updates,
+ * @weight and @ioprio_class are protected by @lock.
+ */
+struct io_cgroup {
+	struct cgroup_subsys_state css;
+
+	unsigned long weight, ioprio_class;
+
+	spinlock_t lock;
+	struct hlist_head group_data;
+};
+#else
 struct io_group {
 	struct io_sched_data sched_data;
 
@@ -187,10 +258,14 @@ struct io_group {
 	struct io_queue *async_queue[2][IOPRIO_BE_NR];
 	struct io_queue *async_idle_queue;
 };
+#endif
 
 struct elv_fq_data {
 	struct io_group *root_group;
 
+	/* List of io groups hanging on this elevator */
+	struct hlist_head group_list;
+
 	/* List of io queues on idle tree. */
 	struct list_head idle_list;
 
@@ -375,9 +450,20 @@ static inline void elv_ioq_set_ioprio_class(struct io_queue *ioq,
 	ioq->entity.ioprio_changed = 1;
 }
 
+/**
+ * bfq_ioprio_to_weight - calc a weight from an ioprio.
+ * @ioprio: the ioprio value to convert.
+ */
+static inline bfq_weight_t bfq_ioprio_to_weight(int ioprio)
+{
+	WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
+	return ((IOPRIO_BE_NR - ioprio) * WEIGHT_MAX)/IOPRIO_BE_NR;
+}
+
 static inline void elv_ioq_set_ioprio(struct io_queue *ioq, int ioprio)
 {
 	ioq->entity.new_ioprio = ioprio;
+	ioq->entity.new_weight = bfq_ioprio_to_weight(ioprio);
 	ioq->entity.ioprio_changed = 1;
 }
 
@@ -394,6 +480,50 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
 						sched_data);
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+extern int io_group_allow_merge(struct request *rq, struct bio *bio);
+extern void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
+					struct io_group *iog);
+extern void elv_fq_set_request_io_group(struct request_queue *q,
+						struct request *rq);
+static inline bfq_weight_t iog_weight(struct io_group *iog)
+{
+	return iog->entity.weight;
+}
+
+#else /* !GROUP_IOSCHED */
+/*
+ * No ioq movement is needed in case of flat setup. root io group gets cleaned
+ * up upon elevator exit and before that it has been made sure that both
+ * active and idle tree are empty.
+ */
+static inline void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
+					struct io_group *iog)
+{
+}
+
+static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+	return 1;
+}
+/*
+ * Currently root group is not part of elevator group list and freed
+ * separately. Hence in case of non-hierarchical setup, nothing todo.
+ */
+static inline void io_disconnect_groups(struct elevator_queue *e) {}
+static inline void elv_fq_set_request_io_group(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline bfq_weight_t iog_weight(struct io_group *iog)
+{
+	/* Just root group is present and weight is immaterial. */
+	return 0;
+}
+
+#endif /* GROUP_IOSCHED */
+
 /* Functions used by blksysfs.c */
 extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
 extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
@@ -495,5 +625,16 @@ static inline void *elv_fq_select_ioq(struct request_queue *q, int force)
 {
 	return NULL;
 }
+
+static inline void elv_fq_set_request_io_group(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
+
+{
+	return 1;
+}
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index c2f07f5..4321169 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -105,6 +105,10 @@ int elv_rq_merge_ok(struct request *rq, struct bio *bio)
 	if (bio_integrity(bio) != blk_integrity_rq(rq))
 		return 0;
 
+	/* If rq and bio belongs to different groups, dont allow merging */
+	if (!io_group_allow_merge(rq, bio))
+		return 0;
+
 	if (!elv_iosched_allow_merge(rq, bio))
 		return 0;
 
@@ -913,6 +917,8 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 
+	elv_fq_set_request_io_group(q, rq);
+
 	if (e->ops->elevator_set_req_fn)
 		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4634949..9c209a0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -249,7 +249,12 @@ struct request {
 #ifdef CONFIG_ELV_FAIR_QUEUING
 	/* io queue request belongs to */
 	struct io_queue *ioq;
-#endif
+
+#ifdef CONFIG_GROUP_IOSCHED
+	/* io group request belongs to */
+	struct io_group *iog;
+#endif /* GROUP_IOSCHED */
+#endif /* ELV_FAIR_QUEUING */
 };
 
 static inline unsigned short req_get_ioprio(struct request *req)
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c8d31b..68ea6bd 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -60,3 +60,10 @@ SUBSYS(net_cls)
 #endif
 
 /* */
+
+#ifdef CONFIG_GROUP_IOSCHED
+SUBSYS(io)
+#endif
+
+/* */
+
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 08b987b..51664bb 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -73,6 +73,11 @@ struct io_context {
 	unsigned short ioprio;
 	unsigned short ioprio_changed;
 
+#ifdef CONFIG_GROUP_IOSCHED
+	/* If task changes the cgroup, elevator processes it asynchronously */
+	unsigned short cgroup_changed;
+#endif
+
 	/*
 	 * For request batching
 	 */
diff --git a/init/Kconfig b/init/Kconfig
index 7be4d38..ab76477 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -606,6 +606,14 @@ config CGROUP_MEM_RES_CTLR_SWAP
 	  Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page
 	  size is 4096bytes, 512k per 1Gbytes of swap.
 
+config GROUP_IOSCHED
+	bool "Group IO Scheduler"
+	depends on CGROUPS && ELV_FAIR_QUEUING
+	default n
+	---help---
+	  This feature lets IO scheduler recognize task groups and control
+	  disk bandwidth allocation to such task groups.
+
 endif # CGROUPS
 
 config MM_OWNER
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 06/18] io-controller: cfq changes to use hierarchical fair queuing code in elevaotor layer
       [not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (4 preceding siblings ...)
  2009-05-05 19:58   ` [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer Vivek Goyal
@ 2009-05-05 19:58   ` Vivek Goyal
  2009-05-05 19:58   ` [PATCH 07/18] io-controller: Export disk time used and nr sectors dipatched through cgroups Vivek Goyal
                     ` (15 subsequent siblings)
  21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Make cfq hierarhical.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched |    8 ++++++++
 block/cfq-iosched.c   |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 init/Kconfig          |    2 +-
 3 files changed, 57 insertions(+), 1 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index dd5224d..a91a807 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -54,6 +54,14 @@ config IOSCHED_CFQ
 	  working environment, suitable for desktop systems.
 	  This is the default I/O scheduler.
 
+config IOSCHED_CFQ_HIER
+	bool "CFQ Hierarchical Scheduling support"
+	depends on IOSCHED_CFQ && CGROUPS
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in cfq.
+
 choice
 	prompt "Default I/O scheduler"
 	default DEFAULT_CFQ
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index f90c534..1e9dd5b 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1229,6 +1229,50 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
 	ioc->ioprio_changed = 0;
 }
 
+#ifdef CONFIG_IOSCHED_CFQ_HIER
+static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
+{
+	struct cfq_queue *async_cfqq = cic_to_cfqq(cic, 0);
+	struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1);
+	struct cfq_data *cfqd = cic->key;
+	struct io_group *iog, *__iog;
+	unsigned long flags;
+	struct request_queue *q;
+
+	if (unlikely(!cfqd))
+		return;
+
+	q = cfqd->queue;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	iog = io_lookup_io_group_current(q);
+
+	if (async_cfqq != NULL) {
+		__iog = cfqq_to_io_group(async_cfqq);
+
+		if (iog != __iog) {
+			cic_set_cfqq(cic, NULL, 0);
+			cfq_put_queue(async_cfqq);
+		}
+	}
+
+	if (sync_cfqq != NULL) {
+		__iog = cfqq_to_io_group(sync_cfqq);
+		if (iog != __iog)
+			io_ioq_move(q->elevator, sync_cfqq->ioq, iog);
+	}
+
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static void cfq_ioc_set_cgroup(struct io_context *ioc)
+{
+	call_for_each_cic(ioc, changed_cgroup);
+	ioc->cgroup_changed = 0;
+}
+#endif  /* CONFIG_IOSCHED_CFQ_HIER */
+
 static struct cfq_queue *
 cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
 				struct io_context *ioc, gfp_t gfp_mask)
@@ -1494,6 +1538,10 @@ out:
 	smp_read_barrier_depends();
 	if (unlikely(ioc->ioprio_changed))
 		cfq_ioc_set_ioprio(ioc);
+#ifdef CONFIG_IOSCHED_CFQ_HIER
+	if (unlikely(ioc->cgroup_changed))
+		cfq_ioc_set_cgroup(ioc);
+#endif
 	return cic;
 err_free:
 	cfq_cic_free(cic);
diff --git a/init/Kconfig b/init/Kconfig
index ab76477..1a4686d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -607,7 +607,7 @@ config CGROUP_MEM_RES_CTLR_SWAP
 	  size is 4096bytes, 512k per 1Gbytes of swap.
 
 config GROUP_IOSCHED
-	bool "Group IO Scheduler"
+	bool
 	depends on CGROUPS && ELV_FAIR_QUEUING
 	default n
 	---help---
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 06/18] io-controller: cfq changes to use hierarchical fair queuing code in elevaotor layer
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (8 preceding siblings ...)
  2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-05 19:58 ` Vivek Goyal
                   ` (27 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, m-ikeda
  Cc: vgoyal, akpm

Make cfq hierarhical.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched |    8 ++++++++
 block/cfq-iosched.c   |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 init/Kconfig          |    2 +-
 3 files changed, 57 insertions(+), 1 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index dd5224d..a91a807 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -54,6 +54,14 @@ config IOSCHED_CFQ
 	  working environment, suitable for desktop systems.
 	  This is the default I/O scheduler.
 
+config IOSCHED_CFQ_HIER
+	bool "CFQ Hierarchical Scheduling support"
+	depends on IOSCHED_CFQ && CGROUPS
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in cfq.
+
 choice
 	prompt "Default I/O scheduler"
 	default DEFAULT_CFQ
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index f90c534..1e9dd5b 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1229,6 +1229,50 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
 	ioc->ioprio_changed = 0;
 }
 
+#ifdef CONFIG_IOSCHED_CFQ_HIER
+static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
+{
+	struct cfq_queue *async_cfqq = cic_to_cfqq(cic, 0);
+	struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1);
+	struct cfq_data *cfqd = cic->key;
+	struct io_group *iog, *__iog;
+	unsigned long flags;
+	struct request_queue *q;
+
+	if (unlikely(!cfqd))
+		return;
+
+	q = cfqd->queue;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	iog = io_lookup_io_group_current(q);
+
+	if (async_cfqq != NULL) {
+		__iog = cfqq_to_io_group(async_cfqq);
+
+		if (iog != __iog) {
+			cic_set_cfqq(cic, NULL, 0);
+			cfq_put_queue(async_cfqq);
+		}
+	}
+
+	if (sync_cfqq != NULL) {
+		__iog = cfqq_to_io_group(sync_cfqq);
+		if (iog != __iog)
+			io_ioq_move(q->elevator, sync_cfqq->ioq, iog);
+	}
+
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static void cfq_ioc_set_cgroup(struct io_context *ioc)
+{
+	call_for_each_cic(ioc, changed_cgroup);
+	ioc->cgroup_changed = 0;
+}
+#endif  /* CONFIG_IOSCHED_CFQ_HIER */
+
 static struct cfq_queue *
 cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
 				struct io_context *ioc, gfp_t gfp_mask)
@@ -1494,6 +1538,10 @@ out:
 	smp_read_barrier_depends();
 	if (unlikely(ioc->ioprio_changed))
 		cfq_ioc_set_ioprio(ioc);
+#ifdef CONFIG_IOSCHED_CFQ_HIER
+	if (unlikely(ioc->cgroup_changed))
+		cfq_ioc_set_cgroup(ioc);
+#endif
 	return cic;
 err_free:
 	cfq_cic_free(cic);
diff --git a/init/Kconfig b/init/Kconfig
index ab76477..1a4686d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -607,7 +607,7 @@ config CGROUP_MEM_RES_CTLR_SWAP
 	  size is 4096bytes, 512k per 1Gbytes of swap.
 
 config GROUP_IOSCHED
-	bool "Group IO Scheduler"
+	bool
 	depends on CGROUPS && ELV_FAIR_QUEUING
 	default n
 	---help---
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 06/18] io-controller: cfq changes to use hierarchical fair queuing code in elevaotor layer
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (9 preceding siblings ...)
  2009-05-05 19:58 ` [PATCH 06/18] io-controller: cfq changes to use " Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-05 19:58 ` [PATCH 07/18] io-controller: Export disk time used and nr sectors dipatched through cgroups Vivek Goyal
                   ` (26 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando
  Cc: akpm, vgoyal

Make cfq hierarhical.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched |    8 ++++++++
 block/cfq-iosched.c   |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 init/Kconfig          |    2 +-
 3 files changed, 57 insertions(+), 1 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index dd5224d..a91a807 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -54,6 +54,14 @@ config IOSCHED_CFQ
 	  working environment, suitable for desktop systems.
 	  This is the default I/O scheduler.
 
+config IOSCHED_CFQ_HIER
+	bool "CFQ Hierarchical Scheduling support"
+	depends on IOSCHED_CFQ && CGROUPS
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in cfq.
+
 choice
 	prompt "Default I/O scheduler"
 	default DEFAULT_CFQ
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index f90c534..1e9dd5b 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1229,6 +1229,50 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
 	ioc->ioprio_changed = 0;
 }
 
+#ifdef CONFIG_IOSCHED_CFQ_HIER
+static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
+{
+	struct cfq_queue *async_cfqq = cic_to_cfqq(cic, 0);
+	struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1);
+	struct cfq_data *cfqd = cic->key;
+	struct io_group *iog, *__iog;
+	unsigned long flags;
+	struct request_queue *q;
+
+	if (unlikely(!cfqd))
+		return;
+
+	q = cfqd->queue;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	iog = io_lookup_io_group_current(q);
+
+	if (async_cfqq != NULL) {
+		__iog = cfqq_to_io_group(async_cfqq);
+
+		if (iog != __iog) {
+			cic_set_cfqq(cic, NULL, 0);
+			cfq_put_queue(async_cfqq);
+		}
+	}
+
+	if (sync_cfqq != NULL) {
+		__iog = cfqq_to_io_group(sync_cfqq);
+		if (iog != __iog)
+			io_ioq_move(q->elevator, sync_cfqq->ioq, iog);
+	}
+
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static void cfq_ioc_set_cgroup(struct io_context *ioc)
+{
+	call_for_each_cic(ioc, changed_cgroup);
+	ioc->cgroup_changed = 0;
+}
+#endif  /* CONFIG_IOSCHED_CFQ_HIER */
+
 static struct cfq_queue *
 cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
 				struct io_context *ioc, gfp_t gfp_mask)
@@ -1494,6 +1538,10 @@ out:
 	smp_read_barrier_depends();
 	if (unlikely(ioc->ioprio_changed))
 		cfq_ioc_set_ioprio(ioc);
+#ifdef CONFIG_IOSCHED_CFQ_HIER
+	if (unlikely(ioc->cgroup_changed))
+		cfq_ioc_set_cgroup(ioc);
+#endif
 	return cic;
 err_free:
 	cfq_cic_free(cic);
diff --git a/init/Kconfig b/init/Kconfig
index ab76477..1a4686d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -607,7 +607,7 @@ config CGROUP_MEM_RES_CTLR_SWAP
 	  size is 4096bytes, 512k per 1Gbytes of swap.
 
 config GROUP_IOSCHED
-	bool "Group IO Scheduler"
+	bool
 	depends on CGROUPS && ELV_FAIR_QUEUING
 	default n
 	---help---
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 07/18] io-controller: Export disk time used and nr sectors dipatched through cgroups
       [not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (5 preceding siblings ...)
  2009-05-05 19:58   ` [PATCH 06/18] io-controller: cfq changes to use " Vivek Goyal
@ 2009-05-05 19:58   ` Vivek Goyal
  2009-05-05 19:58   ` [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it Vivek Goyal
                     ` (14 subsequent siblings)
  21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

o This patch exports some statistics through cgroup interface. Two of the
  statistics currently exported are actual disk time assigned to the cgroup
  and actual number of sectors dispatched to disk on behalf of this cgroup.

o Currently these numbers are aggregate. That means it is for all the tasks
  in that cgroup on all the disks. Later may be it will help to get per
  disk statistics also.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/elevator-fq.c |  101 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h |    7 ++++
 2 files changed, 106 insertions(+), 2 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index cdaa46f..b8dbc8b 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -886,13 +886,16 @@ struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
 	return entity;
 }
 
-void entity_served(struct io_entity *entity, bfq_service_t served)
+void entity_served(struct io_entity *entity, bfq_service_t served,
+					bfq_service_t nr_sectors)
 {
 	struct io_service_tree *st;
 
 	for_each_entity(entity) {
 		st = io_entity_service_tree(entity);
 		entity->service += served;
+		entity->total_service += served;
+		entity->total_sector_service += nr_sectors;
 		BUG_ON(st->wsum == 0);
 		st->vtime += bfq_delta(served, st->wsum);
 		bfq_forget_idle(st);
@@ -1064,6 +1067,92 @@ STORE_FUNCTION(weight, 0, WEIGHT_MAX);
 STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
 #undef STORE_FUNCTION
 
+/*
+ * traverse through all the io_groups associated with this cgroup and calculate
+ * the aggr disk time received by all the groups on respective disks.
+ */
+static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
+{
+	struct io_group *iog;
+	struct hlist_node *n;
+	u64 disk_time = 0;
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (rcu_dereference(iog->key))
+			disk_time += iog->entity.total_service;
+	}
+	rcu_read_unlock();
+
+	return disk_time;
+}
+
+static u64 io_cgroup_disk_time_read(struct cgroup *cgroup,
+					struct cftype *cftype)
+{
+	struct io_cgroup *iocg;
+	u64 ret;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+	spin_lock_irq(&iocg->lock);
+	ret = jiffies_to_msecs(calculate_aggr_disk_time(iocg));
+	spin_unlock_irq(&iocg->lock);
+
+	cgroup_unlock();
+
+	return ret;
+}
+
+/*
+ * traverse through all the io_groups associated with this cgroup and calculate
+ * the aggr number of sectors transferred by all the groups on respective disks.
+ */
+static u64 calculate_aggr_disk_sectors(struct io_cgroup *iocg)
+{
+	struct io_group *iog;
+	struct hlist_node *n;
+	u64 disk_sectors = 0;
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (rcu_dereference(iog->key))
+			disk_sectors += iog->entity.total_sector_service;
+	}
+	rcu_read_unlock();
+
+	return disk_sectors;
+}
+
+static u64 io_cgroup_disk_sectors_read(struct cgroup *cgroup,
+					struct cftype *cftype)
+{
+	struct io_cgroup *iocg;
+	u64 ret;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+	spin_lock_irq(&iocg->lock);
+	ret = calculate_aggr_disk_sectors(iocg);
+	spin_unlock_irq(&iocg->lock);
+
+	cgroup_unlock();
+
+	return ret;
+}
+
 /**
  * bfq_group_chain_alloc - allocate a chain of groups.
  * @bfqd: queue descriptor.
@@ -1297,6 +1386,14 @@ struct cftype bfqio_files[] = {
 		.read_u64 = io_cgroup_ioprio_class_read,
 		.write_u64 = io_cgroup_ioprio_class_write,
 	},
+	{
+		.name = "disk_time",
+		.read_u64 = io_cgroup_disk_time_read,
+	},
+	{
+		.name = "disk_sectors",
+		.read_u64 = io_cgroup_disk_sectors_read,
+	},
 };
 
 int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
@@ -1712,7 +1809,7 @@ EXPORT_SYMBOL(elv_get_slice_idle);
 
 void elv_ioq_served(struct io_queue *ioq, bfq_service_t served)
 {
-	entity_served(&ioq->entity, served);
+	entity_served(&ioq->entity, served, ioq->nr_sectors);
 }
 
 /* Tells whether ioq is queued in root group or not */
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 8c60cf7..f4c6361 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -145,6 +145,13 @@ struct io_entity {
 	unsigned short ioprio_class, new_ioprio_class;
 
 	int ioprio_changed;
+
+	/*
+	 * Keep track of total service received by this entity. Keep the
+	 * stats both for time slices and number of sectors dispatched
+	 */
+	unsigned long total_service;
+	unsigned long total_sector_service;
 };
 
 /*
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 07/18] io-controller: Export disk time used and nr sectors dipatched through cgroups
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (11 preceding siblings ...)
  2009-05-05 19:58 ` [PATCH 07/18] io-controller: Export disk time used and nr sectors dipatched through cgroups Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-13  2:39   ` Gui Jianfeng
       [not found]   ` <1241553525-28095-8-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-05-05 19:58 ` [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it Vivek Goyal
                   ` (24 subsequent siblings)
  37 siblings, 2 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, m-ikeda
  Cc: vgoyal, akpm

o This patch exports some statistics through cgroup interface. Two of the
  statistics currently exported are actual disk time assigned to the cgroup
  and actual number of sectors dispatched to disk on behalf of this cgroup.

o Currently these numbers are aggregate. That means it is for all the tasks
  in that cgroup on all the disks. Later may be it will help to get per
  disk statistics also.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |  101 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h |    7 ++++
 2 files changed, 106 insertions(+), 2 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index cdaa46f..b8dbc8b 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -886,13 +886,16 @@ struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
 	return entity;
 }
 
-void entity_served(struct io_entity *entity, bfq_service_t served)
+void entity_served(struct io_entity *entity, bfq_service_t served,
+					bfq_service_t nr_sectors)
 {
 	struct io_service_tree *st;
 
 	for_each_entity(entity) {
 		st = io_entity_service_tree(entity);
 		entity->service += served;
+		entity->total_service += served;
+		entity->total_sector_service += nr_sectors;
 		BUG_ON(st->wsum == 0);
 		st->vtime += bfq_delta(served, st->wsum);
 		bfq_forget_idle(st);
@@ -1064,6 +1067,92 @@ STORE_FUNCTION(weight, 0, WEIGHT_MAX);
 STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
 #undef STORE_FUNCTION
 
+/*
+ * traverse through all the io_groups associated with this cgroup and calculate
+ * the aggr disk time received by all the groups on respective disks.
+ */
+static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
+{
+	struct io_group *iog;
+	struct hlist_node *n;
+	u64 disk_time = 0;
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (rcu_dereference(iog->key))
+			disk_time += iog->entity.total_service;
+	}
+	rcu_read_unlock();
+
+	return disk_time;
+}
+
+static u64 io_cgroup_disk_time_read(struct cgroup *cgroup,
+					struct cftype *cftype)
+{
+	struct io_cgroup *iocg;
+	u64 ret;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+	spin_lock_irq(&iocg->lock);
+	ret = jiffies_to_msecs(calculate_aggr_disk_time(iocg));
+	spin_unlock_irq(&iocg->lock);
+
+	cgroup_unlock();
+
+	return ret;
+}
+
+/*
+ * traverse through all the io_groups associated with this cgroup and calculate
+ * the aggr number of sectors transferred by all the groups on respective disks.
+ */
+static u64 calculate_aggr_disk_sectors(struct io_cgroup *iocg)
+{
+	struct io_group *iog;
+	struct hlist_node *n;
+	u64 disk_sectors = 0;
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (rcu_dereference(iog->key))
+			disk_sectors += iog->entity.total_sector_service;
+	}
+	rcu_read_unlock();
+
+	return disk_sectors;
+}
+
+static u64 io_cgroup_disk_sectors_read(struct cgroup *cgroup,
+					struct cftype *cftype)
+{
+	struct io_cgroup *iocg;
+	u64 ret;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+	spin_lock_irq(&iocg->lock);
+	ret = calculate_aggr_disk_sectors(iocg);
+	spin_unlock_irq(&iocg->lock);
+
+	cgroup_unlock();
+
+	return ret;
+}
+
 /**
  * bfq_group_chain_alloc - allocate a chain of groups.
  * @bfqd: queue descriptor.
@@ -1297,6 +1386,14 @@ struct cftype bfqio_files[] = {
 		.read_u64 = io_cgroup_ioprio_class_read,
 		.write_u64 = io_cgroup_ioprio_class_write,
 	},
+	{
+		.name = "disk_time",
+		.read_u64 = io_cgroup_disk_time_read,
+	},
+	{
+		.name = "disk_sectors",
+		.read_u64 = io_cgroup_disk_sectors_read,
+	},
 };
 
 int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
@@ -1712,7 +1809,7 @@ EXPORT_SYMBOL(elv_get_slice_idle);
 
 void elv_ioq_served(struct io_queue *ioq, bfq_service_t served)
 {
-	entity_served(&ioq->entity, served);
+	entity_served(&ioq->entity, served, ioq->nr_sectors);
 }
 
 /* Tells whether ioq is queued in root group or not */
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 8c60cf7..f4c6361 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -145,6 +145,13 @@ struct io_entity {
 	unsigned short ioprio_class, new_ioprio_class;
 
 	int ioprio_changed;
+
+	/*
+	 * Keep track of total service received by this entity. Keep the
+	 * stats both for time slices and number of sectors dispatched
+	 */
+	unsigned long total_service;
+	unsigned long total_sector_service;
 };
 
 /*
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 07/18] io-controller: Export disk time used and nr sectors dipatched through cgroups
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (10 preceding siblings ...)
  2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-05 19:58 ` Vivek Goyal
                   ` (25 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando
  Cc: akpm, vgoyal

o This patch exports some statistics through cgroup interface. Two of the
  statistics currently exported are actual disk time assigned to the cgroup
  and actual number of sectors dispatched to disk on behalf of this cgroup.

o Currently these numbers are aggregate. That means it is for all the tasks
  in that cgroup on all the disks. Later may be it will help to get per
  disk statistics also.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |  101 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h |    7 ++++
 2 files changed, 106 insertions(+), 2 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index cdaa46f..b8dbc8b 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -886,13 +886,16 @@ struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
 	return entity;
 }
 
-void entity_served(struct io_entity *entity, bfq_service_t served)
+void entity_served(struct io_entity *entity, bfq_service_t served,
+					bfq_service_t nr_sectors)
 {
 	struct io_service_tree *st;
 
 	for_each_entity(entity) {
 		st = io_entity_service_tree(entity);
 		entity->service += served;
+		entity->total_service += served;
+		entity->total_sector_service += nr_sectors;
 		BUG_ON(st->wsum == 0);
 		st->vtime += bfq_delta(served, st->wsum);
 		bfq_forget_idle(st);
@@ -1064,6 +1067,92 @@ STORE_FUNCTION(weight, 0, WEIGHT_MAX);
 STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
 #undef STORE_FUNCTION
 
+/*
+ * traverse through all the io_groups associated with this cgroup and calculate
+ * the aggr disk time received by all the groups on respective disks.
+ */
+static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
+{
+	struct io_group *iog;
+	struct hlist_node *n;
+	u64 disk_time = 0;
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (rcu_dereference(iog->key))
+			disk_time += iog->entity.total_service;
+	}
+	rcu_read_unlock();
+
+	return disk_time;
+}
+
+static u64 io_cgroup_disk_time_read(struct cgroup *cgroup,
+					struct cftype *cftype)
+{
+	struct io_cgroup *iocg;
+	u64 ret;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+	spin_lock_irq(&iocg->lock);
+	ret = jiffies_to_msecs(calculate_aggr_disk_time(iocg));
+	spin_unlock_irq(&iocg->lock);
+
+	cgroup_unlock();
+
+	return ret;
+}
+
+/*
+ * traverse through all the io_groups associated with this cgroup and calculate
+ * the aggr number of sectors transferred by all the groups on respective disks.
+ */
+static u64 calculate_aggr_disk_sectors(struct io_cgroup *iocg)
+{
+	struct io_group *iog;
+	struct hlist_node *n;
+	u64 disk_sectors = 0;
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (rcu_dereference(iog->key))
+			disk_sectors += iog->entity.total_sector_service;
+	}
+	rcu_read_unlock();
+
+	return disk_sectors;
+}
+
+static u64 io_cgroup_disk_sectors_read(struct cgroup *cgroup,
+					struct cftype *cftype)
+{
+	struct io_cgroup *iocg;
+	u64 ret;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+	spin_lock_irq(&iocg->lock);
+	ret = calculate_aggr_disk_sectors(iocg);
+	spin_unlock_irq(&iocg->lock);
+
+	cgroup_unlock();
+
+	return ret;
+}
+
 /**
  * bfq_group_chain_alloc - allocate a chain of groups.
  * @bfqd: queue descriptor.
@@ -1297,6 +1386,14 @@ struct cftype bfqio_files[] = {
 		.read_u64 = io_cgroup_ioprio_class_read,
 		.write_u64 = io_cgroup_ioprio_class_write,
 	},
+	{
+		.name = "disk_time",
+		.read_u64 = io_cgroup_disk_time_read,
+	},
+	{
+		.name = "disk_sectors",
+		.read_u64 = io_cgroup_disk_sectors_read,
+	},
 };
 
 int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
@@ -1712,7 +1809,7 @@ EXPORT_SYMBOL(elv_get_slice_idle);
 
 void elv_ioq_served(struct io_queue *ioq, bfq_service_t served)
 {
-	entity_served(&ioq->entity, served);
+	entity_served(&ioq->entity, served, ioq->nr_sectors);
 }
 
 /* Tells whether ioq is queued in root group or not */
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 8c60cf7..f4c6361 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -145,6 +145,13 @@ struct io_entity {
 	unsigned short ioprio_class, new_ioprio_class;
 
 	int ioprio_changed;
+
+	/*
+	 * Keep track of total service received by this entity. Keep the
+	 * stats both for time slices and number of sectors dispatched
+	 */
+	unsigned long total_service;
+	unsigned long total_sector_service;
 };
 
 /*
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
       [not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (6 preceding siblings ...)
  2009-05-05 19:58   ` [PATCH 07/18] io-controller: Export disk time used and nr sectors dipatched through cgroups Vivek Goyal
@ 2009-05-05 19:58   ` Vivek Goyal
  2009-05-05 19:58   ` [PATCH 09/18] io-controller: Separate out queue and data Vivek Goyal
                     ` (13 subsequent siblings)
  21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

o When a sync queue expires, in many cases it might be empty and then
  it will be deleted from the active tree. This will lead to a scenario
  where out of two competing queues, only one is on the tree and when a
  new queue is selected, vtime jump takes place and we don't see services
  provided in proportion to weight.

o In general this is a fundamental problem with fairness of sync queues
  where queues are not continuously backlogged. Looks like idling is
  only solution to make sure such kind of queues can get some decent amount
  of disk bandwidth in the face of competion from continusouly backlogged
  queues. But excessive idling has potential to reduce performance on SSD
  and disks with commnad queuing.

o This patch experiments with waiting for next request to come before a
  queue is expired after it has consumed its time slice. This can ensure
  more accurate fairness numbers in some cases.

o Introduced a tunable "fairness". If set, io-controller will put more
  focus on getting fairness right than getting throughput right.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/blk-sysfs.c   |    7 +++
 block/elevator-fq.c |  117 +++++++++++++++++++++++++++++++++++++++++++++-----
 block/elevator-fq.h |   12 +++++
 3 files changed, 124 insertions(+), 12 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 082a273..c942ddc 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -294,6 +294,12 @@ static struct queue_sysfs_entry queue_slice_async_entry = {
 	.show = elv_slice_async_show,
 	.store = elv_slice_async_store,
 };
+
+static struct queue_sysfs_entry queue_fairness_entry = {
+	.attr = {.name = "fairness", .mode = S_IRUGO | S_IWUSR },
+	.show = elv_fairness_show,
+	.store = elv_fairness_store,
+};
 #endif
 
 static struct attribute *default_attrs[] = {
@@ -311,6 +317,7 @@ static struct attribute *default_attrs[] = {
 	&queue_slice_idle_entry.attr,
 	&queue_slice_sync_entry.attr,
 	&queue_slice_async_entry.attr,
+	&queue_fairness_entry.attr,
 #endif
 	NULL,
 };
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index b8dbc8b..ec01273 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1821,6 +1821,44 @@ static inline int is_root_group_ioq(struct request_queue *q,
 	return (ioq->entity.sched_data == &efqd->root_group->sched_data);
 }
 
+/* Functions to show and store fairness value through sysfs */
+ssize_t elv_fairness_show(struct request_queue *q, char *name)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	data = efqd->fairness;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return sprintf(name, "%d\n", data);
+}
+
+ssize_t elv_fairness_store(struct request_queue *q, const char *name,
+			  size_t count)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	char *p = (char *)name;
+
+	data = simple_strtoul(p, &p, 10);
+
+	if (data < 0)
+		data = 0;
+	else if (data > INT_MAX)
+		data = INT_MAX;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	efqd->fairness = data;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	return count;
+}
+
 /* Functions to show and store elv_idle_slice value through sysfs */
 ssize_t elv_slice_idle_show(struct request_queue *q, char *name)
 {
@@ -2061,7 +2099,7 @@ static void elv_ioq_update_idle_window(struct elevator_queue *eq,
 	 * io scheduler if it wants to disable idling based on additional
 	 * considrations like seek pattern.
 	 */
-	if (enable_idle) {
+	if (enable_idle && !efqd->fairness) {
 		if (eq->ops->elevator_update_idle_window_fn)
 			enable_idle = eq->ops->elevator_update_idle_window_fn(
 						eq, ioq->sched_queue, rq);
@@ -2395,10 +2433,11 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 	assert_spin_locked(q->queue_lock);
 	elv_log_ioq(efqd, ioq, "slice expired");
 
-	if (elv_ioq_wait_request(ioq))
+	if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq))
 		del_timer(&efqd->idle_slice_timer);
 
 	elv_clear_ioq_wait_request(ioq);
+	elv_clear_ioq_wait_busy(ioq);
 
 	/*
 	 * if ioq->slice_end = 0, that means a queue was expired before first
@@ -2563,7 +2602,7 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 		 * has other work pending, don't risk delaying until the
 		 * idle timer unplug to continue working.
 		 */
-		if (elv_ioq_wait_request(ioq)) {
+		if (elv_ioq_wait_request(ioq) && !elv_ioq_wait_busy(ioq)) {
 			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
 			    efqd->busy_queues > 1) {
 				del_timer(&efqd->idle_slice_timer);
@@ -2571,6 +2610,17 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 			}
 			elv_mark_ioq_must_dispatch(ioq);
 		}
+
+		/*
+		 * If we were waiting for a request on this queue, wait is
+		 * done. Schedule the next dispatch
+		 */
+		if (elv_ioq_wait_busy(ioq)) {
+			del_timer(&efqd->idle_slice_timer);
+			elv_clear_ioq_wait_busy(ioq);
+			elv_clear_ioq_must_dispatch(ioq);
+			elv_schedule_dispatch(q);
+		}
 	} else if (elv_should_preempt(q, ioq, rq)) {
 		/*
 		 * not the active queue - expire current slice if it is
@@ -2598,6 +2648,9 @@ void elv_idle_slice_timer(unsigned long data)
 
 	if (ioq) {
 
+		if (elv_ioq_wait_busy(ioq))
+			goto expire;
+
 		/*
 		 * We saw a request before the queue expired, let it through
 		 */
@@ -2631,7 +2684,7 @@ out_cont:
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
-void elv_ioq_arm_slice_timer(struct request_queue *q)
+void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
 {
 	struct elv_fq_data *efqd = &q->elevator->efqd;
 	struct io_queue *ioq = elv_active_ioq(q->elevator);
@@ -2644,26 +2697,38 @@ void elv_ioq_arm_slice_timer(struct request_queue *q)
 	 * for devices that support queuing, otherwise we still have a problem
 	 * with sync vs async workloads.
 	 */
-	if (blk_queue_nonrot(q) && efqd->hw_tag)
+	if (blk_queue_nonrot(q) && efqd->hw_tag && !efqd->fairness)
 		return;
 
 	/*
-	 * still requests with the driver, don't idle
+	 * idle is disabled, either manually or by past process history
 	 */
-	if (efqd->rq_in_driver)
+	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
 		return;
 
 	/*
-	 * idle is disabled, either manually or by past process history
+	 * This queue has consumed its time slice. We are waiting only for
+	 * it to become busy before we select next queue for dispatch.
 	 */
-	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
+	if (efqd->fairness && wait_for_busy && !ioq->dispatched) {
+		elv_mark_ioq_wait_busy(ioq);
+		sl = efqd->elv_slice_idle;
+		mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+		elv_log(efqd, "arm idle: %lu wait busy=1", sl);
+		return;
+	}
+
+	/*
+	 * still requests with the driver, don't idle
+	 */
+	if (efqd->rq_in_driver && !efqd->fairness)
 		return;
 
 	/*
 	 * may be iosched got its own idling logic. In that case io
 	 * schduler will take care of arming the timer, if need be.
 	 */
-	if (q->elevator->ops->elevator_arm_slice_timer_fn) {
+	if (q->elevator->ops->elevator_arm_slice_timer_fn && !efqd->fairness) {
 		q->elevator->ops->elevator_arm_slice_timer_fn(q,
 						ioq->sched_queue);
 	} else {
@@ -2706,6 +2771,12 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 			goto expire;
 	}
 
+	/* We are waiting for this queue to become busy before it expires.*/
+	if (efqd->fairness && elv_ioq_wait_busy(ioq)) {
+		ioq = NULL;
+		goto keep_queue;
+	}
+
 	/*
 	 * The active queue has run out of time, expire it and select new.
 	 */
@@ -2915,6 +2986,25 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 			elv_ioq_set_prio_slice(q, ioq);
 			elv_clear_ioq_slice_new(ioq);
 		}
+
+		if (elv_ioq_class_idle(ioq)) {
+			elv_ioq_slice_expired(q);
+			goto done;
+		}
+
+		if (efqd->fairness && sync && !ioq->nr_queued) {
+			/*
+			 * If fairness is enabled, wait for one extra idle
+			 * period in the hope that this queue will get
+			 * backlogged again
+			 */
+			if (elv_ioq_slice_used(ioq))
+				elv_ioq_arm_slice_timer(q, 1);
+			else
+				elv_ioq_arm_slice_timer(q, 0);
+			goto done;
+		}
+
 		/*
 		 * If there are no requests waiting in this queue, and
 		 * there are other queues ready to issue requests, AND
@@ -2922,13 +3012,14 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		 * mean seek distance, give them a chance to run instead
 		 * of idling.
 		 */
-		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+		if (elv_ioq_slice_used(ioq))
 			elv_ioq_slice_expired(q);
 		else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
 			 && sync && !rq_noidle(rq))
-			elv_ioq_arm_slice_timer(q);
+			elv_ioq_arm_slice_timer(q, 0);
 	}
 
+done:
 	if (!efqd->rq_in_driver)
 		elv_schedule_dispatch(q);
 }
@@ -3035,6 +3126,8 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 	efqd->elv_slice_idle = elv_slice_idle;
 	efqd->hw_tag = 1;
 
+	/* For the time being keep fairness enabled by default */
+	efqd->fairness = 1;
 	return 0;
 }
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index f4c6361..7d3434b 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -316,6 +316,13 @@ struct elv_fq_data {
 	unsigned long long rate_sampling_start; /*sampling window start jifies*/
 	/* number of sectors finished io during current sampling window */
 	unsigned long rate_sectors_current;
+
+	/*
+	 * If set to 1, will disable many optimizations done for boost
+	 * throughput and focus more on providing fairness for sync
+	 * queues.
+	 */
+	int fairness;
 };
 
 extern int elv_slice_idle;
@@ -340,6 +347,7 @@ enum elv_queue_state_flags {
 	ELV_QUEUE_FLAG_wait_request,	  /* waiting for a request */
 	ELV_QUEUE_FLAG_must_dispatch,	  /* must be allowed a dispatch */
 	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
+	ELV_QUEUE_FLAG_wait_busy,	  /* wait for this queue to get busy */
 	ELV_QUEUE_FLAG_NR,
 };
 
@@ -363,6 +371,7 @@ ELV_IO_QUEUE_FLAG_FNS(wait_request)
 ELV_IO_QUEUE_FLAG_FNS(must_dispatch)
 ELV_IO_QUEUE_FLAG_FNS(idle_window)
 ELV_IO_QUEUE_FLAG_FNS(slice_new)
+ELV_IO_QUEUE_FLAG_FNS(wait_busy)
 
 static inline struct io_service_tree *
 io_entity_service_tree(struct io_entity *entity)
@@ -541,6 +550,9 @@ extern ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
 extern ssize_t elv_slice_async_show(struct request_queue *q, char *name);
 extern ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
 						size_t count);
+extern ssize_t elv_fairness_show(struct request_queue *q, char *name);
+extern ssize_t elv_fairness_store(struct request_queue *q, const char *name,
+						size_t count);
 
 /* Functions used by elevator.c */
 extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (13 preceding siblings ...)
  2009-05-05 19:58 ` [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-13 15:00   ` Vivek Goyal
                     ` (3 more replies)
  2009-05-05 19:58 ` [PATCH 09/18] io-controller: Separate out queue and data Vivek Goyal
                   ` (22 subsequent siblings)
  37 siblings, 4 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, m-ikeda
  Cc: vgoyal, akpm

o When a sync queue expires, in many cases it might be empty and then
  it will be deleted from the active tree. This will lead to a scenario
  where out of two competing queues, only one is on the tree and when a
  new queue is selected, vtime jump takes place and we don't see services
  provided in proportion to weight.

o In general this is a fundamental problem with fairness of sync queues
  where queues are not continuously backlogged. Looks like idling is
  only solution to make sure such kind of queues can get some decent amount
  of disk bandwidth in the face of competion from continusouly backlogged
  queues. But excessive idling has potential to reduce performance on SSD
  and disks with commnad queuing.

o This patch experiments with waiting for next request to come before a
  queue is expired after it has consumed its time slice. This can ensure
  more accurate fairness numbers in some cases.

o Introduced a tunable "fairness". If set, io-controller will put more
  focus on getting fairness right than getting throughput right.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-sysfs.c   |    7 +++
 block/elevator-fq.c |  117 +++++++++++++++++++++++++++++++++++++++++++++-----
 block/elevator-fq.h |   12 +++++
 3 files changed, 124 insertions(+), 12 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 082a273..c942ddc 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -294,6 +294,12 @@ static struct queue_sysfs_entry queue_slice_async_entry = {
 	.show = elv_slice_async_show,
 	.store = elv_slice_async_store,
 };
+
+static struct queue_sysfs_entry queue_fairness_entry = {
+	.attr = {.name = "fairness", .mode = S_IRUGO | S_IWUSR },
+	.show = elv_fairness_show,
+	.store = elv_fairness_store,
+};
 #endif
 
 static struct attribute *default_attrs[] = {
@@ -311,6 +317,7 @@ static struct attribute *default_attrs[] = {
 	&queue_slice_idle_entry.attr,
 	&queue_slice_sync_entry.attr,
 	&queue_slice_async_entry.attr,
+	&queue_fairness_entry.attr,
 #endif
 	NULL,
 };
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index b8dbc8b..ec01273 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1821,6 +1821,44 @@ static inline int is_root_group_ioq(struct request_queue *q,
 	return (ioq->entity.sched_data == &efqd->root_group->sched_data);
 }
 
+/* Functions to show and store fairness value through sysfs */
+ssize_t elv_fairness_show(struct request_queue *q, char *name)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	data = efqd->fairness;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return sprintf(name, "%d\n", data);
+}
+
+ssize_t elv_fairness_store(struct request_queue *q, const char *name,
+			  size_t count)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	char *p = (char *)name;
+
+	data = simple_strtoul(p, &p, 10);
+
+	if (data < 0)
+		data = 0;
+	else if (data > INT_MAX)
+		data = INT_MAX;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	efqd->fairness = data;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	return count;
+}
+
 /* Functions to show and store elv_idle_slice value through sysfs */
 ssize_t elv_slice_idle_show(struct request_queue *q, char *name)
 {
@@ -2061,7 +2099,7 @@ static void elv_ioq_update_idle_window(struct elevator_queue *eq,
 	 * io scheduler if it wants to disable idling based on additional
 	 * considrations like seek pattern.
 	 */
-	if (enable_idle) {
+	if (enable_idle && !efqd->fairness) {
 		if (eq->ops->elevator_update_idle_window_fn)
 			enable_idle = eq->ops->elevator_update_idle_window_fn(
 						eq, ioq->sched_queue, rq);
@@ -2395,10 +2433,11 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 	assert_spin_locked(q->queue_lock);
 	elv_log_ioq(efqd, ioq, "slice expired");
 
-	if (elv_ioq_wait_request(ioq))
+	if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq))
 		del_timer(&efqd->idle_slice_timer);
 
 	elv_clear_ioq_wait_request(ioq);
+	elv_clear_ioq_wait_busy(ioq);
 
 	/*
 	 * if ioq->slice_end = 0, that means a queue was expired before first
@@ -2563,7 +2602,7 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 		 * has other work pending, don't risk delaying until the
 		 * idle timer unplug to continue working.
 		 */
-		if (elv_ioq_wait_request(ioq)) {
+		if (elv_ioq_wait_request(ioq) && !elv_ioq_wait_busy(ioq)) {
 			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
 			    efqd->busy_queues > 1) {
 				del_timer(&efqd->idle_slice_timer);
@@ -2571,6 +2610,17 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 			}
 			elv_mark_ioq_must_dispatch(ioq);
 		}
+
+		/*
+		 * If we were waiting for a request on this queue, wait is
+		 * done. Schedule the next dispatch
+		 */
+		if (elv_ioq_wait_busy(ioq)) {
+			del_timer(&efqd->idle_slice_timer);
+			elv_clear_ioq_wait_busy(ioq);
+			elv_clear_ioq_must_dispatch(ioq);
+			elv_schedule_dispatch(q);
+		}
 	} else if (elv_should_preempt(q, ioq, rq)) {
 		/*
 		 * not the active queue - expire current slice if it is
@@ -2598,6 +2648,9 @@ void elv_idle_slice_timer(unsigned long data)
 
 	if (ioq) {
 
+		if (elv_ioq_wait_busy(ioq))
+			goto expire;
+
 		/*
 		 * We saw a request before the queue expired, let it through
 		 */
@@ -2631,7 +2684,7 @@ out_cont:
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
-void elv_ioq_arm_slice_timer(struct request_queue *q)
+void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
 {
 	struct elv_fq_data *efqd = &q->elevator->efqd;
 	struct io_queue *ioq = elv_active_ioq(q->elevator);
@@ -2644,26 +2697,38 @@ void elv_ioq_arm_slice_timer(struct request_queue *q)
 	 * for devices that support queuing, otherwise we still have a problem
 	 * with sync vs async workloads.
 	 */
-	if (blk_queue_nonrot(q) && efqd->hw_tag)
+	if (blk_queue_nonrot(q) && efqd->hw_tag && !efqd->fairness)
 		return;
 
 	/*
-	 * still requests with the driver, don't idle
+	 * idle is disabled, either manually or by past process history
 	 */
-	if (efqd->rq_in_driver)
+	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
 		return;
 
 	/*
-	 * idle is disabled, either manually or by past process history
+	 * This queue has consumed its time slice. We are waiting only for
+	 * it to become busy before we select next queue for dispatch.
 	 */
-	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
+	if (efqd->fairness && wait_for_busy && !ioq->dispatched) {
+		elv_mark_ioq_wait_busy(ioq);
+		sl = efqd->elv_slice_idle;
+		mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+		elv_log(efqd, "arm idle: %lu wait busy=1", sl);
+		return;
+	}
+
+	/*
+	 * still requests with the driver, don't idle
+	 */
+	if (efqd->rq_in_driver && !efqd->fairness)
 		return;
 
 	/*
 	 * may be iosched got its own idling logic. In that case io
 	 * schduler will take care of arming the timer, if need be.
 	 */
-	if (q->elevator->ops->elevator_arm_slice_timer_fn) {
+	if (q->elevator->ops->elevator_arm_slice_timer_fn && !efqd->fairness) {
 		q->elevator->ops->elevator_arm_slice_timer_fn(q,
 						ioq->sched_queue);
 	} else {
@@ -2706,6 +2771,12 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 			goto expire;
 	}
 
+	/* We are waiting for this queue to become busy before it expires.*/
+	if (efqd->fairness && elv_ioq_wait_busy(ioq)) {
+		ioq = NULL;
+		goto keep_queue;
+	}
+
 	/*
 	 * The active queue has run out of time, expire it and select new.
 	 */
@@ -2915,6 +2986,25 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 			elv_ioq_set_prio_slice(q, ioq);
 			elv_clear_ioq_slice_new(ioq);
 		}
+
+		if (elv_ioq_class_idle(ioq)) {
+			elv_ioq_slice_expired(q);
+			goto done;
+		}
+
+		if (efqd->fairness && sync && !ioq->nr_queued) {
+			/*
+			 * If fairness is enabled, wait for one extra idle
+			 * period in the hope that this queue will get
+			 * backlogged again
+			 */
+			if (elv_ioq_slice_used(ioq))
+				elv_ioq_arm_slice_timer(q, 1);
+			else
+				elv_ioq_arm_slice_timer(q, 0);
+			goto done;
+		}
+
 		/*
 		 * If there are no requests waiting in this queue, and
 		 * there are other queues ready to issue requests, AND
@@ -2922,13 +3012,14 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		 * mean seek distance, give them a chance to run instead
 		 * of idling.
 		 */
-		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+		if (elv_ioq_slice_used(ioq))
 			elv_ioq_slice_expired(q);
 		else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
 			 && sync && !rq_noidle(rq))
-			elv_ioq_arm_slice_timer(q);
+			elv_ioq_arm_slice_timer(q, 0);
 	}
 
+done:
 	if (!efqd->rq_in_driver)
 		elv_schedule_dispatch(q);
 }
@@ -3035,6 +3126,8 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 	efqd->elv_slice_idle = elv_slice_idle;
 	efqd->hw_tag = 1;
 
+	/* For the time being keep fairness enabled by default */
+	efqd->fairness = 1;
 	return 0;
 }
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index f4c6361..7d3434b 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -316,6 +316,13 @@ struct elv_fq_data {
 	unsigned long long rate_sampling_start; /*sampling window start jifies*/
 	/* number of sectors finished io during current sampling window */
 	unsigned long rate_sectors_current;
+
+	/*
+	 * If set to 1, will disable many optimizations done for boost
+	 * throughput and focus more on providing fairness for sync
+	 * queues.
+	 */
+	int fairness;
 };
 
 extern int elv_slice_idle;
@@ -340,6 +347,7 @@ enum elv_queue_state_flags {
 	ELV_QUEUE_FLAG_wait_request,	  /* waiting for a request */
 	ELV_QUEUE_FLAG_must_dispatch,	  /* must be allowed a dispatch */
 	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
+	ELV_QUEUE_FLAG_wait_busy,	  /* wait for this queue to get busy */
 	ELV_QUEUE_FLAG_NR,
 };
 
@@ -363,6 +371,7 @@ ELV_IO_QUEUE_FLAG_FNS(wait_request)
 ELV_IO_QUEUE_FLAG_FNS(must_dispatch)
 ELV_IO_QUEUE_FLAG_FNS(idle_window)
 ELV_IO_QUEUE_FLAG_FNS(slice_new)
+ELV_IO_QUEUE_FLAG_FNS(wait_busy)
 
 static inline struct io_service_tree *
 io_entity_service_tree(struct io_entity *entity)
@@ -541,6 +550,9 @@ extern ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
 extern ssize_t elv_slice_async_show(struct request_queue *q, char *name);
 extern ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
 						size_t count);
+extern ssize_t elv_fairness_show(struct request_queue *q, char *name);
+extern ssize_t elv_fairness_store(struct request_queue *q, const char *name,
+						size_t count);
 
 /* Functions used by elevator.c */
 extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (12 preceding siblings ...)
  2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-05 19:58 ` Vivek Goyal
                   ` (23 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando
  Cc: akpm, vgoyal

o When a sync queue expires, in many cases it might be empty and then
  it will be deleted from the active tree. This will lead to a scenario
  where out of two competing queues, only one is on the tree and when a
  new queue is selected, vtime jump takes place and we don't see services
  provided in proportion to weight.

o In general this is a fundamental problem with fairness of sync queues
  where queues are not continuously backlogged. Looks like idling is
  only solution to make sure such kind of queues can get some decent amount
  of disk bandwidth in the face of competion from continusouly backlogged
  queues. But excessive idling has potential to reduce performance on SSD
  and disks with commnad queuing.

o This patch experiments with waiting for next request to come before a
  queue is expired after it has consumed its time slice. This can ensure
  more accurate fairness numbers in some cases.

o Introduced a tunable "fairness". If set, io-controller will put more
  focus on getting fairness right than getting throughput right.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-sysfs.c   |    7 +++
 block/elevator-fq.c |  117 +++++++++++++++++++++++++++++++++++++++++++++-----
 block/elevator-fq.h |   12 +++++
 3 files changed, 124 insertions(+), 12 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 082a273..c942ddc 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -294,6 +294,12 @@ static struct queue_sysfs_entry queue_slice_async_entry = {
 	.show = elv_slice_async_show,
 	.store = elv_slice_async_store,
 };
+
+static struct queue_sysfs_entry queue_fairness_entry = {
+	.attr = {.name = "fairness", .mode = S_IRUGO | S_IWUSR },
+	.show = elv_fairness_show,
+	.store = elv_fairness_store,
+};
 #endif
 
 static struct attribute *default_attrs[] = {
@@ -311,6 +317,7 @@ static struct attribute *default_attrs[] = {
 	&queue_slice_idle_entry.attr,
 	&queue_slice_sync_entry.attr,
 	&queue_slice_async_entry.attr,
+	&queue_fairness_entry.attr,
 #endif
 	NULL,
 };
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index b8dbc8b..ec01273 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1821,6 +1821,44 @@ static inline int is_root_group_ioq(struct request_queue *q,
 	return (ioq->entity.sched_data == &efqd->root_group->sched_data);
 }
 
+/* Functions to show and store fairness value through sysfs */
+ssize_t elv_fairness_show(struct request_queue *q, char *name)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	data = efqd->fairness;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return sprintf(name, "%d\n", data);
+}
+
+ssize_t elv_fairness_store(struct request_queue *q, const char *name,
+			  size_t count)
+{
+	struct elv_fq_data *efqd;
+	unsigned int data;
+	unsigned long flags;
+
+	char *p = (char *)name;
+
+	data = simple_strtoul(p, &p, 10);
+
+	if (data < 0)
+		data = 0;
+	else if (data > INT_MAX)
+		data = INT_MAX;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	efqd = &q->elevator->efqd;
+	efqd->fairness = data;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	return count;
+}
+
 /* Functions to show and store elv_idle_slice value through sysfs */
 ssize_t elv_slice_idle_show(struct request_queue *q, char *name)
 {
@@ -2061,7 +2099,7 @@ static void elv_ioq_update_idle_window(struct elevator_queue *eq,
 	 * io scheduler if it wants to disable idling based on additional
 	 * considrations like seek pattern.
 	 */
-	if (enable_idle) {
+	if (enable_idle && !efqd->fairness) {
 		if (eq->ops->elevator_update_idle_window_fn)
 			enable_idle = eq->ops->elevator_update_idle_window_fn(
 						eq, ioq->sched_queue, rq);
@@ -2395,10 +2433,11 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 	assert_spin_locked(q->queue_lock);
 	elv_log_ioq(efqd, ioq, "slice expired");
 
-	if (elv_ioq_wait_request(ioq))
+	if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq))
 		del_timer(&efqd->idle_slice_timer);
 
 	elv_clear_ioq_wait_request(ioq);
+	elv_clear_ioq_wait_busy(ioq);
 
 	/*
 	 * if ioq->slice_end = 0, that means a queue was expired before first
@@ -2563,7 +2602,7 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 		 * has other work pending, don't risk delaying until the
 		 * idle timer unplug to continue working.
 		 */
-		if (elv_ioq_wait_request(ioq)) {
+		if (elv_ioq_wait_request(ioq) && !elv_ioq_wait_busy(ioq)) {
 			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
 			    efqd->busy_queues > 1) {
 				del_timer(&efqd->idle_slice_timer);
@@ -2571,6 +2610,17 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 			}
 			elv_mark_ioq_must_dispatch(ioq);
 		}
+
+		/*
+		 * If we were waiting for a request on this queue, wait is
+		 * done. Schedule the next dispatch
+		 */
+		if (elv_ioq_wait_busy(ioq)) {
+			del_timer(&efqd->idle_slice_timer);
+			elv_clear_ioq_wait_busy(ioq);
+			elv_clear_ioq_must_dispatch(ioq);
+			elv_schedule_dispatch(q);
+		}
 	} else if (elv_should_preempt(q, ioq, rq)) {
 		/*
 		 * not the active queue - expire current slice if it is
@@ -2598,6 +2648,9 @@ void elv_idle_slice_timer(unsigned long data)
 
 	if (ioq) {
 
+		if (elv_ioq_wait_busy(ioq))
+			goto expire;
+
 		/*
 		 * We saw a request before the queue expired, let it through
 		 */
@@ -2631,7 +2684,7 @@ out_cont:
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
-void elv_ioq_arm_slice_timer(struct request_queue *q)
+void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
 {
 	struct elv_fq_data *efqd = &q->elevator->efqd;
 	struct io_queue *ioq = elv_active_ioq(q->elevator);
@@ -2644,26 +2697,38 @@ void elv_ioq_arm_slice_timer(struct request_queue *q)
 	 * for devices that support queuing, otherwise we still have a problem
 	 * with sync vs async workloads.
 	 */
-	if (blk_queue_nonrot(q) && efqd->hw_tag)
+	if (blk_queue_nonrot(q) && efqd->hw_tag && !efqd->fairness)
 		return;
 
 	/*
-	 * still requests with the driver, don't idle
+	 * idle is disabled, either manually or by past process history
 	 */
-	if (efqd->rq_in_driver)
+	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
 		return;
 
 	/*
-	 * idle is disabled, either manually or by past process history
+	 * This queue has consumed its time slice. We are waiting only for
+	 * it to become busy before we select next queue for dispatch.
 	 */
-	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
+	if (efqd->fairness && wait_for_busy && !ioq->dispatched) {
+		elv_mark_ioq_wait_busy(ioq);
+		sl = efqd->elv_slice_idle;
+		mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+		elv_log(efqd, "arm idle: %lu wait busy=1", sl);
+		return;
+	}
+
+	/*
+	 * still requests with the driver, don't idle
+	 */
+	if (efqd->rq_in_driver && !efqd->fairness)
 		return;
 
 	/*
 	 * may be iosched got its own idling logic. In that case io
 	 * schduler will take care of arming the timer, if need be.
 	 */
-	if (q->elevator->ops->elevator_arm_slice_timer_fn) {
+	if (q->elevator->ops->elevator_arm_slice_timer_fn && !efqd->fairness) {
 		q->elevator->ops->elevator_arm_slice_timer_fn(q,
 						ioq->sched_queue);
 	} else {
@@ -2706,6 +2771,12 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 			goto expire;
 	}
 
+	/* We are waiting for this queue to become busy before it expires.*/
+	if (efqd->fairness && elv_ioq_wait_busy(ioq)) {
+		ioq = NULL;
+		goto keep_queue;
+	}
+
 	/*
 	 * The active queue has run out of time, expire it and select new.
 	 */
@@ -2915,6 +2986,25 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 			elv_ioq_set_prio_slice(q, ioq);
 			elv_clear_ioq_slice_new(ioq);
 		}
+
+		if (elv_ioq_class_idle(ioq)) {
+			elv_ioq_slice_expired(q);
+			goto done;
+		}
+
+		if (efqd->fairness && sync && !ioq->nr_queued) {
+			/*
+			 * If fairness is enabled, wait for one extra idle
+			 * period in the hope that this queue will get
+			 * backlogged again
+			 */
+			if (elv_ioq_slice_used(ioq))
+				elv_ioq_arm_slice_timer(q, 1);
+			else
+				elv_ioq_arm_slice_timer(q, 0);
+			goto done;
+		}
+
 		/*
 		 * If there are no requests waiting in this queue, and
 		 * there are other queues ready to issue requests, AND
@@ -2922,13 +3012,14 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		 * mean seek distance, give them a chance to run instead
 		 * of idling.
 		 */
-		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+		if (elv_ioq_slice_used(ioq))
 			elv_ioq_slice_expired(q);
 		else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
 			 && sync && !rq_noidle(rq))
-			elv_ioq_arm_slice_timer(q);
+			elv_ioq_arm_slice_timer(q, 0);
 	}
 
+done:
 	if (!efqd->rq_in_driver)
 		elv_schedule_dispatch(q);
 }
@@ -3035,6 +3126,8 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 	efqd->elv_slice_idle = elv_slice_idle;
 	efqd->hw_tag = 1;
 
+	/* For the time being keep fairness enabled by default */
+	efqd->fairness = 1;
 	return 0;
 }
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index f4c6361..7d3434b 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -316,6 +316,13 @@ struct elv_fq_data {
 	unsigned long long rate_sampling_start; /*sampling window start jifies*/
 	/* number of sectors finished io during current sampling window */
 	unsigned long rate_sectors_current;
+
+	/*
+	 * If set to 1, will disable many optimizations done for boost
+	 * throughput and focus more on providing fairness for sync
+	 * queues.
+	 */
+	int fairness;
 };
 
 extern int elv_slice_idle;
@@ -340,6 +347,7 @@ enum elv_queue_state_flags {
 	ELV_QUEUE_FLAG_wait_request,	  /* waiting for a request */
 	ELV_QUEUE_FLAG_must_dispatch,	  /* must be allowed a dispatch */
 	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
+	ELV_QUEUE_FLAG_wait_busy,	  /* wait for this queue to get busy */
 	ELV_QUEUE_FLAG_NR,
 };
 
@@ -363,6 +371,7 @@ ELV_IO_QUEUE_FLAG_FNS(wait_request)
 ELV_IO_QUEUE_FLAG_FNS(must_dispatch)
 ELV_IO_QUEUE_FLAG_FNS(idle_window)
 ELV_IO_QUEUE_FLAG_FNS(slice_new)
+ELV_IO_QUEUE_FLAG_FNS(wait_busy)
 
 static inline struct io_service_tree *
 io_entity_service_tree(struct io_entity *entity)
@@ -541,6 +550,9 @@ extern ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
 extern ssize_t elv_slice_async_show(struct request_queue *q, char *name);
 extern ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
 						size_t count);
+extern ssize_t elv_fairness_show(struct request_queue *q, char *name);
+extern ssize_t elv_fairness_store(struct request_queue *q, const char *name,
+						size_t count);
 
 /* Functions used by elevator.c */
 extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 09/18] io-controller: Separate out queue and data
       [not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (7 preceding siblings ...)
  2009-05-05 19:58   ` [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it Vivek Goyal
@ 2009-05-05 19:58   ` Vivek Goyal
  2009-05-05 19:58     ` Vivek Goyal
                     ` (12 subsequent siblings)
  21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

o So far noop, deadline and AS had one common structure called *_data which
  contained both the queue information where requests are queued and also
  common data used for scheduling. This patch breaks down this common
  structure in two parts, *_queue and *_data. This is along the lines of
  cfq where all the reuquests are queued in queue and common data and tunables
  are part of data.

o It does not change the functionality but this re-organization helps once
  noop, deadline and AS are changed to use hierarchical fair queuing.

o looks like queue_empty function is not required and we can check for
  q->nr_sorted in elevator layer to see if ioscheduler queues are empty or
  not.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/as-iosched.c       |  208 ++++++++++++++++++++++++++--------------------
 block/deadline-iosched.c |  117 ++++++++++++++++----------
 block/elevator.c         |  111 +++++++++++++++++++++----
 block/noop-iosched.c     |   59 ++++++-------
 include/linux/elevator.h |    8 ++-
 5 files changed, 319 insertions(+), 184 deletions(-)

diff --git a/block/as-iosched.c b/block/as-iosched.c
index c48fa67..7158e13 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -76,13 +76,7 @@ enum anticipation_status {
 				 * or timed out */
 };
 
-struct as_data {
-	/*
-	 * run time data
-	 */
-
-	struct request_queue *q;	/* the "owner" queue */
-
+struct as_queue {
 	/*
 	 * requests (as_rq s) are present on both sort_list and fifo_list
 	 */
@@ -90,6 +84,14 @@ struct as_data {
 	struct list_head fifo_list[2];
 
 	struct request *next_rq[2];	/* next in sort order */
+	unsigned long last_check_fifo[2];
+	int write_batch_count;		/* max # of reqs in a write batch */
+	int current_write_count;	/* how many requests left this batch */
+	int write_batch_idled;		/* has the write batch gone idle? */
+};
+
+struct as_data {
+	struct request_queue *q;	/* the "owner" queue */
 	sector_t last_sector[2];	/* last SYNC & ASYNC sectors */
 
 	unsigned long exit_prob;	/* probability a task will exit while
@@ -103,21 +105,17 @@ struct as_data {
 	sector_t new_seek_mean;
 
 	unsigned long current_batch_expires;
-	unsigned long last_check_fifo[2];
 	int changed_batch;		/* 1: waiting for old batch to end */
 	int new_batch;			/* 1: waiting on first read complete */
-	int batch_data_dir;		/* current batch SYNC / ASYNC */
-	int write_batch_count;		/* max # of reqs in a write batch */
-	int current_write_count;	/* how many requests left this batch */
-	int write_batch_idled;		/* has the write batch gone idle? */
 
 	enum anticipation_status antic_status;
 	unsigned long antic_start;	/* jiffies: when it started */
 	struct timer_list antic_timer;	/* anticipatory scheduling timer */
-	struct work_struct antic_work;	/* Deferred unplugging */
+	struct work_struct antic_work;  /* Deferred unplugging */
 	struct io_context *io_context;	/* Identify the expected process */
 	int ioc_finished; /* IO associated with io_context is finished */
 	int nr_dispatched;
+	int batch_data_dir;		/* current batch SYNC / ASYNC */
 
 	/*
 	 * settings that change how the i/o scheduler behaves
@@ -258,13 +256,14 @@ static void as_put_io_context(struct request *rq)
 /*
  * rb tree support functions
  */
-#define RQ_RB_ROOT(ad, rq)	(&(ad)->sort_list[rq_is_sync((rq))])
+#define RQ_RB_ROOT(asq, rq)	(&(asq)->sort_list[rq_is_sync((rq))])
 
 static void as_add_rq_rb(struct as_data *ad, struct request *rq)
 {
 	struct request *alias;
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
-	while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(ad, rq), rq)))) {
+	while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(asq, rq), rq)))) {
 		as_move_to_dispatch(ad, alias);
 		as_antic_stop(ad);
 	}
@@ -272,7 +271,9 @@ static void as_add_rq_rb(struct as_data *ad, struct request *rq)
 
 static inline void as_del_rq_rb(struct as_data *ad, struct request *rq)
 {
-	elv_rb_del(RQ_RB_ROOT(ad, rq), rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
+
+	elv_rb_del(RQ_RB_ROOT(asq, rq), rq);
 }
 
 /*
@@ -366,7 +367,7 @@ as_choose_req(struct as_data *ad, struct request *rq1, struct request *rq2)
  * what request to process next. Anticipation works on top of this.
  */
 static struct request *
-as_find_next_rq(struct as_data *ad, struct request *last)
+as_find_next_rq(struct as_data *ad, struct as_queue *asq, struct request *last)
 {
 	struct rb_node *rbnext = rb_next(&last->rb_node);
 	struct rb_node *rbprev = rb_prev(&last->rb_node);
@@ -382,7 +383,7 @@ as_find_next_rq(struct as_data *ad, struct request *last)
 	else {
 		const int data_dir = rq_is_sync(last);
 
-		rbnext = rb_first(&ad->sort_list[data_dir]);
+		rbnext = rb_first(&asq->sort_list[data_dir]);
 		if (rbnext && rbnext != &last->rb_node)
 			next = rb_entry_rq(rbnext);
 	}
@@ -787,9 +788,10 @@ static int as_can_anticipate(struct as_data *ad, struct request *rq)
 static void as_update_rq(struct as_data *ad, struct request *rq)
 {
 	const int data_dir = rq_is_sync(rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
 	/* keep the next_rq cache up to date */
-	ad->next_rq[data_dir] = as_choose_req(ad, rq, ad->next_rq[data_dir]);
+	asq->next_rq[data_dir] = as_choose_req(ad, rq, asq->next_rq[data_dir]);
 
 	/*
 	 * have we been anticipating this request?
@@ -810,25 +812,26 @@ static void update_write_batch(struct as_data *ad)
 {
 	unsigned long batch = ad->batch_expire[BLK_RW_ASYNC];
 	long write_time;
+	struct as_queue *asq = elv_get_sched_queue(ad->q, NULL);
 
 	write_time = (jiffies - ad->current_batch_expires) + batch;
 	if (write_time < 0)
 		write_time = 0;
 
-	if (write_time > batch && !ad->write_batch_idled) {
+	if (write_time > batch && !asq->write_batch_idled) {
 		if (write_time > batch * 3)
-			ad->write_batch_count /= 2;
+			asq->write_batch_count /= 2;
 		else
-			ad->write_batch_count--;
-	} else if (write_time < batch && ad->current_write_count == 0) {
+			asq->write_batch_count--;
+	} else if (write_time < batch && asq->current_write_count == 0) {
 		if (batch > write_time * 3)
-			ad->write_batch_count *= 2;
+			asq->write_batch_count *= 2;
 		else
-			ad->write_batch_count++;
+			asq->write_batch_count++;
 	}
 
-	if (ad->write_batch_count < 1)
-		ad->write_batch_count = 1;
+	if (asq->write_batch_count < 1)
+		asq->write_batch_count = 1;
 }
 
 /*
@@ -899,6 +902,7 @@ static void as_remove_queued_request(struct request_queue *q,
 	const int data_dir = rq_is_sync(rq);
 	struct as_data *ad = q->elevator->elevator_data;
 	struct io_context *ioc;
+	struct as_queue *asq = elv_get_sched_queue(q, rq);
 
 	WARN_ON(RQ_STATE(rq) != AS_RQ_QUEUED);
 
@@ -912,8 +916,8 @@ static void as_remove_queued_request(struct request_queue *q,
 	 * Update the "next_rq" cache if we are about to remove its
 	 * entry
 	 */
-	if (ad->next_rq[data_dir] == rq)
-		ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+	if (asq->next_rq[data_dir] == rq)
+		asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);
 
 	rq_fifo_clear(rq);
 	as_del_rq_rb(ad, rq);
@@ -927,23 +931,23 @@ static void as_remove_queued_request(struct request_queue *q,
  *
  * See as_antic_expired comment.
  */
-static int as_fifo_expired(struct as_data *ad, int adir)
+static int as_fifo_expired(struct as_data *ad, struct as_queue *asq, int adir)
 {
 	struct request *rq;
 	long delta_jif;
 
-	delta_jif = jiffies - ad->last_check_fifo[adir];
+	delta_jif = jiffies - asq->last_check_fifo[adir];
 	if (unlikely(delta_jif < 0))
 		delta_jif = -delta_jif;
 	if (delta_jif < ad->fifo_expire[adir])
 		return 0;
 
-	ad->last_check_fifo[adir] = jiffies;
+	asq->last_check_fifo[adir] = jiffies;
 
-	if (list_empty(&ad->fifo_list[adir]))
+	if (list_empty(&asq->fifo_list[adir]))
 		return 0;
 
-	rq = rq_entry_fifo(ad->fifo_list[adir].next);
+	rq = rq_entry_fifo(asq->fifo_list[adir].next);
 
 	return time_after(jiffies, rq_fifo_time(rq));
 }
@@ -952,7 +956,7 @@ static int as_fifo_expired(struct as_data *ad, int adir)
  * as_batch_expired returns true if the current batch has expired. A batch
  * is a set of reads or a set of writes.
  */
-static inline int as_batch_expired(struct as_data *ad)
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq)
 {
 	if (ad->changed_batch || ad->new_batch)
 		return 0;
@@ -962,7 +966,7 @@ static inline int as_batch_expired(struct as_data *ad)
 		return time_after(jiffies, ad->current_batch_expires);
 
 	return time_after(jiffies, ad->current_batch_expires)
-		|| ad->current_write_count == 0;
+		|| asq->current_write_count == 0;
 }
 
 /*
@@ -971,6 +975,7 @@ static inline int as_batch_expired(struct as_data *ad)
 static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 {
 	const int data_dir = rq_is_sync(rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
 	BUG_ON(RB_EMPTY_NODE(&rq->rb_node));
 
@@ -993,12 +998,12 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 			ad->io_context = NULL;
 		}
 
-		if (ad->current_write_count != 0)
-			ad->current_write_count--;
+		if (asq->current_write_count != 0)
+			asq->current_write_count--;
 	}
 	ad->ioc_finished = 0;
 
-	ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+	asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);
 
 	/*
 	 * take it off the sort and fifo list, add to dispatch queue
@@ -1022,9 +1027,16 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 static int as_dispatch_request(struct request_queue *q, int force)
 {
 	struct as_data *ad = q->elevator->elevator_data;
-	const int reads = !list_empty(&ad->fifo_list[BLK_RW_SYNC]);
-	const int writes = !list_empty(&ad->fifo_list[BLK_RW_ASYNC]);
 	struct request *rq;
+	struct as_queue *asq = elv_select_sched_queue(q, force);
+	int reads, writes;
+
+	if (!asq)
+		return 0;
+
+	reads = !list_empty(&asq->fifo_list[BLK_RW_SYNC]);
+	writes = !list_empty(&asq->fifo_list[BLK_RW_ASYNC]);
+
 
 	if (unlikely(force)) {
 		/*
@@ -1040,25 +1052,25 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		ad->changed_batch = 0;
 		ad->new_batch = 0;
 
-		while (ad->next_rq[BLK_RW_SYNC]) {
-			as_move_to_dispatch(ad, ad->next_rq[BLK_RW_SYNC]);
+		while (asq->next_rq[BLK_RW_SYNC]) {
+			as_move_to_dispatch(ad, asq->next_rq[BLK_RW_SYNC]);
 			dispatched++;
 		}
-		ad->last_check_fifo[BLK_RW_SYNC] = jiffies;
+		asq->last_check_fifo[BLK_RW_SYNC] = jiffies;
 
-		while (ad->next_rq[BLK_RW_ASYNC]) {
-			as_move_to_dispatch(ad, ad->next_rq[BLK_RW_ASYNC]);
+		while (asq->next_rq[BLK_RW_ASYNC]) {
+			as_move_to_dispatch(ad, asq->next_rq[BLK_RW_ASYNC]);
 			dispatched++;
 		}
-		ad->last_check_fifo[BLK_RW_ASYNC] = jiffies;
+		asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
 
 		return dispatched;
 	}
 
 	/* Signal that the write batch was uncontended, so we can't time it */
 	if (ad->batch_data_dir == BLK_RW_ASYNC && !reads) {
-		if (ad->current_write_count == 0 || !writes)
-			ad->write_batch_idled = 1;
+		if (asq->current_write_count == 0 || !writes)
+			asq->write_batch_idled = 1;
 	}
 
 	if (!(reads || writes)
@@ -1067,14 +1079,14 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		|| ad->changed_batch)
 		return 0;
 
-	if (!(reads && writes && as_batch_expired(ad))) {
+	if (!(reads && writes && as_batch_expired(ad, asq))) {
 		/*
 		 * batch is still running or no reads or no writes
 		 */
-		rq = ad->next_rq[ad->batch_data_dir];
+		rq = asq->next_rq[ad->batch_data_dir];
 
 		if (ad->batch_data_dir == BLK_RW_SYNC && ad->antic_expire) {
-			if (as_fifo_expired(ad, BLK_RW_SYNC))
+			if (as_fifo_expired(ad, asq, BLK_RW_SYNC))
 				goto fifo_expired;
 
 			if (as_can_anticipate(ad, rq)) {
@@ -1098,7 +1110,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 	 */
 
 	if (reads) {
-		BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[BLK_RW_SYNC]));
+		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_SYNC]));
 
 		if (writes && ad->batch_data_dir == BLK_RW_SYNC)
 			/*
@@ -1111,8 +1123,8 @@ static int as_dispatch_request(struct request_queue *q, int force)
 			ad->changed_batch = 1;
 		}
 		ad->batch_data_dir = BLK_RW_SYNC;
-		rq = rq_entry_fifo(ad->fifo_list[BLK_RW_SYNC].next);
-		ad->last_check_fifo[ad->batch_data_dir] = jiffies;
+		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_SYNC].next);
+		asq->last_check_fifo[ad->batch_data_dir] = jiffies;
 		goto dispatch_request;
 	}
 
@@ -1122,7 +1134,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 
 	if (writes) {
 dispatch_writes:
-		BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[BLK_RW_ASYNC]));
+		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_ASYNC]));
 
 		if (ad->batch_data_dir == BLK_RW_SYNC) {
 			ad->changed_batch = 1;
@@ -1135,10 +1147,10 @@ dispatch_writes:
 			ad->new_batch = 0;
 		}
 		ad->batch_data_dir = BLK_RW_ASYNC;
-		ad->current_write_count = ad->write_batch_count;
-		ad->write_batch_idled = 0;
-		rq = rq_entry_fifo(ad->fifo_list[BLK_RW_ASYNC].next);
-		ad->last_check_fifo[BLK_RW_ASYNC] = jiffies;
+		asq->current_write_count = asq->write_batch_count;
+		asq->write_batch_idled = 0;
+		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_ASYNC].next);
+		asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
 		goto dispatch_request;
 	}
 
@@ -1150,9 +1162,9 @@ dispatch_request:
 	 * If a request has expired, service it.
 	 */
 
-	if (as_fifo_expired(ad, ad->batch_data_dir)) {
+	if (as_fifo_expired(ad, asq, ad->batch_data_dir)) {
 fifo_expired:
-		rq = rq_entry_fifo(ad->fifo_list[ad->batch_data_dir].next);
+		rq = rq_entry_fifo(asq->fifo_list[ad->batch_data_dir].next);
 	}
 
 	if (ad->changed_batch) {
@@ -1185,6 +1197,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 {
 	struct as_data *ad = q->elevator->elevator_data;
 	int data_dir;
+	struct as_queue *asq = elv_get_sched_queue(q, rq);
 
 	RQ_SET_STATE(rq, AS_RQ_NEW);
 
@@ -1203,7 +1216,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 	 * set expire time and add to fifo list
 	 */
 	rq_set_fifo_time(rq, jiffies + ad->fifo_expire[data_dir]);
-	list_add_tail(&rq->queuelist, &ad->fifo_list[data_dir]);
+	list_add_tail(&rq->queuelist, &asq->fifo_list[data_dir]);
 
 	as_update_rq(ad, rq); /* keep state machine up to date */
 	RQ_SET_STATE(rq, AS_RQ_QUEUED);
@@ -1225,31 +1238,20 @@ static void as_deactivate_request(struct request_queue *q, struct request *rq)
 		atomic_inc(&RQ_IOC(rq)->aic->nr_dispatched);
 }
 
-/*
- * as_queue_empty tells us if there are requests left in the device. It may
- * not be the case that a driver can get the next request even if the queue
- * is not empty - it is used in the block layer to check for plugging and
- * merging opportunities
- */
-static int as_queue_empty(struct request_queue *q)
-{
-	struct as_data *ad = q->elevator->elevator_data;
-
-	return list_empty(&ad->fifo_list[BLK_RW_ASYNC])
-		&& list_empty(&ad->fifo_list[BLK_RW_SYNC]);
-}
-
 static int
 as_merge(struct request_queue *q, struct request **req, struct bio *bio)
 {
-	struct as_data *ad = q->elevator->elevator_data;
 	sector_t rb_key = bio->bi_sector + bio_sectors(bio);
 	struct request *__rq;
+	struct as_queue *asq = elv_get_sched_queue_current(q);
+
+	if (!asq)
+		return ELEVATOR_NO_MERGE;
 
 	/*
 	 * check for front merge
 	 */
-	__rq = elv_rb_find(&ad->sort_list[bio_data_dir(bio)], rb_key);
+	__rq = elv_rb_find(&asq->sort_list[bio_data_dir(bio)], rb_key);
 	if (__rq && elv_rq_merge_ok(__rq, bio)) {
 		*req = __rq;
 		return ELEVATOR_FRONT_MERGE;
@@ -1336,6 +1338,41 @@ static int as_may_queue(struct request_queue *q, int rw)
 	return ret;
 }
 
+/* Called with queue lock held */
+static void *as_alloc_as_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
+{
+	struct as_queue *asq;
+	struct as_data *ad = eq->elevator_data;
+
+	asq = kmalloc_node(sizeof(*asq), gfp_mask | __GFP_ZERO, q->node);
+	if (asq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&asq->fifo_list[BLK_RW_SYNC]);
+	INIT_LIST_HEAD(&asq->fifo_list[BLK_RW_ASYNC]);
+	asq->sort_list[BLK_RW_SYNC] = RB_ROOT;
+	asq->sort_list[BLK_RW_ASYNC] = RB_ROOT;
+	if (ad)
+		asq->write_batch_count = ad->batch_expire[BLK_RW_ASYNC] / 10;
+	else
+		asq->write_batch_count = default_write_batch_expire / 10;
+
+	if (asq->write_batch_count < 2)
+		asq->write_batch_count = 2;
+out:
+	return asq;
+}
+
+static void as_free_as_queue(struct elevator_queue *e, void *sched_queue)
+{
+	struct as_queue *asq = sched_queue;
+
+	BUG_ON(!list_empty(&asq->fifo_list[BLK_RW_SYNC]));
+	BUG_ON(!list_empty(&asq->fifo_list[BLK_RW_ASYNC]));
+	kfree(asq);
+}
+
 static void as_exit_queue(struct elevator_queue *e)
 {
 	struct as_data *ad = e->elevator_data;
@@ -1343,9 +1380,6 @@ static void as_exit_queue(struct elevator_queue *e)
 	del_timer_sync(&ad->antic_timer);
 	cancel_work_sync(&ad->antic_work);
 
-	BUG_ON(!list_empty(&ad->fifo_list[BLK_RW_SYNC]));
-	BUG_ON(!list_empty(&ad->fifo_list[BLK_RW_ASYNC]));
-
 	put_io_context(ad->io_context);
 	kfree(ad);
 }
@@ -1369,10 +1403,6 @@ static void *as_init_queue(struct request_queue *q)
 	init_timer(&ad->antic_timer);
 	INIT_WORK(&ad->antic_work, as_work_handler);
 
-	INIT_LIST_HEAD(&ad->fifo_list[BLK_RW_SYNC]);
-	INIT_LIST_HEAD(&ad->fifo_list[BLK_RW_ASYNC]);
-	ad->sort_list[BLK_RW_SYNC] = RB_ROOT;
-	ad->sort_list[BLK_RW_ASYNC] = RB_ROOT;
 	ad->fifo_expire[BLK_RW_SYNC] = default_read_expire;
 	ad->fifo_expire[BLK_RW_ASYNC] = default_write_expire;
 	ad->antic_expire = default_antic_expire;
@@ -1380,9 +1410,6 @@ static void *as_init_queue(struct request_queue *q)
 	ad->batch_expire[BLK_RW_ASYNC] = default_write_batch_expire;
 
 	ad->current_batch_expires = jiffies + ad->batch_expire[BLK_RW_SYNC];
-	ad->write_batch_count = ad->batch_expire[BLK_RW_ASYNC] / 10;
-	if (ad->write_batch_count < 2)
-		ad->write_batch_count = 2;
 
 	return ad;
 }
@@ -1480,7 +1507,6 @@ static struct elevator_type iosched_as = {
 		.elevator_add_req_fn =		as_add_request,
 		.elevator_activate_req_fn =	as_activate_request,
 		.elevator_deactivate_req_fn = 	as_deactivate_request,
-		.elevator_queue_empty_fn =	as_queue_empty,
 		.elevator_completed_req_fn =	as_completed_request,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
@@ -1488,6 +1514,8 @@ static struct elevator_type iosched_as = {
 		.elevator_init_fn =		as_init_queue,
 		.elevator_exit_fn =		as_exit_queue,
 		.trim =				as_trim,
+		.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
+		.elevator_free_sched_queue_fn = as_free_as_queue,
 	},
 
 	.elevator_attrs = as_attrs,
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index c4d991d..5e65041 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -23,25 +23,23 @@ static const int writes_starved = 2;    /* max times reads can starve a write */
 static const int fifo_batch = 16;       /* # of sequential requests treated as one
 				     by the above parameters. For throughput. */
 
-struct deadline_data {
-	/*
-	 * run time data
-	 */
-
+struct deadline_queue {
 	/*
 	 * requests (deadline_rq s) are present on both sort_list and fifo_list
 	 */
-	struct rb_root sort_list[2];	
+	struct rb_root sort_list[2];
 	struct list_head fifo_list[2];
-
 	/*
 	 * next in sort order. read, write or both are NULL
 	 */
 	struct request *next_rq[2];
 	unsigned int batching;		/* number of sequential requests made */
-	sector_t last_sector;		/* head position */
 	unsigned int starved;		/* times reads have starved writes */
+};
 
+struct deadline_data {
+	struct request_queue *q;
+	sector_t last_sector;		/* head position */
 	/*
 	 * settings that change how the i/o scheduler behaves
 	 */
@@ -56,7 +54,9 @@ static void deadline_move_request(struct deadline_data *, struct request *);
 static inline struct rb_root *
 deadline_rb_root(struct deadline_data *dd, struct request *rq)
 {
-	return &dd->sort_list[rq_data_dir(rq)];
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
+
+	return &dq->sort_list[rq_data_dir(rq)];
 }
 
 /*
@@ -87,9 +87,10 @@ static inline void
 deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
 {
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
 
-	if (dd->next_rq[data_dir] == rq)
-		dd->next_rq[data_dir] = deadline_latter_request(rq);
+	if (dq->next_rq[data_dir] == rq)
+		dq->next_rq[data_dir] = deadline_latter_request(rq);
 
 	elv_rb_del(deadline_rb_root(dd, rq), rq);
 }
@@ -102,6 +103,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
 {
 	struct deadline_data *dd = q->elevator->elevator_data;
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(q, rq);
 
 	deadline_add_rq_rb(dd, rq);
 
@@ -109,7 +111,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
 	 * set expire time and add to fifo list
 	 */
 	rq_set_fifo_time(rq, jiffies + dd->fifo_expire[data_dir]);
-	list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
+	list_add_tail(&rq->queuelist, &dq->fifo_list[data_dir]);
 }
 
 /*
@@ -129,6 +131,11 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	struct deadline_data *dd = q->elevator->elevator_data;
 	struct request *__rq;
 	int ret;
+	struct deadline_queue *dq;
+
+	dq = elv_get_sched_queue_current(q);
+	if (!dq)
+		return ELEVATOR_NO_MERGE;
 
 	/*
 	 * check for front merge
@@ -136,7 +143,7 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	if (dd->front_merges) {
 		sector_t sector = bio->bi_sector + bio_sectors(bio);
 
-		__rq = elv_rb_find(&dd->sort_list[bio_data_dir(bio)], sector);
+		__rq = elv_rb_find(&dq->sort_list[bio_data_dir(bio)], sector);
 		if (__rq) {
 			BUG_ON(sector != __rq->sector);
 
@@ -207,10 +214,11 @@ static void
 deadline_move_request(struct deadline_data *dd, struct request *rq)
 {
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
 
-	dd->next_rq[READ] = NULL;
-	dd->next_rq[WRITE] = NULL;
-	dd->next_rq[data_dir] = deadline_latter_request(rq);
+	dq->next_rq[READ] = NULL;
+	dq->next_rq[WRITE] = NULL;
+	dq->next_rq[data_dir] = deadline_latter_request(rq);
 
 	dd->last_sector = rq_end_sector(rq);
 
@@ -225,9 +233,9 @@ deadline_move_request(struct deadline_data *dd, struct request *rq)
  * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
  * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
  */
-static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
+static inline int deadline_check_fifo(struct deadline_queue *dq, int ddir)
 {
-	struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
+	struct request *rq = rq_entry_fifo(dq->fifo_list[ddir].next);
 
 	/*
 	 * rq is expired!
@@ -245,20 +253,26 @@ static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
 static int deadline_dispatch_requests(struct request_queue *q, int force)
 {
 	struct deadline_data *dd = q->elevator->elevator_data;
-	const int reads = !list_empty(&dd->fifo_list[READ]);
-	const int writes = !list_empty(&dd->fifo_list[WRITE]);
+	struct deadline_queue *dq = elv_select_sched_queue(q, force);
+	int reads, writes;
 	struct request *rq;
 	int data_dir;
 
+	if (!dq)
+		return 0;
+
+	reads = !list_empty(&dq->fifo_list[READ]);
+	writes = !list_empty(&dq->fifo_list[WRITE]);
+
 	/*
 	 * batches are currently reads XOR writes
 	 */
-	if (dd->next_rq[WRITE])
-		rq = dd->next_rq[WRITE];
+	if (dq->next_rq[WRITE])
+		rq = dq->next_rq[WRITE];
 	else
-		rq = dd->next_rq[READ];
+		rq = dq->next_rq[READ];
 
-	if (rq && dd->batching < dd->fifo_batch)
+	if (rq && dq->batching < dd->fifo_batch)
 		/* we have a next request are still entitled to batch */
 		goto dispatch_request;
 
@@ -268,9 +282,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
 	 */
 
 	if (reads) {
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
+		BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[READ]));
 
-		if (writes && (dd->starved++ >= dd->writes_starved))
+		if (writes && (dq->starved++ >= dd->writes_starved))
 			goto dispatch_writes;
 
 		data_dir = READ;
@@ -284,9 +298,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
 
 	if (writes) {
 dispatch_writes:
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
+		BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[WRITE]));
 
-		dd->starved = 0;
+		dq->starved = 0;
 
 		data_dir = WRITE;
 
@@ -299,48 +313,62 @@ dispatch_find_request:
 	/*
 	 * we are not running a batch, find best request for selected data_dir
 	 */
-	if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
+	if (deadline_check_fifo(dq, data_dir) || !dq->next_rq[data_dir]) {
 		/*
 		 * A deadline has expired, the last request was in the other
 		 * direction, or we have run out of higher-sectored requests.
 		 * Start again from the request with the earliest expiry time.
 		 */
-		rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
+		rq = rq_entry_fifo(dq->fifo_list[data_dir].next);
 	} else {
 		/*
 		 * The last req was the same dir and we have a next request in
 		 * sort order. No expired requests so continue on from here.
 		 */
-		rq = dd->next_rq[data_dir];
+		rq = dq->next_rq[data_dir];
 	}
 
-	dd->batching = 0;
+	dq->batching = 0;
 
 dispatch_request:
 	/*
 	 * rq is the selected appropriate request.
 	 */
-	dd->batching++;
+	dq->batching++;
 	deadline_move_request(dd, rq);
 
 	return 1;
 }
 
-static int deadline_queue_empty(struct request_queue *q)
+static void *deadline_alloc_deadline_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
 {
-	struct deadline_data *dd = q->elevator->elevator_data;
+	struct deadline_queue *dq;
 
-	return list_empty(&dd->fifo_list[WRITE])
-		&& list_empty(&dd->fifo_list[READ]);
+	dq = kmalloc_node(sizeof(*dq), gfp_mask | __GFP_ZERO, q->node);
+	if (dq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&dq->fifo_list[READ]);
+	INIT_LIST_HEAD(&dq->fifo_list[WRITE]);
+	dq->sort_list[READ] = RB_ROOT;
+	dq->sort_list[WRITE] = RB_ROOT;
+out:
+	return dq;
+}
+
+static void deadline_free_deadline_queue(struct elevator_queue *e,
+						void *sched_queue)
+{
+	struct deadline_queue *dq = sched_queue;
+
+	kfree(dq);
 }
 
 static void deadline_exit_queue(struct elevator_queue *e)
 {
 	struct deadline_data *dd = e->elevator_data;
 
-	BUG_ON(!list_empty(&dd->fifo_list[READ]));
-	BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
-
 	kfree(dd);
 }
 
@@ -355,10 +383,7 @@ static void *deadline_init_queue(struct request_queue *q)
 	if (!dd)
 		return NULL;
 
-	INIT_LIST_HEAD(&dd->fifo_list[READ]);
-	INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
-	dd->sort_list[READ] = RB_ROOT;
-	dd->sort_list[WRITE] = RB_ROOT;
+	dd->q = q;
 	dd->fifo_expire[READ] = read_expire;
 	dd->fifo_expire[WRITE] = write_expire;
 	dd->writes_starved = writes_starved;
@@ -445,13 +470,13 @@ static struct elevator_type iosched_deadline = {
 		.elevator_merge_req_fn =	deadline_merged_requests,
 		.elevator_dispatch_fn =		deadline_dispatch_requests,
 		.elevator_add_req_fn =		deadline_add_request,
-		.elevator_queue_empty_fn =	deadline_queue_empty,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
 		.elevator_init_fn =		deadline_init_queue,
 		.elevator_exit_fn =		deadline_exit_queue,
+		.elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
+		.elevator_free_sched_queue_fn = deadline_free_deadline_queue,
 	},
-
 	.elevator_attrs = deadline_attrs,
 	.elevator_name = "deadline",
 	.elevator_owner = THIS_MODULE,
diff --git a/block/elevator.c b/block/elevator.c
index 4321169..f6725f2 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -180,17 +180,54 @@ static struct elevator_type *elevator_get(const char *name)
 	return e;
 }
 
-static void *elevator_init_queue(struct request_queue *q,
-				 struct elevator_queue *eq)
+static void *elevator_init_data(struct request_queue *q,
+					struct elevator_queue *eq)
 {
-	return eq->ops->elevator_init_fn(q);
+	void *data = NULL;
+
+	if (eq->ops->elevator_init_fn) {
+		data = eq->ops->elevator_init_fn(q);
+		if (data)
+			return data;
+		else
+			return ERR_PTR(-ENOMEM);
+	}
+
+	/* IO scheduler does not instanciate data (noop), it is not an error */
+	return NULL;
+}
+
+static void elevator_free_sched_queue(struct elevator_queue *eq,
+						void *sched_queue)
+{
+	/* Not all io schedulers (cfq) strore sched_queue */
+	if (!sched_queue)
+		return;
+	eq->ops->elevator_free_sched_queue_fn(eq, sched_queue);
+}
+
+static void *elevator_alloc_sched_queue(struct request_queue *q,
+					struct elevator_queue *eq)
+{
+	void *sched_queue = NULL;
+
+	if (eq->ops->elevator_alloc_sched_queue_fn) {
+		sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
+								GFP_KERNEL);
+		if (!sched_queue)
+			return ERR_PTR(-ENOMEM);
+
+	}
+
+	return sched_queue;
 }
 
 static void elevator_attach(struct request_queue *q, struct elevator_queue *eq,
-			   void *data)
+			   void *data, void *sched_queue)
 {
 	q->elevator = eq;
 	eq->elevator_data = data;
+	eq->sched_queue = sched_queue;
 }
 
 static char chosen_elevator[16];
@@ -260,7 +297,7 @@ int elevator_init(struct request_queue *q, char *name)
 	struct elevator_type *e = NULL;
 	struct elevator_queue *eq;
 	int ret = 0;
-	void *data;
+	void *data = NULL, *sched_queue = NULL;
 
 	INIT_LIST_HEAD(&q->queue_head);
 	q->last_merge = NULL;
@@ -294,13 +331,21 @@ int elevator_init(struct request_queue *q, char *name)
 	if (!eq)
 		return -ENOMEM;
 
-	data = elevator_init_queue(q, eq);
-	if (!data) {
+	data = elevator_init_data(q, eq);
+
+	if (IS_ERR(data)) {
+		kobject_put(&eq->kobj);
+		return -ENOMEM;
+	}
+
+	sched_queue = elevator_alloc_sched_queue(q, eq);
+
+	if (IS_ERR(sched_queue)) {
 		kobject_put(&eq->kobj);
 		return -ENOMEM;
 	}
 
-	elevator_attach(q, eq, data);
+	elevator_attach(q, eq, data, sched_queue);
 	return ret;
 }
 EXPORT_SYMBOL(elevator_init);
@@ -308,6 +353,7 @@ EXPORT_SYMBOL(elevator_init);
 void elevator_exit(struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
+	elevator_free_sched_queue(e, e->sched_queue);
 	elv_exit_fq_data(e);
 	if (e->ops->elevator_exit_fn)
 		e->ops->elevator_exit_fn(e);
@@ -1123,7 +1169,7 @@ EXPORT_SYMBOL_GPL(elv_unregister);
 static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 {
 	struct elevator_queue *old_elevator, *e;
-	void *data;
+	void *data = NULL, *sched_queue = NULL;
 
 	/*
 	 * Allocate new elevator
@@ -1132,10 +1178,18 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	if (!e)
 		return 0;
 
-	data = elevator_init_queue(q, e);
-	if (!data) {
+	data = elevator_init_data(q, e);
+
+	if (IS_ERR(data)) {
 		kobject_put(&e->kobj);
-		return 0;
+		return -ENOMEM;
+	}
+
+	sched_queue = elevator_alloc_sched_queue(q, e);
+
+	if (IS_ERR(sched_queue)) {
+		kobject_put(&e->kobj);
+		return -ENOMEM;
 	}
 
 	/*
@@ -1152,7 +1206,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	/*
 	 * attach and start new elevator
 	 */
-	elevator_attach(q, e, data);
+	elevator_attach(q, e, data, sched_queue);
 
 	spin_unlock_irq(q->queue_lock);
 
@@ -1259,16 +1313,43 @@ struct request *elv_rb_latter_request(struct request_queue *q,
 }
 EXPORT_SYMBOL(elv_rb_latter_request);
 
-/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+/* Get the io scheduler queue pointer. */
 void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
 {
-	return ioq_sched_queue(rq_ioq(rq));
+	/*
+	 * io scheduler is not using fair queuing. Return sched_queue
+	 * pointer stored in elevator_queue. It will be null if io
+	 * scheduler never stored anything there to begin with (cfq)
+	 */
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
+	/*
+	 * IO schedueler is using fair queuing infrasture. If io scheduler
+	 * has passed a non null rq, retrieve sched_queue pointer from
+	 * there. */
+	if (rq)
+		return ioq_sched_queue(rq_ioq(rq));
+
+	return NULL;
 }
 EXPORT_SYMBOL(elv_get_sched_queue);
 
 /* Select an ioscheduler queue to dispatch request from. */
 void *elv_select_sched_queue(struct request_queue *q, int force)
 {
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
 	return ioq_sched_queue(elv_fq_select_ioq(q, force));
 }
 EXPORT_SYMBOL(elv_select_sched_queue);
+
+/*
+ * Get the io scheduler queue pointer for current task.
+ */
+void *elv_get_sched_queue_current(struct request_queue *q)
+{
+	return q->elevator->sched_queue;
+}
+EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index 3a0d369..d587832 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -7,7 +7,7 @@
 #include <linux/module.h>
 #include <linux/init.h>
 
-struct noop_data {
+struct noop_queue {
 	struct list_head queue;
 };
 
@@ -19,11 +19,14 @@ static void noop_merged_requests(struct request_queue *q, struct request *rq,
 
 static int noop_dispatch(struct request_queue *q, int force)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_select_sched_queue(q, force);
 
-	if (!list_empty(&nd->queue)) {
+	if (!nq)
+		return 0;
+
+	if (!list_empty(&nq->queue)) {
 		struct request *rq;
-		rq = list_entry(nd->queue.next, struct request, queuelist);
+		rq = list_entry(nq->queue.next, struct request, queuelist);
 		list_del_init(&rq->queuelist);
 		elv_dispatch_sort(q, rq);
 		return 1;
@@ -33,24 +36,17 @@ static int noop_dispatch(struct request_queue *q, int force)
 
 static void noop_add_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	list_add_tail(&rq->queuelist, &nd->queue);
-}
-
-static int noop_queue_empty(struct request_queue *q)
-{
-	struct noop_data *nd = q->elevator->elevator_data;
-
-	return list_empty(&nd->queue);
+	list_add_tail(&rq->queuelist, &nq->queue);
 }
 
 static struct request *
 noop_former_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	if (rq->queuelist.prev == &nd->queue)
+	if (rq->queuelist.prev == &nq->queue)
 		return NULL;
 	return list_entry(rq->queuelist.prev, struct request, queuelist);
 }
@@ -58,30 +54,32 @@ noop_former_request(struct request_queue *q, struct request *rq)
 static struct request *
 noop_latter_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	if (rq->queuelist.next == &nd->queue)
+	if (rq->queuelist.next == &nq->queue)
 		return NULL;
 	return list_entry(rq->queuelist.next, struct request, queuelist);
 }
 
-static void *noop_init_queue(struct request_queue *q)
+static void *noop_alloc_noop_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
 {
-	struct noop_data *nd;
+	struct noop_queue *nq;
 
-	nd = kmalloc_node(sizeof(*nd), GFP_KERNEL, q->node);
-	if (!nd)
-		return NULL;
-	INIT_LIST_HEAD(&nd->queue);
-	return nd;
+	nq = kmalloc_node(sizeof(*nq), gfp_mask | __GFP_ZERO, q->node);
+	if (nq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&nq->queue);
+out:
+	return nq;
 }
 
-static void noop_exit_queue(struct elevator_queue *e)
+static void noop_free_noop_queue(struct elevator_queue *e, void *sched_queue)
 {
-	struct noop_data *nd = e->elevator_data;
+	struct noop_queue *nq = sched_queue;
 
-	BUG_ON(!list_empty(&nd->queue));
-	kfree(nd);
+	kfree(nq);
 }
 
 static struct elevator_type elevator_noop = {
@@ -89,11 +87,10 @@ static struct elevator_type elevator_noop = {
 		.elevator_merge_req_fn		= noop_merged_requests,
 		.elevator_dispatch_fn		= noop_dispatch,
 		.elevator_add_req_fn		= noop_add_request,
-		.elevator_queue_empty_fn	= noop_queue_empty,
 		.elevator_former_req_fn		= noop_former_request,
 		.elevator_latter_req_fn		= noop_latter_request,
-		.elevator_init_fn		= noop_init_queue,
-		.elevator_exit_fn		= noop_exit_queue,
+		.elevator_alloc_sched_queue_fn	= noop_alloc_noop_queue,
+		.elevator_free_sched_queue_fn	= noop_free_noop_queue,
 	},
 	.elevator_name = "noop",
 	.elevator_owner = THIS_MODULE,
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 679c149..3729a2f 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -30,8 +30,9 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
 
 typedef void *(elevator_init_fn) (struct request_queue *);
 typedef void (elevator_exit_fn) (struct elevator_queue *);
-#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void* (elevator_alloc_sched_queue_fn) (struct request_queue *q, struct elevator_queue *eq, gfp_t);
 typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
 typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
 typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
 typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
@@ -70,8 +71,9 @@ struct elevator_ops
 	elevator_exit_fn *elevator_exit_fn;
 	void (*trim)(struct io_context *);
 
-#ifdef CONFIG_ELV_FAIR_QUEUING
+	elevator_alloc_sched_queue_fn *elevator_alloc_sched_queue_fn;
 	elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+#ifdef CONFIG_ELV_FAIR_QUEUING
 	elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
 	elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;
 
@@ -112,6 +114,7 @@ struct elevator_queue
 {
 	struct elevator_ops *ops;
 	void *elevator_data;
+	void *sched_queue;
 	struct kobject kobj;
 	struct elevator_type *elevator_type;
 	struct mutex sysfs_lock;
@@ -260,5 +263,6 @@ static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
+extern void *elv_get_sched_queue_current(struct request_queue *q);
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 09/18] io-controller: Separate out queue and data
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (15 preceding siblings ...)
  2009-05-05 19:58 ` [PATCH 09/18] io-controller: Separate out queue and data Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-05 19:58 ` [PATCH 10/18] io-conroller: Prepare elevator layer for single queue schedulers Vivek Goyal
                   ` (20 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, m-ikeda
  Cc: vgoyal, akpm

o So far noop, deadline and AS had one common structure called *_data which
  contained both the queue information where requests are queued and also
  common data used for scheduling. This patch breaks down this common
  structure in two parts, *_queue and *_data. This is along the lines of
  cfq where all the reuquests are queued in queue and common data and tunables
  are part of data.

o It does not change the functionality but this re-organization helps once
  noop, deadline and AS are changed to use hierarchical fair queuing.

o looks like queue_empty function is not required and we can check for
  q->nr_sorted in elevator layer to see if ioscheduler queues are empty or
  not.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/as-iosched.c       |  208 ++++++++++++++++++++++++++--------------------
 block/deadline-iosched.c |  117 ++++++++++++++++----------
 block/elevator.c         |  111 +++++++++++++++++++++----
 block/noop-iosched.c     |   59 ++++++-------
 include/linux/elevator.h |    8 ++-
 5 files changed, 319 insertions(+), 184 deletions(-)

diff --git a/block/as-iosched.c b/block/as-iosched.c
index c48fa67..7158e13 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -76,13 +76,7 @@ enum anticipation_status {
 				 * or timed out */
 };
 
-struct as_data {
-	/*
-	 * run time data
-	 */
-
-	struct request_queue *q;	/* the "owner" queue */
-
+struct as_queue {
 	/*
 	 * requests (as_rq s) are present on both sort_list and fifo_list
 	 */
@@ -90,6 +84,14 @@ struct as_data {
 	struct list_head fifo_list[2];
 
 	struct request *next_rq[2];	/* next in sort order */
+	unsigned long last_check_fifo[2];
+	int write_batch_count;		/* max # of reqs in a write batch */
+	int current_write_count;	/* how many requests left this batch */
+	int write_batch_idled;		/* has the write batch gone idle? */
+};
+
+struct as_data {
+	struct request_queue *q;	/* the "owner" queue */
 	sector_t last_sector[2];	/* last SYNC & ASYNC sectors */
 
 	unsigned long exit_prob;	/* probability a task will exit while
@@ -103,21 +105,17 @@ struct as_data {
 	sector_t new_seek_mean;
 
 	unsigned long current_batch_expires;
-	unsigned long last_check_fifo[2];
 	int changed_batch;		/* 1: waiting for old batch to end */
 	int new_batch;			/* 1: waiting on first read complete */
-	int batch_data_dir;		/* current batch SYNC / ASYNC */
-	int write_batch_count;		/* max # of reqs in a write batch */
-	int current_write_count;	/* how many requests left this batch */
-	int write_batch_idled;		/* has the write batch gone idle? */
 
 	enum anticipation_status antic_status;
 	unsigned long antic_start;	/* jiffies: when it started */
 	struct timer_list antic_timer;	/* anticipatory scheduling timer */
-	struct work_struct antic_work;	/* Deferred unplugging */
+	struct work_struct antic_work;  /* Deferred unplugging */
 	struct io_context *io_context;	/* Identify the expected process */
 	int ioc_finished; /* IO associated with io_context is finished */
 	int nr_dispatched;
+	int batch_data_dir;		/* current batch SYNC / ASYNC */
 
 	/*
 	 * settings that change how the i/o scheduler behaves
@@ -258,13 +256,14 @@ static void as_put_io_context(struct request *rq)
 /*
  * rb tree support functions
  */
-#define RQ_RB_ROOT(ad, rq)	(&(ad)->sort_list[rq_is_sync((rq))])
+#define RQ_RB_ROOT(asq, rq)	(&(asq)->sort_list[rq_is_sync((rq))])
 
 static void as_add_rq_rb(struct as_data *ad, struct request *rq)
 {
 	struct request *alias;
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
-	while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(ad, rq), rq)))) {
+	while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(asq, rq), rq)))) {
 		as_move_to_dispatch(ad, alias);
 		as_antic_stop(ad);
 	}
@@ -272,7 +271,9 @@ static void as_add_rq_rb(struct as_data *ad, struct request *rq)
 
 static inline void as_del_rq_rb(struct as_data *ad, struct request *rq)
 {
-	elv_rb_del(RQ_RB_ROOT(ad, rq), rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
+
+	elv_rb_del(RQ_RB_ROOT(asq, rq), rq);
 }
 
 /*
@@ -366,7 +367,7 @@ as_choose_req(struct as_data *ad, struct request *rq1, struct request *rq2)
  * what request to process next. Anticipation works on top of this.
  */
 static struct request *
-as_find_next_rq(struct as_data *ad, struct request *last)
+as_find_next_rq(struct as_data *ad, struct as_queue *asq, struct request *last)
 {
 	struct rb_node *rbnext = rb_next(&last->rb_node);
 	struct rb_node *rbprev = rb_prev(&last->rb_node);
@@ -382,7 +383,7 @@ as_find_next_rq(struct as_data *ad, struct request *last)
 	else {
 		const int data_dir = rq_is_sync(last);
 
-		rbnext = rb_first(&ad->sort_list[data_dir]);
+		rbnext = rb_first(&asq->sort_list[data_dir]);
 		if (rbnext && rbnext != &last->rb_node)
 			next = rb_entry_rq(rbnext);
 	}
@@ -787,9 +788,10 @@ static int as_can_anticipate(struct as_data *ad, struct request *rq)
 static void as_update_rq(struct as_data *ad, struct request *rq)
 {
 	const int data_dir = rq_is_sync(rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
 	/* keep the next_rq cache up to date */
-	ad->next_rq[data_dir] = as_choose_req(ad, rq, ad->next_rq[data_dir]);
+	asq->next_rq[data_dir] = as_choose_req(ad, rq, asq->next_rq[data_dir]);
 
 	/*
 	 * have we been anticipating this request?
@@ -810,25 +812,26 @@ static void update_write_batch(struct as_data *ad)
 {
 	unsigned long batch = ad->batch_expire[BLK_RW_ASYNC];
 	long write_time;
+	struct as_queue *asq = elv_get_sched_queue(ad->q, NULL);
 
 	write_time = (jiffies - ad->current_batch_expires) + batch;
 	if (write_time < 0)
 		write_time = 0;
 
-	if (write_time > batch && !ad->write_batch_idled) {
+	if (write_time > batch && !asq->write_batch_idled) {
 		if (write_time > batch * 3)
-			ad->write_batch_count /= 2;
+			asq->write_batch_count /= 2;
 		else
-			ad->write_batch_count--;
-	} else if (write_time < batch && ad->current_write_count == 0) {
+			asq->write_batch_count--;
+	} else if (write_time < batch && asq->current_write_count == 0) {
 		if (batch > write_time * 3)
-			ad->write_batch_count *= 2;
+			asq->write_batch_count *= 2;
 		else
-			ad->write_batch_count++;
+			asq->write_batch_count++;
 	}
 
-	if (ad->write_batch_count < 1)
-		ad->write_batch_count = 1;
+	if (asq->write_batch_count < 1)
+		asq->write_batch_count = 1;
 }
 
 /*
@@ -899,6 +902,7 @@ static void as_remove_queued_request(struct request_queue *q,
 	const int data_dir = rq_is_sync(rq);
 	struct as_data *ad = q->elevator->elevator_data;
 	struct io_context *ioc;
+	struct as_queue *asq = elv_get_sched_queue(q, rq);
 
 	WARN_ON(RQ_STATE(rq) != AS_RQ_QUEUED);
 
@@ -912,8 +916,8 @@ static void as_remove_queued_request(struct request_queue *q,
 	 * Update the "next_rq" cache if we are about to remove its
 	 * entry
 	 */
-	if (ad->next_rq[data_dir] == rq)
-		ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+	if (asq->next_rq[data_dir] == rq)
+		asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);
 
 	rq_fifo_clear(rq);
 	as_del_rq_rb(ad, rq);
@@ -927,23 +931,23 @@ static void as_remove_queued_request(struct request_queue *q,
  *
  * See as_antic_expired comment.
  */
-static int as_fifo_expired(struct as_data *ad, int adir)
+static int as_fifo_expired(struct as_data *ad, struct as_queue *asq, int adir)
 {
 	struct request *rq;
 	long delta_jif;
 
-	delta_jif = jiffies - ad->last_check_fifo[adir];
+	delta_jif = jiffies - asq->last_check_fifo[adir];
 	if (unlikely(delta_jif < 0))
 		delta_jif = -delta_jif;
 	if (delta_jif < ad->fifo_expire[adir])
 		return 0;
 
-	ad->last_check_fifo[adir] = jiffies;
+	asq->last_check_fifo[adir] = jiffies;
 
-	if (list_empty(&ad->fifo_list[adir]))
+	if (list_empty(&asq->fifo_list[adir]))
 		return 0;
 
-	rq = rq_entry_fifo(ad->fifo_list[adir].next);
+	rq = rq_entry_fifo(asq->fifo_list[adir].next);
 
 	return time_after(jiffies, rq_fifo_time(rq));
 }
@@ -952,7 +956,7 @@ static int as_fifo_expired(struct as_data *ad, int adir)
  * as_batch_expired returns true if the current batch has expired. A batch
  * is a set of reads or a set of writes.
  */
-static inline int as_batch_expired(struct as_data *ad)
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq)
 {
 	if (ad->changed_batch || ad->new_batch)
 		return 0;
@@ -962,7 +966,7 @@ static inline int as_batch_expired(struct as_data *ad)
 		return time_after(jiffies, ad->current_batch_expires);
 
 	return time_after(jiffies, ad->current_batch_expires)
-		|| ad->current_write_count == 0;
+		|| asq->current_write_count == 0;
 }
 
 /*
@@ -971,6 +975,7 @@ static inline int as_batch_expired(struct as_data *ad)
 static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 {
 	const int data_dir = rq_is_sync(rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
 	BUG_ON(RB_EMPTY_NODE(&rq->rb_node));
 
@@ -993,12 +998,12 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 			ad->io_context = NULL;
 		}
 
-		if (ad->current_write_count != 0)
-			ad->current_write_count--;
+		if (asq->current_write_count != 0)
+			asq->current_write_count--;
 	}
 	ad->ioc_finished = 0;
 
-	ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+	asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);
 
 	/*
 	 * take it off the sort and fifo list, add to dispatch queue
@@ -1022,9 +1027,16 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 static int as_dispatch_request(struct request_queue *q, int force)
 {
 	struct as_data *ad = q->elevator->elevator_data;
-	const int reads = !list_empty(&ad->fifo_list[BLK_RW_SYNC]);
-	const int writes = !list_empty(&ad->fifo_list[BLK_RW_ASYNC]);
 	struct request *rq;
+	struct as_queue *asq = elv_select_sched_queue(q, force);
+	int reads, writes;
+
+	if (!asq)
+		return 0;
+
+	reads = !list_empty(&asq->fifo_list[BLK_RW_SYNC]);
+	writes = !list_empty(&asq->fifo_list[BLK_RW_ASYNC]);
+
 
 	if (unlikely(force)) {
 		/*
@@ -1040,25 +1052,25 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		ad->changed_batch = 0;
 		ad->new_batch = 0;
 
-		while (ad->next_rq[BLK_RW_SYNC]) {
-			as_move_to_dispatch(ad, ad->next_rq[BLK_RW_SYNC]);
+		while (asq->next_rq[BLK_RW_SYNC]) {
+			as_move_to_dispatch(ad, asq->next_rq[BLK_RW_SYNC]);
 			dispatched++;
 		}
-		ad->last_check_fifo[BLK_RW_SYNC] = jiffies;
+		asq->last_check_fifo[BLK_RW_SYNC] = jiffies;
 
-		while (ad->next_rq[BLK_RW_ASYNC]) {
-			as_move_to_dispatch(ad, ad->next_rq[BLK_RW_ASYNC]);
+		while (asq->next_rq[BLK_RW_ASYNC]) {
+			as_move_to_dispatch(ad, asq->next_rq[BLK_RW_ASYNC]);
 			dispatched++;
 		}
-		ad->last_check_fifo[BLK_RW_ASYNC] = jiffies;
+		asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
 
 		return dispatched;
 	}
 
 	/* Signal that the write batch was uncontended, so we can't time it */
 	if (ad->batch_data_dir == BLK_RW_ASYNC && !reads) {
-		if (ad->current_write_count == 0 || !writes)
-			ad->write_batch_idled = 1;
+		if (asq->current_write_count == 0 || !writes)
+			asq->write_batch_idled = 1;
 	}
 
 	if (!(reads || writes)
@@ -1067,14 +1079,14 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		|| ad->changed_batch)
 		return 0;
 
-	if (!(reads && writes && as_batch_expired(ad))) {
+	if (!(reads && writes && as_batch_expired(ad, asq))) {
 		/*
 		 * batch is still running or no reads or no writes
 		 */
-		rq = ad->next_rq[ad->batch_data_dir];
+		rq = asq->next_rq[ad->batch_data_dir];
 
 		if (ad->batch_data_dir == BLK_RW_SYNC && ad->antic_expire) {
-			if (as_fifo_expired(ad, BLK_RW_SYNC))
+			if (as_fifo_expired(ad, asq, BLK_RW_SYNC))
 				goto fifo_expired;
 
 			if (as_can_anticipate(ad, rq)) {
@@ -1098,7 +1110,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 	 */
 
 	if (reads) {
-		BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[BLK_RW_SYNC]));
+		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_SYNC]));
 
 		if (writes && ad->batch_data_dir == BLK_RW_SYNC)
 			/*
@@ -1111,8 +1123,8 @@ static int as_dispatch_request(struct request_queue *q, int force)
 			ad->changed_batch = 1;
 		}
 		ad->batch_data_dir = BLK_RW_SYNC;
-		rq = rq_entry_fifo(ad->fifo_list[BLK_RW_SYNC].next);
-		ad->last_check_fifo[ad->batch_data_dir] = jiffies;
+		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_SYNC].next);
+		asq->last_check_fifo[ad->batch_data_dir] = jiffies;
 		goto dispatch_request;
 	}
 
@@ -1122,7 +1134,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 
 	if (writes) {
 dispatch_writes:
-		BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[BLK_RW_ASYNC]));
+		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_ASYNC]));
 
 		if (ad->batch_data_dir == BLK_RW_SYNC) {
 			ad->changed_batch = 1;
@@ -1135,10 +1147,10 @@ dispatch_writes:
 			ad->new_batch = 0;
 		}
 		ad->batch_data_dir = BLK_RW_ASYNC;
-		ad->current_write_count = ad->write_batch_count;
-		ad->write_batch_idled = 0;
-		rq = rq_entry_fifo(ad->fifo_list[BLK_RW_ASYNC].next);
-		ad->last_check_fifo[BLK_RW_ASYNC] = jiffies;
+		asq->current_write_count = asq->write_batch_count;
+		asq->write_batch_idled = 0;
+		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_ASYNC].next);
+		asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
 		goto dispatch_request;
 	}
 
@@ -1150,9 +1162,9 @@ dispatch_request:
 	 * If a request has expired, service it.
 	 */
 
-	if (as_fifo_expired(ad, ad->batch_data_dir)) {
+	if (as_fifo_expired(ad, asq, ad->batch_data_dir)) {
 fifo_expired:
-		rq = rq_entry_fifo(ad->fifo_list[ad->batch_data_dir].next);
+		rq = rq_entry_fifo(asq->fifo_list[ad->batch_data_dir].next);
 	}
 
 	if (ad->changed_batch) {
@@ -1185,6 +1197,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 {
 	struct as_data *ad = q->elevator->elevator_data;
 	int data_dir;
+	struct as_queue *asq = elv_get_sched_queue(q, rq);
 
 	RQ_SET_STATE(rq, AS_RQ_NEW);
 
@@ -1203,7 +1216,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 	 * set expire time and add to fifo list
 	 */
 	rq_set_fifo_time(rq, jiffies + ad->fifo_expire[data_dir]);
-	list_add_tail(&rq->queuelist, &ad->fifo_list[data_dir]);
+	list_add_tail(&rq->queuelist, &asq->fifo_list[data_dir]);
 
 	as_update_rq(ad, rq); /* keep state machine up to date */
 	RQ_SET_STATE(rq, AS_RQ_QUEUED);
@@ -1225,31 +1238,20 @@ static void as_deactivate_request(struct request_queue *q, struct request *rq)
 		atomic_inc(&RQ_IOC(rq)->aic->nr_dispatched);
 }
 
-/*
- * as_queue_empty tells us if there are requests left in the device. It may
- * not be the case that a driver can get the next request even if the queue
- * is not empty - it is used in the block layer to check for plugging and
- * merging opportunities
- */
-static int as_queue_empty(struct request_queue *q)
-{
-	struct as_data *ad = q->elevator->elevator_data;
-
-	return list_empty(&ad->fifo_list[BLK_RW_ASYNC])
-		&& list_empty(&ad->fifo_list[BLK_RW_SYNC]);
-}
-
 static int
 as_merge(struct request_queue *q, struct request **req, struct bio *bio)
 {
-	struct as_data *ad = q->elevator->elevator_data;
 	sector_t rb_key = bio->bi_sector + bio_sectors(bio);
 	struct request *__rq;
+	struct as_queue *asq = elv_get_sched_queue_current(q);
+
+	if (!asq)
+		return ELEVATOR_NO_MERGE;
 
 	/*
 	 * check for front merge
 	 */
-	__rq = elv_rb_find(&ad->sort_list[bio_data_dir(bio)], rb_key);
+	__rq = elv_rb_find(&asq->sort_list[bio_data_dir(bio)], rb_key);
 	if (__rq && elv_rq_merge_ok(__rq, bio)) {
 		*req = __rq;
 		return ELEVATOR_FRONT_MERGE;
@@ -1336,6 +1338,41 @@ static int as_may_queue(struct request_queue *q, int rw)
 	return ret;
 }
 
+/* Called with queue lock held */
+static void *as_alloc_as_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
+{
+	struct as_queue *asq;
+	struct as_data *ad = eq->elevator_data;
+
+	asq = kmalloc_node(sizeof(*asq), gfp_mask | __GFP_ZERO, q->node);
+	if (asq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&asq->fifo_list[BLK_RW_SYNC]);
+	INIT_LIST_HEAD(&asq->fifo_list[BLK_RW_ASYNC]);
+	asq->sort_list[BLK_RW_SYNC] = RB_ROOT;
+	asq->sort_list[BLK_RW_ASYNC] = RB_ROOT;
+	if (ad)
+		asq->write_batch_count = ad->batch_expire[BLK_RW_ASYNC] / 10;
+	else
+		asq->write_batch_count = default_write_batch_expire / 10;
+
+	if (asq->write_batch_count < 2)
+		asq->write_batch_count = 2;
+out:
+	return asq;
+}
+
+static void as_free_as_queue(struct elevator_queue *e, void *sched_queue)
+{
+	struct as_queue *asq = sched_queue;
+
+	BUG_ON(!list_empty(&asq->fifo_list[BLK_RW_SYNC]));
+	BUG_ON(!list_empty(&asq->fifo_list[BLK_RW_ASYNC]));
+	kfree(asq);
+}
+
 static void as_exit_queue(struct elevator_queue *e)
 {
 	struct as_data *ad = e->elevator_data;
@@ -1343,9 +1380,6 @@ static void as_exit_queue(struct elevator_queue *e)
 	del_timer_sync(&ad->antic_timer);
 	cancel_work_sync(&ad->antic_work);
 
-	BUG_ON(!list_empty(&ad->fifo_list[BLK_RW_SYNC]));
-	BUG_ON(!list_empty(&ad->fifo_list[BLK_RW_ASYNC]));
-
 	put_io_context(ad->io_context);
 	kfree(ad);
 }
@@ -1369,10 +1403,6 @@ static void *as_init_queue(struct request_queue *q)
 	init_timer(&ad->antic_timer);
 	INIT_WORK(&ad->antic_work, as_work_handler);
 
-	INIT_LIST_HEAD(&ad->fifo_list[BLK_RW_SYNC]);
-	INIT_LIST_HEAD(&ad->fifo_list[BLK_RW_ASYNC]);
-	ad->sort_list[BLK_RW_SYNC] = RB_ROOT;
-	ad->sort_list[BLK_RW_ASYNC] = RB_ROOT;
 	ad->fifo_expire[BLK_RW_SYNC] = default_read_expire;
 	ad->fifo_expire[BLK_RW_ASYNC] = default_write_expire;
 	ad->antic_expire = default_antic_expire;
@@ -1380,9 +1410,6 @@ static void *as_init_queue(struct request_queue *q)
 	ad->batch_expire[BLK_RW_ASYNC] = default_write_batch_expire;
 
 	ad->current_batch_expires = jiffies + ad->batch_expire[BLK_RW_SYNC];
-	ad->write_batch_count = ad->batch_expire[BLK_RW_ASYNC] / 10;
-	if (ad->write_batch_count < 2)
-		ad->write_batch_count = 2;
 
 	return ad;
 }
@@ -1480,7 +1507,6 @@ static struct elevator_type iosched_as = {
 		.elevator_add_req_fn =		as_add_request,
 		.elevator_activate_req_fn =	as_activate_request,
 		.elevator_deactivate_req_fn = 	as_deactivate_request,
-		.elevator_queue_empty_fn =	as_queue_empty,
 		.elevator_completed_req_fn =	as_completed_request,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
@@ -1488,6 +1514,8 @@ static struct elevator_type iosched_as = {
 		.elevator_init_fn =		as_init_queue,
 		.elevator_exit_fn =		as_exit_queue,
 		.trim =				as_trim,
+		.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
+		.elevator_free_sched_queue_fn = as_free_as_queue,
 	},
 
 	.elevator_attrs = as_attrs,
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index c4d991d..5e65041 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -23,25 +23,23 @@ static const int writes_starved = 2;    /* max times reads can starve a write */
 static const int fifo_batch = 16;       /* # of sequential requests treated as one
 				     by the above parameters. For throughput. */
 
-struct deadline_data {
-	/*
-	 * run time data
-	 */
-
+struct deadline_queue {
 	/*
 	 * requests (deadline_rq s) are present on both sort_list and fifo_list
 	 */
-	struct rb_root sort_list[2];	
+	struct rb_root sort_list[2];
 	struct list_head fifo_list[2];
-
 	/*
 	 * next in sort order. read, write or both are NULL
 	 */
 	struct request *next_rq[2];
 	unsigned int batching;		/* number of sequential requests made */
-	sector_t last_sector;		/* head position */
 	unsigned int starved;		/* times reads have starved writes */
+};
 
+struct deadline_data {
+	struct request_queue *q;
+	sector_t last_sector;		/* head position */
 	/*
 	 * settings that change how the i/o scheduler behaves
 	 */
@@ -56,7 +54,9 @@ static void deadline_move_request(struct deadline_data *, struct request *);
 static inline struct rb_root *
 deadline_rb_root(struct deadline_data *dd, struct request *rq)
 {
-	return &dd->sort_list[rq_data_dir(rq)];
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
+
+	return &dq->sort_list[rq_data_dir(rq)];
 }
 
 /*
@@ -87,9 +87,10 @@ static inline void
 deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
 {
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
 
-	if (dd->next_rq[data_dir] == rq)
-		dd->next_rq[data_dir] = deadline_latter_request(rq);
+	if (dq->next_rq[data_dir] == rq)
+		dq->next_rq[data_dir] = deadline_latter_request(rq);
 
 	elv_rb_del(deadline_rb_root(dd, rq), rq);
 }
@@ -102,6 +103,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
 {
 	struct deadline_data *dd = q->elevator->elevator_data;
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(q, rq);
 
 	deadline_add_rq_rb(dd, rq);
 
@@ -109,7 +111,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
 	 * set expire time and add to fifo list
 	 */
 	rq_set_fifo_time(rq, jiffies + dd->fifo_expire[data_dir]);
-	list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
+	list_add_tail(&rq->queuelist, &dq->fifo_list[data_dir]);
 }
 
 /*
@@ -129,6 +131,11 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	struct deadline_data *dd = q->elevator->elevator_data;
 	struct request *__rq;
 	int ret;
+	struct deadline_queue *dq;
+
+	dq = elv_get_sched_queue_current(q);
+	if (!dq)
+		return ELEVATOR_NO_MERGE;
 
 	/*
 	 * check for front merge
@@ -136,7 +143,7 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	if (dd->front_merges) {
 		sector_t sector = bio->bi_sector + bio_sectors(bio);
 
-		__rq = elv_rb_find(&dd->sort_list[bio_data_dir(bio)], sector);
+		__rq = elv_rb_find(&dq->sort_list[bio_data_dir(bio)], sector);
 		if (__rq) {
 			BUG_ON(sector != __rq->sector);
 
@@ -207,10 +214,11 @@ static void
 deadline_move_request(struct deadline_data *dd, struct request *rq)
 {
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
 
-	dd->next_rq[READ] = NULL;
-	dd->next_rq[WRITE] = NULL;
-	dd->next_rq[data_dir] = deadline_latter_request(rq);
+	dq->next_rq[READ] = NULL;
+	dq->next_rq[WRITE] = NULL;
+	dq->next_rq[data_dir] = deadline_latter_request(rq);
 
 	dd->last_sector = rq_end_sector(rq);
 
@@ -225,9 +233,9 @@ deadline_move_request(struct deadline_data *dd, struct request *rq)
  * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
  * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
  */
-static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
+static inline int deadline_check_fifo(struct deadline_queue *dq, int ddir)
 {
-	struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
+	struct request *rq = rq_entry_fifo(dq->fifo_list[ddir].next);
 
 	/*
 	 * rq is expired!
@@ -245,20 +253,26 @@ static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
 static int deadline_dispatch_requests(struct request_queue *q, int force)
 {
 	struct deadline_data *dd = q->elevator->elevator_data;
-	const int reads = !list_empty(&dd->fifo_list[READ]);
-	const int writes = !list_empty(&dd->fifo_list[WRITE]);
+	struct deadline_queue *dq = elv_select_sched_queue(q, force);
+	int reads, writes;
 	struct request *rq;
 	int data_dir;
 
+	if (!dq)
+		return 0;
+
+	reads = !list_empty(&dq->fifo_list[READ]);
+	writes = !list_empty(&dq->fifo_list[WRITE]);
+
 	/*
 	 * batches are currently reads XOR writes
 	 */
-	if (dd->next_rq[WRITE])
-		rq = dd->next_rq[WRITE];
+	if (dq->next_rq[WRITE])
+		rq = dq->next_rq[WRITE];
 	else
-		rq = dd->next_rq[READ];
+		rq = dq->next_rq[READ];
 
-	if (rq && dd->batching < dd->fifo_batch)
+	if (rq && dq->batching < dd->fifo_batch)
 		/* we have a next request are still entitled to batch */
 		goto dispatch_request;
 
@@ -268,9 +282,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
 	 */
 
 	if (reads) {
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
+		BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[READ]));
 
-		if (writes && (dd->starved++ >= dd->writes_starved))
+		if (writes && (dq->starved++ >= dd->writes_starved))
 			goto dispatch_writes;
 
 		data_dir = READ;
@@ -284,9 +298,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
 
 	if (writes) {
 dispatch_writes:
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
+		BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[WRITE]));
 
-		dd->starved = 0;
+		dq->starved = 0;
 
 		data_dir = WRITE;
 
@@ -299,48 +313,62 @@ dispatch_find_request:
 	/*
 	 * we are not running a batch, find best request for selected data_dir
 	 */
-	if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
+	if (deadline_check_fifo(dq, data_dir) || !dq->next_rq[data_dir]) {
 		/*
 		 * A deadline has expired, the last request was in the other
 		 * direction, or we have run out of higher-sectored requests.
 		 * Start again from the request with the earliest expiry time.
 		 */
-		rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
+		rq = rq_entry_fifo(dq->fifo_list[data_dir].next);
 	} else {
 		/*
 		 * The last req was the same dir and we have a next request in
 		 * sort order. No expired requests so continue on from here.
 		 */
-		rq = dd->next_rq[data_dir];
+		rq = dq->next_rq[data_dir];
 	}
 
-	dd->batching = 0;
+	dq->batching = 0;
 
 dispatch_request:
 	/*
 	 * rq is the selected appropriate request.
 	 */
-	dd->batching++;
+	dq->batching++;
 	deadline_move_request(dd, rq);
 
 	return 1;
 }
 
-static int deadline_queue_empty(struct request_queue *q)
+static void *deadline_alloc_deadline_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
 {
-	struct deadline_data *dd = q->elevator->elevator_data;
+	struct deadline_queue *dq;
 
-	return list_empty(&dd->fifo_list[WRITE])
-		&& list_empty(&dd->fifo_list[READ]);
+	dq = kmalloc_node(sizeof(*dq), gfp_mask | __GFP_ZERO, q->node);
+	if (dq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&dq->fifo_list[READ]);
+	INIT_LIST_HEAD(&dq->fifo_list[WRITE]);
+	dq->sort_list[READ] = RB_ROOT;
+	dq->sort_list[WRITE] = RB_ROOT;
+out:
+	return dq;
+}
+
+static void deadline_free_deadline_queue(struct elevator_queue *e,
+						void *sched_queue)
+{
+	struct deadline_queue *dq = sched_queue;
+
+	kfree(dq);
 }
 
 static void deadline_exit_queue(struct elevator_queue *e)
 {
 	struct deadline_data *dd = e->elevator_data;
 
-	BUG_ON(!list_empty(&dd->fifo_list[READ]));
-	BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
-
 	kfree(dd);
 }
 
@@ -355,10 +383,7 @@ static void *deadline_init_queue(struct request_queue *q)
 	if (!dd)
 		return NULL;
 
-	INIT_LIST_HEAD(&dd->fifo_list[READ]);
-	INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
-	dd->sort_list[READ] = RB_ROOT;
-	dd->sort_list[WRITE] = RB_ROOT;
+	dd->q = q;
 	dd->fifo_expire[READ] = read_expire;
 	dd->fifo_expire[WRITE] = write_expire;
 	dd->writes_starved = writes_starved;
@@ -445,13 +470,13 @@ static struct elevator_type iosched_deadline = {
 		.elevator_merge_req_fn =	deadline_merged_requests,
 		.elevator_dispatch_fn =		deadline_dispatch_requests,
 		.elevator_add_req_fn =		deadline_add_request,
-		.elevator_queue_empty_fn =	deadline_queue_empty,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
 		.elevator_init_fn =		deadline_init_queue,
 		.elevator_exit_fn =		deadline_exit_queue,
+		.elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
+		.elevator_free_sched_queue_fn = deadline_free_deadline_queue,
 	},
-
 	.elevator_attrs = deadline_attrs,
 	.elevator_name = "deadline",
 	.elevator_owner = THIS_MODULE,
diff --git a/block/elevator.c b/block/elevator.c
index 4321169..f6725f2 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -180,17 +180,54 @@ static struct elevator_type *elevator_get(const char *name)
 	return e;
 }
 
-static void *elevator_init_queue(struct request_queue *q,
-				 struct elevator_queue *eq)
+static void *elevator_init_data(struct request_queue *q,
+					struct elevator_queue *eq)
 {
-	return eq->ops->elevator_init_fn(q);
+	void *data = NULL;
+
+	if (eq->ops->elevator_init_fn) {
+		data = eq->ops->elevator_init_fn(q);
+		if (data)
+			return data;
+		else
+			return ERR_PTR(-ENOMEM);
+	}
+
+	/* IO scheduler does not instanciate data (noop), it is not an error */
+	return NULL;
+}
+
+static void elevator_free_sched_queue(struct elevator_queue *eq,
+						void *sched_queue)
+{
+	/* Not all io schedulers (cfq) strore sched_queue */
+	if (!sched_queue)
+		return;
+	eq->ops->elevator_free_sched_queue_fn(eq, sched_queue);
+}
+
+static void *elevator_alloc_sched_queue(struct request_queue *q,
+					struct elevator_queue *eq)
+{
+	void *sched_queue = NULL;
+
+	if (eq->ops->elevator_alloc_sched_queue_fn) {
+		sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
+								GFP_KERNEL);
+		if (!sched_queue)
+			return ERR_PTR(-ENOMEM);
+
+	}
+
+	return sched_queue;
 }
 
 static void elevator_attach(struct request_queue *q, struct elevator_queue *eq,
-			   void *data)
+			   void *data, void *sched_queue)
 {
 	q->elevator = eq;
 	eq->elevator_data = data;
+	eq->sched_queue = sched_queue;
 }
 
 static char chosen_elevator[16];
@@ -260,7 +297,7 @@ int elevator_init(struct request_queue *q, char *name)
 	struct elevator_type *e = NULL;
 	struct elevator_queue *eq;
 	int ret = 0;
-	void *data;
+	void *data = NULL, *sched_queue = NULL;
 
 	INIT_LIST_HEAD(&q->queue_head);
 	q->last_merge = NULL;
@@ -294,13 +331,21 @@ int elevator_init(struct request_queue *q, char *name)
 	if (!eq)
 		return -ENOMEM;
 
-	data = elevator_init_queue(q, eq);
-	if (!data) {
+	data = elevator_init_data(q, eq);
+
+	if (IS_ERR(data)) {
+		kobject_put(&eq->kobj);
+		return -ENOMEM;
+	}
+
+	sched_queue = elevator_alloc_sched_queue(q, eq);
+
+	if (IS_ERR(sched_queue)) {
 		kobject_put(&eq->kobj);
 		return -ENOMEM;
 	}
 
-	elevator_attach(q, eq, data);
+	elevator_attach(q, eq, data, sched_queue);
 	return ret;
 }
 EXPORT_SYMBOL(elevator_init);
@@ -308,6 +353,7 @@ EXPORT_SYMBOL(elevator_init);
 void elevator_exit(struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
+	elevator_free_sched_queue(e, e->sched_queue);
 	elv_exit_fq_data(e);
 	if (e->ops->elevator_exit_fn)
 		e->ops->elevator_exit_fn(e);
@@ -1123,7 +1169,7 @@ EXPORT_SYMBOL_GPL(elv_unregister);
 static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 {
 	struct elevator_queue *old_elevator, *e;
-	void *data;
+	void *data = NULL, *sched_queue = NULL;
 
 	/*
 	 * Allocate new elevator
@@ -1132,10 +1178,18 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	if (!e)
 		return 0;
 
-	data = elevator_init_queue(q, e);
-	if (!data) {
+	data = elevator_init_data(q, e);
+
+	if (IS_ERR(data)) {
 		kobject_put(&e->kobj);
-		return 0;
+		return -ENOMEM;
+	}
+
+	sched_queue = elevator_alloc_sched_queue(q, e);
+
+	if (IS_ERR(sched_queue)) {
+		kobject_put(&e->kobj);
+		return -ENOMEM;
 	}
 
 	/*
@@ -1152,7 +1206,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	/*
 	 * attach and start new elevator
 	 */
-	elevator_attach(q, e, data);
+	elevator_attach(q, e, data, sched_queue);
 
 	spin_unlock_irq(q->queue_lock);
 
@@ -1259,16 +1313,43 @@ struct request *elv_rb_latter_request(struct request_queue *q,
 }
 EXPORT_SYMBOL(elv_rb_latter_request);
 
-/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+/* Get the io scheduler queue pointer. */
 void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
 {
-	return ioq_sched_queue(rq_ioq(rq));
+	/*
+	 * io scheduler is not using fair queuing. Return sched_queue
+	 * pointer stored in elevator_queue. It will be null if io
+	 * scheduler never stored anything there to begin with (cfq)
+	 */
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
+	/*
+	 * IO schedueler is using fair queuing infrasture. If io scheduler
+	 * has passed a non null rq, retrieve sched_queue pointer from
+	 * there. */
+	if (rq)
+		return ioq_sched_queue(rq_ioq(rq));
+
+	return NULL;
 }
 EXPORT_SYMBOL(elv_get_sched_queue);
 
 /* Select an ioscheduler queue to dispatch request from. */
 void *elv_select_sched_queue(struct request_queue *q, int force)
 {
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
 	return ioq_sched_queue(elv_fq_select_ioq(q, force));
 }
 EXPORT_SYMBOL(elv_select_sched_queue);
+
+/*
+ * Get the io scheduler queue pointer for current task.
+ */
+void *elv_get_sched_queue_current(struct request_queue *q)
+{
+	return q->elevator->sched_queue;
+}
+EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index 3a0d369..d587832 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -7,7 +7,7 @@
 #include <linux/module.h>
 #include <linux/init.h>
 
-struct noop_data {
+struct noop_queue {
 	struct list_head queue;
 };
 
@@ -19,11 +19,14 @@ static void noop_merged_requests(struct request_queue *q, struct request *rq,
 
 static int noop_dispatch(struct request_queue *q, int force)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_select_sched_queue(q, force);
 
-	if (!list_empty(&nd->queue)) {
+	if (!nq)
+		return 0;
+
+	if (!list_empty(&nq->queue)) {
 		struct request *rq;
-		rq = list_entry(nd->queue.next, struct request, queuelist);
+		rq = list_entry(nq->queue.next, struct request, queuelist);
 		list_del_init(&rq->queuelist);
 		elv_dispatch_sort(q, rq);
 		return 1;
@@ -33,24 +36,17 @@ static int noop_dispatch(struct request_queue *q, int force)
 
 static void noop_add_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	list_add_tail(&rq->queuelist, &nd->queue);
-}
-
-static int noop_queue_empty(struct request_queue *q)
-{
-	struct noop_data *nd = q->elevator->elevator_data;
-
-	return list_empty(&nd->queue);
+	list_add_tail(&rq->queuelist, &nq->queue);
 }
 
 static struct request *
 noop_former_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	if (rq->queuelist.prev == &nd->queue)
+	if (rq->queuelist.prev == &nq->queue)
 		return NULL;
 	return list_entry(rq->queuelist.prev, struct request, queuelist);
 }
@@ -58,30 +54,32 @@ noop_former_request(struct request_queue *q, struct request *rq)
 static struct request *
 noop_latter_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	if (rq->queuelist.next == &nd->queue)
+	if (rq->queuelist.next == &nq->queue)
 		return NULL;
 	return list_entry(rq->queuelist.next, struct request, queuelist);
 }
 
-static void *noop_init_queue(struct request_queue *q)
+static void *noop_alloc_noop_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
 {
-	struct noop_data *nd;
+	struct noop_queue *nq;
 
-	nd = kmalloc_node(sizeof(*nd), GFP_KERNEL, q->node);
-	if (!nd)
-		return NULL;
-	INIT_LIST_HEAD(&nd->queue);
-	return nd;
+	nq = kmalloc_node(sizeof(*nq), gfp_mask | __GFP_ZERO, q->node);
+	if (nq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&nq->queue);
+out:
+	return nq;
 }
 
-static void noop_exit_queue(struct elevator_queue *e)
+static void noop_free_noop_queue(struct elevator_queue *e, void *sched_queue)
 {
-	struct noop_data *nd = e->elevator_data;
+	struct noop_queue *nq = sched_queue;
 
-	BUG_ON(!list_empty(&nd->queue));
-	kfree(nd);
+	kfree(nq);
 }
 
 static struct elevator_type elevator_noop = {
@@ -89,11 +87,10 @@ static struct elevator_type elevator_noop = {
 		.elevator_merge_req_fn		= noop_merged_requests,
 		.elevator_dispatch_fn		= noop_dispatch,
 		.elevator_add_req_fn		= noop_add_request,
-		.elevator_queue_empty_fn	= noop_queue_empty,
 		.elevator_former_req_fn		= noop_former_request,
 		.elevator_latter_req_fn		= noop_latter_request,
-		.elevator_init_fn		= noop_init_queue,
-		.elevator_exit_fn		= noop_exit_queue,
+		.elevator_alloc_sched_queue_fn	= noop_alloc_noop_queue,
+		.elevator_free_sched_queue_fn	= noop_free_noop_queue,
 	},
 	.elevator_name = "noop",
 	.elevator_owner = THIS_MODULE,
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 679c149..3729a2f 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -30,8 +30,9 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
 
 typedef void *(elevator_init_fn) (struct request_queue *);
 typedef void (elevator_exit_fn) (struct elevator_queue *);
-#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void* (elevator_alloc_sched_queue_fn) (struct request_queue *q, struct elevator_queue *eq, gfp_t);
 typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
 typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
 typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
 typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
@@ -70,8 +71,9 @@ struct elevator_ops
 	elevator_exit_fn *elevator_exit_fn;
 	void (*trim)(struct io_context *);
 
-#ifdef CONFIG_ELV_FAIR_QUEUING
+	elevator_alloc_sched_queue_fn *elevator_alloc_sched_queue_fn;
 	elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+#ifdef CONFIG_ELV_FAIR_QUEUING
 	elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
 	elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;
 
@@ -112,6 +114,7 @@ struct elevator_queue
 {
 	struct elevator_ops *ops;
 	void *elevator_data;
+	void *sched_queue;
 	struct kobject kobj;
 	struct elevator_type *elevator_type;
 	struct mutex sysfs_lock;
@@ -260,5 +263,6 @@ static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
+extern void *elv_get_sched_queue_current(struct request_queue *q);
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 09/18] io-controller: Separate out queue and data
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (14 preceding siblings ...)
  2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-05 19:58 ` Vivek Goyal
                   ` (21 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando
  Cc: akpm, vgoyal

o So far noop, deadline and AS had one common structure called *_data which
  contained both the queue information where requests are queued and also
  common data used for scheduling. This patch breaks down this common
  structure in two parts, *_queue and *_data. This is along the lines of
  cfq where all the reuquests are queued in queue and common data and tunables
  are part of data.

o It does not change the functionality but this re-organization helps once
  noop, deadline and AS are changed to use hierarchical fair queuing.

o looks like queue_empty function is not required and we can check for
  q->nr_sorted in elevator layer to see if ioscheduler queues are empty or
  not.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/as-iosched.c       |  208 ++++++++++++++++++++++++++--------------------
 block/deadline-iosched.c |  117 ++++++++++++++++----------
 block/elevator.c         |  111 +++++++++++++++++++++----
 block/noop-iosched.c     |   59 ++++++-------
 include/linux/elevator.h |    8 ++-
 5 files changed, 319 insertions(+), 184 deletions(-)

diff --git a/block/as-iosched.c b/block/as-iosched.c
index c48fa67..7158e13 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -76,13 +76,7 @@ enum anticipation_status {
 				 * or timed out */
 };
 
-struct as_data {
-	/*
-	 * run time data
-	 */
-
-	struct request_queue *q;	/* the "owner" queue */
-
+struct as_queue {
 	/*
 	 * requests (as_rq s) are present on both sort_list and fifo_list
 	 */
@@ -90,6 +84,14 @@ struct as_data {
 	struct list_head fifo_list[2];
 
 	struct request *next_rq[2];	/* next in sort order */
+	unsigned long last_check_fifo[2];
+	int write_batch_count;		/* max # of reqs in a write batch */
+	int current_write_count;	/* how many requests left this batch */
+	int write_batch_idled;		/* has the write batch gone idle? */
+};
+
+struct as_data {
+	struct request_queue *q;	/* the "owner" queue */
 	sector_t last_sector[2];	/* last SYNC & ASYNC sectors */
 
 	unsigned long exit_prob;	/* probability a task will exit while
@@ -103,21 +105,17 @@ struct as_data {
 	sector_t new_seek_mean;
 
 	unsigned long current_batch_expires;
-	unsigned long last_check_fifo[2];
 	int changed_batch;		/* 1: waiting for old batch to end */
 	int new_batch;			/* 1: waiting on first read complete */
-	int batch_data_dir;		/* current batch SYNC / ASYNC */
-	int write_batch_count;		/* max # of reqs in a write batch */
-	int current_write_count;	/* how many requests left this batch */
-	int write_batch_idled;		/* has the write batch gone idle? */
 
 	enum anticipation_status antic_status;
 	unsigned long antic_start;	/* jiffies: when it started */
 	struct timer_list antic_timer;	/* anticipatory scheduling timer */
-	struct work_struct antic_work;	/* Deferred unplugging */
+	struct work_struct antic_work;  /* Deferred unplugging */
 	struct io_context *io_context;	/* Identify the expected process */
 	int ioc_finished; /* IO associated with io_context is finished */
 	int nr_dispatched;
+	int batch_data_dir;		/* current batch SYNC / ASYNC */
 
 	/*
 	 * settings that change how the i/o scheduler behaves
@@ -258,13 +256,14 @@ static void as_put_io_context(struct request *rq)
 /*
  * rb tree support functions
  */
-#define RQ_RB_ROOT(ad, rq)	(&(ad)->sort_list[rq_is_sync((rq))])
+#define RQ_RB_ROOT(asq, rq)	(&(asq)->sort_list[rq_is_sync((rq))])
 
 static void as_add_rq_rb(struct as_data *ad, struct request *rq)
 {
 	struct request *alias;
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
-	while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(ad, rq), rq)))) {
+	while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(asq, rq), rq)))) {
 		as_move_to_dispatch(ad, alias);
 		as_antic_stop(ad);
 	}
@@ -272,7 +271,9 @@ static void as_add_rq_rb(struct as_data *ad, struct request *rq)
 
 static inline void as_del_rq_rb(struct as_data *ad, struct request *rq)
 {
-	elv_rb_del(RQ_RB_ROOT(ad, rq), rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
+
+	elv_rb_del(RQ_RB_ROOT(asq, rq), rq);
 }
 
 /*
@@ -366,7 +367,7 @@ as_choose_req(struct as_data *ad, struct request *rq1, struct request *rq2)
  * what request to process next. Anticipation works on top of this.
  */
 static struct request *
-as_find_next_rq(struct as_data *ad, struct request *last)
+as_find_next_rq(struct as_data *ad, struct as_queue *asq, struct request *last)
 {
 	struct rb_node *rbnext = rb_next(&last->rb_node);
 	struct rb_node *rbprev = rb_prev(&last->rb_node);
@@ -382,7 +383,7 @@ as_find_next_rq(struct as_data *ad, struct request *last)
 	else {
 		const int data_dir = rq_is_sync(last);
 
-		rbnext = rb_first(&ad->sort_list[data_dir]);
+		rbnext = rb_first(&asq->sort_list[data_dir]);
 		if (rbnext && rbnext != &last->rb_node)
 			next = rb_entry_rq(rbnext);
 	}
@@ -787,9 +788,10 @@ static int as_can_anticipate(struct as_data *ad, struct request *rq)
 static void as_update_rq(struct as_data *ad, struct request *rq)
 {
 	const int data_dir = rq_is_sync(rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
 	/* keep the next_rq cache up to date */
-	ad->next_rq[data_dir] = as_choose_req(ad, rq, ad->next_rq[data_dir]);
+	asq->next_rq[data_dir] = as_choose_req(ad, rq, asq->next_rq[data_dir]);
 
 	/*
 	 * have we been anticipating this request?
@@ -810,25 +812,26 @@ static void update_write_batch(struct as_data *ad)
 {
 	unsigned long batch = ad->batch_expire[BLK_RW_ASYNC];
 	long write_time;
+	struct as_queue *asq = elv_get_sched_queue(ad->q, NULL);
 
 	write_time = (jiffies - ad->current_batch_expires) + batch;
 	if (write_time < 0)
 		write_time = 0;
 
-	if (write_time > batch && !ad->write_batch_idled) {
+	if (write_time > batch && !asq->write_batch_idled) {
 		if (write_time > batch * 3)
-			ad->write_batch_count /= 2;
+			asq->write_batch_count /= 2;
 		else
-			ad->write_batch_count--;
-	} else if (write_time < batch && ad->current_write_count == 0) {
+			asq->write_batch_count--;
+	} else if (write_time < batch && asq->current_write_count == 0) {
 		if (batch > write_time * 3)
-			ad->write_batch_count *= 2;
+			asq->write_batch_count *= 2;
 		else
-			ad->write_batch_count++;
+			asq->write_batch_count++;
 	}
 
-	if (ad->write_batch_count < 1)
-		ad->write_batch_count = 1;
+	if (asq->write_batch_count < 1)
+		asq->write_batch_count = 1;
 }
 
 /*
@@ -899,6 +902,7 @@ static void as_remove_queued_request(struct request_queue *q,
 	const int data_dir = rq_is_sync(rq);
 	struct as_data *ad = q->elevator->elevator_data;
 	struct io_context *ioc;
+	struct as_queue *asq = elv_get_sched_queue(q, rq);
 
 	WARN_ON(RQ_STATE(rq) != AS_RQ_QUEUED);
 
@@ -912,8 +916,8 @@ static void as_remove_queued_request(struct request_queue *q,
 	 * Update the "next_rq" cache if we are about to remove its
 	 * entry
 	 */
-	if (ad->next_rq[data_dir] == rq)
-		ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+	if (asq->next_rq[data_dir] == rq)
+		asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);
 
 	rq_fifo_clear(rq);
 	as_del_rq_rb(ad, rq);
@@ -927,23 +931,23 @@ static void as_remove_queued_request(struct request_queue *q,
  *
  * See as_antic_expired comment.
  */
-static int as_fifo_expired(struct as_data *ad, int adir)
+static int as_fifo_expired(struct as_data *ad, struct as_queue *asq, int adir)
 {
 	struct request *rq;
 	long delta_jif;
 
-	delta_jif = jiffies - ad->last_check_fifo[adir];
+	delta_jif = jiffies - asq->last_check_fifo[adir];
 	if (unlikely(delta_jif < 0))
 		delta_jif = -delta_jif;
 	if (delta_jif < ad->fifo_expire[adir])
 		return 0;
 
-	ad->last_check_fifo[adir] = jiffies;
+	asq->last_check_fifo[adir] = jiffies;
 
-	if (list_empty(&ad->fifo_list[adir]))
+	if (list_empty(&asq->fifo_list[adir]))
 		return 0;
 
-	rq = rq_entry_fifo(ad->fifo_list[adir].next);
+	rq = rq_entry_fifo(asq->fifo_list[adir].next);
 
 	return time_after(jiffies, rq_fifo_time(rq));
 }
@@ -952,7 +956,7 @@ static int as_fifo_expired(struct as_data *ad, int adir)
  * as_batch_expired returns true if the current batch has expired. A batch
  * is a set of reads or a set of writes.
  */
-static inline int as_batch_expired(struct as_data *ad)
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq)
 {
 	if (ad->changed_batch || ad->new_batch)
 		return 0;
@@ -962,7 +966,7 @@ static inline int as_batch_expired(struct as_data *ad)
 		return time_after(jiffies, ad->current_batch_expires);
 
 	return time_after(jiffies, ad->current_batch_expires)
-		|| ad->current_write_count == 0;
+		|| asq->current_write_count == 0;
 }
 
 /*
@@ -971,6 +975,7 @@ static inline int as_batch_expired(struct as_data *ad)
 static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 {
 	const int data_dir = rq_is_sync(rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
 	BUG_ON(RB_EMPTY_NODE(&rq->rb_node));
 
@@ -993,12 +998,12 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 			ad->io_context = NULL;
 		}
 
-		if (ad->current_write_count != 0)
-			ad->current_write_count--;
+		if (asq->current_write_count != 0)
+			asq->current_write_count--;
 	}
 	ad->ioc_finished = 0;
 
-	ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+	asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);
 
 	/*
 	 * take it off the sort and fifo list, add to dispatch queue
@@ -1022,9 +1027,16 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 static int as_dispatch_request(struct request_queue *q, int force)
 {
 	struct as_data *ad = q->elevator->elevator_data;
-	const int reads = !list_empty(&ad->fifo_list[BLK_RW_SYNC]);
-	const int writes = !list_empty(&ad->fifo_list[BLK_RW_ASYNC]);
 	struct request *rq;
+	struct as_queue *asq = elv_select_sched_queue(q, force);
+	int reads, writes;
+
+	if (!asq)
+		return 0;
+
+	reads = !list_empty(&asq->fifo_list[BLK_RW_SYNC]);
+	writes = !list_empty(&asq->fifo_list[BLK_RW_ASYNC]);
+
 
 	if (unlikely(force)) {
 		/*
@@ -1040,25 +1052,25 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		ad->changed_batch = 0;
 		ad->new_batch = 0;
 
-		while (ad->next_rq[BLK_RW_SYNC]) {
-			as_move_to_dispatch(ad, ad->next_rq[BLK_RW_SYNC]);
+		while (asq->next_rq[BLK_RW_SYNC]) {
+			as_move_to_dispatch(ad, asq->next_rq[BLK_RW_SYNC]);
 			dispatched++;
 		}
-		ad->last_check_fifo[BLK_RW_SYNC] = jiffies;
+		asq->last_check_fifo[BLK_RW_SYNC] = jiffies;
 
-		while (ad->next_rq[BLK_RW_ASYNC]) {
-			as_move_to_dispatch(ad, ad->next_rq[BLK_RW_ASYNC]);
+		while (asq->next_rq[BLK_RW_ASYNC]) {
+			as_move_to_dispatch(ad, asq->next_rq[BLK_RW_ASYNC]);
 			dispatched++;
 		}
-		ad->last_check_fifo[BLK_RW_ASYNC] = jiffies;
+		asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
 
 		return dispatched;
 	}
 
 	/* Signal that the write batch was uncontended, so we can't time it */
 	if (ad->batch_data_dir == BLK_RW_ASYNC && !reads) {
-		if (ad->current_write_count == 0 || !writes)
-			ad->write_batch_idled = 1;
+		if (asq->current_write_count == 0 || !writes)
+			asq->write_batch_idled = 1;
 	}
 
 	if (!(reads || writes)
@@ -1067,14 +1079,14 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		|| ad->changed_batch)
 		return 0;
 
-	if (!(reads && writes && as_batch_expired(ad))) {
+	if (!(reads && writes && as_batch_expired(ad, asq))) {
 		/*
 		 * batch is still running or no reads or no writes
 		 */
-		rq = ad->next_rq[ad->batch_data_dir];
+		rq = asq->next_rq[ad->batch_data_dir];
 
 		if (ad->batch_data_dir == BLK_RW_SYNC && ad->antic_expire) {
-			if (as_fifo_expired(ad, BLK_RW_SYNC))
+			if (as_fifo_expired(ad, asq, BLK_RW_SYNC))
 				goto fifo_expired;
 
 			if (as_can_anticipate(ad, rq)) {
@@ -1098,7 +1110,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 	 */
 
 	if (reads) {
-		BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[BLK_RW_SYNC]));
+		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_SYNC]));
 
 		if (writes && ad->batch_data_dir == BLK_RW_SYNC)
 			/*
@@ -1111,8 +1123,8 @@ static int as_dispatch_request(struct request_queue *q, int force)
 			ad->changed_batch = 1;
 		}
 		ad->batch_data_dir = BLK_RW_SYNC;
-		rq = rq_entry_fifo(ad->fifo_list[BLK_RW_SYNC].next);
-		ad->last_check_fifo[ad->batch_data_dir] = jiffies;
+		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_SYNC].next);
+		asq->last_check_fifo[ad->batch_data_dir] = jiffies;
 		goto dispatch_request;
 	}
 
@@ -1122,7 +1134,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 
 	if (writes) {
 dispatch_writes:
-		BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[BLK_RW_ASYNC]));
+		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_ASYNC]));
 
 		if (ad->batch_data_dir == BLK_RW_SYNC) {
 			ad->changed_batch = 1;
@@ -1135,10 +1147,10 @@ dispatch_writes:
 			ad->new_batch = 0;
 		}
 		ad->batch_data_dir = BLK_RW_ASYNC;
-		ad->current_write_count = ad->write_batch_count;
-		ad->write_batch_idled = 0;
-		rq = rq_entry_fifo(ad->fifo_list[BLK_RW_ASYNC].next);
-		ad->last_check_fifo[BLK_RW_ASYNC] = jiffies;
+		asq->current_write_count = asq->write_batch_count;
+		asq->write_batch_idled = 0;
+		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_ASYNC].next);
+		asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
 		goto dispatch_request;
 	}
 
@@ -1150,9 +1162,9 @@ dispatch_request:
 	 * If a request has expired, service it.
 	 */
 
-	if (as_fifo_expired(ad, ad->batch_data_dir)) {
+	if (as_fifo_expired(ad, asq, ad->batch_data_dir)) {
 fifo_expired:
-		rq = rq_entry_fifo(ad->fifo_list[ad->batch_data_dir].next);
+		rq = rq_entry_fifo(asq->fifo_list[ad->batch_data_dir].next);
 	}
 
 	if (ad->changed_batch) {
@@ -1185,6 +1197,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 {
 	struct as_data *ad = q->elevator->elevator_data;
 	int data_dir;
+	struct as_queue *asq = elv_get_sched_queue(q, rq);
 
 	RQ_SET_STATE(rq, AS_RQ_NEW);
 
@@ -1203,7 +1216,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 	 * set expire time and add to fifo list
 	 */
 	rq_set_fifo_time(rq, jiffies + ad->fifo_expire[data_dir]);
-	list_add_tail(&rq->queuelist, &ad->fifo_list[data_dir]);
+	list_add_tail(&rq->queuelist, &asq->fifo_list[data_dir]);
 
 	as_update_rq(ad, rq); /* keep state machine up to date */
 	RQ_SET_STATE(rq, AS_RQ_QUEUED);
@@ -1225,31 +1238,20 @@ static void as_deactivate_request(struct request_queue *q, struct request *rq)
 		atomic_inc(&RQ_IOC(rq)->aic->nr_dispatched);
 }
 
-/*
- * as_queue_empty tells us if there are requests left in the device. It may
- * not be the case that a driver can get the next request even if the queue
- * is not empty - it is used in the block layer to check for plugging and
- * merging opportunities
- */
-static int as_queue_empty(struct request_queue *q)
-{
-	struct as_data *ad = q->elevator->elevator_data;
-
-	return list_empty(&ad->fifo_list[BLK_RW_ASYNC])
-		&& list_empty(&ad->fifo_list[BLK_RW_SYNC]);
-}
-
 static int
 as_merge(struct request_queue *q, struct request **req, struct bio *bio)
 {
-	struct as_data *ad = q->elevator->elevator_data;
 	sector_t rb_key = bio->bi_sector + bio_sectors(bio);
 	struct request *__rq;
+	struct as_queue *asq = elv_get_sched_queue_current(q);
+
+	if (!asq)
+		return ELEVATOR_NO_MERGE;
 
 	/*
 	 * check for front merge
 	 */
-	__rq = elv_rb_find(&ad->sort_list[bio_data_dir(bio)], rb_key);
+	__rq = elv_rb_find(&asq->sort_list[bio_data_dir(bio)], rb_key);
 	if (__rq && elv_rq_merge_ok(__rq, bio)) {
 		*req = __rq;
 		return ELEVATOR_FRONT_MERGE;
@@ -1336,6 +1338,41 @@ static int as_may_queue(struct request_queue *q, int rw)
 	return ret;
 }
 
+/* Called with queue lock held */
+static void *as_alloc_as_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
+{
+	struct as_queue *asq;
+	struct as_data *ad = eq->elevator_data;
+
+	asq = kmalloc_node(sizeof(*asq), gfp_mask | __GFP_ZERO, q->node);
+	if (asq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&asq->fifo_list[BLK_RW_SYNC]);
+	INIT_LIST_HEAD(&asq->fifo_list[BLK_RW_ASYNC]);
+	asq->sort_list[BLK_RW_SYNC] = RB_ROOT;
+	asq->sort_list[BLK_RW_ASYNC] = RB_ROOT;
+	if (ad)
+		asq->write_batch_count = ad->batch_expire[BLK_RW_ASYNC] / 10;
+	else
+		asq->write_batch_count = default_write_batch_expire / 10;
+
+	if (asq->write_batch_count < 2)
+		asq->write_batch_count = 2;
+out:
+	return asq;
+}
+
+static void as_free_as_queue(struct elevator_queue *e, void *sched_queue)
+{
+	struct as_queue *asq = sched_queue;
+
+	BUG_ON(!list_empty(&asq->fifo_list[BLK_RW_SYNC]));
+	BUG_ON(!list_empty(&asq->fifo_list[BLK_RW_ASYNC]));
+	kfree(asq);
+}
+
 static void as_exit_queue(struct elevator_queue *e)
 {
 	struct as_data *ad = e->elevator_data;
@@ -1343,9 +1380,6 @@ static void as_exit_queue(struct elevator_queue *e)
 	del_timer_sync(&ad->antic_timer);
 	cancel_work_sync(&ad->antic_work);
 
-	BUG_ON(!list_empty(&ad->fifo_list[BLK_RW_SYNC]));
-	BUG_ON(!list_empty(&ad->fifo_list[BLK_RW_ASYNC]));
-
 	put_io_context(ad->io_context);
 	kfree(ad);
 }
@@ -1369,10 +1403,6 @@ static void *as_init_queue(struct request_queue *q)
 	init_timer(&ad->antic_timer);
 	INIT_WORK(&ad->antic_work, as_work_handler);
 
-	INIT_LIST_HEAD(&ad->fifo_list[BLK_RW_SYNC]);
-	INIT_LIST_HEAD(&ad->fifo_list[BLK_RW_ASYNC]);
-	ad->sort_list[BLK_RW_SYNC] = RB_ROOT;
-	ad->sort_list[BLK_RW_ASYNC] = RB_ROOT;
 	ad->fifo_expire[BLK_RW_SYNC] = default_read_expire;
 	ad->fifo_expire[BLK_RW_ASYNC] = default_write_expire;
 	ad->antic_expire = default_antic_expire;
@@ -1380,9 +1410,6 @@ static void *as_init_queue(struct request_queue *q)
 	ad->batch_expire[BLK_RW_ASYNC] = default_write_batch_expire;
 
 	ad->current_batch_expires = jiffies + ad->batch_expire[BLK_RW_SYNC];
-	ad->write_batch_count = ad->batch_expire[BLK_RW_ASYNC] / 10;
-	if (ad->write_batch_count < 2)
-		ad->write_batch_count = 2;
 
 	return ad;
 }
@@ -1480,7 +1507,6 @@ static struct elevator_type iosched_as = {
 		.elevator_add_req_fn =		as_add_request,
 		.elevator_activate_req_fn =	as_activate_request,
 		.elevator_deactivate_req_fn = 	as_deactivate_request,
-		.elevator_queue_empty_fn =	as_queue_empty,
 		.elevator_completed_req_fn =	as_completed_request,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
@@ -1488,6 +1514,8 @@ static struct elevator_type iosched_as = {
 		.elevator_init_fn =		as_init_queue,
 		.elevator_exit_fn =		as_exit_queue,
 		.trim =				as_trim,
+		.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
+		.elevator_free_sched_queue_fn = as_free_as_queue,
 	},
 
 	.elevator_attrs = as_attrs,
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index c4d991d..5e65041 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -23,25 +23,23 @@ static const int writes_starved = 2;    /* max times reads can starve a write */
 static const int fifo_batch = 16;       /* # of sequential requests treated as one
 				     by the above parameters. For throughput. */
 
-struct deadline_data {
-	/*
-	 * run time data
-	 */
-
+struct deadline_queue {
 	/*
 	 * requests (deadline_rq s) are present on both sort_list and fifo_list
 	 */
-	struct rb_root sort_list[2];	
+	struct rb_root sort_list[2];
 	struct list_head fifo_list[2];
-
 	/*
 	 * next in sort order. read, write or both are NULL
 	 */
 	struct request *next_rq[2];
 	unsigned int batching;		/* number of sequential requests made */
-	sector_t last_sector;		/* head position */
 	unsigned int starved;		/* times reads have starved writes */
+};
 
+struct deadline_data {
+	struct request_queue *q;
+	sector_t last_sector;		/* head position */
 	/*
 	 * settings that change how the i/o scheduler behaves
 	 */
@@ -56,7 +54,9 @@ static void deadline_move_request(struct deadline_data *, struct request *);
 static inline struct rb_root *
 deadline_rb_root(struct deadline_data *dd, struct request *rq)
 {
-	return &dd->sort_list[rq_data_dir(rq)];
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
+
+	return &dq->sort_list[rq_data_dir(rq)];
 }
 
 /*
@@ -87,9 +87,10 @@ static inline void
 deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
 {
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
 
-	if (dd->next_rq[data_dir] == rq)
-		dd->next_rq[data_dir] = deadline_latter_request(rq);
+	if (dq->next_rq[data_dir] == rq)
+		dq->next_rq[data_dir] = deadline_latter_request(rq);
 
 	elv_rb_del(deadline_rb_root(dd, rq), rq);
 }
@@ -102,6 +103,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
 {
 	struct deadline_data *dd = q->elevator->elevator_data;
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(q, rq);
 
 	deadline_add_rq_rb(dd, rq);
 
@@ -109,7 +111,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
 	 * set expire time and add to fifo list
 	 */
 	rq_set_fifo_time(rq, jiffies + dd->fifo_expire[data_dir]);
-	list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
+	list_add_tail(&rq->queuelist, &dq->fifo_list[data_dir]);
 }
 
 /*
@@ -129,6 +131,11 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	struct deadline_data *dd = q->elevator->elevator_data;
 	struct request *__rq;
 	int ret;
+	struct deadline_queue *dq;
+
+	dq = elv_get_sched_queue_current(q);
+	if (!dq)
+		return ELEVATOR_NO_MERGE;
 
 	/*
 	 * check for front merge
@@ -136,7 +143,7 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	if (dd->front_merges) {
 		sector_t sector = bio->bi_sector + bio_sectors(bio);
 
-		__rq = elv_rb_find(&dd->sort_list[bio_data_dir(bio)], sector);
+		__rq = elv_rb_find(&dq->sort_list[bio_data_dir(bio)], sector);
 		if (__rq) {
 			BUG_ON(sector != __rq->sector);
 
@@ -207,10 +214,11 @@ static void
 deadline_move_request(struct deadline_data *dd, struct request *rq)
 {
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
 
-	dd->next_rq[READ] = NULL;
-	dd->next_rq[WRITE] = NULL;
-	dd->next_rq[data_dir] = deadline_latter_request(rq);
+	dq->next_rq[READ] = NULL;
+	dq->next_rq[WRITE] = NULL;
+	dq->next_rq[data_dir] = deadline_latter_request(rq);
 
 	dd->last_sector = rq_end_sector(rq);
 
@@ -225,9 +233,9 @@ deadline_move_request(struct deadline_data *dd, struct request *rq)
  * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
  * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
  */
-static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
+static inline int deadline_check_fifo(struct deadline_queue *dq, int ddir)
 {
-	struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
+	struct request *rq = rq_entry_fifo(dq->fifo_list[ddir].next);
 
 	/*
 	 * rq is expired!
@@ -245,20 +253,26 @@ static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
 static int deadline_dispatch_requests(struct request_queue *q, int force)
 {
 	struct deadline_data *dd = q->elevator->elevator_data;
-	const int reads = !list_empty(&dd->fifo_list[READ]);
-	const int writes = !list_empty(&dd->fifo_list[WRITE]);
+	struct deadline_queue *dq = elv_select_sched_queue(q, force);
+	int reads, writes;
 	struct request *rq;
 	int data_dir;
 
+	if (!dq)
+		return 0;
+
+	reads = !list_empty(&dq->fifo_list[READ]);
+	writes = !list_empty(&dq->fifo_list[WRITE]);
+
 	/*
 	 * batches are currently reads XOR writes
 	 */
-	if (dd->next_rq[WRITE])
-		rq = dd->next_rq[WRITE];
+	if (dq->next_rq[WRITE])
+		rq = dq->next_rq[WRITE];
 	else
-		rq = dd->next_rq[READ];
+		rq = dq->next_rq[READ];
 
-	if (rq && dd->batching < dd->fifo_batch)
+	if (rq && dq->batching < dd->fifo_batch)
 		/* we have a next request are still entitled to batch */
 		goto dispatch_request;
 
@@ -268,9 +282,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
 	 */
 
 	if (reads) {
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
+		BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[READ]));
 
-		if (writes && (dd->starved++ >= dd->writes_starved))
+		if (writes && (dq->starved++ >= dd->writes_starved))
 			goto dispatch_writes;
 
 		data_dir = READ;
@@ -284,9 +298,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
 
 	if (writes) {
 dispatch_writes:
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
+		BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[WRITE]));
 
-		dd->starved = 0;
+		dq->starved = 0;
 
 		data_dir = WRITE;
 
@@ -299,48 +313,62 @@ dispatch_find_request:
 	/*
 	 * we are not running a batch, find best request for selected data_dir
 	 */
-	if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
+	if (deadline_check_fifo(dq, data_dir) || !dq->next_rq[data_dir]) {
 		/*
 		 * A deadline has expired, the last request was in the other
 		 * direction, or we have run out of higher-sectored requests.
 		 * Start again from the request with the earliest expiry time.
 		 */
-		rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
+		rq = rq_entry_fifo(dq->fifo_list[data_dir].next);
 	} else {
 		/*
 		 * The last req was the same dir and we have a next request in
 		 * sort order. No expired requests so continue on from here.
 		 */
-		rq = dd->next_rq[data_dir];
+		rq = dq->next_rq[data_dir];
 	}
 
-	dd->batching = 0;
+	dq->batching = 0;
 
 dispatch_request:
 	/*
 	 * rq is the selected appropriate request.
 	 */
-	dd->batching++;
+	dq->batching++;
 	deadline_move_request(dd, rq);
 
 	return 1;
 }
 
-static int deadline_queue_empty(struct request_queue *q)
+static void *deadline_alloc_deadline_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
 {
-	struct deadline_data *dd = q->elevator->elevator_data;
+	struct deadline_queue *dq;
 
-	return list_empty(&dd->fifo_list[WRITE])
-		&& list_empty(&dd->fifo_list[READ]);
+	dq = kmalloc_node(sizeof(*dq), gfp_mask | __GFP_ZERO, q->node);
+	if (dq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&dq->fifo_list[READ]);
+	INIT_LIST_HEAD(&dq->fifo_list[WRITE]);
+	dq->sort_list[READ] = RB_ROOT;
+	dq->sort_list[WRITE] = RB_ROOT;
+out:
+	return dq;
+}
+
+static void deadline_free_deadline_queue(struct elevator_queue *e,
+						void *sched_queue)
+{
+	struct deadline_queue *dq = sched_queue;
+
+	kfree(dq);
 }
 
 static void deadline_exit_queue(struct elevator_queue *e)
 {
 	struct deadline_data *dd = e->elevator_data;
 
-	BUG_ON(!list_empty(&dd->fifo_list[READ]));
-	BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
-
 	kfree(dd);
 }
 
@@ -355,10 +383,7 @@ static void *deadline_init_queue(struct request_queue *q)
 	if (!dd)
 		return NULL;
 
-	INIT_LIST_HEAD(&dd->fifo_list[READ]);
-	INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
-	dd->sort_list[READ] = RB_ROOT;
-	dd->sort_list[WRITE] = RB_ROOT;
+	dd->q = q;
 	dd->fifo_expire[READ] = read_expire;
 	dd->fifo_expire[WRITE] = write_expire;
 	dd->writes_starved = writes_starved;
@@ -445,13 +470,13 @@ static struct elevator_type iosched_deadline = {
 		.elevator_merge_req_fn =	deadline_merged_requests,
 		.elevator_dispatch_fn =		deadline_dispatch_requests,
 		.elevator_add_req_fn =		deadline_add_request,
-		.elevator_queue_empty_fn =	deadline_queue_empty,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
 		.elevator_init_fn =		deadline_init_queue,
 		.elevator_exit_fn =		deadline_exit_queue,
+		.elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
+		.elevator_free_sched_queue_fn = deadline_free_deadline_queue,
 	},
-
 	.elevator_attrs = deadline_attrs,
 	.elevator_name = "deadline",
 	.elevator_owner = THIS_MODULE,
diff --git a/block/elevator.c b/block/elevator.c
index 4321169..f6725f2 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -180,17 +180,54 @@ static struct elevator_type *elevator_get(const char *name)
 	return e;
 }
 
-static void *elevator_init_queue(struct request_queue *q,
-				 struct elevator_queue *eq)
+static void *elevator_init_data(struct request_queue *q,
+					struct elevator_queue *eq)
 {
-	return eq->ops->elevator_init_fn(q);
+	void *data = NULL;
+
+	if (eq->ops->elevator_init_fn) {
+		data = eq->ops->elevator_init_fn(q);
+		if (data)
+			return data;
+		else
+			return ERR_PTR(-ENOMEM);
+	}
+
+	/* IO scheduler does not instanciate data (noop), it is not an error */
+	return NULL;
+}
+
+static void elevator_free_sched_queue(struct elevator_queue *eq,
+						void *sched_queue)
+{
+	/* Not all io schedulers (cfq) strore sched_queue */
+	if (!sched_queue)
+		return;
+	eq->ops->elevator_free_sched_queue_fn(eq, sched_queue);
+}
+
+static void *elevator_alloc_sched_queue(struct request_queue *q,
+					struct elevator_queue *eq)
+{
+	void *sched_queue = NULL;
+
+	if (eq->ops->elevator_alloc_sched_queue_fn) {
+		sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
+								GFP_KERNEL);
+		if (!sched_queue)
+			return ERR_PTR(-ENOMEM);
+
+	}
+
+	return sched_queue;
 }
 
 static void elevator_attach(struct request_queue *q, struct elevator_queue *eq,
-			   void *data)
+			   void *data, void *sched_queue)
 {
 	q->elevator = eq;
 	eq->elevator_data = data;
+	eq->sched_queue = sched_queue;
 }
 
 static char chosen_elevator[16];
@@ -260,7 +297,7 @@ int elevator_init(struct request_queue *q, char *name)
 	struct elevator_type *e = NULL;
 	struct elevator_queue *eq;
 	int ret = 0;
-	void *data;
+	void *data = NULL, *sched_queue = NULL;
 
 	INIT_LIST_HEAD(&q->queue_head);
 	q->last_merge = NULL;
@@ -294,13 +331,21 @@ int elevator_init(struct request_queue *q, char *name)
 	if (!eq)
 		return -ENOMEM;
 
-	data = elevator_init_queue(q, eq);
-	if (!data) {
+	data = elevator_init_data(q, eq);
+
+	if (IS_ERR(data)) {
+		kobject_put(&eq->kobj);
+		return -ENOMEM;
+	}
+
+	sched_queue = elevator_alloc_sched_queue(q, eq);
+
+	if (IS_ERR(sched_queue)) {
 		kobject_put(&eq->kobj);
 		return -ENOMEM;
 	}
 
-	elevator_attach(q, eq, data);
+	elevator_attach(q, eq, data, sched_queue);
 	return ret;
 }
 EXPORT_SYMBOL(elevator_init);
@@ -308,6 +353,7 @@ EXPORT_SYMBOL(elevator_init);
 void elevator_exit(struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
+	elevator_free_sched_queue(e, e->sched_queue);
 	elv_exit_fq_data(e);
 	if (e->ops->elevator_exit_fn)
 		e->ops->elevator_exit_fn(e);
@@ -1123,7 +1169,7 @@ EXPORT_SYMBOL_GPL(elv_unregister);
 static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 {
 	struct elevator_queue *old_elevator, *e;
-	void *data;
+	void *data = NULL, *sched_queue = NULL;
 
 	/*
 	 * Allocate new elevator
@@ -1132,10 +1178,18 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	if (!e)
 		return 0;
 
-	data = elevator_init_queue(q, e);
-	if (!data) {
+	data = elevator_init_data(q, e);
+
+	if (IS_ERR(data)) {
 		kobject_put(&e->kobj);
-		return 0;
+		return -ENOMEM;
+	}
+
+	sched_queue = elevator_alloc_sched_queue(q, e);
+
+	if (IS_ERR(sched_queue)) {
+		kobject_put(&e->kobj);
+		return -ENOMEM;
 	}
 
 	/*
@@ -1152,7 +1206,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	/*
 	 * attach and start new elevator
 	 */
-	elevator_attach(q, e, data);
+	elevator_attach(q, e, data, sched_queue);
 
 	spin_unlock_irq(q->queue_lock);
 
@@ -1259,16 +1313,43 @@ struct request *elv_rb_latter_request(struct request_queue *q,
 }
 EXPORT_SYMBOL(elv_rb_latter_request);
 
-/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+/* Get the io scheduler queue pointer. */
 void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
 {
-	return ioq_sched_queue(rq_ioq(rq));
+	/*
+	 * io scheduler is not using fair queuing. Return sched_queue
+	 * pointer stored in elevator_queue. It will be null if io
+	 * scheduler never stored anything there to begin with (cfq)
+	 */
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
+	/*
+	 * IO schedueler is using fair queuing infrasture. If io scheduler
+	 * has passed a non null rq, retrieve sched_queue pointer from
+	 * there. */
+	if (rq)
+		return ioq_sched_queue(rq_ioq(rq));
+
+	return NULL;
 }
 EXPORT_SYMBOL(elv_get_sched_queue);
 
 /* Select an ioscheduler queue to dispatch request from. */
 void *elv_select_sched_queue(struct request_queue *q, int force)
 {
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
 	return ioq_sched_queue(elv_fq_select_ioq(q, force));
 }
 EXPORT_SYMBOL(elv_select_sched_queue);
+
+/*
+ * Get the io scheduler queue pointer for current task.
+ */
+void *elv_get_sched_queue_current(struct request_queue *q)
+{
+	return q->elevator->sched_queue;
+}
+EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index 3a0d369..d587832 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -7,7 +7,7 @@
 #include <linux/module.h>
 #include <linux/init.h>
 
-struct noop_data {
+struct noop_queue {
 	struct list_head queue;
 };
 
@@ -19,11 +19,14 @@ static void noop_merged_requests(struct request_queue *q, struct request *rq,
 
 static int noop_dispatch(struct request_queue *q, int force)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_select_sched_queue(q, force);
 
-	if (!list_empty(&nd->queue)) {
+	if (!nq)
+		return 0;
+
+	if (!list_empty(&nq->queue)) {
 		struct request *rq;
-		rq = list_entry(nd->queue.next, struct request, queuelist);
+		rq = list_entry(nq->queue.next, struct request, queuelist);
 		list_del_init(&rq->queuelist);
 		elv_dispatch_sort(q, rq);
 		return 1;
@@ -33,24 +36,17 @@ static int noop_dispatch(struct request_queue *q, int force)
 
 static void noop_add_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	list_add_tail(&rq->queuelist, &nd->queue);
-}
-
-static int noop_queue_empty(struct request_queue *q)
-{
-	struct noop_data *nd = q->elevator->elevator_data;
-
-	return list_empty(&nd->queue);
+	list_add_tail(&rq->queuelist, &nq->queue);
 }
 
 static struct request *
 noop_former_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	if (rq->queuelist.prev == &nd->queue)
+	if (rq->queuelist.prev == &nq->queue)
 		return NULL;
 	return list_entry(rq->queuelist.prev, struct request, queuelist);
 }
@@ -58,30 +54,32 @@ noop_former_request(struct request_queue *q, struct request *rq)
 static struct request *
 noop_latter_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	if (rq->queuelist.next == &nd->queue)
+	if (rq->queuelist.next == &nq->queue)
 		return NULL;
 	return list_entry(rq->queuelist.next, struct request, queuelist);
 }
 
-static void *noop_init_queue(struct request_queue *q)
+static void *noop_alloc_noop_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
 {
-	struct noop_data *nd;
+	struct noop_queue *nq;
 
-	nd = kmalloc_node(sizeof(*nd), GFP_KERNEL, q->node);
-	if (!nd)
-		return NULL;
-	INIT_LIST_HEAD(&nd->queue);
-	return nd;
+	nq = kmalloc_node(sizeof(*nq), gfp_mask | __GFP_ZERO, q->node);
+	if (nq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&nq->queue);
+out:
+	return nq;
 }
 
-static void noop_exit_queue(struct elevator_queue *e)
+static void noop_free_noop_queue(struct elevator_queue *e, void *sched_queue)
 {
-	struct noop_data *nd = e->elevator_data;
+	struct noop_queue *nq = sched_queue;
 
-	BUG_ON(!list_empty(&nd->queue));
-	kfree(nd);
+	kfree(nq);
 }
 
 static struct elevator_type elevator_noop = {
@@ -89,11 +87,10 @@ static struct elevator_type elevator_noop = {
 		.elevator_merge_req_fn		= noop_merged_requests,
 		.elevator_dispatch_fn		= noop_dispatch,
 		.elevator_add_req_fn		= noop_add_request,
-		.elevator_queue_empty_fn	= noop_queue_empty,
 		.elevator_former_req_fn		= noop_former_request,
 		.elevator_latter_req_fn		= noop_latter_request,
-		.elevator_init_fn		= noop_init_queue,
-		.elevator_exit_fn		= noop_exit_queue,
+		.elevator_alloc_sched_queue_fn	= noop_alloc_noop_queue,
+		.elevator_free_sched_queue_fn	= noop_free_noop_queue,
 	},
 	.elevator_name = "noop",
 	.elevator_owner = THIS_MODULE,
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 679c149..3729a2f 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -30,8 +30,9 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
 
 typedef void *(elevator_init_fn) (struct request_queue *);
 typedef void (elevator_exit_fn) (struct elevator_queue *);
-#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void* (elevator_alloc_sched_queue_fn) (struct request_queue *q, struct elevator_queue *eq, gfp_t);
 typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
 typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
 typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
 typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
@@ -70,8 +71,9 @@ struct elevator_ops
 	elevator_exit_fn *elevator_exit_fn;
 	void (*trim)(struct io_context *);
 
-#ifdef CONFIG_ELV_FAIR_QUEUING
+	elevator_alloc_sched_queue_fn *elevator_alloc_sched_queue_fn;
 	elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+#ifdef CONFIG_ELV_FAIR_QUEUING
 	elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
 	elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;
 
@@ -112,6 +114,7 @@ struct elevator_queue
 {
 	struct elevator_ops *ops;
 	void *elevator_data;
+	void *sched_queue;
 	struct kobject kobj;
 	struct elevator_type *elevator_type;
 	struct mutex sysfs_lock;
@@ -260,5 +263,6 @@ static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
+extern void *elv_get_sched_queue_current(struct request_queue *q);
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 10/18] io-conroller: Prepare elevator layer for single queue schedulers
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
@ 2009-05-05 19:58     ` Vivek Goyal
  2009-05-05 19:58 ` Vivek Goyal
                       ` (36 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Elevator layer now has support for hierarchical fair queuing. cfq has
been migrated to make use of it and now it is time to do groundwork for
noop, deadline and AS.

noop deadline and AS don't maintain separate queues for different processes.
There is only one single queue. Effectively one can think that in hierarchical
setup, there will be one queue per cgroup where requests from all the
processes in the cgroup will be queued.

Generally io scheduler takes care of creating queues. Because there is
only one queue here, we have modified common layer to take care of queue
creation and some other functionality. This special casing helps in keeping
the changes to noop, deadline and AS to the minimum.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/elevator-fq.c      |  160 +++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h      |   67 +++++++++++++++++++
 block/elevator.c         |   35 ++++++++++-
 include/linux/elevator.h |   14 ++++
 4 files changed, 274 insertions(+), 2 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index ec01273..f2805e6 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -915,6 +915,12 @@ void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
 
 	/* Free up async idle queue */
 	elv_release_ioq(e, &iog->async_idle_queue);
+
+#ifdef CONFIG_GROUP_IOSCHED
+	/* Optimization for io schedulers having single ioq */
+	if (elv_iosched_single_ioq(e))
+		elv_release_ioq(e, &iog->ioq);
+#endif
 }
 
 
@@ -1702,6 +1708,153 @@ void elv_fq_set_request_io_group(struct request_queue *q,
 	rq->iog = iog;
 }
 
+/*
+ * Find/Create the io queue the rq should go in. This is an optimization
+ * for the io schedulers (noop, deadline and AS) which maintain only single
+ * io queue per cgroup. In this case common layer can just maintain a
+ * pointer in group data structure and keeps track of it.
+ *
+ * For the io schdulers like cfq, which maintain multiple io queues per
+ * cgroup, and decide the io queue  of request based on process, this
+ * function is not invoked.
+ */
+int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
+					gfp_t gfp_mask)
+{
+	struct elevator_queue *e = q->elevator;
+	unsigned long flags;
+	struct io_queue *ioq = NULL, *new_ioq = NULL;
+	struct io_group *iog;
+	void *sched_q = NULL, *new_sched_q = NULL;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return 0;
+
+	might_sleep_if(gfp_mask & __GFP_WAIT);
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	/* Determine the io group request belongs to */
+	iog = rq->iog;
+	BUG_ON(!iog);
+
+retry:
+	/* Get the iosched queue */
+	ioq = io_group_ioq(iog);
+	if (!ioq) {
+		/* io queue and sched_queue needs to be allocated */
+		BUG_ON(!e->ops->elevator_alloc_sched_queue_fn);
+
+		if (new_sched_q) {
+			goto alloc_ioq;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			/* Call io scheduer to create scheduler queue */
+			new_sched_q = e->ops->elevator_alloc_sched_queue_fn(q,
+					e, gfp_mask | __GFP_NOFAIL
+					| __GFP_ZERO);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			sched_q = e->ops->elevator_alloc_sched_queue_fn(q, e,
+						gfp_mask | __GFP_ZERO);
+			if (!sched_q)
+				goto queue_fail;
+		}
+
+alloc_ioq:
+		if (new_ioq) {
+			ioq = new_ioq;
+			new_ioq = NULL;
+			sched_q = new_sched_q;
+			new_sched_q = NULL;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			new_ioq = elv_alloc_ioq(q, gfp_mask | __GFP_NOFAIL
+							| __GFP_ZERO);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+			if (!ioq) {
+				e->ops->elevator_free_sched_queue_fn(e,
+							sched_q);
+				sched_q = NULL;
+				goto queue_fail;
+			}
+		}
+
+		elv_init_ioq(e, ioq, sched_q, IOPRIO_CLASS_BE, 4, 1);
+		io_group_set_ioq(iog, ioq);
+		elv_mark_ioq_sync(ioq);
+	}
+
+	if (new_sched_q)
+		e->ops->elevator_free_sched_queue_fn(q->elevator, sched_q);
+
+	if (new_ioq)
+		elv_free_ioq(new_ioq);
+
+	/* Request reference */
+	elv_get_ioq(ioq);
+	rq->ioq = ioq;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return 0;
+
+queue_fail:
+	WARN_ON((gfp_mask & __GFP_WAIT) && !ioq);
+	elv_schedule_dispatch(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return 1;
+}
+
+/*
+ * Find out the io queue of current task. Optimization for single ioq
+ * per io group io schedulers.
+ */
+struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	struct io_group *iog;
+
+	/* Determine the io group and io queue of the bio submitting task */
+	iog = io_lookup_io_group_current(q);
+	if (!iog) {
+		/* May be task belongs to a cgroup for which io group has
+		 * not been setup yet. */
+		return NULL;
+	}
+	return io_group_ioq(iog);
+}
+
+/*
+ * This request has been serviced. Clean up ioq info and drop the reference.
+ * Again this is called only for single queue per cgroup schedulers (noop,
+ * deadline, AS).
+ */
+void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
+{
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	if (ioq) {
+		rq->ioq = NULL;
+		elv_put_ioq(ioq);
+	}
+}
+
 #else /* GROUP_IOSCHED */
 void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
 {
@@ -2143,7 +2296,12 @@ int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
 	ioq->efqd = efqd;
 	elv_ioq_set_ioprio_class(ioq, ioprio_class);
 	elv_ioq_set_ioprio(ioq, ioprio);
-	ioq->pid = current->pid;
+
+	if (elv_iosched_single_ioq(eq))
+		ioq->pid = 0;
+	else
+		ioq->pid = current->pid;
+
 	ioq->sched_queue = sched_queue;
 	if (is_sync && !elv_ioq_class_idle(ioq))
 		elv_mark_ioq_idle_window(ioq);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 7d3434b..5a15329 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -236,6 +236,9 @@ struct io_group {
 	/* async_queue and idle_queue are used only for cfq */
 	struct io_queue *async_queue[2][IOPRIO_BE_NR];
 	struct io_queue *async_idle_queue;
+
+	/* Single ioq per group, used for noop, deadline, anticipatory */
+	struct io_queue *ioq;
 };
 
 /**
@@ -507,6 +510,28 @@ static inline bfq_weight_t iog_weight(struct io_group *iog)
 	return iog->entity.weight;
 }
 
+extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
+					gfp_t gfp_mask);
+extern void elv_fq_unset_request_ioq(struct request_queue *q,
+					struct request *rq);
+extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+
+/* Returns single ioq associated with the io group. */
+static inline struct io_queue *io_group_ioq(struct io_group *iog)
+{
+	BUG_ON(!iog);
+	return iog->ioq;
+}
+
+/* Sets the single ioq associated with the io group. (noop, deadline, AS) */
+static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
+{
+	BUG_ON(!iog);
+	/* io group reference. Will be dropped when group is destroyed. */
+	elv_get_ioq(ioq);
+	iog->ioq = ioq;
+}
+
 #else /* !GROUP_IOSCHED */
 /*
  * No ioq movement is needed in case of flat setup. root io group gets cleaned
@@ -538,6 +563,32 @@ static inline bfq_weight_t iog_weight(struct io_group *iog)
 	return 0;
 }
 
+/* Returns single ioq associated with the io group. */
+static inline struct io_queue *io_group_ioq(struct io_group *iog)
+{
+	return NULL;
+}
+
+static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
+{
+}
+
+static inline int elv_fq_set_request_ioq(struct request_queue *q,
+					struct request *rq, gfp_t gfp_mask)
+{
+	return 0;
+}
+
+static inline void elv_fq_unset_request_ioq(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	return NULL;
+}
+
 #endif /* GROUP_IOSCHED */
 
 /* Functions used by blksysfs.c */
@@ -655,5 +706,21 @@ static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
 {
 	return 1;
 }
+static inline int elv_fq_set_request_ioq(struct request_queue *q,
+					struct request *rq, gfp_t gfp_mask)
+{
+	return 0;
+}
+
+static inline void elv_fq_unset_request_ioq(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	return NULL;
+}
+
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index f6725f2..e634a2f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -211,6 +211,14 @@ static void *elevator_alloc_sched_queue(struct request_queue *q,
 {
 	void *sched_queue = NULL;
 
+	/*
+	 * If fair queuing is enabled, then queue allocation takes place
+	 * during set_request() functions when request actually comes
+	 * in.
+	 */
+	if (elv_iosched_fair_queuing_enabled(eq))
+		return NULL;
+
 	if (eq->ops->elevator_alloc_sched_queue_fn) {
 		sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
 								GFP_KERNEL);
@@ -965,6 +973,13 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 
 	elv_fq_set_request_io_group(q, rq);
 
+	/*
+	 * Optimization for noop, deadline and AS which maintain only single
+	 * ioq per io group
+	 */
+	if (elv_iosched_single_ioq(e))
+		return elv_fq_set_request_ioq(q, rq, gfp_mask);
+
 	if (e->ops->elevator_set_req_fn)
 		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
 
@@ -976,6 +991,15 @@ void elv_put_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	/*
+	 * Optimization for noop, deadline and AS which maintain only single
+	 * ioq per io group
+	 */
+	if (elv_iosched_single_ioq(e)) {
+		elv_fq_unset_request_ioq(q, rq);
+		return;
+	}
+
 	if (e->ops->elevator_put_req_fn)
 		e->ops->elevator_put_req_fn(rq);
 }
@@ -1347,9 +1371,18 @@ EXPORT_SYMBOL(elv_select_sched_queue);
 
 /*
  * Get the io scheduler queue pointer for current task.
+ *
+ * If fair queuing is enabled, determine the io group of task and retrieve
+ * the ioq pointer from that. This is used by only single queue ioschedulers
+ * for retrieving the queue associated with the group to decide whether the
+ * new bio can do a front merge or not.
  */
 void *elv_get_sched_queue_current(struct request_queue *q)
 {
-	return q->elevator->sched_queue;
+	/* Fair queuing is not enabled. There is only one queue. */
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
+	return ioq_sched_queue(elv_lookup_ioq_current(q));
 }
 EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 3729a2f..ee38d08 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -249,17 +249,31 @@ enum {
 /* iosched wants to use fq logic of elevator layer */
 #define	ELV_IOSCHED_NEED_FQ	1
 
+/* iosched maintains only single ioq per group.*/
+#define ELV_IOSCHED_SINGLE_IOQ        2
+
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 {
 	return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
 }
 
+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+	return (e->elevator_type->elevator_features) & ELV_IOSCHED_SINGLE_IOQ;
+}
+
 #else /* ELV_IOSCHED_FAIR_QUEUING */
 
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 {
 	return 0;
 }
+
+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+	return 0;
+}
+
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 10/18] io-conroller: Prepare elevator layer for single queue schedulers
@ 2009-05-05 19:58     ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, m-ikeda
  Cc: vgoyal, akpm

Elevator layer now has support for hierarchical fair queuing. cfq has
been migrated to make use of it and now it is time to do groundwork for
noop, deadline and AS.

noop deadline and AS don't maintain separate queues for different processes.
There is only one single queue. Effectively one can think that in hierarchical
setup, there will be one queue per cgroup where requests from all the
processes in the cgroup will be queued.

Generally io scheduler takes care of creating queues. Because there is
only one queue here, we have modified common layer to take care of queue
creation and some other functionality. This special casing helps in keeping
the changes to noop, deadline and AS to the minimum.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c      |  160 +++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h      |   67 +++++++++++++++++++
 block/elevator.c         |   35 ++++++++++-
 include/linux/elevator.h |   14 ++++
 4 files changed, 274 insertions(+), 2 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index ec01273..f2805e6 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -915,6 +915,12 @@ void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
 
 	/* Free up async idle queue */
 	elv_release_ioq(e, &iog->async_idle_queue);
+
+#ifdef CONFIG_GROUP_IOSCHED
+	/* Optimization for io schedulers having single ioq */
+	if (elv_iosched_single_ioq(e))
+		elv_release_ioq(e, &iog->ioq);
+#endif
 }
 
 
@@ -1702,6 +1708,153 @@ void elv_fq_set_request_io_group(struct request_queue *q,
 	rq->iog = iog;
 }
 
+/*
+ * Find/Create the io queue the rq should go in. This is an optimization
+ * for the io schedulers (noop, deadline and AS) which maintain only single
+ * io queue per cgroup. In this case common layer can just maintain a
+ * pointer in group data structure and keeps track of it.
+ *
+ * For the io schdulers like cfq, which maintain multiple io queues per
+ * cgroup, and decide the io queue  of request based on process, this
+ * function is not invoked.
+ */
+int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
+					gfp_t gfp_mask)
+{
+	struct elevator_queue *e = q->elevator;
+	unsigned long flags;
+	struct io_queue *ioq = NULL, *new_ioq = NULL;
+	struct io_group *iog;
+	void *sched_q = NULL, *new_sched_q = NULL;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return 0;
+
+	might_sleep_if(gfp_mask & __GFP_WAIT);
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	/* Determine the io group request belongs to */
+	iog = rq->iog;
+	BUG_ON(!iog);
+
+retry:
+	/* Get the iosched queue */
+	ioq = io_group_ioq(iog);
+	if (!ioq) {
+		/* io queue and sched_queue needs to be allocated */
+		BUG_ON(!e->ops->elevator_alloc_sched_queue_fn);
+
+		if (new_sched_q) {
+			goto alloc_ioq;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			/* Call io scheduer to create scheduler queue */
+			new_sched_q = e->ops->elevator_alloc_sched_queue_fn(q,
+					e, gfp_mask | __GFP_NOFAIL
+					| __GFP_ZERO);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			sched_q = e->ops->elevator_alloc_sched_queue_fn(q, e,
+						gfp_mask | __GFP_ZERO);
+			if (!sched_q)
+				goto queue_fail;
+		}
+
+alloc_ioq:
+		if (new_ioq) {
+			ioq = new_ioq;
+			new_ioq = NULL;
+			sched_q = new_sched_q;
+			new_sched_q = NULL;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			new_ioq = elv_alloc_ioq(q, gfp_mask | __GFP_NOFAIL
+							| __GFP_ZERO);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+			if (!ioq) {
+				e->ops->elevator_free_sched_queue_fn(e,
+							sched_q);
+				sched_q = NULL;
+				goto queue_fail;
+			}
+		}
+
+		elv_init_ioq(e, ioq, sched_q, IOPRIO_CLASS_BE, 4, 1);
+		io_group_set_ioq(iog, ioq);
+		elv_mark_ioq_sync(ioq);
+	}
+
+	if (new_sched_q)
+		e->ops->elevator_free_sched_queue_fn(q->elevator, sched_q);
+
+	if (new_ioq)
+		elv_free_ioq(new_ioq);
+
+	/* Request reference */
+	elv_get_ioq(ioq);
+	rq->ioq = ioq;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return 0;
+
+queue_fail:
+	WARN_ON((gfp_mask & __GFP_WAIT) && !ioq);
+	elv_schedule_dispatch(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return 1;
+}
+
+/*
+ * Find out the io queue of current task. Optimization for single ioq
+ * per io group io schedulers.
+ */
+struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	struct io_group *iog;
+
+	/* Determine the io group and io queue of the bio submitting task */
+	iog = io_lookup_io_group_current(q);
+	if (!iog) {
+		/* May be task belongs to a cgroup for which io group has
+		 * not been setup yet. */
+		return NULL;
+	}
+	return io_group_ioq(iog);
+}
+
+/*
+ * This request has been serviced. Clean up ioq info and drop the reference.
+ * Again this is called only for single queue per cgroup schedulers (noop,
+ * deadline, AS).
+ */
+void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
+{
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	if (ioq) {
+		rq->ioq = NULL;
+		elv_put_ioq(ioq);
+	}
+}
+
 #else /* GROUP_IOSCHED */
 void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
 {
@@ -2143,7 +2296,12 @@ int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
 	ioq->efqd = efqd;
 	elv_ioq_set_ioprio_class(ioq, ioprio_class);
 	elv_ioq_set_ioprio(ioq, ioprio);
-	ioq->pid = current->pid;
+
+	if (elv_iosched_single_ioq(eq))
+		ioq->pid = 0;
+	else
+		ioq->pid = current->pid;
+
 	ioq->sched_queue = sched_queue;
 	if (is_sync && !elv_ioq_class_idle(ioq))
 		elv_mark_ioq_idle_window(ioq);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 7d3434b..5a15329 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -236,6 +236,9 @@ struct io_group {
 	/* async_queue and idle_queue are used only for cfq */
 	struct io_queue *async_queue[2][IOPRIO_BE_NR];
 	struct io_queue *async_idle_queue;
+
+	/* Single ioq per group, used for noop, deadline, anticipatory */
+	struct io_queue *ioq;
 };
 
 /**
@@ -507,6 +510,28 @@ static inline bfq_weight_t iog_weight(struct io_group *iog)
 	return iog->entity.weight;
 }
 
+extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
+					gfp_t gfp_mask);
+extern void elv_fq_unset_request_ioq(struct request_queue *q,
+					struct request *rq);
+extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+
+/* Returns single ioq associated with the io group. */
+static inline struct io_queue *io_group_ioq(struct io_group *iog)
+{
+	BUG_ON(!iog);
+	return iog->ioq;
+}
+
+/* Sets the single ioq associated with the io group. (noop, deadline, AS) */
+static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
+{
+	BUG_ON(!iog);
+	/* io group reference. Will be dropped when group is destroyed. */
+	elv_get_ioq(ioq);
+	iog->ioq = ioq;
+}
+
 #else /* !GROUP_IOSCHED */
 /*
  * No ioq movement is needed in case of flat setup. root io group gets cleaned
@@ -538,6 +563,32 @@ static inline bfq_weight_t iog_weight(struct io_group *iog)
 	return 0;
 }
 
+/* Returns single ioq associated with the io group. */
+static inline struct io_queue *io_group_ioq(struct io_group *iog)
+{
+	return NULL;
+}
+
+static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
+{
+}
+
+static inline int elv_fq_set_request_ioq(struct request_queue *q,
+					struct request *rq, gfp_t gfp_mask)
+{
+	return 0;
+}
+
+static inline void elv_fq_unset_request_ioq(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	return NULL;
+}
+
 #endif /* GROUP_IOSCHED */
 
 /* Functions used by blksysfs.c */
@@ -655,5 +706,21 @@ static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
 {
 	return 1;
 }
+static inline int elv_fq_set_request_ioq(struct request_queue *q,
+					struct request *rq, gfp_t gfp_mask)
+{
+	return 0;
+}
+
+static inline void elv_fq_unset_request_ioq(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	return NULL;
+}
+
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index f6725f2..e634a2f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -211,6 +211,14 @@ static void *elevator_alloc_sched_queue(struct request_queue *q,
 {
 	void *sched_queue = NULL;
 
+	/*
+	 * If fair queuing is enabled, then queue allocation takes place
+	 * during set_request() functions when request actually comes
+	 * in.
+	 */
+	if (elv_iosched_fair_queuing_enabled(eq))
+		return NULL;
+
 	if (eq->ops->elevator_alloc_sched_queue_fn) {
 		sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
 								GFP_KERNEL);
@@ -965,6 +973,13 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 
 	elv_fq_set_request_io_group(q, rq);
 
+	/*
+	 * Optimization for noop, deadline and AS which maintain only single
+	 * ioq per io group
+	 */
+	if (elv_iosched_single_ioq(e))
+		return elv_fq_set_request_ioq(q, rq, gfp_mask);
+
 	if (e->ops->elevator_set_req_fn)
 		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
 
@@ -976,6 +991,15 @@ void elv_put_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	/*
+	 * Optimization for noop, deadline and AS which maintain only single
+	 * ioq per io group
+	 */
+	if (elv_iosched_single_ioq(e)) {
+		elv_fq_unset_request_ioq(q, rq);
+		return;
+	}
+
 	if (e->ops->elevator_put_req_fn)
 		e->ops->elevator_put_req_fn(rq);
 }
@@ -1347,9 +1371,18 @@ EXPORT_SYMBOL(elv_select_sched_queue);
 
 /*
  * Get the io scheduler queue pointer for current task.
+ *
+ * If fair queuing is enabled, determine the io group of task and retrieve
+ * the ioq pointer from that. This is used by only single queue ioschedulers
+ * for retrieving the queue associated with the group to decide whether the
+ * new bio can do a front merge or not.
  */
 void *elv_get_sched_queue_current(struct request_queue *q)
 {
-	return q->elevator->sched_queue;
+	/* Fair queuing is not enabled. There is only one queue. */
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
+	return ioq_sched_queue(elv_lookup_ioq_current(q));
 }
 EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 3729a2f..ee38d08 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -249,17 +249,31 @@ enum {
 /* iosched wants to use fq logic of elevator layer */
 #define	ELV_IOSCHED_NEED_FQ	1
 
+/* iosched maintains only single ioq per group.*/
+#define ELV_IOSCHED_SINGLE_IOQ        2
+
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 {
 	return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
 }
 
+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+	return (e->elevator_type->elevator_features) & ELV_IOSCHED_SINGLE_IOQ;
+}
+
 #else /* ELV_IOSCHED_FAIR_QUEUING */
 
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 {
 	return 0;
 }
+
+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+	return 0;
+}
+
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 10/18] io-conroller: Prepare elevator layer for single queue schedulers
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (16 preceding siblings ...)
  2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-05 19:58 ` [PATCH 11/18] io-controller: noop changes for hierarchical fair queuing Vivek Goyal
                   ` (19 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando
  Cc: akpm, vgoyal

Elevator layer now has support for hierarchical fair queuing. cfq has
been migrated to make use of it and now it is time to do groundwork for
noop, deadline and AS.

noop deadline and AS don't maintain separate queues for different processes.
There is only one single queue. Effectively one can think that in hierarchical
setup, there will be one queue per cgroup where requests from all the
processes in the cgroup will be queued.

Generally io scheduler takes care of creating queues. Because there is
only one queue here, we have modified common layer to take care of queue
creation and some other functionality. This special casing helps in keeping
the changes to noop, deadline and AS to the minimum.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c      |  160 +++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h      |   67 +++++++++++++++++++
 block/elevator.c         |   35 ++++++++++-
 include/linux/elevator.h |   14 ++++
 4 files changed, 274 insertions(+), 2 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index ec01273..f2805e6 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -915,6 +915,12 @@ void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
 
 	/* Free up async idle queue */
 	elv_release_ioq(e, &iog->async_idle_queue);
+
+#ifdef CONFIG_GROUP_IOSCHED
+	/* Optimization for io schedulers having single ioq */
+	if (elv_iosched_single_ioq(e))
+		elv_release_ioq(e, &iog->ioq);
+#endif
 }
 
 
@@ -1702,6 +1708,153 @@ void elv_fq_set_request_io_group(struct request_queue *q,
 	rq->iog = iog;
 }
 
+/*
+ * Find/Create the io queue the rq should go in. This is an optimization
+ * for the io schedulers (noop, deadline and AS) which maintain only single
+ * io queue per cgroup. In this case common layer can just maintain a
+ * pointer in group data structure and keeps track of it.
+ *
+ * For the io schdulers like cfq, which maintain multiple io queues per
+ * cgroup, and decide the io queue  of request based on process, this
+ * function is not invoked.
+ */
+int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
+					gfp_t gfp_mask)
+{
+	struct elevator_queue *e = q->elevator;
+	unsigned long flags;
+	struct io_queue *ioq = NULL, *new_ioq = NULL;
+	struct io_group *iog;
+	void *sched_q = NULL, *new_sched_q = NULL;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return 0;
+
+	might_sleep_if(gfp_mask & __GFP_WAIT);
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	/* Determine the io group request belongs to */
+	iog = rq->iog;
+	BUG_ON(!iog);
+
+retry:
+	/* Get the iosched queue */
+	ioq = io_group_ioq(iog);
+	if (!ioq) {
+		/* io queue and sched_queue needs to be allocated */
+		BUG_ON(!e->ops->elevator_alloc_sched_queue_fn);
+
+		if (new_sched_q) {
+			goto alloc_ioq;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			/* Call io scheduer to create scheduler queue */
+			new_sched_q = e->ops->elevator_alloc_sched_queue_fn(q,
+					e, gfp_mask | __GFP_NOFAIL
+					| __GFP_ZERO);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			sched_q = e->ops->elevator_alloc_sched_queue_fn(q, e,
+						gfp_mask | __GFP_ZERO);
+			if (!sched_q)
+				goto queue_fail;
+		}
+
+alloc_ioq:
+		if (new_ioq) {
+			ioq = new_ioq;
+			new_ioq = NULL;
+			sched_q = new_sched_q;
+			new_sched_q = NULL;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			new_ioq = elv_alloc_ioq(q, gfp_mask | __GFP_NOFAIL
+							| __GFP_ZERO);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+			if (!ioq) {
+				e->ops->elevator_free_sched_queue_fn(e,
+							sched_q);
+				sched_q = NULL;
+				goto queue_fail;
+			}
+		}
+
+		elv_init_ioq(e, ioq, sched_q, IOPRIO_CLASS_BE, 4, 1);
+		io_group_set_ioq(iog, ioq);
+		elv_mark_ioq_sync(ioq);
+	}
+
+	if (new_sched_q)
+		e->ops->elevator_free_sched_queue_fn(q->elevator, sched_q);
+
+	if (new_ioq)
+		elv_free_ioq(new_ioq);
+
+	/* Request reference */
+	elv_get_ioq(ioq);
+	rq->ioq = ioq;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return 0;
+
+queue_fail:
+	WARN_ON((gfp_mask & __GFP_WAIT) && !ioq);
+	elv_schedule_dispatch(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return 1;
+}
+
+/*
+ * Find out the io queue of current task. Optimization for single ioq
+ * per io group io schedulers.
+ */
+struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	struct io_group *iog;
+
+	/* Determine the io group and io queue of the bio submitting task */
+	iog = io_lookup_io_group_current(q);
+	if (!iog) {
+		/* May be task belongs to a cgroup for which io group has
+		 * not been setup yet. */
+		return NULL;
+	}
+	return io_group_ioq(iog);
+}
+
+/*
+ * This request has been serviced. Clean up ioq info and drop the reference.
+ * Again this is called only for single queue per cgroup schedulers (noop,
+ * deadline, AS).
+ */
+void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
+{
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	if (ioq) {
+		rq->ioq = NULL;
+		elv_put_ioq(ioq);
+	}
+}
+
 #else /* GROUP_IOSCHED */
 void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
 {
@@ -2143,7 +2296,12 @@ int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
 	ioq->efqd = efqd;
 	elv_ioq_set_ioprio_class(ioq, ioprio_class);
 	elv_ioq_set_ioprio(ioq, ioprio);
-	ioq->pid = current->pid;
+
+	if (elv_iosched_single_ioq(eq))
+		ioq->pid = 0;
+	else
+		ioq->pid = current->pid;
+
 	ioq->sched_queue = sched_queue;
 	if (is_sync && !elv_ioq_class_idle(ioq))
 		elv_mark_ioq_idle_window(ioq);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 7d3434b..5a15329 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -236,6 +236,9 @@ struct io_group {
 	/* async_queue and idle_queue are used only for cfq */
 	struct io_queue *async_queue[2][IOPRIO_BE_NR];
 	struct io_queue *async_idle_queue;
+
+	/* Single ioq per group, used for noop, deadline, anticipatory */
+	struct io_queue *ioq;
 };
 
 /**
@@ -507,6 +510,28 @@ static inline bfq_weight_t iog_weight(struct io_group *iog)
 	return iog->entity.weight;
 }
 
+extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
+					gfp_t gfp_mask);
+extern void elv_fq_unset_request_ioq(struct request_queue *q,
+					struct request *rq);
+extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+
+/* Returns single ioq associated with the io group. */
+static inline struct io_queue *io_group_ioq(struct io_group *iog)
+{
+	BUG_ON(!iog);
+	return iog->ioq;
+}
+
+/* Sets the single ioq associated with the io group. (noop, deadline, AS) */
+static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
+{
+	BUG_ON(!iog);
+	/* io group reference. Will be dropped when group is destroyed. */
+	elv_get_ioq(ioq);
+	iog->ioq = ioq;
+}
+
 #else /* !GROUP_IOSCHED */
 /*
  * No ioq movement is needed in case of flat setup. root io group gets cleaned
@@ -538,6 +563,32 @@ static inline bfq_weight_t iog_weight(struct io_group *iog)
 	return 0;
 }
 
+/* Returns single ioq associated with the io group. */
+static inline struct io_queue *io_group_ioq(struct io_group *iog)
+{
+	return NULL;
+}
+
+static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
+{
+}
+
+static inline int elv_fq_set_request_ioq(struct request_queue *q,
+					struct request *rq, gfp_t gfp_mask)
+{
+	return 0;
+}
+
+static inline void elv_fq_unset_request_ioq(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	return NULL;
+}
+
 #endif /* GROUP_IOSCHED */
 
 /* Functions used by blksysfs.c */
@@ -655,5 +706,21 @@ static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
 {
 	return 1;
 }
+static inline int elv_fq_set_request_ioq(struct request_queue *q,
+					struct request *rq, gfp_t gfp_mask)
+{
+	return 0;
+}
+
+static inline void elv_fq_unset_request_ioq(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	return NULL;
+}
+
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index f6725f2..e634a2f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -211,6 +211,14 @@ static void *elevator_alloc_sched_queue(struct request_queue *q,
 {
 	void *sched_queue = NULL;
 
+	/*
+	 * If fair queuing is enabled, then queue allocation takes place
+	 * during set_request() functions when request actually comes
+	 * in.
+	 */
+	if (elv_iosched_fair_queuing_enabled(eq))
+		return NULL;
+
 	if (eq->ops->elevator_alloc_sched_queue_fn) {
 		sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
 								GFP_KERNEL);
@@ -965,6 +973,13 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 
 	elv_fq_set_request_io_group(q, rq);
 
+	/*
+	 * Optimization for noop, deadline and AS which maintain only single
+	 * ioq per io group
+	 */
+	if (elv_iosched_single_ioq(e))
+		return elv_fq_set_request_ioq(q, rq, gfp_mask);
+
 	if (e->ops->elevator_set_req_fn)
 		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
 
@@ -976,6 +991,15 @@ void elv_put_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	/*
+	 * Optimization for noop, deadline and AS which maintain only single
+	 * ioq per io group
+	 */
+	if (elv_iosched_single_ioq(e)) {
+		elv_fq_unset_request_ioq(q, rq);
+		return;
+	}
+
 	if (e->ops->elevator_put_req_fn)
 		e->ops->elevator_put_req_fn(rq);
 }
@@ -1347,9 +1371,18 @@ EXPORT_SYMBOL(elv_select_sched_queue);
 
 /*
  * Get the io scheduler queue pointer for current task.
+ *
+ * If fair queuing is enabled, determine the io group of task and retrieve
+ * the ioq pointer from that. This is used by only single queue ioschedulers
+ * for retrieving the queue associated with the group to decide whether the
+ * new bio can do a front merge or not.
  */
 void *elv_get_sched_queue_current(struct request_queue *q)
 {
-	return q->elevator->sched_queue;
+	/* Fair queuing is not enabled. There is only one queue. */
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
+	return ioq_sched_queue(elv_lookup_ioq_current(q));
 }
 EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 3729a2f..ee38d08 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -249,17 +249,31 @@ enum {
 /* iosched wants to use fq logic of elevator layer */
 #define	ELV_IOSCHED_NEED_FQ	1
 
+/* iosched maintains only single ioq per group.*/
+#define ELV_IOSCHED_SINGLE_IOQ        2
+
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 {
 	return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
 }
 
+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+	return (e->elevator_type->elevator_features) & ELV_IOSCHED_SINGLE_IOQ;
+}
+
 #else /* ELV_IOSCHED_FAIR_QUEUING */
 
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 {
 	return 0;
 }
+
+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+	return 0;
+}
+
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 11/18] io-controller: noop changes for hierarchical fair queuing
       [not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (9 preceding siblings ...)
  2009-05-05 19:58     ` Vivek Goyal
@ 2009-05-05 19:58   ` Vivek Goyal
  2009-05-05 19:58   ` [PATCH 12/18] io-controller: deadline " Vivek Goyal
                     ` (10 subsequent siblings)
  21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

This patch changes noop to use queue scheduling code from elevator layer.
One can go back to old noop by deselecting CONFIG_IOSCHED_NOOP_HIER.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched |   11 +++++++++++
 block/noop-iosched.c  |    3 +++
 2 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index a91a807..9da6657 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -25,6 +25,17 @@ config IOSCHED_NOOP
 	  that do their own scheduling and require only minimal assistance from
 	  the kernel.
 
+config IOSCHED_NOOP_HIER
+	bool "Noop Hierarchical Scheduling support"
+	depends on IOSCHED_NOOP && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in noop. In this mode noop keeps
+	  one IO queue per cgroup instead of a global queue. Elevator
+	  fair queuing logic ensures fairness among various queues.
+
 config IOSCHED_AS
 	tristate "Anticipatory I/O scheduler"
 	default y
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index d587832..73e571d 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -92,6 +92,9 @@ static struct elevator_type elevator_noop = {
 		.elevator_alloc_sched_queue_fn	= noop_alloc_noop_queue,
 		.elevator_free_sched_queue_fn	= noop_free_noop_queue,
 	},
+#ifdef CONFIG_IOSCHED_NOOP_HIER
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#endif
 	.elevator_name = "noop",
 	.elevator_owner = THIS_MODULE,
 };
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 11/18] io-controller: noop changes for hierarchical fair queuing
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (18 preceding siblings ...)
  2009-05-05 19:58 ` [PATCH 11/18] io-controller: noop changes for hierarchical fair queuing Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-05 19:58 ` [PATCH 12/18] io-controller: deadline " Vivek Goyal
                   ` (17 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, m-ikeda
  Cc: vgoyal, akpm

This patch changes noop to use queue scheduling code from elevator layer.
One can go back to old noop by deselecting CONFIG_IOSCHED_NOOP_HIER.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched |   11 +++++++++++
 block/noop-iosched.c  |    3 +++
 2 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index a91a807..9da6657 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -25,6 +25,17 @@ config IOSCHED_NOOP
 	  that do their own scheduling and require only minimal assistance from
 	  the kernel.
 
+config IOSCHED_NOOP_HIER
+	bool "Noop Hierarchical Scheduling support"
+	depends on IOSCHED_NOOP && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in noop. In this mode noop keeps
+	  one IO queue per cgroup instead of a global queue. Elevator
+	  fair queuing logic ensures fairness among various queues.
+
 config IOSCHED_AS
 	tristate "Anticipatory I/O scheduler"
 	default y
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index d587832..73e571d 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -92,6 +92,9 @@ static struct elevator_type elevator_noop = {
 		.elevator_alloc_sched_queue_fn	= noop_alloc_noop_queue,
 		.elevator_free_sched_queue_fn	= noop_free_noop_queue,
 	},
+#ifdef CONFIG_IOSCHED_NOOP_HIER
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#endif
 	.elevator_name = "noop",
 	.elevator_owner = THIS_MODULE,
 };
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 11/18] io-controller: noop changes for hierarchical fair queuing
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (17 preceding siblings ...)
  2009-05-05 19:58 ` [PATCH 10/18] io-conroller: Prepare elevator layer for single queue schedulers Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-05 19:58 ` Vivek Goyal
                   ` (18 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando
  Cc: akpm, vgoyal

This patch changes noop to use queue scheduling code from elevator layer.
One can go back to old noop by deselecting CONFIG_IOSCHED_NOOP_HIER.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched |   11 +++++++++++
 block/noop-iosched.c  |    3 +++
 2 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index a91a807..9da6657 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -25,6 +25,17 @@ config IOSCHED_NOOP
 	  that do their own scheduling and require only minimal assistance from
 	  the kernel.
 
+config IOSCHED_NOOP_HIER
+	bool "Noop Hierarchical Scheduling support"
+	depends on IOSCHED_NOOP && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in noop. In this mode noop keeps
+	  one IO queue per cgroup instead of a global queue. Elevator
+	  fair queuing logic ensures fairness among various queues.
+
 config IOSCHED_AS
 	tristate "Anticipatory I/O scheduler"
 	default y
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index d587832..73e571d 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -92,6 +92,9 @@ static struct elevator_type elevator_noop = {
 		.elevator_alloc_sched_queue_fn	= noop_alloc_noop_queue,
 		.elevator_free_sched_queue_fn	= noop_free_noop_queue,
 	},
+#ifdef CONFIG_IOSCHED_NOOP_HIER
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#endif
 	.elevator_name = "noop",
 	.elevator_owner = THIS_MODULE,
 };
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 12/18] io-controller: deadline changes for hierarchical fair queuing
       [not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (10 preceding siblings ...)
  2009-05-05 19:58   ` [PATCH 11/18] io-controller: noop changes for hierarchical fair queuing Vivek Goyal
@ 2009-05-05 19:58   ` Vivek Goyal
  2009-05-05 19:58   ` [PATCH 13/18] io-controller: anticipatory " Vivek Goyal
                     ` (9 subsequent siblings)
  21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

This patch changes deadline to use queue scheduling code from elevator layer.
One can go back to old deadline by selecting CONFIG_IOSCHED_DEADLINE_HIER.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched    |   11 +++++++++++
 block/deadline-iosched.c |    3 +++
 2 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 9da6657..3a9e7d7 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -55,6 +55,17 @@ config IOSCHED_DEADLINE
 	  a disk at any one time, its behaviour is almost identical to the
 	  anticipatory I/O scheduler and so is a good choice.
 
+config IOSCHED_DEADLINE_HIER
+	bool "Deadline Hierarchical Scheduling support"
+	depends on IOSCHED_DEADLINE && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in deadline. In this mode deadline keeps
+	  one IO queue per cgroup instead of a global queue. Elevator
+	  fair queuing logic ensures fairness among various queues.
+
 config IOSCHED_CFQ
 	tristate "CFQ I/O scheduler"
 	select ELV_FAIR_QUEUING
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 5e65041..27b77b9 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -477,6 +477,9 @@ static struct elevator_type iosched_deadline = {
 		.elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
 		.elevator_free_sched_queue_fn = deadline_free_deadline_queue,
 	},
+#ifdef CONFIG_IOSCHED_DEADLINE_HIER
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#endif
 	.elevator_attrs = deadline_attrs,
 	.elevator_name = "deadline",
 	.elevator_owner = THIS_MODULE,
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 12/18] io-controller: deadline changes for hierarchical fair queuing
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (19 preceding siblings ...)
  2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-05 19:58 ` Vivek Goyal
                   ` (16 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, m-ikeda
  Cc: vgoyal, akpm

This patch changes deadline to use queue scheduling code from elevator layer.
One can go back to old deadline by selecting CONFIG_IOSCHED_DEADLINE_HIER.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched    |   11 +++++++++++
 block/deadline-iosched.c |    3 +++
 2 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 9da6657..3a9e7d7 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -55,6 +55,17 @@ config IOSCHED_DEADLINE
 	  a disk at any one time, its behaviour is almost identical to the
 	  anticipatory I/O scheduler and so is a good choice.
 
+config IOSCHED_DEADLINE_HIER
+	bool "Deadline Hierarchical Scheduling support"
+	depends on IOSCHED_DEADLINE && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in deadline. In this mode deadline keeps
+	  one IO queue per cgroup instead of a global queue. Elevator
+	  fair queuing logic ensures fairness among various queues.
+
 config IOSCHED_CFQ
 	tristate "CFQ I/O scheduler"
 	select ELV_FAIR_QUEUING
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 5e65041..27b77b9 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -477,6 +477,9 @@ static struct elevator_type iosched_deadline = {
 		.elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
 		.elevator_free_sched_queue_fn = deadline_free_deadline_queue,
 	},
+#ifdef CONFIG_IOSCHED_DEADLINE_HIER
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#endif
 	.elevator_attrs = deadline_attrs,
 	.elevator_name = "deadline",
 	.elevator_owner = THIS_MODULE,
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 12/18] io-controller: deadline changes for hierarchical fair queuing
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (20 preceding siblings ...)
  2009-05-05 19:58 ` [PATCH 12/18] io-controller: deadline " Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-05 19:58 ` [PATCH 13/18] io-controller: anticipatory " Vivek Goyal
                   ` (15 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando
  Cc: akpm, vgoyal

This patch changes deadline to use queue scheduling code from elevator layer.
One can go back to old deadline by selecting CONFIG_IOSCHED_DEADLINE_HIER.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched    |   11 +++++++++++
 block/deadline-iosched.c |    3 +++
 2 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 9da6657..3a9e7d7 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -55,6 +55,17 @@ config IOSCHED_DEADLINE
 	  a disk at any one time, its behaviour is almost identical to the
 	  anticipatory I/O scheduler and so is a good choice.
 
+config IOSCHED_DEADLINE_HIER
+	bool "Deadline Hierarchical Scheduling support"
+	depends on IOSCHED_DEADLINE && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in deadline. In this mode deadline keeps
+	  one IO queue per cgroup instead of a global queue. Elevator
+	  fair queuing logic ensures fairness among various queues.
+
 config IOSCHED_CFQ
 	tristate "CFQ I/O scheduler"
 	select ELV_FAIR_QUEUING
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 5e65041..27b77b9 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -477,6 +477,9 @@ static struct elevator_type iosched_deadline = {
 		.elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
 		.elevator_free_sched_queue_fn = deadline_free_deadline_queue,
 	},
+#ifdef CONFIG_IOSCHED_DEADLINE_HIER
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#endif
 	.elevator_attrs = deadline_attrs,
 	.elevator_name = "deadline",
 	.elevator_owner = THIS_MODULE,
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 13/18] io-controller: anticipatory changes for hierarchical fair queuing
       [not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (11 preceding siblings ...)
  2009-05-05 19:58   ` [PATCH 12/18] io-controller: deadline " Vivek Goyal
@ 2009-05-05 19:58   ` Vivek Goyal
  2009-05-05 19:58   ` [PATCH 14/18] blkio_cgroup patches from Ryo to track async bios Vivek Goyal
                     ` (8 subsequent siblings)
  21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

This patch changes anticipatory scheduler to use queue scheduling code from
elevator layer.  One can go back to old as by deselecting
CONFIG_IOSCHED_AS_HIER.

TODO/Issues
===========
- AS anticipation logic does not seem to be sufficient to provide BW difference
  if two "dd" are going in two different cgroups. Needs to be looked into.

- AS write batch number of request adjustment happens upon every W->R batch
  direction switch. This automatic adjustment depends on how much time a
  read is taking after a W->R switch.

  This does not gel very well when hierarhical scheduling is enabled and
  every io group can have its separate read/write batch. Now if io group
  switching takes place it creates issues.

  Currently I have disabled write batch length adjustment in hierarchical
  mode.

- Currently performance seems to be very bad in hierarhical mode. Needs
  to be looked into.

- I think the whole idea of common layer doing time slice switching between
  queues and then queue in turn running timed batches is not very good. May
  be AS can maintain two queues (one for READS and other for WRITES) and let
  common layer do the time slice switching between these two queues.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched    |   12 +++
 block/as-iosched.c       |  177 +++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.c      |   76 ++++++++++++++++----
 include/linux/elevator.h |   16 ++++
 4 files changed, 266 insertions(+), 15 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3a9e7d7..77fc786 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -45,6 +45,18 @@ config IOSCHED_AS
 	  deadline I/O scheduler, it can also be slower in some cases
 	  especially some database loads.
 
+config IOSCHED_AS_HIER
+	bool "Anticipatory Hierarchical Scheduling support"
+	depends on IOSCHED_AS && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in anticipatory. In this mode
+	  anticipatory keeps one IO queue per cgroup instead of a global
+	  queue. Elevator fair queuing logic ensures fairness among various
+	  queues.
+
 config IOSCHED_DEADLINE
 	tristate "Deadline I/O scheduler"
 	default y
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 7158e13..12aea88 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -84,6 +84,19 @@ struct as_queue {
 	struct list_head fifo_list[2];
 
 	struct request *next_rq[2];	/* next in sort order */
+
+	/*
+	 * If an as_queue is switched while a batch is running, then we
+	 * store the time left before current batch will expire
+	 */
+	long current_batch_time_left;
+
+	/*
+	 * batch data dir when queue was scheduled out. This will be used
+	 * to setup ad->batch_data_dir when queue is scheduled in.
+	 */
+	int saved_batch_data_dir;
+
 	unsigned long last_check_fifo[2];
 	int write_batch_count;		/* max # of reqs in a write batch */
 	int current_write_count;	/* how many requests left this batch */
@@ -150,6 +163,141 @@ static DEFINE_SPINLOCK(ioc_gone_lock);
 
 static void as_move_to_dispatch(struct as_data *ad, struct request *rq);
 static void as_antic_stop(struct as_data *ad);
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq);
+
+#ifdef CONFIG_IOSCHED_AS_HIER
+static void as_save_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+	/* Save batch data dir */
+	asq->saved_batch_data_dir = ad->batch_data_dir;
+
+	if (ad->changed_batch) {
+		/*
+		 * In case of force expire, we come here. Batch changeover
+		 * has been signalled but we are waiting for all the
+		 * request to finish from previous batch and then start
+		 * the new batch. Can't wait now. Mark that full batch time
+		 * needs to be allocated when this queue is scheduled again.
+		 */
+		asq->current_batch_time_left =
+				ad->batch_expire[ad->batch_data_dir];
+		ad->changed_batch = 0;
+		return;
+	}
+
+	if (ad->new_batch) {
+		/*
+		 * We should come here only when new_batch has been set
+		 * but no read request has been issued or if it is a forced
+		 * expiry.
+		 *
+		 * In both the cases, new batch has not started yet so
+		 * allocate full batch length for next scheduling opportunity.
+		 * We don't do write batch size adjustment in hierarchical
+		 * AS so that should not be an issue.
+		 */
+		asq->current_batch_time_left =
+				ad->batch_expire[ad->batch_data_dir];
+		ad->new_batch = 0;
+		return;
+	}
+
+	/* Save how much time is left before current batch expires */
+	if (as_batch_expired(ad, asq))
+		asq->current_batch_time_left = 0;
+	else {
+		asq->current_batch_time_left = ad->current_batch_expires
+							- jiffies;
+		BUG_ON((asq->current_batch_time_left) < 0);
+	}
+}
+
+/*
+ * FIXME: In original AS, read batch's time account started only after when
+ * first request had completed (if last batch was a write batch). But here
+ * we might be rescheduling a read batch right away irrespective of the fact
+ * of disk cache state.
+ */
+static void as_restore_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+	/* Adjust the batch expire time */
+	if (asq->current_batch_time_left)
+		ad->current_batch_expires = jiffies +
+						asq->current_batch_time_left;
+	/* restore asq batch_data_dir info */
+	ad->batch_data_dir = asq->saved_batch_data_dir;
+}
+
+/* ioq has been set. */
+static void as_active_ioq_set(struct request_queue *q, void *sched_queue,
+				int coop)
+{
+	struct as_queue *asq = sched_queue;
+	struct as_data *ad = q->elevator->elevator_data;
+
+	as_restore_batch_context(ad, asq);
+}
+
+/*
+ * This is a notification from common layer that it wishes to expire this
+ * io queue. AS decides whether queue can be expired, if yes, it also
+ * saves the batch context.
+ */
+static int as_expire_ioq(struct request_queue *q, void *sched_queue,
+				int slice_expired, int force)
+{
+	struct as_data *ad = q->elevator->elevator_data;
+	int status = ad->antic_status;
+	struct as_queue *asq = sched_queue;
+
+	/* Forced expiry. We don't have a choice */
+	if (force) {
+		as_antic_stop(ad);
+		as_save_batch_context(ad, asq);
+		return 1;
+	}
+
+	/*
+	 * We are waiting for requests to finish from last
+	 * batch. Don't expire the queue now
+	 */
+	if (ad->changed_batch)
+		goto keep_queue;
+
+	/*
+	 * Wait for all requests from existing batch to finish before we
+	 * switch the queue. New queue might change the batch direction
+	 * and this is to be consistent with AS philosophy of not dispatching
+	 * new requests to underlying drive till requests from requests
+	 * from previous batch are completed.
+	 */
+	if (ad->nr_dispatched)
+		goto keep_queue;
+
+	/*
+	 * If AS anticipation is ON, stop it if slice expired, otherwise
+	 * keep the queue.
+	 */
+	if (status == ANTIC_WAIT_REQ || status == ANTIC_WAIT_NEXT) {
+		if (slice_expired)
+			as_antic_stop(ad);
+		else
+			/*
+			 * We are anticipating and time slice has not expired
+			 * so I would rather prefer waiting than break the
+			 * anticipation and expire the queue.
+			 */
+			goto keep_queue;
+	}
+
+	/* We are good to expire the queue. Save batch context */
+	as_save_batch_context(ad, asq);
+	return 1;
+
+keep_queue:
+	return 0;
+}
+#endif
 
 /*
  * IO Context helper functions
@@ -805,6 +953,7 @@ static void as_update_rq(struct as_data *ad, struct request *rq)
 	}
 }
 
+#ifndef CONFIG_IOSCHED_AS_HIER
 /*
  * Gathers timings and resizes the write batch automatically
  */
@@ -833,6 +982,7 @@ static void update_write_batch(struct as_data *ad)
 	if (asq->write_batch_count < 1)
 		asq->write_batch_count = 1;
 }
+#endif /* !CONFIG_IOSCHED_AS_HIER */
 
 /*
  * as_completed_request is to be called when a request has completed and
@@ -867,7 +1017,26 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
 	 * and writeback caches
 	 */
 	if (ad->new_batch && ad->batch_data_dir == rq_is_sync(rq)) {
+#ifndef CONFIG_IOSCHED_AS_HIER
+		/*
+		 * Dynamic updation of write batch length is disabled
+		 * for hierarchical scheduling. It is difficult to do
+		 * accurate accounting when queue switch can take place
+		 * in the middle of the batch.
+		 *
+		 * Say, A, B are two groups. Following is the sequence of
+		 * events.
+		 *
+		 * Servicing Write batch of A.
+		 * Queue switch takes place and write batch of B starts.
+		 * Batch switch takes place and read batch of B starts.
+		 *
+		 * In above scenario, writes issued in write batch of A
+		 * might impact the write batch length of B. Which is not
+		 * good.
+		 */
 		update_write_batch(ad);
+#endif
 		ad->current_batch_expires = jiffies +
 				ad->batch_expire[BLK_RW_SYNC];
 		ad->new_batch = 0;
@@ -1516,8 +1685,14 @@ static struct elevator_type iosched_as = {
 		.trim =				as_trim,
 		.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
 		.elevator_free_sched_queue_fn = as_free_as_queue,
+#ifdef CONFIG_IOSCHED_AS_HIER
+		.elevator_expire_ioq_fn =       as_expire_ioq,
+		.elevator_active_ioq_set_fn =   as_active_ioq_set,
 	},
-
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ | ELV_IOSCHED_DONT_IDLE,
+#else
+	},
+#endif
 	.elevator_attrs = as_attrs,
 	.elevator_name = "anticipatory",
 	.elevator_owner = THIS_MODULE,
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index f2805e6..02c27ac 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -36,6 +36,8 @@ static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
 struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
 						 int extract);
 void elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
+int elv_iosched_expire_ioq(struct request_queue *q, int slice_expired,
+					int force);
 
 static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
 					unsigned short prio)
@@ -2230,6 +2232,9 @@ static void elv_ioq_update_idle_window(struct elevator_queue *eq,
 	int old_idle, enable_idle;
 	struct elv_fq_data *efqd = ioq->efqd;
 
+	/* If idling is disabled from ioscheduler, return */
+	if (!elv_gen_idling_enabled(eq))
+		return;
 	/*
 	 * Don't idle for async or idle io prio class
 	 */
@@ -2303,7 +2308,7 @@ int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
 		ioq->pid = current->pid;
 
 	ioq->sched_queue = sched_queue;
-	if (is_sync && !elv_ioq_class_idle(ioq))
+	if (elv_gen_idling_enabled(eq) && is_sync && !elv_ioq_class_idle(ioq))
 		elv_mark_ioq_idle_window(ioq);
 	bfq_init_entity(&ioq->entity, iog);
 	ioq->entity.budget = elv_prio_to_slice(efqd, ioq);
@@ -2718,16 +2723,18 @@ int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 static void elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
 {
 	elv_log_ioq(&q->elevator->efqd, ioq, "preempt");
-	elv_ioq_slice_expired(q);
+	if (elv_iosched_expire_ioq(q, 0, 1)) {
+		elv_ioq_slice_expired(q);
 
-	/*
-	 * Put the new queue at the front of the of the current list,
-	 * so we know that it will be selected next.
-	 */
+		/*
+		 * Put the new queue at the front of the of the current list,
+		 * so we know that it will be selected next.
+		 */
 
-	elv_activate_ioq(ioq, 1);
-	elv_ioq_set_slice_end(ioq, 0);
-	elv_mark_ioq_slice_new(ioq);
+		elv_activate_ioq(ioq, 1);
+		elv_ioq_set_slice_end(ioq, 0);
+		elv_mark_ioq_slice_new(ioq);
+	}
 }
 
 void elv_ioq_request_add(struct request_queue *q, struct request *rq)
@@ -2906,11 +2913,44 @@ void elv_free_idle_ioq_list(struct elevator_queue *e)
 		elv_deactivate_ioq(efqd, ioq, 0);
 }
 
+/*
+ * Call iosched to let that elevator wants to expire the queue. This gives
+ * iosched like AS to say no (if it is in the middle of batch changeover or
+ * it is anticipating). it also allows iosched to do some house keeping
+ *
+ * force--> it is force dispatch and iosched must clean up its state. This
+ * 	     is useful when elevator wants to drain iosched and wants to
+ * 	     expire currnent active queue.
+ *
+ * slice_expired--> if 1, ioq slice expired hence elevator fair queuing logic
+ * 		    wants to switch the queue. iosched should allow that until
+ * 		    and unless necessary. Currently AS can deny the switch if
+ * 		    in the middle of batch switch.
+ *
+ * 		    if 0, time slice is still remaining. It is up to the iosched
+ * 		    whether it wants to wait on this queue or just want to
+ * 		    expire it and move on to next queue.
+ *
+ */
+int elv_iosched_expire_ioq(struct request_queue *q, int slice_expired,
+					int force)
+{
+	struct elevator_queue *e = q->elevator;
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+
+	if (e->ops->elevator_expire_ioq_fn)
+		return e->ops->elevator_expire_ioq_fn(q, ioq->sched_queue,
+							slice_expired, force);
+
+	return 1;
+}
+
 /* Common layer function to select the next queue to dispatch from */
 void *elv_fq_select_ioq(struct request_queue *q, int force)
 {
 	struct elv_fq_data *efqd = &q->elevator->efqd;
 	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
+	int slice_expired = 1;
 
 	if (!elv_nr_busy_ioq(q->elevator))
 		return NULL;
@@ -2984,8 +3024,14 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 		goto keep_queue;
 	}
 
+	slice_expired = 0;
 expire:
-	elv_ioq_slice_expired(q);
+	if (elv_iosched_expire_ioq(q, slice_expired, force))
+		elv_ioq_slice_expired(q);
+	else {
+		ioq = NULL;
+		goto keep_queue;
+	}
 new_queue:
 	ioq = elv_set_active_ioq(q, new_ioq);
 keep_queue:
@@ -3146,7 +3192,8 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		}
 
 		if (elv_ioq_class_idle(ioq)) {
-			elv_ioq_slice_expired(q);
+			if (elv_iosched_expire_ioq(q, 1, 0))
+				elv_ioq_slice_expired(q);
 			goto done;
 		}
 
@@ -3170,9 +3217,10 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		 * mean seek distance, give them a chance to run instead
 		 * of idling.
 		 */
-		if (elv_ioq_slice_used(ioq))
-			elv_ioq_slice_expired(q);
-		else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
+		if (elv_ioq_slice_used(ioq)) {
+			if (elv_iosched_expire_ioq(q, 1, 0))
+				elv_ioq_slice_expired(q);
+		} else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
 			 && sync && !rq_noidle(rq))
 			elv_ioq_arm_slice_timer(q, 0);
 	}
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index ee38d08..cbfce0b 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -42,6 +42,7 @@ typedef int (elevator_update_idle_window_fn) (struct elevator_queue*, void*,
 						struct request*);
 typedef struct io_queue* (elevator_close_cooperator_fn) (struct request_queue*,
 						void*, int probe);
+typedef int (elevator_expire_ioq_fn) (struct request_queue*, void *, int, int);
 #endif
 
 struct elevator_ops
@@ -81,6 +82,7 @@ struct elevator_ops
 	elevator_should_preempt_fn *elevator_should_preempt_fn;
 	elevator_update_idle_window_fn *elevator_update_idle_window_fn;
 	elevator_close_cooperator_fn *elevator_close_cooperator_fn;
+	elevator_expire_ioq_fn  *elevator_expire_ioq_fn;
 #endif
 };
 
@@ -252,6 +254,9 @@ enum {
 /* iosched maintains only single ioq per group.*/
 #define ELV_IOSCHED_SINGLE_IOQ        2
 
+/* iosched does not need anticipation/idling logic support from common layer */
+#define ELV_IOSCHED_DONT_IDLE	4
+
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 {
 	return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
@@ -262,6 +267,12 @@ static inline int elv_iosched_single_ioq(struct elevator_queue *e)
 	return (e->elevator_type->elevator_features) & ELV_IOSCHED_SINGLE_IOQ;
 }
 
+/* returns 1 if elevator layer should enable its idling logic, 0 otherwise */
+static inline int elv_gen_idling_enabled(struct elevator_queue *e)
+{
+	return !((e->elevator_type->elevator_features) & ELV_IOSCHED_DONT_IDLE);
+}
+
 #else /* ELV_IOSCHED_FAIR_QUEUING */
 
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
@@ -274,6 +285,11 @@ static inline int elv_iosched_single_ioq(struct elevator_queue *e)
 	return 0;
 }
 
+static inline int elv_gen_idling_enabled(struct elevator_queue *e)
+{
+	return 0;
+}
+
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 13/18] io-controller: anticipatory changes for hierarchical fair queuing
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (22 preceding siblings ...)
  2009-05-05 19:58 ` [PATCH 13/18] io-controller: anticipatory " Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-05 19:58 ` [PATCH 14/18] blkio_cgroup patches from Ryo to track async bios Vivek Goyal
                   ` (13 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, m-ikeda
  Cc: vgoyal, akpm

This patch changes anticipatory scheduler to use queue scheduling code from
elevator layer.  One can go back to old as by deselecting
CONFIG_IOSCHED_AS_HIER.

TODO/Issues
===========
- AS anticipation logic does not seem to be sufficient to provide BW difference
  if two "dd" are going in two different cgroups. Needs to be looked into.

- AS write batch number of request adjustment happens upon every W->R batch
  direction switch. This automatic adjustment depends on how much time a
  read is taking after a W->R switch.

  This does not gel very well when hierarhical scheduling is enabled and
  every io group can have its separate read/write batch. Now if io group
  switching takes place it creates issues.

  Currently I have disabled write batch length adjustment in hierarchical
  mode.

- Currently performance seems to be very bad in hierarhical mode. Needs
  to be looked into.

- I think the whole idea of common layer doing time slice switching between
  queues and then queue in turn running timed batches is not very good. May
  be AS can maintain two queues (one for READS and other for WRITES) and let
  common layer do the time slice switching between these two queues.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched    |   12 +++
 block/as-iosched.c       |  177 +++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.c      |   76 ++++++++++++++++----
 include/linux/elevator.h |   16 ++++
 4 files changed, 266 insertions(+), 15 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3a9e7d7..77fc786 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -45,6 +45,18 @@ config IOSCHED_AS
 	  deadline I/O scheduler, it can also be slower in some cases
 	  especially some database loads.
 
+config IOSCHED_AS_HIER
+	bool "Anticipatory Hierarchical Scheduling support"
+	depends on IOSCHED_AS && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in anticipatory. In this mode
+	  anticipatory keeps one IO queue per cgroup instead of a global
+	  queue. Elevator fair queuing logic ensures fairness among various
+	  queues.
+
 config IOSCHED_DEADLINE
 	tristate "Deadline I/O scheduler"
 	default y
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 7158e13..12aea88 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -84,6 +84,19 @@ struct as_queue {
 	struct list_head fifo_list[2];
 
 	struct request *next_rq[2];	/* next in sort order */
+
+	/*
+	 * If an as_queue is switched while a batch is running, then we
+	 * store the time left before current batch will expire
+	 */
+	long current_batch_time_left;
+
+	/*
+	 * batch data dir when queue was scheduled out. This will be used
+	 * to setup ad->batch_data_dir when queue is scheduled in.
+	 */
+	int saved_batch_data_dir;
+
 	unsigned long last_check_fifo[2];
 	int write_batch_count;		/* max # of reqs in a write batch */
 	int current_write_count;	/* how many requests left this batch */
@@ -150,6 +163,141 @@ static DEFINE_SPINLOCK(ioc_gone_lock);
 
 static void as_move_to_dispatch(struct as_data *ad, struct request *rq);
 static void as_antic_stop(struct as_data *ad);
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq);
+
+#ifdef CONFIG_IOSCHED_AS_HIER
+static void as_save_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+	/* Save batch data dir */
+	asq->saved_batch_data_dir = ad->batch_data_dir;
+
+	if (ad->changed_batch) {
+		/*
+		 * In case of force expire, we come here. Batch changeover
+		 * has been signalled but we are waiting for all the
+		 * request to finish from previous batch and then start
+		 * the new batch. Can't wait now. Mark that full batch time
+		 * needs to be allocated when this queue is scheduled again.
+		 */
+		asq->current_batch_time_left =
+				ad->batch_expire[ad->batch_data_dir];
+		ad->changed_batch = 0;
+		return;
+	}
+
+	if (ad->new_batch) {
+		/*
+		 * We should come here only when new_batch has been set
+		 * but no read request has been issued or if it is a forced
+		 * expiry.
+		 *
+		 * In both the cases, new batch has not started yet so
+		 * allocate full batch length for next scheduling opportunity.
+		 * We don't do write batch size adjustment in hierarchical
+		 * AS so that should not be an issue.
+		 */
+		asq->current_batch_time_left =
+				ad->batch_expire[ad->batch_data_dir];
+		ad->new_batch = 0;
+		return;
+	}
+
+	/* Save how much time is left before current batch expires */
+	if (as_batch_expired(ad, asq))
+		asq->current_batch_time_left = 0;
+	else {
+		asq->current_batch_time_left = ad->current_batch_expires
+							- jiffies;
+		BUG_ON((asq->current_batch_time_left) < 0);
+	}
+}
+
+/*
+ * FIXME: In original AS, read batch's time account started only after when
+ * first request had completed (if last batch was a write batch). But here
+ * we might be rescheduling a read batch right away irrespective of the fact
+ * of disk cache state.
+ */
+static void as_restore_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+	/* Adjust the batch expire time */
+	if (asq->current_batch_time_left)
+		ad->current_batch_expires = jiffies +
+						asq->current_batch_time_left;
+	/* restore asq batch_data_dir info */
+	ad->batch_data_dir = asq->saved_batch_data_dir;
+}
+
+/* ioq has been set. */
+static void as_active_ioq_set(struct request_queue *q, void *sched_queue,
+				int coop)
+{
+	struct as_queue *asq = sched_queue;
+	struct as_data *ad = q->elevator->elevator_data;
+
+	as_restore_batch_context(ad, asq);
+}
+
+/*
+ * This is a notification from common layer that it wishes to expire this
+ * io queue. AS decides whether queue can be expired, if yes, it also
+ * saves the batch context.
+ */
+static int as_expire_ioq(struct request_queue *q, void *sched_queue,
+				int slice_expired, int force)
+{
+	struct as_data *ad = q->elevator->elevator_data;
+	int status = ad->antic_status;
+	struct as_queue *asq = sched_queue;
+
+	/* Forced expiry. We don't have a choice */
+	if (force) {
+		as_antic_stop(ad);
+		as_save_batch_context(ad, asq);
+		return 1;
+	}
+
+	/*
+	 * We are waiting for requests to finish from last
+	 * batch. Don't expire the queue now
+	 */
+	if (ad->changed_batch)
+		goto keep_queue;
+
+	/*
+	 * Wait for all requests from existing batch to finish before we
+	 * switch the queue. New queue might change the batch direction
+	 * and this is to be consistent with AS philosophy of not dispatching
+	 * new requests to underlying drive till requests from requests
+	 * from previous batch are completed.
+	 */
+	if (ad->nr_dispatched)
+		goto keep_queue;
+
+	/*
+	 * If AS anticipation is ON, stop it if slice expired, otherwise
+	 * keep the queue.
+	 */
+	if (status == ANTIC_WAIT_REQ || status == ANTIC_WAIT_NEXT) {
+		if (slice_expired)
+			as_antic_stop(ad);
+		else
+			/*
+			 * We are anticipating and time slice has not expired
+			 * so I would rather prefer waiting than break the
+			 * anticipation and expire the queue.
+			 */
+			goto keep_queue;
+	}
+
+	/* We are good to expire the queue. Save batch context */
+	as_save_batch_context(ad, asq);
+	return 1;
+
+keep_queue:
+	return 0;
+}
+#endif
 
 /*
  * IO Context helper functions
@@ -805,6 +953,7 @@ static void as_update_rq(struct as_data *ad, struct request *rq)
 	}
 }
 
+#ifndef CONFIG_IOSCHED_AS_HIER
 /*
  * Gathers timings and resizes the write batch automatically
  */
@@ -833,6 +982,7 @@ static void update_write_batch(struct as_data *ad)
 	if (asq->write_batch_count < 1)
 		asq->write_batch_count = 1;
 }
+#endif /* !CONFIG_IOSCHED_AS_HIER */
 
 /*
  * as_completed_request is to be called when a request has completed and
@@ -867,7 +1017,26 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
 	 * and writeback caches
 	 */
 	if (ad->new_batch && ad->batch_data_dir == rq_is_sync(rq)) {
+#ifndef CONFIG_IOSCHED_AS_HIER
+		/*
+		 * Dynamic updation of write batch length is disabled
+		 * for hierarchical scheduling. It is difficult to do
+		 * accurate accounting when queue switch can take place
+		 * in the middle of the batch.
+		 *
+		 * Say, A, B are two groups. Following is the sequence of
+		 * events.
+		 *
+		 * Servicing Write batch of A.
+		 * Queue switch takes place and write batch of B starts.
+		 * Batch switch takes place and read batch of B starts.
+		 *
+		 * In above scenario, writes issued in write batch of A
+		 * might impact the write batch length of B. Which is not
+		 * good.
+		 */
 		update_write_batch(ad);
+#endif
 		ad->current_batch_expires = jiffies +
 				ad->batch_expire[BLK_RW_SYNC];
 		ad->new_batch = 0;
@@ -1516,8 +1685,14 @@ static struct elevator_type iosched_as = {
 		.trim =				as_trim,
 		.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
 		.elevator_free_sched_queue_fn = as_free_as_queue,
+#ifdef CONFIG_IOSCHED_AS_HIER
+		.elevator_expire_ioq_fn =       as_expire_ioq,
+		.elevator_active_ioq_set_fn =   as_active_ioq_set,
 	},
-
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ | ELV_IOSCHED_DONT_IDLE,
+#else
+	},
+#endif
 	.elevator_attrs = as_attrs,
 	.elevator_name = "anticipatory",
 	.elevator_owner = THIS_MODULE,
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index f2805e6..02c27ac 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -36,6 +36,8 @@ static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
 struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
 						 int extract);
 void elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
+int elv_iosched_expire_ioq(struct request_queue *q, int slice_expired,
+					int force);
 
 static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
 					unsigned short prio)
@@ -2230,6 +2232,9 @@ static void elv_ioq_update_idle_window(struct elevator_queue *eq,
 	int old_idle, enable_idle;
 	struct elv_fq_data *efqd = ioq->efqd;
 
+	/* If idling is disabled from ioscheduler, return */
+	if (!elv_gen_idling_enabled(eq))
+		return;
 	/*
 	 * Don't idle for async or idle io prio class
 	 */
@@ -2303,7 +2308,7 @@ int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
 		ioq->pid = current->pid;
 
 	ioq->sched_queue = sched_queue;
-	if (is_sync && !elv_ioq_class_idle(ioq))
+	if (elv_gen_idling_enabled(eq) && is_sync && !elv_ioq_class_idle(ioq))
 		elv_mark_ioq_idle_window(ioq);
 	bfq_init_entity(&ioq->entity, iog);
 	ioq->entity.budget = elv_prio_to_slice(efqd, ioq);
@@ -2718,16 +2723,18 @@ int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 static void elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
 {
 	elv_log_ioq(&q->elevator->efqd, ioq, "preempt");
-	elv_ioq_slice_expired(q);
+	if (elv_iosched_expire_ioq(q, 0, 1)) {
+		elv_ioq_slice_expired(q);
 
-	/*
-	 * Put the new queue at the front of the of the current list,
-	 * so we know that it will be selected next.
-	 */
+		/*
+		 * Put the new queue at the front of the of the current list,
+		 * so we know that it will be selected next.
+		 */
 
-	elv_activate_ioq(ioq, 1);
-	elv_ioq_set_slice_end(ioq, 0);
-	elv_mark_ioq_slice_new(ioq);
+		elv_activate_ioq(ioq, 1);
+		elv_ioq_set_slice_end(ioq, 0);
+		elv_mark_ioq_slice_new(ioq);
+	}
 }
 
 void elv_ioq_request_add(struct request_queue *q, struct request *rq)
@@ -2906,11 +2913,44 @@ void elv_free_idle_ioq_list(struct elevator_queue *e)
 		elv_deactivate_ioq(efqd, ioq, 0);
 }
 
+/*
+ * Call iosched to let that elevator wants to expire the queue. This gives
+ * iosched like AS to say no (if it is in the middle of batch changeover or
+ * it is anticipating). it also allows iosched to do some house keeping
+ *
+ * force--> it is force dispatch and iosched must clean up its state. This
+ * 	     is useful when elevator wants to drain iosched and wants to
+ * 	     expire currnent active queue.
+ *
+ * slice_expired--> if 1, ioq slice expired hence elevator fair queuing logic
+ * 		    wants to switch the queue. iosched should allow that until
+ * 		    and unless necessary. Currently AS can deny the switch if
+ * 		    in the middle of batch switch.
+ *
+ * 		    if 0, time slice is still remaining. It is up to the iosched
+ * 		    whether it wants to wait on this queue or just want to
+ * 		    expire it and move on to next queue.
+ *
+ */
+int elv_iosched_expire_ioq(struct request_queue *q, int slice_expired,
+					int force)
+{
+	struct elevator_queue *e = q->elevator;
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+
+	if (e->ops->elevator_expire_ioq_fn)
+		return e->ops->elevator_expire_ioq_fn(q, ioq->sched_queue,
+							slice_expired, force);
+
+	return 1;
+}
+
 /* Common layer function to select the next queue to dispatch from */
 void *elv_fq_select_ioq(struct request_queue *q, int force)
 {
 	struct elv_fq_data *efqd = &q->elevator->efqd;
 	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
+	int slice_expired = 1;
 
 	if (!elv_nr_busy_ioq(q->elevator))
 		return NULL;
@@ -2984,8 +3024,14 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 		goto keep_queue;
 	}
 
+	slice_expired = 0;
 expire:
-	elv_ioq_slice_expired(q);
+	if (elv_iosched_expire_ioq(q, slice_expired, force))
+		elv_ioq_slice_expired(q);
+	else {
+		ioq = NULL;
+		goto keep_queue;
+	}
 new_queue:
 	ioq = elv_set_active_ioq(q, new_ioq);
 keep_queue:
@@ -3146,7 +3192,8 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		}
 
 		if (elv_ioq_class_idle(ioq)) {
-			elv_ioq_slice_expired(q);
+			if (elv_iosched_expire_ioq(q, 1, 0))
+				elv_ioq_slice_expired(q);
 			goto done;
 		}
 
@@ -3170,9 +3217,10 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		 * mean seek distance, give them a chance to run instead
 		 * of idling.
 		 */
-		if (elv_ioq_slice_used(ioq))
-			elv_ioq_slice_expired(q);
-		else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
+		if (elv_ioq_slice_used(ioq)) {
+			if (elv_iosched_expire_ioq(q, 1, 0))
+				elv_ioq_slice_expired(q);
+		} else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
 			 && sync && !rq_noidle(rq))
 			elv_ioq_arm_slice_timer(q, 0);
 	}
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index ee38d08..cbfce0b 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -42,6 +42,7 @@ typedef int (elevator_update_idle_window_fn) (struct elevator_queue*, void*,
 						struct request*);
 typedef struct io_queue* (elevator_close_cooperator_fn) (struct request_queue*,
 						void*, int probe);
+typedef int (elevator_expire_ioq_fn) (struct request_queue*, void *, int, int);
 #endif
 
 struct elevator_ops
@@ -81,6 +82,7 @@ struct elevator_ops
 	elevator_should_preempt_fn *elevator_should_preempt_fn;
 	elevator_update_idle_window_fn *elevator_update_idle_window_fn;
 	elevator_close_cooperator_fn *elevator_close_cooperator_fn;
+	elevator_expire_ioq_fn  *elevator_expire_ioq_fn;
 #endif
 };
 
@@ -252,6 +254,9 @@ enum {
 /* iosched maintains only single ioq per group.*/
 #define ELV_IOSCHED_SINGLE_IOQ        2
 
+/* iosched does not need anticipation/idling logic support from common layer */
+#define ELV_IOSCHED_DONT_IDLE	4
+
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 {
 	return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
@@ -262,6 +267,12 @@ static inline int elv_iosched_single_ioq(struct elevator_queue *e)
 	return (e->elevator_type->elevator_features) & ELV_IOSCHED_SINGLE_IOQ;
 }
 
+/* returns 1 if elevator layer should enable its idling logic, 0 otherwise */
+static inline int elv_gen_idling_enabled(struct elevator_queue *e)
+{
+	return !((e->elevator_type->elevator_features) & ELV_IOSCHED_DONT_IDLE);
+}
+
 #else /* ELV_IOSCHED_FAIR_QUEUING */
 
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
@@ -274,6 +285,11 @@ static inline int elv_iosched_single_ioq(struct elevator_queue *e)
 	return 0;
 }
 
+static inline int elv_gen_idling_enabled(struct elevator_queue *e)
+{
+	return 0;
+}
+
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 13/18] io-controller: anticipatory changes for hierarchical fair queuing
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (21 preceding siblings ...)
  2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-05 19:58 ` Vivek Goyal
                   ` (14 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando
  Cc: akpm, vgoyal

This patch changes anticipatory scheduler to use queue scheduling code from
elevator layer.  One can go back to old as by deselecting
CONFIG_IOSCHED_AS_HIER.

TODO/Issues
===========
- AS anticipation logic does not seem to be sufficient to provide BW difference
  if two "dd" are going in two different cgroups. Needs to be looked into.

- AS write batch number of request adjustment happens upon every W->R batch
  direction switch. This automatic adjustment depends on how much time a
  read is taking after a W->R switch.

  This does not gel very well when hierarhical scheduling is enabled and
  every io group can have its separate read/write batch. Now if io group
  switching takes place it creates issues.

  Currently I have disabled write batch length adjustment in hierarchical
  mode.

- Currently performance seems to be very bad in hierarhical mode. Needs
  to be looked into.

- I think the whole idea of common layer doing time slice switching between
  queues and then queue in turn running timed batches is not very good. May
  be AS can maintain two queues (one for READS and other for WRITES) and let
  common layer do the time slice switching between these two queues.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched    |   12 +++
 block/as-iosched.c       |  177 +++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.c      |   76 ++++++++++++++++----
 include/linux/elevator.h |   16 ++++
 4 files changed, 266 insertions(+), 15 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3a9e7d7..77fc786 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -45,6 +45,18 @@ config IOSCHED_AS
 	  deadline I/O scheduler, it can also be slower in some cases
 	  especially some database loads.
 
+config IOSCHED_AS_HIER
+	bool "Anticipatory Hierarchical Scheduling support"
+	depends on IOSCHED_AS && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in anticipatory. In this mode
+	  anticipatory keeps one IO queue per cgroup instead of a global
+	  queue. Elevator fair queuing logic ensures fairness among various
+	  queues.
+
 config IOSCHED_DEADLINE
 	tristate "Deadline I/O scheduler"
 	default y
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 7158e13..12aea88 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -84,6 +84,19 @@ struct as_queue {
 	struct list_head fifo_list[2];
 
 	struct request *next_rq[2];	/* next in sort order */
+
+	/*
+	 * If an as_queue is switched while a batch is running, then we
+	 * store the time left before current batch will expire
+	 */
+	long current_batch_time_left;
+
+	/*
+	 * batch data dir when queue was scheduled out. This will be used
+	 * to setup ad->batch_data_dir when queue is scheduled in.
+	 */
+	int saved_batch_data_dir;
+
 	unsigned long last_check_fifo[2];
 	int write_batch_count;		/* max # of reqs in a write batch */
 	int current_write_count;	/* how many requests left this batch */
@@ -150,6 +163,141 @@ static DEFINE_SPINLOCK(ioc_gone_lock);
 
 static void as_move_to_dispatch(struct as_data *ad, struct request *rq);
 static void as_antic_stop(struct as_data *ad);
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq);
+
+#ifdef CONFIG_IOSCHED_AS_HIER
+static void as_save_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+	/* Save batch data dir */
+	asq->saved_batch_data_dir = ad->batch_data_dir;
+
+	if (ad->changed_batch) {
+		/*
+		 * In case of force expire, we come here. Batch changeover
+		 * has been signalled but we are waiting for all the
+		 * request to finish from previous batch and then start
+		 * the new batch. Can't wait now. Mark that full batch time
+		 * needs to be allocated when this queue is scheduled again.
+		 */
+		asq->current_batch_time_left =
+				ad->batch_expire[ad->batch_data_dir];
+		ad->changed_batch = 0;
+		return;
+	}
+
+	if (ad->new_batch) {
+		/*
+		 * We should come here only when new_batch has been set
+		 * but no read request has been issued or if it is a forced
+		 * expiry.
+		 *
+		 * In both the cases, new batch has not started yet so
+		 * allocate full batch length for next scheduling opportunity.
+		 * We don't do write batch size adjustment in hierarchical
+		 * AS so that should not be an issue.
+		 */
+		asq->current_batch_time_left =
+				ad->batch_expire[ad->batch_data_dir];
+		ad->new_batch = 0;
+		return;
+	}
+
+	/* Save how much time is left before current batch expires */
+	if (as_batch_expired(ad, asq))
+		asq->current_batch_time_left = 0;
+	else {
+		asq->current_batch_time_left = ad->current_batch_expires
+							- jiffies;
+		BUG_ON((asq->current_batch_time_left) < 0);
+	}
+}
+
+/*
+ * FIXME: In original AS, read batch's time account started only after when
+ * first request had completed (if last batch was a write batch). But here
+ * we might be rescheduling a read batch right away irrespective of the fact
+ * of disk cache state.
+ */
+static void as_restore_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+	/* Adjust the batch expire time */
+	if (asq->current_batch_time_left)
+		ad->current_batch_expires = jiffies +
+						asq->current_batch_time_left;
+	/* restore asq batch_data_dir info */
+	ad->batch_data_dir = asq->saved_batch_data_dir;
+}
+
+/* ioq has been set. */
+static void as_active_ioq_set(struct request_queue *q, void *sched_queue,
+				int coop)
+{
+	struct as_queue *asq = sched_queue;
+	struct as_data *ad = q->elevator->elevator_data;
+
+	as_restore_batch_context(ad, asq);
+}
+
+/*
+ * This is a notification from common layer that it wishes to expire this
+ * io queue. AS decides whether queue can be expired, if yes, it also
+ * saves the batch context.
+ */
+static int as_expire_ioq(struct request_queue *q, void *sched_queue,
+				int slice_expired, int force)
+{
+	struct as_data *ad = q->elevator->elevator_data;
+	int status = ad->antic_status;
+	struct as_queue *asq = sched_queue;
+
+	/* Forced expiry. We don't have a choice */
+	if (force) {
+		as_antic_stop(ad);
+		as_save_batch_context(ad, asq);
+		return 1;
+	}
+
+	/*
+	 * We are waiting for requests to finish from last
+	 * batch. Don't expire the queue now
+	 */
+	if (ad->changed_batch)
+		goto keep_queue;
+
+	/*
+	 * Wait for all requests from existing batch to finish before we
+	 * switch the queue. New queue might change the batch direction
+	 * and this is to be consistent with AS philosophy of not dispatching
+	 * new requests to underlying drive till requests from requests
+	 * from previous batch are completed.
+	 */
+	if (ad->nr_dispatched)
+		goto keep_queue;
+
+	/*
+	 * If AS anticipation is ON, stop it if slice expired, otherwise
+	 * keep the queue.
+	 */
+	if (status == ANTIC_WAIT_REQ || status == ANTIC_WAIT_NEXT) {
+		if (slice_expired)
+			as_antic_stop(ad);
+		else
+			/*
+			 * We are anticipating and time slice has not expired
+			 * so I would rather prefer waiting than break the
+			 * anticipation and expire the queue.
+			 */
+			goto keep_queue;
+	}
+
+	/* We are good to expire the queue. Save batch context */
+	as_save_batch_context(ad, asq);
+	return 1;
+
+keep_queue:
+	return 0;
+}
+#endif
 
 /*
  * IO Context helper functions
@@ -805,6 +953,7 @@ static void as_update_rq(struct as_data *ad, struct request *rq)
 	}
 }
 
+#ifndef CONFIG_IOSCHED_AS_HIER
 /*
  * Gathers timings and resizes the write batch automatically
  */
@@ -833,6 +982,7 @@ static void update_write_batch(struct as_data *ad)
 	if (asq->write_batch_count < 1)
 		asq->write_batch_count = 1;
 }
+#endif /* !CONFIG_IOSCHED_AS_HIER */
 
 /*
  * as_completed_request is to be called when a request has completed and
@@ -867,7 +1017,26 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
 	 * and writeback caches
 	 */
 	if (ad->new_batch && ad->batch_data_dir == rq_is_sync(rq)) {
+#ifndef CONFIG_IOSCHED_AS_HIER
+		/*
+		 * Dynamic updation of write batch length is disabled
+		 * for hierarchical scheduling. It is difficult to do
+		 * accurate accounting when queue switch can take place
+		 * in the middle of the batch.
+		 *
+		 * Say, A, B are two groups. Following is the sequence of
+		 * events.
+		 *
+		 * Servicing Write batch of A.
+		 * Queue switch takes place and write batch of B starts.
+		 * Batch switch takes place and read batch of B starts.
+		 *
+		 * In above scenario, writes issued in write batch of A
+		 * might impact the write batch length of B. Which is not
+		 * good.
+		 */
 		update_write_batch(ad);
+#endif
 		ad->current_batch_expires = jiffies +
 				ad->batch_expire[BLK_RW_SYNC];
 		ad->new_batch = 0;
@@ -1516,8 +1685,14 @@ static struct elevator_type iosched_as = {
 		.trim =				as_trim,
 		.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
 		.elevator_free_sched_queue_fn = as_free_as_queue,
+#ifdef CONFIG_IOSCHED_AS_HIER
+		.elevator_expire_ioq_fn =       as_expire_ioq,
+		.elevator_active_ioq_set_fn =   as_active_ioq_set,
 	},
-
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ | ELV_IOSCHED_DONT_IDLE,
+#else
+	},
+#endif
 	.elevator_attrs = as_attrs,
 	.elevator_name = "anticipatory",
 	.elevator_owner = THIS_MODULE,
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index f2805e6..02c27ac 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -36,6 +36,8 @@ static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
 struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
 						 int extract);
 void elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
+int elv_iosched_expire_ioq(struct request_queue *q, int slice_expired,
+					int force);
 
 static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
 					unsigned short prio)
@@ -2230,6 +2232,9 @@ static void elv_ioq_update_idle_window(struct elevator_queue *eq,
 	int old_idle, enable_idle;
 	struct elv_fq_data *efqd = ioq->efqd;
 
+	/* If idling is disabled from ioscheduler, return */
+	if (!elv_gen_idling_enabled(eq))
+		return;
 	/*
 	 * Don't idle for async or idle io prio class
 	 */
@@ -2303,7 +2308,7 @@ int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
 		ioq->pid = current->pid;
 
 	ioq->sched_queue = sched_queue;
-	if (is_sync && !elv_ioq_class_idle(ioq))
+	if (elv_gen_idling_enabled(eq) && is_sync && !elv_ioq_class_idle(ioq))
 		elv_mark_ioq_idle_window(ioq);
 	bfq_init_entity(&ioq->entity, iog);
 	ioq->entity.budget = elv_prio_to_slice(efqd, ioq);
@@ -2718,16 +2723,18 @@ int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 static void elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
 {
 	elv_log_ioq(&q->elevator->efqd, ioq, "preempt");
-	elv_ioq_slice_expired(q);
+	if (elv_iosched_expire_ioq(q, 0, 1)) {
+		elv_ioq_slice_expired(q);
 
-	/*
-	 * Put the new queue at the front of the of the current list,
-	 * so we know that it will be selected next.
-	 */
+		/*
+		 * Put the new queue at the front of the of the current list,
+		 * so we know that it will be selected next.
+		 */
 
-	elv_activate_ioq(ioq, 1);
-	elv_ioq_set_slice_end(ioq, 0);
-	elv_mark_ioq_slice_new(ioq);
+		elv_activate_ioq(ioq, 1);
+		elv_ioq_set_slice_end(ioq, 0);
+		elv_mark_ioq_slice_new(ioq);
+	}
 }
 
 void elv_ioq_request_add(struct request_queue *q, struct request *rq)
@@ -2906,11 +2913,44 @@ void elv_free_idle_ioq_list(struct elevator_queue *e)
 		elv_deactivate_ioq(efqd, ioq, 0);
 }
 
+/*
+ * Call iosched to let that elevator wants to expire the queue. This gives
+ * iosched like AS to say no (if it is in the middle of batch changeover or
+ * it is anticipating). it also allows iosched to do some house keeping
+ *
+ * force--> it is force dispatch and iosched must clean up its state. This
+ * 	     is useful when elevator wants to drain iosched and wants to
+ * 	     expire currnent active queue.
+ *
+ * slice_expired--> if 1, ioq slice expired hence elevator fair queuing logic
+ * 		    wants to switch the queue. iosched should allow that until
+ * 		    and unless necessary. Currently AS can deny the switch if
+ * 		    in the middle of batch switch.
+ *
+ * 		    if 0, time slice is still remaining. It is up to the iosched
+ * 		    whether it wants to wait on this queue or just want to
+ * 		    expire it and move on to next queue.
+ *
+ */
+int elv_iosched_expire_ioq(struct request_queue *q, int slice_expired,
+					int force)
+{
+	struct elevator_queue *e = q->elevator;
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+
+	if (e->ops->elevator_expire_ioq_fn)
+		return e->ops->elevator_expire_ioq_fn(q, ioq->sched_queue,
+							slice_expired, force);
+
+	return 1;
+}
+
 /* Common layer function to select the next queue to dispatch from */
 void *elv_fq_select_ioq(struct request_queue *q, int force)
 {
 	struct elv_fq_data *efqd = &q->elevator->efqd;
 	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
+	int slice_expired = 1;
 
 	if (!elv_nr_busy_ioq(q->elevator))
 		return NULL;
@@ -2984,8 +3024,14 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 		goto keep_queue;
 	}
 
+	slice_expired = 0;
 expire:
-	elv_ioq_slice_expired(q);
+	if (elv_iosched_expire_ioq(q, slice_expired, force))
+		elv_ioq_slice_expired(q);
+	else {
+		ioq = NULL;
+		goto keep_queue;
+	}
 new_queue:
 	ioq = elv_set_active_ioq(q, new_ioq);
 keep_queue:
@@ -3146,7 +3192,8 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		}
 
 		if (elv_ioq_class_idle(ioq)) {
-			elv_ioq_slice_expired(q);
+			if (elv_iosched_expire_ioq(q, 1, 0))
+				elv_ioq_slice_expired(q);
 			goto done;
 		}
 
@@ -3170,9 +3217,10 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		 * mean seek distance, give them a chance to run instead
 		 * of idling.
 		 */
-		if (elv_ioq_slice_used(ioq))
-			elv_ioq_slice_expired(q);
-		else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
+		if (elv_ioq_slice_used(ioq)) {
+			if (elv_iosched_expire_ioq(q, 1, 0))
+				elv_ioq_slice_expired(q);
+		} else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
 			 && sync && !rq_noidle(rq))
 			elv_ioq_arm_slice_timer(q, 0);
 	}
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index ee38d08..cbfce0b 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -42,6 +42,7 @@ typedef int (elevator_update_idle_window_fn) (struct elevator_queue*, void*,
 						struct request*);
 typedef struct io_queue* (elevator_close_cooperator_fn) (struct request_queue*,
 						void*, int probe);
+typedef int (elevator_expire_ioq_fn) (struct request_queue*, void *, int, int);
 #endif
 
 struct elevator_ops
@@ -81,6 +82,7 @@ struct elevator_ops
 	elevator_should_preempt_fn *elevator_should_preempt_fn;
 	elevator_update_idle_window_fn *elevator_update_idle_window_fn;
 	elevator_close_cooperator_fn *elevator_close_cooperator_fn;
+	elevator_expire_ioq_fn  *elevator_expire_ioq_fn;
 #endif
 };
 
@@ -252,6 +254,9 @@ enum {
 /* iosched maintains only single ioq per group.*/
 #define ELV_IOSCHED_SINGLE_IOQ        2
 
+/* iosched does not need anticipation/idling logic support from common layer */
+#define ELV_IOSCHED_DONT_IDLE	4
+
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 {
 	return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
@@ -262,6 +267,12 @@ static inline int elv_iosched_single_ioq(struct elevator_queue *e)
 	return (e->elevator_type->elevator_features) & ELV_IOSCHED_SINGLE_IOQ;
 }
 
+/* returns 1 if elevator layer should enable its idling logic, 0 otherwise */
+static inline int elv_gen_idling_enabled(struct elevator_queue *e)
+{
+	return !((e->elevator_type->elevator_features) & ELV_IOSCHED_DONT_IDLE);
+}
+
 #else /* ELV_IOSCHED_FAIR_QUEUING */
 
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
@@ -274,6 +285,11 @@ static inline int elv_iosched_single_ioq(struct elevator_queue *e)
 	return 0;
 }
 
+static inline int elv_gen_idling_enabled(struct elevator_queue *e)
+{
+	return 0;
+}
+
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 14/18] blkio_cgroup patches from Ryo to track async bios.
       [not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (12 preceding siblings ...)
  2009-05-05 19:58   ` [PATCH 13/18] io-controller: anticipatory " Vivek Goyal
@ 2009-05-05 19:58   ` Vivek Goyal
  2009-05-05 19:58   ` [PATCH 15/18] io-controller: map async requests to appropriate cgroup Vivek Goyal
                     ` (7 subsequent siblings)
  21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

o blkio_cgroup patches from Ryo to track async bios.

o Fernando is also working on another IO tracking mechanism. We are not
  particular about any IO tracking mechanism. This patchset can make use
  of any mechanism which makes it to upstream. For the time being making
  use of Ryo's posting.

Based on 2.6.30-rc3-git3
Signed-off-by: Hirokazu Takahashi <taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
Signed-off-by: Ryo Tsuruta <ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
---
 block/blk-ioc.c               |   37 +++---
 fs/buffer.c                   |    2 +
 fs/direct-io.c                |    2 +
 include/linux/biotrack.h      |   97 +++++++++++++
 include/linux/cgroup_subsys.h |    6 +
 include/linux/iocontext.h     |    1 +
 include/linux/memcontrol.h    |    6 +
 include/linux/mmzone.h        |    4 +-
 include/linux/page_cgroup.h   |   31 ++++-
 init/Kconfig                  |   15 ++
 mm/Makefile                   |    4 +-
 mm/biotrack.c                 |  300 +++++++++++++++++++++++++++++++++++++++++
 mm/bounce.c                   |    2 +
 mm/filemap.c                  |    2 +
 mm/memcontrol.c               |    6 +
 mm/memory.c                   |    5 +
 mm/page-writeback.c           |    2 +
 mm/page_cgroup.c              |   17 ++-
 mm/swap_state.c               |    2 +
 19 files changed, 511 insertions(+), 30 deletions(-)
 create mode 100644 include/linux/biotrack.h
 create mode 100644 mm/biotrack.c

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 8f0f6cf..ccde40e 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -84,27 +84,32 @@ void exit_io_context(void)
 	}
 }
 
+void init_io_context(struct io_context *ioc)
+{
+	atomic_set(&ioc->refcount, 1);
+	atomic_set(&ioc->nr_tasks, 1);
+	spin_lock_init(&ioc->lock);
+	ioc->ioprio_changed = 0;
+	ioc->ioprio = 0;
+#ifdef CONFIG_GROUP_IOSCHED
+	ioc->cgroup_changed = 0;
+#endif
+	ioc->last_waited = jiffies; /* doesn't matter... */
+	ioc->nr_batch_requests = 0; /* because this is 0 */
+	ioc->aic = NULL;
+	INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
+	INIT_HLIST_HEAD(&ioc->cic_list);
+	ioc->ioc_data = NULL;
+}
+
+
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 {
 	struct io_context *ret;
 
 	ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
-	if (ret) {
-		atomic_set(&ret->refcount, 1);
-		atomic_set(&ret->nr_tasks, 1);
-		spin_lock_init(&ret->lock);
-		ret->ioprio_changed = 0;
-		ret->ioprio = 0;
-#ifdef CONFIG_GROUP_IOSCHED
-		ret->cgroup_changed = 0;
-#endif
-		ret->last_waited = jiffies; /* doesn't matter... */
-		ret->nr_batch_requests = 0; /* because this is 0 */
-		ret->aic = NULL;
-		INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
-		INIT_HLIST_HEAD(&ret->cic_list);
-		ret->ioc_data = NULL;
-	}
+	if (ret)
+		init_io_context(ret);
 
 	return ret;
 }
diff --git a/fs/buffer.c b/fs/buffer.c
index b3e5be7..79118d4 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -36,6 +36,7 @@
 #include <linux/buffer_head.h>
 #include <linux/task_io_accounting_ops.h>
 #include <linux/bio.h>
+#include <linux/biotrack.h>
 #include <linux/notifier.h>
 #include <linux/cpu.h>
 #include <linux/bitops.h>
@@ -668,6 +669,7 @@ static void __set_page_dirty(struct page *page,
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
 		account_page_dirtied(page, mapping);
+		blkio_cgroup_reset_owner_pagedirty(page, current->mm);
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 05763bb..60b1a99 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -33,6 +33,7 @@
 #include <linux/err.h>
 #include <linux/blkdev.h>
 #include <linux/buffer_head.h>
+#include <linux/biotrack.h>
 #include <linux/rwsem.h>
 #include <linux/uio.h>
 #include <asm/atomic.h>
@@ -797,6 +798,7 @@ static int do_direct_IO(struct dio *dio)
 			ret = PTR_ERR(page);
 			goto out;
 		}
+		blkio_cgroup_reset_owner(page, current->mm);
 
 		while (block_in_page < blocks_per_page) {
 			unsigned offset_in_page = block_in_page << blkbits;
diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
new file mode 100644
index 0000000..741a8b5
--- /dev/null
+++ b/include/linux/biotrack.h
@@ -0,0 +1,97 @@
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include <linux/page_cgroup.h>
+
+#ifndef _LINUX_BIOTRACK_H
+#define _LINUX_BIOTRACK_H
+
+#ifdef	CONFIG_CGROUP_BLKIO
+
+struct io_context;
+struct block_device;
+
+struct blkio_cgroup {
+	struct cgroup_subsys_state css;
+	struct io_context *io_context;	/* default io_context */
+/*	struct radix_tree_root io_context_root; per device io_context */
+};
+
+/**
+ * __init_blkio_page_cgroup() - initialize a blkio_page_cgroup
+ * @pc:		page_cgroup of the page
+ *
+ * Reset the owner ID of a page.
+ */
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+	lock_page_cgroup(pc);
+	page_cgroup_set_id(pc, 0);
+	unlock_page_cgroup(pc);
+}
+
+/**
+ * blkio_cgroup_disabled - check whether blkio_cgroup is disabled
+ *
+ * Returns true if disabled, false if not.
+ */
+static inline bool blkio_cgroup_disabled(void)
+{
+	if (blkio_cgroup_subsys.disabled)
+		return true;
+	return false;
+}
+
+extern void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+						 struct mm_struct *mm);
+extern void blkio_cgroup_copy_owner(struct page *page, struct page *opage);
+
+extern struct io_context *get_blkio_cgroup_iocontext(struct bio *bio);
+extern unsigned long get_blkio_cgroup_id(struct bio *bio);
+extern struct cgroup *blkio_cgroup_lookup(int id);
+
+#else	/* CONFIG_CGROUP_BIO */
+
+struct blkio_cgroup;
+
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+}
+
+static inline bool blkio_cgroup_disabled(void)
+{
+	return true;
+}
+
+static inline void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_copy_owner(struct page *page, struct page *opage)
+{
+}
+
+static inline struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+	return NULL;
+}
+
+static inline unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+	return 0;
+}
+
+#endif	/* CONFIG_CGROUP_BLKIO */
+
+#endif /* _LINUX_BIOTRACK_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 68ea6bd..f214e6e 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -43,6 +43,12 @@ SUBSYS(mem_cgroup)
 
 /* */
 
+#ifdef CONFIG_CGROUP_BLKIO
+SUBSYS(blkio_cgroup)
+#endif
+
+/* */
+
 #ifdef CONFIG_CGROUP_DEVICE
 SUBSYS(devices)
 #endif
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 51664bb..ed52a1f 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -109,6 +109,7 @@ int put_io_context(struct io_context *ioc);
 void exit_io_context(void);
 struct io_context *get_io_context(gfp_t gfp_flags, int node);
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
+void init_io_context(struct io_context *ioc);
 void copy_io_context(struct io_context **pdst, struct io_context **psrc);
 #else
 static inline void exit_io_context(void)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a9e3b76..e80e335 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -37,6 +37,8 @@ struct mm_struct;
  * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
  */
 
+extern void __init_mem_page_cgroup(struct page_cgroup *pc);
+
 extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
 /* for swap handling */
@@ -120,6 +122,10 @@ extern bool mem_cgroup_oom_called(struct task_struct *task);
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
+static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+}
+
 static inline int mem_cgroup_newpage_charge(struct page *page,
 					struct mm_struct *mm, gfp_t gfp_mask)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 186ec6a..47a6f55 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -607,7 +607,7 @@ typedef struct pglist_data {
 	int nr_zones;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP	/* means !SPARSEMEM */
 	struct page *node_mem_map;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	struct page_cgroup *node_page_cgroup;
 #endif
 #endif
@@ -958,7 +958,7 @@ struct mem_section {
 
 	/* See declaration of similar field in struct zone */
 	unsigned long *pageblock_flags;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	/*
 	 * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
 	 * section. (see memcontrol.h/page_cgroup.h about this.)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 7339c7b..dd7f71c 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -1,7 +1,7 @@
 #ifndef __LINUX_PAGE_CGROUP_H
 #define __LINUX_PAGE_CGROUP_H
 
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 #include <linux/bit_spinlock.h>
 /*
  * Page Cgroup can be considered as an extended mem_map.
@@ -12,9 +12,11 @@
  */
 struct page_cgroup {
 	unsigned long flags;
-	struct mem_cgroup *mem_cgroup;
 	struct page *page;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+	struct mem_cgroup *mem_cgroup;
 	struct list_head lru;		/* per cgroup LRU list */
+#endif
 };
 
 void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
@@ -71,7 +73,7 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_CGROUP_PAGE */
 struct page_cgroup;
 
 static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
@@ -122,4 +124,27 @@ static inline void swap_cgroup_swapoff(int type)
 }
 
 #endif
+
+#ifdef CONFIG_CGROUP_BLKIO
+/*
+ * use lower 16 bits for flags and reserve the rest for the page tracking id
+ */
+#define PCG_TRACKING_ID_SHIFT	(16)
+#define PCG_TRACKING_ID_BITS \
+	(8 * sizeof(unsigned long) - PCG_TRACKING_ID_SHIFT)
+
+/* NOTE: must be called with page_cgroup() held */
+static inline unsigned long page_cgroup_get_id(struct page_cgroup *pc)
+{
+	return pc->flags >> PCG_TRACKING_ID_SHIFT;
+}
+
+/* NOTE: must be called with page_cgroup() held */
+static inline void page_cgroup_set_id(struct page_cgroup *pc, unsigned long id)
+{
+	WARN_ON(id >= (1UL << PCG_TRACKING_ID_BITS));
+	pc->flags &= (1UL << PCG_TRACKING_ID_SHIFT) - 1;
+	pc->flags |= (unsigned long)(id << PCG_TRACKING_ID_SHIFT);
+}
+#endif
 #endif
diff --git a/init/Kconfig b/init/Kconfig
index 1a4686d..ee16d6f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -616,6 +616,21 @@ config GROUP_IOSCHED
 
 endif # CGROUPS
 
+config CGROUP_BLKIO
+	bool "Block I/O cgroup subsystem"
+	depends on CGROUPS && BLOCK
+	select MM_OWNER
+	help
+	  Provides a Resource Controller which enables to track the onwner
+	  of every Block I/O requests.
+	  The information this subsystem provides can be used from any
+	  kind of module such as dm-ioband device mapper modules or
+	  the cfq-scheduler.
+
+config CGROUP_PAGE
+	def_bool y
+	depends on CGROUP_MEM_RES_CTLR || CGROUP_BLKIO
+
 config MM_OWNER
 	bool
 
diff --git a/mm/Makefile b/mm/Makefile
index ec73c68..76c3436 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -37,4 +37,6 @@ else
 obj-$(CONFIG_SMP) += allocpercpu.o
 endif
 obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
+obj-$(CONFIG_CGROUP_BLKIO) += biotrack.o
diff --git a/mm/biotrack.c b/mm/biotrack.c
new file mode 100644
index 0000000..2baf1f0
--- /dev/null
+++ b/mm/biotrack.c
@@ -0,0 +1,300 @@
+/* biotrack.c - Block I/O Tracking
+ *
+ * Copyright (C) VA Linux Systems Japan, 2008-2009
+ * Developed by Hirokazu Takahashi <taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
+ *
+ * Copyright (C) 2008 Andrea Righi <righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
+ * Use part of page_cgroup->flags to store blkio-cgroup ID.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/smp.h>
+#include <linux/bit_spinlock.h>
+#include <linux/blkdev.h>
+#include <linux/biotrack.h>
+#include <linux/mm_inline.h>
+
+/*
+ * The block I/O tracking mechanism is implemented on the cgroup memory
+ * controller framework. It helps to find the the owner of an I/O request
+ * because every I/O request has a target page and the owner of the page
+ * can be easily determined on the framework.
+ */
+
+/* Return the blkio_cgroup that associates with a cgroup. */
+static inline struct blkio_cgroup *cgroup_blkio(struct cgroup *cgrp)
+{
+	return container_of(cgroup_subsys_state(cgrp, blkio_cgroup_subsys_id),
+					struct blkio_cgroup, css);
+}
+
+/* Return the blkio_cgroup that associates with a process. */
+static inline struct blkio_cgroup *blkio_cgroup_from_task(struct task_struct *p)
+{
+	return container_of(task_subsys_state(p, blkio_cgroup_subsys_id),
+					struct blkio_cgroup, css);
+}
+
+static struct io_context default_blkio_io_context;
+static struct blkio_cgroup default_blkio_cgroup = {
+	.io_context	= &default_blkio_io_context,
+};
+
+/**
+ * blkio_cgroup_set_owner() - set the owner ID of a page.
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Make a given page have the blkio-cgroup ID of the owner of this page.
+ */
+void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+	struct blkio_cgroup *biog;
+	struct page_cgroup *pc;
+	unsigned long id;
+
+	if (blkio_cgroup_disabled())
+		return;
+	pc = lookup_page_cgroup(page);
+	if (unlikely(!pc))
+		return;
+
+	lock_page_cgroup(pc);
+	page_cgroup_set_id(pc, 0);	/* 0: default blkio_cgroup id */
+	unlock_page_cgroup(pc);
+	if (!mm)
+		return;
+
+	rcu_read_lock();
+	biog = blkio_cgroup_from_task(rcu_dereference(mm->owner));
+	if (unlikely(!biog)) {
+		rcu_read_unlock();
+		return;
+	}
+	/*
+	 * css_get(&bio->css) isn't called to increment the reference
+	 * count of this blkio_cgroup "biog" so the css_id might turn
+	 * invalid even if this page is still active.
+	 * This approach is chosen to minimize the overhead.
+	 */
+	id = css_id(&biog->css);
+	rcu_read_unlock();
+	lock_page_cgroup(pc);
+	page_cgroup_set_id(pc, id);
+	unlock_page_cgroup(pc);
+}
+
+/**
+ * blkio_cgroup_reset_owner() - reset the owner ID of a page
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Change the owner of a given page if necessary.
+ */
+void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
+{
+	blkio_cgroup_set_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_reset_owner_pagedirty() - reset the owner ID of a pagecache page
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Change the owner of a given page if the page is in the pagecache.
+ */
+void blkio_cgroup_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
+{
+	if (!page_is_file_cache(page))
+		return;
+	if (current->flags & PF_MEMALLOC)
+		return;
+
+	blkio_cgroup_reset_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_copy_owner() - copy the owner ID of a page into another page
+ * @npage:	the page where we want to copy the owner
+ * @opage:	the page from which we want to copy the ID
+ *
+ * Copy the owner ID of @opage into @npage.
+ */
+void blkio_cgroup_copy_owner(struct page *npage, struct page *opage)
+{
+	struct page_cgroup *npc, *opc;
+	unsigned long id;
+
+	if (blkio_cgroup_disabled())
+		return;
+	npc = lookup_page_cgroup(npage);
+	if (unlikely(!npc))
+		return;
+	opc = lookup_page_cgroup(opage);
+	if (unlikely(!opc))
+		return;
+
+	lock_page_cgroup(opc);
+	lock_page_cgroup(npc);
+	id = page_cgroup_get_id(opc);
+	page_cgroup_set_id(npc, id);
+	unlock_page_cgroup(npc);
+	unlock_page_cgroup(opc);
+}
+
+/* Create a new blkio-cgroup. */
+static struct cgroup_subsys_state *
+blkio_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct blkio_cgroup *biog;
+	struct io_context *ioc;
+
+	if (!cgrp->parent) {
+		biog = &default_blkio_cgroup;
+		init_io_context(biog->io_context);
+		/* Increment the referrence count not to be released ever. */
+		atomic_inc(&biog->io_context->refcount);
+		return &biog->css;
+	}
+
+	biog = kzalloc(sizeof(*biog), GFP_KERNEL);
+	if (!biog)
+		return ERR_PTR(-ENOMEM);
+	ioc = alloc_io_context(GFP_KERNEL, -1);
+	if (!ioc) {
+		kfree(biog);
+		return ERR_PTR(-ENOMEM);
+	}
+	biog->io_context = ioc;
+	return &biog->css;
+}
+
+/* Delete the blkio-cgroup. */
+static void blkio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+
+	put_io_context(biog->io_context);
+	free_css_id(&blkio_cgroup_subsys, &biog->css);
+	kfree(biog);
+}
+
+/**
+ * get_blkio_cgroup_id() - determine the blkio-cgroup ID
+ * @bio:	the &struct bio which describes the I/O
+ *
+ * Returns the blkio-cgroup ID of a given bio. A return value zero
+ * means that the page associated with the bio belongs to default_blkio_cgroup.
+ */
+unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+	struct page_cgroup *pc;
+	struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+	unsigned long id = 0;
+
+	pc = lookup_page_cgroup(page);
+	if (pc) {
+		lock_page_cgroup(pc);
+		id = page_cgroup_get_id(pc);
+		unlock_page_cgroup(pc);
+	}
+	return id;
+}
+
+/**
+ * get_blkio_cgroup_iocontext() - determine the blkio-cgroup iocontext
+ * @bio:	the &struct bio which describe the I/O
+ *
+ * Returns the iocontext of blkio-cgroup that issued a given bio.
+ */
+struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+	struct cgroup_subsys_state *css;
+	struct blkio_cgroup *biog;
+	struct io_context *ioc;
+	unsigned long id;
+
+	id = get_blkio_cgroup_id(bio);
+	rcu_read_lock();
+	css = css_lookup(&blkio_cgroup_subsys, id);
+	if (css)
+		biog = container_of(css, struct blkio_cgroup, css);
+	else
+		biog = &default_blkio_cgroup;
+	ioc = biog->io_context;	/* default io_context for this cgroup */
+	atomic_inc(&ioc->refcount);
+	rcu_read_unlock();
+	return ioc;
+}
+
+/**
+ * blkio_cgroup_lookup() - lookup a cgroup by blkio-cgroup ID
+ * @id:		blkio-cgroup ID
+ *
+ * Returns the cgroup associated with the specified ID, or NULL if lookup
+ * fails.
+ *
+ * Note:
+ * This function should be called under rcu_read_lock().
+ */
+struct cgroup *blkio_cgroup_lookup(int id)
+{
+	struct cgroup *cgrp;
+	struct cgroup_subsys_state *css;
+
+	if (blkio_cgroup_disabled())
+		return NULL;
+
+	css = css_lookup(&blkio_cgroup_subsys, id);
+	if (!css)
+		return NULL;
+	cgrp = css->cgroup;
+	return cgrp;
+}
+EXPORT_SYMBOL(get_blkio_cgroup_iocontext);
+EXPORT_SYMBOL(get_blkio_cgroup_id);
+EXPORT_SYMBOL(blkio_cgroup_lookup);
+
+static u64 blkio_id_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+	unsigned long id;
+
+	rcu_read_lock();
+	id = css_id(&biog->css);
+	rcu_read_unlock();
+	return (u64)id;
+}
+
+
+static struct cftype blkio_files[] = {
+	{
+		.name = "id",
+		.read_u64 = blkio_id_read,
+	},
+};
+
+static int blkio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	return cgroup_add_files(cgrp, ss, blkio_files,
+					ARRAY_SIZE(blkio_files));
+}
+
+struct cgroup_subsys blkio_cgroup_subsys = {
+	.name		= "blkio",
+	.create		= blkio_cgroup_create,
+	.destroy	= blkio_cgroup_destroy,
+	.populate	= blkio_cgroup_populate,
+	.subsys_id	= blkio_cgroup_subsys_id,
+	.use_id		= 1,
+};
diff --git a/mm/bounce.c b/mm/bounce.c
index e590272..875380c 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -14,6 +14,7 @@
 #include <linux/hash.h>
 #include <linux/highmem.h>
 #include <linux/blktrace_api.h>
+#include <linux/biotrack.h>
 #include <trace/block.h>
 #include <asm/tlbflush.h>
 
@@ -212,6 +213,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
 		to->bv_len = from->bv_len;
 		to->bv_offset = from->bv_offset;
 		inc_zone_page_state(to->bv_page, NR_BOUNCE);
+		blkio_cgroup_copy_owner(to->bv_page, page);
 
 		if (rw == WRITE) {
 			char *vto, *vfrom;
diff --git a/mm/filemap.c b/mm/filemap.c
index 379ff0b..cee1438 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/mm_inline.h> /* for page_is_file_cache() */
 #include "internal.h"
 
@@ -464,6 +465,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 					gfp_mask & GFP_RECLAIM_MASK);
 	if (error)
 		goto out;
+	blkio_cgroup_set_owner(page, current->mm);
 
 	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error == 0) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e44fb0f..eeefee3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -128,6 +128,12 @@ struct mem_cgroup_lru_info {
 	struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
 };
 
+void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+	pc->mem_cgroup = NULL;
+	INIT_LIST_HEAD(&pc->lru);
+}
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
diff --git a/mm/memory.c b/mm/memory.c
index cf6873e..194bda7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -51,6 +51,7 @@
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/mmu_notifier.h>
 #include <linux/kallsyms.h>
 #include <linux/swapops.h>
@@ -2053,6 +2054,7 @@ gotten:
 		 */
 		ptep_clear_flush_notify(vma, address, page_table);
 		page_add_new_anon_rmap(new_page, vma, address);
+		blkio_cgroup_set_owner(new_page, mm);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		if (old_page) {
@@ -2497,6 +2499,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	flush_icache_page(vma, page);
 	set_pte_at(mm, address, page_table, pte);
 	page_add_anon_rmap(page, vma, address);
+	blkio_cgroup_reset_owner(page, mm);
 	/* It's better to call commit-charge after rmap is established */
 	mem_cgroup_commit_charge_swapin(page, ptr);
 
@@ -2560,6 +2563,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto release;
 	inc_mm_counter(mm, anon_rss);
 	page_add_new_anon_rmap(page, vma, address);
+	blkio_cgroup_set_owner(page, mm);
 	set_pte_at(mm, address, page_table, entry);
 
 	/* No need to invalidate - it was non-present before */
@@ -2712,6 +2716,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (anon) {
 			inc_mm_counter(mm, anon_rss);
 			page_add_new_anon_rmap(page, vma, address);
+			blkio_cgroup_set_owner(page, mm);
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 30351f0..f0b6d12 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -23,6 +23,7 @@
 #include <linux/init.h>
 #include <linux/backing-dev.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/biotrack.h>
 #include <linux/blkdev.h>
 #include <linux/mpage.h>
 #include <linux/rmap.h>
@@ -1243,6 +1244,7 @@ int __set_page_dirty_nobuffers(struct page *page)
 			BUG_ON(mapping2 != mapping);
 			WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
 			account_page_dirtied(page, mapping);
+			blkio_cgroup_reset_owner_pagedirty(page, current->mm);
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 791905c..e143d04 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -9,14 +9,15 @@
 #include <linux/vmalloc.h>
 #include <linux/cgroup.h>
 #include <linux/swapops.h>
+#include <linux/biotrack.h>
 
 static void __meminit
 __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
 {
 	pc->flags = 0;
-	pc->mem_cgroup = NULL;
 	pc->page = pfn_to_page(pfn);
-	INIT_LIST_HEAD(&pc->lru);
+	__init_mem_page_cgroup(pc);
+	__init_blkio_page_cgroup(pc);
 }
 static unsigned long total_usage;
 
@@ -74,7 +75,7 @@ void __init page_cgroup_init(void)
 
 	int nid, fail;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && blkio_cgroup_disabled())
 		return;
 
 	for_each_online_node(nid)  {
@@ -83,12 +84,12 @@ void __init page_cgroup_init(void)
 			goto fail;
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try cgroup_disable=memory option if you"
+	printk(KERN_INFO "please try cgroup_disable=memory,blkio option if you"
 	" don't want\n");
 	return;
 fail:
 	printk(KERN_CRIT "allocation of page_cgroup was failed.\n");
-	printk(KERN_CRIT "please try cgroup_disable=memory boot option\n");
+	printk(KERN_CRIT "please try cgroup_disable=memory,blkio boot options\n");
 	panic("Out of memory");
 }
 
@@ -248,7 +249,7 @@ void __init page_cgroup_init(void)
 	unsigned long pfn;
 	int fail = 0;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && blkio_cgroup_disabled())
 		return;
 
 	for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
@@ -263,8 +264,8 @@ void __init page_cgroup_init(void)
 		hotplug_memory_notifier(page_cgroup_callback, 0);
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try cgroup_disable=memory option if you don't"
-	" want\n");
+	printk(KERN_INFO "please try cgroup_disable=memory,blkio option"
+	" if you don't want\n");
 }
 
 void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3ecea98..a6a40e9 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -18,6 +18,7 @@
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
 #include <linux/page_cgroup.h>
+#include <linux/biotrack.h>
 
 #include <asm/pgtable.h>
 
@@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 */
 		__set_page_locked(new_page);
 		SetPageSwapBacked(new_page);
+		blkio_cgroup_set_owner(new_page, current->mm);
 		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
 		if (likely(!err)) {
 			/*
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 14/18] blkio_cgroup patches from Ryo to track async bios.
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (24 preceding siblings ...)
  2009-05-05 19:58 ` [PATCH 14/18] blkio_cgroup patches from Ryo to track async bios Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-05 19:58 ` [PATCH 15/18] io-controller: map async requests to appropriate cgroup Vivek Goyal
                   ` (11 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, m-ikeda
  Cc: vgoyal, akpm

o blkio_cgroup patches from Ryo to track async bios.

o Fernando is also working on another IO tracking mechanism. We are not
  particular about any IO tracking mechanism. This patchset can make use
  of any mechanism which makes it to upstream. For the time being making
  use of Ryo's posting.

Based on 2.6.30-rc3-git3
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
---
 block/blk-ioc.c               |   37 +++---
 fs/buffer.c                   |    2 +
 fs/direct-io.c                |    2 +
 include/linux/biotrack.h      |   97 +++++++++++++
 include/linux/cgroup_subsys.h |    6 +
 include/linux/iocontext.h     |    1 +
 include/linux/memcontrol.h    |    6 +
 include/linux/mmzone.h        |    4 +-
 include/linux/page_cgroup.h   |   31 ++++-
 init/Kconfig                  |   15 ++
 mm/Makefile                   |    4 +-
 mm/biotrack.c                 |  300 +++++++++++++++++++++++++++++++++++++++++
 mm/bounce.c                   |    2 +
 mm/filemap.c                  |    2 +
 mm/memcontrol.c               |    6 +
 mm/memory.c                   |    5 +
 mm/page-writeback.c           |    2 +
 mm/page_cgroup.c              |   17 ++-
 mm/swap_state.c               |    2 +
 19 files changed, 511 insertions(+), 30 deletions(-)
 create mode 100644 include/linux/biotrack.h
 create mode 100644 mm/biotrack.c

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 8f0f6cf..ccde40e 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -84,27 +84,32 @@ void exit_io_context(void)
 	}
 }
 
+void init_io_context(struct io_context *ioc)
+{
+	atomic_set(&ioc->refcount, 1);
+	atomic_set(&ioc->nr_tasks, 1);
+	spin_lock_init(&ioc->lock);
+	ioc->ioprio_changed = 0;
+	ioc->ioprio = 0;
+#ifdef CONFIG_GROUP_IOSCHED
+	ioc->cgroup_changed = 0;
+#endif
+	ioc->last_waited = jiffies; /* doesn't matter... */
+	ioc->nr_batch_requests = 0; /* because this is 0 */
+	ioc->aic = NULL;
+	INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
+	INIT_HLIST_HEAD(&ioc->cic_list);
+	ioc->ioc_data = NULL;
+}
+
+
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 {
 	struct io_context *ret;
 
 	ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
-	if (ret) {
-		atomic_set(&ret->refcount, 1);
-		atomic_set(&ret->nr_tasks, 1);
-		spin_lock_init(&ret->lock);
-		ret->ioprio_changed = 0;
-		ret->ioprio = 0;
-#ifdef CONFIG_GROUP_IOSCHED
-		ret->cgroup_changed = 0;
-#endif
-		ret->last_waited = jiffies; /* doesn't matter... */
-		ret->nr_batch_requests = 0; /* because this is 0 */
-		ret->aic = NULL;
-		INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
-		INIT_HLIST_HEAD(&ret->cic_list);
-		ret->ioc_data = NULL;
-	}
+	if (ret)
+		init_io_context(ret);
 
 	return ret;
 }
diff --git a/fs/buffer.c b/fs/buffer.c
index b3e5be7..79118d4 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -36,6 +36,7 @@
 #include <linux/buffer_head.h>
 #include <linux/task_io_accounting_ops.h>
 #include <linux/bio.h>
+#include <linux/biotrack.h>
 #include <linux/notifier.h>
 #include <linux/cpu.h>
 #include <linux/bitops.h>
@@ -668,6 +669,7 @@ static void __set_page_dirty(struct page *page,
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
 		account_page_dirtied(page, mapping);
+		blkio_cgroup_reset_owner_pagedirty(page, current->mm);
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 05763bb..60b1a99 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -33,6 +33,7 @@
 #include <linux/err.h>
 #include <linux/blkdev.h>
 #include <linux/buffer_head.h>
+#include <linux/biotrack.h>
 #include <linux/rwsem.h>
 #include <linux/uio.h>
 #include <asm/atomic.h>
@@ -797,6 +798,7 @@ static int do_direct_IO(struct dio *dio)
 			ret = PTR_ERR(page);
 			goto out;
 		}
+		blkio_cgroup_reset_owner(page, current->mm);
 
 		while (block_in_page < blocks_per_page) {
 			unsigned offset_in_page = block_in_page << blkbits;
diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
new file mode 100644
index 0000000..741a8b5
--- /dev/null
+++ b/include/linux/biotrack.h
@@ -0,0 +1,97 @@
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include <linux/page_cgroup.h>
+
+#ifndef _LINUX_BIOTRACK_H
+#define _LINUX_BIOTRACK_H
+
+#ifdef	CONFIG_CGROUP_BLKIO
+
+struct io_context;
+struct block_device;
+
+struct blkio_cgroup {
+	struct cgroup_subsys_state css;
+	struct io_context *io_context;	/* default io_context */
+/*	struct radix_tree_root io_context_root; per device io_context */
+};
+
+/**
+ * __init_blkio_page_cgroup() - initialize a blkio_page_cgroup
+ * @pc:		page_cgroup of the page
+ *
+ * Reset the owner ID of a page.
+ */
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+	lock_page_cgroup(pc);
+	page_cgroup_set_id(pc, 0);
+	unlock_page_cgroup(pc);
+}
+
+/**
+ * blkio_cgroup_disabled - check whether blkio_cgroup is disabled
+ *
+ * Returns true if disabled, false if not.
+ */
+static inline bool blkio_cgroup_disabled(void)
+{
+	if (blkio_cgroup_subsys.disabled)
+		return true;
+	return false;
+}
+
+extern void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+						 struct mm_struct *mm);
+extern void blkio_cgroup_copy_owner(struct page *page, struct page *opage);
+
+extern struct io_context *get_blkio_cgroup_iocontext(struct bio *bio);
+extern unsigned long get_blkio_cgroup_id(struct bio *bio);
+extern struct cgroup *blkio_cgroup_lookup(int id);
+
+#else	/* CONFIG_CGROUP_BIO */
+
+struct blkio_cgroup;
+
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+}
+
+static inline bool blkio_cgroup_disabled(void)
+{
+	return true;
+}
+
+static inline void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_copy_owner(struct page *page, struct page *opage)
+{
+}
+
+static inline struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+	return NULL;
+}
+
+static inline unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+	return 0;
+}
+
+#endif	/* CONFIG_CGROUP_BLKIO */
+
+#endif /* _LINUX_BIOTRACK_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 68ea6bd..f214e6e 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -43,6 +43,12 @@ SUBSYS(mem_cgroup)
 
 /* */
 
+#ifdef CONFIG_CGROUP_BLKIO
+SUBSYS(blkio_cgroup)
+#endif
+
+/* */
+
 #ifdef CONFIG_CGROUP_DEVICE
 SUBSYS(devices)
 #endif
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 51664bb..ed52a1f 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -109,6 +109,7 @@ int put_io_context(struct io_context *ioc);
 void exit_io_context(void);
 struct io_context *get_io_context(gfp_t gfp_flags, int node);
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
+void init_io_context(struct io_context *ioc);
 void copy_io_context(struct io_context **pdst, struct io_context **psrc);
 #else
 static inline void exit_io_context(void)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a9e3b76..e80e335 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -37,6 +37,8 @@ struct mm_struct;
  * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
  */
 
+extern void __init_mem_page_cgroup(struct page_cgroup *pc);
+
 extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
 /* for swap handling */
@@ -120,6 +122,10 @@ extern bool mem_cgroup_oom_called(struct task_struct *task);
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
+static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+}
+
 static inline int mem_cgroup_newpage_charge(struct page *page,
 					struct mm_struct *mm, gfp_t gfp_mask)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 186ec6a..47a6f55 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -607,7 +607,7 @@ typedef struct pglist_data {
 	int nr_zones;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP	/* means !SPARSEMEM */
 	struct page *node_mem_map;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	struct page_cgroup *node_page_cgroup;
 #endif
 #endif
@@ -958,7 +958,7 @@ struct mem_section {
 
 	/* See declaration of similar field in struct zone */
 	unsigned long *pageblock_flags;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	/*
 	 * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
 	 * section. (see memcontrol.h/page_cgroup.h about this.)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 7339c7b..dd7f71c 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -1,7 +1,7 @@
 #ifndef __LINUX_PAGE_CGROUP_H
 #define __LINUX_PAGE_CGROUP_H
 
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 #include <linux/bit_spinlock.h>
 /*
  * Page Cgroup can be considered as an extended mem_map.
@@ -12,9 +12,11 @@
  */
 struct page_cgroup {
 	unsigned long flags;
-	struct mem_cgroup *mem_cgroup;
 	struct page *page;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+	struct mem_cgroup *mem_cgroup;
 	struct list_head lru;		/* per cgroup LRU list */
+#endif
 };
 
 void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
@@ -71,7 +73,7 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_CGROUP_PAGE */
 struct page_cgroup;
 
 static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
@@ -122,4 +124,27 @@ static inline void swap_cgroup_swapoff(int type)
 }
 
 #endif
+
+#ifdef CONFIG_CGROUP_BLKIO
+/*
+ * use lower 16 bits for flags and reserve the rest for the page tracking id
+ */
+#define PCG_TRACKING_ID_SHIFT	(16)
+#define PCG_TRACKING_ID_BITS \
+	(8 * sizeof(unsigned long) - PCG_TRACKING_ID_SHIFT)
+
+/* NOTE: must be called with page_cgroup() held */
+static inline unsigned long page_cgroup_get_id(struct page_cgroup *pc)
+{
+	return pc->flags >> PCG_TRACKING_ID_SHIFT;
+}
+
+/* NOTE: must be called with page_cgroup() held */
+static inline void page_cgroup_set_id(struct page_cgroup *pc, unsigned long id)
+{
+	WARN_ON(id >= (1UL << PCG_TRACKING_ID_BITS));
+	pc->flags &= (1UL << PCG_TRACKING_ID_SHIFT) - 1;
+	pc->flags |= (unsigned long)(id << PCG_TRACKING_ID_SHIFT);
+}
+#endif
 #endif
diff --git a/init/Kconfig b/init/Kconfig
index 1a4686d..ee16d6f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -616,6 +616,21 @@ config GROUP_IOSCHED
 
 endif # CGROUPS
 
+config CGROUP_BLKIO
+	bool "Block I/O cgroup subsystem"
+	depends on CGROUPS && BLOCK
+	select MM_OWNER
+	help
+	  Provides a Resource Controller which enables to track the onwner
+	  of every Block I/O requests.
+	  The information this subsystem provides can be used from any
+	  kind of module such as dm-ioband device mapper modules or
+	  the cfq-scheduler.
+
+config CGROUP_PAGE
+	def_bool y
+	depends on CGROUP_MEM_RES_CTLR || CGROUP_BLKIO
+
 config MM_OWNER
 	bool
 
diff --git a/mm/Makefile b/mm/Makefile
index ec73c68..76c3436 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -37,4 +37,6 @@ else
 obj-$(CONFIG_SMP) += allocpercpu.o
 endif
 obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
+obj-$(CONFIG_CGROUP_BLKIO) += biotrack.o
diff --git a/mm/biotrack.c b/mm/biotrack.c
new file mode 100644
index 0000000..2baf1f0
--- /dev/null
+++ b/mm/biotrack.c
@@ -0,0 +1,300 @@
+/* biotrack.c - Block I/O Tracking
+ *
+ * Copyright (C) VA Linux Systems Japan, 2008-2009
+ * Developed by Hirokazu Takahashi <taka@valinux.co.jp>
+ *
+ * Copyright (C) 2008 Andrea Righi <righi.andrea@gmail.com>
+ * Use part of page_cgroup->flags to store blkio-cgroup ID.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/smp.h>
+#include <linux/bit_spinlock.h>
+#include <linux/blkdev.h>
+#include <linux/biotrack.h>
+#include <linux/mm_inline.h>
+
+/*
+ * The block I/O tracking mechanism is implemented on the cgroup memory
+ * controller framework. It helps to find the the owner of an I/O request
+ * because every I/O request has a target page and the owner of the page
+ * can be easily determined on the framework.
+ */
+
+/* Return the blkio_cgroup that associates with a cgroup. */
+static inline struct blkio_cgroup *cgroup_blkio(struct cgroup *cgrp)
+{
+	return container_of(cgroup_subsys_state(cgrp, blkio_cgroup_subsys_id),
+					struct blkio_cgroup, css);
+}
+
+/* Return the blkio_cgroup that associates with a process. */
+static inline struct blkio_cgroup *blkio_cgroup_from_task(struct task_struct *p)
+{
+	return container_of(task_subsys_state(p, blkio_cgroup_subsys_id),
+					struct blkio_cgroup, css);
+}
+
+static struct io_context default_blkio_io_context;
+static struct blkio_cgroup default_blkio_cgroup = {
+	.io_context	= &default_blkio_io_context,
+};
+
+/**
+ * blkio_cgroup_set_owner() - set the owner ID of a page.
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Make a given page have the blkio-cgroup ID of the owner of this page.
+ */
+void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+	struct blkio_cgroup *biog;
+	struct page_cgroup *pc;
+	unsigned long id;
+
+	if (blkio_cgroup_disabled())
+		return;
+	pc = lookup_page_cgroup(page);
+	if (unlikely(!pc))
+		return;
+
+	lock_page_cgroup(pc);
+	page_cgroup_set_id(pc, 0);	/* 0: default blkio_cgroup id */
+	unlock_page_cgroup(pc);
+	if (!mm)
+		return;
+
+	rcu_read_lock();
+	biog = blkio_cgroup_from_task(rcu_dereference(mm->owner));
+	if (unlikely(!biog)) {
+		rcu_read_unlock();
+		return;
+	}
+	/*
+	 * css_get(&bio->css) isn't called to increment the reference
+	 * count of this blkio_cgroup "biog" so the css_id might turn
+	 * invalid even if this page is still active.
+	 * This approach is chosen to minimize the overhead.
+	 */
+	id = css_id(&biog->css);
+	rcu_read_unlock();
+	lock_page_cgroup(pc);
+	page_cgroup_set_id(pc, id);
+	unlock_page_cgroup(pc);
+}
+
+/**
+ * blkio_cgroup_reset_owner() - reset the owner ID of a page
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Change the owner of a given page if necessary.
+ */
+void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
+{
+	blkio_cgroup_set_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_reset_owner_pagedirty() - reset the owner ID of a pagecache page
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Change the owner of a given page if the page is in the pagecache.
+ */
+void blkio_cgroup_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
+{
+	if (!page_is_file_cache(page))
+		return;
+	if (current->flags & PF_MEMALLOC)
+		return;
+
+	blkio_cgroup_reset_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_copy_owner() - copy the owner ID of a page into another page
+ * @npage:	the page where we want to copy the owner
+ * @opage:	the page from which we want to copy the ID
+ *
+ * Copy the owner ID of @opage into @npage.
+ */
+void blkio_cgroup_copy_owner(struct page *npage, struct page *opage)
+{
+	struct page_cgroup *npc, *opc;
+	unsigned long id;
+
+	if (blkio_cgroup_disabled())
+		return;
+	npc = lookup_page_cgroup(npage);
+	if (unlikely(!npc))
+		return;
+	opc = lookup_page_cgroup(opage);
+	if (unlikely(!opc))
+		return;
+
+	lock_page_cgroup(opc);
+	lock_page_cgroup(npc);
+	id = page_cgroup_get_id(opc);
+	page_cgroup_set_id(npc, id);
+	unlock_page_cgroup(npc);
+	unlock_page_cgroup(opc);
+}
+
+/* Create a new blkio-cgroup. */
+static struct cgroup_subsys_state *
+blkio_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct blkio_cgroup *biog;
+	struct io_context *ioc;
+
+	if (!cgrp->parent) {
+		biog = &default_blkio_cgroup;
+		init_io_context(biog->io_context);
+		/* Increment the referrence count not to be released ever. */
+		atomic_inc(&biog->io_context->refcount);
+		return &biog->css;
+	}
+
+	biog = kzalloc(sizeof(*biog), GFP_KERNEL);
+	if (!biog)
+		return ERR_PTR(-ENOMEM);
+	ioc = alloc_io_context(GFP_KERNEL, -1);
+	if (!ioc) {
+		kfree(biog);
+		return ERR_PTR(-ENOMEM);
+	}
+	biog->io_context = ioc;
+	return &biog->css;
+}
+
+/* Delete the blkio-cgroup. */
+static void blkio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+
+	put_io_context(biog->io_context);
+	free_css_id(&blkio_cgroup_subsys, &biog->css);
+	kfree(biog);
+}
+
+/**
+ * get_blkio_cgroup_id() - determine the blkio-cgroup ID
+ * @bio:	the &struct bio which describes the I/O
+ *
+ * Returns the blkio-cgroup ID of a given bio. A return value zero
+ * means that the page associated with the bio belongs to default_blkio_cgroup.
+ */
+unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+	struct page_cgroup *pc;
+	struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+	unsigned long id = 0;
+
+	pc = lookup_page_cgroup(page);
+	if (pc) {
+		lock_page_cgroup(pc);
+		id = page_cgroup_get_id(pc);
+		unlock_page_cgroup(pc);
+	}
+	return id;
+}
+
+/**
+ * get_blkio_cgroup_iocontext() - determine the blkio-cgroup iocontext
+ * @bio:	the &struct bio which describe the I/O
+ *
+ * Returns the iocontext of blkio-cgroup that issued a given bio.
+ */
+struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+	struct cgroup_subsys_state *css;
+	struct blkio_cgroup *biog;
+	struct io_context *ioc;
+	unsigned long id;
+
+	id = get_blkio_cgroup_id(bio);
+	rcu_read_lock();
+	css = css_lookup(&blkio_cgroup_subsys, id);
+	if (css)
+		biog = container_of(css, struct blkio_cgroup, css);
+	else
+		biog = &default_blkio_cgroup;
+	ioc = biog->io_context;	/* default io_context for this cgroup */
+	atomic_inc(&ioc->refcount);
+	rcu_read_unlock();
+	return ioc;
+}
+
+/**
+ * blkio_cgroup_lookup() - lookup a cgroup by blkio-cgroup ID
+ * @id:		blkio-cgroup ID
+ *
+ * Returns the cgroup associated with the specified ID, or NULL if lookup
+ * fails.
+ *
+ * Note:
+ * This function should be called under rcu_read_lock().
+ */
+struct cgroup *blkio_cgroup_lookup(int id)
+{
+	struct cgroup *cgrp;
+	struct cgroup_subsys_state *css;
+
+	if (blkio_cgroup_disabled())
+		return NULL;
+
+	css = css_lookup(&blkio_cgroup_subsys, id);
+	if (!css)
+		return NULL;
+	cgrp = css->cgroup;
+	return cgrp;
+}
+EXPORT_SYMBOL(get_blkio_cgroup_iocontext);
+EXPORT_SYMBOL(get_blkio_cgroup_id);
+EXPORT_SYMBOL(blkio_cgroup_lookup);
+
+static u64 blkio_id_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+	unsigned long id;
+
+	rcu_read_lock();
+	id = css_id(&biog->css);
+	rcu_read_unlock();
+	return (u64)id;
+}
+
+
+static struct cftype blkio_files[] = {
+	{
+		.name = "id",
+		.read_u64 = blkio_id_read,
+	},
+};
+
+static int blkio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	return cgroup_add_files(cgrp, ss, blkio_files,
+					ARRAY_SIZE(blkio_files));
+}
+
+struct cgroup_subsys blkio_cgroup_subsys = {
+	.name		= "blkio",
+	.create		= blkio_cgroup_create,
+	.destroy	= blkio_cgroup_destroy,
+	.populate	= blkio_cgroup_populate,
+	.subsys_id	= blkio_cgroup_subsys_id,
+	.use_id		= 1,
+};
diff --git a/mm/bounce.c b/mm/bounce.c
index e590272..875380c 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -14,6 +14,7 @@
 #include <linux/hash.h>
 #include <linux/highmem.h>
 #include <linux/blktrace_api.h>
+#include <linux/biotrack.h>
 #include <trace/block.h>
 #include <asm/tlbflush.h>
 
@@ -212,6 +213,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
 		to->bv_len = from->bv_len;
 		to->bv_offset = from->bv_offset;
 		inc_zone_page_state(to->bv_page, NR_BOUNCE);
+		blkio_cgroup_copy_owner(to->bv_page, page);
 
 		if (rw == WRITE) {
 			char *vto, *vfrom;
diff --git a/mm/filemap.c b/mm/filemap.c
index 379ff0b..cee1438 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/mm_inline.h> /* for page_is_file_cache() */
 #include "internal.h"
 
@@ -464,6 +465,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 					gfp_mask & GFP_RECLAIM_MASK);
 	if (error)
 		goto out;
+	blkio_cgroup_set_owner(page, current->mm);
 
 	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error == 0) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e44fb0f..eeefee3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -128,6 +128,12 @@ struct mem_cgroup_lru_info {
 	struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
 };
 
+void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+	pc->mem_cgroup = NULL;
+	INIT_LIST_HEAD(&pc->lru);
+}
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
diff --git a/mm/memory.c b/mm/memory.c
index cf6873e..194bda7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -51,6 +51,7 @@
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/mmu_notifier.h>
 #include <linux/kallsyms.h>
 #include <linux/swapops.h>
@@ -2053,6 +2054,7 @@ gotten:
 		 */
 		ptep_clear_flush_notify(vma, address, page_table);
 		page_add_new_anon_rmap(new_page, vma, address);
+		blkio_cgroup_set_owner(new_page, mm);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		if (old_page) {
@@ -2497,6 +2499,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	flush_icache_page(vma, page);
 	set_pte_at(mm, address, page_table, pte);
 	page_add_anon_rmap(page, vma, address);
+	blkio_cgroup_reset_owner(page, mm);
 	/* It's better to call commit-charge after rmap is established */
 	mem_cgroup_commit_charge_swapin(page, ptr);
 
@@ -2560,6 +2563,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto release;
 	inc_mm_counter(mm, anon_rss);
 	page_add_new_anon_rmap(page, vma, address);
+	blkio_cgroup_set_owner(page, mm);
 	set_pte_at(mm, address, page_table, entry);
 
 	/* No need to invalidate - it was non-present before */
@@ -2712,6 +2716,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (anon) {
 			inc_mm_counter(mm, anon_rss);
 			page_add_new_anon_rmap(page, vma, address);
+			blkio_cgroup_set_owner(page, mm);
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 30351f0..f0b6d12 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -23,6 +23,7 @@
 #include <linux/init.h>
 #include <linux/backing-dev.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/biotrack.h>
 #include <linux/blkdev.h>
 #include <linux/mpage.h>
 #include <linux/rmap.h>
@@ -1243,6 +1244,7 @@ int __set_page_dirty_nobuffers(struct page *page)
 			BUG_ON(mapping2 != mapping);
 			WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
 			account_page_dirtied(page, mapping);
+			blkio_cgroup_reset_owner_pagedirty(page, current->mm);
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 791905c..e143d04 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -9,14 +9,15 @@
 #include <linux/vmalloc.h>
 #include <linux/cgroup.h>
 #include <linux/swapops.h>
+#include <linux/biotrack.h>
 
 static void __meminit
 __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
 {
 	pc->flags = 0;
-	pc->mem_cgroup = NULL;
 	pc->page = pfn_to_page(pfn);
-	INIT_LIST_HEAD(&pc->lru);
+	__init_mem_page_cgroup(pc);
+	__init_blkio_page_cgroup(pc);
 }
 static unsigned long total_usage;
 
@@ -74,7 +75,7 @@ void __init page_cgroup_init(void)
 
 	int nid, fail;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && blkio_cgroup_disabled())
 		return;
 
 	for_each_online_node(nid)  {
@@ -83,12 +84,12 @@ void __init page_cgroup_init(void)
 			goto fail;
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try cgroup_disable=memory option if you"
+	printk(KERN_INFO "please try cgroup_disable=memory,blkio option if you"
 	" don't want\n");
 	return;
 fail:
 	printk(KERN_CRIT "allocation of page_cgroup was failed.\n");
-	printk(KERN_CRIT "please try cgroup_disable=memory boot option\n");
+	printk(KERN_CRIT "please try cgroup_disable=memory,blkio boot options\n");
 	panic("Out of memory");
 }
 
@@ -248,7 +249,7 @@ void __init page_cgroup_init(void)
 	unsigned long pfn;
 	int fail = 0;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && blkio_cgroup_disabled())
 		return;
 
 	for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
@@ -263,8 +264,8 @@ void __init page_cgroup_init(void)
 		hotplug_memory_notifier(page_cgroup_callback, 0);
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try cgroup_disable=memory option if you don't"
-	" want\n");
+	printk(KERN_INFO "please try cgroup_disable=memory,blkio option"
+	" if you don't want\n");
 }
 
 void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3ecea98..a6a40e9 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -18,6 +18,7 @@
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
 #include <linux/page_cgroup.h>
+#include <linux/biotrack.h>
 
 #include <asm/pgtable.h>
 
@@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 */
 		__set_page_locked(new_page);
 		SetPageSwapBacked(new_page);
+		blkio_cgroup_set_owner(new_page, current->mm);
 		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
 		if (likely(!err)) {
 			/*
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 14/18] blkio_cgroup patches from Ryo to track async bios.
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (23 preceding siblings ...)
  2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-05 19:58 ` Vivek Goyal
                   ` (12 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando
  Cc: akpm, vgoyal

o blkio_cgroup patches from Ryo to track async bios.

o Fernando is also working on another IO tracking mechanism. We are not
  particular about any IO tracking mechanism. This patchset can make use
  of any mechanism which makes it to upstream. For the time being making
  use of Ryo's posting.

Based on 2.6.30-rc3-git3
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
---
 block/blk-ioc.c               |   37 +++---
 fs/buffer.c                   |    2 +
 fs/direct-io.c                |    2 +
 include/linux/biotrack.h      |   97 +++++++++++++
 include/linux/cgroup_subsys.h |    6 +
 include/linux/iocontext.h     |    1 +
 include/linux/memcontrol.h    |    6 +
 include/linux/mmzone.h        |    4 +-
 include/linux/page_cgroup.h   |   31 ++++-
 init/Kconfig                  |   15 ++
 mm/Makefile                   |    4 +-
 mm/biotrack.c                 |  300 +++++++++++++++++++++++++++++++++++++++++
 mm/bounce.c                   |    2 +
 mm/filemap.c                  |    2 +
 mm/memcontrol.c               |    6 +
 mm/memory.c                   |    5 +
 mm/page-writeback.c           |    2 +
 mm/page_cgroup.c              |   17 ++-
 mm/swap_state.c               |    2 +
 19 files changed, 511 insertions(+), 30 deletions(-)
 create mode 100644 include/linux/biotrack.h
 create mode 100644 mm/biotrack.c

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 8f0f6cf..ccde40e 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -84,27 +84,32 @@ void exit_io_context(void)
 	}
 }
 
+void init_io_context(struct io_context *ioc)
+{
+	atomic_set(&ioc->refcount, 1);
+	atomic_set(&ioc->nr_tasks, 1);
+	spin_lock_init(&ioc->lock);
+	ioc->ioprio_changed = 0;
+	ioc->ioprio = 0;
+#ifdef CONFIG_GROUP_IOSCHED
+	ioc->cgroup_changed = 0;
+#endif
+	ioc->last_waited = jiffies; /* doesn't matter... */
+	ioc->nr_batch_requests = 0; /* because this is 0 */
+	ioc->aic = NULL;
+	INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
+	INIT_HLIST_HEAD(&ioc->cic_list);
+	ioc->ioc_data = NULL;
+}
+
+
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 {
 	struct io_context *ret;
 
 	ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
-	if (ret) {
-		atomic_set(&ret->refcount, 1);
-		atomic_set(&ret->nr_tasks, 1);
-		spin_lock_init(&ret->lock);
-		ret->ioprio_changed = 0;
-		ret->ioprio = 0;
-#ifdef CONFIG_GROUP_IOSCHED
-		ret->cgroup_changed = 0;
-#endif
-		ret->last_waited = jiffies; /* doesn't matter... */
-		ret->nr_batch_requests = 0; /* because this is 0 */
-		ret->aic = NULL;
-		INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
-		INIT_HLIST_HEAD(&ret->cic_list);
-		ret->ioc_data = NULL;
-	}
+	if (ret)
+		init_io_context(ret);
 
 	return ret;
 }
diff --git a/fs/buffer.c b/fs/buffer.c
index b3e5be7..79118d4 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -36,6 +36,7 @@
 #include <linux/buffer_head.h>
 #include <linux/task_io_accounting_ops.h>
 #include <linux/bio.h>
+#include <linux/biotrack.h>
 #include <linux/notifier.h>
 #include <linux/cpu.h>
 #include <linux/bitops.h>
@@ -668,6 +669,7 @@ static void __set_page_dirty(struct page *page,
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
 		account_page_dirtied(page, mapping);
+		blkio_cgroup_reset_owner_pagedirty(page, current->mm);
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 05763bb..60b1a99 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -33,6 +33,7 @@
 #include <linux/err.h>
 #include <linux/blkdev.h>
 #include <linux/buffer_head.h>
+#include <linux/biotrack.h>
 #include <linux/rwsem.h>
 #include <linux/uio.h>
 #include <asm/atomic.h>
@@ -797,6 +798,7 @@ static int do_direct_IO(struct dio *dio)
 			ret = PTR_ERR(page);
 			goto out;
 		}
+		blkio_cgroup_reset_owner(page, current->mm);
 
 		while (block_in_page < blocks_per_page) {
 			unsigned offset_in_page = block_in_page << blkbits;
diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
new file mode 100644
index 0000000..741a8b5
--- /dev/null
+++ b/include/linux/biotrack.h
@@ -0,0 +1,97 @@
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include <linux/page_cgroup.h>
+
+#ifndef _LINUX_BIOTRACK_H
+#define _LINUX_BIOTRACK_H
+
+#ifdef	CONFIG_CGROUP_BLKIO
+
+struct io_context;
+struct block_device;
+
+struct blkio_cgroup {
+	struct cgroup_subsys_state css;
+	struct io_context *io_context;	/* default io_context */
+/*	struct radix_tree_root io_context_root; per device io_context */
+};
+
+/**
+ * __init_blkio_page_cgroup() - initialize a blkio_page_cgroup
+ * @pc:		page_cgroup of the page
+ *
+ * Reset the owner ID of a page.
+ */
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+	lock_page_cgroup(pc);
+	page_cgroup_set_id(pc, 0);
+	unlock_page_cgroup(pc);
+}
+
+/**
+ * blkio_cgroup_disabled - check whether blkio_cgroup is disabled
+ *
+ * Returns true if disabled, false if not.
+ */
+static inline bool blkio_cgroup_disabled(void)
+{
+	if (blkio_cgroup_subsys.disabled)
+		return true;
+	return false;
+}
+
+extern void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+						 struct mm_struct *mm);
+extern void blkio_cgroup_copy_owner(struct page *page, struct page *opage);
+
+extern struct io_context *get_blkio_cgroup_iocontext(struct bio *bio);
+extern unsigned long get_blkio_cgroup_id(struct bio *bio);
+extern struct cgroup *blkio_cgroup_lookup(int id);
+
+#else	/* CONFIG_CGROUP_BIO */
+
+struct blkio_cgroup;
+
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+}
+
+static inline bool blkio_cgroup_disabled(void)
+{
+	return true;
+}
+
+static inline void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_copy_owner(struct page *page, struct page *opage)
+{
+}
+
+static inline struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+	return NULL;
+}
+
+static inline unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+	return 0;
+}
+
+#endif	/* CONFIG_CGROUP_BLKIO */
+
+#endif /* _LINUX_BIOTRACK_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 68ea6bd..f214e6e 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -43,6 +43,12 @@ SUBSYS(mem_cgroup)
 
 /* */
 
+#ifdef CONFIG_CGROUP_BLKIO
+SUBSYS(blkio_cgroup)
+#endif
+
+/* */
+
 #ifdef CONFIG_CGROUP_DEVICE
 SUBSYS(devices)
 #endif
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 51664bb..ed52a1f 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -109,6 +109,7 @@ int put_io_context(struct io_context *ioc);
 void exit_io_context(void);
 struct io_context *get_io_context(gfp_t gfp_flags, int node);
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
+void init_io_context(struct io_context *ioc);
 void copy_io_context(struct io_context **pdst, struct io_context **psrc);
 #else
 static inline void exit_io_context(void)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a9e3b76..e80e335 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -37,6 +37,8 @@ struct mm_struct;
  * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
  */
 
+extern void __init_mem_page_cgroup(struct page_cgroup *pc);
+
 extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
 /* for swap handling */
@@ -120,6 +122,10 @@ extern bool mem_cgroup_oom_called(struct task_struct *task);
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
+static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+}
+
 static inline int mem_cgroup_newpage_charge(struct page *page,
 					struct mm_struct *mm, gfp_t gfp_mask)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 186ec6a..47a6f55 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -607,7 +607,7 @@ typedef struct pglist_data {
 	int nr_zones;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP	/* means !SPARSEMEM */
 	struct page *node_mem_map;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	struct page_cgroup *node_page_cgroup;
 #endif
 #endif
@@ -958,7 +958,7 @@ struct mem_section {
 
 	/* See declaration of similar field in struct zone */
 	unsigned long *pageblock_flags;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	/*
 	 * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
 	 * section. (see memcontrol.h/page_cgroup.h about this.)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 7339c7b..dd7f71c 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -1,7 +1,7 @@
 #ifndef __LINUX_PAGE_CGROUP_H
 #define __LINUX_PAGE_CGROUP_H
 
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 #include <linux/bit_spinlock.h>
 /*
  * Page Cgroup can be considered as an extended mem_map.
@@ -12,9 +12,11 @@
  */
 struct page_cgroup {
 	unsigned long flags;
-	struct mem_cgroup *mem_cgroup;
 	struct page *page;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+	struct mem_cgroup *mem_cgroup;
 	struct list_head lru;		/* per cgroup LRU list */
+#endif
 };
 
 void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
@@ -71,7 +73,7 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_CGROUP_PAGE */
 struct page_cgroup;
 
 static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
@@ -122,4 +124,27 @@ static inline void swap_cgroup_swapoff(int type)
 }
 
 #endif
+
+#ifdef CONFIG_CGROUP_BLKIO
+/*
+ * use lower 16 bits for flags and reserve the rest for the page tracking id
+ */
+#define PCG_TRACKING_ID_SHIFT	(16)
+#define PCG_TRACKING_ID_BITS \
+	(8 * sizeof(unsigned long) - PCG_TRACKING_ID_SHIFT)
+
+/* NOTE: must be called with page_cgroup() held */
+static inline unsigned long page_cgroup_get_id(struct page_cgroup *pc)
+{
+	return pc->flags >> PCG_TRACKING_ID_SHIFT;
+}
+
+/* NOTE: must be called with page_cgroup() held */
+static inline void page_cgroup_set_id(struct page_cgroup *pc, unsigned long id)
+{
+	WARN_ON(id >= (1UL << PCG_TRACKING_ID_BITS));
+	pc->flags &= (1UL << PCG_TRACKING_ID_SHIFT) - 1;
+	pc->flags |= (unsigned long)(id << PCG_TRACKING_ID_SHIFT);
+}
+#endif
 #endif
diff --git a/init/Kconfig b/init/Kconfig
index 1a4686d..ee16d6f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -616,6 +616,21 @@ config GROUP_IOSCHED
 
 endif # CGROUPS
 
+config CGROUP_BLKIO
+	bool "Block I/O cgroup subsystem"
+	depends on CGROUPS && BLOCK
+	select MM_OWNER
+	help
+	  Provides a Resource Controller which enables to track the onwner
+	  of every Block I/O requests.
+	  The information this subsystem provides can be used from any
+	  kind of module such as dm-ioband device mapper modules or
+	  the cfq-scheduler.
+
+config CGROUP_PAGE
+	def_bool y
+	depends on CGROUP_MEM_RES_CTLR || CGROUP_BLKIO
+
 config MM_OWNER
 	bool
 
diff --git a/mm/Makefile b/mm/Makefile
index ec73c68..76c3436 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -37,4 +37,6 @@ else
 obj-$(CONFIG_SMP) += allocpercpu.o
 endif
 obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
+obj-$(CONFIG_CGROUP_BLKIO) += biotrack.o
diff --git a/mm/biotrack.c b/mm/biotrack.c
new file mode 100644
index 0000000..2baf1f0
--- /dev/null
+++ b/mm/biotrack.c
@@ -0,0 +1,300 @@
+/* biotrack.c - Block I/O Tracking
+ *
+ * Copyright (C) VA Linux Systems Japan, 2008-2009
+ * Developed by Hirokazu Takahashi <taka@valinux.co.jp>
+ *
+ * Copyright (C) 2008 Andrea Righi <righi.andrea@gmail.com>
+ * Use part of page_cgroup->flags to store blkio-cgroup ID.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/smp.h>
+#include <linux/bit_spinlock.h>
+#include <linux/blkdev.h>
+#include <linux/biotrack.h>
+#include <linux/mm_inline.h>
+
+/*
+ * The block I/O tracking mechanism is implemented on the cgroup memory
+ * controller framework. It helps to find the the owner of an I/O request
+ * because every I/O request has a target page and the owner of the page
+ * can be easily determined on the framework.
+ */
+
+/* Return the blkio_cgroup that associates with a cgroup. */
+static inline struct blkio_cgroup *cgroup_blkio(struct cgroup *cgrp)
+{
+	return container_of(cgroup_subsys_state(cgrp, blkio_cgroup_subsys_id),
+					struct blkio_cgroup, css);
+}
+
+/* Return the blkio_cgroup that associates with a process. */
+static inline struct blkio_cgroup *blkio_cgroup_from_task(struct task_struct *p)
+{
+	return container_of(task_subsys_state(p, blkio_cgroup_subsys_id),
+					struct blkio_cgroup, css);
+}
+
+static struct io_context default_blkio_io_context;
+static struct blkio_cgroup default_blkio_cgroup = {
+	.io_context	= &default_blkio_io_context,
+};
+
+/**
+ * blkio_cgroup_set_owner() - set the owner ID of a page.
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Make a given page have the blkio-cgroup ID of the owner of this page.
+ */
+void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+	struct blkio_cgroup *biog;
+	struct page_cgroup *pc;
+	unsigned long id;
+
+	if (blkio_cgroup_disabled())
+		return;
+	pc = lookup_page_cgroup(page);
+	if (unlikely(!pc))
+		return;
+
+	lock_page_cgroup(pc);
+	page_cgroup_set_id(pc, 0);	/* 0: default blkio_cgroup id */
+	unlock_page_cgroup(pc);
+	if (!mm)
+		return;
+
+	rcu_read_lock();
+	biog = blkio_cgroup_from_task(rcu_dereference(mm->owner));
+	if (unlikely(!biog)) {
+		rcu_read_unlock();
+		return;
+	}
+	/*
+	 * css_get(&bio->css) isn't called to increment the reference
+	 * count of this blkio_cgroup "biog" so the css_id might turn
+	 * invalid even if this page is still active.
+	 * This approach is chosen to minimize the overhead.
+	 */
+	id = css_id(&biog->css);
+	rcu_read_unlock();
+	lock_page_cgroup(pc);
+	page_cgroup_set_id(pc, id);
+	unlock_page_cgroup(pc);
+}
+
+/**
+ * blkio_cgroup_reset_owner() - reset the owner ID of a page
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Change the owner of a given page if necessary.
+ */
+void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
+{
+	blkio_cgroup_set_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_reset_owner_pagedirty() - reset the owner ID of a pagecache page
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Change the owner of a given page if the page is in the pagecache.
+ */
+void blkio_cgroup_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
+{
+	if (!page_is_file_cache(page))
+		return;
+	if (current->flags & PF_MEMALLOC)
+		return;
+
+	blkio_cgroup_reset_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_copy_owner() - copy the owner ID of a page into another page
+ * @npage:	the page where we want to copy the owner
+ * @opage:	the page from which we want to copy the ID
+ *
+ * Copy the owner ID of @opage into @npage.
+ */
+void blkio_cgroup_copy_owner(struct page *npage, struct page *opage)
+{
+	struct page_cgroup *npc, *opc;
+	unsigned long id;
+
+	if (blkio_cgroup_disabled())
+		return;
+	npc = lookup_page_cgroup(npage);
+	if (unlikely(!npc))
+		return;
+	opc = lookup_page_cgroup(opage);
+	if (unlikely(!opc))
+		return;
+
+	lock_page_cgroup(opc);
+	lock_page_cgroup(npc);
+	id = page_cgroup_get_id(opc);
+	page_cgroup_set_id(npc, id);
+	unlock_page_cgroup(npc);
+	unlock_page_cgroup(opc);
+}
+
+/* Create a new blkio-cgroup. */
+static struct cgroup_subsys_state *
+blkio_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct blkio_cgroup *biog;
+	struct io_context *ioc;
+
+	if (!cgrp->parent) {
+		biog = &default_blkio_cgroup;
+		init_io_context(biog->io_context);
+		/* Increment the referrence count not to be released ever. */
+		atomic_inc(&biog->io_context->refcount);
+		return &biog->css;
+	}
+
+	biog = kzalloc(sizeof(*biog), GFP_KERNEL);
+	if (!biog)
+		return ERR_PTR(-ENOMEM);
+	ioc = alloc_io_context(GFP_KERNEL, -1);
+	if (!ioc) {
+		kfree(biog);
+		return ERR_PTR(-ENOMEM);
+	}
+	biog->io_context = ioc;
+	return &biog->css;
+}
+
+/* Delete the blkio-cgroup. */
+static void blkio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+
+	put_io_context(biog->io_context);
+	free_css_id(&blkio_cgroup_subsys, &biog->css);
+	kfree(biog);
+}
+
+/**
+ * get_blkio_cgroup_id() - determine the blkio-cgroup ID
+ * @bio:	the &struct bio which describes the I/O
+ *
+ * Returns the blkio-cgroup ID of a given bio. A return value zero
+ * means that the page associated with the bio belongs to default_blkio_cgroup.
+ */
+unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+	struct page_cgroup *pc;
+	struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+	unsigned long id = 0;
+
+	pc = lookup_page_cgroup(page);
+	if (pc) {
+		lock_page_cgroup(pc);
+		id = page_cgroup_get_id(pc);
+		unlock_page_cgroup(pc);
+	}
+	return id;
+}
+
+/**
+ * get_blkio_cgroup_iocontext() - determine the blkio-cgroup iocontext
+ * @bio:	the &struct bio which describe the I/O
+ *
+ * Returns the iocontext of blkio-cgroup that issued a given bio.
+ */
+struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+	struct cgroup_subsys_state *css;
+	struct blkio_cgroup *biog;
+	struct io_context *ioc;
+	unsigned long id;
+
+	id = get_blkio_cgroup_id(bio);
+	rcu_read_lock();
+	css = css_lookup(&blkio_cgroup_subsys, id);
+	if (css)
+		biog = container_of(css, struct blkio_cgroup, css);
+	else
+		biog = &default_blkio_cgroup;
+	ioc = biog->io_context;	/* default io_context for this cgroup */
+	atomic_inc(&ioc->refcount);
+	rcu_read_unlock();
+	return ioc;
+}
+
+/**
+ * blkio_cgroup_lookup() - lookup a cgroup by blkio-cgroup ID
+ * @id:		blkio-cgroup ID
+ *
+ * Returns the cgroup associated with the specified ID, or NULL if lookup
+ * fails.
+ *
+ * Note:
+ * This function should be called under rcu_read_lock().
+ */
+struct cgroup *blkio_cgroup_lookup(int id)
+{
+	struct cgroup *cgrp;
+	struct cgroup_subsys_state *css;
+
+	if (blkio_cgroup_disabled())
+		return NULL;
+
+	css = css_lookup(&blkio_cgroup_subsys, id);
+	if (!css)
+		return NULL;
+	cgrp = css->cgroup;
+	return cgrp;
+}
+EXPORT_SYMBOL(get_blkio_cgroup_iocontext);
+EXPORT_SYMBOL(get_blkio_cgroup_id);
+EXPORT_SYMBOL(blkio_cgroup_lookup);
+
+static u64 blkio_id_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+	unsigned long id;
+
+	rcu_read_lock();
+	id = css_id(&biog->css);
+	rcu_read_unlock();
+	return (u64)id;
+}
+
+
+static struct cftype blkio_files[] = {
+	{
+		.name = "id",
+		.read_u64 = blkio_id_read,
+	},
+};
+
+static int blkio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	return cgroup_add_files(cgrp, ss, blkio_files,
+					ARRAY_SIZE(blkio_files));
+}
+
+struct cgroup_subsys blkio_cgroup_subsys = {
+	.name		= "blkio",
+	.create		= blkio_cgroup_create,
+	.destroy	= blkio_cgroup_destroy,
+	.populate	= blkio_cgroup_populate,
+	.subsys_id	= blkio_cgroup_subsys_id,
+	.use_id		= 1,
+};
diff --git a/mm/bounce.c b/mm/bounce.c
index e590272..875380c 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -14,6 +14,7 @@
 #include <linux/hash.h>
 #include <linux/highmem.h>
 #include <linux/blktrace_api.h>
+#include <linux/biotrack.h>
 #include <trace/block.h>
 #include <asm/tlbflush.h>
 
@@ -212,6 +213,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
 		to->bv_len = from->bv_len;
 		to->bv_offset = from->bv_offset;
 		inc_zone_page_state(to->bv_page, NR_BOUNCE);
+		blkio_cgroup_copy_owner(to->bv_page, page);
 
 		if (rw == WRITE) {
 			char *vto, *vfrom;
diff --git a/mm/filemap.c b/mm/filemap.c
index 379ff0b..cee1438 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/mm_inline.h> /* for page_is_file_cache() */
 #include "internal.h"
 
@@ -464,6 +465,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 					gfp_mask & GFP_RECLAIM_MASK);
 	if (error)
 		goto out;
+	blkio_cgroup_set_owner(page, current->mm);
 
 	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error == 0) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e44fb0f..eeefee3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -128,6 +128,12 @@ struct mem_cgroup_lru_info {
 	struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
 };
 
+void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+	pc->mem_cgroup = NULL;
+	INIT_LIST_HEAD(&pc->lru);
+}
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
diff --git a/mm/memory.c b/mm/memory.c
index cf6873e..194bda7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -51,6 +51,7 @@
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/mmu_notifier.h>
 #include <linux/kallsyms.h>
 #include <linux/swapops.h>
@@ -2053,6 +2054,7 @@ gotten:
 		 */
 		ptep_clear_flush_notify(vma, address, page_table);
 		page_add_new_anon_rmap(new_page, vma, address);
+		blkio_cgroup_set_owner(new_page, mm);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		if (old_page) {
@@ -2497,6 +2499,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	flush_icache_page(vma, page);
 	set_pte_at(mm, address, page_table, pte);
 	page_add_anon_rmap(page, vma, address);
+	blkio_cgroup_reset_owner(page, mm);
 	/* It's better to call commit-charge after rmap is established */
 	mem_cgroup_commit_charge_swapin(page, ptr);
 
@@ -2560,6 +2563,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto release;
 	inc_mm_counter(mm, anon_rss);
 	page_add_new_anon_rmap(page, vma, address);
+	blkio_cgroup_set_owner(page, mm);
 	set_pte_at(mm, address, page_table, entry);
 
 	/* No need to invalidate - it was non-present before */
@@ -2712,6 +2716,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (anon) {
 			inc_mm_counter(mm, anon_rss);
 			page_add_new_anon_rmap(page, vma, address);
+			blkio_cgroup_set_owner(page, mm);
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 30351f0..f0b6d12 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -23,6 +23,7 @@
 #include <linux/init.h>
 #include <linux/backing-dev.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/biotrack.h>
 #include <linux/blkdev.h>
 #include <linux/mpage.h>
 #include <linux/rmap.h>
@@ -1243,6 +1244,7 @@ int __set_page_dirty_nobuffers(struct page *page)
 			BUG_ON(mapping2 != mapping);
 			WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
 			account_page_dirtied(page, mapping);
+			blkio_cgroup_reset_owner_pagedirty(page, current->mm);
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 791905c..e143d04 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -9,14 +9,15 @@
 #include <linux/vmalloc.h>
 #include <linux/cgroup.h>
 #include <linux/swapops.h>
+#include <linux/biotrack.h>
 
 static void __meminit
 __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
 {
 	pc->flags = 0;
-	pc->mem_cgroup = NULL;
 	pc->page = pfn_to_page(pfn);
-	INIT_LIST_HEAD(&pc->lru);
+	__init_mem_page_cgroup(pc);
+	__init_blkio_page_cgroup(pc);
 }
 static unsigned long total_usage;
 
@@ -74,7 +75,7 @@ void __init page_cgroup_init(void)
 
 	int nid, fail;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && blkio_cgroup_disabled())
 		return;
 
 	for_each_online_node(nid)  {
@@ -83,12 +84,12 @@ void __init page_cgroup_init(void)
 			goto fail;
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try cgroup_disable=memory option if you"
+	printk(KERN_INFO "please try cgroup_disable=memory,blkio option if you"
 	" don't want\n");
 	return;
 fail:
 	printk(KERN_CRIT "allocation of page_cgroup was failed.\n");
-	printk(KERN_CRIT "please try cgroup_disable=memory boot option\n");
+	printk(KERN_CRIT "please try cgroup_disable=memory,blkio boot options\n");
 	panic("Out of memory");
 }
 
@@ -248,7 +249,7 @@ void __init page_cgroup_init(void)
 	unsigned long pfn;
 	int fail = 0;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && blkio_cgroup_disabled())
 		return;
 
 	for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
@@ -263,8 +264,8 @@ void __init page_cgroup_init(void)
 		hotplug_memory_notifier(page_cgroup_callback, 0);
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try cgroup_disable=memory option if you don't"
-	" want\n");
+	printk(KERN_INFO "please try cgroup_disable=memory,blkio option"
+	" if you don't want\n");
 }
 
 void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3ecea98..a6a40e9 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -18,6 +18,7 @@
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
 #include <linux/page_cgroup.h>
+#include <linux/biotrack.h>
 
 #include <asm/pgtable.h>
 
@@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 */
 		__set_page_locked(new_page);
 		SetPageSwapBacked(new_page);
+		blkio_cgroup_set_owner(new_page, current->mm);
 		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
 		if (likely(!err)) {
 			/*
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 15/18] io-controller: map async requests to appropriate cgroup
       [not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (13 preceding siblings ...)
  2009-05-05 19:58   ` [PATCH 14/18] blkio_cgroup patches from Ryo to track async bios Vivek Goyal
@ 2009-05-05 19:58   ` Vivek Goyal
  2009-05-05 19:58   ` [PATCH 16/18] io-controller: Per cgroup request descriptor support Vivek Goyal
                     ` (6 subsequent siblings)
  21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

o So far we were assuming that a bio/rq belongs to the task who is submitting
  it. It did not hold good in case of async writes. This patch makes use of
  blkio_cgroup pataches to attribute the aysnc writes to right group instead
  of task submitting the bio.

o For sync requests, we continue to assume that io belongs to the task
  submitting it. Only in case of async requests, we make use of io tracking
  patches to track the owner cgroup.

o So far cfq always caches the async queue pointer. With async requests now
  not necessarily being tied to submitting task io context, caching the
  pointer will not help for async queues. This patch introduces a new config
  option CONFIG_TRACK_ASYNC_CONTEXT. If this option is not set, cfq retains
  old behavior where async queue pointer is cached in task context. If it
  is not set, async queue pointer is not cached and we take help of bio
  tracking patches to determine group bio belongs to and then map it to
  async queue of that group.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched    |   16 +++++
 block/as-iosched.c       |    2 +-
 block/blk-core.c         |    7 +-
 block/cfq-iosched.c      |  149 ++++++++++++++++++++++++++++++++++++----------
 block/deadline-iosched.c |    2 +-
 block/elevator-fq.c      |  131 ++++++++++++++++++++++++++++++++++-------
 block/elevator-fq.h      |   34 +++++++++-
 block/elevator.c         |   13 ++--
 include/linux/elevator.h |   19 +++++-
 9 files changed, 304 insertions(+), 69 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 77fc786..0677099 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -124,6 +124,22 @@ config DEFAULT_IOSCHED
 	default "cfq" if DEFAULT_CFQ
 	default "noop" if DEFAULT_NOOP
 
+config TRACK_ASYNC_CONTEXT
+	bool "Determine async request context from bio"
+	depends on GROUP_IOSCHED
+	select CGROUP_BLKIO
+	default n
+	---help---
+	  Normally async request is attributed to the task submitting the
+	  request. With group ioscheduling, for accurate accounting of
+	  async writes, one needs to map the request to original task/cgroup
+	  which originated the request and not the submitter of the request.
+
+	  Currently there are generic io tracking patches to provide facility
+	  to map bio to original owner. If this option is set, for async
+	  request, original owner of the bio is decided by using io tracking
+	  patches otherwise we continue to attribute the request to the
+	  submitting thread.
 endmenu
 
 endif
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 12aea88..afa554a 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -1412,7 +1412,7 @@ as_merge(struct request_queue *q, struct request **req, struct bio *bio)
 {
 	sector_t rb_key = bio->bi_sector + bio_sectors(bio);
 	struct request *__rq;
-	struct as_queue *asq = elv_get_sched_queue_current(q);
+	struct as_queue *asq = elv_get_sched_queue_bio(q, bio);
 
 	if (!asq)
 		return ELEVATOR_NO_MERGE;
diff --git a/block/blk-core.c b/block/blk-core.c
index 2998fe3..b19510a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -643,7 +643,8 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
 }
 
 static struct request *
-blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
+blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
+					gfp_t gfp_mask)
 {
 	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
 
@@ -655,7 +656,7 @@ blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
 	rq->cmd_flags = flags | REQ_ALLOCED;
 
 	if (priv) {
-		if (unlikely(elv_set_request(q, rq, gfp_mask))) {
+		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
 			mempool_free(rq, q->rq.rq_pool);
 			return NULL;
 		}
@@ -796,7 +797,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		rw_flags |= REQ_IO_STAT;
 	spin_unlock_irq(q->queue_lock);
 
-	rq = blk_alloc_request(q, rw_flags, priv, gfp_mask);
+	rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
 	if (unlikely(!rq)) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 1e9dd5b..ea71239 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -161,8 +161,8 @@ CFQ_CFQQ_FNS(coop);
 	blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)
 
 static void cfq_dispatch_insert(struct request_queue *, struct request *);
-static struct cfq_queue *cfq_get_queue(struct cfq_data *, int,
-				       struct io_context *, gfp_t);
+static struct cfq_queue *cfq_get_queue(struct cfq_data *, struct io_group *iog,
+					int, struct io_context *, gfp_t);
 static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
 						struct io_context *);
 
@@ -172,22 +172,56 @@ static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
 	return cic->cfqq[!!is_sync];
 }
 
-static inline void cic_set_cfqq(struct cfq_io_context *cic,
-				struct cfq_queue *cfqq, int is_sync)
-{
-	cic->cfqq[!!is_sync] = cfqq;
-}
-
 /*
- * We regard a request as SYNC, if it's either a read or has the SYNC bit
- * set (in which case it could also be direct WRITE).
+ * Determine the cfq queue bio should go in. This is primarily used by
+ * front merge and allow merge functions.
+ *
+ * Currently this function takes the ioprio and iprio_class from task
+ * submitting async bio. Later save the task information in the page_cgroup
+ * and retrieve task's ioprio and class from there.
  */
-static inline int cfq_bio_sync(struct bio *bio)
+static struct cfq_queue *cic_bio_to_cfqq(struct cfq_data *cfqd,
+		struct cfq_io_context *cic, struct bio *bio, int is_sync)
 {
-	if (bio_data_dir(bio) == READ || bio_sync(bio))
-		return 1;
+	struct cfq_queue *cfqq = NULL;
 
-	return 0;
+	cfqq = cic_to_cfqq(cic, is_sync);
+
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	if (!cfqq && !is_sync) {
+		const int ioprio = task_ioprio(cic->ioc);
+		const int ioprio_class = task_ioprio_class(cic->ioc);
+		struct io_group *iog;
+		/*
+		 * async bio tracking is enabled and we are not caching
+		 * async queue pointer in cic.
+		 */
+		iog = io_get_io_group_bio(cfqd->queue, bio, 0);
+		if (!iog) {
+			/*
+			 * May be this is first rq/bio and io group has not
+			 * been setup yet.
+			 */
+			return NULL;
+		}
+		return io_group_async_queue_prio(iog, ioprio_class, ioprio);
+	}
+#endif
+	return cfqq;
+}
+
+static inline void cic_set_cfqq(struct cfq_io_context *cic,
+				struct cfq_queue *cfqq, int is_sync)
+{
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	/*
+	 * Don't cache async queue pointer as now one io context might
+	 * be submitting async io for various different async queues
+	 */
+	if (!is_sync)
+		return;
+#endif
+	cic->cfqq[!!is_sync] = cfqq;
 }
 
 static inline struct io_group *cfqq_to_io_group(struct cfq_queue *cfqq)
@@ -505,7 +539,7 @@ cfq_find_rq_fmerge(struct cfq_data *cfqd, struct bio *bio)
 	if (!cic)
 		return NULL;
 
-	cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+	cfqq = cic_bio_to_cfqq(cfqd, cic, bio, elv_bio_sync(bio));
 	if (cfqq) {
 		sector_t sector = bio->bi_sector + bio_sectors(bio);
 
@@ -587,7 +621,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	/*
 	 * Disallow merge of a sync bio into an async request.
 	 */
-	if (cfq_bio_sync(bio) && !rq_is_sync(rq))
+	if (elv_bio_sync(bio) && !rq_is_sync(rq))
 		return 0;
 
 	/*
@@ -598,7 +632,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	if (!cic)
 		return 0;
 
-	cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+	cfqq = cic_bio_to_cfqq(cfqd, cic, bio, elv_bio_sync(bio));
 	if (cfqq == RQ_CFQQ(rq))
 		return 1;
 
@@ -1206,14 +1240,29 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
 	spin_lock_irqsave(q->queue_lock, flags);
 
 	cfqq = cic->cfqq[BLK_RW_ASYNC];
+
 	if (cfqq) {
+		struct io_group *iog = io_lookup_io_group_current(q);
 		struct cfq_queue *new_cfqq;
-		new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic->ioc,
+
+		/*
+		 * Drop the reference to old queue unconditionally. Don't
+		 * worry whether new async prio queue has been allocated
+		 * or not.
+		 */
+		cic_set_cfqq(cic, NULL, BLK_RW_ASYNC);
+		cfq_put_queue(cfqq);
+
+		/*
+		 * Why to allocate new queue now? Will it not be automatically
+		 * allocated whenever another async request from same context
+		 * comes? Keeping it for the time being because existing cfq
+		 * code allocates the new queue immediately upon prio change
+		 */
+		new_cfqq = cfq_get_queue(cfqd, iog, BLK_RW_ASYNC, cic->ioc,
 						GFP_ATOMIC);
-		if (new_cfqq) {
-			cic->cfqq[BLK_RW_ASYNC] = new_cfqq;
-			cfq_put_queue(cfqq);
-		}
+		if (new_cfqq)
+			cic_set_cfqq(cic, new_cfqq, BLK_RW_ASYNC);
 	}
 
 	cfqq = cic->cfqq[BLK_RW_SYNC];
@@ -1274,7 +1323,7 @@ static void cfq_ioc_set_cgroup(struct io_context *ioc)
 #endif  /* CONFIG_IOSCHED_CFQ_HIER */
 
 static struct cfq_queue *
-cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
+cfq_find_alloc_queue(struct cfq_data *cfqd, struct io_group *iog, int is_sync,
 				struct io_context *ioc, gfp_t gfp_mask)
 {
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
@@ -1286,6 +1335,21 @@ retry:
 	/* cic always exists here */
 	cfqq = cic_to_cfqq(cic, is_sync);
 
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	if (!cfqq && !is_sync) {
+		const int ioprio = task_ioprio(cic->ioc);
+		const int ioprio_class = task_ioprio_class(cic->ioc);
+
+		/*
+		 * We have not cached async queue pointer as bio tracking
+		 * is enabled. Look into group async queue array using ioc
+		 * class and prio to see if somebody already allocated the
+		 * queue.
+		 */
+
+		cfqq = io_group_async_queue_prio(iog, ioprio_class, ioprio);
+	}
+#endif
 	if (!cfqq) {
 		if (new_cfqq) {
 			goto alloc_ioq;
@@ -1348,8 +1412,9 @@ alloc_ioq:
 
 		cfqq->ioq = ioq;
 		cfq_init_prio_data(cfqq, ioc);
-		elv_init_ioq(q->elevator, ioq, cfqq, cfqq->org_ioprio_class,
-				cfqq->org_ioprio, is_sync);
+		elv_init_ioq(q->elevator, ioq, iog, cfqq,
+				cfqq->org_ioprio_class, cfqq->org_ioprio,
+				is_sync);
 
 		if (is_sync) {
 			if (!cfq_class_idle(cfqq))
@@ -1372,14 +1437,13 @@ out:
 }
 
 static struct cfq_queue *
-cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
-					gfp_t gfp_mask)
+cfq_get_queue(struct cfq_data *cfqd, struct io_group *iog, int is_sync,
+			struct io_context *ioc, gfp_t gfp_mask)
 {
 	const int ioprio = task_ioprio(ioc);
 	const int ioprio_class = task_ioprio_class(ioc);
 	struct cfq_queue *async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
-	struct io_group *iog = io_lookup_io_group_current(cfqd->queue);
 
 	if (!is_sync) {
 		async_cfqq = io_group_async_queue_prio(iog, ioprio_class,
@@ -1388,7 +1452,7 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 	}
 
 	if (!cfqq) {
-		cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc, gfp_mask);
+		cfqq = cfq_find_alloc_queue(cfqd, iog, is_sync, ioc, gfp_mask);
 		if (!cfqq)
 			return NULL;
 	}
@@ -1396,8 +1460,30 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 	if (!is_sync && !async_cfqq)
 		io_group_set_async_queue(iog, ioprio_class, ioprio, cfqq->ioq);
 
-	/* ioc reference */
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	/*
+	 * ioc reference. If async request queue/group is determined from the
+	 * original task/cgroup and not from submitter task, io context can
+	 * not cache the pointer to async queue and everytime a request comes,
+	 * it will be determined by going through the async queue array.
+	 *
+	 * This comes from the fact that we might be getting async requests
+	 * which belong to a different cgroup altogether than the cgroup
+	 * iocontext belongs to. And this thread might be submitting bios
+	 * from various cgroups. So every time async queue will be different
+	 * based on the cgroup of the bio/rq. Can't cache the async cfqq
+	 * pointer in cic.
+	 */
+	if (is_sync)
+		elv_get_ioq(cfqq->ioq);
+#else
+	/*
+	 * async requests are being attributed to task submitting
+	 * it, hence cic can cache async cfqq pointer. Take the
+	 * queue reference even for async queue.
+	 */
 	elv_get_ioq(cfqq->ioq);
+#endif
 	return cfqq;
 }
 
@@ -1811,7 +1897,8 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 
 	cfqq = cic_to_cfqq(cic, is_sync);
 	if (!cfqq) {
-		cfqq = cfq_get_queue(cfqd, is_sync, cic->ioc, gfp_mask);
+		cfqq = cfq_get_queue(cfqd, rq_iog(q, rq), is_sync, cic->ioc,
+						gfp_mask);
 
 		if (!cfqq)
 			goto queue_fail;
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 27b77b9..87a46c2 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -133,7 +133,7 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	int ret;
 	struct deadline_queue *dq;
 
-	dq = elv_get_sched_queue_current(q);
+	dq = elv_get_sched_queue_bio(q, bio);
 	if (!dq)
 		return ELEVATOR_NO_MERGE;
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 02c27ac..69eaee4 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -11,6 +11,7 @@
 #include <linux/blkdev.h>
 #include "elevator-fq.h"
 #include <linux/blktrace_api.h>
+#include <linux/biotrack.h>
 
 /* Values taken from cfq */
 const int elv_slice_sync = HZ / 10;
@@ -71,6 +72,7 @@ void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
 void elv_activate_ioq(struct io_queue *ioq, int add_front);
 void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
 					int requeue);
+struct io_cgroup *get_iocg_from_bio(struct bio *bio);
 
 static int bfq_update_next_active(struct io_sched_data *sd)
 {
@@ -945,6 +947,9 @@ void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
 
 struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
 {
+	if (!cgroup)
+		return &io_root_cgroup;
+
 	return container_of(cgroup_subsys_state(cgroup, io_subsys_id),
 			    struct io_cgroup, css);
 }
@@ -968,6 +973,7 @@ struct io_group *io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
 	return NULL;
 }
 
+/* Lookup the io group of the current task */
 struct io_group *io_lookup_io_group_current(struct request_queue *q)
 {
 	struct io_group *iog;
@@ -1318,32 +1324,99 @@ struct io_group *io_find_alloc_group(struct request_queue *q,
 	return iog;
 }
 
+/* Map a bio to respective cgroup. Null return means, map it to root cgroup */
+static inline struct cgroup *get_cgroup_from_bio(struct bio *bio)
+{
+	unsigned long bio_cgroup_id;
+	struct cgroup *cgroup;
+
+	/* blk_get_request can reach here without passing a bio */
+	if (!bio)
+		return NULL;
+
+	if (bio_barrier(bio)) {
+		/*
+		 * Map barrier requests to root group. May be more special
+		 * bio cases should come here
+		 */
+		return NULL;
+	}
+
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	if (elv_bio_sync(bio)) {
+		/* sync io. Determine cgroup from submitting task context. */
+		cgroup = task_cgroup(current, io_subsys_id);
+		return cgroup;
+	}
+
+	/* Async io. Determine cgroup from with cgroup id stored in page */
+	bio_cgroup_id = get_blkio_cgroup_id(bio);
+
+	if (!bio_cgroup_id)
+		return NULL;
+
+	cgroup = blkio_cgroup_lookup(bio_cgroup_id);
+#else
+	cgroup = task_cgroup(current, io_subsys_id);
+#endif
+	return cgroup;
+}
+
+/* Determine the io cgroup of a bio */
+struct io_cgroup *get_iocg_from_bio(struct bio *bio)
+{
+	struct cgroup *cgrp;
+	struct io_cgroup *iocg = NULL;
+
+	cgrp = get_cgroup_from_bio(bio);
+	if (!cgrp)
+		return &io_root_cgroup;
+
+	iocg = cgroup_to_io_cgroup(cgrp);
+	if (!iocg)
+		return &io_root_cgroup;
+
+	return iocg;
+}
+
 /*
- * Search for the io group current task belongs to. If create=1, then also
- * create the io group if it is not already there.
+ * Find the io group bio belongs to.
+ * If "create" is set, io group is created if it is not already present.
  */
-struct io_group *io_get_io_group(struct request_queue *q, int create)
+struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
+					int create)
 {
 	struct cgroup *cgroup;
 	struct io_group *iog;
 	struct elv_fq_data *efqd = &q->elevator->efqd;
 
 	rcu_read_lock();
-	cgroup = task_cgroup(current, io_subsys_id);
-	iog = io_find_alloc_group(q, cgroup, efqd, create);
-	if (!iog) {
+	cgroup = get_cgroup_from_bio(bio);
+	if (!cgroup) {
 		if (create)
 			iog = efqd->root_group;
-		else
+		else {
 			/*
 			 * bio merge functions doing lookup don't want to
 			 * map bio to root group by default
 			 */
 			iog = NULL;
+		}
+		goto out;
+	}
+
+	iog = io_find_alloc_group(q, cgroup, efqd, create);
+	if (!iog) {
+		if (create)
+			iog = efqd->root_group;
+		else
+			iog = NULL;
 	}
+out:
 	rcu_read_unlock();
 	return iog;
 }
+EXPORT_SYMBOL(io_get_io_group_bio);
 
 void io_free_root_group(struct elevator_queue *e)
 {
@@ -1678,7 +1751,7 @@ int io_group_allow_merge(struct request *rq, struct bio *bio)
 		return 1;
 
 	/* Determine the io group of the bio submitting task */
-	iog = io_get_io_group(q, 0);
+	iog = io_get_io_group_bio(q, bio, 0);
 	if (!iog) {
 		/* May be task belongs to a differet cgroup for which io
 		 * group has not been setup yet. */
@@ -1692,8 +1765,8 @@ int io_group_allow_merge(struct request *rq, struct bio *bio)
 }
 
 /* find/create the io group request belongs to and put that info in rq */
-void elv_fq_set_request_io_group(struct request_queue *q,
-						struct request *rq)
+void elv_fq_set_request_io_group(struct request_queue *q, struct request *rq,
+					struct bio *bio)
 {
 	struct io_group *iog;
 	unsigned long flags;
@@ -1702,7 +1775,7 @@ void elv_fq_set_request_io_group(struct request_queue *q,
 	 * io group to which rq belongs. Later we should make use of
 	 * bio cgroup patches to determine the io group */
 	spin_lock_irqsave(q->queue_lock, flags);
-	iog = io_get_io_group(q, 1);
+	iog = io_get_io_group_bio(q, bio, 1);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 	BUG_ON(!iog);
 
@@ -1797,7 +1870,7 @@ alloc_ioq:
 			}
 		}
 
-		elv_init_ioq(e, ioq, sched_q, IOPRIO_CLASS_BE, 4, 1);
+		elv_init_ioq(e, ioq, rq->iog, sched_q, IOPRIO_CLASS_BE, 4, 1);
 		io_group_set_ioq(iog, ioq);
 		elv_mark_ioq_sync(ioq);
 	}
@@ -1822,17 +1895,17 @@ queue_fail:
 }
 
 /*
- * Find out the io queue of current task. Optimization for single ioq
+ * Find out the io queue of bio belongs to. Optimization for single ioq
  * per io group io schedulers.
  */
-struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+struct io_queue *elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
 {
 	struct io_group *iog;
 
-	/* Determine the io group and io queue of the bio submitting task */
-	iog = io_lookup_io_group_current(q);
+	/* lookup the io group and io queue of the bio submitting task */
+	iog = io_get_io_group_bio(q, bio, 0);
 	if (!iog) {
-		/* May be task belongs to a cgroup for which io group has
+		/* May be bio belongs to a cgroup for which io group has
 		 * not been setup yet. */
 		return NULL;
 	}
@@ -1890,6 +1963,13 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
 }
 EXPORT_SYMBOL(io_lookup_io_group_current);
 
+struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
+					int create)
+{
+	return q->elevator->efqd.root_group;
+}
+EXPORT_SYMBOL(io_get_io_group_bio);
+
 void io_free_root_group(struct elevator_queue *e)
 {
 	struct io_group *iog = e->efqd.root_group;
@@ -1902,6 +1982,11 @@ struct io_group *io_get_io_group(struct request_queue *q, int create)
 	return q->elevator->efqd.root_group;
 }
 
+struct io_group *rq_iog(struct request_queue *q, struct request *rq)
+{
+	return q->elevator->efqd.root_group;
+}
+
 #endif /* CONFIG_GROUP_IOSCHED*/
 
 /* Elevator fair queuing function */
@@ -2290,11 +2375,10 @@ void elv_free_ioq(struct io_queue *ioq)
 EXPORT_SYMBOL(elv_free_ioq);
 
 int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
-			void *sched_queue, int ioprio_class, int ioprio,
-			int is_sync)
+		struct io_group *iog, void *sched_queue, int ioprio_class,
+		int ioprio, int is_sync)
 {
 	struct elv_fq_data *efqd = &eq->efqd;
-	struct io_group *iog = io_lookup_io_group_current(efqd->queue);
 
 	RB_CLEAR_NODE(&ioq->entity.rb_node);
 	atomic_set(&ioq->ref, 0);
@@ -3035,6 +3119,10 @@ expire:
 new_queue:
 	ioq = elv_set_active_ioq(q, new_ioq);
 keep_queue:
+	if (ioq)
+		elv_log_ioq(efqd, ioq, "select busy=%d qued=%d disp=%d",
+				elv_nr_busy_ioq(q->elevator), ioq->nr_queued,
+				elv_ioq_nr_dispatched(ioq));
 	return ioq;
 }
 
@@ -3166,7 +3254,8 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 	if (!elv_iosched_fair_queuing_enabled(q->elevator))
 		return;
 
-	elv_log_ioq(efqd, ioq, "complete");
+	elv_log_ioq(efqd, ioq, "complete drv=%d disp=%d", efqd->rq_in_driver,
+						elv_ioq_nr_dispatched(ioq));
 
 	elv_update_hw_tag(efqd);
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 5a15329..5fc7d48 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -504,7 +504,7 @@ extern int io_group_allow_merge(struct request *rq, struct bio *bio);
 extern void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
 					struct io_group *iog);
 extern void elv_fq_set_request_io_group(struct request_queue *q,
-						struct request *rq);
+					struct request *rq, struct bio *bio);
 static inline bfq_weight_t iog_weight(struct io_group *iog)
 {
 	return iog->entity.weight;
@@ -515,6 +515,8 @@ extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
 extern void elv_fq_unset_request_ioq(struct request_queue *q,
 					struct request *rq);
 extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+						struct bio *bio);
 
 /* Returns single ioq associated with the io group. */
 static inline struct io_queue *io_group_ioq(struct io_group *iog)
@@ -532,6 +534,12 @@ static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
 	iog->ioq = ioq;
 }
 
+static inline struct io_group *rq_iog(struct request_queue *q,
+					struct request *rq)
+{
+	return rq->iog;
+}
+
 #else /* !GROUP_IOSCHED */
 /*
  * No ioq movement is needed in case of flat setup. root io group gets cleaned
@@ -553,7 +561,7 @@ static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
  */
 static inline void io_disconnect_groups(struct elevator_queue *e) {}
 static inline void elv_fq_set_request_io_group(struct request_queue *q,
-						struct request *rq)
+					struct request *rq, struct bio *bio)
 {
 }
 
@@ -589,6 +597,15 @@ static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
 	return NULL;
 }
 
+static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+						struct bio *bio)
+{
+	return NULL;
+}
+
+
+extern struct io_group *rq_iog(struct request_queue *q, struct request *rq);
+
 #endif /* GROUP_IOSCHED */
 
 /* Functions used by blksysfs.c */
@@ -630,7 +647,8 @@ extern void elv_put_ioq(struct io_queue *ioq);
 extern void __elv_ioq_slice_expired(struct request_queue *q,
 					struct io_queue *ioq);
 extern int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
-		void *sched_queue, int ioprio_class, int ioprio, int is_sync);
+		struct io_group *iog, void *sched_queue, int ioprio_class,
+		int ioprio, int is_sync);
 extern void elv_schedule_dispatch(struct request_queue *q);
 extern int elv_hw_tag(struct elevator_queue *e);
 extern void *elv_active_sched_queue(struct elevator_queue *e);
@@ -643,6 +661,8 @@ extern void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
 extern void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 					int ioprio, struct io_queue *ioq);
 extern struct io_group *io_lookup_io_group_current(struct request_queue *q);
+extern struct io_group *io_get_io_group_bio(struct request_queue *q,
+						struct bio *bio, int create);
 extern int elv_nr_busy_ioq(struct elevator_queue *e);
 extern int elv_nr_busy_rt_ioq(struct elevator_queue *e);
 extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
@@ -697,7 +717,7 @@ static inline void *elv_fq_select_ioq(struct request_queue *q, int force)
 }
 
 static inline void elv_fq_set_request_io_group(struct request_queue *q,
-						struct request *rq)
+					struct request *rq, struct bio *bio)
 {
 }
 
@@ -722,5 +742,11 @@ static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
 	return NULL;
 }
 
+static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+						struct bio *bio)
+{
+	return NULL;
+}
+
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index e634a2f..3b83b2f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -967,11 +967,12 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
 	return NULL;
 }
 
-int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+int elv_set_request(struct request_queue *q, struct request *rq,
+			struct bio *bio, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 
-	elv_fq_set_request_io_group(q, rq);
+	elv_fq_set_request_io_group(q, rq, bio);
 
 	/*
 	 * Optimization for noop, deadline and AS which maintain only single
@@ -1370,19 +1371,19 @@ void *elv_select_sched_queue(struct request_queue *q, int force)
 EXPORT_SYMBOL(elv_select_sched_queue);
 
 /*
- * Get the io scheduler queue pointer for current task.
+ * Get the io scheduler queue pointer for the group bio belongs to.
  *
  * If fair queuing is enabled, determine the io group of task and retrieve
  * the ioq pointer from that. This is used by only single queue ioschedulers
  * for retrieving the queue associated with the group to decide whether the
  * new bio can do a front merge or not.
  */
-void *elv_get_sched_queue_current(struct request_queue *q)
+void *elv_get_sched_queue_bio(struct request_queue *q, struct bio *bio)
 {
 	/* Fair queuing is not enabled. There is only one queue. */
 	if (!elv_iosched_fair_queuing_enabled(q->elevator))
 		return q->elevator->sched_queue;
 
-	return ioq_sched_queue(elv_lookup_ioq_current(q));
+	return ioq_sched_queue(elv_lookup_ioq_bio(q, bio));
 }
-EXPORT_SYMBOL(elv_get_sched_queue_current);
+EXPORT_SYMBOL(elv_get_sched_queue_bio);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index cbfce0b..3e70d24 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -150,7 +150,8 @@ extern void elv_unregister_queue(struct request_queue *q);
 extern int elv_may_queue(struct request_queue *, int);
 extern void elv_abort_queue(struct request_queue *);
 extern void elv_completed_request(struct request_queue *, struct request *);
-extern int elv_set_request(struct request_queue *, struct request *, gfp_t);
+extern int elv_set_request(struct request_queue *, struct request *,
+					struct bio *bio, gfp_t);
 extern void elv_put_request(struct request_queue *, struct request *);
 extern void elv_drain_elevator(struct request_queue *);
 
@@ -293,6 +294,20 @@ static inline int elv_gen_idling_enabled(struct elevator_queue *e)
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
-extern void *elv_get_sched_queue_current(struct request_queue *q);
+extern void *elv_get_sched_queue_bio(struct request_queue *q, struct bio *bio);
+
+/*
+ * This is equivalent of rq_is_sync()/cfq_bio_sync() function where we
+ * determine whether an rq/bio is sync or not. There are cases like during
+ * merging and during * request allocation, where we don't have rq but bio
+ * and needs to find out * if this bio will be considered as sync or async by
+ * elevator/iosched. This function is useful in such cases.
+ */
+static inline int elv_bio_sync(struct bio *bio)
+{
+	if ((bio_data_dir(bio) == READ) || bio_sync(bio))
+		return 1;
+	return 0;
+}
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 15/18] io-controller: map async requests to appropriate cgroup
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (25 preceding siblings ...)
  2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-05 19:58 ` Vivek Goyal
                   ` (10 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, m-ikeda
  Cc: vgoyal, akpm

o So far we were assuming that a bio/rq belongs to the task who is submitting
  it. It did not hold good in case of async writes. This patch makes use of
  blkio_cgroup pataches to attribute the aysnc writes to right group instead
  of task submitting the bio.

o For sync requests, we continue to assume that io belongs to the task
  submitting it. Only in case of async requests, we make use of io tracking
  patches to track the owner cgroup.

o So far cfq always caches the async queue pointer. With async requests now
  not necessarily being tied to submitting task io context, caching the
  pointer will not help for async queues. This patch introduces a new config
  option CONFIG_TRACK_ASYNC_CONTEXT. If this option is not set, cfq retains
  old behavior where async queue pointer is cached in task context. If it
  is not set, async queue pointer is not cached and we take help of bio
  tracking patches to determine group bio belongs to and then map it to
  async queue of that group.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched    |   16 +++++
 block/as-iosched.c       |    2 +-
 block/blk-core.c         |    7 +-
 block/cfq-iosched.c      |  149 ++++++++++++++++++++++++++++++++++++----------
 block/deadline-iosched.c |    2 +-
 block/elevator-fq.c      |  131 ++++++++++++++++++++++++++++++++++-------
 block/elevator-fq.h      |   34 +++++++++-
 block/elevator.c         |   13 ++--
 include/linux/elevator.h |   19 +++++-
 9 files changed, 304 insertions(+), 69 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 77fc786..0677099 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -124,6 +124,22 @@ config DEFAULT_IOSCHED
 	default "cfq" if DEFAULT_CFQ
 	default "noop" if DEFAULT_NOOP
 
+config TRACK_ASYNC_CONTEXT
+	bool "Determine async request context from bio"
+	depends on GROUP_IOSCHED
+	select CGROUP_BLKIO
+	default n
+	---help---
+	  Normally async request is attributed to the task submitting the
+	  request. With group ioscheduling, for accurate accounting of
+	  async writes, one needs to map the request to original task/cgroup
+	  which originated the request and not the submitter of the request.
+
+	  Currently there are generic io tracking patches to provide facility
+	  to map bio to original owner. If this option is set, for async
+	  request, original owner of the bio is decided by using io tracking
+	  patches otherwise we continue to attribute the request to the
+	  submitting thread.
 endmenu
 
 endif
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 12aea88..afa554a 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -1412,7 +1412,7 @@ as_merge(struct request_queue *q, struct request **req, struct bio *bio)
 {
 	sector_t rb_key = bio->bi_sector + bio_sectors(bio);
 	struct request *__rq;
-	struct as_queue *asq = elv_get_sched_queue_current(q);
+	struct as_queue *asq = elv_get_sched_queue_bio(q, bio);
 
 	if (!asq)
 		return ELEVATOR_NO_MERGE;
diff --git a/block/blk-core.c b/block/blk-core.c
index 2998fe3..b19510a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -643,7 +643,8 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
 }
 
 static struct request *
-blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
+blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
+					gfp_t gfp_mask)
 {
 	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
 
@@ -655,7 +656,7 @@ blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
 	rq->cmd_flags = flags | REQ_ALLOCED;
 
 	if (priv) {
-		if (unlikely(elv_set_request(q, rq, gfp_mask))) {
+		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
 			mempool_free(rq, q->rq.rq_pool);
 			return NULL;
 		}
@@ -796,7 +797,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		rw_flags |= REQ_IO_STAT;
 	spin_unlock_irq(q->queue_lock);
 
-	rq = blk_alloc_request(q, rw_flags, priv, gfp_mask);
+	rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
 	if (unlikely(!rq)) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 1e9dd5b..ea71239 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -161,8 +161,8 @@ CFQ_CFQQ_FNS(coop);
 	blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)
 
 static void cfq_dispatch_insert(struct request_queue *, struct request *);
-static struct cfq_queue *cfq_get_queue(struct cfq_data *, int,
-				       struct io_context *, gfp_t);
+static struct cfq_queue *cfq_get_queue(struct cfq_data *, struct io_group *iog,
+					int, struct io_context *, gfp_t);
 static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
 						struct io_context *);
 
@@ -172,22 +172,56 @@ static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
 	return cic->cfqq[!!is_sync];
 }
 
-static inline void cic_set_cfqq(struct cfq_io_context *cic,
-				struct cfq_queue *cfqq, int is_sync)
-{
-	cic->cfqq[!!is_sync] = cfqq;
-}
-
 /*
- * We regard a request as SYNC, if it's either a read or has the SYNC bit
- * set (in which case it could also be direct WRITE).
+ * Determine the cfq queue bio should go in. This is primarily used by
+ * front merge and allow merge functions.
+ *
+ * Currently this function takes the ioprio and iprio_class from task
+ * submitting async bio. Later save the task information in the page_cgroup
+ * and retrieve task's ioprio and class from there.
  */
-static inline int cfq_bio_sync(struct bio *bio)
+static struct cfq_queue *cic_bio_to_cfqq(struct cfq_data *cfqd,
+		struct cfq_io_context *cic, struct bio *bio, int is_sync)
 {
-	if (bio_data_dir(bio) == READ || bio_sync(bio))
-		return 1;
+	struct cfq_queue *cfqq = NULL;
 
-	return 0;
+	cfqq = cic_to_cfqq(cic, is_sync);
+
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	if (!cfqq && !is_sync) {
+		const int ioprio = task_ioprio(cic->ioc);
+		const int ioprio_class = task_ioprio_class(cic->ioc);
+		struct io_group *iog;
+		/*
+		 * async bio tracking is enabled and we are not caching
+		 * async queue pointer in cic.
+		 */
+		iog = io_get_io_group_bio(cfqd->queue, bio, 0);
+		if (!iog) {
+			/*
+			 * May be this is first rq/bio and io group has not
+			 * been setup yet.
+			 */
+			return NULL;
+		}
+		return io_group_async_queue_prio(iog, ioprio_class, ioprio);
+	}
+#endif
+	return cfqq;
+}
+
+static inline void cic_set_cfqq(struct cfq_io_context *cic,
+				struct cfq_queue *cfqq, int is_sync)
+{
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	/*
+	 * Don't cache async queue pointer as now one io context might
+	 * be submitting async io for various different async queues
+	 */
+	if (!is_sync)
+		return;
+#endif
+	cic->cfqq[!!is_sync] = cfqq;
 }
 
 static inline struct io_group *cfqq_to_io_group(struct cfq_queue *cfqq)
@@ -505,7 +539,7 @@ cfq_find_rq_fmerge(struct cfq_data *cfqd, struct bio *bio)
 	if (!cic)
 		return NULL;
 
-	cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+	cfqq = cic_bio_to_cfqq(cfqd, cic, bio, elv_bio_sync(bio));
 	if (cfqq) {
 		sector_t sector = bio->bi_sector + bio_sectors(bio);
 
@@ -587,7 +621,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	/*
 	 * Disallow merge of a sync bio into an async request.
 	 */
-	if (cfq_bio_sync(bio) && !rq_is_sync(rq))
+	if (elv_bio_sync(bio) && !rq_is_sync(rq))
 		return 0;
 
 	/*
@@ -598,7 +632,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	if (!cic)
 		return 0;
 
-	cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+	cfqq = cic_bio_to_cfqq(cfqd, cic, bio, elv_bio_sync(bio));
 	if (cfqq == RQ_CFQQ(rq))
 		return 1;
 
@@ -1206,14 +1240,29 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
 	spin_lock_irqsave(q->queue_lock, flags);
 
 	cfqq = cic->cfqq[BLK_RW_ASYNC];
+
 	if (cfqq) {
+		struct io_group *iog = io_lookup_io_group_current(q);
 		struct cfq_queue *new_cfqq;
-		new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic->ioc,
+
+		/*
+		 * Drop the reference to old queue unconditionally. Don't
+		 * worry whether new async prio queue has been allocated
+		 * or not.
+		 */
+		cic_set_cfqq(cic, NULL, BLK_RW_ASYNC);
+		cfq_put_queue(cfqq);
+
+		/*
+		 * Why to allocate new queue now? Will it not be automatically
+		 * allocated whenever another async request from same context
+		 * comes? Keeping it for the time being because existing cfq
+		 * code allocates the new queue immediately upon prio change
+		 */
+		new_cfqq = cfq_get_queue(cfqd, iog, BLK_RW_ASYNC, cic->ioc,
 						GFP_ATOMIC);
-		if (new_cfqq) {
-			cic->cfqq[BLK_RW_ASYNC] = new_cfqq;
-			cfq_put_queue(cfqq);
-		}
+		if (new_cfqq)
+			cic_set_cfqq(cic, new_cfqq, BLK_RW_ASYNC);
 	}
 
 	cfqq = cic->cfqq[BLK_RW_SYNC];
@@ -1274,7 +1323,7 @@ static void cfq_ioc_set_cgroup(struct io_context *ioc)
 #endif  /* CONFIG_IOSCHED_CFQ_HIER */
 
 static struct cfq_queue *
-cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
+cfq_find_alloc_queue(struct cfq_data *cfqd, struct io_group *iog, int is_sync,
 				struct io_context *ioc, gfp_t gfp_mask)
 {
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
@@ -1286,6 +1335,21 @@ retry:
 	/* cic always exists here */
 	cfqq = cic_to_cfqq(cic, is_sync);
 
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	if (!cfqq && !is_sync) {
+		const int ioprio = task_ioprio(cic->ioc);
+		const int ioprio_class = task_ioprio_class(cic->ioc);
+
+		/*
+		 * We have not cached async queue pointer as bio tracking
+		 * is enabled. Look into group async queue array using ioc
+		 * class and prio to see if somebody already allocated the
+		 * queue.
+		 */
+
+		cfqq = io_group_async_queue_prio(iog, ioprio_class, ioprio);
+	}
+#endif
 	if (!cfqq) {
 		if (new_cfqq) {
 			goto alloc_ioq;
@@ -1348,8 +1412,9 @@ alloc_ioq:
 
 		cfqq->ioq = ioq;
 		cfq_init_prio_data(cfqq, ioc);
-		elv_init_ioq(q->elevator, ioq, cfqq, cfqq->org_ioprio_class,
-				cfqq->org_ioprio, is_sync);
+		elv_init_ioq(q->elevator, ioq, iog, cfqq,
+				cfqq->org_ioprio_class, cfqq->org_ioprio,
+				is_sync);
 
 		if (is_sync) {
 			if (!cfq_class_idle(cfqq))
@@ -1372,14 +1437,13 @@ out:
 }
 
 static struct cfq_queue *
-cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
-					gfp_t gfp_mask)
+cfq_get_queue(struct cfq_data *cfqd, struct io_group *iog, int is_sync,
+			struct io_context *ioc, gfp_t gfp_mask)
 {
 	const int ioprio = task_ioprio(ioc);
 	const int ioprio_class = task_ioprio_class(ioc);
 	struct cfq_queue *async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
-	struct io_group *iog = io_lookup_io_group_current(cfqd->queue);
 
 	if (!is_sync) {
 		async_cfqq = io_group_async_queue_prio(iog, ioprio_class,
@@ -1388,7 +1452,7 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 	}
 
 	if (!cfqq) {
-		cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc, gfp_mask);
+		cfqq = cfq_find_alloc_queue(cfqd, iog, is_sync, ioc, gfp_mask);
 		if (!cfqq)
 			return NULL;
 	}
@@ -1396,8 +1460,30 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 	if (!is_sync && !async_cfqq)
 		io_group_set_async_queue(iog, ioprio_class, ioprio, cfqq->ioq);
 
-	/* ioc reference */
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	/*
+	 * ioc reference. If async request queue/group is determined from the
+	 * original task/cgroup and not from submitter task, io context can
+	 * not cache the pointer to async queue and everytime a request comes,
+	 * it will be determined by going through the async queue array.
+	 *
+	 * This comes from the fact that we might be getting async requests
+	 * which belong to a different cgroup altogether than the cgroup
+	 * iocontext belongs to. And this thread might be submitting bios
+	 * from various cgroups. So every time async queue will be different
+	 * based on the cgroup of the bio/rq. Can't cache the async cfqq
+	 * pointer in cic.
+	 */
+	if (is_sync)
+		elv_get_ioq(cfqq->ioq);
+#else
+	/*
+	 * async requests are being attributed to task submitting
+	 * it, hence cic can cache async cfqq pointer. Take the
+	 * queue reference even for async queue.
+	 */
 	elv_get_ioq(cfqq->ioq);
+#endif
 	return cfqq;
 }
 
@@ -1811,7 +1897,8 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 
 	cfqq = cic_to_cfqq(cic, is_sync);
 	if (!cfqq) {
-		cfqq = cfq_get_queue(cfqd, is_sync, cic->ioc, gfp_mask);
+		cfqq = cfq_get_queue(cfqd, rq_iog(q, rq), is_sync, cic->ioc,
+						gfp_mask);
 
 		if (!cfqq)
 			goto queue_fail;
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 27b77b9..87a46c2 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -133,7 +133,7 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	int ret;
 	struct deadline_queue *dq;
 
-	dq = elv_get_sched_queue_current(q);
+	dq = elv_get_sched_queue_bio(q, bio);
 	if (!dq)
 		return ELEVATOR_NO_MERGE;
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 02c27ac..69eaee4 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -11,6 +11,7 @@
 #include <linux/blkdev.h>
 #include "elevator-fq.h"
 #include <linux/blktrace_api.h>
+#include <linux/biotrack.h>
 
 /* Values taken from cfq */
 const int elv_slice_sync = HZ / 10;
@@ -71,6 +72,7 @@ void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
 void elv_activate_ioq(struct io_queue *ioq, int add_front);
 void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
 					int requeue);
+struct io_cgroup *get_iocg_from_bio(struct bio *bio);
 
 static int bfq_update_next_active(struct io_sched_data *sd)
 {
@@ -945,6 +947,9 @@ void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
 
 struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
 {
+	if (!cgroup)
+		return &io_root_cgroup;
+
 	return container_of(cgroup_subsys_state(cgroup, io_subsys_id),
 			    struct io_cgroup, css);
 }
@@ -968,6 +973,7 @@ struct io_group *io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
 	return NULL;
 }
 
+/* Lookup the io group of the current task */
 struct io_group *io_lookup_io_group_current(struct request_queue *q)
 {
 	struct io_group *iog;
@@ -1318,32 +1324,99 @@ struct io_group *io_find_alloc_group(struct request_queue *q,
 	return iog;
 }
 
+/* Map a bio to respective cgroup. Null return means, map it to root cgroup */
+static inline struct cgroup *get_cgroup_from_bio(struct bio *bio)
+{
+	unsigned long bio_cgroup_id;
+	struct cgroup *cgroup;
+
+	/* blk_get_request can reach here without passing a bio */
+	if (!bio)
+		return NULL;
+
+	if (bio_barrier(bio)) {
+		/*
+		 * Map barrier requests to root group. May be more special
+		 * bio cases should come here
+		 */
+		return NULL;
+	}
+
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	if (elv_bio_sync(bio)) {
+		/* sync io. Determine cgroup from submitting task context. */
+		cgroup = task_cgroup(current, io_subsys_id);
+		return cgroup;
+	}
+
+	/* Async io. Determine cgroup from with cgroup id stored in page */
+	bio_cgroup_id = get_blkio_cgroup_id(bio);
+
+	if (!bio_cgroup_id)
+		return NULL;
+
+	cgroup = blkio_cgroup_lookup(bio_cgroup_id);
+#else
+	cgroup = task_cgroup(current, io_subsys_id);
+#endif
+	return cgroup;
+}
+
+/* Determine the io cgroup of a bio */
+struct io_cgroup *get_iocg_from_bio(struct bio *bio)
+{
+	struct cgroup *cgrp;
+	struct io_cgroup *iocg = NULL;
+
+	cgrp = get_cgroup_from_bio(bio);
+	if (!cgrp)
+		return &io_root_cgroup;
+
+	iocg = cgroup_to_io_cgroup(cgrp);
+	if (!iocg)
+		return &io_root_cgroup;
+
+	return iocg;
+}
+
 /*
- * Search for the io group current task belongs to. If create=1, then also
- * create the io group if it is not already there.
+ * Find the io group bio belongs to.
+ * If "create" is set, io group is created if it is not already present.
  */
-struct io_group *io_get_io_group(struct request_queue *q, int create)
+struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
+					int create)
 {
 	struct cgroup *cgroup;
 	struct io_group *iog;
 	struct elv_fq_data *efqd = &q->elevator->efqd;
 
 	rcu_read_lock();
-	cgroup = task_cgroup(current, io_subsys_id);
-	iog = io_find_alloc_group(q, cgroup, efqd, create);
-	if (!iog) {
+	cgroup = get_cgroup_from_bio(bio);
+	if (!cgroup) {
 		if (create)
 			iog = efqd->root_group;
-		else
+		else {
 			/*
 			 * bio merge functions doing lookup don't want to
 			 * map bio to root group by default
 			 */
 			iog = NULL;
+		}
+		goto out;
+	}
+
+	iog = io_find_alloc_group(q, cgroup, efqd, create);
+	if (!iog) {
+		if (create)
+			iog = efqd->root_group;
+		else
+			iog = NULL;
 	}
+out:
 	rcu_read_unlock();
 	return iog;
 }
+EXPORT_SYMBOL(io_get_io_group_bio);
 
 void io_free_root_group(struct elevator_queue *e)
 {
@@ -1678,7 +1751,7 @@ int io_group_allow_merge(struct request *rq, struct bio *bio)
 		return 1;
 
 	/* Determine the io group of the bio submitting task */
-	iog = io_get_io_group(q, 0);
+	iog = io_get_io_group_bio(q, bio, 0);
 	if (!iog) {
 		/* May be task belongs to a differet cgroup for which io
 		 * group has not been setup yet. */
@@ -1692,8 +1765,8 @@ int io_group_allow_merge(struct request *rq, struct bio *bio)
 }
 
 /* find/create the io group request belongs to and put that info in rq */
-void elv_fq_set_request_io_group(struct request_queue *q,
-						struct request *rq)
+void elv_fq_set_request_io_group(struct request_queue *q, struct request *rq,
+					struct bio *bio)
 {
 	struct io_group *iog;
 	unsigned long flags;
@@ -1702,7 +1775,7 @@ void elv_fq_set_request_io_group(struct request_queue *q,
 	 * io group to which rq belongs. Later we should make use of
 	 * bio cgroup patches to determine the io group */
 	spin_lock_irqsave(q->queue_lock, flags);
-	iog = io_get_io_group(q, 1);
+	iog = io_get_io_group_bio(q, bio, 1);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 	BUG_ON(!iog);
 
@@ -1797,7 +1870,7 @@ alloc_ioq:
 			}
 		}
 
-		elv_init_ioq(e, ioq, sched_q, IOPRIO_CLASS_BE, 4, 1);
+		elv_init_ioq(e, ioq, rq->iog, sched_q, IOPRIO_CLASS_BE, 4, 1);
 		io_group_set_ioq(iog, ioq);
 		elv_mark_ioq_sync(ioq);
 	}
@@ -1822,17 +1895,17 @@ queue_fail:
 }
 
 /*
- * Find out the io queue of current task. Optimization for single ioq
+ * Find out the io queue of bio belongs to. Optimization for single ioq
  * per io group io schedulers.
  */
-struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+struct io_queue *elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
 {
 	struct io_group *iog;
 
-	/* Determine the io group and io queue of the bio submitting task */
-	iog = io_lookup_io_group_current(q);
+	/* lookup the io group and io queue of the bio submitting task */
+	iog = io_get_io_group_bio(q, bio, 0);
 	if (!iog) {
-		/* May be task belongs to a cgroup for which io group has
+		/* May be bio belongs to a cgroup for which io group has
 		 * not been setup yet. */
 		return NULL;
 	}
@@ -1890,6 +1963,13 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
 }
 EXPORT_SYMBOL(io_lookup_io_group_current);
 
+struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
+					int create)
+{
+	return q->elevator->efqd.root_group;
+}
+EXPORT_SYMBOL(io_get_io_group_bio);
+
 void io_free_root_group(struct elevator_queue *e)
 {
 	struct io_group *iog = e->efqd.root_group;
@@ -1902,6 +1982,11 @@ struct io_group *io_get_io_group(struct request_queue *q, int create)
 	return q->elevator->efqd.root_group;
 }
 
+struct io_group *rq_iog(struct request_queue *q, struct request *rq)
+{
+	return q->elevator->efqd.root_group;
+}
+
 #endif /* CONFIG_GROUP_IOSCHED*/
 
 /* Elevator fair queuing function */
@@ -2290,11 +2375,10 @@ void elv_free_ioq(struct io_queue *ioq)
 EXPORT_SYMBOL(elv_free_ioq);
 
 int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
-			void *sched_queue, int ioprio_class, int ioprio,
-			int is_sync)
+		struct io_group *iog, void *sched_queue, int ioprio_class,
+		int ioprio, int is_sync)
 {
 	struct elv_fq_data *efqd = &eq->efqd;
-	struct io_group *iog = io_lookup_io_group_current(efqd->queue);
 
 	RB_CLEAR_NODE(&ioq->entity.rb_node);
 	atomic_set(&ioq->ref, 0);
@@ -3035,6 +3119,10 @@ expire:
 new_queue:
 	ioq = elv_set_active_ioq(q, new_ioq);
 keep_queue:
+	if (ioq)
+		elv_log_ioq(efqd, ioq, "select busy=%d qued=%d disp=%d",
+				elv_nr_busy_ioq(q->elevator), ioq->nr_queued,
+				elv_ioq_nr_dispatched(ioq));
 	return ioq;
 }
 
@@ -3166,7 +3254,8 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 	if (!elv_iosched_fair_queuing_enabled(q->elevator))
 		return;
 
-	elv_log_ioq(efqd, ioq, "complete");
+	elv_log_ioq(efqd, ioq, "complete drv=%d disp=%d", efqd->rq_in_driver,
+						elv_ioq_nr_dispatched(ioq));
 
 	elv_update_hw_tag(efqd);
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 5a15329..5fc7d48 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -504,7 +504,7 @@ extern int io_group_allow_merge(struct request *rq, struct bio *bio);
 extern void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
 					struct io_group *iog);
 extern void elv_fq_set_request_io_group(struct request_queue *q,
-						struct request *rq);
+					struct request *rq, struct bio *bio);
 static inline bfq_weight_t iog_weight(struct io_group *iog)
 {
 	return iog->entity.weight;
@@ -515,6 +515,8 @@ extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
 extern void elv_fq_unset_request_ioq(struct request_queue *q,
 					struct request *rq);
 extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+						struct bio *bio);
 
 /* Returns single ioq associated with the io group. */
 static inline struct io_queue *io_group_ioq(struct io_group *iog)
@@ -532,6 +534,12 @@ static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
 	iog->ioq = ioq;
 }
 
+static inline struct io_group *rq_iog(struct request_queue *q,
+					struct request *rq)
+{
+	return rq->iog;
+}
+
 #else /* !GROUP_IOSCHED */
 /*
  * No ioq movement is needed in case of flat setup. root io group gets cleaned
@@ -553,7 +561,7 @@ static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
  */
 static inline void io_disconnect_groups(struct elevator_queue *e) {}
 static inline void elv_fq_set_request_io_group(struct request_queue *q,
-						struct request *rq)
+					struct request *rq, struct bio *bio)
 {
 }
 
@@ -589,6 +597,15 @@ static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
 	return NULL;
 }
 
+static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+						struct bio *bio)
+{
+	return NULL;
+}
+
+
+extern struct io_group *rq_iog(struct request_queue *q, struct request *rq);
+
 #endif /* GROUP_IOSCHED */
 
 /* Functions used by blksysfs.c */
@@ -630,7 +647,8 @@ extern void elv_put_ioq(struct io_queue *ioq);
 extern void __elv_ioq_slice_expired(struct request_queue *q,
 					struct io_queue *ioq);
 extern int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
-		void *sched_queue, int ioprio_class, int ioprio, int is_sync);
+		struct io_group *iog, void *sched_queue, int ioprio_class,
+		int ioprio, int is_sync);
 extern void elv_schedule_dispatch(struct request_queue *q);
 extern int elv_hw_tag(struct elevator_queue *e);
 extern void *elv_active_sched_queue(struct elevator_queue *e);
@@ -643,6 +661,8 @@ extern void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
 extern void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 					int ioprio, struct io_queue *ioq);
 extern struct io_group *io_lookup_io_group_current(struct request_queue *q);
+extern struct io_group *io_get_io_group_bio(struct request_queue *q,
+						struct bio *bio, int create);
 extern int elv_nr_busy_ioq(struct elevator_queue *e);
 extern int elv_nr_busy_rt_ioq(struct elevator_queue *e);
 extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
@@ -697,7 +717,7 @@ static inline void *elv_fq_select_ioq(struct request_queue *q, int force)
 }
 
 static inline void elv_fq_set_request_io_group(struct request_queue *q,
-						struct request *rq)
+					struct request *rq, struct bio *bio)
 {
 }
 
@@ -722,5 +742,11 @@ static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
 	return NULL;
 }
 
+static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+						struct bio *bio)
+{
+	return NULL;
+}
+
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index e634a2f..3b83b2f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -967,11 +967,12 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
 	return NULL;
 }
 
-int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+int elv_set_request(struct request_queue *q, struct request *rq,
+			struct bio *bio, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 
-	elv_fq_set_request_io_group(q, rq);
+	elv_fq_set_request_io_group(q, rq, bio);
 
 	/*
 	 * Optimization for noop, deadline and AS which maintain only single
@@ -1370,19 +1371,19 @@ void *elv_select_sched_queue(struct request_queue *q, int force)
 EXPORT_SYMBOL(elv_select_sched_queue);
 
 /*
- * Get the io scheduler queue pointer for current task.
+ * Get the io scheduler queue pointer for the group bio belongs to.
  *
  * If fair queuing is enabled, determine the io group of task and retrieve
  * the ioq pointer from that. This is used by only single queue ioschedulers
  * for retrieving the queue associated with the group to decide whether the
  * new bio can do a front merge or not.
  */
-void *elv_get_sched_queue_current(struct request_queue *q)
+void *elv_get_sched_queue_bio(struct request_queue *q, struct bio *bio)
 {
 	/* Fair queuing is not enabled. There is only one queue. */
 	if (!elv_iosched_fair_queuing_enabled(q->elevator))
 		return q->elevator->sched_queue;
 
-	return ioq_sched_queue(elv_lookup_ioq_current(q));
+	return ioq_sched_queue(elv_lookup_ioq_bio(q, bio));
 }
-EXPORT_SYMBOL(elv_get_sched_queue_current);
+EXPORT_SYMBOL(elv_get_sched_queue_bio);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index cbfce0b..3e70d24 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -150,7 +150,8 @@ extern void elv_unregister_queue(struct request_queue *q);
 extern int elv_may_queue(struct request_queue *, int);
 extern void elv_abort_queue(struct request_queue *);
 extern void elv_completed_request(struct request_queue *, struct request *);
-extern int elv_set_request(struct request_queue *, struct request *, gfp_t);
+extern int elv_set_request(struct request_queue *, struct request *,
+					struct bio *bio, gfp_t);
 extern void elv_put_request(struct request_queue *, struct request *);
 extern void elv_drain_elevator(struct request_queue *);
 
@@ -293,6 +294,20 @@ static inline int elv_gen_idling_enabled(struct elevator_queue *e)
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
-extern void *elv_get_sched_queue_current(struct request_queue *q);
+extern void *elv_get_sched_queue_bio(struct request_queue *q, struct bio *bio);
+
+/*
+ * This is equivalent of rq_is_sync()/cfq_bio_sync() function where we
+ * determine whether an rq/bio is sync or not. There are cases like during
+ * merging and during * request allocation, where we don't have rq but bio
+ * and needs to find out * if this bio will be considered as sync or async by
+ * elevator/iosched. This function is useful in such cases.
+ */
+static inline int elv_bio_sync(struct bio *bio)
+{
+	if ((bio_data_dir(bio) == READ) || bio_sync(bio))
+		return 1;
+	return 0;
+}
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 15/18] io-controller: map async requests to appropriate cgroup
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (26 preceding siblings ...)
  2009-05-05 19:58 ` [PATCH 15/18] io-controller: map async requests to appropriate cgroup Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-05 19:58 ` [PATCH 16/18] io-controller: Per cgroup request descriptor support Vivek Goyal
                   ` (9 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando
  Cc: akpm, vgoyal

o So far we were assuming that a bio/rq belongs to the task who is submitting
  it. It did not hold good in case of async writes. This patch makes use of
  blkio_cgroup pataches to attribute the aysnc writes to right group instead
  of task submitting the bio.

o For sync requests, we continue to assume that io belongs to the task
  submitting it. Only in case of async requests, we make use of io tracking
  patches to track the owner cgroup.

o So far cfq always caches the async queue pointer. With async requests now
  not necessarily being tied to submitting task io context, caching the
  pointer will not help for async queues. This patch introduces a new config
  option CONFIG_TRACK_ASYNC_CONTEXT. If this option is not set, cfq retains
  old behavior where async queue pointer is cached in task context. If it
  is not set, async queue pointer is not cached and we take help of bio
  tracking patches to determine group bio belongs to and then map it to
  async queue of that group.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched    |   16 +++++
 block/as-iosched.c       |    2 +-
 block/blk-core.c         |    7 +-
 block/cfq-iosched.c      |  149 ++++++++++++++++++++++++++++++++++++----------
 block/deadline-iosched.c |    2 +-
 block/elevator-fq.c      |  131 ++++++++++++++++++++++++++++++++++-------
 block/elevator-fq.h      |   34 +++++++++-
 block/elevator.c         |   13 ++--
 include/linux/elevator.h |   19 +++++-
 9 files changed, 304 insertions(+), 69 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 77fc786..0677099 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -124,6 +124,22 @@ config DEFAULT_IOSCHED
 	default "cfq" if DEFAULT_CFQ
 	default "noop" if DEFAULT_NOOP
 
+config TRACK_ASYNC_CONTEXT
+	bool "Determine async request context from bio"
+	depends on GROUP_IOSCHED
+	select CGROUP_BLKIO
+	default n
+	---help---
+	  Normally async request is attributed to the task submitting the
+	  request. With group ioscheduling, for accurate accounting of
+	  async writes, one needs to map the request to original task/cgroup
+	  which originated the request and not the submitter of the request.
+
+	  Currently there are generic io tracking patches to provide facility
+	  to map bio to original owner. If this option is set, for async
+	  request, original owner of the bio is decided by using io tracking
+	  patches otherwise we continue to attribute the request to the
+	  submitting thread.
 endmenu
 
 endif
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 12aea88..afa554a 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -1412,7 +1412,7 @@ as_merge(struct request_queue *q, struct request **req, struct bio *bio)
 {
 	sector_t rb_key = bio->bi_sector + bio_sectors(bio);
 	struct request *__rq;
-	struct as_queue *asq = elv_get_sched_queue_current(q);
+	struct as_queue *asq = elv_get_sched_queue_bio(q, bio);
 
 	if (!asq)
 		return ELEVATOR_NO_MERGE;
diff --git a/block/blk-core.c b/block/blk-core.c
index 2998fe3..b19510a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -643,7 +643,8 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
 }
 
 static struct request *
-blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
+blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
+					gfp_t gfp_mask)
 {
 	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
 
@@ -655,7 +656,7 @@ blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
 	rq->cmd_flags = flags | REQ_ALLOCED;
 
 	if (priv) {
-		if (unlikely(elv_set_request(q, rq, gfp_mask))) {
+		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
 			mempool_free(rq, q->rq.rq_pool);
 			return NULL;
 		}
@@ -796,7 +797,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		rw_flags |= REQ_IO_STAT;
 	spin_unlock_irq(q->queue_lock);
 
-	rq = blk_alloc_request(q, rw_flags, priv, gfp_mask);
+	rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
 	if (unlikely(!rq)) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 1e9dd5b..ea71239 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -161,8 +161,8 @@ CFQ_CFQQ_FNS(coop);
 	blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)
 
 static void cfq_dispatch_insert(struct request_queue *, struct request *);
-static struct cfq_queue *cfq_get_queue(struct cfq_data *, int,
-				       struct io_context *, gfp_t);
+static struct cfq_queue *cfq_get_queue(struct cfq_data *, struct io_group *iog,
+					int, struct io_context *, gfp_t);
 static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
 						struct io_context *);
 
@@ -172,22 +172,56 @@ static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
 	return cic->cfqq[!!is_sync];
 }
 
-static inline void cic_set_cfqq(struct cfq_io_context *cic,
-				struct cfq_queue *cfqq, int is_sync)
-{
-	cic->cfqq[!!is_sync] = cfqq;
-}
-
 /*
- * We regard a request as SYNC, if it's either a read or has the SYNC bit
- * set (in which case it could also be direct WRITE).
+ * Determine the cfq queue bio should go in. This is primarily used by
+ * front merge and allow merge functions.
+ *
+ * Currently this function takes the ioprio and iprio_class from task
+ * submitting async bio. Later save the task information in the page_cgroup
+ * and retrieve task's ioprio and class from there.
  */
-static inline int cfq_bio_sync(struct bio *bio)
+static struct cfq_queue *cic_bio_to_cfqq(struct cfq_data *cfqd,
+		struct cfq_io_context *cic, struct bio *bio, int is_sync)
 {
-	if (bio_data_dir(bio) == READ || bio_sync(bio))
-		return 1;
+	struct cfq_queue *cfqq = NULL;
 
-	return 0;
+	cfqq = cic_to_cfqq(cic, is_sync);
+
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	if (!cfqq && !is_sync) {
+		const int ioprio = task_ioprio(cic->ioc);
+		const int ioprio_class = task_ioprio_class(cic->ioc);
+		struct io_group *iog;
+		/*
+		 * async bio tracking is enabled and we are not caching
+		 * async queue pointer in cic.
+		 */
+		iog = io_get_io_group_bio(cfqd->queue, bio, 0);
+		if (!iog) {
+			/*
+			 * May be this is first rq/bio and io group has not
+			 * been setup yet.
+			 */
+			return NULL;
+		}
+		return io_group_async_queue_prio(iog, ioprio_class, ioprio);
+	}
+#endif
+	return cfqq;
+}
+
+static inline void cic_set_cfqq(struct cfq_io_context *cic,
+				struct cfq_queue *cfqq, int is_sync)
+{
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	/*
+	 * Don't cache async queue pointer as now one io context might
+	 * be submitting async io for various different async queues
+	 */
+	if (!is_sync)
+		return;
+#endif
+	cic->cfqq[!!is_sync] = cfqq;
 }
 
 static inline struct io_group *cfqq_to_io_group(struct cfq_queue *cfqq)
@@ -505,7 +539,7 @@ cfq_find_rq_fmerge(struct cfq_data *cfqd, struct bio *bio)
 	if (!cic)
 		return NULL;
 
-	cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+	cfqq = cic_bio_to_cfqq(cfqd, cic, bio, elv_bio_sync(bio));
 	if (cfqq) {
 		sector_t sector = bio->bi_sector + bio_sectors(bio);
 
@@ -587,7 +621,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	/*
 	 * Disallow merge of a sync bio into an async request.
 	 */
-	if (cfq_bio_sync(bio) && !rq_is_sync(rq))
+	if (elv_bio_sync(bio) && !rq_is_sync(rq))
 		return 0;
 
 	/*
@@ -598,7 +632,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	if (!cic)
 		return 0;
 
-	cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+	cfqq = cic_bio_to_cfqq(cfqd, cic, bio, elv_bio_sync(bio));
 	if (cfqq == RQ_CFQQ(rq))
 		return 1;
 
@@ -1206,14 +1240,29 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
 	spin_lock_irqsave(q->queue_lock, flags);
 
 	cfqq = cic->cfqq[BLK_RW_ASYNC];
+
 	if (cfqq) {
+		struct io_group *iog = io_lookup_io_group_current(q);
 		struct cfq_queue *new_cfqq;
-		new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic->ioc,
+
+		/*
+		 * Drop the reference to old queue unconditionally. Don't
+		 * worry whether new async prio queue has been allocated
+		 * or not.
+		 */
+		cic_set_cfqq(cic, NULL, BLK_RW_ASYNC);
+		cfq_put_queue(cfqq);
+
+		/*
+		 * Why to allocate new queue now? Will it not be automatically
+		 * allocated whenever another async request from same context
+		 * comes? Keeping it for the time being because existing cfq
+		 * code allocates the new queue immediately upon prio change
+		 */
+		new_cfqq = cfq_get_queue(cfqd, iog, BLK_RW_ASYNC, cic->ioc,
 						GFP_ATOMIC);
-		if (new_cfqq) {
-			cic->cfqq[BLK_RW_ASYNC] = new_cfqq;
-			cfq_put_queue(cfqq);
-		}
+		if (new_cfqq)
+			cic_set_cfqq(cic, new_cfqq, BLK_RW_ASYNC);
 	}
 
 	cfqq = cic->cfqq[BLK_RW_SYNC];
@@ -1274,7 +1323,7 @@ static void cfq_ioc_set_cgroup(struct io_context *ioc)
 #endif  /* CONFIG_IOSCHED_CFQ_HIER */
 
 static struct cfq_queue *
-cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
+cfq_find_alloc_queue(struct cfq_data *cfqd, struct io_group *iog, int is_sync,
 				struct io_context *ioc, gfp_t gfp_mask)
 {
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
@@ -1286,6 +1335,21 @@ retry:
 	/* cic always exists here */
 	cfqq = cic_to_cfqq(cic, is_sync);
 
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	if (!cfqq && !is_sync) {
+		const int ioprio = task_ioprio(cic->ioc);
+		const int ioprio_class = task_ioprio_class(cic->ioc);
+
+		/*
+		 * We have not cached async queue pointer as bio tracking
+		 * is enabled. Look into group async queue array using ioc
+		 * class and prio to see if somebody already allocated the
+		 * queue.
+		 */
+
+		cfqq = io_group_async_queue_prio(iog, ioprio_class, ioprio);
+	}
+#endif
 	if (!cfqq) {
 		if (new_cfqq) {
 			goto alloc_ioq;
@@ -1348,8 +1412,9 @@ alloc_ioq:
 
 		cfqq->ioq = ioq;
 		cfq_init_prio_data(cfqq, ioc);
-		elv_init_ioq(q->elevator, ioq, cfqq, cfqq->org_ioprio_class,
-				cfqq->org_ioprio, is_sync);
+		elv_init_ioq(q->elevator, ioq, iog, cfqq,
+				cfqq->org_ioprio_class, cfqq->org_ioprio,
+				is_sync);
 
 		if (is_sync) {
 			if (!cfq_class_idle(cfqq))
@@ -1372,14 +1437,13 @@ out:
 }
 
 static struct cfq_queue *
-cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
-					gfp_t gfp_mask)
+cfq_get_queue(struct cfq_data *cfqd, struct io_group *iog, int is_sync,
+			struct io_context *ioc, gfp_t gfp_mask)
 {
 	const int ioprio = task_ioprio(ioc);
 	const int ioprio_class = task_ioprio_class(ioc);
 	struct cfq_queue *async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
-	struct io_group *iog = io_lookup_io_group_current(cfqd->queue);
 
 	if (!is_sync) {
 		async_cfqq = io_group_async_queue_prio(iog, ioprio_class,
@@ -1388,7 +1452,7 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 	}
 
 	if (!cfqq) {
-		cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc, gfp_mask);
+		cfqq = cfq_find_alloc_queue(cfqd, iog, is_sync, ioc, gfp_mask);
 		if (!cfqq)
 			return NULL;
 	}
@@ -1396,8 +1460,30 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 	if (!is_sync && !async_cfqq)
 		io_group_set_async_queue(iog, ioprio_class, ioprio, cfqq->ioq);
 
-	/* ioc reference */
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	/*
+	 * ioc reference. If async request queue/group is determined from the
+	 * original task/cgroup and not from submitter task, io context can
+	 * not cache the pointer to async queue and everytime a request comes,
+	 * it will be determined by going through the async queue array.
+	 *
+	 * This comes from the fact that we might be getting async requests
+	 * which belong to a different cgroup altogether than the cgroup
+	 * iocontext belongs to. And this thread might be submitting bios
+	 * from various cgroups. So every time async queue will be different
+	 * based on the cgroup of the bio/rq. Can't cache the async cfqq
+	 * pointer in cic.
+	 */
+	if (is_sync)
+		elv_get_ioq(cfqq->ioq);
+#else
+	/*
+	 * async requests are being attributed to task submitting
+	 * it, hence cic can cache async cfqq pointer. Take the
+	 * queue reference even for async queue.
+	 */
 	elv_get_ioq(cfqq->ioq);
+#endif
 	return cfqq;
 }
 
@@ -1811,7 +1897,8 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 
 	cfqq = cic_to_cfqq(cic, is_sync);
 	if (!cfqq) {
-		cfqq = cfq_get_queue(cfqd, is_sync, cic->ioc, gfp_mask);
+		cfqq = cfq_get_queue(cfqd, rq_iog(q, rq), is_sync, cic->ioc,
+						gfp_mask);
 
 		if (!cfqq)
 			goto queue_fail;
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 27b77b9..87a46c2 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -133,7 +133,7 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	int ret;
 	struct deadline_queue *dq;
 
-	dq = elv_get_sched_queue_current(q);
+	dq = elv_get_sched_queue_bio(q, bio);
 	if (!dq)
 		return ELEVATOR_NO_MERGE;
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 02c27ac..69eaee4 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -11,6 +11,7 @@
 #include <linux/blkdev.h>
 #include "elevator-fq.h"
 #include <linux/blktrace_api.h>
+#include <linux/biotrack.h>
 
 /* Values taken from cfq */
 const int elv_slice_sync = HZ / 10;
@@ -71,6 +72,7 @@ void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
 void elv_activate_ioq(struct io_queue *ioq, int add_front);
 void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
 					int requeue);
+struct io_cgroup *get_iocg_from_bio(struct bio *bio);
 
 static int bfq_update_next_active(struct io_sched_data *sd)
 {
@@ -945,6 +947,9 @@ void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
 
 struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
 {
+	if (!cgroup)
+		return &io_root_cgroup;
+
 	return container_of(cgroup_subsys_state(cgroup, io_subsys_id),
 			    struct io_cgroup, css);
 }
@@ -968,6 +973,7 @@ struct io_group *io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
 	return NULL;
 }
 
+/* Lookup the io group of the current task */
 struct io_group *io_lookup_io_group_current(struct request_queue *q)
 {
 	struct io_group *iog;
@@ -1318,32 +1324,99 @@ struct io_group *io_find_alloc_group(struct request_queue *q,
 	return iog;
 }
 
+/* Map a bio to respective cgroup. Null return means, map it to root cgroup */
+static inline struct cgroup *get_cgroup_from_bio(struct bio *bio)
+{
+	unsigned long bio_cgroup_id;
+	struct cgroup *cgroup;
+
+	/* blk_get_request can reach here without passing a bio */
+	if (!bio)
+		return NULL;
+
+	if (bio_barrier(bio)) {
+		/*
+		 * Map barrier requests to root group. May be more special
+		 * bio cases should come here
+		 */
+		return NULL;
+	}
+
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	if (elv_bio_sync(bio)) {
+		/* sync io. Determine cgroup from submitting task context. */
+		cgroup = task_cgroup(current, io_subsys_id);
+		return cgroup;
+	}
+
+	/* Async io. Determine cgroup from with cgroup id stored in page */
+	bio_cgroup_id = get_blkio_cgroup_id(bio);
+
+	if (!bio_cgroup_id)
+		return NULL;
+
+	cgroup = blkio_cgroup_lookup(bio_cgroup_id);
+#else
+	cgroup = task_cgroup(current, io_subsys_id);
+#endif
+	return cgroup;
+}
+
+/* Determine the io cgroup of a bio */
+struct io_cgroup *get_iocg_from_bio(struct bio *bio)
+{
+	struct cgroup *cgrp;
+	struct io_cgroup *iocg = NULL;
+
+	cgrp = get_cgroup_from_bio(bio);
+	if (!cgrp)
+		return &io_root_cgroup;
+
+	iocg = cgroup_to_io_cgroup(cgrp);
+	if (!iocg)
+		return &io_root_cgroup;
+
+	return iocg;
+}
+
 /*
- * Search for the io group current task belongs to. If create=1, then also
- * create the io group if it is not already there.
+ * Find the io group bio belongs to.
+ * If "create" is set, io group is created if it is not already present.
  */
-struct io_group *io_get_io_group(struct request_queue *q, int create)
+struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
+					int create)
 {
 	struct cgroup *cgroup;
 	struct io_group *iog;
 	struct elv_fq_data *efqd = &q->elevator->efqd;
 
 	rcu_read_lock();
-	cgroup = task_cgroup(current, io_subsys_id);
-	iog = io_find_alloc_group(q, cgroup, efqd, create);
-	if (!iog) {
+	cgroup = get_cgroup_from_bio(bio);
+	if (!cgroup) {
 		if (create)
 			iog = efqd->root_group;
-		else
+		else {
 			/*
 			 * bio merge functions doing lookup don't want to
 			 * map bio to root group by default
 			 */
 			iog = NULL;
+		}
+		goto out;
+	}
+
+	iog = io_find_alloc_group(q, cgroup, efqd, create);
+	if (!iog) {
+		if (create)
+			iog = efqd->root_group;
+		else
+			iog = NULL;
 	}
+out:
 	rcu_read_unlock();
 	return iog;
 }
+EXPORT_SYMBOL(io_get_io_group_bio);
 
 void io_free_root_group(struct elevator_queue *e)
 {
@@ -1678,7 +1751,7 @@ int io_group_allow_merge(struct request *rq, struct bio *bio)
 		return 1;
 
 	/* Determine the io group of the bio submitting task */
-	iog = io_get_io_group(q, 0);
+	iog = io_get_io_group_bio(q, bio, 0);
 	if (!iog) {
 		/* May be task belongs to a differet cgroup for which io
 		 * group has not been setup yet. */
@@ -1692,8 +1765,8 @@ int io_group_allow_merge(struct request *rq, struct bio *bio)
 }
 
 /* find/create the io group request belongs to and put that info in rq */
-void elv_fq_set_request_io_group(struct request_queue *q,
-						struct request *rq)
+void elv_fq_set_request_io_group(struct request_queue *q, struct request *rq,
+					struct bio *bio)
 {
 	struct io_group *iog;
 	unsigned long flags;
@@ -1702,7 +1775,7 @@ void elv_fq_set_request_io_group(struct request_queue *q,
 	 * io group to which rq belongs. Later we should make use of
 	 * bio cgroup patches to determine the io group */
 	spin_lock_irqsave(q->queue_lock, flags);
-	iog = io_get_io_group(q, 1);
+	iog = io_get_io_group_bio(q, bio, 1);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 	BUG_ON(!iog);
 
@@ -1797,7 +1870,7 @@ alloc_ioq:
 			}
 		}
 
-		elv_init_ioq(e, ioq, sched_q, IOPRIO_CLASS_BE, 4, 1);
+		elv_init_ioq(e, ioq, rq->iog, sched_q, IOPRIO_CLASS_BE, 4, 1);
 		io_group_set_ioq(iog, ioq);
 		elv_mark_ioq_sync(ioq);
 	}
@@ -1822,17 +1895,17 @@ queue_fail:
 }
 
 /*
- * Find out the io queue of current task. Optimization for single ioq
+ * Find out the io queue of bio belongs to. Optimization for single ioq
  * per io group io schedulers.
  */
-struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+struct io_queue *elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
 {
 	struct io_group *iog;
 
-	/* Determine the io group and io queue of the bio submitting task */
-	iog = io_lookup_io_group_current(q);
+	/* lookup the io group and io queue of the bio submitting task */
+	iog = io_get_io_group_bio(q, bio, 0);
 	if (!iog) {
-		/* May be task belongs to a cgroup for which io group has
+		/* May be bio belongs to a cgroup for which io group has
 		 * not been setup yet. */
 		return NULL;
 	}
@@ -1890,6 +1963,13 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
 }
 EXPORT_SYMBOL(io_lookup_io_group_current);
 
+struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
+					int create)
+{
+	return q->elevator->efqd.root_group;
+}
+EXPORT_SYMBOL(io_get_io_group_bio);
+
 void io_free_root_group(struct elevator_queue *e)
 {
 	struct io_group *iog = e->efqd.root_group;
@@ -1902,6 +1982,11 @@ struct io_group *io_get_io_group(struct request_queue *q, int create)
 	return q->elevator->efqd.root_group;
 }
 
+struct io_group *rq_iog(struct request_queue *q, struct request *rq)
+{
+	return q->elevator->efqd.root_group;
+}
+
 #endif /* CONFIG_GROUP_IOSCHED*/
 
 /* Elevator fair queuing function */
@@ -2290,11 +2375,10 @@ void elv_free_ioq(struct io_queue *ioq)
 EXPORT_SYMBOL(elv_free_ioq);
 
 int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
-			void *sched_queue, int ioprio_class, int ioprio,
-			int is_sync)
+		struct io_group *iog, void *sched_queue, int ioprio_class,
+		int ioprio, int is_sync)
 {
 	struct elv_fq_data *efqd = &eq->efqd;
-	struct io_group *iog = io_lookup_io_group_current(efqd->queue);
 
 	RB_CLEAR_NODE(&ioq->entity.rb_node);
 	atomic_set(&ioq->ref, 0);
@@ -3035,6 +3119,10 @@ expire:
 new_queue:
 	ioq = elv_set_active_ioq(q, new_ioq);
 keep_queue:
+	if (ioq)
+		elv_log_ioq(efqd, ioq, "select busy=%d qued=%d disp=%d",
+				elv_nr_busy_ioq(q->elevator), ioq->nr_queued,
+				elv_ioq_nr_dispatched(ioq));
 	return ioq;
 }
 
@@ -3166,7 +3254,8 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 	if (!elv_iosched_fair_queuing_enabled(q->elevator))
 		return;
 
-	elv_log_ioq(efqd, ioq, "complete");
+	elv_log_ioq(efqd, ioq, "complete drv=%d disp=%d", efqd->rq_in_driver,
+						elv_ioq_nr_dispatched(ioq));
 
 	elv_update_hw_tag(efqd);
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 5a15329..5fc7d48 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -504,7 +504,7 @@ extern int io_group_allow_merge(struct request *rq, struct bio *bio);
 extern void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
 					struct io_group *iog);
 extern void elv_fq_set_request_io_group(struct request_queue *q,
-						struct request *rq);
+					struct request *rq, struct bio *bio);
 static inline bfq_weight_t iog_weight(struct io_group *iog)
 {
 	return iog->entity.weight;
@@ -515,6 +515,8 @@ extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
 extern void elv_fq_unset_request_ioq(struct request_queue *q,
 					struct request *rq);
 extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+						struct bio *bio);
 
 /* Returns single ioq associated with the io group. */
 static inline struct io_queue *io_group_ioq(struct io_group *iog)
@@ -532,6 +534,12 @@ static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
 	iog->ioq = ioq;
 }
 
+static inline struct io_group *rq_iog(struct request_queue *q,
+					struct request *rq)
+{
+	return rq->iog;
+}
+
 #else /* !GROUP_IOSCHED */
 /*
  * No ioq movement is needed in case of flat setup. root io group gets cleaned
@@ -553,7 +561,7 @@ static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
  */
 static inline void io_disconnect_groups(struct elevator_queue *e) {}
 static inline void elv_fq_set_request_io_group(struct request_queue *q,
-						struct request *rq)
+					struct request *rq, struct bio *bio)
 {
 }
 
@@ -589,6 +597,15 @@ static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
 	return NULL;
 }
 
+static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+						struct bio *bio)
+{
+	return NULL;
+}
+
+
+extern struct io_group *rq_iog(struct request_queue *q, struct request *rq);
+
 #endif /* GROUP_IOSCHED */
 
 /* Functions used by blksysfs.c */
@@ -630,7 +647,8 @@ extern void elv_put_ioq(struct io_queue *ioq);
 extern void __elv_ioq_slice_expired(struct request_queue *q,
 					struct io_queue *ioq);
 extern int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
-		void *sched_queue, int ioprio_class, int ioprio, int is_sync);
+		struct io_group *iog, void *sched_queue, int ioprio_class,
+		int ioprio, int is_sync);
 extern void elv_schedule_dispatch(struct request_queue *q);
 extern int elv_hw_tag(struct elevator_queue *e);
 extern void *elv_active_sched_queue(struct elevator_queue *e);
@@ -643,6 +661,8 @@ extern void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
 extern void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 					int ioprio, struct io_queue *ioq);
 extern struct io_group *io_lookup_io_group_current(struct request_queue *q);
+extern struct io_group *io_get_io_group_bio(struct request_queue *q,
+						struct bio *bio, int create);
 extern int elv_nr_busy_ioq(struct elevator_queue *e);
 extern int elv_nr_busy_rt_ioq(struct elevator_queue *e);
 extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
@@ -697,7 +717,7 @@ static inline void *elv_fq_select_ioq(struct request_queue *q, int force)
 }
 
 static inline void elv_fq_set_request_io_group(struct request_queue *q,
-						struct request *rq)
+					struct request *rq, struct bio *bio)
 {
 }
 
@@ -722,5 +742,11 @@ static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
 	return NULL;
 }
 
+static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+						struct bio *bio)
+{
+	return NULL;
+}
+
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index e634a2f..3b83b2f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -967,11 +967,12 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
 	return NULL;
 }
 
-int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+int elv_set_request(struct request_queue *q, struct request *rq,
+			struct bio *bio, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 
-	elv_fq_set_request_io_group(q, rq);
+	elv_fq_set_request_io_group(q, rq, bio);
 
 	/*
 	 * Optimization for noop, deadline and AS which maintain only single
@@ -1370,19 +1371,19 @@ void *elv_select_sched_queue(struct request_queue *q, int force)
 EXPORT_SYMBOL(elv_select_sched_queue);
 
 /*
- * Get the io scheduler queue pointer for current task.
+ * Get the io scheduler queue pointer for the group bio belongs to.
  *
  * If fair queuing is enabled, determine the io group of task and retrieve
  * the ioq pointer from that. This is used by only single queue ioschedulers
  * for retrieving the queue associated with the group to decide whether the
  * new bio can do a front merge or not.
  */
-void *elv_get_sched_queue_current(struct request_queue *q)
+void *elv_get_sched_queue_bio(struct request_queue *q, struct bio *bio)
 {
 	/* Fair queuing is not enabled. There is only one queue. */
 	if (!elv_iosched_fair_queuing_enabled(q->elevator))
 		return q->elevator->sched_queue;
 
-	return ioq_sched_queue(elv_lookup_ioq_current(q));
+	return ioq_sched_queue(elv_lookup_ioq_bio(q, bio));
 }
-EXPORT_SYMBOL(elv_get_sched_queue_current);
+EXPORT_SYMBOL(elv_get_sched_queue_bio);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index cbfce0b..3e70d24 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -150,7 +150,8 @@ extern void elv_unregister_queue(struct request_queue *q);
 extern int elv_may_queue(struct request_queue *, int);
 extern void elv_abort_queue(struct request_queue *);
 extern void elv_completed_request(struct request_queue *, struct request *);
-extern int elv_set_request(struct request_queue *, struct request *, gfp_t);
+extern int elv_set_request(struct request_queue *, struct request *,
+					struct bio *bio, gfp_t);
 extern void elv_put_request(struct request_queue *, struct request *);
 extern void elv_drain_elevator(struct request_queue *);
 
@@ -293,6 +294,20 @@ static inline int elv_gen_idling_enabled(struct elevator_queue *e)
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
-extern void *elv_get_sched_queue_current(struct request_queue *q);
+extern void *elv_get_sched_queue_bio(struct request_queue *q, struct bio *bio);
+
+/*
+ * This is equivalent of rq_is_sync()/cfq_bio_sync() function where we
+ * determine whether an rq/bio is sync or not. There are cases like during
+ * merging and during * request allocation, where we don't have rq but bio
+ * and needs to find out * if this bio will be considered as sync or async by
+ * elevator/iosched. This function is useful in such cases.
+ */
+static inline int elv_bio_sync(struct bio *bio)
+{
+	if ((bio_data_dir(bio) == READ) || bio_sync(bio))
+		return 1;
+	return 0;
+}
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 16/18] io-controller: Per cgroup request descriptor support
       [not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (14 preceding siblings ...)
  2009-05-05 19:58   ` [PATCH 15/18] io-controller: map async requests to appropriate cgroup Vivek Goyal
@ 2009-05-05 19:58   ` Vivek Goyal
  2009-05-05 19:58   ` [PATCH 17/18] io-controller: IO group refcounting support Vivek Goyal
                     ` (5 subsequent siblings)
  21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

o Currently a request queue has got fixed number of request descriptors for
  sync and async requests. Once the request descriptors are consumed, new
  processes are put to sleep and they effectively become serialized. Because
  sync and async queues are separate, async requests don't impact sync ones
  but if one is looking for fairness between async requests, that is not
  achievable if request queue descriptors become bottleneck.

o Make request descriptor's per io group so that if there is lots of IO
  going on in one cgroup, it does not impact the IO of other group.

o This is just one relatively simple way of doing things. This patch will
  probably change after the feedback. Folks have raised concerns that in
  hierchical setup, child's request descriptors should be capped by parent's
  request descriptors. May be we need to have per cgroup per device files
  in cgroups where one can specify the upper limit of request descriptors
  and whenever a cgroup is created one needs to assign request descritor
  limit making sure total sum of child's request descriptor is not more than
  of parent.

  I guess something like memory controller. Anyway, that would be the next
  step. For the time being, we have implemented something simpler as follows.

o This patch implements the per cgroup request descriptors. request pool per
  queue is still common but every group will have its own wait list and its
  own count of request descriptors allocated to that group for sync and async
  queues. So effectively request_list becomes per io group property and not a
  global request queue feature.

o Currently one can define q->nr_requests to limit request descriptors
  allocated for the queue. Now there is another tunable q->nr_group_requests
  which controls the requests descriptr limit per group. q->nr_requests
  supercedes q->nr_group_requests to make sure if there are lots of groups
  present, we don't end up allocating too many request descriptors on the
  queue.

o Issues: Currently notion of congestion is per queue. With per group request
  descriptor it is possible that queue is not congested but the group bio
  will go into is congested.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/blk-core.c       |  216 ++++++++++++++++++++++++++++++++++--------------
 block/blk-settings.c   |    3 +
 block/blk-sysfs.c      |   57 ++++++++++---
 block/elevator-fq.c    |   14 +++
 block/elevator-fq.h    |    5 +
 block/elevator.c       |    6 +-
 include/linux/blkdev.h |   62 +++++++++++++-
 7 files changed, 283 insertions(+), 80 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index b19510a..9226cdd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -480,20 +480,31 @@ void blk_cleanup_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_cleanup_queue);
 
-static int blk_init_free_list(struct request_queue *q)
+void blk_init_request_list(struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
 
 	rl->count[BLK_RW_SYNC] = rl->count[BLK_RW_ASYNC] = 0;
-	rl->starved[BLK_RW_SYNC] = rl->starved[BLK_RW_ASYNC] = 0;
-	rl->elvpriv = 0;
 	init_waitqueue_head(&rl->wait[BLK_RW_SYNC]);
 	init_waitqueue_head(&rl->wait[BLK_RW_ASYNC]);
+}
 
-	rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ, mempool_alloc_slab,
-				mempool_free_slab, request_cachep, q->node);
+static int blk_init_free_list(struct request_queue *q)
+{
+#ifndef CONFIG_GROUP_IOSCHED
+	struct request_list *rl = blk_get_request_list(q, NULL);
+
+	/*
+	 * In case of group scheduling, request list is inside the associated
+	 * group and when that group is instanciated, it takes care of
+	 * initializing the request list also.
+	 */
+	blk_init_request_list(rl);
+#endif
+	q->rq_data.rq_pool = mempool_create_node(BLKDEV_MIN_RQ,
+				mempool_alloc_slab, mempool_free_slab,
+				request_cachep, q->node);
 
-	if (!rl->rq_pool)
+	if (!q->rq_data.rq_pool)
 		return -ENOMEM;
 
 	return 0;
@@ -590,6 +601,9 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
 		return NULL;
 	}
 
+	/* init starved waiter wait queue */
+	init_waitqueue_head(&q->rq_data.starved_wait);
+
 	/*
 	 * if caller didn't supply a lock, they get per-queue locking with
 	 * our embedded lock
@@ -639,14 +653,14 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
 {
 	if (rq->cmd_flags & REQ_ELVPRIV)
 		elv_put_request(q, rq);
-	mempool_free(rq, q->rq.rq_pool);
+	mempool_free(rq, q->rq_data.rq_pool);
 }
 
 static struct request *
 blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
 					gfp_t gfp_mask)
 {
-	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
+	struct request *rq = mempool_alloc(q->rq_data.rq_pool, gfp_mask);
 
 	if (!rq)
 		return NULL;
@@ -657,7 +671,7 @@ blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
 
 	if (priv) {
 		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
-			mempool_free(rq, q->rq.rq_pool);
+			mempool_free(rq, q->rq_data.rq_pool);
 			return NULL;
 		}
 		rq->cmd_flags |= REQ_ELVPRIV;
@@ -700,18 +714,18 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
 	ioc->last_waited = jiffies;
 }
 
-static void __freed_request(struct request_queue *q, int sync)
+static void __freed_request(struct request_queue *q, int sync,
+					struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
-
-	if (rl->count[sync] < queue_congestion_off_threshold(q))
+	if (q->rq_data.count[sync] < queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, sync);
 
-	if (rl->count[sync] + 1 <= q->nr_requests) {
+	if (q->rq_data.count[sync] + 1 <= q->nr_requests)
+		blk_clear_queue_full(q, sync);
+
+	if (rl->count[sync] + 1 <= q->nr_group_requests) {
 		if (waitqueue_active(&rl->wait[sync]))
 			wake_up(&rl->wait[sync]);
-
-		blk_clear_queue_full(q, sync);
 	}
 }
 
@@ -719,18 +733,29 @@ static void __freed_request(struct request_queue *q, int sync)
  * A request has just been released.  Account for it, update the full and
  * congestion status, wake up any waiters.   Called under q->queue_lock.
  */
-static void freed_request(struct request_queue *q, int sync, int priv)
+static void freed_request(struct request_queue *q, int sync, int priv,
+					struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
-
+	BUG_ON(!rl->count[sync]);
 	rl->count[sync]--;
+
+	BUG_ON(!q->rq_data.count[sync]);
+	q->rq_data.count[sync]--;
+
 	if (priv)
-		rl->elvpriv--;
+		q->rq_data.elvpriv--;
 
-	__freed_request(q, sync);
+	__freed_request(q, sync, rl);
 
 	if (unlikely(rl->starved[sync ^ 1]))
-		__freed_request(q, sync ^ 1);
+		__freed_request(q, sync ^ 1, rl);
+
+	/* Wake up the starved process on global list, if any */
+	if (unlikely(q->rq_data.starved)) {
+		if (waitqueue_active(&q->rq_data.starved_wait))
+			wake_up(&q->rq_data.starved_wait);
+		q->rq_data.starved--;
+	}
 }
 
 /*
@@ -739,10 +764,9 @@ static void freed_request(struct request_queue *q, int sync, int priv)
  * Returns !NULL on success, with queue_lock *not held*.
  */
 static struct request *get_request(struct request_queue *q, int rw_flags,
-				   struct bio *bio, gfp_t gfp_mask)
+		   struct bio *bio, gfp_t gfp_mask, struct request_list *rl)
 {
 	struct request *rq = NULL;
-	struct request_list *rl = &q->rq;
 	struct io_context *ioc = NULL;
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
 	int may_queue, priv;
@@ -751,31 +775,38 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	if (may_queue == ELV_MQUEUE_NO)
 		goto rq_starved;
 
-	if (rl->count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
-		if (rl->count[is_sync]+1 >= q->nr_requests) {
-			ioc = current_io_context(GFP_ATOMIC, q->node);
-			/*
-			 * The queue will fill after this allocation, so set
-			 * it as full, and mark this process as "batching".
-			 * This process will be allowed to complete a batch of
-			 * requests, others will be blocked.
-			 */
-			if (!blk_queue_full(q, is_sync)) {
-				ioc_set_batching(q, ioc);
-				blk_set_queue_full(q, is_sync);
-			} else {
-				if (may_queue != ELV_MQUEUE_MUST
-						&& !ioc_batching(q, ioc)) {
-					/*
-					 * The queue is full and the allocating
-					 * process is not a "batcher", and not
-					 * exempted by the IO scheduler
-					 */
-					goto out;
-				}
+	if (q->rq_data.count[is_sync]+1 >= queue_congestion_on_threshold(q))
+		blk_set_queue_congested(q, is_sync);
+
+	/*
+	 * Looks like there is no user of queue full now.
+	 * Keeping it for time being.
+	 */
+	if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
+		blk_set_queue_full(q, is_sync);
+
+	if (rl->count[is_sync]+1 >= q->nr_group_requests) {
+		ioc = current_io_context(GFP_ATOMIC, q->node);
+		/*
+		 * The queue request descriptor group will fill after this
+		 * allocation, so set
+		 * it as full, and mark this process as "batching".
+		 * This process will be allowed to complete a batch of
+		 * requests, others will be blocked.
+		 */
+		if (rl->count[is_sync] <= q->nr_group_requests)
+			ioc_set_batching(q, ioc);
+		else {
+			if (may_queue != ELV_MQUEUE_MUST
+					&& !ioc_batching(q, ioc)) {
+				/*
+				 * The queue is full and the allocating
+				 * process is not a "batcher", and not
+				 * exempted by the IO scheduler
+				 */
+				goto out;
 			}
 		}
-		blk_set_queue_congested(q, is_sync);
 	}
 
 	/*
@@ -783,21 +814,43 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	 * limit of requests, otherwise we could have thousands of requests
 	 * allocated with any setting of ->nr_requests
 	 */
-	if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
+
+	if (q->rq_data.count[is_sync] >= (3 * q->nr_requests / 2))
+		goto out;
+
+	/*
+	 * Allocation of request is allowed from queue perspective. Now check
+	 * from per group request list
+	 */
+
+	if (rl->count[is_sync] >= (3 * q->nr_group_requests / 2))
 		goto out;
 
 	rl->count[is_sync]++;
 	rl->starved[is_sync] = 0;
 
+	q->rq_data.count[is_sync]++;
+
 	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
 	if (priv)
-		rl->elvpriv++;
+		q->rq_data.elvpriv++;
 
 	if (blk_queue_io_stat(q))
 		rw_flags |= REQ_IO_STAT;
 	spin_unlock_irq(q->queue_lock);
 
 	rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
+
+#ifdef CONFIG_GROUP_IOSCHED
+	if (rq) {
+		/*
+		 * TODO. Implement group reference counting and take the
+		 * reference to the group to make sure group hence request
+		 * list does not go away till rq finishes.
+		 */
+		rq->rl = rl;
+	}
+#endif
 	if (unlikely(!rq)) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything
@@ -807,7 +860,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		 * wait queue, but this is pretty rare.
 		 */
 		spin_lock_irq(q->queue_lock);
-		freed_request(q, is_sync, priv);
+		freed_request(q, is_sync, priv, rl);
 
 		/*
 		 * in the very unlikely event that allocation failed and no
@@ -817,10 +870,26 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		 * rq mempool into READ and WRITE
 		 */
 rq_starved:
-		if (unlikely(rl->count[is_sync] == 0))
-			rl->starved[is_sync] = 1;
-
-		goto out;
+		if (unlikely(rl->count[is_sync] == 0)) {
+			/*
+			 * If there is a request pending in other direction
+			 * in same io group, then set the starved flag of
+			 * the group request list. Otherwise, we need to
+			 * make this process sleep in global starved list
+			 * to make sure it will not sleep indefinitely.
+			 */
+			if (rl->count[is_sync ^ 1] != 0) {
+				rl->starved[is_sync] = 1;
+				goto out;
+			} else {
+				/*
+				 * It indicates to calling function to put
+				 * task on global starved list. Not the best
+				 * way
+				 */
+				return ERR_PTR(-ENOMEM);
+			}
+		}
 	}
 
 	/*
@@ -848,15 +917,29 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 {
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
 	struct request *rq;
+	struct request_list *rl = blk_get_request_list(q, bio);
 
-	rq = get_request(q, rw_flags, bio, GFP_NOIO);
-	while (!rq) {
+	rq = get_request(q, rw_flags, bio, GFP_NOIO, rl);
+	while (!rq || (IS_ERR(rq) && PTR_ERR(rq) == -ENOMEM)) {
 		DEFINE_WAIT(wait);
 		struct io_context *ioc;
-		struct request_list *rl = &q->rq;
 
-		prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
-				TASK_UNINTERRUPTIBLE);
+		if (IS_ERR(rq) && PTR_ERR(rq) == -ENOMEM) {
+			/*
+			 * Task failed allocation and needs to wait and
+			 * try again. There are no requests pending from
+			 * the io group hence need to sleep on global
+			 * wait queue. Most likely the allocation failed
+			 * because of memory issues.
+			 */
+
+			q->rq_data.starved++;
+			prepare_to_wait_exclusive(&q->rq_data.starved_wait,
+					&wait, TASK_UNINTERRUPTIBLE);
+		} else {
+			prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
+					TASK_UNINTERRUPTIBLE);
+		}
 
 		trace_block_sleeprq(q, bio, rw_flags & 1);
 
@@ -876,7 +959,12 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 		spin_lock_irq(q->queue_lock);
 		finish_wait(&rl->wait[is_sync], &wait);
 
-		rq = get_request(q, rw_flags, bio, GFP_NOIO);
+		/*
+		 * After the sleep check the rl again in case cgrop bio
+		 * belonged to is gone and it is mapped to root group now
+		 */
+		rl = blk_get_request_list(q, bio);
+		rq = get_request(q, rw_flags, bio, GFP_NOIO, rl);
 	};
 
 	return rq;
@@ -885,6 +973,7 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
 {
 	struct request *rq;
+	struct request_list *rl = blk_get_request_list(q, NULL);
 
 	BUG_ON(rw != READ && rw != WRITE);
 
@@ -892,7 +981,7 @@ struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
 	if (gfp_mask & __GFP_WAIT) {
 		rq = get_request_wait(q, rw, NULL);
 	} else {
-		rq = get_request(q, rw, NULL, gfp_mask);
+		rq = get_request(q, rw, NULL, gfp_mask, rl);
 		if (!rq)
 			spin_unlock_irq(q->queue_lock);
 	}
@@ -1075,12 +1164,13 @@ void __blk_put_request(struct request_queue *q, struct request *req)
 	if (req->cmd_flags & REQ_ALLOCED) {
 		int is_sync = rq_is_sync(req) != 0;
 		int priv = req->cmd_flags & REQ_ELVPRIV;
+		struct request_list *rl = rq_rl(q, req);
 
 		BUG_ON(!list_empty(&req->queuelist));
 		BUG_ON(!hlist_unhashed(&req->hash));
 
 		blk_free_request(q, req);
-		freed_request(q, is_sync, priv);
+		freed_request(q, is_sync, priv, rl);
 	}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 57af728..8733192 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -123,6 +123,9 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
 	 * set defaults
 	 */
 	q->nr_requests = BLKDEV_MAX_RQ;
+#ifdef CONFIG_GROUP_IOSCHED
+	q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
+#endif
 	blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);
 	blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS);
 	blk_queue_segment_boundary(q, BLK_SEG_BOUNDARY_MASK);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index c942ddc..b60b76e 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -38,7 +38,7 @@ static ssize_t queue_requests_show(struct request_queue *q, char *page)
 static ssize_t
 queue_requests_store(struct request_queue *q, const char *page, size_t count)
 {
-	struct request_list *rl = &q->rq;
+	struct request_list *rl = blk_get_request_list(q, NULL);
 	unsigned long nr;
 	int ret = queue_var_store(&nr, page, count);
 	if (nr < BLKDEV_MIN_RQ)
@@ -48,32 +48,55 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
 	q->nr_requests = nr;
 	blk_queue_congestion_threshold(q);
 
-	if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
+	if (q->rq_data.count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
 		blk_set_queue_congested(q, BLK_RW_SYNC);
-	else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
+	else if (q->rq_data.count[BLK_RW_SYNC] <
+				queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, BLK_RW_SYNC);
 
-	if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
+	if (q->rq_data.count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
 		blk_set_queue_congested(q, BLK_RW_ASYNC);
-	else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
+	else if (q->rq_data.count[BLK_RW_ASYNC] <
+				queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, BLK_RW_ASYNC);
 
-	if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
+	if (q->rq_data.count[BLK_RW_SYNC] >= q->nr_requests) {
 		blk_set_queue_full(q, BLK_RW_SYNC);
-	} else if (rl->count[BLK_RW_SYNC]+1 <= q->nr_requests) {
+	} else if (q->rq_data.count[BLK_RW_SYNC]+1 <= q->nr_requests) {
 		blk_clear_queue_full(q, BLK_RW_SYNC);
 		wake_up(&rl->wait[BLK_RW_SYNC]);
 	}
 
-	if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
+	if (q->rq_data.count[BLK_RW_ASYNC] >= q->nr_requests) {
 		blk_set_queue_full(q, BLK_RW_ASYNC);
-	} else if (rl->count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
+	} else if (q->rq_data.count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
 		blk_clear_queue_full(q, BLK_RW_ASYNC);
 		wake_up(&rl->wait[BLK_RW_ASYNC]);
 	}
 	spin_unlock_irq(q->queue_lock);
 	return ret;
 }
+#ifdef CONFIG_GROUP_IOSCHED
+static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(q->nr_group_requests, (page));
+}
+
+static ssize_t
+queue_group_requests_store(struct request_queue *q, const char *page,
+					size_t count)
+{
+	unsigned long nr;
+	int ret = queue_var_store(&nr, page, count);
+	if (nr < BLKDEV_MIN_RQ)
+		nr = BLKDEV_MIN_RQ;
+
+	spin_lock_irq(q->queue_lock);
+	q->nr_group_requests = nr;
+	spin_unlock_irq(q->queue_lock);
+	return ret;
+}
+#endif
 
 static ssize_t queue_ra_show(struct request_queue *q, char *page)
 {
@@ -224,6 +247,14 @@ static struct queue_sysfs_entry queue_requests_entry = {
 	.store = queue_requests_store,
 };
 
+#ifdef CONFIG_GROUP_IOSCHED
+static struct queue_sysfs_entry queue_group_requests_entry = {
+	.attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_group_requests_show,
+	.store = queue_group_requests_store,
+};
+#endif
+
 static struct queue_sysfs_entry queue_ra_entry = {
 	.attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_ra_show,
@@ -304,6 +335,9 @@ static struct queue_sysfs_entry queue_fairness_entry = {
 
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
+#ifdef CONFIG_GROUP_IOSCHED
+	&queue_group_requests_entry.attr,
+#endif
 	&queue_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
@@ -385,12 +419,11 @@ static void blk_release_queue(struct kobject *kobj)
 {
 	struct request_queue *q =
 		container_of(kobj, struct request_queue, kobj);
-	struct request_list *rl = &q->rq;
 
 	blk_sync_queue(q);
 
-	if (rl->rq_pool)
-		mempool_destroy(rl->rq_pool);
+	if (q->rq_data.rq_pool)
+		mempool_destroy(q->rq_data.rq_pool);
 
 	if (q->queue_tags)
 		__blk_queue_free_tags(q);
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 69eaee4..bd98317 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -954,6 +954,16 @@ struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
 			    struct io_cgroup, css);
 }
 
+struct request_list *io_group_get_request_list(struct request_queue *q,
+						struct bio *bio)
+{
+	struct io_group *iog;
+
+	iog = io_get_io_group_bio(q, bio, 1);
+	BUG_ON(!iog);
+	return &iog->rl;
+}
+
 /*
  * Search the bfq_group for bfqd into the hash table (by now only a list)
  * of bgrp.  Must be called under rcu_read_lock().
@@ -1203,6 +1213,8 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
 		io_group_init_entity(iocg, iog);
 		iog->my_entity = &iog->entity;
 
+		blk_init_request_list(&iog->rl);
+
 		if (leaf == NULL) {
 			leaf = iog;
 			prev = leaf;
@@ -1447,6 +1459,8 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
 	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
 		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
 
+	blk_init_request_list(&iog->rl);
+
 	iocg = &io_root_cgroup;
 	spin_lock_irq(&iocg->lock);
 	rcu_assign_pointer(iog->key, key);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 5fc7d48..58543ec 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -239,6 +239,9 @@ struct io_group {
 
 	/* Single ioq per group, used for noop, deadline, anticipatory */
 	struct io_queue *ioq;
+
+	/* request list associated with the group */
+	struct request_list rl;
 };
 
 /**
@@ -517,6 +520,8 @@ extern void elv_fq_unset_request_ioq(struct request_queue *q,
 extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
 extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
 						struct bio *bio);
+extern struct request_list *io_group_get_request_list(struct request_queue *q,
+						struct bio *bio);
 
 /* Returns single ioq associated with the io group. */
 static inline struct io_queue *io_group_ioq(struct io_group *iog)
diff --git a/block/elevator.c b/block/elevator.c
index 3b83b2f..44c9fad 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -668,7 +668,7 @@ void elv_quiesce_start(struct request_queue *q)
 	 * make sure we don't have any requests in flight
 	 */
 	elv_drain_elevator(q);
-	while (q->rq.elvpriv) {
+	while (q->rq_data.elvpriv) {
 		blk_start_queueing(q);
 		spin_unlock_irq(q->queue_lock);
 		msleep(10);
@@ -768,8 +768,8 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 	}
 
 	if (unplug_it && blk_queue_plugged(q)) {
-		int nrq = q->rq.count[BLK_RW_SYNC] + q->rq.count[BLK_RW_ASYNC]
-			- q->in_flight;
+		int nrq = q->rq_data.count[BLK_RW_SYNC] +
+				q->rq_data.count[BLK_RW_ASYNC] - q->in_flight;
 
 		if (nrq >= q->unplug_thresh)
 			__generic_unplug_device(q);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9c209a0..07aca2f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -32,21 +32,51 @@ struct request;
 struct sg_io_hdr;
 
 #define BLKDEV_MIN_RQ	4
+
+#ifdef CONFIG_GROUP_IOSCHED
+#define BLKDEV_MAX_RQ	256	/* Default maximum */
+#define BLKDEV_MAX_GROUP_RQ    64      /* Default maximum */
+#else
 #define BLKDEV_MAX_RQ	128	/* Default maximum */
+/*
+ * This is eqivalent to case of only one group present (root group). Let
+ * it consume all the request descriptors available on the queue .
+ */
+#define BLKDEV_MAX_GROUP_RQ    BLKDEV_MAX_RQ      /* Default maximum */
+#endif
 
 struct request;
 typedef void (rq_end_io_fn)(struct request *, int);
 
 struct request_list {
 	/*
-	 * count[], starved[], and wait[] are indexed by
+	 * count[], starved and wait[] are indexed by
 	 * BLK_RW_SYNC/BLK_RW_ASYNC
 	 */
 	int count[2];
 	int starved[2];
+	wait_queue_head_t wait[2];
+};
+
+/*
+ * This data structures keeps track of mempool of requests for the queue
+ * and some overall statistics.
+ */
+struct request_data {
+	/*
+	 * Per queue request descriptor count. This is in addition to per
+	 * cgroup count
+	 */
+	int count[2];
 	int elvpriv;
 	mempool_t *rq_pool;
-	wait_queue_head_t wait[2];
+	int starved;
+	/*
+	 * Global list for starved tasks. A task will be queued here if
+	 * it could not allocate request descriptor and the associated
+	 * group request list does not have any requests pending.
+	 */
+	wait_queue_head_t starved_wait;
 };
 
 /*
@@ -253,6 +283,7 @@ struct request {
 #ifdef CONFIG_GROUP_IOSCHED
 	/* io group request belongs to */
 	struct io_group *iog;
+	struct request_list *rl;
 #endif /* GROUP_IOSCHED */
 #endif /* ELV_FAIR_QUEUING */
 };
@@ -342,6 +373,9 @@ struct request_queue
 	 */
 	struct request_list	rq;
 
+	/* Contains request pool and other data like starved data */
+	struct request_data	rq_data;
+
 	request_fn_proc		*request_fn;
 	make_request_fn		*make_request_fn;
 	prep_rq_fn		*prep_rq_fn;
@@ -404,6 +438,8 @@ struct request_queue
 	 * queue settings
 	 */
 	unsigned long		nr_requests;	/* Max # of requests */
+	/* Max # of per io group requests */
+	unsigned long		nr_group_requests;
 	unsigned int		nr_congestion_on;
 	unsigned int		nr_congestion_off;
 	unsigned int		nr_batching;
@@ -776,6 +812,28 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 			 struct scsi_ioctl_command __user *);
 
+extern void blk_init_request_list(struct request_list *rl);
+
+static inline struct request_list *blk_get_request_list(struct request_queue *q,
+						struct bio *bio)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+	return io_group_get_request_list(q, bio);
+#else
+	return &q->rq;
+#endif
+}
+
+static inline struct request_list *rq_rl(struct request_queue *q,
+						struct request *rq)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+	return rq->rl;
+#else
+	return blk_get_request_list(q, NULL);
+#endif
+}
+
 /*
  * Temporary export, until SCSI gets fixed up.
  */
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 16/18] io-controller: Per cgroup request descriptor support
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (27 preceding siblings ...)
  2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-05 19:58 ` Vivek Goyal
                   ` (8 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, m-ikeda
  Cc: vgoyal, akpm

o Currently a request queue has got fixed number of request descriptors for
  sync and async requests. Once the request descriptors are consumed, new
  processes are put to sleep and they effectively become serialized. Because
  sync and async queues are separate, async requests don't impact sync ones
  but if one is looking for fairness between async requests, that is not
  achievable if request queue descriptors become bottleneck.

o Make request descriptor's per io group so that if there is lots of IO
  going on in one cgroup, it does not impact the IO of other group.

o This is just one relatively simple way of doing things. This patch will
  probably change after the feedback. Folks have raised concerns that in
  hierchical setup, child's request descriptors should be capped by parent's
  request descriptors. May be we need to have per cgroup per device files
  in cgroups where one can specify the upper limit of request descriptors
  and whenever a cgroup is created one needs to assign request descritor
  limit making sure total sum of child's request descriptor is not more than
  of parent.

  I guess something like memory controller. Anyway, that would be the next
  step. For the time being, we have implemented something simpler as follows.

o This patch implements the per cgroup request descriptors. request pool per
  queue is still common but every group will have its own wait list and its
  own count of request descriptors allocated to that group for sync and async
  queues. So effectively request_list becomes per io group property and not a
  global request queue feature.

o Currently one can define q->nr_requests to limit request descriptors
  allocated for the queue. Now there is another tunable q->nr_group_requests
  which controls the requests descriptr limit per group. q->nr_requests
  supercedes q->nr_group_requests to make sure if there are lots of groups
  present, we don't end up allocating too many request descriptors on the
  queue.

o Issues: Currently notion of congestion is per queue. With per group request
  descriptor it is possible that queue is not congested but the group bio
  will go into is congested.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-core.c       |  216 ++++++++++++++++++++++++++++++++++--------------
 block/blk-settings.c   |    3 +
 block/blk-sysfs.c      |   57 ++++++++++---
 block/elevator-fq.c    |   14 +++
 block/elevator-fq.h    |    5 +
 block/elevator.c       |    6 +-
 include/linux/blkdev.h |   62 +++++++++++++-
 7 files changed, 283 insertions(+), 80 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index b19510a..9226cdd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -480,20 +480,31 @@ void blk_cleanup_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_cleanup_queue);
 
-static int blk_init_free_list(struct request_queue *q)
+void blk_init_request_list(struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
 
 	rl->count[BLK_RW_SYNC] = rl->count[BLK_RW_ASYNC] = 0;
-	rl->starved[BLK_RW_SYNC] = rl->starved[BLK_RW_ASYNC] = 0;
-	rl->elvpriv = 0;
 	init_waitqueue_head(&rl->wait[BLK_RW_SYNC]);
 	init_waitqueue_head(&rl->wait[BLK_RW_ASYNC]);
+}
 
-	rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ, mempool_alloc_slab,
-				mempool_free_slab, request_cachep, q->node);
+static int blk_init_free_list(struct request_queue *q)
+{
+#ifndef CONFIG_GROUP_IOSCHED
+	struct request_list *rl = blk_get_request_list(q, NULL);
+
+	/*
+	 * In case of group scheduling, request list is inside the associated
+	 * group and when that group is instanciated, it takes care of
+	 * initializing the request list also.
+	 */
+	blk_init_request_list(rl);
+#endif
+	q->rq_data.rq_pool = mempool_create_node(BLKDEV_MIN_RQ,
+				mempool_alloc_slab, mempool_free_slab,
+				request_cachep, q->node);
 
-	if (!rl->rq_pool)
+	if (!q->rq_data.rq_pool)
 		return -ENOMEM;
 
 	return 0;
@@ -590,6 +601,9 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
 		return NULL;
 	}
 
+	/* init starved waiter wait queue */
+	init_waitqueue_head(&q->rq_data.starved_wait);
+
 	/*
 	 * if caller didn't supply a lock, they get per-queue locking with
 	 * our embedded lock
@@ -639,14 +653,14 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
 {
 	if (rq->cmd_flags & REQ_ELVPRIV)
 		elv_put_request(q, rq);
-	mempool_free(rq, q->rq.rq_pool);
+	mempool_free(rq, q->rq_data.rq_pool);
 }
 
 static struct request *
 blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
 					gfp_t gfp_mask)
 {
-	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
+	struct request *rq = mempool_alloc(q->rq_data.rq_pool, gfp_mask);
 
 	if (!rq)
 		return NULL;
@@ -657,7 +671,7 @@ blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
 
 	if (priv) {
 		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
-			mempool_free(rq, q->rq.rq_pool);
+			mempool_free(rq, q->rq_data.rq_pool);
 			return NULL;
 		}
 		rq->cmd_flags |= REQ_ELVPRIV;
@@ -700,18 +714,18 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
 	ioc->last_waited = jiffies;
 }
 
-static void __freed_request(struct request_queue *q, int sync)
+static void __freed_request(struct request_queue *q, int sync,
+					struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
-
-	if (rl->count[sync] < queue_congestion_off_threshold(q))
+	if (q->rq_data.count[sync] < queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, sync);
 
-	if (rl->count[sync] + 1 <= q->nr_requests) {
+	if (q->rq_data.count[sync] + 1 <= q->nr_requests)
+		blk_clear_queue_full(q, sync);
+
+	if (rl->count[sync] + 1 <= q->nr_group_requests) {
 		if (waitqueue_active(&rl->wait[sync]))
 			wake_up(&rl->wait[sync]);
-
-		blk_clear_queue_full(q, sync);
 	}
 }
 
@@ -719,18 +733,29 @@ static void __freed_request(struct request_queue *q, int sync)
  * A request has just been released.  Account for it, update the full and
  * congestion status, wake up any waiters.   Called under q->queue_lock.
  */
-static void freed_request(struct request_queue *q, int sync, int priv)
+static void freed_request(struct request_queue *q, int sync, int priv,
+					struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
-
+	BUG_ON(!rl->count[sync]);
 	rl->count[sync]--;
+
+	BUG_ON(!q->rq_data.count[sync]);
+	q->rq_data.count[sync]--;
+
 	if (priv)
-		rl->elvpriv--;
+		q->rq_data.elvpriv--;
 
-	__freed_request(q, sync);
+	__freed_request(q, sync, rl);
 
 	if (unlikely(rl->starved[sync ^ 1]))
-		__freed_request(q, sync ^ 1);
+		__freed_request(q, sync ^ 1, rl);
+
+	/* Wake up the starved process on global list, if any */
+	if (unlikely(q->rq_data.starved)) {
+		if (waitqueue_active(&q->rq_data.starved_wait))
+			wake_up(&q->rq_data.starved_wait);
+		q->rq_data.starved--;
+	}
 }
 
 /*
@@ -739,10 +764,9 @@ static void freed_request(struct request_queue *q, int sync, int priv)
  * Returns !NULL on success, with queue_lock *not held*.
  */
 static struct request *get_request(struct request_queue *q, int rw_flags,
-				   struct bio *bio, gfp_t gfp_mask)
+		   struct bio *bio, gfp_t gfp_mask, struct request_list *rl)
 {
 	struct request *rq = NULL;
-	struct request_list *rl = &q->rq;
 	struct io_context *ioc = NULL;
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
 	int may_queue, priv;
@@ -751,31 +775,38 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	if (may_queue == ELV_MQUEUE_NO)
 		goto rq_starved;
 
-	if (rl->count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
-		if (rl->count[is_sync]+1 >= q->nr_requests) {
-			ioc = current_io_context(GFP_ATOMIC, q->node);
-			/*
-			 * The queue will fill after this allocation, so set
-			 * it as full, and mark this process as "batching".
-			 * This process will be allowed to complete a batch of
-			 * requests, others will be blocked.
-			 */
-			if (!blk_queue_full(q, is_sync)) {
-				ioc_set_batching(q, ioc);
-				blk_set_queue_full(q, is_sync);
-			} else {
-				if (may_queue != ELV_MQUEUE_MUST
-						&& !ioc_batching(q, ioc)) {
-					/*
-					 * The queue is full and the allocating
-					 * process is not a "batcher", and not
-					 * exempted by the IO scheduler
-					 */
-					goto out;
-				}
+	if (q->rq_data.count[is_sync]+1 >= queue_congestion_on_threshold(q))
+		blk_set_queue_congested(q, is_sync);
+
+	/*
+	 * Looks like there is no user of queue full now.
+	 * Keeping it for time being.
+	 */
+	if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
+		blk_set_queue_full(q, is_sync);
+
+	if (rl->count[is_sync]+1 >= q->nr_group_requests) {
+		ioc = current_io_context(GFP_ATOMIC, q->node);
+		/*
+		 * The queue request descriptor group will fill after this
+		 * allocation, so set
+		 * it as full, and mark this process as "batching".
+		 * This process will be allowed to complete a batch of
+		 * requests, others will be blocked.
+		 */
+		if (rl->count[is_sync] <= q->nr_group_requests)
+			ioc_set_batching(q, ioc);
+		else {
+			if (may_queue != ELV_MQUEUE_MUST
+					&& !ioc_batching(q, ioc)) {
+				/*
+				 * The queue is full and the allocating
+				 * process is not a "batcher", and not
+				 * exempted by the IO scheduler
+				 */
+				goto out;
 			}
 		}
-		blk_set_queue_congested(q, is_sync);
 	}
 
 	/*
@@ -783,21 +814,43 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	 * limit of requests, otherwise we could have thousands of requests
 	 * allocated with any setting of ->nr_requests
 	 */
-	if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
+
+	if (q->rq_data.count[is_sync] >= (3 * q->nr_requests / 2))
+		goto out;
+
+	/*
+	 * Allocation of request is allowed from queue perspective. Now check
+	 * from per group request list
+	 */
+
+	if (rl->count[is_sync] >= (3 * q->nr_group_requests / 2))
 		goto out;
 
 	rl->count[is_sync]++;
 	rl->starved[is_sync] = 0;
 
+	q->rq_data.count[is_sync]++;
+
 	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
 	if (priv)
-		rl->elvpriv++;
+		q->rq_data.elvpriv++;
 
 	if (blk_queue_io_stat(q))
 		rw_flags |= REQ_IO_STAT;
 	spin_unlock_irq(q->queue_lock);
 
 	rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
+
+#ifdef CONFIG_GROUP_IOSCHED
+	if (rq) {
+		/*
+		 * TODO. Implement group reference counting and take the
+		 * reference to the group to make sure group hence request
+		 * list does not go away till rq finishes.
+		 */
+		rq->rl = rl;
+	}
+#endif
 	if (unlikely(!rq)) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything
@@ -807,7 +860,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		 * wait queue, but this is pretty rare.
 		 */
 		spin_lock_irq(q->queue_lock);
-		freed_request(q, is_sync, priv);
+		freed_request(q, is_sync, priv, rl);
 
 		/*
 		 * in the very unlikely event that allocation failed and no
@@ -817,10 +870,26 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		 * rq mempool into READ and WRITE
 		 */
 rq_starved:
-		if (unlikely(rl->count[is_sync] == 0))
-			rl->starved[is_sync] = 1;
-
-		goto out;
+		if (unlikely(rl->count[is_sync] == 0)) {
+			/*
+			 * If there is a request pending in other direction
+			 * in same io group, then set the starved flag of
+			 * the group request list. Otherwise, we need to
+			 * make this process sleep in global starved list
+			 * to make sure it will not sleep indefinitely.
+			 */
+			if (rl->count[is_sync ^ 1] != 0) {
+				rl->starved[is_sync] = 1;
+				goto out;
+			} else {
+				/*
+				 * It indicates to calling function to put
+				 * task on global starved list. Not the best
+				 * way
+				 */
+				return ERR_PTR(-ENOMEM);
+			}
+		}
 	}
 
 	/*
@@ -848,15 +917,29 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 {
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
 	struct request *rq;
+	struct request_list *rl = blk_get_request_list(q, bio);
 
-	rq = get_request(q, rw_flags, bio, GFP_NOIO);
-	while (!rq) {
+	rq = get_request(q, rw_flags, bio, GFP_NOIO, rl);
+	while (!rq || (IS_ERR(rq) && PTR_ERR(rq) == -ENOMEM)) {
 		DEFINE_WAIT(wait);
 		struct io_context *ioc;
-		struct request_list *rl = &q->rq;
 
-		prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
-				TASK_UNINTERRUPTIBLE);
+		if (IS_ERR(rq) && PTR_ERR(rq) == -ENOMEM) {
+			/*
+			 * Task failed allocation and needs to wait and
+			 * try again. There are no requests pending from
+			 * the io group hence need to sleep on global
+			 * wait queue. Most likely the allocation failed
+			 * because of memory issues.
+			 */
+
+			q->rq_data.starved++;
+			prepare_to_wait_exclusive(&q->rq_data.starved_wait,
+					&wait, TASK_UNINTERRUPTIBLE);
+		} else {
+			prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
+					TASK_UNINTERRUPTIBLE);
+		}
 
 		trace_block_sleeprq(q, bio, rw_flags & 1);
 
@@ -876,7 +959,12 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 		spin_lock_irq(q->queue_lock);
 		finish_wait(&rl->wait[is_sync], &wait);
 
-		rq = get_request(q, rw_flags, bio, GFP_NOIO);
+		/*
+		 * After the sleep check the rl again in case cgrop bio
+		 * belonged to is gone and it is mapped to root group now
+		 */
+		rl = blk_get_request_list(q, bio);
+		rq = get_request(q, rw_flags, bio, GFP_NOIO, rl);
 	};
 
 	return rq;
@@ -885,6 +973,7 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
 {
 	struct request *rq;
+	struct request_list *rl = blk_get_request_list(q, NULL);
 
 	BUG_ON(rw != READ && rw != WRITE);
 
@@ -892,7 +981,7 @@ struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
 	if (gfp_mask & __GFP_WAIT) {
 		rq = get_request_wait(q, rw, NULL);
 	} else {
-		rq = get_request(q, rw, NULL, gfp_mask);
+		rq = get_request(q, rw, NULL, gfp_mask, rl);
 		if (!rq)
 			spin_unlock_irq(q->queue_lock);
 	}
@@ -1075,12 +1164,13 @@ void __blk_put_request(struct request_queue *q, struct request *req)
 	if (req->cmd_flags & REQ_ALLOCED) {
 		int is_sync = rq_is_sync(req) != 0;
 		int priv = req->cmd_flags & REQ_ELVPRIV;
+		struct request_list *rl = rq_rl(q, req);
 
 		BUG_ON(!list_empty(&req->queuelist));
 		BUG_ON(!hlist_unhashed(&req->hash));
 
 		blk_free_request(q, req);
-		freed_request(q, is_sync, priv);
+		freed_request(q, is_sync, priv, rl);
 	}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 57af728..8733192 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -123,6 +123,9 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
 	 * set defaults
 	 */
 	q->nr_requests = BLKDEV_MAX_RQ;
+#ifdef CONFIG_GROUP_IOSCHED
+	q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
+#endif
 	blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);
 	blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS);
 	blk_queue_segment_boundary(q, BLK_SEG_BOUNDARY_MASK);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index c942ddc..b60b76e 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -38,7 +38,7 @@ static ssize_t queue_requests_show(struct request_queue *q, char *page)
 static ssize_t
 queue_requests_store(struct request_queue *q, const char *page, size_t count)
 {
-	struct request_list *rl = &q->rq;
+	struct request_list *rl = blk_get_request_list(q, NULL);
 	unsigned long nr;
 	int ret = queue_var_store(&nr, page, count);
 	if (nr < BLKDEV_MIN_RQ)
@@ -48,32 +48,55 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
 	q->nr_requests = nr;
 	blk_queue_congestion_threshold(q);
 
-	if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
+	if (q->rq_data.count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
 		blk_set_queue_congested(q, BLK_RW_SYNC);
-	else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
+	else if (q->rq_data.count[BLK_RW_SYNC] <
+				queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, BLK_RW_SYNC);
 
-	if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
+	if (q->rq_data.count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
 		blk_set_queue_congested(q, BLK_RW_ASYNC);
-	else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
+	else if (q->rq_data.count[BLK_RW_ASYNC] <
+				queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, BLK_RW_ASYNC);
 
-	if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
+	if (q->rq_data.count[BLK_RW_SYNC] >= q->nr_requests) {
 		blk_set_queue_full(q, BLK_RW_SYNC);
-	} else if (rl->count[BLK_RW_SYNC]+1 <= q->nr_requests) {
+	} else if (q->rq_data.count[BLK_RW_SYNC]+1 <= q->nr_requests) {
 		blk_clear_queue_full(q, BLK_RW_SYNC);
 		wake_up(&rl->wait[BLK_RW_SYNC]);
 	}
 
-	if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
+	if (q->rq_data.count[BLK_RW_ASYNC] >= q->nr_requests) {
 		blk_set_queue_full(q, BLK_RW_ASYNC);
-	} else if (rl->count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
+	} else if (q->rq_data.count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
 		blk_clear_queue_full(q, BLK_RW_ASYNC);
 		wake_up(&rl->wait[BLK_RW_ASYNC]);
 	}
 	spin_unlock_irq(q->queue_lock);
 	return ret;
 }
+#ifdef CONFIG_GROUP_IOSCHED
+static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(q->nr_group_requests, (page));
+}
+
+static ssize_t
+queue_group_requests_store(struct request_queue *q, const char *page,
+					size_t count)
+{
+	unsigned long nr;
+	int ret = queue_var_store(&nr, page, count);
+	if (nr < BLKDEV_MIN_RQ)
+		nr = BLKDEV_MIN_RQ;
+
+	spin_lock_irq(q->queue_lock);
+	q->nr_group_requests = nr;
+	spin_unlock_irq(q->queue_lock);
+	return ret;
+}
+#endif
 
 static ssize_t queue_ra_show(struct request_queue *q, char *page)
 {
@@ -224,6 +247,14 @@ static struct queue_sysfs_entry queue_requests_entry = {
 	.store = queue_requests_store,
 };
 
+#ifdef CONFIG_GROUP_IOSCHED
+static struct queue_sysfs_entry queue_group_requests_entry = {
+	.attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_group_requests_show,
+	.store = queue_group_requests_store,
+};
+#endif
+
 static struct queue_sysfs_entry queue_ra_entry = {
 	.attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_ra_show,
@@ -304,6 +335,9 @@ static struct queue_sysfs_entry queue_fairness_entry = {
 
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
+#ifdef CONFIG_GROUP_IOSCHED
+	&queue_group_requests_entry.attr,
+#endif
 	&queue_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
@@ -385,12 +419,11 @@ static void blk_release_queue(struct kobject *kobj)
 {
 	struct request_queue *q =
 		container_of(kobj, struct request_queue, kobj);
-	struct request_list *rl = &q->rq;
 
 	blk_sync_queue(q);
 
-	if (rl->rq_pool)
-		mempool_destroy(rl->rq_pool);
+	if (q->rq_data.rq_pool)
+		mempool_destroy(q->rq_data.rq_pool);
 
 	if (q->queue_tags)
 		__blk_queue_free_tags(q);
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 69eaee4..bd98317 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -954,6 +954,16 @@ struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
 			    struct io_cgroup, css);
 }
 
+struct request_list *io_group_get_request_list(struct request_queue *q,
+						struct bio *bio)
+{
+	struct io_group *iog;
+
+	iog = io_get_io_group_bio(q, bio, 1);
+	BUG_ON(!iog);
+	return &iog->rl;
+}
+
 /*
  * Search the bfq_group for bfqd into the hash table (by now only a list)
  * of bgrp.  Must be called under rcu_read_lock().
@@ -1203,6 +1213,8 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
 		io_group_init_entity(iocg, iog);
 		iog->my_entity = &iog->entity;
 
+		blk_init_request_list(&iog->rl);
+
 		if (leaf == NULL) {
 			leaf = iog;
 			prev = leaf;
@@ -1447,6 +1459,8 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
 	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
 		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
 
+	blk_init_request_list(&iog->rl);
+
 	iocg = &io_root_cgroup;
 	spin_lock_irq(&iocg->lock);
 	rcu_assign_pointer(iog->key, key);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 5fc7d48..58543ec 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -239,6 +239,9 @@ struct io_group {
 
 	/* Single ioq per group, used for noop, deadline, anticipatory */
 	struct io_queue *ioq;
+
+	/* request list associated with the group */
+	struct request_list rl;
 };
 
 /**
@@ -517,6 +520,8 @@ extern void elv_fq_unset_request_ioq(struct request_queue *q,
 extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
 extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
 						struct bio *bio);
+extern struct request_list *io_group_get_request_list(struct request_queue *q,
+						struct bio *bio);
 
 /* Returns single ioq associated with the io group. */
 static inline struct io_queue *io_group_ioq(struct io_group *iog)
diff --git a/block/elevator.c b/block/elevator.c
index 3b83b2f..44c9fad 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -668,7 +668,7 @@ void elv_quiesce_start(struct request_queue *q)
 	 * make sure we don't have any requests in flight
 	 */
 	elv_drain_elevator(q);
-	while (q->rq.elvpriv) {
+	while (q->rq_data.elvpriv) {
 		blk_start_queueing(q);
 		spin_unlock_irq(q->queue_lock);
 		msleep(10);
@@ -768,8 +768,8 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 	}
 
 	if (unplug_it && blk_queue_plugged(q)) {
-		int nrq = q->rq.count[BLK_RW_SYNC] + q->rq.count[BLK_RW_ASYNC]
-			- q->in_flight;
+		int nrq = q->rq_data.count[BLK_RW_SYNC] +
+				q->rq_data.count[BLK_RW_ASYNC] - q->in_flight;
 
 		if (nrq >= q->unplug_thresh)
 			__generic_unplug_device(q);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9c209a0..07aca2f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -32,21 +32,51 @@ struct request;
 struct sg_io_hdr;
 
 #define BLKDEV_MIN_RQ	4
+
+#ifdef CONFIG_GROUP_IOSCHED
+#define BLKDEV_MAX_RQ	256	/* Default maximum */
+#define BLKDEV_MAX_GROUP_RQ    64      /* Default maximum */
+#else
 #define BLKDEV_MAX_RQ	128	/* Default maximum */
+/*
+ * This is eqivalent to case of only one group present (root group). Let
+ * it consume all the request descriptors available on the queue .
+ */
+#define BLKDEV_MAX_GROUP_RQ    BLKDEV_MAX_RQ      /* Default maximum */
+#endif
 
 struct request;
 typedef void (rq_end_io_fn)(struct request *, int);
 
 struct request_list {
 	/*
-	 * count[], starved[], and wait[] are indexed by
+	 * count[], starved and wait[] are indexed by
 	 * BLK_RW_SYNC/BLK_RW_ASYNC
 	 */
 	int count[2];
 	int starved[2];
+	wait_queue_head_t wait[2];
+};
+
+/*
+ * This data structures keeps track of mempool of requests for the queue
+ * and some overall statistics.
+ */
+struct request_data {
+	/*
+	 * Per queue request descriptor count. This is in addition to per
+	 * cgroup count
+	 */
+	int count[2];
 	int elvpriv;
 	mempool_t *rq_pool;
-	wait_queue_head_t wait[2];
+	int starved;
+	/*
+	 * Global list for starved tasks. A task will be queued here if
+	 * it could not allocate request descriptor and the associated
+	 * group request list does not have any requests pending.
+	 */
+	wait_queue_head_t starved_wait;
 };
 
 /*
@@ -253,6 +283,7 @@ struct request {
 #ifdef CONFIG_GROUP_IOSCHED
 	/* io group request belongs to */
 	struct io_group *iog;
+	struct request_list *rl;
 #endif /* GROUP_IOSCHED */
 #endif /* ELV_FAIR_QUEUING */
 };
@@ -342,6 +373,9 @@ struct request_queue
 	 */
 	struct request_list	rq;
 
+	/* Contains request pool and other data like starved data */
+	struct request_data	rq_data;
+
 	request_fn_proc		*request_fn;
 	make_request_fn		*make_request_fn;
 	prep_rq_fn		*prep_rq_fn;
@@ -404,6 +438,8 @@ struct request_queue
 	 * queue settings
 	 */
 	unsigned long		nr_requests;	/* Max # of requests */
+	/* Max # of per io group requests */
+	unsigned long		nr_group_requests;
 	unsigned int		nr_congestion_on;
 	unsigned int		nr_congestion_off;
 	unsigned int		nr_batching;
@@ -776,6 +812,28 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 			 struct scsi_ioctl_command __user *);
 
+extern void blk_init_request_list(struct request_list *rl);
+
+static inline struct request_list *blk_get_request_list(struct request_queue *q,
+						struct bio *bio)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+	return io_group_get_request_list(q, bio);
+#else
+	return &q->rq;
+#endif
+}
+
+static inline struct request_list *rq_rl(struct request_queue *q,
+						struct request *rq)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+	return rq->rl;
+#else
+	return blk_get_request_list(q, NULL);
+#endif
+}
+
 /*
  * Temporary export, until SCSI gets fixed up.
  */
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 16/18] io-controller: Per cgroup request descriptor support
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (28 preceding siblings ...)
  2009-05-05 19:58 ` [PATCH 16/18] io-controller: Per cgroup request descriptor support Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-05 19:58 ` [PATCH 17/18] io-controller: IO group refcounting support Vivek Goyal
                   ` (7 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando
  Cc: akpm, vgoyal

o Currently a request queue has got fixed number of request descriptors for
  sync and async requests. Once the request descriptors are consumed, new
  processes are put to sleep and they effectively become serialized. Because
  sync and async queues are separate, async requests don't impact sync ones
  but if one is looking for fairness between async requests, that is not
  achievable if request queue descriptors become bottleneck.

o Make request descriptor's per io group so that if there is lots of IO
  going on in one cgroup, it does not impact the IO of other group.

o This is just one relatively simple way of doing things. This patch will
  probably change after the feedback. Folks have raised concerns that in
  hierchical setup, child's request descriptors should be capped by parent's
  request descriptors. May be we need to have per cgroup per device files
  in cgroups where one can specify the upper limit of request descriptors
  and whenever a cgroup is created one needs to assign request descritor
  limit making sure total sum of child's request descriptor is not more than
  of parent.

  I guess something like memory controller. Anyway, that would be the next
  step. For the time being, we have implemented something simpler as follows.

o This patch implements the per cgroup request descriptors. request pool per
  queue is still common but every group will have its own wait list and its
  own count of request descriptors allocated to that group for sync and async
  queues. So effectively request_list becomes per io group property and not a
  global request queue feature.

o Currently one can define q->nr_requests to limit request descriptors
  allocated for the queue. Now there is another tunable q->nr_group_requests
  which controls the requests descriptr limit per group. q->nr_requests
  supercedes q->nr_group_requests to make sure if there are lots of groups
  present, we don't end up allocating too many request descriptors on the
  queue.

o Issues: Currently notion of congestion is per queue. With per group request
  descriptor it is possible that queue is not congested but the group bio
  will go into is congested.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-core.c       |  216 ++++++++++++++++++++++++++++++++++--------------
 block/blk-settings.c   |    3 +
 block/blk-sysfs.c      |   57 ++++++++++---
 block/elevator-fq.c    |   14 +++
 block/elevator-fq.h    |    5 +
 block/elevator.c       |    6 +-
 include/linux/blkdev.h |   62 +++++++++++++-
 7 files changed, 283 insertions(+), 80 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index b19510a..9226cdd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -480,20 +480,31 @@ void blk_cleanup_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_cleanup_queue);
 
-static int blk_init_free_list(struct request_queue *q)
+void blk_init_request_list(struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
 
 	rl->count[BLK_RW_SYNC] = rl->count[BLK_RW_ASYNC] = 0;
-	rl->starved[BLK_RW_SYNC] = rl->starved[BLK_RW_ASYNC] = 0;
-	rl->elvpriv = 0;
 	init_waitqueue_head(&rl->wait[BLK_RW_SYNC]);
 	init_waitqueue_head(&rl->wait[BLK_RW_ASYNC]);
+}
 
-	rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ, mempool_alloc_slab,
-				mempool_free_slab, request_cachep, q->node);
+static int blk_init_free_list(struct request_queue *q)
+{
+#ifndef CONFIG_GROUP_IOSCHED
+	struct request_list *rl = blk_get_request_list(q, NULL);
+
+	/*
+	 * In case of group scheduling, request list is inside the associated
+	 * group and when that group is instanciated, it takes care of
+	 * initializing the request list also.
+	 */
+	blk_init_request_list(rl);
+#endif
+	q->rq_data.rq_pool = mempool_create_node(BLKDEV_MIN_RQ,
+				mempool_alloc_slab, mempool_free_slab,
+				request_cachep, q->node);
 
-	if (!rl->rq_pool)
+	if (!q->rq_data.rq_pool)
 		return -ENOMEM;
 
 	return 0;
@@ -590,6 +601,9 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
 		return NULL;
 	}
 
+	/* init starved waiter wait queue */
+	init_waitqueue_head(&q->rq_data.starved_wait);
+
 	/*
 	 * if caller didn't supply a lock, they get per-queue locking with
 	 * our embedded lock
@@ -639,14 +653,14 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
 {
 	if (rq->cmd_flags & REQ_ELVPRIV)
 		elv_put_request(q, rq);
-	mempool_free(rq, q->rq.rq_pool);
+	mempool_free(rq, q->rq_data.rq_pool);
 }
 
 static struct request *
 blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
 					gfp_t gfp_mask)
 {
-	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
+	struct request *rq = mempool_alloc(q->rq_data.rq_pool, gfp_mask);
 
 	if (!rq)
 		return NULL;
@@ -657,7 +671,7 @@ blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
 
 	if (priv) {
 		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
-			mempool_free(rq, q->rq.rq_pool);
+			mempool_free(rq, q->rq_data.rq_pool);
 			return NULL;
 		}
 		rq->cmd_flags |= REQ_ELVPRIV;
@@ -700,18 +714,18 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
 	ioc->last_waited = jiffies;
 }
 
-static void __freed_request(struct request_queue *q, int sync)
+static void __freed_request(struct request_queue *q, int sync,
+					struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
-
-	if (rl->count[sync] < queue_congestion_off_threshold(q))
+	if (q->rq_data.count[sync] < queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, sync);
 
-	if (rl->count[sync] + 1 <= q->nr_requests) {
+	if (q->rq_data.count[sync] + 1 <= q->nr_requests)
+		blk_clear_queue_full(q, sync);
+
+	if (rl->count[sync] + 1 <= q->nr_group_requests) {
 		if (waitqueue_active(&rl->wait[sync]))
 			wake_up(&rl->wait[sync]);
-
-		blk_clear_queue_full(q, sync);
 	}
 }
 
@@ -719,18 +733,29 @@ static void __freed_request(struct request_queue *q, int sync)
  * A request has just been released.  Account for it, update the full and
  * congestion status, wake up any waiters.   Called under q->queue_lock.
  */
-static void freed_request(struct request_queue *q, int sync, int priv)
+static void freed_request(struct request_queue *q, int sync, int priv,
+					struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
-
+	BUG_ON(!rl->count[sync]);
 	rl->count[sync]--;
+
+	BUG_ON(!q->rq_data.count[sync]);
+	q->rq_data.count[sync]--;
+
 	if (priv)
-		rl->elvpriv--;
+		q->rq_data.elvpriv--;
 
-	__freed_request(q, sync);
+	__freed_request(q, sync, rl);
 
 	if (unlikely(rl->starved[sync ^ 1]))
-		__freed_request(q, sync ^ 1);
+		__freed_request(q, sync ^ 1, rl);
+
+	/* Wake up the starved process on global list, if any */
+	if (unlikely(q->rq_data.starved)) {
+		if (waitqueue_active(&q->rq_data.starved_wait))
+			wake_up(&q->rq_data.starved_wait);
+		q->rq_data.starved--;
+	}
 }
 
 /*
@@ -739,10 +764,9 @@ static void freed_request(struct request_queue *q, int sync, int priv)
  * Returns !NULL on success, with queue_lock *not held*.
  */
 static struct request *get_request(struct request_queue *q, int rw_flags,
-				   struct bio *bio, gfp_t gfp_mask)
+		   struct bio *bio, gfp_t gfp_mask, struct request_list *rl)
 {
 	struct request *rq = NULL;
-	struct request_list *rl = &q->rq;
 	struct io_context *ioc = NULL;
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
 	int may_queue, priv;
@@ -751,31 +775,38 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	if (may_queue == ELV_MQUEUE_NO)
 		goto rq_starved;
 
-	if (rl->count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
-		if (rl->count[is_sync]+1 >= q->nr_requests) {
-			ioc = current_io_context(GFP_ATOMIC, q->node);
-			/*
-			 * The queue will fill after this allocation, so set
-			 * it as full, and mark this process as "batching".
-			 * This process will be allowed to complete a batch of
-			 * requests, others will be blocked.
-			 */
-			if (!blk_queue_full(q, is_sync)) {
-				ioc_set_batching(q, ioc);
-				blk_set_queue_full(q, is_sync);
-			} else {
-				if (may_queue != ELV_MQUEUE_MUST
-						&& !ioc_batching(q, ioc)) {
-					/*
-					 * The queue is full and the allocating
-					 * process is not a "batcher", and not
-					 * exempted by the IO scheduler
-					 */
-					goto out;
-				}
+	if (q->rq_data.count[is_sync]+1 >= queue_congestion_on_threshold(q))
+		blk_set_queue_congested(q, is_sync);
+
+	/*
+	 * Looks like there is no user of queue full now.
+	 * Keeping it for time being.
+	 */
+	if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
+		blk_set_queue_full(q, is_sync);
+
+	if (rl->count[is_sync]+1 >= q->nr_group_requests) {
+		ioc = current_io_context(GFP_ATOMIC, q->node);
+		/*
+		 * The queue request descriptor group will fill after this
+		 * allocation, so set
+		 * it as full, and mark this process as "batching".
+		 * This process will be allowed to complete a batch of
+		 * requests, others will be blocked.
+		 */
+		if (rl->count[is_sync] <= q->nr_group_requests)
+			ioc_set_batching(q, ioc);
+		else {
+			if (may_queue != ELV_MQUEUE_MUST
+					&& !ioc_batching(q, ioc)) {
+				/*
+				 * The queue is full and the allocating
+				 * process is not a "batcher", and not
+				 * exempted by the IO scheduler
+				 */
+				goto out;
 			}
 		}
-		blk_set_queue_congested(q, is_sync);
 	}
 
 	/*
@@ -783,21 +814,43 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	 * limit of requests, otherwise we could have thousands of requests
 	 * allocated with any setting of ->nr_requests
 	 */
-	if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
+
+	if (q->rq_data.count[is_sync] >= (3 * q->nr_requests / 2))
+		goto out;
+
+	/*
+	 * Allocation of request is allowed from queue perspective. Now check
+	 * from per group request list
+	 */
+
+	if (rl->count[is_sync] >= (3 * q->nr_group_requests / 2))
 		goto out;
 
 	rl->count[is_sync]++;
 	rl->starved[is_sync] = 0;
 
+	q->rq_data.count[is_sync]++;
+
 	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
 	if (priv)
-		rl->elvpriv++;
+		q->rq_data.elvpriv++;
 
 	if (blk_queue_io_stat(q))
 		rw_flags |= REQ_IO_STAT;
 	spin_unlock_irq(q->queue_lock);
 
 	rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
+
+#ifdef CONFIG_GROUP_IOSCHED
+	if (rq) {
+		/*
+		 * TODO. Implement group reference counting and take the
+		 * reference to the group to make sure group hence request
+		 * list does not go away till rq finishes.
+		 */
+		rq->rl = rl;
+	}
+#endif
 	if (unlikely(!rq)) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything
@@ -807,7 +860,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		 * wait queue, but this is pretty rare.
 		 */
 		spin_lock_irq(q->queue_lock);
-		freed_request(q, is_sync, priv);
+		freed_request(q, is_sync, priv, rl);
 
 		/*
 		 * in the very unlikely event that allocation failed and no
@@ -817,10 +870,26 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		 * rq mempool into READ and WRITE
 		 */
 rq_starved:
-		if (unlikely(rl->count[is_sync] == 0))
-			rl->starved[is_sync] = 1;
-
-		goto out;
+		if (unlikely(rl->count[is_sync] == 0)) {
+			/*
+			 * If there is a request pending in other direction
+			 * in same io group, then set the starved flag of
+			 * the group request list. Otherwise, we need to
+			 * make this process sleep in global starved list
+			 * to make sure it will not sleep indefinitely.
+			 */
+			if (rl->count[is_sync ^ 1] != 0) {
+				rl->starved[is_sync] = 1;
+				goto out;
+			} else {
+				/*
+				 * It indicates to calling function to put
+				 * task on global starved list. Not the best
+				 * way
+				 */
+				return ERR_PTR(-ENOMEM);
+			}
+		}
 	}
 
 	/*
@@ -848,15 +917,29 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 {
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
 	struct request *rq;
+	struct request_list *rl = blk_get_request_list(q, bio);
 
-	rq = get_request(q, rw_flags, bio, GFP_NOIO);
-	while (!rq) {
+	rq = get_request(q, rw_flags, bio, GFP_NOIO, rl);
+	while (!rq || (IS_ERR(rq) && PTR_ERR(rq) == -ENOMEM)) {
 		DEFINE_WAIT(wait);
 		struct io_context *ioc;
-		struct request_list *rl = &q->rq;
 
-		prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
-				TASK_UNINTERRUPTIBLE);
+		if (IS_ERR(rq) && PTR_ERR(rq) == -ENOMEM) {
+			/*
+			 * Task failed allocation and needs to wait and
+			 * try again. There are no requests pending from
+			 * the io group hence need to sleep on global
+			 * wait queue. Most likely the allocation failed
+			 * because of memory issues.
+			 */
+
+			q->rq_data.starved++;
+			prepare_to_wait_exclusive(&q->rq_data.starved_wait,
+					&wait, TASK_UNINTERRUPTIBLE);
+		} else {
+			prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
+					TASK_UNINTERRUPTIBLE);
+		}
 
 		trace_block_sleeprq(q, bio, rw_flags & 1);
 
@@ -876,7 +959,12 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 		spin_lock_irq(q->queue_lock);
 		finish_wait(&rl->wait[is_sync], &wait);
 
-		rq = get_request(q, rw_flags, bio, GFP_NOIO);
+		/*
+		 * After the sleep check the rl again in case cgrop bio
+		 * belonged to is gone and it is mapped to root group now
+		 */
+		rl = blk_get_request_list(q, bio);
+		rq = get_request(q, rw_flags, bio, GFP_NOIO, rl);
 	};
 
 	return rq;
@@ -885,6 +973,7 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
 {
 	struct request *rq;
+	struct request_list *rl = blk_get_request_list(q, NULL);
 
 	BUG_ON(rw != READ && rw != WRITE);
 
@@ -892,7 +981,7 @@ struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
 	if (gfp_mask & __GFP_WAIT) {
 		rq = get_request_wait(q, rw, NULL);
 	} else {
-		rq = get_request(q, rw, NULL, gfp_mask);
+		rq = get_request(q, rw, NULL, gfp_mask, rl);
 		if (!rq)
 			spin_unlock_irq(q->queue_lock);
 	}
@@ -1075,12 +1164,13 @@ void __blk_put_request(struct request_queue *q, struct request *req)
 	if (req->cmd_flags & REQ_ALLOCED) {
 		int is_sync = rq_is_sync(req) != 0;
 		int priv = req->cmd_flags & REQ_ELVPRIV;
+		struct request_list *rl = rq_rl(q, req);
 
 		BUG_ON(!list_empty(&req->queuelist));
 		BUG_ON(!hlist_unhashed(&req->hash));
 
 		blk_free_request(q, req);
-		freed_request(q, is_sync, priv);
+		freed_request(q, is_sync, priv, rl);
 	}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 57af728..8733192 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -123,6 +123,9 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
 	 * set defaults
 	 */
 	q->nr_requests = BLKDEV_MAX_RQ;
+#ifdef CONFIG_GROUP_IOSCHED
+	q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
+#endif
 	blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);
 	blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS);
 	blk_queue_segment_boundary(q, BLK_SEG_BOUNDARY_MASK);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index c942ddc..b60b76e 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -38,7 +38,7 @@ static ssize_t queue_requests_show(struct request_queue *q, char *page)
 static ssize_t
 queue_requests_store(struct request_queue *q, const char *page, size_t count)
 {
-	struct request_list *rl = &q->rq;
+	struct request_list *rl = blk_get_request_list(q, NULL);
 	unsigned long nr;
 	int ret = queue_var_store(&nr, page, count);
 	if (nr < BLKDEV_MIN_RQ)
@@ -48,32 +48,55 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
 	q->nr_requests = nr;
 	blk_queue_congestion_threshold(q);
 
-	if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
+	if (q->rq_data.count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
 		blk_set_queue_congested(q, BLK_RW_SYNC);
-	else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
+	else if (q->rq_data.count[BLK_RW_SYNC] <
+				queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, BLK_RW_SYNC);
 
-	if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
+	if (q->rq_data.count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
 		blk_set_queue_congested(q, BLK_RW_ASYNC);
-	else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
+	else if (q->rq_data.count[BLK_RW_ASYNC] <
+				queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, BLK_RW_ASYNC);
 
-	if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
+	if (q->rq_data.count[BLK_RW_SYNC] >= q->nr_requests) {
 		blk_set_queue_full(q, BLK_RW_SYNC);
-	} else if (rl->count[BLK_RW_SYNC]+1 <= q->nr_requests) {
+	} else if (q->rq_data.count[BLK_RW_SYNC]+1 <= q->nr_requests) {
 		blk_clear_queue_full(q, BLK_RW_SYNC);
 		wake_up(&rl->wait[BLK_RW_SYNC]);
 	}
 
-	if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
+	if (q->rq_data.count[BLK_RW_ASYNC] >= q->nr_requests) {
 		blk_set_queue_full(q, BLK_RW_ASYNC);
-	} else if (rl->count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
+	} else if (q->rq_data.count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
 		blk_clear_queue_full(q, BLK_RW_ASYNC);
 		wake_up(&rl->wait[BLK_RW_ASYNC]);
 	}
 	spin_unlock_irq(q->queue_lock);
 	return ret;
 }
+#ifdef CONFIG_GROUP_IOSCHED
+static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(q->nr_group_requests, (page));
+}
+
+static ssize_t
+queue_group_requests_store(struct request_queue *q, const char *page,
+					size_t count)
+{
+	unsigned long nr;
+	int ret = queue_var_store(&nr, page, count);
+	if (nr < BLKDEV_MIN_RQ)
+		nr = BLKDEV_MIN_RQ;
+
+	spin_lock_irq(q->queue_lock);
+	q->nr_group_requests = nr;
+	spin_unlock_irq(q->queue_lock);
+	return ret;
+}
+#endif
 
 static ssize_t queue_ra_show(struct request_queue *q, char *page)
 {
@@ -224,6 +247,14 @@ static struct queue_sysfs_entry queue_requests_entry = {
 	.store = queue_requests_store,
 };
 
+#ifdef CONFIG_GROUP_IOSCHED
+static struct queue_sysfs_entry queue_group_requests_entry = {
+	.attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_group_requests_show,
+	.store = queue_group_requests_store,
+};
+#endif
+
 static struct queue_sysfs_entry queue_ra_entry = {
 	.attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_ra_show,
@@ -304,6 +335,9 @@ static struct queue_sysfs_entry queue_fairness_entry = {
 
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
+#ifdef CONFIG_GROUP_IOSCHED
+	&queue_group_requests_entry.attr,
+#endif
 	&queue_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
@@ -385,12 +419,11 @@ static void blk_release_queue(struct kobject *kobj)
 {
 	struct request_queue *q =
 		container_of(kobj, struct request_queue, kobj);
-	struct request_list *rl = &q->rq;
 
 	blk_sync_queue(q);
 
-	if (rl->rq_pool)
-		mempool_destroy(rl->rq_pool);
+	if (q->rq_data.rq_pool)
+		mempool_destroy(q->rq_data.rq_pool);
 
 	if (q->queue_tags)
 		__blk_queue_free_tags(q);
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 69eaee4..bd98317 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -954,6 +954,16 @@ struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
 			    struct io_cgroup, css);
 }
 
+struct request_list *io_group_get_request_list(struct request_queue *q,
+						struct bio *bio)
+{
+	struct io_group *iog;
+
+	iog = io_get_io_group_bio(q, bio, 1);
+	BUG_ON(!iog);
+	return &iog->rl;
+}
+
 /*
  * Search the bfq_group for bfqd into the hash table (by now only a list)
  * of bgrp.  Must be called under rcu_read_lock().
@@ -1203,6 +1213,8 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
 		io_group_init_entity(iocg, iog);
 		iog->my_entity = &iog->entity;
 
+		blk_init_request_list(&iog->rl);
+
 		if (leaf == NULL) {
 			leaf = iog;
 			prev = leaf;
@@ -1447,6 +1459,8 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
 	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
 		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
 
+	blk_init_request_list(&iog->rl);
+
 	iocg = &io_root_cgroup;
 	spin_lock_irq(&iocg->lock);
 	rcu_assign_pointer(iog->key, key);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 5fc7d48..58543ec 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -239,6 +239,9 @@ struct io_group {
 
 	/* Single ioq per group, used for noop, deadline, anticipatory */
 	struct io_queue *ioq;
+
+	/* request list associated with the group */
+	struct request_list rl;
 };
 
 /**
@@ -517,6 +520,8 @@ extern void elv_fq_unset_request_ioq(struct request_queue *q,
 extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
 extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
 						struct bio *bio);
+extern struct request_list *io_group_get_request_list(struct request_queue *q,
+						struct bio *bio);
 
 /* Returns single ioq associated with the io group. */
 static inline struct io_queue *io_group_ioq(struct io_group *iog)
diff --git a/block/elevator.c b/block/elevator.c
index 3b83b2f..44c9fad 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -668,7 +668,7 @@ void elv_quiesce_start(struct request_queue *q)
 	 * make sure we don't have any requests in flight
 	 */
 	elv_drain_elevator(q);
-	while (q->rq.elvpriv) {
+	while (q->rq_data.elvpriv) {
 		blk_start_queueing(q);
 		spin_unlock_irq(q->queue_lock);
 		msleep(10);
@@ -768,8 +768,8 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 	}
 
 	if (unplug_it && blk_queue_plugged(q)) {
-		int nrq = q->rq.count[BLK_RW_SYNC] + q->rq.count[BLK_RW_ASYNC]
-			- q->in_flight;
+		int nrq = q->rq_data.count[BLK_RW_SYNC] +
+				q->rq_data.count[BLK_RW_ASYNC] - q->in_flight;
 
 		if (nrq >= q->unplug_thresh)
 			__generic_unplug_device(q);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9c209a0..07aca2f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -32,21 +32,51 @@ struct request;
 struct sg_io_hdr;
 
 #define BLKDEV_MIN_RQ	4
+
+#ifdef CONFIG_GROUP_IOSCHED
+#define BLKDEV_MAX_RQ	256	/* Default maximum */
+#define BLKDEV_MAX_GROUP_RQ    64      /* Default maximum */
+#else
 #define BLKDEV_MAX_RQ	128	/* Default maximum */
+/*
+ * This is eqivalent to case of only one group present (root group). Let
+ * it consume all the request descriptors available on the queue .
+ */
+#define BLKDEV_MAX_GROUP_RQ    BLKDEV_MAX_RQ      /* Default maximum */
+#endif
 
 struct request;
 typedef void (rq_end_io_fn)(struct request *, int);
 
 struct request_list {
 	/*
-	 * count[], starved[], and wait[] are indexed by
+	 * count[], starved and wait[] are indexed by
 	 * BLK_RW_SYNC/BLK_RW_ASYNC
 	 */
 	int count[2];
 	int starved[2];
+	wait_queue_head_t wait[2];
+};
+
+/*
+ * This data structures keeps track of mempool of requests for the queue
+ * and some overall statistics.
+ */
+struct request_data {
+	/*
+	 * Per queue request descriptor count. This is in addition to per
+	 * cgroup count
+	 */
+	int count[2];
 	int elvpriv;
 	mempool_t *rq_pool;
-	wait_queue_head_t wait[2];
+	int starved;
+	/*
+	 * Global list for starved tasks. A task will be queued here if
+	 * it could not allocate request descriptor and the associated
+	 * group request list does not have any requests pending.
+	 */
+	wait_queue_head_t starved_wait;
 };
 
 /*
@@ -253,6 +283,7 @@ struct request {
 #ifdef CONFIG_GROUP_IOSCHED
 	/* io group request belongs to */
 	struct io_group *iog;
+	struct request_list *rl;
 #endif /* GROUP_IOSCHED */
 #endif /* ELV_FAIR_QUEUING */
 };
@@ -342,6 +373,9 @@ struct request_queue
 	 */
 	struct request_list	rq;
 
+	/* Contains request pool and other data like starved data */
+	struct request_data	rq_data;
+
 	request_fn_proc		*request_fn;
 	make_request_fn		*make_request_fn;
 	prep_rq_fn		*prep_rq_fn;
@@ -404,6 +438,8 @@ struct request_queue
 	 * queue settings
 	 */
 	unsigned long		nr_requests;	/* Max # of requests */
+	/* Max # of per io group requests */
+	unsigned long		nr_group_requests;
 	unsigned int		nr_congestion_on;
 	unsigned int		nr_congestion_off;
 	unsigned int		nr_batching;
@@ -776,6 +812,28 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 			 struct scsi_ioctl_command __user *);
 
+extern void blk_init_request_list(struct request_list *rl);
+
+static inline struct request_list *blk_get_request_list(struct request_queue *q,
+						struct bio *bio)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+	return io_group_get_request_list(q, bio);
+#else
+	return &q->rq;
+#endif
+}
+
+static inline struct request_list *rq_rl(struct request_queue *q,
+						struct request *rq)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+	return rq->rl;
+#else
+	return blk_get_request_list(q, NULL);
+#endif
+}
+
 /*
  * Temporary export, until SCSI gets fixed up.
  */
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 17/18] io-controller: IO group refcounting support
       [not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (15 preceding siblings ...)
  2009-05-05 19:58   ` [PATCH 16/18] io-controller: Per cgroup request descriptor support Vivek Goyal
@ 2009-05-05 19:58   ` Vivek Goyal
  2009-05-05 19:58   ` [PATCH 18/18] io-controller: Debug hierarchical IO scheduling Vivek Goyal
                     ` (4 subsequent siblings)
  21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

o In the original BFQ patch once a cgroup is being deleted, it will clean
  up the associated io groups immediately and if there are any active io
  queues with that group, these will be moved to root group. This movement
  of queues is not good from fairness perspective as one can then create
  a cgroup, dump lots of IO and then delete the cgroup and then potentially
  get higher share. Apart from there are more issues hence it was felt that
  we need a io group refcounting mechanism also so that io group can be
  reclaimed asynchronously.

o This is a crude patch to implement io group refcounting. This is still
  work in progress and Nauman and Divyesh are playing with more ideas.

o I can do basic cgroup creation, deletion, task movement operations and
  there are no crashes (As was reported with V1 by Gui). Though I have not
  verified that io groups are actually being freed. Will do it next.

o There are couple of hard to hit race conditions I am aware of. Will fix
  that in upcoming versions. (RCU lookup when group might be going away
  during cgroup deletion).

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/cfq-iosched.c |   16 ++-
 block/elevator-fq.c |  441 ++++++++++++++++++++++++++++++++++-----------------
 block/elevator-fq.h |   26 ++--
 3 files changed, 320 insertions(+), 163 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ea71239..cf9d258 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1308,8 +1308,17 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
 
 	if (sync_cfqq != NULL) {
 		__iog = cfqq_to_io_group(sync_cfqq);
-		if (iog != __iog)
-			io_ioq_move(q->elevator, sync_cfqq->ioq, iog);
+		/*
+		 * Drop reference to sync queue. A new queue sync queue will
+		 * be assigned in new group upon arrival of a fresh request.
+		 * If old queue has got requests, those reuests will be
+		 * dispatched over a period of time and queue will be freed
+		 * automatically.
+		 */
+		if (iog != __iog) {
+			cic_set_cfqq(cic, NULL, 1);
+			cfq_put_queue(sync_cfqq);
+		}
 	}
 
 	spin_unlock_irqrestore(q->queue_lock, flags);
@@ -1422,6 +1431,9 @@ alloc_ioq:
 			elv_mark_ioq_sync(cfqq->ioq);
 		}
 		cfqq->pid = current->pid;
+
+		/* ioq reference on iog */
+		elv_get_iog(iog);
 		cfq_log_cfqq(cfqd, cfqq, "alloced");
 	}
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index bd98317..1dd0bb3 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -36,7 +36,7 @@ static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
 					struct io_queue *ioq, int probe);
 struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
 						 int extract);
-void elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
+void elv_release_ioq(struct io_queue **ioq_ptr);
 int elv_iosched_expire_ioq(struct request_queue *q, int slice_expired,
 					int force);
 
@@ -108,6 +108,16 @@ static inline void bfq_check_next_active(struct io_sched_data *sd,
 {
 	BUG_ON(sd->next_active != entity);
 }
+
+static inline struct io_group *io_entity_to_iog(struct io_entity *entity)
+{
+	struct io_group *iog = NULL;
+
+	BUG_ON(entity == NULL);
+	if (entity->my_sched_data != NULL)
+		iog = container_of(entity, struct io_group, entity);
+	return iog;
+}
 #else /* GROUP_IOSCHED */
 #define for_each_entity(entity)	\
 	for (; entity != NULL; entity = NULL)
@@ -124,6 +134,11 @@ static inline void bfq_check_next_active(struct io_sched_data *sd,
 					 struct io_entity *entity)
 {
 }
+
+static inline struct io_group *io_entity_to_iog(struct io_entity *entity)
+{
+	return NULL;
+}
 #endif
 
 /*
@@ -224,7 +239,6 @@ static void bfq_idle_extract(struct io_service_tree *st,
 				struct io_entity *entity)
 {
 	struct rb_node *next;
-	struct io_queue *ioq = io_entity_to_ioq(entity);
 
 	BUG_ON(entity->tree != &st->idle);
 
@@ -239,10 +253,6 @@ static void bfq_idle_extract(struct io_service_tree *st,
 	}
 
 	bfq_extract(&st->idle, entity);
-
-	/* Delete queue from idle list */
-	if (ioq)
-		list_del(&ioq->queue_list);
 }
 
 /**
@@ -374,9 +384,12 @@ static void bfq_active_insert(struct io_service_tree *st,
 void bfq_get_entity(struct io_entity *entity)
 {
 	struct io_queue *ioq = io_entity_to_ioq(entity);
+	struct io_group *iog = io_entity_to_iog(entity);
 
 	if (ioq)
 		elv_get_ioq(ioq);
+	else
+		elv_get_iog(iog);
 }
 
 /**
@@ -436,7 +449,6 @@ static void bfq_idle_insert(struct io_service_tree *st,
 {
 	struct io_entity *first_idle = st->first_idle;
 	struct io_entity *last_idle = st->last_idle;
-	struct io_queue *ioq = io_entity_to_ioq(entity);
 
 	if (first_idle == NULL || bfq_gt(first_idle->finish, entity->finish))
 		st->first_idle = entity;
@@ -444,10 +456,6 @@ static void bfq_idle_insert(struct io_service_tree *st,
 		st->last_idle = entity;
 
 	bfq_insert(&st->idle, entity);
-
-	/* Add this queue to idle list */
-	if (ioq)
-		list_add(&ioq->queue_list, &ioq->efqd->idle_list);
 }
 
 /**
@@ -463,14 +471,21 @@ static void bfq_forget_entity(struct io_service_tree *st,
 				struct io_entity *entity)
 {
 	struct io_queue *ioq = NULL;
+	struct io_group *iog = NULL;
 
 	BUG_ON(!entity->on_st);
 	entity->on_st = 0;
 	st->wsum -= entity->weight;
+
 	ioq = io_entity_to_ioq(entity);
-	if (!ioq)
+	if (ioq) {
+		elv_put_ioq(ioq);
 		return;
-	elv_put_ioq(ioq);
+	}
+
+	iog = io_entity_to_iog(entity);
+	if (iog)
+		elv_put_iog(iog);
 }
 
 /**
@@ -909,21 +924,21 @@ void entity_served(struct io_entity *entity, bfq_service_t served,
 /*
  * Release all the io group references to its async queues.
  */
-void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+void io_put_io_group_queues(struct io_group *iog)
 {
 	int i, j;
 
 	for (i = 0; i < 2; i++)
 		for (j = 0; j < IOPRIO_BE_NR; j++)
-			elv_release_ioq(e, &iog->async_queue[i][j]);
+			elv_release_ioq(&iog->async_queue[i][j]);
 
 	/* Free up async idle queue */
-	elv_release_ioq(e, &iog->async_idle_queue);
+	elv_release_ioq(&iog->async_idle_queue);
 
 #ifdef CONFIG_GROUP_IOSCHED
 	/* Optimization for io schedulers having single ioq */
-	if (elv_iosched_single_ioq(e))
-		elv_release_ioq(e, &iog->ioq);
+	if (iog->ioq)
+		elv_release_ioq(&iog->ioq);
 #endif
 }
 
@@ -1018,6 +1033,9 @@ void io_group_set_parent(struct io_group *iog, struct io_group *parent)
 	entity = &iog->entity;
 	entity->parent = parent->my_entity;
 	entity->sched_data = &parent->sched_data;
+	if (entity->parent)
+		/* Child group reference on parent group */
+		elv_get_iog(parent);
 }
 
 /**
@@ -1210,6 +1228,9 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
 		if (!iog)
 			goto cleanup;
 
+		atomic_set(&iog->ref, 0);
+		iog->deleting = 0;
+
 		io_group_init_entity(iocg, iog);
 		iog->my_entity = &iog->entity;
 
@@ -1279,7 +1300,12 @@ void io_group_chain_link(struct request_queue *q, void *key,
 
 		rcu_assign_pointer(leaf->key, key);
 		hlist_add_head_rcu(&leaf->group_node, &iocg->group_data);
+		/* io_cgroup reference on io group */
+		elv_get_iog(leaf);
+
 		hlist_add_head(&leaf->elv_data_node, &efqd->group_list);
+		/* elevator reference on io group */
+		elv_get_iog(leaf);
 
 		spin_unlock_irqrestore(&iocg->lock, flags);
 
@@ -1388,12 +1414,23 @@ struct io_cgroup *get_iocg_from_bio(struct bio *bio)
 	if (!iocg)
 		return &io_root_cgroup;
 
+	/*
+	 * If this cgroup io_cgroup is being deleted, map the bio to
+	 * root cgroup
+	 */
+	if (css_is_removed(&iocg->css))
+		return &io_root_cgroup;
+
 	return iocg;
 }
 
 /*
  * Find the io group bio belongs to.
  * If "create" is set, io group is created if it is not already present.
+ *
+ * Note: There is a narrow window of race where a group is being freed
+ * by cgroup deletion path and some rq has slipped through in this group.
+ * Fix it.
  */
 struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
 					int create)
@@ -1440,8 +1477,8 @@ void io_free_root_group(struct elevator_queue *e)
 	spin_lock_irq(&iocg->lock);
 	hlist_del_rcu(&iog->group_node);
 	spin_unlock_irq(&iocg->lock);
-	io_put_io_group_queues(e, iog);
-	kfree(iog);
+	io_put_io_group_queues(iog);
+	elv_put_iog(iog);
 }
 
 struct io_group *io_alloc_root_group(struct request_queue *q,
@@ -1459,11 +1496,15 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
 	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
 		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
 
+	atomic_set(&iog->ref, 0);
+
 	blk_init_request_list(&iog->rl);
 
 	iocg = &io_root_cgroup;
 	spin_lock_irq(&iocg->lock);
 	rcu_assign_pointer(iog->key, key);
+	/* elevator reference. */
+	elv_get_iog(iog);
 	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
 	spin_unlock_irq(&iocg->lock);
 
@@ -1560,105 +1601,109 @@ void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
 }
 
 /*
- * Move the queue to the root group if it is active. This is needed when
- * a cgroup is being deleted and all the IO is not done yet. This is not
- * very good scheme as a user might get unfair share. This needs to be
- * fixed.
+ * check whether a given group has got any active entities on any of the
+ * service tree.
  */
-void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
-				struct io_group *iog)
+static inline int io_group_has_active_entities(struct io_group *iog)
 {
-	int busy, resume;
-	struct io_entity *entity = &ioq->entity;
-	struct elv_fq_data *efqd = &e->efqd;
-	struct io_service_tree *st = io_entity_service_tree(entity);
+	int i;
+	struct io_service_tree *st;
 
-	busy = elv_ioq_busy(ioq);
-	resume = !!ioq->nr_queued;
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		if (!RB_EMPTY_ROOT(&st->active))
+			return 1;
+	}
 
-	BUG_ON(resume && !entity->on_st);
-	BUG_ON(busy && !resume && entity->on_st && ioq != efqd->active_queue);
+	return 0;
+}
+
+/*
+ * Should be called with both iocg->lock as well as queue lock held (if
+ * group is still connected on elevator list)
+ */
+void __iocg_destroy(struct io_cgroup *iocg, struct io_group *iog,
+				int queue_lock_held)
+{
+	int i;
+	struct io_service_tree *st;
 
 	/*
-	 * We could be moving an queue which is on idle tree of previous group
-	 * What to do? I guess anyway this queue does not have any requests.
-	 * just forget the entity and free up from idle tree.
-	 *
-	 * This needs cleanup. Hackish.
+	 * If we are here then we got the queue lock if group was still on
+	 * elevator list. If group had already been disconnected from elevator
+	 * list, then we don't need the queue lock.
 	 */
-	if (entity->tree == &st->idle) {
-		BUG_ON(atomic_read(&ioq->ref) < 2);
-		bfq_put_idle_entity(st, entity);
-	}
 
-	if (busy) {
-		BUG_ON(atomic_read(&ioq->ref) < 2);
-
-		if (!resume)
-			elv_del_ioq_busy(e, ioq, 0);
-		else
-			elv_deactivate_ioq(efqd, ioq, 0);
-	}
+	/* Remove io group from cgroup list */
+	hlist_del(&iog->group_node);
 
 	/*
-	 * Here we use a reference to bfqg.  We don't need a refcounter
-	 * as the cgroup reference will not be dropped, so that its
-	 * destroy() callback will not be invoked.
+	 * Mark io group for deletion so that no new entry goes in
+	 * idle tree. Any active queue will be removed from active
+	 * tree and not put in to idle tree.
 	 */
-	entity->parent = iog->my_entity;
-	entity->sched_data = &iog->sched_data;
+	iog->deleting = 1;
 
-	if (busy && resume)
-		elv_activate_ioq(ioq, 0);
-}
-EXPORT_SYMBOL(io_ioq_move);
+	/* Flush idle tree.  */
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		io_flush_idle_tree(st);
+	}
 
-static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
-{
-	struct elevator_queue *eq;
-	struct io_entity *entity = iog->my_entity;
-	struct io_service_tree *st;
-	int i;
+	/*
+	 * Drop io group reference on all async queues. This group is
+	 * going away so once these queues are empty, free those up
+	 * instead of keeping these around in the hope that new IO
+	 * will come.
+	 *
+	 * Note: If this group is disconnected from elevator, elevator
+	 * switch must have already done it.
+	 */
 
-	eq = container_of(efqd, struct elevator_queue, efqd);
-	hlist_del(&iog->elv_data_node);
-	__bfq_deactivate_entity(entity, 0);
-	io_put_io_group_queues(eq, iog);
+	io_put_io_group_queues(iog);
 
-	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
-		st = iog->sched_data.service_tree + i;
+	if (!io_group_has_active_entities(iog)) {
+		/*
+		 * io group does not have any active entites. Because this
+		 * group has been decoupled from io_cgroup list and this
+		 * cgroup is being deleted, this group should not receive
+		 * any new IO. Hence it should be safe to deactivate this
+		 * io group and remove from the scheduling tree.
+		 */
+		__bfq_deactivate_entity(iog->my_entity, 0);
 
 		/*
-		 * The idle tree may still contain bfq_queues belonging
-		 * to exited task because they never migrated to a different
-		 * cgroup from the one being destroyed now.  Noone else
-		 * can access them so it's safe to act without any lock.
+		 * Because this io group does not have any active entities,
+		 * it should be safe to remove it from elevator list and
+		 * drop elvator reference so that upon dropping io_cgroup
+		 * reference, this io group should be freed and we don't
+		 * wait for elevator switch to happen to free the group
+		 * up.
 		 */
-		io_flush_idle_tree(st);
+		if (queue_lock_held) {
+			hlist_del(&iog->elv_data_node);
+			rcu_assign_pointer(iog->key, NULL);
+			/*
+			 * Drop iog reference taken by elevator
+			 * (efqd->group_list)
+			 */
+			elv_put_iog(iog);
+		}
 
-		BUG_ON(!RB_EMPTY_ROOT(&st->active));
-		BUG_ON(!RB_EMPTY_ROOT(&st->idle));
 	}
 
-	BUG_ON(iog->sched_data.next_active != NULL);
-	BUG_ON(iog->sched_data.active_entity != NULL);
-	BUG_ON(entity->tree != NULL);
+	/* Drop iocg reference on io group */
+	elv_put_iog(iog);
 }
 
-/**
- * bfq_destroy_group - destroy @bfqg.
- * @bgrp: the bfqio_cgroup containing @bfqg.
- * @bfqg: the group being destroyed.
- *
- * Destroy @bfqg, making sure that it is not referenced from its parent.
- */
-static void io_destroy_group(struct io_cgroup *iocg, struct io_group *iog)
+void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
 {
-	struct elv_fq_data *efqd = NULL;
-	unsigned long uninitialized_var(flags);
-
-	/* Remove io group from cgroup list */
-	hlist_del(&iog->group_node);
+	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+	struct hlist_node *n, *tmp;
+	struct io_group *iog;
+	unsigned long flags;
+	int queue_lock_held = 0;
+	struct elv_fq_data *efqd;
 
 	/*
 	 * io groups are linked in two lists. One list is maintained
@@ -1677,58 +1722,93 @@ static void io_destroy_group(struct io_cgroup *iocg, struct io_group *iog)
 	 * try to free up async queues again or flush the idle tree.
 	 */
 
-	rcu_read_lock();
-	efqd = rcu_dereference(iog->key);
-	if (efqd != NULL) {
-		spin_lock_irqsave(efqd->queue->queue_lock, flags);
-		if (iog->key == efqd)
-			__io_destroy_group(efqd, iog);
-		spin_unlock_irqrestore(efqd->queue->queue_lock, flags);
-	}
-	rcu_read_unlock();
-
-	/*
-	 * No need to defer the kfree() to the end of the RCU grace
-	 * period: we are called from the destroy() callback of our
-	 * cgroup, so we can be sure that noone is a) still using
-	 * this cgroup or b) doing lookups in it.
-	 */
-	kfree(iog);
-}
+retry:
+	spin_lock_irqsave(&iocg->lock, flags);
+	hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node) {
+		/* Take the group queue lock */
+		rcu_read_lock();
+		efqd = rcu_dereference(iog->key);
+		if (efqd != NULL) {
+			if (spin_trylock_irq(efqd->queue->queue_lock)) {
+				if (iog->key == efqd) {
+					queue_lock_held = 1;
+					rcu_read_unlock();
+					goto locked;
+				}
 
-void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
-{
-	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
-	struct hlist_node *n, *tmp;
-	struct io_group *iog;
+				/*
+				 * After acquiring the queue lock, we found
+				 * iog->key==NULL, that means elevator switch
+				 * completed, group is no longer connected on
+				 * elevator hence we can proceed safely without
+				 * queue lock.
+				 */
+				spin_unlock_irq(efqd->queue->queue_lock);
+			} else {
+				/*
+				 * Did not get the queue lock while trying.
+				 * Backout. Drop iocg->lock and try again
+				 */
+				rcu_read_unlock();
+				spin_unlock_irqrestore(&iocg->lock, flags);
+				udelay(100);
+				goto retry;
 
-	/*
-	 * Since we are destroying the cgroup, there are no more tasks
-	 * referencing it, and all the RCU grace periods that may have
-	 * referenced it are ended (as the destruction of the parent
-	 * cgroup is RCU-safe); bgrp->group_data will not be accessed by
-	 * anything else and we don't need any synchronization.
-	 */
-	hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node)
-		io_destroy_group(iocg, iog);
+			}
+		}
+		/*
+		 * We come here when iog->key==NULL, that means elevator switch
+		 * has already taken place and now this group is no more
+		 * connected on elevator and one does not have to have a
+		 * queue lock to do the cleanup.
+		 */
+		rcu_read_unlock();
+locked:
+		__iocg_destroy(iocg, iog, queue_lock_held);
+		if (queue_lock_held) {
+			spin_unlock_irq(efqd->queue->queue_lock);
+			queue_lock_held = 0;
+		}
+	}
+	spin_unlock_irqrestore(&iocg->lock, flags);
 
 	BUG_ON(!hlist_empty(&iocg->group_data));
 
 	kfree(iocg);
 }
 
+/* Should be called with queue lock held */
 void io_disconnect_groups(struct elevator_queue *e)
 {
 	struct hlist_node *pos, *n;
 	struct io_group *iog;
 	struct elv_fq_data *efqd = &e->efqd;
+	int i;
+	struct io_service_tree *st;
 
 	hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
 					elv_data_node) {
-		hlist_del(&iog->elv_data_node);
-
+		/*
+		 * At this point of time group should be on idle tree. This
+		 * would extract the group from idle tree.
+		 */
 		__bfq_deactivate_entity(iog->my_entity, 0);
 
+		/* Flush all the idle trees of the group */
+		for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+			st = iog->sched_data.service_tree + i;
+			io_flush_idle_tree(st);
+		}
+
+		/*
+		 * This has to be here also apart from cgroup cleanup path
+		 * and the reason being that if async queue reference of the
+		 * group are not dropped, then async ioq as well as associated
+		 * queue will not be reclaimed. Apart from that async cfqq
+		 * has to be cleaned up before elevator goes away.
+		 */
+		io_put_io_group_queues(iog);
+
 		/*
 		 * Don't remove from the group hash, just set an
 		 * invalid key.  No lookups can race with the
@@ -1736,11 +1816,68 @@ void io_disconnect_groups(struct elevator_queue *e)
 		 * implies also that new elements cannot be added
 		 * to the list.
 		 */
+		hlist_del(&iog->elv_data_node);
 		rcu_assign_pointer(iog->key, NULL);
-		io_put_io_group_queues(e, iog);
+		/* Drop iog reference taken by elevator (efqd->group_list)*/
+		elv_put_iog(iog);
 	}
 }
 
+/*
+ * This cleanup function is does the last bit of things to destroy cgroup.
+   It should only get called after io_destroy_group has been invoked.
+ */
+void io_group_cleanup(struct io_group *iog)
+{
+	struct io_service_tree *st;
+	struct io_entity *entity = iog->my_entity;
+	int i;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+
+		BUG_ON(!RB_EMPTY_ROOT(&st->active));
+		BUG_ON(!RB_EMPTY_ROOT(&st->idle));
+		BUG_ON(st->wsum != 0);
+	}
+
+	BUG_ON(iog->sched_data.next_active != NULL);
+	BUG_ON(iog->sched_data.active_entity != NULL);
+	BUG_ON(entity != NULL && entity->tree != NULL);
+
+	kfree(iog);
+}
+
+/*
+ * Should be called with queue lock held. The only case it can be called
+ * without queue lock held is when elevator has gone away leaving behind
+ * dead io groups which are hanging there to be reclaimed when cgroup is
+ * deleted. In case of cgroup deletion, I think there is only one thread
+ * doing deletion and rest of the threads should have been taken care by
+ * cgroup stuff.
+ */
+void elv_put_iog(struct io_group *iog)
+{
+	struct io_group *parent = NULL;
+
+	BUG_ON(!iog);
+
+	BUG_ON(atomic_read(&iog->ref) <= 0);
+	if (!atomic_dec_and_test(&iog->ref))
+		return;
+
+	BUG_ON(iog->entity.on_st);
+
+	if (iog->my_entity)
+		parent = container_of(iog->my_entity->parent,
+				      struct io_group, entity);
+	io_group_cleanup(iog);
+
+	if (parent)
+		elv_put_iog(parent);
+}
+EXPORT_SYMBOL(elv_put_iog);
+
 struct cgroup_subsys io_subsys = {
 	.name = "io",
 	.create = iocg_create,
@@ -1887,6 +2024,8 @@ alloc_ioq:
 		elv_init_ioq(e, ioq, rq->iog, sched_q, IOPRIO_CLASS_BE, 4, 1);
 		io_group_set_ioq(iog, ioq);
 		elv_mark_ioq_sync(ioq);
+		/* ioq reference on iog */
+		elv_get_iog(iog);
 	}
 
 	if (new_sched_q)
@@ -1987,7 +2126,7 @@ EXPORT_SYMBOL(io_get_io_group_bio);
 void io_free_root_group(struct elevator_queue *e)
 {
 	struct io_group *iog = e->efqd.root_group;
-	io_put_io_group_queues(e, iog);
+	io_put_io_group_queues(iog);
 	kfree(iog);
 }
 
@@ -2437,13 +2576,11 @@ void elv_put_ioq(struct io_queue *ioq)
 }
 EXPORT_SYMBOL(elv_put_ioq);
 
-void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
+void elv_release_ioq(struct io_queue **ioq_ptr)
 {
-	struct io_group *root_group = e->efqd.root_group;
 	struct io_queue *ioq = *ioq_ptr;
 
 	if (ioq != NULL) {
-		io_ioq_move(e, ioq, root_group);
 		/* Drop the reference taken by the io group */
 		elv_put_ioq(ioq);
 		*ioq_ptr = NULL;
@@ -2600,9 +2737,19 @@ void elv_activate_ioq(struct io_queue *ioq, int add_front)
 void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
 					int requeue)
 {
+	struct io_group *iog = ioq_to_io_group(ioq);
+
 	if (ioq == efqd->active_queue)
 		elv_reset_active_ioq(efqd);
 
+	/*
+	 * The io group ioq belongs to is going away. Don't requeue the
+	 * ioq on idle tree. Free it.
+	 */
+#ifdef CONFIG_GROUP_IOSCHED
+	if (iog->deleting == 1)
+		requeue = 0;
+#endif
 	bfq_deactivate_entity(&ioq->entity, requeue);
 }
 
@@ -3002,15 +3149,6 @@ void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
 	}
 }
 
-void elv_free_idle_ioq_list(struct elevator_queue *e)
-{
-	struct io_queue *ioq, *n;
-	struct elv_fq_data *efqd = &e->efqd;
-
-	list_for_each_entry_safe(ioq, n, &efqd->idle_list, queue_list)
-		elv_deactivate_ioq(efqd, ioq, 0);
-}
-
 /*
  * Call iosched to let that elevator wants to expire the queue. This gives
  * iosched like AS to say no (if it is in the middle of batch changeover or
@@ -3427,7 +3565,6 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 
 	INIT_WORK(&efqd->unplug_work, elv_kick_queue);
 
-	INIT_LIST_HEAD(&efqd->idle_list);
 	INIT_HLIST_HEAD(&efqd->group_list);
 
 	efqd->elv_slice[0] = elv_slice_async;
@@ -3458,9 +3595,19 @@ void elv_exit_fq_data(struct elevator_queue *e)
 	elv_shutdown_timer_wq(e);
 
 	spin_lock_irq(q->queue_lock);
-	/* This should drop all the idle tree references of ioq */
-	elv_free_idle_ioq_list(e);
-	/* This should drop all the io group references of async queues */
+	/*
+	 * This should drop all the references of async queues taken by
+	 * io group.
+	 *
+	 * Also should should deactivate the group and extract from the
+	 * idle tree. (group can not be on active tree now after the
+	 * elevator has been drained).
+	 *
+	 * Should flush idle tree of the group which inturn will drop
+	 * ioq reference taken by active/idle tree.
+	 *
+	 * Drop the iog reference taken by elevator.
+	 */
 	io_disconnect_groups(e);
 	spin_unlock_irq(q->queue_lock);
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 58543ec..42e3777 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -165,7 +165,6 @@ struct io_queue {
 
 	/* Pointer to generic elevator data structure */
 	struct elv_fq_data *efqd;
-	struct list_head queue_list;
 	pid_t pid;
 
 	/* Number of requests queued on this io queue */
@@ -219,6 +218,7 @@ struct io_queue {
  *    o All the other fields are protected by the @bfqd queue lock.
  */
 struct io_group {
+	atomic_t ref;
 	struct io_entity entity;
 	struct hlist_node elv_data_node;
 	struct hlist_node group_node;
@@ -242,6 +242,9 @@ struct io_group {
 
 	/* request list associated with the group */
 	struct request_list rl;
+
+	/* io group is going away */
+	int deleting;
 };
 
 /**
@@ -279,9 +282,6 @@ struct elv_fq_data {
 	/* List of io groups hanging on this elevator */
 	struct hlist_head group_list;
 
-	/* List of io queues on idle tree. */
-	struct list_head idle_list;
-
 	struct request_queue *queue;
 	unsigned int busy_queues;
 	/*
@@ -504,8 +504,6 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
 
 #ifdef CONFIG_GROUP_IOSCHED
 extern int io_group_allow_merge(struct request *rq, struct bio *bio);
-extern void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
-					struct io_group *iog);
 extern void elv_fq_set_request_io_group(struct request_queue *q,
 					struct request *rq, struct bio *bio);
 static inline bfq_weight_t iog_weight(struct io_group *iog)
@@ -523,6 +521,8 @@ extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
 extern struct request_list *io_group_get_request_list(struct request_queue *q,
 						struct bio *bio);
 
+extern void elv_put_iog(struct io_group *iog);
+
 /* Returns single ioq associated with the io group. */
 static inline struct io_queue *io_group_ioq(struct io_group *iog)
 {
@@ -545,17 +545,12 @@ static inline struct io_group *rq_iog(struct request_queue *q,
 	return rq->iog;
 }
 
-#else /* !GROUP_IOSCHED */
-/*
- * No ioq movement is needed in case of flat setup. root io group gets cleaned
- * up upon elevator exit and before that it has been made sure that both
- * active and idle tree are empty.
- */
-static inline void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
-					struct io_group *iog)
+static inline void elv_get_iog(struct io_group *iog)
 {
+	atomic_inc(&iog->ref);
 }
 
+#else /* !GROUP_IOSCHED */
 static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
 {
 	return 1;
@@ -608,6 +603,9 @@ static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
 	return NULL;
 }
 
+static inline void elv_get_iog(struct io_group *iog) { }
+
+static inline void elv_put_iog(struct io_group *iog) { }
 
 extern struct io_group *rq_iog(struct request_queue *q, struct request *rq);
 
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 17/18] io-controller: IO group refcounting support
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (30 preceding siblings ...)
  2009-05-05 19:58 ` [PATCH 17/18] io-controller: IO group refcounting support Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
       [not found]   ` <1241553525-28095-18-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-05-05 19:58 ` [PATCH 18/18] io-controller: Debug hierarchical IO scheduling Vivek Goyal
                   ` (5 subsequent siblings)
  37 siblings, 1 reply; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, m-ikeda
  Cc: vgoyal, akpm

o In the original BFQ patch once a cgroup is being deleted, it will clean
  up the associated io groups immediately and if there are any active io
  queues with that group, these will be moved to root group. This movement
  of queues is not good from fairness perspective as one can then create
  a cgroup, dump lots of IO and then delete the cgroup and then potentially
  get higher share. Apart from there are more issues hence it was felt that
  we need a io group refcounting mechanism also so that io group can be
  reclaimed asynchronously.

o This is a crude patch to implement io group refcounting. This is still
  work in progress and Nauman and Divyesh are playing with more ideas.

o I can do basic cgroup creation, deletion, task movement operations and
  there are no crashes (As was reported with V1 by Gui). Though I have not
  verified that io groups are actually being freed. Will do it next.

o There are couple of hard to hit race conditions I am aware of. Will fix
  that in upcoming versions. (RCU lookup when group might be going away
  during cgroup deletion).

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |   16 ++-
 block/elevator-fq.c |  441 ++++++++++++++++++++++++++++++++++-----------------
 block/elevator-fq.h |   26 ++--
 3 files changed, 320 insertions(+), 163 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ea71239..cf9d258 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1308,8 +1308,17 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
 
 	if (sync_cfqq != NULL) {
 		__iog = cfqq_to_io_group(sync_cfqq);
-		if (iog != __iog)
-			io_ioq_move(q->elevator, sync_cfqq->ioq, iog);
+		/*
+		 * Drop reference to sync queue. A new queue sync queue will
+		 * be assigned in new group upon arrival of a fresh request.
+		 * If old queue has got requests, those reuests will be
+		 * dispatched over a period of time and queue will be freed
+		 * automatically.
+		 */
+		if (iog != __iog) {
+			cic_set_cfqq(cic, NULL, 1);
+			cfq_put_queue(sync_cfqq);
+		}
 	}
 
 	spin_unlock_irqrestore(q->queue_lock, flags);
@@ -1422,6 +1431,9 @@ alloc_ioq:
 			elv_mark_ioq_sync(cfqq->ioq);
 		}
 		cfqq->pid = current->pid;
+
+		/* ioq reference on iog */
+		elv_get_iog(iog);
 		cfq_log_cfqq(cfqd, cfqq, "alloced");
 	}
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index bd98317..1dd0bb3 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -36,7 +36,7 @@ static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
 					struct io_queue *ioq, int probe);
 struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
 						 int extract);
-void elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
+void elv_release_ioq(struct io_queue **ioq_ptr);
 int elv_iosched_expire_ioq(struct request_queue *q, int slice_expired,
 					int force);
 
@@ -108,6 +108,16 @@ static inline void bfq_check_next_active(struct io_sched_data *sd,
 {
 	BUG_ON(sd->next_active != entity);
 }
+
+static inline struct io_group *io_entity_to_iog(struct io_entity *entity)
+{
+	struct io_group *iog = NULL;
+
+	BUG_ON(entity == NULL);
+	if (entity->my_sched_data != NULL)
+		iog = container_of(entity, struct io_group, entity);
+	return iog;
+}
 #else /* GROUP_IOSCHED */
 #define for_each_entity(entity)	\
 	for (; entity != NULL; entity = NULL)
@@ -124,6 +134,11 @@ static inline void bfq_check_next_active(struct io_sched_data *sd,
 					 struct io_entity *entity)
 {
 }
+
+static inline struct io_group *io_entity_to_iog(struct io_entity *entity)
+{
+	return NULL;
+}
 #endif
 
 /*
@@ -224,7 +239,6 @@ static void bfq_idle_extract(struct io_service_tree *st,
 				struct io_entity *entity)
 {
 	struct rb_node *next;
-	struct io_queue *ioq = io_entity_to_ioq(entity);
 
 	BUG_ON(entity->tree != &st->idle);
 
@@ -239,10 +253,6 @@ static void bfq_idle_extract(struct io_service_tree *st,
 	}
 
 	bfq_extract(&st->idle, entity);
-
-	/* Delete queue from idle list */
-	if (ioq)
-		list_del(&ioq->queue_list);
 }
 
 /**
@@ -374,9 +384,12 @@ static void bfq_active_insert(struct io_service_tree *st,
 void bfq_get_entity(struct io_entity *entity)
 {
 	struct io_queue *ioq = io_entity_to_ioq(entity);
+	struct io_group *iog = io_entity_to_iog(entity);
 
 	if (ioq)
 		elv_get_ioq(ioq);
+	else
+		elv_get_iog(iog);
 }
 
 /**
@@ -436,7 +449,6 @@ static void bfq_idle_insert(struct io_service_tree *st,
 {
 	struct io_entity *first_idle = st->first_idle;
 	struct io_entity *last_idle = st->last_idle;
-	struct io_queue *ioq = io_entity_to_ioq(entity);
 
 	if (first_idle == NULL || bfq_gt(first_idle->finish, entity->finish))
 		st->first_idle = entity;
@@ -444,10 +456,6 @@ static void bfq_idle_insert(struct io_service_tree *st,
 		st->last_idle = entity;
 
 	bfq_insert(&st->idle, entity);
-
-	/* Add this queue to idle list */
-	if (ioq)
-		list_add(&ioq->queue_list, &ioq->efqd->idle_list);
 }
 
 /**
@@ -463,14 +471,21 @@ static void bfq_forget_entity(struct io_service_tree *st,
 				struct io_entity *entity)
 {
 	struct io_queue *ioq = NULL;
+	struct io_group *iog = NULL;
 
 	BUG_ON(!entity->on_st);
 	entity->on_st = 0;
 	st->wsum -= entity->weight;
+
 	ioq = io_entity_to_ioq(entity);
-	if (!ioq)
+	if (ioq) {
+		elv_put_ioq(ioq);
 		return;
-	elv_put_ioq(ioq);
+	}
+
+	iog = io_entity_to_iog(entity);
+	if (iog)
+		elv_put_iog(iog);
 }
 
 /**
@@ -909,21 +924,21 @@ void entity_served(struct io_entity *entity, bfq_service_t served,
 /*
  * Release all the io group references to its async queues.
  */
-void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+void io_put_io_group_queues(struct io_group *iog)
 {
 	int i, j;
 
 	for (i = 0; i < 2; i++)
 		for (j = 0; j < IOPRIO_BE_NR; j++)
-			elv_release_ioq(e, &iog->async_queue[i][j]);
+			elv_release_ioq(&iog->async_queue[i][j]);
 
 	/* Free up async idle queue */
-	elv_release_ioq(e, &iog->async_idle_queue);
+	elv_release_ioq(&iog->async_idle_queue);
 
 #ifdef CONFIG_GROUP_IOSCHED
 	/* Optimization for io schedulers having single ioq */
-	if (elv_iosched_single_ioq(e))
-		elv_release_ioq(e, &iog->ioq);
+	if (iog->ioq)
+		elv_release_ioq(&iog->ioq);
 #endif
 }
 
@@ -1018,6 +1033,9 @@ void io_group_set_parent(struct io_group *iog, struct io_group *parent)
 	entity = &iog->entity;
 	entity->parent = parent->my_entity;
 	entity->sched_data = &parent->sched_data;
+	if (entity->parent)
+		/* Child group reference on parent group */
+		elv_get_iog(parent);
 }
 
 /**
@@ -1210,6 +1228,9 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
 		if (!iog)
 			goto cleanup;
 
+		atomic_set(&iog->ref, 0);
+		iog->deleting = 0;
+
 		io_group_init_entity(iocg, iog);
 		iog->my_entity = &iog->entity;
 
@@ -1279,7 +1300,12 @@ void io_group_chain_link(struct request_queue *q, void *key,
 
 		rcu_assign_pointer(leaf->key, key);
 		hlist_add_head_rcu(&leaf->group_node, &iocg->group_data);
+		/* io_cgroup reference on io group */
+		elv_get_iog(leaf);
+
 		hlist_add_head(&leaf->elv_data_node, &efqd->group_list);
+		/* elevator reference on io group */
+		elv_get_iog(leaf);
 
 		spin_unlock_irqrestore(&iocg->lock, flags);
 
@@ -1388,12 +1414,23 @@ struct io_cgroup *get_iocg_from_bio(struct bio *bio)
 	if (!iocg)
 		return &io_root_cgroup;
 
+	/*
+	 * If this cgroup io_cgroup is being deleted, map the bio to
+	 * root cgroup
+	 */
+	if (css_is_removed(&iocg->css))
+		return &io_root_cgroup;
+
 	return iocg;
 }
 
 /*
  * Find the io group bio belongs to.
  * If "create" is set, io group is created if it is not already present.
+ *
+ * Note: There is a narrow window of race where a group is being freed
+ * by cgroup deletion path and some rq has slipped through in this group.
+ * Fix it.
  */
 struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
 					int create)
@@ -1440,8 +1477,8 @@ void io_free_root_group(struct elevator_queue *e)
 	spin_lock_irq(&iocg->lock);
 	hlist_del_rcu(&iog->group_node);
 	spin_unlock_irq(&iocg->lock);
-	io_put_io_group_queues(e, iog);
-	kfree(iog);
+	io_put_io_group_queues(iog);
+	elv_put_iog(iog);
 }
 
 struct io_group *io_alloc_root_group(struct request_queue *q,
@@ -1459,11 +1496,15 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
 	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
 		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
 
+	atomic_set(&iog->ref, 0);
+
 	blk_init_request_list(&iog->rl);
 
 	iocg = &io_root_cgroup;
 	spin_lock_irq(&iocg->lock);
 	rcu_assign_pointer(iog->key, key);
+	/* elevator reference. */
+	elv_get_iog(iog);
 	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
 	spin_unlock_irq(&iocg->lock);
 
@@ -1560,105 +1601,109 @@ void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
 }
 
 /*
- * Move the queue to the root group if it is active. This is needed when
- * a cgroup is being deleted and all the IO is not done yet. This is not
- * very good scheme as a user might get unfair share. This needs to be
- * fixed.
+ * check whether a given group has got any active entities on any of the
+ * service tree.
  */
-void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
-				struct io_group *iog)
+static inline int io_group_has_active_entities(struct io_group *iog)
 {
-	int busy, resume;
-	struct io_entity *entity = &ioq->entity;
-	struct elv_fq_data *efqd = &e->efqd;
-	struct io_service_tree *st = io_entity_service_tree(entity);
+	int i;
+	struct io_service_tree *st;
 
-	busy = elv_ioq_busy(ioq);
-	resume = !!ioq->nr_queued;
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		if (!RB_EMPTY_ROOT(&st->active))
+			return 1;
+	}
 
-	BUG_ON(resume && !entity->on_st);
-	BUG_ON(busy && !resume && entity->on_st && ioq != efqd->active_queue);
+	return 0;
+}
+
+/*
+ * Should be called with both iocg->lock as well as queue lock held (if
+ * group is still connected on elevator list)
+ */
+void __iocg_destroy(struct io_cgroup *iocg, struct io_group *iog,
+				int queue_lock_held)
+{
+	int i;
+	struct io_service_tree *st;
 
 	/*
-	 * We could be moving an queue which is on idle tree of previous group
-	 * What to do? I guess anyway this queue does not have any requests.
-	 * just forget the entity and free up from idle tree.
-	 *
-	 * This needs cleanup. Hackish.
+	 * If we are here then we got the queue lock if group was still on
+	 * elevator list. If group had already been disconnected from elevator
+	 * list, then we don't need the queue lock.
 	 */
-	if (entity->tree == &st->idle) {
-		BUG_ON(atomic_read(&ioq->ref) < 2);
-		bfq_put_idle_entity(st, entity);
-	}
 
-	if (busy) {
-		BUG_ON(atomic_read(&ioq->ref) < 2);
-
-		if (!resume)
-			elv_del_ioq_busy(e, ioq, 0);
-		else
-			elv_deactivate_ioq(efqd, ioq, 0);
-	}
+	/* Remove io group from cgroup list */
+	hlist_del(&iog->group_node);
 
 	/*
-	 * Here we use a reference to bfqg.  We don't need a refcounter
-	 * as the cgroup reference will not be dropped, so that its
-	 * destroy() callback will not be invoked.
+	 * Mark io group for deletion so that no new entry goes in
+	 * idle tree. Any active queue will be removed from active
+	 * tree and not put in to idle tree.
 	 */
-	entity->parent = iog->my_entity;
-	entity->sched_data = &iog->sched_data;
+	iog->deleting = 1;
 
-	if (busy && resume)
-		elv_activate_ioq(ioq, 0);
-}
-EXPORT_SYMBOL(io_ioq_move);
+	/* Flush idle tree.  */
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		io_flush_idle_tree(st);
+	}
 
-static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
-{
-	struct elevator_queue *eq;
-	struct io_entity *entity = iog->my_entity;
-	struct io_service_tree *st;
-	int i;
+	/*
+	 * Drop io group reference on all async queues. This group is
+	 * going away so once these queues are empty, free those up
+	 * instead of keeping these around in the hope that new IO
+	 * will come.
+	 *
+	 * Note: If this group is disconnected from elevator, elevator
+	 * switch must have already done it.
+	 */
 
-	eq = container_of(efqd, struct elevator_queue, efqd);
-	hlist_del(&iog->elv_data_node);
-	__bfq_deactivate_entity(entity, 0);
-	io_put_io_group_queues(eq, iog);
+	io_put_io_group_queues(iog);
 
-	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
-		st = iog->sched_data.service_tree + i;
+	if (!io_group_has_active_entities(iog)) {
+		/*
+		 * io group does not have any active entites. Because this
+		 * group has been decoupled from io_cgroup list and this
+		 * cgroup is being deleted, this group should not receive
+		 * any new IO. Hence it should be safe to deactivate this
+		 * io group and remove from the scheduling tree.
+		 */
+		__bfq_deactivate_entity(iog->my_entity, 0);
 
 		/*
-		 * The idle tree may still contain bfq_queues belonging
-		 * to exited task because they never migrated to a different
-		 * cgroup from the one being destroyed now.  Noone else
-		 * can access them so it's safe to act without any lock.
+		 * Because this io group does not have any active entities,
+		 * it should be safe to remove it from elevator list and
+		 * drop elvator reference so that upon dropping io_cgroup
+		 * reference, this io group should be freed and we don't
+		 * wait for elevator switch to happen to free the group
+		 * up.
 		 */
-		io_flush_idle_tree(st);
+		if (queue_lock_held) {
+			hlist_del(&iog->elv_data_node);
+			rcu_assign_pointer(iog->key, NULL);
+			/*
+			 * Drop iog reference taken by elevator
+			 * (efqd->group_list)
+			 */
+			elv_put_iog(iog);
+		}
 
-		BUG_ON(!RB_EMPTY_ROOT(&st->active));
-		BUG_ON(!RB_EMPTY_ROOT(&st->idle));
 	}
 
-	BUG_ON(iog->sched_data.next_active != NULL);
-	BUG_ON(iog->sched_data.active_entity != NULL);
-	BUG_ON(entity->tree != NULL);
+	/* Drop iocg reference on io group */
+	elv_put_iog(iog);
 }
 
-/**
- * bfq_destroy_group - destroy @bfqg.
- * @bgrp: the bfqio_cgroup containing @bfqg.
- * @bfqg: the group being destroyed.
- *
- * Destroy @bfqg, making sure that it is not referenced from its parent.
- */
-static void io_destroy_group(struct io_cgroup *iocg, struct io_group *iog)
+void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
 {
-	struct elv_fq_data *efqd = NULL;
-	unsigned long uninitialized_var(flags);
-
-	/* Remove io group from cgroup list */
-	hlist_del(&iog->group_node);
+	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+	struct hlist_node *n, *tmp;
+	struct io_group *iog;
+	unsigned long flags;
+	int queue_lock_held = 0;
+	struct elv_fq_data *efqd;
 
 	/*
 	 * io groups are linked in two lists. One list is maintained
@@ -1677,58 +1722,93 @@ static void io_destroy_group(struct io_cgroup *iocg, struct io_group *iog)
 	 * try to free up async queues again or flush the idle tree.
 	 */
 
-	rcu_read_lock();
-	efqd = rcu_dereference(iog->key);
-	if (efqd != NULL) {
-		spin_lock_irqsave(efqd->queue->queue_lock, flags);
-		if (iog->key == efqd)
-			__io_destroy_group(efqd, iog);
-		spin_unlock_irqrestore(efqd->queue->queue_lock, flags);
-	}
-	rcu_read_unlock();
-
-	/*
-	 * No need to defer the kfree() to the end of the RCU grace
-	 * period: we are called from the destroy() callback of our
-	 * cgroup, so we can be sure that noone is a) still using
-	 * this cgroup or b) doing lookups in it.
-	 */
-	kfree(iog);
-}
+retry:
+	spin_lock_irqsave(&iocg->lock, flags);
+	hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node) {
+		/* Take the group queue lock */
+		rcu_read_lock();
+		efqd = rcu_dereference(iog->key);
+		if (efqd != NULL) {
+			if (spin_trylock_irq(efqd->queue->queue_lock)) {
+				if (iog->key == efqd) {
+					queue_lock_held = 1;
+					rcu_read_unlock();
+					goto locked;
+				}
 
-void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
-{
-	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
-	struct hlist_node *n, *tmp;
-	struct io_group *iog;
+				/*
+				 * After acquiring the queue lock, we found
+				 * iog->key==NULL, that means elevator switch
+				 * completed, group is no longer connected on
+				 * elevator hence we can proceed safely without
+				 * queue lock.
+				 */
+				spin_unlock_irq(efqd->queue->queue_lock);
+			} else {
+				/*
+				 * Did not get the queue lock while trying.
+				 * Backout. Drop iocg->lock and try again
+				 */
+				rcu_read_unlock();
+				spin_unlock_irqrestore(&iocg->lock, flags);
+				udelay(100);
+				goto retry;
 
-	/*
-	 * Since we are destroying the cgroup, there are no more tasks
-	 * referencing it, and all the RCU grace periods that may have
-	 * referenced it are ended (as the destruction of the parent
-	 * cgroup is RCU-safe); bgrp->group_data will not be accessed by
-	 * anything else and we don't need any synchronization.
-	 */
-	hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node)
-		io_destroy_group(iocg, iog);
+			}
+		}
+		/*
+		 * We come here when iog->key==NULL, that means elevator switch
+		 * has already taken place and now this group is no more
+		 * connected on elevator and one does not have to have a
+		 * queue lock to do the cleanup.
+		 */
+		rcu_read_unlock();
+locked:
+		__iocg_destroy(iocg, iog, queue_lock_held);
+		if (queue_lock_held) {
+			spin_unlock_irq(efqd->queue->queue_lock);
+			queue_lock_held = 0;
+		}
+	}
+	spin_unlock_irqrestore(&iocg->lock, flags);
 
 	BUG_ON(!hlist_empty(&iocg->group_data));
 
 	kfree(iocg);
 }
 
+/* Should be called with queue lock held */
 void io_disconnect_groups(struct elevator_queue *e)
 {
 	struct hlist_node *pos, *n;
 	struct io_group *iog;
 	struct elv_fq_data *efqd = &e->efqd;
+	int i;
+	struct io_service_tree *st;
 
 	hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
 					elv_data_node) {
-		hlist_del(&iog->elv_data_node);
-
+		/*
+		 * At this point of time group should be on idle tree. This
+		 * would extract the group from idle tree.
+		 */
 		__bfq_deactivate_entity(iog->my_entity, 0);
 
+		/* Flush all the idle trees of the group */
+		for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+			st = iog->sched_data.service_tree + i;
+			io_flush_idle_tree(st);
+		}
+
+		/*
+		 * This has to be here also apart from cgroup cleanup path
+		 * and the reason being that if async queue reference of the
+		 * group are not dropped, then async ioq as well as associated
+		 * queue will not be reclaimed. Apart from that async cfqq
+		 * has to be cleaned up before elevator goes away.
+		 */
+		io_put_io_group_queues(iog);
+
 		/*
 		 * Don't remove from the group hash, just set an
 		 * invalid key.  No lookups can race with the
@@ -1736,11 +1816,68 @@ void io_disconnect_groups(struct elevator_queue *e)
 		 * implies also that new elements cannot be added
 		 * to the list.
 		 */
+		hlist_del(&iog->elv_data_node);
 		rcu_assign_pointer(iog->key, NULL);
-		io_put_io_group_queues(e, iog);
+		/* Drop iog reference taken by elevator (efqd->group_list)*/
+		elv_put_iog(iog);
 	}
 }
 
+/*
+ * This cleanup function is does the last bit of things to destroy cgroup.
+   It should only get called after io_destroy_group has been invoked.
+ */
+void io_group_cleanup(struct io_group *iog)
+{
+	struct io_service_tree *st;
+	struct io_entity *entity = iog->my_entity;
+	int i;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+
+		BUG_ON(!RB_EMPTY_ROOT(&st->active));
+		BUG_ON(!RB_EMPTY_ROOT(&st->idle));
+		BUG_ON(st->wsum != 0);
+	}
+
+	BUG_ON(iog->sched_data.next_active != NULL);
+	BUG_ON(iog->sched_data.active_entity != NULL);
+	BUG_ON(entity != NULL && entity->tree != NULL);
+
+	kfree(iog);
+}
+
+/*
+ * Should be called with queue lock held. The only case it can be called
+ * without queue lock held is when elevator has gone away leaving behind
+ * dead io groups which are hanging there to be reclaimed when cgroup is
+ * deleted. In case of cgroup deletion, I think there is only one thread
+ * doing deletion and rest of the threads should have been taken care by
+ * cgroup stuff.
+ */
+void elv_put_iog(struct io_group *iog)
+{
+	struct io_group *parent = NULL;
+
+	BUG_ON(!iog);
+
+	BUG_ON(atomic_read(&iog->ref) <= 0);
+	if (!atomic_dec_and_test(&iog->ref))
+		return;
+
+	BUG_ON(iog->entity.on_st);
+
+	if (iog->my_entity)
+		parent = container_of(iog->my_entity->parent,
+				      struct io_group, entity);
+	io_group_cleanup(iog);
+
+	if (parent)
+		elv_put_iog(parent);
+}
+EXPORT_SYMBOL(elv_put_iog);
+
 struct cgroup_subsys io_subsys = {
 	.name = "io",
 	.create = iocg_create,
@@ -1887,6 +2024,8 @@ alloc_ioq:
 		elv_init_ioq(e, ioq, rq->iog, sched_q, IOPRIO_CLASS_BE, 4, 1);
 		io_group_set_ioq(iog, ioq);
 		elv_mark_ioq_sync(ioq);
+		/* ioq reference on iog */
+		elv_get_iog(iog);
 	}
 
 	if (new_sched_q)
@@ -1987,7 +2126,7 @@ EXPORT_SYMBOL(io_get_io_group_bio);
 void io_free_root_group(struct elevator_queue *e)
 {
 	struct io_group *iog = e->efqd.root_group;
-	io_put_io_group_queues(e, iog);
+	io_put_io_group_queues(iog);
 	kfree(iog);
 }
 
@@ -2437,13 +2576,11 @@ void elv_put_ioq(struct io_queue *ioq)
 }
 EXPORT_SYMBOL(elv_put_ioq);
 
-void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
+void elv_release_ioq(struct io_queue **ioq_ptr)
 {
-	struct io_group *root_group = e->efqd.root_group;
 	struct io_queue *ioq = *ioq_ptr;
 
 	if (ioq != NULL) {
-		io_ioq_move(e, ioq, root_group);
 		/* Drop the reference taken by the io group */
 		elv_put_ioq(ioq);
 		*ioq_ptr = NULL;
@@ -2600,9 +2737,19 @@ void elv_activate_ioq(struct io_queue *ioq, int add_front)
 void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
 					int requeue)
 {
+	struct io_group *iog = ioq_to_io_group(ioq);
+
 	if (ioq == efqd->active_queue)
 		elv_reset_active_ioq(efqd);
 
+	/*
+	 * The io group ioq belongs to is going away. Don't requeue the
+	 * ioq on idle tree. Free it.
+	 */
+#ifdef CONFIG_GROUP_IOSCHED
+	if (iog->deleting == 1)
+		requeue = 0;
+#endif
 	bfq_deactivate_entity(&ioq->entity, requeue);
 }
 
@@ -3002,15 +3149,6 @@ void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
 	}
 }
 
-void elv_free_idle_ioq_list(struct elevator_queue *e)
-{
-	struct io_queue *ioq, *n;
-	struct elv_fq_data *efqd = &e->efqd;
-
-	list_for_each_entry_safe(ioq, n, &efqd->idle_list, queue_list)
-		elv_deactivate_ioq(efqd, ioq, 0);
-}
-
 /*
  * Call iosched to let that elevator wants to expire the queue. This gives
  * iosched like AS to say no (if it is in the middle of batch changeover or
@@ -3427,7 +3565,6 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 
 	INIT_WORK(&efqd->unplug_work, elv_kick_queue);
 
-	INIT_LIST_HEAD(&efqd->idle_list);
 	INIT_HLIST_HEAD(&efqd->group_list);
 
 	efqd->elv_slice[0] = elv_slice_async;
@@ -3458,9 +3595,19 @@ void elv_exit_fq_data(struct elevator_queue *e)
 	elv_shutdown_timer_wq(e);
 
 	spin_lock_irq(q->queue_lock);
-	/* This should drop all the idle tree references of ioq */
-	elv_free_idle_ioq_list(e);
-	/* This should drop all the io group references of async queues */
+	/*
+	 * This should drop all the references of async queues taken by
+	 * io group.
+	 *
+	 * Also should should deactivate the group and extract from the
+	 * idle tree. (group can not be on active tree now after the
+	 * elevator has been drained).
+	 *
+	 * Should flush idle tree of the group which inturn will drop
+	 * ioq reference taken by active/idle tree.
+	 *
+	 * Drop the iog reference taken by elevator.
+	 */
 	io_disconnect_groups(e);
 	spin_unlock_irq(q->queue_lock);
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 58543ec..42e3777 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -165,7 +165,6 @@ struct io_queue {
 
 	/* Pointer to generic elevator data structure */
 	struct elv_fq_data *efqd;
-	struct list_head queue_list;
 	pid_t pid;
 
 	/* Number of requests queued on this io queue */
@@ -219,6 +218,7 @@ struct io_queue {
  *    o All the other fields are protected by the @bfqd queue lock.
  */
 struct io_group {
+	atomic_t ref;
 	struct io_entity entity;
 	struct hlist_node elv_data_node;
 	struct hlist_node group_node;
@@ -242,6 +242,9 @@ struct io_group {
 
 	/* request list associated with the group */
 	struct request_list rl;
+
+	/* io group is going away */
+	int deleting;
 };
 
 /**
@@ -279,9 +282,6 @@ struct elv_fq_data {
 	/* List of io groups hanging on this elevator */
 	struct hlist_head group_list;
 
-	/* List of io queues on idle tree. */
-	struct list_head idle_list;
-
 	struct request_queue *queue;
 	unsigned int busy_queues;
 	/*
@@ -504,8 +504,6 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
 
 #ifdef CONFIG_GROUP_IOSCHED
 extern int io_group_allow_merge(struct request *rq, struct bio *bio);
-extern void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
-					struct io_group *iog);
 extern void elv_fq_set_request_io_group(struct request_queue *q,
 					struct request *rq, struct bio *bio);
 static inline bfq_weight_t iog_weight(struct io_group *iog)
@@ -523,6 +521,8 @@ extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
 extern struct request_list *io_group_get_request_list(struct request_queue *q,
 						struct bio *bio);
 
+extern void elv_put_iog(struct io_group *iog);
+
 /* Returns single ioq associated with the io group. */
 static inline struct io_queue *io_group_ioq(struct io_group *iog)
 {
@@ -545,17 +545,12 @@ static inline struct io_group *rq_iog(struct request_queue *q,
 	return rq->iog;
 }
 
-#else /* !GROUP_IOSCHED */
-/*
- * No ioq movement is needed in case of flat setup. root io group gets cleaned
- * up upon elevator exit and before that it has been made sure that both
- * active and idle tree are empty.
- */
-static inline void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
-					struct io_group *iog)
+static inline void elv_get_iog(struct io_group *iog)
 {
+	atomic_inc(&iog->ref);
 }
 
+#else /* !GROUP_IOSCHED */
 static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
 {
 	return 1;
@@ -608,6 +603,9 @@ static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
 	return NULL;
 }
 
+static inline void elv_get_iog(struct io_group *iog) { }
+
+static inline void elv_put_iog(struct io_group *iog) { }
 
 extern struct io_group *rq_iog(struct request_queue *q, struct request *rq);
 
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 17/18] io-controller: IO group refcounting support
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (29 preceding siblings ...)
  2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-05 19:58 ` Vivek Goyal
                   ` (6 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando
  Cc: akpm, vgoyal

o In the original BFQ patch once a cgroup is being deleted, it will clean
  up the associated io groups immediately and if there are any active io
  queues with that group, these will be moved to root group. This movement
  of queues is not good from fairness perspective as one can then create
  a cgroup, dump lots of IO and then delete the cgroup and then potentially
  get higher share. Apart from there are more issues hence it was felt that
  we need a io group refcounting mechanism also so that io group can be
  reclaimed asynchronously.

o This is a crude patch to implement io group refcounting. This is still
  work in progress and Nauman and Divyesh are playing with more ideas.

o I can do basic cgroup creation, deletion, task movement operations and
  there are no crashes (As was reported with V1 by Gui). Though I have not
  verified that io groups are actually being freed. Will do it next.

o There are couple of hard to hit race conditions I am aware of. Will fix
  that in upcoming versions. (RCU lookup when group might be going away
  during cgroup deletion).

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |   16 ++-
 block/elevator-fq.c |  441 ++++++++++++++++++++++++++++++++++-----------------
 block/elevator-fq.h |   26 ++--
 3 files changed, 320 insertions(+), 163 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ea71239..cf9d258 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1308,8 +1308,17 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
 
 	if (sync_cfqq != NULL) {
 		__iog = cfqq_to_io_group(sync_cfqq);
-		if (iog != __iog)
-			io_ioq_move(q->elevator, sync_cfqq->ioq, iog);
+		/*
+		 * Drop reference to sync queue. A new queue sync queue will
+		 * be assigned in new group upon arrival of a fresh request.
+		 * If old queue has got requests, those reuests will be
+		 * dispatched over a period of time and queue will be freed
+		 * automatically.
+		 */
+		if (iog != __iog) {
+			cic_set_cfqq(cic, NULL, 1);
+			cfq_put_queue(sync_cfqq);
+		}
 	}
 
 	spin_unlock_irqrestore(q->queue_lock, flags);
@@ -1422,6 +1431,9 @@ alloc_ioq:
 			elv_mark_ioq_sync(cfqq->ioq);
 		}
 		cfqq->pid = current->pid;
+
+		/* ioq reference on iog */
+		elv_get_iog(iog);
 		cfq_log_cfqq(cfqd, cfqq, "alloced");
 	}
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index bd98317..1dd0bb3 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -36,7 +36,7 @@ static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
 					struct io_queue *ioq, int probe);
 struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
 						 int extract);
-void elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
+void elv_release_ioq(struct io_queue **ioq_ptr);
 int elv_iosched_expire_ioq(struct request_queue *q, int slice_expired,
 					int force);
 
@@ -108,6 +108,16 @@ static inline void bfq_check_next_active(struct io_sched_data *sd,
 {
 	BUG_ON(sd->next_active != entity);
 }
+
+static inline struct io_group *io_entity_to_iog(struct io_entity *entity)
+{
+	struct io_group *iog = NULL;
+
+	BUG_ON(entity == NULL);
+	if (entity->my_sched_data != NULL)
+		iog = container_of(entity, struct io_group, entity);
+	return iog;
+}
 #else /* GROUP_IOSCHED */
 #define for_each_entity(entity)	\
 	for (; entity != NULL; entity = NULL)
@@ -124,6 +134,11 @@ static inline void bfq_check_next_active(struct io_sched_data *sd,
 					 struct io_entity *entity)
 {
 }
+
+static inline struct io_group *io_entity_to_iog(struct io_entity *entity)
+{
+	return NULL;
+}
 #endif
 
 /*
@@ -224,7 +239,6 @@ static void bfq_idle_extract(struct io_service_tree *st,
 				struct io_entity *entity)
 {
 	struct rb_node *next;
-	struct io_queue *ioq = io_entity_to_ioq(entity);
 
 	BUG_ON(entity->tree != &st->idle);
 
@@ -239,10 +253,6 @@ static void bfq_idle_extract(struct io_service_tree *st,
 	}
 
 	bfq_extract(&st->idle, entity);
-
-	/* Delete queue from idle list */
-	if (ioq)
-		list_del(&ioq->queue_list);
 }
 
 /**
@@ -374,9 +384,12 @@ static void bfq_active_insert(struct io_service_tree *st,
 void bfq_get_entity(struct io_entity *entity)
 {
 	struct io_queue *ioq = io_entity_to_ioq(entity);
+	struct io_group *iog = io_entity_to_iog(entity);
 
 	if (ioq)
 		elv_get_ioq(ioq);
+	else
+		elv_get_iog(iog);
 }
 
 /**
@@ -436,7 +449,6 @@ static void bfq_idle_insert(struct io_service_tree *st,
 {
 	struct io_entity *first_idle = st->first_idle;
 	struct io_entity *last_idle = st->last_idle;
-	struct io_queue *ioq = io_entity_to_ioq(entity);
 
 	if (first_idle == NULL || bfq_gt(first_idle->finish, entity->finish))
 		st->first_idle = entity;
@@ -444,10 +456,6 @@ static void bfq_idle_insert(struct io_service_tree *st,
 		st->last_idle = entity;
 
 	bfq_insert(&st->idle, entity);
-
-	/* Add this queue to idle list */
-	if (ioq)
-		list_add(&ioq->queue_list, &ioq->efqd->idle_list);
 }
 
 /**
@@ -463,14 +471,21 @@ static void bfq_forget_entity(struct io_service_tree *st,
 				struct io_entity *entity)
 {
 	struct io_queue *ioq = NULL;
+	struct io_group *iog = NULL;
 
 	BUG_ON(!entity->on_st);
 	entity->on_st = 0;
 	st->wsum -= entity->weight;
+
 	ioq = io_entity_to_ioq(entity);
-	if (!ioq)
+	if (ioq) {
+		elv_put_ioq(ioq);
 		return;
-	elv_put_ioq(ioq);
+	}
+
+	iog = io_entity_to_iog(entity);
+	if (iog)
+		elv_put_iog(iog);
 }
 
 /**
@@ -909,21 +924,21 @@ void entity_served(struct io_entity *entity, bfq_service_t served,
 /*
  * Release all the io group references to its async queues.
  */
-void io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+void io_put_io_group_queues(struct io_group *iog)
 {
 	int i, j;
 
 	for (i = 0; i < 2; i++)
 		for (j = 0; j < IOPRIO_BE_NR; j++)
-			elv_release_ioq(e, &iog->async_queue[i][j]);
+			elv_release_ioq(&iog->async_queue[i][j]);
 
 	/* Free up async idle queue */
-	elv_release_ioq(e, &iog->async_idle_queue);
+	elv_release_ioq(&iog->async_idle_queue);
 
 #ifdef CONFIG_GROUP_IOSCHED
 	/* Optimization for io schedulers having single ioq */
-	if (elv_iosched_single_ioq(e))
-		elv_release_ioq(e, &iog->ioq);
+	if (iog->ioq)
+		elv_release_ioq(&iog->ioq);
 #endif
 }
 
@@ -1018,6 +1033,9 @@ void io_group_set_parent(struct io_group *iog, struct io_group *parent)
 	entity = &iog->entity;
 	entity->parent = parent->my_entity;
 	entity->sched_data = &parent->sched_data;
+	if (entity->parent)
+		/* Child group reference on parent group */
+		elv_get_iog(parent);
 }
 
 /**
@@ -1210,6 +1228,9 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
 		if (!iog)
 			goto cleanup;
 
+		atomic_set(&iog->ref, 0);
+		iog->deleting = 0;
+
 		io_group_init_entity(iocg, iog);
 		iog->my_entity = &iog->entity;
 
@@ -1279,7 +1300,12 @@ void io_group_chain_link(struct request_queue *q, void *key,
 
 		rcu_assign_pointer(leaf->key, key);
 		hlist_add_head_rcu(&leaf->group_node, &iocg->group_data);
+		/* io_cgroup reference on io group */
+		elv_get_iog(leaf);
+
 		hlist_add_head(&leaf->elv_data_node, &efqd->group_list);
+		/* elevator reference on io group */
+		elv_get_iog(leaf);
 
 		spin_unlock_irqrestore(&iocg->lock, flags);
 
@@ -1388,12 +1414,23 @@ struct io_cgroup *get_iocg_from_bio(struct bio *bio)
 	if (!iocg)
 		return &io_root_cgroup;
 
+	/*
+	 * If this cgroup io_cgroup is being deleted, map the bio to
+	 * root cgroup
+	 */
+	if (css_is_removed(&iocg->css))
+		return &io_root_cgroup;
+
 	return iocg;
 }
 
 /*
  * Find the io group bio belongs to.
  * If "create" is set, io group is created if it is not already present.
+ *
+ * Note: There is a narrow window of race where a group is being freed
+ * by cgroup deletion path and some rq has slipped through in this group.
+ * Fix it.
  */
 struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
 					int create)
@@ -1440,8 +1477,8 @@ void io_free_root_group(struct elevator_queue *e)
 	spin_lock_irq(&iocg->lock);
 	hlist_del_rcu(&iog->group_node);
 	spin_unlock_irq(&iocg->lock);
-	io_put_io_group_queues(e, iog);
-	kfree(iog);
+	io_put_io_group_queues(iog);
+	elv_put_iog(iog);
 }
 
 struct io_group *io_alloc_root_group(struct request_queue *q,
@@ -1459,11 +1496,15 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
 	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
 		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
 
+	atomic_set(&iog->ref, 0);
+
 	blk_init_request_list(&iog->rl);
 
 	iocg = &io_root_cgroup;
 	spin_lock_irq(&iocg->lock);
 	rcu_assign_pointer(iog->key, key);
+	/* elevator reference. */
+	elv_get_iog(iog);
 	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
 	spin_unlock_irq(&iocg->lock);
 
@@ -1560,105 +1601,109 @@ void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
 }
 
 /*
- * Move the queue to the root group if it is active. This is needed when
- * a cgroup is being deleted and all the IO is not done yet. This is not
- * very good scheme as a user might get unfair share. This needs to be
- * fixed.
+ * check whether a given group has got any active entities on any of the
+ * service tree.
  */
-void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
-				struct io_group *iog)
+static inline int io_group_has_active_entities(struct io_group *iog)
 {
-	int busy, resume;
-	struct io_entity *entity = &ioq->entity;
-	struct elv_fq_data *efqd = &e->efqd;
-	struct io_service_tree *st = io_entity_service_tree(entity);
+	int i;
+	struct io_service_tree *st;
 
-	busy = elv_ioq_busy(ioq);
-	resume = !!ioq->nr_queued;
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		if (!RB_EMPTY_ROOT(&st->active))
+			return 1;
+	}
 
-	BUG_ON(resume && !entity->on_st);
-	BUG_ON(busy && !resume && entity->on_st && ioq != efqd->active_queue);
+	return 0;
+}
+
+/*
+ * Should be called with both iocg->lock as well as queue lock held (if
+ * group is still connected on elevator list)
+ */
+void __iocg_destroy(struct io_cgroup *iocg, struct io_group *iog,
+				int queue_lock_held)
+{
+	int i;
+	struct io_service_tree *st;
 
 	/*
-	 * We could be moving an queue which is on idle tree of previous group
-	 * What to do? I guess anyway this queue does not have any requests.
-	 * just forget the entity and free up from idle tree.
-	 *
-	 * This needs cleanup. Hackish.
+	 * If we are here then we got the queue lock if group was still on
+	 * elevator list. If group had already been disconnected from elevator
+	 * list, then we don't need the queue lock.
 	 */
-	if (entity->tree == &st->idle) {
-		BUG_ON(atomic_read(&ioq->ref) < 2);
-		bfq_put_idle_entity(st, entity);
-	}
 
-	if (busy) {
-		BUG_ON(atomic_read(&ioq->ref) < 2);
-
-		if (!resume)
-			elv_del_ioq_busy(e, ioq, 0);
-		else
-			elv_deactivate_ioq(efqd, ioq, 0);
-	}
+	/* Remove io group from cgroup list */
+	hlist_del(&iog->group_node);
 
 	/*
-	 * Here we use a reference to bfqg.  We don't need a refcounter
-	 * as the cgroup reference will not be dropped, so that its
-	 * destroy() callback will not be invoked.
+	 * Mark io group for deletion so that no new entry goes in
+	 * idle tree. Any active queue will be removed from active
+	 * tree and not put in to idle tree.
 	 */
-	entity->parent = iog->my_entity;
-	entity->sched_data = &iog->sched_data;
+	iog->deleting = 1;
 
-	if (busy && resume)
-		elv_activate_ioq(ioq, 0);
-}
-EXPORT_SYMBOL(io_ioq_move);
+	/* Flush idle tree.  */
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		io_flush_idle_tree(st);
+	}
 
-static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
-{
-	struct elevator_queue *eq;
-	struct io_entity *entity = iog->my_entity;
-	struct io_service_tree *st;
-	int i;
+	/*
+	 * Drop io group reference on all async queues. This group is
+	 * going away so once these queues are empty, free those up
+	 * instead of keeping these around in the hope that new IO
+	 * will come.
+	 *
+	 * Note: If this group is disconnected from elevator, elevator
+	 * switch must have already done it.
+	 */
 
-	eq = container_of(efqd, struct elevator_queue, efqd);
-	hlist_del(&iog->elv_data_node);
-	__bfq_deactivate_entity(entity, 0);
-	io_put_io_group_queues(eq, iog);
+	io_put_io_group_queues(iog);
 
-	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
-		st = iog->sched_data.service_tree + i;
+	if (!io_group_has_active_entities(iog)) {
+		/*
+		 * io group does not have any active entites. Because this
+		 * group has been decoupled from io_cgroup list and this
+		 * cgroup is being deleted, this group should not receive
+		 * any new IO. Hence it should be safe to deactivate this
+		 * io group and remove from the scheduling tree.
+		 */
+		__bfq_deactivate_entity(iog->my_entity, 0);
 
 		/*
-		 * The idle tree may still contain bfq_queues belonging
-		 * to exited task because they never migrated to a different
-		 * cgroup from the one being destroyed now.  Noone else
-		 * can access them so it's safe to act without any lock.
+		 * Because this io group does not have any active entities,
+		 * it should be safe to remove it from elevator list and
+		 * drop elvator reference so that upon dropping io_cgroup
+		 * reference, this io group should be freed and we don't
+		 * wait for elevator switch to happen to free the group
+		 * up.
 		 */
-		io_flush_idle_tree(st);
+		if (queue_lock_held) {
+			hlist_del(&iog->elv_data_node);
+			rcu_assign_pointer(iog->key, NULL);
+			/*
+			 * Drop iog reference taken by elevator
+			 * (efqd->group_list)
+			 */
+			elv_put_iog(iog);
+		}
 
-		BUG_ON(!RB_EMPTY_ROOT(&st->active));
-		BUG_ON(!RB_EMPTY_ROOT(&st->idle));
 	}
 
-	BUG_ON(iog->sched_data.next_active != NULL);
-	BUG_ON(iog->sched_data.active_entity != NULL);
-	BUG_ON(entity->tree != NULL);
+	/* Drop iocg reference on io group */
+	elv_put_iog(iog);
 }
 
-/**
- * bfq_destroy_group - destroy @bfqg.
- * @bgrp: the bfqio_cgroup containing @bfqg.
- * @bfqg: the group being destroyed.
- *
- * Destroy @bfqg, making sure that it is not referenced from its parent.
- */
-static void io_destroy_group(struct io_cgroup *iocg, struct io_group *iog)
+void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
 {
-	struct elv_fq_data *efqd = NULL;
-	unsigned long uninitialized_var(flags);
-
-	/* Remove io group from cgroup list */
-	hlist_del(&iog->group_node);
+	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+	struct hlist_node *n, *tmp;
+	struct io_group *iog;
+	unsigned long flags;
+	int queue_lock_held = 0;
+	struct elv_fq_data *efqd;
 
 	/*
 	 * io groups are linked in two lists. One list is maintained
@@ -1677,58 +1722,93 @@ static void io_destroy_group(struct io_cgroup *iocg, struct io_group *iog)
 	 * try to free up async queues again or flush the idle tree.
 	 */
 
-	rcu_read_lock();
-	efqd = rcu_dereference(iog->key);
-	if (efqd != NULL) {
-		spin_lock_irqsave(efqd->queue->queue_lock, flags);
-		if (iog->key == efqd)
-			__io_destroy_group(efqd, iog);
-		spin_unlock_irqrestore(efqd->queue->queue_lock, flags);
-	}
-	rcu_read_unlock();
-
-	/*
-	 * No need to defer the kfree() to the end of the RCU grace
-	 * period: we are called from the destroy() callback of our
-	 * cgroup, so we can be sure that noone is a) still using
-	 * this cgroup or b) doing lookups in it.
-	 */
-	kfree(iog);
-}
+retry:
+	spin_lock_irqsave(&iocg->lock, flags);
+	hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node) {
+		/* Take the group queue lock */
+		rcu_read_lock();
+		efqd = rcu_dereference(iog->key);
+		if (efqd != NULL) {
+			if (spin_trylock_irq(efqd->queue->queue_lock)) {
+				if (iog->key == efqd) {
+					queue_lock_held = 1;
+					rcu_read_unlock();
+					goto locked;
+				}
 
-void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
-{
-	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
-	struct hlist_node *n, *tmp;
-	struct io_group *iog;
+				/*
+				 * After acquiring the queue lock, we found
+				 * iog->key==NULL, that means elevator switch
+				 * completed, group is no longer connected on
+				 * elevator hence we can proceed safely without
+				 * queue lock.
+				 */
+				spin_unlock_irq(efqd->queue->queue_lock);
+			} else {
+				/*
+				 * Did not get the queue lock while trying.
+				 * Backout. Drop iocg->lock and try again
+				 */
+				rcu_read_unlock();
+				spin_unlock_irqrestore(&iocg->lock, flags);
+				udelay(100);
+				goto retry;
 
-	/*
-	 * Since we are destroying the cgroup, there are no more tasks
-	 * referencing it, and all the RCU grace periods that may have
-	 * referenced it are ended (as the destruction of the parent
-	 * cgroup is RCU-safe); bgrp->group_data will not be accessed by
-	 * anything else and we don't need any synchronization.
-	 */
-	hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node)
-		io_destroy_group(iocg, iog);
+			}
+		}
+		/*
+		 * We come here when iog->key==NULL, that means elevator switch
+		 * has already taken place and now this group is no more
+		 * connected on elevator and one does not have to have a
+		 * queue lock to do the cleanup.
+		 */
+		rcu_read_unlock();
+locked:
+		__iocg_destroy(iocg, iog, queue_lock_held);
+		if (queue_lock_held) {
+			spin_unlock_irq(efqd->queue->queue_lock);
+			queue_lock_held = 0;
+		}
+	}
+	spin_unlock_irqrestore(&iocg->lock, flags);
 
 	BUG_ON(!hlist_empty(&iocg->group_data));
 
 	kfree(iocg);
 }
 
+/* Should be called with queue lock held */
 void io_disconnect_groups(struct elevator_queue *e)
 {
 	struct hlist_node *pos, *n;
 	struct io_group *iog;
 	struct elv_fq_data *efqd = &e->efqd;
+	int i;
+	struct io_service_tree *st;
 
 	hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
 					elv_data_node) {
-		hlist_del(&iog->elv_data_node);
-
+		/*
+		 * At this point of time group should be on idle tree. This
+		 * would extract the group from idle tree.
+		 */
 		__bfq_deactivate_entity(iog->my_entity, 0);
 
+		/* Flush all the idle trees of the group */
+		for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+			st = iog->sched_data.service_tree + i;
+			io_flush_idle_tree(st);
+		}
+
+		/*
+		 * This has to be here also apart from cgroup cleanup path
+		 * and the reason being that if async queue reference of the
+		 * group are not dropped, then async ioq as well as associated
+		 * queue will not be reclaimed. Apart from that async cfqq
+		 * has to be cleaned up before elevator goes away.
+		 */
+		io_put_io_group_queues(iog);
+
 		/*
 		 * Don't remove from the group hash, just set an
 		 * invalid key.  No lookups can race with the
@@ -1736,11 +1816,68 @@ void io_disconnect_groups(struct elevator_queue *e)
 		 * implies also that new elements cannot be added
 		 * to the list.
 		 */
+		hlist_del(&iog->elv_data_node);
 		rcu_assign_pointer(iog->key, NULL);
-		io_put_io_group_queues(e, iog);
+		/* Drop iog reference taken by elevator (efqd->group_list)*/
+		elv_put_iog(iog);
 	}
 }
 
+/*
+ * This cleanup function is does the last bit of things to destroy cgroup.
+   It should only get called after io_destroy_group has been invoked.
+ */
+void io_group_cleanup(struct io_group *iog)
+{
+	struct io_service_tree *st;
+	struct io_entity *entity = iog->my_entity;
+	int i;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+
+		BUG_ON(!RB_EMPTY_ROOT(&st->active));
+		BUG_ON(!RB_EMPTY_ROOT(&st->idle));
+		BUG_ON(st->wsum != 0);
+	}
+
+	BUG_ON(iog->sched_data.next_active != NULL);
+	BUG_ON(iog->sched_data.active_entity != NULL);
+	BUG_ON(entity != NULL && entity->tree != NULL);
+
+	kfree(iog);
+}
+
+/*
+ * Should be called with queue lock held. The only case it can be called
+ * without queue lock held is when elevator has gone away leaving behind
+ * dead io groups which are hanging there to be reclaimed when cgroup is
+ * deleted. In case of cgroup deletion, I think there is only one thread
+ * doing deletion and rest of the threads should have been taken care by
+ * cgroup stuff.
+ */
+void elv_put_iog(struct io_group *iog)
+{
+	struct io_group *parent = NULL;
+
+	BUG_ON(!iog);
+
+	BUG_ON(atomic_read(&iog->ref) <= 0);
+	if (!atomic_dec_and_test(&iog->ref))
+		return;
+
+	BUG_ON(iog->entity.on_st);
+
+	if (iog->my_entity)
+		parent = container_of(iog->my_entity->parent,
+				      struct io_group, entity);
+	io_group_cleanup(iog);
+
+	if (parent)
+		elv_put_iog(parent);
+}
+EXPORT_SYMBOL(elv_put_iog);
+
 struct cgroup_subsys io_subsys = {
 	.name = "io",
 	.create = iocg_create,
@@ -1887,6 +2024,8 @@ alloc_ioq:
 		elv_init_ioq(e, ioq, rq->iog, sched_q, IOPRIO_CLASS_BE, 4, 1);
 		io_group_set_ioq(iog, ioq);
 		elv_mark_ioq_sync(ioq);
+		/* ioq reference on iog */
+		elv_get_iog(iog);
 	}
 
 	if (new_sched_q)
@@ -1987,7 +2126,7 @@ EXPORT_SYMBOL(io_get_io_group_bio);
 void io_free_root_group(struct elevator_queue *e)
 {
 	struct io_group *iog = e->efqd.root_group;
-	io_put_io_group_queues(e, iog);
+	io_put_io_group_queues(iog);
 	kfree(iog);
 }
 
@@ -2437,13 +2576,11 @@ void elv_put_ioq(struct io_queue *ioq)
 }
 EXPORT_SYMBOL(elv_put_ioq);
 
-void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
+void elv_release_ioq(struct io_queue **ioq_ptr)
 {
-	struct io_group *root_group = e->efqd.root_group;
 	struct io_queue *ioq = *ioq_ptr;
 
 	if (ioq != NULL) {
-		io_ioq_move(e, ioq, root_group);
 		/* Drop the reference taken by the io group */
 		elv_put_ioq(ioq);
 		*ioq_ptr = NULL;
@@ -2600,9 +2737,19 @@ void elv_activate_ioq(struct io_queue *ioq, int add_front)
 void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
 					int requeue)
 {
+	struct io_group *iog = ioq_to_io_group(ioq);
+
 	if (ioq == efqd->active_queue)
 		elv_reset_active_ioq(efqd);
 
+	/*
+	 * The io group ioq belongs to is going away. Don't requeue the
+	 * ioq on idle tree. Free it.
+	 */
+#ifdef CONFIG_GROUP_IOSCHED
+	if (iog->deleting == 1)
+		requeue = 0;
+#endif
 	bfq_deactivate_entity(&ioq->entity, requeue);
 }
 
@@ -3002,15 +3149,6 @@ void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
 	}
 }
 
-void elv_free_idle_ioq_list(struct elevator_queue *e)
-{
-	struct io_queue *ioq, *n;
-	struct elv_fq_data *efqd = &e->efqd;
-
-	list_for_each_entry_safe(ioq, n, &efqd->idle_list, queue_list)
-		elv_deactivate_ioq(efqd, ioq, 0);
-}
-
 /*
  * Call iosched to let that elevator wants to expire the queue. This gives
  * iosched like AS to say no (if it is in the middle of batch changeover or
@@ -3427,7 +3565,6 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 
 	INIT_WORK(&efqd->unplug_work, elv_kick_queue);
 
-	INIT_LIST_HEAD(&efqd->idle_list);
 	INIT_HLIST_HEAD(&efqd->group_list);
 
 	efqd->elv_slice[0] = elv_slice_async;
@@ -3458,9 +3595,19 @@ void elv_exit_fq_data(struct elevator_queue *e)
 	elv_shutdown_timer_wq(e);
 
 	spin_lock_irq(q->queue_lock);
-	/* This should drop all the idle tree references of ioq */
-	elv_free_idle_ioq_list(e);
-	/* This should drop all the io group references of async queues */
+	/*
+	 * This should drop all the references of async queues taken by
+	 * io group.
+	 *
+	 * Also should should deactivate the group and extract from the
+	 * idle tree. (group can not be on active tree now after the
+	 * elevator has been drained).
+	 *
+	 * Should flush idle tree of the group which inturn will drop
+	 * ioq reference taken by active/idle tree.
+	 *
+	 * Drop the iog reference taken by elevator.
+	 */
 	io_disconnect_groups(e);
 	spin_unlock_irq(q->queue_lock);
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 58543ec..42e3777 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -165,7 +165,6 @@ struct io_queue {
 
 	/* Pointer to generic elevator data structure */
 	struct elv_fq_data *efqd;
-	struct list_head queue_list;
 	pid_t pid;
 
 	/* Number of requests queued on this io queue */
@@ -219,6 +218,7 @@ struct io_queue {
  *    o All the other fields are protected by the @bfqd queue lock.
  */
 struct io_group {
+	atomic_t ref;
 	struct io_entity entity;
 	struct hlist_node elv_data_node;
 	struct hlist_node group_node;
@@ -242,6 +242,9 @@ struct io_group {
 
 	/* request list associated with the group */
 	struct request_list rl;
+
+	/* io group is going away */
+	int deleting;
 };
 
 /**
@@ -279,9 +282,6 @@ struct elv_fq_data {
 	/* List of io groups hanging on this elevator */
 	struct hlist_head group_list;
 
-	/* List of io queues on idle tree. */
-	struct list_head idle_list;
-
 	struct request_queue *queue;
 	unsigned int busy_queues;
 	/*
@@ -504,8 +504,6 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
 
 #ifdef CONFIG_GROUP_IOSCHED
 extern int io_group_allow_merge(struct request *rq, struct bio *bio);
-extern void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
-					struct io_group *iog);
 extern void elv_fq_set_request_io_group(struct request_queue *q,
 					struct request *rq, struct bio *bio);
 static inline bfq_weight_t iog_weight(struct io_group *iog)
@@ -523,6 +521,8 @@ extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
 extern struct request_list *io_group_get_request_list(struct request_queue *q,
 						struct bio *bio);
 
+extern void elv_put_iog(struct io_group *iog);
+
 /* Returns single ioq associated with the io group. */
 static inline struct io_queue *io_group_ioq(struct io_group *iog)
 {
@@ -545,17 +545,12 @@ static inline struct io_group *rq_iog(struct request_queue *q,
 	return rq->iog;
 }
 
-#else /* !GROUP_IOSCHED */
-/*
- * No ioq movement is needed in case of flat setup. root io group gets cleaned
- * up upon elevator exit and before that it has been made sure that both
- * active and idle tree are empty.
- */
-static inline void io_ioq_move(struct elevator_queue *e, struct io_queue *ioq,
-					struct io_group *iog)
+static inline void elv_get_iog(struct io_group *iog)
 {
+	atomic_inc(&iog->ref);
 }
 
+#else /* !GROUP_IOSCHED */
 static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
 {
 	return 1;
@@ -608,6 +603,9 @@ static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
 	return NULL;
 }
 
+static inline void elv_get_iog(struct io_group *iog) { }
+
+static inline void elv_put_iog(struct io_group *iog) { }
 
 extern struct io_group *rq_iog(struct request_queue *q, struct request *rq);
 
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 18/18] io-controller: Debug hierarchical IO scheduling
       [not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (16 preceding siblings ...)
  2009-05-05 19:58   ` [PATCH 17/18] io-controller: IO group refcounting support Vivek Goyal
@ 2009-05-05 19:58   ` Vivek Goyal
  2009-05-05 20:24     ` Andrew Morton
                     ` (3 subsequent siblings)
  21 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

o Littile debugging aid for hierarchical IO scheduling.

o Enabled under CONFIG_DEBUG_GROUP_IOSCHED

o Currently it outputs more debug messages in blktrace output which helps
  a great deal in debugging in hierarchical setup.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched |   10 +++-
 block/elevator-fq.c   |  131 +++++++++++++++++++++++++++++++++++++++++++++++--
 block/elevator-fq.h   |    6 ++
 3 files changed, 141 insertions(+), 6 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 0677099..79f188c 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -140,6 +140,14 @@ config TRACK_ASYNC_CONTEXT
 	  request, original owner of the bio is decided by using io tracking
 	  patches otherwise we continue to attribute the request to the
 	  submitting thread.
-endmenu
 
+config DEBUG_GROUP_IOSCHED
+	bool "Debug Hierarchical Scheduling support"
+	depends on CGROUPS && GROUP_IOSCHED
+	default n
+	---help---
+	  Enable some debugging hooks for hierarchical scheduling support.
+	  Currently it just outputs more information in blktrace output.
+
+endmenu
 endif
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 1dd0bb3..9500619 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -30,7 +30,7 @@ static int elv_rate_sampling_window = HZ / 10;
 #define IO_DEFAULT_GRP_CLASS   IOPRIO_CLASS_BE
 
 #define IO_SERVICE_TREE_INIT   ((struct io_service_tree)		\
-				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
+				{ RB_ROOT, RB_ROOT, 0, NULL, NULL, 0, 0 })
 
 static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
 					struct io_queue *ioq, int probe);
@@ -118,6 +118,37 @@ static inline struct io_group *io_entity_to_iog(struct io_entity *entity)
 		iog = container_of(entity, struct io_group, entity);
 	return iog;
 }
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+static void io_group_path(struct io_group *iog, char *buf, int buflen)
+{
+	unsigned short id = iog->iocg_id;
+	struct cgroup_subsys_state *css;
+
+	rcu_read_lock();
+
+	if (!id)
+		goto out;
+
+	css = css_lookup(&io_subsys, id);
+	if (!css)
+		goto out;
+
+	if (!css_tryget(css))
+		goto out;
+
+	cgroup_path(css->cgroup, buf, buflen);
+
+	css_put(css);
+
+	rcu_read_unlock();
+	return;
+out:
+	rcu_read_unlock();
+	buf[0] = '\0';
+	return;
+}
+#endif
 #else /* GROUP_IOSCHED */
 #define for_each_entity(entity)	\
 	for (; entity != NULL; entity = NULL)
@@ -372,7 +403,7 @@ static void bfq_active_insert(struct io_service_tree *st,
 	struct rb_node *node = &entity->rb_node;
 
 	bfq_insert(&st->active, entity);
-
+	st->nr_active++;
 	if (node->rb_left != NULL)
 		node = node->rb_left;
 	else if (node->rb_right != NULL)
@@ -434,7 +465,7 @@ static void bfq_active_extract(struct io_service_tree *st,
 
 	node = bfq_find_deepest(&entity->rb_node);
 	bfq_extract(&st->active, entity);
-
+	st->nr_active--;
 	if (node != NULL)
 		bfq_update_active_tree(node);
 }
@@ -1233,6 +1264,9 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
 
 		io_group_init_entity(iocg, iog);
 		iog->my_entity = &iog->entity;
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+		iog->iocg_id = css_id(&iocg->css);
+#endif
 
 		blk_init_request_list(&iog->rl);
 
@@ -1506,6 +1540,9 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
 	/* elevator reference. */
 	elv_get_iog(iog);
 	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	iog->iocg_id = css_id(&iocg->css);
+#endif
 	spin_unlock_irq(&iocg->lock);
 
 	return iog;
@@ -1886,6 +1923,7 @@ struct cgroup_subsys io_subsys = {
 	.destroy = iocg_destroy,
 	.populate = iocg_populate,
 	.subsys_id = io_subsys_id,
+	.use_id = 1,
 };
 
 /*
@@ -2203,6 +2241,25 @@ EXPORT_SYMBOL(elv_get_slice_idle);
 void elv_ioq_served(struct io_queue *ioq, bfq_service_t served)
 {
 	entity_served(&ioq->entity, served, ioq->nr_sectors);
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+		{
+			struct elv_fq_data *efqd = ioq->efqd;
+			char path[128];
+			struct io_group *iog = ioq_to_io_group(ioq);
+			io_group_path(iog, path, sizeof(path));
+			elv_log_ioq(efqd, ioq, "ioq served: QSt=0x%lx QSs=0x%lx"
+				" QTt=0x%lx QTs=0x%lx grp=%s GTt=0x%lx "
+				" GTs=0x%lx rq_queued=%d",
+				served, ioq->nr_sectors,
+				ioq->entity.total_service,
+				ioq->entity.total_sector_service,
+				path,
+				iog->entity.total_service,
+				iog->entity.total_sector_service,
+				ioq->nr_queued);
+		}
+#endif
 }
 
 /* Tells whether ioq is queued in root group or not */
@@ -2671,11 +2728,34 @@ static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
 
 	if (ioq) {
 		struct io_group *iog = ioq_to_io_group(ioq);
+
 		elv_log_ioq(efqd, ioq, "set_active, busy=%d ioprio=%d"
-				" weight=%ld group_weight=%ld",
+				" weight=%ld rq_queued=%d group_weight=%ld",
 				efqd->busy_queues,
 				ioq->entity.ioprio, ioq->entity.weight,
-				iog_weight(iog));
+				ioq->nr_queued, iog_weight(iog));
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+			{
+				char path[128];
+				struct io_service_tree *grpst;
+				int nr_active = 0;
+				if (iog != efqd->root_group) {
+					grpst = io_entity_service_tree(
+								&iog->entity);
+					nr_active = grpst->nr_active;
+				}
+				io_group_path(iog, path, sizeof(path));
+				elv_log_ioq(efqd, ioq, "set_active, ioq grp=%s"
+				" nrgrps=%d QTt=0x%lx QTs=0x%lx GTt=0x%lx "
+				" GTs=0x%lx rq_queued=%d", path, nr_active,
+				ioq->entity.total_service,
+				ioq->entity.total_sector_service,
+				iog->entity.total_service,
+				iog->entity.total_sector_service,
+				ioq->nr_queued);
+			}
+#endif
 		ioq->slice_end = 0;
 
 		elv_clear_ioq_wait_request(ioq);
@@ -2764,6 +2844,22 @@ void elv_add_ioq_busy(struct elv_fq_data *efqd, struct io_queue *ioq)
 	efqd->busy_queues++;
 	if (elv_ioq_class_rt(ioq))
 		efqd->busy_rt_queues++;
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+		{
+			char path[128];
+			struct io_group *iog = ioq_to_io_group(ioq);
+			io_group_path(iog, path, sizeof(path));
+			elv_log(efqd, "add to busy: QTt=0x%lx QTs=0x%lx "
+				"ioq grp=%s GTt=0x%lx GTs=0x%lx rq_queued=%d",
+				ioq->entity.total_service,
+				ioq->entity.total_sector_service,
+				path,
+				iog->entity.total_service,
+				iog->entity.total_sector_service,
+				ioq->nr_queued);
+		}
+#endif
 }
 
 void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
@@ -2773,7 +2869,24 @@ void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
 
 	BUG_ON(!elv_ioq_busy(ioq));
 	BUG_ON(ioq->nr_queued);
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+		{
+			char path[128];
+			struct io_group *iog = ioq_to_io_group(ioq);
+			io_group_path(iog, path, sizeof(path));
+			elv_log_ioq(efqd, ioq, "del from busy: QTt=0x%lx "
+				"QTs=0x%lx ioq grp=%s GTt=0x%lx GTs=0x%lx "
+				"rq_queued=%d",
+				ioq->entity.total_service,
+				ioq->entity.total_sector_service,
+				path,
+				iog->entity.total_service,
+				iog->entity.total_sector_service,
+				ioq->nr_queued);
+		}
+#else
 	elv_log_ioq(efqd, ioq, "del from busy");
+#endif
 	elv_clear_ioq_busy(ioq);
 	BUG_ON(efqd->busy_queues == 0);
 	efqd->busy_queues--;
@@ -3000,6 +3113,14 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 
 	elv_ioq_update_io_thinktime(ioq);
 	elv_ioq_update_idle_window(q->elevator, ioq, rq);
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+		{
+			char path[128];
+			io_group_path(rq_iog(q, rq), path, sizeof(path));
+			elv_log_ioq(efqd, ioq, "add rq: group path=%s "
+					"rq_queued=%d", path, ioq->nr_queued);
+		}
+#endif
 
 	if (ioq == elv_active_ioq(q->elevator)) {
 		/*
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 42e3777..db3a347 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -43,6 +43,8 @@ struct io_service_tree {
 	struct rb_root active;
 	struct rb_root idle;
 
+	int nr_active;
+
 	struct io_entity *first_idle;
 	struct io_entity *last_idle;
 
@@ -245,6 +247,10 @@ struct io_group {
 
 	/* io group is going away */
 	int deleting;
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	unsigned short iocg_id;
+#endif
 };
 
 /**
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 18/18] io-controller: Debug hierarchical IO scheduling
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (32 preceding siblings ...)
  2009-05-05 19:58 ` [PATCH 18/18] io-controller: Debug hierarchical IO scheduling Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-06 21:40   ` IKEDA, Munehiro
       [not found]   ` <1241553525-28095-19-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
       [not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                   ` (3 subsequent siblings)
  37 siblings, 2 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, m-ikeda
  Cc: vgoyal, akpm

o Littile debugging aid for hierarchical IO scheduling.

o Enabled under CONFIG_DEBUG_GROUP_IOSCHED

o Currently it outputs more debug messages in blktrace output which helps
  a great deal in debugging in hierarchical setup.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched |   10 +++-
 block/elevator-fq.c   |  131 +++++++++++++++++++++++++++++++++++++++++++++++--
 block/elevator-fq.h   |    6 ++
 3 files changed, 141 insertions(+), 6 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 0677099..79f188c 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -140,6 +140,14 @@ config TRACK_ASYNC_CONTEXT
 	  request, original owner of the bio is decided by using io tracking
 	  patches otherwise we continue to attribute the request to the
 	  submitting thread.
-endmenu
 
+config DEBUG_GROUP_IOSCHED
+	bool "Debug Hierarchical Scheduling support"
+	depends on CGROUPS && GROUP_IOSCHED
+	default n
+	---help---
+	  Enable some debugging hooks for hierarchical scheduling support.
+	  Currently it just outputs more information in blktrace output.
+
+endmenu
 endif
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 1dd0bb3..9500619 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -30,7 +30,7 @@ static int elv_rate_sampling_window = HZ / 10;
 #define IO_DEFAULT_GRP_CLASS   IOPRIO_CLASS_BE
 
 #define IO_SERVICE_TREE_INIT   ((struct io_service_tree)		\
-				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
+				{ RB_ROOT, RB_ROOT, 0, NULL, NULL, 0, 0 })
 
 static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
 					struct io_queue *ioq, int probe);
@@ -118,6 +118,37 @@ static inline struct io_group *io_entity_to_iog(struct io_entity *entity)
 		iog = container_of(entity, struct io_group, entity);
 	return iog;
 }
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+static void io_group_path(struct io_group *iog, char *buf, int buflen)
+{
+	unsigned short id = iog->iocg_id;
+	struct cgroup_subsys_state *css;
+
+	rcu_read_lock();
+
+	if (!id)
+		goto out;
+
+	css = css_lookup(&io_subsys, id);
+	if (!css)
+		goto out;
+
+	if (!css_tryget(css))
+		goto out;
+
+	cgroup_path(css->cgroup, buf, buflen);
+
+	css_put(css);
+
+	rcu_read_unlock();
+	return;
+out:
+	rcu_read_unlock();
+	buf[0] = '\0';
+	return;
+}
+#endif
 #else /* GROUP_IOSCHED */
 #define for_each_entity(entity)	\
 	for (; entity != NULL; entity = NULL)
@@ -372,7 +403,7 @@ static void bfq_active_insert(struct io_service_tree *st,
 	struct rb_node *node = &entity->rb_node;
 
 	bfq_insert(&st->active, entity);
-
+	st->nr_active++;
 	if (node->rb_left != NULL)
 		node = node->rb_left;
 	else if (node->rb_right != NULL)
@@ -434,7 +465,7 @@ static void bfq_active_extract(struct io_service_tree *st,
 
 	node = bfq_find_deepest(&entity->rb_node);
 	bfq_extract(&st->active, entity);
-
+	st->nr_active--;
 	if (node != NULL)
 		bfq_update_active_tree(node);
 }
@@ -1233,6 +1264,9 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
 
 		io_group_init_entity(iocg, iog);
 		iog->my_entity = &iog->entity;
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+		iog->iocg_id = css_id(&iocg->css);
+#endif
 
 		blk_init_request_list(&iog->rl);
 
@@ -1506,6 +1540,9 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
 	/* elevator reference. */
 	elv_get_iog(iog);
 	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	iog->iocg_id = css_id(&iocg->css);
+#endif
 	spin_unlock_irq(&iocg->lock);
 
 	return iog;
@@ -1886,6 +1923,7 @@ struct cgroup_subsys io_subsys = {
 	.destroy = iocg_destroy,
 	.populate = iocg_populate,
 	.subsys_id = io_subsys_id,
+	.use_id = 1,
 };
 
 /*
@@ -2203,6 +2241,25 @@ EXPORT_SYMBOL(elv_get_slice_idle);
 void elv_ioq_served(struct io_queue *ioq, bfq_service_t served)
 {
 	entity_served(&ioq->entity, served, ioq->nr_sectors);
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+		{
+			struct elv_fq_data *efqd = ioq->efqd;
+			char path[128];
+			struct io_group *iog = ioq_to_io_group(ioq);
+			io_group_path(iog, path, sizeof(path));
+			elv_log_ioq(efqd, ioq, "ioq served: QSt=0x%lx QSs=0x%lx"
+				" QTt=0x%lx QTs=0x%lx grp=%s GTt=0x%lx "
+				" GTs=0x%lx rq_queued=%d",
+				served, ioq->nr_sectors,
+				ioq->entity.total_service,
+				ioq->entity.total_sector_service,
+				path,
+				iog->entity.total_service,
+				iog->entity.total_sector_service,
+				ioq->nr_queued);
+		}
+#endif
 }
 
 /* Tells whether ioq is queued in root group or not */
@@ -2671,11 +2728,34 @@ static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
 
 	if (ioq) {
 		struct io_group *iog = ioq_to_io_group(ioq);
+
 		elv_log_ioq(efqd, ioq, "set_active, busy=%d ioprio=%d"
-				" weight=%ld group_weight=%ld",
+				" weight=%ld rq_queued=%d group_weight=%ld",
 				efqd->busy_queues,
 				ioq->entity.ioprio, ioq->entity.weight,
-				iog_weight(iog));
+				ioq->nr_queued, iog_weight(iog));
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+			{
+				char path[128];
+				struct io_service_tree *grpst;
+				int nr_active = 0;
+				if (iog != efqd->root_group) {
+					grpst = io_entity_service_tree(
+								&iog->entity);
+					nr_active = grpst->nr_active;
+				}
+				io_group_path(iog, path, sizeof(path));
+				elv_log_ioq(efqd, ioq, "set_active, ioq grp=%s"
+				" nrgrps=%d QTt=0x%lx QTs=0x%lx GTt=0x%lx "
+				" GTs=0x%lx rq_queued=%d", path, nr_active,
+				ioq->entity.total_service,
+				ioq->entity.total_sector_service,
+				iog->entity.total_service,
+				iog->entity.total_sector_service,
+				ioq->nr_queued);
+			}
+#endif
 		ioq->slice_end = 0;
 
 		elv_clear_ioq_wait_request(ioq);
@@ -2764,6 +2844,22 @@ void elv_add_ioq_busy(struct elv_fq_data *efqd, struct io_queue *ioq)
 	efqd->busy_queues++;
 	if (elv_ioq_class_rt(ioq))
 		efqd->busy_rt_queues++;
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+		{
+			char path[128];
+			struct io_group *iog = ioq_to_io_group(ioq);
+			io_group_path(iog, path, sizeof(path));
+			elv_log(efqd, "add to busy: QTt=0x%lx QTs=0x%lx "
+				"ioq grp=%s GTt=0x%lx GTs=0x%lx rq_queued=%d",
+				ioq->entity.total_service,
+				ioq->entity.total_sector_service,
+				path,
+				iog->entity.total_service,
+				iog->entity.total_sector_service,
+				ioq->nr_queued);
+		}
+#endif
 }
 
 void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
@@ -2773,7 +2869,24 @@ void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
 
 	BUG_ON(!elv_ioq_busy(ioq));
 	BUG_ON(ioq->nr_queued);
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+		{
+			char path[128];
+			struct io_group *iog = ioq_to_io_group(ioq);
+			io_group_path(iog, path, sizeof(path));
+			elv_log_ioq(efqd, ioq, "del from busy: QTt=0x%lx "
+				"QTs=0x%lx ioq grp=%s GTt=0x%lx GTs=0x%lx "
+				"rq_queued=%d",
+				ioq->entity.total_service,
+				ioq->entity.total_sector_service,
+				path,
+				iog->entity.total_service,
+				iog->entity.total_sector_service,
+				ioq->nr_queued);
+		}
+#else
 	elv_log_ioq(efqd, ioq, "del from busy");
+#endif
 	elv_clear_ioq_busy(ioq);
 	BUG_ON(efqd->busy_queues == 0);
 	efqd->busy_queues--;
@@ -3000,6 +3113,14 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 
 	elv_ioq_update_io_thinktime(ioq);
 	elv_ioq_update_idle_window(q->elevator, ioq, rq);
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+		{
+			char path[128];
+			io_group_path(rq_iog(q, rq), path, sizeof(path));
+			elv_log_ioq(efqd, ioq, "add rq: group path=%s "
+					"rq_queued=%d", path, ioq->nr_queued);
+		}
+#endif
 
 	if (ioq == elv_active_ioq(q->elevator)) {
 		/*
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 42e3777..db3a347 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -43,6 +43,8 @@ struct io_service_tree {
 	struct rb_root active;
 	struct rb_root idle;
 
+	int nr_active;
+
 	struct io_entity *first_idle;
 	struct io_entity *last_idle;
 
@@ -245,6 +247,10 @@ struct io_group {
 
 	/* io group is going away */
 	int deleting;
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	unsigned short iocg_id;
+#endif
 };
 
 /**
-- 
1.6.0.1


^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH 18/18] io-controller: Debug hierarchical IO scheduling
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (31 preceding siblings ...)
  2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-05 19:58 ` Vivek Goyal
  2009-05-05 19:58 ` Vivek Goyal
                   ` (4 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando
  Cc: akpm, vgoyal

o Littile debugging aid for hierarchical IO scheduling.

o Enabled under CONFIG_DEBUG_GROUP_IOSCHED

o Currently it outputs more debug messages in blktrace output which helps
  a great deal in debugging in hierarchical setup.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched |   10 +++-
 block/elevator-fq.c   |  131 +++++++++++++++++++++++++++++++++++++++++++++++--
 block/elevator-fq.h   |    6 ++
 3 files changed, 141 insertions(+), 6 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 0677099..79f188c 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -140,6 +140,14 @@ config TRACK_ASYNC_CONTEXT
 	  request, original owner of the bio is decided by using io tracking
 	  patches otherwise we continue to attribute the request to the
 	  submitting thread.
-endmenu
 
+config DEBUG_GROUP_IOSCHED
+	bool "Debug Hierarchical Scheduling support"
+	depends on CGROUPS && GROUP_IOSCHED
+	default n
+	---help---
+	  Enable some debugging hooks for hierarchical scheduling support.
+	  Currently it just outputs more information in blktrace output.
+
+endmenu
 endif
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 1dd0bb3..9500619 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -30,7 +30,7 @@ static int elv_rate_sampling_window = HZ / 10;
 #define IO_DEFAULT_GRP_CLASS   IOPRIO_CLASS_BE
 
 #define IO_SERVICE_TREE_INIT   ((struct io_service_tree)		\
-				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
+				{ RB_ROOT, RB_ROOT, 0, NULL, NULL, 0, 0 })
 
 static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
 					struct io_queue *ioq, int probe);
@@ -118,6 +118,37 @@ static inline struct io_group *io_entity_to_iog(struct io_entity *entity)
 		iog = container_of(entity, struct io_group, entity);
 	return iog;
 }
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+static void io_group_path(struct io_group *iog, char *buf, int buflen)
+{
+	unsigned short id = iog->iocg_id;
+	struct cgroup_subsys_state *css;
+
+	rcu_read_lock();
+
+	if (!id)
+		goto out;
+
+	css = css_lookup(&io_subsys, id);
+	if (!css)
+		goto out;
+
+	if (!css_tryget(css))
+		goto out;
+
+	cgroup_path(css->cgroup, buf, buflen);
+
+	css_put(css);
+
+	rcu_read_unlock();
+	return;
+out:
+	rcu_read_unlock();
+	buf[0] = '\0';
+	return;
+}
+#endif
 #else /* GROUP_IOSCHED */
 #define for_each_entity(entity)	\
 	for (; entity != NULL; entity = NULL)
@@ -372,7 +403,7 @@ static void bfq_active_insert(struct io_service_tree *st,
 	struct rb_node *node = &entity->rb_node;
 
 	bfq_insert(&st->active, entity);
-
+	st->nr_active++;
 	if (node->rb_left != NULL)
 		node = node->rb_left;
 	else if (node->rb_right != NULL)
@@ -434,7 +465,7 @@ static void bfq_active_extract(struct io_service_tree *st,
 
 	node = bfq_find_deepest(&entity->rb_node);
 	bfq_extract(&st->active, entity);
-
+	st->nr_active--;
 	if (node != NULL)
 		bfq_update_active_tree(node);
 }
@@ -1233,6 +1264,9 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
 
 		io_group_init_entity(iocg, iog);
 		iog->my_entity = &iog->entity;
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+		iog->iocg_id = css_id(&iocg->css);
+#endif
 
 		blk_init_request_list(&iog->rl);
 
@@ -1506,6 +1540,9 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
 	/* elevator reference. */
 	elv_get_iog(iog);
 	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	iog->iocg_id = css_id(&iocg->css);
+#endif
 	spin_unlock_irq(&iocg->lock);
 
 	return iog;
@@ -1886,6 +1923,7 @@ struct cgroup_subsys io_subsys = {
 	.destroy = iocg_destroy,
 	.populate = iocg_populate,
 	.subsys_id = io_subsys_id,
+	.use_id = 1,
 };
 
 /*
@@ -2203,6 +2241,25 @@ EXPORT_SYMBOL(elv_get_slice_idle);
 void elv_ioq_served(struct io_queue *ioq, bfq_service_t served)
 {
 	entity_served(&ioq->entity, served, ioq->nr_sectors);
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+		{
+			struct elv_fq_data *efqd = ioq->efqd;
+			char path[128];
+			struct io_group *iog = ioq_to_io_group(ioq);
+			io_group_path(iog, path, sizeof(path));
+			elv_log_ioq(efqd, ioq, "ioq served: QSt=0x%lx QSs=0x%lx"
+				" QTt=0x%lx QTs=0x%lx grp=%s GTt=0x%lx "
+				" GTs=0x%lx rq_queued=%d",
+				served, ioq->nr_sectors,
+				ioq->entity.total_service,
+				ioq->entity.total_sector_service,
+				path,
+				iog->entity.total_service,
+				iog->entity.total_sector_service,
+				ioq->nr_queued);
+		}
+#endif
 }
 
 /* Tells whether ioq is queued in root group or not */
@@ -2671,11 +2728,34 @@ static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
 
 	if (ioq) {
 		struct io_group *iog = ioq_to_io_group(ioq);
+
 		elv_log_ioq(efqd, ioq, "set_active, busy=%d ioprio=%d"
-				" weight=%ld group_weight=%ld",
+				" weight=%ld rq_queued=%d group_weight=%ld",
 				efqd->busy_queues,
 				ioq->entity.ioprio, ioq->entity.weight,
-				iog_weight(iog));
+				ioq->nr_queued, iog_weight(iog));
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+			{
+				char path[128];
+				struct io_service_tree *grpst;
+				int nr_active = 0;
+				if (iog != efqd->root_group) {
+					grpst = io_entity_service_tree(
+								&iog->entity);
+					nr_active = grpst->nr_active;
+				}
+				io_group_path(iog, path, sizeof(path));
+				elv_log_ioq(efqd, ioq, "set_active, ioq grp=%s"
+				" nrgrps=%d QTt=0x%lx QTs=0x%lx GTt=0x%lx "
+				" GTs=0x%lx rq_queued=%d", path, nr_active,
+				ioq->entity.total_service,
+				ioq->entity.total_sector_service,
+				iog->entity.total_service,
+				iog->entity.total_sector_service,
+				ioq->nr_queued);
+			}
+#endif
 		ioq->slice_end = 0;
 
 		elv_clear_ioq_wait_request(ioq);
@@ -2764,6 +2844,22 @@ void elv_add_ioq_busy(struct elv_fq_data *efqd, struct io_queue *ioq)
 	efqd->busy_queues++;
 	if (elv_ioq_class_rt(ioq))
 		efqd->busy_rt_queues++;
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+		{
+			char path[128];
+			struct io_group *iog = ioq_to_io_group(ioq);
+			io_group_path(iog, path, sizeof(path));
+			elv_log(efqd, "add to busy: QTt=0x%lx QTs=0x%lx "
+				"ioq grp=%s GTt=0x%lx GTs=0x%lx rq_queued=%d",
+				ioq->entity.total_service,
+				ioq->entity.total_sector_service,
+				path,
+				iog->entity.total_service,
+				iog->entity.total_sector_service,
+				ioq->nr_queued);
+		}
+#endif
 }
 
 void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
@@ -2773,7 +2869,24 @@ void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
 
 	BUG_ON(!elv_ioq_busy(ioq));
 	BUG_ON(ioq->nr_queued);
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+		{
+			char path[128];
+			struct io_group *iog = ioq_to_io_group(ioq);
+			io_group_path(iog, path, sizeof(path));
+			elv_log_ioq(efqd, ioq, "del from busy: QTt=0x%lx "
+				"QTs=0x%lx ioq grp=%s GTt=0x%lx GTs=0x%lx "
+				"rq_queued=%d",
+				ioq->entity.total_service,
+				ioq->entity.total_sector_service,
+				path,
+				iog->entity.total_service,
+				iog->entity.total_sector_service,
+				ioq->nr_queued);
+		}
+#else
 	elv_log_ioq(efqd, ioq, "del from busy");
+#endif
 	elv_clear_ioq_busy(ioq);
 	BUG_ON(efqd->busy_queues == 0);
 	efqd->busy_queues--;
@@ -3000,6 +3113,14 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 
 	elv_ioq_update_io_thinktime(ioq);
 	elv_ioq_update_idle_window(q->elevator, ioq, rq);
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+		{
+			char path[128];
+			io_group_path(rq_iog(q, rq), path, sizeof(path));
+			elv_log_ioq(efqd, ioq, "add rq: group path=%s "
+					"rq_queued=%d", path, ioq->nr_queued);
+		}
+#endif
 
 	if (ioq == elv_active_ioq(q->elevator)) {
 		/*
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 42e3777..db3a347 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -43,6 +43,8 @@ struct io_service_tree {
 	struct rb_root active;
 	struct rb_root idle;
 
+	int nr_active;
+
 	struct io_entity *first_idle;
 	struct io_entity *last_idle;
 
@@ -245,6 +247,10 @@ struct io_group {
 
 	/* io group is going away */
 	int deleting;
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	unsigned short iocg_id;
+#endif
 };
 
 /**
-- 
1.6.0.1

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
@ 2009-05-05 20:24     ` Andrew Morton
  2009-05-05 19:58 ` Vivek Goyal
                       ` (36 subsequent siblings)
  37 siblings, 0 replies; 297+ messages in thread
From: Andrew Morton @ 2009-05-05 20:24 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Tue,  5 May 2009 15:58:27 -0400
Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> 
> Hi All,
> 
> Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> ...
> Currently primarily two other IO controller proposals are out there.
> 
> dm-ioband
> ---------
> This patch set is from Ryo Tsuruta from valinux.
> ...
> IO-throttling
> -------------
> This patch set is from Andrea Righi provides max bandwidth controller.

I'm thinking we need to lock you guys in a room and come back in 15 minutes.

Seriously, how are we to resolve this?  We could lock me in a room and
cmoe back in 15 days, but there's no reason to believe that I'd emerge
with the best answer.

I tend to think that a cgroup-based controller is the way to go. 
Anything else will need to be wired up to cgroups _anyway_, and that
might end up messy.

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
@ 2009-05-05 20:24     ` Andrew Morton
  0 siblings, 0 replies; 297+ messages in thread
From: Andrew Morton @ 2009-05-05 20:24 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, m-ikeda, vgoyal

On Tue,  5 May 2009 15:58:27 -0400
Vivek Goyal <vgoyal@redhat.com> wrote:

> 
> Hi All,
> 
> Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> ...
> Currently primarily two other IO controller proposals are out there.
> 
> dm-ioband
> ---------
> This patch set is from Ryo Tsuruta from valinux.
> ...
> IO-throttling
> -------------
> This patch set is from Andrea Righi provides max bandwidth controller.

I'm thinking we need to lock you guys in a room and come back in 15 minutes.

Seriously, how are we to resolve this?  We could lock me in a room and
cmoe back in 15 days, but there's no reason to believe that I'd emerge
with the best answer.

I tend to think that a cgroup-based controller is the way to go. 
Anything else will need to be wired up to cgroups _anyway_, and that
might end up messy.


^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
       [not found]     ` <20090505132441.1705bfad.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2009-05-05 22:20       ` Peter Zijlstra
  2009-05-06  2:33       ` Vivek Goyal
  2009-05-06  3:41       ` Balbir Singh
  2 siblings, 0 replies; 297+ messages in thread
From: Peter Zijlstra @ 2009-05-05 22:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, 2009-05-05 at 13:24 -0700, Andrew Morton wrote:
> On Tue,  5 May 2009 15:58:27 -0400
> Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> 
> > 
> > Hi All,
> > 
> > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > ...
> > Currently primarily two other IO controller proposals are out there.
> > 
> > dm-ioband
> > ---------
> > This patch set is from Ryo Tsuruta from valinux.
> > ...
> > IO-throttling
> > -------------
> > This patch set is from Andrea Righi provides max bandwidth controller.
> 
> I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> 
> Seriously, how are we to resolve this?  We could lock me in a room and
> cmoe back in 15 days, but there's no reason to believe that I'd emerge
> with the best answer.
> 
> I tend to think that a cgroup-based controller is the way to go. 
> Anything else will need to be wired up to cgroups _anyway_, and that
> might end up messy.

FWIW I subscribe to the io-scheduler faith as opposed to the
device-mapper cult ;-)

Also, I don't think a simple throttle will be very useful, a more mature
solution should cater to more use cases.

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-05 20:24     ` Andrew Morton
  (?)
@ 2009-05-05 22:20     ` Peter Zijlstra
  2009-05-06  3:42       ` Balbir Singh
  2009-05-06  3:42       ` Balbir Singh
  -1 siblings, 2 replies; 297+ messages in thread
From: Peter Zijlstra @ 2009-05-05 22:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vivek Goyal, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, linux-kernel, containers,
	righi.andrea, agk, dm-devel, snitzer, m-ikeda

On Tue, 2009-05-05 at 13:24 -0700, Andrew Morton wrote:
> On Tue,  5 May 2009 15:58:27 -0400
> Vivek Goyal <vgoyal@redhat.com> wrote:
> 
> > 
> > Hi All,
> > 
> > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > ...
> > Currently primarily two other IO controller proposals are out there.
> > 
> > dm-ioband
> > ---------
> > This patch set is from Ryo Tsuruta from valinux.
> > ...
> > IO-throttling
> > -------------
> > This patch set is from Andrea Righi provides max bandwidth controller.
> 
> I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> 
> Seriously, how are we to resolve this?  We could lock me in a room and
> cmoe back in 15 days, but there's no reason to believe that I'd emerge
> with the best answer.
> 
> I tend to think that a cgroup-based controller is the way to go. 
> Anything else will need to be wired up to cgroups _anyway_, and that
> might end up messy.

FWIW I subscribe to the io-scheduler faith as opposed to the
device-mapper cult ;-)

Also, I don't think a simple throttle will be very useful, a more mature
solution should cater to more use cases.



^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
       [not found]     ` <20090505132441.1705bfad.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  2009-05-05 22:20       ` Peter Zijlstra
@ 2009-05-06  2:33       ` Vivek Goyal
  2009-05-06  3:41       ` Balbir Singh
  2 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06  2:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Tue, May 05, 2009 at 01:24:41PM -0700, Andrew Morton wrote:
> On Tue,  5 May 2009 15:58:27 -0400
> Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> 
> > 
> > Hi All,
> > 
> > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > ...
> > Currently primarily two other IO controller proposals are out there.
> > 
> > dm-ioband
> > ---------
> > This patch set is from Ryo Tsuruta from valinux.
> > ...
> > IO-throttling
> > -------------
> > This patch set is from Andrea Righi provides max bandwidth controller.
> 
> I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> 
> Seriously, how are we to resolve this?  We could lock me in a room and
> cmoe back in 15 days, but there's no reason to believe that I'd emerge
> with the best answer.
> 
> I tend to think that a cgroup-based controller is the way to go. 
> Anything else will need to be wired up to cgroups _anyway_, and that
> might end up messy.

Hi Andrew,

Sorry, did not get what do you mean by cgroup based controller? If you
mean that we use cgroups for grouping tasks for controlling IO, then both
IO scheduler based controller as well as io throttling proposal do that.
dm-ioband also supports that up to some extent but it requires extra step of
transferring cgroup grouping information to dm-ioband device using dm-tools.

But if you meant that io-throttle patches, then I think it solves only
part of the problem and that is max bw control. It does not offer minimum
BW/minimum disk share gurantees as offered by proportional BW control.

IOW, it supports upper limit control and does not support a work conserving
IO controller which lets a group use the whole BW if competing groups are
not present. IMHO, proportional BW control is an important feature which
we will need and IIUC, io-throttle patches can't be easily extended to support
proportional BW control, OTOH, one should be able to extend IO scheduler
based proportional weight controller to also support max bw control. 

Andrea, last time you were planning to have a look at my patches and see
if max bw controller can be implemented there. I got a feeling that it
should not be too difficult to implement it there. We already have the
hierarchical tree of io queues and groups in elevator layer and we run
BFQ (WF2Q+) algorithm to select next queue to dispatch the IO from. It is
just a matter of also keeping track of IO rate per queue/group and we should
be easily be able to delay the dispatch of IO from a queue if its group has
crossed the specified max bw.

This should lead to less code and reduced complextiy (compared with the
case where we do max bw control with io-throttling patches and proportional
BW control using IO scheduler based control patches).
 
So do you think that it would make sense to do max BW control along with
proportional weight IO controller at IO scheduler? If yes, then we can
work together and continue to develop this patchset to also support max
bw control and meet your requirements and drop the io-throttling patches.

The only thing which concerns me is the fact that IO scheduler does not
have the view of higher level logical device. So if somebody has setup a
software RAID and wants to put max BW limit on software raid device, this
solution will not work. One shall have to live with max bw limits on 
individual disks (where io scheduler is actually running). Do your patches
allow to put limit on software RAID devices also? 

Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
of FIFO dispatch of buffered bios. Apart from that it tries to provide
fairness in terms of actual IO done and that would mean a seeky workload
will can use disk for much longer to get equivalent IO done and slow down
other applications. Implementing IO controller at IO scheduler level gives
us tigher control. Will it not meet your requirements? If you got specific
concerns with IO scheduler based contol patches, please highlight these and
we will see how these can be addressed.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-05 20:24     ` Andrew Morton
  (?)
  (?)
@ 2009-05-06  2:33     ` Vivek Goyal
  2009-05-06 17:59       ` Nauman Rafique
                         ` (4 more replies)
  -1 siblings, 5 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06  2:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, m-ikeda, peterz

On Tue, May 05, 2009 at 01:24:41PM -0700, Andrew Morton wrote:
> On Tue,  5 May 2009 15:58:27 -0400
> Vivek Goyal <vgoyal@redhat.com> wrote:
> 
> > 
> > Hi All,
> > 
> > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > ...
> > Currently primarily two other IO controller proposals are out there.
> > 
> > dm-ioband
> > ---------
> > This patch set is from Ryo Tsuruta from valinux.
> > ...
> > IO-throttling
> > -------------
> > This patch set is from Andrea Righi provides max bandwidth controller.
> 
> I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> 
> Seriously, how are we to resolve this?  We could lock me in a room and
> cmoe back in 15 days, but there's no reason to believe that I'd emerge
> with the best answer.
> 
> I tend to think that a cgroup-based controller is the way to go. 
> Anything else will need to be wired up to cgroups _anyway_, and that
> might end up messy.

Hi Andrew,

Sorry, did not get what do you mean by cgroup based controller? If you
mean that we use cgroups for grouping tasks for controlling IO, then both
IO scheduler based controller as well as io throttling proposal do that.
dm-ioband also supports that up to some extent but it requires extra step of
transferring cgroup grouping information to dm-ioband device using dm-tools.

But if you meant that io-throttle patches, then I think it solves only
part of the problem and that is max bw control. It does not offer minimum
BW/minimum disk share gurantees as offered by proportional BW control.

IOW, it supports upper limit control and does not support a work conserving
IO controller which lets a group use the whole BW if competing groups are
not present. IMHO, proportional BW control is an important feature which
we will need and IIUC, io-throttle patches can't be easily extended to support
proportional BW control, OTOH, one should be able to extend IO scheduler
based proportional weight controller to also support max bw control. 

Andrea, last time you were planning to have a look at my patches and see
if max bw controller can be implemented there. I got a feeling that it
should not be too difficult to implement it there. We already have the
hierarchical tree of io queues and groups in elevator layer and we run
BFQ (WF2Q+) algorithm to select next queue to dispatch the IO from. It is
just a matter of also keeping track of IO rate per queue/group and we should
be easily be able to delay the dispatch of IO from a queue if its group has
crossed the specified max bw.

This should lead to less code and reduced complextiy (compared with the
case where we do max bw control with io-throttling patches and proportional
BW control using IO scheduler based control patches).
 
So do you think that it would make sense to do max BW control along with
proportional weight IO controller at IO scheduler? If yes, then we can
work together and continue to develop this patchset to also support max
bw control and meet your requirements and drop the io-throttling patches.

The only thing which concerns me is the fact that IO scheduler does not
have the view of higher level logical device. So if somebody has setup a
software RAID and wants to put max BW limit on software raid device, this
solution will not work. One shall have to live with max bw limits on 
individual disks (where io scheduler is actually running). Do your patches
allow to put limit on software RAID devices also? 

Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
of FIFO dispatch of buffered bios. Apart from that it tries to provide
fairness in terms of actual IO done and that would mean a seeky workload
will can use disk for much longer to get equivalent IO done and slow down
other applications. Implementing IO controller at IO scheduler level gives
us tigher control. Will it not meet your requirements? If you got specific
concerns with IO scheduler based contol patches, please highlight these and
we will see how these can be addressed.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 01/18] io-controller: Documentation
       [not found]   ` <1241553525-28095-2-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-06  3:16     ` Gui Jianfeng
  0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-06  3:16 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

Vivek Goyal wrote:
...
> +	mount -t cgroup -o io,blkio none /cgroup
> +
> +- Create two cgroups
> +	mkdir -p /cgroup/test1/ /cgroup/test2
> +
> +- Set weights of group test1 and test2
> +	echo 1000 > /cgroup/test1/io.ioprio
> +	echo 500 > /cgroup/test2/io.ioprio

  Here seems should be /cgroup/test2/io.weight

> +
> +- Create two same size files (say 512MB each) on same disk (file1, file2) and
> +  launch two dd threads in different cgroup to read those files. Make sure
> +  right io scheduler is being used for the block device where files are
> +  present (the one you compiled in hierarchical mode).
> +
> +	echo 1 > /proc/sys/vm/drop_caches
> +
> +	dd if=/mnt/lv0/zerofile1 of=/dev/null &
> +	echo $! > /cgroup/test1/tasks
> +	cat /cgroup/test1/tasks
> +
> +	dd if=/mnt/lv0/zerofile2 of=/dev/null &
> +	echo $! > /cgroup/test2/tasks
> +	cat /cgroup/test2/tasks
> +
> +- At macro level, first dd should finish first. To get more precise data, keep
> +  on looking at (with the help of script), at io.disk_time and io.disk_sectors
> +  files of both test1 and test2 groups. This will tell how much disk time
> +  (in milli seconds), each group got and how many secotors each group
> +  dispatched to the disk. We provide fairness in terms of disk time, so
> +  ideally io.disk_time of cgroups should be in proportion to the weight.
> +  (It is hard to achieve though :-)).

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 01/18] io-controller: Documentation
  2009-05-05 19:58 ` [PATCH 01/18] io-controller: Documentation Vivek Goyal
@ 2009-05-06  3:16   ` Gui Jianfeng
       [not found]     ` <4A0100F4.4040400-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2009-05-06 13:31     ` Vivek Goyal
       [not found]   ` <1241553525-28095-2-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 2 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-06  3:16 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

Vivek Goyal wrote:
...
> +	mount -t cgroup -o io,blkio none /cgroup
> +
> +- Create two cgroups
> +	mkdir -p /cgroup/test1/ /cgroup/test2
> +
> +- Set weights of group test1 and test2
> +	echo 1000 > /cgroup/test1/io.ioprio
> +	echo 500 > /cgroup/test2/io.ioprio

  Here seems should be /cgroup/test2/io.weight

> +
> +- Create two same size files (say 512MB each) on same disk (file1, file2) and
> +  launch two dd threads in different cgroup to read those files. Make sure
> +  right io scheduler is being used for the block device where files are
> +  present (the one you compiled in hierarchical mode).
> +
> +	echo 1 > /proc/sys/vm/drop_caches
> +
> +	dd if=/mnt/lv0/zerofile1 of=/dev/null &
> +	echo $! > /cgroup/test1/tasks
> +	cat /cgroup/test1/tasks
> +
> +	dd if=/mnt/lv0/zerofile2 of=/dev/null &
> +	echo $! > /cgroup/test2/tasks
> +	cat /cgroup/test2/tasks
> +
> +- At macro level, first dd should finish first. To get more precise data, keep
> +  on looking at (with the help of script), at io.disk_time and io.disk_sectors
> +  files of both test1 and test2 groups. This will tell how much disk time
> +  (in milli seconds), each group got and how many secotors each group
> +  dispatched to the disk. We provide fairness in terms of disk time, so
> +  ideally io.disk_time of cgroups should be in proportion to the weight.
> +  (It is hard to achieve though :-)).

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
       [not found]     ` <20090505132441.1705bfad.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  2009-05-05 22:20       ` Peter Zijlstra
  2009-05-06  2:33       ` Vivek Goyal
@ 2009-05-06  3:41       ` Balbir Singh
  2 siblings, 0 replies; 297+ messages in thread
From: Balbir Singh @ 2009-05-06  3:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	agk-H+wXaHxf7aLQT0dZR+AlfA, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

* Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> [2009-05-05 13:24:41]:

> On Tue,  5 May 2009 15:58:27 -0400
> Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> 
> > 
> > Hi All,
> > 
> > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > ...
> > Currently primarily two other IO controller proposals are out there.
> > 
> > dm-ioband
> > ---------
> > This patch set is from Ryo Tsuruta from valinux.
> > ...
> > IO-throttling
> > -------------
> > This patch set is from Andrea Righi provides max bandwidth controller.
> 
> I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> 
> Seriously, how are we to resolve this?  We could lock me in a room and
> cmoe back in 15 days, but there's no reason to believe that I'd emerge
> with the best answer.
>

We are planning an IO mini-summit prior to the kernel summit
(hopefully we'll all be able to attend and decide).
 
-- 
	Balbir

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-05 20:24     ` Andrew Morton
                       ` (3 preceding siblings ...)
  (?)
@ 2009-05-06  3:41     ` Balbir Singh
  2009-05-06 13:28         ` Vivek Goyal
       [not found]       ` <20090506034118.GC4416-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
  -1 siblings, 2 replies; 297+ messages in thread
From: Balbir Singh @ 2009-05-06  3:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vivek Goyal, dhaval, snitzer, dm-devel, jens.axboe, agk,
	paolo.valente, fernando, jmoyer, fchecconi, containers,
	linux-kernel, righi.andrea

* Andrew Morton <akpm@linux-foundation.org> [2009-05-05 13:24:41]:

> On Tue,  5 May 2009 15:58:27 -0400
> Vivek Goyal <vgoyal@redhat.com> wrote:
> 
> > 
> > Hi All,
> > 
> > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > ...
> > Currently primarily two other IO controller proposals are out there.
> > 
> > dm-ioband
> > ---------
> > This patch set is from Ryo Tsuruta from valinux.
> > ...
> > IO-throttling
> > -------------
> > This patch set is from Andrea Righi provides max bandwidth controller.
> 
> I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> 
> Seriously, how are we to resolve this?  We could lock me in a room and
> cmoe back in 15 days, but there's no reason to believe that I'd emerge
> with the best answer.
>

We are planning an IO mini-summit prior to the kernel summit
(hopefully we'll all be able to attend and decide).
 
-- 
	Balbir

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-05 22:20     ` Peter Zijlstra
@ 2009-05-06  3:42       ` Balbir Singh
  2009-05-06  3:42       ` Balbir Singh
  1 sibling, 0 replies; 297+ messages in thread
From: Balbir Singh @ 2009-05-06  3:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

* Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> [2009-05-06 00:20:49]:

> On Tue, 2009-05-05 at 13:24 -0700, Andrew Morton wrote:
> > On Tue,  5 May 2009 15:58:27 -0400
> > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > 
> > > 
> > > Hi All,
> > > 
> > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > ...
> > > Currently primarily two other IO controller proposals are out there.
> > > 
> > > dm-ioband
> > > ---------
> > > This patch set is from Ryo Tsuruta from valinux.
> > > ...
> > > IO-throttling
> > > -------------
> > > This patch set is from Andrea Righi provides max bandwidth controller.
> > 
> > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> > 
> > Seriously, how are we to resolve this?  We could lock me in a room and
> > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > with the best answer.
> > 
> > I tend to think that a cgroup-based controller is the way to go. 
> > Anything else will need to be wired up to cgroups _anyway_, and that
> > might end up messy.
> 
> FWIW I subscribe to the io-scheduler faith as opposed to the
> device-mapper cult ;-)
> 
> Also, I don't think a simple throttle will be very useful, a more mature
> solution should cater to more use cases.
>

I tend to agree, unless Andrea can prove us wrong. I don't think
throttling a task (not letting it consume CPU, memory when its IO
quota is exceeded) is a good idea. I've asked that question to Andrea
a few times, but got no response.
 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-05 22:20     ` Peter Zijlstra
  2009-05-06  3:42       ` Balbir Singh
@ 2009-05-06  3:42       ` Balbir Singh
  2009-05-06 10:20         ` Fabio Checconi
                           ` (3 more replies)
  1 sibling, 4 replies; 297+ messages in thread
From: Balbir Singh @ 2009-05-06  3:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Vivek Goyal, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, jens.axboe, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, linux-kernel, containers,
	righi.andrea, agk, dm-devel, snitzer, m-ikeda

* Peter Zijlstra <peterz@infradead.org> [2009-05-06 00:20:49]:

> On Tue, 2009-05-05 at 13:24 -0700, Andrew Morton wrote:
> > On Tue,  5 May 2009 15:58:27 -0400
> > Vivek Goyal <vgoyal@redhat.com> wrote:
> > 
> > > 
> > > Hi All,
> > > 
> > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > ...
> > > Currently primarily two other IO controller proposals are out there.
> > > 
> > > dm-ioband
> > > ---------
> > > This patch set is from Ryo Tsuruta from valinux.
> > > ...
> > > IO-throttling
> > > -------------
> > > This patch set is from Andrea Righi provides max bandwidth controller.
> > 
> > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> > 
> > Seriously, how are we to resolve this?  We could lock me in a room and
> > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > with the best answer.
> > 
> > I tend to think that a cgroup-based controller is the way to go. 
> > Anything else will need to be wired up to cgroups _anyway_, and that
> > might end up messy.
> 
> FWIW I subscribe to the io-scheduler faith as opposed to the
> device-mapper cult ;-)
> 
> Also, I don't think a simple throttle will be very useful, a more mature
> solution should cater to more use cases.
>

I tend to agree, unless Andrea can prove us wrong. I don't think
throttling a task (not letting it consume CPU, memory when its IO
quota is exceeded) is a good idea. I've asked that question to Andrea
a few times, but got no response.
 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
       [not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (18 preceding siblings ...)
  2009-05-05 20:24     ` Andrew Morton
@ 2009-05-06  8:11   ` Gui Jianfeng
  2009-05-08  9:45   ` [PATCH] io-controller: Add io group reference handling for request Gui Jianfeng
  2009-05-13  2:00   ` [PATCH] IO Controller: Add per-device weight and ioprio_class handling Gui Jianfeng
  21 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-06  8:11 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

Vivek Goyal wrote:
> Hi All,
> 
> Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> First version of the patches was posted here.

Hi Vivek,

I did some simple test for V2, and triggered an kernel panic.
The following script can reproduce this bug. It seems that the cgroup
is already removed, but IO Controller still try to access into it.

#!/bin/sh
echo 1 > /proc/sys/vm/drop_caches
mkdir /cgroup 2> /dev/null
mount -t cgroup -o io,blkio io /cgroup
mkdir /cgroup/test1
mkdir /cgroup/test2
echo 100 > /cgroup/test1/io.weight
echo 500 > /cgroup/test2/io.weight

./rwio -w -f 2000M.1 &  //do async write
pid1=$!
echo $pid1 > /cgroup/test1/tasks

./rwio -w -f 2000M.2 &
pid2=$!
echo $pid2 > /cgroup/test2/tasks

sleep 10
kill -9 $pid1
kill -9 $pid2
sleep 1

echo ======
cat /cgroup/test1/io.disk_time
cat /cgroup/test2/io.disk_time

echo ======
cat /cgroup/test1/io.disk_sectors
cat /cgroup/test2/io.disk_sectors

rmdir /cgroup/test1
rmdir /cgroup/test2
umount /cgroup
rmdir /cgroup


BUG: unable to handle kernel NULL pointer dereferec
IP: [<c0448c24>] cgroup_path+0xc/0x97
*pde = 64d2d067
Oops: 0000 [#1] SMP
last sysfs file: /sys/block/md0/range
Modules linked in: ipv6 cpufreq_ondemand acpi_cpufreq dm_mirror dm_multipath sbd
Pid: 132, comm: kblockd/0 Not tainted (2.6.30-rc4-Vivek-V2 #1) Veriton M460
EIP: 0060:[<c0448c24>] EFLAGS: 00010086 CPU: 0
EIP is at cgroup_path+0xc/0x97
EAX: 00000100 EBX: f60adca0 ECX: 00000080 EDX: f709fe28
ESI: f60adca8 EDI: f709fe28 EBP: 00000100 ESP: f709fdf0
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process kblockd/0 (pid: 132, ti=f709f000 task=f70a8f60 task.ti=f709f000)
Stack:
 f709fe28 f68c5698 f60adca0 f60adca8 f709fe28 f68de801 c04f5389 00000080
 f68de800 f7094d0c f6a29118 f68bde00 00000016 c04f5e8d c04f5340 00000080
 c0579fec f68c5e94 00000082 c042edb4 f68c5fd4 f68c5fd4 c080b520 00000082
Call Trace:
 [<c04f5389>] ? io_group_path+0x6d/0x89
 [<c04f5e8d>] ? elv_ioq_served+0x2a/0x7a
 [<c04f5340>] ? io_group_path+0x24/0x89
 [<c0579fec>] ? ide_build_dmatable+0xda/0x130
 [<c042edb4>] ? lock_timer_base+0x19/0x35
 [<c042ef0c>] ? mod_timer+0x9f/0xa8
 [<c04fdee6>] ? __delay+0x6/0x7
 [<c057364f>] ? ide_execute_command+0x5d/0x71
 [<c0579d4f>] ? ide_dma_intr+0x0/0x99
 [<c0576496>] ? do_rw_taskfile+0x201/0x213
 [<c04f6daa>] ? __elv_ioq_slice_expired+0x212/0x25e
 [<c04f7e15>] ? elv_fq_select_ioq+0x121/0x184
 [<c04e8a2f>] ? elv_select_sched_queue+0x1e/0x2e
 [<c04f439c>] ? cfq_dispatch_requests+0xaa/0x238
 [<c04e7e67>] ? elv_next_request+0x152/0x15f
 [<c04240c2>] ? dequeue_task_fair+0x16/0x2d
 [<c0572f49>] ? do_ide_request+0x10f/0x4c8
 [<c0642d44>] ? __schedule+0x845/0x893
 [<c042edb4>] ? lock_timer_base+0x19/0x35
 [<c042f1be>] ? del_timer+0x41/0x47
 [<c04ea5c6>] ? __generic_unplug_device+0x23/0x25
 [<c04f530d>] ? elv_kick_queue+0x19/0x28
 [<c0434b77>] ? worker_thread+0x11f/0x19e
 [<c04f52f4>] ? elv_kick_queue+0x0/0x28
 [<c0436ffc>] ? autoremove_wake_function+0x0/0x2d
 [<c0434a58>] ? worker_thread+0x0/0x19e
 [<c0436f3b>] ? kthread+0x42/0x67
 [<c0436ef9>] ? kthread+0x0/0x67
 [<c040326f>] ? kernel_thread_helper+0x7/0x10
Code: c0 84 c0 74 0e 89 d8 e8 7c e9 fd ff eb 05 bf fd ff ff ff e8 c0 ea ff ff 8
EIP: [<c0448c24>] cgroup_path+0xc/0x97 SS:ESP 0068:f709fdf0
CR2: 000000000000011c
---[ end trace 2d4bc25a2c33e394 ]---

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (34 preceding siblings ...)
       [not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-06  8:11 ` Gui Jianfeng
       [not found]   ` <4A014619.1040000-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2009-05-08  9:45 ` [PATCH] io-controller: Add io group reference handling for request Gui Jianfeng
  2009-05-13  2:00 ` [PATCH] IO Controller: Add per-device weight and ioprio_class handling Gui Jianfeng
  37 siblings, 1 reply; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-06  8:11 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

Vivek Goyal wrote:
> Hi All,
> 
> Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> First version of the patches was posted here.

Hi Vivek,

I did some simple test for V2, and triggered an kernel panic.
The following script can reproduce this bug. It seems that the cgroup
is already removed, but IO Controller still try to access into it.

#!/bin/sh
echo 1 > /proc/sys/vm/drop_caches
mkdir /cgroup 2> /dev/null
mount -t cgroup -o io,blkio io /cgroup
mkdir /cgroup/test1
mkdir /cgroup/test2
echo 100 > /cgroup/test1/io.weight
echo 500 > /cgroup/test2/io.weight

./rwio -w -f 2000M.1 &  //do async write
pid1=$!
echo $pid1 > /cgroup/test1/tasks

./rwio -w -f 2000M.2 &
pid2=$!
echo $pid2 > /cgroup/test2/tasks

sleep 10
kill -9 $pid1
kill -9 $pid2
sleep 1

echo ======
cat /cgroup/test1/io.disk_time
cat /cgroup/test2/io.disk_time

echo ======
cat /cgroup/test1/io.disk_sectors
cat /cgroup/test2/io.disk_sectors

rmdir /cgroup/test1
rmdir /cgroup/test2
umount /cgroup
rmdir /cgroup


BUG: unable to handle kernel NULL pointer dereferec
IP: [<c0448c24>] cgroup_path+0xc/0x97
*pde = 64d2d067
Oops: 0000 [#1] SMP
last sysfs file: /sys/block/md0/range
Modules linked in: ipv6 cpufreq_ondemand acpi_cpufreq dm_mirror dm_multipath sbd
Pid: 132, comm: kblockd/0 Not tainted (2.6.30-rc4-Vivek-V2 #1) Veriton M460
EIP: 0060:[<c0448c24>] EFLAGS: 00010086 CPU: 0
EIP is at cgroup_path+0xc/0x97
EAX: 00000100 EBX: f60adca0 ECX: 00000080 EDX: f709fe28
ESI: f60adca8 EDI: f709fe28 EBP: 00000100 ESP: f709fdf0
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process kblockd/0 (pid: 132, ti=f709f000 task=f70a8f60 task.ti=f709f000)
Stack:
 f709fe28 f68c5698 f60adca0 f60adca8 f709fe28 f68de801 c04f5389 00000080
 f68de800 f7094d0c f6a29118 f68bde00 00000016 c04f5e8d c04f5340 00000080
 c0579fec f68c5e94 00000082 c042edb4 f68c5fd4 f68c5fd4 c080b520 00000082
Call Trace:
 [<c04f5389>] ? io_group_path+0x6d/0x89
 [<c04f5e8d>] ? elv_ioq_served+0x2a/0x7a
 [<c04f5340>] ? io_group_path+0x24/0x89
 [<c0579fec>] ? ide_build_dmatable+0xda/0x130
 [<c042edb4>] ? lock_timer_base+0x19/0x35
 [<c042ef0c>] ? mod_timer+0x9f/0xa8
 [<c04fdee6>] ? __delay+0x6/0x7
 [<c057364f>] ? ide_execute_command+0x5d/0x71
 [<c0579d4f>] ? ide_dma_intr+0x0/0x99
 [<c0576496>] ? do_rw_taskfile+0x201/0x213
 [<c04f6daa>] ? __elv_ioq_slice_expired+0x212/0x25e
 [<c04f7e15>] ? elv_fq_select_ioq+0x121/0x184
 [<c04e8a2f>] ? elv_select_sched_queue+0x1e/0x2e
 [<c04f439c>] ? cfq_dispatch_requests+0xaa/0x238
 [<c04e7e67>] ? elv_next_request+0x152/0x15f
 [<c04240c2>] ? dequeue_task_fair+0x16/0x2d
 [<c0572f49>] ? do_ide_request+0x10f/0x4c8
 [<c0642d44>] ? __schedule+0x845/0x893
 [<c042edb4>] ? lock_timer_base+0x19/0x35
 [<c042f1be>] ? del_timer+0x41/0x47
 [<c04ea5c6>] ? __generic_unplug_device+0x23/0x25
 [<c04f530d>] ? elv_kick_queue+0x19/0x28
 [<c0434b77>] ? worker_thread+0x11f/0x19e
 [<c04f52f4>] ? elv_kick_queue+0x0/0x28
 [<c0436ffc>] ? autoremove_wake_function+0x0/0x2d
 [<c0434a58>] ? worker_thread+0x0/0x19e
 [<c0436f3b>] ? kthread+0x42/0x67
 [<c0436ef9>] ? kthread+0x0/0x67
 [<c040326f>] ? kernel_thread_helper+0x7/0x10
Code: c0 84 c0 74 0e 89 d8 e8 7c e9 fd ff eb 05 bf fd ff ff ff e8 c0 ea ff ff 8
EIP: [<c0448c24>] cgroup_path+0xc/0x97 SS:ESP 0068:f709fdf0
CR2: 000000000000011c
---[ end trace 2d4bc25a2c33e394 ]---

-- 
Regards
Gui Jianfeng



^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
       [not found]         ` <20090506034254.GD4416-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
@ 2009-05-06 10:20           ` Fabio Checconi
  2009-05-06 18:47           ` Divyesh Shah
  2009-05-06 20:42           ` Andrea Righi
  2 siblings, 0 replies; 297+ messages in thread
From: Fabio Checconi @ 2009-05-06 10:20 UTC (permalink / raw)
  To: Balbir Singh
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

Hi,

> From: Balbir Singh <balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
> Date: Wed, May 06, 2009 09:12:54AM +0530
>
> * Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> [2009-05-06 00:20:49]:
> 
> > On Tue, 2009-05-05 at 13:24 -0700, Andrew Morton wrote:
> > > On Tue,  5 May 2009 15:58:27 -0400
> > > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > 
> > > > 
> > > > Hi All,
> > > > 
> > > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > > ...
> > > > Currently primarily two other IO controller proposals are out there.
> > > > 
> > > > dm-ioband
> > > > ---------
> > > > This patch set is from Ryo Tsuruta from valinux.
> > > > ...
> > > > IO-throttling
> > > > -------------
> > > > This patch set is from Andrea Righi provides max bandwidth controller.
> > > 
> > > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> > > 
> > > Seriously, how are we to resolve this?  We could lock me in a room and
> > > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > > with the best answer.
> > > 
> > > I tend to think that a cgroup-based controller is the way to go. 
> > > Anything else will need to be wired up to cgroups _anyway_, and that
> > > might end up messy.
> > 
> > FWIW I subscribe to the io-scheduler faith as opposed to the
> > device-mapper cult ;-)
> > 
> > Also, I don't think a simple throttle will be very useful, a more mature
> > solution should cater to more use cases.
> >
> 
> I tend to agree, unless Andrea can prove us wrong. I don't think
> throttling a task (not letting it consume CPU, memory when its IO
> quota is exceeded) is a good idea. I've asked that question to Andrea
> a few times, but got no response.
>  

  from what I can see, the principle used by io-throttling is not too
different to what happens when bandwidth differentiation with synchronous
access patterns is achieved using idling at the io scheduler level.

When an io scheduler anticipates requests from a task/cgroup, all the
other tasks with pending (synchronous) requests are in fact blocked, and
the fact that the task being anticipated is allowed to submit additional
io while they remain blocked is what creates the bandwidth differentiation
among them.

Of course there are many differences, in particular related to the
latencies introduced by the two mechanisms, the granularity they use to
allocate disk service, and to what throttling and proportional share io
scheduling can or cannot guarantee, but FWIK both of them rely on
blocking tasks to create bandwidth differentiation.

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-06  3:42       ` Balbir Singh
@ 2009-05-06 10:20         ` Fabio Checconi
  2009-05-06 17:10             ` Balbir Singh
       [not found]           ` <20090506102030.GB20544-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
  2009-05-06 18:47         ` Divyesh Shah
                           ` (2 subsequent siblings)
  3 siblings, 2 replies; 297+ messages in thread
From: Fabio Checconi @ 2009-05-06 10:20 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Peter Zijlstra, Andrew Morton, Vivek Goyal, nauman, dpshah, lizf,
	mikew, paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, linux-kernel, containers,
	righi.andrea, agk, dm-devel, snitzer, m-ikeda

Hi,

> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> Date: Wed, May 06, 2009 09:12:54AM +0530
>
> * Peter Zijlstra <peterz@infradead.org> [2009-05-06 00:20:49]:
> 
> > On Tue, 2009-05-05 at 13:24 -0700, Andrew Morton wrote:
> > > On Tue,  5 May 2009 15:58:27 -0400
> > > Vivek Goyal <vgoyal@redhat.com> wrote:
> > > 
> > > > 
> > > > Hi All,
> > > > 
> > > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > > ...
> > > > Currently primarily two other IO controller proposals are out there.
> > > > 
> > > > dm-ioband
> > > > ---------
> > > > This patch set is from Ryo Tsuruta from valinux.
> > > > ...
> > > > IO-throttling
> > > > -------------
> > > > This patch set is from Andrea Righi provides max bandwidth controller.
> > > 
> > > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> > > 
> > > Seriously, how are we to resolve this?  We could lock me in a room and
> > > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > > with the best answer.
> > > 
> > > I tend to think that a cgroup-based controller is the way to go. 
> > > Anything else will need to be wired up to cgroups _anyway_, and that
> > > might end up messy.
> > 
> > FWIW I subscribe to the io-scheduler faith as opposed to the
> > device-mapper cult ;-)
> > 
> > Also, I don't think a simple throttle will be very useful, a more mature
> > solution should cater to more use cases.
> >
> 
> I tend to agree, unless Andrea can prove us wrong. I don't think
> throttling a task (not letting it consume CPU, memory when its IO
> quota is exceeded) is a good idea. I've asked that question to Andrea
> a few times, but got no response.
>  

  from what I can see, the principle used by io-throttling is not too
different to what happens when bandwidth differentiation with synchronous
access patterns is achieved using idling at the io scheduler level.

When an io scheduler anticipates requests from a task/cgroup, all the
other tasks with pending (synchronous) requests are in fact blocked, and
the fact that the task being anticipated is allowed to submit additional
io while they remain blocked is what creates the bandwidth differentiation
among them.

Of course there are many differences, in particular related to the
latencies introduced by the two mechanisms, the granularity they use to
allocate disk service, and to what throttling and proportional share io
scheduling can or cannot guarantee, but FWIK both of them rely on
blocking tasks to create bandwidth differentiation.

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
       [not found]       ` <20090506034118.GC4416-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
@ 2009-05-06 13:28         ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 13:28 UTC (permalink / raw)
  To: Balbir Singh
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	agk-H+wXaHxf7aLQT0dZR+AlfA, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Wed, May 06, 2009 at 09:11:18AM +0530, Balbir Singh wrote:
> * Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> [2009-05-05 13:24:41]:
> 
> > On Tue,  5 May 2009 15:58:27 -0400
> > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > 
> > > 
> > > Hi All,
> > > 
> > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > ...
> > > Currently primarily two other IO controller proposals are out there.
> > > 
> > > dm-ioband
> > > ---------
> > > This patch set is from Ryo Tsuruta from valinux.
> > > ...
> > > IO-throttling
> > > -------------
> > > This patch set is from Andrea Righi provides max bandwidth controller.
> > 
> > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> > 
> > Seriously, how are we to resolve this?  We could lock me in a room and
> > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > with the best answer.
> >
> 
> We are planning an IO mini-summit prior to the kernel summit
> (hopefully we'll all be able to attend and decide).

Hi Balbir,

Mini summit is still few months away. I think a better idea would be to
try to thrash out the details here on lkml and try to reach to some
conclusion.

Its a complicated problem and there are no simple and easy answers. If we
can't reach a conclusion here, I am skeptical that mini summit will serve
that purpose.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-06  3:41     ` Balbir Singh
@ 2009-05-06 13:28         ` Vivek Goyal
       [not found]       ` <20090506034118.GC4416-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
  1 sibling, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 13:28 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, dhaval, snitzer, dm-devel, jens.axboe, agk,
	paolo.valente, fernando, jmoyer, fchecconi, containers,
	linux-kernel, righi.andrea

On Wed, May 06, 2009 at 09:11:18AM +0530, Balbir Singh wrote:
> * Andrew Morton <akpm@linux-foundation.org> [2009-05-05 13:24:41]:
> 
> > On Tue,  5 May 2009 15:58:27 -0400
> > Vivek Goyal <vgoyal@redhat.com> wrote:
> > 
> > > 
> > > Hi All,
> > > 
> > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > ...
> > > Currently primarily two other IO controller proposals are out there.
> > > 
> > > dm-ioband
> > > ---------
> > > This patch set is from Ryo Tsuruta from valinux.
> > > ...
> > > IO-throttling
> > > -------------
> > > This patch set is from Andrea Righi provides max bandwidth controller.
> > 
> > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> > 
> > Seriously, how are we to resolve this?  We could lock me in a room and
> > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > with the best answer.
> >
> 
> We are planning an IO mini-summit prior to the kernel summit
> (hopefully we'll all be able to attend and decide).

Hi Balbir,

Mini summit is still few months away. I think a better idea would be to
try to thrash out the details here on lkml and try to reach to some
conclusion.

Its a complicated problem and there are no simple and easy answers. If we
can't reach a conclusion here, I am skeptical that mini summit will serve
that purpose.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
@ 2009-05-06 13:28         ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 13:28 UTC (permalink / raw)
  To: Balbir Singh
  Cc: paolo.valente, dhaval, snitzer, fernando, jmoyer, linux-kernel,
	fchecconi, dm-devel, jens.axboe, Andrew Morton, containers, agk,
	righi.andrea

On Wed, May 06, 2009 at 09:11:18AM +0530, Balbir Singh wrote:
> * Andrew Morton <akpm@linux-foundation.org> [2009-05-05 13:24:41]:
> 
> > On Tue,  5 May 2009 15:58:27 -0400
> > Vivek Goyal <vgoyal@redhat.com> wrote:
> > 
> > > 
> > > Hi All,
> > > 
> > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > ...
> > > Currently primarily two other IO controller proposals are out there.
> > > 
> > > dm-ioband
> > > ---------
> > > This patch set is from Ryo Tsuruta from valinux.
> > > ...
> > > IO-throttling
> > > -------------
> > > This patch set is from Andrea Righi provides max bandwidth controller.
> > 
> > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> > 
> > Seriously, how are we to resolve this?  We could lock me in a room and
> > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > with the best answer.
> >
> 
> We are planning an IO mini-summit prior to the kernel summit
> (hopefully we'll all be able to attend and decide).

Hi Balbir,

Mini summit is still few months away. I think a better idea would be to
try to thrash out the details here on lkml and try to reach to some
conclusion.

Its a complicated problem and there are no simple and easy answers. If we
can't reach a conclusion here, I am skeptical that mini summit will serve
that purpose.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 01/18] io-controller: Documentation
       [not found]     ` <4A0100F4.4040400-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-05-06 13:31       ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 13:31 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Wed, May 06, 2009 at 11:16:04AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> > +	mount -t cgroup -o io,blkio none /cgroup
> > +
> > +- Create two cgroups
> > +	mkdir -p /cgroup/test1/ /cgroup/test2
> > +
> > +- Set weights of group test1 and test2
> > +	echo 1000 > /cgroup/test1/io.ioprio
> > +	echo 500 > /cgroup/test2/io.ioprio
> 
>   Here seems should be /cgroup/test2/io.weight
> 

Forgot to update these lines while switching from the notion of ioprio
to weight for the groups. Will do that next time.

Thanks
Vivek

> > +
> > +- Create two same size files (say 512MB each) on same disk (file1, file2) and
> > +  launch two dd threads in different cgroup to read those files. Make sure
> > +  right io scheduler is being used for the block device where files are
> > +  present (the one you compiled in hierarchical mode).
> > +
> > +	echo 1 > /proc/sys/vm/drop_caches
> > +
> > +	dd if=/mnt/lv0/zerofile1 of=/dev/null &
> > +	echo $! > /cgroup/test1/tasks
> > +	cat /cgroup/test1/tasks
> > +
> > +	dd if=/mnt/lv0/zerofile2 of=/dev/null &
> > +	echo $! > /cgroup/test2/tasks
> > +	cat /cgroup/test2/tasks
> > +
> > +- At macro level, first dd should finish first. To get more precise data, keep
> > +  on looking at (with the help of script), at io.disk_time and io.disk_sectors
> > +  files of both test1 and test2 groups. This will tell how much disk time
> > +  (in milli seconds), each group got and how many secotors each group
> > +  dispatched to the disk. We provide fairness in terms of disk time, so
> > +  ideally io.disk_time of cgroups should be in proportion to the weight.
> > +  (It is hard to achieve though :-)).
> 
> -- 
> Regards
> Gui Jianfeng

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 01/18] io-controller: Documentation
  2009-05-06  3:16   ` Gui Jianfeng
       [not found]     ` <4A0100F4.4040400-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-05-06 13:31     ` Vivek Goyal
  1 sibling, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 13:31 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Wed, May 06, 2009 at 11:16:04AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> > +	mount -t cgroup -o io,blkio none /cgroup
> > +
> > +- Create two cgroups
> > +	mkdir -p /cgroup/test1/ /cgroup/test2
> > +
> > +- Set weights of group test1 and test2
> > +	echo 1000 > /cgroup/test1/io.ioprio
> > +	echo 500 > /cgroup/test2/io.ioprio
> 
>   Here seems should be /cgroup/test2/io.weight
> 

Forgot to update these lines while switching from the notion of ioprio
to weight for the groups. Will do that next time.

Thanks
Vivek

> > +
> > +- Create two same size files (say 512MB each) on same disk (file1, file2) and
> > +  launch two dd threads in different cgroup to read those files. Make sure
> > +  right io scheduler is being used for the block device where files are
> > +  present (the one you compiled in hierarchical mode).
> > +
> > +	echo 1 > /proc/sys/vm/drop_caches
> > +
> > +	dd if=/mnt/lv0/zerofile1 of=/dev/null &
> > +	echo $! > /cgroup/test1/tasks
> > +	cat /cgroup/test1/tasks
> > +
> > +	dd if=/mnt/lv0/zerofile2 of=/dev/null &
> > +	echo $! > /cgroup/test2/tasks
> > +	cat /cgroup/test2/tasks
> > +
> > +- At macro level, first dd should finish first. To get more precise data, keep
> > +  on looking at (with the help of script), at io.disk_time and io.disk_sectors
> > +  files of both test1 and test2 groups. This will tell how much disk time
> > +  (in milli seconds), each group got and how many secotors each group
> > +  dispatched to the disk. We provide fairness in terms of disk time, so
> > +  ideally io.disk_time of cgroups should be in proportion to the weight.
> > +  (It is hard to achieve though :-)).
> 
> -- 
> Regards
> Gui Jianfeng

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-06  8:11 ` IO scheduler based IO Controller V2 Gui Jianfeng
@ 2009-05-06 16:10       ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 16:10 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Wed, May 06, 2009 at 04:11:05PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > Hi All,
> > 
> > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > First version of the patches was posted here.
> 
> Hi Vivek,
> 
> I did some simple test for V2, and triggered an kernel panic.
> The following script can reproduce this bug. It seems that the cgroup
> is already removed, but IO Controller still try to access into it.
> 

Hi Gui,

Thanks for the report. I use cgroup_path() for debugging. I guess that
cgroup_path() was passed null cgrp pointer that's why it crashed.

If yes, then it is strange though. I call cgroup_path() only after
grabbing a refenrece to css object. (I am assuming that if I have a valid
reference to css object then css->cgrp can't be null).

Anyway, can you please try out following patch and see if it fixes your
crash.

---
 block/elevator-fq.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Index: linux11/block/elevator-fq.c
===================================================================
--- linux11.orig/block/elevator-fq.c	2009-05-05 15:38:06.000000000 -0400
+++ linux11/block/elevator-fq.c	2009-05-06 11:55:47.000000000 -0400
@@ -125,6 +125,9 @@ static void io_group_path(struct io_grou
 	unsigned short id = iog->iocg_id;
 	struct cgroup_subsys_state *css;
 
+	/* For error case */
+	buf[0] = '\0';
+
 	rcu_read_lock();
 
 	if (!id)
@@ -137,15 +140,12 @@ static void io_group_path(struct io_grou
 	if (!css_tryget(css))
 		goto out;
 
-	cgroup_path(css->cgroup, buf, buflen);
+	if (css->cgroup)
+		cgroup_path(css->cgroup, buf, buflen);
 
 	css_put(css);
-
-	rcu_read_unlock();
-	return;
 out:
 	rcu_read_unlock();
-	buf[0] = '\0';
 	return;
 }
 #endif

BTW, I tried following equivalent script and I can't see the crash on 
my system. Are you able to hit it regularly?

Instead of killing the tasks I also tried moving the tasks into root cgroup
and then deleting test1 and test2 groups, that also did not produce any crash.
(Hit a different bug though after 5-6 attempts :-)

As I mentioned in the patchset, currently we do have issues with group
refcounting and cgroup/group going away. Hopefully in next version they
all should be fixed up. But still, it is nice to hear back...


#!/bin/sh

../mount-cgroups.sh

# Mount disk
mount /dev/sdd1 /mnt/sdd1
mount /dev/sdd2 /mnt/sdd2

echo 1 > /proc/sys/vm/drop_caches

dd if=/dev/zero of=/mnt/sdd1/testzerofile1 bs=4K count=524288 &
pid1=$!
echo $pid1 > /cgroup/bfqio/test1/tasks
echo "Launched $pid1"

dd if=/dev/zero of=/mnt/sdd2/testzerofile1 bs=4K count=524288 &
pid2=$!
echo $pid2 > /cgroup/bfqio/test2/tasks
echo "Launched $pid2"

#echo "sleeping for 10 seconds"
#sleep 10
#echo "Killing pid $pid1"
#kill -9 $pid1
#echo "Killing pid $pid2"
#kill -9 $pid2
#sleep 5

echo "sleeping for 10 seconds"
sleep 10

echo "moving pid $pid1 to root"
echo $pid1 > /cgroup/bfqio/tasks
echo "moving pid $pid2 to root"
echo $pid2 > /cgroup/bfqio/tasks

echo ======
cat /cgroup/bfqio/test1/io.disk_time
cat /cgroup/bfqio/test2/io.disk_time

echo ======
cat /cgroup/bfqio/test1/io.disk_sectors
cat /cgroup/bfqio/test2/io.disk_sectors

echo "Removing test1"
rmdir /cgroup/bfqio/test1
echo "Removing test2"
rmdir /cgroup/bfqio/test2

echo "Unmounting /cgroup"
umount /cgroup/bfqio
echo "Done"
#rmdir /cgroup



> #!/bin/sh
> echo 1 > /proc/sys/vm/drop_caches
> mkdir /cgroup 2> /dev/null
> mount -t cgroup -o io,blkio io /cgroup
> mkdir /cgroup/test1
> mkdir /cgroup/test2
> echo 100 > /cgroup/test1/io.weight
> echo 500 > /cgroup/test2/io.weight
> 
> ./rwio -w -f 2000M.1 &  //do async write
> pid1=$!
> echo $pid1 > /cgroup/test1/tasks
> 
> ./rwio -w -f 2000M.2 &
> pid2=$!
> echo $pid2 > /cgroup/test2/tasks
> 
> sleep 10
> kill -9 $pid1
> kill -9 $pid2
> sleep 1
> 
> echo ======
> cat /cgroup/test1/io.disk_time
> cat /cgroup/test2/io.disk_time
> 
> echo ======
> cat /cgroup/test1/io.disk_sectors
> cat /cgroup/test2/io.disk_sectors
> 
> rmdir /cgroup/test1
> rmdir /cgroup/test2
> umount /cgroup
> rmdir /cgroup
> 
> 
> BUG: unable to handle kernel NULL pointer dereferec
> IP: [<c0448c24>] cgroup_path+0xc/0x97
> *pde = 64d2d067
> Oops: 0000 [#1] SMP
> last sysfs file: /sys/block/md0/range
> Modules linked in: ipv6 cpufreq_ondemand acpi_cpufreq dm_mirror dm_multipath sbd
> Pid: 132, comm: kblockd/0 Not tainted (2.6.30-rc4-Vivek-V2 #1) Veriton M460
> EIP: 0060:[<c0448c24>] EFLAGS: 00010086 CPU: 0
> EIP is at cgroup_path+0xc/0x97
> EAX: 00000100 EBX: f60adca0 ECX: 00000080 EDX: f709fe28
> ESI: f60adca8 EDI: f709fe28 EBP: 00000100 ESP: f709fdf0
>  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> Process kblockd/0 (pid: 132, ti=f709f000 task=f70a8f60 task.ti=f709f000)
> Stack:
>  f709fe28 f68c5698 f60adca0 f60adca8 f709fe28 f68de801 c04f5389 00000080
>  f68de800 f7094d0c f6a29118 f68bde00 00000016 c04f5e8d c04f5340 00000080
>  c0579fec f68c5e94 00000082 c042edb4 f68c5fd4 f68c5fd4 c080b520 00000082
> Call Trace:
>  [<c04f5389>] ? io_group_path+0x6d/0x89
>  [<c04f5e8d>] ? elv_ioq_served+0x2a/0x7a
>  [<c04f5340>] ? io_group_path+0x24/0x89
>  [<c0579fec>] ? ide_build_dmatable+0xda/0x130
>  [<c042edb4>] ? lock_timer_base+0x19/0x35
>  [<c042ef0c>] ? mod_timer+0x9f/0xa8
>  [<c04fdee6>] ? __delay+0x6/0x7
>  [<c057364f>] ? ide_execute_command+0x5d/0x71
>  [<c0579d4f>] ? ide_dma_intr+0x0/0x99
>  [<c0576496>] ? do_rw_taskfile+0x201/0x213
>  [<c04f6daa>] ? __elv_ioq_slice_expired+0x212/0x25e
>  [<c04f7e15>] ? elv_fq_select_ioq+0x121/0x184
>  [<c04e8a2f>] ? elv_select_sched_queue+0x1e/0x2e
>  [<c04f439c>] ? cfq_dispatch_requests+0xaa/0x238
>  [<c04e7e67>] ? elv_next_request+0x152/0x15f
>  [<c04240c2>] ? dequeue_task_fair+0x16/0x2d
>  [<c0572f49>] ? do_ide_request+0x10f/0x4c8
>  [<c0642d44>] ? __schedule+0x845/0x893
>  [<c042edb4>] ? lock_timer_base+0x19/0x35
>  [<c042f1be>] ? del_timer+0x41/0x47
>  [<c04ea5c6>] ? __generic_unplug_device+0x23/0x25
>  [<c04f530d>] ? elv_kick_queue+0x19/0x28
>  [<c0434b77>] ? worker_thread+0x11f/0x19e
>  [<c04f52f4>] ? elv_kick_queue+0x0/0x28
>  [<c0436ffc>] ? autoremove_wake_function+0x0/0x2d
>  [<c0434a58>] ? worker_thread+0x0/0x19e
>  [<c0436f3b>] ? kthread+0x42/0x67
>  [<c0436ef9>] ? kthread+0x0/0x67
>  [<c040326f>] ? kernel_thread_helper+0x7/0x10
> Code: c0 84 c0 74 0e 89 d8 e8 7c e9 fd ff eb 05 bf fd ff ff ff e8 c0 ea ff ff 8
> EIP: [<c0448c24>] cgroup_path+0xc/0x97 SS:ESP 0068:f709fdf0
> CR2: 000000000000011c
> ---[ end trace 2d4bc25a2c33e394 ]---
> 
> -- 
> Regards
> Gui Jianfeng
> 

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
@ 2009-05-06 16:10       ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 16:10 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Wed, May 06, 2009 at 04:11:05PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > Hi All,
> > 
> > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > First version of the patches was posted here.
> 
> Hi Vivek,
> 
> I did some simple test for V2, and triggered an kernel panic.
> The following script can reproduce this bug. It seems that the cgroup
> is already removed, but IO Controller still try to access into it.
> 

Hi Gui,

Thanks for the report. I use cgroup_path() for debugging. I guess that
cgroup_path() was passed null cgrp pointer that's why it crashed.

If yes, then it is strange though. I call cgroup_path() only after
grabbing a refenrece to css object. (I am assuming that if I have a valid
reference to css object then css->cgrp can't be null).

Anyway, can you please try out following patch and see if it fixes your
crash.

---
 block/elevator-fq.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Index: linux11/block/elevator-fq.c
===================================================================
--- linux11.orig/block/elevator-fq.c	2009-05-05 15:38:06.000000000 -0400
+++ linux11/block/elevator-fq.c	2009-05-06 11:55:47.000000000 -0400
@@ -125,6 +125,9 @@ static void io_group_path(struct io_grou
 	unsigned short id = iog->iocg_id;
 	struct cgroup_subsys_state *css;
 
+	/* For error case */
+	buf[0] = '\0';
+
 	rcu_read_lock();
 
 	if (!id)
@@ -137,15 +140,12 @@ static void io_group_path(struct io_grou
 	if (!css_tryget(css))
 		goto out;
 
-	cgroup_path(css->cgroup, buf, buflen);
+	if (css->cgroup)
+		cgroup_path(css->cgroup, buf, buflen);
 
 	css_put(css);
-
-	rcu_read_unlock();
-	return;
 out:
 	rcu_read_unlock();
-	buf[0] = '\0';
 	return;
 }
 #endif

BTW, I tried following equivalent script and I can't see the crash on 
my system. Are you able to hit it regularly?

Instead of killing the tasks I also tried moving the tasks into root cgroup
and then deleting test1 and test2 groups, that also did not produce any crash.
(Hit a different bug though after 5-6 attempts :-)

As I mentioned in the patchset, currently we do have issues with group
refcounting and cgroup/group going away. Hopefully in next version they
all should be fixed up. But still, it is nice to hear back...


#!/bin/sh

../mount-cgroups.sh

# Mount disk
mount /dev/sdd1 /mnt/sdd1
mount /dev/sdd2 /mnt/sdd2

echo 1 > /proc/sys/vm/drop_caches

dd if=/dev/zero of=/mnt/sdd1/testzerofile1 bs=4K count=524288 &
pid1=$!
echo $pid1 > /cgroup/bfqio/test1/tasks
echo "Launched $pid1"

dd if=/dev/zero of=/mnt/sdd2/testzerofile1 bs=4K count=524288 &
pid2=$!
echo $pid2 > /cgroup/bfqio/test2/tasks
echo "Launched $pid2"

#echo "sleeping for 10 seconds"
#sleep 10
#echo "Killing pid $pid1"
#kill -9 $pid1
#echo "Killing pid $pid2"
#kill -9 $pid2
#sleep 5

echo "sleeping for 10 seconds"
sleep 10

echo "moving pid $pid1 to root"
echo $pid1 > /cgroup/bfqio/tasks
echo "moving pid $pid2 to root"
echo $pid2 > /cgroup/bfqio/tasks

echo ======
cat /cgroup/bfqio/test1/io.disk_time
cat /cgroup/bfqio/test2/io.disk_time

echo ======
cat /cgroup/bfqio/test1/io.disk_sectors
cat /cgroup/bfqio/test2/io.disk_sectors

echo "Removing test1"
rmdir /cgroup/bfqio/test1
echo "Removing test2"
rmdir /cgroup/bfqio/test2

echo "Unmounting /cgroup"
umount /cgroup/bfqio
echo "Done"
#rmdir /cgroup



> #!/bin/sh
> echo 1 > /proc/sys/vm/drop_caches
> mkdir /cgroup 2> /dev/null
> mount -t cgroup -o io,blkio io /cgroup
> mkdir /cgroup/test1
> mkdir /cgroup/test2
> echo 100 > /cgroup/test1/io.weight
> echo 500 > /cgroup/test2/io.weight
> 
> ./rwio -w -f 2000M.1 &  //do async write
> pid1=$!
> echo $pid1 > /cgroup/test1/tasks
> 
> ./rwio -w -f 2000M.2 &
> pid2=$!
> echo $pid2 > /cgroup/test2/tasks
> 
> sleep 10
> kill -9 $pid1
> kill -9 $pid2
> sleep 1
> 
> echo ======
> cat /cgroup/test1/io.disk_time
> cat /cgroup/test2/io.disk_time
> 
> echo ======
> cat /cgroup/test1/io.disk_sectors
> cat /cgroup/test2/io.disk_sectors
> 
> rmdir /cgroup/test1
> rmdir /cgroup/test2
> umount /cgroup
> rmdir /cgroup
> 
> 
> BUG: unable to handle kernel NULL pointer dereferec
> IP: [<c0448c24>] cgroup_path+0xc/0x97
> *pde = 64d2d067
> Oops: 0000 [#1] SMP
> last sysfs file: /sys/block/md0/range
> Modules linked in: ipv6 cpufreq_ondemand acpi_cpufreq dm_mirror dm_multipath sbd
> Pid: 132, comm: kblockd/0 Not tainted (2.6.30-rc4-Vivek-V2 #1) Veriton M460
> EIP: 0060:[<c0448c24>] EFLAGS: 00010086 CPU: 0
> EIP is at cgroup_path+0xc/0x97
> EAX: 00000100 EBX: f60adca0 ECX: 00000080 EDX: f709fe28
> ESI: f60adca8 EDI: f709fe28 EBP: 00000100 ESP: f709fdf0
>  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> Process kblockd/0 (pid: 132, ti=f709f000 task=f70a8f60 task.ti=f709f000)
> Stack:
>  f709fe28 f68c5698 f60adca0 f60adca8 f709fe28 f68de801 c04f5389 00000080
>  f68de800 f7094d0c f6a29118 f68bde00 00000016 c04f5e8d c04f5340 00000080
>  c0579fec f68c5e94 00000082 c042edb4 f68c5fd4 f68c5fd4 c080b520 00000082
> Call Trace:
>  [<c04f5389>] ? io_group_path+0x6d/0x89
>  [<c04f5e8d>] ? elv_ioq_served+0x2a/0x7a
>  [<c04f5340>] ? io_group_path+0x24/0x89
>  [<c0579fec>] ? ide_build_dmatable+0xda/0x130
>  [<c042edb4>] ? lock_timer_base+0x19/0x35
>  [<c042ef0c>] ? mod_timer+0x9f/0xa8
>  [<c04fdee6>] ? __delay+0x6/0x7
>  [<c057364f>] ? ide_execute_command+0x5d/0x71
>  [<c0579d4f>] ? ide_dma_intr+0x0/0x99
>  [<c0576496>] ? do_rw_taskfile+0x201/0x213
>  [<c04f6daa>] ? __elv_ioq_slice_expired+0x212/0x25e
>  [<c04f7e15>] ? elv_fq_select_ioq+0x121/0x184
>  [<c04e8a2f>] ? elv_select_sched_queue+0x1e/0x2e
>  [<c04f439c>] ? cfq_dispatch_requests+0xaa/0x238
>  [<c04e7e67>] ? elv_next_request+0x152/0x15f
>  [<c04240c2>] ? dequeue_task_fair+0x16/0x2d
>  [<c0572f49>] ? do_ide_request+0x10f/0x4c8
>  [<c0642d44>] ? __schedule+0x845/0x893
>  [<c042edb4>] ? lock_timer_base+0x19/0x35
>  [<c042f1be>] ? del_timer+0x41/0x47
>  [<c04ea5c6>] ? __generic_unplug_device+0x23/0x25
>  [<c04f530d>] ? elv_kick_queue+0x19/0x28
>  [<c0434b77>] ? worker_thread+0x11f/0x19e
>  [<c04f52f4>] ? elv_kick_queue+0x0/0x28
>  [<c0436ffc>] ? autoremove_wake_function+0x0/0x2d
>  [<c0434a58>] ? worker_thread+0x0/0x19e
>  [<c0436f3b>] ? kthread+0x42/0x67
>  [<c0436ef9>] ? kthread+0x0/0x67
>  [<c040326f>] ? kernel_thread_helper+0x7/0x10
> Code: c0 84 c0 74 0e 89 d8 e8 7c e9 fd ff eb 05 bf fd ff ff ff e8 c0 ea ff ff 8
> EIP: [<c0448c24>] cgroup_path+0xc/0x97 SS:ESP 0068:f709fdf0
> CR2: 000000000000011c
> ---[ end trace 2d4bc25a2c33e394 ]---
> 
> -- 
> Regards
> Gui Jianfeng
> 

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
       [not found]           ` <20090506102030.GB20544-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
@ 2009-05-06 17:10             ` Balbir Singh
  0 siblings, 0 replies; 297+ messages in thread
From: Balbir Singh @ 2009-05-06 17:10 UTC (permalink / raw)
  To: Fabio Checconi
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	agk-H+wXaHxf7aLQT0dZR+AlfA, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

* Fabio Checconi <fchecconi-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> [2009-05-06 12:20:30]:

> Hi,
> 
> > From: Balbir Singh <balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
> > Date: Wed, May 06, 2009 09:12:54AM +0530
> >
> > * Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> [2009-05-06 00:20:49]:
> > 
> > > On Tue, 2009-05-05 at 13:24 -0700, Andrew Morton wrote:
> > > > On Tue,  5 May 2009 15:58:27 -0400
> > > > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > > 
> > > > > 
> > > > > Hi All,
> > > > > 
> > > > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > > > ...
> > > > > Currently primarily two other IO controller proposals are out there.
> > > > > 
> > > > > dm-ioband
> > > > > ---------
> > > > > This patch set is from Ryo Tsuruta from valinux.
> > > > > ...
> > > > > IO-throttling
> > > > > -------------
> > > > > This patch set is from Andrea Righi provides max bandwidth controller.
> > > > 
> > > > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> > > > 
> > > > Seriously, how are we to resolve this?  We could lock me in a room and
> > > > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > > > with the best answer.
> > > > 
> > > > I tend to think that a cgroup-based controller is the way to go. 
> > > > Anything else will need to be wired up to cgroups _anyway_, and that
> > > > might end up messy.
> > > 
> > > FWIW I subscribe to the io-scheduler faith as opposed to the
> > > device-mapper cult ;-)
> > > 
> > > Also, I don't think a simple throttle will be very useful, a more mature
> > > solution should cater to more use cases.
> > >
> > 
> > I tend to agree, unless Andrea can prove us wrong. I don't think
> > throttling a task (not letting it consume CPU, memory when its IO
> > quota is exceeded) is a good idea. I've asked that question to Andrea
> > a few times, but got no response.
> >  
> 
>   from what I can see, the principle used by io-throttling is not too
> different to what happens when bandwidth differentiation with synchronous
> access patterns is achieved using idling at the io scheduler level.
> 
> When an io scheduler anticipates requests from a task/cgroup, all the
> other tasks with pending (synchronous) requests are in fact blocked, and
> the fact that the task being anticipated is allowed to submit additional
> io while they remain blocked is what creates the bandwidth differentiation
> among them.
> 
> Of course there are many differences, in particular related to the
> latencies introduced by the two mechanisms, the granularity they use to
> allocate disk service, and to what throttling and proportional share io
> scheduling can or cannot guarantee, but FWIK both of them rely on
> blocking tasks to create bandwidth differentiation.

My concern stems from the fact that in the case in this case we might
throttle all the tasks in the group.. no? I'll take a closer look.


-- 
	Balbir

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-06 10:20         ` Fabio Checconi
@ 2009-05-06 17:10             ` Balbir Singh
       [not found]           ` <20090506102030.GB20544-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
  1 sibling, 0 replies; 297+ messages in thread
From: Balbir Singh @ 2009-05-06 17:10 UTC (permalink / raw)
  To: Fabio Checconi
  Cc: dhaval, snitzer, dm-devel, jens.axboe, agk, paolo.valente,
	fernando, jmoyer, righi.andrea, containers, linux-kernel,
	Andrew Morton

* Fabio Checconi <fchecconi@gmail.com> [2009-05-06 12:20:30]:

> Hi,
> 
> > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > Date: Wed, May 06, 2009 09:12:54AM +0530
> >
> > * Peter Zijlstra <peterz@infradead.org> [2009-05-06 00:20:49]:
> > 
> > > On Tue, 2009-05-05 at 13:24 -0700, Andrew Morton wrote:
> > > > On Tue,  5 May 2009 15:58:27 -0400
> > > > Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > 
> > > > > 
> > > > > Hi All,
> > > > > 
> > > > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > > > ...
> > > > > Currently primarily two other IO controller proposals are out there.
> > > > > 
> > > > > dm-ioband
> > > > > ---------
> > > > > This patch set is from Ryo Tsuruta from valinux.
> > > > > ...
> > > > > IO-throttling
> > > > > -------------
> > > > > This patch set is from Andrea Righi provides max bandwidth controller.
> > > > 
> > > > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> > > > 
> > > > Seriously, how are we to resolve this?  We could lock me in a room and
> > > > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > > > with the best answer.
> > > > 
> > > > I tend to think that a cgroup-based controller is the way to go. 
> > > > Anything else will need to be wired up to cgroups _anyway_, and that
> > > > might end up messy.
> > > 
> > > FWIW I subscribe to the io-scheduler faith as opposed to the
> > > device-mapper cult ;-)
> > > 
> > > Also, I don't think a simple throttle will be very useful, a more mature
> > > solution should cater to more use cases.
> > >
> > 
> > I tend to agree, unless Andrea can prove us wrong. I don't think
> > throttling a task (not letting it consume CPU, memory when its IO
> > quota is exceeded) is a good idea. I've asked that question to Andrea
> > a few times, but got no response.
> >  
> 
>   from what I can see, the principle used by io-throttling is not too
> different to what happens when bandwidth differentiation with synchronous
> access patterns is achieved using idling at the io scheduler level.
> 
> When an io scheduler anticipates requests from a task/cgroup, all the
> other tasks with pending (synchronous) requests are in fact blocked, and
> the fact that the task being anticipated is allowed to submit additional
> io while they remain blocked is what creates the bandwidth differentiation
> among them.
> 
> Of course there are many differences, in particular related to the
> latencies introduced by the two mechanisms, the granularity they use to
> allocate disk service, and to what throttling and proportional share io
> scheduling can or cannot guarantee, but FWIK both of them rely on
> blocking tasks to create bandwidth differentiation.

My concern stems from the fact that in the case in this case we might
throttle all the tasks in the group.. no? I'll take a closer look.


-- 
	Balbir

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
@ 2009-05-06 17:10             ` Balbir Singh
  0 siblings, 0 replies; 297+ messages in thread
From: Balbir Singh @ 2009-05-06 17:10 UTC (permalink / raw)
  To: Fabio Checconi
  Cc: paolo.valente, dhaval, snitzer, fernando, jmoyer, linux-kernel,
	dm-devel, jens.axboe, Andrew Morton, containers, agk,
	righi.andrea

* Fabio Checconi <fchecconi@gmail.com> [2009-05-06 12:20:30]:

> Hi,
> 
> > From: Balbir Singh <balbir@linux.vnet.ibm.com>
> > Date: Wed, May 06, 2009 09:12:54AM +0530
> >
> > * Peter Zijlstra <peterz@infradead.org> [2009-05-06 00:20:49]:
> > 
> > > On Tue, 2009-05-05 at 13:24 -0700, Andrew Morton wrote:
> > > > On Tue,  5 May 2009 15:58:27 -0400
> > > > Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > 
> > > > > 
> > > > > Hi All,
> > > > > 
> > > > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > > > ...
> > > > > Currently primarily two other IO controller proposals are out there.
> > > > > 
> > > > > dm-ioband
> > > > > ---------
> > > > > This patch set is from Ryo Tsuruta from valinux.
> > > > > ...
> > > > > IO-throttling
> > > > > -------------
> > > > > This patch set is from Andrea Righi provides max bandwidth controller.
> > > > 
> > > > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> > > > 
> > > > Seriously, how are we to resolve this?  We could lock me in a room and
> > > > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > > > with the best answer.
> > > > 
> > > > I tend to think that a cgroup-based controller is the way to go. 
> > > > Anything else will need to be wired up to cgroups _anyway_, and that
> > > > might end up messy.
> > > 
> > > FWIW I subscribe to the io-scheduler faith as opposed to the
> > > device-mapper cult ;-)
> > > 
> > > Also, I don't think a simple throttle will be very useful, a more mature
> > > solution should cater to more use cases.
> > >
> > 
> > I tend to agree, unless Andrea can prove us wrong. I don't think
> > throttling a task (not letting it consume CPU, memory when its IO
> > quota is exceeded) is a good idea. I've asked that question to Andrea
> > a few times, but got no response.
> >  
> 
>   from what I can see, the principle used by io-throttling is not too
> different to what happens when bandwidth differentiation with synchronous
> access patterns is achieved using idling at the io scheduler level.
> 
> When an io scheduler anticipates requests from a task/cgroup, all the
> other tasks with pending (synchronous) requests are in fact blocked, and
> the fact that the task being anticipated is allowed to submit additional
> io while they remain blocked is what creates the bandwidth differentiation
> among them.
> 
> Of course there are many differences, in particular related to the
> latencies introduced by the two mechanisms, the granularity they use to
> allocate disk service, and to what throttling and proportional share io
> scheduling can or cannot guarantee, but FWIK both of them rely on
> blocking tasks to create bandwidth differentiation.

My concern stems from the fact that in the case in this case we might
throttle all the tasks in the group.. no? I'll take a closer look.


-- 
	Balbir

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
       [not found]       ` <20090506023332.GA1212-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-06 17:59         ` Nauman Rafique
  2009-05-06 20:07         ` Andrea Righi
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 297+ messages in thread
From: Nauman Rafique @ 2009-05-06 17:59 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Tue, May 5, 2009 at 7:33 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, May 05, 2009 at 01:24:41PM -0700, Andrew Morton wrote:
>> On Tue,  5 May 2009 15:58:27 -0400
>> Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>
>> >
>> > Hi All,
>> >
>> > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
>> > ...
>> > Currently primarily two other IO controller proposals are out there.
>> >
>> > dm-ioband
>> > ---------
>> > This patch set is from Ryo Tsuruta from valinux.
>> > ...
>> > IO-throttling
>> > -------------
>> > This patch set is from Andrea Righi provides max bandwidth controller.
>>
>> I'm thinking we need to lock you guys in a room and come back in 15 minutes.
>>
>> Seriously, how are we to resolve this?  We could lock me in a room and
>> cmoe back in 15 days, but there's no reason to believe that I'd emerge
>> with the best answer.
>>
>> I tend to think that a cgroup-based controller is the way to go.
>> Anything else will need to be wired up to cgroups _anyway_, and that
>> might end up messy.
>
> Hi Andrew,
>
> Sorry, did not get what do you mean by cgroup based controller? If you
> mean that we use cgroups for grouping tasks for controlling IO, then both
> IO scheduler based controller as well as io throttling proposal do that.
> dm-ioband also supports that up to some extent but it requires extra step of
> transferring cgroup grouping information to dm-ioband device using dm-tools.
>
> But if you meant that io-throttle patches, then I think it solves only
> part of the problem and that is max bw control. It does not offer minimum
> BW/minimum disk share gurantees as offered by proportional BW control.
>
> IOW, it supports upper limit control and does not support a work conserving
> IO controller which lets a group use the whole BW if competing groups are
> not present. IMHO, proportional BW control is an important feature which
> we will need and IIUC, io-throttle patches can't be easily extended to support
> proportional BW control, OTOH, one should be able to extend IO scheduler
> based proportional weight controller to also support max bw control.
>
> Andrea, last time you were planning to have a look at my patches and see
> if max bw controller can be implemented there. I got a feeling that it
> should not be too difficult to implement it there. We already have the
> hierarchical tree of io queues and groups in elevator layer and we run
> BFQ (WF2Q+) algorithm to select next queue to dispatch the IO from. It is
> just a matter of also keeping track of IO rate per queue/group and we should
> be easily be able to delay the dispatch of IO from a queue if its group has
> crossed the specified max bw.
>
> This should lead to less code and reduced complextiy (compared with the
> case where we do max bw control with io-throttling patches and proportional
> BW control using IO scheduler based control patches).
>
> So do you think that it would make sense to do max BW control along with
> proportional weight IO controller at IO scheduler? If yes, then we can
> work together and continue to develop this patchset to also support max
> bw control and meet your requirements and drop the io-throttling patches.
>
> The only thing which concerns me is the fact that IO scheduler does not
> have the view of higher level logical device. So if somebody has setup a
> software RAID and wants to put max BW limit on software raid device, this
> solution will not work. One shall have to live with max bw limits on
> individual disks (where io scheduler is actually running). Do your patches
> allow to put limit on software RAID devices also?
>
> Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> of FIFO dispatch of buffered bios. Apart from that it tries to provide
> fairness in terms of actual IO done and that would mean a seeky workload
> will can use disk for much longer to get equivalent IO done and slow down
> other applications. Implementing IO controller at IO scheduler level gives
> us tigher control. Will it not meet your requirements? If you got specific
> concerns with IO scheduler based contol patches, please highlight these and
> we will see how these can be addressed.

In my opinion, IO throttling and dm-ioband are probably simpler, but
incomplete solutions to the problem. And for a solution to be
complete, it would have to be at a IO scheduler layer so it can do
things like taking an IO as soon as it comes and stick it to the front
of all the queues so that it can go to the disk right away. This patch
set is big, but it takes us in the right direction. Our ultimate goal
should be able to reach the level of control that we can have over CPU
and network resources. And I don't think IO throttling and dm-ioband
approaches take us in that direction.

>
> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-06  2:33     ` Vivek Goyal
@ 2009-05-06 17:59       ` Nauman Rafique
  2009-05-06 20:07       ` Andrea Righi
                         ` (3 subsequent siblings)
  4 siblings, 0 replies; 297+ messages in thread
From: Nauman Rafique @ 2009-05-06 17:59 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrew Morton, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, m-ikeda, peterz

On Tue, May 5, 2009 at 7:33 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Tue, May 05, 2009 at 01:24:41PM -0700, Andrew Morton wrote:
>> On Tue,  5 May 2009 15:58:27 -0400
>> Vivek Goyal <vgoyal@redhat.com> wrote:
>>
>> >
>> > Hi All,
>> >
>> > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
>> > ...
>> > Currently primarily two other IO controller proposals are out there.
>> >
>> > dm-ioband
>> > ---------
>> > This patch set is from Ryo Tsuruta from valinux.
>> > ...
>> > IO-throttling
>> > -------------
>> > This patch set is from Andrea Righi provides max bandwidth controller.
>>
>> I'm thinking we need to lock you guys in a room and come back in 15 minutes.
>>
>> Seriously, how are we to resolve this?  We could lock me in a room and
>> cmoe back in 15 days, but there's no reason to believe that I'd emerge
>> with the best answer.
>>
>> I tend to think that a cgroup-based controller is the way to go.
>> Anything else will need to be wired up to cgroups _anyway_, and that
>> might end up messy.
>
> Hi Andrew,
>
> Sorry, did not get what do you mean by cgroup based controller? If you
> mean that we use cgroups for grouping tasks for controlling IO, then both
> IO scheduler based controller as well as io throttling proposal do that.
> dm-ioband also supports that up to some extent but it requires extra step of
> transferring cgroup grouping information to dm-ioband device using dm-tools.
>
> But if you meant that io-throttle patches, then I think it solves only
> part of the problem and that is max bw control. It does not offer minimum
> BW/minimum disk share gurantees as offered by proportional BW control.
>
> IOW, it supports upper limit control and does not support a work conserving
> IO controller which lets a group use the whole BW if competing groups are
> not present. IMHO, proportional BW control is an important feature which
> we will need and IIUC, io-throttle patches can't be easily extended to support
> proportional BW control, OTOH, one should be able to extend IO scheduler
> based proportional weight controller to also support max bw control.
>
> Andrea, last time you were planning to have a look at my patches and see
> if max bw controller can be implemented there. I got a feeling that it
> should not be too difficult to implement it there. We already have the
> hierarchical tree of io queues and groups in elevator layer and we run
> BFQ (WF2Q+) algorithm to select next queue to dispatch the IO from. It is
> just a matter of also keeping track of IO rate per queue/group and we should
> be easily be able to delay the dispatch of IO from a queue if its group has
> crossed the specified max bw.
>
> This should lead to less code and reduced complextiy (compared with the
> case where we do max bw control with io-throttling patches and proportional
> BW control using IO scheduler based control patches).
>
> So do you think that it would make sense to do max BW control along with
> proportional weight IO controller at IO scheduler? If yes, then we can
> work together and continue to develop this patchset to also support max
> bw control and meet your requirements and drop the io-throttling patches.
>
> The only thing which concerns me is the fact that IO scheduler does not
> have the view of higher level logical device. So if somebody has setup a
> software RAID and wants to put max BW limit on software raid device, this
> solution will not work. One shall have to live with max bw limits on
> individual disks (where io scheduler is actually running). Do your patches
> allow to put limit on software RAID devices also?
>
> Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> of FIFO dispatch of buffered bios. Apart from that it tries to provide
> fairness in terms of actual IO done and that would mean a seeky workload
> will can use disk for much longer to get equivalent IO done and slow down
> other applications. Implementing IO controller at IO scheduler level gives
> us tigher control. Will it not meet your requirements? If you got specific
> concerns with IO scheduler based contol patches, please highlight these and
> we will see how these can be addressed.

In my opinion, IO throttling and dm-ioband are probably simpler, but
incomplete solutions to the problem. And for a solution to be
complete, it would have to be at a IO scheduler layer so it can do
things like taking an IO as soon as it comes and stick it to the front
of all the queues so that it can go to the disk right away. This patch
set is big, but it takes us in the right direction. Our ultimate goal
should be able to reach the level of control that we can have over CPU
and network resources. And I don't think IO throttling and dm-ioband
approaches take us in that direction.

>
> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
       [not found]         ` <20090506034254.GD4416-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
  2009-05-06 10:20           ` Fabio Checconi
@ 2009-05-06 18:47           ` Divyesh Shah
  2009-05-06 20:42           ` Andrea Righi
  2 siblings, 0 replies; 297+ messages in thread
From: Divyesh Shah @ 2009-05-06 18:47 UTC (permalink / raw)
  To: balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

Balbir Singh wrote:
> * Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> [2009-05-06 00:20:49]:
> 
>> On Tue, 2009-05-05 at 13:24 -0700, Andrew Morton wrote:
>>> On Tue,  5 May 2009 15:58:27 -0400
>>> Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>>
>>>> Hi All,
>>>>
>>>> Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
>>>> ...
>>>> Currently primarily two other IO controller proposals are out there.
>>>>
>>>> dm-ioband
>>>> ---------
>>>> This patch set is from Ryo Tsuruta from valinux.
>>>> ...
>>>> IO-throttling
>>>> -------------
>>>> This patch set is from Andrea Righi provides max bandwidth controller.
>>> I'm thinking we need to lock you guys in a room and come back in 15 minutes.
>>>
>>> Seriously, how are we to resolve this?  We could lock me in a room and
>>> cmoe back in 15 days, but there's no reason to believe that I'd emerge
>>> with the best answer.
>>>
>>> I tend to think that a cgroup-based controller is the way to go. 
>>> Anything else will need to be wired up to cgroups _anyway_, and that
>>> might end up messy.
>> FWIW I subscribe to the io-scheduler faith as opposed to the
>> device-mapper cult ;-)
>>
>> Also, I don't think a simple throttle will be very useful, a more mature
>> solution should cater to more use cases.
>>
> 
> I tend to agree, unless Andrea can prove us wrong. I don't think
> throttling a task (not letting it consume CPU, memory when its IO
> quota is exceeded) is a good idea. I've asked that question to Andrea
> a few times, but got no response.

I agree with what Balbir said about the effects of throttling on memory and cpu usage of that task.
Nauman and I have been working on Vivek's set of patches (which also includes some patches by Nauman) and have been testing and developing on top of that. I've found this solution to be the one that takes us closest to a complete solution. This approach works well under the assumption that the queues are backlogged and in the limited testing that we've done so far doesn't fare that badly when they are not backlogged (though there is definitely room to improve there).
With buffered writes, when the queues are not backlogged I think it might be useful to explore into vm space and see if we can do something there w/o any impact to the tasks mem or cpu usage. I don't have any brilliant ideas on this now but want to get people thinking about this.

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-06  3:42       ` Balbir Singh
  2009-05-06 10:20         ` Fabio Checconi
@ 2009-05-06 18:47         ` Divyesh Shah
       [not found]         ` <20090506034254.GD4416-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
  2009-05-06 20:42         ` Andrea Righi
  3 siblings, 0 replies; 297+ messages in thread
From: Divyesh Shah @ 2009-05-06 18:47 UTC (permalink / raw)
  To: balbir
  Cc: Peter Zijlstra, Andrew Morton, Vivek Goyal, nauman, lizf, mikew,
	fchecconi, paolo.valente, jens.axboe, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, linux-kernel, containers,
	righi.andrea, agk, dm-devel, snitzer, m-ikeda

Balbir Singh wrote:
> * Peter Zijlstra <peterz@infradead.org> [2009-05-06 00:20:49]:
> 
>> On Tue, 2009-05-05 at 13:24 -0700, Andrew Morton wrote:
>>> On Tue,  5 May 2009 15:58:27 -0400
>>> Vivek Goyal <vgoyal@redhat.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
>>>> ...
>>>> Currently primarily two other IO controller proposals are out there.
>>>>
>>>> dm-ioband
>>>> ---------
>>>> This patch set is from Ryo Tsuruta from valinux.
>>>> ...
>>>> IO-throttling
>>>> -------------
>>>> This patch set is from Andrea Righi provides max bandwidth controller.
>>> I'm thinking we need to lock you guys in a room and come back in 15 minutes.
>>>
>>> Seriously, how are we to resolve this?  We could lock me in a room and
>>> cmoe back in 15 days, but there's no reason to believe that I'd emerge
>>> with the best answer.
>>>
>>> I tend to think that a cgroup-based controller is the way to go. 
>>> Anything else will need to be wired up to cgroups _anyway_, and that
>>> might end up messy.
>> FWIW I subscribe to the io-scheduler faith as opposed to the
>> device-mapper cult ;-)
>>
>> Also, I don't think a simple throttle will be very useful, a more mature
>> solution should cater to more use cases.
>>
> 
> I tend to agree, unless Andrea can prove us wrong. I don't think
> throttling a task (not letting it consume CPU, memory when its IO
> quota is exceeded) is a good idea. I've asked that question to Andrea
> a few times, but got no response.

I agree with what Balbir said about the effects of throttling on memory and cpu usage of that task.
Nauman and I have been working on Vivek's set of patches (which also includes some patches by Nauman) and have been testing and developing on top of that. I've found this solution to be the one that takes us closest to a complete solution. This approach works well under the assumption that the queues are backlogged and in the limited testing that we've done so far doesn't fare that badly when they are not backlogged (though there is definitely room to improve there).
With buffered writes, when the queues are not backlogged I think it might be useful to explore into vm space and see if we can do something there w/o any impact to the tasks mem or cpu usage. I don't have any brilliant ideas on this now but want to get people thinking about this.


^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
       [not found]       ` <20090506023332.GA1212-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-05-06 17:59         ` Nauman Rafique
@ 2009-05-06 20:07         ` Andrea Righi
  2009-05-06 20:32         ` Vivek Goyal
  2009-05-07  0:18         ` Ryo Tsuruta
  3 siblings, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-06 20:07 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Tue, May 05, 2009 at 10:33:32PM -0400, Vivek Goyal wrote:
> On Tue, May 05, 2009 at 01:24:41PM -0700, Andrew Morton wrote:
> > On Tue,  5 May 2009 15:58:27 -0400
> > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > 
> > > 
> > > Hi All,
> > > 
> > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > ...
> > > Currently primarily two other IO controller proposals are out there.
> > > 
> > > dm-ioband
> > > ---------
> > > This patch set is from Ryo Tsuruta from valinux.
> > > ...
> > > IO-throttling
> > > -------------
> > > This patch set is from Andrea Righi provides max bandwidth controller.
> > 
> > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> > 
> > Seriously, how are we to resolve this?  We could lock me in a room and
> > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > with the best answer.
> > 
> > I tend to think that a cgroup-based controller is the way to go. 
> > Anything else will need to be wired up to cgroups _anyway_, and that
> > might end up messy.
> 
> Hi Andrew,
> 
> Sorry, did not get what do you mean by cgroup based controller? If you
> mean that we use cgroups for grouping tasks for controlling IO, then both
> IO scheduler based controller as well as io throttling proposal do that.
> dm-ioband also supports that up to some extent but it requires extra step of
> transferring cgroup grouping information to dm-ioband device using dm-tools.
> 
> But if you meant that io-throttle patches, then I think it solves only
> part of the problem and that is max bw control. It does not offer minimum
> BW/minimum disk share gurantees as offered by proportional BW control.
> 
> IOW, it supports upper limit control and does not support a work conserving
> IO controller which lets a group use the whole BW if competing groups are
> not present. IMHO, proportional BW control is an important feature which
> we will need and IIUC, io-throttle patches can't be easily extended to support
> proportional BW control, OTOH, one should be able to extend IO scheduler
> based proportional weight controller to also support max bw control. 

Well, IMHO the big concern is at which level we want to implement the
logic of control: IO scheduler, when the IO requests are already
submitted and need to be dispatched, or at high level when the
applications generates IO requests (or maybe both).

And, as pointed by Andrew, do everything by a cgroup-based controller.

The other features, proportional BW, throttling, take the current ioprio
model in account, etc. are implementation details and any of the
proposed solutions can be extended to support all these features. I
mean, io-throttle can be extended to support proportional BW (for a
certain perspective it is already provided by the throttling water mark
in v16), as well as the IO scheduler based controller can be extended to
support absolute BW limits. The same for dm-ioband. I don't think
there're huge obstacle to merge the functionalities in this sense.

> 
> Andrea, last time you were planning to have a look at my patches and see
> if max bw controller can be implemented there. I got a feeling that it
> should not be too difficult to implement it there. We already have the
> hierarchical tree of io queues and groups in elevator layer and we run
> BFQ (WF2Q+) algorithm to select next queue to dispatch the IO from. It is
> just a matter of also keeping track of IO rate per queue/group and we should
> be easily be able to delay the dispatch of IO from a queue if its group has
> crossed the specified max bw.

Yes, sorry for my late, I quickly tested your patchset, but I still need
to understand many details of your solution. In the next days I'll
re-read everything carefully and I'll try to do a detailed review of
your patchset (just re-building the kernel with your patchset applied).

> 
> This should lead to less code and reduced complextiy (compared with the
> case where we do max bw control with io-throttling patches and proportional
> BW control using IO scheduler based control patches).

mmmh... changing the logic at the elevator and all IO schedulers doesn't
sound like reduced complexity and less code changed. With io-throttle we
just need to place the cgroup_io_throttle() hook in the right functions
where we want to apply throttling. This is a quite easy approach to
extend the IO control also to logical devices (more in general devices
that use their own make_request_fn) or even network-attached devices, as
well as networking filesystems, etc.

But I may be wrong. As I said I still need to review in the details your
solution.

>  
> So do you think that it would make sense to do max BW control along with
> proportional weight IO controller at IO scheduler? If yes, then we can
> work together and continue to develop this patchset to also support max
> bw control and meet your requirements and drop the io-throttling patches.

It is surely worth to be explored. Honestly, I don't know if it would be
a better solution or not. Probably comparing some results with different
IO workloads is the best way to proceed and decide which is the right
way to go. This is necessary IMHO, before totally dropping one solution
or another.

> 
> The only thing which concerns me is the fact that IO scheduler does not
> have the view of higher level logical device. So if somebody has setup a
> software RAID and wants to put max BW limit on software raid device, this
> solution will not work. One shall have to live with max bw limits on 
> individual disks (where io scheduler is actually running). Do your patches
> allow to put limit on software RAID devices also? 

No, but as said above my patchset provides the interfaces to apply the
IO control and accounting wherever we want. At the moment there's just
one interface, cgroup_io_throttle().

-Andrea

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-06  2:33     ` Vivek Goyal
  2009-05-06 17:59       ` Nauman Rafique
@ 2009-05-06 20:07       ` Andrea Righi
  2009-05-06 21:21         ` Vivek Goyal
  2009-05-06 21:21         ` Vivek Goyal
       [not found]       ` <20090506023332.GA1212-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                         ` (2 subsequent siblings)
  4 siblings, 2 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-06 20:07 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, linux-kernel, containers,
	agk, dm-devel, snitzer, m-ikeda, peterz

On Tue, May 05, 2009 at 10:33:32PM -0400, Vivek Goyal wrote:
> On Tue, May 05, 2009 at 01:24:41PM -0700, Andrew Morton wrote:
> > On Tue,  5 May 2009 15:58:27 -0400
> > Vivek Goyal <vgoyal@redhat.com> wrote:
> > 
> > > 
> > > Hi All,
> > > 
> > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > ...
> > > Currently primarily two other IO controller proposals are out there.
> > > 
> > > dm-ioband
> > > ---------
> > > This patch set is from Ryo Tsuruta from valinux.
> > > ...
> > > IO-throttling
> > > -------------
> > > This patch set is from Andrea Righi provides max bandwidth controller.
> > 
> > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> > 
> > Seriously, how are we to resolve this?  We could lock me in a room and
> > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > with the best answer.
> > 
> > I tend to think that a cgroup-based controller is the way to go. 
> > Anything else will need to be wired up to cgroups _anyway_, and that
> > might end up messy.
> 
> Hi Andrew,
> 
> Sorry, did not get what do you mean by cgroup based controller? If you
> mean that we use cgroups for grouping tasks for controlling IO, then both
> IO scheduler based controller as well as io throttling proposal do that.
> dm-ioband also supports that up to some extent but it requires extra step of
> transferring cgroup grouping information to dm-ioband device using dm-tools.
> 
> But if you meant that io-throttle patches, then I think it solves only
> part of the problem and that is max bw control. It does not offer minimum
> BW/minimum disk share gurantees as offered by proportional BW control.
> 
> IOW, it supports upper limit control and does not support a work conserving
> IO controller which lets a group use the whole BW if competing groups are
> not present. IMHO, proportional BW control is an important feature which
> we will need and IIUC, io-throttle patches can't be easily extended to support
> proportional BW control, OTOH, one should be able to extend IO scheduler
> based proportional weight controller to also support max bw control. 

Well, IMHO the big concern is at which level we want to implement the
logic of control: IO scheduler, when the IO requests are already
submitted and need to be dispatched, or at high level when the
applications generates IO requests (or maybe both).

And, as pointed by Andrew, do everything by a cgroup-based controller.

The other features, proportional BW, throttling, take the current ioprio
model in account, etc. are implementation details and any of the
proposed solutions can be extended to support all these features. I
mean, io-throttle can be extended to support proportional BW (for a
certain perspective it is already provided by the throttling water mark
in v16), as well as the IO scheduler based controller can be extended to
support absolute BW limits. The same for dm-ioband. I don't think
there're huge obstacle to merge the functionalities in this sense.

> 
> Andrea, last time you were planning to have a look at my patches and see
> if max bw controller can be implemented there. I got a feeling that it
> should not be too difficult to implement it there. We already have the
> hierarchical tree of io queues and groups in elevator layer and we run
> BFQ (WF2Q+) algorithm to select next queue to dispatch the IO from. It is
> just a matter of also keeping track of IO rate per queue/group and we should
> be easily be able to delay the dispatch of IO from a queue if its group has
> crossed the specified max bw.

Yes, sorry for my late, I quickly tested your patchset, but I still need
to understand many details of your solution. In the next days I'll
re-read everything carefully and I'll try to do a detailed review of
your patchset (just re-building the kernel with your patchset applied).

> 
> This should lead to less code and reduced complextiy (compared with the
> case where we do max bw control with io-throttling patches and proportional
> BW control using IO scheduler based control patches).

mmmh... changing the logic at the elevator and all IO schedulers doesn't
sound like reduced complexity and less code changed. With io-throttle we
just need to place the cgroup_io_throttle() hook in the right functions
where we want to apply throttling. This is a quite easy approach to
extend the IO control also to logical devices (more in general devices
that use their own make_request_fn) or even network-attached devices, as
well as networking filesystems, etc.

But I may be wrong. As I said I still need to review in the details your
solution.

>  
> So do you think that it would make sense to do max BW control along with
> proportional weight IO controller at IO scheduler? If yes, then we can
> work together and continue to develop this patchset to also support max
> bw control and meet your requirements and drop the io-throttling patches.

It is surely worth to be explored. Honestly, I don't know if it would be
a better solution or not. Probably comparing some results with different
IO workloads is the best way to proceed and decide which is the right
way to go. This is necessary IMHO, before totally dropping one solution
or another.

> 
> The only thing which concerns me is the fact that IO scheduler does not
> have the view of higher level logical device. So if somebody has setup a
> software RAID and wants to put max BW limit on software raid device, this
> solution will not work. One shall have to live with max bw limits on 
> individual disks (where io scheduler is actually running). Do your patches
> allow to put limit on software RAID devices also? 

No, but as said above my patchset provides the interfaces to apply the
IO control and accounting wherever we want. At the moment there's just
one interface, cgroup_io_throttle().

-Andrea

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
       [not found]       ` <20090506023332.GA1212-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-05-06 17:59         ` Nauman Rafique
  2009-05-06 20:07         ` Andrea Righi
@ 2009-05-06 20:32         ` Vivek Goyal
  2009-05-07  0:18         ` Ryo Tsuruta
  3 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 20:32 UTC (permalink / raw)
  To: Andrew Morton, Andrea Righi
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, May 05, 2009 at 10:33:32PM -0400, Vivek Goyal wrote:
> On Tue, May 05, 2009 at 01:24:41PM -0700, Andrew Morton wrote:
> > On Tue,  5 May 2009 15:58:27 -0400
> > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > 
> > > 
> > > Hi All,
> > > 
> > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > ...
> > > Currently primarily two other IO controller proposals are out there.
> > > 
> > > dm-ioband
> > > ---------
> > > This patch set is from Ryo Tsuruta from valinux.
> > > ...
> > > IO-throttling
> > > -------------
> > > This patch set is from Andrea Righi provides max bandwidth controller.
> > 
> > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> > 
> > Seriously, how are we to resolve this?  We could lock me in a room and
> > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > with the best answer.
> > 
> > I tend to think that a cgroup-based controller is the way to go. 
> > Anything else will need to be wired up to cgroups _anyway_, and that
> > might end up messy.
> 
> Hi Andrew,
> 
> Sorry, did not get what do you mean by cgroup based controller? If you
> mean that we use cgroups for grouping tasks for controlling IO, then both
> IO scheduler based controller as well as io throttling proposal do that.
> dm-ioband also supports that up to some extent but it requires extra step of
> transferring cgroup grouping information to dm-ioband device using dm-tools.
> 
> But if you meant that io-throttle patches, then I think it solves only
> part of the problem and that is max bw control. It does not offer minimum
> BW/minimum disk share gurantees as offered by proportional BW control.
> 
> IOW, it supports upper limit control and does not support a work conserving
> IO controller which lets a group use the whole BW if competing groups are
> not present. IMHO, proportional BW control is an important feature which
> we will need and IIUC, io-throttle patches can't be easily extended to support
> proportional BW control, OTOH, one should be able to extend IO scheduler
> based proportional weight controller to also support max bw control. 
> 
> Andrea, last time you were planning to have a look at my patches and see
> if max bw controller can be implemented there. I got a feeling that it
> should not be too difficult to implement it there. We already have the
> hierarchical tree of io queues and groups in elevator layer and we run
> BFQ (WF2Q+) algorithm to select next queue to dispatch the IO from. It is
> just a matter of also keeping track of IO rate per queue/group and we should
> be easily be able to delay the dispatch of IO from a queue if its group has
> crossed the specified max bw.
> 
> This should lead to less code and reduced complextiy (compared with the
> case where we do max bw control with io-throttling patches and proportional
> BW control using IO scheduler based control patches).
>  
> So do you think that it would make sense to do max BW control along with
> proportional weight IO controller at IO scheduler? If yes, then we can
> work together and continue to develop this patchset to also support max
> bw control and meet your requirements and drop the io-throttling patches.
> 

Hi Andrea and others,

I always had this doubt in mind that any kind of 2nd level controller will
have no idea about underlying IO scheduler queues/semantics. So while it
can implement a particular cgroup policy (max bw like io-throttle or
proportional bw like dm-ioband) but there are high chances that it will
break IO scheduler's semantics in one way or other.

I had already sent out the results for dm-ioband in a separate thread.

http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07258.html
http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07573.html
http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08177.html
http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08345.html
http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08355.html

Here are some basic results with io-throttle. Andrea, please let me know
if you think this is procedural problem. Playing with io-throttle patches
for the first time.

I took V16 of your patches and trying it out with 2.6.30-rc4 with CFQ
scheduler.

I have got one SATA drive with one partition on it.

I am trying to create one cgroup and assignn 8MB/s limit to it and launch
on RT prio 0 task and one BE prio 7 task and see how this 8MB/s is divided
between these tasks. Following are the results.

Following is my test script.

*******************************************************************
#!/bin/bash

mount /dev/sdb1 /mnt/sdb

mount -t cgroup -o blockio blockio /cgroup/iot/
mkdir -p /cgroup/iot/test1 /cgroup/iot/test2

# Set bw limit of 8 MB/ps on sdb
echo "/dev/sdb:$((8 * 1024 * 1024)):0:0" >
/cgroup/iot/test1/blockio.bandwidth-max

sync
echo 3 > /proc/sys/vm/drop_caches

echo $$ > /cgroup/iot/test1/tasks

# Launch a normal prio reader.
ionice -c 2 -n 7 dd if=/mnt/sdb/zerofile1 of=/dev/zero &
pid1=$!
echo $pid1

# Launch an RT reader  
ionice -c 1 -n 0 dd if=/mnt/sdb/zerofile2 of=/dev/zero &
pid2=$!
echo $pid2

wait $pid2
echo "RT task finished"
**********************************************************************

Test1
=====
Test two readers (one RT class and one BE class) and see how BW is
allocated with-in cgroup

With io-throttle patches
------------------------
- Two readers, first BE prio 7, second RT prio 0

234179072 bytes (234 MB) copied, 55.8482 s, 4.2 MB/s
234179072 bytes (234 MB) copied, 55.8975 s, 4.2 MB/s
RT task finished

Note: See, there is no difference in the performance of RT or BE task.
Looks like these got throttled equally.


Without io-throttle patches
----------------------------
- Two readers, first BE prio 7, second RT prio 0

234179072 bytes (234 MB) copied, 2.81801 s, 83.1 MB/s
RT task finished
234179072 bytes (234 MB) copied, 5.28238 s, 44.3 MB/s

Note: Because I can't limit the BW without io-throttle patches, so don't
      worry about increased BW. But the important point is that RT task
      gets much more BW than a BE prio 7 task.

Test2
====
- Test 2 readers (One BE prio 0 and one BE prio 7) and see how BW is
distributed among these.

With io-throttle patches
------------------------
- Two readers, first BE prio 7, second BE prio 0

234179072 bytes (234 MB) copied, 55.8604 s, 4.2 MB/s
234179072 bytes (234 MB) copied, 55.8918 s, 4.2 MB/s
High prio reader finished

Without io-throttle patches
---------------------------
- Two readers, first BE prio 7, second BE prio 0

234179072 bytes (234 MB) copied, 4.12074 s, 56.8 MB/s
High prio reader finished
234179072 bytes (234 MB) copied, 5.36023 s, 43.7 MB/s

Note: There is no service differentiation between prio 0 and prio 7 task
      with io-throttle patches.

Test 3
======
- Run the one RT reader and one BE reader in root cgroup without any
  limitations. I guess this should mean unlimited BW and behavior should
  be same as with CFQ without io-throttling patches.

With io-throttle patches
=========================
Ran the test 4 times because I was getting different results in different
runs.

- Two readers, one RT prio 0  other BE prio 7

234179072 bytes (234 MB) copied, 2.74604 s, 85.3 MB/s
234179072 bytes (234 MB) copied, 5.20995 s, 44.9 MB/s
RT task finished

234179072 bytes (234 MB) copied, 4.54417 s, 51.5 MB/s
RT task finished
234179072 bytes (234 MB) copied, 5.23396 s, 44.7 MB/s

234179072 bytes (234 MB) copied, 5.17727 s, 45.2 MB/s
RT task finished
234179072 bytes (234 MB) copied, 5.25894 s, 44.5 MB/s

234179072 bytes (234 MB) copied, 2.74141 s, 85.4 MB/s
234179072 bytes (234 MB) copied, 5.20536 s, 45.0 MB/s
RT task finished

Note: Out of 4 runs, looks like twice it is complete priority inversion
      and RT task finished after BE task. Rest of the two times, the
      difference between BW of RT and BE task is much less as compared to
      without patches. In fact once it was almost same.

Without io-throttle patches.
===========================
- Two readers, one RT prio 0  other BE prio 7 (4 runs)

234179072 bytes (234 MB) copied, 2.80988 s, 83.3 MB/s
RT task finished
234179072 bytes (234 MB) copied, 5.28228 s, 44.3 MB/s

234179072 bytes (234 MB) copied, 2.80659 s, 83.4 MB/s
RT task finished
234179072 bytes (234 MB) copied, 5.27874 s, 44.4 MB/s

234179072 bytes (234 MB) copied, 2.79601 s, 83.8 MB/s
RT task finished
234179072 bytes (234 MB) copied, 5.2542 s, 44.6 MB/s

234179072 bytes (234 MB) copied, 2.78764 s, 84.0 MB/s
RT task finished
234179072 bytes (234 MB) copied, 5.26009 s, 44.5 MB/s

Note, How consistent the behavior is without io-throttle patches.

In summary, I think a 2nd level solution can ensure one policy on cgroups but
it will break other semantics/properties of IO scheduler with-in cgroup as
2nd level solution has no idea at run time what is the IO scheduler running
underneath and what kind of properties it has.

Andrea, please try it on your setup and see if you get similar results
on or. Hopefully it is not a configuration or test procedure issue on my
side.

Thanks
Vivek

> The only thing which concerns me is the fact that IO scheduler does not
> have the view of higher level logical device. So if somebody has setup a
> software RAID and wants to put max BW limit on software raid device, this
> solution will not work. One shall have to live with max bw limits on 
> individual disks (where io scheduler is actually running). Do your patches
> allow to put limit on software RAID devices also? 
> 
> Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> of FIFO dispatch of buffered bios. Apart from that it tries to provide
> fairness in terms of actual IO done and that would mean a seeky workload
> will can use disk for much longer to get equivalent IO done and slow down
> other applications. Implementing IO controller at IO scheduler level gives
> us tigher control. Will it not meet your requirements? If you got specific
> concerns with IO scheduler based contol patches, please highlight these and
> we will see how these can be addressed.
> 
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-06  2:33     ` Vivek Goyal
                         ` (2 preceding siblings ...)
       [not found]       ` <20090506023332.GA1212-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-06 20:32       ` Vivek Goyal
       [not found]         ` <20090506203228.GH8180-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-05-06 21:34         ` Andrea Righi
  2009-05-07  0:18       ` Ryo Tsuruta
  4 siblings, 2 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 20:32 UTC (permalink / raw)
  To: Andrew Morton, Andrea Righi
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, agk, dm-devel, snitzer,
	m-ikeda, peterz

On Tue, May 05, 2009 at 10:33:32PM -0400, Vivek Goyal wrote:
> On Tue, May 05, 2009 at 01:24:41PM -0700, Andrew Morton wrote:
> > On Tue,  5 May 2009 15:58:27 -0400
> > Vivek Goyal <vgoyal@redhat.com> wrote:
> > 
> > > 
> > > Hi All,
> > > 
> > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > ...
> > > Currently primarily two other IO controller proposals are out there.
> > > 
> > > dm-ioband
> > > ---------
> > > This patch set is from Ryo Tsuruta from valinux.
> > > ...
> > > IO-throttling
> > > -------------
> > > This patch set is from Andrea Righi provides max bandwidth controller.
> > 
> > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> > 
> > Seriously, how are we to resolve this?  We could lock me in a room and
> > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > with the best answer.
> > 
> > I tend to think that a cgroup-based controller is the way to go. 
> > Anything else will need to be wired up to cgroups _anyway_, and that
> > might end up messy.
> 
> Hi Andrew,
> 
> Sorry, did not get what do you mean by cgroup based controller? If you
> mean that we use cgroups for grouping tasks for controlling IO, then both
> IO scheduler based controller as well as io throttling proposal do that.
> dm-ioband also supports that up to some extent but it requires extra step of
> transferring cgroup grouping information to dm-ioband device using dm-tools.
> 
> But if you meant that io-throttle patches, then I think it solves only
> part of the problem and that is max bw control. It does not offer minimum
> BW/minimum disk share gurantees as offered by proportional BW control.
> 
> IOW, it supports upper limit control and does not support a work conserving
> IO controller which lets a group use the whole BW if competing groups are
> not present. IMHO, proportional BW control is an important feature which
> we will need and IIUC, io-throttle patches can't be easily extended to support
> proportional BW control, OTOH, one should be able to extend IO scheduler
> based proportional weight controller to also support max bw control. 
> 
> Andrea, last time you were planning to have a look at my patches and see
> if max bw controller can be implemented there. I got a feeling that it
> should not be too difficult to implement it there. We already have the
> hierarchical tree of io queues and groups in elevator layer and we run
> BFQ (WF2Q+) algorithm to select next queue to dispatch the IO from. It is
> just a matter of also keeping track of IO rate per queue/group and we should
> be easily be able to delay the dispatch of IO from a queue if its group has
> crossed the specified max bw.
> 
> This should lead to less code and reduced complextiy (compared with the
> case where we do max bw control with io-throttling patches and proportional
> BW control using IO scheduler based control patches).
>  
> So do you think that it would make sense to do max BW control along with
> proportional weight IO controller at IO scheduler? If yes, then we can
> work together and continue to develop this patchset to also support max
> bw control and meet your requirements and drop the io-throttling patches.
> 

Hi Andrea and others,

I always had this doubt in mind that any kind of 2nd level controller will
have no idea about underlying IO scheduler queues/semantics. So while it
can implement a particular cgroup policy (max bw like io-throttle or
proportional bw like dm-ioband) but there are high chances that it will
break IO scheduler's semantics in one way or other.

I had already sent out the results for dm-ioband in a separate thread.

http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07258.html
http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07573.html
http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08177.html
http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08345.html
http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08355.html

Here are some basic results with io-throttle. Andrea, please let me know
if you think this is procedural problem. Playing with io-throttle patches
for the first time.

I took V16 of your patches and trying it out with 2.6.30-rc4 with CFQ
scheduler.

I have got one SATA drive with one partition on it.

I am trying to create one cgroup and assignn 8MB/s limit to it and launch
on RT prio 0 task and one BE prio 7 task and see how this 8MB/s is divided
between these tasks. Following are the results.

Following is my test script.

*******************************************************************
#!/bin/bash

mount /dev/sdb1 /mnt/sdb

mount -t cgroup -o blockio blockio /cgroup/iot/
mkdir -p /cgroup/iot/test1 /cgroup/iot/test2

# Set bw limit of 8 MB/ps on sdb
echo "/dev/sdb:$((8 * 1024 * 1024)):0:0" >
/cgroup/iot/test1/blockio.bandwidth-max

sync
echo 3 > /proc/sys/vm/drop_caches

echo $$ > /cgroup/iot/test1/tasks

# Launch a normal prio reader.
ionice -c 2 -n 7 dd if=/mnt/sdb/zerofile1 of=/dev/zero &
pid1=$!
echo $pid1

# Launch an RT reader  
ionice -c 1 -n 0 dd if=/mnt/sdb/zerofile2 of=/dev/zero &
pid2=$!
echo $pid2

wait $pid2
echo "RT task finished"
**********************************************************************

Test1
=====
Test two readers (one RT class and one BE class) and see how BW is
allocated with-in cgroup

With io-throttle patches
------------------------
- Two readers, first BE prio 7, second RT prio 0

234179072 bytes (234 MB) copied, 55.8482 s, 4.2 MB/s
234179072 bytes (234 MB) copied, 55.8975 s, 4.2 MB/s
RT task finished

Note: See, there is no difference in the performance of RT or BE task.
Looks like these got throttled equally.


Without io-throttle patches
----------------------------
- Two readers, first BE prio 7, second RT prio 0

234179072 bytes (234 MB) copied, 2.81801 s, 83.1 MB/s
RT task finished
234179072 bytes (234 MB) copied, 5.28238 s, 44.3 MB/s

Note: Because I can't limit the BW without io-throttle patches, so don't
      worry about increased BW. But the important point is that RT task
      gets much more BW than a BE prio 7 task.

Test2
====
- Test 2 readers (One BE prio 0 and one BE prio 7) and see how BW is
distributed among these.

With io-throttle patches
------------------------
- Two readers, first BE prio 7, second BE prio 0

234179072 bytes (234 MB) copied, 55.8604 s, 4.2 MB/s
234179072 bytes (234 MB) copied, 55.8918 s, 4.2 MB/s
High prio reader finished

Without io-throttle patches
---------------------------
- Two readers, first BE prio 7, second BE prio 0

234179072 bytes (234 MB) copied, 4.12074 s, 56.8 MB/s
High prio reader finished
234179072 bytes (234 MB) copied, 5.36023 s, 43.7 MB/s

Note: There is no service differentiation between prio 0 and prio 7 task
      with io-throttle patches.

Test 3
======
- Run the one RT reader and one BE reader in root cgroup without any
  limitations. I guess this should mean unlimited BW and behavior should
  be same as with CFQ without io-throttling patches.

With io-throttle patches
=========================
Ran the test 4 times because I was getting different results in different
runs.

- Two readers, one RT prio 0  other BE prio 7

234179072 bytes (234 MB) copied, 2.74604 s, 85.3 MB/s
234179072 bytes (234 MB) copied, 5.20995 s, 44.9 MB/s
RT task finished

234179072 bytes (234 MB) copied, 4.54417 s, 51.5 MB/s
RT task finished
234179072 bytes (234 MB) copied, 5.23396 s, 44.7 MB/s

234179072 bytes (234 MB) copied, 5.17727 s, 45.2 MB/s
RT task finished
234179072 bytes (234 MB) copied, 5.25894 s, 44.5 MB/s

234179072 bytes (234 MB) copied, 2.74141 s, 85.4 MB/s
234179072 bytes (234 MB) copied, 5.20536 s, 45.0 MB/s
RT task finished

Note: Out of 4 runs, looks like twice it is complete priority inversion
      and RT task finished after BE task. Rest of the two times, the
      difference between BW of RT and BE task is much less as compared to
      without patches. In fact once it was almost same.

Without io-throttle patches.
===========================
- Two readers, one RT prio 0  other BE prio 7 (4 runs)

234179072 bytes (234 MB) copied, 2.80988 s, 83.3 MB/s
RT task finished
234179072 bytes (234 MB) copied, 5.28228 s, 44.3 MB/s

234179072 bytes (234 MB) copied, 2.80659 s, 83.4 MB/s
RT task finished
234179072 bytes (234 MB) copied, 5.27874 s, 44.4 MB/s

234179072 bytes (234 MB) copied, 2.79601 s, 83.8 MB/s
RT task finished
234179072 bytes (234 MB) copied, 5.2542 s, 44.6 MB/s

234179072 bytes (234 MB) copied, 2.78764 s, 84.0 MB/s
RT task finished
234179072 bytes (234 MB) copied, 5.26009 s, 44.5 MB/s

Note, How consistent the behavior is without io-throttle patches.

In summary, I think a 2nd level solution can ensure one policy on cgroups but
it will break other semantics/properties of IO scheduler with-in cgroup as
2nd level solution has no idea at run time what is the IO scheduler running
underneath and what kind of properties it has.

Andrea, please try it on your setup and see if you get similar results
on or. Hopefully it is not a configuration or test procedure issue on my
side.

Thanks
Vivek

> The only thing which concerns me is the fact that IO scheduler does not
> have the view of higher level logical device. So if somebody has setup a
> software RAID and wants to put max BW limit on software raid device, this
> solution will not work. One shall have to live with max bw limits on 
> individual disks (where io scheduler is actually running). Do your patches
> allow to put limit on software RAID devices also? 
> 
> Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> of FIFO dispatch of buffered bios. Apart from that it tries to provide
> fairness in terms of actual IO done and that would mean a seeky workload
> will can use disk for much longer to get equivalent IO done and slow down
> other applications. Implementing IO controller at IO scheduler level gives
> us tigher control. Will it not meet your requirements? If you got specific
> concerns with IO scheduler based contol patches, please highlight these and
> we will see how these can be addressed.
> 
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
       [not found]         ` <20090506034254.GD4416-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
  2009-05-06 10:20           ` Fabio Checconi
  2009-05-06 18:47           ` Divyesh Shah
@ 2009-05-06 20:42           ` Andrea Righi
  2 siblings, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-06 20:42 UTC (permalink / raw)
  To: Balbir Singh
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Wed, May 06, 2009 at 09:12:54AM +0530, Balbir Singh wrote:
> * Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> [2009-05-06 00:20:49]:
> 
> > On Tue, 2009-05-05 at 13:24 -0700, Andrew Morton wrote:
> > > On Tue,  5 May 2009 15:58:27 -0400
> > > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > 
> > > > 
> > > > Hi All,
> > > > 
> > > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > > ...
> > > > Currently primarily two other IO controller proposals are out there.
> > > > 
> > > > dm-ioband
> > > > ---------
> > > > This patch set is from Ryo Tsuruta from valinux.
> > > > ...
> > > > IO-throttling
> > > > -------------
> > > > This patch set is from Andrea Righi provides max bandwidth controller.
> > > 
> > > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> > > 
> > > Seriously, how are we to resolve this?  We could lock me in a room and
> > > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > > with the best answer.
> > > 
> > > I tend to think that a cgroup-based controller is the way to go. 
> > > Anything else will need to be wired up to cgroups _anyway_, and that
> > > might end up messy.
> > 
> > FWIW I subscribe to the io-scheduler faith as opposed to the
> > device-mapper cult ;-)
> > 
> > Also, I don't think a simple throttle will be very useful, a more mature
> > solution should cater to more use cases.
> >
> 
> I tend to agree, unless Andrea can prove us wrong. I don't think
> throttling a task (not letting it consume CPU, memory when its IO
> quota is exceeded) is a good idea. I've asked that question to Andrea
> a few times, but got no response.

Sorry Balbir, I probably missed your question. Or replied in a different
thread maybe...

Actually we could allow an offending cgroup to continue to submit IO
requests without throttling it directly. But if we don't want to waste
the memory with pending IO requests or pending writeback pages, we need
to block it sooner or later.

Instead of directly throttle the offending applications, we could block
them when we hit a max limit of requests or dirty pages, i.e. something
like congestion_wait(), but that's the same, no? the difference is that
in this case throttling is asynchronous. Or am I oversimplifying it?

As an example, with writeback IO io-throttle doesn't throttle the IO
requests directly, each request instead receives a deadline (depending
on the BW limit) and it's added into a rbtree. Then all the requests are
dispatched asynchronously using a kernel thread (kiothrottled) only when
the deadline is expired.

OK, there's a lot of space for improvements: provide many kernel threads
per block device, multiple queues/rbtrees, etc., but this is actually a
way to apply throttling asynchronously. The fact is that if I don't
apply the throttling also in balance_dirty_pages() (and I did so in the
last io-throttle version) or add a max limit of requests the rbtree
increases indefinitely...

That should be very similar to the proportional BW solution allocating a
quota of nr_requests per block device and per cgroup.

-Andrea

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-06  3:42       ` Balbir Singh
                           ` (2 preceding siblings ...)
       [not found]         ` <20090506034254.GD4416-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
@ 2009-05-06 20:42         ` Andrea Righi
  3 siblings, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-06 20:42 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Peter Zijlstra, Andrew Morton, Vivek Goyal, nauman, dpshah, lizf,
	mikew, fchecconi, paolo.valente, jens.axboe, ryov, fernando,
	s-uchida, taka, guijianfeng, jmoyer, dhaval, linux-kernel,
	containers, agk, dm-devel, snitzer, m-ikeda

On Wed, May 06, 2009 at 09:12:54AM +0530, Balbir Singh wrote:
> * Peter Zijlstra <peterz@infradead.org> [2009-05-06 00:20:49]:
> 
> > On Tue, 2009-05-05 at 13:24 -0700, Andrew Morton wrote:
> > > On Tue,  5 May 2009 15:58:27 -0400
> > > Vivek Goyal <vgoyal@redhat.com> wrote:
> > > 
> > > > 
> > > > Hi All,
> > > > 
> > > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > > ...
> > > > Currently primarily two other IO controller proposals are out there.
> > > > 
> > > > dm-ioband
> > > > ---------
> > > > This patch set is from Ryo Tsuruta from valinux.
> > > > ...
> > > > IO-throttling
> > > > -------------
> > > > This patch set is from Andrea Righi provides max bandwidth controller.
> > > 
> > > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> > > 
> > > Seriously, how are we to resolve this?  We could lock me in a room and
> > > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > > with the best answer.
> > > 
> > > I tend to think that a cgroup-based controller is the way to go. 
> > > Anything else will need to be wired up to cgroups _anyway_, and that
> > > might end up messy.
> > 
> > FWIW I subscribe to the io-scheduler faith as opposed to the
> > device-mapper cult ;-)
> > 
> > Also, I don't think a simple throttle will be very useful, a more mature
> > solution should cater to more use cases.
> >
> 
> I tend to agree, unless Andrea can prove us wrong. I don't think
> throttling a task (not letting it consume CPU, memory when its IO
> quota is exceeded) is a good idea. I've asked that question to Andrea
> a few times, but got no response.

Sorry Balbir, I probably missed your question. Or replied in a different
thread maybe...

Actually we could allow an offending cgroup to continue to submit IO
requests without throttling it directly. But if we don't want to waste
the memory with pending IO requests or pending writeback pages, we need
to block it sooner or later.

Instead of directly throttle the offending applications, we could block
them when we hit a max limit of requests or dirty pages, i.e. something
like congestion_wait(), but that's the same, no? the difference is that
in this case throttling is asynchronous. Or am I oversimplifying it?

As an example, with writeback IO io-throttle doesn't throttle the IO
requests directly, each request instead receives a deadline (depending
on the BW limit) and it's added into a rbtree. Then all the requests are
dispatched asynchronously using a kernel thread (kiothrottled) only when
the deadline is expired.

OK, there's a lot of space for improvements: provide many kernel threads
per block device, multiple queues/rbtrees, etc., but this is actually a
way to apply throttling asynchronously. The fact is that if I don't
apply the throttling also in balance_dirty_pages() (and I did so in the
last io-throttle version) or add a max limit of requests the rbtree
increases indefinitely...

That should be very similar to the proportional BW solution allocating a
quota of nr_requests per block device and per cgroup.

-Andrea

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-06 20:07       ` Andrea Righi
@ 2009-05-06 21:21         ` Vivek Goyal
  2009-05-06 21:21         ` Vivek Goyal
  1 sibling, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 21:21 UTC (permalink / raw)
  To: Andrea Righi
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Wed, May 06, 2009 at 10:07:53PM +0200, Andrea Righi wrote:
> On Tue, May 05, 2009 at 10:33:32PM -0400, Vivek Goyal wrote:
> > On Tue, May 05, 2009 at 01:24:41PM -0700, Andrew Morton wrote:
> > > On Tue,  5 May 2009 15:58:27 -0400
> > > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > 
> > > > 
> > > > Hi All,
> > > > 
> > > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > > ...
> > > > Currently primarily two other IO controller proposals are out there.
> > > > 
> > > > dm-ioband
> > > > ---------
> > > > This patch set is from Ryo Tsuruta from valinux.
> > > > ...
> > > > IO-throttling
> > > > -------------
> > > > This patch set is from Andrea Righi provides max bandwidth controller.
> > > 
> > > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> > > 
> > > Seriously, how are we to resolve this?  We could lock me in a room and
> > > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > > with the best answer.
> > > 
> > > I tend to think that a cgroup-based controller is the way to go. 
> > > Anything else will need to be wired up to cgroups _anyway_, and that
> > > might end up messy.
> > 
> > Hi Andrew,
> > 
> > Sorry, did not get what do you mean by cgroup based controller? If you
> > mean that we use cgroups for grouping tasks for controlling IO, then both
> > IO scheduler based controller as well as io throttling proposal do that.
> > dm-ioband also supports that up to some extent but it requires extra step of
> > transferring cgroup grouping information to dm-ioband device using dm-tools.
> > 
> > But if you meant that io-throttle patches, then I think it solves only
> > part of the problem and that is max bw control. It does not offer minimum
> > BW/minimum disk share gurantees as offered by proportional BW control.
> > 
> > IOW, it supports upper limit control and does not support a work conserving
> > IO controller which lets a group use the whole BW if competing groups are
> > not present. IMHO, proportional BW control is an important feature which
> > we will need and IIUC, io-throttle patches can't be easily extended to support
> > proportional BW control, OTOH, one should be able to extend IO scheduler
> > based proportional weight controller to also support max bw control. 
> 
> Well, IMHO the big concern is at which level we want to implement the
> logic of control: IO scheduler, when the IO requests are already
> submitted and need to be dispatched, or at high level when the
> applications generates IO requests (or maybe both).
> 
> And, as pointed by Andrew, do everything by a cgroup-based controller.

I am not sure what's the rationale behind that. Why to do it at higher
layer? Doing it at IO scheduler layer will make sure that one does not
breaks the IO scheduler's properties with-in cgroup. (See my other mail
with some io-throttling test results).

The advantage of higher layer mechanism is that it can also cover software
RAID devices well. 

> 
> The other features, proportional BW, throttling, take the current ioprio
> model in account, etc. are implementation details and any of the
> proposed solutions can be extended to support all these features. I
> mean, io-throttle can be extended to support proportional BW (for a
> certain perspective it is already provided by the throttling water mark
> in v16), as well as the IO scheduler based controller can be extended to
> support absolute BW limits. The same for dm-ioband. I don't think
> there're huge obstacle to merge the functionalities in this sense.

Yes, from technical point of view, one can implement a proportional BW
controller at higher layer also. But that would practically mean almost
re-implementing the CFQ logic at higher layer. Now why to get into all
that complexity. Why not simply make CFQ hiearchical to also handle the
groups?

Secondly, think of following odd scenarios if we implement a higher level
proportional BW controller which can offer the same feature as CFQ and
also can handle group scheduling.

Case1:
======	 
           (Higher level proportional BW controller)
			/dev/sda (CFQ)

So if somebody wants a group scheduling, we will be doing same IO control
at two places (with-in group). Once at higher level and second time at CFQ
level. Does not sound too logical to me.

Case2:
======

           (Higher level proportional BW controller)
			/dev/sda (NOOP)
	
This is other extrememt. Lower level IO scheduler does not offer any kind
of notion of class or prio with-in class and higher level scheduler will
still be maintaining all the infrastructure unnecessarily.

That's why I get back to this simple question again, why not extend the
IO schedulers to handle group scheduling and do both proportional BW and
max bw control there.

> 
> > 
> > Andrea, last time you were planning to have a look at my patches and see
> > if max bw controller can be implemented there. I got a feeling that it
> > should not be too difficult to implement it there. We already have the
> > hierarchical tree of io queues and groups in elevator layer and we run
> > BFQ (WF2Q+) algorithm to select next queue to dispatch the IO from. It is
> > just a matter of also keeping track of IO rate per queue/group and we should
> > be easily be able to delay the dispatch of IO from a queue if its group has
> > crossed the specified max bw.
> 
> Yes, sorry for my late, I quickly tested your patchset, but I still need
> to understand many details of your solution. In the next days I'll
> re-read everything carefully and I'll try to do a detailed review of
> your patchset (just re-building the kernel with your patchset applied).
> 

Sure. My patchset is still in the infancy stage. So don't expect great
results. But it does highlight the idea and design very well.

> > 
> > This should lead to less code and reduced complextiy (compared with the
> > case where we do max bw control with io-throttling patches and proportional
> > BW control using IO scheduler based control patches).
> 
> mmmh... changing the logic at the elevator and all IO schedulers doesn't
> sound like reduced complexity and less code changed. With io-throttle we
> just need to place the cgroup_io_throttle() hook in the right functions
> where we want to apply throttling. This is a quite easy approach to
> extend the IO control also to logical devices (more in general devices
> that use their own make_request_fn) or even network-attached devices, as
> well as networking filesystems, etc.
> 
> But I may be wrong. As I said I still need to review in the details your
> solution.

Well I meant reduced code in the sense if we implement both max bw and
proportional bw at IO scheduler level instead of proportional BW at
IO scheduler and max bw at higher level.

I agree that doing max bw control at higher level has this advantage that
it covers all the kind of deivces (higher level logical devices) and IO
scheduler level solution does not do that. But this comes at the price
of broken IO scheduler properties with-in cgroup.

Maybe we can then implement both. A higher level max bw controller and a
max bw feature implemented along side proportional BW controller at IO
scheduler level. Folks who use hardware RAID, or single disk devices can
use max bw control of IO scheduler and those using software RAID devices
can use higher level max bw controller.

> 
> >  
> > So do you think that it would make sense to do max BW control along with
> > proportional weight IO controller at IO scheduler? If yes, then we can
> > work together and continue to develop this patchset to also support max
> > bw control and meet your requirements and drop the io-throttling patches.
> 
> It is surely worth to be explored. Honestly, I don't know if it would be
> a better solution or not. Probably comparing some results with different
> IO workloads is the best way to proceed and decide which is the right
> way to go. This is necessary IMHO, before totally dropping one solution
> or another.

Sure. My patches have started giving some basic results but because there
is lot of work remaining before a fair comparison can be done on the
basis of performance under various work loads. So some more time to
go before we can do a fair comparison based on numbers.
 
> 
> > 
> > The only thing which concerns me is the fact that IO scheduler does not
> > have the view of higher level logical device. So if somebody has setup a
> > software RAID and wants to put max BW limit on software raid device, this
> > solution will not work. One shall have to live with max bw limits on 
> > individual disks (where io scheduler is actually running). Do your patches
> > allow to put limit on software RAID devices also? 
> 
> No, but as said above my patchset provides the interfaces to apply the
> IO control and accounting wherever we want. At the moment there's just
> one interface, cgroup_io_throttle().

Sorry, I did not get it clearly. I guess I did not ask the question right.
So lets say I got a setup where there are two phyical devices /dev/sda and
/dev/sdb and I create a logical device (say using device mapper facilities)
on top of these two physical disks. And some application is generating
the IO for logical device lv0.

				Appl
				 |
				lv0
			       /  \
			    sda	   sdb


Where should I put the bandwidth limiting rules now for io-throtle. I 
specify these for lv0 device or for sda and sdb devices?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-06 20:07       ` Andrea Righi
  2009-05-06 21:21         ` Vivek Goyal
@ 2009-05-06 21:21         ` Vivek Goyal
       [not found]           ` <20090506212121.GI8180-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 21:21 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, linux-kernel, containers,
	agk, dm-devel, snitzer, m-ikeda, peterz

On Wed, May 06, 2009 at 10:07:53PM +0200, Andrea Righi wrote:
> On Tue, May 05, 2009 at 10:33:32PM -0400, Vivek Goyal wrote:
> > On Tue, May 05, 2009 at 01:24:41PM -0700, Andrew Morton wrote:
> > > On Tue,  5 May 2009 15:58:27 -0400
> > > Vivek Goyal <vgoyal@redhat.com> wrote:
> > > 
> > > > 
> > > > Hi All,
> > > > 
> > > > Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> > > > ...
> > > > Currently primarily two other IO controller proposals are out there.
> > > > 
> > > > dm-ioband
> > > > ---------
> > > > This patch set is from Ryo Tsuruta from valinux.
> > > > ...
> > > > IO-throttling
> > > > -------------
> > > > This patch set is from Andrea Righi provides max bandwidth controller.
> > > 
> > > I'm thinking we need to lock you guys in a room and come back in 15 minutes.
> > > 
> > > Seriously, how are we to resolve this?  We could lock me in a room and
> > > cmoe back in 15 days, but there's no reason to believe that I'd emerge
> > > with the best answer.
> > > 
> > > I tend to think that a cgroup-based controller is the way to go. 
> > > Anything else will need to be wired up to cgroups _anyway_, and that
> > > might end up messy.
> > 
> > Hi Andrew,
> > 
> > Sorry, did not get what do you mean by cgroup based controller? If you
> > mean that we use cgroups for grouping tasks for controlling IO, then both
> > IO scheduler based controller as well as io throttling proposal do that.
> > dm-ioband also supports that up to some extent but it requires extra step of
> > transferring cgroup grouping information to dm-ioband device using dm-tools.
> > 
> > But if you meant that io-throttle patches, then I think it solves only
> > part of the problem and that is max bw control. It does not offer minimum
> > BW/minimum disk share gurantees as offered by proportional BW control.
> > 
> > IOW, it supports upper limit control and does not support a work conserving
> > IO controller which lets a group use the whole BW if competing groups are
> > not present. IMHO, proportional BW control is an important feature which
> > we will need and IIUC, io-throttle patches can't be easily extended to support
> > proportional BW control, OTOH, one should be able to extend IO scheduler
> > based proportional weight controller to also support max bw control. 
> 
> Well, IMHO the big concern is at which level we want to implement the
> logic of control: IO scheduler, when the IO requests are already
> submitted and need to be dispatched, or at high level when the
> applications generates IO requests (or maybe both).
> 
> And, as pointed by Andrew, do everything by a cgroup-based controller.

I am not sure what's the rationale behind that. Why to do it at higher
layer? Doing it at IO scheduler layer will make sure that one does not
breaks the IO scheduler's properties with-in cgroup. (See my other mail
with some io-throttling test results).

The advantage of higher layer mechanism is that it can also cover software
RAID devices well. 

> 
> The other features, proportional BW, throttling, take the current ioprio
> model in account, etc. are implementation details and any of the
> proposed solutions can be extended to support all these features. I
> mean, io-throttle can be extended to support proportional BW (for a
> certain perspective it is already provided by the throttling water mark
> in v16), as well as the IO scheduler based controller can be extended to
> support absolute BW limits. The same for dm-ioband. I don't think
> there're huge obstacle to merge the functionalities in this sense.

Yes, from technical point of view, one can implement a proportional BW
controller at higher layer also. But that would practically mean almost
re-implementing the CFQ logic at higher layer. Now why to get into all
that complexity. Why not simply make CFQ hiearchical to also handle the
groups?

Secondly, think of following odd scenarios if we implement a higher level
proportional BW controller which can offer the same feature as CFQ and
also can handle group scheduling.

Case1:
======	 
           (Higher level proportional BW controller)
			/dev/sda (CFQ)

So if somebody wants a group scheduling, we will be doing same IO control
at two places (with-in group). Once at higher level and second time at CFQ
level. Does not sound too logical to me.

Case2:
======

           (Higher level proportional BW controller)
			/dev/sda (NOOP)
	
This is other extrememt. Lower level IO scheduler does not offer any kind
of notion of class or prio with-in class and higher level scheduler will
still be maintaining all the infrastructure unnecessarily.

That's why I get back to this simple question again, why not extend the
IO schedulers to handle group scheduling and do both proportional BW and
max bw control there.

> 
> > 
> > Andrea, last time you were planning to have a look at my patches and see
> > if max bw controller can be implemented there. I got a feeling that it
> > should not be too difficult to implement it there. We already have the
> > hierarchical tree of io queues and groups in elevator layer and we run
> > BFQ (WF2Q+) algorithm to select next queue to dispatch the IO from. It is
> > just a matter of also keeping track of IO rate per queue/group and we should
> > be easily be able to delay the dispatch of IO from a queue if its group has
> > crossed the specified max bw.
> 
> Yes, sorry for my late, I quickly tested your patchset, but I still need
> to understand many details of your solution. In the next days I'll
> re-read everything carefully and I'll try to do a detailed review of
> your patchset (just re-building the kernel with your patchset applied).
> 

Sure. My patchset is still in the infancy stage. So don't expect great
results. But it does highlight the idea and design very well.

> > 
> > This should lead to less code and reduced complextiy (compared with the
> > case where we do max bw control with io-throttling patches and proportional
> > BW control using IO scheduler based control patches).
> 
> mmmh... changing the logic at the elevator and all IO schedulers doesn't
> sound like reduced complexity and less code changed. With io-throttle we
> just need to place the cgroup_io_throttle() hook in the right functions
> where we want to apply throttling. This is a quite easy approach to
> extend the IO control also to logical devices (more in general devices
> that use their own make_request_fn) or even network-attached devices, as
> well as networking filesystems, etc.
> 
> But I may be wrong. As I said I still need to review in the details your
> solution.

Well I meant reduced code in the sense if we implement both max bw and
proportional bw at IO scheduler level instead of proportional BW at
IO scheduler and max bw at higher level.

I agree that doing max bw control at higher level has this advantage that
it covers all the kind of deivces (higher level logical devices) and IO
scheduler level solution does not do that. But this comes at the price
of broken IO scheduler properties with-in cgroup.

Maybe we can then implement both. A higher level max bw controller and a
max bw feature implemented along side proportional BW controller at IO
scheduler level. Folks who use hardware RAID, or single disk devices can
use max bw control of IO scheduler and those using software RAID devices
can use higher level max bw controller.

> 
> >  
> > So do you think that it would make sense to do max BW control along with
> > proportional weight IO controller at IO scheduler? If yes, then we can
> > work together and continue to develop this patchset to also support max
> > bw control and meet your requirements and drop the io-throttling patches.
> 
> It is surely worth to be explored. Honestly, I don't know if it would be
> a better solution or not. Probably comparing some results with different
> IO workloads is the best way to proceed and decide which is the right
> way to go. This is necessary IMHO, before totally dropping one solution
> or another.

Sure. My patches have started giving some basic results but because there
is lot of work remaining before a fair comparison can be done on the
basis of performance under various work loads. So some more time to
go before we can do a fair comparison based on numbers.
 
> 
> > 
> > The only thing which concerns me is the fact that IO scheduler does not
> > have the view of higher level logical device. So if somebody has setup a
> > software RAID and wants to put max BW limit on software raid device, this
> > solution will not work. One shall have to live with max bw limits on 
> > individual disks (where io scheduler is actually running). Do your patches
> > allow to put limit on software RAID devices also? 
> 
> No, but as said above my patchset provides the interfaces to apply the
> IO control and accounting wherever we want. At the moment there's just
> one interface, cgroup_io_throttle().

Sorry, I did not get it clearly. I guess I did not ask the question right.
So lets say I got a setup where there are two phyical devices /dev/sda and
/dev/sdb and I create a logical device (say using device mapper facilities)
on top of these two physical disks. And some application is generating
the IO for logical device lv0.

				Appl
				 |
				lv0
			       /  \
			    sda	   sdb


Where should I put the bandwidth limiting rules now for io-throtle. I 
specify these for lv0 device or for sda and sdb devices?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
       [not found]         ` <20090506203228.GH8180-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-06 21:34           ` Andrea Righi
  0 siblings, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-06 21:34 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Wed, May 06, 2009 at 04:32:28PM -0400, Vivek Goyal wrote:
> Hi Andrea and others,
> 
> I always had this doubt in mind that any kind of 2nd level controller will
> have no idea about underlying IO scheduler queues/semantics. So while it
> can implement a particular cgroup policy (max bw like io-throttle or
> proportional bw like dm-ioband) but there are high chances that it will
> break IO scheduler's semantics in one way or other.
> 
> I had already sent out the results for dm-ioband in a separate thread.
> 
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07258.html
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07573.html
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08177.html
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08345.html
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08355.html
> 
> Here are some basic results with io-throttle. Andrea, please let me know
> if you think this is procedural problem. Playing with io-throttle patches
> for the first time.
> 
> I took V16 of your patches and trying it out with 2.6.30-rc4 with CFQ
> scheduler.
> 
> I have got one SATA drive with one partition on it.
> 
> I am trying to create one cgroup and assignn 8MB/s limit to it and launch
> on RT prio 0 task and one BE prio 7 task and see how this 8MB/s is divided
> between these tasks. Following are the results.
> 
> Following is my test script.
> 
> *******************************************************************
> #!/bin/bash
> 
> mount /dev/sdb1 /mnt/sdb
> 
> mount -t cgroup -o blockio blockio /cgroup/iot/
> mkdir -p /cgroup/iot/test1 /cgroup/iot/test2
> 
> # Set bw limit of 8 MB/ps on sdb
> echo "/dev/sdb:$((8 * 1024 * 1024)):0:0" >
> /cgroup/iot/test1/blockio.bandwidth-max
> 
> sync
> echo 3 > /proc/sys/vm/drop_caches
> 
> echo $$ > /cgroup/iot/test1/tasks
> 
> # Launch a normal prio reader.
> ionice -c 2 -n 7 dd if=/mnt/sdb/zerofile1 of=/dev/zero &
> pid1=$!
> echo $pid1
> 
> # Launch an RT reader  
> ionice -c 1 -n 0 dd if=/mnt/sdb/zerofile2 of=/dev/zero &
> pid2=$!
> echo $pid2
> 
> wait $pid2
> echo "RT task finished"
> **********************************************************************
> 
> Test1
> =====
> Test two readers (one RT class and one BE class) and see how BW is
> allocated with-in cgroup
> 
> With io-throttle patches
> ------------------------
> - Two readers, first BE prio 7, second RT prio 0
> 
> 234179072 bytes (234 MB) copied, 55.8482 s, 4.2 MB/s
> 234179072 bytes (234 MB) copied, 55.8975 s, 4.2 MB/s
> RT task finished
> 
> Note: See, there is no difference in the performance of RT or BE task.
> Looks like these got throttled equally.

OK, this is coherent with the current io-throttle implementation. IO
requests are throttled without the concept of the ioprio model.

We could try to distribute the throttle using a function of each task's
ioprio, but ok, the obvious drawback is that it totally breaks the logic
used by the underlying layers.

BTW, I'm wondering, is it a very critical issue? I would say why not to
move the RT task to a different cgroup with unlimited BW? or limited BW
but with other tasks running at the same IO priority... could the cgroup
subsystem be a more flexible and customizable framework respect to the
current ioprio model?

I'm not saying we have to ignore the problem, just trying to evaluate
the impact and alternatives. And I'm still convinced that also providing
per-cgroup ioprio would be an important feature.

> 
> 
> Without io-throttle patches
> ----------------------------
> - Two readers, first BE prio 7, second RT prio 0
> 
> 234179072 bytes (234 MB) copied, 2.81801 s, 83.1 MB/s
> RT task finished
> 234179072 bytes (234 MB) copied, 5.28238 s, 44.3 MB/s
> 
> Note: Because I can't limit the BW without io-throttle patches, so don't
>       worry about increased BW. But the important point is that RT task
>       gets much more BW than a BE prio 7 task.
> 
> Test2
> ====
> - Test 2 readers (One BE prio 0 and one BE prio 7) and see how BW is
> distributed among these.
> 
> With io-throttle patches
> ------------------------
> - Two readers, first BE prio 7, second BE prio 0
> 
> 234179072 bytes (234 MB) copied, 55.8604 s, 4.2 MB/s
> 234179072 bytes (234 MB) copied, 55.8918 s, 4.2 MB/s
> High prio reader finished

Ditto.

> 
> Without io-throttle patches
> ---------------------------
> - Two readers, first BE prio 7, second BE prio 0
> 
> 234179072 bytes (234 MB) copied, 4.12074 s, 56.8 MB/s
> High prio reader finished
> 234179072 bytes (234 MB) copied, 5.36023 s, 43.7 MB/s
> 
> Note: There is no service differentiation between prio 0 and prio 7 task
>       with io-throttle patches.
> 
> Test 3
> ======
> - Run the one RT reader and one BE reader in root cgroup without any
>   limitations. I guess this should mean unlimited BW and behavior should
>   be same as with CFQ without io-throttling patches.
> 
> With io-throttle patches
> =========================
> Ran the test 4 times because I was getting different results in different
> runs.
> 
> - Two readers, one RT prio 0  other BE prio 7
> 
> 234179072 bytes (234 MB) copied, 2.74604 s, 85.3 MB/s
> 234179072 bytes (234 MB) copied, 5.20995 s, 44.9 MB/s
> RT task finished
> 
> 234179072 bytes (234 MB) copied, 4.54417 s, 51.5 MB/s
> RT task finished
> 234179072 bytes (234 MB) copied, 5.23396 s, 44.7 MB/s
> 
> 234179072 bytes (234 MB) copied, 5.17727 s, 45.2 MB/s
> RT task finished
> 234179072 bytes (234 MB) copied, 5.25894 s, 44.5 MB/s
> 
> 234179072 bytes (234 MB) copied, 2.74141 s, 85.4 MB/s
> 234179072 bytes (234 MB) copied, 5.20536 s, 45.0 MB/s
> RT task finished
> 
> Note: Out of 4 runs, looks like twice it is complete priority inversion
>       and RT task finished after BE task. Rest of the two times, the
>       difference between BW of RT and BE task is much less as compared to
>       without patches. In fact once it was almost same.

This is strange. If you don't set any limit there shouldn't be any
difference respect to the other case (without io-throttle patches).

At worst a small overhead given by the task_to_iothrottle(), under
rcu_read_lock(). I'll repeat this test ASAP and see if I'll be able to
reproduce this strange behaviour.

> 
> Without io-throttle patches.
> ===========================
> - Two readers, one RT prio 0  other BE prio 7 (4 runs)
> 
> 234179072 bytes (234 MB) copied, 2.80988 s, 83.3 MB/s
> RT task finished
> 234179072 bytes (234 MB) copied, 5.28228 s, 44.3 MB/s
> 
> 234179072 bytes (234 MB) copied, 2.80659 s, 83.4 MB/s
> RT task finished
> 234179072 bytes (234 MB) copied, 5.27874 s, 44.4 MB/s
> 
> 234179072 bytes (234 MB) copied, 2.79601 s, 83.8 MB/s
> RT task finished
> 234179072 bytes (234 MB) copied, 5.2542 s, 44.6 MB/s
> 
> 234179072 bytes (234 MB) copied, 2.78764 s, 84.0 MB/s
> RT task finished
> 234179072 bytes (234 MB) copied, 5.26009 s, 44.5 MB/s
> 
> Note, How consistent the behavior is without io-throttle patches.
> 
> In summary, I think a 2nd level solution can ensure one policy on cgroups but
> it will break other semantics/properties of IO scheduler with-in cgroup as
> 2nd level solution has no idea at run time what is the IO scheduler running
> underneath and what kind of properties it has.
> 
> Andrea, please try it on your setup and see if you get similar results
> on or. Hopefully it is not a configuration or test procedure issue on my
> side.
> 
> Thanks
> Vivek
> 
> > The only thing which concerns me is the fact that IO scheduler does not
> > have the view of higher level logical device. So if somebody has setup a
> > software RAID and wants to put max BW limit on software raid device, this
> > solution will not work. One shall have to live with max bw limits on 
> > individual disks (where io scheduler is actually running). Do your patches
> > allow to put limit on software RAID devices also? 
> > 
> > Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> > of FIFO dispatch of buffered bios. Apart from that it tries to provide
> > fairness in terms of actual IO done and that would mean a seeky workload
> > will can use disk for much longer to get equivalent IO done and slow down
> > other applications. Implementing IO controller at IO scheduler level gives
> > us tigher control. Will it not meet your requirements? If you got specific
> > concerns with IO scheduler based contol patches, please highlight these and
> > we will see how these can be addressed.
> > 
> > Thanks
> > Vivek

-Andrea

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-06 20:32       ` Vivek Goyal
       [not found]         ` <20090506203228.GH8180-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-06 21:34         ` Andrea Righi
  2009-05-06 21:52             ` Vivek Goyal
  1 sibling, 1 reply; 297+ messages in thread
From: Andrea Righi @ 2009-05-06 21:34 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, linux-kernel, containers,
	agk, dm-devel, snitzer, m-ikeda, peterz

On Wed, May 06, 2009 at 04:32:28PM -0400, Vivek Goyal wrote:
> Hi Andrea and others,
> 
> I always had this doubt in mind that any kind of 2nd level controller will
> have no idea about underlying IO scheduler queues/semantics. So while it
> can implement a particular cgroup policy (max bw like io-throttle or
> proportional bw like dm-ioband) but there are high chances that it will
> break IO scheduler's semantics in one way or other.
> 
> I had already sent out the results for dm-ioband in a separate thread.
> 
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07258.html
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07573.html
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08177.html
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08345.html
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08355.html
> 
> Here are some basic results with io-throttle. Andrea, please let me know
> if you think this is procedural problem. Playing with io-throttle patches
> for the first time.
> 
> I took V16 of your patches and trying it out with 2.6.30-rc4 with CFQ
> scheduler.
> 
> I have got one SATA drive with one partition on it.
> 
> I am trying to create one cgroup and assignn 8MB/s limit to it and launch
> on RT prio 0 task and one BE prio 7 task and see how this 8MB/s is divided
> between these tasks. Following are the results.
> 
> Following is my test script.
> 
> *******************************************************************
> #!/bin/bash
> 
> mount /dev/sdb1 /mnt/sdb
> 
> mount -t cgroup -o blockio blockio /cgroup/iot/
> mkdir -p /cgroup/iot/test1 /cgroup/iot/test2
> 
> # Set bw limit of 8 MB/ps on sdb
> echo "/dev/sdb:$((8 * 1024 * 1024)):0:0" >
> /cgroup/iot/test1/blockio.bandwidth-max
> 
> sync
> echo 3 > /proc/sys/vm/drop_caches
> 
> echo $$ > /cgroup/iot/test1/tasks
> 
> # Launch a normal prio reader.
> ionice -c 2 -n 7 dd if=/mnt/sdb/zerofile1 of=/dev/zero &
> pid1=$!
> echo $pid1
> 
> # Launch an RT reader  
> ionice -c 1 -n 0 dd if=/mnt/sdb/zerofile2 of=/dev/zero &
> pid2=$!
> echo $pid2
> 
> wait $pid2
> echo "RT task finished"
> **********************************************************************
> 
> Test1
> =====
> Test two readers (one RT class and one BE class) and see how BW is
> allocated with-in cgroup
> 
> With io-throttle patches
> ------------------------
> - Two readers, first BE prio 7, second RT prio 0
> 
> 234179072 bytes (234 MB) copied, 55.8482 s, 4.2 MB/s
> 234179072 bytes (234 MB) copied, 55.8975 s, 4.2 MB/s
> RT task finished
> 
> Note: See, there is no difference in the performance of RT or BE task.
> Looks like these got throttled equally.

OK, this is coherent with the current io-throttle implementation. IO
requests are throttled without the concept of the ioprio model.

We could try to distribute the throttle using a function of each task's
ioprio, but ok, the obvious drawback is that it totally breaks the logic
used by the underlying layers.

BTW, I'm wondering, is it a very critical issue? I would say why not to
move the RT task to a different cgroup with unlimited BW? or limited BW
but with other tasks running at the same IO priority... could the cgroup
subsystem be a more flexible and customizable framework respect to the
current ioprio model?

I'm not saying we have to ignore the problem, just trying to evaluate
the impact and alternatives. And I'm still convinced that also providing
per-cgroup ioprio would be an important feature.

> 
> 
> Without io-throttle patches
> ----------------------------
> - Two readers, first BE prio 7, second RT prio 0
> 
> 234179072 bytes (234 MB) copied, 2.81801 s, 83.1 MB/s
> RT task finished
> 234179072 bytes (234 MB) copied, 5.28238 s, 44.3 MB/s
> 
> Note: Because I can't limit the BW without io-throttle patches, so don't
>       worry about increased BW. But the important point is that RT task
>       gets much more BW than a BE prio 7 task.
> 
> Test2
> ====
> - Test 2 readers (One BE prio 0 and one BE prio 7) and see how BW is
> distributed among these.
> 
> With io-throttle patches
> ------------------------
> - Two readers, first BE prio 7, second BE prio 0
> 
> 234179072 bytes (234 MB) copied, 55.8604 s, 4.2 MB/s
> 234179072 bytes (234 MB) copied, 55.8918 s, 4.2 MB/s
> High prio reader finished

Ditto.

> 
> Without io-throttle patches
> ---------------------------
> - Two readers, first BE prio 7, second BE prio 0
> 
> 234179072 bytes (234 MB) copied, 4.12074 s, 56.8 MB/s
> High prio reader finished
> 234179072 bytes (234 MB) copied, 5.36023 s, 43.7 MB/s
> 
> Note: There is no service differentiation between prio 0 and prio 7 task
>       with io-throttle patches.
> 
> Test 3
> ======
> - Run the one RT reader and one BE reader in root cgroup without any
>   limitations. I guess this should mean unlimited BW and behavior should
>   be same as with CFQ without io-throttling patches.
> 
> With io-throttle patches
> =========================
> Ran the test 4 times because I was getting different results in different
> runs.
> 
> - Two readers, one RT prio 0  other BE prio 7
> 
> 234179072 bytes (234 MB) copied, 2.74604 s, 85.3 MB/s
> 234179072 bytes (234 MB) copied, 5.20995 s, 44.9 MB/s
> RT task finished
> 
> 234179072 bytes (234 MB) copied, 4.54417 s, 51.5 MB/s
> RT task finished
> 234179072 bytes (234 MB) copied, 5.23396 s, 44.7 MB/s
> 
> 234179072 bytes (234 MB) copied, 5.17727 s, 45.2 MB/s
> RT task finished
> 234179072 bytes (234 MB) copied, 5.25894 s, 44.5 MB/s
> 
> 234179072 bytes (234 MB) copied, 2.74141 s, 85.4 MB/s
> 234179072 bytes (234 MB) copied, 5.20536 s, 45.0 MB/s
> RT task finished
> 
> Note: Out of 4 runs, looks like twice it is complete priority inversion
>       and RT task finished after BE task. Rest of the two times, the
>       difference between BW of RT and BE task is much less as compared to
>       without patches. In fact once it was almost same.

This is strange. If you don't set any limit there shouldn't be any
difference respect to the other case (without io-throttle patches).

At worst a small overhead given by the task_to_iothrottle(), under
rcu_read_lock(). I'll repeat this test ASAP and see if I'll be able to
reproduce this strange behaviour.

> 
> Without io-throttle patches.
> ===========================
> - Two readers, one RT prio 0  other BE prio 7 (4 runs)
> 
> 234179072 bytes (234 MB) copied, 2.80988 s, 83.3 MB/s
> RT task finished
> 234179072 bytes (234 MB) copied, 5.28228 s, 44.3 MB/s
> 
> 234179072 bytes (234 MB) copied, 2.80659 s, 83.4 MB/s
> RT task finished
> 234179072 bytes (234 MB) copied, 5.27874 s, 44.4 MB/s
> 
> 234179072 bytes (234 MB) copied, 2.79601 s, 83.8 MB/s
> RT task finished
> 234179072 bytes (234 MB) copied, 5.2542 s, 44.6 MB/s
> 
> 234179072 bytes (234 MB) copied, 2.78764 s, 84.0 MB/s
> RT task finished
> 234179072 bytes (234 MB) copied, 5.26009 s, 44.5 MB/s
> 
> Note, How consistent the behavior is without io-throttle patches.
> 
> In summary, I think a 2nd level solution can ensure one policy on cgroups but
> it will break other semantics/properties of IO scheduler with-in cgroup as
> 2nd level solution has no idea at run time what is the IO scheduler running
> underneath and what kind of properties it has.
> 
> Andrea, please try it on your setup and see if you get similar results
> on or. Hopefully it is not a configuration or test procedure issue on my
> side.
> 
> Thanks
> Vivek
> 
> > The only thing which concerns me is the fact that IO scheduler does not
> > have the view of higher level logical device. So if somebody has setup a
> > software RAID and wants to put max BW limit on software raid device, this
> > solution will not work. One shall have to live with max bw limits on 
> > individual disks (where io scheduler is actually running). Do your patches
> > allow to put limit on software RAID devices also? 
> > 
> > Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> > of FIFO dispatch of buffered bios. Apart from that it tries to provide
> > fairness in terms of actual IO done and that would mean a seeky workload
> > will can use disk for much longer to get equivalent IO done and slow down
> > other applications. Implementing IO controller at IO scheduler level gives
> > us tigher control. Will it not meet your requirements? If you got specific
> > concerns with IO scheduler based contol patches, please highlight these and
> > we will see how these can be addressed.
> > 
> > Thanks
> > Vivek

-Andrea


^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 18/18] io-controller: Debug hierarchical IO scheduling
       [not found]   ` <1241553525-28095-19-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-06 21:40     ` IKEDA, Munehiro
  0 siblings, 0 replies; 297+ messages in thread
From: IKEDA, Munehiro @ 2009-05-06 21:40 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi Vivek,

Patching and compilation error occurred on the 18/18 patch.
I know this is a patch for debug but report them in case.


Vivek Goyal wrote:
> @@ -2203,6 +2241,25 @@ EXPORT_SYMBOL(elv_get_slice_idle);
>  void elv_ioq_served(struct io_queue *ioq, bfq_service_t served)
>  {
>  	entity_served(&ioq->entity, served, ioq->nr_sectors);

Patch failed due to this line.  I guess this should be

	entity_served(&ioq->entity, served);


> +
> +#ifdef CONFIG_DEBUG_GROUP_IOSCHED
> +		{
> +			struct elv_fq_data *efqd = ioq->efqd;
> +			char path[128];
> +			struct io_group *iog = ioq_to_io_group(ioq);
> +			io_group_path(iog, path, sizeof(path));
> +			elv_log_ioq(efqd, ioq, "ioq served: QSt=0x%lx QSs=0x%lx"
> +				" QTt=0x%lx QTs=0x%lx grp=%s GTt=0x%lx "
> +				" GTs=0x%lx rq_queued=%d",
> +				served, ioq->nr_sectors,
> +				ioq->entity.total_service,
> +				ioq->entity.total_sector_service,
> +				path,
> +				iog->entity.total_service,
> +				iog->entity.total_sector_service,
> +				ioq->nr_queued);
> +		}
> +#endif
>  }

Because
  io_entity::total_service
and
  io_entity::total_sector_service
are not defined, compilation failed if CONFIG_DEBUG_GROUP_IOSCHED=y
here. (and everywhere referencing entity.total_service or  entity.total_sector_service)
They need to be defined like:

diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 1ea4ff3..6d0a735 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -147,6 +147,10 @@ struct io_entity {
        unsigned short ioprio_class, new_ioprio_class;
 
        int ioprio_changed;
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+       unsigned long total_service, total_sector_service;
+#endif
 };
 
 /*

Unfortunately I couldn't figure out where and how the members
should be calculated, sorry.


-- 
IKEDA, Munehiro
  NEC Corporation of America
    m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* Re: [PATCH 18/18] io-controller: Debug hierarchical IO scheduling
  2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-06 21:40   ` IKEDA, Munehiro
       [not found]     ` <4A0203DB.1090809-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
       [not found]   ` <1241553525-28095-19-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 297+ messages in thread
From: IKEDA, Munehiro @ 2009-05-06 21:40 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, akpm

Hi Vivek,

Patching and compilation error occurred on the 18/18 patch.
I know this is a patch for debug but report them in case.


Vivek Goyal wrote:
> @@ -2203,6 +2241,25 @@ EXPORT_SYMBOL(elv_get_slice_idle);
>  void elv_ioq_served(struct io_queue *ioq, bfq_service_t served)
>  {
>  	entity_served(&ioq->entity, served, ioq->nr_sectors);

Patch failed due to this line.  I guess this should be

	entity_served(&ioq->entity, served);


> +
> +#ifdef CONFIG_DEBUG_GROUP_IOSCHED
> +		{
> +			struct elv_fq_data *efqd = ioq->efqd;
> +			char path[128];
> +			struct io_group *iog = ioq_to_io_group(ioq);
> +			io_group_path(iog, path, sizeof(path));
> +			elv_log_ioq(efqd, ioq, "ioq served: QSt=0x%lx QSs=0x%lx"
> +				" QTt=0x%lx QTs=0x%lx grp=%s GTt=0x%lx "
> +				" GTs=0x%lx rq_queued=%d",
> +				served, ioq->nr_sectors,
> +				ioq->entity.total_service,
> +				ioq->entity.total_sector_service,
> +				path,
> +				iog->entity.total_service,
> +				iog->entity.total_sector_service,
> +				ioq->nr_queued);
> +		}
> +#endif
>  }

Because
  io_entity::total_service
and
  io_entity::total_sector_service
are not defined, compilation failed if CONFIG_DEBUG_GROUP_IOSCHED=y
here. (and everywhere referencing entity.total_service or  entity.total_sector_service)
They need to be defined like:

diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 1ea4ff3..6d0a735 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -147,6 +147,10 @@ struct io_entity {
        unsigned short ioprio_class, new_ioprio_class;
 
        int ioprio_changed;
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+       unsigned long total_service, total_sector_service;
+#endif
 };
 
 /*

Unfortunately I couldn't figure out where and how the members
should be calculated, sorry.


-- 
IKEDA, Munehiro
  NEC Corporation of America
    m-ikeda@ds.jp.nec.com


^ permalink raw reply related	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-06 21:34         ` Andrea Righi
@ 2009-05-06 21:52             ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 21:52 UTC (permalink / raw)
  To: Andrea Righi
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Wed, May 06, 2009 at 11:34:54PM +0200, Andrea Righi wrote:
> On Wed, May 06, 2009 at 04:32:28PM -0400, Vivek Goyal wrote:
> > Hi Andrea and others,
> > 
> > I always had this doubt in mind that any kind of 2nd level controller will
> > have no idea about underlying IO scheduler queues/semantics. So while it
> > can implement a particular cgroup policy (max bw like io-throttle or
> > proportional bw like dm-ioband) but there are high chances that it will
> > break IO scheduler's semantics in one way or other.
> > 
> > I had already sent out the results for dm-ioband in a separate thread.
> > 
> > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07258.html
> > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07573.html
> > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08177.html
> > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08345.html
> > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08355.html
> > 
> > Here are some basic results with io-throttle. Andrea, please let me know
> > if you think this is procedural problem. Playing with io-throttle patches
> > for the first time.
> > 
> > I took V16 of your patches and trying it out with 2.6.30-rc4 with CFQ
> > scheduler.
> > 
> > I have got one SATA drive with one partition on it.
> > 
> > I am trying to create one cgroup and assignn 8MB/s limit to it and launch
> > on RT prio 0 task and one BE prio 7 task and see how this 8MB/s is divided
> > between these tasks. Following are the results.
> > 
> > Following is my test script.
> > 
> > *******************************************************************
> > #!/bin/bash
> > 
> > mount /dev/sdb1 /mnt/sdb
> > 
> > mount -t cgroup -o blockio blockio /cgroup/iot/
> > mkdir -p /cgroup/iot/test1 /cgroup/iot/test2
> > 
> > # Set bw limit of 8 MB/ps on sdb
> > echo "/dev/sdb:$((8 * 1024 * 1024)):0:0" >
> > /cgroup/iot/test1/blockio.bandwidth-max
> > 
> > sync
> > echo 3 > /proc/sys/vm/drop_caches
> > 
> > echo $$ > /cgroup/iot/test1/tasks
> > 
> > # Launch a normal prio reader.
> > ionice -c 2 -n 7 dd if=/mnt/sdb/zerofile1 of=/dev/zero &
> > pid1=$!
> > echo $pid1
> > 
> > # Launch an RT reader  
> > ionice -c 1 -n 0 dd if=/mnt/sdb/zerofile2 of=/dev/zero &
> > pid2=$!
> > echo $pid2
> > 
> > wait $pid2
> > echo "RT task finished"
> > **********************************************************************
> > 
> > Test1
> > =====
> > Test two readers (one RT class and one BE class) and see how BW is
> > allocated with-in cgroup
> > 
> > With io-throttle patches
> > ------------------------
> > - Two readers, first BE prio 7, second RT prio 0
> > 
> > 234179072 bytes (234 MB) copied, 55.8482 s, 4.2 MB/s
> > 234179072 bytes (234 MB) copied, 55.8975 s, 4.2 MB/s
> > RT task finished
> > 
> > Note: See, there is no difference in the performance of RT or BE task.
> > Looks like these got throttled equally.
> 
> OK, this is coherent with the current io-throttle implementation. IO
> requests are throttled without the concept of the ioprio model.
> 
> We could try to distribute the throttle using a function of each task's
> ioprio, but ok, the obvious drawback is that it totally breaks the logic
> used by the underlying layers.
> 
> BTW, I'm wondering, is it a very critical issue? I would say why not to
> move the RT task to a different cgroup with unlimited BW? or limited BW
> but with other tasks running at the same IO priority...

So one of hypothetical use case probably  could be following. Somebody
is having a hosted server and customers are going to get there
applications running in a particular cgroup with a limit on max bw.

			root
		  /      |      \
	     cust1      cust2   cust3
	   (20 MB/s)  (40MB/s)  (30MB/s)

Now all three customers will run their own applications/virtual machines
in their respective groups with upper limits. Will we say to these that
all your tasks will be considered as same class and same prio level.

Assume cust1 is running a hypothetical application which creates multiple
threads and assigns these threads different priorities based on its needs
at run time. How would we handle this thing?

You can't collect all the RT tasks from all customers and move these to a
single cgroup. Or ask customers to separate out their tasks based on
priority level and give them multiple groups of different priority.

> could the cgroup
> subsystem be a more flexible and customizable framework respect to the
> current ioprio model?
> 
> I'm not saying we have to ignore the problem, just trying to evaluate
> the impact and alternatives. And I'm still convinced that also providing
> per-cgroup ioprio would be an important feature.
> 
> > 
> > 
> > Without io-throttle patches
> > ----------------------------
> > - Two readers, first BE prio 7, second RT prio 0
> > 
> > 234179072 bytes (234 MB) copied, 2.81801 s, 83.1 MB/s
> > RT task finished
> > 234179072 bytes (234 MB) copied, 5.28238 s, 44.3 MB/s
> > 
> > Note: Because I can't limit the BW without io-throttle patches, so don't
> >       worry about increased BW. But the important point is that RT task
> >       gets much more BW than a BE prio 7 task.
> > 
> > Test2
> > ====
> > - Test 2 readers (One BE prio 0 and one BE prio 7) and see how BW is
> > distributed among these.
> > 
> > With io-throttle patches
> > ------------------------
> > - Two readers, first BE prio 7, second BE prio 0
> > 
> > 234179072 bytes (234 MB) copied, 55.8604 s, 4.2 MB/s
> > 234179072 bytes (234 MB) copied, 55.8918 s, 4.2 MB/s
> > High prio reader finished
> 
> Ditto.
> 
> > 
> > Without io-throttle patches
> > ---------------------------
> > - Two readers, first BE prio 7, second BE prio 0
> > 
> > 234179072 bytes (234 MB) copied, 4.12074 s, 56.8 MB/s
> > High prio reader finished
> > 234179072 bytes (234 MB) copied, 5.36023 s, 43.7 MB/s
> > 
> > Note: There is no service differentiation between prio 0 and prio 7 task
> >       with io-throttle patches.
> > 
> > Test 3
> > ======
> > - Run the one RT reader and one BE reader in root cgroup without any
> >   limitations. I guess this should mean unlimited BW and behavior should
> >   be same as with CFQ without io-throttling patches.
> > 
> > With io-throttle patches
> > =========================
> > Ran the test 4 times because I was getting different results in different
> > runs.
> > 
> > - Two readers, one RT prio 0  other BE prio 7
> > 
> > 234179072 bytes (234 MB) copied, 2.74604 s, 85.3 MB/s
> > 234179072 bytes (234 MB) copied, 5.20995 s, 44.9 MB/s
> > RT task finished
> > 
> > 234179072 bytes (234 MB) copied, 4.54417 s, 51.5 MB/s
> > RT task finished
> > 234179072 bytes (234 MB) copied, 5.23396 s, 44.7 MB/s
> > 
> > 234179072 bytes (234 MB) copied, 5.17727 s, 45.2 MB/s
> > RT task finished
> > 234179072 bytes (234 MB) copied, 5.25894 s, 44.5 MB/s
> > 
> > 234179072 bytes (234 MB) copied, 2.74141 s, 85.4 MB/s
> > 234179072 bytes (234 MB) copied, 5.20536 s, 45.0 MB/s
> > RT task finished
> > 
> > Note: Out of 4 runs, looks like twice it is complete priority inversion
> >       and RT task finished after BE task. Rest of the two times, the
> >       difference between BW of RT and BE task is much less as compared to
> >       without patches. In fact once it was almost same.
> 
> This is strange. If you don't set any limit there shouldn't be any
> difference respect to the other case (without io-throttle patches).
> 
> At worst a small overhead given by the task_to_iothrottle(), under
> rcu_read_lock(). I'll repeat this test ASAP and see if I'll be able to
> reproduce this strange behaviour.

Ya, I also found this strange. At least in root group there should not be
any behavior change (at max one might expect little drop in throughput
because of extra code).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
@ 2009-05-06 21:52             ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 21:52 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, linux-kernel, containers,
	agk, dm-devel, snitzer, m-ikeda, peterz

On Wed, May 06, 2009 at 11:34:54PM +0200, Andrea Righi wrote:
> On Wed, May 06, 2009 at 04:32:28PM -0400, Vivek Goyal wrote:
> > Hi Andrea and others,
> > 
> > I always had this doubt in mind that any kind of 2nd level controller will
> > have no idea about underlying IO scheduler queues/semantics. So while it
> > can implement a particular cgroup policy (max bw like io-throttle or
> > proportional bw like dm-ioband) but there are high chances that it will
> > break IO scheduler's semantics in one way or other.
> > 
> > I had already sent out the results for dm-ioband in a separate thread.
> > 
> > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07258.html
> > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07573.html
> > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08177.html
> > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08345.html
> > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08355.html
> > 
> > Here are some basic results with io-throttle. Andrea, please let me know
> > if you think this is procedural problem. Playing with io-throttle patches
> > for the first time.
> > 
> > I took V16 of your patches and trying it out with 2.6.30-rc4 with CFQ
> > scheduler.
> > 
> > I have got one SATA drive with one partition on it.
> > 
> > I am trying to create one cgroup and assignn 8MB/s limit to it and launch
> > on RT prio 0 task and one BE prio 7 task and see how this 8MB/s is divided
> > between these tasks. Following are the results.
> > 
> > Following is my test script.
> > 
> > *******************************************************************
> > #!/bin/bash
> > 
> > mount /dev/sdb1 /mnt/sdb
> > 
> > mount -t cgroup -o blockio blockio /cgroup/iot/
> > mkdir -p /cgroup/iot/test1 /cgroup/iot/test2
> > 
> > # Set bw limit of 8 MB/ps on sdb
> > echo "/dev/sdb:$((8 * 1024 * 1024)):0:0" >
> > /cgroup/iot/test1/blockio.bandwidth-max
> > 
> > sync
> > echo 3 > /proc/sys/vm/drop_caches
> > 
> > echo $$ > /cgroup/iot/test1/tasks
> > 
> > # Launch a normal prio reader.
> > ionice -c 2 -n 7 dd if=/mnt/sdb/zerofile1 of=/dev/zero &
> > pid1=$!
> > echo $pid1
> > 
> > # Launch an RT reader  
> > ionice -c 1 -n 0 dd if=/mnt/sdb/zerofile2 of=/dev/zero &
> > pid2=$!
> > echo $pid2
> > 
> > wait $pid2
> > echo "RT task finished"
> > **********************************************************************
> > 
> > Test1
> > =====
> > Test two readers (one RT class and one BE class) and see how BW is
> > allocated with-in cgroup
> > 
> > With io-throttle patches
> > ------------------------
> > - Two readers, first BE prio 7, second RT prio 0
> > 
> > 234179072 bytes (234 MB) copied, 55.8482 s, 4.2 MB/s
> > 234179072 bytes (234 MB) copied, 55.8975 s, 4.2 MB/s
> > RT task finished
> > 
> > Note: See, there is no difference in the performance of RT or BE task.
> > Looks like these got throttled equally.
> 
> OK, this is coherent with the current io-throttle implementation. IO
> requests are throttled without the concept of the ioprio model.
> 
> We could try to distribute the throttle using a function of each task's
> ioprio, but ok, the obvious drawback is that it totally breaks the logic
> used by the underlying layers.
> 
> BTW, I'm wondering, is it a very critical issue? I would say why not to
> move the RT task to a different cgroup with unlimited BW? or limited BW
> but with other tasks running at the same IO priority...

So one of hypothetical use case probably  could be following. Somebody
is having a hosted server and customers are going to get there
applications running in a particular cgroup with a limit on max bw.

			root
		  /      |      \
	     cust1      cust2   cust3
	   (20 MB/s)  (40MB/s)  (30MB/s)

Now all three customers will run their own applications/virtual machines
in their respective groups with upper limits. Will we say to these that
all your tasks will be considered as same class and same prio level.

Assume cust1 is running a hypothetical application which creates multiple
threads and assigns these threads different priorities based on its needs
at run time. How would we handle this thing?

You can't collect all the RT tasks from all customers and move these to a
single cgroup. Or ask customers to separate out their tasks based on
priority level and give them multiple groups of different priority.

> could the cgroup
> subsystem be a more flexible and customizable framework respect to the
> current ioprio model?
> 
> I'm not saying we have to ignore the problem, just trying to evaluate
> the impact and alternatives. And I'm still convinced that also providing
> per-cgroup ioprio would be an important feature.
> 
> > 
> > 
> > Without io-throttle patches
> > ----------------------------
> > - Two readers, first BE prio 7, second RT prio 0
> > 
> > 234179072 bytes (234 MB) copied, 2.81801 s, 83.1 MB/s
> > RT task finished
> > 234179072 bytes (234 MB) copied, 5.28238 s, 44.3 MB/s
> > 
> > Note: Because I can't limit the BW without io-throttle patches, so don't
> >       worry about increased BW. But the important point is that RT task
> >       gets much more BW than a BE prio 7 task.
> > 
> > Test2
> > ====
> > - Test 2 readers (One BE prio 0 and one BE prio 7) and see how BW is
> > distributed among these.
> > 
> > With io-throttle patches
> > ------------------------
> > - Two readers, first BE prio 7, second BE prio 0
> > 
> > 234179072 bytes (234 MB) copied, 55.8604 s, 4.2 MB/s
> > 234179072 bytes (234 MB) copied, 55.8918 s, 4.2 MB/s
> > High prio reader finished
> 
> Ditto.
> 
> > 
> > Without io-throttle patches
> > ---------------------------
> > - Two readers, first BE prio 7, second BE prio 0
> > 
> > 234179072 bytes (234 MB) copied, 4.12074 s, 56.8 MB/s
> > High prio reader finished
> > 234179072 bytes (234 MB) copied, 5.36023 s, 43.7 MB/s
> > 
> > Note: There is no service differentiation between prio 0 and prio 7 task
> >       with io-throttle patches.
> > 
> > Test 3
> > ======
> > - Run the one RT reader and one BE reader in root cgroup without any
> >   limitations. I guess this should mean unlimited BW and behavior should
> >   be same as with CFQ without io-throttling patches.
> > 
> > With io-throttle patches
> > =========================
> > Ran the test 4 times because I was getting different results in different
> > runs.
> > 
> > - Two readers, one RT prio 0  other BE prio 7
> > 
> > 234179072 bytes (234 MB) copied, 2.74604 s, 85.3 MB/s
> > 234179072 bytes (234 MB) copied, 5.20995 s, 44.9 MB/s
> > RT task finished
> > 
> > 234179072 bytes (234 MB) copied, 4.54417 s, 51.5 MB/s
> > RT task finished
> > 234179072 bytes (234 MB) copied, 5.23396 s, 44.7 MB/s
> > 
> > 234179072 bytes (234 MB) copied, 5.17727 s, 45.2 MB/s
> > RT task finished
> > 234179072 bytes (234 MB) copied, 5.25894 s, 44.5 MB/s
> > 
> > 234179072 bytes (234 MB) copied, 2.74141 s, 85.4 MB/s
> > 234179072 bytes (234 MB) copied, 5.20536 s, 45.0 MB/s
> > RT task finished
> > 
> > Note: Out of 4 runs, looks like twice it is complete priority inversion
> >       and RT task finished after BE task. Rest of the two times, the
> >       difference between BW of RT and BE task is much less as compared to
> >       without patches. In fact once it was almost same.
> 
> This is strange. If you don't set any limit there shouldn't be any
> difference respect to the other case (without io-throttle patches).
> 
> At worst a small overhead given by the task_to_iothrottle(), under
> rcu_read_lock(). I'll repeat this test ASAP and see if I'll be able to
> reproduce this strange behaviour.

Ya, I also found this strange. At least in root group there should not be
any behavior change (at max one might expect little drop in throughput
because of extra code).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 18/18] io-controller: Debug hierarchical IO scheduling
  2009-05-06 21:40   ` IKEDA, Munehiro
@ 2009-05-06 21:58         ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 21:58 UTC (permalink / raw)
  To: IKEDA, Munehiro
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Wed, May 06, 2009 at 05:40:43PM -0400, IKEDA, Munehiro wrote:
> Hi Vivek,
> 
> Patching and compilation error occurred on the 18/18 patch.
> I know this is a patch for debug but report them in case.
> 
> 
> Vivek Goyal wrote:
> > @@ -2203,6 +2241,25 @@ EXPORT_SYMBOL(elv_get_slice_idle);
> >  void elv_ioq_served(struct io_queue *ioq, bfq_service_t served)
> >  {
> >  	entity_served(&ioq->entity, served, ioq->nr_sectors);
> 
> Patch failed due to this line.  I guess this should be
> 
> 	entity_served(&ioq->entity, served);
> 
> 
> > +
> > +#ifdef CONFIG_DEBUG_GROUP_IOSCHED
> > +		{
> > +			struct elv_fq_data *efqd = ioq->efqd;
> > +			char path[128];
> > +			struct io_group *iog = ioq_to_io_group(ioq);
> > +			io_group_path(iog, path, sizeof(path));
> > +			elv_log_ioq(efqd, ioq, "ioq served: QSt=0x%lx QSs=0x%lx"
> > +				" QTt=0x%lx QTs=0x%lx grp=%s GTt=0x%lx "
> > +				" GTs=0x%lx rq_queued=%d",
> > +				served, ioq->nr_sectors,
> > +				ioq->entity.total_service,
> > +				ioq->entity.total_sector_service,
> > +				path,
> > +				iog->entity.total_service,
> > +				iog->entity.total_sector_service,
> > +				ioq->nr_queued);
> > +		}
> > +#endif
> >  }
> 
> Because
>   io_entity::total_service
> and
>   io_entity::total_sector_service
> are not defined, compilation failed if CONFIG_DEBUG_GROUP_IOSCHED=y
> here. (and everywhere referencing entity.total_service or  entity.total_sector_service)
> They need to be defined like:
> 
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index 1ea4ff3..6d0a735 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -147,6 +147,10 @@ struct io_entity {
>         unsigned short ioprio_class, new_ioprio_class;
>  
>         int ioprio_changed;
> +
> +#ifdef CONFIG_DEBUG_GROUP_IOSCHED
> +       unsigned long total_service, total_sector_service;
> +#endif
>  };
>  
>  /*
> 
> Unfortunately I couldn't figure out where and how the members
> should be calculated, sorry.

Hi Ikeda,

I think there is some issue with applying the patch. I think you have
forgotten to apply following patch and that's why seeing all the issues.

"io-controller: Export disk time used and nr sectors dipatched through
cgroups"

This patch changes elv_ioq_served() at the same time introduced additional
field of total_sector_service etc.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 18/18] io-controller: Debug hierarchical IO scheduling
@ 2009-05-06 21:58         ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 21:58 UTC (permalink / raw)
  To: IKEDA, Munehiro
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, akpm

On Wed, May 06, 2009 at 05:40:43PM -0400, IKEDA, Munehiro wrote:
> Hi Vivek,
> 
> Patching and compilation error occurred on the 18/18 patch.
> I know this is a patch for debug but report them in case.
> 
> 
> Vivek Goyal wrote:
> > @@ -2203,6 +2241,25 @@ EXPORT_SYMBOL(elv_get_slice_idle);
> >  void elv_ioq_served(struct io_queue *ioq, bfq_service_t served)
> >  {
> >  	entity_served(&ioq->entity, served, ioq->nr_sectors);
> 
> Patch failed due to this line.  I guess this should be
> 
> 	entity_served(&ioq->entity, served);
> 
> 
> > +
> > +#ifdef CONFIG_DEBUG_GROUP_IOSCHED
> > +		{
> > +			struct elv_fq_data *efqd = ioq->efqd;
> > +			char path[128];
> > +			struct io_group *iog = ioq_to_io_group(ioq);
> > +			io_group_path(iog, path, sizeof(path));
> > +			elv_log_ioq(efqd, ioq, "ioq served: QSt=0x%lx QSs=0x%lx"
> > +				" QTt=0x%lx QTs=0x%lx grp=%s GTt=0x%lx "
> > +				" GTs=0x%lx rq_queued=%d",
> > +				served, ioq->nr_sectors,
> > +				ioq->entity.total_service,
> > +				ioq->entity.total_sector_service,
> > +				path,
> > +				iog->entity.total_service,
> > +				iog->entity.total_sector_service,
> > +				ioq->nr_queued);
> > +		}
> > +#endif
> >  }
> 
> Because
>   io_entity::total_service
> and
>   io_entity::total_sector_service
> are not defined, compilation failed if CONFIG_DEBUG_GROUP_IOSCHED=y
> here. (and everywhere referencing entity.total_service or  entity.total_sector_service)
> They need to be defined like:
> 
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index 1ea4ff3..6d0a735 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -147,6 +147,10 @@ struct io_entity {
>         unsigned short ioprio_class, new_ioprio_class;
>  
>         int ioprio_changed;
> +
> +#ifdef CONFIG_DEBUG_GROUP_IOSCHED
> +       unsigned long total_service, total_sector_service;
> +#endif
>  };
>  
>  /*
> 
> Unfortunately I couldn't figure out where and how the members
> should be calculated, sorry.

Hi Ikeda,

I think there is some issue with applying the patch. I think you have
forgotten to apply following patch and that's why seeing all the issues.

"io-controller: Export disk time used and nr sectors dipatched through
cgroups"

This patch changes elv_ioq_served() at the same time introduced additional
field of total_sector_service etc.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-06 21:21         ` Vivek Goyal
@ 2009-05-06 22:02               ` Andrea Righi
  0 siblings, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-06 22:02 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Wed, May 06, 2009 at 05:21:21PM -0400, Vivek Goyal wrote:
> > Well, IMHO the big concern is at which level we want to implement the
> > logic of control: IO scheduler, when the IO requests are already
> > submitted and need to be dispatched, or at high level when the
> > applications generates IO requests (or maybe both).
> > 
> > And, as pointed by Andrew, do everything by a cgroup-based controller.
> 
> I am not sure what's the rationale behind that. Why to do it at higher
> layer? Doing it at IO scheduler layer will make sure that one does not
> breaks the IO scheduler's properties with-in cgroup. (See my other mail
> with some io-throttling test results).
> 
> The advantage of higher layer mechanism is that it can also cover software
> RAID devices well. 
> 
> > 
> > The other features, proportional BW, throttling, take the current ioprio
> > model in account, etc. are implementation details and any of the
> > proposed solutions can be extended to support all these features. I
> > mean, io-throttle can be extended to support proportional BW (for a
> > certain perspective it is already provided by the throttling water mark
> > in v16), as well as the IO scheduler based controller can be extended to
> > support absolute BW limits. The same for dm-ioband. I don't think
> > there're huge obstacle to merge the functionalities in this sense.
> 
> Yes, from technical point of view, one can implement a proportional BW
> controller at higher layer also. But that would practically mean almost
> re-implementing the CFQ logic at higher layer. Now why to get into all
> that complexity. Why not simply make CFQ hiearchical to also handle the
> groups?

Make CFQ aware of cgroups is very important also. I could be wrong, but
I don't think we shouldn't re-implement the same exact CFQ logic at
higher layers. CFQ dispatches IO requests, at higher layers applications
submit IO requests. We're talking about different things and applying
different logic doesn't sound too strange IMHO. I mean, at least we
should consider/test also this different approach before deciding drop
it.

This solution also guarantee no changes in the IO schedulers for those
who are not interested in using the cgroup IO controller. What is the
impact of the IO scheduler based controller for those users?

> 
> Secondly, think of following odd scenarios if we implement a higher level
> proportional BW controller which can offer the same feature as CFQ and
> also can handle group scheduling.
> 
> Case1:
> ======	 
>            (Higher level proportional BW controller)
> 			/dev/sda (CFQ)
> 
> So if somebody wants a group scheduling, we will be doing same IO control
> at two places (with-in group). Once at higher level and second time at CFQ
> level. Does not sound too logical to me.
> 
> Case2:
> ======
> 
>            (Higher level proportional BW controller)
> 			/dev/sda (NOOP)
> 	
> This is other extrememt. Lower level IO scheduler does not offer any kind
> of notion of class or prio with-in class and higher level scheduler will
> still be maintaining all the infrastructure unnecessarily.
> 
> That's why I get back to this simple question again, why not extend the
> IO schedulers to handle group scheduling and do both proportional BW and
> max bw control there.
> 
> > 
> > > 
> > > Andrea, last time you were planning to have a look at my patches and see
> > > if max bw controller can be implemented there. I got a feeling that it
> > > should not be too difficult to implement it there. We already have the
> > > hierarchical tree of io queues and groups in elevator layer and we run
> > > BFQ (WF2Q+) algorithm to select next queue to dispatch the IO from. It is
> > > just a matter of also keeping track of IO rate per queue/group and we should
> > > be easily be able to delay the dispatch of IO from a queue if its group has
> > > crossed the specified max bw.
> > 
> > Yes, sorry for my late, I quickly tested your patchset, but I still need
> > to understand many details of your solution. In the next days I'll
> > re-read everything carefully and I'll try to do a detailed review of
> > your patchset (just re-building the kernel with your patchset applied).
> > 
> 
> Sure. My patchset is still in the infancy stage. So don't expect great
> results. But it does highlight the idea and design very well.
> 
> > > 
> > > This should lead to less code and reduced complextiy (compared with the
> > > case where we do max bw control with io-throttling patches and proportional
> > > BW control using IO scheduler based control patches).
> > 
> > mmmh... changing the logic at the elevator and all IO schedulers doesn't
> > sound like reduced complexity and less code changed. With io-throttle we
> > just need to place the cgroup_io_throttle() hook in the right functions
> > where we want to apply throttling. This is a quite easy approach to
> > extend the IO control also to logical devices (more in general devices
> > that use their own make_request_fn) or even network-attached devices, as
> > well as networking filesystems, etc.
> > 
> > But I may be wrong. As I said I still need to review in the details your
> > solution.
> 
> Well I meant reduced code in the sense if we implement both max bw and
> proportional bw at IO scheduler level instead of proportional BW at
> IO scheduler and max bw at higher level.

OK.

> 
> I agree that doing max bw control at higher level has this advantage that
> it covers all the kind of deivces (higher level logical devices) and IO
> scheduler level solution does not do that. But this comes at the price
> of broken IO scheduler properties with-in cgroup.
> 
> Maybe we can then implement both. A higher level max bw controller and a
> max bw feature implemented along side proportional BW controller at IO
> scheduler level. Folks who use hardware RAID, or single disk devices can
> use max bw control of IO scheduler and those using software RAID devices
> can use higher level max bw controller.

OK, maybe.

> 
> > 
> > >  
> > > So do you think that it would make sense to do max BW control along with
> > > proportional weight IO controller at IO scheduler? If yes, then we can
> > > work together and continue to develop this patchset to also support max
> > > bw control and meet your requirements and drop the io-throttling patches.
> > 
> > It is surely worth to be explored. Honestly, I don't know if it would be
> > a better solution or not. Probably comparing some results with different
> > IO workloads is the best way to proceed and decide which is the right
> > way to go. This is necessary IMHO, before totally dropping one solution
> > or another.
> 
> Sure. My patches have started giving some basic results but because there
> is lot of work remaining before a fair comparison can be done on the
> basis of performance under various work loads. So some more time to
> go before we can do a fair comparison based on numbers.
>  
> > 
> > > 
> > > The only thing which concerns me is the fact that IO scheduler does not
> > > have the view of higher level logical device. So if somebody has setup a
> > > software RAID and wants to put max BW limit on software raid device, this
> > > solution will not work. One shall have to live with max bw limits on 
> > > individual disks (where io scheduler is actually running). Do your patches
> > > allow to put limit on software RAID devices also? 
> > 
> > No, but as said above my patchset provides the interfaces to apply the
> > IO control and accounting wherever we want. At the moment there's just
> > one interface, cgroup_io_throttle().
> 
> Sorry, I did not get it clearly. I guess I did not ask the question right.
> So lets say I got a setup where there are two phyical devices /dev/sda and
> /dev/sdb and I create a logical device (say using device mapper facilities)
> on top of these two physical disks. And some application is generating
> the IO for logical device lv0.
> 
> 				Appl
> 				 |
> 				lv0
> 			       /  \
> 			    sda	   sdb
> 
> 
> Where should I put the bandwidth limiting rules now for io-throtle. I 
> specify these for lv0 device or for sda and sdb devices?

The BW limiting rules would be applied into the make_request_fn provided
by the lv0 device. If it's not provided, before calling
generic_make_request(). A problem could be that the driver must be aware
of the particular lv0 device at that point.

> 
> Thanks
> Vivek

OK. I definitely need to look at your patchset before saying any other
opinion... :)

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
@ 2009-05-06 22:02               ` Andrea Righi
  0 siblings, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-06 22:02 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, linux-kernel, containers,
	agk, dm-devel, snitzer, m-ikeda, peterz

On Wed, May 06, 2009 at 05:21:21PM -0400, Vivek Goyal wrote:
> > Well, IMHO the big concern is at which level we want to implement the
> > logic of control: IO scheduler, when the IO requests are already
> > submitted and need to be dispatched, or at high level when the
> > applications generates IO requests (or maybe both).
> > 
> > And, as pointed by Andrew, do everything by a cgroup-based controller.
> 
> I am not sure what's the rationale behind that. Why to do it at higher
> layer? Doing it at IO scheduler layer will make sure that one does not
> breaks the IO scheduler's properties with-in cgroup. (See my other mail
> with some io-throttling test results).
> 
> The advantage of higher layer mechanism is that it can also cover software
> RAID devices well. 
> 
> > 
> > The other features, proportional BW, throttling, take the current ioprio
> > model in account, etc. are implementation details and any of the
> > proposed solutions can be extended to support all these features. I
> > mean, io-throttle can be extended to support proportional BW (for a
> > certain perspective it is already provided by the throttling water mark
> > in v16), as well as the IO scheduler based controller can be extended to
> > support absolute BW limits. The same for dm-ioband. I don't think
> > there're huge obstacle to merge the functionalities in this sense.
> 
> Yes, from technical point of view, one can implement a proportional BW
> controller at higher layer also. But that would practically mean almost
> re-implementing the CFQ logic at higher layer. Now why to get into all
> that complexity. Why not simply make CFQ hiearchical to also handle the
> groups?

Make CFQ aware of cgroups is very important also. I could be wrong, but
I don't think we shouldn't re-implement the same exact CFQ logic at
higher layers. CFQ dispatches IO requests, at higher layers applications
submit IO requests. We're talking about different things and applying
different logic doesn't sound too strange IMHO. I mean, at least we
should consider/test also this different approach before deciding drop
it.

This solution also guarantee no changes in the IO schedulers for those
who are not interested in using the cgroup IO controller. What is the
impact of the IO scheduler based controller for those users?

> 
> Secondly, think of following odd scenarios if we implement a higher level
> proportional BW controller which can offer the same feature as CFQ and
> also can handle group scheduling.
> 
> Case1:
> ======	 
>            (Higher level proportional BW controller)
> 			/dev/sda (CFQ)
> 
> So if somebody wants a group scheduling, we will be doing same IO control
> at two places (with-in group). Once at higher level and second time at CFQ
> level. Does not sound too logical to me.
> 
> Case2:
> ======
> 
>            (Higher level proportional BW controller)
> 			/dev/sda (NOOP)
> 	
> This is other extrememt. Lower level IO scheduler does not offer any kind
> of notion of class or prio with-in class and higher level scheduler will
> still be maintaining all the infrastructure unnecessarily.
> 
> That's why I get back to this simple question again, why not extend the
> IO schedulers to handle group scheduling and do both proportional BW and
> max bw control there.
> 
> > 
> > > 
> > > Andrea, last time you were planning to have a look at my patches and see
> > > if max bw controller can be implemented there. I got a feeling that it
> > > should not be too difficult to implement it there. We already have the
> > > hierarchical tree of io queues and groups in elevator layer and we run
> > > BFQ (WF2Q+) algorithm to select next queue to dispatch the IO from. It is
> > > just a matter of also keeping track of IO rate per queue/group and we should
> > > be easily be able to delay the dispatch of IO from a queue if its group has
> > > crossed the specified max bw.
> > 
> > Yes, sorry for my late, I quickly tested your patchset, but I still need
> > to understand many details of your solution. In the next days I'll
> > re-read everything carefully and I'll try to do a detailed review of
> > your patchset (just re-building the kernel with your patchset applied).
> > 
> 
> Sure. My patchset is still in the infancy stage. So don't expect great
> results. But it does highlight the idea and design very well.
> 
> > > 
> > > This should lead to less code and reduced complextiy (compared with the
> > > case where we do max bw control with io-throttling patches and proportional
> > > BW control using IO scheduler based control patches).
> > 
> > mmmh... changing the logic at the elevator and all IO schedulers doesn't
> > sound like reduced complexity and less code changed. With io-throttle we
> > just need to place the cgroup_io_throttle() hook in the right functions
> > where we want to apply throttling. This is a quite easy approach to
> > extend the IO control also to logical devices (more in general devices
> > that use their own make_request_fn) or even network-attached devices, as
> > well as networking filesystems, etc.
> > 
> > But I may be wrong. As I said I still need to review in the details your
> > solution.
> 
> Well I meant reduced code in the sense if we implement both max bw and
> proportional bw at IO scheduler level instead of proportional BW at
> IO scheduler and max bw at higher level.

OK.

> 
> I agree that doing max bw control at higher level has this advantage that
> it covers all the kind of deivces (higher level logical devices) and IO
> scheduler level solution does not do that. But this comes at the price
> of broken IO scheduler properties with-in cgroup.
> 
> Maybe we can then implement both. A higher level max bw controller and a
> max bw feature implemented along side proportional BW controller at IO
> scheduler level. Folks who use hardware RAID, or single disk devices can
> use max bw control of IO scheduler and those using software RAID devices
> can use higher level max bw controller.

OK, maybe.

> 
> > 
> > >  
> > > So do you think that it would make sense to do max BW control along with
> > > proportional weight IO controller at IO scheduler? If yes, then we can
> > > work together and continue to develop this patchset to also support max
> > > bw control and meet your requirements and drop the io-throttling patches.
> > 
> > It is surely worth to be explored. Honestly, I don't know if it would be
> > a better solution or not. Probably comparing some results with different
> > IO workloads is the best way to proceed and decide which is the right
> > way to go. This is necessary IMHO, before totally dropping one solution
> > or another.
> 
> Sure. My patches have started giving some basic results but because there
> is lot of work remaining before a fair comparison can be done on the
> basis of performance under various work loads. So some more time to
> go before we can do a fair comparison based on numbers.
>  
> > 
> > > 
> > > The only thing which concerns me is the fact that IO scheduler does not
> > > have the view of higher level logical device. So if somebody has setup a
> > > software RAID and wants to put max BW limit on software raid device, this
> > > solution will not work. One shall have to live with max bw limits on 
> > > individual disks (where io scheduler is actually running). Do your patches
> > > allow to put limit on software RAID devices also? 
> > 
> > No, but as said above my patchset provides the interfaces to apply the
> > IO control and accounting wherever we want. At the moment there's just
> > one interface, cgroup_io_throttle().
> 
> Sorry, I did not get it clearly. I guess I did not ask the question right.
> So lets say I got a setup where there are two phyical devices /dev/sda and
> /dev/sdb and I create a logical device (say using device mapper facilities)
> on top of these two physical disks. And some application is generating
> the IO for logical device lv0.
> 
> 				Appl
> 				 |
> 				lv0
> 			       /  \
> 			    sda	   sdb
> 
> 
> Where should I put the bandwidth limiting rules now for io-throtle. I 
> specify these for lv0 device or for sda and sdb devices?

The BW limiting rules would be applied into the make_request_fn provided
by the lv0 device. If it's not provided, before calling
generic_make_request(). A problem could be that the driver must be aware
of the particular lv0 device at that point.

> 
> Thanks
> Vivek

OK. I definitely need to look at your patchset before saying any other
opinion... :)

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-06 22:02               ` Andrea Righi
@ 2009-05-06 22:17                 ` Vivek Goyal
  -1 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 22:17 UTC (permalink / raw)
  To: Andrea Righi
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Thu, May 07, 2009 at 12:02:51AM +0200, Andrea Righi wrote:
> On Wed, May 06, 2009 at 05:21:21PM -0400, Vivek Goyal wrote:
> > > Well, IMHO the big concern is at which level we want to implement the
> > > logic of control: IO scheduler, when the IO requests are already
> > > submitted and need to be dispatched, or at high level when the
> > > applications generates IO requests (or maybe both).
> > > 
> > > And, as pointed by Andrew, do everything by a cgroup-based controller.
> > 
> > I am not sure what's the rationale behind that. Why to do it at higher
> > layer? Doing it at IO scheduler layer will make sure that one does not
> > breaks the IO scheduler's properties with-in cgroup. (See my other mail
> > with some io-throttling test results).
> > 
> > The advantage of higher layer mechanism is that it can also cover software
> > RAID devices well. 
> > 
> > > 
> > > The other features, proportional BW, throttling, take the current ioprio
> > > model in account, etc. are implementation details and any of the
> > > proposed solutions can be extended to support all these features. I
> > > mean, io-throttle can be extended to support proportional BW (for a
> > > certain perspective it is already provided by the throttling water mark
> > > in v16), as well as the IO scheduler based controller can be extended to
> > > support absolute BW limits. The same for dm-ioband. I don't think
> > > there're huge obstacle to merge the functionalities in this sense.
> > 
> > Yes, from technical point of view, one can implement a proportional BW
> > controller at higher layer also. But that would practically mean almost
> > re-implementing the CFQ logic at higher layer. Now why to get into all
> > that complexity. Why not simply make CFQ hiearchical to also handle the
> > groups?
> 
> Make CFQ aware of cgroups is very important also. I could be wrong, but
> I don't think we shouldn't re-implement the same exact CFQ logic at
> higher layers. CFQ dispatches IO requests, at higher layers applications
> submit IO requests. We're talking about different things and applying
> different logic doesn't sound too strange IMHO. I mean, at least we
> should consider/test also this different approach before deciding drop
> it.
> 

Lot of CFQ code is all about maintaining per io context queues, for
different classes and different prio level, about anticipation for
reads etc. Anybody who wants to get classes and ioprio within cgroup
right will end up duplicating all that logic (to cover all the cases).
So I did not mean that you will end up copying the whole code but
logically a lot of it.

Secondly, there will be mismatch in anticipation logic. CFQ gives
preference to reads and for dependent readers it idles and waits for
next request to come. A higher level throttling can interefere with IO
pattern of application and can lead CFQ to think that average thinktime
of this application is high and disable the anticipation on that
application. Which should result in high latencies for simple commands
like "ls", in presence of competing applications. 

> This solution also guarantee no changes in the IO schedulers for those
> who are not interested in using the cgroup IO controller. What is the
> impact of the IO scheduler based controller for those users?
> 

IO scheduler based solution is highly customizable. First of all there
are compile time switches to either completely remove fair queuing code
(for noop, deadline and AS only) or to disable group scheduling only. If
that's the case one would expect same behavior as old scheduler.

Secondly, even if everything is compiled in and customer is not using
cgroups, I would expect almost same behavior (because we will have only
root group). There will be extra code in the way and we will need some
optimizations to detect that there is only one group and bypass as much
code as possible bringing the overhead of the new code to the minimum. 

So if customer is not using IO controller, he should get the same behavior
as old system. Can't prove it right now because my patches are not in that
matured but there are no fundamental design limitations.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
@ 2009-05-06 22:17                 ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 22:17 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, linux-kernel, containers,
	agk, dm-devel, snitzer, m-ikeda, peterz

On Thu, May 07, 2009 at 12:02:51AM +0200, Andrea Righi wrote:
> On Wed, May 06, 2009 at 05:21:21PM -0400, Vivek Goyal wrote:
> > > Well, IMHO the big concern is at which level we want to implement the
> > > logic of control: IO scheduler, when the IO requests are already
> > > submitted and need to be dispatched, or at high level when the
> > > applications generates IO requests (or maybe both).
> > > 
> > > And, as pointed by Andrew, do everything by a cgroup-based controller.
> > 
> > I am not sure what's the rationale behind that. Why to do it at higher
> > layer? Doing it at IO scheduler layer will make sure that one does not
> > breaks the IO scheduler's properties with-in cgroup. (See my other mail
> > with some io-throttling test results).
> > 
> > The advantage of higher layer mechanism is that it can also cover software
> > RAID devices well. 
> > 
> > > 
> > > The other features, proportional BW, throttling, take the current ioprio
> > > model in account, etc. are implementation details and any of the
> > > proposed solutions can be extended to support all these features. I
> > > mean, io-throttle can be extended to support proportional BW (for a
> > > certain perspective it is already provided by the throttling water mark
> > > in v16), as well as the IO scheduler based controller can be extended to
> > > support absolute BW limits. The same for dm-ioband. I don't think
> > > there're huge obstacle to merge the functionalities in this sense.
> > 
> > Yes, from technical point of view, one can implement a proportional BW
> > controller at higher layer also. But that would practically mean almost
> > re-implementing the CFQ logic at higher layer. Now why to get into all
> > that complexity. Why not simply make CFQ hiearchical to also handle the
> > groups?
> 
> Make CFQ aware of cgroups is very important also. I could be wrong, but
> I don't think we shouldn't re-implement the same exact CFQ logic at
> higher layers. CFQ dispatches IO requests, at higher layers applications
> submit IO requests. We're talking about different things and applying
> different logic doesn't sound too strange IMHO. I mean, at least we
> should consider/test also this different approach before deciding drop
> it.
> 

Lot of CFQ code is all about maintaining per io context queues, for
different classes and different prio level, about anticipation for
reads etc. Anybody who wants to get classes and ioprio within cgroup
right will end up duplicating all that logic (to cover all the cases).
So I did not mean that you will end up copying the whole code but
logically a lot of it.

Secondly, there will be mismatch in anticipation logic. CFQ gives
preference to reads and for dependent readers it idles and waits for
next request to come. A higher level throttling can interefere with IO
pattern of application and can lead CFQ to think that average thinktime
of this application is high and disable the anticipation on that
application. Which should result in high latencies for simple commands
like "ls", in presence of competing applications. 

> This solution also guarantee no changes in the IO schedulers for those
> who are not interested in using the cgroup IO controller. What is the
> impact of the IO scheduler based controller for those users?
> 

IO scheduler based solution is highly customizable. First of all there
are compile time switches to either completely remove fair queuing code
(for noop, deadline and AS only) or to disable group scheduling only. If
that's the case one would expect same behavior as old scheduler.

Secondly, even if everything is compiled in and customer is not using
cgroups, I would expect almost same behavior (because we will have only
root group). There will be extra code in the way and we will need some
optimizations to detect that there is only one group and bypass as much
code as possible bringing the overhead of the new code to the minimum. 

So if customer is not using IO controller, he should get the same behavior
as old system. Can't prove it right now because my patches are not in that
matured but there are no fundamental design limitations.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 18/18] io-controller: Debug hierarchical IO scheduling
  2009-05-06 21:58         ` Vivek Goyal
@ 2009-05-06 22:19             ` IKEDA, Munehiro
  -1 siblings, 0 replies; 297+ messages in thread
From: IKEDA, Munehiro @ 2009-05-06 22:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi,

Vivek Goyal wrote:
> Hi Ikeda,
> 
> I think there is some issue with applying the patch. I think you have
> forgotten to apply following patch and that's why seeing all the issues.
> 
> "io-controller: Export disk time used and nr sectors dipatched through
> cgroups"
> 
> This patch changes elv_ioq_served() at the same time introduced additional
> field of total_sector_service etc.
> 
> Thanks
> Vivek

Oh!  you are right.  I missed it because it is out of the
thread...
Thanks and please forgive my pointless reply.
  

-- 
IKEDA, Munehiro
  NEC Corporation of America
    m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 18/18] io-controller: Debug hierarchical IO scheduling
@ 2009-05-06 22:19             ` IKEDA, Munehiro
  0 siblings, 0 replies; 297+ messages in thread
From: IKEDA, Munehiro @ 2009-05-06 22:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, akpm

Hi,

Vivek Goyal wrote:
> Hi Ikeda,
> 
> I think there is some issue with applying the patch. I think you have
> forgotten to apply following patch and that's why seeing all the issues.
> 
> "io-controller: Export disk time used and nr sectors dipatched through
> cgroups"
> 
> This patch changes elv_ioq_served() at the same time introduced additional
> field of total_sector_service etc.
> 
> Thanks
> Vivek

Oh!  you are right.  I missed it because it is out of the
thread...
Thanks and please forgive my pointless reply.
  

-- 
IKEDA, Munehiro
  NEC Corporation of America
    m-ikeda@ds.jp.nec.com


^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 18/18] io-controller: Debug hierarchical IO scheduling
  2009-05-06 22:19             ` IKEDA, Munehiro
@ 2009-05-06 22:24                 ` Vivek Goyal
  -1 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 22:24 UTC (permalink / raw)
  To: IKEDA, Munehiro
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Wed, May 06, 2009 at 06:19:01PM -0400, IKEDA, Munehiro wrote:
> Hi,
>
> Vivek Goyal wrote:
>> Hi Ikeda,
>>
>> I think there is some issue with applying the patch. I think you have
>> forgotten to apply following patch and that's why seeing all the issues.
>>
>> "io-controller: Export disk time used and nr sectors dipatched through
>> cgroups"
>>
>> This patch changes elv_ioq_served() at the same time introduced additional
>> field of total_sector_service etc.
>>
>> Thanks
>> Vivek
>
> Oh!  you are right.  I missed it because it is out of the
> thread...

That's strange. In "mutt" I see this patch (patch number 7) as part of the
thread. Which mail client are you using. Not sure if it is mail client
specific thing or some issue with my way of using "git-send-email".

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 18/18] io-controller: Debug hierarchical IO scheduling
@ 2009-05-06 22:24                 ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-06 22:24 UTC (permalink / raw)
  To: IKEDA, Munehiro
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, akpm

On Wed, May 06, 2009 at 06:19:01PM -0400, IKEDA, Munehiro wrote:
> Hi,
>
> Vivek Goyal wrote:
>> Hi Ikeda,
>>
>> I think there is some issue with applying the patch. I think you have
>> forgotten to apply following patch and that's why seeing all the issues.
>>
>> "io-controller: Export disk time used and nr sectors dipatched through
>> cgroups"
>>
>> This patch changes elv_ioq_served() at the same time introduced additional
>> field of total_sector_service etc.
>>
>> Thanks
>> Vivek
>
> Oh!  you are right.  I missed it because it is out of the
> thread...

That's strange. In "mutt" I see this patch (patch number 7) as part of the
thread. Which mail client are you using. Not sure if it is mail client
specific thing or some issue with my way of using "git-send-email".

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
       [not found]             ` <20090506215235.GJ8180-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-06 22:35               ` Andrea Righi
  2009-05-07  9:04               ` Andrea Righi
  1 sibling, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-06 22:35 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Wed, May 06, 2009 at 05:52:35PM -0400, Vivek Goyal wrote:
> On Wed, May 06, 2009 at 11:34:54PM +0200, Andrea Righi wrote:
> > On Wed, May 06, 2009 at 04:32:28PM -0400, Vivek Goyal wrote:
> > > Hi Andrea and others,
> > > 
> > > I always had this doubt in mind that any kind of 2nd level controller will
> > > have no idea about underlying IO scheduler queues/semantics. So while it
> > > can implement a particular cgroup policy (max bw like io-throttle or
> > > proportional bw like dm-ioband) but there are high chances that it will
> > > break IO scheduler's semantics in one way or other.
> > > 
> > > I had already sent out the results for dm-ioband in a separate thread.
> > > 
> > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07258.html
> > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07573.html
> > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08177.html
> > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08345.html
> > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08355.html
> > > 
> > > Here are some basic results with io-throttle. Andrea, please let me know
> > > if you think this is procedural problem. Playing with io-throttle patches
> > > for the first time.
> > > 
> > > I took V16 of your patches and trying it out with 2.6.30-rc4 with CFQ
> > > scheduler.
> > > 
> > > I have got one SATA drive with one partition on it.
> > > 
> > > I am trying to create one cgroup and assignn 8MB/s limit to it and launch
> > > on RT prio 0 task and one BE prio 7 task and see how this 8MB/s is divided
> > > between these tasks. Following are the results.
> > > 
> > > Following is my test script.
> > > 
> > > *******************************************************************
> > > #!/bin/bash
> > > 
> > > mount /dev/sdb1 /mnt/sdb
> > > 
> > > mount -t cgroup -o blockio blockio /cgroup/iot/
> > > mkdir -p /cgroup/iot/test1 /cgroup/iot/test2
> > > 
> > > # Set bw limit of 8 MB/ps on sdb
> > > echo "/dev/sdb:$((8 * 1024 * 1024)):0:0" >
> > > /cgroup/iot/test1/blockio.bandwidth-max
> > > 
> > > sync
> > > echo 3 > /proc/sys/vm/drop_caches
> > > 
> > > echo $$ > /cgroup/iot/test1/tasks
> > > 
> > > # Launch a normal prio reader.
> > > ionice -c 2 -n 7 dd if=/mnt/sdb/zerofile1 of=/dev/zero &
> > > pid1=$!
> > > echo $pid1
> > > 
> > > # Launch an RT reader  
> > > ionice -c 1 -n 0 dd if=/mnt/sdb/zerofile2 of=/dev/zero &
> > > pid2=$!
> > > echo $pid2
> > > 
> > > wait $pid2
> > > echo "RT task finished"
> > > **********************************************************************
> > > 
> > > Test1
> > > =====
> > > Test two readers (one RT class and one BE class) and see how BW is
> > > allocated with-in cgroup
> > > 
> > > With io-throttle patches
> > > ------------------------
> > > - Two readers, first BE prio 7, second RT prio 0
> > > 
> > > 234179072 bytes (234 MB) copied, 55.8482 s, 4.2 MB/s
> > > 234179072 bytes (234 MB) copied, 55.8975 s, 4.2 MB/s
> > > RT task finished
> > > 
> > > Note: See, there is no difference in the performance of RT or BE task.
> > > Looks like these got throttled equally.
> > 
> > OK, this is coherent with the current io-throttle implementation. IO
> > requests are throttled without the concept of the ioprio model.
> > 
> > We could try to distribute the throttle using a function of each task's
> > ioprio, but ok, the obvious drawback is that it totally breaks the logic
> > used by the underlying layers.
> > 
> > BTW, I'm wondering, is it a very critical issue? I would say why not to
> > move the RT task to a different cgroup with unlimited BW? or limited BW
> > but with other tasks running at the same IO priority...
> 
> So one of hypothetical use case probably  could be following. Somebody
> is having a hosted server and customers are going to get there
> applications running in a particular cgroup with a limit on max bw.
> 
> 			root
> 		  /      |      \
> 	     cust1      cust2   cust3
> 	   (20 MB/s)  (40MB/s)  (30MB/s)
> 
> Now all three customers will run their own applications/virtual machines
> in their respective groups with upper limits. Will we say to these that
> all your tasks will be considered as same class and same prio level.
> 
> Assume cust1 is running a hypothetical application which creates multiple
> threads and assigns these threads different priorities based on its needs
> at run time. How would we handle this thing?
> 
> You can't collect all the RT tasks from all customers and move these to a
> single cgroup. Or ask customers to separate out their tasks based on
> priority level and give them multiple groups of different priority.

Clear.

Unfortunately, I think, with absolute BW limits at a certain point, if
we hit the limit, we need to block the IO request. That's the same
either, when we dispatch or submit the request. And the risk is to break
the logic of the IO priorities and fall in the classic priority
inversion problem.

The difference is that probably working at the CFQ level gives a better
control so we can handle these cases appropriately and avoid the
priority inversion problems.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-06 21:52             ` Vivek Goyal
  (?)
@ 2009-05-06 22:35             ` Andrea Righi
  2009-05-07  1:48               ` Ryo Tsuruta
  2009-05-07  1:48               ` Ryo Tsuruta
  -1 siblings, 2 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-06 22:35 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, linux-kernel, containers,
	agk, dm-devel, snitzer, m-ikeda, peterz

On Wed, May 06, 2009 at 05:52:35PM -0400, Vivek Goyal wrote:
> On Wed, May 06, 2009 at 11:34:54PM +0200, Andrea Righi wrote:
> > On Wed, May 06, 2009 at 04:32:28PM -0400, Vivek Goyal wrote:
> > > Hi Andrea and others,
> > > 
> > > I always had this doubt in mind that any kind of 2nd level controller will
> > > have no idea about underlying IO scheduler queues/semantics. So while it
> > > can implement a particular cgroup policy (max bw like io-throttle or
> > > proportional bw like dm-ioband) but there are high chances that it will
> > > break IO scheduler's semantics in one way or other.
> > > 
> > > I had already sent out the results for dm-ioband in a separate thread.
> > > 
> > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07258.html
> > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07573.html
> > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08177.html
> > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08345.html
> > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08355.html
> > > 
> > > Here are some basic results with io-throttle. Andrea, please let me know
> > > if you think this is procedural problem. Playing with io-throttle patches
> > > for the first time.
> > > 
> > > I took V16 of your patches and trying it out with 2.6.30-rc4 with CFQ
> > > scheduler.
> > > 
> > > I have got one SATA drive with one partition on it.
> > > 
> > > I am trying to create one cgroup and assignn 8MB/s limit to it and launch
> > > on RT prio 0 task and one BE prio 7 task and see how this 8MB/s is divided
> > > between these tasks. Following are the results.
> > > 
> > > Following is my test script.
> > > 
> > > *******************************************************************
> > > #!/bin/bash
> > > 
> > > mount /dev/sdb1 /mnt/sdb
> > > 
> > > mount -t cgroup -o blockio blockio /cgroup/iot/
> > > mkdir -p /cgroup/iot/test1 /cgroup/iot/test2
> > > 
> > > # Set bw limit of 8 MB/ps on sdb
> > > echo "/dev/sdb:$((8 * 1024 * 1024)):0:0" >
> > > /cgroup/iot/test1/blockio.bandwidth-max
> > > 
> > > sync
> > > echo 3 > /proc/sys/vm/drop_caches
> > > 
> > > echo $$ > /cgroup/iot/test1/tasks
> > > 
> > > # Launch a normal prio reader.
> > > ionice -c 2 -n 7 dd if=/mnt/sdb/zerofile1 of=/dev/zero &
> > > pid1=$!
> > > echo $pid1
> > > 
> > > # Launch an RT reader  
> > > ionice -c 1 -n 0 dd if=/mnt/sdb/zerofile2 of=/dev/zero &
> > > pid2=$!
> > > echo $pid2
> > > 
> > > wait $pid2
> > > echo "RT task finished"
> > > **********************************************************************
> > > 
> > > Test1
> > > =====
> > > Test two readers (one RT class and one BE class) and see how BW is
> > > allocated with-in cgroup
> > > 
> > > With io-throttle patches
> > > ------------------------
> > > - Two readers, first BE prio 7, second RT prio 0
> > > 
> > > 234179072 bytes (234 MB) copied, 55.8482 s, 4.2 MB/s
> > > 234179072 bytes (234 MB) copied, 55.8975 s, 4.2 MB/s
> > > RT task finished
> > > 
> > > Note: See, there is no difference in the performance of RT or BE task.
> > > Looks like these got throttled equally.
> > 
> > OK, this is coherent with the current io-throttle implementation. IO
> > requests are throttled without the concept of the ioprio model.
> > 
> > We could try to distribute the throttle using a function of each task's
> > ioprio, but ok, the obvious drawback is that it totally breaks the logic
> > used by the underlying layers.
> > 
> > BTW, I'm wondering, is it a very critical issue? I would say why not to
> > move the RT task to a different cgroup with unlimited BW? or limited BW
> > but with other tasks running at the same IO priority...
> 
> So one of hypothetical use case probably  could be following. Somebody
> is having a hosted server and customers are going to get there
> applications running in a particular cgroup with a limit on max bw.
> 
> 			root
> 		  /      |      \
> 	     cust1      cust2   cust3
> 	   (20 MB/s)  (40MB/s)  (30MB/s)
> 
> Now all three customers will run their own applications/virtual machines
> in their respective groups with upper limits. Will we say to these that
> all your tasks will be considered as same class and same prio level.
> 
> Assume cust1 is running a hypothetical application which creates multiple
> threads and assigns these threads different priorities based on its needs
> at run time. How would we handle this thing?
> 
> You can't collect all the RT tasks from all customers and move these to a
> single cgroup. Or ask customers to separate out their tasks based on
> priority level and give them multiple groups of different priority.

Clear.

Unfortunately, I think, with absolute BW limits at a certain point, if
we hit the limit, we need to block the IO request. That's the same
either, when we dispatch or submit the request. And the risk is to break
the logic of the IO priorities and fall in the classic priority
inversion problem.

The difference is that probably working at the CFQ level gives a better
control so we can handle these cases appropriately and avoid the
priority inversion problems.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 18/18] io-controller: Debug hierarchical IO scheduling
  2009-05-06 22:24                 ` Vivek Goyal
@ 2009-05-06 23:01                     ` IKEDA, Munehiro
  -1 siblings, 0 replies; 297+ messages in thread
From: IKEDA, Munehiro @ 2009-05-06 23:01 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote:
> On Wed, May 06, 2009 at 06:19:01PM -0400, IKEDA, Munehiro wrote:
>> Hi,
>>
>> Vivek Goyal wrote:
>>> Hi Ikeda,
>>>
>>> I think there is some issue with applying the patch. I think you have
>>> forgotten to apply following patch and that's why seeing all the issues.
>>>
>>> "io-controller: Export disk time used and nr sectors dipatched through
>>> cgroups"
>>>
>>> This patch changes elv_ioq_served() at the same time introduced additional
>>> field of total_sector_service etc.
>>>
>>> Thanks
>>> Vivek
>> Oh!  you are right.  I missed it because it is out of the
>> thread...
> 
> That's strange. In "mutt" I see this patch (patch number 7) as part of the
> thread. Which mail client are you using. Not sure if it is mail client
> specific thing or some issue with my way of using "git-send-email".
> 
> Thanks
> Vivek

I'm using a little bit old Thunderbird (ver.2.0.0).
I believe you are completely doing right because In-Reply-To
in the mail header of patch #7 correctly points the first
mail.

The mail of patch #7 seems to have slightly earlier time stamp
than the first mail.  Poor Thunderbird might be confused by this,
in my wild guess.

Sorry for confusing you.


-- 
IKEDA, Munehiro
  NEC Corporation of America
    m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 18/18] io-controller: Debug hierarchical IO scheduling
@ 2009-05-06 23:01                     ` IKEDA, Munehiro
  0 siblings, 0 replies; 297+ messages in thread
From: IKEDA, Munehiro @ 2009-05-06 23:01 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, akpm

Vivek Goyal wrote:
> On Wed, May 06, 2009 at 06:19:01PM -0400, IKEDA, Munehiro wrote:
>> Hi,
>>
>> Vivek Goyal wrote:
>>> Hi Ikeda,
>>>
>>> I think there is some issue with applying the patch. I think you have
>>> forgotten to apply following patch and that's why seeing all the issues.
>>>
>>> "io-controller: Export disk time used and nr sectors dipatched through
>>> cgroups"
>>>
>>> This patch changes elv_ioq_served() at the same time introduced additional
>>> field of total_sector_service etc.
>>>
>>> Thanks
>>> Vivek
>> Oh!  you are right.  I missed it because it is out of the
>> thread...
> 
> That's strange. In "mutt" I see this patch (patch number 7) as part of the
> thread. Which mail client are you using. Not sure if it is mail client
> specific thing or some issue with my way of using "git-send-email".
> 
> Thanks
> Vivek

I'm using a little bit old Thunderbird (ver.2.0.0).
I believe you are completely doing right because In-Reply-To
in the mail header of patch #7 correctly points the first
mail.

The mail of patch #7 seems to have slightly earlier time stamp
than the first mail.  Poor Thunderbird might be confused by this,
in my wild guess.

Sorry for confusing you.


-- 
IKEDA, Munehiro
  NEC Corporation of America
    m-ikeda@ds.jp.nec.com


^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
       [not found]       ` <20090506023332.GA1212-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                           ` (2 preceding siblings ...)
  2009-05-06 20:32         ` Vivek Goyal
@ 2009-05-07  0:18         ` Ryo Tsuruta
  3 siblings, 0 replies; 297+ messages in thread
From: Ryo Tsuruta @ 2009-05-07  0:18 UTC (permalink / raw)
  To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

Hi Vivek,

> Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> of FIFO dispatch of buffered bios. Apart from that it tries to provide
> fairness in terms of actual IO done and that would mean a seeky workload
> will can use disk for much longer to get equivalent IO done and slow down
> other applications. Implementing IO controller at IO scheduler level gives
> us tigher control. Will it not meet your requirements? If you got specific
> concerns with IO scheduler based contol patches, please highlight these and
> we will see how these can be addressed.

I'd like to avoid making complicated existing IO schedulers and other
kernel codes and to give a choice to users whether or not to use it.
I know that you chose an approach that using compile time options to
get the same behavior as old system, but device-mapper drivers can be
added, removed and replaced while system is running.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-06  2:33     ` Vivek Goyal
                         ` (3 preceding siblings ...)
  2009-05-06 20:32       ` Vivek Goyal
@ 2009-05-07  0:18       ` Ryo Tsuruta
       [not found]         ` <20090507.091858.226775723.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
  2009-05-08 14:24         ` Rik van Riel
  4 siblings, 2 replies; 297+ messages in thread
From: Ryo Tsuruta @ 2009-05-07  0:18 UTC (permalink / raw)
  To: vgoyal
  Cc: akpm, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, m-ikeda, peterz

Hi Vivek,

> Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> of FIFO dispatch of buffered bios. Apart from that it tries to provide
> fairness in terms of actual IO done and that would mean a seeky workload
> will can use disk for much longer to get equivalent IO done and slow down
> other applications. Implementing IO controller at IO scheduler level gives
> us tigher control. Will it not meet your requirements? If you got specific
> concerns with IO scheduler based contol patches, please highlight these and
> we will see how these can be addressed.

I'd like to avoid making complicated existing IO schedulers and other
kernel codes and to give a choice to users whether or not to use it.
I know that you chose an approach that using compile time options to
get the same behavior as old system, but device-mapper drivers can be
added, removed and replaced while system is running.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-07  0:18       ` Ryo Tsuruta
@ 2009-05-07  1:25             ` Vivek Goyal
  2009-05-08 14:24         ` Rik van Riel
  1 sibling, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-07  1:25 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Thu, May 07, 2009 at 09:18:58AM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
> 
> > Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> > of FIFO dispatch of buffered bios. Apart from that it tries to provide
> > fairness in terms of actual IO done and that would mean a seeky workload
> > will can use disk for much longer to get equivalent IO done and slow down
> > other applications. Implementing IO controller at IO scheduler level gives
> > us tigher control. Will it not meet your requirements? If you got specific
> > concerns with IO scheduler based contol patches, please highlight these and
> > we will see how these can be addressed.
> 
> I'd like to avoid making complicated existing IO schedulers and other
> kernel codes and to give a choice to users whether or not to use it.
> I know that you chose an approach that using compile time options to
> get the same behavior as old system, but device-mapper drivers can be
> added, removed and replaced while system is running.
> 

Same is possible with IO scheduler based controller. If you don't want
cgroup stuff, don't create those. By default everything will be in root
group and you will get the old behavior. 

If you want io controller stuff, just create the cgroup, assign weight
and move task there. So what more choices do you want which are missing
here?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
@ 2009-05-07  1:25             ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-07  1:25 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: akpm, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, m-ikeda, peterz

On Thu, May 07, 2009 at 09:18:58AM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
> 
> > Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> > of FIFO dispatch of buffered bios. Apart from that it tries to provide
> > fairness in terms of actual IO done and that would mean a seeky workload
> > will can use disk for much longer to get equivalent IO done and slow down
> > other applications. Implementing IO controller at IO scheduler level gives
> > us tigher control. Will it not meet your requirements? If you got specific
> > concerns with IO scheduler based contol patches, please highlight these and
> > we will see how these can be addressed.
> 
> I'd like to avoid making complicated existing IO schedulers and other
> kernel codes and to give a choice to users whether or not to use it.
> I know that you chose an approach that using compile time options to
> get the same behavior as old system, but device-mapper drivers can be
> added, removed and replaced while system is running.
> 

Same is possible with IO scheduler based controller. If you don't want
cgroup stuff, don't create those. By default everything will be in root
group and you will get the old behavior. 

If you want io controller stuff, just create the cgroup, assign weight
and move task there. So what more choices do you want which are missing
here?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-06 22:35             ` Andrea Righi
@ 2009-05-07  1:48               ` Ryo Tsuruta
  2009-05-07  1:48               ` Ryo Tsuruta
  1 sibling, 0 replies; 297+ messages in thread
From: Ryo Tsuruta @ 2009-05-07  1:48 UTC (permalink / raw)
  To: righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

From: Andrea Righi <righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Subject: Re: IO scheduler based IO Controller V2
Date: Thu, 7 May 2009 00:35:13 +0200

> On Wed, May 06, 2009 at 05:52:35PM -0400, Vivek Goyal wrote:
> > On Wed, May 06, 2009 at 11:34:54PM +0200, Andrea Righi wrote:
> > > On Wed, May 06, 2009 at 04:32:28PM -0400, Vivek Goyal wrote:
> > > > Hi Andrea and others,
> > > > 
> > > > I always had this doubt in mind that any kind of 2nd level controller will
> > > > have no idea about underlying IO scheduler queues/semantics. So while it
> > > > can implement a particular cgroup policy (max bw like io-throttle or
> > > > proportional bw like dm-ioband) but there are high chances that it will
> > > > break IO scheduler's semantics in one way or other.
> > > > 
> > > > I had already sent out the results for dm-ioband in a separate thread.
> > > > 
> > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07258.html
> > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07573.html
> > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08177.html
> > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08345.html
> > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08355.html
> > > > 
> > > > Here are some basic results with io-throttle. Andrea, please let me know
> > > > if you think this is procedural problem. Playing with io-throttle patches
> > > > for the first time.
> > > > 
> > > > I took V16 of your patches and trying it out with 2.6.30-rc4 with CFQ
> > > > scheduler.
> > > > 
> > > > I have got one SATA drive with one partition on it.
> > > > 
> > > > I am trying to create one cgroup and assignn 8MB/s limit to it and launch
> > > > on RT prio 0 task and one BE prio 7 task and see how this 8MB/s is divided
> > > > between these tasks. Following are the results.
> > > > 
> > > > Following is my test script.
> > > > 
> > > > *******************************************************************
> > > > #!/bin/bash
> > > > 
> > > > mount /dev/sdb1 /mnt/sdb
> > > > 
> > > > mount -t cgroup -o blockio blockio /cgroup/iot/
> > > > mkdir -p /cgroup/iot/test1 /cgroup/iot/test2
> > > > 
> > > > # Set bw limit of 8 MB/ps on sdb
> > > > echo "/dev/sdb:$((8 * 1024 * 1024)):0:0" >
> > > > /cgroup/iot/test1/blockio.bandwidth-max
> > > > 
> > > > sync
> > > > echo 3 > /proc/sys/vm/drop_caches
> > > > 
> > > > echo $$ > /cgroup/iot/test1/tasks
> > > > 
> > > > # Launch a normal prio reader.
> > > > ionice -c 2 -n 7 dd if=/mnt/sdb/zerofile1 of=/dev/zero &
> > > > pid1=$!
> > > > echo $pid1
> > > > 
> > > > # Launch an RT reader  
> > > > ionice -c 1 -n 0 dd if=/mnt/sdb/zerofile2 of=/dev/zero &
> > > > pid2=$!
> > > > echo $pid2
> > > > 
> > > > wait $pid2
> > > > echo "RT task finished"
> > > > **********************************************************************
> > > > 
> > > > Test1
> > > > =====
> > > > Test two readers (one RT class and one BE class) and see how BW is
> > > > allocated with-in cgroup
> > > > 
> > > > With io-throttle patches
> > > > ------------------------
> > > > - Two readers, first BE prio 7, second RT prio 0
> > > > 
> > > > 234179072 bytes (234 MB) copied, 55.8482 s, 4.2 MB/s
> > > > 234179072 bytes (234 MB) copied, 55.8975 s, 4.2 MB/s
> > > > RT task finished
> > > > 
> > > > Note: See, there is no difference in the performance of RT or BE task.
> > > > Looks like these got throttled equally.
> > > 
> > > OK, this is coherent with the current io-throttle implementation. IO
> > > requests are throttled without the concept of the ioprio model.
> > > 
> > > We could try to distribute the throttle using a function of each task's
> > > ioprio, but ok, the obvious drawback is that it totally breaks the logic
> > > used by the underlying layers.
> > > 
> > > BTW, I'm wondering, is it a very critical issue? I would say why not to
> > > move the RT task to a different cgroup with unlimited BW? or limited BW
> > > but with other tasks running at the same IO priority...
> > 
> > So one of hypothetical use case probably  could be following. Somebody
> > is having a hosted server and customers are going to get there
> > applications running in a particular cgroup with a limit on max bw.
> > 
> > 			root
> > 		  /      |      \
> > 	     cust1      cust2   cust3
> > 	   (20 MB/s)  (40MB/s)  (30MB/s)
> > 
> > Now all three customers will run their own applications/virtual machines
> > in their respective groups with upper limits. Will we say to these that
> > all your tasks will be considered as same class and same prio level.
> > 
> > Assume cust1 is running a hypothetical application which creates multiple
> > threads and assigns these threads different priorities based on its needs
> > at run time. How would we handle this thing?
> > 
> > You can't collect all the RT tasks from all customers and move these to a
> > single cgroup. Or ask customers to separate out their tasks based on
> > priority level and give them multiple groups of different priority.
> 
> Clear.
> 
> Unfortunately, I think, with absolute BW limits at a certain point, if
> we hit the limit, we need to block the IO request. That's the same
> either, when we dispatch or submit the request. And the risk is to break
> the logic of the IO priorities and fall in the classic priority
> inversion problem.
> 
> The difference is that probably working at the CFQ level gives a better
> control so we can handle these cases appropriately and avoid the
> priority inversion problems.
> 
> Thanks,
> -Andrea

If RT tasks in cust1 issue IOs intensively, are IOs issued from BE
tasks running on cust2 and cust3 suppressed and cust1 can use whole
bandwidth?
I think that CFQ's class and priority should be preserved within a
given bandwidth to each cgroup.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-06 22:35             ` Andrea Righi
  2009-05-07  1:48               ` Ryo Tsuruta
@ 2009-05-07  1:48               ` Ryo Tsuruta
  1 sibling, 0 replies; 297+ messages in thread
From: Ryo Tsuruta @ 2009-05-07  1:48 UTC (permalink / raw)
  To: righi.andrea
  Cc: vgoyal, akpm, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, linux-kernel, containers, agk, dm-devel,
	snitzer, m-ikeda, peterz

From: Andrea Righi <righi.andrea@gmail.com>
Subject: Re: IO scheduler based IO Controller V2
Date: Thu, 7 May 2009 00:35:13 +0200

> On Wed, May 06, 2009 at 05:52:35PM -0400, Vivek Goyal wrote:
> > On Wed, May 06, 2009 at 11:34:54PM +0200, Andrea Righi wrote:
> > > On Wed, May 06, 2009 at 04:32:28PM -0400, Vivek Goyal wrote:
> > > > Hi Andrea and others,
> > > > 
> > > > I always had this doubt in mind that any kind of 2nd level controller will
> > > > have no idea about underlying IO scheduler queues/semantics. So while it
> > > > can implement a particular cgroup policy (max bw like io-throttle or
> > > > proportional bw like dm-ioband) but there are high chances that it will
> > > > break IO scheduler's semantics in one way or other.
> > > > 
> > > > I had already sent out the results for dm-ioband in a separate thread.
> > > > 
> > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07258.html
> > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg07573.html
> > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08177.html
> > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08345.html
> > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg08355.html
> > > > 
> > > > Here are some basic results with io-throttle. Andrea, please let me know
> > > > if you think this is procedural problem. Playing with io-throttle patches
> > > > for the first time.
> > > > 
> > > > I took V16 of your patches and trying it out with 2.6.30-rc4 with CFQ
> > > > scheduler.
> > > > 
> > > > I have got one SATA drive with one partition on it.
> > > > 
> > > > I am trying to create one cgroup and assignn 8MB/s limit to it and launch
> > > > on RT prio 0 task and one BE prio 7 task and see how this 8MB/s is divided
> > > > between these tasks. Following are the results.
> > > > 
> > > > Following is my test script.
> > > > 
> > > > *******************************************************************
> > > > #!/bin/bash
> > > > 
> > > > mount /dev/sdb1 /mnt/sdb
> > > > 
> > > > mount -t cgroup -o blockio blockio /cgroup/iot/
> > > > mkdir -p /cgroup/iot/test1 /cgroup/iot/test2
> > > > 
> > > > # Set bw limit of 8 MB/ps on sdb
> > > > echo "/dev/sdb:$((8 * 1024 * 1024)):0:0" >
> > > > /cgroup/iot/test1/blockio.bandwidth-max
> > > > 
> > > > sync
> > > > echo 3 > /proc/sys/vm/drop_caches
> > > > 
> > > > echo $$ > /cgroup/iot/test1/tasks
> > > > 
> > > > # Launch a normal prio reader.
> > > > ionice -c 2 -n 7 dd if=/mnt/sdb/zerofile1 of=/dev/zero &
> > > > pid1=$!
> > > > echo $pid1
> > > > 
> > > > # Launch an RT reader  
> > > > ionice -c 1 -n 0 dd if=/mnt/sdb/zerofile2 of=/dev/zero &
> > > > pid2=$!
> > > > echo $pid2
> > > > 
> > > > wait $pid2
> > > > echo "RT task finished"
> > > > **********************************************************************
> > > > 
> > > > Test1
> > > > =====
> > > > Test two readers (one RT class and one BE class) and see how BW is
> > > > allocated with-in cgroup
> > > > 
> > > > With io-throttle patches
> > > > ------------------------
> > > > - Two readers, first BE prio 7, second RT prio 0
> > > > 
> > > > 234179072 bytes (234 MB) copied, 55.8482 s, 4.2 MB/s
> > > > 234179072 bytes (234 MB) copied, 55.8975 s, 4.2 MB/s
> > > > RT task finished
> > > > 
> > > > Note: See, there is no difference in the performance of RT or BE task.
> > > > Looks like these got throttled equally.
> > > 
> > > OK, this is coherent with the current io-throttle implementation. IO
> > > requests are throttled without the concept of the ioprio model.
> > > 
> > > We could try to distribute the throttle using a function of each task's
> > > ioprio, but ok, the obvious drawback is that it totally breaks the logic
> > > used by the underlying layers.
> > > 
> > > BTW, I'm wondering, is it a very critical issue? I would say why not to
> > > move the RT task to a different cgroup with unlimited BW? or limited BW
> > > but with other tasks running at the same IO priority...
> > 
> > So one of hypothetical use case probably  could be following. Somebody
> > is having a hosted server and customers are going to get there
> > applications running in a particular cgroup with a limit on max bw.
> > 
> > 			root
> > 		  /      |      \
> > 	     cust1      cust2   cust3
> > 	   (20 MB/s)  (40MB/s)  (30MB/s)
> > 
> > Now all three customers will run their own applications/virtual machines
> > in their respective groups with upper limits. Will we say to these that
> > all your tasks will be considered as same class and same prio level.
> > 
> > Assume cust1 is running a hypothetical application which creates multiple
> > threads and assigns these threads different priorities based on its needs
> > at run time. How would we handle this thing?
> > 
> > You can't collect all the RT tasks from all customers and move these to a
> > single cgroup. Or ask customers to separate out their tasks based on
> > priority level and give them multiple groups of different priority.
> 
> Clear.
> 
> Unfortunately, I think, with absolute BW limits at a certain point, if
> we hit the limit, we need to block the IO request. That's the same
> either, when we dispatch or submit the request. And the risk is to break
> the logic of the IO priorities and fall in the classic priority
> inversion problem.
> 
> The difference is that probably working at the CFQ level gives a better
> control so we can handle these cases appropriately and avoid the
> priority inversion problems.
> 
> Thanks,
> -Andrea

If RT tasks in cust1 issue IOs intensively, are IOs issued from BE
tasks running on cust2 and cust3 suppressed and cust1 can use whole
bandwidth?
I think that CFQ's class and priority should be preserved within a
given bandwidth to each cgroup.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
       [not found]       ` <20090506161012.GC8180-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-07  5:36         ` Li Zefan
  2009-05-07  5:47         ` Gui Jianfeng
  1 sibling, 0 replies; 297+ messages in thread
From: Li Zefan @ 2009-05-07  5:36 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

[-- Attachment #1: Type: text/plain, Size: 2886 bytes --]

Vivek Goyal wrote:
> On Wed, May 06, 2009 at 04:11:05PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> Hi All,
>>>
>>> Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
>>> First version of the patches was posted here.
>> Hi Vivek,
>>
>> I did some simple test for V2, and triggered an kernel panic.
>> The following script can reproduce this bug. It seems that the cgroup
>> is already removed, but IO Controller still try to access into it.
>>
> 
> Hi Gui,
> 
> Thanks for the report. I use cgroup_path() for debugging. I guess that
> cgroup_path() was passed null cgrp pointer that's why it crashed.
> 
> If yes, then it is strange though. I call cgroup_path() only after
> grabbing a refenrece to css object. (I am assuming that if I have a valid
> reference to css object then css->cgrp can't be null).
> 

Yes, css->cgrp shouldn't be NULL.. I doubt we hit a bug in cgroup here.
The code dealing with css refcnt and cgroup rmdir has changed quite a lot,
and is much more complex than it was.

> Anyway, can you please try out following patch and see if it fixes your
> crash.
...
> BTW, I tried following equivalent script and I can't see the crash on 
> my system. Are you able to hit it regularly?
> 

I modified the script like this:

======================
#!/bin/sh
echo 1 > /proc/sys/vm/drop_caches
mkdir /cgroup 2> /dev/null
mount -t cgroup -o io,blkio io /cgroup
mkdir /cgroup/test1
mkdir /cgroup/test2
echo 100 > /cgroup/test1/io.weight
echo 500 > /cgroup/test2/io.weight

dd if=/dev/zero bs=4096 count=128000 of=500M.1 &
pid1=$!
echo $pid1 > /cgroup/test1/tasks

dd if=/dev/zero bs=4096 count=128000 of=500M.2 &
pid2=$!
echo $pid2 > /cgroup/test2/tasks

sleep 5
kill -9 $pid1
kill -9 $pid2

for ((;count != 2;))
{
        rmdir /cgroup/test1 > /dev/null 2>&1
        if [ $? -eq 0 ]; then
                count=$(( $count + 1 ))
        fi

        rmdir /cgroup/test2 > /dev/null 2>&1
        if [ $? -eq 0 ]; then
                count=$(( $count + 1 ))
        fi
}

umount /cgroup
rmdir /cgroup
======================

I ran this script and got lockdep BUG. Full log and my config are attached.

Actually this can be triggered with the following steps on my box:
# mount -t cgroup -o blkio,io xxx /mnt
# mkdir /mnt/0
# echo $$ > /mnt/0/tasks
# echo 3 > /proc/sys/vm/drop_cache
# echo $$ > /mnt/tasks
# rmdir /mnt/0

And when I ran the script for the second time, my box was freezed
and I had to reset it.

> Instead of killing the tasks I also tried moving the tasks into root cgroup
> and then deleting test1 and test2 groups, that also did not produce any crash.
> (Hit a different bug though after 5-6 attempts :-)
> 
> As I mentioned in the patchset, currently we do have issues with group
> refcounting and cgroup/group going away. Hopefully in next version they
> all should be fixed up. But still, it is nice to hear back...
> 

[-- Attachment #2: myconfig --]
[-- Type: text/plain, Size: 64514 bytes --]

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.30-rc4
# Thu May  7 09:11:29 2009
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
# CONFIG_X86_64 is not set
CONFIG_X86=y
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/i386_defconfig"
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
# CONFIG_GENERIC_TIME_VSYSCALL is not set
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_DEFAULT_IDLE=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_HAVE_DYNAMIC_PER_CPU_AREA=y
# CONFIG_HAVE_CPUMASK_OF_CPU_MAP is not set
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
# CONFIG_ZONE_DMA32 is not set
CONFIG_ARCH_POPULATES_NODE_MAP=y
# CONFIG_AUDIT_ARCH is not set
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_USE_GENERIC_SMP_HELPERS=y
CONFIG_X86_32_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_TRAMPOLINE=y
CONFIG_X86_32_LAZY_GS=y
CONFIG_KTIME_SCALAR=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
# CONFIG_AUDIT is not set

#
# RCU Subsystem
#
# CONFIG_CLASSIC_RCU is not set
# CONFIG_TREE_RCU is not set
CONFIG_PREEMPT_RCU=y
CONFIG_RCU_TRACE=y
# CONFIG_TREE_RCU_TRACE is not set
CONFIG_PREEMPT_RCU_TRACE=y
# CONFIG_IKCONFIG is not set
CONFIG_LOG_BUF_SHIFT=17
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_GROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_RT_GROUP_SCHED=y
# CONFIG_USER_SCHED is not set
CONFIG_CGROUP_SCHED=y
CONFIG_CGROUPS=y
CONFIG_CGROUP_DEBUG=y
CONFIG_CGROUP_NS=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_RESOURCE_COUNTERS=y
CONFIG_CGROUP_MEM_RES_CTLR=y
CONFIG_CGROUP_MEM_RES_CTLR_SWAP=y
CONFIG_GROUP_IOSCHED=y
CONFIG_CGROUP_BLKIO=y
CONFIG_CGROUP_PAGE=y
CONFIG_MM_OWNER=y
CONFIG_SYSFS_DEPRECATED=y
CONFIG_SYSFS_DEPRECATED_V2=y
CONFIG_RELAY=y
CONFIG_NAMESPACES=y
# CONFIG_UTS_NS is not set
# CONFIG_IPC_NS is not set
CONFIG_USER_NS=y
CONFIG_PID_NS=y
# CONFIG_NET_NS is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
CONFIG_RD_LZMA=y
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_KALLSYMS_EXTRA_PASS=y
# CONFIG_STRIP_ASM_SYMS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_PCI_QUIRKS=y
CONFIG_SLUB_DEBUG=y
CONFIG_COMPAT_BRK=y
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y
CONFIG_MARKERS=y
CONFIG_OPROFILE=m
# CONFIG_OPROFILE_IBS is not set
CONFIG_HAVE_OPROFILE=y
CONFIG_KPROBES=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_KRETPROBES=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_API_DEBUG=y
# CONFIG_SLOW_WORK is not set
CONFIG_HAVE_GENERIC_DMA_COHERENT=y
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
# CONFIG_MODULE_FORCE_LOAD is not set
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
CONFIG_LBD=y
CONFIG_BLK_DEV_BSG=y
# CONFIG_BLK_DEV_INTEGRITY is not set

#
# IO Schedulers
#
CONFIG_ELV_FAIR_QUEUING=y
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_NOOP_HIER=y
CONFIG_IOSCHED_AS=m
CONFIG_IOSCHED_AS_HIER=y
CONFIG_IOSCHED_DEADLINE=m
CONFIG_IOSCHED_DEADLINE_HIER=y
CONFIG_IOSCHED_CFQ=y
CONFIG_IOSCHED_CFQ_HIER=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"
CONFIG_TRACK_ASYNC_CONTEXT=y
CONFIG_DEBUG_GROUP_IOSCHED=y
CONFIG_FREEZER=y

#
# Processor type and features
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_SMP=y
# CONFIG_SPARSE_IRQ is not set
CONFIG_X86_MPPARSE=y
# CONFIG_X86_BIGSMP is not set
CONFIG_X86_EXTENDED_PLATFORM=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_RDC321X is not set
# CONFIG_X86_32_NON_STANDARD is not set
CONFIG_SCHED_OMIT_FRAME_POINTER=y
# CONFIG_PARAVIRT_GUEST is not set
# CONFIG_MEMTEST is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
CONFIG_M686=y
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
# CONFIG_MPENTIUMM is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_MVIAC7 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_GENERIC_CPU is not set
CONFIG_X86_GENERIC=y
CONFIG_X86_CPU=y
CONFIG_X86_L1_CACHE_BYTES=64
CONFIG_X86_INTERNODE_CACHE_BYTES=64
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=5
CONFIG_X86_XADD=y
CONFIG_X86_PPRO_FENCE=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_TSC=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=4
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_CYRIX_32=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_CPU_SUP_TRANSMETA_32=y
CONFIG_CPU_SUP_UMC_32=y
# CONFIG_X86_DS is not set
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
# CONFIG_IOMMU_HELPER is not set
# CONFIG_IOMMU_API is not set
CONFIG_NR_CPUS=8
# CONFIG_SCHED_SMT is not set
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
# CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS is not set
CONFIG_X86_MCE=y
# CONFIG_X86_MCE_NONFATAL is not set
# CONFIG_X86_MCE_P4THERMAL is not set
CONFIG_VM86=y
# CONFIG_TOSHIBA is not set
# CONFIG_I8K is not set
# CONFIG_X86_REBOOTFIXUPS is not set
# CONFIG_MICROCODE is not set
CONFIG_X86_MSR=m
CONFIG_X86_CPUID=m
# CONFIG_X86_CPU_DEBUG is not set
# CONFIG_NOHIGHMEM is not set
CONFIG_HIGHMEM4G=y
# CONFIG_HIGHMEM64G is not set
CONFIG_PAGE_OFFSET=0xC0000000
CONFIG_HIGHMEM=y
# CONFIG_ARCH_PHYS_ADDR_T_64BIT is not set
CONFIG_ARCH_FLATMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_FLATMEM_MANUAL=y
# CONFIG_DISCONTIGMEM_MANUAL is not set
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_FLATMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
CONFIG_SPARSEMEM_STATIC=y
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=4
# CONFIG_PHYS_ADDR_T_64BIT is not set
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_UNEVICTABLE_LRU=y
CONFIG_HAVE_MLOCK=y
CONFIG_HAVE_MLOCKED_PAGE_BIT=y
CONFIG_HIGHPTE=y
# CONFIG_X86_CHECK_BIOS_CORRUPTION is not set
CONFIG_X86_RESERVE_LOW_64K=y
# CONFIG_MATH_EMULATION is not set
CONFIG_MTRR=y
CONFIG_MTRR_SANITIZER=y
CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=0
CONFIG_MTRR_SANITIZER_SPARE_REG_NR_DEFAULT=1
# CONFIG_X86_PAT is not set
CONFIG_EFI=y
CONFIG_SECCOMP=y
# CONFIG_CC_STACKPROTECTOR is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
CONFIG_SCHED_HRTICK=y
CONFIG_KEXEC=y
CONFIG_CRASH_DUMP=y
CONFIG_PHYSICAL_START=0x1000000
CONFIG_RELOCATABLE=y
CONFIG_PHYSICAL_ALIGN=0x400000
CONFIG_HOTPLUG_CPU=y
# CONFIG_COMPAT_VDSO is not set
# CONFIG_CMDLINE_BOOL is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y

#
# Power management and ACPI options
#
CONFIG_PM=y
CONFIG_PM_DEBUG=y
# CONFIG_PM_VERBOSE is not set
CONFIG_CAN_PM_TRACE=y
# CONFIG_PM_TRACE_RTC is not set
CONFIG_PM_SLEEP_SMP=y
CONFIG_PM_SLEEP=y
CONFIG_SUSPEND=y
CONFIG_SUSPEND_FREEZER=y
# CONFIG_HIBERNATION is not set
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
# CONFIG_ACPI_PROCFS is not set
# CONFIG_ACPI_PROCFS_POWER is not set
CONFIG_ACPI_SYSFS_POWER=y
# CONFIG_ACPI_PROC_EVENT is not set
CONFIG_ACPI_AC=m
# CONFIG_ACPI_BATTERY is not set
CONFIG_ACPI_BUTTON=m
CONFIG_ACPI_VIDEO=m
CONFIG_ACPI_FAN=y
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_HOTPLUG_CPU=y
CONFIG_ACPI_THERMAL=y
# CONFIG_ACPI_CUSTOM_DSDT is not set
CONFIG_ACPI_BLACKLIST_YEAR=1999
# CONFIG_ACPI_DEBUG is not set
# CONFIG_ACPI_PCI_SLOT is not set
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=y
# CONFIG_ACPI_SBS is not set
CONFIG_X86_APM_BOOT=y
CONFIG_APM=y
# CONFIG_APM_IGNORE_USER_SUSPEND is not set
# CONFIG_APM_DO_ENABLE is not set
CONFIG_APM_CPU_IDLE=y
# CONFIG_APM_DISPLAY_BLANK is not set
# CONFIG_APM_ALLOW_INTS is not set

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=y
CONFIG_CPU_FREQ_DEBUG=y
CONFIG_CPU_FREQ_STAT=m
CONFIG_CPU_FREQ_STAT_DETAILS=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=m
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=m
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m

#
# CPUFreq processor drivers
#
# CONFIG_X86_ACPI_CPUFREQ is not set
# CONFIG_X86_POWERNOW_K6 is not set
# CONFIG_X86_POWERNOW_K7 is not set
# CONFIG_X86_POWERNOW_K8 is not set
# CONFIG_X86_GX_SUSPMOD is not set
# CONFIG_X86_SPEEDSTEP_CENTRINO is not set
CONFIG_X86_SPEEDSTEP_ICH=y
CONFIG_X86_SPEEDSTEP_SMI=y
# CONFIG_X86_P4_CLOCKMOD is not set
# CONFIG_X86_CPUFREQ_NFORCE2 is not set
# CONFIG_X86_LONGRUN is not set
# CONFIG_X86_LONGHAUL is not set
# CONFIG_X86_E_POWERSAVER is not set

#
# shared options
#
CONFIG_X86_SPEEDSTEP_LIB=y
# CONFIG_X86_SPEEDSTEP_RELAXED_CAP_CHECK is not set
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
# CONFIG_PCI_GOBIOS is not set
# CONFIG_PCI_GOMMCONFIG is not set
# CONFIG_PCI_GODIRECT is not set
# CONFIG_PCI_GOOLPC is not set
CONFIG_PCI_GOANY=y
CONFIG_PCI_BIOS=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCI_DOMAINS=y
CONFIG_PCIEPORTBUS=y
CONFIG_HOTPLUG_PCI_PCIE=m
CONFIG_PCIEAER=y
# CONFIG_PCIEASPM is not set
CONFIG_ARCH_SUPPORTS_MSI=y
# CONFIG_PCI_MSI is not set
CONFIG_PCI_LEGACY=y
# CONFIG_PCI_DEBUG is not set
# CONFIG_PCI_STUB is not set
CONFIG_HT_IRQ=y
# CONFIG_PCI_IOV is not set
CONFIG_ISA_DMA_API=y
CONFIG_ISA=y
# CONFIG_EISA is not set
# CONFIG_MCA is not set
# CONFIG_SCx200 is not set
# CONFIG_OLPC is not set
CONFIG_PCCARD=y
# CONFIG_PCMCIA_DEBUG is not set
CONFIG_PCMCIA=y
CONFIG_PCMCIA_LOAD_CIS=y
# CONFIG_PCMCIA_IOCTL is not set
CONFIG_CARDBUS=y

#
# PC-card bridges
#
CONFIG_YENTA=y
CONFIG_YENTA_O2=y
CONFIG_YENTA_RICOH=y
CONFIG_YENTA_TI=y
CONFIG_YENTA_ENE_TUNE=y
CONFIG_YENTA_TOSHIBA=y
# CONFIG_PD6729 is not set
# CONFIG_I82092 is not set
# CONFIG_I82365 is not set
# CONFIG_TCIC is not set
CONFIG_PCMCIA_PROBE=y
CONFIG_PCCARD_NONSTATIC=y
CONFIG_HOTPLUG_PCI=y
CONFIG_HOTPLUG_PCI_FAKE=m
# CONFIG_HOTPLUG_PCI_COMPAQ is not set
# CONFIG_HOTPLUG_PCI_IBM is not set
CONFIG_HOTPLUG_PCI_ACPI=m
CONFIG_HOTPLUG_PCI_ACPI_IBM=m
# CONFIG_HOTPLUG_PCI_CPCI is not set
# CONFIG_HOTPLUG_PCI_SHPC is not set

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
# CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set
CONFIG_HAVE_AOUT=y
# CONFIG_BINFMT_AOUT is not set
CONFIG_BINFMT_MISC=y
CONFIG_HAVE_ATOMIC_IOMAP=y
CONFIG_NET=y

#
# Networking options
#
CONFIG_PACKET=y
CONFIG_PACKET_MMAP=y
CONFIG_UNIX=y
# CONFIG_NET_KEY is not set
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_ASK_IP_FIB_HASH=y
# CONFIG_IP_FIB_TRIE is not set
CONFIG_IP_FIB_HASH=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_VERBOSE=y
# CONFIG_IP_PNP is not set
CONFIG_NET_IPIP=m
# CONFIG_NET_IPGRE is not set
CONFIG_IP_MROUTE=y
CONFIG_IP_PIMSM_V1=y
CONFIG_IP_PIMSM_V2=y
# CONFIG_ARPD is not set
CONFIG_SYN_COOKIES=y
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
# CONFIG_INET_XFRM_TUNNEL is not set
CONFIG_INET_TUNNEL=m
# CONFIG_INET_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET_XFRM_MODE_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_BEET is not set
CONFIG_INET_LRO=m
CONFIG_INET_DIAG=m
CONFIG_INET_TCP_DIAG=m
CONFIG_TCP_CONG_ADVANCED=y
CONFIG_TCP_CONG_BIC=m
CONFIG_TCP_CONG_CUBIC=y
# CONFIG_TCP_CONG_WESTWOOD is not set
# CONFIG_TCP_CONG_HTCP is not set
CONFIG_TCP_CONG_HSTCP=m
CONFIG_TCP_CONG_HYBLA=m
# CONFIG_TCP_CONG_VEGAS is not set
CONFIG_TCP_CONG_SCALABLE=m
CONFIG_TCP_CONG_LP=m
# CONFIG_TCP_CONG_VENO is not set
# CONFIG_TCP_CONG_YEAH is not set
CONFIG_TCP_CONG_ILLINOIS=m
# CONFIG_DEFAULT_BIC is not set
CONFIG_DEFAULT_CUBIC=y
# CONFIG_DEFAULT_HTCP is not set
# CONFIG_DEFAULT_VEGAS is not set
# CONFIG_DEFAULT_WESTWOOD is not set
# CONFIG_DEFAULT_RENO is not set
CONFIG_DEFAULT_TCP_CONG="cubic"
# CONFIG_TCP_MD5SIG is not set
# CONFIG_IPV6 is not set
# CONFIG_NETWORK_SECMARK is not set
# CONFIG_NETFILTER is not set
# CONFIG_IP_DCCP is not set
# CONFIG_IP_SCTP is not set
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
CONFIG_STP=m
CONFIG_BRIDGE=m
# CONFIG_NET_DSA is not set
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
CONFIG_LLC=m
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set
# CONFIG_PHONET is not set
CONFIG_NET_SCHED=y

#
# Queueing/Scheduling
#
# CONFIG_NET_SCH_CBQ is not set
# CONFIG_NET_SCH_HTB is not set
# CONFIG_NET_SCH_HFSC is not set
# CONFIG_NET_SCH_PRIO is not set
# CONFIG_NET_SCH_MULTIQ is not set
# CONFIG_NET_SCH_RED is not set
# CONFIG_NET_SCH_SFQ is not set
# CONFIG_NET_SCH_TEQL is not set
# CONFIG_NET_SCH_TBF is not set
# CONFIG_NET_SCH_GRED is not set
# CONFIG_NET_SCH_DSMARK is not set
# CONFIG_NET_SCH_NETEM is not set
# CONFIG_NET_SCH_DRR is not set

#
# Classification
#
CONFIG_NET_CLS=y
# CONFIG_NET_CLS_BASIC is not set
# CONFIG_NET_CLS_TCINDEX is not set
# CONFIG_NET_CLS_ROUTE4 is not set
# CONFIG_NET_CLS_FW is not set
# CONFIG_NET_CLS_U32 is not set
# CONFIG_NET_CLS_RSVP is not set
# CONFIG_NET_CLS_RSVP6 is not set
# CONFIG_NET_CLS_FLOW is not set
CONFIG_NET_CLS_CGROUP=y
# CONFIG_NET_EMATCH is not set
# CONFIG_NET_CLS_ACT is not set
CONFIG_NET_SCH_FIFO=y
# CONFIG_DCB is not set

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_NET_TCPPROBE is not set
# CONFIG_NET_DROP_MONITOR is not set
# CONFIG_HAMRADIO is not set
# CONFIG_CAN is not set
# CONFIG_IRDA is not set
# CONFIG_BT is not set
# CONFIG_AF_RXRPC is not set
CONFIG_FIB_RULES=y
# CONFIG_WIRELESS is not set
# CONFIG_WIMAX is not set
# CONFIG_RFKILL is not set
# CONFIG_NET_9P is not set

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
CONFIG_FIRMWARE_IN_KERNEL=y
CONFIG_EXTRA_FIRMWARE=""
# CONFIG_DEBUG_DRIVER is not set
CONFIG_DEBUG_DEVRES=y
# CONFIG_SYS_HYPERVISOR is not set
# CONFIG_CONNECTOR is not set
# CONFIG_MTD is not set
CONFIG_PARPORT=m
CONFIG_PARPORT_PC=m
CONFIG_PARPORT_SERIAL=m
# CONFIG_PARPORT_PC_FIFO is not set
# CONFIG_PARPORT_PC_SUPERIO is not set
CONFIG_PARPORT_PC_PCMCIA=m
# CONFIG_PARPORT_GSC is not set
# CONFIG_PARPORT_AX88796 is not set
CONFIG_PARPORT_1284=y
CONFIG_PNP=y
CONFIG_PNP_DEBUG_MESSAGES=y

#
# Protocols
#
CONFIG_ISAPNP=y
# CONFIG_PNPBIOS is not set
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
# CONFIG_BLK_DEV_FD is not set
# CONFIG_BLK_DEV_XD is not set
CONFIG_PARIDE=m

#
# Parallel IDE high-level drivers
#
CONFIG_PARIDE_PD=m
CONFIG_PARIDE_PCD=m
CONFIG_PARIDE_PF=m
# CONFIG_PARIDE_PT is not set
CONFIG_PARIDE_PG=m

#
# Parallel IDE protocol modules
#
# CONFIG_PARIDE_ATEN is not set
# CONFIG_PARIDE_BPCK is not set
# CONFIG_PARIDE_BPCK6 is not set
# CONFIG_PARIDE_COMM is not set
# CONFIG_PARIDE_DSTR is not set
# CONFIG_PARIDE_FIT2 is not set
# CONFIG_PARIDE_FIT3 is not set
# CONFIG_PARIDE_EPAT is not set
# CONFIG_PARIDE_EPIA is not set
# CONFIG_PARIDE_FRIQ is not set
# CONFIG_PARIDE_FRPW is not set
# CONFIG_PARIDE_KBIC is not set
# CONFIG_PARIDE_KTTI is not set
# CONFIG_PARIDE_ON20 is not set
# CONFIG_PARIDE_ON26 is not set
# CONFIG_BLK_CPQ_DA is not set
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=m
CONFIG_BLK_DEV_CRYPTOLOOP=m
CONFIG_BLK_DEV_NBD=m
# CONFIG_BLK_DEV_SX8 is not set
# CONFIG_BLK_DEV_UB is not set
CONFIG_BLK_DEV_RAM=y
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=16384
# CONFIG_BLK_DEV_XIP is not set
# CONFIG_CDROM_PKTCDVD is not set
# CONFIG_ATA_OVER_ETH is not set
# CONFIG_BLK_DEV_HD is not set
CONFIG_MISC_DEVICES=y
# CONFIG_IBM_ASM is not set
# CONFIG_PHANTOM is not set
# CONFIG_SGI_IOC4 is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_ICS932S401 is not set
# CONFIG_ENCLOSURE_SERVICES is not set
# CONFIG_HP_ILO is not set
# CONFIG_ISL29003 is not set
# CONFIG_C2PORT is not set

#
# EEPROM support
#
# CONFIG_EEPROM_AT24 is not set
# CONFIG_EEPROM_LEGACY is not set
CONFIG_EEPROM_93CX6=m
CONFIG_HAVE_IDE=y
# CONFIG_IDE is not set

#
# SCSI device support
#
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI=m
CONFIG_SCSI_DMA=y
CONFIG_SCSI_TGT=m
CONFIG_SCSI_NETLINK=y
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=m
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
CONFIG_BLK_DEV_SR=m
CONFIG_BLK_DEV_SR_VENDOR=y
CONFIG_CHR_DEV_SG=m
CONFIG_CHR_DEV_SCH=m

#
# Some SCSI devices (e.g. CD jukebox) support multiple LUNs
#
CONFIG_SCSI_MULTI_LUN=y
# CONFIG_SCSI_CONSTANTS is not set
CONFIG_SCSI_LOGGING=y
CONFIG_SCSI_SCAN_ASYNC=y
CONFIG_SCSI_WAIT_SCAN=m

#
# SCSI Transports
#
CONFIG_SCSI_SPI_ATTRS=m
CONFIG_SCSI_FC_ATTRS=m
# CONFIG_SCSI_FC_TGT_ATTRS is not set
CONFIG_SCSI_ISCSI_ATTRS=m
CONFIG_SCSI_SAS_ATTRS=m
CONFIG_SCSI_SAS_LIBSAS=m
CONFIG_SCSI_SAS_ATA=y
CONFIG_SCSI_SAS_HOST_SMP=y
# CONFIG_SCSI_SAS_LIBSAS_DEBUG is not set
CONFIG_SCSI_SRP_ATTRS=m
# CONFIG_SCSI_SRP_TGT_ATTRS is not set
CONFIG_SCSI_LOWLEVEL=y
CONFIG_ISCSI_TCP=m
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_3W_9XXX is not set
# CONFIG_SCSI_7000FASST is not set
CONFIG_SCSI_ACARD=m
# CONFIG_SCSI_AHA152X is not set
# CONFIG_SCSI_AHA1542 is not set
# CONFIG_SCSI_AACRAID is not set
CONFIG_SCSI_AIC7XXX=m
CONFIG_AIC7XXX_CMDS_PER_DEVICE=4
CONFIG_AIC7XXX_RESET_DELAY_MS=15000
# CONFIG_AIC7XXX_DEBUG_ENABLE is not set
CONFIG_AIC7XXX_DEBUG_MASK=0
# CONFIG_AIC7XXX_REG_PRETTY_PRINT is not set
CONFIG_SCSI_AIC7XXX_OLD=m
CONFIG_SCSI_AIC79XX=m
CONFIG_AIC79XX_CMDS_PER_DEVICE=4
CONFIG_AIC79XX_RESET_DELAY_MS=15000
# CONFIG_AIC79XX_DEBUG_ENABLE is not set
CONFIG_AIC79XX_DEBUG_MASK=0
# CONFIG_AIC79XX_REG_PRETTY_PRINT is not set
CONFIG_SCSI_AIC94XX=m
# CONFIG_AIC94XX_DEBUG is not set
# CONFIG_SCSI_DPT_I2O is not set
CONFIG_SCSI_ADVANSYS=m
# CONFIG_SCSI_IN2000 is not set
# CONFIG_SCSI_ARCMSR is not set
# CONFIG_MEGARAID_NEWGEN is not set
# CONFIG_MEGARAID_LEGACY is not set
# CONFIG_MEGARAID_SAS is not set
# CONFIG_SCSI_MPT2SAS is not set
# CONFIG_SCSI_HPTIOP is not set
CONFIG_SCSI_BUSLOGIC=m
# CONFIG_SCSI_FLASHPOINT is not set
# CONFIG_LIBFC is not set
# CONFIG_LIBFCOE is not set
# CONFIG_FCOE is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_DTC3280 is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
CONFIG_SCSI_GDTH=m
# CONFIG_SCSI_GENERIC_NCR5380 is not set
# CONFIG_SCSI_GENERIC_NCR5380_MMIO is not set
CONFIG_SCSI_IPS=m
CONFIG_SCSI_INITIO=m
CONFIG_SCSI_INIA100=m
CONFIG_SCSI_PPA=m
CONFIG_SCSI_IMM=m
# CONFIG_SCSI_IZIP_EPP16 is not set
# CONFIG_SCSI_IZIP_SLOW_CTR is not set
# CONFIG_SCSI_MVSAS is not set
# CONFIG_SCSI_NCR53C406A is not set
# CONFIG_SCSI_STEX is not set
CONFIG_SCSI_SYM53C8XX_2=m
CONFIG_SCSI_SYM53C8XX_DMA_ADDRESSING_MODE=1
CONFIG_SCSI_SYM53C8XX_DEFAULT_TAGS=16
CONFIG_SCSI_SYM53C8XX_MAX_TAGS=64
CONFIG_SCSI_SYM53C8XX_MMIO=y
# CONFIG_SCSI_IPR is not set
# CONFIG_SCSI_PAS16 is not set
# CONFIG_SCSI_QLOGIC_FAS is not set
# CONFIG_SCSI_QLOGIC_1280 is not set
# CONFIG_SCSI_QLA_FC is not set
# CONFIG_SCSI_QLA_ISCSI is not set
# CONFIG_SCSI_LPFC is not set
# CONFIG_SCSI_SYM53C416 is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_DC390T is not set
# CONFIG_SCSI_T128 is not set
# CONFIG_SCSI_U14_34F is not set
# CONFIG_SCSI_ULTRASTOR is not set
# CONFIG_SCSI_NSP32 is not set
# CONFIG_SCSI_DEBUG is not set
# CONFIG_SCSI_SRP is not set
CONFIG_SCSI_LOWLEVEL_PCMCIA=y
# CONFIG_PCMCIA_AHA152X is not set
# CONFIG_PCMCIA_FDOMAIN is not set
# CONFIG_PCMCIA_NINJA_SCSI is not set
CONFIG_PCMCIA_QLOGIC=m
# CONFIG_PCMCIA_SYM53C500 is not set
# CONFIG_SCSI_DH is not set
# CONFIG_SCSI_OSD_INITIATOR is not set
CONFIG_ATA=m
# CONFIG_ATA_NONSTANDARD is not set
CONFIG_ATA_ACPI=y
CONFIG_SATA_PMP=y
CONFIG_SATA_AHCI=m
# CONFIG_SATA_SIL24 is not set
CONFIG_ATA_SFF=y
# CONFIG_SATA_SVW is not set
CONFIG_ATA_PIIX=m
# CONFIG_SATA_MV is not set
CONFIG_SATA_NV=m
# CONFIG_PDC_ADMA is not set
# CONFIG_SATA_QSTOR is not set
# CONFIG_SATA_PROMISE is not set
# CONFIG_SATA_SX4 is not set
# CONFIG_SATA_SIL is not set
CONFIG_SATA_SIS=m
# CONFIG_SATA_ULI is not set
# CONFIG_SATA_VIA is not set
# CONFIG_SATA_VITESSE is not set
# CONFIG_SATA_INIC162X is not set
# CONFIG_PATA_ACPI is not set
# CONFIG_PATA_ALI is not set
# CONFIG_PATA_AMD is not set
# CONFIG_PATA_ARTOP is not set
CONFIG_PATA_ATIIXP=m
# CONFIG_PATA_CMD640_PCI is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CS5520 is not set
# CONFIG_PATA_CS5530 is not set
# CONFIG_PATA_CS5535 is not set
# CONFIG_PATA_CS5536 is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
CONFIG_ATA_GENERIC=m
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_ISAPNP is not set
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_IT8213 is not set
# CONFIG_PATA_JMICRON is not set
# CONFIG_PATA_LEGACY is not set
# CONFIG_PATA_TRIFLEX is not set
# CONFIG_PATA_MARVELL is not set
CONFIG_PATA_MPIIX=m
# CONFIG_PATA_OLDPIIX is not set
# CONFIG_PATA_NETCELL is not set
# CONFIG_PATA_NINJA32 is not set
# CONFIG_PATA_NS87410 is not set
# CONFIG_PATA_NS87415 is not set
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_OPTIDMA is not set
CONFIG_PATA_PCMCIA=m
# CONFIG_PATA_PDC_OLD is not set
# CONFIG_PATA_QDI is not set
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RZ1000 is not set
# CONFIG_PATA_SC1200 is not set
# CONFIG_PATA_SERVERWORKS is not set
# CONFIG_PATA_PDC2027X is not set
# CONFIG_PATA_SIL680 is not set
CONFIG_PATA_SIS=m
CONFIG_PATA_VIA=m
# CONFIG_PATA_WINBOND is not set
# CONFIG_PATA_WINBOND_VLB is not set
# CONFIG_PATA_SCH is not set
# CONFIG_MD is not set
CONFIG_FUSION=y
CONFIG_FUSION_SPI=m
CONFIG_FUSION_FC=m
# CONFIG_FUSION_SAS is not set
CONFIG_FUSION_MAX_SGE=40
CONFIG_FUSION_CTL=m
CONFIG_FUSION_LAN=m
CONFIG_FUSION_LOGGING=y

#
# IEEE 1394 (FireWire) support
#

#
# Enable only one of the two stacks, unless you know what you are doing
#
CONFIG_FIREWIRE=m
CONFIG_FIREWIRE_OHCI=m
CONFIG_FIREWIRE_OHCI_DEBUG=y
CONFIG_FIREWIRE_SBP2=m
# CONFIG_IEEE1394 is not set
CONFIG_I2O=m
# CONFIG_I2O_LCT_NOTIFY_ON_CHANGES is not set
CONFIG_I2O_EXT_ADAPTEC=y
CONFIG_I2O_CONFIG=m
CONFIG_I2O_CONFIG_OLD_IOCTL=y
CONFIG_I2O_BUS=m
CONFIG_I2O_BLOCK=m
CONFIG_I2O_SCSI=m
CONFIG_I2O_PROC=m
# CONFIG_MACINTOSH_DRIVERS is not set
CONFIG_NETDEVICES=y
CONFIG_COMPAT_NET_DEV_OPS=y
CONFIG_DUMMY=m
CONFIG_BONDING=m
# CONFIG_MACVLAN is not set
# CONFIG_EQUALIZER is not set
CONFIG_TUN=m
# CONFIG_VETH is not set
# CONFIG_NET_SB1000 is not set
# CONFIG_ARCNET is not set
CONFIG_PHYLIB=m

#
# MII PHY device drivers
#
# CONFIG_MARVELL_PHY is not set
# CONFIG_DAVICOM_PHY is not set
# CONFIG_QSEMI_PHY is not set
CONFIG_LXT_PHY=m
# CONFIG_CICADA_PHY is not set
# CONFIG_VITESSE_PHY is not set
# CONFIG_SMSC_PHY is not set
# CONFIG_BROADCOM_PHY is not set
# CONFIG_ICPLUS_PHY is not set
# CONFIG_REALTEK_PHY is not set
# CONFIG_NATIONAL_PHY is not set
# CONFIG_STE10XP is not set
# CONFIG_LSI_ET1011C_PHY is not set
# CONFIG_MDIO_BITBANG is not set
CONFIG_NET_ETHERNET=y
CONFIG_MII=m
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_CASSINI is not set
CONFIG_NET_VENDOR_3COM=y
# CONFIG_EL1 is not set
# CONFIG_EL2 is not set
# CONFIG_ELPLUS is not set
# CONFIG_EL16 is not set
CONFIG_EL3=m
# CONFIG_3C515 is not set
CONFIG_VORTEX=m
CONFIG_TYPHOON=m
# CONFIG_LANCE is not set
CONFIG_NET_VENDOR_SMC=y
# CONFIG_WD80x3 is not set
# CONFIG_ULTRA is not set
# CONFIG_SMC9194 is not set
# CONFIG_ETHOC is not set
# CONFIG_NET_VENDOR_RACAL is not set
# CONFIG_DNET is not set
CONFIG_NET_TULIP=y
CONFIG_DE2104X=m
CONFIG_TULIP=m
# CONFIG_TULIP_MWI is not set
CONFIG_TULIP_MMIO=y
# CONFIG_TULIP_NAPI is not set
CONFIG_DE4X5=m
CONFIG_WINBOND_840=m
CONFIG_DM9102=m
CONFIG_ULI526X=m
CONFIG_PCMCIA_XIRCOM=m
# CONFIG_AT1700 is not set
# CONFIG_DEPCA is not set
# CONFIG_HP100 is not set
CONFIG_NET_ISA=y
# CONFIG_E2100 is not set
# CONFIG_EWRK3 is not set
# CONFIG_EEXPRESS is not set
# CONFIG_EEXPRESS_PRO is not set
# CONFIG_HPLAN_PLUS is not set
# CONFIG_HPLAN is not set
# CONFIG_LP486E is not set
# CONFIG_ETH16I is not set
CONFIG_NE2000=m
# CONFIG_ZNET is not set
# CONFIG_SEEQ8005 is not set
# CONFIG_IBM_NEW_EMAC_ZMII is not set
# CONFIG_IBM_NEW_EMAC_RGMII is not set
# CONFIG_IBM_NEW_EMAC_TAH is not set
# CONFIG_IBM_NEW_EMAC_EMAC4 is not set
# CONFIG_IBM_NEW_EMAC_NO_FLOW_CTRL is not set
# CONFIG_IBM_NEW_EMAC_MAL_CLR_ICINTSTAT is not set
# CONFIG_IBM_NEW_EMAC_MAL_COMMON_ERR is not set
CONFIG_NET_PCI=y
CONFIG_PCNET32=m
CONFIG_AMD8111_ETH=m
CONFIG_ADAPTEC_STARFIRE=m
# CONFIG_AC3200 is not set
# CONFIG_APRICOT is not set
CONFIG_B44=m
CONFIG_B44_PCI_AUTOSELECT=y
CONFIG_B44_PCICORE_AUTOSELECT=y
CONFIG_B44_PCI=y
CONFIG_FORCEDETH=m
CONFIG_FORCEDETH_NAPI=y
# CONFIG_CS89x0 is not set
CONFIG_E100=m
# CONFIG_FEALNX is not set
# CONFIG_NATSEMI is not set
CONFIG_NE2K_PCI=m
# CONFIG_8139CP is not set
CONFIG_8139TOO=m
# CONFIG_8139TOO_PIO is not set
# CONFIG_8139TOO_TUNE_TWISTER is not set
CONFIG_8139TOO_8129=y
# CONFIG_8139_OLD_RX_RESET is not set
# CONFIG_R6040 is not set
CONFIG_SIS900=m
# CONFIG_EPIC100 is not set
# CONFIG_SMSC9420 is not set
# CONFIG_SUNDANCE is not set
# CONFIG_TLAN is not set
CONFIG_VIA_RHINE=m
CONFIG_VIA_RHINE_MMIO=y
# CONFIG_SC92031 is not set
CONFIG_NET_POCKET=y
CONFIG_ATP=m
CONFIG_DE600=m
CONFIG_DE620=m
# CONFIG_ATL2 is not set
CONFIG_NETDEV_1000=y
CONFIG_ACENIC=m
# CONFIG_ACENIC_OMIT_TIGON_I is not set
# CONFIG_DL2K is not set
CONFIG_E1000=m
CONFIG_E1000E=m
# CONFIG_IP1000 is not set
# CONFIG_IGB is not set
# CONFIG_IGBVF is not set
# CONFIG_NS83820 is not set
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
CONFIG_R8169=m
# CONFIG_SIS190 is not set
CONFIG_SKGE=m
# CONFIG_SKGE_DEBUG is not set
CONFIG_SKY2=m
# CONFIG_SKY2_DEBUG is not set
CONFIG_VIA_VELOCITY=m
# CONFIG_TIGON3 is not set
# CONFIG_BNX2 is not set
# CONFIG_QLA3XXX is not set
# CONFIG_ATL1 is not set
# CONFIG_ATL1E is not set
# CONFIG_ATL1C is not set
# CONFIG_JME is not set
# CONFIG_NETDEV_10000 is not set
# CONFIG_TR is not set

#
# Wireless LAN
#
# CONFIG_WLAN_PRE80211 is not set
# CONFIG_WLAN_80211 is not set

#
# Enable WiMAX (Networking options) to see the WiMAX drivers
#

#
# USB Network Adapters
#
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
CONFIG_USB_USBNET=m
CONFIG_USB_NET_AX8817X=m
CONFIG_USB_NET_CDCETHER=m
CONFIG_USB_NET_DM9601=m
# CONFIG_USB_NET_SMSC95XX is not set
CONFIG_USB_NET_GL620A=m
CONFIG_USB_NET_NET1080=m
# CONFIG_USB_NET_PLUSB is not set
# CONFIG_USB_NET_MCS7830 is not set
# CONFIG_USB_NET_RNDIS_HOST is not set
CONFIG_USB_NET_CDC_SUBSET=m
CONFIG_USB_ALI_M5632=y
CONFIG_USB_AN2720=y
CONFIG_USB_BELKIN=y
CONFIG_USB_ARMLINUX=y
CONFIG_USB_EPSON2888=y
CONFIG_USB_KC2190=y
# CONFIG_USB_NET_ZAURUS is not set
CONFIG_NET_PCMCIA=y
# CONFIG_PCMCIA_3C589 is not set
# CONFIG_PCMCIA_3C574 is not set
# CONFIG_PCMCIA_FMVJ18X is not set
CONFIG_PCMCIA_PCNET=m
CONFIG_PCMCIA_NMCLAN=m
CONFIG_PCMCIA_SMC91C92=m
# CONFIG_PCMCIA_XIRC2PS is not set
# CONFIG_PCMCIA_AXNET is not set
# CONFIG_WAN is not set
CONFIG_FDDI=y
# CONFIG_DEFXX is not set
# CONFIG_SKFP is not set
# CONFIG_HIPPI is not set
CONFIG_PLIP=m
CONFIG_PPP=m
CONFIG_PPP_MULTILINK=y
CONFIG_PPP_FILTER=y
CONFIG_PPP_ASYNC=m
CONFIG_PPP_SYNC_TTY=m
CONFIG_PPP_DEFLATE=m
# CONFIG_PPP_BSDCOMP is not set
# CONFIG_PPP_MPPE is not set
CONFIG_PPPOE=m
# CONFIG_PPPOL2TP is not set
CONFIG_SLIP=m
CONFIG_SLIP_COMPRESSED=y
CONFIG_SLHC=m
CONFIG_SLIP_SMART=y
# CONFIG_SLIP_MODE_SLIP6 is not set
CONFIG_NET_FC=y
CONFIG_NETCONSOLE=m
# CONFIG_NETCONSOLE_DYNAMIC is not set
CONFIG_NETPOLL=y
CONFIG_NETPOLL_TRAP=y
CONFIG_NET_POLL_CONTROLLER=y
# CONFIG_ISDN is not set
# CONFIG_PHONE is not set

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_FF_MEMLESS=y
CONFIG_INPUT_POLLDEV=m

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
# CONFIG_INPUT_MOUSEDEV_PSAUX is not set
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
# CONFIG_INPUT_JOYDEV is not set
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_XTKBD is not set
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
# CONFIG_MOUSE_PS2_ELANTECH is not set
# CONFIG_MOUSE_PS2_TOUCHKIT is not set
CONFIG_MOUSE_SERIAL=m
CONFIG_MOUSE_APPLETOUCH=m
# CONFIG_MOUSE_BCM5974 is not set
# CONFIG_MOUSE_INPORT is not set
# CONFIG_MOUSE_LOGIBM is not set
# CONFIG_MOUSE_PC110PAD is not set
CONFIG_MOUSE_VSXXXAA=m
# CONFIG_INPUT_JOYSTICK is not set
# CONFIG_INPUT_TABLET is not set
# CONFIG_INPUT_TOUCHSCREEN is not set
CONFIG_INPUT_MISC=y
# CONFIG_INPUT_PCSPKR is not set
# CONFIG_INPUT_APANEL is not set
# CONFIG_INPUT_WISTRON_BTNS is not set
# CONFIG_INPUT_ATLAS_BTNS is not set
# CONFIG_INPUT_ATI_REMOTE is not set
# CONFIG_INPUT_ATI_REMOTE2 is not set
# CONFIG_INPUT_KEYSPAN_REMOTE is not set
# CONFIG_INPUT_POWERMATE is not set
# CONFIG_INPUT_YEALINK is not set
# CONFIG_INPUT_CM109 is not set
CONFIG_INPUT_UINPUT=m

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=y
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PARKBD is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
CONFIG_SERIO_RAW=m
# CONFIG_GAMEPORT is not set

#
# Character devices
#
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_DEVKMEM=y
CONFIG_SERIAL_NONSTANDARD=y
# CONFIG_COMPUTONE is not set
CONFIG_ROCKETPORT=m
CONFIG_CYCLADES=m
# CONFIG_CYZ_INTR is not set
# CONFIG_DIGIEPCA is not set
# CONFIG_MOXA_INTELLIO is not set
# CONFIG_MOXA_SMARTIO is not set
# CONFIG_ISI is not set
# CONFIG_SYNCLINK is not set
CONFIG_SYNCLINKMP=m
CONFIG_SYNCLINK_GT=m
# CONFIG_N_HDLC is not set
# CONFIG_RISCOM8 is not set
# CONFIG_SPECIALIX is not set
# CONFIG_SX is not set
# CONFIG_RIO is not set
# CONFIG_STALDRV is not set
# CONFIG_NOZOMI is not set

#
# Serial drivers
#
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_PNP=y
CONFIG_SERIAL_8250_CS=m
CONFIG_SERIAL_8250_NR_UARTS=32
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
# CONFIG_SERIAL_8250_FOURPORT is not set
# CONFIG_SERIAL_8250_ACCENT is not set
# CONFIG_SERIAL_8250_BOCA is not set
# CONFIG_SERIAL_8250_EXAR_ST16C554 is not set
# CONFIG_SERIAL_8250_HUB6 is not set
CONFIG_SERIAL_8250_SHARE_IRQ=y
CONFIG_SERIAL_8250_DETECT_IRQ=y
CONFIG_SERIAL_8250_RSA=y

#
# Non-8250 serial port support
#
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
CONFIG_SERIAL_JSM=m
CONFIG_UNIX98_PTYS=y
# CONFIG_DEVPTS_MULTIPLE_INSTANCES is not set
# CONFIG_LEGACY_PTYS is not set
CONFIG_PRINTER=m
CONFIG_LP_CONSOLE=y
CONFIG_PPDEV=m
CONFIG_IPMI_HANDLER=m
# CONFIG_IPMI_PANIC_EVENT is not set
CONFIG_IPMI_DEVICE_INTERFACE=m
CONFIG_IPMI_SI=m
CONFIG_IPMI_WATCHDOG=m
CONFIG_IPMI_POWEROFF=m
CONFIG_HW_RANDOM=y
# CONFIG_HW_RANDOM_TIMERIOMEM is not set
CONFIG_HW_RANDOM_INTEL=m
CONFIG_HW_RANDOM_AMD=m
CONFIG_HW_RANDOM_GEODE=m
CONFIG_HW_RANDOM_VIA=m
CONFIG_NVRAM=y
CONFIG_RTC=y
# CONFIG_DTLK is not set
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set
# CONFIG_SONYPI is not set

#
# PCMCIA character devices
#
# CONFIG_SYNCLINK_CS is not set
CONFIG_CARDMAN_4000=m
CONFIG_CARDMAN_4040=m
# CONFIG_IPWIRELESS is not set
CONFIG_MWAVE=m
# CONFIG_PC8736x_GPIO is not set
# CONFIG_NSC_GPIO is not set
# CONFIG_CS5535_GPIO is not set
# CONFIG_RAW_DRIVER is not set
CONFIG_HPET=y
# CONFIG_HPET_MMAP is not set
CONFIG_HANGCHECK_TIMER=m
# CONFIG_TCG_TPM is not set
# CONFIG_TELCLOCK is not set
CONFIG_DEVPORT=y
CONFIG_I2C=m
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_CHARDEV=m
CONFIG_I2C_HELPER_AUTO=y
CONFIG_I2C_ALGOBIT=m
CONFIG_I2C_ALGOPCA=m

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
CONFIG_I2C_ALI1535=m
CONFIG_I2C_ALI1563=m
CONFIG_I2C_ALI15X3=m
CONFIG_I2C_AMD756=m
CONFIG_I2C_AMD756_S4882=m
# CONFIG_I2C_AMD8111 is not set
CONFIG_I2C_I801=m
# CONFIG_I2C_ISCH is not set
CONFIG_I2C_PIIX4=m
CONFIG_I2C_NFORCE2=m
# CONFIG_I2C_NFORCE2_S4985 is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
# CONFIG_I2C_SIS96X is not set
CONFIG_I2C_VIA=m
CONFIG_I2C_VIAPRO=m

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
# CONFIG_I2C_OCORES is not set
CONFIG_I2C_SIMTEC=m

#
# External I2C/SMBus adapter drivers
#
CONFIG_I2C_PARPORT=m
CONFIG_I2C_PARPORT_LIGHT=m
# CONFIG_I2C_TAOS_EVM is not set
# CONFIG_I2C_TINY_USB is not set

#
# Graphics adapter I2C/DDC channel drivers
#
CONFIG_I2C_VOODOO3=m

#
# Other I2C/SMBus bus drivers
#
CONFIG_I2C_PCA_ISA=m
# CONFIG_I2C_PCA_PLATFORM is not set
CONFIG_I2C_STUB=m
# CONFIG_SCx200_ACB is not set

#
# Miscellaneous I2C Chip support
#
# CONFIG_DS1682 is not set
# CONFIG_SENSORS_PCF8574 is not set
# CONFIG_PCF8575 is not set
# CONFIG_SENSORS_PCA9539 is not set
CONFIG_SENSORS_MAX6875=m
# CONFIG_SENSORS_TSL2550 is not set
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# CONFIG_I2C_DEBUG_CHIP is not set
# CONFIG_SPI is not set
CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y
# CONFIG_GPIOLIB is not set
# CONFIG_W1 is not set
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
# CONFIG_PDA_POWER is not set
# CONFIG_BATTERY_DS2760 is not set
# CONFIG_BATTERY_BQ27x00 is not set
CONFIG_HWMON=m
CONFIG_HWMON_VID=m
# CONFIG_SENSORS_ABITUGURU is not set
# CONFIG_SENSORS_ABITUGURU3 is not set
# CONFIG_SENSORS_AD7414 is not set
CONFIG_SENSORS_AD7418=m
# CONFIG_SENSORS_ADM1021 is not set
# CONFIG_SENSORS_ADM1025 is not set
# CONFIG_SENSORS_ADM1026 is not set
# CONFIG_SENSORS_ADM1029 is not set
# CONFIG_SENSORS_ADM1031 is not set
# CONFIG_SENSORS_ADM9240 is not set
# CONFIG_SENSORS_ADT7462 is not set
# CONFIG_SENSORS_ADT7470 is not set
# CONFIG_SENSORS_ADT7473 is not set
# CONFIG_SENSORS_ADT7475 is not set
# CONFIG_SENSORS_K8TEMP is not set
# CONFIG_SENSORS_ASB100 is not set
# CONFIG_SENSORS_ATK0110 is not set
# CONFIG_SENSORS_ATXP1 is not set
# CONFIG_SENSORS_DS1621 is not set
# CONFIG_SENSORS_I5K_AMB is not set
# CONFIG_SENSORS_F71805F is not set
# CONFIG_SENSORS_F71882FG is not set
# CONFIG_SENSORS_F75375S is not set
# CONFIG_SENSORS_FSCHER is not set
# CONFIG_SENSORS_FSCPOS is not set
# CONFIG_SENSORS_FSCHMD is not set
# CONFIG_SENSORS_G760A is not set
# CONFIG_SENSORS_GL518SM is not set
# CONFIG_SENSORS_GL520SM is not set
CONFIG_SENSORS_CORETEMP=m
# CONFIG_SENSORS_IBMAEM is not set
# CONFIG_SENSORS_IBMPEX is not set
# CONFIG_SENSORS_IT87 is not set
# CONFIG_SENSORS_LM63 is not set
# CONFIG_SENSORS_LM75 is not set
# CONFIG_SENSORS_LM77 is not set
# CONFIG_SENSORS_LM78 is not set
# CONFIG_SENSORS_LM80 is not set
# CONFIG_SENSORS_LM83 is not set
# CONFIG_SENSORS_LM85 is not set
# CONFIG_SENSORS_LM87 is not set
# CONFIG_SENSORS_LM90 is not set
# CONFIG_SENSORS_LM92 is not set
# CONFIG_SENSORS_LM93 is not set
# CONFIG_SENSORS_LTC4215 is not set
# CONFIG_SENSORS_LTC4245 is not set
# CONFIG_SENSORS_LM95241 is not set
# CONFIG_SENSORS_MAX1619 is not set
# CONFIG_SENSORS_MAX6650 is not set
# CONFIG_SENSORS_PC87360 is not set
# CONFIG_SENSORS_PC87427 is not set
# CONFIG_SENSORS_PCF8591 is not set
CONFIG_SENSORS_SIS5595=m
# CONFIG_SENSORS_DME1737 is not set
# CONFIG_SENSORS_SMSC47M1 is not set
# CONFIG_SENSORS_SMSC47M192 is not set
# CONFIG_SENSORS_SMSC47B397 is not set
# CONFIG_SENSORS_ADS7828 is not set
# CONFIG_SENSORS_THMC50 is not set
CONFIG_SENSORS_VIA686A=m
CONFIG_SENSORS_VT1211=m
CONFIG_SENSORS_VT8231=m
# CONFIG_SENSORS_W83781D is not set
# CONFIG_SENSORS_W83791D is not set
# CONFIG_SENSORS_W83792D is not set
# CONFIG_SENSORS_W83793 is not set
# CONFIG_SENSORS_W83L785TS is not set
# CONFIG_SENSORS_W83L786NG is not set
# CONFIG_SENSORS_W83627HF is not set
# CONFIG_SENSORS_W83627EHF is not set
CONFIG_SENSORS_HDAPS=m
# CONFIG_SENSORS_LIS3LV02D is not set
# CONFIG_SENSORS_APPLESMC is not set
# CONFIG_HWMON_DEBUG_CHIP is not set
CONFIG_THERMAL=y
# CONFIG_WATCHDOG is not set
CONFIG_SSB_POSSIBLE=y

#
# Sonics Silicon Backplane
#
CONFIG_SSB=m
CONFIG_SSB_SPROM=y
CONFIG_SSB_PCIHOST_POSSIBLE=y
CONFIG_SSB_PCIHOST=y
# CONFIG_SSB_B43_PCI_BRIDGE is not set
CONFIG_SSB_PCMCIAHOST_POSSIBLE=y
CONFIG_SSB_PCMCIAHOST=y
# CONFIG_SSB_DEBUG is not set
CONFIG_SSB_DRIVER_PCICORE_POSSIBLE=y
CONFIG_SSB_DRIVER_PCICORE=y

#
# Multifunction device drivers
#
# CONFIG_MFD_CORE is not set
# CONFIG_MFD_SM501 is not set
# CONFIG_HTC_PASIC3 is not set
# CONFIG_MFD_TMIO is not set
# CONFIG_MFD_WM8400 is not set
# CONFIG_MFD_WM8350_I2C is not set
# CONFIG_MFD_PCF50633 is not set
# CONFIG_REGULATOR is not set

#
# Multimedia devices
#

#
# Multimedia core support
#
CONFIG_VIDEO_DEV=m
CONFIG_VIDEO_V4L2_COMMON=m
CONFIG_VIDEO_ALLOW_V4L1=y
CONFIG_VIDEO_V4L1_COMPAT=y
# CONFIG_DVB_CORE is not set
CONFIG_VIDEO_MEDIA=m

#
# Multimedia drivers
#
# CONFIG_MEDIA_ATTACH is not set
CONFIG_MEDIA_TUNER=m
# CONFIG_MEDIA_TUNER_CUSTOMISE is not set
CONFIG_MEDIA_TUNER_SIMPLE=m
CONFIG_MEDIA_TUNER_TDA8290=m
CONFIG_MEDIA_TUNER_TDA9887=m
CONFIG_MEDIA_TUNER_TEA5761=m
CONFIG_MEDIA_TUNER_TEA5767=m
CONFIG_MEDIA_TUNER_MT20XX=m
CONFIG_MEDIA_TUNER_XC2028=m
CONFIG_MEDIA_TUNER_XC5000=m
CONFIG_MEDIA_TUNER_MC44S803=m
CONFIG_VIDEO_V4L2=m
CONFIG_VIDEO_V4L1=m
CONFIG_VIDEOBUF_GEN=m
CONFIG_VIDEOBUF_DMA_SG=m
CONFIG_VIDEO_BTCX=m
CONFIG_VIDEO_IR=m
CONFIG_VIDEO_TVEEPROM=m
CONFIG_VIDEO_TUNER=m
CONFIG_VIDEO_CAPTURE_DRIVERS=y
# CONFIG_VIDEO_ADV_DEBUG is not set
# CONFIG_VIDEO_FIXED_MINOR_RANGES is not set
# CONFIG_VIDEO_HELPER_CHIPS_AUTO is not set
CONFIG_VIDEO_IR_I2C=m

#
# Encoders/decoders and other helper chips
#

#
# Audio decoders
#
CONFIG_VIDEO_TVAUDIO=m
CONFIG_VIDEO_TDA7432=m
CONFIG_VIDEO_TDA9840=m
CONFIG_VIDEO_TDA9875=m
CONFIG_VIDEO_TEA6415C=m
CONFIG_VIDEO_TEA6420=m
CONFIG_VIDEO_MSP3400=m
# CONFIG_VIDEO_CS5345 is not set
CONFIG_VIDEO_CS53L32A=m
CONFIG_VIDEO_M52790=m
CONFIG_VIDEO_TLV320AIC23B=m
CONFIG_VIDEO_WM8775=m
CONFIG_VIDEO_WM8739=m
CONFIG_VIDEO_VP27SMPX=m

#
# RDS decoders
#
# CONFIG_VIDEO_SAA6588 is not set

#
# Video decoders
#
CONFIG_VIDEO_BT819=m
CONFIG_VIDEO_BT856=m
CONFIG_VIDEO_BT866=m
CONFIG_VIDEO_KS0127=m
CONFIG_VIDEO_OV7670=m
# CONFIG_VIDEO_TCM825X is not set
CONFIG_VIDEO_SAA7110=m
CONFIG_VIDEO_SAA711X=m
CONFIG_VIDEO_SAA717X=m
CONFIG_VIDEO_SAA7191=m
# CONFIG_VIDEO_TVP514X is not set
CONFIG_VIDEO_TVP5150=m
CONFIG_VIDEO_VPX3220=m

#
# Video and audio decoders
#
CONFIG_VIDEO_CX25840=m

#
# MPEG video encoders
#
CONFIG_VIDEO_CX2341X=m

#
# Video encoders
#
CONFIG_VIDEO_SAA7127=m
CONFIG_VIDEO_SAA7185=m
CONFIG_VIDEO_ADV7170=m
CONFIG_VIDEO_ADV7175=m

#
# Video improvement chips
#
CONFIG_VIDEO_UPD64031A=m
CONFIG_VIDEO_UPD64083=m
# CONFIG_VIDEO_VIVI is not set
CONFIG_VIDEO_BT848=m
# CONFIG_VIDEO_PMS is not set
# CONFIG_VIDEO_BWQCAM is not set
# CONFIG_VIDEO_CQCAM is not set
# CONFIG_VIDEO_W9966 is not set
CONFIG_VIDEO_CPIA=m
CONFIG_VIDEO_CPIA_PP=m
CONFIG_VIDEO_CPIA_USB=m
CONFIG_VIDEO_CPIA2=m
# CONFIG_VIDEO_SAA5246A is not set
# CONFIG_VIDEO_SAA5249 is not set
# CONFIG_VIDEO_STRADIS is not set
CONFIG_VIDEO_ZORAN=m
# CONFIG_VIDEO_ZORAN_DC30 is not set
CONFIG_VIDEO_ZORAN_ZR36060=m
CONFIG_VIDEO_ZORAN_BUZ=m
# CONFIG_VIDEO_ZORAN_DC10 is not set
CONFIG_VIDEO_ZORAN_LML33=m
# CONFIG_VIDEO_ZORAN_LML33R10 is not set
# CONFIG_VIDEO_ZORAN_AVS6EYES is not set
# CONFIG_VIDEO_SAA7134 is not set
# CONFIG_VIDEO_MXB is not set
# CONFIG_VIDEO_HEXIUM_ORION is not set
# CONFIG_VIDEO_HEXIUM_GEMINI is not set
# CONFIG_VIDEO_CX88 is not set
CONFIG_VIDEO_IVTV=m
# CONFIG_VIDEO_FB_IVTV is not set
# CONFIG_VIDEO_CAFE_CCIC is not set
# CONFIG_SOC_CAMERA is not set
# CONFIG_V4L_USB_DRIVERS is not set
CONFIG_RADIO_ADAPTERS=y
# CONFIG_RADIO_CADET is not set
# CONFIG_RADIO_RTRACK is not set
# CONFIG_RADIO_RTRACK2 is not set
# CONFIG_RADIO_AZTECH is not set
# CONFIG_RADIO_GEMTEK is not set
# CONFIG_RADIO_GEMTEK_PCI is not set
CONFIG_RADIO_MAXIRADIO=m
CONFIG_RADIO_MAESTRO=m
# CONFIG_RADIO_SF16FMI is not set
# CONFIG_RADIO_SF16FMR2 is not set
# CONFIG_RADIO_TERRATEC is not set
# CONFIG_RADIO_TRUST is not set
# CONFIG_RADIO_TYPHOON is not set
# CONFIG_RADIO_ZOLTRIX is not set
CONFIG_USB_DSBR=m
# CONFIG_USB_SI470X is not set
# CONFIG_USB_MR800 is not set
# CONFIG_RADIO_TEA5764 is not set
CONFIG_DAB=y
CONFIG_USB_DABUSB=m

#
# Graphics support
#
CONFIG_AGP=y
CONFIG_AGP_ALI=y
CONFIG_AGP_ATI=y
# CONFIG_AGP_AMD is not set
# CONFIG_AGP_AMD64 is not set
CONFIG_AGP_INTEL=y
CONFIG_AGP_NVIDIA=y
CONFIG_AGP_SIS=y
# CONFIG_AGP_SWORKS is not set
CONFIG_AGP_VIA=y
CONFIG_AGP_EFFICEON=y
CONFIG_DRM=m
CONFIG_DRM_TDFX=m
CONFIG_DRM_R128=m
CONFIG_DRM_RADEON=m
CONFIG_DRM_I810=m
CONFIG_DRM_I830=m
CONFIG_DRM_I915=m
# CONFIG_DRM_I915_KMS is not set
# CONFIG_DRM_MGA is not set
CONFIG_DRM_SIS=m
# CONFIG_DRM_VIA is not set
# CONFIG_DRM_SAVAGE is not set
CONFIG_VGASTATE=m
CONFIG_VIDEO_OUTPUT_CONTROL=m
CONFIG_FB=y
# CONFIG_FIRMWARE_EDID is not set
CONFIG_FB_DDC=m
CONFIG_FB_BOOT_VESA_SUPPORT=y
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set
# CONFIG_FB_SYS_FILLRECT is not set
# CONFIG_FB_SYS_COPYAREA is not set
# CONFIG_FB_SYS_IMAGEBLIT is not set
# CONFIG_FB_FOREIGN_ENDIAN is not set
# CONFIG_FB_SYS_FOPS is not set
CONFIG_FB_SVGALIB=m
# CONFIG_FB_MACMODES is not set
CONFIG_FB_BACKLIGHT=y
CONFIG_FB_MODE_HELPERS=y
CONFIG_FB_TILEBLITTING=y

#
# Frame buffer hardware drivers
#
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
# CONFIG_FB_VGA16 is not set
CONFIG_FB_VESA=y
# CONFIG_FB_EFI is not set
# CONFIG_FB_N411 is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_S1D13XXX is not set
CONFIG_FB_NVIDIA=m
CONFIG_FB_NVIDIA_I2C=y
# CONFIG_FB_NVIDIA_DEBUG is not set
CONFIG_FB_NVIDIA_BACKLIGHT=y
# CONFIG_FB_RIVA is not set
# CONFIG_FB_I810 is not set
# CONFIG_FB_LE80578 is not set
# CONFIG_FB_INTEL is not set
# CONFIG_FB_MATROX is not set
CONFIG_FB_RADEON=m
CONFIG_FB_RADEON_I2C=y
CONFIG_FB_RADEON_BACKLIGHT=y
# CONFIG_FB_RADEON_DEBUG is not set
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
CONFIG_FB_S3=m
CONFIG_FB_SAVAGE=m
CONFIG_FB_SAVAGE_I2C=y
CONFIG_FB_SAVAGE_ACCEL=y
# CONFIG_FB_SIS is not set
# CONFIG_FB_VIA is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_VT8623 is not set
CONFIG_FB_TRIDENT=m
# CONFIG_FB_ARK is not set
# CONFIG_FB_PM3 is not set
# CONFIG_FB_CARMINE is not set
# CONFIG_FB_GEODE is not set
# CONFIG_FB_VIRTUAL is not set
# CONFIG_FB_METRONOME is not set
# CONFIG_FB_MB862XX is not set
# CONFIG_FB_BROADSHEET is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
CONFIG_LCD_CLASS_DEVICE=m
# CONFIG_LCD_ILI9320 is not set
# CONFIG_LCD_PLATFORM is not set
CONFIG_BACKLIGHT_CLASS_DEVICE=y
CONFIG_BACKLIGHT_GENERIC=y
CONFIG_BACKLIGHT_PROGEAR=m
# CONFIG_BACKLIGHT_MBP_NVIDIA is not set
# CONFIG_BACKLIGHT_SAHARA is not set

#
# Display device support
#
CONFIG_DISPLAY_SUPPORT=m

#
# Display hardware drivers
#

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=64
# CONFIG_MDA_CONSOLE is not set
CONFIG_DUMMY_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY=y
CONFIG_FRAMEBUFFER_CONSOLE_ROTATION=y
# CONFIG_FONTS is not set
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y
CONFIG_LOGO=y
# CONFIG_LOGO_LINUX_MONO is not set
# CONFIG_LOGO_LINUX_VGA16 is not set
CONFIG_LOGO_LINUX_CLUT224=y
# CONFIG_SOUND is not set
# CONFIG_HID_SUPPORT is not set
CONFIG_USB_SUPPORT=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB=y
# CONFIG_USB_DEBUG is not set
# CONFIG_USB_ANNOUNCE_NEW_DEVICES is not set

#
# Miscellaneous USB options
#
CONFIG_USB_DEVICEFS=y
# CONFIG_USB_DEVICE_CLASS is not set
# CONFIG_USB_DYNAMIC_MINORS is not set
CONFIG_USB_SUSPEND=y
# CONFIG_USB_OTG is not set
# CONFIG_USB_MON is not set
# CONFIG_USB_WUSB is not set
# CONFIG_USB_WUSB_CBAF is not set

#
# USB Host Controller Drivers
#
# CONFIG_USB_C67X00_HCD is not set
CONFIG_USB_EHCI_HCD=m
CONFIG_USB_EHCI_ROOT_HUB_TT=y
CONFIG_USB_EHCI_TT_NEWSCHED=y
# CONFIG_USB_OXU210HP_HCD is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_ISP1760_HCD is not set
CONFIG_USB_OHCI_HCD=m
# CONFIG_USB_OHCI_HCD_SSB is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_DESC is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_MMIO is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_UHCI_HCD=m
# CONFIG_USB_U132_HCD is not set
# CONFIG_USB_SL811_HCD is not set
# CONFIG_USB_R8A66597_HCD is not set
# CONFIG_USB_WHCI_HCD is not set
# CONFIG_USB_HWA_HCD is not set

#
# USB Device Class drivers
#
# CONFIG_USB_ACM is not set
# CONFIG_USB_PRINTER is not set
# CONFIG_USB_WDM is not set
# CONFIG_USB_TMC is not set

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may
#

#
# also be needed; see USB_STORAGE Help for more info
#
CONFIG_USB_STORAGE=m
# CONFIG_USB_STORAGE_DEBUG is not set
CONFIG_USB_STORAGE_DATAFAB=m
CONFIG_USB_STORAGE_FREECOM=m
# CONFIG_USB_STORAGE_ISD200 is not set
CONFIG_USB_STORAGE_USBAT=m
# CONFIG_USB_STORAGE_SDDR09 is not set
# CONFIG_USB_STORAGE_SDDR55 is not set
# CONFIG_USB_STORAGE_JUMPSHOT is not set
# CONFIG_USB_STORAGE_ALAUDA is not set
# CONFIG_USB_STORAGE_ONETOUCH is not set
# CONFIG_USB_STORAGE_KARMA is not set
# CONFIG_USB_STORAGE_CYPRESS_ATACB is not set
# CONFIG_USB_LIBUSUAL is not set

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set

#
# USB port drivers
#
# CONFIG_USB_USS720 is not set
CONFIG_USB_SERIAL=m
CONFIG_USB_EZUSB=y
CONFIG_USB_SERIAL_GENERIC=y
# CONFIG_USB_SERIAL_AIRCABLE is not set
# CONFIG_USB_SERIAL_ARK3116 is not set
# CONFIG_USB_SERIAL_BELKIN is not set
# CONFIG_USB_SERIAL_CH341 is not set
# CONFIG_USB_SERIAL_WHITEHEAT is not set
# CONFIG_USB_SERIAL_DIGI_ACCELEPORT is not set
# CONFIG_USB_SERIAL_CP210X is not set
# CONFIG_USB_SERIAL_CYPRESS_M8 is not set
CONFIG_USB_SERIAL_EMPEG=m
# CONFIG_USB_SERIAL_FTDI_SIO is not set
# CONFIG_USB_SERIAL_FUNSOFT is not set
# CONFIG_USB_SERIAL_VISOR is not set
# CONFIG_USB_SERIAL_IPAQ is not set
# CONFIG_USB_SERIAL_IR is not set
# CONFIG_USB_SERIAL_EDGEPORT is not set
# CONFIG_USB_SERIAL_EDGEPORT_TI is not set
# CONFIG_USB_SERIAL_GARMIN is not set
# CONFIG_USB_SERIAL_IPW is not set
# CONFIG_USB_SERIAL_IUU is not set
# CONFIG_USB_SERIAL_KEYSPAN_PDA is not set
CONFIG_USB_SERIAL_KEYSPAN=m
# CONFIG_USB_SERIAL_KEYSPAN_MPR is not set
# CONFIG_USB_SERIAL_KEYSPAN_USA28 is not set
# CONFIG_USB_SERIAL_KEYSPAN_USA28X is not set
# CONFIG_USB_SERIAL_KEYSPAN_USA28XA is not set
# CONFIG_USB_SERIAL_KEYSPAN_USA28XB is not set
# CONFIG_USB_SERIAL_KEYSPAN_USA19 is not set
# CONFIG_USB_SERIAL_KEYSPAN_USA18X is not set
# CONFIG_USB_SERIAL_KEYSPAN_USA19W is not set
CONFIG_USB_SERIAL_KEYSPAN_USA19QW=y
CONFIG_USB_SERIAL_KEYSPAN_USA19QI=y
CONFIG_USB_SERIAL_KEYSPAN_USA49W=y
CONFIG_USB_SERIAL_KEYSPAN_USA49WLC=y
# CONFIG_USB_SERIAL_KLSI is not set
# CONFIG_USB_SERIAL_KOBIL_SCT is not set
# CONFIG_USB_SERIAL_MCT_U232 is not set
# CONFIG_USB_SERIAL_MOS7720 is not set
# CONFIG_USB_SERIAL_MOS7840 is not set
# CONFIG_USB_SERIAL_MOTOROLA is not set
# CONFIG_USB_SERIAL_NAVMAN is not set
# CONFIG_USB_SERIAL_PL2303 is not set
# CONFIG_USB_SERIAL_OTI6858 is not set
# CONFIG_USB_SERIAL_QUALCOMM is not set
# CONFIG_USB_SERIAL_SPCP8X5 is not set
# CONFIG_USB_SERIAL_HP4X is not set
# CONFIG_USB_SERIAL_SAFE is not set
# CONFIG_USB_SERIAL_SIEMENS_MPI is not set
# CONFIG_USB_SERIAL_SIERRAWIRELESS is not set
# CONFIG_USB_SERIAL_SYMBOL is not set
# CONFIG_USB_SERIAL_TI is not set
# CONFIG_USB_SERIAL_CYBERJACK is not set
# CONFIG_USB_SERIAL_XIRCOM is not set
# CONFIG_USB_SERIAL_OPTION is not set
# CONFIG_USB_SERIAL_OMNINET is not set
# CONFIG_USB_SERIAL_OPTICON is not set
# CONFIG_USB_SERIAL_DEBUG is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_SEVSEG is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_BERRY_CHARGE is not set
# CONFIG_USB_LED is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_IDMOUSE is not set
CONFIG_USB_FTDI_ELAN=m
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_IOWARRIOR is not set
# CONFIG_USB_TEST is not set
# CONFIG_USB_ISIGHTFW is not set
# CONFIG_USB_VST is not set
# CONFIG_USB_GADGET is not set

#
# OTG and related infrastructure
#
# CONFIG_NOP_USB_XCEIV is not set
# CONFIG_UWB is not set
# CONFIG_MMC is not set
# CONFIG_MEMSTICK is not set
CONFIG_NEW_LEDS=y
CONFIG_LEDS_CLASS=y

#
# LED drivers
#
# CONFIG_LEDS_ALIX2 is not set
# CONFIG_LEDS_PCA9532 is not set
# CONFIG_LEDS_LP5521 is not set
# CONFIG_LEDS_CLEVO_MAIL is not set
# CONFIG_LEDS_PCA955X is not set
# CONFIG_LEDS_BD2802 is not set

#
# LED Triggers
#
CONFIG_LEDS_TRIGGERS=y
CONFIG_LEDS_TRIGGER_TIMER=m
# CONFIG_LEDS_TRIGGER_HEARTBEAT is not set
# CONFIG_LEDS_TRIGGER_BACKLIGHT is not set
# CONFIG_LEDS_TRIGGER_DEFAULT_ON is not set

#
# iptables trigger is under Netfilter config (LED target)
#
# CONFIG_ACCESSIBILITY is not set
# CONFIG_INFINIBAND is not set
# CONFIG_EDAC is not set
# CONFIG_RTC_CLASS is not set
# CONFIG_DMADEVICES is not set
# CONFIG_AUXDISPLAY is not set
CONFIG_UIO=m
# CONFIG_UIO_CIF is not set
# CONFIG_UIO_PDRV is not set
# CONFIG_UIO_PDRV_GENIRQ is not set
# CONFIG_UIO_SMX is not set
# CONFIG_UIO_AEC is not set
# CONFIG_UIO_SERCOS3 is not set
# CONFIG_STAGING is not set
CONFIG_X86_PLATFORM_DEVICES=y
# CONFIG_ASUS_LAPTOP is not set
# CONFIG_FUJITSU_LAPTOP is not set
# CONFIG_TC1100_WMI is not set
# CONFIG_MSI_LAPTOP is not set
# CONFIG_PANASONIC_LAPTOP is not set
# CONFIG_COMPAL_LAPTOP is not set
# CONFIG_THINKPAD_ACPI is not set
# CONFIG_INTEL_MENLOW is not set
# CONFIG_EEEPC_LAPTOP is not set
# CONFIG_ACPI_WMI is not set
# CONFIG_ACPI_ASUS is not set
# CONFIG_ACPI_TOSHIBA is not set

#
# Firmware Drivers
#
CONFIG_EDD=m
# CONFIG_EDD_OFF is not set
CONFIG_FIRMWARE_MEMMAP=y
CONFIG_EFI_VARS=y
# CONFIG_DELL_RBU is not set
# CONFIG_DCDBAS is not set
CONFIG_DMIID=y
# CONFIG_ISCSI_IBFT_FIND is not set

#
# File systems
#
CONFIG_EXT2_FS=m
# CONFIG_EXT2_FS_XATTR is not set
CONFIG_EXT2_FS_XIP=y
CONFIG_EXT3_FS=m
# CONFIG_EXT3_DEFAULTS_TO_ORDERED is not set
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
CONFIG_EXT4_FS=m
CONFIG_EXT4DEV_COMPAT=y
CONFIG_EXT4_FS_XATTR=y
CONFIG_EXT4_FS_POSIX_ACL=y
CONFIG_EXT4_FS_SECURITY=y
CONFIG_FS_XIP=y
CONFIG_JBD=m
# CONFIG_JBD_DEBUG is not set
CONFIG_JBD2=m
# CONFIG_JBD2_DEBUG is not set
CONFIG_FS_MBCACHE=m
# CONFIG_REISERFS_FS is not set
# CONFIG_JFS_FS is not set
CONFIG_FS_POSIX_ACL=y
CONFIG_FILE_LOCKING=y
# CONFIG_XFS_FS is not set
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
# CONFIG_BTRFS_FS is not set
CONFIG_DNOTIFY=y
CONFIG_INOTIFY=y
CONFIG_INOTIFY_USER=y
# CONFIG_QUOTA is not set
# CONFIG_AUTOFS_FS is not set
CONFIG_AUTOFS4_FS=m
CONFIG_FUSE_FS=m
CONFIG_GENERIC_ACL=y

#
# Caches
#
# CONFIG_FSCACHE is not set

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_UDF_FS=y
CONFIG_UDF_NLS=y

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=m
CONFIG_MSDOS_FS=m
CONFIG_VFAT_FS=m
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="ascii"
# CONFIG_NTFS_FS is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_VMCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_CONFIGFS_FS=m
CONFIG_MISC_FILESYSTEMS=y
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
CONFIG_CRAMFS=m
# CONFIG_SQUASHFS is not set
# CONFIG_VXFS_FS is not set
# CONFIG_MINIX_FS is not set
# CONFIG_OMFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
CONFIG_ROMFS_FS=m
CONFIG_ROMFS_BACKED_BY_BLOCK=y
# CONFIG_ROMFS_BACKED_BY_MTD is not set
# CONFIG_ROMFS_BACKED_BY_BOTH is not set
CONFIG_ROMFS_ON_BLOCK=y
# CONFIG_SYSV_FS is not set
CONFIG_UFS_FS=m
# CONFIG_UFS_FS_WRITE is not set
# CONFIG_UFS_DEBUG is not set
# CONFIG_NILFS2_FS is not set
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=m
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
CONFIG_NFS_V4=y
# CONFIG_NFSD is not set
CONFIG_LOCKD=m
CONFIG_LOCKD_V4=y
CONFIG_NFS_ACL_SUPPORT=m
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=m
CONFIG_SUNRPC_GSS=m
CONFIG_RPCSEC_GSS_KRB5=m
# CONFIG_RPCSEC_GSS_SPKM3 is not set
# CONFIG_SMB_FS is not set
# CONFIG_CIFS is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
# CONFIG_OSF_PARTITION is not set
# CONFIG_AMIGA_PARTITION is not set
# CONFIG_ATARI_PARTITION is not set
# CONFIG_MAC_PARTITION is not set
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
# CONFIG_MINIX_SUBPARTITION is not set
# CONFIG_SOLARIS_X86_PARTITION is not set
# CONFIG_UNIXWARE_DISKLABEL is not set
# CONFIG_LDM_PARTITION is not set
# CONFIG_SGI_PARTITION is not set
# CONFIG_ULTRIX_PARTITION is not set
# CONFIG_SUN_PARTITION is not set
# CONFIG_KARMA_PARTITION is not set
CONFIG_EFI_PARTITION=y
# CONFIG_SYSV68_PARTITION is not set
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="utf8"
CONFIG_NLS_CODEPAGE_437=y
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
CONFIG_NLS_CODEPAGE_850=m
CONFIG_NLS_CODEPAGE_852=m
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
CONFIG_NLS_CODEPAGE_863=m
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
CONFIG_NLS_CODEPAGE_936=m
CONFIG_NLS_CODEPAGE_950=m
CONFIG_NLS_CODEPAGE_932=m
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
CONFIG_NLS_ISO8859_8=m
CONFIG_NLS_CODEPAGE_1250=m
CONFIG_NLS_CODEPAGE_1251=m
CONFIG_NLS_ASCII=y
# CONFIG_NLS_ISO8859_1 is not set
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
CONFIG_NLS_UTF8=m
# CONFIG_DLM is not set

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
# CONFIG_PRINTK_TIME is not set
# CONFIG_ENABLE_WARN_DEPRECATED is not set
# CONFIG_ENABLE_MUST_CHECK is not set
CONFIG_FRAME_WARN=1024
CONFIG_MAGIC_SYSRQ=y
# CONFIG_UNUSED_SYMBOLS is not set
CONFIG_DEBUG_FS=y
CONFIG_HEADERS_CHECK=y
CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_SHIRQ=y
CONFIG_DETECT_SOFTLOCKUP=y
# CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0
CONFIG_DETECT_HUNG_TASK=y
# CONFIG_BOOTPARAM_HUNG_TASK_PANIC is not set
CONFIG_BOOTPARAM_HUNG_TASK_PANIC_VALUE=0
CONFIG_SCHED_DEBUG=y
CONFIG_SCHEDSTATS=y
CONFIG_TIMER_STATS=y
# CONFIG_DEBUG_OBJECTS is not set
# CONFIG_SLUB_DEBUG_ON is not set
# CONFIG_SLUB_STATS is not set
CONFIG_DEBUG_PREEMPT=y
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_RT_MUTEX_TESTER is not set
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_PROVE_LOCKING=y
CONFIG_LOCKDEP=y
# CONFIG_LOCK_STAT is not set
CONFIG_DEBUG_LOCKDEP=y
CONFIG_TRACE_IRQFLAGS=y
CONFIG_DEBUG_SPINLOCK_SLEEP=y
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
CONFIG_STACKTRACE=y
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_HIGHMEM=y
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_INFO=y
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_VIRTUAL is not set
# CONFIG_DEBUG_WRITECOUNT is not set
CONFIG_DEBUG_MEMORY_INIT=y
CONFIG_DEBUG_LIST=y
# CONFIG_DEBUG_SG is not set
# CONFIG_DEBUG_NOTIFIERS is not set
CONFIG_ARCH_WANT_FRAME_POINTERS=y
CONFIG_FRAME_POINTER=y
# CONFIG_BOOT_PRINTK_DELAY is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_KPROBES_SANITY_TEST is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
# CONFIG_LKDTM is not set
# CONFIG_FAULT_INJECTION is not set
# CONFIG_LATENCYTOP is not set
CONFIG_SYSCTL_SYSCALL_CHECK=y
# CONFIG_DEBUG_PAGEALLOC is not set
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FTRACE_NMI_ENTER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_FTRACE_SYSCALLS=y
CONFIG_TRACER_MAX_TRACE=y
CONFIG_RING_BUFFER=y
CONFIG_FTRACE_NMI_ENTER=y
CONFIG_TRACING=y
CONFIG_TRACING_SUPPORT=y

#
# Tracers
#
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
CONFIG_IRQSOFF_TRACER=y
CONFIG_PREEMPT_TRACER=y
CONFIG_SYSPROF_TRACER=y
CONFIG_SCHED_TRACER=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_EVENT_TRACER=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_BOOT_TRACER=y
# CONFIG_TRACE_BRANCH_PROFILING is not set
CONFIG_POWER_TRACER=y
CONFIG_STACK_TRACER=y
# CONFIG_KMEMTRACE is not set
CONFIG_WORKQUEUE_TRACER=y
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_FTRACE_MCOUNT_RECORD=y
# CONFIG_FTRACE_STARTUP_TEST is not set
CONFIG_MMIOTRACE=y
CONFIG_MMIOTRACE_TEST=m
# CONFIG_PROVIDE_OHCI1394_DMA_INIT is not set
# CONFIG_FIREWIRE_OHCI_REMOTE_DMA is not set
# CONFIG_BUILD_DOCSRC is not set
# CONFIG_DYNAMIC_DEBUG is not set
# CONFIG_DMA_API_DEBUG is not set
CONFIG_SAMPLES=y
# CONFIG_SAMPLE_MARKERS is not set
# CONFIG_SAMPLE_TRACEPOINTS is not set
CONFIG_SAMPLE_KOBJECT=m
CONFIG_SAMPLE_KPROBES=m
CONFIG_SAMPLE_KRETPROBES=m
CONFIG_HAVE_ARCH_KGDB=y
# CONFIG_KGDB is not set
# CONFIG_STRICT_DEVMEM is not set
CONFIG_X86_VERBOSE_BOOTUP=y
CONFIG_EARLY_PRINTK=y
# CONFIG_EARLY_PRINTK_DBGP is not set
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_DEBUG_PER_CPU_MAPS is not set
# CONFIG_X86_PTDUMP is not set
CONFIG_DEBUG_RODATA=y
# CONFIG_DEBUG_RODATA_TEST is not set
# CONFIG_DEBUG_NX_TEST is not set
CONFIG_4KSTACKS=y
CONFIG_DOUBLEFAULT=y
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEFAULT_IO_DELAY_TYPE=0
# CONFIG_DEBUG_BOOT_PARAMS is not set
# CONFIG_CPA_DEBUG is not set
# CONFIG_OPTIMIZE_INLINING is not set

#
# Security options
#
# CONFIG_KEYS is not set
# CONFIG_SECURITY is not set
# CONFIG_SECURITYFS is not set
# CONFIG_SECURITY_FILE_CAPABILITIES is not set
# CONFIG_IMA is not set
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
# CONFIG_CRYPTO_FIPS is not set
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_BLKCIPHER=m
CONFIG_CRYPTO_BLKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_PCOMP=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
# CONFIG_CRYPTO_GF128MUL is not set
CONFIG_CRYPTO_NULL=m
CONFIG_CRYPTO_WORKQUEUE=y
# CONFIG_CRYPTO_CRYPTD is not set
# CONFIG_CRYPTO_AUTHENC is not set
# CONFIG_CRYPTO_TEST is not set

#
# Authenticated Encryption with Associated Data
#
# CONFIG_CRYPTO_CCM is not set
# CONFIG_CRYPTO_GCM is not set
# CONFIG_CRYPTO_SEQIV is not set

#
# Block modes
#
CONFIG_CRYPTO_CBC=m
# CONFIG_CRYPTO_CTR is not set
# CONFIG_CRYPTO_CTS is not set
# CONFIG_CRYPTO_ECB is not set
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_PCBC is not set
# CONFIG_CRYPTO_XTS is not set

#
# Hash modes
#
# CONFIG_CRYPTO_HMAC is not set
# CONFIG_CRYPTO_XCBC is not set

#
# Digest
#
CONFIG_CRYPTO_CRC32C=y
# CONFIG_CRYPTO_CRC32C_INTEL is not set
CONFIG_CRYPTO_MD4=m
CONFIG_CRYPTO_MD5=y
# CONFIG_CRYPTO_MICHAEL_MIC is not set
# CONFIG_CRYPTO_RMD128 is not set
# CONFIG_CRYPTO_RMD160 is not set
# CONFIG_CRYPTO_RMD256 is not set
# CONFIG_CRYPTO_RMD320 is not set
CONFIG_CRYPTO_SHA1=y
CONFIG_CRYPTO_SHA256=m
# CONFIG_CRYPTO_SHA512 is not set
# CONFIG_CRYPTO_TGR192 is not set
# CONFIG_CRYPTO_WP512 is not set

#
# Ciphers
#
CONFIG_CRYPTO_AES=m
# CONFIG_CRYPTO_AES_586 is not set
# CONFIG_CRYPTO_ANUBIS is not set
# CONFIG_CRYPTO_ARC4 is not set
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_CAMELLIA is not set
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST6 is not set
CONFIG_CRYPTO_DES=m
# CONFIG_CRYPTO_FCRYPT is not set
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_SALSA20 is not set
# CONFIG_CRYPTO_SALSA20_586 is not set
# CONFIG_CRYPTO_SEED is not set
# CONFIG_CRYPTO_SERPENT is not set
# CONFIG_CRYPTO_TEA is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_TWOFISH_586 is not set

#
# Compression
#
# CONFIG_CRYPTO_DEFLATE is not set
# CONFIG_CRYPTO_ZLIB is not set
# CONFIG_CRYPTO_LZO is not set

#
# Random Number Generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
# CONFIG_CRYPTO_HW is not set
CONFIG_HAVE_KVM=y
CONFIG_HAVE_KVM_IRQCHIP=y
# CONFIG_VIRTUALIZATION is not set
CONFIG_BINARY_PRINTF=y

#
# Library routines
#
CONFIG_BITREVERSE=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_FIND_NEXT_BIT=y
CONFIG_GENERIC_FIND_LAST_BIT=y
CONFIG_CRC_CCITT=m
CONFIG_CRC16=m
# CONFIG_CRC_T10DIF is not set
CONFIG_CRC_ITU_T=y
CONFIG_CRC32=y
# CONFIG_CRC7 is not set
# CONFIG_LIBCRC32C is not set
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=m
CONFIG_DECOMPRESS_GZIP=y
CONFIG_DECOMPRESS_BZIP2=y
CONFIG_DECOMPRESS_LZMA=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y
CONFIG_NLATTR=y

[-- Attachment #3: dmesg.txt --]
[-- Type: text/plain, Size: 90566 bytes --]

Initializing cgroup subsys cpuset
Initializing cgroup subsys cpu
Linux version 2.6.30-rc4-io (root-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org) (gcc version 4.1.2 20070925 (Red Hat 4.1.2-33)) #6 SMP PREEMPT Thu May 7 11:07:49 CST 2009
KERNEL supported cpus:
  Intel GenuineIntel
  AMD AuthenticAMD
  NSC Geode by NSC
  Cyrix CyrixInstead
  Centaur CentaurHauls
  Transmeta GenuineTMx86
  Transmeta TransmetaCPU
  UMC UMC UMC UMC
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009f400 (usable)
 BIOS-e820: 000000000009f400 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000003bff0000 (usable)
 BIOS-e820: 000000003bff0000 - 000000003bff3000 (ACPI NVS)
 BIOS-e820: 000000003bff3000 - 000000003c000000 (ACPI data)
 BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
DMI 2.3 present.
Phoenix BIOS detected: BIOS may corrupt low RAM, working around it.
e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)
last_pfn = 0x3bff0 max_arch_pfn = 0x100000
MTRR default type: uncachable
MTRR fixed ranges enabled:
  00000-9FFFF write-back
  A0000-BFFFF uncachable
  C0000-C7FFF write-protect
  C8000-FFFFF uncachable
MTRR variable ranges enabled:
  0 base 000000000 mask FC0000000 write-back
  1 base 03C000000 mask FFC000000 uncachable
  2 base 0D0000000 mask FF8000000 write-combining
  3 disabled
  4 disabled
  5 disabled
  6 disabled
  7 disabled
init_memory_mapping: 0000000000000000-00000000377fe000
 0000000000 - 0000400000 page 4k
 0000400000 - 0037400000 page 2M
 0037400000 - 00377fe000 page 4k
kernel direct mapping tables up to 377fe000 @ 10000-15000
RAMDISK: 37d0d000 - 37fefd69
Allocated new RAMDISK: 00100000 - 003e2d69
Move RAMDISK from 0000000037d0d000 - 0000000037fefd68 to 00100000 - 003e2d68
ACPI: RSDP 000f7560 00014 (v00 AWARD )
ACPI: RSDT 3bff3040 0002C (v01 AWARD  AWRDACPI 42302E31 AWRD 00000000)
ACPI: FACP 3bff30c0 00074 (v01 AWARD  AWRDACPI 42302E31 AWRD 00000000)
ACPI: DSDT 3bff3180 03ABC (v01 AWARD  AWRDACPI 00001000 MSFT 0100000E)
ACPI: FACS 3bff0000 00040
ACPI: APIC 3bff6c80 00084 (v01 AWARD  AWRDACPI 42302E31 AWRD 00000000)
ACPI: Local APIC address 0xfee00000
71MB HIGHMEM available.
887MB LOWMEM available.
  mapped low ram: 0 - 377fe000
  low ram: 0 - 377fe000
  node 0 low ram: 00000000 - 377fe000
  node 0 bootmap 00011000 - 00017f00
(9 early reservations) ==> bootmem [0000000000 - 00377fe000]
  #0 [0000000000 - 0000001000]   BIOS data page ==> [0000000000 - 0000001000]
  #1 [0000001000 - 0000002000]    EX TRAMPOLINE ==> [0000001000 - 0000002000]
  #2 [0000006000 - 0000007000]       TRAMPOLINE ==> [0000006000 - 0000007000]
  #3 [0000400000 - 0000c6bd1c]    TEXT DATA BSS ==> [0000400000 - 0000c6bd1c]
  #4 [000009f400 - 0000100000]    BIOS reserved ==> [000009f400 - 0000100000]
  #5 [0000c6c000 - 0000c700ed]              BRK ==> [0000c6c000 - 0000c700ed]
  #6 [0000010000 - 0000011000]          PGTABLE ==> [0000010000 - 0000011000]
  #7 [0000100000 - 00003e2d69]      NEW RAMDISK ==> [0000100000 - 00003e2d69]
  #8 [0000011000 - 0000018000]          BOOTMAP ==> [0000011000 - 0000018000]
found SMP MP-table at [c00f5ad0] f5ad0
Zone PFN ranges:
  DMA      0x00000010 -> 0x00001000
  Normal   0x00001000 -> 0x000377fe
  HighMem  0x000377fe -> 0x0003bff0
Movable zone start PFN for each node
early_node_map[2] active PFN ranges
    0: 0x00000010 -> 0x0000009f
    0: 0x00000100 -> 0x0003bff0
On node 0 totalpages: 245631
free_area_init_node: node 0, pgdat c0778f80, node_mem_map c1000340
  DMA zone: 52 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 3931 pages, LIFO batch:0
  Normal zone: 2834 pages used for memmap
  Normal zone: 220396 pages, LIFO batch:31
  HighMem zone: 234 pages used for memmap
  HighMem zone: 18184 pages, LIFO batch:3
Using APIC driver default
ACPI: PM-Timer IO Port: 0x1008
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] disabled)
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] disabled)
ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x03] high edge lint[0x1])
ACPI: IOAPIC (id[0x04] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 4, version 17, address 0xfec00000, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 dfl dfl)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ9 used by override.
Enabling APIC mode:  Flat.  Using 1 I/O APICs
Using ACPI (MADT) for SMP configuration information
SMP: Allowing 4 CPUs, 2 hotplug CPUs
nr_irqs_gsi: 24
Allocating PCI resources starting at 40000000 (gap: 3c000000:c2c00000)
NR_CPUS:8 nr_cpumask_bits:8 nr_cpu_ids:4 nr_node_ids:1
PERCPU: Embedded 13 pages at c1c3b000, static data 32756 bytes
Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 242511
Kernel command line: ro root=LABEL=/ rhgb quiet
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Initializing CPU#0
Preemptible RCU implementation.
NR_IRQS:512
CPU 0 irqstacks, hard=c1c3b000 soft=c1c3c000
PID hash table entries: 4096 (order: 12, 16384 bytes)
Fast TSC calibration using PIT
Detected 2800.222 MHz processor.
Console: colour VGA+ 80x25
console [tty0] enabled
Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
... MAX_LOCKDEP_SUBCLASSES:  8
... MAX_LOCK_DEPTH:          48
... MAX_LOCKDEP_KEYS:        8191
... CLASSHASH_SIZE:          4096
... MAX_LOCKDEP_ENTRIES:     8192
... MAX_LOCKDEP_CHAINS:      16384
... CHAINHASH_SIZE:          8192
 memory used by lock dependency info: 2847 kB
 per task-struct memory footprint: 1152 bytes
Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
allocated 4914560 bytes of page_cgroup
please try cgroup_disable=memory,blkio option if you don't want
Initializing HighMem for node 0 (000377fe:0003bff0)
Memory: 952284k/982976k available (2258k kernel code, 30016k reserved, 1424k data, 320k init, 73672k highmem)
virtual kernel memory layout:
    fixmap  : 0xffedf000 - 0xfffff000   (1152 kB)
    pkmap   : 0xff800000 - 0xffc00000   (4096 kB)
    vmalloc : 0xf7ffe000 - 0xff7fe000   ( 120 MB)
    lowmem  : 0xc0000000 - 0xf77fe000   ( 887 MB)
      .init : 0xc079d000 - 0xc07ed000   ( 320 kB)
      .data : 0xc06349ab - 0xc0798cb8   (1424 kB)
      .text : 0xc0400000 - 0xc06349ab   (2258 kB)
Checking if this processor honours the WP bit even in supervisor mode...Ok.
SLUB: Genslabs=13, HWalign=128, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
Calibrating delay loop (skipped), value calculated using timer frequency.. 5600.44 BogoMIPS (lpj=2800222)
Mount-cache hash table entries: 512
Initializing cgroup subsys debug
Initializing cgroup subsys ns
Initializing cgroup subsys cpuacct
Initializing cgroup subsys memory
Initializing cgroup subsys blkio
Initializing cgroup subsys devices
Initializing cgroup subsys freezer
Initializing cgroup subsys net_cls
Initializing cgroup subsys io
CPU: Trace cache: 12K uops, L1 D cache: 16K
CPU: L2 cache: 1024K
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 0
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU0: Intel P4/Xeon Extended MCE MSRs (24) available
using mwait in idle threads.
Checking 'hlt' instruction... OK.
ACPI: Core revision 20090320
ftrace: converting mcount calls to 0f 1f 44 00 00
ftrace: allocating 12136 entries in 24 pages
..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
CPU0: Intel(R) Pentium(R) D CPU 2.80GHz stepping 04
lockdep: fixing up alternatives.
CPU 1 irqstacks, hard=c1c4b000 soft=c1c4c000
Booting processor 1 APIC 0x1 ip 0x6000
Initializing CPU#1
Calibrating delay using timer specific routine.. 5599.23 BogoMIPS (lpj=2799617)
CPU: Trace cache: 12K uops, L1 D cache: 16K
CPU: L2 cache: 1024K
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 1
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#1.
CPU1: Intel P4/Xeon Extended MCE MSRs (24) available
CPU1: Intel(R) Pentium(R) D CPU 2.80GHz stepping 04
checking TSC synchronization [CPU#0 -> CPU#1]: passed.
Brought up 2 CPUs
Total of 2 processors activated (11199.67 BogoMIPS).
CPU0 attaching sched-domain:
 domain 0: span 0-1 level CPU
  groups: 0 1
CPU1 attaching sched-domain:
 domain 0: span 0-1 level CPU
  groups: 1 0
net_namespace: 436 bytes
NET: Registered protocol family 16
ACPI: bus type pci registered
PCI: PCI BIOS revision 2.10 entry at 0xfbda0, last bus=1
PCI: Using configuration type 1 for base access
mtrr: your CPUs had inconsistent fixed MTRR settings
mtrr: probably your BIOS does not setup all CPUs.
mtrr: corrected configuration.
bio: create slab <bio-0> at 0
ACPI: EC: Look up EC in DSDT
ACPI: Interpreter enabled
ACPI: (supports S0 S3 S5)
ACPI: Using IOAPIC for interrupt routing
ACPI: No dock devices found.
ACPI: PCI Root Bridge [PCI0] (0000:00)
pci 0000:00:00.0: reg 10 32bit mmio: [0xd0000000-0xd7ffffff]
pci 0000:00:02.5: reg 10 io port: [0x1f0-0x1f7]
pci 0000:00:02.5: reg 14 io port: [0x3f4-0x3f7]
pci 0000:00:02.5: reg 18 io port: [0x170-0x177]
pci 0000:00:02.5: reg 1c io port: [0x374-0x377]
pci 0000:00:02.5: reg 20 io port: [0x4000-0x400f]
pci 0000:00:02.5: PME# supported from D3cold
pci 0000:00:02.5: PME# disabled
pci 0000:00:02.7: reg 10 io port: [0xd000-0xd0ff]
pci 0000:00:02.7: reg 14 io port: [0xd400-0xd47f]
pci 0000:00:02.7: supports D1 D2
pci 0000:00:02.7: PME# supported from D3hot D3cold
pci 0000:00:02.7: PME# disabled
pci 0000:00:03.0: reg 10 32bit mmio: [0xe1104000-0xe1104fff]
pci 0000:00:03.1: reg 10 32bit mmio: [0xe1100000-0xe1100fff]
pci 0000:00:03.2: reg 10 32bit mmio: [0xe1101000-0xe1101fff]
pci 0000:00:03.3: reg 10 32bit mmio: [0xe1102000-0xe1102fff]
pci 0000:00:03.3: PME# supported from D0 D3hot D3cold
pci 0000:00:03.3: PME# disabled
pci 0000:00:05.0: reg 10 io port: [0xd800-0xd807]
pci 0000:00:05.0: reg 14 io port: [0xdc00-0xdc03]
pci 0000:00:05.0: reg 18 io port: [0xe000-0xe007]
pci 0000:00:05.0: reg 1c io port: [0xe400-0xe403]
pci 0000:00:05.0: reg 20 io port: [0xe800-0xe80f]
pci 0000:00:05.0: PME# supported from D3cold
pci 0000:00:05.0: PME# disabled
pci 0000:00:0e.0: reg 10 io port: [0xec00-0xecff]
pci 0000:00:0e.0: reg 14 32bit mmio: [0xe1103000-0xe11030ff]
pci 0000:00:0e.0: reg 30 32bit mmio: [0x000000-0x01ffff]
pci 0000:00:0e.0: supports D1 D2
pci 0000:00:0e.0: PME# supported from D1 D2 D3hot D3cold
pci 0000:00:0e.0: PME# disabled
pci 0000:01:00.0: reg 10 32bit mmio: [0xd8000000-0xdfffffff]
pci 0000:01:00.0: reg 14 32bit mmio: [0xe1000000-0xe101ffff]
pci 0000:01:00.0: reg 18 io port: [0xc000-0xc07f]
pci 0000:01:00.0: supports D1 D2
pci 0000:00:01.0: bridge io port: [0xc000-0xcfff]
pci 0000:00:01.0: bridge 32bit mmio: [0xe1000000-0xe10fffff]
pci 0000:00:01.0: bridge 32bit mmio pref: [0xd8000000-0xdfffffff]
pci_bus 0000:00: on NUMA node 0
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 9 10 11 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 9 10 *11 14 15)
ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 5 6 7 9 *10 11 14 15)
ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 7 9 10 11 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNKE] (IRQs 3 4 5 6 7 9 10 *11 14 15)
ACPI: PCI Interrupt Link [LNKF] (IRQs 3 4 5 *6 7 9 10 11 14 15)
ACPI: PCI Interrupt Link [LNKG] (IRQs 3 4 5 6 7 *9 10 11 14 15)
ACPI: PCI Interrupt Link [LNKH] (IRQs 3 4 *5 6 7 9 10 11 14 15)
usbcore: registered new interface driver usbfs
usbcore: registered new interface driver hub
usbcore: registered new device driver usb
PCI: Using ACPI for IRQ routing
pnp: PnP ACPI init
ACPI: bus type pnp registered
pnp: PnP ACPI: found 12 devices
ACPI: ACPI bus type pnp unregistered
system 00:00: iomem range 0xc8000-0xcbfff has been reserved
system 00:00: iomem range 0xf0000-0xf7fff could not be reserved
system 00:00: iomem range 0xf8000-0xfbfff could not be reserved
system 00:00: iomem range 0xfc000-0xfffff could not be reserved
system 00:00: iomem range 0x3bff0000-0x3bffffff could not be reserved
system 00:00: iomem range 0xffff0000-0xffffffff has been reserved
system 00:00: iomem range 0x0-0x9ffff could not be reserved
system 00:00: iomem range 0x100000-0x3bfeffff could not be reserved
system 00:00: iomem range 0xffee0000-0xffefffff has been reserved
system 00:00: iomem range 0xfffe0000-0xfffeffff has been reserved
system 00:00: iomem range 0xfec00000-0xfecfffff has been reserved
system 00:00: iomem range 0xfee00000-0xfeefffff has been reserved
system 00:02: ioport range 0x4d0-0x4d1 has been reserved
system 00:02: ioport range 0x800-0x805 has been reserved
system 00:02: ioport range 0x290-0x297 has been reserved
system 00:02: ioport range 0x880-0x88f has been reserved
pci 0000:00:01.0: PCI bridge, secondary bus 0000:01
pci 0000:00:01.0:   IO window: 0xc000-0xcfff
pci 0000:00:01.0:   MEM window: 0xe1000000-0xe10fffff
pci 0000:00:01.0:   PREFETCH window: 0x000000d8000000-0x000000dfffffff
pci_bus 0000:00: resource 0 io:  [0x00-0xffff]
pci_bus 0000:00: resource 1 mem: [0x000000-0xffffffff]
pci_bus 0000:01: resource 0 io:  [0xc000-0xcfff]
pci_bus 0000:01: resource 1 mem: [0xe1000000-0xe10fffff]
pci_bus 0000:01: resource 2 pref mem [0xd8000000-0xdfffffff]
NET: Registered protocol family 2
IP route cache hash table entries: 32768 (order: 5, 131072 bytes)
TCP established hash table entries: 131072 (order: 8, 1048576 bytes)
TCP bind hash table entries: 65536 (order: 9, 2097152 bytes)
TCP: Hash tables configured (established 131072 bind 65536)
TCP reno registered
NET: Registered protocol family 1
checking if image is initramfs...
rootfs image is initramfs; unpacking...
Freeing initrd memory: 2955k freed
apm: BIOS version 1.2 Flags 0x07 (Driver version 1.16ac)
apm: disabled - APM is not SMP safe.
highmem bounce pool size: 64 pages
HugeTLB registered 4 MB page size, pre-allocated 0 pages
msgmni has been set to 1722
alg: No test for stdrng (krng)
Block layer SCSI generic (bsg) driver version 0.4 loaded (major 254)
io scheduler noop registered
io scheduler cfq registered (default)
pci 0000:01:00.0: Boot video device
pci_hotplug: PCI Hot Plug PCI Core version: 0.5
fan PNP0C0B:00: registered as cooling_device0
ACPI: Fan [FAN] (on)
processor ACPI_CPU:00: registered as cooling_device1
processor ACPI_CPU:01: registered as cooling_device2
thermal LNXTHERM:01: registered as thermal_zone0
ACPI: Thermal Zone [THRM] (62 C)
isapnp: Scanning for PnP cards...
Switched to high resolution mode on CPU 1
Switched to high resolution mode on CPU 0
isapnp: No Plug & Play device found
Real Time Clock Driver v1.12b
Non-volatile memory driver v1.3
Linux agpgart interface v0.103
agpgart-sis 0000:00:00.0: SiS chipset [1039/0661]
agpgart-sis 0000:00:00.0: AGP aperture is 128M @ 0xd0000000
Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
serial8250: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
00:07: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
00:08: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
brd: module loaded
PNP: PS/2 Controller [PNP0303:PS2K,PNP0f13:PS2M] at 0x60,0x64 irq 1,12
serio: i8042 KBD port at 0x60,0x64 irq 1
serio: i8042 AUX port at 0x60,0x64 irq 12
mice: PS/2 mouse device common for all mice
cpuidle: using governor ladder
cpuidle: using governor menu
TCP cubic registered
NET: Registered protocol family 17
Using IPI No-Shortcut mode
registered taskstats version 1
Freeing unused kernel memory: 320k freed
Write protecting the kernel text: 2260k
Write protecting the kernel read-only data: 1120k
ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
ehci_hcd 0000:00:03.3: PCI INT D -> GSI 23 (level, low) -> IRQ 23
ehci_hcd 0000:00:03.3: EHCI Host Controller
ehci_hcd 0000:00:03.3: new USB bus registered, assigned bus number 1
ehci_hcd 0000:00:03.3: cache line size of 128 is not supported
ehci_hcd 0000:00:03.3: irq 23, io mem 0xe1102000
ehci_hcd 0000:00:03.3: USB 2.0 started, EHCI 1.00
usb usb1: configuration #1 chosen from 1 choice
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 8 ports detected
ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
ohci_hcd 0000:00:03.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20
ohci_hcd 0000:00:03.0: OHCI Host Controller
ohci_hcd 0000:00:03.0: new USB bus registered, assigned bus number 2
ohci_hcd 0000:00:03.0: irq 20, io mem 0xe1104000
usb usb2: configuration #1 chosen from 1 choice
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 3 ports detected
ohci_hcd 0000:00:03.1: PCI INT B -> GSI 21 (level, low) -> IRQ 21
ohci_hcd 0000:00:03.1: OHCI Host Controller
ohci_hcd 0000:00:03.1: new USB bus registered, assigned bus number 3
ohci_hcd 0000:00:03.1: irq 21, io mem 0xe1100000
usb usb3: configuration #1 chosen from 1 choice
hub 3-0:1.0: USB hub found
hub 3-0:1.0: 3 ports detected
ohci_hcd 0000:00:03.2: PCI INT C -> GSI 22 (level, low) -> IRQ 22
ohci_hcd 0000:00:03.2: OHCI Host Controller
ohci_hcd 0000:00:03.2: new USB bus registered, assigned bus number 4
ohci_hcd 0000:00:03.2: irq 22, io mem 0xe1101000
usb usb4: configuration #1 chosen from 1 choice
hub 4-0:1.0: USB hub found
hub 4-0:1.0: 2 ports detected
uhci_hcd: USB Universal Host Controller Interface driver
SCSI subsystem initialized
Driver 'sd' needs updating - please use bus_type methods
libata version 3.00 loaded.
pata_sis 0000:00:02.5: version 0.5.2
pata_sis 0000:00:02.5: PCI INT A -> GSI 16 (level, low) -> IRQ 16
scsi0 : pata_sis
scsi1 : pata_sis
ata1: PATA max UDMA/133 cmd 0x1f0 ctl 0x3f6 bmdma 0x4000 irq 14
ata2: PATA max UDMA/133 cmd 0x170 ctl 0x376 bmdma 0x4008 irq 15
input: ImPS/2 Logitech Wheel Mouse as /class/input/input0
input: AT Translated Set 2 keyboard as /class/input/input1
sata_sis 0000:00:05.0: version 1.0
sata_sis 0000:00:05.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
sata_sis 0000:00:05.0: Detected SiS 180/181/964 chipset in SATA mode
scsi2 : sata_sis
scsi3 : sata_sis
ata3: SATA max UDMA/133 cmd 0xd800 ctl 0xdc00 bmdma 0xe800 irq 17
ata4: SATA max UDMA/133 cmd 0xe000 ctl 0xe400 bmdma 0xe808 irq 17
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata3.00: ATA-7: ST3808110AS, 3.AAE, max UDMA/133
ata3.00: 156301488 sectors, multi 16: LBA48 NCQ (depth 0/32)
ata3.00: configured for UDMA/133
scsi 2:0:0:0: Direct-Access     ATA      ST3808110AS      3.AA PQ: 0 ANSI: 5
sd 2:0:0:0: [sda] 156301488 512-byte hardware sectors: (80.0 GB/74.5 GiB)
sd 2:0:0:0: [sda] Write Protect is off
sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sda: sda1 sda2 < sda5 sda6 sda7 sda8 sda9 >
sd 2:0:0:0: [sda] Attached SCSI disk
ata4: SATA link down (SStatus 0 SControl 300)
EXT3-fs: INFO: recovery required on readonly filesystem.
EXT3-fs: write access will be enabled during recovery.
kjournald starting.  Commit interval 5 seconds
EXT3-fs: sda8: orphan cleanup on readonly fs
ext3_orphan_cleanup: deleting unreferenced inode 3725366
ext3_orphan_cleanup: deleting unreferenced inode 3725365
ext3_orphan_cleanup: deleting unreferenced inode 3725364
EXT3-fs: sda8: 3 orphan inodes deleted
EXT3-fs: recovery complete.
EXT3-fs: mounted filesystem with writeback data mode.
r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
r8169 0000:00:0e.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
r8169 0000:00:0e.0: no PCI Express capability
eth0: RTL8110s at 0xf8236000, 00:16:ec:2e:b7:e0, XID 04000000 IRQ 18
sd 2:0:0:0: Attached scsi generic sg0 type 0
parport_pc 00:09: reported by Plug and Play ACPI
parport0: PC-style at 0x378 (0x778), irq 7 [PCSPP,TRISTATE]
input: Power Button as /class/input/input2
ACPI: Power Button [PWRF]
input: Power Button as /class/input/input3
ACPI: Power Button [PWRB]
input: Sleep Button as /class/input/input4
ACPI: Sleep Button [FUTS]
ramfs: bad mount option: maxsize=512
EXT3 FS on sda8, internal journal
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sda7, internal journal
EXT3-fs: mounted filesystem with writeback data mode.
Adding 1052216k swap on /dev/sda6.  Priority:-1 extents:1 across:1052216k 
warning: process `kudzu' used the deprecated sysctl system call with 1.23.
kudzu[1133] general protection ip:8056968 sp:bffe9e90 error:0
r8169: eth0: link up
r8169: eth0: link up
warning: `dbus-daemon' uses 32-bit capabilities (legacy support in use)
CPU0 attaching NULL sched-domain.
CPU1 attaching NULL sched-domain.
CPU0 attaching sched-domain:
 domain 0: span 0-1 level CPU
  groups: 0 1
CPU1 attaching sched-domain:
 domain 0: span 0-1 level CPU
  groups: 1 0

=========================================================
[ INFO: possible irq lock inversion dependency detected ]
2.6.30-rc4-io #6
---------------------------------------------------------
rmdir/2186 just changed the state of lock:
 (&iocg->lock){+.+...}, at: [<c0513b18>] iocg_destroy+0x2a/0x118
but this lock was taken by another, SOFTIRQ-safe lock in the past:
 (&q->__queue_lock){..-...}

and interrupts could create inverse lock ordering between them.


other info that might help us debug this:
3 locks held by rmdir/2186:
 #0:  (&sb->s_type->i_mutex_key#10/1){+.+.+.}, at: [<c04ae1e8>] do_rmdir+0x5c/0xc8
 #1:  (cgroup_mutex){+.+.+.}, at: [<c045a15b>] cgroup_diput+0x3c/0xa7
 #2:  (&iocg->lock){+.+...}, at: [<c0513b18>] iocg_destroy+0x2a/0x118

the first lock's dependencies:
-> (&iocg->lock){+.+...} ops: 3 {
   HARDIRQ-ON-W at:
                        [<c044b840>] mark_held_locks+0x3d/0x58
                        [<c044b963>] trace_hardirqs_on_caller+0x108/0x14c
                        [<c044b9b2>] trace_hardirqs_on+0xb/0xd
                        [<c0630883>] _spin_unlock_irq+0x27/0x47
                        [<c0513baa>] iocg_destroy+0xbc/0x118
                        [<c045a16a>] cgroup_diput+0x4b/0xa7
                        [<c04b1dbb>] dentry_iput+0x78/0x9c
                        [<c04b1e82>] d_kill+0x21/0x3b
                        [<c04b2f2a>] dput+0xf3/0xfc
                        [<c04ae226>] do_rmdir+0x9a/0xc8
                        [<c04ae29d>] sys_rmdir+0x15/0x17
                        [<c0402a68>] sysenter_do_call+0x12/0x36
                        [<ffffffff>] 0xffffffff
   SOFTIRQ-ON-W at:
                        [<c044b840>] mark_held_locks+0x3d/0x58
                        [<c044b97c>] trace_hardirqs_on_caller+0x121/0x14c
                        [<c044b9b2>] trace_hardirqs_on+0xb/0xd
                        [<c0630883>] _spin_unlock_irq+0x27/0x47
                        [<c0513baa>] iocg_destroy+0xbc/0x118
                        [<c045a16a>] cgroup_diput+0x4b/0xa7
                        [<c04b1dbb>] dentry_iput+0x78/0x9c
                        [<c04b1e82>] d_kill+0x21/0x3b
                        [<c04b2f2a>] dput+0xf3/0xfc
                        [<c04ae226>] do_rmdir+0x9a/0xc8
                        [<c04ae29d>] sys_rmdir+0x15/0x17
                        [<c0402a68>] sysenter_do_call+0x12/0x36
                        [<ffffffff>] 0xffffffff
   INITIAL USE at:
                       [<c044dad5>] __lock_acquire+0x58c/0x73e
                       [<c044dd36>] lock_acquire+0xaf/0xcc
                       [<c06304ea>] _spin_lock_irq+0x30/0x3f
                       [<c05119bd>] io_alloc_root_group+0x104/0x155
                       [<c05133cb>] elv_init_fq_data+0x32/0xe0
                       [<c0504317>] elevator_alloc+0x150/0x170
                       [<c0505393>] elevator_init+0x9d/0x100
                       [<c0507088>] blk_init_queue_node+0xc4/0xf7
                       [<c05070cb>] blk_init_queue+0x10/0x12
                       [<f81060fd>] __scsi_alloc_queue+0x1c/0xba [scsi_mod]
                       [<f81061b0>] scsi_alloc_queue+0x15/0x4e [scsi_mod]
                       [<f810803d>] scsi_alloc_sdev+0x154/0x1f5 [scsi_mod]
                       [<f8108387>] scsi_probe_and_add_lun+0x123/0xb5b [scsi_mod]
                       [<f8109847>] __scsi_add_device+0x8a/0xb0 [scsi_mod]
                       [<f816ad14>] ata_scsi_scan_host+0x77/0x141 [libata]
                       [<f816903f>] async_port_probe+0xa0/0xa9 [libata]
                       [<c044341f>] async_thread+0xe9/0x1c9
                       [<c043e204>] kthread+0x4a/0x72
                       [<c04034e7>] kernel_thread_helper+0x7/0x10
                       [<ffffffff>] 0xffffffff
 }
 ... key      at: [<c0c5ebd8>] __key.29462+0x0/0x8

the second lock's dependencies:
-> (&q->__queue_lock){..-...} ops: 162810 {
   IN-SOFTIRQ-W at:
                        [<c044da08>] __lock_acquire+0x4bf/0x73e
                        [<c044dd36>] lock_acquire+0xaf/0xcc
                        [<c0630340>] _spin_lock+0x2a/0x39
                        [<f810672c>] scsi_device_unbusy+0x78/0x92 [scsi_mod]
                        [<f8101483>] scsi_finish_command+0x22/0xd4 [scsi_mod]
                        [<f8106fdb>] scsi_softirq_done+0xf9/0x101 [scsi_mod]
                        [<c050a936>] blk_done_softirq+0x5e/0x70
                        [<c0431379>] __do_softirq+0xb8/0x180
                        [<ffffffff>] 0xffffffff
   INITIAL USE at:
                       [<c044dad5>] __lock_acquire+0x58c/0x73e
                       [<c044dd36>] lock_acquire+0xaf/0xcc
                       [<c063056b>] _spin_lock_irqsave+0x33/0x43
                       [<f8101337>] scsi_adjust_queue_depth+0x2a/0xc9 [scsi_mod]
                       [<f8108079>] scsi_alloc_sdev+0x190/0x1f5 [scsi_mod]
                       [<f8108387>] scsi_probe_and_add_lun+0x123/0xb5b [scsi_mod]
                       [<f8109847>] __scsi_add_device+0x8a/0xb0 [scsi_mod]
                       [<f816ad14>] ata_scsi_scan_host+0x77/0x141 [libata]
                       [<f816903f>] async_port_probe+0xa0/0xa9 [libata]
                       [<c044341f>] async_thread+0xe9/0x1c9
                       [<c043e204>] kthread+0x4a/0x72
                       [<c04034e7>] kernel_thread_helper+0x7/0x10
                       [<ffffffff>] 0xffffffff
 }
 ... key      at: [<c0c5e698>] __key.29749+0x0/0x8
 -> (&ioc->lock){..-...} ops: 1032 {
    IN-SOFTIRQ-W at:
                          [<c044da08>] __lock_acquire+0x4bf/0x73e
                          [<c044dd36>] lock_acquire+0xaf/0xcc
                          [<c063056b>] _spin_lock_irqsave+0x33/0x43
                          [<c050f0f0>] cic_free_func+0x26/0x64
                          [<c050ea90>] __call_for_each_cic+0x23/0x2e
                          [<c050eaad>] cfq_free_io_context+0x12/0x14
                          [<c050978c>] put_io_context+0x4b/0x66
                          [<c050f2a2>] cfq_put_request+0x42/0x5b
                          [<c0504629>] elv_put_request+0x30/0x33
                          [<c050678d>] __blk_put_request+0x8b/0xb8
                          [<c0506953>] end_that_request_last+0x199/0x1a1
                          [<c0506a0d>] blk_end_io+0x51/0x6f
                          [<c0506a64>] blk_end_request+0x11/0x13
                          [<f8106c9c>] scsi_io_completion+0x1d9/0x41f [scsi_mod]
                          [<f810152d>] scsi_finish_command+0xcc/0xd4 [scsi_mod]
                          [<f8106fdb>] scsi_softirq_done+0xf9/0x101 [scsi_mod]
                          [<c050a936>] blk_done_softirq+0x5e/0x70
                          [<c0431379>] __do_softirq+0xb8/0x180
                          [<ffffffff>] 0xffffffff
    INITIAL USE at:
                         [<c044dad5>] __lock_acquire+0x58c/0x73e
                         [<c044dd36>] lock_acquire+0xaf/0xcc
                         [<c063056b>] _spin_lock_irqsave+0x33/0x43
                         [<c050f9bf>] cfq_set_request+0x123/0x33d
                         [<c05052e6>] elv_set_request+0x43/0x53
                         [<c0506d44>] get_request+0x22e/0x33f
                         [<c0507498>] get_request_wait+0x137/0x15d
                         [<c0507501>] blk_get_request+0x43/0x73
                         [<f8106854>] scsi_execute+0x24/0x11c [scsi_mod]
                         [<f81069ff>] scsi_execute_req+0xb3/0x104 [scsi_mod]
                         [<f81084f8>] scsi_probe_and_add_lun+0x294/0xb5b [scsi_mod]
                         [<f8109847>] __scsi_add_device+0x8a/0xb0 [scsi_mod]
                         [<f816ad14>] ata_scsi_scan_host+0x77/0x141 [libata]
                         [<f816903f>] async_port_probe+0xa0/0xa9 [libata]
                         [<c044341f>] async_thread+0xe9/0x1c9
                         [<c043e204>] kthread+0x4a/0x72
                         [<c04034e7>] kernel_thread_helper+0x7/0x10
                         [<ffffffff>] 0xffffffff
  }
  ... key      at: [<c0c5e6ec>] __key.27747+0x0/0x8
  -> (&rdp->lock){-.-...} ops: 168014 {
     IN-HARDIRQ-W at:
                            [<c044d9e4>] __lock_acquire+0x49b/0x73e
                            [<c044dd36>] lock_acquire+0xaf/0xcc
                            [<c063056b>] _spin_lock_irqsave+0x33/0x43
                            [<c0461b2a>] rcu_check_callbacks+0x6a/0xa3
                            [<c043549a>] update_process_times+0x3d/0x53
                            [<c0447fe0>] tick_periodic+0x6b/0x77
                            [<c0448009>] tick_handle_periodic+0x1d/0x60
                            [<c063406e>] smp_apic_timer_interrupt+0x6e/0x81
                            [<c04033c7>] apic_timer_interrupt+0x2f/0x34
                            [<c042fbd7>] do_exit+0x53e/0x5b3
                            [<c043a9d8>] __request_module+0x0/0x100
                            [<c04034e7>] kernel_thread_helper+0x7/0x10
                            [<ffffffff>] 0xffffffff
     IN-SOFTIRQ-W at:
                            [<c044da08>] __lock_acquire+0x4bf/0x73e
                            [<c044dd36>] lock_acquire+0xaf/0xcc
                            [<c0630340>] _spin_lock+0x2a/0x39
                            [<c04619db>] rcu_process_callbacks+0x2b/0x86
                            [<c0431379>] __do_softirq+0xb8/0x180
                            [<ffffffff>] 0xffffffff
     INITIAL USE at:
                           [<c044dad5>] __lock_acquire+0x58c/0x73e
                           [<c044dd36>] lock_acquire+0xaf/0xcc
                           [<c063056b>] _spin_lock_irqsave+0x33/0x43
                           [<c062c8ca>] rcu_online_cpu+0x3d/0x51
                           [<c062c910>] rcu_cpu_notify+0x32/0x43
                           [<c07b097f>] __rcu_init+0xf0/0x120
                           [<c07af027>] rcu_init+0x8/0x14
                           [<c079d6e1>] start_kernel+0x187/0x2fc
                           [<c079d06a>] __init_begin+0x6a/0x6f
                           [<ffffffff>] 0xffffffff
   }
   ... key      at: [<c0c2e52c>] __key.17543+0x0/0x8
  ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c0630340>] _spin_lock+0x2a/0x39
   [<c046143d>] call_rcu+0x36/0x5b
   [<c0517b45>] radix_tree_delete+0xe7/0x176
   [<c050f0fe>] cic_free_func+0x34/0x64
   [<c050ea90>] __call_for_each_cic+0x23/0x2e
   [<c050eaad>] cfq_free_io_context+0x12/0x14
   [<c050978c>] put_io_context+0x4b/0x66
   [<c050984c>] exit_io_context+0x77/0x7b
   [<c042fc24>] do_exit+0x58b/0x5b3
   [<c04034ed>] kernel_thread_helper+0xd/0x10
   [<ffffffff>] 0xffffffff

 ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c063056b>] _spin_lock_irqsave+0x33/0x43
   [<c050f4a3>] cfq_cic_lookup+0xd9/0xef
   [<c050f674>] cfq_get_queue+0x92/0x2ba
   [<c050fb01>] cfq_set_request+0x265/0x33d
   [<c05052e6>] elv_set_request+0x43/0x53
   [<c0506d44>] get_request+0x22e/0x33f
   [<c0507498>] get_request_wait+0x137/0x15d
   [<c0507501>] blk_get_request+0x43/0x73
   [<f8106854>] scsi_execute+0x24/0x11c [scsi_mod]
   [<f81069ff>] scsi_execute_req+0xb3/0x104 [scsi_mod]
   [<f81084f8>] scsi_probe_and_add_lun+0x294/0xb5b [scsi_mod]
   [<f8109847>] __scsi_add_device+0x8a/0xb0 [scsi_mod]
   [<f816ad14>] ata_scsi_scan_host+0x77/0x141 [libata]
   [<f816903f>] async_port_probe+0xa0/0xa9 [libata]
   [<c044341f>] async_thread+0xe9/0x1c9
   [<c043e204>] kthread+0x4a/0x72
   [<c04034e7>] kernel_thread_helper+0x7/0x10
   [<ffffffff>] 0xffffffff

 -> (&base->lock){..-...} ops: 348073 {
    IN-SOFTIRQ-W at:
                          [<c044da08>] __lock_acquire+0x4bf/0x73e
                          [<c044dd36>] lock_acquire+0xaf/0xcc
                          [<c06304ea>] _spin_lock_irq+0x30/0x3f
                          [<c0434b8b>] run_timer_softirq+0x3c/0x1d1
                          [<c0431379>] __do_softirq+0xb8/0x180
                          [<ffffffff>] 0xffffffff
    INITIAL USE at:
                         [<c044dad5>] __lock_acquire+0x58c/0x73e
                         [<c044dd36>] lock_acquire+0xaf/0xcc
                         [<c063056b>] _spin_lock_irqsave+0x33/0x43
                         [<c0434e84>] lock_timer_base+0x24/0x43
                         [<c0434f3d>] mod_timer+0x46/0xcc
                         [<c07bd97a>] con_init+0xa4/0x20e
                         [<c07bd3b2>] console_init+0x12/0x20
                         [<c079d735>] start_kernel+0x1db/0x2fc
                         [<c079d06a>] __init_begin+0x6a/0x6f
                         [<ffffffff>] 0xffffffff
  }
  ... key      at: [<c082304c>] __key.23401+0x0/0x8
 ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c063056b>] _spin_lock_irqsave+0x33/0x43
   [<c0434e84>] lock_timer_base+0x24/0x43
   [<c0434f3d>] mod_timer+0x46/0xcc
   [<c05075cb>] blk_plug_device+0x9a/0xdf
   [<c05049e1>] __elv_add_request+0x86/0x96
   [<c0509d52>] blk_execute_rq_nowait+0x5d/0x86
   [<c0509e2e>] blk_execute_rq+0xb3/0xd5
   [<f81068f5>] scsi_execute+0xc5/0x11c [scsi_mod]
   [<f81069ff>] scsi_execute_req+0xb3/0x104 [scsi_mod]
   [<f81084f8>] scsi_probe_and_add_lun+0x294/0xb5b [scsi_mod]
   [<f8109847>] __scsi_add_device+0x8a/0xb0 [scsi_mod]
   [<f816ad14>] ata_scsi_scan_host+0x77/0x141 [libata]
   [<f816903f>] async_port_probe+0xa0/0xa9 [libata]
   [<c044341f>] async_thread+0xe9/0x1c9
   [<c043e204>] kthread+0x4a/0x72
   [<c04034e7>] kernel_thread_helper+0x7/0x10
   [<ffffffff>] 0xffffffff

 -> (&sdev->list_lock){..-...} ops: 27612 {
    IN-SOFTIRQ-W at:
                          [<c044da08>] __lock_acquire+0x4bf/0x73e
                          [<c044dd36>] lock_acquire+0xaf/0xcc
                          [<c063056b>] _spin_lock_irqsave+0x33/0x43
                          [<f8101cb4>] scsi_put_command+0x17/0x57 [scsi_mod]
                          [<f810620f>] scsi_next_command+0x26/0x39 [scsi_mod]
                          [<f8106d02>] scsi_io_completion+0x23f/0x41f [scsi_mod]
                          [<f810152d>] scsi_finish_command+0xcc/0xd4 [scsi_mod]
                          [<f8106fdb>] scsi_softirq_done+0xf9/0x101 [scsi_mod]
                          [<c050a936>] blk_done_softirq+0x5e/0x70
                          [<c0431379>] __do_softirq+0xb8/0x180
                          [<ffffffff>] 0xffffffff
    INITIAL USE at:
                         [<c044dad5>] __lock_acquire+0x58c/0x73e
                         [<c044dd36>] lock_acquire+0xaf/0xcc
                         [<c063056b>] _spin_lock_irqsave+0x33/0x43
                         [<f8101c64>] scsi_get_command+0x5c/0x95 [scsi_mod]
                         [<f81062b6>] scsi_get_cmd_from_req+0x26/0x50 [scsi_mod]
                         [<f8106594>] scsi_setup_blk_pc_cmnd+0x2b/0xd7 [scsi_mod]
                         [<f8106664>] scsi_prep_fn+0x24/0x33 [scsi_mod]
                         [<c0504712>] elv_next_request+0xe6/0x18d
                         [<f810704c>] scsi_request_fn+0x69/0x431 [scsi_mod]
                         [<c05072af>] __generic_unplug_device+0x2e/0x31
                         [<c0509d59>] blk_execute_rq_nowait+0x64/0x86
                         [<c0509e2e>] blk_execute_rq+0xb3/0xd5
                         [<f81068f5>] scsi_execute+0xc5/0x11c [scsi_mod]
                         [<f81069ff>] scsi_execute_req+0xb3/0x104 [scsi_mod]
                         [<f81084f8>] scsi_probe_and_add_lun+0x294/0xb5b [scsi_mod]
                         [<f8109847>] __scsi_add_device+0x8a/0xb0 [scsi_mod]
                         [<f816ad14>] ata_scsi_scan_host+0x77/0x141 [libata]
                         [<f816903f>] async_port_probe+0xa0/0xa9 [libata]
                         [<c044341f>] async_thread+0xe9/0x1c9
                         [<c043e204>] kthread+0x4a/0x72
                         [<c04034e7>] kernel_thread_helper+0x7/0x10
                         [<ffffffff>] 0xffffffff
  }
  ... key      at: [<f811916c>] __key.29786+0x0/0xffff2ebf [scsi_mod]
 ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c063056b>] _spin_lock_irqsave+0x33/0x43
   [<f8101c64>] scsi_get_command+0x5c/0x95 [scsi_mod]
   [<f81062b6>] scsi_get_cmd_from_req+0x26/0x50 [scsi_mod]
   [<f8106594>] scsi_setup_blk_pc_cmnd+0x2b/0xd7 [scsi_mod]
   [<f8106664>] scsi_prep_fn+0x24/0x33 [scsi_mod]
   [<c0504712>] elv_next_request+0xe6/0x18d
   [<f810704c>] scsi_request_fn+0x69/0x431 [scsi_mod]
   [<c05072af>] __generic_unplug_device+0x2e/0x31
   [<c0509d59>] blk_execute_rq_nowait+0x64/0x86
   [<c0509e2e>] blk_execute_rq+0xb3/0xd5
   [<f81068f5>] scsi_execute+0xc5/0x11c [scsi_mod]
   [<f81069ff>] scsi_execute_req+0xb3/0x104 [scsi_mod]
   [<f81084f8>] scsi_probe_and_add_lun+0x294/0xb5b [scsi_mod]
   [<f8109847>] __scsi_add_device+0x8a/0xb0 [scsi_mod]
   [<f816ad14>] ata_scsi_scan_host+0x77/0x141 [libata]
   [<f816903f>] async_port_probe+0xa0/0xa9 [libata]
   [<c044341f>] async_thread+0xe9/0x1c9
   [<c043e204>] kthread+0x4a/0x72
   [<c04034e7>] kernel_thread_helper+0x7/0x10
   [<ffffffff>] 0xffffffff

 -> (&q->lock){-.-.-.} ops: 2105038 {
    IN-HARDIRQ-W at:
                          [<c044d9e4>] __lock_acquire+0x49b/0x73e
                          [<c044dd36>] lock_acquire+0xaf/0xcc
                          [<c063056b>] _spin_lock_irqsave+0x33/0x43
                          [<c041ec0d>] complete+0x17/0x43
                          [<c062609b>] i8042_aux_test_irq+0x4c/0x65
                          [<c045e922>] handle_IRQ_event+0xa4/0x169
                          [<c04602ea>] handle_edge_irq+0xc9/0x10a
                          [<ffffffff>] 0xffffffff
    IN-SOFTIRQ-W at:
                          [<c044da08>] __lock_acquire+0x4bf/0x73e
                          [<c044dd36>] lock_acquire+0xaf/0xcc
                          [<c063056b>] _spin_lock_irqsave+0x33/0x43
                          [<c041ec0d>] complete+0x17/0x43
                          [<c043c336>] wakeme_after_rcu+0x10/0x12
                          [<c0461a12>] rcu_process_callbacks+0x62/0x86
                          [<c0431379>] __do_softirq+0xb8/0x180
                          [<ffffffff>] 0xffffffff
    IN-RECLAIM_FS-W at:
                             [<c044dabd>] __lock_acquire+0x574/0x73e
                             [<c044dd36>] lock_acquire+0xaf/0xcc
                             [<c063056b>] _spin_lock_irqsave+0x33/0x43
                             [<c043e47b>] prepare_to_wait+0x1c/0x4a
                             [<c0485d3e>] kswapd+0xa7/0x51b
                             [<c043e204>] kthread+0x4a/0x72
                             [<c04034e7>] kernel_thread_helper+0x7/0x10
                             [<ffffffff>] 0xffffffff
    INITIAL USE at:
                         [<c044dad5>] __lock_acquire+0x58c/0x73e
                         [<c044dd36>] lock_acquire+0xaf/0xcc
                         [<c06304ea>] _spin_lock_irq+0x30/0x3f
                         [<c062d811>] wait_for_common+0x2f/0xeb
                         [<c062d968>] wait_for_completion+0x17/0x19
                         [<c043e161>] kthread_create+0x6e/0xc7
                         [<c062b7eb>] migration_call+0x39/0x444
                         [<c07ae112>] migration_init+0x1d/0x4b
                         [<c040115c>] do_one_initcall+0x6a/0x16e
                         [<c079d44d>] kernel_init+0x4d/0x15a
                         [<c04034e7>] kernel_thread_helper+0x7/0x10
                         [<ffffffff>] 0xffffffff
  }
  ... key      at: [<c0823490>] __key.17681+0x0/0x8
  -> (&rq->lock){-.-.-.} ops: 854341 {
     IN-HARDIRQ-W at:
                            [<c044d9e4>] __lock_acquire+0x49b/0x73e
                            [<c044dd36>] lock_acquire+0xaf/0xcc
                            [<c0630340>] _spin_lock+0x2a/0x39
                            [<c0429f89>] scheduler_tick+0x39/0x19b
                            [<c04354a4>] update_process_times+0x47/0x53
                            [<c0447fe0>] tick_periodic+0x6b/0x77
                            [<c0448009>] tick_handle_periodic+0x1d/0x60
                            [<c0404ace>] timer_interrupt+0x3e/0x45
                            [<c045e922>] handle_IRQ_event+0xa4/0x169
                            [<c04603a3>] handle_level_irq+0x78/0xc1
                            [<ffffffff>] 0xffffffff
     IN-SOFTIRQ-W at:
                            [<c044da08>] __lock_acquire+0x4bf/0x73e
                            [<c044dd36>] lock_acquire+0xaf/0xcc
                            [<c0630340>] _spin_lock+0x2a/0x39
                            [<c041ede7>] task_rq_lock+0x3b/0x62
                            [<c0426e41>] try_to_wake_up+0x75/0x2d4
                            [<c04270d7>] wake_up_process+0x14/0x16
                            [<c043507c>] process_timeout+0xd/0xf
                            [<c0434caa>] run_timer_softirq+0x15b/0x1d1
                            [<c0431379>] __do_softirq+0xb8/0x180
                            [<ffffffff>] 0xffffffff
     IN-RECLAIM_FS-W at:
                               [<c044dabd>] __lock_acquire+0x574/0x73e
                               [<c044dd36>] lock_acquire+0xaf/0xcc
                               [<c0630340>] _spin_lock+0x2a/0x39
                               [<c041ede7>] task_rq_lock+0x3b/0x62
                               [<c0427515>] set_cpus_allowed_ptr+0x1a/0xdd
                               [<c0485cf8>] kswapd+0x61/0x51b
                               [<c043e204>] kthread+0x4a/0x72
                               [<c04034e7>] kernel_thread_helper+0x7/0x10
                               [<ffffffff>] 0xffffffff
     INITIAL USE at:
                           [<c044dad5>] __lock_acquire+0x58c/0x73e
                           [<c044dd36>] lock_acquire+0xaf/0xcc
                           [<c063056b>] _spin_lock_irqsave+0x33/0x43
                           [<c042398e>] rq_attach_root+0x17/0xa7
                           [<c07ae52c>] sched_init+0x240/0x33e
                           [<c079d661>] start_kernel+0x107/0x2fc
                           [<c079d06a>] __init_begin+0x6a/0x6f
                           [<ffffffff>] 0xffffffff
   }
   ... key      at: [<c0800518>] __key.46938+0x0/0x8
   -> (&vec->lock){-.-...} ops: 34058 {
      IN-HARDIRQ-W at:
                              [<c044d9e4>] __lock_acquire+0x49b/0x73e
                              [<c044dd36>] lock_acquire+0xaf/0xcc
                              [<c063056b>] _spin_lock_irqsave+0x33/0x43
                              [<c047ad3b>] cpupri_set+0x51/0xba
                              [<c04219ee>] __enqueue_rt_entity+0xe2/0x1c8
                              [<c0421e18>] enqueue_rt_entity+0x19/0x23
                              [<c0428a52>] enqueue_task_rt+0x24/0x51
                              [<c041e03b>] enqueue_task+0x64/0x70
                              [<c041e06b>] activate_task+0x24/0x2a
                              [<c0426f9e>] try_to_wake_up+0x1d2/0x2d4
                              [<c04270d7>] wake_up_process+0x14/0x16
                              [<c04408b6>] hrtimer_wakeup+0x1d/0x21
                              [<c0440922>] __run_hrtimer+0x68/0x98
                              [<c04411ca>] hrtimer_interrupt+0x101/0x153
                              [<c063406e>] smp_apic_timer_interrupt+0x6e/0x81
                              [<c04033c7>] apic_timer_interrupt+0x2f/0x34
                              [<c0401c4f>] cpu_idle+0x53/0x85
                              [<c061fc80>] rest_init+0x6c/0x6e
                              [<c079d851>] start_kernel+0x2f7/0x2fc
                              [<c079d06a>] __init_begin+0x6a/0x6f
                              [<ffffffff>] 0xffffffff
      IN-SOFTIRQ-W at:
                              [<c044da08>] __lock_acquire+0x4bf/0x73e
                              [<c044dd36>] lock_acquire+0xaf/0xcc
                              [<c063056b>] _spin_lock_irqsave+0x33/0x43
                              [<c047ad3b>] cpupri_set+0x51/0xba
                              [<c04219ee>] __enqueue_rt_entity+0xe2/0x1c8
                              [<c0421e18>] enqueue_rt_entity+0x19/0x23
                              [<c0428a52>] enqueue_task_rt+0x24/0x51
                              [<c041e03b>] enqueue_task+0x64/0x70
                              [<c041e06b>] activate_task+0x24/0x2a
                              [<c0426f9e>] try_to_wake_up+0x1d2/0x2d4
                              [<c04270d7>] wake_up_process+0x14/0x16
                              [<c042737c>] rebalance_domains+0x2a3/0x3ac
                              [<c0429a06>] run_rebalance_domains+0x32/0xaa
                              [<c0431379>] __do_softirq+0xb8/0x180
                              [<ffffffff>] 0xffffffff
      INITIAL USE at:
                             [<c044dad5>] __lock_acquire+0x58c/0x73e
                             [<c044dd36>] lock_acquire+0xaf/0xcc
                             [<c063056b>] _spin_lock_irqsave+0x33/0x43
                             [<c047ad74>] cpupri_set+0x8a/0xba
                             [<c04216f2>] rq_online_rt+0x5e/0x61
                             [<c041dd3a>] set_rq_online+0x40/0x4a
                             [<c04239fb>] rq_attach_root+0x84/0xa7
                             [<c07ae52c>] sched_init+0x240/0x33e
                             [<c079d661>] start_kernel+0x107/0x2fc
                             [<c079d06a>] __init_begin+0x6a/0x6f
                             [<ffffffff>] 0xffffffff
    }
    ... key      at: [<c0c525d0>] __key.14261+0x0/0x10
   ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c063056b>] _spin_lock_irqsave+0x33/0x43
   [<c047ad74>] cpupri_set+0x8a/0xba
   [<c04216f2>] rq_online_rt+0x5e/0x61
   [<c041dd3a>] set_rq_online+0x40/0x4a
   [<c04239fb>] rq_attach_root+0x84/0xa7
   [<c07ae52c>] sched_init+0x240/0x33e
   [<c079d661>] start_kernel+0x107/0x2fc
   [<c079d06a>] __init_begin+0x6a/0x6f
   [<ffffffff>] 0xffffffff

   -> (&rt_b->rt_runtime_lock){-.-...} ops: 336 {
      IN-HARDIRQ-W at:
                              [<c044d9e4>] __lock_acquire+0x49b/0x73e
                              [<c044dd36>] lock_acquire+0xaf/0xcc
                              [<c0630340>] _spin_lock+0x2a/0x39
                              [<c0421a75>] __enqueue_rt_entity+0x169/0x1c8
                              [<c0421e18>] enqueue_rt_entity+0x19/0x23
                              [<c0428a52>] enqueue_task_rt+0x24/0x51
                              [<c041e03b>] enqueue_task+0x64/0x70
                              [<c041e06b>] activate_task+0x24/0x2a
                              [<c0426f9e>] try_to_wake_up+0x1d2/0x2d4
                              [<c04270d7>] wake_up_process+0x14/0x16
                              [<c04408b6>] hrtimer_wakeup+0x1d/0x21
                              [<c0440922>] __run_hrtimer+0x68/0x98
                              [<c04411ca>] hrtimer_interrupt+0x101/0x153
                              [<c063406e>] smp_apic_timer_interrupt+0x6e/0x81
                              [<c04033c7>] apic_timer_interrupt+0x2f/0x34
                              [<c0401c4f>] cpu_idle+0x53/0x85
                              [<c061fc80>] rest_init+0x6c/0x6e
                              [<c079d851>] start_kernel+0x2f7/0x2fc
                              [<c079d06a>] __init_begin+0x6a/0x6f
                              [<ffffffff>] 0xffffffff
      IN-SOFTIRQ-W at:
                              [<c044da08>] __lock_acquire+0x4bf/0x73e
                              [<c044dd36>] lock_acquire+0xaf/0xcc
                              [<c0630340>] _spin_lock+0x2a/0x39
                              [<c0421a75>] __enqueue_rt_entity+0x169/0x1c8
                              [<c0421e18>] enqueue_rt_entity+0x19/0x23
                              [<c0428a52>] enqueue_task_rt+0x24/0x51
                              [<c041e03b>] enqueue_task+0x64/0x70
                              [<c041e06b>] activate_task+0x24/0x2a
                              [<c0426f9e>] try_to_wake_up+0x1d2/0x2d4
                              [<c04270d7>] wake_up_process+0x14/0x16
                              [<c042737c>] rebalance_domains+0x2a3/0x3ac
                              [<c0429a06>] run_rebalance_domains+0x32/0xaa
                              [<c0431379>] __do_softirq+0xb8/0x180
                              [<ffffffff>] 0xffffffff
      INITIAL USE at:
                             [<c044dad5>] __lock_acquire+0x58c/0x73e
                             [<c044dd36>] lock_acquire+0xaf/0xcc
                             [<c0630340>] _spin_lock+0x2a/0x39
                             [<c0421a75>] __enqueue_rt_entity+0x169/0x1c8
                             [<c0421e18>] enqueue_rt_entity+0x19/0x23
                             [<c0428a52>] enqueue_task_rt+0x24/0x51
                             [<c041e03b>] enqueue_task+0x64/0x70
                             [<c041e06b>] activate_task+0x24/0x2a
                             [<c0426f9e>] try_to_wake_up+0x1d2/0x2d4
                             [<c04270d7>] wake_up_process+0x14/0x16
                             [<c062b86b>] migration_call+0xb9/0x444
                             [<c07ae130>] migration_init+0x3b/0x4b
                             [<c040115c>] do_one_initcall+0x6a/0x16e
                             [<c079d44d>] kernel_init+0x4d/0x15a
                             [<c04034e7>] kernel_thread_helper+0x7/0x10
                             [<ffffffff>] 0xffffffff
    }
    ... key      at: [<c0800504>] __key.37924+0x0/0x8
    -> (&cpu_base->lock){-.-...} ops: 950512 {
       IN-HARDIRQ-W at:
                                [<c044d9e4>] __lock_acquire+0x49b/0x73e
                                [<c044dd36>] lock_acquire+0xaf/0xcc
                                [<c0630340>] _spin_lock+0x2a/0x39
                                [<c0440a3a>] hrtimer_run_queues+0xe8/0x131
                                [<c0435151>] run_local_timers+0xd/0x1e
                                [<c0435486>] update_process_times+0x29/0x53
                                [<c0447fe0>] tick_periodic+0x6b/0x77
                                [<c0448009>] tick_handle_periodic+0x1d/0x60
                                [<c063406e>] smp_apic_timer_interrupt+0x6e/0x81
                                [<c04033c7>] apic_timer_interrupt+0x2f/0x34
                                [<c04082c7>] arch_dup_task_struct+0x19/0x81
                                [<c042ac1c>] copy_process+0xab/0x115f
                                [<c042be78>] do_fork+0x129/0x2c5
                                [<c0401698>] kernel_thread+0x7f/0x87
                                [<c043e0b3>] kthreadd+0xa3/0xe3
                                [<c04034e7>] kernel_thread_helper+0x7/0x10
                                [<ffffffff>] 0xffffffff
       IN-SOFTIRQ-W at:
                                [<c044da08>] __lock_acquire+0x4bf/0x73e
                                [<c044dd36>] lock_acquire+0xaf/0xcc
                                [<c063056b>] _spin_lock_irqsave+0x33/0x43
                                [<c0440b98>] lock_hrtimer_base+0x1d/0x38
                                [<c0440ca9>] __hrtimer_start_range_ns+0x1f/0x232
                                [<c0440ee7>] hrtimer_start_range_ns+0x15/0x17
                                [<c0448ef1>] tick_setup_sched_timer+0xf6/0x124
                                [<c0441558>] hrtimer_run_pending+0xb0/0xe8
                                [<c0434b76>] run_timer_softirq+0x27/0x1d1
                                [<c0431379>] __do_softirq+0xb8/0x180
                                [<ffffffff>] 0xffffffff
       INITIAL USE at:
                               [<c044dad5>] __lock_acquire+0x58c/0x73e
                               [<c044dd36>] lock_acquire+0xaf/0xcc
                               [<c063056b>] _spin_lock_irqsave+0x33/0x43
                               [<c0440b98>] lock_hrtimer_base+0x1d/0x38
                               [<c0440ca9>] __hrtimer_start_range_ns+0x1f/0x232
                               [<c0421ab1>] __enqueue_rt_entity+0x1a5/0x1c8
                               [<c0421e18>] enqueue_rt_entity+0x19/0x23
                               [<c0428a52>] enqueue_task_rt+0x24/0x51
                               [<c041e03b>] enqueue_task+0x64/0x70
                               [<c041e06b>] activate_task+0x24/0x2a
                               [<c0426f9e>] try_to_wake_up+0x1d2/0x2d4
                               [<c04270d7>] wake_up_process+0x14/0x16
                               [<c062b86b>] migration_call+0xb9/0x444
                               [<c07ae130>] migration_init+0x3b/0x4b
                               [<c040115c>] do_one_initcall+0x6a/0x16e
                               [<c079d44d>] kernel_init+0x4d/0x15a
                               [<c04034e7>] kernel_thread_helper+0x7/0x10
                               [<ffffffff>] 0xffffffff
     }
     ... key      at: [<c08234b8>] __key.20063+0x0/0x8
    ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c063056b>] _spin_lock_irqsave+0x33/0x43
   [<c0440b98>] lock_hrtimer_base+0x1d/0x38
   [<c0440ca9>] __hrtimer_start_range_ns+0x1f/0x232
   [<c0421ab1>] __enqueue_rt_entity+0x1a5/0x1c8
   [<c0421e18>] enqueue_rt_entity+0x19/0x23
   [<c0428a52>] enqueue_task_rt+0x24/0x51
   [<c041e03b>] enqueue_task+0x64/0x70
   [<c041e06b>] activate_task+0x24/0x2a
   [<c0426f9e>] try_to_wake_up+0x1d2/0x2d4
   [<c04270d7>] wake_up_process+0x14/0x16
   [<c062b86b>] migration_call+0xb9/0x444
   [<c07ae130>] migration_init+0x3b/0x4b
   [<c040115c>] do_one_initcall+0x6a/0x16e
   [<c079d44d>] kernel_init+0x4d/0x15a
   [<c04034e7>] kernel_thread_helper+0x7/0x10
   [<ffffffff>] 0xffffffff

    -> (&rt_rq->rt_runtime_lock){-.....} ops: 17587 {
       IN-HARDIRQ-W at:
                                [<c044d9e4>] __lock_acquire+0x49b/0x73e
                                [<c044dd36>] lock_acquire+0xaf/0xcc
                                [<c0630340>] _spin_lock+0x2a/0x39
                                [<c0421efc>] sched_rt_period_timer+0xda/0x24e
                                [<c0440922>] __run_hrtimer+0x68/0x98
                                [<c04411ca>] hrtimer_interrupt+0x101/0x153
                                [<c063406e>] smp_apic_timer_interrupt+0x6e/0x81
                                [<c04033c7>] apic_timer_interrupt+0x2f/0x34
                                [<c0452203>] each_symbol_in_section+0x27/0x57
                                [<c045225a>] each_symbol+0x27/0x113
                                [<c0452373>] find_symbol+0x2d/0x51
                                [<c0454a7a>] load_module+0xaec/0x10eb
                                [<c04550bf>] sys_init_module+0x46/0x19b
                                [<c0402a68>] sysenter_do_call+0x12/0x36
                                [<ffffffff>] 0xffffffff
       INITIAL USE at:
                               [<c044dad5>] __lock_acquire+0x58c/0x73e
                               [<c044dd36>] lock_acquire+0xaf/0xcc
                               [<c0630340>] _spin_lock+0x2a/0x39
                               [<c0421c41>] update_curr_rt+0x13a/0x20d
                               [<c0421dd8>] dequeue_task_rt+0x13/0x3a
                               [<c041df9e>] dequeue_task+0xff/0x10e
                               [<c041dfd1>] deactivate_task+0x24/0x2a
                               [<c062db54>] __schedule+0x162/0x991
                               [<c062e39a>] schedule+0x17/0x30
                               [<c0426c54>] migration_thread+0x175/0x203
                               [<c043e204>] kthread+0x4a/0x72
                               [<c04034e7>] kernel_thread_helper+0x7/0x10
                               [<ffffffff>] 0xffffffff
     }
     ... key      at: [<c080050c>] __key.46863+0x0/0x8
    ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c0630340>] _spin_lock+0x2a/0x39
   [<c041ee73>] __enable_runtime+0x43/0xb3
   [<c04216d8>] rq_online_rt+0x44/0x61
   [<c041dd3a>] set_rq_online+0x40/0x4a
   [<c062b8a5>] migration_call+0xf3/0x444
   [<c063291c>] notifier_call_chain+0x2b/0x4a
   [<c0441e22>] __raw_notifier_call_chain+0x13/0x15
   [<c0441e35>] raw_notifier_call_chain+0x11/0x13
   [<c062bd2f>] _cpu_up+0xc3/0xf6
   [<c062bdac>] cpu_up+0x4a/0x5a
   [<c079d49a>] kernel_init+0x9a/0x15a
   [<c04034e7>] kernel_thread_helper+0x7/0x10
   [<ffffffff>] 0xffffffff

   ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c0630340>] _spin_lock+0x2a/0x39
   [<c0421a75>] __enqueue_rt_entity+0x169/0x1c8
   [<c0421e18>] enqueue_rt_entity+0x19/0x23
   [<c0428a52>] enqueue_task_rt+0x24/0x51
   [<c041e03b>] enqueue_task+0x64/0x70
   [<c041e06b>] activate_task+0x24/0x2a
   [<c0426f9e>] try_to_wake_up+0x1d2/0x2d4
   [<c04270d7>] wake_up_process+0x14/0x16
   [<c062b86b>] migration_call+0xb9/0x444
   [<c07ae130>] migration_init+0x3b/0x4b
   [<c040115c>] do_one_initcall+0x6a/0x16e
   [<c079d44d>] kernel_init+0x4d/0x15a
   [<c04034e7>] kernel_thread_helper+0x7/0x10
   [<ffffffff>] 0xffffffff

   ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c0630340>] _spin_lock+0x2a/0x39
   [<c0421c41>] update_curr_rt+0x13a/0x20d
   [<c0421dd8>] dequeue_task_rt+0x13/0x3a
   [<c041df9e>] dequeue_task+0xff/0x10e
   [<c041dfd1>] deactivate_task+0x24/0x2a
   [<c062db54>] __schedule+0x162/0x991
   [<c062e39a>] schedule+0x17/0x30
   [<c0426c54>] migration_thread+0x175/0x203
   [<c043e204>] kthread+0x4a/0x72
   [<c04034e7>] kernel_thread_helper+0x7/0x10
   [<ffffffff>] 0xffffffff

   -> (&sig->cputimer.lock){......} ops: 1949 {
      INITIAL USE at:
                             [<c044dad5>] __lock_acquire+0x58c/0x73e
                             [<c044dd36>] lock_acquire+0xaf/0xcc
                             [<c063056b>] _spin_lock_irqsave+0x33/0x43
                             [<c043f03e>] thread_group_cputimer+0x29/0x90
                             [<c044004c>] posix_cpu_timers_exit_group+0x16/0x39
                             [<c042e5f0>] release_task+0xa2/0x376
                             [<c042fbe1>] do_exit+0x548/0x5b3
                             [<c043a9d8>] __request_module+0x0/0x100
                             [<c04034e7>] kernel_thread_helper+0x7/0x10
                             [<ffffffff>] 0xffffffff
    }
    ... key      at: [<c08014ac>] __key.15480+0x0/0x8
   ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c0630340>] _spin_lock+0x2a/0x39
   [<c041f43a>] update_curr+0xef/0x107
   [<c042131b>] enqueue_entity+0x1a/0x1c6
   [<c0421535>] enqueue_task_fair+0x24/0x3e
   [<c041e03b>] enqueue_task+0x64/0x70
   [<c041e06b>] activate_task+0x24/0x2a
   [<c0426f9e>] try_to_wake_up+0x1d2/0x2d4
   [<c04270b0>] default_wake_function+0x10/0x12
   [<c041d785>] __wake_up_common+0x34/0x5f
   [<c041ec26>] complete+0x30/0x43
   [<c043e1e8>] kthread+0x2e/0x72
   [<c04034e7>] kernel_thread_helper+0x7/0x10
   [<ffffffff>] 0xffffffff

   -> (&rq->lock/1){..-...} ops: 3217 {
      IN-SOFTIRQ-W at:
                              [<c044da08>] __lock_acquire+0x4bf/0x73e
                              [<c044dd36>] lock_acquire+0xaf/0xcc
                              [<c0630305>] _spin_lock_nested+0x2d/0x3e
                              [<c0422cb4>] double_rq_lock+0x4b/0x7d
                              [<c0427274>] rebalance_domains+0x19b/0x3ac
                              [<c0429a06>] run_rebalance_domains+0x32/0xaa
                              [<c0431379>] __do_softirq+0xb8/0x180
                              [<ffffffff>] 0xffffffff
      INITIAL USE at:
                             [<c044dad5>] __lock_acquire+0x58c/0x73e
                             [<c044dd36>] lock_acquire+0xaf/0xcc
                             [<c0630305>] _spin_lock_nested+0x2d/0x3e
                             [<c0422cb4>] double_rq_lock+0x4b/0x7d
                             [<c0427274>] rebalance_domains+0x19b/0x3ac
                             [<c0429a06>] run_rebalance_domains+0x32/0xaa
                             [<c0431379>] __do_softirq+0xb8/0x180
                             [<ffffffff>] 0xffffffff
    }
    ... key      at: [<c0800519>] __key.46938+0x1/0x8
    ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c0630340>] _spin_lock+0x2a/0x39
   [<c0421c41>] update_curr_rt+0x13a/0x20d
   [<c0421dd8>] dequeue_task_rt+0x13/0x3a
   [<c041df9e>] dequeue_task+0xff/0x10e
   [<c041dfd1>] deactivate_task+0x24/0x2a
   [<c0427b1b>] push_rt_task+0x189/0x1f7
   [<c0427b9b>] push_rt_tasks+0x12/0x19
   [<c0427bb9>] post_schedule_rt+0x17/0x21
   [<c0425a68>] finish_task_switch+0x83/0xc0
   [<c062e339>] __schedule+0x947/0x991
   [<c062e39a>] schedule+0x17/0x30
   [<c0426c54>] migration_thread+0x175/0x203
   [<c043e204>] kthread+0x4a/0x72
   [<c04034e7>] kernel_thread_helper+0x7/0x10
   [<ffffffff>] 0xffffffff

    ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c063056b>] _spin_lock_irqsave+0x33/0x43
   [<c047ad3b>] cpupri_set+0x51/0xba
   [<c04219ee>] __enqueue_rt_entity+0xe2/0x1c8
   [<c0421e18>] enqueue_rt_entity+0x19/0x23
   [<c0428a52>] enqueue_task_rt+0x24/0x51
   [<c041e03b>] enqueue_task+0x64/0x70
   [<c041e06b>] activate_task+0x24/0x2a
   [<c0427b33>] push_rt_task+0x1a1/0x1f7
   [<c0427b9b>] push_rt_tasks+0x12/0x19
   [<c0427bb9>] post_schedule_rt+0x17/0x21
   [<c0425a68>] finish_task_switch+0x83/0xc0
   [<c062e339>] __schedule+0x947/0x991
   [<c062e39a>] schedule+0x17/0x30
   [<c0426c54>] migration_thread+0x175/0x203
   [<c043e204>] kthread+0x4a/0x72
   [<c04034e7>] kernel_thread_helper+0x7/0x10
   [<ffffffff>] 0xffffffff

   ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c0630305>] _spin_lock_nested+0x2d/0x3e
   [<c0422cb4>] double_rq_lock+0x4b/0x7d
   [<c0427274>] rebalance_domains+0x19b/0x3ac
   [<c0429a06>] run_rebalance_domains+0x32/0xaa
   [<c0431379>] __do_softirq+0xb8/0x180
   [<ffffffff>] 0xffffffff

  ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c0630340>] _spin_lock+0x2a/0x39
   [<c041ede7>] task_rq_lock+0x3b/0x62
   [<c0426e41>] try_to_wake_up+0x75/0x2d4
   [<c04270b0>] default_wake_function+0x10/0x12
   [<c041d785>] __wake_up_common+0x34/0x5f
   [<c041ec26>] complete+0x30/0x43
   [<c043e0cc>] kthreadd+0xbc/0xe3
   [<c04034e7>] kernel_thread_helper+0x7/0x10
   [<ffffffff>] 0xffffffff

  -> (&ep->lock){......} ops: 110 {
     INITIAL USE at:
                           [<c044dad5>] __lock_acquire+0x58c/0x73e
                           [<c044dd36>] lock_acquire+0xaf/0xcc
                           [<c063056b>] _spin_lock_irqsave+0x33/0x43
                           [<c04ca381>] sys_epoll_ctl+0x232/0x3f6
                           [<c0402a68>] sysenter_do_call+0x12/0x36
                           [<ffffffff>] 0xffffffff
   }
   ... key      at: [<c0c5be90>] __key.22301+0x0/0x10
   ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c0630340>] _spin_lock+0x2a/0x39
   [<c041ede7>] task_rq_lock+0x3b/0x62
   [<c0426e41>] try_to_wake_up+0x75/0x2d4
   [<c04270b0>] default_wake_function+0x10/0x12
   [<c041d785>] __wake_up_common+0x34/0x5f
   [<c041d7c6>] __wake_up_locked+0x16/0x1a
   [<c04ca7f5>] ep_poll_callback+0x7c/0xb6
   [<c041d785>] __wake_up_common+0x34/0x5f
   [<c041ec70>] __wake_up_sync_key+0x37/0x4a
   [<c05cbefa>] sock_def_readable+0x42/0x71
   [<c061c8b1>] unix_stream_connect+0x2f3/0x368
   [<c05c830a>] sys_connect+0x59/0x76
   [<c05c963f>] sys_socketcall+0x76/0x172
   [<c0402a68>] sysenter_do_call+0x12/0x36
   [<ffffffff>] 0xffffffff

  ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c063056b>] _spin_lock_irqsave+0x33/0x43
   [<c04ca797>] ep_poll_callback+0x1e/0xb6
   [<c041d785>] __wake_up_common+0x34/0x5f
   [<c041ec70>] __wake_up_sync_key+0x37/0x4a
   [<c05cbefa>] sock_def_readable+0x42/0x71
   [<c061c8b1>] unix_stream_connect+0x2f3/0x368
   [<c05c830a>] sys_connect+0x59/0x76
   [<c05c963f>] sys_socketcall+0x76/0x172
   [<c0402a68>] sysenter_do_call+0x12/0x36
   [<ffffffff>] 0xffffffff

 ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c063056b>] _spin_lock_irqsave+0x33/0x43
   [<c041ec0d>] complete+0x17/0x43
   [<c0509cf2>] blk_end_sync_rq+0x2a/0x2d
   [<c0506935>] end_that_request_last+0x17b/0x1a1
   [<c0506a0d>] blk_end_io+0x51/0x6f
   [<c0506a64>] blk_end_request+0x11/0x13
   [<f8106c9c>] scsi_io_completion+0x1d9/0x41f [scsi_mod]
   [<f810152d>] scsi_finish_command+0xcc/0xd4 [scsi_mod]
   [<f8106fdb>] scsi_softirq_done+0xf9/0x101 [scsi_mod]
   [<c050a936>] blk_done_softirq+0x5e/0x70
   [<c0431379>] __do_softirq+0xb8/0x180
   [<ffffffff>] 0xffffffff

 -> (&n->list_lock){..-...} ops: 49241 {
    IN-SOFTIRQ-W at:
                          [<c044da08>] __lock_acquire+0x4bf/0x73e
                          [<c044dd36>] lock_acquire+0xaf/0xcc
                          [<c0630340>] _spin_lock+0x2a/0x39
                          [<c049bd18>] add_partial+0x16/0x40
                          [<c049d0d4>] __slab_free+0x96/0x28f
                          [<c049df5c>] kmem_cache_free+0x8c/0xf2
                          [<c04a5ce9>] file_free_rcu+0x35/0x38
                          [<c0461a12>] rcu_process_callbacks+0x62/0x86
                          [<c0431379>] __do_softirq+0xb8/0x180
                          [<ffffffff>] 0xffffffff
    INITIAL USE at:
                         [<c044dad5>] __lock_acquire+0x58c/0x73e
                         [<c044dd36>] lock_acquire+0xaf/0xcc
                         [<c0630340>] _spin_lock+0x2a/0x39
                         [<c049bd18>] add_partial+0x16/0x40
                         [<c049d0d4>] __slab_free+0x96/0x28f
                         [<c049df5c>] kmem_cache_free+0x8c/0xf2
                         [<c0514eda>] ida_get_new_above+0x13b/0x155
                         [<c0514f00>] ida_get_new+0xc/0xe
                         [<c04a628b>] set_anon_super+0x39/0xa3
                         [<c04a68c6>] sget+0x2f3/0x386
                         [<c04a7365>] get_sb_single+0x24/0x8f
                         [<c04e034c>] sysfs_get_sb+0x18/0x1a
                         [<c04a6dd1>] vfs_kern_mount+0x40/0x7b
                         [<c04a6e21>] kern_mount_data+0x15/0x17
                         [<c07b5ff6>] sysfs_init+0x50/0x9c
                         [<c07b4ac9>] mnt_init+0x8c/0x1e4
                         [<c07b4737>] vfs_caches_init+0xd8/0xea
                         [<c079d815>] start_kernel+0x2bb/0x2fc
                         [<c079d06a>] __init_begin+0x6a/0x6f
                         [<ffffffff>] 0xffffffff
  }
  ... key      at: [<c0c5a424>] __key.25358+0x0/0x8
 ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c0630340>] _spin_lock+0x2a/0x39
   [<c049cc45>] __slab_alloc+0xf6/0x4ef
   [<c049d333>] kmem_cache_alloc+0x66/0x11f
   [<f810189b>] scsi_pool_alloc_command+0x20/0x4c [scsi_mod]
   [<f81018de>] scsi_host_alloc_command+0x17/0x4f [scsi_mod]
   [<f810192b>] __scsi_get_command+0x15/0x71 [scsi_mod]
   [<f8101c41>] scsi_get_command+0x39/0x95 [scsi_mod]
   [<f81062b6>] scsi_get_cmd_from_req+0x26/0x50 [scsi_mod]
   [<f8106594>] scsi_setup_blk_pc_cmnd+0x2b/0xd7 [scsi_mod]
   [<f8106664>] scsi_prep_fn+0x24/0x33 [scsi_mod]
   [<c0504712>] elv_next_request+0xe6/0x18d
   [<f810704c>] scsi_request_fn+0x69/0x431 [scsi_mod]
   [<c05072af>] __generic_unplug_device+0x2e/0x31
   [<c0509d59>] blk_execute_rq_nowait+0x64/0x86
   [<c0509e2e>] blk_execute_rq+0xb3/0xd5
   [<f81068f5>] scsi_execute+0xc5/0x11c [scsi_mod]
   [<f81069ff>] scsi_execute_req+0xb3/0x104 [scsi_mod]
   [<f812b40d>] sd_revalidate_disk+0x1a3/0xf64 [sd_mod]
   [<f812d52f>] sd_probe_async+0x146/0x22d [sd_mod]
   [<c044341f>] async_thread+0xe9/0x1c9
   [<c043e204>] kthread+0x4a/0x72
   [<c04034e7>] kernel_thread_helper+0x7/0x10
   [<ffffffff>] 0xffffffff

 -> (&cwq->lock){-.-...} ops: 30335 {
    IN-HARDIRQ-W at:
                          [<c044d9e4>] __lock_acquire+0x49b/0x73e
                          [<c044dd36>] lock_acquire+0xaf/0xcc
                          [<c063056b>] _spin_lock_irqsave+0x33/0x43
                          [<c043b54b>] __queue_work+0x14/0x30
                          [<c043b5ce>] queue_work_on+0x3a/0x46
                          [<c043b617>] queue_work+0x26/0x4a
                          [<c043b64f>] schedule_work+0x14/0x16
                          [<c057a367>] schedule_console_callback+0x12/0x14
                          [<c05788ed>] kbd_event+0x595/0x600
                          [<c05b3d15>] input_pass_event+0x56/0x7e
                          [<c05b4702>] input_handle_event+0x314/0x334
                          [<c05b4f1e>] input_event+0x50/0x63
                          [<c05b9bd4>] atkbd_interrupt+0x209/0x4e9
                          [<c05b1793>] serio_interrupt+0x38/0x6e
                          [<c05b24e8>] i8042_interrupt+0x1db/0x1ec
                          [<c045e922>] handle_IRQ_event+0xa4/0x169
                          [<c04602ea>] handle_edge_irq+0xc9/0x10a
                          [<ffffffff>] 0xffffffff
    IN-SOFTIRQ-W at:
                          [<c044da08>] __lock_acquire+0x4bf/0x73e
                          [<c044dd36>] lock_acquire+0xaf/0xcc
                          [<c063056b>] _spin_lock_irqsave+0x33/0x43
                          [<c043b54b>] __queue_work+0x14/0x30
                          [<c043b590>] delayed_work_timer_fn+0x29/0x2d
                          [<c0434caa>] run_timer_softirq+0x15b/0x1d1
                          [<c0431379>] __do_softirq+0xb8/0x180
                          [<ffffffff>] 0xffffffff
    INITIAL USE at:
                         [<c044dad5>] __lock_acquire+0x58c/0x73e
                         [<c044dd36>] lock_acquire+0xaf/0xcc
                         [<c063056b>] _spin_lock_irqsave+0x33/0x43
                         [<c043b54b>] __queue_work+0x14/0x30
                         [<c043b5ce>] queue_work_on+0x3a/0x46
                         [<c043b617>] queue_work+0x26/0x4a
                         [<c043a7b3>] call_usermodehelper_exec+0x83/0xd0
                         [<c051631a>] kobject_uevent_env+0x351/0x385
                         [<c0516358>] kobject_uevent+0xa/0xc
                         [<c0515a0e>] kset_register+0x2e/0x34
                         [<c0590f18>] bus_register+0xed/0x23d
                         [<c07bea09>] platform_bus_init+0x23/0x38
                         [<c07beb77>] driver_init+0x1c/0x28
                         [<c079d4f6>] kernel_init+0xf6/0x15a
                         [<c04034e7>] kernel_thread_helper+0x7/0x10
                         [<ffffffff>] 0xffffffff
  }
  ... key      at: [<c08230a8>] __key.23814+0x0/0x8
  -> (&workqueue_cpu_stat(cpu)->lock){-.-...} ops: 20397 {
     IN-HARDIRQ-W at:
                            [<c044d9e4>] __lock_acquire+0x49b/0x73e
                            [<c044dd36>] lock_acquire+0xaf/0xcc
                            [<c063056b>] _spin_lock_irqsave+0x33/0x43
                            [<c0474909>] probe_workqueue_insertion+0x33/0x81
                            [<c043acf3>] insert_work+0x3f/0x9b
                            [<c043b559>] __queue_work+0x22/0x30
                            [<c043b5ce>] queue_work_on+0x3a/0x46
                            [<c043b617>] queue_work+0x26/0x4a
                            [<c043b64f>] schedule_work+0x14/0x16
                            [<c057a367>] schedule_console_callback+0x12/0x14
                            [<c05788ed>] kbd_event+0x595/0x600
                            [<c05b3d15>] input_pass_event+0x56/0x7e
                            [<c05b4702>] input_handle_event+0x314/0x334
                            [<c05b4f1e>] input_event+0x50/0x63
                            [<c05b9bd4>] atkbd_interrupt+0x209/0x4e9
                            [<c05b1793>] serio_interrupt+0x38/0x6e
                            [<c05b24e8>] i8042_interrupt+0x1db/0x1ec
                            [<c045e922>] handle_IRQ_event+0xa4/0x169
                            [<c04602ea>] handle_edge_irq+0xc9/0x10a
                            [<ffffffff>] 0xffffffff
     IN-SOFTIRQ-W at:
                            [<c044da08>] __lock_acquire+0x4bf/0x73e
                            [<c044dd36>] lock_acquire+0xaf/0xcc
                            [<c063056b>] _spin_lock_irqsave+0x33/0x43
                            [<c0474909>] probe_workqueue_insertion+0x33/0x81
                            [<c043acf3>] insert_work+0x3f/0x9b
                            [<c043b559>] __queue_work+0x22/0x30
                            [<c043b590>] delayed_work_timer_fn+0x29/0x2d
                            [<c0434caa>] run_timer_softirq+0x15b/0x1d1
                            [<c0431379>] __do_softirq+0xb8/0x180
                            [<ffffffff>] 0xffffffff
     INITIAL USE at:
                           [<c044dad5>] __lock_acquire+0x58c/0x73e
                           [<c044dd36>] lock_acquire+0xaf/0xcc
                           [<c063056b>] _spin_lock_irqsave+0x33/0x43
                           [<c04747eb>] probe_workqueue_creation+0xc9/0x10a
                           [<c043abcb>] create_workqueue_thread+0x87/0xb0
                           [<c043b12f>] __create_workqueue_key+0x16d/0x1b2
                           [<c07aeedb>] init_workqueues+0x61/0x73
                           [<c079d4e7>] kernel_init+0xe7/0x15a
                           [<c04034e7>] kernel_thread_helper+0x7/0x10
                           [<ffffffff>] 0xffffffff
   }
   ... key      at: [<c0c52574>] __key.23424+0x0/0x8
  ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c063056b>] _spin_lock_irqsave+0x33/0x43
   [<c0474909>] probe_workqueue_insertion+0x33/0x81
   [<c043acf3>] insert_work+0x3f/0x9b
   [<c043b559>] __queue_work+0x22/0x30
   [<c043b5ce>] queue_work_on+0x3a/0x46
   [<c043b617>] queue_work+0x26/0x4a
   [<c043a7b3>] call_usermodehelper_exec+0x83/0xd0
   [<c051631a>] kobject_uevent_env+0x351/0x385
   [<c0516358>] kobject_uevent+0xa/0xc
   [<c0515a0e>] kset_register+0x2e/0x34
   [<c0590f18>] bus_register+0xed/0x23d
   [<c07bea09>] platform_bus_init+0x23/0x38
   [<c07beb77>] driver_init+0x1c/0x28
   [<c079d4f6>] kernel_init+0xf6/0x15a
   [<c04034e7>] kernel_thread_helper+0x7/0x10
   [<ffffffff>] 0xffffffff

  ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c063056b>] _spin_lock_irqsave+0x33/0x43
   [<c041ecaf>] __wake_up+0x1a/0x40
   [<c043ad46>] insert_work+0x92/0x9b
   [<c043b559>] __queue_work+0x22/0x30
   [<c043b5ce>] queue_work_on+0x3a/0x46
   [<c043b617>] queue_work+0x26/0x4a
   [<c043a7b3>] call_usermodehelper_exec+0x83/0xd0
   [<c051631a>] kobject_uevent_env+0x351/0x385
   [<c0516358>] kobject_uevent+0xa/0xc
   [<c0515a0e>] kset_register+0x2e/0x34
   [<c0590f18>] bus_register+0xed/0x23d
   [<c07bea09>] platform_bus_init+0x23/0x38
   [<c07beb77>] driver_init+0x1c/0x28
   [<c079d4f6>] kernel_init+0xf6/0x15a
   [<c04034e7>] kernel_thread_helper+0x7/0x10
   [<ffffffff>] 0xffffffff

 ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c063056b>] _spin_lock_irqsave+0x33/0x43
   [<c043b54b>] __queue_work+0x14/0x30
   [<c043b5ce>] queue_work_on+0x3a/0x46
   [<c043b617>] queue_work+0x26/0x4a
   [<c0505679>] kblockd_schedule_work+0x12/0x14
   [<c05113bb>] elv_schedule_dispatch+0x41/0x48
   [<c0513377>] elv_ioq_completed_request+0x2dc/0x2fe
   [<c05045aa>] elv_completed_request+0x48/0x97
   [<c0506738>] __blk_put_request+0x36/0xb8
   [<c0506953>] end_that_request_last+0x199/0x1a1
   [<c0506a0d>] blk_end_io+0x51/0x6f
   [<c0506a64>] blk_end_request+0x11/0x13
   [<f8106c9c>] scsi_io_completion+0x1d9/0x41f [scsi_mod]
   [<f810152d>] scsi_finish_command+0xcc/0xd4 [scsi_mod]
   [<f8106fdb>] scsi_softirq_done+0xf9/0x101 [scsi_mod]
   [<c050a936>] blk_done_softirq+0x5e/0x70
   [<c0431379>] __do_softirq+0xb8/0x180
   [<ffffffff>] 0xffffffff

 -> (&zone->lock){..-...} ops: 80266 {
    IN-SOFTIRQ-W at:
                          [<c044da08>] __lock_acquire+0x4bf/0x73e
                          [<c044dd36>] lock_acquire+0xaf/0xcc
                          [<c0630340>] _spin_lock+0x2a/0x39
                          [<c047fc71>] __free_pages_ok+0x167/0x321
                          [<c04800ce>] __free_pages+0x29/0x2b
                          [<c049c7c1>] __free_slab+0xb2/0xba
                          [<c049c800>] discard_slab+0x37/0x39
                          [<c049d15c>] __slab_free+0x11e/0x28f
                          [<c049df5c>] kmem_cache_free+0x8c/0xf2
                          [<c042ab6e>] free_task+0x31/0x34
                          [<c042c37b>] __put_task_struct+0xd3/0xd8
                          [<c042e072>] delayed_put_task_struct+0x60/0x64
                          [<c0461a12>] rcu_process_callbacks+0x62/0x86
                          [<c0431379>] __do_softirq+0xb8/0x180
                          [<ffffffff>] 0xffffffff
    INITIAL USE at:
                         [<c044dad5>] __lock_acquire+0x58c/0x73e
                         [<c044dd36>] lock_acquire+0xaf/0xcc
                         [<c0630340>] _spin_lock+0x2a/0x39
                         [<c047f7b6>] free_pages_bulk+0x21/0x1a1
                         [<c047ffcf>] free_hot_cold_page+0x181/0x20f
                         [<c04800a3>] free_hot_page+0xf/0x11
                         [<c04800c5>] __free_pages+0x20/0x2b
                         [<c07c4d96>] __free_pages_bootmem+0x6d/0x71
                         [<c07b2244>] free_all_bootmem_core+0xd2/0x177
                         [<c07b22f6>] free_all_bootmem+0xd/0xf
                         [<c07ad21a>] mem_init+0x28/0x28c
                         [<c079d7b1>] start_kernel+0x257/0x2fc
                         [<c079d06a>] __init_begin+0x6a/0x6f
                         [<ffffffff>] 0xffffffff
  }
  ... key      at: [<c0c52628>] __key.30749+0x0/0x8
 ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c063056b>] _spin_lock_irqsave+0x33/0x43
   [<c048035e>] get_page_from_freelist+0x236/0x3e3
   [<c04805f4>] __alloc_pages_internal+0xce/0x371
   [<c049cce6>] __slab_alloc+0x197/0x4ef
   [<c049d333>] kmem_cache_alloc+0x66/0x11f
   [<c047d96b>] mempool_alloc_slab+0x13/0x15
   [<c047da5c>] mempool_alloc+0x3a/0xd5
   [<f81063cc>] scsi_sg_alloc+0x47/0x4a [scsi_mod]
   [<c051cd02>] __sg_alloc_table+0x48/0xc7
   [<f8106325>] scsi_init_sgtable+0x2c/0x8c [scsi_mod]
   [<f81064e7>] scsi_init_io+0x19/0x9b [scsi_mod]
   [<f8106abf>] scsi_setup_fs_cmnd+0x6f/0x73 [scsi_mod]
   [<f812ca73>] sd_prep_fn+0x6a/0x7d4 [sd_mod]
   [<c0504712>] elv_next_request+0xe6/0x18d
   [<f810704c>] scsi_request_fn+0x69/0x431 [scsi_mod]
   [<c05072af>] __generic_unplug_device+0x2e/0x31
   [<c05072db>] blk_start_queueing+0x29/0x2b
   [<c05137b8>] elv_ioq_request_add+0x2be/0x393
   [<c05048cd>] elv_insert+0x114/0x1a2
   [<c05049ec>] __elv_add_request+0x91/0x96
   [<c0507a00>] __make_request+0x365/0x397
   [<c050635a>] generic_make_request+0x342/0x3ce
   [<c0507b21>] submit_bio+0xef/0xfa
   [<c04c6c4e>] mpage_bio_submit+0x21/0x26
   [<c04c7b7f>] mpage_readpages+0xa3/0xad
   [<f80c1ea8>] ext3_readpages+0x19/0x1b [ext3]
   [<c048275e>] __do_page_cache_readahead+0xfd/0x166
   [<c0482b42>] do_page_cache_readahead+0x44/0x52
   [<c047d665>] filemap_fault+0x197/0x3ae
   [<c048b9ea>] __do_fault+0x40/0x37b
   [<c048d43f>] handle_mm_fault+0x2bb/0x646
   [<c063273c>] do_page_fault+0x29c/0x2fd
   [<c0630b4a>] error_code+0x72/0x78
   [<ffffffff>] 0xffffffff

 -> (&page_address_htable[i].lock){......} ops: 6802 {
    INITIAL USE at:
                         [<c044dad5>] __lock_acquire+0x58c/0x73e
                         [<c044dd36>] lock_acquire+0xaf/0xcc
                         [<c063056b>] _spin_lock_irqsave+0x33/0x43
                         [<c048af69>] page_address+0x50/0xa6
                         [<c048b0e7>] kmap_high+0x21/0x175
                         [<c041b7ef>] kmap+0x4e/0x5b
                         [<c04abb36>] page_getlink+0x37/0x59
                         [<c04abb75>] page_follow_link_light+0x1d/0x2b
                         [<c04ad4d0>] __link_path_walk+0x3d1/0xa71
                         [<c04adbae>] path_walk+0x3e/0x77
                         [<c04add0e>] do_path_lookup+0xeb/0x105
                         [<c04ae6f2>] path_lookup_open+0x48/0x7a
                         [<c04a8e96>] open_exec+0x25/0xf4
                         [<c04a9c2d>] do_execve+0xfa/0x2cc
                         [<c04015c0>] sys_execve+0x2b/0x54
                         [<c0402ae9>] syscall_call+0x7/0xb
                         [<ffffffff>] 0xffffffff
  }
  ... key      at: [<c0c5288c>] __key.28547+0x0/0x14
 ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c063056b>] _spin_lock_irqsave+0x33/0x43
   [<c048af69>] page_address+0x50/0xa6
   [<c05078a1>] __make_request+0x206/0x397
   [<c050635a>] generic_make_request+0x342/0x3ce
   [<c0507b21>] submit_bio+0xef/0xfa
   [<c04c6c4e>] mpage_bio_submit+0x21/0x26
   [<c04c78b8>] do_mpage_readpage+0x471/0x5e5
   [<c04c7b55>] mpage_readpages+0x79/0xad
   [<f80c1ea8>] ext3_readpages+0x19/0x1b [ext3]
   [<c048275e>] __do_page_cache_readahead+0xfd/0x166
   [<c0482b42>] do_page_cache_readahead+0x44/0x52
   [<c047d665>] filemap_fault+0x197/0x3ae
   [<c048b9ea>] __do_fault+0x40/0x37b
   [<c048d43f>] handle_mm_fault+0x2bb/0x646
   [<c063273c>] do_page_fault+0x29c/0x2fd
   [<c0630b4a>] error_code+0x72/0x78
   [<ffffffff>] 0xffffffff

 ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c0630340>] _spin_lock+0x2a/0x39
   [<c046143d>] call_rcu+0x36/0x5b
   [<c050f0c8>] cfq_cic_free+0x15/0x17
   [<c050f128>] cic_free_func+0x5e/0x64
   [<c050ea90>] __call_for_each_cic+0x23/0x2e
   [<c050eaad>] cfq_free_io_context+0x12/0x14
   [<c050978c>] put_io_context+0x4b/0x66
   [<c050f00a>] cfq_active_ioq_reset+0x21/0x39
   [<c0511044>] elv_reset_active_ioq+0x2b/0x3e
   [<c0512ecf>] __elv_ioq_slice_expired+0x238/0x26a
   [<c0512f1f>] elv_ioq_slice_expired+0x1e/0x20
   [<c0513860>] elv_ioq_request_add+0x366/0x393
   [<c05048cd>] elv_insert+0x114/0x1a2
   [<c05049ec>] __elv_add_request+0x91/0x96
   [<c0507a00>] __make_request+0x365/0x397
   [<c050635a>] generic_make_request+0x342/0x3ce
   [<c0507b21>] submit_bio+0xef/0xfa
   [<c04bf495>] submit_bh+0xe3/0x102
   [<c04c04b0>] ll_rw_block+0xbe/0xf7
   [<f80c35ba>] ext3_bread+0x39/0x79 [ext3]
   [<f80c5643>] dx_probe+0x2f/0x298 [ext3]
   [<f80c5956>] ext3_find_entry+0xaa/0x573 [ext3]
   [<f80c739e>] ext3_lookup+0x31/0xbe [ext3]
   [<c04abf7c>] do_lookup+0xbc/0x159
   [<c04ad7e8>] __link_path_walk+0x6e9/0xa71
   [<c04adbae>] path_walk+0x3e/0x77
   [<c04add0e>] do_path_lookup+0xeb/0x105
   [<c04ae584>] user_path_at+0x41/0x6c
   [<c04a8301>] vfs_fstatat+0x32/0x59
   [<c04a8417>] vfs_stat+0x18/0x1a
   [<c04a8432>] sys_stat64+0x19/0x2d
   [<c0402a68>] sysenter_do_call+0x12/0x36
   [<ffffffff>] 0xffffffff

 -> (&iocg->lock){+.+...} ops: 3 {
    HARDIRQ-ON-W at:
                          [<c044b840>] mark_held_locks+0x3d/0x58
                          [<c044b963>] trace_hardirqs_on_caller+0x108/0x14c
                          [<c044b9b2>] trace_hardirqs_on+0xb/0xd
                          [<c0630883>] _spin_unlock_irq+0x27/0x47
                          [<c0513baa>] iocg_destroy+0xbc/0x118
                          [<c045a16a>] cgroup_diput+0x4b/0xa7
                          [<c04b1dbb>] dentry_iput+0x78/0x9c
                          [<c04b1e82>] d_kill+0x21/0x3b
                          [<c04b2f2a>] dput+0xf3/0xfc
                          [<c04ae226>] do_rmdir+0x9a/0xc8
                          [<c04ae29d>] sys_rmdir+0x15/0x17
                          [<c0402a68>] sysenter_do_call+0x12/0x36
                          [<ffffffff>] 0xffffffff
    SOFTIRQ-ON-W at:
                          [<c044b840>] mark_held_locks+0x3d/0x58
                          [<c044b97c>] trace_hardirqs_on_caller+0x121/0x14c
                          [<c044b9b2>] trace_hardirqs_on+0xb/0xd
                          [<c0630883>] _spin_unlock_irq+0x27/0x47
                          [<c0513baa>] iocg_destroy+0xbc/0x118
                          [<c045a16a>] cgroup_diput+0x4b/0xa7
                          [<c04b1dbb>] dentry_iput+0x78/0x9c
                          [<c04b1e82>] d_kill+0x21/0x3b
                          [<c04b2f2a>] dput+0xf3/0xfc
                          [<c04ae226>] do_rmdir+0x9a/0xc8
                          [<c04ae29d>] sys_rmdir+0x15/0x17
                          [<c0402a68>] sysenter_do_call+0x12/0x36
                          [<ffffffff>] 0xffffffff
    INITIAL USE at:
                         [<c044dad5>] __lock_acquire+0x58c/0x73e
                         [<c044dd36>] lock_acquire+0xaf/0xcc
                         [<c06304ea>] _spin_lock_irq+0x30/0x3f
                         [<c05119bd>] io_alloc_root_group+0x104/0x155
                         [<c05133cb>] elv_init_fq_data+0x32/0xe0
                         [<c0504317>] elevator_alloc+0x150/0x170
                         [<c0505393>] elevator_init+0x9d/0x100
                         [<c0507088>] blk_init_queue_node+0xc4/0xf7
                         [<c05070cb>] blk_init_queue+0x10/0x12
                         [<f81060fd>] __scsi_alloc_queue+0x1c/0xba [scsi_mod]
                         [<f81061b0>] scsi_alloc_queue+0x15/0x4e [scsi_mod]
                         [<f810803d>] scsi_alloc_sdev+0x154/0x1f5 [scsi_mod]
                         [<f8108387>] scsi_probe_and_add_lun+0x123/0xb5b [scsi_mod]
                         [<f8109847>] __scsi_add_device+0x8a/0xb0 [scsi_mod]
                         [<f816ad14>] ata_scsi_scan_host+0x77/0x141 [libata]
                         [<f816903f>] async_port_probe+0xa0/0xa9 [libata]
                         [<c044341f>] async_thread+0xe9/0x1c9
                         [<c043e204>] kthread+0x4a/0x72
                         [<c04034e7>] kernel_thread_helper+0x7/0x10
                         [<ffffffff>] 0xffffffff
  }
  ... key      at: [<c0c5ebd8>] __key.29462+0x0/0x8
 ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c063056b>] _spin_lock_irqsave+0x33/0x43
   [<c0510f6f>] io_group_chain_link+0x5c/0x106
   [<c0511ba7>] io_find_alloc_group+0x54/0x60
   [<c0511c11>] io_get_io_group_bio+0x5e/0x89
   [<c0511cc3>] io_group_get_request_list+0x12/0x21
   [<c0507485>] get_request_wait+0x124/0x15d
   [<c050797e>] __make_request+0x2e3/0x397
   [<c050635a>] generic_make_request+0x342/0x3ce
   [<c0507b21>] submit_bio+0xef/0xfa
   [<c04c6c4e>] mpage_bio_submit+0x21/0x26
   [<c04c7b7f>] mpage_readpages+0xa3/0xad
   [<f80c1ea8>] ext3_readpages+0x19/0x1b [ext3]
   [<c048275e>] __do_page_cache_readahead+0xfd/0x166
   [<c048294a>] ondemand_readahead+0x10a/0x118
   [<c04829db>] page_cache_sync_readahead+0x1b/0x20
   [<c047cf37>] generic_file_aio_read+0x226/0x545
   [<c04a4cf6>] do_sync_read+0xb0/0xee
   [<c04a54b0>] vfs_read+0x8f/0x136
   [<c04a8d7c>] kernel_read+0x39/0x4b
   [<c04a8e69>] prepare_binprm+0xdb/0xe3
   [<c04a9ca8>] do_execve+0x175/0x2cc
   [<c04015c0>] sys_execve+0x2b/0x54
   [<c0402a68>] sysenter_do_call+0x12/0x36
   [<ffffffff>] 0xffffffff


stack backtrace:
Pid: 2186, comm: rmdir Not tainted 2.6.30-rc4-io #6
Call Trace:
 [<c044b1ac>] print_irq_inversion_bug+0x13b/0x147
 [<c044c3e5>] check_usage_backwards+0x7d/0x86
 [<c044b5ec>] mark_lock+0x2d3/0x4ea
 [<c044c368>] ? check_usage_backwards+0x0/0x86
 [<c044b840>] mark_held_locks+0x3d/0x58
 [<c0630883>] ? _spin_unlock_irq+0x27/0x47
 [<c044b97c>] trace_hardirqs_on_caller+0x121/0x14c
 [<c044b9b2>] trace_hardirqs_on+0xb/0xd
 [<c0630883>] _spin_unlock_irq+0x27/0x47
 [<c0513baa>] iocg_destroy+0xbc/0x118
 [<c045a16a>] cgroup_diput+0x4b/0xa7
 [<c04b1dbb>] dentry_iput+0x78/0x9c
 [<c04b1e82>] d_kill+0x21/0x3b
 [<c04b2f2a>] dput+0xf3/0xfc
 [<c04ae226>] do_rmdir+0x9a/0xc8
 [<c04029b1>] ? resume_userspace+0x11/0x28
 [<c051aa14>] ? trace_hardirqs_on_thunk+0xc/0x10
 [<c0402b34>] ? restore_nocheck_notrace+0x0/0xe
 [<c06324a0>] ? do_page_fault+0x0/0x2fd
 [<c044b97c>] ? trace_hardirqs_on_caller+0x121/0x14c
 [<c04ae29d>] sys_rmdir+0x15/0x17
 [<c0402a68>] sysenter_do_call+0x12/0x36

[-- Attachment #4: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-06 16:10       ` Vivek Goyal
  (?)
@ 2009-05-07  5:36       ` Li Zefan
       [not found]         ` <4A027348.6000808-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  -1 siblings, 1 reply; 297+ messages in thread
From: Li Zefan @ 2009-05-07  5:36 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Gui Jianfeng, nauman, dpshah, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

[-- Attachment #1: Type: text/plain, Size: 2886 bytes --]

Vivek Goyal wrote:
> On Wed, May 06, 2009 at 04:11:05PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> Hi All,
>>>
>>> Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
>>> First version of the patches was posted here.
>> Hi Vivek,
>>
>> I did some simple test for V2, and triggered an kernel panic.
>> The following script can reproduce this bug. It seems that the cgroup
>> is already removed, but IO Controller still try to access into it.
>>
> 
> Hi Gui,
> 
> Thanks for the report. I use cgroup_path() for debugging. I guess that
> cgroup_path() was passed null cgrp pointer that's why it crashed.
> 
> If yes, then it is strange though. I call cgroup_path() only after
> grabbing a refenrece to css object. (I am assuming that if I have a valid
> reference to css object then css->cgrp can't be null).
> 

Yes, css->cgrp shouldn't be NULL.. I doubt we hit a bug in cgroup here.
The code dealing with css refcnt and cgroup rmdir has changed quite a lot,
and is much more complex than it was.

> Anyway, can you please try out following patch and see if it fixes your
> crash.
...
> BTW, I tried following equivalent script and I can't see the crash on 
> my system. Are you able to hit it regularly?
> 

I modified the script like this:

======================
#!/bin/sh
echo 1 > /proc/sys/vm/drop_caches
mkdir /cgroup 2> /dev/null
mount -t cgroup -o io,blkio io /cgroup
mkdir /cgroup/test1
mkdir /cgroup/test2
echo 100 > /cgroup/test1/io.weight
echo 500 > /cgroup/test2/io.weight

dd if=/dev/zero bs=4096 count=128000 of=500M.1 &
pid1=$!
echo $pid1 > /cgroup/test1/tasks

dd if=/dev/zero bs=4096 count=128000 of=500M.2 &
pid2=$!
echo $pid2 > /cgroup/test2/tasks

sleep 5
kill -9 $pid1
kill -9 $pid2

for ((;count != 2;))
{
        rmdir /cgroup/test1 > /dev/null 2>&1
        if [ $? -eq 0 ]; then
                count=$(( $count + 1 ))
        fi

        rmdir /cgroup/test2 > /dev/null 2>&1
        if [ $? -eq 0 ]; then
                count=$(( $count + 1 ))
        fi
}

umount /cgroup
rmdir /cgroup
======================

I ran this script and got lockdep BUG. Full log and my config are attached.

Actually this can be triggered with the following steps on my box:
# mount -t cgroup -o blkio,io xxx /mnt
# mkdir /mnt/0
# echo $$ > /mnt/0/tasks
# echo 3 > /proc/sys/vm/drop_cache
# echo $$ > /mnt/tasks
# rmdir /mnt/0

And when I ran the script for the second time, my box was freezed
and I had to reset it.

> Instead of killing the tasks I also tried moving the tasks into root cgroup
> and then deleting test1 and test2 groups, that also did not produce any crash.
> (Hit a different bug though after 5-6 attempts :-)
> 
> As I mentioned in the patchset, currently we do have issues with group
> refcounting and cgroup/group going away. Hopefully in next version they
> all should be fixed up. But still, it is nice to hear back...
> 

[-- Attachment #2: myconfig --]
[-- Type: text/plain, Size: 64514 bytes --]

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.30-rc4
# Thu May  7 09:11:29 2009
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
# CONFIG_X86_64 is not set
CONFIG_X86=y
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/i386_defconfig"
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
# CONFIG_GENERIC_TIME_VSYSCALL is not set
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_DEFAULT_IDLE=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_HAVE_DYNAMIC_PER_CPU_AREA=y
# CONFIG_HAVE_CPUMASK_OF_CPU_MAP is not set
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
# CONFIG_ZONE_DMA32 is not set
CONFIG_ARCH_POPULATES_NODE_MAP=y
# CONFIG_AUDIT_ARCH is not set
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_USE_GENERIC_SMP_HELPERS=y
CONFIG_X86_32_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_TRAMPOLINE=y
CONFIG_X86_32_LAZY_GS=y
CONFIG_KTIME_SCALAR=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
# CONFIG_AUDIT is not set

#
# RCU Subsystem
#
# CONFIG_CLASSIC_RCU is not set
# CONFIG_TREE_RCU is not set
CONFIG_PREEMPT_RCU=y
CONFIG_RCU_TRACE=y
# CONFIG_TREE_RCU_TRACE is not set
CONFIG_PREEMPT_RCU_TRACE=y
# CONFIG_IKCONFIG is not set
CONFIG_LOG_BUF_SHIFT=17
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_GROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_RT_GROUP_SCHED=y
# CONFIG_USER_SCHED is not set
CONFIG_CGROUP_SCHED=y
CONFIG_CGROUPS=y
CONFIG_CGROUP_DEBUG=y
CONFIG_CGROUP_NS=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_RESOURCE_COUNTERS=y
CONFIG_CGROUP_MEM_RES_CTLR=y
CONFIG_CGROUP_MEM_RES_CTLR_SWAP=y
CONFIG_GROUP_IOSCHED=y
CONFIG_CGROUP_BLKIO=y
CONFIG_CGROUP_PAGE=y
CONFIG_MM_OWNER=y
CONFIG_SYSFS_DEPRECATED=y
CONFIG_SYSFS_DEPRECATED_V2=y
CONFIG_RELAY=y
CONFIG_NAMESPACES=y
# CONFIG_UTS_NS is not set
# CONFIG_IPC_NS is not set
CONFIG_USER_NS=y
CONFIG_PID_NS=y
# CONFIG_NET_NS is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
CONFIG_RD_LZMA=y
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_KALLSYMS_EXTRA_PASS=y
# CONFIG_STRIP_ASM_SYMS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_PCI_QUIRKS=y
CONFIG_SLUB_DEBUG=y
CONFIG_COMPAT_BRK=y
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y
CONFIG_MARKERS=y
CONFIG_OPROFILE=m
# CONFIG_OPROFILE_IBS is not set
CONFIG_HAVE_OPROFILE=y
CONFIG_KPROBES=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_KRETPROBES=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_API_DEBUG=y
# CONFIG_SLOW_WORK is not set
CONFIG_HAVE_GENERIC_DMA_COHERENT=y
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
# CONFIG_MODULE_FORCE_LOAD is not set
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
CONFIG_LBD=y
CONFIG_BLK_DEV_BSG=y
# CONFIG_BLK_DEV_INTEGRITY is not set

#
# IO Schedulers
#
CONFIG_ELV_FAIR_QUEUING=y
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_NOOP_HIER=y
CONFIG_IOSCHED_AS=m
CONFIG_IOSCHED_AS_HIER=y
CONFIG_IOSCHED_DEADLINE=m
CONFIG_IOSCHED_DEADLINE_HIER=y
CONFIG_IOSCHED_CFQ=y
CONFIG_IOSCHED_CFQ_HIER=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"
CONFIG_TRACK_ASYNC_CONTEXT=y
CONFIG_DEBUG_GROUP_IOSCHED=y
CONFIG_FREEZER=y

#
# Processor type and features
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_SMP=y
# CONFIG_SPARSE_IRQ is not set
CONFIG_X86_MPPARSE=y
# CONFIG_X86_BIGSMP is not set
CONFIG_X86_EXTENDED_PLATFORM=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_RDC321X is not set
# CONFIG_X86_32_NON_STANDARD is not set
CONFIG_SCHED_OMIT_FRAME_POINTER=y
# CONFIG_PARAVIRT_GUEST is not set
# CONFIG_MEMTEST is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
CONFIG_M686=y
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
# CONFIG_MPENTIUMM is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_MVIAC7 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_GENERIC_CPU is not set
CONFIG_X86_GENERIC=y
CONFIG_X86_CPU=y
CONFIG_X86_L1_CACHE_BYTES=64
CONFIG_X86_INTERNODE_CACHE_BYTES=64
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=5
CONFIG_X86_XADD=y
CONFIG_X86_PPRO_FENCE=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_TSC=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=4
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_CYRIX_32=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_CPU_SUP_TRANSMETA_32=y
CONFIG_CPU_SUP_UMC_32=y
# CONFIG_X86_DS is not set
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
# CONFIG_IOMMU_HELPER is not set
# CONFIG_IOMMU_API is not set
CONFIG_NR_CPUS=8
# CONFIG_SCHED_SMT is not set
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
# CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS is not set
CONFIG_X86_MCE=y
# CONFIG_X86_MCE_NONFATAL is not set
# CONFIG_X86_MCE_P4THERMAL is not set
CONFIG_VM86=y
# CONFIG_TOSHIBA is not set
# CONFIG_I8K is not set
# CONFIG_X86_REBOOTFIXUPS is not set
# CONFIG_MICROCODE is not set
CONFIG_X86_MSR=m
CONFIG_X86_CPUID=m
# CONFIG_X86_CPU_DEBUG is not set
# CONFIG_NOHIGHMEM is not set
CONFIG_HIGHMEM4G=y
# CONFIG_HIGHMEM64G is not set
CONFIG_PAGE_OFFSET=0xC0000000
CONFIG_HIGHMEM=y
# CONFIG_ARCH_PHYS_ADDR_T_64BIT is not set
CONFIG_ARCH_FLATMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_FLATMEM_MANUAL=y
# CONFIG_DISCONTIGMEM_MANUAL is not set
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_FLATMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
CONFIG_SPARSEMEM_STATIC=y
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=4
# CONFIG_PHYS_ADDR_T_64BIT is not set
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_UNEVICTABLE_LRU=y
CONFIG_HAVE_MLOCK=y
CONFIG_HAVE_MLOCKED_PAGE_BIT=y
CONFIG_HIGHPTE=y
# CONFIG_X86_CHECK_BIOS_CORRUPTION is not set
CONFIG_X86_RESERVE_LOW_64K=y
# CONFIG_MATH_EMULATION is not set
CONFIG_MTRR=y
CONFIG_MTRR_SANITIZER=y
CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=0
CONFIG_MTRR_SANITIZER_SPARE_REG_NR_DEFAULT=1
# CONFIG_X86_PAT is not set
CONFIG_EFI=y
CONFIG_SECCOMP=y
# CONFIG_CC_STACKPROTECTOR is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
CONFIG_SCHED_HRTICK=y
CONFIG_KEXEC=y
CONFIG_CRASH_DUMP=y
CONFIG_PHYSICAL_START=0x1000000
CONFIG_RELOCATABLE=y
CONFIG_PHYSICAL_ALIGN=0x400000
CONFIG_HOTPLUG_CPU=y
# CONFIG_COMPAT_VDSO is not set
# CONFIG_CMDLINE_BOOL is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y

#
# Power management and ACPI options
#
CONFIG_PM=y
CONFIG_PM_DEBUG=y
# CONFIG_PM_VERBOSE is not set
CONFIG_CAN_PM_TRACE=y
# CONFIG_PM_TRACE_RTC is not set
CONFIG_PM_SLEEP_SMP=y
CONFIG_PM_SLEEP=y
CONFIG_SUSPEND=y
CONFIG_SUSPEND_FREEZER=y
# CONFIG_HIBERNATION is not set
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
# CONFIG_ACPI_PROCFS is not set
# CONFIG_ACPI_PROCFS_POWER is not set
CONFIG_ACPI_SYSFS_POWER=y
# CONFIG_ACPI_PROC_EVENT is not set
CONFIG_ACPI_AC=m
# CONFIG_ACPI_BATTERY is not set
CONFIG_ACPI_BUTTON=m
CONFIG_ACPI_VIDEO=m
CONFIG_ACPI_FAN=y
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_HOTPLUG_CPU=y
CONFIG_ACPI_THERMAL=y
# CONFIG_ACPI_CUSTOM_DSDT is not set
CONFIG_ACPI_BLACKLIST_YEAR=1999
# CONFIG_ACPI_DEBUG is not set
# CONFIG_ACPI_PCI_SLOT is not set
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=y
# CONFIG_ACPI_SBS is not set
CONFIG_X86_APM_BOOT=y
CONFIG_APM=y
# CONFIG_APM_IGNORE_USER_SUSPEND is not set
# CONFIG_APM_DO_ENABLE is not set
CONFIG_APM_CPU_IDLE=y
# CONFIG_APM_DISPLAY_BLANK is not set
# CONFIG_APM_ALLOW_INTS is not set

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=y
CONFIG_CPU_FREQ_DEBUG=y
CONFIG_CPU_FREQ_STAT=m
CONFIG_CPU_FREQ_STAT_DETAILS=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=m
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=m
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m

#
# CPUFreq processor drivers
#
# CONFIG_X86_ACPI_CPUFREQ is not set
# CONFIG_X86_POWERNOW_K6 is not set
# CONFIG_X86_POWERNOW_K7 is not set
# CONFIG_X86_POWERNOW_K8 is not set
# CONFIG_X86_GX_SUSPMOD is not set
# CONFIG_X86_SPEEDSTEP_CENTRINO is not set
CONFIG_X86_SPEEDSTEP_ICH=y
CONFIG_X86_SPEEDSTEP_SMI=y
# CONFIG_X86_P4_CLOCKMOD is not set
# CONFIG_X86_CPUFREQ_NFORCE2 is not set
# CONFIG_X86_LONGRUN is not set
# CONFIG_X86_LONGHAUL is not set
# CONFIG_X86_E_POWERSAVER is not set

#
# shared options
#
CONFIG_X86_SPEEDSTEP_LIB=y
# CONFIG_X86_SPEEDSTEP_RELAXED_CAP_CHECK is not set
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
# CONFIG_PCI_GOBIOS is not set
# CONFIG_PCI_GOMMCONFIG is not set
# CONFIG_PCI_GODIRECT is not set
# CONFIG_PCI_GOOLPC is not set
CONFIG_PCI_GOANY=y
CONFIG_PCI_BIOS=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCI_DOMAINS=y
CONFIG_PCIEPORTBUS=y
CONFIG_HOTPLUG_PCI_PCIE=m
CONFIG_PCIEAER=y
# CONFIG_PCIEASPM is not set
CONFIG_ARCH_SUPPORTS_MSI=y
# CONFIG_PCI_MSI is not set
CONFIG_PCI_LEGACY=y
# CONFIG_PCI_DEBUG is not set
# CONFIG_PCI_STUB is not set
CONFIG_HT_IRQ=y
# CONFIG_PCI_IOV is not set
CONFIG_ISA_DMA_API=y
CONFIG_ISA=y
# CONFIG_EISA is not set
# CONFIG_MCA is not set
# CONFIG_SCx200 is not set
# CONFIG_OLPC is not set
CONFIG_PCCARD=y
# CONFIG_PCMCIA_DEBUG is not set
CONFIG_PCMCIA=y
CONFIG_PCMCIA_LOAD_CIS=y
# CONFIG_PCMCIA_IOCTL is not set
CONFIG_CARDBUS=y

#
# PC-card bridges
#
CONFIG_YENTA=y
CONFIG_YENTA_O2=y
CONFIG_YENTA_RICOH=y
CONFIG_YENTA_TI=y
CONFIG_YENTA_ENE_TUNE=y
CONFIG_YENTA_TOSHIBA=y
# CONFIG_PD6729 is not set
# CONFIG_I82092 is not set
# CONFIG_I82365 is not set
# CONFIG_TCIC is not set
CONFIG_PCMCIA_PROBE=y
CONFIG_PCCARD_NONSTATIC=y
CONFIG_HOTPLUG_PCI=y
CONFIG_HOTPLUG_PCI_FAKE=m
# CONFIG_HOTPLUG_PCI_COMPAQ is not set
# CONFIG_HOTPLUG_PCI_IBM is not set
CONFIG_HOTPLUG_PCI_ACPI=m
CONFIG_HOTPLUG_PCI_ACPI_IBM=m
# CONFIG_HOTPLUG_PCI_CPCI is not set
# CONFIG_HOTPLUG_PCI_SHPC is not set

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
# CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set
CONFIG_HAVE_AOUT=y
# CONFIG_BINFMT_AOUT is not set
CONFIG_BINFMT_MISC=y
CONFIG_HAVE_ATOMIC_IOMAP=y
CONFIG_NET=y

#
# Networking options
#
CONFIG_PACKET=y
CONFIG_PACKET_MMAP=y
CONFIG_UNIX=y
# CONFIG_NET_KEY is not set
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_ASK_IP_FIB_HASH=y
# CONFIG_IP_FIB_TRIE is not set
CONFIG_IP_FIB_HASH=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_VERBOSE=y
# CONFIG_IP_PNP is not set
CONFIG_NET_IPIP=m
# CONFIG_NET_IPGRE is not set
CONFIG_IP_MROUTE=y
CONFIG_IP_PIMSM_V1=y
CONFIG_IP_PIMSM_V2=y
# CONFIG_ARPD is not set
CONFIG_SYN_COOKIES=y
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
# CONFIG_INET_XFRM_TUNNEL is not set
CONFIG_INET_TUNNEL=m
# CONFIG_INET_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET_XFRM_MODE_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_BEET is not set
CONFIG_INET_LRO=m
CONFIG_INET_DIAG=m
CONFIG_INET_TCP_DIAG=m
CONFIG_TCP_CONG_ADVANCED=y
CONFIG_TCP_CONG_BIC=m
CONFIG_TCP_CONG_CUBIC=y
# CONFIG_TCP_CONG_WESTWOOD is not set
# CONFIG_TCP_CONG_HTCP is not set
CONFIG_TCP_CONG_HSTCP=m
CONFIG_TCP_CONG_HYBLA=m
# CONFIG_TCP_CONG_VEGAS is not set
CONFIG_TCP_CONG_SCALABLE=m
CONFIG_TCP_CONG_LP=m
# CONFIG_TCP_CONG_VENO is not set
# CONFIG_TCP_CONG_YEAH is not set
CONFIG_TCP_CONG_ILLINOIS=m
# CONFIG_DEFAULT_BIC is not set
CONFIG_DEFAULT_CUBIC=y
# CONFIG_DEFAULT_HTCP is not set
# CONFIG_DEFAULT_VEGAS is not set
# CONFIG_DEFAULT_WESTWOOD is not set
# CONFIG_DEFAULT_RENO is not set
CONFIG_DEFAULT_TCP_CONG="cubic"
# CONFIG_TCP_MD5SIG is not set
# CONFIG_IPV6 is not set
# CONFIG_NETWORK_SECMARK is not set
# CONFIG_NETFILTER is not set
# CONFIG_IP_DCCP is not set
# CONFIG_IP_SCTP is not set
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
CONFIG_STP=m
CONFIG_BRIDGE=m
# CONFIG_NET_DSA is not set
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
CONFIG_LLC=m
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set
# CONFIG_PHONET is not set
CONFIG_NET_SCHED=y

#
# Queueing/Scheduling
#
# CONFIG_NET_SCH_CBQ is not set
# CONFIG_NET_SCH_HTB is not set
# CONFIG_NET_SCH_HFSC is not set
# CONFIG_NET_SCH_PRIO is not set
# CONFIG_NET_SCH_MULTIQ is not set
# CONFIG_NET_SCH_RED is not set
# CONFIG_NET_SCH_SFQ is not set
# CONFIG_NET_SCH_TEQL is not set
# CONFIG_NET_SCH_TBF is not set
# CONFIG_NET_SCH_GRED is not set
# CONFIG_NET_SCH_DSMARK is not set
# CONFIG_NET_SCH_NETEM is not set
# CONFIG_NET_SCH_DRR is not set

#
# Classification
#
CONFIG_NET_CLS=y
# CONFIG_NET_CLS_BASIC is not set
# CONFIG_NET_CLS_TCINDEX is not set
# CONFIG_NET_CLS_ROUTE4 is not set
# CONFIG_NET_CLS_FW is not set
# CONFIG_NET_CLS_U32 is not set
# CONFIG_NET_CLS_RSVP is not set
# CONFIG_NET_CLS_RSVP6 is not set
# CONFIG_NET_CLS_FLOW is not set
CONFIG_NET_CLS_CGROUP=y
# CONFIG_NET_EMATCH is not set
# CONFIG_NET_CLS_ACT is not set
CONFIG_NET_SCH_FIFO=y
# CONFIG_DCB is not set

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_NET_TCPPROBE is not set
# CONFIG_NET_DROP_MONITOR is not set
# CONFIG_HAMRADIO is not set
# CONFIG_CAN is not set
# CONFIG_IRDA is not set
# CONFIG_BT is not set
# CONFIG_AF_RXRPC is not set
CONFIG_FIB_RULES=y
# CONFIG_WIRELESS is not set
# CONFIG_WIMAX is not set
# CONFIG_RFKILL is not set
# CONFIG_NET_9P is not set

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
CONFIG_FIRMWARE_IN_KERNEL=y
CONFIG_EXTRA_FIRMWARE=""
# CONFIG_DEBUG_DRIVER is not set
CONFIG_DEBUG_DEVRES=y
# CONFIG_SYS_HYPERVISOR is not set
# CONFIG_CONNECTOR is not set
# CONFIG_MTD is not set
CONFIG_PARPORT=m
CONFIG_PARPORT_PC=m
CONFIG_PARPORT_SERIAL=m
# CONFIG_PARPORT_PC_FIFO is not set
# CONFIG_PARPORT_PC_SUPERIO is not set
CONFIG_PARPORT_PC_PCMCIA=m
# CONFIG_PARPORT_GSC is not set
# CONFIG_PARPORT_AX88796 is not set
CONFIG_PARPORT_1284=y
CONFIG_PNP=y
CONFIG_PNP_DEBUG_MESSAGES=y

#
# Protocols
#
CONFIG_ISAPNP=y
# CONFIG_PNPBIOS is not set
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
# CONFIG_BLK_DEV_FD is not set
# CONFIG_BLK_DEV_XD is not set
CONFIG_PARIDE=m

#
# Parallel IDE high-level drivers
#
CONFIG_PARIDE_PD=m
CONFIG_PARIDE_PCD=m
CONFIG_PARIDE_PF=m
# CONFIG_PARIDE_PT is not set
CONFIG_PARIDE_PG=m

#
# Parallel IDE protocol modules
#
# CONFIG_PARIDE_ATEN is not set
# CONFIG_PARIDE_BPCK is not set
# CONFIG_PARIDE_BPCK6 is not set
# CONFIG_PARIDE_COMM is not set
# CONFIG_PARIDE_DSTR is not set
# CONFIG_PARIDE_FIT2 is not set
# CONFIG_PARIDE_FIT3 is not set
# CONFIG_PARIDE_EPAT is not set
# CONFIG_PARIDE_EPIA is not set
# CONFIG_PARIDE_FRIQ is not set
# CONFIG_PARIDE_FRPW is not set
# CONFIG_PARIDE_KBIC is not set
# CONFIG_PARIDE_KTTI is not set
# CONFIG_PARIDE_ON20 is not set
# CONFIG_PARIDE_ON26 is not set
# CONFIG_BLK_CPQ_DA is not set
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=m
CONFIG_BLK_DEV_CRYPTOLOOP=m
CONFIG_BLK_DEV_NBD=m
# CONFIG_BLK_DEV_SX8 is not set
# CONFIG_BLK_DEV_UB is not set
CONFIG_BLK_DEV_RAM=y
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=16384
# CONFIG_BLK_DEV_XIP is not set
# CONFIG_CDROM_PKTCDVD is not set
# CONFIG_ATA_OVER_ETH is not set
# CONFIG_BLK_DEV_HD is not set
CONFIG_MISC_DEVICES=y
# CONFIG_IBM_ASM is not set
# CONFIG_PHANTOM is not set
# CONFIG_SGI_IOC4 is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_ICS932S401 is not set
# CONFIG_ENCLOSURE_SERVICES is not set
# CONFIG_HP_ILO is not set
# CONFIG_ISL29003 is not set
# CONFIG_C2PORT is not set

#
# EEPROM support
#
# CONFIG_EEPROM_AT24 is not set
# CONFIG_EEPROM_LEGACY is not set
CONFIG_EEPROM_93CX6=m
CONFIG_HAVE_IDE=y
# CONFIG_IDE is not set

#
# SCSI device support
#
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI=m
CONFIG_SCSI_DMA=y
CONFIG_SCSI_TGT=m
CONFIG_SCSI_NETLINK=y
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=m
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
CONFIG_BLK_DEV_SR=m
CONFIG_BLK_DEV_SR_VENDOR=y
CONFIG_CHR_DEV_SG=m
CONFIG_CHR_DEV_SCH=m

#
# Some SCSI devices (e.g. CD jukebox) support multiple LUNs
#
CONFIG_SCSI_MULTI_LUN=y
# CONFIG_SCSI_CONSTANTS is not set
CONFIG_SCSI_LOGGING=y
CONFIG_SCSI_SCAN_ASYNC=y
CONFIG_SCSI_WAIT_SCAN=m

#
# SCSI Transports
#
CONFIG_SCSI_SPI_ATTRS=m
CONFIG_SCSI_FC_ATTRS=m
# CONFIG_SCSI_FC_TGT_ATTRS is not set
CONFIG_SCSI_ISCSI_ATTRS=m
CONFIG_SCSI_SAS_ATTRS=m
CONFIG_SCSI_SAS_LIBSAS=m
CONFIG_SCSI_SAS_ATA=y
CONFIG_SCSI_SAS_HOST_SMP=y
# CONFIG_SCSI_SAS_LIBSAS_DEBUG is not set
CONFIG_SCSI_SRP_ATTRS=m
# CONFIG_SCSI_SRP_TGT_ATTRS is not set
CONFIG_SCSI_LOWLEVEL=y
CONFIG_ISCSI_TCP=m
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_3W_9XXX is not set
# CONFIG_SCSI_7000FASST is not set
CONFIG_SCSI_ACARD=m
# CONFIG_SCSI_AHA152X is not set
# CONFIG_SCSI_AHA1542 is not set
# CONFIG_SCSI_AACRAID is not set
CONFIG_SCSI_AIC7XXX=m
CONFIG_AIC7XXX_CMDS_PER_DEVICE=4
CONFIG_AIC7XXX_RESET_DELAY_MS=15000
# CONFIG_AIC7XXX_DEBUG_ENABLE is not set
CONFIG_AIC7XXX_DEBUG_MASK=0
# CONFIG_AIC7XXX_REG_PRETTY_PRINT is not set
CONFIG_SCSI_AIC7XXX_OLD=m
CONFIG_SCSI_AIC79XX=m
CONFIG_AIC79XX_CMDS_PER_DEVICE=4
CONFIG_AIC79XX_RESET_DELAY_MS=15000
# CONFIG_AIC79XX_DEBUG_ENABLE is not set
CONFIG_AIC79XX_DEBUG_MASK=0
# CONFIG_AIC79XX_REG_PRETTY_PRINT is not set
CONFIG_SCSI_AIC94XX=m
# CONFIG_AIC94XX_DEBUG is not set
# CONFIG_SCSI_DPT_I2O is not set
CONFIG_SCSI_ADVANSYS=m
# CONFIG_SCSI_IN2000 is not set
# CONFIG_SCSI_ARCMSR is not set
# CONFIG_MEGARAID_NEWGEN is not set
# CONFIG_MEGARAID_LEGACY is not set
# CONFIG_MEGARAID_SAS is not set
# CONFIG_SCSI_MPT2SAS is not set
# CONFIG_SCSI_HPTIOP is not set
CONFIG_SCSI_BUSLOGIC=m
# CONFIG_SCSI_FLASHPOINT is not set
# CONFIG_LIBFC is not set
# CONFIG_LIBFCOE is not set
# CONFIG_FCOE is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_DTC3280 is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
CONFIG_SCSI_GDTH=m
# CONFIG_SCSI_GENERIC_NCR5380 is not set
# CONFIG_SCSI_GENERIC_NCR5380_MMIO is not set
CONFIG_SCSI_IPS=m
CONFIG_SCSI_INITIO=m
CONFIG_SCSI_INIA100=m
CONFIG_SCSI_PPA=m
CONFIG_SCSI_IMM=m
# CONFIG_SCSI_IZIP_EPP16 is not set
# CONFIG_SCSI_IZIP_SLOW_CTR is not set
# CONFIG_SCSI_MVSAS is not set
# CONFIG_SCSI_NCR53C406A is not set
# CONFIG_SCSI_STEX is not set
CONFIG_SCSI_SYM53C8XX_2=m
CONFIG_SCSI_SYM53C8XX_DMA_ADDRESSING_MODE=1
CONFIG_SCSI_SYM53C8XX_DEFAULT_TAGS=16
CONFIG_SCSI_SYM53C8XX_MAX_TAGS=64
CONFIG_SCSI_SYM53C8XX_MMIO=y
# CONFIG_SCSI_IPR is not set
# CONFIG_SCSI_PAS16 is not set
# CONFIG_SCSI_QLOGIC_FAS is not set
# CONFIG_SCSI_QLOGIC_1280 is not set
# CONFIG_SCSI_QLA_FC is not set
# CONFIG_SCSI_QLA_ISCSI is not set
# CONFIG_SCSI_LPFC is not set
# CONFIG_SCSI_SYM53C416 is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_DC390T is not set
# CONFIG_SCSI_T128 is not set
# CONFIG_SCSI_U14_34F is not set
# CONFIG_SCSI_ULTRASTOR is not set
# CONFIG_SCSI_NSP32 is not set
# CONFIG_SCSI_DEBUG is not set
# CONFIG_SCSI_SRP is not set
CONFIG_SCSI_LOWLEVEL_PCMCIA=y
# CONFIG_PCMCIA_AHA152X is not set
# CONFIG_PCMCIA_FDOMAIN is not set
# CONFIG_PCMCIA_NINJA_SCSI is not set
CONFIG_PCMCIA_QLOGIC=m
# CONFIG_PCMCIA_SYM53C500 is not set
# CONFIG_SCSI_DH is not set
# CONFIG_SCSI_OSD_INITIATOR is not set
CONFIG_ATA=m
# CONFIG_ATA_NONSTANDARD is not set
CONFIG_ATA_ACPI=y
CONFIG_SATA_PMP=y
CONFIG_SATA_AHCI=m
# CONFIG_SATA_SIL24 is not set
CONFIG_ATA_SFF=y
# CONFIG_SATA_SVW is not set
CONFIG_ATA_PIIX=m
# CONFIG_SATA_MV is not set
CONFIG_SATA_NV=m
# CONFIG_PDC_ADMA is not set
# CONFIG_SATA_QSTOR is not set
# CONFIG_SATA_PROMISE is not set
# CONFIG_SATA_SX4 is not set
# CONFIG_SATA_SIL is not set
CONFIG_SATA_SIS=m
# CONFIG_SATA_ULI is not set
# CONFIG_SATA_VIA is not set
# CONFIG_SATA_VITESSE is not set
# CONFIG_SATA_INIC162X is not set
# CONFIG_PATA_ACPI is not set
# CONFIG_PATA_ALI is not set
# CONFIG_PATA_AMD is not set
# CONFIG_PATA_ARTOP is not set
CONFIG_PATA_ATIIXP=m
# CONFIG_PATA_CMD640_PCI is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CS5520 is not set
# CONFIG_PATA_CS5530 is not set
# CONFIG_PATA_CS5535 is not set
# CONFIG_PATA_CS5536 is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
CONFIG_ATA_GENERIC=m
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_ISAPNP is not set
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_IT8213 is not set
# CONFIG_PATA_JMICRON is not set
# CONFIG_PATA_LEGACY is not set
# CONFIG_PATA_TRIFLEX is not set
# CONFIG_PATA_MARVELL is not set
CONFIG_PATA_MPIIX=m
# CONFIG_PATA_OLDPIIX is not set
# CONFIG_PATA_NETCELL is not set
# CONFIG_PATA_NINJA32 is not set
# CONFIG_PATA_NS87410 is not set
# CONFIG_PATA_NS87415 is not set
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_OPTIDMA is not set
CONFIG_PATA_PCMCIA=m
# CONFIG_PATA_PDC_OLD is not set
# CONFIG_PATA_QDI is not set
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RZ1000 is not set
# CONFIG_PATA_SC1200 is not set
# CONFIG_PATA_SERVERWORKS is not set
# CONFIG_PATA_PDC2027X is not set
# CONFIG_PATA_SIL680 is not set
CONFIG_PATA_SIS=m
CONFIG_PATA_VIA=m
# CONFIG_PATA_WINBOND is not set
# CONFIG_PATA_WINBOND_VLB is not set
# CONFIG_PATA_SCH is not set
# CONFIG_MD is not set
CONFIG_FUSION=y
CONFIG_FUSION_SPI=m
CONFIG_FUSION_FC=m
# CONFIG_FUSION_SAS is not set
CONFIG_FUSION_MAX_SGE=40
CONFIG_FUSION_CTL=m
CONFIG_FUSION_LAN=m
CONFIG_FUSION_LOGGING=y

#
# IEEE 1394 (FireWire) support
#

#
# Enable only one of the two stacks, unless you know what you are doing
#
CONFIG_FIREWIRE=m
CONFIG_FIREWIRE_OHCI=m
CONFIG_FIREWIRE_OHCI_DEBUG=y
CONFIG_FIREWIRE_SBP2=m
# CONFIG_IEEE1394 is not set
CONFIG_I2O=m
# CONFIG_I2O_LCT_NOTIFY_ON_CHANGES is not set
CONFIG_I2O_EXT_ADAPTEC=y
CONFIG_I2O_CONFIG=m
CONFIG_I2O_CONFIG_OLD_IOCTL=y
CONFIG_I2O_BUS=m
CONFIG_I2O_BLOCK=m
CONFIG_I2O_SCSI=m
CONFIG_I2O_PROC=m
# CONFIG_MACINTOSH_DRIVERS is not set
CONFIG_NETDEVICES=y
CONFIG_COMPAT_NET_DEV_OPS=y
CONFIG_DUMMY=m
CONFIG_BONDING=m
# CONFIG_MACVLAN is not set
# CONFIG_EQUALIZER is not set
CONFIG_TUN=m
# CONFIG_VETH is not set
# CONFIG_NET_SB1000 is not set
# CONFIG_ARCNET is not set
CONFIG_PHYLIB=m

#
# MII PHY device drivers
#
# CONFIG_MARVELL_PHY is not set
# CONFIG_DAVICOM_PHY is not set
# CONFIG_QSEMI_PHY is not set
CONFIG_LXT_PHY=m
# CONFIG_CICADA_PHY is not set
# CONFIG_VITESSE_PHY is not set
# CONFIG_SMSC_PHY is not set
# CONFIG_BROADCOM_PHY is not set
# CONFIG_ICPLUS_PHY is not set
# CONFIG_REALTEK_PHY is not set
# CONFIG_NATIONAL_PHY is not set
# CONFIG_STE10XP is not set
# CONFIG_LSI_ET1011C_PHY is not set
# CONFIG_MDIO_BITBANG is not set
CONFIG_NET_ETHERNET=y
CONFIG_MII=m
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_CASSINI is not set
CONFIG_NET_VENDOR_3COM=y
# CONFIG_EL1 is not set
# CONFIG_EL2 is not set
# CONFIG_ELPLUS is not set
# CONFIG_EL16 is not set
CONFIG_EL3=m
# CONFIG_3C515 is not set
CONFIG_VORTEX=m
CONFIG_TYPHOON=m
# CONFIG_LANCE is not set
CONFIG_NET_VENDOR_SMC=y
# CONFIG_WD80x3 is not set
# CONFIG_ULTRA is not set
# CONFIG_SMC9194 is not set
# CONFIG_ETHOC is not set
# CONFIG_NET_VENDOR_RACAL is not set
# CONFIG_DNET is not set
CONFIG_NET_TULIP=y
CONFIG_DE2104X=m
CONFIG_TULIP=m
# CONFIG_TULIP_MWI is not set
CONFIG_TULIP_MMIO=y
# CONFIG_TULIP_NAPI is not set
CONFIG_DE4X5=m
CONFIG_WINBOND_840=m
CONFIG_DM9102=m
CONFIG_ULI526X=m
CONFIG_PCMCIA_XIRCOM=m
# CONFIG_AT1700 is not set
# CONFIG_DEPCA is not set
# CONFIG_HP100 is not set
CONFIG_NET_ISA=y
# CONFIG_E2100 is not set
# CONFIG_EWRK3 is not set
# CONFIG_EEXPRESS is not set
# CONFIG_EEXPRESS_PRO is not set
# CONFIG_HPLAN_PLUS is not set
# CONFIG_HPLAN is not set
# CONFIG_LP486E is not set
# CONFIG_ETH16I is not set
CONFIG_NE2000=m
# CONFIG_ZNET is not set
# CONFIG_SEEQ8005 is not set
# CONFIG_IBM_NEW_EMAC_ZMII is not set
# CONFIG_IBM_NEW_EMAC_RGMII is not set
# CONFIG_IBM_NEW_EMAC_TAH is not set
# CONFIG_IBM_NEW_EMAC_EMAC4 is not set
# CONFIG_IBM_NEW_EMAC_NO_FLOW_CTRL is not set
# CONFIG_IBM_NEW_EMAC_MAL_CLR_ICINTSTAT is not set
# CONFIG_IBM_NEW_EMAC_MAL_COMMON_ERR is not set
CONFIG_NET_PCI=y
CONFIG_PCNET32=m
CONFIG_AMD8111_ETH=m
CONFIG_ADAPTEC_STARFIRE=m
# CONFIG_AC3200 is not set
# CONFIG_APRICOT is not set
CONFIG_B44=m
CONFIG_B44_PCI_AUTOSELECT=y
CONFIG_B44_PCICORE_AUTOSELECT=y
CONFIG_B44_PCI=y
CONFIG_FORCEDETH=m
CONFIG_FORCEDETH_NAPI=y
# CONFIG_CS89x0 is not set
CONFIG_E100=m
# CONFIG_FEALNX is not set
# CONFIG_NATSEMI is not set
CONFIG_NE2K_PCI=m
# CONFIG_8139CP is not set
CONFIG_8139TOO=m
# CONFIG_8139TOO_PIO is not set
# CONFIG_8139TOO_TUNE_TWISTER is not set
CONFIG_8139TOO_8129=y
# CONFIG_8139_OLD_RX_RESET is not set
# CONFIG_R6040 is not set
CONFIG_SIS900=m
# CONFIG_EPIC100 is not set
# CONFIG_SMSC9420 is not set
# CONFIG_SUNDANCE is not set
# CONFIG_TLAN is not set
CONFIG_VIA_RHINE=m
CONFIG_VIA_RHINE_MMIO=y
# CONFIG_SC92031 is not set
CONFIG_NET_POCKET=y
CONFIG_ATP=m
CONFIG_DE600=m
CONFIG_DE620=m
# CONFIG_ATL2 is not set
CONFIG_NETDEV_1000=y
CONFIG_ACENIC=m
# CONFIG_ACENIC_OMIT_TIGON_I is not set
# CONFIG_DL2K is not set
CONFIG_E1000=m
CONFIG_E1000E=m
# CONFIG_IP1000 is not set
# CONFIG_IGB is not set
# CONFIG_IGBVF is not set
# CONFIG_NS83820 is not set
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
CONFIG_R8169=m
# CONFIG_SIS190 is not set
CONFIG_SKGE=m
# CONFIG_SKGE_DEBUG is not set
CONFIG_SKY2=m
# CONFIG_SKY2_DEBUG is not set
CONFIG_VIA_VELOCITY=m
# CONFIG_TIGON3 is not set
# CONFIG_BNX2 is not set
# CONFIG_QLA3XXX is not set
# CONFIG_ATL1 is not set
# CONFIG_ATL1E is not set
# CONFIG_ATL1C is not set
# CONFIG_JME is not set
# CONFIG_NETDEV_10000 is not set
# CONFIG_TR is not set

#
# Wireless LAN
#
# CONFIG_WLAN_PRE80211 is not set
# CONFIG_WLAN_80211 is not set

#
# Enable WiMAX (Networking options) to see the WiMAX drivers
#

#
# USB Network Adapters
#
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
CONFIG_USB_USBNET=m
CONFIG_USB_NET_AX8817X=m
CONFIG_USB_NET_CDCETHER=m
CONFIG_USB_NET_DM9601=m
# CONFIG_USB_NET_SMSC95XX is not set
CONFIG_USB_NET_GL620A=m
CONFIG_USB_NET_NET1080=m
# CONFIG_USB_NET_PLUSB is not set
# CONFIG_USB_NET_MCS7830 is not set
# CONFIG_USB_NET_RNDIS_HOST is not set
CONFIG_USB_NET_CDC_SUBSET=m
CONFIG_USB_ALI_M5632=y
CONFIG_USB_AN2720=y
CONFIG_USB_BELKIN=y
CONFIG_USB_ARMLINUX=y
CONFIG_USB_EPSON2888=y
CONFIG_USB_KC2190=y
# CONFIG_USB_NET_ZAURUS is not set
CONFIG_NET_PCMCIA=y
# CONFIG_PCMCIA_3C589 is not set
# CONFIG_PCMCIA_3C574 is not set
# CONFIG_PCMCIA_FMVJ18X is not set
CONFIG_PCMCIA_PCNET=m
CONFIG_PCMCIA_NMCLAN=m
CONFIG_PCMCIA_SMC91C92=m
# CONFIG_PCMCIA_XIRC2PS is not set
# CONFIG_PCMCIA_AXNET is not set
# CONFIG_WAN is not set
CONFIG_FDDI=y
# CONFIG_DEFXX is not set
# CONFIG_SKFP is not set
# CONFIG_HIPPI is not set
CONFIG_PLIP=m
CONFIG_PPP=m
CONFIG_PPP_MULTILINK=y
CONFIG_PPP_FILTER=y
CONFIG_PPP_ASYNC=m
CONFIG_PPP_SYNC_TTY=m
CONFIG_PPP_DEFLATE=m
# CONFIG_PPP_BSDCOMP is not set
# CONFIG_PPP_MPPE is not set
CONFIG_PPPOE=m
# CONFIG_PPPOL2TP is not set
CONFIG_SLIP=m
CONFIG_SLIP_COMPRESSED=y
CONFIG_SLHC=m
CONFIG_SLIP_SMART=y
# CONFIG_SLIP_MODE_SLIP6 is not set
CONFIG_NET_FC=y
CONFIG_NETCONSOLE=m
# CONFIG_NETCONSOLE_DYNAMIC is not set
CONFIG_NETPOLL=y
CONFIG_NETPOLL_TRAP=y
CONFIG_NET_POLL_CONTROLLER=y
# CONFIG_ISDN is not set
# CONFIG_PHONE is not set

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_FF_MEMLESS=y
CONFIG_INPUT_POLLDEV=m

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
# CONFIG_INPUT_MOUSEDEV_PSAUX is not set
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
# CONFIG_INPUT_JOYDEV is not set
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_XTKBD is not set
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
# CONFIG_MOUSE_PS2_ELANTECH is not set
# CONFIG_MOUSE_PS2_TOUCHKIT is not set
CONFIG_MOUSE_SERIAL=m
CONFIG_MOUSE_APPLETOUCH=m
# CONFIG_MOUSE_BCM5974 is not set
# CONFIG_MOUSE_INPORT is not set
# CONFIG_MOUSE_LOGIBM is not set
# CONFIG_MOUSE_PC110PAD is not set
CONFIG_MOUSE_VSXXXAA=m
# CONFIG_INPUT_JOYSTICK is not set
# CONFIG_INPUT_TABLET is not set
# CONFIG_INPUT_TOUCHSCREEN is not set
CONFIG_INPUT_MISC=y
# CONFIG_INPUT_PCSPKR is not set
# CONFIG_INPUT_APANEL is not set
# CONFIG_INPUT_WISTRON_BTNS is not set
# CONFIG_INPUT_ATLAS_BTNS is not set
# CONFIG_INPUT_ATI_REMOTE is not set
# CONFIG_INPUT_ATI_REMOTE2 is not set
# CONFIG_INPUT_KEYSPAN_REMOTE is not set
# CONFIG_INPUT_POWERMATE is not set
# CONFIG_INPUT_YEALINK is not set
# CONFIG_INPUT_CM109 is not set
CONFIG_INPUT_UINPUT=m

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=y
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PARKBD is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
CONFIG_SERIO_RAW=m
# CONFIG_GAMEPORT is not set

#
# Character devices
#
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_DEVKMEM=y
CONFIG_SERIAL_NONSTANDARD=y
# CONFIG_COMPUTONE is not set
CONFIG_ROCKETPORT=m
CONFIG_CYCLADES=m
# CONFIG_CYZ_INTR is not set
# CONFIG_DIGIEPCA is not set
# CONFIG_MOXA_INTELLIO is not set
# CONFIG_MOXA_SMARTIO is not set
# CONFIG_ISI is not set
# CONFIG_SYNCLINK is not set
CONFIG_SYNCLINKMP=m
CONFIG_SYNCLINK_GT=m
# CONFIG_N_HDLC is not set
# CONFIG_RISCOM8 is not set
# CONFIG_SPECIALIX is not set
# CONFIG_SX is not set
# CONFIG_RIO is not set
# CONFIG_STALDRV is not set
# CONFIG_NOZOMI is not set

#
# Serial drivers
#
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_PNP=y
CONFIG_SERIAL_8250_CS=m
CONFIG_SERIAL_8250_NR_UARTS=32
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
# CONFIG_SERIAL_8250_FOURPORT is not set
# CONFIG_SERIAL_8250_ACCENT is not set
# CONFIG_SERIAL_8250_BOCA is not set
# CONFIG_SERIAL_8250_EXAR_ST16C554 is not set
# CONFIG_SERIAL_8250_HUB6 is not set
CONFIG_SERIAL_8250_SHARE_IRQ=y
CONFIG_SERIAL_8250_DETECT_IRQ=y
CONFIG_SERIAL_8250_RSA=y

#
# Non-8250 serial port support
#
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
CONFIG_SERIAL_JSM=m
CONFIG_UNIX98_PTYS=y
# CONFIG_DEVPTS_MULTIPLE_INSTANCES is not set
# CONFIG_LEGACY_PTYS is not set
CONFIG_PRINTER=m
CONFIG_LP_CONSOLE=y
CONFIG_PPDEV=m
CONFIG_IPMI_HANDLER=m
# CONFIG_IPMI_PANIC_EVENT is not set
CONFIG_IPMI_DEVICE_INTERFACE=m
CONFIG_IPMI_SI=m
CONFIG_IPMI_WATCHDOG=m
CONFIG_IPMI_POWEROFF=m
CONFIG_HW_RANDOM=y
# CONFIG_HW_RANDOM_TIMERIOMEM is not set
CONFIG_HW_RANDOM_INTEL=m
CONFIG_HW_RANDOM_AMD=m
CONFIG_HW_RANDOM_GEODE=m
CONFIG_HW_RANDOM_VIA=m
CONFIG_NVRAM=y
CONFIG_RTC=y
# CONFIG_DTLK is not set
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set
# CONFIG_SONYPI is not set

#
# PCMCIA character devices
#
# CONFIG_SYNCLINK_CS is not set
CONFIG_CARDMAN_4000=m
CONFIG_CARDMAN_4040=m
# CONFIG_IPWIRELESS is not set
CONFIG_MWAVE=m
# CONFIG_PC8736x_GPIO is not set
# CONFIG_NSC_GPIO is not set
# CONFIG_CS5535_GPIO is not set
# CONFIG_RAW_DRIVER is not set
CONFIG_HPET=y
# CONFIG_HPET_MMAP is not set
CONFIG_HANGCHECK_TIMER=m
# CONFIG_TCG_TPM is not set
# CONFIG_TELCLOCK is not set
CONFIG_DEVPORT=y
CONFIG_I2C=m
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_CHARDEV=m
CONFIG_I2C_HELPER_AUTO=y
CONFIG_I2C_ALGOBIT=m
CONFIG_I2C_ALGOPCA=m

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
CONFIG_I2C_ALI1535=m
CONFIG_I2C_ALI1563=m
CONFIG_I2C_ALI15X3=m
CONFIG_I2C_AMD756=m
CONFIG_I2C_AMD756_S4882=m
# CONFIG_I2C_AMD8111 is not set
CONFIG_I2C_I801=m
# CONFIG_I2C_ISCH is not set
CONFIG_I2C_PIIX4=m
CONFIG_I2C_NFORCE2=m
# CONFIG_I2C_NFORCE2_S4985 is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
# CONFIG_I2C_SIS96X is not set
CONFIG_I2C_VIA=m
CONFIG_I2C_VIAPRO=m

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
# CONFIG_I2C_OCORES is not set
CONFIG_I2C_SIMTEC=m

#
# External I2C/SMBus adapter drivers
#
CONFIG_I2C_PARPORT=m
CONFIG_I2C_PARPORT_LIGHT=m
# CONFIG_I2C_TAOS_EVM is not set
# CONFIG_I2C_TINY_USB is not set

#
# Graphics adapter I2C/DDC channel drivers
#
CONFIG_I2C_VOODOO3=m

#
# Other I2C/SMBus bus drivers
#
CONFIG_I2C_PCA_ISA=m
# CONFIG_I2C_PCA_PLATFORM is not set
CONFIG_I2C_STUB=m
# CONFIG_SCx200_ACB is not set

#
# Miscellaneous I2C Chip support
#
# CONFIG_DS1682 is not set
# CONFIG_SENSORS_PCF8574 is not set
# CONFIG_PCF8575 is not set
# CONFIG_SENSORS_PCA9539 is not set
CONFIG_SENSORS_MAX6875=m
# CONFIG_SENSORS_TSL2550 is not set
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# CONFIG_I2C_DEBUG_CHIP is not set
# CONFIG_SPI is not set
CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y
# CONFIG_GPIOLIB is not set
# CONFIG_W1 is not set
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
# CONFIG_PDA_POWER is not set
# CONFIG_BATTERY_DS2760 is not set
# CONFIG_BATTERY_BQ27x00 is not set
CONFIG_HWMON=m
CONFIG_HWMON_VID=m
# CONFIG_SENSORS_ABITUGURU is not set
# CONFIG_SENSORS_ABITUGURU3 is not set
# CONFIG_SENSORS_AD7414 is not set
CONFIG_SENSORS_AD7418=m
# CONFIG_SENSORS_ADM1021 is not set
# CONFIG_SENSORS_ADM1025 is not set
# CONFIG_SENSORS_ADM1026 is not set
# CONFIG_SENSORS_ADM1029 is not set
# CONFIG_SENSORS_ADM1031 is not set
# CONFIG_SENSORS_ADM9240 is not set
# CONFIG_SENSORS_ADT7462 is not set
# CONFIG_SENSORS_ADT7470 is not set
# CONFIG_SENSORS_ADT7473 is not set
# CONFIG_SENSORS_ADT7475 is not set
# CONFIG_SENSORS_K8TEMP is not set
# CONFIG_SENSORS_ASB100 is not set
# CONFIG_SENSORS_ATK0110 is not set
# CONFIG_SENSORS_ATXP1 is not set
# CONFIG_SENSORS_DS1621 is not set
# CONFIG_SENSORS_I5K_AMB is not set
# CONFIG_SENSORS_F71805F is not set
# CONFIG_SENSORS_F71882FG is not set
# CONFIG_SENSORS_F75375S is not set
# CONFIG_SENSORS_FSCHER is not set
# CONFIG_SENSORS_FSCPOS is not set
# CONFIG_SENSORS_FSCHMD is not set
# CONFIG_SENSORS_G760A is not set
# CONFIG_SENSORS_GL518SM is not set
# CONFIG_SENSORS_GL520SM is not set
CONFIG_SENSORS_CORETEMP=m
# CONFIG_SENSORS_IBMAEM is not set
# CONFIG_SENSORS_IBMPEX is not set
# CONFIG_SENSORS_IT87 is not set
# CONFIG_SENSORS_LM63 is not set
# CONFIG_SENSORS_LM75 is not set
# CONFIG_SENSORS_LM77 is not set
# CONFIG_SENSORS_LM78 is not set
# CONFIG_SENSORS_LM80 is not set
# CONFIG_SENSORS_LM83 is not set
# CONFIG_SENSORS_LM85 is not set
# CONFIG_SENSORS_LM87 is not set
# CONFIG_SENSORS_LM90 is not set
# CONFIG_SENSORS_LM92 is not set
# CONFIG_SENSORS_LM93 is not set
# CONFIG_SENSORS_LTC4215 is not set
# CONFIG_SENSORS_LTC4245 is not set
# CONFIG_SENSORS_LM95241 is not set
# CONFIG_SENSORS_MAX1619 is not set
# CONFIG_SENSORS_MAX6650 is not set
# CONFIG_SENSORS_PC87360 is not set
# CONFIG_SENSORS_PC87427 is not set
# CONFIG_SENSORS_PCF8591 is not set
CONFIG_SENSORS_SIS5595=m
# CONFIG_SENSORS_DME1737 is not set
# CONFIG_SENSORS_SMSC47M1 is not set
# CONFIG_SENSORS_SMSC47M192 is not set
# CONFIG_SENSORS_SMSC47B397 is not set
# CONFIG_SENSORS_ADS7828 is not set
# CONFIG_SENSORS_THMC50 is not set
CONFIG_SENSORS_VIA686A=m
CONFIG_SENSORS_VT1211=m
CONFIG_SENSORS_VT8231=m
# CONFIG_SENSORS_W83781D is not set
# CONFIG_SENSORS_W83791D is not set
# CONFIG_SENSORS_W83792D is not set
# CONFIG_SENSORS_W83793 is not set
# CONFIG_SENSORS_W83L785TS is not set
# CONFIG_SENSORS_W83L786NG is not set
# CONFIG_SENSORS_W83627HF is not set
# CONFIG_SENSORS_W83627EHF is not set
CONFIG_SENSORS_HDAPS=m
# CONFIG_SENSORS_LIS3LV02D is not set
# CONFIG_SENSORS_APPLESMC is not set
# CONFIG_HWMON_DEBUG_CHIP is not set
CONFIG_THERMAL=y
# CONFIG_WATCHDOG is not set
CONFIG_SSB_POSSIBLE=y

#
# Sonics Silicon Backplane
#
CONFIG_SSB=m
CONFIG_SSB_SPROM=y
CONFIG_SSB_PCIHOST_POSSIBLE=y
CONFIG_SSB_PCIHOST=y
# CONFIG_SSB_B43_PCI_BRIDGE is not set
CONFIG_SSB_PCMCIAHOST_POSSIBLE=y
CONFIG_SSB_PCMCIAHOST=y
# CONFIG_SSB_DEBUG is not set
CONFIG_SSB_DRIVER_PCICORE_POSSIBLE=y
CONFIG_SSB_DRIVER_PCICORE=y

#
# Multifunction device drivers
#
# CONFIG_MFD_CORE is not set
# CONFIG_MFD_SM501 is not set
# CONFIG_HTC_PASIC3 is not set
# CONFIG_MFD_TMIO is not set
# CONFIG_MFD_WM8400 is not set
# CONFIG_MFD_WM8350_I2C is not set
# CONFIG_MFD_PCF50633 is not set
# CONFIG_REGULATOR is not set

#
# Multimedia devices
#

#
# Multimedia core support
#
CONFIG_VIDEO_DEV=m
CONFIG_VIDEO_V4L2_COMMON=m
CONFIG_VIDEO_ALLOW_V4L1=y
CONFIG_VIDEO_V4L1_COMPAT=y
# CONFIG_DVB_CORE is not set
CONFIG_VIDEO_MEDIA=m

#
# Multimedia drivers
#
# CONFIG_MEDIA_ATTACH is not set
CONFIG_MEDIA_TUNER=m
# CONFIG_MEDIA_TUNER_CUSTOMISE is not set
CONFIG_MEDIA_TUNER_SIMPLE=m
CONFIG_MEDIA_TUNER_TDA8290=m
CONFIG_MEDIA_TUNER_TDA9887=m
CONFIG_MEDIA_TUNER_TEA5761=m
CONFIG_MEDIA_TUNER_TEA5767=m
CONFIG_MEDIA_TUNER_MT20XX=m
CONFIG_MEDIA_TUNER_XC2028=m
CONFIG_MEDIA_TUNER_XC5000=m
CONFIG_MEDIA_TUNER_MC44S803=m
CONFIG_VIDEO_V4L2=m
CONFIG_VIDEO_V4L1=m
CONFIG_VIDEOBUF_GEN=m
CONFIG_VIDEOBUF_DMA_SG=m
CONFIG_VIDEO_BTCX=m
CONFIG_VIDEO_IR=m
CONFIG_VIDEO_TVEEPROM=m
CONFIG_VIDEO_TUNER=m
CONFIG_VIDEO_CAPTURE_DRIVERS=y
# CONFIG_VIDEO_ADV_DEBUG is not set
# CONFIG_VIDEO_FIXED_MINOR_RANGES is not set
# CONFIG_VIDEO_HELPER_CHIPS_AUTO is not set
CONFIG_VIDEO_IR_I2C=m

#
# Encoders/decoders and other helper chips
#

#
# Audio decoders
#
CONFIG_VIDEO_TVAUDIO=m
CONFIG_VIDEO_TDA7432=m
CONFIG_VIDEO_TDA9840=m
CONFIG_VIDEO_TDA9875=m
CONFIG_VIDEO_TEA6415C=m
CONFIG_VIDEO_TEA6420=m
CONFIG_VIDEO_MSP3400=m
# CONFIG_VIDEO_CS5345 is not set
CONFIG_VIDEO_CS53L32A=m
CONFIG_VIDEO_M52790=m
CONFIG_VIDEO_TLV320AIC23B=m
CONFIG_VIDEO_WM8775=m
CONFIG_VIDEO_WM8739=m
CONFIG_VIDEO_VP27SMPX=m

#
# RDS decoders
#
# CONFIG_VIDEO_SAA6588 is not set

#
# Video decoders
#
CONFIG_VIDEO_BT819=m
CONFIG_VIDEO_BT856=m
CONFIG_VIDEO_BT866=m
CONFIG_VIDEO_KS0127=m
CONFIG_VIDEO_OV7670=m
# CONFIG_VIDEO_TCM825X is not set
CONFIG_VIDEO_SAA7110=m
CONFIG_VIDEO_SAA711X=m
CONFIG_VIDEO_SAA717X=m
CONFIG_VIDEO_SAA7191=m
# CONFIG_VIDEO_TVP514X is not set
CONFIG_VIDEO_TVP5150=m
CONFIG_VIDEO_VPX3220=m

#
# Video and audio decoders
#
CONFIG_VIDEO_CX25840=m

#
# MPEG video encoders
#
CONFIG_VIDEO_CX2341X=m

#
# Video encoders
#
CONFIG_VIDEO_SAA7127=m
CONFIG_VIDEO_SAA7185=m
CONFIG_VIDEO_ADV7170=m
CONFIG_VIDEO_ADV7175=m

#
# Video improvement chips
#
CONFIG_VIDEO_UPD64031A=m
CONFIG_VIDEO_UPD64083=m
# CONFIG_VIDEO_VIVI is not set
CONFIG_VIDEO_BT848=m
# CONFIG_VIDEO_PMS is not set
# CONFIG_VIDEO_BWQCAM is not set
# CONFIG_VIDEO_CQCAM is not set
# CONFIG_VIDEO_W9966 is not set
CONFIG_VIDEO_CPIA=m
CONFIG_VIDEO_CPIA_PP=m
CONFIG_VIDEO_CPIA_USB=m
CONFIG_VIDEO_CPIA2=m
# CONFIG_VIDEO_SAA5246A is not set
# CONFIG_VIDEO_SAA5249 is not set
# CONFIG_VIDEO_STRADIS is not set
CONFIG_VIDEO_ZORAN=m
# CONFIG_VIDEO_ZORAN_DC30 is not set
CONFIG_VIDEO_ZORAN_ZR36060=m
CONFIG_VIDEO_ZORAN_BUZ=m
# CONFIG_VIDEO_ZORAN_DC10 is not set
CONFIG_VIDEO_ZORAN_LML33=m
# CONFIG_VIDEO_ZORAN_LML33R10 is not set
# CONFIG_VIDEO_ZORAN_AVS6EYES is not set
# CONFIG_VIDEO_SAA7134 is not set
# CONFIG_VIDEO_MXB is not set
# CONFIG_VIDEO_HEXIUM_ORION is not set
# CONFIG_VIDEO_HEXIUM_GEMINI is not set
# CONFIG_VIDEO_CX88 is not set
CONFIG_VIDEO_IVTV=m
# CONFIG_VIDEO_FB_IVTV is not set
# CONFIG_VIDEO_CAFE_CCIC is not set
# CONFIG_SOC_CAMERA is not set
# CONFIG_V4L_USB_DRIVERS is not set
CONFIG_RADIO_ADAPTERS=y
# CONFIG_RADIO_CADET is not set
# CONFIG_RADIO_RTRACK is not set
# CONFIG_RADIO_RTRACK2 is not set
# CONFIG_RADIO_AZTECH is not set
# CONFIG_RADIO_GEMTEK is not set
# CONFIG_RADIO_GEMTEK_PCI is not set
CONFIG_RADIO_MAXIRADIO=m
CONFIG_RADIO_MAESTRO=m
# CONFIG_RADIO_SF16FMI is not set
# CONFIG_RADIO_SF16FMR2 is not set
# CONFIG_RADIO_TERRATEC is not set
# CONFIG_RADIO_TRUST is not set
# CONFIG_RADIO_TYPHOON is not set
# CONFIG_RADIO_ZOLTRIX is not set
CONFIG_USB_DSBR=m
# CONFIG_USB_SI470X is not set
# CONFIG_USB_MR800 is not set
# CONFIG_RADIO_TEA5764 is not set
CONFIG_DAB=y
CONFIG_USB_DABUSB=m

#
# Graphics support
#
CONFIG_AGP=y
CONFIG_AGP_ALI=y
CONFIG_AGP_ATI=y
# CONFIG_AGP_AMD is not set
# CONFIG_AGP_AMD64 is not set
CONFIG_AGP_INTEL=y
CONFIG_AGP_NVIDIA=y
CONFIG_AGP_SIS=y
# CONFIG_AGP_SWORKS is not set
CONFIG_AGP_VIA=y
CONFIG_AGP_EFFICEON=y
CONFIG_DRM=m
CONFIG_DRM_TDFX=m
CONFIG_DRM_R128=m
CONFIG_DRM_RADEON=m
CONFIG_DRM_I810=m
CONFIG_DRM_I830=m
CONFIG_DRM_I915=m
# CONFIG_DRM_I915_KMS is not set
# CONFIG_DRM_MGA is not set
CONFIG_DRM_SIS=m
# CONFIG_DRM_VIA is not set
# CONFIG_DRM_SAVAGE is not set
CONFIG_VGASTATE=m
CONFIG_VIDEO_OUTPUT_CONTROL=m
CONFIG_FB=y
# CONFIG_FIRMWARE_EDID is not set
CONFIG_FB_DDC=m
CONFIG_FB_BOOT_VESA_SUPPORT=y
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set
# CONFIG_FB_SYS_FILLRECT is not set
# CONFIG_FB_SYS_COPYAREA is not set
# CONFIG_FB_SYS_IMAGEBLIT is not set
# CONFIG_FB_FOREIGN_ENDIAN is not set
# CONFIG_FB_SYS_FOPS is not set
CONFIG_FB_SVGALIB=m
# CONFIG_FB_MACMODES is not set
CONFIG_FB_BACKLIGHT=y
CONFIG_FB_MODE_HELPERS=y
CONFIG_FB_TILEBLITTING=y

#
# Frame buffer hardware drivers
#
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
# CONFIG_FB_VGA16 is not set
CONFIG_FB_VESA=y
# CONFIG_FB_EFI is not set
# CONFIG_FB_N411 is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_S1D13XXX is not set
CONFIG_FB_NVIDIA=m
CONFIG_FB_NVIDIA_I2C=y
# CONFIG_FB_NVIDIA_DEBUG is not set
CONFIG_FB_NVIDIA_BACKLIGHT=y
# CONFIG_FB_RIVA is not set
# CONFIG_FB_I810 is not set
# CONFIG_FB_LE80578 is not set
# CONFIG_FB_INTEL is not set
# CONFIG_FB_MATROX is not set
CONFIG_FB_RADEON=m
CONFIG_FB_RADEON_I2C=y
CONFIG_FB_RADEON_BACKLIGHT=y
# CONFIG_FB_RADEON_DEBUG is not set
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
CONFIG_FB_S3=m
CONFIG_FB_SAVAGE=m
CONFIG_FB_SAVAGE_I2C=y
CONFIG_FB_SAVAGE_ACCEL=y
# CONFIG_FB_SIS is not set
# CONFIG_FB_VIA is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_VT8623 is not set
CONFIG_FB_TRIDENT=m
# CONFIG_FB_ARK is not set
# CONFIG_FB_PM3 is not set
# CONFIG_FB_CARMINE is not set
# CONFIG_FB_GEODE is not set
# CONFIG_FB_VIRTUAL is not set
# CONFIG_FB_METRONOME is not set
# CONFIG_FB_MB862XX is not set
# CONFIG_FB_BROADSHEET is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
CONFIG_LCD_CLASS_DEVICE=m
# CONFIG_LCD_ILI9320 is not set
# CONFIG_LCD_PLATFORM is not set
CONFIG_BACKLIGHT_CLASS_DEVICE=y
CONFIG_BACKLIGHT_GENERIC=y
CONFIG_BACKLIGHT_PROGEAR=m
# CONFIG_BACKLIGHT_MBP_NVIDIA is not set
# CONFIG_BACKLIGHT_SAHARA is not set

#
# Display device support
#
CONFIG_DISPLAY_SUPPORT=m

#
# Display hardware drivers
#

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=64
# CONFIG_MDA_CONSOLE is not set
CONFIG_DUMMY_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY=y
CONFIG_FRAMEBUFFER_CONSOLE_ROTATION=y
# CONFIG_FONTS is not set
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y
CONFIG_LOGO=y
# CONFIG_LOGO_LINUX_MONO is not set
# CONFIG_LOGO_LINUX_VGA16 is not set
CONFIG_LOGO_LINUX_CLUT224=y
# CONFIG_SOUND is not set
# CONFIG_HID_SUPPORT is not set
CONFIG_USB_SUPPORT=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB=y
# CONFIG_USB_DEBUG is not set
# CONFIG_USB_ANNOUNCE_NEW_DEVICES is not set

#
# Miscellaneous USB options
#
CONFIG_USB_DEVICEFS=y
# CONFIG_USB_DEVICE_CLASS is not set
# CONFIG_USB_DYNAMIC_MINORS is not set
CONFIG_USB_SUSPEND=y
# CONFIG_USB_OTG is not set
# CONFIG_USB_MON is not set
# CONFIG_USB_WUSB is not set
# CONFIG_USB_WUSB_CBAF is not set

#
# USB Host Controller Drivers
#
# CONFIG_USB_C67X00_HCD is not set
CONFIG_USB_EHCI_HCD=m
CONFIG_USB_EHCI_ROOT_HUB_TT=y
CONFIG_USB_EHCI_TT_NEWSCHED=y
# CONFIG_USB_OXU210HP_HCD is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_ISP1760_HCD is not set
CONFIG_USB_OHCI_HCD=m
# CONFIG_USB_OHCI_HCD_SSB is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_DESC is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_MMIO is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_UHCI_HCD=m
# CONFIG_USB_U132_HCD is not set
# CONFIG_USB_SL811_HCD is not set
# CONFIG_USB_R8A66597_HCD is not set
# CONFIG_USB_WHCI_HCD is not set
# CONFIG_USB_HWA_HCD is not set

#
# USB Device Class drivers
#
# CONFIG_USB_ACM is not set
# CONFIG_USB_PRINTER is not set
# CONFIG_USB_WDM is not set
# CONFIG_USB_TMC is not set

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may
#

#
# also be needed; see USB_STORAGE Help for more info
#
CONFIG_USB_STORAGE=m
# CONFIG_USB_STORAGE_DEBUG is not set
CONFIG_USB_STORAGE_DATAFAB=m
CONFIG_USB_STORAGE_FREECOM=m
# CONFIG_USB_STORAGE_ISD200 is not set
CONFIG_USB_STORAGE_USBAT=m
# CONFIG_USB_STORAGE_SDDR09 is not set
# CONFIG_USB_STORAGE_SDDR55 is not set
# CONFIG_USB_STORAGE_JUMPSHOT is not set
# CONFIG_USB_STORAGE_ALAUDA is not set
# CONFIG_USB_STORAGE_ONETOUCH is not set
# CONFIG_USB_STORAGE_KARMA is not set
# CONFIG_USB_STORAGE_CYPRESS_ATACB is not set
# CONFIG_USB_LIBUSUAL is not set

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set

#
# USB port drivers
#
# CONFIG_USB_USS720 is not set
CONFIG_USB_SERIAL=m
CONFIG_USB_EZUSB=y
CONFIG_USB_SERIAL_GENERIC=y
# CONFIG_USB_SERIAL_AIRCABLE is not set
# CONFIG_USB_SERIAL_ARK3116 is not set
# CONFIG_USB_SERIAL_BELKIN is not set
# CONFIG_USB_SERIAL_CH341 is not set
# CONFIG_USB_SERIAL_WHITEHEAT is not set
# CONFIG_USB_SERIAL_DIGI_ACCELEPORT is not set
# CONFIG_USB_SERIAL_CP210X is not set
# CONFIG_USB_SERIAL_CYPRESS_M8 is not set
CONFIG_USB_SERIAL_EMPEG=m
# CONFIG_USB_SERIAL_FTDI_SIO is not set
# CONFIG_USB_SERIAL_FUNSOFT is not set
# CONFIG_USB_SERIAL_VISOR is not set
# CONFIG_USB_SERIAL_IPAQ is not set
# CONFIG_USB_SERIAL_IR is not set
# CONFIG_USB_SERIAL_EDGEPORT is not set
# CONFIG_USB_SERIAL_EDGEPORT_TI is not set
# CONFIG_USB_SERIAL_GARMIN is not set
# CONFIG_USB_SERIAL_IPW is not set
# CONFIG_USB_SERIAL_IUU is not set
# CONFIG_USB_SERIAL_KEYSPAN_PDA is not set
CONFIG_USB_SERIAL_KEYSPAN=m
# CONFIG_USB_SERIAL_KEYSPAN_MPR is not set
# CONFIG_USB_SERIAL_KEYSPAN_USA28 is not set
# CONFIG_USB_SERIAL_KEYSPAN_USA28X is not set
# CONFIG_USB_SERIAL_KEYSPAN_USA28XA is not set
# CONFIG_USB_SERIAL_KEYSPAN_USA28XB is not set
# CONFIG_USB_SERIAL_KEYSPAN_USA19 is not set
# CONFIG_USB_SERIAL_KEYSPAN_USA18X is not set
# CONFIG_USB_SERIAL_KEYSPAN_USA19W is not set
CONFIG_USB_SERIAL_KEYSPAN_USA19QW=y
CONFIG_USB_SERIAL_KEYSPAN_USA19QI=y
CONFIG_USB_SERIAL_KEYSPAN_USA49W=y
CONFIG_USB_SERIAL_KEYSPAN_USA49WLC=y
# CONFIG_USB_SERIAL_KLSI is not set
# CONFIG_USB_SERIAL_KOBIL_SCT is not set
# CONFIG_USB_SERIAL_MCT_U232 is not set
# CONFIG_USB_SERIAL_MOS7720 is not set
# CONFIG_USB_SERIAL_MOS7840 is not set
# CONFIG_USB_SERIAL_MOTOROLA is not set
# CONFIG_USB_SERIAL_NAVMAN is not set
# CONFIG_USB_SERIAL_PL2303 is not set
# CONFIG_USB_SERIAL_OTI6858 is not set
# CONFIG_USB_SERIAL_QUALCOMM is not set
# CONFIG_USB_SERIAL_SPCP8X5 is not set
# CONFIG_USB_SERIAL_HP4X is not set
# CONFIG_USB_SERIAL_SAFE is not set
# CONFIG_USB_SERIAL_SIEMENS_MPI is not set
# CONFIG_USB_SERIAL_SIERRAWIRELESS is not set
# CONFIG_USB_SERIAL_SYMBOL is not set
# CONFIG_USB_SERIAL_TI is not set
# CONFIG_USB_SERIAL_CYBERJACK is not set
# CONFIG_USB_SERIAL_XIRCOM is not set
# CONFIG_USB_SERIAL_OPTION is not set
# CONFIG_USB_SERIAL_OMNINET is not set
# CONFIG_USB_SERIAL_OPTICON is not set
# CONFIG_USB_SERIAL_DEBUG is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_SEVSEG is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_BERRY_CHARGE is not set
# CONFIG_USB_LED is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_IDMOUSE is not set
CONFIG_USB_FTDI_ELAN=m
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_IOWARRIOR is not set
# CONFIG_USB_TEST is not set
# CONFIG_USB_ISIGHTFW is not set
# CONFIG_USB_VST is not set
# CONFIG_USB_GADGET is not set

#
# OTG and related infrastructure
#
# CONFIG_NOP_USB_XCEIV is not set
# CONFIG_UWB is not set
# CONFIG_MMC is not set
# CONFIG_MEMSTICK is not set
CONFIG_NEW_LEDS=y
CONFIG_LEDS_CLASS=y

#
# LED drivers
#
# CONFIG_LEDS_ALIX2 is not set
# CONFIG_LEDS_PCA9532 is not set
# CONFIG_LEDS_LP5521 is not set
# CONFIG_LEDS_CLEVO_MAIL is not set
# CONFIG_LEDS_PCA955X is not set
# CONFIG_LEDS_BD2802 is not set

#
# LED Triggers
#
CONFIG_LEDS_TRIGGERS=y
CONFIG_LEDS_TRIGGER_TIMER=m
# CONFIG_LEDS_TRIGGER_HEARTBEAT is not set
# CONFIG_LEDS_TRIGGER_BACKLIGHT is not set
# CONFIG_LEDS_TRIGGER_DEFAULT_ON is not set

#
# iptables trigger is under Netfilter config (LED target)
#
# CONFIG_ACCESSIBILITY is not set
# CONFIG_INFINIBAND is not set
# CONFIG_EDAC is not set
# CONFIG_RTC_CLASS is not set
# CONFIG_DMADEVICES is not set
# CONFIG_AUXDISPLAY is not set
CONFIG_UIO=m
# CONFIG_UIO_CIF is not set
# CONFIG_UIO_PDRV is not set
# CONFIG_UIO_PDRV_GENIRQ is not set
# CONFIG_UIO_SMX is not set
# CONFIG_UIO_AEC is not set
# CONFIG_UIO_SERCOS3 is not set
# CONFIG_STAGING is not set
CONFIG_X86_PLATFORM_DEVICES=y
# CONFIG_ASUS_LAPTOP is not set
# CONFIG_FUJITSU_LAPTOP is not set
# CONFIG_TC1100_WMI is not set
# CONFIG_MSI_LAPTOP is not set
# CONFIG_PANASONIC_LAPTOP is not set
# CONFIG_COMPAL_LAPTOP is not set
# CONFIG_THINKPAD_ACPI is not set
# CONFIG_INTEL_MENLOW is not set
# CONFIG_EEEPC_LAPTOP is not set
# CONFIG_ACPI_WMI is not set
# CONFIG_ACPI_ASUS is not set
# CONFIG_ACPI_TOSHIBA is not set

#
# Firmware Drivers
#
CONFIG_EDD=m
# CONFIG_EDD_OFF is not set
CONFIG_FIRMWARE_MEMMAP=y
CONFIG_EFI_VARS=y
# CONFIG_DELL_RBU is not set
# CONFIG_DCDBAS is not set
CONFIG_DMIID=y
# CONFIG_ISCSI_IBFT_FIND is not set

#
# File systems
#
CONFIG_EXT2_FS=m
# CONFIG_EXT2_FS_XATTR is not set
CONFIG_EXT2_FS_XIP=y
CONFIG_EXT3_FS=m
# CONFIG_EXT3_DEFAULTS_TO_ORDERED is not set
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
CONFIG_EXT4_FS=m
CONFIG_EXT4DEV_COMPAT=y
CONFIG_EXT4_FS_XATTR=y
CONFIG_EXT4_FS_POSIX_ACL=y
CONFIG_EXT4_FS_SECURITY=y
CONFIG_FS_XIP=y
CONFIG_JBD=m
# CONFIG_JBD_DEBUG is not set
CONFIG_JBD2=m
# CONFIG_JBD2_DEBUG is not set
CONFIG_FS_MBCACHE=m
# CONFIG_REISERFS_FS is not set
# CONFIG_JFS_FS is not set
CONFIG_FS_POSIX_ACL=y
CONFIG_FILE_LOCKING=y
# CONFIG_XFS_FS is not set
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
# CONFIG_BTRFS_FS is not set
CONFIG_DNOTIFY=y
CONFIG_INOTIFY=y
CONFIG_INOTIFY_USER=y
# CONFIG_QUOTA is not set
# CONFIG_AUTOFS_FS is not set
CONFIG_AUTOFS4_FS=m
CONFIG_FUSE_FS=m
CONFIG_GENERIC_ACL=y

#
# Caches
#
# CONFIG_FSCACHE is not set

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_UDF_FS=y
CONFIG_UDF_NLS=y

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=m
CONFIG_MSDOS_FS=m
CONFIG_VFAT_FS=m
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="ascii"
# CONFIG_NTFS_FS is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_VMCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_CONFIGFS_FS=m
CONFIG_MISC_FILESYSTEMS=y
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
CONFIG_CRAMFS=m
# CONFIG_SQUASHFS is not set
# CONFIG_VXFS_FS is not set
# CONFIG_MINIX_FS is not set
# CONFIG_OMFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
CONFIG_ROMFS_FS=m
CONFIG_ROMFS_BACKED_BY_BLOCK=y
# CONFIG_ROMFS_BACKED_BY_MTD is not set
# CONFIG_ROMFS_BACKED_BY_BOTH is not set
CONFIG_ROMFS_ON_BLOCK=y
# CONFIG_SYSV_FS is not set
CONFIG_UFS_FS=m
# CONFIG_UFS_FS_WRITE is not set
# CONFIG_UFS_DEBUG is not set
# CONFIG_NILFS2_FS is not set
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=m
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
CONFIG_NFS_V4=y
# CONFIG_NFSD is not set
CONFIG_LOCKD=m
CONFIG_LOCKD_V4=y
CONFIG_NFS_ACL_SUPPORT=m
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=m
CONFIG_SUNRPC_GSS=m
CONFIG_RPCSEC_GSS_KRB5=m
# CONFIG_RPCSEC_GSS_SPKM3 is not set
# CONFIG_SMB_FS is not set
# CONFIG_CIFS is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
# CONFIG_OSF_PARTITION is not set
# CONFIG_AMIGA_PARTITION is not set
# CONFIG_ATARI_PARTITION is not set
# CONFIG_MAC_PARTITION is not set
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
# CONFIG_MINIX_SUBPARTITION is not set
# CONFIG_SOLARIS_X86_PARTITION is not set
# CONFIG_UNIXWARE_DISKLABEL is not set
# CONFIG_LDM_PARTITION is not set
# CONFIG_SGI_PARTITION is not set
# CONFIG_ULTRIX_PARTITION is not set
# CONFIG_SUN_PARTITION is not set
# CONFIG_KARMA_PARTITION is not set
CONFIG_EFI_PARTITION=y
# CONFIG_SYSV68_PARTITION is not set
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="utf8"
CONFIG_NLS_CODEPAGE_437=y
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
CONFIG_NLS_CODEPAGE_850=m
CONFIG_NLS_CODEPAGE_852=m
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
CONFIG_NLS_CODEPAGE_863=m
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
CONFIG_NLS_CODEPAGE_936=m
CONFIG_NLS_CODEPAGE_950=m
CONFIG_NLS_CODEPAGE_932=m
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
CONFIG_NLS_ISO8859_8=m
CONFIG_NLS_CODEPAGE_1250=m
CONFIG_NLS_CODEPAGE_1251=m
CONFIG_NLS_ASCII=y
# CONFIG_NLS_ISO8859_1 is not set
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
CONFIG_NLS_UTF8=m
# CONFIG_DLM is not set

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
# CONFIG_PRINTK_TIME is not set
# CONFIG_ENABLE_WARN_DEPRECATED is not set
# CONFIG_ENABLE_MUST_CHECK is not set
CONFIG_FRAME_WARN=1024
CONFIG_MAGIC_SYSRQ=y
# CONFIG_UNUSED_SYMBOLS is not set
CONFIG_DEBUG_FS=y
CONFIG_HEADERS_CHECK=y
CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_SHIRQ=y
CONFIG_DETECT_SOFTLOCKUP=y
# CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0
CONFIG_DETECT_HUNG_TASK=y
# CONFIG_BOOTPARAM_HUNG_TASK_PANIC is not set
CONFIG_BOOTPARAM_HUNG_TASK_PANIC_VALUE=0
CONFIG_SCHED_DEBUG=y
CONFIG_SCHEDSTATS=y
CONFIG_TIMER_STATS=y
# CONFIG_DEBUG_OBJECTS is not set
# CONFIG_SLUB_DEBUG_ON is not set
# CONFIG_SLUB_STATS is not set
CONFIG_DEBUG_PREEMPT=y
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_RT_MUTEX_TESTER is not set
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_PROVE_LOCKING=y
CONFIG_LOCKDEP=y
# CONFIG_LOCK_STAT is not set
CONFIG_DEBUG_LOCKDEP=y
CONFIG_TRACE_IRQFLAGS=y
CONFIG_DEBUG_SPINLOCK_SLEEP=y
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
CONFIG_STACKTRACE=y
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_HIGHMEM=y
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_INFO=y
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_VIRTUAL is not set
# CONFIG_DEBUG_WRITECOUNT is not set
CONFIG_DEBUG_MEMORY_INIT=y
CONFIG_DEBUG_LIST=y
# CONFIG_DEBUG_SG is not set
# CONFIG_DEBUG_NOTIFIERS is not set
CONFIG_ARCH_WANT_FRAME_POINTERS=y
CONFIG_FRAME_POINTER=y
# CONFIG_BOOT_PRINTK_DELAY is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_KPROBES_SANITY_TEST is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
# CONFIG_LKDTM is not set
# CONFIG_FAULT_INJECTION is not set
# CONFIG_LATENCYTOP is not set
CONFIG_SYSCTL_SYSCALL_CHECK=y
# CONFIG_DEBUG_PAGEALLOC is not set
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FTRACE_NMI_ENTER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_FTRACE_SYSCALLS=y
CONFIG_TRACER_MAX_TRACE=y
CONFIG_RING_BUFFER=y
CONFIG_FTRACE_NMI_ENTER=y
CONFIG_TRACING=y
CONFIG_TRACING_SUPPORT=y

#
# Tracers
#
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
CONFIG_IRQSOFF_TRACER=y
CONFIG_PREEMPT_TRACER=y
CONFIG_SYSPROF_TRACER=y
CONFIG_SCHED_TRACER=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_EVENT_TRACER=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_BOOT_TRACER=y
# CONFIG_TRACE_BRANCH_PROFILING is not set
CONFIG_POWER_TRACER=y
CONFIG_STACK_TRACER=y
# CONFIG_KMEMTRACE is not set
CONFIG_WORKQUEUE_TRACER=y
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_FTRACE_MCOUNT_RECORD=y
# CONFIG_FTRACE_STARTUP_TEST is not set
CONFIG_MMIOTRACE=y
CONFIG_MMIOTRACE_TEST=m
# CONFIG_PROVIDE_OHCI1394_DMA_INIT is not set
# CONFIG_FIREWIRE_OHCI_REMOTE_DMA is not set
# CONFIG_BUILD_DOCSRC is not set
# CONFIG_DYNAMIC_DEBUG is not set
# CONFIG_DMA_API_DEBUG is not set
CONFIG_SAMPLES=y
# CONFIG_SAMPLE_MARKERS is not set
# CONFIG_SAMPLE_TRACEPOINTS is not set
CONFIG_SAMPLE_KOBJECT=m
CONFIG_SAMPLE_KPROBES=m
CONFIG_SAMPLE_KRETPROBES=m
CONFIG_HAVE_ARCH_KGDB=y
# CONFIG_KGDB is not set
# CONFIG_STRICT_DEVMEM is not set
CONFIG_X86_VERBOSE_BOOTUP=y
CONFIG_EARLY_PRINTK=y
# CONFIG_EARLY_PRINTK_DBGP is not set
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_DEBUG_PER_CPU_MAPS is not set
# CONFIG_X86_PTDUMP is not set
CONFIG_DEBUG_RODATA=y
# CONFIG_DEBUG_RODATA_TEST is not set
# CONFIG_DEBUG_NX_TEST is not set
CONFIG_4KSTACKS=y
CONFIG_DOUBLEFAULT=y
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEFAULT_IO_DELAY_TYPE=0
# CONFIG_DEBUG_BOOT_PARAMS is not set
# CONFIG_CPA_DEBUG is not set
# CONFIG_OPTIMIZE_INLINING is not set

#
# Security options
#
# CONFIG_KEYS is not set
# CONFIG_SECURITY is not set
# CONFIG_SECURITYFS is not set
# CONFIG_SECURITY_FILE_CAPABILITIES is not set
# CONFIG_IMA is not set
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
# CONFIG_CRYPTO_FIPS is not set
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_BLKCIPHER=m
CONFIG_CRYPTO_BLKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_PCOMP=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
# CONFIG_CRYPTO_GF128MUL is not set
CONFIG_CRYPTO_NULL=m
CONFIG_CRYPTO_WORKQUEUE=y
# CONFIG_CRYPTO_CRYPTD is not set
# CONFIG_CRYPTO_AUTHENC is not set
# CONFIG_CRYPTO_TEST is not set

#
# Authenticated Encryption with Associated Data
#
# CONFIG_CRYPTO_CCM is not set
# CONFIG_CRYPTO_GCM is not set
# CONFIG_CRYPTO_SEQIV is not set

#
# Block modes
#
CONFIG_CRYPTO_CBC=m
# CONFIG_CRYPTO_CTR is not set
# CONFIG_CRYPTO_CTS is not set
# CONFIG_CRYPTO_ECB is not set
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_PCBC is not set
# CONFIG_CRYPTO_XTS is not set

#
# Hash modes
#
# CONFIG_CRYPTO_HMAC is not set
# CONFIG_CRYPTO_XCBC is not set

#
# Digest
#
CONFIG_CRYPTO_CRC32C=y
# CONFIG_CRYPTO_CRC32C_INTEL is not set
CONFIG_CRYPTO_MD4=m
CONFIG_CRYPTO_MD5=y
# CONFIG_CRYPTO_MICHAEL_MIC is not set
# CONFIG_CRYPTO_RMD128 is not set
# CONFIG_CRYPTO_RMD160 is not set
# CONFIG_CRYPTO_RMD256 is not set
# CONFIG_CRYPTO_RMD320 is not set
CONFIG_CRYPTO_SHA1=y
CONFIG_CRYPTO_SHA256=m
# CONFIG_CRYPTO_SHA512 is not set
# CONFIG_CRYPTO_TGR192 is not set
# CONFIG_CRYPTO_WP512 is not set

#
# Ciphers
#
CONFIG_CRYPTO_AES=m
# CONFIG_CRYPTO_AES_586 is not set
# CONFIG_CRYPTO_ANUBIS is not set
# CONFIG_CRYPTO_ARC4 is not set
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_CAMELLIA is not set
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST6 is not set
CONFIG_CRYPTO_DES=m
# CONFIG_CRYPTO_FCRYPT is not set
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_SALSA20 is not set
# CONFIG_CRYPTO_SALSA20_586 is not set
# CONFIG_CRYPTO_SEED is not set
# CONFIG_CRYPTO_SERPENT is not set
# CONFIG_CRYPTO_TEA is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_TWOFISH_586 is not set

#
# Compression
#
# CONFIG_CRYPTO_DEFLATE is not set
# CONFIG_CRYPTO_ZLIB is not set
# CONFIG_CRYPTO_LZO is not set

#
# Random Number Generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
# CONFIG_CRYPTO_HW is not set
CONFIG_HAVE_KVM=y
CONFIG_HAVE_KVM_IRQCHIP=y
# CONFIG_VIRTUALIZATION is not set
CONFIG_BINARY_PRINTF=y

#
# Library routines
#
CONFIG_BITREVERSE=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_FIND_NEXT_BIT=y
CONFIG_GENERIC_FIND_LAST_BIT=y
CONFIG_CRC_CCITT=m
CONFIG_CRC16=m
# CONFIG_CRC_T10DIF is not set
CONFIG_CRC_ITU_T=y
CONFIG_CRC32=y
# CONFIG_CRC7 is not set
# CONFIG_LIBCRC32C is not set
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=m
CONFIG_DECOMPRESS_GZIP=y
CONFIG_DECOMPRESS_BZIP2=y
CONFIG_DECOMPRESS_LZMA=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y
CONFIG_NLATTR=y

[-- Attachment #3: dmesg.txt --]
[-- Type: text/plain, Size: 90538 bytes --]

Initializing cgroup subsys cpuset
Initializing cgroup subsys cpu
Linux version 2.6.30-rc4-io (root@localhost.localdomain) (gcc version 4.1.2 20070925 (Red Hat 4.1.2-33)) #6 SMP PREEMPT Thu May 7 11:07:49 CST 2009
KERNEL supported cpus:
  Intel GenuineIntel
  AMD AuthenticAMD
  NSC Geode by NSC
  Cyrix CyrixInstead
  Centaur CentaurHauls
  Transmeta GenuineTMx86
  Transmeta TransmetaCPU
  UMC UMC UMC UMC
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009f400 (usable)
 BIOS-e820: 000000000009f400 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000003bff0000 (usable)
 BIOS-e820: 000000003bff0000 - 000000003bff3000 (ACPI NVS)
 BIOS-e820: 000000003bff3000 - 000000003c000000 (ACPI data)
 BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
DMI 2.3 present.
Phoenix BIOS detected: BIOS may corrupt low RAM, working around it.
e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)
last_pfn = 0x3bff0 max_arch_pfn = 0x100000
MTRR default type: uncachable
MTRR fixed ranges enabled:
  00000-9FFFF write-back
  A0000-BFFFF uncachable
  C0000-C7FFF write-protect
  C8000-FFFFF uncachable
MTRR variable ranges enabled:
  0 base 000000000 mask FC0000000 write-back
  1 base 03C000000 mask FFC000000 uncachable
  2 base 0D0000000 mask FF8000000 write-combining
  3 disabled
  4 disabled
  5 disabled
  6 disabled
  7 disabled
init_memory_mapping: 0000000000000000-00000000377fe000
 0000000000 - 0000400000 page 4k
 0000400000 - 0037400000 page 2M
 0037400000 - 00377fe000 page 4k
kernel direct mapping tables up to 377fe000 @ 10000-15000
RAMDISK: 37d0d000 - 37fefd69
Allocated new RAMDISK: 00100000 - 003e2d69
Move RAMDISK from 0000000037d0d000 - 0000000037fefd68 to 00100000 - 003e2d68
ACPI: RSDP 000f7560 00014 (v00 AWARD )
ACPI: RSDT 3bff3040 0002C (v01 AWARD  AWRDACPI 42302E31 AWRD 00000000)
ACPI: FACP 3bff30c0 00074 (v01 AWARD  AWRDACPI 42302E31 AWRD 00000000)
ACPI: DSDT 3bff3180 03ABC (v01 AWARD  AWRDACPI 00001000 MSFT 0100000E)
ACPI: FACS 3bff0000 00040
ACPI: APIC 3bff6c80 00084 (v01 AWARD  AWRDACPI 42302E31 AWRD 00000000)
ACPI: Local APIC address 0xfee00000
71MB HIGHMEM available.
887MB LOWMEM available.
  mapped low ram: 0 - 377fe000
  low ram: 0 - 377fe000
  node 0 low ram: 00000000 - 377fe000
  node 0 bootmap 00011000 - 00017f00
(9 early reservations) ==> bootmem [0000000000 - 00377fe000]
  #0 [0000000000 - 0000001000]   BIOS data page ==> [0000000000 - 0000001000]
  #1 [0000001000 - 0000002000]    EX TRAMPOLINE ==> [0000001000 - 0000002000]
  #2 [0000006000 - 0000007000]       TRAMPOLINE ==> [0000006000 - 0000007000]
  #3 [0000400000 - 0000c6bd1c]    TEXT DATA BSS ==> [0000400000 - 0000c6bd1c]
  #4 [000009f400 - 0000100000]    BIOS reserved ==> [000009f400 - 0000100000]
  #5 [0000c6c000 - 0000c700ed]              BRK ==> [0000c6c000 - 0000c700ed]
  #6 [0000010000 - 0000011000]          PGTABLE ==> [0000010000 - 0000011000]
  #7 [0000100000 - 00003e2d69]      NEW RAMDISK ==> [0000100000 - 00003e2d69]
  #8 [0000011000 - 0000018000]          BOOTMAP ==> [0000011000 - 0000018000]
found SMP MP-table at [c00f5ad0] f5ad0
Zone PFN ranges:
  DMA      0x00000010 -> 0x00001000
  Normal   0x00001000 -> 0x000377fe
  HighMem  0x000377fe -> 0x0003bff0
Movable zone start PFN for each node
early_node_map[2] active PFN ranges
    0: 0x00000010 -> 0x0000009f
    0: 0x00000100 -> 0x0003bff0
On node 0 totalpages: 245631
free_area_init_node: node 0, pgdat c0778f80, node_mem_map c1000340
  DMA zone: 52 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 3931 pages, LIFO batch:0
  Normal zone: 2834 pages used for memmap
  Normal zone: 220396 pages, LIFO batch:31
  HighMem zone: 234 pages used for memmap
  HighMem zone: 18184 pages, LIFO batch:3
Using APIC driver default
ACPI: PM-Timer IO Port: 0x1008
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] disabled)
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] disabled)
ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x03] high edge lint[0x1])
ACPI: IOAPIC (id[0x04] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 4, version 17, address 0xfec00000, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 dfl dfl)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ9 used by override.
Enabling APIC mode:  Flat.  Using 1 I/O APICs
Using ACPI (MADT) for SMP configuration information
SMP: Allowing 4 CPUs, 2 hotplug CPUs
nr_irqs_gsi: 24
Allocating PCI resources starting at 40000000 (gap: 3c000000:c2c00000)
NR_CPUS:8 nr_cpumask_bits:8 nr_cpu_ids:4 nr_node_ids:1
PERCPU: Embedded 13 pages at c1c3b000, static data 32756 bytes
Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 242511
Kernel command line: ro root=LABEL=/ rhgb quiet
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Initializing CPU#0
Preemptible RCU implementation.
NR_IRQS:512
CPU 0 irqstacks, hard=c1c3b000 soft=c1c3c000
PID hash table entries: 4096 (order: 12, 16384 bytes)
Fast TSC calibration using PIT
Detected 2800.222 MHz processor.
Console: colour VGA+ 80x25
console [tty0] enabled
Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
... MAX_LOCKDEP_SUBCLASSES:  8
... MAX_LOCK_DEPTH:          48
... MAX_LOCKDEP_KEYS:        8191
... CLASSHASH_SIZE:          4096
... MAX_LOCKDEP_ENTRIES:     8192
... MAX_LOCKDEP_CHAINS:      16384
... CHAINHASH_SIZE:          8192
 memory used by lock dependency info: 2847 kB
 per task-struct memory footprint: 1152 bytes
Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
allocated 4914560 bytes of page_cgroup
please try cgroup_disable=memory,blkio option if you don't want
Initializing HighMem for node 0 (000377fe:0003bff0)
Memory: 952284k/982976k available (2258k kernel code, 30016k reserved, 1424k data, 320k init, 73672k highmem)
virtual kernel memory layout:
    fixmap  : 0xffedf000 - 0xfffff000   (1152 kB)
    pkmap   : 0xff800000 - 0xffc00000   (4096 kB)
    vmalloc : 0xf7ffe000 - 0xff7fe000   ( 120 MB)
    lowmem  : 0xc0000000 - 0xf77fe000   ( 887 MB)
      .init : 0xc079d000 - 0xc07ed000   ( 320 kB)
      .data : 0xc06349ab - 0xc0798cb8   (1424 kB)
      .text : 0xc0400000 - 0xc06349ab   (2258 kB)
Checking if this processor honours the WP bit even in supervisor mode...Ok.
SLUB: Genslabs=13, HWalign=128, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
Calibrating delay loop (skipped), value calculated using timer frequency.. 5600.44 BogoMIPS (lpj=2800222)
Mount-cache hash table entries: 512
Initializing cgroup subsys debug
Initializing cgroup subsys ns
Initializing cgroup subsys cpuacct
Initializing cgroup subsys memory
Initializing cgroup subsys blkio
Initializing cgroup subsys devices
Initializing cgroup subsys freezer
Initializing cgroup subsys net_cls
Initializing cgroup subsys io
CPU: Trace cache: 12K uops, L1 D cache: 16K
CPU: L2 cache: 1024K
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 0
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU0: Intel P4/Xeon Extended MCE MSRs (24) available
using mwait in idle threads.
Checking 'hlt' instruction... OK.
ACPI: Core revision 20090320
ftrace: converting mcount calls to 0f 1f 44 00 00
ftrace: allocating 12136 entries in 24 pages
..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
CPU0: Intel(R) Pentium(R) D CPU 2.80GHz stepping 04
lockdep: fixing up alternatives.
CPU 1 irqstacks, hard=c1c4b000 soft=c1c4c000
Booting processor 1 APIC 0x1 ip 0x6000
Initializing CPU#1
Calibrating delay using timer specific routine.. 5599.23 BogoMIPS (lpj=2799617)
CPU: Trace cache: 12K uops, L1 D cache: 16K
CPU: L2 cache: 1024K
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 1
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#1.
CPU1: Intel P4/Xeon Extended MCE MSRs (24) available
CPU1: Intel(R) Pentium(R) D CPU 2.80GHz stepping 04
checking TSC synchronization [CPU#0 -> CPU#1]: passed.
Brought up 2 CPUs
Total of 2 processors activated (11199.67 BogoMIPS).
CPU0 attaching sched-domain:
 domain 0: span 0-1 level CPU
  groups: 0 1
CPU1 attaching sched-domain:
 domain 0: span 0-1 level CPU
  groups: 1 0
net_namespace: 436 bytes
NET: Registered protocol family 16
ACPI: bus type pci registered
PCI: PCI BIOS revision 2.10 entry at 0xfbda0, last bus=1
PCI: Using configuration type 1 for base access
mtrr: your CPUs had inconsistent fixed MTRR settings
mtrr: probably your BIOS does not setup all CPUs.
mtrr: corrected configuration.
bio: create slab <bio-0> at 0
ACPI: EC: Look up EC in DSDT
ACPI: Interpreter enabled
ACPI: (supports S0 S3 S5)
ACPI: Using IOAPIC for interrupt routing
ACPI: No dock devices found.
ACPI: PCI Root Bridge [PCI0] (0000:00)
pci 0000:00:00.0: reg 10 32bit mmio: [0xd0000000-0xd7ffffff]
pci 0000:00:02.5: reg 10 io port: [0x1f0-0x1f7]
pci 0000:00:02.5: reg 14 io port: [0x3f4-0x3f7]
pci 0000:00:02.5: reg 18 io port: [0x170-0x177]
pci 0000:00:02.5: reg 1c io port: [0x374-0x377]
pci 0000:00:02.5: reg 20 io port: [0x4000-0x400f]
pci 0000:00:02.5: PME# supported from D3cold
pci 0000:00:02.5: PME# disabled
pci 0000:00:02.7: reg 10 io port: [0xd000-0xd0ff]
pci 0000:00:02.7: reg 14 io port: [0xd400-0xd47f]
pci 0000:00:02.7: supports D1 D2
pci 0000:00:02.7: PME# supported from D3hot D3cold
pci 0000:00:02.7: PME# disabled
pci 0000:00:03.0: reg 10 32bit mmio: [0xe1104000-0xe1104fff]
pci 0000:00:03.1: reg 10 32bit mmio: [0xe1100000-0xe1100fff]
pci 0000:00:03.2: reg 10 32bit mmio: [0xe1101000-0xe1101fff]
pci 0000:00:03.3: reg 10 32bit mmio: [0xe1102000-0xe1102fff]
pci 0000:00:03.3: PME# supported from D0 D3hot D3cold
pci 0000:00:03.3: PME# disabled
pci 0000:00:05.0: reg 10 io port: [0xd800-0xd807]
pci 0000:00:05.0: reg 14 io port: [0xdc00-0xdc03]
pci 0000:00:05.0: reg 18 io port: [0xe000-0xe007]
pci 0000:00:05.0: reg 1c io port: [0xe400-0xe403]
pci 0000:00:05.0: reg 20 io port: [0xe800-0xe80f]
pci 0000:00:05.0: PME# supported from D3cold
pci 0000:00:05.0: PME# disabled
pci 0000:00:0e.0: reg 10 io port: [0xec00-0xecff]
pci 0000:00:0e.0: reg 14 32bit mmio: [0xe1103000-0xe11030ff]
pci 0000:00:0e.0: reg 30 32bit mmio: [0x000000-0x01ffff]
pci 0000:00:0e.0: supports D1 D2
pci 0000:00:0e.0: PME# supported from D1 D2 D3hot D3cold
pci 0000:00:0e.0: PME# disabled
pci 0000:01:00.0: reg 10 32bit mmio: [0xd8000000-0xdfffffff]
pci 0000:01:00.0: reg 14 32bit mmio: [0xe1000000-0xe101ffff]
pci 0000:01:00.0: reg 18 io port: [0xc000-0xc07f]
pci 0000:01:00.0: supports D1 D2
pci 0000:00:01.0: bridge io port: [0xc000-0xcfff]
pci 0000:00:01.0: bridge 32bit mmio: [0xe1000000-0xe10fffff]
pci 0000:00:01.0: bridge 32bit mmio pref: [0xd8000000-0xdfffffff]
pci_bus 0000:00: on NUMA node 0
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 9 10 11 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 9 10 *11 14 15)
ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 5 6 7 9 *10 11 14 15)
ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 7 9 10 11 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNKE] (IRQs 3 4 5 6 7 9 10 *11 14 15)
ACPI: PCI Interrupt Link [LNKF] (IRQs 3 4 5 *6 7 9 10 11 14 15)
ACPI: PCI Interrupt Link [LNKG] (IRQs 3 4 5 6 7 *9 10 11 14 15)
ACPI: PCI Interrupt Link [LNKH] (IRQs 3 4 *5 6 7 9 10 11 14 15)
usbcore: registered new interface driver usbfs
usbcore: registered new interface driver hub
usbcore: registered new device driver usb
PCI: Using ACPI for IRQ routing
pnp: PnP ACPI init
ACPI: bus type pnp registered
pnp: PnP ACPI: found 12 devices
ACPI: ACPI bus type pnp unregistered
system 00:00: iomem range 0xc8000-0xcbfff has been reserved
system 00:00: iomem range 0xf0000-0xf7fff could not be reserved
system 00:00: iomem range 0xf8000-0xfbfff could not be reserved
system 00:00: iomem range 0xfc000-0xfffff could not be reserved
system 00:00: iomem range 0x3bff0000-0x3bffffff could not be reserved
system 00:00: iomem range 0xffff0000-0xffffffff has been reserved
system 00:00: iomem range 0x0-0x9ffff could not be reserved
system 00:00: iomem range 0x100000-0x3bfeffff could not be reserved
system 00:00: iomem range 0xffee0000-0xffefffff has been reserved
system 00:00: iomem range 0xfffe0000-0xfffeffff has been reserved
system 00:00: iomem range 0xfec00000-0xfecfffff has been reserved
system 00:00: iomem range 0xfee00000-0xfeefffff has been reserved
system 00:02: ioport range 0x4d0-0x4d1 has been reserved
system 00:02: ioport range 0x800-0x805 has been reserved
system 00:02: ioport range 0x290-0x297 has been reserved
system 00:02: ioport range 0x880-0x88f has been reserved
pci 0000:00:01.0: PCI bridge, secondary bus 0000:01
pci 0000:00:01.0:   IO window: 0xc000-0xcfff
pci 0000:00:01.0:   MEM window: 0xe1000000-0xe10fffff
pci 0000:00:01.0:   PREFETCH window: 0x000000d8000000-0x000000dfffffff
pci_bus 0000:00: resource 0 io:  [0x00-0xffff]
pci_bus 0000:00: resource 1 mem: [0x000000-0xffffffff]
pci_bus 0000:01: resource 0 io:  [0xc000-0xcfff]
pci_bus 0000:01: resource 1 mem: [0xe1000000-0xe10fffff]
pci_bus 0000:01: resource 2 pref mem [0xd8000000-0xdfffffff]
NET: Registered protocol family 2
IP route cache hash table entries: 32768 (order: 5, 131072 bytes)
TCP established hash table entries: 131072 (order: 8, 1048576 bytes)
TCP bind hash table entries: 65536 (order: 9, 2097152 bytes)
TCP: Hash tables configured (established 131072 bind 65536)
TCP reno registered
NET: Registered protocol family 1
checking if image is initramfs...
rootfs image is initramfs; unpacking...
Freeing initrd memory: 2955k freed
apm: BIOS version 1.2 Flags 0x07 (Driver version 1.16ac)
apm: disabled - APM is not SMP safe.
highmem bounce pool size: 64 pages
HugeTLB registered 4 MB page size, pre-allocated 0 pages
msgmni has been set to 1722
alg: No test for stdrng (krng)
Block layer SCSI generic (bsg) driver version 0.4 loaded (major 254)
io scheduler noop registered
io scheduler cfq registered (default)
pci 0000:01:00.0: Boot video device
pci_hotplug: PCI Hot Plug PCI Core version: 0.5
fan PNP0C0B:00: registered as cooling_device0
ACPI: Fan [FAN] (on)
processor ACPI_CPU:00: registered as cooling_device1
processor ACPI_CPU:01: registered as cooling_device2
thermal LNXTHERM:01: registered as thermal_zone0
ACPI: Thermal Zone [THRM] (62 C)
isapnp: Scanning for PnP cards...
Switched to high resolution mode on CPU 1
Switched to high resolution mode on CPU 0
isapnp: No Plug & Play device found
Real Time Clock Driver v1.12b
Non-volatile memory driver v1.3
Linux agpgart interface v0.103
agpgart-sis 0000:00:00.0: SiS chipset [1039/0661]
agpgart-sis 0000:00:00.0: AGP aperture is 128M @ 0xd0000000
Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
serial8250: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
00:07: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
00:08: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
brd: module loaded
PNP: PS/2 Controller [PNP0303:PS2K,PNP0f13:PS2M] at 0x60,0x64 irq 1,12
serio: i8042 KBD port at 0x60,0x64 irq 1
serio: i8042 AUX port at 0x60,0x64 irq 12
mice: PS/2 mouse device common for all mice
cpuidle: using governor ladder
cpuidle: using governor menu
TCP cubic registered
NET: Registered protocol family 17
Using IPI No-Shortcut mode
registered taskstats version 1
Freeing unused kernel memory: 320k freed
Write protecting the kernel text: 2260k
Write protecting the kernel read-only data: 1120k
ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
ehci_hcd 0000:00:03.3: PCI INT D -> GSI 23 (level, low) -> IRQ 23
ehci_hcd 0000:00:03.3: EHCI Host Controller
ehci_hcd 0000:00:03.3: new USB bus registered, assigned bus number 1
ehci_hcd 0000:00:03.3: cache line size of 128 is not supported
ehci_hcd 0000:00:03.3: irq 23, io mem 0xe1102000
ehci_hcd 0000:00:03.3: USB 2.0 started, EHCI 1.00
usb usb1: configuration #1 chosen from 1 choice
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 8 ports detected
ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
ohci_hcd 0000:00:03.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20
ohci_hcd 0000:00:03.0: OHCI Host Controller
ohci_hcd 0000:00:03.0: new USB bus registered, assigned bus number 2
ohci_hcd 0000:00:03.0: irq 20, io mem 0xe1104000
usb usb2: configuration #1 chosen from 1 choice
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 3 ports detected
ohci_hcd 0000:00:03.1: PCI INT B -> GSI 21 (level, low) -> IRQ 21
ohci_hcd 0000:00:03.1: OHCI Host Controller
ohci_hcd 0000:00:03.1: new USB bus registered, assigned bus number 3
ohci_hcd 0000:00:03.1: irq 21, io mem 0xe1100000
usb usb3: configuration #1 chosen from 1 choice
hub 3-0:1.0: USB hub found
hub 3-0:1.0: 3 ports detected
ohci_hcd 0000:00:03.2: PCI INT C -> GSI 22 (level, low) -> IRQ 22
ohci_hcd 0000:00:03.2: OHCI Host Controller
ohci_hcd 0000:00:03.2: new USB bus registered, assigned bus number 4
ohci_hcd 0000:00:03.2: irq 22, io mem 0xe1101000
usb usb4: configuration #1 chosen from 1 choice
hub 4-0:1.0: USB hub found
hub 4-0:1.0: 2 ports detected
uhci_hcd: USB Universal Host Controller Interface driver
SCSI subsystem initialized
Driver 'sd' needs updating - please use bus_type methods
libata version 3.00 loaded.
pata_sis 0000:00:02.5: version 0.5.2
pata_sis 0000:00:02.5: PCI INT A -> GSI 16 (level, low) -> IRQ 16
scsi0 : pata_sis
scsi1 : pata_sis
ata1: PATA max UDMA/133 cmd 0x1f0 ctl 0x3f6 bmdma 0x4000 irq 14
ata2: PATA max UDMA/133 cmd 0x170 ctl 0x376 bmdma 0x4008 irq 15
input: ImPS/2 Logitech Wheel Mouse as /class/input/input0
input: AT Translated Set 2 keyboard as /class/input/input1
sata_sis 0000:00:05.0: version 1.0
sata_sis 0000:00:05.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
sata_sis 0000:00:05.0: Detected SiS 180/181/964 chipset in SATA mode
scsi2 : sata_sis
scsi3 : sata_sis
ata3: SATA max UDMA/133 cmd 0xd800 ctl 0xdc00 bmdma 0xe800 irq 17
ata4: SATA max UDMA/133 cmd 0xe000 ctl 0xe400 bmdma 0xe808 irq 17
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata3.00: ATA-7: ST3808110AS, 3.AAE, max UDMA/133
ata3.00: 156301488 sectors, multi 16: LBA48 NCQ (depth 0/32)
ata3.00: configured for UDMA/133
scsi 2:0:0:0: Direct-Access     ATA      ST3808110AS      3.AA PQ: 0 ANSI: 5
sd 2:0:0:0: [sda] 156301488 512-byte hardware sectors: (80.0 GB/74.5 GiB)
sd 2:0:0:0: [sda] Write Protect is off
sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sda: sda1 sda2 < sda5 sda6 sda7 sda8 sda9 >
sd 2:0:0:0: [sda] Attached SCSI disk
ata4: SATA link down (SStatus 0 SControl 300)
EXT3-fs: INFO: recovery required on readonly filesystem.
EXT3-fs: write access will be enabled during recovery.
kjournald starting.  Commit interval 5 seconds
EXT3-fs: sda8: orphan cleanup on readonly fs
ext3_orphan_cleanup: deleting unreferenced inode 3725366
ext3_orphan_cleanup: deleting unreferenced inode 3725365
ext3_orphan_cleanup: deleting unreferenced inode 3725364
EXT3-fs: sda8: 3 orphan inodes deleted
EXT3-fs: recovery complete.
EXT3-fs: mounted filesystem with writeback data mode.
r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
r8169 0000:00:0e.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
r8169 0000:00:0e.0: no PCI Express capability
eth0: RTL8110s at 0xf8236000, 00:16:ec:2e:b7:e0, XID 04000000 IRQ 18
sd 2:0:0:0: Attached scsi generic sg0 type 0
parport_pc 00:09: reported by Plug and Play ACPI
parport0: PC-style at 0x378 (0x778), irq 7 [PCSPP,TRISTATE]
input: Power Button as /class/input/input2
ACPI: Power Button [PWRF]
input: Power Button as /class/input/input3
ACPI: Power Button [PWRB]
input: Sleep Button as /class/input/input4
ACPI: Sleep Button [FUTS]
ramfs: bad mount option: maxsize=512
EXT3 FS on sda8, internal journal
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sda7, internal journal
EXT3-fs: mounted filesystem with writeback data mode.
Adding 1052216k swap on /dev/sda6.  Priority:-1 extents:1 across:1052216k 
warning: process `kudzu' used the deprecated sysctl system call with 1.23.
kudzu[1133] general protection ip:8056968 sp:bffe9e90 error:0
r8169: eth0: link up
r8169: eth0: link up
warning: `dbus-daemon' uses 32-bit capabilities (legacy support in use)
CPU0 attaching NULL sched-domain.
CPU1 attaching NULL sched-domain.
CPU0 attaching sched-domain:
 domain 0: span 0-1 level CPU
  groups: 0 1
CPU1 attaching sched-domain:
 domain 0: span 0-1 level CPU
  groups: 1 0

=========================================================
[ INFO: possible irq lock inversion dependency detected ]
2.6.30-rc4-io #6
---------------------------------------------------------
rmdir/2186 just changed the state of lock:
 (&iocg->lock){+.+...}, at: [<c0513b18>] iocg_destroy+0x2a/0x118
but this lock was taken by another, SOFTIRQ-safe lock in the past:
 (&q->__queue_lock){..-...}

and interrupts could create inverse lock ordering between them.


other info that might help us debug this:
3 locks held by rmdir/2186:
 #0:  (&sb->s_type->i_mutex_key#10/1){+.+.+.}, at: [<c04ae1e8>] do_rmdir+0x5c/0xc8
 #1:  (cgroup_mutex){+.+.+.}, at: [<c045a15b>] cgroup_diput+0x3c/0xa7
 #2:  (&iocg->lock){+.+...}, at: [<c0513b18>] iocg_destroy+0x2a/0x118

the first lock's dependencies:
-> (&iocg->lock){+.+...} ops: 3 {
   HARDIRQ-ON-W at:
                        [<c044b840>] mark_held_locks+0x3d/0x58
                        [<c044b963>] trace_hardirqs_on_caller+0x108/0x14c
                        [<c044b9b2>] trace_hardirqs_on+0xb/0xd
                        [<c0630883>] _spin_unlock_irq+0x27/0x47
                        [<c0513baa>] iocg_destroy+0xbc/0x118
                        [<c045a16a>] cgroup_diput+0x4b/0xa7
                        [<c04b1dbb>] dentry_iput+0x78/0x9c
                        [<c04b1e82>] d_kill+0x21/0x3b
                        [<c04b2f2a>] dput+0xf3/0xfc
                        [<c04ae226>] do_rmdir+0x9a/0xc8
                        [<c04ae29d>] sys_rmdir+0x15/0x17
                        [<c0402a68>] sysenter_do_call+0x12/0x36
                        [<ffffffff>] 0xffffffff
   SOFTIRQ-ON-W at:
                        [<c044b840>] mark_held_locks+0x3d/0x58
                        [<c044b97c>] trace_hardirqs_on_caller+0x121/0x14c
                        [<c044b9b2>] trace_hardirqs_on+0xb/0xd
                        [<c0630883>] _spin_unlock_irq+0x27/0x47
                        [<c0513baa>] iocg_destroy+0xbc/0x118
                        [<c045a16a>] cgroup_diput+0x4b/0xa7
                        [<c04b1dbb>] dentry_iput+0x78/0x9c
                        [<c04b1e82>] d_kill+0x21/0x3b
                        [<c04b2f2a>] dput+0xf3/0xfc
                        [<c04ae226>] do_rmdir+0x9a/0xc8
                        [<c04ae29d>] sys_rmdir+0x15/0x17
                        [<c0402a68>] sysenter_do_call+0x12/0x36
                        [<ffffffff>] 0xffffffff
   INITIAL USE at:
                       [<c044dad5>] __lock_acquire+0x58c/0x73e
                       [<c044dd36>] lock_acquire+0xaf/0xcc
                       [<c06304ea>] _spin_lock_irq+0x30/0x3f
                       [<c05119bd>] io_alloc_root_group+0x104/0x155
                       [<c05133cb>] elv_init_fq_data+0x32/0xe0
                       [<c0504317>] elevator_alloc+0x150/0x170
                       [<c0505393>] elevator_init+0x9d/0x100
                       [<c0507088>] blk_init_queue_node+0xc4/0xf7
                       [<c05070cb>] blk_init_queue+0x10/0x12
                       [<f81060fd>] __scsi_alloc_queue+0x1c/0xba [scsi_mod]
                       [<f81061b0>] scsi_alloc_queue+0x15/0x4e [scsi_mod]
                       [<f810803d>] scsi_alloc_sdev+0x154/0x1f5 [scsi_mod]
                       [<f8108387>] scsi_probe_and_add_lun+0x123/0xb5b [scsi_mod]
                       [<f8109847>] __scsi_add_device+0x8a/0xb0 [scsi_mod]
                       [<f816ad14>] ata_scsi_scan_host+0x77/0x141 [libata]
                       [<f816903f>] async_port_probe+0xa0/0xa9 [libata]
                       [<c044341f>] async_thread+0xe9/0x1c9
                       [<c043e204>] kthread+0x4a/0x72
                       [<c04034e7>] kernel_thread_helper+0x7/0x10
                       [<ffffffff>] 0xffffffff
 }
 ... key      at: [<c0c5ebd8>] __key.29462+0x0/0x8

the second lock's dependencies:
-> (&q->__queue_lock){..-...} ops: 162810 {
   IN-SOFTIRQ-W at:
                        [<c044da08>] __lock_acquire+0x4bf/0x73e
                        [<c044dd36>] lock_acquire+0xaf/0xcc
                        [<c0630340>] _spin_lock+0x2a/0x39
                        [<f810672c>] scsi_device_unbusy+0x78/0x92 [scsi_mod]
                        [<f8101483>] scsi_finish_command+0x22/0xd4 [scsi_mod]
                        [<f8106fdb>] scsi_softirq_done+0xf9/0x101 [scsi_mod]
                        [<c050a936>] blk_done_softirq+0x5e/0x70
                        [<c0431379>] __do_softirq+0xb8/0x180
                        [<ffffffff>] 0xffffffff
   INITIAL USE at:
                       [<c044dad5>] __lock_acquire+0x58c/0x73e
                       [<c044dd36>] lock_acquire+0xaf/0xcc
                       [<c063056b>] _spin_lock_irqsave+0x33/0x43
                       [<f8101337>] scsi_adjust_queue_depth+0x2a/0xc9 [scsi_mod]
                       [<f8108079>] scsi_alloc_sdev+0x190/0x1f5 [scsi_mod]
                       [<f8108387>] scsi_probe_and_add_lun+0x123/0xb5b [scsi_mod]
                       [<f8109847>] __scsi_add_device+0x8a/0xb0 [scsi_mod]
                       [<f816ad14>] ata_scsi_scan_host+0x77/0x141 [libata]
                       [<f816903f>] async_port_probe+0xa0/0xa9 [libata]
                       [<c044341f>] async_thread+0xe9/0x1c9
                       [<c043e204>] kthread+0x4a/0x72
                       [<c04034e7>] kernel_thread_helper+0x7/0x10
                       [<ffffffff>] 0xffffffff
 }
 ... key      at: [<c0c5e698>] __key.29749+0x0/0x8
 -> (&ioc->lock){..-...} ops: 1032 {
    IN-SOFTIRQ-W at:
                          [<c044da08>] __lock_acquire+0x4bf/0x73e
                          [<c044dd36>] lock_acquire+0xaf/0xcc
                          [<c063056b>] _spin_lock_irqsave+0x33/0x43
                          [<c050f0f0>] cic_free_func+0x26/0x64
                          [<c050ea90>] __call_for_each_cic+0x23/0x2e
                          [<c050eaad>] cfq_free_io_context+0x12/0x14
                          [<c050978c>] put_io_context+0x4b/0x66
                          [<c050f2a2>] cfq_put_request+0x42/0x5b
                          [<c0504629>] elv_put_request+0x30/0x33
                          [<c050678d>] __blk_put_request+0x8b/0xb8
                          [<c0506953>] end_that_request_last+0x199/0x1a1
                          [<c0506a0d>] blk_end_io+0x51/0x6f
                          [<c0506a64>] blk_end_request+0x11/0x13
                          [<f8106c9c>] scsi_io_completion+0x1d9/0x41f [scsi_mod]
                          [<f810152d>] scsi_finish_command+0xcc/0xd4 [scsi_mod]
                          [<f8106fdb>] scsi_softirq_done+0xf9/0x101 [scsi_mod]
                          [<c050a936>] blk_done_softirq+0x5e/0x70
                          [<c0431379>] __do_softirq+0xb8/0x180
                          [<ffffffff>] 0xffffffff
    INITIAL USE at:
                         [<c044dad5>] __lock_acquire+0x58c/0x73e
                         [<c044dd36>] lock_acquire+0xaf/0xcc
                         [<c063056b>] _spin_lock_irqsave+0x33/0x43
                         [<c050f9bf>] cfq_set_request+0x123/0x33d
                         [<c05052e6>] elv_set_request+0x43/0x53
                         [<c0506d44>] get_request+0x22e/0x33f
                         [<c0507498>] get_request_wait+0x137/0x15d
                         [<c0507501>] blk_get_request+0x43/0x73
                         [<f8106854>] scsi_execute+0x24/0x11c [scsi_mod]
                         [<f81069ff>] scsi_execute_req+0xb3/0x104 [scsi_mod]
                         [<f81084f8>] scsi_probe_and_add_lun+0x294/0xb5b [scsi_mod]
                         [<f8109847>] __scsi_add_device+0x8a/0xb0 [scsi_mod]
                         [<f816ad14>] ata_scsi_scan_host+0x77/0x141 [libata]
                         [<f816903f>] async_port_probe+0xa0/0xa9 [libata]
                         [<c044341f>] async_thread+0xe9/0x1c9
                         [<c043e204>] kthread+0x4a/0x72
                         [<c04034e7>] kernel_thread_helper+0x7/0x10
                         [<ffffffff>] 0xffffffff
  }
  ... key      at: [<c0c5e6ec>] __key.27747+0x0/0x8
  -> (&rdp->lock){-.-...} ops: 168014 {
     IN-HARDIRQ-W at:
                            [<c044d9e4>] __lock_acquire+0x49b/0x73e
                            [<c044dd36>] lock_acquire+0xaf/0xcc
                            [<c063056b>] _spin_lock_irqsave+0x33/0x43
                            [<c0461b2a>] rcu_check_callbacks+0x6a/0xa3
                            [<c043549a>] update_process_times+0x3d/0x53
                            [<c0447fe0>] tick_periodic+0x6b/0x77
                            [<c0448009>] tick_handle_periodic+0x1d/0x60
                            [<c063406e>] smp_apic_timer_interrupt+0x6e/0x81
                            [<c04033c7>] apic_timer_interrupt+0x2f/0x34
                            [<c042fbd7>] do_exit+0x53e/0x5b3
                            [<c043a9d8>] __request_module+0x0/0x100
                            [<c04034e7>] kernel_thread_helper+0x7/0x10
                            [<ffffffff>] 0xffffffff
     IN-SOFTIRQ-W at:
                            [<c044da08>] __lock_acquire+0x4bf/0x73e
                            [<c044dd36>] lock_acquire+0xaf/0xcc
                            [<c0630340>] _spin_lock+0x2a/0x39
                            [<c04619db>] rcu_process_callbacks+0x2b/0x86
                            [<c0431379>] __do_softirq+0xb8/0x180
                            [<ffffffff>] 0xffffffff
     INITIAL USE at:
                           [<c044dad5>] __lock_acquire+0x58c/0x73e
                           [<c044dd36>] lock_acquire+0xaf/0xcc
                           [<c063056b>] _spin_lock_irqsave+0x33/0x43
                           [<c062c8ca>] rcu_online_cpu+0x3d/0x51
                           [<c062c910>] rcu_cpu_notify+0x32/0x43
                           [<c07b097f>] __rcu_init+0xf0/0x120
                           [<c07af027>] rcu_init+0x8/0x14
                           [<c079d6e1>] start_kernel+0x187/0x2fc
                           [<c079d06a>] __init_begin+0x6a/0x6f
                           [<ffffffff>] 0xffffffff
   }
   ... key      at: [<c0c2e52c>] __key.17543+0x0/0x8
  ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c0630340>] _spin_lock+0x2a/0x39
   [<c046143d>] call_rcu+0x36/0x5b
   [<c0517b45>] radix_tree_delete+0xe7/0x176
   [<c050f0fe>] cic_free_func+0x34/0x64
   [<c050ea90>] __call_for_each_cic+0x23/0x2e
   [<c050eaad>] cfq_free_io_context+0x12/0x14
   [<c050978c>] put_io_context+0x4b/0x66
   [<c050984c>] exit_io_context+0x77/0x7b
   [<c042fc24>] do_exit+0x58b/0x5b3
   [<c04034ed>] kernel_thread_helper+0xd/0x10
   [<ffffffff>] 0xffffffff

 ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c063056b>] _spin_lock_irqsave+0x33/0x43
   [<c050f4a3>] cfq_cic_lookup+0xd9/0xef
   [<c050f674>] cfq_get_queue+0x92/0x2ba
   [<c050fb01>] cfq_set_request+0x265/0x33d
   [<c05052e6>] elv_set_request+0x43/0x53
   [<c0506d44>] get_request+0x22e/0x33f
   [<c0507498>] get_request_wait+0x137/0x15d
   [<c0507501>] blk_get_request+0x43/0x73
   [<f8106854>] scsi_execute+0x24/0x11c [scsi_mod]
   [<f81069ff>] scsi_execute_req+0xb3/0x104 [scsi_mod]
   [<f81084f8>] scsi_probe_and_add_lun+0x294/0xb5b [scsi_mod]
   [<f8109847>] __scsi_add_device+0x8a/0xb0 [scsi_mod]
   [<f816ad14>] ata_scsi_scan_host+0x77/0x141 [libata]
   [<f816903f>] async_port_probe+0xa0/0xa9 [libata]
   [<c044341f>] async_thread+0xe9/0x1c9
   [<c043e204>] kthread+0x4a/0x72
   [<c04034e7>] kernel_thread_helper+0x7/0x10
   [<ffffffff>] 0xffffffff

 -> (&base->lock){..-...} ops: 348073 {
    IN-SOFTIRQ-W at:
                          [<c044da08>] __lock_acquire+0x4bf/0x73e
                          [<c044dd36>] lock_acquire+0xaf/0xcc
                          [<c06304ea>] _spin_lock_irq+0x30/0x3f
                          [<c0434b8b>] run_timer_softirq+0x3c/0x1d1
                          [<c0431379>] __do_softirq+0xb8/0x180
                          [<ffffffff>] 0xffffffff
    INITIAL USE at:
                         [<c044dad5>] __lock_acquire+0x58c/0x73e
                         [<c044dd36>] lock_acquire+0xaf/0xcc
                         [<c063056b>] _spin_lock_irqsave+0x33/0x43
                         [<c0434e84>] lock_timer_base+0x24/0x43
                         [<c0434f3d>] mod_timer+0x46/0xcc
                         [<c07bd97a>] con_init+0xa4/0x20e
                         [<c07bd3b2>] console_init+0x12/0x20
                         [<c079d735>] start_kernel+0x1db/0x2fc
                         [<c079d06a>] __init_begin+0x6a/0x6f
                         [<ffffffff>] 0xffffffff
  }
  ... key      at: [<c082304c>] __key.23401+0x0/0x8
 ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c063056b>] _spin_lock_irqsave+0x33/0x43
   [<c0434e84>] lock_timer_base+0x24/0x43
   [<c0434f3d>] mod_timer+0x46/0xcc
   [<c05075cb>] blk_plug_device+0x9a/0xdf
   [<c05049e1>] __elv_add_request+0x86/0x96
   [<c0509d52>] blk_execute_rq_nowait+0x5d/0x86
   [<c0509e2e>] blk_execute_rq+0xb3/0xd5
   [<f81068f5>] scsi_execute+0xc5/0x11c [scsi_mod]
   [<f81069ff>] scsi_execute_req+0xb3/0x104 [scsi_mod]
   [<f81084f8>] scsi_probe_and_add_lun+0x294/0xb5b [scsi_mod]
   [<f8109847>] __scsi_add_device+0x8a/0xb0 [scsi_mod]
   [<f816ad14>] ata_scsi_scan_host+0x77/0x141 [libata]
   [<f816903f>] async_port_probe+0xa0/0xa9 [libata]
   [<c044341f>] async_thread+0xe9/0x1c9
   [<c043e204>] kthread+0x4a/0x72
   [<c04034e7>] kernel_thread_helper+0x7/0x10
   [<ffffffff>] 0xffffffff

 -> (&sdev->list_lock){..-...} ops: 27612 {
    IN-SOFTIRQ-W at:
                          [<c044da08>] __lock_acquire+0x4bf/0x73e
                          [<c044dd36>] lock_acquire+0xaf/0xcc
                          [<c063056b>] _spin_lock_irqsave+0x33/0x43
                          [<f8101cb4>] scsi_put_command+0x17/0x57 [scsi_mod]
                          [<f810620f>] scsi_next_command+0x26/0x39 [scsi_mod]
                          [<f8106d02>] scsi_io_completion+0x23f/0x41f [scsi_mod]
                          [<f810152d>] scsi_finish_command+0xcc/0xd4 [scsi_mod]
                          [<f8106fdb>] scsi_softirq_done+0xf9/0x101 [scsi_mod]
                          [<c050a936>] blk_done_softirq+0x5e/0x70
                          [<c0431379>] __do_softirq+0xb8/0x180
                          [<ffffffff>] 0xffffffff
    INITIAL USE at:
                         [<c044dad5>] __lock_acquire+0x58c/0x73e
                         [<c044dd36>] lock_acquire+0xaf/0xcc
                         [<c063056b>] _spin_lock_irqsave+0x33/0x43
                         [<f8101c64>] scsi_get_command+0x5c/0x95 [scsi_mod]
                         [<f81062b6>] scsi_get_cmd_from_req+0x26/0x50 [scsi_mod]
                         [<f8106594>] scsi_setup_blk_pc_cmnd+0x2b/0xd7 [scsi_mod]
                         [<f8106664>] scsi_prep_fn+0x24/0x33 [scsi_mod]
                         [<c0504712>] elv_next_request+0xe6/0x18d
                         [<f810704c>] scsi_request_fn+0x69/0x431 [scsi_mod]
                         [<c05072af>] __generic_unplug_device+0x2e/0x31
                         [<c0509d59>] blk_execute_rq_nowait+0x64/0x86
                         [<c0509e2e>] blk_execute_rq+0xb3/0xd5
                         [<f81068f5>] scsi_execute+0xc5/0x11c [scsi_mod]
                         [<f81069ff>] scsi_execute_req+0xb3/0x104 [scsi_mod]
                         [<f81084f8>] scsi_probe_and_add_lun+0x294/0xb5b [scsi_mod]
                         [<f8109847>] __scsi_add_device+0x8a/0xb0 [scsi_mod]
                         [<f816ad14>] ata_scsi_scan_host+0x77/0x141 [libata]
                         [<f816903f>] async_port_probe+0xa0/0xa9 [libata]
                         [<c044341f>] async_thread+0xe9/0x1c9
                         [<c043e204>] kthread+0x4a/0x72
                         [<c04034e7>] kernel_thread_helper+0x7/0x10
                         [<ffffffff>] 0xffffffff
  }
  ... key      at: [<f811916c>] __key.29786+0x0/0xffff2ebf [scsi_mod]
 ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c063056b>] _spin_lock_irqsave+0x33/0x43
   [<f8101c64>] scsi_get_command+0x5c/0x95 [scsi_mod]
   [<f81062b6>] scsi_get_cmd_from_req+0x26/0x50 [scsi_mod]
   [<f8106594>] scsi_setup_blk_pc_cmnd+0x2b/0xd7 [scsi_mod]
   [<f8106664>] scsi_prep_fn+0x24/0x33 [scsi_mod]
   [<c0504712>] elv_next_request+0xe6/0x18d
   [<f810704c>] scsi_request_fn+0x69/0x431 [scsi_mod]
   [<c05072af>] __generic_unplug_device+0x2e/0x31
   [<c0509d59>] blk_execute_rq_nowait+0x64/0x86
   [<c0509e2e>] blk_execute_rq+0xb3/0xd5
   [<f81068f5>] scsi_execute+0xc5/0x11c [scsi_mod]
   [<f81069ff>] scsi_execute_req+0xb3/0x104 [scsi_mod]
   [<f81084f8>] scsi_probe_and_add_lun+0x294/0xb5b [scsi_mod]
   [<f8109847>] __scsi_add_device+0x8a/0xb0 [scsi_mod]
   [<f816ad14>] ata_scsi_scan_host+0x77/0x141 [libata]
   [<f816903f>] async_port_probe+0xa0/0xa9 [libata]
   [<c044341f>] async_thread+0xe9/0x1c9
   [<c043e204>] kthread+0x4a/0x72
   [<c04034e7>] kernel_thread_helper+0x7/0x10
   [<ffffffff>] 0xffffffff

 -> (&q->lock){-.-.-.} ops: 2105038 {
    IN-HARDIRQ-W at:
                          [<c044d9e4>] __lock_acquire+0x49b/0x73e
                          [<c044dd36>] lock_acquire+0xaf/0xcc
                          [<c063056b>] _spin_lock_irqsave+0x33/0x43
                          [<c041ec0d>] complete+0x17/0x43
                          [<c062609b>] i8042_aux_test_irq+0x4c/0x65
                          [<c045e922>] handle_IRQ_event+0xa4/0x169
                          [<c04602ea>] handle_edge_irq+0xc9/0x10a
                          [<ffffffff>] 0xffffffff
    IN-SOFTIRQ-W at:
                          [<c044da08>] __lock_acquire+0x4bf/0x73e
                          [<c044dd36>] lock_acquire+0xaf/0xcc
                          [<c063056b>] _spin_lock_irqsave+0x33/0x43
                          [<c041ec0d>] complete+0x17/0x43
                          [<c043c336>] wakeme_after_rcu+0x10/0x12
                          [<c0461a12>] rcu_process_callbacks+0x62/0x86
                          [<c0431379>] __do_softirq+0xb8/0x180
                          [<ffffffff>] 0xffffffff
    IN-RECLAIM_FS-W at:
                             [<c044dabd>] __lock_acquire+0x574/0x73e
                             [<c044dd36>] lock_acquire+0xaf/0xcc
                             [<c063056b>] _spin_lock_irqsave+0x33/0x43
                             [<c043e47b>] prepare_to_wait+0x1c/0x4a
                             [<c0485d3e>] kswapd+0xa7/0x51b
                             [<c043e204>] kthread+0x4a/0x72
                             [<c04034e7>] kernel_thread_helper+0x7/0x10
                             [<ffffffff>] 0xffffffff
    INITIAL USE at:
                         [<c044dad5>] __lock_acquire+0x58c/0x73e
                         [<c044dd36>] lock_acquire+0xaf/0xcc
                         [<c06304ea>] _spin_lock_irq+0x30/0x3f
                         [<c062d811>] wait_for_common+0x2f/0xeb
                         [<c062d968>] wait_for_completion+0x17/0x19
                         [<c043e161>] kthread_create+0x6e/0xc7
                         [<c062b7eb>] migration_call+0x39/0x444
                         [<c07ae112>] migration_init+0x1d/0x4b
                         [<c040115c>] do_one_initcall+0x6a/0x16e
                         [<c079d44d>] kernel_init+0x4d/0x15a
                         [<c04034e7>] kernel_thread_helper+0x7/0x10
                         [<ffffffff>] 0xffffffff
  }
  ... key      at: [<c0823490>] __key.17681+0x0/0x8
  -> (&rq->lock){-.-.-.} ops: 854341 {
     IN-HARDIRQ-W at:
                            [<c044d9e4>] __lock_acquire+0x49b/0x73e
                            [<c044dd36>] lock_acquire+0xaf/0xcc
                            [<c0630340>] _spin_lock+0x2a/0x39
                            [<c0429f89>] scheduler_tick+0x39/0x19b
                            [<c04354a4>] update_process_times+0x47/0x53
                            [<c0447fe0>] tick_periodic+0x6b/0x77
                            [<c0448009>] tick_handle_periodic+0x1d/0x60
                            [<c0404ace>] timer_interrupt+0x3e/0x45
                            [<c045e922>] handle_IRQ_event+0xa4/0x169
                            [<c04603a3>] handle_level_irq+0x78/0xc1
                            [<ffffffff>] 0xffffffff
     IN-SOFTIRQ-W at:
                            [<c044da08>] __lock_acquire+0x4bf/0x73e
                            [<c044dd36>] lock_acquire+0xaf/0xcc
                            [<c0630340>] _spin_lock+0x2a/0x39
                            [<c041ede7>] task_rq_lock+0x3b/0x62
                            [<c0426e41>] try_to_wake_up+0x75/0x2d4
                            [<c04270d7>] wake_up_process+0x14/0x16
                            [<c043507c>] process_timeout+0xd/0xf
                            [<c0434caa>] run_timer_softirq+0x15b/0x1d1
                            [<c0431379>] __do_softirq+0xb8/0x180
                            [<ffffffff>] 0xffffffff
     IN-RECLAIM_FS-W at:
                               [<c044dabd>] __lock_acquire+0x574/0x73e
                               [<c044dd36>] lock_acquire+0xaf/0xcc
                               [<c0630340>] _spin_lock+0x2a/0x39
                               [<c041ede7>] task_rq_lock+0x3b/0x62
                               [<c0427515>] set_cpus_allowed_ptr+0x1a/0xdd
                               [<c0485cf8>] kswapd+0x61/0x51b
                               [<c043e204>] kthread+0x4a/0x72
                               [<c04034e7>] kernel_thread_helper+0x7/0x10
                               [<ffffffff>] 0xffffffff
     INITIAL USE at:
                           [<c044dad5>] __lock_acquire+0x58c/0x73e
                           [<c044dd36>] lock_acquire+0xaf/0xcc
                           [<c063056b>] _spin_lock_irqsave+0x33/0x43
                           [<c042398e>] rq_attach_root+0x17/0xa7
                           [<c07ae52c>] sched_init+0x240/0x33e
                           [<c079d661>] start_kernel+0x107/0x2fc
                           [<c079d06a>] __init_begin+0x6a/0x6f
                           [<ffffffff>] 0xffffffff
   }
   ... key      at: [<c0800518>] __key.46938+0x0/0x8
   -> (&vec->lock){-.-...} ops: 34058 {
      IN-HARDIRQ-W at:
                              [<c044d9e4>] __lock_acquire+0x49b/0x73e
                              [<c044dd36>] lock_acquire+0xaf/0xcc
                              [<c063056b>] _spin_lock_irqsave+0x33/0x43
                              [<c047ad3b>] cpupri_set+0x51/0xba
                              [<c04219ee>] __enqueue_rt_entity+0xe2/0x1c8
                              [<c0421e18>] enqueue_rt_entity+0x19/0x23
                              [<c0428a52>] enqueue_task_rt+0x24/0x51
                              [<c041e03b>] enqueue_task+0x64/0x70
                              [<c041e06b>] activate_task+0x24/0x2a
                              [<c0426f9e>] try_to_wake_up+0x1d2/0x2d4
                              [<c04270d7>] wake_up_process+0x14/0x16
                              [<c04408b6>] hrtimer_wakeup+0x1d/0x21
                              [<c0440922>] __run_hrtimer+0x68/0x98
                              [<c04411ca>] hrtimer_interrupt+0x101/0x153
                              [<c063406e>] smp_apic_timer_interrupt+0x6e/0x81
                              [<c04033c7>] apic_timer_interrupt+0x2f/0x34
                              [<c0401c4f>] cpu_idle+0x53/0x85
                              [<c061fc80>] rest_init+0x6c/0x6e
                              [<c079d851>] start_kernel+0x2f7/0x2fc
                              [<c079d06a>] __init_begin+0x6a/0x6f
                              [<ffffffff>] 0xffffffff
      IN-SOFTIRQ-W at:
                              [<c044da08>] __lock_acquire+0x4bf/0x73e
                              [<c044dd36>] lock_acquire+0xaf/0xcc
                              [<c063056b>] _spin_lock_irqsave+0x33/0x43
                              [<c047ad3b>] cpupri_set+0x51/0xba
                              [<c04219ee>] __enqueue_rt_entity+0xe2/0x1c8
                              [<c0421e18>] enqueue_rt_entity+0x19/0x23
                              [<c0428a52>] enqueue_task_rt+0x24/0x51
                              [<c041e03b>] enqueue_task+0x64/0x70
                              [<c041e06b>] activate_task+0x24/0x2a
                              [<c0426f9e>] try_to_wake_up+0x1d2/0x2d4
                              [<c04270d7>] wake_up_process+0x14/0x16
                              [<c042737c>] rebalance_domains+0x2a3/0x3ac
                              [<c0429a06>] run_rebalance_domains+0x32/0xaa
                              [<c0431379>] __do_softirq+0xb8/0x180
                              [<ffffffff>] 0xffffffff
      INITIAL USE at:
                             [<c044dad5>] __lock_acquire+0x58c/0x73e
                             [<c044dd36>] lock_acquire+0xaf/0xcc
                             [<c063056b>] _spin_lock_irqsave+0x33/0x43
                             [<c047ad74>] cpupri_set+0x8a/0xba
                             [<c04216f2>] rq_online_rt+0x5e/0x61
                             [<c041dd3a>] set_rq_online+0x40/0x4a
                             [<c04239fb>] rq_attach_root+0x84/0xa7
                             [<c07ae52c>] sched_init+0x240/0x33e
                             [<c079d661>] start_kernel+0x107/0x2fc
                             [<c079d06a>] __init_begin+0x6a/0x6f
                             [<ffffffff>] 0xffffffff
    }
    ... key      at: [<c0c525d0>] __key.14261+0x0/0x10
   ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c063056b>] _spin_lock_irqsave+0x33/0x43
   [<c047ad74>] cpupri_set+0x8a/0xba
   [<c04216f2>] rq_online_rt+0x5e/0x61
   [<c041dd3a>] set_rq_online+0x40/0x4a
   [<c04239fb>] rq_attach_root+0x84/0xa7
   [<c07ae52c>] sched_init+0x240/0x33e
   [<c079d661>] start_kernel+0x107/0x2fc
   [<c079d06a>] __init_begin+0x6a/0x6f
   [<ffffffff>] 0xffffffff

   -> (&rt_b->rt_runtime_lock){-.-...} ops: 336 {
      IN-HARDIRQ-W at:
                              [<c044d9e4>] __lock_acquire+0x49b/0x73e
                              [<c044dd36>] lock_acquire+0xaf/0xcc
                              [<c0630340>] _spin_lock+0x2a/0x39
                              [<c0421a75>] __enqueue_rt_entity+0x169/0x1c8
                              [<c0421e18>] enqueue_rt_entity+0x19/0x23
                              [<c0428a52>] enqueue_task_rt+0x24/0x51
                              [<c041e03b>] enqueue_task+0x64/0x70
                              [<c041e06b>] activate_task+0x24/0x2a
                              [<c0426f9e>] try_to_wake_up+0x1d2/0x2d4
                              [<c04270d7>] wake_up_process+0x14/0x16
                              [<c04408b6>] hrtimer_wakeup+0x1d/0x21
                              [<c0440922>] __run_hrtimer+0x68/0x98
                              [<c04411ca>] hrtimer_interrupt+0x101/0x153
                              [<c063406e>] smp_apic_timer_interrupt+0x6e/0x81
                              [<c04033c7>] apic_timer_interrupt+0x2f/0x34
                              [<c0401c4f>] cpu_idle+0x53/0x85
                              [<c061fc80>] rest_init+0x6c/0x6e
                              [<c079d851>] start_kernel+0x2f7/0x2fc
                              [<c079d06a>] __init_begin+0x6a/0x6f
                              [<ffffffff>] 0xffffffff
      IN-SOFTIRQ-W at:
                              [<c044da08>] __lock_acquire+0x4bf/0x73e
                              [<c044dd36>] lock_acquire+0xaf/0xcc
                              [<c0630340>] _spin_lock+0x2a/0x39
                              [<c0421a75>] __enqueue_rt_entity+0x169/0x1c8
                              [<c0421e18>] enqueue_rt_entity+0x19/0x23
                              [<c0428a52>] enqueue_task_rt+0x24/0x51
                              [<c041e03b>] enqueue_task+0x64/0x70
                              [<c041e06b>] activate_task+0x24/0x2a
                              [<c0426f9e>] try_to_wake_up+0x1d2/0x2d4
                              [<c04270d7>] wake_up_process+0x14/0x16
                              [<c042737c>] rebalance_domains+0x2a3/0x3ac
                              [<c0429a06>] run_rebalance_domains+0x32/0xaa
                              [<c0431379>] __do_softirq+0xb8/0x180
                              [<ffffffff>] 0xffffffff
      INITIAL USE at:
                             [<c044dad5>] __lock_acquire+0x58c/0x73e
                             [<c044dd36>] lock_acquire+0xaf/0xcc
                             [<c0630340>] _spin_lock+0x2a/0x39
                             [<c0421a75>] __enqueue_rt_entity+0x169/0x1c8
                             [<c0421e18>] enqueue_rt_entity+0x19/0x23
                             [<c0428a52>] enqueue_task_rt+0x24/0x51
                             [<c041e03b>] enqueue_task+0x64/0x70
                             [<c041e06b>] activate_task+0x24/0x2a
                             [<c0426f9e>] try_to_wake_up+0x1d2/0x2d4
                             [<c04270d7>] wake_up_process+0x14/0x16
                             [<c062b86b>] migration_call+0xb9/0x444
                             [<c07ae130>] migration_init+0x3b/0x4b
                             [<c040115c>] do_one_initcall+0x6a/0x16e
                             [<c079d44d>] kernel_init+0x4d/0x15a
                             [<c04034e7>] kernel_thread_helper+0x7/0x10
                             [<ffffffff>] 0xffffffff
    }
    ... key      at: [<c0800504>] __key.37924+0x0/0x8
    -> (&cpu_base->lock){-.-...} ops: 950512 {
       IN-HARDIRQ-W at:
                                [<c044d9e4>] __lock_acquire+0x49b/0x73e
                                [<c044dd36>] lock_acquire+0xaf/0xcc
                                [<c0630340>] _spin_lock+0x2a/0x39
                                [<c0440a3a>] hrtimer_run_queues+0xe8/0x131
                                [<c0435151>] run_local_timers+0xd/0x1e
                                [<c0435486>] update_process_times+0x29/0x53
                                [<c0447fe0>] tick_periodic+0x6b/0x77
                                [<c0448009>] tick_handle_periodic+0x1d/0x60
                                [<c063406e>] smp_apic_timer_interrupt+0x6e/0x81
                                [<c04033c7>] apic_timer_interrupt+0x2f/0x34
                                [<c04082c7>] arch_dup_task_struct+0x19/0x81
                                [<c042ac1c>] copy_process+0xab/0x115f
                                [<c042be78>] do_fork+0x129/0x2c5
                                [<c0401698>] kernel_thread+0x7f/0x87
                                [<c043e0b3>] kthreadd+0xa3/0xe3
                                [<c04034e7>] kernel_thread_helper+0x7/0x10
                                [<ffffffff>] 0xffffffff
       IN-SOFTIRQ-W at:
                                [<c044da08>] __lock_acquire+0x4bf/0x73e
                                [<c044dd36>] lock_acquire+0xaf/0xcc
                                [<c063056b>] _spin_lock_irqsave+0x33/0x43
                                [<c0440b98>] lock_hrtimer_base+0x1d/0x38
                                [<c0440ca9>] __hrtimer_start_range_ns+0x1f/0x232
                                [<c0440ee7>] hrtimer_start_range_ns+0x15/0x17
                                [<c0448ef1>] tick_setup_sched_timer+0xf6/0x124
                                [<c0441558>] hrtimer_run_pending+0xb0/0xe8
                                [<c0434b76>] run_timer_softirq+0x27/0x1d1
                                [<c0431379>] __do_softirq+0xb8/0x180
                                [<ffffffff>] 0xffffffff
       INITIAL USE at:
                               [<c044dad5>] __lock_acquire+0x58c/0x73e
                               [<c044dd36>] lock_acquire+0xaf/0xcc
                               [<c063056b>] _spin_lock_irqsave+0x33/0x43
                               [<c0440b98>] lock_hrtimer_base+0x1d/0x38
                               [<c0440ca9>] __hrtimer_start_range_ns+0x1f/0x232
                               [<c0421ab1>] __enqueue_rt_entity+0x1a5/0x1c8
                               [<c0421e18>] enqueue_rt_entity+0x19/0x23
                               [<c0428a52>] enqueue_task_rt+0x24/0x51
                               [<c041e03b>] enqueue_task+0x64/0x70
                               [<c041e06b>] activate_task+0x24/0x2a
                               [<c0426f9e>] try_to_wake_up+0x1d2/0x2d4
                               [<c04270d7>] wake_up_process+0x14/0x16
                               [<c062b86b>] migration_call+0xb9/0x444
                               [<c07ae130>] migration_init+0x3b/0x4b
                               [<c040115c>] do_one_initcall+0x6a/0x16e
                               [<c079d44d>] kernel_init+0x4d/0x15a
                               [<c04034e7>] kernel_thread_helper+0x7/0x10
                               [<ffffffff>] 0xffffffff
     }
     ... key      at: [<c08234b8>] __key.20063+0x0/0x8
    ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c063056b>] _spin_lock_irqsave+0x33/0x43
   [<c0440b98>] lock_hrtimer_base+0x1d/0x38
   [<c0440ca9>] __hrtimer_start_range_ns+0x1f/0x232
   [<c0421ab1>] __enqueue_rt_entity+0x1a5/0x1c8
   [<c0421e18>] enqueue_rt_entity+0x19/0x23
   [<c0428a52>] enqueue_task_rt+0x24/0x51
   [<c041e03b>] enqueue_task+0x64/0x70
   [<c041e06b>] activate_task+0x24/0x2a
   [<c0426f9e>] try_to_wake_up+0x1d2/0x2d4
   [<c04270d7>] wake_up_process+0x14/0x16
   [<c062b86b>] migration_call+0xb9/0x444
   [<c07ae130>] migration_init+0x3b/0x4b
   [<c040115c>] do_one_initcall+0x6a/0x16e
   [<c079d44d>] kernel_init+0x4d/0x15a
   [<c04034e7>] kernel_thread_helper+0x7/0x10
   [<ffffffff>] 0xffffffff

    -> (&rt_rq->rt_runtime_lock){-.....} ops: 17587 {
       IN-HARDIRQ-W at:
                                [<c044d9e4>] __lock_acquire+0x49b/0x73e
                                [<c044dd36>] lock_acquire+0xaf/0xcc
                                [<c0630340>] _spin_lock+0x2a/0x39
                                [<c0421efc>] sched_rt_period_timer+0xda/0x24e
                                [<c0440922>] __run_hrtimer+0x68/0x98
                                [<c04411ca>] hrtimer_interrupt+0x101/0x153
                                [<c063406e>] smp_apic_timer_interrupt+0x6e/0x81
                                [<c04033c7>] apic_timer_interrupt+0x2f/0x34
                                [<c0452203>] each_symbol_in_section+0x27/0x57
                                [<c045225a>] each_symbol+0x27/0x113
                                [<c0452373>] find_symbol+0x2d/0x51
                                [<c0454a7a>] load_module+0xaec/0x10eb
                                [<c04550bf>] sys_init_module+0x46/0x19b
                                [<c0402a68>] sysenter_do_call+0x12/0x36
                                [<ffffffff>] 0xffffffff
       INITIAL USE at:
                               [<c044dad5>] __lock_acquire+0x58c/0x73e
                               [<c044dd36>] lock_acquire+0xaf/0xcc
                               [<c0630340>] _spin_lock+0x2a/0x39
                               [<c0421c41>] update_curr_rt+0x13a/0x20d
                               [<c0421dd8>] dequeue_task_rt+0x13/0x3a
                               [<c041df9e>] dequeue_task+0xff/0x10e
                               [<c041dfd1>] deactivate_task+0x24/0x2a
                               [<c062db54>] __schedule+0x162/0x991
                               [<c062e39a>] schedule+0x17/0x30
                               [<c0426c54>] migration_thread+0x175/0x203
                               [<c043e204>] kthread+0x4a/0x72
                               [<c04034e7>] kernel_thread_helper+0x7/0x10
                               [<ffffffff>] 0xffffffff
     }
     ... key      at: [<c080050c>] __key.46863+0x0/0x8
    ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c0630340>] _spin_lock+0x2a/0x39
   [<c041ee73>] __enable_runtime+0x43/0xb3
   [<c04216d8>] rq_online_rt+0x44/0x61
   [<c041dd3a>] set_rq_online+0x40/0x4a
   [<c062b8a5>] migration_call+0xf3/0x444
   [<c063291c>] notifier_call_chain+0x2b/0x4a
   [<c0441e22>] __raw_notifier_call_chain+0x13/0x15
   [<c0441e35>] raw_notifier_call_chain+0x11/0x13
   [<c062bd2f>] _cpu_up+0xc3/0xf6
   [<c062bdac>] cpu_up+0x4a/0x5a
   [<c079d49a>] kernel_init+0x9a/0x15a
   [<c04034e7>] kernel_thread_helper+0x7/0x10
   [<ffffffff>] 0xffffffff

   ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c0630340>] _spin_lock+0x2a/0x39
   [<c0421a75>] __enqueue_rt_entity+0x169/0x1c8
   [<c0421e18>] enqueue_rt_entity+0x19/0x23
   [<c0428a52>] enqueue_task_rt+0x24/0x51
   [<c041e03b>] enqueue_task+0x64/0x70
   [<c041e06b>] activate_task+0x24/0x2a
   [<c0426f9e>] try_to_wake_up+0x1d2/0x2d4
   [<c04270d7>] wake_up_process+0x14/0x16
   [<c062b86b>] migration_call+0xb9/0x444
   [<c07ae130>] migration_init+0x3b/0x4b
   [<c040115c>] do_one_initcall+0x6a/0x16e
   [<c079d44d>] kernel_init+0x4d/0x15a
   [<c04034e7>] kernel_thread_helper+0x7/0x10
   [<ffffffff>] 0xffffffff

   ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c0630340>] _spin_lock+0x2a/0x39
   [<c0421c41>] update_curr_rt+0x13a/0x20d
   [<c0421dd8>] dequeue_task_rt+0x13/0x3a
   [<c041df9e>] dequeue_task+0xff/0x10e
   [<c041dfd1>] deactivate_task+0x24/0x2a
   [<c062db54>] __schedule+0x162/0x991
   [<c062e39a>] schedule+0x17/0x30
   [<c0426c54>] migration_thread+0x175/0x203
   [<c043e204>] kthread+0x4a/0x72
   [<c04034e7>] kernel_thread_helper+0x7/0x10
   [<ffffffff>] 0xffffffff

   -> (&sig->cputimer.lock){......} ops: 1949 {
      INITIAL USE at:
                             [<c044dad5>] __lock_acquire+0x58c/0x73e
                             [<c044dd36>] lock_acquire+0xaf/0xcc
                             [<c063056b>] _spin_lock_irqsave+0x33/0x43
                             [<c043f03e>] thread_group_cputimer+0x29/0x90
                             [<c044004c>] posix_cpu_timers_exit_group+0x16/0x39
                             [<c042e5f0>] release_task+0xa2/0x376
                             [<c042fbe1>] do_exit+0x548/0x5b3
                             [<c043a9d8>] __request_module+0x0/0x100
                             [<c04034e7>] kernel_thread_helper+0x7/0x10
                             [<ffffffff>] 0xffffffff
    }
    ... key      at: [<c08014ac>] __key.15480+0x0/0x8
   ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c0630340>] _spin_lock+0x2a/0x39
   [<c041f43a>] update_curr+0xef/0x107
   [<c042131b>] enqueue_entity+0x1a/0x1c6
   [<c0421535>] enqueue_task_fair+0x24/0x3e
   [<c041e03b>] enqueue_task+0x64/0x70
   [<c041e06b>] activate_task+0x24/0x2a
   [<c0426f9e>] try_to_wake_up+0x1d2/0x2d4
   [<c04270b0>] default_wake_function+0x10/0x12
   [<c041d785>] __wake_up_common+0x34/0x5f
   [<c041ec26>] complete+0x30/0x43
   [<c043e1e8>] kthread+0x2e/0x72
   [<c04034e7>] kernel_thread_helper+0x7/0x10
   [<ffffffff>] 0xffffffff

   -> (&rq->lock/1){..-...} ops: 3217 {
      IN-SOFTIRQ-W at:
                              [<c044da08>] __lock_acquire+0x4bf/0x73e
                              [<c044dd36>] lock_acquire+0xaf/0xcc
                              [<c0630305>] _spin_lock_nested+0x2d/0x3e
                              [<c0422cb4>] double_rq_lock+0x4b/0x7d
                              [<c0427274>] rebalance_domains+0x19b/0x3ac
                              [<c0429a06>] run_rebalance_domains+0x32/0xaa
                              [<c0431379>] __do_softirq+0xb8/0x180
                              [<ffffffff>] 0xffffffff
      INITIAL USE at:
                             [<c044dad5>] __lock_acquire+0x58c/0x73e
                             [<c044dd36>] lock_acquire+0xaf/0xcc
                             [<c0630305>] _spin_lock_nested+0x2d/0x3e
                             [<c0422cb4>] double_rq_lock+0x4b/0x7d
                             [<c0427274>] rebalance_domains+0x19b/0x3ac
                             [<c0429a06>] run_rebalance_domains+0x32/0xaa
                             [<c0431379>] __do_softirq+0xb8/0x180
                             [<ffffffff>] 0xffffffff
    }
    ... key      at: [<c0800519>] __key.46938+0x1/0x8
    ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c0630340>] _spin_lock+0x2a/0x39
   [<c0421c41>] update_curr_rt+0x13a/0x20d
   [<c0421dd8>] dequeue_task_rt+0x13/0x3a
   [<c041df9e>] dequeue_task+0xff/0x10e
   [<c041dfd1>] deactivate_task+0x24/0x2a
   [<c0427b1b>] push_rt_task+0x189/0x1f7
   [<c0427b9b>] push_rt_tasks+0x12/0x19
   [<c0427bb9>] post_schedule_rt+0x17/0x21
   [<c0425a68>] finish_task_switch+0x83/0xc0
   [<c062e339>] __schedule+0x947/0x991
   [<c062e39a>] schedule+0x17/0x30
   [<c0426c54>] migration_thread+0x175/0x203
   [<c043e204>] kthread+0x4a/0x72
   [<c04034e7>] kernel_thread_helper+0x7/0x10
   [<ffffffff>] 0xffffffff

    ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c063056b>] _spin_lock_irqsave+0x33/0x43
   [<c047ad3b>] cpupri_set+0x51/0xba
   [<c04219ee>] __enqueue_rt_entity+0xe2/0x1c8
   [<c0421e18>] enqueue_rt_entity+0x19/0x23
   [<c0428a52>] enqueue_task_rt+0x24/0x51
   [<c041e03b>] enqueue_task+0x64/0x70
   [<c041e06b>] activate_task+0x24/0x2a
   [<c0427b33>] push_rt_task+0x1a1/0x1f7
   [<c0427b9b>] push_rt_tasks+0x12/0x19
   [<c0427bb9>] post_schedule_rt+0x17/0x21
   [<c0425a68>] finish_task_switch+0x83/0xc0
   [<c062e339>] __schedule+0x947/0x991
   [<c062e39a>] schedule+0x17/0x30
   [<c0426c54>] migration_thread+0x175/0x203
   [<c043e204>] kthread+0x4a/0x72
   [<c04034e7>] kernel_thread_helper+0x7/0x10
   [<ffffffff>] 0xffffffff

   ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c0630305>] _spin_lock_nested+0x2d/0x3e
   [<c0422cb4>] double_rq_lock+0x4b/0x7d
   [<c0427274>] rebalance_domains+0x19b/0x3ac
   [<c0429a06>] run_rebalance_domains+0x32/0xaa
   [<c0431379>] __do_softirq+0xb8/0x180
   [<ffffffff>] 0xffffffff

  ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c0630340>] _spin_lock+0x2a/0x39
   [<c041ede7>] task_rq_lock+0x3b/0x62
   [<c0426e41>] try_to_wake_up+0x75/0x2d4
   [<c04270b0>] default_wake_function+0x10/0x12
   [<c041d785>] __wake_up_common+0x34/0x5f
   [<c041ec26>] complete+0x30/0x43
   [<c043e0cc>] kthreadd+0xbc/0xe3
   [<c04034e7>] kernel_thread_helper+0x7/0x10
   [<ffffffff>] 0xffffffff

  -> (&ep->lock){......} ops: 110 {
     INITIAL USE at:
                           [<c044dad5>] __lock_acquire+0x58c/0x73e
                           [<c044dd36>] lock_acquire+0xaf/0xcc
                           [<c063056b>] _spin_lock_irqsave+0x33/0x43
                           [<c04ca381>] sys_epoll_ctl+0x232/0x3f6
                           [<c0402a68>] sysenter_do_call+0x12/0x36
                           [<ffffffff>] 0xffffffff
   }
   ... key      at: [<c0c5be90>] __key.22301+0x0/0x10
   ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c0630340>] _spin_lock+0x2a/0x39
   [<c041ede7>] task_rq_lock+0x3b/0x62
   [<c0426e41>] try_to_wake_up+0x75/0x2d4
   [<c04270b0>] default_wake_function+0x10/0x12
   [<c041d785>] __wake_up_common+0x34/0x5f
   [<c041d7c6>] __wake_up_locked+0x16/0x1a
   [<c04ca7f5>] ep_poll_callback+0x7c/0xb6
   [<c041d785>] __wake_up_common+0x34/0x5f
   [<c041ec70>] __wake_up_sync_key+0x37/0x4a
   [<c05cbefa>] sock_def_readable+0x42/0x71
   [<c061c8b1>] unix_stream_connect+0x2f3/0x368
   [<c05c830a>] sys_connect+0x59/0x76
   [<c05c963f>] sys_socketcall+0x76/0x172
   [<c0402a68>] sysenter_do_call+0x12/0x36
   [<ffffffff>] 0xffffffff

  ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c063056b>] _spin_lock_irqsave+0x33/0x43
   [<c04ca797>] ep_poll_callback+0x1e/0xb6
   [<c041d785>] __wake_up_common+0x34/0x5f
   [<c041ec70>] __wake_up_sync_key+0x37/0x4a
   [<c05cbefa>] sock_def_readable+0x42/0x71
   [<c061c8b1>] unix_stream_connect+0x2f3/0x368
   [<c05c830a>] sys_connect+0x59/0x76
   [<c05c963f>] sys_socketcall+0x76/0x172
   [<c0402a68>] sysenter_do_call+0x12/0x36
   [<ffffffff>] 0xffffffff

 ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c063056b>] _spin_lock_irqsave+0x33/0x43
   [<c041ec0d>] complete+0x17/0x43
   [<c0509cf2>] blk_end_sync_rq+0x2a/0x2d
   [<c0506935>] end_that_request_last+0x17b/0x1a1
   [<c0506a0d>] blk_end_io+0x51/0x6f
   [<c0506a64>] blk_end_request+0x11/0x13
   [<f8106c9c>] scsi_io_completion+0x1d9/0x41f [scsi_mod]
   [<f810152d>] scsi_finish_command+0xcc/0xd4 [scsi_mod]
   [<f8106fdb>] scsi_softirq_done+0xf9/0x101 [scsi_mod]
   [<c050a936>] blk_done_softirq+0x5e/0x70
   [<c0431379>] __do_softirq+0xb8/0x180
   [<ffffffff>] 0xffffffff

 -> (&n->list_lock){..-...} ops: 49241 {
    IN-SOFTIRQ-W at:
                          [<c044da08>] __lock_acquire+0x4bf/0x73e
                          [<c044dd36>] lock_acquire+0xaf/0xcc
                          [<c0630340>] _spin_lock+0x2a/0x39
                          [<c049bd18>] add_partial+0x16/0x40
                          [<c049d0d4>] __slab_free+0x96/0x28f
                          [<c049df5c>] kmem_cache_free+0x8c/0xf2
                          [<c04a5ce9>] file_free_rcu+0x35/0x38
                          [<c0461a12>] rcu_process_callbacks+0x62/0x86
                          [<c0431379>] __do_softirq+0xb8/0x180
                          [<ffffffff>] 0xffffffff
    INITIAL USE at:
                         [<c044dad5>] __lock_acquire+0x58c/0x73e
                         [<c044dd36>] lock_acquire+0xaf/0xcc
                         [<c0630340>] _spin_lock+0x2a/0x39
                         [<c049bd18>] add_partial+0x16/0x40
                         [<c049d0d4>] __slab_free+0x96/0x28f
                         [<c049df5c>] kmem_cache_free+0x8c/0xf2
                         [<c0514eda>] ida_get_new_above+0x13b/0x155
                         [<c0514f00>] ida_get_new+0xc/0xe
                         [<c04a628b>] set_anon_super+0x39/0xa3
                         [<c04a68c6>] sget+0x2f3/0x386
                         [<c04a7365>] get_sb_single+0x24/0x8f
                         [<c04e034c>] sysfs_get_sb+0x18/0x1a
                         [<c04a6dd1>] vfs_kern_mount+0x40/0x7b
                         [<c04a6e21>] kern_mount_data+0x15/0x17
                         [<c07b5ff6>] sysfs_init+0x50/0x9c
                         [<c07b4ac9>] mnt_init+0x8c/0x1e4
                         [<c07b4737>] vfs_caches_init+0xd8/0xea
                         [<c079d815>] start_kernel+0x2bb/0x2fc
                         [<c079d06a>] __init_begin+0x6a/0x6f
                         [<ffffffff>] 0xffffffff
  }
  ... key      at: [<c0c5a424>] __key.25358+0x0/0x8
 ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c0630340>] _spin_lock+0x2a/0x39
   [<c049cc45>] __slab_alloc+0xf6/0x4ef
   [<c049d333>] kmem_cache_alloc+0x66/0x11f
   [<f810189b>] scsi_pool_alloc_command+0x20/0x4c [scsi_mod]
   [<f81018de>] scsi_host_alloc_command+0x17/0x4f [scsi_mod]
   [<f810192b>] __scsi_get_command+0x15/0x71 [scsi_mod]
   [<f8101c41>] scsi_get_command+0x39/0x95 [scsi_mod]
   [<f81062b6>] scsi_get_cmd_from_req+0x26/0x50 [scsi_mod]
   [<f8106594>] scsi_setup_blk_pc_cmnd+0x2b/0xd7 [scsi_mod]
   [<f8106664>] scsi_prep_fn+0x24/0x33 [scsi_mod]
   [<c0504712>] elv_next_request+0xe6/0x18d
   [<f810704c>] scsi_request_fn+0x69/0x431 [scsi_mod]
   [<c05072af>] __generic_unplug_device+0x2e/0x31
   [<c0509d59>] blk_execute_rq_nowait+0x64/0x86
   [<c0509e2e>] blk_execute_rq+0xb3/0xd5
   [<f81068f5>] scsi_execute+0xc5/0x11c [scsi_mod]
   [<f81069ff>] scsi_execute_req+0xb3/0x104 [scsi_mod]
   [<f812b40d>] sd_revalidate_disk+0x1a3/0xf64 [sd_mod]
   [<f812d52f>] sd_probe_async+0x146/0x22d [sd_mod]
   [<c044341f>] async_thread+0xe9/0x1c9
   [<c043e204>] kthread+0x4a/0x72
   [<c04034e7>] kernel_thread_helper+0x7/0x10
   [<ffffffff>] 0xffffffff

 -> (&cwq->lock){-.-...} ops: 30335 {
    IN-HARDIRQ-W at:
                          [<c044d9e4>] __lock_acquire+0x49b/0x73e
                          [<c044dd36>] lock_acquire+0xaf/0xcc
                          [<c063056b>] _spin_lock_irqsave+0x33/0x43
                          [<c043b54b>] __queue_work+0x14/0x30
                          [<c043b5ce>] queue_work_on+0x3a/0x46
                          [<c043b617>] queue_work+0x26/0x4a
                          [<c043b64f>] schedule_work+0x14/0x16
                          [<c057a367>] schedule_console_callback+0x12/0x14
                          [<c05788ed>] kbd_event+0x595/0x600
                          [<c05b3d15>] input_pass_event+0x56/0x7e
                          [<c05b4702>] input_handle_event+0x314/0x334
                          [<c05b4f1e>] input_event+0x50/0x63
                          [<c05b9bd4>] atkbd_interrupt+0x209/0x4e9
                          [<c05b1793>] serio_interrupt+0x38/0x6e
                          [<c05b24e8>] i8042_interrupt+0x1db/0x1ec
                          [<c045e922>] handle_IRQ_event+0xa4/0x169
                          [<c04602ea>] handle_edge_irq+0xc9/0x10a
                          [<ffffffff>] 0xffffffff
    IN-SOFTIRQ-W at:
                          [<c044da08>] __lock_acquire+0x4bf/0x73e
                          [<c044dd36>] lock_acquire+0xaf/0xcc
                          [<c063056b>] _spin_lock_irqsave+0x33/0x43
                          [<c043b54b>] __queue_work+0x14/0x30
                          [<c043b590>] delayed_work_timer_fn+0x29/0x2d
                          [<c0434caa>] run_timer_softirq+0x15b/0x1d1
                          [<c0431379>] __do_softirq+0xb8/0x180
                          [<ffffffff>] 0xffffffff
    INITIAL USE at:
                         [<c044dad5>] __lock_acquire+0x58c/0x73e
                         [<c044dd36>] lock_acquire+0xaf/0xcc
                         [<c063056b>] _spin_lock_irqsave+0x33/0x43
                         [<c043b54b>] __queue_work+0x14/0x30
                         [<c043b5ce>] queue_work_on+0x3a/0x46
                         [<c043b617>] queue_work+0x26/0x4a
                         [<c043a7b3>] call_usermodehelper_exec+0x83/0xd0
                         [<c051631a>] kobject_uevent_env+0x351/0x385
                         [<c0516358>] kobject_uevent+0xa/0xc
                         [<c0515a0e>] kset_register+0x2e/0x34
                         [<c0590f18>] bus_register+0xed/0x23d
                         [<c07bea09>] platform_bus_init+0x23/0x38
                         [<c07beb77>] driver_init+0x1c/0x28
                         [<c079d4f6>] kernel_init+0xf6/0x15a
                         [<c04034e7>] kernel_thread_helper+0x7/0x10
                         [<ffffffff>] 0xffffffff
  }
  ... key      at: [<c08230a8>] __key.23814+0x0/0x8
  -> (&workqueue_cpu_stat(cpu)->lock){-.-...} ops: 20397 {
     IN-HARDIRQ-W at:
                            [<c044d9e4>] __lock_acquire+0x49b/0x73e
                            [<c044dd36>] lock_acquire+0xaf/0xcc
                            [<c063056b>] _spin_lock_irqsave+0x33/0x43
                            [<c0474909>] probe_workqueue_insertion+0x33/0x81
                            [<c043acf3>] insert_work+0x3f/0x9b
                            [<c043b559>] __queue_work+0x22/0x30
                            [<c043b5ce>] queue_work_on+0x3a/0x46
                            [<c043b617>] queue_work+0x26/0x4a
                            [<c043b64f>] schedule_work+0x14/0x16
                            [<c057a367>] schedule_console_callback+0x12/0x14
                            [<c05788ed>] kbd_event+0x595/0x600
                            [<c05b3d15>] input_pass_event+0x56/0x7e
                            [<c05b4702>] input_handle_event+0x314/0x334
                            [<c05b4f1e>] input_event+0x50/0x63
                            [<c05b9bd4>] atkbd_interrupt+0x209/0x4e9
                            [<c05b1793>] serio_interrupt+0x38/0x6e
                            [<c05b24e8>] i8042_interrupt+0x1db/0x1ec
                            [<c045e922>] handle_IRQ_event+0xa4/0x169
                            [<c04602ea>] handle_edge_irq+0xc9/0x10a
                            [<ffffffff>] 0xffffffff
     IN-SOFTIRQ-W at:
                            [<c044da08>] __lock_acquire+0x4bf/0x73e
                            [<c044dd36>] lock_acquire+0xaf/0xcc
                            [<c063056b>] _spin_lock_irqsave+0x33/0x43
                            [<c0474909>] probe_workqueue_insertion+0x33/0x81
                            [<c043acf3>] insert_work+0x3f/0x9b
                            [<c043b559>] __queue_work+0x22/0x30
                            [<c043b590>] delayed_work_timer_fn+0x29/0x2d
                            [<c0434caa>] run_timer_softirq+0x15b/0x1d1
                            [<c0431379>] __do_softirq+0xb8/0x180
                            [<ffffffff>] 0xffffffff
     INITIAL USE at:
                           [<c044dad5>] __lock_acquire+0x58c/0x73e
                           [<c044dd36>] lock_acquire+0xaf/0xcc
                           [<c063056b>] _spin_lock_irqsave+0x33/0x43
                           [<c04747eb>] probe_workqueue_creation+0xc9/0x10a
                           [<c043abcb>] create_workqueue_thread+0x87/0xb0
                           [<c043b12f>] __create_workqueue_key+0x16d/0x1b2
                           [<c07aeedb>] init_workqueues+0x61/0x73
                           [<c079d4e7>] kernel_init+0xe7/0x15a
                           [<c04034e7>] kernel_thread_helper+0x7/0x10
                           [<ffffffff>] 0xffffffff
   }
   ... key      at: [<c0c52574>] __key.23424+0x0/0x8
  ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c063056b>] _spin_lock_irqsave+0x33/0x43
   [<c0474909>] probe_workqueue_insertion+0x33/0x81
   [<c043acf3>] insert_work+0x3f/0x9b
   [<c043b559>] __queue_work+0x22/0x30
   [<c043b5ce>] queue_work_on+0x3a/0x46
   [<c043b617>] queue_work+0x26/0x4a
   [<c043a7b3>] call_usermodehelper_exec+0x83/0xd0
   [<c051631a>] kobject_uevent_env+0x351/0x385
   [<c0516358>] kobject_uevent+0xa/0xc
   [<c0515a0e>] kset_register+0x2e/0x34
   [<c0590f18>] bus_register+0xed/0x23d
   [<c07bea09>] platform_bus_init+0x23/0x38
   [<c07beb77>] driver_init+0x1c/0x28
   [<c079d4f6>] kernel_init+0xf6/0x15a
   [<c04034e7>] kernel_thread_helper+0x7/0x10
   [<ffffffff>] 0xffffffff

  ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c063056b>] _spin_lock_irqsave+0x33/0x43
   [<c041ecaf>] __wake_up+0x1a/0x40
   [<c043ad46>] insert_work+0x92/0x9b
   [<c043b559>] __queue_work+0x22/0x30
   [<c043b5ce>] queue_work_on+0x3a/0x46
   [<c043b617>] queue_work+0x26/0x4a
   [<c043a7b3>] call_usermodehelper_exec+0x83/0xd0
   [<c051631a>] kobject_uevent_env+0x351/0x385
   [<c0516358>] kobject_uevent+0xa/0xc
   [<c0515a0e>] kset_register+0x2e/0x34
   [<c0590f18>] bus_register+0xed/0x23d
   [<c07bea09>] platform_bus_init+0x23/0x38
   [<c07beb77>] driver_init+0x1c/0x28
   [<c079d4f6>] kernel_init+0xf6/0x15a
   [<c04034e7>] kernel_thread_helper+0x7/0x10
   [<ffffffff>] 0xffffffff

 ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c063056b>] _spin_lock_irqsave+0x33/0x43
   [<c043b54b>] __queue_work+0x14/0x30
   [<c043b5ce>] queue_work_on+0x3a/0x46
   [<c043b617>] queue_work+0x26/0x4a
   [<c0505679>] kblockd_schedule_work+0x12/0x14
   [<c05113bb>] elv_schedule_dispatch+0x41/0x48
   [<c0513377>] elv_ioq_completed_request+0x2dc/0x2fe
   [<c05045aa>] elv_completed_request+0x48/0x97
   [<c0506738>] __blk_put_request+0x36/0xb8
   [<c0506953>] end_that_request_last+0x199/0x1a1
   [<c0506a0d>] blk_end_io+0x51/0x6f
   [<c0506a64>] blk_end_request+0x11/0x13
   [<f8106c9c>] scsi_io_completion+0x1d9/0x41f [scsi_mod]
   [<f810152d>] scsi_finish_command+0xcc/0xd4 [scsi_mod]
   [<f8106fdb>] scsi_softirq_done+0xf9/0x101 [scsi_mod]
   [<c050a936>] blk_done_softirq+0x5e/0x70
   [<c0431379>] __do_softirq+0xb8/0x180
   [<ffffffff>] 0xffffffff

 -> (&zone->lock){..-...} ops: 80266 {
    IN-SOFTIRQ-W at:
                          [<c044da08>] __lock_acquire+0x4bf/0x73e
                          [<c044dd36>] lock_acquire+0xaf/0xcc
                          [<c0630340>] _spin_lock+0x2a/0x39
                          [<c047fc71>] __free_pages_ok+0x167/0x321
                          [<c04800ce>] __free_pages+0x29/0x2b
                          [<c049c7c1>] __free_slab+0xb2/0xba
                          [<c049c800>] discard_slab+0x37/0x39
                          [<c049d15c>] __slab_free+0x11e/0x28f
                          [<c049df5c>] kmem_cache_free+0x8c/0xf2
                          [<c042ab6e>] free_task+0x31/0x34
                          [<c042c37b>] __put_task_struct+0xd3/0xd8
                          [<c042e072>] delayed_put_task_struct+0x60/0x64
                          [<c0461a12>] rcu_process_callbacks+0x62/0x86
                          [<c0431379>] __do_softirq+0xb8/0x180
                          [<ffffffff>] 0xffffffff
    INITIAL USE at:
                         [<c044dad5>] __lock_acquire+0x58c/0x73e
                         [<c044dd36>] lock_acquire+0xaf/0xcc
                         [<c0630340>] _spin_lock+0x2a/0x39
                         [<c047f7b6>] free_pages_bulk+0x21/0x1a1
                         [<c047ffcf>] free_hot_cold_page+0x181/0x20f
                         [<c04800a3>] free_hot_page+0xf/0x11
                         [<c04800c5>] __free_pages+0x20/0x2b
                         [<c07c4d96>] __free_pages_bootmem+0x6d/0x71
                         [<c07b2244>] free_all_bootmem_core+0xd2/0x177
                         [<c07b22f6>] free_all_bootmem+0xd/0xf
                         [<c07ad21a>] mem_init+0x28/0x28c
                         [<c079d7b1>] start_kernel+0x257/0x2fc
                         [<c079d06a>] __init_begin+0x6a/0x6f
                         [<ffffffff>] 0xffffffff
  }
  ... key      at: [<c0c52628>] __key.30749+0x0/0x8
 ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c063056b>] _spin_lock_irqsave+0x33/0x43
   [<c048035e>] get_page_from_freelist+0x236/0x3e3
   [<c04805f4>] __alloc_pages_internal+0xce/0x371
   [<c049cce6>] __slab_alloc+0x197/0x4ef
   [<c049d333>] kmem_cache_alloc+0x66/0x11f
   [<c047d96b>] mempool_alloc_slab+0x13/0x15
   [<c047da5c>] mempool_alloc+0x3a/0xd5
   [<f81063cc>] scsi_sg_alloc+0x47/0x4a [scsi_mod]
   [<c051cd02>] __sg_alloc_table+0x48/0xc7
   [<f8106325>] scsi_init_sgtable+0x2c/0x8c [scsi_mod]
   [<f81064e7>] scsi_init_io+0x19/0x9b [scsi_mod]
   [<f8106abf>] scsi_setup_fs_cmnd+0x6f/0x73 [scsi_mod]
   [<f812ca73>] sd_prep_fn+0x6a/0x7d4 [sd_mod]
   [<c0504712>] elv_next_request+0xe6/0x18d
   [<f810704c>] scsi_request_fn+0x69/0x431 [scsi_mod]
   [<c05072af>] __generic_unplug_device+0x2e/0x31
   [<c05072db>] blk_start_queueing+0x29/0x2b
   [<c05137b8>] elv_ioq_request_add+0x2be/0x393
   [<c05048cd>] elv_insert+0x114/0x1a2
   [<c05049ec>] __elv_add_request+0x91/0x96
   [<c0507a00>] __make_request+0x365/0x397
   [<c050635a>] generic_make_request+0x342/0x3ce
   [<c0507b21>] submit_bio+0xef/0xfa
   [<c04c6c4e>] mpage_bio_submit+0x21/0x26
   [<c04c7b7f>] mpage_readpages+0xa3/0xad
   [<f80c1ea8>] ext3_readpages+0x19/0x1b [ext3]
   [<c048275e>] __do_page_cache_readahead+0xfd/0x166
   [<c0482b42>] do_page_cache_readahead+0x44/0x52
   [<c047d665>] filemap_fault+0x197/0x3ae
   [<c048b9ea>] __do_fault+0x40/0x37b
   [<c048d43f>] handle_mm_fault+0x2bb/0x646
   [<c063273c>] do_page_fault+0x29c/0x2fd
   [<c0630b4a>] error_code+0x72/0x78
   [<ffffffff>] 0xffffffff

 -> (&page_address_htable[i].lock){......} ops: 6802 {
    INITIAL USE at:
                         [<c044dad5>] __lock_acquire+0x58c/0x73e
                         [<c044dd36>] lock_acquire+0xaf/0xcc
                         [<c063056b>] _spin_lock_irqsave+0x33/0x43
                         [<c048af69>] page_address+0x50/0xa6
                         [<c048b0e7>] kmap_high+0x21/0x175
                         [<c041b7ef>] kmap+0x4e/0x5b
                         [<c04abb36>] page_getlink+0x37/0x59
                         [<c04abb75>] page_follow_link_light+0x1d/0x2b
                         [<c04ad4d0>] __link_path_walk+0x3d1/0xa71
                         [<c04adbae>] path_walk+0x3e/0x77
                         [<c04add0e>] do_path_lookup+0xeb/0x105
                         [<c04ae6f2>] path_lookup_open+0x48/0x7a
                         [<c04a8e96>] open_exec+0x25/0xf4
                         [<c04a9c2d>] do_execve+0xfa/0x2cc
                         [<c04015c0>] sys_execve+0x2b/0x54
                         [<c0402ae9>] syscall_call+0x7/0xb
                         [<ffffffff>] 0xffffffff
  }
  ... key      at: [<c0c5288c>] __key.28547+0x0/0x14
 ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c063056b>] _spin_lock_irqsave+0x33/0x43
   [<c048af69>] page_address+0x50/0xa6
   [<c05078a1>] __make_request+0x206/0x397
   [<c050635a>] generic_make_request+0x342/0x3ce
   [<c0507b21>] submit_bio+0xef/0xfa
   [<c04c6c4e>] mpage_bio_submit+0x21/0x26
   [<c04c78b8>] do_mpage_readpage+0x471/0x5e5
   [<c04c7b55>] mpage_readpages+0x79/0xad
   [<f80c1ea8>] ext3_readpages+0x19/0x1b [ext3]
   [<c048275e>] __do_page_cache_readahead+0xfd/0x166
   [<c0482b42>] do_page_cache_readahead+0x44/0x52
   [<c047d665>] filemap_fault+0x197/0x3ae
   [<c048b9ea>] __do_fault+0x40/0x37b
   [<c048d43f>] handle_mm_fault+0x2bb/0x646
   [<c063273c>] do_page_fault+0x29c/0x2fd
   [<c0630b4a>] error_code+0x72/0x78
   [<ffffffff>] 0xffffffff

 ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c0630340>] _spin_lock+0x2a/0x39
   [<c046143d>] call_rcu+0x36/0x5b
   [<c050f0c8>] cfq_cic_free+0x15/0x17
   [<c050f128>] cic_free_func+0x5e/0x64
   [<c050ea90>] __call_for_each_cic+0x23/0x2e
   [<c050eaad>] cfq_free_io_context+0x12/0x14
   [<c050978c>] put_io_context+0x4b/0x66
   [<c050f00a>] cfq_active_ioq_reset+0x21/0x39
   [<c0511044>] elv_reset_active_ioq+0x2b/0x3e
   [<c0512ecf>] __elv_ioq_slice_expired+0x238/0x26a
   [<c0512f1f>] elv_ioq_slice_expired+0x1e/0x20
   [<c0513860>] elv_ioq_request_add+0x366/0x393
   [<c05048cd>] elv_insert+0x114/0x1a2
   [<c05049ec>] __elv_add_request+0x91/0x96
   [<c0507a00>] __make_request+0x365/0x397
   [<c050635a>] generic_make_request+0x342/0x3ce
   [<c0507b21>] submit_bio+0xef/0xfa
   [<c04bf495>] submit_bh+0xe3/0x102
   [<c04c04b0>] ll_rw_block+0xbe/0xf7
   [<f80c35ba>] ext3_bread+0x39/0x79 [ext3]
   [<f80c5643>] dx_probe+0x2f/0x298 [ext3]
   [<f80c5956>] ext3_find_entry+0xaa/0x573 [ext3]
   [<f80c739e>] ext3_lookup+0x31/0xbe [ext3]
   [<c04abf7c>] do_lookup+0xbc/0x159
   [<c04ad7e8>] __link_path_walk+0x6e9/0xa71
   [<c04adbae>] path_walk+0x3e/0x77
   [<c04add0e>] do_path_lookup+0xeb/0x105
   [<c04ae584>] user_path_at+0x41/0x6c
   [<c04a8301>] vfs_fstatat+0x32/0x59
   [<c04a8417>] vfs_stat+0x18/0x1a
   [<c04a8432>] sys_stat64+0x19/0x2d
   [<c0402a68>] sysenter_do_call+0x12/0x36
   [<ffffffff>] 0xffffffff

 -> (&iocg->lock){+.+...} ops: 3 {
    HARDIRQ-ON-W at:
                          [<c044b840>] mark_held_locks+0x3d/0x58
                          [<c044b963>] trace_hardirqs_on_caller+0x108/0x14c
                          [<c044b9b2>] trace_hardirqs_on+0xb/0xd
                          [<c0630883>] _spin_unlock_irq+0x27/0x47
                          [<c0513baa>] iocg_destroy+0xbc/0x118
                          [<c045a16a>] cgroup_diput+0x4b/0xa7
                          [<c04b1dbb>] dentry_iput+0x78/0x9c
                          [<c04b1e82>] d_kill+0x21/0x3b
                          [<c04b2f2a>] dput+0xf3/0xfc
                          [<c04ae226>] do_rmdir+0x9a/0xc8
                          [<c04ae29d>] sys_rmdir+0x15/0x17
                          [<c0402a68>] sysenter_do_call+0x12/0x36
                          [<ffffffff>] 0xffffffff
    SOFTIRQ-ON-W at:
                          [<c044b840>] mark_held_locks+0x3d/0x58
                          [<c044b97c>] trace_hardirqs_on_caller+0x121/0x14c
                          [<c044b9b2>] trace_hardirqs_on+0xb/0xd
                          [<c0630883>] _spin_unlock_irq+0x27/0x47
                          [<c0513baa>] iocg_destroy+0xbc/0x118
                          [<c045a16a>] cgroup_diput+0x4b/0xa7
                          [<c04b1dbb>] dentry_iput+0x78/0x9c
                          [<c04b1e82>] d_kill+0x21/0x3b
                          [<c04b2f2a>] dput+0xf3/0xfc
                          [<c04ae226>] do_rmdir+0x9a/0xc8
                          [<c04ae29d>] sys_rmdir+0x15/0x17
                          [<c0402a68>] sysenter_do_call+0x12/0x36
                          [<ffffffff>] 0xffffffff
    INITIAL USE at:
                         [<c044dad5>] __lock_acquire+0x58c/0x73e
                         [<c044dd36>] lock_acquire+0xaf/0xcc
                         [<c06304ea>] _spin_lock_irq+0x30/0x3f
                         [<c05119bd>] io_alloc_root_group+0x104/0x155
                         [<c05133cb>] elv_init_fq_data+0x32/0xe0
                         [<c0504317>] elevator_alloc+0x150/0x170
                         [<c0505393>] elevator_init+0x9d/0x100
                         [<c0507088>] blk_init_queue_node+0xc4/0xf7
                         [<c05070cb>] blk_init_queue+0x10/0x12
                         [<f81060fd>] __scsi_alloc_queue+0x1c/0xba [scsi_mod]
                         [<f81061b0>] scsi_alloc_queue+0x15/0x4e [scsi_mod]
                         [<f810803d>] scsi_alloc_sdev+0x154/0x1f5 [scsi_mod]
                         [<f8108387>] scsi_probe_and_add_lun+0x123/0xb5b [scsi_mod]
                         [<f8109847>] __scsi_add_device+0x8a/0xb0 [scsi_mod]
                         [<f816ad14>] ata_scsi_scan_host+0x77/0x141 [libata]
                         [<f816903f>] async_port_probe+0xa0/0xa9 [libata]
                         [<c044341f>] async_thread+0xe9/0x1c9
                         [<c043e204>] kthread+0x4a/0x72
                         [<c04034e7>] kernel_thread_helper+0x7/0x10
                         [<ffffffff>] 0xffffffff
  }
  ... key      at: [<c0c5ebd8>] __key.29462+0x0/0x8
 ... acquired at:
   [<c044d243>] validate_chain+0x8a8/0xbae
   [<c044dbfd>] __lock_acquire+0x6b4/0x73e
   [<c044dd36>] lock_acquire+0xaf/0xcc
   [<c063056b>] _spin_lock_irqsave+0x33/0x43
   [<c0510f6f>] io_group_chain_link+0x5c/0x106
   [<c0511ba7>] io_find_alloc_group+0x54/0x60
   [<c0511c11>] io_get_io_group_bio+0x5e/0x89
   [<c0511cc3>] io_group_get_request_list+0x12/0x21
   [<c0507485>] get_request_wait+0x124/0x15d
   [<c050797e>] __make_request+0x2e3/0x397
   [<c050635a>] generic_make_request+0x342/0x3ce
   [<c0507b21>] submit_bio+0xef/0xfa
   [<c04c6c4e>] mpage_bio_submit+0x21/0x26
   [<c04c7b7f>] mpage_readpages+0xa3/0xad
   [<f80c1ea8>] ext3_readpages+0x19/0x1b [ext3]
   [<c048275e>] __do_page_cache_readahead+0xfd/0x166
   [<c048294a>] ondemand_readahead+0x10a/0x118
   [<c04829db>] page_cache_sync_readahead+0x1b/0x20
   [<c047cf37>] generic_file_aio_read+0x226/0x545
   [<c04a4cf6>] do_sync_read+0xb0/0xee
   [<c04a54b0>] vfs_read+0x8f/0x136
   [<c04a8d7c>] kernel_read+0x39/0x4b
   [<c04a8e69>] prepare_binprm+0xdb/0xe3
   [<c04a9ca8>] do_execve+0x175/0x2cc
   [<c04015c0>] sys_execve+0x2b/0x54
   [<c0402a68>] sysenter_do_call+0x12/0x36
   [<ffffffff>] 0xffffffff


stack backtrace:
Pid: 2186, comm: rmdir Not tainted 2.6.30-rc4-io #6
Call Trace:
 [<c044b1ac>] print_irq_inversion_bug+0x13b/0x147
 [<c044c3e5>] check_usage_backwards+0x7d/0x86
 [<c044b5ec>] mark_lock+0x2d3/0x4ea
 [<c044c368>] ? check_usage_backwards+0x0/0x86
 [<c044b840>] mark_held_locks+0x3d/0x58
 [<c0630883>] ? _spin_unlock_irq+0x27/0x47
 [<c044b97c>] trace_hardirqs_on_caller+0x121/0x14c
 [<c044b9b2>] trace_hardirqs_on+0xb/0xd
 [<c0630883>] _spin_unlock_irq+0x27/0x47
 [<c0513baa>] iocg_destroy+0xbc/0x118
 [<c045a16a>] cgroup_diput+0x4b/0xa7
 [<c04b1dbb>] dentry_iput+0x78/0x9c
 [<c04b1e82>] d_kill+0x21/0x3b
 [<c04b2f2a>] dput+0xf3/0xfc
 [<c04ae226>] do_rmdir+0x9a/0xc8
 [<c04029b1>] ? resume_userspace+0x11/0x28
 [<c051aa14>] ? trace_hardirqs_on_thunk+0xc/0x10
 [<c0402b34>] ? restore_nocheck_notrace+0x0/0xe
 [<c06324a0>] ? do_page_fault+0x0/0x2fd
 [<c044b97c>] ? trace_hardirqs_on_caller+0x121/0x14c
 [<c04ae29d>] sys_rmdir+0x15/0x17
 [<c0402a68>] sysenter_do_call+0x12/0x36

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
       [not found]       ` <20090506161012.GC8180-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-05-07  5:36         ` Li Zefan
@ 2009-05-07  5:47         ` Gui Jianfeng
  1 sibling, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-07  5:47 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

[-- Attachment #1: Type: text/plain, Size: 2218 bytes --]

Vivek Goyal wrote:
> Hi Gui,
> 
> Thanks for the report. I use cgroup_path() for debugging. I guess that
> cgroup_path() was passed null cgrp pointer that's why it crashed.
> 
> If yes, then it is strange though. I call cgroup_path() only after
> grabbing a refenrece to css object. (I am assuming that if I have a valid
> reference to css object then css->cgrp can't be null).

  I think so too...

> 
> Anyway, can you please try out following patch and see if it fixes your
> crash.
> 
> ---
>  block/elevator-fq.c |   10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> Index: linux11/block/elevator-fq.c
> ===================================================================
> --- linux11.orig/block/elevator-fq.c	2009-05-05 15:38:06.000000000 -0400
> +++ linux11/block/elevator-fq.c	2009-05-06 11:55:47.000000000 -0400
> @@ -125,6 +125,9 @@ static void io_group_path(struct io_grou
>  	unsigned short id = iog->iocg_id;
>  	struct cgroup_subsys_state *css;
>  
> +	/* For error case */
> +	buf[0] = '\0';
> +
>  	rcu_read_lock();
>  
>  	if (!id)
> @@ -137,15 +140,12 @@ static void io_group_path(struct io_grou
>  	if (!css_tryget(css))
>  		goto out;
>  
> -	cgroup_path(css->cgroup, buf, buflen);
> +	if (css->cgroup)

  According to CR2, when kernel crashing, css->cgroup equals 0x00000100.
  So i guess this patch won't fix this issue.

> +		cgroup_path(css->cgroup, buf, buflen);
>  
>  	css_put(css);
> -
> -	rcu_read_unlock();
> -	return;
>  out:
>  	rcu_read_unlock();
> -	buf[0] = '\0';
>  	return;
>  }
>  #endif
> 
> BTW, I tried following equivalent script and I can't see the crash on 
> my system. Are you able to hit it regularly?

  yes, it's 50% chance that i can reproduce it.
  i'v attached the rwio source code.

> 
> Instead of killing the tasks I also tried moving the tasks into root cgroup
> and then deleting test1 and test2 groups, that also did not produce any crash.
> (Hit a different bug though after 5-6 attempts :-)
> 
> As I mentioned in the patchset, currently we do have issues with group
> refcounting and cgroup/group going away. Hopefully in next version they
> all should be fixed up. But still, it is nice to hear back...
> 
> 

-- 
Regards
Gui Jianfeng

[-- Attachment #2: rwio.c --]
[-- Type: image/x-xbitmap, Size: 1613 bytes --]

[-- Attachment #3: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-06 16:10       ` Vivek Goyal
  (?)
  (?)
@ 2009-05-07  5:47       ` Gui Jianfeng
  -1 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-07  5:47 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

[-- Attachment #1: Type: text/plain, Size: 2218 bytes --]

Vivek Goyal wrote:
> Hi Gui,
> 
> Thanks for the report. I use cgroup_path() for debugging. I guess that
> cgroup_path() was passed null cgrp pointer that's why it crashed.
> 
> If yes, then it is strange though. I call cgroup_path() only after
> grabbing a refenrece to css object. (I am assuming that if I have a valid
> reference to css object then css->cgrp can't be null).

  I think so too...

> 
> Anyway, can you please try out following patch and see if it fixes your
> crash.
> 
> ---
>  block/elevator-fq.c |   10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> Index: linux11/block/elevator-fq.c
> ===================================================================
> --- linux11.orig/block/elevator-fq.c	2009-05-05 15:38:06.000000000 -0400
> +++ linux11/block/elevator-fq.c	2009-05-06 11:55:47.000000000 -0400
> @@ -125,6 +125,9 @@ static void io_group_path(struct io_grou
>  	unsigned short id = iog->iocg_id;
>  	struct cgroup_subsys_state *css;
>  
> +	/* For error case */
> +	buf[0] = '\0';
> +
>  	rcu_read_lock();
>  
>  	if (!id)
> @@ -137,15 +140,12 @@ static void io_group_path(struct io_grou
>  	if (!css_tryget(css))
>  		goto out;
>  
> -	cgroup_path(css->cgroup, buf, buflen);
> +	if (css->cgroup)

  According to CR2, when kernel crashing, css->cgroup equals 0x00000100.
  So i guess this patch won't fix this issue.

> +		cgroup_path(css->cgroup, buf, buflen);
>  
>  	css_put(css);
> -
> -	rcu_read_unlock();
> -	return;
>  out:
>  	rcu_read_unlock();
> -	buf[0] = '\0';
>  	return;
>  }
>  #endif
> 
> BTW, I tried following equivalent script and I can't see the crash on 
> my system. Are you able to hit it regularly?

  yes, it's 50% chance that i can reproduce it.
  i'v attached the rwio source code.

> 
> Instead of killing the tasks I also tried moving the tasks into root cgroup
> and then deleting test1 and test2 groups, that also did not produce any crash.
> (Hit a different bug though after 5-6 attempts :-)
> 
> As I mentioned in the patchset, currently we do have issues with group
> refcounting and cgroup/group going away. Hopefully in next version they
> all should be fixed up. But still, it is nice to hear back...
> 
> 

-- 
Regards
Gui Jianfeng

[-- Attachment #2: rwio.c --]
[-- Type: image/x-xbitmap, Size: 1613 bytes --]

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer
       [not found]   ` <1241553525-28095-6-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-07  7:42     ` Gui Jianfeng
  2009-05-08 21:09     ` Andrea Righi
  1 sibling, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-07  7:42 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

Vivek Goyal wrote:
> This patch enables hierarchical fair queuing in common layer. It is
> controlled by config option CONFIG_GROUP_IOSCHED.
...
> +}
> +
> +void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
> +{
> +	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
> +	struct hlist_node *n, *tmp;
> +	struct io_group *iog;
> +
> +	/*
> +	 * Since we are destroying the cgroup, there are no more tasks
> +	 * referencing it, and all the RCU grace periods that may have
> +	 * referenced it are ended (as the destruction of the parent
> +	 * cgroup is RCU-safe); bgrp->group_data will not be accessed by
> +	 * anything else and we don't need any synchronization.
> +	 */
> +	hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node)
> +		io_destroy_group(iocg, iog);
> +
> +	BUG_ON(!hlist_empty(&iocg->group_data));
> +

    Hi Vivek,

    IMHO, free_css_id() needs to be called here.

> +	kfree(iocg);
> +}
> +
> +void io_disconnect_groups(struct elevator_queue *e)
> +{
> +	struct hlist_node *pos, *n;
> +	struct io_group *iog;
> +	struct elv_fq_data *efqd = &e->efqd;
> +
> +	hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
> +					elv_data_node) {
> +		hlist_del(&iog->elv_data_node);
> +

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer
  2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-07  7:42   ` Gui Jianfeng
  2009-05-07  8:05     ` Li Zefan
                       ` (2 more replies)
       [not found]   ` <1241553525-28095-6-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-05-08 21:09   ` Andrea Righi
  2 siblings, 3 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-07  7:42 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

Vivek Goyal wrote:
> This patch enables hierarchical fair queuing in common layer. It is
> controlled by config option CONFIG_GROUP_IOSCHED.
...
> +}
> +
> +void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
> +{
> +	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
> +	struct hlist_node *n, *tmp;
> +	struct io_group *iog;
> +
> +	/*
> +	 * Since we are destroying the cgroup, there are no more tasks
> +	 * referencing it, and all the RCU grace periods that may have
> +	 * referenced it are ended (as the destruction of the parent
> +	 * cgroup is RCU-safe); bgrp->group_data will not be accessed by
> +	 * anything else and we don't need any synchronization.
> +	 */
> +	hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node)
> +		io_destroy_group(iocg, iog);
> +
> +	BUG_ON(!hlist_empty(&iocg->group_data));
> +

    Hi Vivek,

    IMHO, free_css_id() needs to be called here.

> +	kfree(iocg);
> +}
> +
> +void io_disconnect_groups(struct elevator_queue *e)
> +{
> +	struct hlist_node *pos, *n;
> +	struct io_group *iog;
> +	struct elv_fq_data *efqd = &e->efqd;
> +
> +	hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
> +					elv_data_node) {
> +		hlist_del(&iog->elv_data_node);
> +

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer
       [not found]     ` <4A0290ED.7080506-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-05-07  8:05       ` Li Zefan
  2009-05-08 12:45       ` Vivek Goyal
  1 sibling, 0 replies; 297+ messages in thread
From: Li Zefan @ 2009-05-07  8:05 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

Gui Jianfeng wrote:
> Vivek Goyal wrote:
>> This patch enables hierarchical fair queuing in common layer. It is
>> controlled by config option CONFIG_GROUP_IOSCHED.
> ...
>> +}
>> +
>> +void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
>> +{
>> +	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
>> +	struct hlist_node *n, *tmp;
>> +	struct io_group *iog;
>> +
>> +	/*
>> +	 * Since we are destroying the cgroup, there are no more tasks
>> +	 * referencing it, and all the RCU grace periods that may have
>> +	 * referenced it are ended (as the destruction of the parent
>> +	 * cgroup is RCU-safe); bgrp->group_data will not be accessed by
>> +	 * anything else and we don't need any synchronization.
>> +	 */
>> +	hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node)
>> +		io_destroy_group(iocg, iog);
>> +
>> +	BUG_ON(!hlist_empty(&iocg->group_data));
>> +
> 
>     Hi Vivek,
> 
>     IMHO, free_css_id() needs to be called here.
> 

Right.

Though alloc_css_id() is called by cgroup core in cgroup_create(),
free_css_id() should be called by subsystem itself.

This is a bit strange, but it's required by memory cgroup. Normally,
free_css_id() is called in destroy() handler, but memcg calls it
when a mem_cgroup's refcnt goes to 0. When a cgroup is destroyed,
the mem_cgroup won't be destroyed (refcnt > 0) if it has records on
swap-entry.

>> +	kfree(iocg);
>> +}
>> +
>> +void io_disconnect_groups(struct elevator_queue *e)
>> +{
>> +	struct hlist_node *pos, *n;
>> +	struct io_group *iog;
>> +	struct elv_fq_data *efqd = &e->efqd;
>> +
>> +	hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
>> +					elv_data_node) {
>> +		hlist_del(&iog->elv_data_node);
>> +
> 

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer
  2009-05-07  7:42   ` Gui Jianfeng
@ 2009-05-07  8:05     ` Li Zefan
       [not found]     ` <4A0290ED.7080506-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2009-05-08 12:45     ` Vivek Goyal
  2 siblings, 0 replies; 297+ messages in thread
From: Li Zefan @ 2009-05-07  8:05 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: Vivek Goyal, nauman, dpshah, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

Gui Jianfeng wrote:
> Vivek Goyal wrote:
>> This patch enables hierarchical fair queuing in common layer. It is
>> controlled by config option CONFIG_GROUP_IOSCHED.
> ...
>> +}
>> +
>> +void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
>> +{
>> +	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
>> +	struct hlist_node *n, *tmp;
>> +	struct io_group *iog;
>> +
>> +	/*
>> +	 * Since we are destroying the cgroup, there are no more tasks
>> +	 * referencing it, and all the RCU grace periods that may have
>> +	 * referenced it are ended (as the destruction of the parent
>> +	 * cgroup is RCU-safe); bgrp->group_data will not be accessed by
>> +	 * anything else and we don't need any synchronization.
>> +	 */
>> +	hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node)
>> +		io_destroy_group(iocg, iog);
>> +
>> +	BUG_ON(!hlist_empty(&iocg->group_data));
>> +
> 
>     Hi Vivek,
> 
>     IMHO, free_css_id() needs to be called here.
> 

Right.

Though alloc_css_id() is called by cgroup core in cgroup_create(),
free_css_id() should be called by subsystem itself.

This is a bit strange, but it's required by memory cgroup. Normally,
free_css_id() is called in destroy() handler, but memcg calls it
when a mem_cgroup's refcnt goes to 0. When a cgroup is destroyed,
the mem_cgroup won't be destroyed (refcnt > 0) if it has records on
swap-entry.

>> +	kfree(iocg);
>> +}
>> +
>> +void io_disconnect_groups(struct elevator_queue *e)
>> +{
>> +	struct hlist_node *pos, *n;
>> +	struct io_group *iog;
>> +	struct elv_fq_data *efqd = &e->efqd;
>> +
>> +	hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
>> +					elv_data_node) {
>> +		hlist_del(&iog->elv_data_node);
>> +
> 


^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
       [not found]             ` <20090506215235.GJ8180-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-05-06 22:35               ` Andrea Righi
@ 2009-05-07  9:04               ` Andrea Righi
  1 sibling, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-07  9:04 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Wed, May 06, 2009 at 05:52:35PM -0400, Vivek Goyal wrote:
> > > Without io-throttle patches
> > > ---------------------------
> > > - Two readers, first BE prio 7, second BE prio 0
> > > 
> > > 234179072 bytes (234 MB) copied, 4.12074 s, 56.8 MB/s
> > > High prio reader finished
> > > 234179072 bytes (234 MB) copied, 5.36023 s, 43.7 MB/s
> > > 
> > > Note: There is no service differentiation between prio 0 and prio 7 task
> > >       with io-throttle patches.
> > > 
> > > Test 3
> > > ======
> > > - Run the one RT reader and one BE reader in root cgroup without any
> > >   limitations. I guess this should mean unlimited BW and behavior should
> > >   be same as with CFQ without io-throttling patches.
> > > 
> > > With io-throttle patches
> > > =========================
> > > Ran the test 4 times because I was getting different results in different
> > > runs.
> > > 
> > > - Two readers, one RT prio 0  other BE prio 7
> > > 
> > > 234179072 bytes (234 MB) copied, 2.74604 s, 85.3 MB/s
> > > 234179072 bytes (234 MB) copied, 5.20995 s, 44.9 MB/s
> > > RT task finished
> > > 
> > > 234179072 bytes (234 MB) copied, 4.54417 s, 51.5 MB/s
> > > RT task finished
> > > 234179072 bytes (234 MB) copied, 5.23396 s, 44.7 MB/s
> > > 
> > > 234179072 bytes (234 MB) copied, 5.17727 s, 45.2 MB/s
> > > RT task finished
> > > 234179072 bytes (234 MB) copied, 5.25894 s, 44.5 MB/s
> > > 
> > > 234179072 bytes (234 MB) copied, 2.74141 s, 85.4 MB/s
> > > 234179072 bytes (234 MB) copied, 5.20536 s, 45.0 MB/s
> > > RT task finished
> > > 
> > > Note: Out of 4 runs, looks like twice it is complete priority inversion
> > >       and RT task finished after BE task. Rest of the two times, the
> > >       difference between BW of RT and BE task is much less as compared to
> > >       without patches. In fact once it was almost same.
> > 
> > This is strange. If you don't set any limit there shouldn't be any
> > difference respect to the other case (without io-throttle patches).
> > 
> > At worst a small overhead given by the task_to_iothrottle(), under
> > rcu_read_lock(). I'll repeat this test ASAP and see if I'll be able to
> > reproduce this strange behaviour.
> 
> Ya, I also found this strange. At least in root group there should not be
> any behavior change (at max one might expect little drop in throughput
> because of extra code).

Hi Vivek,

I'm not able to reproduce the strange behaviour above.

Which commands are you running exactly? is the system isolated (stupid
question) no cron or background tasks doing IO during the tests?

Following the script I've used:

$ cat test.sh
#!/bin/sh
echo 3 > /proc/sys/vm/drop_caches
ionice -c 1 -n 0 dd if=bigfile1 of=/dev/null bs=1M 2>&1 | sed "s/\(.*\)/RT: \1/" &
cat /proc/$!/cgroup | sed "s/\(.*\)/RT: \1/"
ionice -c 2 -n 7 dd if=bigfile2 of=/dev/null bs=1M 2>&1 | sed "s/\(.*\)/BE: \1/" &
cat /proc/$!/cgroup | sed "s/\(.*\)/BE: \1/"
for i in 1 2; do
	wait
done

And the results on my PC:

2.6.30-rc4
~~~~~~~~~~
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 21.3406 s, 11.5 MB/s
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.989 s, 20.5 MB/s
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 23.4436 s, 10.5 MB/s
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.9555 s, 20.5 MB/s
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 21.622 s, 11.3 MB/s
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.9856 s, 20.5 MB/s
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 21.5664 s, 11.4 MB/s
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.8522 s, 20.7 MB/s

2.6.30-rc4 + io-throttle, no BW limit, both tasks in the root cgroup
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
$ sudo sh ./test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 23.6739 s, 10.4 MB/s
BE: cgroup 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 12.2853 s, 20.0 MB/s
RT: 4:blockio:/
$ sudo sh ./test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 23.7483 s, 10.3 MB/s
BE: cgroup 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 12.3597 s, 19.9 MB/s
RT: 4:blockio:/
$ sudo sh ./test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 23.6843 s, 10.4 MB/s
BE: cgroup 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 12.4886 s, 19.6 MB/s
RT: 4:blockio:/
$ sudo sh ./test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 23.8621 s, 10.3 MB/s
BE: cgroup 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 12.6737 s, 19.4 MB/s
RT: 4:blockio:/

The difference seems to be just the expected overhead.

-Andrea

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-06 21:52             ` Vivek Goyal
                               ` (2 preceding siblings ...)
  (?)
@ 2009-05-07  9:04             ` Andrea Righi
  2009-05-07 12:22               ` Andrea Righi
                                 ` (3 more replies)
  -1 siblings, 4 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-07  9:04 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, linux-kernel, containers,
	agk, dm-devel, snitzer, m-ikeda, peterz

On Wed, May 06, 2009 at 05:52:35PM -0400, Vivek Goyal wrote:
> > > Without io-throttle patches
> > > ---------------------------
> > > - Two readers, first BE prio 7, second BE prio 0
> > > 
> > > 234179072 bytes (234 MB) copied, 4.12074 s, 56.8 MB/s
> > > High prio reader finished
> > > 234179072 bytes (234 MB) copied, 5.36023 s, 43.7 MB/s
> > > 
> > > Note: There is no service differentiation between prio 0 and prio 7 task
> > >       with io-throttle patches.
> > > 
> > > Test 3
> > > ======
> > > - Run the one RT reader and one BE reader in root cgroup without any
> > >   limitations. I guess this should mean unlimited BW and behavior should
> > >   be same as with CFQ without io-throttling patches.
> > > 
> > > With io-throttle patches
> > > =========================
> > > Ran the test 4 times because I was getting different results in different
> > > runs.
> > > 
> > > - Two readers, one RT prio 0  other BE prio 7
> > > 
> > > 234179072 bytes (234 MB) copied, 2.74604 s, 85.3 MB/s
> > > 234179072 bytes (234 MB) copied, 5.20995 s, 44.9 MB/s
> > > RT task finished
> > > 
> > > 234179072 bytes (234 MB) copied, 4.54417 s, 51.5 MB/s
> > > RT task finished
> > > 234179072 bytes (234 MB) copied, 5.23396 s, 44.7 MB/s
> > > 
> > > 234179072 bytes (234 MB) copied, 5.17727 s, 45.2 MB/s
> > > RT task finished
> > > 234179072 bytes (234 MB) copied, 5.25894 s, 44.5 MB/s
> > > 
> > > 234179072 bytes (234 MB) copied, 2.74141 s, 85.4 MB/s
> > > 234179072 bytes (234 MB) copied, 5.20536 s, 45.0 MB/s
> > > RT task finished
> > > 
> > > Note: Out of 4 runs, looks like twice it is complete priority inversion
> > >       and RT task finished after BE task. Rest of the two times, the
> > >       difference between BW of RT and BE task is much less as compared to
> > >       without patches. In fact once it was almost same.
> > 
> > This is strange. If you don't set any limit there shouldn't be any
> > difference respect to the other case (without io-throttle patches).
> > 
> > At worst a small overhead given by the task_to_iothrottle(), under
> > rcu_read_lock(). I'll repeat this test ASAP and see if I'll be able to
> > reproduce this strange behaviour.
> 
> Ya, I also found this strange. At least in root group there should not be
> any behavior change (at max one might expect little drop in throughput
> because of extra code).

Hi Vivek,

I'm not able to reproduce the strange behaviour above.

Which commands are you running exactly? is the system isolated (stupid
question) no cron or background tasks doing IO during the tests?

Following the script I've used:

$ cat test.sh
#!/bin/sh
echo 3 > /proc/sys/vm/drop_caches
ionice -c 1 -n 0 dd if=bigfile1 of=/dev/null bs=1M 2>&1 | sed "s/\(.*\)/RT: \1/" &
cat /proc/$!/cgroup | sed "s/\(.*\)/RT: \1/"
ionice -c 2 -n 7 dd if=bigfile2 of=/dev/null bs=1M 2>&1 | sed "s/\(.*\)/BE: \1/" &
cat /proc/$!/cgroup | sed "s/\(.*\)/BE: \1/"
for i in 1 2; do
	wait
done

And the results on my PC:

2.6.30-rc4
~~~~~~~~~~
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 21.3406 s, 11.5 MB/s
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.989 s, 20.5 MB/s
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 23.4436 s, 10.5 MB/s
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.9555 s, 20.5 MB/s
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 21.622 s, 11.3 MB/s
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.9856 s, 20.5 MB/s
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 21.5664 s, 11.4 MB/s
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.8522 s, 20.7 MB/s

2.6.30-rc4 + io-throttle, no BW limit, both tasks in the root cgroup
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
$ sudo sh ./test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 23.6739 s, 10.4 MB/s
BE: cgroup 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 12.2853 s, 20.0 MB/s
RT: 4:blockio:/
$ sudo sh ./test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 23.7483 s, 10.3 MB/s
BE: cgroup 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 12.3597 s, 19.9 MB/s
RT: 4:blockio:/
$ sudo sh ./test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 23.6843 s, 10.4 MB/s
BE: cgroup 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 12.4886 s, 19.6 MB/s
RT: 4:blockio:/
$ sudo sh ./test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 23.8621 s, 10.3 MB/s
BE: cgroup 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 12.6737 s, 19.4 MB/s
RT: 4:blockio:/

The difference seems to be just the expected overhead.

-Andrea

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-07  9:04             ` Andrea Righi
  2009-05-07 12:22               ` Andrea Righi
@ 2009-05-07 12:22               ` Andrea Righi
  2009-05-07 14:11               ` Vivek Goyal
  2009-05-07 14:11               ` Vivek Goyal
  3 siblings, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-07 12:22 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Thu, May 07, 2009 at 11:04:50AM +0200, Andrea Righi wrote:
> On Wed, May 06, 2009 at 05:52:35PM -0400, Vivek Goyal wrote:
> > > > Without io-throttle patches
> > > > ---------------------------
> > > > - Two readers, first BE prio 7, second BE prio 0
> > > > 
> > > > 234179072 bytes (234 MB) copied, 4.12074 s, 56.8 MB/s
> > > > High prio reader finished
> > > > 234179072 bytes (234 MB) copied, 5.36023 s, 43.7 MB/s
> > > > 
> > > > Note: There is no service differentiation between prio 0 and prio 7 task
> > > >       with io-throttle patches.
> > > > 
> > > > Test 3
> > > > ======
> > > > - Run the one RT reader and one BE reader in root cgroup without any
> > > >   limitations. I guess this should mean unlimited BW and behavior should
> > > >   be same as with CFQ without io-throttling patches.
> > > > 
> > > > With io-throttle patches
> > > > =========================
> > > > Ran the test 4 times because I was getting different results in different
> > > > runs.
> > > > 
> > > > - Two readers, one RT prio 0  other BE prio 7
> > > > 
> > > > 234179072 bytes (234 MB) copied, 2.74604 s, 85.3 MB/s
> > > > 234179072 bytes (234 MB) copied, 5.20995 s, 44.9 MB/s
> > > > RT task finished
> > > > 
> > > > 234179072 bytes (234 MB) copied, 4.54417 s, 51.5 MB/s
> > > > RT task finished
> > > > 234179072 bytes (234 MB) copied, 5.23396 s, 44.7 MB/s
> > > > 
> > > > 234179072 bytes (234 MB) copied, 5.17727 s, 45.2 MB/s
> > > > RT task finished
> > > > 234179072 bytes (234 MB) copied, 5.25894 s, 44.5 MB/s
> > > > 
> > > > 234179072 bytes (234 MB) copied, 2.74141 s, 85.4 MB/s
> > > > 234179072 bytes (234 MB) copied, 5.20536 s, 45.0 MB/s
> > > > RT task finished
> > > > 
> > > > Note: Out of 4 runs, looks like twice it is complete priority inversion
> > > >       and RT task finished after BE task. Rest of the two times, the
> > > >       difference between BW of RT and BE task is much less as compared to
> > > >       without patches. In fact once it was almost same.
> > > 
> > > This is strange. If you don't set any limit there shouldn't be any
> > > difference respect to the other case (without io-throttle patches).
> > > 
> > > At worst a small overhead given by the task_to_iothrottle(), under
> > > rcu_read_lock(). I'll repeat this test ASAP and see if I'll be able to
> > > reproduce this strange behaviour.
> > 
> > Ya, I also found this strange. At least in root group there should not be
> > any behavior change (at max one might expect little drop in throughput
> > because of extra code).
> 
> Hi Vivek,
> 
> I'm not able to reproduce the strange behaviour above.
> 
> Which commands are you running exactly? is the system isolated (stupid
> question) no cron or background tasks doing IO during the tests?
> 
> Following the script I've used:
> 
> $ cat test.sh
> #!/bin/sh
> echo 3 > /proc/sys/vm/drop_caches
> ionice -c 1 -n 0 dd if=bigfile1 of=/dev/null bs=1M 2>&1 | sed "s/\(.*\)/RT: \1/" &
> cat /proc/$!/cgroup | sed "s/\(.*\)/RT: \1/"
> ionice -c 2 -n 7 dd if=bigfile2 of=/dev/null bs=1M 2>&1 | sed "s/\(.*\)/BE: \1/" &
> cat /proc/$!/cgroup | sed "s/\(.*\)/BE: \1/"
> for i in 1 2; do
> 	wait
> done
> 
> And the results on my PC:
> 
> 2.6.30-rc4
> ~~~~~~~~~~
> $ sudo sh test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 21.3406 s, 11.5 MB/s
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 11.989 s, 20.5 MB/s
> $ sudo sh test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 23.4436 s, 10.5 MB/s
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 11.9555 s, 20.5 MB/s
> $ sudo sh test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 21.622 s, 11.3 MB/s
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 11.9856 s, 20.5 MB/s
> $ sudo sh test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 21.5664 s, 11.4 MB/s
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 11.8522 s, 20.7 MB/s
> 
> 2.6.30-rc4 + io-throttle, no BW limit, both tasks in the root cgroup
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> $ sudo sh ./test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 23.6739 s, 10.4 MB/s
> BE: cgroup 4:blockio:/
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 12.2853 s, 20.0 MB/s
> RT: 4:blockio:/
> $ sudo sh ./test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 23.7483 s, 10.3 MB/s
> BE: cgroup 4:blockio:/
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 12.3597 s, 19.9 MB/s
> RT: 4:blockio:/
> $ sudo sh ./test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 23.6843 s, 10.4 MB/s
> BE: cgroup 4:blockio:/
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 12.4886 s, 19.6 MB/s
> RT: 4:blockio:/
> $ sudo sh ./test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 23.8621 s, 10.3 MB/s
> BE: cgroup 4:blockio:/
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 12.6737 s, 19.4 MB/s
> RT: 4:blockio:/
> 
> The difference seems to be just the expected overhead.

BTW, it is possible to reduce the io-throttle overhead even more for non
io-throttle users (also when CONFIG_CGROUP_IO_THROTTLE is enabled) using
the trick below.

2.6.30-rc4 + io-throttle + following patch, no BW limit, tasks in root cgroup
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 17.462 s, 14.1 MB/s
BE: 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.7865 s, 20.8 MB/s
RT: 4:blockio:/
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 18.8375 s, 13.0 MB/s
BE: 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.9148 s, 20.6 MB/s
RT: 4:blockio:/
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 19.6826 s, 12.5 MB/s
BE: 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.8715 s, 20.7 MB/s
RT: 4:blockio:/
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 18.9152 s, 13.0 MB/s
BE: 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.8925 s, 20.6 MB/s
RT: 4:blockio:/

[ To be applied on top of io-throttle v16 ]

Signed-off-by: Andrea Righi <righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 block/blk-io-throttle.c |   16 ++++++++++++++--
 1 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c
index e2dfd24..8b45c71 100644
--- a/block/blk-io-throttle.c
+++ b/block/blk-io-throttle.c
@@ -131,6 +131,14 @@ struct iothrottle_node {
 	struct iothrottle_stat stat;
 };
 
+/*
+ * This is a trick to reduce the unneded overhead when io-throttle is not used
+ * at all. We use a counter of the io-throttle rules; if the counter is zero,
+ * we immediately return from the io-throttle hooks, without accounting IO and
+ * without checking if we need to apply some limiting rules.
+ */
+static atomic_t iothrottle_node_count __read_mostly;
+
 /**
  * struct iothrottle - throttling rules for a cgroup
  * @css: pointer to the cgroup state
@@ -193,6 +201,7 @@ static void iothrottle_insert_node(struct iothrottle *iot,
 {
 	WARN_ON_ONCE(!cgroup_is_locked());
 	list_add_rcu(&n->node, &iot->list);
+	atomic_inc(&iothrottle_node_count);
 }
 
 /*
@@ -214,6 +223,7 @@ iothrottle_delete_node(struct iothrottle *iot, struct iothrottle_node *n)
 {
 	WARN_ON_ONCE(!cgroup_is_locked());
 	list_del_rcu(&n->node);
+	atomic_dec(&iothrottle_node_count);
 }
 
 /*
@@ -250,8 +260,10 @@ static void iothrottle_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 	 * reference to the list.
 	 */
 	if (!list_empty(&iot->list))
-		list_for_each_entry_safe(n, p, &iot->list, node)
+		list_for_each_entry_safe(n, p, &iot->list, node) {
 			kfree(n);
+			atomic_dec(&iothrottle_node_count);
+		}
 	kfree(iot);
 }
 
@@ -836,7 +848,7 @@ cgroup_io_throttle(struct bio *bio, struct block_device *bdev, ssize_t bytes)
 	unsigned long long sleep;
 	int type, can_sleep = 1;
 
-	if (iothrottle_disabled())
+	if (iothrottle_disabled() || !atomic_read(&iothrottle_node_count))
 		return 0;
 	if (unlikely(!bdev))
 		return 0;

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-07  9:04             ` Andrea Righi
@ 2009-05-07 12:22               ` Andrea Righi
  2009-05-07 12:22               ` Andrea Righi
                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-07 12:22 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, linux-kernel, containers,
	agk, dm-devel, snitzer, m-ikeda, peterz

On Thu, May 07, 2009 at 11:04:50AM +0200, Andrea Righi wrote:
> On Wed, May 06, 2009 at 05:52:35PM -0400, Vivek Goyal wrote:
> > > > Without io-throttle patches
> > > > ---------------------------
> > > > - Two readers, first BE prio 7, second BE prio 0
> > > > 
> > > > 234179072 bytes (234 MB) copied, 4.12074 s, 56.8 MB/s
> > > > High prio reader finished
> > > > 234179072 bytes (234 MB) copied, 5.36023 s, 43.7 MB/s
> > > > 
> > > > Note: There is no service differentiation between prio 0 and prio 7 task
> > > >       with io-throttle patches.
> > > > 
> > > > Test 3
> > > > ======
> > > > - Run the one RT reader and one BE reader in root cgroup without any
> > > >   limitations. I guess this should mean unlimited BW and behavior should
> > > >   be same as with CFQ without io-throttling patches.
> > > > 
> > > > With io-throttle patches
> > > > =========================
> > > > Ran the test 4 times because I was getting different results in different
> > > > runs.
> > > > 
> > > > - Two readers, one RT prio 0  other BE prio 7
> > > > 
> > > > 234179072 bytes (234 MB) copied, 2.74604 s, 85.3 MB/s
> > > > 234179072 bytes (234 MB) copied, 5.20995 s, 44.9 MB/s
> > > > RT task finished
> > > > 
> > > > 234179072 bytes (234 MB) copied, 4.54417 s, 51.5 MB/s
> > > > RT task finished
> > > > 234179072 bytes (234 MB) copied, 5.23396 s, 44.7 MB/s
> > > > 
> > > > 234179072 bytes (234 MB) copied, 5.17727 s, 45.2 MB/s
> > > > RT task finished
> > > > 234179072 bytes (234 MB) copied, 5.25894 s, 44.5 MB/s
> > > > 
> > > > 234179072 bytes (234 MB) copied, 2.74141 s, 85.4 MB/s
> > > > 234179072 bytes (234 MB) copied, 5.20536 s, 45.0 MB/s
> > > > RT task finished
> > > > 
> > > > Note: Out of 4 runs, looks like twice it is complete priority inversion
> > > >       and RT task finished after BE task. Rest of the two times, the
> > > >       difference between BW of RT and BE task is much less as compared to
> > > >       without patches. In fact once it was almost same.
> > > 
> > > This is strange. If you don't set any limit there shouldn't be any
> > > difference respect to the other case (without io-throttle patches).
> > > 
> > > At worst a small overhead given by the task_to_iothrottle(), under
> > > rcu_read_lock(). I'll repeat this test ASAP and see if I'll be able to
> > > reproduce this strange behaviour.
> > 
> > Ya, I also found this strange. At least in root group there should not be
> > any behavior change (at max one might expect little drop in throughput
> > because of extra code).
> 
> Hi Vivek,
> 
> I'm not able to reproduce the strange behaviour above.
> 
> Which commands are you running exactly? is the system isolated (stupid
> question) no cron or background tasks doing IO during the tests?
> 
> Following the script I've used:
> 
> $ cat test.sh
> #!/bin/sh
> echo 3 > /proc/sys/vm/drop_caches
> ionice -c 1 -n 0 dd if=bigfile1 of=/dev/null bs=1M 2>&1 | sed "s/\(.*\)/RT: \1/" &
> cat /proc/$!/cgroup | sed "s/\(.*\)/RT: \1/"
> ionice -c 2 -n 7 dd if=bigfile2 of=/dev/null bs=1M 2>&1 | sed "s/\(.*\)/BE: \1/" &
> cat /proc/$!/cgroup | sed "s/\(.*\)/BE: \1/"
> for i in 1 2; do
> 	wait
> done
> 
> And the results on my PC:
> 
> 2.6.30-rc4
> ~~~~~~~~~~
> $ sudo sh test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 21.3406 s, 11.5 MB/s
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 11.989 s, 20.5 MB/s
> $ sudo sh test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 23.4436 s, 10.5 MB/s
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 11.9555 s, 20.5 MB/s
> $ sudo sh test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 21.622 s, 11.3 MB/s
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 11.9856 s, 20.5 MB/s
> $ sudo sh test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 21.5664 s, 11.4 MB/s
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 11.8522 s, 20.7 MB/s
> 
> 2.6.30-rc4 + io-throttle, no BW limit, both tasks in the root cgroup
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> $ sudo sh ./test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 23.6739 s, 10.4 MB/s
> BE: cgroup 4:blockio:/
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 12.2853 s, 20.0 MB/s
> RT: 4:blockio:/
> $ sudo sh ./test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 23.7483 s, 10.3 MB/s
> BE: cgroup 4:blockio:/
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 12.3597 s, 19.9 MB/s
> RT: 4:blockio:/
> $ sudo sh ./test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 23.6843 s, 10.4 MB/s
> BE: cgroup 4:blockio:/
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 12.4886 s, 19.6 MB/s
> RT: 4:blockio:/
> $ sudo sh ./test.sh | sort
> BE: 234+0 records in
> BE: 234+0 records out
> BE: 245366784 bytes (245 MB) copied, 23.8621 s, 10.3 MB/s
> BE: cgroup 4:blockio:/
> RT: 234+0 records in
> RT: 234+0 records out
> RT: 245366784 bytes (245 MB) copied, 12.6737 s, 19.4 MB/s
> RT: 4:blockio:/
> 
> The difference seems to be just the expected overhead.

BTW, it is possible to reduce the io-throttle overhead even more for non
io-throttle users (also when CONFIG_CGROUP_IO_THROTTLE is enabled) using
the trick below.

2.6.30-rc4 + io-throttle + following patch, no BW limit, tasks in root cgroup
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 17.462 s, 14.1 MB/s
BE: 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.7865 s, 20.8 MB/s
RT: 4:blockio:/
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 18.8375 s, 13.0 MB/s
BE: 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.9148 s, 20.6 MB/s
RT: 4:blockio:/
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 19.6826 s, 12.5 MB/s
BE: 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.8715 s, 20.7 MB/s
RT: 4:blockio:/
$ sudo sh test.sh | sort
BE: 234+0 records in
BE: 234+0 records out
BE: 245366784 bytes (245 MB) copied, 18.9152 s, 13.0 MB/s
BE: 4:blockio:/
RT: 234+0 records in
RT: 234+0 records out
RT: 245366784 bytes (245 MB) copied, 11.8925 s, 20.6 MB/s
RT: 4:blockio:/

[ To be applied on top of io-throttle v16 ]

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
---
 block/blk-io-throttle.c |   16 ++++++++++++++--
 1 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c
index e2dfd24..8b45c71 100644
--- a/block/blk-io-throttle.c
+++ b/block/blk-io-throttle.c
@@ -131,6 +131,14 @@ struct iothrottle_node {
 	struct iothrottle_stat stat;
 };
 
+/*
+ * This is a trick to reduce the unneded overhead when io-throttle is not used
+ * at all. We use a counter of the io-throttle rules; if the counter is zero,
+ * we immediately return from the io-throttle hooks, without accounting IO and
+ * without checking if we need to apply some limiting rules.
+ */
+static atomic_t iothrottle_node_count __read_mostly;
+
 /**
  * struct iothrottle - throttling rules for a cgroup
  * @css: pointer to the cgroup state
@@ -193,6 +201,7 @@ static void iothrottle_insert_node(struct iothrottle *iot,
 {
 	WARN_ON_ONCE(!cgroup_is_locked());
 	list_add_rcu(&n->node, &iot->list);
+	atomic_inc(&iothrottle_node_count);
 }
 
 /*
@@ -214,6 +223,7 @@ iothrottle_delete_node(struct iothrottle *iot, struct iothrottle_node *n)
 {
 	WARN_ON_ONCE(!cgroup_is_locked());
 	list_del_rcu(&n->node);
+	atomic_dec(&iothrottle_node_count);
 }
 
 /*
@@ -250,8 +260,10 @@ static void iothrottle_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 	 * reference to the list.
 	 */
 	if (!list_empty(&iot->list))
-		list_for_each_entry_safe(n, p, &iot->list, node)
+		list_for_each_entry_safe(n, p, &iot->list, node) {
 			kfree(n);
+			atomic_dec(&iothrottle_node_count);
+		}
 	kfree(iot);
 }
 
@@ -836,7 +848,7 @@ cgroup_io_throttle(struct bio *bio, struct block_device *bdev, ssize_t bytes)
 	unsigned long long sleep;
 	int type, can_sleep = 1;
 
-	if (iothrottle_disabled())
+	if (iothrottle_disabled() || !atomic_read(&iothrottle_node_count))
 		return 0;
 	if (unlikely(!bdev))
 		return 0;

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-07  9:04             ` Andrea Righi
  2009-05-07 12:22               ` Andrea Righi
  2009-05-07 12:22               ` Andrea Righi
@ 2009-05-07 14:11               ` Vivek Goyal
  2009-05-07 14:11               ` Vivek Goyal
  3 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-07 14:11 UTC (permalink / raw)
  To: Andrea Righi
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Thu, May 07, 2009 at 11:04:50AM +0200, Andrea Righi wrote:
> On Wed, May 06, 2009 at 05:52:35PM -0400, Vivek Goyal wrote:
> > > > Without io-throttle patches
> > > > ---------------------------
> > > > - Two readers, first BE prio 7, second BE prio 0
> > > > 
> > > > 234179072 bytes (234 MB) copied, 4.12074 s, 56.8 MB/s
> > > > High prio reader finished
> > > > 234179072 bytes (234 MB) copied, 5.36023 s, 43.7 MB/s
> > > > 
> > > > Note: There is no service differentiation between prio 0 and prio 7 task
> > > >       with io-throttle patches.
> > > > 
> > > > Test 3
> > > > ======
> > > > - Run the one RT reader and one BE reader in root cgroup without any
> > > >   limitations. I guess this should mean unlimited BW and behavior should
> > > >   be same as with CFQ without io-throttling patches.
> > > > 
> > > > With io-throttle patches
> > > > =========================
> > > > Ran the test 4 times because I was getting different results in different
> > > > runs.
> > > > 
> > > > - Two readers, one RT prio 0  other BE prio 7
> > > > 
> > > > 234179072 bytes (234 MB) copied, 2.74604 s, 85.3 MB/s
> > > > 234179072 bytes (234 MB) copied, 5.20995 s, 44.9 MB/s
> > > > RT task finished
> > > > 
> > > > 234179072 bytes (234 MB) copied, 4.54417 s, 51.5 MB/s
> > > > RT task finished
> > > > 234179072 bytes (234 MB) copied, 5.23396 s, 44.7 MB/s
> > > > 
> > > > 234179072 bytes (234 MB) copied, 5.17727 s, 45.2 MB/s
> > > > RT task finished
> > > > 234179072 bytes (234 MB) copied, 5.25894 s, 44.5 MB/s
> > > > 
> > > > 234179072 bytes (234 MB) copied, 2.74141 s, 85.4 MB/s
> > > > 234179072 bytes (234 MB) copied, 5.20536 s, 45.0 MB/s
> > > > RT task finished
> > > > 
> > > > Note: Out of 4 runs, looks like twice it is complete priority inversion
> > > >       and RT task finished after BE task. Rest of the two times, the
> > > >       difference between BW of RT and BE task is much less as compared to
> > > >       without patches. In fact once it was almost same.
> > > 
> > > This is strange. If you don't set any limit there shouldn't be any
> > > difference respect to the other case (without io-throttle patches).
> > > 
> > > At worst a small overhead given by the task_to_iothrottle(), under
> > > rcu_read_lock(). I'll repeat this test ASAP and see if I'll be able to
> > > reproduce this strange behaviour.
> > 
> > Ya, I also found this strange. At least in root group there should not be
> > any behavior change (at max one might expect little drop in throughput
> > because of extra code).
> 
> Hi Vivek,
> 
> I'm not able to reproduce the strange behaviour above.
> 
> Which commands are you running exactly? is the system isolated (stupid
> question) no cron or background tasks doing IO during the tests?
> 
> Following the script I've used:
> 
> $ cat test.sh
> #!/bin/sh
> echo 3 > /proc/sys/vm/drop_caches
> ionice -c 1 -n 0 dd if=bigfile1 of=/dev/null bs=1M 2>&1 | sed "s/\(.*\)/RT: \1/" &
> cat /proc/$!/cgroup | sed "s/\(.*\)/RT: \1/"
> ionice -c 2 -n 7 dd if=bigfile2 of=/dev/null bs=1M 2>&1 | sed "s/\(.*\)/BE: \1/" &
> cat /proc/$!/cgroup | sed "s/\(.*\)/BE: \1/"
> for i in 1 2; do
> 	wait
> done
> 
> And the results on my PC:
> 

[..]

> The difference seems to be just the expected overhead.

Hm..., something is really amiss here. I took your scripts and ran on
my system and I still see the issue. There is nothing else running on the
system and it is isolated.

2.6.30-rc4 + io-throttle patches V16
===================================
It is freshly booted system with nothing extra running on it. This is a
4 core system.

Disk1
=====
This is a fast disk which supports queue depth of 31.

Following is the output picked from dmesg for my device properties.
[    3.016099] sd 2:0:0:0: [sdb] 488397168 512-byte hardware sectors: (250
GB/232 GiB)
[    3.016188] sd 2:0:0:0: Attached scsi generic sg2 type 0

Following are the results of 4 runs of your script. (Just changed the 
script to read right file on my system if=/mnt/sdb/zerofile1).

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 4.38435 s, 53.4 MB/s
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 5.20706 s, 45.0 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 5.12953 s, 45.7 MB/s
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 5.23573 s, 44.7 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 3.54644 s, 66.0 MB/s
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 5.19406 s, 45.1 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 5.21908 s, 44.9 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 5.23802 s, 44.7 MB/s

Disk2
=====
This is a relatively slower disk with no command queuing.

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 7.06471 s, 33.1 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 8.01571 s, 29.2 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 7.89043 s, 29.7 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 8.03428 s, 29.1 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 7.38942 s, 31.7 MB/s
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 8.01146 s, 29.2 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 7.78351 s, 30.1 MB/s
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 8.06292 s, 29.0 MB/s

Disk3
=====
This is an Intel SSD.

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 0.993735 s, 236 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.98772 s, 118 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 1.8616 s, 126 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.98499 s, 118 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 1.01174 s, 231 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.99143 s, 118 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 1.96132 s, 119 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.97746 s, 118 MB/s

Results without io-throttle patches (vanilla 2.6.30-rc4)
========================================================

Disk 1
======
This is relatively faster SATA drive with command queuing enabled.

RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 2.84065 s, 82.4 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 5.30087 s, 44.2 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 2.69688 s, 86.8 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 5.18175 s, 45.2 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 2.73279 s, 85.7 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 5.21803 s, 44.9 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 2.69304 s, 87.0 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 5.17821 s, 45.2 MB/s

Disk 2
======
Slower disk with no command queuing.

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 4.29453 s, 54.5 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 8.04978 s, 29.1 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 3.96924 s, 59.0 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 7.74984 s, 30.2 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 4.11254 s, 56.9 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 7.8678 s, 29.8 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 3.95979 s, 59.1 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 7.73976 s, 30.3 MB/s

Disk3
=====
Intel SSD

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 0.996762 s, 235 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.93268 s, 121 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 0.98511 s, 238 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.92481 s, 122 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 0.986981 s, 237 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.9312 s, 121 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 0.988448 s, 237 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.93885 s, 121 MB/s

So I am still seeing the issue with differnt kind of disks also. At this point
of time I am really not sure why I am seeing such results.

I have following patches applied on 30-rc4 (V16).

3954-vivek.goyal2008-res_counter-introduce-ratelimiting-attributes.patch
3955-vivek.goyal2008-page_cgroup-provide-a-generic-page-tracking-infrastructure.patch
3956-vivek.goyal2008-io-throttle-controller-infrastructure.patch
3957-vivek.goyal2008-kiothrottled-throttle-buffered-io.patch
3958-vivek.goyal2008-io-throttle-instrumentation.patch
3959-vivek.goyal2008-io-throttle-export-per-task-statistics-to-userspace.patch

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-07  9:04             ` Andrea Righi
                                 ` (2 preceding siblings ...)
  2009-05-07 14:11               ` Vivek Goyal
@ 2009-05-07 14:11               ` Vivek Goyal
       [not found]                 ` <20090507141126.GA9463-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  3 siblings, 1 reply; 297+ messages in thread
From: Vivek Goyal @ 2009-05-07 14:11 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, linux-kernel, containers,
	agk, dm-devel, snitzer, m-ikeda, peterz

On Thu, May 07, 2009 at 11:04:50AM +0200, Andrea Righi wrote:
> On Wed, May 06, 2009 at 05:52:35PM -0400, Vivek Goyal wrote:
> > > > Without io-throttle patches
> > > > ---------------------------
> > > > - Two readers, first BE prio 7, second BE prio 0
> > > > 
> > > > 234179072 bytes (234 MB) copied, 4.12074 s, 56.8 MB/s
> > > > High prio reader finished
> > > > 234179072 bytes (234 MB) copied, 5.36023 s, 43.7 MB/s
> > > > 
> > > > Note: There is no service differentiation between prio 0 and prio 7 task
> > > >       with io-throttle patches.
> > > > 
> > > > Test 3
> > > > ======
> > > > - Run the one RT reader and one BE reader in root cgroup without any
> > > >   limitations. I guess this should mean unlimited BW and behavior should
> > > >   be same as with CFQ without io-throttling patches.
> > > > 
> > > > With io-throttle patches
> > > > =========================
> > > > Ran the test 4 times because I was getting different results in different
> > > > runs.
> > > > 
> > > > - Two readers, one RT prio 0  other BE prio 7
> > > > 
> > > > 234179072 bytes (234 MB) copied, 2.74604 s, 85.3 MB/s
> > > > 234179072 bytes (234 MB) copied, 5.20995 s, 44.9 MB/s
> > > > RT task finished
> > > > 
> > > > 234179072 bytes (234 MB) copied, 4.54417 s, 51.5 MB/s
> > > > RT task finished
> > > > 234179072 bytes (234 MB) copied, 5.23396 s, 44.7 MB/s
> > > > 
> > > > 234179072 bytes (234 MB) copied, 5.17727 s, 45.2 MB/s
> > > > RT task finished
> > > > 234179072 bytes (234 MB) copied, 5.25894 s, 44.5 MB/s
> > > > 
> > > > 234179072 bytes (234 MB) copied, 2.74141 s, 85.4 MB/s
> > > > 234179072 bytes (234 MB) copied, 5.20536 s, 45.0 MB/s
> > > > RT task finished
> > > > 
> > > > Note: Out of 4 runs, looks like twice it is complete priority inversion
> > > >       and RT task finished after BE task. Rest of the two times, the
> > > >       difference between BW of RT and BE task is much less as compared to
> > > >       without patches. In fact once it was almost same.
> > > 
> > > This is strange. If you don't set any limit there shouldn't be any
> > > difference respect to the other case (without io-throttle patches).
> > > 
> > > At worst a small overhead given by the task_to_iothrottle(), under
> > > rcu_read_lock(). I'll repeat this test ASAP and see if I'll be able to
> > > reproduce this strange behaviour.
> > 
> > Ya, I also found this strange. At least in root group there should not be
> > any behavior change (at max one might expect little drop in throughput
> > because of extra code).
> 
> Hi Vivek,
> 
> I'm not able to reproduce the strange behaviour above.
> 
> Which commands are you running exactly? is the system isolated (stupid
> question) no cron or background tasks doing IO during the tests?
> 
> Following the script I've used:
> 
> $ cat test.sh
> #!/bin/sh
> echo 3 > /proc/sys/vm/drop_caches
> ionice -c 1 -n 0 dd if=bigfile1 of=/dev/null bs=1M 2>&1 | sed "s/\(.*\)/RT: \1/" &
> cat /proc/$!/cgroup | sed "s/\(.*\)/RT: \1/"
> ionice -c 2 -n 7 dd if=bigfile2 of=/dev/null bs=1M 2>&1 | sed "s/\(.*\)/BE: \1/" &
> cat /proc/$!/cgroup | sed "s/\(.*\)/BE: \1/"
> for i in 1 2; do
> 	wait
> done
> 
> And the results on my PC:
> 

[..]

> The difference seems to be just the expected overhead.

Hm..., something is really amiss here. I took your scripts and ran on
my system and I still see the issue. There is nothing else running on the
system and it is isolated.

2.6.30-rc4 + io-throttle patches V16
===================================
It is freshly booted system with nothing extra running on it. This is a
4 core system.

Disk1
=====
This is a fast disk which supports queue depth of 31.

Following is the output picked from dmesg for my device properties.
[    3.016099] sd 2:0:0:0: [sdb] 488397168 512-byte hardware sectors: (250
GB/232 GiB)
[    3.016188] sd 2:0:0:0: Attached scsi generic sg2 type 0

Following are the results of 4 runs of your script. (Just changed the 
script to read right file on my system if=/mnt/sdb/zerofile1).

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 4.38435 s, 53.4 MB/s
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 5.20706 s, 45.0 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 5.12953 s, 45.7 MB/s
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 5.23573 s, 44.7 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 3.54644 s, 66.0 MB/s
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 5.19406 s, 45.1 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 5.21908 s, 44.9 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 5.23802 s, 44.7 MB/s

Disk2
=====
This is a relatively slower disk with no command queuing.

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 7.06471 s, 33.1 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 8.01571 s, 29.2 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 7.89043 s, 29.7 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 8.03428 s, 29.1 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 7.38942 s, 31.7 MB/s
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 8.01146 s, 29.2 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 7.78351 s, 30.1 MB/s
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 8.06292 s, 29.0 MB/s

Disk3
=====
This is an Intel SSD.

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 0.993735 s, 236 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.98772 s, 118 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 1.8616 s, 126 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.98499 s, 118 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 1.01174 s, 231 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.99143 s, 118 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 1.96132 s, 119 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.97746 s, 118 MB/s

Results without io-throttle patches (vanilla 2.6.30-rc4)
========================================================

Disk 1
======
This is relatively faster SATA drive with command queuing enabled.

RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 2.84065 s, 82.4 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 5.30087 s, 44.2 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 2.69688 s, 86.8 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 5.18175 s, 45.2 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 2.73279 s, 85.7 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 5.21803 s, 44.9 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 2.69304 s, 87.0 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 5.17821 s, 45.2 MB/s

Disk 2
======
Slower disk with no command queuing.

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 4.29453 s, 54.5 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 8.04978 s, 29.1 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 3.96924 s, 59.0 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 7.74984 s, 30.2 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 4.11254 s, 56.9 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 7.8678 s, 29.8 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 3.95979 s, 59.1 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 7.73976 s, 30.3 MB/s

Disk3
=====
Intel SSD

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 0.996762 s, 235 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.93268 s, 121 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 0.98511 s, 238 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.92481 s, 122 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 0.986981 s, 237 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.9312 s, 121 MB/s

[root@chilli io-throttle-tests]# ./andrea-test-script.sh 
RT: 223+1 records in
RT: 223+1 records out
RT: 234179072 bytes (234 MB) copied, 0.988448 s, 237 MB/s
BE: 223+1 records in
BE: 223+1 records out
BE: 234179072 bytes (234 MB) copied, 1.93885 s, 121 MB/s

So I am still seeing the issue with differnt kind of disks also. At this point
of time I am really not sure why I am seeing such results.

I have following patches applied on 30-rc4 (V16).

3954-vivek.goyal2008-res_counter-introduce-ratelimiting-attributes.patch
3955-vivek.goyal2008-page_cgroup-provide-a-generic-page-tracking-infrastructure.patch
3956-vivek.goyal2008-io-throttle-controller-infrastructure.patch
3957-vivek.goyal2008-kiothrottled-throttle-buffered-io.patch
3958-vivek.goyal2008-io-throttle-instrumentation.patch
3959-vivek.goyal2008-io-throttle-export-per-task-statistics-to-userspace.patch

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-07 14:11               ` Vivek Goyal
@ 2009-05-07 14:45                     ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-07 14:45 UTC (permalink / raw)
  To: Andrea Righi
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Thu, May 07, 2009 at 10:11:26AM -0400, Vivek Goyal wrote:

[..]
> [root@chilli io-throttle-tests]# ./andrea-test-script.sh 
> RT: 223+1 records in
> RT: 223+1 records out
> RT: 234179072 bytes (234 MB) copied, 0.988448 s, 237 MB/s
> BE: 223+1 records in
> BE: 223+1 records out
> BE: 234179072 bytes (234 MB) copied, 1.93885 s, 121 MB/s
> 
> So I am still seeing the issue with differnt kind of disks also. At this point
> of time I am really not sure why I am seeing such results.

Hold on. I think I found the culprit here. I was thinking that what is
the difference between two setups and realized that with vanilla kernels
I had done "make defconfig" and with io-throttle kernels I had used an
old config of my and did "make oldconfig". So basically config files
were differnt.

I now used the same config file and issues seems to have gone away. I
will look into why an old config file can force such kind of issues.

So now we are left with the issue of loosing the notion of priority and
class with-in cgroup. In fact on bigger systems we will probably run into
issues of kiothrottled scalability as single thread is trying to cater to
all the disks.

If we do max bw control at IO scheduler level, then I think we should be able
to control max bw while maintaining the notion of priority and class with-in
cgroup. Also there are multiple pdflush threads and jens seems to be pushing
flusher threads per bdi which will help us achieve greater scalability and
don't have to replicate that infrastructure for kiothrottled also.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
@ 2009-05-07 14:45                     ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-07 14:45 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, linux-kernel, containers,
	agk, dm-devel, snitzer, m-ikeda, peterz

On Thu, May 07, 2009 at 10:11:26AM -0400, Vivek Goyal wrote:

[..]
> [root@chilli io-throttle-tests]# ./andrea-test-script.sh 
> RT: 223+1 records in
> RT: 223+1 records out
> RT: 234179072 bytes (234 MB) copied, 0.988448 s, 237 MB/s
> BE: 223+1 records in
> BE: 223+1 records out
> BE: 234179072 bytes (234 MB) copied, 1.93885 s, 121 MB/s
> 
> So I am still seeing the issue with differnt kind of disks also. At this point
> of time I am really not sure why I am seeing such results.

Hold on. I think I found the culprit here. I was thinking that what is
the difference between two setups and realized that with vanilla kernels
I had done "make defconfig" and with io-throttle kernels I had used an
old config of my and did "make oldconfig". So basically config files
were differnt.

I now used the same config file and issues seems to have gone away. I
will look into why an old config file can force such kind of issues.

So now we are left with the issue of loosing the notion of priority and
class with-in cgroup. In fact on bigger systems we will probably run into
issues of kiothrottled scalability as single thread is trying to cater to
all the disks.

If we do max bw control at IO scheduler level, then I think we should be able
to control max bw while maintaining the notion of priority and class with-in
cgroup. Also there are multiple pdflush threads and jens seems to be pushing
flusher threads per bdi which will help us achieve greater scalability and
don't have to replicate that infrastructure for kiothrottled also.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-07 14:45                     ` Vivek Goyal
@ 2009-05-07 15:36                         ` Vivek Goyal
  -1 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-07 15:36 UTC (permalink / raw)
  To: Andrea Righi
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Thu, May 07, 2009 at 10:45:01AM -0400, Vivek Goyal wrote:
> On Thu, May 07, 2009 at 10:11:26AM -0400, Vivek Goyal wrote:
> 
> [..]
> > [root@chilli io-throttle-tests]# ./andrea-test-script.sh 
> > RT: 223+1 records in
> > RT: 223+1 records out
> > RT: 234179072 bytes (234 MB) copied, 0.988448 s, 237 MB/s
> > BE: 223+1 records in
> > BE: 223+1 records out
> > BE: 234179072 bytes (234 MB) copied, 1.93885 s, 121 MB/s
> > 
> > So I am still seeing the issue with differnt kind of disks also. At this point
> > of time I am really not sure why I am seeing such results.
> 
> Hold on. I think I found the culprit here. I was thinking that what is
> the difference between two setups and realized that with vanilla kernels
> I had done "make defconfig" and with io-throttle kernels I had used an
> old config of my and did "make oldconfig". So basically config files
> were differnt.
> 
> I now used the same config file and issues seems to have gone away. I
> will look into why an old config file can force such kind of issues.
> 

Hmm.., my old config had "AS" as default scheduler that's why I was seeing
the strange issue of RT task finishing after BE. My apologies for that. I
somehow assumed that CFQ is default scheduler in my config.

So I have re-run the test to see if we are still seeing the issue of
loosing priority and class with-in cgroup. And we still do..

2.6.30-rc4 with io-throttle patches
===================================
Test1
=====
- Two readers, one BE prio 0 and other BE prio 7 in a cgroup limited with
  8MB/s BW.

234179072 bytes (234 MB) copied, 55.8448 s, 4.2 MB/s
prio 0 task finished
234179072 bytes (234 MB) copied, 55.8878 s, 4.2 MB/s

Test2
=====
- Two readers, one RT prio 0 and other BE prio 7 in a cgroup limited with
  8MB/s BW.

234179072 bytes (234 MB) copied, 55.8876 s, 4.2 MB/s
234179072 bytes (234 MB) copied, 55.8984 s, 4.2 MB/s
RT task finished

Test3
=====
- Reader Starvation
- I created a cgroup with BW limit of 64MB/s. First I just run the reader
  alone and then I run reader along with 4 writers 4 times. 

Reader alone
234179072 bytes (234 MB) copied, 3.71796 s, 63.0 MB/s

Reader with 4 writers
---------------------
First run
234179072 bytes (234 MB) copied, 30.394 s, 7.7 MB/s 

Second run
234179072 bytes (234 MB) copied, 26.9607 s, 8.7 MB/s

Third run
234179072 bytes (234 MB) copied, 37.3515 s, 6.3 MB/s

Fourth run
234179072 bytes (234 MB) copied, 36.817 s, 6.4 MB/s

Note that out of 64MB/s limit of this cgroup, reader does not get even
1/5 of the BW. In normal systems, readers are advantaged and reader gets
its job done much faster even in presence of multiple writers.   

Vanilla 2.6.30-rc4
==================

Test3
=====
Reader alone
234179072 bytes (234 MB) copied, 2.52195 s, 92.9 MB/s

Reader with 4 writers
---------------------
First run
234179072 bytes (234 MB) copied, 4.39929 s, 53.2 MB/s

Second run
234179072 bytes (234 MB) copied, 4.55929 s, 51.4 MB/s

Third run
234179072 bytes (234 MB) copied, 4.79855 s, 48.8 MB/s

Fourth run
234179072 bytes (234 MB) copied, 4.5069 s, 52.0 MB/s

Notice, that without any writers we seem to be having BW of 92MB/s and
more than 50% of that BW is still assigned to reader in presence of
writers. Compare this with io-throttle cgroup of 64MB/s where reader
struggles to get 10-15% of BW. 

So any 2nd level control will break the notion and assumptions of
underlying IO scheduler. We should probably do control at IO scheduler
level to make sure we don't run into such issues while getting
hierarchical fair share for groups.

Thanks
Vivek

> So now we are left with the issue of loosing the notion of priority and
> class with-in cgroup. In fact on bigger systems we will probably run into > issues of kiothrottled scalability as single thread is trying to cater to
> all the disks.
> 
> If we do max bw control at IO scheduler level, then I think we should be able
> to control max bw while maintaining the notion of priority and class with-in
> cgroup. Also there are multiple pdflush threads and jens seems to be pushing
> flusher threads per bdi which will help us achieve greater scalability and
> don't have to replicate that infrastructure for kiothrottled also.
> 
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
@ 2009-05-07 15:36                         ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-07 15:36 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, linux-kernel, containers,
	agk, dm-devel, snitzer, m-ikeda, peterz

On Thu, May 07, 2009 at 10:45:01AM -0400, Vivek Goyal wrote:
> On Thu, May 07, 2009 at 10:11:26AM -0400, Vivek Goyal wrote:
> 
> [..]
> > [root@chilli io-throttle-tests]# ./andrea-test-script.sh 
> > RT: 223+1 records in
> > RT: 223+1 records out
> > RT: 234179072 bytes (234 MB) copied, 0.988448 s, 237 MB/s
> > BE: 223+1 records in
> > BE: 223+1 records out
> > BE: 234179072 bytes (234 MB) copied, 1.93885 s, 121 MB/s
> > 
> > So I am still seeing the issue with differnt kind of disks also. At this point
> > of time I am really not sure why I am seeing such results.
> 
> Hold on. I think I found the culprit here. I was thinking that what is
> the difference between two setups and realized that with vanilla kernels
> I had done "make defconfig" and with io-throttle kernels I had used an
> old config of my and did "make oldconfig". So basically config files
> were differnt.
> 
> I now used the same config file and issues seems to have gone away. I
> will look into why an old config file can force such kind of issues.
> 

Hmm.., my old config had "AS" as default scheduler that's why I was seeing
the strange issue of RT task finishing after BE. My apologies for that. I
somehow assumed that CFQ is default scheduler in my config.

So I have re-run the test to see if we are still seeing the issue of
loosing priority and class with-in cgroup. And we still do..

2.6.30-rc4 with io-throttle patches
===================================
Test1
=====
- Two readers, one BE prio 0 and other BE prio 7 in a cgroup limited with
  8MB/s BW.

234179072 bytes (234 MB) copied, 55.8448 s, 4.2 MB/s
prio 0 task finished
234179072 bytes (234 MB) copied, 55.8878 s, 4.2 MB/s

Test2
=====
- Two readers, one RT prio 0 and other BE prio 7 in a cgroup limited with
  8MB/s BW.

234179072 bytes (234 MB) copied, 55.8876 s, 4.2 MB/s
234179072 bytes (234 MB) copied, 55.8984 s, 4.2 MB/s
RT task finished

Test3
=====
- Reader Starvation
- I created a cgroup with BW limit of 64MB/s. First I just run the reader
  alone and then I run reader along with 4 writers 4 times. 

Reader alone
234179072 bytes (234 MB) copied, 3.71796 s, 63.0 MB/s

Reader with 4 writers
---------------------
First run
234179072 bytes (234 MB) copied, 30.394 s, 7.7 MB/s 

Second run
234179072 bytes (234 MB) copied, 26.9607 s, 8.7 MB/s

Third run
234179072 bytes (234 MB) copied, 37.3515 s, 6.3 MB/s

Fourth run
234179072 bytes (234 MB) copied, 36.817 s, 6.4 MB/s

Note that out of 64MB/s limit of this cgroup, reader does not get even
1/5 of the BW. In normal systems, readers are advantaged and reader gets
its job done much faster even in presence of multiple writers.   

Vanilla 2.6.30-rc4
==================

Test3
=====
Reader alone
234179072 bytes (234 MB) copied, 2.52195 s, 92.9 MB/s

Reader with 4 writers
---------------------
First run
234179072 bytes (234 MB) copied, 4.39929 s, 53.2 MB/s

Second run
234179072 bytes (234 MB) copied, 4.55929 s, 51.4 MB/s

Third run
234179072 bytes (234 MB) copied, 4.79855 s, 48.8 MB/s

Fourth run
234179072 bytes (234 MB) copied, 4.5069 s, 52.0 MB/s

Notice, that without any writers we seem to be having BW of 92MB/s and
more than 50% of that BW is still assigned to reader in presence of
writers. Compare this with io-throttle cgroup of 64MB/s where reader
struggles to get 10-15% of BW. 

So any 2nd level control will break the notion and assumptions of
underlying IO scheduler. We should probably do control at IO scheduler
level to make sure we don't run into such issues while getting
hierarchical fair share for groups.

Thanks
Vivek

> So now we are left with the issue of loosing the notion of priority and
> class with-in cgroup. In fact on bigger systems we will probably run into > issues of kiothrottled scalability as single thread is trying to cater to
> all the disks.
> 
> If we do max bw control at IO scheduler level, then I think we should be able
> to control max bw while maintaining the notion of priority and class with-in
> cgroup. Also there are multiple pdflush threads and jens seems to be pushing
> flusher threads per bdi which will help us achieve greater scalability and
> don't have to replicate that infrastructure for kiothrottled also.
> 
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-07 15:36                         ` Vivek Goyal
@ 2009-05-07 15:42                             ` Vivek Goyal
  -1 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-07 15:42 UTC (permalink / raw)
  To: Andrea Righi
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Thu, May 07, 2009 at 11:36:42AM -0400, Vivek Goyal wrote:
> On Thu, May 07, 2009 at 10:45:01AM -0400, Vivek Goyal wrote:
> > On Thu, May 07, 2009 at 10:11:26AM -0400, Vivek Goyal wrote:
> > 
> > [..]
> > > [root@chilli io-throttle-tests]# ./andrea-test-script.sh 
> > > RT: 223+1 records in
> > > RT: 223+1 records out
> > > RT: 234179072 bytes (234 MB) copied, 0.988448 s, 237 MB/s
> > > BE: 223+1 records in
> > > BE: 223+1 records out
> > > BE: 234179072 bytes (234 MB) copied, 1.93885 s, 121 MB/s
> > > 
> > > So I am still seeing the issue with differnt kind of disks also. At this point
> > > of time I am really not sure why I am seeing such results.
> > 
> > Hold on. I think I found the culprit here. I was thinking that what is
> > the difference between two setups and realized that with vanilla kernels
> > I had done "make defconfig" and with io-throttle kernels I had used an
> > old config of my and did "make oldconfig". So basically config files
> > were differnt.
> > 
> > I now used the same config file and issues seems to have gone away. I
> > will look into why an old config file can force such kind of issues.
> > 
> 
> Hmm.., my old config had "AS" as default scheduler that's why I was seeing
> the strange issue of RT task finishing after BE. My apologies for that. I
> somehow assumed that CFQ is default scheduler in my config.
> 
> So I have re-run the test to see if we are still seeing the issue of
> loosing priority and class with-in cgroup. And we still do..
> 
> 2.6.30-rc4 with io-throttle patches
> ===================================
> Test1
> =====
> - Two readers, one BE prio 0 and other BE prio 7 in a cgroup limited with
>   8MB/s BW.
> 
> 234179072 bytes (234 MB) copied, 55.8448 s, 4.2 MB/s
> prio 0 task finished
> 234179072 bytes (234 MB) copied, 55.8878 s, 4.2 MB/s
> 
> Test2
> =====
> - Two readers, one RT prio 0 and other BE prio 7 in a cgroup limited with
>   8MB/s BW.
> 
> 234179072 bytes (234 MB) copied, 55.8876 s, 4.2 MB/s
> 234179072 bytes (234 MB) copied, 55.8984 s, 4.2 MB/s
> RT task finished
> 
> Test3
> =====
> - Reader Starvation
> - I created a cgroup with BW limit of 64MB/s. First I just run the reader
>   alone and then I run reader along with 4 writers 4 times. 
> 
> Reader alone
> 234179072 bytes (234 MB) copied, 3.71796 s, 63.0 MB/s
> 
> Reader with 4 writers
> ---------------------
> First run
> 234179072 bytes (234 MB) copied, 30.394 s, 7.7 MB/s 
> 
> Second run
> 234179072 bytes (234 MB) copied, 26.9607 s, 8.7 MB/s
> 
> Third run
> 234179072 bytes (234 MB) copied, 37.3515 s, 6.3 MB/s
> 
> Fourth run
> 234179072 bytes (234 MB) copied, 36.817 s, 6.4 MB/s
> 
> Note that out of 64MB/s limit of this cgroup, reader does not get even
> 1/5 of the BW. In normal systems, readers are advantaged and reader gets
> its job done much faster even in presence of multiple writers.   
> 
> Vanilla 2.6.30-rc4
> ==================
> 
> Test3
> =====
> Reader alone
> 234179072 bytes (234 MB) copied, 2.52195 s, 92.9 MB/s
> 
> Reader with 4 writers
> ---------------------
> First run
> 234179072 bytes (234 MB) copied, 4.39929 s, 53.2 MB/s
> 
> Second run
> 234179072 bytes (234 MB) copied, 4.55929 s, 51.4 MB/s
> 
> Third run
> 234179072 bytes (234 MB) copied, 4.79855 s, 48.8 MB/s
> 
> Fourth run
> 234179072 bytes (234 MB) copied, 4.5069 s, 52.0 MB/s
> 
> Notice, that without any writers we seem to be having BW of 92MB/s and
> more than 50% of that BW is still assigned to reader in presence of
> writers. Compare this with io-throttle cgroup of 64MB/s where reader
> struggles to get 10-15% of BW. 
> 
> So any 2nd level control will break the notion and assumptions of
> underlying IO scheduler. We should probably do control at IO scheduler
> level to make sure we don't run into such issues while getting
> hierarchical fair share for groups.
> 

Forgot to attached my reader-writer script last time. Here it is.


***************************************************************
#!/bin/bash

mount /dev/sdb1 /mnt/sdb

mount -t cgroup -o blockio blockio /cgroup/iot/
mkdir -p /cgroup/iot/test1 /cgroup/iot/test2

# Set bw limit of 64 MB/ps on sdb
echo "/dev/sdb:$((64 * 1024 * 1024)):0:0" > /cgroup/iot/test1/blockio.bandwidth-max

sync
echo 3 > /proc/sys/vm/drop_caches

echo $$ > /cgroup/iot/test1/tasks

ionice -c 2 -n 7 dd if=/dev/zero of=/mnt/sdb/testzerofile1 bs=4K count=524288 & 
echo $!

ionice -c 2 -n 7 dd if=/dev/zero of=/mnt/sdb/testzerofile2 bs=4K count=524288 & 
echo $!

ionice -c 2 -n 7 dd if=/dev/zero of=/mnt/sdb/testzerofile3 bs=4K count=524288 & 
echo $!

ionice -c 2 -n 7 dd if=/dev/zero of=/mnt/sdb/testzerofile4 bs=4K count=524288 & 
echo $!

sleep 5
echo "Launching reader"

ionice -c 2 -n 0 dd if=/mnt/sdb/zerofile2 of=/dev/zero &
pid2=$!
echo $pid2

wait $pid2
echo "Reader Finished"
killall dd
**********************************************************************

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
@ 2009-05-07 15:42                             ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-07 15:42 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, linux-kernel, containers,
	agk, dm-devel, snitzer, m-ikeda, peterz

On Thu, May 07, 2009 at 11:36:42AM -0400, Vivek Goyal wrote:
> On Thu, May 07, 2009 at 10:45:01AM -0400, Vivek Goyal wrote:
> > On Thu, May 07, 2009 at 10:11:26AM -0400, Vivek Goyal wrote:
> > 
> > [..]
> > > [root@chilli io-throttle-tests]# ./andrea-test-script.sh 
> > > RT: 223+1 records in
> > > RT: 223+1 records out
> > > RT: 234179072 bytes (234 MB) copied, 0.988448 s, 237 MB/s
> > > BE: 223+1 records in
> > > BE: 223+1 records out
> > > BE: 234179072 bytes (234 MB) copied, 1.93885 s, 121 MB/s
> > > 
> > > So I am still seeing the issue with differnt kind of disks also. At this point
> > > of time I am really not sure why I am seeing such results.
> > 
> > Hold on. I think I found the culprit here. I was thinking that what is
> > the difference between two setups and realized that with vanilla kernels
> > I had done "make defconfig" and with io-throttle kernels I had used an
> > old config of my and did "make oldconfig". So basically config files
> > were differnt.
> > 
> > I now used the same config file and issues seems to have gone away. I
> > will look into why an old config file can force such kind of issues.
> > 
> 
> Hmm.., my old config had "AS" as default scheduler that's why I was seeing
> the strange issue of RT task finishing after BE. My apologies for that. I
> somehow assumed that CFQ is default scheduler in my config.
> 
> So I have re-run the test to see if we are still seeing the issue of
> loosing priority and class with-in cgroup. And we still do..
> 
> 2.6.30-rc4 with io-throttle patches
> ===================================
> Test1
> =====
> - Two readers, one BE prio 0 and other BE prio 7 in a cgroup limited with
>   8MB/s BW.
> 
> 234179072 bytes (234 MB) copied, 55.8448 s, 4.2 MB/s
> prio 0 task finished
> 234179072 bytes (234 MB) copied, 55.8878 s, 4.2 MB/s
> 
> Test2
> =====
> - Two readers, one RT prio 0 and other BE prio 7 in a cgroup limited with
>   8MB/s BW.
> 
> 234179072 bytes (234 MB) copied, 55.8876 s, 4.2 MB/s
> 234179072 bytes (234 MB) copied, 55.8984 s, 4.2 MB/s
> RT task finished
> 
> Test3
> =====
> - Reader Starvation
> - I created a cgroup with BW limit of 64MB/s. First I just run the reader
>   alone and then I run reader along with 4 writers 4 times. 
> 
> Reader alone
> 234179072 bytes (234 MB) copied, 3.71796 s, 63.0 MB/s
> 
> Reader with 4 writers
> ---------------------
> First run
> 234179072 bytes (234 MB) copied, 30.394 s, 7.7 MB/s 
> 
> Second run
> 234179072 bytes (234 MB) copied, 26.9607 s, 8.7 MB/s
> 
> Third run
> 234179072 bytes (234 MB) copied, 37.3515 s, 6.3 MB/s
> 
> Fourth run
> 234179072 bytes (234 MB) copied, 36.817 s, 6.4 MB/s
> 
> Note that out of 64MB/s limit of this cgroup, reader does not get even
> 1/5 of the BW. In normal systems, readers are advantaged and reader gets
> its job done much faster even in presence of multiple writers.   
> 
> Vanilla 2.6.30-rc4
> ==================
> 
> Test3
> =====
> Reader alone
> 234179072 bytes (234 MB) copied, 2.52195 s, 92.9 MB/s
> 
> Reader with 4 writers
> ---------------------
> First run
> 234179072 bytes (234 MB) copied, 4.39929 s, 53.2 MB/s
> 
> Second run
> 234179072 bytes (234 MB) copied, 4.55929 s, 51.4 MB/s
> 
> Third run
> 234179072 bytes (234 MB) copied, 4.79855 s, 48.8 MB/s
> 
> Fourth run
> 234179072 bytes (234 MB) copied, 4.5069 s, 52.0 MB/s
> 
> Notice, that without any writers we seem to be having BW of 92MB/s and
> more than 50% of that BW is still assigned to reader in presence of
> writers. Compare this with io-throttle cgroup of 64MB/s where reader
> struggles to get 10-15% of BW. 
> 
> So any 2nd level control will break the notion and assumptions of
> underlying IO scheduler. We should probably do control at IO scheduler
> level to make sure we don't run into such issues while getting
> hierarchical fair share for groups.
> 

Forgot to attached my reader-writer script last time. Here it is.


***************************************************************
#!/bin/bash

mount /dev/sdb1 /mnt/sdb

mount -t cgroup -o blockio blockio /cgroup/iot/
mkdir -p /cgroup/iot/test1 /cgroup/iot/test2

# Set bw limit of 64 MB/ps on sdb
echo "/dev/sdb:$((64 * 1024 * 1024)):0:0" > /cgroup/iot/test1/blockio.bandwidth-max

sync
echo 3 > /proc/sys/vm/drop_caches

echo $$ > /cgroup/iot/test1/tasks

ionice -c 2 -n 7 dd if=/dev/zero of=/mnt/sdb/testzerofile1 bs=4K count=524288 & 
echo $!

ionice -c 2 -n 7 dd if=/dev/zero of=/mnt/sdb/testzerofile2 bs=4K count=524288 & 
echo $!

ionice -c 2 -n 7 dd if=/dev/zero of=/mnt/sdb/testzerofile3 bs=4K count=524288 & 
echo $!

ionice -c 2 -n 7 dd if=/dev/zero of=/mnt/sdb/testzerofile4 bs=4K count=524288 & 
echo $!

sleep 5
echo "Launching reader"

ionice -c 2 -n 0 dd if=/mnt/sdb/zerofile2 of=/dev/zero &
pid2=$!
echo $pid2

wait $pid2
echo "Reader Finished"
killall dd
**********************************************************************

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
       [not found]                         ` <20090507153642.GC9463-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-05-07 15:42                             ` Vivek Goyal
@ 2009-05-07 22:19                           ` Andrea Righi
  1 sibling, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-07 22:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Thu, May 07, 2009 at 11:36:42AM -0400, Vivek Goyal wrote:
> Hmm.., my old config had "AS" as default scheduler that's why I was seeing
> the strange issue of RT task finishing after BE. My apologies for that. I
> somehow assumed that CFQ is default scheduler in my config.

ok.

> 
> So I have re-run the test to see if we are still seeing the issue of
> loosing priority and class with-in cgroup. And we still do..
> 
> 2.6.30-rc4 with io-throttle patches
> ===================================
> Test1
> =====
> - Two readers, one BE prio 0 and other BE prio 7 in a cgroup limited with
>   8MB/s BW.
> 
> 234179072 bytes (234 MB) copied, 55.8448 s, 4.2 MB/s
> prio 0 task finished
> 234179072 bytes (234 MB) copied, 55.8878 s, 4.2 MB/s
> 
> Test2
> =====
> - Two readers, one RT prio 0 and other BE prio 7 in a cgroup limited with
>   8MB/s BW.
> 
> 234179072 bytes (234 MB) copied, 55.8876 s, 4.2 MB/s
> 234179072 bytes (234 MB) copied, 55.8984 s, 4.2 MB/s
> RT task finished

ok, coherent with the current io-throttle implementation.

> 
> Test3
> =====
> - Reader Starvation
> - I created a cgroup with BW limit of 64MB/s. First I just run the reader
>   alone and then I run reader along with 4 writers 4 times. 
> 
> Reader alone
> 234179072 bytes (234 MB) copied, 3.71796 s, 63.0 MB/s
> 
> Reader with 4 writers
> ---------------------
> First run
> 234179072 bytes (234 MB) copied, 30.394 s, 7.7 MB/s 
> 
> Second run
> 234179072 bytes (234 MB) copied, 26.9607 s, 8.7 MB/s
> 
> Third run
> 234179072 bytes (234 MB) copied, 37.3515 s, 6.3 MB/s
> 
> Fourth run
> 234179072 bytes (234 MB) copied, 36.817 s, 6.4 MB/s
> 
> Note that out of 64MB/s limit of this cgroup, reader does not get even
> 1/5 of the BW. In normal systems, readers are advantaged and reader gets
> its job done much faster even in presence of multiple writers.   

And this is also coherent. The throttling is equally probable for read
and write. But this shouldn't happen if we saturate the physical disk BW
(doing proportional BW control or using a watermark close to 100 in
io-throttle). In this case IO scheduler logic shouldn't be totally
broken.

Doing a very quick test with io-throttle, using a 10MB/s BW limit and
blockio.watermark=90:

Launching reader
256+0 records in
256+0 records out
268435456 bytes (268 MB) copied, 32.2798 s, 8.3 MB/s

In the same time the writers wrote ~190MB, so the single reader got
about 1/3 of the total BW.

182M testzerofile4
198M testzerofile1
188M testzerofile3
189M testzerofile2

Things are probably better with many cgroups, many readers and writers
and in general the disk BW more saturated.

Proportional BW approach wins in this case, because if you always use
the whole disk BW the logic of the IO scheduler is still valid.

> 
> Vanilla 2.6.30-rc4
> ==================
> 
> Test3
> =====
> Reader alone
> 234179072 bytes (234 MB) copied, 2.52195 s, 92.9 MB/s
> 
> Reader with 4 writers
> ---------------------
> First run
> 234179072 bytes (234 MB) copied, 4.39929 s, 53.2 MB/s
> 
> Second run
> 234179072 bytes (234 MB) copied, 4.55929 s, 51.4 MB/s
> 
> Third run
> 234179072 bytes (234 MB) copied, 4.79855 s, 48.8 MB/s
> 
> Fourth run
> 234179072 bytes (234 MB) copied, 4.5069 s, 52.0 MB/s
> 
> Notice, that without any writers we seem to be having BW of 92MB/s and
> more than 50% of that BW is still assigned to reader in presence of
> writers. Compare this with io-throttle cgroup of 64MB/s where reader
> struggles to get 10-15% of BW. 
> 
> So any 2nd level control will break the notion and assumptions of
> underlying IO scheduler. We should probably do control at IO scheduler
> level to make sure we don't run into such issues while getting
> hierarchical fair share for groups.
> 
> Thanks
> Vivek
> 

What are the results with your IO scheduler controller (if you already
have them, otherwise I'll repeat this test in my system)? It seems a
very interesting test to compare the advantages of the IO scheduler
solution respect to the io-throttle approach.

Thanks,
-Andrea

> > So now we are left with the issue of loosing the notion of priority and
> > class with-in cgroup. In fact on bigger systems we will probably run into > issues of kiothrottled scalability as single thread is trying to cater to
> > all the disks.
> > 
> > If we do max bw control at IO scheduler level, then I think we should be able
> > to control max bw while maintaining the notion of priority and class with-in
> > cgroup. Also there are multiple pdflush threads and jens seems to be pushing
> > flusher threads per bdi which will help us achieve greater scalability and
> > don't have to replicate that infrastructure for kiothrottled also.
> > 
> > Thanks
> > Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-07 15:36                         ` Vivek Goyal
  (?)
  (?)
@ 2009-05-07 22:19                         ` Andrea Righi
  2009-05-08 18:09                           ` Vivek Goyal
  2009-05-08 18:09                           ` Vivek Goyal
  -1 siblings, 2 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-07 22:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, linux-kernel, containers,
	agk, dm-devel, snitzer, m-ikeda, peterz

On Thu, May 07, 2009 at 11:36:42AM -0400, Vivek Goyal wrote:
> Hmm.., my old config had "AS" as default scheduler that's why I was seeing
> the strange issue of RT task finishing after BE. My apologies for that. I
> somehow assumed that CFQ is default scheduler in my config.

ok.

> 
> So I have re-run the test to see if we are still seeing the issue of
> loosing priority and class with-in cgroup. And we still do..
> 
> 2.6.30-rc4 with io-throttle patches
> ===================================
> Test1
> =====
> - Two readers, one BE prio 0 and other BE prio 7 in a cgroup limited with
>   8MB/s BW.
> 
> 234179072 bytes (234 MB) copied, 55.8448 s, 4.2 MB/s
> prio 0 task finished
> 234179072 bytes (234 MB) copied, 55.8878 s, 4.2 MB/s
> 
> Test2
> =====
> - Two readers, one RT prio 0 and other BE prio 7 in a cgroup limited with
>   8MB/s BW.
> 
> 234179072 bytes (234 MB) copied, 55.8876 s, 4.2 MB/s
> 234179072 bytes (234 MB) copied, 55.8984 s, 4.2 MB/s
> RT task finished

ok, coherent with the current io-throttle implementation.

> 
> Test3
> =====
> - Reader Starvation
> - I created a cgroup with BW limit of 64MB/s. First I just run the reader
>   alone and then I run reader along with 4 writers 4 times. 
> 
> Reader alone
> 234179072 bytes (234 MB) copied, 3.71796 s, 63.0 MB/s
> 
> Reader with 4 writers
> ---------------------
> First run
> 234179072 bytes (234 MB) copied, 30.394 s, 7.7 MB/s 
> 
> Second run
> 234179072 bytes (234 MB) copied, 26.9607 s, 8.7 MB/s
> 
> Third run
> 234179072 bytes (234 MB) copied, 37.3515 s, 6.3 MB/s
> 
> Fourth run
> 234179072 bytes (234 MB) copied, 36.817 s, 6.4 MB/s
> 
> Note that out of 64MB/s limit of this cgroup, reader does not get even
> 1/5 of the BW. In normal systems, readers are advantaged and reader gets
> its job done much faster even in presence of multiple writers.   

And this is also coherent. The throttling is equally probable for read
and write. But this shouldn't happen if we saturate the physical disk BW
(doing proportional BW control or using a watermark close to 100 in
io-throttle). In this case IO scheduler logic shouldn't be totally
broken.

Doing a very quick test with io-throttle, using a 10MB/s BW limit and
blockio.watermark=90:

Launching reader
256+0 records in
256+0 records out
268435456 bytes (268 MB) copied, 32.2798 s, 8.3 MB/s

In the same time the writers wrote ~190MB, so the single reader got
about 1/3 of the total BW.

182M testzerofile4
198M testzerofile1
188M testzerofile3
189M testzerofile2

Things are probably better with many cgroups, many readers and writers
and in general the disk BW more saturated.

Proportional BW approach wins in this case, because if you always use
the whole disk BW the logic of the IO scheduler is still valid.

> 
> Vanilla 2.6.30-rc4
> ==================
> 
> Test3
> =====
> Reader alone
> 234179072 bytes (234 MB) copied, 2.52195 s, 92.9 MB/s
> 
> Reader with 4 writers
> ---------------------
> First run
> 234179072 bytes (234 MB) copied, 4.39929 s, 53.2 MB/s
> 
> Second run
> 234179072 bytes (234 MB) copied, 4.55929 s, 51.4 MB/s
> 
> Third run
> 234179072 bytes (234 MB) copied, 4.79855 s, 48.8 MB/s
> 
> Fourth run
> 234179072 bytes (234 MB) copied, 4.5069 s, 52.0 MB/s
> 
> Notice, that without any writers we seem to be having BW of 92MB/s and
> more than 50% of that BW is still assigned to reader in presence of
> writers. Compare this with io-throttle cgroup of 64MB/s where reader
> struggles to get 10-15% of BW. 
> 
> So any 2nd level control will break the notion and assumptions of
> underlying IO scheduler. We should probably do control at IO scheduler
> level to make sure we don't run into such issues while getting
> hierarchical fair share for groups.
> 
> Thanks
> Vivek
> 

What are the results with your IO scheduler controller (if you already
have them, otherwise I'll repeat this test in my system)? It seems a
very interesting test to compare the advantages of the IO scheduler
solution respect to the io-throttle approach.

Thanks,
-Andrea

> > So now we are left with the issue of loosing the notion of priority and
> > class with-in cgroup. In fact on bigger systems we will probably run into > issues of kiothrottled scalability as single thread is trying to cater to
> > all the disks.
> > 
> > If we do max bw control at IO scheduler level, then I think we should be able
> > to control max bw while maintaining the notion of priority and class with-in
> > cgroup. Also there are multiple pdflush threads and jens seems to be pushing
> > flusher threads per bdi which will help us achieve greater scalability and
> > don't have to replicate that infrastructure for kiothrottled also.
> > 
> > Thanks
> > Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
       [not found]                     ` <20090507144501.GB9463-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-05-07 15:36                         ` Vivek Goyal
@ 2009-05-07 22:40                       ` Andrea Righi
  1 sibling, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-07 22:40 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Thu, May 07, 2009 at 10:45:01AM -0400, Vivek Goyal wrote:
> So now we are left with the issue of loosing the notion of priority and
> class with-in cgroup. In fact on bigger systems we will probably run into
> issues of kiothrottled scalability as single thread is trying to cater to
> all the disks.
> 
> If we do max bw control at IO scheduler level, then I think we should be able
> to control max bw while maintaining the notion of priority and class with-in
> cgroup. Also there are multiple pdflush threads and jens seems to be pushing
> flusher threads per bdi which will help us achieve greater scalability and
> don't have to replicate that infrastructure for kiothrottled also.

There's a lot of room for improvements and optimizations in the
kiothrottled part, obviously the single-threaded approach is not a
definitive solutions.

Flusher threads are probably a good solution. But I don't think we need
to replicate the pdflush replacement infrastructure for throttled
writeback IO. Instead it could be just integrated with the flusher
threads, i.e. activate flusher threads only when the request needs to be
written to disk according to the dirty memory limit and IO BW limits.

I mean, I don't see any critical problem for this part.

Instead, preserving the IO priority and IO scheduler logic inside
cgroups seems a more critical issue to me. And I'm quite convinced that
the right approach for this is to operate at the IO scheduler, but I'm
still a little bit skeptical that only operating at the IO scheduler
level would resolve all our problems.

-Andrea

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-07 14:45                     ` Vivek Goyal
  (?)
  (?)
@ 2009-05-07 22:40                     ` Andrea Righi
  -1 siblings, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-07 22:40 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, linux-kernel, containers,
	agk, dm-devel, snitzer, m-ikeda, peterz

On Thu, May 07, 2009 at 10:45:01AM -0400, Vivek Goyal wrote:
> So now we are left with the issue of loosing the notion of priority and
> class with-in cgroup. In fact on bigger systems we will probably run into
> issues of kiothrottled scalability as single thread is trying to cater to
> all the disks.
> 
> If we do max bw control at IO scheduler level, then I think we should be able
> to control max bw while maintaining the notion of priority and class with-in
> cgroup. Also there are multiple pdflush threads and jens seems to be pushing
> flusher threads per bdi which will help us achieve greater scalability and
> don't have to replicate that infrastructure for kiothrottled also.

There's a lot of room for improvements and optimizations in the
kiothrottled part, obviously the single-threaded approach is not a
definitive solutions.

Flusher threads are probably a good solution. But I don't think we need
to replicate the pdflush replacement infrastructure for throttled
writeback IO. Instead it could be just integrated with the flusher
threads, i.e. activate flusher threads only when the request needs to be
written to disk according to the dirty memory limit and IO BW limits.

I mean, I don't see any critical problem for this part.

Instead, preserving the IO priority and IO scheduler logic inside
cgroups seems a more critical issue to me. And I'm quite convinced that
the right approach for this is to operate at the IO scheduler, but I'm
still a little bit skeptical that only operating at the IO scheduler
level would resolve all our problems.

-Andrea

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 17/18] io-controller: IO group refcounting support
  2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-08  2:59       ` Gui Jianfeng
  0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-08  2:59 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

Vivek Goyal wrote:
...
>  
> -		BUG_ON(!RB_EMPTY_ROOT(&st->active));
> -		BUG_ON(!RB_EMPTY_ROOT(&st->idle));
>  	}
>  
> -	BUG_ON(iog->sched_data.next_active != NULL);
> -	BUG_ON(iog->sched_data.active_entity != NULL);
> -	BUG_ON(entity->tree != NULL);
> +	/* Drop iocg reference on io group */

  We'd better to reset iog->iocg_id here since the cgroup is going to destroy.

> +	elv_put_iog(iog);
>  }
>  
> -/**


-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 17/18] io-controller: IO group refcounting support
@ 2009-05-08  2:59       ` Gui Jianfeng
  0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-08  2:59 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

Vivek Goyal wrote:
...
>  
> -		BUG_ON(!RB_EMPTY_ROOT(&st->active));
> -		BUG_ON(!RB_EMPTY_ROOT(&st->idle));
>  	}
>  
> -	BUG_ON(iog->sched_data.next_active != NULL);
> -	BUG_ON(iog->sched_data.active_entity != NULL);
> -	BUG_ON(entity->tree != NULL);
> +	/* Drop iocg reference on io group */

  We'd better to reset iog->iocg_id here since the cgroup is going to destroy.

> +	elv_put_iog(iog);
>  }
>  
> -/**


-- 
Regards
Gui Jianfeng



^ permalink raw reply	[flat|nested] 297+ messages in thread

* [PATCH] io-controller: Add io group reference handling for request
       [not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (19 preceding siblings ...)
  2009-05-06  8:11   ` Gui Jianfeng
@ 2009-05-08  9:45   ` Gui Jianfeng
  2009-05-13  2:00   ` [PATCH] IO Controller: Add per-device weight and ioprio_class handling Gui Jianfeng
  21 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-08  9:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

Hi Vivek,

This patch adds io group reference handling when allocating
and removing a request.

Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
 elevator-fq.c |   15 ++++++++++++++-
 elevator-fq.h |    5 +++++
 elevator.c    |    2 ++
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 9500619..e6d6712 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1968,11 +1968,24 @@ void elv_fq_set_request_io_group(struct request_queue *q, struct request *rq,
 	spin_unlock_irqrestore(q->queue_lock, flags);
 	BUG_ON(!iog);
 
-	/* Store iog in rq. TODO: take care of referencing */
+	elv_get_iog(iog);
 	rq->iog = iog;
 }
 
 /*
+ * This request has been serviced. Clean up iog info and drop the reference.
+ */
+void elv_fq_unset_request_io_group(struct request *rq)
+{
+	struct io_group *iog = rq->iog;
+
+	if (iog) {
+		rq->iog = NULL;
+		elv_put_iog(iog);
+	}
+}
+
+/*
  * Find/Create the io queue the rq should go in. This is an optimization
  * for the io schedulers (noop, deadline and AS) which maintain only single
  * io queue per cgroup. In this case common layer can just maintain a
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index db3a347..96a28e9 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -512,6 +512,7 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
 extern int io_group_allow_merge(struct request *rq, struct bio *bio);
 extern void elv_fq_set_request_io_group(struct request_queue *q,
 					struct request *rq, struct bio *bio);
+extern void elv_fq_unset_request_io_group(struct request *rq);
 static inline bfq_weight_t iog_weight(struct io_group *iog)
 {
 	return iog->entity.weight;
@@ -571,6 +572,10 @@ static inline void elv_fq_set_request_io_group(struct request_queue *q,
 {
 }
 
+static inline void elv_fq_unset_request_io_group(struct request *rq)
+{
+}
+
 static inline bfq_weight_t iog_weight(struct io_group *iog)
 {
 	/* Just root group is present and weight is immaterial. */
diff --git a/block/elevator.c b/block/elevator.c
index 44c9fad..d75eec7 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -992,6 +992,8 @@ void elv_put_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	elv_fq_unset_request_io_group(rq);
+
 	/*
 	 * Optimization for noop, deadline and AS which maintain only single
 	 * ioq per io group

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH] io-controller: Add io group reference handling for request
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (35 preceding siblings ...)
  2009-05-06  8:11 ` IO scheduler based IO Controller V2 Gui Jianfeng
@ 2009-05-08  9:45 ` Gui Jianfeng
       [not found]   ` <4A03FF3C.4020506-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2009-05-13  2:00 ` [PATCH] IO Controller: Add per-device weight and ioprio_class handling Gui Jianfeng
  37 siblings, 1 reply; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-08  9:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

Hi Vivek,

This patch adds io group reference handling when allocating
and removing a request.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
 elevator-fq.c |   15 ++++++++++++++-
 elevator-fq.h |    5 +++++
 elevator.c    |    2 ++
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 9500619..e6d6712 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1968,11 +1968,24 @@ void elv_fq_set_request_io_group(struct request_queue *q, struct request *rq,
 	spin_unlock_irqrestore(q->queue_lock, flags);
 	BUG_ON(!iog);
 
-	/* Store iog in rq. TODO: take care of referencing */
+	elv_get_iog(iog);
 	rq->iog = iog;
 }
 
 /*
+ * This request has been serviced. Clean up iog info and drop the reference.
+ */
+void elv_fq_unset_request_io_group(struct request *rq)
+{
+	struct io_group *iog = rq->iog;
+
+	if (iog) {
+		rq->iog = NULL;
+		elv_put_iog(iog);
+	}
+}
+
+/*
  * Find/Create the io queue the rq should go in. This is an optimization
  * for the io schedulers (noop, deadline and AS) which maintain only single
  * io queue per cgroup. In this case common layer can just maintain a
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index db3a347..96a28e9 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -512,6 +512,7 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
 extern int io_group_allow_merge(struct request *rq, struct bio *bio);
 extern void elv_fq_set_request_io_group(struct request_queue *q,
 					struct request *rq, struct bio *bio);
+extern void elv_fq_unset_request_io_group(struct request *rq);
 static inline bfq_weight_t iog_weight(struct io_group *iog)
 {
 	return iog->entity.weight;
@@ -571,6 +572,10 @@ static inline void elv_fq_set_request_io_group(struct request_queue *q,
 {
 }
 
+static inline void elv_fq_unset_request_io_group(struct request *rq)
+{
+}
+
 static inline bfq_weight_t iog_weight(struct io_group *iog)
 {
 	/* Just root group is present and weight is immaterial. */
diff --git a/block/elevator.c b/block/elevator.c
index 44c9fad..d75eec7 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -992,6 +992,8 @@ void elv_put_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	elv_fq_unset_request_io_group(rq);
+
 	/*
 	 * Optimization for noop, deadline and AS which maintain only single
 	 * ioq per io group



^ permalink raw reply related	[flat|nested] 297+ messages in thread

* Re: [PATCH 17/18] io-controller: IO group refcounting support
       [not found]       ` <4A03A013.9000405-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-05-08 12:44         ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-08 12:44 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Fri, May 08, 2009 at 10:59:31AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> >  
> > -		BUG_ON(!RB_EMPTY_ROOT(&st->active));
> > -		BUG_ON(!RB_EMPTY_ROOT(&st->idle));
> >  	}
> >  
> > -	BUG_ON(iog->sched_data.next_active != NULL);
> > -	BUG_ON(iog->sched_data.active_entity != NULL);
> > -	BUG_ON(entity->tree != NULL);
> > +	/* Drop iocg reference on io group */
> 
>   We'd better to reset iog->iocg_id here since the cgroup is going to destroy.
> 

Hm.., that does not harm. Will do in next version.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 17/18] io-controller: IO group refcounting support
  2009-05-08  2:59       ` Gui Jianfeng
  (?)
@ 2009-05-08 12:44       ` Vivek Goyal
  -1 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-08 12:44 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Fri, May 08, 2009 at 10:59:31AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> >  
> > -		BUG_ON(!RB_EMPTY_ROOT(&st->active));
> > -		BUG_ON(!RB_EMPTY_ROOT(&st->idle));
> >  	}
> >  
> > -	BUG_ON(iog->sched_data.next_active != NULL);
> > -	BUG_ON(iog->sched_data.active_entity != NULL);
> > -	BUG_ON(entity->tree != NULL);
> > +	/* Drop iocg reference on io group */
> 
>   We'd better to reset iog->iocg_id here since the cgroup is going to destroy.
> 

Hm.., that does not harm. Will do in next version.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer
       [not found]     ` <4A0290ED.7080506-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2009-05-07  8:05       ` Li Zefan
@ 2009-05-08 12:45       ` Vivek Goyal
  1 sibling, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-08 12:45 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Thu, May 07, 2009 at 03:42:37PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > This patch enables hierarchical fair queuing in common layer. It is
> > controlled by config option CONFIG_GROUP_IOSCHED.
> ...
> > +}
> > +
> > +void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
> > +{
> > +	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
> > +	struct hlist_node *n, *tmp;
> > +	struct io_group *iog;
> > +
> > +	/*
> > +	 * Since we are destroying the cgroup, there are no more tasks
> > +	 * referencing it, and all the RCU grace periods that may have
> > +	 * referenced it are ended (as the destruction of the parent
> > +	 * cgroup is RCU-safe); bgrp->group_data will not be accessed by
> > +	 * anything else and we don't need any synchronization.
> > +	 */
> > +	hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node)
> > +		io_destroy_group(iocg, iog);
> > +
> > +	BUG_ON(!hlist_empty(&iocg->group_data));
> > +
> 
>     Hi Vivek,
> 
>     IMHO, free_css_id() needs to be called here.
> 

Thanks. Sure, will do in next version.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer
  2009-05-07  7:42   ` Gui Jianfeng
  2009-05-07  8:05     ` Li Zefan
       [not found]     ` <4A0290ED.7080506-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-05-08 12:45     ` Vivek Goyal
  2 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-08 12:45 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Thu, May 07, 2009 at 03:42:37PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > This patch enables hierarchical fair queuing in common layer. It is
> > controlled by config option CONFIG_GROUP_IOSCHED.
> ...
> > +}
> > +
> > +void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
> > +{
> > +	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
> > +	struct hlist_node *n, *tmp;
> > +	struct io_group *iog;
> > +
> > +	/*
> > +	 * Since we are destroying the cgroup, there are no more tasks
> > +	 * referencing it, and all the RCU grace periods that may have
> > +	 * referenced it are ended (as the destruction of the parent
> > +	 * cgroup is RCU-safe); bgrp->group_data will not be accessed by
> > +	 * anything else and we don't need any synchronization.
> > +	 */
> > +	hlist_for_each_entry_safe(iog, n, tmp, &iocg->group_data, group_node)
> > +		io_destroy_group(iocg, iog);
> > +
> > +	BUG_ON(!hlist_empty(&iocg->group_data));
> > +
> 
>     Hi Vivek,
> 
>     IMHO, free_css_id() needs to be called here.
> 

Thanks. Sure, will do in next version.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-07  5:36       ` Li Zefan
@ 2009-05-08 13:37             ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-08 13:37 UTC (permalink / raw)
  To: Li Zefan
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Thu, May 07, 2009 at 01:36:08PM +0800, Li Zefan wrote:
> Vivek Goyal wrote:
> > On Wed, May 06, 2009 at 04:11:05PM +0800, Gui Jianfeng wrote:
> >> Vivek Goyal wrote:
> >>> Hi All,
> >>>
> >>> Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> >>> First version of the patches was posted here.
> >> Hi Vivek,
> >>
> >> I did some simple test for V2, and triggered an kernel panic.
> >> The following script can reproduce this bug. It seems that the cgroup
> >> is already removed, but IO Controller still try to access into it.
> >>
> > 
> > Hi Gui,
> > 
> > Thanks for the report. I use cgroup_path() for debugging. I guess that
> > cgroup_path() was passed null cgrp pointer that's why it crashed.
> > 
> > If yes, then it is strange though. I call cgroup_path() only after
> > grabbing a refenrece to css object. (I am assuming that if I have a valid
> > reference to css object then css->cgrp can't be null).
> > 
> 
> Yes, css->cgrp shouldn't be NULL.. I doubt we hit a bug in cgroup here.
> The code dealing with css refcnt and cgroup rmdir has changed quite a lot,
> and is much more complex than it was.
> 
> > Anyway, can you please try out following patch and see if it fixes your
> > crash.
> ...
> > BTW, I tried following equivalent script and I can't see the crash on 
> > my system. Are you able to hit it regularly?
> > 
> 
> I modified the script like this:
> 
> ======================
> #!/bin/sh
> echo 1 > /proc/sys/vm/drop_caches
> mkdir /cgroup 2> /dev/null
> mount -t cgroup -o io,blkio io /cgroup
> mkdir /cgroup/test1
> mkdir /cgroup/test2
> echo 100 > /cgroup/test1/io.weight
> echo 500 > /cgroup/test2/io.weight
> 
> dd if=/dev/zero bs=4096 count=128000 of=500M.1 &
> pid1=$!
> echo $pid1 > /cgroup/test1/tasks
> 
> dd if=/dev/zero bs=4096 count=128000 of=500M.2 &
> pid2=$!
> echo $pid2 > /cgroup/test2/tasks
> 
> sleep 5
> kill -9 $pid1
> kill -9 $pid2
> 
> for ((;count != 2;))
> {
>         rmdir /cgroup/test1 > /dev/null 2>&1
>         if [ $? -eq 0 ]; then
>                 count=$(( $count + 1 ))
>         fi
> 
>         rmdir /cgroup/test2 > /dev/null 2>&1
>         if [ $? -eq 0 ]; then
>                 count=$(( $count + 1 ))
>         fi
> }
> 
> umount /cgroup
> rmdir /cgroup
> ======================
> 
> I ran this script and got lockdep BUG. Full log and my config are attached.
> 
> Actually this can be triggered with the following steps on my box:
> # mount -t cgroup -o blkio,io xxx /mnt
> # mkdir /mnt/0
> # echo $$ > /mnt/0/tasks
> # echo 3 > /proc/sys/vm/drop_cache
> # echo $$ > /mnt/tasks
> # rmdir /mnt/0
> 
> And when I ran the script for the second time, my box was freezed
> and I had to reset it.
> 

Thanks Li and Gui for pointing out the problem. With you script, I could
also produce lock validator warning as well as system freeze. I could
identify at least two trouble spots. With following patch things seems
to be fine on my system. Can you please give it a try.


---
 block/elevator-fq.c |   20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

Index: linux11/block/elevator-fq.c
===================================================================
--- linux11.orig/block/elevator-fq.c	2009-05-08 08:47:45.000000000 -0400
+++ linux11/block/elevator-fq.c	2009-05-08 09:27:37.000000000 -0400
@@ -942,6 +942,7 @@ void entity_served(struct io_entity *ent
 	struct io_service_tree *st;
 
 	for_each_entity(entity) {
+		BUG_ON(!entity->on_st);
 		st = io_entity_service_tree(entity);
 		entity->service += served;
 		entity->total_service += served;
@@ -1652,6 +1653,14 @@ static inline int io_group_has_active_en
 			return 1;
 	}
 
+	/*
+	 * Also check there are no active entities being served which are
+	 * not on active tree
+	 */
+
+	if (iog->sched_data.active_entity)
+		return 1;
+
 	return 0;
 }
 
@@ -1738,7 +1747,7 @@ void iocg_destroy(struct cgroup_subsys *
 	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
 	struct hlist_node *n, *tmp;
 	struct io_group *iog;
-	unsigned long flags;
+	unsigned long flags, flags1;
 	int queue_lock_held = 0;
 	struct elv_fq_data *efqd;
 
@@ -1766,7 +1775,8 @@ retry:
 		rcu_read_lock();
 		efqd = rcu_dereference(iog->key);
 		if (efqd != NULL) {
-			if (spin_trylock_irq(efqd->queue->queue_lock)) {
+			if (spin_trylock_irqsave(efqd->queue->queue_lock,
+						flags1)) {
 				if (iog->key == efqd) {
 					queue_lock_held = 1;
 					rcu_read_unlock();
@@ -1780,7 +1790,8 @@ retry:
 				 * elevator hence we can proceed safely without
 				 * queue lock.
 				 */
-				spin_unlock_irq(efqd->queue->queue_lock);
+				spin_unlock_irqrestore(efqd->queue->queue_lock,
+							flags1);
 			} else {
 				/*
 				 * Did not get the queue lock while trying.
@@ -1803,7 +1814,7 @@ retry:
 locked:
 		__iocg_destroy(iocg, iog, queue_lock_held);
 		if (queue_lock_held) {
-			spin_unlock_irq(efqd->queue->queue_lock);
+			spin_unlock_irqrestore(efqd->queue->queue_lock, flags1);
 			queue_lock_held = 0;
 		}
 	}
@@ -1811,6 +1822,7 @@ locked:
 
 	BUG_ON(!hlist_empty(&iocg->group_data));
 
+	free_css_id(&io_subsys, &iocg->css);
 	kfree(iocg);
 }

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
@ 2009-05-08 13:37             ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-08 13:37 UTC (permalink / raw)
  To: Li Zefan
  Cc: Gui Jianfeng, nauman, dpshah, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Thu, May 07, 2009 at 01:36:08PM +0800, Li Zefan wrote:
> Vivek Goyal wrote:
> > On Wed, May 06, 2009 at 04:11:05PM +0800, Gui Jianfeng wrote:
> >> Vivek Goyal wrote:
> >>> Hi All,
> >>>
> >>> Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
> >>> First version of the patches was posted here.
> >> Hi Vivek,
> >>
> >> I did some simple test for V2, and triggered an kernel panic.
> >> The following script can reproduce this bug. It seems that the cgroup
> >> is already removed, but IO Controller still try to access into it.
> >>
> > 
> > Hi Gui,
> > 
> > Thanks for the report. I use cgroup_path() for debugging. I guess that
> > cgroup_path() was passed null cgrp pointer that's why it crashed.
> > 
> > If yes, then it is strange though. I call cgroup_path() only after
> > grabbing a refenrece to css object. (I am assuming that if I have a valid
> > reference to css object then css->cgrp can't be null).
> > 
> 
> Yes, css->cgrp shouldn't be NULL.. I doubt we hit a bug in cgroup here.
> The code dealing with css refcnt and cgroup rmdir has changed quite a lot,
> and is much more complex than it was.
> 
> > Anyway, can you please try out following patch and see if it fixes your
> > crash.
> ...
> > BTW, I tried following equivalent script and I can't see the crash on 
> > my system. Are you able to hit it regularly?
> > 
> 
> I modified the script like this:
> 
> ======================
> #!/bin/sh
> echo 1 > /proc/sys/vm/drop_caches
> mkdir /cgroup 2> /dev/null
> mount -t cgroup -o io,blkio io /cgroup
> mkdir /cgroup/test1
> mkdir /cgroup/test2
> echo 100 > /cgroup/test1/io.weight
> echo 500 > /cgroup/test2/io.weight
> 
> dd if=/dev/zero bs=4096 count=128000 of=500M.1 &
> pid1=$!
> echo $pid1 > /cgroup/test1/tasks
> 
> dd if=/dev/zero bs=4096 count=128000 of=500M.2 &
> pid2=$!
> echo $pid2 > /cgroup/test2/tasks
> 
> sleep 5
> kill -9 $pid1
> kill -9 $pid2
> 
> for ((;count != 2;))
> {
>         rmdir /cgroup/test1 > /dev/null 2>&1
>         if [ $? -eq 0 ]; then
>                 count=$(( $count + 1 ))
>         fi
> 
>         rmdir /cgroup/test2 > /dev/null 2>&1
>         if [ $? -eq 0 ]; then
>                 count=$(( $count + 1 ))
>         fi
> }
> 
> umount /cgroup
> rmdir /cgroup
> ======================
> 
> I ran this script and got lockdep BUG. Full log and my config are attached.
> 
> Actually this can be triggered with the following steps on my box:
> # mount -t cgroup -o blkio,io xxx /mnt
> # mkdir /mnt/0
> # echo $$ > /mnt/0/tasks
> # echo 3 > /proc/sys/vm/drop_cache
> # echo $$ > /mnt/tasks
> # rmdir /mnt/0
> 
> And when I ran the script for the second time, my box was freezed
> and I had to reset it.
> 

Thanks Li and Gui for pointing out the problem. With you script, I could
also produce lock validator warning as well as system freeze. I could
identify at least two trouble spots. With following patch things seems
to be fine on my system. Can you please give it a try.


---
 block/elevator-fq.c |   20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

Index: linux11/block/elevator-fq.c
===================================================================
--- linux11.orig/block/elevator-fq.c	2009-05-08 08:47:45.000000000 -0400
+++ linux11/block/elevator-fq.c	2009-05-08 09:27:37.000000000 -0400
@@ -942,6 +942,7 @@ void entity_served(struct io_entity *ent
 	struct io_service_tree *st;
 
 	for_each_entity(entity) {
+		BUG_ON(!entity->on_st);
 		st = io_entity_service_tree(entity);
 		entity->service += served;
 		entity->total_service += served;
@@ -1652,6 +1653,14 @@ static inline int io_group_has_active_en
 			return 1;
 	}
 
+	/*
+	 * Also check there are no active entities being served which are
+	 * not on active tree
+	 */
+
+	if (iog->sched_data.active_entity)
+		return 1;
+
 	return 0;
 }
 
@@ -1738,7 +1747,7 @@ void iocg_destroy(struct cgroup_subsys *
 	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
 	struct hlist_node *n, *tmp;
 	struct io_group *iog;
-	unsigned long flags;
+	unsigned long flags, flags1;
 	int queue_lock_held = 0;
 	struct elv_fq_data *efqd;
 
@@ -1766,7 +1775,8 @@ retry:
 		rcu_read_lock();
 		efqd = rcu_dereference(iog->key);
 		if (efqd != NULL) {
-			if (spin_trylock_irq(efqd->queue->queue_lock)) {
+			if (spin_trylock_irqsave(efqd->queue->queue_lock,
+						flags1)) {
 				if (iog->key == efqd) {
 					queue_lock_held = 1;
 					rcu_read_unlock();
@@ -1780,7 +1790,8 @@ retry:
 				 * elevator hence we can proceed safely without
 				 * queue lock.
 				 */
-				spin_unlock_irq(efqd->queue->queue_lock);
+				spin_unlock_irqrestore(efqd->queue->queue_lock,
+							flags1);
 			} else {
 				/*
 				 * Did not get the queue lock while trying.
@@ -1803,7 +1814,7 @@ retry:
 locked:
 		__iocg_destroy(iocg, iog, queue_lock_held);
 		if (queue_lock_held) {
-			spin_unlock_irq(efqd->queue->queue_lock);
+			spin_unlock_irqrestore(efqd->queue->queue_lock, flags1);
 			queue_lock_held = 0;
 		}
 	}
@@ -1811,6 +1822,7 @@ locked:
 
 	BUG_ON(!hlist_empty(&iocg->group_data));
 
+	free_css_id(&io_subsys, &iocg->css);
 	kfree(iocg);
 }
 

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
  2009-05-08  9:45 ` [PATCH] io-controller: Add io group reference handling for request Gui Jianfeng
@ 2009-05-08 13:57       ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-08 13:57 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Fri, May 08, 2009 at 05:45:32PM +0800, Gui Jianfeng wrote:
> Hi Vivek,
> 
> This patch adds io group reference handling when allocating
> and removing a request.
> 

Hi Gui,

Thanks for the patch. We were thinking that requests can take a reference
on io queues and io queues can take a reference on io groups. That should
make sure that io groups don't go away as long as active requests are
present.

But there seems to be a small window while allocating the new request
where request gets allocated from a group first and then later it is
mapped to that group and queue is created. IOW, in get_request_wait(), 
we allocate a request from a particular group and set rq->rl, then
drop the queue lock and later call elv_set_request() which again maps
the request to the group saves rq->iog and creates new queue. This window
is troublesome because request can be mapped to a particular group at the
time of allocation and during set_request() it can go to a different
group as queue lock was dropped and group might have disappeared.

In this case probably it might make sense that request also takes a
reference on groups. At the same time it looks too much that request takes
a reference on queue as well as group object. Ideas are welcome on how
to handle it...

Thanks
Vivek
 
> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> ---
>  elevator-fq.c |   15 ++++++++++++++-
>  elevator-fq.h |    5 +++++
>  elevator.c    |    2 ++
>  3 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 9500619..e6d6712 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -1968,11 +1968,24 @@ void elv_fq_set_request_io_group(struct request_queue *q, struct request *rq,
>  	spin_unlock_irqrestore(q->queue_lock, flags);
>  	BUG_ON(!iog);
>  
> -	/* Store iog in rq. TODO: take care of referencing */
> +	elv_get_iog(iog);
>  	rq->iog = iog;
>  }
>  
>  /*
> + * This request has been serviced. Clean up iog info and drop the reference.
> + */
> +void elv_fq_unset_request_io_group(struct request *rq)
> +{
> +	struct io_group *iog = rq->iog;
> +
> +	if (iog) {
> +		rq->iog = NULL;
> +		elv_put_iog(iog);
> +	}
> +}
> +
> +/*
>   * Find/Create the io queue the rq should go in. This is an optimization
>   * for the io schedulers (noop, deadline and AS) which maintain only single
>   * io queue per cgroup. In this case common layer can just maintain a
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index db3a347..96a28e9 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -512,6 +512,7 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
>  extern int io_group_allow_merge(struct request *rq, struct bio *bio);
>  extern void elv_fq_set_request_io_group(struct request_queue *q,
>  					struct request *rq, struct bio *bio);
> +extern void elv_fq_unset_request_io_group(struct request *rq);
>  static inline bfq_weight_t iog_weight(struct io_group *iog)
>  {
>  	return iog->entity.weight;
> @@ -571,6 +572,10 @@ static inline void elv_fq_set_request_io_group(struct request_queue *q,
>  {
>  }
>  
> +static inline void elv_fq_unset_request_io_group(struct request *rq)
> +{
> +}
> +
>  static inline bfq_weight_t iog_weight(struct io_group *iog)
>  {
>  	/* Just root group is present and weight is immaterial. */
> diff --git a/block/elevator.c b/block/elevator.c
> index 44c9fad..d75eec7 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -992,6 +992,8 @@ void elv_put_request(struct request_queue *q, struct request *rq)
>  {
>  	struct elevator_queue *e = q->elevator;
>  
> +	elv_fq_unset_request_io_group(rq);
> +
>  	/*
>  	 * Optimization for noop, deadline and AS which maintain only single
>  	 * ioq per io group
> 

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
@ 2009-05-08 13:57       ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-08 13:57 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Fri, May 08, 2009 at 05:45:32PM +0800, Gui Jianfeng wrote:
> Hi Vivek,
> 
> This patch adds io group reference handling when allocating
> and removing a request.
> 

Hi Gui,

Thanks for the patch. We were thinking that requests can take a reference
on io queues and io queues can take a reference on io groups. That should
make sure that io groups don't go away as long as active requests are
present.

But there seems to be a small window while allocating the new request
where request gets allocated from a group first and then later it is
mapped to that group and queue is created. IOW, in get_request_wait(), 
we allocate a request from a particular group and set rq->rl, then
drop the queue lock and later call elv_set_request() which again maps
the request to the group saves rq->iog and creates new queue. This window
is troublesome because request can be mapped to a particular group at the
time of allocation and during set_request() it can go to a different
group as queue lock was dropped and group might have disappeared.

In this case probably it might make sense that request also takes a
reference on groups. At the same time it looks too much that request takes
a reference on queue as well as group object. Ideas are welcome on how
to handle it...

Thanks
Vivek
 
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
>  elevator-fq.c |   15 ++++++++++++++-
>  elevator-fq.h |    5 +++++
>  elevator.c    |    2 ++
>  3 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 9500619..e6d6712 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -1968,11 +1968,24 @@ void elv_fq_set_request_io_group(struct request_queue *q, struct request *rq,
>  	spin_unlock_irqrestore(q->queue_lock, flags);
>  	BUG_ON(!iog);
>  
> -	/* Store iog in rq. TODO: take care of referencing */
> +	elv_get_iog(iog);
>  	rq->iog = iog;
>  }
>  
>  /*
> + * This request has been serviced. Clean up iog info and drop the reference.
> + */
> +void elv_fq_unset_request_io_group(struct request *rq)
> +{
> +	struct io_group *iog = rq->iog;
> +
> +	if (iog) {
> +		rq->iog = NULL;
> +		elv_put_iog(iog);
> +	}
> +}
> +
> +/*
>   * Find/Create the io queue the rq should go in. This is an optimization
>   * for the io schedulers (noop, deadline and AS) which maintain only single
>   * io queue per cgroup. In this case common layer can just maintain a
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index db3a347..96a28e9 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -512,6 +512,7 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
>  extern int io_group_allow_merge(struct request *rq, struct bio *bio);
>  extern void elv_fq_set_request_io_group(struct request_queue *q,
>  					struct request *rq, struct bio *bio);
> +extern void elv_fq_unset_request_io_group(struct request *rq);
>  static inline bfq_weight_t iog_weight(struct io_group *iog)
>  {
>  	return iog->entity.weight;
> @@ -571,6 +572,10 @@ static inline void elv_fq_set_request_io_group(struct request_queue *q,
>  {
>  }
>  
> +static inline void elv_fq_unset_request_io_group(struct request *rq)
> +{
> +}
> +
>  static inline bfq_weight_t iog_weight(struct io_group *iog)
>  {
>  	/* Just root group is present and weight is immaterial. */
> diff --git a/block/elevator.c b/block/elevator.c
> index 44c9fad..d75eec7 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -992,6 +992,8 @@ void elv_put_request(struct request_queue *q, struct request *rq)
>  {
>  	struct elevator_queue *e = q->elevator;
>  
> +	elv_fq_unset_request_io_group(rq);
> +
>  	/*
>  	 * Optimization for noop, deadline and AS which maintain only single
>  	 * ioq per io group
> 

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
       [not found]         ` <20090507.091858.226775723.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
  2009-05-07  1:25             ` Vivek Goyal
@ 2009-05-08 14:24           ` Rik van Riel
  1 sibling, 0 replies; 297+ messages in thread
From: Rik van Riel @ 2009-05-08 14:24 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Ryo Tsuruta wrote:
> Hi Vivek,
> 
>> Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
>> of FIFO dispatch of buffered bios. Apart from that it tries to provide
>> fairness in terms of actual IO done and that would mean a seeky workload
>> will can use disk for much longer to get equivalent IO done and slow down
>> other applications. Implementing IO controller at IO scheduler level gives
>> us tigher control. Will it not meet your requirements? If you got specific
>> concerns with IO scheduler based contol patches, please highlight these and
>> we will see how these can be addressed.
> 
> I'd like to avoid making complicated existing IO schedulers and other
> kernel codes and to give a choice to users whether or not to use it.
> I know that you chose an approach that using compile time options to
> get the same behavior as old system, but device-mapper drivers can be
> added, removed and replaced while system is running.

I do not believe that every use of cgroups will end up with
a separate logical volume for each group.

In fact, if you look at group-per-UID usage, which could be
quite common on shared web servers and shell servers, I would
expect all the groups to share the same filesystem.

I do not believe dm-ioband would be useful in that configuration,
while the IO scheduler based IO controller will just work.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-07  0:18       ` Ryo Tsuruta
       [not found]         ` <20090507.091858.226775723.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
@ 2009-05-08 14:24         ` Rik van Riel
       [not found]           ` <4A0440B2.7040300-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-05-11 10:11           ` Ryo Tsuruta
  1 sibling, 2 replies; 297+ messages in thread
From: Rik van Riel @ 2009-05-08 14:24 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: vgoyal, akpm, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, linux-kernel, containers, righi.andrea,
	agk, dm-devel, snitzer, m-ikeda, peterz

Ryo Tsuruta wrote:
> Hi Vivek,
> 
>> Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
>> of FIFO dispatch of buffered bios. Apart from that it tries to provide
>> fairness in terms of actual IO done and that would mean a seeky workload
>> will can use disk for much longer to get equivalent IO done and slow down
>> other applications. Implementing IO controller at IO scheduler level gives
>> us tigher control. Will it not meet your requirements? If you got specific
>> concerns with IO scheduler based contol patches, please highlight these and
>> we will see how these can be addressed.
> 
> I'd like to avoid making complicated existing IO schedulers and other
> kernel codes and to give a choice to users whether or not to use it.
> I know that you chose an approach that using compile time options to
> get the same behavior as old system, but device-mapper drivers can be
> added, removed and replaced while system is running.

I do not believe that every use of cgroups will end up with
a separate logical volume for each group.

In fact, if you look at group-per-UID usage, which could be
quite common on shared web servers and shell servers, I would
expect all the groups to share the same filesystem.

I do not believe dm-ioband would be useful in that configuration,
while the IO scheduler based IO controller will just work.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
       [not found]       ` <20090508135724.GE7293-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-08 17:41         ` Nauman Rafique
  0 siblings, 0 replies; 297+ messages in thread
From: Nauman Rafique @ 2009-05-08 17:41 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Fri, May 8, 2009 at 6:57 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, May 08, 2009 at 05:45:32PM +0800, Gui Jianfeng wrote:
>> Hi Vivek,
>>
>> This patch adds io group reference handling when allocating
>> and removing a request.
>>
>
> Hi Gui,
>
> Thanks for the patch. We were thinking that requests can take a reference
> on io queues and io queues can take a reference on io groups. That should
> make sure that io groups don't go away as long as active requests are
> present.
>
> But there seems to be a small window while allocating the new request
> where request gets allocated from a group first and then later it is
> mapped to that group and queue is created. IOW, in get_request_wait(),
> we allocate a request from a particular group and set rq->rl, then
> drop the queue lock and later call elv_set_request() which again maps
> the request to the group saves rq->iog and creates new queue. This window
> is troublesome because request can be mapped to a particular group at the
> time of allocation and during set_request() it can go to a different
> group as queue lock was dropped and group might have disappeared.
>
> In this case probably it might make sense that request also takes a
> reference on groups. At the same time it looks too much that request takes
> a reference on queue as well as group object. Ideas are welcome on how
> to handle it...

IMHO a request being allocated on the wrong cgroup should not be a big
problem as such. All it means is that the request descriptor was
accounted to the wrong cgroup in this particular corner case. Please
correct me if I am wrong.

We can also get rid of rq->iog pointer too. What that means is that
request is associated with ioq (rq->ioq), and we can use
ioq_to_io_group() function to get the io_group. So the request would
only be indirectly associated with an io_group i.e. the request is
associated with an io_queue and the io_group for the request is the
io_group associated with io_queue. Do you see any problems with that
approach?

Thanks.
--
Nauman


>
> Thanks
> Vivek
>
>> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
>> ---
>>  elevator-fq.c |   15 ++++++++++++++-
>>  elevator-fq.h |    5 +++++
>>  elevator.c    |    2 ++
>>  3 files changed, 21 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
>> index 9500619..e6d6712 100644
>> --- a/block/elevator-fq.c
>> +++ b/block/elevator-fq.c
>> @@ -1968,11 +1968,24 @@ void elv_fq_set_request_io_group(struct request_queue *q, struct request *rq,
>>       spin_unlock_irqrestore(q->queue_lock, flags);
>>       BUG_ON(!iog);
>>
>> -     /* Store iog in rq. TODO: take care of referencing */
>> +     elv_get_iog(iog);
>>       rq->iog = iog;
>>  }
>>
>>  /*
>> + * This request has been serviced. Clean up iog info and drop the reference.
>> + */
>> +void elv_fq_unset_request_io_group(struct request *rq)
>> +{
>> +     struct io_group *iog = rq->iog;
>> +
>> +     if (iog) {
>> +             rq->iog = NULL;
>> +             elv_put_iog(iog);
>> +     }
>> +}
>> +
>> +/*
>>   * Find/Create the io queue the rq should go in. This is an optimization
>>   * for the io schedulers (noop, deadline and AS) which maintain only single
>>   * io queue per cgroup. In this case common layer can just maintain a
>> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
>> index db3a347..96a28e9 100644
>> --- a/block/elevator-fq.h
>> +++ b/block/elevator-fq.h
>> @@ -512,6 +512,7 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
>>  extern int io_group_allow_merge(struct request *rq, struct bio *bio);
>>  extern void elv_fq_set_request_io_group(struct request_queue *q,
>>                                       struct request *rq, struct bio *bio);
>> +extern void elv_fq_unset_request_io_group(struct request *rq);
>>  static inline bfq_weight_t iog_weight(struct io_group *iog)
>>  {
>>       return iog->entity.weight;
>> @@ -571,6 +572,10 @@ static inline void elv_fq_set_request_io_group(struct request_queue *q,
>>  {
>>  }
>>
>> +static inline void elv_fq_unset_request_io_group(struct request *rq)
>> +{
>> +}
>> +
>>  static inline bfq_weight_t iog_weight(struct io_group *iog)
>>  {
>>       /* Just root group is present and weight is immaterial. */
>> diff --git a/block/elevator.c b/block/elevator.c
>> index 44c9fad..d75eec7 100644
>> --- a/block/elevator.c
>> +++ b/block/elevator.c
>> @@ -992,6 +992,8 @@ void elv_put_request(struct request_queue *q, struct request *rq)
>>  {
>>       struct elevator_queue *e = q->elevator;
>>
>> +     elv_fq_unset_request_io_group(rq);
>> +
>>       /*
>>        * Optimization for noop, deadline and AS which maintain only single
>>        * ioq per io group
>>
>

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for  request
  2009-05-08 13:57       ` Vivek Goyal
@ 2009-05-08 17:41         ` Nauman Rafique
  -1 siblings, 0 replies; 297+ messages in thread
From: Nauman Rafique @ 2009-05-08 17:41 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Gui Jianfeng, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Fri, May 8, 2009 at 6:57 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Fri, May 08, 2009 at 05:45:32PM +0800, Gui Jianfeng wrote:
>> Hi Vivek,
>>
>> This patch adds io group reference handling when allocating
>> and removing a request.
>>
>
> Hi Gui,
>
> Thanks for the patch. We were thinking that requests can take a reference
> on io queues and io queues can take a reference on io groups. That should
> make sure that io groups don't go away as long as active requests are
> present.
>
> But there seems to be a small window while allocating the new request
> where request gets allocated from a group first and then later it is
> mapped to that group and queue is created. IOW, in get_request_wait(),
> we allocate a request from a particular group and set rq->rl, then
> drop the queue lock and later call elv_set_request() which again maps
> the request to the group saves rq->iog and creates new queue. This window
> is troublesome because request can be mapped to a particular group at the
> time of allocation and during set_request() it can go to a different
> group as queue lock was dropped and group might have disappeared.
>
> In this case probably it might make sense that request also takes a
> reference on groups. At the same time it looks too much that request takes
> a reference on queue as well as group object. Ideas are welcome on how
> to handle it...

IMHO a request being allocated on the wrong cgroup should not be a big
problem as such. All it means is that the request descriptor was
accounted to the wrong cgroup in this particular corner case. Please
correct me if I am wrong.

We can also get rid of rq->iog pointer too. What that means is that
request is associated with ioq (rq->ioq), and we can use
ioq_to_io_group() function to get the io_group. So the request would
only be indirectly associated with an io_group i.e. the request is
associated with an io_queue and the io_group for the request is the
io_group associated with io_queue. Do you see any problems with that
approach?

Thanks.
--
Nauman


>
> Thanks
> Vivek
>
>> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
>> ---
>>  elevator-fq.c |   15 ++++++++++++++-
>>  elevator-fq.h |    5 +++++
>>  elevator.c    |    2 ++
>>  3 files changed, 21 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
>> index 9500619..e6d6712 100644
>> --- a/block/elevator-fq.c
>> +++ b/block/elevator-fq.c
>> @@ -1968,11 +1968,24 @@ void elv_fq_set_request_io_group(struct request_queue *q, struct request *rq,
>>       spin_unlock_irqrestore(q->queue_lock, flags);
>>       BUG_ON(!iog);
>>
>> -     /* Store iog in rq. TODO: take care of referencing */
>> +     elv_get_iog(iog);
>>       rq->iog = iog;
>>  }
>>
>>  /*
>> + * This request has been serviced. Clean up iog info and drop the reference.
>> + */
>> +void elv_fq_unset_request_io_group(struct request *rq)
>> +{
>> +     struct io_group *iog = rq->iog;
>> +
>> +     if (iog) {
>> +             rq->iog = NULL;
>> +             elv_put_iog(iog);
>> +     }
>> +}
>> +
>> +/*
>>   * Find/Create the io queue the rq should go in. This is an optimization
>>   * for the io schedulers (noop, deadline and AS) which maintain only single
>>   * io queue per cgroup. In this case common layer can just maintain a
>> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
>> index db3a347..96a28e9 100644
>> --- a/block/elevator-fq.h
>> +++ b/block/elevator-fq.h
>> @@ -512,6 +512,7 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
>>  extern int io_group_allow_merge(struct request *rq, struct bio *bio);
>>  extern void elv_fq_set_request_io_group(struct request_queue *q,
>>                                       struct request *rq, struct bio *bio);
>> +extern void elv_fq_unset_request_io_group(struct request *rq);
>>  static inline bfq_weight_t iog_weight(struct io_group *iog)
>>  {
>>       return iog->entity.weight;
>> @@ -571,6 +572,10 @@ static inline void elv_fq_set_request_io_group(struct request_queue *q,
>>  {
>>  }
>>
>> +static inline void elv_fq_unset_request_io_group(struct request *rq)
>> +{
>> +}
>> +
>>  static inline bfq_weight_t iog_weight(struct io_group *iog)
>>  {
>>       /* Just root group is present and weight is immaterial. */
>> diff --git a/block/elevator.c b/block/elevator.c
>> index 44c9fad..d75eec7 100644
>> --- a/block/elevator.c
>> +++ b/block/elevator.c
>> @@ -992,6 +992,8 @@ void elv_put_request(struct request_queue *q, struct request *rq)
>>  {
>>       struct elevator_queue *e = q->elevator;
>>
>> +     elv_fq_unset_request_io_group(rq);
>> +
>>       /*
>>        * Optimization for noop, deadline and AS which maintain only single
>>        * ioq per io group
>>
>

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
@ 2009-05-08 17:41         ` Nauman Rafique
  0 siblings, 0 replies; 297+ messages in thread
From: Nauman Rafique @ 2009-05-08 17:41 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Gui Jianfeng, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Fri, May 8, 2009 at 6:57 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Fri, May 08, 2009 at 05:45:32PM +0800, Gui Jianfeng wrote:
>> Hi Vivek,
>>
>> This patch adds io group reference handling when allocating
>> and removing a request.
>>
>
> Hi Gui,
>
> Thanks for the patch. We were thinking that requests can take a reference
> on io queues and io queues can take a reference on io groups. That should
> make sure that io groups don't go away as long as active requests are
> present.
>
> But there seems to be a small window while allocating the new request
> where request gets allocated from a group first and then later it is
> mapped to that group and queue is created. IOW, in get_request_wait(),
> we allocate a request from a particular group and set rq->rl, then
> drop the queue lock and later call elv_set_request() which again maps
> the request to the group saves rq->iog and creates new queue. This window
> is troublesome because request can be mapped to a particular group at the
> time of allocation and during set_request() it can go to a different
> group as queue lock was dropped and group might have disappeared.
>
> In this case probably it might make sense that request also takes a
> reference on groups. At the same time it looks too much that request takes
> a reference on queue as well as group object. Ideas are welcome on how
> to handle it...

IMHO a request being allocated on the wrong cgroup should not be a big
problem as such. All it means is that the request descriptor was
accounted to the wrong cgroup in this particular corner case. Please
correct me if I am wrong.

We can also get rid of rq->iog pointer too. What that means is that
request is associated with ioq (rq->ioq), and we can use
ioq_to_io_group() function to get the io_group. So the request would
only be indirectly associated with an io_group i.e. the request is
associated with an io_queue and the io_group for the request is the
io_group associated with io_queue. Do you see any problems with that
approach?

Thanks.
--
Nauman


>
> Thanks
> Vivek
>
>> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
>> ---
>>  elevator-fq.c |   15 ++++++++++++++-
>>  elevator-fq.h |    5 +++++
>>  elevator.c    |    2 ++
>>  3 files changed, 21 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
>> index 9500619..e6d6712 100644
>> --- a/block/elevator-fq.c
>> +++ b/block/elevator-fq.c
>> @@ -1968,11 +1968,24 @@ void elv_fq_set_request_io_group(struct request_queue *q, struct request *rq,
>>       spin_unlock_irqrestore(q->queue_lock, flags);
>>       BUG_ON(!iog);
>>
>> -     /* Store iog in rq. TODO: take care of referencing */
>> +     elv_get_iog(iog);
>>       rq->iog = iog;
>>  }
>>
>>  /*
>> + * This request has been serviced. Clean up iog info and drop the reference.
>> + */
>> +void elv_fq_unset_request_io_group(struct request *rq)
>> +{
>> +     struct io_group *iog = rq->iog;
>> +
>> +     if (iog) {
>> +             rq->iog = NULL;
>> +             elv_put_iog(iog);
>> +     }
>> +}
>> +
>> +/*
>>   * Find/Create the io queue the rq should go in. This is an optimization
>>   * for the io schedulers (noop, deadline and AS) which maintain only single
>>   * io queue per cgroup. In this case common layer can just maintain a
>> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
>> index db3a347..96a28e9 100644
>> --- a/block/elevator-fq.h
>> +++ b/block/elevator-fq.h
>> @@ -512,6 +512,7 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
>>  extern int io_group_allow_merge(struct request *rq, struct bio *bio);
>>  extern void elv_fq_set_request_io_group(struct request_queue *q,
>>                                       struct request *rq, struct bio *bio);
>> +extern void elv_fq_unset_request_io_group(struct request *rq);
>>  static inline bfq_weight_t iog_weight(struct io_group *iog)
>>  {
>>       return iog->entity.weight;
>> @@ -571,6 +572,10 @@ static inline void elv_fq_set_request_io_group(struct request_queue *q,
>>  {
>>  }
>>
>> +static inline void elv_fq_unset_request_io_group(struct request *rq)
>> +{
>> +}
>> +
>>  static inline bfq_weight_t iog_weight(struct io_group *iog)
>>  {
>>       /* Just root group is present and weight is immaterial. */
>> diff --git a/block/elevator.c b/block/elevator.c
>> index 44c9fad..d75eec7 100644
>> --- a/block/elevator.c
>> +++ b/block/elevator.c
>> @@ -992,6 +992,8 @@ void elv_put_request(struct request_queue *q, struct request *rq)
>>  {
>>       struct elevator_queue *e = q->elevator;
>>
>> +     elv_fq_unset_request_io_group(rq);
>> +
>>       /*
>>        * Optimization for noop, deadline and AS which maintain only single
>>        * ioq per io group
>>
>

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-07 22:19                         ` Andrea Righi
  2009-05-08 18:09                           ` Vivek Goyal
@ 2009-05-08 18:09                           ` Vivek Goyal
  1 sibling, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-08 18:09 UTC (permalink / raw)
  To: Andrea Righi
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Fri, May 08, 2009 at 12:19:01AM +0200, Andrea Righi wrote:
> On Thu, May 07, 2009 at 11:36:42AM -0400, Vivek Goyal wrote:
> > Hmm.., my old config had "AS" as default scheduler that's why I was seeing
> > the strange issue of RT task finishing after BE. My apologies for that. I
> > somehow assumed that CFQ is default scheduler in my config.
> 
> ok.
> 
> > 
> > So I have re-run the test to see if we are still seeing the issue of
> > loosing priority and class with-in cgroup. And we still do..
> > 
> > 2.6.30-rc4 with io-throttle patches
> > ===================================
> > Test1
> > =====
> > - Two readers, one BE prio 0 and other BE prio 7 in a cgroup limited with
> >   8MB/s BW.
> > 
> > 234179072 bytes (234 MB) copied, 55.8448 s, 4.2 MB/s
> > prio 0 task finished
> > 234179072 bytes (234 MB) copied, 55.8878 s, 4.2 MB/s
> > 
> > Test2
> > =====
> > - Two readers, one RT prio 0 and other BE prio 7 in a cgroup limited with
> >   8MB/s BW.
> > 
> > 234179072 bytes (234 MB) copied, 55.8876 s, 4.2 MB/s
> > 234179072 bytes (234 MB) copied, 55.8984 s, 4.2 MB/s
> > RT task finished
> 
> ok, coherent with the current io-throttle implementation.
> 
> > 
> > Test3
> > =====
> > - Reader Starvation
> > - I created a cgroup with BW limit of 64MB/s. First I just run the reader
> >   alone and then I run reader along with 4 writers 4 times. 
> > 
> > Reader alone
> > 234179072 bytes (234 MB) copied, 3.71796 s, 63.0 MB/s
> > 
> > Reader with 4 writers
> > ---------------------
> > First run
> > 234179072 bytes (234 MB) copied, 30.394 s, 7.7 MB/s 
> > 
> > Second run
> > 234179072 bytes (234 MB) copied, 26.9607 s, 8.7 MB/s
> > 
> > Third run
> > 234179072 bytes (234 MB) copied, 37.3515 s, 6.3 MB/s
> > 
> > Fourth run
> > 234179072 bytes (234 MB) copied, 36.817 s, 6.4 MB/s
> > 
> > Note that out of 64MB/s limit of this cgroup, reader does not get even
> > 1/5 of the BW. In normal systems, readers are advantaged and reader gets
> > its job done much faster even in presence of multiple writers.   
> 
> And this is also coherent. The throttling is equally probable for read
> and write. But this shouldn't happen if we saturate the physical disk BW
> (doing proportional BW control or using a watermark close to 100 in
> io-throttle). In this case IO scheduler logic shouldn't be totally
> broken.
>

Can you please explain the watermark a bit more? So blockio.watermark=90
mean 90% of what? total disk BW? But disk BW varies based on work load?

> Doing a very quick test with io-throttle, using a 10MB/s BW limit and
> blockio.watermark=90:
> 
> Launching reader
> 256+0 records in
> 256+0 records out
> 268435456 bytes (268 MB) copied, 32.2798 s, 8.3 MB/s
> 
> In the same time the writers wrote ~190MB, so the single reader got
> about 1/3 of the total BW.
> 
> 182M testzerofile4
> 198M testzerofile1
> 188M testzerofile3
> 189M testzerofile2
> 

But its now more a max bw controller at all now? I seem to be getting the
total BW of (268+182+198+188+189)/32 = 32MB/s and you set the limit to
10MB/s?
 

[..]
> What are the results with your IO scheduler controller (if you already
> have them, otherwise I'll repeat this test in my system)? It seems a
> very interesting test to compare the advantages of the IO scheduler
> solution respect to the io-throttle approach.
> 

I had not done any reader writer testing so far. But you forced me to run
some now. :-) Here are the results. 

Because one is max BW controller and other is proportional BW controller
doing exact comparison is hard. Still....

Test1
=====
Try to run lots of writers (50 random writers using fio and 4 sequential
writers with dd if=/dev/zero) and one single reader either in root group
or with in one cgroup to show that readers are not starved by writers
as opposed to io-throttle controller.

Run test1 with vanilla kernel with CFQ
=====================================
Launched 50 fio random writers, 4 sequential writers and 1 reader in root
and noted how long it takes reader to finish. Also noted the per second output
from iostat -d 1 -m /dev/sdb1 to monitor how disk throughput varies.

***********************************************************************
# launch 50 writers fio job

fio_args="--size=64m --rw=write --numjobs=50 --group_reporting"
fio $fio_args --name=test2 --directory=/mnt/sdb/fio2/ --output=/mnt/sdb/fio2/test2.log > /dev/null  &

#launch 4 sequential writers
ionice -c 2 -n 7 dd if=/dev/zero of=/mnt/sdb/testzerofile1 bs=4K count=524288 &
ionice -c 2 -n 7 dd if=/dev/zero of=/mnt/sdb/testzerofile2 bs=4K count=524288 &
ionice -c 2 -n 7 dd if=/dev/zero of=/mnt/sdb/testzerofile3 bs=4K count=524288 &
ionice -c 2 -n 7 dd if=/dev/zero of=/mnt/sdb/testzerofile4 bs=4K count=524288 &

echo "Sleeping for 5 seconds"
sleep 5
echo "Launching reader"

ionice -c 2 -n 0 dd if=/mnt/sdb/zerofile2 of=/dev/zero &
wait $!
echo "Reader Finished"
***************************************************************************

Results
-------
234179072 bytes (234 MB) copied, 4.55047 s, 51.5 MB/s

Reader finished in 4.5 seconds. Following are few lines from iostat output

***********************************************************************
Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            151.00         0.04        48.33          0         48

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            120.00         1.78        31.23          1         31

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            504.95        56.75         7.51         57          7

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            547.47        62.71         4.47         62          4

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            441.00        49.80         7.82         49          7

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            441.41        48.28        13.84         47         13

*************************************************************************

Note how, first write picks up and then suddenly reader comes in and CFQ
allocates a huge chunk of BW to reader to give it the advantage.

Run Test1 with IO scheduler based io controller patch
=====================================================

234179072 bytes (234 MB) copied, 5.23141 s, 44.8 MB/s 

Reader finishes in 5.23 seconds. Why does it take more time than CFQ,
because looks like current algorithm is not punishing writers that hard.
This can be fixed and not an issue.

Following is some output from iostat.

**********************************************************************
Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            139.60         0.04        43.83          0         44

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            227.72        16.88        29.05         17         29

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            349.00        35.04        16.06         35         16

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            339.00        34.16        21.07         34         21

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            343.56        36.68        12.54         37         12

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            378.00        38.68        19.47         38         19

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            532.00        59.06        10.00         59         10

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            125.00         2.62        38.82          2         38
************************************************************************

Note how read throughput goes up when reader comes in. Also note that
writer is still getting some decent IO done and that's why reader took
little bit more time as compared to CFQ.


Run Test1 with IO throttle patches
==================================

Now same test is run with io-throttle patches. The only difference is that
it run the test in a cgroup with max limit of 32MB/s. That should mean 
that effectvily we got a disk which can support at max 32MB/s of IO rate.
If we look at above CFQ and io controller results, it looks like with
above load we touched a peak of 70MB/s.  So one can think of same test
being run on a disk roughly half the speed of original disk.

234179072 bytes (234 MB) copied, 144.207 s, 1.6 MB/s

Reader got a disk rate of 1.6MB/s (5 %) out of 32MB/s capacity, as opposed to
the case CFQ and io scheduler controller where reader got around 70-80% of
disk BW under similar work load.

Test2
=====
Run test2 with io scheduler based io controller
===============================================
Now run almost same test with a little difference. This time I create two
cgroups of same weight 1000. I run the 50 fio random writer in one cgroup
and 4 sequential writers and 1 reader in second group. This test is more
to show that proportional BW IO controller is working and because of
reader in group1, group2 writes are not killed (providing isolation) and
secondly, reader still gets preference over the writers which are in same
group.

				root
			     /       \		
			  group1     group2
		  (50 fio writers)   ( 4 writers and one reader)

234179072 bytes (234 MB) copied, 12.8546 s, 18.2 MB/s

Reader finished in almost 13 seconds and got around 18MB/s. Remember when
everything was in root group reader got around 45MB/s. This is to account
for the fact that half of the disk is now being shared by other cgroup
which are running 50 fio writes and reader can't steal the disk from them.

Following is some portion of iostat output when reader became active
*********************************************************************
Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            103.92         0.03        40.21          0         41

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            240.00        15.78        37.40         15         37

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            206.93        13.17        28.50         13         28

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            224.75        15.39        27.89         15         28

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            270.71        16.85        25.95         16         25

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            215.84         8.81        32.40          8         32

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            216.16        19.11        20.75         18         20

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            211.11        14.67        35.77         14         35

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            208.91        15.04        26.95         15         27

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            277.23        24.30        28.53         24         28

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            202.97        12.29        34.79         12         35
**********************************************************************

Total disk throughput is varying a lot, on an average it looks like it
is getting 45MB/s. Lets say 50% of that is going to cgroup1 (fio writers),
then out of rest of 22 MB/s reader seems to have to 18MB/s. These are
highly approximate numbers. I think I need to come up with some kind of 
tool to measure per cgroup throughput (like we have for per partition
stat) for more accurate comparision.

But the point is that second cgroup got the isolation and read got
preference with-in same cgroup. The expected behavior.

Run test2 with io-throttle
==========================
Same setup of two groups. The only difference is that I setup two groups
with (16MB) limit. So previous 32MB limit got divided between two cgroups
50% each.

- 234179072 bytes (234 MB) copied, 90.8055 s, 2.6 MB/s

Reader took 90 seconds to finish.  It seems to have got around 16% of
available disk BW (16MB) to it.

iostat output is long. Will just paste one section.

************************************************************************
[..]

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            141.58        10.16        16.12         10         16

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            174.75         8.06        12.31          7         12

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1             47.52         0.12         6.16          0          6

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1             82.00         0.00        31.85          0         31

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            141.00         0.00        48.07          0         48

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1             72.73         0.00        26.52          0         26
 

***************************************************************************

Conclusion
==========
It just reaffirms that with max BW control, we are not doing a fair job
of throttling hence no more hold the IO scheduler properties with-in
cgroup.

With proportional BW controller implemented at IO scheduler level, one
can do very tight integration with IO controller and hence retain 
IO scheduler behavior with-in cgroup.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-07 22:19                         ` Andrea Righi
@ 2009-05-08 18:09                           ` Vivek Goyal
  2009-05-08 20:05                             ` Andrea Righi
       [not found]                             ` <20090508180951.GG7293-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-05-08 18:09                           ` Vivek Goyal
  1 sibling, 2 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-08 18:09 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, linux-kernel, containers,
	agk, dm-devel, snitzer, m-ikeda, peterz

On Fri, May 08, 2009 at 12:19:01AM +0200, Andrea Righi wrote:
> On Thu, May 07, 2009 at 11:36:42AM -0400, Vivek Goyal wrote:
> > Hmm.., my old config had "AS" as default scheduler that's why I was seeing
> > the strange issue of RT task finishing after BE. My apologies for that. I
> > somehow assumed that CFQ is default scheduler in my config.
> 
> ok.
> 
> > 
> > So I have re-run the test to see if we are still seeing the issue of
> > loosing priority and class with-in cgroup. And we still do..
> > 
> > 2.6.30-rc4 with io-throttle patches
> > ===================================
> > Test1
> > =====
> > - Two readers, one BE prio 0 and other BE prio 7 in a cgroup limited with
> >   8MB/s BW.
> > 
> > 234179072 bytes (234 MB) copied, 55.8448 s, 4.2 MB/s
> > prio 0 task finished
> > 234179072 bytes (234 MB) copied, 55.8878 s, 4.2 MB/s
> > 
> > Test2
> > =====
> > - Two readers, one RT prio 0 and other BE prio 7 in a cgroup limited with
> >   8MB/s BW.
> > 
> > 234179072 bytes (234 MB) copied, 55.8876 s, 4.2 MB/s
> > 234179072 bytes (234 MB) copied, 55.8984 s, 4.2 MB/s
> > RT task finished
> 
> ok, coherent with the current io-throttle implementation.
> 
> > 
> > Test3
> > =====
> > - Reader Starvation
> > - I created a cgroup with BW limit of 64MB/s. First I just run the reader
> >   alone and then I run reader along with 4 writers 4 times. 
> > 
> > Reader alone
> > 234179072 bytes (234 MB) copied, 3.71796 s, 63.0 MB/s
> > 
> > Reader with 4 writers
> > ---------------------
> > First run
> > 234179072 bytes (234 MB) copied, 30.394 s, 7.7 MB/s 
> > 
> > Second run
> > 234179072 bytes (234 MB) copied, 26.9607 s, 8.7 MB/s
> > 
> > Third run
> > 234179072 bytes (234 MB) copied, 37.3515 s, 6.3 MB/s
> > 
> > Fourth run
> > 234179072 bytes (234 MB) copied, 36.817 s, 6.4 MB/s
> > 
> > Note that out of 64MB/s limit of this cgroup, reader does not get even
> > 1/5 of the BW. In normal systems, readers are advantaged and reader gets
> > its job done much faster even in presence of multiple writers.   
> 
> And this is also coherent. The throttling is equally probable for read
> and write. But this shouldn't happen if we saturate the physical disk BW
> (doing proportional BW control or using a watermark close to 100 in
> io-throttle). In this case IO scheduler logic shouldn't be totally
> broken.
>

Can you please explain the watermark a bit more? So blockio.watermark=90
mean 90% of what? total disk BW? But disk BW varies based on work load?

> Doing a very quick test with io-throttle, using a 10MB/s BW limit and
> blockio.watermark=90:
> 
> Launching reader
> 256+0 records in
> 256+0 records out
> 268435456 bytes (268 MB) copied, 32.2798 s, 8.3 MB/s
> 
> In the same time the writers wrote ~190MB, so the single reader got
> about 1/3 of the total BW.
> 
> 182M testzerofile4
> 198M testzerofile1
> 188M testzerofile3
> 189M testzerofile2
> 

But its now more a max bw controller at all now? I seem to be getting the
total BW of (268+182+198+188+189)/32 = 32MB/s and you set the limit to
10MB/s?
 

[..]
> What are the results with your IO scheduler controller (if you already
> have them, otherwise I'll repeat this test in my system)? It seems a
> very interesting test to compare the advantages of the IO scheduler
> solution respect to the io-throttle approach.
> 

I had not done any reader writer testing so far. But you forced me to run
some now. :-) Here are the results. 

Because one is max BW controller and other is proportional BW controller
doing exact comparison is hard. Still....

Test1
=====
Try to run lots of writers (50 random writers using fio and 4 sequential
writers with dd if=/dev/zero) and one single reader either in root group
or with in one cgroup to show that readers are not starved by writers
as opposed to io-throttle controller.

Run test1 with vanilla kernel with CFQ
=====================================
Launched 50 fio random writers, 4 sequential writers and 1 reader in root
and noted how long it takes reader to finish. Also noted the per second output
from iostat -d 1 -m /dev/sdb1 to monitor how disk throughput varies.

***********************************************************************
# launch 50 writers fio job

fio_args="--size=64m --rw=write --numjobs=50 --group_reporting"
fio $fio_args --name=test2 --directory=/mnt/sdb/fio2/ --output=/mnt/sdb/fio2/test2.log > /dev/null  &

#launch 4 sequential writers
ionice -c 2 -n 7 dd if=/dev/zero of=/mnt/sdb/testzerofile1 bs=4K count=524288 &
ionice -c 2 -n 7 dd if=/dev/zero of=/mnt/sdb/testzerofile2 bs=4K count=524288 &
ionice -c 2 -n 7 dd if=/dev/zero of=/mnt/sdb/testzerofile3 bs=4K count=524288 &
ionice -c 2 -n 7 dd if=/dev/zero of=/mnt/sdb/testzerofile4 bs=4K count=524288 &

echo "Sleeping for 5 seconds"
sleep 5
echo "Launching reader"

ionice -c 2 -n 0 dd if=/mnt/sdb/zerofile2 of=/dev/zero &
wait $!
echo "Reader Finished"
***************************************************************************

Results
-------
234179072 bytes (234 MB) copied, 4.55047 s, 51.5 MB/s

Reader finished in 4.5 seconds. Following are few lines from iostat output

***********************************************************************
Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            151.00         0.04        48.33          0         48

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            120.00         1.78        31.23          1         31

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            504.95        56.75         7.51         57          7

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            547.47        62.71         4.47         62          4

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            441.00        49.80         7.82         49          7

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            441.41        48.28        13.84         47         13

*************************************************************************

Note how, first write picks up and then suddenly reader comes in and CFQ
allocates a huge chunk of BW to reader to give it the advantage.

Run Test1 with IO scheduler based io controller patch
=====================================================

234179072 bytes (234 MB) copied, 5.23141 s, 44.8 MB/s 

Reader finishes in 5.23 seconds. Why does it take more time than CFQ,
because looks like current algorithm is not punishing writers that hard.
This can be fixed and not an issue.

Following is some output from iostat.

**********************************************************************
Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            139.60         0.04        43.83          0         44

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            227.72        16.88        29.05         17         29

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            349.00        35.04        16.06         35         16

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            339.00        34.16        21.07         34         21

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            343.56        36.68        12.54         37         12

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            378.00        38.68        19.47         38         19

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            532.00        59.06        10.00         59         10

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            125.00         2.62        38.82          2         38
************************************************************************

Note how read throughput goes up when reader comes in. Also note that
writer is still getting some decent IO done and that's why reader took
little bit more time as compared to CFQ.


Run Test1 with IO throttle patches
==================================

Now same test is run with io-throttle patches. The only difference is that
it run the test in a cgroup with max limit of 32MB/s. That should mean 
that effectvily we got a disk which can support at max 32MB/s of IO rate.
If we look at above CFQ and io controller results, it looks like with
above load we touched a peak of 70MB/s.  So one can think of same test
being run on a disk roughly half the speed of original disk.

234179072 bytes (234 MB) copied, 144.207 s, 1.6 MB/s

Reader got a disk rate of 1.6MB/s (5 %) out of 32MB/s capacity, as opposed to
the case CFQ and io scheduler controller where reader got around 70-80% of
disk BW under similar work load.

Test2
=====
Run test2 with io scheduler based io controller
===============================================
Now run almost same test with a little difference. This time I create two
cgroups of same weight 1000. I run the 50 fio random writer in one cgroup
and 4 sequential writers and 1 reader in second group. This test is more
to show that proportional BW IO controller is working and because of
reader in group1, group2 writes are not killed (providing isolation) and
secondly, reader still gets preference over the writers which are in same
group.

				root
			     /       \		
			  group1     group2
		  (50 fio writers)   ( 4 writers and one reader)

234179072 bytes (234 MB) copied, 12.8546 s, 18.2 MB/s

Reader finished in almost 13 seconds and got around 18MB/s. Remember when
everything was in root group reader got around 45MB/s. This is to account
for the fact that half of the disk is now being shared by other cgroup
which are running 50 fio writes and reader can't steal the disk from them.

Following is some portion of iostat output when reader became active
*********************************************************************
Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            103.92         0.03        40.21          0         41

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            240.00        15.78        37.40         15         37

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            206.93        13.17        28.50         13         28

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            224.75        15.39        27.89         15         28

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            270.71        16.85        25.95         16         25

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            215.84         8.81        32.40          8         32

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            216.16        19.11        20.75         18         20

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            211.11        14.67        35.77         14         35

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            208.91        15.04        26.95         15         27

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            277.23        24.30        28.53         24         28

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            202.97        12.29        34.79         12         35
**********************************************************************

Total disk throughput is varying a lot, on an average it looks like it
is getting 45MB/s. Lets say 50% of that is going to cgroup1 (fio writers),
then out of rest of 22 MB/s reader seems to have to 18MB/s. These are
highly approximate numbers. I think I need to come up with some kind of 
tool to measure per cgroup throughput (like we have for per partition
stat) for more accurate comparision.

But the point is that second cgroup got the isolation and read got
preference with-in same cgroup. The expected behavior.

Run test2 with io-throttle
==========================
Same setup of two groups. The only difference is that I setup two groups
with (16MB) limit. So previous 32MB limit got divided between two cgroups
50% each.

- 234179072 bytes (234 MB) copied, 90.8055 s, 2.6 MB/s

Reader took 90 seconds to finish.  It seems to have got around 16% of
available disk BW (16MB) to it.

iostat output is long. Will just paste one section.

************************************************************************
[..]

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            141.58        10.16        16.12         10         16

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            174.75         8.06        12.31          7         12

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1             47.52         0.12         6.16          0          6

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1             82.00         0.00        31.85          0         31

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1            141.00         0.00        48.07          0         48

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb1             72.73         0.00        26.52          0         26
 

***************************************************************************

Conclusion
==========
It just reaffirms that with max BW control, we are not doing a fair job
of throttling hence no more hold the IO scheduler properties with-in
cgroup.

With proportional BW controller implemented at IO scheduler level, one
can do very tight integration with IO controller and hence retain 
IO scheduler behavior with-in cgroup.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
       [not found]         ` <e98e18940905081041r386e52a5q5a2b1f13f1e8c634-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-05-08 18:56           ` Vivek Goyal
  2009-05-11  1:33           ` Gui Jianfeng
  1 sibling, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-08 18:56 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Fri, May 08, 2009 at 10:41:01AM -0700, Nauman Rafique wrote:
> On Fri, May 8, 2009 at 6:57 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Fri, May 08, 2009 at 05:45:32PM +0800, Gui Jianfeng wrote:
> >> Hi Vivek,
> >>
> >> This patch adds io group reference handling when allocating
> >> and removing a request.
> >>
> >
> > Hi Gui,
> >
> > Thanks for the patch. We were thinking that requests can take a reference
> > on io queues and io queues can take a reference on io groups. That should
> > make sure that io groups don't go away as long as active requests are
> > present.
> >
> > But there seems to be a small window while allocating the new request
> > where request gets allocated from a group first and then later it is
> > mapped to that group and queue is created. IOW, in get_request_wait(),
> > we allocate a request from a particular group and set rq->rl, then
> > drop the queue lock and later call elv_set_request() which again maps
> > the request to the group saves rq->iog and creates new queue. This window
> > is troublesome because request can be mapped to a particular group at the
> > time of allocation and during set_request() it can go to a different
> > group as queue lock was dropped and group might have disappeared.
> >
> > In this case probably it might make sense that request also takes a
> > reference on groups. At the same time it looks too much that request takes
> > a reference on queue as well as group object. Ideas are welcome on how
> > to handle it...
> 
> IMHO a request being allocated on the wrong cgroup should not be a big
> problem as such. All it means is that the request descriptor was
> accounted to the wrong cgroup in this particular corner case. Please
> correct me if I am wrong.

I think you are right. We just need to be little careful while freeing
the request that associated request list might have gone away (rq->rl).

Or we probably can think of getting rid of (rq->rl) also and while
freeing request determine io queue and group from rq->ioq. But somehow
I remember that I had to introduce rq->rl otherwise I was running into
issues of request being mapped to different groups at different point
of time hence different request list etc. Will check again..
> 
> We can also get rid of rq->iog pointer too. What that means is that
> request is associated with ioq (rq->ioq), and we can use
> ioq_to_io_group() function to get the io_group. So the request would
> only be indirectly associated with an io_group i.e. the request is
> associated with an io_queue and the io_group for the request is the
> io_group associated with io_queue. Do you see any problems with that
> approach?

Looks like this is also doable. Good idea. Can't think of why can't
we get rid of rq->iog and manage with rq->ioq. There are only 1-2
places where ioq is not setup yet and rq has been mapped to the group.
There we shall have to carry group information or carry bio information
and map it again to get group info.

Will try to implement it and see how does it go.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
  2009-05-08 17:41         ` Nauman Rafique
  (?)
@ 2009-05-08 18:56         ` Vivek Goyal
       [not found]           ` <20090508185644.GH7293-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-05-08 19:06             ` Nauman Rafique
  -1 siblings, 2 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-08 18:56 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: Gui Jianfeng, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Fri, May 08, 2009 at 10:41:01AM -0700, Nauman Rafique wrote:
> On Fri, May 8, 2009 at 6:57 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Fri, May 08, 2009 at 05:45:32PM +0800, Gui Jianfeng wrote:
> >> Hi Vivek,
> >>
> >> This patch adds io group reference handling when allocating
> >> and removing a request.
> >>
> >
> > Hi Gui,
> >
> > Thanks for the patch. We were thinking that requests can take a reference
> > on io queues and io queues can take a reference on io groups. That should
> > make sure that io groups don't go away as long as active requests are
> > present.
> >
> > But there seems to be a small window while allocating the new request
> > where request gets allocated from a group first and then later it is
> > mapped to that group and queue is created. IOW, in get_request_wait(),
> > we allocate a request from a particular group and set rq->rl, then
> > drop the queue lock and later call elv_set_request() which again maps
> > the request to the group saves rq->iog and creates new queue. This window
> > is troublesome because request can be mapped to a particular group at the
> > time of allocation and during set_request() it can go to a different
> > group as queue lock was dropped and group might have disappeared.
> >
> > In this case probably it might make sense that request also takes a
> > reference on groups. At the same time it looks too much that request takes
> > a reference on queue as well as group object. Ideas are welcome on how
> > to handle it...
> 
> IMHO a request being allocated on the wrong cgroup should not be a big
> problem as such. All it means is that the request descriptor was
> accounted to the wrong cgroup in this particular corner case. Please
> correct me if I am wrong.

I think you are right. We just need to be little careful while freeing
the request that associated request list might have gone away (rq->rl).

Or we probably can think of getting rid of (rq->rl) also and while
freeing request determine io queue and group from rq->ioq. But somehow
I remember that I had to introduce rq->rl otherwise I was running into
issues of request being mapped to different groups at different point
of time hence different request list etc. Will check again..
> 
> We can also get rid of rq->iog pointer too. What that means is that
> request is associated with ioq (rq->ioq), and we can use
> ioq_to_io_group() function to get the io_group. So the request would
> only be indirectly associated with an io_group i.e. the request is
> associated with an io_queue and the io_group for the request is the
> io_group associated with io_queue. Do you see any problems with that
> approach?

Looks like this is also doable. Good idea. Can't think of why can't
we get rid of rq->iog and manage with rq->ioq. There are only 1-2
places where ioq is not setup yet and rq has been mapped to the group.
There we shall have to carry group information or carry bio information
and map it again to get group info.

Will try to implement it and see how does it go.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
       [not found]           ` <20090508185644.GH7293-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-08 19:06             ` Nauman Rafique
  0 siblings, 0 replies; 297+ messages in thread
From: Nauman Rafique @ 2009-05-08 19:06 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Fri, May 8, 2009 at 11:56 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, May 08, 2009 at 10:41:01AM -0700, Nauman Rafique wrote:
>> On Fri, May 8, 2009 at 6:57 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> > On Fri, May 08, 2009 at 05:45:32PM +0800, Gui Jianfeng wrote:
>> >> Hi Vivek,
>> >>
>> >> This patch adds io group reference handling when allocating
>> >> and removing a request.
>> >>
>> >
>> > Hi Gui,
>> >
>> > Thanks for the patch. We were thinking that requests can take a reference
>> > on io queues and io queues can take a reference on io groups. That should
>> > make sure that io groups don't go away as long as active requests are
>> > present.
>> >
>> > But there seems to be a small window while allocating the new request
>> > where request gets allocated from a group first and then later it is
>> > mapped to that group and queue is created. IOW, in get_request_wait(),
>> > we allocate a request from a particular group and set rq->rl, then
>> > drop the queue lock and later call elv_set_request() which again maps
>> > the request to the group saves rq->iog and creates new queue. This window
>> > is troublesome because request can be mapped to a particular group at the
>> > time of allocation and during set_request() it can go to a different
>> > group as queue lock was dropped and group might have disappeared.
>> >
>> > In this case probably it might make sense that request also takes a
>> > reference on groups. At the same time it looks too much that request takes
>> > a reference on queue as well as group object. Ideas are welcome on how
>> > to handle it...
>>
>> IMHO a request being allocated on the wrong cgroup should not be a big
>> problem as such. All it means is that the request descriptor was
>> accounted to the wrong cgroup in this particular corner case. Please
>> correct me if I am wrong.
>
> I think you are right. We just need to be little careful while freeing
> the request that associated request list might have gone away (rq->rl).
>
> Or we probably can think of getting rid of (rq->rl) also and while
> freeing request determine io queue and group from rq->ioq. But somehow
> I remember that I had to introduce rq->rl otherwise I was running into
> issues of request being mapped to different groups at different point
> of time hence different request list etc. Will check again..
>>
>> We can also get rid of rq->iog pointer too. What that means is that
>> request is associated with ioq (rq->ioq), and we can use
>> ioq_to_io_group() function to get the io_group. So the request would
>> only be indirectly associated with an io_group i.e. the request is
>> associated with an io_queue and the io_group for the request is the
>> io_group associated with io_queue. Do you see any problems with that
>> approach?
>
> Looks like this is also doable. Good idea. Can't think of why can't
> we get rid of rq->iog and manage with rq->ioq. There are only 1-2
> places where ioq is not setup yet and rq has been mapped to the group.
> There we shall have to carry group information or carry bio information
> and map it again to get group info.
>
> Will try to implement it and see how does it go.

I tried it, and it seems to work. I passed io_group around as function
arguments, and return values before ioq was set.

>
> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for  request
  2009-05-08 18:56         ` Vivek Goyal
@ 2009-05-08 19:06             ` Nauman Rafique
  2009-05-08 19:06             ` Nauman Rafique
  1 sibling, 0 replies; 297+ messages in thread
From: Nauman Rafique @ 2009-05-08 19:06 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Gui Jianfeng, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Fri, May 8, 2009 at 11:56 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Fri, May 08, 2009 at 10:41:01AM -0700, Nauman Rafique wrote:
>> On Fri, May 8, 2009 at 6:57 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> > On Fri, May 08, 2009 at 05:45:32PM +0800, Gui Jianfeng wrote:
>> >> Hi Vivek,
>> >>
>> >> This patch adds io group reference handling when allocating
>> >> and removing a request.
>> >>
>> >
>> > Hi Gui,
>> >
>> > Thanks for the patch. We were thinking that requests can take a reference
>> > on io queues and io queues can take a reference on io groups. That should
>> > make sure that io groups don't go away as long as active requests are
>> > present.
>> >
>> > But there seems to be a small window while allocating the new request
>> > where request gets allocated from a group first and then later it is
>> > mapped to that group and queue is created. IOW, in get_request_wait(),
>> > we allocate a request from a particular group and set rq->rl, then
>> > drop the queue lock and later call elv_set_request() which again maps
>> > the request to the group saves rq->iog and creates new queue. This window
>> > is troublesome because request can be mapped to a particular group at the
>> > time of allocation and during set_request() it can go to a different
>> > group as queue lock was dropped and group might have disappeared.
>> >
>> > In this case probably it might make sense that request also takes a
>> > reference on groups. At the same time it looks too much that request takes
>> > a reference on queue as well as group object. Ideas are welcome on how
>> > to handle it...
>>
>> IMHO a request being allocated on the wrong cgroup should not be a big
>> problem as such. All it means is that the request descriptor was
>> accounted to the wrong cgroup in this particular corner case. Please
>> correct me if I am wrong.
>
> I think you are right. We just need to be little careful while freeing
> the request that associated request list might have gone away (rq->rl).
>
> Or we probably can think of getting rid of (rq->rl) also and while
> freeing request determine io queue and group from rq->ioq. But somehow
> I remember that I had to introduce rq->rl otherwise I was running into
> issues of request being mapped to different groups at different point
> of time hence different request list etc. Will check again..
>>
>> We can also get rid of rq->iog pointer too. What that means is that
>> request is associated with ioq (rq->ioq), and we can use
>> ioq_to_io_group() function to get the io_group. So the request would
>> only be indirectly associated with an io_group i.e. the request is
>> associated with an io_queue and the io_group for the request is the
>> io_group associated with io_queue. Do you see any problems with that
>> approach?
>
> Looks like this is also doable. Good idea. Can't think of why can't
> we get rid of rq->iog and manage with rq->ioq. There are only 1-2
> places where ioq is not setup yet and rq has been mapped to the group.
> There we shall have to carry group information or carry bio information
> and map it again to get group info.
>
> Will try to implement it and see how does it go.

I tried it, and it seems to work. I passed io_group around as function
arguments, and return values before ioq was set.

>
> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
@ 2009-05-08 19:06             ` Nauman Rafique
  0 siblings, 0 replies; 297+ messages in thread
From: Nauman Rafique @ 2009-05-08 19:06 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Gui Jianfeng, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Fri, May 8, 2009 at 11:56 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Fri, May 08, 2009 at 10:41:01AM -0700, Nauman Rafique wrote:
>> On Fri, May 8, 2009 at 6:57 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> > On Fri, May 08, 2009 at 05:45:32PM +0800, Gui Jianfeng wrote:
>> >> Hi Vivek,
>> >>
>> >> This patch adds io group reference handling when allocating
>> >> and removing a request.
>> >>
>> >
>> > Hi Gui,
>> >
>> > Thanks for the patch. We were thinking that requests can take a reference
>> > on io queues and io queues can take a reference on io groups. That should
>> > make sure that io groups don't go away as long as active requests are
>> > present.
>> >
>> > But there seems to be a small window while allocating the new request
>> > where request gets allocated from a group first and then later it is
>> > mapped to that group and queue is created. IOW, in get_request_wait(),
>> > we allocate a request from a particular group and set rq->rl, then
>> > drop the queue lock and later call elv_set_request() which again maps
>> > the request to the group saves rq->iog and creates new queue. This window
>> > is troublesome because request can be mapped to a particular group at the
>> > time of allocation and during set_request() it can go to a different
>> > group as queue lock was dropped and group might have disappeared.
>> >
>> > In this case probably it might make sense that request also takes a
>> > reference on groups. At the same time it looks too much that request takes
>> > a reference on queue as well as group object. Ideas are welcome on how
>> > to handle it...
>>
>> IMHO a request being allocated on the wrong cgroup should not be a big
>> problem as such. All it means is that the request descriptor was
>> accounted to the wrong cgroup in this particular corner case. Please
>> correct me if I am wrong.
>
> I think you are right. We just need to be little careful while freeing
> the request that associated request list might have gone away (rq->rl).
>
> Or we probably can think of getting rid of (rq->rl) also and while
> freeing request determine io queue and group from rq->ioq. But somehow
> I remember that I had to introduce rq->rl otherwise I was running into
> issues of request being mapped to different groups at different point
> of time hence different request list etc. Will check again..
>>
>> We can also get rid of rq->iog pointer too. What that means is that
>> request is associated with ioq (rq->ioq), and we can use
>> ioq_to_io_group() function to get the io_group. So the request would
>> only be indirectly associated with an io_group i.e. the request is
>> associated with an io_queue and the io_group for the request is the
>> io_group associated with io_queue. Do you see any problems with that
>> approach?
>
> Looks like this is also doable. Good idea. Can't think of why can't
> we get rid of rq->iog and manage with rq->ioq. There are only 1-2
> places where ioq is not setup yet and rq has been mapped to the group.
> There we shall have to carry group information or carry bio information
> and map it again to get group info.
>
> Will try to implement it and see how does it go.

I tried it, and it seems to work. I passed io_group around as function
arguments, and return values before ioq was set.

>
> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
       [not found]                             ` <20090508180951.GG7293-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-08 20:05                               ` Andrea Righi
  0 siblings, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-08 20:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Fri, May 08, 2009 at 02:09:51PM -0400, Vivek Goyal wrote:
> On Fri, May 08, 2009 at 12:19:01AM +0200, Andrea Righi wrote:
> > On Thu, May 07, 2009 at 11:36:42AM -0400, Vivek Goyal wrote:
> > > Hmm.., my old config had "AS" as default scheduler that's why I was seeing
> > > the strange issue of RT task finishing after BE. My apologies for that. I
> > > somehow assumed that CFQ is default scheduler in my config.
> > 
> > ok.
> > 
> > > 
> > > So I have re-run the test to see if we are still seeing the issue of
> > > loosing priority and class with-in cgroup. And we still do..
> > > 
> > > 2.6.30-rc4 with io-throttle patches
> > > ===================================
> > > Test1
> > > =====
> > > - Two readers, one BE prio 0 and other BE prio 7 in a cgroup limited with
> > >   8MB/s BW.
> > > 
> > > 234179072 bytes (234 MB) copied, 55.8448 s, 4.2 MB/s
> > > prio 0 task finished
> > > 234179072 bytes (234 MB) copied, 55.8878 s, 4.2 MB/s
> > > 
> > > Test2
> > > =====
> > > - Two readers, one RT prio 0 and other BE prio 7 in a cgroup limited with
> > >   8MB/s BW.
> > > 
> > > 234179072 bytes (234 MB) copied, 55.8876 s, 4.2 MB/s
> > > 234179072 bytes (234 MB) copied, 55.8984 s, 4.2 MB/s
> > > RT task finished
> > 
> > ok, coherent with the current io-throttle implementation.
> > 
> > > 
> > > Test3
> > > =====
> > > - Reader Starvation
> > > - I created a cgroup with BW limit of 64MB/s. First I just run the reader
> > >   alone and then I run reader along with 4 writers 4 times. 
> > > 
> > > Reader alone
> > > 234179072 bytes (234 MB) copied, 3.71796 s, 63.0 MB/s
> > > 
> > > Reader with 4 writers
> > > ---------------------
> > > First run
> > > 234179072 bytes (234 MB) copied, 30.394 s, 7.7 MB/s 
> > > 
> > > Second run
> > > 234179072 bytes (234 MB) copied, 26.9607 s, 8.7 MB/s
> > > 
> > > Third run
> > > 234179072 bytes (234 MB) copied, 37.3515 s, 6.3 MB/s
> > > 
> > > Fourth run
> > > 234179072 bytes (234 MB) copied, 36.817 s, 6.4 MB/s
> > > 
> > > Note that out of 64MB/s limit of this cgroup, reader does not get even
> > > 1/5 of the BW. In normal systems, readers are advantaged and reader gets
> > > its job done much faster even in presence of multiple writers.   
> > 
> > And this is also coherent. The throttling is equally probable for read
> > and write. But this shouldn't happen if we saturate the physical disk BW
> > (doing proportional BW control or using a watermark close to 100 in
> > io-throttle). In this case IO scheduler logic shouldn't be totally
> > broken.
> >
> 
> Can you please explain the watermark a bit more? So blockio.watermark=90
> mean 90% of what? total disk BW? But disk BW varies based on work load?

The controller starts to apply throttling rules only when the total disk
BW utilization is greater than 90%.

The consumed BW is evaluated as (cpu_ticks / io_ticks * 100), where
cpu_ticks are the ticks (in jiffies) since the last i/o request and
io_ticks is the difference of ticks accounted to a particular block
device, retrieved by:

part_stat_read(bdev->bd_part, io_ticks)

BTW it's the same metric (%util) used by iostat.

> 
> > Doing a very quick test with io-throttle, using a 10MB/s BW limit and
> > blockio.watermark=90:
> > 
> > Launching reader
> > 256+0 records in
> > 256+0 records out
> > 268435456 bytes (268 MB) copied, 32.2798 s, 8.3 MB/s
> > 
> > In the same time the writers wrote ~190MB, so the single reader got
> > about 1/3 of the total BW.
> > 
> > 182M testzerofile4
> > 198M testzerofile1
> > 188M testzerofile3
> > 189M testzerofile2
> > 
> 
> But its now more a max bw controller at all now? I seem to be getting the
> total BW of (268+182+198+188+189)/32 = 32MB/s and you set the limit to
> 10MB/s?
>  

The limit of 10MB/s is applied only when the consumed disk BW hits 90%.

If the disk is not fully saturated no limit is applied. It's nothing
more than soft limiting, to avoid to waste the unused disk BW that we
have with hard limits. This is similar to the proportional approach from
a certain point of view.

But ok, this only reduces the number of times that we block the IO
requests. The fact is that when we apply throttling the probability to
block a read or a write it's the same also in this case.

> 
> [..]
> > What are the results with your IO scheduler controller (if you already
> > have them, otherwise I'll repeat this test in my system)? It seems a
> > very interesting test to compare the advantages of the IO scheduler
> > solution respect to the io-throttle approach.
> > 
> 
> I had not done any reader writer testing so far. But you forced me to run
> some now. :-) Here are the results. 

Good! :)

> 
> Because one is max BW controller and other is proportional BW controller
> doing exact comparison is hard. Still....
> 
> Test1
> =====
> Try to run lots of writers (50 random writers using fio and 4 sequential
> writers with dd if=/dev/zero) and one single reader either in root group
> or with in one cgroup to show that readers are not starved by writers
> as opposed to io-throttle controller.
> 
> Run test1 with vanilla kernel with CFQ
> =====================================
> Launched 50 fio random writers, 4 sequential writers and 1 reader in root
> and noted how long it takes reader to finish. Also noted the per second output
> from iostat -d 1 -m /dev/sdb1 to monitor how disk throughput varies.
> 
> ***********************************************************************
> # launch 50 writers fio job
> 
> fio_args="--size=64m --rw=write --numjobs=50 --group_reporting"
> fio $fio_args --name=test2 --directory=/mnt/sdb/fio2/ --output=/mnt/sdb/fio2/test2.log > /dev/null  &
> 
> #launch 4 sequential writers
> ionice -c 2 -n 7 dd if=/dev/zero of=/mnt/sdb/testzerofile1 bs=4K count=524288 &
> ionice -c 2 -n 7 dd if=/dev/zero of=/mnt/sdb/testzerofile2 bs=4K count=524288 &
> ionice -c 2 -n 7 dd if=/dev/zero of=/mnt/sdb/testzerofile3 bs=4K count=524288 &
> ionice -c 2 -n 7 dd if=/dev/zero of=/mnt/sdb/testzerofile4 bs=4K count=524288 &
> 
> echo "Sleeping for 5 seconds"
> sleep 5
> echo "Launching reader"
> 
> ionice -c 2 -n 0 dd if=/mnt/sdb/zerofile2 of=/dev/zero &
> wait $!
> echo "Reader Finished"
> ***************************************************************************
> 
> Results
> -------
> 234179072 bytes (234 MB) copied, 4.55047 s, 51.5 MB/s
> 
> Reader finished in 4.5 seconds. Following are few lines from iostat output
> 
> ***********************************************************************
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            151.00         0.04        48.33          0         48
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            120.00         1.78        31.23          1         31
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            504.95        56.75         7.51         57          7
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            547.47        62.71         4.47         62          4
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            441.00        49.80         7.82         49          7
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            441.41        48.28        13.84         47         13
> 
> *************************************************************************
> 
> Note how, first write picks up and then suddenly reader comes in and CFQ
> allocates a huge chunk of BW to reader to give it the advantage.
> 
> Run Test1 with IO scheduler based io controller patch
> =====================================================
> 
> 234179072 bytes (234 MB) copied, 5.23141 s, 44.8 MB/s 
> 
> Reader finishes in 5.23 seconds. Why does it take more time than CFQ,
> because looks like current algorithm is not punishing writers that hard.
> This can be fixed and not an issue.
> 
> Following is some output from iostat.
> 
> **********************************************************************
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            139.60         0.04        43.83          0         44
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            227.72        16.88        29.05         17         29
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            349.00        35.04        16.06         35         16
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            339.00        34.16        21.07         34         21
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            343.56        36.68        12.54         37         12
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            378.00        38.68        19.47         38         19
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            532.00        59.06        10.00         59         10
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            125.00         2.62        38.82          2         38
> ************************************************************************
> 
> Note how read throughput goes up when reader comes in. Also note that
> writer is still getting some decent IO done and that's why reader took
> little bit more time as compared to CFQ.
> 
> 
> Run Test1 with IO throttle patches
> ==================================
> 
> Now same test is run with io-throttle patches. The only difference is that
> it run the test in a cgroup with max limit of 32MB/s. That should mean 
> that effectvily we got a disk which can support at max 32MB/s of IO rate.
> If we look at above CFQ and io controller results, it looks like with
> above load we touched a peak of 70MB/s.  So one can think of same test
> being run on a disk roughly half the speed of original disk.
> 
> 234179072 bytes (234 MB) copied, 144.207 s, 1.6 MB/s
> 
> Reader got a disk rate of 1.6MB/s (5 %) out of 32MB/s capacity, as opposed to
> the case CFQ and io scheduler controller where reader got around 70-80% of
> disk BW under similar work load.
> 
> Test2
> =====
> Run test2 with io scheduler based io controller
> ===============================================
> Now run almost same test with a little difference. This time I create two
> cgroups of same weight 1000. I run the 50 fio random writer in one cgroup
> and 4 sequential writers and 1 reader in second group. This test is more
> to show that proportional BW IO controller is working and because of
> reader in group1, group2 writes are not killed (providing isolation) and
> secondly, reader still gets preference over the writers which are in same
> group.
> 
> 				root
> 			     /       \		
> 			  group1     group2
> 		  (50 fio writers)   ( 4 writers and one reader)
> 
> 234179072 bytes (234 MB) copied, 12.8546 s, 18.2 MB/s
> 
> Reader finished in almost 13 seconds and got around 18MB/s. Remember when
> everything was in root group reader got around 45MB/s. This is to account
> for the fact that half of the disk is now being shared by other cgroup
> which are running 50 fio writes and reader can't steal the disk from them.
> 
> Following is some portion of iostat output when reader became active
> *********************************************************************
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            103.92         0.03        40.21          0         41
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            240.00        15.78        37.40         15         37
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            206.93        13.17        28.50         13         28
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            224.75        15.39        27.89         15         28
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            270.71        16.85        25.95         16         25
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            215.84         8.81        32.40          8         32
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            216.16        19.11        20.75         18         20
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            211.11        14.67        35.77         14         35
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            208.91        15.04        26.95         15         27
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            277.23        24.30        28.53         24         28
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            202.97        12.29        34.79         12         35
> **********************************************************************
> 
> Total disk throughput is varying a lot, on an average it looks like it
> is getting 45MB/s. Lets say 50% of that is going to cgroup1 (fio writers),
> then out of rest of 22 MB/s reader seems to have to 18MB/s. These are
> highly approximate numbers. I think I need to come up with some kind of 
> tool to measure per cgroup throughput (like we have for per partition
> stat) for more accurate comparision.
> 
> But the point is that second cgroup got the isolation and read got
> preference with-in same cgroup. The expected behavior.
> 
> Run test2 with io-throttle
> ==========================
> Same setup of two groups. The only difference is that I setup two groups
> with (16MB) limit. So previous 32MB limit got divided between two cgroups
> 50% each.
> 
> - 234179072 bytes (234 MB) copied, 90.8055 s, 2.6 MB/s
> 
> Reader took 90 seconds to finish.  It seems to have got around 16% of
> available disk BW (16MB) to it.
> 
> iostat output is long. Will just paste one section.
> 
> ************************************************************************
> [..]
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            141.58        10.16        16.12         10         16
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            174.75         8.06        12.31          7         12
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1             47.52         0.12         6.16          0          6
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1             82.00         0.00        31.85          0         31
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            141.00         0.00        48.07          0         48
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1             72.73         0.00        26.52          0         26
>  
> 
> ***************************************************************************
> 
> Conclusion
> ==========
> It just reaffirms that with max BW control, we are not doing a fair job
> of throttling hence no more hold the IO scheduler properties with-in
> cgroup.
> 
> With proportional BW controller implemented at IO scheduler level, one
> can do very tight integration with IO controller and hence retain 
> IO scheduler behavior with-in cgroup.

It is worth to bug you I would say :). Results are interesting,
definitely. I'll check if it's possible to merge part of the io-throttle
max BW control in this controller and who knows if finally we'll be able
to converge to a common proposal...

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-08 18:09                           ` Vivek Goyal
@ 2009-05-08 20:05                             ` Andrea Righi
  2009-05-08 21:56                                 ` Vivek Goyal
       [not found]                             ` <20090508180951.GG7293-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 297+ messages in thread
From: Andrea Righi @ 2009-05-08 20:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, linux-kernel, containers,
	agk, dm-devel, snitzer, m-ikeda, peterz

On Fri, May 08, 2009 at 02:09:51PM -0400, Vivek Goyal wrote:
> On Fri, May 08, 2009 at 12:19:01AM +0200, Andrea Righi wrote:
> > On Thu, May 07, 2009 at 11:36:42AM -0400, Vivek Goyal wrote:
> > > Hmm.., my old config had "AS" as default scheduler that's why I was seeing
> > > the strange issue of RT task finishing after BE. My apologies for that. I
> > > somehow assumed that CFQ is default scheduler in my config.
> > 
> > ok.
> > 
> > > 
> > > So I have re-run the test to see if we are still seeing the issue of
> > > loosing priority and class with-in cgroup. And we still do..
> > > 
> > > 2.6.30-rc4 with io-throttle patches
> > > ===================================
> > > Test1
> > > =====
> > > - Two readers, one BE prio 0 and other BE prio 7 in a cgroup limited with
> > >   8MB/s BW.
> > > 
> > > 234179072 bytes (234 MB) copied, 55.8448 s, 4.2 MB/s
> > > prio 0 task finished
> > > 234179072 bytes (234 MB) copied, 55.8878 s, 4.2 MB/s
> > > 
> > > Test2
> > > =====
> > > - Two readers, one RT prio 0 and other BE prio 7 in a cgroup limited with
> > >   8MB/s BW.
> > > 
> > > 234179072 bytes (234 MB) copied, 55.8876 s, 4.2 MB/s
> > > 234179072 bytes (234 MB) copied, 55.8984 s, 4.2 MB/s
> > > RT task finished
> > 
> > ok, coherent with the current io-throttle implementation.
> > 
> > > 
> > > Test3
> > > =====
> > > - Reader Starvation
> > > - I created a cgroup with BW limit of 64MB/s. First I just run the reader
> > >   alone and then I run reader along with 4 writers 4 times. 
> > > 
> > > Reader alone
> > > 234179072 bytes (234 MB) copied, 3.71796 s, 63.0 MB/s
> > > 
> > > Reader with 4 writers
> > > ---------------------
> > > First run
> > > 234179072 bytes (234 MB) copied, 30.394 s, 7.7 MB/s 
> > > 
> > > Second run
> > > 234179072 bytes (234 MB) copied, 26.9607 s, 8.7 MB/s
> > > 
> > > Third run
> > > 234179072 bytes (234 MB) copied, 37.3515 s, 6.3 MB/s
> > > 
> > > Fourth run
> > > 234179072 bytes (234 MB) copied, 36.817 s, 6.4 MB/s
> > > 
> > > Note that out of 64MB/s limit of this cgroup, reader does not get even
> > > 1/5 of the BW. In normal systems, readers are advantaged and reader gets
> > > its job done much faster even in presence of multiple writers.   
> > 
> > And this is also coherent. The throttling is equally probable for read
> > and write. But this shouldn't happen if we saturate the physical disk BW
> > (doing proportional BW control or using a watermark close to 100 in
> > io-throttle). In this case IO scheduler logic shouldn't be totally
> > broken.
> >
> 
> Can you please explain the watermark a bit more? So blockio.watermark=90
> mean 90% of what? total disk BW? But disk BW varies based on work load?

The controller starts to apply throttling rules only when the total disk
BW utilization is greater than 90%.

The consumed BW is evaluated as (cpu_ticks / io_ticks * 100), where
cpu_ticks are the ticks (in jiffies) since the last i/o request and
io_ticks is the difference of ticks accounted to a particular block
device, retrieved by:

part_stat_read(bdev->bd_part, io_ticks)

BTW it's the same metric (%util) used by iostat.

> 
> > Doing a very quick test with io-throttle, using a 10MB/s BW limit and
> > blockio.watermark=90:
> > 
> > Launching reader
> > 256+0 records in
> > 256+0 records out
> > 268435456 bytes (268 MB) copied, 32.2798 s, 8.3 MB/s
> > 
> > In the same time the writers wrote ~190MB, so the single reader got
> > about 1/3 of the total BW.
> > 
> > 182M testzerofile4
> > 198M testzerofile1
> > 188M testzerofile3
> > 189M testzerofile2
> > 
> 
> But its now more a max bw controller at all now? I seem to be getting the
> total BW of (268+182+198+188+189)/32 = 32MB/s and you set the limit to
> 10MB/s?
>  

The limit of 10MB/s is applied only when the consumed disk BW hits 90%.

If the disk is not fully saturated no limit is applied. It's nothing
more than soft limiting, to avoid to waste the unused disk BW that we
have with hard limits. This is similar to the proportional approach from
a certain point of view.

But ok, this only reduces the number of times that we block the IO
requests. The fact is that when we apply throttling the probability to
block a read or a write it's the same also in this case.

> 
> [..]
> > What are the results with your IO scheduler controller (if you already
> > have them, otherwise I'll repeat this test in my system)? It seems a
> > very interesting test to compare the advantages of the IO scheduler
> > solution respect to the io-throttle approach.
> > 
> 
> I had not done any reader writer testing so far. But you forced me to run
> some now. :-) Here are the results. 

Good! :)

> 
> Because one is max BW controller and other is proportional BW controller
> doing exact comparison is hard. Still....
> 
> Test1
> =====
> Try to run lots of writers (50 random writers using fio and 4 sequential
> writers with dd if=/dev/zero) and one single reader either in root group
> or with in one cgroup to show that readers are not starved by writers
> as opposed to io-throttle controller.
> 
> Run test1 with vanilla kernel with CFQ
> =====================================
> Launched 50 fio random writers, 4 sequential writers and 1 reader in root
> and noted how long it takes reader to finish. Also noted the per second output
> from iostat -d 1 -m /dev/sdb1 to monitor how disk throughput varies.
> 
> ***********************************************************************
> # launch 50 writers fio job
> 
> fio_args="--size=64m --rw=write --numjobs=50 --group_reporting"
> fio $fio_args --name=test2 --directory=/mnt/sdb/fio2/ --output=/mnt/sdb/fio2/test2.log > /dev/null  &
> 
> #launch 4 sequential writers
> ionice -c 2 -n 7 dd if=/dev/zero of=/mnt/sdb/testzerofile1 bs=4K count=524288 &
> ionice -c 2 -n 7 dd if=/dev/zero of=/mnt/sdb/testzerofile2 bs=4K count=524288 &
> ionice -c 2 -n 7 dd if=/dev/zero of=/mnt/sdb/testzerofile3 bs=4K count=524288 &
> ionice -c 2 -n 7 dd if=/dev/zero of=/mnt/sdb/testzerofile4 bs=4K count=524288 &
> 
> echo "Sleeping for 5 seconds"
> sleep 5
> echo "Launching reader"
> 
> ionice -c 2 -n 0 dd if=/mnt/sdb/zerofile2 of=/dev/zero &
> wait $!
> echo "Reader Finished"
> ***************************************************************************
> 
> Results
> -------
> 234179072 bytes (234 MB) copied, 4.55047 s, 51.5 MB/s
> 
> Reader finished in 4.5 seconds. Following are few lines from iostat output
> 
> ***********************************************************************
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            151.00         0.04        48.33          0         48
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            120.00         1.78        31.23          1         31
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            504.95        56.75         7.51         57          7
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            547.47        62.71         4.47         62          4
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            441.00        49.80         7.82         49          7
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            441.41        48.28        13.84         47         13
> 
> *************************************************************************
> 
> Note how, first write picks up and then suddenly reader comes in and CFQ
> allocates a huge chunk of BW to reader to give it the advantage.
> 
> Run Test1 with IO scheduler based io controller patch
> =====================================================
> 
> 234179072 bytes (234 MB) copied, 5.23141 s, 44.8 MB/s 
> 
> Reader finishes in 5.23 seconds. Why does it take more time than CFQ,
> because looks like current algorithm is not punishing writers that hard.
> This can be fixed and not an issue.
> 
> Following is some output from iostat.
> 
> **********************************************************************
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            139.60         0.04        43.83          0         44
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            227.72        16.88        29.05         17         29
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            349.00        35.04        16.06         35         16
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            339.00        34.16        21.07         34         21
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            343.56        36.68        12.54         37         12
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            378.00        38.68        19.47         38         19
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            532.00        59.06        10.00         59         10
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            125.00         2.62        38.82          2         38
> ************************************************************************
> 
> Note how read throughput goes up when reader comes in. Also note that
> writer is still getting some decent IO done and that's why reader took
> little bit more time as compared to CFQ.
> 
> 
> Run Test1 with IO throttle patches
> ==================================
> 
> Now same test is run with io-throttle patches. The only difference is that
> it run the test in a cgroup with max limit of 32MB/s. That should mean 
> that effectvily we got a disk which can support at max 32MB/s of IO rate.
> If we look at above CFQ and io controller results, it looks like with
> above load we touched a peak of 70MB/s.  So one can think of same test
> being run on a disk roughly half the speed of original disk.
> 
> 234179072 bytes (234 MB) copied, 144.207 s, 1.6 MB/s
> 
> Reader got a disk rate of 1.6MB/s (5 %) out of 32MB/s capacity, as opposed to
> the case CFQ and io scheduler controller where reader got around 70-80% of
> disk BW under similar work load.
> 
> Test2
> =====
> Run test2 with io scheduler based io controller
> ===============================================
> Now run almost same test with a little difference. This time I create two
> cgroups of same weight 1000. I run the 50 fio random writer in one cgroup
> and 4 sequential writers and 1 reader in second group. This test is more
> to show that proportional BW IO controller is working and because of
> reader in group1, group2 writes are not killed (providing isolation) and
> secondly, reader still gets preference over the writers which are in same
> group.
> 
> 				root
> 			     /       \		
> 			  group1     group2
> 		  (50 fio writers)   ( 4 writers and one reader)
> 
> 234179072 bytes (234 MB) copied, 12.8546 s, 18.2 MB/s
> 
> Reader finished in almost 13 seconds and got around 18MB/s. Remember when
> everything was in root group reader got around 45MB/s. This is to account
> for the fact that half of the disk is now being shared by other cgroup
> which are running 50 fio writes and reader can't steal the disk from them.
> 
> Following is some portion of iostat output when reader became active
> *********************************************************************
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            103.92         0.03        40.21          0         41
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            240.00        15.78        37.40         15         37
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            206.93        13.17        28.50         13         28
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            224.75        15.39        27.89         15         28
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            270.71        16.85        25.95         16         25
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            215.84         8.81        32.40          8         32
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            216.16        19.11        20.75         18         20
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            211.11        14.67        35.77         14         35
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            208.91        15.04        26.95         15         27
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            277.23        24.30        28.53         24         28
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            202.97        12.29        34.79         12         35
> **********************************************************************
> 
> Total disk throughput is varying a lot, on an average it looks like it
> is getting 45MB/s. Lets say 50% of that is going to cgroup1 (fio writers),
> then out of rest of 22 MB/s reader seems to have to 18MB/s. These are
> highly approximate numbers. I think I need to come up with some kind of 
> tool to measure per cgroup throughput (like we have for per partition
> stat) for more accurate comparision.
> 
> But the point is that second cgroup got the isolation and read got
> preference with-in same cgroup. The expected behavior.
> 
> Run test2 with io-throttle
> ==========================
> Same setup of two groups. The only difference is that I setup two groups
> with (16MB) limit. So previous 32MB limit got divided between two cgroups
> 50% each.
> 
> - 234179072 bytes (234 MB) copied, 90.8055 s, 2.6 MB/s
> 
> Reader took 90 seconds to finish.  It seems to have got around 16% of
> available disk BW (16MB) to it.
> 
> iostat output is long. Will just paste one section.
> 
> ************************************************************************
> [..]
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            141.58        10.16        16.12         10         16
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            174.75         8.06        12.31          7         12
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1             47.52         0.12         6.16          0          6
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1             82.00         0.00        31.85          0         31
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1            141.00         0.00        48.07          0         48
> 
> Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sdb1             72.73         0.00        26.52          0         26
>  
> 
> ***************************************************************************
> 
> Conclusion
> ==========
> It just reaffirms that with max BW control, we are not doing a fair job
> of throttling hence no more hold the IO scheduler properties with-in
> cgroup.
> 
> With proportional BW controller implemented at IO scheduler level, one
> can do very tight integration with IO controller and hence retain 
> IO scheduler behavior with-in cgroup.

It is worth to bug you I would say :). Results are interesting,
definitely. I'll check if it's possible to merge part of the io-throttle
max BW control in this controller and who knows if finally we'll be able
to converge to a common proposal...

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer
       [not found]   ` <1241553525-28095-6-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-05-07  7:42     ` Gui Jianfeng
@ 2009-05-08 21:09     ` Andrea Righi
  1 sibling, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-08 21:09 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Tue, May 05, 2009 at 03:58:32PM -0400, Vivek Goyal wrote:
> +#define STORE_FUNCTION(__VAR, __MIN, __MAX)				\
> +static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
> +					struct cftype *cftype,		\
> +					u64 val)			\
> +{									\
> +	struct io_cgroup *iocg;					\
> +	struct io_group *iog;						\
> +	struct hlist_node *n;						\
> +									\
> +	if (val < (__MIN) || val > (__MAX))				\
> +		return -EINVAL;						\
> +									\
> +	if (!cgroup_lock_live_group(cgroup))				\
> +		return -ENODEV;						\
> +									\
> +	iocg = cgroup_to_io_cgroup(cgroup);				\
> +									\
> +	spin_lock_irq(&iocg->lock);					\
> +	iocg->__VAR = (unsigned long)val;				\
> +	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {	\
> +		iog->entity.new_##__VAR = (unsigned long)val;		\
> +		smp_wmb();						\
> +		iog->entity.ioprio_changed = 1;				\
> +	}								\
> +	spin_unlock_irq(&iocg->lock);					\
> +									\
> +	cgroup_unlock();						\
> +									\
> +	return 0;							\
> +}
> +
> +STORE_FUNCTION(weight, 0, WEIGHT_MAX);

A small fix: io.weight should be strictly greater than 0 if we don't
want to automatically trigger the BUG_ON(entity->weight == 0) in
bfq_calc_finish().

Signed-off-by: Andrea Righi <righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 block/elevator-fq.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 9500619..de25f44 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1136,7 +1136,7 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
 	return 0;							\
 }
 
-STORE_FUNCTION(weight, 0, WEIGHT_MAX);
+STORE_FUNCTION(weight, 1, WEIGHT_MAX);
 STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
 #undef STORE_FUNCTION

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* Re: [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer
  2009-05-05 19:58 ` Vivek Goyal
  2009-05-07  7:42   ` Gui Jianfeng
       [not found]   ` <1241553525-28095-6-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-08 21:09   ` Andrea Righi
  2009-05-08 21:17     ` Vivek Goyal
  2009-05-08 21:17     ` Vivek Goyal
  2 siblings, 2 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-08 21:09 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, agk, dm-devel, snitzer,
	m-ikeda, akpm

On Tue, May 05, 2009 at 03:58:32PM -0400, Vivek Goyal wrote:
> +#define STORE_FUNCTION(__VAR, __MIN, __MAX)				\
> +static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
> +					struct cftype *cftype,		\
> +					u64 val)			\
> +{									\
> +	struct io_cgroup *iocg;					\
> +	struct io_group *iog;						\
> +	struct hlist_node *n;						\
> +									\
> +	if (val < (__MIN) || val > (__MAX))				\
> +		return -EINVAL;						\
> +									\
> +	if (!cgroup_lock_live_group(cgroup))				\
> +		return -ENODEV;						\
> +									\
> +	iocg = cgroup_to_io_cgroup(cgroup);				\
> +									\
> +	spin_lock_irq(&iocg->lock);					\
> +	iocg->__VAR = (unsigned long)val;				\
> +	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {	\
> +		iog->entity.new_##__VAR = (unsigned long)val;		\
> +		smp_wmb();						\
> +		iog->entity.ioprio_changed = 1;				\
> +	}								\
> +	spin_unlock_irq(&iocg->lock);					\
> +									\
> +	cgroup_unlock();						\
> +									\
> +	return 0;							\
> +}
> +
> +STORE_FUNCTION(weight, 0, WEIGHT_MAX);

A small fix: io.weight should be strictly greater than 0 if we don't
want to automatically trigger the BUG_ON(entity->weight == 0) in
bfq_calc_finish().

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
---
 block/elevator-fq.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 9500619..de25f44 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1136,7 +1136,7 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
 	return 0;							\
 }
 
-STORE_FUNCTION(weight, 0, WEIGHT_MAX);
+STORE_FUNCTION(weight, 1, WEIGHT_MAX);
 STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
 #undef STORE_FUNCTION
 

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* Re: [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer
  2009-05-08 21:09   ` Andrea Righi
  2009-05-08 21:17     ` Vivek Goyal
@ 2009-05-08 21:17     ` Vivek Goyal
  1 sibling, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-08 21:17 UTC (permalink / raw)
  To: Andrea Righi
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, May 08, 2009 at 11:09:37PM +0200, Andrea Righi wrote:
> On Tue, May 05, 2009 at 03:58:32PM -0400, Vivek Goyal wrote:
> > +#define STORE_FUNCTION(__VAR, __MIN, __MAX)				\
> > +static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
> > +					struct cftype *cftype,		\
> > +					u64 val)			\
> > +{									\
> > +	struct io_cgroup *iocg;					\
> > +	struct io_group *iog;						\
> > +	struct hlist_node *n;						\
> > +									\
> > +	if (val < (__MIN) || val > (__MAX))				\
> > +		return -EINVAL;						\
> > +									\
> > +	if (!cgroup_lock_live_group(cgroup))				\
> > +		return -ENODEV;						\
> > +									\
> > +	iocg = cgroup_to_io_cgroup(cgroup);				\
> > +									\
> > +	spin_lock_irq(&iocg->lock);					\
> > +	iocg->__VAR = (unsigned long)val;				\
> > +	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {	\
> > +		iog->entity.new_##__VAR = (unsigned long)val;		\
> > +		smp_wmb();						\
> > +		iog->entity.ioprio_changed = 1;				\
> > +	}								\
> > +	spin_unlock_irq(&iocg->lock);					\
> > +									\
> > +	cgroup_unlock();						\
> > +									\
> > +	return 0;							\
> > +}
> > +
> > +STORE_FUNCTION(weight, 0, WEIGHT_MAX);
> 
> A small fix: io.weight should be strictly greater than 0 if we don't
> want to automatically trigger the BUG_ON(entity->weight == 0) in
> bfq_calc_finish().
> 
> Signed-off-by: Andrea Righi <righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Thanks Andrea. It worked previously as in previous version it was
io.ioprio and prio 0 was allowed and we calculated weights from priority.

Will include the fix in next version.

Thanks
Vivek

> ---
>  block/elevator-fq.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 9500619..de25f44 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -1136,7 +1136,7 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
>  	return 0;							\
>  }
>  
> -STORE_FUNCTION(weight, 0, WEIGHT_MAX);
> +STORE_FUNCTION(weight, 1, WEIGHT_MAX);
>  STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
>  #undef STORE_FUNCTION
>  

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer
  2009-05-08 21:09   ` Andrea Righi
@ 2009-05-08 21:17     ` Vivek Goyal
  2009-05-08 21:17     ` Vivek Goyal
  1 sibling, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-08 21:17 UTC (permalink / raw)
  To: Andrea Righi
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, agk, dm-devel, snitzer,
	m-ikeda, akpm

On Fri, May 08, 2009 at 11:09:37PM +0200, Andrea Righi wrote:
> On Tue, May 05, 2009 at 03:58:32PM -0400, Vivek Goyal wrote:
> > +#define STORE_FUNCTION(__VAR, __MIN, __MAX)				\
> > +static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
> > +					struct cftype *cftype,		\
> > +					u64 val)			\
> > +{									\
> > +	struct io_cgroup *iocg;					\
> > +	struct io_group *iog;						\
> > +	struct hlist_node *n;						\
> > +									\
> > +	if (val < (__MIN) || val > (__MAX))				\
> > +		return -EINVAL;						\
> > +									\
> > +	if (!cgroup_lock_live_group(cgroup))				\
> > +		return -ENODEV;						\
> > +									\
> > +	iocg = cgroup_to_io_cgroup(cgroup);				\
> > +									\
> > +	spin_lock_irq(&iocg->lock);					\
> > +	iocg->__VAR = (unsigned long)val;				\
> > +	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {	\
> > +		iog->entity.new_##__VAR = (unsigned long)val;		\
> > +		smp_wmb();						\
> > +		iog->entity.ioprio_changed = 1;				\
> > +	}								\
> > +	spin_unlock_irq(&iocg->lock);					\
> > +									\
> > +	cgroup_unlock();						\
> > +									\
> > +	return 0;							\
> > +}
> > +
> > +STORE_FUNCTION(weight, 0, WEIGHT_MAX);
> 
> A small fix: io.weight should be strictly greater than 0 if we don't
> want to automatically trigger the BUG_ON(entity->weight == 0) in
> bfq_calc_finish().
> 
> Signed-off-by: Andrea Righi <righi.andrea@gmail.com>

Thanks Andrea. It worked previously as in previous version it was
io.ioprio and prio 0 was allowed and we calculated weights from priority.

Will include the fix in next version.

Thanks
Vivek

> ---
>  block/elevator-fq.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 9500619..de25f44 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -1136,7 +1136,7 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
>  	return 0;							\
>  }
>  
> -STORE_FUNCTION(weight, 0, WEIGHT_MAX);
> +STORE_FUNCTION(weight, 1, WEIGHT_MAX);
>  STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
>  #undef STORE_FUNCTION
>  

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-08 20:05                             ` Andrea Righi
@ 2009-05-08 21:56                                 ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-08 21:56 UTC (permalink / raw)
  To: Andrea Righi
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Fri, May 08, 2009 at 10:05:01PM +0200, Andrea Righi wrote:

[..]
> > Conclusion
> > ==========
> > It just reaffirms that with max BW control, we are not doing a fair job
> > of throttling hence no more hold the IO scheduler properties with-in
> > cgroup.
> > 
> > With proportional BW controller implemented at IO scheduler level, one
> > can do very tight integration with IO controller and hence retain 
> > IO scheduler behavior with-in cgroup.
> 
> It is worth to bug you I would say :). Results are interesting,
> definitely. I'll check if it's possible to merge part of the io-throttle
> max BW control in this controller and who knows if finally we'll be able
> to converge to a common proposal...

Great, Few thoughts though.

- What are your requirements? Do you strictly need max bw control or
  proportional BW control will satisfy your needs? Or you need both?

- With the current algorithm BFQ (modified WF2Q+), we should be able
  to do proportional BW division while maintaining the properties of
  IO scheduler with-in cgroup in hiearchical manner.
 
  I think it can be simply enhanced to do max bw control also. That is
  whenever a queue is selected for dispatch (from fairness point of view)
  also check the IO rate of that group and if IO rate exceeded, expire
  the queue immediately and fake as if queue consumed its time slice
  which will be equivalent to throttling.

  But in this simple scheme, I think throttling is still unfair with-in
  the class. What I mean is following.

  if an RT task and an BE task are in same cgroup and cgroup exceeds its
  max BW, RT task is next to be dispatched from fairness point of view and it
  will end being throttled. This is still fine because until RT task is
  finished, BE task will never get to run in that cgroup, so at some point
  of time, cgroup rate will come down and RT task will get the IO done
  meeting fairnesss and max bw constraints.

  But this simple scheme does not work with-in same class. Say prio 0
  and prio 7 BE class readers. Now we will end up throttling the guy who
  is scheduled to go next and there is no mechanism that prio0 and prio7
  tasks are throttled in proportionate manner.

  So, we shall have to come up with something better, I think Dhaval was
  implementing upper limit for cpu controller. May be PeterZ and Dhaval can
  give us some pointers how did they manage to implement both proportional
  and max bw control with the help of a single tree while maintaining the
  notion of prio with-in cgroup.

PeterZ/Dhaval  ^^^^^^^^

- We should be able to get rid of reader-writer issue even with above
  simple throttling mechanism for schedulers like deadline and AS, because at
  elevator we see it as a single queue (for both reads and writes) and we
  will throttle this queue. With-in queue dispatch are taken care by io
  scheduler. So as long as IO has been queued in the queue, scheduler
  will take care of giving advantage to readers even if throttling is
  taking place on the queue.

Why am I thinking loud? So that we know what are we trying to achieve at the
end of the day. So at this point of time what are the advantages/disadvantages
of doing max bw control along with proportional bw control?

Advantages
==========
- With a combined code base, total code should be less as compared to if
  both of them are implemented separately. 

- There can be few advantages in terms of maintaining the notion of IO
  scheduler with-in cgroup. (like RT tasks always goes first in presence
  of BE and IDLE task etc. But simple throttling scheme will not take
  care of fair throttling with-in class. We need a better algorithm to
  achive that goal).

- We probably will get rid of reader writer issue for single queue
  schedulers like deadline and AS. (Need to run tests and see).

Disadvantages
=============
- Implementation at IO scheduler/elevator layer does not cover higher
  level logical devices. So one can do max bw control only at leaf nodes
  where IO scheduler is running and not at intermediate logical nodes.
   
I personally think that proportional BW control will meet more people's
need as compared to max bw contorl. 

So far nobody has come up with a solution where a single proposal covers
all the cases without breaking things. So personally, I want to make
things work at least at IO scheduler level and cover as much ground as
possible without breaking things (hardware RAID, all the direct attached
devices etc) and then worry about higher level software devices.

Thoughts?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
@ 2009-05-08 21:56                                 ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-08 21:56 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, linux-kernel, containers,
	agk, dm-devel, snitzer, m-ikeda, peterz

On Fri, May 08, 2009 at 10:05:01PM +0200, Andrea Righi wrote:

[..]
> > Conclusion
> > ==========
> > It just reaffirms that with max BW control, we are not doing a fair job
> > of throttling hence no more hold the IO scheduler properties with-in
> > cgroup.
> > 
> > With proportional BW controller implemented at IO scheduler level, one
> > can do very tight integration with IO controller and hence retain 
> > IO scheduler behavior with-in cgroup.
> 
> It is worth to bug you I would say :). Results are interesting,
> definitely. I'll check if it's possible to merge part of the io-throttle
> max BW control in this controller and who knows if finally we'll be able
> to converge to a common proposal...

Great, Few thoughts though.

- What are your requirements? Do you strictly need max bw control or
  proportional BW control will satisfy your needs? Or you need both?

- With the current algorithm BFQ (modified WF2Q+), we should be able
  to do proportional BW division while maintaining the properties of
  IO scheduler with-in cgroup in hiearchical manner.
 
  I think it can be simply enhanced to do max bw control also. That is
  whenever a queue is selected for dispatch (from fairness point of view)
  also check the IO rate of that group and if IO rate exceeded, expire
  the queue immediately and fake as if queue consumed its time slice
  which will be equivalent to throttling.

  But in this simple scheme, I think throttling is still unfair with-in
  the class. What I mean is following.

  if an RT task and an BE task are in same cgroup and cgroup exceeds its
  max BW, RT task is next to be dispatched from fairness point of view and it
  will end being throttled. This is still fine because until RT task is
  finished, BE task will never get to run in that cgroup, so at some point
  of time, cgroup rate will come down and RT task will get the IO done
  meeting fairnesss and max bw constraints.

  But this simple scheme does not work with-in same class. Say prio 0
  and prio 7 BE class readers. Now we will end up throttling the guy who
  is scheduled to go next and there is no mechanism that prio0 and prio7
  tasks are throttled in proportionate manner.

  So, we shall have to come up with something better, I think Dhaval was
  implementing upper limit for cpu controller. May be PeterZ and Dhaval can
  give us some pointers how did they manage to implement both proportional
  and max bw control with the help of a single tree while maintaining the
  notion of prio with-in cgroup.

PeterZ/Dhaval  ^^^^^^^^

- We should be able to get rid of reader-writer issue even with above
  simple throttling mechanism for schedulers like deadline and AS, because at
  elevator we see it as a single queue (for both reads and writes) and we
  will throttle this queue. With-in queue dispatch are taken care by io
  scheduler. So as long as IO has been queued in the queue, scheduler
  will take care of giving advantage to readers even if throttling is
  taking place on the queue.

Why am I thinking loud? So that we know what are we trying to achieve at the
end of the day. So at this point of time what are the advantages/disadvantages
of doing max bw control along with proportional bw control?

Advantages
==========
- With a combined code base, total code should be less as compared to if
  both of them are implemented separately. 

- There can be few advantages in terms of maintaining the notion of IO
  scheduler with-in cgroup. (like RT tasks always goes first in presence
  of BE and IDLE task etc. But simple throttling scheme will not take
  care of fair throttling with-in class. We need a better algorithm to
  achive that goal).

- We probably will get rid of reader writer issue for single queue
  schedulers like deadline and AS. (Need to run tests and see).

Disadvantages
=============
- Implementation at IO scheduler/elevator layer does not cover higher
  level logical devices. So one can do max bw control only at leaf nodes
  where IO scheduler is running and not at intermediate logical nodes.
   
I personally think that proportional BW control will meet more people's
need as compared to max bw contorl. 

So far nobody has come up with a solution where a single proposal covers
all the cases without breaking things. So personally, I want to make
things work at least at IO scheduler level and cover as much ground as
possible without breaking things (hardware RAID, all the direct attached
devices etc) and then worry about higher level software devices.

Thoughts?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
       [not found]                                 ` <20090508215618.GJ7293-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-09  9:22                                   ` Peter Zijlstra
  2009-05-14 10:31                                   ` Andrea Righi
  2009-05-14 16:43                                     ` Dhaval Giani
  2 siblings, 0 replies; 297+ messages in thread
From: Peter Zijlstra @ 2009-05-09  9:22 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	Andrea Righi, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Fri, 2009-05-08 at 17:56 -0400, Vivek Goyal wrote:
>   So, we shall have to come up with something better, I think Dhaval was
>   implementing upper limit for cpu controller. May be PeterZ and Dhaval can
>   give us some pointers how did they manage to implement both proportional
>   and max bw control with the help of a single tree while maintaining the
>   notion of prio with-in cgroup.

We don't do max bandwidth control in the SCHED_OTHER bits as I oppose to
making it non work conserving.

SCHED_FIFO/RR do constant bandwidth things and are always scheduled in
favour of SCHED_OTHER. 

That is, we provide a minimum bandwidth for real-time tasks, but since
having a maximum higher than the minimum is useless since one cannot
rely on it (non deterministic) we put max = min.

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-08 21:56                                 ` Vivek Goyal
  (?)
@ 2009-05-09  9:22                                 ` Peter Zijlstra
  -1 siblings, 0 replies; 297+ messages in thread
From: Peter Zijlstra @ 2009-05-09  9:22 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrea Righi, Andrew Morton, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, jens.axboe, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, linux-kernel,
	containers, agk, dm-devel, snitzer, m-ikeda

On Fri, 2009-05-08 at 17:56 -0400, Vivek Goyal wrote:
>   So, we shall have to come up with something better, I think Dhaval was
>   implementing upper limit for cpu controller. May be PeterZ and Dhaval can
>   give us some pointers how did they manage to implement both proportional
>   and max bw control with the help of a single tree while maintaining the
>   notion of prio with-in cgroup.

We don't do max bandwidth control in the SCHED_OTHER bits as I oppose to
making it non work conserving.

SCHED_FIFO/RR do constant bandwidth things and are always scheduled in
favour of SCHED_OTHER. 

That is, we provide a minimum bandwidth for real-time tasks, but since
having a maximum higher than the minimum is useless since one cannot
rely on it (non deterministic) we put max = min.


^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
       [not found]         ` <e98e18940905081041r386e52a5q5a2b1f13f1e8c634-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2009-05-08 18:56           ` Vivek Goyal
@ 2009-05-11  1:33           ` Gui Jianfeng
  1 sibling, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-11  1:33 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

Nauman Rafique wrote:
> On Fri, May 8, 2009 at 6:57 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> On Fri, May 08, 2009 at 05:45:32PM +0800, Gui Jianfeng wrote:
>>> Hi Vivek,
>>>
>>> This patch adds io group reference handling when allocating
>>> and removing a request.
>>>
>> Hi Gui,
>>
>> Thanks for the patch. We were thinking that requests can take a reference
>> on io queues and io queues can take a reference on io groups. That should
>> make sure that io groups don't go away as long as active requests are
>> present.
>>
>> But there seems to be a small window while allocating the new request
>> where request gets allocated from a group first and then later it is
>> mapped to that group and queue is created. IOW, in get_request_wait(),
>> we allocate a request from a particular group and set rq->rl, then
>> drop the queue lock and later call elv_set_request() which again maps
>> the request to the group saves rq->iog and creates new queue. This window
>> is troublesome because request can be mapped to a particular group at the
>> time of allocation and during set_request() it can go to a different
>> group as queue lock was dropped and group might have disappeared.
>>
>> In this case probably it might make sense that request also takes a
>> reference on groups. At the same time it looks too much that request takes
>> a reference on queue as well as group object. Ideas are welcome on how
>> to handle it...
> 
> IMHO a request being allocated on the wrong cgroup should not be a big
> problem as such. All it means is that the request descriptor was
> accounted to the wrong cgroup in this particular corner case. Please
> correct me if I am wrong.
> 
> We can also get rid of rq->iog pointer too. What that means is that
> request is associated with ioq (rq->ioq), and we can use
> ioq_to_io_group() function to get the io_group. So the request would
> only be indirectly associated with an io_group i.e. the request is
> associated with an io_queue and the io_group for the request is the
> io_group associated with io_queue. Do you see any problems with that
> approach?

  That sounds reasonable to get rid of rq->iog, and rq->rl is also dead.
  Hope to see the patch soon. ;)

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for  request
  2009-05-08 17:41         ` Nauman Rafique
  (?)
  (?)
@ 2009-05-11  1:33         ` Gui Jianfeng
  2009-05-11 15:41           ` Vivek Goyal
       [not found]           ` <4A078051.5060702-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  -1 siblings, 2 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-11  1:33 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: Vivek Goyal, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

Nauman Rafique wrote:
> On Fri, May 8, 2009 at 6:57 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> On Fri, May 08, 2009 at 05:45:32PM +0800, Gui Jianfeng wrote:
>>> Hi Vivek,
>>>
>>> This patch adds io group reference handling when allocating
>>> and removing a request.
>>>
>> Hi Gui,
>>
>> Thanks for the patch. We were thinking that requests can take a reference
>> on io queues and io queues can take a reference on io groups. That should
>> make sure that io groups don't go away as long as active requests are
>> present.
>>
>> But there seems to be a small window while allocating the new request
>> where request gets allocated from a group first and then later it is
>> mapped to that group and queue is created. IOW, in get_request_wait(),
>> we allocate a request from a particular group and set rq->rl, then
>> drop the queue lock and later call elv_set_request() which again maps
>> the request to the group saves rq->iog and creates new queue. This window
>> is troublesome because request can be mapped to a particular group at the
>> time of allocation and during set_request() it can go to a different
>> group as queue lock was dropped and group might have disappeared.
>>
>> In this case probably it might make sense that request also takes a
>> reference on groups. At the same time it looks too much that request takes
>> a reference on queue as well as group object. Ideas are welcome on how
>> to handle it...
> 
> IMHO a request being allocated on the wrong cgroup should not be a big
> problem as such. All it means is that the request descriptor was
> accounted to the wrong cgroup in this particular corner case. Please
> correct me if I am wrong.
> 
> We can also get rid of rq->iog pointer too. What that means is that
> request is associated with ioq (rq->ioq), and we can use
> ioq_to_io_group() function to get the io_group. So the request would
> only be indirectly associated with an io_group i.e. the request is
> associated with an io_queue and the io_group for the request is the
> io_group associated with io_queue. Do you see any problems with that
> approach?

  That sounds reasonable to get rid of rq->iog, and rq->rl is also dead.
  Hope to see the patch soon. ;)

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
       [not found]             ` <20090508133740.GD7293-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-11  2:59               ` Gui Jianfeng
  0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-11  2:59 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

Vivek Goyal wrote:
...
>>
> 
> Thanks Li and Gui for pointing out the problem. With you script, I could
> also produce lock validator warning as well as system freeze. I could
> identify at least two trouble spots. With following patch things seems
> to be fine on my system. Can you please give it a try.

  Hi Vivek,

  I'v tried this patch, and seems the problem is addressed. Thanks.

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-08 13:37             ` Vivek Goyal
  (?)
@ 2009-05-11  2:59             ` Gui Jianfeng
  -1 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-11  2:59 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Li Zefan, nauman, dpshah, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

Vivek Goyal wrote:
...
>>
> 
> Thanks Li and Gui for pointing out the problem. With you script, I could
> also produce lock validator warning as well as system freeze. I could
> identify at least two trouble spots. With following patch things seems
> to be fine on my system. Can you please give it a try.

  Hi Vivek,

  I'v tried this patch, and seems the problem is addressed. Thanks.

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
       [not found]           ` <4A0440B2.7040300-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-11 10:11             ` Ryo Tsuruta
  0 siblings, 0 replies; 297+ messages in thread
From: Ryo Tsuruta @ 2009-05-11 10:11 UTC (permalink / raw)
  To: riel-H+wXaHxf7aLQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi Rik,

From: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Subject: Re: IO scheduler based IO Controller V2
Date: Fri, 08 May 2009 10:24:50 -0400

> Ryo Tsuruta wrote:
> > Hi Vivek,
> > 
> >> Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> >> of FIFO dispatch of buffered bios. Apart from that it tries to provide
> >> fairness in terms of actual IO done and that would mean a seeky workload
> >> will can use disk for much longer to get equivalent IO done and slow down
> >> other applications. Implementing IO controller at IO scheduler level gives
> >> us tigher control. Will it not meet your requirements? If you got specific
> >> concerns with IO scheduler based contol patches, please highlight these and
> >> we will see how these can be addressed.
> > I'd like to avoid making complicated existing IO schedulers and other
> > kernel codes and to give a choice to users whether or not to use it.
> > I know that you chose an approach that using compile time options to
> > get the same behavior as old system, but device-mapper drivers can be
> > added, removed and replaced while system is running.
> 
> I do not believe that every use of cgroups will end up with
> a separate logical volume for each group.
> 
> In fact, if you look at group-per-UID usage, which could be
> quite common on shared web servers and shell servers, I would
> expect all the groups to share the same filesystem.
> 
> I do not believe dm-ioband would be useful in that configuration,
> while the IO scheduler based IO controller will just work.

dm-ioband can control bandwidth on a per cgroup basis as same as
Vivek's IO controller. Could you explain what do you want to do and
how to configure the IO scheduler based IO controller in that case?

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-08 14:24         ` Rik van Riel
       [not found]           ` <4A0440B2.7040300-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-11 10:11           ` Ryo Tsuruta
  1 sibling, 0 replies; 297+ messages in thread
From: Ryo Tsuruta @ 2009-05-11 10:11 UTC (permalink / raw)
  To: riel
  Cc: vgoyal, akpm, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, linux-kernel, containers, righi.andrea,
	agk, dm-devel, snitzer, m-ikeda, peterz

Hi Rik,

From: Rik van Riel <riel@redhat.com>
Subject: Re: IO scheduler based IO Controller V2
Date: Fri, 08 May 2009 10:24:50 -0400

> Ryo Tsuruta wrote:
> > Hi Vivek,
> > 
> >> Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> >> of FIFO dispatch of buffered bios. Apart from that it tries to provide
> >> fairness in terms of actual IO done and that would mean a seeky workload
> >> will can use disk for much longer to get equivalent IO done and slow down
> >> other applications. Implementing IO controller at IO scheduler level gives
> >> us tigher control. Will it not meet your requirements? If you got specific
> >> concerns with IO scheduler based contol patches, please highlight these and
> >> we will see how these can be addressed.
> > I'd like to avoid making complicated existing IO schedulers and other
> > kernel codes and to give a choice to users whether or not to use it.
> > I know that you chose an approach that using compile time options to
> > get the same behavior as old system, but device-mapper drivers can be
> > added, removed and replaced while system is running.
> 
> I do not believe that every use of cgroups will end up with
> a separate logical volume for each group.
> 
> In fact, if you look at group-per-UID usage, which could be
> quite common on shared web servers and shell servers, I would
> expect all the groups to share the same filesystem.
> 
> I do not believe dm-ioband would be useful in that configuration,
> while the IO scheduler based IO controller will just work.

dm-ioband can control bandwidth on a per cgroup basis as same as
Vivek's IO controller. Could you explain what do you want to do and
how to configure the IO scheduler based IO controller in that case?

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
       [not found]             ` <20090507012559.GC4187-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-11 11:23               ` Ryo Tsuruta
  0 siblings, 0 replies; 297+ messages in thread
From: Ryo Tsuruta @ 2009-05-11 11:23 UTC (permalink / raw)
  To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

Hi Vivek,

From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Subject: Re: IO scheduler based IO Controller V2
Date: Wed, 6 May 2009 21:25:59 -0400

> On Thu, May 07, 2009 at 09:18:58AM +0900, Ryo Tsuruta wrote:
> > Hi Vivek,
> > 
> > > Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> > > of FIFO dispatch of buffered bios. Apart from that it tries to provide
> > > fairness in terms of actual IO done and that would mean a seeky workload
> > > will can use disk for much longer to get equivalent IO done and slow down
> > > other applications. Implementing IO controller at IO scheduler level gives
> > > us tigher control. Will it not meet your requirements? If you got specific
> > > concerns with IO scheduler based contol patches, please highlight these and
> > > we will see how these can be addressed.
> > 
> > I'd like to avoid making complicated existing IO schedulers and other
> > kernel codes and to give a choice to users whether or not to use it.
> > I know that you chose an approach that using compile time options to
> > get the same behavior as old system, but device-mapper drivers can be
> > added, removed and replaced while system is running.
> > 
> 
> Same is possible with IO scheduler based controller. If you don't want
> cgroup stuff, don't create those. By default everything will be in root
> group and you will get the old behavior. 
> 
> If you want io controller stuff, just create the cgroup, assign weight
> and move task there. So what more choices do you want which are missing
> here?

What I mean to say is that device-mapper drivers can be completely
removed from the kernel if not used.

I know that dm-ioband has some issues which can be addressed by your
IO controller, but I'm not sure your controller works well. So I would
like to see some benchmark results of your IO controller.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-07  1:25             ` Vivek Goyal
  (?)
  (?)
@ 2009-05-11 11:23             ` Ryo Tsuruta
       [not found]               ` <20090511.202309.112614168.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
  -1 siblings, 1 reply; 297+ messages in thread
From: Ryo Tsuruta @ 2009-05-11 11:23 UTC (permalink / raw)
  To: vgoyal
  Cc: akpm, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, m-ikeda, peterz

Hi Vivek,

From: Vivek Goyal <vgoyal@redhat.com>
Subject: Re: IO scheduler based IO Controller V2
Date: Wed, 6 May 2009 21:25:59 -0400

> On Thu, May 07, 2009 at 09:18:58AM +0900, Ryo Tsuruta wrote:
> > Hi Vivek,
> > 
> > > Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> > > of FIFO dispatch of buffered bios. Apart from that it tries to provide
> > > fairness in terms of actual IO done and that would mean a seeky workload
> > > will can use disk for much longer to get equivalent IO done and slow down
> > > other applications. Implementing IO controller at IO scheduler level gives
> > > us tigher control. Will it not meet your requirements? If you got specific
> > > concerns with IO scheduler based contol patches, please highlight these and
> > > we will see how these can be addressed.
> > 
> > I'd like to avoid making complicated existing IO schedulers and other
> > kernel codes and to give a choice to users whether or not to use it.
> > I know that you chose an approach that using compile time options to
> > get the same behavior as old system, but device-mapper drivers can be
> > added, removed and replaced while system is running.
> > 
> 
> Same is possible with IO scheduler based controller. If you don't want
> cgroup stuff, don't create those. By default everything will be in root
> group and you will get the old behavior. 
> 
> If you want io controller stuff, just create the cgroup, assign weight
> and move task there. So what more choices do you want which are missing
> here?

What I mean to say is that device-mapper drivers can be completely
removed from the kernel if not used.

I know that dm-ioband has some issues which can be addressed by your
IO controller, but I'm not sure your controller works well. So I would
like to see some benchmark results of your IO controller.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-11 11:23             ` Ryo Tsuruta
@ 2009-05-11 12:49                   ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-11 12:49 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Mon, May 11, 2009 at 08:23:09PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
> 
> From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Subject: Re: IO scheduler based IO Controller V2
> Date: Wed, 6 May 2009 21:25:59 -0400
> 
> > On Thu, May 07, 2009 at 09:18:58AM +0900, Ryo Tsuruta wrote:
> > > Hi Vivek,
> > > 
> > > > Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> > > > of FIFO dispatch of buffered bios. Apart from that it tries to provide
> > > > fairness in terms of actual IO done and that would mean a seeky workload
> > > > will can use disk for much longer to get equivalent IO done and slow down
> > > > other applications. Implementing IO controller at IO scheduler level gives
> > > > us tigher control. Will it not meet your requirements? If you got specific
> > > > concerns with IO scheduler based contol patches, please highlight these and
> > > > we will see how these can be addressed.
> > > 
> > > I'd like to avoid making complicated existing IO schedulers and other
> > > kernel codes and to give a choice to users whether or not to use it.
> > > I know that you chose an approach that using compile time options to
> > > get the same behavior as old system, but device-mapper drivers can be
> > > added, removed and replaced while system is running.
> > > 
> > 
> > Same is possible with IO scheduler based controller. If you don't want
> > cgroup stuff, don't create those. By default everything will be in root
> > group and you will get the old behavior. 
> > 
> > If you want io controller stuff, just create the cgroup, assign weight
> > and move task there. So what more choices do you want which are missing
> > here?
> 
> What I mean to say is that device-mapper drivers can be completely
> removed from the kernel if not used.
> 
> I know that dm-ioband has some issues which can be addressed by your
> IO controller, but I'm not sure your controller works well. So I would
> like to see some benchmark results of your IO controller.
> 

Fair enough. IO scheduler based IO controller is still work in progress
and we have started to get some basic things right. I think after 3-4
iterations of patches, patches will be stable enough and working enough that
I should be able to give some benchmark numbers also.

Currently I am posting the intermediate snapshot of my tree to lkml to get the
design feedback so that if there are fundamental design issues, we can sort
these out.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
@ 2009-05-11 12:49                   ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-11 12:49 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: akpm, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, m-ikeda, peterz

On Mon, May 11, 2009 at 08:23:09PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
> 
> From: Vivek Goyal <vgoyal@redhat.com>
> Subject: Re: IO scheduler based IO Controller V2
> Date: Wed, 6 May 2009 21:25:59 -0400
> 
> > On Thu, May 07, 2009 at 09:18:58AM +0900, Ryo Tsuruta wrote:
> > > Hi Vivek,
> > > 
> > > > Ryo, dm-ioband breaks the notion of classes and priority of CFQ because
> > > > of FIFO dispatch of buffered bios. Apart from that it tries to provide
> > > > fairness in terms of actual IO done and that would mean a seeky workload
> > > > will can use disk for much longer to get equivalent IO done and slow down
> > > > other applications. Implementing IO controller at IO scheduler level gives
> > > > us tigher control. Will it not meet your requirements? If you got specific
> > > > concerns with IO scheduler based contol patches, please highlight these and
> > > > we will see how these can be addressed.
> > > 
> > > I'd like to avoid making complicated existing IO schedulers and other
> > > kernel codes and to give a choice to users whether or not to use it.
> > > I know that you chose an approach that using compile time options to
> > > get the same behavior as old system, but device-mapper drivers can be
> > > added, removed and replaced while system is running.
> > > 
> > 
> > Same is possible with IO scheduler based controller. If you don't want
> > cgroup stuff, don't create those. By default everything will be in root
> > group and you will get the old behavior. 
> > 
> > If you want io controller stuff, just create the cgroup, assign weight
> > and move task there. So what more choices do you want which are missing
> > here?
> 
> What I mean to say is that device-mapper drivers can be completely
> removed from the kernel if not used.
> 
> I know that dm-ioband has some issues which can be addressed by your
> IO controller, but I'm not sure your controller works well. So I would
> like to see some benchmark results of your IO controller.
> 

Fair enough. IO scheduler based IO controller is still work in progress
and we have started to get some basic things right. I think after 3-4
iterations of patches, patches will be stable enough and working enough that
I should be able to give some benchmark numbers also.

Currently I am posting the intermediate snapshot of my tree to lkml to get the
design feedback so that if there are fundamental design issues, we can sort
these out.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
       [not found]           ` <4A078051.5060702-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-05-11 15:41             ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-11 15:41 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Mon, May 11, 2009 at 09:33:05AM +0800, Gui Jianfeng wrote:
> Nauman Rafique wrote:
> > On Fri, May 8, 2009 at 6:57 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >> On Fri, May 08, 2009 at 05:45:32PM +0800, Gui Jianfeng wrote:
> >>> Hi Vivek,
> >>>
> >>> This patch adds io group reference handling when allocating
> >>> and removing a request.
> >>>
> >> Hi Gui,
> >>
> >> Thanks for the patch. We were thinking that requests can take a reference
> >> on io queues and io queues can take a reference on io groups. That should
> >> make sure that io groups don't go away as long as active requests are
> >> present.
> >>
> >> But there seems to be a small window while allocating the new request
> >> where request gets allocated from a group first and then later it is
> >> mapped to that group and queue is created. IOW, in get_request_wait(),
> >> we allocate a request from a particular group and set rq->rl, then
> >> drop the queue lock and later call elv_set_request() which again maps
> >> the request to the group saves rq->iog and creates new queue. This window
> >> is troublesome because request can be mapped to a particular group at the
> >> time of allocation and during set_request() it can go to a different
> >> group as queue lock was dropped and group might have disappeared.
> >>
> >> In this case probably it might make sense that request also takes a
> >> reference on groups. At the same time it looks too much that request takes
> >> a reference on queue as well as group object. Ideas are welcome on how
> >> to handle it...
> > 
> > IMHO a request being allocated on the wrong cgroup should not be a big
> > problem as such. All it means is that the request descriptor was
> > accounted to the wrong cgroup in this particular corner case. Please
> > correct me if I am wrong.
> > 
> > We can also get rid of rq->iog pointer too. What that means is that
> > request is associated with ioq (rq->ioq), and we can use
> > ioq_to_io_group() function to get the io_group. So the request would
> > only be indirectly associated with an io_group i.e. the request is
> > associated with an io_queue and the io_group for the request is the
> > io_group associated with io_queue. Do you see any problems with that
> > approach?
> 
>   That sounds reasonable to get rid of rq->iog, and rq->rl is also dead.
>   Hope to see the patch soon. ;)
>

Ok, here is the patch which gets rid of rq->iog and rq->rl fields. Good to
see some code and data structures trimming. It seems to be working fine for me.


o Get rid of rq->iog field and rq->rl fields. request descriptor stores
  the pointer the the queue it belongs to (rq->ioq) and from the io queue one
  can determine the group queue belongs to hence request belongs to. Thanks
  to Nauman for the idea.

o There are couple of places where rq->ioq information is not present yet
  as request and queue are being setup. In those places "bio" is passed 
  around as function argument to determine the group rq will go into. I
  did not pass "iog" as function argument becuase when memory is scarce,
  we can release queue lock and sleep to wait for memory to become available
  and once we wake up, it is possible that io group is gone. Passing bio
  around helps that one shall have to remap bio to right group after waking
  up. 

o Got rid of io_lookup_io_group_current() function and merged it with
  io_get_io_group() to also take care of looking for group using current
  task info and not from bio.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/blk-core.c         |   28 +++++++++-------
 block/cfq-iosched.c      |   40 ++++++++++++++++--------
 block/elevator-fq.c      |   78 ++++++++++++++++++-----------------------------
 block/elevator-fq.h      |   29 +++--------------
 block/elevator.c         |    6 +--
 include/linux/blkdev.h   |   16 ++++-----
 include/linux/elevator.h |    2 -
 7 files changed, 91 insertions(+), 108 deletions(-)

Index: linux14/include/linux/blkdev.h
===================================================================
--- linux14.orig/include/linux/blkdev.h	2009-05-11 10:51:33.000000000 -0400
+++ linux14/include/linux/blkdev.h	2009-05-11 11:35:27.000000000 -0400
@@ -279,12 +279,6 @@ struct request {
 #ifdef CONFIG_ELV_FAIR_QUEUING
 	/* io queue request belongs to */
 	struct io_queue *ioq;
-
-#ifdef CONFIG_GROUP_IOSCHED
-	/* io group request belongs to */
-	struct io_group *iog;
-	struct request_list *rl;
-#endif /* GROUP_IOSCHED */
 #endif /* ELV_FAIR_QUEUING */
 };
 
@@ -828,9 +822,15 @@ static inline struct request_list *rq_rl
 						struct request *rq)
 {
 #ifdef CONFIG_GROUP_IOSCHED
-	return rq->rl;
+	struct io_group *iog;
+
+	BUG_ON(!rq->ioq);
+	iog = ioq_to_io_group(rq->ioq);
+	BUG_ON(!iog);
+
+	return &iog->rl;
 #else
-	return blk_get_request_list(q, NULL);
+	return &q->rq;
 #endif
 }
 
Index: linux14/block/elevator-fq.c
===================================================================
--- linux14.orig/block/elevator-fq.c	2009-05-11 10:52:49.000000000 -0400
+++ linux14/block/elevator-fq.c	2009-05-11 11:28:19.000000000 -0400
@@ -1006,7 +1006,7 @@ struct request_list *io_group_get_reques
 {
 	struct io_group *iog;
 
-	iog = io_get_io_group_bio(q, bio, 1);
+	iog = io_get_io_group(q, bio, 1, 0);
 	BUG_ON(!iog);
 	return &iog->rl;
 }
@@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
 /*
  * Find the io group bio belongs to.
  * If "create" is set, io group is created if it is not already present.
+ * If "curr" is set, io group is information is searched for current
+ * task and not with the help of bio.
+ *
+ * FIXME: Can we assume that if bio is NULL then lookup group for current
+ * task and not create extra function parameter ?
  *
- * Note: There is a narrow window of race where a group is being freed
- * by cgroup deletion path and some rq has slipped through in this group.
- * Fix it.
  */
-struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
-					int create)
+struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
+					int create, int curr)
 {
 	struct cgroup *cgroup;
 	struct io_group *iog;
 	struct elv_fq_data *efqd = &q->elevator->efqd;
 
 	rcu_read_lock();
-	cgroup = get_cgroup_from_bio(bio);
+
+	if (curr)
+		cgroup = task_cgroup(current, io_subsys_id);
+	else
+		cgroup = get_cgroup_from_bio(bio);
+
 	if (!cgroup) {
 		if (create)
 			iog = efqd->root_group;
@@ -1500,7 +1507,7 @@ out:
 	rcu_read_unlock();
 	return iog;
 }
-EXPORT_SYMBOL(io_get_io_group_bio);
+EXPORT_SYMBOL(io_get_io_group);
 
 void io_free_root_group(struct elevator_queue *e)
 {
@@ -1952,7 +1959,7 @@ int io_group_allow_merge(struct request 
 		return 1;
 
 	/* Determine the io group of the bio submitting task */
-	iog = io_get_io_group_bio(q, bio, 0);
+	iog = io_get_io_group(q, bio, 0, 0);
 	if (!iog) {
 		/* May be task belongs to a differet cgroup for which io
 		 * group has not been setup yet. */
@@ -1965,25 +1972,6 @@ int io_group_allow_merge(struct request 
 	return (iog == __iog);
 }
 
-/* find/create the io group request belongs to and put that info in rq */
-void elv_fq_set_request_io_group(struct request_queue *q, struct request *rq,
-					struct bio *bio)
-{
-	struct io_group *iog;
-	unsigned long flags;
-
-	/* Make sure io group hierarchy has been setup and also set the
-	 * io group to which rq belongs. Later we should make use of
-	 * bio cgroup patches to determine the io group */
-	spin_lock_irqsave(q->queue_lock, flags);
-	iog = io_get_io_group_bio(q, bio, 1);
-	spin_unlock_irqrestore(q->queue_lock, flags);
-	BUG_ON(!iog);
-
-	/* Store iog in rq. TODO: take care of referencing */
-	rq->iog = iog;
-}
-
 /*
  * Find/Create the io queue the rq should go in. This is an optimization
  * for the io schedulers (noop, deadline and AS) which maintain only single
@@ -1995,7 +1983,7 @@ void elv_fq_set_request_io_group(struct 
  * function is not invoked.
  */
 int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
-					gfp_t gfp_mask)
+					struct bio *bio, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 	unsigned long flags;
@@ -2009,11 +1997,15 @@ int elv_fq_set_request_ioq(struct reques
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 	spin_lock_irqsave(q->queue_lock, flags);
 
+retry:
 	/* Determine the io group request belongs to */
-	iog = rq->iog;
+	if (bio)
+		iog = io_get_io_group(q, bio, 1, 0);
+	else
+		iog = io_get_io_group(q, bio, 1, 1);
+
 	BUG_ON(!iog);
 
-retry:
 	/* Get the iosched queue */
 	ioq = io_group_ioq(iog);
 	if (!ioq) {
@@ -2071,7 +2063,7 @@ alloc_ioq:
 			}
 		}
 
-		elv_init_ioq(e, ioq, rq->iog, sched_q, IOPRIO_CLASS_BE, 4, 1);
+		elv_init_ioq(e, ioq, iog, sched_q, IOPRIO_CLASS_BE, 4, 1);
 		io_group_set_ioq(iog, ioq);
 		elv_mark_ioq_sync(ioq);
 		/* ioq reference on iog */
@@ -2106,7 +2098,7 @@ struct io_queue *elv_lookup_ioq_bio(stru
 	struct io_group *iog;
 
 	/* lookup the io group and io queue of the bio submitting task */
-	iog = io_get_io_group_bio(q, bio, 0);
+	iog = io_get_io_group(q, bio, 0, 0);
 	if (!iog) {
 		/* May be bio belongs to a cgroup for which io group has
 		 * not been setup yet. */
@@ -2166,12 +2158,12 @@ struct io_group *io_lookup_io_group_curr
 }
 EXPORT_SYMBOL(io_lookup_io_group_current);
 
-struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
-					int create)
+struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
+					int create, int curr)
 {
 	return q->elevator->efqd.root_group;
 }
-EXPORT_SYMBOL(io_get_io_group_bio);
+EXPORT_SYMBOL(io_get_io_group);
 
 void io_free_root_group(struct elevator_queue *e)
 {
@@ -2180,16 +2172,6 @@ void io_free_root_group(struct elevator_
 	kfree(iog);
 }
 
-struct io_group *io_get_io_group(struct request_queue *q, int create)
-{
-	return q->elevator->efqd.root_group;
-}
-
-struct io_group *rq_iog(struct request_queue *q, struct request *rq)
-{
-	return q->elevator->efqd.root_group;
-}
-
 #endif /* CONFIG_GROUP_IOSCHED*/
 
 /* Elevator fair queuing function */
@@ -3128,7 +3110,9 @@ void elv_ioq_request_add(struct request_
 #ifdef CONFIG_DEBUG_GROUP_IOSCHED
 		{
 			char path[128];
-			io_group_path(rq_iog(q, rq), path, sizeof(path));
+			struct io_group *iog = ioq_to_io_group(ioq);
+
+			io_group_path(iog, path, sizeof(path));
 			elv_log_ioq(efqd, ioq, "add rq: group path=%s "
 					"rq_queued=%d", path, ioq->nr_queued);
 		}
Index: linux14/include/linux/elevator.h
===================================================================
--- linux14.orig/include/linux/elevator.h	2009-05-11 10:51:33.000000000 -0400
+++ linux14/include/linux/elevator.h	2009-05-11 10:52:51.000000000 -0400
@@ -23,7 +23,7 @@ typedef struct request *(elevator_reques
 typedef void (elevator_completed_req_fn) (struct request_queue *, struct request *);
 typedef int (elevator_may_queue_fn) (struct request_queue *, int);
 
-typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, gfp_t);
+typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, struct bio *bio, gfp_t);
 typedef void (elevator_put_req_fn) (struct request *);
 typedef void (elevator_activate_req_fn) (struct request_queue *, struct request *);
 typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct request *);
Index: linux14/block/cfq-iosched.c
===================================================================
--- linux14.orig/block/cfq-iosched.c	2009-05-11 10:52:47.000000000 -0400
+++ linux14/block/cfq-iosched.c	2009-05-11 10:52:51.000000000 -0400
@@ -161,7 +161,7 @@ CFQ_CFQQ_FNS(coop);
 	blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)
 
 static void cfq_dispatch_insert(struct request_queue *, struct request *);
-static struct cfq_queue *cfq_get_queue(struct cfq_data *, struct io_group *iog,
+static struct cfq_queue *cfq_get_queue(struct cfq_data *, struct bio *bio,
 					int, struct io_context *, gfp_t);
 static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
 						struct io_context *);
@@ -196,7 +196,7 @@ static struct cfq_queue *cic_bio_to_cfqq
 		 * async bio tracking is enabled and we are not caching
 		 * async queue pointer in cic.
 		 */
-		iog = io_get_io_group_bio(cfqd->queue, bio, 0);
+		iog = io_get_io_group(cfqd->queue, bio, 0, 0);
 		if (!iog) {
 			/*
 			 * May be this is first rq/bio and io group has not
@@ -1242,7 +1242,6 @@ static void changed_ioprio(struct io_con
 	cfqq = cic->cfqq[BLK_RW_ASYNC];
 
 	if (cfqq) {
-		struct io_group *iog = io_lookup_io_group_current(q);
 		struct cfq_queue *new_cfqq;
 
 		/*
@@ -1259,7 +1258,7 @@ static void changed_ioprio(struct io_con
 		 * comes? Keeping it for the time being because existing cfq
 		 * code allocates the new queue immediately upon prio change
 		 */
-		new_cfqq = cfq_get_queue(cfqd, iog, BLK_RW_ASYNC, cic->ioc,
+		new_cfqq = cfq_get_queue(cfqd, NULL, BLK_RW_ASYNC, cic->ioc,
 						GFP_ATOMIC);
 		if (new_cfqq)
 			cic_set_cfqq(cic, new_cfqq, BLK_RW_ASYNC);
@@ -1295,7 +1294,7 @@ static void changed_cgroup(struct io_con
 
 	spin_lock_irqsave(q->queue_lock, flags);
 
-	iog = io_lookup_io_group_current(q);
+	iog = io_get_io_group(q, NULL, 0, 1);
 
 	if (async_cfqq != NULL) {
 		__iog = cfqq_to_io_group(async_cfqq);
@@ -1332,14 +1331,25 @@ static void cfq_ioc_set_cgroup(struct io
 #endif  /* CONFIG_IOSCHED_CFQ_HIER */
 
 static struct cfq_queue *
-cfq_find_alloc_queue(struct cfq_data *cfqd, struct io_group *iog, int is_sync,
+cfq_find_alloc_queue(struct cfq_data *cfqd, struct bio *bio, int is_sync,
 				struct io_context *ioc, gfp_t gfp_mask)
 {
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
 	struct cfq_io_context *cic;
 	struct request_queue *q = cfqd->queue;
 	struct io_queue *ioq = NULL, *new_ioq = NULL;
+	struct io_group *iog = NULL;
 retry:
+	/*
+	 * Note: Finding the io group again in case io group disappeared
+	 * during the time we dropped the queue lock and acquired it
+	 * back.
+	 */
+	if (bio)
+		iog = io_get_io_group(q, bio, 1, 0);
+	else
+		iog = io_get_io_group(q, NULL, 1, 1);
+
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
 	cfqq = cic_to_cfqq(cic, is_sync);
@@ -1449,13 +1459,19 @@ out:
 }
 
 static struct cfq_queue *
-cfq_get_queue(struct cfq_data *cfqd, struct io_group *iog, int is_sync,
-			struct io_context *ioc, gfp_t gfp_mask)
+cfq_get_queue(struct cfq_data *cfqd, struct bio *bio, int is_sync,
+				struct io_context *ioc, gfp_t gfp_mask)
 {
 	const int ioprio = task_ioprio(ioc);
 	const int ioprio_class = task_ioprio_class(ioc);
 	struct cfq_queue *async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
+	struct io_group *iog = NULL;
+
+	if (bio)
+		iog = io_get_io_group(cfqd->queue, bio, 1, 0);
+	else
+		iog = io_get_io_group(cfqd->queue, NULL, 1, 1);
 
 	if (!is_sync) {
 		async_cfqq = io_group_async_queue_prio(iog, ioprio_class,
@@ -1464,7 +1480,7 @@ cfq_get_queue(struct cfq_data *cfqd, str
 	}
 
 	if (!cfqq) {
-		cfqq = cfq_find_alloc_queue(cfqd, iog, is_sync, ioc, gfp_mask);
+		cfqq = cfq_find_alloc_queue(cfqd, bio, is_sync, ioc, gfp_mask);
 		if (!cfqq)
 			return NULL;
 	}
@@ -1889,7 +1905,8 @@ static void cfq_put_request(struct reque
  * Allocate cfq data structures associated with this request.
  */
 static int
-cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
+				gfp_t gfp_mask)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_io_context *cic;
@@ -1909,8 +1926,7 @@ cfq_set_request(struct request_queue *q,
 
 	cfqq = cic_to_cfqq(cic, is_sync);
 	if (!cfqq) {
-		cfqq = cfq_get_queue(cfqd, rq_iog(q, rq), is_sync, cic->ioc,
-						gfp_mask);
+		cfqq = cfq_get_queue(cfqd, bio, is_sync, cic->ioc, gfp_mask);
 
 		if (!cfqq)
 			goto queue_fail;
Index: linux14/block/elevator.c
===================================================================
--- linux14.orig/block/elevator.c	2009-05-11 10:51:33.000000000 -0400
+++ linux14/block/elevator.c	2009-05-11 10:52:51.000000000 -0400
@@ -972,17 +972,15 @@ int elv_set_request(struct request_queue
 {
 	struct elevator_queue *e = q->elevator;
 
-	elv_fq_set_request_io_group(q, rq, bio);
-
 	/*
 	 * Optimization for noop, deadline and AS which maintain only single
 	 * ioq per io group
 	 */
 	if (elv_iosched_single_ioq(e))
-		return elv_fq_set_request_ioq(q, rq, gfp_mask);
+		return elv_fq_set_request_ioq(q, rq, bio, gfp_mask);
 
 	if (e->ops->elevator_set_req_fn)
-		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
+		return e->ops->elevator_set_req_fn(q, rq, bio, gfp_mask);
 
 	rq->elevator_private = NULL;
 	return 0;
Index: linux14/block/elevator-fq.h
===================================================================
--- linux14.orig/block/elevator-fq.h	2009-05-11 10:52:48.000000000 -0400
+++ linux14/block/elevator-fq.h	2009-05-11 11:25:03.000000000 -0400
@@ -510,15 +510,13 @@ static inline struct io_group *ioq_to_io
 
 #ifdef CONFIG_GROUP_IOSCHED
 extern int io_group_allow_merge(struct request *rq, struct bio *bio);
-extern void elv_fq_set_request_io_group(struct request_queue *q,
-					struct request *rq, struct bio *bio);
 static inline bfq_weight_t iog_weight(struct io_group *iog)
 {
 	return iog->entity.weight;
 }
 
 extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
-					gfp_t gfp_mask);
+					struct bio *bio, gfp_t gfp_mask);
 extern void elv_fq_unset_request_ioq(struct request_queue *q,
 					struct request *rq);
 extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
@@ -545,12 +543,6 @@ static inline void io_group_set_ioq(stru
 	iog->ioq = ioq;
 }
 
-static inline struct io_group *rq_iog(struct request_queue *q,
-					struct request *rq)
-{
-	return rq->iog;
-}
-
 static inline void elv_get_iog(struct io_group *iog)
 {
 	atomic_inc(&iog->ref);
@@ -566,10 +558,6 @@ static inline int io_group_allow_merge(s
  * separately. Hence in case of non-hierarchical setup, nothing todo.
  */
 static inline void io_disconnect_groups(struct elevator_queue *e) {}
-static inline void elv_fq_set_request_io_group(struct request_queue *q,
-					struct request *rq, struct bio *bio)
-{
-}
 
 static inline bfq_weight_t iog_weight(struct io_group *iog)
 {
@@ -588,7 +576,7 @@ static inline void io_group_set_ioq(stru
 }
 
 static inline int elv_fq_set_request_ioq(struct request_queue *q,
-					struct request *rq, gfp_t gfp_mask)
+			struct request *rq, struct bio *bio, gfp_t gfp_mask)
 {
 	return 0;
 }
@@ -613,8 +601,6 @@ static inline void elv_get_iog(struct io
 
 static inline void elv_put_iog(struct io_group *iog) { }
 
-extern struct io_group *rq_iog(struct request_queue *q, struct request *rq);
-
 #endif /* GROUP_IOSCHED */
 
 /* Functions used by blksysfs.c */
@@ -670,8 +656,8 @@ extern void *io_group_async_queue_prio(s
 extern void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 					int ioprio, struct io_queue *ioq);
 extern struct io_group *io_lookup_io_group_current(struct request_queue *q);
-extern struct io_group *io_get_io_group_bio(struct request_queue *q,
-						struct bio *bio, int create);
+extern struct io_group *io_get_io_group(struct request_queue *q,
+				struct bio *bio, int create, int curr);
 extern int elv_nr_busy_ioq(struct elevator_queue *e);
 extern int elv_nr_busy_rt_ioq(struct elevator_queue *e);
 extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
@@ -725,18 +711,13 @@ static inline void *elv_fq_select_ioq(st
 	return NULL;
 }
 
-static inline void elv_fq_set_request_io_group(struct request_queue *q,
-					struct request *rq, struct bio *bio)
-{
-}
-
 static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
 
 {
 	return 1;
 }
 static inline int elv_fq_set_request_ioq(struct request_queue *q,
-					struct request *rq, gfp_t gfp_mask)
+			struct request *rq, struct bio *bio, gfp_t gfp_mask)
 {
 	return 0;
 }
Index: linux14/block/blk-core.c
===================================================================
--- linux14.orig/block/blk-core.c	2009-05-11 11:35:20.000000000 -0400
+++ linux14/block/blk-core.c	2009-05-11 11:35:27.000000000 -0400
@@ -736,8 +736,22 @@ static void __freed_request(struct reque
 static void freed_request(struct request_queue *q, int sync, int priv,
 					struct request_list *rl)
 {
-	BUG_ON(!rl->count[sync]);
-	rl->count[sync]--;
+	/* There is a window during request allocation where request is
+	 * mapped to one group but by the time a queue for the group is
+	 * allocated, it is possible that original cgroup/io group has been
+	 * deleted and now io queue is allocated in a different group (root)
+	 * altogether.
+	 *
+	 * One solution to the problem is that rq should take io group
+	 * reference. But it looks too much to do that to solve this issue.
+	 * The only side affect to the hard to hit issue seems to be that
+	 * we will try to decrement the rl->count for a request list which
+	 * did not allocate that request. Chcek for rl->count going less than
+	 * zero and do not decrement it if that's the case.
+	 */
+
+	if (rl->count[sync] > 0)
+		rl->count[sync]--;
 
 	BUG_ON(!q->rq_data.count[sync]);
 	q->rq_data.count[sync]--;
@@ -841,16 +855,6 @@ static struct request *get_request(struc
 
 	rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
 
-#ifdef CONFIG_GROUP_IOSCHED
-	if (rq) {
-		/*
-		 * TODO. Implement group reference counting and take the
-		 * reference to the group to make sure group hence request
-		 * list does not go away till rq finishes.
-		 */
-		rq->rl = rl;
-	}
-#endif
 	if (unlikely(!rq)) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
  2009-05-11  1:33         ` Gui Jianfeng
@ 2009-05-11 15:41           ` Vivek Goyal
       [not found]             ` <20090511154127.GD6036-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
       [not found]           ` <4A078051.5060702-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  1 sibling, 1 reply; 297+ messages in thread
From: Vivek Goyal @ 2009-05-11 15:41 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: Nauman Rafique, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Mon, May 11, 2009 at 09:33:05AM +0800, Gui Jianfeng wrote:
> Nauman Rafique wrote:
> > On Fri, May 8, 2009 at 6:57 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> >> On Fri, May 08, 2009 at 05:45:32PM +0800, Gui Jianfeng wrote:
> >>> Hi Vivek,
> >>>
> >>> This patch adds io group reference handling when allocating
> >>> and removing a request.
> >>>
> >> Hi Gui,
> >>
> >> Thanks for the patch. We were thinking that requests can take a reference
> >> on io queues and io queues can take a reference on io groups. That should
> >> make sure that io groups don't go away as long as active requests are
> >> present.
> >>
> >> But there seems to be a small window while allocating the new request
> >> where request gets allocated from a group first and then later it is
> >> mapped to that group and queue is created. IOW, in get_request_wait(),
> >> we allocate a request from a particular group and set rq->rl, then
> >> drop the queue lock and later call elv_set_request() which again maps
> >> the request to the group saves rq->iog and creates new queue. This window
> >> is troublesome because request can be mapped to a particular group at the
> >> time of allocation and during set_request() it can go to a different
> >> group as queue lock was dropped and group might have disappeared.
> >>
> >> In this case probably it might make sense that request also takes a
> >> reference on groups. At the same time it looks too much that request takes
> >> a reference on queue as well as group object. Ideas are welcome on how
> >> to handle it...
> > 
> > IMHO a request being allocated on the wrong cgroup should not be a big
> > problem as such. All it means is that the request descriptor was
> > accounted to the wrong cgroup in this particular corner case. Please
> > correct me if I am wrong.
> > 
> > We can also get rid of rq->iog pointer too. What that means is that
> > request is associated with ioq (rq->ioq), and we can use
> > ioq_to_io_group() function to get the io_group. So the request would
> > only be indirectly associated with an io_group i.e. the request is
> > associated with an io_queue and the io_group for the request is the
> > io_group associated with io_queue. Do you see any problems with that
> > approach?
> 
>   That sounds reasonable to get rid of rq->iog, and rq->rl is also dead.
>   Hope to see the patch soon. ;)
>

Ok, here is the patch which gets rid of rq->iog and rq->rl fields. Good to
see some code and data structures trimming. It seems to be working fine for me.


o Get rid of rq->iog field and rq->rl fields. request descriptor stores
  the pointer the the queue it belongs to (rq->ioq) and from the io queue one
  can determine the group queue belongs to hence request belongs to. Thanks
  to Nauman for the idea.

o There are couple of places where rq->ioq information is not present yet
  as request and queue are being setup. In those places "bio" is passed 
  around as function argument to determine the group rq will go into. I
  did not pass "iog" as function argument becuase when memory is scarce,
  we can release queue lock and sleep to wait for memory to become available
  and once we wake up, it is possible that io group is gone. Passing bio
  around helps that one shall have to remap bio to right group after waking
  up. 

o Got rid of io_lookup_io_group_current() function and merged it with
  io_get_io_group() to also take care of looking for group using current
  task info and not from bio.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-core.c         |   28 +++++++++-------
 block/cfq-iosched.c      |   40 ++++++++++++++++--------
 block/elevator-fq.c      |   78 ++++++++++++++++++-----------------------------
 block/elevator-fq.h      |   29 +++--------------
 block/elevator.c         |    6 +--
 include/linux/blkdev.h   |   16 ++++-----
 include/linux/elevator.h |    2 -
 7 files changed, 91 insertions(+), 108 deletions(-)

Index: linux14/include/linux/blkdev.h
===================================================================
--- linux14.orig/include/linux/blkdev.h	2009-05-11 10:51:33.000000000 -0400
+++ linux14/include/linux/blkdev.h	2009-05-11 11:35:27.000000000 -0400
@@ -279,12 +279,6 @@ struct request {
 #ifdef CONFIG_ELV_FAIR_QUEUING
 	/* io queue request belongs to */
 	struct io_queue *ioq;
-
-#ifdef CONFIG_GROUP_IOSCHED
-	/* io group request belongs to */
-	struct io_group *iog;
-	struct request_list *rl;
-#endif /* GROUP_IOSCHED */
 #endif /* ELV_FAIR_QUEUING */
 };
 
@@ -828,9 +822,15 @@ static inline struct request_list *rq_rl
 						struct request *rq)
 {
 #ifdef CONFIG_GROUP_IOSCHED
-	return rq->rl;
+	struct io_group *iog;
+
+	BUG_ON(!rq->ioq);
+	iog = ioq_to_io_group(rq->ioq);
+	BUG_ON(!iog);
+
+	return &iog->rl;
 #else
-	return blk_get_request_list(q, NULL);
+	return &q->rq;
 #endif
 }
 
Index: linux14/block/elevator-fq.c
===================================================================
--- linux14.orig/block/elevator-fq.c	2009-05-11 10:52:49.000000000 -0400
+++ linux14/block/elevator-fq.c	2009-05-11 11:28:19.000000000 -0400
@@ -1006,7 +1006,7 @@ struct request_list *io_group_get_reques
 {
 	struct io_group *iog;
 
-	iog = io_get_io_group_bio(q, bio, 1);
+	iog = io_get_io_group(q, bio, 1, 0);
 	BUG_ON(!iog);
 	return &iog->rl;
 }
@@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
 /*
  * Find the io group bio belongs to.
  * If "create" is set, io group is created if it is not already present.
+ * If "curr" is set, io group is information is searched for current
+ * task and not with the help of bio.
+ *
+ * FIXME: Can we assume that if bio is NULL then lookup group for current
+ * task and not create extra function parameter ?
  *
- * Note: There is a narrow window of race where a group is being freed
- * by cgroup deletion path and some rq has slipped through in this group.
- * Fix it.
  */
-struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
-					int create)
+struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
+					int create, int curr)
 {
 	struct cgroup *cgroup;
 	struct io_group *iog;
 	struct elv_fq_data *efqd = &q->elevator->efqd;
 
 	rcu_read_lock();
-	cgroup = get_cgroup_from_bio(bio);
+
+	if (curr)
+		cgroup = task_cgroup(current, io_subsys_id);
+	else
+		cgroup = get_cgroup_from_bio(bio);
+
 	if (!cgroup) {
 		if (create)
 			iog = efqd->root_group;
@@ -1500,7 +1507,7 @@ out:
 	rcu_read_unlock();
 	return iog;
 }
-EXPORT_SYMBOL(io_get_io_group_bio);
+EXPORT_SYMBOL(io_get_io_group);
 
 void io_free_root_group(struct elevator_queue *e)
 {
@@ -1952,7 +1959,7 @@ int io_group_allow_merge(struct request 
 		return 1;
 
 	/* Determine the io group of the bio submitting task */
-	iog = io_get_io_group_bio(q, bio, 0);
+	iog = io_get_io_group(q, bio, 0, 0);
 	if (!iog) {
 		/* May be task belongs to a differet cgroup for which io
 		 * group has not been setup yet. */
@@ -1965,25 +1972,6 @@ int io_group_allow_merge(struct request 
 	return (iog == __iog);
 }
 
-/* find/create the io group request belongs to and put that info in rq */
-void elv_fq_set_request_io_group(struct request_queue *q, struct request *rq,
-					struct bio *bio)
-{
-	struct io_group *iog;
-	unsigned long flags;
-
-	/* Make sure io group hierarchy has been setup and also set the
-	 * io group to which rq belongs. Later we should make use of
-	 * bio cgroup patches to determine the io group */
-	spin_lock_irqsave(q->queue_lock, flags);
-	iog = io_get_io_group_bio(q, bio, 1);
-	spin_unlock_irqrestore(q->queue_lock, flags);
-	BUG_ON(!iog);
-
-	/* Store iog in rq. TODO: take care of referencing */
-	rq->iog = iog;
-}
-
 /*
  * Find/Create the io queue the rq should go in. This is an optimization
  * for the io schedulers (noop, deadline and AS) which maintain only single
@@ -1995,7 +1983,7 @@ void elv_fq_set_request_io_group(struct 
  * function is not invoked.
  */
 int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
-					gfp_t gfp_mask)
+					struct bio *bio, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 	unsigned long flags;
@@ -2009,11 +1997,15 @@ int elv_fq_set_request_ioq(struct reques
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 	spin_lock_irqsave(q->queue_lock, flags);
 
+retry:
 	/* Determine the io group request belongs to */
-	iog = rq->iog;
+	if (bio)
+		iog = io_get_io_group(q, bio, 1, 0);
+	else
+		iog = io_get_io_group(q, bio, 1, 1);
+
 	BUG_ON(!iog);
 
-retry:
 	/* Get the iosched queue */
 	ioq = io_group_ioq(iog);
 	if (!ioq) {
@@ -2071,7 +2063,7 @@ alloc_ioq:
 			}
 		}
 
-		elv_init_ioq(e, ioq, rq->iog, sched_q, IOPRIO_CLASS_BE, 4, 1);
+		elv_init_ioq(e, ioq, iog, sched_q, IOPRIO_CLASS_BE, 4, 1);
 		io_group_set_ioq(iog, ioq);
 		elv_mark_ioq_sync(ioq);
 		/* ioq reference on iog */
@@ -2106,7 +2098,7 @@ struct io_queue *elv_lookup_ioq_bio(stru
 	struct io_group *iog;
 
 	/* lookup the io group and io queue of the bio submitting task */
-	iog = io_get_io_group_bio(q, bio, 0);
+	iog = io_get_io_group(q, bio, 0, 0);
 	if (!iog) {
 		/* May be bio belongs to a cgroup for which io group has
 		 * not been setup yet. */
@@ -2166,12 +2158,12 @@ struct io_group *io_lookup_io_group_curr
 }
 EXPORT_SYMBOL(io_lookup_io_group_current);
 
-struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
-					int create)
+struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
+					int create, int curr)
 {
 	return q->elevator->efqd.root_group;
 }
-EXPORT_SYMBOL(io_get_io_group_bio);
+EXPORT_SYMBOL(io_get_io_group);
 
 void io_free_root_group(struct elevator_queue *e)
 {
@@ -2180,16 +2172,6 @@ void io_free_root_group(struct elevator_
 	kfree(iog);
 }
 
-struct io_group *io_get_io_group(struct request_queue *q, int create)
-{
-	return q->elevator->efqd.root_group;
-}
-
-struct io_group *rq_iog(struct request_queue *q, struct request *rq)
-{
-	return q->elevator->efqd.root_group;
-}
-
 #endif /* CONFIG_GROUP_IOSCHED*/
 
 /* Elevator fair queuing function */
@@ -3128,7 +3110,9 @@ void elv_ioq_request_add(struct request_
 #ifdef CONFIG_DEBUG_GROUP_IOSCHED
 		{
 			char path[128];
-			io_group_path(rq_iog(q, rq), path, sizeof(path));
+			struct io_group *iog = ioq_to_io_group(ioq);
+
+			io_group_path(iog, path, sizeof(path));
 			elv_log_ioq(efqd, ioq, "add rq: group path=%s "
 					"rq_queued=%d", path, ioq->nr_queued);
 		}
Index: linux14/include/linux/elevator.h
===================================================================
--- linux14.orig/include/linux/elevator.h	2009-05-11 10:51:33.000000000 -0400
+++ linux14/include/linux/elevator.h	2009-05-11 10:52:51.000000000 -0400
@@ -23,7 +23,7 @@ typedef struct request *(elevator_reques
 typedef void (elevator_completed_req_fn) (struct request_queue *, struct request *);
 typedef int (elevator_may_queue_fn) (struct request_queue *, int);
 
-typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, gfp_t);
+typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, struct bio *bio, gfp_t);
 typedef void (elevator_put_req_fn) (struct request *);
 typedef void (elevator_activate_req_fn) (struct request_queue *, struct request *);
 typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct request *);
Index: linux14/block/cfq-iosched.c
===================================================================
--- linux14.orig/block/cfq-iosched.c	2009-05-11 10:52:47.000000000 -0400
+++ linux14/block/cfq-iosched.c	2009-05-11 10:52:51.000000000 -0400
@@ -161,7 +161,7 @@ CFQ_CFQQ_FNS(coop);
 	blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)
 
 static void cfq_dispatch_insert(struct request_queue *, struct request *);
-static struct cfq_queue *cfq_get_queue(struct cfq_data *, struct io_group *iog,
+static struct cfq_queue *cfq_get_queue(struct cfq_data *, struct bio *bio,
 					int, struct io_context *, gfp_t);
 static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
 						struct io_context *);
@@ -196,7 +196,7 @@ static struct cfq_queue *cic_bio_to_cfqq
 		 * async bio tracking is enabled and we are not caching
 		 * async queue pointer in cic.
 		 */
-		iog = io_get_io_group_bio(cfqd->queue, bio, 0);
+		iog = io_get_io_group(cfqd->queue, bio, 0, 0);
 		if (!iog) {
 			/*
 			 * May be this is first rq/bio and io group has not
@@ -1242,7 +1242,6 @@ static void changed_ioprio(struct io_con
 	cfqq = cic->cfqq[BLK_RW_ASYNC];
 
 	if (cfqq) {
-		struct io_group *iog = io_lookup_io_group_current(q);
 		struct cfq_queue *new_cfqq;
 
 		/*
@@ -1259,7 +1258,7 @@ static void changed_ioprio(struct io_con
 		 * comes? Keeping it for the time being because existing cfq
 		 * code allocates the new queue immediately upon prio change
 		 */
-		new_cfqq = cfq_get_queue(cfqd, iog, BLK_RW_ASYNC, cic->ioc,
+		new_cfqq = cfq_get_queue(cfqd, NULL, BLK_RW_ASYNC, cic->ioc,
 						GFP_ATOMIC);
 		if (new_cfqq)
 			cic_set_cfqq(cic, new_cfqq, BLK_RW_ASYNC);
@@ -1295,7 +1294,7 @@ static void changed_cgroup(struct io_con
 
 	spin_lock_irqsave(q->queue_lock, flags);
 
-	iog = io_lookup_io_group_current(q);
+	iog = io_get_io_group(q, NULL, 0, 1);
 
 	if (async_cfqq != NULL) {
 		__iog = cfqq_to_io_group(async_cfqq);
@@ -1332,14 +1331,25 @@ static void cfq_ioc_set_cgroup(struct io
 #endif  /* CONFIG_IOSCHED_CFQ_HIER */
 
 static struct cfq_queue *
-cfq_find_alloc_queue(struct cfq_data *cfqd, struct io_group *iog, int is_sync,
+cfq_find_alloc_queue(struct cfq_data *cfqd, struct bio *bio, int is_sync,
 				struct io_context *ioc, gfp_t gfp_mask)
 {
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
 	struct cfq_io_context *cic;
 	struct request_queue *q = cfqd->queue;
 	struct io_queue *ioq = NULL, *new_ioq = NULL;
+	struct io_group *iog = NULL;
 retry:
+	/*
+	 * Note: Finding the io group again in case io group disappeared
+	 * during the time we dropped the queue lock and acquired it
+	 * back.
+	 */
+	if (bio)
+		iog = io_get_io_group(q, bio, 1, 0);
+	else
+		iog = io_get_io_group(q, NULL, 1, 1);
+
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
 	cfqq = cic_to_cfqq(cic, is_sync);
@@ -1449,13 +1459,19 @@ out:
 }
 
 static struct cfq_queue *
-cfq_get_queue(struct cfq_data *cfqd, struct io_group *iog, int is_sync,
-			struct io_context *ioc, gfp_t gfp_mask)
+cfq_get_queue(struct cfq_data *cfqd, struct bio *bio, int is_sync,
+				struct io_context *ioc, gfp_t gfp_mask)
 {
 	const int ioprio = task_ioprio(ioc);
 	const int ioprio_class = task_ioprio_class(ioc);
 	struct cfq_queue *async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
+	struct io_group *iog = NULL;
+
+	if (bio)
+		iog = io_get_io_group(cfqd->queue, bio, 1, 0);
+	else
+		iog = io_get_io_group(cfqd->queue, NULL, 1, 1);
 
 	if (!is_sync) {
 		async_cfqq = io_group_async_queue_prio(iog, ioprio_class,
@@ -1464,7 +1480,7 @@ cfq_get_queue(struct cfq_data *cfqd, str
 	}
 
 	if (!cfqq) {
-		cfqq = cfq_find_alloc_queue(cfqd, iog, is_sync, ioc, gfp_mask);
+		cfqq = cfq_find_alloc_queue(cfqd, bio, is_sync, ioc, gfp_mask);
 		if (!cfqq)
 			return NULL;
 	}
@@ -1889,7 +1905,8 @@ static void cfq_put_request(struct reque
  * Allocate cfq data structures associated with this request.
  */
 static int
-cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
+				gfp_t gfp_mask)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_io_context *cic;
@@ -1909,8 +1926,7 @@ cfq_set_request(struct request_queue *q,
 
 	cfqq = cic_to_cfqq(cic, is_sync);
 	if (!cfqq) {
-		cfqq = cfq_get_queue(cfqd, rq_iog(q, rq), is_sync, cic->ioc,
-						gfp_mask);
+		cfqq = cfq_get_queue(cfqd, bio, is_sync, cic->ioc, gfp_mask);
 
 		if (!cfqq)
 			goto queue_fail;
Index: linux14/block/elevator.c
===================================================================
--- linux14.orig/block/elevator.c	2009-05-11 10:51:33.000000000 -0400
+++ linux14/block/elevator.c	2009-05-11 10:52:51.000000000 -0400
@@ -972,17 +972,15 @@ int elv_set_request(struct request_queue
 {
 	struct elevator_queue *e = q->elevator;
 
-	elv_fq_set_request_io_group(q, rq, bio);
-
 	/*
 	 * Optimization for noop, deadline and AS which maintain only single
 	 * ioq per io group
 	 */
 	if (elv_iosched_single_ioq(e))
-		return elv_fq_set_request_ioq(q, rq, gfp_mask);
+		return elv_fq_set_request_ioq(q, rq, bio, gfp_mask);
 
 	if (e->ops->elevator_set_req_fn)
-		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
+		return e->ops->elevator_set_req_fn(q, rq, bio, gfp_mask);
 
 	rq->elevator_private = NULL;
 	return 0;
Index: linux14/block/elevator-fq.h
===================================================================
--- linux14.orig/block/elevator-fq.h	2009-05-11 10:52:48.000000000 -0400
+++ linux14/block/elevator-fq.h	2009-05-11 11:25:03.000000000 -0400
@@ -510,15 +510,13 @@ static inline struct io_group *ioq_to_io
 
 #ifdef CONFIG_GROUP_IOSCHED
 extern int io_group_allow_merge(struct request *rq, struct bio *bio);
-extern void elv_fq_set_request_io_group(struct request_queue *q,
-					struct request *rq, struct bio *bio);
 static inline bfq_weight_t iog_weight(struct io_group *iog)
 {
 	return iog->entity.weight;
 }
 
 extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
-					gfp_t gfp_mask);
+					struct bio *bio, gfp_t gfp_mask);
 extern void elv_fq_unset_request_ioq(struct request_queue *q,
 					struct request *rq);
 extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
@@ -545,12 +543,6 @@ static inline void io_group_set_ioq(stru
 	iog->ioq = ioq;
 }
 
-static inline struct io_group *rq_iog(struct request_queue *q,
-					struct request *rq)
-{
-	return rq->iog;
-}
-
 static inline void elv_get_iog(struct io_group *iog)
 {
 	atomic_inc(&iog->ref);
@@ -566,10 +558,6 @@ static inline int io_group_allow_merge(s
  * separately. Hence in case of non-hierarchical setup, nothing todo.
  */
 static inline void io_disconnect_groups(struct elevator_queue *e) {}
-static inline void elv_fq_set_request_io_group(struct request_queue *q,
-					struct request *rq, struct bio *bio)
-{
-}
 
 static inline bfq_weight_t iog_weight(struct io_group *iog)
 {
@@ -588,7 +576,7 @@ static inline void io_group_set_ioq(stru
 }
 
 static inline int elv_fq_set_request_ioq(struct request_queue *q,
-					struct request *rq, gfp_t gfp_mask)
+			struct request *rq, struct bio *bio, gfp_t gfp_mask)
 {
 	return 0;
 }
@@ -613,8 +601,6 @@ static inline void elv_get_iog(struct io
 
 static inline void elv_put_iog(struct io_group *iog) { }
 
-extern struct io_group *rq_iog(struct request_queue *q, struct request *rq);
-
 #endif /* GROUP_IOSCHED */
 
 /* Functions used by blksysfs.c */
@@ -670,8 +656,8 @@ extern void *io_group_async_queue_prio(s
 extern void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 					int ioprio, struct io_queue *ioq);
 extern struct io_group *io_lookup_io_group_current(struct request_queue *q);
-extern struct io_group *io_get_io_group_bio(struct request_queue *q,
-						struct bio *bio, int create);
+extern struct io_group *io_get_io_group(struct request_queue *q,
+				struct bio *bio, int create, int curr);
 extern int elv_nr_busy_ioq(struct elevator_queue *e);
 extern int elv_nr_busy_rt_ioq(struct elevator_queue *e);
 extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
@@ -725,18 +711,13 @@ static inline void *elv_fq_select_ioq(st
 	return NULL;
 }
 
-static inline void elv_fq_set_request_io_group(struct request_queue *q,
-					struct request *rq, struct bio *bio)
-{
-}
-
 static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
 
 {
 	return 1;
 }
 static inline int elv_fq_set_request_ioq(struct request_queue *q,
-					struct request *rq, gfp_t gfp_mask)
+			struct request *rq, struct bio *bio, gfp_t gfp_mask)
 {
 	return 0;
 }
Index: linux14/block/blk-core.c
===================================================================
--- linux14.orig/block/blk-core.c	2009-05-11 11:35:20.000000000 -0400
+++ linux14/block/blk-core.c	2009-05-11 11:35:27.000000000 -0400
@@ -736,8 +736,22 @@ static void __freed_request(struct reque
 static void freed_request(struct request_queue *q, int sync, int priv,
 					struct request_list *rl)
 {
-	BUG_ON(!rl->count[sync]);
-	rl->count[sync]--;
+	/* There is a window during request allocation where request is
+	 * mapped to one group but by the time a queue for the group is
+	 * allocated, it is possible that original cgroup/io group has been
+	 * deleted and now io queue is allocated in a different group (root)
+	 * altogether.
+	 *
+	 * One solution to the problem is that rq should take io group
+	 * reference. But it looks too much to do that to solve this issue.
+	 * The only side affect to the hard to hit issue seems to be that
+	 * we will try to decrement the rl->count for a request list which
+	 * did not allocate that request. Chcek for rl->count going less than
+	 * zero and do not decrement it if that's the case.
+	 */
+
+	if (rl->count[sync] > 0)
+		rl->count[sync]--;
 
 	BUG_ON(!q->rq_data.count[sync]);
 	q->rq_data.count[sync]--;
@@ -841,16 +855,6 @@ static struct request *get_request(struc
 
 	rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
 
-#ifdef CONFIG_GROUP_IOSCHED
-	if (rq) {
-		/*
-		 * TODO. Implement group reference counting and take the
-		 * reference to the group to make sure group hence request
-		 * list does not go away till rq finishes.
-		 */
-		rq->rl = rl;
-	}
-#endif
 	if (unlikely(!rq)) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything

 

^ permalink raw reply	[flat|nested] 297+ messages in thread

* [PATCH] IO Controller: Add per-device weight and ioprio_class handling
       [not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (20 preceding siblings ...)
  2009-05-08  9:45   ` [PATCH] io-controller: Add io group reference handling for request Gui Jianfeng
@ 2009-05-13  2:00   ` Gui Jianfeng
  21 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-13  2:00 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

Hi Vivek,

This patch enables per-cgroup per-device weight and ioprio_class handling.
A new cgroup interface "policy" is introduced. You can make use of this 
file to configure weight and ioprio_class for each device in a given cgroup.
The original "weight" and "ioprio_class" files are still available. If you
don't do special configuration for a particular device, "weight" and 
"ioprio_class" are used as default values in this device.

You can use the following format to play with the new interface.
#echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
weight=0 means removing the policy for DEV.

Examples:
Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
# echo /dev/hdb:300:2 > io.policy
# cat io.policy
dev weight class
/dev/hdb 300 2

Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
# echo /dev/hda:500:1 > io.policy
# cat io.policy
dev weight class
/dev/hda 500 1
/dev/hdb 300 2

Remove the policy for /dev/hda in this cgroup
# echo /dev/hda:0:1 > io.policy
# cat io.policy
dev weight class
/dev/hdb 300 2

Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
 block/elevator-fq.c |  239 +++++++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h |   11 +++
 2 files changed, 245 insertions(+), 5 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 69435ab..7c95d55 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -12,6 +12,9 @@
 #include "elevator-fq.h"
 #include <linux/blktrace_api.h>
 #include <linux/biotrack.h>
+#include <linux/seq_file.h>
+#include <linux/genhd.h>
+
 
 /* Values taken from cfq */
 const int elv_slice_sync = HZ / 10;
@@ -1045,12 +1048,30 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
 }
 EXPORT_SYMBOL(io_lookup_io_group_current);
 
-void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
+					      void *key);
+
+void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
+			  void *key)
 {
 	struct io_entity *entity = &iog->entity;
+	struct policy_node *pn;
+
+	spin_lock_irq(&iocg->lock);
+	pn = policy_search_node(iocg, key);
+	if (pn) {
+		entity->weight = pn->weight;
+		entity->new_weight = pn->weight;
+		entity->ioprio_class = pn->ioprio_class;
+		entity->new_ioprio_class = pn->ioprio_class;
+	} else {
+		entity->weight = iocg->weight;
+		entity->new_weight = iocg->weight;
+		entity->ioprio_class = iocg->ioprio_class;
+		entity->new_ioprio_class = iocg->ioprio_class;
+	}
+	spin_unlock_irq(&iocg->lock);
 
-	entity->weight = entity->new_weight = iocg->weight;
-	entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
 	entity->ioprio_changed = 1;
 	entity->my_sched_data = &iog->sched_data;
 }
@@ -1263,7 +1284,7 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
 		atomic_set(&iog->ref, 0);
 		iog->deleting = 0;
 
-		io_group_init_entity(iocg, iog);
+		io_group_init_entity(iocg, iog, key);
 		iog->my_entity = &iog->entity;
 #ifdef CONFIG_DEBUG_GROUP_IOSCHED
 		iog->iocg_id = css_id(&iocg->css);
@@ -1549,8 +1570,208 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
 	return iog;
 }
 
+static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
+				  struct seq_file *m)
+{
+	struct io_cgroup *iocg;
+	struct policy_node *pn;
+
+	iocg = cgroup_to_io_cgroup(cgrp);
+
+	if (list_empty(&iocg->list))
+		goto out;
+
+	seq_printf(m, "dev weight class\n");
+
+	spin_lock_irq(&iocg->lock);
+	list_for_each_entry(pn, &iocg->list, node) {
+		seq_printf(m, "%s %lu %lu\n", pn->dev_name,
+			   pn->weight, pn->ioprio_class);
+	}
+	spin_unlock_irq(&iocg->lock);
+out:
+	return 0;
+}
+
+static inline void policy_insert_node(struct io_cgroup *iocg,
+					  struct policy_node *pn)
+{
+	list_add(&pn->node, &iocg->list);
+}
+
+/* Must be called with iocg->lock held */
+static inline void policy_delete_node(struct policy_node *pn)
+{
+	list_del(&pn->node);
+}
+
+/* Must be called with iocg->lock held */
+static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
+					      void *key)
+{
+	struct policy_node *pn;
+
+	if (list_empty(&iocg->list))
+		return NULL;
+
+	list_for_each_entry(pn, &iocg->list, node) {
+		if (pn->key == key)
+			return pn;
+	}
+
+	return NULL;
+}
+
+static void *devname_to_efqd(const char *buf)
+{
+	struct block_device *bdev;
+	void *key = NULL;
+	struct gendisk *disk;
+	int part;
+
+	bdev = lookup_bdev(buf);
+	if (IS_ERR(bdev))
+		return NULL;
+
+	disk = get_gendisk(bdev->bd_dev, &part);
+	key = (void *)&disk->queue->elevator->efqd;
+	bdput(bdev);
+
+	return key;
+}
+
+static int policy_parse_and_set(char *buf, struct policy_node *newpn)
+{
+	char *s[3];
+	char *p;
+	int ret;
+	int i = 0;
+
+	memset(s, 0, sizeof(s));
+	while (i < ARRAY_SIZE(s)) {
+		p = strsep(&buf, ":");
+		if (!p)
+			break;
+		if (!*p)
+			continue;
+		s[i++] = p;
+	}
+
+	newpn->key = devname_to_efqd(s[0]);
+	if (!newpn->key)
+		return -EINVAL;
+
+	strcpy(newpn->dev_name, s[0]);
+
+	ret = strict_strtoul(s[1], 10, &newpn->weight);
+	if (ret || newpn->weight > WEIGHT_MAX)
+		return -EINVAL;
+
+	ret = strict_strtoul(s[2], 10, &newpn->ioprio_class);
+	if (ret || newpn->ioprio_class < IOPRIO_CLASS_RT ||
+	    newpn->ioprio_class > IOPRIO_CLASS_IDLE)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
+			    const char *buffer)
+{
+	struct io_cgroup *iocg;
+	struct policy_node *newpn, *pn;
+	char *buf;
+	int ret = 0;
+	int keep_newpn = 0;
+	struct hlist_node *n;
+	struct io_group *iog;
+
+	buf = kstrdup(buffer, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
+	if (!newpn) {
+		ret = -ENOMEM;
+		goto free_buf;
+	}
+
+	ret = policy_parse_and_set(buf, newpn);
+	if (ret)
+		goto free_newpn;
+
+	if (!cgroup_lock_live_group(cgrp)) {
+		ret = -ENODEV;
+		goto free_newpn;
+	}
+
+	iocg = cgroup_to_io_cgroup(cgrp);
+	spin_lock_irq(&iocg->lock);
+
+	pn = policy_search_node(iocg, newpn->key);
+	if (!pn) {
+		if (newpn->weight != 0) {
+			policy_insert_node(iocg, newpn);
+			keep_newpn = 1;
+		}
+		goto update_io_group;
+	}
+
+	if (newpn->weight == 0) {
+		/* weight == 0 means deleteing a policy */
+		policy_delete_node(pn);
+		goto update_io_group;
+	}
+
+	pn->weight = newpn->weight;
+	pn->ioprio_class = newpn->ioprio_class;
+
+update_io_group:
+	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
+		if (iog->key == newpn->key) {
+			if (newpn->weight) {
+				iog->entity.new_weight = newpn->weight;
+				iog->entity.new_ioprio_class =
+					newpn->ioprio_class;
+				/*
+				 * iog weight and ioprio_class updating
+				 * actually happens if ioprio_changed is set.
+				 * So ensure ioprio_changed is not set until
+				 * new weight and new ioprio_class are updated.
+				 */
+				smp_wmb();
+				iog->entity.ioprio_changed = 1;
+			} else {
+				iog->entity.new_weight = iocg->weight;
+				iog->entity.new_ioprio_class =
+					iocg->ioprio_class;
+
+				/* The same as above */
+				smp_wmb();
+				iog->entity.ioprio_changed = 1;
+			}
+		}
+	}
+	spin_unlock_irq(&iocg->lock);
+
+	cgroup_unlock();
+
+free_newpn:
+	if (!keep_newpn)
+		kfree(newpn);
+free_buf:
+	kfree(buf);
+	return ret;
+}
+
 struct cftype bfqio_files[] = {
 	{
+		.name = "policy",
+		.read_seq_string = io_cgroup_policy_read,
+		.write_string = io_cgroup_policy_write,
+		.max_write_len = 256,
+	},
+	{
 		.name = "weight",
 		.read_u64 = io_cgroup_weight_read,
 		.write_u64 = io_cgroup_weight_write,
@@ -1592,6 +1813,7 @@ struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
 	INIT_HLIST_HEAD(&iocg->group_data);
 	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
 	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
+	INIT_LIST_HEAD(&iocg->list);
 
 	return &iocg->css;
 }
@@ -1750,6 +1972,7 @@ void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
 	unsigned long flags, flags1;
 	int queue_lock_held = 0;
 	struct elv_fq_data *efqd;
+	struct policy_node *pn, *pntmp;
 
 	/*
 	 * io groups are linked in two lists. One list is maintained
@@ -1823,6 +2046,12 @@ locked:
 	BUG_ON(!hlist_empty(&iocg->group_data));
 
 	free_css_id(&io_subsys, &iocg->css);
+
+	list_for_each_entry_safe(pn, pntmp, &iocg->list, node) {
+		policy_delete_node(pn);
+		kfree(pn);
+	}
+
 	kfree(iocg);
 }
 
@@ -2137,7 +2366,7 @@ void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
 void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
 {
 	entity->ioprio = entity->new_ioprio;
-	entity->weight = entity->new_weight;
+	entity->weight = entity->new_weigh;
 	entity->ioprio_class = entity->new_ioprio_class;
 	entity->sched_data = &iog->sched_data;
 }
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index db3a347..0407633 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -253,6 +253,14 @@ struct io_group {
 #endif
 };
 
+struct policy_node {
+	struct list_head node;
+	char dev_name[32];
+	void *key;
+	unsigned long weight;
+	unsigned long ioprio_class;
+};
+
 /**
  * struct bfqio_cgroup - bfq cgroup data structure.
  * @css: subsystem state for bfq in the containing cgroup.
@@ -269,6 +277,9 @@ struct io_cgroup {
 
 	unsigned long weight, ioprio_class;
 
+	/* list of policy_node */
+	struct list_head list;
+
 	spinlock_t lock;
 	struct hlist_head group_data;
 };
-- 
1.5.4.rc3

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* [PATCH] IO Controller: Add per-device weight and ioprio_class handling
  2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
                   ` (36 preceding siblings ...)
  2009-05-08  9:45 ` [PATCH] io-controller: Add io group reference handling for request Gui Jianfeng
@ 2009-05-13  2:00 ` Gui Jianfeng
  2009-05-13 14:44   ` Vivek Goyal
                     ` (5 more replies)
  37 siblings, 6 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-13  2:00 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

Hi Vivek,

This patch enables per-cgroup per-device weight and ioprio_class handling.
A new cgroup interface "policy" is introduced. You can make use of this 
file to configure weight and ioprio_class for each device in a given cgroup.
The original "weight" and "ioprio_class" files are still available. If you
don't do special configuration for a particular device, "weight" and 
"ioprio_class" are used as default values in this device.

You can use the following format to play with the new interface.
#echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
weight=0 means removing the policy for DEV.

Examples:
Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
# echo /dev/hdb:300:2 > io.policy
# cat io.policy
dev weight class
/dev/hdb 300 2

Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
# echo /dev/hda:500:1 > io.policy
# cat io.policy
dev weight class
/dev/hda 500 1
/dev/hdb 300 2

Remove the policy for /dev/hda in this cgroup
# echo /dev/hda:0:1 > io.policy
# cat io.policy
dev weight class
/dev/hdb 300 2

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
 block/elevator-fq.c |  239 +++++++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h |   11 +++
 2 files changed, 245 insertions(+), 5 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 69435ab..7c95d55 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -12,6 +12,9 @@
 #include "elevator-fq.h"
 #include <linux/blktrace_api.h>
 #include <linux/biotrack.h>
+#include <linux/seq_file.h>
+#include <linux/genhd.h>
+
 
 /* Values taken from cfq */
 const int elv_slice_sync = HZ / 10;
@@ -1045,12 +1048,30 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
 }
 EXPORT_SYMBOL(io_lookup_io_group_current);
 
-void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
+					      void *key);
+
+void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
+			  void *key)
 {
 	struct io_entity *entity = &iog->entity;
+	struct policy_node *pn;
+
+	spin_lock_irq(&iocg->lock);
+	pn = policy_search_node(iocg, key);
+	if (pn) {
+		entity->weight = pn->weight;
+		entity->new_weight = pn->weight;
+		entity->ioprio_class = pn->ioprio_class;
+		entity->new_ioprio_class = pn->ioprio_class;
+	} else {
+		entity->weight = iocg->weight;
+		entity->new_weight = iocg->weight;
+		entity->ioprio_class = iocg->ioprio_class;
+		entity->new_ioprio_class = iocg->ioprio_class;
+	}
+	spin_unlock_irq(&iocg->lock);
 
-	entity->weight = entity->new_weight = iocg->weight;
-	entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
 	entity->ioprio_changed = 1;
 	entity->my_sched_data = &iog->sched_data;
 }
@@ -1263,7 +1284,7 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
 		atomic_set(&iog->ref, 0);
 		iog->deleting = 0;
 
-		io_group_init_entity(iocg, iog);
+		io_group_init_entity(iocg, iog, key);
 		iog->my_entity = &iog->entity;
 #ifdef CONFIG_DEBUG_GROUP_IOSCHED
 		iog->iocg_id = css_id(&iocg->css);
@@ -1549,8 +1570,208 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
 	return iog;
 }
 
+static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
+				  struct seq_file *m)
+{
+	struct io_cgroup *iocg;
+	struct policy_node *pn;
+
+	iocg = cgroup_to_io_cgroup(cgrp);
+
+	if (list_empty(&iocg->list))
+		goto out;
+
+	seq_printf(m, "dev weight class\n");
+
+	spin_lock_irq(&iocg->lock);
+	list_for_each_entry(pn, &iocg->list, node) {
+		seq_printf(m, "%s %lu %lu\n", pn->dev_name,
+			   pn->weight, pn->ioprio_class);
+	}
+	spin_unlock_irq(&iocg->lock);
+out:
+	return 0;
+}
+
+static inline void policy_insert_node(struct io_cgroup *iocg,
+					  struct policy_node *pn)
+{
+	list_add(&pn->node, &iocg->list);
+}
+
+/* Must be called with iocg->lock held */
+static inline void policy_delete_node(struct policy_node *pn)
+{
+	list_del(&pn->node);
+}
+
+/* Must be called with iocg->lock held */
+static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
+					      void *key)
+{
+	struct policy_node *pn;
+
+	if (list_empty(&iocg->list))
+		return NULL;
+
+	list_for_each_entry(pn, &iocg->list, node) {
+		if (pn->key == key)
+			return pn;
+	}
+
+	return NULL;
+}
+
+static void *devname_to_efqd(const char *buf)
+{
+	struct block_device *bdev;
+	void *key = NULL;
+	struct gendisk *disk;
+	int part;
+
+	bdev = lookup_bdev(buf);
+	if (IS_ERR(bdev))
+		return NULL;
+
+	disk = get_gendisk(bdev->bd_dev, &part);
+	key = (void *)&disk->queue->elevator->efqd;
+	bdput(bdev);
+
+	return key;
+}
+
+static int policy_parse_and_set(char *buf, struct policy_node *newpn)
+{
+	char *s[3];
+	char *p;
+	int ret;
+	int i = 0;
+
+	memset(s, 0, sizeof(s));
+	while (i < ARRAY_SIZE(s)) {
+		p = strsep(&buf, ":");
+		if (!p)
+			break;
+		if (!*p)
+			continue;
+		s[i++] = p;
+	}
+
+	newpn->key = devname_to_efqd(s[0]);
+	if (!newpn->key)
+		return -EINVAL;
+
+	strcpy(newpn->dev_name, s[0]);
+
+	ret = strict_strtoul(s[1], 10, &newpn->weight);
+	if (ret || newpn->weight > WEIGHT_MAX)
+		return -EINVAL;
+
+	ret = strict_strtoul(s[2], 10, &newpn->ioprio_class);
+	if (ret || newpn->ioprio_class < IOPRIO_CLASS_RT ||
+	    newpn->ioprio_class > IOPRIO_CLASS_IDLE)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
+			    const char *buffer)
+{
+	struct io_cgroup *iocg;
+	struct policy_node *newpn, *pn;
+	char *buf;
+	int ret = 0;
+	int keep_newpn = 0;
+	struct hlist_node *n;
+	struct io_group *iog;
+
+	buf = kstrdup(buffer, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
+	if (!newpn) {
+		ret = -ENOMEM;
+		goto free_buf;
+	}
+
+	ret = policy_parse_and_set(buf, newpn);
+	if (ret)
+		goto free_newpn;
+
+	if (!cgroup_lock_live_group(cgrp)) {
+		ret = -ENODEV;
+		goto free_newpn;
+	}
+
+	iocg = cgroup_to_io_cgroup(cgrp);
+	spin_lock_irq(&iocg->lock);
+
+	pn = policy_search_node(iocg, newpn->key);
+	if (!pn) {
+		if (newpn->weight != 0) {
+			policy_insert_node(iocg, newpn);
+			keep_newpn = 1;
+		}
+		goto update_io_group;
+	}
+
+	if (newpn->weight == 0) {
+		/* weight == 0 means deleteing a policy */
+		policy_delete_node(pn);
+		goto update_io_group;
+	}
+
+	pn->weight = newpn->weight;
+	pn->ioprio_class = newpn->ioprio_class;
+
+update_io_group:
+	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
+		if (iog->key == newpn->key) {
+			if (newpn->weight) {
+				iog->entity.new_weight = newpn->weight;
+				iog->entity.new_ioprio_class =
+					newpn->ioprio_class;
+				/*
+				 * iog weight and ioprio_class updating
+				 * actually happens if ioprio_changed is set.
+				 * So ensure ioprio_changed is not set until
+				 * new weight and new ioprio_class are updated.
+				 */
+				smp_wmb();
+				iog->entity.ioprio_changed = 1;
+			} else {
+				iog->entity.new_weight = iocg->weight;
+				iog->entity.new_ioprio_class =
+					iocg->ioprio_class;
+
+				/* The same as above */
+				smp_wmb();
+				iog->entity.ioprio_changed = 1;
+			}
+		}
+	}
+	spin_unlock_irq(&iocg->lock);
+
+	cgroup_unlock();
+
+free_newpn:
+	if (!keep_newpn)
+		kfree(newpn);
+free_buf:
+	kfree(buf);
+	return ret;
+}
+
 struct cftype bfqio_files[] = {
 	{
+		.name = "policy",
+		.read_seq_string = io_cgroup_policy_read,
+		.write_string = io_cgroup_policy_write,
+		.max_write_len = 256,
+	},
+	{
 		.name = "weight",
 		.read_u64 = io_cgroup_weight_read,
 		.write_u64 = io_cgroup_weight_write,
@@ -1592,6 +1813,7 @@ struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
 	INIT_HLIST_HEAD(&iocg->group_data);
 	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
 	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
+	INIT_LIST_HEAD(&iocg->list);
 
 	return &iocg->css;
 }
@@ -1750,6 +1972,7 @@ void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
 	unsigned long flags, flags1;
 	int queue_lock_held = 0;
 	struct elv_fq_data *efqd;
+	struct policy_node *pn, *pntmp;
 
 	/*
 	 * io groups are linked in two lists. One list is maintained
@@ -1823,6 +2046,12 @@ locked:
 	BUG_ON(!hlist_empty(&iocg->group_data));
 
 	free_css_id(&io_subsys, &iocg->css);
+
+	list_for_each_entry_safe(pn, pntmp, &iocg->list, node) {
+		policy_delete_node(pn);
+		kfree(pn);
+	}
+
 	kfree(iocg);
 }
 
@@ -2137,7 +2366,7 @@ void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
 void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
 {
 	entity->ioprio = entity->new_ioprio;
-	entity->weight = entity->new_weight;
+	entity->weight = entity->new_weigh;
 	entity->ioprio_class = entity->new_ioprio_class;
 	entity->sched_data = &iog->sched_data;
 }
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index db3a347..0407633 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -253,6 +253,14 @@ struct io_group {
 #endif
 };
 
+struct policy_node {
+	struct list_head node;
+	char dev_name[32];
+	void *key;
+	unsigned long weight;
+	unsigned long ioprio_class;
+};
+
 /**
  * struct bfqio_cgroup - bfq cgroup data structure.
  * @css: subsystem state for bfq in the containing cgroup.
@@ -269,6 +277,9 @@ struct io_cgroup {
 
 	unsigned long weight, ioprio_class;
 
+	/* list of policy_node */
+	struct list_head list;
+
 	spinlock_t lock;
 	struct hlist_head group_data;
 };
-- 
1.5.4.rc3




^ permalink raw reply related	[flat|nested] 297+ messages in thread

* Re: [PATCH 07/18] io-controller: Export disk time used and nr sectors dipatched through cgroups
       [not found]   ` <1241553525-28095-8-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-13  2:39     ` Gui Jianfeng
  0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-13  2:39 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

Vivek Goyal wrote:
...
>  
> +/*
> + * traverse through all the io_groups associated with this cgroup and calculate
> + * the aggr disk time received by all the groups on respective disks.
> + */
> +static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
> +{
> +	struct io_group *iog;
> +	struct hlist_node *n;
> +	u64 disk_time = 0;
> +
> +	rcu_read_lock();

  This function is in slow-path, so no need to call rcu_read_lock(), just need to ensure
  that the caller already holds the iocg->lock.

> +	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
> +		/*
> +		 * There might be groups which are not functional and
> +		 * waiting to be reclaimed upon cgoup deletion.
> +		 */
> +		if (rcu_dereference(iog->key))
> +			disk_time += iog->entity.total_service;
> +	}
> +	rcu_read_unlock();
> +
> +	return disk_time;
> +}
> +

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 07/18] io-controller: Export disk time used and nr sectors dipatched through cgroups
  2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-13  2:39   ` Gui Jianfeng
       [not found]     ` <4A0A32CB.4020609-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2009-05-13 14:51     ` Vivek Goyal
       [not found]   ` <1241553525-28095-8-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 2 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-13  2:39 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

Vivek Goyal wrote:
...
>  
> +/*
> + * traverse through all the io_groups associated with this cgroup and calculate
> + * the aggr disk time received by all the groups on respective disks.
> + */
> +static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
> +{
> +	struct io_group *iog;
> +	struct hlist_node *n;
> +	u64 disk_time = 0;
> +
> +	rcu_read_lock();

  This function is in slow-path, so no need to call rcu_read_lock(), just need to ensure
  that the caller already holds the iocg->lock.

> +	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
> +		/*
> +		 * There might be groups which are not functional and
> +		 * waiting to be reclaimed upon cgoup deletion.
> +		 */
> +		if (rcu_dereference(iog->key))
> +			disk_time += iog->entity.total_service;
> +	}
> +	rcu_read_unlock();
> +
> +	return disk_time;
> +}
> +

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
       [not found]   ` <4A0A29B5.7030109-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-05-13 14:44     ` Vivek Goyal
  2009-05-13 15:29     ` Vivek Goyal
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 14:44 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
> Hi Vivek,
> 
> This patch enables per-cgroup per-device weight and ioprio_class handling.
> A new cgroup interface "policy" is introduced. You can make use of this 
> file to configure weight and ioprio_class for each device in a given cgroup.
> The original "weight" and "ioprio_class" files are still available. If you
> don't do special configuration for a particular device, "weight" and 
> "ioprio_class" are used as default values in this device.
> 
> You can use the following format to play with the new interface.
> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
> weight=0 means removing the policy for DEV.
> 

Thanks for the patch Gui. I will test it out and let you know how does
it go.

Thanks
Vivek

> Examples:
> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
> # echo /dev/hdb:300:2 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
> 
> Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
> # echo /dev/hda:500:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hda 500 1
> /dev/hdb 300 2
> 
> Remove the policy for /dev/hda in this cgroup
> # echo /dev/hda:0:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
> 
> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> ---
>  block/elevator-fq.c |  239 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  block/elevator-fq.h |   11 +++
>  2 files changed, 245 insertions(+), 5 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 69435ab..7c95d55 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -12,6 +12,9 @@
>  #include "elevator-fq.h"
>  #include <linux/blktrace_api.h>
>  #include <linux/biotrack.h>
> +#include <linux/seq_file.h>
> +#include <linux/genhd.h>
> +
>  
>  /* Values taken from cfq */
>  const int elv_slice_sync = HZ / 10;
> @@ -1045,12 +1048,30 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
>  }
>  EXPORT_SYMBOL(io_lookup_io_group_current);
>  
> -void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
> +					      void *key);
> +
> +void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
> +			  void *key)
>  {
>  	struct io_entity *entity = &iog->entity;
> +	struct policy_node *pn;
> +
> +	spin_lock_irq(&iocg->lock);
> +	pn = policy_search_node(iocg, key);
> +	if (pn) {
> +		entity->weight = pn->weight;
> +		entity->new_weight = pn->weight;
> +		entity->ioprio_class = pn->ioprio_class;
> +		entity->new_ioprio_class = pn->ioprio_class;
> +	} else {
> +		entity->weight = iocg->weight;
> +		entity->new_weight = iocg->weight;
> +		entity->ioprio_class = iocg->ioprio_class;
> +		entity->new_ioprio_class = iocg->ioprio_class;
> +	}
> +	spin_unlock_irq(&iocg->lock);
>  
> -	entity->weight = entity->new_weight = iocg->weight;
> -	entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
>  	entity->ioprio_changed = 1;
>  	entity->my_sched_data = &iog->sched_data;
>  }
> @@ -1263,7 +1284,7 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
>  		atomic_set(&iog->ref, 0);
>  		iog->deleting = 0;
>  
> -		io_group_init_entity(iocg, iog);
> +		io_group_init_entity(iocg, iog, key);
>  		iog->my_entity = &iog->entity;
>  #ifdef CONFIG_DEBUG_GROUP_IOSCHED
>  		iog->iocg_id = css_id(&iocg->css);
> @@ -1549,8 +1570,208 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
>  	return iog;
>  }
>  
> +static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
> +				  struct seq_file *m)
> +{
> +	struct io_cgroup *iocg;
> +	struct policy_node *pn;
> +
> +	iocg = cgroup_to_io_cgroup(cgrp);
> +
> +	if (list_empty(&iocg->list))
> +		goto out;
> +
> +	seq_printf(m, "dev weight class\n");
> +
> +	spin_lock_irq(&iocg->lock);
> +	list_for_each_entry(pn, &iocg->list, node) {
> +		seq_printf(m, "%s %lu %lu\n", pn->dev_name,
> +			   pn->weight, pn->ioprio_class);
> +	}
> +	spin_unlock_irq(&iocg->lock);
> +out:
> +	return 0;
> +}
> +
> +static inline void policy_insert_node(struct io_cgroup *iocg,
> +					  struct policy_node *pn)
> +{
> +	list_add(&pn->node, &iocg->list);
> +}
> +
> +/* Must be called with iocg->lock held */
> +static inline void policy_delete_node(struct policy_node *pn)
> +{
> +	list_del(&pn->node);
> +}
> +
> +/* Must be called with iocg->lock held */
> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
> +					      void *key)
> +{
> +	struct policy_node *pn;
> +
> +	if (list_empty(&iocg->list))
> +		return NULL;
> +
> +	list_for_each_entry(pn, &iocg->list, node) {
> +		if (pn->key == key)
> +			return pn;
> +	}
> +
> +	return NULL;
> +}
> +
> +static void *devname_to_efqd(const char *buf)
> +{
> +	struct block_device *bdev;
> +	void *key = NULL;
> +	struct gendisk *disk;
> +	int part;
> +
> +	bdev = lookup_bdev(buf);
> +	if (IS_ERR(bdev))
> +		return NULL;
> +
> +	disk = get_gendisk(bdev->bd_dev, &part);
> +	key = (void *)&disk->queue->elevator->efqd;
> +	bdput(bdev);
> +
> +	return key;
> +}
> +
> +static int policy_parse_and_set(char *buf, struct policy_node *newpn)
> +{
> +	char *s[3];
> +	char *p;
> +	int ret;
> +	int i = 0;
> +
> +	memset(s, 0, sizeof(s));
> +	while (i < ARRAY_SIZE(s)) {
> +		p = strsep(&buf, ":");
> +		if (!p)
> +			break;
> +		if (!*p)
> +			continue;
> +		s[i++] = p;
> +	}
> +
> +	newpn->key = devname_to_efqd(s[0]);
> +	if (!newpn->key)
> +		return -EINVAL;
> +
> +	strcpy(newpn->dev_name, s[0]);
> +
> +	ret = strict_strtoul(s[1], 10, &newpn->weight);
> +	if (ret || newpn->weight > WEIGHT_MAX)
> +		return -EINVAL;
> +
> +	ret = strict_strtoul(s[2], 10, &newpn->ioprio_class);
> +	if (ret || newpn->ioprio_class < IOPRIO_CLASS_RT ||
> +	    newpn->ioprio_class > IOPRIO_CLASS_IDLE)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
> +			    const char *buffer)
> +{
> +	struct io_cgroup *iocg;
> +	struct policy_node *newpn, *pn;
> +	char *buf;
> +	int ret = 0;
> +	int keep_newpn = 0;
> +	struct hlist_node *n;
> +	struct io_group *iog;
> +
> +	buf = kstrdup(buffer, GFP_KERNEL);
> +	if (!buf)
> +		return -ENOMEM;
> +
> +	newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
> +	if (!newpn) {
> +		ret = -ENOMEM;
> +		goto free_buf;
> +	}
> +
> +	ret = policy_parse_and_set(buf, newpn);
> +	if (ret)
> +		goto free_newpn;
> +
> +	if (!cgroup_lock_live_group(cgrp)) {
> +		ret = -ENODEV;
> +		goto free_newpn;
> +	}
> +
> +	iocg = cgroup_to_io_cgroup(cgrp);
> +	spin_lock_irq(&iocg->lock);
> +
> +	pn = policy_search_node(iocg, newpn->key);
> +	if (!pn) {
> +		if (newpn->weight != 0) {
> +			policy_insert_node(iocg, newpn);
> +			keep_newpn = 1;
> +		}
> +		goto update_io_group;
> +	}
> +
> +	if (newpn->weight == 0) {
> +		/* weight == 0 means deleteing a policy */
> +		policy_delete_node(pn);
> +		goto update_io_group;
> +	}
> +
> +	pn->weight = newpn->weight;
> +	pn->ioprio_class = newpn->ioprio_class;
> +
> +update_io_group:
> +	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
> +		if (iog->key == newpn->key) {
> +			if (newpn->weight) {
> +				iog->entity.new_weight = newpn->weight;
> +				iog->entity.new_ioprio_class =
> +					newpn->ioprio_class;
> +				/*
> +				 * iog weight and ioprio_class updating
> +				 * actually happens if ioprio_changed is set.
> +				 * So ensure ioprio_changed is not set until
> +				 * new weight and new ioprio_class are updated.
> +				 */
> +				smp_wmb();
> +				iog->entity.ioprio_changed = 1;
> +			} else {
> +				iog->entity.new_weight = iocg->weight;
> +				iog->entity.new_ioprio_class =
> +					iocg->ioprio_class;
> +
> +				/* The same as above */
> +				smp_wmb();
> +				iog->entity.ioprio_changed = 1;
> +			}
> +		}
> +	}
> +	spin_unlock_irq(&iocg->lock);
> +
> +	cgroup_unlock();
> +
> +free_newpn:
> +	if (!keep_newpn)
> +		kfree(newpn);
> +free_buf:
> +	kfree(buf);
> +	return ret;
> +}
> +
>  struct cftype bfqio_files[] = {
>  	{
> +		.name = "policy",
> +		.read_seq_string = io_cgroup_policy_read,
> +		.write_string = io_cgroup_policy_write,
> +		.max_write_len = 256,
> +	},
> +	{
>  		.name = "weight",
>  		.read_u64 = io_cgroup_weight_read,
>  		.write_u64 = io_cgroup_weight_write,
> @@ -1592,6 +1813,7 @@ struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
>  	INIT_HLIST_HEAD(&iocg->group_data);
>  	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
>  	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
> +	INIT_LIST_HEAD(&iocg->list);
>  
>  	return &iocg->css;
>  }
> @@ -1750,6 +1972,7 @@ void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
>  	unsigned long flags, flags1;
>  	int queue_lock_held = 0;
>  	struct elv_fq_data *efqd;
> +	struct policy_node *pn, *pntmp;
>  
>  	/*
>  	 * io groups are linked in two lists. One list is maintained
> @@ -1823,6 +2046,12 @@ locked:
>  	BUG_ON(!hlist_empty(&iocg->group_data));
>  
>  	free_css_id(&io_subsys, &iocg->css);
> +
> +	list_for_each_entry_safe(pn, pntmp, &iocg->list, node) {
> +		policy_delete_node(pn);
> +		kfree(pn);
> +	}
> +
>  	kfree(iocg);
>  }
>  
> @@ -2137,7 +2366,7 @@ void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
>  void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
>  {
>  	entity->ioprio = entity->new_ioprio;
> -	entity->weight = entity->new_weight;
> +	entity->weight = entity->new_weigh;
>  	entity->ioprio_class = entity->new_ioprio_class;
>  	entity->sched_data = &iog->sched_data;
>  }
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index db3a347..0407633 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -253,6 +253,14 @@ struct io_group {
>  #endif
>  };
>  
> +struct policy_node {
> +	struct list_head node;
> +	char dev_name[32];
> +	void *key;
> +	unsigned long weight;
> +	unsigned long ioprio_class;
> +};
> +
>  /**
>   * struct bfqio_cgroup - bfq cgroup data structure.
>   * @css: subsystem state for bfq in the containing cgroup.
> @@ -269,6 +277,9 @@ struct io_cgroup {
>  
>  	unsigned long weight, ioprio_class;
>  
> +	/* list of policy_node */
> +	struct list_head list;
> +
>  	spinlock_t lock;
>  	struct hlist_head group_data;
>  };
> -- 
> 1.5.4.rc3
> 
> 

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
  2009-05-13  2:00 ` [PATCH] IO Controller: Add per-device weight and ioprio_class handling Gui Jianfeng
@ 2009-05-13 14:44   ` Vivek Goyal
       [not found]     ` <20090513144432.GA7696-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-05-14  0:59     ` Gui Jianfeng
  2009-05-13 15:29   ` Vivek Goyal
                     ` (4 subsequent siblings)
  5 siblings, 2 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 14:44 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
> Hi Vivek,
> 
> This patch enables per-cgroup per-device weight and ioprio_class handling.
> A new cgroup interface "policy" is introduced. You can make use of this 
> file to configure weight and ioprio_class for each device in a given cgroup.
> The original "weight" and "ioprio_class" files are still available. If you
> don't do special configuration for a particular device, "weight" and 
> "ioprio_class" are used as default values in this device.
> 
> You can use the following format to play with the new interface.
> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
> weight=0 means removing the policy for DEV.
> 

Thanks for the patch Gui. I will test it out and let you know how does
it go.

Thanks
Vivek

> Examples:
> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
> # echo /dev/hdb:300:2 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
> 
> Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
> # echo /dev/hda:500:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hda 500 1
> /dev/hdb 300 2
> 
> Remove the policy for /dev/hda in this cgroup
> # echo /dev/hda:0:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
> 
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
>  block/elevator-fq.c |  239 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  block/elevator-fq.h |   11 +++
>  2 files changed, 245 insertions(+), 5 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 69435ab..7c95d55 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -12,6 +12,9 @@
>  #include "elevator-fq.h"
>  #include <linux/blktrace_api.h>
>  #include <linux/biotrack.h>
> +#include <linux/seq_file.h>
> +#include <linux/genhd.h>
> +
>  
>  /* Values taken from cfq */
>  const int elv_slice_sync = HZ / 10;
> @@ -1045,12 +1048,30 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
>  }
>  EXPORT_SYMBOL(io_lookup_io_group_current);
>  
> -void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
> +					      void *key);
> +
> +void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
> +			  void *key)
>  {
>  	struct io_entity *entity = &iog->entity;
> +	struct policy_node *pn;
> +
> +	spin_lock_irq(&iocg->lock);
> +	pn = policy_search_node(iocg, key);
> +	if (pn) {
> +		entity->weight = pn->weight;
> +		entity->new_weight = pn->weight;
> +		entity->ioprio_class = pn->ioprio_class;
> +		entity->new_ioprio_class = pn->ioprio_class;
> +	} else {
> +		entity->weight = iocg->weight;
> +		entity->new_weight = iocg->weight;
> +		entity->ioprio_class = iocg->ioprio_class;
> +		entity->new_ioprio_class = iocg->ioprio_class;
> +	}
> +	spin_unlock_irq(&iocg->lock);
>  
> -	entity->weight = entity->new_weight = iocg->weight;
> -	entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
>  	entity->ioprio_changed = 1;
>  	entity->my_sched_data = &iog->sched_data;
>  }
> @@ -1263,7 +1284,7 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
>  		atomic_set(&iog->ref, 0);
>  		iog->deleting = 0;
>  
> -		io_group_init_entity(iocg, iog);
> +		io_group_init_entity(iocg, iog, key);
>  		iog->my_entity = &iog->entity;
>  #ifdef CONFIG_DEBUG_GROUP_IOSCHED
>  		iog->iocg_id = css_id(&iocg->css);
> @@ -1549,8 +1570,208 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
>  	return iog;
>  }
>  
> +static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
> +				  struct seq_file *m)
> +{
> +	struct io_cgroup *iocg;
> +	struct policy_node *pn;
> +
> +	iocg = cgroup_to_io_cgroup(cgrp);
> +
> +	if (list_empty(&iocg->list))
> +		goto out;
> +
> +	seq_printf(m, "dev weight class\n");
> +
> +	spin_lock_irq(&iocg->lock);
> +	list_for_each_entry(pn, &iocg->list, node) {
> +		seq_printf(m, "%s %lu %lu\n", pn->dev_name,
> +			   pn->weight, pn->ioprio_class);
> +	}
> +	spin_unlock_irq(&iocg->lock);
> +out:
> +	return 0;
> +}
> +
> +static inline void policy_insert_node(struct io_cgroup *iocg,
> +					  struct policy_node *pn)
> +{
> +	list_add(&pn->node, &iocg->list);
> +}
> +
> +/* Must be called with iocg->lock held */
> +static inline void policy_delete_node(struct policy_node *pn)
> +{
> +	list_del(&pn->node);
> +}
> +
> +/* Must be called with iocg->lock held */
> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
> +					      void *key)
> +{
> +	struct policy_node *pn;
> +
> +	if (list_empty(&iocg->list))
> +		return NULL;
> +
> +	list_for_each_entry(pn, &iocg->list, node) {
> +		if (pn->key == key)
> +			return pn;
> +	}
> +
> +	return NULL;
> +}
> +
> +static void *devname_to_efqd(const char *buf)
> +{
> +	struct block_device *bdev;
> +	void *key = NULL;
> +	struct gendisk *disk;
> +	int part;
> +
> +	bdev = lookup_bdev(buf);
> +	if (IS_ERR(bdev))
> +		return NULL;
> +
> +	disk = get_gendisk(bdev->bd_dev, &part);
> +	key = (void *)&disk->queue->elevator->efqd;
> +	bdput(bdev);
> +
> +	return key;
> +}
> +
> +static int policy_parse_and_set(char *buf, struct policy_node *newpn)
> +{
> +	char *s[3];
> +	char *p;
> +	int ret;
> +	int i = 0;
> +
> +	memset(s, 0, sizeof(s));
> +	while (i < ARRAY_SIZE(s)) {
> +		p = strsep(&buf, ":");
> +		if (!p)
> +			break;
> +		if (!*p)
> +			continue;
> +		s[i++] = p;
> +	}
> +
> +	newpn->key = devname_to_efqd(s[0]);
> +	if (!newpn->key)
> +		return -EINVAL;
> +
> +	strcpy(newpn->dev_name, s[0]);
> +
> +	ret = strict_strtoul(s[1], 10, &newpn->weight);
> +	if (ret || newpn->weight > WEIGHT_MAX)
> +		return -EINVAL;
> +
> +	ret = strict_strtoul(s[2], 10, &newpn->ioprio_class);
> +	if (ret || newpn->ioprio_class < IOPRIO_CLASS_RT ||
> +	    newpn->ioprio_class > IOPRIO_CLASS_IDLE)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
> +			    const char *buffer)
> +{
> +	struct io_cgroup *iocg;
> +	struct policy_node *newpn, *pn;
> +	char *buf;
> +	int ret = 0;
> +	int keep_newpn = 0;
> +	struct hlist_node *n;
> +	struct io_group *iog;
> +
> +	buf = kstrdup(buffer, GFP_KERNEL);
> +	if (!buf)
> +		return -ENOMEM;
> +
> +	newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
> +	if (!newpn) {
> +		ret = -ENOMEM;
> +		goto free_buf;
> +	}
> +
> +	ret = policy_parse_and_set(buf, newpn);
> +	if (ret)
> +		goto free_newpn;
> +
> +	if (!cgroup_lock_live_group(cgrp)) {
> +		ret = -ENODEV;
> +		goto free_newpn;
> +	}
> +
> +	iocg = cgroup_to_io_cgroup(cgrp);
> +	spin_lock_irq(&iocg->lock);
> +
> +	pn = policy_search_node(iocg, newpn->key);
> +	if (!pn) {
> +		if (newpn->weight != 0) {
> +			policy_insert_node(iocg, newpn);
> +			keep_newpn = 1;
> +		}
> +		goto update_io_group;
> +	}
> +
> +	if (newpn->weight == 0) {
> +		/* weight == 0 means deleteing a policy */
> +		policy_delete_node(pn);
> +		goto update_io_group;
> +	}
> +
> +	pn->weight = newpn->weight;
> +	pn->ioprio_class = newpn->ioprio_class;
> +
> +update_io_group:
> +	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
> +		if (iog->key == newpn->key) {
> +			if (newpn->weight) {
> +				iog->entity.new_weight = newpn->weight;
> +				iog->entity.new_ioprio_class =
> +					newpn->ioprio_class;
> +				/*
> +				 * iog weight and ioprio_class updating
> +				 * actually happens if ioprio_changed is set.
> +				 * So ensure ioprio_changed is not set until
> +				 * new weight and new ioprio_class are updated.
> +				 */
> +				smp_wmb();
> +				iog->entity.ioprio_changed = 1;
> +			} else {
> +				iog->entity.new_weight = iocg->weight;
> +				iog->entity.new_ioprio_class =
> +					iocg->ioprio_class;
> +
> +				/* The same as above */
> +				smp_wmb();
> +				iog->entity.ioprio_changed = 1;
> +			}
> +		}
> +	}
> +	spin_unlock_irq(&iocg->lock);
> +
> +	cgroup_unlock();
> +
> +free_newpn:
> +	if (!keep_newpn)
> +		kfree(newpn);
> +free_buf:
> +	kfree(buf);
> +	return ret;
> +}
> +
>  struct cftype bfqio_files[] = {
>  	{
> +		.name = "policy",
> +		.read_seq_string = io_cgroup_policy_read,
> +		.write_string = io_cgroup_policy_write,
> +		.max_write_len = 256,
> +	},
> +	{
>  		.name = "weight",
>  		.read_u64 = io_cgroup_weight_read,
>  		.write_u64 = io_cgroup_weight_write,
> @@ -1592,6 +1813,7 @@ struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
>  	INIT_HLIST_HEAD(&iocg->group_data);
>  	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
>  	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
> +	INIT_LIST_HEAD(&iocg->list);
>  
>  	return &iocg->css;
>  }
> @@ -1750,6 +1972,7 @@ void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
>  	unsigned long flags, flags1;
>  	int queue_lock_held = 0;
>  	struct elv_fq_data *efqd;
> +	struct policy_node *pn, *pntmp;
>  
>  	/*
>  	 * io groups are linked in two lists. One list is maintained
> @@ -1823,6 +2046,12 @@ locked:
>  	BUG_ON(!hlist_empty(&iocg->group_data));
>  
>  	free_css_id(&io_subsys, &iocg->css);
> +
> +	list_for_each_entry_safe(pn, pntmp, &iocg->list, node) {
> +		policy_delete_node(pn);
> +		kfree(pn);
> +	}
> +
>  	kfree(iocg);
>  }
>  
> @@ -2137,7 +2366,7 @@ void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
>  void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
>  {
>  	entity->ioprio = entity->new_ioprio;
> -	entity->weight = entity->new_weight;
> +	entity->weight = entity->new_weigh;
>  	entity->ioprio_class = entity->new_ioprio_class;
>  	entity->sched_data = &iog->sched_data;
>  }
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index db3a347..0407633 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -253,6 +253,14 @@ struct io_group {
>  #endif
>  };
>  
> +struct policy_node {
> +	struct list_head node;
> +	char dev_name[32];
> +	void *key;
> +	unsigned long weight;
> +	unsigned long ioprio_class;
> +};
> +
>  /**
>   * struct bfqio_cgroup - bfq cgroup data structure.
>   * @css: subsystem state for bfq in the containing cgroup.
> @@ -269,6 +277,9 @@ struct io_cgroup {
>  
>  	unsigned long weight, ioprio_class;
>  
> +	/* list of policy_node */
> +	struct list_head list;
> +
>  	spinlock_t lock;
>  	struct hlist_head group_data;
>  };
> -- 
> 1.5.4.rc3
> 
> 

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 07/18] io-controller: Export disk time used and nr sectors dipatched through cgroups
       [not found]     ` <4A0A32CB.4020609-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-05-13 14:51       ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 14:51 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Wed, May 13, 2009 at 10:39:07AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> >  
> > +/*
> > + * traverse through all the io_groups associated with this cgroup and calculate
> > + * the aggr disk time received by all the groups on respective disks.
> > + */
> > +static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
> > +{
> > +	struct io_group *iog;
> > +	struct hlist_node *n;
> > +	u64 disk_time = 0;
> > +
> > +	rcu_read_lock();
> 
>   This function is in slow-path, so no need to call rcu_read_lock(), just need to ensure
>   that the caller already holds the iocg->lock.
> 

Or can we get rid of requirement of iocg_lock here and just read the io
group data under rcu read lock? Actually I am wondering why do we require
an iocg_lock here. We are not modifying the rcu protected list. We are
just traversing through it and reading the data.

Thanks
Vivek

> > +	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
> > +		/*
> > +		 * There might be groups which are not functional and
> > +		 * waiting to be reclaimed upon cgoup deletion.
> > +		 */
> > +		if (rcu_dereference(iog->key))
> > +			disk_time += iog->entity.total_service;
> > +	}
> > +	rcu_read_unlock();
> > +
> > +	return disk_time;
> > +}
> > +
> 
> -- 
> Regards
> Gui Jianfeng

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 07/18] io-controller: Export disk time used and nr sectors dipatched through cgroups
  2009-05-13  2:39   ` Gui Jianfeng
       [not found]     ` <4A0A32CB.4020609-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-05-13 14:51     ` Vivek Goyal
  2009-05-14  7:53       ` Gui Jianfeng
       [not found]       ` <20090513145127.GB7696-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 2 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 14:51 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Wed, May 13, 2009 at 10:39:07AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> >  
> > +/*
> > + * traverse through all the io_groups associated with this cgroup and calculate
> > + * the aggr disk time received by all the groups on respective disks.
> > + */
> > +static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
> > +{
> > +	struct io_group *iog;
> > +	struct hlist_node *n;
> > +	u64 disk_time = 0;
> > +
> > +	rcu_read_lock();
> 
>   This function is in slow-path, so no need to call rcu_read_lock(), just need to ensure
>   that the caller already holds the iocg->lock.
> 

Or can we get rid of requirement of iocg_lock here and just read the io
group data under rcu read lock? Actually I am wondering why do we require
an iocg_lock here. We are not modifying the rcu protected list. We are
just traversing through it and reading the data.

Thanks
Vivek

> > +	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
> > +		/*
> > +		 * There might be groups which are not functional and
> > +		 * waiting to be reclaimed upon cgoup deletion.
> > +		 */
> > +		if (rcu_dereference(iog->key))
> > +			disk_time += iog->entity.total_service;
> > +	}
> > +	rcu_read_unlock();
> > +
> > +	return disk_time;
> > +}
> > +
> 
> -- 
> Regards
> Gui Jianfeng

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
       [not found]   ` <1241553525-28095-9-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-13 15:00     ` Vivek Goyal
  2009-06-09  7:56     ` Gui Jianfeng
  1 sibling, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 15:00 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Tue, May 05, 2009 at 03:58:35PM -0400, Vivek Goyal wrote:
> o When a sync queue expires, in many cases it might be empty and then
>   it will be deleted from the active tree. This will lead to a scenario
>   where out of two competing queues, only one is on the tree and when a
>   new queue is selected, vtime jump takes place and we don't see services
>   provided in proportion to weight.
> 
> o In general this is a fundamental problem with fairness of sync queues
>   where queues are not continuously backlogged. Looks like idling is
>   only solution to make sure such kind of queues can get some decent amount
>   of disk bandwidth in the face of competion from continusouly backlogged
>   queues. But excessive idling has potential to reduce performance on SSD
>   and disks with commnad queuing.
> 
> o This patch experiments with waiting for next request to come before a
>   queue is expired after it has consumed its time slice. This can ensure
>   more accurate fairness numbers in some cases.
> 
> o Introduced a tunable "fairness". If set, io-controller will put more
>   focus on getting fairness right than getting throughput right.
> 
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---

Following is a fix which should go here. This patch helps me get much 
better fairness numbers for sync queues.


o Fix a window where a queue can be expired without doing busy wait for
  next request. This fix allows better fairness number for sync queues.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/elevator-fq.c |   18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

Index: linux14/block/elevator-fq.c
===================================================================
--- linux14.orig/block/elevator-fq.c	2009-05-13 10:55:44.000000000 -0400
+++ linux14/block/elevator-fq.c	2009-05-13 10:55:50.000000000 -0400
@@ -3368,8 +3368,22 @@ void *elv_fq_select_ioq(struct request_q
 	/*
 	 * The active queue has run out of time, expire it and select new.
 	 */
-	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
-		goto expire;
+	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq)) {
+		/*
+		 * Queue has used up its slice. Wait busy is not on otherwise
+		 * we wouldn't have been here. There is a chance that after
+		 * slice expiry no request from the queue completed hence
+		 * wait busy timer could not be turned on. If that's the case
+		 * don't expire the queue yet. Next request completion from
+		 * the queue will arm the wait busy timer.
+		 */
+		if (efqd->fairness && !ioq->nr_queued
+		    && elv_ioq_nr_dispatched(ioq)) {
+			ioq = NULL;
+			goto keep_queue;
+		} else
+			goto expire;
+	}
 
 	/*
 	 * If we have a RT cfqq waiting, then we pre-empt the current non-rt

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
  2009-05-05 19:58 ` Vivek Goyal
  2009-05-13 15:00   ` Vivek Goyal
@ 2009-05-13 15:00   ` Vivek Goyal
  2009-06-09  7:56   ` Gui Jianfeng
       [not found]   ` <1241553525-28095-9-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  3 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 15:00 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, balbir, linux-kernel, containers, righi.andrea, agk,
	dm-devel, snitzer, m-ikeda
  Cc: akpm

On Tue, May 05, 2009 at 03:58:35PM -0400, Vivek Goyal wrote:
> o When a sync queue expires, in many cases it might be empty and then
>   it will be deleted from the active tree. This will lead to a scenario
>   where out of two competing queues, only one is on the tree and when a
>   new queue is selected, vtime jump takes place and we don't see services
>   provided in proportion to weight.
> 
> o In general this is a fundamental problem with fairness of sync queues
>   where queues are not continuously backlogged. Looks like idling is
>   only solution to make sure such kind of queues can get some decent amount
>   of disk bandwidth in the face of competion from continusouly backlogged
>   queues. But excessive idling has potential to reduce performance on SSD
>   and disks with commnad queuing.
> 
> o This patch experiments with waiting for next request to come before a
>   queue is expired after it has consumed its time slice. This can ensure
>   more accurate fairness numbers in some cases.
> 
> o Introduced a tunable "fairness". If set, io-controller will put more
>   focus on getting fairness right than getting throughput right.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---

Following is a fix which should go here. This patch helps me get much 
better fairness numbers for sync queues.


o Fix a window where a queue can be expired without doing busy wait for
  next request. This fix allows better fairness number for sync queues.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |   18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

Index: linux14/block/elevator-fq.c
===================================================================
--- linux14.orig/block/elevator-fq.c	2009-05-13 10:55:44.000000000 -0400
+++ linux14/block/elevator-fq.c	2009-05-13 10:55:50.000000000 -0400
@@ -3368,8 +3368,22 @@ void *elv_fq_select_ioq(struct request_q
 	/*
 	 * The active queue has run out of time, expire it and select new.
 	 */
-	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
-		goto expire;
+	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq)) {
+		/*
+		 * Queue has used up its slice. Wait busy is not on otherwise
+		 * we wouldn't have been here. There is a chance that after
+		 * slice expiry no request from the queue completed hence
+		 * wait busy timer could not be turned on. If that's the case
+		 * don't expire the queue yet. Next request completion from
+		 * the queue will arm the wait busy timer.
+		 */
+		if (efqd->fairness && !ioq->nr_queued
+		    && elv_ioq_nr_dispatched(ioq)) {
+			ioq = NULL;
+			goto keep_queue;
+		} else
+			goto expire;
+	}
 
 	/*
 	 * If we have a RT cfqq waiting, then we pre-empt the current non-rt

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
  2009-05-05 19:58 ` Vivek Goyal
@ 2009-05-13 15:00   ` Vivek Goyal
  2009-05-13 15:00   ` Vivek Goyal
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 15:00 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando
  Cc: akpm

On Tue, May 05, 2009 at 03:58:35PM -0400, Vivek Goyal wrote:
> o When a sync queue expires, in many cases it might be empty and then
>   it will be deleted from the active tree. This will lead to a scenario
>   where out of two competing queues, only one is on the tree and when a
>   new queue is selected, vtime jump takes place and we don't see services
>   provided in proportion to weight.
> 
> o In general this is a fundamental problem with fairness of sync queues
>   where queues are not continuously backlogged. Looks like idling is
>   only solution to make sure such kind of queues can get some decent amount
>   of disk bandwidth in the face of competion from continusouly backlogged
>   queues. But excessive idling has potential to reduce performance on SSD
>   and disks with commnad queuing.
> 
> o This patch experiments with waiting for next request to come before a
>   queue is expired after it has consumed its time slice. This can ensure
>   more accurate fairness numbers in some cases.
> 
> o Introduced a tunable "fairness". If set, io-controller will put more
>   focus on getting fairness right than getting throughput right.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---

Following is a fix which should go here. This patch helps me get much 
better fairness numbers for sync queues.


o Fix a window where a queue can be expired without doing busy wait for
  next request. This fix allows better fairness number for sync queues.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |   18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

Index: linux14/block/elevator-fq.c
===================================================================
--- linux14.orig/block/elevator-fq.c	2009-05-13 10:55:44.000000000 -0400
+++ linux14/block/elevator-fq.c	2009-05-13 10:55:50.000000000 -0400
@@ -3368,8 +3368,22 @@ void *elv_fq_select_ioq(struct request_q
 	/*
 	 * The active queue has run out of time, expire it and select new.
 	 */
-	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
-		goto expire;
+	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq)) {
+		/*
+		 * Queue has used up its slice. Wait busy is not on otherwise
+		 * we wouldn't have been here. There is a chance that after
+		 * slice expiry no request from the queue completed hence
+		 * wait busy timer could not be turned on. If that's the case
+		 * don't expire the queue yet. Next request completion from
+		 * the queue will arm the wait busy timer.
+		 */
+		if (efqd->fairness && !ioq->nr_queued
+		    && elv_ioq_nr_dispatched(ioq)) {
+			ioq = NULL;
+			goto keep_queue;
+		} else
+			goto expire;
+	}
 
 	/*
 	 * If we have a RT cfqq waiting, then we pre-empt the current non-rt

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
       [not found]   ` <4A0A29B5.7030109-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2009-05-13 14:44     ` Vivek Goyal
@ 2009-05-13 15:29     ` Vivek Goyal
  2009-05-13 15:59     ` Vivek Goyal
                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 15:29 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:

[..]
> @@ -2137,7 +2366,7 @@ void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
>  void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
>  {
>  	entity->ioprio = entity->new_ioprio;
> -	entity->weight = entity->new_weight;
> +	entity->weight = entity->new_weigh;
>  	entity->ioprio_class = entity->new_ioprio_class;
>  	entity->sched_data = &iog->sched_data;
>  }
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index db3a347..0407633 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -253,6 +253,14 @@ struct io_group {
>  #endif
>  };
>  
> +struct policy_node {

Would "io_policy_node" be better?

> +	struct list_head node;
> +	char dev_name[32];
> +	void *key;
> +	unsigned long weight;
> +	unsigned long ioprio_class;
> +};
> +
>  /**
>   * struct bfqio_cgroup - bfq cgroup data structure.
>   * @css: subsystem state for bfq in the containing cgroup.
> @@ -269,6 +277,9 @@ struct io_cgroup {
>  
>  	unsigned long weight, ioprio_class;
>  
> +	/* list of policy_node */
> +	struct list_head list;
> +

How about "struct list_head policy_list" or "struct list_head io_policy"?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
  2009-05-13  2:00 ` [PATCH] IO Controller: Add per-device weight and ioprio_class handling Gui Jianfeng
  2009-05-13 14:44   ` Vivek Goyal
@ 2009-05-13 15:29   ` Vivek Goyal
  2009-05-14  1:02     ` Gui Jianfeng
       [not found]     ` <20090513152909.GD7696-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-05-13 15:59   ` Vivek Goyal
                     ` (3 subsequent siblings)
  5 siblings, 2 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 15:29 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:

[..]
> @@ -2137,7 +2366,7 @@ void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
>  void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
>  {
>  	entity->ioprio = entity->new_ioprio;
> -	entity->weight = entity->new_weight;
> +	entity->weight = entity->new_weigh;
>  	entity->ioprio_class = entity->new_ioprio_class;
>  	entity->sched_data = &iog->sched_data;
>  }
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index db3a347..0407633 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -253,6 +253,14 @@ struct io_group {
>  #endif
>  };
>  
> +struct policy_node {

Would "io_policy_node" be better?

> +	struct list_head node;
> +	char dev_name[32];
> +	void *key;
> +	unsigned long weight;
> +	unsigned long ioprio_class;
> +};
> +
>  /**
>   * struct bfqio_cgroup - bfq cgroup data structure.
>   * @css: subsystem state for bfq in the containing cgroup.
> @@ -269,6 +277,9 @@ struct io_cgroup {
>  
>  	unsigned long weight, ioprio_class;
>  
> +	/* list of policy_node */
> +	struct list_head list;
> +

How about "struct list_head policy_list" or "struct list_head io_policy"?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
       [not found]   ` <4A0A29B5.7030109-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2009-05-13 14:44     ` Vivek Goyal
  2009-05-13 15:29     ` Vivek Goyal
@ 2009-05-13 15:59     ` Vivek Goyal
  2009-05-13 17:17     ` Vivek Goyal
  2009-05-13 19:09     ` Vivek Goyal
  4 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 15:59 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
> Hi Vivek,
> 
> This patch enables per-cgroup per-device weight and ioprio_class handling.
> A new cgroup interface "policy" is introduced. You can make use of this 
> file to configure weight and ioprio_class for each device in a given cgroup.
> The original "weight" and "ioprio_class" files are still available. If you
> don't do special configuration for a particular device, "weight" and 
> "ioprio_class" are used as default values in this device.
> 
> You can use the following format to play with the new interface.
> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
> weight=0 means removing the policy for DEV.
> 
> Examples:
> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
> # echo /dev/hdb:300:2 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
> 
> Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
> # echo /dev/hda:500:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hda 500 1
> /dev/hdb 300 2
> 
> Remove the policy for /dev/hda in this cgroup
> # echo /dev/hda:0:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
> 
> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> ---
>  block/elevator-fq.c |  239 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  block/elevator-fq.h |   11 +++
>  2 files changed, 245 insertions(+), 5 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 69435ab..7c95d55 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -12,6 +12,9 @@
>  #include "elevator-fq.h"
>  #include <linux/blktrace_api.h>
>  #include <linux/biotrack.h>
> +#include <linux/seq_file.h>
> +#include <linux/genhd.h>
> +
>  
>  /* Values taken from cfq */
>  const int elv_slice_sync = HZ / 10;
> @@ -1045,12 +1048,30 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
>  }
>  EXPORT_SYMBOL(io_lookup_io_group_current);
>  
> -void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
> +					      void *key);
> +
> +void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
> +			  void *key)
>  {
>  	struct io_entity *entity = &iog->entity;
> +	struct policy_node *pn;
> +
> +	spin_lock_irq(&iocg->lock);
> +	pn = policy_search_node(iocg, key);
> +	if (pn) {
> +		entity->weight = pn->weight;
> +		entity->new_weight = pn->weight;
> +		entity->ioprio_class = pn->ioprio_class;
> +		entity->new_ioprio_class = pn->ioprio_class;
> +	} else {
> +		entity->weight = iocg->weight;
> +		entity->new_weight = iocg->weight;
> +		entity->ioprio_class = iocg->ioprio_class;
> +		entity->new_ioprio_class = iocg->ioprio_class;
> +	}
> +	spin_unlock_irq(&iocg->lock);

Hi Gui,

It might make sense to also store the device name or device major and
minor number in io_group while creating the io group. This will help us
to display io.disk_time and io.disk_sector statistics per device instead
of aggregate.

I am attaching a patch I was playing around with to display per device
statistics instead of aggregate one. So if user has specified the per
device rule.

Thanks
Vivek


o Currently the statistics exported through cgroup are aggregate of statistics
  on all devices for that cgroup. Instead of aggregate, make these per device.

o Also export another statistics io.disk_dequeue. This keeps a count of how
  many times a particular group got out of race for the disk. This is a
  debugging aid to keep a track how often we could create continuously
  backlogged queues.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/elevator-fq.c |  127 +++++++++++++++++++++++++++++++++-------------------
 block/elevator-fq.h |    3 +
 2 files changed, 85 insertions(+), 45 deletions(-)

Index: linux14/block/elevator-fq.h
===================================================================
--- linux14.orig/block/elevator-fq.h	2009-05-13 11:40:32.000000000 -0400
+++ linux14/block/elevator-fq.h	2009-05-13 11:40:57.000000000 -0400
@@ -250,6 +250,9 @@ struct io_group {
 
 #ifdef CONFIG_DEBUG_GROUP_IOSCHED
 	unsigned short iocg_id;
+	dev_t	dev;
+	/* How many times this group has been removed from active tree */
+	unsigned long dequeue;
 #endif
 };
 
Index: linux14/block/elevator-fq.c
===================================================================
--- linux14.orig/block/elevator-fq.c	2009-05-13 11:40:53.000000000 -0400
+++ linux14/block/elevator-fq.c	2009-05-13 11:40:57.000000000 -0400
@@ -12,6 +12,7 @@
 #include "elevator-fq.h"
 #include <linux/blktrace_api.h>
 #include <linux/biotrack.h>
+#include <linux/seq_file.h>
 
 /* Values taken from cfq */
 const int elv_slice_sync = HZ / 10;
@@ -758,6 +759,18 @@ int __bfq_deactivate_entity(struct io_en
 	BUG_ON(sd->active_entity == entity);
 	BUG_ON(sd->next_active == entity);
 
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	{
+		struct io_group *iog = io_entity_to_iog(entity);
+		/*
+		 * Keep track of how many times a group has been removed
+		 * from active tree because it did not have any active
+		 * backlogged ioq under it
+		 */
+		if (iog)
+			iog->dequeue++;
+	}
+#endif
 	return ret;
 }
 
@@ -1126,90 +1139,103 @@ STORE_FUNCTION(weight, 0, WEIGHT_MAX);
 STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
 #undef STORE_FUNCTION
 
-/*
- * traverse through all the io_groups associated with this cgroup and calculate
- * the aggr disk time received by all the groups on respective disks.
- */
-static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
+static int io_cgroup_disk_time_read(struct cgroup *cgroup,
+				struct cftype *cftype, struct seq_file *m)
 {
+	struct io_cgroup *iocg;
 	struct io_group *iog;
 	struct hlist_node *n;
-	u64 disk_time = 0;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
 
 	rcu_read_lock();
+	spin_lock_irq(&iocg->lock);
 	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
 		/*
 		 * There might be groups which are not functional and
 		 * waiting to be reclaimed upon cgoup deletion.
 		 */
-		if (rcu_dereference(iog->key))
-			disk_time += iog->entity.total_service;
+		if (rcu_dereference(iog->key)) {
+			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev),
+					iog->entity.total_service);
+		}
 	}
+	spin_unlock_irq(&iocg->lock);
 	rcu_read_unlock();
 
-	return disk_time;
+	cgroup_unlock();
+
+	return 0;
 }
 
-static u64 io_cgroup_disk_time_read(struct cgroup *cgroup,
-					struct cftype *cftype)
+static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
+				struct cftype *cftype, struct seq_file *m)
 {
 	struct io_cgroup *iocg;
-	u64 ret;
+	struct io_group *iog;
+	struct hlist_node *n;
 
 	if (!cgroup_lock_live_group(cgroup))
 		return -ENODEV;
 
 	iocg = cgroup_to_io_cgroup(cgroup);
-	spin_lock_irq(&iocg->lock);
-	ret = jiffies_to_msecs(calculate_aggr_disk_time(iocg));
-	spin_unlock_irq(&iocg->lock);
-
-	cgroup_unlock();
-
-	return ret;
-}
-
-/*
- * traverse through all the io_groups associated with this cgroup and calculate
- * the aggr number of sectors transferred by all the groups on respective disks.
- */
-static u64 calculate_aggr_disk_sectors(struct io_cgroup *iocg)
-{
-	struct io_group *iog;
-	struct hlist_node *n;
-	u64 disk_sectors = 0;
 
 	rcu_read_lock();
+	spin_lock_irq(&iocg->lock);
 	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
 		/*
 		 * There might be groups which are not functional and
 		 * waiting to be reclaimed upon cgoup deletion.
 		 */
-		if (rcu_dereference(iog->key))
-			disk_sectors += iog->entity.total_sector_service;
+		if (rcu_dereference(iog->key)) {
+			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev),
+					iog->entity.total_sector_service);
+		}
 	}
+	spin_unlock_irq(&iocg->lock);
 	rcu_read_unlock();
 
-	return disk_sectors;
+	cgroup_unlock();
+
+	return 0;
 }
 
-static u64 io_cgroup_disk_sectors_read(struct cgroup *cgroup,
-					struct cftype *cftype)
+static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,
+			struct cftype *cftype, struct seq_file *m)
 {
-	struct io_cgroup *iocg;
-	u64 ret;
+	struct io_cgroup *iocg = NULL;
+	struct io_group *iog = NULL;
+	struct hlist_node *n;
 
 	if (!cgroup_lock_live_group(cgroup))
 		return -ENODEV;
 
 	iocg = cgroup_to_io_cgroup(cgroup);
+
+	rcu_read_lock();
 	spin_lock_irq(&iocg->lock);
-	ret = calculate_aggr_disk_sectors(iocg);
+	/* Loop through all the io groups and print statistics */
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (rcu_dereference(iog->key)) {
+			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev), iog->dequeue);
+		}
+	}
 	spin_unlock_irq(&iocg->lock);
+	rcu_read_unlock();
 
 	cgroup_unlock();
 
-	return ret;
+	return 0;
 }
 
 /**
@@ -1222,7 +1248,7 @@ static u64 io_cgroup_disk_sectors_read(s
  * to the root has already an allocated group on @bfqd.
  */
 struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
-					struct cgroup *cgroup)
+					struct cgroup *cgroup, struct bio *bio)
 {
 	struct io_cgroup *iocg;
 	struct io_group *iog, *leaf = NULL, *prev = NULL;
@@ -1250,8 +1276,13 @@ struct io_group *io_group_chain_alloc(st
 
 		io_group_init_entity(iocg, iog);
 		iog->my_entity = &iog->entity;
+
 #ifdef CONFIG_DEBUG_GROUP_IOSCHED
 		iog->iocg_id = css_id(&iocg->css);
+		if (bio) {
+			struct gendisk *disk = bio->bi_bdev->bd_disk;
+			iog->dev = MKDEV(disk->major, disk->first_minor);
+		}
 #endif
 
 		blk_init_request_list(&iog->rl);
@@ -1364,7 +1395,7 @@ void io_group_chain_link(struct request_
  */
 struct io_group *io_find_alloc_group(struct request_queue *q,
 			struct cgroup *cgroup, struct elv_fq_data *efqd,
-			int create)
+			int create, struct bio *bio)
 {
 	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
 	struct io_group *iog = NULL;
@@ -1375,7 +1406,7 @@ struct io_group *io_find_alloc_group(str
 	if (iog != NULL || !create)
 		return iog;
 
-	iog = io_group_chain_alloc(q, key, cgroup);
+	iog = io_group_chain_alloc(q, key, cgroup, bio);
 	if (iog != NULL)
 		io_group_chain_link(q, key, cgroup, iog, efqd);
 
@@ -1481,7 +1512,7 @@ struct io_group *io_get_io_group(struct 
 		goto out;
 	}
 
-	iog = io_find_alloc_group(q, cgroup, efqd, create);
+	iog = io_find_alloc_group(q, cgroup, efqd, create, bio);
 	if (!iog) {
 		if (create)
 			iog = efqd->root_group;
@@ -1554,12 +1585,18 @@ struct cftype bfqio_files[] = {
 	},
 	{
 		.name = "disk_time",
-		.read_u64 = io_cgroup_disk_time_read,
+		.read_seq_string = io_cgroup_disk_time_read,
 	},
 	{
 		.name = "disk_sectors",
-		.read_u64 = io_cgroup_disk_sectors_read,
+		.read_seq_string = io_cgroup_disk_sectors_read,
 	},
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	{
+		.name = "disk_dequeue",
+		.read_seq_string = io_cgroup_disk_dequeue_read,
+	},
+#endif
 };
 
 int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
  2009-05-13  2:00 ` [PATCH] IO Controller: Add per-device weight and ioprio_class handling Gui Jianfeng
  2009-05-13 14:44   ` Vivek Goyal
  2009-05-13 15:29   ` Vivek Goyal
@ 2009-05-13 15:59   ` Vivek Goyal
  2009-05-14  1:51     ` Gui Jianfeng
                       ` (2 more replies)
       [not found]   ` <4A0A29B5.7030109-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
                     ` (2 subsequent siblings)
  5 siblings, 3 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 15:59 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
> Hi Vivek,
> 
> This patch enables per-cgroup per-device weight and ioprio_class handling.
> A new cgroup interface "policy" is introduced. You can make use of this 
> file to configure weight and ioprio_class for each device in a given cgroup.
> The original "weight" and "ioprio_class" files are still available. If you
> don't do special configuration for a particular device, "weight" and 
> "ioprio_class" are used as default values in this device.
> 
> You can use the following format to play with the new interface.
> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
> weight=0 means removing the policy for DEV.
> 
> Examples:
> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
> # echo /dev/hdb:300:2 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
> 
> Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
> # echo /dev/hda:500:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hda 500 1
> /dev/hdb 300 2
> 
> Remove the policy for /dev/hda in this cgroup
> # echo /dev/hda:0:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
> 
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
>  block/elevator-fq.c |  239 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  block/elevator-fq.h |   11 +++
>  2 files changed, 245 insertions(+), 5 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 69435ab..7c95d55 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -12,6 +12,9 @@
>  #include "elevator-fq.h"
>  #include <linux/blktrace_api.h>
>  #include <linux/biotrack.h>
> +#include <linux/seq_file.h>
> +#include <linux/genhd.h>
> +
>  
>  /* Values taken from cfq */
>  const int elv_slice_sync = HZ / 10;
> @@ -1045,12 +1048,30 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
>  }
>  EXPORT_SYMBOL(io_lookup_io_group_current);
>  
> -void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
> +					      void *key);
> +
> +void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
> +			  void *key)
>  {
>  	struct io_entity *entity = &iog->entity;
> +	struct policy_node *pn;
> +
> +	spin_lock_irq(&iocg->lock);
> +	pn = policy_search_node(iocg, key);
> +	if (pn) {
> +		entity->weight = pn->weight;
> +		entity->new_weight = pn->weight;
> +		entity->ioprio_class = pn->ioprio_class;
> +		entity->new_ioprio_class = pn->ioprio_class;
> +	} else {
> +		entity->weight = iocg->weight;
> +		entity->new_weight = iocg->weight;
> +		entity->ioprio_class = iocg->ioprio_class;
> +		entity->new_ioprio_class = iocg->ioprio_class;
> +	}
> +	spin_unlock_irq(&iocg->lock);

Hi Gui,

It might make sense to also store the device name or device major and
minor number in io_group while creating the io group. This will help us
to display io.disk_time and io.disk_sector statistics per device instead
of aggregate.

I am attaching a patch I was playing around with to display per device
statistics instead of aggregate one. So if user has specified the per
device rule.

Thanks
Vivek


o Currently the statistics exported through cgroup are aggregate of statistics
  on all devices for that cgroup. Instead of aggregate, make these per device.

o Also export another statistics io.disk_dequeue. This keeps a count of how
  many times a particular group got out of race for the disk. This is a
  debugging aid to keep a track how often we could create continuously
  backlogged queues.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |  127 +++++++++++++++++++++++++++++++++-------------------
 block/elevator-fq.h |    3 +
 2 files changed, 85 insertions(+), 45 deletions(-)

Index: linux14/block/elevator-fq.h
===================================================================
--- linux14.orig/block/elevator-fq.h	2009-05-13 11:40:32.000000000 -0400
+++ linux14/block/elevator-fq.h	2009-05-13 11:40:57.000000000 -0400
@@ -250,6 +250,9 @@ struct io_group {
 
 #ifdef CONFIG_DEBUG_GROUP_IOSCHED
 	unsigned short iocg_id;
+	dev_t	dev;
+	/* How many times this group has been removed from active tree */
+	unsigned long dequeue;
 #endif
 };
 
Index: linux14/block/elevator-fq.c
===================================================================
--- linux14.orig/block/elevator-fq.c	2009-05-13 11:40:53.000000000 -0400
+++ linux14/block/elevator-fq.c	2009-05-13 11:40:57.000000000 -0400
@@ -12,6 +12,7 @@
 #include "elevator-fq.h"
 #include <linux/blktrace_api.h>
 #include <linux/biotrack.h>
+#include <linux/seq_file.h>
 
 /* Values taken from cfq */
 const int elv_slice_sync = HZ / 10;
@@ -758,6 +759,18 @@ int __bfq_deactivate_entity(struct io_en
 	BUG_ON(sd->active_entity == entity);
 	BUG_ON(sd->next_active == entity);
 
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	{
+		struct io_group *iog = io_entity_to_iog(entity);
+		/*
+		 * Keep track of how many times a group has been removed
+		 * from active tree because it did not have any active
+		 * backlogged ioq under it
+		 */
+		if (iog)
+			iog->dequeue++;
+	}
+#endif
 	return ret;
 }
 
@@ -1126,90 +1139,103 @@ STORE_FUNCTION(weight, 0, WEIGHT_MAX);
 STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
 #undef STORE_FUNCTION
 
-/*
- * traverse through all the io_groups associated with this cgroup and calculate
- * the aggr disk time received by all the groups on respective disks.
- */
-static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
+static int io_cgroup_disk_time_read(struct cgroup *cgroup,
+				struct cftype *cftype, struct seq_file *m)
 {
+	struct io_cgroup *iocg;
 	struct io_group *iog;
 	struct hlist_node *n;
-	u64 disk_time = 0;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
 
 	rcu_read_lock();
+	spin_lock_irq(&iocg->lock);
 	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
 		/*
 		 * There might be groups which are not functional and
 		 * waiting to be reclaimed upon cgoup deletion.
 		 */
-		if (rcu_dereference(iog->key))
-			disk_time += iog->entity.total_service;
+		if (rcu_dereference(iog->key)) {
+			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev),
+					iog->entity.total_service);
+		}
 	}
+	spin_unlock_irq(&iocg->lock);
 	rcu_read_unlock();
 
-	return disk_time;
+	cgroup_unlock();
+
+	return 0;
 }
 
-static u64 io_cgroup_disk_time_read(struct cgroup *cgroup,
-					struct cftype *cftype)
+static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
+				struct cftype *cftype, struct seq_file *m)
 {
 	struct io_cgroup *iocg;
-	u64 ret;
+	struct io_group *iog;
+	struct hlist_node *n;
 
 	if (!cgroup_lock_live_group(cgroup))
 		return -ENODEV;
 
 	iocg = cgroup_to_io_cgroup(cgroup);
-	spin_lock_irq(&iocg->lock);
-	ret = jiffies_to_msecs(calculate_aggr_disk_time(iocg));
-	spin_unlock_irq(&iocg->lock);
-
-	cgroup_unlock();
-
-	return ret;
-}
-
-/*
- * traverse through all the io_groups associated with this cgroup and calculate
- * the aggr number of sectors transferred by all the groups on respective disks.
- */
-static u64 calculate_aggr_disk_sectors(struct io_cgroup *iocg)
-{
-	struct io_group *iog;
-	struct hlist_node *n;
-	u64 disk_sectors = 0;
 
 	rcu_read_lock();
+	spin_lock_irq(&iocg->lock);
 	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
 		/*
 		 * There might be groups which are not functional and
 		 * waiting to be reclaimed upon cgoup deletion.
 		 */
-		if (rcu_dereference(iog->key))
-			disk_sectors += iog->entity.total_sector_service;
+		if (rcu_dereference(iog->key)) {
+			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev),
+					iog->entity.total_sector_service);
+		}
 	}
+	spin_unlock_irq(&iocg->lock);
 	rcu_read_unlock();
 
-	return disk_sectors;
+	cgroup_unlock();
+
+	return 0;
 }
 
-static u64 io_cgroup_disk_sectors_read(struct cgroup *cgroup,
-					struct cftype *cftype)
+static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,
+			struct cftype *cftype, struct seq_file *m)
 {
-	struct io_cgroup *iocg;
-	u64 ret;
+	struct io_cgroup *iocg = NULL;
+	struct io_group *iog = NULL;
+	struct hlist_node *n;
 
 	if (!cgroup_lock_live_group(cgroup))
 		return -ENODEV;
 
 	iocg = cgroup_to_io_cgroup(cgroup);
+
+	rcu_read_lock();
 	spin_lock_irq(&iocg->lock);
-	ret = calculate_aggr_disk_sectors(iocg);
+	/* Loop through all the io groups and print statistics */
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (rcu_dereference(iog->key)) {
+			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev), iog->dequeue);
+		}
+	}
 	spin_unlock_irq(&iocg->lock);
+	rcu_read_unlock();
 
 	cgroup_unlock();
 
-	return ret;
+	return 0;
 }
 
 /**
@@ -1222,7 +1248,7 @@ static u64 io_cgroup_disk_sectors_read(s
  * to the root has already an allocated group on @bfqd.
  */
 struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
-					struct cgroup *cgroup)
+					struct cgroup *cgroup, struct bio *bio)
 {
 	struct io_cgroup *iocg;
 	struct io_group *iog, *leaf = NULL, *prev = NULL;
@@ -1250,8 +1276,13 @@ struct io_group *io_group_chain_alloc(st
 
 		io_group_init_entity(iocg, iog);
 		iog->my_entity = &iog->entity;
+
 #ifdef CONFIG_DEBUG_GROUP_IOSCHED
 		iog->iocg_id = css_id(&iocg->css);
+		if (bio) {
+			struct gendisk *disk = bio->bi_bdev->bd_disk;
+			iog->dev = MKDEV(disk->major, disk->first_minor);
+		}
 #endif
 
 		blk_init_request_list(&iog->rl);
@@ -1364,7 +1395,7 @@ void io_group_chain_link(struct request_
  */
 struct io_group *io_find_alloc_group(struct request_queue *q,
 			struct cgroup *cgroup, struct elv_fq_data *efqd,
-			int create)
+			int create, struct bio *bio)
 {
 	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
 	struct io_group *iog = NULL;
@@ -1375,7 +1406,7 @@ struct io_group *io_find_alloc_group(str
 	if (iog != NULL || !create)
 		return iog;
 
-	iog = io_group_chain_alloc(q, key, cgroup);
+	iog = io_group_chain_alloc(q, key, cgroup, bio);
 	if (iog != NULL)
 		io_group_chain_link(q, key, cgroup, iog, efqd);
 
@@ -1481,7 +1512,7 @@ struct io_group *io_get_io_group(struct 
 		goto out;
 	}
 
-	iog = io_find_alloc_group(q, cgroup, efqd, create);
+	iog = io_find_alloc_group(q, cgroup, efqd, create, bio);
 	if (!iog) {
 		if (create)
 			iog = efqd->root_group;
@@ -1554,12 +1585,18 @@ struct cftype bfqio_files[] = {
 	},
 	{
 		.name = "disk_time",
-		.read_u64 = io_cgroup_disk_time_read,
+		.read_seq_string = io_cgroup_disk_time_read,
 	},
 	{
 		.name = "disk_sectors",
-		.read_u64 = io_cgroup_disk_sectors_read,
+		.read_seq_string = io_cgroup_disk_sectors_read,
 	},
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	{
+		.name = "disk_dequeue",
+		.read_seq_string = io_cgroup_disk_dequeue_read,
+	},
+#endif
 };
 
 int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
       [not found]   ` <4A0A29B5.7030109-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
                       ` (2 preceding siblings ...)
  2009-05-13 15:59     ` Vivek Goyal
@ 2009-05-13 17:17     ` Vivek Goyal
  2009-05-13 19:09     ` Vivek Goyal
  4 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 17:17 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
> Hi Vivek,
> 
> This patch enables per-cgroup per-device weight and ioprio_class handling.
> A new cgroup interface "policy" is introduced. You can make use of this 
> file to configure weight and ioprio_class for each device in a given cgroup.
> The original "weight" and "ioprio_class" files are still available. If you
> don't do special configuration for a particular device, "weight" and 
> "ioprio_class" are used as default values in this device.
> 
> You can use the following format to play with the new interface.
> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
> weight=0 means removing the policy for DEV.
> 
> Examples:
> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
> # echo /dev/hdb:300:2 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
> 
> Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
> # echo /dev/hda:500:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hda 500 1
> /dev/hdb 300 2
> 
> Remove the policy for /dev/hda in this cgroup
> # echo /dev/hda:0:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
> 
> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> ---
>  block/elevator-fq.c |  239 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  block/elevator-fq.h |   11 +++
>  2 files changed, 245 insertions(+), 5 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 69435ab..7c95d55 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -12,6 +12,9 @@
>  #include "elevator-fq.h"
>  #include <linux/blktrace_api.h>
>  #include <linux/biotrack.h>
> +#include <linux/seq_file.h>
> +#include <linux/genhd.h>
> +
>  
>  /* Values taken from cfq */
>  const int elv_slice_sync = HZ / 10;
> @@ -1045,12 +1048,30 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
>  }
>  EXPORT_SYMBOL(io_lookup_io_group_current);
>  
> -void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
> +					      void *key);
> +
> +void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
> +			  void *key)
>  {
>  	struct io_entity *entity = &iog->entity;
> +	struct policy_node *pn;
> +
> +	spin_lock_irq(&iocg->lock);
> +	pn = policy_search_node(iocg, key);
> +	if (pn) {
> +		entity->weight = pn->weight;
> +		entity->new_weight = pn->weight;
> +		entity->ioprio_class = pn->ioprio_class;
> +		entity->new_ioprio_class = pn->ioprio_class;
> +	} else {
> +		entity->weight = iocg->weight;
> +		entity->new_weight = iocg->weight;
> +		entity->ioprio_class = iocg->ioprio_class;
> +		entity->new_ioprio_class = iocg->ioprio_class;
> +	}
> +	spin_unlock_irq(&iocg->lock);
>  

I think we need to use spin_lock_irqsave() and spin_lock_irqrestore()
version above because it can be called with request queue lock held and we
don't want to enable the interrupts unconditionally here.

I hit following lock validator warning.

 
[   81.521242] =================================
[   81.522127] [ INFO: inconsistent lock state ]
[   81.522127] 2.6.30-rc4-ioc #47
[   81.522127] ---------------------------------
[   81.522127] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[   81.522127] io-group-bw-tes/4138 [HC0[0]:SC0[0]:HE1:SE1] takes:
[   81.522127]  (&q->__queue_lock){+.?...}, at: [<ffffffff811d7b2e>] __make_request+0x35/0x396
[   81.522127] {IN-SOFTIRQ-W} state was registered at:
[   81.522127]   [<ffffffffffffffff>] 0xffffffffffffffff
[   81.522127] irq event stamp: 1006
[   81.522127] hardirqs last  enabled at (1005): [<ffffffff810c1198>] kmem_cache_alloc+0x9d/0x105
[   81.522127] hardirqs last disabled at (1006): [<ffffffff8150343f>] _spin_lock_irq+0x12/0x3e
[   81.522127] softirqs last  enabled at (286): [<ffffffff81042039>] __do_softirq+0x17a/0x187
[   81.522127] softirqs last disabled at (271): [<ffffffff8100ccfc>] call_softirq+0x1c/0x34
[   81.522127] 
[   81.522127] other info that might help us debug this:
[   81.522127] 3 locks held by io-group-bw-tes/4138:
[   81.522127]  #0:  (&type->i_mutex_dir_key#4){+.+.+.}, at: [<ffffffff810cfd2c>] do_lookup+0x82/0x15f
[   81.522127]  #1:  (&q->__queue_lock){+.?...}, at: [<ffffffff811d7b2e>] __make_request+0x35/0x396
[   81.522127]  #2:  (rcu_read_lock){.+.+..}, at: [<ffffffff811e55bb>] __rcu_read_lock+0x0/0x30
[   81.522127] 
[   81.522127] stack backtrace:
[   81.522127] Pid: 4138, comm: io-group-bw-tes Not tainted 2.6.30-rc4-ioc #47
[   81.522127] Call Trace:
[   81.522127]  [<ffffffff8105edad>] valid_state+0x17c/0x18f
[   81.522127]  [<ffffffff8105eb8a>] ? check_usage_backwards+0x0/0x52
[   81.522127]  [<ffffffff8105ee9b>] mark_lock+0xdb/0x1ff
[   81.522127]  [<ffffffff8105f00c>] mark_held_locks+0x4d/0x6b
[   81.522127]  [<ffffffff8150331a>] ? _spin_unlock_irq+0x2b/0x31
[   81.522127]  [<ffffffff8105f13e>] trace_hardirqs_on_caller+0x114/0x138
[   81.522127]  [<ffffffff8105f16f>] trace_hardirqs_on+0xd/0xf
[   81.522127]  [<ffffffff8150331a>] _spin_unlock_irq+0x2b/0x31
[   81.522127]  [<ffffffff811e5534>] ? io_group_init_entity+0x2a/0xb1
[   81.522127]  [<ffffffff811e5597>] io_group_init_entity+0x8d/0xb1
[   81.522127]  [<ffffffff811e688e>] ? io_group_chain_alloc+0x49/0x167
[   81.522127]  [<ffffffff811e68fe>] io_group_chain_alloc+0xb9/0x167
[   81.522127]  [<ffffffff811e6a04>] io_find_alloc_group+0x58/0x85
[   81.522127]  [<ffffffff811e6aec>] io_get_io_group+0x6e/0x94
[   81.522127]  [<ffffffff811e6d8c>] io_group_get_request_list+0x10/0x21
[   81.522127]  [<ffffffff811d7021>] blk_get_request_list+0x9/0xb
[   81.522127]  [<ffffffff811d7ab0>] get_request_wait+0x132/0x17b
[   81.522127]  [<ffffffff811d7dc1>] __make_request+0x2c8/0x396
[   81.522127]  [<ffffffff811d6806>] generic_make_request+0x1f2/0x28c
[   81.522127]  [<ffffffff810e9ee7>] ? bio_init+0x18/0x32
[   81.522127]  [<ffffffff811d8019>] submit_bio+0xb1/0xbc
[   81.522127]  [<ffffffff810e61c1>] submit_bh+0xfb/0x11e
[   81.522127]  [<ffffffff8111f554>] __ext3_get_inode_loc+0x263/0x2c2
[   81.522127]  [<ffffffff81122286>] ext3_iget+0x69/0x399
[   81.522127]  [<ffffffff81125b92>] ext3_lookup+0x81/0xd0
[   81.522127]  [<ffffffff810cfd81>] do_lookup+0xd7/0x15f
[   81.522127]  [<ffffffff810d15c2>] __link_path_walk+0x319/0x67f
[   81.522127]  [<ffffffff810d1976>] path_walk+0x4e/0x97
[   81.522127]  [<ffffffff810d1b48>] do_path_lookup+0x115/0x15a
[   81.522127]  [<ffffffff810d0fec>] ? getname+0x19d/0x1bf
[   81.522127]  [<ffffffff810d252a>] user_path_at+0x52/0x8c
[   81.522127]  [<ffffffff811ee668>] ? __up_read+0x1c/0x8c
[   81.522127]  [<ffffffff8150379b>] ? _spin_unlock_irqrestore+0x3f/0x47
[   81.522127]  [<ffffffff8105f13e>] ? trace_hardirqs_on_caller+0x114/0x138
[   81.522127]  [<ffffffff810cb6c1>] vfs_fstatat+0x35/0x62
[   81.522127]  [<ffffffff811ee6d0>] ? __up_read+0x84/0x8c
[   81.522127]  [<ffffffff810cb7bb>] vfs_stat+0x16/0x18
[   81.522127]  [<ffffffff810cb7d7>] sys_newstat+0x1a/0x34
[   81.522127]  [<ffffffff8100c5e9>] ? retint_swapgs+0xe/0x13
[   81.522127]  [<ffffffff8105f13e>] ? trace_hardirqs_on_caller+0x114/0x138
[   81.522127]  [<ffffffff8107f771>] ? audit_syscall_entry+0xfe/0x12a
[   81.522127]  [<ffffffff8100bb2b>] system_call_fastpath+0x16/0x1b

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
  2009-05-13  2:00 ` [PATCH] IO Controller: Add per-device weight and ioprio_class handling Gui Jianfeng
                     ` (3 preceding siblings ...)
       [not found]   ` <4A0A29B5.7030109-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-05-13 17:17   ` Vivek Goyal
       [not found]     ` <20090513171734.GA18371-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-05-14  1:24     ` Gui Jianfeng
  2009-05-13 19:09   ` Vivek Goyal
  5 siblings, 2 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 17:17 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
> Hi Vivek,
> 
> This patch enables per-cgroup per-device weight and ioprio_class handling.
> A new cgroup interface "policy" is introduced. You can make use of this 
> file to configure weight and ioprio_class for each device in a given cgroup.
> The original "weight" and "ioprio_class" files are still available. If you
> don't do special configuration for a particular device, "weight" and 
> "ioprio_class" are used as default values in this device.
> 
> You can use the following format to play with the new interface.
> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
> weight=0 means removing the policy for DEV.
> 
> Examples:
> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
> # echo /dev/hdb:300:2 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
> 
> Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
> # echo /dev/hda:500:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hda 500 1
> /dev/hdb 300 2
> 
> Remove the policy for /dev/hda in this cgroup
> # echo /dev/hda:0:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
> 
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
>  block/elevator-fq.c |  239 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  block/elevator-fq.h |   11 +++
>  2 files changed, 245 insertions(+), 5 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 69435ab..7c95d55 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -12,6 +12,9 @@
>  #include "elevator-fq.h"
>  #include <linux/blktrace_api.h>
>  #include <linux/biotrack.h>
> +#include <linux/seq_file.h>
> +#include <linux/genhd.h>
> +
>  
>  /* Values taken from cfq */
>  const int elv_slice_sync = HZ / 10;
> @@ -1045,12 +1048,30 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
>  }
>  EXPORT_SYMBOL(io_lookup_io_group_current);
>  
> -void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
> +					      void *key);
> +
> +void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
> +			  void *key)
>  {
>  	struct io_entity *entity = &iog->entity;
> +	struct policy_node *pn;
> +
> +	spin_lock_irq(&iocg->lock);
> +	pn = policy_search_node(iocg, key);
> +	if (pn) {
> +		entity->weight = pn->weight;
> +		entity->new_weight = pn->weight;
> +		entity->ioprio_class = pn->ioprio_class;
> +		entity->new_ioprio_class = pn->ioprio_class;
> +	} else {
> +		entity->weight = iocg->weight;
> +		entity->new_weight = iocg->weight;
> +		entity->ioprio_class = iocg->ioprio_class;
> +		entity->new_ioprio_class = iocg->ioprio_class;
> +	}
> +	spin_unlock_irq(&iocg->lock);
>  

I think we need to use spin_lock_irqsave() and spin_lock_irqrestore()
version above because it can be called with request queue lock held and we
don't want to enable the interrupts unconditionally here.

I hit following lock validator warning.

 
[   81.521242] =================================
[   81.522127] [ INFO: inconsistent lock state ]
[   81.522127] 2.6.30-rc4-ioc #47
[   81.522127] ---------------------------------
[   81.522127] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[   81.522127] io-group-bw-tes/4138 [HC0[0]:SC0[0]:HE1:SE1] takes:
[   81.522127]  (&q->__queue_lock){+.?...}, at: [<ffffffff811d7b2e>] __make_request+0x35/0x396
[   81.522127] {IN-SOFTIRQ-W} state was registered at:
[   81.522127]   [<ffffffffffffffff>] 0xffffffffffffffff
[   81.522127] irq event stamp: 1006
[   81.522127] hardirqs last  enabled at (1005): [<ffffffff810c1198>] kmem_cache_alloc+0x9d/0x105
[   81.522127] hardirqs last disabled at (1006): [<ffffffff8150343f>] _spin_lock_irq+0x12/0x3e
[   81.522127] softirqs last  enabled at (286): [<ffffffff81042039>] __do_softirq+0x17a/0x187
[   81.522127] softirqs last disabled at (271): [<ffffffff8100ccfc>] call_softirq+0x1c/0x34
[   81.522127] 
[   81.522127] other info that might help us debug this:
[   81.522127] 3 locks held by io-group-bw-tes/4138:
[   81.522127]  #0:  (&type->i_mutex_dir_key#4){+.+.+.}, at: [<ffffffff810cfd2c>] do_lookup+0x82/0x15f
[   81.522127]  #1:  (&q->__queue_lock){+.?...}, at: [<ffffffff811d7b2e>] __make_request+0x35/0x396
[   81.522127]  #2:  (rcu_read_lock){.+.+..}, at: [<ffffffff811e55bb>] __rcu_read_lock+0x0/0x30
[   81.522127] 
[   81.522127] stack backtrace:
[   81.522127] Pid: 4138, comm: io-group-bw-tes Not tainted 2.6.30-rc4-ioc #47
[   81.522127] Call Trace:
[   81.522127]  [<ffffffff8105edad>] valid_state+0x17c/0x18f
[   81.522127]  [<ffffffff8105eb8a>] ? check_usage_backwards+0x0/0x52
[   81.522127]  [<ffffffff8105ee9b>] mark_lock+0xdb/0x1ff
[   81.522127]  [<ffffffff8105f00c>] mark_held_locks+0x4d/0x6b
[   81.522127]  [<ffffffff8150331a>] ? _spin_unlock_irq+0x2b/0x31
[   81.522127]  [<ffffffff8105f13e>] trace_hardirqs_on_caller+0x114/0x138
[   81.522127]  [<ffffffff8105f16f>] trace_hardirqs_on+0xd/0xf
[   81.522127]  [<ffffffff8150331a>] _spin_unlock_irq+0x2b/0x31
[   81.522127]  [<ffffffff811e5534>] ? io_group_init_entity+0x2a/0xb1
[   81.522127]  [<ffffffff811e5597>] io_group_init_entity+0x8d/0xb1
[   81.522127]  [<ffffffff811e688e>] ? io_group_chain_alloc+0x49/0x167
[   81.522127]  [<ffffffff811e68fe>] io_group_chain_alloc+0xb9/0x167
[   81.522127]  [<ffffffff811e6a04>] io_find_alloc_group+0x58/0x85
[   81.522127]  [<ffffffff811e6aec>] io_get_io_group+0x6e/0x94
[   81.522127]  [<ffffffff811e6d8c>] io_group_get_request_list+0x10/0x21
[   81.522127]  [<ffffffff811d7021>] blk_get_request_list+0x9/0xb
[   81.522127]  [<ffffffff811d7ab0>] get_request_wait+0x132/0x17b
[   81.522127]  [<ffffffff811d7dc1>] __make_request+0x2c8/0x396
[   81.522127]  [<ffffffff811d6806>] generic_make_request+0x1f2/0x28c
[   81.522127]  [<ffffffff810e9ee7>] ? bio_init+0x18/0x32
[   81.522127]  [<ffffffff811d8019>] submit_bio+0xb1/0xbc
[   81.522127]  [<ffffffff810e61c1>] submit_bh+0xfb/0x11e
[   81.522127]  [<ffffffff8111f554>] __ext3_get_inode_loc+0x263/0x2c2
[   81.522127]  [<ffffffff81122286>] ext3_iget+0x69/0x399
[   81.522127]  [<ffffffff81125b92>] ext3_lookup+0x81/0xd0
[   81.522127]  [<ffffffff810cfd81>] do_lookup+0xd7/0x15f
[   81.522127]  [<ffffffff810d15c2>] __link_path_walk+0x319/0x67f
[   81.522127]  [<ffffffff810d1976>] path_walk+0x4e/0x97
[   81.522127]  [<ffffffff810d1b48>] do_path_lookup+0x115/0x15a
[   81.522127]  [<ffffffff810d0fec>] ? getname+0x19d/0x1bf
[   81.522127]  [<ffffffff810d252a>] user_path_at+0x52/0x8c
[   81.522127]  [<ffffffff811ee668>] ? __up_read+0x1c/0x8c
[   81.522127]  [<ffffffff8150379b>] ? _spin_unlock_irqrestore+0x3f/0x47
[   81.522127]  [<ffffffff8105f13e>] ? trace_hardirqs_on_caller+0x114/0x138
[   81.522127]  [<ffffffff810cb6c1>] vfs_fstatat+0x35/0x62
[   81.522127]  [<ffffffff811ee6d0>] ? __up_read+0x84/0x8c
[   81.522127]  [<ffffffff810cb7bb>] vfs_stat+0x16/0x18
[   81.522127]  [<ffffffff810cb7d7>] sys_newstat+0x1a/0x34
[   81.522127]  [<ffffffff8100c5e9>] ? retint_swapgs+0xe/0x13
[   81.522127]  [<ffffffff8105f13e>] ? trace_hardirqs_on_caller+0x114/0x138
[   81.522127]  [<ffffffff8107f771>] ? audit_syscall_entry+0xfe/0x12a
[   81.522127]  [<ffffffff8100bb2b>] system_call_fastpath+0x16/0x1b

Thanks
Vivek


^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
       [not found]   ` <4A0A29B5.7030109-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
                       ` (3 preceding siblings ...)
  2009-05-13 17:17     ` Vivek Goyal
@ 2009-05-13 19:09     ` Vivek Goyal
  4 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 19:09 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
> Hi Vivek,
> 
> This patch enables per-cgroup per-device weight and ioprio_class handling.
> A new cgroup interface "policy" is introduced. You can make use of this 
> file to configure weight and ioprio_class for each device in a given cgroup.
> The original "weight" and "ioprio_class" files are still available. If you
> don't do special configuration for a particular device, "weight" and 
> "ioprio_class" are used as default values in this device.
> 
> You can use the following format to play with the new interface.
> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
> weight=0 means removing the policy for DEV.
> 
> Examples:
> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
> # echo /dev/hdb:300:2 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
> 
> Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
> # echo /dev/hda:500:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hda 500 1
> /dev/hdb 300 2
> 
> Remove the policy for /dev/hda in this cgroup
> # echo /dev/hda:0:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
> 

Hi Gui,

Noticed few things during testing.

1. Writing 0 as weight is not removing the policy for me, if I swich the
   IO scheduler on the device.

	- echo "/dev/sdb:500:2" > io.policy
	- Change elevator on device /sdb
	- echo "/dev/sdb:0:2" > io.policy
	- cat io.policy
	  The old rule is not gone away.

2. One can add same rule twice after chaning elevator. 

	- echo "/dev/sdb:500:2" > io.policy
	- Change elevator on device /sdb
	- echo "/dev/sdb:500:2" > io.policy
	- cat io.policy

	Same rule appears twice

3. If one writes to io.weight, it should not update the weight for a
   device if there is a rule for the device already. For example, if a
   cgroup got io.weight=1000 and later i set the weight on /dev/sdb to
   500 and then change the io.weight=200, it should not be udpated for
   for groups on /dev/sdb. Why?, because I think it will make more sense
   to keep the simple rule that as long there is a rule for device, it 
   always overrides the generic settings of io.weight.

4. Wrong rule should return invalid value instead we see oops.

   - echo "/dev/sdb:0:" > io.policy

[ 2651.587533] BUG: unable to handle kernel NULL pointer dereference at
(null)
[ 2651.588301] IP: [<ffffffff811f035c>] strict_strtoul+0x24/0x79
[ 2651.588301] PGD 38c33067 PUD 38d67067 PMD 0 
[ 2651.588301] Oops: 0000 [#2] SMP 
[ 2651.588301] last sysfs file:
/sys/devices/pci0000:00/0000:00:1c.0/0000:0e:00.0/irq
[ 2651.588301] CPU 2 
[ 2651.588301] Modules linked in:
[ 2651.588301] Pid: 4538, comm: bash Tainted: G      D    2.6.30-rc4-ioc
#52 HP xw6600 Workstation
[ 2651.588301] RIP: 0010:[<ffffffff811f035c>]  [<ffffffff811f035c>]
strict_strtoul+0x24/0x79
[ 2651.588301] RSP: 0018:ffff88003dd73dc0  EFLAGS: 00010286
[ 2651.588301] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
ffffffffffffffff
[ 2651.588301] RDX: ffff88003e9ffca0 RSI: 000000000000000a RDI:
0000000000000000
[ 2651.588301] RBP: ffff88003dd73de8 R08: 000000000000000a R09:
ffff88003dd73cf8
[ 2651.588301] R10: ffff88003dcd2300 R11: ffffffff8178aa00 R12:
ffff88003f4a1e00
[ 2651.588301] R13: ffff88003e9ffca0 R14: ffff88003ac5f200 R15:
ffff88003fa7ed40
[ 2651.588301] FS:  00007ff971c466f0(0000) GS:ffff88000209c000(0000)
knlGS:0000000000000000
[ 2651.588301] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2651.588301] CR2: 0000000000000000 CR3: 000000003ad0d000 CR4:
00000000000006e0
[ 2651.588301] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 2651.588301] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[ 2651.588301] Process bash (pid: 4538, threadinfo ffff88003dd72000, task
ffff880038d98000)
[ 2651.588301] Stack:
[ 2651.588301]  ffffffff810d8f23 ffff88003fa7ed4a ffff88003dcdeee0
ffff88003f4a1e00
[ 2651.588301]  ffff88003e9ffc60 ffff88003dd73e68 ffffffff811e8097
ffff880038dd2780
[ 2651.588301]  ffff88003dd73e48 ffff88003fa7ed40 ffff88003fa7ed49
0000000000000000
[ 2651.588301] Call Trace:
[ 2651.588301]  [<ffffffff810d8f23>] ? iput+0x2f/0x65
[ 2651.588301]  [<ffffffff811e8097>] io_cgroup_policy_write+0x11d/0x2ac
[ 2651.588301]  [<ffffffff81072dee>] cgroup_file_write+0x1ec/0x254
[ 2651.588301]  [<ffffffff811afce8>] ? security_file_permission+0x11/0x13
[ 2651.588301]  [<ffffffff810c8394>] vfs_write+0xab/0x105
[ 2651.588301]  [<ffffffff810c84a8>] sys_write+0x47/0x6c
[ 2651.588301]  [<ffffffff8100bb2b>] system_call_fastpath+0x16/0x1b
[ 2651.588301] Code: 65 ff ff ff 5b c9 c3 55 48 83 c9 ff 31 c0 fc 48 89 e5
41 55 41 89 f0 49 89 d5 41 54 53 48 89 fb 48 83 ec 10 48 c7 02 00 00 00 00
<f2> ae 48 f7 d1 49 89 cc 49 ff cc 74 39 48 8d 75 e0 44 89 c2 48 
[ 2651.588301] RIP  [<ffffffff811f035c>] strict_strtoul+0x24/0x79
[ 2651.588301]  RSP <ffff88003dd73dc0>
[ 2651.588301] CR2: 0000000000000000
[ 2651.828110] ---[ end trace 537b9a98ce297f01 ]---

Thanks
Vivek

> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> ---
>  block/elevator-fq.c |  239 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  block/elevator-fq.h |   11 +++
>  2 files changed, 245 insertions(+), 5 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 69435ab..7c95d55 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -12,6 +12,9 @@
>  #include "elevator-fq.h"
>  #include <linux/blktrace_api.h>
>  #include <linux/biotrack.h>
> +#include <linux/seq_file.h>
> +#include <linux/genhd.h>
> +
>  
>  /* Values taken from cfq */
>  const int elv_slice_sync = HZ / 10;
> @@ -1045,12 +1048,30 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
>  }
>  EXPORT_SYMBOL(io_lookup_io_group_current);
>  
> -void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
> +					      void *key);
> +
> +void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
> +			  void *key)
>  {
>  	struct io_entity *entity = &iog->entity;
> +	struct policy_node *pn;
> +
> +	spin_lock_irq(&iocg->lock);
> +	pn = policy_search_node(iocg, key);
> +	if (pn) {
> +		entity->weight = pn->weight;
> +		entity->new_weight = pn->weight;
> +		entity->ioprio_class = pn->ioprio_class;
> +		entity->new_ioprio_class = pn->ioprio_class;
> +	} else {
> +		entity->weight = iocg->weight;
> +		entity->new_weight = iocg->weight;
> +		entity->ioprio_class = iocg->ioprio_class;
> +		entity->new_ioprio_class = iocg->ioprio_class;
> +	}
> +	spin_unlock_irq(&iocg->lock);
>  
> -	entity->weight = entity->new_weight = iocg->weight;
> -	entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
>  	entity->ioprio_changed = 1;
>  	entity->my_sched_data = &iog->sched_data;
>  }
> @@ -1263,7 +1284,7 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
>  		atomic_set(&iog->ref, 0);
>  		iog->deleting = 0;
>  
> -		io_group_init_entity(iocg, iog);
> +		io_group_init_entity(iocg, iog, key);
>  		iog->my_entity = &iog->entity;
>  #ifdef CONFIG_DEBUG_GROUP_IOSCHED
>  		iog->iocg_id = css_id(&iocg->css);
> @@ -1549,8 +1570,208 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
>  	return iog;
>  }
>  
> +static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
> +				  struct seq_file *m)
> +{
> +	struct io_cgroup *iocg;
> +	struct policy_node *pn;
> +
> +	iocg = cgroup_to_io_cgroup(cgrp);
> +
> +	if (list_empty(&iocg->list))
> +		goto out;
> +
> +	seq_printf(m, "dev weight class\n");
> +
> +	spin_lock_irq(&iocg->lock);
> +	list_for_each_entry(pn, &iocg->list, node) {
> +		seq_printf(m, "%s %lu %lu\n", pn->dev_name,
> +			   pn->weight, pn->ioprio_class);
> +	}
> +	spin_unlock_irq(&iocg->lock);
> +out:
> +	return 0;
> +}
> +
> +static inline void policy_insert_node(struct io_cgroup *iocg,
> +					  struct policy_node *pn)
> +{
> +	list_add(&pn->node, &iocg->list);
> +}
> +
> +/* Must be called with iocg->lock held */
> +static inline void policy_delete_node(struct policy_node *pn)
> +{
> +	list_del(&pn->node);
> +}
> +
> +/* Must be called with iocg->lock held */
> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
> +					      void *key)
> +{
> +	struct policy_node *pn;
> +
> +	if (list_empty(&iocg->list))
> +		return NULL;
> +
> +	list_for_each_entry(pn, &iocg->list, node) {
> +		if (pn->key == key)
> +			return pn;
> +	}
> +
> +	return NULL;
> +}
> +
> +static void *devname_to_efqd(const char *buf)
> +{
> +	struct block_device *bdev;
> +	void *key = NULL;
> +	struct gendisk *disk;
> +	int part;
> +
> +	bdev = lookup_bdev(buf);
> +	if (IS_ERR(bdev))
> +		return NULL;
> +
> +	disk = get_gendisk(bdev->bd_dev, &part);
> +	key = (void *)&disk->queue->elevator->efqd;
> +	bdput(bdev);
> +
> +	return key;
> +}
> +
> +static int policy_parse_and_set(char *buf, struct policy_node *newpn)
> +{
> +	char *s[3];
> +	char *p;
> +	int ret;
> +	int i = 0;
> +
> +	memset(s, 0, sizeof(s));
> +	while (i < ARRAY_SIZE(s)) {
> +		p = strsep(&buf, ":");
> +		if (!p)
> +			break;
> +		if (!*p)
> +			continue;
> +		s[i++] = p;
> +	}
> +
> +	newpn->key = devname_to_efqd(s[0]);
> +	if (!newpn->key)
> +		return -EINVAL;
> +
> +	strcpy(newpn->dev_name, s[0]);
> +
> +	ret = strict_strtoul(s[1], 10, &newpn->weight);
> +	if (ret || newpn->weight > WEIGHT_MAX)
> +		return -EINVAL;
> +
> +	ret = strict_strtoul(s[2], 10, &newpn->ioprio_class);
> +	if (ret || newpn->ioprio_class < IOPRIO_CLASS_RT ||
> +	    newpn->ioprio_class > IOPRIO_CLASS_IDLE)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
> +			    const char *buffer)
> +{
> +	struct io_cgroup *iocg;
> +	struct policy_node *newpn, *pn;
> +	char *buf;
> +	int ret = 0;
> +	int keep_newpn = 0;
> +	struct hlist_node *n;
> +	struct io_group *iog;
> +
> +	buf = kstrdup(buffer, GFP_KERNEL);
> +	if (!buf)
> +		return -ENOMEM;
> +
> +	newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
> +	if (!newpn) {
> +		ret = -ENOMEM;
> +		goto free_buf;
> +	}
> +
> +	ret = policy_parse_and_set(buf, newpn);
> +	if (ret)
> +		goto free_newpn;
> +
> +	if (!cgroup_lock_live_group(cgrp)) {
> +		ret = -ENODEV;
> +		goto free_newpn;
> +	}
> +
> +	iocg = cgroup_to_io_cgroup(cgrp);
> +	spin_lock_irq(&iocg->lock);
> +
> +	pn = policy_search_node(iocg, newpn->key);
> +	if (!pn) {
> +		if (newpn->weight != 0) {
> +			policy_insert_node(iocg, newpn);
> +			keep_newpn = 1;
> +		}
> +		goto update_io_group;
> +	}
> +
> +	if (newpn->weight == 0) {
> +		/* weight == 0 means deleteing a policy */
> +		policy_delete_node(pn);
> +		goto update_io_group;
> +	}
> +
> +	pn->weight = newpn->weight;
> +	pn->ioprio_class = newpn->ioprio_class;
> +
> +update_io_group:
> +	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
> +		if (iog->key == newpn->key) {
> +			if (newpn->weight) {
> +				iog->entity.new_weight = newpn->weight;
> +				iog->entity.new_ioprio_class =
> +					newpn->ioprio_class;
> +				/*
> +				 * iog weight and ioprio_class updating
> +				 * actually happens if ioprio_changed is set.
> +				 * So ensure ioprio_changed is not set until
> +				 * new weight and new ioprio_class are updated.
> +				 */
> +				smp_wmb();
> +				iog->entity.ioprio_changed = 1;
> +			} else {
> +				iog->entity.new_weight = iocg->weight;
> +				iog->entity.new_ioprio_class =
> +					iocg->ioprio_class;
> +
> +				/* The same as above */
> +				smp_wmb();
> +				iog->entity.ioprio_changed = 1;
> +			}
> +		}
> +	}
> +	spin_unlock_irq(&iocg->lock);
> +
> +	cgroup_unlock();
> +
> +free_newpn:
> +	if (!keep_newpn)
> +		kfree(newpn);
> +free_buf:
> +	kfree(buf);
> +	return ret;
> +}
> +
>  struct cftype bfqio_files[] = {
>  	{
> +		.name = "policy",
> +		.read_seq_string = io_cgroup_policy_read,
> +		.write_string = io_cgroup_policy_write,
> +		.max_write_len = 256,
> +	},
> +	{
>  		.name = "weight",
>  		.read_u64 = io_cgroup_weight_read,
>  		.write_u64 = io_cgroup_weight_write,
> @@ -1592,6 +1813,7 @@ struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
>  	INIT_HLIST_HEAD(&iocg->group_data);
>  	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
>  	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
> +	INIT_LIST_HEAD(&iocg->list);
>  
>  	return &iocg->css;
>  }
> @@ -1750,6 +1972,7 @@ void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
>  	unsigned long flags, flags1;
>  	int queue_lock_held = 0;
>  	struct elv_fq_data *efqd;
> +	struct policy_node *pn, *pntmp;
>  
>  	/*
>  	 * io groups are linked in two lists. One list is maintained
> @@ -1823,6 +2046,12 @@ locked:
>  	BUG_ON(!hlist_empty(&iocg->group_data));
>  
>  	free_css_id(&io_subsys, &iocg->css);
> +
> +	list_for_each_entry_safe(pn, pntmp, &iocg->list, node) {
> +		policy_delete_node(pn);
> +		kfree(pn);
> +	}
> +
>  	kfree(iocg);
>  }
>  
> @@ -2137,7 +2366,7 @@ void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
>  void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
>  {
>  	entity->ioprio = entity->new_ioprio;
> -	entity->weight = entity->new_weight;
> +	entity->weight = entity->new_weigh;
>  	entity->ioprio_class = entity->new_ioprio_class;
>  	entity->sched_data = &iog->sched_data;
>  }
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index db3a347..0407633 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -253,6 +253,14 @@ struct io_group {
>  #endif
>  };
>  
> +struct policy_node {
> +	struct list_head node;
> +	char dev_name[32];
> +	void *key;
> +	unsigned long weight;
> +	unsigned long ioprio_class;
> +};
> +
>  /**
>   * struct bfqio_cgroup - bfq cgroup data structure.
>   * @css: subsystem state for bfq in the containing cgroup.
> @@ -269,6 +277,9 @@ struct io_cgroup {
>  
>  	unsigned long weight, ioprio_class;
>  
> +	/* list of policy_node */
> +	struct list_head list;
> +
>  	spinlock_t lock;
>  	struct hlist_head group_data;
>  };
> -- 
> 1.5.4.rc3
> 
> 

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
  2009-05-13  2:00 ` [PATCH] IO Controller: Add per-device weight and ioprio_class handling Gui Jianfeng
                     ` (4 preceding siblings ...)
  2009-05-13 17:17   ` Vivek Goyal
@ 2009-05-13 19:09   ` Vivek Goyal
  2009-05-14  1:35     ` Gui Jianfeng
                       ` (2 more replies)
  5 siblings, 3 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-13 19:09 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
> Hi Vivek,
> 
> This patch enables per-cgroup per-device weight and ioprio_class handling.
> A new cgroup interface "policy" is introduced. You can make use of this 
> file to configure weight and ioprio_class for each device in a given cgroup.
> The original "weight" and "ioprio_class" files are still available. If you
> don't do special configuration for a particular device, "weight" and 
> "ioprio_class" are used as default values in this device.
> 
> You can use the following format to play with the new interface.
> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
> weight=0 means removing the policy for DEV.
> 
> Examples:
> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
> # echo /dev/hdb:300:2 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
> 
> Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
> # echo /dev/hda:500:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hda 500 1
> /dev/hdb 300 2
> 
> Remove the policy for /dev/hda in this cgroup
> # echo /dev/hda:0:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
> 

Hi Gui,

Noticed few things during testing.

1. Writing 0 as weight is not removing the policy for me, if I swich the
   IO scheduler on the device.

	- echo "/dev/sdb:500:2" > io.policy
	- Change elevator on device /sdb
	- echo "/dev/sdb:0:2" > io.policy
	- cat io.policy
	  The old rule is not gone away.

2. One can add same rule twice after chaning elevator. 

	- echo "/dev/sdb:500:2" > io.policy
	- Change elevator on device /sdb
	- echo "/dev/sdb:500:2" > io.policy
	- cat io.policy

	Same rule appears twice

3. If one writes to io.weight, it should not update the weight for a
   device if there is a rule for the device already. For example, if a
   cgroup got io.weight=1000 and later i set the weight on /dev/sdb to
   500 and then change the io.weight=200, it should not be udpated for
   for groups on /dev/sdb. Why?, because I think it will make more sense
   to keep the simple rule that as long there is a rule for device, it 
   always overrides the generic settings of io.weight.

4. Wrong rule should return invalid value instead we see oops.

   - echo "/dev/sdb:0:" > io.policy

[ 2651.587533] BUG: unable to handle kernel NULL pointer dereference at
(null)
[ 2651.588301] IP: [<ffffffff811f035c>] strict_strtoul+0x24/0x79
[ 2651.588301] PGD 38c33067 PUD 38d67067 PMD 0 
[ 2651.588301] Oops: 0000 [#2] SMP 
[ 2651.588301] last sysfs file:
/sys/devices/pci0000:00/0000:00:1c.0/0000:0e:00.0/irq
[ 2651.588301] CPU 2 
[ 2651.588301] Modules linked in:
[ 2651.588301] Pid: 4538, comm: bash Tainted: G      D    2.6.30-rc4-ioc
#52 HP xw6600 Workstation
[ 2651.588301] RIP: 0010:[<ffffffff811f035c>]  [<ffffffff811f035c>]
strict_strtoul+0x24/0x79
[ 2651.588301] RSP: 0018:ffff88003dd73dc0  EFLAGS: 00010286
[ 2651.588301] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
ffffffffffffffff
[ 2651.588301] RDX: ffff88003e9ffca0 RSI: 000000000000000a RDI:
0000000000000000
[ 2651.588301] RBP: ffff88003dd73de8 R08: 000000000000000a R09:
ffff88003dd73cf8
[ 2651.588301] R10: ffff88003dcd2300 R11: ffffffff8178aa00 R12:
ffff88003f4a1e00
[ 2651.588301] R13: ffff88003e9ffca0 R14: ffff88003ac5f200 R15:
ffff88003fa7ed40
[ 2651.588301] FS:  00007ff971c466f0(0000) GS:ffff88000209c000(0000)
knlGS:0000000000000000
[ 2651.588301] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2651.588301] CR2: 0000000000000000 CR3: 000000003ad0d000 CR4:
00000000000006e0
[ 2651.588301] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 2651.588301] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[ 2651.588301] Process bash (pid: 4538, threadinfo ffff88003dd72000, task
ffff880038d98000)
[ 2651.588301] Stack:
[ 2651.588301]  ffffffff810d8f23 ffff88003fa7ed4a ffff88003dcdeee0
ffff88003f4a1e00
[ 2651.588301]  ffff88003e9ffc60 ffff88003dd73e68 ffffffff811e8097
ffff880038dd2780
[ 2651.588301]  ffff88003dd73e48 ffff88003fa7ed40 ffff88003fa7ed49
0000000000000000
[ 2651.588301] Call Trace:
[ 2651.588301]  [<ffffffff810d8f23>] ? iput+0x2f/0x65
[ 2651.588301]  [<ffffffff811e8097>] io_cgroup_policy_write+0x11d/0x2ac
[ 2651.588301]  [<ffffffff81072dee>] cgroup_file_write+0x1ec/0x254
[ 2651.588301]  [<ffffffff811afce8>] ? security_file_permission+0x11/0x13
[ 2651.588301]  [<ffffffff810c8394>] vfs_write+0xab/0x105
[ 2651.588301]  [<ffffffff810c84a8>] sys_write+0x47/0x6c
[ 2651.588301]  [<ffffffff8100bb2b>] system_call_fastpath+0x16/0x1b
[ 2651.588301] Code: 65 ff ff ff 5b c9 c3 55 48 83 c9 ff 31 c0 fc 48 89 e5
41 55 41 89 f0 49 89 d5 41 54 53 48 89 fb 48 83 ec 10 48 c7 02 00 00 00 00
<f2> ae 48 f7 d1 49 89 cc 49 ff cc 74 39 48 8d 75 e0 44 89 c2 48 
[ 2651.588301] RIP  [<ffffffff811f035c>] strict_strtoul+0x24/0x79
[ 2651.588301]  RSP <ffff88003dd73dc0>
[ 2651.588301] CR2: 0000000000000000
[ 2651.828110] ---[ end trace 537b9a98ce297f01 ]---

Thanks
Vivek

> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
>  block/elevator-fq.c |  239 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  block/elevator-fq.h |   11 +++
>  2 files changed, 245 insertions(+), 5 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 69435ab..7c95d55 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -12,6 +12,9 @@
>  #include "elevator-fq.h"
>  #include <linux/blktrace_api.h>
>  #include <linux/biotrack.h>
> +#include <linux/seq_file.h>
> +#include <linux/genhd.h>
> +
>  
>  /* Values taken from cfq */
>  const int elv_slice_sync = HZ / 10;
> @@ -1045,12 +1048,30 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
>  }
>  EXPORT_SYMBOL(io_lookup_io_group_current);
>  
> -void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
> +					      void *key);
> +
> +void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
> +			  void *key)
>  {
>  	struct io_entity *entity = &iog->entity;
> +	struct policy_node *pn;
> +
> +	spin_lock_irq(&iocg->lock);
> +	pn = policy_search_node(iocg, key);
> +	if (pn) {
> +		entity->weight = pn->weight;
> +		entity->new_weight = pn->weight;
> +		entity->ioprio_class = pn->ioprio_class;
> +		entity->new_ioprio_class = pn->ioprio_class;
> +	} else {
> +		entity->weight = iocg->weight;
> +		entity->new_weight = iocg->weight;
> +		entity->ioprio_class = iocg->ioprio_class;
> +		entity->new_ioprio_class = iocg->ioprio_class;
> +	}
> +	spin_unlock_irq(&iocg->lock);
>  
> -	entity->weight = entity->new_weight = iocg->weight;
> -	entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
>  	entity->ioprio_changed = 1;
>  	entity->my_sched_data = &iog->sched_data;
>  }
> @@ -1263,7 +1284,7 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
>  		atomic_set(&iog->ref, 0);
>  		iog->deleting = 0;
>  
> -		io_group_init_entity(iocg, iog);
> +		io_group_init_entity(iocg, iog, key);
>  		iog->my_entity = &iog->entity;
>  #ifdef CONFIG_DEBUG_GROUP_IOSCHED
>  		iog->iocg_id = css_id(&iocg->css);
> @@ -1549,8 +1570,208 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
>  	return iog;
>  }
>  
> +static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
> +				  struct seq_file *m)
> +{
> +	struct io_cgroup *iocg;
> +	struct policy_node *pn;
> +
> +	iocg = cgroup_to_io_cgroup(cgrp);
> +
> +	if (list_empty(&iocg->list))
> +		goto out;
> +
> +	seq_printf(m, "dev weight class\n");
> +
> +	spin_lock_irq(&iocg->lock);
> +	list_for_each_entry(pn, &iocg->list, node) {
> +		seq_printf(m, "%s %lu %lu\n", pn->dev_name,
> +			   pn->weight, pn->ioprio_class);
> +	}
> +	spin_unlock_irq(&iocg->lock);
> +out:
> +	return 0;
> +}
> +
> +static inline void policy_insert_node(struct io_cgroup *iocg,
> +					  struct policy_node *pn)
> +{
> +	list_add(&pn->node, &iocg->list);
> +}
> +
> +/* Must be called with iocg->lock held */
> +static inline void policy_delete_node(struct policy_node *pn)
> +{
> +	list_del(&pn->node);
> +}
> +
> +/* Must be called with iocg->lock held */
> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
> +					      void *key)
> +{
> +	struct policy_node *pn;
> +
> +	if (list_empty(&iocg->list))
> +		return NULL;
> +
> +	list_for_each_entry(pn, &iocg->list, node) {
> +		if (pn->key == key)
> +			return pn;
> +	}
> +
> +	return NULL;
> +}
> +
> +static void *devname_to_efqd(const char *buf)
> +{
> +	struct block_device *bdev;
> +	void *key = NULL;
> +	struct gendisk *disk;
> +	int part;
> +
> +	bdev = lookup_bdev(buf);
> +	if (IS_ERR(bdev))
> +		return NULL;
> +
> +	disk = get_gendisk(bdev->bd_dev, &part);
> +	key = (void *)&disk->queue->elevator->efqd;
> +	bdput(bdev);
> +
> +	return key;
> +}
> +
> +static int policy_parse_and_set(char *buf, struct policy_node *newpn)
> +{
> +	char *s[3];
> +	char *p;
> +	int ret;
> +	int i = 0;
> +
> +	memset(s, 0, sizeof(s));
> +	while (i < ARRAY_SIZE(s)) {
> +		p = strsep(&buf, ":");
> +		if (!p)
> +			break;
> +		if (!*p)
> +			continue;
> +		s[i++] = p;
> +	}
> +
> +	newpn->key = devname_to_efqd(s[0]);
> +	if (!newpn->key)
> +		return -EINVAL;
> +
> +	strcpy(newpn->dev_name, s[0]);
> +
> +	ret = strict_strtoul(s[1], 10, &newpn->weight);
> +	if (ret || newpn->weight > WEIGHT_MAX)
> +		return -EINVAL;
> +
> +	ret = strict_strtoul(s[2], 10, &newpn->ioprio_class);
> +	if (ret || newpn->ioprio_class < IOPRIO_CLASS_RT ||
> +	    newpn->ioprio_class > IOPRIO_CLASS_IDLE)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
> +			    const char *buffer)
> +{
> +	struct io_cgroup *iocg;
> +	struct policy_node *newpn, *pn;
> +	char *buf;
> +	int ret = 0;
> +	int keep_newpn = 0;
> +	struct hlist_node *n;
> +	struct io_group *iog;
> +
> +	buf = kstrdup(buffer, GFP_KERNEL);
> +	if (!buf)
> +		return -ENOMEM;
> +
> +	newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
> +	if (!newpn) {
> +		ret = -ENOMEM;
> +		goto free_buf;
> +	}
> +
> +	ret = policy_parse_and_set(buf, newpn);
> +	if (ret)
> +		goto free_newpn;
> +
> +	if (!cgroup_lock_live_group(cgrp)) {
> +		ret = -ENODEV;
> +		goto free_newpn;
> +	}
> +
> +	iocg = cgroup_to_io_cgroup(cgrp);
> +	spin_lock_irq(&iocg->lock);
> +
> +	pn = policy_search_node(iocg, newpn->key);
> +	if (!pn) {
> +		if (newpn->weight != 0) {
> +			policy_insert_node(iocg, newpn);
> +			keep_newpn = 1;
> +		}
> +		goto update_io_group;
> +	}
> +
> +	if (newpn->weight == 0) {
> +		/* weight == 0 means deleteing a policy */
> +		policy_delete_node(pn);
> +		goto update_io_group;
> +	}
> +
> +	pn->weight = newpn->weight;
> +	pn->ioprio_class = newpn->ioprio_class;
> +
> +update_io_group:
> +	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
> +		if (iog->key == newpn->key) {
> +			if (newpn->weight) {
> +				iog->entity.new_weight = newpn->weight;
> +				iog->entity.new_ioprio_class =
> +					newpn->ioprio_class;
> +				/*
> +				 * iog weight and ioprio_class updating
> +				 * actually happens if ioprio_changed is set.
> +				 * So ensure ioprio_changed is not set until
> +				 * new weight and new ioprio_class are updated.
> +				 */
> +				smp_wmb();
> +				iog->entity.ioprio_changed = 1;
> +			} else {
> +				iog->entity.new_weight = iocg->weight;
> +				iog->entity.new_ioprio_class =
> +					iocg->ioprio_class;
> +
> +				/* The same as above */
> +				smp_wmb();
> +				iog->entity.ioprio_changed = 1;
> +			}
> +		}
> +	}
> +	spin_unlock_irq(&iocg->lock);
> +
> +	cgroup_unlock();
> +
> +free_newpn:
> +	if (!keep_newpn)
> +		kfree(newpn);
> +free_buf:
> +	kfree(buf);
> +	return ret;
> +}
> +
>  struct cftype bfqio_files[] = {
>  	{
> +		.name = "policy",
> +		.read_seq_string = io_cgroup_policy_read,
> +		.write_string = io_cgroup_policy_write,
> +		.max_write_len = 256,
> +	},
> +	{
>  		.name = "weight",
>  		.read_u64 = io_cgroup_weight_read,
>  		.write_u64 = io_cgroup_weight_write,
> @@ -1592,6 +1813,7 @@ struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
>  	INIT_HLIST_HEAD(&iocg->group_data);
>  	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
>  	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
> +	INIT_LIST_HEAD(&iocg->list);
>  
>  	return &iocg->css;
>  }
> @@ -1750,6 +1972,7 @@ void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
>  	unsigned long flags, flags1;
>  	int queue_lock_held = 0;
>  	struct elv_fq_data *efqd;
> +	struct policy_node *pn, *pntmp;
>  
>  	/*
>  	 * io groups are linked in two lists. One list is maintained
> @@ -1823,6 +2046,12 @@ locked:
>  	BUG_ON(!hlist_empty(&iocg->group_data));
>  
>  	free_css_id(&io_subsys, &iocg->css);
> +
> +	list_for_each_entry_safe(pn, pntmp, &iocg->list, node) {
> +		policy_delete_node(pn);
> +		kfree(pn);
> +	}
> +
>  	kfree(iocg);
>  }
>  
> @@ -2137,7 +2366,7 @@ void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
>  void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
>  {
>  	entity->ioprio = entity->new_ioprio;
> -	entity->weight = entity->new_weight;
> +	entity->weight = entity->new_weigh;
>  	entity->ioprio_class = entity->new_ioprio_class;
>  	entity->sched_data = &iog->sched_data;
>  }
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index db3a347..0407633 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -253,6 +253,14 @@ struct io_group {
>  #endif
>  };
>  
> +struct policy_node {
> +	struct list_head node;
> +	char dev_name[32];
> +	void *key;
> +	unsigned long weight;
> +	unsigned long ioprio_class;
> +};
> +
>  /**
>   * struct bfqio_cgroup - bfq cgroup data structure.
>   * @css: subsystem state for bfq in the containing cgroup.
> @@ -269,6 +277,9 @@ struct io_cgroup {
>  
>  	unsigned long weight, ioprio_class;
>  
> +	/* list of policy_node */
> +	struct list_head list;
> +
>  	spinlock_t lock;
>  	struct hlist_head group_data;
>  };
> -- 
> 1.5.4.rc3
> 
> 

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
       [not found]     ` <20090513144432.GA7696-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-14  0:59       ` Gui Jianfeng
  0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14  0:59 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

Vivek Goyal wrote:
> On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
>> Hi Vivek,
>>
>> This patch enables per-cgroup per-device weight and ioprio_class handling.
>> A new cgroup interface "policy" is introduced. You can make use of this 
>> file to configure weight and ioprio_class for each device in a given cgroup.
>> The original "weight" and "ioprio_class" files are still available. If you
>> don't do special configuration for a particular device, "weight" and 
>> "ioprio_class" are used as default values in this device.
>>
>> You can use the following format to play with the new interface.
>> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
>> weight=0 means removing the policy for DEV.
>>
> 
> Thanks for the patch Gui. I will test it out and let you know how does
> it go.

  Hi Vivek,

  I forgot to mention that this patch isn't tested through, just for showing
  the design. I'd like to test it soon.
  

> 
> Thanks
> Vivek
> 

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
  2009-05-13 14:44   ` Vivek Goyal
       [not found]     ` <20090513144432.GA7696-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-14  0:59     ` Gui Jianfeng
  1 sibling, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14  0:59 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

Vivek Goyal wrote:
> On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
>> Hi Vivek,
>>
>> This patch enables per-cgroup per-device weight and ioprio_class handling.
>> A new cgroup interface "policy" is introduced. You can make use of this 
>> file to configure weight and ioprio_class for each device in a given cgroup.
>> The original "weight" and "ioprio_class" files are still available. If you
>> don't do special configuration for a particular device, "weight" and 
>> "ioprio_class" are used as default values in this device.
>>
>> You can use the following format to play with the new interface.
>> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
>> weight=0 means removing the policy for DEV.
>>
> 
> Thanks for the patch Gui. I will test it out and let you know how does
> it go.

  Hi Vivek,

  I forgot to mention that this patch isn't tested through, just for showing
  the design. I'd like to test it soon.
  

> 
> Thanks
> Vivek
> 

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
       [not found]     ` <20090513152909.GD7696-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-14  1:02       ` Gui Jianfeng
  0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14  1:02 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

Vivek Goyal wrote:
> On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
> 
> [..]
>> @@ -2137,7 +2366,7 @@ void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
>>  void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
>>  {
>>  	entity->ioprio = entity->new_ioprio;
>> -	entity->weight = entity->new_weight;
>> +	entity->weight = entity->new_weigh;
>>  	entity->ioprio_class = entity->new_ioprio_class;
>>  	entity->sched_data = &iog->sched_data;
>>  }
>> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
>> index db3a347..0407633 100644
>> --- a/block/elevator-fq.h
>> +++ b/block/elevator-fq.h
>> @@ -253,6 +253,14 @@ struct io_group {
>>  #endif
>>  };
>>  
>> +struct policy_node {
> 
> Would "io_policy_node" be better?

  Sure

> 
>> +	struct list_head node;
>> +	char dev_name[32];
>> +	void *key;
>> +	unsigned long weight;
>> +	unsigned long ioprio_class;
>> +};
>> +
>>  /**
>>   * struct bfqio_cgroup - bfq cgroup data structure.
>>   * @css: subsystem state for bfq in the containing cgroup.
>> @@ -269,6 +277,9 @@ struct io_cgroup {
>>  
>>  	unsigned long weight, ioprio_class;
>>  
>> +	/* list of policy_node */
>> +	struct list_head list;
>> +
> 
> How about "struct list_head policy_list" or "struct list_head io_policy"?

  OK

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
  2009-05-13 15:29   ` Vivek Goyal
@ 2009-05-14  1:02     ` Gui Jianfeng
       [not found]     ` <20090513152909.GD7696-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14  1:02 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

Vivek Goyal wrote:
> On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
> 
> [..]
>> @@ -2137,7 +2366,7 @@ void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
>>  void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
>>  {
>>  	entity->ioprio = entity->new_ioprio;
>> -	entity->weight = entity->new_weight;
>> +	entity->weight = entity->new_weigh;
>>  	entity->ioprio_class = entity->new_ioprio_class;
>>  	entity->sched_data = &iog->sched_data;
>>  }
>> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
>> index db3a347..0407633 100644
>> --- a/block/elevator-fq.h
>> +++ b/block/elevator-fq.h
>> @@ -253,6 +253,14 @@ struct io_group {
>>  #endif
>>  };
>>  
>> +struct policy_node {
> 
> Would "io_policy_node" be better?

  Sure

> 
>> +	struct list_head node;
>> +	char dev_name[32];
>> +	void *key;
>> +	unsigned long weight;
>> +	unsigned long ioprio_class;
>> +};
>> +
>>  /**
>>   * struct bfqio_cgroup - bfq cgroup data structure.
>>   * @css: subsystem state for bfq in the containing cgroup.
>> @@ -269,6 +277,9 @@ struct io_cgroup {
>>  
>>  	unsigned long weight, ioprio_class;
>>  
>> +	/* list of policy_node */
>> +	struct list_head list;
>> +
> 
> How about "struct list_head policy_list" or "struct list_head io_policy"?

  OK

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
       [not found]     ` <20090513171734.GA18371-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-14  1:24       ` Gui Jianfeng
  0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14  1:24 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

Vivek Goyal wrote:
...
>> +	}
>> +	spin_unlock_irq(&iocg->lock);
>>  
> 
> I think we need to use spin_lock_irqsave() and spin_lock_irqrestore()
> version above because it can be called with request queue lock held and we
> don't want to enable the interrupts unconditionally here.

  Will change.

> 
> I hit following lock validator warning.
> 
>  
> [   81.521242] =================================
> [   81.522127] [ INFO: inconsistent lock state ]
> [   81.522127] 2.6.30-rc4-ioc #47
> [   81.522127] ---------------------------------
> [   81.522127] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
> [   81.522127] io-group-bw-tes/4138 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [   81.522127]  (&q->__queue_lock){+.?...}, at: [<ffffffff811d7b2e>] __make_request+0x35/0x396
> [   81.522127] {IN-SOFTIRQ-W} state was registered at:
> [   81.522127]   [<ffffffffffffffff>] 0xffffffffffffffff
> [   81.522127] irq event stamp: 1006
> [   81.522127] hardirqs last  enabled at (1005): [<ffffffff810c1198>] kmem_cache_alloc+0x9d/0x105
> [   81.522127] hardirqs last disabled at (1006): [<ffffffff8150343f>] _spin_lock_irq+0x12/0x3e
> [   81.522127] softirqs last  enabled at (286): [<ffffffff81042039>] __do_softirq+0x17a/0x187
> [   81.522127] softirqs last disabled at (271): [<ffffffff8100ccfc>] call_softirq+0x1c/0x34
> [   81.522127] 
> [   81.522127] other info that might help us debug this:
> [   81.522127] 3 locks held by io-group-bw-tes/4138:
> [   81.522127]  #0:  (&type->i_mutex_dir_key#4){+.+.+.}, at: [<ffffffff810cfd2c>] do_lookup+0x82/0x15f
> [   81.522127]  #1:  (&q->__queue_lock){+.?...}, at: [<ffffffff811d7b2e>] __make_request+0x35/0x396
> [   81.522127]  #2:  (rcu_read_lock){.+.+..}, at: [<ffffffff811e55bb>] __rcu_read_lock+0x0/0x30
> [   81.522127] 
> [   81.522127] stack backtrace:
> [   81.522127] Pid: 4138, comm: io-group-bw-tes Not tainted 2.6.30-rc4-ioc #47
> [   81.522127] Call Trace:
> [   81.522127]  [<ffffffff8105edad>] valid_state+0x17c/0x18f
> [   81.522127]  [<ffffffff8105eb8a>] ? check_usage_backwards+0x0/0x52
> [   81.522127]  [<ffffffff8105ee9b>] mark_lock+0xdb/0x1ff
> [   81.522127]  [<ffffffff8105f00c>] mark_held_locks+0x4d/0x6b
> [   81.522127]  [<ffffffff8150331a>] ? _spin_unlock_irq+0x2b/0x31
> [   81.522127]  [<ffffffff8105f13e>] trace_hardirqs_on_caller+0x114/0x138
> [   81.522127]  [<ffffffff8105f16f>] trace_hardirqs_on+0xd/0xf
> [   81.522127]  [<ffffffff8150331a>] _spin_unlock_irq+0x2b/0x31
> [   81.522127]  [<ffffffff811e5534>] ? io_group_init_entity+0x2a/0xb1
> [   81.522127]  [<ffffffff811e5597>] io_group_init_entity+0x8d/0xb1
> [   81.522127]  [<ffffffff811e688e>] ? io_group_chain_alloc+0x49/0x167
> [   81.522127]  [<ffffffff811e68fe>] io_group_chain_alloc+0xb9/0x167
> [   81.522127]  [<ffffffff811e6a04>] io_find_alloc_group+0x58/0x85
> [   81.522127]  [<ffffffff811e6aec>] io_get_io_group+0x6e/0x94
> [   81.522127]  [<ffffffff811e6d8c>] io_group_get_request_list+0x10/0x21
> [   81.522127]  [<ffffffff811d7021>] blk_get_request_list+0x9/0xb
> [   81.522127]  [<ffffffff811d7ab0>] get_request_wait+0x132/0x17b
> [   81.522127]  [<ffffffff811d7dc1>] __make_request+0x2c8/0x396
> [   81.522127]  [<ffffffff811d6806>] generic_make_request+0x1f2/0x28c
> [   81.522127]  [<ffffffff810e9ee7>] ? bio_init+0x18/0x32
> [   81.522127]  [<ffffffff811d8019>] submit_bio+0xb1/0xbc
> [   81.522127]  [<ffffffff810e61c1>] submit_bh+0xfb/0x11e
> [   81.522127]  [<ffffffff8111f554>] __ext3_get_inode_loc+0x263/0x2c2
> [   81.522127]  [<ffffffff81122286>] ext3_iget+0x69/0x399
> [   81.522127]  [<ffffffff81125b92>] ext3_lookup+0x81/0xd0
> [   81.522127]  [<ffffffff810cfd81>] do_lookup+0xd7/0x15f
> [   81.522127]  [<ffffffff810d15c2>] __link_path_walk+0x319/0x67f
> [   81.522127]  [<ffffffff810d1976>] path_walk+0x4e/0x97
> [   81.522127]  [<ffffffff810d1b48>] do_path_lookup+0x115/0x15a
> [   81.522127]  [<ffffffff810d0fec>] ? getname+0x19d/0x1bf
> [   81.522127]  [<ffffffff810d252a>] user_path_at+0x52/0x8c
> [   81.522127]  [<ffffffff811ee668>] ? __up_read+0x1c/0x8c
> [   81.522127]  [<ffffffff8150379b>] ? _spin_unlock_irqrestore+0x3f/0x47
> [   81.522127]  [<ffffffff8105f13e>] ? trace_hardirqs_on_caller+0x114/0x138
> [   81.522127]  [<ffffffff810cb6c1>] vfs_fstatat+0x35/0x62
> [   81.522127]  [<ffffffff811ee6d0>] ? __up_read+0x84/0x8c
> [   81.522127]  [<ffffffff810cb7bb>] vfs_stat+0x16/0x18
> [   81.522127]  [<ffffffff810cb7d7>] sys_newstat+0x1a/0x34
> [   81.522127]  [<ffffffff8100c5e9>] ? retint_swapgs+0xe/0x13
> [   81.522127]  [<ffffffff8105f13e>] ? trace_hardirqs_on_caller+0x114/0x138
> [   81.522127]  [<ffffffff8107f771>] ? audit_syscall_entry+0xfe/0x12a
> [   81.522127]  [<ffffffff8100bb2b>] system_call_fastpath+0x16/0x1b
> 
> Thanks
> Vivek
> 
> 
> 
> 

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
  2009-05-13 17:17   ` Vivek Goyal
       [not found]     ` <20090513171734.GA18371-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-14  1:24     ` Gui Jianfeng
  1 sibling, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14  1:24 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

Vivek Goyal wrote:
...
>> +	}
>> +	spin_unlock_irq(&iocg->lock);
>>  
> 
> I think we need to use spin_lock_irqsave() and spin_lock_irqrestore()
> version above because it can be called with request queue lock held and we
> don't want to enable the interrupts unconditionally here.

  Will change.

> 
> I hit following lock validator warning.
> 
>  
> [   81.521242] =================================
> [   81.522127] [ INFO: inconsistent lock state ]
> [   81.522127] 2.6.30-rc4-ioc #47
> [   81.522127] ---------------------------------
> [   81.522127] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
> [   81.522127] io-group-bw-tes/4138 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [   81.522127]  (&q->__queue_lock){+.?...}, at: [<ffffffff811d7b2e>] __make_request+0x35/0x396
> [   81.522127] {IN-SOFTIRQ-W} state was registered at:
> [   81.522127]   [<ffffffffffffffff>] 0xffffffffffffffff
> [   81.522127] irq event stamp: 1006
> [   81.522127] hardirqs last  enabled at (1005): [<ffffffff810c1198>] kmem_cache_alloc+0x9d/0x105
> [   81.522127] hardirqs last disabled at (1006): [<ffffffff8150343f>] _spin_lock_irq+0x12/0x3e
> [   81.522127] softirqs last  enabled at (286): [<ffffffff81042039>] __do_softirq+0x17a/0x187
> [   81.522127] softirqs last disabled at (271): [<ffffffff8100ccfc>] call_softirq+0x1c/0x34
> [   81.522127] 
> [   81.522127] other info that might help us debug this:
> [   81.522127] 3 locks held by io-group-bw-tes/4138:
> [   81.522127]  #0:  (&type->i_mutex_dir_key#4){+.+.+.}, at: [<ffffffff810cfd2c>] do_lookup+0x82/0x15f
> [   81.522127]  #1:  (&q->__queue_lock){+.?...}, at: [<ffffffff811d7b2e>] __make_request+0x35/0x396
> [   81.522127]  #2:  (rcu_read_lock){.+.+..}, at: [<ffffffff811e55bb>] __rcu_read_lock+0x0/0x30
> [   81.522127] 
> [   81.522127] stack backtrace:
> [   81.522127] Pid: 4138, comm: io-group-bw-tes Not tainted 2.6.30-rc4-ioc #47
> [   81.522127] Call Trace:
> [   81.522127]  [<ffffffff8105edad>] valid_state+0x17c/0x18f
> [   81.522127]  [<ffffffff8105eb8a>] ? check_usage_backwards+0x0/0x52
> [   81.522127]  [<ffffffff8105ee9b>] mark_lock+0xdb/0x1ff
> [   81.522127]  [<ffffffff8105f00c>] mark_held_locks+0x4d/0x6b
> [   81.522127]  [<ffffffff8150331a>] ? _spin_unlock_irq+0x2b/0x31
> [   81.522127]  [<ffffffff8105f13e>] trace_hardirqs_on_caller+0x114/0x138
> [   81.522127]  [<ffffffff8105f16f>] trace_hardirqs_on+0xd/0xf
> [   81.522127]  [<ffffffff8150331a>] _spin_unlock_irq+0x2b/0x31
> [   81.522127]  [<ffffffff811e5534>] ? io_group_init_entity+0x2a/0xb1
> [   81.522127]  [<ffffffff811e5597>] io_group_init_entity+0x8d/0xb1
> [   81.522127]  [<ffffffff811e688e>] ? io_group_chain_alloc+0x49/0x167
> [   81.522127]  [<ffffffff811e68fe>] io_group_chain_alloc+0xb9/0x167
> [   81.522127]  [<ffffffff811e6a04>] io_find_alloc_group+0x58/0x85
> [   81.522127]  [<ffffffff811e6aec>] io_get_io_group+0x6e/0x94
> [   81.522127]  [<ffffffff811e6d8c>] io_group_get_request_list+0x10/0x21
> [   81.522127]  [<ffffffff811d7021>] blk_get_request_list+0x9/0xb
> [   81.522127]  [<ffffffff811d7ab0>] get_request_wait+0x132/0x17b
> [   81.522127]  [<ffffffff811d7dc1>] __make_request+0x2c8/0x396
> [   81.522127]  [<ffffffff811d6806>] generic_make_request+0x1f2/0x28c
> [   81.522127]  [<ffffffff810e9ee7>] ? bio_init+0x18/0x32
> [   81.522127]  [<ffffffff811d8019>] submit_bio+0xb1/0xbc
> [   81.522127]  [<ffffffff810e61c1>] submit_bh+0xfb/0x11e
> [   81.522127]  [<ffffffff8111f554>] __ext3_get_inode_loc+0x263/0x2c2
> [   81.522127]  [<ffffffff81122286>] ext3_iget+0x69/0x399
> [   81.522127]  [<ffffffff81125b92>] ext3_lookup+0x81/0xd0
> [   81.522127]  [<ffffffff810cfd81>] do_lookup+0xd7/0x15f
> [   81.522127]  [<ffffffff810d15c2>] __link_path_walk+0x319/0x67f
> [   81.522127]  [<ffffffff810d1976>] path_walk+0x4e/0x97
> [   81.522127]  [<ffffffff810d1b48>] do_path_lookup+0x115/0x15a
> [   81.522127]  [<ffffffff810d0fec>] ? getname+0x19d/0x1bf
> [   81.522127]  [<ffffffff810d252a>] user_path_at+0x52/0x8c
> [   81.522127]  [<ffffffff811ee668>] ? __up_read+0x1c/0x8c
> [   81.522127]  [<ffffffff8150379b>] ? _spin_unlock_irqrestore+0x3f/0x47
> [   81.522127]  [<ffffffff8105f13e>] ? trace_hardirqs_on_caller+0x114/0x138
> [   81.522127]  [<ffffffff810cb6c1>] vfs_fstatat+0x35/0x62
> [   81.522127]  [<ffffffff811ee6d0>] ? __up_read+0x84/0x8c
> [   81.522127]  [<ffffffff810cb7bb>] vfs_stat+0x16/0x18
> [   81.522127]  [<ffffffff810cb7d7>] sys_newstat+0x1a/0x34
> [   81.522127]  [<ffffffff8100c5e9>] ? retint_swapgs+0xe/0x13
> [   81.522127]  [<ffffffff8105f13e>] ? trace_hardirqs_on_caller+0x114/0x138
> [   81.522127]  [<ffffffff8107f771>] ? audit_syscall_entry+0xfe/0x12a
> [   81.522127]  [<ffffffff8100bb2b>] system_call_fastpath+0x16/0x1b
> 
> Thanks
> Vivek
> 
> 
> 
> 

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
       [not found]     ` <20090513190929.GB18371-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-14  1:35       ` Gui Jianfeng
  2009-05-14  7:26       ` Gui Jianfeng
  1 sibling, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14  1:35 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

Vivek Goyal wrote:
> On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
>> Hi Vivek,
>>
>> This patch enables per-cgroup per-device weight and ioprio_class handling.
>> A new cgroup interface "policy" is introduced. You can make use of this 
>> file to configure weight and ioprio_class for each device in a given cgroup.
>> The original "weight" and "ioprio_class" files are still available. If you
>> don't do special configuration for a particular device, "weight" and 
>> "ioprio_class" are used as default values in this device.
>>
>> You can use the following format to play with the new interface.
>> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
>> weight=0 means removing the policy for DEV.
>>
>> Examples:
>> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
>> # echo /dev/hdb:300:2 > io.policy
>> # cat io.policy
>> dev weight class
>> /dev/hdb 300 2
>>
>> Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
>> # echo /dev/hda:500:1 > io.policy
>> # cat io.policy
>> dev weight class
>> /dev/hda 500 1
>> /dev/hdb 300 2
>>
>> Remove the policy for /dev/hda in this cgroup
>> # echo /dev/hda:0:1 > io.policy
>> # cat io.policy
>> dev weight class
>> /dev/hdb 300 2
>>
> 
> Hi Gui,
> 
> Noticed few things during testing.
> 
> 1. Writing 0 as weight is not removing the policy for me, if I swich the
>    IO scheduler on the device.
> 
> 	- echo "/dev/sdb:500:2" > io.policy
> 	- Change elevator on device /sdb
> 	- echo "/dev/sdb:0:2" > io.policy
> 	- cat io.policy
> 	  The old rule is not gone away.
> 
> 2. One can add same rule twice after chaning elevator. 
> 
> 	- echo "/dev/sdb:500:2" > io.policy
> 	- Change elevator on device /sdb
> 	- echo "/dev/sdb:500:2" > io.policy
> 	- cat io.policy
> 
> 	Same rule appears twice
> 
> 3. If one writes to io.weight, it should not update the weight for a
>    device if there is a rule for the device already. For example, if a
>    cgroup got io.weight=1000 and later i set the weight on /dev/sdb to
>    500 and then change the io.weight=200, it should not be udpated for
>    for groups on /dev/sdb. Why?, because I think it will make more sense
>    to keep the simple rule that as long there is a rule for device, it 
>    always overrides the generic settings of io.weight.
> 
> 4. Wrong rule should return invalid value instead we see oops.
> 
>    - echo "/dev/sdb:0:" > io.policy

  Hi Vivek,

  Thanks for testing, i'll fix the above problems, and send an update version.

> 
> [ 2651.587533] BUG: unable to handle kernel NULL pointer dereference at
> (null)
> [ 2651.588301] IP: [<ffffffff811f035c>] strict_strtoul+0x24/0x79
> [ 2651.588301] PGD 38c33067 PUD 38d67067 PMD 0 
> [ 2651.588301] Oops: 0000 [#2] SMP 
> [ 2651.588301] last sysfs file:
> /sys/devices/pci0000:00/0000:00:1c.0/0000:0e:00.0/irq
> [ 2651.588301] CPU 2 
> [ 2651.588301] Modules linked in:
> [ 2651.588301] Pid: 4538, comm: bash Tainted: G      D    2.6.30-rc4-ioc
> #52 HP xw6600 Workstation
> [ 2651.588301] RIP: 0010:[<ffffffff811f035c>]  [<ffffffff811f035c>]
> strict_strtoul+0x24/0x79
> [ 2651.588301] RSP: 0018:ffff88003dd73dc0  EFLAGS: 00010286
> [ 2651.588301] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
> ffffffffffffffff
> [ 2651.588301] RDX: ffff88003e9ffca0 RSI: 000000000000000a RDI:
> 0000000000000000
> [ 2651.588301] RBP: ffff88003dd73de8 R08: 000000000000000a R09:
> ffff88003dd73cf8
> [ 2651.588301] R10: ffff88003dcd2300 R11: ffffffff8178aa00 R12:
> ffff88003f4a1e00
> [ 2651.588301] R13: ffff88003e9ffca0 R14: ffff88003ac5f200 R15:
> ffff88003fa7ed40
> [ 2651.588301] FS:  00007ff971c466f0(0000) GS:ffff88000209c000(0000)
> knlGS:0000000000000000
> [ 2651.588301] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 2651.588301] CR2: 0000000000000000 CR3: 000000003ad0d000 CR4:
> 00000000000006e0
> [ 2651.588301] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 2651.588301] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [ 2651.588301] Process bash (pid: 4538, threadinfo ffff88003dd72000, task
> ffff880038d98000)
> [ 2651.588301] Stack:
> [ 2651.588301]  ffffffff810d8f23 ffff88003fa7ed4a ffff88003dcdeee0
> ffff88003f4a1e00
> [ 2651.588301]  ffff88003e9ffc60 ffff88003dd73e68 ffffffff811e8097
> ffff880038dd2780
> [ 2651.588301]  ffff88003dd73e48 ffff88003fa7ed40 ffff88003fa7ed49
> 0000000000000000
> [ 2651.588301] Call Trace:
> [ 2651.588301]  [<ffffffff810d8f23>] ? iput+0x2f/0x65
> [ 2651.588301]  [<ffffffff811e8097>] io_cgroup_policy_write+0x11d/0x2ac
> [ 2651.588301]  [<ffffffff81072dee>] cgroup_file_write+0x1ec/0x254
> [ 2651.588301]  [<ffffffff811afce8>] ? security_file_permission+0x11/0x13
> [ 2651.588301]  [<ffffffff810c8394>] vfs_write+0xab/0x105
> [ 2651.588301]  [<ffffffff810c84a8>] sys_write+0x47/0x6c
> [ 2651.588301]  [<ffffffff8100bb2b>] system_call_fastpath+0x16/0x1b
> [ 2651.588301] Code: 65 ff ff ff 5b c9 c3 55 48 83 c9 ff 31 c0 fc 48 89 e5
> 41 55 41 89 f0 49 89 d5 41 54 53 48 89 fb 48 83 ec 10 48 c7 02 00 00 00 00
> <f2> ae 48 f7 d1 49 89 cc 49 ff cc 74 39 48 8d 75 e0 44 89 c2 48 
> [ 2651.588301] RIP  [<ffffffff811f035c>] strict_strtoul+0x24/0x79
> [ 2651.588301]  RSP <ffff88003dd73dc0>
> [ 2651.588301] CR2: 0000000000000000
> [ 2651.828110] ---[ end trace 537b9a98ce297f01 ]---
> 
> Thanks
> Vivek
> 
>> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
>> ---
>>  block/elevator-fq.c |  239 +++++++++++++++++++++++++++++++++++++++++++++++++-
>>  block/elevator-fq.h |   11 +++
>>  2 files changed, 245 insertions(+), 5 deletions(-)
>>
>> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
>> index 69435ab..7c95d55 100644
>> --- a/block/elevator-fq.c
>> +++ b/block/elevator-fq.c
>> @@ -12,6 +12,9 @@
>>  #include "elevator-fq.h"
>>  #include <linux/blktrace_api.h>
>>  #include <linux/biotrack.h>
>> +#include <linux/seq_file.h>
>> +#include <linux/genhd.h>
>> +
>>  
>>  /* Values taken from cfq */
>>  const int elv_slice_sync = HZ / 10;
>> @@ -1045,12 +1048,30 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
>>  }
>>  EXPORT_SYMBOL(io_lookup_io_group_current);
>>  
>> -void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
>> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
>> +					      void *key);
>> +
>> +void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
>> +			  void *key)
>>  {
>>  	struct io_entity *entity = &iog->entity;
>> +	struct policy_node *pn;
>> +
>> +	spin_lock_irq(&iocg->lock);
>> +	pn = policy_search_node(iocg, key);
>> +	if (pn) {
>> +		entity->weight = pn->weight;
>> +		entity->new_weight = pn->weight;
>> +		entity->ioprio_class = pn->ioprio_class;
>> +		entity->new_ioprio_class = pn->ioprio_class;
>> +	} else {
>> +		entity->weight = iocg->weight;
>> +		entity->new_weight = iocg->weight;
>> +		entity->ioprio_class = iocg->ioprio_class;
>> +		entity->new_ioprio_class = iocg->ioprio_class;
>> +	}
>> +	spin_unlock_irq(&iocg->lock);
>>  
>> -	entity->weight = entity->new_weight = iocg->weight;
>> -	entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
>>  	entity->ioprio_changed = 1;
>>  	entity->my_sched_data = &iog->sched_data;
>>  }
>> @@ -1263,7 +1284,7 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
>>  		atomic_set(&iog->ref, 0);
>>  		iog->deleting = 0;
>>  
>> -		io_group_init_entity(iocg, iog);
>> +		io_group_init_entity(iocg, iog, key);
>>  		iog->my_entity = &iog->entity;
>>  #ifdef CONFIG_DEBUG_GROUP_IOSCHED
>>  		iog->iocg_id = css_id(&iocg->css);
>> @@ -1549,8 +1570,208 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
>>  	return iog;
>>  }
>>  
>> +static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
>> +				  struct seq_file *m)
>> +{
>> +	struct io_cgroup *iocg;
>> +	struct policy_node *pn;
>> +
>> +	iocg = cgroup_to_io_cgroup(cgrp);
>> +
>> +	if (list_empty(&iocg->list))
>> +		goto out;
>> +
>> +	seq_printf(m, "dev weight class\n");
>> +
>> +	spin_lock_irq(&iocg->lock);
>> +	list_for_each_entry(pn, &iocg->list, node) {
>> +		seq_printf(m, "%s %lu %lu\n", pn->dev_name,
>> +			   pn->weight, pn->ioprio_class);
>> +	}
>> +	spin_unlock_irq(&iocg->lock);
>> +out:
>> +	return 0;
>> +}
>> +
>> +static inline void policy_insert_node(struct io_cgroup *iocg,
>> +					  struct policy_node *pn)
>> +{
>> +	list_add(&pn->node, &iocg->list);
>> +}
>> +
>> +/* Must be called with iocg->lock held */
>> +static inline void policy_delete_node(struct policy_node *pn)
>> +{
>> +	list_del(&pn->node);
>> +}
>> +
>> +/* Must be called with iocg->lock held */
>> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
>> +					      void *key)
>> +{
>> +	struct policy_node *pn;
>> +
>> +	if (list_empty(&iocg->list))
>> +		return NULL;
>> +
>> +	list_for_each_entry(pn, &iocg->list, node) {
>> +		if (pn->key == key)
>> +			return pn;
>> +	}
>> +
>> +	return NULL;
>> +}
>> +
>> +static void *devname_to_efqd(const char *buf)
>> +{
>> +	struct block_device *bdev;
>> +	void *key = NULL;
>> +	struct gendisk *disk;
>> +	int part;
>> +
>> +	bdev = lookup_bdev(buf);
>> +	if (IS_ERR(bdev))
>> +		return NULL;
>> +
>> +	disk = get_gendisk(bdev->bd_dev, &part);
>> +	key = (void *)&disk->queue->elevator->efqd;
>> +	bdput(bdev);
>> +
>> +	return key;
>> +}
>> +
>> +static int policy_parse_and_set(char *buf, struct policy_node *newpn)
>> +{
>> +	char *s[3];
>> +	char *p;
>> +	int ret;
>> +	int i = 0;
>> +
>> +	memset(s, 0, sizeof(s));
>> +	while (i < ARRAY_SIZE(s)) {
>> +		p = strsep(&buf, ":");
>> +		if (!p)
>> +			break;
>> +		if (!*p)
>> +			continue;
>> +		s[i++] = p;
>> +	}
>> +
>> +	newpn->key = devname_to_efqd(s[0]);
>> +	if (!newpn->key)
>> +		return -EINVAL;
>> +
>> +	strcpy(newpn->dev_name, s[0]);
>> +
>> +	ret = strict_strtoul(s[1], 10, &newpn->weight);
>> +	if (ret || newpn->weight > WEIGHT_MAX)
>> +		return -EINVAL;
>> +
>> +	ret = strict_strtoul(s[2], 10, &newpn->ioprio_class);
>> +	if (ret || newpn->ioprio_class < IOPRIO_CLASS_RT ||
>> +	    newpn->ioprio_class > IOPRIO_CLASS_IDLE)
>> +		return -EINVAL;
>> +
>> +	return 0;
>> +}
>> +
>> +static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
>> +			    const char *buffer)
>> +{
>> +	struct io_cgroup *iocg;
>> +	struct policy_node *newpn, *pn;
>> +	char *buf;
>> +	int ret = 0;
>> +	int keep_newpn = 0;
>> +	struct hlist_node *n;
>> +	struct io_group *iog;
>> +
>> +	buf = kstrdup(buffer, GFP_KERNEL);
>> +	if (!buf)
>> +		return -ENOMEM;
>> +
>> +	newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
>> +	if (!newpn) {
>> +		ret = -ENOMEM;
>> +		goto free_buf;
>> +	}
>> +
>> +	ret = policy_parse_and_set(buf, newpn);
>> +	if (ret)
>> +		goto free_newpn;
>> +
>> +	if (!cgroup_lock_live_group(cgrp)) {
>> +		ret = -ENODEV;
>> +		goto free_newpn;
>> +	}
>> +
>> +	iocg = cgroup_to_io_cgroup(cgrp);
>> +	spin_lock_irq(&iocg->lock);
>> +
>> +	pn = policy_search_node(iocg, newpn->key);
>> +	if (!pn) {
>> +		if (newpn->weight != 0) {
>> +			policy_insert_node(iocg, newpn);
>> +			keep_newpn = 1;
>> +		}
>> +		goto update_io_group;
>> +	}
>> +
>> +	if (newpn->weight == 0) {
>> +		/* weight == 0 means deleteing a policy */
>> +		policy_delete_node(pn);
>> +		goto update_io_group;
>> +	}
>> +
>> +	pn->weight = newpn->weight;
>> +	pn->ioprio_class = newpn->ioprio_class;
>> +
>> +update_io_group:
>> +	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
>> +		if (iog->key == newpn->key) {
>> +			if (newpn->weight) {
>> +				iog->entity.new_weight = newpn->weight;
>> +				iog->entity.new_ioprio_class =
>> +					newpn->ioprio_class;
>> +				/*
>> +				 * iog weight and ioprio_class updating
>> +				 * actually happens if ioprio_changed is set.
>> +				 * So ensure ioprio_changed is not set until
>> +				 * new weight and new ioprio_class are updated.
>> +				 */
>> +				smp_wmb();
>> +				iog->entity.ioprio_changed = 1;
>> +			} else {
>> +				iog->entity.new_weight = iocg->weight;
>> +				iog->entity.new_ioprio_class =
>> +					iocg->ioprio_class;
>> +
>> +				/* The same as above */
>> +				smp_wmb();
>> +				iog->entity.ioprio_changed = 1;
>> +			}
>> +		}
>> +	}
>> +	spin_unlock_irq(&iocg->lock);
>> +
>> +	cgroup_unlock();
>> +
>> +free_newpn:
>> +	if (!keep_newpn)
>> +		kfree(newpn);
>> +free_buf:
>> +	kfree(buf);
>> +	return ret;
>> +}
>> +
>>  struct cftype bfqio_files[] = {
>>  	{
>> +		.name = "policy",
>> +		.read_seq_string = io_cgroup_policy_read,
>> +		.write_string = io_cgroup_policy_write,
>> +		.max_write_len = 256,
>> +	},
>> +	{
>>  		.name = "weight",
>>  		.read_u64 = io_cgroup_weight_read,
>>  		.write_u64 = io_cgroup_weight_write,
>> @@ -1592,6 +1813,7 @@ struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
>>  	INIT_HLIST_HEAD(&iocg->group_data);
>>  	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
>>  	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
>> +	INIT_LIST_HEAD(&iocg->list);
>>  
>>  	return &iocg->css;
>>  }
>> @@ -1750,6 +1972,7 @@ void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
>>  	unsigned long flags, flags1;
>>  	int queue_lock_held = 0;
>>  	struct elv_fq_data *efqd;
>> +	struct policy_node *pn, *pntmp;
>>  
>>  	/*
>>  	 * io groups are linked in two lists. One list is maintained
>> @@ -1823,6 +2046,12 @@ locked:
>>  	BUG_ON(!hlist_empty(&iocg->group_data));
>>  
>>  	free_css_id(&io_subsys, &iocg->css);
>> +
>> +	list_for_each_entry_safe(pn, pntmp, &iocg->list, node) {
>> +		policy_delete_node(pn);
>> +		kfree(pn);
>> +	}
>> +
>>  	kfree(iocg);
>>  }
>>  
>> @@ -2137,7 +2366,7 @@ void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
>>  void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
>>  {
>>  	entity->ioprio = entity->new_ioprio;
>> -	entity->weight = entity->new_weight;
>> +	entity->weight = entity->new_weigh;
>>  	entity->ioprio_class = entity->new_ioprio_class;
>>  	entity->sched_data = &iog->sched_data;
>>  }
>> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
>> index db3a347..0407633 100644
>> --- a/block/elevator-fq.h
>> +++ b/block/elevator-fq.h
>> @@ -253,6 +253,14 @@ struct io_group {
>>  #endif
>>  };
>>  
>> +struct policy_node {
>> +	struct list_head node;
>> +	char dev_name[32];
>> +	void *key;
>> +	unsigned long weight;
>> +	unsigned long ioprio_class;
>> +};
>> +
>>  /**
>>   * struct bfqio_cgroup - bfq cgroup data structure.
>>   * @css: subsystem state for bfq in the containing cgroup.
>> @@ -269,6 +277,9 @@ struct io_cgroup {
>>  
>>  	unsigned long weight, ioprio_class;
>>  
>> +	/* list of policy_node */
>> +	struct list_head list;
>> +
>>  	spinlock_t lock;
>>  	struct hlist_head group_data;
>>  };
>> -- 
>> 1.5.4.rc3
>>
>>
> 
> 
> 

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
  2009-05-13 19:09   ` Vivek Goyal
@ 2009-05-14  1:35     ` Gui Jianfeng
       [not found]     ` <20090513190929.GB18371-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-05-14  7:26     ` Gui Jianfeng
  2 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14  1:35 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

Vivek Goyal wrote:
> On Wed, May 13, 2009 at 10:00:21AM +0800, Gui Jianfeng wrote:
>> Hi Vivek,
>>
>> This patch enables per-cgroup per-device weight and ioprio_class handling.
>> A new cgroup interface "policy" is introduced. You can make use of this 
>> file to configure weight and ioprio_class for each device in a given cgroup.
>> The original "weight" and "ioprio_class" files are still available. If you
>> don't do special configuration for a particular device, "weight" and 
>> "ioprio_class" are used as default values in this device.
>>
>> You can use the following format to play with the new interface.
>> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
>> weight=0 means removing the policy for DEV.
>>
>> Examples:
>> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
>> # echo /dev/hdb:300:2 > io.policy
>> # cat io.policy
>> dev weight class
>> /dev/hdb 300 2
>>
>> Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
>> # echo /dev/hda:500:1 > io.policy
>> # cat io.policy
>> dev weight class
>> /dev/hda 500 1
>> /dev/hdb 300 2
>>
>> Remove the policy for /dev/hda in this cgroup
>> # echo /dev/hda:0:1 > io.policy
>> # cat io.policy
>> dev weight class
>> /dev/hdb 300 2
>>
> 
> Hi Gui,
> 
> Noticed few things during testing.
> 
> 1. Writing 0 as weight is not removing the policy for me, if I swich the
>    IO scheduler on the device.
> 
> 	- echo "/dev/sdb:500:2" > io.policy
> 	- Change elevator on device /sdb
> 	- echo "/dev/sdb:0:2" > io.policy
> 	- cat io.policy
> 	  The old rule is not gone away.
> 
> 2. One can add same rule twice after chaning elevator. 
> 
> 	- echo "/dev/sdb:500:2" > io.policy
> 	- Change elevator on device /sdb
> 	- echo "/dev/sdb:500:2" > io.policy
> 	- cat io.policy
> 
> 	Same rule appears twice
> 
> 3. If one writes to io.weight, it should not update the weight for a
>    device if there is a rule for the device already. For example, if a
>    cgroup got io.weight=1000 and later i set the weight on /dev/sdb to
>    500 and then change the io.weight=200, it should not be udpated for
>    for groups on /dev/sdb. Why?, because I think it will make more sense
>    to keep the simple rule that as long there is a rule for device, it 
>    always overrides the generic settings of io.weight.
> 
> 4. Wrong rule should return invalid value instead we see oops.
> 
>    - echo "/dev/sdb:0:" > io.policy

  Hi Vivek,

  Thanks for testing, i'll fix the above problems, and send an update version.

> 
> [ 2651.587533] BUG: unable to handle kernel NULL pointer dereference at
> (null)
> [ 2651.588301] IP: [<ffffffff811f035c>] strict_strtoul+0x24/0x79
> [ 2651.588301] PGD 38c33067 PUD 38d67067 PMD 0 
> [ 2651.588301] Oops: 0000 [#2] SMP 
> [ 2651.588301] last sysfs file:
> /sys/devices/pci0000:00/0000:00:1c.0/0000:0e:00.0/irq
> [ 2651.588301] CPU 2 
> [ 2651.588301] Modules linked in:
> [ 2651.588301] Pid: 4538, comm: bash Tainted: G      D    2.6.30-rc4-ioc
> #52 HP xw6600 Workstation
> [ 2651.588301] RIP: 0010:[<ffffffff811f035c>]  [<ffffffff811f035c>]
> strict_strtoul+0x24/0x79
> [ 2651.588301] RSP: 0018:ffff88003dd73dc0  EFLAGS: 00010286
> [ 2651.588301] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
> ffffffffffffffff
> [ 2651.588301] RDX: ffff88003e9ffca0 RSI: 000000000000000a RDI:
> 0000000000000000
> [ 2651.588301] RBP: ffff88003dd73de8 R08: 000000000000000a R09:
> ffff88003dd73cf8
> [ 2651.588301] R10: ffff88003dcd2300 R11: ffffffff8178aa00 R12:
> ffff88003f4a1e00
> [ 2651.588301] R13: ffff88003e9ffca0 R14: ffff88003ac5f200 R15:
> ffff88003fa7ed40
> [ 2651.588301] FS:  00007ff971c466f0(0000) GS:ffff88000209c000(0000)
> knlGS:0000000000000000
> [ 2651.588301] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 2651.588301] CR2: 0000000000000000 CR3: 000000003ad0d000 CR4:
> 00000000000006e0
> [ 2651.588301] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 2651.588301] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [ 2651.588301] Process bash (pid: 4538, threadinfo ffff88003dd72000, task
> ffff880038d98000)
> [ 2651.588301] Stack:
> [ 2651.588301]  ffffffff810d8f23 ffff88003fa7ed4a ffff88003dcdeee0
> ffff88003f4a1e00
> [ 2651.588301]  ffff88003e9ffc60 ffff88003dd73e68 ffffffff811e8097
> ffff880038dd2780
> [ 2651.588301]  ffff88003dd73e48 ffff88003fa7ed40 ffff88003fa7ed49
> 0000000000000000
> [ 2651.588301] Call Trace:
> [ 2651.588301]  [<ffffffff810d8f23>] ? iput+0x2f/0x65
> [ 2651.588301]  [<ffffffff811e8097>] io_cgroup_policy_write+0x11d/0x2ac
> [ 2651.588301]  [<ffffffff81072dee>] cgroup_file_write+0x1ec/0x254
> [ 2651.588301]  [<ffffffff811afce8>] ? security_file_permission+0x11/0x13
> [ 2651.588301]  [<ffffffff810c8394>] vfs_write+0xab/0x105
> [ 2651.588301]  [<ffffffff810c84a8>] sys_write+0x47/0x6c
> [ 2651.588301]  [<ffffffff8100bb2b>] system_call_fastpath+0x16/0x1b
> [ 2651.588301] Code: 65 ff ff ff 5b c9 c3 55 48 83 c9 ff 31 c0 fc 48 89 e5
> 41 55 41 89 f0 49 89 d5 41 54 53 48 89 fb 48 83 ec 10 48 c7 02 00 00 00 00
> <f2> ae 48 f7 d1 49 89 cc 49 ff cc 74 39 48 8d 75 e0 44 89 c2 48 
> [ 2651.588301] RIP  [<ffffffff811f035c>] strict_strtoul+0x24/0x79
> [ 2651.588301]  RSP <ffff88003dd73dc0>
> [ 2651.588301] CR2: 0000000000000000
> [ 2651.828110] ---[ end trace 537b9a98ce297f01 ]---
> 
> Thanks
> Vivek
> 
>> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
>> ---
>>  block/elevator-fq.c |  239 +++++++++++++++++++++++++++++++++++++++++++++++++-
>>  block/elevator-fq.h |   11 +++
>>  2 files changed, 245 insertions(+), 5 deletions(-)
>>
>> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
>> index 69435ab..7c95d55 100644
>> --- a/block/elevator-fq.c
>> +++ b/block/elevator-fq.c
>> @@ -12,6 +12,9 @@
>>  #include "elevator-fq.h"
>>  #include <linux/blktrace_api.h>
>>  #include <linux/biotrack.h>
>> +#include <linux/seq_file.h>
>> +#include <linux/genhd.h>
>> +
>>  
>>  /* Values taken from cfq */
>>  const int elv_slice_sync = HZ / 10;
>> @@ -1045,12 +1048,30 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
>>  }
>>  EXPORT_SYMBOL(io_lookup_io_group_current);
>>  
>> -void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
>> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
>> +					      void *key);
>> +
>> +void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
>> +			  void *key)
>>  {
>>  	struct io_entity *entity = &iog->entity;
>> +	struct policy_node *pn;
>> +
>> +	spin_lock_irq(&iocg->lock);
>> +	pn = policy_search_node(iocg, key);
>> +	if (pn) {
>> +		entity->weight = pn->weight;
>> +		entity->new_weight = pn->weight;
>> +		entity->ioprio_class = pn->ioprio_class;
>> +		entity->new_ioprio_class = pn->ioprio_class;
>> +	} else {
>> +		entity->weight = iocg->weight;
>> +		entity->new_weight = iocg->weight;
>> +		entity->ioprio_class = iocg->ioprio_class;
>> +		entity->new_ioprio_class = iocg->ioprio_class;
>> +	}
>> +	spin_unlock_irq(&iocg->lock);
>>  
>> -	entity->weight = entity->new_weight = iocg->weight;
>> -	entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
>>  	entity->ioprio_changed = 1;
>>  	entity->my_sched_data = &iog->sched_data;
>>  }
>> @@ -1263,7 +1284,7 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
>>  		atomic_set(&iog->ref, 0);
>>  		iog->deleting = 0;
>>  
>> -		io_group_init_entity(iocg, iog);
>> +		io_group_init_entity(iocg, iog, key);
>>  		iog->my_entity = &iog->entity;
>>  #ifdef CONFIG_DEBUG_GROUP_IOSCHED
>>  		iog->iocg_id = css_id(&iocg->css);
>> @@ -1549,8 +1570,208 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
>>  	return iog;
>>  }
>>  
>> +static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
>> +				  struct seq_file *m)
>> +{
>> +	struct io_cgroup *iocg;
>> +	struct policy_node *pn;
>> +
>> +	iocg = cgroup_to_io_cgroup(cgrp);
>> +
>> +	if (list_empty(&iocg->list))
>> +		goto out;
>> +
>> +	seq_printf(m, "dev weight class\n");
>> +
>> +	spin_lock_irq(&iocg->lock);
>> +	list_for_each_entry(pn, &iocg->list, node) {
>> +		seq_printf(m, "%s %lu %lu\n", pn->dev_name,
>> +			   pn->weight, pn->ioprio_class);
>> +	}
>> +	spin_unlock_irq(&iocg->lock);
>> +out:
>> +	return 0;
>> +}
>> +
>> +static inline void policy_insert_node(struct io_cgroup *iocg,
>> +					  struct policy_node *pn)
>> +{
>> +	list_add(&pn->node, &iocg->list);
>> +}
>> +
>> +/* Must be called with iocg->lock held */
>> +static inline void policy_delete_node(struct policy_node *pn)
>> +{
>> +	list_del(&pn->node);
>> +}
>> +
>> +/* Must be called with iocg->lock held */
>> +static struct policy_node *policy_search_node(const struct io_cgroup *iocg,
>> +					      void *key)
>> +{
>> +	struct policy_node *pn;
>> +
>> +	if (list_empty(&iocg->list))
>> +		return NULL;
>> +
>> +	list_for_each_entry(pn, &iocg->list, node) {
>> +		if (pn->key == key)
>> +			return pn;
>> +	}
>> +
>> +	return NULL;
>> +}
>> +
>> +static void *devname_to_efqd(const char *buf)
>> +{
>> +	struct block_device *bdev;
>> +	void *key = NULL;
>> +	struct gendisk *disk;
>> +	int part;
>> +
>> +	bdev = lookup_bdev(buf);
>> +	if (IS_ERR(bdev))
>> +		return NULL;
>> +
>> +	disk = get_gendisk(bdev->bd_dev, &part);
>> +	key = (void *)&disk->queue->elevator->efqd;
>> +	bdput(bdev);
>> +
>> +	return key;
>> +}
>> +
>> +static int policy_parse_and_set(char *buf, struct policy_node *newpn)
>> +{
>> +	char *s[3];
>> +	char *p;
>> +	int ret;
>> +	int i = 0;
>> +
>> +	memset(s, 0, sizeof(s));
>> +	while (i < ARRAY_SIZE(s)) {
>> +		p = strsep(&buf, ":");
>> +		if (!p)
>> +			break;
>> +		if (!*p)
>> +			continue;
>> +		s[i++] = p;
>> +	}
>> +
>> +	newpn->key = devname_to_efqd(s[0]);
>> +	if (!newpn->key)
>> +		return -EINVAL;
>> +
>> +	strcpy(newpn->dev_name, s[0]);
>> +
>> +	ret = strict_strtoul(s[1], 10, &newpn->weight);
>> +	if (ret || newpn->weight > WEIGHT_MAX)
>> +		return -EINVAL;
>> +
>> +	ret = strict_strtoul(s[2], 10, &newpn->ioprio_class);
>> +	if (ret || newpn->ioprio_class < IOPRIO_CLASS_RT ||
>> +	    newpn->ioprio_class > IOPRIO_CLASS_IDLE)
>> +		return -EINVAL;
>> +
>> +	return 0;
>> +}
>> +
>> +static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
>> +			    const char *buffer)
>> +{
>> +	struct io_cgroup *iocg;
>> +	struct policy_node *newpn, *pn;
>> +	char *buf;
>> +	int ret = 0;
>> +	int keep_newpn = 0;
>> +	struct hlist_node *n;
>> +	struct io_group *iog;
>> +
>> +	buf = kstrdup(buffer, GFP_KERNEL);
>> +	if (!buf)
>> +		return -ENOMEM;
>> +
>> +	newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
>> +	if (!newpn) {
>> +		ret = -ENOMEM;
>> +		goto free_buf;
>> +	}
>> +
>> +	ret = policy_parse_and_set(buf, newpn);
>> +	if (ret)
>> +		goto free_newpn;
>> +
>> +	if (!cgroup_lock_live_group(cgrp)) {
>> +		ret = -ENODEV;
>> +		goto free_newpn;
>> +	}
>> +
>> +	iocg = cgroup_to_io_cgroup(cgrp);
>> +	spin_lock_irq(&iocg->lock);
>> +
>> +	pn = policy_search_node(iocg, newpn->key);
>> +	if (!pn) {
>> +		if (newpn->weight != 0) {
>> +			policy_insert_node(iocg, newpn);
>> +			keep_newpn = 1;
>> +		}
>> +		goto update_io_group;
>> +	}
>> +
>> +	if (newpn->weight == 0) {
>> +		/* weight == 0 means deleteing a policy */
>> +		policy_delete_node(pn);
>> +		goto update_io_group;
>> +	}
>> +
>> +	pn->weight = newpn->weight;
>> +	pn->ioprio_class = newpn->ioprio_class;
>> +
>> +update_io_group:
>> +	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
>> +		if (iog->key == newpn->key) {
>> +			if (newpn->weight) {
>> +				iog->entity.new_weight = newpn->weight;
>> +				iog->entity.new_ioprio_class =
>> +					newpn->ioprio_class;
>> +				/*
>> +				 * iog weight and ioprio_class updating
>> +				 * actually happens if ioprio_changed is set.
>> +				 * So ensure ioprio_changed is not set until
>> +				 * new weight and new ioprio_class are updated.
>> +				 */
>> +				smp_wmb();
>> +				iog->entity.ioprio_changed = 1;
>> +			} else {
>> +				iog->entity.new_weight = iocg->weight;
>> +				iog->entity.new_ioprio_class =
>> +					iocg->ioprio_class;
>> +
>> +				/* The same as above */
>> +				smp_wmb();
>> +				iog->entity.ioprio_changed = 1;
>> +			}
>> +		}
>> +	}
>> +	spin_unlock_irq(&iocg->lock);
>> +
>> +	cgroup_unlock();
>> +
>> +free_newpn:
>> +	if (!keep_newpn)
>> +		kfree(newpn);
>> +free_buf:
>> +	kfree(buf);
>> +	return ret;
>> +}
>> +
>>  struct cftype bfqio_files[] = {
>>  	{
>> +		.name = "policy",
>> +		.read_seq_string = io_cgroup_policy_read,
>> +		.write_string = io_cgroup_policy_write,
>> +		.max_write_len = 256,
>> +	},
>> +	{
>>  		.name = "weight",
>>  		.read_u64 = io_cgroup_weight_read,
>>  		.write_u64 = io_cgroup_weight_write,
>> @@ -1592,6 +1813,7 @@ struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
>>  	INIT_HLIST_HEAD(&iocg->group_data);
>>  	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
>>  	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
>> +	INIT_LIST_HEAD(&iocg->list);
>>  
>>  	return &iocg->css;
>>  }
>> @@ -1750,6 +1972,7 @@ void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
>>  	unsigned long flags, flags1;
>>  	int queue_lock_held = 0;
>>  	struct elv_fq_data *efqd;
>> +	struct policy_node *pn, *pntmp;
>>  
>>  	/*
>>  	 * io groups are linked in two lists. One list is maintained
>> @@ -1823,6 +2046,12 @@ locked:
>>  	BUG_ON(!hlist_empty(&iocg->group_data));
>>  
>>  	free_css_id(&io_subsys, &iocg->css);
>> +
>> +	list_for_each_entry_safe(pn, pntmp, &iocg->list, node) {
>> +		policy_delete_node(pn);
>> +		kfree(pn);
>> +	}
>> +
>>  	kfree(iocg);
>>  }
>>  
>> @@ -2137,7 +2366,7 @@ void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
>>  void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
>>  {
>>  	entity->ioprio = entity->new_ioprio;
>> -	entity->weight = entity->new_weight;
>> +	entity->weight = entity->new_weigh;
>>  	entity->ioprio_class = entity->new_ioprio_class;
>>  	entity->sched_data = &iog->sched_data;
>>  }
>> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
>> index db3a347..0407633 100644
>> --- a/block/elevator-fq.h
>> +++ b/block/elevator-fq.h
>> @@ -253,6 +253,14 @@ struct io_group {
>>  #endif
>>  };
>>  
>> +struct policy_node {
>> +	struct list_head node;
>> +	char dev_name[32];
>> +	void *key;
>> +	unsigned long weight;
>> +	unsigned long ioprio_class;
>> +};
>> +
>>  /**
>>   * struct bfqio_cgroup - bfq cgroup data structure.
>>   * @css: subsystem state for bfq in the containing cgroup.
>> @@ -269,6 +277,9 @@ struct io_cgroup {
>>  
>>  	unsigned long weight, ioprio_class;
>>  
>> +	/* list of policy_node */
>> +	struct list_head list;
>> +
>>  	spinlock_t lock;
>>  	struct hlist_head group_data;
>>  };
>> -- 
>> 1.5.4.rc3
>>
>>
> 
> 
> 

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
       [not found]     ` <20090513155900.GA15623-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-14  1:51       ` Gui Jianfeng
  2009-05-14  2:25       ` Gui Jianfeng
  1 sibling, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14  1:51 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

Vivek Goyal wrote:
...
> 
> Hi Gui,
> 
> It might make sense to also store the device name or device major and
> minor number in io_group while creating the io group. This will help us
> to display io.disk_time and io.disk_sector statistics per device instead
> of aggregate.
> 
> I am attaching a patch I was playing around with to display per device
> statistics instead of aggregate one. So if user has specified the per
> device rule.
> 
> Thanks
> Vivek
> 
> 
> o Currently the statistics exported through cgroup are aggregate of statistics
>   on all devices for that cgroup. Instead of aggregate, make these per device.

Hi Vivek,

Actually, I did it also.
FYI

Examples:
# cat io.disk_time
dev:/dev/hdb time:4421
dev:others time:3741

# cat io.disk_sectors
dev:/dev/hdb sectors:585696
dev:others sectors:2664

Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
 block/elevator-fq.c |  104 +++++++++++++++++++++++---------------------------
 1 files changed, 48 insertions(+), 56 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 7c95d55..1620074 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1162,90 +1162,82 @@ STORE_FUNCTION(weight, 0, WEIGHT_MAX);
 STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
 #undef STORE_FUNCTION
 
-/*
- * traverse through all the io_groups associated with this cgroup and calculate
- * the aggr disk time received by all the groups on respective disks.
- */
-static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
+static int io_cgroup_disk_time_read(struct cgroup *cgroup,
+				    struct cftype *cftype,
+				    struct seq_file *m)
 {
+	struct io_cgroup *iocg;
 	struct io_group *iog;
 	struct hlist_node *n;
-	u64 disk_time = 0;
-
-	rcu_read_lock();
-	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
-		/*
-		 * There might be groups which are not functional and
-		 * waiting to be reclaimed upon cgoup deletion.
-		 */
-		if (rcu_dereference(iog->key))
-			disk_time += iog->entity.total_service;
-	}
-	rcu_read_unlock();
-
-	return disk_time;
-}
+	struct policy_node *pn;
+	unsigned int other, time;
 
-static u64 io_cgroup_disk_time_read(struct cgroup *cgroup,
-					struct cftype *cftype)
-{
-	struct io_cgroup *iocg;
-	u64 ret;
+	other = 0;
 
 	if (!cgroup_lock_live_group(cgroup))
 		return -ENODEV;
 
 	iocg = cgroup_to_io_cgroup(cgroup);
 	spin_lock_irq(&iocg->lock);
-	ret = jiffies_to_msecs(calculate_aggr_disk_time(iocg));
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		if (iog->key != NULL) {
+			pn = policy_search_node(iocg, iog->key);
+			if (pn) {
+				time = jiffies_to_msecs(iog->entity.
+							total_service);
+				seq_printf(m, "dev:%s time:%u\n",
+					   pn->dev_name, time);
+			} else {
+				other += jiffies_to_msecs(iog->entity.
+							  total_service);
+			}
+		}
+	}
+	seq_printf(m, "dev:others time:%u\n", other);
+
 	spin_unlock_irq(&iocg->lock);
 
 	cgroup_unlock();
 
-	return ret;
+	return 0;
 }
 
-/*
- * traverse through all the io_groups associated with this cgroup and calculate
- * the aggr number of sectors transferred by all the groups on respective disks.
- */
-static u64 calculate_aggr_disk_sectors(struct io_cgroup *iocg)
+static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
+				       struct cftype *cftype,
+				       struct seq_file *m)
 {
+	struct io_cgroup *iocg;
 	struct io_group *iog;
 	struct hlist_node *n;
-	u64 disk_sectors = 0;
-
-	rcu_read_lock();
-	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
-		/*
-		 * There might be groups which are not functional and
-		 * waiting to be reclaimed upon cgoup deletion.
-		 */
-		if (rcu_dereference(iog->key))
-			disk_sectors += iog->entity.total_sector_service;
-	}
-	rcu_read_unlock();
+	struct policy_node *pn;
+	u64 other = 0;
 
-	return disk_sectors;
-}
-
-static u64 io_cgroup_disk_sectors_read(struct cgroup *cgroup,
-					struct cftype *cftype)
-{
-	struct io_cgroup *iocg;
-	u64 ret;
 
 	if (!cgroup_lock_live_group(cgroup))
 		return -ENODEV;
 
 	iocg = cgroup_to_io_cgroup(cgroup);
 	spin_lock_irq(&iocg->lock);
-	ret = calculate_aggr_disk_sectors(iocg);
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		if (iog->key) {
+			pn = policy_search_node(iocg, iog->key);
+			if (pn) {
+				seq_printf(m, "dev:%s sectors:%lu\n",
+					   pn->dev_name,
+					   iog->entity.total_sector_service);
+			} else {
+				other += iog->entity.total_sector_service;
+			}
+		}
+	}
+
+	seq_printf(m, "dev:others sectors:%llu\n", other);
+
 	spin_unlock_irq(&iocg->lock);
 
 	cgroup_unlock();
 
-	return ret;
+	return 0;
 }
 
 /**
@@ -1783,11 +1775,11 @@ struct cftype bfqio_files[] = {
 	},
 	{
 		.name = "disk_time",
-		.read_u64 = io_cgroup_disk_time_read,
+		.read_seq_string = io_cgroup_disk_time_read,
 	},
 	{
 		.name = "disk_sectors",
-		.read_u64 = io_cgroup_disk_sectors_read,
+		.read_seq_string = io_cgroup_disk_sectors_read,
 	},
 };
 
-- 
1.5.4.rc3

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
  2009-05-13 15:59   ` Vivek Goyal
@ 2009-05-14  1:51     ` Gui Jianfeng
       [not found]     ` <20090513155900.GA15623-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-05-14  2:25     ` Gui Jianfeng
  2 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14  1:51 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

Vivek Goyal wrote:
...
> 
> Hi Gui,
> 
> It might make sense to also store the device name or device major and
> minor number in io_group while creating the io group. This will help us
> to display io.disk_time and io.disk_sector statistics per device instead
> of aggregate.
> 
> I am attaching a patch I was playing around with to display per device
> statistics instead of aggregate one. So if user has specified the per
> device rule.
> 
> Thanks
> Vivek
> 
> 
> o Currently the statistics exported through cgroup are aggregate of statistics
>   on all devices for that cgroup. Instead of aggregate, make these per device.

Hi Vivek,

Actually, I did it also.
FYI

Examples:
# cat io.disk_time
dev:/dev/hdb time:4421
dev:others time:3741

# cat io.disk_sectors
dev:/dev/hdb sectors:585696
dev:others sectors:2664

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
 block/elevator-fq.c |  104 +++++++++++++++++++++++---------------------------
 1 files changed, 48 insertions(+), 56 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 7c95d55..1620074 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1162,90 +1162,82 @@ STORE_FUNCTION(weight, 0, WEIGHT_MAX);
 STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
 #undef STORE_FUNCTION
 
-/*
- * traverse through all the io_groups associated with this cgroup and calculate
- * the aggr disk time received by all the groups on respective disks.
- */
-static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
+static int io_cgroup_disk_time_read(struct cgroup *cgroup,
+				    struct cftype *cftype,
+				    struct seq_file *m)
 {
+	struct io_cgroup *iocg;
 	struct io_group *iog;
 	struct hlist_node *n;
-	u64 disk_time = 0;
-
-	rcu_read_lock();
-	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
-		/*
-		 * There might be groups which are not functional and
-		 * waiting to be reclaimed upon cgoup deletion.
-		 */
-		if (rcu_dereference(iog->key))
-			disk_time += iog->entity.total_service;
-	}
-	rcu_read_unlock();
-
-	return disk_time;
-}
+	struct policy_node *pn;
+	unsigned int other, time;
 
-static u64 io_cgroup_disk_time_read(struct cgroup *cgroup,
-					struct cftype *cftype)
-{
-	struct io_cgroup *iocg;
-	u64 ret;
+	other = 0;
 
 	if (!cgroup_lock_live_group(cgroup))
 		return -ENODEV;
 
 	iocg = cgroup_to_io_cgroup(cgroup);
 	spin_lock_irq(&iocg->lock);
-	ret = jiffies_to_msecs(calculate_aggr_disk_time(iocg));
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		if (iog->key != NULL) {
+			pn = policy_search_node(iocg, iog->key);
+			if (pn) {
+				time = jiffies_to_msecs(iog->entity.
+							total_service);
+				seq_printf(m, "dev:%s time:%u\n",
+					   pn->dev_name, time);
+			} else {
+				other += jiffies_to_msecs(iog->entity.
+							  total_service);
+			}
+		}
+	}
+	seq_printf(m, "dev:others time:%u\n", other);
+
 	spin_unlock_irq(&iocg->lock);
 
 	cgroup_unlock();
 
-	return ret;
+	return 0;
 }
 
-/*
- * traverse through all the io_groups associated with this cgroup and calculate
- * the aggr number of sectors transferred by all the groups on respective disks.
- */
-static u64 calculate_aggr_disk_sectors(struct io_cgroup *iocg)
+static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
+				       struct cftype *cftype,
+				       struct seq_file *m)
 {
+	struct io_cgroup *iocg;
 	struct io_group *iog;
 	struct hlist_node *n;
-	u64 disk_sectors = 0;
-
-	rcu_read_lock();
-	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
-		/*
-		 * There might be groups which are not functional and
-		 * waiting to be reclaimed upon cgoup deletion.
-		 */
-		if (rcu_dereference(iog->key))
-			disk_sectors += iog->entity.total_sector_service;
-	}
-	rcu_read_unlock();
+	struct policy_node *pn;
+	u64 other = 0;
 
-	return disk_sectors;
-}
-
-static u64 io_cgroup_disk_sectors_read(struct cgroup *cgroup,
-					struct cftype *cftype)
-{
-	struct io_cgroup *iocg;
-	u64 ret;
 
 	if (!cgroup_lock_live_group(cgroup))
 		return -ENODEV;
 
 	iocg = cgroup_to_io_cgroup(cgroup);
 	spin_lock_irq(&iocg->lock);
-	ret = calculate_aggr_disk_sectors(iocg);
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		if (iog->key) {
+			pn = policy_search_node(iocg, iog->key);
+			if (pn) {
+				seq_printf(m, "dev:%s sectors:%lu\n",
+					   pn->dev_name,
+					   iog->entity.total_sector_service);
+			} else {
+				other += iog->entity.total_sector_service;
+			}
+		}
+	}
+
+	seq_printf(m, "dev:others sectors:%llu\n", other);
+
 	spin_unlock_irq(&iocg->lock);
 
 	cgroup_unlock();
 
-	return ret;
+	return 0;
 }
 
 /**
@@ -1783,11 +1775,11 @@ struct cftype bfqio_files[] = {
 	},
 	{
 		.name = "disk_time",
-		.read_u64 = io_cgroup_disk_time_read,
+		.read_seq_string = io_cgroup_disk_time_read,
 	},
 	{
 		.name = "disk_sectors",
-		.read_u64 = io_cgroup_disk_sectors_read,
+		.read_seq_string = io_cgroup_disk_sectors_read,
 	},
 };
 
-- 
1.5.4.rc3



^ permalink raw reply related	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
       [not found]     ` <20090513155900.GA15623-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-05-14  1:51       ` Gui Jianfeng
@ 2009-05-14  2:25       ` Gui Jianfeng
  1 sibling, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14  2:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

Vivek Goyal wrote:
...
> Hi Gui,
> 
> It might make sense to also store the device name or device major and
> minor number in io_group while creating the io group. This will help us
> to display io.disk_time and io.disk_sector statistics per device instead
> of aggregate.
> 
> I am attaching a patch I was playing around with to display per device
> statistics instead of aggregate one. So if user has specified the per
> device rule.
> 
> Thanks
> Vivek
> 
> 
> o Currently the statistics exported through cgroup are aggregate of statistics
>   on all devices for that cgroup. Instead of aggregate, make these per device.
> 
> o Also export another statistics io.disk_dequeue. This keeps a count of how
>   many times a particular group got out of race for the disk. This is a
>   debugging aid to keep a track how often we could create continuously
>   backlogged queues.
> 
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  block/elevator-fq.c |  127 +++++++++++++++++++++++++++++++++-------------------
>  block/elevator-fq.h |    3 +
>  2 files changed, 85 insertions(+), 45 deletions(-)
> 
> Index: linux14/block/elevator-fq.h
> ===================================================================
> --- linux14.orig/block/elevator-fq.h	2009-05-13 11:40:32.000000000 -0400
> +++ linux14/block/elevator-fq.h	2009-05-13 11:40:57.000000000 -0400
> @@ -250,6 +250,9 @@ struct io_group {
>  
>  #ifdef CONFIG_DEBUG_GROUP_IOSCHED
>  	unsigned short iocg_id;
> +	dev_t	dev;
> +	/* How many times this group has been removed from active tree */
> +	unsigned long dequeue;
>  #endif
>  };
>  
> Index: linux14/block/elevator-fq.c
> ===================================================================
> --- linux14.orig/block/elevator-fq.c	2009-05-13 11:40:53.000000000 -0400
> +++ linux14/block/elevator-fq.c	2009-05-13 11:40:57.000000000 -0400
> @@ -12,6 +12,7 @@
>  #include "elevator-fq.h"
>  #include <linux/blktrace_api.h>
>  #include <linux/biotrack.h>
> +#include <linux/seq_file.h>
>  
>  /* Values taken from cfq */
>  const int elv_slice_sync = HZ / 10;
> @@ -758,6 +759,18 @@ int __bfq_deactivate_entity(struct io_en
>  	BUG_ON(sd->active_entity == entity);
>  	BUG_ON(sd->next_active == entity);
>  
> +#ifdef CONFIG_DEBUG_GROUP_IOSCHED
> +	{
> +		struct io_group *iog = io_entity_to_iog(entity);
> +		/*
> +		 * Keep track of how many times a group has been removed
> +		 * from active tree because it did not have any active
> +		 * backlogged ioq under it
> +		 */
> +		if (iog)
> +			iog->dequeue++;
> +	}
> +#endif
>  	return ret;
>  }
>  
> @@ -1126,90 +1139,103 @@ STORE_FUNCTION(weight, 0, WEIGHT_MAX);
>  STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
>  #undef STORE_FUNCTION
>  
> -/*
> - * traverse through all the io_groups associated with this cgroup and calculate
> - * the aggr disk time received by all the groups on respective disks.
> - */
> -static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
> +static int io_cgroup_disk_time_read(struct cgroup *cgroup,
> +				struct cftype *cftype, struct seq_file *m)
>  {
> +	struct io_cgroup *iocg;
>  	struct io_group *iog;
>  	struct hlist_node *n;
> -	u64 disk_time = 0;
> +
> +	if (!cgroup_lock_live_group(cgroup))
> +		return -ENODEV;
> +
> +	iocg = cgroup_to_io_cgroup(cgroup);
>  
>  	rcu_read_lock();
> +	spin_lock_irq(&iocg->lock);
>  	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
>  		/*
>  		 * There might be groups which are not functional and
>  		 * waiting to be reclaimed upon cgoup deletion.
>  		 */
> -		if (rcu_dereference(iog->key))
> -			disk_time += iog->entity.total_service;
> +		if (rcu_dereference(iog->key)) {
> +			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
> +					MINOR(iog->dev),
> +					iog->entity.total_service);

  Hi Vivek,

  I think it's easier for users if device name is also shown here.

> +		}
>  	}
> +	spin_unlock_irq(&iocg->lock);
>  	rcu_read_unlock();
>  
> -	return disk_time;
> +	cgroup_unlock();
> +
> +	return 0;
>  }
>  

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
  2009-05-13 15:59   ` Vivek Goyal
  2009-05-14  1:51     ` Gui Jianfeng
       [not found]     ` <20090513155900.GA15623-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-14  2:25     ` Gui Jianfeng
  2 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14  2:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

Vivek Goyal wrote:
...
> Hi Gui,
> 
> It might make sense to also store the device name or device major and
> minor number in io_group while creating the io group. This will help us
> to display io.disk_time and io.disk_sector statistics per device instead
> of aggregate.
> 
> I am attaching a patch I was playing around with to display per device
> statistics instead of aggregate one. So if user has specified the per
> device rule.
> 
> Thanks
> Vivek
> 
> 
> o Currently the statistics exported through cgroup are aggregate of statistics
>   on all devices for that cgroup. Instead of aggregate, make these per device.
> 
> o Also export another statistics io.disk_dequeue. This keeps a count of how
>   many times a particular group got out of race for the disk. This is a
>   debugging aid to keep a track how often we could create continuously
>   backlogged queues.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  block/elevator-fq.c |  127 +++++++++++++++++++++++++++++++++-------------------
>  block/elevator-fq.h |    3 +
>  2 files changed, 85 insertions(+), 45 deletions(-)
> 
> Index: linux14/block/elevator-fq.h
> ===================================================================
> --- linux14.orig/block/elevator-fq.h	2009-05-13 11:40:32.000000000 -0400
> +++ linux14/block/elevator-fq.h	2009-05-13 11:40:57.000000000 -0400
> @@ -250,6 +250,9 @@ struct io_group {
>  
>  #ifdef CONFIG_DEBUG_GROUP_IOSCHED
>  	unsigned short iocg_id;
> +	dev_t	dev;
> +	/* How many times this group has been removed from active tree */
> +	unsigned long dequeue;
>  #endif
>  };
>  
> Index: linux14/block/elevator-fq.c
> ===================================================================
> --- linux14.orig/block/elevator-fq.c	2009-05-13 11:40:53.000000000 -0400
> +++ linux14/block/elevator-fq.c	2009-05-13 11:40:57.000000000 -0400
> @@ -12,6 +12,7 @@
>  #include "elevator-fq.h"
>  #include <linux/blktrace_api.h>
>  #include <linux/biotrack.h>
> +#include <linux/seq_file.h>
>  
>  /* Values taken from cfq */
>  const int elv_slice_sync = HZ / 10;
> @@ -758,6 +759,18 @@ int __bfq_deactivate_entity(struct io_en
>  	BUG_ON(sd->active_entity == entity);
>  	BUG_ON(sd->next_active == entity);
>  
> +#ifdef CONFIG_DEBUG_GROUP_IOSCHED
> +	{
> +		struct io_group *iog = io_entity_to_iog(entity);
> +		/*
> +		 * Keep track of how many times a group has been removed
> +		 * from active tree because it did not have any active
> +		 * backlogged ioq under it
> +		 */
> +		if (iog)
> +			iog->dequeue++;
> +	}
> +#endif
>  	return ret;
>  }
>  
> @@ -1126,90 +1139,103 @@ STORE_FUNCTION(weight, 0, WEIGHT_MAX);
>  STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
>  #undef STORE_FUNCTION
>  
> -/*
> - * traverse through all the io_groups associated with this cgroup and calculate
> - * the aggr disk time received by all the groups on respective disks.
> - */
> -static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
> +static int io_cgroup_disk_time_read(struct cgroup *cgroup,
> +				struct cftype *cftype, struct seq_file *m)
>  {
> +	struct io_cgroup *iocg;
>  	struct io_group *iog;
>  	struct hlist_node *n;
> -	u64 disk_time = 0;
> +
> +	if (!cgroup_lock_live_group(cgroup))
> +		return -ENODEV;
> +
> +	iocg = cgroup_to_io_cgroup(cgroup);
>  
>  	rcu_read_lock();
> +	spin_lock_irq(&iocg->lock);
>  	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
>  		/*
>  		 * There might be groups which are not functional and
>  		 * waiting to be reclaimed upon cgoup deletion.
>  		 */
> -		if (rcu_dereference(iog->key))
> -			disk_time += iog->entity.total_service;
> +		if (rcu_dereference(iog->key)) {
> +			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
> +					MINOR(iog->dev),
> +					iog->entity.total_service);

  Hi Vivek,

  I think it's easier for users if device name is also shown here.

> +		}
>  	}
> +	spin_unlock_irq(&iocg->lock);
>  	rcu_read_unlock();
>  
> -	return disk_time;
> +	cgroup_unlock();
> +
> +	return 0;
>  }
>  

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
       [not found]     ` <20090513190929.GB18371-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-05-14  1:35       ` Gui Jianfeng
@ 2009-05-14  7:26       ` Gui Jianfeng
  1 sibling, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14  7:26 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

Hi Vivek,

This patch enables per-cgroup per-device weight and ioprio_class handling.
A new cgroup interface "policy" is introduced. You can make use of this 
file to configure weight and ioprio_class for each device in a given cgroup.
The original "weight" and "ioprio_class" files are still available. If you
don't do special configuration for a particular device, "weight" and 
"ioprio_class" are used as default values in this device.

You can use the following format to play with the new interface.
#echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
weight=0 means removing the policy for DEV.

Examples:
Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
# echo /dev/hdb:300:2 > io.policy
# cat io.policy
dev weight class
/dev/hdb 300 2

Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
# echo /dev/hda:500:1 > io.policy
# cat io.policy
dev weight class
/dev/hda 500 1
/dev/hdb 300 2

Remove the policy for /dev/hda in this cgroup
# echo /dev/hda:0:1 > io.policy
# cat io.policy
dev weight class
/dev/hdb 300 2

Changelog (v1 -> v2)
- Rename some structures
- Use spin_lock_irqsave() and spin_lock_irqrestore() version to prevent
  from enabling the interrupts unconditionally.
- Fix policy setup bug when switching to another io scheduler.
- If a policy is available for a specific device, don't update weight and
  io class when writing "weight" and "iprio_class".
- Fix a bug when parsing policy string.

Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
 block/elevator-fq.c |  258 +++++++++++++++++++++++++++++++++++++++++++++++++--
 block/elevator-fq.h |   12 +++
 2 files changed, 261 insertions(+), 9 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 69435ab..43b30a4 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -12,6 +12,9 @@
 #include "elevator-fq.h"
 #include <linux/blktrace_api.h>
 #include <linux/biotrack.h>
+#include <linux/seq_file.h>
+#include <linux/genhd.h>
+
 
 /* Values taken from cfq */
 const int elv_slice_sync = HZ / 10;
@@ -1045,12 +1048,31 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
 }
 EXPORT_SYMBOL(io_lookup_io_group_current);
 
-void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
+						 dev_t dev);
+
+void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
+			  dev_t dev)
 {
 	struct io_entity *entity = &iog->entity;
+	struct io_policy_node *pn;
+	unsigned long flags;
+
+	spin_lock_irqsave(&iocg->lock, flags);
+	pn = policy_search_node(iocg, dev);
+	if (pn) {
+		entity->weight = pn->weight;
+		entity->new_weight = pn->weight;
+		entity->ioprio_class = pn->ioprio_class;
+		entity->new_ioprio_class = pn->ioprio_class;
+	} else {
+		entity->weight = iocg->weight;
+		entity->new_weight = iocg->weight;
+		entity->ioprio_class = iocg->ioprio_class;
+		entity->new_ioprio_class = iocg->ioprio_class;
+	}
+	spin_unlock_irqrestore(&iocg->lock, flags);
 
-	entity->weight = entity->new_weight = iocg->weight;
-	entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
 	entity->ioprio_changed = 1;
 	entity->my_sched_data = &iog->sched_data;
 }
@@ -1114,6 +1136,7 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
 	struct io_cgroup *iocg;					\
 	struct io_group *iog;						\
 	struct hlist_node *n;						\
+	struct io_policy_node *pn;					\
 									\
 	if (val < (__MIN) || val > (__MAX))				\
 		return -EINVAL;						\
@@ -1126,6 +1149,9 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
 	spin_lock_irq(&iocg->lock);					\
 	iocg->__VAR = (unsigned long)val;				\
 	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {	\
+		pn = policy_search_node(iocg, iog->dev);		\
+		if (pn)							\
+			continue;					\
 		iog->entity.new_##__VAR = (unsigned long)val;		\
 		smp_wmb();						\
 		iog->entity.ioprio_changed = 1;				\
@@ -1237,7 +1263,7 @@ static u64 io_cgroup_disk_sectors_read(struct cgroup *cgroup,
  * to the root has already an allocated group on @bfqd.
  */
 struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
-					struct cgroup *cgroup)
+				      struct cgroup *cgroup, struct bio *bio)
 {
 	struct io_cgroup *iocg;
 	struct io_group *iog, *leaf = NULL, *prev = NULL;
@@ -1263,12 +1289,17 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
 		atomic_set(&iog->ref, 0);
 		iog->deleting = 0;
 
-		io_group_init_entity(iocg, iog);
-		iog->my_entity = &iog->entity;
 #ifdef CONFIG_DEBUG_GROUP_IOSCHED
 		iog->iocg_id = css_id(&iocg->css);
+		if (bio) {
+			struct gendisk *disk = bio->bi_bdev->bd_disk;
+			iog->dev = MKDEV(disk->major, disk->first_minor);
+		}
 #endif
 
+		io_group_init_entity(iocg, iog, iog->dev);
+		iog->my_entity = &iog->entity;
+
 		blk_init_request_list(&iog->rl);
 
 		if (leaf == NULL) {
@@ -1379,7 +1410,7 @@ void io_group_chain_link(struct request_queue *q, void *key,
  */
 struct io_group *io_find_alloc_group(struct request_queue *q,
 			struct cgroup *cgroup, struct elv_fq_data *efqd,
-			int create)
+			     int create, struct bio *bio)
 {
 	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
 	struct io_group *iog = NULL;
@@ -1390,7 +1421,7 @@ struct io_group *io_find_alloc_group(struct request_queue *q,
 	if (iog != NULL || !create)
 		return iog;
 
-	iog = io_group_chain_alloc(q, key, cgroup);
+	iog = io_group_chain_alloc(q, key, cgroup, bio);
 	if (iog != NULL)
 		io_group_chain_link(q, key, cgroup, iog, efqd);
 
@@ -1489,7 +1520,7 @@ struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
 		goto out;
 	}
 
-	iog = io_find_alloc_group(q, cgroup, efqd, create);
+	iog = io_find_alloc_group(q, cgroup, efqd, create, bio);
 	if (!iog) {
 		if (create)
 			iog = efqd->root_group;
@@ -1549,8 +1580,209 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
 	return iog;
 }
 
+static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
+				  struct seq_file *m)
+{
+	struct io_cgroup *iocg;
+	struct io_policy_node *pn;
+
+	iocg = cgroup_to_io_cgroup(cgrp);
+
+	if (list_empty(&iocg->policy_list))
+		goto out;
+
+	seq_printf(m, "dev weight class\n");
+
+	spin_lock_irq(&iocg->lock);
+	list_for_each_entry(pn, &iocg->policy_list, node) {
+		seq_printf(m, "%s %lu %lu\n", pn->dev_name,
+			   pn->weight, pn->ioprio_class);
+	}
+	spin_unlock_irq(&iocg->lock);
+out:
+	return 0;
+}
+
+static inline void policy_insert_node(struct io_cgroup *iocg,
+					  struct io_policy_node *pn)
+{
+	list_add(&pn->node, &iocg->policy_list);
+}
+
+/* Must be called with iocg->lock held */
+static inline void policy_delete_node(struct io_policy_node *pn)
+{
+	list_del(&pn->node);
+}
+
+/* Must be called with iocg->lock held */
+static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
+						 dev_t dev)
+{
+	struct io_policy_node *pn;
+
+	if (list_empty(&iocg->policy_list))
+		return NULL;
+
+	list_for_each_entry(pn, &iocg->policy_list, node) {
+		if (pn->dev == dev)
+			return pn;
+	}
+
+	return NULL;
+}
+
+static int devname_to_devnum(const char *buf, dev_t *dev)
+{
+	struct block_device *bdev;
+	struct gendisk *disk;
+	int part;
+
+	bdev = lookup_bdev(buf);
+	if (IS_ERR(bdev))
+		return -ENODEV;
+
+	disk = get_gendisk(bdev->bd_dev, &part);
+	*dev = MKDEV(disk->major, disk->first_minor);
+	bdput(bdev);
+
+	return 0;
+}
+
+static int policy_parse_and_set(char *buf, struct io_policy_node *newpn)
+{
+	char *s[3], *p;
+	int ret;
+	int i = 0;
+
+	memset(s, 0, sizeof(s));
+	while ((p = strsep(&buf, ":")) != NULL) {
+		if (!*p)
+			continue;
+		s[i++] = p;
+	}
+
+	ret = devname_to_devnum(s[0], &newpn->dev);
+	if (ret)
+		return ret;
+
+	strcpy(newpn->dev_name, s[0]);
+
+	if (s[1] == NULL)
+		return -EINVAL;
+
+	ret = strict_strtoul(s[1], 10, &newpn->weight);
+	if (ret || newpn->weight > WEIGHT_MAX)
+		return -EINVAL;
+
+	if (s[2] == NULL)
+		return -EINVAL;
+
+	ret = strict_strtoul(s[2], 10, &newpn->ioprio_class);
+	if (ret || newpn->ioprio_class < IOPRIO_CLASS_RT ||
+	    newpn->ioprio_class > IOPRIO_CLASS_IDLE)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
+			    const char *buffer)
+{
+	struct io_cgroup *iocg;
+	struct io_policy_node *newpn, *pn;
+	char *buf;
+	int ret = 0;
+	int keep_newpn = 0;
+	struct hlist_node *n;
+	struct io_group *iog;
+
+	buf = kstrdup(buffer, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
+	if (!newpn) {
+		ret = -ENOMEM;
+		goto free_buf;
+	}
+
+	ret = policy_parse_and_set(buf, newpn);
+	if (ret)
+		goto free_newpn;
+
+	if (!cgroup_lock_live_group(cgrp)) {
+		ret = -ENODEV;
+		goto free_newpn;
+	}
+
+	iocg = cgroup_to_io_cgroup(cgrp);
+	spin_lock_irq(&iocg->lock);
+
+	pn = policy_search_node(iocg, newpn->dev);
+	if (!pn) {
+		if (newpn->weight != 0) {
+			policy_insert_node(iocg, newpn);
+			keep_newpn = 1;
+		}
+		goto update_io_group;
+	}
+
+	if (newpn->weight == 0) {
+		/* weight == 0 means deleteing a policy */
+		policy_delete_node(pn);
+		goto update_io_group;
+	}
+
+	pn->weight = newpn->weight;
+	pn->ioprio_class = newpn->ioprio_class;
+
+update_io_group:
+	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
+		if (iog->dev == newpn->dev) {
+			if (newpn->weight) {
+				iog->entity.new_weight = newpn->weight;
+				iog->entity.new_ioprio_class =
+					newpn->ioprio_class;
+				/*
+				 * iog weight and ioprio_class updating
+				 * actually happens if ioprio_changed is set.
+				 * So ensure ioprio_changed is not set until
+				 * new weight and new ioprio_class are updated.
+				 */
+				smp_wmb();
+				iog->entity.ioprio_changed = 1;
+			} else {
+				iog->entity.new_weight = iocg->weight;
+				iog->entity.new_ioprio_class =
+					iocg->ioprio_class;
+
+				/* The same as above */
+				smp_wmb();
+				iog->entity.ioprio_changed = 1;
+			}
+		}
+	}
+	spin_unlock_irq(&iocg->lock);
+
+	cgroup_unlock();
+
+free_newpn:
+	if (!keep_newpn)
+		kfree(newpn);
+free_buf:
+	kfree(buf);
+	return ret;
+}
+
 struct cftype bfqio_files[] = {
 	{
+		.name = "policy",
+		.read_seq_string = io_cgroup_policy_read,
+		.write_string = io_cgroup_policy_write,
+		.max_write_len = 256,
+	},
+	{
 		.name = "weight",
 		.read_u64 = io_cgroup_weight_read,
 		.write_u64 = io_cgroup_weight_write,
@@ -1592,6 +1824,7 @@ struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
 	INIT_HLIST_HEAD(&iocg->group_data);
 	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
 	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
+	INIT_LIST_HEAD(&iocg->policy_list);
 
 	return &iocg->css;
 }
@@ -1750,6 +1983,7 @@ void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
 	unsigned long flags, flags1;
 	int queue_lock_held = 0;
 	struct elv_fq_data *efqd;
+	struct io_policy_node *pn, *pntmp;
 
 	/*
 	 * io groups are linked in two lists. One list is maintained
@@ -1823,6 +2057,12 @@ locked:
 	BUG_ON(!hlist_empty(&iocg->group_data));
 
 	free_css_id(&io_subsys, &iocg->css);
+
+	list_for_each_entry_safe(pn, pntmp, &iocg->policy_list, node) {
+		policy_delete_node(pn);
+		kfree(pn);
+	}
+
 	kfree(iocg);
 }
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index db3a347..b1d97e6 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -250,9 +250,18 @@ struct io_group {
 
 #ifdef CONFIG_DEBUG_GROUP_IOSCHED
 	unsigned short iocg_id;
+	dev_t dev;
 #endif
 };
 
+struct io_policy_node {
+	struct list_head node;
+	char dev_name[32];
+	dev_t dev;
+	unsigned long weight;
+	unsigned long ioprio_class;
+};
+
 /**
  * struct bfqio_cgroup - bfq cgroup data structure.
  * @css: subsystem state for bfq in the containing cgroup.
@@ -269,6 +278,9 @@ struct io_cgroup {
 
 	unsigned long weight, ioprio_class;
 
+	/* list of io_policy_node */
+	struct list_head policy_list;
+
 	spinlock_t lock;
 	struct hlist_head group_data;
 };
-- 
1.5.4.rc3

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
  2009-05-13 19:09   ` Vivek Goyal
  2009-05-14  1:35     ` Gui Jianfeng
       [not found]     ` <20090513190929.GB18371-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-14  7:26     ` Gui Jianfeng
  2009-05-14 15:15       ` Vivek Goyal
                         ` (2 more replies)
  2 siblings, 3 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14  7:26 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

Hi Vivek,

This patch enables per-cgroup per-device weight and ioprio_class handling.
A new cgroup interface "policy" is introduced. You can make use of this 
file to configure weight and ioprio_class for each device in a given cgroup.
The original "weight" and "ioprio_class" files are still available. If you
don't do special configuration for a particular device, "weight" and 
"ioprio_class" are used as default values in this device.

You can use the following format to play with the new interface.
#echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
weight=0 means removing the policy for DEV.

Examples:
Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
# echo /dev/hdb:300:2 > io.policy
# cat io.policy
dev weight class
/dev/hdb 300 2

Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
# echo /dev/hda:500:1 > io.policy
# cat io.policy
dev weight class
/dev/hda 500 1
/dev/hdb 300 2

Remove the policy for /dev/hda in this cgroup
# echo /dev/hda:0:1 > io.policy
# cat io.policy
dev weight class
/dev/hdb 300 2

Changelog (v1 -> v2)
- Rename some structures
- Use spin_lock_irqsave() and spin_lock_irqrestore() version to prevent
  from enabling the interrupts unconditionally.
- Fix policy setup bug when switching to another io scheduler.
- If a policy is available for a specific device, don't update weight and
  io class when writing "weight" and "iprio_class".
- Fix a bug when parsing policy string.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
 block/elevator-fq.c |  258 +++++++++++++++++++++++++++++++++++++++++++++++++--
 block/elevator-fq.h |   12 +++
 2 files changed, 261 insertions(+), 9 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 69435ab..43b30a4 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -12,6 +12,9 @@
 #include "elevator-fq.h"
 #include <linux/blktrace_api.h>
 #include <linux/biotrack.h>
+#include <linux/seq_file.h>
+#include <linux/genhd.h>
+
 
 /* Values taken from cfq */
 const int elv_slice_sync = HZ / 10;
@@ -1045,12 +1048,31 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
 }
 EXPORT_SYMBOL(io_lookup_io_group_current);
 
-void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
+						 dev_t dev);
+
+void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
+			  dev_t dev)
 {
 	struct io_entity *entity = &iog->entity;
+	struct io_policy_node *pn;
+	unsigned long flags;
+
+	spin_lock_irqsave(&iocg->lock, flags);
+	pn = policy_search_node(iocg, dev);
+	if (pn) {
+		entity->weight = pn->weight;
+		entity->new_weight = pn->weight;
+		entity->ioprio_class = pn->ioprio_class;
+		entity->new_ioprio_class = pn->ioprio_class;
+	} else {
+		entity->weight = iocg->weight;
+		entity->new_weight = iocg->weight;
+		entity->ioprio_class = iocg->ioprio_class;
+		entity->new_ioprio_class = iocg->ioprio_class;
+	}
+	spin_unlock_irqrestore(&iocg->lock, flags);
 
-	entity->weight = entity->new_weight = iocg->weight;
-	entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
 	entity->ioprio_changed = 1;
 	entity->my_sched_data = &iog->sched_data;
 }
@@ -1114,6 +1136,7 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
 	struct io_cgroup *iocg;					\
 	struct io_group *iog;						\
 	struct hlist_node *n;						\
+	struct io_policy_node *pn;					\
 									\
 	if (val < (__MIN) || val > (__MAX))				\
 		return -EINVAL;						\
@@ -1126,6 +1149,9 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
 	spin_lock_irq(&iocg->lock);					\
 	iocg->__VAR = (unsigned long)val;				\
 	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {	\
+		pn = policy_search_node(iocg, iog->dev);		\
+		if (pn)							\
+			continue;					\
 		iog->entity.new_##__VAR = (unsigned long)val;		\
 		smp_wmb();						\
 		iog->entity.ioprio_changed = 1;				\
@@ -1237,7 +1263,7 @@ static u64 io_cgroup_disk_sectors_read(struct cgroup *cgroup,
  * to the root has already an allocated group on @bfqd.
  */
 struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
-					struct cgroup *cgroup)
+				      struct cgroup *cgroup, struct bio *bio)
 {
 	struct io_cgroup *iocg;
 	struct io_group *iog, *leaf = NULL, *prev = NULL;
@@ -1263,12 +1289,17 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
 		atomic_set(&iog->ref, 0);
 		iog->deleting = 0;
 
-		io_group_init_entity(iocg, iog);
-		iog->my_entity = &iog->entity;
 #ifdef CONFIG_DEBUG_GROUP_IOSCHED
 		iog->iocg_id = css_id(&iocg->css);
+		if (bio) {
+			struct gendisk *disk = bio->bi_bdev->bd_disk;
+			iog->dev = MKDEV(disk->major, disk->first_minor);
+		}
 #endif
 
+		io_group_init_entity(iocg, iog, iog->dev);
+		iog->my_entity = &iog->entity;
+
 		blk_init_request_list(&iog->rl);
 
 		if (leaf == NULL) {
@@ -1379,7 +1410,7 @@ void io_group_chain_link(struct request_queue *q, void *key,
  */
 struct io_group *io_find_alloc_group(struct request_queue *q,
 			struct cgroup *cgroup, struct elv_fq_data *efqd,
-			int create)
+			     int create, struct bio *bio)
 {
 	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
 	struct io_group *iog = NULL;
@@ -1390,7 +1421,7 @@ struct io_group *io_find_alloc_group(struct request_queue *q,
 	if (iog != NULL || !create)
 		return iog;
 
-	iog = io_group_chain_alloc(q, key, cgroup);
+	iog = io_group_chain_alloc(q, key, cgroup, bio);
 	if (iog != NULL)
 		io_group_chain_link(q, key, cgroup, iog, efqd);
 
@@ -1489,7 +1520,7 @@ struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
 		goto out;
 	}
 
-	iog = io_find_alloc_group(q, cgroup, efqd, create);
+	iog = io_find_alloc_group(q, cgroup, efqd, create, bio);
 	if (!iog) {
 		if (create)
 			iog = efqd->root_group;
@@ -1549,8 +1580,209 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
 	return iog;
 }
 
+static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
+				  struct seq_file *m)
+{
+	struct io_cgroup *iocg;
+	struct io_policy_node *pn;
+
+	iocg = cgroup_to_io_cgroup(cgrp);
+
+	if (list_empty(&iocg->policy_list))
+		goto out;
+
+	seq_printf(m, "dev weight class\n");
+
+	spin_lock_irq(&iocg->lock);
+	list_for_each_entry(pn, &iocg->policy_list, node) {
+		seq_printf(m, "%s %lu %lu\n", pn->dev_name,
+			   pn->weight, pn->ioprio_class);
+	}
+	spin_unlock_irq(&iocg->lock);
+out:
+	return 0;
+}
+
+static inline void policy_insert_node(struct io_cgroup *iocg,
+					  struct io_policy_node *pn)
+{
+	list_add(&pn->node, &iocg->policy_list);
+}
+
+/* Must be called with iocg->lock held */
+static inline void policy_delete_node(struct io_policy_node *pn)
+{
+	list_del(&pn->node);
+}
+
+/* Must be called with iocg->lock held */
+static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
+						 dev_t dev)
+{
+	struct io_policy_node *pn;
+
+	if (list_empty(&iocg->policy_list))
+		return NULL;
+
+	list_for_each_entry(pn, &iocg->policy_list, node) {
+		if (pn->dev == dev)
+			return pn;
+	}
+
+	return NULL;
+}
+
+static int devname_to_devnum(const char *buf, dev_t *dev)
+{
+	struct block_device *bdev;
+	struct gendisk *disk;
+	int part;
+
+	bdev = lookup_bdev(buf);
+	if (IS_ERR(bdev))
+		return -ENODEV;
+
+	disk = get_gendisk(bdev->bd_dev, &part);
+	*dev = MKDEV(disk->major, disk->first_minor);
+	bdput(bdev);
+
+	return 0;
+}
+
+static int policy_parse_and_set(char *buf, struct io_policy_node *newpn)
+{
+	char *s[3], *p;
+	int ret;
+	int i = 0;
+
+	memset(s, 0, sizeof(s));
+	while ((p = strsep(&buf, ":")) != NULL) {
+		if (!*p)
+			continue;
+		s[i++] = p;
+	}
+
+	ret = devname_to_devnum(s[0], &newpn->dev);
+	if (ret)
+		return ret;
+
+	strcpy(newpn->dev_name, s[0]);
+
+	if (s[1] == NULL)
+		return -EINVAL;
+
+	ret = strict_strtoul(s[1], 10, &newpn->weight);
+	if (ret || newpn->weight > WEIGHT_MAX)
+		return -EINVAL;
+
+	if (s[2] == NULL)
+		return -EINVAL;
+
+	ret = strict_strtoul(s[2], 10, &newpn->ioprio_class);
+	if (ret || newpn->ioprio_class < IOPRIO_CLASS_RT ||
+	    newpn->ioprio_class > IOPRIO_CLASS_IDLE)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
+			    const char *buffer)
+{
+	struct io_cgroup *iocg;
+	struct io_policy_node *newpn, *pn;
+	char *buf;
+	int ret = 0;
+	int keep_newpn = 0;
+	struct hlist_node *n;
+	struct io_group *iog;
+
+	buf = kstrdup(buffer, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
+	if (!newpn) {
+		ret = -ENOMEM;
+		goto free_buf;
+	}
+
+	ret = policy_parse_and_set(buf, newpn);
+	if (ret)
+		goto free_newpn;
+
+	if (!cgroup_lock_live_group(cgrp)) {
+		ret = -ENODEV;
+		goto free_newpn;
+	}
+
+	iocg = cgroup_to_io_cgroup(cgrp);
+	spin_lock_irq(&iocg->lock);
+
+	pn = policy_search_node(iocg, newpn->dev);
+	if (!pn) {
+		if (newpn->weight != 0) {
+			policy_insert_node(iocg, newpn);
+			keep_newpn = 1;
+		}
+		goto update_io_group;
+	}
+
+	if (newpn->weight == 0) {
+		/* weight == 0 means deleteing a policy */
+		policy_delete_node(pn);
+		goto update_io_group;
+	}
+
+	pn->weight = newpn->weight;
+	pn->ioprio_class = newpn->ioprio_class;
+
+update_io_group:
+	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
+		if (iog->dev == newpn->dev) {
+			if (newpn->weight) {
+				iog->entity.new_weight = newpn->weight;
+				iog->entity.new_ioprio_class =
+					newpn->ioprio_class;
+				/*
+				 * iog weight and ioprio_class updating
+				 * actually happens if ioprio_changed is set.
+				 * So ensure ioprio_changed is not set until
+				 * new weight and new ioprio_class are updated.
+				 */
+				smp_wmb();
+				iog->entity.ioprio_changed = 1;
+			} else {
+				iog->entity.new_weight = iocg->weight;
+				iog->entity.new_ioprio_class =
+					iocg->ioprio_class;
+
+				/* The same as above */
+				smp_wmb();
+				iog->entity.ioprio_changed = 1;
+			}
+		}
+	}
+	spin_unlock_irq(&iocg->lock);
+
+	cgroup_unlock();
+
+free_newpn:
+	if (!keep_newpn)
+		kfree(newpn);
+free_buf:
+	kfree(buf);
+	return ret;
+}
+
 struct cftype bfqio_files[] = {
 	{
+		.name = "policy",
+		.read_seq_string = io_cgroup_policy_read,
+		.write_string = io_cgroup_policy_write,
+		.max_write_len = 256,
+	},
+	{
 		.name = "weight",
 		.read_u64 = io_cgroup_weight_read,
 		.write_u64 = io_cgroup_weight_write,
@@ -1592,6 +1824,7 @@ struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
 	INIT_HLIST_HEAD(&iocg->group_data);
 	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
 	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
+	INIT_LIST_HEAD(&iocg->policy_list);
 
 	return &iocg->css;
 }
@@ -1750,6 +1983,7 @@ void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
 	unsigned long flags, flags1;
 	int queue_lock_held = 0;
 	struct elv_fq_data *efqd;
+	struct io_policy_node *pn, *pntmp;
 
 	/*
 	 * io groups are linked in two lists. One list is maintained
@@ -1823,6 +2057,12 @@ locked:
 	BUG_ON(!hlist_empty(&iocg->group_data));
 
 	free_css_id(&io_subsys, &iocg->css);
+
+	list_for_each_entry_safe(pn, pntmp, &iocg->policy_list, node) {
+		policy_delete_node(pn);
+		kfree(pn);
+	}
+
 	kfree(iocg);
 }
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index db3a347..b1d97e6 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -250,9 +250,18 @@ struct io_group {
 
 #ifdef CONFIG_DEBUG_GROUP_IOSCHED
 	unsigned short iocg_id;
+	dev_t dev;
 #endif
 };
 
+struct io_policy_node {
+	struct list_head node;
+	char dev_name[32];
+	dev_t dev;
+	unsigned long weight;
+	unsigned long ioprio_class;
+};
+
 /**
  * struct bfqio_cgroup - bfq cgroup data structure.
  * @css: subsystem state for bfq in the containing cgroup.
@@ -269,6 +278,9 @@ struct io_cgroup {
 
 	unsigned long weight, ioprio_class;
 
+	/* list of io_policy_node */
+	struct list_head policy_list;
+
 	spinlock_t lock;
 	struct hlist_head group_data;
 };
-- 
1.5.4.rc3




^ permalink raw reply related	[flat|nested] 297+ messages in thread

* Re: [PATCH 07/18] io-controller: Export disk time used and nr sectors dipatched through cgroups
       [not found]       ` <20090513145127.GB7696-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-14  7:53         ` Gui Jianfeng
  0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14  7:53 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

Vivek Goyal wrote:
> On Wed, May 13, 2009 at 10:39:07AM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>> ...
>>>  
>>> +/*
>>> + * traverse through all the io_groups associated with this cgroup and calculate
>>> + * the aggr disk time received by all the groups on respective disks.
>>> + */
>>> +static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
>>> +{
>>> +	struct io_group *iog;
>>> +	struct hlist_node *n;
>>> +	u64 disk_time = 0;
>>> +
>>> +	rcu_read_lock();
>>   This function is in slow-path, so no need to call rcu_read_lock(), just need to ensure
>>   that the caller already holds the iocg->lock.
>>
> 
> Or can we get rid of requirement of iocg_lock here and just read the io
> group data under rcu read lock? Actually I am wondering why do we require
> an iocg_lock here. We are not modifying the rcu protected list. We are
> just traversing through it and reading the data.

  Yes, i think removing the iocg->lock from caller(io_cgroup_disk_time_read()) 
  is a better choice.

> 
> Thanks
> Vivek
> 
>>> +	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
>>> +		/*
>>> +		 * There might be groups which are not functional and
>>> +		 * waiting to be reclaimed upon cgoup deletion.
>>> +		 */
>>> +		if (rcu_dereference(iog->key))
>>> +			disk_time += iog->entity.total_service;
>>> +	}
>>> +	rcu_read_unlock();
>>> +
>>> +	return disk_time;
>>> +}
>>> +
>> -- 
>> Regards
>> Gui Jianfeng
> 
> 
> 

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 07/18] io-controller: Export disk time used and nr sectors dipatched through cgroups
  2009-05-13 14:51     ` Vivek Goyal
@ 2009-05-14  7:53       ` Gui Jianfeng
       [not found]       ` <20090513145127.GB7696-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-14  7:53 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

Vivek Goyal wrote:
> On Wed, May 13, 2009 at 10:39:07AM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>> ...
>>>  
>>> +/*
>>> + * traverse through all the io_groups associated with this cgroup and calculate
>>> + * the aggr disk time received by all the groups on respective disks.
>>> + */
>>> +static u64 calculate_aggr_disk_time(struct io_cgroup *iocg)
>>> +{
>>> +	struct io_group *iog;
>>> +	struct hlist_node *n;
>>> +	u64 disk_time = 0;
>>> +
>>> +	rcu_read_lock();
>>   This function is in slow-path, so no need to call rcu_read_lock(), just need to ensure
>>   that the caller already holds the iocg->lock.
>>
> 
> Or can we get rid of requirement of iocg_lock here and just read the io
> group data under rcu read lock? Actually I am wondering why do we require
> an iocg_lock here. We are not modifying the rcu protected list. We are
> just traversing through it and reading the data.

  Yes, i think removing the iocg->lock from caller(io_cgroup_disk_time_read()) 
  is a better choice.

> 
> Thanks
> Vivek
> 
>>> +	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
>>> +		/*
>>> +		 * There might be groups which are not functional and
>>> +		 * waiting to be reclaimed upon cgoup deletion.
>>> +		 */
>>> +		if (rcu_dereference(iog->key))
>>> +			disk_time += iog->entity.total_service;
>>> +	}
>>> +	rcu_read_unlock();
>>> +
>>> +	return disk_time;
>>> +}
>>> +
>> -- 
>> Regards
>> Gui Jianfeng
> 
> 
> 

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
       [not found]                                 ` <20090508215618.GJ7293-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-05-09  9:22                                   ` Peter Zijlstra
@ 2009-05-14 10:31                                   ` Andrea Righi
  2009-05-14 16:43                                     ` Dhaval Giani
  2 siblings, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-14 10:31 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton

On Fri, May 08, 2009 at 05:56:18PM -0400, Vivek Goyal wrote:
> On Fri, May 08, 2009 at 10:05:01PM +0200, Andrea Righi wrote:
> 
> [..]
> > > Conclusion
> > > ==========
> > > It just reaffirms that with max BW control, we are not doing a fair job
> > > of throttling hence no more hold the IO scheduler properties with-in
> > > cgroup.
> > > 
> > > With proportional BW controller implemented at IO scheduler level, one
> > > can do very tight integration with IO controller and hence retain 
> > > IO scheduler behavior with-in cgroup.
> > 
> > It is worth to bug you I would say :). Results are interesting,
> > definitely. I'll check if it's possible to merge part of the io-throttle
> > max BW control in this controller and who knows if finally we'll be able
> > to converge to a common proposal...
> 
> Great, Few thoughts though.
> 
> - What are your requirements? Do you strictly need max bw control or
>   proportional BW control will satisfy your needs? Or you need both?

The theoretical advantages of max BW control are that they offer an
immediate action on policy enforcement mitigating the problem before it
happens (a kind of static partitioning I would say) and that you have
probably something that provides a more explicit control to contain
different classes of users in hosted environment (e.g., give BW in
function on how much they pay). And I can say the io-throttle approach
at the moment seems to work fine for a production environment
(http://www.bluehost.com).

Apart the motivations above, I don't have specific requirements to
provide the max BW control.

But it is also true that the io-controller approach is still in a
development stage and needs more testing. The design concepts make
sense, definitely, so maybe only the proportional approach will be
sufficient to satisfy the requirements of the 90% of users out there.

-Andrea

> 
> - With the current algorithm BFQ (modified WF2Q+), we should be able
>   to do proportional BW division while maintaining the properties of
>   IO scheduler with-in cgroup in hiearchical manner.
>  
>   I think it can be simply enhanced to do max bw control also. That is
>   whenever a queue is selected for dispatch (from fairness point of view)
>   also check the IO rate of that group and if IO rate exceeded, expire
>   the queue immediately and fake as if queue consumed its time slice
>   which will be equivalent to throttling.
> 
>   But in this simple scheme, I think throttling is still unfair with-in
>   the class. What I mean is following.
> 
>   if an RT task and an BE task are in same cgroup and cgroup exceeds its
>   max BW, RT task is next to be dispatched from fairness point of view and it
>   will end being throttled. This is still fine because until RT task is
>   finished, BE task will never get to run in that cgroup, so at some point
>   of time, cgroup rate will come down and RT task will get the IO done
>   meeting fairnesss and max bw constraints.
> 
>   But this simple scheme does not work with-in same class. Say prio 0
>   and prio 7 BE class readers. Now we will end up throttling the guy who
>   is scheduled to go next and there is no mechanism that prio0 and prio7
>   tasks are throttled in proportionate manner.
> 
>   So, we shall have to come up with something better, I think Dhaval was
>   implementing upper limit for cpu controller. May be PeterZ and Dhaval can
>   give us some pointers how did they manage to implement both proportional
>   and max bw control with the help of a single tree while maintaining the
>   notion of prio with-in cgroup.
> 
> PeterZ/Dhaval  ^^^^^^^^
> 
> - We should be able to get rid of reader-writer issue even with above
>   simple throttling mechanism for schedulers like deadline and AS, because at
>   elevator we see it as a single queue (for both reads and writes) and we
>   will throttle this queue. With-in queue dispatch are taken care by io
>   scheduler. So as long as IO has been queued in the queue, scheduler
>   will take care of giving advantage to readers even if throttling is
>   taking place on the queue.
> 
> Why am I thinking loud? So that we know what are we trying to achieve at the
> end of the day. So at this point of time what are the advantages/disadvantages
> of doing max bw control along with proportional bw control?
> 
> Advantages
> ==========
> - With a combined code base, total code should be less as compared to if
>   both of them are implemented separately. 
> 
> - There can be few advantages in terms of maintaining the notion of IO
>   scheduler with-in cgroup. (like RT tasks always goes first in presence
>   of BE and IDLE task etc. But simple throttling scheme will not take
>   care of fair throttling with-in class. We need a better algorithm to
>   achive that goal).
> 
> - We probably will get rid of reader writer issue for single queue
>   schedulers like deadline and AS. (Need to run tests and see).
> 
> Disadvantages
> =============
> - Implementation at IO scheduler/elevator layer does not cover higher
>   level logical devices. So one can do max bw control only at leaf nodes
>   where IO scheduler is running and not at intermediate logical nodes.
>    
> I personally think that proportional BW control will meet more people's
> need as compared to max bw contorl. 
> 
> So far nobody has come up with a solution where a single proposal covers
> all the cases without breaking things. So personally, I want to make
> things work at least at IO scheduler level and cover as much ground as
> possible without breaking things (hardware RAID, all the direct attached
> devices etc) and then worry about higher level software devices.
> 
> Thoughts?
> 
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-08 21:56                                 ` Vivek Goyal
  (?)
  (?)
@ 2009-05-14 10:31                                 ` Andrea Righi
  -1 siblings, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-14 10:31 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrew Morton, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, linux-kernel, containers,
	agk, dm-devel, snitzer, m-ikeda, peterz

On Fri, May 08, 2009 at 05:56:18PM -0400, Vivek Goyal wrote:
> On Fri, May 08, 2009 at 10:05:01PM +0200, Andrea Righi wrote:
> 
> [..]
> > > Conclusion
> > > ==========
> > > It just reaffirms that with max BW control, we are not doing a fair job
> > > of throttling hence no more hold the IO scheduler properties with-in
> > > cgroup.
> > > 
> > > With proportional BW controller implemented at IO scheduler level, one
> > > can do very tight integration with IO controller and hence retain 
> > > IO scheduler behavior with-in cgroup.
> > 
> > It is worth to bug you I would say :). Results are interesting,
> > definitely. I'll check if it's possible to merge part of the io-throttle
> > max BW control in this controller and who knows if finally we'll be able
> > to converge to a common proposal...
> 
> Great, Few thoughts though.
> 
> - What are your requirements? Do you strictly need max bw control or
>   proportional BW control will satisfy your needs? Or you need both?

The theoretical advantages of max BW control are that they offer an
immediate action on policy enforcement mitigating the problem before it
happens (a kind of static partitioning I would say) and that you have
probably something that provides a more explicit control to contain
different classes of users in hosted environment (e.g., give BW in
function on how much they pay). And I can say the io-throttle approach
at the moment seems to work fine for a production environment
(http://www.bluehost.com).

Apart the motivations above, I don't have specific requirements to
provide the max BW control.

But it is also true that the io-controller approach is still in a
development stage and needs more testing. The design concepts make
sense, definitely, so maybe only the proportional approach will be
sufficient to satisfy the requirements of the 90% of users out there.

-Andrea

> 
> - With the current algorithm BFQ (modified WF2Q+), we should be able
>   to do proportional BW division while maintaining the properties of
>   IO scheduler with-in cgroup in hiearchical manner.
>  
>   I think it can be simply enhanced to do max bw control also. That is
>   whenever a queue is selected for dispatch (from fairness point of view)
>   also check the IO rate of that group and if IO rate exceeded, expire
>   the queue immediately and fake as if queue consumed its time slice
>   which will be equivalent to throttling.
> 
>   But in this simple scheme, I think throttling is still unfair with-in
>   the class. What I mean is following.
> 
>   if an RT task and an BE task are in same cgroup and cgroup exceeds its
>   max BW, RT task is next to be dispatched from fairness point of view and it
>   will end being throttled. This is still fine because until RT task is
>   finished, BE task will never get to run in that cgroup, so at some point
>   of time, cgroup rate will come down and RT task will get the IO done
>   meeting fairnesss and max bw constraints.
> 
>   But this simple scheme does not work with-in same class. Say prio 0
>   and prio 7 BE class readers. Now we will end up throttling the guy who
>   is scheduled to go next and there is no mechanism that prio0 and prio7
>   tasks are throttled in proportionate manner.
> 
>   So, we shall have to come up with something better, I think Dhaval was
>   implementing upper limit for cpu controller. May be PeterZ and Dhaval can
>   give us some pointers how did they manage to implement both proportional
>   and max bw control with the help of a single tree while maintaining the
>   notion of prio with-in cgroup.
> 
> PeterZ/Dhaval  ^^^^^^^^
> 
> - We should be able to get rid of reader-writer issue even with above
>   simple throttling mechanism for schedulers like deadline and AS, because at
>   elevator we see it as a single queue (for both reads and writes) and we
>   will throttle this queue. With-in queue dispatch are taken care by io
>   scheduler. So as long as IO has been queued in the queue, scheduler
>   will take care of giving advantage to readers even if throttling is
>   taking place on the queue.
> 
> Why am I thinking loud? So that we know what are we trying to achieve at the
> end of the day. So at this point of time what are the advantages/disadvantages
> of doing max bw control along with proportional bw control?
> 
> Advantages
> ==========
> - With a combined code base, total code should be less as compared to if
>   both of them are implemented separately. 
> 
> - There can be few advantages in terms of maintaining the notion of IO
>   scheduler with-in cgroup. (like RT tasks always goes first in presence
>   of BE and IDLE task etc. But simple throttling scheme will not take
>   care of fair throttling with-in class. We need a better algorithm to
>   achive that goal).
> 
> - We probably will get rid of reader writer issue for single queue
>   schedulers like deadline and AS. (Need to run tests and see).
> 
> Disadvantages
> =============
> - Implementation at IO scheduler/elevator layer does not cover higher
>   level logical devices. So one can do max bw control only at leaf nodes
>   where IO scheduler is running and not at intermediate logical nodes.
>    
> I personally think that proportional BW control will meet more people's
> need as compared to max bw contorl. 
> 
> So far nobody has come up with a solution where a single proposal covers
> all the cases without breaking things. So personally, I want to make
> things work at least at IO scheduler level and cover as much ground as
> possible without breaking things (hardware RAID, all the direct attached
> devices etc) and then worry about higher level software devices.
> 
> Thoughts?
> 
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
       [not found]       ` <4A0BC7AB.8030703-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-05-14 15:15         ` Vivek Goyal
  2009-05-18 22:33         ` IKEDA, Munehiro
  1 sibling, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-14 15:15 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Thu, May 14, 2009 at 03:26:35PM +0800, Gui Jianfeng wrote:
> Hi Vivek,
> 
> This patch enables per-cgroup per-device weight and ioprio_class handling.
> A new cgroup interface "policy" is introduced. You can make use of this 
> file to configure weight and ioprio_class for each device in a given cgroup.
> The original "weight" and "ioprio_class" files are still available. If you
> don't do special configuration for a particular device, "weight" and 
> "ioprio_class" are used as default values in this device.
> 
> You can use the following format to play with the new interface.
> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
> weight=0 means removing the policy for DEV.
> 
> Examples:
> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
> # echo /dev/hdb:300:2 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
> 
> Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
> # echo /dev/hda:500:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hda 500 1
> /dev/hdb 300 2
> 
> Remove the policy for /dev/hda in this cgroup
> # echo /dev/hda:0:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
> 
> Changelog (v1 -> v2)
> - Rename some structures
> - Use spin_lock_irqsave() and spin_lock_irqrestore() version to prevent
>   from enabling the interrupts unconditionally.
> - Fix policy setup bug when switching to another io scheduler.
> - If a policy is available for a specific device, don't update weight and
>   io class when writing "weight" and "iprio_class".
> - Fix a bug when parsing policy string.
> 
> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> ---

Thanks a lot Gui. This patch seems to be working fine for me now. I will
continue to do more testing and let you know if there are more issues. I 
will include it in next posting (V3).

Thanks
Vivek

>  block/elevator-fq.c |  258 +++++++++++++++++++++++++++++++++++++++++++++++++--
>  block/elevator-fq.h |   12 +++
>  2 files changed, 261 insertions(+), 9 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 69435ab..43b30a4 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -12,6 +12,9 @@
>  #include "elevator-fq.h"
>  #include <linux/blktrace_api.h>
>  #include <linux/biotrack.h>
> +#include <linux/seq_file.h>
> +#include <linux/genhd.h>
> +
>  
>  /* Values taken from cfq */
>  const int elv_slice_sync = HZ / 10;
> @@ -1045,12 +1048,31 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
>  }
>  EXPORT_SYMBOL(io_lookup_io_group_current);
>  
> -void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
> +static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
> +						 dev_t dev);
> +
> +void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
> +			  dev_t dev)
>  {
>  	struct io_entity *entity = &iog->entity;
> +	struct io_policy_node *pn;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&iocg->lock, flags);
> +	pn = policy_search_node(iocg, dev);
> +	if (pn) {
> +		entity->weight = pn->weight;
> +		entity->new_weight = pn->weight;
> +		entity->ioprio_class = pn->ioprio_class;
> +		entity->new_ioprio_class = pn->ioprio_class;
> +	} else {
> +		entity->weight = iocg->weight;
> +		entity->new_weight = iocg->weight;
> +		entity->ioprio_class = iocg->ioprio_class;
> +		entity->new_ioprio_class = iocg->ioprio_class;
> +	}
> +	spin_unlock_irqrestore(&iocg->lock, flags);
>  
> -	entity->weight = entity->new_weight = iocg->weight;
> -	entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
>  	entity->ioprio_changed = 1;
>  	entity->my_sched_data = &iog->sched_data;
>  }
> @@ -1114,6 +1136,7 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
>  	struct io_cgroup *iocg;					\
>  	struct io_group *iog;						\
>  	struct hlist_node *n;						\
> +	struct io_policy_node *pn;					\
>  									\
>  	if (val < (__MIN) || val > (__MAX))				\
>  		return -EINVAL;						\
> @@ -1126,6 +1149,9 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
>  	spin_lock_irq(&iocg->lock);					\
>  	iocg->__VAR = (unsigned long)val;				\
>  	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {	\
> +		pn = policy_search_node(iocg, iog->dev);		\
> +		if (pn)							\
> +			continue;					\
>  		iog->entity.new_##__VAR = (unsigned long)val;		\
>  		smp_wmb();						\
>  		iog->entity.ioprio_changed = 1;				\
> @@ -1237,7 +1263,7 @@ static u64 io_cgroup_disk_sectors_read(struct cgroup *cgroup,
>   * to the root has already an allocated group on @bfqd.
>   */
>  struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
> -					struct cgroup *cgroup)
> +				      struct cgroup *cgroup, struct bio *bio)
>  {
>  	struct io_cgroup *iocg;
>  	struct io_group *iog, *leaf = NULL, *prev = NULL;
> @@ -1263,12 +1289,17 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
>  		atomic_set(&iog->ref, 0);
>  		iog->deleting = 0;
>  
> -		io_group_init_entity(iocg, iog);
> -		iog->my_entity = &iog->entity;
>  #ifdef CONFIG_DEBUG_GROUP_IOSCHED
>  		iog->iocg_id = css_id(&iocg->css);
> +		if (bio) {
> +			struct gendisk *disk = bio->bi_bdev->bd_disk;
> +			iog->dev = MKDEV(disk->major, disk->first_minor);
> +		}
>  #endif
>  
> +		io_group_init_entity(iocg, iog, iog->dev);
> +		iog->my_entity = &iog->entity;
> +
>  		blk_init_request_list(&iog->rl);
>  
>  		if (leaf == NULL) {
> @@ -1379,7 +1410,7 @@ void io_group_chain_link(struct request_queue *q, void *key,
>   */
>  struct io_group *io_find_alloc_group(struct request_queue *q,
>  			struct cgroup *cgroup, struct elv_fq_data *efqd,
> -			int create)
> +			     int create, struct bio *bio)
>  {
>  	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
>  	struct io_group *iog = NULL;
> @@ -1390,7 +1421,7 @@ struct io_group *io_find_alloc_group(struct request_queue *q,
>  	if (iog != NULL || !create)
>  		return iog;
>  
> -	iog = io_group_chain_alloc(q, key, cgroup);
> +	iog = io_group_chain_alloc(q, key, cgroup, bio);
>  	if (iog != NULL)
>  		io_group_chain_link(q, key, cgroup, iog, efqd);
>  
> @@ -1489,7 +1520,7 @@ struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
>  		goto out;
>  	}
>  
> -	iog = io_find_alloc_group(q, cgroup, efqd, create);
> +	iog = io_find_alloc_group(q, cgroup, efqd, create, bio);
>  	if (!iog) {
>  		if (create)
>  			iog = efqd->root_group;
> @@ -1549,8 +1580,209 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
>  	return iog;
>  }
>  
> +static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
> +				  struct seq_file *m)
> +{
> +	struct io_cgroup *iocg;
> +	struct io_policy_node *pn;
> +
> +	iocg = cgroup_to_io_cgroup(cgrp);
> +
> +	if (list_empty(&iocg->policy_list))
> +		goto out;
> +
> +	seq_printf(m, "dev weight class\n");
> +
> +	spin_lock_irq(&iocg->lock);
> +	list_for_each_entry(pn, &iocg->policy_list, node) {
> +		seq_printf(m, "%s %lu %lu\n", pn->dev_name,
> +			   pn->weight, pn->ioprio_class);
> +	}
> +	spin_unlock_irq(&iocg->lock);
> +out:
> +	return 0;
> +}
> +
> +static inline void policy_insert_node(struct io_cgroup *iocg,
> +					  struct io_policy_node *pn)
> +{
> +	list_add(&pn->node, &iocg->policy_list);
> +}
> +
> +/* Must be called with iocg->lock held */
> +static inline void policy_delete_node(struct io_policy_node *pn)
> +{
> +	list_del(&pn->node);
> +}
> +
> +/* Must be called with iocg->lock held */
> +static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
> +						 dev_t dev)
> +{
> +	struct io_policy_node *pn;
> +
> +	if (list_empty(&iocg->policy_list))
> +		return NULL;
> +
> +	list_for_each_entry(pn, &iocg->policy_list, node) {
> +		if (pn->dev == dev)
> +			return pn;
> +	}
> +
> +	return NULL;
> +}
> +
> +static int devname_to_devnum(const char *buf, dev_t *dev)
> +{
> +	struct block_device *bdev;
> +	struct gendisk *disk;
> +	int part;
> +
> +	bdev = lookup_bdev(buf);
> +	if (IS_ERR(bdev))
> +		return -ENODEV;
> +
> +	disk = get_gendisk(bdev->bd_dev, &part);
> +	*dev = MKDEV(disk->major, disk->first_minor);
> +	bdput(bdev);
> +
> +	return 0;
> +}
> +
> +static int policy_parse_and_set(char *buf, struct io_policy_node *newpn)
> +{
> +	char *s[3], *p;
> +	int ret;
> +	int i = 0;
> +
> +	memset(s, 0, sizeof(s));
> +	while ((p = strsep(&buf, ":")) != NULL) {
> +		if (!*p)
> +			continue;
> +		s[i++] = p;
> +	}
> +
> +	ret = devname_to_devnum(s[0], &newpn->dev);
> +	if (ret)
> +		return ret;
> +
> +	strcpy(newpn->dev_name, s[0]);
> +
> +	if (s[1] == NULL)
> +		return -EINVAL;
> +
> +	ret = strict_strtoul(s[1], 10, &newpn->weight);
> +	if (ret || newpn->weight > WEIGHT_MAX)
> +		return -EINVAL;
> +
> +	if (s[2] == NULL)
> +		return -EINVAL;
> +
> +	ret = strict_strtoul(s[2], 10, &newpn->ioprio_class);
> +	if (ret || newpn->ioprio_class < IOPRIO_CLASS_RT ||
> +	    newpn->ioprio_class > IOPRIO_CLASS_IDLE)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
> +			    const char *buffer)
> +{
> +	struct io_cgroup *iocg;
> +	struct io_policy_node *newpn, *pn;
> +	char *buf;
> +	int ret = 0;
> +	int keep_newpn = 0;
> +	struct hlist_node *n;
> +	struct io_group *iog;
> +
> +	buf = kstrdup(buffer, GFP_KERNEL);
> +	if (!buf)
> +		return -ENOMEM;
> +
> +	newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
> +	if (!newpn) {
> +		ret = -ENOMEM;
> +		goto free_buf;
> +	}
> +
> +	ret = policy_parse_and_set(buf, newpn);
> +	if (ret)
> +		goto free_newpn;
> +
> +	if (!cgroup_lock_live_group(cgrp)) {
> +		ret = -ENODEV;
> +		goto free_newpn;
> +	}
> +
> +	iocg = cgroup_to_io_cgroup(cgrp);
> +	spin_lock_irq(&iocg->lock);
> +
> +	pn = policy_search_node(iocg, newpn->dev);
> +	if (!pn) {
> +		if (newpn->weight != 0) {
> +			policy_insert_node(iocg, newpn);
> +			keep_newpn = 1;
> +		}
> +		goto update_io_group;
> +	}
> +
> +	if (newpn->weight == 0) {
> +		/* weight == 0 means deleteing a policy */
> +		policy_delete_node(pn);
> +		goto update_io_group;
> +	}
> +
> +	pn->weight = newpn->weight;
> +	pn->ioprio_class = newpn->ioprio_class;
> +
> +update_io_group:
> +	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
> +		if (iog->dev == newpn->dev) {
> +			if (newpn->weight) {
> +				iog->entity.new_weight = newpn->weight;
> +				iog->entity.new_ioprio_class =
> +					newpn->ioprio_class;
> +				/*
> +				 * iog weight and ioprio_class updating
> +				 * actually happens if ioprio_changed is set.
> +				 * So ensure ioprio_changed is not set until
> +				 * new weight and new ioprio_class are updated.
> +				 */
> +				smp_wmb();
> +				iog->entity.ioprio_changed = 1;
> +			} else {
> +				iog->entity.new_weight = iocg->weight;
> +				iog->entity.new_ioprio_class =
> +					iocg->ioprio_class;
> +
> +				/* The same as above */
> +				smp_wmb();
> +				iog->entity.ioprio_changed = 1;
> +			}
> +		}
> +	}
> +	spin_unlock_irq(&iocg->lock);
> +
> +	cgroup_unlock();
> +
> +free_newpn:
> +	if (!keep_newpn)
> +		kfree(newpn);
> +free_buf:
> +	kfree(buf);
> +	return ret;
> +}
> +
>  struct cftype bfqio_files[] = {
>  	{
> +		.name = "policy",
> +		.read_seq_string = io_cgroup_policy_read,
> +		.write_string = io_cgroup_policy_write,
> +		.max_write_len = 256,
> +	},
> +	{
>  		.name = "weight",
>  		.read_u64 = io_cgroup_weight_read,
>  		.write_u64 = io_cgroup_weight_write,
> @@ -1592,6 +1824,7 @@ struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
>  	INIT_HLIST_HEAD(&iocg->group_data);
>  	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
>  	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
> +	INIT_LIST_HEAD(&iocg->policy_list);
>  
>  	return &iocg->css;
>  }
> @@ -1750,6 +1983,7 @@ void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
>  	unsigned long flags, flags1;
>  	int queue_lock_held = 0;
>  	struct elv_fq_data *efqd;
> +	struct io_policy_node *pn, *pntmp;
>  
>  	/*
>  	 * io groups are linked in two lists. One list is maintained
> @@ -1823,6 +2057,12 @@ locked:
>  	BUG_ON(!hlist_empty(&iocg->group_data));
>  
>  	free_css_id(&io_subsys, &iocg->css);
> +
> +	list_for_each_entry_safe(pn, pntmp, &iocg->policy_list, node) {
> +		policy_delete_node(pn);
> +		kfree(pn);
> +	}
> +
>  	kfree(iocg);
>  }
>  
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index db3a347..b1d97e6 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -250,9 +250,18 @@ struct io_group {
>  
>  #ifdef CONFIG_DEBUG_GROUP_IOSCHED
>  	unsigned short iocg_id;
> +	dev_t dev;
>  #endif
>  };
>  
> +struct io_policy_node {
> +	struct list_head node;
> +	char dev_name[32];
> +	dev_t dev;
> +	unsigned long weight;
> +	unsigned long ioprio_class;
> +};
> +
>  /**
>   * struct bfqio_cgroup - bfq cgroup data structure.
>   * @css: subsystem state for bfq in the containing cgroup.
> @@ -269,6 +278,9 @@ struct io_cgroup {
>  
>  	unsigned long weight, ioprio_class;
>  
> +	/* list of io_policy_node */
> +	struct list_head policy_list;
> +
>  	spinlock_t lock;
>  	struct hlist_head group_data;
>  };
> -- 
> 1.5.4.rc3
> 
> 

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
  2009-05-14  7:26     ` Gui Jianfeng
@ 2009-05-14 15:15       ` Vivek Goyal
  2009-05-18 22:33       ` IKEDA, Munehiro
       [not found]       ` <4A0BC7AB.8030703-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-14 15:15 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Thu, May 14, 2009 at 03:26:35PM +0800, Gui Jianfeng wrote:
> Hi Vivek,
> 
> This patch enables per-cgroup per-device weight and ioprio_class handling.
> A new cgroup interface "policy" is introduced. You can make use of this 
> file to configure weight and ioprio_class for each device in a given cgroup.
> The original "weight" and "ioprio_class" files are still available. If you
> don't do special configuration for a particular device, "weight" and 
> "ioprio_class" are used as default values in this device.
> 
> You can use the following format to play with the new interface.
> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
> weight=0 means removing the policy for DEV.
> 
> Examples:
> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
> # echo /dev/hdb:300:2 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
> 
> Configure weight=500 ioprio_class=1 on /dev/hda in this cgroup
> # echo /dev/hda:500:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hda 500 1
> /dev/hdb 300 2
> 
> Remove the policy for /dev/hda in this cgroup
> # echo /dev/hda:0:1 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2
> 
> Changelog (v1 -> v2)
> - Rename some structures
> - Use spin_lock_irqsave() and spin_lock_irqrestore() version to prevent
>   from enabling the interrupts unconditionally.
> - Fix policy setup bug when switching to another io scheduler.
> - If a policy is available for a specific device, don't update weight and
>   io class when writing "weight" and "iprio_class".
> - Fix a bug when parsing policy string.
> 
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---

Thanks a lot Gui. This patch seems to be working fine for me now. I will
continue to do more testing and let you know if there are more issues. I 
will include it in next posting (V3).

Thanks
Vivek

>  block/elevator-fq.c |  258 +++++++++++++++++++++++++++++++++++++++++++++++++--
>  block/elevator-fq.h |   12 +++
>  2 files changed, 261 insertions(+), 9 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 69435ab..43b30a4 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -12,6 +12,9 @@
>  #include "elevator-fq.h"
>  #include <linux/blktrace_api.h>
>  #include <linux/biotrack.h>
> +#include <linux/seq_file.h>
> +#include <linux/genhd.h>
> +
>  
>  /* Values taken from cfq */
>  const int elv_slice_sync = HZ / 10;
> @@ -1045,12 +1048,31 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
>  }
>  EXPORT_SYMBOL(io_lookup_io_group_current);
>  
> -void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
> +static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
> +						 dev_t dev);
> +
> +void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
> +			  dev_t dev)
>  {
>  	struct io_entity *entity = &iog->entity;
> +	struct io_policy_node *pn;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&iocg->lock, flags);
> +	pn = policy_search_node(iocg, dev);
> +	if (pn) {
> +		entity->weight = pn->weight;
> +		entity->new_weight = pn->weight;
> +		entity->ioprio_class = pn->ioprio_class;
> +		entity->new_ioprio_class = pn->ioprio_class;
> +	} else {
> +		entity->weight = iocg->weight;
> +		entity->new_weight = iocg->weight;
> +		entity->ioprio_class = iocg->ioprio_class;
> +		entity->new_ioprio_class = iocg->ioprio_class;
> +	}
> +	spin_unlock_irqrestore(&iocg->lock, flags);
>  
> -	entity->weight = entity->new_weight = iocg->weight;
> -	entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
>  	entity->ioprio_changed = 1;
>  	entity->my_sched_data = &iog->sched_data;
>  }
> @@ -1114,6 +1136,7 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
>  	struct io_cgroup *iocg;					\
>  	struct io_group *iog;						\
>  	struct hlist_node *n;						\
> +	struct io_policy_node *pn;					\
>  									\
>  	if (val < (__MIN) || val > (__MAX))				\
>  		return -EINVAL;						\
> @@ -1126,6 +1149,9 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
>  	spin_lock_irq(&iocg->lock);					\
>  	iocg->__VAR = (unsigned long)val;				\
>  	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {	\
> +		pn = policy_search_node(iocg, iog->dev);		\
> +		if (pn)							\
> +			continue;					\
>  		iog->entity.new_##__VAR = (unsigned long)val;		\
>  		smp_wmb();						\
>  		iog->entity.ioprio_changed = 1;				\
> @@ -1237,7 +1263,7 @@ static u64 io_cgroup_disk_sectors_read(struct cgroup *cgroup,
>   * to the root has already an allocated group on @bfqd.
>   */
>  struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
> -					struct cgroup *cgroup)
> +				      struct cgroup *cgroup, struct bio *bio)
>  {
>  	struct io_cgroup *iocg;
>  	struct io_group *iog, *leaf = NULL, *prev = NULL;
> @@ -1263,12 +1289,17 @@ struct io_group *io_group_chain_alloc(struct request_queue *q, void *key,
>  		atomic_set(&iog->ref, 0);
>  		iog->deleting = 0;
>  
> -		io_group_init_entity(iocg, iog);
> -		iog->my_entity = &iog->entity;
>  #ifdef CONFIG_DEBUG_GROUP_IOSCHED
>  		iog->iocg_id = css_id(&iocg->css);
> +		if (bio) {
> +			struct gendisk *disk = bio->bi_bdev->bd_disk;
> +			iog->dev = MKDEV(disk->major, disk->first_minor);
> +		}
>  #endif
>  
> +		io_group_init_entity(iocg, iog, iog->dev);
> +		iog->my_entity = &iog->entity;
> +
>  		blk_init_request_list(&iog->rl);
>  
>  		if (leaf == NULL) {
> @@ -1379,7 +1410,7 @@ void io_group_chain_link(struct request_queue *q, void *key,
>   */
>  struct io_group *io_find_alloc_group(struct request_queue *q,
>  			struct cgroup *cgroup, struct elv_fq_data *efqd,
> -			int create)
> +			     int create, struct bio *bio)
>  {
>  	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
>  	struct io_group *iog = NULL;
> @@ -1390,7 +1421,7 @@ struct io_group *io_find_alloc_group(struct request_queue *q,
>  	if (iog != NULL || !create)
>  		return iog;
>  
> -	iog = io_group_chain_alloc(q, key, cgroup);
> +	iog = io_group_chain_alloc(q, key, cgroup, bio);
>  	if (iog != NULL)
>  		io_group_chain_link(q, key, cgroup, iog, efqd);
>  
> @@ -1489,7 +1520,7 @@ struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
>  		goto out;
>  	}
>  
> -	iog = io_find_alloc_group(q, cgroup, efqd, create);
> +	iog = io_find_alloc_group(q, cgroup, efqd, create, bio);
>  	if (!iog) {
>  		if (create)
>  			iog = efqd->root_group;
> @@ -1549,8 +1580,209 @@ struct io_group *io_alloc_root_group(struct request_queue *q,
>  	return iog;
>  }
>  
> +static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
> +				  struct seq_file *m)
> +{
> +	struct io_cgroup *iocg;
> +	struct io_policy_node *pn;
> +
> +	iocg = cgroup_to_io_cgroup(cgrp);
> +
> +	if (list_empty(&iocg->policy_list))
> +		goto out;
> +
> +	seq_printf(m, "dev weight class\n");
> +
> +	spin_lock_irq(&iocg->lock);
> +	list_for_each_entry(pn, &iocg->policy_list, node) {
> +		seq_printf(m, "%s %lu %lu\n", pn->dev_name,
> +			   pn->weight, pn->ioprio_class);
> +	}
> +	spin_unlock_irq(&iocg->lock);
> +out:
> +	return 0;
> +}
> +
> +static inline void policy_insert_node(struct io_cgroup *iocg,
> +					  struct io_policy_node *pn)
> +{
> +	list_add(&pn->node, &iocg->policy_list);
> +}
> +
> +/* Must be called with iocg->lock held */
> +static inline void policy_delete_node(struct io_policy_node *pn)
> +{
> +	list_del(&pn->node);
> +}
> +
> +/* Must be called with iocg->lock held */
> +static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
> +						 dev_t dev)
> +{
> +	struct io_policy_node *pn;
> +
> +	if (list_empty(&iocg->policy_list))
> +		return NULL;
> +
> +	list_for_each_entry(pn, &iocg->policy_list, node) {
> +		if (pn->dev == dev)
> +			return pn;
> +	}
> +
> +	return NULL;
> +}
> +
> +static int devname_to_devnum(const char *buf, dev_t *dev)
> +{
> +	struct block_device *bdev;
> +	struct gendisk *disk;
> +	int part;
> +
> +	bdev = lookup_bdev(buf);
> +	if (IS_ERR(bdev))
> +		return -ENODEV;
> +
> +	disk = get_gendisk(bdev->bd_dev, &part);
> +	*dev = MKDEV(disk->major, disk->first_minor);
> +	bdput(bdev);
> +
> +	return 0;
> +}
> +
> +static int policy_parse_and_set(char *buf, struct io_policy_node *newpn)
> +{
> +	char *s[3], *p;
> +	int ret;
> +	int i = 0;
> +
> +	memset(s, 0, sizeof(s));
> +	while ((p = strsep(&buf, ":")) != NULL) {
> +		if (!*p)
> +			continue;
> +		s[i++] = p;
> +	}
> +
> +	ret = devname_to_devnum(s[0], &newpn->dev);
> +	if (ret)
> +		return ret;
> +
> +	strcpy(newpn->dev_name, s[0]);
> +
> +	if (s[1] == NULL)
> +		return -EINVAL;
> +
> +	ret = strict_strtoul(s[1], 10, &newpn->weight);
> +	if (ret || newpn->weight > WEIGHT_MAX)
> +		return -EINVAL;
> +
> +	if (s[2] == NULL)
> +		return -EINVAL;
> +
> +	ret = strict_strtoul(s[2], 10, &newpn->ioprio_class);
> +	if (ret || newpn->ioprio_class < IOPRIO_CLASS_RT ||
> +	    newpn->ioprio_class > IOPRIO_CLASS_IDLE)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
> +			    const char *buffer)
> +{
> +	struct io_cgroup *iocg;
> +	struct io_policy_node *newpn, *pn;
> +	char *buf;
> +	int ret = 0;
> +	int keep_newpn = 0;
> +	struct hlist_node *n;
> +	struct io_group *iog;
> +
> +	buf = kstrdup(buffer, GFP_KERNEL);
> +	if (!buf)
> +		return -ENOMEM;
> +
> +	newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
> +	if (!newpn) {
> +		ret = -ENOMEM;
> +		goto free_buf;
> +	}
> +
> +	ret = policy_parse_and_set(buf, newpn);
> +	if (ret)
> +		goto free_newpn;
> +
> +	if (!cgroup_lock_live_group(cgrp)) {
> +		ret = -ENODEV;
> +		goto free_newpn;
> +	}
> +
> +	iocg = cgroup_to_io_cgroup(cgrp);
> +	spin_lock_irq(&iocg->lock);
> +
> +	pn = policy_search_node(iocg, newpn->dev);
> +	if (!pn) {
> +		if (newpn->weight != 0) {
> +			policy_insert_node(iocg, newpn);
> +			keep_newpn = 1;
> +		}
> +		goto update_io_group;
> +	}
> +
> +	if (newpn->weight == 0) {
> +		/* weight == 0 means deleteing a policy */
> +		policy_delete_node(pn);
> +		goto update_io_group;
> +	}
> +
> +	pn->weight = newpn->weight;
> +	pn->ioprio_class = newpn->ioprio_class;
> +
> +update_io_group:
> +	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
> +		if (iog->dev == newpn->dev) {
> +			if (newpn->weight) {
> +				iog->entity.new_weight = newpn->weight;
> +				iog->entity.new_ioprio_class =
> +					newpn->ioprio_class;
> +				/*
> +				 * iog weight and ioprio_class updating
> +				 * actually happens if ioprio_changed is set.
> +				 * So ensure ioprio_changed is not set until
> +				 * new weight and new ioprio_class are updated.
> +				 */
> +				smp_wmb();
> +				iog->entity.ioprio_changed = 1;
> +			} else {
> +				iog->entity.new_weight = iocg->weight;
> +				iog->entity.new_ioprio_class =
> +					iocg->ioprio_class;
> +
> +				/* The same as above */
> +				smp_wmb();
> +				iog->entity.ioprio_changed = 1;
> +			}
> +		}
> +	}
> +	spin_unlock_irq(&iocg->lock);
> +
> +	cgroup_unlock();
> +
> +free_newpn:
> +	if (!keep_newpn)
> +		kfree(newpn);
> +free_buf:
> +	kfree(buf);
> +	return ret;
> +}
> +
>  struct cftype bfqio_files[] = {
>  	{
> +		.name = "policy",
> +		.read_seq_string = io_cgroup_policy_read,
> +		.write_string = io_cgroup_policy_write,
> +		.max_write_len = 256,
> +	},
> +	{
>  		.name = "weight",
>  		.read_u64 = io_cgroup_weight_read,
>  		.write_u64 = io_cgroup_weight_write,
> @@ -1592,6 +1824,7 @@ struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
>  	INIT_HLIST_HEAD(&iocg->group_data);
>  	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
>  	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
> +	INIT_LIST_HEAD(&iocg->policy_list);
>  
>  	return &iocg->css;
>  }
> @@ -1750,6 +1983,7 @@ void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
>  	unsigned long flags, flags1;
>  	int queue_lock_held = 0;
>  	struct elv_fq_data *efqd;
> +	struct io_policy_node *pn, *pntmp;
>  
>  	/*
>  	 * io groups are linked in two lists. One list is maintained
> @@ -1823,6 +2057,12 @@ locked:
>  	BUG_ON(!hlist_empty(&iocg->group_data));
>  
>  	free_css_id(&io_subsys, &iocg->css);
> +
> +	list_for_each_entry_safe(pn, pntmp, &iocg->policy_list, node) {
> +		policy_delete_node(pn);
> +		kfree(pn);
> +	}
> +
>  	kfree(iocg);
>  }
>  
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index db3a347..b1d97e6 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -250,9 +250,18 @@ struct io_group {
>  
>  #ifdef CONFIG_DEBUG_GROUP_IOSCHED
>  	unsigned short iocg_id;
> +	dev_t dev;
>  #endif
>  };
>  
> +struct io_policy_node {
> +	struct list_head node;
> +	char dev_name[32];
> +	dev_t dev;
> +	unsigned long weight;
> +	unsigned long ioprio_class;
> +};
> +
>  /**
>   * struct bfqio_cgroup - bfq cgroup data structure.
>   * @css: subsystem state for bfq in the containing cgroup.
> @@ -269,6 +278,9 @@ struct io_cgroup {
>  
>  	unsigned long weight, ioprio_class;
>  
> +	/* list of io_policy_node */
> +	struct list_head policy_list;
> +
>  	spinlock_t lock;
>  	struct hlist_head group_data;
>  };
> -- 
> 1.5.4.rc3
> 
> 

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
  2009-05-08 21:56                                 ` Vivek Goyal
@ 2009-05-14 16:43                                     ` Dhaval Giani
  -1 siblings, 0 replies; 297+ messages in thread
From: Dhaval Giani @ 2009-05-14 16:43 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	Bharata B Rao, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Andrea Righi

On Fri, May 08, 2009 at 05:56:18PM -0400, Vivek Goyal wrote:

>   So, we shall have to come up with something better, I think Dhaval was
>   implementing upper limit for cpu controller. May be PeterZ and Dhaval can
>   give us some pointers how did they manage to implement both proportional
>   and max bw control with the help of a single tree while maintaining the
>   notion of prio with-in cgroup.
> 
> PeterZ/Dhaval  ^^^^^^^^
> 

We still haven't :). I think the idea is to keep fairness (or
propotion) between the groups that are currently running. The throttled
groups should not be considered.

thanks,
-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: IO scheduler based IO Controller V2
@ 2009-05-14 16:43                                     ` Dhaval Giani
  0 siblings, 0 replies; 297+ messages in thread
From: Dhaval Giani @ 2009-05-14 16:43 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrea Righi, Andrew Morton, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, jens.axboe, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, balbir, linux-kernel, containers, agk,
	dm-devel, snitzer, m-ikeda, peterz, Bharata B Rao

On Fri, May 08, 2009 at 05:56:18PM -0400, Vivek Goyal wrote:

>   So, we shall have to come up with something better, I think Dhaval was
>   implementing upper limit for cpu controller. May be PeterZ and Dhaval can
>   give us some pointers how did they manage to implement both proportional
>   and max bw control with the help of a single tree while maintaining the
>   notion of prio with-in cgroup.
> 
> PeterZ/Dhaval  ^^^^^^^^
> 

We still haven't :). I think the idea is to keep fairness (or
propotion) between the groups that are currently running. The throttled
groups should not be considered.

thanks,
-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
  2009-05-11 15:41           ` Vivek Goyal
@ 2009-05-15  5:15                 ` Gui Jianfeng
  0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-15  5:15 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

Vivek Goyal wrote:
...
>  }
> @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
>  /*
>   * Find the io group bio belongs to.
>   * If "create" is set, io group is created if it is not already present.
> + * If "curr" is set, io group is information is searched for current
> + * task and not with the help of bio.
> + *
> + * FIXME: Can we assume that if bio is NULL then lookup group for current
> + * task and not create extra function parameter ?
>   *
> - * Note: There is a narrow window of race where a group is being freed
> - * by cgroup deletion path and some rq has slipped through in this group.
> - * Fix it.
>   */
> -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> -					int create)
> +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
> +					int create, int curr)

  Hi Vivek,

  IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL,
  get iog from bio, otherwise get it from current task.

>  {
>  	struct cgroup *cgroup;
>  	struct io_group *iog;
>  	struct elv_fq_data *efqd = &q->elevator->efqd;
>  
>  	rcu_read_lock();
> -	cgroup = get_cgroup_from_bio(bio);
> +
> +	if (curr)
> +		cgroup = task_cgroup(current, io_subsys_id);
> +	else
> +		cgroup = get_cgroup_from_bio(bio);
> +
>  	if (!cgroup) {
>  		if (create)
>  			iog = efqd->root_group;
> @@ -1500,7 +1507,7 @@ out:
>  	rcu_read_unlock();
>  	return iog;
>  }

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
@ 2009-05-15  5:15                 ` Gui Jianfeng
  0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-15  5:15 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Nauman Rafique, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

Vivek Goyal wrote:
...
>  }
> @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
>  /*
>   * Find the io group bio belongs to.
>   * If "create" is set, io group is created if it is not already present.
> + * If "curr" is set, io group is information is searched for current
> + * task and not with the help of bio.
> + *
> + * FIXME: Can we assume that if bio is NULL then lookup group for current
> + * task and not create extra function parameter ?
>   *
> - * Note: There is a narrow window of race where a group is being freed
> - * by cgroup deletion path and some rq has slipped through in this group.
> - * Fix it.
>   */
> -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> -					int create)
> +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
> +					int create, int curr)

  Hi Vivek,

  IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL,
  get iog from bio, otherwise get it from current task.

>  {
>  	struct cgroup *cgroup;
>  	struct io_group *iog;
>  	struct elv_fq_data *efqd = &q->elevator->efqd;
>  
>  	rcu_read_lock();
> -	cgroup = get_cgroup_from_bio(bio);
> +
> +	if (curr)
> +		cgroup = task_cgroup(current, io_subsys_id);
> +	else
> +		cgroup = get_cgroup_from_bio(bio);
> +
>  	if (!cgroup) {
>  		if (create)
>  			iog = efqd->root_group;
> @@ -1500,7 +1507,7 @@ out:
>  	rcu_read_unlock();
>  	return iog;
>  }

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
  2009-05-11 15:41           ` Vivek Goyal
@ 2009-05-15  7:40                 ` Gui Jianfeng
  0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-15  7:40 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

Vivek Goyal wrote:
...
> Ok, here is the patch which gets rid of rq->iog and rq->rl fields. Good to
> see some code and data structures trimming. It seems to be working fine for me.
> 
> 
> o Get rid of rq->iog field and rq->rl fields. request descriptor stores
>   the pointer the the queue it belongs to (rq->ioq) and from the io queue one
>   can determine the group queue belongs to hence request belongs to. Thanks
>   to Nauman for the idea.
> 
> o There are couple of places where rq->ioq information is not present yet
>   as request and queue are being setup. In those places "bio" is passed 
>   around as function argument to determine the group rq will go into. I
>   did not pass "iog" as function argument becuase when memory is scarce,
>   we can release queue lock and sleep to wait for memory to become available
>   and once we wake up, it is possible that io group is gone. Passing bio
>   around helps that one shall have to remap bio to right group after waking
>   up. 
> 
> o Got rid of io_lookup_io_group_current() function and merged it with
>   io_get_io_group() to also take care of looking for group using current
>   task info and not from bio.

Hi Vivek,

This patch gets rid of "curr" from io_get_io_group, and seems to be 
working fine for me.

Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
 block/cfq-iosched.c |   12 ++++++------
 block/elevator-fq.c |   16 ++++++++--------
 block/elevator-fq.h |    2 +-
 3 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index c0bb8db..bf87843 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -196,7 +196,7 @@ static struct cfq_queue *cic_bio_to_cfqq(struct cfq_data *cfqd,
 		 * async bio tracking is enabled and we are not caching
 		 * async queue pointer in cic.
 		 */
-		iog = io_get_io_group(cfqd->queue, bio, 0, 0);
+		iog = io_get_io_group(cfqd->queue, bio, 0);
 		if (!iog) {
 			/*
 			 * May be this is first rq/bio and io group has not
@@ -1294,7 +1294,7 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
 
 	spin_lock_irqsave(q->queue_lock, flags);
 
-	iog = io_get_io_group(q, NULL, 0, 1);
+	iog = io_get_io_group(q, NULL, 0);
 
 	if (async_cfqq != NULL) {
 		__iog = cfqq_to_io_group(async_cfqq);
@@ -1346,9 +1346,9 @@ retry:
 	 * back.
 	 */
 	if (bio)
-		iog = io_get_io_group(q, bio, 1, 0);
+		iog = io_get_io_group(q, bio, 1);
 	else
-		iog = io_get_io_group(q, NULL, 1, 1);
+		iog = io_get_io_group(q, NULL, 1);
 
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
@@ -1469,9 +1469,9 @@ cfq_get_queue(struct cfq_data *cfqd, struct bio *bio, int is_sync,
 	struct io_group *iog = NULL;
 
 	if (bio)
-		iog = io_get_io_group(cfqd->queue, bio, 1, 0);
+		iog = io_get_io_group(cfqd->queue, bio, 1);
 	else
-		iog = io_get_io_group(cfqd->queue, NULL, 1, 1);
+		iog = io_get_io_group(cfqd->queue, NULL, 1);
 
 	if (!is_sync) {
 		async_cfqq = io_group_async_queue_prio(iog, ioprio_class,
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 9b7319e..951c163 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1006,7 +1006,7 @@ struct request_list *io_group_get_request_list(struct request_queue *q,
 {
 	struct io_group *iog;
 
-	iog = io_get_io_group(q, bio, 1, 0);
+	iog = io_get_io_group(q, bio, 1);
 	BUG_ON(!iog);
 	return &iog->rl;
 }
@@ -1470,7 +1470,7 @@ struct io_cgroup *get_iocg_from_bio(struct bio *bio)
  *
  */
 struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
-					int create, int curr)
+					int create)
 {
 	struct cgroup *cgroup;
 	struct io_group *iog;
@@ -1478,7 +1478,7 @@ struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
 
 	rcu_read_lock();
 
-	if (curr)
+	if (!bio)
 		cgroup = task_cgroup(current, io_subsys_id);
 	else
 		cgroup = get_cgroup_from_bio(bio);
@@ -1959,7 +1959,7 @@ int io_group_allow_merge(struct request *rq, struct bio *bio)
 		return 1;
 
 	/* Determine the io group of the bio submitting task */
-	iog = io_get_io_group(q, bio, 0, 0);
+	iog = io_get_io_group(q, bio, 0);
 	if (!iog) {
 		/* May be task belongs to a differet cgroup for which io
 		 * group has not been setup yet. */
@@ -2000,9 +2000,9 @@ int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
 retry:
 	/* Determine the io group request belongs to */
 	if (bio)
-		iog = io_get_io_group(q, bio, 1, 0);
+		iog = io_get_io_group(q, bio, 1);
 	else
-		iog = io_get_io_group(q, bio, 1, 1);
+		iog = io_get_io_group(q, NULL, 1);
 
 	BUG_ON(!iog);
 
@@ -2098,7 +2098,7 @@ struct io_queue *elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
 	struct io_group *iog;
 
 	/* lookup the io group and io queue of the bio submitting task */
-	iog = io_get_io_group(q, bio, 0, 0);
+	iog = io_get_io_group(q, bio, 0);
 	if (!iog) {
 		/* May be bio belongs to a cgroup for which io group has
 		 * not been setup yet. */
@@ -2159,7 +2159,7 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
 EXPORT_SYMBOL(io_lookup_io_group_current);
 
 struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
-					int create, int curr)
+					int create)
 {
 	return q->elevator->efqd.root_group;
 }
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 8d190ab..d8d8f61 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -657,7 +657,7 @@ extern void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 					int ioprio, struct io_queue *ioq);
 extern struct io_group *io_lookup_io_group_current(struct request_queue *q);
 extern struct io_group *io_get_io_group(struct request_queue *q,
-				struct bio *bio, int create, int curr);
+				struct bio *bio, int create);
 extern int elv_nr_busy_ioq(struct elevator_queue *e);
 extern int elv_nr_busy_rt_ioq(struct elevator_queue *e);
 extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
-- 
1.5.4.rc3

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
@ 2009-05-15  7:40                 ` Gui Jianfeng
  0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-15  7:40 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Nauman Rafique, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

Vivek Goyal wrote:
...
> Ok, here is the patch which gets rid of rq->iog and rq->rl fields. Good to
> see some code and data structures trimming. It seems to be working fine for me.
> 
> 
> o Get rid of rq->iog field and rq->rl fields. request descriptor stores
>   the pointer the the queue it belongs to (rq->ioq) and from the io queue one
>   can determine the group queue belongs to hence request belongs to. Thanks
>   to Nauman for the idea.
> 
> o There are couple of places where rq->ioq information is not present yet
>   as request and queue are being setup. In those places "bio" is passed 
>   around as function argument to determine the group rq will go into. I
>   did not pass "iog" as function argument becuase when memory is scarce,
>   we can release queue lock and sleep to wait for memory to become available
>   and once we wake up, it is possible that io group is gone. Passing bio
>   around helps that one shall have to remap bio to right group after waking
>   up. 
> 
> o Got rid of io_lookup_io_group_current() function and merged it with
>   io_get_io_group() to also take care of looking for group using current
>   task info and not from bio.

Hi Vivek,

This patch gets rid of "curr" from io_get_io_group, and seems to be 
working fine for me.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
 block/cfq-iosched.c |   12 ++++++------
 block/elevator-fq.c |   16 ++++++++--------
 block/elevator-fq.h |    2 +-
 3 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index c0bb8db..bf87843 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -196,7 +196,7 @@ static struct cfq_queue *cic_bio_to_cfqq(struct cfq_data *cfqd,
 		 * async bio tracking is enabled and we are not caching
 		 * async queue pointer in cic.
 		 */
-		iog = io_get_io_group(cfqd->queue, bio, 0, 0);
+		iog = io_get_io_group(cfqd->queue, bio, 0);
 		if (!iog) {
 			/*
 			 * May be this is first rq/bio and io group has not
@@ -1294,7 +1294,7 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
 
 	spin_lock_irqsave(q->queue_lock, flags);
 
-	iog = io_get_io_group(q, NULL, 0, 1);
+	iog = io_get_io_group(q, NULL, 0);
 
 	if (async_cfqq != NULL) {
 		__iog = cfqq_to_io_group(async_cfqq);
@@ -1346,9 +1346,9 @@ retry:
 	 * back.
 	 */
 	if (bio)
-		iog = io_get_io_group(q, bio, 1, 0);
+		iog = io_get_io_group(q, bio, 1);
 	else
-		iog = io_get_io_group(q, NULL, 1, 1);
+		iog = io_get_io_group(q, NULL, 1);
 
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
@@ -1469,9 +1469,9 @@ cfq_get_queue(struct cfq_data *cfqd, struct bio *bio, int is_sync,
 	struct io_group *iog = NULL;
 
 	if (bio)
-		iog = io_get_io_group(cfqd->queue, bio, 1, 0);
+		iog = io_get_io_group(cfqd->queue, bio, 1);
 	else
-		iog = io_get_io_group(cfqd->queue, NULL, 1, 1);
+		iog = io_get_io_group(cfqd->queue, NULL, 1);
 
 	if (!is_sync) {
 		async_cfqq = io_group_async_queue_prio(iog, ioprio_class,
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 9b7319e..951c163 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1006,7 +1006,7 @@ struct request_list *io_group_get_request_list(struct request_queue *q,
 {
 	struct io_group *iog;
 
-	iog = io_get_io_group(q, bio, 1, 0);
+	iog = io_get_io_group(q, bio, 1);
 	BUG_ON(!iog);
 	return &iog->rl;
 }
@@ -1470,7 +1470,7 @@ struct io_cgroup *get_iocg_from_bio(struct bio *bio)
  *
  */
 struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
-					int create, int curr)
+					int create)
 {
 	struct cgroup *cgroup;
 	struct io_group *iog;
@@ -1478,7 +1478,7 @@ struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
 
 	rcu_read_lock();
 
-	if (curr)
+	if (!bio)
 		cgroup = task_cgroup(current, io_subsys_id);
 	else
 		cgroup = get_cgroup_from_bio(bio);
@@ -1959,7 +1959,7 @@ int io_group_allow_merge(struct request *rq, struct bio *bio)
 		return 1;
 
 	/* Determine the io group of the bio submitting task */
-	iog = io_get_io_group(q, bio, 0, 0);
+	iog = io_get_io_group(q, bio, 0);
 	if (!iog) {
 		/* May be task belongs to a differet cgroup for which io
 		 * group has not been setup yet. */
@@ -2000,9 +2000,9 @@ int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
 retry:
 	/* Determine the io group request belongs to */
 	if (bio)
-		iog = io_get_io_group(q, bio, 1, 0);
+		iog = io_get_io_group(q, bio, 1);
 	else
-		iog = io_get_io_group(q, bio, 1, 1);
+		iog = io_get_io_group(q, NULL, 1);
 
 	BUG_ON(!iog);
 
@@ -2098,7 +2098,7 @@ struct io_queue *elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
 	struct io_group *iog;
 
 	/* lookup the io group and io queue of the bio submitting task */
-	iog = io_get_io_group(q, bio, 0, 0);
+	iog = io_get_io_group(q, bio, 0);
 	if (!iog) {
 		/* May be bio belongs to a cgroup for which io group has
 		 * not been setup yet. */
@@ -2159,7 +2159,7 @@ struct io_group *io_lookup_io_group_current(struct request_queue *q)
 EXPORT_SYMBOL(io_lookup_io_group_current);
 
 struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
-					int create, int curr)
+					int create)
 {
 	return q->elevator->efqd.root_group;
 }
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 8d190ab..d8d8f61 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -657,7 +657,7 @@ extern void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 					int ioprio, struct io_queue *ioq);
 extern struct io_group *io_lookup_io_group_current(struct request_queue *q);
 extern struct io_group *io_get_io_group(struct request_queue *q,
-				struct bio *bio, int create, int curr);
+				struct bio *bio, int create);
 extern int elv_nr_busy_ioq(struct elevator_queue *e);
 extern int elv_nr_busy_rt_ioq(struct elevator_queue *e);
 extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
-- 
1.5.4.rc3


^ permalink raw reply related	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
       [not found]                 ` <4A0CFA6C.3080609-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-05-15  7:48                   ` Andrea Righi
  0 siblings, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-15  7:48 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> >  }
> > @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
> >  /*
> >   * Find the io group bio belongs to.
> >   * If "create" is set, io group is created if it is not already present.
> > + * If "curr" is set, io group is information is searched for current
> > + * task and not with the help of bio.
> > + *
> > + * FIXME: Can we assume that if bio is NULL then lookup group for current
> > + * task and not create extra function parameter ?
> >   *
> > - * Note: There is a narrow window of race where a group is being freed
> > - * by cgroup deletion path and some rq has slipped through in this group.
> > - * Fix it.
> >   */
> > -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> > -					int create)
> > +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
> > +					int create, int curr)
> 
>   Hi Vivek,
> 
>   IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL,
>   get iog from bio, otherwise get it from current task.

Consider also that get_cgroup_from_bio() is much more slow than
task_cgroup() and need to lock/unlock_page_cgroup() in
get_blkio_cgroup_id(), while task_cgroup() is rcu protected.

BTW another optimization could be to use the blkio-cgroup functionality
only for dirty pages and cut out some blkio_set_owner(). For all the
other cases IO always occurs in the same context of the current task,
and you can use task_cgroup().

However, this is true only for page cache pages, for IO generated by
anonymous pages (swap) you still need the page tracking functionality
both for reads and writes.

-Andrea

> 
> >  {
> >  	struct cgroup *cgroup;
> >  	struct io_group *iog;
> >  	struct elv_fq_data *efqd = &q->elevator->efqd;
> >  
> >  	rcu_read_lock();
> > -	cgroup = get_cgroup_from_bio(bio);
> > +
> > +	if (curr)
> > +		cgroup = task_cgroup(current, io_subsys_id);
> > +	else
> > +		cgroup = get_cgroup_from_bio(bio);
> > +
> >  	if (!cgroup) {
> >  		if (create)
> >  			iog = efqd->root_group;
> > @@ -1500,7 +1507,7 @@ out:
> >  	rcu_read_unlock();
> >  	return iog;
> >  }
> 
> -- 
> Regards
> Gui Jianfeng
> 

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
  2009-05-15  5:15                 ` Gui Jianfeng
  (?)
@ 2009-05-15  7:48                 ` Andrea Righi
  2009-05-15  8:16                   ` Gui Jianfeng
                                     ` (3 more replies)
  -1 siblings, 4 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-15  7:48 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: Vivek Goyal, Nauman Rafique, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	jmoyer, dhaval, balbir, linux-kernel, containers, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> >  }
> > @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
> >  /*
> >   * Find the io group bio belongs to.
> >   * If "create" is set, io group is created if it is not already present.
> > + * If "curr" is set, io group is information is searched for current
> > + * task and not with the help of bio.
> > + *
> > + * FIXME: Can we assume that if bio is NULL then lookup group for current
> > + * task and not create extra function parameter ?
> >   *
> > - * Note: There is a narrow window of race where a group is being freed
> > - * by cgroup deletion path and some rq has slipped through in this group.
> > - * Fix it.
> >   */
> > -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> > -					int create)
> > +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
> > +					int create, int curr)
> 
>   Hi Vivek,
> 
>   IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL,
>   get iog from bio, otherwise get it from current task.

Consider also that get_cgroup_from_bio() is much more slow than
task_cgroup() and need to lock/unlock_page_cgroup() in
get_blkio_cgroup_id(), while task_cgroup() is rcu protected.

BTW another optimization could be to use the blkio-cgroup functionality
only for dirty pages and cut out some blkio_set_owner(). For all the
other cases IO always occurs in the same context of the current task,
and you can use task_cgroup().

However, this is true only for page cache pages, for IO generated by
anonymous pages (swap) you still need the page tracking functionality
both for reads and writes.

-Andrea

> 
> >  {
> >  	struct cgroup *cgroup;
> >  	struct io_group *iog;
> >  	struct elv_fq_data *efqd = &q->elevator->efqd;
> >  
> >  	rcu_read_lock();
> > -	cgroup = get_cgroup_from_bio(bio);
> > +
> > +	if (curr)
> > +		cgroup = task_cgroup(current, io_subsys_id);
> > +	else
> > +		cgroup = get_cgroup_from_bio(bio);
> > +
> >  	if (!cgroup) {
> >  		if (create)
> >  			iog = efqd->root_group;
> > @@ -1500,7 +1507,7 @@ out:
> >  	rcu_read_unlock();
> >  	return iog;
> >  }
> 
> -- 
> Regards
> Gui Jianfeng
> 

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
  2009-05-15  7:48                 ` Andrea Righi
@ 2009-05-15  8:16                   ` Gui Jianfeng
  2009-05-15  8:16                   ` Gui Jianfeng
                                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-15  8:16 UTC (permalink / raw)
  To: Andrea Righi
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Andrea Righi wrote:
> On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>> ...
>>>  }
>>> @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
>>>  /*
>>>   * Find the io group bio belongs to.
>>>   * If "create" is set, io group is created if it is not already present.
>>> + * If "curr" is set, io group is information is searched for current
>>> + * task and not with the help of bio.
>>> + *
>>> + * FIXME: Can we assume that if bio is NULL then lookup group for current
>>> + * task and not create extra function parameter ?
>>>   *
>>> - * Note: There is a narrow window of race where a group is being freed
>>> - * by cgroup deletion path and some rq has slipped through in this group.
>>> - * Fix it.
>>>   */
>>> -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
>>> -					int create)
>>> +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
>>> +					int create, int curr)
>>   Hi Vivek,
>>
>>   IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL,
>>   get iog from bio, otherwise get it from current task.
> 
> Consider also that get_cgroup_from_bio() is much more slow than
> task_cgroup() and need to lock/unlock_page_cgroup() in
> get_blkio_cgroup_id(), while task_cgroup() is rcu protected.
> 
> BTW another optimization could be to use the blkio-cgroup functionality
> only for dirty pages and cut out some blkio_set_owner(). For all the
> other cases IO always occurs in the same context of the current task,
> and you can use task_cgroup().
> 
> However, this is true only for page cache pages, for IO generated by
> anonymous pages (swap) you still need the page tracking functionality
> both for reads and writes.

  Hi Andrea,

  Thanks for pointing this out. Yes, i think we can determine io group in
  terms of bio->bi_rw. If bio is a READ bio, just taking io group by 
  task_cgroup(). If it's a WRITE bio, getting it from blkio_cgroup.

> 
> -Andrea
> 
>>>  {
>>>  	struct cgroup *cgroup;
>>>  	struct io_group *iog;
>>>  	struct elv_fq_data *efqd = &q->elevator->efqd;
>>>  
>>>  	rcu_read_lock();
>>> -	cgroup = get_cgroup_from_bio(bio);
>>> +
>>> +	if (curr)
>>> +		cgroup = task_cgroup(current, io_subsys_id);
>>> +	else
>>> +		cgroup = get_cgroup_from_bio(bio);
>>> +
>>>  	if (!cgroup) {
>>>  		if (create)
>>>  			iog = efqd->root_group;
>>> @@ -1500,7 +1507,7 @@ out:
>>>  	rcu_read_unlock();
>>>  	return iog;
>>>  }
>> -- 
>> Regards
>> Gui Jianfeng
>>
> 
> 
> 

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
  2009-05-15  7:48                 ` Andrea Righi
  2009-05-15  8:16                   ` Gui Jianfeng
@ 2009-05-15  8:16                   ` Gui Jianfeng
       [not found]                     ` <4A0D24E6.6010807-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2009-05-15 14:06                   ` Vivek Goyal
  2009-05-15 14:06                   ` Vivek Goyal
  3 siblings, 1 reply; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-15  8:16 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Vivek Goyal, Nauman Rafique, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	jmoyer, dhaval, balbir, linux-kernel, containers, agk, dm-devel,
	snitzer, m-ikeda, akpm

Andrea Righi wrote:
> On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>> ...
>>>  }
>>> @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
>>>  /*
>>>   * Find the io group bio belongs to.
>>>   * If "create" is set, io group is created if it is not already present.
>>> + * If "curr" is set, io group is information is searched for current
>>> + * task and not with the help of bio.
>>> + *
>>> + * FIXME: Can we assume that if bio is NULL then lookup group for current
>>> + * task and not create extra function parameter ?
>>>   *
>>> - * Note: There is a narrow window of race where a group is being freed
>>> - * by cgroup deletion path and some rq has slipped through in this group.
>>> - * Fix it.
>>>   */
>>> -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
>>> -					int create)
>>> +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
>>> +					int create, int curr)
>>   Hi Vivek,
>>
>>   IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL,
>>   get iog from bio, otherwise get it from current task.
> 
> Consider also that get_cgroup_from_bio() is much more slow than
> task_cgroup() and need to lock/unlock_page_cgroup() in
> get_blkio_cgroup_id(), while task_cgroup() is rcu protected.
> 
> BTW another optimization could be to use the blkio-cgroup functionality
> only for dirty pages and cut out some blkio_set_owner(). For all the
> other cases IO always occurs in the same context of the current task,
> and you can use task_cgroup().
> 
> However, this is true only for page cache pages, for IO generated by
> anonymous pages (swap) you still need the page tracking functionality
> both for reads and writes.

  Hi Andrea,

  Thanks for pointing this out. Yes, i think we can determine io group in
  terms of bio->bi_rw. If bio is a READ bio, just taking io group by 
  task_cgroup(). If it's a WRITE bio, getting it from blkio_cgroup.

> 
> -Andrea
> 
>>>  {
>>>  	struct cgroup *cgroup;
>>>  	struct io_group *iog;
>>>  	struct elv_fq_data *efqd = &q->elevator->efqd;
>>>  
>>>  	rcu_read_lock();
>>> -	cgroup = get_cgroup_from_bio(bio);
>>> +
>>> +	if (curr)
>>> +		cgroup = task_cgroup(current, io_subsys_id);
>>> +	else
>>> +		cgroup = get_cgroup_from_bio(bio);
>>> +
>>>  	if (!cgroup) {
>>>  		if (create)
>>>  			iog = efqd->root_group;
>>> @@ -1500,7 +1507,7 @@ out:
>>>  	rcu_read_unlock();
>>>  	return iog;
>>>  }
>> -- 
>> Regards
>> Gui Jianfeng
>>
> 
> 
> 

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
       [not found]                 ` <4A0D1C55.9040700-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-05-15 14:01                   ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-15 14:01 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Fri, May 15, 2009 at 03:40:05PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> > Ok, here is the patch which gets rid of rq->iog and rq->rl fields. Good to
> > see some code and data structures trimming. It seems to be working fine for me.
> > 
> > 
> > o Get rid of rq->iog field and rq->rl fields. request descriptor stores
> >   the pointer the the queue it belongs to (rq->ioq) and from the io queue one
> >   can determine the group queue belongs to hence request belongs to. Thanks
> >   to Nauman for the idea.
> > 
> > o There are couple of places where rq->ioq information is not present yet
> >   as request and queue are being setup. In those places "bio" is passed 
> >   around as function argument to determine the group rq will go into. I
> >   did not pass "iog" as function argument becuase when memory is scarce,
> >   we can release queue lock and sleep to wait for memory to become available
> >   and once we wake up, it is possible that io group is gone. Passing bio
> >   around helps that one shall have to remap bio to right group after waking
> >   up. 
> > 
> > o Got rid of io_lookup_io_group_current() function and merged it with
> >   io_get_io_group() to also take care of looking for group using current
> >   task info and not from bio.
> 
> Hi Vivek,
> 
> This patch gets rid of "curr" from io_get_io_group, and seems to be 
> working fine for me.
> 

Thanks Gui. Can't think of a reason why "curr" should be a separate
function argument. I could only find blk_get_request() to be calling
io_get_io_group() without any bio passed. I think even in that case
determining the group from task should not hurt, I guess.

I will apply the patch.  Couple of comments inline.

> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> ---
>  block/cfq-iosched.c |   12 ++++++------
>  block/elevator-fq.c |   16 ++++++++--------
>  block/elevator-fq.h |    2 +-
>  3 files changed, 15 insertions(+), 15 deletions(-)
> 
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index c0bb8db..bf87843 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -196,7 +196,7 @@ static struct cfq_queue *cic_bio_to_cfqq(struct cfq_data *cfqd,
>  		 * async bio tracking is enabled and we are not caching
>  		 * async queue pointer in cic.
>  		 */
> -		iog = io_get_io_group(cfqd->queue, bio, 0, 0);
> +		iog = io_get_io_group(cfqd->queue, bio, 0);
>  		if (!iog) {
>  			/*
>  			 * May be this is first rq/bio and io group has not
> @@ -1294,7 +1294,7 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
>  
>  	spin_lock_irqsave(q->queue_lock, flags);
>  
> -	iog = io_get_io_group(q, NULL, 0, 1);
> +	iog = io_get_io_group(q, NULL, 0);
>  
>  	if (async_cfqq != NULL) {
>  		__iog = cfqq_to_io_group(async_cfqq);
> @@ -1346,9 +1346,9 @@ retry:
>  	 * back.
>  	 */
>  	if (bio)
> -		iog = io_get_io_group(q, bio, 1, 0);
> +		iog = io_get_io_group(q, bio, 1);
>  	else
> -		iog = io_get_io_group(q, NULL, 1, 1);
> +		iog = io_get_io_group(q, NULL, 1);
>  

Can we now change above to single statement

		io_get_io_group(q, bio, 1);
if bio is present, we will determine the group from that otherwise from
curret task context.


>  	cic = cfq_cic_lookup(cfqd, ioc);
>  	/* cic always exists here */
> @@ -1469,9 +1469,9 @@ cfq_get_queue(struct cfq_data *cfqd, struct bio *bio, int is_sync,
>  	struct io_group *iog = NULL;
>  
>  	if (bio)
> -		iog = io_get_io_group(cfqd->queue, bio, 1, 0);
> +		iog = io_get_io_group(cfqd->queue, bio, 1);
>  	else
> -		iog = io_get_io_group(cfqd->queue, NULL, 1, 1);
> +		iog = io_get_io_group(cfqd->queue, NULL, 1);
>  

Same here. Get rid of "if" condition.

>  	if (!is_sync) {
>  		async_cfqq = io_group_async_queue_prio(iog, ioprio_class,
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 9b7319e..951c163 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -1006,7 +1006,7 @@ struct request_list *io_group_get_request_list(struct request_queue *q,
>  {
>  	struct io_group *iog;
>  
> -	iog = io_get_io_group(q, bio, 1, 0);
> +	iog = io_get_io_group(q, bio, 1);
>  	BUG_ON(!iog);
>  	return &iog->rl;
>  }
> @@ -1470,7 +1470,7 @@ struct io_cgroup *get_iocg_from_bio(struct bio *bio)
>   *
>   */
>  struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
> -					int create, int curr)
> +					int create)
>  {
>  	struct cgroup *cgroup;
>  	struct io_group *iog;
> @@ -1478,7 +1478,7 @@ struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
>  
>  	rcu_read_lock();
>  
> -	if (curr)
> +	if (!bio)
>  		cgroup = task_cgroup(current, io_subsys_id);
>  	else
>  		cgroup = get_cgroup_from_bio(bio);
> @@ -1959,7 +1959,7 @@ int io_group_allow_merge(struct request *rq, struct bio *bio)
>  		return 1;
>  
>  	/* Determine the io group of the bio submitting task */
> -	iog = io_get_io_group(q, bio, 0, 0);
> +	iog = io_get_io_group(q, bio, 0);
>  	if (!iog) {
>  		/* May be task belongs to a differet cgroup for which io
>  		 * group has not been setup yet. */
> @@ -2000,9 +2000,9 @@ int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
>  retry:
>  	/* Determine the io group request belongs to */
>  	if (bio)
> -		iog = io_get_io_group(q, bio, 1, 0);
> +		iog = io_get_io_group(q, bio, 1);
>  	else
> -		iog = io_get_io_group(q, bio, 1, 1);
> +		iog = io_get_io_group(q, NULL, 1);
>  

"if" not required now.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
  2009-05-15  7:40                 ` Gui Jianfeng
  (?)
@ 2009-05-15 14:01                 ` Vivek Goyal
  -1 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-15 14:01 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: Nauman Rafique, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Fri, May 15, 2009 at 03:40:05PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> > Ok, here is the patch which gets rid of rq->iog and rq->rl fields. Good to
> > see some code and data structures trimming. It seems to be working fine for me.
> > 
> > 
> > o Get rid of rq->iog field and rq->rl fields. request descriptor stores
> >   the pointer the the queue it belongs to (rq->ioq) and from the io queue one
> >   can determine the group queue belongs to hence request belongs to. Thanks
> >   to Nauman for the idea.
> > 
> > o There are couple of places where rq->ioq information is not present yet
> >   as request and queue are being setup. In those places "bio" is passed 
> >   around as function argument to determine the group rq will go into. I
> >   did not pass "iog" as function argument becuase when memory is scarce,
> >   we can release queue lock and sleep to wait for memory to become available
> >   and once we wake up, it is possible that io group is gone. Passing bio
> >   around helps that one shall have to remap bio to right group after waking
> >   up. 
> > 
> > o Got rid of io_lookup_io_group_current() function and merged it with
> >   io_get_io_group() to also take care of looking for group using current
> >   task info and not from bio.
> 
> Hi Vivek,
> 
> This patch gets rid of "curr" from io_get_io_group, and seems to be 
> working fine for me.
> 

Thanks Gui. Can't think of a reason why "curr" should be a separate
function argument. I could only find blk_get_request() to be calling
io_get_io_group() without any bio passed. I think even in that case
determining the group from task should not hurt, I guess.

I will apply the patch.  Couple of comments inline.

> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
>  block/cfq-iosched.c |   12 ++++++------
>  block/elevator-fq.c |   16 ++++++++--------
>  block/elevator-fq.h |    2 +-
>  3 files changed, 15 insertions(+), 15 deletions(-)
> 
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index c0bb8db..bf87843 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -196,7 +196,7 @@ static struct cfq_queue *cic_bio_to_cfqq(struct cfq_data *cfqd,
>  		 * async bio tracking is enabled and we are not caching
>  		 * async queue pointer in cic.
>  		 */
> -		iog = io_get_io_group(cfqd->queue, bio, 0, 0);
> +		iog = io_get_io_group(cfqd->queue, bio, 0);
>  		if (!iog) {
>  			/*
>  			 * May be this is first rq/bio and io group has not
> @@ -1294,7 +1294,7 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
>  
>  	spin_lock_irqsave(q->queue_lock, flags);
>  
> -	iog = io_get_io_group(q, NULL, 0, 1);
> +	iog = io_get_io_group(q, NULL, 0);
>  
>  	if (async_cfqq != NULL) {
>  		__iog = cfqq_to_io_group(async_cfqq);
> @@ -1346,9 +1346,9 @@ retry:
>  	 * back.
>  	 */
>  	if (bio)
> -		iog = io_get_io_group(q, bio, 1, 0);
> +		iog = io_get_io_group(q, bio, 1);
>  	else
> -		iog = io_get_io_group(q, NULL, 1, 1);
> +		iog = io_get_io_group(q, NULL, 1);
>  

Can we now change above to single statement

		io_get_io_group(q, bio, 1);
if bio is present, we will determine the group from that otherwise from
curret task context.


>  	cic = cfq_cic_lookup(cfqd, ioc);
>  	/* cic always exists here */
> @@ -1469,9 +1469,9 @@ cfq_get_queue(struct cfq_data *cfqd, struct bio *bio, int is_sync,
>  	struct io_group *iog = NULL;
>  
>  	if (bio)
> -		iog = io_get_io_group(cfqd->queue, bio, 1, 0);
> +		iog = io_get_io_group(cfqd->queue, bio, 1);
>  	else
> -		iog = io_get_io_group(cfqd->queue, NULL, 1, 1);
> +		iog = io_get_io_group(cfqd->queue, NULL, 1);
>  

Same here. Get rid of "if" condition.

>  	if (!is_sync) {
>  		async_cfqq = io_group_async_queue_prio(iog, ioprio_class,
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 9b7319e..951c163 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -1006,7 +1006,7 @@ struct request_list *io_group_get_request_list(struct request_queue *q,
>  {
>  	struct io_group *iog;
>  
> -	iog = io_get_io_group(q, bio, 1, 0);
> +	iog = io_get_io_group(q, bio, 1);
>  	BUG_ON(!iog);
>  	return &iog->rl;
>  }
> @@ -1470,7 +1470,7 @@ struct io_cgroup *get_iocg_from_bio(struct bio *bio)
>   *
>   */
>  struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
> -					int create, int curr)
> +					int create)
>  {
>  	struct cgroup *cgroup;
>  	struct io_group *iog;
> @@ -1478,7 +1478,7 @@ struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
>  
>  	rcu_read_lock();
>  
> -	if (curr)
> +	if (!bio)
>  		cgroup = task_cgroup(current, io_subsys_id);
>  	else
>  		cgroup = get_cgroup_from_bio(bio);
> @@ -1959,7 +1959,7 @@ int io_group_allow_merge(struct request *rq, struct bio *bio)
>  		return 1;
>  
>  	/* Determine the io group of the bio submitting task */
> -	iog = io_get_io_group(q, bio, 0, 0);
> +	iog = io_get_io_group(q, bio, 0);
>  	if (!iog) {
>  		/* May be task belongs to a differet cgroup for which io
>  		 * group has not been setup yet. */
> @@ -2000,9 +2000,9 @@ int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
>  retry:
>  	/* Determine the io group request belongs to */
>  	if (bio)
> -		iog = io_get_io_group(q, bio, 1, 0);
> +		iog = io_get_io_group(q, bio, 1);
>  	else
> -		iog = io_get_io_group(q, bio, 1, 1);
> +		iog = io_get_io_group(q, NULL, 1);
>  

"if" not required now.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
  2009-05-15  7:48                 ` Andrea Righi
  2009-05-15  8:16                   ` Gui Jianfeng
  2009-05-15  8:16                   ` Gui Jianfeng
@ 2009-05-15 14:06                   ` Vivek Goyal
  2009-05-15 14:06                   ` Vivek Goyal
  3 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-15 14:06 UTC (permalink / raw)
  To: Andrea Righi
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, May 15, 2009 at 09:48:40AM +0200, Andrea Righi wrote:
> On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
> > Vivek Goyal wrote:
> > ...
> > >  }
> > > @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
> > >  /*
> > >   * Find the io group bio belongs to.
> > >   * If "create" is set, io group is created if it is not already present.
> > > + * If "curr" is set, io group is information is searched for current
> > > + * task and not with the help of bio.
> > > + *
> > > + * FIXME: Can we assume that if bio is NULL then lookup group for current
> > > + * task and not create extra function parameter ?
> > >   *
> > > - * Note: There is a narrow window of race where a group is being freed
> > > - * by cgroup deletion path and some rq has slipped through in this group.
> > > - * Fix it.
> > >   */
> > > -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> > > -					int create)
> > > +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
> > > +					int create, int curr)
> > 
> >   Hi Vivek,
> > 
> >   IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL,
> >   get iog from bio, otherwise get it from current task.
> 
> Consider also that get_cgroup_from_bio() is much more slow than
> task_cgroup() and need to lock/unlock_page_cgroup() in
> get_blkio_cgroup_id(), while task_cgroup() is rcu protected.
> 

True.

> BTW another optimization could be to use the blkio-cgroup functionality
> only for dirty pages and cut out some blkio_set_owner(). For all the
> other cases IO always occurs in the same context of the current task,
> and you can use task_cgroup().
> 

Yes, may be in some cases we can avoid setting page owner. I will get
to it once I have got functionality going well. In the mean time if
you have a patch for it, it will be great.

> However, this is true only for page cache pages, for IO generated by
> anonymous pages (swap) you still need the page tracking functionality
> both for reads and writes.
> 

Right now I am assuming that all the sync IO will belong to task
submitting the bio hence use task_cgroup() for that. Only for async
IO, I am trying to use page tracking functionality to determine the owner.
Look at elv_bio_sync(bio).

You seem to be saying that there are cases where even for sync IO, we
can't use submitting task's context and need to rely on page tracking
functionlity? In case of getting page (read) from swap, will it not happen
in the context of process who will take a page fault and initiate the
swap read?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
  2009-05-15  7:48                 ` Andrea Righi
                                     ` (2 preceding siblings ...)
  2009-05-15 14:06                   ` Vivek Goyal
@ 2009-05-15 14:06                   ` Vivek Goyal
  2009-05-17 10:26                     ` Andrea Righi
       [not found]                     ` <20090515140643.GB19350-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  3 siblings, 2 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-15 14:06 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Gui Jianfeng, Nauman Rafique, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	jmoyer, dhaval, balbir, linux-kernel, containers, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Fri, May 15, 2009 at 09:48:40AM +0200, Andrea Righi wrote:
> On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
> > Vivek Goyal wrote:
> > ...
> > >  }
> > > @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
> > >  /*
> > >   * Find the io group bio belongs to.
> > >   * If "create" is set, io group is created if it is not already present.
> > > + * If "curr" is set, io group is information is searched for current
> > > + * task and not with the help of bio.
> > > + *
> > > + * FIXME: Can we assume that if bio is NULL then lookup group for current
> > > + * task and not create extra function parameter ?
> > >   *
> > > - * Note: There is a narrow window of race where a group is being freed
> > > - * by cgroup deletion path and some rq has slipped through in this group.
> > > - * Fix it.
> > >   */
> > > -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> > > -					int create)
> > > +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
> > > +					int create, int curr)
> > 
> >   Hi Vivek,
> > 
> >   IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL,
> >   get iog from bio, otherwise get it from current task.
> 
> Consider also that get_cgroup_from_bio() is much more slow than
> task_cgroup() and need to lock/unlock_page_cgroup() in
> get_blkio_cgroup_id(), while task_cgroup() is rcu protected.
> 

True.

> BTW another optimization could be to use the blkio-cgroup functionality
> only for dirty pages and cut out some blkio_set_owner(). For all the
> other cases IO always occurs in the same context of the current task,
> and you can use task_cgroup().
> 

Yes, may be in some cases we can avoid setting page owner. I will get
to it once I have got functionality going well. In the mean time if
you have a patch for it, it will be great.

> However, this is true only for page cache pages, for IO generated by
> anonymous pages (swap) you still need the page tracking functionality
> both for reads and writes.
> 

Right now I am assuming that all the sync IO will belong to task
submitting the bio hence use task_cgroup() for that. Only for async
IO, I am trying to use page tracking functionality to determine the owner.
Look at elv_bio_sync(bio).

You seem to be saying that there are cases where even for sync IO, we
can't use submitting task's context and need to rely on page tracking
functionlity? In case of getting page (read) from swap, will it not happen
in the context of process who will take a page fault and initiate the
swap read?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
  2009-05-15  8:16                   ` Gui Jianfeng
@ 2009-05-15 14:09                         ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-15 14:09 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Andrea Righi

On Fri, May 15, 2009 at 04:16:38PM +0800, Gui Jianfeng wrote:
> Andrea Righi wrote:
> > On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
> >> Vivek Goyal wrote:
> >> ...
> >>>  }
> >>> @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
> >>>  /*
> >>>   * Find the io group bio belongs to.
> >>>   * If "create" is set, io group is created if it is not already present.
> >>> + * If "curr" is set, io group is information is searched for current
> >>> + * task and not with the help of bio.
> >>> + *
> >>> + * FIXME: Can we assume that if bio is NULL then lookup group for current
> >>> + * task and not create extra function parameter ?
> >>>   *
> >>> - * Note: There is a narrow window of race where a group is being freed
> >>> - * by cgroup deletion path and some rq has slipped through in this group.
> >>> - * Fix it.
> >>>   */
> >>> -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> >>> -					int create)
> >>> +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
> >>> +					int create, int curr)
> >>   Hi Vivek,
> >>
> >>   IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL,
> >>   get iog from bio, otherwise get it from current task.
> > 
> > Consider also that get_cgroup_from_bio() is much more slow than
> > task_cgroup() and need to lock/unlock_page_cgroup() in
> > get_blkio_cgroup_id(), while task_cgroup() is rcu protected.
> > 
> > BTW another optimization could be to use the blkio-cgroup functionality
> > only for dirty pages and cut out some blkio_set_owner(). For all the
> > other cases IO always occurs in the same context of the current task,
> > and you can use task_cgroup().
> > 
> > However, this is true only for page cache pages, for IO generated by
> > anonymous pages (swap) you still need the page tracking functionality
> > both for reads and writes.
> 
>   Hi Andrea,
> 
>   Thanks for pointing this out. Yes, i think we can determine io group in
>   terms of bio->bi_rw. If bio is a READ bio, just taking io group by 
>   task_cgroup(). If it's a WRITE bio, getting it from blkio_cgroup.
> 

Gui, we are already doing it. page tracking functionality is used only
for async IO and for all sync IO, we are using submitting tasks's group
to determine io group bio belongs to.

	if (elv_bio_sync(bio)) {
		/* sync io. Determine cgroup from submitting task
		 * context.*/
                cgroup = task_cgroup(current, io_subsys_id);
                return cgroup;
        }

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
@ 2009-05-15 14:09                         ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-15 14:09 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: Andrea Righi, Nauman Rafique, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	jmoyer, dhaval, balbir, linux-kernel, containers, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Fri, May 15, 2009 at 04:16:38PM +0800, Gui Jianfeng wrote:
> Andrea Righi wrote:
> > On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
> >> Vivek Goyal wrote:
> >> ...
> >>>  }
> >>> @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
> >>>  /*
> >>>   * Find the io group bio belongs to.
> >>>   * If "create" is set, io group is created if it is not already present.
> >>> + * If "curr" is set, io group is information is searched for current
> >>> + * task and not with the help of bio.
> >>> + *
> >>> + * FIXME: Can we assume that if bio is NULL then lookup group for current
> >>> + * task and not create extra function parameter ?
> >>>   *
> >>> - * Note: There is a narrow window of race where a group is being freed
> >>> - * by cgroup deletion path and some rq has slipped through in this group.
> >>> - * Fix it.
> >>>   */
> >>> -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> >>> -					int create)
> >>> +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
> >>> +					int create, int curr)
> >>   Hi Vivek,
> >>
> >>   IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL,
> >>   get iog from bio, otherwise get it from current task.
> > 
> > Consider also that get_cgroup_from_bio() is much more slow than
> > task_cgroup() and need to lock/unlock_page_cgroup() in
> > get_blkio_cgroup_id(), while task_cgroup() is rcu protected.
> > 
> > BTW another optimization could be to use the blkio-cgroup functionality
> > only for dirty pages and cut out some blkio_set_owner(). For all the
> > other cases IO always occurs in the same context of the current task,
> > and you can use task_cgroup().
> > 
> > However, this is true only for page cache pages, for IO generated by
> > anonymous pages (swap) you still need the page tracking functionality
> > both for reads and writes.
> 
>   Hi Andrea,
> 
>   Thanks for pointing this out. Yes, i think we can determine io group in
>   terms of bio->bi_rw. If bio is a READ bio, just taking io group by 
>   task_cgroup(). If it's a WRITE bio, getting it from blkio_cgroup.
> 

Gui, we are already doing it. page tracking functionality is used only
for async IO and for all sync IO, we are using submitting tasks's group
to determine io group bio belongs to.

	if (elv_bio_sync(bio)) {
		/* sync io. Determine cgroup from submitting task
		 * context.*/
                cgroup = task_cgroup(current, io_subsys_id);
                return cgroup;
        }

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
       [not found]                     ` <20090515140643.GB19350-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-17 10:26                       ` Andrea Righi
  0 siblings, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-17 10:26 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, May 15, 2009 at 10:06:43AM -0400, Vivek Goyal wrote:
> On Fri, May 15, 2009 at 09:48:40AM +0200, Andrea Righi wrote:
> > On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
> > > Vivek Goyal wrote:
> > > ...
> > > >  }
> > > > @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
> > > >  /*
> > > >   * Find the io group bio belongs to.
> > > >   * If "create" is set, io group is created if it is not already present.
> > > > + * If "curr" is set, io group is information is searched for current
> > > > + * task and not with the help of bio.
> > > > + *
> > > > + * FIXME: Can we assume that if bio is NULL then lookup group for current
> > > > + * task and not create extra function parameter ?
> > > >   *
> > > > - * Note: There is a narrow window of race where a group is being freed
> > > > - * by cgroup deletion path and some rq has slipped through in this group.
> > > > - * Fix it.
> > > >   */
> > > > -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> > > > -					int create)
> > > > +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
> > > > +					int create, int curr)
> > > 
> > >   Hi Vivek,
> > > 
> > >   IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL,
> > >   get iog from bio, otherwise get it from current task.
> > 
> > Consider also that get_cgroup_from_bio() is much more slow than
> > task_cgroup() and need to lock/unlock_page_cgroup() in
> > get_blkio_cgroup_id(), while task_cgroup() is rcu protected.
> > 
> 
> True.
> 
> > BTW another optimization could be to use the blkio-cgroup functionality
> > only for dirty pages and cut out some blkio_set_owner(). For all the
> > other cases IO always occurs in the same context of the current task,
> > and you can use task_cgroup().
> > 
> 
> Yes, may be in some cases we can avoid setting page owner. I will get
> to it once I have got functionality going well. In the mean time if
> you have a patch for it, it will be great.
> 
> > However, this is true only for page cache pages, for IO generated by
> > anonymous pages (swap) you still need the page tracking functionality
> > both for reads and writes.
> > 
> 
> Right now I am assuming that all the sync IO will belong to task
> submitting the bio hence use task_cgroup() for that. Only for async
> IO, I am trying to use page tracking functionality to determine the owner.
> Look at elv_bio_sync(bio).
> 
> You seem to be saying that there are cases where even for sync IO, we
> can't use submitting task's context and need to rely on page tracking
> functionlity? In case of getting page (read) from swap, will it not happen
> in the context of process who will take a page fault and initiate the
> swap read?

No, for example in read_swap_cache_async():

@@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 */
 		__set_page_locked(new_page);
 		SetPageSwapBacked(new_page);
+		blkio_cgroup_set_owner(new_page, current->mm);
 		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
 		if (likely(!err)) {
 			/*

This is a read, but the current task is not always the owner of this
swap cache page, because it's a readahead operation.

Anyway, this is a minor corner case I think. And probably it is safe to
consider this like any other read IO and get rid of the
blkio_cgroup_set_owner().

I wonder if it would be better to attach the blkio_cgroup to the
anonymous page only when swap-out occurs. I mean, just put the
blkio_cgroup_set_owner() hook in try_to_umap() in order to keep track of
the IO generated by direct reclaim of anon memory. For all the other
cases we can simply use the submitting task's context.

BTW, O_DIRECT is another case that is possible to optimize, because all
the bios generated by direct IO occur in the same context of the current
task.

-Andrea

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
  2009-05-15 14:06                   ` Vivek Goyal
@ 2009-05-17 10:26                     ` Andrea Righi
  2009-05-18 14:01                         ` Vivek Goyal
       [not found]                     ` <20090515140643.GB19350-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 297+ messages in thread
From: Andrea Righi @ 2009-05-17 10:26 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Gui Jianfeng, Nauman Rafique, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	jmoyer, dhaval, balbir, linux-kernel, containers, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Fri, May 15, 2009 at 10:06:43AM -0400, Vivek Goyal wrote:
> On Fri, May 15, 2009 at 09:48:40AM +0200, Andrea Righi wrote:
> > On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
> > > Vivek Goyal wrote:
> > > ...
> > > >  }
> > > > @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
> > > >  /*
> > > >   * Find the io group bio belongs to.
> > > >   * If "create" is set, io group is created if it is not already present.
> > > > + * If "curr" is set, io group is information is searched for current
> > > > + * task and not with the help of bio.
> > > > + *
> > > > + * FIXME: Can we assume that if bio is NULL then lookup group for current
> > > > + * task and not create extra function parameter ?
> > > >   *
> > > > - * Note: There is a narrow window of race where a group is being freed
> > > > - * by cgroup deletion path and some rq has slipped through in this group.
> > > > - * Fix it.
> > > >   */
> > > > -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> > > > -					int create)
> > > > +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
> > > > +					int create, int curr)
> > > 
> > >   Hi Vivek,
> > > 
> > >   IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL,
> > >   get iog from bio, otherwise get it from current task.
> > 
> > Consider also that get_cgroup_from_bio() is much more slow than
> > task_cgroup() and need to lock/unlock_page_cgroup() in
> > get_blkio_cgroup_id(), while task_cgroup() is rcu protected.
> > 
> 
> True.
> 
> > BTW another optimization could be to use the blkio-cgroup functionality
> > only for dirty pages and cut out some blkio_set_owner(). For all the
> > other cases IO always occurs in the same context of the current task,
> > and you can use task_cgroup().
> > 
> 
> Yes, may be in some cases we can avoid setting page owner. I will get
> to it once I have got functionality going well. In the mean time if
> you have a patch for it, it will be great.
> 
> > However, this is true only for page cache pages, for IO generated by
> > anonymous pages (swap) you still need the page tracking functionality
> > both for reads and writes.
> > 
> 
> Right now I am assuming that all the sync IO will belong to task
> submitting the bio hence use task_cgroup() for that. Only for async
> IO, I am trying to use page tracking functionality to determine the owner.
> Look at elv_bio_sync(bio).
> 
> You seem to be saying that there are cases where even for sync IO, we
> can't use submitting task's context and need to rely on page tracking
> functionlity? In case of getting page (read) from swap, will it not happen
> in the context of process who will take a page fault and initiate the
> swap read?

No, for example in read_swap_cache_async():

@@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 */
 		__set_page_locked(new_page);
 		SetPageSwapBacked(new_page);
+		blkio_cgroup_set_owner(new_page, current->mm);
 		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
 		if (likely(!err)) {
 			/*

This is a read, but the current task is not always the owner of this
swap cache page, because it's a readahead operation.

Anyway, this is a minor corner case I think. And probably it is safe to
consider this like any other read IO and get rid of the
blkio_cgroup_set_owner().

I wonder if it would be better to attach the blkio_cgroup to the
anonymous page only when swap-out occurs. I mean, just put the
blkio_cgroup_set_owner() hook in try_to_umap() in order to keep track of
the IO generated by direct reclaim of anon memory. For all the other
cases we can simply use the submitting task's context.

BTW, O_DIRECT is another case that is possible to optimize, because all
the bios generated by direct IO occur in the same context of the current
task.

-Andrea

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
  2009-05-17 10:26                     ` Andrea Righi
@ 2009-05-18 14:01                         ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-18 14:01 UTC (permalink / raw)
  To: Andrea Righi
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Sun, May 17, 2009 at 12:26:06PM +0200, Andrea Righi wrote:
> On Fri, May 15, 2009 at 10:06:43AM -0400, Vivek Goyal wrote:
> > On Fri, May 15, 2009 at 09:48:40AM +0200, Andrea Righi wrote:
> > > On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
> > > > Vivek Goyal wrote:
> > > > ...
> > > > >  }
> > > > > @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
> > > > >  /*
> > > > >   * Find the io group bio belongs to.
> > > > >   * If "create" is set, io group is created if it is not already present.
> > > > > + * If "curr" is set, io group is information is searched for current
> > > > > + * task and not with the help of bio.
> > > > > + *
> > > > > + * FIXME: Can we assume that if bio is NULL then lookup group for current
> > > > > + * task and not create extra function parameter ?
> > > > >   *
> > > > > - * Note: There is a narrow window of race where a group is being freed
> > > > > - * by cgroup deletion path and some rq has slipped through in this group.
> > > > > - * Fix it.
> > > > >   */
> > > > > -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> > > > > -					int create)
> > > > > +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
> > > > > +					int create, int curr)
> > > > 
> > > >   Hi Vivek,
> > > > 
> > > >   IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL,
> > > >   get iog from bio, otherwise get it from current task.
> > > 
> > > Consider also that get_cgroup_from_bio() is much more slow than
> > > task_cgroup() and need to lock/unlock_page_cgroup() in
> > > get_blkio_cgroup_id(), while task_cgroup() is rcu protected.
> > > 
> > 
> > True.
> > 
> > > BTW another optimization could be to use the blkio-cgroup functionality
> > > only for dirty pages and cut out some blkio_set_owner(). For all the
> > > other cases IO always occurs in the same context of the current task,
> > > and you can use task_cgroup().
> > > 
> > 
> > Yes, may be in some cases we can avoid setting page owner. I will get
> > to it once I have got functionality going well. In the mean time if
> > you have a patch for it, it will be great.
> > 
> > > However, this is true only for page cache pages, for IO generated by
> > > anonymous pages (swap) you still need the page tracking functionality
> > > both for reads and writes.
> > > 
> > 
> > Right now I am assuming that all the sync IO will belong to task
> > submitting the bio hence use task_cgroup() for that. Only for async
> > IO, I am trying to use page tracking functionality to determine the owner.
> > Look at elv_bio_sync(bio).
> > 
> > You seem to be saying that there are cases where even for sync IO, we
> > can't use submitting task's context and need to rely on page tracking
> > functionlity? In case of getting page (read) from swap, will it not happen
> > in the context of process who will take a page fault and initiate the
> > swap read?
> 
> No, for example in read_swap_cache_async():
> 
> @@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  		 */
>  		__set_page_locked(new_page);
>  		SetPageSwapBacked(new_page);
> +		blkio_cgroup_set_owner(new_page, current->mm);
>  		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
>  		if (likely(!err)) {
>  			/*
> 
> This is a read, but the current task is not always the owner of this
> swap cache page, because it's a readahead operation.
> 

But will this readahead be not initiated in the context of the task taking
the page fault?

handle_pte_fault()
	do_swap_page()
		swapin_readahead()
			read_swap_cache_async()

If yes, then swap reads issued will still be in the context of process and
we should be fine?

> Anyway, this is a minor corner case I think. And probably it is safe to
> consider this like any other read IO and get rid of the
> blkio_cgroup_set_owner().

Agreed.

> 
> I wonder if it would be better to attach the blkio_cgroup to the
> anonymous page only when swap-out occurs.

Swap seems to be an interesting case in general. Somebody raised this
question on lwn io controller article also. A user process never asked
for swap activity. It is something enforced by kernel. So while doing
some swap outs, it does not seem too fair to charge the write out to
the process page belongs to and the fact of the matter may be that there
is some other memory hungry application which is forcing these swap outs.

Keeping this in mind, should swap activity be considered as system
activity and be charged to root group instead of to user tasks in other
cgroups?
  
> I mean, just put the
> blkio_cgroup_set_owner() hook in try_to_umap() in order to keep track of
> the IO generated by direct reclaim of anon memory. For all the other
> cases we can simply use the submitting task's context.
> 
> BTW, O_DIRECT is another case that is possible to optimize, because all
> the bios generated by direct IO occur in the same context of the current
> task.

Agreed about the direct IO optimization.

Ryo, what do you think? would you like to do include these optimizations
by the Andrea in next version of IO tracking patches?
 
Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
@ 2009-05-18 14:01                         ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-18 14:01 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Gui Jianfeng, Nauman Rafique, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	jmoyer, dhaval, balbir, linux-kernel, containers, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Sun, May 17, 2009 at 12:26:06PM +0200, Andrea Righi wrote:
> On Fri, May 15, 2009 at 10:06:43AM -0400, Vivek Goyal wrote:
> > On Fri, May 15, 2009 at 09:48:40AM +0200, Andrea Righi wrote:
> > > On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
> > > > Vivek Goyal wrote:
> > > > ...
> > > > >  }
> > > > > @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
> > > > >  /*
> > > > >   * Find the io group bio belongs to.
> > > > >   * If "create" is set, io group is created if it is not already present.
> > > > > + * If "curr" is set, io group is information is searched for current
> > > > > + * task and not with the help of bio.
> > > > > + *
> > > > > + * FIXME: Can we assume that if bio is NULL then lookup group for current
> > > > > + * task and not create extra function parameter ?
> > > > >   *
> > > > > - * Note: There is a narrow window of race where a group is being freed
> > > > > - * by cgroup deletion path and some rq has slipped through in this group.
> > > > > - * Fix it.
> > > > >   */
> > > > > -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> > > > > -					int create)
> > > > > +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
> > > > > +					int create, int curr)
> > > > 
> > > >   Hi Vivek,
> > > > 
> > > >   IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL,
> > > >   get iog from bio, otherwise get it from current task.
> > > 
> > > Consider also that get_cgroup_from_bio() is much more slow than
> > > task_cgroup() and need to lock/unlock_page_cgroup() in
> > > get_blkio_cgroup_id(), while task_cgroup() is rcu protected.
> > > 
> > 
> > True.
> > 
> > > BTW another optimization could be to use the blkio-cgroup functionality
> > > only for dirty pages and cut out some blkio_set_owner(). For all the
> > > other cases IO always occurs in the same context of the current task,
> > > and you can use task_cgroup().
> > > 
> > 
> > Yes, may be in some cases we can avoid setting page owner. I will get
> > to it once I have got functionality going well. In the mean time if
> > you have a patch for it, it will be great.
> > 
> > > However, this is true only for page cache pages, for IO generated by
> > > anonymous pages (swap) you still need the page tracking functionality
> > > both for reads and writes.
> > > 
> > 
> > Right now I am assuming that all the sync IO will belong to task
> > submitting the bio hence use task_cgroup() for that. Only for async
> > IO, I am trying to use page tracking functionality to determine the owner.
> > Look at elv_bio_sync(bio).
> > 
> > You seem to be saying that there are cases where even for sync IO, we
> > can't use submitting task's context and need to rely on page tracking
> > functionlity? In case of getting page (read) from swap, will it not happen
> > in the context of process who will take a page fault and initiate the
> > swap read?
> 
> No, for example in read_swap_cache_async():
> 
> @@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  		 */
>  		__set_page_locked(new_page);
>  		SetPageSwapBacked(new_page);
> +		blkio_cgroup_set_owner(new_page, current->mm);
>  		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
>  		if (likely(!err)) {
>  			/*
> 
> This is a read, but the current task is not always the owner of this
> swap cache page, because it's a readahead operation.
> 

But will this readahead be not initiated in the context of the task taking
the page fault?

handle_pte_fault()
	do_swap_page()
		swapin_readahead()
			read_swap_cache_async()

If yes, then swap reads issued will still be in the context of process and
we should be fine?

> Anyway, this is a minor corner case I think. And probably it is safe to
> consider this like any other read IO and get rid of the
> blkio_cgroup_set_owner().

Agreed.

> 
> I wonder if it would be better to attach the blkio_cgroup to the
> anonymous page only when swap-out occurs.

Swap seems to be an interesting case in general. Somebody raised this
question on lwn io controller article also. A user process never asked
for swap activity. It is something enforced by kernel. So while doing
some swap outs, it does not seem too fair to charge the write out to
the process page belongs to and the fact of the matter may be that there
is some other memory hungry application which is forcing these swap outs.

Keeping this in mind, should swap activity be considered as system
activity and be charged to root group instead of to user tasks in other
cgroups?
  
> I mean, just put the
> blkio_cgroup_set_owner() hook in try_to_umap() in order to keep track of
> the IO generated by direct reclaim of anon memory. For all the other
> cases we can simply use the submitting task's context.
> 
> BTW, O_DIRECT is another case that is possible to optimize, because all
> the bios generated by direct IO occur in the same context of the current
> task.

Agreed about the direct IO optimization.

Ryo, what do you think? would you like to do include these optimizations
by the Andrea in next version of IO tracking patches?
 
Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
       [not found]                         ` <20090518140114.GB27080-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-18 14:39                           ` Andrea Righi
  2009-05-19 12:18                           ` Ryo Tsuruta
  1 sibling, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-18 14:39 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, May 18, 2009 at 10:01:14AM -0400, Vivek Goyal wrote:
> On Sun, May 17, 2009 at 12:26:06PM +0200, Andrea Righi wrote:
> > On Fri, May 15, 2009 at 10:06:43AM -0400, Vivek Goyal wrote:
> > > On Fri, May 15, 2009 at 09:48:40AM +0200, Andrea Righi wrote:
> > > > On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
> > > > > Vivek Goyal wrote:
> > > > > ...
> > > > > >  }
> > > > > > @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
> > > > > >  /*
> > > > > >   * Find the io group bio belongs to.
> > > > > >   * If "create" is set, io group is created if it is not already present.
> > > > > > + * If "curr" is set, io group is information is searched for current
> > > > > > + * task and not with the help of bio.
> > > > > > + *
> > > > > > + * FIXME: Can we assume that if bio is NULL then lookup group for current
> > > > > > + * task and not create extra function parameter ?
> > > > > >   *
> > > > > > - * Note: There is a narrow window of race where a group is being freed
> > > > > > - * by cgroup deletion path and some rq has slipped through in this group.
> > > > > > - * Fix it.
> > > > > >   */
> > > > > > -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> > > > > > -					int create)
> > > > > > +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
> > > > > > +					int create, int curr)
> > > > > 
> > > > >   Hi Vivek,
> > > > > 
> > > > >   IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL,
> > > > >   get iog from bio, otherwise get it from current task.
> > > > 
> > > > Consider also that get_cgroup_from_bio() is much more slow than
> > > > task_cgroup() and need to lock/unlock_page_cgroup() in
> > > > get_blkio_cgroup_id(), while task_cgroup() is rcu protected.
> > > > 
> > > 
> > > True.
> > > 
> > > > BTW another optimization could be to use the blkio-cgroup functionality
> > > > only for dirty pages and cut out some blkio_set_owner(). For all the
> > > > other cases IO always occurs in the same context of the current task,
> > > > and you can use task_cgroup().
> > > > 
> > > 
> > > Yes, may be in some cases we can avoid setting page owner. I will get
> > > to it once I have got functionality going well. In the mean time if
> > > you have a patch for it, it will be great.
> > > 
> > > > However, this is true only for page cache pages, for IO generated by
> > > > anonymous pages (swap) you still need the page tracking functionality
> > > > both for reads and writes.
> > > > 
> > > 
> > > Right now I am assuming that all the sync IO will belong to task
> > > submitting the bio hence use task_cgroup() for that. Only for async
> > > IO, I am trying to use page tracking functionality to determine the owner.
> > > Look at elv_bio_sync(bio).
> > > 
> > > You seem to be saying that there are cases where even for sync IO, we
> > > can't use submitting task's context and need to rely on page tracking
> > > functionlity? In case of getting page (read) from swap, will it not happen
> > > in the context of process who will take a page fault and initiate the
> > > swap read?
> > 
> > No, for example in read_swap_cache_async():
> > 
> > @@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> >  		 */
> >  		__set_page_locked(new_page);
> >  		SetPageSwapBacked(new_page);
> > +		blkio_cgroup_set_owner(new_page, current->mm);
> >  		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> >  		if (likely(!err)) {
> >  			/*
> > 
> > This is a read, but the current task is not always the owner of this
> > swap cache page, because it's a readahead operation.
> > 
> 
> But will this readahead be not initiated in the context of the task taking
> the page fault?
> 
> handle_pte_fault()
> 	do_swap_page()
> 		swapin_readahead()
> 			read_swap_cache_async()
> 
> If yes, then swap reads issued will still be in the context of process and
> we should be fine?

Right. I was trying to say that the current task may swap-in also pages
belonging to a different task, so from a certain point of view it's not
so fair to charge the current task for the whole activity. But ok, I
think it's a minor issue.

> 
> > Anyway, this is a minor corner case I think. And probably it is safe to
> > consider this like any other read IO and get rid of the
> > blkio_cgroup_set_owner().
> 
> Agreed.
> 
> > 
> > I wonder if it would be better to attach the blkio_cgroup to the
> > anonymous page only when swap-out occurs.
> 
> Swap seems to be an interesting case in general. Somebody raised this
> question on lwn io controller article also. A user process never asked
> for swap activity. It is something enforced by kernel. So while doing
> some swap outs, it does not seem too fair to charge the write out to
> the process page belongs to and the fact of the matter may be that there
> is some other memory hungry application which is forcing these swap outs.
> 
> Keeping this in mind, should swap activity be considered as system
> activity and be charged to root group instead of to user tasks in other
> cgroups?

In this case I assume the swap-in activity should be charged to the root
cgroup as well.

Anyway, in the logic of the memory and swap control it would seem
reasonable to provide IO separation also for the swap IO activity.

In the MEMHOG example, it would be unfair if the memory pressure is
caused by a task in another cgroup, but with memory and swap isolation a
memory pressure condition can only be caused by a memory hog that runs
in the same cgroup. From this point of view it seems more fair to
consider the swap activity as the particular cgroup IO activity, instead
of charging always the root cgroup.

Otherwise, I suspect, memory pressure would be a simple way to blow away
any kind of QoS guarantees provided by the IO controller.

>   
> > I mean, just put the
> > blkio_cgroup_set_owner() hook in try_to_umap() in order to keep track of
> > the IO generated by direct reclaim of anon memory. For all the other
> > cases we can simply use the submitting task's context.
> > 
> > BTW, O_DIRECT is another case that is possible to optimize, because all
> > the bios generated by direct IO occur in the same context of the current
> > task.
> 
> Agreed about the direct IO optimization.
> 
> Ryo, what do you think? would you like to do include these optimizations
> by the Andrea in next version of IO tracking patches?
>  
> Thanks
> Vivek

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
  2009-05-18 14:01                         ` Vivek Goyal
  (?)
@ 2009-05-18 14:39                         ` Andrea Righi
  2009-05-26 11:34                           ` Ryo Tsuruta
  2009-05-26 11:34                           ` Ryo Tsuruta
  -1 siblings, 2 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-18 14:39 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Gui Jianfeng, Nauman Rafique, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	jmoyer, dhaval, balbir, linux-kernel, containers, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Mon, May 18, 2009 at 10:01:14AM -0400, Vivek Goyal wrote:
> On Sun, May 17, 2009 at 12:26:06PM +0200, Andrea Righi wrote:
> > On Fri, May 15, 2009 at 10:06:43AM -0400, Vivek Goyal wrote:
> > > On Fri, May 15, 2009 at 09:48:40AM +0200, Andrea Righi wrote:
> > > > On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
> > > > > Vivek Goyal wrote:
> > > > > ...
> > > > > >  }
> > > > > > @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
> > > > > >  /*
> > > > > >   * Find the io group bio belongs to.
> > > > > >   * If "create" is set, io group is created if it is not already present.
> > > > > > + * If "curr" is set, io group is information is searched for current
> > > > > > + * task and not with the help of bio.
> > > > > > + *
> > > > > > + * FIXME: Can we assume that if bio is NULL then lookup group for current
> > > > > > + * task and not create extra function parameter ?
> > > > > >   *
> > > > > > - * Note: There is a narrow window of race where a group is being freed
> > > > > > - * by cgroup deletion path and some rq has slipped through in this group.
> > > > > > - * Fix it.
> > > > > >   */
> > > > > > -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> > > > > > -					int create)
> > > > > > +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
> > > > > > +					int create, int curr)
> > > > > 
> > > > >   Hi Vivek,
> > > > > 
> > > > >   IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL,
> > > > >   get iog from bio, otherwise get it from current task.
> > > > 
> > > > Consider also that get_cgroup_from_bio() is much more slow than
> > > > task_cgroup() and need to lock/unlock_page_cgroup() in
> > > > get_blkio_cgroup_id(), while task_cgroup() is rcu protected.
> > > > 
> > > 
> > > True.
> > > 
> > > > BTW another optimization could be to use the blkio-cgroup functionality
> > > > only for dirty pages and cut out some blkio_set_owner(). For all the
> > > > other cases IO always occurs in the same context of the current task,
> > > > and you can use task_cgroup().
> > > > 
> > > 
> > > Yes, may be in some cases we can avoid setting page owner. I will get
> > > to it once I have got functionality going well. In the mean time if
> > > you have a patch for it, it will be great.
> > > 
> > > > However, this is true only for page cache pages, for IO generated by
> > > > anonymous pages (swap) you still need the page tracking functionality
> > > > both for reads and writes.
> > > > 
> > > 
> > > Right now I am assuming that all the sync IO will belong to task
> > > submitting the bio hence use task_cgroup() for that. Only for async
> > > IO, I am trying to use page tracking functionality to determine the owner.
> > > Look at elv_bio_sync(bio).
> > > 
> > > You seem to be saying that there are cases where even for sync IO, we
> > > can't use submitting task's context and need to rely on page tracking
> > > functionlity? In case of getting page (read) from swap, will it not happen
> > > in the context of process who will take a page fault and initiate the
> > > swap read?
> > 
> > No, for example in read_swap_cache_async():
> > 
> > @@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> >  		 */
> >  		__set_page_locked(new_page);
> >  		SetPageSwapBacked(new_page);
> > +		blkio_cgroup_set_owner(new_page, current->mm);
> >  		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> >  		if (likely(!err)) {
> >  			/*
> > 
> > This is a read, but the current task is not always the owner of this
> > swap cache page, because it's a readahead operation.
> > 
> 
> But will this readahead be not initiated in the context of the task taking
> the page fault?
> 
> handle_pte_fault()
> 	do_swap_page()
> 		swapin_readahead()
> 			read_swap_cache_async()
> 
> If yes, then swap reads issued will still be in the context of process and
> we should be fine?

Right. I was trying to say that the current task may swap-in also pages
belonging to a different task, so from a certain point of view it's not
so fair to charge the current task for the whole activity. But ok, I
think it's a minor issue.

> 
> > Anyway, this is a minor corner case I think. And probably it is safe to
> > consider this like any other read IO and get rid of the
> > blkio_cgroup_set_owner().
> 
> Agreed.
> 
> > 
> > I wonder if it would be better to attach the blkio_cgroup to the
> > anonymous page only when swap-out occurs.
> 
> Swap seems to be an interesting case in general. Somebody raised this
> question on lwn io controller article also. A user process never asked
> for swap activity. It is something enforced by kernel. So while doing
> some swap outs, it does not seem too fair to charge the write out to
> the process page belongs to and the fact of the matter may be that there
> is some other memory hungry application which is forcing these swap outs.
> 
> Keeping this in mind, should swap activity be considered as system
> activity and be charged to root group instead of to user tasks in other
> cgroups?

In this case I assume the swap-in activity should be charged to the root
cgroup as well.

Anyway, in the logic of the memory and swap control it would seem
reasonable to provide IO separation also for the swap IO activity.

In the MEMHOG example, it would be unfair if the memory pressure is
caused by a task in another cgroup, but with memory and swap isolation a
memory pressure condition can only be caused by a memory hog that runs
in the same cgroup. From this point of view it seems more fair to
consider the swap activity as the particular cgroup IO activity, instead
of charging always the root cgroup.

Otherwise, I suspect, memory pressure would be a simple way to blow away
any kind of QoS guarantees provided by the IO controller.

>   
> > I mean, just put the
> > blkio_cgroup_set_owner() hook in try_to_umap() in order to keep track of
> > the IO generated by direct reclaim of anon memory. For all the other
> > cases we can simply use the submitting task's context.
> > 
> > BTW, O_DIRECT is another case that is possible to optimize, because all
> > the bios generated by direct IO occur in the same context of the current
> > task.
> 
> Agreed about the direct IO optimization.
> 
> Ryo, what do you think? would you like to do include these optimizations
> by the Andrea in next version of IO tracking patches?
>  
> Thanks
> Vivek

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
       [not found]       ` <4A0BC7AB.8030703-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2009-05-14 15:15         ` Vivek Goyal
@ 2009-05-18 22:33         ` IKEDA, Munehiro
  1 sibling, 0 replies; 297+ messages in thread
From: IKEDA, Munehiro @ 2009-05-18 22:33 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi Gui,

Gui Jianfeng wrote:
> Hi Vivek,
> 
> This patch enables per-cgroup per-device weight and ioprio_class handling.
> A new cgroup interface "policy" is introduced. You can make use of this 
> file to configure weight and ioprio_class for each device in a given cgroup.
> The original "weight" and "ioprio_class" files are still available. If you
> don't do special configuration for a particular device, "weight" and 
> "ioprio_class" are used as default values in this device.
> 
> You can use the following format to play with the new interface.
> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
> weight=0 means removing the policy for DEV.
> 
> Examples:
> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
> # echo /dev/hdb:300:2 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2

Users can specify a device file of a partition for io.policy.
In this case, io_policy_node::dev_name is set as a name of the
partition device like /dev/sda2.

ex)
  # cd /mnt/cgroup
  # cat /dev/sda2:500:2 > io.policy
  # echo io.policy
    dev weight class
    /dev/sda2 500 2

I believe io_policy_node::dev_name should be set a generic
device name like /dev/sda.
What do you think about it?

Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
---
 block/elevator-fq.c |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 39fa2a1..5d3d55c 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1631,11 +1631,12 @@ static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
 	return NULL;
 }
 
-static int devname_to_devnum(const char *buf, dev_t *dev)
+static int devname_to_devnum(char *buf, dev_t *dev)
 {
 	struct block_device *bdev;
 	struct gendisk *disk;
 	int part;
+	char *c;
 
 	bdev = lookup_bdev(buf);
 	if (IS_ERR(bdev))
@@ -1645,6 +1646,10 @@ static int devname_to_devnum(const char *buf, dev_t *dev)
 	*dev = MKDEV(disk->major, disk->first_minor);
 	bdput(bdev);
 
+	c = strrchr(buf, '/');
+	if (c)
+		strcpy(c+1, disk->disk_name);
+
 	return 0;
 }
 
-- 
1.5.4.3


-- 
IKEDA, Munehiro
  NEC Corporation of America
    m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
  2009-05-14  7:26     ` Gui Jianfeng
  2009-05-14 15:15       ` Vivek Goyal
@ 2009-05-18 22:33       ` IKEDA, Munehiro
  2009-05-20  1:44         ` Gui Jianfeng
       [not found]         ` <4A11E244.2000305-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
       [not found]       ` <4A0BC7AB.8030703-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2 siblings, 2 replies; 297+ messages in thread
From: IKEDA, Munehiro @ 2009-05-18 22:33 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: Vivek Goyal, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	jmoyer, dhaval, balbir, linux-kernel, containers, righi.andrea,
	agk, dm-devel, snitzer, akpm

Hi Gui,

Gui Jianfeng wrote:
> Hi Vivek,
> 
> This patch enables per-cgroup per-device weight and ioprio_class handling.
> A new cgroup interface "policy" is introduced. You can make use of this 
> file to configure weight and ioprio_class for each device in a given cgroup.
> The original "weight" and "ioprio_class" files are still available. If you
> don't do special configuration for a particular device, "weight" and 
> "ioprio_class" are used as default values in this device.
> 
> You can use the following format to play with the new interface.
> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
> weight=0 means removing the policy for DEV.
> 
> Examples:
> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
> # echo /dev/hdb:300:2 > io.policy
> # cat io.policy
> dev weight class
> /dev/hdb 300 2

Users can specify a device file of a partition for io.policy.
In this case, io_policy_node::dev_name is set as a name of the
partition device like /dev/sda2.

ex)
  # cd /mnt/cgroup
  # cat /dev/sda2:500:2 > io.policy
  # echo io.policy
    dev weight class
    /dev/sda2 500 2

I believe io_policy_node::dev_name should be set a generic
device name like /dev/sda.
What do you think about it?

Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
---
 block/elevator-fq.c |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 39fa2a1..5d3d55c 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1631,11 +1631,12 @@ static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
 	return NULL;
 }
 
-static int devname_to_devnum(const char *buf, dev_t *dev)
+static int devname_to_devnum(char *buf, dev_t *dev)
 {
 	struct block_device *bdev;
 	struct gendisk *disk;
 	int part;
+	char *c;
 
 	bdev = lookup_bdev(buf);
 	if (IS_ERR(bdev))
@@ -1645,6 +1646,10 @@ static int devname_to_devnum(const char *buf, dev_t *dev)
 	*dev = MKDEV(disk->major, disk->first_minor);
 	bdput(bdev);
 
+	c = strrchr(buf, '/');
+	if (c)
+		strcpy(c+1, disk->disk_name);
+
 	return 0;
 }
 
-- 
1.5.4.3


-- 
IKEDA, Munehiro
  NEC Corporation of America
    m-ikeda@ds.jp.nec.com


^ permalink raw reply related	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
       [not found]                         ` <20090518140114.GB27080-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-05-18 14:39                           ` Andrea Righi
@ 2009-05-19 12:18                           ` Ryo Tsuruta
  1 sibling, 0 replies; 297+ messages in thread
From: Ryo Tsuruta @ 2009-05-19 12:18 UTC (permalink / raw)
  To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Subject: Re: [PATCH] io-controller: Add io group reference handling for request
Date: Mon, 18 May 2009 10:01:14 -0400

> On Sun, May 17, 2009 at 12:26:06PM +0200, Andrea Righi wrote:
> > On Fri, May 15, 2009 at 10:06:43AM -0400, Vivek Goyal wrote:
> > > On Fri, May 15, 2009 at 09:48:40AM +0200, Andrea Righi wrote:
> > > > On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
> > > > > Vivek Goyal wrote:
> > > > > ...
> > > > > >  }
> > > > > > @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
> > > > > >  /*
> > > > > >   * Find the io group bio belongs to.
> > > > > >   * If "create" is set, io group is created if it is not already present.
> > > > > > + * If "curr" is set, io group is information is searched for current
> > > > > > + * task and not with the help of bio.
> > > > > > + *
> > > > > > + * FIXME: Can we assume that if bio is NULL then lookup group for current
> > > > > > + * task and not create extra function parameter ?
> > > > > >   *
> > > > > > - * Note: There is a narrow window of race where a group is being freed
> > > > > > - * by cgroup deletion path and some rq has slipped through in this group.
> > > > > > - * Fix it.
> > > > > >   */
> > > > > > -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> > > > > > -					int create)
> > > > > > +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
> > > > > > +					int create, int curr)
> > > > > 
> > > > >   Hi Vivek,
> > > > > 
> > > > >   IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL,
> > > > >   get iog from bio, otherwise get it from current task.
> > > > 
> > > > Consider also that get_cgroup_from_bio() is much more slow than
> > > > task_cgroup() and need to lock/unlock_page_cgroup() in
> > > > get_blkio_cgroup_id(), while task_cgroup() is rcu protected.
> > > > 
> > > 
> > > True.
> > > 
> > > > BTW another optimization could be to use the blkio-cgroup functionality
> > > > only for dirty pages and cut out some blkio_set_owner(). For all the
> > > > other cases IO always occurs in the same context of the current task,
> > > > and you can use task_cgroup().
> > > > 
> > > 
> > > Yes, may be in some cases we can avoid setting page owner. I will get
> > > to it once I have got functionality going well. In the mean time if
> > > you have a patch for it, it will be great.
> > > 
> > > > However, this is true only for page cache pages, for IO generated by
> > > > anonymous pages (swap) you still need the page tracking functionality
> > > > both for reads and writes.
> > > > 
> > > 
> > > Right now I am assuming that all the sync IO will belong to task
> > > submitting the bio hence use task_cgroup() for that. Only for async
> > > IO, I am trying to use page tracking functionality to determine the owner.
> > > Look at elv_bio_sync(bio).
> > > 
> > > You seem to be saying that there are cases where even for sync IO, we
> > > can't use submitting task's context and need to rely on page tracking
> > > functionlity? In case of getting page (read) from swap, will it not happen
> > > in the context of process who will take a page fault and initiate the
> > > swap read?
> > 
> > No, for example in read_swap_cache_async():
> > 
> > @@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> >  		 */
> >  		__set_page_locked(new_page);
> >  		SetPageSwapBacked(new_page);
> > +		blkio_cgroup_set_owner(new_page, current->mm);
> >  		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> >  		if (likely(!err)) {
> >  			/*
> > 
> > This is a read, but the current task is not always the owner of this
> > swap cache page, because it's a readahead operation.
> > 
> 
> But will this readahead be not initiated in the context of the task taking
> the page fault?
> 
> handle_pte_fault()
> 	do_swap_page()
> 		swapin_readahead()
> 			read_swap_cache_async()
> 
> If yes, then swap reads issued will still be in the context of process and
> we should be fine?
> 
> > Anyway, this is a minor corner case I think. And probably it is safe to
> > consider this like any other read IO and get rid of the
> > blkio_cgroup_set_owner().
> 
> Agreed.
> 
> > 
> > I wonder if it would be better to attach the blkio_cgroup to the
> > anonymous page only when swap-out occurs.
> 
> Swap seems to be an interesting case in general. Somebody raised this
> question on lwn io controller article also. A user process never asked
> for swap activity. It is something enforced by kernel. So while doing
> some swap outs, it does not seem too fair to charge the write out to
> the process page belongs to and the fact of the matter may be that there
> is some other memory hungry application which is forcing these swap outs.
> 
> Keeping this in mind, should swap activity be considered as system
> activity and be charged to root group instead of to user tasks in other
> cgroups?
>   
> > I mean, just put the
> > blkio_cgroup_set_owner() hook in try_to_umap() in order to keep track of
> > the IO generated by direct reclaim of anon memory. For all the other
> > cases we can simply use the submitting task's context.
> > 
> > BTW, O_DIRECT is another case that is possible to optimize, because all
> > the bios generated by direct IO occur in the same context of the current
> > task.
> 
> Agreed about the direct IO optimization.
> 
> Ryo, what do you think? would you like to do include these optimizations
> by the Andrea in next version of IO tracking patches?

I'll consider whether these optimizations are reasonable.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
  2009-05-18 14:01                         ` Vivek Goyal
  (?)
  (?)
@ 2009-05-19 12:18                         ` Ryo Tsuruta
  -1 siblings, 0 replies; 297+ messages in thread
From: Ryo Tsuruta @ 2009-05-19 12:18 UTC (permalink / raw)
  To: vgoyal
  Cc: righi.andrea, guijianfeng, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, jens.axboe, fernando, s-uchida, taka,
	jmoyer, dhaval, balbir, linux-kernel, containers, agk, dm-devel,
	snitzer, m-ikeda, akpm

From: Vivek Goyal <vgoyal@redhat.com>
Subject: Re: [PATCH] io-controller: Add io group reference handling for request
Date: Mon, 18 May 2009 10:01:14 -0400

> On Sun, May 17, 2009 at 12:26:06PM +0200, Andrea Righi wrote:
> > On Fri, May 15, 2009 at 10:06:43AM -0400, Vivek Goyal wrote:
> > > On Fri, May 15, 2009 at 09:48:40AM +0200, Andrea Righi wrote:
> > > > On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
> > > > > Vivek Goyal wrote:
> > > > > ...
> > > > > >  }
> > > > > > @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
> > > > > >  /*
> > > > > >   * Find the io group bio belongs to.
> > > > > >   * If "create" is set, io group is created if it is not already present.
> > > > > > + * If "curr" is set, io group is information is searched for current
> > > > > > + * task and not with the help of bio.
> > > > > > + *
> > > > > > + * FIXME: Can we assume that if bio is NULL then lookup group for current
> > > > > > + * task and not create extra function parameter ?
> > > > > >   *
> > > > > > - * Note: There is a narrow window of race where a group is being freed
> > > > > > - * by cgroup deletion path and some rq has slipped through in this group.
> > > > > > - * Fix it.
> > > > > >   */
> > > > > > -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> > > > > > -					int create)
> > > > > > +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
> > > > > > +					int create, int curr)
> > > > > 
> > > > >   Hi Vivek,
> > > > > 
> > > > >   IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL,
> > > > >   get iog from bio, otherwise get it from current task.
> > > > 
> > > > Consider also that get_cgroup_from_bio() is much more slow than
> > > > task_cgroup() and need to lock/unlock_page_cgroup() in
> > > > get_blkio_cgroup_id(), while task_cgroup() is rcu protected.
> > > > 
> > > 
> > > True.
> > > 
> > > > BTW another optimization could be to use the blkio-cgroup functionality
> > > > only for dirty pages and cut out some blkio_set_owner(). For all the
> > > > other cases IO always occurs in the same context of the current task,
> > > > and you can use task_cgroup().
> > > > 
> > > 
> > > Yes, may be in some cases we can avoid setting page owner. I will get
> > > to it once I have got functionality going well. In the mean time if
> > > you have a patch for it, it will be great.
> > > 
> > > > However, this is true only for page cache pages, for IO generated by
> > > > anonymous pages (swap) you still need the page tracking functionality
> > > > both for reads and writes.
> > > > 
> > > 
> > > Right now I am assuming that all the sync IO will belong to task
> > > submitting the bio hence use task_cgroup() for that. Only for async
> > > IO, I am trying to use page tracking functionality to determine the owner.
> > > Look at elv_bio_sync(bio).
> > > 
> > > You seem to be saying that there are cases where even for sync IO, we
> > > can't use submitting task's context and need to rely on page tracking
> > > functionlity? In case of getting page (read) from swap, will it not happen
> > > in the context of process who will take a page fault and initiate the
> > > swap read?
> > 
> > No, for example in read_swap_cache_async():
> > 
> > @@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> >  		 */
> >  		__set_page_locked(new_page);
> >  		SetPageSwapBacked(new_page);
> > +		blkio_cgroup_set_owner(new_page, current->mm);
> >  		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> >  		if (likely(!err)) {
> >  			/*
> > 
> > This is a read, but the current task is not always the owner of this
> > swap cache page, because it's a readahead operation.
> > 
> 
> But will this readahead be not initiated in the context of the task taking
> the page fault?
> 
> handle_pte_fault()
> 	do_swap_page()
> 		swapin_readahead()
> 			read_swap_cache_async()
> 
> If yes, then swap reads issued will still be in the context of process and
> we should be fine?
> 
> > Anyway, this is a minor corner case I think. And probably it is safe to
> > consider this like any other read IO and get rid of the
> > blkio_cgroup_set_owner().
> 
> Agreed.
> 
> > 
> > I wonder if it would be better to attach the blkio_cgroup to the
> > anonymous page only when swap-out occurs.
> 
> Swap seems to be an interesting case in general. Somebody raised this
> question on lwn io controller article also. A user process never asked
> for swap activity. It is something enforced by kernel. So while doing
> some swap outs, it does not seem too fair to charge the write out to
> the process page belongs to and the fact of the matter may be that there
> is some other memory hungry application which is forcing these swap outs.
> 
> Keeping this in mind, should swap activity be considered as system
> activity and be charged to root group instead of to user tasks in other
> cgroups?
>   
> > I mean, just put the
> > blkio_cgroup_set_owner() hook in try_to_umap() in order to keep track of
> > the IO generated by direct reclaim of anon memory. For all the other
> > cases we can simply use the submitting task's context.
> > 
> > BTW, O_DIRECT is another case that is possible to optimize, because all
> > the bios generated by direct IO occur in the same context of the current
> > task.
> 
> Agreed about the direct IO optimization.
> 
> Ryo, what do you think? would you like to do include these optimizations
> by the Andrea in next version of IO tracking patches?

I'll consider whether these optimizations are reasonable.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
       [not found]         ` <4A11E244.2000305-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
@ 2009-05-20  1:44           ` Gui Jianfeng
  0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-20  1:44 UTC (permalink / raw)
  To: IKEDA, Munehiro, Vivek Goyal
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w, snitzer-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

IKEDA, Munehiro wrote:
> Hi Gui,
> 
> Gui Jianfeng wrote:
>> Hi Vivek,
>>
>> This patch enables per-cgroup per-device weight and ioprio_class
>> handling.
>> A new cgroup interface "policy" is introduced. You can make use of
>> this file to configure weight and ioprio_class for each device in a
>> given cgroup.
>> The original "weight" and "ioprio_class" files are still available. If
>> you
>> don't do special configuration for a particular device, "weight" and
>> "ioprio_class" are used as default values in this device.
>>
>> You can use the following format to play with the new interface.
>> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
>> weight=0 means removing the policy for DEV.
>>
>> Examples:
>> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
>> # echo /dev/hdb:300:2 > io.policy
>> # cat io.policy
>> dev weight class
>> /dev/hdb 300 2
> 
> Users can specify a device file of a partition for io.policy.
> In this case, io_policy_node::dev_name is set as a name of the
> partition device like /dev/sda2.
> 
> ex)
>  # cd /mnt/cgroup
>  # cat /dev/sda2:500:2 > io.policy
>  # echo io.policy
>    dev weight class
>    /dev/sda2 500 2
> 
> I believe io_policy_node::dev_name should be set a generic
> device name like /dev/sda.
> What do you think about it?

  Hi Ikeda-san,

  Sorry for the late reply. Thanks for pointing this out. 
  yes, it does the right thing but shows a wrong name. 
  IMHO, Inputing a sigle partition should not be allowed since the
  policy is disk basis. So how about the following patch?

Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 1a0ca07..b620768 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1650,6 +1650,9 @@ static int devname_to_devnum(const char *buf, dev_t *dev)
 		return -ENODEV;
 
 	disk = get_gendisk(bdev->bd_dev, &part);
+	if (part)
+		return -EINVAL;
+
 	*dev = MKDEV(disk->major, disk->first_minor);
 	bdput(bdev);

> 
> Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
> ---
> block/elevator-fq.c |    7 ++++++-
> 1 files changed, 6 insertions(+), 1 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 39fa2a1..5d3d55c 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -1631,11 +1631,12 @@ static struct io_policy_node
> *policy_search_node(const struct io_cgroup *iocg,
>     return NULL;
> }
> 
> -static int devname_to_devnum(const char *buf, dev_t *dev)
> +static int devname_to_devnum(char *buf, dev_t *dev)
> {
>     struct block_device *bdev;
>     struct gendisk *disk;
>     int part;
> +    char *c;
> 
>     bdev = lookup_bdev(buf);
>     if (IS_ERR(bdev))
> @@ -1645,6 +1646,10 @@ static int devname_to_devnum(const char *buf,
> dev_t *dev)
>     *dev = MKDEV(disk->major, disk->first_minor);
>     bdput(bdev);
> 
> +    c = strrchr(buf, '/');
> +    if (c)
> +        strcpy(c+1, disk->disk_name);
> +
>     return 0;
> }
> 

-- 
Regards
Gui Jianfeng

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
  2009-05-18 22:33       ` IKEDA, Munehiro
@ 2009-05-20  1:44         ` Gui Jianfeng
       [not found]           ` <4A136090.5090705-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
       [not found]         ` <4A11E244.2000305-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
  1 sibling, 1 reply; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-20  1:44 UTC (permalink / raw)
  To: IKEDA, Munehiro, Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, akpm

IKEDA, Munehiro wrote:
> Hi Gui,
> 
> Gui Jianfeng wrote:
>> Hi Vivek,
>>
>> This patch enables per-cgroup per-device weight and ioprio_class
>> handling.
>> A new cgroup interface "policy" is introduced. You can make use of
>> this file to configure weight and ioprio_class for each device in a
>> given cgroup.
>> The original "weight" and "ioprio_class" files are still available. If
>> you
>> don't do special configuration for a particular device, "weight" and
>> "ioprio_class" are used as default values in this device.
>>
>> You can use the following format to play with the new interface.
>> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
>> weight=0 means removing the policy for DEV.
>>
>> Examples:
>> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
>> # echo /dev/hdb:300:2 > io.policy
>> # cat io.policy
>> dev weight class
>> /dev/hdb 300 2
> 
> Users can specify a device file of a partition for io.policy.
> In this case, io_policy_node::dev_name is set as a name of the
> partition device like /dev/sda2.
> 
> ex)
>  # cd /mnt/cgroup
>  # cat /dev/sda2:500:2 > io.policy
>  # echo io.policy
>    dev weight class
>    /dev/sda2 500 2
> 
> I believe io_policy_node::dev_name should be set a generic
> device name like /dev/sda.
> What do you think about it?

  Hi Ikeda-san,

  Sorry for the late reply. Thanks for pointing this out. 
  yes, it does the right thing but shows a wrong name. 
  IMHO, Inputing a sigle partition should not be allowed since the
  policy is disk basis. So how about the following patch?

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 1a0ca07..b620768 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1650,6 +1650,9 @@ static int devname_to_devnum(const char *buf, dev_t *dev)
 		return -ENODEV;
 
 	disk = get_gendisk(bdev->bd_dev, &part);
+	if (part)
+		return -EINVAL;
+
 	*dev = MKDEV(disk->major, disk->first_minor);
 	bdput(bdev);

> 
> Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
> ---
> block/elevator-fq.c |    7 ++++++-
> 1 files changed, 6 insertions(+), 1 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 39fa2a1..5d3d55c 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -1631,11 +1631,12 @@ static struct io_policy_node
> *policy_search_node(const struct io_cgroup *iocg,
>     return NULL;
> }
> 
> -static int devname_to_devnum(const char *buf, dev_t *dev)
> +static int devname_to_devnum(char *buf, dev_t *dev)
> {
>     struct block_device *bdev;
>     struct gendisk *disk;
>     int part;
> +    char *c;
> 
>     bdev = lookup_bdev(buf);
>     if (IS_ERR(bdev))
> @@ -1645,6 +1646,10 @@ static int devname_to_devnum(const char *buf,
> dev_t *dev)
>     *dev = MKDEV(disk->major, disk->first_minor);
>     bdput(bdev);
> 
> +    c = strrchr(buf, '/');
> +    if (c)
> +        strcpy(c+1, disk->disk_name);
> +
>     return 0;
> }
> 

-- 
Regards
Gui Jianfeng


^ permalink raw reply related	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
  2009-05-20  1:44         ` Gui Jianfeng
@ 2009-05-20 15:41               ` IKEDA, Munehiro
  0 siblings, 0 replies; 297+ messages in thread
From: IKEDA, Munehiro @ 2009-05-20 15:41 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Gui Jianfeng wrote:
> IKEDA, Munehiro wrote:
>> Hi Gui,
>>
>> Gui Jianfeng wrote:
>>> Hi Vivek,
>>>
>>> This patch enables per-cgroup per-device weight and ioprio_class
>>> handling.
>>> A new cgroup interface "policy" is introduced. You can make use of
>>> this file to configure weight and ioprio_class for each device in a
>>> given cgroup.
>>> The original "weight" and "ioprio_class" files are still available. If
>>> you
>>> don't do special configuration for a particular device, "weight" and
>>> "ioprio_class" are used as default values in this device.
>>>
>>> You can use the following format to play with the new interface.
>>> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
>>> weight=0 means removing the policy for DEV.
>>>
>>> Examples:
>>> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
>>> # echo /dev/hdb:300:2 > io.policy
>>> # cat io.policy
>>> dev weight class
>>> /dev/hdb 300 2
>> Users can specify a device file of a partition for io.policy.
>> In this case, io_policy_node::dev_name is set as a name of the
>> partition device like /dev/sda2.
>>
>> ex)
>>  # cd /mnt/cgroup
>>  # cat /dev/sda2:500:2 > io.policy
>>  # echo io.policy
>>    dev weight class
>>    /dev/sda2 500 2
>>
>> I believe io_policy_node::dev_name should be set a generic
>> device name like /dev/sda.
>> What do you think about it?
> 
>   Hi Ikeda-san,
> 
>   Sorry for the late reply. Thanks for pointing this out. 
>   yes, it does the right thing but shows a wrong name. 
>   IMHO, Inputing a sigle partition should not be allowed since the
>   policy is disk basis. So how about the following patch?
> 
> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> ---
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 1a0ca07..b620768 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -1650,6 +1650,9 @@ static int devname_to_devnum(const char *buf, dev_t *dev)
>  		return -ENODEV;
>  
>  	disk = get_gendisk(bdev->bd_dev, &part);
> +	if (part)
> +		return -EINVAL;
> +
>  	*dev = MKDEV(disk->major, disk->first_minor);
>  	bdput(bdev);
> 

It looks nicer and reasonable for me.
Thanks!



-- 
IKEDA, Munehiro
 NEC Corporation of America
   m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] IO Controller: Add per-device weight and ioprio_class handling
@ 2009-05-20 15:41               ` IKEDA, Munehiro
  0 siblings, 0 replies; 297+ messages in thread
From: IKEDA, Munehiro @ 2009-05-20 15:41 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: Vivek Goyal, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, ryov, fernando, s-uchida, taka,
	jmoyer, dhaval, balbir, linux-kernel, containers, righi.andrea,
	agk, dm-devel, snitzer, akpm

Gui Jianfeng wrote:
> IKEDA, Munehiro wrote:
>> Hi Gui,
>>
>> Gui Jianfeng wrote:
>>> Hi Vivek,
>>>
>>> This patch enables per-cgroup per-device weight and ioprio_class
>>> handling.
>>> A new cgroup interface "policy" is introduced. You can make use of
>>> this file to configure weight and ioprio_class for each device in a
>>> given cgroup.
>>> The original "weight" and "ioprio_class" files are still available. If
>>> you
>>> don't do special configuration for a particular device, "weight" and
>>> "ioprio_class" are used as default values in this device.
>>>
>>> You can use the following format to play with the new interface.
>>> #echo DEV:weight:ioprio_class > /patch/to/cgroup/policy
>>> weight=0 means removing the policy for DEV.
>>>
>>> Examples:
>>> Configure weight=300 ioprio_class=2 on /dev/hdb in this cgroup
>>> # echo /dev/hdb:300:2 > io.policy
>>> # cat io.policy
>>> dev weight class
>>> /dev/hdb 300 2
>> Users can specify a device file of a partition for io.policy.
>> In this case, io_policy_node::dev_name is set as a name of the
>> partition device like /dev/sda2.
>>
>> ex)
>>  # cd /mnt/cgroup
>>  # cat /dev/sda2:500:2 > io.policy
>>  # echo io.policy
>>    dev weight class
>>    /dev/sda2 500 2
>>
>> I believe io_policy_node::dev_name should be set a generic
>> device name like /dev/sda.
>> What do you think about it?
> 
>   Hi Ikeda-san,
> 
>   Sorry for the late reply. Thanks for pointing this out. 
>   yes, it does the right thing but shows a wrong name. 
>   IMHO, Inputing a sigle partition should not be allowed since the
>   policy is disk basis. So how about the following patch?
> 
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 1a0ca07..b620768 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -1650,6 +1650,9 @@ static int devname_to_devnum(const char *buf, dev_t *dev)
>  		return -ENODEV;
>  
>  	disk = get_gendisk(bdev->bd_dev, &part);
> +	if (part)
> +		return -EINVAL;
> +
>  	*dev = MKDEV(disk->major, disk->first_minor);
>  	bdput(bdev);
> 

It looks nicer and reasonable for me.
Thanks!



-- 
IKEDA, Munehiro
 NEC Corporation of America
   m-ikeda@ds.jp.nec.com



^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 02/18] io-controller: Common flat fair queuing code in elevaotor layer
       [not found]     ` <1241553525-28095-3-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-22  6:43       ` Gui Jianfeng
  0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-22  6:43 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

Vivek Goyal wrote:
...
> +/* A request got completed from io_queue. Do the accounting. */
> +void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
> +{
> +	const int sync = rq_is_sync(rq);
> +	struct io_queue *ioq = rq->ioq;
> +	struct elv_fq_data *efqd = &q->elevator->efqd;
> +
> +	if (!elv_iosched_fair_queuing_enabled(q->elevator))
> +		return;
> +
> +	elv_log_ioq(efqd, ioq, "complete");
> +
> +	elv_update_hw_tag(efqd);
> +
> +	WARN_ON(!efqd->rq_in_driver);
> +	WARN_ON(!ioq->dispatched);
> +	efqd->rq_in_driver--;
> +	ioq->dispatched--;
> +
> +	if (sync)
> +		ioq->last_end_request = jiffies;
> +
> +	/*
> +	 * If this is the active queue, check if it needs to be expired,
> +	 * or if we want to idle in case it has no pending requests.
> +	 */
> +
> +	if (elv_active_ioq(q->elevator) == ioq) {
> +		if (elv_ioq_slice_new(ioq)) {
> +			elv_ioq_set_prio_slice(q, ioq);

  Hi Vivek,

  Would you explain a bit why slice_end should be set when first request completes.
  Why not set it just when an ioq gets active?
  
  Thanks.
  Gui Jianfeng

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 02/18] io-controller: Common flat fair queuing code in elevaotor layer
  2009-05-05 19:58     ` Vivek Goyal
  (?)
@ 2009-05-22  6:43     ` Gui Jianfeng
  2009-05-22 12:32       ` Vivek Goyal
       [not found]       ` <4A164978.1020604-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  -1 siblings, 2 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-22  6:43 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

Vivek Goyal wrote:
...
> +/* A request got completed from io_queue. Do the accounting. */
> +void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
> +{
> +	const int sync = rq_is_sync(rq);
> +	struct io_queue *ioq = rq->ioq;
> +	struct elv_fq_data *efqd = &q->elevator->efqd;
> +
> +	if (!elv_iosched_fair_queuing_enabled(q->elevator))
> +		return;
> +
> +	elv_log_ioq(efqd, ioq, "complete");
> +
> +	elv_update_hw_tag(efqd);
> +
> +	WARN_ON(!efqd->rq_in_driver);
> +	WARN_ON(!ioq->dispatched);
> +	efqd->rq_in_driver--;
> +	ioq->dispatched--;
> +
> +	if (sync)
> +		ioq->last_end_request = jiffies;
> +
> +	/*
> +	 * If this is the active queue, check if it needs to be expired,
> +	 * or if we want to idle in case it has no pending requests.
> +	 */
> +
> +	if (elv_active_ioq(q->elevator) == ioq) {
> +		if (elv_ioq_slice_new(ioq)) {
> +			elv_ioq_set_prio_slice(q, ioq);

  Hi Vivek,

  Would you explain a bit why slice_end should be set when first request completes.
  Why not set it just when an ioq gets active?
  
  Thanks.
  Gui Jianfeng



^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 04/18] io-controller: Modify cfq to make use of flat elevator fair queuing
       [not found]   ` <1241553525-28095-5-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-22  8:54     ` Gui Jianfeng
  0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-22  8:54 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

Hi Vivek,

Since thinking time logic is moving to common layer, corresponding items
in cic is not needed.

Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index ed52a1f..1fe9d78 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -42,10 +42,6 @@ struct cfq_io_context {
 	unsigned long last_end_request;
 	sector_t last_request_pos;
 
-	unsigned long ttime_total;
-	unsigned long ttime_samples;
-	unsigned long ttime_mean;
-
 	unsigned int seek_samples;
 	u64 seek_total;
 	sector_t seek_mean;

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* Re: [PATCH 04/18] io-controller: Modify cfq to make use of flat elevator fair queuing
  2009-05-05 19:58 ` [PATCH 04/18] io-controller: Modify cfq to make use of flat elevator fair queuing Vivek Goyal
       [not found]   ` <1241553525-28095-5-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-22  8:54   ` Gui Jianfeng
       [not found]     ` <4A166829.6070608-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2009-05-22 12:33     ` Vivek Goyal
  1 sibling, 2 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-05-22  8:54 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

Hi Vivek,

Since thinking time logic is moving to common layer, corresponding items
in cic is not needed.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index ed52a1f..1fe9d78 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -42,10 +42,6 @@ struct cfq_io_context {
 	unsigned long last_end_request;
 	sector_t last_request_pos;
 
-	unsigned long ttime_total;
-	unsigned long ttime_samples;
-	unsigned long ttime_mean;
-
 	unsigned int seek_samples;
 	u64 seek_total;
 	sector_t seek_mean;




^ permalink raw reply related	[flat|nested] 297+ messages in thread

* Re: [PATCH 02/18] io-controller: Common flat fair queuing code in elevaotor layer
       [not found]       ` <4A164978.1020604-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-05-22 12:32         ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-22 12:32 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Fri, May 22, 2009 at 02:43:04PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> > +/* A request got completed from io_queue. Do the accounting. */
> > +void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
> > +{
> > +	const int sync = rq_is_sync(rq);
> > +	struct io_queue *ioq = rq->ioq;
> > +	struct elv_fq_data *efqd = &q->elevator->efqd;
> > +
> > +	if (!elv_iosched_fair_queuing_enabled(q->elevator))
> > +		return;
> > +
> > +	elv_log_ioq(efqd, ioq, "complete");
> > +
> > +	elv_update_hw_tag(efqd);
> > +
> > +	WARN_ON(!efqd->rq_in_driver);
> > +	WARN_ON(!ioq->dispatched);
> > +	efqd->rq_in_driver--;
> > +	ioq->dispatched--;
> > +
> > +	if (sync)
> > +		ioq->last_end_request = jiffies;
> > +
> > +	/*
> > +	 * If this is the active queue, check if it needs to be expired,
> > +	 * or if we want to idle in case it has no pending requests.
> > +	 */
> > +
> > +	if (elv_active_ioq(q->elevator) == ioq) {
> > +		if (elv_ioq_slice_new(ioq)) {
> > +			elv_ioq_set_prio_slice(q, ioq);
> 
>   Hi Vivek,
> 
>   Would you explain a bit why slice_end should be set when first request completes.
>   Why not set it just when an ioq gets active?
>   

Hi Gui,

I have kept the behavior same as CFQ. I guess reason behind this is that
when a new queue is scheduled in, first request completion might take more
time as head of the disk might be quite a distance away (due to previous
queue) and one probably does not want to charge the new queue for that
first seek time. That's the reason we start the queue slice when first
request has completed.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 02/18] io-controller: Common flat fair queuing code in elevaotor layer
  2009-05-22  6:43     ` Gui Jianfeng
@ 2009-05-22 12:32       ` Vivek Goyal
  2009-05-23 20:04         ` Jens Axboe
       [not found]         ` <20090522123231.GA14972-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
       [not found]       ` <4A164978.1020604-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  1 sibling, 2 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-22 12:32 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Fri, May 22, 2009 at 02:43:04PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> > +/* A request got completed from io_queue. Do the accounting. */
> > +void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
> > +{
> > +	const int sync = rq_is_sync(rq);
> > +	struct io_queue *ioq = rq->ioq;
> > +	struct elv_fq_data *efqd = &q->elevator->efqd;
> > +
> > +	if (!elv_iosched_fair_queuing_enabled(q->elevator))
> > +		return;
> > +
> > +	elv_log_ioq(efqd, ioq, "complete");
> > +
> > +	elv_update_hw_tag(efqd);
> > +
> > +	WARN_ON(!efqd->rq_in_driver);
> > +	WARN_ON(!ioq->dispatched);
> > +	efqd->rq_in_driver--;
> > +	ioq->dispatched--;
> > +
> > +	if (sync)
> > +		ioq->last_end_request = jiffies;
> > +
> > +	/*
> > +	 * If this is the active queue, check if it needs to be expired,
> > +	 * or if we want to idle in case it has no pending requests.
> > +	 */
> > +
> > +	if (elv_active_ioq(q->elevator) == ioq) {
> > +		if (elv_ioq_slice_new(ioq)) {
> > +			elv_ioq_set_prio_slice(q, ioq);
> 
>   Hi Vivek,
> 
>   Would you explain a bit why slice_end should be set when first request completes.
>   Why not set it just when an ioq gets active?
>   

Hi Gui,

I have kept the behavior same as CFQ. I guess reason behind this is that
when a new queue is scheduled in, first request completion might take more
time as head of the disk might be quite a distance away (due to previous
queue) and one probably does not want to charge the new queue for that
first seek time. That's the reason we start the queue slice when first
request has completed.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 04/18] io-controller: Modify cfq to make use of flat elevator fair queuing
       [not found]     ` <4A166829.6070608-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-05-22 12:33       ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-22 12:33 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Fri, May 22, 2009 at 04:54:01PM +0800, Gui Jianfeng wrote:
> Hi Vivek,
> 
> Since thinking time logic is moving to common layer, corresponding items
> in cic is not needed.
> 
> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> ---
> diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
> index ed52a1f..1fe9d78 100644
> --- a/include/linux/iocontext.h
> +++ b/include/linux/iocontext.h
> @@ -42,10 +42,6 @@ struct cfq_io_context {
>  	unsigned long last_end_request;
>  	sector_t last_request_pos;
>  
> -	unsigned long ttime_total;
> -	unsigned long ttime_samples;
> -	unsigned long ttime_mean;
> -
>  	unsigned int seek_samples;
>  	u64 seek_total;
>  	sector_t seek_mean;
> 

Thanks Gui. Queued for next posting.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 04/18] io-controller: Modify cfq to make use of flat elevator fair queuing
  2009-05-22  8:54   ` Gui Jianfeng
       [not found]     ` <4A166829.6070608-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-05-22 12:33     ` Vivek Goyal
  1 sibling, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-22 12:33 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Fri, May 22, 2009 at 04:54:01PM +0800, Gui Jianfeng wrote:
> Hi Vivek,
> 
> Since thinking time logic is moving to common layer, corresponding items
> in cic is not needed.
> 
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
> diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
> index ed52a1f..1fe9d78 100644
> --- a/include/linux/iocontext.h
> +++ b/include/linux/iocontext.h
> @@ -42,10 +42,6 @@ struct cfq_io_context {
>  	unsigned long last_end_request;
>  	sector_t last_request_pos;
>  
> -	unsigned long ttime_total;
> -	unsigned long ttime_samples;
> -	unsigned long ttime_mean;
> -
>  	unsigned int seek_samples;
>  	u64 seek_total;
>  	sector_t seek_mean;
> 

Thanks Gui. Queued for next posting.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 02/18] io-controller: Common flat fair queuing code in elevaotor layer
       [not found]         ` <20090522123231.GA14972-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-05-23 20:04           ` Jens Axboe
  0 siblings, 0 replies; 297+ messages in thread
From: Jens Axboe @ 2009-05-23 20:04 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Fri, May 22 2009, Vivek Goyal wrote:
> On Fri, May 22, 2009 at 02:43:04PM +0800, Gui Jianfeng wrote:
> > Vivek Goyal wrote:
> > ...
> > > +/* A request got completed from io_queue. Do the accounting. */
> > > +void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
> > > +{
> > > +	const int sync = rq_is_sync(rq);
> > > +	struct io_queue *ioq = rq->ioq;
> > > +	struct elv_fq_data *efqd = &q->elevator->efqd;
> > > +
> > > +	if (!elv_iosched_fair_queuing_enabled(q->elevator))
> > > +		return;
> > > +
> > > +	elv_log_ioq(efqd, ioq, "complete");
> > > +
> > > +	elv_update_hw_tag(efqd);
> > > +
> > > +	WARN_ON(!efqd->rq_in_driver);
> > > +	WARN_ON(!ioq->dispatched);
> > > +	efqd->rq_in_driver--;
> > > +	ioq->dispatched--;
> > > +
> > > +	if (sync)
> > > +		ioq->last_end_request = jiffies;
> > > +
> > > +	/*
> > > +	 * If this is the active queue, check if it needs to be expired,
> > > +	 * or if we want to idle in case it has no pending requests.
> > > +	 */
> > > +
> > > +	if (elv_active_ioq(q->elevator) == ioq) {
> > > +		if (elv_ioq_slice_new(ioq)) {
> > > +			elv_ioq_set_prio_slice(q, ioq);
> > 
> >   Hi Vivek,
> > 
> >   Would you explain a bit why slice_end should be set when first request completes.
> >   Why not set it just when an ioq gets active?
> >   
> 
> Hi Gui,
> 
> I have kept the behavior same as CFQ. I guess reason behind this is that
> when a new queue is scheduled in, first request completion might take more
> time as head of the disk might be quite a distance away (due to previous
> queue) and one probably does not want to charge the new queue for that
> first seek time. That's the reason we start the queue slice when first
> request has completed.

That's exactly why CFQ does it that way. And not just for the seek
itself, but if have eg writes issued before the switch to a new queue,
it's not fair to charge the potential cache writeout happening ahead of
the read to that new queue. So I'd definitely recommend keeping this
behaviour, as you have.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 02/18] io-controller: Common flat fair queuing code in elevaotor layer
  2009-05-22 12:32       ` Vivek Goyal
@ 2009-05-23 20:04         ` Jens Axboe
       [not found]         ` <20090522123231.GA14972-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 0 replies; 297+ messages in thread
From: Jens Axboe @ 2009-05-23 20:04 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Gui Jianfeng, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Fri, May 22 2009, Vivek Goyal wrote:
> On Fri, May 22, 2009 at 02:43:04PM +0800, Gui Jianfeng wrote:
> > Vivek Goyal wrote:
> > ...
> > > +/* A request got completed from io_queue. Do the accounting. */
> > > +void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
> > > +{
> > > +	const int sync = rq_is_sync(rq);
> > > +	struct io_queue *ioq = rq->ioq;
> > > +	struct elv_fq_data *efqd = &q->elevator->efqd;
> > > +
> > > +	if (!elv_iosched_fair_queuing_enabled(q->elevator))
> > > +		return;
> > > +
> > > +	elv_log_ioq(efqd, ioq, "complete");
> > > +
> > > +	elv_update_hw_tag(efqd);
> > > +
> > > +	WARN_ON(!efqd->rq_in_driver);
> > > +	WARN_ON(!ioq->dispatched);
> > > +	efqd->rq_in_driver--;
> > > +	ioq->dispatched--;
> > > +
> > > +	if (sync)
> > > +		ioq->last_end_request = jiffies;
> > > +
> > > +	/*
> > > +	 * If this is the active queue, check if it needs to be expired,
> > > +	 * or if we want to idle in case it has no pending requests.
> > > +	 */
> > > +
> > > +	if (elv_active_ioq(q->elevator) == ioq) {
> > > +		if (elv_ioq_slice_new(ioq)) {
> > > +			elv_ioq_set_prio_slice(q, ioq);
> > 
> >   Hi Vivek,
> > 
> >   Would you explain a bit why slice_end should be set when first request completes.
> >   Why not set it just when an ioq gets active?
> >   
> 
> Hi Gui,
> 
> I have kept the behavior same as CFQ. I guess reason behind this is that
> when a new queue is scheduled in, first request completion might take more
> time as head of the disk might be quite a distance away (due to previous
> queue) and one probably does not want to charge the new queue for that
> first seek time. That's the reason we start the queue slice when first
> request has completed.

That's exactly why CFQ does it that way. And not just for the seek
itself, but if have eg writes issued before the switch to a new queue,
it's not fair to charge the potential cache writeout happening ahead of
the read to that new queue. So I'd definitely recommend keeping this
behaviour, as you have.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
  2009-05-18 14:39                         ` Andrea Righi
@ 2009-05-26 11:34                           ` Ryo Tsuruta
  2009-05-26 11:34                           ` Ryo Tsuruta
  1 sibling, 0 replies; 297+ messages in thread
From: Ryo Tsuruta @ 2009-05-26 11:34 UTC (permalink / raw)
  To: righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi Andrea and Vivek,

From: Andrea Righi <righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Subject: Re: [PATCH] io-controller: Add io group reference handling for request
Date: Mon, 18 May 2009 16:39:23 +0200

> On Mon, May 18, 2009 at 10:01:14AM -0400, Vivek Goyal wrote:
> > On Sun, May 17, 2009 at 12:26:06PM +0200, Andrea Righi wrote:
> > > On Fri, May 15, 2009 at 10:06:43AM -0400, Vivek Goyal wrote:
> > > > On Fri, May 15, 2009 at 09:48:40AM +0200, Andrea Righi wrote:
> > > > > On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
> > > > > > Vivek Goyal wrote:
> > > > > > ...
> > > > > > >  }
> > > > > > > @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
> > > > > > >  /*
> > > > > > >   * Find the io group bio belongs to.
> > > > > > >   * If "create" is set, io group is created if it is not already present.
> > > > > > > + * If "curr" is set, io group is information is searched for current
> > > > > > > + * task and not with the help of bio.
> > > > > > > + *
> > > > > > > + * FIXME: Can we assume that if bio is NULL then lookup group for current
> > > > > > > + * task and not create extra function parameter ?
> > > > > > >   *
> > > > > > > - * Note: There is a narrow window of race where a group is being freed
> > > > > > > - * by cgroup deletion path and some rq has slipped through in this group.
> > > > > > > - * Fix it.
> > > > > > >   */
> > > > > > > -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> > > > > > > -					int create)
> > > > > > > +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
> > > > > > > +					int create, int curr)
> > > > > > 
> > > > > >   Hi Vivek,
> > > > > > 
> > > > > >   IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL,
> > > > > >   get iog from bio, otherwise get it from current task.
> > > > > 
> > > > > Consider also that get_cgroup_from_bio() is much more slow than
> > > > > task_cgroup() and need to lock/unlock_page_cgroup() in
> > > > > get_blkio_cgroup_id(), while task_cgroup() is rcu protected.
> > > > > 
> > > > 
> > > > True.
> > > > 
> > > > > BTW another optimization could be to use the blkio-cgroup functionality
> > > > > only for dirty pages and cut out some blkio_set_owner(). For all the
> > > > > other cases IO always occurs in the same context of the current task,
> > > > > and you can use task_cgroup().
> > > > > 
> > > > 
> > > > Yes, may be in some cases we can avoid setting page owner. I will get
> > > > to it once I have got functionality going well. In the mean time if
> > > > you have a patch for it, it will be great.
> > > > 
> > > > > However, this is true only for page cache pages, for IO generated by
> > > > > anonymous pages (swap) you still need the page tracking functionality
> > > > > both for reads and writes.
> > > > > 
> > > > 
> > > > Right now I am assuming that all the sync IO will belong to task
> > > > submitting the bio hence use task_cgroup() for that. Only for async
> > > > IO, I am trying to use page tracking functionality to determine the owner.
> > > > Look at elv_bio_sync(bio).
> > > > 
> > > > You seem to be saying that there are cases where even for sync IO, we
> > > > can't use submitting task's context and need to rely on page tracking
> > > > functionlity? 

I think that there are some kernel threads (e.g., dm-crypt, LVM and md
devices) which actually submit IOs instead of tasks which originate the
IOs. When IOs are submitted from such kernel threads, we can't use
submitting task's context to determine to which cgroup the IO belongs.

> > > > In case of getting page (read) from swap, will it not happen
> > > > in the context of process who will take a page fault and initiate the
> > > > swap read?
> > > 
> > > No, for example in read_swap_cache_async():
> > > 
> > > @@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > >  		 */
> > >  		__set_page_locked(new_page);
> > >  		SetPageSwapBacked(new_page);
> > > +		blkio_cgroup_set_owner(new_page, current->mm);
> > >  		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> > >  		if (likely(!err)) {
> > >  			/*
> > > 
> > > This is a read, but the current task is not always the owner of this
> > > swap cache page, because it's a readahead operation.
> > > 
> > 
> > But will this readahead be not initiated in the context of the task taking
> > the page fault?
> > 
> > handle_pte_fault()
> > 	do_swap_page()
> > 		swapin_readahead()
> > 			read_swap_cache_async()
> > 
> > If yes, then swap reads issued will still be in the context of process and
> > we should be fine?
> 
> Right. I was trying to say that the current task may swap-in also pages
> belonging to a different task, so from a certain point of view it's not
> so fair to charge the current task for the whole activity. But ok, I
> think it's a minor issue.
> 
> > 
> > > Anyway, this is a minor corner case I think. And probably it is safe to
> > > consider this like any other read IO and get rid of the
> > > blkio_cgroup_set_owner().
> > 
> > Agreed.
> > 
> > > 
> > > I wonder if it would be better to attach the blkio_cgroup to the
> > > anonymous page only when swap-out occurs.
> > 
> > Swap seems to be an interesting case in general. Somebody raised this
> > question on lwn io controller article also. A user process never asked
> > for swap activity. It is something enforced by kernel. So while doing
> > some swap outs, it does not seem too fair to charge the write out to
> > the process page belongs to and the fact of the matter may be that there
> > is some other memory hungry application which is forcing these swap outs.
> > 
> > Keeping this in mind, should swap activity be considered as system
> > activity and be charged to root group instead of to user tasks in other
> > cgroups?
> 
> In this case I assume the swap-in activity should be charged to the root
> cgroup as well.
> 
> Anyway, in the logic of the memory and swap control it would seem
> reasonable to provide IO separation also for the swap IO activity.
> 
> In the MEMHOG example, it would be unfair if the memory pressure is
> caused by a task in another cgroup, but with memory and swap isolation a
> memory pressure condition can only be caused by a memory hog that runs
> in the same cgroup. From this point of view it seems more fair to
> consider the swap activity as the particular cgroup IO activity, instead
> of charging always the root cgroup.
> 
> Otherwise, I suspect, memory pressure would be a simple way to blow away
> any kind of QoS guarantees provided by the IO controller.
> 
> >   
> > > I mean, just put the
> > > blkio_cgroup_set_owner() hook in try_to_umap() in order to keep track of
> > > the IO generated by direct reclaim of anon memory. For all the other
> > > cases we can simply use the submitting task's context.

I think that only putting the hook in try_to_unmap() doesn't work
correctly, because IOs will be charged to reclaiming processes or
kswapd. These IOs should be charged to processes which cause memory
pressure.

> > > BTW, O_DIRECT is another case that is possible to optimize, because all
> > > the bios generated by direct IO occur in the same context of the current
> > > task.
> > 
> > Agreed about the direct IO optimization.
> > 
> > Ryo, what do you think? would you like to do include these optimizations
> > by the Andrea in next version of IO tracking patches?
> >  
> > Thanks
> > Vivek
> 
> Thanks,
> -Andrea

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
  2009-05-18 14:39                         ` Andrea Righi
  2009-05-26 11:34                           ` Ryo Tsuruta
@ 2009-05-26 11:34                           ` Ryo Tsuruta
  2009-05-27  6:56                               ` Ryo Tsuruta
       [not found]                             ` <20090526.203424.39179999.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
  1 sibling, 2 replies; 297+ messages in thread
From: Ryo Tsuruta @ 2009-05-26 11:34 UTC (permalink / raw)
  To: righi.andrea
  Cc: vgoyal, guijianfeng, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, fernando, s-uchida, taka, jmoyer,
	dhaval, balbir, linux-kernel, containers, agk, dm-devel, snitzer,
	m-ikeda, akpm

Hi Andrea and Vivek,

From: Andrea Righi <righi.andrea@gmail.com>
Subject: Re: [PATCH] io-controller: Add io group reference handling for request
Date: Mon, 18 May 2009 16:39:23 +0200

> On Mon, May 18, 2009 at 10:01:14AM -0400, Vivek Goyal wrote:
> > On Sun, May 17, 2009 at 12:26:06PM +0200, Andrea Righi wrote:
> > > On Fri, May 15, 2009 at 10:06:43AM -0400, Vivek Goyal wrote:
> > > > On Fri, May 15, 2009 at 09:48:40AM +0200, Andrea Righi wrote:
> > > > > On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
> > > > > > Vivek Goyal wrote:
> > > > > > ...
> > > > > > >  }
> > > > > > > @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
> > > > > > >  /*
> > > > > > >   * Find the io group bio belongs to.
> > > > > > >   * If "create" is set, io group is created if it is not already present.
> > > > > > > + * If "curr" is set, io group is information is searched for current
> > > > > > > + * task and not with the help of bio.
> > > > > > > + *
> > > > > > > + * FIXME: Can we assume that if bio is NULL then lookup group for current
> > > > > > > + * task and not create extra function parameter ?
> > > > > > >   *
> > > > > > > - * Note: There is a narrow window of race where a group is being freed
> > > > > > > - * by cgroup deletion path and some rq has slipped through in this group.
> > > > > > > - * Fix it.
> > > > > > >   */
> > > > > > > -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> > > > > > > -					int create)
> > > > > > > +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
> > > > > > > +					int create, int curr)
> > > > > > 
> > > > > >   Hi Vivek,
> > > > > > 
> > > > > >   IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL,
> > > > > >   get iog from bio, otherwise get it from current task.
> > > > > 
> > > > > Consider also that get_cgroup_from_bio() is much more slow than
> > > > > task_cgroup() and need to lock/unlock_page_cgroup() in
> > > > > get_blkio_cgroup_id(), while task_cgroup() is rcu protected.
> > > > > 
> > > > 
> > > > True.
> > > > 
> > > > > BTW another optimization could be to use the blkio-cgroup functionality
> > > > > only for dirty pages and cut out some blkio_set_owner(). For all the
> > > > > other cases IO always occurs in the same context of the current task,
> > > > > and you can use task_cgroup().
> > > > > 
> > > > 
> > > > Yes, may be in some cases we can avoid setting page owner. I will get
> > > > to it once I have got functionality going well. In the mean time if
> > > > you have a patch for it, it will be great.
> > > > 
> > > > > However, this is true only for page cache pages, for IO generated by
> > > > > anonymous pages (swap) you still need the page tracking functionality
> > > > > both for reads and writes.
> > > > > 
> > > > 
> > > > Right now I am assuming that all the sync IO will belong to task
> > > > submitting the bio hence use task_cgroup() for that. Only for async
> > > > IO, I am trying to use page tracking functionality to determine the owner.
> > > > Look at elv_bio_sync(bio).
> > > > 
> > > > You seem to be saying that there are cases where even for sync IO, we
> > > > can't use submitting task's context and need to rely on page tracking
> > > > functionlity? 

I think that there are some kernel threads (e.g., dm-crypt, LVM and md
devices) which actually submit IOs instead of tasks which originate the
IOs. When IOs are submitted from such kernel threads, we can't use
submitting task's context to determine to which cgroup the IO belongs.

> > > > In case of getting page (read) from swap, will it not happen
> > > > in the context of process who will take a page fault and initiate the
> > > > swap read?
> > > 
> > > No, for example in read_swap_cache_async():
> > > 
> > > @@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > >  		 */
> > >  		__set_page_locked(new_page);
> > >  		SetPageSwapBacked(new_page);
> > > +		blkio_cgroup_set_owner(new_page, current->mm);
> > >  		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> > >  		if (likely(!err)) {
> > >  			/*
> > > 
> > > This is a read, but the current task is not always the owner of this
> > > swap cache page, because it's a readahead operation.
> > > 
> > 
> > But will this readahead be not initiated in the context of the task taking
> > the page fault?
> > 
> > handle_pte_fault()
> > 	do_swap_page()
> > 		swapin_readahead()
> > 			read_swap_cache_async()
> > 
> > If yes, then swap reads issued will still be in the context of process and
> > we should be fine?
> 
> Right. I was trying to say that the current task may swap-in also pages
> belonging to a different task, so from a certain point of view it's not
> so fair to charge the current task for the whole activity. But ok, I
> think it's a minor issue.
> 
> > 
> > > Anyway, this is a minor corner case I think. And probably it is safe to
> > > consider this like any other read IO and get rid of the
> > > blkio_cgroup_set_owner().
> > 
> > Agreed.
> > 
> > > 
> > > I wonder if it would be better to attach the blkio_cgroup to the
> > > anonymous page only when swap-out occurs.
> > 
> > Swap seems to be an interesting case in general. Somebody raised this
> > question on lwn io controller article also. A user process never asked
> > for swap activity. It is something enforced by kernel. So while doing
> > some swap outs, it does not seem too fair to charge the write out to
> > the process page belongs to and the fact of the matter may be that there
> > is some other memory hungry application which is forcing these swap outs.
> > 
> > Keeping this in mind, should swap activity be considered as system
> > activity and be charged to root group instead of to user tasks in other
> > cgroups?
> 
> In this case I assume the swap-in activity should be charged to the root
> cgroup as well.
> 
> Anyway, in the logic of the memory and swap control it would seem
> reasonable to provide IO separation also for the swap IO activity.
> 
> In the MEMHOG example, it would be unfair if the memory pressure is
> caused by a task in another cgroup, but with memory and swap isolation a
> memory pressure condition can only be caused by a memory hog that runs
> in the same cgroup. From this point of view it seems more fair to
> consider the swap activity as the particular cgroup IO activity, instead
> of charging always the root cgroup.
> 
> Otherwise, I suspect, memory pressure would be a simple way to blow away
> any kind of QoS guarantees provided by the IO controller.
> 
> >   
> > > I mean, just put the
> > > blkio_cgroup_set_owner() hook in try_to_umap() in order to keep track of
> > > the IO generated by direct reclaim of anon memory. For all the other
> > > cases we can simply use the submitting task's context.

I think that only putting the hook in try_to_unmap() doesn't work
correctly, because IOs will be charged to reclaiming processes or
kswapd. These IOs should be charged to processes which cause memory
pressure.

> > > BTW, O_DIRECT is another case that is possible to optimize, because all
> > > the bios generated by direct IO occur in the same context of the current
> > > task.
> > 
> > Agreed about the direct IO optimization.
> > 
> > Ryo, what do you think? would you like to do include these optimizations
> > by the Andrea in next version of IO tracking patches?
> >  
> > Thanks
> > Vivek
> 
> Thanks,
> -Andrea

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
       [not found]                             ` <20090526.203424.39179999.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
@ 2009-05-27  6:56                               ` Ryo Tsuruta
  0 siblings, 0 replies; 297+ messages in thread
From: Ryo Tsuruta @ 2009-05-27  6:56 UTC (permalink / raw)
  To: righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi Andrea and Vivek,

Ryo Tsuruta <ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org> wrote:
> Hi Andrea and Vivek,
> 
> From: Andrea Righi <righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Subject: Re: [PATCH] io-controller: Add io group reference handling for request
> Date: Mon, 18 May 2009 16:39:23 +0200
> 
> > On Mon, May 18, 2009 at 10:01:14AM -0400, Vivek Goyal wrote:
> > > On Sun, May 17, 2009 at 12:26:06PM +0200, Andrea Righi wrote:
> > > > On Fri, May 15, 2009 at 10:06:43AM -0400, Vivek Goyal wrote:
> > > > > On Fri, May 15, 2009 at 09:48:40AM +0200, Andrea Righi wrote:
> > > > > > On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
> > > > > > > Vivek Goyal wrote:
> > > > > > > ...
> > > > > > > >  }
> > > > > > > > @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
> > > > > > > >  /*
> > > > > > > >   * Find the io group bio belongs to.
> > > > > > > >   * If "create" is set, io group is created if it is not already present.
> > > > > > > > + * If "curr" is set, io group is information is searched for current
> > > > > > > > + * task and not with the help of bio.
> > > > > > > > + *
> > > > > > > > + * FIXME: Can we assume that if bio is NULL then lookup group for current
> > > > > > > > + * task and not create extra function parameter ?
> > > > > > > >   *
> > > > > > > > - * Note: There is a narrow window of race where a group is being freed
> > > > > > > > - * by cgroup deletion path and some rq has slipped through in this group.
> > > > > > > > - * Fix it.
> > > > > > > >   */
> > > > > > > > -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> > > > > > > > -					int create)
> > > > > > > > +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
> > > > > > > > +					int create, int curr)
> > > > > > > 
> > > > > > >   Hi Vivek,
> > > > > > > 
> > > > > > >   IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL,
> > > > > > >   get iog from bio, otherwise get it from current task.
> > > > > > 
> > > > > > Consider also that get_cgroup_from_bio() is much more slow than
> > > > > > task_cgroup() and need to lock/unlock_page_cgroup() in
> > > > > > get_blkio_cgroup_id(), while task_cgroup() is rcu protected.
> > > > > > 
> > > > > 
> > > > > True.
> > > > > 
> > > > > > BTW another optimization could be to use the blkio-cgroup functionality
> > > > > > only for dirty pages and cut out some blkio_set_owner(). For all the
> > > > > > other cases IO always occurs in the same context of the current task,
> > > > > > and you can use task_cgroup().
> > > > > > 
> > > > > 
> > > > > Yes, may be in some cases we can avoid setting page owner. I will get
> > > > > to it once I have got functionality going well. In the mean time if
> > > > > you have a patch for it, it will be great.
> > > > > 
> > > > > > However, this is true only for page cache pages, for IO generated by
> > > > > > anonymous pages (swap) you still need the page tracking functionality
> > > > > > both for reads and writes.
> > > > > > 
> > > > > 
> > > > > Right now I am assuming that all the sync IO will belong to task
> > > > > submitting the bio hence use task_cgroup() for that. Only for async
> > > > > IO, I am trying to use page tracking functionality to determine the owner.
> > > > > Look at elv_bio_sync(bio).
> > > > > 
> > > > > You seem to be saying that there are cases where even for sync IO, we
> > > > > can't use submitting task's context and need to rely on page tracking
> > > > > functionlity? 
> 
> I think that there are some kernel threads (e.g., dm-crypt, LVM and md
> devices) which actually submit IOs instead of tasks which originate the
> IOs. When IOs are submitted from such kernel threads, we can't use
> submitting task's context to determine to which cgroup the IO belongs.
> 
> > > > > In case of getting page (read) from swap, will it not happen
> > > > > in the context of process who will take a page fault and initiate the
> > > > > swap read?
> > > > 
> > > > No, for example in read_swap_cache_async():
> > > > 
> > > > @@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > > >  		 */
> > > >  		__set_page_locked(new_page);
> > > >  		SetPageSwapBacked(new_page);
> > > > +		blkio_cgroup_set_owner(new_page, current->mm);
> > > >  		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> > > >  		if (likely(!err)) {
> > > >  			/*
> > > > 
> > > > This is a read, but the current task is not always the owner of this
> > > > swap cache page, because it's a readahead operation.
> > > > 
> > > 
> > > But will this readahead be not initiated in the context of the task taking
> > > the page fault?
> > > 
> > > handle_pte_fault()
> > > 	do_swap_page()
> > > 		swapin_readahead()
> > > 			read_swap_cache_async()
> > > 
> > > If yes, then swap reads issued will still be in the context of process and
> > > we should be fine?
> > 
> > Right. I was trying to say that the current task may swap-in also pages
> > belonging to a different task, so from a certain point of view it's not
> > so fair to charge the current task for the whole activity. But ok, I
> > think it's a minor issue.
> > 
> > > 
> > > > Anyway, this is a minor corner case I think. And probably it is safe to
> > > > consider this like any other read IO and get rid of the
> > > > blkio_cgroup_set_owner().
> > > 
> > > Agreed.
> > > 
> > > > 
> > > > I wonder if it would be better to attach the blkio_cgroup to the
> > > > anonymous page only when swap-out occurs.
> > > 
> > > Swap seems to be an interesting case in general. Somebody raised this
> > > question on lwn io controller article also. A user process never asked
> > > for swap activity. It is something enforced by kernel. So while doing
> > > some swap outs, it does not seem too fair to charge the write out to
> > > the process page belongs to and the fact of the matter may be that there
> > > is some other memory hungry application which is forcing these swap outs.
> > > 
> > > Keeping this in mind, should swap activity be considered as system
> > > activity and be charged to root group instead of to user tasks in other
> > > cgroups?
> > 
> > In this case I assume the swap-in activity should be charged to the root
> > cgroup as well.
> > 
> > Anyway, in the logic of the memory and swap control it would seem
> > reasonable to provide IO separation also for the swap IO activity.
> > 
> > In the MEMHOG example, it would be unfair if the memory pressure is
> > caused by a task in another cgroup, but with memory and swap isolation a
> > memory pressure condition can only be caused by a memory hog that runs
> > in the same cgroup. From this point of view it seems more fair to
> > consider the swap activity as the particular cgroup IO activity, instead
> > of charging always the root cgroup.
> > 
> > Otherwise, I suspect, memory pressure would be a simple way to blow away
> > any kind of QoS guarantees provided by the IO controller.
> > 
> > >   
> > > > I mean, just put the
> > > > blkio_cgroup_set_owner() hook in try_to_umap() in order to keep track of
> > > > the IO generated by direct reclaim of anon memory. For all the other
> > > > cases we can simply use the submitting task's context.
> 
> I think that only putting the hook in try_to_unmap() doesn't work
> correctly, because IOs will be charged to reclaiming processes or
> kswapd. These IOs should be charged to processes which cause memory
> pressure.

Consider the following case:

  (1) There are two processes Proc-A and Proc-B.
  (2) Proc-A maps a large file into many pages by mmap() and writes
      many data to the file.
  (3) After (2), Proc-B try to get a page, but there are no available
      pages because Proc-A has used them.
  (4) kernel starts to reclaim pages, call try_to_unmap() to unmap
      a page which is owned by Proc-A, then blkio_cgroup_set_owner()
      sets Proc-B's ID on the page because the task's context is Proc-B.
  (5) After (4), kernel writes the page out to a disk. This IO is
      charged to Proc-B.

In the above case, I think that the IO should be charged to a Proc-A,
because the IO is caused by Proc-A's memory pressure. 
I think we should consider in the case without memory and swap
isolation.

Thanks,
Ryo Tsuruta

> > > > BTW, O_DIRECT is another case that is possible to optimize, because all
> > > > the bios generated by direct IO occur in the same context of the current
> > > > task.
> > > 
> > > Agreed about the direct IO optimization.
> > > 
> > > Ryo, what do you think? would you like to do include these optimizations
> > > by the Andrea in next version of IO tracking patches?
> > >  
> > > Thanks
> > > Vivek
> > 
> > Thanks,
> > -Andrea
> 
> Thanks,
> Ryo Tsuruta

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
  2009-05-26 11:34                           ` Ryo Tsuruta
@ 2009-05-27  6:56                               ` Ryo Tsuruta
       [not found]                             ` <20090526.203424.39179999.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
  1 sibling, 0 replies; 297+ messages in thread
From: Ryo Tsuruta @ 2009-05-27  6:56 UTC (permalink / raw)
  To: righi.andrea
  Cc: vgoyal, guijianfeng, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, fernando, s-uchida, taka, jmoyer,
	dhaval, balbir, linux-kernel, containers, agk, dm-devel, snitzer,
	m-ikeda, akpm

Hi Andrea and Vivek,

Ryo Tsuruta <ryov@valinux.co.jp> wrote:
> Hi Andrea and Vivek,
> 
> From: Andrea Righi <righi.andrea@gmail.com>
> Subject: Re: [PATCH] io-controller: Add io group reference handling for request
> Date: Mon, 18 May 2009 16:39:23 +0200
> 
> > On Mon, May 18, 2009 at 10:01:14AM -0400, Vivek Goyal wrote:
> > > On Sun, May 17, 2009 at 12:26:06PM +0200, Andrea Righi wrote:
> > > > On Fri, May 15, 2009 at 10:06:43AM -0400, Vivek Goyal wrote:
> > > > > On Fri, May 15, 2009 at 09:48:40AM +0200, Andrea Righi wrote:
> > > > > > On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
> > > > > > > Vivek Goyal wrote:
> > > > > > > ...
> > > > > > > >  }
> > > > > > > > @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
> > > > > > > >  /*
> > > > > > > >   * Find the io group bio belongs to.
> > > > > > > >   * If "create" is set, io group is created if it is not already present.
> > > > > > > > + * If "curr" is set, io group is information is searched for current
> > > > > > > > + * task and not with the help of bio.
> > > > > > > > + *
> > > > > > > > + * FIXME: Can we assume that if bio is NULL then lookup group for current
> > > > > > > > + * task and not create extra function parameter ?
> > > > > > > >   *
> > > > > > > > - * Note: There is a narrow window of race where a group is being freed
> > > > > > > > - * by cgroup deletion path and some rq has slipped through in this group.
> > > > > > > > - * Fix it.
> > > > > > > >   */
> > > > > > > > -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> > > > > > > > -					int create)
> > > > > > > > +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
> > > > > > > > +					int create, int curr)
> > > > > > > 
> > > > > > >   Hi Vivek,
> > > > > > > 
> > > > > > >   IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL,
> > > > > > >   get iog from bio, otherwise get it from current task.
> > > > > > 
> > > > > > Consider also that get_cgroup_from_bio() is much more slow than
> > > > > > task_cgroup() and need to lock/unlock_page_cgroup() in
> > > > > > get_blkio_cgroup_id(), while task_cgroup() is rcu protected.
> > > > > > 
> > > > > 
> > > > > True.
> > > > > 
> > > > > > BTW another optimization could be to use the blkio-cgroup functionality
> > > > > > only for dirty pages and cut out some blkio_set_owner(). For all the
> > > > > > other cases IO always occurs in the same context of the current task,
> > > > > > and you can use task_cgroup().
> > > > > > 
> > > > > 
> > > > > Yes, may be in some cases we can avoid setting page owner. I will get
> > > > > to it once I have got functionality going well. In the mean time if
> > > > > you have a patch for it, it will be great.
> > > > > 
> > > > > > However, this is true only for page cache pages, for IO generated by
> > > > > > anonymous pages (swap) you still need the page tracking functionality
> > > > > > both for reads and writes.
> > > > > > 
> > > > > 
> > > > > Right now I am assuming that all the sync IO will belong to task
> > > > > submitting the bio hence use task_cgroup() for that. Only for async
> > > > > IO, I am trying to use page tracking functionality to determine the owner.
> > > > > Look at elv_bio_sync(bio).
> > > > > 
> > > > > You seem to be saying that there are cases where even for sync IO, we
> > > > > can't use submitting task's context and need to rely on page tracking
> > > > > functionlity? 
> 
> I think that there are some kernel threads (e.g., dm-crypt, LVM and md
> devices) which actually submit IOs instead of tasks which originate the
> IOs. When IOs are submitted from such kernel threads, we can't use
> submitting task's context to determine to which cgroup the IO belongs.
> 
> > > > > In case of getting page (read) from swap, will it not happen
> > > > > in the context of process who will take a page fault and initiate the
> > > > > swap read?
> > > > 
> > > > No, for example in read_swap_cache_async():
> > > > 
> > > > @@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > > >  		 */
> > > >  		__set_page_locked(new_page);
> > > >  		SetPageSwapBacked(new_page);
> > > > +		blkio_cgroup_set_owner(new_page, current->mm);
> > > >  		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> > > >  		if (likely(!err)) {
> > > >  			/*
> > > > 
> > > > This is a read, but the current task is not always the owner of this
> > > > swap cache page, because it's a readahead operation.
> > > > 
> > > 
> > > But will this readahead be not initiated in the context of the task taking
> > > the page fault?
> > > 
> > > handle_pte_fault()
> > > 	do_swap_page()
> > > 		swapin_readahead()
> > > 			read_swap_cache_async()
> > > 
> > > If yes, then swap reads issued will still be in the context of process and
> > > we should be fine?
> > 
> > Right. I was trying to say that the current task may swap-in also pages
> > belonging to a different task, so from a certain point of view it's not
> > so fair to charge the current task for the whole activity. But ok, I
> > think it's a minor issue.
> > 
> > > 
> > > > Anyway, this is a minor corner case I think. And probably it is safe to
> > > > consider this like any other read IO and get rid of the
> > > > blkio_cgroup_set_owner().
> > > 
> > > Agreed.
> > > 
> > > > 
> > > > I wonder if it would be better to attach the blkio_cgroup to the
> > > > anonymous page only when swap-out occurs.
> > > 
> > > Swap seems to be an interesting case in general. Somebody raised this
> > > question on lwn io controller article also. A user process never asked
> > > for swap activity. It is something enforced by kernel. So while doing
> > > some swap outs, it does not seem too fair to charge the write out to
> > > the process page belongs to and the fact of the matter may be that there
> > > is some other memory hungry application which is forcing these swap outs.
> > > 
> > > Keeping this in mind, should swap activity be considered as system
> > > activity and be charged to root group instead of to user tasks in other
> > > cgroups?
> > 
> > In this case I assume the swap-in activity should be charged to the root
> > cgroup as well.
> > 
> > Anyway, in the logic of the memory and swap control it would seem
> > reasonable to provide IO separation also for the swap IO activity.
> > 
> > In the MEMHOG example, it would be unfair if the memory pressure is
> > caused by a task in another cgroup, but with memory and swap isolation a
> > memory pressure condition can only be caused by a memory hog that runs
> > in the same cgroup. From this point of view it seems more fair to
> > consider the swap activity as the particular cgroup IO activity, instead
> > of charging always the root cgroup.
> > 
> > Otherwise, I suspect, memory pressure would be a simple way to blow away
> > any kind of QoS guarantees provided by the IO controller.
> > 
> > >   
> > > > I mean, just put the
> > > > blkio_cgroup_set_owner() hook in try_to_umap() in order to keep track of
> > > > the IO generated by direct reclaim of anon memory. For all the other
> > > > cases we can simply use the submitting task's context.
> 
> I think that only putting the hook in try_to_unmap() doesn't work
> correctly, because IOs will be charged to reclaiming processes or
> kswapd. These IOs should be charged to processes which cause memory
> pressure.

Consider the following case:

  (1) There are two processes Proc-A and Proc-B.
  (2) Proc-A maps a large file into many pages by mmap() and writes
      many data to the file.
  (3) After (2), Proc-B try to get a page, but there are no available
      pages because Proc-A has used them.
  (4) kernel starts to reclaim pages, call try_to_unmap() to unmap
      a page which is owned by Proc-A, then blkio_cgroup_set_owner()
      sets Proc-B's ID on the page because the task's context is Proc-B.
  (5) After (4), kernel writes the page out to a disk. This IO is
      charged to Proc-B.

In the above case, I think that the IO should be charged to a Proc-A,
because the IO is caused by Proc-A's memory pressure. 
I think we should consider in the case without memory and swap
isolation.

Thanks,
Ryo Tsuruta

> > > > BTW, O_DIRECT is another case that is possible to optimize, because all
> > > > the bios generated by direct IO occur in the same context of the current
> > > > task.
> > > 
> > > Agreed about the direct IO optimization.
> > > 
> > > Ryo, what do you think? would you like to do include these optimizations
> > > by the Andrea in next version of IO tracking patches?
> > >  
> > > Thanks
> > > Vivek
> > 
> > Thanks,
> > -Andrea
> 
> Thanks,
> Ryo Tsuruta

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
@ 2009-05-27  6:56                               ` Ryo Tsuruta
  0 siblings, 0 replies; 297+ messages in thread
From: Ryo Tsuruta @ 2009-05-27  6:56 UTC (permalink / raw)
  To: righi.andrea
  Cc: dhaval, snitzer, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, guijianfeng, fernando, mikew, jmoyer, nauman,
	vgoyal, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm

Hi Andrea and Vivek,

Ryo Tsuruta <ryov@valinux.co.jp> wrote:
> Hi Andrea and Vivek,
> 
> From: Andrea Righi <righi.andrea@gmail.com>
> Subject: Re: [PATCH] io-controller: Add io group reference handling for request
> Date: Mon, 18 May 2009 16:39:23 +0200
> 
> > On Mon, May 18, 2009 at 10:01:14AM -0400, Vivek Goyal wrote:
> > > On Sun, May 17, 2009 at 12:26:06PM +0200, Andrea Righi wrote:
> > > > On Fri, May 15, 2009 at 10:06:43AM -0400, Vivek Goyal wrote:
> > > > > On Fri, May 15, 2009 at 09:48:40AM +0200, Andrea Righi wrote:
> > > > > > On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
> > > > > > > Vivek Goyal wrote:
> > > > > > > ...
> > > > > > > >  }
> > > > > > > > @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
> > > > > > > >  /*
> > > > > > > >   * Find the io group bio belongs to.
> > > > > > > >   * If "create" is set, io group is created if it is not already present.
> > > > > > > > + * If "curr" is set, io group is information is searched for current
> > > > > > > > + * task and not with the help of bio.
> > > > > > > > + *
> > > > > > > > + * FIXME: Can we assume that if bio is NULL then lookup group for current
> > > > > > > > + * task and not create extra function parameter ?
> > > > > > > >   *
> > > > > > > > - * Note: There is a narrow window of race where a group is being freed
> > > > > > > > - * by cgroup deletion path and some rq has slipped through in this group.
> > > > > > > > - * Fix it.
> > > > > > > >   */
> > > > > > > > -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> > > > > > > > -					int create)
> > > > > > > > +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
> > > > > > > > +					int create, int curr)
> > > > > > > 
> > > > > > >   Hi Vivek,
> > > > > > > 
> > > > > > >   IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL,
> > > > > > >   get iog from bio, otherwise get it from current task.
> > > > > > 
> > > > > > Consider also that get_cgroup_from_bio() is much more slow than
> > > > > > task_cgroup() and need to lock/unlock_page_cgroup() in
> > > > > > get_blkio_cgroup_id(), while task_cgroup() is rcu protected.
> > > > > > 
> > > > > 
> > > > > True.
> > > > > 
> > > > > > BTW another optimization could be to use the blkio-cgroup functionality
> > > > > > only for dirty pages and cut out some blkio_set_owner(). For all the
> > > > > > other cases IO always occurs in the same context of the current task,
> > > > > > and you can use task_cgroup().
> > > > > > 
> > > > > 
> > > > > Yes, may be in some cases we can avoid setting page owner. I will get
> > > > > to it once I have got functionality going well. In the mean time if
> > > > > you have a patch for it, it will be great.
> > > > > 
> > > > > > However, this is true only for page cache pages, for IO generated by
> > > > > > anonymous pages (swap) you still need the page tracking functionality
> > > > > > both for reads and writes.
> > > > > > 
> > > > > 
> > > > > Right now I am assuming that all the sync IO will belong to task
> > > > > submitting the bio hence use task_cgroup() for that. Only for async
> > > > > IO, I am trying to use page tracking functionality to determine the owner.
> > > > > Look at elv_bio_sync(bio).
> > > > > 
> > > > > You seem to be saying that there are cases where even for sync IO, we
> > > > > can't use submitting task's context and need to rely on page tracking
> > > > > functionlity? 
> 
> I think that there are some kernel threads (e.g., dm-crypt, LVM and md
> devices) which actually submit IOs instead of tasks which originate the
> IOs. When IOs are submitted from such kernel threads, we can't use
> submitting task's context to determine to which cgroup the IO belongs.
> 
> > > > > In case of getting page (read) from swap, will it not happen
> > > > > in the context of process who will take a page fault and initiate the
> > > > > swap read?
> > > > 
> > > > No, for example in read_swap_cache_async():
> > > > 
> > > > @@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > > >  		 */
> > > >  		__set_page_locked(new_page);
> > > >  		SetPageSwapBacked(new_page);
> > > > +		blkio_cgroup_set_owner(new_page, current->mm);
> > > >  		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> > > >  		if (likely(!err)) {
> > > >  			/*
> > > > 
> > > > This is a read, but the current task is not always the owner of this
> > > > swap cache page, because it's a readahead operation.
> > > > 
> > > 
> > > But will this readahead be not initiated in the context of the task taking
> > > the page fault?
> > > 
> > > handle_pte_fault()
> > > 	do_swap_page()
> > > 		swapin_readahead()
> > > 			read_swap_cache_async()
> > > 
> > > If yes, then swap reads issued will still be in the context of process and
> > > we should be fine?
> > 
> > Right. I was trying to say that the current task may swap-in also pages
> > belonging to a different task, so from a certain point of view it's not
> > so fair to charge the current task for the whole activity. But ok, I
> > think it's a minor issue.
> > 
> > > 
> > > > Anyway, this is a minor corner case I think. And probably it is safe to
> > > > consider this like any other read IO and get rid of the
> > > > blkio_cgroup_set_owner().
> > > 
> > > Agreed.
> > > 
> > > > 
> > > > I wonder if it would be better to attach the blkio_cgroup to the
> > > > anonymous page only when swap-out occurs.
> > > 
> > > Swap seems to be an interesting case in general. Somebody raised this
> > > question on lwn io controller article also. A user process never asked
> > > for swap activity. It is something enforced by kernel. So while doing
> > > some swap outs, it does not seem too fair to charge the write out to
> > > the process page belongs to and the fact of the matter may be that there
> > > is some other memory hungry application which is forcing these swap outs.
> > > 
> > > Keeping this in mind, should swap activity be considered as system
> > > activity and be charged to root group instead of to user tasks in other
> > > cgroups?
> > 
> > In this case I assume the swap-in activity should be charged to the root
> > cgroup as well.
> > 
> > Anyway, in the logic of the memory and swap control it would seem
> > reasonable to provide IO separation also for the swap IO activity.
> > 
> > In the MEMHOG example, it would be unfair if the memory pressure is
> > caused by a task in another cgroup, but with memory and swap isolation a
> > memory pressure condition can only be caused by a memory hog that runs
> > in the same cgroup. From this point of view it seems more fair to
> > consider the swap activity as the particular cgroup IO activity, instead
> > of charging always the root cgroup.
> > 
> > Otherwise, I suspect, memory pressure would be a simple way to blow away
> > any kind of QoS guarantees provided by the IO controller.
> > 
> > >   
> > > > I mean, just put the
> > > > blkio_cgroup_set_owner() hook in try_to_umap() in order to keep track of
> > > > the IO generated by direct reclaim of anon memory. For all the other
> > > > cases we can simply use the submitting task's context.
> 
> I think that only putting the hook in try_to_unmap() doesn't work
> correctly, because IOs will be charged to reclaiming processes or
> kswapd. These IOs should be charged to processes which cause memory
> pressure.

Consider the following case:

  (1) There are two processes Proc-A and Proc-B.
  (2) Proc-A maps a large file into many pages by mmap() and writes
      many data to the file.
  (3) After (2), Proc-B try to get a page, but there are no available
      pages because Proc-A has used them.
  (4) kernel starts to reclaim pages, call try_to_unmap() to unmap
      a page which is owned by Proc-A, then blkio_cgroup_set_owner()
      sets Proc-B's ID on the page because the task's context is Proc-B.
  (5) After (4), kernel writes the page out to a disk. This IO is
      charged to Proc-B.

In the above case, I think that the IO should be charged to a Proc-A,
because the IO is caused by Proc-A's memory pressure. 
I think we should consider in the case without memory and swap
isolation.

Thanks,
Ryo Tsuruta

> > > > BTW, O_DIRECT is another case that is possible to optimize, because all
> > > > the bios generated by direct IO occur in the same context of the current
> > > > task.
> > > 
> > > Agreed about the direct IO optimization.
> > > 
> > > Ryo, what do you think? would you like to do include these optimizations
> > > by the Andrea in next version of IO tracking patches?
> > >  
> > > Thanks
> > > Vivek
> > 
> > Thanks,
> > -Andrea
> 
> Thanks,
> Ryo Tsuruta

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
       [not found]                               ` <20090527.155631.226800550.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
@ 2009-05-27  8:17                                 ` Andrea Righi
  2009-05-27 17:32                                 ` Vivek Goyal
  1 sibling, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-27  8:17 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Wed, May 27, 2009 at 03:56:31PM +0900, Ryo Tsuruta wrote:
> > I think that only putting the hook in try_to_unmap() doesn't work
> > correctly, because IOs will be charged to reclaiming processes or
> > kswapd. These IOs should be charged to processes which cause memory
> > pressure.
> 
> Consider the following case:
> 
>   (1) There are two processes Proc-A and Proc-B.
>   (2) Proc-A maps a large file into many pages by mmap() and writes
>       many data to the file.
>   (3) After (2), Proc-B try to get a page, but there are no available
>       pages because Proc-A has used them.
>   (4) kernel starts to reclaim pages, call try_to_unmap() to unmap
>       a page which is owned by Proc-A, then blkio_cgroup_set_owner()
>       sets Proc-B's ID on the page because the task's context is Proc-B.
>   (5) After (4), kernel writes the page out to a disk. This IO is
>       charged to Proc-B.
> 
> In the above case, I think that the IO should be charged to a Proc-A,
> because the IO is caused by Proc-A's memory pressure. 
> I think we should consider in the case without memory and swap
> isolation.

mmmh.. even if they're strictly related I think we're mixing two
different problems in this way: memory pressure control and IO control.

It seems you're proposing something like the badness() for OOM
conditions to charge swap IO depending on how bad is a cgroup in terms
of memory consumption. I don't think this is the right way to proceed,
also because we already have the memory and swap control.

-Andrea

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
  2009-05-27  6:56                               ` Ryo Tsuruta
@ 2009-05-27  8:17                                 ` Andrea Righi
  -1 siblings, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-27  8:17 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: vgoyal, guijianfeng, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, fernando, s-uchida, taka, jmoyer,
	dhaval, balbir, linux-kernel, containers, agk, dm-devel, snitzer,
	m-ikeda, akpm

On Wed, May 27, 2009 at 03:56:31PM +0900, Ryo Tsuruta wrote:
> > I think that only putting the hook in try_to_unmap() doesn't work
> > correctly, because IOs will be charged to reclaiming processes or
> > kswapd. These IOs should be charged to processes which cause memory
> > pressure.
> 
> Consider the following case:
> 
>   (1) There are two processes Proc-A and Proc-B.
>   (2) Proc-A maps a large file into many pages by mmap() and writes
>       many data to the file.
>   (3) After (2), Proc-B try to get a page, but there are no available
>       pages because Proc-A has used them.
>   (4) kernel starts to reclaim pages, call try_to_unmap() to unmap
>       a page which is owned by Proc-A, then blkio_cgroup_set_owner()
>       sets Proc-B's ID on the page because the task's context is Proc-B.
>   (5) After (4), kernel writes the page out to a disk. This IO is
>       charged to Proc-B.
> 
> In the above case, I think that the IO should be charged to a Proc-A,
> because the IO is caused by Proc-A's memory pressure. 
> I think we should consider in the case without memory and swap
> isolation.

mmmh.. even if they're strictly related I think we're mixing two
different problems in this way: memory pressure control and IO control.

It seems you're proposing something like the badness() for OOM
conditions to charge swap IO depending on how bad is a cgroup in terms
of memory consumption. I don't think this is the right way to proceed,
also because we already have the memory and swap control.

-Andrea

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
@ 2009-05-27  8:17                                 ` Andrea Righi
  0 siblings, 0 replies; 297+ messages in thread
From: Andrea Righi @ 2009-05-27  8:17 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval, snitzer, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, guijianfeng, fernando, mikew, jmoyer, nauman,
	vgoyal, m-ikeda, lizf, fchecconi, s-uchida, containers,
	linux-kernel, akpm

On Wed, May 27, 2009 at 03:56:31PM +0900, Ryo Tsuruta wrote:
> > I think that only putting the hook in try_to_unmap() doesn't work
> > correctly, because IOs will be charged to reclaiming processes or
> > kswapd. These IOs should be charged to processes which cause memory
> > pressure.
> 
> Consider the following case:
> 
>   (1) There are two processes Proc-A and Proc-B.
>   (2) Proc-A maps a large file into many pages by mmap() and writes
>       many data to the file.
>   (3) After (2), Proc-B try to get a page, but there are no available
>       pages because Proc-A has used them.
>   (4) kernel starts to reclaim pages, call try_to_unmap() to unmap
>       a page which is owned by Proc-A, then blkio_cgroup_set_owner()
>       sets Proc-B's ID on the page because the task's context is Proc-B.
>   (5) After (4), kernel writes the page out to a disk. This IO is
>       charged to Proc-B.
> 
> In the above case, I think that the IO should be charged to a Proc-A,
> because the IO is caused by Proc-A's memory pressure. 
> I think we should consider in the case without memory and swap
> isolation.

mmmh.. even if they're strictly related I think we're mixing two
different problems in this way: memory pressure control and IO control.

It seems you're proposing something like the badness() for OOM
conditions to charge swap IO depending on how bad is a cgroup in terms
of memory consumption. I don't think this is the right way to proceed,
also because we already have the memory and swap control.

-Andrea

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
  2009-05-27  8:17                                 ` Andrea Righi
  (?)
  (?)
@ 2009-05-27 11:53                                 ` Ryo Tsuruta
  -1 siblings, 0 replies; 297+ messages in thread
From: Ryo Tsuruta @ 2009-05-27 11:53 UTC (permalink / raw)
  To: righi.andrea-Re5JQEeQqe8AvxtiuMwx3w
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Andrea Righi <righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Wed, May 27, 2009 at 03:56:31PM +0900, Ryo Tsuruta wrote:
> > > I think that only putting the hook in try_to_unmap() doesn't work
> > > correctly, because IOs will be charged to reclaiming processes or
> > > kswapd. These IOs should be charged to processes which cause memory
> > > pressure.
> > 
> > Consider the following case:
> > 
> >   (1) There are two processes Proc-A and Proc-B.
> >   (2) Proc-A maps a large file into many pages by mmap() and writes
> >       many data to the file.
> >   (3) After (2), Proc-B try to get a page, but there are no available
> >       pages because Proc-A has used them.
> >   (4) kernel starts to reclaim pages, call try_to_unmap() to unmap
> >       a page which is owned by Proc-A, then blkio_cgroup_set_owner()
> >       sets Proc-B's ID on the page because the task's context is Proc-B.
> >   (5) After (4), kernel writes the page out to a disk. This IO is
> >       charged to Proc-B.
> > 
> > In the above case, I think that the IO should be charged to a Proc-A,
> > because the IO is caused by Proc-A's memory pressure. 
> > I think we should consider in the case without memory and swap
> > isolation.
> 
> mmmh.. even if they're strictly related I think we're mixing two
> different problems in this way: memory pressure control and IO control.
> 
> It seems you're proposing something like the badness() for OOM
> conditions to charge swap IO depending on how bad is a cgroup in terms
> of memory consumption. I don't think this is the right way to proceed,
> also because we already have the memory and swap control.

cgroups support multiple hierarchy and it allows to have different
divisions of tasks among hierarchy like below:

                                 PIDs
   mem+swap /hier1/grp1/tasks <= 1, 2, 3, 4
   blkio    /hier2/grp2/tasks <= 1, 2
                   grp3/tasks <= 3, 4

Don't we need to consider this case?

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
  2009-05-27  8:17                                 ` Andrea Righi
  (?)
@ 2009-05-27 11:53                                 ` Ryo Tsuruta
  -1 siblings, 0 replies; 297+ messages in thread
From: Ryo Tsuruta @ 2009-05-27 11:53 UTC (permalink / raw)
  To: righi.andrea
  Cc: vgoyal, guijianfeng, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, jens.axboe, fernando, s-uchida, taka, jmoyer,
	dhaval, balbir, linux-kernel, containers, agk, dm-devel, snitzer,
	m-ikeda, akpm

Andrea Righi <righi.andrea@gmail.com> wrote:
> On Wed, May 27, 2009 at 03:56:31PM +0900, Ryo Tsuruta wrote:
> > > I think that only putting the hook in try_to_unmap() doesn't work
> > > correctly, because IOs will be charged to reclaiming processes or
> > > kswapd. These IOs should be charged to processes which cause memory
> > > pressure.
> > 
> > Consider the following case:
> > 
> >   (1) There are two processes Proc-A and Proc-B.
> >   (2) Proc-A maps a large file into many pages by mmap() and writes
> >       many data to the file.
> >   (3) After (2), Proc-B try to get a page, but there are no available
> >       pages because Proc-A has used them.
> >   (4) kernel starts to reclaim pages, call try_to_unmap() to unmap
> >       a page which is owned by Proc-A, then blkio_cgroup_set_owner()
> >       sets Proc-B's ID on the page because the task's context is Proc-B.
> >   (5) After (4), kernel writes the page out to a disk. This IO is
> >       charged to Proc-B.
> > 
> > In the above case, I think that the IO should be charged to a Proc-A,
> > because the IO is caused by Proc-A's memory pressure. 
> > I think we should consider in the case without memory and swap
> > isolation.
> 
> mmmh.. even if they're strictly related I think we're mixing two
> different problems in this way: memory pressure control and IO control.
> 
> It seems you're proposing something like the badness() for OOM
> conditions to charge swap IO depending on how bad is a cgroup in terms
> of memory consumption. I don't think this is the right way to proceed,
> also because we already have the memory and swap control.

cgroups support multiple hierarchy and it allows to have different
divisions of tasks among hierarchy like below:

                                 PIDs
   mem+swap /hier1/grp1/tasks <= 1, 2, 3, 4
   blkio    /hier2/grp2/tasks <= 1, 2
                   grp3/tasks <= 3, 4

Don't we need to consider this case?

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
       [not found]                               ` <20090527.155631.226800550.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
  2009-05-27  8:17                                 ` Andrea Righi
@ 2009-05-27 17:32                                 ` Vivek Goyal
  1 sibling, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-27 17:32 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Wed, May 27, 2009 at 03:56:31PM +0900, Ryo Tsuruta wrote:
> Hi Andrea and Vivek,
> 
> Ryo Tsuruta <ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org> wrote:
> > Hi Andrea and Vivek,
> > 
> > From: Andrea Righi <righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> > Subject: Re: [PATCH] io-controller: Add io group reference handling for request
> > Date: Mon, 18 May 2009 16:39:23 +0200
> > 
> > > On Mon, May 18, 2009 at 10:01:14AM -0400, Vivek Goyal wrote:
> > > > On Sun, May 17, 2009 at 12:26:06PM +0200, Andrea Righi wrote:
> > > > > On Fri, May 15, 2009 at 10:06:43AM -0400, Vivek Goyal wrote:
> > > > > > On Fri, May 15, 2009 at 09:48:40AM +0200, Andrea Righi wrote:
> > > > > > > On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
> > > > > > > > Vivek Goyal wrote:
> > > > > > > > ...
> > > > > > > > >  }
> > > > > > > > > @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
> > > > > > > > >  /*
> > > > > > > > >   * Find the io group bio belongs to.
> > > > > > > > >   * If "create" is set, io group is created if it is not already present.
> > > > > > > > > + * If "curr" is set, io group is information is searched for current
> > > > > > > > > + * task and not with the help of bio.
> > > > > > > > > + *
> > > > > > > > > + * FIXME: Can we assume that if bio is NULL then lookup group for current
> > > > > > > > > + * task and not create extra function parameter ?
> > > > > > > > >   *
> > > > > > > > > - * Note: There is a narrow window of race where a group is being freed
> > > > > > > > > - * by cgroup deletion path and some rq has slipped through in this group.
> > > > > > > > > - * Fix it.
> > > > > > > > >   */
> > > > > > > > > -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> > > > > > > > > -					int create)
> > > > > > > > > +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
> > > > > > > > > +					int create, int curr)
> > > > > > > > 
> > > > > > > >   Hi Vivek,
> > > > > > > > 
> > > > > > > >   IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL,
> > > > > > > >   get iog from bio, otherwise get it from current task.
> > > > > > > 
> > > > > > > Consider also that get_cgroup_from_bio() is much more slow than
> > > > > > > task_cgroup() and need to lock/unlock_page_cgroup() in
> > > > > > > get_blkio_cgroup_id(), while task_cgroup() is rcu protected.
> > > > > > > 
> > > > > > 
> > > > > > True.
> > > > > > 
> > > > > > > BTW another optimization could be to use the blkio-cgroup functionality
> > > > > > > only for dirty pages and cut out some blkio_set_owner(). For all the
> > > > > > > other cases IO always occurs in the same context of the current task,
> > > > > > > and you can use task_cgroup().
> > > > > > > 
> > > > > > 
> > > > > > Yes, may be in some cases we can avoid setting page owner. I will get
> > > > > > to it once I have got functionality going well. In the mean time if
> > > > > > you have a patch for it, it will be great.
> > > > > > 
> > > > > > > However, this is true only for page cache pages, for IO generated by
> > > > > > > anonymous pages (swap) you still need the page tracking functionality
> > > > > > > both for reads and writes.
> > > > > > > 
> > > > > > 
> > > > > > Right now I am assuming that all the sync IO will belong to task
> > > > > > submitting the bio hence use task_cgroup() for that. Only for async
> > > > > > IO, I am trying to use page tracking functionality to determine the owner.
> > > > > > Look at elv_bio_sync(bio).
> > > > > > 
> > > > > > You seem to be saying that there are cases where even for sync IO, we
> > > > > > can't use submitting task's context and need to rely on page tracking
> > > > > > functionlity? 
> > 
> > I think that there are some kernel threads (e.g., dm-crypt, LVM and md
> > devices) which actually submit IOs instead of tasks which originate the
> > IOs. When IOs are submitted from such kernel threads, we can't use
> > submitting task's context to determine to which cgroup the IO belongs.
> > 
> > > > > > In case of getting page (read) from swap, will it not happen
> > > > > > in the context of process who will take a page fault and initiate the
> > > > > > swap read?
> > > > > 
> > > > > No, for example in read_swap_cache_async():
> > > > > 
> > > > > @@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > > > >  		 */
> > > > >  		__set_page_locked(new_page);
> > > > >  		SetPageSwapBacked(new_page);
> > > > > +		blkio_cgroup_set_owner(new_page, current->mm);
> > > > >  		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> > > > >  		if (likely(!err)) {
> > > > >  			/*
> > > > > 
> > > > > This is a read, but the current task is not always the owner of this
> > > > > swap cache page, because it's a readahead operation.
> > > > > 
> > > > 
> > > > But will this readahead be not initiated in the context of the task taking
> > > > the page fault?
> > > > 
> > > > handle_pte_fault()
> > > > 	do_swap_page()
> > > > 		swapin_readahead()
> > > > 			read_swap_cache_async()
> > > > 
> > > > If yes, then swap reads issued will still be in the context of process and
> > > > we should be fine?
> > > 
> > > Right. I was trying to say that the current task may swap-in also pages
> > > belonging to a different task, so from a certain point of view it's not
> > > so fair to charge the current task for the whole activity. But ok, I
> > > think it's a minor issue.
> > > 
> > > > 
> > > > > Anyway, this is a minor corner case I think. And probably it is safe to
> > > > > consider this like any other read IO and get rid of the
> > > > > blkio_cgroup_set_owner().
> > > > 
> > > > Agreed.
> > > > 
> > > > > 
> > > > > I wonder if it would be better to attach the blkio_cgroup to the
> > > > > anonymous page only when swap-out occurs.
> > > > 
> > > > Swap seems to be an interesting case in general. Somebody raised this
> > > > question on lwn io controller article also. A user process never asked
> > > > for swap activity. It is something enforced by kernel. So while doing
> > > > some swap outs, it does not seem too fair to charge the write out to
> > > > the process page belongs to and the fact of the matter may be that there
> > > > is some other memory hungry application which is forcing these swap outs.
> > > > 
> > > > Keeping this in mind, should swap activity be considered as system
> > > > activity and be charged to root group instead of to user tasks in other
> > > > cgroups?
> > > 
> > > In this case I assume the swap-in activity should be charged to the root
> > > cgroup as well.
> > > 
> > > Anyway, in the logic of the memory and swap control it would seem
> > > reasonable to provide IO separation also for the swap IO activity.
> > > 
> > > In the MEMHOG example, it would be unfair if the memory pressure is
> > > caused by a task in another cgroup, but with memory and swap isolation a
> > > memory pressure condition can only be caused by a memory hog that runs
> > > in the same cgroup. From this point of view it seems more fair to
> > > consider the swap activity as the particular cgroup IO activity, instead
> > > of charging always the root cgroup.
> > > 
> > > Otherwise, I suspect, memory pressure would be a simple way to blow away
> > > any kind of QoS guarantees provided by the IO controller.
> > > 
> > > >   
> > > > > I mean, just put the
> > > > > blkio_cgroup_set_owner() hook in try_to_umap() in order to keep track of
> > > > > the IO generated by direct reclaim of anon memory. For all the other
> > > > > cases we can simply use the submitting task's context.
> > 
> > I think that only putting the hook in try_to_unmap() doesn't work
> > correctly, because IOs will be charged to reclaiming processes or
> > kswapd. These IOs should be charged to processes which cause memory
> > pressure.
> 
> Consider the following case:
> 
>   (1) There are two processes Proc-A and Proc-B.
>   (2) Proc-A maps a large file into many pages by mmap() and writes
>       many data to the file.
>   (3) After (2), Proc-B try to get a page, but there are no available
>       pages because Proc-A has used them.
>   (4) kernel starts to reclaim pages, call try_to_unmap() to unmap
>       a page which is owned by Proc-A, then blkio_cgroup_set_owner()
>       sets Proc-B's ID on the page because the task's context is Proc-B.
>   (5) After (4), kernel writes the page out to a disk. This IO is
>       charged to Proc-B.
> 
> In the above case, I think that the IO should be charged to a Proc-A,
> because the IO is caused by Proc-A's memory pressure. 
> I think we should consider in the case without memory and swap
> isolation.
> 

But what happens if Proc-B is consuming lots of memory and then process A
asks for one page of memory and that triggers the memory reclaim. In that
case we are kind of penalizing process A from IO point of view because
some other process consumed lots of memory?

So it looks like that if one mounts mem+swap and io controller on same
hierarchy, then things probably would be fine as swap IO generated due
to either memory pressure or periodic reclaim by kswapd, will be charged
to right cgroup.

But if they are not mounted on same hiearchy, then I guess it is not too
bad to charge the owner of the page for swap IO. It is not very accurate
but at the same time there does not seem to be an easy way out?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
  2009-05-27  6:56                               ` Ryo Tsuruta
@ 2009-05-27 17:32                                 ` Vivek Goyal
  -1 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-27 17:32 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: righi.andrea, guijianfeng, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, jens.axboe, fernando, s-uchida, taka,
	jmoyer, dhaval, balbir, linux-kernel, containers, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Wed, May 27, 2009 at 03:56:31PM +0900, Ryo Tsuruta wrote:
> Hi Andrea and Vivek,
> 
> Ryo Tsuruta <ryov@valinux.co.jp> wrote:
> > Hi Andrea and Vivek,
> > 
> > From: Andrea Righi <righi.andrea@gmail.com>
> > Subject: Re: [PATCH] io-controller: Add io group reference handling for request
> > Date: Mon, 18 May 2009 16:39:23 +0200
> > 
> > > On Mon, May 18, 2009 at 10:01:14AM -0400, Vivek Goyal wrote:
> > > > On Sun, May 17, 2009 at 12:26:06PM +0200, Andrea Righi wrote:
> > > > > On Fri, May 15, 2009 at 10:06:43AM -0400, Vivek Goyal wrote:
> > > > > > On Fri, May 15, 2009 at 09:48:40AM +0200, Andrea Righi wrote:
> > > > > > > On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
> > > > > > > > Vivek Goyal wrote:
> > > > > > > > ...
> > > > > > > > >  }
> > > > > > > > > @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
> > > > > > > > >  /*
> > > > > > > > >   * Find the io group bio belongs to.
> > > > > > > > >   * If "create" is set, io group is created if it is not already present.
> > > > > > > > > + * If "curr" is set, io group is information is searched for current
> > > > > > > > > + * task and not with the help of bio.
> > > > > > > > > + *
> > > > > > > > > + * FIXME: Can we assume that if bio is NULL then lookup group for current
> > > > > > > > > + * task and not create extra function parameter ?
> > > > > > > > >   *
> > > > > > > > > - * Note: There is a narrow window of race where a group is being freed
> > > > > > > > > - * by cgroup deletion path and some rq has slipped through in this group.
> > > > > > > > > - * Fix it.
> > > > > > > > >   */
> > > > > > > > > -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> > > > > > > > > -					int create)
> > > > > > > > > +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
> > > > > > > > > +					int create, int curr)
> > > > > > > > 
> > > > > > > >   Hi Vivek,
> > > > > > > > 
> > > > > > > >   IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL,
> > > > > > > >   get iog from bio, otherwise get it from current task.
> > > > > > > 
> > > > > > > Consider also that get_cgroup_from_bio() is much more slow than
> > > > > > > task_cgroup() and need to lock/unlock_page_cgroup() in
> > > > > > > get_blkio_cgroup_id(), while task_cgroup() is rcu protected.
> > > > > > > 
> > > > > > 
> > > > > > True.
> > > > > > 
> > > > > > > BTW another optimization could be to use the blkio-cgroup functionality
> > > > > > > only for dirty pages and cut out some blkio_set_owner(). For all the
> > > > > > > other cases IO always occurs in the same context of the current task,
> > > > > > > and you can use task_cgroup().
> > > > > > > 
> > > > > > 
> > > > > > Yes, may be in some cases we can avoid setting page owner. I will get
> > > > > > to it once I have got functionality going well. In the mean time if
> > > > > > you have a patch for it, it will be great.
> > > > > > 
> > > > > > > However, this is true only for page cache pages, for IO generated by
> > > > > > > anonymous pages (swap) you still need the page tracking functionality
> > > > > > > both for reads and writes.
> > > > > > > 
> > > > > > 
> > > > > > Right now I am assuming that all the sync IO will belong to task
> > > > > > submitting the bio hence use task_cgroup() for that. Only for async
> > > > > > IO, I am trying to use page tracking functionality to determine the owner.
> > > > > > Look at elv_bio_sync(bio).
> > > > > > 
> > > > > > You seem to be saying that there are cases where even for sync IO, we
> > > > > > can't use submitting task's context and need to rely on page tracking
> > > > > > functionlity? 
> > 
> > I think that there are some kernel threads (e.g., dm-crypt, LVM and md
> > devices) which actually submit IOs instead of tasks which originate the
> > IOs. When IOs are submitted from such kernel threads, we can't use
> > submitting task's context to determine to which cgroup the IO belongs.
> > 
> > > > > > In case of getting page (read) from swap, will it not happen
> > > > > > in the context of process who will take a page fault and initiate the
> > > > > > swap read?
> > > > > 
> > > > > No, for example in read_swap_cache_async():
> > > > > 
> > > > > @@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > > > >  		 */
> > > > >  		__set_page_locked(new_page);
> > > > >  		SetPageSwapBacked(new_page);
> > > > > +		blkio_cgroup_set_owner(new_page, current->mm);
> > > > >  		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> > > > >  		if (likely(!err)) {
> > > > >  			/*
> > > > > 
> > > > > This is a read, but the current task is not always the owner of this
> > > > > swap cache page, because it's a readahead operation.
> > > > > 
> > > > 
> > > > But will this readahead be not initiated in the context of the task taking
> > > > the page fault?
> > > > 
> > > > handle_pte_fault()
> > > > 	do_swap_page()
> > > > 		swapin_readahead()
> > > > 			read_swap_cache_async()
> > > > 
> > > > If yes, then swap reads issued will still be in the context of process and
> > > > we should be fine?
> > > 
> > > Right. I was trying to say that the current task may swap-in also pages
> > > belonging to a different task, so from a certain point of view it's not
> > > so fair to charge the current task for the whole activity. But ok, I
> > > think it's a minor issue.
> > > 
> > > > 
> > > > > Anyway, this is a minor corner case I think. And probably it is safe to
> > > > > consider this like any other read IO and get rid of the
> > > > > blkio_cgroup_set_owner().
> > > > 
> > > > Agreed.
> > > > 
> > > > > 
> > > > > I wonder if it would be better to attach the blkio_cgroup to the
> > > > > anonymous page only when swap-out occurs.
> > > > 
> > > > Swap seems to be an interesting case in general. Somebody raised this
> > > > question on lwn io controller article also. A user process never asked
> > > > for swap activity. It is something enforced by kernel. So while doing
> > > > some swap outs, it does not seem too fair to charge the write out to
> > > > the process page belongs to and the fact of the matter may be that there
> > > > is some other memory hungry application which is forcing these swap outs.
> > > > 
> > > > Keeping this in mind, should swap activity be considered as system
> > > > activity and be charged to root group instead of to user tasks in other
> > > > cgroups?
> > > 
> > > In this case I assume the swap-in activity should be charged to the root
> > > cgroup as well.
> > > 
> > > Anyway, in the logic of the memory and swap control it would seem
> > > reasonable to provide IO separation also for the swap IO activity.
> > > 
> > > In the MEMHOG example, it would be unfair if the memory pressure is
> > > caused by a task in another cgroup, but with memory and swap isolation a
> > > memory pressure condition can only be caused by a memory hog that runs
> > > in the same cgroup. From this point of view it seems more fair to
> > > consider the swap activity as the particular cgroup IO activity, instead
> > > of charging always the root cgroup.
> > > 
> > > Otherwise, I suspect, memory pressure would be a simple way to blow away
> > > any kind of QoS guarantees provided by the IO controller.
> > > 
> > > >   
> > > > > I mean, just put the
> > > > > blkio_cgroup_set_owner() hook in try_to_umap() in order to keep track of
> > > > > the IO generated by direct reclaim of anon memory. For all the other
> > > > > cases we can simply use the submitting task's context.
> > 
> > I think that only putting the hook in try_to_unmap() doesn't work
> > correctly, because IOs will be charged to reclaiming processes or
> > kswapd. These IOs should be charged to processes which cause memory
> > pressure.
> 
> Consider the following case:
> 
>   (1) There are two processes Proc-A and Proc-B.
>   (2) Proc-A maps a large file into many pages by mmap() and writes
>       many data to the file.
>   (3) After (2), Proc-B try to get a page, but there are no available
>       pages because Proc-A has used them.
>   (4) kernel starts to reclaim pages, call try_to_unmap() to unmap
>       a page which is owned by Proc-A, then blkio_cgroup_set_owner()
>       sets Proc-B's ID on the page because the task's context is Proc-B.
>   (5) After (4), kernel writes the page out to a disk. This IO is
>       charged to Proc-B.
> 
> In the above case, I think that the IO should be charged to a Proc-A,
> because the IO is caused by Proc-A's memory pressure. 
> I think we should consider in the case without memory and swap
> isolation.
> 

But what happens if Proc-B is consuming lots of memory and then process A
asks for one page of memory and that triggers the memory reclaim. In that
case we are kind of penalizing process A from IO point of view because
some other process consumed lots of memory?

So it looks like that if one mounts mem+swap and io controller on same
hierarchy, then things probably would be fine as swap IO generated due
to either memory pressure or periodic reclaim by kswapd, will be charged
to right cgroup.

But if they are not mounted on same hiearchy, then I guess it is not too
bad to charge the owner of the page for swap IO. It is not very accurate
but at the same time there does not seem to be an easy way out?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH] io-controller: Add io group reference handling for request
@ 2009-05-27 17:32                                 ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-27 17:32 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval, snitzer, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, guijianfeng, fernando, mikew, jmoyer, nauman,
	m-ikeda, lizf, fchecconi, s-uchida, containers, linux-kernel,
	akpm, righi.andrea

On Wed, May 27, 2009 at 03:56:31PM +0900, Ryo Tsuruta wrote:
> Hi Andrea and Vivek,
> 
> Ryo Tsuruta <ryov@valinux.co.jp> wrote:
> > Hi Andrea and Vivek,
> > 
> > From: Andrea Righi <righi.andrea@gmail.com>
> > Subject: Re: [PATCH] io-controller: Add io group reference handling for request
> > Date: Mon, 18 May 2009 16:39:23 +0200
> > 
> > > On Mon, May 18, 2009 at 10:01:14AM -0400, Vivek Goyal wrote:
> > > > On Sun, May 17, 2009 at 12:26:06PM +0200, Andrea Righi wrote:
> > > > > On Fri, May 15, 2009 at 10:06:43AM -0400, Vivek Goyal wrote:
> > > > > > On Fri, May 15, 2009 at 09:48:40AM +0200, Andrea Righi wrote:
> > > > > > > On Fri, May 15, 2009 at 01:15:24PM +0800, Gui Jianfeng wrote:
> > > > > > > > Vivek Goyal wrote:
> > > > > > > > ...
> > > > > > > > >  }
> > > > > > > > > @@ -1462,20 +1462,27 @@ struct io_cgroup *get_iocg_from_bio(stru
> > > > > > > > >  /*
> > > > > > > > >   * Find the io group bio belongs to.
> > > > > > > > >   * If "create" is set, io group is created if it is not already present.
> > > > > > > > > + * If "curr" is set, io group is information is searched for current
> > > > > > > > > + * task and not with the help of bio.
> > > > > > > > > + *
> > > > > > > > > + * FIXME: Can we assume that if bio is NULL then lookup group for current
> > > > > > > > > + * task and not create extra function parameter ?
> > > > > > > > >   *
> > > > > > > > > - * Note: There is a narrow window of race where a group is being freed
> > > > > > > > > - * by cgroup deletion path and some rq has slipped through in this group.
> > > > > > > > > - * Fix it.
> > > > > > > > >   */
> > > > > > > > > -struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> > > > > > > > > -					int create)
> > > > > > > > > +struct io_group *io_get_io_group(struct request_queue *q, struct bio *bio,
> > > > > > > > > +					int create, int curr)
> > > > > > > > 
> > > > > > > >   Hi Vivek,
> > > > > > > > 
> > > > > > > >   IIUC we can get rid of curr, and just determine iog from bio. If bio is not NULL,
> > > > > > > >   get iog from bio, otherwise get it from current task.
> > > > > > > 
> > > > > > > Consider also that get_cgroup_from_bio() is much more slow than
> > > > > > > task_cgroup() and need to lock/unlock_page_cgroup() in
> > > > > > > get_blkio_cgroup_id(), while task_cgroup() is rcu protected.
> > > > > > > 
> > > > > > 
> > > > > > True.
> > > > > > 
> > > > > > > BTW another optimization could be to use the blkio-cgroup functionality
> > > > > > > only for dirty pages and cut out some blkio_set_owner(). For all the
> > > > > > > other cases IO always occurs in the same context of the current task,
> > > > > > > and you can use task_cgroup().
> > > > > > > 
> > > > > > 
> > > > > > Yes, may be in some cases we can avoid setting page owner. I will get
> > > > > > to it once I have got functionality going well. In the mean time if
> > > > > > you have a patch for it, it will be great.
> > > > > > 
> > > > > > > However, this is true only for page cache pages, for IO generated by
> > > > > > > anonymous pages (swap) you still need the page tracking functionality
> > > > > > > both for reads and writes.
> > > > > > > 
> > > > > > 
> > > > > > Right now I am assuming that all the sync IO will belong to task
> > > > > > submitting the bio hence use task_cgroup() for that. Only for async
> > > > > > IO, I am trying to use page tracking functionality to determine the owner.
> > > > > > Look at elv_bio_sync(bio).
> > > > > > 
> > > > > > You seem to be saying that there are cases where even for sync IO, we
> > > > > > can't use submitting task's context and need to rely on page tracking
> > > > > > functionlity? 
> > 
> > I think that there are some kernel threads (e.g., dm-crypt, LVM and md
> > devices) which actually submit IOs instead of tasks which originate the
> > IOs. When IOs are submitted from such kernel threads, we can't use
> > submitting task's context to determine to which cgroup the IO belongs.
> > 
> > > > > > In case of getting page (read) from swap, will it not happen
> > > > > > in the context of process who will take a page fault and initiate the
> > > > > > swap read?
> > > > > 
> > > > > No, for example in read_swap_cache_async():
> > > > > 
> > > > > @@ -308,6 +309,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > > > >  		 */
> > > > >  		__set_page_locked(new_page);
> > > > >  		SetPageSwapBacked(new_page);
> > > > > +		blkio_cgroup_set_owner(new_page, current->mm);
> > > > >  		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> > > > >  		if (likely(!err)) {
> > > > >  			/*
> > > > > 
> > > > > This is a read, but the current task is not always the owner of this
> > > > > swap cache page, because it's a readahead operation.
> > > > > 
> > > > 
> > > > But will this readahead be not initiated in the context of the task taking
> > > > the page fault?
> > > > 
> > > > handle_pte_fault()
> > > > 	do_swap_page()
> > > > 		swapin_readahead()
> > > > 			read_swap_cache_async()
> > > > 
> > > > If yes, then swap reads issued will still be in the context of process and
> > > > we should be fine?
> > > 
> > > Right. I was trying to say that the current task may swap-in also pages
> > > belonging to a different task, so from a certain point of view it's not
> > > so fair to charge the current task for the whole activity. But ok, I
> > > think it's a minor issue.
> > > 
> > > > 
> > > > > Anyway, this is a minor corner case I think. And probably it is safe to
> > > > > consider this like any other read IO and get rid of the
> > > > > blkio_cgroup_set_owner().
> > > > 
> > > > Agreed.
> > > > 
> > > > > 
> > > > > I wonder if it would be better to attach the blkio_cgroup to the
> > > > > anonymous page only when swap-out occurs.
> > > > 
> > > > Swap seems to be an interesting case in general. Somebody raised this
> > > > question on lwn io controller article also. A user process never asked
> > > > for swap activity. It is something enforced by kernel. So while doing
> > > > some swap outs, it does not seem too fair to charge the write out to
> > > > the process page belongs to and the fact of the matter may be that there
> > > > is some other memory hungry application which is forcing these swap outs.
> > > > 
> > > > Keeping this in mind, should swap activity be considered as system
> > > > activity and be charged to root group instead of to user tasks in other
> > > > cgroups?
> > > 
> > > In this case I assume the swap-in activity should be charged to the root
> > > cgroup as well.
> > > 
> > > Anyway, in the logic of the memory and swap control it would seem
> > > reasonable to provide IO separation also for the swap IO activity.
> > > 
> > > In the MEMHOG example, it would be unfair if the memory pressure is
> > > caused by a task in another cgroup, but with memory and swap isolation a
> > > memory pressure condition can only be caused by a memory hog that runs
> > > in the same cgroup. From this point of view it seems more fair to
> > > consider the swap activity as the particular cgroup IO activity, instead
> > > of charging always the root cgroup.
> > > 
> > > Otherwise, I suspect, memory pressure would be a simple way to blow away
> > > any kind of QoS guarantees provided by the IO controller.
> > > 
> > > >   
> > > > > I mean, just put the
> > > > > blkio_cgroup_set_owner() hook in try_to_umap() in order to keep track of
> > > > > the IO generated by direct reclaim of anon memory. For all the other
> > > > > cases we can simply use the submitting task's context.
> > 
> > I think that only putting the hook in try_to_unmap() doesn't work
> > correctly, because IOs will be charged to reclaiming processes or
> > kswapd. These IOs should be charged to processes which cause memory
> > pressure.
> 
> Consider the following case:
> 
>   (1) There are two processes Proc-A and Proc-B.
>   (2) Proc-A maps a large file into many pages by mmap() and writes
>       many data to the file.
>   (3) After (2), Proc-B try to get a page, but there are no available
>       pages because Proc-A has used them.
>   (4) kernel starts to reclaim pages, call try_to_unmap() to unmap
>       a page which is owned by Proc-A, then blkio_cgroup_set_owner()
>       sets Proc-B's ID on the page because the task's context is Proc-B.
>   (5) After (4), kernel writes the page out to a disk. This IO is
>       charged to Proc-B.
> 
> In the above case, I think that the IO should be charged to a Proc-A,
> because the IO is caused by Proc-A's memory pressure. 
> I think we should consider in the case without memory and swap
> isolation.
> 

But what happens if Proc-B is consuming lots of memory and then process A
asks for one page of memory and that triggers the memory reclaim. In that
case we are kind of penalizing process A from IO point of view because
some other process consumed lots of memory?

So it looks like that if one mounts mem+swap and io controller on same
hierarchy, then things probably would be fine as swap IO generated due
to either memory pressure or periodic reclaim by kswapd, will be charged
to right cgroup.

But if they are not mounted on same hiearchy, then I guess it is not too
bad to charge the owner of the page for swap IO. It is not very accurate
but at the same time there does not seem to be an easy way out?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
       [not found]   ` <1241553525-28095-9-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-05-13 15:00     ` Vivek Goyal
@ 2009-06-09  7:56     ` Gui Jianfeng
  1 sibling, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-06-09  7:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

Vivek Goyal wrote:
...
> +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> +			  size_t count)
> +{
> +	struct elv_fq_data *efqd;
> +	unsigned int data;
> +	unsigned long flags;
> +
> +	char *p = (char *)name;
> +
> +	data = simple_strtoul(p, &p, 10);
> +
> +	if (data < 0)
> +		data = 0;
> +	else if (data > INT_MAX)
> +		data = INT_MAX;

  Hi Vivek,

  data might overflow on 64 bit systems. In addition, since "fairness" is nothing 
  more than a switch, just let it be.

Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
 block/elevator-fq.c |   10 +++++-----
 block/elevator-fq.h |    2 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 655162b..42d4279 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -2605,7 +2605,7 @@ static inline int is_root_group_ioq(struct request_queue *q,
 ssize_t elv_fairness_show(struct request_queue *q, char *name)
 {
 	struct elv_fq_data *efqd;
-	unsigned int data;
+	unsigned long data;
 	unsigned long flags;
 
 	spin_lock_irqsave(q->queue_lock, flags);
@@ -2619,17 +2619,17 @@ ssize_t elv_fairness_store(struct request_queue *q, const char *name,
 			  size_t count)
 {
 	struct elv_fq_data *efqd;
-	unsigned int data;
+	unsigned long data;
 	unsigned long flags;
 
 	char *p = (char *)name;
 
 	data = simple_strtoul(p, &p, 10);
 
-	if (data < 0)
+	if (!data)
 		data = 0;
-	else if (data > INT_MAX)
-		data = INT_MAX;
+	else
+		data = 1;
 
 	spin_lock_irqsave(q->queue_lock, flags);
 	efqd = &q->elevator->efqd;
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index b2bb11a..4fe843a 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -359,7 +359,7 @@ struct elv_fq_data {
 	 * throughput and focus more on providing fairness for sync
 	 * queues.
 	 */
-	int fairness;
+	unsigned long fairness;
 };
 
 extern int elv_slice_idle;
-- 
1.5.4.rc3

^ permalink raw reply related	[flat|nested] 297+ messages in thread

* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
  2009-05-05 19:58 ` Vivek Goyal
  2009-05-13 15:00   ` Vivek Goyal
  2009-05-13 15:00   ` Vivek Goyal
@ 2009-06-09  7:56   ` Gui Jianfeng
  2009-06-09 17:51       ` Vivek Goyal
       [not found]     ` <4A2E15B6.8030001-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
       [not found]   ` <1241553525-28095-9-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  3 siblings, 2 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-06-09  7:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

Vivek Goyal wrote:
...
> +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> +			  size_t count)
> +{
> +	struct elv_fq_data *efqd;
> +	unsigned int data;
> +	unsigned long flags;
> +
> +	char *p = (char *)name;
> +
> +	data = simple_strtoul(p, &p, 10);
> +
> +	if (data < 0)
> +		data = 0;
> +	else if (data > INT_MAX)
> +		data = INT_MAX;

  Hi Vivek,

  data might overflow on 64 bit systems. In addition, since "fairness" is nothing 
  more than a switch, just let it be.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
 block/elevator-fq.c |   10 +++++-----
 block/elevator-fq.h |    2 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 655162b..42d4279 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -2605,7 +2605,7 @@ static inline int is_root_group_ioq(struct request_queue *q,
 ssize_t elv_fairness_show(struct request_queue *q, char *name)
 {
 	struct elv_fq_data *efqd;
-	unsigned int data;
+	unsigned long data;
 	unsigned long flags;
 
 	spin_lock_irqsave(q->queue_lock, flags);
@@ -2619,17 +2619,17 @@ ssize_t elv_fairness_store(struct request_queue *q, const char *name,
 			  size_t count)
 {
 	struct elv_fq_data *efqd;
-	unsigned int data;
+	unsigned long data;
 	unsigned long flags;
 
 	char *p = (char *)name;
 
 	data = simple_strtoul(p, &p, 10);
 
-	if (data < 0)
+	if (!data)
 		data = 0;
-	else if (data > INT_MAX)
-		data = INT_MAX;
+	else
+		data = 1;
 
 	spin_lock_irqsave(q->queue_lock, flags);
 	efqd = &q->elevator->efqd;
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index b2bb11a..4fe843a 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -359,7 +359,7 @@ struct elv_fq_data {
 	 * throughput and focus more on providing fairness for sync
 	 * queues.
 	 */
-	int fairness;
+	unsigned long fairness;
 };
 
 extern int elv_slice_idle;
-- 
1.5.4.rc3



^ permalink raw reply related	[flat|nested] 297+ messages in thread

* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
       [not found]     ` <4A2E15B6.8030001-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-06-09 17:51       ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-06-09 17:51 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Tue, Jun 09, 2009 at 03:56:38PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> > +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> > +			  size_t count)
> > +{
> > +	struct elv_fq_data *efqd;
> > +	unsigned int data;
> > +	unsigned long flags;
> > +
> > +	char *p = (char *)name;
> > +
> > +	data = simple_strtoul(p, &p, 10);
> > +
> > +	if (data < 0)
> > +		data = 0;
> > +	else if (data > INT_MAX)
> > +		data = INT_MAX;
> 
>   Hi Vivek,
> 
>   data might overflow on 64 bit systems. In addition, since "fairness" is nothing 
>   more than a switch, just let it be.
> 
> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> ---

Hi Gui,

How about following patch? Currently this should apply at the end of the
patch series. If it looks good, I will merge the changes in higher level
patches.

Thanks
Vivek

o Previously common layer elevator parameters were appearing as request
  queue parameters in sysfs. But actually these are io scheduler parameters
  in hiearchical mode. Fix it.

o Use macros to define multiple sysfs C functions doing the same thing. Code
  borrowed from CFQ. Helps reduce the number of lines of by 140.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/as-iosched.c       |    5 
 block/blk-sysfs.c        |   39 -------
 block/cfq-iosched.c      |    5 
 block/deadline-iosched.c |    5 
 block/elevator-fq.c      |  245 +++++++++++------------------------------------
 block/elevator-fq.h      |   26 ++--
 block/noop-iosched.c     |   10 +
 7 files changed, 97 insertions(+), 238 deletions(-)

Index: linux18/block/elevator-fq.h
===================================================================
--- linux18.orig/block/elevator-fq.h	2009-06-09 10:34:59.000000000 -0400
+++ linux18/block/elevator-fq.h	2009-06-09 13:35:03.000000000 -0400
@@ -27,6 +27,9 @@ struct io_queue;
 
 #ifdef CONFIG_ELV_FAIR_QUEUING
 
+#define ELV_ATTR(name) \
+	__ATTR(name, S_IRUGO|S_IWUSR, elv_##name##_show, elv_##name##_store)
+
 /**
  * struct bfq_service_tree - per ioprio_class service tree.
  * @active: tree for active entities (i.e., those backlogged).
@@ -364,7 +367,7 @@ struct elv_fq_data {
 	 * throughput and focus more on providing fairness for sync
 	 * queues.
 	 */
-	int fairness;
+	unsigned int fairness;
 	int only_root_group;
 };
 
@@ -650,23 +653,22 @@ static inline struct io_queue *elv_looku
 
 #endif /* GROUP_IOSCHED */
 
-/* Functions used by blksysfs.c */
-extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
-extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
+extern ssize_t elv_slice_idle_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_slice_idle_store(struct elevator_queue *e, const char *name,
 						size_t count);
-extern ssize_t elv_slice_sync_show(struct request_queue *q, char *name);
-extern ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
+extern ssize_t elv_slice_sync_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_slice_sync_store(struct elevator_queue *e, const char *name,
 						size_t count);
 
-extern ssize_t elv_async_slice_idle_show(struct request_queue *q, char *name);
-extern ssize_t elv_async_slice_idle_store(struct request_queue *q,
+extern ssize_t elv_async_slice_idle_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_async_slice_idle_store(struct elevator_queue *e,
 					const char *name, size_t count);
 
-extern ssize_t elv_slice_async_show(struct request_queue *q, char *name);
-extern ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
+extern ssize_t elv_slice_async_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_slice_async_store(struct elevator_queue *e, const char *name,
 						size_t count);
-extern ssize_t elv_fairness_show(struct request_queue *q, char *name);
-extern ssize_t elv_fairness_store(struct request_queue *q, const char *name,
+extern ssize_t elv_fairness_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_fairness_store(struct elevator_queue *e, const char *name,
 						size_t count);
 
 /* Functions used by elevator.c */
Index: linux18/block/elevator-fq.c
===================================================================
--- linux18.orig/block/elevator-fq.c	2009-06-09 10:34:59.000000000 -0400
+++ linux18/block/elevator-fq.c	2009-06-09 13:39:48.000000000 -0400
@@ -2618,201 +2618,72 @@ static inline int is_root_group_ioq(stru
 	return (ioq->entity.sched_data == &efqd->root_group->sched_data);
 }
 
-/* Functions to show and store fairness value through sysfs */
-ssize_t elv_fairness_show(struct request_queue *q, char *name)
-{
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	data = efqd->fairness;
-	spin_unlock_irqrestore(q->queue_lock, flags);
-	return sprintf(name, "%d\n", data);
-}
-
-ssize_t elv_fairness_store(struct request_queue *q, const char *name,
-			  size_t count)
-{
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	char *p = (char *)name;
-
-	data = simple_strtoul(p, &p, 10);
-
-	if (data < 0)
-		data = 0;
-	else if (data > INT_MAX)
-		data = INT_MAX;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	efqd->fairness = data;
-	spin_unlock_irqrestore(q->queue_lock, flags);
-
-	return count;
-}
-
-/* Functions to show and store elv_idle_slice value through sysfs */
-ssize_t elv_slice_idle_show(struct request_queue *q, char *name)
-{
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	data = jiffies_to_msecs(efqd->elv_slice_idle);
-	spin_unlock_irqrestore(q->queue_lock, flags);
-	return sprintf(name, "%d\n", data);
-}
-
-ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
-			  size_t count)
-{
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	char *p = (char *)name;
-
-	data = simple_strtoul(p, &p, 10);
-
-	if (data < 0)
-		data = 0;
-	else if (data > INT_MAX)
-		data = INT_MAX;
-
-	data = msecs_to_jiffies(data);
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	efqd->elv_slice_idle = data;
-	spin_unlock_irqrestore(q->queue_lock, flags);
-
-	return count;
-}
-
-/* Functions to show and store elv_idle_slice value through sysfs */
-ssize_t elv_async_slice_idle_show(struct request_queue *q, char *name)
-{
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	data = jiffies_to_msecs(efqd->elv_async_slice_idle);
-	spin_unlock_irqrestore(q->queue_lock, flags);
-	return sprintf(name, "%d\n", data);
-}
-
-ssize_t elv_async_slice_idle_store(struct request_queue *q, const char *name,
-			  size_t count)
-{
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	char *p = (char *)name;
-
-	data = simple_strtoul(p, &p, 10);
-
-	if (data < 0)
-		data = 0;
-	else if (data > INT_MAX)
-		data = INT_MAX;
-
-	data = msecs_to_jiffies(data);
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	efqd->elv_async_slice_idle = data;
-	spin_unlock_irqrestore(q->queue_lock, flags);
-
-	return count;
-}
-
-/* Functions to show and store elv_slice_sync value through sysfs */
-ssize_t elv_slice_sync_show(struct request_queue *q, char *name)
+/*
+ * sysfs parts below -->
+ */
+static ssize_t
+elv_var_show(unsigned int var, char *page)
 {
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	data = efqd->elv_slice[1];
-	spin_unlock_irqrestore(q->queue_lock, flags);
-	return sprintf(name, "%d\n", data);
+	return sprintf(page, "%d\n", var);
 }
 
-ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
-			  size_t count)
+static ssize_t
+elv_var_store(unsigned int *var, const char *page, size_t count)
 {
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	char *p = (char *)name;
-
-	data = simple_strtoul(p, &p, 10);
-
-	if (data < 0)
-		data = 0;
-	/* 100ms is the limit for now*/
-	else if (data > 100)
-		data = 100;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	efqd->elv_slice[1] = data;
-	spin_unlock_irqrestore(q->queue_lock, flags);
+	char *p = (char *) page;
 
+	*var = simple_strtoul(p, &p, 10);
 	return count;
 }
 
-/* Functions to show and store elv_slice_async value through sysfs */
-ssize_t elv_slice_async_show(struct request_queue *q, char *name)
-{
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	data = efqd->elv_slice[0];
-	spin_unlock_irqrestore(q->queue_lock, flags);
-	return sprintf(name, "%d\n", data);
-}
-
-ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
-			  size_t count)
-{
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	char *p = (char *)name;
-
-	data = simple_strtoul(p, &p, 10);
-
-	if (data < 0)
-		data = 0;
-	/* 100ms is the limit for now*/
-	else if (data > 100)
-		data = 100;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	efqd->elv_slice[0] = data;
-	spin_unlock_irqrestore(q->queue_lock, flags);
-
-	return count;
-}
+#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
+ssize_t __FUNC(struct elevator_queue *e, char *page)		\
+{									\
+	struct elv_fq_data *efqd = &e->efqd;				\
+	unsigned int __data = __VAR;					\
+	if (__CONV)							\
+		__data = jiffies_to_msecs(__data);			\
+	return elv_var_show(__data, (page));				\
+}
+SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
+EXPORT_SYMBOL(elv_fairness_show);
+SHOW_FUNCTION(elv_slice_idle_show, efqd->elv_slice_idle, 1);
+EXPORT_SYMBOL(elv_slice_idle_show);
+SHOW_FUNCTION(elv_async_slice_idle_show, efqd->elv_async_slice_idle, 1);
+EXPORT_SYMBOL(elv_async_slice_idle_show);
+SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
+EXPORT_SYMBOL(elv_slice_sync_show);
+SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
+EXPORT_SYMBOL(elv_slice_async_show);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
+ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)	\
+{									\
+	struct elv_fq_data *efqd = &e->efqd;				\
+	unsigned int __data;						\
+	int ret = elv_var_store(&__data, (page), count);		\
+	if (__data < (MIN))						\
+		__data = (MIN);						\
+	else if (__data > (MAX))					\
+		__data = (MAX);						\
+	if (__CONV)							\
+		*(__PTR) = msecs_to_jiffies(__data);			\
+	else								\
+		*(__PTR) = __data;					\
+	return ret;							\
+}
+STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
+EXPORT_SYMBOL(elv_fairness_store);
+STORE_FUNCTION(elv_slice_idle_store, &efqd->elv_slice_idle, 0, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_idle_store);
+STORE_FUNCTION(elv_async_slice_idle_store, &efqd->elv_async_slice_idle, 0, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_async_slice_idle_store);
+STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_sync_store);
+STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_async_store);
+#undef STORE_FUNCTION
 
 void elv_schedule_dispatch(struct request_queue *q)
 {
Index: linux18/block/blk-sysfs.c
===================================================================
--- linux18.orig/block/blk-sysfs.c	2009-06-09 10:34:59.000000000 -0400
+++ linux18/block/blk-sysfs.c	2009-06-09 13:24:42.000000000 -0400
@@ -307,38 +307,6 @@ static struct queue_sysfs_entry queue_io
 	.store = queue_iostats_store,
 };
 
-#ifdef CONFIG_ELV_FAIR_QUEUING
-static struct queue_sysfs_entry queue_slice_idle_entry = {
-	.attr = {.name = "slice_idle", .mode = S_IRUGO | S_IWUSR },
-	.show = elv_slice_idle_show,
-	.store = elv_slice_idle_store,
-};
-
-static struct queue_sysfs_entry queue_async_slice_idle_entry = {
-	.attr = {.name = "async_slice_idle", .mode = S_IRUGO | S_IWUSR },
-	.show = elv_async_slice_idle_show,
-	.store = elv_async_slice_idle_store,
-};
-
-static struct queue_sysfs_entry queue_slice_sync_entry = {
-	.attr = {.name = "slice_sync", .mode = S_IRUGO | S_IWUSR },
-	.show = elv_slice_sync_show,
-	.store = elv_slice_sync_store,
-};
-
-static struct queue_sysfs_entry queue_slice_async_entry = {
-	.attr = {.name = "slice_async", .mode = S_IRUGO | S_IWUSR },
-	.show = elv_slice_async_show,
-	.store = elv_slice_async_store,
-};
-
-static struct queue_sysfs_entry queue_fairness_entry = {
-	.attr = {.name = "fairness", .mode = S_IRUGO | S_IWUSR },
-	.show = elv_fairness_show,
-	.store = elv_fairness_store,
-};
-#endif
-
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
 #ifdef CONFIG_GROUP_IOSCHED
@@ -353,13 +321,6 @@ static struct attribute *default_attrs[]
 	&queue_nomerges_entry.attr,
 	&queue_rq_affinity_entry.attr,
 	&queue_iostats_entry.attr,
-#ifdef CONFIG_ELV_FAIR_QUEUING
-	&queue_slice_idle_entry.attr,
-	&queue_async_slice_idle_entry.attr,
-	&queue_slice_sync_entry.attr,
-	&queue_slice_async_entry.attr,
-	&queue_fairness_entry.attr,
-#endif
 	NULL,
 };
 
Index: linux18/block/cfq-iosched.c
===================================================================
--- linux18.orig/block/cfq-iosched.c	2009-06-09 10:34:55.000000000 -0400
+++ linux18/block/cfq-iosched.c	2009-06-09 13:25:42.000000000 -0400
@@ -2095,6 +2095,11 @@ static struct elv_fs_entry cfq_attrs[] =
 	CFQ_ATTR(back_seek_max),
 	CFQ_ATTR(back_seek_penalty),
 	CFQ_ATTR(slice_async_rq),
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_idle),
+	ELV_ATTR(async_slice_idle),
+	ELV_ATTR(slice_sync),
+	ELV_ATTR(slice_async),
 	__ATTR_NULL
 };
 
Index: linux18/block/as-iosched.c
===================================================================
--- linux18.orig/block/as-iosched.c	2009-06-09 10:34:58.000000000 -0400
+++ linux18/block/as-iosched.c	2009-06-09 13:27:38.000000000 -0400
@@ -1766,6 +1766,11 @@ static struct elv_fs_entry as_attrs[] = 
 	AS_ATTR(antic_expire),
 	AS_ATTR(read_batch_expire),
 	AS_ATTR(write_batch_expire),
+#ifdef CONFIG_IOSCHED_AS_HIER
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_idle),
+	ELV_ATTR(slice_sync),
+#endif
 	__ATTR_NULL
 };
 
Index: linux18/block/deadline-iosched.c
===================================================================
--- linux18.orig/block/deadline-iosched.c	2009-06-09 10:34:55.000000000 -0400
+++ linux18/block/deadline-iosched.c	2009-06-09 13:28:51.000000000 -0400
@@ -460,6 +460,11 @@ static struct elv_fs_entry deadline_attr
 	DD_ATTR(writes_starved),
 	DD_ATTR(front_merges),
 	DD_ATTR(fifo_batch),
+#ifdef CONFIG_IOSCHED_DEADLINE_HIER
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_idle),
+	ELV_ATTR(slice_sync),
+#endif
 	__ATTR_NULL
 };
 
Index: linux18/block/noop-iosched.c
===================================================================
--- linux18.orig/block/noop-iosched.c	2009-06-09 10:34:52.000000000 -0400
+++ linux18/block/noop-iosched.c	2009-06-09 13:31:48.000000000 -0400
@@ -82,6 +82,15 @@ static void noop_free_noop_queue(struct 
 	kfree(nq);
 }
 
+#ifdef CONFIG_IOSCHED_NOOP_HIER
+static struct elv_fs_entry noop_attrs[] = {
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_idle),
+	ELV_ATTR(slice_sync),
+	__ATTR_NULL
+};
+#endif
+
 static struct elevator_type elevator_noop = {
 	.ops = {
 		.elevator_merge_req_fn		= noop_merged_requests,
@@ -94,6 +103,7 @@ static struct elevator_type elevator_noo
 	},
 #ifdef CONFIG_IOSCHED_NOOP_HIER
 	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+	.elevator_attrs = noop_attrs,
 #endif
 	.elevator_name = "noop",
 	.elevator_owner = THIS_MODULE,

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
  2009-06-09  7:56   ` Gui Jianfeng
@ 2009-06-09 17:51       ` Vivek Goyal
       [not found]     ` <4A2E15B6.8030001-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  1 sibling, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-06-09 17:51 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Tue, Jun 09, 2009 at 03:56:38PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> > +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> > +			  size_t count)
> > +{
> > +	struct elv_fq_data *efqd;
> > +	unsigned int data;
> > +	unsigned long flags;
> > +
> > +	char *p = (char *)name;
> > +
> > +	data = simple_strtoul(p, &p, 10);
> > +
> > +	if (data < 0)
> > +		data = 0;
> > +	else if (data > INT_MAX)
> > +		data = INT_MAX;
> 
>   Hi Vivek,
> 
>   data might overflow on 64 bit systems. In addition, since "fairness" is nothing 
>   more than a switch, just let it be.
> 
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---

Hi Gui,

How about following patch? Currently this should apply at the end of the
patch series. If it looks good, I will merge the changes in higher level
patches.

Thanks
Vivek

o Previously common layer elevator parameters were appearing as request
  queue parameters in sysfs. But actually these are io scheduler parameters
  in hiearchical mode. Fix it.

o Use macros to define multiple sysfs C functions doing the same thing. Code
  borrowed from CFQ. Helps reduce the number of lines of by 140.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/as-iosched.c       |    5 
 block/blk-sysfs.c        |   39 -------
 block/cfq-iosched.c      |    5 
 block/deadline-iosched.c |    5 
 block/elevator-fq.c      |  245 +++++++++++------------------------------------
 block/elevator-fq.h      |   26 ++--
 block/noop-iosched.c     |   10 +
 7 files changed, 97 insertions(+), 238 deletions(-)

Index: linux18/block/elevator-fq.h
===================================================================
--- linux18.orig/block/elevator-fq.h	2009-06-09 10:34:59.000000000 -0400
+++ linux18/block/elevator-fq.h	2009-06-09 13:35:03.000000000 -0400
@@ -27,6 +27,9 @@ struct io_queue;
 
 #ifdef CONFIG_ELV_FAIR_QUEUING
 
+#define ELV_ATTR(name) \
+	__ATTR(name, S_IRUGO|S_IWUSR, elv_##name##_show, elv_##name##_store)
+
 /**
  * struct bfq_service_tree - per ioprio_class service tree.
  * @active: tree for active entities (i.e., those backlogged).
@@ -364,7 +367,7 @@ struct elv_fq_data {
 	 * throughput and focus more on providing fairness for sync
 	 * queues.
 	 */
-	int fairness;
+	unsigned int fairness;
 	int only_root_group;
 };
 
@@ -650,23 +653,22 @@ static inline struct io_queue *elv_looku
 
 #endif /* GROUP_IOSCHED */
 
-/* Functions used by blksysfs.c */
-extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
-extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
+extern ssize_t elv_slice_idle_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_slice_idle_store(struct elevator_queue *e, const char *name,
 						size_t count);
-extern ssize_t elv_slice_sync_show(struct request_queue *q, char *name);
-extern ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
+extern ssize_t elv_slice_sync_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_slice_sync_store(struct elevator_queue *e, const char *name,
 						size_t count);
 
-extern ssize_t elv_async_slice_idle_show(struct request_queue *q, char *name);
-extern ssize_t elv_async_slice_idle_store(struct request_queue *q,
+extern ssize_t elv_async_slice_idle_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_async_slice_idle_store(struct elevator_queue *e,
 					const char *name, size_t count);
 
-extern ssize_t elv_slice_async_show(struct request_queue *q, char *name);
-extern ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
+extern ssize_t elv_slice_async_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_slice_async_store(struct elevator_queue *e, const char *name,
 						size_t count);
-extern ssize_t elv_fairness_show(struct request_queue *q, char *name);
-extern ssize_t elv_fairness_store(struct request_queue *q, const char *name,
+extern ssize_t elv_fairness_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_fairness_store(struct elevator_queue *e, const char *name,
 						size_t count);
 
 /* Functions used by elevator.c */
Index: linux18/block/elevator-fq.c
===================================================================
--- linux18.orig/block/elevator-fq.c	2009-06-09 10:34:59.000000000 -0400
+++ linux18/block/elevator-fq.c	2009-06-09 13:39:48.000000000 -0400
@@ -2618,201 +2618,72 @@ static inline int is_root_group_ioq(stru
 	return (ioq->entity.sched_data == &efqd->root_group->sched_data);
 }
 
-/* Functions to show and store fairness value through sysfs */
-ssize_t elv_fairness_show(struct request_queue *q, char *name)
-{
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	data = efqd->fairness;
-	spin_unlock_irqrestore(q->queue_lock, flags);
-	return sprintf(name, "%d\n", data);
-}
-
-ssize_t elv_fairness_store(struct request_queue *q, const char *name,
-			  size_t count)
-{
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	char *p = (char *)name;
-
-	data = simple_strtoul(p, &p, 10);
-
-	if (data < 0)
-		data = 0;
-	else if (data > INT_MAX)
-		data = INT_MAX;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	efqd->fairness = data;
-	spin_unlock_irqrestore(q->queue_lock, flags);
-
-	return count;
-}
-
-/* Functions to show and store elv_idle_slice value through sysfs */
-ssize_t elv_slice_idle_show(struct request_queue *q, char *name)
-{
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	data = jiffies_to_msecs(efqd->elv_slice_idle);
-	spin_unlock_irqrestore(q->queue_lock, flags);
-	return sprintf(name, "%d\n", data);
-}
-
-ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
-			  size_t count)
-{
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	char *p = (char *)name;
-
-	data = simple_strtoul(p, &p, 10);
-
-	if (data < 0)
-		data = 0;
-	else if (data > INT_MAX)
-		data = INT_MAX;
-
-	data = msecs_to_jiffies(data);
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	efqd->elv_slice_idle = data;
-	spin_unlock_irqrestore(q->queue_lock, flags);
-
-	return count;
-}
-
-/* Functions to show and store elv_idle_slice value through sysfs */
-ssize_t elv_async_slice_idle_show(struct request_queue *q, char *name)
-{
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	data = jiffies_to_msecs(efqd->elv_async_slice_idle);
-	spin_unlock_irqrestore(q->queue_lock, flags);
-	return sprintf(name, "%d\n", data);
-}
-
-ssize_t elv_async_slice_idle_store(struct request_queue *q, const char *name,
-			  size_t count)
-{
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	char *p = (char *)name;
-
-	data = simple_strtoul(p, &p, 10);
-
-	if (data < 0)
-		data = 0;
-	else if (data > INT_MAX)
-		data = INT_MAX;
-
-	data = msecs_to_jiffies(data);
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	efqd->elv_async_slice_idle = data;
-	spin_unlock_irqrestore(q->queue_lock, flags);
-
-	return count;
-}
-
-/* Functions to show and store elv_slice_sync value through sysfs */
-ssize_t elv_slice_sync_show(struct request_queue *q, char *name)
+/*
+ * sysfs parts below -->
+ */
+static ssize_t
+elv_var_show(unsigned int var, char *page)
 {
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	data = efqd->elv_slice[1];
-	spin_unlock_irqrestore(q->queue_lock, flags);
-	return sprintf(name, "%d\n", data);
+	return sprintf(page, "%d\n", var);
 }
 
-ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
-			  size_t count)
+static ssize_t
+elv_var_store(unsigned int *var, const char *page, size_t count)
 {
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	char *p = (char *)name;
-
-	data = simple_strtoul(p, &p, 10);
-
-	if (data < 0)
-		data = 0;
-	/* 100ms is the limit for now*/
-	else if (data > 100)
-		data = 100;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	efqd->elv_slice[1] = data;
-	spin_unlock_irqrestore(q->queue_lock, flags);
+	char *p = (char *) page;
 
+	*var = simple_strtoul(p, &p, 10);
 	return count;
 }
 
-/* Functions to show and store elv_slice_async value through sysfs */
-ssize_t elv_slice_async_show(struct request_queue *q, char *name)
-{
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	data = efqd->elv_slice[0];
-	spin_unlock_irqrestore(q->queue_lock, flags);
-	return sprintf(name, "%d\n", data);
-}
-
-ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
-			  size_t count)
-{
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	char *p = (char *)name;
-
-	data = simple_strtoul(p, &p, 10);
-
-	if (data < 0)
-		data = 0;
-	/* 100ms is the limit for now*/
-	else if (data > 100)
-		data = 100;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	efqd->elv_slice[0] = data;
-	spin_unlock_irqrestore(q->queue_lock, flags);
-
-	return count;
-}
+#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
+ssize_t __FUNC(struct elevator_queue *e, char *page)		\
+{									\
+	struct elv_fq_data *efqd = &e->efqd;				\
+	unsigned int __data = __VAR;					\
+	if (__CONV)							\
+		__data = jiffies_to_msecs(__data);			\
+	return elv_var_show(__data, (page));				\
+}
+SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
+EXPORT_SYMBOL(elv_fairness_show);
+SHOW_FUNCTION(elv_slice_idle_show, efqd->elv_slice_idle, 1);
+EXPORT_SYMBOL(elv_slice_idle_show);
+SHOW_FUNCTION(elv_async_slice_idle_show, efqd->elv_async_slice_idle, 1);
+EXPORT_SYMBOL(elv_async_slice_idle_show);
+SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
+EXPORT_SYMBOL(elv_slice_sync_show);
+SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
+EXPORT_SYMBOL(elv_slice_async_show);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
+ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)	\
+{									\
+	struct elv_fq_data *efqd = &e->efqd;				\
+	unsigned int __data;						\
+	int ret = elv_var_store(&__data, (page), count);		\
+	if (__data < (MIN))						\
+		__data = (MIN);						\
+	else if (__data > (MAX))					\
+		__data = (MAX);						\
+	if (__CONV)							\
+		*(__PTR) = msecs_to_jiffies(__data);			\
+	else								\
+		*(__PTR) = __data;					\
+	return ret;							\
+}
+STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
+EXPORT_SYMBOL(elv_fairness_store);
+STORE_FUNCTION(elv_slice_idle_store, &efqd->elv_slice_idle, 0, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_idle_store);
+STORE_FUNCTION(elv_async_slice_idle_store, &efqd->elv_async_slice_idle, 0, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_async_slice_idle_store);
+STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_sync_store);
+STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_async_store);
+#undef STORE_FUNCTION
 
 void elv_schedule_dispatch(struct request_queue *q)
 {
Index: linux18/block/blk-sysfs.c
===================================================================
--- linux18.orig/block/blk-sysfs.c	2009-06-09 10:34:59.000000000 -0400
+++ linux18/block/blk-sysfs.c	2009-06-09 13:24:42.000000000 -0400
@@ -307,38 +307,6 @@ static struct queue_sysfs_entry queue_io
 	.store = queue_iostats_store,
 };
 
-#ifdef CONFIG_ELV_FAIR_QUEUING
-static struct queue_sysfs_entry queue_slice_idle_entry = {
-	.attr = {.name = "slice_idle", .mode = S_IRUGO | S_IWUSR },
-	.show = elv_slice_idle_show,
-	.store = elv_slice_idle_store,
-};
-
-static struct queue_sysfs_entry queue_async_slice_idle_entry = {
-	.attr = {.name = "async_slice_idle", .mode = S_IRUGO | S_IWUSR },
-	.show = elv_async_slice_idle_show,
-	.store = elv_async_slice_idle_store,
-};
-
-static struct queue_sysfs_entry queue_slice_sync_entry = {
-	.attr = {.name = "slice_sync", .mode = S_IRUGO | S_IWUSR },
-	.show = elv_slice_sync_show,
-	.store = elv_slice_sync_store,
-};
-
-static struct queue_sysfs_entry queue_slice_async_entry = {
-	.attr = {.name = "slice_async", .mode = S_IRUGO | S_IWUSR },
-	.show = elv_slice_async_show,
-	.store = elv_slice_async_store,
-};
-
-static struct queue_sysfs_entry queue_fairness_entry = {
-	.attr = {.name = "fairness", .mode = S_IRUGO | S_IWUSR },
-	.show = elv_fairness_show,
-	.store = elv_fairness_store,
-};
-#endif
-
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
 #ifdef CONFIG_GROUP_IOSCHED
@@ -353,13 +321,6 @@ static struct attribute *default_attrs[]
 	&queue_nomerges_entry.attr,
 	&queue_rq_affinity_entry.attr,
 	&queue_iostats_entry.attr,
-#ifdef CONFIG_ELV_FAIR_QUEUING
-	&queue_slice_idle_entry.attr,
-	&queue_async_slice_idle_entry.attr,
-	&queue_slice_sync_entry.attr,
-	&queue_slice_async_entry.attr,
-	&queue_fairness_entry.attr,
-#endif
 	NULL,
 };
 
Index: linux18/block/cfq-iosched.c
===================================================================
--- linux18.orig/block/cfq-iosched.c	2009-06-09 10:34:55.000000000 -0400
+++ linux18/block/cfq-iosched.c	2009-06-09 13:25:42.000000000 -0400
@@ -2095,6 +2095,11 @@ static struct elv_fs_entry cfq_attrs[] =
 	CFQ_ATTR(back_seek_max),
 	CFQ_ATTR(back_seek_penalty),
 	CFQ_ATTR(slice_async_rq),
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_idle),
+	ELV_ATTR(async_slice_idle),
+	ELV_ATTR(slice_sync),
+	ELV_ATTR(slice_async),
 	__ATTR_NULL
 };
 
Index: linux18/block/as-iosched.c
===================================================================
--- linux18.orig/block/as-iosched.c	2009-06-09 10:34:58.000000000 -0400
+++ linux18/block/as-iosched.c	2009-06-09 13:27:38.000000000 -0400
@@ -1766,6 +1766,11 @@ static struct elv_fs_entry as_attrs[] = 
 	AS_ATTR(antic_expire),
 	AS_ATTR(read_batch_expire),
 	AS_ATTR(write_batch_expire),
+#ifdef CONFIG_IOSCHED_AS_HIER
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_idle),
+	ELV_ATTR(slice_sync),
+#endif
 	__ATTR_NULL
 };
 
Index: linux18/block/deadline-iosched.c
===================================================================
--- linux18.orig/block/deadline-iosched.c	2009-06-09 10:34:55.000000000 -0400
+++ linux18/block/deadline-iosched.c	2009-06-09 13:28:51.000000000 -0400
@@ -460,6 +460,11 @@ static struct elv_fs_entry deadline_attr
 	DD_ATTR(writes_starved),
 	DD_ATTR(front_merges),
 	DD_ATTR(fifo_batch),
+#ifdef CONFIG_IOSCHED_DEADLINE_HIER
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_idle),
+	ELV_ATTR(slice_sync),
+#endif
 	__ATTR_NULL
 };
 
Index: linux18/block/noop-iosched.c
===================================================================
--- linux18.orig/block/noop-iosched.c	2009-06-09 10:34:52.000000000 -0400
+++ linux18/block/noop-iosched.c	2009-06-09 13:31:48.000000000 -0400
@@ -82,6 +82,15 @@ static void noop_free_noop_queue(struct 
 	kfree(nq);
 }
 
+#ifdef CONFIG_IOSCHED_NOOP_HIER
+static struct elv_fs_entry noop_attrs[] = {
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_idle),
+	ELV_ATTR(slice_sync),
+	__ATTR_NULL
+};
+#endif
+
 static struct elevator_type elevator_noop = {
 	.ops = {
 		.elevator_merge_req_fn		= noop_merged_requests,
@@ -94,6 +103,7 @@ static struct elevator_type elevator_noo
 	},
 #ifdef CONFIG_IOSCHED_NOOP_HIER
 	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+	.elevator_attrs = noop_attrs,
 #endif
 	.elevator_name = "noop",
 	.elevator_owner = THIS_MODULE,

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
@ 2009-06-09 17:51       ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-06-09 17:51 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval, snitzer, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, fernando, mikew, jmoyer, nauman, m-ikeda, lizf,
	fchecconi, s-uchida, containers, linux-kernel, akpm,
	righi.andrea

On Tue, Jun 09, 2009 at 03:56:38PM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> > +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> > +			  size_t count)
> > +{
> > +	struct elv_fq_data *efqd;
> > +	unsigned int data;
> > +	unsigned long flags;
> > +
> > +	char *p = (char *)name;
> > +
> > +	data = simple_strtoul(p, &p, 10);
> > +
> > +	if (data < 0)
> > +		data = 0;
> > +	else if (data > INT_MAX)
> > +		data = INT_MAX;
> 
>   Hi Vivek,
> 
>   data might overflow on 64 bit systems. In addition, since "fairness" is nothing 
>   more than a switch, just let it be.
> 
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---

Hi Gui,

How about following patch? Currently this should apply at the end of the
patch series. If it looks good, I will merge the changes in higher level
patches.

Thanks
Vivek

o Previously common layer elevator parameters were appearing as request
  queue parameters in sysfs. But actually these are io scheduler parameters
  in hiearchical mode. Fix it.

o Use macros to define multiple sysfs C functions doing the same thing. Code
  borrowed from CFQ. Helps reduce the number of lines of by 140.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/as-iosched.c       |    5 
 block/blk-sysfs.c        |   39 -------
 block/cfq-iosched.c      |    5 
 block/deadline-iosched.c |    5 
 block/elevator-fq.c      |  245 +++++++++++------------------------------------
 block/elevator-fq.h      |   26 ++--
 block/noop-iosched.c     |   10 +
 7 files changed, 97 insertions(+), 238 deletions(-)

Index: linux18/block/elevator-fq.h
===================================================================
--- linux18.orig/block/elevator-fq.h	2009-06-09 10:34:59.000000000 -0400
+++ linux18/block/elevator-fq.h	2009-06-09 13:35:03.000000000 -0400
@@ -27,6 +27,9 @@ struct io_queue;
 
 #ifdef CONFIG_ELV_FAIR_QUEUING
 
+#define ELV_ATTR(name) \
+	__ATTR(name, S_IRUGO|S_IWUSR, elv_##name##_show, elv_##name##_store)
+
 /**
  * struct bfq_service_tree - per ioprio_class service tree.
  * @active: tree for active entities (i.e., those backlogged).
@@ -364,7 +367,7 @@ struct elv_fq_data {
 	 * throughput and focus more on providing fairness for sync
 	 * queues.
 	 */
-	int fairness;
+	unsigned int fairness;
 	int only_root_group;
 };
 
@@ -650,23 +653,22 @@ static inline struct io_queue *elv_looku
 
 #endif /* GROUP_IOSCHED */
 
-/* Functions used by blksysfs.c */
-extern ssize_t elv_slice_idle_show(struct request_queue *q, char *name);
-extern ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
+extern ssize_t elv_slice_idle_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_slice_idle_store(struct elevator_queue *e, const char *name,
 						size_t count);
-extern ssize_t elv_slice_sync_show(struct request_queue *q, char *name);
-extern ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
+extern ssize_t elv_slice_sync_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_slice_sync_store(struct elevator_queue *e, const char *name,
 						size_t count);
 
-extern ssize_t elv_async_slice_idle_show(struct request_queue *q, char *name);
-extern ssize_t elv_async_slice_idle_store(struct request_queue *q,
+extern ssize_t elv_async_slice_idle_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_async_slice_idle_store(struct elevator_queue *e,
 					const char *name, size_t count);
 
-extern ssize_t elv_slice_async_show(struct request_queue *q, char *name);
-extern ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
+extern ssize_t elv_slice_async_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_slice_async_store(struct elevator_queue *e, const char *name,
 						size_t count);
-extern ssize_t elv_fairness_show(struct request_queue *q, char *name);
-extern ssize_t elv_fairness_store(struct request_queue *q, const char *name,
+extern ssize_t elv_fairness_show(struct elevator_queue *e, char *name);
+extern ssize_t elv_fairness_store(struct elevator_queue *e, const char *name,
 						size_t count);
 
 /* Functions used by elevator.c */
Index: linux18/block/elevator-fq.c
===================================================================
--- linux18.orig/block/elevator-fq.c	2009-06-09 10:34:59.000000000 -0400
+++ linux18/block/elevator-fq.c	2009-06-09 13:39:48.000000000 -0400
@@ -2618,201 +2618,72 @@ static inline int is_root_group_ioq(stru
 	return (ioq->entity.sched_data == &efqd->root_group->sched_data);
 }
 
-/* Functions to show and store fairness value through sysfs */
-ssize_t elv_fairness_show(struct request_queue *q, char *name)
-{
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	data = efqd->fairness;
-	spin_unlock_irqrestore(q->queue_lock, flags);
-	return sprintf(name, "%d\n", data);
-}
-
-ssize_t elv_fairness_store(struct request_queue *q, const char *name,
-			  size_t count)
-{
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	char *p = (char *)name;
-
-	data = simple_strtoul(p, &p, 10);
-
-	if (data < 0)
-		data = 0;
-	else if (data > INT_MAX)
-		data = INT_MAX;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	efqd->fairness = data;
-	spin_unlock_irqrestore(q->queue_lock, flags);
-
-	return count;
-}
-
-/* Functions to show and store elv_idle_slice value through sysfs */
-ssize_t elv_slice_idle_show(struct request_queue *q, char *name)
-{
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	data = jiffies_to_msecs(efqd->elv_slice_idle);
-	spin_unlock_irqrestore(q->queue_lock, flags);
-	return sprintf(name, "%d\n", data);
-}
-
-ssize_t elv_slice_idle_store(struct request_queue *q, const char *name,
-			  size_t count)
-{
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	char *p = (char *)name;
-
-	data = simple_strtoul(p, &p, 10);
-
-	if (data < 0)
-		data = 0;
-	else if (data > INT_MAX)
-		data = INT_MAX;
-
-	data = msecs_to_jiffies(data);
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	efqd->elv_slice_idle = data;
-	spin_unlock_irqrestore(q->queue_lock, flags);
-
-	return count;
-}
-
-/* Functions to show and store elv_idle_slice value through sysfs */
-ssize_t elv_async_slice_idle_show(struct request_queue *q, char *name)
-{
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	data = jiffies_to_msecs(efqd->elv_async_slice_idle);
-	spin_unlock_irqrestore(q->queue_lock, flags);
-	return sprintf(name, "%d\n", data);
-}
-
-ssize_t elv_async_slice_idle_store(struct request_queue *q, const char *name,
-			  size_t count)
-{
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	char *p = (char *)name;
-
-	data = simple_strtoul(p, &p, 10);
-
-	if (data < 0)
-		data = 0;
-	else if (data > INT_MAX)
-		data = INT_MAX;
-
-	data = msecs_to_jiffies(data);
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	efqd->elv_async_slice_idle = data;
-	spin_unlock_irqrestore(q->queue_lock, flags);
-
-	return count;
-}
-
-/* Functions to show and store elv_slice_sync value through sysfs */
-ssize_t elv_slice_sync_show(struct request_queue *q, char *name)
+/*
+ * sysfs parts below -->
+ */
+static ssize_t
+elv_var_show(unsigned int var, char *page)
 {
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	data = efqd->elv_slice[1];
-	spin_unlock_irqrestore(q->queue_lock, flags);
-	return sprintf(name, "%d\n", data);
+	return sprintf(page, "%d\n", var);
 }
 
-ssize_t elv_slice_sync_store(struct request_queue *q, const char *name,
-			  size_t count)
+static ssize_t
+elv_var_store(unsigned int *var, const char *page, size_t count)
 {
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	char *p = (char *)name;
-
-	data = simple_strtoul(p, &p, 10);
-
-	if (data < 0)
-		data = 0;
-	/* 100ms is the limit for now*/
-	else if (data > 100)
-		data = 100;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	efqd->elv_slice[1] = data;
-	spin_unlock_irqrestore(q->queue_lock, flags);
+	char *p = (char *) page;
 
+	*var = simple_strtoul(p, &p, 10);
 	return count;
 }
 
-/* Functions to show and store elv_slice_async value through sysfs */
-ssize_t elv_slice_async_show(struct request_queue *q, char *name)
-{
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	data = efqd->elv_slice[0];
-	spin_unlock_irqrestore(q->queue_lock, flags);
-	return sprintf(name, "%d\n", data);
-}
-
-ssize_t elv_slice_async_store(struct request_queue *q, const char *name,
-			  size_t count)
-{
-	struct elv_fq_data *efqd;
-	unsigned int data;
-	unsigned long flags;
-
-	char *p = (char *)name;
-
-	data = simple_strtoul(p, &p, 10);
-
-	if (data < 0)
-		data = 0;
-	/* 100ms is the limit for now*/
-	else if (data > 100)
-		data = 100;
-
-	spin_lock_irqsave(q->queue_lock, flags);
-	efqd = &q->elevator->efqd;
-	efqd->elv_slice[0] = data;
-	spin_unlock_irqrestore(q->queue_lock, flags);
-
-	return count;
-}
+#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
+ssize_t __FUNC(struct elevator_queue *e, char *page)		\
+{									\
+	struct elv_fq_data *efqd = &e->efqd;				\
+	unsigned int __data = __VAR;					\
+	if (__CONV)							\
+		__data = jiffies_to_msecs(__data);			\
+	return elv_var_show(__data, (page));				\
+}
+SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
+EXPORT_SYMBOL(elv_fairness_show);
+SHOW_FUNCTION(elv_slice_idle_show, efqd->elv_slice_idle, 1);
+EXPORT_SYMBOL(elv_slice_idle_show);
+SHOW_FUNCTION(elv_async_slice_idle_show, efqd->elv_async_slice_idle, 1);
+EXPORT_SYMBOL(elv_async_slice_idle_show);
+SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
+EXPORT_SYMBOL(elv_slice_sync_show);
+SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
+EXPORT_SYMBOL(elv_slice_async_show);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
+ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)	\
+{									\
+	struct elv_fq_data *efqd = &e->efqd;				\
+	unsigned int __data;						\
+	int ret = elv_var_store(&__data, (page), count);		\
+	if (__data < (MIN))						\
+		__data = (MIN);						\
+	else if (__data > (MAX))					\
+		__data = (MAX);						\
+	if (__CONV)							\
+		*(__PTR) = msecs_to_jiffies(__data);			\
+	else								\
+		*(__PTR) = __data;					\
+	return ret;							\
+}
+STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
+EXPORT_SYMBOL(elv_fairness_store);
+STORE_FUNCTION(elv_slice_idle_store, &efqd->elv_slice_idle, 0, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_idle_store);
+STORE_FUNCTION(elv_async_slice_idle_store, &efqd->elv_async_slice_idle, 0, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_async_slice_idle_store);
+STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_sync_store);
+STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_async_store);
+#undef STORE_FUNCTION
 
 void elv_schedule_dispatch(struct request_queue *q)
 {
Index: linux18/block/blk-sysfs.c
===================================================================
--- linux18.orig/block/blk-sysfs.c	2009-06-09 10:34:59.000000000 -0400
+++ linux18/block/blk-sysfs.c	2009-06-09 13:24:42.000000000 -0400
@@ -307,38 +307,6 @@ static struct queue_sysfs_entry queue_io
 	.store = queue_iostats_store,
 };
 
-#ifdef CONFIG_ELV_FAIR_QUEUING
-static struct queue_sysfs_entry queue_slice_idle_entry = {
-	.attr = {.name = "slice_idle", .mode = S_IRUGO | S_IWUSR },
-	.show = elv_slice_idle_show,
-	.store = elv_slice_idle_store,
-};
-
-static struct queue_sysfs_entry queue_async_slice_idle_entry = {
-	.attr = {.name = "async_slice_idle", .mode = S_IRUGO | S_IWUSR },
-	.show = elv_async_slice_idle_show,
-	.store = elv_async_slice_idle_store,
-};
-
-static struct queue_sysfs_entry queue_slice_sync_entry = {
-	.attr = {.name = "slice_sync", .mode = S_IRUGO | S_IWUSR },
-	.show = elv_slice_sync_show,
-	.store = elv_slice_sync_store,
-};
-
-static struct queue_sysfs_entry queue_slice_async_entry = {
-	.attr = {.name = "slice_async", .mode = S_IRUGO | S_IWUSR },
-	.show = elv_slice_async_show,
-	.store = elv_slice_async_store,
-};
-
-static struct queue_sysfs_entry queue_fairness_entry = {
-	.attr = {.name = "fairness", .mode = S_IRUGO | S_IWUSR },
-	.show = elv_fairness_show,
-	.store = elv_fairness_store,
-};
-#endif
-
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
 #ifdef CONFIG_GROUP_IOSCHED
@@ -353,13 +321,6 @@ static struct attribute *default_attrs[]
 	&queue_nomerges_entry.attr,
 	&queue_rq_affinity_entry.attr,
 	&queue_iostats_entry.attr,
-#ifdef CONFIG_ELV_FAIR_QUEUING
-	&queue_slice_idle_entry.attr,
-	&queue_async_slice_idle_entry.attr,
-	&queue_slice_sync_entry.attr,
-	&queue_slice_async_entry.attr,
-	&queue_fairness_entry.attr,
-#endif
 	NULL,
 };
 
Index: linux18/block/cfq-iosched.c
===================================================================
--- linux18.orig/block/cfq-iosched.c	2009-06-09 10:34:55.000000000 -0400
+++ linux18/block/cfq-iosched.c	2009-06-09 13:25:42.000000000 -0400
@@ -2095,6 +2095,11 @@ static struct elv_fs_entry cfq_attrs[] =
 	CFQ_ATTR(back_seek_max),
 	CFQ_ATTR(back_seek_penalty),
 	CFQ_ATTR(slice_async_rq),
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_idle),
+	ELV_ATTR(async_slice_idle),
+	ELV_ATTR(slice_sync),
+	ELV_ATTR(slice_async),
 	__ATTR_NULL
 };
 
Index: linux18/block/as-iosched.c
===================================================================
--- linux18.orig/block/as-iosched.c	2009-06-09 10:34:58.000000000 -0400
+++ linux18/block/as-iosched.c	2009-06-09 13:27:38.000000000 -0400
@@ -1766,6 +1766,11 @@ static struct elv_fs_entry as_attrs[] = 
 	AS_ATTR(antic_expire),
 	AS_ATTR(read_batch_expire),
 	AS_ATTR(write_batch_expire),
+#ifdef CONFIG_IOSCHED_AS_HIER
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_idle),
+	ELV_ATTR(slice_sync),
+#endif
 	__ATTR_NULL
 };
 
Index: linux18/block/deadline-iosched.c
===================================================================
--- linux18.orig/block/deadline-iosched.c	2009-06-09 10:34:55.000000000 -0400
+++ linux18/block/deadline-iosched.c	2009-06-09 13:28:51.000000000 -0400
@@ -460,6 +460,11 @@ static struct elv_fs_entry deadline_attr
 	DD_ATTR(writes_starved),
 	DD_ATTR(front_merges),
 	DD_ATTR(fifo_batch),
+#ifdef CONFIG_IOSCHED_DEADLINE_HIER
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_idle),
+	ELV_ATTR(slice_sync),
+#endif
 	__ATTR_NULL
 };
 
Index: linux18/block/noop-iosched.c
===================================================================
--- linux18.orig/block/noop-iosched.c	2009-06-09 10:34:52.000000000 -0400
+++ linux18/block/noop-iosched.c	2009-06-09 13:31:48.000000000 -0400
@@ -82,6 +82,15 @@ static void noop_free_noop_queue(struct 
 	kfree(nq);
 }
 
+#ifdef CONFIG_IOSCHED_NOOP_HIER
+static struct elv_fs_entry noop_attrs[] = {
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_idle),
+	ELV_ATTR(slice_sync),
+	__ATTR_NULL
+};
+#endif
+
 static struct elevator_type elevator_noop = {
 	.ops = {
 		.elevator_merge_req_fn		= noop_merged_requests,
@@ -94,6 +103,7 @@ static struct elevator_type elevator_noo
 	},
 #ifdef CONFIG_IOSCHED_NOOP_HIER
 	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+	.elevator_attrs = noop_attrs,
 #endif
 	.elevator_name = "noop",
 	.elevator_owner = THIS_MODULE,

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
       [not found]       ` <20090609175131.GB13476-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-06-10  1:30         ` Gui Jianfeng
  0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-06-10  1:30 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

Vivek Goyal wrote:
> On Tue, Jun 09, 2009 at 03:56:38PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>> ...
>>> +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
>>> +			  size_t count)
>>> +{
>>> +	struct elv_fq_data *efqd;
>>> +	unsigned int data;
>>> +	unsigned long flags;
>>> +
>>> +	char *p = (char *)name;
>>> +
>>> +	data = simple_strtoul(p, &p, 10);
>>> +
>>> +	if (data < 0)
>>> +		data = 0;
>>> +	else if (data > INT_MAX)
>>> +		data = INT_MAX;
>>   Hi Vivek,
>>
>>   data might overflow on 64 bit systems. In addition, since "fairness" is nothing 
>>   more than a switch, just let it be.
>>
>> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
>> ---
> 
> Hi Gui,
> 
> How about following patch? Currently this should apply at the end of the
> patch series. If it looks good, I will merge the changes in higher level
> patches.

  This patch seems good to me. Some trivial issues comment below.

> 
> Thanks
> Vivek
> 
> o Previously common layer elevator parameters were appearing as request
>   queue parameters in sysfs. But actually these are io scheduler parameters
>   in hiearchical mode. Fix it.
> 
> o Use macros to define multiple sysfs C functions doing the same thing. Code
>   borrowed from CFQ. Helps reduce the number of lines of by 140.
> 
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
...	\
> +}
> +SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
> +EXPORT_SYMBOL(elv_fairness_show);
> +SHOW_FUNCTION(elv_slice_idle_show, efqd->elv_slice_idle, 1);
> +EXPORT_SYMBOL(elv_slice_idle_show);
> +SHOW_FUNCTION(elv_async_slice_idle_show, efqd->elv_async_slice_idle, 1);
> +EXPORT_SYMBOL(elv_async_slice_idle_show);
> +SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
> +EXPORT_SYMBOL(elv_slice_sync_show);
> +SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
> +EXPORT_SYMBOL(elv_slice_async_show);
> +#undef SHOW_FUNCTION
> +
> +#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
> +ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)	\
> +{									\
> +	struct elv_fq_data *efqd = &e->efqd;				\
> +	unsigned int __data;						\
> +	int ret = elv_var_store(&__data, (page), count);		\

  Since simple_strtoul returns unsigned long, it's better to make __data 
  be that type.

> +	if (__data < (MIN))						\
> +		__data = (MIN);						\
> +	else if (__data > (MAX))					\
> +		__data = (MAX);						\
> +	if (__CONV)							\
> +		*(__PTR) = msecs_to_jiffies(__data);			\
> +	else								\
> +		*(__PTR) = __data;					\
> +	return ret;							\
> +}
> +STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
> +EXPORT_SYMBOL(elv_fairness_store);
> +STORE_FUNCTION(elv_slice_idle_store, &efqd->elv_slice_idle, 0, UINT_MAX, 1);

  Do we need to set an actual max limitation rather than UINT_MAX for these entries?

> +EXPORT_SYMBOL(elv_slice_idle_store);
> +STORE_FUNCTION(elv_async_slice_idle_store, &efqd->elv_async_slice_idle, 0, UINT_MAX, 1);
> +EXPORT_SYMBOL(elv_async_slice_idle_store);
> +STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
> +EXPORT_SYMBOL(elv_slice_sync_store);
> +STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
> +EXPORT_SYMBOL(elv_slice_async_store);
> +#undef STORE_FUNCTION
>  
>  void elv_schedule_dispatch(struct request_queue *q)
>  {
> Index: linux18/block/blk-sysfs.c
> ===================================================================
> --- linux18.orig/block/blk-sysfs.c	2009-06-09 10:34:59.000000000 -0400
> +++ linux18/block/blk-sysfs.c	2009-06-09 13:24:42.000000000 -0400
> @@ -307,38 +307,6 @@ static struct queue_sysfs_entry queue_io
>  	.store = queue_iostats_store,
>  };
>  
> -#ifdef CONFIG_ELV_FAIR_QUEUING
> -static struct queue_sysfs_entry queue_slice_idle_entry = {
> -	.attr = {.name = "slice_idle", .mode = S_IRUGO | S_IWUSR },
> -	.show = elv_slice_idle_show,
> -	.store = elv_slice_idle_store,
> -};
> -
> -static struct queue_sysfs_entry queue_async_slice_idle_entry = {
> -	.attr = {.name = "async_slice_idle", .mode = S_IRUGO | S_IWUSR },
> -	.show = elv_async_slice_idle_show,
> -	.store = elv_async_slice_idle_store,
> -};
> -
> -static struct queue_sysfs_entry queue_slice_sync_entry = {
> -	.attr = {.name = "slice_sync", .mode = S_IRUGO | S_IWUSR },
> -	.show = elv_slice_sync_show,
> -	.store = elv_slice_sync_store,
> -};
> -
> -static struct queue_sysfs_entry queue_slice_async_entry = {
> -	.attr = {.name = "slice_async", .mode = S_IRUGO | S_IWUSR },
> -	.show = elv_slice_async_show,
> -	.store = elv_slice_async_store,
> -};
> -
> -static struct queue_sysfs_entry queue_fairness_entry = {
> -	.attr = {.name = "fairness", .mode = S_IRUGO | S_IWUSR },
> -	.show = elv_fairness_show,
> -	.store = elv_fairness_store,
> -};
> -#endif
> -
>  static struct attribute *default_attrs[] = {
>  	&queue_requests_entry.attr,
>  #ifdef CONFIG_GROUP_IOSCHED
> @@ -353,13 +321,6 @@ static struct attribute *default_attrs[]
>  	&queue_nomerges_entry.attr,
>  	&queue_rq_affinity_entry.attr,
>  	&queue_iostats_entry.attr,
> -#ifdef CONFIG_ELV_FAIR_QUEUING
> -	&queue_slice_idle_entry.attr,
> -	&queue_async_slice_idle_entry.attr,
> -	&queue_slice_sync_entry.attr,
> -	&queue_slice_async_entry.attr,
> -	&queue_fairness_entry.attr,
> -#endif
>  	NULL,
>  };
>  
> Index: linux18/block/cfq-iosched.c
> ===================================================================
> --- linux18.orig/block/cfq-iosched.c	2009-06-09 10:34:55.000000000 -0400
> +++ linux18/block/cfq-iosched.c	2009-06-09 13:25:42.000000000 -0400
> @@ -2095,6 +2095,11 @@ static struct elv_fs_entry cfq_attrs[] =
>  	CFQ_ATTR(back_seek_max),
>  	CFQ_ATTR(back_seek_penalty),
>  	CFQ_ATTR(slice_async_rq),
> +	ELV_ATTR(fairness),
> +	ELV_ATTR(slice_idle),
> +	ELV_ATTR(async_slice_idle),
> +	ELV_ATTR(slice_sync),
> +	ELV_ATTR(slice_async),
>  	__ATTR_NULL
>  };
>  
> Index: linux18/block/as-iosched.c
> ===================================================================
> --- linux18.orig/block/as-iosched.c	2009-06-09 10:34:58.000000000 -0400
> +++ linux18/block/as-iosched.c	2009-06-09 13:27:38.000000000 -0400
> @@ -1766,6 +1766,11 @@ static struct elv_fs_entry as_attrs[] = 
>  	AS_ATTR(antic_expire),
>  	AS_ATTR(read_batch_expire),
>  	AS_ATTR(write_batch_expire),
> +#ifdef CONFIG_IOSCHED_AS_HIER
> +	ELV_ATTR(fairness),
> +	ELV_ATTR(slice_idle),
> +	ELV_ATTR(slice_sync),
> +#endif
>  	__ATTR_NULL
>  };
>  
> Index: linux18/block/deadline-iosched.c
> ===================================================================
> --- linux18.orig/block/deadline-iosched.c	2009-06-09 10:34:55.000000000 -0400
> +++ linux18/block/deadline-iosched.c	2009-06-09 13:28:51.000000000 -0400
> @@ -460,6 +460,11 @@ static struct elv_fs_entry deadline_attr
>  	DD_ATTR(writes_starved),
>  	DD_ATTR(front_merges),
>  	DD_ATTR(fifo_batch),
> +#ifdef CONFIG_IOSCHED_DEADLINE_HIER
> +	ELV_ATTR(fairness),
> +	ELV_ATTR(slice_idle),
> +	ELV_ATTR(slice_sync),
> +#endif
>  	__ATTR_NULL
>  };
>  
> Index: linux18/block/noop-iosched.c
> ===================================================================
> --- linux18.orig/block/noop-iosched.c	2009-06-09 10:34:52.000000000 -0400
> +++ linux18/block/noop-iosched.c	2009-06-09 13:31:48.000000000 -0400
> @@ -82,6 +82,15 @@ static void noop_free_noop_queue(struct 
>  	kfree(nq);
>  }
>  
> +#ifdef CONFIG_IOSCHED_NOOP_HIER
> +static struct elv_fs_entry noop_attrs[] = {
> +	ELV_ATTR(fairness),
> +	ELV_ATTR(slice_idle),
> +	ELV_ATTR(slice_sync),
> +	__ATTR_NULL
> +};
> +#endif
> +
>  static struct elevator_type elevator_noop = {
>  	.ops = {
>  		.elevator_merge_req_fn		= noop_merged_requests,
> @@ -94,6 +103,7 @@ static struct elevator_type elevator_noo
>  	},
>  #ifdef CONFIG_IOSCHED_NOOP_HIER
>  	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
> +	.elevator_attrs = noop_attrs,
>  #endif
>  	.elevator_name = "noop",
>  	.elevator_owner = THIS_MODULE,
> 
> 
> 

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
  2009-06-09 17:51       ` Vivek Goyal
@ 2009-06-10  1:30         ` Gui Jianfeng
  -1 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-06-10  1:30 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

Vivek Goyal wrote:
> On Tue, Jun 09, 2009 at 03:56:38PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>> ...
>>> +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
>>> +			  size_t count)
>>> +{
>>> +	struct elv_fq_data *efqd;
>>> +	unsigned int data;
>>> +	unsigned long flags;
>>> +
>>> +	char *p = (char *)name;
>>> +
>>> +	data = simple_strtoul(p, &p, 10);
>>> +
>>> +	if (data < 0)
>>> +		data = 0;
>>> +	else if (data > INT_MAX)
>>> +		data = INT_MAX;
>>   Hi Vivek,
>>
>>   data might overflow on 64 bit systems. In addition, since "fairness" is nothing 
>>   more than a switch, just let it be.
>>
>> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
>> ---
> 
> Hi Gui,
> 
> How about following patch? Currently this should apply at the end of the
> patch series. If it looks good, I will merge the changes in higher level
> patches.

  This patch seems good to me. Some trivial issues comment below.

> 
> Thanks
> Vivek
> 
> o Previously common layer elevator parameters were appearing as request
>   queue parameters in sysfs. But actually these are io scheduler parameters
>   in hiearchical mode. Fix it.
> 
> o Use macros to define multiple sysfs C functions doing the same thing. Code
>   borrowed from CFQ. Helps reduce the number of lines of by 140.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
...	\
> +}
> +SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
> +EXPORT_SYMBOL(elv_fairness_show);
> +SHOW_FUNCTION(elv_slice_idle_show, efqd->elv_slice_idle, 1);
> +EXPORT_SYMBOL(elv_slice_idle_show);
> +SHOW_FUNCTION(elv_async_slice_idle_show, efqd->elv_async_slice_idle, 1);
> +EXPORT_SYMBOL(elv_async_slice_idle_show);
> +SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
> +EXPORT_SYMBOL(elv_slice_sync_show);
> +SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
> +EXPORT_SYMBOL(elv_slice_async_show);
> +#undef SHOW_FUNCTION
> +
> +#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
> +ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)	\
> +{									\
> +	struct elv_fq_data *efqd = &e->efqd;				\
> +	unsigned int __data;						\
> +	int ret = elv_var_store(&__data, (page), count);		\

  Since simple_strtoul returns unsigned long, it's better to make __data 
  be that type.

> +	if (__data < (MIN))						\
> +		__data = (MIN);						\
> +	else if (__data > (MAX))					\
> +		__data = (MAX);						\
> +	if (__CONV)							\
> +		*(__PTR) = msecs_to_jiffies(__data);			\
> +	else								\
> +		*(__PTR) = __data;					\
> +	return ret;							\
> +}
> +STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
> +EXPORT_SYMBOL(elv_fairness_store);
> +STORE_FUNCTION(elv_slice_idle_store, &efqd->elv_slice_idle, 0, UINT_MAX, 1);

  Do we need to set an actual max limitation rather than UINT_MAX for these entries?

> +EXPORT_SYMBOL(elv_slice_idle_store);
> +STORE_FUNCTION(elv_async_slice_idle_store, &efqd->elv_async_slice_idle, 0, UINT_MAX, 1);
> +EXPORT_SYMBOL(elv_async_slice_idle_store);
> +STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
> +EXPORT_SYMBOL(elv_slice_sync_store);
> +STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
> +EXPORT_SYMBOL(elv_slice_async_store);
> +#undef STORE_FUNCTION
>  
>  void elv_schedule_dispatch(struct request_queue *q)
>  {
> Index: linux18/block/blk-sysfs.c
> ===================================================================
> --- linux18.orig/block/blk-sysfs.c	2009-06-09 10:34:59.000000000 -0400
> +++ linux18/block/blk-sysfs.c	2009-06-09 13:24:42.000000000 -0400
> @@ -307,38 +307,6 @@ static struct queue_sysfs_entry queue_io
>  	.store = queue_iostats_store,
>  };
>  
> -#ifdef CONFIG_ELV_FAIR_QUEUING
> -static struct queue_sysfs_entry queue_slice_idle_entry = {
> -	.attr = {.name = "slice_idle", .mode = S_IRUGO | S_IWUSR },
> -	.show = elv_slice_idle_show,
> -	.store = elv_slice_idle_store,
> -};
> -
> -static struct queue_sysfs_entry queue_async_slice_idle_entry = {
> -	.attr = {.name = "async_slice_idle", .mode = S_IRUGO | S_IWUSR },
> -	.show = elv_async_slice_idle_show,
> -	.store = elv_async_slice_idle_store,
> -};
> -
> -static struct queue_sysfs_entry queue_slice_sync_entry = {
> -	.attr = {.name = "slice_sync", .mode = S_IRUGO | S_IWUSR },
> -	.show = elv_slice_sync_show,
> -	.store = elv_slice_sync_store,
> -};
> -
> -static struct queue_sysfs_entry queue_slice_async_entry = {
> -	.attr = {.name = "slice_async", .mode = S_IRUGO | S_IWUSR },
> -	.show = elv_slice_async_show,
> -	.store = elv_slice_async_store,
> -};
> -
> -static struct queue_sysfs_entry queue_fairness_entry = {
> -	.attr = {.name = "fairness", .mode = S_IRUGO | S_IWUSR },
> -	.show = elv_fairness_show,
> -	.store = elv_fairness_store,
> -};
> -#endif
> -
>  static struct attribute *default_attrs[] = {
>  	&queue_requests_entry.attr,
>  #ifdef CONFIG_GROUP_IOSCHED
> @@ -353,13 +321,6 @@ static struct attribute *default_attrs[]
>  	&queue_nomerges_entry.attr,
>  	&queue_rq_affinity_entry.attr,
>  	&queue_iostats_entry.attr,
> -#ifdef CONFIG_ELV_FAIR_QUEUING
> -	&queue_slice_idle_entry.attr,
> -	&queue_async_slice_idle_entry.attr,
> -	&queue_slice_sync_entry.attr,
> -	&queue_slice_async_entry.attr,
> -	&queue_fairness_entry.attr,
> -#endif
>  	NULL,
>  };
>  
> Index: linux18/block/cfq-iosched.c
> ===================================================================
> --- linux18.orig/block/cfq-iosched.c	2009-06-09 10:34:55.000000000 -0400
> +++ linux18/block/cfq-iosched.c	2009-06-09 13:25:42.000000000 -0400
> @@ -2095,6 +2095,11 @@ static struct elv_fs_entry cfq_attrs[] =
>  	CFQ_ATTR(back_seek_max),
>  	CFQ_ATTR(back_seek_penalty),
>  	CFQ_ATTR(slice_async_rq),
> +	ELV_ATTR(fairness),
> +	ELV_ATTR(slice_idle),
> +	ELV_ATTR(async_slice_idle),
> +	ELV_ATTR(slice_sync),
> +	ELV_ATTR(slice_async),
>  	__ATTR_NULL
>  };
>  
> Index: linux18/block/as-iosched.c
> ===================================================================
> --- linux18.orig/block/as-iosched.c	2009-06-09 10:34:58.000000000 -0400
> +++ linux18/block/as-iosched.c	2009-06-09 13:27:38.000000000 -0400
> @@ -1766,6 +1766,11 @@ static struct elv_fs_entry as_attrs[] = 
>  	AS_ATTR(antic_expire),
>  	AS_ATTR(read_batch_expire),
>  	AS_ATTR(write_batch_expire),
> +#ifdef CONFIG_IOSCHED_AS_HIER
> +	ELV_ATTR(fairness),
> +	ELV_ATTR(slice_idle),
> +	ELV_ATTR(slice_sync),
> +#endif
>  	__ATTR_NULL
>  };
>  
> Index: linux18/block/deadline-iosched.c
> ===================================================================
> --- linux18.orig/block/deadline-iosched.c	2009-06-09 10:34:55.000000000 -0400
> +++ linux18/block/deadline-iosched.c	2009-06-09 13:28:51.000000000 -0400
> @@ -460,6 +460,11 @@ static struct elv_fs_entry deadline_attr
>  	DD_ATTR(writes_starved),
>  	DD_ATTR(front_merges),
>  	DD_ATTR(fifo_batch),
> +#ifdef CONFIG_IOSCHED_DEADLINE_HIER
> +	ELV_ATTR(fairness),
> +	ELV_ATTR(slice_idle),
> +	ELV_ATTR(slice_sync),
> +#endif
>  	__ATTR_NULL
>  };
>  
> Index: linux18/block/noop-iosched.c
> ===================================================================
> --- linux18.orig/block/noop-iosched.c	2009-06-09 10:34:52.000000000 -0400
> +++ linux18/block/noop-iosched.c	2009-06-09 13:31:48.000000000 -0400
> @@ -82,6 +82,15 @@ static void noop_free_noop_queue(struct 
>  	kfree(nq);
>  }
>  
> +#ifdef CONFIG_IOSCHED_NOOP_HIER
> +static struct elv_fs_entry noop_attrs[] = {
> +	ELV_ATTR(fairness),
> +	ELV_ATTR(slice_idle),
> +	ELV_ATTR(slice_sync),
> +	__ATTR_NULL
> +};
> +#endif
> +
>  static struct elevator_type elevator_noop = {
>  	.ops = {
>  		.elevator_merge_req_fn		= noop_merged_requests,
> @@ -94,6 +103,7 @@ static struct elevator_type elevator_noo
>  	},
>  #ifdef CONFIG_IOSCHED_NOOP_HIER
>  	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
> +	.elevator_attrs = noop_attrs,
>  #endif
>  	.elevator_name = "noop",
>  	.elevator_owner = THIS_MODULE,
> 
> 
> 

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
@ 2009-06-10  1:30         ` Gui Jianfeng
  0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-06-10  1:30 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, snitzer, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, fernando, mikew, jmoyer, nauman, m-ikeda, lizf,
	fchecconi, s-uchida, containers, linux-kernel, akpm,
	righi.andrea

Vivek Goyal wrote:
> On Tue, Jun 09, 2009 at 03:56:38PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>> ...
>>> +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
>>> +			  size_t count)
>>> +{
>>> +	struct elv_fq_data *efqd;
>>> +	unsigned int data;
>>> +	unsigned long flags;
>>> +
>>> +	char *p = (char *)name;
>>> +
>>> +	data = simple_strtoul(p, &p, 10);
>>> +
>>> +	if (data < 0)
>>> +		data = 0;
>>> +	else if (data > INT_MAX)
>>> +		data = INT_MAX;
>>   Hi Vivek,
>>
>>   data might overflow on 64 bit systems. In addition, since "fairness" is nothing 
>>   more than a switch, just let it be.
>>
>> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
>> ---
> 
> Hi Gui,
> 
> How about following patch? Currently this should apply at the end of the
> patch series. If it looks good, I will merge the changes in higher level
> patches.

  This patch seems good to me. Some trivial issues comment below.

> 
> Thanks
> Vivek
> 
> o Previously common layer elevator parameters were appearing as request
>   queue parameters in sysfs. But actually these are io scheduler parameters
>   in hiearchical mode. Fix it.
> 
> o Use macros to define multiple sysfs C functions doing the same thing. Code
>   borrowed from CFQ. Helps reduce the number of lines of by 140.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
...	\
> +}
> +SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
> +EXPORT_SYMBOL(elv_fairness_show);
> +SHOW_FUNCTION(elv_slice_idle_show, efqd->elv_slice_idle, 1);
> +EXPORT_SYMBOL(elv_slice_idle_show);
> +SHOW_FUNCTION(elv_async_slice_idle_show, efqd->elv_async_slice_idle, 1);
> +EXPORT_SYMBOL(elv_async_slice_idle_show);
> +SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
> +EXPORT_SYMBOL(elv_slice_sync_show);
> +SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
> +EXPORT_SYMBOL(elv_slice_async_show);
> +#undef SHOW_FUNCTION
> +
> +#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
> +ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)	\
> +{									\
> +	struct elv_fq_data *efqd = &e->efqd;				\
> +	unsigned int __data;						\
> +	int ret = elv_var_store(&__data, (page), count);		\

  Since simple_strtoul returns unsigned long, it's better to make __data 
  be that type.

> +	if (__data < (MIN))						\
> +		__data = (MIN);						\
> +	else if (__data > (MAX))					\
> +		__data = (MAX);						\
> +	if (__CONV)							\
> +		*(__PTR) = msecs_to_jiffies(__data);			\
> +	else								\
> +		*(__PTR) = __data;					\
> +	return ret;							\
> +}
> +STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
> +EXPORT_SYMBOL(elv_fairness_store);
> +STORE_FUNCTION(elv_slice_idle_store, &efqd->elv_slice_idle, 0, UINT_MAX, 1);

  Do we need to set an actual max limitation rather than UINT_MAX for these entries?

> +EXPORT_SYMBOL(elv_slice_idle_store);
> +STORE_FUNCTION(elv_async_slice_idle_store, &efqd->elv_async_slice_idle, 0, UINT_MAX, 1);
> +EXPORT_SYMBOL(elv_async_slice_idle_store);
> +STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
> +EXPORT_SYMBOL(elv_slice_sync_store);
> +STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
> +EXPORT_SYMBOL(elv_slice_async_store);
> +#undef STORE_FUNCTION
>  
>  void elv_schedule_dispatch(struct request_queue *q)
>  {
> Index: linux18/block/blk-sysfs.c
> ===================================================================
> --- linux18.orig/block/blk-sysfs.c	2009-06-09 10:34:59.000000000 -0400
> +++ linux18/block/blk-sysfs.c	2009-06-09 13:24:42.000000000 -0400
> @@ -307,38 +307,6 @@ static struct queue_sysfs_entry queue_io
>  	.store = queue_iostats_store,
>  };
>  
> -#ifdef CONFIG_ELV_FAIR_QUEUING
> -static struct queue_sysfs_entry queue_slice_idle_entry = {
> -	.attr = {.name = "slice_idle", .mode = S_IRUGO | S_IWUSR },
> -	.show = elv_slice_idle_show,
> -	.store = elv_slice_idle_store,
> -};
> -
> -static struct queue_sysfs_entry queue_async_slice_idle_entry = {
> -	.attr = {.name = "async_slice_idle", .mode = S_IRUGO | S_IWUSR },
> -	.show = elv_async_slice_idle_show,
> -	.store = elv_async_slice_idle_store,
> -};
> -
> -static struct queue_sysfs_entry queue_slice_sync_entry = {
> -	.attr = {.name = "slice_sync", .mode = S_IRUGO | S_IWUSR },
> -	.show = elv_slice_sync_show,
> -	.store = elv_slice_sync_store,
> -};
> -
> -static struct queue_sysfs_entry queue_slice_async_entry = {
> -	.attr = {.name = "slice_async", .mode = S_IRUGO | S_IWUSR },
> -	.show = elv_slice_async_show,
> -	.store = elv_slice_async_store,
> -};
> -
> -static struct queue_sysfs_entry queue_fairness_entry = {
> -	.attr = {.name = "fairness", .mode = S_IRUGO | S_IWUSR },
> -	.show = elv_fairness_show,
> -	.store = elv_fairness_store,
> -};
> -#endif
> -
>  static struct attribute *default_attrs[] = {
>  	&queue_requests_entry.attr,
>  #ifdef CONFIG_GROUP_IOSCHED
> @@ -353,13 +321,6 @@ static struct attribute *default_attrs[]
>  	&queue_nomerges_entry.attr,
>  	&queue_rq_affinity_entry.attr,
>  	&queue_iostats_entry.attr,
> -#ifdef CONFIG_ELV_FAIR_QUEUING
> -	&queue_slice_idle_entry.attr,
> -	&queue_async_slice_idle_entry.attr,
> -	&queue_slice_sync_entry.attr,
> -	&queue_slice_async_entry.attr,
> -	&queue_fairness_entry.attr,
> -#endif
>  	NULL,
>  };
>  
> Index: linux18/block/cfq-iosched.c
> ===================================================================
> --- linux18.orig/block/cfq-iosched.c	2009-06-09 10:34:55.000000000 -0400
> +++ linux18/block/cfq-iosched.c	2009-06-09 13:25:42.000000000 -0400
> @@ -2095,6 +2095,11 @@ static struct elv_fs_entry cfq_attrs[] =
>  	CFQ_ATTR(back_seek_max),
>  	CFQ_ATTR(back_seek_penalty),
>  	CFQ_ATTR(slice_async_rq),
> +	ELV_ATTR(fairness),
> +	ELV_ATTR(slice_idle),
> +	ELV_ATTR(async_slice_idle),
> +	ELV_ATTR(slice_sync),
> +	ELV_ATTR(slice_async),
>  	__ATTR_NULL
>  };
>  
> Index: linux18/block/as-iosched.c
> ===================================================================
> --- linux18.orig/block/as-iosched.c	2009-06-09 10:34:58.000000000 -0400
> +++ linux18/block/as-iosched.c	2009-06-09 13:27:38.000000000 -0400
> @@ -1766,6 +1766,11 @@ static struct elv_fs_entry as_attrs[] = 
>  	AS_ATTR(antic_expire),
>  	AS_ATTR(read_batch_expire),
>  	AS_ATTR(write_batch_expire),
> +#ifdef CONFIG_IOSCHED_AS_HIER
> +	ELV_ATTR(fairness),
> +	ELV_ATTR(slice_idle),
> +	ELV_ATTR(slice_sync),
> +#endif
>  	__ATTR_NULL
>  };
>  
> Index: linux18/block/deadline-iosched.c
> ===================================================================
> --- linux18.orig/block/deadline-iosched.c	2009-06-09 10:34:55.000000000 -0400
> +++ linux18/block/deadline-iosched.c	2009-06-09 13:28:51.000000000 -0400
> @@ -460,6 +460,11 @@ static struct elv_fs_entry deadline_attr
>  	DD_ATTR(writes_starved),
>  	DD_ATTR(front_merges),
>  	DD_ATTR(fifo_batch),
> +#ifdef CONFIG_IOSCHED_DEADLINE_HIER
> +	ELV_ATTR(fairness),
> +	ELV_ATTR(slice_idle),
> +	ELV_ATTR(slice_sync),
> +#endif
>  	__ATTR_NULL
>  };
>  
> Index: linux18/block/noop-iosched.c
> ===================================================================
> --- linux18.orig/block/noop-iosched.c	2009-06-09 10:34:52.000000000 -0400
> +++ linux18/block/noop-iosched.c	2009-06-09 13:31:48.000000000 -0400
> @@ -82,6 +82,15 @@ static void noop_free_noop_queue(struct 
>  	kfree(nq);
>  }
>  
> +#ifdef CONFIG_IOSCHED_NOOP_HIER
> +static struct elv_fs_entry noop_attrs[] = {
> +	ELV_ATTR(fairness),
> +	ELV_ATTR(slice_idle),
> +	ELV_ATTR(slice_sync),
> +	__ATTR_NULL
> +};
> +#endif
> +
>  static struct elevator_type elevator_noop = {
>  	.ops = {
>  		.elevator_merge_req_fn		= noop_merged_requests,
> @@ -94,6 +103,7 @@ static struct elevator_type elevator_noo
>  	},
>  #ifdef CONFIG_IOSCHED_NOOP_HIER
>  	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
> +	.elevator_attrs = noop_attrs,
>  #endif
>  	.elevator_name = "noop",
>  	.elevator_owner = THIS_MODULE,
> 
> 
> 

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
       [not found]         ` <4A2F0CBE.8030208-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-06-10 13:26           ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-06-10 13:26 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Wed, Jun 10, 2009 at 09:30:38AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > On Tue, Jun 09, 2009 at 03:56:38PM +0800, Gui Jianfeng wrote:
> >> Vivek Goyal wrote:
> >> ...
> >>> +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> >>> +			  size_t count)
> >>> +{
> >>> +	struct elv_fq_data *efqd;
> >>> +	unsigned int data;
> >>> +	unsigned long flags;
> >>> +
> >>> +	char *p = (char *)name;
> >>> +
> >>> +	data = simple_strtoul(p, &p, 10);
> >>> +
> >>> +	if (data < 0)
> >>> +		data = 0;
> >>> +	else if (data > INT_MAX)
> >>> +		data = INT_MAX;
> >>   Hi Vivek,
> >>
> >>   data might overflow on 64 bit systems. In addition, since "fairness" is nothing 
> >>   more than a switch, just let it be.
> >>
> >> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> >> ---
> > 
> > Hi Gui,
> > 
> > How about following patch? Currently this should apply at the end of the
> > patch series. If it looks good, I will merge the changes in higher level
> > patches.
> 
>   This patch seems good to me. Some trivial issues comment below.
> 
> > 
> > Thanks
> > Vivek
> > 
> > o Previously common layer elevator parameters were appearing as request
> >   queue parameters in sysfs. But actually these are io scheduler parameters
> >   in hiearchical mode. Fix it.
> > 
> > o Use macros to define multiple sysfs C functions doing the same thing. Code
> >   borrowed from CFQ. Helps reduce the number of lines of by 140.
> > 
> > Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ...	\
> > +}
> > +SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
> > +EXPORT_SYMBOL(elv_fairness_show);
> > +SHOW_FUNCTION(elv_slice_idle_show, efqd->elv_slice_idle, 1);
> > +EXPORT_SYMBOL(elv_slice_idle_show);
> > +SHOW_FUNCTION(elv_async_slice_idle_show, efqd->elv_async_slice_idle, 1);
> > +EXPORT_SYMBOL(elv_async_slice_idle_show);
> > +SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
> > +EXPORT_SYMBOL(elv_slice_sync_show);
> > +SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
> > +EXPORT_SYMBOL(elv_slice_async_show);
> > +#undef SHOW_FUNCTION
> > +
> > +#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
> > +ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)	\
> > +{									\
> > +	struct elv_fq_data *efqd = &e->efqd;				\
> > +	unsigned int __data;						\
> > +	int ret = elv_var_store(&__data, (page), count);		\
> 
>   Since simple_strtoul returns unsigned long, it's better to make __data 
>   be that type.
> 

I just took it from CFQ. BTW, what's the harm here in truncating unsigned
long to int? Anyway for our variables we are not expecting any value 
bigger than unsigned int and if it is, we expect to truncate it?

> > +	if (__data < (MIN))						\
> > +		__data = (MIN);						\
> > +	else if (__data > (MAX))					\
> > +		__data = (MAX);						\
> > +	if (__CONV)							\
> > +		*(__PTR) = msecs_to_jiffies(__data);			\
> > +	else								\
> > +		*(__PTR) = __data;					\
> > +	return ret;							\
> > +}
> > +STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
> > +EXPORT_SYMBOL(elv_fairness_store);
> > +STORE_FUNCTION(elv_slice_idle_store, &efqd->elv_slice_idle, 0, UINT_MAX, 1);
> 
>   Do we need to set an actual max limitation rather than UINT_MAX for these entries?

Again these are the same values CFQ was using.  Do you have a better upper
limit in mind? Until and unless there is strong objection to UINT_MAX, we
can stick to what CFQ has been doing so far.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
  2009-06-10  1:30         ` Gui Jianfeng
@ 2009-06-10 13:26           ` Vivek Goyal
  -1 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-06-10 13:26 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

On Wed, Jun 10, 2009 at 09:30:38AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > On Tue, Jun 09, 2009 at 03:56:38PM +0800, Gui Jianfeng wrote:
> >> Vivek Goyal wrote:
> >> ...
> >>> +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> >>> +			  size_t count)
> >>> +{
> >>> +	struct elv_fq_data *efqd;
> >>> +	unsigned int data;
> >>> +	unsigned long flags;
> >>> +
> >>> +	char *p = (char *)name;
> >>> +
> >>> +	data = simple_strtoul(p, &p, 10);
> >>> +
> >>> +	if (data < 0)
> >>> +		data = 0;
> >>> +	else if (data > INT_MAX)
> >>> +		data = INT_MAX;
> >>   Hi Vivek,
> >>
> >>   data might overflow on 64 bit systems. In addition, since "fairness" is nothing 
> >>   more than a switch, just let it be.
> >>
> >> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> >> ---
> > 
> > Hi Gui,
> > 
> > How about following patch? Currently this should apply at the end of the
> > patch series. If it looks good, I will merge the changes in higher level
> > patches.
> 
>   This patch seems good to me. Some trivial issues comment below.
> 
> > 
> > Thanks
> > Vivek
> > 
> > o Previously common layer elevator parameters were appearing as request
> >   queue parameters in sysfs. But actually these are io scheduler parameters
> >   in hiearchical mode. Fix it.
> > 
> > o Use macros to define multiple sysfs C functions doing the same thing. Code
> >   borrowed from CFQ. Helps reduce the number of lines of by 140.
> > 
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ...	\
> > +}
> > +SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
> > +EXPORT_SYMBOL(elv_fairness_show);
> > +SHOW_FUNCTION(elv_slice_idle_show, efqd->elv_slice_idle, 1);
> > +EXPORT_SYMBOL(elv_slice_idle_show);
> > +SHOW_FUNCTION(elv_async_slice_idle_show, efqd->elv_async_slice_idle, 1);
> > +EXPORT_SYMBOL(elv_async_slice_idle_show);
> > +SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
> > +EXPORT_SYMBOL(elv_slice_sync_show);
> > +SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
> > +EXPORT_SYMBOL(elv_slice_async_show);
> > +#undef SHOW_FUNCTION
> > +
> > +#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
> > +ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)	\
> > +{									\
> > +	struct elv_fq_data *efqd = &e->efqd;				\
> > +	unsigned int __data;						\
> > +	int ret = elv_var_store(&__data, (page), count);		\
> 
>   Since simple_strtoul returns unsigned long, it's better to make __data 
>   be that type.
> 

I just took it from CFQ. BTW, what's the harm here in truncating unsigned
long to int? Anyway for our variables we are not expecting any value 
bigger than unsigned int and if it is, we expect to truncate it?

> > +	if (__data < (MIN))						\
> > +		__data = (MIN);						\
> > +	else if (__data > (MAX))					\
> > +		__data = (MAX);						\
> > +	if (__CONV)							\
> > +		*(__PTR) = msecs_to_jiffies(__data);			\
> > +	else								\
> > +		*(__PTR) = __data;					\
> > +	return ret;							\
> > +}
> > +STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
> > +EXPORT_SYMBOL(elv_fairness_store);
> > +STORE_FUNCTION(elv_slice_idle_store, &efqd->elv_slice_idle, 0, UINT_MAX, 1);
> 
>   Do we need to set an actual max limitation rather than UINT_MAX for these entries?

Again these are the same values CFQ was using.  Do you have a better upper
limit in mind? Until and unless there is strong objection to UINT_MAX, we
can stick to what CFQ has been doing so far.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
@ 2009-06-10 13:26           ` Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-06-10 13:26 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval, snitzer, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, fernando, mikew, jmoyer, nauman, m-ikeda, lizf,
	fchecconi, s-uchida, containers, linux-kernel, akpm,
	righi.andrea

On Wed, Jun 10, 2009 at 09:30:38AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > On Tue, Jun 09, 2009 at 03:56:38PM +0800, Gui Jianfeng wrote:
> >> Vivek Goyal wrote:
> >> ...
> >>> +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
> >>> +			  size_t count)
> >>> +{
> >>> +	struct elv_fq_data *efqd;
> >>> +	unsigned int data;
> >>> +	unsigned long flags;
> >>> +
> >>> +	char *p = (char *)name;
> >>> +
> >>> +	data = simple_strtoul(p, &p, 10);
> >>> +
> >>> +	if (data < 0)
> >>> +		data = 0;
> >>> +	else if (data > INT_MAX)
> >>> +		data = INT_MAX;
> >>   Hi Vivek,
> >>
> >>   data might overflow on 64 bit systems. In addition, since "fairness" is nothing 
> >>   more than a switch, just let it be.
> >>
> >> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> >> ---
> > 
> > Hi Gui,
> > 
> > How about following patch? Currently this should apply at the end of the
> > patch series. If it looks good, I will merge the changes in higher level
> > patches.
> 
>   This patch seems good to me. Some trivial issues comment below.
> 
> > 
> > Thanks
> > Vivek
> > 
> > o Previously common layer elevator parameters were appearing as request
> >   queue parameters in sysfs. But actually these are io scheduler parameters
> >   in hiearchical mode. Fix it.
> > 
> > o Use macros to define multiple sysfs C functions doing the same thing. Code
> >   borrowed from CFQ. Helps reduce the number of lines of by 140.
> > 
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ...	\
> > +}
> > +SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
> > +EXPORT_SYMBOL(elv_fairness_show);
> > +SHOW_FUNCTION(elv_slice_idle_show, efqd->elv_slice_idle, 1);
> > +EXPORT_SYMBOL(elv_slice_idle_show);
> > +SHOW_FUNCTION(elv_async_slice_idle_show, efqd->elv_async_slice_idle, 1);
> > +EXPORT_SYMBOL(elv_async_slice_idle_show);
> > +SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
> > +EXPORT_SYMBOL(elv_slice_sync_show);
> > +SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
> > +EXPORT_SYMBOL(elv_slice_async_show);
> > +#undef SHOW_FUNCTION
> > +
> > +#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
> > +ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)	\
> > +{									\
> > +	struct elv_fq_data *efqd = &e->efqd;				\
> > +	unsigned int __data;						\
> > +	int ret = elv_var_store(&__data, (page), count);		\
> 
>   Since simple_strtoul returns unsigned long, it's better to make __data 
>   be that type.
> 

I just took it from CFQ. BTW, what's the harm here in truncating unsigned
long to int? Anyway for our variables we are not expecting any value 
bigger than unsigned int and if it is, we expect to truncate it?

> > +	if (__data < (MIN))						\
> > +		__data = (MIN);						\
> > +	else if (__data > (MAX))					\
> > +		__data = (MAX);						\
> > +	if (__CONV)							\
> > +		*(__PTR) = msecs_to_jiffies(__data);			\
> > +	else								\
> > +		*(__PTR) = __data;					\
> > +	return ret;							\
> > +}
> > +STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
> > +EXPORT_SYMBOL(elv_fairness_store);
> > +STORE_FUNCTION(elv_slice_idle_store, &efqd->elv_slice_idle, 0, UINT_MAX, 1);
> 
>   Do we need to set an actual max limitation rather than UINT_MAX for these entries?

Again these are the same values CFQ was using.  Do you have a better upper
limit in mind? Until and unless there is strong objection to UINT_MAX, we
can stick to what CFQ has been doing so far.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
       [not found]           ` <20090610132638.GB19680-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-06-11  1:22             ` Gui Jianfeng
  0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-06-11  1:22 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

Vivek Goyal wrote:
> On Wed, Jun 10, 2009 at 09:30:38AM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> On Tue, Jun 09, 2009 at 03:56:38PM +0800, Gui Jianfeng wrote:
>>>> Vivek Goyal wrote:
>>>> ...
>>>>> +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
>>>>> +			  size_t count)
>>>>> +{
>>>>> +	struct elv_fq_data *efqd;
>>>>> +	unsigned int data;
>>>>> +	unsigned long flags;
>>>>> +
>>>>> +	char *p = (char *)name;
>>>>> +
>>>>> +	data = simple_strtoul(p, &p, 10);
>>>>> +
>>>>> +	if (data < 0)
>>>>> +		data = 0;
>>>>> +	else if (data > INT_MAX)
>>>>> +		data = INT_MAX;
>>>>   Hi Vivek,
>>>>
>>>>   data might overflow on 64 bit systems. In addition, since "fairness" is nothing 
>>>>   more than a switch, just let it be.
>>>>
>>>> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
>>>> ---
>>> Hi Gui,
>>>
>>> How about following patch? Currently this should apply at the end of the
>>> patch series. If it looks good, I will merge the changes in higher level
>>> patches.
>>   This patch seems good to me. Some trivial issues comment below.
>>
>>> Thanks
>>> Vivek
>>>
>>> o Previously common layer elevator parameters were appearing as request
>>>   queue parameters in sysfs. But actually these are io scheduler parameters
>>>   in hiearchical mode. Fix it.
>>>
>>> o Use macros to define multiple sysfs C functions doing the same thing. Code
>>>   borrowed from CFQ. Helps reduce the number of lines of by 140.
>>>
>>> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> ...	\
>>> +}
>>> +SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
>>> +EXPORT_SYMBOL(elv_fairness_show);
>>> +SHOW_FUNCTION(elv_slice_idle_show, efqd->elv_slice_idle, 1);
>>> +EXPORT_SYMBOL(elv_slice_idle_show);
>>> +SHOW_FUNCTION(elv_async_slice_idle_show, efqd->elv_async_slice_idle, 1);
>>> +EXPORT_SYMBOL(elv_async_slice_idle_show);
>>> +SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
>>> +EXPORT_SYMBOL(elv_slice_sync_show);
>>> +SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
>>> +EXPORT_SYMBOL(elv_slice_async_show);
>>> +#undef SHOW_FUNCTION
>>> +
>>> +#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
>>> +ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)	\
>>> +{									\
>>> +	struct elv_fq_data *efqd = &e->efqd;				\
>>> +	unsigned int __data;						\
>>> +	int ret = elv_var_store(&__data, (page), count);		\
>>   Since simple_strtoul returns unsigned long, it's better to make __data 
>>   be that type.
>>
> 
> I just took it from CFQ. BTW, what's the harm here in truncating unsigned
> long to int? Anyway for our variables we are not expecting any value 
> bigger than unsigned int and if it is, we expect to truncate it?
> 
>>> +	if (__data < (MIN))						\
>>> +		__data = (MIN);						\
>>> +	else if (__data > (MAX))					\
>>> +		__data = (MAX);						\
>>> +	if (__CONV)							\
>>> +		*(__PTR) = msecs_to_jiffies(__data);			\
>>> +	else								\
>>> +		*(__PTR) = __data;					\
>>> +	return ret;							\
>>> +}
>>> +STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
>>> +EXPORT_SYMBOL(elv_fairness_store);
>>> +STORE_FUNCTION(elv_slice_idle_store, &efqd->elv_slice_idle, 0, UINT_MAX, 1);
>>   Do we need to set an actual max limitation rather than UINT_MAX for these entries?
> 
> Again these are the same values CFQ was using.  Do you have a better upper
> limit in mind? Until and unless there is strong objection to UINT_MAX, we
> can stick to what CFQ has been doing so far.

  Ok, I don't have strong opinion about the above things.

> 
> Thanks
> Vivek
> 
> 
> 

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
  2009-06-10 13:26           ` Vivek Goyal
@ 2009-06-11  1:22             ` Gui Jianfeng
  -1 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-06-11  1:22 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando, s-uchida, taka, jmoyer, dhaval,
	balbir, linux-kernel, containers, righi.andrea, agk, dm-devel,
	snitzer, m-ikeda, akpm

Vivek Goyal wrote:
> On Wed, Jun 10, 2009 at 09:30:38AM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> On Tue, Jun 09, 2009 at 03:56:38PM +0800, Gui Jianfeng wrote:
>>>> Vivek Goyal wrote:
>>>> ...
>>>>> +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
>>>>> +			  size_t count)
>>>>> +{
>>>>> +	struct elv_fq_data *efqd;
>>>>> +	unsigned int data;
>>>>> +	unsigned long flags;
>>>>> +
>>>>> +	char *p = (char *)name;
>>>>> +
>>>>> +	data = simple_strtoul(p, &p, 10);
>>>>> +
>>>>> +	if (data < 0)
>>>>> +		data = 0;
>>>>> +	else if (data > INT_MAX)
>>>>> +		data = INT_MAX;
>>>>   Hi Vivek,
>>>>
>>>>   data might overflow on 64 bit systems. In addition, since "fairness" is nothing 
>>>>   more than a switch, just let it be.
>>>>
>>>> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
>>>> ---
>>> Hi Gui,
>>>
>>> How about following patch? Currently this should apply at the end of the
>>> patch series. If it looks good, I will merge the changes in higher level
>>> patches.
>>   This patch seems good to me. Some trivial issues comment below.
>>
>>> Thanks
>>> Vivek
>>>
>>> o Previously common layer elevator parameters were appearing as request
>>>   queue parameters in sysfs. But actually these are io scheduler parameters
>>>   in hiearchical mode. Fix it.
>>>
>>> o Use macros to define multiple sysfs C functions doing the same thing. Code
>>>   borrowed from CFQ. Helps reduce the number of lines of by 140.
>>>
>>> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
>> ...	\
>>> +}
>>> +SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
>>> +EXPORT_SYMBOL(elv_fairness_show);
>>> +SHOW_FUNCTION(elv_slice_idle_show, efqd->elv_slice_idle, 1);
>>> +EXPORT_SYMBOL(elv_slice_idle_show);
>>> +SHOW_FUNCTION(elv_async_slice_idle_show, efqd->elv_async_slice_idle, 1);
>>> +EXPORT_SYMBOL(elv_async_slice_idle_show);
>>> +SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
>>> +EXPORT_SYMBOL(elv_slice_sync_show);
>>> +SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
>>> +EXPORT_SYMBOL(elv_slice_async_show);
>>> +#undef SHOW_FUNCTION
>>> +
>>> +#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
>>> +ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)	\
>>> +{									\
>>> +	struct elv_fq_data *efqd = &e->efqd;				\
>>> +	unsigned int __data;						\
>>> +	int ret = elv_var_store(&__data, (page), count);		\
>>   Since simple_strtoul returns unsigned long, it's better to make __data 
>>   be that type.
>>
> 
> I just took it from CFQ. BTW, what's the harm here in truncating unsigned
> long to int? Anyway for our variables we are not expecting any value 
> bigger than unsigned int and if it is, we expect to truncate it?
> 
>>> +	if (__data < (MIN))						\
>>> +		__data = (MIN);						\
>>> +	else if (__data > (MAX))					\
>>> +		__data = (MAX);						\
>>> +	if (__CONV)							\
>>> +		*(__PTR) = msecs_to_jiffies(__data);			\
>>> +	else								\
>>> +		*(__PTR) = __data;					\
>>> +	return ret;							\
>>> +}
>>> +STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
>>> +EXPORT_SYMBOL(elv_fairness_store);
>>> +STORE_FUNCTION(elv_slice_idle_store, &efqd->elv_slice_idle, 0, UINT_MAX, 1);
>>   Do we need to set an actual max limitation rather than UINT_MAX for these entries?
> 
> Again these are the same values CFQ was using.  Do you have a better upper
> limit in mind? Until and unless there is strong objection to UINT_MAX, we
> can stick to what CFQ has been doing so far.

  Ok, I don't have strong opinion about the above things.

> 
> Thanks
> Vivek
> 
> 
> 

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 297+ messages in thread

* Re: [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it
@ 2009-06-11  1:22             ` Gui Jianfeng
  0 siblings, 0 replies; 297+ messages in thread
From: Gui Jianfeng @ 2009-06-11  1:22 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, snitzer, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, fernando, mikew, jmoyer, nauman, m-ikeda, lizf,
	fchecconi, s-uchida, containers, linux-kernel, akpm,
	righi.andrea

Vivek Goyal wrote:
> On Wed, Jun 10, 2009 at 09:30:38AM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> On Tue, Jun 09, 2009 at 03:56:38PM +0800, Gui Jianfeng wrote:
>>>> Vivek Goyal wrote:
>>>> ...
>>>>> +ssize_t elv_fairness_store(struct request_queue *q, const char *name,
>>>>> +			  size_t count)
>>>>> +{
>>>>> +	struct elv_fq_data *efqd;
>>>>> +	unsigned int data;
>>>>> +	unsigned long flags;
>>>>> +
>>>>> +	char *p = (char *)name;
>>>>> +
>>>>> +	data = simple_strtoul(p, &p, 10);
>>>>> +
>>>>> +	if (data < 0)
>>>>> +		data = 0;
>>>>> +	else if (data > INT_MAX)
>>>>> +		data = INT_MAX;
>>>>   Hi Vivek,
>>>>
>>>>   data might overflow on 64 bit systems. In addition, since "fairness" is nothing 
>>>>   more than a switch, just let it be.
>>>>
>>>> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
>>>> ---
>>> Hi Gui,
>>>
>>> How about following patch? Currently this should apply at the end of the
>>> patch series. If it looks good, I will merge the changes in higher level
>>> patches.
>>   This patch seems good to me. Some trivial issues comment below.
>>
>>> Thanks
>>> Vivek
>>>
>>> o Previously common layer elevator parameters were appearing as request
>>>   queue parameters in sysfs. But actually these are io scheduler parameters
>>>   in hiearchical mode. Fix it.
>>>
>>> o Use macros to define multiple sysfs C functions doing the same thing. Code
>>>   borrowed from CFQ. Helps reduce the number of lines of by 140.
>>>
>>> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
>> ...	\
>>> +}
>>> +SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
>>> +EXPORT_SYMBOL(elv_fairness_show);
>>> +SHOW_FUNCTION(elv_slice_idle_show, efqd->elv_slice_idle, 1);
>>> +EXPORT_SYMBOL(elv_slice_idle_show);
>>> +SHOW_FUNCTION(elv_async_slice_idle_show, efqd->elv_async_slice_idle, 1);
>>> +EXPORT_SYMBOL(elv_async_slice_idle_show);
>>> +SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
>>> +EXPORT_SYMBOL(elv_slice_sync_show);
>>> +SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
>>> +EXPORT_SYMBOL(elv_slice_async_show);
>>> +#undef SHOW_FUNCTION
>>> +
>>> +#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
>>> +ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)	\
>>> +{									\
>>> +	struct elv_fq_data *efqd = &e->efqd;				\
>>> +	unsigned int __data;						\
>>> +	int ret = elv_var_store(&__data, (page), count);		\
>>   Since simple_strtoul returns unsigned long, it's better to make __data 
>>   be that type.
>>
> 
> I just took it from CFQ. BTW, what's the harm here in truncating unsigned
> long to int? Anyway for our variables we are not expecting any value 
> bigger than unsigned int and if it is, we expect to truncate it?
> 
>>> +	if (__data < (MIN))						\
>>> +		__data = (MIN);						\
>>> +	else if (__data > (MAX))					\
>>> +		__data = (MAX);						\
>>> +	if (__CONV)							\
>>> +		*(__PTR) = msecs_to_jiffies(__data);			\
>>> +	else								\
>>> +		*(__PTR) = __data;					\
>>> +	return ret;							\
>>> +}
>>> +STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
>>> +EXPORT_SYMBOL(elv_fairness_store);
>>> +STORE_FUNCTION(elv_slice_idle_store, &efqd->elv_slice_idle, 0, UINT_MAX, 1);
>>   Do we need to set an actual max limitation rather than UINT_MAX for these entries?
> 
> Again these are the same values CFQ was using.  Do you have a better upper
> limit in mind? Until and unless there is strong objection to UINT_MAX, we
> can stick to what CFQ has been doing so far.

  Ok, I don't have strong opinion about the above things.

> 
> Thanks
> Vivek
> 
> 
> 

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 297+ messages in thread

* IO scheduler based IO Controller V2
@ 2009-05-05 19:58 Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA, dpshah-hpIqsD4AKlfQT0dZR+AlfA,
	lizf-BthXqXjhjHXQFUHtdCDX3A, mikew-hpIqsD4AKlfQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, ryov-jCdQPDEk3idL9jVzuh4AOg,
	fer
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b


Hi All,

Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
First version of the patches was posted here.

http://lkml.org/lkml/2009/3/11/486

This patchset is still work in progress but I want to keep on getting the
snapshot of my tree out at regular intervals to get the feedback hence V2.

Before I go into details of what are the major changes from V1, wanted
to highlight other IO controller proposals on lkml.

Other active IO controller proposals
------------------------------------
Currently primarily two other IO controller proposals are out there.

dm-ioband
---------
This patch set is from Ryo Tsuruta from valinux. It is a proportional bandwidth controller implemented as a dm driver.

http://people.valinux.co.jp/~ryov/dm-ioband/

The biggest issue (apart from others), with a 2nd level IO controller is that
buffering of BIOs takes place in a single queue and dispatch of this BIOs
to unerlying IO scheduler is in FIFO manner. That means whenever the buffering
takes place, it breaks the notion of different class and priority of CFQ.

That means RT requests might be stuck behind some write requests or some read
requests might be stuck behind somet write requests for long time etc. To
demonstrate the single FIFO dispatch issues, I had run some basic tests and
posted the results in following mail thread.

http://lkml.org/lkml/2009/4/13/2

These are hard to solve issues and one will end up maintaining the separate
queues for separate classes and priority as CFQ does to fully resolve it.
But that will make 2nd level implementation complex at the same time if
somebody is trying to use IO controller on a single disk or on a hardware RAID
using cfq as scheduler, it will be two layers of queueing maintating separate
queues per priorty level. One at dm-driver level and other at CFQ which again
does not make lot of sense.

On the other hand, if a user is running noop at the device level, at higher
level we will be maintaining multiple cfq like queues, which also does not
make sense as underlying IO scheduler never wanted that.

Hence, IMHO, I think that controlling bio at second level probably is not a
very good idea. We should instead do it at IO scheduler level where we already
maintain all the needed queues. Just that make the scheduling hierarhical and
group aware so isolate IO of one group from other.

IO-throttling
-------------
This patch set is from Andrea Righi provides max bandwidth controller. That
means, it does not gurantee the minimum bandwidth. It provides the maximum
bandwidth limits and throttles the application if it crosses its bandwidth.

So its not apple vs apple comparison. This patch set and dm-ioband provide
proportional bandwidth control where a cgroup can use much more bandwidth
if there are not other users and resource control comes into the picture
only if there is contention.

It seems that there are both the kind of users there. One set of people needing
proportional BW control and other people needing max bandwidth control.

Now the question is, where max bandwidth control should be implemented? At
higher layers or at IO scheduler level? Should proportional bw control and
max bw control be implemented separately at different layer or these should
be implemented at one place?

IMHO, if we are doing proportional bw control at IO scheduler layer, it should
be possible to extend it to do max bw control also here without lot of effort.
Then it probably does not make too much of sense to do two types of control
at two different layers. Doing it at one place should lead to lesser code
and reduced complexity.

Secondly, io-throttling solution also buffers writes at higher layer.
Which again will lead to issue of losing the notion of priority of writes.

Hence, personally I think that users will need both proportional bw as well
as max bw control and we probably should implement these at a single place
instead of splitting it. Once elevator based io controller patchset matures,
it can be enhanced to do max bw control also.

Having said that, one issue with doing upper limit control at elevator/IO
scheduler level is that it does not have the view of higher level logical
devices. So if there is a software RAID with two disks, then one can not do
max bw control on logical device, instead it shall have to be on leaf node
where io scheduler is attached.

Now back to the desciption of this patchset and changes from V1.

- Rebased patches to 2.6.30-rc4.

- Last time Andrew mentioned that async writes are big issue for us hence,
  introduced the control for async writes also.

- Implemented per group request descriptor support. This was needed to
  make sure one group doing lot of IO does not starve other group of request
  descriptors and other group does not get fair share. This is a basic patch
  right now which probably will require more changes after some discussion.

- Exported the disk time used and number of sectors dispatched by a cgroup
  through cgroup interface. This should help us in seeing how much disk
  time each group got and whether it is fair or not.

- Implemented group refcounting support. Lack of this was causing some
  cgroup related issues. There are still some races left out which needs
  to be fixed. 

- For IO tracking/async write tracking, started making use of patches of
  blkio-cgroup from ryo Tsuruta posted here.

  http://lkml.org/lkml/2009/4/28/235

  Currently people seem to be liking the idea of separate subsystem for
  tracking writes and then rest of the users can use that info instead of
  everybody implementing their own. That's a different thing that how many
  users are out there which will end up in kernel is not clear.

  So instead of carrying own versin of bio-cgroup patches, and overloading
  io controller cgroup subsystem, I am making use of blkio-cgroup patches.
  One shall have to mount io controller and blkio subsystem together on the
  same hiearchy for the time being. Later we can take care of the case where
  blkio is mounted on a different hierarchy.

- Replaced group priorities with group weights.

Testing
=======

Again, I have been able to do only very basic testing of reads and writes.
Did not want to hold the patches back because of testing. Providing support
for async writes took much more time than expected and still work is left
in that area. Will continue to do more testing.

Test1 (Fairness for synchronous reads)
======================================
- Two dd in two cgroups with cgrop weights 1000 and 500. Ran two "dd" in those
  cgroups (With CFQ scheduler and /sys/block/<device>/queue/fairness = 1)

dd if=/mnt/$BLOCKDEV/zerofile1 of=/dev/null &
dd if=/mnt/$BLOCKDEV/zerofile2 of=/dev/null &

234179072 bytes (234 MB) copied, 4.13954 s, 56.6 MB/s
234179072 bytes (234 MB) copied, 5.2127 s, 44.9 MB/s

group1 time=3108 group1 sectors=460968
group2 time=1405 group2 sectors=264944

This patchset tries to provide fairness in terms of disk time received. group1
got almost double of group2 disk time (At the time of first dd finish). These
time and sectors statistics can be read using io.disk_time and io.disk_sector
files in cgroup. More about it in documentation file.

Test2 (Fairness for async writes)
=================================
Fairness for async writes is tricy and biggest reason is that async writes
are cached in higher layers (page cahe) and are dispatched to lower layers
not necessarily in proportional manner. For example, consider two dd threads
reading /dev/zero as input file and doing writes of huge files. Very soon
we will cross vm_dirty_ratio and dd thread will be forced to write out some
pages to disk before more pages can be dirtied. But not necessarily dirty
pages of same thread are picked. It can very well pick the inode of lesser
priority dd thread and do some writeout. So effectively higher weight dd is
doing writeouts of lower weight dd pages and we don't see service differentation

IOW, the core problem with async write fairness is that higher weight thread
does not throw enought IO traffic at IO controller to keep the queue
continuously backlogged. This are many .2 to .8 second intervals where higher
weight queue is empty and in that duration lower weight queue get lots of job
done giving the impression that there was no service differentiation.

In summary, from IO controller point of view async writes support is there. Now
we need to do some more work in higher layers to make sure higher weight process
is not blocked behind IO of some lower weight process. This is a TODO item.

So to test async writes I generated lots of write traffic in two cgroups (50
fio threads) and watched the disk time statistics in respective cgroups at
the interval of 2 seconds. Thanks to ryo tsuruta for the test case.

*****************************************************************
sync
echo 3 > /proc/sys/vm/drop_caches

fio_args="--size=64m --rw=write --numjobs=50 --group_reporting"

echo $$ > /cgroup/bfqio/test1/tasks
fio $fio_args --name=test1 --directory=/mnt/sdd1/fio/ --output=/mnt/sdd1/fio/test1.log &

echo $$ > /cgroup/bfqio/test2/tasks
fio $fio_args --name=test2 --directory=/mnt/sdd2/fio/ --output=/mnt/sdd2/fio/test2.log &
*********************************************************************** 

And watched the disk time and sector statistics for the both the cgroups
every 2 seconds using a script. How is snippet from output.

test1 statistics: time=9848   sectors=643152
test2 statistics: time=5224   sectors=258600

test1 statistics: time=11736   sectors=785792
test2 statistics: time=6509   sectors=333160

test1 statistics: time=13607   sectors=943968
test2 statistics: time=7443   sectors=394352

test1 statistics: time=15662   sectors=1089496
test2 statistics: time=8568   sectors=451152

So disk time consumed by group1 is almost double of group2.  

Your feedback and comments are welcome.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

* IO scheduler based IO Controller V2
@ 2009-05-05 19:58 Vivek Goyal
  0 siblings, 0 replies; 297+ messages in thread
From: Vivek Goyal @ 2009-05-05 19:58 UTC (permalink / raw)
  To: nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	jens.axboe, ryov, fernando
  Cc: akpm, vgoyal


Hi All,

Here is the V2 of the IO controller patches generated on top of 2.6.30-rc4.
First version of the patches was posted here.

http://lkml.org/lkml/2009/3/11/486

This patchset is still work in progress but I want to keep on getting the
snapshot of my tree out at regular intervals to get the feedback hence V2.

Before I go into details of what are the major changes from V1, wanted
to highlight other IO controller proposals on lkml.

Other active IO controller proposals
------------------------------------
Currently primarily two other IO controller proposals are out there.

dm-ioband
---------
This patch set is from Ryo Tsuruta from valinux. It is a proportional bandwidth controller implemented as a dm driver.

http://people.valinux.co.jp/~ryov/dm-ioband/

The biggest issue (apart from others), with a 2nd level IO controller is that
buffering of BIOs takes place in a single queue and dispatch of this BIOs
to unerlying IO scheduler is in FIFO manner. That means whenever the buffering
takes place, it breaks the notion of different class and priority of CFQ.

That means RT requests might be stuck behind some write requests or some read
requests might be stuck behind somet write requests for long time etc. To
demonstrate the single FIFO dispatch issues, I had run some basic tests and
posted the results in following mail thread.

http://lkml.org/lkml/2009/4/13/2

These are hard to solve issues and one will end up maintaining the separate
queues for separate classes and priority as CFQ does to fully resolve it.
But that will make 2nd level implementation complex at the same time if
somebody is trying to use IO controller on a single disk or on a hardware RAID
using cfq as scheduler, it will be two layers of queueing maintating separate
queues per priorty level. One at dm-driver level and other at CFQ which again
does not make lot of sense.

On the other hand, if a user is running noop at the device level, at higher
level we will be maintaining multiple cfq like queues, which also does not
make sense as underlying IO scheduler never wanted that.

Hence, IMHO, I think that controlling bio at second level probably is not a
very good idea. We should instead do it at IO scheduler level where we already
maintain all the needed queues. Just that make the scheduling hierarhical and
group aware so isolate IO of one group from other.

IO-throttling
-------------
This patch set is from Andrea Righi provides max bandwidth controller. That
means, it does not gurantee the minimum bandwidth. It provides the maximum
bandwidth limits and throttles the application if it crosses its bandwidth.

So its not apple vs apple comparison. This patch set and dm-ioband provide
proportional bandwidth control where a cgroup can use much more bandwidth
if there are not other users and resource control comes into the picture
only if there is contention.

It seems that there are both the kind of users there. One set of people needing
proportional BW control and other people needing max bandwidth control.

Now the question is, where max bandwidth control should be implemented? At
higher layers or at IO scheduler level? Should proportional bw control and
max bw control be implemented separately at different layer or these should
be implemented at one place?

IMHO, if we are doing proportional bw control at IO scheduler layer, it should
be possible to extend it to do max bw control also here without lot of effort.
Then it probably does not make too much of sense to do two types of control
at two different layers. Doing it at one place should lead to lesser code
and reduced complexity.

Secondly, io-throttling solution also buffers writes at higher layer.
Which again will lead to issue of losing the notion of priority of writes.

Hence, personally I think that users will need both proportional bw as well
as max bw control and we probably should implement these at a single place
instead of splitting it. Once elevator based io controller patchset matures,
it can be enhanced to do max bw control also.

Having said that, one issue with doing upper limit control at elevator/IO
scheduler level is that it does not have the view of higher level logical
devices. So if there is a software RAID with two disks, then one can not do
max bw control on logical device, instead it shall have to be on leaf node
where io scheduler is attached.

Now back to the desciption of this patchset and changes from V1.

- Rebased patches to 2.6.30-rc4.

- Last time Andrew mentioned that async writes are big issue for us hence,
  introduced the control for async writes also.

- Implemented per group request descriptor support. This was needed to
  make sure one group doing lot of IO does not starve other group of request
  descriptors and other group does not get fair share. This is a basic patch
  right now which probably will require more changes after some discussion.

- Exported the disk time used and number of sectors dispatched by a cgroup
  through cgroup interface. This should help us in seeing how much disk
  time each group got and whether it is fair or not.

- Implemented group refcounting support. Lack of this was causing some
  cgroup related issues. There are still some races left out which needs
  to be fixed. 

- For IO tracking/async write tracking, started making use of patches of
  blkio-cgroup from ryo Tsuruta posted here.

  http://lkml.org/lkml/2009/4/28/235

  Currently people seem to be liking the idea of separate subsystem for
  tracking writes and then rest of the users can use that info instead of
  everybody implementing their own. That's a different thing that how many
  users are out there which will end up in kernel is not clear.

  So instead of carrying own versin of bio-cgroup patches, and overloading
  io controller cgroup subsystem, I am making use of blkio-cgroup patches.
  One shall have to mount io controller and blkio subsystem together on the
  same hiearchy for the time being. Later we can take care of the case where
  blkio is mounted on a different hierarchy.

- Replaced group priorities with group weights.

Testing
=======

Again, I have been able to do only very basic testing of reads and writes.
Did not want to hold the patches back because of testing. Providing support
for async writes took much more time than expected and still work is left
in that area. Will continue to do more testing.

Test1 (Fairness for synchronous reads)
======================================
- Two dd in two cgroups with cgrop weights 1000 and 500. Ran two "dd" in those
  cgroups (With CFQ scheduler and /sys/block/<device>/queue/fairness = 1)

dd if=/mnt/$BLOCKDEV/zerofile1 of=/dev/null &
dd if=/mnt/$BLOCKDEV/zerofile2 of=/dev/null &

234179072 bytes (234 MB) copied, 4.13954 s, 56.6 MB/s
234179072 bytes (234 MB) copied, 5.2127 s, 44.9 MB/s

group1 time=3108 group1 sectors=460968
group2 time=1405 group2 sectors=264944

This patchset tries to provide fairness in terms of disk time received. group1
got almost double of group2 disk time (At the time of first dd finish). These
time and sectors statistics can be read using io.disk_time and io.disk_sector
files in cgroup. More about it in documentation file.

Test2 (Fairness for async writes)
=================================
Fairness for async writes is tricy and biggest reason is that async writes
are cached in higher layers (page cahe) and are dispatched to lower layers
not necessarily in proportional manner. For example, consider two dd threads
reading /dev/zero as input file and doing writes of huge files. Very soon
we will cross vm_dirty_ratio and dd thread will be forced to write out some
pages to disk before more pages can be dirtied. But not necessarily dirty
pages of same thread are picked. It can very well pick the inode of lesser
priority dd thread and do some writeout. So effectively higher weight dd is
doing writeouts of lower weight dd pages and we don't see service differentation

IOW, the core problem with async write fairness is that higher weight thread
does not throw enought IO traffic at IO controller to keep the queue
continuously backlogged. This are many .2 to .8 second intervals where higher
weight queue is empty and in that duration lower weight queue get lots of job
done giving the impression that there was no service differentiation.

In summary, from IO controller point of view async writes support is there. Now
we need to do some more work in higher layers to make sure higher weight process
is not blocked behind IO of some lower weight process. This is a TODO item.

So to test async writes I generated lots of write traffic in two cgroups (50
fio threads) and watched the disk time statistics in respective cgroups at
the interval of 2 seconds. Thanks to ryo tsuruta for the test case.

*****************************************************************
sync
echo 3 > /proc/sys/vm/drop_caches

fio_args="--size=64m --rw=write --numjobs=50 --group_reporting"

echo $$ > /cgroup/bfqio/test1/tasks
fio $fio_args --name=test1 --directory=/mnt/sdd1/fio/ --output=/mnt/sdd1/fio/test1.log &

echo $$ > /cgroup/bfqio/test2/tasks
fio $fio_args --name=test2 --directory=/mnt/sdd2/fio/ --output=/mnt/sdd2/fio/test2.log &
*********************************************************************** 

And watched the disk time and sector statistics for the both the cgroups
every 2 seconds using a script. How is snippet from output.

test1 statistics: time=9848   sectors=643152
test2 statistics: time=5224   sectors=258600

test1 statistics: time=11736   sectors=785792
test2 statistics: time=6509   sectors=333160

test1 statistics: time=13607   sectors=943968
test2 statistics: time=7443   sectors=394352

test1 statistics: time=15662   sectors=1089496
test2 statistics: time=8568   sectors=451152

So disk time consumed by group1 is almost double of group2.  

Your feedback and comments are welcome.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 297+ messages in thread

end of thread, other threads:[~2009-06-11  1:24 UTC | newest]

Thread overview: 297+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
2009-05-05 19:58 ` [PATCH 01/18] io-controller: Documentation Vivek Goyal
2009-05-06  3:16   ` Gui Jianfeng
     [not found]     ` <4A0100F4.4040400-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-05-06 13:31       ` Vivek Goyal
2009-05-06 13:31     ` Vivek Goyal
     [not found]   ` <1241553525-28095-2-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-06  3:16     ` Gui Jianfeng
2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 02/18] io-controller: Common flat fair queuing code in elevaotor layer Vivek Goyal
2009-05-05 19:58 ` [PATCH 03/18] io-controller: Charge for time slice based on average disk rate Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 04/18] io-controller: Modify cfq to make use of flat elevator fair queuing Vivek Goyal
     [not found]   ` <1241553525-28095-5-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-22  8:54     ` Gui Jianfeng
2009-05-22  8:54   ` Gui Jianfeng
     [not found]     ` <4A166829.6070608-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-05-22 12:33       ` Vivek Goyal
2009-05-22 12:33     ` Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
2009-05-07  7:42   ` Gui Jianfeng
2009-05-07  8:05     ` Li Zefan
     [not found]     ` <4A0290ED.7080506-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-05-07  8:05       ` Li Zefan
2009-05-08 12:45       ` Vivek Goyal
2009-05-08 12:45     ` Vivek Goyal
     [not found]   ` <1241553525-28095-6-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-07  7:42     ` Gui Jianfeng
2009-05-08 21:09     ` Andrea Righi
2009-05-08 21:09   ` Andrea Righi
2009-05-08 21:17     ` Vivek Goyal
2009-05-08 21:17     ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 06/18] io-controller: cfq changes to use " Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 07/18] io-controller: Export disk time used and nr sectors dipatched through cgroups Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
2009-05-13  2:39   ` Gui Jianfeng
     [not found]     ` <4A0A32CB.4020609-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-05-13 14:51       ` Vivek Goyal
2009-05-13 14:51     ` Vivek Goyal
2009-05-14  7:53       ` Gui Jianfeng
     [not found]       ` <20090513145127.GB7696-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-14  7:53         ` Gui Jianfeng
     [not found]   ` <1241553525-28095-8-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-13  2:39     ` Gui Jianfeng
2009-05-05 19:58 ` [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
2009-05-13 15:00   ` Vivek Goyal
2009-05-13 15:00   ` Vivek Goyal
2009-06-09  7:56   ` Gui Jianfeng
2009-06-09 17:51     ` Vivek Goyal
2009-06-09 17:51       ` Vivek Goyal
     [not found]       ` <20090609175131.GB13476-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-06-10  1:30         ` Gui Jianfeng
2009-06-10  1:30       ` Gui Jianfeng
2009-06-10  1:30         ` Gui Jianfeng
2009-06-10 13:26         ` Vivek Goyal
2009-06-10 13:26           ` Vivek Goyal
2009-06-11  1:22           ` Gui Jianfeng
2009-06-11  1:22             ` Gui Jianfeng
     [not found]           ` <20090610132638.GB19680-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-06-11  1:22             ` Gui Jianfeng
     [not found]         ` <4A2F0CBE.8030208-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-06-10 13:26           ` Vivek Goyal
     [not found]     ` <4A2E15B6.8030001-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-06-09 17:51       ` Vivek Goyal
     [not found]   ` <1241553525-28095-9-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-13 15:00     ` Vivek Goyal
2009-06-09  7:56     ` Gui Jianfeng
2009-05-05 19:58 ` [PATCH 09/18] io-controller: Separate out queue and data Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 10/18] io-conroller: Prepare elevator layer for single queue schedulers Vivek Goyal
2009-05-05 19:58 ` [PATCH 11/18] io-controller: noop changes for hierarchical fair queuing Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 12/18] io-controller: deadline " Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 13/18] io-controller: anticipatory " Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 14/18] blkio_cgroup patches from Ryo to track async bios Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 15/18] io-controller: map async requests to appropriate cgroup Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 16/18] io-controller: Per cgroup request descriptor support Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 17/18] io-controller: IO group refcounting support Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
     [not found]   ` <1241553525-28095-18-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-08  2:59     ` Gui Jianfeng
2009-05-08  2:59       ` Gui Jianfeng
2009-05-08 12:44       ` Vivek Goyal
     [not found]       ` <4A03A013.9000405-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-05-08 12:44         ` Vivek Goyal
2009-05-05 19:58 ` [PATCH 18/18] io-controller: Debug hierarchical IO scheduling Vivek Goyal
2009-05-05 19:58 ` Vivek Goyal
2009-05-06 21:40   ` IKEDA, Munehiro
     [not found]     ` <4A0203DB.1090809-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
2009-05-06 21:58       ` Vivek Goyal
2009-05-06 21:58         ` Vivek Goyal
     [not found]         ` <20090506215833.GK8180-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-06 22:19           ` IKEDA, Munehiro
2009-05-06 22:19             ` IKEDA, Munehiro
     [not found]             ` <4A020CD5.2000308-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
2009-05-06 22:24               ` Vivek Goyal
2009-05-06 22:24                 ` Vivek Goyal
     [not found]                 ` <20090506222458.GM8180-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-06 23:01                   ` IKEDA, Munehiro
2009-05-06 23:01                     ` IKEDA, Munehiro
     [not found]   ` <1241553525-28095-19-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-06 21:40     ` IKEDA, Munehiro
     [not found] ` <1241553525-28095-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-05 19:58   ` [PATCH 01/18] io-controller: Documentation Vivek Goyal
2009-05-05 19:58   ` [PATCH 02/18] io-controller: Common flat fair queuing code in elevaotor layer Vivek Goyal
2009-05-05 19:58     ` Vivek Goyal
2009-05-22  6:43     ` Gui Jianfeng
2009-05-22 12:32       ` Vivek Goyal
2009-05-23 20:04         ` Jens Axboe
     [not found]         ` <20090522123231.GA14972-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-23 20:04           ` Jens Axboe
     [not found]       ` <4A164978.1020604-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-05-22 12:32         ` Vivek Goyal
     [not found]     ` <1241553525-28095-3-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-22  6:43       ` Gui Jianfeng
2009-05-05 19:58   ` [PATCH 03/18] io-controller: Charge for time slice based on average disk rate Vivek Goyal
2009-05-05 19:58   ` [PATCH 04/18] io-controller: Modify cfq to make use of flat elevator fair queuing Vivek Goyal
2009-05-05 19:58   ` [PATCH 05/18] io-controller: Common hierarchical fair queuing code in elevaotor layer Vivek Goyal
2009-05-05 19:58   ` [PATCH 06/18] io-controller: cfq changes to use " Vivek Goyal
2009-05-05 19:58   ` [PATCH 07/18] io-controller: Export disk time used and nr sectors dipatched through cgroups Vivek Goyal
2009-05-05 19:58   ` [PATCH 08/18] io-controller: idle for sometime on sync queue before expiring it Vivek Goyal
2009-05-05 19:58   ` [PATCH 09/18] io-controller: Separate out queue and data Vivek Goyal
2009-05-05 19:58   ` [PATCH 10/18] io-conroller: Prepare elevator layer for single queue schedulers Vivek Goyal
2009-05-05 19:58     ` Vivek Goyal
2009-05-05 19:58   ` [PATCH 11/18] io-controller: noop changes for hierarchical fair queuing Vivek Goyal
2009-05-05 19:58   ` [PATCH 12/18] io-controller: deadline " Vivek Goyal
2009-05-05 19:58   ` [PATCH 13/18] io-controller: anticipatory " Vivek Goyal
2009-05-05 19:58   ` [PATCH 14/18] blkio_cgroup patches from Ryo to track async bios Vivek Goyal
2009-05-05 19:58   ` [PATCH 15/18] io-controller: map async requests to appropriate cgroup Vivek Goyal
2009-05-05 19:58   ` [PATCH 16/18] io-controller: Per cgroup request descriptor support Vivek Goyal
2009-05-05 19:58   ` [PATCH 17/18] io-controller: IO group refcounting support Vivek Goyal
2009-05-05 19:58   ` [PATCH 18/18] io-controller: Debug hierarchical IO scheduling Vivek Goyal
2009-05-05 20:24   ` IO scheduler based IO Controller V2 Andrew Morton
2009-05-05 20:24     ` Andrew Morton
2009-05-05 22:20     ` Peter Zijlstra
2009-05-06  3:42       ` Balbir Singh
2009-05-06  3:42       ` Balbir Singh
2009-05-06 10:20         ` Fabio Checconi
2009-05-06 17:10           ` Balbir Singh
2009-05-06 17:10             ` Balbir Singh
     [not found]           ` <20090506102030.GB20544-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
2009-05-06 17:10             ` Balbir Singh
2009-05-06 18:47         ` Divyesh Shah
     [not found]         ` <20090506034254.GD4416-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
2009-05-06 10:20           ` Fabio Checconi
2009-05-06 18:47           ` Divyesh Shah
2009-05-06 20:42           ` Andrea Righi
2009-05-06 20:42         ` Andrea Righi
2009-05-06  2:33     ` Vivek Goyal
2009-05-06 17:59       ` Nauman Rafique
2009-05-06 20:07       ` Andrea Righi
2009-05-06 21:21         ` Vivek Goyal
2009-05-06 21:21         ` Vivek Goyal
     [not found]           ` <20090506212121.GI8180-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-06 22:02             ` Andrea Righi
2009-05-06 22:02               ` Andrea Righi
2009-05-06 22:17               ` Vivek Goyal
2009-05-06 22:17                 ` Vivek Goyal
     [not found]       ` <20090506023332.GA1212-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-06 17:59         ` Nauman Rafique
2009-05-06 20:07         ` Andrea Righi
2009-05-06 20:32         ` Vivek Goyal
2009-05-07  0:18         ` Ryo Tsuruta
2009-05-06 20:32       ` Vivek Goyal
     [not found]         ` <20090506203228.GH8180-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-06 21:34           ` Andrea Righi
2009-05-06 21:34         ` Andrea Righi
2009-05-06 21:52           ` Vivek Goyal
2009-05-06 21:52             ` Vivek Goyal
2009-05-06 22:35             ` Andrea Righi
2009-05-07  1:48               ` Ryo Tsuruta
2009-05-07  1:48               ` Ryo Tsuruta
     [not found]             ` <20090506215235.GJ8180-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-06 22:35               ` Andrea Righi
2009-05-07  9:04               ` Andrea Righi
2009-05-07  9:04             ` Andrea Righi
2009-05-07 12:22               ` Andrea Righi
2009-05-07 12:22               ` Andrea Righi
2009-05-07 14:11               ` Vivek Goyal
2009-05-07 14:11               ` Vivek Goyal
     [not found]                 ` <20090507141126.GA9463-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-07 14:45                   ` Vivek Goyal
2009-05-07 14:45                     ` Vivek Goyal
     [not found]                     ` <20090507144501.GB9463-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-07 15:36                       ` Vivek Goyal
2009-05-07 15:36                         ` Vivek Goyal
     [not found]                         ` <20090507153642.GC9463-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-07 15:42                           ` Vivek Goyal
2009-05-07 15:42                             ` Vivek Goyal
2009-05-07 22:19                           ` Andrea Righi
2009-05-07 22:19                         ` Andrea Righi
2009-05-08 18:09                           ` Vivek Goyal
2009-05-08 20:05                             ` Andrea Righi
2009-05-08 21:56                               ` Vivek Goyal
2009-05-08 21:56                                 ` Vivek Goyal
2009-05-09  9:22                                 ` Peter Zijlstra
2009-05-14 10:31                                 ` Andrea Righi
     [not found]                                 ` <20090508215618.GJ7293-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-09  9:22                                   ` Peter Zijlstra
2009-05-14 10:31                                   ` Andrea Righi
2009-05-14 16:43                                   ` Dhaval Giani
2009-05-14 16:43                                     ` Dhaval Giani
     [not found]                             ` <20090508180951.GG7293-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-08 20:05                               ` Andrea Righi
2009-05-08 18:09                           ` Vivek Goyal
2009-05-07 22:40                       ` Andrea Righi
2009-05-07 22:40                     ` Andrea Righi
2009-05-07  0:18       ` Ryo Tsuruta
     [not found]         ` <20090507.091858.226775723.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2009-05-07  1:25           ` Vivek Goyal
2009-05-07  1:25             ` Vivek Goyal
     [not found]             ` <20090507012559.GC4187-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-11 11:23               ` Ryo Tsuruta
2009-05-11 11:23             ` Ryo Tsuruta
     [not found]               ` <20090511.202309.112614168.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2009-05-11 12:49                 ` Vivek Goyal
2009-05-11 12:49                   ` Vivek Goyal
2009-05-08 14:24           ` Rik van Riel
2009-05-08 14:24         ` Rik van Riel
     [not found]           ` <4A0440B2.7040300-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-11 10:11             ` Ryo Tsuruta
2009-05-11 10:11           ` Ryo Tsuruta
     [not found]     ` <20090505132441.1705bfad.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2009-05-05 22:20       ` Peter Zijlstra
2009-05-06  2:33       ` Vivek Goyal
2009-05-06  3:41       ` Balbir Singh
2009-05-06  3:41     ` Balbir Singh
2009-05-06 13:28       ` Vivek Goyal
2009-05-06 13:28         ` Vivek Goyal
     [not found]       ` <20090506034118.GC4416-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
2009-05-06 13:28         ` Vivek Goyal
2009-05-06  8:11   ` Gui Jianfeng
2009-05-08  9:45   ` [PATCH] io-controller: Add io group reference handling for request Gui Jianfeng
2009-05-13  2:00   ` [PATCH] IO Controller: Add per-device weight and ioprio_class handling Gui Jianfeng
2009-05-06  8:11 ` IO scheduler based IO Controller V2 Gui Jianfeng
     [not found]   ` <4A014619.1040000-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-05-06 16:10     ` Vivek Goyal
2009-05-06 16:10       ` Vivek Goyal
2009-05-07  5:36       ` Li Zefan
     [not found]         ` <4A027348.6000808-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-05-08 13:37           ` Vivek Goyal
2009-05-08 13:37             ` Vivek Goyal
2009-05-11  2:59             ` Gui Jianfeng
     [not found]             ` <20090508133740.GD7293-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-11  2:59               ` Gui Jianfeng
2009-05-07  5:47       ` Gui Jianfeng
     [not found]       ` <20090506161012.GC8180-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-07  5:36         ` Li Zefan
2009-05-07  5:47         ` Gui Jianfeng
2009-05-08  9:45 ` [PATCH] io-controller: Add io group reference handling for request Gui Jianfeng
     [not found]   ` <4A03FF3C.4020506-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-05-08 13:57     ` Vivek Goyal
2009-05-08 13:57       ` Vivek Goyal
     [not found]       ` <20090508135724.GE7293-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-08 17:41         ` Nauman Rafique
2009-05-08 17:41       ` Nauman Rafique
2009-05-08 17:41         ` Nauman Rafique
2009-05-08 18:56         ` Vivek Goyal
     [not found]           ` <20090508185644.GH7293-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-08 19:06             ` Nauman Rafique
2009-05-08 19:06           ` Nauman Rafique
2009-05-08 19:06             ` Nauman Rafique
2009-05-11  1:33         ` Gui Jianfeng
2009-05-11 15:41           ` Vivek Goyal
     [not found]             ` <20090511154127.GD6036-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-15  5:15               ` Gui Jianfeng
2009-05-15  5:15                 ` Gui Jianfeng
2009-05-15  7:48                 ` Andrea Righi
2009-05-15  8:16                   ` Gui Jianfeng
2009-05-15  8:16                   ` Gui Jianfeng
     [not found]                     ` <4A0D24E6.6010807-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-05-15 14:09                       ` Vivek Goyal
2009-05-15 14:09                         ` Vivek Goyal
2009-05-15 14:06                   ` Vivek Goyal
2009-05-15 14:06                   ` Vivek Goyal
2009-05-17 10:26                     ` Andrea Righi
2009-05-18 14:01                       ` Vivek Goyal
2009-05-18 14:01                         ` Vivek Goyal
2009-05-18 14:39                         ` Andrea Righi
2009-05-26 11:34                           ` Ryo Tsuruta
2009-05-26 11:34                           ` Ryo Tsuruta
2009-05-27  6:56                             ` Ryo Tsuruta
2009-05-27  6:56                               ` Ryo Tsuruta
2009-05-27  8:17                               ` Andrea Righi
2009-05-27  8:17                                 ` Andrea Righi
2009-05-27 11:53                                 ` Ryo Tsuruta
2009-05-27 11:53                                 ` Ryo Tsuruta
2009-05-27 17:32                               ` Vivek Goyal
2009-05-27 17:32                                 ` Vivek Goyal
     [not found]                               ` <20090527.155631.226800550.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2009-05-27  8:17                                 ` Andrea Righi
2009-05-27 17:32                                 ` Vivek Goyal
     [not found]                             ` <20090526.203424.39179999.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2009-05-27  6:56                               ` Ryo Tsuruta
2009-05-19 12:18                         ` Ryo Tsuruta
     [not found]                         ` <20090518140114.GB27080-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-18 14:39                           ` Andrea Righi
2009-05-19 12:18                           ` Ryo Tsuruta
     [not found]                     ` <20090515140643.GB19350-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-17 10:26                       ` Andrea Righi
     [not found]                 ` <4A0CFA6C.3080609-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-05-15  7:48                   ` Andrea Righi
2009-05-15  7:40               ` Gui Jianfeng
2009-05-15  7:40                 ` Gui Jianfeng
2009-05-15 14:01                 ` Vivek Goyal
     [not found]                 ` <4A0D1C55.9040700-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-05-15 14:01                   ` Vivek Goyal
     [not found]           ` <4A078051.5060702-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-05-11 15:41             ` Vivek Goyal
     [not found]         ` <e98e18940905081041r386e52a5q5a2b1f13f1e8c634-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-05-08 18:56           ` Vivek Goyal
2009-05-11  1:33           ` Gui Jianfeng
2009-05-13  2:00 ` [PATCH] IO Controller: Add per-device weight and ioprio_class handling Gui Jianfeng
2009-05-13 14:44   ` Vivek Goyal
     [not found]     ` <20090513144432.GA7696-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-14  0:59       ` Gui Jianfeng
2009-05-14  0:59     ` Gui Jianfeng
2009-05-13 15:29   ` Vivek Goyal
2009-05-14  1:02     ` Gui Jianfeng
     [not found]     ` <20090513152909.GD7696-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-14  1:02       ` Gui Jianfeng
2009-05-13 15:59   ` Vivek Goyal
2009-05-14  1:51     ` Gui Jianfeng
     [not found]     ` <20090513155900.GA15623-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-14  1:51       ` Gui Jianfeng
2009-05-14  2:25       ` Gui Jianfeng
2009-05-14  2:25     ` Gui Jianfeng
     [not found]   ` <4A0A29B5.7030109-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-05-13 14:44     ` Vivek Goyal
2009-05-13 15:29     ` Vivek Goyal
2009-05-13 15:59     ` Vivek Goyal
2009-05-13 17:17     ` Vivek Goyal
2009-05-13 19:09     ` Vivek Goyal
2009-05-13 17:17   ` Vivek Goyal
     [not found]     ` <20090513171734.GA18371-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-14  1:24       ` Gui Jianfeng
2009-05-14  1:24     ` Gui Jianfeng
2009-05-13 19:09   ` Vivek Goyal
2009-05-14  1:35     ` Gui Jianfeng
     [not found]     ` <20090513190929.GB18371-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-05-14  1:35       ` Gui Jianfeng
2009-05-14  7:26       ` Gui Jianfeng
2009-05-14  7:26     ` Gui Jianfeng
2009-05-14 15:15       ` Vivek Goyal
2009-05-18 22:33       ` IKEDA, Munehiro
2009-05-20  1:44         ` Gui Jianfeng
     [not found]           ` <4A136090.5090705-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-05-20 15:41             ` IKEDA, Munehiro
2009-05-20 15:41               ` IKEDA, Munehiro
     [not found]         ` <4A11E244.2000305-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
2009-05-20  1:44           ` Gui Jianfeng
     [not found]       ` <4A0BC7AB.8030703-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-05-14 15:15         ` Vivek Goyal
2009-05-18 22:33         ` IKEDA, Munehiro
  -- strict thread matches above, loose matches on Subject: below --
2009-05-05 19:58 IO scheduler based IO Controller V2 Vivek Goyal
2009-05-05 19:58 Vivek Goyal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.