All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] IO scheduler based IO controller V6
@ 2009-07-02 20:01 ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron
  Cc: agk, snitzer, vgoyal, akpm, peterz


Hi All,

Here is the V6 of the IO controller patches generated on top of 2.6.31-rc1.

Previous versions of the patches was posted here.

(V1) http://lkml.org/lkml/2009/3/11/486
(V2) http://lkml.org/lkml/2009/5/5/275
(V3) http://lkml.org/lkml/2009/5/26/472
(V4) http://lkml.org/lkml/2009/6/8/580
(V5) http://lkml.org/lkml/2009/6/19/279

This patchset is still work in progress but I want to keep on getting the
snapshot of my tree out at regular intervals to get the feedback hence V6.

Changes from V5
===============
- Broke down two of the biggest patches in to smaller patches. Now core of
  bfq scheduler patches are separate patch and it should make review a bit
  easier. I will try to break the patches down even more.

	- Broke out bfq core scheduler changes from flat fair queuing code.
	- Created separate patch for in class preemtion logic.
	- Created separate patch to for core bfq hierarchical scheduler
	  changes. 
	- Created a separate patch for cgroup related bits.

- Introduced a new patch to wait for requests to complete from previous
  queue before next queue is scheduled. It helps in achieving better
  accounting of disk time used by writes and hence better isolation between
  reads and buffered writes. This helps achieve fairness between sync queues
  and buffered writes.

- Merged gui's patch for optimization during io group deletion.

- Merged gui's per device rule interface patch resulting from Paul Menage's
  feedback.

- Merged gui's patch to read group data under rcu lock instead of taking
  spin lock.

- Took care of some of the balbir's review comments on V5.

	- Got rid of additional user defined data tyepes. "bfq_timestamp_t",
	  bfq_weight_t and bfq_service_t.
	- Changed data type of "weight" to unsigned int.
	- replaced *_extract() function names with *_remove().
	- Renamed some of the bfq_* functions to io_* in comments.

- Misc code cleanups

	- Moved io_get_io_group() and other common changes from patch
	  "implement per group bdi congestion interface" to upper patches.
	- Made lots of functions static.
	- Got rid of some forward declarations.
	- Replaced rq_ioq() with req_ioq() and moved it to blkdev.h
	- Some comment cleanups.
	- Got rid of elv_ioq_set_slice_end()
	- Got rid of redundant declaration of io_disconnect_groups().
	- Got rid of io_group_ioq()

Limitations
===========

- This IO controller provides the bandwidth control at the IO scheduler
  level (leaf node in stacked hiearchy of logical devices). So there can
  be cases (depending on configuration) where application does not see
  proportional BW division at higher logical level device.

  LWN has written an article about the issue here.

	http://lwn.net/Articles/332839/

How to solve the issue of fairness at higher level logical devices
==================================================================
Couple of suggestions have come forward.

- Implement IO control at IO scheduler layer and then with the help of
  some daemon, adjust the weight on underlying devices dynamiclly, depending
  on what kind of BW gurantees are to be achieved at higher level logical
  block devices.

- Also implement a higher level IO controller along with IO scheduler
  based controller and let user choose one depending on his needs.

  A higher level controller does not know about the assumptions/policies
  of unerldying IO scheduler, hence it has the potential to break down
  the IO scheduler's policy with-in cgroup. A lower level controller
  can work with IO scheduler much more closely and efficiently.
 
Other active IO controller developments
=======================================

IO throttling
-------------

  This is a max bandwidth controller and not the proportional one. Secondly
  it is a second level controller which can break the IO scheduler's
  policy/assumtions with-in cgroup. 

dm-ioband
---------

 This is a proportional bandwidth controller implemented as device mapper
 driver. It is also a second level controller which can break the
 IO scheduler's policy/assumptions with-in cgroup.

TODO
====
- Lots of code cleanups, testing, bug fixing, optimizations, benchmarking
  etc...

- Improve time keeping so that sub jiffy queue expiry time can be accounted
  for.

- Work on a better interface (possibly cgroup based) for configuring per
  group request descriptor limits.

- Debug and fix some of the areas like page cache where higher weight cgroup
  async writes are stuck behind lower weight cgroup async writes.

Testing
=======

I have been able to do some testing as follows. All my testing is with ext3
file system with a SATA drive which supports queue depth of 31.

Test1 (Fairness for synchronous reads)
======================================
- Two dd in two cgroups with cgrop weights 1000 and 500. Ran two "dd" in those
  cgroups (With CFQ scheduler and /sys/block/<device>/queue/fairness = 1)

  Higher weight dd finishes first and at that point of time my script takes
  care of reading cgroup files io.disk_time and io.disk_sectors for both the
  groups and display the results.

  dd if=/mnt/$BLOCKDEV/zerofile1 of=/dev/null &
  dd if=/mnt/$BLOCKDEV/zerofile2 of=/dev/null &

  234179072 bytes (234 MB) copied, 3.9065 s, 59.9 MB/s
  234179072 bytes (234 MB) copied, 5.19232 s, 45.1 MB/s

  group1 time=8 16 2471 group1 sectors=8 16 457840
  group2 time=8 16 1220 group2 sectors=8 16 225736

First two fields in time and sectors statistics represent major and minor
number of the device. Third field represents disk time in milliseconds and
number of sectors transferred respectively.

This patchset tries to provide fairness in terms of disk time received. group1
got almost double of group2 disk time (At the time of first dd finish). These
time and sectors statistics can be read using io.disk_time and io.disk_sector
files in cgroup. More about it in documentation file.

Test2 (Reader Vs Buffered Writes)
================================
Buffered writes can be problematic and can overwhelm readers, especially with
noop and deadline. IO controller can provide isolation between readers and
buffered (async) writers.

First I ran the test without io controller to see the severity of the issue.
Ran a hostile writer and then after 10 seconds started a reader and then
monitored the completion time of reader. Reader reads a 256 MB file. Tested
this with noop scheduler.

sample script
------------
sync
echo 3 > /proc/sys/vm/drop_caches
time dd if=/dev/zero of=/mnt/sdb/reader-writer-zerofile bs=4K count=2097152
conv=fdatasync &
sleep 10
time dd if=/mnt/sdb/256M-file of=/dev/null &

Results
-------
8589934592 bytes (8.6 GB) copied, 106.045 s, 81.0 MB/s (Writer)
268435456 bytes (268 MB) copied, 96.5237 s, 2.8 MB/s (Reader)

Now it was time to test io controller whether it can provide isolation between
readers and writers with noop. I created two cgroups of weight 1000 each and
put reader in group1 and writer in group 2 and ran the test again. Upon
comletion of reader, my scripts read io.dis_time and io.disk_group cgroup
files to get an estimate how much disk time each group got and how many
sectors each group did IO for. 

For more accurate accounting of disk time for buffered writes with queuing
hardware I had to set /sys/block/<disk>/queue/iosched/fairness to "2".

sample script
-------------
echo $$ > /cgroup/bfqio/test2/tasks
dd if=/dev/zero of=/mnt/$BLOCKDEV/testzerofile bs=4K count=2097152 &
sleep 10
echo noop > /sys/block/$BLOCKDEV/queue/scheduler
echo  2 > /sys/block/$BLOCKDEV/queue/iosched/fairness
echo $$ > /cgroup/bfqio/test1/tasks
dd if=/mnt/$BLOCKDEV/256M-file of=/dev/null &
wait $!
# Some code for reading cgroup files upon completion of reader.
-------------------------

Results
=======
268435456 bytes (268 MB) copied, 6.65819 s, 40.3 MB/s (Reader) 

group1 time=8 16 3063	group1 sectors=8 16 524808
group2 time=8 16 3071	group2 sectors=8 16 441752

Note, reader finishes now much lesser time and both group1 and group2
got almost 3 seconds of disk time. Hence io-controller provides isolation
from buffered writes.

Test3 (AIO)
===========

AIO reads
-----------
Set up two fio, AIO read jobs in two cgroup with weight 1000 and 500
respectively. I am using cfq scheduler. Following are some lines from my test
script.

---------------------------------------------------------------
echo 1000 > /cgroup/bfqio/test1/io.weight
echo 500 > /cgroup/bfqio/test2/io.weight

fio_args="--ioengine=libaio --rw=read --size=512M --direct=1"
echo 1 > /sys/block/$BLOCKDEV/queue/iosched/fairness

echo $$ > /cgroup/bfqio/test1/tasks
fio $fio_args --name=test1 --directory=/mnt/$BLOCKDEV/fio1/
--output=/mnt/$BLOCKDEV/fio1/test1.log
--exec_postrun="../read-and-display-group-stats.sh $maj_dev $minor_dev" &

echo $$ > /cgroup/bfqio/test2/tasks
fio $fio_args --name=test2 --directory=/mnt/$BLOCKDEV/fio2/
--output=/mnt/$BLOCKDEV/fio2/test2.log &
----------------------------------------------------------------

test1 and test2 are two groups with weight 1000 and 500 respectively.
"read-and-display-group-stats.sh" is one small script which reads the
test1 and test2 cgroup files to determine how much disk time each group
got till first fio job finished.

Results
------
test1 statistics: time=8 16 22403   sectors=8 16 1049640
test2 statistics: time=8 16 11400   sectors=8 16 552864

Above shows that by the time first fio (higher weight), finished, group
test1 got 22403 ms of disk time and group test2 got 11400 ms of disk time.
similarly the statistics for number of sectors transferred are also shown.

Note that disk time given to group test1 is almost double of group2 disk
time.

AIO writes
----------
Set up two fio, AIO direct write jobs in two cgroup with weight 1000 and 500
respectively. I am using cfq scheduler. Following are some lines from my test
script.

------------------------------------------------
echo 1000 > /cgroup/bfqio/test1/io.weight
echo 500 > /cgroup/bfqio/test2/io.weight
fio_args="--ioengine=libaio --rw=write --size=512M --direct=1"

echo 1 > /sys/block/$BLOCKDEV/queue/iosched/fairness

echo $$ > /cgroup/bfqio/test1/tasks
fio $fio_args --name=test1 --directory=/mnt/$BLOCKDEV/fio1/
--output=/mnt/$BLOCKDEV/fio1/test1.log
--exec_postrun="../read-and-display-group-stats.sh $maj_dev $minor_dev" &

echo $$ > /cgroup/bfqio/test2/tasks
fio $fio_args --name=test2 --directory=/mnt/$BLOCKDEV/fio2/
--output=/mnt/$BLOCKDEV/fio2/test2.log &
-------------------------------------------------

test1 and test2 are two groups with weight 1000 and 500 respectively.
"read-and-display-group-stats.sh" is one small script which reads the
test1 and test2 cgroup files to determine how much disk time each group
got till first fio job finished.

Following are the results.

test1 statistics: time=8 16 29085   sectors=8 16 1049656
test2 statistics: time=8 16 14652   sectors=8 16 516728

Above shows that by the time first fio (higher weight), finished, group
test1 got 28085 ms of disk time and group test2 got 14652 ms of disk time.
similarly the statistics for number of sectors transferred are also shown.

Note that disk time given to group test1 is almost double of group2 disk
time.

Test4 (Writes with O_SYNC)
==========================
Created two groups with weight 1000 and 500 and launched two fio jobs doing
sync writes.

sample script
---------------------------
fio_args="--size=256m --rw=write --numjobs=1 --group_reporting --sync=1"

echo $$ > /cgroup/bfqio/test1/tasks
time fio $fio_args --name=test1 --directory=/mnt/$BLOCKDEV/fio1/
--output=/mnt/$BLOCKDEV/fio1/test1.log > /dev/null &

echo $$ > /cgroup/bfqio/test2/tasks
time fio $fio_args --name=test2 --directory=/mnt/$BLOCKDEV/fio2/
--output=/mnt/$BLOCKDEV/fio2/test2.log > /dev/null &

# some code to read group data upon completion of first fio job
----------------------------

Results
-------
group1 time=8 16 15194	group1 sectors=8 16 524864
group2 time=8 16 7689	group2 sectors=8 16 258920

Note, group 1 got almost double of group2 time as per the weight settings.

Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
===================================================================
Fairness for async writes is tricky and biggest reason is that async writes
are cached in higher layers (page cahe) as well as possibly in file system
layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
in proportional manner.

For example, consider two dd threads reading /dev/zero as input file and doing
writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
be forced to write out some pages to disk before more pages can be dirtied. But
not necessarily dirty pages of same thread are picked. It can very well pick
the inode of lesser priority dd thread and do some writeout. So effectively
higher weight dd is doing writeouts of lower weight dd pages and we don't see
service differentation.

IOW, the core problem with async write fairness is that higher weight thread
does not throw enought IO traffic at IO controller to keep the queue
continuously backlogged. In my testing, there are many .2 to .8 second
intervals where higher weight queue is empty and in that duration lower weight
queue get lots of job done giving the impression that there was no service
differentiation.

In summary, from IO controller point of view async writes support is there.
Because page cache has not been designed in such a manner that higher 
prio/weight writer can do more write out as compared to lower prio/weight
writer, gettting service differentiation is hard and it is visible in some
cases and not visible in some cases.

Do we really care that much for fairness among two writer cgroups? One can
choose to do direct writes or sync writes if fairness for writes really
matters for him.

Following is the only case where it is hard to ensure fairness between cgroups.

- Buffered writes Vs Buffered Writes.

So to test async writes I created two partitions on a disk and created ext3
file systems on both the partitions.  Also created two cgroups and generated
lots of write traffic in two cgroups (50 fio threads) and watched the disk
time statistics in respective cgroups at the interval of 2 seconds. Thanks to
ryo tsuruta for the test case.

*****************************************************************
sync
echo 3 > /proc/sys/vm/drop_caches

fio_args="--size=64m --rw=write --numjobs=50 --group_reporting"

echo $$ > /cgroup/bfqio/test1/tasks
fio $fio_args --name=test1 --directory=/mnt/sdd1/fio/ --output=/mnt/sdd1/fio/test1.log &

echo $$ > /cgroup/bfqio/test2/tasks
fio $fio_args --name=test2 --directory=/mnt/sdd2/fio/ --output=/mnt/sdd2/fio/test2.log &
*********************************************************************** 

And watched the disk time and sector statistics for the both the cgroups
every 2 seconds using a script. How is snippet from output.

test1 statistics: time=8 48 1315   sectors=8 48 55776 dq=8 48 1
test2 statistics: time=8 48 633   sectors=8 48 14720 dq=8 48 2

test1 statistics: time=8 48 5586   sectors=8 48 339064 dq=8 48 2
test2 statistics: time=8 48 2985   sectors=8 48 146656 dq=8 48 3

test1 statistics: time=8 48 9935   sectors=8 48 628728 dq=8 48 3
test2 statistics: time=8 48 5265   sectors=8 48 278688 dq=8 48 4

test1 statistics: time=8 48 14156   sectors=8 48 932488 dq=8 48 6
test2 statistics: time=8 48 7646   sectors=8 48 412704 dq=8 48 7

test1 statistics: time=8 48 18141   sectors=8 48 1231488 dq=8 48 10
test2 statistics: time=8 48 9820   sectors=8 48 548400 dq=8 48 8

test1 statistics: time=8 48 21953   sectors=8 48 1485632 dq=8 48 13
test2 statistics: time=8 48 12394   sectors=8 48 698288 dq=8 48 10

test1 statistics: time=8 48 25167   sectors=8 48 1705264 dq=8 48 13
test2 statistics: time=8 48 14042   sectors=8 48 817808 dq=8 48 10

First two fields in time and sectors statistics represent major and minor
number of the device. Third field represents disk time in milliseconds and
number of sectors transferred respectively.

So disk time consumed by group1 is almost double of group2 in this case.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 191+ messages in thread

* [RFC] IO scheduler based IO controller V6
@ 2009-07-02 20:01 ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew
  Cc: peterz, akpm, snitzer, agk, vgoyal


Hi All,

Here is the V6 of the IO controller patches generated on top of 2.6.31-rc1.

Previous versions of the patches was posted here.

(V1) http://lkml.org/lkml/2009/3/11/486
(V2) http://lkml.org/lkml/2009/5/5/275
(V3) http://lkml.org/lkml/2009/5/26/472
(V4) http://lkml.org/lkml/2009/6/8/580
(V5) http://lkml.org/lkml/2009/6/19/279

This patchset is still work in progress but I want to keep on getting the
snapshot of my tree out at regular intervals to get the feedback hence V6.

Changes from V5
===============
- Broke down two of the biggest patches in to smaller patches. Now core of
  bfq scheduler patches are separate patch and it should make review a bit
  easier. I will try to break the patches down even more.

	- Broke out bfq core scheduler changes from flat fair queuing code.
	- Created separate patch for in class preemtion logic.
	- Created separate patch to for core bfq hierarchical scheduler
	  changes. 
	- Created a separate patch for cgroup related bits.

- Introduced a new patch to wait for requests to complete from previous
  queue before next queue is scheduled. It helps in achieving better
  accounting of disk time used by writes and hence better isolation between
  reads and buffered writes. This helps achieve fairness between sync queues
  and buffered writes.

- Merged gui's patch for optimization during io group deletion.

- Merged gui's per device rule interface patch resulting from Paul Menage's
  feedback.

- Merged gui's patch to read group data under rcu lock instead of taking
  spin lock.

- Took care of some of the balbir's review comments on V5.

	- Got rid of additional user defined data tyepes. "bfq_timestamp_t",
	  bfq_weight_t and bfq_service_t.
	- Changed data type of "weight" to unsigned int.
	- replaced *_extract() function names with *_remove().
	- Renamed some of the bfq_* functions to io_* in comments.

- Misc code cleanups

	- Moved io_get_io_group() and other common changes from patch
	  "implement per group bdi congestion interface" to upper patches.
	- Made lots of functions static.
	- Got rid of some forward declarations.
	- Replaced rq_ioq() with req_ioq() and moved it to blkdev.h
	- Some comment cleanups.
	- Got rid of elv_ioq_set_slice_end()
	- Got rid of redundant declaration of io_disconnect_groups().
	- Got rid of io_group_ioq()

Limitations
===========

- This IO controller provides the bandwidth control at the IO scheduler
  level (leaf node in stacked hiearchy of logical devices). So there can
  be cases (depending on configuration) where application does not see
  proportional BW division at higher logical level device.

  LWN has written an article about the issue here.

	http://lwn.net/Articles/332839/

How to solve the issue of fairness at higher level logical devices
==================================================================
Couple of suggestions have come forward.

- Implement IO control at IO scheduler layer and then with the help of
  some daemon, adjust the weight on underlying devices dynamiclly, depending
  on what kind of BW gurantees are to be achieved at higher level logical
  block devices.

- Also implement a higher level IO controller along with IO scheduler
  based controller and let user choose one depending on his needs.

  A higher level controller does not know about the assumptions/policies
  of unerldying IO scheduler, hence it has the potential to break down
  the IO scheduler's policy with-in cgroup. A lower level controller
  can work with IO scheduler much more closely and efficiently.
 
Other active IO controller developments
=======================================

IO throttling
-------------

  This is a max bandwidth controller and not the proportional one. Secondly
  it is a second level controller which can break the IO scheduler's
  policy/assumtions with-in cgroup. 

dm-ioband
---------

 This is a proportional bandwidth controller implemented as device mapper
 driver. It is also a second level controller which can break the
 IO scheduler's policy/assumptions with-in cgroup.

TODO
====
- Lots of code cleanups, testing, bug fixing, optimizations, benchmarking
  etc...

- Improve time keeping so that sub jiffy queue expiry time can be accounted
  for.

- Work on a better interface (possibly cgroup based) for configuring per
  group request descriptor limits.

- Debug and fix some of the areas like page cache where higher weight cgroup
  async writes are stuck behind lower weight cgroup async writes.

Testing
=======

I have been able to do some testing as follows. All my testing is with ext3
file system with a SATA drive which supports queue depth of 31.

Test1 (Fairness for synchronous reads)
======================================
- Two dd in two cgroups with cgrop weights 1000 and 500. Ran two "dd" in those
  cgroups (With CFQ scheduler and /sys/block/<device>/queue/fairness = 1)

  Higher weight dd finishes first and at that point of time my script takes
  care of reading cgroup files io.disk_time and io.disk_sectors for both the
  groups and display the results.

  dd if=/mnt/$BLOCKDEV/zerofile1 of=/dev/null &
  dd if=/mnt/$BLOCKDEV/zerofile2 of=/dev/null &

  234179072 bytes (234 MB) copied, 3.9065 s, 59.9 MB/s
  234179072 bytes (234 MB) copied, 5.19232 s, 45.1 MB/s

  group1 time=8 16 2471 group1 sectors=8 16 457840
  group2 time=8 16 1220 group2 sectors=8 16 225736

First two fields in time and sectors statistics represent major and minor
number of the device. Third field represents disk time in milliseconds and
number of sectors transferred respectively.

This patchset tries to provide fairness in terms of disk time received. group1
got almost double of group2 disk time (At the time of first dd finish). These
time and sectors statistics can be read using io.disk_time and io.disk_sector
files in cgroup. More about it in documentation file.

Test2 (Reader Vs Buffered Writes)
================================
Buffered writes can be problematic and can overwhelm readers, especially with
noop and deadline. IO controller can provide isolation between readers and
buffered (async) writers.

First I ran the test without io controller to see the severity of the issue.
Ran a hostile writer and then after 10 seconds started a reader and then
monitored the completion time of reader. Reader reads a 256 MB file. Tested
this with noop scheduler.

sample script
------------
sync
echo 3 > /proc/sys/vm/drop_caches
time dd if=/dev/zero of=/mnt/sdb/reader-writer-zerofile bs=4K count=2097152
conv=fdatasync &
sleep 10
time dd if=/mnt/sdb/256M-file of=/dev/null &

Results
-------
8589934592 bytes (8.6 GB) copied, 106.045 s, 81.0 MB/s (Writer)
268435456 bytes (268 MB) copied, 96.5237 s, 2.8 MB/s (Reader)

Now it was time to test io controller whether it can provide isolation between
readers and writers with noop. I created two cgroups of weight 1000 each and
put reader in group1 and writer in group 2 and ran the test again. Upon
comletion of reader, my scripts read io.dis_time and io.disk_group cgroup
files to get an estimate how much disk time each group got and how many
sectors each group did IO for. 

For more accurate accounting of disk time for buffered writes with queuing
hardware I had to set /sys/block/<disk>/queue/iosched/fairness to "2".

sample script
-------------
echo $$ > /cgroup/bfqio/test2/tasks
dd if=/dev/zero of=/mnt/$BLOCKDEV/testzerofile bs=4K count=2097152 &
sleep 10
echo noop > /sys/block/$BLOCKDEV/queue/scheduler
echo  2 > /sys/block/$BLOCKDEV/queue/iosched/fairness
echo $$ > /cgroup/bfqio/test1/tasks
dd if=/mnt/$BLOCKDEV/256M-file of=/dev/null &
wait $!
# Some code for reading cgroup files upon completion of reader.
-------------------------

Results
=======
268435456 bytes (268 MB) copied, 6.65819 s, 40.3 MB/s (Reader) 

group1 time=8 16 3063	group1 sectors=8 16 524808
group2 time=8 16 3071	group2 sectors=8 16 441752

Note, reader finishes now much lesser time and both group1 and group2
got almost 3 seconds of disk time. Hence io-controller provides isolation
from buffered writes.

Test3 (AIO)
===========

AIO reads
-----------
Set up two fio, AIO read jobs in two cgroup with weight 1000 and 500
respectively. I am using cfq scheduler. Following are some lines from my test
script.

---------------------------------------------------------------
echo 1000 > /cgroup/bfqio/test1/io.weight
echo 500 > /cgroup/bfqio/test2/io.weight

fio_args="--ioengine=libaio --rw=read --size=512M --direct=1"
echo 1 > /sys/block/$BLOCKDEV/queue/iosched/fairness

echo $$ > /cgroup/bfqio/test1/tasks
fio $fio_args --name=test1 --directory=/mnt/$BLOCKDEV/fio1/
--output=/mnt/$BLOCKDEV/fio1/test1.log
--exec_postrun="../read-and-display-group-stats.sh $maj_dev $minor_dev" &

echo $$ > /cgroup/bfqio/test2/tasks
fio $fio_args --name=test2 --directory=/mnt/$BLOCKDEV/fio2/
--output=/mnt/$BLOCKDEV/fio2/test2.log &
----------------------------------------------------------------

test1 and test2 are two groups with weight 1000 and 500 respectively.
"read-and-display-group-stats.sh" is one small script which reads the
test1 and test2 cgroup files to determine how much disk time each group
got till first fio job finished.

Results
------
test1 statistics: time=8 16 22403   sectors=8 16 1049640
test2 statistics: time=8 16 11400   sectors=8 16 552864

Above shows that by the time first fio (higher weight), finished, group
test1 got 22403 ms of disk time and group test2 got 11400 ms of disk time.
similarly the statistics for number of sectors transferred are also shown.

Note that disk time given to group test1 is almost double of group2 disk
time.

AIO writes
----------
Set up two fio, AIO direct write jobs in two cgroup with weight 1000 and 500
respectively. I am using cfq scheduler. Following are some lines from my test
script.

------------------------------------------------
echo 1000 > /cgroup/bfqio/test1/io.weight
echo 500 > /cgroup/bfqio/test2/io.weight
fio_args="--ioengine=libaio --rw=write --size=512M --direct=1"

echo 1 > /sys/block/$BLOCKDEV/queue/iosched/fairness

echo $$ > /cgroup/bfqio/test1/tasks
fio $fio_args --name=test1 --directory=/mnt/$BLOCKDEV/fio1/
--output=/mnt/$BLOCKDEV/fio1/test1.log
--exec_postrun="../read-and-display-group-stats.sh $maj_dev $minor_dev" &

echo $$ > /cgroup/bfqio/test2/tasks
fio $fio_args --name=test2 --directory=/mnt/$BLOCKDEV/fio2/
--output=/mnt/$BLOCKDEV/fio2/test2.log &
-------------------------------------------------

test1 and test2 are two groups with weight 1000 and 500 respectively.
"read-and-display-group-stats.sh" is one small script which reads the
test1 and test2 cgroup files to determine how much disk time each group
got till first fio job finished.

Following are the results.

test1 statistics: time=8 16 29085   sectors=8 16 1049656
test2 statistics: time=8 16 14652   sectors=8 16 516728

Above shows that by the time first fio (higher weight), finished, group
test1 got 28085 ms of disk time and group test2 got 14652 ms of disk time.
similarly the statistics for number of sectors transferred are also shown.

Note that disk time given to group test1 is almost double of group2 disk
time.

Test4 (Writes with O_SYNC)
==========================
Created two groups with weight 1000 and 500 and launched two fio jobs doing
sync writes.

sample script
---------------------------
fio_args="--size=256m --rw=write --numjobs=1 --group_reporting --sync=1"

echo $$ > /cgroup/bfqio/test1/tasks
time fio $fio_args --name=test1 --directory=/mnt/$BLOCKDEV/fio1/
--output=/mnt/$BLOCKDEV/fio1/test1.log > /dev/null &

echo $$ > /cgroup/bfqio/test2/tasks
time fio $fio_args --name=test2 --directory=/mnt/$BLOCKDEV/fio2/
--output=/mnt/$BLOCKDEV/fio2/test2.log > /dev/null &

# some code to read group data upon completion of first fio job
----------------------------

Results
-------
group1 time=8 16 15194	group1 sectors=8 16 524864
group2 time=8 16 7689	group2 sectors=8 16 258920

Note, group 1 got almost double of group2 time as per the weight settings.

Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
===================================================================
Fairness for async writes is tricky and biggest reason is that async writes
are cached in higher layers (page cahe) as well as possibly in file system
layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
in proportional manner.

For example, consider two dd threads reading /dev/zero as input file and doing
writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
be forced to write out some pages to disk before more pages can be dirtied. But
not necessarily dirty pages of same thread are picked. It can very well pick
the inode of lesser priority dd thread and do some writeout. So effectively
higher weight dd is doing writeouts of lower weight dd pages and we don't see
service differentation.

IOW, the core problem with async write fairness is that higher weight thread
does not throw enought IO traffic at IO controller to keep the queue
continuously backlogged. In my testing, there are many .2 to .8 second
intervals where higher weight queue is empty and in that duration lower weight
queue get lots of job done giving the impression that there was no service
differentiation.

In summary, from IO controller point of view async writes support is there.
Because page cache has not been designed in such a manner that higher 
prio/weight writer can do more write out as compared to lower prio/weight
writer, gettting service differentiation is hard and it is visible in some
cases and not visible in some cases.

Do we really care that much for fairness among two writer cgroups? One can
choose to do direct writes or sync writes if fairness for writes really
matters for him.

Following is the only case where it is hard to ensure fairness between cgroups.

- Buffered writes Vs Buffered Writes.

So to test async writes I created two partitions on a disk and created ext3
file systems on both the partitions.  Also created two cgroups and generated
lots of write traffic in two cgroups (50 fio threads) and watched the disk
time statistics in respective cgroups at the interval of 2 seconds. Thanks to
ryo tsuruta for the test case.

*****************************************************************
sync
echo 3 > /proc/sys/vm/drop_caches

fio_args="--size=64m --rw=write --numjobs=50 --group_reporting"

echo $$ > /cgroup/bfqio/test1/tasks
fio $fio_args --name=test1 --directory=/mnt/sdd1/fio/ --output=/mnt/sdd1/fio/test1.log &

echo $$ > /cgroup/bfqio/test2/tasks
fio $fio_args --name=test2 --directory=/mnt/sdd2/fio/ --output=/mnt/sdd2/fio/test2.log &
*********************************************************************** 

And watched the disk time and sector statistics for the both the cgroups
every 2 seconds using a script. How is snippet from output.

test1 statistics: time=8 48 1315   sectors=8 48 55776 dq=8 48 1
test2 statistics: time=8 48 633   sectors=8 48 14720 dq=8 48 2

test1 statistics: time=8 48 5586   sectors=8 48 339064 dq=8 48 2
test2 statistics: time=8 48 2985   sectors=8 48 146656 dq=8 48 3

test1 statistics: time=8 48 9935   sectors=8 48 628728 dq=8 48 3
test2 statistics: time=8 48 5265   sectors=8 48 278688 dq=8 48 4

test1 statistics: time=8 48 14156   sectors=8 48 932488 dq=8 48 6
test2 statistics: time=8 48 7646   sectors=8 48 412704 dq=8 48 7

test1 statistics: time=8 48 18141   sectors=8 48 1231488 dq=8 48 10
test2 statistics: time=8 48 9820   sectors=8 48 548400 dq=8 48 8

test1 statistics: time=8 48 21953   sectors=8 48 1485632 dq=8 48 13
test2 statistics: time=8 48 12394   sectors=8 48 698288 dq=8 48 10

test1 statistics: time=8 48 25167   sectors=8 48 1705264 dq=8 48 13
test2 statistics: time=8 48 14042   sectors=8 48 817808 dq=8 48 10

First two fields in time and sectors statistics represent major and minor
number of the device. Third field represents disk time in milliseconds and
number of sectors transferred respectively.

So disk time consumed by group1 is almost double of group2 in this case.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 191+ messages in thread

* [PATCH 01/25] io-controller: Documentation
       [not found] ` <1246564917-19603-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-07-02 20:01   ` Vivek Goyal
  2009-07-02 20:01   ` [PATCH 02/25] io-controller: Core of the B-WF2Q+ scheduler Vivek Goyal
                     ` (26 subsequent siblings)
  27 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, nauman-hpIqsD4AKlfQT0dZR+AlfA,
	dpshah-hpIqsD4AKlfQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA

o Documentation for io-controller.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 Documentation/block/00-INDEX          |    2 +
 Documentation/block/io-controller.txt |  378 +++++++++++++++++++++++++++++++++
 2 files changed, 380 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/block/io-controller.txt

diff --git a/Documentation/block/00-INDEX b/Documentation/block/00-INDEX
index 961a051..dc8bf95 100644
--- a/Documentation/block/00-INDEX
+++ b/Documentation/block/00-INDEX
@@ -10,6 +10,8 @@ capability.txt
 	- Generic Block Device Capability (/sys/block/<disk>/capability)
 deadline-iosched.txt
 	- Deadline IO scheduler tunables
+io-controller.txt
+	- IO controller for provding hierarchical IO scheduling
 ioprio.txt
 	- Block io priorities (in CFQ scheduler)
 request.txt
diff --git a/Documentation/block/io-controller.txt b/Documentation/block/io-controller.txt
new file mode 100644
index 0000000..b2a96b3
--- /dev/null
+++ b/Documentation/block/io-controller.txt
@@ -0,0 +1,378 @@
+				IO Controller
+				=============
+
+Overview
+========
+
+This patchset implements a proportional weight IO controller. That is one
+can create cgroups and assign prio/weights to those cgroups and task group
+will get access to disk proportionate to the weight of the group.
+
+These patches modify elevator layer and individual IO schedulers to do
+IO control hence this io controller works only on block devices which use
+one of the standard io schedulers can not be used with any xyz logical block
+device.
+
+The assumption/thought behind modifying IO scheduler is that resource control
+is needed only on leaf nodes where the actual contention for resources is
+present and not on intertermediate logical block devices.
+
+Consider following hypothetical scenario. Lets say there are three physical
+disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
+created on top of these. Some part of sdb is in lv0 and some part is in lv1.
+
+			    lv0      lv1
+			  /	\  /     \
+			sda      sdb      sdc
+
+Also consider following cgroup hierarchy
+
+				root
+				/   \
+			       A     B
+			      / \    / \
+			     T1 T2  T3  T4
+
+A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
+Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
+get their fair share of bandwidth on disks sda, sdb and sdc. There is no
+IO control on intermediate logical block nodes (lv0, lv1).
+
+So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
+only, there will not be any contetion for resources between group A and B if
+IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
+IO scheduler associated with the sdb will distribute disk bandwidth to
+group A and B proportionate to their weight.
+
+CFQ already has the notion of fairness and it provides differential disk
+access based on priority and class of the task. Just that it is flat and
+with cgroup stuff, it needs to be made hierarchical to achive a good
+hierarchical control on IO.
+
+Rest of the IO schedulers (noop, deadline and AS) don't have any notion
+of fairness among various threads. They maintain only one queue where all
+the IO gets queued (internally this queue is split in read and write queue
+for deadline and AS). With this patchset, now we maintain one queue per
+cgropu per device and then try to do fair queuing among those queues.
+
+One of the concerns raised with modifying IO schedulers was that we don't
+want to replicate the code in all the IO schedulers. These patches share
+the fair queuing code which has been moved to a common layer (elevator
+layer). Hence we don't end up replicating code across IO schedulers. Following
+diagram depicts the concept.
+
+			--------------------------------
+			| Elevator Layer + Fair Queuing |
+			--------------------------------
+			 |	     |	     |       |
+			NOOP     DEADLINE    AS     CFQ
+
+Design
+======
+This patchset primarily uses BFQ (Budget Fair Queuing) code to provide
+fairness among different IO queues. Fabio and Paolo implemented BFQ which uses
+B-WF2Q+ algorithm for fair queuing.
+
+Why BFQ?
+
+- Not sure if weighted round robin logic of CFQ can be easily extended for
+  hierarchical mode. One of the things is that we can not keep dividing
+  the time slice of parent group among childrens. Deeper we go in hierarchy
+  time slice will get smaller.
+
+  One of the ways to implement hierarchical support could be to keep track
+  of virtual time and service provided to queue/group and select a queue/group
+  for service based on any of the various available algoriths.
+
+  BFQ already had support for hierarchical scheduling, taking those patches
+  was easier.
+
+- BFQ was designed to provide tighter bounds/delay w.r.t service provided
+  to a queue. Delay/Jitter with BFQ is O(1).
+
+  Note: BFQ originally used amount of IO done (number of sectors) as notion
+        of service provided. IOW, it tried to provide fairness in terms of
+        actual IO done and not in terms of actual time disk access was
+	given to a queue.
+
+	This patcheset modified BFQ to provide fairness in time domain because
+	that's what CFQ does. So idea was try not to deviate too much from
+	the CFQ behavior initially.
+
+	Providing fairness in time domain makes accounting trciky because
+	due to command queueing, at one time there might be multiple requests
+	from different queues and there is no easy way to find out how much
+	disk time actually was consumed by the requests of a particular
+	queue. More about this in comments in source code.
+
+We have taken BFQ code as starting point for providing fairness among groups
+because it already contained lots of features which we required to implement
+hierarhical IO scheduling. With this patch set, I am not trying to ensure O(1)
+delay here as my goal is to provide fairness among groups. Most likely that
+will mean that latencies are not worse than what cfq currently provides (if
+not improved ones). Once fairness is ensured, one can look into  more in
+ensuring O(1) latencies.
+
+From data structure point of view, one can think of a tree per device, where
+io groups and io queues are hanging and are being scheduled using B-WF2Q+
+algorithm. io_queue, is end queue where requests are actually stored and
+dispatched from (like cfqq).
+
+These io queues are primarily created by and managed by end io schedulers
+depending on its semantics. For example, noop, deadline and AS ioschedulers
+keep one io queues per cgroup and cfqq keeps one io queue per io_context in
+a cgroup (apart from async queues).
+
+A request is mapped to an io group by elevator layer and which io queue it
+is mapped to with in group depends on ioscheduler. Currently "current" task
+is used to determine the cgroup (hence io group) of the request. Down the
+line we need to make use of bio-cgroup patches to map delayed writes to
+right group.
+
+Going back to old behavior
+==========================
+In new scheme of things essentially we are creating hierarchical fair
+queuing logic in elevator layer and chaning IO schedulers to make use of
+that logic so that end IO schedulers start supporting hierarchical scheduling.
+
+Elevator layer continues to support the old interfaces. So even if fair queuing
+is enabled at elevator layer, one can have both new hierchical scheduler as
+well as old non-hierarchical scheduler operating.
+
+Also noop, deadline and AS have option of enabling hierarchical scheduling.
+If it is selected, fair queuing is done in hierarchical manner. If hierarchical
+scheduling is disabled, noop, deadline and AS should retain their existing
+behavior.
+
+CFQ is the only exception where one can not disable fair queuing as it is
+needed for provding fairness among various threads even in non-hierarchical
+mode.
+
+Various user visible config options
+===================================
+CONFIG_IOSCHED_NOOP_HIER
+	- Enables hierchical fair queuing in noop. Not selecting this option
+	  leads to old behavior of noop.
+
+CONFIG_IOSCHED_DEADLINE_HIER
+	- Enables hierchical fair queuing in deadline. Not selecting this
+	  option leads to old behavior of deadline.
+
+CONFIG_IOSCHED_AS_HIER
+	- Enables hierchical fair queuing in AS. Not selecting this option
+	  leads to old behavior of AS.
+
+CONFIG_IOSCHED_CFQ_HIER
+	- Enables hierarchical fair queuing in CFQ. Not selecting this option
+	  still does fair queuing among various queus but it is flat and not
+	  hierarchical.
+
+CGROUP_BLKIO
+	- This option enables blkio-cgroup controller for IO tracking
+	  purposes. That means, by this controller one can attribute a write
+	  to the original cgroup and not assume that it belongs to submitting
+	  thread.
+
+CONFIG_TRACK_ASYNC_CONTEXT
+	- Currently CFQ attributes the writes to the submitting thread and
+	  caches the async queue pointer in the io context of the process.
+	  If this option is set, it tells cfq and elevator fair queuing logic
+	  that for async writes make use of IO tracking patches and attribute
+	  writes to original cgroup and not to write submitting thread.
+
+CONFIG_DEBUG_GROUP_IOSCHED
+	- Throws extra debug messages in blktrace output helpful in doing
+	  doing debugging in hierarchical setup.
+
+	- Also allows for export of extra debug statistics like group queue
+	  and dequeue statistics on device through cgroup interface.
+
+Config options selected automatically
+=====================================
+These config options are not user visible and are selected/deselected
+automatically based on IO scheduler configurations.
+
+CONFIG_ELV_FAIR_QUEUING
+	- Enables/Disables the fair queuing logic at elevator layer.
+
+CONFIG_GROUP_IOSCHED
+	- Enables/Disables hierarchical queuing and associated cgroup bits.
+
+HOWTO
+=====
+So far I have done very simple testing of running two dd threads in two
+different cgroups. Here is what you can do.
+
+- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
+	CONFIG_IOSCHED_CFQ_HIER=y
+
+- Enable IO tracking for async writes.
+	CONFIG_TRACK_ASYNC_CONTEXT=y
+
+  (This will automatically select CGROUP_BLKIO)
+
+- Compile and boot into kernel and mount IO controller and blkio io tracking
+  controller.
+
+	mount -t cgroup -o io,blkio none /cgroup
+
+- Create two cgroups
+	mkdir -p /cgroup/test1/ /cgroup/test2
+
+- Set weights of group test1 and test2
+	echo 1000 > /cgroup/test1/io.weight
+	echo 500 > /cgroup/test2/io.weight
+
+- Set "fairness" parameter to 1 at the disk you are testing.
+
+  echo 1 > /sys/block/<disk>/queue/iosched/fairness
+
+- Create two same size files (say 512MB each) on same disk (file1, file2) and
+  launch two dd threads in different cgroup to read those files. Make sure
+  right io scheduler is being used for the block device where files are
+  present (the one you compiled in hierarchical mode).
+
+	sync
+	echo 3 > /proc/sys/vm/drop_caches
+
+	dd if=/mnt/sdb/zerofile1 of=/dev/null &
+	echo $! > /cgroup/test1/tasks
+	cat /cgroup/test1/tasks
+
+	dd if=/mnt/sdb/zerofile2 of=/dev/null &
+	echo $! > /cgroup/test2/tasks
+	cat /cgroup/test2/tasks
+
+- At macro level, first dd should finish first. To get more precise data, keep
+  on looking at (with the help of script), at io.disk_time and io.disk_sectors
+  files of both test1 and test2 groups. This will tell how much disk time
+  (in milli seconds), each group got and how many secotors each group
+  dispatched to the disk. We provide fairness in terms of disk time, so
+  ideally io.disk_time of cgroups should be in proportion to the weight.
+  (It is hard to achieve though :-)).
+
+Regarding "fairness" parameter
+==============================
+IO controller has introduced a "fairness" tunable for every io scheduler.
+Currently this tunable can assume values 0, 1 and 2.
+
+There are places where io schedulers are geared towards throughput. Sometimes
+fairness and throughput don't go together. This tunable helps choose fairness
+over throughput depending on the need.
+
+0 --> We should achive existing io scheduler behavior. Some of the special
+      fairness hooks are disabled.
+
+1 --> Helps in achieving the better fairness for sync queues.
+
+2 --> If buffered writes come into play, use this. For example, One cgroup
+      doing sync reads and other cgroup doing buffered writes, use "2" to
+      provide better fairness/isolation between two cgroups.
+
+Details of cgroup files
+=======================
+- io.ioprio_class
+	- Specifies class of the cgroup (RT, BE, IDLE). This is default io
+	  class of the group on all the devices until and unless overridden by
+	  per device rule. (See io.policy).
+
+	  1 = RT; 2 = BE, 3 = IDLE
+
+- io.weight
+	- Specifies per cgroup weight. This is default weight of the group
+	  on all the devices until and unless overridden by per device rule.
+	  (See io.policy).
+
+- io.disk_time
+	- disk time allocated to cgroup per device in milliseconds. First
+	  two fields specify the major and minor number of the device and
+	  third field specifies the disk time allocated to group in
+	  milliseconds.
+
+- io.disk_sectors
+	- number of sectors transferred to/from disk by the group. First
+	  two fields specify the major and minor number of the device and
+	  third field specifies the number of sectors transferred by the
+	  group to/from the device.
+
+- io.disk_queue
+	- Debugging aid only enabled if CONFIG_DEBUG_GROUP_IOSCHED=y. This
+	  gives the statistics about how many a times a group was queued
+	  on service tree of the device. First two fields specify the major
+	  and minor number of the device and third field specifies the number
+	  of times a group was queued on a particular device.
+
+- io.disk_queue
+	- Debugging aid only enabled if CONFIG_DEBUG_GROUP_IOSCHED=y. This
+	  gives the statistics about how many a times a group was de-queued
+	  or removed from the service tree of the device. This basically gives
+	  and idea if we can generate enough IO to create continuously
+	  backlogged groups. First two fields specify the major and minor
+	  number of the device and third field specifies the number
+	  of times a group was de-queued on a particular device.
+
+- io.policy
+	- One can specify per cgroup per device rules using this interface.
+	  These rules override the default value of group weight and class as
+	  specified by io.weight and io.ioprio_class.
+
+	  Following is the format.
+
+	#echo dev_maj:dev_minor weight ioprio_class > /patch/to/cgroup/io.policy
+
+	weight=0 means removing a policy.
+
+	Examples:
+
+	Configure weight=300 ioprio_class=2 on /dev/hdb (8:16) in this cgroup
+	# echo 8:16 300 2 > io.policy
+	# cat io.policy
+	dev	weight	class
+	8:16	300	2
+
+	Configure weight=500 ioprio_class=1 on /dev/hda (8:0) in this cgroup
+	# echo 8:0 500 1 > io.policy
+	# cat io.policy
+	dev	weight	class
+	8:0	500	1
+	8:16	300	2
+
+	Remove the policy for /dev/hda in this cgroup
+	# echo 8:0 0 1 > io.policy
+	# cat io.policy
+	dev	weight	class
+	8:16	300	2
+
+About configuring request desriptors
+====================================
+Traditionally there are 128 request desriptors allocated per request queue
+where io scheduler is operating (/sys/block/<disk>/queue/nr_requests). If these
+request descriptors are exhausted, processes will put to sleep and woken
+up once request descriptors are available.
+
+With io controller and cgroup stuff, one can not afford to allocate requests
+from single pool as one group might allocate lots of requests and then tasks
+from other groups might be put to sleep and this other group might be a
+higher weight group. Hence to make sure that a group always can get the
+request descriptors it is entitled to, one needs to make request descriptor
+limit per group on every queue.
+
+A new parameter /sys/block/<disk>/queue/nr_group_requests has been introduced
+and this parameter controlls the maximum number of requests per group.
+nr_requests still continues to control total number of request descriptors
+on the queue.
+
+Ideally one should set nr_requests to be following.
+
+nr_requests = number_of_cgroups * nr_group_requests
+
+This will make sure that at any point of time nr_group_requests number of
+request descriptors will be available for any of the cgroups.
+
+Currently default nr_requests=512 and nr_group_requests=128. This will make
+sure that apart from root group one can create 3 more group without running
+into any issues. If one decides to create more cgorus, nr_requests and
+nr_group_requests should be adjusted accordingly.
+
+Probably a better way to assign limit to group request descriptors is through
+sysfs interface. This is a future TODO item.
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 01/25] io-controller: Documentation
  2009-07-02 20:01 ` Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron
  Cc: agk, snitzer, vgoyal, akpm, peterz

o Documentation for io-controller.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 Documentation/block/00-INDEX          |    2 +
 Documentation/block/io-controller.txt |  378 +++++++++++++++++++++++++++++++++
 2 files changed, 380 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/block/io-controller.txt

diff --git a/Documentation/block/00-INDEX b/Documentation/block/00-INDEX
index 961a051..dc8bf95 100644
--- a/Documentation/block/00-INDEX
+++ b/Documentation/block/00-INDEX
@@ -10,6 +10,8 @@ capability.txt
 	- Generic Block Device Capability (/sys/block/<disk>/capability)
 deadline-iosched.txt
 	- Deadline IO scheduler tunables
+io-controller.txt
+	- IO controller for provding hierarchical IO scheduling
 ioprio.txt
 	- Block io priorities (in CFQ scheduler)
 request.txt
diff --git a/Documentation/block/io-controller.txt b/Documentation/block/io-controller.txt
new file mode 100644
index 0000000..b2a96b3
--- /dev/null
+++ b/Documentation/block/io-controller.txt
@@ -0,0 +1,378 @@
+				IO Controller
+				=============
+
+Overview
+========
+
+This patchset implements a proportional weight IO controller. That is one
+can create cgroups and assign prio/weights to those cgroups and task group
+will get access to disk proportionate to the weight of the group.
+
+These patches modify elevator layer and individual IO schedulers to do
+IO control hence this io controller works only on block devices which use
+one of the standard io schedulers can not be used with any xyz logical block
+device.
+
+The assumption/thought behind modifying IO scheduler is that resource control
+is needed only on leaf nodes where the actual contention for resources is
+present and not on intertermediate logical block devices.
+
+Consider following hypothetical scenario. Lets say there are three physical
+disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
+created on top of these. Some part of sdb is in lv0 and some part is in lv1.
+
+			    lv0      lv1
+			  /	\  /     \
+			sda      sdb      sdc
+
+Also consider following cgroup hierarchy
+
+				root
+				/   \
+			       A     B
+			      / \    / \
+			     T1 T2  T3  T4
+
+A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
+Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
+get their fair share of bandwidth on disks sda, sdb and sdc. There is no
+IO control on intermediate logical block nodes (lv0, lv1).
+
+So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
+only, there will not be any contetion for resources between group A and B if
+IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
+IO scheduler associated with the sdb will distribute disk bandwidth to
+group A and B proportionate to their weight.
+
+CFQ already has the notion of fairness and it provides differential disk
+access based on priority and class of the task. Just that it is flat and
+with cgroup stuff, it needs to be made hierarchical to achive a good
+hierarchical control on IO.
+
+Rest of the IO schedulers (noop, deadline and AS) don't have any notion
+of fairness among various threads. They maintain only one queue where all
+the IO gets queued (internally this queue is split in read and write queue
+for deadline and AS). With this patchset, now we maintain one queue per
+cgropu per device and then try to do fair queuing among those queues.
+
+One of the concerns raised with modifying IO schedulers was that we don't
+want to replicate the code in all the IO schedulers. These patches share
+the fair queuing code which has been moved to a common layer (elevator
+layer). Hence we don't end up replicating code across IO schedulers. Following
+diagram depicts the concept.
+
+			--------------------------------
+			| Elevator Layer + Fair Queuing |
+			--------------------------------
+			 |	     |	     |       |
+			NOOP     DEADLINE    AS     CFQ
+
+Design
+======
+This patchset primarily uses BFQ (Budget Fair Queuing) code to provide
+fairness among different IO queues. Fabio and Paolo implemented BFQ which uses
+B-WF2Q+ algorithm for fair queuing.
+
+Why BFQ?
+
+- Not sure if weighted round robin logic of CFQ can be easily extended for
+  hierarchical mode. One of the things is that we can not keep dividing
+  the time slice of parent group among childrens. Deeper we go in hierarchy
+  time slice will get smaller.
+
+  One of the ways to implement hierarchical support could be to keep track
+  of virtual time and service provided to queue/group and select a queue/group
+  for service based on any of the various available algoriths.
+
+  BFQ already had support for hierarchical scheduling, taking those patches
+  was easier.
+
+- BFQ was designed to provide tighter bounds/delay w.r.t service provided
+  to a queue. Delay/Jitter with BFQ is O(1).
+
+  Note: BFQ originally used amount of IO done (number of sectors) as notion
+        of service provided. IOW, it tried to provide fairness in terms of
+        actual IO done and not in terms of actual time disk access was
+	given to a queue.
+
+	This patcheset modified BFQ to provide fairness in time domain because
+	that's what CFQ does. So idea was try not to deviate too much from
+	the CFQ behavior initially.
+
+	Providing fairness in time domain makes accounting trciky because
+	due to command queueing, at one time there might be multiple requests
+	from different queues and there is no easy way to find out how much
+	disk time actually was consumed by the requests of a particular
+	queue. More about this in comments in source code.
+
+We have taken BFQ code as starting point for providing fairness among groups
+because it already contained lots of features which we required to implement
+hierarhical IO scheduling. With this patch set, I am not trying to ensure O(1)
+delay here as my goal is to provide fairness among groups. Most likely that
+will mean that latencies are not worse than what cfq currently provides (if
+not improved ones). Once fairness is ensured, one can look into  more in
+ensuring O(1) latencies.
+
+From data structure point of view, one can think of a tree per device, where
+io groups and io queues are hanging and are being scheduled using B-WF2Q+
+algorithm. io_queue, is end queue where requests are actually stored and
+dispatched from (like cfqq).
+
+These io queues are primarily created by and managed by end io schedulers
+depending on its semantics. For example, noop, deadline and AS ioschedulers
+keep one io queues per cgroup and cfqq keeps one io queue per io_context in
+a cgroup (apart from async queues).
+
+A request is mapped to an io group by elevator layer and which io queue it
+is mapped to with in group depends on ioscheduler. Currently "current" task
+is used to determine the cgroup (hence io group) of the request. Down the
+line we need to make use of bio-cgroup patches to map delayed writes to
+right group.
+
+Going back to old behavior
+==========================
+In new scheme of things essentially we are creating hierarchical fair
+queuing logic in elevator layer and chaning IO schedulers to make use of
+that logic so that end IO schedulers start supporting hierarchical scheduling.
+
+Elevator layer continues to support the old interfaces. So even if fair queuing
+is enabled at elevator layer, one can have both new hierchical scheduler as
+well as old non-hierarchical scheduler operating.
+
+Also noop, deadline and AS have option of enabling hierarchical scheduling.
+If it is selected, fair queuing is done in hierarchical manner. If hierarchical
+scheduling is disabled, noop, deadline and AS should retain their existing
+behavior.
+
+CFQ is the only exception where one can not disable fair queuing as it is
+needed for provding fairness among various threads even in non-hierarchical
+mode.
+
+Various user visible config options
+===================================
+CONFIG_IOSCHED_NOOP_HIER
+	- Enables hierchical fair queuing in noop. Not selecting this option
+	  leads to old behavior of noop.
+
+CONFIG_IOSCHED_DEADLINE_HIER
+	- Enables hierchical fair queuing in deadline. Not selecting this
+	  option leads to old behavior of deadline.
+
+CONFIG_IOSCHED_AS_HIER
+	- Enables hierchical fair queuing in AS. Not selecting this option
+	  leads to old behavior of AS.
+
+CONFIG_IOSCHED_CFQ_HIER
+	- Enables hierarchical fair queuing in CFQ. Not selecting this option
+	  still does fair queuing among various queus but it is flat and not
+	  hierarchical.
+
+CGROUP_BLKIO
+	- This option enables blkio-cgroup controller for IO tracking
+	  purposes. That means, by this controller one can attribute a write
+	  to the original cgroup and not assume that it belongs to submitting
+	  thread.
+
+CONFIG_TRACK_ASYNC_CONTEXT
+	- Currently CFQ attributes the writes to the submitting thread and
+	  caches the async queue pointer in the io context of the process.
+	  If this option is set, it tells cfq and elevator fair queuing logic
+	  that for async writes make use of IO tracking patches and attribute
+	  writes to original cgroup and not to write submitting thread.
+
+CONFIG_DEBUG_GROUP_IOSCHED
+	- Throws extra debug messages in blktrace output helpful in doing
+	  doing debugging in hierarchical setup.
+
+	- Also allows for export of extra debug statistics like group queue
+	  and dequeue statistics on device through cgroup interface.
+
+Config options selected automatically
+=====================================
+These config options are not user visible and are selected/deselected
+automatically based on IO scheduler configurations.
+
+CONFIG_ELV_FAIR_QUEUING
+	- Enables/Disables the fair queuing logic at elevator layer.
+
+CONFIG_GROUP_IOSCHED
+	- Enables/Disables hierarchical queuing and associated cgroup bits.
+
+HOWTO
+=====
+So far I have done very simple testing of running two dd threads in two
+different cgroups. Here is what you can do.
+
+- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
+	CONFIG_IOSCHED_CFQ_HIER=y
+
+- Enable IO tracking for async writes.
+	CONFIG_TRACK_ASYNC_CONTEXT=y
+
+  (This will automatically select CGROUP_BLKIO)
+
+- Compile and boot into kernel and mount IO controller and blkio io tracking
+  controller.
+
+	mount -t cgroup -o io,blkio none /cgroup
+
+- Create two cgroups
+	mkdir -p /cgroup/test1/ /cgroup/test2
+
+- Set weights of group test1 and test2
+	echo 1000 > /cgroup/test1/io.weight
+	echo 500 > /cgroup/test2/io.weight
+
+- Set "fairness" parameter to 1 at the disk you are testing.
+
+  echo 1 > /sys/block/<disk>/queue/iosched/fairness
+
+- Create two same size files (say 512MB each) on same disk (file1, file2) and
+  launch two dd threads in different cgroup to read those files. Make sure
+  right io scheduler is being used for the block device where files are
+  present (the one you compiled in hierarchical mode).
+
+	sync
+	echo 3 > /proc/sys/vm/drop_caches
+
+	dd if=/mnt/sdb/zerofile1 of=/dev/null &
+	echo $! > /cgroup/test1/tasks
+	cat /cgroup/test1/tasks
+
+	dd if=/mnt/sdb/zerofile2 of=/dev/null &
+	echo $! > /cgroup/test2/tasks
+	cat /cgroup/test2/tasks
+
+- At macro level, first dd should finish first. To get more precise data, keep
+  on looking at (with the help of script), at io.disk_time and io.disk_sectors
+  files of both test1 and test2 groups. This will tell how much disk time
+  (in milli seconds), each group got and how many secotors each group
+  dispatched to the disk. We provide fairness in terms of disk time, so
+  ideally io.disk_time of cgroups should be in proportion to the weight.
+  (It is hard to achieve though :-)).
+
+Regarding "fairness" parameter
+==============================
+IO controller has introduced a "fairness" tunable for every io scheduler.
+Currently this tunable can assume values 0, 1 and 2.
+
+There are places where io schedulers are geared towards throughput. Sometimes
+fairness and throughput don't go together. This tunable helps choose fairness
+over throughput depending on the need.
+
+0 --> We should achive existing io scheduler behavior. Some of the special
+      fairness hooks are disabled.
+
+1 --> Helps in achieving the better fairness for sync queues.
+
+2 --> If buffered writes come into play, use this. For example, One cgroup
+      doing sync reads and other cgroup doing buffered writes, use "2" to
+      provide better fairness/isolation between two cgroups.
+
+Details of cgroup files
+=======================
+- io.ioprio_class
+	- Specifies class of the cgroup (RT, BE, IDLE). This is default io
+	  class of the group on all the devices until and unless overridden by
+	  per device rule. (See io.policy).
+
+	  1 = RT; 2 = BE, 3 = IDLE
+
+- io.weight
+	- Specifies per cgroup weight. This is default weight of the group
+	  on all the devices until and unless overridden by per device rule.
+	  (See io.policy).
+
+- io.disk_time
+	- disk time allocated to cgroup per device in milliseconds. First
+	  two fields specify the major and minor number of the device and
+	  third field specifies the disk time allocated to group in
+	  milliseconds.
+
+- io.disk_sectors
+	- number of sectors transferred to/from disk by the group. First
+	  two fields specify the major and minor number of the device and
+	  third field specifies the number of sectors transferred by the
+	  group to/from the device.
+
+- io.disk_queue
+	- Debugging aid only enabled if CONFIG_DEBUG_GROUP_IOSCHED=y. This
+	  gives the statistics about how many a times a group was queued
+	  on service tree of the device. First two fields specify the major
+	  and minor number of the device and third field specifies the number
+	  of times a group was queued on a particular device.
+
+- io.disk_queue
+	- Debugging aid only enabled if CONFIG_DEBUG_GROUP_IOSCHED=y. This
+	  gives the statistics about how many a times a group was de-queued
+	  or removed from the service tree of the device. This basically gives
+	  and idea if we can generate enough IO to create continuously
+	  backlogged groups. First two fields specify the major and minor
+	  number of the device and third field specifies the number
+	  of times a group was de-queued on a particular device.
+
+- io.policy
+	- One can specify per cgroup per device rules using this interface.
+	  These rules override the default value of group weight and class as
+	  specified by io.weight and io.ioprio_class.
+
+	  Following is the format.
+
+	#echo dev_maj:dev_minor weight ioprio_class > /patch/to/cgroup/io.policy
+
+	weight=0 means removing a policy.
+
+	Examples:
+
+	Configure weight=300 ioprio_class=2 on /dev/hdb (8:16) in this cgroup
+	# echo 8:16 300 2 > io.policy
+	# cat io.policy
+	dev	weight	class
+	8:16	300	2
+
+	Configure weight=500 ioprio_class=1 on /dev/hda (8:0) in this cgroup
+	# echo 8:0 500 1 > io.policy
+	# cat io.policy
+	dev	weight	class
+	8:0	500	1
+	8:16	300	2
+
+	Remove the policy for /dev/hda in this cgroup
+	# echo 8:0 0 1 > io.policy
+	# cat io.policy
+	dev	weight	class
+	8:16	300	2
+
+About configuring request desriptors
+====================================
+Traditionally there are 128 request desriptors allocated per request queue
+where io scheduler is operating (/sys/block/<disk>/queue/nr_requests). If these
+request descriptors are exhausted, processes will put to sleep and woken
+up once request descriptors are available.
+
+With io controller and cgroup stuff, one can not afford to allocate requests
+from single pool as one group might allocate lots of requests and then tasks
+from other groups might be put to sleep and this other group might be a
+higher weight group. Hence to make sure that a group always can get the
+request descriptors it is entitled to, one needs to make request descriptor
+limit per group on every queue.
+
+A new parameter /sys/block/<disk>/queue/nr_group_requests has been introduced
+and this parameter controlls the maximum number of requests per group.
+nr_requests still continues to control total number of request descriptors
+on the queue.
+
+Ideally one should set nr_requests to be following.
+
+nr_requests = number_of_cgroups * nr_group_requests
+
+This will make sure that at any point of time nr_group_requests number of
+request descriptors will be available for any of the cgroups.
+
+Currently default nr_requests=512 and nr_group_requests=128. This will make
+sure that apart from root group one can create 3 more group without running
+into any issues. If one decides to create more cgorus, nr_requests and
+nr_group_requests should be adjusted accordingly.
+
+Probably a better way to assign limit to group request descriptors is through
+sysfs interface. This is a future TODO item.
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 01/25] io-controller: Documentation
@ 2009-07-02 20:01   ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew
  Cc: peterz, akpm, snitzer, agk, vgoyal

o Documentation for io-controller.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 Documentation/block/00-INDEX          |    2 +
 Documentation/block/io-controller.txt |  378 +++++++++++++++++++++++++++++++++
 2 files changed, 380 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/block/io-controller.txt

diff --git a/Documentation/block/00-INDEX b/Documentation/block/00-INDEX
index 961a051..dc8bf95 100644
--- a/Documentation/block/00-INDEX
+++ b/Documentation/block/00-INDEX
@@ -10,6 +10,8 @@ capability.txt
 	- Generic Block Device Capability (/sys/block/<disk>/capability)
 deadline-iosched.txt
 	- Deadline IO scheduler tunables
+io-controller.txt
+	- IO controller for provding hierarchical IO scheduling
 ioprio.txt
 	- Block io priorities (in CFQ scheduler)
 request.txt
diff --git a/Documentation/block/io-controller.txt b/Documentation/block/io-controller.txt
new file mode 100644
index 0000000..b2a96b3
--- /dev/null
+++ b/Documentation/block/io-controller.txt
@@ -0,0 +1,378 @@
+				IO Controller
+				=============
+
+Overview
+========
+
+This patchset implements a proportional weight IO controller. That is one
+can create cgroups and assign prio/weights to those cgroups and task group
+will get access to disk proportionate to the weight of the group.
+
+These patches modify elevator layer and individual IO schedulers to do
+IO control hence this io controller works only on block devices which use
+one of the standard io schedulers can not be used with any xyz logical block
+device.
+
+The assumption/thought behind modifying IO scheduler is that resource control
+is needed only on leaf nodes where the actual contention for resources is
+present and not on intertermediate logical block devices.
+
+Consider following hypothetical scenario. Lets say there are three physical
+disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
+created on top of these. Some part of sdb is in lv0 and some part is in lv1.
+
+			    lv0      lv1
+			  /	\  /     \
+			sda      sdb      sdc
+
+Also consider following cgroup hierarchy
+
+				root
+				/   \
+			       A     B
+			      / \    / \
+			     T1 T2  T3  T4
+
+A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
+Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
+get their fair share of bandwidth on disks sda, sdb and sdc. There is no
+IO control on intermediate logical block nodes (lv0, lv1).
+
+So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
+only, there will not be any contetion for resources between group A and B if
+IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
+IO scheduler associated with the sdb will distribute disk bandwidth to
+group A and B proportionate to their weight.
+
+CFQ already has the notion of fairness and it provides differential disk
+access based on priority and class of the task. Just that it is flat and
+with cgroup stuff, it needs to be made hierarchical to achive a good
+hierarchical control on IO.
+
+Rest of the IO schedulers (noop, deadline and AS) don't have any notion
+of fairness among various threads. They maintain only one queue where all
+the IO gets queued (internally this queue is split in read and write queue
+for deadline and AS). With this patchset, now we maintain one queue per
+cgropu per device and then try to do fair queuing among those queues.
+
+One of the concerns raised with modifying IO schedulers was that we don't
+want to replicate the code in all the IO schedulers. These patches share
+the fair queuing code which has been moved to a common layer (elevator
+layer). Hence we don't end up replicating code across IO schedulers. Following
+diagram depicts the concept.
+
+			--------------------------------
+			| Elevator Layer + Fair Queuing |
+			--------------------------------
+			 |	     |	     |       |
+			NOOP     DEADLINE    AS     CFQ
+
+Design
+======
+This patchset primarily uses BFQ (Budget Fair Queuing) code to provide
+fairness among different IO queues. Fabio and Paolo implemented BFQ which uses
+B-WF2Q+ algorithm for fair queuing.
+
+Why BFQ?
+
+- Not sure if weighted round robin logic of CFQ can be easily extended for
+  hierarchical mode. One of the things is that we can not keep dividing
+  the time slice of parent group among childrens. Deeper we go in hierarchy
+  time slice will get smaller.
+
+  One of the ways to implement hierarchical support could be to keep track
+  of virtual time and service provided to queue/group and select a queue/group
+  for service based on any of the various available algoriths.
+
+  BFQ already had support for hierarchical scheduling, taking those patches
+  was easier.
+
+- BFQ was designed to provide tighter bounds/delay w.r.t service provided
+  to a queue. Delay/Jitter with BFQ is O(1).
+
+  Note: BFQ originally used amount of IO done (number of sectors) as notion
+        of service provided. IOW, it tried to provide fairness in terms of
+        actual IO done and not in terms of actual time disk access was
+	given to a queue.
+
+	This patcheset modified BFQ to provide fairness in time domain because
+	that's what CFQ does. So idea was try not to deviate too much from
+	the CFQ behavior initially.
+
+	Providing fairness in time domain makes accounting trciky because
+	due to command queueing, at one time there might be multiple requests
+	from different queues and there is no easy way to find out how much
+	disk time actually was consumed by the requests of a particular
+	queue. More about this in comments in source code.
+
+We have taken BFQ code as starting point for providing fairness among groups
+because it already contained lots of features which we required to implement
+hierarhical IO scheduling. With this patch set, I am not trying to ensure O(1)
+delay here as my goal is to provide fairness among groups. Most likely that
+will mean that latencies are not worse than what cfq currently provides (if
+not improved ones). Once fairness is ensured, one can look into  more in
+ensuring O(1) latencies.
+
+From data structure point of view, one can think of a tree per device, where
+io groups and io queues are hanging and are being scheduled using B-WF2Q+
+algorithm. io_queue, is end queue where requests are actually stored and
+dispatched from (like cfqq).
+
+These io queues are primarily created by and managed by end io schedulers
+depending on its semantics. For example, noop, deadline and AS ioschedulers
+keep one io queues per cgroup and cfqq keeps one io queue per io_context in
+a cgroup (apart from async queues).
+
+A request is mapped to an io group by elevator layer and which io queue it
+is mapped to with in group depends on ioscheduler. Currently "current" task
+is used to determine the cgroup (hence io group) of the request. Down the
+line we need to make use of bio-cgroup patches to map delayed writes to
+right group.
+
+Going back to old behavior
+==========================
+In new scheme of things essentially we are creating hierarchical fair
+queuing logic in elevator layer and chaning IO schedulers to make use of
+that logic so that end IO schedulers start supporting hierarchical scheduling.
+
+Elevator layer continues to support the old interfaces. So even if fair queuing
+is enabled at elevator layer, one can have both new hierchical scheduler as
+well as old non-hierarchical scheduler operating.
+
+Also noop, deadline and AS have option of enabling hierarchical scheduling.
+If it is selected, fair queuing is done in hierarchical manner. If hierarchical
+scheduling is disabled, noop, deadline and AS should retain their existing
+behavior.
+
+CFQ is the only exception where one can not disable fair queuing as it is
+needed for provding fairness among various threads even in non-hierarchical
+mode.
+
+Various user visible config options
+===================================
+CONFIG_IOSCHED_NOOP_HIER
+	- Enables hierchical fair queuing in noop. Not selecting this option
+	  leads to old behavior of noop.
+
+CONFIG_IOSCHED_DEADLINE_HIER
+	- Enables hierchical fair queuing in deadline. Not selecting this
+	  option leads to old behavior of deadline.
+
+CONFIG_IOSCHED_AS_HIER
+	- Enables hierchical fair queuing in AS. Not selecting this option
+	  leads to old behavior of AS.
+
+CONFIG_IOSCHED_CFQ_HIER
+	- Enables hierarchical fair queuing in CFQ. Not selecting this option
+	  still does fair queuing among various queus but it is flat and not
+	  hierarchical.
+
+CGROUP_BLKIO
+	- This option enables blkio-cgroup controller for IO tracking
+	  purposes. That means, by this controller one can attribute a write
+	  to the original cgroup and not assume that it belongs to submitting
+	  thread.
+
+CONFIG_TRACK_ASYNC_CONTEXT
+	- Currently CFQ attributes the writes to the submitting thread and
+	  caches the async queue pointer in the io context of the process.
+	  If this option is set, it tells cfq and elevator fair queuing logic
+	  that for async writes make use of IO tracking patches and attribute
+	  writes to original cgroup and not to write submitting thread.
+
+CONFIG_DEBUG_GROUP_IOSCHED
+	- Throws extra debug messages in blktrace output helpful in doing
+	  doing debugging in hierarchical setup.
+
+	- Also allows for export of extra debug statistics like group queue
+	  and dequeue statistics on device through cgroup interface.
+
+Config options selected automatically
+=====================================
+These config options are not user visible and are selected/deselected
+automatically based on IO scheduler configurations.
+
+CONFIG_ELV_FAIR_QUEUING
+	- Enables/Disables the fair queuing logic at elevator layer.
+
+CONFIG_GROUP_IOSCHED
+	- Enables/Disables hierarchical queuing and associated cgroup bits.
+
+HOWTO
+=====
+So far I have done very simple testing of running two dd threads in two
+different cgroups. Here is what you can do.
+
+- Enable hierarchical scheduling in io scheuduler of your choice (say cfq).
+	CONFIG_IOSCHED_CFQ_HIER=y
+
+- Enable IO tracking for async writes.
+	CONFIG_TRACK_ASYNC_CONTEXT=y
+
+  (This will automatically select CGROUP_BLKIO)
+
+- Compile and boot into kernel and mount IO controller and blkio io tracking
+  controller.
+
+	mount -t cgroup -o io,blkio none /cgroup
+
+- Create two cgroups
+	mkdir -p /cgroup/test1/ /cgroup/test2
+
+- Set weights of group test1 and test2
+	echo 1000 > /cgroup/test1/io.weight
+	echo 500 > /cgroup/test2/io.weight
+
+- Set "fairness" parameter to 1 at the disk you are testing.
+
+  echo 1 > /sys/block/<disk>/queue/iosched/fairness
+
+- Create two same size files (say 512MB each) on same disk (file1, file2) and
+  launch two dd threads in different cgroup to read those files. Make sure
+  right io scheduler is being used for the block device where files are
+  present (the one you compiled in hierarchical mode).
+
+	sync
+	echo 3 > /proc/sys/vm/drop_caches
+
+	dd if=/mnt/sdb/zerofile1 of=/dev/null &
+	echo $! > /cgroup/test1/tasks
+	cat /cgroup/test1/tasks
+
+	dd if=/mnt/sdb/zerofile2 of=/dev/null &
+	echo $! > /cgroup/test2/tasks
+	cat /cgroup/test2/tasks
+
+- At macro level, first dd should finish first. To get more precise data, keep
+  on looking at (with the help of script), at io.disk_time and io.disk_sectors
+  files of both test1 and test2 groups. This will tell how much disk time
+  (in milli seconds), each group got and how many secotors each group
+  dispatched to the disk. We provide fairness in terms of disk time, so
+  ideally io.disk_time of cgroups should be in proportion to the weight.
+  (It is hard to achieve though :-)).
+
+Regarding "fairness" parameter
+==============================
+IO controller has introduced a "fairness" tunable for every io scheduler.
+Currently this tunable can assume values 0, 1 and 2.
+
+There are places where io schedulers are geared towards throughput. Sometimes
+fairness and throughput don't go together. This tunable helps choose fairness
+over throughput depending on the need.
+
+0 --> We should achive existing io scheduler behavior. Some of the special
+      fairness hooks are disabled.
+
+1 --> Helps in achieving the better fairness for sync queues.
+
+2 --> If buffered writes come into play, use this. For example, One cgroup
+      doing sync reads and other cgroup doing buffered writes, use "2" to
+      provide better fairness/isolation between two cgroups.
+
+Details of cgroup files
+=======================
+- io.ioprio_class
+	- Specifies class of the cgroup (RT, BE, IDLE). This is default io
+	  class of the group on all the devices until and unless overridden by
+	  per device rule. (See io.policy).
+
+	  1 = RT; 2 = BE, 3 = IDLE
+
+- io.weight
+	- Specifies per cgroup weight. This is default weight of the group
+	  on all the devices until and unless overridden by per device rule.
+	  (See io.policy).
+
+- io.disk_time
+	- disk time allocated to cgroup per device in milliseconds. First
+	  two fields specify the major and minor number of the device and
+	  third field specifies the disk time allocated to group in
+	  milliseconds.
+
+- io.disk_sectors
+	- number of sectors transferred to/from disk by the group. First
+	  two fields specify the major and minor number of the device and
+	  third field specifies the number of sectors transferred by the
+	  group to/from the device.
+
+- io.disk_queue
+	- Debugging aid only enabled if CONFIG_DEBUG_GROUP_IOSCHED=y. This
+	  gives the statistics about how many a times a group was queued
+	  on service tree of the device. First two fields specify the major
+	  and minor number of the device and third field specifies the number
+	  of times a group was queued on a particular device.
+
+- io.disk_queue
+	- Debugging aid only enabled if CONFIG_DEBUG_GROUP_IOSCHED=y. This
+	  gives the statistics about how many a times a group was de-queued
+	  or removed from the service tree of the device. This basically gives
+	  and idea if we can generate enough IO to create continuously
+	  backlogged groups. First two fields specify the major and minor
+	  number of the device and third field specifies the number
+	  of times a group was de-queued on a particular device.
+
+- io.policy
+	- One can specify per cgroup per device rules using this interface.
+	  These rules override the default value of group weight and class as
+	  specified by io.weight and io.ioprio_class.
+
+	  Following is the format.
+
+	#echo dev_maj:dev_minor weight ioprio_class > /patch/to/cgroup/io.policy
+
+	weight=0 means removing a policy.
+
+	Examples:
+
+	Configure weight=300 ioprio_class=2 on /dev/hdb (8:16) in this cgroup
+	# echo 8:16 300 2 > io.policy
+	# cat io.policy
+	dev	weight	class
+	8:16	300	2
+
+	Configure weight=500 ioprio_class=1 on /dev/hda (8:0) in this cgroup
+	# echo 8:0 500 1 > io.policy
+	# cat io.policy
+	dev	weight	class
+	8:0	500	1
+	8:16	300	2
+
+	Remove the policy for /dev/hda in this cgroup
+	# echo 8:0 0 1 > io.policy
+	# cat io.policy
+	dev	weight	class
+	8:16	300	2
+
+About configuring request desriptors
+====================================
+Traditionally there are 128 request desriptors allocated per request queue
+where io scheduler is operating (/sys/block/<disk>/queue/nr_requests). If these
+request descriptors are exhausted, processes will put to sleep and woken
+up once request descriptors are available.
+
+With io controller and cgroup stuff, one can not afford to allocate requests
+from single pool as one group might allocate lots of requests and then tasks
+from other groups might be put to sleep and this other group might be a
+higher weight group. Hence to make sure that a group always can get the
+request descriptors it is entitled to, one needs to make request descriptor
+limit per group on every queue.
+
+A new parameter /sys/block/<disk>/queue/nr_group_requests has been introduced
+and this parameter controlls the maximum number of requests per group.
+nr_requests still continues to control total number of request descriptors
+on the queue.
+
+Ideally one should set nr_requests to be following.
+
+nr_requests = number_of_cgroups * nr_group_requests
+
+This will make sure that at any point of time nr_group_requests number of
+request descriptors will be available for any of the cgroups.
+
+Currently default nr_requests=512 and nr_group_requests=128. This will make
+sure that apart from root group one can create 3 more group without running
+into any issues. If one decides to create more cgorus, nr_requests and
+nr_group_requests should be adjusted accordingly.
+
+Probably a better way to assign limit to group request descriptors is through
+sysfs interface. This is a future TODO item.
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 02/25] io-controller: Core of the B-WF2Q+ scheduler
       [not found] ` <1246564917-19603-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-07-02 20:01   ` [PATCH 01/25] io-controller: Documentation Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  2009-07-02 20:01   ` [PATCH 03/25] io-controller: bfq support of in-class preemption Vivek Goyal
                     ` (25 subsequent siblings)
  27 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, nauman-hpIqsD4AKlfQT0dZR+AlfA,
	dpshah-hpIqsD4AKlfQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA

This is core of the BFQ(B-WF2Q+) scheduler originally implemented by Paolo and
Fabio in BFQ patches. Since then I have taken relevant pieces from BFQ and
continued the work on IO controller. It is not the full patch. Just pulled out
the some bits to show how core scheduler looks like and it becomes easier to
review.

Originally BFQ code was hierarchical. This patch only shows non-hierarchical
bits. Hierarhical code comes in later patches.

This code is the building base of introducing fair queuing logic in common
elevator layer so that it can be used by all the four IO schedulers. In
later patches, CFQ's weighted round robin scheduler will be replaced with
B-WF2Q+ scheduler.

Also note that BFQ originally provided fairness in-terms of number of
sectors of IO done by the queue. It has been modified to provide fairness
in terms of disk time (like CFQ allocate disk time slices proportionate to
prio/weight).

B-WF2Q+ is based on WF2Q+, that is described in [2], together with
H-WF2Q+, while the augmented tree used to implement B-WF2Q+ with O(log N)
complexity derives from the one introduced with EEVDF in [3].

[1] P. Valente and F. Checconi, ``High Throughput Disk Scheduling
    with Deterministic Guarantees on Bandwidth Distribution,'' to be
    published.

    http://algo.ing.unimo.it/people/paolo/disk_sched/bfq.pdf

[2] Jon C.R. Bennett and H. Zhang, ``Hierarchical Packet Fair Queueing
    Algorithms,'' IEEE/ACM Transactions on Networking, 5(5):675-689,
    Oct 1997.

    http://www.cs.cmu.edu/~hzhang/papers/TON-97-Oct.ps.gz

[3] I. Stoica and H. Abdel-Wahab, ``Earliest Eligible Virtual Deadline
    First: A Flexible and Accurate Mechanism for Proportional Share
    Resource Allocation,'' technical report.

    http://www.cs.berkeley.edu/~istoica/papers/eevdf-tr-95.pdf

Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/elevator-fq.c |  717 +++++++++++++++++++++++++++++++++++++++++++++++++++
 block/elevator-fq.h |  172 ++++++++++++
 2 files changed, 889 insertions(+), 0 deletions(-)
 create mode 100644 block/elevator-fq.c
 create mode 100644 block/elevator-fq.h

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
new file mode 100644
index 0000000..a58efdc
--- /dev/null
+++ b/block/elevator-fq.c
@@ -0,0 +1,717 @@
+/*
+ * elevator fair queuing Layer. Uses B-WF2Q+ hierarchical scheduler for
+ * fair queuing.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
+ *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ *
+ * Copyright (C) 2009 Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
+ * 	              Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
+ */
+
+#include <linux/blkdev.h>
+#include "elevator-fq.h"
+
+#define IO_SERVICE_TREE_INIT   ((struct io_service_tree)		\
+				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
+
+/* Mainly the BFQ scheduling code Follows */
+
+/*
+ * Shift for timestamp calculations.  This actually limits the maximum
+ * service allowed in one timestamp delta (small shift values increase it),
+ * the maximum total weight that can be used for the queues in the system
+ * (big shift values increase it), and the period of virtual time wraparounds.
+ */
+#define WFQ_SERVICE_SHIFT	22
+
+/**
+ * bfq_gt - compare two timestamps.
+ * @a: first ts.
+ * @b: second ts.
+ *
+ * Return @a > @b, dealing with wrapping correctly.
+ */
+static inline int bfq_gt(u64 a, u64 b)
+{
+	return (s64)(a - b) > 0;
+}
+
+/**
+ * bfq_delta - map service into the virtual time domain.
+ * @service: amount of service.
+ * @weight: scale factor.
+ */
+static inline u64 bfq_delta(unsigned long service, unsigned int weight)
+{
+	u64 d = (u64)service << WFQ_SERVICE_SHIFT;
+
+	do_div(d, weight);
+	return d;
+}
+
+/**
+ * bfq_calc_finish - assign the finish time to an entity.
+ * @entity: the entity to act upon.
+ * @service: the service to be charged to the entity.
+ */
+static inline void bfq_calc_finish(struct io_entity *entity,
+					unsigned long service)
+{
+	BUG_ON(entity->weight == 0);
+
+	entity->finish = entity->start + bfq_delta(service, entity->weight);
+}
+
+static inline struct io_queue *io_entity_to_ioq(struct io_entity *entity)
+{
+	struct io_queue *ioq = NULL;
+
+	BUG_ON(entity == NULL);
+	if (entity->my_sched_data == NULL)
+		ioq = container_of(entity, struct io_queue, entity);
+	return ioq;
+}
+
+/**
+ * io_entity_of - get an entity from a node.
+ * @node: the node field of the entity.
+ *
+ * Convert a node pointer to the relative entity.  This is used only
+ * to simplify the logic of some functions and not as the generic
+ * conversion mechanism because, e.g., in the tree walking functions,
+ * the check for a %NULL value would be redundant.
+ */
+static inline struct io_entity *io_entity_of(struct rb_node *node)
+{
+	struct io_entity *entity = NULL;
+
+	if (node != NULL)
+		entity = rb_entry(node, struct io_entity, rb_node);
+
+	return entity;
+}
+
+/**
+ * bfq_remove - remove an entity from a tree.
+ * @root: the tree root.
+ * @entity: the entity to remove.
+ */
+static inline void bfq_remove(struct rb_root *root, struct io_entity *entity)
+{
+	BUG_ON(entity->tree != root);
+
+	entity->tree = NULL;
+	rb_erase(&entity->rb_node, root);
+}
+
+/**
+ * bfq_idle_remove - remove an entity from the idle tree.
+ * @st: the service tree of the owning @entity.
+ * @entity: the entity being removed.
+ */
+static void bfq_idle_remove(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	struct rb_node *next;
+
+	BUG_ON(entity->tree != &st->idle);
+
+	if (entity == st->first_idle) {
+		next = rb_next(&entity->rb_node);
+		st->first_idle = io_entity_of(next);
+	}
+
+	if (entity == st->last_idle) {
+		next = rb_prev(&entity->rb_node);
+		st->last_idle = io_entity_of(next);
+	}
+
+	bfq_remove(&st->idle, entity);
+}
+
+/**
+ * bfq_insert - generic tree insertion.
+ * @root: tree root.
+ * @entity: entity to insert.
+ *
+ * This is used for the idle and the active tree, since they are both
+ * ordered by finish time.
+ */
+static void bfq_insert(struct rb_root *root, struct io_entity *entity)
+{
+	struct io_entity *entry;
+	struct rb_node **node = &root->rb_node;
+	struct rb_node *parent = NULL;
+
+	BUG_ON(entity->tree != NULL);
+
+	while (*node != NULL) {
+		parent = *node;
+		entry = rb_entry(parent, struct io_entity, rb_node);
+
+		if (bfq_gt(entry->finish, entity->finish))
+			node = &parent->rb_left;
+		else
+			node = &parent->rb_right;
+	}
+
+	rb_link_node(&entity->rb_node, parent, node);
+	rb_insert_color(&entity->rb_node, root);
+
+	entity->tree = root;
+}
+
+/**
+ * bfq_update_min - update the min_start field of a entity.
+ * @entity: the entity to update.
+ * @node: one of its children.
+ *
+ * This function is called when @entity may store an invalid value for
+ * min_start due to updates to the active tree.  The function  assumes
+ * that the subtree rooted at @node (which may be its left or its right
+ * child) has a valid min_start value.
+ */
+static inline void bfq_update_min(struct io_entity *entity,
+					struct rb_node *node)
+{
+	struct io_entity *child;
+
+	if (node != NULL) {
+		child = rb_entry(node, struct io_entity, rb_node);
+		if (bfq_gt(entity->min_start, child->min_start))
+			entity->min_start = child->min_start;
+	}
+}
+
+/**
+ * bfq_update_active_node - recalculate min_start.
+ * @node: the node to update.
+ *
+ * @node may have changed position or one of its children may have moved,
+ * this function updates its min_start value.  The left and right subtrees
+ * are assumed to hold a correct min_start value.
+ */
+static inline void bfq_update_active_node(struct rb_node *node)
+{
+	struct io_entity *entity = rb_entry(node, struct io_entity, rb_node);
+
+	entity->min_start = entity->start;
+	bfq_update_min(entity, node->rb_right);
+	bfq_update_min(entity, node->rb_left);
+}
+
+/**
+ * bfq_update_active_tree - update min_start for the whole active tree.
+ * @node: the starting node.
+ *
+ * @node must be the deepest modified node after an update.  This function
+ * updates its min_start using the values held by its children, assuming
+ * that they did not change, and then updates all the nodes that may have
+ * changed in the path to the root.  The only nodes that may have changed
+ * are the ones in the path or their siblings.
+ */
+static void bfq_update_active_tree(struct rb_node *node)
+{
+	struct rb_node *parent;
+
+up:
+	bfq_update_active_node(node);
+
+	parent = rb_parent(node);
+	if (parent == NULL)
+		return;
+
+	if (node == parent->rb_left && parent->rb_right != NULL)
+		bfq_update_active_node(parent->rb_right);
+	else if (parent->rb_left != NULL)
+		bfq_update_active_node(parent->rb_left);
+
+	node = parent;
+	goto up;
+}
+
+/**
+ * bfq_active_insert - insert an entity in the active tree of its group/device.
+ * @st: the service tree of the entity.
+ * @entity: the entity being inserted.
+ *
+ * The active tree is ordered by finish time, but an extra key is kept
+ * per each node, containing the minimum value for the start times of
+ * its children (and the node itself), so it's possible to search for
+ * the eligible node with the lowest finish time in logarithmic time.
+ */
+static void bfq_active_insert(struct io_service_tree *st,
+					struct io_entity *entity)
+{
+	struct rb_node *node = &entity->rb_node;
+
+	bfq_insert(&st->active, entity);
+
+	if (node->rb_left != NULL)
+		node = node->rb_left;
+	else if (node->rb_right != NULL)
+		node = node->rb_right;
+
+	bfq_update_active_tree(node);
+}
+
+static void bfq_get_entity(struct io_entity *entity)
+{
+	struct io_queue *ioq = io_entity_to_ioq(entity);
+
+	if (ioq)
+		elv_get_ioq(ioq);
+}
+
+static void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+	entity->ioprio = entity->new_ioprio;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->sched_data = &iog->sched_data;
+}
+
+/**
+ * bfq_find_deepest - find the deepest node that an extraction can modify.
+ * @node: the node being removed.
+ *
+ * Do the first step of an extraction in an rb tree, looking for the
+ * node that will replace @node, and returning the deepest node that
+ * the following modifications to the tree can touch.  If @node is the
+ * last node in the tree return %NULL.
+ */
+static struct rb_node *bfq_find_deepest(struct rb_node *node)
+{
+	struct rb_node *deepest;
+
+	if (node->rb_right == NULL && node->rb_left == NULL)
+		deepest = rb_parent(node);
+	else if (node->rb_right == NULL)
+		deepest = node->rb_left;
+	else if (node->rb_left == NULL)
+		deepest = node->rb_right;
+	else {
+		deepest = rb_next(node);
+		if (deepest->rb_right != NULL)
+			deepest = deepest->rb_right;
+		else if (rb_parent(deepest) != node)
+			deepest = rb_parent(deepest);
+	}
+
+	return deepest;
+}
+
+/**
+ * bfq_active_remove - remove an entity from the active tree.
+ * @st: the service_tree containing the tree.
+ * @entity: the entity being removed.
+ */
+static void bfq_active_remove(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	struct rb_node *node;
+
+	node = bfq_find_deepest(&entity->rb_node);
+	bfq_remove(&st->active, entity);
+
+	if (node != NULL)
+		bfq_update_active_tree(node);
+}
+
+/**
+ * bfq_idle_insert - insert an entity into the idle tree.
+ * @st: the service tree containing the tree.
+ * @entity: the entity to insert.
+ */
+static void bfq_idle_insert(struct io_service_tree *st,
+					struct io_entity *entity)
+{
+	struct io_entity *first_idle = st->first_idle;
+	struct io_entity *last_idle = st->last_idle;
+
+	if (first_idle == NULL || bfq_gt(first_idle->finish, entity->finish))
+		st->first_idle = entity;
+	if (last_idle == NULL || bfq_gt(entity->finish, last_idle->finish))
+		st->last_idle = entity;
+
+	bfq_insert(&st->idle, entity);
+}
+
+/**
+ * bfq_forget_entity - remove an entity from the wfq trees.
+ * @st: the service tree.
+ * @entity: the entity being removed.
+ *
+ * Update the device status and forget everything about @entity, putting
+ * the device reference to it, if it is a queue.  Entities belonging to
+ * groups are not refcounted.
+ */
+static void bfq_forget_entity(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	struct io_queue *ioq = NULL;
+
+	BUG_ON(!entity->on_st);
+	entity->on_st = 0;
+	st->wsum -= entity->weight;
+	ioq = io_entity_to_ioq(entity);
+	if (!ioq)
+		return;
+	elv_put_ioq(ioq);
+}
+
+/**
+ * bfq_put_idle_entity - release the idle tree ref of an entity.
+ * @st: service tree for the entity.
+ * @entity: the entity being released.
+ */
+static void bfq_put_idle_entity(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	bfq_idle_remove(st, entity);
+	bfq_forget_entity(st, entity);
+}
+
+/**
+ * bfq_forget_idle - update the idle tree if necessary.
+ * @st: the service tree to act upon.
+ *
+ * To preserve the global O(log N) complexity we only remove one entry here;
+ * as the idle tree will not grow indefinitely this can be done safely.
+ */
+static void bfq_forget_idle(struct io_service_tree *st)
+{
+	struct io_entity *first_idle = st->first_idle;
+	struct io_entity *last_idle = st->last_idle;
+
+	if (RB_EMPTY_ROOT(&st->active) && last_idle != NULL &&
+	    !bfq_gt(last_idle->finish, st->vtime)) {
+		/*
+		 * Active tree is empty. Pull back vtime to finish time of
+		 * last idle entity on idle tree.
+		 * Rational seems to be that it reduces the possibility of
+		 * vtime wraparound (bfq_gt(V-F) < 0).
+		 */
+		st->vtime = last_idle->finish;
+	}
+
+	if (first_idle != NULL && !bfq_gt(first_idle->finish, st->vtime))
+		bfq_put_idle_entity(st, first_idle);
+}
+
+
+static struct io_service_tree *
+__bfq_entity_update_prio(struct io_service_tree *old_st,
+				struct io_entity *entity)
+{
+	struct io_service_tree *new_st = old_st;
+	struct io_queue *ioq = io_entity_to_ioq(entity);
+
+	if (entity->ioprio_changed) {
+		old_st->wsum -= entity->weight;
+		entity->ioprio = entity->new_ioprio;
+		entity->ioprio_class = entity->new_ioprio_class;
+		entity->weight = entity->new_weight;
+		entity->ioprio_changed = 0;
+
+		/*
+		 * Also update the scaled budget for ioq. Group will get the
+		 * updated budget once ioq is selected to run next.
+		 */
+		if (ioq) {
+			struct elv_fq_data *efqd = ioq->efqd;
+			/*
+			 * elv_prio_to_slice() is defined in later patches
+			 * where a slice length is calculated from the
+			 * ioprio of the queue.
+			 */
+			entity->budget = elv_prio_to_slice(efqd, ioq);
+		}
+
+		/*
+		 * NOTE: here we may be changing the weight too early,
+		 * this will cause unfairness.  The correct approach
+		 * would have required additional complexity to defer
+		 * weight changes to the proper time instants (i.e.,
+		 * when entity->finish <= old_st->vtime).
+		 */
+		new_st = io_entity_service_tree(entity);
+		new_st->wsum += entity->weight;
+
+		if (new_st != old_st)
+			entity->start = new_st->vtime;
+	}
+
+	return new_st;
+}
+
+/**
+ * bfq_update_vtime - update vtime if necessary.
+ * @st: the service tree to act upon.
+ *
+ * If necessary update the service tree vtime to have at least one
+ * eligible entity, skipping to its start time.  Assumes that the
+ * active tree of the device is not empty.
+ *
+ * NOTE: this hierarchical implementation updates vtimes quite often,
+ * we may end up with reactivated tasks getting timestamps after a
+ * vtime skip done because we needed a ->first_active entity on some
+ * intermediate node.
+ */
+static void bfq_update_vtime(struct io_service_tree *st)
+{
+	struct io_entity *entry;
+	struct rb_node *node = st->active.rb_node;
+
+	entry = rb_entry(node, struct io_entity, rb_node);
+	if (bfq_gt(entry->min_start, st->vtime)) {
+		st->vtime = entry->min_start;
+		bfq_forget_idle(st);
+	}
+}
+
+/**
+ * bfq_first_active - find the eligible entity with the smallest finish time
+ * @st: the service tree to select from.
+ *
+ * This function searches the first schedulable entity, starting from the
+ * root of the tree and going on the left every time on this side there is
+ * a subtree with at least one eligible (start <= vtime) entity.  The path
+ * on the right is followed only if a) the left subtree contains no eligible
+ * entities and b) no eligible entity has been found yet.
+ */
+static struct io_entity *bfq_first_active_entity(struct io_service_tree *st)
+{
+	struct io_entity *entry, *first = NULL;
+	struct rb_node *node = st->active.rb_node;
+
+	while (node != NULL) {
+		entry = rb_entry(node, struct io_entity, rb_node);
+left:
+		if (!bfq_gt(entry->start, st->vtime))
+			first = entry;
+
+		BUG_ON(bfq_gt(entry->min_start, st->vtime));
+
+		if (node->rb_left != NULL) {
+			entry = rb_entry(node->rb_left,
+					 struct io_entity, rb_node);
+			if (!bfq_gt(entry->min_start, st->vtime)) {
+				node = node->rb_left;
+				goto left;
+			}
+		}
+		if (first != NULL)
+			break;
+		node = node->rb_right;
+	}
+
+	BUG_ON(first == NULL && !RB_EMPTY_ROOT(&st->active));
+	return first;
+}
+
+/**
+ * __bfq_lookup_next_entity - return the first eligible entity in @st.
+ * @st: the service tree.
+ *
+ * Update the virtual time in @st and return the first eligible entity
+ * it contains.
+ */
+static struct io_entity *__bfq_lookup_next_entity(struct io_service_tree *st)
+{
+	struct io_entity *entity;
+
+	if (RB_EMPTY_ROOT(&st->active))
+		return NULL;
+
+	bfq_update_vtime(st);
+	entity = bfq_first_active_entity(st);
+	BUG_ON(bfq_gt(entity->start, st->vtime));
+
+	return entity;
+}
+
+/**
+ * bfq_lookup_next_entity - return the first eligible entity in @sd.
+ * @sd: the sched_data.
+ * @extract: if true the returned entity will be also extracted from @sd.
+ *
+ * NOTE: since we cache the next_active entity at each level of the
+ * hierarchy, the complexity of the lookup can be decreased with
+ * absolutely no effort just returning the cached next_active value;
+ * we prefer to do full lookups to test the consistency of * the data
+ * structures.
+ */
+static struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
+						 int extract)
+{
+	struct io_service_tree *st = sd->service_tree;
+	struct io_entity *entity;
+	int i;
+
+	/*
+	 * We should not call lookup when an entity is active, as doing lookup
+	 * can result in an erroneous vtime jump.
+	 */
+	BUG_ON(sd->active_entity != NULL);
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++, st++) {
+		entity = __bfq_lookup_next_entity(st);
+		if (entity != NULL) {
+			if (extract) {
+				bfq_active_remove(st, entity);
+				sd->active_entity = entity;
+			}
+			break;
+		}
+	}
+
+	return entity;
+}
+
+/**
+ * __bfq_activate_entity - activate an entity.
+ * @entity: the entity being activated.
+ *
+ * Called whenever an entity is activated, i.e., it is not active and one
+ * of its children receives a new request, or has to be reactivated due to
+ * budget exhaustion.  It uses the current budget of the entity (and the
+ * service received if @entity is active) of the queue to calculate its
+ * timestamps.
+ */
+static void __bfq_activate_entity(struct io_entity *entity)
+{
+	struct io_sched_data *sd = entity->sched_data;
+	struct io_service_tree *st = io_entity_service_tree(entity);
+
+	if (entity == sd->active_entity) {
+		BUG_ON(entity->tree != NULL);
+		/*
+		 * If we are requeueing the current entity we have
+		 * to take care of not charging to it service it has
+		 * not received.
+		 */
+		bfq_calc_finish(entity, entity->service);
+		entity->start = entity->finish;
+		sd->active_entity = NULL;
+	} else if (entity->tree == &st->active) {
+		/*
+		 * Requeueing an entity due to a change of some
+		 * next_active entity below it.  We reuse the old
+		 * start time.
+		 */
+		bfq_active_remove(st, entity);
+	} else if (entity->tree == &st->idle) {
+		/*
+		 * Must be on the idle tree, bfq_idle_remove() will
+		 * check for that.
+		 */
+		bfq_idle_remove(st, entity);
+		entity->start = bfq_gt(st->vtime, entity->finish) ?
+				       st->vtime : entity->finish;
+	} else {
+		/*
+		 * The finish time of the entity may be invalid, and
+		 * it is in the past for sure, otherwise the queue
+		 * would have been on the idle tree.
+		 */
+		entity->start = st->vtime;
+		st->wsum += entity->weight;
+		bfq_get_entity(entity);
+
+		BUG_ON(entity->on_st);
+		entity->on_st = 1;
+	}
+
+	st = __bfq_entity_update_prio(st, entity);
+	bfq_calc_finish(entity, entity->budget);
+	bfq_active_insert(st, entity);
+}
+
+/**
+ * bfq_activate_entity - activate an entity.
+ * @entity: the entity to activate.
+ */
+static void bfq_activate_entity(struct io_entity *entity)
+{
+	__bfq_activate_entity(entity);
+}
+
+/**
+ * __bfq_deactivate_entity - deactivate an entity from its service tree.
+ * @entity: the entity to deactivate.
+ * @requeue: if false, the entity will not be put into the idle tree.
+ *
+ * Deactivate an entity, independently from its previous state.  If the
+ * entity was not on a service tree just return, otherwise if it is on
+ * any scheduler tree, extract it from that tree, and if necessary
+ * and if the caller did not specify @requeue, put it on the idle tree.
+ *
+ */
+static int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
+{
+	struct io_sched_data *sd = entity->sched_data;
+	struct io_service_tree *st = io_entity_service_tree(entity);
+	int was_active = entity == sd->active_entity;
+	int ret = 0;
+
+	if (!entity->on_st)
+		return 0;
+
+	BUG_ON(was_active && entity->tree != NULL);
+
+	if (was_active) {
+		bfq_calc_finish(entity, entity->service);
+		sd->active_entity = NULL;
+	} else if (entity->tree == &st->active)
+		bfq_active_remove(st, entity);
+	else if (entity->tree == &st->idle)
+		bfq_idle_remove(st, entity);
+	else if (entity->tree != NULL)
+		BUG();
+
+	if (!requeue || !bfq_gt(entity->finish, st->vtime))
+		bfq_forget_entity(st, entity);
+	else
+		bfq_idle_insert(st, entity);
+
+	BUG_ON(sd->active_entity == entity);
+
+	return ret;
+}
+
+/**
+ * bfq_deactivate_entity - deactivate an entity.
+ * @entity: the entity to deactivate.
+ * @requeue: true if the entity can be put on the idle tree
+ */
+static void bfq_deactivate_entity(struct io_entity *entity, int requeue)
+{
+	__bfq_deactivate_entity(entity, requeue);
+}
+
+static void entity_served(struct io_entity *entity, unsigned long served)
+{
+	struct io_service_tree *st;
+
+	st = io_entity_service_tree(entity);
+	entity->service += served;
+	BUG_ON(st->wsum == 0);
+	st->vtime += bfq_delta(served, st->wsum);
+	bfq_forget_idle(st);
+}
+
+/**
+ * io_flush_idle_tree - deactivate any entity on the idle tree of @st.
+ * @st: the service tree being flushed.
+ */
+static void io_flush_idle_tree(struct io_service_tree *st)
+{
+	struct io_entity *entity = st->first_idle;
+
+	for (; entity != NULL; entity = st->first_idle)
+		__bfq_deactivate_entity(entity, 0);
+}
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
new file mode 100644
index 0000000..4554d7f
--- /dev/null
+++ b/block/elevator-fq.h
@@ -0,0 +1,172 @@
+/*
+ * elevator fair queuing Layer. Uses B-WF2Q+ hierarchical scheduler for
+ * fair queuing. Data structures and common functions prototypes.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
+ *		      Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
+ *
+ * Copyright (C) 2009 Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
+ * 	              Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
+ */
+
+#include <linux/blkdev.h>
+
+#ifndef _BFQ_SCHED_H
+#define _BFQ_SCHED_H
+
+#define IO_IOPRIO_CLASSES	3
+
+struct io_entity;
+struct io_queue;
+
+/**
+ * struct io_service_tree - per ioprio_class service tree.
+ * @active: tree for active entities (i.e., those backlogged).
+ * @idle: tree for idle entities (i.e., those not backlogged, with V <= F_i).
+ * @first_idle: idle entity with minimum F_i.
+ * @last_idle: idle entity with maximum F_i.
+ * @vtime: scheduler virtual time.
+ * @wsum: scheduler weight sum; active and idle entities contribute to it.
+ *
+ * Each service tree represents a B-WF2Q+ scheduler on its own.  Each
+ * ioprio_class has its own independent scheduler, and so its own
+ * io_service_tree.  All the fields are protected by the queue lock
+ * of the containing efqd.
+ */
+struct io_service_tree {
+	struct rb_root active;
+	struct rb_root idle;
+
+	struct io_entity *first_idle;
+	struct io_entity *last_idle;
+
+	u64 vtime;
+	unsigned int wsum;
+};
+
+/**
+ * struct io_sched_data - multi-class scheduler.
+ * @active_entity: entity under service.
+ * @next_active: head-of-the-line entity in the scheduler.
+ * @service_tree: array of service trees, one per ioprio_class.
+ *
+ * io_sched_data is the basic scheduler queue.  It supports three
+ * ioprio_classes, and can be used either as a toplevel queue or as
+ * an intermediate queue on a hierarchical setup.
+ * @next_active points to the active entity of the sched_data service
+ * trees that will be scheduled next.
+ *
+ * The supported ioprio_classes are the same as in CFQ, in descending
+ * priority order, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE.
+ * Requests from higher priority queues are served before all the
+ * requests from lower priority queues; among requests of the same
+ * queue requests are served according to B-WF2Q+.
+ * All the fields are protected by the queue lock of the containing efqd.
+ */
+struct io_sched_data {
+	struct io_entity *active_entity;
+	struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
+};
+
+/**
+ * struct io_entity - schedulable entity.
+ * @rb_node: service_tree member.
+ * @on_st: flag, true if the entity is on a tree (either the active or
+ *         the idle one of its service_tree).
+ * @finish: B-WF2Q+ finish timestamp (aka F_i).
+ * @start: B-WF2Q+ start timestamp (aka S_i).
+ * @tree: tree the entity is enqueued into; %NULL if not on a tree.
+ * @min_start: minimum start time of the (active) subtree rooted at
+ *             this entity; used for O(log N) lookups into active trees.
+ * @service: service received during the last round of service.
+ * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
+ * @weight: weight of the queue, calculated as IOPRIO_BE_NR - @ioprio.
+ * @new_weight: when a weight change is requested, the new weight value
+ * @parent: parent entity, for hierarchical scheduling.
+ * @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the
+ *                 associated scheduler queue, %NULL on leaf nodes.
+ * @sched_data: the scheduler queue this entity belongs to.
+ * @ioprio: the ioprio in use.
+ * @new_ioprio: when an ioprio change is requested, the new ioprio value
+ * @ioprio_class: the ioprio_class in use.
+ * @new_ioprio_class: when an ioprio_class change is requested, the new
+ *                    ioprio_class value.
+ * @ioprio_changed: flag, true when the user requested an ioprio or
+ *                  ioprio_class change.
+ *
+ * A io_entity is used to represent either a io_queue (leaf node in the
+ * cgroup hierarchy) or a io_group into the upper level scheduler.  Each
+ * entity belongs to the sched_data of the parent group in the cgroup
+ * hierarchy.  Non-leaf entities have also their own sched_data, stored
+ * in @my_sched_data.
+ *
+ * Each entity stores independently its priority values; this would allow
+ * different weights on different devices, but this functionality is not
+ * exported to userspace by now.  Priorities are updated lazily, first
+ * storing the new values into the new_* fields, then setting the
+ * @ioprio_changed flag.  As soon as there is a transition in the entity
+ * state that allows the priority update to take place the effective and
+ * the requested priority values are synchronized.
+ *
+ * The weight value is calculated from the ioprio to export the same
+ * interface as CFQ.
+ *
+ * All the fields are protected by the queue lock of the containing efqd.
+ */
+struct io_entity {
+	struct rb_node rb_node;
+
+	int on_st;
+
+	u64 finish;
+	u64 start;
+
+	struct rb_root *tree;
+
+	u64 min_start;
+
+	unsigned long service, budget;
+	unsigned int weight, new_weight;
+
+	struct io_entity *parent;
+
+	struct io_sched_data *my_sched_data;
+	struct io_sched_data *sched_data;
+
+	unsigned short ioprio, new_ioprio;
+	unsigned short ioprio_class, new_ioprio_class;
+
+	int ioprio_changed;
+};
+
+/*
+ * A common structure representing the io queue where requests are actually
+ * queued.
+ */
+struct io_queue {
+	struct io_entity entity;
+	atomic_t ref;
+
+	/* Pointer to generic elevator fair queuing data structure */
+	struct elv_fq_data *efqd;
+};
+
+struct io_group {
+	struct io_sched_data sched_data;
+};
+
+static inline struct io_service_tree *
+io_entity_service_tree(struct io_entity *entity)
+{
+	struct io_sched_data *sched_data = entity->sched_data;
+	unsigned int idx = entity->ioprio_class - 1;
+
+	BUG_ON(idx >= IO_IOPRIO_CLASSES);
+	BUG_ON(sched_data == NULL);
+
+	return sched_data->service_tree + idx;
+}
+#endif /* _BFQ_SCHED_H */
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 02/25] io-controller: Core of the B-WF2Q+ scheduler
  2009-07-02 20:01 ` Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron
  Cc: agk, snitzer, vgoyal, akpm, peterz

This is core of the BFQ(B-WF2Q+) scheduler originally implemented by Paolo and
Fabio in BFQ patches. Since then I have taken relevant pieces from BFQ and
continued the work on IO controller. It is not the full patch. Just pulled out
the some bits to show how core scheduler looks like and it becomes easier to
review.

Originally BFQ code was hierarchical. This patch only shows non-hierarchical
bits. Hierarhical code comes in later patches.

This code is the building base of introducing fair queuing logic in common
elevator layer so that it can be used by all the four IO schedulers. In
later patches, CFQ's weighted round robin scheduler will be replaced with
B-WF2Q+ scheduler.

Also note that BFQ originally provided fairness in-terms of number of
sectors of IO done by the queue. It has been modified to provide fairness
in terms of disk time (like CFQ allocate disk time slices proportionate to
prio/weight).

B-WF2Q+ is based on WF2Q+, that is described in [2], together with
H-WF2Q+, while the augmented tree used to implement B-WF2Q+ with O(log N)
complexity derives from the one introduced with EEVDF in [3].

[1] P. Valente and F. Checconi, ``High Throughput Disk Scheduling
    with Deterministic Guarantees on Bandwidth Distribution,'' to be
    published.

    http://algo.ing.unimo.it/people/paolo/disk_sched/bfq.pdf

[2] Jon C.R. Bennett and H. Zhang, ``Hierarchical Packet Fair Queueing
    Algorithms,'' IEEE/ACM Transactions on Networking, 5(5):675-689,
    Oct 1997.

    http://www.cs.cmu.edu/~hzhang/papers/TON-97-Oct.ps.gz

[3] I. Stoica and H. Abdel-Wahab, ``Earliest Eligible Virtual Deadline
    First: A Flexible and Accurate Mechanism for Proportional Share
    Resource Allocation,'' technical report.

    http://www.cs.berkeley.edu/~istoica/papers/eevdf-tr-95.pdf

Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |  717 +++++++++++++++++++++++++++++++++++++++++++++++++++
 block/elevator-fq.h |  172 ++++++++++++
 2 files changed, 889 insertions(+), 0 deletions(-)
 create mode 100644 block/elevator-fq.c
 create mode 100644 block/elevator-fq.h

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
new file mode 100644
index 0000000..a58efdc
--- /dev/null
+++ b/block/elevator-fq.c
@@ -0,0 +1,717 @@
+/*
+ * elevator fair queuing Layer. Uses B-WF2Q+ hierarchical scheduler for
+ * fair queuing.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ *		      Paolo Valente <paolo.valente@unimore.it>
+ *
+ * Copyright (C) 2009 Vivek Goyal <vgoyal@redhat.com>
+ * 	              Nauman Rafique <nauman@google.com>
+ */
+
+#include <linux/blkdev.h>
+#include "elevator-fq.h"
+
+#define IO_SERVICE_TREE_INIT   ((struct io_service_tree)		\
+				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
+
+/* Mainly the BFQ scheduling code Follows */
+
+/*
+ * Shift for timestamp calculations.  This actually limits the maximum
+ * service allowed in one timestamp delta (small shift values increase it),
+ * the maximum total weight that can be used for the queues in the system
+ * (big shift values increase it), and the period of virtual time wraparounds.
+ */
+#define WFQ_SERVICE_SHIFT	22
+
+/**
+ * bfq_gt - compare two timestamps.
+ * @a: first ts.
+ * @b: second ts.
+ *
+ * Return @a > @b, dealing with wrapping correctly.
+ */
+static inline int bfq_gt(u64 a, u64 b)
+{
+	return (s64)(a - b) > 0;
+}
+
+/**
+ * bfq_delta - map service into the virtual time domain.
+ * @service: amount of service.
+ * @weight: scale factor.
+ */
+static inline u64 bfq_delta(unsigned long service, unsigned int weight)
+{
+	u64 d = (u64)service << WFQ_SERVICE_SHIFT;
+
+	do_div(d, weight);
+	return d;
+}
+
+/**
+ * bfq_calc_finish - assign the finish time to an entity.
+ * @entity: the entity to act upon.
+ * @service: the service to be charged to the entity.
+ */
+static inline void bfq_calc_finish(struct io_entity *entity,
+					unsigned long service)
+{
+	BUG_ON(entity->weight == 0);
+
+	entity->finish = entity->start + bfq_delta(service, entity->weight);
+}
+
+static inline struct io_queue *io_entity_to_ioq(struct io_entity *entity)
+{
+	struct io_queue *ioq = NULL;
+
+	BUG_ON(entity == NULL);
+	if (entity->my_sched_data == NULL)
+		ioq = container_of(entity, struct io_queue, entity);
+	return ioq;
+}
+
+/**
+ * io_entity_of - get an entity from a node.
+ * @node: the node field of the entity.
+ *
+ * Convert a node pointer to the relative entity.  This is used only
+ * to simplify the logic of some functions and not as the generic
+ * conversion mechanism because, e.g., in the tree walking functions,
+ * the check for a %NULL value would be redundant.
+ */
+static inline struct io_entity *io_entity_of(struct rb_node *node)
+{
+	struct io_entity *entity = NULL;
+
+	if (node != NULL)
+		entity = rb_entry(node, struct io_entity, rb_node);
+
+	return entity;
+}
+
+/**
+ * bfq_remove - remove an entity from a tree.
+ * @root: the tree root.
+ * @entity: the entity to remove.
+ */
+static inline void bfq_remove(struct rb_root *root, struct io_entity *entity)
+{
+	BUG_ON(entity->tree != root);
+
+	entity->tree = NULL;
+	rb_erase(&entity->rb_node, root);
+}
+
+/**
+ * bfq_idle_remove - remove an entity from the idle tree.
+ * @st: the service tree of the owning @entity.
+ * @entity: the entity being removed.
+ */
+static void bfq_idle_remove(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	struct rb_node *next;
+
+	BUG_ON(entity->tree != &st->idle);
+
+	if (entity == st->first_idle) {
+		next = rb_next(&entity->rb_node);
+		st->first_idle = io_entity_of(next);
+	}
+
+	if (entity == st->last_idle) {
+		next = rb_prev(&entity->rb_node);
+		st->last_idle = io_entity_of(next);
+	}
+
+	bfq_remove(&st->idle, entity);
+}
+
+/**
+ * bfq_insert - generic tree insertion.
+ * @root: tree root.
+ * @entity: entity to insert.
+ *
+ * This is used for the idle and the active tree, since they are both
+ * ordered by finish time.
+ */
+static void bfq_insert(struct rb_root *root, struct io_entity *entity)
+{
+	struct io_entity *entry;
+	struct rb_node **node = &root->rb_node;
+	struct rb_node *parent = NULL;
+
+	BUG_ON(entity->tree != NULL);
+
+	while (*node != NULL) {
+		parent = *node;
+		entry = rb_entry(parent, struct io_entity, rb_node);
+
+		if (bfq_gt(entry->finish, entity->finish))
+			node = &parent->rb_left;
+		else
+			node = &parent->rb_right;
+	}
+
+	rb_link_node(&entity->rb_node, parent, node);
+	rb_insert_color(&entity->rb_node, root);
+
+	entity->tree = root;
+}
+
+/**
+ * bfq_update_min - update the min_start field of a entity.
+ * @entity: the entity to update.
+ * @node: one of its children.
+ *
+ * This function is called when @entity may store an invalid value for
+ * min_start due to updates to the active tree.  The function  assumes
+ * that the subtree rooted at @node (which may be its left or its right
+ * child) has a valid min_start value.
+ */
+static inline void bfq_update_min(struct io_entity *entity,
+					struct rb_node *node)
+{
+	struct io_entity *child;
+
+	if (node != NULL) {
+		child = rb_entry(node, struct io_entity, rb_node);
+		if (bfq_gt(entity->min_start, child->min_start))
+			entity->min_start = child->min_start;
+	}
+}
+
+/**
+ * bfq_update_active_node - recalculate min_start.
+ * @node: the node to update.
+ *
+ * @node may have changed position or one of its children may have moved,
+ * this function updates its min_start value.  The left and right subtrees
+ * are assumed to hold a correct min_start value.
+ */
+static inline void bfq_update_active_node(struct rb_node *node)
+{
+	struct io_entity *entity = rb_entry(node, struct io_entity, rb_node);
+
+	entity->min_start = entity->start;
+	bfq_update_min(entity, node->rb_right);
+	bfq_update_min(entity, node->rb_left);
+}
+
+/**
+ * bfq_update_active_tree - update min_start for the whole active tree.
+ * @node: the starting node.
+ *
+ * @node must be the deepest modified node after an update.  This function
+ * updates its min_start using the values held by its children, assuming
+ * that they did not change, and then updates all the nodes that may have
+ * changed in the path to the root.  The only nodes that may have changed
+ * are the ones in the path or their siblings.
+ */
+static void bfq_update_active_tree(struct rb_node *node)
+{
+	struct rb_node *parent;
+
+up:
+	bfq_update_active_node(node);
+
+	parent = rb_parent(node);
+	if (parent == NULL)
+		return;
+
+	if (node == parent->rb_left && parent->rb_right != NULL)
+		bfq_update_active_node(parent->rb_right);
+	else if (parent->rb_left != NULL)
+		bfq_update_active_node(parent->rb_left);
+
+	node = parent;
+	goto up;
+}
+
+/**
+ * bfq_active_insert - insert an entity in the active tree of its group/device.
+ * @st: the service tree of the entity.
+ * @entity: the entity being inserted.
+ *
+ * The active tree is ordered by finish time, but an extra key is kept
+ * per each node, containing the minimum value for the start times of
+ * its children (and the node itself), so it's possible to search for
+ * the eligible node with the lowest finish time in logarithmic time.
+ */
+static void bfq_active_insert(struct io_service_tree *st,
+					struct io_entity *entity)
+{
+	struct rb_node *node = &entity->rb_node;
+
+	bfq_insert(&st->active, entity);
+
+	if (node->rb_left != NULL)
+		node = node->rb_left;
+	else if (node->rb_right != NULL)
+		node = node->rb_right;
+
+	bfq_update_active_tree(node);
+}
+
+static void bfq_get_entity(struct io_entity *entity)
+{
+	struct io_queue *ioq = io_entity_to_ioq(entity);
+
+	if (ioq)
+		elv_get_ioq(ioq);
+}
+
+static void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+	entity->ioprio = entity->new_ioprio;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->sched_data = &iog->sched_data;
+}
+
+/**
+ * bfq_find_deepest - find the deepest node that an extraction can modify.
+ * @node: the node being removed.
+ *
+ * Do the first step of an extraction in an rb tree, looking for the
+ * node that will replace @node, and returning the deepest node that
+ * the following modifications to the tree can touch.  If @node is the
+ * last node in the tree return %NULL.
+ */
+static struct rb_node *bfq_find_deepest(struct rb_node *node)
+{
+	struct rb_node *deepest;
+
+	if (node->rb_right == NULL && node->rb_left == NULL)
+		deepest = rb_parent(node);
+	else if (node->rb_right == NULL)
+		deepest = node->rb_left;
+	else if (node->rb_left == NULL)
+		deepest = node->rb_right;
+	else {
+		deepest = rb_next(node);
+		if (deepest->rb_right != NULL)
+			deepest = deepest->rb_right;
+		else if (rb_parent(deepest) != node)
+			deepest = rb_parent(deepest);
+	}
+
+	return deepest;
+}
+
+/**
+ * bfq_active_remove - remove an entity from the active tree.
+ * @st: the service_tree containing the tree.
+ * @entity: the entity being removed.
+ */
+static void bfq_active_remove(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	struct rb_node *node;
+
+	node = bfq_find_deepest(&entity->rb_node);
+	bfq_remove(&st->active, entity);
+
+	if (node != NULL)
+		bfq_update_active_tree(node);
+}
+
+/**
+ * bfq_idle_insert - insert an entity into the idle tree.
+ * @st: the service tree containing the tree.
+ * @entity: the entity to insert.
+ */
+static void bfq_idle_insert(struct io_service_tree *st,
+					struct io_entity *entity)
+{
+	struct io_entity *first_idle = st->first_idle;
+	struct io_entity *last_idle = st->last_idle;
+
+	if (first_idle == NULL || bfq_gt(first_idle->finish, entity->finish))
+		st->first_idle = entity;
+	if (last_idle == NULL || bfq_gt(entity->finish, last_idle->finish))
+		st->last_idle = entity;
+
+	bfq_insert(&st->idle, entity);
+}
+
+/**
+ * bfq_forget_entity - remove an entity from the wfq trees.
+ * @st: the service tree.
+ * @entity: the entity being removed.
+ *
+ * Update the device status and forget everything about @entity, putting
+ * the device reference to it, if it is a queue.  Entities belonging to
+ * groups are not refcounted.
+ */
+static void bfq_forget_entity(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	struct io_queue *ioq = NULL;
+
+	BUG_ON(!entity->on_st);
+	entity->on_st = 0;
+	st->wsum -= entity->weight;
+	ioq = io_entity_to_ioq(entity);
+	if (!ioq)
+		return;
+	elv_put_ioq(ioq);
+}
+
+/**
+ * bfq_put_idle_entity - release the idle tree ref of an entity.
+ * @st: service tree for the entity.
+ * @entity: the entity being released.
+ */
+static void bfq_put_idle_entity(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	bfq_idle_remove(st, entity);
+	bfq_forget_entity(st, entity);
+}
+
+/**
+ * bfq_forget_idle - update the idle tree if necessary.
+ * @st: the service tree to act upon.
+ *
+ * To preserve the global O(log N) complexity we only remove one entry here;
+ * as the idle tree will not grow indefinitely this can be done safely.
+ */
+static void bfq_forget_idle(struct io_service_tree *st)
+{
+	struct io_entity *first_idle = st->first_idle;
+	struct io_entity *last_idle = st->last_idle;
+
+	if (RB_EMPTY_ROOT(&st->active) && last_idle != NULL &&
+	    !bfq_gt(last_idle->finish, st->vtime)) {
+		/*
+		 * Active tree is empty. Pull back vtime to finish time of
+		 * last idle entity on idle tree.
+		 * Rational seems to be that it reduces the possibility of
+		 * vtime wraparound (bfq_gt(V-F) < 0).
+		 */
+		st->vtime = last_idle->finish;
+	}
+
+	if (first_idle != NULL && !bfq_gt(first_idle->finish, st->vtime))
+		bfq_put_idle_entity(st, first_idle);
+}
+
+
+static struct io_service_tree *
+__bfq_entity_update_prio(struct io_service_tree *old_st,
+				struct io_entity *entity)
+{
+	struct io_service_tree *new_st = old_st;
+	struct io_queue *ioq = io_entity_to_ioq(entity);
+
+	if (entity->ioprio_changed) {
+		old_st->wsum -= entity->weight;
+		entity->ioprio = entity->new_ioprio;
+		entity->ioprio_class = entity->new_ioprio_class;
+		entity->weight = entity->new_weight;
+		entity->ioprio_changed = 0;
+
+		/*
+		 * Also update the scaled budget for ioq. Group will get the
+		 * updated budget once ioq is selected to run next.
+		 */
+		if (ioq) {
+			struct elv_fq_data *efqd = ioq->efqd;
+			/*
+			 * elv_prio_to_slice() is defined in later patches
+			 * where a slice length is calculated from the
+			 * ioprio of the queue.
+			 */
+			entity->budget = elv_prio_to_slice(efqd, ioq);
+		}
+
+		/*
+		 * NOTE: here we may be changing the weight too early,
+		 * this will cause unfairness.  The correct approach
+		 * would have required additional complexity to defer
+		 * weight changes to the proper time instants (i.e.,
+		 * when entity->finish <= old_st->vtime).
+		 */
+		new_st = io_entity_service_tree(entity);
+		new_st->wsum += entity->weight;
+
+		if (new_st != old_st)
+			entity->start = new_st->vtime;
+	}
+
+	return new_st;
+}
+
+/**
+ * bfq_update_vtime - update vtime if necessary.
+ * @st: the service tree to act upon.
+ *
+ * If necessary update the service tree vtime to have at least one
+ * eligible entity, skipping to its start time.  Assumes that the
+ * active tree of the device is not empty.
+ *
+ * NOTE: this hierarchical implementation updates vtimes quite often,
+ * we may end up with reactivated tasks getting timestamps after a
+ * vtime skip done because we needed a ->first_active entity on some
+ * intermediate node.
+ */
+static void bfq_update_vtime(struct io_service_tree *st)
+{
+	struct io_entity *entry;
+	struct rb_node *node = st->active.rb_node;
+
+	entry = rb_entry(node, struct io_entity, rb_node);
+	if (bfq_gt(entry->min_start, st->vtime)) {
+		st->vtime = entry->min_start;
+		bfq_forget_idle(st);
+	}
+}
+
+/**
+ * bfq_first_active - find the eligible entity with the smallest finish time
+ * @st: the service tree to select from.
+ *
+ * This function searches the first schedulable entity, starting from the
+ * root of the tree and going on the left every time on this side there is
+ * a subtree with at least one eligible (start <= vtime) entity.  The path
+ * on the right is followed only if a) the left subtree contains no eligible
+ * entities and b) no eligible entity has been found yet.
+ */
+static struct io_entity *bfq_first_active_entity(struct io_service_tree *st)
+{
+	struct io_entity *entry, *first = NULL;
+	struct rb_node *node = st->active.rb_node;
+
+	while (node != NULL) {
+		entry = rb_entry(node, struct io_entity, rb_node);
+left:
+		if (!bfq_gt(entry->start, st->vtime))
+			first = entry;
+
+		BUG_ON(bfq_gt(entry->min_start, st->vtime));
+
+		if (node->rb_left != NULL) {
+			entry = rb_entry(node->rb_left,
+					 struct io_entity, rb_node);
+			if (!bfq_gt(entry->min_start, st->vtime)) {
+				node = node->rb_left;
+				goto left;
+			}
+		}
+		if (first != NULL)
+			break;
+		node = node->rb_right;
+	}
+
+	BUG_ON(first == NULL && !RB_EMPTY_ROOT(&st->active));
+	return first;
+}
+
+/**
+ * __bfq_lookup_next_entity - return the first eligible entity in @st.
+ * @st: the service tree.
+ *
+ * Update the virtual time in @st and return the first eligible entity
+ * it contains.
+ */
+static struct io_entity *__bfq_lookup_next_entity(struct io_service_tree *st)
+{
+	struct io_entity *entity;
+
+	if (RB_EMPTY_ROOT(&st->active))
+		return NULL;
+
+	bfq_update_vtime(st);
+	entity = bfq_first_active_entity(st);
+	BUG_ON(bfq_gt(entity->start, st->vtime));
+
+	return entity;
+}
+
+/**
+ * bfq_lookup_next_entity - return the first eligible entity in @sd.
+ * @sd: the sched_data.
+ * @extract: if true the returned entity will be also extracted from @sd.
+ *
+ * NOTE: since we cache the next_active entity at each level of the
+ * hierarchy, the complexity of the lookup can be decreased with
+ * absolutely no effort just returning the cached next_active value;
+ * we prefer to do full lookups to test the consistency of * the data
+ * structures.
+ */
+static struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
+						 int extract)
+{
+	struct io_service_tree *st = sd->service_tree;
+	struct io_entity *entity;
+	int i;
+
+	/*
+	 * We should not call lookup when an entity is active, as doing lookup
+	 * can result in an erroneous vtime jump.
+	 */
+	BUG_ON(sd->active_entity != NULL);
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++, st++) {
+		entity = __bfq_lookup_next_entity(st);
+		if (entity != NULL) {
+			if (extract) {
+				bfq_active_remove(st, entity);
+				sd->active_entity = entity;
+			}
+			break;
+		}
+	}
+
+	return entity;
+}
+
+/**
+ * __bfq_activate_entity - activate an entity.
+ * @entity: the entity being activated.
+ *
+ * Called whenever an entity is activated, i.e., it is not active and one
+ * of its children receives a new request, or has to be reactivated due to
+ * budget exhaustion.  It uses the current budget of the entity (and the
+ * service received if @entity is active) of the queue to calculate its
+ * timestamps.
+ */
+static void __bfq_activate_entity(struct io_entity *entity)
+{
+	struct io_sched_data *sd = entity->sched_data;
+	struct io_service_tree *st = io_entity_service_tree(entity);
+
+	if (entity == sd->active_entity) {
+		BUG_ON(entity->tree != NULL);
+		/*
+		 * If we are requeueing the current entity we have
+		 * to take care of not charging to it service it has
+		 * not received.
+		 */
+		bfq_calc_finish(entity, entity->service);
+		entity->start = entity->finish;
+		sd->active_entity = NULL;
+	} else if (entity->tree == &st->active) {
+		/*
+		 * Requeueing an entity due to a change of some
+		 * next_active entity below it.  We reuse the old
+		 * start time.
+		 */
+		bfq_active_remove(st, entity);
+	} else if (entity->tree == &st->idle) {
+		/*
+		 * Must be on the idle tree, bfq_idle_remove() will
+		 * check for that.
+		 */
+		bfq_idle_remove(st, entity);
+		entity->start = bfq_gt(st->vtime, entity->finish) ?
+				       st->vtime : entity->finish;
+	} else {
+		/*
+		 * The finish time of the entity may be invalid, and
+		 * it is in the past for sure, otherwise the queue
+		 * would have been on the idle tree.
+		 */
+		entity->start = st->vtime;
+		st->wsum += entity->weight;
+		bfq_get_entity(entity);
+
+		BUG_ON(entity->on_st);
+		entity->on_st = 1;
+	}
+
+	st = __bfq_entity_update_prio(st, entity);
+	bfq_calc_finish(entity, entity->budget);
+	bfq_active_insert(st, entity);
+}
+
+/**
+ * bfq_activate_entity - activate an entity.
+ * @entity: the entity to activate.
+ */
+static void bfq_activate_entity(struct io_entity *entity)
+{
+	__bfq_activate_entity(entity);
+}
+
+/**
+ * __bfq_deactivate_entity - deactivate an entity from its service tree.
+ * @entity: the entity to deactivate.
+ * @requeue: if false, the entity will not be put into the idle tree.
+ *
+ * Deactivate an entity, independently from its previous state.  If the
+ * entity was not on a service tree just return, otherwise if it is on
+ * any scheduler tree, extract it from that tree, and if necessary
+ * and if the caller did not specify @requeue, put it on the idle tree.
+ *
+ */
+static int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
+{
+	struct io_sched_data *sd = entity->sched_data;
+	struct io_service_tree *st = io_entity_service_tree(entity);
+	int was_active = entity == sd->active_entity;
+	int ret = 0;
+
+	if (!entity->on_st)
+		return 0;
+
+	BUG_ON(was_active && entity->tree != NULL);
+
+	if (was_active) {
+		bfq_calc_finish(entity, entity->service);
+		sd->active_entity = NULL;
+	} else if (entity->tree == &st->active)
+		bfq_active_remove(st, entity);
+	else if (entity->tree == &st->idle)
+		bfq_idle_remove(st, entity);
+	else if (entity->tree != NULL)
+		BUG();
+
+	if (!requeue || !bfq_gt(entity->finish, st->vtime))
+		bfq_forget_entity(st, entity);
+	else
+		bfq_idle_insert(st, entity);
+
+	BUG_ON(sd->active_entity == entity);
+
+	return ret;
+}
+
+/**
+ * bfq_deactivate_entity - deactivate an entity.
+ * @entity: the entity to deactivate.
+ * @requeue: true if the entity can be put on the idle tree
+ */
+static void bfq_deactivate_entity(struct io_entity *entity, int requeue)
+{
+	__bfq_deactivate_entity(entity, requeue);
+}
+
+static void entity_served(struct io_entity *entity, unsigned long served)
+{
+	struct io_service_tree *st;
+
+	st = io_entity_service_tree(entity);
+	entity->service += served;
+	BUG_ON(st->wsum == 0);
+	st->vtime += bfq_delta(served, st->wsum);
+	bfq_forget_idle(st);
+}
+
+/**
+ * io_flush_idle_tree - deactivate any entity on the idle tree of @st.
+ * @st: the service tree being flushed.
+ */
+static void io_flush_idle_tree(struct io_service_tree *st)
+{
+	struct io_entity *entity = st->first_idle;
+
+	for (; entity != NULL; entity = st->first_idle)
+		__bfq_deactivate_entity(entity, 0);
+}
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
new file mode 100644
index 0000000..4554d7f
--- /dev/null
+++ b/block/elevator-fq.h
@@ -0,0 +1,172 @@
+/*
+ * elevator fair queuing Layer. Uses B-WF2Q+ hierarchical scheduler for
+ * fair queuing. Data structures and common functions prototypes.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ *		      Paolo Valente <paolo.valente@unimore.it>
+ *
+ * Copyright (C) 2009 Vivek Goyal <vgoyal@redhat.com>
+ * 	              Nauman Rafique <nauman@google.com>
+ */
+
+#include <linux/blkdev.h>
+
+#ifndef _BFQ_SCHED_H
+#define _BFQ_SCHED_H
+
+#define IO_IOPRIO_CLASSES	3
+
+struct io_entity;
+struct io_queue;
+
+/**
+ * struct io_service_tree - per ioprio_class service tree.
+ * @active: tree for active entities (i.e., those backlogged).
+ * @idle: tree for idle entities (i.e., those not backlogged, with V <= F_i).
+ * @first_idle: idle entity with minimum F_i.
+ * @last_idle: idle entity with maximum F_i.
+ * @vtime: scheduler virtual time.
+ * @wsum: scheduler weight sum; active and idle entities contribute to it.
+ *
+ * Each service tree represents a B-WF2Q+ scheduler on its own.  Each
+ * ioprio_class has its own independent scheduler, and so its own
+ * io_service_tree.  All the fields are protected by the queue lock
+ * of the containing efqd.
+ */
+struct io_service_tree {
+	struct rb_root active;
+	struct rb_root idle;
+
+	struct io_entity *first_idle;
+	struct io_entity *last_idle;
+
+	u64 vtime;
+	unsigned int wsum;
+};
+
+/**
+ * struct io_sched_data - multi-class scheduler.
+ * @active_entity: entity under service.
+ * @next_active: head-of-the-line entity in the scheduler.
+ * @service_tree: array of service trees, one per ioprio_class.
+ *
+ * io_sched_data is the basic scheduler queue.  It supports three
+ * ioprio_classes, and can be used either as a toplevel queue or as
+ * an intermediate queue on a hierarchical setup.
+ * @next_active points to the active entity of the sched_data service
+ * trees that will be scheduled next.
+ *
+ * The supported ioprio_classes are the same as in CFQ, in descending
+ * priority order, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE.
+ * Requests from higher priority queues are served before all the
+ * requests from lower priority queues; among requests of the same
+ * queue requests are served according to B-WF2Q+.
+ * All the fields are protected by the queue lock of the containing efqd.
+ */
+struct io_sched_data {
+	struct io_entity *active_entity;
+	struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
+};
+
+/**
+ * struct io_entity - schedulable entity.
+ * @rb_node: service_tree member.
+ * @on_st: flag, true if the entity is on a tree (either the active or
+ *         the idle one of its service_tree).
+ * @finish: B-WF2Q+ finish timestamp (aka F_i).
+ * @start: B-WF2Q+ start timestamp (aka S_i).
+ * @tree: tree the entity is enqueued into; %NULL if not on a tree.
+ * @min_start: minimum start time of the (active) subtree rooted at
+ *             this entity; used for O(log N) lookups into active trees.
+ * @service: service received during the last round of service.
+ * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
+ * @weight: weight of the queue, calculated as IOPRIO_BE_NR - @ioprio.
+ * @new_weight: when a weight change is requested, the new weight value
+ * @parent: parent entity, for hierarchical scheduling.
+ * @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the
+ *                 associated scheduler queue, %NULL on leaf nodes.
+ * @sched_data: the scheduler queue this entity belongs to.
+ * @ioprio: the ioprio in use.
+ * @new_ioprio: when an ioprio change is requested, the new ioprio value
+ * @ioprio_class: the ioprio_class in use.
+ * @new_ioprio_class: when an ioprio_class change is requested, the new
+ *                    ioprio_class value.
+ * @ioprio_changed: flag, true when the user requested an ioprio or
+ *                  ioprio_class change.
+ *
+ * A io_entity is used to represent either a io_queue (leaf node in the
+ * cgroup hierarchy) or a io_group into the upper level scheduler.  Each
+ * entity belongs to the sched_data of the parent group in the cgroup
+ * hierarchy.  Non-leaf entities have also their own sched_data, stored
+ * in @my_sched_data.
+ *
+ * Each entity stores independently its priority values; this would allow
+ * different weights on different devices, but this functionality is not
+ * exported to userspace by now.  Priorities are updated lazily, first
+ * storing the new values into the new_* fields, then setting the
+ * @ioprio_changed flag.  As soon as there is a transition in the entity
+ * state that allows the priority update to take place the effective and
+ * the requested priority values are synchronized.
+ *
+ * The weight value is calculated from the ioprio to export the same
+ * interface as CFQ.
+ *
+ * All the fields are protected by the queue lock of the containing efqd.
+ */
+struct io_entity {
+	struct rb_node rb_node;
+
+	int on_st;
+
+	u64 finish;
+	u64 start;
+
+	struct rb_root *tree;
+
+	u64 min_start;
+
+	unsigned long service, budget;
+	unsigned int weight, new_weight;
+
+	struct io_entity *parent;
+
+	struct io_sched_data *my_sched_data;
+	struct io_sched_data *sched_data;
+
+	unsigned short ioprio, new_ioprio;
+	unsigned short ioprio_class, new_ioprio_class;
+
+	int ioprio_changed;
+};
+
+/*
+ * A common structure representing the io queue where requests are actually
+ * queued.
+ */
+struct io_queue {
+	struct io_entity entity;
+	atomic_t ref;
+
+	/* Pointer to generic elevator fair queuing data structure */
+	struct elv_fq_data *efqd;
+};
+
+struct io_group {
+	struct io_sched_data sched_data;
+};
+
+static inline struct io_service_tree *
+io_entity_service_tree(struct io_entity *entity)
+{
+	struct io_sched_data *sched_data = entity->sched_data;
+	unsigned int idx = entity->ioprio_class - 1;
+
+	BUG_ON(idx >= IO_IOPRIO_CLASSES);
+	BUG_ON(sched_data == NULL);
+
+	return sched_data->service_tree + idx;
+}
+#endif /* _BFQ_SCHED_H */
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 02/25] io-controller: Core of the B-WF2Q+ scheduler
@ 2009-07-02 20:01   ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew
  Cc: peterz, akpm, snitzer, agk, vgoyal

This is core of the BFQ(B-WF2Q+) scheduler originally implemented by Paolo and
Fabio in BFQ patches. Since then I have taken relevant pieces from BFQ and
continued the work on IO controller. It is not the full patch. Just pulled out
the some bits to show how core scheduler looks like and it becomes easier to
review.

Originally BFQ code was hierarchical. This patch only shows non-hierarchical
bits. Hierarhical code comes in later patches.

This code is the building base of introducing fair queuing logic in common
elevator layer so that it can be used by all the four IO schedulers. In
later patches, CFQ's weighted round robin scheduler will be replaced with
B-WF2Q+ scheduler.

Also note that BFQ originally provided fairness in-terms of number of
sectors of IO done by the queue. It has been modified to provide fairness
in terms of disk time (like CFQ allocate disk time slices proportionate to
prio/weight).

B-WF2Q+ is based on WF2Q+, that is described in [2], together with
H-WF2Q+, while the augmented tree used to implement B-WF2Q+ with O(log N)
complexity derives from the one introduced with EEVDF in [3].

[1] P. Valente and F. Checconi, ``High Throughput Disk Scheduling
    with Deterministic Guarantees on Bandwidth Distribution,'' to be
    published.

    http://algo.ing.unimo.it/people/paolo/disk_sched/bfq.pdf

[2] Jon C.R. Bennett and H. Zhang, ``Hierarchical Packet Fair Queueing
    Algorithms,'' IEEE/ACM Transactions on Networking, 5(5):675-689,
    Oct 1997.

    http://www.cs.cmu.edu/~hzhang/papers/TON-97-Oct.ps.gz

[3] I. Stoica and H. Abdel-Wahab, ``Earliest Eligible Virtual Deadline
    First: A Flexible and Accurate Mechanism for Proportional Share
    Resource Allocation,'' technical report.

    http://www.cs.berkeley.edu/~istoica/papers/eevdf-tr-95.pdf

Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |  717 +++++++++++++++++++++++++++++++++++++++++++++++++++
 block/elevator-fq.h |  172 ++++++++++++
 2 files changed, 889 insertions(+), 0 deletions(-)
 create mode 100644 block/elevator-fq.c
 create mode 100644 block/elevator-fq.h

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
new file mode 100644
index 0000000..a58efdc
--- /dev/null
+++ b/block/elevator-fq.c
@@ -0,0 +1,717 @@
+/*
+ * elevator fair queuing Layer. Uses B-WF2Q+ hierarchical scheduler for
+ * fair queuing.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ *		      Paolo Valente <paolo.valente@unimore.it>
+ *
+ * Copyright (C) 2009 Vivek Goyal <vgoyal@redhat.com>
+ * 	              Nauman Rafique <nauman@google.com>
+ */
+
+#include <linux/blkdev.h>
+#include "elevator-fq.h"
+
+#define IO_SERVICE_TREE_INIT   ((struct io_service_tree)		\
+				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
+
+/* Mainly the BFQ scheduling code Follows */
+
+/*
+ * Shift for timestamp calculations.  This actually limits the maximum
+ * service allowed in one timestamp delta (small shift values increase it),
+ * the maximum total weight that can be used for the queues in the system
+ * (big shift values increase it), and the period of virtual time wraparounds.
+ */
+#define WFQ_SERVICE_SHIFT	22
+
+/**
+ * bfq_gt - compare two timestamps.
+ * @a: first ts.
+ * @b: second ts.
+ *
+ * Return @a > @b, dealing with wrapping correctly.
+ */
+static inline int bfq_gt(u64 a, u64 b)
+{
+	return (s64)(a - b) > 0;
+}
+
+/**
+ * bfq_delta - map service into the virtual time domain.
+ * @service: amount of service.
+ * @weight: scale factor.
+ */
+static inline u64 bfq_delta(unsigned long service, unsigned int weight)
+{
+	u64 d = (u64)service << WFQ_SERVICE_SHIFT;
+
+	do_div(d, weight);
+	return d;
+}
+
+/**
+ * bfq_calc_finish - assign the finish time to an entity.
+ * @entity: the entity to act upon.
+ * @service: the service to be charged to the entity.
+ */
+static inline void bfq_calc_finish(struct io_entity *entity,
+					unsigned long service)
+{
+	BUG_ON(entity->weight == 0);
+
+	entity->finish = entity->start + bfq_delta(service, entity->weight);
+}
+
+static inline struct io_queue *io_entity_to_ioq(struct io_entity *entity)
+{
+	struct io_queue *ioq = NULL;
+
+	BUG_ON(entity == NULL);
+	if (entity->my_sched_data == NULL)
+		ioq = container_of(entity, struct io_queue, entity);
+	return ioq;
+}
+
+/**
+ * io_entity_of - get an entity from a node.
+ * @node: the node field of the entity.
+ *
+ * Convert a node pointer to the relative entity.  This is used only
+ * to simplify the logic of some functions and not as the generic
+ * conversion mechanism because, e.g., in the tree walking functions,
+ * the check for a %NULL value would be redundant.
+ */
+static inline struct io_entity *io_entity_of(struct rb_node *node)
+{
+	struct io_entity *entity = NULL;
+
+	if (node != NULL)
+		entity = rb_entry(node, struct io_entity, rb_node);
+
+	return entity;
+}
+
+/**
+ * bfq_remove - remove an entity from a tree.
+ * @root: the tree root.
+ * @entity: the entity to remove.
+ */
+static inline void bfq_remove(struct rb_root *root, struct io_entity *entity)
+{
+	BUG_ON(entity->tree != root);
+
+	entity->tree = NULL;
+	rb_erase(&entity->rb_node, root);
+}
+
+/**
+ * bfq_idle_remove - remove an entity from the idle tree.
+ * @st: the service tree of the owning @entity.
+ * @entity: the entity being removed.
+ */
+static void bfq_idle_remove(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	struct rb_node *next;
+
+	BUG_ON(entity->tree != &st->idle);
+
+	if (entity == st->first_idle) {
+		next = rb_next(&entity->rb_node);
+		st->first_idle = io_entity_of(next);
+	}
+
+	if (entity == st->last_idle) {
+		next = rb_prev(&entity->rb_node);
+		st->last_idle = io_entity_of(next);
+	}
+
+	bfq_remove(&st->idle, entity);
+}
+
+/**
+ * bfq_insert - generic tree insertion.
+ * @root: tree root.
+ * @entity: entity to insert.
+ *
+ * This is used for the idle and the active tree, since they are both
+ * ordered by finish time.
+ */
+static void bfq_insert(struct rb_root *root, struct io_entity *entity)
+{
+	struct io_entity *entry;
+	struct rb_node **node = &root->rb_node;
+	struct rb_node *parent = NULL;
+
+	BUG_ON(entity->tree != NULL);
+
+	while (*node != NULL) {
+		parent = *node;
+		entry = rb_entry(parent, struct io_entity, rb_node);
+
+		if (bfq_gt(entry->finish, entity->finish))
+			node = &parent->rb_left;
+		else
+			node = &parent->rb_right;
+	}
+
+	rb_link_node(&entity->rb_node, parent, node);
+	rb_insert_color(&entity->rb_node, root);
+
+	entity->tree = root;
+}
+
+/**
+ * bfq_update_min - update the min_start field of a entity.
+ * @entity: the entity to update.
+ * @node: one of its children.
+ *
+ * This function is called when @entity may store an invalid value for
+ * min_start due to updates to the active tree.  The function  assumes
+ * that the subtree rooted at @node (which may be its left or its right
+ * child) has a valid min_start value.
+ */
+static inline void bfq_update_min(struct io_entity *entity,
+					struct rb_node *node)
+{
+	struct io_entity *child;
+
+	if (node != NULL) {
+		child = rb_entry(node, struct io_entity, rb_node);
+		if (bfq_gt(entity->min_start, child->min_start))
+			entity->min_start = child->min_start;
+	}
+}
+
+/**
+ * bfq_update_active_node - recalculate min_start.
+ * @node: the node to update.
+ *
+ * @node may have changed position or one of its children may have moved,
+ * this function updates its min_start value.  The left and right subtrees
+ * are assumed to hold a correct min_start value.
+ */
+static inline void bfq_update_active_node(struct rb_node *node)
+{
+	struct io_entity *entity = rb_entry(node, struct io_entity, rb_node);
+
+	entity->min_start = entity->start;
+	bfq_update_min(entity, node->rb_right);
+	bfq_update_min(entity, node->rb_left);
+}
+
+/**
+ * bfq_update_active_tree - update min_start for the whole active tree.
+ * @node: the starting node.
+ *
+ * @node must be the deepest modified node after an update.  This function
+ * updates its min_start using the values held by its children, assuming
+ * that they did not change, and then updates all the nodes that may have
+ * changed in the path to the root.  The only nodes that may have changed
+ * are the ones in the path or their siblings.
+ */
+static void bfq_update_active_tree(struct rb_node *node)
+{
+	struct rb_node *parent;
+
+up:
+	bfq_update_active_node(node);
+
+	parent = rb_parent(node);
+	if (parent == NULL)
+		return;
+
+	if (node == parent->rb_left && parent->rb_right != NULL)
+		bfq_update_active_node(parent->rb_right);
+	else if (parent->rb_left != NULL)
+		bfq_update_active_node(parent->rb_left);
+
+	node = parent;
+	goto up;
+}
+
+/**
+ * bfq_active_insert - insert an entity in the active tree of its group/device.
+ * @st: the service tree of the entity.
+ * @entity: the entity being inserted.
+ *
+ * The active tree is ordered by finish time, but an extra key is kept
+ * per each node, containing the minimum value for the start times of
+ * its children (and the node itself), so it's possible to search for
+ * the eligible node with the lowest finish time in logarithmic time.
+ */
+static void bfq_active_insert(struct io_service_tree *st,
+					struct io_entity *entity)
+{
+	struct rb_node *node = &entity->rb_node;
+
+	bfq_insert(&st->active, entity);
+
+	if (node->rb_left != NULL)
+		node = node->rb_left;
+	else if (node->rb_right != NULL)
+		node = node->rb_right;
+
+	bfq_update_active_tree(node);
+}
+
+static void bfq_get_entity(struct io_entity *entity)
+{
+	struct io_queue *ioq = io_entity_to_ioq(entity);
+
+	if (ioq)
+		elv_get_ioq(ioq);
+}
+
+static void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+	entity->ioprio = entity->new_ioprio;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->sched_data = &iog->sched_data;
+}
+
+/**
+ * bfq_find_deepest - find the deepest node that an extraction can modify.
+ * @node: the node being removed.
+ *
+ * Do the first step of an extraction in an rb tree, looking for the
+ * node that will replace @node, and returning the deepest node that
+ * the following modifications to the tree can touch.  If @node is the
+ * last node in the tree return %NULL.
+ */
+static struct rb_node *bfq_find_deepest(struct rb_node *node)
+{
+	struct rb_node *deepest;
+
+	if (node->rb_right == NULL && node->rb_left == NULL)
+		deepest = rb_parent(node);
+	else if (node->rb_right == NULL)
+		deepest = node->rb_left;
+	else if (node->rb_left == NULL)
+		deepest = node->rb_right;
+	else {
+		deepest = rb_next(node);
+		if (deepest->rb_right != NULL)
+			deepest = deepest->rb_right;
+		else if (rb_parent(deepest) != node)
+			deepest = rb_parent(deepest);
+	}
+
+	return deepest;
+}
+
+/**
+ * bfq_active_remove - remove an entity from the active tree.
+ * @st: the service_tree containing the tree.
+ * @entity: the entity being removed.
+ */
+static void bfq_active_remove(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	struct rb_node *node;
+
+	node = bfq_find_deepest(&entity->rb_node);
+	bfq_remove(&st->active, entity);
+
+	if (node != NULL)
+		bfq_update_active_tree(node);
+}
+
+/**
+ * bfq_idle_insert - insert an entity into the idle tree.
+ * @st: the service tree containing the tree.
+ * @entity: the entity to insert.
+ */
+static void bfq_idle_insert(struct io_service_tree *st,
+					struct io_entity *entity)
+{
+	struct io_entity *first_idle = st->first_idle;
+	struct io_entity *last_idle = st->last_idle;
+
+	if (first_idle == NULL || bfq_gt(first_idle->finish, entity->finish))
+		st->first_idle = entity;
+	if (last_idle == NULL || bfq_gt(entity->finish, last_idle->finish))
+		st->last_idle = entity;
+
+	bfq_insert(&st->idle, entity);
+}
+
+/**
+ * bfq_forget_entity - remove an entity from the wfq trees.
+ * @st: the service tree.
+ * @entity: the entity being removed.
+ *
+ * Update the device status and forget everything about @entity, putting
+ * the device reference to it, if it is a queue.  Entities belonging to
+ * groups are not refcounted.
+ */
+static void bfq_forget_entity(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	struct io_queue *ioq = NULL;
+
+	BUG_ON(!entity->on_st);
+	entity->on_st = 0;
+	st->wsum -= entity->weight;
+	ioq = io_entity_to_ioq(entity);
+	if (!ioq)
+		return;
+	elv_put_ioq(ioq);
+}
+
+/**
+ * bfq_put_idle_entity - release the idle tree ref of an entity.
+ * @st: service tree for the entity.
+ * @entity: the entity being released.
+ */
+static void bfq_put_idle_entity(struct io_service_tree *st,
+				struct io_entity *entity)
+{
+	bfq_idle_remove(st, entity);
+	bfq_forget_entity(st, entity);
+}
+
+/**
+ * bfq_forget_idle - update the idle tree if necessary.
+ * @st: the service tree to act upon.
+ *
+ * To preserve the global O(log N) complexity we only remove one entry here;
+ * as the idle tree will not grow indefinitely this can be done safely.
+ */
+static void bfq_forget_idle(struct io_service_tree *st)
+{
+	struct io_entity *first_idle = st->first_idle;
+	struct io_entity *last_idle = st->last_idle;
+
+	if (RB_EMPTY_ROOT(&st->active) && last_idle != NULL &&
+	    !bfq_gt(last_idle->finish, st->vtime)) {
+		/*
+		 * Active tree is empty. Pull back vtime to finish time of
+		 * last idle entity on idle tree.
+		 * Rational seems to be that it reduces the possibility of
+		 * vtime wraparound (bfq_gt(V-F) < 0).
+		 */
+		st->vtime = last_idle->finish;
+	}
+
+	if (first_idle != NULL && !bfq_gt(first_idle->finish, st->vtime))
+		bfq_put_idle_entity(st, first_idle);
+}
+
+
+static struct io_service_tree *
+__bfq_entity_update_prio(struct io_service_tree *old_st,
+				struct io_entity *entity)
+{
+	struct io_service_tree *new_st = old_st;
+	struct io_queue *ioq = io_entity_to_ioq(entity);
+
+	if (entity->ioprio_changed) {
+		old_st->wsum -= entity->weight;
+		entity->ioprio = entity->new_ioprio;
+		entity->ioprio_class = entity->new_ioprio_class;
+		entity->weight = entity->new_weight;
+		entity->ioprio_changed = 0;
+
+		/*
+		 * Also update the scaled budget for ioq. Group will get the
+		 * updated budget once ioq is selected to run next.
+		 */
+		if (ioq) {
+			struct elv_fq_data *efqd = ioq->efqd;
+			/*
+			 * elv_prio_to_slice() is defined in later patches
+			 * where a slice length is calculated from the
+			 * ioprio of the queue.
+			 */
+			entity->budget = elv_prio_to_slice(efqd, ioq);
+		}
+
+		/*
+		 * NOTE: here we may be changing the weight too early,
+		 * this will cause unfairness.  The correct approach
+		 * would have required additional complexity to defer
+		 * weight changes to the proper time instants (i.e.,
+		 * when entity->finish <= old_st->vtime).
+		 */
+		new_st = io_entity_service_tree(entity);
+		new_st->wsum += entity->weight;
+
+		if (new_st != old_st)
+			entity->start = new_st->vtime;
+	}
+
+	return new_st;
+}
+
+/**
+ * bfq_update_vtime - update vtime if necessary.
+ * @st: the service tree to act upon.
+ *
+ * If necessary update the service tree vtime to have at least one
+ * eligible entity, skipping to its start time.  Assumes that the
+ * active tree of the device is not empty.
+ *
+ * NOTE: this hierarchical implementation updates vtimes quite often,
+ * we may end up with reactivated tasks getting timestamps after a
+ * vtime skip done because we needed a ->first_active entity on some
+ * intermediate node.
+ */
+static void bfq_update_vtime(struct io_service_tree *st)
+{
+	struct io_entity *entry;
+	struct rb_node *node = st->active.rb_node;
+
+	entry = rb_entry(node, struct io_entity, rb_node);
+	if (bfq_gt(entry->min_start, st->vtime)) {
+		st->vtime = entry->min_start;
+		bfq_forget_idle(st);
+	}
+}
+
+/**
+ * bfq_first_active - find the eligible entity with the smallest finish time
+ * @st: the service tree to select from.
+ *
+ * This function searches the first schedulable entity, starting from the
+ * root of the tree and going on the left every time on this side there is
+ * a subtree with at least one eligible (start <= vtime) entity.  The path
+ * on the right is followed only if a) the left subtree contains no eligible
+ * entities and b) no eligible entity has been found yet.
+ */
+static struct io_entity *bfq_first_active_entity(struct io_service_tree *st)
+{
+	struct io_entity *entry, *first = NULL;
+	struct rb_node *node = st->active.rb_node;
+
+	while (node != NULL) {
+		entry = rb_entry(node, struct io_entity, rb_node);
+left:
+		if (!bfq_gt(entry->start, st->vtime))
+			first = entry;
+
+		BUG_ON(bfq_gt(entry->min_start, st->vtime));
+
+		if (node->rb_left != NULL) {
+			entry = rb_entry(node->rb_left,
+					 struct io_entity, rb_node);
+			if (!bfq_gt(entry->min_start, st->vtime)) {
+				node = node->rb_left;
+				goto left;
+			}
+		}
+		if (first != NULL)
+			break;
+		node = node->rb_right;
+	}
+
+	BUG_ON(first == NULL && !RB_EMPTY_ROOT(&st->active));
+	return first;
+}
+
+/**
+ * __bfq_lookup_next_entity - return the first eligible entity in @st.
+ * @st: the service tree.
+ *
+ * Update the virtual time in @st and return the first eligible entity
+ * it contains.
+ */
+static struct io_entity *__bfq_lookup_next_entity(struct io_service_tree *st)
+{
+	struct io_entity *entity;
+
+	if (RB_EMPTY_ROOT(&st->active))
+		return NULL;
+
+	bfq_update_vtime(st);
+	entity = bfq_first_active_entity(st);
+	BUG_ON(bfq_gt(entity->start, st->vtime));
+
+	return entity;
+}
+
+/**
+ * bfq_lookup_next_entity - return the first eligible entity in @sd.
+ * @sd: the sched_data.
+ * @extract: if true the returned entity will be also extracted from @sd.
+ *
+ * NOTE: since we cache the next_active entity at each level of the
+ * hierarchy, the complexity of the lookup can be decreased with
+ * absolutely no effort just returning the cached next_active value;
+ * we prefer to do full lookups to test the consistency of * the data
+ * structures.
+ */
+static struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
+						 int extract)
+{
+	struct io_service_tree *st = sd->service_tree;
+	struct io_entity *entity;
+	int i;
+
+	/*
+	 * We should not call lookup when an entity is active, as doing lookup
+	 * can result in an erroneous vtime jump.
+	 */
+	BUG_ON(sd->active_entity != NULL);
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++, st++) {
+		entity = __bfq_lookup_next_entity(st);
+		if (entity != NULL) {
+			if (extract) {
+				bfq_active_remove(st, entity);
+				sd->active_entity = entity;
+			}
+			break;
+		}
+	}
+
+	return entity;
+}
+
+/**
+ * __bfq_activate_entity - activate an entity.
+ * @entity: the entity being activated.
+ *
+ * Called whenever an entity is activated, i.e., it is not active and one
+ * of its children receives a new request, or has to be reactivated due to
+ * budget exhaustion.  It uses the current budget of the entity (and the
+ * service received if @entity is active) of the queue to calculate its
+ * timestamps.
+ */
+static void __bfq_activate_entity(struct io_entity *entity)
+{
+	struct io_sched_data *sd = entity->sched_data;
+	struct io_service_tree *st = io_entity_service_tree(entity);
+
+	if (entity == sd->active_entity) {
+		BUG_ON(entity->tree != NULL);
+		/*
+		 * If we are requeueing the current entity we have
+		 * to take care of not charging to it service it has
+		 * not received.
+		 */
+		bfq_calc_finish(entity, entity->service);
+		entity->start = entity->finish;
+		sd->active_entity = NULL;
+	} else if (entity->tree == &st->active) {
+		/*
+		 * Requeueing an entity due to a change of some
+		 * next_active entity below it.  We reuse the old
+		 * start time.
+		 */
+		bfq_active_remove(st, entity);
+	} else if (entity->tree == &st->idle) {
+		/*
+		 * Must be on the idle tree, bfq_idle_remove() will
+		 * check for that.
+		 */
+		bfq_idle_remove(st, entity);
+		entity->start = bfq_gt(st->vtime, entity->finish) ?
+				       st->vtime : entity->finish;
+	} else {
+		/*
+		 * The finish time of the entity may be invalid, and
+		 * it is in the past for sure, otherwise the queue
+		 * would have been on the idle tree.
+		 */
+		entity->start = st->vtime;
+		st->wsum += entity->weight;
+		bfq_get_entity(entity);
+
+		BUG_ON(entity->on_st);
+		entity->on_st = 1;
+	}
+
+	st = __bfq_entity_update_prio(st, entity);
+	bfq_calc_finish(entity, entity->budget);
+	bfq_active_insert(st, entity);
+}
+
+/**
+ * bfq_activate_entity - activate an entity.
+ * @entity: the entity to activate.
+ */
+static void bfq_activate_entity(struct io_entity *entity)
+{
+	__bfq_activate_entity(entity);
+}
+
+/**
+ * __bfq_deactivate_entity - deactivate an entity from its service tree.
+ * @entity: the entity to deactivate.
+ * @requeue: if false, the entity will not be put into the idle tree.
+ *
+ * Deactivate an entity, independently from its previous state.  If the
+ * entity was not on a service tree just return, otherwise if it is on
+ * any scheduler tree, extract it from that tree, and if necessary
+ * and if the caller did not specify @requeue, put it on the idle tree.
+ *
+ */
+static int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
+{
+	struct io_sched_data *sd = entity->sched_data;
+	struct io_service_tree *st = io_entity_service_tree(entity);
+	int was_active = entity == sd->active_entity;
+	int ret = 0;
+
+	if (!entity->on_st)
+		return 0;
+
+	BUG_ON(was_active && entity->tree != NULL);
+
+	if (was_active) {
+		bfq_calc_finish(entity, entity->service);
+		sd->active_entity = NULL;
+	} else if (entity->tree == &st->active)
+		bfq_active_remove(st, entity);
+	else if (entity->tree == &st->idle)
+		bfq_idle_remove(st, entity);
+	else if (entity->tree != NULL)
+		BUG();
+
+	if (!requeue || !bfq_gt(entity->finish, st->vtime))
+		bfq_forget_entity(st, entity);
+	else
+		bfq_idle_insert(st, entity);
+
+	BUG_ON(sd->active_entity == entity);
+
+	return ret;
+}
+
+/**
+ * bfq_deactivate_entity - deactivate an entity.
+ * @entity: the entity to deactivate.
+ * @requeue: true if the entity can be put on the idle tree
+ */
+static void bfq_deactivate_entity(struct io_entity *entity, int requeue)
+{
+	__bfq_deactivate_entity(entity, requeue);
+}
+
+static void entity_served(struct io_entity *entity, unsigned long served)
+{
+	struct io_service_tree *st;
+
+	st = io_entity_service_tree(entity);
+	entity->service += served;
+	BUG_ON(st->wsum == 0);
+	st->vtime += bfq_delta(served, st->wsum);
+	bfq_forget_idle(st);
+}
+
+/**
+ * io_flush_idle_tree - deactivate any entity on the idle tree of @st.
+ * @st: the service tree being flushed.
+ */
+static void io_flush_idle_tree(struct io_service_tree *st)
+{
+	struct io_entity *entity = st->first_idle;
+
+	for (; entity != NULL; entity = st->first_idle)
+		__bfq_deactivate_entity(entity, 0);
+}
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
new file mode 100644
index 0000000..4554d7f
--- /dev/null
+++ b/block/elevator-fq.h
@@ -0,0 +1,172 @@
+/*
+ * elevator fair queuing Layer. Uses B-WF2Q+ hierarchical scheduler for
+ * fair queuing. Data structures and common functions prototypes.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ *		      Paolo Valente <paolo.valente@unimore.it>
+ *
+ * Copyright (C) 2009 Vivek Goyal <vgoyal@redhat.com>
+ * 	              Nauman Rafique <nauman@google.com>
+ */
+
+#include <linux/blkdev.h>
+
+#ifndef _BFQ_SCHED_H
+#define _BFQ_SCHED_H
+
+#define IO_IOPRIO_CLASSES	3
+
+struct io_entity;
+struct io_queue;
+
+/**
+ * struct io_service_tree - per ioprio_class service tree.
+ * @active: tree for active entities (i.e., those backlogged).
+ * @idle: tree for idle entities (i.e., those not backlogged, with V <= F_i).
+ * @first_idle: idle entity with minimum F_i.
+ * @last_idle: idle entity with maximum F_i.
+ * @vtime: scheduler virtual time.
+ * @wsum: scheduler weight sum; active and idle entities contribute to it.
+ *
+ * Each service tree represents a B-WF2Q+ scheduler on its own.  Each
+ * ioprio_class has its own independent scheduler, and so its own
+ * io_service_tree.  All the fields are protected by the queue lock
+ * of the containing efqd.
+ */
+struct io_service_tree {
+	struct rb_root active;
+	struct rb_root idle;
+
+	struct io_entity *first_idle;
+	struct io_entity *last_idle;
+
+	u64 vtime;
+	unsigned int wsum;
+};
+
+/**
+ * struct io_sched_data - multi-class scheduler.
+ * @active_entity: entity under service.
+ * @next_active: head-of-the-line entity in the scheduler.
+ * @service_tree: array of service trees, one per ioprio_class.
+ *
+ * io_sched_data is the basic scheduler queue.  It supports three
+ * ioprio_classes, and can be used either as a toplevel queue or as
+ * an intermediate queue on a hierarchical setup.
+ * @next_active points to the active entity of the sched_data service
+ * trees that will be scheduled next.
+ *
+ * The supported ioprio_classes are the same as in CFQ, in descending
+ * priority order, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE.
+ * Requests from higher priority queues are served before all the
+ * requests from lower priority queues; among requests of the same
+ * queue requests are served according to B-WF2Q+.
+ * All the fields are protected by the queue lock of the containing efqd.
+ */
+struct io_sched_data {
+	struct io_entity *active_entity;
+	struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
+};
+
+/**
+ * struct io_entity - schedulable entity.
+ * @rb_node: service_tree member.
+ * @on_st: flag, true if the entity is on a tree (either the active or
+ *         the idle one of its service_tree).
+ * @finish: B-WF2Q+ finish timestamp (aka F_i).
+ * @start: B-WF2Q+ start timestamp (aka S_i).
+ * @tree: tree the entity is enqueued into; %NULL if not on a tree.
+ * @min_start: minimum start time of the (active) subtree rooted at
+ *             this entity; used for O(log N) lookups into active trees.
+ * @service: service received during the last round of service.
+ * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
+ * @weight: weight of the queue, calculated as IOPRIO_BE_NR - @ioprio.
+ * @new_weight: when a weight change is requested, the new weight value
+ * @parent: parent entity, for hierarchical scheduling.
+ * @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the
+ *                 associated scheduler queue, %NULL on leaf nodes.
+ * @sched_data: the scheduler queue this entity belongs to.
+ * @ioprio: the ioprio in use.
+ * @new_ioprio: when an ioprio change is requested, the new ioprio value
+ * @ioprio_class: the ioprio_class in use.
+ * @new_ioprio_class: when an ioprio_class change is requested, the new
+ *                    ioprio_class value.
+ * @ioprio_changed: flag, true when the user requested an ioprio or
+ *                  ioprio_class change.
+ *
+ * A io_entity is used to represent either a io_queue (leaf node in the
+ * cgroup hierarchy) or a io_group into the upper level scheduler.  Each
+ * entity belongs to the sched_data of the parent group in the cgroup
+ * hierarchy.  Non-leaf entities have also their own sched_data, stored
+ * in @my_sched_data.
+ *
+ * Each entity stores independently its priority values; this would allow
+ * different weights on different devices, but this functionality is not
+ * exported to userspace by now.  Priorities are updated lazily, first
+ * storing the new values into the new_* fields, then setting the
+ * @ioprio_changed flag.  As soon as there is a transition in the entity
+ * state that allows the priority update to take place the effective and
+ * the requested priority values are synchronized.
+ *
+ * The weight value is calculated from the ioprio to export the same
+ * interface as CFQ.
+ *
+ * All the fields are protected by the queue lock of the containing efqd.
+ */
+struct io_entity {
+	struct rb_node rb_node;
+
+	int on_st;
+
+	u64 finish;
+	u64 start;
+
+	struct rb_root *tree;
+
+	u64 min_start;
+
+	unsigned long service, budget;
+	unsigned int weight, new_weight;
+
+	struct io_entity *parent;
+
+	struct io_sched_data *my_sched_data;
+	struct io_sched_data *sched_data;
+
+	unsigned short ioprio, new_ioprio;
+	unsigned short ioprio_class, new_ioprio_class;
+
+	int ioprio_changed;
+};
+
+/*
+ * A common structure representing the io queue where requests are actually
+ * queued.
+ */
+struct io_queue {
+	struct io_entity entity;
+	atomic_t ref;
+
+	/* Pointer to generic elevator fair queuing data structure */
+	struct elv_fq_data *efqd;
+};
+
+struct io_group {
+	struct io_sched_data sched_data;
+};
+
+static inline struct io_service_tree *
+io_entity_service_tree(struct io_entity *entity)
+{
+	struct io_sched_data *sched_data = entity->sched_data;
+	unsigned int idx = entity->ioprio_class - 1;
+
+	BUG_ON(idx >= IO_IOPRIO_CLASSES);
+	BUG_ON(sched_data == NULL);
+
+	return sched_data->service_tree + idx;
+}
+#endif /* _BFQ_SCHED_H */
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 03/25] io-controller: bfq support of in-class preemption
       [not found] ` <1246564917-19603-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-07-02 20:01   ` [PATCH 01/25] io-controller: Documentation Vivek Goyal
  2009-07-02 20:01   ` [PATCH 02/25] io-controller: Core of the B-WF2Q+ scheduler Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  2009-07-02 20:01   ` [PATCH 04/25] io-controller: Common flat fair queuing code in elevaotor layer Vivek Goyal
                     ` (24 subsequent siblings)
  27 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, nauman-hpIqsD4AKlfQT0dZR+AlfA,
	dpshah-hpIqsD4AKlfQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA

o Generally preemption is associated with cross class where if an request
  from RT class is pending it will preempt the ongoing BE or IDLE class
  request.

o CFQ also does in-class preemtions like a sync request queue preempting the
  async request queue. In that case it looks like preempting queue gains
  share and it is not fair.

o Implement the similar functionality in bfq so that we can retain the
  existing CFQ behavior.

o This patch creates a bypass path so that a queue can be put at the
  front of the service tree (add_front, similar to CFQ), so that it will
  be selected next to run. That's a different thing that in the process
  this queue gains share.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/elevator-fq.c |   43 +++++++++++++++++++++++++++++++++++++++----
 1 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index a58efdc..7ee4321 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -582,7 +582,7 @@ static struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
  * service received if @entity is active) of the queue to calculate its
  * timestamps.
  */
-static void __bfq_activate_entity(struct io_entity *entity)
+static void __bfq_activate_entity(struct io_entity *entity, int add_front)
 {
 	struct io_sched_data *sd = entity->sched_data;
 	struct io_service_tree *st = io_entity_service_tree(entity);
@@ -627,7 +627,42 @@ static void __bfq_activate_entity(struct io_entity *entity)
 	}
 
 	st = __bfq_entity_update_prio(st, entity);
-	bfq_calc_finish(entity, entity->budget);
+	/*
+	 * This is to emulate cfq like functionality where preemption can
+	 * happen with-in same class, like sync queue preempting async queue
+	 * May be this is not a very good idea from fairness point of view
+	 * as preempting queue gains share. Keeping it for now.
+	 */
+	if (add_front) {
+		struct io_entity *next_entity;
+
+		/*
+		 * Determine the entity which will be dispatched next
+		 * Use sd->next_active once hierarchical patch is applied
+		 */
+		next_entity = bfq_lookup_next_entity(sd, 0);
+
+		if (next_entity && next_entity != entity) {
+			struct io_service_tree *new_st;
+			u64 delta;
+
+			new_st = io_entity_service_tree(next_entity);
+
+			/*
+			 * At this point, both entities should belong to
+			 * same service tree as cross service tree preemption
+			 * is automatically taken care by algorithm
+			 */
+			BUG_ON(new_st != st);
+			entity->finish = next_entity->finish - 1;
+			delta = bfq_delta(entity->budget, entity->weight);
+			entity->start = entity->finish - delta;
+			if (bfq_gt(entity->start, st->vtime))
+				entity->start = st->vtime;
+		}
+	} else {
+		bfq_calc_finish(entity, entity->budget);
+	}
 	bfq_active_insert(st, entity);
 }
 
@@ -635,9 +670,9 @@ static void __bfq_activate_entity(struct io_entity *entity)
  * bfq_activate_entity - activate an entity.
  * @entity: the entity to activate.
  */
-static void bfq_activate_entity(struct io_entity *entity)
+static void bfq_activate_entity(struct io_entity *entity, int add_front)
 {
-	__bfq_activate_entity(entity);
+	__bfq_activate_entity(entity, add_front);
 }
 
 /**
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 03/25] io-controller: bfq support of in-class preemption
  2009-07-02 20:01 ` Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron
  Cc: agk, snitzer, vgoyal, akpm, peterz

o Generally preemption is associated with cross class where if an request
  from RT class is pending it will preempt the ongoing BE or IDLE class
  request.

o CFQ also does in-class preemtions like a sync request queue preempting the
  async request queue. In that case it looks like preempting queue gains
  share and it is not fair.

o Implement the similar functionality in bfq so that we can retain the
  existing CFQ behavior.

o This patch creates a bypass path so that a queue can be put at the
  front of the service tree (add_front, similar to CFQ), so that it will
  be selected next to run. That's a different thing that in the process
  this queue gains share.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |   43 +++++++++++++++++++++++++++++++++++++++----
 1 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index a58efdc..7ee4321 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -582,7 +582,7 @@ static struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
  * service received if @entity is active) of the queue to calculate its
  * timestamps.
  */
-static void __bfq_activate_entity(struct io_entity *entity)
+static void __bfq_activate_entity(struct io_entity *entity, int add_front)
 {
 	struct io_sched_data *sd = entity->sched_data;
 	struct io_service_tree *st = io_entity_service_tree(entity);
@@ -627,7 +627,42 @@ static void __bfq_activate_entity(struct io_entity *entity)
 	}
 
 	st = __bfq_entity_update_prio(st, entity);
-	bfq_calc_finish(entity, entity->budget);
+	/*
+	 * This is to emulate cfq like functionality where preemption can
+	 * happen with-in same class, like sync queue preempting async queue
+	 * May be this is not a very good idea from fairness point of view
+	 * as preempting queue gains share. Keeping it for now.
+	 */
+	if (add_front) {
+		struct io_entity *next_entity;
+
+		/*
+		 * Determine the entity which will be dispatched next
+		 * Use sd->next_active once hierarchical patch is applied
+		 */
+		next_entity = bfq_lookup_next_entity(sd, 0);
+
+		if (next_entity && next_entity != entity) {
+			struct io_service_tree *new_st;
+			u64 delta;
+
+			new_st = io_entity_service_tree(next_entity);
+
+			/*
+			 * At this point, both entities should belong to
+			 * same service tree as cross service tree preemption
+			 * is automatically taken care by algorithm
+			 */
+			BUG_ON(new_st != st);
+			entity->finish = next_entity->finish - 1;
+			delta = bfq_delta(entity->budget, entity->weight);
+			entity->start = entity->finish - delta;
+			if (bfq_gt(entity->start, st->vtime))
+				entity->start = st->vtime;
+		}
+	} else {
+		bfq_calc_finish(entity, entity->budget);
+	}
 	bfq_active_insert(st, entity);
 }
 
@@ -635,9 +670,9 @@ static void __bfq_activate_entity(struct io_entity *entity)
  * bfq_activate_entity - activate an entity.
  * @entity: the entity to activate.
  */
-static void bfq_activate_entity(struct io_entity *entity)
+static void bfq_activate_entity(struct io_entity *entity, int add_front)
 {
-	__bfq_activate_entity(entity);
+	__bfq_activate_entity(entity, add_front);
 }
 
 /**
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 03/25] io-controller: bfq support of in-class preemption
@ 2009-07-02 20:01   ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew
  Cc: peterz, akpm, snitzer, agk, vgoyal

o Generally preemption is associated with cross class where if an request
  from RT class is pending it will preempt the ongoing BE or IDLE class
  request.

o CFQ also does in-class preemtions like a sync request queue preempting the
  async request queue. In that case it looks like preempting queue gains
  share and it is not fair.

o Implement the similar functionality in bfq so that we can retain the
  existing CFQ behavior.

o This patch creates a bypass path so that a queue can be put at the
  front of the service tree (add_front, similar to CFQ), so that it will
  be selected next to run. That's a different thing that in the process
  this queue gains share.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |   43 +++++++++++++++++++++++++++++++++++++++----
 1 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index a58efdc..7ee4321 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -582,7 +582,7 @@ static struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
  * service received if @entity is active) of the queue to calculate its
  * timestamps.
  */
-static void __bfq_activate_entity(struct io_entity *entity)
+static void __bfq_activate_entity(struct io_entity *entity, int add_front)
 {
 	struct io_sched_data *sd = entity->sched_data;
 	struct io_service_tree *st = io_entity_service_tree(entity);
@@ -627,7 +627,42 @@ static void __bfq_activate_entity(struct io_entity *entity)
 	}
 
 	st = __bfq_entity_update_prio(st, entity);
-	bfq_calc_finish(entity, entity->budget);
+	/*
+	 * This is to emulate cfq like functionality where preemption can
+	 * happen with-in same class, like sync queue preempting async queue
+	 * May be this is not a very good idea from fairness point of view
+	 * as preempting queue gains share. Keeping it for now.
+	 */
+	if (add_front) {
+		struct io_entity *next_entity;
+
+		/*
+		 * Determine the entity which will be dispatched next
+		 * Use sd->next_active once hierarchical patch is applied
+		 */
+		next_entity = bfq_lookup_next_entity(sd, 0);
+
+		if (next_entity && next_entity != entity) {
+			struct io_service_tree *new_st;
+			u64 delta;
+
+			new_st = io_entity_service_tree(next_entity);
+
+			/*
+			 * At this point, both entities should belong to
+			 * same service tree as cross service tree preemption
+			 * is automatically taken care by algorithm
+			 */
+			BUG_ON(new_st != st);
+			entity->finish = next_entity->finish - 1;
+			delta = bfq_delta(entity->budget, entity->weight);
+			entity->start = entity->finish - delta;
+			if (bfq_gt(entity->start, st->vtime))
+				entity->start = st->vtime;
+		}
+	} else {
+		bfq_calc_finish(entity, entity->budget);
+	}
 	bfq_active_insert(st, entity);
 }
 
@@ -635,9 +670,9 @@ static void __bfq_activate_entity(struct io_entity *entity)
  * bfq_activate_entity - activate an entity.
  * @entity: the entity to activate.
  */
-static void bfq_activate_entity(struct io_entity *entity)
+static void bfq_activate_entity(struct io_entity *entity, int add_front)
 {
-	__bfq_activate_entity(entity);
+	__bfq_activate_entity(entity, add_front);
 }
 
 /**
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 04/25] io-controller: Common flat fair queuing code in elevaotor layer
       [not found] ` <1246564917-19603-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (2 preceding siblings ...)
  2009-07-02 20:01   ` [PATCH 03/25] io-controller: bfq support of in-class preemption Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  2009-07-02 20:01   ` [PATCH 05/25] io-controller: Charge for time slice based on average disk rate Vivek Goyal
                     ` (23 subsequent siblings)
  27 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, nauman-hpIqsD4AKlfQT0dZR+AlfA,
	dpshah-hpIqsD4AKlfQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA

This is common fair queuing code in elevator layer. This is controlled by
config option CONFIG_ELV_FAIR_QUEUING. This patch initially only introduces
flat fair queuing support where there is only one group, "root group" and all
the tasks belong to root group.

This elevator layer changes are backward compatible. That means any ioscheduler
using old interfaces will continue to work.

This code is essentially the CFQ code for fair queuing. The primary difference
is that flat rounding robin algorithm of CFQ has been replaced with BFQ (WF2Q+).

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched    |   13 +
 block/Makefile           |    1 +
 block/blk.h              |    4 +
 block/elevator-fq.c      | 1254 +++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h      |  304 +++++++++++
 block/elevator.c         |   42 ++-
 include/linux/blkdev.h   |   14 +
 include/linux/elevator.h |   51 ++
 8 files changed, 1667 insertions(+), 16 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 7e803fc..3398134 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -2,6 +2,19 @@ if BLOCK
 
 menu "IO Schedulers"
 
+config ELV_FAIR_QUEUING
+	bool "Elevator Fair Queuing Support"
+	default n
+	---help---
+	  Traditionally only cfq had notion of multiple queues and it did
+	  fair queuing at its own. With the cgroups and need of controlling
+	  IO, now even the simple io schedulers like noop, deadline, as will
+	  have one queue per cgroup and will need hierarchical fair queuing.
+	  Instead of every io scheduler implementing its own fair queuing
+	  logic, this option enables fair queuing in elevator layer so that
+	  other ioschedulers can make use of it.
+	  If unsure, say N.
+
 config IOSCHED_NOOP
 	bool
 	default y
diff --git a/block/Makefile b/block/Makefile
index e9fa4dd..94bfc6e 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -15,3 +15,4 @@ obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
 
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY)	+= blk-integrity.o
+obj-$(CONFIG_ELV_FAIR_QUEUING)	+= elevator-fq.o
diff --git a/block/blk.h b/block/blk.h
index 3fae6ad..99c3819 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -71,6 +71,8 @@ static inline void elv_activate_rq(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	elv_fq_activate_rq(q, rq);
+
 	if (e->ops->elevator_activate_req_fn)
 		e->ops->elevator_activate_req_fn(q, rq);
 }
@@ -79,6 +81,8 @@ static inline void elv_deactivate_rq(struct request_queue *q, struct request *rq
 {
 	struct elevator_queue *e = q->elevator;
 
+	elv_fq_deactivate_rq(q, rq);
+
 	if (e->ops->elevator_deactivate_req_fn)
 		e->ops->elevator_deactivate_req_fn(q, rq);
 }
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 7ee4321..6f23d7e 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -14,6 +14,17 @@
 
 #include <linux/blkdev.h>
 #include "elevator-fq.h"
+#include <linux/blktrace_api.h>
+
+/* Values taken from cfq */
+const int elv_slice_sync = HZ / 10;
+int elv_slice_async = HZ / 25;
+const int elv_slice_async_rq = 2;
+int elv_slice_idle = HZ / 125;
+static struct kmem_cache *elv_ioq_pool;
+
+#define ELV_SLICE_SCALE		(5)
+#define ELV_HW_QUEUE_MIN	(5)
 
 #define IO_SERVICE_TREE_INIT   ((struct io_service_tree)		\
 				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
@@ -28,6 +39,22 @@
  */
 #define WFQ_SERVICE_SHIFT	22
 
+static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
+					unsigned short prio)
+{
+	const int base_slice = efqd->elv_slice[sync];
+
+	WARN_ON(prio >= IOPRIO_BE_NR);
+
+	return base_slice + (base_slice/ELV_SLICE_SCALE * (4 - prio));
+}
+
+static inline int
+elv_prio_to_slice(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+	return elv_prio_slice(efqd, elv_ioq_sync(ioq), ioq->entity.ioprio);
+}
+
 /**
  * bfq_gt - compare two timestamps.
  * @a: first ts.
@@ -423,11 +450,6 @@ __bfq_entity_update_prio(struct io_service_tree *old_st,
 		 */
 		if (ioq) {
 			struct elv_fq_data *efqd = ioq->efqd;
-			/*
-			 * elv_prio_to_slice() is defined in later patches
-			 * where a slice length is calculated from the
-			 * ioprio of the queue.
-			 */
 			entity->budget = elv_prio_to_slice(efqd, ioq);
 		}
 
@@ -750,3 +772,1225 @@ static void io_flush_idle_tree(struct io_service_tree *st)
 	for (; entity != NULL; entity = st->first_idle)
 		__bfq_deactivate_entity(entity, 0);
 }
+
+/* Elevator fair queuing function */
+static inline struct io_queue *elv_active_ioq(struct elevator_queue *e)
+{
+	return e->efqd.active_queue;
+}
+
+void *elv_active_sched_queue(struct elevator_queue *e)
+{
+	return ioq_sched_queue(elv_active_ioq(e));
+}
+EXPORT_SYMBOL(elv_active_sched_queue);
+
+int elv_nr_busy_ioq(struct elevator_queue *e)
+{
+	return e->efqd.busy_queues;
+}
+EXPORT_SYMBOL(elv_nr_busy_ioq);
+
+int elv_hw_tag(struct elevator_queue *e)
+{
+	return e->efqd.hw_tag;
+}
+EXPORT_SYMBOL(elv_hw_tag);
+
+/* Helper functions for operating on elevator idle slice timer */
+int elv_mod_idle_slice_timer(struct elevator_queue *eq, unsigned long expires)
+{
+	struct elv_fq_data *efqd = &eq->efqd;
+
+	return mod_timer(&efqd->idle_slice_timer, expires);
+}
+EXPORT_SYMBOL(elv_mod_idle_slice_timer);
+
+int elv_del_idle_slice_timer(struct elevator_queue *eq)
+{
+	struct elv_fq_data *efqd = &eq->efqd;
+
+	return del_timer(&efqd->idle_slice_timer);
+}
+EXPORT_SYMBOL(elv_del_idle_slice_timer);
+
+unsigned int elv_get_slice_idle(struct elevator_queue *eq)
+{
+	return eq->efqd.elv_slice_idle;
+}
+EXPORT_SYMBOL(elv_get_slice_idle);
+
+static void elv_ioq_served(struct io_queue *ioq, unsigned long served)
+{
+	entity_served(&ioq->entity, served);
+}
+
+/* Tells whether ioq is queued in root group or not */
+static inline int is_root_group_ioq(struct request_queue *q,
+					struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	return (ioq->entity.sched_data == &efqd->root_group->sched_data);
+}
+
+/*
+ * sysfs parts below -->
+ */
+static ssize_t
+elv_var_show(unsigned int var, char *page)
+{
+	return sprintf(page, "%d\n", var);
+}
+
+static ssize_t
+elv_var_store(unsigned int *var, const char *page, size_t count)
+{
+	char *p = (char *) page;
+
+	*var = simple_strtoul(p, &p, 10);
+	return count;
+}
+
+#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
+ssize_t __FUNC(struct elevator_queue *e, char *page)		\
+{									\
+	struct elv_fq_data *efqd = &e->efqd;				\
+	unsigned int __data = __VAR;					\
+	if (__CONV)							\
+		__data = jiffies_to_msecs(__data);			\
+	return elv_var_show(__data, (page));				\
+}
+SHOW_FUNCTION(elv_slice_idle_show, efqd->elv_slice_idle, 1);
+EXPORT_SYMBOL(elv_slice_idle_show);
+SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
+EXPORT_SYMBOL(elv_slice_sync_show);
+SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
+EXPORT_SYMBOL(elv_slice_async_show);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
+ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)\
+{									\
+	struct elv_fq_data *efqd = &e->efqd;				\
+	unsigned int __data;						\
+	int ret = elv_var_store(&__data, (page), count);		\
+	if (__data < (MIN))						\
+		__data = (MIN);						\
+	else if (__data > (MAX))					\
+		__data = (MAX);						\
+	if (__CONV)							\
+		*(__PTR) = msecs_to_jiffies(__data);			\
+	else								\
+		*(__PTR) = __data;					\
+	return ret;							\
+}
+STORE_FUNCTION(elv_slice_idle_store, &efqd->elv_slice_idle, 0, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_idle_store);
+STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_sync_store);
+STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_async_store);
+#undef STORE_FUNCTION
+
+void elv_schedule_dispatch(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (elv_nr_busy_ioq(q->elevator)) {
+		elv_log(efqd, "schedule dispatch");
+		kblockd_schedule_work(efqd->queue, &efqd->unplug_work);
+	}
+}
+EXPORT_SYMBOL(elv_schedule_dispatch);
+
+static void elv_kick_queue(struct work_struct *work)
+{
+	struct elv_fq_data *efqd =
+		container_of(work, struct elv_fq_data, unplug_work);
+	struct request_queue *q = efqd->queue;
+
+	spin_lock_irq(q->queue_lock);
+	__blk_run_queue(q);
+	spin_unlock_irq(q->queue_lock);
+}
+
+static void elv_shutdown_timer_wq(struct elevator_queue *e)
+{
+	del_timer_sync(&e->efqd.idle_slice_timer);
+	cancel_work_sync(&e->efqd.unplug_work);
+}
+
+static void elv_ioq_set_prio_slice(struct request_queue *q,
+					struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	ioq->slice_end = jiffies + ioq->entity.budget;
+	elv_log_ioq(efqd, ioq, "set_slice=%lu", ioq->entity.budget);
+}
+
+static void elv_ioq_update_io_thinktime(struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = ioq->efqd;
+	unsigned long elapsed = jiffies - ioq->last_end_request;
+	unsigned long ttime = min(elapsed, 2UL * efqd->elv_slice_idle);
+
+	ioq->ttime_samples = (7*ioq->ttime_samples + 256) / 8;
+	ioq->ttime_total = (7*ioq->ttime_total + 256*ttime) / 8;
+	ioq->ttime_mean = (ioq->ttime_total + 128) / ioq->ttime_samples;
+}
+
+/*
+ * Disable idle window if the process thinks too long.
+ * This idle flag can also be updated by io scheduler.
+ */
+static void elv_ioq_update_idle_window(struct elevator_queue *eq,
+				struct io_queue *ioq, struct request *rq)
+{
+	int old_idle, enable_idle;
+	struct elv_fq_data *efqd = ioq->efqd;
+
+	/*
+	 * Don't idle for async or idle io prio class
+	 */
+	if (!elv_ioq_sync(ioq) || elv_ioq_class_idle(ioq))
+		return;
+
+	enable_idle = old_idle = elv_ioq_idle_window(ioq);
+
+	if (!efqd->elv_slice_idle)
+		enable_idle = 0;
+	else if (ioq_sample_valid(ioq->ttime_samples)) {
+		if (ioq->ttime_mean > efqd->elv_slice_idle)
+			enable_idle = 0;
+		else
+			enable_idle = 1;
+	}
+
+	/*
+	 * From think time perspective idle should be enabled. Check with
+	 * io scheduler if it wants to disable idling based on additional
+	 * considrations like seek pattern.
+	 */
+	if (enable_idle) {
+		if (eq->ops->elevator_update_idle_window_fn)
+			enable_idle = eq->ops->elevator_update_idle_window_fn(
+						eq, ioq->sched_queue, rq);
+		if (!enable_idle)
+			elv_log_ioq(efqd, ioq, "iosched disabled idle");
+	}
+
+	if (old_idle != enable_idle) {
+		elv_log_ioq(efqd, ioq, "idle=%d", enable_idle);
+		if (enable_idle)
+			elv_mark_ioq_idle_window(ioq);
+		else
+			elv_clear_ioq_idle_window(ioq);
+	}
+}
+
+struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask)
+{
+	struct io_queue *ioq = NULL;
+
+	ioq = kmem_cache_alloc_node(elv_ioq_pool, gfp_mask, q->node);
+	return ioq;
+}
+EXPORT_SYMBOL(elv_alloc_ioq);
+
+void elv_free_ioq(struct io_queue *ioq)
+{
+	kmem_cache_free(elv_ioq_pool, ioq);
+}
+EXPORT_SYMBOL(elv_free_ioq);
+
+int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
+		struct io_group *iog, void *sched_queue, int ioprio_class,
+		int ioprio, int is_sync)
+{
+	struct elv_fq_data *efqd = &eq->efqd;
+
+	RB_CLEAR_NODE(&ioq->entity.rb_node);
+	atomic_set(&ioq->ref, 0);
+	ioq->efqd = efqd;
+	elv_ioq_set_ioprio_class(ioq, ioprio_class);
+	elv_ioq_set_ioprio(ioq, ioprio);
+	ioq->pid = current->pid;
+	ioq->sched_queue = sched_queue;
+	if (is_sync && !elv_ioq_class_idle(ioq))
+		elv_mark_ioq_idle_window(ioq);
+	bfq_init_entity(&ioq->entity, iog);
+	ioq->entity.budget = elv_prio_to_slice(efqd, ioq);
+	if (is_sync)
+		ioq->last_end_request = jiffies;
+
+	return 0;
+}
+EXPORT_SYMBOL(elv_init_ioq);
+
+void elv_put_ioq(struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = ioq->efqd;
+	struct elevator_queue *e = container_of(efqd, struct elevator_queue,
+						efqd);
+
+	BUG_ON(atomic_read(&ioq->ref) <= 0);
+	if (!atomic_dec_and_test(&ioq->ref))
+		return;
+	BUG_ON(ioq->nr_queued);
+	BUG_ON(ioq->entity.tree != NULL);
+	BUG_ON(elv_ioq_busy(ioq));
+	BUG_ON(efqd->active_queue == ioq);
+
+	/* Can be called by outgoing elevator. Don't use q */
+	BUG_ON(!e->ops->elevator_free_sched_queue_fn);
+
+	e->ops->elevator_free_sched_queue_fn(e, ioq->sched_queue);
+	elv_log_ioq(efqd, ioq, "put_queue");
+	elv_free_ioq(ioq);
+}
+EXPORT_SYMBOL(elv_put_ioq);
+
+void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
+{
+	struct io_queue *ioq = *ioq_ptr;
+
+	if (ioq != NULL) {
+		/* Drop the reference taken by the io group */
+		elv_put_ioq(ioq);
+		*ioq_ptr = NULL;
+	}
+}
+
+/*
+ * Normally next io queue to be served is selected from the service tree.
+ * This function allows one to choose a specific io queue to run next
+ * out of order. This is primarily to accomodate the close_cooperator
+ * feature of cfq.
+ *
+ * Currently it is done only for root level as to begin with supporting
+ * close cooperator feature only for root group to make sure default
+ * cfq behavior in flat hierarchy is not changed.
+ */
+static void elv_set_next_ioq(struct request_queue *q, struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_entity *entity = &ioq->entity;
+	struct io_sched_data *sd = &efqd->root_group->sched_data;
+	struct io_service_tree *st = io_entity_service_tree(entity);
+
+	BUG_ON(efqd->active_queue != NULL || sd->active_entity != NULL);
+	BUG_ON(!efqd->busy_queues);
+	BUG_ON(sd != entity->sched_data);
+	BUG_ON(!st);
+
+	bfq_update_vtime(st);
+	bfq_active_remove(st, entity);
+	sd->active_entity = entity;
+	entity->service = 0;
+	elv_log_ioq(efqd, ioq, "set_next_ioq");
+}
+
+/* Get next queue for service. */
+static struct io_queue *elv_get_next_ioq(struct request_queue *q, int extract)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_entity *entity = NULL;
+	struct io_queue *ioq = NULL;
+	struct io_sched_data *sd;
+
+	/*
+	 * We should not call lookup when an entity is active, as doing
+	 * lookup can result in an erroneous vtime jump.
+	 */
+	BUG_ON(efqd->active_queue != NULL);
+
+	if (!efqd->busy_queues)
+		return NULL;
+
+	sd = &efqd->root_group->sched_data;
+	entity = bfq_lookup_next_entity(sd, 1);
+
+	BUG_ON(!entity);
+	if (extract)
+		entity->service = 0;
+	ioq = io_entity_to_ioq(entity);
+
+	return ioq;
+}
+
+/*
+ * coop (cooperating queue) tells that io scheduler selected a queue for us
+ * and we did not select the next queue based on fairness.
+ */
+static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
+					int coop)
+{
+	struct request_queue *q = efqd->queue;
+
+	if (ioq) {
+		elv_log_ioq(efqd, ioq, "set_active, busy=%d",
+							efqd->busy_queues);
+		ioq->slice_end = 0;
+
+		elv_clear_ioq_wait_request(ioq);
+		elv_clear_ioq_must_dispatch(ioq);
+		elv_mark_ioq_slice_new(ioq);
+
+		del_timer(&efqd->idle_slice_timer);
+	}
+
+	efqd->active_queue = ioq;
+
+	/* Let iosched know if it wants to take some action */
+	if (ioq) {
+		if (q->elevator->ops->elevator_active_ioq_set_fn)
+			q->elevator->ops->elevator_active_ioq_set_fn(q,
+							ioq->sched_queue, coop);
+	}
+}
+
+/* Get and set a new active queue for service. */
+static struct io_queue *elv_set_active_ioq(struct request_queue *q,
+						struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	int coop = 0;
+
+	if (!ioq)
+		ioq = elv_get_next_ioq(q, 1);
+	else {
+		elv_set_next_ioq(q, ioq);
+		/*
+		 * io scheduler selected the next queue for us. Pass this
+		 * this info back to io scheudler. cfq currently uses it
+		 * to reset coop flag on the queue.
+		 */
+		coop = 1;
+	}
+	__elv_set_active_ioq(efqd, ioq, coop);
+	return ioq;
+}
+
+static void elv_reset_active_ioq(struct elv_fq_data *efqd)
+{
+	struct request_queue *q = efqd->queue;
+	struct io_queue *ioq = elv_active_ioq(efqd->queue->elevator);
+
+	if (q->elevator->ops->elevator_active_ioq_reset_fn)
+		q->elevator->ops->elevator_active_ioq_reset_fn(q,
+							ioq->sched_queue);
+	efqd->active_queue = NULL;
+	del_timer(&efqd->idle_slice_timer);
+}
+
+static void elv_activate_ioq(struct io_queue *ioq, int add_front)
+{
+	bfq_activate_entity(&ioq->entity, add_front);
+}
+
+static void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
+					int requeue)
+{
+	bfq_deactivate_entity(&ioq->entity, requeue);
+}
+
+/* Called when an inactive queue receives a new request. */
+static void elv_add_ioq_busy(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+	BUG_ON(elv_ioq_busy(ioq));
+	BUG_ON(ioq == efqd->active_queue);
+	elv_log_ioq(efqd, ioq, "add to busy");
+	elv_activate_ioq(ioq, 0);
+	elv_mark_ioq_busy(ioq);
+	efqd->busy_queues++;
+	if (elv_ioq_class_rt(ioq)) {
+		struct io_group *iog = ioq_to_io_group(ioq);
+		iog->busy_rt_queues++;
+	}
+}
+
+static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
+					int requeue)
+{
+	struct elv_fq_data *efqd = &e->efqd;
+
+	BUG_ON(!elv_ioq_busy(ioq));
+	BUG_ON(ioq->nr_queued);
+	elv_log_ioq(efqd, ioq, "del from busy");
+	elv_clear_ioq_busy(ioq);
+	BUG_ON(efqd->busy_queues == 0);
+	efqd->busy_queues--;
+	if (elv_ioq_class_rt(ioq)) {
+		struct io_group *iog = ioq_to_io_group(ioq);
+		iog->busy_rt_queues--;
+	}
+
+	elv_deactivate_ioq(efqd, ioq, requeue);
+}
+
+/*
+ * Do the accounting. Determine how much service (in terms of time slices)
+ * current queue used and adjust the start, finish time of queue and vtime
+ * of the tree accordingly.
+ *
+ * Determining the service used in terms of time is tricky in certain
+ * situations. Especially when underlying device supports command queuing
+ * and requests from multiple queues can be there at same time, then it
+ * is not clear which queue consumed how much of disk time.
+ *
+ * To mitigate this problem, cfq starts the time slice of the queue only
+ * after first request from the queue has completed. This does not work
+ * very well if we expire the queue before we wait for first and more
+ * request to finish from the queue. For seeky queues, we will expire the
+ * queue after dispatching few requests without waiting and start dispatching
+ * from next queue.
+ *
+ * Not sure how to determine the time consumed by queue in such scenarios.
+ * Currently as a crude approximation, we are charging 25% of time slice
+ * for such cases. A better mechanism is needed for accurate accounting.
+ */
+void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_entity *entity = &ioq->entity;
+	long slice_unused = 0, slice_used = 0, slice_overshoot = 0;
+
+	assert_spin_locked(q->queue_lock);
+	elv_log_ioq(efqd, ioq, "slice expired");
+
+	if (elv_ioq_wait_request(ioq))
+		del_timer(&efqd->idle_slice_timer);
+
+	elv_clear_ioq_wait_request(ioq);
+
+	/*
+	 * if ioq->slice_end = 0, that means a queue was expired before first
+	 * reuqest from the queue got completed. Of course we are not planning
+	 * to idle on the queue otherwise we would not have expired it.
+	 *
+	 * Charge for the 25% slice in such cases. This is not the best thing
+	 * to do but at the same time not very sure what's the next best
+	 * thing to do.
+	 *
+	 * This arises from that fact that we don't have the notion of
+	 * one queue being operational at one time. io scheduler can dispatch
+	 * requests from multiple queues in one dispatch round. Ideally for
+	 * more accurate accounting of exact disk time used by disk, one
+	 * should dispatch requests from only one queue and wait for all
+	 * the requests to finish. But this will reduce throughput.
+	 */
+	if (!ioq->slice_end)
+		slice_used = entity->budget/4;
+	else {
+		if (time_after(ioq->slice_end, jiffies)) {
+			slice_unused = ioq->slice_end - jiffies;
+			if (slice_unused == entity->budget) {
+				/*
+				 * queue got expired immediately after
+				 * completing first request. Charge 25% of
+				 * slice.
+				 */
+				slice_used = entity->budget/4;
+			} else
+				slice_used = entity->budget - slice_unused;
+		} else {
+			slice_overshoot = jiffies - ioq->slice_end;
+			slice_used = entity->budget + slice_overshoot;
+		}
+	}
+
+	elv_log_ioq(efqd, ioq, "sl_end=%lx, jiffies=%lx", ioq->slice_end,
+			jiffies);
+	elv_log_ioq(efqd, ioq, "sl_used=%ld, budget=%ld overshoot=%ld",
+				slice_used, entity->budget, slice_overshoot);
+	elv_ioq_served(ioq, slice_used);
+
+	BUG_ON(ioq != efqd->active_queue);
+	elv_reset_active_ioq(efqd);
+
+	if (!ioq->nr_queued)
+		elv_del_ioq_busy(q->elevator, ioq, 1);
+	else
+		elv_activate_ioq(ioq, 0);
+}
+EXPORT_SYMBOL(__elv_ioq_slice_expired);
+
+/*
+ *  Expire the ioq.
+ */
+void elv_ioq_slice_expired(struct request_queue *q)
+{
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+
+	if (ioq)
+		__elv_ioq_slice_expired(q, ioq);
+}
+
+/*
+ * Check if new_cfqq should preempt the currently active queue. Return 0 for
+ * no or if we aren't sure, a 1 will cause a preemption attempt.
+ */
+static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
+			struct request *rq)
+{
+	struct io_queue *ioq;
+	struct elevator_queue *eq = q->elevator;
+	struct io_entity *entity, *new_entity;
+
+	ioq = elv_active_ioq(eq);
+
+	if (!ioq)
+		return 0;
+
+	entity = &ioq->entity;
+	new_entity = &new_ioq->entity;
+
+	/*
+	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
+	 */
+
+	if (new_entity->ioprio_class == IOPRIO_CLASS_RT
+	    && entity->ioprio_class != IOPRIO_CLASS_RT)
+		return 1;
+	/*
+	 * Allow an BE request to pre-empt an ongoing IDLE clas timeslice.
+	 */
+
+	if (new_entity->ioprio_class == IOPRIO_CLASS_BE
+	    && entity->ioprio_class == IOPRIO_CLASS_IDLE)
+		return 1;
+
+	/*
+	 * Check with io scheduler if it has additional criterion based on
+	 * which it wants to preempt existing queue.
+	 */
+	if (eq->ops->elevator_should_preempt_fn)
+		return eq->ops->elevator_should_preempt_fn(q,
+						ioq_sched_queue(new_ioq), rq);
+
+	return 0;
+}
+
+static void elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
+{
+	elv_log_ioq(&q->elevator->efqd, ioq, "preempt");
+	elv_ioq_slice_expired(q);
+
+	/*
+	 * Put the new queue at the front of the of the current list,
+	 * so we know that it will be selected next.
+	 */
+
+	elv_activate_ioq(ioq, 1);
+	ioq->slice_end = 0;
+	elv_mark_ioq_slice_new(ioq);
+}
+
+void elv_ioq_request_add(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	BUG_ON(!efqd);
+	BUG_ON(!ioq);
+	efqd->rq_queued++;
+	ioq->nr_queued++;
+
+	if (!elv_ioq_busy(ioq))
+		elv_add_ioq_busy(efqd, ioq);
+
+	elv_ioq_update_io_thinktime(ioq);
+	elv_ioq_update_idle_window(q->elevator, ioq, rq);
+
+	if (ioq == elv_active_ioq(q->elevator)) {
+		/*
+		 * Remember that we saw a request from this process, but
+		 * don't start queuing just yet. Otherwise we risk seeing lots
+		 * of tiny requests, because we disrupt the normal plugging
+		 * and merging. If the request is already larger than a single
+		 * page, let it rip immediately. For that case we assume that
+		 * merging is already done. Ditto for a busy system that
+		 * has other work pending, don't risk delaying until the
+		 * idle timer unplug to continue working.
+		 */
+		if (elv_ioq_wait_request(ioq)) {
+			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
+			    efqd->busy_queues > 1) {
+				del_timer(&efqd->idle_slice_timer);
+				__blk_run_queue(q);
+			}
+			elv_mark_ioq_must_dispatch(ioq);
+		}
+	} else if (elv_should_preempt(q, ioq, rq)) {
+		/*
+		 * not the active queue - expire current slice if it is
+		 * idle and has expired it's mean thinktime or this new queue
+		 * has some old slice time left and is of higher priority or
+		 * this new queue is RT and the current one is BE
+		 */
+		elv_preempt_queue(q, ioq);
+		__blk_run_queue(q);
+	}
+}
+
+static void elv_idle_slice_timer(unsigned long data)
+{
+	struct elv_fq_data *efqd = (struct elv_fq_data *)data;
+	struct io_queue *ioq;
+	unsigned long flags;
+	struct request_queue *q = efqd->queue;
+
+	elv_log(efqd, "idle timer fired");
+
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	ioq = efqd->active_queue;
+
+	if (ioq) {
+
+		/*
+		 * We saw a request before the queue expired, let it through
+		 */
+		if (elv_ioq_must_dispatch(ioq))
+			goto out_kick;
+
+		/*
+		 * expired
+		 */
+		if (elv_ioq_slice_used(ioq))
+			goto expire;
+
+		/*
+		 * only expire and reinvoke request handler, if there are
+		 * other queues with pending requests
+		 */
+		if (!elv_nr_busy_ioq(q->elevator))
+			goto out_cont;
+
+		/*
+		 * not expired and it has a request pending, let it dispatch
+		 */
+		if (ioq->nr_queued)
+			goto out_kick;
+	}
+expire:
+	elv_ioq_slice_expired(q);
+out_kick:
+	elv_schedule_dispatch(q);
+out_cont:
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static void elv_ioq_arm_slice_timer(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+	unsigned long sl;
+
+	BUG_ON(!ioq);
+
+	/*
+	 * SSD device without seek penalty, disable idling. But only do so
+	 * for devices that support queuing, otherwise we still have a problem
+	 * with sync vs async workloads.
+	 */
+	if (blk_queue_nonrot(q) && efqd->hw_tag)
+		return;
+
+	/*
+	 * still requests with the driver, don't idle
+	 */
+	if (efqd->rq_in_driver)
+		return;
+
+	/*
+	 * idle is disabled, either manually or by past process history
+	 */
+	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
+		return;
+
+	/*
+	 * may be iosched got its own idling logic. In that case io
+	 * schduler will take care of arming the timer, if need be.
+	 */
+	if (q->elevator->ops->elevator_arm_slice_timer_fn) {
+		q->elevator->ops->elevator_arm_slice_timer_fn(q,
+						ioq->sched_queue);
+	} else {
+		elv_mark_ioq_wait_request(ioq);
+		sl = efqd->elv_slice_idle;
+		mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+		elv_log_ioq(efqd, ioq, "arm idle: %lu", sl);
+	}
+}
+
+/*
+ * If io scheduler has functionality of keeping track of close cooperator, check
+ * with it if it has got a closely co-operating queue.
+ */
+static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
+					struct io_queue *ioq, int probe)
+{
+	struct elevator_queue *e = q->elevator;
+	struct io_queue *new_ioq = NULL;
+
+	/*
+	 * Currently this feature is supported only for flat hierarchy or
+	 * root group queues so that default cfq behavior is not changed.
+	 */
+	if (!is_root_group_ioq(q, ioq))
+		return NULL;
+
+	if (q->elevator->ops->elevator_close_cooperator_fn)
+		new_ioq = e->ops->elevator_close_cooperator_fn(q,
+						ioq->sched_queue, probe);
+
+	/* Only select co-operating queue if it belongs to root group */
+	if (new_ioq && !is_root_group_ioq(q, new_ioq))
+		return NULL;
+
+	return new_ioq;
+}
+
+/* Common layer function to select the next queue to dispatch from */
+void *elv_fq_select_ioq(struct request_queue *q, int force)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
+	struct io_group *iog;
+
+	if (!elv_nr_busy_ioq(q->elevator))
+		return NULL;
+
+	if (ioq == NULL)
+		goto new_queue;
+
+	/*
+	 * Force dispatch. Continue to dispatch from current queue as long
+	 * as it has requests.
+	 */
+	if (unlikely(force)) {
+		if (ioq->nr_queued)
+			goto keep_queue;
+		else
+			goto expire;
+	}
+
+	/*
+	 * The active queue has run out of time, expire it and select new.
+	 */
+	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
+		goto expire;
+
+	/*
+	 * If we have a RT cfqq waiting, then we pre-empt the current non-rt
+	 * cfqq.
+	 */
+	iog = ioq_to_io_group(ioq);
+
+	if (!elv_ioq_class_rt(ioq) && iog->busy_rt_queues) {
+		/*
+		 * We simulate this as cfqq timed out so that it gets to bank
+		 * the remaining of its time slice.
+		 */
+		elv_log_ioq(efqd, ioq, "preempt");
+		goto expire;
+	}
+
+	/*
+	 * The active queue has requests and isn't expired, allow it to
+	 * dispatch.
+	 */
+
+	if (ioq->nr_queued)
+		goto keep_queue;
+
+	/*
+	 * If another queue has a request waiting within our mean seek
+	 * distance, let it run.  The expire code will check for close
+	 * cooperators and put the close queue at the front of the service
+	 * tree.
+	 */
+	new_ioq = elv_close_cooperator(q, ioq, 0);
+	if (new_ioq)
+		goto expire;
+
+	/*
+	 * No requests pending. If the active queue still has requests in
+	 * flight or is idling for a new request, allow either of these
+	 * conditions to happen (or time out) before selecting a new queue.
+	 */
+
+	if (timer_pending(&efqd->idle_slice_timer) ||
+	    (elv_ioq_nr_dispatched(ioq) && elv_ioq_idle_window(ioq))) {
+		ioq = NULL;
+		goto keep_queue;
+	}
+
+expire:
+	elv_ioq_slice_expired(q);
+new_queue:
+	ioq = elv_set_active_ioq(q, new_ioq);
+keep_queue:
+	return ioq;
+}
+
+/* A request got removed from io_queue. Do the accounting */
+void elv_ioq_request_removed(struct elevator_queue *e, struct request *rq)
+{
+	struct io_queue *ioq;
+	struct elv_fq_data *efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	ioq = rq->ioq;
+	BUG_ON(!ioq);
+	ioq->nr_queued--;
+
+	efqd = ioq->efqd;
+	BUG_ON(!efqd);
+	efqd->rq_queued--;
+
+	if (elv_ioq_busy(ioq) && (elv_active_ioq(e) != ioq) && !ioq->nr_queued)
+		elv_del_ioq_busy(e, ioq, 1);
+}
+
+/* A request got dispatched. Do the accounting. */
+void elv_fq_dispatched_request(struct elevator_queue *e, struct request *rq)
+{
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	BUG_ON(!ioq);
+	elv_ioq_request_dispatched(ioq);
+	elv_ioq_request_removed(e, rq);
+	elv_clear_ioq_must_dispatch(ioq);
+}
+
+void elv_fq_activate_rq(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	efqd->rq_in_driver++;
+	elv_log_ioq(efqd, rq->ioq, "activate rq, drv=%d",
+						efqd->rq_in_driver);
+}
+
+void elv_fq_deactivate_rq(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	WARN_ON(!efqd->rq_in_driver);
+	efqd->rq_in_driver--;
+	elv_log_ioq(efqd, rq->ioq, "deactivate rq, drv=%d",
+						efqd->rq_in_driver);
+}
+
+/*
+ * Update hw_tag based on peak queue depth over 50 samples under
+ * sufficient load.
+ */
+static void elv_update_hw_tag(struct elv_fq_data *efqd)
+{
+	if (efqd->rq_in_driver > efqd->rq_in_driver_peak)
+		efqd->rq_in_driver_peak = efqd->rq_in_driver;
+
+	if (efqd->rq_queued <= ELV_HW_QUEUE_MIN &&
+	    efqd->rq_in_driver <= ELV_HW_QUEUE_MIN)
+		return;
+
+	if (efqd->hw_tag_samples++ < 50)
+		return;
+
+	if (efqd->rq_in_driver_peak >= ELV_HW_QUEUE_MIN)
+		efqd->hw_tag = 1;
+	else
+		efqd->hw_tag = 0;
+
+	efqd->hw_tag_samples = 0;
+	efqd->rq_in_driver_peak = 0;
+}
+
+/* A request got completed from io_queue. Do the accounting. */
+void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
+{
+	const int sync = rq_is_sync(rq);
+	struct io_queue *ioq;
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	ioq = rq->ioq;
+
+	elv_log_ioq(efqd, ioq, "complete");
+
+	elv_update_hw_tag(efqd);
+
+	WARN_ON(!efqd->rq_in_driver);
+	WARN_ON(!ioq->dispatched);
+	efqd->rq_in_driver--;
+	ioq->dispatched--;
+
+	if (sync)
+		ioq->last_end_request = jiffies;
+
+	/*
+	 * If this is the active queue, check if it needs to be expired,
+	 * or if we want to idle in case it has no pending requests.
+	 */
+
+	if (elv_active_ioq(q->elevator) == ioq) {
+		if (elv_ioq_slice_new(ioq)) {
+			elv_ioq_set_prio_slice(q, ioq);
+			elv_clear_ioq_slice_new(ioq);
+		}
+		/*
+		 * If there are no requests waiting in this queue, and
+		 * there are other queues ready to issue requests, AND
+		 * those other queues are issuing requests within our
+		 * mean seek distance, give them a chance to run instead
+		 * of idling.
+		 */
+		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+			elv_ioq_slice_expired(q);
+		else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
+			 && sync && !rq_noidle(rq))
+			elv_ioq_arm_slice_timer(q);
+	}
+
+	if (!efqd->rq_in_driver)
+		elv_schedule_dispatch(q);
+}
+
+struct io_group *io_get_io_group(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	/* In flat mode, there is only root group */
+	return efqd->root_group;
+}
+EXPORT_SYMBOL(io_get_io_group);
+
+void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
+					int ioprio)
+{
+	struct io_queue *ioq = NULL;
+
+	switch (ioprio_class) {
+	case IOPRIO_CLASS_RT:
+		ioq = iog->async_queue[0][ioprio];
+		break;
+	case IOPRIO_CLASS_BE:
+		ioq = iog->async_queue[1][ioprio];
+		break;
+	case IOPRIO_CLASS_IDLE:
+		ioq = iog->async_idle_queue;
+		break;
+	default:
+		BUG();
+	}
+
+	if (ioq)
+		return ioq->sched_queue;
+	return NULL;
+}
+EXPORT_SYMBOL(io_group_async_queue_prio);
+
+void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+					int ioprio, struct io_queue *ioq)
+{
+	switch (ioprio_class) {
+	case IOPRIO_CLASS_RT:
+		iog->async_queue[0][ioprio] = ioq;
+		break;
+	case IOPRIO_CLASS_BE:
+		iog->async_queue[1][ioprio] = ioq;
+		break;
+	case IOPRIO_CLASS_IDLE:
+		iog->async_idle_queue = ioq;
+		break;
+	default:
+		BUG();
+	}
+
+	/*
+	 * Take the group reference and pin the queue. Group exit will
+	 * clean it up
+	 */
+	elv_get_ioq(ioq);
+}
+EXPORT_SYMBOL(io_group_set_async_queue);
+
+/*
+ * Release all the io group references to its async queues.
+ */
+static void
+io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+{
+	int i, j;
+
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < IOPRIO_BE_NR; j++)
+			elv_release_ioq(e, &iog->async_queue[i][j]);
+
+	/* Free up async idle queue */
+	elv_release_ioq(e, &iog->async_idle_queue);
+}
+
+static struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+	return iog;
+}
+
+static void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_group *iog = e->efqd.root_group;
+	struct io_service_tree *st;
+	int i;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		io_flush_idle_tree(st);
+	}
+
+	io_put_io_group_queues(e, iog);
+	kfree(iog);
+}
+
+static void elv_slab_kill(void)
+{
+	/*
+	 * Caller already ensured that pending RCU callbacks are completed,
+	 * so we should have no busy allocations at this point.
+	 */
+	if (elv_ioq_pool)
+		kmem_cache_destroy(elv_ioq_pool);
+}
+
+static int __init elv_slab_setup(void)
+{
+	elv_ioq_pool = KMEM_CACHE(io_queue, 0);
+	if (!elv_ioq_pool)
+		goto fail;
+
+	return 0;
+fail:
+	elv_slab_kill();
+	return -ENOMEM;
+}
+
+/* Initialize fair queueing data associated with elevator */
+int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
+{
+	struct io_group *iog;
+	struct elv_fq_data *efqd = &e->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return 0;
+
+	iog = io_alloc_root_group(q, e, efqd);
+	if (iog == NULL)
+		return 1;
+
+	efqd->root_group = iog;
+	efqd->queue = q;
+
+	init_timer(&efqd->idle_slice_timer);
+	efqd->idle_slice_timer.function = elv_idle_slice_timer;
+	efqd->idle_slice_timer.data = (unsigned long) efqd;
+
+	INIT_WORK(&efqd->unplug_work, elv_kick_queue);
+
+	efqd->elv_slice[0] = elv_slice_async;
+	efqd->elv_slice[1] = elv_slice_sync;
+	efqd->elv_slice_idle = elv_slice_idle;
+	efqd->hw_tag = 1;
+
+	return 0;
+}
+
+/*
+ * elv_exit_fq_data is called before we call elevator_exit_fn. Before
+ * we ask elevator to cleanup its queues, we do the cleanup here so
+ * that all the group and idle tree references to ioq are dropped. Later
+ * during elevator cleanup, ioc reference will be dropped which will lead
+ * to removal of ioscheduler queue as well as associated ioq object.
+ */
+void elv_exit_fq_data(struct elevator_queue *e)
+{
+	struct elv_fq_data *efqd = &e->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	elv_shutdown_timer_wq(e);
+
+	BUG_ON(timer_pending(&efqd->idle_slice_timer));
+	io_free_root_group(e);
+}
+
+/*
+ * This is called after the io scheduler has cleaned up its data structres.
+ * I don't think that this function is required. Right now just keeping it
+ * because cfq cleans up timer and work queue again after freeing up
+ * io contexts. To me io scheduler has already been drained out, and all
+ * the active queue have already been expired so time and work queue should
+ * not been activated during cleanup process.
+ *
+ * Keeping it here for the time being. Will get rid of it later.
+ */
+void elv_exit_fq_data_post(struct elevator_queue *e)
+{
+	struct elv_fq_data *efqd = &e->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	elv_shutdown_timer_wq(e);
+	BUG_ON(timer_pending(&efqd->idle_slice_timer));
+}
+
+
+static int __init elv_fq_init(void)
+{
+	if (elv_slab_setup())
+		return -ENOMEM;
+
+	/* could be 0 on HZ < 1000 setups */
+
+	if (!elv_slice_async)
+		elv_slice_async = 1;
+
+	if (!elv_slice_idle)
+		elv_slice_idle = 1;
+
+	return 0;
+}
+
+module_init(elv_fq_init);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 4554d7f..a7cbc0f 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -22,6 +22,10 @@
 struct io_entity;
 struct io_queue;
 
+#ifdef CONFIG_ELV_FAIR_QUEUING
+#define ELV_ATTR(name) \
+	__ATTR(name, S_IRUGO|S_IWUSR, elv_##name##_show, elv_##name##_store)
+
 /**
  * struct io_service_tree - per ioprio_class service tree.
  * @active: tree for active entities (i.e., those backlogged).
@@ -149,15 +153,125 @@ struct io_entity {
 struct io_queue {
 	struct io_entity entity;
 	atomic_t ref;
+	unsigned int flags;
 
 	/* Pointer to generic elevator fair queuing data structure */
 	struct elv_fq_data *efqd;
+	pid_t pid;
+
+	/* Number of requests queued on this io queue */
+	unsigned long nr_queued;
+
+	/* Requests dispatched from this queue */
+	int dispatched;
+
+	/* Keep a track of think time of processes in this queue */
+	unsigned long last_end_request;
+	unsigned long ttime_total;
+	unsigned long ttime_samples;
+	unsigned long ttime_mean;
+
+	unsigned long slice_end;
+
+	/* Pointer to io scheduler's queue */
+	void *sched_queue;
 };
 
 struct io_group {
 	struct io_sched_data sched_data;
+	/*
+	 * async queue for each priority case for RT and BE class.
+	 * Used only for cfq.
+	 */
+
+	struct io_queue *async_queue[2][IOPRIO_BE_NR];
+	struct io_queue *async_idle_queue;
+
+	/*
+	 * Used to track any pending rt requests so we can pre-empt current
+	 * non-RT cfqq in service when this value is non-zero.
+	 */
+	unsigned int busy_rt_queues;
+};
+
+struct elv_fq_data {
+	struct io_group *root_group;
+
+	struct request_queue *queue;
+	unsigned int busy_queues;
+
+	/* Number of requests queued */
+	int rq_queued;
+
+	/* Pointer to the ioscheduler queue being served */
+	void *active_queue;
+
+	/*
+	 * queue-depth detection
+	 */
+	int rq_in_driver;
+	int hw_tag;
+	int hw_tag_samples;
+	int rq_in_driver_peak;
+
+	/*
+	 * elevator fair queuing layer has the capability to provide idling
+	 * for ensuring fairness for processes doing dependent reads.
+	 * This might be needed to ensure fairness among two processes doing
+	 * synchronous reads in two different cgroups. noop and deadline don't
+	 * have any notion of anticipation/idling. As of now, these are the
+	 * users of this functionality.
+	 */
+	unsigned int elv_slice_idle;
+	struct timer_list idle_slice_timer;
+	struct work_struct unplug_work;
+
+	/* Base slice length for sync and async queues */
+	unsigned int elv_slice[2];
 };
 
+/* Logging facilities. */
+#define elv_log_ioq(efqd, ioq, fmt, args...) \
+	blk_add_trace_msg((efqd)->queue, "elv%d%c " fmt, (ioq)->pid,	\
+				elv_ioq_sync(ioq) ? 'S' : 'A', ##args)
+
+#define elv_log(efqd, fmt, args...) \
+	blk_add_trace_msg((efqd)->queue, "elv " fmt, ##args)
+
+#define ioq_sample_valid(samples)   ((samples) > 80)
+
+/* Some shared queue flag manipulation functions among elevators */
+
+enum elv_queue_state_flags {
+	ELV_QUEUE_FLAG_busy = 0,          /* has requests or is under service */
+	ELV_QUEUE_FLAG_sync,              /* synchronous queue */
+	ELV_QUEUE_FLAG_idle_window,	  /* elevator slice idling enabled */
+	ELV_QUEUE_FLAG_wait_request,	  /* waiting for a request */
+	ELV_QUEUE_FLAG_must_dispatch,	  /* must be allowed a dispatch */
+	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
+};
+
+#define ELV_IO_QUEUE_FLAG_FNS(name)					\
+static inline void elv_mark_ioq_##name(struct io_queue *ioq)		\
+{                                                                       \
+	(ioq)->flags |= (1 << ELV_QUEUE_FLAG_##name);			\
+}                                                                       \
+static inline void elv_clear_ioq_##name(struct io_queue *ioq)		\
+{                                                                       \
+	(ioq)->flags &= ~(1 << ELV_QUEUE_FLAG_##name);			\
+}                                                                       \
+static inline int elv_ioq_##name(struct io_queue *ioq)         		\
+{                                                                       \
+	return ((ioq)->flags & (1 << ELV_QUEUE_FLAG_##name)) != 0;	\
+}
+
+ELV_IO_QUEUE_FLAG_FNS(busy)
+ELV_IO_QUEUE_FLAG_FNS(sync)
+ELV_IO_QUEUE_FLAG_FNS(wait_request)
+ELV_IO_QUEUE_FLAG_FNS(must_dispatch)
+ELV_IO_QUEUE_FLAG_FNS(idle_window)
+ELV_IO_QUEUE_FLAG_FNS(slice_new)
+
 static inline struct io_service_tree *
 io_entity_service_tree(struct io_entity *entity)
 {
@@ -169,4 +283,194 @@ io_entity_service_tree(struct io_entity *entity)
 
 	return sched_data->service_tree + idx;
 }
+
+/* A request got dispatched from the io_queue. Do the accounting. */
+static inline void elv_ioq_request_dispatched(struct io_queue *ioq)
+{
+	ioq->dispatched++;
+}
+
+static inline int elv_ioq_slice_used(struct io_queue *ioq)
+{
+	if (elv_ioq_slice_new(ioq))
+		return 0;
+	if (time_before(jiffies, ioq->slice_end))
+		return 0;
+
+	return 1;
+}
+
+/* How many request are currently dispatched from the queue */
+static inline int elv_ioq_nr_dispatched(struct io_queue *ioq)
+{
+	return ioq->dispatched;
+}
+
+/* How many request are currently queued in the queue */
+static inline int elv_ioq_nr_queued(struct io_queue *ioq)
+{
+	return ioq->nr_queued;
+}
+
+static inline void elv_get_ioq(struct io_queue *ioq)
+{
+	atomic_inc(&ioq->ref);
+}
+
+static inline int elv_ioq_class_idle(struct io_queue *ioq)
+{
+	return ioq->entity.ioprio_class == IOPRIO_CLASS_IDLE;
+}
+
+static inline int elv_ioq_class_rt(struct io_queue *ioq)
+{
+	return ioq->entity.ioprio_class == IOPRIO_CLASS_RT;
+}
+
+static inline int elv_ioq_ioprio_class(struct io_queue *ioq)
+{
+	return ioq->entity.new_ioprio_class;
+}
+
+static inline int elv_ioq_ioprio(struct io_queue *ioq)
+{
+	return ioq->entity.new_ioprio;
+}
+
+static inline void elv_ioq_set_ioprio_class(struct io_queue *ioq,
+						int ioprio_class)
+{
+	ioq->entity.new_ioprio_class = ioprio_class;
+	ioq->entity.ioprio_changed = 1;
+}
+
+/**
+ * bfq_ioprio_to_weight - calc a weight from an ioprio.
+ * @ioprio: the ioprio value to convert.
+ */
+static inline unsigned int bfq_ioprio_to_weight(int ioprio)
+{
+	WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
+	return IOPRIO_BE_NR - ioprio;
+}
+
+static inline void elv_ioq_set_ioprio(struct io_queue *ioq, int ioprio)
+{
+	ioq->entity.new_ioprio = ioprio;
+	ioq->entity.new_weight = bfq_ioprio_to_weight(ioprio);
+	ioq->entity.ioprio_changed = 1;
+}
+
+static inline void *ioq_sched_queue(struct io_queue *ioq)
+{
+	if (ioq)
+		return ioq->sched_queue;
+	return NULL;
+}
+
+static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
+{
+	return container_of(ioq->entity.sched_data, struct io_group,
+						sched_data);
+}
+
+extern ssize_t elv_slice_idle_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_slice_idle_store(struct elevator_queue *q, const char *name,
+						size_t count);
+extern ssize_t elv_slice_sync_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_slice_sync_store(struct elevator_queue *q, const char *name,
+						size_t count);
+extern ssize_t elv_slice_async_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_slice_async_store(struct elevator_queue *q, const char *name,
+						size_t count);
+
+/* Functions used by elevator.c */
+extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
+extern void elv_exit_fq_data(struct elevator_queue *e);
+extern void elv_exit_fq_data_post(struct elevator_queue *e);
+
+extern void elv_ioq_request_add(struct request_queue *q, struct request *rq);
+extern void elv_ioq_request_removed(struct elevator_queue *e,
+					struct request *rq);
+extern void elv_fq_dispatched_request(struct elevator_queue *e,
+					struct request *rq);
+
+extern void elv_fq_activate_rq(struct request_queue *q, struct request *rq);
+extern void elv_fq_deactivate_rq(struct request_queue *q, struct request *rq);
+
+extern void elv_ioq_completed_request(struct request_queue *q,
+				struct request *rq);
+
+extern void *elv_fq_select_ioq(struct request_queue *q, int force);
+
+/* Functions used by io schedulers */
+extern void elv_put_ioq(struct io_queue *ioq);
+extern void __elv_ioq_slice_expired(struct request_queue *q,
+					struct io_queue *ioq);
+extern int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
+		struct io_group *iog, void *sched_queue, int ioprio_class,
+		int ioprio, int is_sync);
+extern void elv_schedule_dispatch(struct request_queue *q);
+extern int elv_hw_tag(struct elevator_queue *e);
+extern void *elv_active_sched_queue(struct elevator_queue *e);
+extern int elv_mod_idle_slice_timer(struct elevator_queue *eq,
+					unsigned long expires);
+extern int elv_del_idle_slice_timer(struct elevator_queue *eq);
+extern unsigned int elv_get_slice_idle(struct elevator_queue *eq);
+extern void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
+					int ioprio);
+extern void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+					int ioprio, struct io_queue *ioq);
+extern struct io_group *io_get_io_group(struct request_queue *q);
+extern int elv_nr_busy_ioq(struct elevator_queue *e);
+extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
+extern void elv_free_ioq(struct io_queue *ioq);
+
+#else /* CONFIG_ELV_FAIR_QUEUING */
+
+static inline int elv_init_fq_data(struct request_queue *q,
+					struct elevator_queue *e)
+{
+	return 0;
+}
+
+static inline void elv_exit_fq_data(struct elevator_queue *e) {}
+static inline void elv_exit_fq_data_post(struct elevator_queue *e) {}
+
+static inline void elv_fq_activate_rq(struct request_queue *q,
+					struct request *rq)
+{
+}
+
+static inline void elv_fq_deactivate_rq(struct request_queue *q,
+					struct request *rq)
+{
+}
+
+static inline void elv_fq_dispatched_request(struct elevator_queue *e,
+						struct request *rq)
+{
+}
+
+static inline void elv_ioq_request_removed(struct elevator_queue *e,
+						struct request *rq)
+{
+}
+
+static inline void elv_ioq_request_add(struct request_queue *q,
+					struct request *rq)
+{
+}
+
+static inline void elv_ioq_completed_request(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline void *ioq_sched_queue(struct io_queue *ioq) { return NULL; }
+static inline void *elv_fq_select_ioq(struct request_queue *q, int force)
+{
+	return NULL;
+}
+#endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index ca86192..357f529 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -226,6 +226,9 @@ static struct elevator_queue *elevator_alloc(struct request_queue *q,
 	for (i = 0; i < ELV_HASH_ENTRIES; i++)
 		INIT_HLIST_HEAD(&eq->hash[i]);
 
+	if (elv_init_fq_data(q, eq))
+		goto err;
+
 	return eq;
 err:
 	kfree(eq);
@@ -296,9 +299,11 @@ EXPORT_SYMBOL(elevator_init);
 void elevator_exit(struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
+	elv_exit_fq_data(e);
 	if (e->ops->elevator_exit_fn)
 		e->ops->elevator_exit_fn(e);
 	e->ops = NULL;
+	elv_exit_fq_data_post(e);
 	mutex_unlock(&e->sysfs_lock);
 
 	kobject_put(&e->kobj);
@@ -425,6 +430,7 @@ void elv_dispatch_sort(struct request_queue *q, struct request *rq)
 	elv_rqhash_del(q, rq);
 
 	q->nr_sorted--;
+	elv_fq_dispatched_request(q->elevator, rq);
 
 	boundary = q->end_sector;
 	stop_flags = REQ_SOFTBARRIER | REQ_HARDBARRIER | REQ_STARTED;
@@ -465,6 +471,7 @@ void elv_dispatch_add_tail(struct request_queue *q, struct request *rq)
 	elv_rqhash_del(q, rq);
 
 	q->nr_sorted--;
+	elv_fq_dispatched_request(q->elevator, rq);
 
 	q->end_sector = rq_end_sector(rq);
 	q->boundary_rq = rq;
@@ -532,6 +539,7 @@ void elv_merge_requests(struct request_queue *q, struct request *rq,
 	elv_rqhash_del(q, next);
 
 	q->nr_sorted--;
+	elv_ioq_request_removed(e, next);
 	q->last_merge = rq;
 }
 
@@ -638,12 +646,8 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 				q->last_merge = rq;
 		}
 
-		/*
-		 * Some ioscheds (cfq) run q->request_fn directly, so
-		 * rq cannot be accessed after calling
-		 * elevator_add_req_fn.
-		 */
 		q->elevator->ops->elevator_add_req_fn(q, rq);
+		elv_ioq_request_add(q, rq);
 		break;
 
 	case ELEVATOR_INSERT_REQUEUE:
@@ -742,13 +746,12 @@ EXPORT_SYMBOL(elv_add_request);
 
 int elv_queue_empty(struct request_queue *q)
 {
-	struct elevator_queue *e = q->elevator;
-
 	if (!list_empty(&q->queue_head))
 		return 0;
 
-	if (e->ops->elevator_queue_empty_fn)
-		return e->ops->elevator_queue_empty_fn(q);
+	/* Hopefully nr_sorted works and no need to call queue_empty_fn */
+	if (q->nr_sorted)
+		return 0;
 
 	return 1;
 }
@@ -828,8 +831,11 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
 	 */
 	if (blk_account_rq(rq)) {
 		q->in_flight[rq_is_sync(rq)]--;
-		if (blk_sorted_rq(rq) && e->ops->elevator_completed_req_fn)
-			e->ops->elevator_completed_req_fn(q, rq);
+		if (blk_sorted_rq(rq)) {
+			if (e->ops->elevator_completed_req_fn)
+				e->ops->elevator_completed_req_fn(q, rq);
+			elv_ioq_completed_request(q, rq);
+		}
 	}
 
 	/*
@@ -1125,3 +1131,17 @@ struct request *elv_rb_latter_request(struct request_queue *q,
 	return NULL;
 }
 EXPORT_SYMBOL(elv_rb_latter_request);
+
+/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
+{
+	return ioq_sched_queue(req_ioq(rq));
+}
+EXPORT_SYMBOL(elv_get_sched_queue);
+
+/* Select an ioscheduler queue to dispatch request from. */
+void *elv_select_sched_queue(struct request_queue *q, int force)
+{
+	return ioq_sched_queue(elv_fq_select_ioq(q, force));
+}
+EXPORT_SYMBOL(elv_select_sched_queue);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8963d91..551e17d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -234,6 +234,11 @@ struct request {
 
 	/* for bidi */
 	struct request *next_rq;
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	/* io queue request belongs to */
+	struct io_queue *ioq;
+#endif
 };
 
 static inline unsigned short req_get_ioprio(struct request *req)
@@ -241,6 +246,15 @@ static inline unsigned short req_get_ioprio(struct request *req)
 	return req->ioprio;
 }
 
+static inline struct io_queue *req_ioq(struct request *req)
+{
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	return req->ioq;
+#else
+	return NULL;
+#endif
+}
+
 /*
  * State information carried for REQ_TYPE_PM_SUSPEND and REQ_TYPE_PM_RESUME
  * requests. Some step values could eventually be made generic.
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 1cb3372..81f1ed8 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -2,6 +2,7 @@
 #define _LINUX_ELEVATOR_H
 
 #include <linux/percpu.h>
+#include "../../block/elevator-fq.h"
 
 #ifdef CONFIG_BLOCK
 
@@ -29,6 +30,18 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
 
 typedef void *(elevator_init_fn) (struct request_queue *);
 typedef void (elevator_exit_fn) (struct elevator_queue *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
+typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
+typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
+typedef int (elevator_should_preempt_fn) (struct request_queue*, void*,
+						struct request*);
+typedef int (elevator_update_idle_window_fn) (struct elevator_queue*, void*,
+						struct request*);
+typedef struct io_queue* (elevator_close_cooperator_fn) (struct request_queue*,
+						void*, int probe);
+#endif
 
 struct elevator_ops
 {
@@ -56,6 +69,17 @@ struct elevator_ops
 	elevator_init_fn *elevator_init_fn;
 	elevator_exit_fn *elevator_exit_fn;
 	void (*trim)(struct io_context *);
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+	elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
+	elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;
+
+	elevator_arm_slice_timer_fn *elevator_arm_slice_timer_fn;
+	elevator_should_preempt_fn *elevator_should_preempt_fn;
+	elevator_update_idle_window_fn *elevator_update_idle_window_fn;
+	elevator_close_cooperator_fn *elevator_close_cooperator_fn;
+#endif
 };
 
 #define ELV_NAME_MAX	(16)
@@ -76,6 +100,9 @@ struct elevator_type
 	struct elv_fs_entry *elevator_attrs;
 	char elevator_name[ELV_NAME_MAX];
 	struct module *elevator_owner;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	int elevator_features;
+#endif
 };
 
 /*
@@ -89,6 +116,10 @@ struct elevator_queue
 	struct elevator_type *elevator_type;
 	struct mutex sysfs_lock;
 	struct hlist_head *hash;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	/* fair queuing data */
+	struct elv_fq_data efqd;
+#endif
 };
 
 /*
@@ -207,5 +238,25 @@ enum {
 	__val;							\
 })
 
+/* iosched can let elevator know their feature set/capability */
+#ifdef CONFIG_ELV_FAIR_QUEUING
+
+/* iosched wants to use fair queuing logic of elevator layer */
+#define	ELV_IOSCHED_NEED_FQ	1
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+	return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
+}
+
+#else /* ELV_IOSCHED_FAIR_QUEUING */
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+	return 0;
+}
+#endif /* ELV_IOSCHED_FAIR_QUEUING */
+extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
+extern void *elv_select_sched_queue(struct request_queue *q, int force);
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 04/25] io-controller: Common flat fair queuing code in elevaotor layer
  2009-07-02 20:01 ` Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron
  Cc: agk, snitzer, vgoyal, akpm, peterz

This is common fair queuing code in elevator layer. This is controlled by
config option CONFIG_ELV_FAIR_QUEUING. This patch initially only introduces
flat fair queuing support where there is only one group, "root group" and all
the tasks belong to root group.

This elevator layer changes are backward compatible. That means any ioscheduler
using old interfaces will continue to work.

This code is essentially the CFQ code for fair queuing. The primary difference
is that flat rounding robin algorithm of CFQ has been replaced with BFQ (WF2Q+).

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched    |   13 +
 block/Makefile           |    1 +
 block/blk.h              |    4 +
 block/elevator-fq.c      | 1254 +++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h      |  304 +++++++++++
 block/elevator.c         |   42 ++-
 include/linux/blkdev.h   |   14 +
 include/linux/elevator.h |   51 ++
 8 files changed, 1667 insertions(+), 16 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 7e803fc..3398134 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -2,6 +2,19 @@ if BLOCK
 
 menu "IO Schedulers"
 
+config ELV_FAIR_QUEUING
+	bool "Elevator Fair Queuing Support"
+	default n
+	---help---
+	  Traditionally only cfq had notion of multiple queues and it did
+	  fair queuing at its own. With the cgroups and need of controlling
+	  IO, now even the simple io schedulers like noop, deadline, as will
+	  have one queue per cgroup and will need hierarchical fair queuing.
+	  Instead of every io scheduler implementing its own fair queuing
+	  logic, this option enables fair queuing in elevator layer so that
+	  other ioschedulers can make use of it.
+	  If unsure, say N.
+
 config IOSCHED_NOOP
 	bool
 	default y
diff --git a/block/Makefile b/block/Makefile
index e9fa4dd..94bfc6e 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -15,3 +15,4 @@ obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
 
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY)	+= blk-integrity.o
+obj-$(CONFIG_ELV_FAIR_QUEUING)	+= elevator-fq.o
diff --git a/block/blk.h b/block/blk.h
index 3fae6ad..99c3819 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -71,6 +71,8 @@ static inline void elv_activate_rq(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	elv_fq_activate_rq(q, rq);
+
 	if (e->ops->elevator_activate_req_fn)
 		e->ops->elevator_activate_req_fn(q, rq);
 }
@@ -79,6 +81,8 @@ static inline void elv_deactivate_rq(struct request_queue *q, struct request *rq
 {
 	struct elevator_queue *e = q->elevator;
 
+	elv_fq_deactivate_rq(q, rq);
+
 	if (e->ops->elevator_deactivate_req_fn)
 		e->ops->elevator_deactivate_req_fn(q, rq);
 }
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 7ee4321..6f23d7e 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -14,6 +14,17 @@
 
 #include <linux/blkdev.h>
 #include "elevator-fq.h"
+#include <linux/blktrace_api.h>
+
+/* Values taken from cfq */
+const int elv_slice_sync = HZ / 10;
+int elv_slice_async = HZ / 25;
+const int elv_slice_async_rq = 2;
+int elv_slice_idle = HZ / 125;
+static struct kmem_cache *elv_ioq_pool;
+
+#define ELV_SLICE_SCALE		(5)
+#define ELV_HW_QUEUE_MIN	(5)
 
 #define IO_SERVICE_TREE_INIT   ((struct io_service_tree)		\
 				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
@@ -28,6 +39,22 @@
  */
 #define WFQ_SERVICE_SHIFT	22
 
+static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
+					unsigned short prio)
+{
+	const int base_slice = efqd->elv_slice[sync];
+
+	WARN_ON(prio >= IOPRIO_BE_NR);
+
+	return base_slice + (base_slice/ELV_SLICE_SCALE * (4 - prio));
+}
+
+static inline int
+elv_prio_to_slice(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+	return elv_prio_slice(efqd, elv_ioq_sync(ioq), ioq->entity.ioprio);
+}
+
 /**
  * bfq_gt - compare two timestamps.
  * @a: first ts.
@@ -423,11 +450,6 @@ __bfq_entity_update_prio(struct io_service_tree *old_st,
 		 */
 		if (ioq) {
 			struct elv_fq_data *efqd = ioq->efqd;
-			/*
-			 * elv_prio_to_slice() is defined in later patches
-			 * where a slice length is calculated from the
-			 * ioprio of the queue.
-			 */
 			entity->budget = elv_prio_to_slice(efqd, ioq);
 		}
 
@@ -750,3 +772,1225 @@ static void io_flush_idle_tree(struct io_service_tree *st)
 	for (; entity != NULL; entity = st->first_idle)
 		__bfq_deactivate_entity(entity, 0);
 }
+
+/* Elevator fair queuing function */
+static inline struct io_queue *elv_active_ioq(struct elevator_queue *e)
+{
+	return e->efqd.active_queue;
+}
+
+void *elv_active_sched_queue(struct elevator_queue *e)
+{
+	return ioq_sched_queue(elv_active_ioq(e));
+}
+EXPORT_SYMBOL(elv_active_sched_queue);
+
+int elv_nr_busy_ioq(struct elevator_queue *e)
+{
+	return e->efqd.busy_queues;
+}
+EXPORT_SYMBOL(elv_nr_busy_ioq);
+
+int elv_hw_tag(struct elevator_queue *e)
+{
+	return e->efqd.hw_tag;
+}
+EXPORT_SYMBOL(elv_hw_tag);
+
+/* Helper functions for operating on elevator idle slice timer */
+int elv_mod_idle_slice_timer(struct elevator_queue *eq, unsigned long expires)
+{
+	struct elv_fq_data *efqd = &eq->efqd;
+
+	return mod_timer(&efqd->idle_slice_timer, expires);
+}
+EXPORT_SYMBOL(elv_mod_idle_slice_timer);
+
+int elv_del_idle_slice_timer(struct elevator_queue *eq)
+{
+	struct elv_fq_data *efqd = &eq->efqd;
+
+	return del_timer(&efqd->idle_slice_timer);
+}
+EXPORT_SYMBOL(elv_del_idle_slice_timer);
+
+unsigned int elv_get_slice_idle(struct elevator_queue *eq)
+{
+	return eq->efqd.elv_slice_idle;
+}
+EXPORT_SYMBOL(elv_get_slice_idle);
+
+static void elv_ioq_served(struct io_queue *ioq, unsigned long served)
+{
+	entity_served(&ioq->entity, served);
+}
+
+/* Tells whether ioq is queued in root group or not */
+static inline int is_root_group_ioq(struct request_queue *q,
+					struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	return (ioq->entity.sched_data == &efqd->root_group->sched_data);
+}
+
+/*
+ * sysfs parts below -->
+ */
+static ssize_t
+elv_var_show(unsigned int var, char *page)
+{
+	return sprintf(page, "%d\n", var);
+}
+
+static ssize_t
+elv_var_store(unsigned int *var, const char *page, size_t count)
+{
+	char *p = (char *) page;
+
+	*var = simple_strtoul(p, &p, 10);
+	return count;
+}
+
+#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
+ssize_t __FUNC(struct elevator_queue *e, char *page)		\
+{									\
+	struct elv_fq_data *efqd = &e->efqd;				\
+	unsigned int __data = __VAR;					\
+	if (__CONV)							\
+		__data = jiffies_to_msecs(__data);			\
+	return elv_var_show(__data, (page));				\
+}
+SHOW_FUNCTION(elv_slice_idle_show, efqd->elv_slice_idle, 1);
+EXPORT_SYMBOL(elv_slice_idle_show);
+SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
+EXPORT_SYMBOL(elv_slice_sync_show);
+SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
+EXPORT_SYMBOL(elv_slice_async_show);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
+ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)\
+{									\
+	struct elv_fq_data *efqd = &e->efqd;				\
+	unsigned int __data;						\
+	int ret = elv_var_store(&__data, (page), count);		\
+	if (__data < (MIN))						\
+		__data = (MIN);						\
+	else if (__data > (MAX))					\
+		__data = (MAX);						\
+	if (__CONV)							\
+		*(__PTR) = msecs_to_jiffies(__data);			\
+	else								\
+		*(__PTR) = __data;					\
+	return ret;							\
+}
+STORE_FUNCTION(elv_slice_idle_store, &efqd->elv_slice_idle, 0, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_idle_store);
+STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_sync_store);
+STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_async_store);
+#undef STORE_FUNCTION
+
+void elv_schedule_dispatch(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (elv_nr_busy_ioq(q->elevator)) {
+		elv_log(efqd, "schedule dispatch");
+		kblockd_schedule_work(efqd->queue, &efqd->unplug_work);
+	}
+}
+EXPORT_SYMBOL(elv_schedule_dispatch);
+
+static void elv_kick_queue(struct work_struct *work)
+{
+	struct elv_fq_data *efqd =
+		container_of(work, struct elv_fq_data, unplug_work);
+	struct request_queue *q = efqd->queue;
+
+	spin_lock_irq(q->queue_lock);
+	__blk_run_queue(q);
+	spin_unlock_irq(q->queue_lock);
+}
+
+static void elv_shutdown_timer_wq(struct elevator_queue *e)
+{
+	del_timer_sync(&e->efqd.idle_slice_timer);
+	cancel_work_sync(&e->efqd.unplug_work);
+}
+
+static void elv_ioq_set_prio_slice(struct request_queue *q,
+					struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	ioq->slice_end = jiffies + ioq->entity.budget;
+	elv_log_ioq(efqd, ioq, "set_slice=%lu", ioq->entity.budget);
+}
+
+static void elv_ioq_update_io_thinktime(struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = ioq->efqd;
+	unsigned long elapsed = jiffies - ioq->last_end_request;
+	unsigned long ttime = min(elapsed, 2UL * efqd->elv_slice_idle);
+
+	ioq->ttime_samples = (7*ioq->ttime_samples + 256) / 8;
+	ioq->ttime_total = (7*ioq->ttime_total + 256*ttime) / 8;
+	ioq->ttime_mean = (ioq->ttime_total + 128) / ioq->ttime_samples;
+}
+
+/*
+ * Disable idle window if the process thinks too long.
+ * This idle flag can also be updated by io scheduler.
+ */
+static void elv_ioq_update_idle_window(struct elevator_queue *eq,
+				struct io_queue *ioq, struct request *rq)
+{
+	int old_idle, enable_idle;
+	struct elv_fq_data *efqd = ioq->efqd;
+
+	/*
+	 * Don't idle for async or idle io prio class
+	 */
+	if (!elv_ioq_sync(ioq) || elv_ioq_class_idle(ioq))
+		return;
+
+	enable_idle = old_idle = elv_ioq_idle_window(ioq);
+
+	if (!efqd->elv_slice_idle)
+		enable_idle = 0;
+	else if (ioq_sample_valid(ioq->ttime_samples)) {
+		if (ioq->ttime_mean > efqd->elv_slice_idle)
+			enable_idle = 0;
+		else
+			enable_idle = 1;
+	}
+
+	/*
+	 * From think time perspective idle should be enabled. Check with
+	 * io scheduler if it wants to disable idling based on additional
+	 * considrations like seek pattern.
+	 */
+	if (enable_idle) {
+		if (eq->ops->elevator_update_idle_window_fn)
+			enable_idle = eq->ops->elevator_update_idle_window_fn(
+						eq, ioq->sched_queue, rq);
+		if (!enable_idle)
+			elv_log_ioq(efqd, ioq, "iosched disabled idle");
+	}
+
+	if (old_idle != enable_idle) {
+		elv_log_ioq(efqd, ioq, "idle=%d", enable_idle);
+		if (enable_idle)
+			elv_mark_ioq_idle_window(ioq);
+		else
+			elv_clear_ioq_idle_window(ioq);
+	}
+}
+
+struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask)
+{
+	struct io_queue *ioq = NULL;
+
+	ioq = kmem_cache_alloc_node(elv_ioq_pool, gfp_mask, q->node);
+	return ioq;
+}
+EXPORT_SYMBOL(elv_alloc_ioq);
+
+void elv_free_ioq(struct io_queue *ioq)
+{
+	kmem_cache_free(elv_ioq_pool, ioq);
+}
+EXPORT_SYMBOL(elv_free_ioq);
+
+int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
+		struct io_group *iog, void *sched_queue, int ioprio_class,
+		int ioprio, int is_sync)
+{
+	struct elv_fq_data *efqd = &eq->efqd;
+
+	RB_CLEAR_NODE(&ioq->entity.rb_node);
+	atomic_set(&ioq->ref, 0);
+	ioq->efqd = efqd;
+	elv_ioq_set_ioprio_class(ioq, ioprio_class);
+	elv_ioq_set_ioprio(ioq, ioprio);
+	ioq->pid = current->pid;
+	ioq->sched_queue = sched_queue;
+	if (is_sync && !elv_ioq_class_idle(ioq))
+		elv_mark_ioq_idle_window(ioq);
+	bfq_init_entity(&ioq->entity, iog);
+	ioq->entity.budget = elv_prio_to_slice(efqd, ioq);
+	if (is_sync)
+		ioq->last_end_request = jiffies;
+
+	return 0;
+}
+EXPORT_SYMBOL(elv_init_ioq);
+
+void elv_put_ioq(struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = ioq->efqd;
+	struct elevator_queue *e = container_of(efqd, struct elevator_queue,
+						efqd);
+
+	BUG_ON(atomic_read(&ioq->ref) <= 0);
+	if (!atomic_dec_and_test(&ioq->ref))
+		return;
+	BUG_ON(ioq->nr_queued);
+	BUG_ON(ioq->entity.tree != NULL);
+	BUG_ON(elv_ioq_busy(ioq));
+	BUG_ON(efqd->active_queue == ioq);
+
+	/* Can be called by outgoing elevator. Don't use q */
+	BUG_ON(!e->ops->elevator_free_sched_queue_fn);
+
+	e->ops->elevator_free_sched_queue_fn(e, ioq->sched_queue);
+	elv_log_ioq(efqd, ioq, "put_queue");
+	elv_free_ioq(ioq);
+}
+EXPORT_SYMBOL(elv_put_ioq);
+
+void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
+{
+	struct io_queue *ioq = *ioq_ptr;
+
+	if (ioq != NULL) {
+		/* Drop the reference taken by the io group */
+		elv_put_ioq(ioq);
+		*ioq_ptr = NULL;
+	}
+}
+
+/*
+ * Normally next io queue to be served is selected from the service tree.
+ * This function allows one to choose a specific io queue to run next
+ * out of order. This is primarily to accomodate the close_cooperator
+ * feature of cfq.
+ *
+ * Currently it is done only for root level as to begin with supporting
+ * close cooperator feature only for root group to make sure default
+ * cfq behavior in flat hierarchy is not changed.
+ */
+static void elv_set_next_ioq(struct request_queue *q, struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_entity *entity = &ioq->entity;
+	struct io_sched_data *sd = &efqd->root_group->sched_data;
+	struct io_service_tree *st = io_entity_service_tree(entity);
+
+	BUG_ON(efqd->active_queue != NULL || sd->active_entity != NULL);
+	BUG_ON(!efqd->busy_queues);
+	BUG_ON(sd != entity->sched_data);
+	BUG_ON(!st);
+
+	bfq_update_vtime(st);
+	bfq_active_remove(st, entity);
+	sd->active_entity = entity;
+	entity->service = 0;
+	elv_log_ioq(efqd, ioq, "set_next_ioq");
+}
+
+/* Get next queue for service. */
+static struct io_queue *elv_get_next_ioq(struct request_queue *q, int extract)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_entity *entity = NULL;
+	struct io_queue *ioq = NULL;
+	struct io_sched_data *sd;
+
+	/*
+	 * We should not call lookup when an entity is active, as doing
+	 * lookup can result in an erroneous vtime jump.
+	 */
+	BUG_ON(efqd->active_queue != NULL);
+
+	if (!efqd->busy_queues)
+		return NULL;
+
+	sd = &efqd->root_group->sched_data;
+	entity = bfq_lookup_next_entity(sd, 1);
+
+	BUG_ON(!entity);
+	if (extract)
+		entity->service = 0;
+	ioq = io_entity_to_ioq(entity);
+
+	return ioq;
+}
+
+/*
+ * coop (cooperating queue) tells that io scheduler selected a queue for us
+ * and we did not select the next queue based on fairness.
+ */
+static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
+					int coop)
+{
+	struct request_queue *q = efqd->queue;
+
+	if (ioq) {
+		elv_log_ioq(efqd, ioq, "set_active, busy=%d",
+							efqd->busy_queues);
+		ioq->slice_end = 0;
+
+		elv_clear_ioq_wait_request(ioq);
+		elv_clear_ioq_must_dispatch(ioq);
+		elv_mark_ioq_slice_new(ioq);
+
+		del_timer(&efqd->idle_slice_timer);
+	}
+
+	efqd->active_queue = ioq;
+
+	/* Let iosched know if it wants to take some action */
+	if (ioq) {
+		if (q->elevator->ops->elevator_active_ioq_set_fn)
+			q->elevator->ops->elevator_active_ioq_set_fn(q,
+							ioq->sched_queue, coop);
+	}
+}
+
+/* Get and set a new active queue for service. */
+static struct io_queue *elv_set_active_ioq(struct request_queue *q,
+						struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	int coop = 0;
+
+	if (!ioq)
+		ioq = elv_get_next_ioq(q, 1);
+	else {
+		elv_set_next_ioq(q, ioq);
+		/*
+		 * io scheduler selected the next queue for us. Pass this
+		 * this info back to io scheudler. cfq currently uses it
+		 * to reset coop flag on the queue.
+		 */
+		coop = 1;
+	}
+	__elv_set_active_ioq(efqd, ioq, coop);
+	return ioq;
+}
+
+static void elv_reset_active_ioq(struct elv_fq_data *efqd)
+{
+	struct request_queue *q = efqd->queue;
+	struct io_queue *ioq = elv_active_ioq(efqd->queue->elevator);
+
+	if (q->elevator->ops->elevator_active_ioq_reset_fn)
+		q->elevator->ops->elevator_active_ioq_reset_fn(q,
+							ioq->sched_queue);
+	efqd->active_queue = NULL;
+	del_timer(&efqd->idle_slice_timer);
+}
+
+static void elv_activate_ioq(struct io_queue *ioq, int add_front)
+{
+	bfq_activate_entity(&ioq->entity, add_front);
+}
+
+static void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
+					int requeue)
+{
+	bfq_deactivate_entity(&ioq->entity, requeue);
+}
+
+/* Called when an inactive queue receives a new request. */
+static void elv_add_ioq_busy(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+	BUG_ON(elv_ioq_busy(ioq));
+	BUG_ON(ioq == efqd->active_queue);
+	elv_log_ioq(efqd, ioq, "add to busy");
+	elv_activate_ioq(ioq, 0);
+	elv_mark_ioq_busy(ioq);
+	efqd->busy_queues++;
+	if (elv_ioq_class_rt(ioq)) {
+		struct io_group *iog = ioq_to_io_group(ioq);
+		iog->busy_rt_queues++;
+	}
+}
+
+static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
+					int requeue)
+{
+	struct elv_fq_data *efqd = &e->efqd;
+
+	BUG_ON(!elv_ioq_busy(ioq));
+	BUG_ON(ioq->nr_queued);
+	elv_log_ioq(efqd, ioq, "del from busy");
+	elv_clear_ioq_busy(ioq);
+	BUG_ON(efqd->busy_queues == 0);
+	efqd->busy_queues--;
+	if (elv_ioq_class_rt(ioq)) {
+		struct io_group *iog = ioq_to_io_group(ioq);
+		iog->busy_rt_queues--;
+	}
+
+	elv_deactivate_ioq(efqd, ioq, requeue);
+}
+
+/*
+ * Do the accounting. Determine how much service (in terms of time slices)
+ * current queue used and adjust the start, finish time of queue and vtime
+ * of the tree accordingly.
+ *
+ * Determining the service used in terms of time is tricky in certain
+ * situations. Especially when underlying device supports command queuing
+ * and requests from multiple queues can be there at same time, then it
+ * is not clear which queue consumed how much of disk time.
+ *
+ * To mitigate this problem, cfq starts the time slice of the queue only
+ * after first request from the queue has completed. This does not work
+ * very well if we expire the queue before we wait for first and more
+ * request to finish from the queue. For seeky queues, we will expire the
+ * queue after dispatching few requests without waiting and start dispatching
+ * from next queue.
+ *
+ * Not sure how to determine the time consumed by queue in such scenarios.
+ * Currently as a crude approximation, we are charging 25% of time slice
+ * for such cases. A better mechanism is needed for accurate accounting.
+ */
+void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_entity *entity = &ioq->entity;
+	long slice_unused = 0, slice_used = 0, slice_overshoot = 0;
+
+	assert_spin_locked(q->queue_lock);
+	elv_log_ioq(efqd, ioq, "slice expired");
+
+	if (elv_ioq_wait_request(ioq))
+		del_timer(&efqd->idle_slice_timer);
+
+	elv_clear_ioq_wait_request(ioq);
+
+	/*
+	 * if ioq->slice_end = 0, that means a queue was expired before first
+	 * reuqest from the queue got completed. Of course we are not planning
+	 * to idle on the queue otherwise we would not have expired it.
+	 *
+	 * Charge for the 25% slice in such cases. This is not the best thing
+	 * to do but at the same time not very sure what's the next best
+	 * thing to do.
+	 *
+	 * This arises from that fact that we don't have the notion of
+	 * one queue being operational at one time. io scheduler can dispatch
+	 * requests from multiple queues in one dispatch round. Ideally for
+	 * more accurate accounting of exact disk time used by disk, one
+	 * should dispatch requests from only one queue and wait for all
+	 * the requests to finish. But this will reduce throughput.
+	 */
+	if (!ioq->slice_end)
+		slice_used = entity->budget/4;
+	else {
+		if (time_after(ioq->slice_end, jiffies)) {
+			slice_unused = ioq->slice_end - jiffies;
+			if (slice_unused == entity->budget) {
+				/*
+				 * queue got expired immediately after
+				 * completing first request. Charge 25% of
+				 * slice.
+				 */
+				slice_used = entity->budget/4;
+			} else
+				slice_used = entity->budget - slice_unused;
+		} else {
+			slice_overshoot = jiffies - ioq->slice_end;
+			slice_used = entity->budget + slice_overshoot;
+		}
+	}
+
+	elv_log_ioq(efqd, ioq, "sl_end=%lx, jiffies=%lx", ioq->slice_end,
+			jiffies);
+	elv_log_ioq(efqd, ioq, "sl_used=%ld, budget=%ld overshoot=%ld",
+				slice_used, entity->budget, slice_overshoot);
+	elv_ioq_served(ioq, slice_used);
+
+	BUG_ON(ioq != efqd->active_queue);
+	elv_reset_active_ioq(efqd);
+
+	if (!ioq->nr_queued)
+		elv_del_ioq_busy(q->elevator, ioq, 1);
+	else
+		elv_activate_ioq(ioq, 0);
+}
+EXPORT_SYMBOL(__elv_ioq_slice_expired);
+
+/*
+ *  Expire the ioq.
+ */
+void elv_ioq_slice_expired(struct request_queue *q)
+{
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+
+	if (ioq)
+		__elv_ioq_slice_expired(q, ioq);
+}
+
+/*
+ * Check if new_cfqq should preempt the currently active queue. Return 0 for
+ * no or if we aren't sure, a 1 will cause a preemption attempt.
+ */
+static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
+			struct request *rq)
+{
+	struct io_queue *ioq;
+	struct elevator_queue *eq = q->elevator;
+	struct io_entity *entity, *new_entity;
+
+	ioq = elv_active_ioq(eq);
+
+	if (!ioq)
+		return 0;
+
+	entity = &ioq->entity;
+	new_entity = &new_ioq->entity;
+
+	/*
+	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
+	 */
+
+	if (new_entity->ioprio_class == IOPRIO_CLASS_RT
+	    && entity->ioprio_class != IOPRIO_CLASS_RT)
+		return 1;
+	/*
+	 * Allow an BE request to pre-empt an ongoing IDLE clas timeslice.
+	 */
+
+	if (new_entity->ioprio_class == IOPRIO_CLASS_BE
+	    && entity->ioprio_class == IOPRIO_CLASS_IDLE)
+		return 1;
+
+	/*
+	 * Check with io scheduler if it has additional criterion based on
+	 * which it wants to preempt existing queue.
+	 */
+	if (eq->ops->elevator_should_preempt_fn)
+		return eq->ops->elevator_should_preempt_fn(q,
+						ioq_sched_queue(new_ioq), rq);
+
+	return 0;
+}
+
+static void elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
+{
+	elv_log_ioq(&q->elevator->efqd, ioq, "preempt");
+	elv_ioq_slice_expired(q);
+
+	/*
+	 * Put the new queue at the front of the of the current list,
+	 * so we know that it will be selected next.
+	 */
+
+	elv_activate_ioq(ioq, 1);
+	ioq->slice_end = 0;
+	elv_mark_ioq_slice_new(ioq);
+}
+
+void elv_ioq_request_add(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	BUG_ON(!efqd);
+	BUG_ON(!ioq);
+	efqd->rq_queued++;
+	ioq->nr_queued++;
+
+	if (!elv_ioq_busy(ioq))
+		elv_add_ioq_busy(efqd, ioq);
+
+	elv_ioq_update_io_thinktime(ioq);
+	elv_ioq_update_idle_window(q->elevator, ioq, rq);
+
+	if (ioq == elv_active_ioq(q->elevator)) {
+		/*
+		 * Remember that we saw a request from this process, but
+		 * don't start queuing just yet. Otherwise we risk seeing lots
+		 * of tiny requests, because we disrupt the normal plugging
+		 * and merging. If the request is already larger than a single
+		 * page, let it rip immediately. For that case we assume that
+		 * merging is already done. Ditto for a busy system that
+		 * has other work pending, don't risk delaying until the
+		 * idle timer unplug to continue working.
+		 */
+		if (elv_ioq_wait_request(ioq)) {
+			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
+			    efqd->busy_queues > 1) {
+				del_timer(&efqd->idle_slice_timer);
+				__blk_run_queue(q);
+			}
+			elv_mark_ioq_must_dispatch(ioq);
+		}
+	} else if (elv_should_preempt(q, ioq, rq)) {
+		/*
+		 * not the active queue - expire current slice if it is
+		 * idle and has expired it's mean thinktime or this new queue
+		 * has some old slice time left and is of higher priority or
+		 * this new queue is RT and the current one is BE
+		 */
+		elv_preempt_queue(q, ioq);
+		__blk_run_queue(q);
+	}
+}
+
+static void elv_idle_slice_timer(unsigned long data)
+{
+	struct elv_fq_data *efqd = (struct elv_fq_data *)data;
+	struct io_queue *ioq;
+	unsigned long flags;
+	struct request_queue *q = efqd->queue;
+
+	elv_log(efqd, "idle timer fired");
+
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	ioq = efqd->active_queue;
+
+	if (ioq) {
+
+		/*
+		 * We saw a request before the queue expired, let it through
+		 */
+		if (elv_ioq_must_dispatch(ioq))
+			goto out_kick;
+
+		/*
+		 * expired
+		 */
+		if (elv_ioq_slice_used(ioq))
+			goto expire;
+
+		/*
+		 * only expire and reinvoke request handler, if there are
+		 * other queues with pending requests
+		 */
+		if (!elv_nr_busy_ioq(q->elevator))
+			goto out_cont;
+
+		/*
+		 * not expired and it has a request pending, let it dispatch
+		 */
+		if (ioq->nr_queued)
+			goto out_kick;
+	}
+expire:
+	elv_ioq_slice_expired(q);
+out_kick:
+	elv_schedule_dispatch(q);
+out_cont:
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static void elv_ioq_arm_slice_timer(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+	unsigned long sl;
+
+	BUG_ON(!ioq);
+
+	/*
+	 * SSD device without seek penalty, disable idling. But only do so
+	 * for devices that support queuing, otherwise we still have a problem
+	 * with sync vs async workloads.
+	 */
+	if (blk_queue_nonrot(q) && efqd->hw_tag)
+		return;
+
+	/*
+	 * still requests with the driver, don't idle
+	 */
+	if (efqd->rq_in_driver)
+		return;
+
+	/*
+	 * idle is disabled, either manually or by past process history
+	 */
+	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
+		return;
+
+	/*
+	 * may be iosched got its own idling logic. In that case io
+	 * schduler will take care of arming the timer, if need be.
+	 */
+	if (q->elevator->ops->elevator_arm_slice_timer_fn) {
+		q->elevator->ops->elevator_arm_slice_timer_fn(q,
+						ioq->sched_queue);
+	} else {
+		elv_mark_ioq_wait_request(ioq);
+		sl = efqd->elv_slice_idle;
+		mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+		elv_log_ioq(efqd, ioq, "arm idle: %lu", sl);
+	}
+}
+
+/*
+ * If io scheduler has functionality of keeping track of close cooperator, check
+ * with it if it has got a closely co-operating queue.
+ */
+static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
+					struct io_queue *ioq, int probe)
+{
+	struct elevator_queue *e = q->elevator;
+	struct io_queue *new_ioq = NULL;
+
+	/*
+	 * Currently this feature is supported only for flat hierarchy or
+	 * root group queues so that default cfq behavior is not changed.
+	 */
+	if (!is_root_group_ioq(q, ioq))
+		return NULL;
+
+	if (q->elevator->ops->elevator_close_cooperator_fn)
+		new_ioq = e->ops->elevator_close_cooperator_fn(q,
+						ioq->sched_queue, probe);
+
+	/* Only select co-operating queue if it belongs to root group */
+	if (new_ioq && !is_root_group_ioq(q, new_ioq))
+		return NULL;
+
+	return new_ioq;
+}
+
+/* Common layer function to select the next queue to dispatch from */
+void *elv_fq_select_ioq(struct request_queue *q, int force)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
+	struct io_group *iog;
+
+	if (!elv_nr_busy_ioq(q->elevator))
+		return NULL;
+
+	if (ioq == NULL)
+		goto new_queue;
+
+	/*
+	 * Force dispatch. Continue to dispatch from current queue as long
+	 * as it has requests.
+	 */
+	if (unlikely(force)) {
+		if (ioq->nr_queued)
+			goto keep_queue;
+		else
+			goto expire;
+	}
+
+	/*
+	 * The active queue has run out of time, expire it and select new.
+	 */
+	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
+		goto expire;
+
+	/*
+	 * If we have a RT cfqq waiting, then we pre-empt the current non-rt
+	 * cfqq.
+	 */
+	iog = ioq_to_io_group(ioq);
+
+	if (!elv_ioq_class_rt(ioq) && iog->busy_rt_queues) {
+		/*
+		 * We simulate this as cfqq timed out so that it gets to bank
+		 * the remaining of its time slice.
+		 */
+		elv_log_ioq(efqd, ioq, "preempt");
+		goto expire;
+	}
+
+	/*
+	 * The active queue has requests and isn't expired, allow it to
+	 * dispatch.
+	 */
+
+	if (ioq->nr_queued)
+		goto keep_queue;
+
+	/*
+	 * If another queue has a request waiting within our mean seek
+	 * distance, let it run.  The expire code will check for close
+	 * cooperators and put the close queue at the front of the service
+	 * tree.
+	 */
+	new_ioq = elv_close_cooperator(q, ioq, 0);
+	if (new_ioq)
+		goto expire;
+
+	/*
+	 * No requests pending. If the active queue still has requests in
+	 * flight or is idling for a new request, allow either of these
+	 * conditions to happen (or time out) before selecting a new queue.
+	 */
+
+	if (timer_pending(&efqd->idle_slice_timer) ||
+	    (elv_ioq_nr_dispatched(ioq) && elv_ioq_idle_window(ioq))) {
+		ioq = NULL;
+		goto keep_queue;
+	}
+
+expire:
+	elv_ioq_slice_expired(q);
+new_queue:
+	ioq = elv_set_active_ioq(q, new_ioq);
+keep_queue:
+	return ioq;
+}
+
+/* A request got removed from io_queue. Do the accounting */
+void elv_ioq_request_removed(struct elevator_queue *e, struct request *rq)
+{
+	struct io_queue *ioq;
+	struct elv_fq_data *efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	ioq = rq->ioq;
+	BUG_ON(!ioq);
+	ioq->nr_queued--;
+
+	efqd = ioq->efqd;
+	BUG_ON(!efqd);
+	efqd->rq_queued--;
+
+	if (elv_ioq_busy(ioq) && (elv_active_ioq(e) != ioq) && !ioq->nr_queued)
+		elv_del_ioq_busy(e, ioq, 1);
+}
+
+/* A request got dispatched. Do the accounting. */
+void elv_fq_dispatched_request(struct elevator_queue *e, struct request *rq)
+{
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	BUG_ON(!ioq);
+	elv_ioq_request_dispatched(ioq);
+	elv_ioq_request_removed(e, rq);
+	elv_clear_ioq_must_dispatch(ioq);
+}
+
+void elv_fq_activate_rq(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	efqd->rq_in_driver++;
+	elv_log_ioq(efqd, rq->ioq, "activate rq, drv=%d",
+						efqd->rq_in_driver);
+}
+
+void elv_fq_deactivate_rq(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	WARN_ON(!efqd->rq_in_driver);
+	efqd->rq_in_driver--;
+	elv_log_ioq(efqd, rq->ioq, "deactivate rq, drv=%d",
+						efqd->rq_in_driver);
+}
+
+/*
+ * Update hw_tag based on peak queue depth over 50 samples under
+ * sufficient load.
+ */
+static void elv_update_hw_tag(struct elv_fq_data *efqd)
+{
+	if (efqd->rq_in_driver > efqd->rq_in_driver_peak)
+		efqd->rq_in_driver_peak = efqd->rq_in_driver;
+
+	if (efqd->rq_queued <= ELV_HW_QUEUE_MIN &&
+	    efqd->rq_in_driver <= ELV_HW_QUEUE_MIN)
+		return;
+
+	if (efqd->hw_tag_samples++ < 50)
+		return;
+
+	if (efqd->rq_in_driver_peak >= ELV_HW_QUEUE_MIN)
+		efqd->hw_tag = 1;
+	else
+		efqd->hw_tag = 0;
+
+	efqd->hw_tag_samples = 0;
+	efqd->rq_in_driver_peak = 0;
+}
+
+/* A request got completed from io_queue. Do the accounting. */
+void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
+{
+	const int sync = rq_is_sync(rq);
+	struct io_queue *ioq;
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	ioq = rq->ioq;
+
+	elv_log_ioq(efqd, ioq, "complete");
+
+	elv_update_hw_tag(efqd);
+
+	WARN_ON(!efqd->rq_in_driver);
+	WARN_ON(!ioq->dispatched);
+	efqd->rq_in_driver--;
+	ioq->dispatched--;
+
+	if (sync)
+		ioq->last_end_request = jiffies;
+
+	/*
+	 * If this is the active queue, check if it needs to be expired,
+	 * or if we want to idle in case it has no pending requests.
+	 */
+
+	if (elv_active_ioq(q->elevator) == ioq) {
+		if (elv_ioq_slice_new(ioq)) {
+			elv_ioq_set_prio_slice(q, ioq);
+			elv_clear_ioq_slice_new(ioq);
+		}
+		/*
+		 * If there are no requests waiting in this queue, and
+		 * there are other queues ready to issue requests, AND
+		 * those other queues are issuing requests within our
+		 * mean seek distance, give them a chance to run instead
+		 * of idling.
+		 */
+		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+			elv_ioq_slice_expired(q);
+		else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
+			 && sync && !rq_noidle(rq))
+			elv_ioq_arm_slice_timer(q);
+	}
+
+	if (!efqd->rq_in_driver)
+		elv_schedule_dispatch(q);
+}
+
+struct io_group *io_get_io_group(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	/* In flat mode, there is only root group */
+	return efqd->root_group;
+}
+EXPORT_SYMBOL(io_get_io_group);
+
+void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
+					int ioprio)
+{
+	struct io_queue *ioq = NULL;
+
+	switch (ioprio_class) {
+	case IOPRIO_CLASS_RT:
+		ioq = iog->async_queue[0][ioprio];
+		break;
+	case IOPRIO_CLASS_BE:
+		ioq = iog->async_queue[1][ioprio];
+		break;
+	case IOPRIO_CLASS_IDLE:
+		ioq = iog->async_idle_queue;
+		break;
+	default:
+		BUG();
+	}
+
+	if (ioq)
+		return ioq->sched_queue;
+	return NULL;
+}
+EXPORT_SYMBOL(io_group_async_queue_prio);
+
+void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+					int ioprio, struct io_queue *ioq)
+{
+	switch (ioprio_class) {
+	case IOPRIO_CLASS_RT:
+		iog->async_queue[0][ioprio] = ioq;
+		break;
+	case IOPRIO_CLASS_BE:
+		iog->async_queue[1][ioprio] = ioq;
+		break;
+	case IOPRIO_CLASS_IDLE:
+		iog->async_idle_queue = ioq;
+		break;
+	default:
+		BUG();
+	}
+
+	/*
+	 * Take the group reference and pin the queue. Group exit will
+	 * clean it up
+	 */
+	elv_get_ioq(ioq);
+}
+EXPORT_SYMBOL(io_group_set_async_queue);
+
+/*
+ * Release all the io group references to its async queues.
+ */
+static void
+io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+{
+	int i, j;
+
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < IOPRIO_BE_NR; j++)
+			elv_release_ioq(e, &iog->async_queue[i][j]);
+
+	/* Free up async idle queue */
+	elv_release_ioq(e, &iog->async_idle_queue);
+}
+
+static struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+	return iog;
+}
+
+static void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_group *iog = e->efqd.root_group;
+	struct io_service_tree *st;
+	int i;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		io_flush_idle_tree(st);
+	}
+
+	io_put_io_group_queues(e, iog);
+	kfree(iog);
+}
+
+static void elv_slab_kill(void)
+{
+	/*
+	 * Caller already ensured that pending RCU callbacks are completed,
+	 * so we should have no busy allocations at this point.
+	 */
+	if (elv_ioq_pool)
+		kmem_cache_destroy(elv_ioq_pool);
+}
+
+static int __init elv_slab_setup(void)
+{
+	elv_ioq_pool = KMEM_CACHE(io_queue, 0);
+	if (!elv_ioq_pool)
+		goto fail;
+
+	return 0;
+fail:
+	elv_slab_kill();
+	return -ENOMEM;
+}
+
+/* Initialize fair queueing data associated with elevator */
+int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
+{
+	struct io_group *iog;
+	struct elv_fq_data *efqd = &e->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return 0;
+
+	iog = io_alloc_root_group(q, e, efqd);
+	if (iog == NULL)
+		return 1;
+
+	efqd->root_group = iog;
+	efqd->queue = q;
+
+	init_timer(&efqd->idle_slice_timer);
+	efqd->idle_slice_timer.function = elv_idle_slice_timer;
+	efqd->idle_slice_timer.data = (unsigned long) efqd;
+
+	INIT_WORK(&efqd->unplug_work, elv_kick_queue);
+
+	efqd->elv_slice[0] = elv_slice_async;
+	efqd->elv_slice[1] = elv_slice_sync;
+	efqd->elv_slice_idle = elv_slice_idle;
+	efqd->hw_tag = 1;
+
+	return 0;
+}
+
+/*
+ * elv_exit_fq_data is called before we call elevator_exit_fn. Before
+ * we ask elevator to cleanup its queues, we do the cleanup here so
+ * that all the group and idle tree references to ioq are dropped. Later
+ * during elevator cleanup, ioc reference will be dropped which will lead
+ * to removal of ioscheduler queue as well as associated ioq object.
+ */
+void elv_exit_fq_data(struct elevator_queue *e)
+{
+	struct elv_fq_data *efqd = &e->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	elv_shutdown_timer_wq(e);
+
+	BUG_ON(timer_pending(&efqd->idle_slice_timer));
+	io_free_root_group(e);
+}
+
+/*
+ * This is called after the io scheduler has cleaned up its data structres.
+ * I don't think that this function is required. Right now just keeping it
+ * because cfq cleans up timer and work queue again after freeing up
+ * io contexts. To me io scheduler has already been drained out, and all
+ * the active queue have already been expired so time and work queue should
+ * not been activated during cleanup process.
+ *
+ * Keeping it here for the time being. Will get rid of it later.
+ */
+void elv_exit_fq_data_post(struct elevator_queue *e)
+{
+	struct elv_fq_data *efqd = &e->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	elv_shutdown_timer_wq(e);
+	BUG_ON(timer_pending(&efqd->idle_slice_timer));
+}
+
+
+static int __init elv_fq_init(void)
+{
+	if (elv_slab_setup())
+		return -ENOMEM;
+
+	/* could be 0 on HZ < 1000 setups */
+
+	if (!elv_slice_async)
+		elv_slice_async = 1;
+
+	if (!elv_slice_idle)
+		elv_slice_idle = 1;
+
+	return 0;
+}
+
+module_init(elv_fq_init);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 4554d7f..a7cbc0f 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -22,6 +22,10 @@
 struct io_entity;
 struct io_queue;
 
+#ifdef CONFIG_ELV_FAIR_QUEUING
+#define ELV_ATTR(name) \
+	__ATTR(name, S_IRUGO|S_IWUSR, elv_##name##_show, elv_##name##_store)
+
 /**
  * struct io_service_tree - per ioprio_class service tree.
  * @active: tree for active entities (i.e., those backlogged).
@@ -149,15 +153,125 @@ struct io_entity {
 struct io_queue {
 	struct io_entity entity;
 	atomic_t ref;
+	unsigned int flags;
 
 	/* Pointer to generic elevator fair queuing data structure */
 	struct elv_fq_data *efqd;
+	pid_t pid;
+
+	/* Number of requests queued on this io queue */
+	unsigned long nr_queued;
+
+	/* Requests dispatched from this queue */
+	int dispatched;
+
+	/* Keep a track of think time of processes in this queue */
+	unsigned long last_end_request;
+	unsigned long ttime_total;
+	unsigned long ttime_samples;
+	unsigned long ttime_mean;
+
+	unsigned long slice_end;
+
+	/* Pointer to io scheduler's queue */
+	void *sched_queue;
 };
 
 struct io_group {
 	struct io_sched_data sched_data;
+	/*
+	 * async queue for each priority case for RT and BE class.
+	 * Used only for cfq.
+	 */
+
+	struct io_queue *async_queue[2][IOPRIO_BE_NR];
+	struct io_queue *async_idle_queue;
+
+	/*
+	 * Used to track any pending rt requests so we can pre-empt current
+	 * non-RT cfqq in service when this value is non-zero.
+	 */
+	unsigned int busy_rt_queues;
+};
+
+struct elv_fq_data {
+	struct io_group *root_group;
+
+	struct request_queue *queue;
+	unsigned int busy_queues;
+
+	/* Number of requests queued */
+	int rq_queued;
+
+	/* Pointer to the ioscheduler queue being served */
+	void *active_queue;
+
+	/*
+	 * queue-depth detection
+	 */
+	int rq_in_driver;
+	int hw_tag;
+	int hw_tag_samples;
+	int rq_in_driver_peak;
+
+	/*
+	 * elevator fair queuing layer has the capability to provide idling
+	 * for ensuring fairness for processes doing dependent reads.
+	 * This might be needed to ensure fairness among two processes doing
+	 * synchronous reads in two different cgroups. noop and deadline don't
+	 * have any notion of anticipation/idling. As of now, these are the
+	 * users of this functionality.
+	 */
+	unsigned int elv_slice_idle;
+	struct timer_list idle_slice_timer;
+	struct work_struct unplug_work;
+
+	/* Base slice length for sync and async queues */
+	unsigned int elv_slice[2];
 };
 
+/* Logging facilities. */
+#define elv_log_ioq(efqd, ioq, fmt, args...) \
+	blk_add_trace_msg((efqd)->queue, "elv%d%c " fmt, (ioq)->pid,	\
+				elv_ioq_sync(ioq) ? 'S' : 'A', ##args)
+
+#define elv_log(efqd, fmt, args...) \
+	blk_add_trace_msg((efqd)->queue, "elv " fmt, ##args)
+
+#define ioq_sample_valid(samples)   ((samples) > 80)
+
+/* Some shared queue flag manipulation functions among elevators */
+
+enum elv_queue_state_flags {
+	ELV_QUEUE_FLAG_busy = 0,          /* has requests or is under service */
+	ELV_QUEUE_FLAG_sync,              /* synchronous queue */
+	ELV_QUEUE_FLAG_idle_window,	  /* elevator slice idling enabled */
+	ELV_QUEUE_FLAG_wait_request,	  /* waiting for a request */
+	ELV_QUEUE_FLAG_must_dispatch,	  /* must be allowed a dispatch */
+	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
+};
+
+#define ELV_IO_QUEUE_FLAG_FNS(name)					\
+static inline void elv_mark_ioq_##name(struct io_queue *ioq)		\
+{                                                                       \
+	(ioq)->flags |= (1 << ELV_QUEUE_FLAG_##name);			\
+}                                                                       \
+static inline void elv_clear_ioq_##name(struct io_queue *ioq)		\
+{                                                                       \
+	(ioq)->flags &= ~(1 << ELV_QUEUE_FLAG_##name);			\
+}                                                                       \
+static inline int elv_ioq_##name(struct io_queue *ioq)         		\
+{                                                                       \
+	return ((ioq)->flags & (1 << ELV_QUEUE_FLAG_##name)) != 0;	\
+}
+
+ELV_IO_QUEUE_FLAG_FNS(busy)
+ELV_IO_QUEUE_FLAG_FNS(sync)
+ELV_IO_QUEUE_FLAG_FNS(wait_request)
+ELV_IO_QUEUE_FLAG_FNS(must_dispatch)
+ELV_IO_QUEUE_FLAG_FNS(idle_window)
+ELV_IO_QUEUE_FLAG_FNS(slice_new)
+
 static inline struct io_service_tree *
 io_entity_service_tree(struct io_entity *entity)
 {
@@ -169,4 +283,194 @@ io_entity_service_tree(struct io_entity *entity)
 
 	return sched_data->service_tree + idx;
 }
+
+/* A request got dispatched from the io_queue. Do the accounting. */
+static inline void elv_ioq_request_dispatched(struct io_queue *ioq)
+{
+	ioq->dispatched++;
+}
+
+static inline int elv_ioq_slice_used(struct io_queue *ioq)
+{
+	if (elv_ioq_slice_new(ioq))
+		return 0;
+	if (time_before(jiffies, ioq->slice_end))
+		return 0;
+
+	return 1;
+}
+
+/* How many request are currently dispatched from the queue */
+static inline int elv_ioq_nr_dispatched(struct io_queue *ioq)
+{
+	return ioq->dispatched;
+}
+
+/* How many request are currently queued in the queue */
+static inline int elv_ioq_nr_queued(struct io_queue *ioq)
+{
+	return ioq->nr_queued;
+}
+
+static inline void elv_get_ioq(struct io_queue *ioq)
+{
+	atomic_inc(&ioq->ref);
+}
+
+static inline int elv_ioq_class_idle(struct io_queue *ioq)
+{
+	return ioq->entity.ioprio_class == IOPRIO_CLASS_IDLE;
+}
+
+static inline int elv_ioq_class_rt(struct io_queue *ioq)
+{
+	return ioq->entity.ioprio_class == IOPRIO_CLASS_RT;
+}
+
+static inline int elv_ioq_ioprio_class(struct io_queue *ioq)
+{
+	return ioq->entity.new_ioprio_class;
+}
+
+static inline int elv_ioq_ioprio(struct io_queue *ioq)
+{
+	return ioq->entity.new_ioprio;
+}
+
+static inline void elv_ioq_set_ioprio_class(struct io_queue *ioq,
+						int ioprio_class)
+{
+	ioq->entity.new_ioprio_class = ioprio_class;
+	ioq->entity.ioprio_changed = 1;
+}
+
+/**
+ * bfq_ioprio_to_weight - calc a weight from an ioprio.
+ * @ioprio: the ioprio value to convert.
+ */
+static inline unsigned int bfq_ioprio_to_weight(int ioprio)
+{
+	WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
+	return IOPRIO_BE_NR - ioprio;
+}
+
+static inline void elv_ioq_set_ioprio(struct io_queue *ioq, int ioprio)
+{
+	ioq->entity.new_ioprio = ioprio;
+	ioq->entity.new_weight = bfq_ioprio_to_weight(ioprio);
+	ioq->entity.ioprio_changed = 1;
+}
+
+static inline void *ioq_sched_queue(struct io_queue *ioq)
+{
+	if (ioq)
+		return ioq->sched_queue;
+	return NULL;
+}
+
+static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
+{
+	return container_of(ioq->entity.sched_data, struct io_group,
+						sched_data);
+}
+
+extern ssize_t elv_slice_idle_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_slice_idle_store(struct elevator_queue *q, const char *name,
+						size_t count);
+extern ssize_t elv_slice_sync_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_slice_sync_store(struct elevator_queue *q, const char *name,
+						size_t count);
+extern ssize_t elv_slice_async_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_slice_async_store(struct elevator_queue *q, const char *name,
+						size_t count);
+
+/* Functions used by elevator.c */
+extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
+extern void elv_exit_fq_data(struct elevator_queue *e);
+extern void elv_exit_fq_data_post(struct elevator_queue *e);
+
+extern void elv_ioq_request_add(struct request_queue *q, struct request *rq);
+extern void elv_ioq_request_removed(struct elevator_queue *e,
+					struct request *rq);
+extern void elv_fq_dispatched_request(struct elevator_queue *e,
+					struct request *rq);
+
+extern void elv_fq_activate_rq(struct request_queue *q, struct request *rq);
+extern void elv_fq_deactivate_rq(struct request_queue *q, struct request *rq);
+
+extern void elv_ioq_completed_request(struct request_queue *q,
+				struct request *rq);
+
+extern void *elv_fq_select_ioq(struct request_queue *q, int force);
+
+/* Functions used by io schedulers */
+extern void elv_put_ioq(struct io_queue *ioq);
+extern void __elv_ioq_slice_expired(struct request_queue *q,
+					struct io_queue *ioq);
+extern int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
+		struct io_group *iog, void *sched_queue, int ioprio_class,
+		int ioprio, int is_sync);
+extern void elv_schedule_dispatch(struct request_queue *q);
+extern int elv_hw_tag(struct elevator_queue *e);
+extern void *elv_active_sched_queue(struct elevator_queue *e);
+extern int elv_mod_idle_slice_timer(struct elevator_queue *eq,
+					unsigned long expires);
+extern int elv_del_idle_slice_timer(struct elevator_queue *eq);
+extern unsigned int elv_get_slice_idle(struct elevator_queue *eq);
+extern void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
+					int ioprio);
+extern void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+					int ioprio, struct io_queue *ioq);
+extern struct io_group *io_get_io_group(struct request_queue *q);
+extern int elv_nr_busy_ioq(struct elevator_queue *e);
+extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
+extern void elv_free_ioq(struct io_queue *ioq);
+
+#else /* CONFIG_ELV_FAIR_QUEUING */
+
+static inline int elv_init_fq_data(struct request_queue *q,
+					struct elevator_queue *e)
+{
+	return 0;
+}
+
+static inline void elv_exit_fq_data(struct elevator_queue *e) {}
+static inline void elv_exit_fq_data_post(struct elevator_queue *e) {}
+
+static inline void elv_fq_activate_rq(struct request_queue *q,
+					struct request *rq)
+{
+}
+
+static inline void elv_fq_deactivate_rq(struct request_queue *q,
+					struct request *rq)
+{
+}
+
+static inline void elv_fq_dispatched_request(struct elevator_queue *e,
+						struct request *rq)
+{
+}
+
+static inline void elv_ioq_request_removed(struct elevator_queue *e,
+						struct request *rq)
+{
+}
+
+static inline void elv_ioq_request_add(struct request_queue *q,
+					struct request *rq)
+{
+}
+
+static inline void elv_ioq_completed_request(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline void *ioq_sched_queue(struct io_queue *ioq) { return NULL; }
+static inline void *elv_fq_select_ioq(struct request_queue *q, int force)
+{
+	return NULL;
+}
+#endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index ca86192..357f529 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -226,6 +226,9 @@ static struct elevator_queue *elevator_alloc(struct request_queue *q,
 	for (i = 0; i < ELV_HASH_ENTRIES; i++)
 		INIT_HLIST_HEAD(&eq->hash[i]);
 
+	if (elv_init_fq_data(q, eq))
+		goto err;
+
 	return eq;
 err:
 	kfree(eq);
@@ -296,9 +299,11 @@ EXPORT_SYMBOL(elevator_init);
 void elevator_exit(struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
+	elv_exit_fq_data(e);
 	if (e->ops->elevator_exit_fn)
 		e->ops->elevator_exit_fn(e);
 	e->ops = NULL;
+	elv_exit_fq_data_post(e);
 	mutex_unlock(&e->sysfs_lock);
 
 	kobject_put(&e->kobj);
@@ -425,6 +430,7 @@ void elv_dispatch_sort(struct request_queue *q, struct request *rq)
 	elv_rqhash_del(q, rq);
 
 	q->nr_sorted--;
+	elv_fq_dispatched_request(q->elevator, rq);
 
 	boundary = q->end_sector;
 	stop_flags = REQ_SOFTBARRIER | REQ_HARDBARRIER | REQ_STARTED;
@@ -465,6 +471,7 @@ void elv_dispatch_add_tail(struct request_queue *q, struct request *rq)
 	elv_rqhash_del(q, rq);
 
 	q->nr_sorted--;
+	elv_fq_dispatched_request(q->elevator, rq);
 
 	q->end_sector = rq_end_sector(rq);
 	q->boundary_rq = rq;
@@ -532,6 +539,7 @@ void elv_merge_requests(struct request_queue *q, struct request *rq,
 	elv_rqhash_del(q, next);
 
 	q->nr_sorted--;
+	elv_ioq_request_removed(e, next);
 	q->last_merge = rq;
 }
 
@@ -638,12 +646,8 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 				q->last_merge = rq;
 		}
 
-		/*
-		 * Some ioscheds (cfq) run q->request_fn directly, so
-		 * rq cannot be accessed after calling
-		 * elevator_add_req_fn.
-		 */
 		q->elevator->ops->elevator_add_req_fn(q, rq);
+		elv_ioq_request_add(q, rq);
 		break;
 
 	case ELEVATOR_INSERT_REQUEUE:
@@ -742,13 +746,12 @@ EXPORT_SYMBOL(elv_add_request);
 
 int elv_queue_empty(struct request_queue *q)
 {
-	struct elevator_queue *e = q->elevator;
-
 	if (!list_empty(&q->queue_head))
 		return 0;
 
-	if (e->ops->elevator_queue_empty_fn)
-		return e->ops->elevator_queue_empty_fn(q);
+	/* Hopefully nr_sorted works and no need to call queue_empty_fn */
+	if (q->nr_sorted)
+		return 0;
 
 	return 1;
 }
@@ -828,8 +831,11 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
 	 */
 	if (blk_account_rq(rq)) {
 		q->in_flight[rq_is_sync(rq)]--;
-		if (blk_sorted_rq(rq) && e->ops->elevator_completed_req_fn)
-			e->ops->elevator_completed_req_fn(q, rq);
+		if (blk_sorted_rq(rq)) {
+			if (e->ops->elevator_completed_req_fn)
+				e->ops->elevator_completed_req_fn(q, rq);
+			elv_ioq_completed_request(q, rq);
+		}
 	}
 
 	/*
@@ -1125,3 +1131,17 @@ struct request *elv_rb_latter_request(struct request_queue *q,
 	return NULL;
 }
 EXPORT_SYMBOL(elv_rb_latter_request);
+
+/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
+{
+	return ioq_sched_queue(req_ioq(rq));
+}
+EXPORT_SYMBOL(elv_get_sched_queue);
+
+/* Select an ioscheduler queue to dispatch request from. */
+void *elv_select_sched_queue(struct request_queue *q, int force)
+{
+	return ioq_sched_queue(elv_fq_select_ioq(q, force));
+}
+EXPORT_SYMBOL(elv_select_sched_queue);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8963d91..551e17d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -234,6 +234,11 @@ struct request {
 
 	/* for bidi */
 	struct request *next_rq;
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	/* io queue request belongs to */
+	struct io_queue *ioq;
+#endif
 };
 
 static inline unsigned short req_get_ioprio(struct request *req)
@@ -241,6 +246,15 @@ static inline unsigned short req_get_ioprio(struct request *req)
 	return req->ioprio;
 }
 
+static inline struct io_queue *req_ioq(struct request *req)
+{
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	return req->ioq;
+#else
+	return NULL;
+#endif
+}
+
 /*
  * State information carried for REQ_TYPE_PM_SUSPEND and REQ_TYPE_PM_RESUME
  * requests. Some step values could eventually be made generic.
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 1cb3372..81f1ed8 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -2,6 +2,7 @@
 #define _LINUX_ELEVATOR_H
 
 #include <linux/percpu.h>
+#include "../../block/elevator-fq.h"
 
 #ifdef CONFIG_BLOCK
 
@@ -29,6 +30,18 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
 
 typedef void *(elevator_init_fn) (struct request_queue *);
 typedef void (elevator_exit_fn) (struct elevator_queue *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
+typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
+typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
+typedef int (elevator_should_preempt_fn) (struct request_queue*, void*,
+						struct request*);
+typedef int (elevator_update_idle_window_fn) (struct elevator_queue*, void*,
+						struct request*);
+typedef struct io_queue* (elevator_close_cooperator_fn) (struct request_queue*,
+						void*, int probe);
+#endif
 
 struct elevator_ops
 {
@@ -56,6 +69,17 @@ struct elevator_ops
 	elevator_init_fn *elevator_init_fn;
 	elevator_exit_fn *elevator_exit_fn;
 	void (*trim)(struct io_context *);
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+	elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
+	elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;
+
+	elevator_arm_slice_timer_fn *elevator_arm_slice_timer_fn;
+	elevator_should_preempt_fn *elevator_should_preempt_fn;
+	elevator_update_idle_window_fn *elevator_update_idle_window_fn;
+	elevator_close_cooperator_fn *elevator_close_cooperator_fn;
+#endif
 };
 
 #define ELV_NAME_MAX	(16)
@@ -76,6 +100,9 @@ struct elevator_type
 	struct elv_fs_entry *elevator_attrs;
 	char elevator_name[ELV_NAME_MAX];
 	struct module *elevator_owner;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	int elevator_features;
+#endif
 };
 
 /*
@@ -89,6 +116,10 @@ struct elevator_queue
 	struct elevator_type *elevator_type;
 	struct mutex sysfs_lock;
 	struct hlist_head *hash;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	/* fair queuing data */
+	struct elv_fq_data efqd;
+#endif
 };
 
 /*
@@ -207,5 +238,25 @@ enum {
 	__val;							\
 })
 
+/* iosched can let elevator know their feature set/capability */
+#ifdef CONFIG_ELV_FAIR_QUEUING
+
+/* iosched wants to use fair queuing logic of elevator layer */
+#define	ELV_IOSCHED_NEED_FQ	1
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+	return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
+}
+
+#else /* ELV_IOSCHED_FAIR_QUEUING */
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+	return 0;
+}
+#endif /* ELV_IOSCHED_FAIR_QUEUING */
+extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
+extern void *elv_select_sched_queue(struct request_queue *q, int force);
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 04/25] io-controller: Common flat fair queuing code in elevaotor layer
@ 2009-07-02 20:01   ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew
  Cc: peterz, akpm, snitzer, agk, vgoyal

This is common fair queuing code in elevator layer. This is controlled by
config option CONFIG_ELV_FAIR_QUEUING. This patch initially only introduces
flat fair queuing support where there is only one group, "root group" and all
the tasks belong to root group.

This elevator layer changes are backward compatible. That means any ioscheduler
using old interfaces will continue to work.

This code is essentially the CFQ code for fair queuing. The primary difference
is that flat rounding robin algorithm of CFQ has been replaced with BFQ (WF2Q+).

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched    |   13 +
 block/Makefile           |    1 +
 block/blk.h              |    4 +
 block/elevator-fq.c      | 1254 +++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h      |  304 +++++++++++
 block/elevator.c         |   42 ++-
 include/linux/blkdev.h   |   14 +
 include/linux/elevator.h |   51 ++
 8 files changed, 1667 insertions(+), 16 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 7e803fc..3398134 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -2,6 +2,19 @@ if BLOCK
 
 menu "IO Schedulers"
 
+config ELV_FAIR_QUEUING
+	bool "Elevator Fair Queuing Support"
+	default n
+	---help---
+	  Traditionally only cfq had notion of multiple queues and it did
+	  fair queuing at its own. With the cgroups and need of controlling
+	  IO, now even the simple io schedulers like noop, deadline, as will
+	  have one queue per cgroup and will need hierarchical fair queuing.
+	  Instead of every io scheduler implementing its own fair queuing
+	  logic, this option enables fair queuing in elevator layer so that
+	  other ioschedulers can make use of it.
+	  If unsure, say N.
+
 config IOSCHED_NOOP
 	bool
 	default y
diff --git a/block/Makefile b/block/Makefile
index e9fa4dd..94bfc6e 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -15,3 +15,4 @@ obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
 
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY)	+= blk-integrity.o
+obj-$(CONFIG_ELV_FAIR_QUEUING)	+= elevator-fq.o
diff --git a/block/blk.h b/block/blk.h
index 3fae6ad..99c3819 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -71,6 +71,8 @@ static inline void elv_activate_rq(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	elv_fq_activate_rq(q, rq);
+
 	if (e->ops->elevator_activate_req_fn)
 		e->ops->elevator_activate_req_fn(q, rq);
 }
@@ -79,6 +81,8 @@ static inline void elv_deactivate_rq(struct request_queue *q, struct request *rq
 {
 	struct elevator_queue *e = q->elevator;
 
+	elv_fq_deactivate_rq(q, rq);
+
 	if (e->ops->elevator_deactivate_req_fn)
 		e->ops->elevator_deactivate_req_fn(q, rq);
 }
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 7ee4321..6f23d7e 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -14,6 +14,17 @@
 
 #include <linux/blkdev.h>
 #include "elevator-fq.h"
+#include <linux/blktrace_api.h>
+
+/* Values taken from cfq */
+const int elv_slice_sync = HZ / 10;
+int elv_slice_async = HZ / 25;
+const int elv_slice_async_rq = 2;
+int elv_slice_idle = HZ / 125;
+static struct kmem_cache *elv_ioq_pool;
+
+#define ELV_SLICE_SCALE		(5)
+#define ELV_HW_QUEUE_MIN	(5)
 
 #define IO_SERVICE_TREE_INIT   ((struct io_service_tree)		\
 				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
@@ -28,6 +39,22 @@
  */
 #define WFQ_SERVICE_SHIFT	22
 
+static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
+					unsigned short prio)
+{
+	const int base_slice = efqd->elv_slice[sync];
+
+	WARN_ON(prio >= IOPRIO_BE_NR);
+
+	return base_slice + (base_slice/ELV_SLICE_SCALE * (4 - prio));
+}
+
+static inline int
+elv_prio_to_slice(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+	return elv_prio_slice(efqd, elv_ioq_sync(ioq), ioq->entity.ioprio);
+}
+
 /**
  * bfq_gt - compare two timestamps.
  * @a: first ts.
@@ -423,11 +450,6 @@ __bfq_entity_update_prio(struct io_service_tree *old_st,
 		 */
 		if (ioq) {
 			struct elv_fq_data *efqd = ioq->efqd;
-			/*
-			 * elv_prio_to_slice() is defined in later patches
-			 * where a slice length is calculated from the
-			 * ioprio of the queue.
-			 */
 			entity->budget = elv_prio_to_slice(efqd, ioq);
 		}
 
@@ -750,3 +772,1225 @@ static void io_flush_idle_tree(struct io_service_tree *st)
 	for (; entity != NULL; entity = st->first_idle)
 		__bfq_deactivate_entity(entity, 0);
 }
+
+/* Elevator fair queuing function */
+static inline struct io_queue *elv_active_ioq(struct elevator_queue *e)
+{
+	return e->efqd.active_queue;
+}
+
+void *elv_active_sched_queue(struct elevator_queue *e)
+{
+	return ioq_sched_queue(elv_active_ioq(e));
+}
+EXPORT_SYMBOL(elv_active_sched_queue);
+
+int elv_nr_busy_ioq(struct elevator_queue *e)
+{
+	return e->efqd.busy_queues;
+}
+EXPORT_SYMBOL(elv_nr_busy_ioq);
+
+int elv_hw_tag(struct elevator_queue *e)
+{
+	return e->efqd.hw_tag;
+}
+EXPORT_SYMBOL(elv_hw_tag);
+
+/* Helper functions for operating on elevator idle slice timer */
+int elv_mod_idle_slice_timer(struct elevator_queue *eq, unsigned long expires)
+{
+	struct elv_fq_data *efqd = &eq->efqd;
+
+	return mod_timer(&efqd->idle_slice_timer, expires);
+}
+EXPORT_SYMBOL(elv_mod_idle_slice_timer);
+
+int elv_del_idle_slice_timer(struct elevator_queue *eq)
+{
+	struct elv_fq_data *efqd = &eq->efqd;
+
+	return del_timer(&efqd->idle_slice_timer);
+}
+EXPORT_SYMBOL(elv_del_idle_slice_timer);
+
+unsigned int elv_get_slice_idle(struct elevator_queue *eq)
+{
+	return eq->efqd.elv_slice_idle;
+}
+EXPORT_SYMBOL(elv_get_slice_idle);
+
+static void elv_ioq_served(struct io_queue *ioq, unsigned long served)
+{
+	entity_served(&ioq->entity, served);
+}
+
+/* Tells whether ioq is queued in root group or not */
+static inline int is_root_group_ioq(struct request_queue *q,
+					struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	return (ioq->entity.sched_data == &efqd->root_group->sched_data);
+}
+
+/*
+ * sysfs parts below -->
+ */
+static ssize_t
+elv_var_show(unsigned int var, char *page)
+{
+	return sprintf(page, "%d\n", var);
+}
+
+static ssize_t
+elv_var_store(unsigned int *var, const char *page, size_t count)
+{
+	char *p = (char *) page;
+
+	*var = simple_strtoul(p, &p, 10);
+	return count;
+}
+
+#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
+ssize_t __FUNC(struct elevator_queue *e, char *page)		\
+{									\
+	struct elv_fq_data *efqd = &e->efqd;				\
+	unsigned int __data = __VAR;					\
+	if (__CONV)							\
+		__data = jiffies_to_msecs(__data);			\
+	return elv_var_show(__data, (page));				\
+}
+SHOW_FUNCTION(elv_slice_idle_show, efqd->elv_slice_idle, 1);
+EXPORT_SYMBOL(elv_slice_idle_show);
+SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
+EXPORT_SYMBOL(elv_slice_sync_show);
+SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
+EXPORT_SYMBOL(elv_slice_async_show);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
+ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)\
+{									\
+	struct elv_fq_data *efqd = &e->efqd;				\
+	unsigned int __data;						\
+	int ret = elv_var_store(&__data, (page), count);		\
+	if (__data < (MIN))						\
+		__data = (MIN);						\
+	else if (__data > (MAX))					\
+		__data = (MAX);						\
+	if (__CONV)							\
+		*(__PTR) = msecs_to_jiffies(__data);			\
+	else								\
+		*(__PTR) = __data;					\
+	return ret;							\
+}
+STORE_FUNCTION(elv_slice_idle_store, &efqd->elv_slice_idle, 0, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_idle_store);
+STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_sync_store);
+STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_slice_async_store);
+#undef STORE_FUNCTION
+
+void elv_schedule_dispatch(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (elv_nr_busy_ioq(q->elevator)) {
+		elv_log(efqd, "schedule dispatch");
+		kblockd_schedule_work(efqd->queue, &efqd->unplug_work);
+	}
+}
+EXPORT_SYMBOL(elv_schedule_dispatch);
+
+static void elv_kick_queue(struct work_struct *work)
+{
+	struct elv_fq_data *efqd =
+		container_of(work, struct elv_fq_data, unplug_work);
+	struct request_queue *q = efqd->queue;
+
+	spin_lock_irq(q->queue_lock);
+	__blk_run_queue(q);
+	spin_unlock_irq(q->queue_lock);
+}
+
+static void elv_shutdown_timer_wq(struct elevator_queue *e)
+{
+	del_timer_sync(&e->efqd.idle_slice_timer);
+	cancel_work_sync(&e->efqd.unplug_work);
+}
+
+static void elv_ioq_set_prio_slice(struct request_queue *q,
+					struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	ioq->slice_end = jiffies + ioq->entity.budget;
+	elv_log_ioq(efqd, ioq, "set_slice=%lu", ioq->entity.budget);
+}
+
+static void elv_ioq_update_io_thinktime(struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = ioq->efqd;
+	unsigned long elapsed = jiffies - ioq->last_end_request;
+	unsigned long ttime = min(elapsed, 2UL * efqd->elv_slice_idle);
+
+	ioq->ttime_samples = (7*ioq->ttime_samples + 256) / 8;
+	ioq->ttime_total = (7*ioq->ttime_total + 256*ttime) / 8;
+	ioq->ttime_mean = (ioq->ttime_total + 128) / ioq->ttime_samples;
+}
+
+/*
+ * Disable idle window if the process thinks too long.
+ * This idle flag can also be updated by io scheduler.
+ */
+static void elv_ioq_update_idle_window(struct elevator_queue *eq,
+				struct io_queue *ioq, struct request *rq)
+{
+	int old_idle, enable_idle;
+	struct elv_fq_data *efqd = ioq->efqd;
+
+	/*
+	 * Don't idle for async or idle io prio class
+	 */
+	if (!elv_ioq_sync(ioq) || elv_ioq_class_idle(ioq))
+		return;
+
+	enable_idle = old_idle = elv_ioq_idle_window(ioq);
+
+	if (!efqd->elv_slice_idle)
+		enable_idle = 0;
+	else if (ioq_sample_valid(ioq->ttime_samples)) {
+		if (ioq->ttime_mean > efqd->elv_slice_idle)
+			enable_idle = 0;
+		else
+			enable_idle = 1;
+	}
+
+	/*
+	 * From think time perspective idle should be enabled. Check with
+	 * io scheduler if it wants to disable idling based on additional
+	 * considrations like seek pattern.
+	 */
+	if (enable_idle) {
+		if (eq->ops->elevator_update_idle_window_fn)
+			enable_idle = eq->ops->elevator_update_idle_window_fn(
+						eq, ioq->sched_queue, rq);
+		if (!enable_idle)
+			elv_log_ioq(efqd, ioq, "iosched disabled idle");
+	}
+
+	if (old_idle != enable_idle) {
+		elv_log_ioq(efqd, ioq, "idle=%d", enable_idle);
+		if (enable_idle)
+			elv_mark_ioq_idle_window(ioq);
+		else
+			elv_clear_ioq_idle_window(ioq);
+	}
+}
+
+struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask)
+{
+	struct io_queue *ioq = NULL;
+
+	ioq = kmem_cache_alloc_node(elv_ioq_pool, gfp_mask, q->node);
+	return ioq;
+}
+EXPORT_SYMBOL(elv_alloc_ioq);
+
+void elv_free_ioq(struct io_queue *ioq)
+{
+	kmem_cache_free(elv_ioq_pool, ioq);
+}
+EXPORT_SYMBOL(elv_free_ioq);
+
+int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
+		struct io_group *iog, void *sched_queue, int ioprio_class,
+		int ioprio, int is_sync)
+{
+	struct elv_fq_data *efqd = &eq->efqd;
+
+	RB_CLEAR_NODE(&ioq->entity.rb_node);
+	atomic_set(&ioq->ref, 0);
+	ioq->efqd = efqd;
+	elv_ioq_set_ioprio_class(ioq, ioprio_class);
+	elv_ioq_set_ioprio(ioq, ioprio);
+	ioq->pid = current->pid;
+	ioq->sched_queue = sched_queue;
+	if (is_sync && !elv_ioq_class_idle(ioq))
+		elv_mark_ioq_idle_window(ioq);
+	bfq_init_entity(&ioq->entity, iog);
+	ioq->entity.budget = elv_prio_to_slice(efqd, ioq);
+	if (is_sync)
+		ioq->last_end_request = jiffies;
+
+	return 0;
+}
+EXPORT_SYMBOL(elv_init_ioq);
+
+void elv_put_ioq(struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = ioq->efqd;
+	struct elevator_queue *e = container_of(efqd, struct elevator_queue,
+						efqd);
+
+	BUG_ON(atomic_read(&ioq->ref) <= 0);
+	if (!atomic_dec_and_test(&ioq->ref))
+		return;
+	BUG_ON(ioq->nr_queued);
+	BUG_ON(ioq->entity.tree != NULL);
+	BUG_ON(elv_ioq_busy(ioq));
+	BUG_ON(efqd->active_queue == ioq);
+
+	/* Can be called by outgoing elevator. Don't use q */
+	BUG_ON(!e->ops->elevator_free_sched_queue_fn);
+
+	e->ops->elevator_free_sched_queue_fn(e, ioq->sched_queue);
+	elv_log_ioq(efqd, ioq, "put_queue");
+	elv_free_ioq(ioq);
+}
+EXPORT_SYMBOL(elv_put_ioq);
+
+void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
+{
+	struct io_queue *ioq = *ioq_ptr;
+
+	if (ioq != NULL) {
+		/* Drop the reference taken by the io group */
+		elv_put_ioq(ioq);
+		*ioq_ptr = NULL;
+	}
+}
+
+/*
+ * Normally next io queue to be served is selected from the service tree.
+ * This function allows one to choose a specific io queue to run next
+ * out of order. This is primarily to accomodate the close_cooperator
+ * feature of cfq.
+ *
+ * Currently it is done only for root level as to begin with supporting
+ * close cooperator feature only for root group to make sure default
+ * cfq behavior in flat hierarchy is not changed.
+ */
+static void elv_set_next_ioq(struct request_queue *q, struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_entity *entity = &ioq->entity;
+	struct io_sched_data *sd = &efqd->root_group->sched_data;
+	struct io_service_tree *st = io_entity_service_tree(entity);
+
+	BUG_ON(efqd->active_queue != NULL || sd->active_entity != NULL);
+	BUG_ON(!efqd->busy_queues);
+	BUG_ON(sd != entity->sched_data);
+	BUG_ON(!st);
+
+	bfq_update_vtime(st);
+	bfq_active_remove(st, entity);
+	sd->active_entity = entity;
+	entity->service = 0;
+	elv_log_ioq(efqd, ioq, "set_next_ioq");
+}
+
+/* Get next queue for service. */
+static struct io_queue *elv_get_next_ioq(struct request_queue *q, int extract)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_entity *entity = NULL;
+	struct io_queue *ioq = NULL;
+	struct io_sched_data *sd;
+
+	/*
+	 * We should not call lookup when an entity is active, as doing
+	 * lookup can result in an erroneous vtime jump.
+	 */
+	BUG_ON(efqd->active_queue != NULL);
+
+	if (!efqd->busy_queues)
+		return NULL;
+
+	sd = &efqd->root_group->sched_data;
+	entity = bfq_lookup_next_entity(sd, 1);
+
+	BUG_ON(!entity);
+	if (extract)
+		entity->service = 0;
+	ioq = io_entity_to_ioq(entity);
+
+	return ioq;
+}
+
+/*
+ * coop (cooperating queue) tells that io scheduler selected a queue for us
+ * and we did not select the next queue based on fairness.
+ */
+static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
+					int coop)
+{
+	struct request_queue *q = efqd->queue;
+
+	if (ioq) {
+		elv_log_ioq(efqd, ioq, "set_active, busy=%d",
+							efqd->busy_queues);
+		ioq->slice_end = 0;
+
+		elv_clear_ioq_wait_request(ioq);
+		elv_clear_ioq_must_dispatch(ioq);
+		elv_mark_ioq_slice_new(ioq);
+
+		del_timer(&efqd->idle_slice_timer);
+	}
+
+	efqd->active_queue = ioq;
+
+	/* Let iosched know if it wants to take some action */
+	if (ioq) {
+		if (q->elevator->ops->elevator_active_ioq_set_fn)
+			q->elevator->ops->elevator_active_ioq_set_fn(q,
+							ioq->sched_queue, coop);
+	}
+}
+
+/* Get and set a new active queue for service. */
+static struct io_queue *elv_set_active_ioq(struct request_queue *q,
+						struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	int coop = 0;
+
+	if (!ioq)
+		ioq = elv_get_next_ioq(q, 1);
+	else {
+		elv_set_next_ioq(q, ioq);
+		/*
+		 * io scheduler selected the next queue for us. Pass this
+		 * this info back to io scheudler. cfq currently uses it
+		 * to reset coop flag on the queue.
+		 */
+		coop = 1;
+	}
+	__elv_set_active_ioq(efqd, ioq, coop);
+	return ioq;
+}
+
+static void elv_reset_active_ioq(struct elv_fq_data *efqd)
+{
+	struct request_queue *q = efqd->queue;
+	struct io_queue *ioq = elv_active_ioq(efqd->queue->elevator);
+
+	if (q->elevator->ops->elevator_active_ioq_reset_fn)
+		q->elevator->ops->elevator_active_ioq_reset_fn(q,
+							ioq->sched_queue);
+	efqd->active_queue = NULL;
+	del_timer(&efqd->idle_slice_timer);
+}
+
+static void elv_activate_ioq(struct io_queue *ioq, int add_front)
+{
+	bfq_activate_entity(&ioq->entity, add_front);
+}
+
+static void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
+					int requeue)
+{
+	bfq_deactivate_entity(&ioq->entity, requeue);
+}
+
+/* Called when an inactive queue receives a new request. */
+static void elv_add_ioq_busy(struct elv_fq_data *efqd, struct io_queue *ioq)
+{
+	BUG_ON(elv_ioq_busy(ioq));
+	BUG_ON(ioq == efqd->active_queue);
+	elv_log_ioq(efqd, ioq, "add to busy");
+	elv_activate_ioq(ioq, 0);
+	elv_mark_ioq_busy(ioq);
+	efqd->busy_queues++;
+	if (elv_ioq_class_rt(ioq)) {
+		struct io_group *iog = ioq_to_io_group(ioq);
+		iog->busy_rt_queues++;
+	}
+}
+
+static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
+					int requeue)
+{
+	struct elv_fq_data *efqd = &e->efqd;
+
+	BUG_ON(!elv_ioq_busy(ioq));
+	BUG_ON(ioq->nr_queued);
+	elv_log_ioq(efqd, ioq, "del from busy");
+	elv_clear_ioq_busy(ioq);
+	BUG_ON(efqd->busy_queues == 0);
+	efqd->busy_queues--;
+	if (elv_ioq_class_rt(ioq)) {
+		struct io_group *iog = ioq_to_io_group(ioq);
+		iog->busy_rt_queues--;
+	}
+
+	elv_deactivate_ioq(efqd, ioq, requeue);
+}
+
+/*
+ * Do the accounting. Determine how much service (in terms of time slices)
+ * current queue used and adjust the start, finish time of queue and vtime
+ * of the tree accordingly.
+ *
+ * Determining the service used in terms of time is tricky in certain
+ * situations. Especially when underlying device supports command queuing
+ * and requests from multiple queues can be there at same time, then it
+ * is not clear which queue consumed how much of disk time.
+ *
+ * To mitigate this problem, cfq starts the time slice of the queue only
+ * after first request from the queue has completed. This does not work
+ * very well if we expire the queue before we wait for first and more
+ * request to finish from the queue. For seeky queues, we will expire the
+ * queue after dispatching few requests without waiting and start dispatching
+ * from next queue.
+ *
+ * Not sure how to determine the time consumed by queue in such scenarios.
+ * Currently as a crude approximation, we are charging 25% of time slice
+ * for such cases. A better mechanism is needed for accurate accounting.
+ */
+void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_entity *entity = &ioq->entity;
+	long slice_unused = 0, slice_used = 0, slice_overshoot = 0;
+
+	assert_spin_locked(q->queue_lock);
+	elv_log_ioq(efqd, ioq, "slice expired");
+
+	if (elv_ioq_wait_request(ioq))
+		del_timer(&efqd->idle_slice_timer);
+
+	elv_clear_ioq_wait_request(ioq);
+
+	/*
+	 * if ioq->slice_end = 0, that means a queue was expired before first
+	 * reuqest from the queue got completed. Of course we are not planning
+	 * to idle on the queue otherwise we would not have expired it.
+	 *
+	 * Charge for the 25% slice in such cases. This is not the best thing
+	 * to do but at the same time not very sure what's the next best
+	 * thing to do.
+	 *
+	 * This arises from that fact that we don't have the notion of
+	 * one queue being operational at one time. io scheduler can dispatch
+	 * requests from multiple queues in one dispatch round. Ideally for
+	 * more accurate accounting of exact disk time used by disk, one
+	 * should dispatch requests from only one queue and wait for all
+	 * the requests to finish. But this will reduce throughput.
+	 */
+	if (!ioq->slice_end)
+		slice_used = entity->budget/4;
+	else {
+		if (time_after(ioq->slice_end, jiffies)) {
+			slice_unused = ioq->slice_end - jiffies;
+			if (slice_unused == entity->budget) {
+				/*
+				 * queue got expired immediately after
+				 * completing first request. Charge 25% of
+				 * slice.
+				 */
+				slice_used = entity->budget/4;
+			} else
+				slice_used = entity->budget - slice_unused;
+		} else {
+			slice_overshoot = jiffies - ioq->slice_end;
+			slice_used = entity->budget + slice_overshoot;
+		}
+	}
+
+	elv_log_ioq(efqd, ioq, "sl_end=%lx, jiffies=%lx", ioq->slice_end,
+			jiffies);
+	elv_log_ioq(efqd, ioq, "sl_used=%ld, budget=%ld overshoot=%ld",
+				slice_used, entity->budget, slice_overshoot);
+	elv_ioq_served(ioq, slice_used);
+
+	BUG_ON(ioq != efqd->active_queue);
+	elv_reset_active_ioq(efqd);
+
+	if (!ioq->nr_queued)
+		elv_del_ioq_busy(q->elevator, ioq, 1);
+	else
+		elv_activate_ioq(ioq, 0);
+}
+EXPORT_SYMBOL(__elv_ioq_slice_expired);
+
+/*
+ *  Expire the ioq.
+ */
+void elv_ioq_slice_expired(struct request_queue *q)
+{
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+
+	if (ioq)
+		__elv_ioq_slice_expired(q, ioq);
+}
+
+/*
+ * Check if new_cfqq should preempt the currently active queue. Return 0 for
+ * no or if we aren't sure, a 1 will cause a preemption attempt.
+ */
+static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
+			struct request *rq)
+{
+	struct io_queue *ioq;
+	struct elevator_queue *eq = q->elevator;
+	struct io_entity *entity, *new_entity;
+
+	ioq = elv_active_ioq(eq);
+
+	if (!ioq)
+		return 0;
+
+	entity = &ioq->entity;
+	new_entity = &new_ioq->entity;
+
+	/*
+	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
+	 */
+
+	if (new_entity->ioprio_class == IOPRIO_CLASS_RT
+	    && entity->ioprio_class != IOPRIO_CLASS_RT)
+		return 1;
+	/*
+	 * Allow an BE request to pre-empt an ongoing IDLE clas timeslice.
+	 */
+
+	if (new_entity->ioprio_class == IOPRIO_CLASS_BE
+	    && entity->ioprio_class == IOPRIO_CLASS_IDLE)
+		return 1;
+
+	/*
+	 * Check with io scheduler if it has additional criterion based on
+	 * which it wants to preempt existing queue.
+	 */
+	if (eq->ops->elevator_should_preempt_fn)
+		return eq->ops->elevator_should_preempt_fn(q,
+						ioq_sched_queue(new_ioq), rq);
+
+	return 0;
+}
+
+static void elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
+{
+	elv_log_ioq(&q->elevator->efqd, ioq, "preempt");
+	elv_ioq_slice_expired(q);
+
+	/*
+	 * Put the new queue at the front of the of the current list,
+	 * so we know that it will be selected next.
+	 */
+
+	elv_activate_ioq(ioq, 1);
+	ioq->slice_end = 0;
+	elv_mark_ioq_slice_new(ioq);
+}
+
+void elv_ioq_request_add(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	BUG_ON(!efqd);
+	BUG_ON(!ioq);
+	efqd->rq_queued++;
+	ioq->nr_queued++;
+
+	if (!elv_ioq_busy(ioq))
+		elv_add_ioq_busy(efqd, ioq);
+
+	elv_ioq_update_io_thinktime(ioq);
+	elv_ioq_update_idle_window(q->elevator, ioq, rq);
+
+	if (ioq == elv_active_ioq(q->elevator)) {
+		/*
+		 * Remember that we saw a request from this process, but
+		 * don't start queuing just yet. Otherwise we risk seeing lots
+		 * of tiny requests, because we disrupt the normal plugging
+		 * and merging. If the request is already larger than a single
+		 * page, let it rip immediately. For that case we assume that
+		 * merging is already done. Ditto for a busy system that
+		 * has other work pending, don't risk delaying until the
+		 * idle timer unplug to continue working.
+		 */
+		if (elv_ioq_wait_request(ioq)) {
+			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
+			    efqd->busy_queues > 1) {
+				del_timer(&efqd->idle_slice_timer);
+				__blk_run_queue(q);
+			}
+			elv_mark_ioq_must_dispatch(ioq);
+		}
+	} else if (elv_should_preempt(q, ioq, rq)) {
+		/*
+		 * not the active queue - expire current slice if it is
+		 * idle and has expired it's mean thinktime or this new queue
+		 * has some old slice time left and is of higher priority or
+		 * this new queue is RT and the current one is BE
+		 */
+		elv_preempt_queue(q, ioq);
+		__blk_run_queue(q);
+	}
+}
+
+static void elv_idle_slice_timer(unsigned long data)
+{
+	struct elv_fq_data *efqd = (struct elv_fq_data *)data;
+	struct io_queue *ioq;
+	unsigned long flags;
+	struct request_queue *q = efqd->queue;
+
+	elv_log(efqd, "idle timer fired");
+
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	ioq = efqd->active_queue;
+
+	if (ioq) {
+
+		/*
+		 * We saw a request before the queue expired, let it through
+		 */
+		if (elv_ioq_must_dispatch(ioq))
+			goto out_kick;
+
+		/*
+		 * expired
+		 */
+		if (elv_ioq_slice_used(ioq))
+			goto expire;
+
+		/*
+		 * only expire and reinvoke request handler, if there are
+		 * other queues with pending requests
+		 */
+		if (!elv_nr_busy_ioq(q->elevator))
+			goto out_cont;
+
+		/*
+		 * not expired and it has a request pending, let it dispatch
+		 */
+		if (ioq->nr_queued)
+			goto out_kick;
+	}
+expire:
+	elv_ioq_slice_expired(q);
+out_kick:
+	elv_schedule_dispatch(q);
+out_cont:
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static void elv_ioq_arm_slice_timer(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+	unsigned long sl;
+
+	BUG_ON(!ioq);
+
+	/*
+	 * SSD device without seek penalty, disable idling. But only do so
+	 * for devices that support queuing, otherwise we still have a problem
+	 * with sync vs async workloads.
+	 */
+	if (blk_queue_nonrot(q) && efqd->hw_tag)
+		return;
+
+	/*
+	 * still requests with the driver, don't idle
+	 */
+	if (efqd->rq_in_driver)
+		return;
+
+	/*
+	 * idle is disabled, either manually or by past process history
+	 */
+	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
+		return;
+
+	/*
+	 * may be iosched got its own idling logic. In that case io
+	 * schduler will take care of arming the timer, if need be.
+	 */
+	if (q->elevator->ops->elevator_arm_slice_timer_fn) {
+		q->elevator->ops->elevator_arm_slice_timer_fn(q,
+						ioq->sched_queue);
+	} else {
+		elv_mark_ioq_wait_request(ioq);
+		sl = efqd->elv_slice_idle;
+		mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+		elv_log_ioq(efqd, ioq, "arm idle: %lu", sl);
+	}
+}
+
+/*
+ * If io scheduler has functionality of keeping track of close cooperator, check
+ * with it if it has got a closely co-operating queue.
+ */
+static inline struct io_queue *elv_close_cooperator(struct request_queue *q,
+					struct io_queue *ioq, int probe)
+{
+	struct elevator_queue *e = q->elevator;
+	struct io_queue *new_ioq = NULL;
+
+	/*
+	 * Currently this feature is supported only for flat hierarchy or
+	 * root group queues so that default cfq behavior is not changed.
+	 */
+	if (!is_root_group_ioq(q, ioq))
+		return NULL;
+
+	if (q->elevator->ops->elevator_close_cooperator_fn)
+		new_ioq = e->ops->elevator_close_cooperator_fn(q,
+						ioq->sched_queue, probe);
+
+	/* Only select co-operating queue if it belongs to root group */
+	if (new_ioq && !is_root_group_ioq(q, new_ioq))
+		return NULL;
+
+	return new_ioq;
+}
+
+/* Common layer function to select the next queue to dispatch from */
+void *elv_fq_select_ioq(struct request_queue *q, int force)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
+	struct io_group *iog;
+
+	if (!elv_nr_busy_ioq(q->elevator))
+		return NULL;
+
+	if (ioq == NULL)
+		goto new_queue;
+
+	/*
+	 * Force dispatch. Continue to dispatch from current queue as long
+	 * as it has requests.
+	 */
+	if (unlikely(force)) {
+		if (ioq->nr_queued)
+			goto keep_queue;
+		else
+			goto expire;
+	}
+
+	/*
+	 * The active queue has run out of time, expire it and select new.
+	 */
+	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
+		goto expire;
+
+	/*
+	 * If we have a RT cfqq waiting, then we pre-empt the current non-rt
+	 * cfqq.
+	 */
+	iog = ioq_to_io_group(ioq);
+
+	if (!elv_ioq_class_rt(ioq) && iog->busy_rt_queues) {
+		/*
+		 * We simulate this as cfqq timed out so that it gets to bank
+		 * the remaining of its time slice.
+		 */
+		elv_log_ioq(efqd, ioq, "preempt");
+		goto expire;
+	}
+
+	/*
+	 * The active queue has requests and isn't expired, allow it to
+	 * dispatch.
+	 */
+
+	if (ioq->nr_queued)
+		goto keep_queue;
+
+	/*
+	 * If another queue has a request waiting within our mean seek
+	 * distance, let it run.  The expire code will check for close
+	 * cooperators and put the close queue at the front of the service
+	 * tree.
+	 */
+	new_ioq = elv_close_cooperator(q, ioq, 0);
+	if (new_ioq)
+		goto expire;
+
+	/*
+	 * No requests pending. If the active queue still has requests in
+	 * flight or is idling for a new request, allow either of these
+	 * conditions to happen (or time out) before selecting a new queue.
+	 */
+
+	if (timer_pending(&efqd->idle_slice_timer) ||
+	    (elv_ioq_nr_dispatched(ioq) && elv_ioq_idle_window(ioq))) {
+		ioq = NULL;
+		goto keep_queue;
+	}
+
+expire:
+	elv_ioq_slice_expired(q);
+new_queue:
+	ioq = elv_set_active_ioq(q, new_ioq);
+keep_queue:
+	return ioq;
+}
+
+/* A request got removed from io_queue. Do the accounting */
+void elv_ioq_request_removed(struct elevator_queue *e, struct request *rq)
+{
+	struct io_queue *ioq;
+	struct elv_fq_data *efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	ioq = rq->ioq;
+	BUG_ON(!ioq);
+	ioq->nr_queued--;
+
+	efqd = ioq->efqd;
+	BUG_ON(!efqd);
+	efqd->rq_queued--;
+
+	if (elv_ioq_busy(ioq) && (elv_active_ioq(e) != ioq) && !ioq->nr_queued)
+		elv_del_ioq_busy(e, ioq, 1);
+}
+
+/* A request got dispatched. Do the accounting. */
+void elv_fq_dispatched_request(struct elevator_queue *e, struct request *rq)
+{
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	BUG_ON(!ioq);
+	elv_ioq_request_dispatched(ioq);
+	elv_ioq_request_removed(e, rq);
+	elv_clear_ioq_must_dispatch(ioq);
+}
+
+void elv_fq_activate_rq(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	efqd->rq_in_driver++;
+	elv_log_ioq(efqd, rq->ioq, "activate rq, drv=%d",
+						efqd->rq_in_driver);
+}
+
+void elv_fq_deactivate_rq(struct request_queue *q, struct request *rq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	WARN_ON(!efqd->rq_in_driver);
+	efqd->rq_in_driver--;
+	elv_log_ioq(efqd, rq->ioq, "deactivate rq, drv=%d",
+						efqd->rq_in_driver);
+}
+
+/*
+ * Update hw_tag based on peak queue depth over 50 samples under
+ * sufficient load.
+ */
+static void elv_update_hw_tag(struct elv_fq_data *efqd)
+{
+	if (efqd->rq_in_driver > efqd->rq_in_driver_peak)
+		efqd->rq_in_driver_peak = efqd->rq_in_driver;
+
+	if (efqd->rq_queued <= ELV_HW_QUEUE_MIN &&
+	    efqd->rq_in_driver <= ELV_HW_QUEUE_MIN)
+		return;
+
+	if (efqd->hw_tag_samples++ < 50)
+		return;
+
+	if (efqd->rq_in_driver_peak >= ELV_HW_QUEUE_MIN)
+		efqd->hw_tag = 1;
+	else
+		efqd->hw_tag = 0;
+
+	efqd->hw_tag_samples = 0;
+	efqd->rq_in_driver_peak = 0;
+}
+
+/* A request got completed from io_queue. Do the accounting. */
+void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
+{
+	const int sync = rq_is_sync(rq);
+	struct io_queue *ioq;
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	ioq = rq->ioq;
+
+	elv_log_ioq(efqd, ioq, "complete");
+
+	elv_update_hw_tag(efqd);
+
+	WARN_ON(!efqd->rq_in_driver);
+	WARN_ON(!ioq->dispatched);
+	efqd->rq_in_driver--;
+	ioq->dispatched--;
+
+	if (sync)
+		ioq->last_end_request = jiffies;
+
+	/*
+	 * If this is the active queue, check if it needs to be expired,
+	 * or if we want to idle in case it has no pending requests.
+	 */
+
+	if (elv_active_ioq(q->elevator) == ioq) {
+		if (elv_ioq_slice_new(ioq)) {
+			elv_ioq_set_prio_slice(q, ioq);
+			elv_clear_ioq_slice_new(ioq);
+		}
+		/*
+		 * If there are no requests waiting in this queue, and
+		 * there are other queues ready to issue requests, AND
+		 * those other queues are issuing requests within our
+		 * mean seek distance, give them a chance to run instead
+		 * of idling.
+		 */
+		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
+			elv_ioq_slice_expired(q);
+		else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
+			 && sync && !rq_noidle(rq))
+			elv_ioq_arm_slice_timer(q);
+	}
+
+	if (!efqd->rq_in_driver)
+		elv_schedule_dispatch(q);
+}
+
+struct io_group *io_get_io_group(struct request_queue *q)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	/* In flat mode, there is only root group */
+	return efqd->root_group;
+}
+EXPORT_SYMBOL(io_get_io_group);
+
+void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
+					int ioprio)
+{
+	struct io_queue *ioq = NULL;
+
+	switch (ioprio_class) {
+	case IOPRIO_CLASS_RT:
+		ioq = iog->async_queue[0][ioprio];
+		break;
+	case IOPRIO_CLASS_BE:
+		ioq = iog->async_queue[1][ioprio];
+		break;
+	case IOPRIO_CLASS_IDLE:
+		ioq = iog->async_idle_queue;
+		break;
+	default:
+		BUG();
+	}
+
+	if (ioq)
+		return ioq->sched_queue;
+	return NULL;
+}
+EXPORT_SYMBOL(io_group_async_queue_prio);
+
+void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+					int ioprio, struct io_queue *ioq)
+{
+	switch (ioprio_class) {
+	case IOPRIO_CLASS_RT:
+		iog->async_queue[0][ioprio] = ioq;
+		break;
+	case IOPRIO_CLASS_BE:
+		iog->async_queue[1][ioprio] = ioq;
+		break;
+	case IOPRIO_CLASS_IDLE:
+		iog->async_idle_queue = ioq;
+		break;
+	default:
+		BUG();
+	}
+
+	/*
+	 * Take the group reference and pin the queue. Group exit will
+	 * clean it up
+	 */
+	elv_get_ioq(ioq);
+}
+EXPORT_SYMBOL(io_group_set_async_queue);
+
+/*
+ * Release all the io group references to its async queues.
+ */
+static void
+io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+{
+	int i, j;
+
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < IOPRIO_BE_NR; j++)
+			elv_release_ioq(e, &iog->async_queue[i][j]);
+
+	/* Free up async idle queue */
+	elv_release_ioq(e, &iog->async_idle_queue);
+}
+
+static struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+	return iog;
+}
+
+static void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_group *iog = e->efqd.root_group;
+	struct io_service_tree *st;
+	int i;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		io_flush_idle_tree(st);
+	}
+
+	io_put_io_group_queues(e, iog);
+	kfree(iog);
+}
+
+static void elv_slab_kill(void)
+{
+	/*
+	 * Caller already ensured that pending RCU callbacks are completed,
+	 * so we should have no busy allocations at this point.
+	 */
+	if (elv_ioq_pool)
+		kmem_cache_destroy(elv_ioq_pool);
+}
+
+static int __init elv_slab_setup(void)
+{
+	elv_ioq_pool = KMEM_CACHE(io_queue, 0);
+	if (!elv_ioq_pool)
+		goto fail;
+
+	return 0;
+fail:
+	elv_slab_kill();
+	return -ENOMEM;
+}
+
+/* Initialize fair queueing data associated with elevator */
+int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
+{
+	struct io_group *iog;
+	struct elv_fq_data *efqd = &e->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return 0;
+
+	iog = io_alloc_root_group(q, e, efqd);
+	if (iog == NULL)
+		return 1;
+
+	efqd->root_group = iog;
+	efqd->queue = q;
+
+	init_timer(&efqd->idle_slice_timer);
+	efqd->idle_slice_timer.function = elv_idle_slice_timer;
+	efqd->idle_slice_timer.data = (unsigned long) efqd;
+
+	INIT_WORK(&efqd->unplug_work, elv_kick_queue);
+
+	efqd->elv_slice[0] = elv_slice_async;
+	efqd->elv_slice[1] = elv_slice_sync;
+	efqd->elv_slice_idle = elv_slice_idle;
+	efqd->hw_tag = 1;
+
+	return 0;
+}
+
+/*
+ * elv_exit_fq_data is called before we call elevator_exit_fn. Before
+ * we ask elevator to cleanup its queues, we do the cleanup here so
+ * that all the group and idle tree references to ioq are dropped. Later
+ * during elevator cleanup, ioc reference will be dropped which will lead
+ * to removal of ioscheduler queue as well as associated ioq object.
+ */
+void elv_exit_fq_data(struct elevator_queue *e)
+{
+	struct elv_fq_data *efqd = &e->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	elv_shutdown_timer_wq(e);
+
+	BUG_ON(timer_pending(&efqd->idle_slice_timer));
+	io_free_root_group(e);
+}
+
+/*
+ * This is called after the io scheduler has cleaned up its data structres.
+ * I don't think that this function is required. Right now just keeping it
+ * because cfq cleans up timer and work queue again after freeing up
+ * io contexts. To me io scheduler has already been drained out, and all
+ * the active queue have already been expired so time and work queue should
+ * not been activated during cleanup process.
+ *
+ * Keeping it here for the time being. Will get rid of it later.
+ */
+void elv_exit_fq_data_post(struct elevator_queue *e)
+{
+	struct elv_fq_data *efqd = &e->efqd;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return;
+
+	elv_shutdown_timer_wq(e);
+	BUG_ON(timer_pending(&efqd->idle_slice_timer));
+}
+
+
+static int __init elv_fq_init(void)
+{
+	if (elv_slab_setup())
+		return -ENOMEM;
+
+	/* could be 0 on HZ < 1000 setups */
+
+	if (!elv_slice_async)
+		elv_slice_async = 1;
+
+	if (!elv_slice_idle)
+		elv_slice_idle = 1;
+
+	return 0;
+}
+
+module_init(elv_fq_init);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 4554d7f..a7cbc0f 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -22,6 +22,10 @@
 struct io_entity;
 struct io_queue;
 
+#ifdef CONFIG_ELV_FAIR_QUEUING
+#define ELV_ATTR(name) \
+	__ATTR(name, S_IRUGO|S_IWUSR, elv_##name##_show, elv_##name##_store)
+
 /**
  * struct io_service_tree - per ioprio_class service tree.
  * @active: tree for active entities (i.e., those backlogged).
@@ -149,15 +153,125 @@ struct io_entity {
 struct io_queue {
 	struct io_entity entity;
 	atomic_t ref;
+	unsigned int flags;
 
 	/* Pointer to generic elevator fair queuing data structure */
 	struct elv_fq_data *efqd;
+	pid_t pid;
+
+	/* Number of requests queued on this io queue */
+	unsigned long nr_queued;
+
+	/* Requests dispatched from this queue */
+	int dispatched;
+
+	/* Keep a track of think time of processes in this queue */
+	unsigned long last_end_request;
+	unsigned long ttime_total;
+	unsigned long ttime_samples;
+	unsigned long ttime_mean;
+
+	unsigned long slice_end;
+
+	/* Pointer to io scheduler's queue */
+	void *sched_queue;
 };
 
 struct io_group {
 	struct io_sched_data sched_data;
+	/*
+	 * async queue for each priority case for RT and BE class.
+	 * Used only for cfq.
+	 */
+
+	struct io_queue *async_queue[2][IOPRIO_BE_NR];
+	struct io_queue *async_idle_queue;
+
+	/*
+	 * Used to track any pending rt requests so we can pre-empt current
+	 * non-RT cfqq in service when this value is non-zero.
+	 */
+	unsigned int busy_rt_queues;
+};
+
+struct elv_fq_data {
+	struct io_group *root_group;
+
+	struct request_queue *queue;
+	unsigned int busy_queues;
+
+	/* Number of requests queued */
+	int rq_queued;
+
+	/* Pointer to the ioscheduler queue being served */
+	void *active_queue;
+
+	/*
+	 * queue-depth detection
+	 */
+	int rq_in_driver;
+	int hw_tag;
+	int hw_tag_samples;
+	int rq_in_driver_peak;
+
+	/*
+	 * elevator fair queuing layer has the capability to provide idling
+	 * for ensuring fairness for processes doing dependent reads.
+	 * This might be needed to ensure fairness among two processes doing
+	 * synchronous reads in two different cgroups. noop and deadline don't
+	 * have any notion of anticipation/idling. As of now, these are the
+	 * users of this functionality.
+	 */
+	unsigned int elv_slice_idle;
+	struct timer_list idle_slice_timer;
+	struct work_struct unplug_work;
+
+	/* Base slice length for sync and async queues */
+	unsigned int elv_slice[2];
 };
 
+/* Logging facilities. */
+#define elv_log_ioq(efqd, ioq, fmt, args...) \
+	blk_add_trace_msg((efqd)->queue, "elv%d%c " fmt, (ioq)->pid,	\
+				elv_ioq_sync(ioq) ? 'S' : 'A', ##args)
+
+#define elv_log(efqd, fmt, args...) \
+	blk_add_trace_msg((efqd)->queue, "elv " fmt, ##args)
+
+#define ioq_sample_valid(samples)   ((samples) > 80)
+
+/* Some shared queue flag manipulation functions among elevators */
+
+enum elv_queue_state_flags {
+	ELV_QUEUE_FLAG_busy = 0,          /* has requests or is under service */
+	ELV_QUEUE_FLAG_sync,              /* synchronous queue */
+	ELV_QUEUE_FLAG_idle_window,	  /* elevator slice idling enabled */
+	ELV_QUEUE_FLAG_wait_request,	  /* waiting for a request */
+	ELV_QUEUE_FLAG_must_dispatch,	  /* must be allowed a dispatch */
+	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
+};
+
+#define ELV_IO_QUEUE_FLAG_FNS(name)					\
+static inline void elv_mark_ioq_##name(struct io_queue *ioq)		\
+{                                                                       \
+	(ioq)->flags |= (1 << ELV_QUEUE_FLAG_##name);			\
+}                                                                       \
+static inline void elv_clear_ioq_##name(struct io_queue *ioq)		\
+{                                                                       \
+	(ioq)->flags &= ~(1 << ELV_QUEUE_FLAG_##name);			\
+}                                                                       \
+static inline int elv_ioq_##name(struct io_queue *ioq)         		\
+{                                                                       \
+	return ((ioq)->flags & (1 << ELV_QUEUE_FLAG_##name)) != 0;	\
+}
+
+ELV_IO_QUEUE_FLAG_FNS(busy)
+ELV_IO_QUEUE_FLAG_FNS(sync)
+ELV_IO_QUEUE_FLAG_FNS(wait_request)
+ELV_IO_QUEUE_FLAG_FNS(must_dispatch)
+ELV_IO_QUEUE_FLAG_FNS(idle_window)
+ELV_IO_QUEUE_FLAG_FNS(slice_new)
+
 static inline struct io_service_tree *
 io_entity_service_tree(struct io_entity *entity)
 {
@@ -169,4 +283,194 @@ io_entity_service_tree(struct io_entity *entity)
 
 	return sched_data->service_tree + idx;
 }
+
+/* A request got dispatched from the io_queue. Do the accounting. */
+static inline void elv_ioq_request_dispatched(struct io_queue *ioq)
+{
+	ioq->dispatched++;
+}
+
+static inline int elv_ioq_slice_used(struct io_queue *ioq)
+{
+	if (elv_ioq_slice_new(ioq))
+		return 0;
+	if (time_before(jiffies, ioq->slice_end))
+		return 0;
+
+	return 1;
+}
+
+/* How many request are currently dispatched from the queue */
+static inline int elv_ioq_nr_dispatched(struct io_queue *ioq)
+{
+	return ioq->dispatched;
+}
+
+/* How many request are currently queued in the queue */
+static inline int elv_ioq_nr_queued(struct io_queue *ioq)
+{
+	return ioq->nr_queued;
+}
+
+static inline void elv_get_ioq(struct io_queue *ioq)
+{
+	atomic_inc(&ioq->ref);
+}
+
+static inline int elv_ioq_class_idle(struct io_queue *ioq)
+{
+	return ioq->entity.ioprio_class == IOPRIO_CLASS_IDLE;
+}
+
+static inline int elv_ioq_class_rt(struct io_queue *ioq)
+{
+	return ioq->entity.ioprio_class == IOPRIO_CLASS_RT;
+}
+
+static inline int elv_ioq_ioprio_class(struct io_queue *ioq)
+{
+	return ioq->entity.new_ioprio_class;
+}
+
+static inline int elv_ioq_ioprio(struct io_queue *ioq)
+{
+	return ioq->entity.new_ioprio;
+}
+
+static inline void elv_ioq_set_ioprio_class(struct io_queue *ioq,
+						int ioprio_class)
+{
+	ioq->entity.new_ioprio_class = ioprio_class;
+	ioq->entity.ioprio_changed = 1;
+}
+
+/**
+ * bfq_ioprio_to_weight - calc a weight from an ioprio.
+ * @ioprio: the ioprio value to convert.
+ */
+static inline unsigned int bfq_ioprio_to_weight(int ioprio)
+{
+	WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
+	return IOPRIO_BE_NR - ioprio;
+}
+
+static inline void elv_ioq_set_ioprio(struct io_queue *ioq, int ioprio)
+{
+	ioq->entity.new_ioprio = ioprio;
+	ioq->entity.new_weight = bfq_ioprio_to_weight(ioprio);
+	ioq->entity.ioprio_changed = 1;
+}
+
+static inline void *ioq_sched_queue(struct io_queue *ioq)
+{
+	if (ioq)
+		return ioq->sched_queue;
+	return NULL;
+}
+
+static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
+{
+	return container_of(ioq->entity.sched_data, struct io_group,
+						sched_data);
+}
+
+extern ssize_t elv_slice_idle_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_slice_idle_store(struct elevator_queue *q, const char *name,
+						size_t count);
+extern ssize_t elv_slice_sync_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_slice_sync_store(struct elevator_queue *q, const char *name,
+						size_t count);
+extern ssize_t elv_slice_async_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_slice_async_store(struct elevator_queue *q, const char *name,
+						size_t count);
+
+/* Functions used by elevator.c */
+extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
+extern void elv_exit_fq_data(struct elevator_queue *e);
+extern void elv_exit_fq_data_post(struct elevator_queue *e);
+
+extern void elv_ioq_request_add(struct request_queue *q, struct request *rq);
+extern void elv_ioq_request_removed(struct elevator_queue *e,
+					struct request *rq);
+extern void elv_fq_dispatched_request(struct elevator_queue *e,
+					struct request *rq);
+
+extern void elv_fq_activate_rq(struct request_queue *q, struct request *rq);
+extern void elv_fq_deactivate_rq(struct request_queue *q, struct request *rq);
+
+extern void elv_ioq_completed_request(struct request_queue *q,
+				struct request *rq);
+
+extern void *elv_fq_select_ioq(struct request_queue *q, int force);
+
+/* Functions used by io schedulers */
+extern void elv_put_ioq(struct io_queue *ioq);
+extern void __elv_ioq_slice_expired(struct request_queue *q,
+					struct io_queue *ioq);
+extern int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
+		struct io_group *iog, void *sched_queue, int ioprio_class,
+		int ioprio, int is_sync);
+extern void elv_schedule_dispatch(struct request_queue *q);
+extern int elv_hw_tag(struct elevator_queue *e);
+extern void *elv_active_sched_queue(struct elevator_queue *e);
+extern int elv_mod_idle_slice_timer(struct elevator_queue *eq,
+					unsigned long expires);
+extern int elv_del_idle_slice_timer(struct elevator_queue *eq);
+extern unsigned int elv_get_slice_idle(struct elevator_queue *eq);
+extern void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
+					int ioprio);
+extern void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
+					int ioprio, struct io_queue *ioq);
+extern struct io_group *io_get_io_group(struct request_queue *q);
+extern int elv_nr_busy_ioq(struct elevator_queue *e);
+extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
+extern void elv_free_ioq(struct io_queue *ioq);
+
+#else /* CONFIG_ELV_FAIR_QUEUING */
+
+static inline int elv_init_fq_data(struct request_queue *q,
+					struct elevator_queue *e)
+{
+	return 0;
+}
+
+static inline void elv_exit_fq_data(struct elevator_queue *e) {}
+static inline void elv_exit_fq_data_post(struct elevator_queue *e) {}
+
+static inline void elv_fq_activate_rq(struct request_queue *q,
+					struct request *rq)
+{
+}
+
+static inline void elv_fq_deactivate_rq(struct request_queue *q,
+					struct request *rq)
+{
+}
+
+static inline void elv_fq_dispatched_request(struct elevator_queue *e,
+						struct request *rq)
+{
+}
+
+static inline void elv_ioq_request_removed(struct elevator_queue *e,
+						struct request *rq)
+{
+}
+
+static inline void elv_ioq_request_add(struct request_queue *q,
+					struct request *rq)
+{
+}
+
+static inline void elv_ioq_completed_request(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline void *ioq_sched_queue(struct io_queue *ioq) { return NULL; }
+static inline void *elv_fq_select_ioq(struct request_queue *q, int force)
+{
+	return NULL;
+}
+#endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index ca86192..357f529 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -226,6 +226,9 @@ static struct elevator_queue *elevator_alloc(struct request_queue *q,
 	for (i = 0; i < ELV_HASH_ENTRIES; i++)
 		INIT_HLIST_HEAD(&eq->hash[i]);
 
+	if (elv_init_fq_data(q, eq))
+		goto err;
+
 	return eq;
 err:
 	kfree(eq);
@@ -296,9 +299,11 @@ EXPORT_SYMBOL(elevator_init);
 void elevator_exit(struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
+	elv_exit_fq_data(e);
 	if (e->ops->elevator_exit_fn)
 		e->ops->elevator_exit_fn(e);
 	e->ops = NULL;
+	elv_exit_fq_data_post(e);
 	mutex_unlock(&e->sysfs_lock);
 
 	kobject_put(&e->kobj);
@@ -425,6 +430,7 @@ void elv_dispatch_sort(struct request_queue *q, struct request *rq)
 	elv_rqhash_del(q, rq);
 
 	q->nr_sorted--;
+	elv_fq_dispatched_request(q->elevator, rq);
 
 	boundary = q->end_sector;
 	stop_flags = REQ_SOFTBARRIER | REQ_HARDBARRIER | REQ_STARTED;
@@ -465,6 +471,7 @@ void elv_dispatch_add_tail(struct request_queue *q, struct request *rq)
 	elv_rqhash_del(q, rq);
 
 	q->nr_sorted--;
+	elv_fq_dispatched_request(q->elevator, rq);
 
 	q->end_sector = rq_end_sector(rq);
 	q->boundary_rq = rq;
@@ -532,6 +539,7 @@ void elv_merge_requests(struct request_queue *q, struct request *rq,
 	elv_rqhash_del(q, next);
 
 	q->nr_sorted--;
+	elv_ioq_request_removed(e, next);
 	q->last_merge = rq;
 }
 
@@ -638,12 +646,8 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 				q->last_merge = rq;
 		}
 
-		/*
-		 * Some ioscheds (cfq) run q->request_fn directly, so
-		 * rq cannot be accessed after calling
-		 * elevator_add_req_fn.
-		 */
 		q->elevator->ops->elevator_add_req_fn(q, rq);
+		elv_ioq_request_add(q, rq);
 		break;
 
 	case ELEVATOR_INSERT_REQUEUE:
@@ -742,13 +746,12 @@ EXPORT_SYMBOL(elv_add_request);
 
 int elv_queue_empty(struct request_queue *q)
 {
-	struct elevator_queue *e = q->elevator;
-
 	if (!list_empty(&q->queue_head))
 		return 0;
 
-	if (e->ops->elevator_queue_empty_fn)
-		return e->ops->elevator_queue_empty_fn(q);
+	/* Hopefully nr_sorted works and no need to call queue_empty_fn */
+	if (q->nr_sorted)
+		return 0;
 
 	return 1;
 }
@@ -828,8 +831,11 @@ void elv_completed_request(struct request_queue *q, struct request *rq)
 	 */
 	if (blk_account_rq(rq)) {
 		q->in_flight[rq_is_sync(rq)]--;
-		if (blk_sorted_rq(rq) && e->ops->elevator_completed_req_fn)
-			e->ops->elevator_completed_req_fn(q, rq);
+		if (blk_sorted_rq(rq)) {
+			if (e->ops->elevator_completed_req_fn)
+				e->ops->elevator_completed_req_fn(q, rq);
+			elv_ioq_completed_request(q, rq);
+		}
 	}
 
 	/*
@@ -1125,3 +1131,17 @@ struct request *elv_rb_latter_request(struct request_queue *q,
 	return NULL;
 }
 EXPORT_SYMBOL(elv_rb_latter_request);
+
+/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
+{
+	return ioq_sched_queue(req_ioq(rq));
+}
+EXPORT_SYMBOL(elv_get_sched_queue);
+
+/* Select an ioscheduler queue to dispatch request from. */
+void *elv_select_sched_queue(struct request_queue *q, int force)
+{
+	return ioq_sched_queue(elv_fq_select_ioq(q, force));
+}
+EXPORT_SYMBOL(elv_select_sched_queue);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8963d91..551e17d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -234,6 +234,11 @@ struct request {
 
 	/* for bidi */
 	struct request *next_rq;
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	/* io queue request belongs to */
+	struct io_queue *ioq;
+#endif
 };
 
 static inline unsigned short req_get_ioprio(struct request *req)
@@ -241,6 +246,15 @@ static inline unsigned short req_get_ioprio(struct request *req)
 	return req->ioprio;
 }
 
+static inline struct io_queue *req_ioq(struct request *req)
+{
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	return req->ioq;
+#else
+	return NULL;
+#endif
+}
+
 /*
  * State information carried for REQ_TYPE_PM_SUSPEND and REQ_TYPE_PM_RESUME
  * requests. Some step values could eventually be made generic.
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 1cb3372..81f1ed8 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -2,6 +2,7 @@
 #define _LINUX_ELEVATOR_H
 
 #include <linux/percpu.h>
+#include "../../block/elevator-fq.h"
 
 #ifdef CONFIG_BLOCK
 
@@ -29,6 +30,18 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
 
 typedef void *(elevator_init_fn) (struct request_queue *);
 typedef void (elevator_exit_fn) (struct elevator_queue *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
+typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
+typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
+typedef int (elevator_should_preempt_fn) (struct request_queue*, void*,
+						struct request*);
+typedef int (elevator_update_idle_window_fn) (struct elevator_queue*, void*,
+						struct request*);
+typedef struct io_queue* (elevator_close_cooperator_fn) (struct request_queue*,
+						void*, int probe);
+#endif
 
 struct elevator_ops
 {
@@ -56,6 +69,17 @@ struct elevator_ops
 	elevator_init_fn *elevator_init_fn;
 	elevator_exit_fn *elevator_exit_fn;
 	void (*trim)(struct io_context *);
+
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+	elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
+	elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;
+
+	elevator_arm_slice_timer_fn *elevator_arm_slice_timer_fn;
+	elevator_should_preempt_fn *elevator_should_preempt_fn;
+	elevator_update_idle_window_fn *elevator_update_idle_window_fn;
+	elevator_close_cooperator_fn *elevator_close_cooperator_fn;
+#endif
 };
 
 #define ELV_NAME_MAX	(16)
@@ -76,6 +100,9 @@ struct elevator_type
 	struct elv_fs_entry *elevator_attrs;
 	char elevator_name[ELV_NAME_MAX];
 	struct module *elevator_owner;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	int elevator_features;
+#endif
 };
 
 /*
@@ -89,6 +116,10 @@ struct elevator_queue
 	struct elevator_type *elevator_type;
 	struct mutex sysfs_lock;
 	struct hlist_head *hash;
+#ifdef CONFIG_ELV_FAIR_QUEUING
+	/* fair queuing data */
+	struct elv_fq_data efqd;
+#endif
 };
 
 /*
@@ -207,5 +238,25 @@ enum {
 	__val;							\
 })
 
+/* iosched can let elevator know their feature set/capability */
+#ifdef CONFIG_ELV_FAIR_QUEUING
+
+/* iosched wants to use fair queuing logic of elevator layer */
+#define	ELV_IOSCHED_NEED_FQ	1
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+	return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
+}
+
+#else /* ELV_IOSCHED_FAIR_QUEUING */
+
+static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
+{
+	return 0;
+}
+#endif /* ELV_IOSCHED_FAIR_QUEUING */
+extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
+extern void *elv_select_sched_queue(struct request_queue *q, int force);
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 05/25] io-controller: Charge for time slice based on average disk rate
       [not found] ` <1246564917-19603-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (3 preceding siblings ...)
  2009-07-02 20:01   ` [PATCH 04/25] io-controller: Common flat fair queuing code in elevaotor layer Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  2009-07-02 20:01   ` [PATCH 06/25] io-controller: Modify cfq to make use of flat elevator fair queuing Vivek Goyal
                     ` (22 subsequent siblings)
  27 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, nauman-hpIqsD4AKlfQT0dZR+AlfA,
	dpshah-hpIqsD4AKlfQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA

o There are situations where a queue gets expired very soon and it looks
  as if time slice used by that queue is zero. For example, If an async
  queue dispatches a bunch of requests and queue is expired before first
  request completes. Another example is where a queue is expired as soon
  as first request completes and queue has no more requests (sync queues
  on SSD).

o Currently we just charge 25% of slice length in such cases. This patch tries
  to improve on that approximation by keeping a track of average disk rate
  and charging for time by nr_sectors/disk_rate.

o This is still experimental, not very sure if it gives measurable improvement
  or not. May be a better scheme is to use something more granular than jiffies
  for time keeping for io queues.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/elevator-fq.c |   97 +++++++++++++++++++++++++++++++++++++++++++++++----
 block/elevator-fq.h |   11 ++++++
 2 files changed, 101 insertions(+), 7 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 6f23d7e..67c02b9 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -23,6 +23,9 @@ const int elv_slice_async_rq = 2;
 int elv_slice_idle = HZ / 125;
 static struct kmem_cache *elv_ioq_pool;
 
+/* Maximum Window length for updating average disk rate */
+static int elv_rate_sampling_window = HZ / 10;
+
 #define ELV_SLICE_SCALE		(5)
 #define ELV_HW_QUEUE_MIN	(5)
 
@@ -941,6 +944,47 @@ static void elv_ioq_update_io_thinktime(struct io_queue *ioq)
 	ioq->ttime_mean = (ioq->ttime_total + 128) / ioq->ttime_samples;
 }
 
+static void elv_update_io_rate(struct elv_fq_data *efqd, struct request *rq)
+{
+	long elapsed = jiffies - efqd->rate_sampling_start;
+	unsigned long total;
+
+	/* sampling window is off */
+	if (!efqd->rate_sampling_start)
+		return;
+
+	efqd->rate_sectors_current += blk_rq_sectors(rq);
+
+	if (efqd->rq_in_driver && (elapsed < elv_rate_sampling_window))
+		return;
+
+	efqd->rate_sectors = (7*efqd->rate_sectors +
+				256*efqd->rate_sectors_current) / 8;
+
+	if (!elapsed) {
+		/*
+		 * updating rate before a jiffy could complete. Could be a
+		 * problem with fast queuing/non-queuing hardware. Should we
+		 * look at higher resolution time source?
+		 *
+		 * In case of non-queuing hardware we will probably not try to
+		 * dispatch from multiple queues and will be able to account
+		 * for disk time used and will not need this approximation
+		 * anyway?
+		 */
+		elapsed = 1;
+	}
+
+	efqd->rate_time = (7*efqd->rate_time + 256*elapsed) / 8;
+	total = efqd->rate_sectors + (efqd->rate_time/2);
+	efqd->mean_rate = total/efqd->rate_time;
+
+	elv_log(efqd, "mean_rate=%d, t=%d s=%d", efqd->mean_rate,
+			elapsed, efqd->rate_sectors_current);
+	efqd->rate_sampling_start = 0;
+	efqd->rate_sectors_current = 0;
+}
+
 /*
  * Disable idle window if the process thinks too long.
  * This idle flag can also be updated by io scheduler.
@@ -1231,6 +1275,34 @@ static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
 }
 
 /*
+ * Calculate the effective disk time used by the queue based on how many
+ * sectors queue has dispatched and what is the average disk rate
+ * Returns disk time in ms.
+ */
+static inline unsigned long elv_disk_time_used(struct request_queue *q,
+					struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_entity *entity = &ioq->entity;
+	unsigned long jiffies_used = 0;
+
+	if (!efqd->mean_rate)
+		return entity->budget/4;
+
+	/* Charge the queue based on average disk rate */
+	jiffies_used = ioq->nr_sectors/efqd->mean_rate;
+
+	if (!jiffies_used)
+		jiffies_used = 1;
+
+	elv_log_ioq(efqd, ioq, "disk time=%ldms sect=%lu rate=%ld",
+				jiffies_to_msecs(jiffies_used),
+				ioq->nr_sectors, efqd->mean_rate);
+
+	return jiffies_used;
+}
+
+/*
  * Do the accounting. Determine how much service (in terms of time slices)
  * current queue used and adjust the start, finish time of queue and vtime
  * of the tree accordingly.
@@ -1248,8 +1320,10 @@ static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
  * from next queue.
  *
  * Not sure how to determine the time consumed by queue in such scenarios.
- * Currently as a crude approximation, we are charging 25% of time slice
- * for such cases. A better mechanism is needed for accurate accounting.
+ * Currently as a crude approximation, try to keep track of average disk rate
+ * and charge the queue based on number of sectors transferred. If suffcient
+ * disk rate data is not available then we are charging 25% of time slice
+ * for such cases. A better mechanism, is needed for accurate accounting.
  */
 void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 {
@@ -1270,9 +1344,9 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 	 * reuqest from the queue got completed. Of course we are not planning
 	 * to idle on the queue otherwise we would not have expired it.
 	 *
-	 * Charge for the 25% slice in such cases. This is not the best thing
-	 * to do but at the same time not very sure what's the next best
-	 * thing to do.
+	 * Charge the queue based on average disk rate or the 25% slice if
+	 * mean rate is 0. This is not the best thing to do but at the same
+	 * time not very sure what's the next best thing to do.
 	 *
 	 * This arises from that fact that we don't have the notion of
 	 * one queue being operational at one time. io scheduler can dispatch
@@ -1282,7 +1356,7 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 	 * the requests to finish. But this will reduce throughput.
 	 */
 	if (!ioq->slice_end)
-		slice_used = entity->budget/4;
+		slice_used = elv_disk_time_used(q, ioq);
 	else {
 		if (time_after(ioq->slice_end, jiffies)) {
 			slice_unused = ioq->slice_end - jiffies;
@@ -1292,7 +1366,7 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 				 * completing first request. Charge 25% of
 				 * slice.
 				 */
-				slice_used = entity->budget/4;
+				slice_used = elv_disk_time_used(q, ioq);
 			} else
 				slice_used = entity->budget - slice_unused;
 		} else {
@@ -1310,6 +1384,8 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 	BUG_ON(ioq != efqd->active_queue);
 	elv_reset_active_ioq(efqd);
 
+	/* Queue is being expired. Reset number of secotrs dispatched */
+	ioq->nr_sectors = 0;
 	if (!ioq->nr_queued)
 		elv_del_ioq_busy(q->elevator, ioq, 1);
 	else
@@ -1671,6 +1747,7 @@ void elv_fq_dispatched_request(struct elevator_queue *e, struct request *rq)
 
 	BUG_ON(!ioq);
 	elv_ioq_request_dispatched(ioq);
+	ioq->nr_sectors += blk_rq_sectors(rq);
 	elv_ioq_request_removed(e, rq);
 	elv_clear_ioq_must_dispatch(ioq);
 }
@@ -1683,6 +1760,10 @@ void elv_fq_activate_rq(struct request_queue *q, struct request *rq)
 		return;
 
 	efqd->rq_in_driver++;
+
+	if (!efqd->rate_sampling_start)
+		efqd->rate_sampling_start = jiffies;
+
 	elv_log_ioq(efqd, rq->ioq, "activate rq, drv=%d",
 						efqd->rq_in_driver);
 }
@@ -1746,6 +1827,8 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 	efqd->rq_in_driver--;
 	ioq->dispatched--;
 
+	elv_update_io_rate(efqd, rq);
+
 	if (sync)
 		ioq->last_end_request = jiffies;
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index a7cbc0f..4b69239 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -165,6 +165,9 @@ struct io_queue {
 	/* Requests dispatched from this queue */
 	int dispatched;
 
+	/* Number of sectors dispatched in current dispatch round */
+	unsigned long nr_sectors;
+
 	/* Keep a track of think time of processes in this queue */
 	unsigned long last_end_request;
 	unsigned long ttime_total;
@@ -228,6 +231,14 @@ struct elv_fq_data {
 
 	/* Base slice length for sync and async queues */
 	unsigned int elv_slice[2];
+
+	/* Fields for keeping track of average disk rate */
+	unsigned long rate_sectors; /* number of sectors finished */
+	unsigned long rate_time;   /* jiffies elapsed */
+	unsigned long mean_rate; /* sectors per jiffy */
+	unsigned long long rate_sampling_start; /*sampling window start jifies*/
+	/* number of sectors finished io during current sampling window */
+	unsigned long rate_sectors_current;
 };
 
 /* Logging facilities. */
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 05/25] io-controller: Charge for time slice based on average disk rate
  2009-07-02 20:01 ` Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron
  Cc: agk, snitzer, vgoyal, akpm, peterz

o There are situations where a queue gets expired very soon and it looks
  as if time slice used by that queue is zero. For example, If an async
  queue dispatches a bunch of requests and queue is expired before first
  request completes. Another example is where a queue is expired as soon
  as first request completes and queue has no more requests (sync queues
  on SSD).

o Currently we just charge 25% of slice length in such cases. This patch tries
  to improve on that approximation by keeping a track of average disk rate
  and charging for time by nr_sectors/disk_rate.

o This is still experimental, not very sure if it gives measurable improvement
  or not. May be a better scheme is to use something more granular than jiffies
  for time keeping for io queues.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |   97 +++++++++++++++++++++++++++++++++++++++++++++++----
 block/elevator-fq.h |   11 ++++++
 2 files changed, 101 insertions(+), 7 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 6f23d7e..67c02b9 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -23,6 +23,9 @@ const int elv_slice_async_rq = 2;
 int elv_slice_idle = HZ / 125;
 static struct kmem_cache *elv_ioq_pool;
 
+/* Maximum Window length for updating average disk rate */
+static int elv_rate_sampling_window = HZ / 10;
+
 #define ELV_SLICE_SCALE		(5)
 #define ELV_HW_QUEUE_MIN	(5)
 
@@ -941,6 +944,47 @@ static void elv_ioq_update_io_thinktime(struct io_queue *ioq)
 	ioq->ttime_mean = (ioq->ttime_total + 128) / ioq->ttime_samples;
 }
 
+static void elv_update_io_rate(struct elv_fq_data *efqd, struct request *rq)
+{
+	long elapsed = jiffies - efqd->rate_sampling_start;
+	unsigned long total;
+
+	/* sampling window is off */
+	if (!efqd->rate_sampling_start)
+		return;
+
+	efqd->rate_sectors_current += blk_rq_sectors(rq);
+
+	if (efqd->rq_in_driver && (elapsed < elv_rate_sampling_window))
+		return;
+
+	efqd->rate_sectors = (7*efqd->rate_sectors +
+				256*efqd->rate_sectors_current) / 8;
+
+	if (!elapsed) {
+		/*
+		 * updating rate before a jiffy could complete. Could be a
+		 * problem with fast queuing/non-queuing hardware. Should we
+		 * look at higher resolution time source?
+		 *
+		 * In case of non-queuing hardware we will probably not try to
+		 * dispatch from multiple queues and will be able to account
+		 * for disk time used and will not need this approximation
+		 * anyway?
+		 */
+		elapsed = 1;
+	}
+
+	efqd->rate_time = (7*efqd->rate_time + 256*elapsed) / 8;
+	total = efqd->rate_sectors + (efqd->rate_time/2);
+	efqd->mean_rate = total/efqd->rate_time;
+
+	elv_log(efqd, "mean_rate=%d, t=%d s=%d", efqd->mean_rate,
+			elapsed, efqd->rate_sectors_current);
+	efqd->rate_sampling_start = 0;
+	efqd->rate_sectors_current = 0;
+}
+
 /*
  * Disable idle window if the process thinks too long.
  * This idle flag can also be updated by io scheduler.
@@ -1231,6 +1275,34 @@ static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
 }
 
 /*
+ * Calculate the effective disk time used by the queue based on how many
+ * sectors queue has dispatched and what is the average disk rate
+ * Returns disk time in ms.
+ */
+static inline unsigned long elv_disk_time_used(struct request_queue *q,
+					struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_entity *entity = &ioq->entity;
+	unsigned long jiffies_used = 0;
+
+	if (!efqd->mean_rate)
+		return entity->budget/4;
+
+	/* Charge the queue based on average disk rate */
+	jiffies_used = ioq->nr_sectors/efqd->mean_rate;
+
+	if (!jiffies_used)
+		jiffies_used = 1;
+
+	elv_log_ioq(efqd, ioq, "disk time=%ldms sect=%lu rate=%ld",
+				jiffies_to_msecs(jiffies_used),
+				ioq->nr_sectors, efqd->mean_rate);
+
+	return jiffies_used;
+}
+
+/*
  * Do the accounting. Determine how much service (in terms of time slices)
  * current queue used and adjust the start, finish time of queue and vtime
  * of the tree accordingly.
@@ -1248,8 +1320,10 @@ static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
  * from next queue.
  *
  * Not sure how to determine the time consumed by queue in such scenarios.
- * Currently as a crude approximation, we are charging 25% of time slice
- * for such cases. A better mechanism is needed for accurate accounting.
+ * Currently as a crude approximation, try to keep track of average disk rate
+ * and charge the queue based on number of sectors transferred. If suffcient
+ * disk rate data is not available then we are charging 25% of time slice
+ * for such cases. A better mechanism, is needed for accurate accounting.
  */
 void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 {
@@ -1270,9 +1344,9 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 	 * reuqest from the queue got completed. Of course we are not planning
 	 * to idle on the queue otherwise we would not have expired it.
 	 *
-	 * Charge for the 25% slice in such cases. This is not the best thing
-	 * to do but at the same time not very sure what's the next best
-	 * thing to do.
+	 * Charge the queue based on average disk rate or the 25% slice if
+	 * mean rate is 0. This is not the best thing to do but at the same
+	 * time not very sure what's the next best thing to do.
 	 *
 	 * This arises from that fact that we don't have the notion of
 	 * one queue being operational at one time. io scheduler can dispatch
@@ -1282,7 +1356,7 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 	 * the requests to finish. But this will reduce throughput.
 	 */
 	if (!ioq->slice_end)
-		slice_used = entity->budget/4;
+		slice_used = elv_disk_time_used(q, ioq);
 	else {
 		if (time_after(ioq->slice_end, jiffies)) {
 			slice_unused = ioq->slice_end - jiffies;
@@ -1292,7 +1366,7 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 				 * completing first request. Charge 25% of
 				 * slice.
 				 */
-				slice_used = entity->budget/4;
+				slice_used = elv_disk_time_used(q, ioq);
 			} else
 				slice_used = entity->budget - slice_unused;
 		} else {
@@ -1310,6 +1384,8 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 	BUG_ON(ioq != efqd->active_queue);
 	elv_reset_active_ioq(efqd);
 
+	/* Queue is being expired. Reset number of secotrs dispatched */
+	ioq->nr_sectors = 0;
 	if (!ioq->nr_queued)
 		elv_del_ioq_busy(q->elevator, ioq, 1);
 	else
@@ -1671,6 +1747,7 @@ void elv_fq_dispatched_request(struct elevator_queue *e, struct request *rq)
 
 	BUG_ON(!ioq);
 	elv_ioq_request_dispatched(ioq);
+	ioq->nr_sectors += blk_rq_sectors(rq);
 	elv_ioq_request_removed(e, rq);
 	elv_clear_ioq_must_dispatch(ioq);
 }
@@ -1683,6 +1760,10 @@ void elv_fq_activate_rq(struct request_queue *q, struct request *rq)
 		return;
 
 	efqd->rq_in_driver++;
+
+	if (!efqd->rate_sampling_start)
+		efqd->rate_sampling_start = jiffies;
+
 	elv_log_ioq(efqd, rq->ioq, "activate rq, drv=%d",
 						efqd->rq_in_driver);
 }
@@ -1746,6 +1827,8 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 	efqd->rq_in_driver--;
 	ioq->dispatched--;
 
+	elv_update_io_rate(efqd, rq);
+
 	if (sync)
 		ioq->last_end_request = jiffies;
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index a7cbc0f..4b69239 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -165,6 +165,9 @@ struct io_queue {
 	/* Requests dispatched from this queue */
 	int dispatched;
 
+	/* Number of sectors dispatched in current dispatch round */
+	unsigned long nr_sectors;
+
 	/* Keep a track of think time of processes in this queue */
 	unsigned long last_end_request;
 	unsigned long ttime_total;
@@ -228,6 +231,14 @@ struct elv_fq_data {
 
 	/* Base slice length for sync and async queues */
 	unsigned int elv_slice[2];
+
+	/* Fields for keeping track of average disk rate */
+	unsigned long rate_sectors; /* number of sectors finished */
+	unsigned long rate_time;   /* jiffies elapsed */
+	unsigned long mean_rate; /* sectors per jiffy */
+	unsigned long long rate_sampling_start; /*sampling window start jifies*/
+	/* number of sectors finished io during current sampling window */
+	unsigned long rate_sectors_current;
 };
 
 /* Logging facilities. */
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 05/25] io-controller: Charge for time slice based on average disk rate
@ 2009-07-02 20:01   ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew
  Cc: peterz, akpm, snitzer, agk, vgoyal

o There are situations where a queue gets expired very soon and it looks
  as if time slice used by that queue is zero. For example, If an async
  queue dispatches a bunch of requests and queue is expired before first
  request completes. Another example is where a queue is expired as soon
  as first request completes and queue has no more requests (sync queues
  on SSD).

o Currently we just charge 25% of slice length in such cases. This patch tries
  to improve on that approximation by keeping a track of average disk rate
  and charging for time by nr_sectors/disk_rate.

o This is still experimental, not very sure if it gives measurable improvement
  or not. May be a better scheme is to use something more granular than jiffies
  for time keeping for io queues.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |   97 +++++++++++++++++++++++++++++++++++++++++++++++----
 block/elevator-fq.h |   11 ++++++
 2 files changed, 101 insertions(+), 7 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 6f23d7e..67c02b9 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -23,6 +23,9 @@ const int elv_slice_async_rq = 2;
 int elv_slice_idle = HZ / 125;
 static struct kmem_cache *elv_ioq_pool;
 
+/* Maximum Window length for updating average disk rate */
+static int elv_rate_sampling_window = HZ / 10;
+
 #define ELV_SLICE_SCALE		(5)
 #define ELV_HW_QUEUE_MIN	(5)
 
@@ -941,6 +944,47 @@ static void elv_ioq_update_io_thinktime(struct io_queue *ioq)
 	ioq->ttime_mean = (ioq->ttime_total + 128) / ioq->ttime_samples;
 }
 
+static void elv_update_io_rate(struct elv_fq_data *efqd, struct request *rq)
+{
+	long elapsed = jiffies - efqd->rate_sampling_start;
+	unsigned long total;
+
+	/* sampling window is off */
+	if (!efqd->rate_sampling_start)
+		return;
+
+	efqd->rate_sectors_current += blk_rq_sectors(rq);
+
+	if (efqd->rq_in_driver && (elapsed < elv_rate_sampling_window))
+		return;
+
+	efqd->rate_sectors = (7*efqd->rate_sectors +
+				256*efqd->rate_sectors_current) / 8;
+
+	if (!elapsed) {
+		/*
+		 * updating rate before a jiffy could complete. Could be a
+		 * problem with fast queuing/non-queuing hardware. Should we
+		 * look at higher resolution time source?
+		 *
+		 * In case of non-queuing hardware we will probably not try to
+		 * dispatch from multiple queues and will be able to account
+		 * for disk time used and will not need this approximation
+		 * anyway?
+		 */
+		elapsed = 1;
+	}
+
+	efqd->rate_time = (7*efqd->rate_time + 256*elapsed) / 8;
+	total = efqd->rate_sectors + (efqd->rate_time/2);
+	efqd->mean_rate = total/efqd->rate_time;
+
+	elv_log(efqd, "mean_rate=%d, t=%d s=%d", efqd->mean_rate,
+			elapsed, efqd->rate_sectors_current);
+	efqd->rate_sampling_start = 0;
+	efqd->rate_sectors_current = 0;
+}
+
 /*
  * Disable idle window if the process thinks too long.
  * This idle flag can also be updated by io scheduler.
@@ -1231,6 +1275,34 @@ static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
 }
 
 /*
+ * Calculate the effective disk time used by the queue based on how many
+ * sectors queue has dispatched and what is the average disk rate
+ * Returns disk time in ms.
+ */
+static inline unsigned long elv_disk_time_used(struct request_queue *q,
+					struct io_queue *ioq)
+{
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_entity *entity = &ioq->entity;
+	unsigned long jiffies_used = 0;
+
+	if (!efqd->mean_rate)
+		return entity->budget/4;
+
+	/* Charge the queue based on average disk rate */
+	jiffies_used = ioq->nr_sectors/efqd->mean_rate;
+
+	if (!jiffies_used)
+		jiffies_used = 1;
+
+	elv_log_ioq(efqd, ioq, "disk time=%ldms sect=%lu rate=%ld",
+				jiffies_to_msecs(jiffies_used),
+				ioq->nr_sectors, efqd->mean_rate);
+
+	return jiffies_used;
+}
+
+/*
  * Do the accounting. Determine how much service (in terms of time slices)
  * current queue used and adjust the start, finish time of queue and vtime
  * of the tree accordingly.
@@ -1248,8 +1320,10 @@ static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
  * from next queue.
  *
  * Not sure how to determine the time consumed by queue in such scenarios.
- * Currently as a crude approximation, we are charging 25% of time slice
- * for such cases. A better mechanism is needed for accurate accounting.
+ * Currently as a crude approximation, try to keep track of average disk rate
+ * and charge the queue based on number of sectors transferred. If suffcient
+ * disk rate data is not available then we are charging 25% of time slice
+ * for such cases. A better mechanism, is needed for accurate accounting.
  */
 void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 {
@@ -1270,9 +1344,9 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 	 * reuqest from the queue got completed. Of course we are not planning
 	 * to idle on the queue otherwise we would not have expired it.
 	 *
-	 * Charge for the 25% slice in such cases. This is not the best thing
-	 * to do but at the same time not very sure what's the next best
-	 * thing to do.
+	 * Charge the queue based on average disk rate or the 25% slice if
+	 * mean rate is 0. This is not the best thing to do but at the same
+	 * time not very sure what's the next best thing to do.
 	 *
 	 * This arises from that fact that we don't have the notion of
 	 * one queue being operational at one time. io scheduler can dispatch
@@ -1282,7 +1356,7 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 	 * the requests to finish. But this will reduce throughput.
 	 */
 	if (!ioq->slice_end)
-		slice_used = entity->budget/4;
+		slice_used = elv_disk_time_used(q, ioq);
 	else {
 		if (time_after(ioq->slice_end, jiffies)) {
 			slice_unused = ioq->slice_end - jiffies;
@@ -1292,7 +1366,7 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 				 * completing first request. Charge 25% of
 				 * slice.
 				 */
-				slice_used = entity->budget/4;
+				slice_used = elv_disk_time_used(q, ioq);
 			} else
 				slice_used = entity->budget - slice_unused;
 		} else {
@@ -1310,6 +1384,8 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 	BUG_ON(ioq != efqd->active_queue);
 	elv_reset_active_ioq(efqd);
 
+	/* Queue is being expired. Reset number of secotrs dispatched */
+	ioq->nr_sectors = 0;
 	if (!ioq->nr_queued)
 		elv_del_ioq_busy(q->elevator, ioq, 1);
 	else
@@ -1671,6 +1747,7 @@ void elv_fq_dispatched_request(struct elevator_queue *e, struct request *rq)
 
 	BUG_ON(!ioq);
 	elv_ioq_request_dispatched(ioq);
+	ioq->nr_sectors += blk_rq_sectors(rq);
 	elv_ioq_request_removed(e, rq);
 	elv_clear_ioq_must_dispatch(ioq);
 }
@@ -1683,6 +1760,10 @@ void elv_fq_activate_rq(struct request_queue *q, struct request *rq)
 		return;
 
 	efqd->rq_in_driver++;
+
+	if (!efqd->rate_sampling_start)
+		efqd->rate_sampling_start = jiffies;
+
 	elv_log_ioq(efqd, rq->ioq, "activate rq, drv=%d",
 						efqd->rq_in_driver);
 }
@@ -1746,6 +1827,8 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 	efqd->rq_in_driver--;
 	ioq->dispatched--;
 
+	elv_update_io_rate(efqd, rq);
+
 	if (sync)
 		ioq->last_end_request = jiffies;
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index a7cbc0f..4b69239 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -165,6 +165,9 @@ struct io_queue {
 	/* Requests dispatched from this queue */
 	int dispatched;
 
+	/* Number of sectors dispatched in current dispatch round */
+	unsigned long nr_sectors;
+
 	/* Keep a track of think time of processes in this queue */
 	unsigned long last_end_request;
 	unsigned long ttime_total;
@@ -228,6 +231,14 @@ struct elv_fq_data {
 
 	/* Base slice length for sync and async queues */
 	unsigned int elv_slice[2];
+
+	/* Fields for keeping track of average disk rate */
+	unsigned long rate_sectors; /* number of sectors finished */
+	unsigned long rate_time;   /* jiffies elapsed */
+	unsigned long mean_rate; /* sectors per jiffy */
+	unsigned long long rate_sampling_start; /*sampling window start jifies*/
+	/* number of sectors finished io during current sampling window */
+	unsigned long rate_sectors_current;
 };
 
 /* Logging facilities. */
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 06/25] io-controller: Modify cfq to make use of flat elevator fair queuing
       [not found] ` <1246564917-19603-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (4 preceding siblings ...)
  2009-07-02 20:01   ` [PATCH 05/25] io-controller: Charge for time slice based on average disk rate Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  2009-07-02 20:01   ` [PATCH 07/25] io-controller: core bfq scheduler changes for hierarchical setup Vivek Goyal
                     ` (21 subsequent siblings)
  27 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, nauman-hpIqsD4AKlfQT0dZR+AlfA,
	dpshah-hpIqsD4AKlfQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA

This patch changes cfq to use fair queuing code from elevator layer.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched     |    3 +-
 block/cfq-iosched.c       | 1105 +++++++++------------------------------------
 include/linux/iocontext.h |    5 -
 3 files changed, 226 insertions(+), 887 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3398134..dd5224d 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -3,7 +3,7 @@ if BLOCK
 menu "IO Schedulers"
 
 config ELV_FAIR_QUEUING
-	bool "Elevator Fair Queuing Support"
+	bool
 	default n
 	---help---
 	  Traditionally only cfq had notion of multiple queues and it did
@@ -46,6 +46,7 @@ config IOSCHED_DEADLINE
 
 config IOSCHED_CFQ
 	tristate "CFQ I/O scheduler"
+	select ELV_FAIR_QUEUING
 	default y
 	---help---
 	  The CFQ I/O scheduler tries to distribute bandwidth equally
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 833ec18..f852b00 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -12,7 +12,6 @@
 #include <linux/rbtree.h>
 #include <linux/ioprio.h>
 #include <linux/blktrace_api.h>
-
 /*
  * tunables
  */
@@ -23,15 +22,7 @@ static const int cfq_fifo_expire[2] = { HZ / 4, HZ / 8 };
 static const int cfq_back_max = 16 * 1024;
 /* penalty of a backwards seek */
 static const int cfq_back_penalty = 2;
-static const int cfq_slice_sync = HZ / 10;
-static int cfq_slice_async = HZ / 25;
 static const int cfq_slice_async_rq = 2;
-static int cfq_slice_idle = HZ / 125;
-
-/*
- * offset from end of service tree
- */
-#define CFQ_IDLE_DELAY		(HZ / 5)
 
 /*
  * below this threshold, we consider thinktime immediate
@@ -43,7 +34,7 @@ static int cfq_slice_idle = HZ / 125;
 
 #define RQ_CIC(rq)		\
 	((struct cfq_io_context *) (rq)->elevator_private)
-#define RQ_CFQQ(rq)		(struct cfq_queue *) ((rq)->elevator_private2)
+#define RQ_CFQQ(rq)	(struct cfq_queue *) (ioq_sched_queue((rq)->ioq))
 
 static struct kmem_cache *cfq_pool;
 static struct kmem_cache *cfq_ioc_pool;
@@ -53,8 +44,6 @@ static struct completion *ioc_gone;
 static DEFINE_SPINLOCK(ioc_gone_lock);
 
 #define CFQ_PRIO_LISTS		IOPRIO_BE_NR
-#define cfq_class_idle(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
-#define cfq_class_rt(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_RT)
 
 #define sample_valid(samples)	((samples) > 80)
 
@@ -75,12 +64,6 @@ struct cfq_rb_root {
  */
 struct cfq_data {
 	struct request_queue *queue;
-
-	/*
-	 * rr list of queues with requests and the count of them
-	 */
-	struct cfq_rb_root service_tree;
-
 	/*
 	 * Each priority tree is sorted by next_request position.  These
 	 * trees are used when determining if two or more queues are
@@ -88,39 +71,10 @@ struct cfq_data {
 	 */
 	struct rb_root prio_trees[CFQ_PRIO_LISTS];
 
-	unsigned int busy_queues;
-	/*
-	 * Used to track any pending rt requests so we can pre-empt current
-	 * non-RT cfqq in service when this value is non-zero.
-	 */
-	unsigned int busy_rt_queues;
-
-	int rq_in_driver;
 	int sync_flight;
 
-	/*
-	 * queue-depth detection
-	 */
-	int rq_queued;
-	int hw_tag;
-	int hw_tag_samples;
-	int rq_in_driver_peak;
-
-	/*
-	 * idle window management
-	 */
-	struct timer_list idle_slice_timer;
-	struct work_struct unplug_work;
-
-	struct cfq_queue *active_queue;
 	struct cfq_io_context *active_cic;
 
-	/*
-	 * async queue for each priority case
-	 */
-	struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
-	struct cfq_queue *async_idle_cfqq;
-
 	sector_t last_position;
 
 	/*
@@ -130,9 +84,7 @@ struct cfq_data {
 	unsigned int cfq_fifo_expire[2];
 	unsigned int cfq_back_penalty;
 	unsigned int cfq_back_max;
-	unsigned int cfq_slice[2];
 	unsigned int cfq_slice_async_rq;
-	unsigned int cfq_slice_idle;
 
 	struct list_head cic_list;
 };
@@ -141,16 +93,11 @@ struct cfq_data {
  * Per process-grouping structure
  */
 struct cfq_queue {
-	/* reference count */
-	atomic_t ref;
+	struct io_queue *ioq;
 	/* various state flags, see below */
 	unsigned int flags;
 	/* parent cfq_data */
 	struct cfq_data *cfqd;
-	/* service_tree member */
-	struct rb_node rb_node;
-	/* service_tree key */
-	unsigned long rb_key;
 	/* prio tree member */
 	struct rb_node p_node;
 	/* prio tree root we belong to, if any */
@@ -166,33 +113,23 @@ struct cfq_queue {
 	/* fifo list of requests in sort_list */
 	struct list_head fifo;
 
-	unsigned long slice_end;
-	long slice_resid;
 	unsigned int slice_dispatch;
 
 	/* pending metadata requests */
 	int meta_pending;
-	/* number of requests that are on the dispatch list or inside driver */
-	int dispatched;
 
 	/* io prio of this group */
-	unsigned short ioprio, org_ioprio;
-	unsigned short ioprio_class, org_ioprio_class;
+	unsigned short org_ioprio;
+	unsigned short org_ioprio_class;
 
 	pid_t pid;
 };
 
 enum cfqq_state_flags {
-	CFQ_CFQQ_FLAG_on_rr = 0,	/* on round-robin busy list */
-	CFQ_CFQQ_FLAG_wait_request,	/* waiting for a request */
-	CFQ_CFQQ_FLAG_must_dispatch,	/* must be allowed a dispatch */
 	CFQ_CFQQ_FLAG_must_alloc,	/* must be allowed rq alloc */
 	CFQ_CFQQ_FLAG_must_alloc_slice,	/* per-slice must_alloc flag */
 	CFQ_CFQQ_FLAG_fifo_expire,	/* FIFO checked in this slice */
-	CFQ_CFQQ_FLAG_idle_window,	/* slice idling enabled */
 	CFQ_CFQQ_FLAG_prio_changed,	/* task priority has changed */
-	CFQ_CFQQ_FLAG_slice_new,	/* no requests dispatched in slice */
-	CFQ_CFQQ_FLAG_sync,		/* synchronous queue */
 	CFQ_CFQQ_FLAG_coop,		/* has done a coop jump of the queue */
 };
 
@@ -210,16 +147,10 @@ static inline int cfq_cfqq_##name(const struct cfq_queue *cfqq)		\
 	return ((cfqq)->flags & (1 << CFQ_CFQQ_FLAG_##name)) != 0;	\
 }
 
-CFQ_CFQQ_FNS(on_rr);
-CFQ_CFQQ_FNS(wait_request);
-CFQ_CFQQ_FNS(must_dispatch);
 CFQ_CFQQ_FNS(must_alloc);
 CFQ_CFQQ_FNS(must_alloc_slice);
 CFQ_CFQQ_FNS(fifo_expire);
-CFQ_CFQQ_FNS(idle_window);
 CFQ_CFQQ_FNS(prio_changed);
-CFQ_CFQQ_FNS(slice_new);
-CFQ_CFQQ_FNS(sync);
 CFQ_CFQQ_FNS(coop);
 #undef CFQ_CFQQ_FNS
 
@@ -258,66 +189,27 @@ static inline int cfq_bio_sync(struct bio *bio)
 	return 0;
 }
 
-/*
- * scheduler run of queue, if there are requests pending and no one in the
- * driver that will restart queueing
- */
-static inline void cfq_schedule_dispatch(struct cfq_data *cfqd)
+static inline struct io_group *cfqq_to_io_group(struct cfq_queue *cfqq)
 {
-	if (cfqd->busy_queues) {
-		cfq_log(cfqd, "schedule dispatch");
-		kblockd_schedule_work(cfqd->queue, &cfqd->unplug_work);
-	}
+	return ioq_to_io_group(cfqq->ioq);
 }
 
-static int cfq_queue_empty(struct request_queue *q)
+static inline int cfq_class_idle(struct cfq_queue *cfqq)
 {
-	struct cfq_data *cfqd = q->elevator->elevator_data;
-
-	return !cfqd->busy_queues;
+	return elv_ioq_class_idle(cfqq->ioq);
 }
 
-/*
- * Scale schedule slice based on io priority. Use the sync time slice only
- * if a queue is marked sync and has sync io queued. A sync queue with async
- * io only, should not get full sync slice length.
- */
-static inline int cfq_prio_slice(struct cfq_data *cfqd, int sync,
-				 unsigned short prio)
-{
-	const int base_slice = cfqd->cfq_slice[sync];
-
-	WARN_ON(prio >= IOPRIO_BE_NR);
-
-	return base_slice + (base_slice/CFQ_SLICE_SCALE * (4 - prio));
-}
-
-static inline int
-cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
-}
-
-static inline void
-cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+static inline int cfq_cfqq_sync(struct cfq_queue *cfqq)
 {
-	cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
-	cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
+	return elv_ioq_sync(cfqq->ioq);
 }
 
-/*
- * We need to wrap this check in cfq_cfqq_slice_new(), since ->slice_end
- * isn't valid until the first request from the dispatch is activated
- * and the slice time set.
- */
-static inline int cfq_slice_used(struct cfq_queue *cfqq)
+static inline int cfqq_is_active_queue(struct cfq_queue *cfqq)
 {
-	if (cfq_cfqq_slice_new(cfqq))
-		return 0;
-	if (time_before(jiffies, cfqq->slice_end))
-		return 0;
+	struct cfq_data *cfqd = cfqq->cfqd;
+	struct elevator_queue *e = cfqd->queue->elevator;
 
-	return 1;
+	return (elv_active_sched_queue(e) == cfqq);
 }
 
 /*
@@ -416,33 +308,6 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2)
 }
 
 /*
- * The below is leftmost cache rbtree addon
- */
-static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
-{
-	if (!root->left)
-		root->left = rb_first(&root->rb);
-
-	if (root->left)
-		return rb_entry(root->left, struct cfq_queue, rb_node);
-
-	return NULL;
-}
-
-static void rb_erase_init(struct rb_node *n, struct rb_root *root)
-{
-	rb_erase(n, root);
-	RB_CLEAR_NODE(n);
-}
-
-static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root)
-{
-	if (root->left == n)
-		root->left = NULL;
-	rb_erase_init(n, &root->rb);
-}
-
-/*
  * would be nice to take fifo expire time into account as well
  */
 static struct request *
@@ -455,10 +320,10 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 
 	BUG_ON(RB_EMPTY_NODE(&last->rb_node));
 
-	if (rbprev)
+	if (rbprev != NULL)
 		prev = rb_entry_rq(rbprev);
 
-	if (rbnext)
+	if (rbnext != NULL)
 		next = rb_entry_rq(rbnext);
 	else {
 		rbnext = rb_first(&cfqq->sort_list);
@@ -469,95 +334,6 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	return cfq_choose_req(cfqd, next, prev);
 }
 
-static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
-				      struct cfq_queue *cfqq)
-{
-	/*
-	 * just an approximation, should be ok.
-	 */
-	return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) -
-		       cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
-}
-
-/*
- * The cfqd->service_tree holds all pending cfq_queue's that have
- * requests waiting to be processed. It is sorted in the order that
- * we will service the queues.
- */
-static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-				 int add_front)
-{
-	struct rb_node **p, *parent;
-	struct cfq_queue *__cfqq;
-	unsigned long rb_key;
-	int left;
-
-	if (cfq_class_idle(cfqq)) {
-		rb_key = CFQ_IDLE_DELAY;
-		parent = rb_last(&cfqd->service_tree.rb);
-		if (parent && parent != &cfqq->rb_node) {
-			__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
-			rb_key += __cfqq->rb_key;
-		} else
-			rb_key += jiffies;
-	} else if (!add_front) {
-		rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
-		rb_key += cfqq->slice_resid;
-		cfqq->slice_resid = 0;
-	} else
-		rb_key = 0;
-
-	if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
-		/*
-		 * same position, nothing more to do
-		 */
-		if (rb_key == cfqq->rb_key)
-			return;
-
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
-	}
-
-	left = 1;
-	parent = NULL;
-	p = &cfqd->service_tree.rb.rb_node;
-	while (*p) {
-		struct rb_node **n;
-
-		parent = *p;
-		__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
-
-		/*
-		 * sort RT queues first, we always want to give
-		 * preference to them. IDLE queues goes to the back.
-		 * after that, sort on the next service time.
-		 */
-		if (cfq_class_rt(cfqq) > cfq_class_rt(__cfqq))
-			n = &(*p)->rb_left;
-		else if (cfq_class_rt(cfqq) < cfq_class_rt(__cfqq))
-			n = &(*p)->rb_right;
-		else if (cfq_class_idle(cfqq) < cfq_class_idle(__cfqq))
-			n = &(*p)->rb_left;
-		else if (cfq_class_idle(cfqq) > cfq_class_idle(__cfqq))
-			n = &(*p)->rb_right;
-		else if (rb_key < __cfqq->rb_key)
-			n = &(*p)->rb_left;
-		else
-			n = &(*p)->rb_right;
-
-		if (n == &(*p)->rb_right)
-			left = 0;
-
-		p = n;
-	}
-
-	if (left)
-		cfqd->service_tree.left = &cfqq->rb_node;
-
-	cfqq->rb_key = rb_key;
-	rb_link_node(&cfqq->rb_node, parent, p);
-	rb_insert_color(&cfqq->rb_node, &cfqd->service_tree.rb);
-}
-
 static struct cfq_queue *
 cfq_prio_tree_lookup(struct cfq_data *cfqd, struct rb_root *root,
 		     sector_t sector, struct rb_node **ret_parent,
@@ -619,57 +395,34 @@ static void cfq_prio_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 		cfqq->p_root = NULL;
 }
 
-/*
- * Update cfqq's position in the service tree.
- */
-static void cfq_resort_rr_list(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+/* An active ioq is being reset. A chance to do cic related stuff. */
+static void cfq_active_ioq_reset(struct request_queue *q, void *sched_queue)
 {
-	/*
-	 * Resorting requires the cfqq to be on the RR list already.
-	 */
-	if (cfq_cfqq_on_rr(cfqq)) {
-		cfq_service_tree_add(cfqd, cfqq, 0);
-		cfq_prio_tree_add(cfqd, cfqq);
-	}
-}
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = sched_queue;
 
-/*
- * add to busy list of queues for service, trying to be fair in ordering
- * the pending list according to last request service
- */
-static void cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	cfq_log_cfqq(cfqd, cfqq, "add_to_rr");
-	BUG_ON(cfq_cfqq_on_rr(cfqq));
-	cfq_mark_cfqq_on_rr(cfqq);
-	cfqd->busy_queues++;
-	if (cfq_class_rt(cfqq))
-		cfqd->busy_rt_queues++;
+	if (cfqd->active_cic) {
+		put_io_context(cfqd->active_cic->ioc);
+		cfqd->active_cic = NULL;
+	}
 
-	cfq_resort_rr_list(cfqd, cfqq);
+	/* Resort the cfqq in prio tree */
+	if (cfqq)
+		cfq_prio_tree_add(cfqd, cfqq);
 }
 
-/*
- * Called when the cfqq no longer has requests pending, remove it from
- * the service tree.
- */
-static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+/* An ioq has been set as active one. */
+static void cfq_active_ioq_set(struct request_queue *q, void *sched_queue,
+				int coop)
 {
-	cfq_log_cfqq(cfqd, cfqq, "del_from_rr");
-	BUG_ON(!cfq_cfqq_on_rr(cfqq));
-	cfq_clear_cfqq_on_rr(cfqq);
+	struct cfq_queue *cfqq = sched_queue;
 
-	if (!RB_EMPTY_NODE(&cfqq->rb_node))
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
-	if (cfqq->p_root) {
-		rb_erase(&cfqq->p_node, cfqq->p_root);
-		cfqq->p_root = NULL;
-	}
+	cfqq->slice_dispatch = 0;
 
-	BUG_ON(!cfqd->busy_queues);
-	cfqd->busy_queues--;
-	if (cfq_class_rt(cfqq))
-		cfqd->busy_rt_queues--;
+	cfq_clear_cfqq_must_alloc_slice(cfqq);
+	cfq_clear_cfqq_fifo_expire(cfqq);
+	if (!coop)
+		cfq_clear_cfqq_coop(cfqq);
 }
 
 /*
@@ -678,7 +431,6 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 static void cfq_del_rq_rb(struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
-	struct cfq_data *cfqd = cfqq->cfqd;
 	const int sync = rq_is_sync(rq);
 
 	BUG_ON(!cfqq->queued[sync]);
@@ -686,8 +438,17 @@ static void cfq_del_rq_rb(struct request *rq)
 
 	elv_rb_del(&cfqq->sort_list, rq);
 
-	if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
-		cfq_del_cfqq_rr(cfqd, cfqq);
+	/*
+	 * If this was last request in the queue, remove this queue from
+	 * prio trees. For last request nr_queued count will still be 1 as
+	 * elevator fair queuing layer is yet to do the accounting.
+	 */
+	if (elv_ioq_nr_queued(cfqq->ioq) == 1) {
+		if (cfqq->p_root) {
+			rb_erase(&cfqq->p_node, cfqq->p_root);
+			cfqq->p_root = NULL;
+		}
+	}
 }
 
 static void cfq_add_rq_rb(struct request *rq)
@@ -705,9 +466,6 @@ static void cfq_add_rq_rb(struct request *rq)
 	while ((__alias = elv_rb_add(&cfqq->sort_list, rq)) != NULL)
 		cfq_dispatch_insert(cfqd->queue, __alias);
 
-	if (!cfq_cfqq_on_rr(cfqq))
-		cfq_add_cfqq_rr(cfqd, cfqq);
-
 	/*
 	 * check if this request is a better next-serve candidate
 	 */
@@ -755,23 +513,9 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 
-	cfqd->rq_in_driver++;
-	cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "activate rq, drv=%d",
-						cfqd->rq_in_driver);
-
 	cfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq);
 }
 
-static void cfq_deactivate_request(struct request_queue *q, struct request *rq)
-{
-	struct cfq_data *cfqd = q->elevator->elevator_data;
-
-	WARN_ON(!cfqd->rq_in_driver);
-	cfqd->rq_in_driver--;
-	cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "deactivate rq, drv=%d",
-						cfqd->rq_in_driver);
-}
-
 static void cfq_remove_request(struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
@@ -782,7 +526,6 @@ static void cfq_remove_request(struct request *rq)
 	list_del_init(&rq->queuelist);
 	cfq_del_rq_rb(rq);
 
-	cfqq->cfqd->rq_queued--;
 	if (rq_is_meta(rq)) {
 		WARN_ON(!cfqq->meta_pending);
 		cfqq->meta_pending--;
@@ -856,93 +599,21 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	return 0;
 }
 
-static void __cfq_set_active_queue(struct cfq_data *cfqd,
-				   struct cfq_queue *cfqq)
-{
-	if (cfqq) {
-		cfq_log_cfqq(cfqd, cfqq, "set_active");
-		cfqq->slice_end = 0;
-		cfqq->slice_dispatch = 0;
-
-		cfq_clear_cfqq_wait_request(cfqq);
-		cfq_clear_cfqq_must_dispatch(cfqq);
-		cfq_clear_cfqq_must_alloc_slice(cfqq);
-		cfq_clear_cfqq_fifo_expire(cfqq);
-		cfq_mark_cfqq_slice_new(cfqq);
-
-		del_timer(&cfqd->idle_slice_timer);
-	}
-
-	cfqd->active_queue = cfqq;
-}
-
 /*
  * current cfqq expired its slice (or was too idle), select new one
  */
 static void
-__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-		    int timed_out)
+__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	cfq_log_cfqq(cfqd, cfqq, "slice expired t=%d", timed_out);
-
-	if (cfq_cfqq_wait_request(cfqq))
-		del_timer(&cfqd->idle_slice_timer);
-
-	cfq_clear_cfqq_wait_request(cfqq);
-
-	/*
-	 * store what was left of this slice, if the queue idled/timed out
-	 */
-	if (timed_out && !cfq_cfqq_slice_new(cfqq)) {
-		cfqq->slice_resid = cfqq->slice_end - jiffies;
-		cfq_log_cfqq(cfqd, cfqq, "resid=%ld", cfqq->slice_resid);
-	}
-
-	cfq_resort_rr_list(cfqd, cfqq);
-
-	if (cfqq == cfqd->active_queue)
-		cfqd->active_queue = NULL;
-
-	if (cfqd->active_cic) {
-		put_io_context(cfqd->active_cic->ioc);
-		cfqd->active_cic = NULL;
-	}
+	__elv_ioq_slice_expired(cfqd->queue, cfqq->ioq);
 }
 
-static inline void cfq_slice_expired(struct cfq_data *cfqd, int timed_out)
+static inline void cfq_slice_expired(struct cfq_data *cfqd)
 {
-	struct cfq_queue *cfqq = cfqd->active_queue;
+	struct cfq_queue *cfqq = elv_active_sched_queue(cfqd->queue->elevator);
 
 	if (cfqq)
-		__cfq_slice_expired(cfqd, cfqq, timed_out);
-}
-
-/*
- * Get next queue for service. Unless we have a queue preemption,
- * we'll simply select the first cfqq in the service tree.
- */
-static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
-{
-	if (RB_EMPTY_ROOT(&cfqd->service_tree.rb))
-		return NULL;
-
-	return cfq_rb_first(&cfqd->service_tree);
-}
-
-/*
- * Get and set a new active queue for service.
- */
-static struct cfq_queue *cfq_set_active_queue(struct cfq_data *cfqd,
-					      struct cfq_queue *cfqq)
-{
-	if (!cfqq) {
-		cfqq = cfq_get_next_queue(cfqd);
-		if (cfqq)
-			cfq_clear_cfqq_coop(cfqq);
-	}
-
-	__cfq_set_active_queue(cfqd, cfqq);
-	return cfqq;
+		__cfq_slice_expired(cfqd, cfqq);
 }
 
 static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd,
@@ -1019,11 +690,12 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
  * associated with the I/O issued by cur_cfqq.  I'm not sure this is a valid
  * assumption.
  */
-static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
-					      struct cfq_queue *cur_cfqq,
+static struct io_queue *cfq_close_cooperator(struct request_queue *q,
+					      void *cur_sched_queue,
 					      int probe)
 {
-	struct cfq_queue *cfqq;
+	struct cfq_queue *cur_cfqq = cur_sched_queue, *cfqq;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
 
 	/*
 	 * A valid cfq_io_context is necessary to compare requests against
@@ -1046,38 +718,18 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
 
 	if (!probe)
 		cfq_mark_cfqq_coop(cfqq);
-	return cfqq;
+	return cfqq->ioq;
 }
 
-static void cfq_arm_slice_timer(struct cfq_data *cfqd)
+static void cfq_arm_slice_timer(struct request_queue *q, void *sched_queue)
 {
-	struct cfq_queue *cfqq = cfqd->active_queue;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = sched_queue;
 	struct cfq_io_context *cic;
 	unsigned long sl;
 
-	/*
-	 * SSD device without seek penalty, disable idling. But only do so
-	 * for devices that support queuing, otherwise we still have a problem
-	 * with sync vs async workloads.
-	 */
-	if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)
-		return;
-
 	WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list));
-	WARN_ON(cfq_cfqq_slice_new(cfqq));
-
-	/*
-	 * idle is disabled, either manually or by past process history
-	 */
-	if (!cfqd->cfq_slice_idle || !cfq_cfqq_idle_window(cfqq))
-		return;
-
-	/*
-	 * still requests with the driver, don't idle
-	 */
-	if (cfqd->rq_in_driver)
-		return;
-
+	WARN_ON(elv_ioq_slice_new(cfqq->ioq));
 	/*
 	 * task has exited, don't wait
 	 */
@@ -1085,18 +737,18 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 	if (!cic || !atomic_read(&cic->ioc->nr_tasks))
 		return;
 
-	cfq_mark_cfqq_wait_request(cfqq);
 
+	elv_mark_ioq_wait_request(cfqq->ioq);
 	/*
 	 * we don't want to idle for seeks, but we do want to allow
 	 * fair distribution of slice time for a process doing back-to-back
 	 * seeks. so allow a little bit of time for him to submit a new rq
 	 */
-	sl = cfqd->cfq_slice_idle;
+	sl = elv_get_slice_idle(q->elevator);
 	if (sample_valid(cic->seek_samples) && CIC_SEEKY(cic))
 		sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));
 
-	mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
+	elv_mod_idle_slice_timer(q->elevator, jiffies + sl);
 	cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
 }
 
@@ -1105,13 +757,12 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
  */
 static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
 {
-	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
+	struct cfq_data *cfqd = q->elevator->elevator_data;
 
-	cfq_log_cfqq(cfqd, cfqq, "dispatch_insert");
+	cfq_log_cfqq(cfqd, cfqq, "dispatch_insert sect=%d", blk_rq_sectors(rq));
 
 	cfq_remove_request(rq);
-	cfqq->dispatched++;
 	elv_dispatch_sort(q, rq);
 
 	if (cfq_cfqq_sync(cfqq))
@@ -1149,78 +800,11 @@ static inline int
 cfq_prio_to_maxrq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
 	const int base_rq = cfqd->cfq_slice_async_rq;
+	unsigned short ioprio = elv_ioq_ioprio(cfqq->ioq);
 
-	WARN_ON(cfqq->ioprio >= IOPRIO_BE_NR);
+	WARN_ON(ioprio >= IOPRIO_BE_NR);
 
-	return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - cfqq->ioprio));
-}
-
-/*
- * Select a queue for service. If we have a current active queue,
- * check whether to continue servicing it, or retrieve and set a new one.
- */
-static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
-{
-	struct cfq_queue *cfqq, *new_cfqq = NULL;
-
-	cfqq = cfqd->active_queue;
-	if (!cfqq)
-		goto new_queue;
-
-	/*
-	 * The active queue has run out of time, expire it and select new.
-	 */
-	if (cfq_slice_used(cfqq) && !cfq_cfqq_must_dispatch(cfqq))
-		goto expire;
-
-	/*
-	 * If we have a RT cfqq waiting, then we pre-empt the current non-rt
-	 * cfqq.
-	 */
-	if (!cfq_class_rt(cfqq) && cfqd->busy_rt_queues) {
-		/*
-		 * We simulate this as cfqq timed out so that it gets to bank
-		 * the remaining of its time slice.
-		 */
-		cfq_log_cfqq(cfqd, cfqq, "preempt");
-		cfq_slice_expired(cfqd, 1);
-		goto new_queue;
-	}
-
-	/*
-	 * The active queue has requests and isn't expired, allow it to
-	 * dispatch.
-	 */
-	if (!RB_EMPTY_ROOT(&cfqq->sort_list))
-		goto keep_queue;
-
-	/*
-	 * If another queue has a request waiting within our mean seek
-	 * distance, let it run.  The expire code will check for close
-	 * cooperators and put the close queue at the front of the service
-	 * tree.
-	 */
-	new_cfqq = cfq_close_cooperator(cfqd, cfqq, 0);
-	if (new_cfqq)
-		goto expire;
-
-	/*
-	 * No requests pending. If the active queue still has requests in
-	 * flight or is idling for a new request, allow either of these
-	 * conditions to happen (or time out) before selecting a new queue.
-	 */
-	if (timer_pending(&cfqd->idle_slice_timer) ||
-	    (cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) {
-		cfqq = NULL;
-		goto keep_queue;
-	}
-
-expire:
-	cfq_slice_expired(cfqd, 0);
-new_queue:
-	cfqq = cfq_set_active_queue(cfqd, new_cfqq);
-keep_queue:
-	return cfqq;
+	return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - ioprio));
 }
 
 static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
@@ -1245,12 +829,14 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
 	struct cfq_queue *cfqq;
 	int dispatched = 0;
 
-	while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL)
+	while ((cfqq = elv_select_sched_queue(cfqd->queue, 1)) != NULL)
 		dispatched += __cfq_forced_dispatch_cfqq(cfqq);
 
-	cfq_slice_expired(cfqd, 0);
+	/* This probably is redundant now. above loop will should make sure
+	 * that all the busy queues have expired */
+	cfq_slice_expired(cfqd);
 
-	BUG_ON(cfqd->busy_queues);
+	BUG_ON(elv_nr_busy_ioq(cfqd->queue->elevator));
 
 	cfq_log(cfqd, "forced_dispatch=%d", dispatched);
 	return dispatched;
@@ -1296,13 +882,10 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	struct cfq_queue *cfqq;
 	unsigned int max_dispatch;
 
-	if (!cfqd->busy_queues)
-		return 0;
-
 	if (unlikely(force))
 		return cfq_forced_dispatch(cfqd);
 
-	cfqq = cfq_select_queue(cfqd);
+	cfqq = elv_select_sched_queue(q, 0);
 	if (!cfqq)
 		return 0;
 
@@ -1319,7 +902,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	/*
 	 * Does this cfqq already have too much IO in flight?
 	 */
-	if (cfqq->dispatched >= max_dispatch) {
+	if (elv_ioq_nr_dispatched(cfqq->ioq) >= max_dispatch) {
 		/*
 		 * idle queue must always only have a single IO in flight
 		 */
@@ -1329,13 +912,13 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 		/*
 		 * We have other queues, don't allow more IO from this one
 		 */
-		if (cfqd->busy_queues > 1)
+		if (elv_nr_busy_ioq(q->elevator) > 1)
 			return 0;
 
 		/*
 		 * we are the only queue, allow up to 4 times of 'quantum'
 		 */
-		if (cfqq->dispatched >= 4 * max_dispatch)
+		if (elv_ioq_nr_dispatched(cfqq->ioq) >= 4 * max_dispatch)
 			return 0;
 	}
 
@@ -1344,51 +927,45 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	 */
 	cfq_dispatch_request(cfqd, cfqq);
 	cfqq->slice_dispatch++;
-	cfq_clear_cfqq_must_dispatch(cfqq);
 
 	/*
 	 * expire an async queue immediately if it has used up its slice. idle
 	 * queue always expire after 1 dispatch round.
 	 */
-	if (cfqd->busy_queues > 1 && ((!cfq_cfqq_sync(cfqq) &&
+	if (elv_nr_busy_ioq(q->elevator) > 1 && ((!cfq_cfqq_sync(cfqq) &&
 	    cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
 	    cfq_class_idle(cfqq))) {
-		cfqq->slice_end = jiffies + 1;
-		cfq_slice_expired(cfqd, 0);
+		cfq_slice_expired(cfqd);
 	}
 
 	cfq_log(cfqd, "dispatched a request");
 	return 1;
 }
 
-/*
- * task holds one reference to the queue, dropped when task exits. each rq
- * in-flight on this queue also holds a reference, dropped when rq is freed.
- *
- * queue lock must be held here.
- */
-static void cfq_put_queue(struct cfq_queue *cfqq)
+static void cfq_free_cfq_queue(struct elevator_queue *e, void *sched_queue)
 {
+	struct cfq_queue *cfqq = sched_queue;
 	struct cfq_data *cfqd = cfqq->cfqd;
 
-	BUG_ON(atomic_read(&cfqq->ref) <= 0);
-
-	if (!atomic_dec_and_test(&cfqq->ref))
-		return;
+	BUG_ON(!cfqq);
 
-	cfq_log_cfqq(cfqd, cfqq, "put_queue");
+	cfq_log_cfqq(cfqd, cfqq, "free_queue");
 	BUG_ON(rb_first(&cfqq->sort_list));
 	BUG_ON(cfqq->allocated[READ] + cfqq->allocated[WRITE]);
-	BUG_ON(cfq_cfqq_on_rr(cfqq));
 
-	if (unlikely(cfqd->active_queue == cfqq)) {
-		__cfq_slice_expired(cfqd, cfqq, 0);
-		cfq_schedule_dispatch(cfqd);
+	if (unlikely(cfqq_is_active_queue(cfqq))) {
+		__cfq_slice_expired(cfqd, cfqq);
+		elv_schedule_dispatch(cfqd->queue);
 	}
 
 	kmem_cache_free(cfq_pool, cfqq);
 }
 
+static inline void cfq_put_queue(struct cfq_queue *cfqq)
+{
+	elv_put_ioq(cfqq->ioq);
+}
+
 /*
  * Must always be called with the rcu_read_lock() held
  */
@@ -1476,9 +1053,9 @@ static void cfq_free_io_context(struct io_context *ioc)
 
 static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	if (unlikely(cfqq == cfqd->active_queue)) {
-		__cfq_slice_expired(cfqd, cfqq, 0);
-		cfq_schedule_dispatch(cfqd);
+	if (unlikely(cfqq == elv_active_sched_queue(cfqd->queue->elevator))) {
+		__cfq_slice_expired(cfqd, cfqq);
+		elv_schedule_dispatch(cfqd->queue);
 	}
 
 	cfq_put_queue(cfqq);
@@ -1548,11 +1125,11 @@ static struct cfq_io_context *
 cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 {
 	struct cfq_io_context *cic;
+	struct request_queue *q = cfqd->queue;
 
 	cic = kmem_cache_alloc_node(cfq_ioc_pool, gfp_mask | __GFP_ZERO,
-							cfqd->queue->node);
+							q->node);
 	if (cic) {
-		cic->last_end_request = jiffies;
 		INIT_LIST_HEAD(&cic->queue_list);
 		INIT_HLIST_NODE(&cic->cic_list);
 		cic->dtor = cfq_free_io_context;
@@ -1566,7 +1143,7 @@ cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
 {
 	struct task_struct *tsk = current;
-	int ioprio_class;
+	int ioprio_class, ioprio;
 
 	if (!cfq_cfqq_prio_changed(cfqq))
 		return;
@@ -1579,30 +1156,33 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
 		/*
 		 * no prio set, inherit CPU scheduling settings
 		 */
-		cfqq->ioprio = task_nice_ioprio(tsk);
-		cfqq->ioprio_class = task_nice_ioclass(tsk);
+		ioprio = task_nice_ioprio(tsk);
+		ioprio_class = task_nice_ioclass(tsk);
 		break;
 	case IOPRIO_CLASS_RT:
-		cfqq->ioprio = task_ioprio(ioc);
-		cfqq->ioprio_class = IOPRIO_CLASS_RT;
+		ioprio = task_ioprio(ioc);
+		ioprio_class = IOPRIO_CLASS_RT;
 		break;
 	case IOPRIO_CLASS_BE:
-		cfqq->ioprio = task_ioprio(ioc);
-		cfqq->ioprio_class = IOPRIO_CLASS_BE;
+		ioprio = task_ioprio(ioc);
+		ioprio_class = IOPRIO_CLASS_BE;
 		break;
 	case IOPRIO_CLASS_IDLE:
-		cfqq->ioprio_class = IOPRIO_CLASS_IDLE;
-		cfqq->ioprio = 7;
-		cfq_clear_cfqq_idle_window(cfqq);
+		ioprio_class = IOPRIO_CLASS_IDLE;
+		ioprio = 7;
+		elv_clear_ioq_idle_window(cfqq->ioq);
 		break;
 	}
 
+	elv_ioq_set_ioprio_class(cfqq->ioq, ioprio_class);
+	elv_ioq_set_ioprio(cfqq->ioq, ioprio);
+
 	/*
 	 * keep track of original prio settings in case we have to temporarily
 	 * elevate the priority of this queue
 	 */
-	cfqq->org_ioprio = cfqq->ioprio;
-	cfqq->org_ioprio_class = cfqq->ioprio_class;
+	cfqq->org_ioprio = ioprio;
+	cfqq->org_ioprio_class = ioprio_class;
 	cfq_clear_cfqq_prio_changed(cfqq);
 }
 
@@ -1611,11 +1191,12 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
 	struct cfq_data *cfqd = cic->key;
 	struct cfq_queue *cfqq;
 	unsigned long flags;
+	struct request_queue *q = cfqd->queue;
 
 	if (unlikely(!cfqd))
 		return;
 
-	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+	spin_lock_irqsave(q->queue_lock, flags);
 
 	cfqq = cic->cfqq[BLK_RW_ASYNC];
 	if (cfqq) {
@@ -1632,7 +1213,7 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
 	if (cfqq)
 		cfq_mark_cfqq_prio_changed(cfqq);
 
-	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
 static void cfq_ioc_set_ioprio(struct io_context *ioc)
@@ -1643,20 +1224,23 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
 
 static struct cfq_queue *
 cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
-		     struct io_context *ioc, gfp_t gfp_mask)
+				struct io_context *ioc, gfp_t gfp_mask)
 {
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
 	struct cfq_io_context *cic;
-
+	struct request_queue *q = cfqd->queue;
+	struct io_queue *ioq = NULL, *new_ioq = NULL;
+	struct io_group *iog = NULL;
 retry:
+	iog = io_get_io_group(q);
+
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
 	cfqq = cic_to_cfqq(cic, is_sync);
 
 	if (!cfqq) {
 		if (new_cfqq) {
-			cfqq = new_cfqq;
-			new_cfqq = NULL;
+			goto alloc_ioq;
 		} else if (gfp_mask & __GFP_WAIT) {
 			/*
 			 * Inform the allocator of the fact that we will
@@ -1677,22 +1261,53 @@ retry:
 			if (!cfqq)
 				goto out;
 		}
+alloc_ioq:
+		if (new_ioq) {
+			ioq = new_ioq;
+			new_ioq = NULL;
+			cfqq = new_cfqq;
+			new_cfqq = NULL;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			new_ioq = elv_alloc_ioq(q,
+					gfp_mask | __GFP_NOFAIL | __GFP_ZERO);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+			if (!ioq) {
+				kmem_cache_free(cfq_pool, cfqq);
+				cfqq = NULL;
+				goto out;
+			}
+		}
 
-		RB_CLEAR_NODE(&cfqq->rb_node);
+		/*
+		 * Both cfqq and ioq objects allocated. Do the initializations
+		 * now.
+		 */
 		RB_CLEAR_NODE(&cfqq->p_node);
 		INIT_LIST_HEAD(&cfqq->fifo);
-
-		atomic_set(&cfqq->ref, 0);
 		cfqq->cfqd = cfqd;
 
 		cfq_mark_cfqq_prio_changed(cfqq);
 
+		cfqq->ioq = ioq;
 		cfq_init_prio_data(cfqq, ioc);
+		elv_init_ioq(q->elevator, ioq, iog, cfqq,
+				cfqq->org_ioprio_class, cfqq->org_ioprio,
+				is_sync);
 
 		if (is_sync) {
 			if (!cfq_class_idle(cfqq))
-				cfq_mark_cfqq_idle_window(cfqq);
-			cfq_mark_cfqq_sync(cfqq);
+				elv_mark_ioq_idle_window(cfqq->ioq);
+			elv_mark_ioq_sync(cfqq->ioq);
 		}
 		cfqq->pid = current->pid;
 		cfq_log_cfqq(cfqd, cfqq, "alloced");
@@ -1701,38 +1316,28 @@ retry:
 	if (new_cfqq)
 		kmem_cache_free(cfq_pool, new_cfqq);
 
+	if (new_ioq)
+		elv_free_ioq(new_ioq);
+
 out:
 	WARN_ON((gfp_mask & __GFP_WAIT) && !cfqq);
 	return cfqq;
 }
 
-static struct cfq_queue **
-cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
-{
-	switch (ioprio_class) {
-	case IOPRIO_CLASS_RT:
-		return &cfqd->async_cfqq[0][ioprio];
-	case IOPRIO_CLASS_BE:
-		return &cfqd->async_cfqq[1][ioprio];
-	case IOPRIO_CLASS_IDLE:
-		return &cfqd->async_idle_cfqq;
-	default:
-		BUG();
-	}
-}
-
 static struct cfq_queue *
 cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
-	      gfp_t gfp_mask)
+					gfp_t gfp_mask)
 {
 	const int ioprio = task_ioprio(ioc);
 	const int ioprio_class = task_ioprio_class(ioc);
-	struct cfq_queue **async_cfqq = NULL;
+	struct cfq_queue *async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
+	struct io_group *iog = io_get_io_group(cfqd->queue);
 
 	if (!is_sync) {
-		async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);
-		cfqq = *async_cfqq;
+		async_cfqq = io_group_async_queue_prio(iog, ioprio_class,
+								ioprio);
+		cfqq = async_cfqq;
 	}
 
 	if (!cfqq) {
@@ -1741,15 +1346,11 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 			return NULL;
 	}
 
-	/*
-	 * pin the queue now that it's allocated, scheduler exit will prune it
-	 */
-	if (!is_sync && !(*async_cfqq)) {
-		atomic_inc(&cfqq->ref);
-		*async_cfqq = cfqq;
-	}
+	if (!is_sync && !async_cfqq)
+		io_group_set_async_queue(iog, ioprio_class, ioprio, cfqq->ioq);
 
-	atomic_inc(&cfqq->ref);
+	/* ioc reference */
+	elv_get_ioq(cfqq->ioq);
 	return cfqq;
 }
 
@@ -1828,6 +1429,7 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
 {
 	unsigned long flags;
 	int ret;
+	struct request_queue *q = cfqd->queue;
 
 	ret = radix_tree_preload(gfp_mask);
 	if (!ret) {
@@ -1844,9 +1446,9 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
 		radix_tree_preload_end();
 
 		if (!ret) {
-			spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+			spin_lock_irqsave(q->queue_lock, flags);
 			list_add(&cic->queue_list, &cfqd->cic_list);
-			spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+			spin_unlock_irqrestore(q->queue_lock, flags);
 		}
 	}
 
@@ -1866,10 +1468,11 @@ cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 {
 	struct io_context *ioc = NULL;
 	struct cfq_io_context *cic;
+	struct request_queue *q = cfqd->queue;
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
-	ioc = get_io_context(gfp_mask, cfqd->queue->node);
+	ioc = get_io_context(gfp_mask, q->node);
 	if (!ioc)
 		return NULL;
 
@@ -1888,7 +1491,6 @@ out:
 	smp_read_barrier_depends();
 	if (unlikely(ioc->ioprio_changed))
 		cfq_ioc_set_ioprio(ioc);
-
 	return cic;
 err_free:
 	cfq_cic_free(cic);
@@ -1898,17 +1500,6 @@ err:
 }
 
 static void
-cfq_update_io_thinktime(struct cfq_data *cfqd, struct cfq_io_context *cic)
-{
-	unsigned long elapsed = jiffies - cic->last_end_request;
-	unsigned long ttime = min(elapsed, 2UL * cfqd->cfq_slice_idle);
-
-	cic->ttime_samples = (7*cic->ttime_samples + 256) / 8;
-	cic->ttime_total = (7*cic->ttime_total + 256*ttime) / 8;
-	cic->ttime_mean = (cic->ttime_total + 128) / cic->ttime_samples;
-}
-
-static void
 cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_io_context *cic,
 		       struct request *rq)
 {
@@ -1939,57 +1530,41 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_io_context *cic,
 }
 
 /*
- * Disable idle window if the process thinks too long or seeks so much that
- * it doesn't matter
+ * Disable idle window if the process seeks so much that it doesn't matter
  */
-static void
-cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-		       struct cfq_io_context *cic)
+static int
+cfq_update_idle_window(struct elevator_queue *eq, void *cfqq,
+					struct request *rq)
 {
-	int old_idle, enable_idle;
+	struct cfq_io_context *cic = RQ_CIC(rq);
 
 	/*
-	 * Don't idle for async or idle io prio class
+	 * Enabling/Disabling idling based on thinktime has been moved
+	 * in common layer.
 	 */
-	if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
-		return;
-
-	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
-
-	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
-	    (cfqd->hw_tag && CIC_SEEKY(cic)))
-		enable_idle = 0;
-	else if (sample_valid(cic->ttime_samples)) {
-		if (cic->ttime_mean > cfqd->cfq_slice_idle)
-			enable_idle = 0;
-		else
-			enable_idle = 1;
-	}
+	if (!atomic_read(&cic->ioc->nr_tasks) ||
+	    (elv_hw_tag(eq) && CIC_SEEKY(cic)))
+		return 0;
 
-	if (old_idle != enable_idle) {
-		cfq_log_cfqq(cfqd, cfqq, "idle=%d", enable_idle);
-		if (enable_idle)
-			cfq_mark_cfqq_idle_window(cfqq);
-		else
-			cfq_clear_cfqq_idle_window(cfqq);
-	}
+	return 1;
 }
 
 /*
  * Check if new_cfqq should preempt the currently active queue. Return 0 for
- * no or if we aren't sure, a 1 will cause a preempt.
+ * no or if we aren't sure, a 1 will cause a preemption attempt.
+ * Some of the preemption logic has been moved to common layer. Only cfq
+ * specific parts are left here.
  */
 static int
-cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
-		   struct request *rq)
+cfq_should_preempt(struct request_queue *q, void *new_cfqq, struct request *rq)
 {
-	struct cfq_queue *cfqq;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = elv_active_sched_queue(q->elevator);
 
-	cfqq = cfqd->active_queue;
 	if (!cfqq)
 		return 0;
 
-	if (cfq_slice_used(cfqq))
+	if (elv_ioq_slice_used(cfqq->ioq))
 		return 1;
 
 	if (cfq_class_idle(new_cfqq))
@@ -2012,13 +1587,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 	if (rq_is_meta(rq) && !cfqq->meta_pending)
 		return 1;
 
-	/*
-	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
-	 */
-	if (cfq_class_rt(new_cfqq) && !cfq_class_rt(cfqq))
-		return 1;
-
-	if (!cfqd->active_cic || !cfq_cfqq_wait_request(cfqq))
+	if (!cfqd->active_cic || !elv_ioq_wait_request(cfqq->ioq))
 		return 0;
 
 	/*
@@ -2032,29 +1601,10 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 }
 
 /*
- * cfqq preempts the active queue. if we allowed preempt with no slice left,
- * let it have half of its nominal slice.
- */
-static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	cfq_log_cfqq(cfqd, cfqq, "preempt");
-	cfq_slice_expired(cfqd, 1);
-
-	/*
-	 * Put the new queue at the front of the of the current list,
-	 * so we know that it will be selected next.
-	 */
-	BUG_ON(!cfq_cfqq_on_rr(cfqq));
-
-	cfq_service_tree_add(cfqd, cfqq, 1);
-
-	cfqq->slice_end = 0;
-	cfq_mark_cfqq_slice_new(cfqq);
-}
-
-/*
  * Called when a new fs request (rq) is added (to cfqq). Check if there's
  * something we should do about it
+ * After enqueuing the request whether queue should be preempted or kicked
+ * decision is taken by common layer.
  */
 static void
 cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
@@ -2062,45 +1612,12 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 {
 	struct cfq_io_context *cic = RQ_CIC(rq);
 
-	cfqd->rq_queued++;
 	if (rq_is_meta(rq))
 		cfqq->meta_pending++;
 
-	cfq_update_io_thinktime(cfqd, cic);
 	cfq_update_io_seektime(cfqd, cic, rq);
-	cfq_update_idle_window(cfqd, cfqq, cic);
 
 	cic->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq);
-
-	if (cfqq == cfqd->active_queue) {
-		/*
-		 * Remember that we saw a request from this process, but
-		 * don't start queuing just yet. Otherwise we risk seeing lots
-		 * of tiny requests, because we disrupt the normal plugging
-		 * and merging. If the request is already larger than a single
-		 * page, let it rip immediately. For that case we assume that
-		 * merging is already done. Ditto for a busy system that
-		 * has other work pending, don't risk delaying until the
-		 * idle timer unplug to continue working.
-		 */
-		if (cfq_cfqq_wait_request(cfqq)) {
-			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
-			    cfqd->busy_queues > 1) {
-				del_timer(&cfqd->idle_slice_timer);
-			__blk_run_queue(cfqd->queue);
-			}
-			cfq_mark_cfqq_must_dispatch(cfqq);
-		}
-	} else if (cfq_should_preempt(cfqd, cfqq, rq)) {
-		/*
-		 * not the active queue - expire current slice if it is
-		 * idle and has expired it's mean thinktime or this new queue
-		 * has some old slice time left and is of higher priority or
-		 * this new queue is RT and the current one is BE
-		 */
-		cfq_preempt_queue(cfqd, cfqq);
-		__blk_run_queue(cfqd->queue);
-	}
 }
 
 static void cfq_insert_request(struct request_queue *q, struct request *rq)
@@ -2118,81 +1635,17 @@ static void cfq_insert_request(struct request_queue *q, struct request *rq)
 	cfq_rq_enqueued(cfqd, cfqq, rq);
 }
 
-/*
- * Update hw_tag based on peak queue depth over 50 samples under
- * sufficient load.
- */
-static void cfq_update_hw_tag(struct cfq_data *cfqd)
-{
-	if (cfqd->rq_in_driver > cfqd->rq_in_driver_peak)
-		cfqd->rq_in_driver_peak = cfqd->rq_in_driver;
-
-	if (cfqd->rq_queued <= CFQ_HW_QUEUE_MIN &&
-	    cfqd->rq_in_driver <= CFQ_HW_QUEUE_MIN)
-		return;
-
-	if (cfqd->hw_tag_samples++ < 50)
-		return;
-
-	if (cfqd->rq_in_driver_peak >= CFQ_HW_QUEUE_MIN)
-		cfqd->hw_tag = 1;
-	else
-		cfqd->hw_tag = 0;
-
-	cfqd->hw_tag_samples = 0;
-	cfqd->rq_in_driver_peak = 0;
-}
-
 static void cfq_completed_request(struct request_queue *q, struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
 	struct cfq_data *cfqd = cfqq->cfqd;
-	const int sync = rq_is_sync(rq);
 	unsigned long now;
 
 	now = jiffies;
 	cfq_log_cfqq(cfqd, cfqq, "complete");
 
-	cfq_update_hw_tag(cfqd);
-
-	WARN_ON(!cfqd->rq_in_driver);
-	WARN_ON(!cfqq->dispatched);
-	cfqd->rq_in_driver--;
-	cfqq->dispatched--;
-
 	if (cfq_cfqq_sync(cfqq))
 		cfqd->sync_flight--;
-
-	if (sync)
-		RQ_CIC(rq)->last_end_request = now;
-
-	/*
-	 * If this is the active queue, check if it needs to be expired,
-	 * or if we want to idle in case it has no pending requests.
-	 */
-	if (cfqd->active_queue == cfqq) {
-		const bool cfqq_empty = RB_EMPTY_ROOT(&cfqq->sort_list);
-
-		if (cfq_cfqq_slice_new(cfqq)) {
-			cfq_set_prio_slice(cfqd, cfqq);
-			cfq_clear_cfqq_slice_new(cfqq);
-		}
-		/*
-		 * If there are no requests waiting in this queue, and
-		 * there are other queues ready to issue requests, AND
-		 * those other queues are issuing requests within our
-		 * mean seek distance, give them a chance to run instead
-		 * of idling.
-		 */
-		if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
-			cfq_slice_expired(cfqd, 1);
-		else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) &&
-			 sync && !rq_noidle(rq))
-			cfq_arm_slice_timer(cfqd);
-	}
-
-	if (!cfqd->rq_in_driver)
-		cfq_schedule_dispatch(cfqd);
 }
 
 /*
@@ -2201,30 +1654,33 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
  */
 static void cfq_prio_boost(struct cfq_queue *cfqq)
 {
+	struct io_queue *ioq = cfqq->ioq;
+
 	if (has_fs_excl()) {
 		/*
 		 * boost idle prio on transactions that would lock out other
 		 * users of the filesystem
 		 */
 		if (cfq_class_idle(cfqq))
-			cfqq->ioprio_class = IOPRIO_CLASS_BE;
-		if (cfqq->ioprio > IOPRIO_NORM)
-			cfqq->ioprio = IOPRIO_NORM;
+			elv_ioq_set_ioprio_class(ioq, IOPRIO_CLASS_BE);
+		if (elv_ioq_ioprio(ioq) > IOPRIO_NORM)
+			elv_ioq_set_ioprio(ioq, IOPRIO_NORM);
+
 	} else {
 		/*
 		 * check if we need to unboost the queue
 		 */
-		if (cfqq->ioprio_class != cfqq->org_ioprio_class)
-			cfqq->ioprio_class = cfqq->org_ioprio_class;
-		if (cfqq->ioprio != cfqq->org_ioprio)
-			cfqq->ioprio = cfqq->org_ioprio;
+		if (elv_ioq_ioprio_class(ioq) != cfqq->org_ioprio_class)
+			elv_ioq_set_ioprio_class(ioq, cfqq->org_ioprio_class);
+		if (elv_ioq_ioprio(ioq) != cfqq->org_ioprio)
+			elv_ioq_set_ioprio(ioq, cfqq->org_ioprio);
 	}
 }
 
 static inline int __cfq_may_queue(struct cfq_queue *cfqq)
 {
-	if ((cfq_cfqq_wait_request(cfqq) || cfq_cfqq_must_alloc(cfqq)) &&
-	    !cfq_cfqq_must_alloc_slice(cfqq)) {
+	if ((elv_ioq_wait_request(cfqq->ioq) ||
+	   cfq_cfqq_must_alloc(cfqq)) && !cfq_cfqq_must_alloc_slice(cfqq)) {
 		cfq_mark_cfqq_must_alloc_slice(cfqq);
 		return ELV_MQUEUE_MUST;
 	}
@@ -2276,7 +1732,7 @@ static void cfq_put_request(struct request *rq)
 		put_io_context(RQ_CIC(rq)->ioc);
 
 		rq->elevator_private = NULL;
-		rq->elevator_private2 = NULL;
+		rq->ioq = NULL;
 
 		cfq_put_queue(cfqq);
 	}
@@ -2316,119 +1772,31 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 
 	cfqq->allocated[rw]++;
 	cfq_clear_cfqq_must_alloc(cfqq);
-	atomic_inc(&cfqq->ref);
+	elv_get_ioq(cfqq->ioq);
 
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
 	rq->elevator_private = cic;
-	rq->elevator_private2 = cfqq;
+	rq->ioq = cfqq->ioq;
 	return 0;
 
 queue_fail:
 	if (cic)
 		put_io_context(cic->ioc);
 
-	cfq_schedule_dispatch(cfqd);
+	elv_schedule_dispatch(cfqd->queue);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 	cfq_log(cfqd, "set_request fail");
 	return 1;
 }
 
-static void cfq_kick_queue(struct work_struct *work)
-{
-	struct cfq_data *cfqd =
-		container_of(work, struct cfq_data, unplug_work);
-	struct request_queue *q = cfqd->queue;
-
-	spin_lock_irq(q->queue_lock);
-	__blk_run_queue(cfqd->queue);
-	spin_unlock_irq(q->queue_lock);
-}
-
-/*
- * Timer running if the active_queue is currently idling inside its time slice
- */
-static void cfq_idle_slice_timer(unsigned long data)
-{
-	struct cfq_data *cfqd = (struct cfq_data *) data;
-	struct cfq_queue *cfqq;
-	unsigned long flags;
-	int timed_out = 1;
-
-	cfq_log(cfqd, "idle timer fired");
-
-	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
-
-	cfqq = cfqd->active_queue;
-	if (cfqq) {
-		timed_out = 0;
-
-		/*
-		 * We saw a request before the queue expired, let it through
-		 */
-		if (cfq_cfqq_must_dispatch(cfqq))
-			goto out_kick;
-
-		/*
-		 * expired
-		 */
-		if (cfq_slice_used(cfqq))
-			goto expire;
-
-		/*
-		 * only expire and reinvoke request handler, if there are
-		 * other queues with pending requests
-		 */
-		if (!cfqd->busy_queues)
-			goto out_cont;
-
-		/*
-		 * not expired and it has a request pending, let it dispatch
-		 */
-		if (!RB_EMPTY_ROOT(&cfqq->sort_list))
-			goto out_kick;
-	}
-expire:
-	cfq_slice_expired(cfqd, timed_out);
-out_kick:
-	cfq_schedule_dispatch(cfqd);
-out_cont:
-	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
-}
-
-static void cfq_shutdown_timer_wq(struct cfq_data *cfqd)
-{
-	del_timer_sync(&cfqd->idle_slice_timer);
-	cancel_work_sync(&cfqd->unplug_work);
-}
-
-static void cfq_put_async_queues(struct cfq_data *cfqd)
-{
-	int i;
-
-	for (i = 0; i < IOPRIO_BE_NR; i++) {
-		if (cfqd->async_cfqq[0][i])
-			cfq_put_queue(cfqd->async_cfqq[0][i]);
-		if (cfqd->async_cfqq[1][i])
-			cfq_put_queue(cfqd->async_cfqq[1][i]);
-	}
-
-	if (cfqd->async_idle_cfqq)
-		cfq_put_queue(cfqd->async_idle_cfqq);
-}
-
 static void cfq_exit_queue(struct elevator_queue *e)
 {
 	struct cfq_data *cfqd = e->elevator_data;
 	struct request_queue *q = cfqd->queue;
 
-	cfq_shutdown_timer_wq(cfqd);
-
 	spin_lock_irq(q->queue_lock);
 
-	if (cfqd->active_queue)
-		__cfq_slice_expired(cfqd, cfqd->active_queue, 0);
-
 	while (!list_empty(&cfqd->cic_list)) {
 		struct cfq_io_context *cic = list_entry(cfqd->cic_list.next,
 							struct cfq_io_context,
@@ -2437,12 +1805,7 @@ static void cfq_exit_queue(struct elevator_queue *e)
 		__cfq_exit_single_io_context(cfqd, cic);
 	}
 
-	cfq_put_async_queues(cfqd);
-
 	spin_unlock_irq(q->queue_lock);
-
-	cfq_shutdown_timer_wq(cfqd);
-
 	kfree(cfqd);
 }
 
@@ -2455,8 +1818,6 @@ static void *cfq_init_queue(struct request_queue *q)
 	if (!cfqd)
 		return NULL;
 
-	cfqd->service_tree = CFQ_RB_ROOT;
-
 	/*
 	 * Not strictly needed (since RB_ROOT just clears the node and we
 	 * zeroed cfqd on alloc), but better be safe in case someone decides
@@ -2469,22 +1830,12 @@ static void *cfq_init_queue(struct request_queue *q)
 
 	cfqd->queue = q;
 
-	init_timer(&cfqd->idle_slice_timer);
-	cfqd->idle_slice_timer.function = cfq_idle_slice_timer;
-	cfqd->idle_slice_timer.data = (unsigned long) cfqd;
-
-	INIT_WORK(&cfqd->unplug_work, cfq_kick_queue);
-
 	cfqd->cfq_quantum = cfq_quantum;
 	cfqd->cfq_fifo_expire[0] = cfq_fifo_expire[0];
 	cfqd->cfq_fifo_expire[1] = cfq_fifo_expire[1];
 	cfqd->cfq_back_max = cfq_back_max;
 	cfqd->cfq_back_penalty = cfq_back_penalty;
-	cfqd->cfq_slice[0] = cfq_slice_async;
-	cfqd->cfq_slice[1] = cfq_slice_sync;
 	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
-	cfqd->cfq_slice_idle = cfq_slice_idle;
-	cfqd->hw_tag = 1;
 
 	return cfqd;
 }
@@ -2549,9 +1900,6 @@ SHOW_FUNCTION(cfq_fifo_expire_sync_show, cfqd->cfq_fifo_expire[1], 1);
 SHOW_FUNCTION(cfq_fifo_expire_async_show, cfqd->cfq_fifo_expire[0], 1);
 SHOW_FUNCTION(cfq_back_seek_max_show, cfqd->cfq_back_max, 0);
 SHOW_FUNCTION(cfq_back_seek_penalty_show, cfqd->cfq_back_penalty, 0);
-SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1);
-SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
-SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
 SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
 #undef SHOW_FUNCTION
 
@@ -2579,9 +1927,6 @@ STORE_FUNCTION(cfq_fifo_expire_async_store, &cfqd->cfq_fifo_expire[0], 1,
 STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
 STORE_FUNCTION(cfq_back_seek_penalty_store, &cfqd->cfq_back_penalty, 1,
 		UINT_MAX, 0);
-STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
 		UINT_MAX, 0);
 #undef STORE_FUNCTION
@@ -2595,10 +1940,10 @@ static struct elv_fs_entry cfq_attrs[] = {
 	CFQ_ATTR(fifo_expire_async),
 	CFQ_ATTR(back_seek_max),
 	CFQ_ATTR(back_seek_penalty),
-	CFQ_ATTR(slice_sync),
-	CFQ_ATTR(slice_async),
 	CFQ_ATTR(slice_async_rq),
-	CFQ_ATTR(slice_idle),
+	ELV_ATTR(slice_idle),
+	ELV_ATTR(slice_sync),
+	ELV_ATTR(slice_async),
 	__ATTR_NULL
 };
 
@@ -2611,8 +1956,6 @@ static struct elevator_type iosched_cfq = {
 		.elevator_dispatch_fn =		cfq_dispatch_requests,
 		.elevator_add_req_fn =		cfq_insert_request,
 		.elevator_activate_req_fn =	cfq_activate_request,
-		.elevator_deactivate_req_fn =	cfq_deactivate_request,
-		.elevator_queue_empty_fn =	cfq_queue_empty,
 		.elevator_completed_req_fn =	cfq_completed_request,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
@@ -2622,7 +1965,15 @@ static struct elevator_type iosched_cfq = {
 		.elevator_init_fn =		cfq_init_queue,
 		.elevator_exit_fn =		cfq_exit_queue,
 		.trim =				cfq_free_io_context,
+		.elevator_free_sched_queue_fn =	cfq_free_cfq_queue,
+		.elevator_active_ioq_set_fn = 	cfq_active_ioq_set,
+		.elevator_active_ioq_reset_fn =	cfq_active_ioq_reset,
+		.elevator_arm_slice_timer_fn = 	cfq_arm_slice_timer,
+		.elevator_should_preempt_fn = 	cfq_should_preempt,
+		.elevator_update_idle_window_fn = cfq_update_idle_window,
+		.elevator_close_cooperator_fn = cfq_close_cooperator,
 	},
+	.elevator_features =    ELV_IOSCHED_NEED_FQ,
 	.elevator_attrs =	cfq_attrs,
 	.elevator_name =	"cfq",
 	.elevator_owner =	THIS_MODULE,
@@ -2630,14 +1981,6 @@ static struct elevator_type iosched_cfq = {
 
 static int __init cfq_init(void)
 {
-	/*
-	 * could be 0 on HZ < 1000 setups
-	 */
-	if (!cfq_slice_async)
-		cfq_slice_async = 1;
-	if (!cfq_slice_idle)
-		cfq_slice_idle = 1;
-
 	if (cfq_slab_setup())
 		return -ENOMEM;
 
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index dd05434..1482b20 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -39,13 +39,8 @@ struct cfq_io_context {
 
 	struct io_context *ioc;
 
-	unsigned long last_end_request;
 	sector_t last_request_pos;
 
-	unsigned long ttime_total;
-	unsigned long ttime_samples;
-	unsigned long ttime_mean;
-
 	unsigned int seek_samples;
 	u64 seek_total;
 	sector_t seek_mean;
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 06/25] io-controller: Modify cfq to make use of flat elevator fair queuing
  2009-07-02 20:01 ` Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron
  Cc: agk, snitzer, vgoyal, akpm, peterz

This patch changes cfq to use fair queuing code from elevator layer.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched     |    3 +-
 block/cfq-iosched.c       | 1105 +++++++++------------------------------------
 include/linux/iocontext.h |    5 -
 3 files changed, 226 insertions(+), 887 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3398134..dd5224d 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -3,7 +3,7 @@ if BLOCK
 menu "IO Schedulers"
 
 config ELV_FAIR_QUEUING
-	bool "Elevator Fair Queuing Support"
+	bool
 	default n
 	---help---
 	  Traditionally only cfq had notion of multiple queues and it did
@@ -46,6 +46,7 @@ config IOSCHED_DEADLINE
 
 config IOSCHED_CFQ
 	tristate "CFQ I/O scheduler"
+	select ELV_FAIR_QUEUING
 	default y
 	---help---
 	  The CFQ I/O scheduler tries to distribute bandwidth equally
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 833ec18..f852b00 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -12,7 +12,6 @@
 #include <linux/rbtree.h>
 #include <linux/ioprio.h>
 #include <linux/blktrace_api.h>
-
 /*
  * tunables
  */
@@ -23,15 +22,7 @@ static const int cfq_fifo_expire[2] = { HZ / 4, HZ / 8 };
 static const int cfq_back_max = 16 * 1024;
 /* penalty of a backwards seek */
 static const int cfq_back_penalty = 2;
-static const int cfq_slice_sync = HZ / 10;
-static int cfq_slice_async = HZ / 25;
 static const int cfq_slice_async_rq = 2;
-static int cfq_slice_idle = HZ / 125;
-
-/*
- * offset from end of service tree
- */
-#define CFQ_IDLE_DELAY		(HZ / 5)
 
 /*
  * below this threshold, we consider thinktime immediate
@@ -43,7 +34,7 @@ static int cfq_slice_idle = HZ / 125;
 
 #define RQ_CIC(rq)		\
 	((struct cfq_io_context *) (rq)->elevator_private)
-#define RQ_CFQQ(rq)		(struct cfq_queue *) ((rq)->elevator_private2)
+#define RQ_CFQQ(rq)	(struct cfq_queue *) (ioq_sched_queue((rq)->ioq))
 
 static struct kmem_cache *cfq_pool;
 static struct kmem_cache *cfq_ioc_pool;
@@ -53,8 +44,6 @@ static struct completion *ioc_gone;
 static DEFINE_SPINLOCK(ioc_gone_lock);
 
 #define CFQ_PRIO_LISTS		IOPRIO_BE_NR
-#define cfq_class_idle(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
-#define cfq_class_rt(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_RT)
 
 #define sample_valid(samples)	((samples) > 80)
 
@@ -75,12 +64,6 @@ struct cfq_rb_root {
  */
 struct cfq_data {
 	struct request_queue *queue;
-
-	/*
-	 * rr list of queues with requests and the count of them
-	 */
-	struct cfq_rb_root service_tree;
-
 	/*
 	 * Each priority tree is sorted by next_request position.  These
 	 * trees are used when determining if two or more queues are
@@ -88,39 +71,10 @@ struct cfq_data {
 	 */
 	struct rb_root prio_trees[CFQ_PRIO_LISTS];
 
-	unsigned int busy_queues;
-	/*
-	 * Used to track any pending rt requests so we can pre-empt current
-	 * non-RT cfqq in service when this value is non-zero.
-	 */
-	unsigned int busy_rt_queues;
-
-	int rq_in_driver;
 	int sync_flight;
 
-	/*
-	 * queue-depth detection
-	 */
-	int rq_queued;
-	int hw_tag;
-	int hw_tag_samples;
-	int rq_in_driver_peak;
-
-	/*
-	 * idle window management
-	 */
-	struct timer_list idle_slice_timer;
-	struct work_struct unplug_work;
-
-	struct cfq_queue *active_queue;
 	struct cfq_io_context *active_cic;
 
-	/*
-	 * async queue for each priority case
-	 */
-	struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
-	struct cfq_queue *async_idle_cfqq;
-
 	sector_t last_position;
 
 	/*
@@ -130,9 +84,7 @@ struct cfq_data {
 	unsigned int cfq_fifo_expire[2];
 	unsigned int cfq_back_penalty;
 	unsigned int cfq_back_max;
-	unsigned int cfq_slice[2];
 	unsigned int cfq_slice_async_rq;
-	unsigned int cfq_slice_idle;
 
 	struct list_head cic_list;
 };
@@ -141,16 +93,11 @@ struct cfq_data {
  * Per process-grouping structure
  */
 struct cfq_queue {
-	/* reference count */
-	atomic_t ref;
+	struct io_queue *ioq;
 	/* various state flags, see below */
 	unsigned int flags;
 	/* parent cfq_data */
 	struct cfq_data *cfqd;
-	/* service_tree member */
-	struct rb_node rb_node;
-	/* service_tree key */
-	unsigned long rb_key;
 	/* prio tree member */
 	struct rb_node p_node;
 	/* prio tree root we belong to, if any */
@@ -166,33 +113,23 @@ struct cfq_queue {
 	/* fifo list of requests in sort_list */
 	struct list_head fifo;
 
-	unsigned long slice_end;
-	long slice_resid;
 	unsigned int slice_dispatch;
 
 	/* pending metadata requests */
 	int meta_pending;
-	/* number of requests that are on the dispatch list or inside driver */
-	int dispatched;
 
 	/* io prio of this group */
-	unsigned short ioprio, org_ioprio;
-	unsigned short ioprio_class, org_ioprio_class;
+	unsigned short org_ioprio;
+	unsigned short org_ioprio_class;
 
 	pid_t pid;
 };
 
 enum cfqq_state_flags {
-	CFQ_CFQQ_FLAG_on_rr = 0,	/* on round-robin busy list */
-	CFQ_CFQQ_FLAG_wait_request,	/* waiting for a request */
-	CFQ_CFQQ_FLAG_must_dispatch,	/* must be allowed a dispatch */
 	CFQ_CFQQ_FLAG_must_alloc,	/* must be allowed rq alloc */
 	CFQ_CFQQ_FLAG_must_alloc_slice,	/* per-slice must_alloc flag */
 	CFQ_CFQQ_FLAG_fifo_expire,	/* FIFO checked in this slice */
-	CFQ_CFQQ_FLAG_idle_window,	/* slice idling enabled */
 	CFQ_CFQQ_FLAG_prio_changed,	/* task priority has changed */
-	CFQ_CFQQ_FLAG_slice_new,	/* no requests dispatched in slice */
-	CFQ_CFQQ_FLAG_sync,		/* synchronous queue */
 	CFQ_CFQQ_FLAG_coop,		/* has done a coop jump of the queue */
 };
 
@@ -210,16 +147,10 @@ static inline int cfq_cfqq_##name(const struct cfq_queue *cfqq)		\
 	return ((cfqq)->flags & (1 << CFQ_CFQQ_FLAG_##name)) != 0;	\
 }
 
-CFQ_CFQQ_FNS(on_rr);
-CFQ_CFQQ_FNS(wait_request);
-CFQ_CFQQ_FNS(must_dispatch);
 CFQ_CFQQ_FNS(must_alloc);
 CFQ_CFQQ_FNS(must_alloc_slice);
 CFQ_CFQQ_FNS(fifo_expire);
-CFQ_CFQQ_FNS(idle_window);
 CFQ_CFQQ_FNS(prio_changed);
-CFQ_CFQQ_FNS(slice_new);
-CFQ_CFQQ_FNS(sync);
 CFQ_CFQQ_FNS(coop);
 #undef CFQ_CFQQ_FNS
 
@@ -258,66 +189,27 @@ static inline int cfq_bio_sync(struct bio *bio)
 	return 0;
 }
 
-/*
- * scheduler run of queue, if there are requests pending and no one in the
- * driver that will restart queueing
- */
-static inline void cfq_schedule_dispatch(struct cfq_data *cfqd)
+static inline struct io_group *cfqq_to_io_group(struct cfq_queue *cfqq)
 {
-	if (cfqd->busy_queues) {
-		cfq_log(cfqd, "schedule dispatch");
-		kblockd_schedule_work(cfqd->queue, &cfqd->unplug_work);
-	}
+	return ioq_to_io_group(cfqq->ioq);
 }
 
-static int cfq_queue_empty(struct request_queue *q)
+static inline int cfq_class_idle(struct cfq_queue *cfqq)
 {
-	struct cfq_data *cfqd = q->elevator->elevator_data;
-
-	return !cfqd->busy_queues;
+	return elv_ioq_class_idle(cfqq->ioq);
 }
 
-/*
- * Scale schedule slice based on io priority. Use the sync time slice only
- * if a queue is marked sync and has sync io queued. A sync queue with async
- * io only, should not get full sync slice length.
- */
-static inline int cfq_prio_slice(struct cfq_data *cfqd, int sync,
-				 unsigned short prio)
-{
-	const int base_slice = cfqd->cfq_slice[sync];
-
-	WARN_ON(prio >= IOPRIO_BE_NR);
-
-	return base_slice + (base_slice/CFQ_SLICE_SCALE * (4 - prio));
-}
-
-static inline int
-cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
-}
-
-static inline void
-cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+static inline int cfq_cfqq_sync(struct cfq_queue *cfqq)
 {
-	cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
-	cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
+	return elv_ioq_sync(cfqq->ioq);
 }
 
-/*
- * We need to wrap this check in cfq_cfqq_slice_new(), since ->slice_end
- * isn't valid until the first request from the dispatch is activated
- * and the slice time set.
- */
-static inline int cfq_slice_used(struct cfq_queue *cfqq)
+static inline int cfqq_is_active_queue(struct cfq_queue *cfqq)
 {
-	if (cfq_cfqq_slice_new(cfqq))
-		return 0;
-	if (time_before(jiffies, cfqq->slice_end))
-		return 0;
+	struct cfq_data *cfqd = cfqq->cfqd;
+	struct elevator_queue *e = cfqd->queue->elevator;
 
-	return 1;
+	return (elv_active_sched_queue(e) == cfqq);
 }
 
 /*
@@ -416,33 +308,6 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2)
 }
 
 /*
- * The below is leftmost cache rbtree addon
- */
-static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
-{
-	if (!root->left)
-		root->left = rb_first(&root->rb);
-
-	if (root->left)
-		return rb_entry(root->left, struct cfq_queue, rb_node);
-
-	return NULL;
-}
-
-static void rb_erase_init(struct rb_node *n, struct rb_root *root)
-{
-	rb_erase(n, root);
-	RB_CLEAR_NODE(n);
-}
-
-static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root)
-{
-	if (root->left == n)
-		root->left = NULL;
-	rb_erase_init(n, &root->rb);
-}
-
-/*
  * would be nice to take fifo expire time into account as well
  */
 static struct request *
@@ -455,10 +320,10 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 
 	BUG_ON(RB_EMPTY_NODE(&last->rb_node));
 
-	if (rbprev)
+	if (rbprev != NULL)
 		prev = rb_entry_rq(rbprev);
 
-	if (rbnext)
+	if (rbnext != NULL)
 		next = rb_entry_rq(rbnext);
 	else {
 		rbnext = rb_first(&cfqq->sort_list);
@@ -469,95 +334,6 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	return cfq_choose_req(cfqd, next, prev);
 }
 
-static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
-				      struct cfq_queue *cfqq)
-{
-	/*
-	 * just an approximation, should be ok.
-	 */
-	return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) -
-		       cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
-}
-
-/*
- * The cfqd->service_tree holds all pending cfq_queue's that have
- * requests waiting to be processed. It is sorted in the order that
- * we will service the queues.
- */
-static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-				 int add_front)
-{
-	struct rb_node **p, *parent;
-	struct cfq_queue *__cfqq;
-	unsigned long rb_key;
-	int left;
-
-	if (cfq_class_idle(cfqq)) {
-		rb_key = CFQ_IDLE_DELAY;
-		parent = rb_last(&cfqd->service_tree.rb);
-		if (parent && parent != &cfqq->rb_node) {
-			__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
-			rb_key += __cfqq->rb_key;
-		} else
-			rb_key += jiffies;
-	} else if (!add_front) {
-		rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
-		rb_key += cfqq->slice_resid;
-		cfqq->slice_resid = 0;
-	} else
-		rb_key = 0;
-
-	if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
-		/*
-		 * same position, nothing more to do
-		 */
-		if (rb_key == cfqq->rb_key)
-			return;
-
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
-	}
-
-	left = 1;
-	parent = NULL;
-	p = &cfqd->service_tree.rb.rb_node;
-	while (*p) {
-		struct rb_node **n;
-
-		parent = *p;
-		__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
-
-		/*
-		 * sort RT queues first, we always want to give
-		 * preference to them. IDLE queues goes to the back.
-		 * after that, sort on the next service time.
-		 */
-		if (cfq_class_rt(cfqq) > cfq_class_rt(__cfqq))
-			n = &(*p)->rb_left;
-		else if (cfq_class_rt(cfqq) < cfq_class_rt(__cfqq))
-			n = &(*p)->rb_right;
-		else if (cfq_class_idle(cfqq) < cfq_class_idle(__cfqq))
-			n = &(*p)->rb_left;
-		else if (cfq_class_idle(cfqq) > cfq_class_idle(__cfqq))
-			n = &(*p)->rb_right;
-		else if (rb_key < __cfqq->rb_key)
-			n = &(*p)->rb_left;
-		else
-			n = &(*p)->rb_right;
-
-		if (n == &(*p)->rb_right)
-			left = 0;
-
-		p = n;
-	}
-
-	if (left)
-		cfqd->service_tree.left = &cfqq->rb_node;
-
-	cfqq->rb_key = rb_key;
-	rb_link_node(&cfqq->rb_node, parent, p);
-	rb_insert_color(&cfqq->rb_node, &cfqd->service_tree.rb);
-}
-
 static struct cfq_queue *
 cfq_prio_tree_lookup(struct cfq_data *cfqd, struct rb_root *root,
 		     sector_t sector, struct rb_node **ret_parent,
@@ -619,57 +395,34 @@ static void cfq_prio_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 		cfqq->p_root = NULL;
 }
 
-/*
- * Update cfqq's position in the service tree.
- */
-static void cfq_resort_rr_list(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+/* An active ioq is being reset. A chance to do cic related stuff. */
+static void cfq_active_ioq_reset(struct request_queue *q, void *sched_queue)
 {
-	/*
-	 * Resorting requires the cfqq to be on the RR list already.
-	 */
-	if (cfq_cfqq_on_rr(cfqq)) {
-		cfq_service_tree_add(cfqd, cfqq, 0);
-		cfq_prio_tree_add(cfqd, cfqq);
-	}
-}
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = sched_queue;
 
-/*
- * add to busy list of queues for service, trying to be fair in ordering
- * the pending list according to last request service
- */
-static void cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	cfq_log_cfqq(cfqd, cfqq, "add_to_rr");
-	BUG_ON(cfq_cfqq_on_rr(cfqq));
-	cfq_mark_cfqq_on_rr(cfqq);
-	cfqd->busy_queues++;
-	if (cfq_class_rt(cfqq))
-		cfqd->busy_rt_queues++;
+	if (cfqd->active_cic) {
+		put_io_context(cfqd->active_cic->ioc);
+		cfqd->active_cic = NULL;
+	}
 
-	cfq_resort_rr_list(cfqd, cfqq);
+	/* Resort the cfqq in prio tree */
+	if (cfqq)
+		cfq_prio_tree_add(cfqd, cfqq);
 }
 
-/*
- * Called when the cfqq no longer has requests pending, remove it from
- * the service tree.
- */
-static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+/* An ioq has been set as active one. */
+static void cfq_active_ioq_set(struct request_queue *q, void *sched_queue,
+				int coop)
 {
-	cfq_log_cfqq(cfqd, cfqq, "del_from_rr");
-	BUG_ON(!cfq_cfqq_on_rr(cfqq));
-	cfq_clear_cfqq_on_rr(cfqq);
+	struct cfq_queue *cfqq = sched_queue;
 
-	if (!RB_EMPTY_NODE(&cfqq->rb_node))
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
-	if (cfqq->p_root) {
-		rb_erase(&cfqq->p_node, cfqq->p_root);
-		cfqq->p_root = NULL;
-	}
+	cfqq->slice_dispatch = 0;
 
-	BUG_ON(!cfqd->busy_queues);
-	cfqd->busy_queues--;
-	if (cfq_class_rt(cfqq))
-		cfqd->busy_rt_queues--;
+	cfq_clear_cfqq_must_alloc_slice(cfqq);
+	cfq_clear_cfqq_fifo_expire(cfqq);
+	if (!coop)
+		cfq_clear_cfqq_coop(cfqq);
 }
 
 /*
@@ -678,7 +431,6 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 static void cfq_del_rq_rb(struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
-	struct cfq_data *cfqd = cfqq->cfqd;
 	const int sync = rq_is_sync(rq);
 
 	BUG_ON(!cfqq->queued[sync]);
@@ -686,8 +438,17 @@ static void cfq_del_rq_rb(struct request *rq)
 
 	elv_rb_del(&cfqq->sort_list, rq);
 
-	if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
-		cfq_del_cfqq_rr(cfqd, cfqq);
+	/*
+	 * If this was last request in the queue, remove this queue from
+	 * prio trees. For last request nr_queued count will still be 1 as
+	 * elevator fair queuing layer is yet to do the accounting.
+	 */
+	if (elv_ioq_nr_queued(cfqq->ioq) == 1) {
+		if (cfqq->p_root) {
+			rb_erase(&cfqq->p_node, cfqq->p_root);
+			cfqq->p_root = NULL;
+		}
+	}
 }
 
 static void cfq_add_rq_rb(struct request *rq)
@@ -705,9 +466,6 @@ static void cfq_add_rq_rb(struct request *rq)
 	while ((__alias = elv_rb_add(&cfqq->sort_list, rq)) != NULL)
 		cfq_dispatch_insert(cfqd->queue, __alias);
 
-	if (!cfq_cfqq_on_rr(cfqq))
-		cfq_add_cfqq_rr(cfqd, cfqq);
-
 	/*
 	 * check if this request is a better next-serve candidate
 	 */
@@ -755,23 +513,9 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 
-	cfqd->rq_in_driver++;
-	cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "activate rq, drv=%d",
-						cfqd->rq_in_driver);
-
 	cfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq);
 }
 
-static void cfq_deactivate_request(struct request_queue *q, struct request *rq)
-{
-	struct cfq_data *cfqd = q->elevator->elevator_data;
-
-	WARN_ON(!cfqd->rq_in_driver);
-	cfqd->rq_in_driver--;
-	cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "deactivate rq, drv=%d",
-						cfqd->rq_in_driver);
-}
-
 static void cfq_remove_request(struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
@@ -782,7 +526,6 @@ static void cfq_remove_request(struct request *rq)
 	list_del_init(&rq->queuelist);
 	cfq_del_rq_rb(rq);
 
-	cfqq->cfqd->rq_queued--;
 	if (rq_is_meta(rq)) {
 		WARN_ON(!cfqq->meta_pending);
 		cfqq->meta_pending--;
@@ -856,93 +599,21 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	return 0;
 }
 
-static void __cfq_set_active_queue(struct cfq_data *cfqd,
-				   struct cfq_queue *cfqq)
-{
-	if (cfqq) {
-		cfq_log_cfqq(cfqd, cfqq, "set_active");
-		cfqq->slice_end = 0;
-		cfqq->slice_dispatch = 0;
-
-		cfq_clear_cfqq_wait_request(cfqq);
-		cfq_clear_cfqq_must_dispatch(cfqq);
-		cfq_clear_cfqq_must_alloc_slice(cfqq);
-		cfq_clear_cfqq_fifo_expire(cfqq);
-		cfq_mark_cfqq_slice_new(cfqq);
-
-		del_timer(&cfqd->idle_slice_timer);
-	}
-
-	cfqd->active_queue = cfqq;
-}
-
 /*
  * current cfqq expired its slice (or was too idle), select new one
  */
 static void
-__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-		    int timed_out)
+__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	cfq_log_cfqq(cfqd, cfqq, "slice expired t=%d", timed_out);
-
-	if (cfq_cfqq_wait_request(cfqq))
-		del_timer(&cfqd->idle_slice_timer);
-
-	cfq_clear_cfqq_wait_request(cfqq);
-
-	/*
-	 * store what was left of this slice, if the queue idled/timed out
-	 */
-	if (timed_out && !cfq_cfqq_slice_new(cfqq)) {
-		cfqq->slice_resid = cfqq->slice_end - jiffies;
-		cfq_log_cfqq(cfqd, cfqq, "resid=%ld", cfqq->slice_resid);
-	}
-
-	cfq_resort_rr_list(cfqd, cfqq);
-
-	if (cfqq == cfqd->active_queue)
-		cfqd->active_queue = NULL;
-
-	if (cfqd->active_cic) {
-		put_io_context(cfqd->active_cic->ioc);
-		cfqd->active_cic = NULL;
-	}
+	__elv_ioq_slice_expired(cfqd->queue, cfqq->ioq);
 }
 
-static inline void cfq_slice_expired(struct cfq_data *cfqd, int timed_out)
+static inline void cfq_slice_expired(struct cfq_data *cfqd)
 {
-	struct cfq_queue *cfqq = cfqd->active_queue;
+	struct cfq_queue *cfqq = elv_active_sched_queue(cfqd->queue->elevator);
 
 	if (cfqq)
-		__cfq_slice_expired(cfqd, cfqq, timed_out);
-}
-
-/*
- * Get next queue for service. Unless we have a queue preemption,
- * we'll simply select the first cfqq in the service tree.
- */
-static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
-{
-	if (RB_EMPTY_ROOT(&cfqd->service_tree.rb))
-		return NULL;
-
-	return cfq_rb_first(&cfqd->service_tree);
-}
-
-/*
- * Get and set a new active queue for service.
- */
-static struct cfq_queue *cfq_set_active_queue(struct cfq_data *cfqd,
-					      struct cfq_queue *cfqq)
-{
-	if (!cfqq) {
-		cfqq = cfq_get_next_queue(cfqd);
-		if (cfqq)
-			cfq_clear_cfqq_coop(cfqq);
-	}
-
-	__cfq_set_active_queue(cfqd, cfqq);
-	return cfqq;
+		__cfq_slice_expired(cfqd, cfqq);
 }
 
 static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd,
@@ -1019,11 +690,12 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
  * associated with the I/O issued by cur_cfqq.  I'm not sure this is a valid
  * assumption.
  */
-static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
-					      struct cfq_queue *cur_cfqq,
+static struct io_queue *cfq_close_cooperator(struct request_queue *q,
+					      void *cur_sched_queue,
 					      int probe)
 {
-	struct cfq_queue *cfqq;
+	struct cfq_queue *cur_cfqq = cur_sched_queue, *cfqq;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
 
 	/*
 	 * A valid cfq_io_context is necessary to compare requests against
@@ -1046,38 +718,18 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
 
 	if (!probe)
 		cfq_mark_cfqq_coop(cfqq);
-	return cfqq;
+	return cfqq->ioq;
 }
 
-static void cfq_arm_slice_timer(struct cfq_data *cfqd)
+static void cfq_arm_slice_timer(struct request_queue *q, void *sched_queue)
 {
-	struct cfq_queue *cfqq = cfqd->active_queue;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = sched_queue;
 	struct cfq_io_context *cic;
 	unsigned long sl;
 
-	/*
-	 * SSD device without seek penalty, disable idling. But only do so
-	 * for devices that support queuing, otherwise we still have a problem
-	 * with sync vs async workloads.
-	 */
-	if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)
-		return;
-
 	WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list));
-	WARN_ON(cfq_cfqq_slice_new(cfqq));
-
-	/*
-	 * idle is disabled, either manually or by past process history
-	 */
-	if (!cfqd->cfq_slice_idle || !cfq_cfqq_idle_window(cfqq))
-		return;
-
-	/*
-	 * still requests with the driver, don't idle
-	 */
-	if (cfqd->rq_in_driver)
-		return;
-
+	WARN_ON(elv_ioq_slice_new(cfqq->ioq));
 	/*
 	 * task has exited, don't wait
 	 */
@@ -1085,18 +737,18 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 	if (!cic || !atomic_read(&cic->ioc->nr_tasks))
 		return;
 
-	cfq_mark_cfqq_wait_request(cfqq);
 
+	elv_mark_ioq_wait_request(cfqq->ioq);
 	/*
 	 * we don't want to idle for seeks, but we do want to allow
 	 * fair distribution of slice time for a process doing back-to-back
 	 * seeks. so allow a little bit of time for him to submit a new rq
 	 */
-	sl = cfqd->cfq_slice_idle;
+	sl = elv_get_slice_idle(q->elevator);
 	if (sample_valid(cic->seek_samples) && CIC_SEEKY(cic))
 		sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));
 
-	mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
+	elv_mod_idle_slice_timer(q->elevator, jiffies + sl);
 	cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
 }
 
@@ -1105,13 +757,12 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
  */
 static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
 {
-	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
+	struct cfq_data *cfqd = q->elevator->elevator_data;
 
-	cfq_log_cfqq(cfqd, cfqq, "dispatch_insert");
+	cfq_log_cfqq(cfqd, cfqq, "dispatch_insert sect=%d", blk_rq_sectors(rq));
 
 	cfq_remove_request(rq);
-	cfqq->dispatched++;
 	elv_dispatch_sort(q, rq);
 
 	if (cfq_cfqq_sync(cfqq))
@@ -1149,78 +800,11 @@ static inline int
 cfq_prio_to_maxrq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
 	const int base_rq = cfqd->cfq_slice_async_rq;
+	unsigned short ioprio = elv_ioq_ioprio(cfqq->ioq);
 
-	WARN_ON(cfqq->ioprio >= IOPRIO_BE_NR);
+	WARN_ON(ioprio >= IOPRIO_BE_NR);
 
-	return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - cfqq->ioprio));
-}
-
-/*
- * Select a queue for service. If we have a current active queue,
- * check whether to continue servicing it, or retrieve and set a new one.
- */
-static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
-{
-	struct cfq_queue *cfqq, *new_cfqq = NULL;
-
-	cfqq = cfqd->active_queue;
-	if (!cfqq)
-		goto new_queue;
-
-	/*
-	 * The active queue has run out of time, expire it and select new.
-	 */
-	if (cfq_slice_used(cfqq) && !cfq_cfqq_must_dispatch(cfqq))
-		goto expire;
-
-	/*
-	 * If we have a RT cfqq waiting, then we pre-empt the current non-rt
-	 * cfqq.
-	 */
-	if (!cfq_class_rt(cfqq) && cfqd->busy_rt_queues) {
-		/*
-		 * We simulate this as cfqq timed out so that it gets to bank
-		 * the remaining of its time slice.
-		 */
-		cfq_log_cfqq(cfqd, cfqq, "preempt");
-		cfq_slice_expired(cfqd, 1);
-		goto new_queue;
-	}
-
-	/*
-	 * The active queue has requests and isn't expired, allow it to
-	 * dispatch.
-	 */
-	if (!RB_EMPTY_ROOT(&cfqq->sort_list))
-		goto keep_queue;
-
-	/*
-	 * If another queue has a request waiting within our mean seek
-	 * distance, let it run.  The expire code will check for close
-	 * cooperators and put the close queue at the front of the service
-	 * tree.
-	 */
-	new_cfqq = cfq_close_cooperator(cfqd, cfqq, 0);
-	if (new_cfqq)
-		goto expire;
-
-	/*
-	 * No requests pending. If the active queue still has requests in
-	 * flight or is idling for a new request, allow either of these
-	 * conditions to happen (or time out) before selecting a new queue.
-	 */
-	if (timer_pending(&cfqd->idle_slice_timer) ||
-	    (cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) {
-		cfqq = NULL;
-		goto keep_queue;
-	}
-
-expire:
-	cfq_slice_expired(cfqd, 0);
-new_queue:
-	cfqq = cfq_set_active_queue(cfqd, new_cfqq);
-keep_queue:
-	return cfqq;
+	return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - ioprio));
 }
 
 static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
@@ -1245,12 +829,14 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
 	struct cfq_queue *cfqq;
 	int dispatched = 0;
 
-	while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL)
+	while ((cfqq = elv_select_sched_queue(cfqd->queue, 1)) != NULL)
 		dispatched += __cfq_forced_dispatch_cfqq(cfqq);
 
-	cfq_slice_expired(cfqd, 0);
+	/* This probably is redundant now. above loop will should make sure
+	 * that all the busy queues have expired */
+	cfq_slice_expired(cfqd);
 
-	BUG_ON(cfqd->busy_queues);
+	BUG_ON(elv_nr_busy_ioq(cfqd->queue->elevator));
 
 	cfq_log(cfqd, "forced_dispatch=%d", dispatched);
 	return dispatched;
@@ -1296,13 +882,10 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	struct cfq_queue *cfqq;
 	unsigned int max_dispatch;
 
-	if (!cfqd->busy_queues)
-		return 0;
-
 	if (unlikely(force))
 		return cfq_forced_dispatch(cfqd);
 
-	cfqq = cfq_select_queue(cfqd);
+	cfqq = elv_select_sched_queue(q, 0);
 	if (!cfqq)
 		return 0;
 
@@ -1319,7 +902,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	/*
 	 * Does this cfqq already have too much IO in flight?
 	 */
-	if (cfqq->dispatched >= max_dispatch) {
+	if (elv_ioq_nr_dispatched(cfqq->ioq) >= max_dispatch) {
 		/*
 		 * idle queue must always only have a single IO in flight
 		 */
@@ -1329,13 +912,13 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 		/*
 		 * We have other queues, don't allow more IO from this one
 		 */
-		if (cfqd->busy_queues > 1)
+		if (elv_nr_busy_ioq(q->elevator) > 1)
 			return 0;
 
 		/*
 		 * we are the only queue, allow up to 4 times of 'quantum'
 		 */
-		if (cfqq->dispatched >= 4 * max_dispatch)
+		if (elv_ioq_nr_dispatched(cfqq->ioq) >= 4 * max_dispatch)
 			return 0;
 	}
 
@@ -1344,51 +927,45 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	 */
 	cfq_dispatch_request(cfqd, cfqq);
 	cfqq->slice_dispatch++;
-	cfq_clear_cfqq_must_dispatch(cfqq);
 
 	/*
 	 * expire an async queue immediately if it has used up its slice. idle
 	 * queue always expire after 1 dispatch round.
 	 */
-	if (cfqd->busy_queues > 1 && ((!cfq_cfqq_sync(cfqq) &&
+	if (elv_nr_busy_ioq(q->elevator) > 1 && ((!cfq_cfqq_sync(cfqq) &&
 	    cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
 	    cfq_class_idle(cfqq))) {
-		cfqq->slice_end = jiffies + 1;
-		cfq_slice_expired(cfqd, 0);
+		cfq_slice_expired(cfqd);
 	}
 
 	cfq_log(cfqd, "dispatched a request");
 	return 1;
 }
 
-/*
- * task holds one reference to the queue, dropped when task exits. each rq
- * in-flight on this queue also holds a reference, dropped when rq is freed.
- *
- * queue lock must be held here.
- */
-static void cfq_put_queue(struct cfq_queue *cfqq)
+static void cfq_free_cfq_queue(struct elevator_queue *e, void *sched_queue)
 {
+	struct cfq_queue *cfqq = sched_queue;
 	struct cfq_data *cfqd = cfqq->cfqd;
 
-	BUG_ON(atomic_read(&cfqq->ref) <= 0);
-
-	if (!atomic_dec_and_test(&cfqq->ref))
-		return;
+	BUG_ON(!cfqq);
 
-	cfq_log_cfqq(cfqd, cfqq, "put_queue");
+	cfq_log_cfqq(cfqd, cfqq, "free_queue");
 	BUG_ON(rb_first(&cfqq->sort_list));
 	BUG_ON(cfqq->allocated[READ] + cfqq->allocated[WRITE]);
-	BUG_ON(cfq_cfqq_on_rr(cfqq));
 
-	if (unlikely(cfqd->active_queue == cfqq)) {
-		__cfq_slice_expired(cfqd, cfqq, 0);
-		cfq_schedule_dispatch(cfqd);
+	if (unlikely(cfqq_is_active_queue(cfqq))) {
+		__cfq_slice_expired(cfqd, cfqq);
+		elv_schedule_dispatch(cfqd->queue);
 	}
 
 	kmem_cache_free(cfq_pool, cfqq);
 }
 
+static inline void cfq_put_queue(struct cfq_queue *cfqq)
+{
+	elv_put_ioq(cfqq->ioq);
+}
+
 /*
  * Must always be called with the rcu_read_lock() held
  */
@@ -1476,9 +1053,9 @@ static void cfq_free_io_context(struct io_context *ioc)
 
 static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	if (unlikely(cfqq == cfqd->active_queue)) {
-		__cfq_slice_expired(cfqd, cfqq, 0);
-		cfq_schedule_dispatch(cfqd);
+	if (unlikely(cfqq == elv_active_sched_queue(cfqd->queue->elevator))) {
+		__cfq_slice_expired(cfqd, cfqq);
+		elv_schedule_dispatch(cfqd->queue);
 	}
 
 	cfq_put_queue(cfqq);
@@ -1548,11 +1125,11 @@ static struct cfq_io_context *
 cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 {
 	struct cfq_io_context *cic;
+	struct request_queue *q = cfqd->queue;
 
 	cic = kmem_cache_alloc_node(cfq_ioc_pool, gfp_mask | __GFP_ZERO,
-							cfqd->queue->node);
+							q->node);
 	if (cic) {
-		cic->last_end_request = jiffies;
 		INIT_LIST_HEAD(&cic->queue_list);
 		INIT_HLIST_NODE(&cic->cic_list);
 		cic->dtor = cfq_free_io_context;
@@ -1566,7 +1143,7 @@ cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
 {
 	struct task_struct *tsk = current;
-	int ioprio_class;
+	int ioprio_class, ioprio;
 
 	if (!cfq_cfqq_prio_changed(cfqq))
 		return;
@@ -1579,30 +1156,33 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
 		/*
 		 * no prio set, inherit CPU scheduling settings
 		 */
-		cfqq->ioprio = task_nice_ioprio(tsk);
-		cfqq->ioprio_class = task_nice_ioclass(tsk);
+		ioprio = task_nice_ioprio(tsk);
+		ioprio_class = task_nice_ioclass(tsk);
 		break;
 	case IOPRIO_CLASS_RT:
-		cfqq->ioprio = task_ioprio(ioc);
-		cfqq->ioprio_class = IOPRIO_CLASS_RT;
+		ioprio = task_ioprio(ioc);
+		ioprio_class = IOPRIO_CLASS_RT;
 		break;
 	case IOPRIO_CLASS_BE:
-		cfqq->ioprio = task_ioprio(ioc);
-		cfqq->ioprio_class = IOPRIO_CLASS_BE;
+		ioprio = task_ioprio(ioc);
+		ioprio_class = IOPRIO_CLASS_BE;
 		break;
 	case IOPRIO_CLASS_IDLE:
-		cfqq->ioprio_class = IOPRIO_CLASS_IDLE;
-		cfqq->ioprio = 7;
-		cfq_clear_cfqq_idle_window(cfqq);
+		ioprio_class = IOPRIO_CLASS_IDLE;
+		ioprio = 7;
+		elv_clear_ioq_idle_window(cfqq->ioq);
 		break;
 	}
 
+	elv_ioq_set_ioprio_class(cfqq->ioq, ioprio_class);
+	elv_ioq_set_ioprio(cfqq->ioq, ioprio);
+
 	/*
 	 * keep track of original prio settings in case we have to temporarily
 	 * elevate the priority of this queue
 	 */
-	cfqq->org_ioprio = cfqq->ioprio;
-	cfqq->org_ioprio_class = cfqq->ioprio_class;
+	cfqq->org_ioprio = ioprio;
+	cfqq->org_ioprio_class = ioprio_class;
 	cfq_clear_cfqq_prio_changed(cfqq);
 }
 
@@ -1611,11 +1191,12 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
 	struct cfq_data *cfqd = cic->key;
 	struct cfq_queue *cfqq;
 	unsigned long flags;
+	struct request_queue *q = cfqd->queue;
 
 	if (unlikely(!cfqd))
 		return;
 
-	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+	spin_lock_irqsave(q->queue_lock, flags);
 
 	cfqq = cic->cfqq[BLK_RW_ASYNC];
 	if (cfqq) {
@@ -1632,7 +1213,7 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
 	if (cfqq)
 		cfq_mark_cfqq_prio_changed(cfqq);
 
-	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
 static void cfq_ioc_set_ioprio(struct io_context *ioc)
@@ -1643,20 +1224,23 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
 
 static struct cfq_queue *
 cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
-		     struct io_context *ioc, gfp_t gfp_mask)
+				struct io_context *ioc, gfp_t gfp_mask)
 {
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
 	struct cfq_io_context *cic;
-
+	struct request_queue *q = cfqd->queue;
+	struct io_queue *ioq = NULL, *new_ioq = NULL;
+	struct io_group *iog = NULL;
 retry:
+	iog = io_get_io_group(q);
+
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
 	cfqq = cic_to_cfqq(cic, is_sync);
 
 	if (!cfqq) {
 		if (new_cfqq) {
-			cfqq = new_cfqq;
-			new_cfqq = NULL;
+			goto alloc_ioq;
 		} else if (gfp_mask & __GFP_WAIT) {
 			/*
 			 * Inform the allocator of the fact that we will
@@ -1677,22 +1261,53 @@ retry:
 			if (!cfqq)
 				goto out;
 		}
+alloc_ioq:
+		if (new_ioq) {
+			ioq = new_ioq;
+			new_ioq = NULL;
+			cfqq = new_cfqq;
+			new_cfqq = NULL;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			new_ioq = elv_alloc_ioq(q,
+					gfp_mask | __GFP_NOFAIL | __GFP_ZERO);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+			if (!ioq) {
+				kmem_cache_free(cfq_pool, cfqq);
+				cfqq = NULL;
+				goto out;
+			}
+		}
 
-		RB_CLEAR_NODE(&cfqq->rb_node);
+		/*
+		 * Both cfqq and ioq objects allocated. Do the initializations
+		 * now.
+		 */
 		RB_CLEAR_NODE(&cfqq->p_node);
 		INIT_LIST_HEAD(&cfqq->fifo);
-
-		atomic_set(&cfqq->ref, 0);
 		cfqq->cfqd = cfqd;
 
 		cfq_mark_cfqq_prio_changed(cfqq);
 
+		cfqq->ioq = ioq;
 		cfq_init_prio_data(cfqq, ioc);
+		elv_init_ioq(q->elevator, ioq, iog, cfqq,
+				cfqq->org_ioprio_class, cfqq->org_ioprio,
+				is_sync);
 
 		if (is_sync) {
 			if (!cfq_class_idle(cfqq))
-				cfq_mark_cfqq_idle_window(cfqq);
-			cfq_mark_cfqq_sync(cfqq);
+				elv_mark_ioq_idle_window(cfqq->ioq);
+			elv_mark_ioq_sync(cfqq->ioq);
 		}
 		cfqq->pid = current->pid;
 		cfq_log_cfqq(cfqd, cfqq, "alloced");
@@ -1701,38 +1316,28 @@ retry:
 	if (new_cfqq)
 		kmem_cache_free(cfq_pool, new_cfqq);
 
+	if (new_ioq)
+		elv_free_ioq(new_ioq);
+
 out:
 	WARN_ON((gfp_mask & __GFP_WAIT) && !cfqq);
 	return cfqq;
 }
 
-static struct cfq_queue **
-cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
-{
-	switch (ioprio_class) {
-	case IOPRIO_CLASS_RT:
-		return &cfqd->async_cfqq[0][ioprio];
-	case IOPRIO_CLASS_BE:
-		return &cfqd->async_cfqq[1][ioprio];
-	case IOPRIO_CLASS_IDLE:
-		return &cfqd->async_idle_cfqq;
-	default:
-		BUG();
-	}
-}
-
 static struct cfq_queue *
 cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
-	      gfp_t gfp_mask)
+					gfp_t gfp_mask)
 {
 	const int ioprio = task_ioprio(ioc);
 	const int ioprio_class = task_ioprio_class(ioc);
-	struct cfq_queue **async_cfqq = NULL;
+	struct cfq_queue *async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
+	struct io_group *iog = io_get_io_group(cfqd->queue);
 
 	if (!is_sync) {
-		async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);
-		cfqq = *async_cfqq;
+		async_cfqq = io_group_async_queue_prio(iog, ioprio_class,
+								ioprio);
+		cfqq = async_cfqq;
 	}
 
 	if (!cfqq) {
@@ -1741,15 +1346,11 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 			return NULL;
 	}
 
-	/*
-	 * pin the queue now that it's allocated, scheduler exit will prune it
-	 */
-	if (!is_sync && !(*async_cfqq)) {
-		atomic_inc(&cfqq->ref);
-		*async_cfqq = cfqq;
-	}
+	if (!is_sync && !async_cfqq)
+		io_group_set_async_queue(iog, ioprio_class, ioprio, cfqq->ioq);
 
-	atomic_inc(&cfqq->ref);
+	/* ioc reference */
+	elv_get_ioq(cfqq->ioq);
 	return cfqq;
 }
 
@@ -1828,6 +1429,7 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
 {
 	unsigned long flags;
 	int ret;
+	struct request_queue *q = cfqd->queue;
 
 	ret = radix_tree_preload(gfp_mask);
 	if (!ret) {
@@ -1844,9 +1446,9 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
 		radix_tree_preload_end();
 
 		if (!ret) {
-			spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+			spin_lock_irqsave(q->queue_lock, flags);
 			list_add(&cic->queue_list, &cfqd->cic_list);
-			spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+			spin_unlock_irqrestore(q->queue_lock, flags);
 		}
 	}
 
@@ -1866,10 +1468,11 @@ cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 {
 	struct io_context *ioc = NULL;
 	struct cfq_io_context *cic;
+	struct request_queue *q = cfqd->queue;
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
-	ioc = get_io_context(gfp_mask, cfqd->queue->node);
+	ioc = get_io_context(gfp_mask, q->node);
 	if (!ioc)
 		return NULL;
 
@@ -1888,7 +1491,6 @@ out:
 	smp_read_barrier_depends();
 	if (unlikely(ioc->ioprio_changed))
 		cfq_ioc_set_ioprio(ioc);
-
 	return cic;
 err_free:
 	cfq_cic_free(cic);
@@ -1898,17 +1500,6 @@ err:
 }
 
 static void
-cfq_update_io_thinktime(struct cfq_data *cfqd, struct cfq_io_context *cic)
-{
-	unsigned long elapsed = jiffies - cic->last_end_request;
-	unsigned long ttime = min(elapsed, 2UL * cfqd->cfq_slice_idle);
-
-	cic->ttime_samples = (7*cic->ttime_samples + 256) / 8;
-	cic->ttime_total = (7*cic->ttime_total + 256*ttime) / 8;
-	cic->ttime_mean = (cic->ttime_total + 128) / cic->ttime_samples;
-}
-
-static void
 cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_io_context *cic,
 		       struct request *rq)
 {
@@ -1939,57 +1530,41 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_io_context *cic,
 }
 
 /*
- * Disable idle window if the process thinks too long or seeks so much that
- * it doesn't matter
+ * Disable idle window if the process seeks so much that it doesn't matter
  */
-static void
-cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-		       struct cfq_io_context *cic)
+static int
+cfq_update_idle_window(struct elevator_queue *eq, void *cfqq,
+					struct request *rq)
 {
-	int old_idle, enable_idle;
+	struct cfq_io_context *cic = RQ_CIC(rq);
 
 	/*
-	 * Don't idle for async or idle io prio class
+	 * Enabling/Disabling idling based on thinktime has been moved
+	 * in common layer.
 	 */
-	if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
-		return;
-
-	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
-
-	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
-	    (cfqd->hw_tag && CIC_SEEKY(cic)))
-		enable_idle = 0;
-	else if (sample_valid(cic->ttime_samples)) {
-		if (cic->ttime_mean > cfqd->cfq_slice_idle)
-			enable_idle = 0;
-		else
-			enable_idle = 1;
-	}
+	if (!atomic_read(&cic->ioc->nr_tasks) ||
+	    (elv_hw_tag(eq) && CIC_SEEKY(cic)))
+		return 0;
 
-	if (old_idle != enable_idle) {
-		cfq_log_cfqq(cfqd, cfqq, "idle=%d", enable_idle);
-		if (enable_idle)
-			cfq_mark_cfqq_idle_window(cfqq);
-		else
-			cfq_clear_cfqq_idle_window(cfqq);
-	}
+	return 1;
 }
 
 /*
  * Check if new_cfqq should preempt the currently active queue. Return 0 for
- * no or if we aren't sure, a 1 will cause a preempt.
+ * no or if we aren't sure, a 1 will cause a preemption attempt.
+ * Some of the preemption logic has been moved to common layer. Only cfq
+ * specific parts are left here.
  */
 static int
-cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
-		   struct request *rq)
+cfq_should_preempt(struct request_queue *q, void *new_cfqq, struct request *rq)
 {
-	struct cfq_queue *cfqq;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = elv_active_sched_queue(q->elevator);
 
-	cfqq = cfqd->active_queue;
 	if (!cfqq)
 		return 0;
 
-	if (cfq_slice_used(cfqq))
+	if (elv_ioq_slice_used(cfqq->ioq))
 		return 1;
 
 	if (cfq_class_idle(new_cfqq))
@@ -2012,13 +1587,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 	if (rq_is_meta(rq) && !cfqq->meta_pending)
 		return 1;
 
-	/*
-	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
-	 */
-	if (cfq_class_rt(new_cfqq) && !cfq_class_rt(cfqq))
-		return 1;
-
-	if (!cfqd->active_cic || !cfq_cfqq_wait_request(cfqq))
+	if (!cfqd->active_cic || !elv_ioq_wait_request(cfqq->ioq))
 		return 0;
 
 	/*
@@ -2032,29 +1601,10 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 }
 
 /*
- * cfqq preempts the active queue. if we allowed preempt with no slice left,
- * let it have half of its nominal slice.
- */
-static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	cfq_log_cfqq(cfqd, cfqq, "preempt");
-	cfq_slice_expired(cfqd, 1);
-
-	/*
-	 * Put the new queue at the front of the of the current list,
-	 * so we know that it will be selected next.
-	 */
-	BUG_ON(!cfq_cfqq_on_rr(cfqq));
-
-	cfq_service_tree_add(cfqd, cfqq, 1);
-
-	cfqq->slice_end = 0;
-	cfq_mark_cfqq_slice_new(cfqq);
-}
-
-/*
  * Called when a new fs request (rq) is added (to cfqq). Check if there's
  * something we should do about it
+ * After enqueuing the request whether queue should be preempted or kicked
+ * decision is taken by common layer.
  */
 static void
 cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
@@ -2062,45 +1612,12 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 {
 	struct cfq_io_context *cic = RQ_CIC(rq);
 
-	cfqd->rq_queued++;
 	if (rq_is_meta(rq))
 		cfqq->meta_pending++;
 
-	cfq_update_io_thinktime(cfqd, cic);
 	cfq_update_io_seektime(cfqd, cic, rq);
-	cfq_update_idle_window(cfqd, cfqq, cic);
 
 	cic->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq);
-
-	if (cfqq == cfqd->active_queue) {
-		/*
-		 * Remember that we saw a request from this process, but
-		 * don't start queuing just yet. Otherwise we risk seeing lots
-		 * of tiny requests, because we disrupt the normal plugging
-		 * and merging. If the request is already larger than a single
-		 * page, let it rip immediately. For that case we assume that
-		 * merging is already done. Ditto for a busy system that
-		 * has other work pending, don't risk delaying until the
-		 * idle timer unplug to continue working.
-		 */
-		if (cfq_cfqq_wait_request(cfqq)) {
-			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
-			    cfqd->busy_queues > 1) {
-				del_timer(&cfqd->idle_slice_timer);
-			__blk_run_queue(cfqd->queue);
-			}
-			cfq_mark_cfqq_must_dispatch(cfqq);
-		}
-	} else if (cfq_should_preempt(cfqd, cfqq, rq)) {
-		/*
-		 * not the active queue - expire current slice if it is
-		 * idle and has expired it's mean thinktime or this new queue
-		 * has some old slice time left and is of higher priority or
-		 * this new queue is RT and the current one is BE
-		 */
-		cfq_preempt_queue(cfqd, cfqq);
-		__blk_run_queue(cfqd->queue);
-	}
 }
 
 static void cfq_insert_request(struct request_queue *q, struct request *rq)
@@ -2118,81 +1635,17 @@ static void cfq_insert_request(struct request_queue *q, struct request *rq)
 	cfq_rq_enqueued(cfqd, cfqq, rq);
 }
 
-/*
- * Update hw_tag based on peak queue depth over 50 samples under
- * sufficient load.
- */
-static void cfq_update_hw_tag(struct cfq_data *cfqd)
-{
-	if (cfqd->rq_in_driver > cfqd->rq_in_driver_peak)
-		cfqd->rq_in_driver_peak = cfqd->rq_in_driver;
-
-	if (cfqd->rq_queued <= CFQ_HW_QUEUE_MIN &&
-	    cfqd->rq_in_driver <= CFQ_HW_QUEUE_MIN)
-		return;
-
-	if (cfqd->hw_tag_samples++ < 50)
-		return;
-
-	if (cfqd->rq_in_driver_peak >= CFQ_HW_QUEUE_MIN)
-		cfqd->hw_tag = 1;
-	else
-		cfqd->hw_tag = 0;
-
-	cfqd->hw_tag_samples = 0;
-	cfqd->rq_in_driver_peak = 0;
-}
-
 static void cfq_completed_request(struct request_queue *q, struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
 	struct cfq_data *cfqd = cfqq->cfqd;
-	const int sync = rq_is_sync(rq);
 	unsigned long now;
 
 	now = jiffies;
 	cfq_log_cfqq(cfqd, cfqq, "complete");
 
-	cfq_update_hw_tag(cfqd);
-
-	WARN_ON(!cfqd->rq_in_driver);
-	WARN_ON(!cfqq->dispatched);
-	cfqd->rq_in_driver--;
-	cfqq->dispatched--;
-
 	if (cfq_cfqq_sync(cfqq))
 		cfqd->sync_flight--;
-
-	if (sync)
-		RQ_CIC(rq)->last_end_request = now;
-
-	/*
-	 * If this is the active queue, check if it needs to be expired,
-	 * or if we want to idle in case it has no pending requests.
-	 */
-	if (cfqd->active_queue == cfqq) {
-		const bool cfqq_empty = RB_EMPTY_ROOT(&cfqq->sort_list);
-
-		if (cfq_cfqq_slice_new(cfqq)) {
-			cfq_set_prio_slice(cfqd, cfqq);
-			cfq_clear_cfqq_slice_new(cfqq);
-		}
-		/*
-		 * If there are no requests waiting in this queue, and
-		 * there are other queues ready to issue requests, AND
-		 * those other queues are issuing requests within our
-		 * mean seek distance, give them a chance to run instead
-		 * of idling.
-		 */
-		if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
-			cfq_slice_expired(cfqd, 1);
-		else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) &&
-			 sync && !rq_noidle(rq))
-			cfq_arm_slice_timer(cfqd);
-	}
-
-	if (!cfqd->rq_in_driver)
-		cfq_schedule_dispatch(cfqd);
 }
 
 /*
@@ -2201,30 +1654,33 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
  */
 static void cfq_prio_boost(struct cfq_queue *cfqq)
 {
+	struct io_queue *ioq = cfqq->ioq;
+
 	if (has_fs_excl()) {
 		/*
 		 * boost idle prio on transactions that would lock out other
 		 * users of the filesystem
 		 */
 		if (cfq_class_idle(cfqq))
-			cfqq->ioprio_class = IOPRIO_CLASS_BE;
-		if (cfqq->ioprio > IOPRIO_NORM)
-			cfqq->ioprio = IOPRIO_NORM;
+			elv_ioq_set_ioprio_class(ioq, IOPRIO_CLASS_BE);
+		if (elv_ioq_ioprio(ioq) > IOPRIO_NORM)
+			elv_ioq_set_ioprio(ioq, IOPRIO_NORM);
+
 	} else {
 		/*
 		 * check if we need to unboost the queue
 		 */
-		if (cfqq->ioprio_class != cfqq->org_ioprio_class)
-			cfqq->ioprio_class = cfqq->org_ioprio_class;
-		if (cfqq->ioprio != cfqq->org_ioprio)
-			cfqq->ioprio = cfqq->org_ioprio;
+		if (elv_ioq_ioprio_class(ioq) != cfqq->org_ioprio_class)
+			elv_ioq_set_ioprio_class(ioq, cfqq->org_ioprio_class);
+		if (elv_ioq_ioprio(ioq) != cfqq->org_ioprio)
+			elv_ioq_set_ioprio(ioq, cfqq->org_ioprio);
 	}
 }
 
 static inline int __cfq_may_queue(struct cfq_queue *cfqq)
 {
-	if ((cfq_cfqq_wait_request(cfqq) || cfq_cfqq_must_alloc(cfqq)) &&
-	    !cfq_cfqq_must_alloc_slice(cfqq)) {
+	if ((elv_ioq_wait_request(cfqq->ioq) ||
+	   cfq_cfqq_must_alloc(cfqq)) && !cfq_cfqq_must_alloc_slice(cfqq)) {
 		cfq_mark_cfqq_must_alloc_slice(cfqq);
 		return ELV_MQUEUE_MUST;
 	}
@@ -2276,7 +1732,7 @@ static void cfq_put_request(struct request *rq)
 		put_io_context(RQ_CIC(rq)->ioc);
 
 		rq->elevator_private = NULL;
-		rq->elevator_private2 = NULL;
+		rq->ioq = NULL;
 
 		cfq_put_queue(cfqq);
 	}
@@ -2316,119 +1772,31 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 
 	cfqq->allocated[rw]++;
 	cfq_clear_cfqq_must_alloc(cfqq);
-	atomic_inc(&cfqq->ref);
+	elv_get_ioq(cfqq->ioq);
 
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
 	rq->elevator_private = cic;
-	rq->elevator_private2 = cfqq;
+	rq->ioq = cfqq->ioq;
 	return 0;
 
 queue_fail:
 	if (cic)
 		put_io_context(cic->ioc);
 
-	cfq_schedule_dispatch(cfqd);
+	elv_schedule_dispatch(cfqd->queue);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 	cfq_log(cfqd, "set_request fail");
 	return 1;
 }
 
-static void cfq_kick_queue(struct work_struct *work)
-{
-	struct cfq_data *cfqd =
-		container_of(work, struct cfq_data, unplug_work);
-	struct request_queue *q = cfqd->queue;
-
-	spin_lock_irq(q->queue_lock);
-	__blk_run_queue(cfqd->queue);
-	spin_unlock_irq(q->queue_lock);
-}
-
-/*
- * Timer running if the active_queue is currently idling inside its time slice
- */
-static void cfq_idle_slice_timer(unsigned long data)
-{
-	struct cfq_data *cfqd = (struct cfq_data *) data;
-	struct cfq_queue *cfqq;
-	unsigned long flags;
-	int timed_out = 1;
-
-	cfq_log(cfqd, "idle timer fired");
-
-	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
-
-	cfqq = cfqd->active_queue;
-	if (cfqq) {
-		timed_out = 0;
-
-		/*
-		 * We saw a request before the queue expired, let it through
-		 */
-		if (cfq_cfqq_must_dispatch(cfqq))
-			goto out_kick;
-
-		/*
-		 * expired
-		 */
-		if (cfq_slice_used(cfqq))
-			goto expire;
-
-		/*
-		 * only expire and reinvoke request handler, if there are
-		 * other queues with pending requests
-		 */
-		if (!cfqd->busy_queues)
-			goto out_cont;
-
-		/*
-		 * not expired and it has a request pending, let it dispatch
-		 */
-		if (!RB_EMPTY_ROOT(&cfqq->sort_list))
-			goto out_kick;
-	}
-expire:
-	cfq_slice_expired(cfqd, timed_out);
-out_kick:
-	cfq_schedule_dispatch(cfqd);
-out_cont:
-	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
-}
-
-static void cfq_shutdown_timer_wq(struct cfq_data *cfqd)
-{
-	del_timer_sync(&cfqd->idle_slice_timer);
-	cancel_work_sync(&cfqd->unplug_work);
-}
-
-static void cfq_put_async_queues(struct cfq_data *cfqd)
-{
-	int i;
-
-	for (i = 0; i < IOPRIO_BE_NR; i++) {
-		if (cfqd->async_cfqq[0][i])
-			cfq_put_queue(cfqd->async_cfqq[0][i]);
-		if (cfqd->async_cfqq[1][i])
-			cfq_put_queue(cfqd->async_cfqq[1][i]);
-	}
-
-	if (cfqd->async_idle_cfqq)
-		cfq_put_queue(cfqd->async_idle_cfqq);
-}
-
 static void cfq_exit_queue(struct elevator_queue *e)
 {
 	struct cfq_data *cfqd = e->elevator_data;
 	struct request_queue *q = cfqd->queue;
 
-	cfq_shutdown_timer_wq(cfqd);
-
 	spin_lock_irq(q->queue_lock);
 
-	if (cfqd->active_queue)
-		__cfq_slice_expired(cfqd, cfqd->active_queue, 0);
-
 	while (!list_empty(&cfqd->cic_list)) {
 		struct cfq_io_context *cic = list_entry(cfqd->cic_list.next,
 							struct cfq_io_context,
@@ -2437,12 +1805,7 @@ static void cfq_exit_queue(struct elevator_queue *e)
 		__cfq_exit_single_io_context(cfqd, cic);
 	}
 
-	cfq_put_async_queues(cfqd);
-
 	spin_unlock_irq(q->queue_lock);
-
-	cfq_shutdown_timer_wq(cfqd);
-
 	kfree(cfqd);
 }
 
@@ -2455,8 +1818,6 @@ static void *cfq_init_queue(struct request_queue *q)
 	if (!cfqd)
 		return NULL;
 
-	cfqd->service_tree = CFQ_RB_ROOT;
-
 	/*
 	 * Not strictly needed (since RB_ROOT just clears the node and we
 	 * zeroed cfqd on alloc), but better be safe in case someone decides
@@ -2469,22 +1830,12 @@ static void *cfq_init_queue(struct request_queue *q)
 
 	cfqd->queue = q;
 
-	init_timer(&cfqd->idle_slice_timer);
-	cfqd->idle_slice_timer.function = cfq_idle_slice_timer;
-	cfqd->idle_slice_timer.data = (unsigned long) cfqd;
-
-	INIT_WORK(&cfqd->unplug_work, cfq_kick_queue);
-
 	cfqd->cfq_quantum = cfq_quantum;
 	cfqd->cfq_fifo_expire[0] = cfq_fifo_expire[0];
 	cfqd->cfq_fifo_expire[1] = cfq_fifo_expire[1];
 	cfqd->cfq_back_max = cfq_back_max;
 	cfqd->cfq_back_penalty = cfq_back_penalty;
-	cfqd->cfq_slice[0] = cfq_slice_async;
-	cfqd->cfq_slice[1] = cfq_slice_sync;
 	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
-	cfqd->cfq_slice_idle = cfq_slice_idle;
-	cfqd->hw_tag = 1;
 
 	return cfqd;
 }
@@ -2549,9 +1900,6 @@ SHOW_FUNCTION(cfq_fifo_expire_sync_show, cfqd->cfq_fifo_expire[1], 1);
 SHOW_FUNCTION(cfq_fifo_expire_async_show, cfqd->cfq_fifo_expire[0], 1);
 SHOW_FUNCTION(cfq_back_seek_max_show, cfqd->cfq_back_max, 0);
 SHOW_FUNCTION(cfq_back_seek_penalty_show, cfqd->cfq_back_penalty, 0);
-SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1);
-SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
-SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
 SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
 #undef SHOW_FUNCTION
 
@@ -2579,9 +1927,6 @@ STORE_FUNCTION(cfq_fifo_expire_async_store, &cfqd->cfq_fifo_expire[0], 1,
 STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
 STORE_FUNCTION(cfq_back_seek_penalty_store, &cfqd->cfq_back_penalty, 1,
 		UINT_MAX, 0);
-STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
 		UINT_MAX, 0);
 #undef STORE_FUNCTION
@@ -2595,10 +1940,10 @@ static struct elv_fs_entry cfq_attrs[] = {
 	CFQ_ATTR(fifo_expire_async),
 	CFQ_ATTR(back_seek_max),
 	CFQ_ATTR(back_seek_penalty),
-	CFQ_ATTR(slice_sync),
-	CFQ_ATTR(slice_async),
 	CFQ_ATTR(slice_async_rq),
-	CFQ_ATTR(slice_idle),
+	ELV_ATTR(slice_idle),
+	ELV_ATTR(slice_sync),
+	ELV_ATTR(slice_async),
 	__ATTR_NULL
 };
 
@@ -2611,8 +1956,6 @@ static struct elevator_type iosched_cfq = {
 		.elevator_dispatch_fn =		cfq_dispatch_requests,
 		.elevator_add_req_fn =		cfq_insert_request,
 		.elevator_activate_req_fn =	cfq_activate_request,
-		.elevator_deactivate_req_fn =	cfq_deactivate_request,
-		.elevator_queue_empty_fn =	cfq_queue_empty,
 		.elevator_completed_req_fn =	cfq_completed_request,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
@@ -2622,7 +1965,15 @@ static struct elevator_type iosched_cfq = {
 		.elevator_init_fn =		cfq_init_queue,
 		.elevator_exit_fn =		cfq_exit_queue,
 		.trim =				cfq_free_io_context,
+		.elevator_free_sched_queue_fn =	cfq_free_cfq_queue,
+		.elevator_active_ioq_set_fn = 	cfq_active_ioq_set,
+		.elevator_active_ioq_reset_fn =	cfq_active_ioq_reset,
+		.elevator_arm_slice_timer_fn = 	cfq_arm_slice_timer,
+		.elevator_should_preempt_fn = 	cfq_should_preempt,
+		.elevator_update_idle_window_fn = cfq_update_idle_window,
+		.elevator_close_cooperator_fn = cfq_close_cooperator,
 	},
+	.elevator_features =    ELV_IOSCHED_NEED_FQ,
 	.elevator_attrs =	cfq_attrs,
 	.elevator_name =	"cfq",
 	.elevator_owner =	THIS_MODULE,
@@ -2630,14 +1981,6 @@ static struct elevator_type iosched_cfq = {
 
 static int __init cfq_init(void)
 {
-	/*
-	 * could be 0 on HZ < 1000 setups
-	 */
-	if (!cfq_slice_async)
-		cfq_slice_async = 1;
-	if (!cfq_slice_idle)
-		cfq_slice_idle = 1;
-
 	if (cfq_slab_setup())
 		return -ENOMEM;
 
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index dd05434..1482b20 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -39,13 +39,8 @@ struct cfq_io_context {
 
 	struct io_context *ioc;
 
-	unsigned long last_end_request;
 	sector_t last_request_pos;
 
-	unsigned long ttime_total;
-	unsigned long ttime_samples;
-	unsigned long ttime_mean;
-
 	unsigned int seek_samples;
 	u64 seek_total;
 	sector_t seek_mean;
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 06/25] io-controller: Modify cfq to make use of flat elevator fair queuing
@ 2009-07-02 20:01   ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew
  Cc: peterz, akpm, snitzer, agk, vgoyal

This patch changes cfq to use fair queuing code from elevator layer.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched     |    3 +-
 block/cfq-iosched.c       | 1105 +++++++++------------------------------------
 include/linux/iocontext.h |    5 -
 3 files changed, 226 insertions(+), 887 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3398134..dd5224d 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -3,7 +3,7 @@ if BLOCK
 menu "IO Schedulers"
 
 config ELV_FAIR_QUEUING
-	bool "Elevator Fair Queuing Support"
+	bool
 	default n
 	---help---
 	  Traditionally only cfq had notion of multiple queues and it did
@@ -46,6 +46,7 @@ config IOSCHED_DEADLINE
 
 config IOSCHED_CFQ
 	tristate "CFQ I/O scheduler"
+	select ELV_FAIR_QUEUING
 	default y
 	---help---
 	  The CFQ I/O scheduler tries to distribute bandwidth equally
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 833ec18..f852b00 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -12,7 +12,6 @@
 #include <linux/rbtree.h>
 #include <linux/ioprio.h>
 #include <linux/blktrace_api.h>
-
 /*
  * tunables
  */
@@ -23,15 +22,7 @@ static const int cfq_fifo_expire[2] = { HZ / 4, HZ / 8 };
 static const int cfq_back_max = 16 * 1024;
 /* penalty of a backwards seek */
 static const int cfq_back_penalty = 2;
-static const int cfq_slice_sync = HZ / 10;
-static int cfq_slice_async = HZ / 25;
 static const int cfq_slice_async_rq = 2;
-static int cfq_slice_idle = HZ / 125;
-
-/*
- * offset from end of service tree
- */
-#define CFQ_IDLE_DELAY		(HZ / 5)
 
 /*
  * below this threshold, we consider thinktime immediate
@@ -43,7 +34,7 @@ static int cfq_slice_idle = HZ / 125;
 
 #define RQ_CIC(rq)		\
 	((struct cfq_io_context *) (rq)->elevator_private)
-#define RQ_CFQQ(rq)		(struct cfq_queue *) ((rq)->elevator_private2)
+#define RQ_CFQQ(rq)	(struct cfq_queue *) (ioq_sched_queue((rq)->ioq))
 
 static struct kmem_cache *cfq_pool;
 static struct kmem_cache *cfq_ioc_pool;
@@ -53,8 +44,6 @@ static struct completion *ioc_gone;
 static DEFINE_SPINLOCK(ioc_gone_lock);
 
 #define CFQ_PRIO_LISTS		IOPRIO_BE_NR
-#define cfq_class_idle(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_IDLE)
-#define cfq_class_rt(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_RT)
 
 #define sample_valid(samples)	((samples) > 80)
 
@@ -75,12 +64,6 @@ struct cfq_rb_root {
  */
 struct cfq_data {
 	struct request_queue *queue;
-
-	/*
-	 * rr list of queues with requests and the count of them
-	 */
-	struct cfq_rb_root service_tree;
-
 	/*
 	 * Each priority tree is sorted by next_request position.  These
 	 * trees are used when determining if two or more queues are
@@ -88,39 +71,10 @@ struct cfq_data {
 	 */
 	struct rb_root prio_trees[CFQ_PRIO_LISTS];
 
-	unsigned int busy_queues;
-	/*
-	 * Used to track any pending rt requests so we can pre-empt current
-	 * non-RT cfqq in service when this value is non-zero.
-	 */
-	unsigned int busy_rt_queues;
-
-	int rq_in_driver;
 	int sync_flight;
 
-	/*
-	 * queue-depth detection
-	 */
-	int rq_queued;
-	int hw_tag;
-	int hw_tag_samples;
-	int rq_in_driver_peak;
-
-	/*
-	 * idle window management
-	 */
-	struct timer_list idle_slice_timer;
-	struct work_struct unplug_work;
-
-	struct cfq_queue *active_queue;
 	struct cfq_io_context *active_cic;
 
-	/*
-	 * async queue for each priority case
-	 */
-	struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
-	struct cfq_queue *async_idle_cfqq;
-
 	sector_t last_position;
 
 	/*
@@ -130,9 +84,7 @@ struct cfq_data {
 	unsigned int cfq_fifo_expire[2];
 	unsigned int cfq_back_penalty;
 	unsigned int cfq_back_max;
-	unsigned int cfq_slice[2];
 	unsigned int cfq_slice_async_rq;
-	unsigned int cfq_slice_idle;
 
 	struct list_head cic_list;
 };
@@ -141,16 +93,11 @@ struct cfq_data {
  * Per process-grouping structure
  */
 struct cfq_queue {
-	/* reference count */
-	atomic_t ref;
+	struct io_queue *ioq;
 	/* various state flags, see below */
 	unsigned int flags;
 	/* parent cfq_data */
 	struct cfq_data *cfqd;
-	/* service_tree member */
-	struct rb_node rb_node;
-	/* service_tree key */
-	unsigned long rb_key;
 	/* prio tree member */
 	struct rb_node p_node;
 	/* prio tree root we belong to, if any */
@@ -166,33 +113,23 @@ struct cfq_queue {
 	/* fifo list of requests in sort_list */
 	struct list_head fifo;
 
-	unsigned long slice_end;
-	long slice_resid;
 	unsigned int slice_dispatch;
 
 	/* pending metadata requests */
 	int meta_pending;
-	/* number of requests that are on the dispatch list or inside driver */
-	int dispatched;
 
 	/* io prio of this group */
-	unsigned short ioprio, org_ioprio;
-	unsigned short ioprio_class, org_ioprio_class;
+	unsigned short org_ioprio;
+	unsigned short org_ioprio_class;
 
 	pid_t pid;
 };
 
 enum cfqq_state_flags {
-	CFQ_CFQQ_FLAG_on_rr = 0,	/* on round-robin busy list */
-	CFQ_CFQQ_FLAG_wait_request,	/* waiting for a request */
-	CFQ_CFQQ_FLAG_must_dispatch,	/* must be allowed a dispatch */
 	CFQ_CFQQ_FLAG_must_alloc,	/* must be allowed rq alloc */
 	CFQ_CFQQ_FLAG_must_alloc_slice,	/* per-slice must_alloc flag */
 	CFQ_CFQQ_FLAG_fifo_expire,	/* FIFO checked in this slice */
-	CFQ_CFQQ_FLAG_idle_window,	/* slice idling enabled */
 	CFQ_CFQQ_FLAG_prio_changed,	/* task priority has changed */
-	CFQ_CFQQ_FLAG_slice_new,	/* no requests dispatched in slice */
-	CFQ_CFQQ_FLAG_sync,		/* synchronous queue */
 	CFQ_CFQQ_FLAG_coop,		/* has done a coop jump of the queue */
 };
 
@@ -210,16 +147,10 @@ static inline int cfq_cfqq_##name(const struct cfq_queue *cfqq)		\
 	return ((cfqq)->flags & (1 << CFQ_CFQQ_FLAG_##name)) != 0;	\
 }
 
-CFQ_CFQQ_FNS(on_rr);
-CFQ_CFQQ_FNS(wait_request);
-CFQ_CFQQ_FNS(must_dispatch);
 CFQ_CFQQ_FNS(must_alloc);
 CFQ_CFQQ_FNS(must_alloc_slice);
 CFQ_CFQQ_FNS(fifo_expire);
-CFQ_CFQQ_FNS(idle_window);
 CFQ_CFQQ_FNS(prio_changed);
-CFQ_CFQQ_FNS(slice_new);
-CFQ_CFQQ_FNS(sync);
 CFQ_CFQQ_FNS(coop);
 #undef CFQ_CFQQ_FNS
 
@@ -258,66 +189,27 @@ static inline int cfq_bio_sync(struct bio *bio)
 	return 0;
 }
 
-/*
- * scheduler run of queue, if there are requests pending and no one in the
- * driver that will restart queueing
- */
-static inline void cfq_schedule_dispatch(struct cfq_data *cfqd)
+static inline struct io_group *cfqq_to_io_group(struct cfq_queue *cfqq)
 {
-	if (cfqd->busy_queues) {
-		cfq_log(cfqd, "schedule dispatch");
-		kblockd_schedule_work(cfqd->queue, &cfqd->unplug_work);
-	}
+	return ioq_to_io_group(cfqq->ioq);
 }
 
-static int cfq_queue_empty(struct request_queue *q)
+static inline int cfq_class_idle(struct cfq_queue *cfqq)
 {
-	struct cfq_data *cfqd = q->elevator->elevator_data;
-
-	return !cfqd->busy_queues;
+	return elv_ioq_class_idle(cfqq->ioq);
 }
 
-/*
- * Scale schedule slice based on io priority. Use the sync time slice only
- * if a queue is marked sync and has sync io queued. A sync queue with async
- * io only, should not get full sync slice length.
- */
-static inline int cfq_prio_slice(struct cfq_data *cfqd, int sync,
-				 unsigned short prio)
-{
-	const int base_slice = cfqd->cfq_slice[sync];
-
-	WARN_ON(prio >= IOPRIO_BE_NR);
-
-	return base_slice + (base_slice/CFQ_SLICE_SCALE * (4 - prio));
-}
-
-static inline int
-cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
-}
-
-static inline void
-cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+static inline int cfq_cfqq_sync(struct cfq_queue *cfqq)
 {
-	cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
-	cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
+	return elv_ioq_sync(cfqq->ioq);
 }
 
-/*
- * We need to wrap this check in cfq_cfqq_slice_new(), since ->slice_end
- * isn't valid until the first request from the dispatch is activated
- * and the slice time set.
- */
-static inline int cfq_slice_used(struct cfq_queue *cfqq)
+static inline int cfqq_is_active_queue(struct cfq_queue *cfqq)
 {
-	if (cfq_cfqq_slice_new(cfqq))
-		return 0;
-	if (time_before(jiffies, cfqq->slice_end))
-		return 0;
+	struct cfq_data *cfqd = cfqq->cfqd;
+	struct elevator_queue *e = cfqd->queue->elevator;
 
-	return 1;
+	return (elv_active_sched_queue(e) == cfqq);
 }
 
 /*
@@ -416,33 +308,6 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2)
 }
 
 /*
- * The below is leftmost cache rbtree addon
- */
-static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
-{
-	if (!root->left)
-		root->left = rb_first(&root->rb);
-
-	if (root->left)
-		return rb_entry(root->left, struct cfq_queue, rb_node);
-
-	return NULL;
-}
-
-static void rb_erase_init(struct rb_node *n, struct rb_root *root)
-{
-	rb_erase(n, root);
-	RB_CLEAR_NODE(n);
-}
-
-static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root)
-{
-	if (root->left == n)
-		root->left = NULL;
-	rb_erase_init(n, &root->rb);
-}
-
-/*
  * would be nice to take fifo expire time into account as well
  */
 static struct request *
@@ -455,10 +320,10 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 
 	BUG_ON(RB_EMPTY_NODE(&last->rb_node));
 
-	if (rbprev)
+	if (rbprev != NULL)
 		prev = rb_entry_rq(rbprev);
 
-	if (rbnext)
+	if (rbnext != NULL)
 		next = rb_entry_rq(rbnext);
 	else {
 		rbnext = rb_first(&cfqq->sort_list);
@@ -469,95 +334,6 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	return cfq_choose_req(cfqd, next, prev);
 }
 
-static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
-				      struct cfq_queue *cfqq)
-{
-	/*
-	 * just an approximation, should be ok.
-	 */
-	return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) -
-		       cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
-}
-
-/*
- * The cfqd->service_tree holds all pending cfq_queue's that have
- * requests waiting to be processed. It is sorted in the order that
- * we will service the queues.
- */
-static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-				 int add_front)
-{
-	struct rb_node **p, *parent;
-	struct cfq_queue *__cfqq;
-	unsigned long rb_key;
-	int left;
-
-	if (cfq_class_idle(cfqq)) {
-		rb_key = CFQ_IDLE_DELAY;
-		parent = rb_last(&cfqd->service_tree.rb);
-		if (parent && parent != &cfqq->rb_node) {
-			__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
-			rb_key += __cfqq->rb_key;
-		} else
-			rb_key += jiffies;
-	} else if (!add_front) {
-		rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
-		rb_key += cfqq->slice_resid;
-		cfqq->slice_resid = 0;
-	} else
-		rb_key = 0;
-
-	if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
-		/*
-		 * same position, nothing more to do
-		 */
-		if (rb_key == cfqq->rb_key)
-			return;
-
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
-	}
-
-	left = 1;
-	parent = NULL;
-	p = &cfqd->service_tree.rb.rb_node;
-	while (*p) {
-		struct rb_node **n;
-
-		parent = *p;
-		__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
-
-		/*
-		 * sort RT queues first, we always want to give
-		 * preference to them. IDLE queues goes to the back.
-		 * after that, sort on the next service time.
-		 */
-		if (cfq_class_rt(cfqq) > cfq_class_rt(__cfqq))
-			n = &(*p)->rb_left;
-		else if (cfq_class_rt(cfqq) < cfq_class_rt(__cfqq))
-			n = &(*p)->rb_right;
-		else if (cfq_class_idle(cfqq) < cfq_class_idle(__cfqq))
-			n = &(*p)->rb_left;
-		else if (cfq_class_idle(cfqq) > cfq_class_idle(__cfqq))
-			n = &(*p)->rb_right;
-		else if (rb_key < __cfqq->rb_key)
-			n = &(*p)->rb_left;
-		else
-			n = &(*p)->rb_right;
-
-		if (n == &(*p)->rb_right)
-			left = 0;
-
-		p = n;
-	}
-
-	if (left)
-		cfqd->service_tree.left = &cfqq->rb_node;
-
-	cfqq->rb_key = rb_key;
-	rb_link_node(&cfqq->rb_node, parent, p);
-	rb_insert_color(&cfqq->rb_node, &cfqd->service_tree.rb);
-}
-
 static struct cfq_queue *
 cfq_prio_tree_lookup(struct cfq_data *cfqd, struct rb_root *root,
 		     sector_t sector, struct rb_node **ret_parent,
@@ -619,57 +395,34 @@ static void cfq_prio_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 		cfqq->p_root = NULL;
 }
 
-/*
- * Update cfqq's position in the service tree.
- */
-static void cfq_resort_rr_list(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+/* An active ioq is being reset. A chance to do cic related stuff. */
+static void cfq_active_ioq_reset(struct request_queue *q, void *sched_queue)
 {
-	/*
-	 * Resorting requires the cfqq to be on the RR list already.
-	 */
-	if (cfq_cfqq_on_rr(cfqq)) {
-		cfq_service_tree_add(cfqd, cfqq, 0);
-		cfq_prio_tree_add(cfqd, cfqq);
-	}
-}
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = sched_queue;
 
-/*
- * add to busy list of queues for service, trying to be fair in ordering
- * the pending list according to last request service
- */
-static void cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	cfq_log_cfqq(cfqd, cfqq, "add_to_rr");
-	BUG_ON(cfq_cfqq_on_rr(cfqq));
-	cfq_mark_cfqq_on_rr(cfqq);
-	cfqd->busy_queues++;
-	if (cfq_class_rt(cfqq))
-		cfqd->busy_rt_queues++;
+	if (cfqd->active_cic) {
+		put_io_context(cfqd->active_cic->ioc);
+		cfqd->active_cic = NULL;
+	}
 
-	cfq_resort_rr_list(cfqd, cfqq);
+	/* Resort the cfqq in prio tree */
+	if (cfqq)
+		cfq_prio_tree_add(cfqd, cfqq);
 }
 
-/*
- * Called when the cfqq no longer has requests pending, remove it from
- * the service tree.
- */
-static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+/* An ioq has been set as active one. */
+static void cfq_active_ioq_set(struct request_queue *q, void *sched_queue,
+				int coop)
 {
-	cfq_log_cfqq(cfqd, cfqq, "del_from_rr");
-	BUG_ON(!cfq_cfqq_on_rr(cfqq));
-	cfq_clear_cfqq_on_rr(cfqq);
+	struct cfq_queue *cfqq = sched_queue;
 
-	if (!RB_EMPTY_NODE(&cfqq->rb_node))
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
-	if (cfqq->p_root) {
-		rb_erase(&cfqq->p_node, cfqq->p_root);
-		cfqq->p_root = NULL;
-	}
+	cfqq->slice_dispatch = 0;
 
-	BUG_ON(!cfqd->busy_queues);
-	cfqd->busy_queues--;
-	if (cfq_class_rt(cfqq))
-		cfqd->busy_rt_queues--;
+	cfq_clear_cfqq_must_alloc_slice(cfqq);
+	cfq_clear_cfqq_fifo_expire(cfqq);
+	if (!coop)
+		cfq_clear_cfqq_coop(cfqq);
 }
 
 /*
@@ -678,7 +431,6 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 static void cfq_del_rq_rb(struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
-	struct cfq_data *cfqd = cfqq->cfqd;
 	const int sync = rq_is_sync(rq);
 
 	BUG_ON(!cfqq->queued[sync]);
@@ -686,8 +438,17 @@ static void cfq_del_rq_rb(struct request *rq)
 
 	elv_rb_del(&cfqq->sort_list, rq);
 
-	if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
-		cfq_del_cfqq_rr(cfqd, cfqq);
+	/*
+	 * If this was last request in the queue, remove this queue from
+	 * prio trees. For last request nr_queued count will still be 1 as
+	 * elevator fair queuing layer is yet to do the accounting.
+	 */
+	if (elv_ioq_nr_queued(cfqq->ioq) == 1) {
+		if (cfqq->p_root) {
+			rb_erase(&cfqq->p_node, cfqq->p_root);
+			cfqq->p_root = NULL;
+		}
+	}
 }
 
 static void cfq_add_rq_rb(struct request *rq)
@@ -705,9 +466,6 @@ static void cfq_add_rq_rb(struct request *rq)
 	while ((__alias = elv_rb_add(&cfqq->sort_list, rq)) != NULL)
 		cfq_dispatch_insert(cfqd->queue, __alias);
 
-	if (!cfq_cfqq_on_rr(cfqq))
-		cfq_add_cfqq_rr(cfqd, cfqq);
-
 	/*
 	 * check if this request is a better next-serve candidate
 	 */
@@ -755,23 +513,9 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 
-	cfqd->rq_in_driver++;
-	cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "activate rq, drv=%d",
-						cfqd->rq_in_driver);
-
 	cfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq);
 }
 
-static void cfq_deactivate_request(struct request_queue *q, struct request *rq)
-{
-	struct cfq_data *cfqd = q->elevator->elevator_data;
-
-	WARN_ON(!cfqd->rq_in_driver);
-	cfqd->rq_in_driver--;
-	cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "deactivate rq, drv=%d",
-						cfqd->rq_in_driver);
-}
-
 static void cfq_remove_request(struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
@@ -782,7 +526,6 @@ static void cfq_remove_request(struct request *rq)
 	list_del_init(&rq->queuelist);
 	cfq_del_rq_rb(rq);
 
-	cfqq->cfqd->rq_queued--;
 	if (rq_is_meta(rq)) {
 		WARN_ON(!cfqq->meta_pending);
 		cfqq->meta_pending--;
@@ -856,93 +599,21 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	return 0;
 }
 
-static void __cfq_set_active_queue(struct cfq_data *cfqd,
-				   struct cfq_queue *cfqq)
-{
-	if (cfqq) {
-		cfq_log_cfqq(cfqd, cfqq, "set_active");
-		cfqq->slice_end = 0;
-		cfqq->slice_dispatch = 0;
-
-		cfq_clear_cfqq_wait_request(cfqq);
-		cfq_clear_cfqq_must_dispatch(cfqq);
-		cfq_clear_cfqq_must_alloc_slice(cfqq);
-		cfq_clear_cfqq_fifo_expire(cfqq);
-		cfq_mark_cfqq_slice_new(cfqq);
-
-		del_timer(&cfqd->idle_slice_timer);
-	}
-
-	cfqd->active_queue = cfqq;
-}
-
 /*
  * current cfqq expired its slice (or was too idle), select new one
  */
 static void
-__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-		    int timed_out)
+__cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	cfq_log_cfqq(cfqd, cfqq, "slice expired t=%d", timed_out);
-
-	if (cfq_cfqq_wait_request(cfqq))
-		del_timer(&cfqd->idle_slice_timer);
-
-	cfq_clear_cfqq_wait_request(cfqq);
-
-	/*
-	 * store what was left of this slice, if the queue idled/timed out
-	 */
-	if (timed_out && !cfq_cfqq_slice_new(cfqq)) {
-		cfqq->slice_resid = cfqq->slice_end - jiffies;
-		cfq_log_cfqq(cfqd, cfqq, "resid=%ld", cfqq->slice_resid);
-	}
-
-	cfq_resort_rr_list(cfqd, cfqq);
-
-	if (cfqq == cfqd->active_queue)
-		cfqd->active_queue = NULL;
-
-	if (cfqd->active_cic) {
-		put_io_context(cfqd->active_cic->ioc);
-		cfqd->active_cic = NULL;
-	}
+	__elv_ioq_slice_expired(cfqd->queue, cfqq->ioq);
 }
 
-static inline void cfq_slice_expired(struct cfq_data *cfqd, int timed_out)
+static inline void cfq_slice_expired(struct cfq_data *cfqd)
 {
-	struct cfq_queue *cfqq = cfqd->active_queue;
+	struct cfq_queue *cfqq = elv_active_sched_queue(cfqd->queue->elevator);
 
 	if (cfqq)
-		__cfq_slice_expired(cfqd, cfqq, timed_out);
-}
-
-/*
- * Get next queue for service. Unless we have a queue preemption,
- * we'll simply select the first cfqq in the service tree.
- */
-static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
-{
-	if (RB_EMPTY_ROOT(&cfqd->service_tree.rb))
-		return NULL;
-
-	return cfq_rb_first(&cfqd->service_tree);
-}
-
-/*
- * Get and set a new active queue for service.
- */
-static struct cfq_queue *cfq_set_active_queue(struct cfq_data *cfqd,
-					      struct cfq_queue *cfqq)
-{
-	if (!cfqq) {
-		cfqq = cfq_get_next_queue(cfqd);
-		if (cfqq)
-			cfq_clear_cfqq_coop(cfqq);
-	}
-
-	__cfq_set_active_queue(cfqd, cfqq);
-	return cfqq;
+		__cfq_slice_expired(cfqd, cfqq);
 }
 
 static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd,
@@ -1019,11 +690,12 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
  * associated with the I/O issued by cur_cfqq.  I'm not sure this is a valid
  * assumption.
  */
-static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
-					      struct cfq_queue *cur_cfqq,
+static struct io_queue *cfq_close_cooperator(struct request_queue *q,
+					      void *cur_sched_queue,
 					      int probe)
 {
-	struct cfq_queue *cfqq;
+	struct cfq_queue *cur_cfqq = cur_sched_queue, *cfqq;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
 
 	/*
 	 * A valid cfq_io_context is necessary to compare requests against
@@ -1046,38 +718,18 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
 
 	if (!probe)
 		cfq_mark_cfqq_coop(cfqq);
-	return cfqq;
+	return cfqq->ioq;
 }
 
-static void cfq_arm_slice_timer(struct cfq_data *cfqd)
+static void cfq_arm_slice_timer(struct request_queue *q, void *sched_queue)
 {
-	struct cfq_queue *cfqq = cfqd->active_queue;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = sched_queue;
 	struct cfq_io_context *cic;
 	unsigned long sl;
 
-	/*
-	 * SSD device without seek penalty, disable idling. But only do so
-	 * for devices that support queuing, otherwise we still have a problem
-	 * with sync vs async workloads.
-	 */
-	if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)
-		return;
-
 	WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list));
-	WARN_ON(cfq_cfqq_slice_new(cfqq));
-
-	/*
-	 * idle is disabled, either manually or by past process history
-	 */
-	if (!cfqd->cfq_slice_idle || !cfq_cfqq_idle_window(cfqq))
-		return;
-
-	/*
-	 * still requests with the driver, don't idle
-	 */
-	if (cfqd->rq_in_driver)
-		return;
-
+	WARN_ON(elv_ioq_slice_new(cfqq->ioq));
 	/*
 	 * task has exited, don't wait
 	 */
@@ -1085,18 +737,18 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 	if (!cic || !atomic_read(&cic->ioc->nr_tasks))
 		return;
 
-	cfq_mark_cfqq_wait_request(cfqq);
 
+	elv_mark_ioq_wait_request(cfqq->ioq);
 	/*
 	 * we don't want to idle for seeks, but we do want to allow
 	 * fair distribution of slice time for a process doing back-to-back
 	 * seeks. so allow a little bit of time for him to submit a new rq
 	 */
-	sl = cfqd->cfq_slice_idle;
+	sl = elv_get_slice_idle(q->elevator);
 	if (sample_valid(cic->seek_samples) && CIC_SEEKY(cic))
 		sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));
 
-	mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
+	elv_mod_idle_slice_timer(q->elevator, jiffies + sl);
 	cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
 }
 
@@ -1105,13 +757,12 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
  */
 static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
 {
-	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
+	struct cfq_data *cfqd = q->elevator->elevator_data;
 
-	cfq_log_cfqq(cfqd, cfqq, "dispatch_insert");
+	cfq_log_cfqq(cfqd, cfqq, "dispatch_insert sect=%d", blk_rq_sectors(rq));
 
 	cfq_remove_request(rq);
-	cfqq->dispatched++;
 	elv_dispatch_sort(q, rq);
 
 	if (cfq_cfqq_sync(cfqq))
@@ -1149,78 +800,11 @@ static inline int
 cfq_prio_to_maxrq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
 	const int base_rq = cfqd->cfq_slice_async_rq;
+	unsigned short ioprio = elv_ioq_ioprio(cfqq->ioq);
 
-	WARN_ON(cfqq->ioprio >= IOPRIO_BE_NR);
+	WARN_ON(ioprio >= IOPRIO_BE_NR);
 
-	return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - cfqq->ioprio));
-}
-
-/*
- * Select a queue for service. If we have a current active queue,
- * check whether to continue servicing it, or retrieve and set a new one.
- */
-static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
-{
-	struct cfq_queue *cfqq, *new_cfqq = NULL;
-
-	cfqq = cfqd->active_queue;
-	if (!cfqq)
-		goto new_queue;
-
-	/*
-	 * The active queue has run out of time, expire it and select new.
-	 */
-	if (cfq_slice_used(cfqq) && !cfq_cfqq_must_dispatch(cfqq))
-		goto expire;
-
-	/*
-	 * If we have a RT cfqq waiting, then we pre-empt the current non-rt
-	 * cfqq.
-	 */
-	if (!cfq_class_rt(cfqq) && cfqd->busy_rt_queues) {
-		/*
-		 * We simulate this as cfqq timed out so that it gets to bank
-		 * the remaining of its time slice.
-		 */
-		cfq_log_cfqq(cfqd, cfqq, "preempt");
-		cfq_slice_expired(cfqd, 1);
-		goto new_queue;
-	}
-
-	/*
-	 * The active queue has requests and isn't expired, allow it to
-	 * dispatch.
-	 */
-	if (!RB_EMPTY_ROOT(&cfqq->sort_list))
-		goto keep_queue;
-
-	/*
-	 * If another queue has a request waiting within our mean seek
-	 * distance, let it run.  The expire code will check for close
-	 * cooperators and put the close queue at the front of the service
-	 * tree.
-	 */
-	new_cfqq = cfq_close_cooperator(cfqd, cfqq, 0);
-	if (new_cfqq)
-		goto expire;
-
-	/*
-	 * No requests pending. If the active queue still has requests in
-	 * flight or is idling for a new request, allow either of these
-	 * conditions to happen (or time out) before selecting a new queue.
-	 */
-	if (timer_pending(&cfqd->idle_slice_timer) ||
-	    (cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) {
-		cfqq = NULL;
-		goto keep_queue;
-	}
-
-expire:
-	cfq_slice_expired(cfqd, 0);
-new_queue:
-	cfqq = cfq_set_active_queue(cfqd, new_cfqq);
-keep_queue:
-	return cfqq;
+	return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - ioprio));
 }
 
 static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
@@ -1245,12 +829,14 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
 	struct cfq_queue *cfqq;
 	int dispatched = 0;
 
-	while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL)
+	while ((cfqq = elv_select_sched_queue(cfqd->queue, 1)) != NULL)
 		dispatched += __cfq_forced_dispatch_cfqq(cfqq);
 
-	cfq_slice_expired(cfqd, 0);
+	/* This probably is redundant now. above loop will should make sure
+	 * that all the busy queues have expired */
+	cfq_slice_expired(cfqd);
 
-	BUG_ON(cfqd->busy_queues);
+	BUG_ON(elv_nr_busy_ioq(cfqd->queue->elevator));
 
 	cfq_log(cfqd, "forced_dispatch=%d", dispatched);
 	return dispatched;
@@ -1296,13 +882,10 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	struct cfq_queue *cfqq;
 	unsigned int max_dispatch;
 
-	if (!cfqd->busy_queues)
-		return 0;
-
 	if (unlikely(force))
 		return cfq_forced_dispatch(cfqd);
 
-	cfqq = cfq_select_queue(cfqd);
+	cfqq = elv_select_sched_queue(q, 0);
 	if (!cfqq)
 		return 0;
 
@@ -1319,7 +902,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	/*
 	 * Does this cfqq already have too much IO in flight?
 	 */
-	if (cfqq->dispatched >= max_dispatch) {
+	if (elv_ioq_nr_dispatched(cfqq->ioq) >= max_dispatch) {
 		/*
 		 * idle queue must always only have a single IO in flight
 		 */
@@ -1329,13 +912,13 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 		/*
 		 * We have other queues, don't allow more IO from this one
 		 */
-		if (cfqd->busy_queues > 1)
+		if (elv_nr_busy_ioq(q->elevator) > 1)
 			return 0;
 
 		/*
 		 * we are the only queue, allow up to 4 times of 'quantum'
 		 */
-		if (cfqq->dispatched >= 4 * max_dispatch)
+		if (elv_ioq_nr_dispatched(cfqq->ioq) >= 4 * max_dispatch)
 			return 0;
 	}
 
@@ -1344,51 +927,45 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	 */
 	cfq_dispatch_request(cfqd, cfqq);
 	cfqq->slice_dispatch++;
-	cfq_clear_cfqq_must_dispatch(cfqq);
 
 	/*
 	 * expire an async queue immediately if it has used up its slice. idle
 	 * queue always expire after 1 dispatch round.
 	 */
-	if (cfqd->busy_queues > 1 && ((!cfq_cfqq_sync(cfqq) &&
+	if (elv_nr_busy_ioq(q->elevator) > 1 && ((!cfq_cfqq_sync(cfqq) &&
 	    cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
 	    cfq_class_idle(cfqq))) {
-		cfqq->slice_end = jiffies + 1;
-		cfq_slice_expired(cfqd, 0);
+		cfq_slice_expired(cfqd);
 	}
 
 	cfq_log(cfqd, "dispatched a request");
 	return 1;
 }
 
-/*
- * task holds one reference to the queue, dropped when task exits. each rq
- * in-flight on this queue also holds a reference, dropped when rq is freed.
- *
- * queue lock must be held here.
- */
-static void cfq_put_queue(struct cfq_queue *cfqq)
+static void cfq_free_cfq_queue(struct elevator_queue *e, void *sched_queue)
 {
+	struct cfq_queue *cfqq = sched_queue;
 	struct cfq_data *cfqd = cfqq->cfqd;
 
-	BUG_ON(atomic_read(&cfqq->ref) <= 0);
-
-	if (!atomic_dec_and_test(&cfqq->ref))
-		return;
+	BUG_ON(!cfqq);
 
-	cfq_log_cfqq(cfqd, cfqq, "put_queue");
+	cfq_log_cfqq(cfqd, cfqq, "free_queue");
 	BUG_ON(rb_first(&cfqq->sort_list));
 	BUG_ON(cfqq->allocated[READ] + cfqq->allocated[WRITE]);
-	BUG_ON(cfq_cfqq_on_rr(cfqq));
 
-	if (unlikely(cfqd->active_queue == cfqq)) {
-		__cfq_slice_expired(cfqd, cfqq, 0);
-		cfq_schedule_dispatch(cfqd);
+	if (unlikely(cfqq_is_active_queue(cfqq))) {
+		__cfq_slice_expired(cfqd, cfqq);
+		elv_schedule_dispatch(cfqd->queue);
 	}
 
 	kmem_cache_free(cfq_pool, cfqq);
 }
 
+static inline void cfq_put_queue(struct cfq_queue *cfqq)
+{
+	elv_put_ioq(cfqq->ioq);
+}
+
 /*
  * Must always be called with the rcu_read_lock() held
  */
@@ -1476,9 +1053,9 @@ static void cfq_free_io_context(struct io_context *ioc)
 
 static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	if (unlikely(cfqq == cfqd->active_queue)) {
-		__cfq_slice_expired(cfqd, cfqq, 0);
-		cfq_schedule_dispatch(cfqd);
+	if (unlikely(cfqq == elv_active_sched_queue(cfqd->queue->elevator))) {
+		__cfq_slice_expired(cfqd, cfqq);
+		elv_schedule_dispatch(cfqd->queue);
 	}
 
 	cfq_put_queue(cfqq);
@@ -1548,11 +1125,11 @@ static struct cfq_io_context *
 cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 {
 	struct cfq_io_context *cic;
+	struct request_queue *q = cfqd->queue;
 
 	cic = kmem_cache_alloc_node(cfq_ioc_pool, gfp_mask | __GFP_ZERO,
-							cfqd->queue->node);
+							q->node);
 	if (cic) {
-		cic->last_end_request = jiffies;
 		INIT_LIST_HEAD(&cic->queue_list);
 		INIT_HLIST_NODE(&cic->cic_list);
 		cic->dtor = cfq_free_io_context;
@@ -1566,7 +1143,7 @@ cfq_alloc_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
 {
 	struct task_struct *tsk = current;
-	int ioprio_class;
+	int ioprio_class, ioprio;
 
 	if (!cfq_cfqq_prio_changed(cfqq))
 		return;
@@ -1579,30 +1156,33 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
 		/*
 		 * no prio set, inherit CPU scheduling settings
 		 */
-		cfqq->ioprio = task_nice_ioprio(tsk);
-		cfqq->ioprio_class = task_nice_ioclass(tsk);
+		ioprio = task_nice_ioprio(tsk);
+		ioprio_class = task_nice_ioclass(tsk);
 		break;
 	case IOPRIO_CLASS_RT:
-		cfqq->ioprio = task_ioprio(ioc);
-		cfqq->ioprio_class = IOPRIO_CLASS_RT;
+		ioprio = task_ioprio(ioc);
+		ioprio_class = IOPRIO_CLASS_RT;
 		break;
 	case IOPRIO_CLASS_BE:
-		cfqq->ioprio = task_ioprio(ioc);
-		cfqq->ioprio_class = IOPRIO_CLASS_BE;
+		ioprio = task_ioprio(ioc);
+		ioprio_class = IOPRIO_CLASS_BE;
 		break;
 	case IOPRIO_CLASS_IDLE:
-		cfqq->ioprio_class = IOPRIO_CLASS_IDLE;
-		cfqq->ioprio = 7;
-		cfq_clear_cfqq_idle_window(cfqq);
+		ioprio_class = IOPRIO_CLASS_IDLE;
+		ioprio = 7;
+		elv_clear_ioq_idle_window(cfqq->ioq);
 		break;
 	}
 
+	elv_ioq_set_ioprio_class(cfqq->ioq, ioprio_class);
+	elv_ioq_set_ioprio(cfqq->ioq, ioprio);
+
 	/*
 	 * keep track of original prio settings in case we have to temporarily
 	 * elevate the priority of this queue
 	 */
-	cfqq->org_ioprio = cfqq->ioprio;
-	cfqq->org_ioprio_class = cfqq->ioprio_class;
+	cfqq->org_ioprio = ioprio;
+	cfqq->org_ioprio_class = ioprio_class;
 	cfq_clear_cfqq_prio_changed(cfqq);
 }
 
@@ -1611,11 +1191,12 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
 	struct cfq_data *cfqd = cic->key;
 	struct cfq_queue *cfqq;
 	unsigned long flags;
+	struct request_queue *q = cfqd->queue;
 
 	if (unlikely(!cfqd))
 		return;
 
-	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+	spin_lock_irqsave(q->queue_lock, flags);
 
 	cfqq = cic->cfqq[BLK_RW_ASYNC];
 	if (cfqq) {
@@ -1632,7 +1213,7 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
 	if (cfqq)
 		cfq_mark_cfqq_prio_changed(cfqq);
 
-	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
 static void cfq_ioc_set_ioprio(struct io_context *ioc)
@@ -1643,20 +1224,23 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
 
 static struct cfq_queue *
 cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
-		     struct io_context *ioc, gfp_t gfp_mask)
+				struct io_context *ioc, gfp_t gfp_mask)
 {
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
 	struct cfq_io_context *cic;
-
+	struct request_queue *q = cfqd->queue;
+	struct io_queue *ioq = NULL, *new_ioq = NULL;
+	struct io_group *iog = NULL;
 retry:
+	iog = io_get_io_group(q);
+
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
 	cfqq = cic_to_cfqq(cic, is_sync);
 
 	if (!cfqq) {
 		if (new_cfqq) {
-			cfqq = new_cfqq;
-			new_cfqq = NULL;
+			goto alloc_ioq;
 		} else if (gfp_mask & __GFP_WAIT) {
 			/*
 			 * Inform the allocator of the fact that we will
@@ -1677,22 +1261,53 @@ retry:
 			if (!cfqq)
 				goto out;
 		}
+alloc_ioq:
+		if (new_ioq) {
+			ioq = new_ioq;
+			new_ioq = NULL;
+			cfqq = new_cfqq;
+			new_cfqq = NULL;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			new_ioq = elv_alloc_ioq(q,
+					gfp_mask | __GFP_NOFAIL | __GFP_ZERO);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+			if (!ioq) {
+				kmem_cache_free(cfq_pool, cfqq);
+				cfqq = NULL;
+				goto out;
+			}
+		}
 
-		RB_CLEAR_NODE(&cfqq->rb_node);
+		/*
+		 * Both cfqq and ioq objects allocated. Do the initializations
+		 * now.
+		 */
 		RB_CLEAR_NODE(&cfqq->p_node);
 		INIT_LIST_HEAD(&cfqq->fifo);
-
-		atomic_set(&cfqq->ref, 0);
 		cfqq->cfqd = cfqd;
 
 		cfq_mark_cfqq_prio_changed(cfqq);
 
+		cfqq->ioq = ioq;
 		cfq_init_prio_data(cfqq, ioc);
+		elv_init_ioq(q->elevator, ioq, iog, cfqq,
+				cfqq->org_ioprio_class, cfqq->org_ioprio,
+				is_sync);
 
 		if (is_sync) {
 			if (!cfq_class_idle(cfqq))
-				cfq_mark_cfqq_idle_window(cfqq);
-			cfq_mark_cfqq_sync(cfqq);
+				elv_mark_ioq_idle_window(cfqq->ioq);
+			elv_mark_ioq_sync(cfqq->ioq);
 		}
 		cfqq->pid = current->pid;
 		cfq_log_cfqq(cfqd, cfqq, "alloced");
@@ -1701,38 +1316,28 @@ retry:
 	if (new_cfqq)
 		kmem_cache_free(cfq_pool, new_cfqq);
 
+	if (new_ioq)
+		elv_free_ioq(new_ioq);
+
 out:
 	WARN_ON((gfp_mask & __GFP_WAIT) && !cfqq);
 	return cfqq;
 }
 
-static struct cfq_queue **
-cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
-{
-	switch (ioprio_class) {
-	case IOPRIO_CLASS_RT:
-		return &cfqd->async_cfqq[0][ioprio];
-	case IOPRIO_CLASS_BE:
-		return &cfqd->async_cfqq[1][ioprio];
-	case IOPRIO_CLASS_IDLE:
-		return &cfqd->async_idle_cfqq;
-	default:
-		BUG();
-	}
-}
-
 static struct cfq_queue *
 cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
-	      gfp_t gfp_mask)
+					gfp_t gfp_mask)
 {
 	const int ioprio = task_ioprio(ioc);
 	const int ioprio_class = task_ioprio_class(ioc);
-	struct cfq_queue **async_cfqq = NULL;
+	struct cfq_queue *async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
+	struct io_group *iog = io_get_io_group(cfqd->queue);
 
 	if (!is_sync) {
-		async_cfqq = cfq_async_queue_prio(cfqd, ioprio_class, ioprio);
-		cfqq = *async_cfqq;
+		async_cfqq = io_group_async_queue_prio(iog, ioprio_class,
+								ioprio);
+		cfqq = async_cfqq;
 	}
 
 	if (!cfqq) {
@@ -1741,15 +1346,11 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 			return NULL;
 	}
 
-	/*
-	 * pin the queue now that it's allocated, scheduler exit will prune it
-	 */
-	if (!is_sync && !(*async_cfqq)) {
-		atomic_inc(&cfqq->ref);
-		*async_cfqq = cfqq;
-	}
+	if (!is_sync && !async_cfqq)
+		io_group_set_async_queue(iog, ioprio_class, ioprio, cfqq->ioq);
 
-	atomic_inc(&cfqq->ref);
+	/* ioc reference */
+	elv_get_ioq(cfqq->ioq);
 	return cfqq;
 }
 
@@ -1828,6 +1429,7 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
 {
 	unsigned long flags;
 	int ret;
+	struct request_queue *q = cfqd->queue;
 
 	ret = radix_tree_preload(gfp_mask);
 	if (!ret) {
@@ -1844,9 +1446,9 @@ static int cfq_cic_link(struct cfq_data *cfqd, struct io_context *ioc,
 		radix_tree_preload_end();
 
 		if (!ret) {
-			spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+			spin_lock_irqsave(q->queue_lock, flags);
 			list_add(&cic->queue_list, &cfqd->cic_list);
-			spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
+			spin_unlock_irqrestore(q->queue_lock, flags);
 		}
 	}
 
@@ -1866,10 +1468,11 @@ cfq_get_io_context(struct cfq_data *cfqd, gfp_t gfp_mask)
 {
 	struct io_context *ioc = NULL;
 	struct cfq_io_context *cic;
+	struct request_queue *q = cfqd->queue;
 
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
-	ioc = get_io_context(gfp_mask, cfqd->queue->node);
+	ioc = get_io_context(gfp_mask, q->node);
 	if (!ioc)
 		return NULL;
 
@@ -1888,7 +1491,6 @@ out:
 	smp_read_barrier_depends();
 	if (unlikely(ioc->ioprio_changed))
 		cfq_ioc_set_ioprio(ioc);
-
 	return cic;
 err_free:
 	cfq_cic_free(cic);
@@ -1898,17 +1500,6 @@ err:
 }
 
 static void
-cfq_update_io_thinktime(struct cfq_data *cfqd, struct cfq_io_context *cic)
-{
-	unsigned long elapsed = jiffies - cic->last_end_request;
-	unsigned long ttime = min(elapsed, 2UL * cfqd->cfq_slice_idle);
-
-	cic->ttime_samples = (7*cic->ttime_samples + 256) / 8;
-	cic->ttime_total = (7*cic->ttime_total + 256*ttime) / 8;
-	cic->ttime_mean = (cic->ttime_total + 128) / cic->ttime_samples;
-}
-
-static void
 cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_io_context *cic,
 		       struct request *rq)
 {
@@ -1939,57 +1530,41 @@ cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_io_context *cic,
 }
 
 /*
- * Disable idle window if the process thinks too long or seeks so much that
- * it doesn't matter
+ * Disable idle window if the process seeks so much that it doesn't matter
  */
-static void
-cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-		       struct cfq_io_context *cic)
+static int
+cfq_update_idle_window(struct elevator_queue *eq, void *cfqq,
+					struct request *rq)
 {
-	int old_idle, enable_idle;
+	struct cfq_io_context *cic = RQ_CIC(rq);
 
 	/*
-	 * Don't idle for async or idle io prio class
+	 * Enabling/Disabling idling based on thinktime has been moved
+	 * in common layer.
 	 */
-	if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
-		return;
-
-	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
-
-	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
-	    (cfqd->hw_tag && CIC_SEEKY(cic)))
-		enable_idle = 0;
-	else if (sample_valid(cic->ttime_samples)) {
-		if (cic->ttime_mean > cfqd->cfq_slice_idle)
-			enable_idle = 0;
-		else
-			enable_idle = 1;
-	}
+	if (!atomic_read(&cic->ioc->nr_tasks) ||
+	    (elv_hw_tag(eq) && CIC_SEEKY(cic)))
+		return 0;
 
-	if (old_idle != enable_idle) {
-		cfq_log_cfqq(cfqd, cfqq, "idle=%d", enable_idle);
-		if (enable_idle)
-			cfq_mark_cfqq_idle_window(cfqq);
-		else
-			cfq_clear_cfqq_idle_window(cfqq);
-	}
+	return 1;
 }
 
 /*
  * Check if new_cfqq should preempt the currently active queue. Return 0 for
- * no or if we aren't sure, a 1 will cause a preempt.
+ * no or if we aren't sure, a 1 will cause a preemption attempt.
+ * Some of the preemption logic has been moved to common layer. Only cfq
+ * specific parts are left here.
  */
 static int
-cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
-		   struct request *rq)
+cfq_should_preempt(struct request_queue *q, void *new_cfqq, struct request *rq)
 {
-	struct cfq_queue *cfqq;
+	struct cfq_data *cfqd = q->elevator->elevator_data;
+	struct cfq_queue *cfqq = elv_active_sched_queue(q->elevator);
 
-	cfqq = cfqd->active_queue;
 	if (!cfqq)
 		return 0;
 
-	if (cfq_slice_used(cfqq))
+	if (elv_ioq_slice_used(cfqq->ioq))
 		return 1;
 
 	if (cfq_class_idle(new_cfqq))
@@ -2012,13 +1587,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 	if (rq_is_meta(rq) && !cfqq->meta_pending)
 		return 1;
 
-	/*
-	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
-	 */
-	if (cfq_class_rt(new_cfqq) && !cfq_class_rt(cfqq))
-		return 1;
-
-	if (!cfqd->active_cic || !cfq_cfqq_wait_request(cfqq))
+	if (!cfqd->active_cic || !elv_ioq_wait_request(cfqq->ioq))
 		return 0;
 
 	/*
@@ -2032,29 +1601,10 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 }
 
 /*
- * cfqq preempts the active queue. if we allowed preempt with no slice left,
- * let it have half of its nominal slice.
- */
-static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
-{
-	cfq_log_cfqq(cfqd, cfqq, "preempt");
-	cfq_slice_expired(cfqd, 1);
-
-	/*
-	 * Put the new queue at the front of the of the current list,
-	 * so we know that it will be selected next.
-	 */
-	BUG_ON(!cfq_cfqq_on_rr(cfqq));
-
-	cfq_service_tree_add(cfqd, cfqq, 1);
-
-	cfqq->slice_end = 0;
-	cfq_mark_cfqq_slice_new(cfqq);
-}
-
-/*
  * Called when a new fs request (rq) is added (to cfqq). Check if there's
  * something we should do about it
+ * After enqueuing the request whether queue should be preempted or kicked
+ * decision is taken by common layer.
  */
 static void
 cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
@@ -2062,45 +1612,12 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 {
 	struct cfq_io_context *cic = RQ_CIC(rq);
 
-	cfqd->rq_queued++;
 	if (rq_is_meta(rq))
 		cfqq->meta_pending++;
 
-	cfq_update_io_thinktime(cfqd, cic);
 	cfq_update_io_seektime(cfqd, cic, rq);
-	cfq_update_idle_window(cfqd, cfqq, cic);
 
 	cic->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq);
-
-	if (cfqq == cfqd->active_queue) {
-		/*
-		 * Remember that we saw a request from this process, but
-		 * don't start queuing just yet. Otherwise we risk seeing lots
-		 * of tiny requests, because we disrupt the normal plugging
-		 * and merging. If the request is already larger than a single
-		 * page, let it rip immediately. For that case we assume that
-		 * merging is already done. Ditto for a busy system that
-		 * has other work pending, don't risk delaying until the
-		 * idle timer unplug to continue working.
-		 */
-		if (cfq_cfqq_wait_request(cfqq)) {
-			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
-			    cfqd->busy_queues > 1) {
-				del_timer(&cfqd->idle_slice_timer);
-			__blk_run_queue(cfqd->queue);
-			}
-			cfq_mark_cfqq_must_dispatch(cfqq);
-		}
-	} else if (cfq_should_preempt(cfqd, cfqq, rq)) {
-		/*
-		 * not the active queue - expire current slice if it is
-		 * idle and has expired it's mean thinktime or this new queue
-		 * has some old slice time left and is of higher priority or
-		 * this new queue is RT and the current one is BE
-		 */
-		cfq_preempt_queue(cfqd, cfqq);
-		__blk_run_queue(cfqd->queue);
-	}
 }
 
 static void cfq_insert_request(struct request_queue *q, struct request *rq)
@@ -2118,81 +1635,17 @@ static void cfq_insert_request(struct request_queue *q, struct request *rq)
 	cfq_rq_enqueued(cfqd, cfqq, rq);
 }
 
-/*
- * Update hw_tag based on peak queue depth over 50 samples under
- * sufficient load.
- */
-static void cfq_update_hw_tag(struct cfq_data *cfqd)
-{
-	if (cfqd->rq_in_driver > cfqd->rq_in_driver_peak)
-		cfqd->rq_in_driver_peak = cfqd->rq_in_driver;
-
-	if (cfqd->rq_queued <= CFQ_HW_QUEUE_MIN &&
-	    cfqd->rq_in_driver <= CFQ_HW_QUEUE_MIN)
-		return;
-
-	if (cfqd->hw_tag_samples++ < 50)
-		return;
-
-	if (cfqd->rq_in_driver_peak >= CFQ_HW_QUEUE_MIN)
-		cfqd->hw_tag = 1;
-	else
-		cfqd->hw_tag = 0;
-
-	cfqd->hw_tag_samples = 0;
-	cfqd->rq_in_driver_peak = 0;
-}
-
 static void cfq_completed_request(struct request_queue *q, struct request *rq)
 {
 	struct cfq_queue *cfqq = RQ_CFQQ(rq);
 	struct cfq_data *cfqd = cfqq->cfqd;
-	const int sync = rq_is_sync(rq);
 	unsigned long now;
 
 	now = jiffies;
 	cfq_log_cfqq(cfqd, cfqq, "complete");
 
-	cfq_update_hw_tag(cfqd);
-
-	WARN_ON(!cfqd->rq_in_driver);
-	WARN_ON(!cfqq->dispatched);
-	cfqd->rq_in_driver--;
-	cfqq->dispatched--;
-
 	if (cfq_cfqq_sync(cfqq))
 		cfqd->sync_flight--;
-
-	if (sync)
-		RQ_CIC(rq)->last_end_request = now;
-
-	/*
-	 * If this is the active queue, check if it needs to be expired,
-	 * or if we want to idle in case it has no pending requests.
-	 */
-	if (cfqd->active_queue == cfqq) {
-		const bool cfqq_empty = RB_EMPTY_ROOT(&cfqq->sort_list);
-
-		if (cfq_cfqq_slice_new(cfqq)) {
-			cfq_set_prio_slice(cfqd, cfqq);
-			cfq_clear_cfqq_slice_new(cfqq);
-		}
-		/*
-		 * If there are no requests waiting in this queue, and
-		 * there are other queues ready to issue requests, AND
-		 * those other queues are issuing requests within our
-		 * mean seek distance, give them a chance to run instead
-		 * of idling.
-		 */
-		if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
-			cfq_slice_expired(cfqd, 1);
-		else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) &&
-			 sync && !rq_noidle(rq))
-			cfq_arm_slice_timer(cfqd);
-	}
-
-	if (!cfqd->rq_in_driver)
-		cfq_schedule_dispatch(cfqd);
 }
 
 /*
@@ -2201,30 +1654,33 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
  */
 static void cfq_prio_boost(struct cfq_queue *cfqq)
 {
+	struct io_queue *ioq = cfqq->ioq;
+
 	if (has_fs_excl()) {
 		/*
 		 * boost idle prio on transactions that would lock out other
 		 * users of the filesystem
 		 */
 		if (cfq_class_idle(cfqq))
-			cfqq->ioprio_class = IOPRIO_CLASS_BE;
-		if (cfqq->ioprio > IOPRIO_NORM)
-			cfqq->ioprio = IOPRIO_NORM;
+			elv_ioq_set_ioprio_class(ioq, IOPRIO_CLASS_BE);
+		if (elv_ioq_ioprio(ioq) > IOPRIO_NORM)
+			elv_ioq_set_ioprio(ioq, IOPRIO_NORM);
+
 	} else {
 		/*
 		 * check if we need to unboost the queue
 		 */
-		if (cfqq->ioprio_class != cfqq->org_ioprio_class)
-			cfqq->ioprio_class = cfqq->org_ioprio_class;
-		if (cfqq->ioprio != cfqq->org_ioprio)
-			cfqq->ioprio = cfqq->org_ioprio;
+		if (elv_ioq_ioprio_class(ioq) != cfqq->org_ioprio_class)
+			elv_ioq_set_ioprio_class(ioq, cfqq->org_ioprio_class);
+		if (elv_ioq_ioprio(ioq) != cfqq->org_ioprio)
+			elv_ioq_set_ioprio(ioq, cfqq->org_ioprio);
 	}
 }
 
 static inline int __cfq_may_queue(struct cfq_queue *cfqq)
 {
-	if ((cfq_cfqq_wait_request(cfqq) || cfq_cfqq_must_alloc(cfqq)) &&
-	    !cfq_cfqq_must_alloc_slice(cfqq)) {
+	if ((elv_ioq_wait_request(cfqq->ioq) ||
+	   cfq_cfqq_must_alloc(cfqq)) && !cfq_cfqq_must_alloc_slice(cfqq)) {
 		cfq_mark_cfqq_must_alloc_slice(cfqq);
 		return ELV_MQUEUE_MUST;
 	}
@@ -2276,7 +1732,7 @@ static void cfq_put_request(struct request *rq)
 		put_io_context(RQ_CIC(rq)->ioc);
 
 		rq->elevator_private = NULL;
-		rq->elevator_private2 = NULL;
+		rq->ioq = NULL;
 
 		cfq_put_queue(cfqq);
 	}
@@ -2316,119 +1772,31 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 
 	cfqq->allocated[rw]++;
 	cfq_clear_cfqq_must_alloc(cfqq);
-	atomic_inc(&cfqq->ref);
+	elv_get_ioq(cfqq->ioq);
 
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
 	rq->elevator_private = cic;
-	rq->elevator_private2 = cfqq;
+	rq->ioq = cfqq->ioq;
 	return 0;
 
 queue_fail:
 	if (cic)
 		put_io_context(cic->ioc);
 
-	cfq_schedule_dispatch(cfqd);
+	elv_schedule_dispatch(cfqd->queue);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 	cfq_log(cfqd, "set_request fail");
 	return 1;
 }
 
-static void cfq_kick_queue(struct work_struct *work)
-{
-	struct cfq_data *cfqd =
-		container_of(work, struct cfq_data, unplug_work);
-	struct request_queue *q = cfqd->queue;
-
-	spin_lock_irq(q->queue_lock);
-	__blk_run_queue(cfqd->queue);
-	spin_unlock_irq(q->queue_lock);
-}
-
-/*
- * Timer running if the active_queue is currently idling inside its time slice
- */
-static void cfq_idle_slice_timer(unsigned long data)
-{
-	struct cfq_data *cfqd = (struct cfq_data *) data;
-	struct cfq_queue *cfqq;
-	unsigned long flags;
-	int timed_out = 1;
-
-	cfq_log(cfqd, "idle timer fired");
-
-	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
-
-	cfqq = cfqd->active_queue;
-	if (cfqq) {
-		timed_out = 0;
-
-		/*
-		 * We saw a request before the queue expired, let it through
-		 */
-		if (cfq_cfqq_must_dispatch(cfqq))
-			goto out_kick;
-
-		/*
-		 * expired
-		 */
-		if (cfq_slice_used(cfqq))
-			goto expire;
-
-		/*
-		 * only expire and reinvoke request handler, if there are
-		 * other queues with pending requests
-		 */
-		if (!cfqd->busy_queues)
-			goto out_cont;
-
-		/*
-		 * not expired and it has a request pending, let it dispatch
-		 */
-		if (!RB_EMPTY_ROOT(&cfqq->sort_list))
-			goto out_kick;
-	}
-expire:
-	cfq_slice_expired(cfqd, timed_out);
-out_kick:
-	cfq_schedule_dispatch(cfqd);
-out_cont:
-	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
-}
-
-static void cfq_shutdown_timer_wq(struct cfq_data *cfqd)
-{
-	del_timer_sync(&cfqd->idle_slice_timer);
-	cancel_work_sync(&cfqd->unplug_work);
-}
-
-static void cfq_put_async_queues(struct cfq_data *cfqd)
-{
-	int i;
-
-	for (i = 0; i < IOPRIO_BE_NR; i++) {
-		if (cfqd->async_cfqq[0][i])
-			cfq_put_queue(cfqd->async_cfqq[0][i]);
-		if (cfqd->async_cfqq[1][i])
-			cfq_put_queue(cfqd->async_cfqq[1][i]);
-	}
-
-	if (cfqd->async_idle_cfqq)
-		cfq_put_queue(cfqd->async_idle_cfqq);
-}
-
 static void cfq_exit_queue(struct elevator_queue *e)
 {
 	struct cfq_data *cfqd = e->elevator_data;
 	struct request_queue *q = cfqd->queue;
 
-	cfq_shutdown_timer_wq(cfqd);
-
 	spin_lock_irq(q->queue_lock);
 
-	if (cfqd->active_queue)
-		__cfq_slice_expired(cfqd, cfqd->active_queue, 0);
-
 	while (!list_empty(&cfqd->cic_list)) {
 		struct cfq_io_context *cic = list_entry(cfqd->cic_list.next,
 							struct cfq_io_context,
@@ -2437,12 +1805,7 @@ static void cfq_exit_queue(struct elevator_queue *e)
 		__cfq_exit_single_io_context(cfqd, cic);
 	}
 
-	cfq_put_async_queues(cfqd);
-
 	spin_unlock_irq(q->queue_lock);
-
-	cfq_shutdown_timer_wq(cfqd);
-
 	kfree(cfqd);
 }
 
@@ -2455,8 +1818,6 @@ static void *cfq_init_queue(struct request_queue *q)
 	if (!cfqd)
 		return NULL;
 
-	cfqd->service_tree = CFQ_RB_ROOT;
-
 	/*
 	 * Not strictly needed (since RB_ROOT just clears the node and we
 	 * zeroed cfqd on alloc), but better be safe in case someone decides
@@ -2469,22 +1830,12 @@ static void *cfq_init_queue(struct request_queue *q)
 
 	cfqd->queue = q;
 
-	init_timer(&cfqd->idle_slice_timer);
-	cfqd->idle_slice_timer.function = cfq_idle_slice_timer;
-	cfqd->idle_slice_timer.data = (unsigned long) cfqd;
-
-	INIT_WORK(&cfqd->unplug_work, cfq_kick_queue);
-
 	cfqd->cfq_quantum = cfq_quantum;
 	cfqd->cfq_fifo_expire[0] = cfq_fifo_expire[0];
 	cfqd->cfq_fifo_expire[1] = cfq_fifo_expire[1];
 	cfqd->cfq_back_max = cfq_back_max;
 	cfqd->cfq_back_penalty = cfq_back_penalty;
-	cfqd->cfq_slice[0] = cfq_slice_async;
-	cfqd->cfq_slice[1] = cfq_slice_sync;
 	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
-	cfqd->cfq_slice_idle = cfq_slice_idle;
-	cfqd->hw_tag = 1;
 
 	return cfqd;
 }
@@ -2549,9 +1900,6 @@ SHOW_FUNCTION(cfq_fifo_expire_sync_show, cfqd->cfq_fifo_expire[1], 1);
 SHOW_FUNCTION(cfq_fifo_expire_async_show, cfqd->cfq_fifo_expire[0], 1);
 SHOW_FUNCTION(cfq_back_seek_max_show, cfqd->cfq_back_max, 0);
 SHOW_FUNCTION(cfq_back_seek_penalty_show, cfqd->cfq_back_penalty, 0);
-SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1);
-SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
-SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
 SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
 #undef SHOW_FUNCTION
 
@@ -2579,9 +1927,6 @@ STORE_FUNCTION(cfq_fifo_expire_async_store, &cfqd->cfq_fifo_expire[0], 1,
 STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
 STORE_FUNCTION(cfq_back_seek_penalty_store, &cfqd->cfq_back_penalty, 1,
 		UINT_MAX, 0);
-STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
-STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
 		UINT_MAX, 0);
 #undef STORE_FUNCTION
@@ -2595,10 +1940,10 @@ static struct elv_fs_entry cfq_attrs[] = {
 	CFQ_ATTR(fifo_expire_async),
 	CFQ_ATTR(back_seek_max),
 	CFQ_ATTR(back_seek_penalty),
-	CFQ_ATTR(slice_sync),
-	CFQ_ATTR(slice_async),
 	CFQ_ATTR(slice_async_rq),
-	CFQ_ATTR(slice_idle),
+	ELV_ATTR(slice_idle),
+	ELV_ATTR(slice_sync),
+	ELV_ATTR(slice_async),
 	__ATTR_NULL
 };
 
@@ -2611,8 +1956,6 @@ static struct elevator_type iosched_cfq = {
 		.elevator_dispatch_fn =		cfq_dispatch_requests,
 		.elevator_add_req_fn =		cfq_insert_request,
 		.elevator_activate_req_fn =	cfq_activate_request,
-		.elevator_deactivate_req_fn =	cfq_deactivate_request,
-		.elevator_queue_empty_fn =	cfq_queue_empty,
 		.elevator_completed_req_fn =	cfq_completed_request,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
@@ -2622,7 +1965,15 @@ static struct elevator_type iosched_cfq = {
 		.elevator_init_fn =		cfq_init_queue,
 		.elevator_exit_fn =		cfq_exit_queue,
 		.trim =				cfq_free_io_context,
+		.elevator_free_sched_queue_fn =	cfq_free_cfq_queue,
+		.elevator_active_ioq_set_fn = 	cfq_active_ioq_set,
+		.elevator_active_ioq_reset_fn =	cfq_active_ioq_reset,
+		.elevator_arm_slice_timer_fn = 	cfq_arm_slice_timer,
+		.elevator_should_preempt_fn = 	cfq_should_preempt,
+		.elevator_update_idle_window_fn = cfq_update_idle_window,
+		.elevator_close_cooperator_fn = cfq_close_cooperator,
 	},
+	.elevator_features =    ELV_IOSCHED_NEED_FQ,
 	.elevator_attrs =	cfq_attrs,
 	.elevator_name =	"cfq",
 	.elevator_owner =	THIS_MODULE,
@@ -2630,14 +1981,6 @@ static struct elevator_type iosched_cfq = {
 
 static int __init cfq_init(void)
 {
-	/*
-	 * could be 0 on HZ < 1000 setups
-	 */
-	if (!cfq_slice_async)
-		cfq_slice_async = 1;
-	if (!cfq_slice_idle)
-		cfq_slice_idle = 1;
-
 	if (cfq_slab_setup())
 		return -ENOMEM;
 
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index dd05434..1482b20 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -39,13 +39,8 @@ struct cfq_io_context {
 
 	struct io_context *ioc;
 
-	unsigned long last_end_request;
 	sector_t last_request_pos;
 
-	unsigned long ttime_total;
-	unsigned long ttime_samples;
-	unsigned long ttime_mean;
-
 	unsigned int seek_samples;
 	u64 seek_total;
 	sector_t seek_mean;
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 07/25] io-controller: core bfq scheduler changes for hierarchical setup
       [not found] ` <1246564917-19603-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (5 preceding siblings ...)
  2009-07-02 20:01   ` [PATCH 06/25] io-controller: Modify cfq to make use of flat elevator fair queuing Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  2009-07-02 20:01   ` [PATCH 08/25] io-controller: cgroup related changes for hierarchical group support Vivek Goyal
                     ` (20 subsequent siblings)
  27 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, nauman-hpIqsD4AKlfQT0dZR+AlfA,
	dpshah-hpIqsD4AKlfQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA

o Some of the core bfq scheduler changes for hiearchical groups.

Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/elevator-fq.c |  169 ++++++++++++++++++++++++++++++++++++++++++++++-----
 block/elevator-fq.h |    4 +
 init/Kconfig        |    8 +++
 3 files changed, 165 insertions(+), 16 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 67c02b9..0acfa2c 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -42,6 +42,69 @@ static int elv_rate_sampling_window = HZ / 10;
  */
 #define WFQ_SERVICE_SHIFT	22
 
+#ifdef CONFIG_GROUP_IOSCHED
+#define for_each_entity(entity)	\
+	for (; entity != NULL; entity = entity->parent)
+
+#define for_each_entity_safe(entity, parent) \
+	for (; entity && ({ parent = entity->parent; 1; }); entity = parent)
+
+
+static struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
+						 int extract);
+
+static int bfq_update_next_active(struct io_sched_data *sd)
+{
+	struct io_group *iog;
+	struct io_entity *entity, *next_active;
+
+	if (sd->active_entity != NULL)
+		/* will update/requeue at the end of service */
+		return 0;
+
+	/*
+	 * NOTE: this can be improved in may ways, such as returning
+	 * 1 (and thus propagating upwards the update) only when the
+	 * budget changes, or caching the bfqq that will be scheduled
+	 * next from this subtree.  By now we worry more about
+	 * correctness than about performance...
+	 */
+	next_active = bfq_lookup_next_entity(sd, 0);
+	sd->next_active = next_active;
+
+	if (next_active != NULL) {
+		iog = container_of(sd, struct io_group, sched_data);
+		entity = iog->my_entity;
+		if (entity != NULL)
+			entity->budget = next_active->budget;
+	}
+
+	return 1;
+}
+
+static inline void bfq_check_next_active(struct io_sched_data *sd,
+					 struct io_entity *entity)
+{
+	BUG_ON(sd->next_active != entity);
+}
+#else /* GROUP_IOSCHED */
+#define for_each_entity(entity)	\
+	for (; entity != NULL; entity = NULL)
+
+#define for_each_entity_safe(entity, parent) \
+	for (parent = NULL; entity != NULL; entity = parent)
+
+static inline int bfq_update_next_active(struct io_sched_data *sd)
+{
+	return 0;
+}
+
+static inline void bfq_check_next_active(struct io_sched_data *sd,
+					 struct io_entity *entity)
+{
+}
+#endif /* GROUP_IOSCHED */
+
 static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
 					unsigned short prio)
 {
@@ -587,8 +650,10 @@ static struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
 		entity = __bfq_lookup_next_entity(st);
 		if (entity != NULL) {
 			if (extract) {
+				bfq_check_next_active(sd, entity);
 				bfq_active_remove(st, entity);
 				sd->active_entity = entity;
+				sd->next_active = NULL;
 			}
 			break;
 		}
@@ -661,11 +726,8 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
 	if (add_front) {
 		struct io_entity *next_entity;
 
-		/*
-		 * Determine the entity which will be dispatched next
-		 * Use sd->next_active once hierarchical patch is applied
-		 */
-		next_entity = bfq_lookup_next_entity(sd, 0);
+		/* Determine the entity which will be dispatched next */
+		next_entity = sd->next_active;
 
 		if (next_entity && next_entity != entity) {
 			struct io_service_tree *new_st;
@@ -697,7 +759,21 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
  */
 static void bfq_activate_entity(struct io_entity *entity, int add_front)
 {
-	__bfq_activate_entity(entity, add_front);
+	struct io_sched_data *sd;
+
+	for_each_entity(entity) {
+		__bfq_activate_entity(entity, add_front);
+
+		add_front = 0;
+		sd = entity->sched_data;
+		if (!bfq_update_next_active(sd))
+			/*
+			 * No need to propagate the activation to the
+			 * upper entities, as they will be updated when
+			 * the active entity is rescheduled.
+			 */
+			break;
+	}
 }
 
 /**
@@ -732,6 +808,8 @@ static int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
 		bfq_idle_remove(st, entity);
 	else if (entity->tree != NULL)
 		BUG();
+	if (was_active || sd->next_active == entity)
+		ret = bfq_update_next_active(sd);
 
 	if (!requeue || !bfq_gt(entity->finish, st->vtime))
 		bfq_forget_entity(st, entity);
@@ -739,6 +817,7 @@ static int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
 		bfq_idle_insert(st, entity);
 
 	BUG_ON(sd->active_entity == entity);
+	BUG_ON(sd->next_active == entity);
 
 	return ret;
 }
@@ -750,18 +829,62 @@ static int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
  */
 static void bfq_deactivate_entity(struct io_entity *entity, int requeue)
 {
-	__bfq_deactivate_entity(entity, requeue);
+	struct io_sched_data *sd;
+	struct io_entity *parent;
+
+	for_each_entity_safe(entity, parent) {
+		sd = entity->sched_data;
+
+		if (!__bfq_deactivate_entity(entity, requeue))
+			/*
+			 * The parent entity is still backlogged, and
+			 * we don't need to update it as it is still
+			 * under service.
+			 */
+			break;
+
+		if (sd->next_active != NULL) {
+			/*
+			 * The parent entity is still backlogged and
+			 * the budgets on the path towards the root
+			 * need to be updated.
+			 */
+			goto update;
+		}
+
+		/*
+		 * If we reach there the parent is no more backlogged and
+		 * we want to propagate the dequeue upwards.
+		 *
+		 */
+
+		requeue = 1;
+	}
+
+	return;
+
+update:
+	entity = parent;
+	for_each_entity(entity) {
+		__bfq_activate_entity(entity, 0);
+
+		sd = entity->sched_data;
+		if (!bfq_update_next_active(sd))
+			break;
+	}
 }
 
 static void entity_served(struct io_entity *entity, unsigned long served)
 {
 	struct io_service_tree *st;
 
-	st = io_entity_service_tree(entity);
-	entity->service += served;
-	BUG_ON(st->wsum == 0);
-	st->vtime += bfq_delta(served, st->wsum);
-	bfq_forget_idle(st);
+	for_each_entity(entity) {
+		st = io_entity_service_tree(entity);
+		entity->service += served;
+		BUG_ON(st->wsum == 0);
+		st->vtime += bfq_delta(served, st->wsum);
+		bfq_forget_idle(st);
+	}
 }
 
 /**
@@ -1154,11 +1277,25 @@ static struct io_queue *elv_get_next_ioq(struct request_queue *q, int extract)
 		return NULL;
 
 	sd = &efqd->root_group->sched_data;
-	entity = bfq_lookup_next_entity(sd, 1);
 
-	BUG_ON(!entity);
-	if (extract)
-		entity->service = 0;
+	for (; sd != NULL; sd = entity->my_sched_data) {
+		entity = bfq_lookup_next_entity(sd, 1);
+		/*
+		 * entity can be null despite the fact that there are busy
+		 * queues. if all the busy queues are under a group which is
+		 * currently under service.
+		 * So if we are just looking for next ioq while something is
+		 * being served, null entity is not an error.
+		 */
+		BUG_ON(!entity && extract);
+
+		if (extract)
+			entity->service = 0;
+
+		if (!entity)
+			return NULL;
+	}
+
 	ioq = io_entity_to_ioq(entity);
 
 	return ioq;
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 4b69239..57207c4 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -72,6 +72,7 @@ struct io_service_tree {
  */
 struct io_sched_data {
 	struct io_entity *active_entity;
+	struct io_entity *next_active;
 	struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
 };
 
@@ -181,7 +182,10 @@ struct io_queue {
 };
 
 struct io_group {
+	struct io_entity entity;
 	struct io_sched_data sched_data;
+	struct io_entity *my_entity;
+
 	/*
 	 * async queue for each priority case for RT and BE class.
 	 * Used only for cfq.
diff --git a/init/Kconfig b/init/Kconfig
index 1ce05a4..a380f46 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -612,6 +612,14 @@ config CGROUP_MEM_RES_CTLR_SWAP
 	  Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page
 	  size is 4096bytes, 512k per 1Gbytes of swap.
 
+config GROUP_IOSCHED
+	bool "Group IO Scheduler"
+	depends on CGROUPS && ELV_FAIR_QUEUING
+	default n
+	---help---
+	  This feature lets IO scheduler recognize task groups and control
+	  disk bandwidth allocation to such task groups.
+
 endif # CGROUPS
 
 config MM_OWNER
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 07/25] io-controller: core bfq scheduler changes for hierarchical setup
  2009-07-02 20:01 ` Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron
  Cc: agk, snitzer, vgoyal, akpm, peterz

o Some of the core bfq scheduler changes for hiearchical groups.

Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |  169 ++++++++++++++++++++++++++++++++++++++++++++++-----
 block/elevator-fq.h |    4 +
 init/Kconfig        |    8 +++
 3 files changed, 165 insertions(+), 16 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 67c02b9..0acfa2c 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -42,6 +42,69 @@ static int elv_rate_sampling_window = HZ / 10;
  */
 #define WFQ_SERVICE_SHIFT	22
 
+#ifdef CONFIG_GROUP_IOSCHED
+#define for_each_entity(entity)	\
+	for (; entity != NULL; entity = entity->parent)
+
+#define for_each_entity_safe(entity, parent) \
+	for (; entity && ({ parent = entity->parent; 1; }); entity = parent)
+
+
+static struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
+						 int extract);
+
+static int bfq_update_next_active(struct io_sched_data *sd)
+{
+	struct io_group *iog;
+	struct io_entity *entity, *next_active;
+
+	if (sd->active_entity != NULL)
+		/* will update/requeue at the end of service */
+		return 0;
+
+	/*
+	 * NOTE: this can be improved in may ways, such as returning
+	 * 1 (and thus propagating upwards the update) only when the
+	 * budget changes, or caching the bfqq that will be scheduled
+	 * next from this subtree.  By now we worry more about
+	 * correctness than about performance...
+	 */
+	next_active = bfq_lookup_next_entity(sd, 0);
+	sd->next_active = next_active;
+
+	if (next_active != NULL) {
+		iog = container_of(sd, struct io_group, sched_data);
+		entity = iog->my_entity;
+		if (entity != NULL)
+			entity->budget = next_active->budget;
+	}
+
+	return 1;
+}
+
+static inline void bfq_check_next_active(struct io_sched_data *sd,
+					 struct io_entity *entity)
+{
+	BUG_ON(sd->next_active != entity);
+}
+#else /* GROUP_IOSCHED */
+#define for_each_entity(entity)	\
+	for (; entity != NULL; entity = NULL)
+
+#define for_each_entity_safe(entity, parent) \
+	for (parent = NULL; entity != NULL; entity = parent)
+
+static inline int bfq_update_next_active(struct io_sched_data *sd)
+{
+	return 0;
+}
+
+static inline void bfq_check_next_active(struct io_sched_data *sd,
+					 struct io_entity *entity)
+{
+}
+#endif /* GROUP_IOSCHED */
+
 static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
 					unsigned short prio)
 {
@@ -587,8 +650,10 @@ static struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
 		entity = __bfq_lookup_next_entity(st);
 		if (entity != NULL) {
 			if (extract) {
+				bfq_check_next_active(sd, entity);
 				bfq_active_remove(st, entity);
 				sd->active_entity = entity;
+				sd->next_active = NULL;
 			}
 			break;
 		}
@@ -661,11 +726,8 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
 	if (add_front) {
 		struct io_entity *next_entity;
 
-		/*
-		 * Determine the entity which will be dispatched next
-		 * Use sd->next_active once hierarchical patch is applied
-		 */
-		next_entity = bfq_lookup_next_entity(sd, 0);
+		/* Determine the entity which will be dispatched next */
+		next_entity = sd->next_active;
 
 		if (next_entity && next_entity != entity) {
 			struct io_service_tree *new_st;
@@ -697,7 +759,21 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
  */
 static void bfq_activate_entity(struct io_entity *entity, int add_front)
 {
-	__bfq_activate_entity(entity, add_front);
+	struct io_sched_data *sd;
+
+	for_each_entity(entity) {
+		__bfq_activate_entity(entity, add_front);
+
+		add_front = 0;
+		sd = entity->sched_data;
+		if (!bfq_update_next_active(sd))
+			/*
+			 * No need to propagate the activation to the
+			 * upper entities, as they will be updated when
+			 * the active entity is rescheduled.
+			 */
+			break;
+	}
 }
 
 /**
@@ -732,6 +808,8 @@ static int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
 		bfq_idle_remove(st, entity);
 	else if (entity->tree != NULL)
 		BUG();
+	if (was_active || sd->next_active == entity)
+		ret = bfq_update_next_active(sd);
 
 	if (!requeue || !bfq_gt(entity->finish, st->vtime))
 		bfq_forget_entity(st, entity);
@@ -739,6 +817,7 @@ static int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
 		bfq_idle_insert(st, entity);
 
 	BUG_ON(sd->active_entity == entity);
+	BUG_ON(sd->next_active == entity);
 
 	return ret;
 }
@@ -750,18 +829,62 @@ static int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
  */
 static void bfq_deactivate_entity(struct io_entity *entity, int requeue)
 {
-	__bfq_deactivate_entity(entity, requeue);
+	struct io_sched_data *sd;
+	struct io_entity *parent;
+
+	for_each_entity_safe(entity, parent) {
+		sd = entity->sched_data;
+
+		if (!__bfq_deactivate_entity(entity, requeue))
+			/*
+			 * The parent entity is still backlogged, and
+			 * we don't need to update it as it is still
+			 * under service.
+			 */
+			break;
+
+		if (sd->next_active != NULL) {
+			/*
+			 * The parent entity is still backlogged and
+			 * the budgets on the path towards the root
+			 * need to be updated.
+			 */
+			goto update;
+		}
+
+		/*
+		 * If we reach there the parent is no more backlogged and
+		 * we want to propagate the dequeue upwards.
+		 *
+		 */
+
+		requeue = 1;
+	}
+
+	return;
+
+update:
+	entity = parent;
+	for_each_entity(entity) {
+		__bfq_activate_entity(entity, 0);
+
+		sd = entity->sched_data;
+		if (!bfq_update_next_active(sd))
+			break;
+	}
 }
 
 static void entity_served(struct io_entity *entity, unsigned long served)
 {
 	struct io_service_tree *st;
 
-	st = io_entity_service_tree(entity);
-	entity->service += served;
-	BUG_ON(st->wsum == 0);
-	st->vtime += bfq_delta(served, st->wsum);
-	bfq_forget_idle(st);
+	for_each_entity(entity) {
+		st = io_entity_service_tree(entity);
+		entity->service += served;
+		BUG_ON(st->wsum == 0);
+		st->vtime += bfq_delta(served, st->wsum);
+		bfq_forget_idle(st);
+	}
 }
 
 /**
@@ -1154,11 +1277,25 @@ static struct io_queue *elv_get_next_ioq(struct request_queue *q, int extract)
 		return NULL;
 
 	sd = &efqd->root_group->sched_data;
-	entity = bfq_lookup_next_entity(sd, 1);
 
-	BUG_ON(!entity);
-	if (extract)
-		entity->service = 0;
+	for (; sd != NULL; sd = entity->my_sched_data) {
+		entity = bfq_lookup_next_entity(sd, 1);
+		/*
+		 * entity can be null despite the fact that there are busy
+		 * queues. if all the busy queues are under a group which is
+		 * currently under service.
+		 * So if we are just looking for next ioq while something is
+		 * being served, null entity is not an error.
+		 */
+		BUG_ON(!entity && extract);
+
+		if (extract)
+			entity->service = 0;
+
+		if (!entity)
+			return NULL;
+	}
+
 	ioq = io_entity_to_ioq(entity);
 
 	return ioq;
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 4b69239..57207c4 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -72,6 +72,7 @@ struct io_service_tree {
  */
 struct io_sched_data {
 	struct io_entity *active_entity;
+	struct io_entity *next_active;
 	struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
 };
 
@@ -181,7 +182,10 @@ struct io_queue {
 };
 
 struct io_group {
+	struct io_entity entity;
 	struct io_sched_data sched_data;
+	struct io_entity *my_entity;
+
 	/*
 	 * async queue for each priority case for RT and BE class.
 	 * Used only for cfq.
diff --git a/init/Kconfig b/init/Kconfig
index 1ce05a4..a380f46 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -612,6 +612,14 @@ config CGROUP_MEM_RES_CTLR_SWAP
 	  Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page
 	  size is 4096bytes, 512k per 1Gbytes of swap.
 
+config GROUP_IOSCHED
+	bool "Group IO Scheduler"
+	depends on CGROUPS && ELV_FAIR_QUEUING
+	default n
+	---help---
+	  This feature lets IO scheduler recognize task groups and control
+	  disk bandwidth allocation to such task groups.
+
 endif # CGROUPS
 
 config MM_OWNER
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 07/25] io-controller: core bfq scheduler changes for hierarchical setup
@ 2009-07-02 20:01   ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew
  Cc: peterz, akpm, snitzer, agk, vgoyal

o Some of the core bfq scheduler changes for hiearchical groups.

Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |  169 ++++++++++++++++++++++++++++++++++++++++++++++-----
 block/elevator-fq.h |    4 +
 init/Kconfig        |    8 +++
 3 files changed, 165 insertions(+), 16 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 67c02b9..0acfa2c 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -42,6 +42,69 @@ static int elv_rate_sampling_window = HZ / 10;
  */
 #define WFQ_SERVICE_SHIFT	22
 
+#ifdef CONFIG_GROUP_IOSCHED
+#define for_each_entity(entity)	\
+	for (; entity != NULL; entity = entity->parent)
+
+#define for_each_entity_safe(entity, parent) \
+	for (; entity && ({ parent = entity->parent; 1; }); entity = parent)
+
+
+static struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
+						 int extract);
+
+static int bfq_update_next_active(struct io_sched_data *sd)
+{
+	struct io_group *iog;
+	struct io_entity *entity, *next_active;
+
+	if (sd->active_entity != NULL)
+		/* will update/requeue at the end of service */
+		return 0;
+
+	/*
+	 * NOTE: this can be improved in may ways, such as returning
+	 * 1 (and thus propagating upwards the update) only when the
+	 * budget changes, or caching the bfqq that will be scheduled
+	 * next from this subtree.  By now we worry more about
+	 * correctness than about performance...
+	 */
+	next_active = bfq_lookup_next_entity(sd, 0);
+	sd->next_active = next_active;
+
+	if (next_active != NULL) {
+		iog = container_of(sd, struct io_group, sched_data);
+		entity = iog->my_entity;
+		if (entity != NULL)
+			entity->budget = next_active->budget;
+	}
+
+	return 1;
+}
+
+static inline void bfq_check_next_active(struct io_sched_data *sd,
+					 struct io_entity *entity)
+{
+	BUG_ON(sd->next_active != entity);
+}
+#else /* GROUP_IOSCHED */
+#define for_each_entity(entity)	\
+	for (; entity != NULL; entity = NULL)
+
+#define for_each_entity_safe(entity, parent) \
+	for (parent = NULL; entity != NULL; entity = parent)
+
+static inline int bfq_update_next_active(struct io_sched_data *sd)
+{
+	return 0;
+}
+
+static inline void bfq_check_next_active(struct io_sched_data *sd,
+					 struct io_entity *entity)
+{
+}
+#endif /* GROUP_IOSCHED */
+
 static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
 					unsigned short prio)
 {
@@ -587,8 +650,10 @@ static struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
 		entity = __bfq_lookup_next_entity(st);
 		if (entity != NULL) {
 			if (extract) {
+				bfq_check_next_active(sd, entity);
 				bfq_active_remove(st, entity);
 				sd->active_entity = entity;
+				sd->next_active = NULL;
 			}
 			break;
 		}
@@ -661,11 +726,8 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
 	if (add_front) {
 		struct io_entity *next_entity;
 
-		/*
-		 * Determine the entity which will be dispatched next
-		 * Use sd->next_active once hierarchical patch is applied
-		 */
-		next_entity = bfq_lookup_next_entity(sd, 0);
+		/* Determine the entity which will be dispatched next */
+		next_entity = sd->next_active;
 
 		if (next_entity && next_entity != entity) {
 			struct io_service_tree *new_st;
@@ -697,7 +759,21 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
  */
 static void bfq_activate_entity(struct io_entity *entity, int add_front)
 {
-	__bfq_activate_entity(entity, add_front);
+	struct io_sched_data *sd;
+
+	for_each_entity(entity) {
+		__bfq_activate_entity(entity, add_front);
+
+		add_front = 0;
+		sd = entity->sched_data;
+		if (!bfq_update_next_active(sd))
+			/*
+			 * No need to propagate the activation to the
+			 * upper entities, as they will be updated when
+			 * the active entity is rescheduled.
+			 */
+			break;
+	}
 }
 
 /**
@@ -732,6 +808,8 @@ static int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
 		bfq_idle_remove(st, entity);
 	else if (entity->tree != NULL)
 		BUG();
+	if (was_active || sd->next_active == entity)
+		ret = bfq_update_next_active(sd);
 
 	if (!requeue || !bfq_gt(entity->finish, st->vtime))
 		bfq_forget_entity(st, entity);
@@ -739,6 +817,7 @@ static int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
 		bfq_idle_insert(st, entity);
 
 	BUG_ON(sd->active_entity == entity);
+	BUG_ON(sd->next_active == entity);
 
 	return ret;
 }
@@ -750,18 +829,62 @@ static int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
  */
 static void bfq_deactivate_entity(struct io_entity *entity, int requeue)
 {
-	__bfq_deactivate_entity(entity, requeue);
+	struct io_sched_data *sd;
+	struct io_entity *parent;
+
+	for_each_entity_safe(entity, parent) {
+		sd = entity->sched_data;
+
+		if (!__bfq_deactivate_entity(entity, requeue))
+			/*
+			 * The parent entity is still backlogged, and
+			 * we don't need to update it as it is still
+			 * under service.
+			 */
+			break;
+
+		if (sd->next_active != NULL) {
+			/*
+			 * The parent entity is still backlogged and
+			 * the budgets on the path towards the root
+			 * need to be updated.
+			 */
+			goto update;
+		}
+
+		/*
+		 * If we reach there the parent is no more backlogged and
+		 * we want to propagate the dequeue upwards.
+		 *
+		 */
+
+		requeue = 1;
+	}
+
+	return;
+
+update:
+	entity = parent;
+	for_each_entity(entity) {
+		__bfq_activate_entity(entity, 0);
+
+		sd = entity->sched_data;
+		if (!bfq_update_next_active(sd))
+			break;
+	}
 }
 
 static void entity_served(struct io_entity *entity, unsigned long served)
 {
 	struct io_service_tree *st;
 
-	st = io_entity_service_tree(entity);
-	entity->service += served;
-	BUG_ON(st->wsum == 0);
-	st->vtime += bfq_delta(served, st->wsum);
-	bfq_forget_idle(st);
+	for_each_entity(entity) {
+		st = io_entity_service_tree(entity);
+		entity->service += served;
+		BUG_ON(st->wsum == 0);
+		st->vtime += bfq_delta(served, st->wsum);
+		bfq_forget_idle(st);
+	}
 }
 
 /**
@@ -1154,11 +1277,25 @@ static struct io_queue *elv_get_next_ioq(struct request_queue *q, int extract)
 		return NULL;
 
 	sd = &efqd->root_group->sched_data;
-	entity = bfq_lookup_next_entity(sd, 1);
 
-	BUG_ON(!entity);
-	if (extract)
-		entity->service = 0;
+	for (; sd != NULL; sd = entity->my_sched_data) {
+		entity = bfq_lookup_next_entity(sd, 1);
+		/*
+		 * entity can be null despite the fact that there are busy
+		 * queues. if all the busy queues are under a group which is
+		 * currently under service.
+		 * So if we are just looking for next ioq while something is
+		 * being served, null entity is not an error.
+		 */
+		BUG_ON(!entity && extract);
+
+		if (extract)
+			entity->service = 0;
+
+		if (!entity)
+			return NULL;
+	}
+
 	ioq = io_entity_to_ioq(entity);
 
 	return ioq;
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 4b69239..57207c4 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -72,6 +72,7 @@ struct io_service_tree {
  */
 struct io_sched_data {
 	struct io_entity *active_entity;
+	struct io_entity *next_active;
 	struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
 };
 
@@ -181,7 +182,10 @@ struct io_queue {
 };
 
 struct io_group {
+	struct io_entity entity;
 	struct io_sched_data sched_data;
+	struct io_entity *my_entity;
+
 	/*
 	 * async queue for each priority case for RT and BE class.
 	 * Used only for cfq.
diff --git a/init/Kconfig b/init/Kconfig
index 1ce05a4..a380f46 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -612,6 +612,14 @@ config CGROUP_MEM_RES_CTLR_SWAP
 	  Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page
 	  size is 4096bytes, 512k per 1Gbytes of swap.
 
+config GROUP_IOSCHED
+	bool "Group IO Scheduler"
+	depends on CGROUPS && ELV_FAIR_QUEUING
+	default n
+	---help---
+	  This feature lets IO scheduler recognize task groups and control
+	  disk bandwidth allocation to such task groups.
+
 endif # CGROUPS
 
 config MM_OWNER
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 08/25] io-controller: cgroup related changes for hierarchical group support
       [not found] ` <1246564917-19603-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (6 preceding siblings ...)
  2009-07-02 20:01   ` [PATCH 07/25] io-controller: core bfq scheduler changes for hierarchical setup Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  2009-07-02 20:01   ` [PATCH 09/25] io-controller: Common hierarchical fair queuing code in elevaotor layer Vivek Goyal
                     ` (19 subsequent siblings)
  27 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, nauman-hpIqsD4AKlfQT0dZR+AlfA,
	dpshah-hpIqsD4AKlfQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA

o This patch introduces some of the cgroup related code for io controller.

Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/blk-ioc.c               |    3 +
 block/elevator-fq.c           |  174 +++++++++++++++++++++++++++++++++++++++++
 block/elevator-fq.h           |   43 ++++++++++-
 include/linux/cgroup_subsys.h |    6 ++
 include/linux/iocontext.h     |    5 +
 5 files changed, 230 insertions(+), 1 deletions(-)

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index d4ed600..0d56336 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -95,6 +95,9 @@ struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 		spin_lock_init(&ret->lock);
 		ret->ioprio_changed = 0;
 		ret->ioprio = 0;
+#ifdef CONFIG_GROUP_IOSCHED
+		ret->cgroup_changed = 0;
+#endif
 		ret->last_waited = jiffies; /* doesn't matter... */
 		ret->nr_batch_requests = 0; /* because this is 0 */
 		ret->aic = NULL;
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 0acfa2c..84276d5 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -29,6 +29,9 @@ static int elv_rate_sampling_window = HZ / 10;
 #define ELV_SLICE_SCALE		(5)
 #define ELV_HW_QUEUE_MIN	(5)
 
+#define IO_DEFAULT_GRP_WEIGHT  500
+#define IO_DEFAULT_GRP_CLASS   IOPRIO_CLASS_BE
+
 #define IO_SERVICE_TREE_INIT   ((struct io_service_tree)		\
 				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
 
@@ -899,6 +902,177 @@ static void io_flush_idle_tree(struct io_service_tree *st)
 		__bfq_deactivate_entity(entity, 0);
 }
 
+/* Mainly hierarchical grouping code */
+#ifdef CONFIG_GROUP_IOSCHED
+
+struct io_cgroup io_root_cgroup = {
+	.weight = IO_DEFAULT_GRP_WEIGHT,
+	.ioprio_class = IO_DEFAULT_GRP_CLASS,
+};
+
+static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
+{
+	return container_of(cgroup_subsys_state(cgroup, io_subsys_id),
+			    struct io_cgroup, css);
+}
+
+#define SHOW_FUNCTION(__VAR)						\
+static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
+				       struct cftype *cftype)		\
+{									\
+	struct io_cgroup *iocg;					\
+	u64 ret;							\
+									\
+	if (!cgroup_lock_live_group(cgroup))				\
+		return -ENODEV;						\
+									\
+	iocg = cgroup_to_io_cgroup(cgroup);				\
+	spin_lock_irq(&iocg->lock);					\
+	ret = iocg->__VAR;						\
+	spin_unlock_irq(&iocg->lock);					\
+									\
+	cgroup_unlock();						\
+									\
+	return ret;							\
+}
+
+SHOW_FUNCTION(weight);
+SHOW_FUNCTION(ioprio_class);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__VAR, __MIN, __MAX)				\
+static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
+					struct cftype *cftype,		\
+					u64 val)			\
+{									\
+	struct io_cgroup *iocg;					\
+	struct io_group *iog;						\
+	struct hlist_node *n;						\
+									\
+	if (val < (__MIN) || val > (__MAX))				\
+		return -EINVAL;						\
+									\
+	if (!cgroup_lock_live_group(cgroup))				\
+		return -ENODEV;						\
+									\
+	iocg = cgroup_to_io_cgroup(cgroup);				\
+									\
+	spin_lock_irq(&iocg->lock);					\
+	iocg->__VAR = (unsigned long)val;				\
+	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {	\
+		iog->entity.new_##__VAR = (unsigned long)val;		\
+		smp_wmb();						\
+		iog->entity.ioprio_changed = 1;				\
+	}								\
+	spin_unlock_irq(&iocg->lock);					\
+									\
+	cgroup_unlock();						\
+									\
+	return 0;							\
+}
+
+STORE_FUNCTION(weight, 1, WEIGHT_MAX);
+STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
+#undef STORE_FUNCTION
+
+struct cftype bfqio_files[] = {
+	{
+		.name = "weight",
+		.read_u64 = io_cgroup_weight_read,
+		.write_u64 = io_cgroup_weight_write,
+	},
+	{
+		.name = "ioprio_class",
+		.read_u64 = io_cgroup_ioprio_class_read,
+		.write_u64 = io_cgroup_ioprio_class_write,
+	},
+};
+
+static int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+	return cgroup_add_files(cgroup, subsys, bfqio_files,
+				ARRAY_SIZE(bfqio_files));
+}
+
+static struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
+						struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg;
+
+	if (cgroup->parent != NULL) {
+		iocg = kzalloc(sizeof(*iocg), GFP_KERNEL);
+		if (iocg == NULL)
+			return ERR_PTR(-ENOMEM);
+	} else
+		iocg = &io_root_cgroup;
+
+	spin_lock_init(&iocg->lock);
+	INIT_HLIST_HEAD(&iocg->group_data);
+	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
+	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
+
+	return &iocg->css;
+}
+
+/*
+ * We cannot support shared io contexts, as we have no mean to support
+ * two tasks with the same ioc in two different groups without major rework
+ * of the main cic/bfqq data structures.  By now we allow a task to change
+ * its cgroup only if it's the only owner of its ioc; the drawback of this
+ * behavior is that a group containing a task that forked using CLONE_IO
+ * will not be destroyed until the tasks sharing the ioc die.
+ */
+static int iocg_can_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+			    struct task_struct *tsk)
+{
+	struct io_context *ioc;
+	int ret = 0;
+
+	/* task_lock() is needed to avoid races with exit_io_context() */
+	task_lock(tsk);
+	ioc = tsk->io_context;
+	if (ioc != NULL && atomic_read(&ioc->nr_tasks) > 1)
+		/*
+		 * ioc == NULL means that the task is either too young or
+		 * exiting: if it has still no ioc the ioc can't be shared,
+		 * if the task is exiting the attach will fail anyway, no
+		 * matter what we return here.
+		 */
+		ret = -EINVAL;
+	task_unlock(tsk);
+
+	return ret;
+}
+
+static void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+			 struct cgroup *prev, struct task_struct *tsk)
+{
+	struct io_context *ioc;
+
+	task_lock(tsk);
+	ioc = tsk->io_context;
+	if (ioc != NULL)
+		ioc->cgroup_changed = 1;
+	task_unlock(tsk);
+}
+
+static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+
+	/* Implemented in later patch */
+}
+
+struct cgroup_subsys io_subsys = {
+	.name = "io",
+	.create = iocg_create,
+	.can_attach = iocg_can_attach,
+	.attach = iocg_attach,
+	.destroy = iocg_destroy,
+	.populate = iocg_populate,
+	.subsys_id = io_subsys_id,
+	.use_id = 1,
+};
+#endif /* GROUP_IOSCHED */
 /* Elevator fair queuing function */
 static inline struct io_queue *elv_active_ioq(struct elevator_queue *e)
 {
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 57207c4..d9acb75 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -13,11 +13,13 @@
  */
 
 #include <linux/blkdev.h>
+#include <linux/cgroup.h>
 
 #ifndef _BFQ_SCHED_H
 #define _BFQ_SCHED_H
 
 #define IO_IOPRIO_CLASSES	3
+#define WEIGHT_MAX             1000
 
 struct io_entity;
 struct io_queue;
@@ -88,7 +90,7 @@ struct io_sched_data {
  *             this entity; used for O(log N) lookups into active trees.
  * @service: service received during the last round of service.
  * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
- * @weight: weight of the queue, calculated as IOPRIO_BE_NR - @ioprio.
+ * @weight: the weight in use.
  * @new_weight: when a weight change is requested, the new weight value
  * @parent: parent entity, for hierarchical scheduling.
  * @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the
@@ -181,8 +183,10 @@ struct io_queue {
 	void *sched_queue;
 };
 
+#ifdef CONFIG_GROUP_IOSCHED
 struct io_group {
 	struct io_entity entity;
+	struct hlist_node group_node;
 	struct io_sched_data sched_data;
 	struct io_entity *my_entity;
 
@@ -199,8 +203,45 @@ struct io_group {
 	 * non-RT cfqq in service when this value is non-zero.
 	 */
 	unsigned int busy_rt_queues;
+	unsigned short iocg_id;
 };
 
+/**
+ * struct io_cgroup - io cgroup data structure.
+ * @css: subsystem state for io in the containing cgroup.
+ * @weight: cgroup weight.
+ * @ioprio_class: cgroup ioprio_class.
+ * @lock: spinlock that protects @weight, @ioprio_class and @group_data.
+ * @group_data: list containing the io_group belonging to this cgroup.
+ *
+ * @group_data is accessed using RCU, with @lock protecting the updates,
+ * @weight and @ioprio_class are protected by @lock.
+ */
+struct io_cgroup {
+	struct cgroup_subsys_state css;
+
+	unsigned int weight;
+	unsigned short ioprio_class;
+
+	spinlock_t lock;
+	struct hlist_head group_data;
+};
+#else
+struct io_group {
+	struct io_sched_data sched_data;
+
+	/* async_queue and idle_queue are used only for cfq */
+	struct io_queue *async_queue[2][IOPRIO_BE_NR];
+	struct io_queue *async_idle_queue;
+
+	/*
+	 * Used to track any pending rt requests so we can pre-empt current
+	 * non-RT cfqq in service when this value is non-zero.
+	 */
+	unsigned int busy_rt_queues;
+};
+#endif /* CONFIG_GROUP_IOSCHED */
+
 struct elv_fq_data {
 	struct io_group *root_group;
 
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c8d31b..baf544f 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -60,3 +60,9 @@ SUBSYS(net_cls)
 #endif
 
 /* */
+
+#ifdef CONFIG_GROUP_IOSCHED
+SUBSYS(io)
+#endif
+
+/* */
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 1482b20..ccecf53 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -68,6 +68,11 @@ struct io_context {
 	unsigned short ioprio;
 	unsigned short ioprio_changed;
 
+#ifdef CONFIG_GROUP_IOSCHED
+	/* If task changes the cgroup, elevator processes it asynchronously */
+	unsigned short cgroup_changed;
+#endif
+
 	/*
 	 * For request batching
 	 */
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 08/25] io-controller: cgroup related changes for hierarchical group support
  2009-07-02 20:01 ` Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron
  Cc: agk, snitzer, vgoyal, akpm, peterz

o This patch introduces some of the cgroup related code for io controller.

Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-ioc.c               |    3 +
 block/elevator-fq.c           |  174 +++++++++++++++++++++++++++++++++++++++++
 block/elevator-fq.h           |   43 ++++++++++-
 include/linux/cgroup_subsys.h |    6 ++
 include/linux/iocontext.h     |    5 +
 5 files changed, 230 insertions(+), 1 deletions(-)

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index d4ed600..0d56336 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -95,6 +95,9 @@ struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 		spin_lock_init(&ret->lock);
 		ret->ioprio_changed = 0;
 		ret->ioprio = 0;
+#ifdef CONFIG_GROUP_IOSCHED
+		ret->cgroup_changed = 0;
+#endif
 		ret->last_waited = jiffies; /* doesn't matter... */
 		ret->nr_batch_requests = 0; /* because this is 0 */
 		ret->aic = NULL;
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 0acfa2c..84276d5 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -29,6 +29,9 @@ static int elv_rate_sampling_window = HZ / 10;
 #define ELV_SLICE_SCALE		(5)
 #define ELV_HW_QUEUE_MIN	(5)
 
+#define IO_DEFAULT_GRP_WEIGHT  500
+#define IO_DEFAULT_GRP_CLASS   IOPRIO_CLASS_BE
+
 #define IO_SERVICE_TREE_INIT   ((struct io_service_tree)		\
 				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
 
@@ -899,6 +902,177 @@ static void io_flush_idle_tree(struct io_service_tree *st)
 		__bfq_deactivate_entity(entity, 0);
 }
 
+/* Mainly hierarchical grouping code */
+#ifdef CONFIG_GROUP_IOSCHED
+
+struct io_cgroup io_root_cgroup = {
+	.weight = IO_DEFAULT_GRP_WEIGHT,
+	.ioprio_class = IO_DEFAULT_GRP_CLASS,
+};
+
+static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
+{
+	return container_of(cgroup_subsys_state(cgroup, io_subsys_id),
+			    struct io_cgroup, css);
+}
+
+#define SHOW_FUNCTION(__VAR)						\
+static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
+				       struct cftype *cftype)		\
+{									\
+	struct io_cgroup *iocg;					\
+	u64 ret;							\
+									\
+	if (!cgroup_lock_live_group(cgroup))				\
+		return -ENODEV;						\
+									\
+	iocg = cgroup_to_io_cgroup(cgroup);				\
+	spin_lock_irq(&iocg->lock);					\
+	ret = iocg->__VAR;						\
+	spin_unlock_irq(&iocg->lock);					\
+									\
+	cgroup_unlock();						\
+									\
+	return ret;							\
+}
+
+SHOW_FUNCTION(weight);
+SHOW_FUNCTION(ioprio_class);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__VAR, __MIN, __MAX)				\
+static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
+					struct cftype *cftype,		\
+					u64 val)			\
+{									\
+	struct io_cgroup *iocg;					\
+	struct io_group *iog;						\
+	struct hlist_node *n;						\
+									\
+	if (val < (__MIN) || val > (__MAX))				\
+		return -EINVAL;						\
+									\
+	if (!cgroup_lock_live_group(cgroup))				\
+		return -ENODEV;						\
+									\
+	iocg = cgroup_to_io_cgroup(cgroup);				\
+									\
+	spin_lock_irq(&iocg->lock);					\
+	iocg->__VAR = (unsigned long)val;				\
+	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {	\
+		iog->entity.new_##__VAR = (unsigned long)val;		\
+		smp_wmb();						\
+		iog->entity.ioprio_changed = 1;				\
+	}								\
+	spin_unlock_irq(&iocg->lock);					\
+									\
+	cgroup_unlock();						\
+									\
+	return 0;							\
+}
+
+STORE_FUNCTION(weight, 1, WEIGHT_MAX);
+STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
+#undef STORE_FUNCTION
+
+struct cftype bfqio_files[] = {
+	{
+		.name = "weight",
+		.read_u64 = io_cgroup_weight_read,
+		.write_u64 = io_cgroup_weight_write,
+	},
+	{
+		.name = "ioprio_class",
+		.read_u64 = io_cgroup_ioprio_class_read,
+		.write_u64 = io_cgroup_ioprio_class_write,
+	},
+};
+
+static int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+	return cgroup_add_files(cgroup, subsys, bfqio_files,
+				ARRAY_SIZE(bfqio_files));
+}
+
+static struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
+						struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg;
+
+	if (cgroup->parent != NULL) {
+		iocg = kzalloc(sizeof(*iocg), GFP_KERNEL);
+		if (iocg == NULL)
+			return ERR_PTR(-ENOMEM);
+	} else
+		iocg = &io_root_cgroup;
+
+	spin_lock_init(&iocg->lock);
+	INIT_HLIST_HEAD(&iocg->group_data);
+	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
+	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
+
+	return &iocg->css;
+}
+
+/*
+ * We cannot support shared io contexts, as we have no mean to support
+ * two tasks with the same ioc in two different groups without major rework
+ * of the main cic/bfqq data structures.  By now we allow a task to change
+ * its cgroup only if it's the only owner of its ioc; the drawback of this
+ * behavior is that a group containing a task that forked using CLONE_IO
+ * will not be destroyed until the tasks sharing the ioc die.
+ */
+static int iocg_can_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+			    struct task_struct *tsk)
+{
+	struct io_context *ioc;
+	int ret = 0;
+
+	/* task_lock() is needed to avoid races with exit_io_context() */
+	task_lock(tsk);
+	ioc = tsk->io_context;
+	if (ioc != NULL && atomic_read(&ioc->nr_tasks) > 1)
+		/*
+		 * ioc == NULL means that the task is either too young or
+		 * exiting: if it has still no ioc the ioc can't be shared,
+		 * if the task is exiting the attach will fail anyway, no
+		 * matter what we return here.
+		 */
+		ret = -EINVAL;
+	task_unlock(tsk);
+
+	return ret;
+}
+
+static void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+			 struct cgroup *prev, struct task_struct *tsk)
+{
+	struct io_context *ioc;
+
+	task_lock(tsk);
+	ioc = tsk->io_context;
+	if (ioc != NULL)
+		ioc->cgroup_changed = 1;
+	task_unlock(tsk);
+}
+
+static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+
+	/* Implemented in later patch */
+}
+
+struct cgroup_subsys io_subsys = {
+	.name = "io",
+	.create = iocg_create,
+	.can_attach = iocg_can_attach,
+	.attach = iocg_attach,
+	.destroy = iocg_destroy,
+	.populate = iocg_populate,
+	.subsys_id = io_subsys_id,
+	.use_id = 1,
+};
+#endif /* GROUP_IOSCHED */
 /* Elevator fair queuing function */
 static inline struct io_queue *elv_active_ioq(struct elevator_queue *e)
 {
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 57207c4..d9acb75 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -13,11 +13,13 @@
  */
 
 #include <linux/blkdev.h>
+#include <linux/cgroup.h>
 
 #ifndef _BFQ_SCHED_H
 #define _BFQ_SCHED_H
 
 #define IO_IOPRIO_CLASSES	3
+#define WEIGHT_MAX             1000
 
 struct io_entity;
 struct io_queue;
@@ -88,7 +90,7 @@ struct io_sched_data {
  *             this entity; used for O(log N) lookups into active trees.
  * @service: service received during the last round of service.
  * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
- * @weight: weight of the queue, calculated as IOPRIO_BE_NR - @ioprio.
+ * @weight: the weight in use.
  * @new_weight: when a weight change is requested, the new weight value
  * @parent: parent entity, for hierarchical scheduling.
  * @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the
@@ -181,8 +183,10 @@ struct io_queue {
 	void *sched_queue;
 };
 
+#ifdef CONFIG_GROUP_IOSCHED
 struct io_group {
 	struct io_entity entity;
+	struct hlist_node group_node;
 	struct io_sched_data sched_data;
 	struct io_entity *my_entity;
 
@@ -199,8 +203,45 @@ struct io_group {
 	 * non-RT cfqq in service when this value is non-zero.
 	 */
 	unsigned int busy_rt_queues;
+	unsigned short iocg_id;
 };
 
+/**
+ * struct io_cgroup - io cgroup data structure.
+ * @css: subsystem state for io in the containing cgroup.
+ * @weight: cgroup weight.
+ * @ioprio_class: cgroup ioprio_class.
+ * @lock: spinlock that protects @weight, @ioprio_class and @group_data.
+ * @group_data: list containing the io_group belonging to this cgroup.
+ *
+ * @group_data is accessed using RCU, with @lock protecting the updates,
+ * @weight and @ioprio_class are protected by @lock.
+ */
+struct io_cgroup {
+	struct cgroup_subsys_state css;
+
+	unsigned int weight;
+	unsigned short ioprio_class;
+
+	spinlock_t lock;
+	struct hlist_head group_data;
+};
+#else
+struct io_group {
+	struct io_sched_data sched_data;
+
+	/* async_queue and idle_queue are used only for cfq */
+	struct io_queue *async_queue[2][IOPRIO_BE_NR];
+	struct io_queue *async_idle_queue;
+
+	/*
+	 * Used to track any pending rt requests so we can pre-empt current
+	 * non-RT cfqq in service when this value is non-zero.
+	 */
+	unsigned int busy_rt_queues;
+};
+#endif /* CONFIG_GROUP_IOSCHED */
+
 struct elv_fq_data {
 	struct io_group *root_group;
 
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c8d31b..baf544f 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -60,3 +60,9 @@ SUBSYS(net_cls)
 #endif
 
 /* */
+
+#ifdef CONFIG_GROUP_IOSCHED
+SUBSYS(io)
+#endif
+
+/* */
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 1482b20..ccecf53 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -68,6 +68,11 @@ struct io_context {
 	unsigned short ioprio;
 	unsigned short ioprio_changed;
 
+#ifdef CONFIG_GROUP_IOSCHED
+	/* If task changes the cgroup, elevator processes it asynchronously */
+	unsigned short cgroup_changed;
+#endif
+
 	/*
 	 * For request batching
 	 */
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 08/25] io-controller: cgroup related changes for hierarchical group support
@ 2009-07-02 20:01   ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew
  Cc: peterz, akpm, snitzer, agk, vgoyal

o This patch introduces some of the cgroup related code for io controller.

Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-ioc.c               |    3 +
 block/elevator-fq.c           |  174 +++++++++++++++++++++++++++++++++++++++++
 block/elevator-fq.h           |   43 ++++++++++-
 include/linux/cgroup_subsys.h |    6 ++
 include/linux/iocontext.h     |    5 +
 5 files changed, 230 insertions(+), 1 deletions(-)

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index d4ed600..0d56336 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -95,6 +95,9 @@ struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 		spin_lock_init(&ret->lock);
 		ret->ioprio_changed = 0;
 		ret->ioprio = 0;
+#ifdef CONFIG_GROUP_IOSCHED
+		ret->cgroup_changed = 0;
+#endif
 		ret->last_waited = jiffies; /* doesn't matter... */
 		ret->nr_batch_requests = 0; /* because this is 0 */
 		ret->aic = NULL;
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 0acfa2c..84276d5 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -29,6 +29,9 @@ static int elv_rate_sampling_window = HZ / 10;
 #define ELV_SLICE_SCALE		(5)
 #define ELV_HW_QUEUE_MIN	(5)
 
+#define IO_DEFAULT_GRP_WEIGHT  500
+#define IO_DEFAULT_GRP_CLASS   IOPRIO_CLASS_BE
+
 #define IO_SERVICE_TREE_INIT   ((struct io_service_tree)		\
 				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
 
@@ -899,6 +902,177 @@ static void io_flush_idle_tree(struct io_service_tree *st)
 		__bfq_deactivate_entity(entity, 0);
 }
 
+/* Mainly hierarchical grouping code */
+#ifdef CONFIG_GROUP_IOSCHED
+
+struct io_cgroup io_root_cgroup = {
+	.weight = IO_DEFAULT_GRP_WEIGHT,
+	.ioprio_class = IO_DEFAULT_GRP_CLASS,
+};
+
+static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
+{
+	return container_of(cgroup_subsys_state(cgroup, io_subsys_id),
+			    struct io_cgroup, css);
+}
+
+#define SHOW_FUNCTION(__VAR)						\
+static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
+				       struct cftype *cftype)		\
+{									\
+	struct io_cgroup *iocg;					\
+	u64 ret;							\
+									\
+	if (!cgroup_lock_live_group(cgroup))				\
+		return -ENODEV;						\
+									\
+	iocg = cgroup_to_io_cgroup(cgroup);				\
+	spin_lock_irq(&iocg->lock);					\
+	ret = iocg->__VAR;						\
+	spin_unlock_irq(&iocg->lock);					\
+									\
+	cgroup_unlock();						\
+									\
+	return ret;							\
+}
+
+SHOW_FUNCTION(weight);
+SHOW_FUNCTION(ioprio_class);
+#undef SHOW_FUNCTION
+
+#define STORE_FUNCTION(__VAR, __MIN, __MAX)				\
+static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
+					struct cftype *cftype,		\
+					u64 val)			\
+{									\
+	struct io_cgroup *iocg;					\
+	struct io_group *iog;						\
+	struct hlist_node *n;						\
+									\
+	if (val < (__MIN) || val > (__MAX))				\
+		return -EINVAL;						\
+									\
+	if (!cgroup_lock_live_group(cgroup))				\
+		return -ENODEV;						\
+									\
+	iocg = cgroup_to_io_cgroup(cgroup);				\
+									\
+	spin_lock_irq(&iocg->lock);					\
+	iocg->__VAR = (unsigned long)val;				\
+	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {	\
+		iog->entity.new_##__VAR = (unsigned long)val;		\
+		smp_wmb();						\
+		iog->entity.ioprio_changed = 1;				\
+	}								\
+	spin_unlock_irq(&iocg->lock);					\
+									\
+	cgroup_unlock();						\
+									\
+	return 0;							\
+}
+
+STORE_FUNCTION(weight, 1, WEIGHT_MAX);
+STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
+#undef STORE_FUNCTION
+
+struct cftype bfqio_files[] = {
+	{
+		.name = "weight",
+		.read_u64 = io_cgroup_weight_read,
+		.write_u64 = io_cgroup_weight_write,
+	},
+	{
+		.name = "ioprio_class",
+		.read_u64 = io_cgroup_ioprio_class_read,
+		.write_u64 = io_cgroup_ioprio_class_write,
+	},
+};
+
+static int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+	return cgroup_add_files(cgroup, subsys, bfqio_files,
+				ARRAY_SIZE(bfqio_files));
+}
+
+static struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
+						struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg;
+
+	if (cgroup->parent != NULL) {
+		iocg = kzalloc(sizeof(*iocg), GFP_KERNEL);
+		if (iocg == NULL)
+			return ERR_PTR(-ENOMEM);
+	} else
+		iocg = &io_root_cgroup;
+
+	spin_lock_init(&iocg->lock);
+	INIT_HLIST_HEAD(&iocg->group_data);
+	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
+	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
+
+	return &iocg->css;
+}
+
+/*
+ * We cannot support shared io contexts, as we have no mean to support
+ * two tasks with the same ioc in two different groups without major rework
+ * of the main cic/bfqq data structures.  By now we allow a task to change
+ * its cgroup only if it's the only owner of its ioc; the drawback of this
+ * behavior is that a group containing a task that forked using CLONE_IO
+ * will not be destroyed until the tasks sharing the ioc die.
+ */
+static int iocg_can_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+			    struct task_struct *tsk)
+{
+	struct io_context *ioc;
+	int ret = 0;
+
+	/* task_lock() is needed to avoid races with exit_io_context() */
+	task_lock(tsk);
+	ioc = tsk->io_context;
+	if (ioc != NULL && atomic_read(&ioc->nr_tasks) > 1)
+		/*
+		 * ioc == NULL means that the task is either too young or
+		 * exiting: if it has still no ioc the ioc can't be shared,
+		 * if the task is exiting the attach will fail anyway, no
+		 * matter what we return here.
+		 */
+		ret = -EINVAL;
+	task_unlock(tsk);
+
+	return ret;
+}
+
+static void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
+			 struct cgroup *prev, struct task_struct *tsk)
+{
+	struct io_context *ioc;
+
+	task_lock(tsk);
+	ioc = tsk->io_context;
+	if (ioc != NULL)
+		ioc->cgroup_changed = 1;
+	task_unlock(tsk);
+}
+
+static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+
+	/* Implemented in later patch */
+}
+
+struct cgroup_subsys io_subsys = {
+	.name = "io",
+	.create = iocg_create,
+	.can_attach = iocg_can_attach,
+	.attach = iocg_attach,
+	.destroy = iocg_destroy,
+	.populate = iocg_populate,
+	.subsys_id = io_subsys_id,
+	.use_id = 1,
+};
+#endif /* GROUP_IOSCHED */
 /* Elevator fair queuing function */
 static inline struct io_queue *elv_active_ioq(struct elevator_queue *e)
 {
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 57207c4..d9acb75 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -13,11 +13,13 @@
  */
 
 #include <linux/blkdev.h>
+#include <linux/cgroup.h>
 
 #ifndef _BFQ_SCHED_H
 #define _BFQ_SCHED_H
 
 #define IO_IOPRIO_CLASSES	3
+#define WEIGHT_MAX             1000
 
 struct io_entity;
 struct io_queue;
@@ -88,7 +90,7 @@ struct io_sched_data {
  *             this entity; used for O(log N) lookups into active trees.
  * @service: service received during the last round of service.
  * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
- * @weight: weight of the queue, calculated as IOPRIO_BE_NR - @ioprio.
+ * @weight: the weight in use.
  * @new_weight: when a weight change is requested, the new weight value
  * @parent: parent entity, for hierarchical scheduling.
  * @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the
@@ -181,8 +183,10 @@ struct io_queue {
 	void *sched_queue;
 };
 
+#ifdef CONFIG_GROUP_IOSCHED
 struct io_group {
 	struct io_entity entity;
+	struct hlist_node group_node;
 	struct io_sched_data sched_data;
 	struct io_entity *my_entity;
 
@@ -199,8 +203,45 @@ struct io_group {
 	 * non-RT cfqq in service when this value is non-zero.
 	 */
 	unsigned int busy_rt_queues;
+	unsigned short iocg_id;
 };
 
+/**
+ * struct io_cgroup - io cgroup data structure.
+ * @css: subsystem state for io in the containing cgroup.
+ * @weight: cgroup weight.
+ * @ioprio_class: cgroup ioprio_class.
+ * @lock: spinlock that protects @weight, @ioprio_class and @group_data.
+ * @group_data: list containing the io_group belonging to this cgroup.
+ *
+ * @group_data is accessed using RCU, with @lock protecting the updates,
+ * @weight and @ioprio_class are protected by @lock.
+ */
+struct io_cgroup {
+	struct cgroup_subsys_state css;
+
+	unsigned int weight;
+	unsigned short ioprio_class;
+
+	spinlock_t lock;
+	struct hlist_head group_data;
+};
+#else
+struct io_group {
+	struct io_sched_data sched_data;
+
+	/* async_queue and idle_queue are used only for cfq */
+	struct io_queue *async_queue[2][IOPRIO_BE_NR];
+	struct io_queue *async_idle_queue;
+
+	/*
+	 * Used to track any pending rt requests so we can pre-empt current
+	 * non-RT cfqq in service when this value is non-zero.
+	 */
+	unsigned int busy_rt_queues;
+};
+#endif /* CONFIG_GROUP_IOSCHED */
+
 struct elv_fq_data {
 	struct io_group *root_group;
 
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 9c8d31b..baf544f 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -60,3 +60,9 @@ SUBSYS(net_cls)
 #endif
 
 /* */
+
+#ifdef CONFIG_GROUP_IOSCHED
+SUBSYS(io)
+#endif
+
+/* */
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 1482b20..ccecf53 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -68,6 +68,11 @@ struct io_context {
 	unsigned short ioprio;
 	unsigned short ioprio_changed;
 
+#ifdef CONFIG_GROUP_IOSCHED
+	/* If task changes the cgroup, elevator processes it asynchronously */
+	unsigned short cgroup_changed;
+#endif
+
 	/*
 	 * For request batching
 	 */
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 09/25] io-controller: Common hierarchical fair queuing code in elevaotor layer
       [not found] ` <1246564917-19603-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (7 preceding siblings ...)
  2009-07-02 20:01   ` [PATCH 08/25] io-controller: cgroup related changes for hierarchical group support Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  2009-07-02 20:01   ` [PATCH 10/25] io-controller: cfq changes to use " Vivek Goyal
                     ` (18 subsequent siblings)
  27 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, nauman-hpIqsD4AKlfQT0dZR+AlfA,
	dpshah-hpIqsD4AKlfQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA

o This patch enables hierarchical fair queuing in common layer. It is
  controlled by config option CONFIG_GROUP_IOSCHED.

o Requests keep a reference on ioq and ioq keeps  keep a reference
  on groups. For async queues in CFQ, and single ioq in other
  schedulers, io_group also keeps are reference on io_queue. This
  reference on ioq is dropped when the queue is released
  (elv_release_ioq). So the queue can be freed.

  When a queue is released, it puts the reference to io_group and the
  io_group is released after all the queues are released. Child groups
  also take reference on parent groups, and release it when they are
  destroyed.

o Reads of iocg->group_data are not always iocg->lock; so all the operations
  on that list are still protected by RCU. All modifications to
  iocg->group_data should always done under iocg->lock.

  Whenever iocg->lock and queue_lock can both be held, queue_lock should
  be held first. This avoids all deadlocks. In order to avoid race
  between cgroup deletion and elevator switch the following algorithm is
  used:

	- Cgroup deletion path holds iocg->lock and removes iog entry
	  to iocg->group_data list. Then it drops iocg->lock, holds
	  queue_lock and destroys iog. So in this path, we never hold
	  iocg->lock and queue_lock at the same time. Also, since we
	  remove iog from iocg->group_data under iocg->lock, we can't
	  race with elevator switch.

	- Elevator switch path does not remove iog from
	  iocg->group_data list directly. It first hold iocg->lock,
	  scans iocg->group_data again to see if iog is still there;
	  it removes iog only if it finds iog there. Otherwise, cgroup
	  deletion must have removed it from the list, and cgroup
	  deletion is responsible for removing iog.

  So the path which removes iog from iocg->group_data list does
  the final removal of iog by calling __io_destroy_group()
  function.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/cfq-iosched.c |    2 +
 block/elevator-fq.c |  885 ++++++++++++++++++++++++++++++++++++++++++++++-----
 block/elevator-fq.h |   93 ++++++-
 block/elevator.c    |    4 +
 4 files changed, 906 insertions(+), 78 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index f852b00..6ddc882 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1310,6 +1310,8 @@ alloc_ioq:
 			elv_mark_ioq_sync(cfqq->ioq);
 		}
 		cfqq->pid = current->pid;
+		/* ioq reference on iog */
+		elv_get_iog(iog);
 		cfq_log_cfqq(cfqd, cfqq, "alloced");
 	}
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 84276d5..f8d0b90 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -45,6 +45,9 @@ static int elv_rate_sampling_window = HZ / 10;
  */
 #define WFQ_SERVICE_SHIFT	22
 
+static void
+elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
+
 #ifdef CONFIG_GROUP_IOSCHED
 #define for_each_entity(entity)	\
 	for (; entity != NULL; entity = entity->parent)
@@ -90,6 +93,69 @@ static inline void bfq_check_next_active(struct io_sched_data *sd,
 {
 	BUG_ON(sd->next_active != entity);
 }
+
+static inline int iog_deleting(struct io_group *iog)
+{
+	return iog->deleting;
+}
+
+/* Do the two (enqueued) entities belong to the same group ? */
+static inline int
+is_same_group(struct io_entity *entity, struct io_entity *new_entity)
+{
+	if (entity->sched_data == new_entity->sched_data)
+		return 1;
+
+	return 0;
+}
+
+static inline struct io_entity *parent_entity(struct io_entity *entity)
+{
+	return entity->parent;
+}
+
+/* return depth at which a io entity is present in the hierarchy */
+static inline int depth_entity(struct io_entity *entity)
+{
+	int depth = 0;
+
+	for_each_entity(entity)
+		depth++;
+
+	return depth;
+}
+
+static void bfq_find_matching_entity(struct io_entity **entity,
+			struct io_entity **new_entity)
+{
+	int entity_depth, new_entity_depth;
+
+	/*
+	 * preemption test can be made between sibling entities who are in the
+	 * same group i.e who have a common parent. Walk up the hierarchy of
+	 * both entities until we find their ancestors who are siblings of
+	 * common parent.
+	 */
+
+	/* First walk up until both entities are at same depth */
+	entity_depth = depth_entity(*entity);
+	new_entity_depth = depth_entity(*new_entity);
+
+	while (entity_depth > new_entity_depth) {
+		entity_depth--;
+		*entity = parent_entity(*entity);
+	}
+
+	while (new_entity_depth > entity_depth) {
+		new_entity_depth--;
+		*new_entity = parent_entity(*new_entity);
+	}
+
+	while (!is_same_group(*entity, *new_entity)) {
+		*entity = parent_entity(*entity);
+		*new_entity = parent_entity(*new_entity);
+	}
+}
 #else /* GROUP_IOSCHED */
 #define for_each_entity(entity)	\
 	for (; entity != NULL; entity = NULL)
@@ -106,6 +172,17 @@ static inline void bfq_check_next_active(struct io_sched_data *sd,
 					 struct io_entity *entity)
 {
 }
+
+static inline int iog_deleting(struct io_group *iog)
+{
+	/* In flat mode, root cgroup can't be deleted. */
+	return 0;
+}
+
+static void bfq_find_matching_entity(struct io_entity **entity,
+					struct io_entity **new_entity)
+{
+}
 #endif /* GROUP_IOSCHED */
 
 static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
@@ -363,13 +440,6 @@ static void bfq_get_entity(struct io_entity *entity)
 		elv_get_ioq(ioq);
 }
 
-static void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
-{
-	entity->ioprio = entity->new_ioprio;
-	entity->ioprio_class = entity->new_ioprio_class;
-	entity->sched_data = &iog->sched_data;
-}
-
 /**
  * bfq_find_deepest - find the deepest node that an extraction can modify.
  * @node: the node being removed.
@@ -833,8 +903,26 @@ static int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
 static void bfq_deactivate_entity(struct io_entity *entity, int requeue)
 {
 	struct io_sched_data *sd;
+	struct io_group *iog, *__iog;
 	struct io_entity *parent;
 
+	iog = container_of(entity->sched_data, struct io_group, sched_data);
+
+	/*
+	 * Hold a reference to entity's iog until we are done. This function
+	 * travels the hierarchy and we don't want to free up the group yet
+	 * while we are traversing the hiearchy. It is possible that this
+	 * group's cgroup has been removed hence cgroup reference is gone.
+	 * If this entity was active entity, then its group will not be on
+	 * any of the trees and it will be freed up the moment queue is
+	 * freed up in __bfq_deactivate_entity().
+	 *
+	 * Hence, hold a reference, deactivate the hierarhcy of entities and
+	 * then drop the reference which should free up the whole chain of
+	 * groups.
+	 */
+	elv_get_iog(iog);
+
 	for_each_entity_safe(entity, parent) {
 		sd = entity->sched_data;
 
@@ -852,6 +940,7 @@ static void bfq_deactivate_entity(struct io_entity *entity, int requeue)
 			 * the budgets on the path towards the root
 			 * need to be updated.
 			 */
+			elv_put_iog(iog);
 			goto update;
 		}
 
@@ -859,11 +948,16 @@ static void bfq_deactivate_entity(struct io_entity *entity, int requeue)
 		 * If we reach there the parent is no more backlogged and
 		 * we want to propagate the dequeue upwards.
 		 *
+		 * If entity's group has been marked for deletion, don't
+		 * requeue the group in idle tree so that it can be freed.
 		 */
-
-		requeue = 1;
+		__iog = container_of(entity->sched_data, struct io_group,
+						sched_data);
+		if (!iog_deleting(__iog))
+			requeue = 1;
 	}
 
+	elv_put_iog(iog);
 	return;
 
 update:
@@ -902,8 +996,59 @@ static void io_flush_idle_tree(struct io_service_tree *st)
 		__bfq_deactivate_entity(entity, 0);
 }
 
+/*
+ * Release all the io group references to its async queues.
+ */
+static void
+io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+{
+	int i, j;
+
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < IOPRIO_BE_NR; j++)
+			elv_release_ioq(e, &iog->async_queue[i][j]);
+
+	/* Free up async idle queue */
+	elv_release_ioq(e, &iog->async_idle_queue);
+}
+
 /* Mainly hierarchical grouping code */
 #ifdef CONFIG_GROUP_IOSCHED
+static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup);
+
+static void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+	entity->ioprio = entity->new_ioprio;
+	entity->weight = entity->new_weight;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->parent = iog->my_entity;
+	entity->sched_data = &iog->sched_data;
+}
+
+static void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+{
+	struct io_entity *entity = &iog->entity;
+
+	entity->weight = entity->new_weight = iocg->weight;
+	entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
+	entity->ioprio_changed = 1;
+	entity->my_sched_data = &iog->sched_data;
+}
+
+static void io_group_set_parent(struct io_group *iog, struct io_group *parent)
+{
+	struct io_entity *entity;
+
+	BUG_ON(parent == NULL);
+	BUG_ON(iog == NULL);
+
+	entity = &iog->entity;
+	entity->parent = parent->my_entity;
+	entity->sched_data = &parent->sched_data;
+	if (entity->parent)
+		/* Child group reference on parent group. */
+		elv_get_iog(parent);
+}
 
 struct io_cgroup io_root_cgroup = {
 	.weight = IO_DEFAULT_GRP_WEIGHT,
@@ -916,6 +1061,26 @@ static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
 			    struct io_cgroup, css);
 }
 
+/*
+ * Search the io_group for efqd into the hash table (by now only a list)
+ * of bgrp.  Must be called under rcu_read_lock().
+ */
+static struct io_group *
+io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
+{
+	struct io_group *iog;
+	struct hlist_node *n;
+	void *__key;
+
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		__key = rcu_dereference(iog->key);
+		if (__key == key)
+			return iog;
+	}
+
+	return NULL;
+}
+
 #define SHOW_FUNCTION(__VAR)						\
 static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
 				       struct cftype *cftype)		\
@@ -1056,12 +1221,6 @@ static void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
 	task_unlock(tsk);
 }
 
-static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
-{
-
-	/* Implemented in later patch */
-}
-
 struct cgroup_subsys io_subsys = {
 	.name = "io",
 	.create = iocg_create,
@@ -1072,7 +1231,599 @@ struct cgroup_subsys io_subsys = {
 	.subsys_id = io_subsys_id,
 	.use_id = 1,
 };
+
+static inline unsigned int iog_weight(struct io_group *iog)
+{
+	return iog->entity.weight;
+}
+
+/**
+ * io_group_chain_alloc - allocate a chain of groups.
+ * @efqd: queue descriptor.
+ * @cgroup: the leaf cgroup this chain starts from.
+ *
+ * Allocate a chain of groups starting from the one belonging to
+ * @cgroup up to the root cgroup.  Stop if a cgroup on the chain
+ * to the root has already an allocated group on @efqd.
+ */
+static struct io_group *
+io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog, *leaf = NULL, *prev = NULL;
+	gfp_t flags = GFP_ATOMIC |  __GFP_ZERO;
+
+	for (; cgroup != NULL; cgroup = cgroup->parent) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+
+		iog = io_cgroup_lookup_group(iocg, key);
+		if (iog != NULL) {
+			/*
+			 * All the cgroups in the path from there to the
+			 * root must have a io_group for efqd, so we don't
+			 * need any more allocations.
+			 */
+			break;
+		}
+
+		iog = kzalloc_node(sizeof(*iog), flags, q->node);
+		if (!iog)
+			goto cleanup;
+
+		iog->iocg_id = css_id(&iocg->css);
+
+		io_group_init_entity(iocg, iog);
+		iog->my_entity = &iog->entity;
+
+		atomic_set(&iog->ref, 0);
+		iog->deleting = 0;
+
+		/*
+		 * Take the initial reference that will be released on destroy
+		 * This can be thought of a joint reference by cgroup and
+		 * elevator which will be dropped by either elevator exit
+		 * or cgroup deletion path depending on who is exiting first.
+		 */
+		elv_get_iog(iog);
+
+		if (leaf == NULL) {
+			leaf = iog;
+			prev = leaf;
+		} else {
+			io_group_set_parent(prev, iog);
+			/*
+			 * Build a list of allocated nodes using the efqd
+			 * filed, that is still unused and will be initialized
+			 * only after the node will be connected.
+			 */
+			prev->key = iog;
+			prev = iog;
+		}
+	}
+
+	return leaf;
+
+cleanup:
+	while (leaf != NULL) {
+		prev = leaf;
+		leaf = leaf->key;
+		kfree(prev);
+	}
+
+	return NULL;
+}
+
+/**
+ * io_group_chain_link - link an allocatd group chain to a cgroup hierarchy.
+ * @efqd: the queue descriptor.
+ * @cgroup: the leaf cgroup to start from.
+ * @leaf: the leaf group (to be associated to @cgroup).
+ *
+ * Try to link a chain of groups to a cgroup hierarchy, connecting the
+ * nodes bottom-up, so we can be sure that when we find a cgroup in the
+ * hierarchy that already as a group associated to @efqd all the nodes
+ * in the path to the root cgroup have one too.
+ *
+ * On locking: the queue lock protects the hierarchy (there is a hierarchy
+ * per device) while the io_cgroup lock protects the list of groups
+ * belonging to the same cgroup.
+ */
+static void io_group_chain_link(struct request_queue *q, void *key,
+				struct cgroup *cgroup,
+				struct io_group *leaf,
+				struct elv_fq_data *efqd)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog, *next, *prev = NULL;
+	unsigned long flags;
+
+	assert_spin_locked(q->queue_lock);
+
+	for (; cgroup != NULL && leaf != NULL; cgroup = cgroup->parent) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+		next = leaf->key;
+
+		iog = io_cgroup_lookup_group(iocg, key);
+		BUG_ON(iog != NULL);
+
+		spin_lock_irqsave(&iocg->lock, flags);
+
+		rcu_assign_pointer(leaf->key, key);
+		hlist_add_head_rcu(&leaf->group_node, &iocg->group_data);
+		hlist_add_head(&leaf->elv_data_node, &efqd->group_list);
+
+		spin_unlock_irqrestore(&iocg->lock, flags);
+
+		prev = leaf;
+		leaf = next;
+	}
+
+	BUG_ON(cgroup == NULL && leaf != NULL);
+
+	if (cgroup != NULL && prev != NULL) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+		iog = io_cgroup_lookup_group(iocg, key);
+		io_group_set_parent(prev, iog);
+	}
+}
+
+/**
+ * io_find_alloc_group - return the group associated to @efqd in @cgroup.
+ * @fqd: queue descriptor.
+ * @cgroup: cgroup being searched for.
+ * @create: if set to 1, create the io group if it has not been created yet.
+ *
+ * Return a group associated to @fqd in @cgroup, allocating one if
+ * necessary.  When a group is returned all the cgroups in the path
+ * to the root have a group associated to @efqd.
+ *
+ * If the allocation fails, return the root group: this breaks guarantees
+ * but is a safe fallbak.  If this loss becames a problem it can be
+ * mitigated using the equivalent weight (given by the product of the
+ * weights of the groups in the path from @group to the root) in the
+ * root scheduler.
+ *
+ * We allocate all the missing nodes in the path from the leaf cgroup
+ * to the root and we connect the nodes only after all the allocations
+ * have been successful.
+ */
+static struct io_group *io_find_alloc_group(struct request_queue *q,
+			struct cgroup *cgroup, struct elv_fq_data *efqd,
+			int create)
+{
+	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+	struct io_group *iog = NULL;
+	/* Note: Use efqd as key */
+	void *key = efqd;
+
+	/*
+	 * Take a refenrece to css object. Don't want to map a bio to
+	 * a group if it has been marked for deletion
+	 */
+
+	if (!css_tryget(&iocg->css))
+		return iog;
+
+	iog = io_cgroup_lookup_group(iocg, key);
+	if (iog != NULL || !create)
+		goto end;
+
+	iog = io_group_chain_alloc(q, key, cgroup);
+	if (iog != NULL)
+		io_group_chain_link(q, key, cgroup, iog, efqd);
+
+end:
+	css_put(&iocg->css);
+	return iog;
+}
+
+/*
+ * Search for the io group current task belongs to. If create=1, then also
+ * create the io group if it is not already there.
+ *
+ * Note: This function should be called with queue lock held. It returns
+ * a pointer to io group without taking any reference. That group will
+ * be around as long as queue lock is not dropped (as group reclaim code
+ * needs to get hold of queue lock). So if somebody needs to use group
+ * pointer even after dropping queue lock, take a reference to the group
+ * before dropping queue lock.
+ */
+struct io_group *io_get_io_group(struct request_queue *q, int create)
+{
+	struct cgroup *cgroup;
+	struct io_group *iog;
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	assert_spin_locked(q->queue_lock);
+
+	rcu_read_lock();
+	cgroup = task_cgroup(current, io_subsys_id);
+	iog = io_find_alloc_group(q, cgroup, efqd, create);
+	if (!iog) {
+		if (create)
+			iog = efqd->root_group;
+		else
+			/*
+			 * bio merge functions doing lookup don't want to
+			 * map bio to root group by default
+			 */
+			iog = NULL;
+	}
+	rcu_read_unlock();
+	return iog;
+}
+EXPORT_SYMBOL(io_get_io_group);
+
+static void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_cgroup *iocg = &io_root_cgroup;
+	struct elv_fq_data *efqd = &e->efqd;
+	struct io_group *iog = efqd->root_group;
+	struct io_service_tree *st;
+	int i;
+
+	BUG_ON(!iog);
+	spin_lock_irq(&iocg->lock);
+	hlist_del_rcu(&iog->group_node);
+	spin_unlock_irq(&iocg->lock);
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		io_flush_idle_tree(st);
+	}
+
+	io_put_io_group_queues(e, iog);
+	elv_put_iog(iog);
+}
+
+static struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	struct io_cgroup *iocg;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	elv_get_iog(iog);
+	iog->entity.parent = NULL;
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+	iocg = &io_root_cgroup;
+	spin_lock_irq(&iocg->lock);
+	rcu_assign_pointer(iog->key, key);
+	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
+	iog->iocg_id = css_id(&iocg->css);
+	spin_unlock_irq(&iocg->lock);
+
+	return iog;
+}
+
+static void io_group_free_rcu(struct rcu_head *head)
+{
+	struct io_group *iog;
+
+	iog = container_of(head, struct io_group, rcu_head);
+	kfree(iog);
+}
+
+/*
+ * This cleanup function does the last bit of things to destroy cgroup.
+ * It should only get called after io_destroy_group has been invoked.
+ */
+static void io_group_cleanup(struct io_group *iog)
+{
+	struct io_service_tree *st;
+	struct io_entity *entity = iog->my_entity;
+	int i;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+
+		BUG_ON(!RB_EMPTY_ROOT(&st->active));
+		BUG_ON(!RB_EMPTY_ROOT(&st->idle));
+		BUG_ON(st->wsum != 0);
+	}
+
+	BUG_ON(iog->sched_data.next_active != NULL);
+	BUG_ON(iog->sched_data.active_entity != NULL);
+	BUG_ON(entity != NULL && entity->tree != NULL);
+
+	/*
+	 * Wait for any rcu readers to exit before freeing up the group.
+	 * Primarily useful when io_get_io_group() is called without queue
+	 * lock to access some group data from bdi_congested_group() path.
+	 */
+	call_rcu(&iog->rcu_head, io_group_free_rcu);
+}
+
+void elv_put_iog(struct io_group *iog)
+{
+	struct io_group *parent = NULL;
+	struct io_entity *entity;
+
+	BUG_ON(!iog);
+
+	entity = iog->my_entity;
+
+	BUG_ON(atomic_read(&iog->ref) <= 0);
+	if (!atomic_dec_and_test(&iog->ref))
+		return;
+
+	if (entity)
+		parent = container_of(iog->my_entity->parent,
+					struct io_group, entity);
+
+	io_group_cleanup(iog);
+
+	if (parent)
+		elv_put_iog(parent);
+}
+EXPORT_SYMBOL(elv_put_iog);
+
+/*
+ * check whether a given group has got any active entities on any of the
+ * service tree.
+ */
+static inline int io_group_has_active_entities(struct io_group *iog)
+{
+	int i;
+	struct io_service_tree *st;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		if (!RB_EMPTY_ROOT(&st->active))
+			return 1;
+	}
+
+	/*
+	 * Also check there are no active entities being served which are
+	 * not on active tree
+	 */
+
+	if (iog->sched_data.active_entity)
+		return 1;
+
+	return 0;
+}
+
+/*
+ * After the group is destroyed, no new sync IO should come to the group.
+ * It might still have pending IOs in some busy queues. It should be able to
+ * send those IOs down to the disk. The async IOs (due to dirty page writeback)
+ * would go in the root group queues after this, as the group does not exist
+ * anymore.
+ */
+static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
+{
+	struct elevator_queue *eq;
+	struct io_service_tree *st;
+	int i;
+
+	BUG_ON(iog->my_entity == NULL);
+
+	/*
+	 * Mark io group for deletion so that no new entry goes in
+	 * idle tree. Any active queue will be removed from active
+	 * tree and not put in to idle tree.
+	 */
+	iog->deleting = 1;
+
+	/* We flush idle tree now, and don't put things in there any more. */
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+
+		io_flush_idle_tree(st);
+	}
+
+	eq = container_of(efqd, struct elevator_queue, efqd);
+	hlist_del(&iog->elv_data_node);
+	io_put_io_group_queues(eq, iog);
+
+	/*
+	 * We can come here either through cgroup deletion path or through
+	 * elevator exit path. If we come here through cgroup deletion path
+	 * check if io group has any active entities or not. If not, then
+	 * deactivate this io group to make sure it is removed from idle
+	 * tree it might have been on. If this group was on idle tree, then
+	 * this probably will be the last reference and group will be
+	 * freed upon putting the reference down.
+	 */
+
+	if (!io_group_has_active_entities(iog)) {
+		/*
+		 * io group does not have any active entites. Because this
+		 * group has been decoupled from io_cgroup list and this
+		 * cgroup is being deleted, this group should not receive
+		 * any new IO. Hence it should be safe to deactivate this
+		 * io group and remove from the scheduling tree.
+		 */
+		__bfq_deactivate_entity(iog->my_entity, 0);
+	}
+
+	/*
+	 * Put the reference taken at the time of creation so that when all
+	 * queues are gone, cgroup can be destroyed.
+	 */
+	elv_put_iog(iog);
+}
+
+static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+	struct io_group *iog;
+	struct elv_fq_data *efqd;
+	unsigned long uninitialized_var(flags);
+
+	/*
+	 * io groups are linked in two lists. One list is maintained
+	 * in elevator (efqd->group_list) and other is maintained
+	 * per cgroup structure (iocg->group_data).
+	 *
+	 * While a cgroup is being deleted, elevator also might be
+	 * exiting and both might try to cleanup the same io group
+	 * so need to be little careful.
+	 *
+	 * (iocg->group_data) is protected by iocg->lock. To avoid deadlock,
+	 * we can't hold the queue lock while holding iocg->lock. So we first
+	 * remove iog from iocg->group_data under iocg->lock. Whoever removes
+	 * iog from iocg->group_data should call __io_destroy_group to remove
+	 * iog.
+	 */
+
+	rcu_read_lock();
+
+remove_entry:
+	spin_lock_irqsave(&iocg->lock, flags);
+
+	if (hlist_empty(&iocg->group_data)) {
+		spin_unlock_irqrestore(&iocg->lock, flags);
+		goto done;
+	}
+	iog = hlist_entry(iocg->group_data.first, struct io_group,
+			  group_node);
+	efqd = rcu_dereference(iog->key);
+	hlist_del_rcu(&iog->group_node);
+	iog->iocg_id = 0;
+	spin_unlock_irqrestore(&iocg->lock, flags);
+
+	spin_lock_irqsave(efqd->queue->queue_lock, flags);
+	__io_destroy_group(efqd, iog);
+	spin_unlock_irqrestore(efqd->queue->queue_lock, flags);
+	goto remove_entry;
+
+done:
+	free_css_id(&io_subsys, &iocg->css);
+	rcu_read_unlock();
+	BUG_ON(!hlist_empty(&iocg->group_data));
+	kfree(iocg);
+}
+
+/*
+ * This functions checks if iog is still in iocg->group_data, and removes it.
+ * If iog is not in that list, then cgroup destroy path has removed it, and
+ * we do not need to remove it.
+ */
+static void io_group_check_and_destroy(struct elv_fq_data *efqd,
+					struct io_group *iog)
+{
+	struct io_cgroup *iocg;
+	unsigned long flags;
+	struct cgroup_subsys_state *css;
+
+	rcu_read_lock();
+
+	css = css_lookup(&io_subsys, iog->iocg_id);
+
+	if (!css)
+		goto out;
+
+	iocg = container_of(css, struct io_cgroup, css);
+
+	spin_lock_irqsave(&iocg->lock, flags);
+
+	if (iog->iocg_id) {
+		hlist_del_rcu(&iog->group_node);
+		__io_destroy_group(efqd, iog);
+	}
+
+	spin_unlock_irqrestore(&iocg->lock, flags);
+out:
+	rcu_read_unlock();
+}
+
+static void io_disconnect_groups(struct elevator_queue *e)
+{
+	struct hlist_node *pos, *n;
+	struct io_group *iog;
+	struct elv_fq_data *efqd = &e->efqd;
+
+	hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
+					elv_data_node) {
+		io_group_check_and_destroy(efqd, iog);
+	}
+}
+
+/*
+ * if bio sumbmitting task and rq don't belong to same io_group, it can't
+ * be merged
+ */
+int io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+	struct request_queue *q = rq->q;
+	struct io_queue *ioq = rq->ioq;
+	struct io_group *iog, *__iog;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return 1;
+
+	/* Determine the io group of the bio submitting task */
+	iog = io_get_io_group(q, 0);
+	if (!iog) {
+		/* May be task belongs to a differet cgroup for which io
+		 * group has not been setup yet. */
+		return 0;
+	}
+
+	/* Determine the io group of the ioq, rq belongs to*/
+	__iog = ioq_to_io_group(ioq);
+
+	return (iog == __iog);
+}
+#else /* GROUP_IOSCHED */
+static void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+	entity->ioprio = entity->new_ioprio;
+	entity->weight = entity->new_weight;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->sched_data = &iog->sched_data;
+}
+
+static inline void io_disconnect_groups(struct elevator_queue *e) {}
+static inline unsigned int iog_weight(struct io_group *iog) { return 0; }
+
+static struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+	return iog;
+}
+
+static void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_group *iog = e->efqd.root_group;
+	struct io_service_tree *st;
+	int i;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		io_flush_idle_tree(st);
+	}
+
+	io_put_io_group_queues(e, iog);
+	kfree(iog);
+}
+
+struct io_group *io_get_io_group(struct request_queue *q, int create)
+{
+	/* In flat mode, there is only root group */
+	return q->elevator->efqd.root_group;
+}
+EXPORT_SYMBOL(io_get_io_group);
 #endif /* GROUP_IOSCHED */
+
 /* Elevator fair queuing function */
 static inline struct io_queue *elv_active_ioq(struct elevator_queue *e)
 {
@@ -1375,10 +2126,14 @@ void elv_put_ioq(struct io_queue *ioq)
 	struct elv_fq_data *efqd = ioq->efqd;
 	struct elevator_queue *e = container_of(efqd, struct elevator_queue,
 						efqd);
+	struct io_group *iog;
 
 	BUG_ON(atomic_read(&ioq->ref) <= 0);
 	if (!atomic_dec_and_test(&ioq->ref))
 		return;
+
+	iog = ioq_to_io_group(ioq);
+
 	BUG_ON(ioq->nr_queued);
 	BUG_ON(ioq->entity.tree != NULL);
 	BUG_ON(elv_ioq_busy(ioq));
@@ -1390,10 +2145,11 @@ void elv_put_ioq(struct io_queue *ioq)
 	e->ops->elevator_free_sched_queue_fn(e, ioq->sched_queue);
 	elv_log_ioq(efqd, ioq, "put_queue");
 	elv_free_ioq(ioq);
+	elv_put_iog(iog);
 }
 EXPORT_SYMBOL(elv_put_ioq);
 
-void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
+static void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
 {
 	struct io_queue *ioq = *ioq_ptr;
 
@@ -1485,8 +2241,12 @@ static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
 	struct request_queue *q = efqd->queue;
 
 	if (ioq) {
-		elv_log_ioq(efqd, ioq, "set_active, busy=%d",
-							efqd->busy_queues);
+		struct io_group *iog = ioq_to_io_group(ioq);
+		elv_log_ioq(efqd, ioq, "set_active, busy=%d ioprio=%d"
+				" weight=%u group_weight=%u",
+				efqd->busy_queues,
+				ioq->entity.ioprio, ioq->entity.weight,
+				iog_weight(iog));
 		ioq->slice_end = 0;
 
 		elv_clear_ioq_wait_request(ioq);
@@ -1548,6 +2308,7 @@ static void elv_activate_ioq(struct io_queue *ioq, int add_front)
 static void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
 					int requeue)
 {
+	requeue = update_requeue(ioq, requeue);
 	bfq_deactivate_entity(&ioq->entity, requeue);
 }
 
@@ -1725,6 +2486,7 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 	struct io_queue *ioq;
 	struct elevator_queue *eq = q->elevator;
 	struct io_entity *entity, *new_entity;
+	struct io_group *iog = NULL, *new_iog = NULL;
 
 	ioq = elv_active_ioq(eq);
 
@@ -1735,6 +2497,13 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 	new_entity = &new_ioq->entity;
 
 	/*
+	 * In hierarchical setup, one need to traverse up the hierarchy
+	 * till both the queues are children of same parent to make a
+	 * decision whether to do the preemption or not.
+	 */
+	bfq_find_matching_entity(&entity, &new_entity);
+
+	/*
 	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
 	 */
 
@@ -1750,9 +2519,17 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 		return 1;
 
 	/*
-	 * Check with io scheduler if it has additional criterion based on
-	 * which it wants to preempt existing queue.
+	 * If both the queues belong to same group, check with io scheduler
+	 * if it has additional criterion based on which it wants to
+	 * preempt existing queue.
 	 */
+	iog = ioq_to_io_group(ioq);
+	new_iog = ioq_to_io_group(new_ioq);
+
+	if (iog != new_iog)
+		return 0;
+
+
 	if (eq->ops->elevator_should_preempt_fn)
 		return eq->ops->elevator_should_preempt_fn(q,
 						ioq_sched_queue(new_ioq), rq);
@@ -2171,15 +2948,6 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		elv_schedule_dispatch(q);
 }
 
-struct io_group *io_get_io_group(struct request_queue *q)
-{
-	struct elv_fq_data *efqd = &q->elevator->efqd;
-
-	/* In flat mode, there is only root group */
-	return efqd->root_group;
-}
-EXPORT_SYMBOL(io_get_io_group);
-
 void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
 					int ioprio)
 {
@@ -2230,53 +2998,6 @@ void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 }
 EXPORT_SYMBOL(io_group_set_async_queue);
 
-/*
- * Release all the io group references to its async queues.
- */
-static void
-io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
-{
-	int i, j;
-
-	for (i = 0; i < 2; i++)
-		for (j = 0; j < IOPRIO_BE_NR; j++)
-			elv_release_ioq(e, &iog->async_queue[i][j]);
-
-	/* Free up async idle queue */
-	elv_release_ioq(e, &iog->async_idle_queue);
-}
-
-static struct io_group *io_alloc_root_group(struct request_queue *q,
-					struct elevator_queue *e, void *key)
-{
-	struct io_group *iog;
-	int i;
-
-	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
-	if (iog == NULL)
-		return NULL;
-
-	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
-		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
-
-	return iog;
-}
-
-static void io_free_root_group(struct elevator_queue *e)
-{
-	struct io_group *iog = e->efqd.root_group;
-	struct io_service_tree *st;
-	int i;
-
-	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
-		st = iog->sched_data.service_tree + i;
-		io_flush_idle_tree(st);
-	}
-
-	io_put_io_group_queues(e, iog);
-	kfree(iog);
-}
-
 static void elv_slab_kill(void)
 {
 	/*
@@ -2320,6 +3041,7 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 	efqd->idle_slice_timer.data = (unsigned long) efqd;
 
 	INIT_WORK(&efqd->unplug_work, elv_kick_queue);
+	INIT_HLIST_HEAD(&efqd->group_list);
 
 	efqd->elv_slice[0] = elv_slice_async;
 	efqd->elv_slice[1] = elv_slice_sync;
@@ -2339,12 +3061,23 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 void elv_exit_fq_data(struct elevator_queue *e)
 {
 	struct elv_fq_data *efqd = &e->efqd;
+	struct request_queue *q = efqd->queue;
 
 	if (!elv_iosched_fair_queuing_enabled(e))
 		return;
 
 	elv_shutdown_timer_wq(e);
 
+	spin_lock_irq(q->queue_lock);
+	/* This should drop all the io group references of async queues */
+	io_disconnect_groups(e);
+	spin_unlock_irq(q->queue_lock);
+
+	elv_shutdown_timer_wq(e);
+
+	/* Wait for iog->key accessors to exit their grace periods. */
+	synchronize_rcu();
+
 	BUG_ON(timer_pending(&efqd->idle_slice_timer));
 	io_free_root_group(e);
 }
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index d9acb75..c8987c0 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -184,13 +184,49 @@ struct io_queue {
 };
 
 #ifdef CONFIG_GROUP_IOSCHED
+/**
+ * struct io_group - per (device, cgroup) data structure.
+ * @entity: schedulable entity to insert into the parent group sched_data.
+ * @sched_data: own sched_data, to contain child entities (they may be
+ *              both io_queues and io_groups).
+ * @group_node: node to be inserted into the io_cgroup->group_data
+ *              list of the containing cgroup's io_cgroup.
+ * @elv_data_node: node to be inserted into the @efqd->group_list list
+ *             of the groups active on the same device; used for cleanup.
+ * @async_queue: array of async queues for all the tasks belonging to
+ *              the group, one queue per ioprio value per ioprio_class,
+ *              except for the idle class that has only one queue.
+ * @async_idle_queue: async queue for the idle class (ioprio is ignored).
+ * @my_entity: pointer to @entity, %NULL for the toplevel group; used
+ *             to avoid too many special cases during group creation/migration.
+ *
+ * Each (device, cgroup) pair has its own io_group, i.e., for each cgroup
+ * there is a set of io_groups, each one collecting the lower-level
+ * entities belonging to the group that are acting on the same device.
+ *
+ * Locking works as follows:
+ *    o @group_node is protected by the io_cgroup lock, and is accessed
+ *      via RCU from its readers.
+ *    o @efqd is protected by the queue lock, RCU is used to access it
+ *      from the readers.
+ *    o All the other fields are protected by the @efqd queue lock.
+ */
 struct io_group {
 	struct io_entity entity;
+	struct hlist_node elv_data_node;
 	struct hlist_node group_node;
 	struct io_sched_data sched_data;
+	atomic_t ref;
 	struct io_entity *my_entity;
 
 	/*
+	 * A cgroup has multiple io_groups, one for each request queue.
+	 * to find io group belonging to a particular queue, elv_fq_data
+	 * pointer is stored as a key.
+	 */
+	void *key;
+
+	/*
 	 * async queue for each priority case for RT and BE class.
 	 * Used only for cfq.
 	 */
@@ -198,11 +234,15 @@ struct io_group {
 	struct io_queue *async_queue[2][IOPRIO_BE_NR];
 	struct io_queue *async_idle_queue;
 
+	struct rcu_head rcu_head;
+
 	/*
 	 * Used to track any pending rt requests so we can pre-empt current
 	 * non-RT cfqq in service when this value is non-zero.
 	 */
 	unsigned int busy_rt_queues;
+
+	int deleting;
 	unsigned short iocg_id;
 };
 
@@ -245,6 +285,9 @@ struct io_group {
 struct elv_fq_data {
 	struct io_group *root_group;
 
+	/* List of io groups hanging on this elevator */
+	struct hlist_head group_list;
+
 	struct request_queue *queue;
 	unsigned int busy_queues;
 
@@ -407,7 +450,7 @@ static inline void elv_ioq_set_ioprio_class(struct io_queue *ioq,
 static inline unsigned int bfq_ioprio_to_weight(int ioprio)
 {
 	WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
-	return IOPRIO_BE_NR - ioprio;
+	return ((IOPRIO_BE_NR - ioprio) * WEIGHT_MAX)/IOPRIO_BE_NR;
 }
 
 static inline void elv_ioq_set_ioprio(struct io_queue *ioq, int ioprio)
@@ -430,6 +473,46 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
 						sched_data);
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+extern int io_group_allow_merge(struct request *rq, struct bio *bio);
+extern void elv_put_iog(struct io_group *iog);
+
+static inline void elv_get_iog(struct io_group *iog)
+{
+	atomic_inc(&iog->ref);
+}
+
+static inline int update_requeue(struct io_queue *ioq, int requeue)
+{
+	struct io_group *iog = ioq_to_io_group(ioq);
+
+	if (iog->deleting == 1)
+		return 0;
+
+	return requeue;
+}
+
+#else /* !GROUP_IOSCHED */
+static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+	return 1;
+}
+
+static inline void elv_get_iog(struct io_group *iog)
+{
+}
+
+static inline void elv_put_iog(struct io_group *iog)
+{
+}
+
+static inline int update_requeue(struct io_queue *ioq, int requeue)
+{
+	return requeue;
+}
+
+#endif /* GROUP_IOSCHED */
+
 extern ssize_t elv_slice_idle_show(struct elevator_queue *q, char *name);
 extern ssize_t elv_slice_idle_store(struct elevator_queue *q, const char *name,
 						size_t count);
@@ -477,7 +560,7 @@ extern void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
 					int ioprio);
 extern void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 					int ioprio, struct io_queue *ioq);
-extern struct io_group *io_get_io_group(struct request_queue *q);
+extern struct io_group *io_get_io_group(struct request_queue *q, int create);
 extern int elv_nr_busy_ioq(struct elevator_queue *e);
 extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
 extern void elv_free_ioq(struct io_queue *ioq);
@@ -528,5 +611,11 @@ static inline void *elv_fq_select_ioq(struct request_queue *q, int force)
 {
 	return NULL;
 }
+
+static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
+
+{
+	return 1;
+}
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index 357f529..a6ef1f1 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -100,6 +100,10 @@ int elv_rq_merge_ok(struct request *rq, struct bio *bio)
 	if (bio_integrity(bio) != blk_integrity_rq(rq))
 		return 0;
 
+	/* If rq and bio belongs to different groups, dont allow merging */
+	if (!io_group_allow_merge(rq, bio))
+		return 0;
+
 	if (!elv_iosched_allow_merge(rq, bio))
 		return 0;
 
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 09/25] io-controller: Common hierarchical fair queuing code in elevaotor layer
  2009-07-02 20:01 ` Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron
  Cc: agk, snitzer, vgoyal, akpm, peterz

o This patch enables hierarchical fair queuing in common layer. It is
  controlled by config option CONFIG_GROUP_IOSCHED.

o Requests keep a reference on ioq and ioq keeps  keep a reference
  on groups. For async queues in CFQ, and single ioq in other
  schedulers, io_group also keeps are reference on io_queue. This
  reference on ioq is dropped when the queue is released
  (elv_release_ioq). So the queue can be freed.

  When a queue is released, it puts the reference to io_group and the
  io_group is released after all the queues are released. Child groups
  also take reference on parent groups, and release it when they are
  destroyed.

o Reads of iocg->group_data are not always iocg->lock; so all the operations
  on that list are still protected by RCU. All modifications to
  iocg->group_data should always done under iocg->lock.

  Whenever iocg->lock and queue_lock can both be held, queue_lock should
  be held first. This avoids all deadlocks. In order to avoid race
  between cgroup deletion and elevator switch the following algorithm is
  used:

	- Cgroup deletion path holds iocg->lock and removes iog entry
	  to iocg->group_data list. Then it drops iocg->lock, holds
	  queue_lock and destroys iog. So in this path, we never hold
	  iocg->lock and queue_lock at the same time. Also, since we
	  remove iog from iocg->group_data under iocg->lock, we can't
	  race with elevator switch.

	- Elevator switch path does not remove iog from
	  iocg->group_data list directly. It first hold iocg->lock,
	  scans iocg->group_data again to see if iog is still there;
	  it removes iog only if it finds iog there. Otherwise, cgroup
	  deletion must have removed it from the list, and cgroup
	  deletion is responsible for removing iog.

  So the path which removes iog from iocg->group_data list does
  the final removal of iog by calling __io_destroy_group()
  function.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |    2 +
 block/elevator-fq.c |  885 ++++++++++++++++++++++++++++++++++++++++++++++-----
 block/elevator-fq.h |   93 ++++++-
 block/elevator.c    |    4 +
 4 files changed, 906 insertions(+), 78 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index f852b00..6ddc882 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1310,6 +1310,8 @@ alloc_ioq:
 			elv_mark_ioq_sync(cfqq->ioq);
 		}
 		cfqq->pid = current->pid;
+		/* ioq reference on iog */
+		elv_get_iog(iog);
 		cfq_log_cfqq(cfqd, cfqq, "alloced");
 	}
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 84276d5..f8d0b90 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -45,6 +45,9 @@ static int elv_rate_sampling_window = HZ / 10;
  */
 #define WFQ_SERVICE_SHIFT	22
 
+static void
+elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
+
 #ifdef CONFIG_GROUP_IOSCHED
 #define for_each_entity(entity)	\
 	for (; entity != NULL; entity = entity->parent)
@@ -90,6 +93,69 @@ static inline void bfq_check_next_active(struct io_sched_data *sd,
 {
 	BUG_ON(sd->next_active != entity);
 }
+
+static inline int iog_deleting(struct io_group *iog)
+{
+	return iog->deleting;
+}
+
+/* Do the two (enqueued) entities belong to the same group ? */
+static inline int
+is_same_group(struct io_entity *entity, struct io_entity *new_entity)
+{
+	if (entity->sched_data == new_entity->sched_data)
+		return 1;
+
+	return 0;
+}
+
+static inline struct io_entity *parent_entity(struct io_entity *entity)
+{
+	return entity->parent;
+}
+
+/* return depth at which a io entity is present in the hierarchy */
+static inline int depth_entity(struct io_entity *entity)
+{
+	int depth = 0;
+
+	for_each_entity(entity)
+		depth++;
+
+	return depth;
+}
+
+static void bfq_find_matching_entity(struct io_entity **entity,
+			struct io_entity **new_entity)
+{
+	int entity_depth, new_entity_depth;
+
+	/*
+	 * preemption test can be made between sibling entities who are in the
+	 * same group i.e who have a common parent. Walk up the hierarchy of
+	 * both entities until we find their ancestors who are siblings of
+	 * common parent.
+	 */
+
+	/* First walk up until both entities are at same depth */
+	entity_depth = depth_entity(*entity);
+	new_entity_depth = depth_entity(*new_entity);
+
+	while (entity_depth > new_entity_depth) {
+		entity_depth--;
+		*entity = parent_entity(*entity);
+	}
+
+	while (new_entity_depth > entity_depth) {
+		new_entity_depth--;
+		*new_entity = parent_entity(*new_entity);
+	}
+
+	while (!is_same_group(*entity, *new_entity)) {
+		*entity = parent_entity(*entity);
+		*new_entity = parent_entity(*new_entity);
+	}
+}
 #else /* GROUP_IOSCHED */
 #define for_each_entity(entity)	\
 	for (; entity != NULL; entity = NULL)
@@ -106,6 +172,17 @@ static inline void bfq_check_next_active(struct io_sched_data *sd,
 					 struct io_entity *entity)
 {
 }
+
+static inline int iog_deleting(struct io_group *iog)
+{
+	/* In flat mode, root cgroup can't be deleted. */
+	return 0;
+}
+
+static void bfq_find_matching_entity(struct io_entity **entity,
+					struct io_entity **new_entity)
+{
+}
 #endif /* GROUP_IOSCHED */
 
 static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
@@ -363,13 +440,6 @@ static void bfq_get_entity(struct io_entity *entity)
 		elv_get_ioq(ioq);
 }
 
-static void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
-{
-	entity->ioprio = entity->new_ioprio;
-	entity->ioprio_class = entity->new_ioprio_class;
-	entity->sched_data = &iog->sched_data;
-}
-
 /**
  * bfq_find_deepest - find the deepest node that an extraction can modify.
  * @node: the node being removed.
@@ -833,8 +903,26 @@ static int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
 static void bfq_deactivate_entity(struct io_entity *entity, int requeue)
 {
 	struct io_sched_data *sd;
+	struct io_group *iog, *__iog;
 	struct io_entity *parent;
 
+	iog = container_of(entity->sched_data, struct io_group, sched_data);
+
+	/*
+	 * Hold a reference to entity's iog until we are done. This function
+	 * travels the hierarchy and we don't want to free up the group yet
+	 * while we are traversing the hiearchy. It is possible that this
+	 * group's cgroup has been removed hence cgroup reference is gone.
+	 * If this entity was active entity, then its group will not be on
+	 * any of the trees and it will be freed up the moment queue is
+	 * freed up in __bfq_deactivate_entity().
+	 *
+	 * Hence, hold a reference, deactivate the hierarhcy of entities and
+	 * then drop the reference which should free up the whole chain of
+	 * groups.
+	 */
+	elv_get_iog(iog);
+
 	for_each_entity_safe(entity, parent) {
 		sd = entity->sched_data;
 
@@ -852,6 +940,7 @@ static void bfq_deactivate_entity(struct io_entity *entity, int requeue)
 			 * the budgets on the path towards the root
 			 * need to be updated.
 			 */
+			elv_put_iog(iog);
 			goto update;
 		}
 
@@ -859,11 +948,16 @@ static void bfq_deactivate_entity(struct io_entity *entity, int requeue)
 		 * If we reach there the parent is no more backlogged and
 		 * we want to propagate the dequeue upwards.
 		 *
+		 * If entity's group has been marked for deletion, don't
+		 * requeue the group in idle tree so that it can be freed.
 		 */
-
-		requeue = 1;
+		__iog = container_of(entity->sched_data, struct io_group,
+						sched_data);
+		if (!iog_deleting(__iog))
+			requeue = 1;
 	}
 
+	elv_put_iog(iog);
 	return;
 
 update:
@@ -902,8 +996,59 @@ static void io_flush_idle_tree(struct io_service_tree *st)
 		__bfq_deactivate_entity(entity, 0);
 }
 
+/*
+ * Release all the io group references to its async queues.
+ */
+static void
+io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+{
+	int i, j;
+
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < IOPRIO_BE_NR; j++)
+			elv_release_ioq(e, &iog->async_queue[i][j]);
+
+	/* Free up async idle queue */
+	elv_release_ioq(e, &iog->async_idle_queue);
+}
+
 /* Mainly hierarchical grouping code */
 #ifdef CONFIG_GROUP_IOSCHED
+static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup);
+
+static void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+	entity->ioprio = entity->new_ioprio;
+	entity->weight = entity->new_weight;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->parent = iog->my_entity;
+	entity->sched_data = &iog->sched_data;
+}
+
+static void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+{
+	struct io_entity *entity = &iog->entity;
+
+	entity->weight = entity->new_weight = iocg->weight;
+	entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
+	entity->ioprio_changed = 1;
+	entity->my_sched_data = &iog->sched_data;
+}
+
+static void io_group_set_parent(struct io_group *iog, struct io_group *parent)
+{
+	struct io_entity *entity;
+
+	BUG_ON(parent == NULL);
+	BUG_ON(iog == NULL);
+
+	entity = &iog->entity;
+	entity->parent = parent->my_entity;
+	entity->sched_data = &parent->sched_data;
+	if (entity->parent)
+		/* Child group reference on parent group. */
+		elv_get_iog(parent);
+}
 
 struct io_cgroup io_root_cgroup = {
 	.weight = IO_DEFAULT_GRP_WEIGHT,
@@ -916,6 +1061,26 @@ static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
 			    struct io_cgroup, css);
 }
 
+/*
+ * Search the io_group for efqd into the hash table (by now only a list)
+ * of bgrp.  Must be called under rcu_read_lock().
+ */
+static struct io_group *
+io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
+{
+	struct io_group *iog;
+	struct hlist_node *n;
+	void *__key;
+
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		__key = rcu_dereference(iog->key);
+		if (__key == key)
+			return iog;
+	}
+
+	return NULL;
+}
+
 #define SHOW_FUNCTION(__VAR)						\
 static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
 				       struct cftype *cftype)		\
@@ -1056,12 +1221,6 @@ static void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
 	task_unlock(tsk);
 }
 
-static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
-{
-
-	/* Implemented in later patch */
-}
-
 struct cgroup_subsys io_subsys = {
 	.name = "io",
 	.create = iocg_create,
@@ -1072,7 +1231,599 @@ struct cgroup_subsys io_subsys = {
 	.subsys_id = io_subsys_id,
 	.use_id = 1,
 };
+
+static inline unsigned int iog_weight(struct io_group *iog)
+{
+	return iog->entity.weight;
+}
+
+/**
+ * io_group_chain_alloc - allocate a chain of groups.
+ * @efqd: queue descriptor.
+ * @cgroup: the leaf cgroup this chain starts from.
+ *
+ * Allocate a chain of groups starting from the one belonging to
+ * @cgroup up to the root cgroup.  Stop if a cgroup on the chain
+ * to the root has already an allocated group on @efqd.
+ */
+static struct io_group *
+io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog, *leaf = NULL, *prev = NULL;
+	gfp_t flags = GFP_ATOMIC |  __GFP_ZERO;
+
+	for (; cgroup != NULL; cgroup = cgroup->parent) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+
+		iog = io_cgroup_lookup_group(iocg, key);
+		if (iog != NULL) {
+			/*
+			 * All the cgroups in the path from there to the
+			 * root must have a io_group for efqd, so we don't
+			 * need any more allocations.
+			 */
+			break;
+		}
+
+		iog = kzalloc_node(sizeof(*iog), flags, q->node);
+		if (!iog)
+			goto cleanup;
+
+		iog->iocg_id = css_id(&iocg->css);
+
+		io_group_init_entity(iocg, iog);
+		iog->my_entity = &iog->entity;
+
+		atomic_set(&iog->ref, 0);
+		iog->deleting = 0;
+
+		/*
+		 * Take the initial reference that will be released on destroy
+		 * This can be thought of a joint reference by cgroup and
+		 * elevator which will be dropped by either elevator exit
+		 * or cgroup deletion path depending on who is exiting first.
+		 */
+		elv_get_iog(iog);
+
+		if (leaf == NULL) {
+			leaf = iog;
+			prev = leaf;
+		} else {
+			io_group_set_parent(prev, iog);
+			/*
+			 * Build a list of allocated nodes using the efqd
+			 * filed, that is still unused and will be initialized
+			 * only after the node will be connected.
+			 */
+			prev->key = iog;
+			prev = iog;
+		}
+	}
+
+	return leaf;
+
+cleanup:
+	while (leaf != NULL) {
+		prev = leaf;
+		leaf = leaf->key;
+		kfree(prev);
+	}
+
+	return NULL;
+}
+
+/**
+ * io_group_chain_link - link an allocatd group chain to a cgroup hierarchy.
+ * @efqd: the queue descriptor.
+ * @cgroup: the leaf cgroup to start from.
+ * @leaf: the leaf group (to be associated to @cgroup).
+ *
+ * Try to link a chain of groups to a cgroup hierarchy, connecting the
+ * nodes bottom-up, so we can be sure that when we find a cgroup in the
+ * hierarchy that already as a group associated to @efqd all the nodes
+ * in the path to the root cgroup have one too.
+ *
+ * On locking: the queue lock protects the hierarchy (there is a hierarchy
+ * per device) while the io_cgroup lock protects the list of groups
+ * belonging to the same cgroup.
+ */
+static void io_group_chain_link(struct request_queue *q, void *key,
+				struct cgroup *cgroup,
+				struct io_group *leaf,
+				struct elv_fq_data *efqd)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog, *next, *prev = NULL;
+	unsigned long flags;
+
+	assert_spin_locked(q->queue_lock);
+
+	for (; cgroup != NULL && leaf != NULL; cgroup = cgroup->parent) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+		next = leaf->key;
+
+		iog = io_cgroup_lookup_group(iocg, key);
+		BUG_ON(iog != NULL);
+
+		spin_lock_irqsave(&iocg->lock, flags);
+
+		rcu_assign_pointer(leaf->key, key);
+		hlist_add_head_rcu(&leaf->group_node, &iocg->group_data);
+		hlist_add_head(&leaf->elv_data_node, &efqd->group_list);
+
+		spin_unlock_irqrestore(&iocg->lock, flags);
+
+		prev = leaf;
+		leaf = next;
+	}
+
+	BUG_ON(cgroup == NULL && leaf != NULL);
+
+	if (cgroup != NULL && prev != NULL) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+		iog = io_cgroup_lookup_group(iocg, key);
+		io_group_set_parent(prev, iog);
+	}
+}
+
+/**
+ * io_find_alloc_group - return the group associated to @efqd in @cgroup.
+ * @fqd: queue descriptor.
+ * @cgroup: cgroup being searched for.
+ * @create: if set to 1, create the io group if it has not been created yet.
+ *
+ * Return a group associated to @fqd in @cgroup, allocating one if
+ * necessary.  When a group is returned all the cgroups in the path
+ * to the root have a group associated to @efqd.
+ *
+ * If the allocation fails, return the root group: this breaks guarantees
+ * but is a safe fallbak.  If this loss becames a problem it can be
+ * mitigated using the equivalent weight (given by the product of the
+ * weights of the groups in the path from @group to the root) in the
+ * root scheduler.
+ *
+ * We allocate all the missing nodes in the path from the leaf cgroup
+ * to the root and we connect the nodes only after all the allocations
+ * have been successful.
+ */
+static struct io_group *io_find_alloc_group(struct request_queue *q,
+			struct cgroup *cgroup, struct elv_fq_data *efqd,
+			int create)
+{
+	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+	struct io_group *iog = NULL;
+	/* Note: Use efqd as key */
+	void *key = efqd;
+
+	/*
+	 * Take a refenrece to css object. Don't want to map a bio to
+	 * a group if it has been marked for deletion
+	 */
+
+	if (!css_tryget(&iocg->css))
+		return iog;
+
+	iog = io_cgroup_lookup_group(iocg, key);
+	if (iog != NULL || !create)
+		goto end;
+
+	iog = io_group_chain_alloc(q, key, cgroup);
+	if (iog != NULL)
+		io_group_chain_link(q, key, cgroup, iog, efqd);
+
+end:
+	css_put(&iocg->css);
+	return iog;
+}
+
+/*
+ * Search for the io group current task belongs to. If create=1, then also
+ * create the io group if it is not already there.
+ *
+ * Note: This function should be called with queue lock held. It returns
+ * a pointer to io group without taking any reference. That group will
+ * be around as long as queue lock is not dropped (as group reclaim code
+ * needs to get hold of queue lock). So if somebody needs to use group
+ * pointer even after dropping queue lock, take a reference to the group
+ * before dropping queue lock.
+ */
+struct io_group *io_get_io_group(struct request_queue *q, int create)
+{
+	struct cgroup *cgroup;
+	struct io_group *iog;
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	assert_spin_locked(q->queue_lock);
+
+	rcu_read_lock();
+	cgroup = task_cgroup(current, io_subsys_id);
+	iog = io_find_alloc_group(q, cgroup, efqd, create);
+	if (!iog) {
+		if (create)
+			iog = efqd->root_group;
+		else
+			/*
+			 * bio merge functions doing lookup don't want to
+			 * map bio to root group by default
+			 */
+			iog = NULL;
+	}
+	rcu_read_unlock();
+	return iog;
+}
+EXPORT_SYMBOL(io_get_io_group);
+
+static void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_cgroup *iocg = &io_root_cgroup;
+	struct elv_fq_data *efqd = &e->efqd;
+	struct io_group *iog = efqd->root_group;
+	struct io_service_tree *st;
+	int i;
+
+	BUG_ON(!iog);
+	spin_lock_irq(&iocg->lock);
+	hlist_del_rcu(&iog->group_node);
+	spin_unlock_irq(&iocg->lock);
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		io_flush_idle_tree(st);
+	}
+
+	io_put_io_group_queues(e, iog);
+	elv_put_iog(iog);
+}
+
+static struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	struct io_cgroup *iocg;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	elv_get_iog(iog);
+	iog->entity.parent = NULL;
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+	iocg = &io_root_cgroup;
+	spin_lock_irq(&iocg->lock);
+	rcu_assign_pointer(iog->key, key);
+	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
+	iog->iocg_id = css_id(&iocg->css);
+	spin_unlock_irq(&iocg->lock);
+
+	return iog;
+}
+
+static void io_group_free_rcu(struct rcu_head *head)
+{
+	struct io_group *iog;
+
+	iog = container_of(head, struct io_group, rcu_head);
+	kfree(iog);
+}
+
+/*
+ * This cleanup function does the last bit of things to destroy cgroup.
+ * It should only get called after io_destroy_group has been invoked.
+ */
+static void io_group_cleanup(struct io_group *iog)
+{
+	struct io_service_tree *st;
+	struct io_entity *entity = iog->my_entity;
+	int i;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+
+		BUG_ON(!RB_EMPTY_ROOT(&st->active));
+		BUG_ON(!RB_EMPTY_ROOT(&st->idle));
+		BUG_ON(st->wsum != 0);
+	}
+
+	BUG_ON(iog->sched_data.next_active != NULL);
+	BUG_ON(iog->sched_data.active_entity != NULL);
+	BUG_ON(entity != NULL && entity->tree != NULL);
+
+	/*
+	 * Wait for any rcu readers to exit before freeing up the group.
+	 * Primarily useful when io_get_io_group() is called without queue
+	 * lock to access some group data from bdi_congested_group() path.
+	 */
+	call_rcu(&iog->rcu_head, io_group_free_rcu);
+}
+
+void elv_put_iog(struct io_group *iog)
+{
+	struct io_group *parent = NULL;
+	struct io_entity *entity;
+
+	BUG_ON(!iog);
+
+	entity = iog->my_entity;
+
+	BUG_ON(atomic_read(&iog->ref) <= 0);
+	if (!atomic_dec_and_test(&iog->ref))
+		return;
+
+	if (entity)
+		parent = container_of(iog->my_entity->parent,
+					struct io_group, entity);
+
+	io_group_cleanup(iog);
+
+	if (parent)
+		elv_put_iog(parent);
+}
+EXPORT_SYMBOL(elv_put_iog);
+
+/*
+ * check whether a given group has got any active entities on any of the
+ * service tree.
+ */
+static inline int io_group_has_active_entities(struct io_group *iog)
+{
+	int i;
+	struct io_service_tree *st;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		if (!RB_EMPTY_ROOT(&st->active))
+			return 1;
+	}
+
+	/*
+	 * Also check there are no active entities being served which are
+	 * not on active tree
+	 */
+
+	if (iog->sched_data.active_entity)
+		return 1;
+
+	return 0;
+}
+
+/*
+ * After the group is destroyed, no new sync IO should come to the group.
+ * It might still have pending IOs in some busy queues. It should be able to
+ * send those IOs down to the disk. The async IOs (due to dirty page writeback)
+ * would go in the root group queues after this, as the group does not exist
+ * anymore.
+ */
+static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
+{
+	struct elevator_queue *eq;
+	struct io_service_tree *st;
+	int i;
+
+	BUG_ON(iog->my_entity == NULL);
+
+	/*
+	 * Mark io group for deletion so that no new entry goes in
+	 * idle tree. Any active queue will be removed from active
+	 * tree and not put in to idle tree.
+	 */
+	iog->deleting = 1;
+
+	/* We flush idle tree now, and don't put things in there any more. */
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+
+		io_flush_idle_tree(st);
+	}
+
+	eq = container_of(efqd, struct elevator_queue, efqd);
+	hlist_del(&iog->elv_data_node);
+	io_put_io_group_queues(eq, iog);
+
+	/*
+	 * We can come here either through cgroup deletion path or through
+	 * elevator exit path. If we come here through cgroup deletion path
+	 * check if io group has any active entities or not. If not, then
+	 * deactivate this io group to make sure it is removed from idle
+	 * tree it might have been on. If this group was on idle tree, then
+	 * this probably will be the last reference and group will be
+	 * freed upon putting the reference down.
+	 */
+
+	if (!io_group_has_active_entities(iog)) {
+		/*
+		 * io group does not have any active entites. Because this
+		 * group has been decoupled from io_cgroup list and this
+		 * cgroup is being deleted, this group should not receive
+		 * any new IO. Hence it should be safe to deactivate this
+		 * io group and remove from the scheduling tree.
+		 */
+		__bfq_deactivate_entity(iog->my_entity, 0);
+	}
+
+	/*
+	 * Put the reference taken at the time of creation so that when all
+	 * queues are gone, cgroup can be destroyed.
+	 */
+	elv_put_iog(iog);
+}
+
+static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+	struct io_group *iog;
+	struct elv_fq_data *efqd;
+	unsigned long uninitialized_var(flags);
+
+	/*
+	 * io groups are linked in two lists. One list is maintained
+	 * in elevator (efqd->group_list) and other is maintained
+	 * per cgroup structure (iocg->group_data).
+	 *
+	 * While a cgroup is being deleted, elevator also might be
+	 * exiting and both might try to cleanup the same io group
+	 * so need to be little careful.
+	 *
+	 * (iocg->group_data) is protected by iocg->lock. To avoid deadlock,
+	 * we can't hold the queue lock while holding iocg->lock. So we first
+	 * remove iog from iocg->group_data under iocg->lock. Whoever removes
+	 * iog from iocg->group_data should call __io_destroy_group to remove
+	 * iog.
+	 */
+
+	rcu_read_lock();
+
+remove_entry:
+	spin_lock_irqsave(&iocg->lock, flags);
+
+	if (hlist_empty(&iocg->group_data)) {
+		spin_unlock_irqrestore(&iocg->lock, flags);
+		goto done;
+	}
+	iog = hlist_entry(iocg->group_data.first, struct io_group,
+			  group_node);
+	efqd = rcu_dereference(iog->key);
+	hlist_del_rcu(&iog->group_node);
+	iog->iocg_id = 0;
+	spin_unlock_irqrestore(&iocg->lock, flags);
+
+	spin_lock_irqsave(efqd->queue->queue_lock, flags);
+	__io_destroy_group(efqd, iog);
+	spin_unlock_irqrestore(efqd->queue->queue_lock, flags);
+	goto remove_entry;
+
+done:
+	free_css_id(&io_subsys, &iocg->css);
+	rcu_read_unlock();
+	BUG_ON(!hlist_empty(&iocg->group_data));
+	kfree(iocg);
+}
+
+/*
+ * This functions checks if iog is still in iocg->group_data, and removes it.
+ * If iog is not in that list, then cgroup destroy path has removed it, and
+ * we do not need to remove it.
+ */
+static void io_group_check_and_destroy(struct elv_fq_data *efqd,
+					struct io_group *iog)
+{
+	struct io_cgroup *iocg;
+	unsigned long flags;
+	struct cgroup_subsys_state *css;
+
+	rcu_read_lock();
+
+	css = css_lookup(&io_subsys, iog->iocg_id);
+
+	if (!css)
+		goto out;
+
+	iocg = container_of(css, struct io_cgroup, css);
+
+	spin_lock_irqsave(&iocg->lock, flags);
+
+	if (iog->iocg_id) {
+		hlist_del_rcu(&iog->group_node);
+		__io_destroy_group(efqd, iog);
+	}
+
+	spin_unlock_irqrestore(&iocg->lock, flags);
+out:
+	rcu_read_unlock();
+}
+
+static void io_disconnect_groups(struct elevator_queue *e)
+{
+	struct hlist_node *pos, *n;
+	struct io_group *iog;
+	struct elv_fq_data *efqd = &e->efqd;
+
+	hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
+					elv_data_node) {
+		io_group_check_and_destroy(efqd, iog);
+	}
+}
+
+/*
+ * if bio sumbmitting task and rq don't belong to same io_group, it can't
+ * be merged
+ */
+int io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+	struct request_queue *q = rq->q;
+	struct io_queue *ioq = rq->ioq;
+	struct io_group *iog, *__iog;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return 1;
+
+	/* Determine the io group of the bio submitting task */
+	iog = io_get_io_group(q, 0);
+	if (!iog) {
+		/* May be task belongs to a differet cgroup for which io
+		 * group has not been setup yet. */
+		return 0;
+	}
+
+	/* Determine the io group of the ioq, rq belongs to*/
+	__iog = ioq_to_io_group(ioq);
+
+	return (iog == __iog);
+}
+#else /* GROUP_IOSCHED */
+static void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+	entity->ioprio = entity->new_ioprio;
+	entity->weight = entity->new_weight;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->sched_data = &iog->sched_data;
+}
+
+static inline void io_disconnect_groups(struct elevator_queue *e) {}
+static inline unsigned int iog_weight(struct io_group *iog) { return 0; }
+
+static struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+	return iog;
+}
+
+static void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_group *iog = e->efqd.root_group;
+	struct io_service_tree *st;
+	int i;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		io_flush_idle_tree(st);
+	}
+
+	io_put_io_group_queues(e, iog);
+	kfree(iog);
+}
+
+struct io_group *io_get_io_group(struct request_queue *q, int create)
+{
+	/* In flat mode, there is only root group */
+	return q->elevator->efqd.root_group;
+}
+EXPORT_SYMBOL(io_get_io_group);
 #endif /* GROUP_IOSCHED */
+
 /* Elevator fair queuing function */
 static inline struct io_queue *elv_active_ioq(struct elevator_queue *e)
 {
@@ -1375,10 +2126,14 @@ void elv_put_ioq(struct io_queue *ioq)
 	struct elv_fq_data *efqd = ioq->efqd;
 	struct elevator_queue *e = container_of(efqd, struct elevator_queue,
 						efqd);
+	struct io_group *iog;
 
 	BUG_ON(atomic_read(&ioq->ref) <= 0);
 	if (!atomic_dec_and_test(&ioq->ref))
 		return;
+
+	iog = ioq_to_io_group(ioq);
+
 	BUG_ON(ioq->nr_queued);
 	BUG_ON(ioq->entity.tree != NULL);
 	BUG_ON(elv_ioq_busy(ioq));
@@ -1390,10 +2145,11 @@ void elv_put_ioq(struct io_queue *ioq)
 	e->ops->elevator_free_sched_queue_fn(e, ioq->sched_queue);
 	elv_log_ioq(efqd, ioq, "put_queue");
 	elv_free_ioq(ioq);
+	elv_put_iog(iog);
 }
 EXPORT_SYMBOL(elv_put_ioq);
 
-void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
+static void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
 {
 	struct io_queue *ioq = *ioq_ptr;
 
@@ -1485,8 +2241,12 @@ static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
 	struct request_queue *q = efqd->queue;
 
 	if (ioq) {
-		elv_log_ioq(efqd, ioq, "set_active, busy=%d",
-							efqd->busy_queues);
+		struct io_group *iog = ioq_to_io_group(ioq);
+		elv_log_ioq(efqd, ioq, "set_active, busy=%d ioprio=%d"
+				" weight=%u group_weight=%u",
+				efqd->busy_queues,
+				ioq->entity.ioprio, ioq->entity.weight,
+				iog_weight(iog));
 		ioq->slice_end = 0;
 
 		elv_clear_ioq_wait_request(ioq);
@@ -1548,6 +2308,7 @@ static void elv_activate_ioq(struct io_queue *ioq, int add_front)
 static void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
 					int requeue)
 {
+	requeue = update_requeue(ioq, requeue);
 	bfq_deactivate_entity(&ioq->entity, requeue);
 }
 
@@ -1725,6 +2486,7 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 	struct io_queue *ioq;
 	struct elevator_queue *eq = q->elevator;
 	struct io_entity *entity, *new_entity;
+	struct io_group *iog = NULL, *new_iog = NULL;
 
 	ioq = elv_active_ioq(eq);
 
@@ -1735,6 +2497,13 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 	new_entity = &new_ioq->entity;
 
 	/*
+	 * In hierarchical setup, one need to traverse up the hierarchy
+	 * till both the queues are children of same parent to make a
+	 * decision whether to do the preemption or not.
+	 */
+	bfq_find_matching_entity(&entity, &new_entity);
+
+	/*
 	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
 	 */
 
@@ -1750,9 +2519,17 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 		return 1;
 
 	/*
-	 * Check with io scheduler if it has additional criterion based on
-	 * which it wants to preempt existing queue.
+	 * If both the queues belong to same group, check with io scheduler
+	 * if it has additional criterion based on which it wants to
+	 * preempt existing queue.
 	 */
+	iog = ioq_to_io_group(ioq);
+	new_iog = ioq_to_io_group(new_ioq);
+
+	if (iog != new_iog)
+		return 0;
+
+
 	if (eq->ops->elevator_should_preempt_fn)
 		return eq->ops->elevator_should_preempt_fn(q,
 						ioq_sched_queue(new_ioq), rq);
@@ -2171,15 +2948,6 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		elv_schedule_dispatch(q);
 }
 
-struct io_group *io_get_io_group(struct request_queue *q)
-{
-	struct elv_fq_data *efqd = &q->elevator->efqd;
-
-	/* In flat mode, there is only root group */
-	return efqd->root_group;
-}
-EXPORT_SYMBOL(io_get_io_group);
-
 void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
 					int ioprio)
 {
@@ -2230,53 +2998,6 @@ void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 }
 EXPORT_SYMBOL(io_group_set_async_queue);
 
-/*
- * Release all the io group references to its async queues.
- */
-static void
-io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
-{
-	int i, j;
-
-	for (i = 0; i < 2; i++)
-		for (j = 0; j < IOPRIO_BE_NR; j++)
-			elv_release_ioq(e, &iog->async_queue[i][j]);
-
-	/* Free up async idle queue */
-	elv_release_ioq(e, &iog->async_idle_queue);
-}
-
-static struct io_group *io_alloc_root_group(struct request_queue *q,
-					struct elevator_queue *e, void *key)
-{
-	struct io_group *iog;
-	int i;
-
-	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
-	if (iog == NULL)
-		return NULL;
-
-	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
-		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
-
-	return iog;
-}
-
-static void io_free_root_group(struct elevator_queue *e)
-{
-	struct io_group *iog = e->efqd.root_group;
-	struct io_service_tree *st;
-	int i;
-
-	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
-		st = iog->sched_data.service_tree + i;
-		io_flush_idle_tree(st);
-	}
-
-	io_put_io_group_queues(e, iog);
-	kfree(iog);
-}
-
 static void elv_slab_kill(void)
 {
 	/*
@@ -2320,6 +3041,7 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 	efqd->idle_slice_timer.data = (unsigned long) efqd;
 
 	INIT_WORK(&efqd->unplug_work, elv_kick_queue);
+	INIT_HLIST_HEAD(&efqd->group_list);
 
 	efqd->elv_slice[0] = elv_slice_async;
 	efqd->elv_slice[1] = elv_slice_sync;
@@ -2339,12 +3061,23 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 void elv_exit_fq_data(struct elevator_queue *e)
 {
 	struct elv_fq_data *efqd = &e->efqd;
+	struct request_queue *q = efqd->queue;
 
 	if (!elv_iosched_fair_queuing_enabled(e))
 		return;
 
 	elv_shutdown_timer_wq(e);
 
+	spin_lock_irq(q->queue_lock);
+	/* This should drop all the io group references of async queues */
+	io_disconnect_groups(e);
+	spin_unlock_irq(q->queue_lock);
+
+	elv_shutdown_timer_wq(e);
+
+	/* Wait for iog->key accessors to exit their grace periods. */
+	synchronize_rcu();
+
 	BUG_ON(timer_pending(&efqd->idle_slice_timer));
 	io_free_root_group(e);
 }
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index d9acb75..c8987c0 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -184,13 +184,49 @@ struct io_queue {
 };
 
 #ifdef CONFIG_GROUP_IOSCHED
+/**
+ * struct io_group - per (device, cgroup) data structure.
+ * @entity: schedulable entity to insert into the parent group sched_data.
+ * @sched_data: own sched_data, to contain child entities (they may be
+ *              both io_queues and io_groups).
+ * @group_node: node to be inserted into the io_cgroup->group_data
+ *              list of the containing cgroup's io_cgroup.
+ * @elv_data_node: node to be inserted into the @efqd->group_list list
+ *             of the groups active on the same device; used for cleanup.
+ * @async_queue: array of async queues for all the tasks belonging to
+ *              the group, one queue per ioprio value per ioprio_class,
+ *              except for the idle class that has only one queue.
+ * @async_idle_queue: async queue for the idle class (ioprio is ignored).
+ * @my_entity: pointer to @entity, %NULL for the toplevel group; used
+ *             to avoid too many special cases during group creation/migration.
+ *
+ * Each (device, cgroup) pair has its own io_group, i.e., for each cgroup
+ * there is a set of io_groups, each one collecting the lower-level
+ * entities belonging to the group that are acting on the same device.
+ *
+ * Locking works as follows:
+ *    o @group_node is protected by the io_cgroup lock, and is accessed
+ *      via RCU from its readers.
+ *    o @efqd is protected by the queue lock, RCU is used to access it
+ *      from the readers.
+ *    o All the other fields are protected by the @efqd queue lock.
+ */
 struct io_group {
 	struct io_entity entity;
+	struct hlist_node elv_data_node;
 	struct hlist_node group_node;
 	struct io_sched_data sched_data;
+	atomic_t ref;
 	struct io_entity *my_entity;
 
 	/*
+	 * A cgroup has multiple io_groups, one for each request queue.
+	 * to find io group belonging to a particular queue, elv_fq_data
+	 * pointer is stored as a key.
+	 */
+	void *key;
+
+	/*
 	 * async queue for each priority case for RT and BE class.
 	 * Used only for cfq.
 	 */
@@ -198,11 +234,15 @@ struct io_group {
 	struct io_queue *async_queue[2][IOPRIO_BE_NR];
 	struct io_queue *async_idle_queue;
 
+	struct rcu_head rcu_head;
+
 	/*
 	 * Used to track any pending rt requests so we can pre-empt current
 	 * non-RT cfqq in service when this value is non-zero.
 	 */
 	unsigned int busy_rt_queues;
+
+	int deleting;
 	unsigned short iocg_id;
 };
 
@@ -245,6 +285,9 @@ struct io_group {
 struct elv_fq_data {
 	struct io_group *root_group;
 
+	/* List of io groups hanging on this elevator */
+	struct hlist_head group_list;
+
 	struct request_queue *queue;
 	unsigned int busy_queues;
 
@@ -407,7 +450,7 @@ static inline void elv_ioq_set_ioprio_class(struct io_queue *ioq,
 static inline unsigned int bfq_ioprio_to_weight(int ioprio)
 {
 	WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
-	return IOPRIO_BE_NR - ioprio;
+	return ((IOPRIO_BE_NR - ioprio) * WEIGHT_MAX)/IOPRIO_BE_NR;
 }
 
 static inline void elv_ioq_set_ioprio(struct io_queue *ioq, int ioprio)
@@ -430,6 +473,46 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
 						sched_data);
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+extern int io_group_allow_merge(struct request *rq, struct bio *bio);
+extern void elv_put_iog(struct io_group *iog);
+
+static inline void elv_get_iog(struct io_group *iog)
+{
+	atomic_inc(&iog->ref);
+}
+
+static inline int update_requeue(struct io_queue *ioq, int requeue)
+{
+	struct io_group *iog = ioq_to_io_group(ioq);
+
+	if (iog->deleting == 1)
+		return 0;
+
+	return requeue;
+}
+
+#else /* !GROUP_IOSCHED */
+static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+	return 1;
+}
+
+static inline void elv_get_iog(struct io_group *iog)
+{
+}
+
+static inline void elv_put_iog(struct io_group *iog)
+{
+}
+
+static inline int update_requeue(struct io_queue *ioq, int requeue)
+{
+	return requeue;
+}
+
+#endif /* GROUP_IOSCHED */
+
 extern ssize_t elv_slice_idle_show(struct elevator_queue *q, char *name);
 extern ssize_t elv_slice_idle_store(struct elevator_queue *q, const char *name,
 						size_t count);
@@ -477,7 +560,7 @@ extern void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
 					int ioprio);
 extern void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 					int ioprio, struct io_queue *ioq);
-extern struct io_group *io_get_io_group(struct request_queue *q);
+extern struct io_group *io_get_io_group(struct request_queue *q, int create);
 extern int elv_nr_busy_ioq(struct elevator_queue *e);
 extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
 extern void elv_free_ioq(struct io_queue *ioq);
@@ -528,5 +611,11 @@ static inline void *elv_fq_select_ioq(struct request_queue *q, int force)
 {
 	return NULL;
 }
+
+static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
+
+{
+	return 1;
+}
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index 357f529..a6ef1f1 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -100,6 +100,10 @@ int elv_rq_merge_ok(struct request *rq, struct bio *bio)
 	if (bio_integrity(bio) != blk_integrity_rq(rq))
 		return 0;
 
+	/* If rq and bio belongs to different groups, dont allow merging */
+	if (!io_group_allow_merge(rq, bio))
+		return 0;
+
 	if (!elv_iosched_allow_merge(rq, bio))
 		return 0;
 
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 09/25] io-controller: Common hierarchical fair queuing code in elevaotor layer
@ 2009-07-02 20:01   ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew
  Cc: peterz, akpm, snitzer, agk, vgoyal

o This patch enables hierarchical fair queuing in common layer. It is
  controlled by config option CONFIG_GROUP_IOSCHED.

o Requests keep a reference on ioq and ioq keeps  keep a reference
  on groups. For async queues in CFQ, and single ioq in other
  schedulers, io_group also keeps are reference on io_queue. This
  reference on ioq is dropped when the queue is released
  (elv_release_ioq). So the queue can be freed.

  When a queue is released, it puts the reference to io_group and the
  io_group is released after all the queues are released. Child groups
  also take reference on parent groups, and release it when they are
  destroyed.

o Reads of iocg->group_data are not always iocg->lock; so all the operations
  on that list are still protected by RCU. All modifications to
  iocg->group_data should always done under iocg->lock.

  Whenever iocg->lock and queue_lock can both be held, queue_lock should
  be held first. This avoids all deadlocks. In order to avoid race
  between cgroup deletion and elevator switch the following algorithm is
  used:

	- Cgroup deletion path holds iocg->lock and removes iog entry
	  to iocg->group_data list. Then it drops iocg->lock, holds
	  queue_lock and destroys iog. So in this path, we never hold
	  iocg->lock and queue_lock at the same time. Also, since we
	  remove iog from iocg->group_data under iocg->lock, we can't
	  race with elevator switch.

	- Elevator switch path does not remove iog from
	  iocg->group_data list directly. It first hold iocg->lock,
	  scans iocg->group_data again to see if iog is still there;
	  it removes iog only if it finds iog there. Otherwise, cgroup
	  deletion must have removed it from the list, and cgroup
	  deletion is responsible for removing iog.

  So the path which removes iog from iocg->group_data list does
  the final removal of iog by calling __io_destroy_group()
  function.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |    2 +
 block/elevator-fq.c |  885 ++++++++++++++++++++++++++++++++++++++++++++++-----
 block/elevator-fq.h |   93 ++++++-
 block/elevator.c    |    4 +
 4 files changed, 906 insertions(+), 78 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index f852b00..6ddc882 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1310,6 +1310,8 @@ alloc_ioq:
 			elv_mark_ioq_sync(cfqq->ioq);
 		}
 		cfqq->pid = current->pid;
+		/* ioq reference on iog */
+		elv_get_iog(iog);
 		cfq_log_cfqq(cfqd, cfqq, "alloced");
 	}
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 84276d5..f8d0b90 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -45,6 +45,9 @@ static int elv_rate_sampling_window = HZ / 10;
  */
 #define WFQ_SERVICE_SHIFT	22
 
+static void
+elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
+
 #ifdef CONFIG_GROUP_IOSCHED
 #define for_each_entity(entity)	\
 	for (; entity != NULL; entity = entity->parent)
@@ -90,6 +93,69 @@ static inline void bfq_check_next_active(struct io_sched_data *sd,
 {
 	BUG_ON(sd->next_active != entity);
 }
+
+static inline int iog_deleting(struct io_group *iog)
+{
+	return iog->deleting;
+}
+
+/* Do the two (enqueued) entities belong to the same group ? */
+static inline int
+is_same_group(struct io_entity *entity, struct io_entity *new_entity)
+{
+	if (entity->sched_data == new_entity->sched_data)
+		return 1;
+
+	return 0;
+}
+
+static inline struct io_entity *parent_entity(struct io_entity *entity)
+{
+	return entity->parent;
+}
+
+/* return depth at which a io entity is present in the hierarchy */
+static inline int depth_entity(struct io_entity *entity)
+{
+	int depth = 0;
+
+	for_each_entity(entity)
+		depth++;
+
+	return depth;
+}
+
+static void bfq_find_matching_entity(struct io_entity **entity,
+			struct io_entity **new_entity)
+{
+	int entity_depth, new_entity_depth;
+
+	/*
+	 * preemption test can be made between sibling entities who are in the
+	 * same group i.e who have a common parent. Walk up the hierarchy of
+	 * both entities until we find their ancestors who are siblings of
+	 * common parent.
+	 */
+
+	/* First walk up until both entities are at same depth */
+	entity_depth = depth_entity(*entity);
+	new_entity_depth = depth_entity(*new_entity);
+
+	while (entity_depth > new_entity_depth) {
+		entity_depth--;
+		*entity = parent_entity(*entity);
+	}
+
+	while (new_entity_depth > entity_depth) {
+		new_entity_depth--;
+		*new_entity = parent_entity(*new_entity);
+	}
+
+	while (!is_same_group(*entity, *new_entity)) {
+		*entity = parent_entity(*entity);
+		*new_entity = parent_entity(*new_entity);
+	}
+}
 #else /* GROUP_IOSCHED */
 #define for_each_entity(entity)	\
 	for (; entity != NULL; entity = NULL)
@@ -106,6 +172,17 @@ static inline void bfq_check_next_active(struct io_sched_data *sd,
 					 struct io_entity *entity)
 {
 }
+
+static inline int iog_deleting(struct io_group *iog)
+{
+	/* In flat mode, root cgroup can't be deleted. */
+	return 0;
+}
+
+static void bfq_find_matching_entity(struct io_entity **entity,
+					struct io_entity **new_entity)
+{
+}
 #endif /* GROUP_IOSCHED */
 
 static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
@@ -363,13 +440,6 @@ static void bfq_get_entity(struct io_entity *entity)
 		elv_get_ioq(ioq);
 }
 
-static void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
-{
-	entity->ioprio = entity->new_ioprio;
-	entity->ioprio_class = entity->new_ioprio_class;
-	entity->sched_data = &iog->sched_data;
-}
-
 /**
  * bfq_find_deepest - find the deepest node that an extraction can modify.
  * @node: the node being removed.
@@ -833,8 +903,26 @@ static int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
 static void bfq_deactivate_entity(struct io_entity *entity, int requeue)
 {
 	struct io_sched_data *sd;
+	struct io_group *iog, *__iog;
 	struct io_entity *parent;
 
+	iog = container_of(entity->sched_data, struct io_group, sched_data);
+
+	/*
+	 * Hold a reference to entity's iog until we are done. This function
+	 * travels the hierarchy and we don't want to free up the group yet
+	 * while we are traversing the hiearchy. It is possible that this
+	 * group's cgroup has been removed hence cgroup reference is gone.
+	 * If this entity was active entity, then its group will not be on
+	 * any of the trees and it will be freed up the moment queue is
+	 * freed up in __bfq_deactivate_entity().
+	 *
+	 * Hence, hold a reference, deactivate the hierarhcy of entities and
+	 * then drop the reference which should free up the whole chain of
+	 * groups.
+	 */
+	elv_get_iog(iog);
+
 	for_each_entity_safe(entity, parent) {
 		sd = entity->sched_data;
 
@@ -852,6 +940,7 @@ static void bfq_deactivate_entity(struct io_entity *entity, int requeue)
 			 * the budgets on the path towards the root
 			 * need to be updated.
 			 */
+			elv_put_iog(iog);
 			goto update;
 		}
 
@@ -859,11 +948,16 @@ static void bfq_deactivate_entity(struct io_entity *entity, int requeue)
 		 * If we reach there the parent is no more backlogged and
 		 * we want to propagate the dequeue upwards.
 		 *
+		 * If entity's group has been marked for deletion, don't
+		 * requeue the group in idle tree so that it can be freed.
 		 */
-
-		requeue = 1;
+		__iog = container_of(entity->sched_data, struct io_group,
+						sched_data);
+		if (!iog_deleting(__iog))
+			requeue = 1;
 	}
 
+	elv_put_iog(iog);
 	return;
 
 update:
@@ -902,8 +996,59 @@ static void io_flush_idle_tree(struct io_service_tree *st)
 		__bfq_deactivate_entity(entity, 0);
 }
 
+/*
+ * Release all the io group references to its async queues.
+ */
+static void
+io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
+{
+	int i, j;
+
+	for (i = 0; i < 2; i++)
+		for (j = 0; j < IOPRIO_BE_NR; j++)
+			elv_release_ioq(e, &iog->async_queue[i][j]);
+
+	/* Free up async idle queue */
+	elv_release_ioq(e, &iog->async_idle_queue);
+}
+
 /* Mainly hierarchical grouping code */
 #ifdef CONFIG_GROUP_IOSCHED
+static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup);
+
+static void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+	entity->ioprio = entity->new_ioprio;
+	entity->weight = entity->new_weight;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->parent = iog->my_entity;
+	entity->sched_data = &iog->sched_data;
+}
+
+static void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+{
+	struct io_entity *entity = &iog->entity;
+
+	entity->weight = entity->new_weight = iocg->weight;
+	entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
+	entity->ioprio_changed = 1;
+	entity->my_sched_data = &iog->sched_data;
+}
+
+static void io_group_set_parent(struct io_group *iog, struct io_group *parent)
+{
+	struct io_entity *entity;
+
+	BUG_ON(parent == NULL);
+	BUG_ON(iog == NULL);
+
+	entity = &iog->entity;
+	entity->parent = parent->my_entity;
+	entity->sched_data = &parent->sched_data;
+	if (entity->parent)
+		/* Child group reference on parent group. */
+		elv_get_iog(parent);
+}
 
 struct io_cgroup io_root_cgroup = {
 	.weight = IO_DEFAULT_GRP_WEIGHT,
@@ -916,6 +1061,26 @@ static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
 			    struct io_cgroup, css);
 }
 
+/*
+ * Search the io_group for efqd into the hash table (by now only a list)
+ * of bgrp.  Must be called under rcu_read_lock().
+ */
+static struct io_group *
+io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
+{
+	struct io_group *iog;
+	struct hlist_node *n;
+	void *__key;
+
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		__key = rcu_dereference(iog->key);
+		if (__key == key)
+			return iog;
+	}
+
+	return NULL;
+}
+
 #define SHOW_FUNCTION(__VAR)						\
 static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
 				       struct cftype *cftype)		\
@@ -1056,12 +1221,6 @@ static void iocg_attach(struct cgroup_subsys *subsys, struct cgroup *cgroup,
 	task_unlock(tsk);
 }
 
-static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
-{
-
-	/* Implemented in later patch */
-}
-
 struct cgroup_subsys io_subsys = {
 	.name = "io",
 	.create = iocg_create,
@@ -1072,7 +1231,599 @@ struct cgroup_subsys io_subsys = {
 	.subsys_id = io_subsys_id,
 	.use_id = 1,
 };
+
+static inline unsigned int iog_weight(struct io_group *iog)
+{
+	return iog->entity.weight;
+}
+
+/**
+ * io_group_chain_alloc - allocate a chain of groups.
+ * @efqd: queue descriptor.
+ * @cgroup: the leaf cgroup this chain starts from.
+ *
+ * Allocate a chain of groups starting from the one belonging to
+ * @cgroup up to the root cgroup.  Stop if a cgroup on the chain
+ * to the root has already an allocated group on @efqd.
+ */
+static struct io_group *
+io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog, *leaf = NULL, *prev = NULL;
+	gfp_t flags = GFP_ATOMIC |  __GFP_ZERO;
+
+	for (; cgroup != NULL; cgroup = cgroup->parent) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+
+		iog = io_cgroup_lookup_group(iocg, key);
+		if (iog != NULL) {
+			/*
+			 * All the cgroups in the path from there to the
+			 * root must have a io_group for efqd, so we don't
+			 * need any more allocations.
+			 */
+			break;
+		}
+
+		iog = kzalloc_node(sizeof(*iog), flags, q->node);
+		if (!iog)
+			goto cleanup;
+
+		iog->iocg_id = css_id(&iocg->css);
+
+		io_group_init_entity(iocg, iog);
+		iog->my_entity = &iog->entity;
+
+		atomic_set(&iog->ref, 0);
+		iog->deleting = 0;
+
+		/*
+		 * Take the initial reference that will be released on destroy
+		 * This can be thought of a joint reference by cgroup and
+		 * elevator which will be dropped by either elevator exit
+		 * or cgroup deletion path depending on who is exiting first.
+		 */
+		elv_get_iog(iog);
+
+		if (leaf == NULL) {
+			leaf = iog;
+			prev = leaf;
+		} else {
+			io_group_set_parent(prev, iog);
+			/*
+			 * Build a list of allocated nodes using the efqd
+			 * filed, that is still unused and will be initialized
+			 * only after the node will be connected.
+			 */
+			prev->key = iog;
+			prev = iog;
+		}
+	}
+
+	return leaf;
+
+cleanup:
+	while (leaf != NULL) {
+		prev = leaf;
+		leaf = leaf->key;
+		kfree(prev);
+	}
+
+	return NULL;
+}
+
+/**
+ * io_group_chain_link - link an allocatd group chain to a cgroup hierarchy.
+ * @efqd: the queue descriptor.
+ * @cgroup: the leaf cgroup to start from.
+ * @leaf: the leaf group (to be associated to @cgroup).
+ *
+ * Try to link a chain of groups to a cgroup hierarchy, connecting the
+ * nodes bottom-up, so we can be sure that when we find a cgroup in the
+ * hierarchy that already as a group associated to @efqd all the nodes
+ * in the path to the root cgroup have one too.
+ *
+ * On locking: the queue lock protects the hierarchy (there is a hierarchy
+ * per device) while the io_cgroup lock protects the list of groups
+ * belonging to the same cgroup.
+ */
+static void io_group_chain_link(struct request_queue *q, void *key,
+				struct cgroup *cgroup,
+				struct io_group *leaf,
+				struct elv_fq_data *efqd)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog, *next, *prev = NULL;
+	unsigned long flags;
+
+	assert_spin_locked(q->queue_lock);
+
+	for (; cgroup != NULL && leaf != NULL; cgroup = cgroup->parent) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+		next = leaf->key;
+
+		iog = io_cgroup_lookup_group(iocg, key);
+		BUG_ON(iog != NULL);
+
+		spin_lock_irqsave(&iocg->lock, flags);
+
+		rcu_assign_pointer(leaf->key, key);
+		hlist_add_head_rcu(&leaf->group_node, &iocg->group_data);
+		hlist_add_head(&leaf->elv_data_node, &efqd->group_list);
+
+		spin_unlock_irqrestore(&iocg->lock, flags);
+
+		prev = leaf;
+		leaf = next;
+	}
+
+	BUG_ON(cgroup == NULL && leaf != NULL);
+
+	if (cgroup != NULL && prev != NULL) {
+		iocg = cgroup_to_io_cgroup(cgroup);
+		iog = io_cgroup_lookup_group(iocg, key);
+		io_group_set_parent(prev, iog);
+	}
+}
+
+/**
+ * io_find_alloc_group - return the group associated to @efqd in @cgroup.
+ * @fqd: queue descriptor.
+ * @cgroup: cgroup being searched for.
+ * @create: if set to 1, create the io group if it has not been created yet.
+ *
+ * Return a group associated to @fqd in @cgroup, allocating one if
+ * necessary.  When a group is returned all the cgroups in the path
+ * to the root have a group associated to @efqd.
+ *
+ * If the allocation fails, return the root group: this breaks guarantees
+ * but is a safe fallbak.  If this loss becames a problem it can be
+ * mitigated using the equivalent weight (given by the product of the
+ * weights of the groups in the path from @group to the root) in the
+ * root scheduler.
+ *
+ * We allocate all the missing nodes in the path from the leaf cgroup
+ * to the root and we connect the nodes only after all the allocations
+ * have been successful.
+ */
+static struct io_group *io_find_alloc_group(struct request_queue *q,
+			struct cgroup *cgroup, struct elv_fq_data *efqd,
+			int create)
+{
+	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+	struct io_group *iog = NULL;
+	/* Note: Use efqd as key */
+	void *key = efqd;
+
+	/*
+	 * Take a refenrece to css object. Don't want to map a bio to
+	 * a group if it has been marked for deletion
+	 */
+
+	if (!css_tryget(&iocg->css))
+		return iog;
+
+	iog = io_cgroup_lookup_group(iocg, key);
+	if (iog != NULL || !create)
+		goto end;
+
+	iog = io_group_chain_alloc(q, key, cgroup);
+	if (iog != NULL)
+		io_group_chain_link(q, key, cgroup, iog, efqd);
+
+end:
+	css_put(&iocg->css);
+	return iog;
+}
+
+/*
+ * Search for the io group current task belongs to. If create=1, then also
+ * create the io group if it is not already there.
+ *
+ * Note: This function should be called with queue lock held. It returns
+ * a pointer to io group without taking any reference. That group will
+ * be around as long as queue lock is not dropped (as group reclaim code
+ * needs to get hold of queue lock). So if somebody needs to use group
+ * pointer even after dropping queue lock, take a reference to the group
+ * before dropping queue lock.
+ */
+struct io_group *io_get_io_group(struct request_queue *q, int create)
+{
+	struct cgroup *cgroup;
+	struct io_group *iog;
+	struct elv_fq_data *efqd = &q->elevator->efqd;
+
+	assert_spin_locked(q->queue_lock);
+
+	rcu_read_lock();
+	cgroup = task_cgroup(current, io_subsys_id);
+	iog = io_find_alloc_group(q, cgroup, efqd, create);
+	if (!iog) {
+		if (create)
+			iog = efqd->root_group;
+		else
+			/*
+			 * bio merge functions doing lookup don't want to
+			 * map bio to root group by default
+			 */
+			iog = NULL;
+	}
+	rcu_read_unlock();
+	return iog;
+}
+EXPORT_SYMBOL(io_get_io_group);
+
+static void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_cgroup *iocg = &io_root_cgroup;
+	struct elv_fq_data *efqd = &e->efqd;
+	struct io_group *iog = efqd->root_group;
+	struct io_service_tree *st;
+	int i;
+
+	BUG_ON(!iog);
+	spin_lock_irq(&iocg->lock);
+	hlist_del_rcu(&iog->group_node);
+	spin_unlock_irq(&iocg->lock);
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		io_flush_idle_tree(st);
+	}
+
+	io_put_io_group_queues(e, iog);
+	elv_put_iog(iog);
+}
+
+static struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	struct io_cgroup *iocg;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	elv_get_iog(iog);
+	iog->entity.parent = NULL;
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+	iocg = &io_root_cgroup;
+	spin_lock_irq(&iocg->lock);
+	rcu_assign_pointer(iog->key, key);
+	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
+	iog->iocg_id = css_id(&iocg->css);
+	spin_unlock_irq(&iocg->lock);
+
+	return iog;
+}
+
+static void io_group_free_rcu(struct rcu_head *head)
+{
+	struct io_group *iog;
+
+	iog = container_of(head, struct io_group, rcu_head);
+	kfree(iog);
+}
+
+/*
+ * This cleanup function does the last bit of things to destroy cgroup.
+ * It should only get called after io_destroy_group has been invoked.
+ */
+static void io_group_cleanup(struct io_group *iog)
+{
+	struct io_service_tree *st;
+	struct io_entity *entity = iog->my_entity;
+	int i;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+
+		BUG_ON(!RB_EMPTY_ROOT(&st->active));
+		BUG_ON(!RB_EMPTY_ROOT(&st->idle));
+		BUG_ON(st->wsum != 0);
+	}
+
+	BUG_ON(iog->sched_data.next_active != NULL);
+	BUG_ON(iog->sched_data.active_entity != NULL);
+	BUG_ON(entity != NULL && entity->tree != NULL);
+
+	/*
+	 * Wait for any rcu readers to exit before freeing up the group.
+	 * Primarily useful when io_get_io_group() is called without queue
+	 * lock to access some group data from bdi_congested_group() path.
+	 */
+	call_rcu(&iog->rcu_head, io_group_free_rcu);
+}
+
+void elv_put_iog(struct io_group *iog)
+{
+	struct io_group *parent = NULL;
+	struct io_entity *entity;
+
+	BUG_ON(!iog);
+
+	entity = iog->my_entity;
+
+	BUG_ON(atomic_read(&iog->ref) <= 0);
+	if (!atomic_dec_and_test(&iog->ref))
+		return;
+
+	if (entity)
+		parent = container_of(iog->my_entity->parent,
+					struct io_group, entity);
+
+	io_group_cleanup(iog);
+
+	if (parent)
+		elv_put_iog(parent);
+}
+EXPORT_SYMBOL(elv_put_iog);
+
+/*
+ * check whether a given group has got any active entities on any of the
+ * service tree.
+ */
+static inline int io_group_has_active_entities(struct io_group *iog)
+{
+	int i;
+	struct io_service_tree *st;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		if (!RB_EMPTY_ROOT(&st->active))
+			return 1;
+	}
+
+	/*
+	 * Also check there are no active entities being served which are
+	 * not on active tree
+	 */
+
+	if (iog->sched_data.active_entity)
+		return 1;
+
+	return 0;
+}
+
+/*
+ * After the group is destroyed, no new sync IO should come to the group.
+ * It might still have pending IOs in some busy queues. It should be able to
+ * send those IOs down to the disk. The async IOs (due to dirty page writeback)
+ * would go in the root group queues after this, as the group does not exist
+ * anymore.
+ */
+static void __io_destroy_group(struct elv_fq_data *efqd, struct io_group *iog)
+{
+	struct elevator_queue *eq;
+	struct io_service_tree *st;
+	int i;
+
+	BUG_ON(iog->my_entity == NULL);
+
+	/*
+	 * Mark io group for deletion so that no new entry goes in
+	 * idle tree. Any active queue will be removed from active
+	 * tree and not put in to idle tree.
+	 */
+	iog->deleting = 1;
+
+	/* We flush idle tree now, and don't put things in there any more. */
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+
+		io_flush_idle_tree(st);
+	}
+
+	eq = container_of(efqd, struct elevator_queue, efqd);
+	hlist_del(&iog->elv_data_node);
+	io_put_io_group_queues(eq, iog);
+
+	/*
+	 * We can come here either through cgroup deletion path or through
+	 * elevator exit path. If we come here through cgroup deletion path
+	 * check if io group has any active entities or not. If not, then
+	 * deactivate this io group to make sure it is removed from idle
+	 * tree it might have been on. If this group was on idle tree, then
+	 * this probably will be the last reference and group will be
+	 * freed upon putting the reference down.
+	 */
+
+	if (!io_group_has_active_entities(iog)) {
+		/*
+		 * io group does not have any active entites. Because this
+		 * group has been decoupled from io_cgroup list and this
+		 * cgroup is being deleted, this group should not receive
+		 * any new IO. Hence it should be safe to deactivate this
+		 * io group and remove from the scheduling tree.
+		 */
+		__bfq_deactivate_entity(iog->my_entity, 0);
+	}
+
+	/*
+	 * Put the reference taken at the time of creation so that when all
+	 * queues are gone, cgroup can be destroyed.
+	 */
+	elv_put_iog(iog);
+}
+
+static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+	struct io_cgroup *iocg = cgroup_to_io_cgroup(cgroup);
+	struct io_group *iog;
+	struct elv_fq_data *efqd;
+	unsigned long uninitialized_var(flags);
+
+	/*
+	 * io groups are linked in two lists. One list is maintained
+	 * in elevator (efqd->group_list) and other is maintained
+	 * per cgroup structure (iocg->group_data).
+	 *
+	 * While a cgroup is being deleted, elevator also might be
+	 * exiting and both might try to cleanup the same io group
+	 * so need to be little careful.
+	 *
+	 * (iocg->group_data) is protected by iocg->lock. To avoid deadlock,
+	 * we can't hold the queue lock while holding iocg->lock. So we first
+	 * remove iog from iocg->group_data under iocg->lock. Whoever removes
+	 * iog from iocg->group_data should call __io_destroy_group to remove
+	 * iog.
+	 */
+
+	rcu_read_lock();
+
+remove_entry:
+	spin_lock_irqsave(&iocg->lock, flags);
+
+	if (hlist_empty(&iocg->group_data)) {
+		spin_unlock_irqrestore(&iocg->lock, flags);
+		goto done;
+	}
+	iog = hlist_entry(iocg->group_data.first, struct io_group,
+			  group_node);
+	efqd = rcu_dereference(iog->key);
+	hlist_del_rcu(&iog->group_node);
+	iog->iocg_id = 0;
+	spin_unlock_irqrestore(&iocg->lock, flags);
+
+	spin_lock_irqsave(efqd->queue->queue_lock, flags);
+	__io_destroy_group(efqd, iog);
+	spin_unlock_irqrestore(efqd->queue->queue_lock, flags);
+	goto remove_entry;
+
+done:
+	free_css_id(&io_subsys, &iocg->css);
+	rcu_read_unlock();
+	BUG_ON(!hlist_empty(&iocg->group_data));
+	kfree(iocg);
+}
+
+/*
+ * This functions checks if iog is still in iocg->group_data, and removes it.
+ * If iog is not in that list, then cgroup destroy path has removed it, and
+ * we do not need to remove it.
+ */
+static void io_group_check_and_destroy(struct elv_fq_data *efqd,
+					struct io_group *iog)
+{
+	struct io_cgroup *iocg;
+	unsigned long flags;
+	struct cgroup_subsys_state *css;
+
+	rcu_read_lock();
+
+	css = css_lookup(&io_subsys, iog->iocg_id);
+
+	if (!css)
+		goto out;
+
+	iocg = container_of(css, struct io_cgroup, css);
+
+	spin_lock_irqsave(&iocg->lock, flags);
+
+	if (iog->iocg_id) {
+		hlist_del_rcu(&iog->group_node);
+		__io_destroy_group(efqd, iog);
+	}
+
+	spin_unlock_irqrestore(&iocg->lock, flags);
+out:
+	rcu_read_unlock();
+}
+
+static void io_disconnect_groups(struct elevator_queue *e)
+{
+	struct hlist_node *pos, *n;
+	struct io_group *iog;
+	struct elv_fq_data *efqd = &e->efqd;
+
+	hlist_for_each_entry_safe(iog, pos, n, &efqd->group_list,
+					elv_data_node) {
+		io_group_check_and_destroy(efqd, iog);
+	}
+}
+
+/*
+ * if bio sumbmitting task and rq don't belong to same io_group, it can't
+ * be merged
+ */
+int io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+	struct request_queue *q = rq->q;
+	struct io_queue *ioq = rq->ioq;
+	struct io_group *iog, *__iog;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return 1;
+
+	/* Determine the io group of the bio submitting task */
+	iog = io_get_io_group(q, 0);
+	if (!iog) {
+		/* May be task belongs to a differet cgroup for which io
+		 * group has not been setup yet. */
+		return 0;
+	}
+
+	/* Determine the io group of the ioq, rq belongs to*/
+	__iog = ioq_to_io_group(ioq);
+
+	return (iog == __iog);
+}
+#else /* GROUP_IOSCHED */
+static void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
+{
+	entity->ioprio = entity->new_ioprio;
+	entity->weight = entity->new_weight;
+	entity->ioprio_class = entity->new_ioprio_class;
+	entity->sched_data = &iog->sched_data;
+}
+
+static inline void io_disconnect_groups(struct elevator_queue *e) {}
+static inline unsigned int iog_weight(struct io_group *iog) { return 0; }
+
+static struct io_group *io_alloc_root_group(struct request_queue *q,
+					struct elevator_queue *e, void *key)
+{
+	struct io_group *iog;
+	int i;
+
+	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
+	if (iog == NULL)
+		return NULL;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
+		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
+
+	return iog;
+}
+
+static void io_free_root_group(struct elevator_queue *e)
+{
+	struct io_group *iog = e->efqd.root_group;
+	struct io_service_tree *st;
+	int i;
+
+	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
+		st = iog->sched_data.service_tree + i;
+		io_flush_idle_tree(st);
+	}
+
+	io_put_io_group_queues(e, iog);
+	kfree(iog);
+}
+
+struct io_group *io_get_io_group(struct request_queue *q, int create)
+{
+	/* In flat mode, there is only root group */
+	return q->elevator->efqd.root_group;
+}
+EXPORT_SYMBOL(io_get_io_group);
 #endif /* GROUP_IOSCHED */
+
 /* Elevator fair queuing function */
 static inline struct io_queue *elv_active_ioq(struct elevator_queue *e)
 {
@@ -1375,10 +2126,14 @@ void elv_put_ioq(struct io_queue *ioq)
 	struct elv_fq_data *efqd = ioq->efqd;
 	struct elevator_queue *e = container_of(efqd, struct elevator_queue,
 						efqd);
+	struct io_group *iog;
 
 	BUG_ON(atomic_read(&ioq->ref) <= 0);
 	if (!atomic_dec_and_test(&ioq->ref))
 		return;
+
+	iog = ioq_to_io_group(ioq);
+
 	BUG_ON(ioq->nr_queued);
 	BUG_ON(ioq->entity.tree != NULL);
 	BUG_ON(elv_ioq_busy(ioq));
@@ -1390,10 +2145,11 @@ void elv_put_ioq(struct io_queue *ioq)
 	e->ops->elevator_free_sched_queue_fn(e, ioq->sched_queue);
 	elv_log_ioq(efqd, ioq, "put_queue");
 	elv_free_ioq(ioq);
+	elv_put_iog(iog);
 }
 EXPORT_SYMBOL(elv_put_ioq);
 
-void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
+static void elv_release_ioq(struct elevator_queue *e, struct io_queue **ioq_ptr)
 {
 	struct io_queue *ioq = *ioq_ptr;
 
@@ -1485,8 +2241,12 @@ static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
 	struct request_queue *q = efqd->queue;
 
 	if (ioq) {
-		elv_log_ioq(efqd, ioq, "set_active, busy=%d",
-							efqd->busy_queues);
+		struct io_group *iog = ioq_to_io_group(ioq);
+		elv_log_ioq(efqd, ioq, "set_active, busy=%d ioprio=%d"
+				" weight=%u group_weight=%u",
+				efqd->busy_queues,
+				ioq->entity.ioprio, ioq->entity.weight,
+				iog_weight(iog));
 		ioq->slice_end = 0;
 
 		elv_clear_ioq_wait_request(ioq);
@@ -1548,6 +2308,7 @@ static void elv_activate_ioq(struct io_queue *ioq, int add_front)
 static void elv_deactivate_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
 					int requeue)
 {
+	requeue = update_requeue(ioq, requeue);
 	bfq_deactivate_entity(&ioq->entity, requeue);
 }
 
@@ -1725,6 +2486,7 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 	struct io_queue *ioq;
 	struct elevator_queue *eq = q->elevator;
 	struct io_entity *entity, *new_entity;
+	struct io_group *iog = NULL, *new_iog = NULL;
 
 	ioq = elv_active_ioq(eq);
 
@@ -1735,6 +2497,13 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 	new_entity = &new_ioq->entity;
 
 	/*
+	 * In hierarchical setup, one need to traverse up the hierarchy
+	 * till both the queues are children of same parent to make a
+	 * decision whether to do the preemption or not.
+	 */
+	bfq_find_matching_entity(&entity, &new_entity);
+
+	/*
 	 * Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
 	 */
 
@@ -1750,9 +2519,17 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 		return 1;
 
 	/*
-	 * Check with io scheduler if it has additional criterion based on
-	 * which it wants to preempt existing queue.
+	 * If both the queues belong to same group, check with io scheduler
+	 * if it has additional criterion based on which it wants to
+	 * preempt existing queue.
 	 */
+	iog = ioq_to_io_group(ioq);
+	new_iog = ioq_to_io_group(new_ioq);
+
+	if (iog != new_iog)
+		return 0;
+
+
 	if (eq->ops->elevator_should_preempt_fn)
 		return eq->ops->elevator_should_preempt_fn(q,
 						ioq_sched_queue(new_ioq), rq);
@@ -2171,15 +2948,6 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		elv_schedule_dispatch(q);
 }
 
-struct io_group *io_get_io_group(struct request_queue *q)
-{
-	struct elv_fq_data *efqd = &q->elevator->efqd;
-
-	/* In flat mode, there is only root group */
-	return efqd->root_group;
-}
-EXPORT_SYMBOL(io_get_io_group);
-
 void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
 					int ioprio)
 {
@@ -2230,53 +2998,6 @@ void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 }
 EXPORT_SYMBOL(io_group_set_async_queue);
 
-/*
- * Release all the io group references to its async queues.
- */
-static void
-io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
-{
-	int i, j;
-
-	for (i = 0; i < 2; i++)
-		for (j = 0; j < IOPRIO_BE_NR; j++)
-			elv_release_ioq(e, &iog->async_queue[i][j]);
-
-	/* Free up async idle queue */
-	elv_release_ioq(e, &iog->async_idle_queue);
-}
-
-static struct io_group *io_alloc_root_group(struct request_queue *q,
-					struct elevator_queue *e, void *key)
-{
-	struct io_group *iog;
-	int i;
-
-	iog = kmalloc_node(sizeof(*iog), GFP_KERNEL | __GFP_ZERO, q->node);
-	if (iog == NULL)
-		return NULL;
-
-	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
-		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
-
-	return iog;
-}
-
-static void io_free_root_group(struct elevator_queue *e)
-{
-	struct io_group *iog = e->efqd.root_group;
-	struct io_service_tree *st;
-	int i;
-
-	for (i = 0; i < IO_IOPRIO_CLASSES; i++) {
-		st = iog->sched_data.service_tree + i;
-		io_flush_idle_tree(st);
-	}
-
-	io_put_io_group_queues(e, iog);
-	kfree(iog);
-}
-
 static void elv_slab_kill(void)
 {
 	/*
@@ -2320,6 +3041,7 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 	efqd->idle_slice_timer.data = (unsigned long) efqd;
 
 	INIT_WORK(&efqd->unplug_work, elv_kick_queue);
+	INIT_HLIST_HEAD(&efqd->group_list);
 
 	efqd->elv_slice[0] = elv_slice_async;
 	efqd->elv_slice[1] = elv_slice_sync;
@@ -2339,12 +3061,23 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 void elv_exit_fq_data(struct elevator_queue *e)
 {
 	struct elv_fq_data *efqd = &e->efqd;
+	struct request_queue *q = efqd->queue;
 
 	if (!elv_iosched_fair_queuing_enabled(e))
 		return;
 
 	elv_shutdown_timer_wq(e);
 
+	spin_lock_irq(q->queue_lock);
+	/* This should drop all the io group references of async queues */
+	io_disconnect_groups(e);
+	spin_unlock_irq(q->queue_lock);
+
+	elv_shutdown_timer_wq(e);
+
+	/* Wait for iog->key accessors to exit their grace periods. */
+	synchronize_rcu();
+
 	BUG_ON(timer_pending(&efqd->idle_slice_timer));
 	io_free_root_group(e);
 }
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index d9acb75..c8987c0 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -184,13 +184,49 @@ struct io_queue {
 };
 
 #ifdef CONFIG_GROUP_IOSCHED
+/**
+ * struct io_group - per (device, cgroup) data structure.
+ * @entity: schedulable entity to insert into the parent group sched_data.
+ * @sched_data: own sched_data, to contain child entities (they may be
+ *              both io_queues and io_groups).
+ * @group_node: node to be inserted into the io_cgroup->group_data
+ *              list of the containing cgroup's io_cgroup.
+ * @elv_data_node: node to be inserted into the @efqd->group_list list
+ *             of the groups active on the same device; used for cleanup.
+ * @async_queue: array of async queues for all the tasks belonging to
+ *              the group, one queue per ioprio value per ioprio_class,
+ *              except for the idle class that has only one queue.
+ * @async_idle_queue: async queue for the idle class (ioprio is ignored).
+ * @my_entity: pointer to @entity, %NULL for the toplevel group; used
+ *             to avoid too many special cases during group creation/migration.
+ *
+ * Each (device, cgroup) pair has its own io_group, i.e., for each cgroup
+ * there is a set of io_groups, each one collecting the lower-level
+ * entities belonging to the group that are acting on the same device.
+ *
+ * Locking works as follows:
+ *    o @group_node is protected by the io_cgroup lock, and is accessed
+ *      via RCU from its readers.
+ *    o @efqd is protected by the queue lock, RCU is used to access it
+ *      from the readers.
+ *    o All the other fields are protected by the @efqd queue lock.
+ */
 struct io_group {
 	struct io_entity entity;
+	struct hlist_node elv_data_node;
 	struct hlist_node group_node;
 	struct io_sched_data sched_data;
+	atomic_t ref;
 	struct io_entity *my_entity;
 
 	/*
+	 * A cgroup has multiple io_groups, one for each request queue.
+	 * to find io group belonging to a particular queue, elv_fq_data
+	 * pointer is stored as a key.
+	 */
+	void *key;
+
+	/*
 	 * async queue for each priority case for RT and BE class.
 	 * Used only for cfq.
 	 */
@@ -198,11 +234,15 @@ struct io_group {
 	struct io_queue *async_queue[2][IOPRIO_BE_NR];
 	struct io_queue *async_idle_queue;
 
+	struct rcu_head rcu_head;
+
 	/*
 	 * Used to track any pending rt requests so we can pre-empt current
 	 * non-RT cfqq in service when this value is non-zero.
 	 */
 	unsigned int busy_rt_queues;
+
+	int deleting;
 	unsigned short iocg_id;
 };
 
@@ -245,6 +285,9 @@ struct io_group {
 struct elv_fq_data {
 	struct io_group *root_group;
 
+	/* List of io groups hanging on this elevator */
+	struct hlist_head group_list;
+
 	struct request_queue *queue;
 	unsigned int busy_queues;
 
@@ -407,7 +450,7 @@ static inline void elv_ioq_set_ioprio_class(struct io_queue *ioq,
 static inline unsigned int bfq_ioprio_to_weight(int ioprio)
 {
 	WARN_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
-	return IOPRIO_BE_NR - ioprio;
+	return ((IOPRIO_BE_NR - ioprio) * WEIGHT_MAX)/IOPRIO_BE_NR;
 }
 
 static inline void elv_ioq_set_ioprio(struct io_queue *ioq, int ioprio)
@@ -430,6 +473,46 @@ static inline struct io_group *ioq_to_io_group(struct io_queue *ioq)
 						sched_data);
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+extern int io_group_allow_merge(struct request *rq, struct bio *bio);
+extern void elv_put_iog(struct io_group *iog);
+
+static inline void elv_get_iog(struct io_group *iog)
+{
+	atomic_inc(&iog->ref);
+}
+
+static inline int update_requeue(struct io_queue *ioq, int requeue)
+{
+	struct io_group *iog = ioq_to_io_group(ioq);
+
+	if (iog->deleting == 1)
+		return 0;
+
+	return requeue;
+}
+
+#else /* !GROUP_IOSCHED */
+static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
+{
+	return 1;
+}
+
+static inline void elv_get_iog(struct io_group *iog)
+{
+}
+
+static inline void elv_put_iog(struct io_group *iog)
+{
+}
+
+static inline int update_requeue(struct io_queue *ioq, int requeue)
+{
+	return requeue;
+}
+
+#endif /* GROUP_IOSCHED */
+
 extern ssize_t elv_slice_idle_show(struct elevator_queue *q, char *name);
 extern ssize_t elv_slice_idle_store(struct elevator_queue *q, const char *name,
 						size_t count);
@@ -477,7 +560,7 @@ extern void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
 					int ioprio);
 extern void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 					int ioprio, struct io_queue *ioq);
-extern struct io_group *io_get_io_group(struct request_queue *q);
+extern struct io_group *io_get_io_group(struct request_queue *q, int create);
 extern int elv_nr_busy_ioq(struct elevator_queue *e);
 extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
 extern void elv_free_ioq(struct io_queue *ioq);
@@ -528,5 +611,11 @@ static inline void *elv_fq_select_ioq(struct request_queue *q, int force)
 {
 	return NULL;
 }
+
+static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
+
+{
+	return 1;
+}
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index 357f529..a6ef1f1 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -100,6 +100,10 @@ int elv_rq_merge_ok(struct request *rq, struct bio *bio)
 	if (bio_integrity(bio) != blk_integrity_rq(rq))
 		return 0;
 
+	/* If rq and bio belongs to different groups, dont allow merging */
+	if (!io_group_allow_merge(rq, bio))
+		return 0;
+
 	if (!elv_iosched_allow_merge(rq, bio))
 		return 0;
 
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 10/25] io-controller: cfq changes to use hierarchical fair queuing code in elevaotor layer
       [not found] ` <1246564917-19603-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (8 preceding siblings ...)
  2009-07-02 20:01   ` [PATCH 09/25] io-controller: Common hierarchical fair queuing code in elevaotor layer Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  2009-07-02 20:01   ` [PATCH 11/25] io-controller: Export disk time used and nr sectors dipatched through cgroups Vivek Goyal
                     ` (17 subsequent siblings)
  27 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, nauman-hpIqsD4AKlfQT0dZR+AlfA,
	dpshah-hpIqsD4AKlfQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA

Make cfq hierarhical.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Fabio Checconi <fabio-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
Signed-off-by: Paolo Valente <paolo.valente-rcYM44yAMweonA0d6jMUrA@public.gmane.org>
Signed-off-by: Aristeu Rozanski <aris-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched |    8 ++++++
 block/cfq-iosched.c   |   62 +++++++++++++++++++++++++++++++++++++++++++++++-
 init/Kconfig          |    2 +-
 3 files changed, 69 insertions(+), 3 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index dd5224d..a91a807 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -54,6 +54,14 @@ config IOSCHED_CFQ
 	  working environment, suitable for desktop systems.
 	  This is the default I/O scheduler.
 
+config IOSCHED_CFQ_HIER
+	bool "CFQ Hierarchical Scheduling support"
+	depends on IOSCHED_CFQ && CGROUPS
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in cfq.
+
 choice
 	prompt "Default I/O scheduler"
 	default DEFAULT_CFQ
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 6ddc882..02b5cd5 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1222,6 +1222,60 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
 	ioc->ioprio_changed = 0;
 }
 
+#ifdef CONFIG_IOSCHED_CFQ_HIER
+static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
+{
+	struct cfq_queue *async_cfqq = cic_to_cfqq(cic, 0);
+	struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1);
+	struct cfq_data *cfqd = cic->key;
+	struct io_group *iog, *__iog;
+	unsigned long flags;
+	struct request_queue *q;
+
+	if (unlikely(!cfqd))
+		return;
+
+	q = cfqd->queue;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	iog = io_get_io_group(q, 0);
+
+	if (async_cfqq != NULL) {
+		__iog = cfqq_to_io_group(async_cfqq);
+		if (iog != __iog) {
+			/* cgroup changed, drop the reference to async queue */
+			cic_set_cfqq(cic, NULL, 0);
+			cfq_put_queue(async_cfqq);
+		}
+	}
+
+	if (sync_cfqq != NULL) {
+		__iog = cfqq_to_io_group(sync_cfqq);
+
+		/*
+		 * Drop reference to sync queue. A new sync queue will
+		 * be assigned in new group upon arrival of a fresh request.
+		 * If old queue has got requests, those reuests will be
+		 * dispatched over a period of time and queue will be freed
+		 * automatically.
+		 */
+		if (iog != __iog) {
+			cic_set_cfqq(cic, NULL, 1);
+			cfq_put_queue(sync_cfqq);
+		}
+	}
+
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static void cfq_ioc_set_cgroup(struct io_context *ioc)
+{
+	call_for_each_cic(ioc, changed_cgroup);
+	ioc->cgroup_changed = 0;
+}
+#endif  /* CONFIG_IOSCHED_CFQ_HIER */
+
 static struct cfq_queue *
 cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
 				struct io_context *ioc, gfp_t gfp_mask)
@@ -1232,7 +1286,7 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
 	struct io_queue *ioq = NULL, *new_ioq = NULL;
 	struct io_group *iog = NULL;
 retry:
-	iog = io_get_io_group(q);
+	iog = io_get_io_group(q, 1);
 
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
@@ -1334,7 +1388,7 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 	const int ioprio_class = task_ioprio_class(ioc);
 	struct cfq_queue *async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
-	struct io_group *iog = io_get_io_group(cfqd->queue);
+	struct io_group *iog = io_get_io_group(cfqd->queue, 1);
 
 	if (!is_sync) {
 		async_cfqq = io_group_async_queue_prio(iog, ioprio_class,
@@ -1493,6 +1547,10 @@ out:
 	smp_read_barrier_depends();
 	if (unlikely(ioc->ioprio_changed))
 		cfq_ioc_set_ioprio(ioc);
+#ifdef CONFIG_IOSCHED_CFQ_HIER
+	if (unlikely(ioc->cgroup_changed))
+		cfq_ioc_set_cgroup(ioc);
+#endif
 	return cic;
 err_free:
 	cfq_cic_free(cic);
diff --git a/init/Kconfig b/init/Kconfig
index a380f46..eaa44db 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -613,7 +613,7 @@ config CGROUP_MEM_RES_CTLR_SWAP
 	  size is 4096bytes, 512k per 1Gbytes of swap.
 
 config GROUP_IOSCHED
-	bool "Group IO Scheduler"
+	bool
 	depends on CGROUPS && ELV_FAIR_QUEUING
 	default n
 	---help---
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 10/25] io-controller: cfq changes to use hierarchical fair queuing code in elevaotor layer
  2009-07-02 20:01 ` Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron
  Cc: agk, snitzer, vgoyal, akpm, peterz

Make cfq hierarhical.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched |    8 ++++++
 block/cfq-iosched.c   |   62 +++++++++++++++++++++++++++++++++++++++++++++++-
 init/Kconfig          |    2 +-
 3 files changed, 69 insertions(+), 3 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index dd5224d..a91a807 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -54,6 +54,14 @@ config IOSCHED_CFQ
 	  working environment, suitable for desktop systems.
 	  This is the default I/O scheduler.
 
+config IOSCHED_CFQ_HIER
+	bool "CFQ Hierarchical Scheduling support"
+	depends on IOSCHED_CFQ && CGROUPS
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in cfq.
+
 choice
 	prompt "Default I/O scheduler"
 	default DEFAULT_CFQ
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 6ddc882..02b5cd5 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1222,6 +1222,60 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
 	ioc->ioprio_changed = 0;
 }
 
+#ifdef CONFIG_IOSCHED_CFQ_HIER
+static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
+{
+	struct cfq_queue *async_cfqq = cic_to_cfqq(cic, 0);
+	struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1);
+	struct cfq_data *cfqd = cic->key;
+	struct io_group *iog, *__iog;
+	unsigned long flags;
+	struct request_queue *q;
+
+	if (unlikely(!cfqd))
+		return;
+
+	q = cfqd->queue;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	iog = io_get_io_group(q, 0);
+
+	if (async_cfqq != NULL) {
+		__iog = cfqq_to_io_group(async_cfqq);
+		if (iog != __iog) {
+			/* cgroup changed, drop the reference to async queue */
+			cic_set_cfqq(cic, NULL, 0);
+			cfq_put_queue(async_cfqq);
+		}
+	}
+
+	if (sync_cfqq != NULL) {
+		__iog = cfqq_to_io_group(sync_cfqq);
+
+		/*
+		 * Drop reference to sync queue. A new sync queue will
+		 * be assigned in new group upon arrival of a fresh request.
+		 * If old queue has got requests, those reuests will be
+		 * dispatched over a period of time and queue will be freed
+		 * automatically.
+		 */
+		if (iog != __iog) {
+			cic_set_cfqq(cic, NULL, 1);
+			cfq_put_queue(sync_cfqq);
+		}
+	}
+
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static void cfq_ioc_set_cgroup(struct io_context *ioc)
+{
+	call_for_each_cic(ioc, changed_cgroup);
+	ioc->cgroup_changed = 0;
+}
+#endif  /* CONFIG_IOSCHED_CFQ_HIER */
+
 static struct cfq_queue *
 cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
 				struct io_context *ioc, gfp_t gfp_mask)
@@ -1232,7 +1286,7 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
 	struct io_queue *ioq = NULL, *new_ioq = NULL;
 	struct io_group *iog = NULL;
 retry:
-	iog = io_get_io_group(q);
+	iog = io_get_io_group(q, 1);
 
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
@@ -1334,7 +1388,7 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 	const int ioprio_class = task_ioprio_class(ioc);
 	struct cfq_queue *async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
-	struct io_group *iog = io_get_io_group(cfqd->queue);
+	struct io_group *iog = io_get_io_group(cfqd->queue, 1);
 
 	if (!is_sync) {
 		async_cfqq = io_group_async_queue_prio(iog, ioprio_class,
@@ -1493,6 +1547,10 @@ out:
 	smp_read_barrier_depends();
 	if (unlikely(ioc->ioprio_changed))
 		cfq_ioc_set_ioprio(ioc);
+#ifdef CONFIG_IOSCHED_CFQ_HIER
+	if (unlikely(ioc->cgroup_changed))
+		cfq_ioc_set_cgroup(ioc);
+#endif
 	return cic;
 err_free:
 	cfq_cic_free(cic);
diff --git a/init/Kconfig b/init/Kconfig
index a380f46..eaa44db 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -613,7 +613,7 @@ config CGROUP_MEM_RES_CTLR_SWAP
 	  size is 4096bytes, 512k per 1Gbytes of swap.
 
 config GROUP_IOSCHED
-	bool "Group IO Scheduler"
+	bool
 	depends on CGROUPS && ELV_FAIR_QUEUING
 	default n
 	---help---
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 10/25] io-controller: cfq changes to use hierarchical fair queuing code in elevaotor layer
@ 2009-07-02 20:01   ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew
  Cc: peterz, akpm, snitzer, agk, vgoyal

Make cfq hierarhical.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched |    8 ++++++
 block/cfq-iosched.c   |   62 +++++++++++++++++++++++++++++++++++++++++++++++-
 init/Kconfig          |    2 +-
 3 files changed, 69 insertions(+), 3 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index dd5224d..a91a807 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -54,6 +54,14 @@ config IOSCHED_CFQ
 	  working environment, suitable for desktop systems.
 	  This is the default I/O scheduler.
 
+config IOSCHED_CFQ_HIER
+	bool "CFQ Hierarchical Scheduling support"
+	depends on IOSCHED_CFQ && CGROUPS
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in cfq.
+
 choice
 	prompt "Default I/O scheduler"
 	default DEFAULT_CFQ
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 6ddc882..02b5cd5 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1222,6 +1222,60 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
 	ioc->ioprio_changed = 0;
 }
 
+#ifdef CONFIG_IOSCHED_CFQ_HIER
+static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
+{
+	struct cfq_queue *async_cfqq = cic_to_cfqq(cic, 0);
+	struct cfq_queue *sync_cfqq = cic_to_cfqq(cic, 1);
+	struct cfq_data *cfqd = cic->key;
+	struct io_group *iog, *__iog;
+	unsigned long flags;
+	struct request_queue *q;
+
+	if (unlikely(!cfqd))
+		return;
+
+	q = cfqd->queue;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+
+	iog = io_get_io_group(q, 0);
+
+	if (async_cfqq != NULL) {
+		__iog = cfqq_to_io_group(async_cfqq);
+		if (iog != __iog) {
+			/* cgroup changed, drop the reference to async queue */
+			cic_set_cfqq(cic, NULL, 0);
+			cfq_put_queue(async_cfqq);
+		}
+	}
+
+	if (sync_cfqq != NULL) {
+		__iog = cfqq_to_io_group(sync_cfqq);
+
+		/*
+		 * Drop reference to sync queue. A new sync queue will
+		 * be assigned in new group upon arrival of a fresh request.
+		 * If old queue has got requests, those reuests will be
+		 * dispatched over a period of time and queue will be freed
+		 * automatically.
+		 */
+		if (iog != __iog) {
+			cic_set_cfqq(cic, NULL, 1);
+			cfq_put_queue(sync_cfqq);
+		}
+	}
+
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static void cfq_ioc_set_cgroup(struct io_context *ioc)
+{
+	call_for_each_cic(ioc, changed_cgroup);
+	ioc->cgroup_changed = 0;
+}
+#endif  /* CONFIG_IOSCHED_CFQ_HIER */
+
 static struct cfq_queue *
 cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
 				struct io_context *ioc, gfp_t gfp_mask)
@@ -1232,7 +1286,7 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
 	struct io_queue *ioq = NULL, *new_ioq = NULL;
 	struct io_group *iog = NULL;
 retry:
-	iog = io_get_io_group(q);
+	iog = io_get_io_group(q, 1);
 
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
@@ -1334,7 +1388,7 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 	const int ioprio_class = task_ioprio_class(ioc);
 	struct cfq_queue *async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
-	struct io_group *iog = io_get_io_group(cfqd->queue);
+	struct io_group *iog = io_get_io_group(cfqd->queue, 1);
 
 	if (!is_sync) {
 		async_cfqq = io_group_async_queue_prio(iog, ioprio_class,
@@ -1493,6 +1547,10 @@ out:
 	smp_read_barrier_depends();
 	if (unlikely(ioc->ioprio_changed))
 		cfq_ioc_set_ioprio(ioc);
+#ifdef CONFIG_IOSCHED_CFQ_HIER
+	if (unlikely(ioc->cgroup_changed))
+		cfq_ioc_set_cgroup(ioc);
+#endif
 	return cic;
 err_free:
 	cfq_cic_free(cic);
diff --git a/init/Kconfig b/init/Kconfig
index a380f46..eaa44db 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -613,7 +613,7 @@ config CGROUP_MEM_RES_CTLR_SWAP
 	  size is 4096bytes, 512k per 1Gbytes of swap.
 
 config GROUP_IOSCHED
-	bool "Group IO Scheduler"
+	bool
 	depends on CGROUPS && ELV_FAIR_QUEUING
 	default n
 	---help---
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 11/25] io-controller: Export disk time used and nr sectors dipatched through cgroups
       [not found] ` <1246564917-19603-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (9 preceding siblings ...)
  2009-07-02 20:01   ` [PATCH 10/25] io-controller: cfq changes to use " Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  2009-07-02 20:01   ` [PATCH 12/25] io-controller: idle for sometime on sync queue before expiring it Vivek Goyal
                     ` (16 subsequent siblings)
  27 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, nauman-hpIqsD4AKlfQT0dZR+AlfA,
	dpshah-hpIqsD4AKlfQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA

o This patch exports some statistics through cgroup interface. Two of the
  statistics currently exported are actual disk time assigned to the cgroup
  and actual number of sectors dispatched to disk on behalf of this cgroup.

Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/elevator-fq.c |   81 +++++++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h |   10 ++++++
 2 files changed, 89 insertions(+), 2 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index f8d0b90..bab01b5 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -15,6 +15,7 @@
 #include <linux/blkdev.h>
 #include "elevator-fq.h"
 #include <linux/blktrace_api.h>
+#include <linux/seq_file.h>
 
 /* Values taken from cfq */
 const int elv_slice_sync = HZ / 10;
@@ -971,13 +972,16 @@ update:
 	}
 }
 
-static void entity_served(struct io_entity *entity, unsigned long served)
+void entity_served(struct io_entity *entity, unsigned long served,
+					unsigned long nr_sectors)
 {
 	struct io_service_tree *st;
 
 	for_each_entity(entity) {
 		st = io_entity_service_tree(entity);
 		entity->service += served;
+		entity->total_service += served;
+		entity->total_sector_service += nr_sectors;
 		BUG_ON(st->wsum == 0);
 		st->vtime += bfq_delta(served, st->wsum);
 		bfq_forget_idle(st);
@@ -1140,6 +1144,66 @@ STORE_FUNCTION(weight, 1, WEIGHT_MAX);
 STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
 #undef STORE_FUNCTION
 
+static int io_cgroup_disk_time_read(struct cgroup *cgroup,
+				struct cftype *cftype, struct seq_file *m)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog;
+	struct hlist_node *n;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (iog->key) {
+			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev),
+					iog->entity.total_service);
+		}
+	}
+	rcu_read_unlock();
+	cgroup_unlock();
+
+	return 0;
+}
+
+static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
+				struct cftype *cftype, struct seq_file *m)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog;
+	struct hlist_node *n;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (iog->key) {
+			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev),
+					iog->entity.total_sector_service);
+		}
+	}
+	rcu_read_unlock();
+	cgroup_unlock();
+
+	return 0;
+}
+
 struct cftype bfqio_files[] = {
 	{
 		.name = "weight",
@@ -1151,6 +1215,14 @@ struct cftype bfqio_files[] = {
 		.read_u64 = io_cgroup_ioprio_class_read,
 		.write_u64 = io_cgroup_ioprio_class_write,
 	},
+	{
+		.name = "disk_time",
+		.read_seq_string = io_cgroup_disk_time_read,
+	},
+	{
+		.name = "disk_sectors",
+		.read_seq_string = io_cgroup_disk_sectors_read,
+	},
 };
 
 static int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
@@ -1252,6 +1324,8 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 	struct io_cgroup *iocg;
 	struct io_group *iog, *leaf = NULL, *prev = NULL;
 	gfp_t flags = GFP_ATOMIC |  __GFP_ZERO;
+	unsigned int major, minor;
+	struct backing_dev_info *bdi = &q->backing_dev_info;
 
 	for (; cgroup != NULL; cgroup = cgroup->parent) {
 		iocg = cgroup_to_io_cgroup(cgroup);
@@ -1272,6 +1346,9 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 
 		iog->iocg_id = css_id(&iocg->css);
 
+		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
+		iog->dev = MKDEV(major, minor);
+
 		io_group_init_entity(iocg, iog);
 		iog->my_entity = &iog->entity;
 
@@ -1873,7 +1950,7 @@ EXPORT_SYMBOL(elv_get_slice_idle);
 
 static void elv_ioq_served(struct io_queue *ioq, unsigned long served)
 {
-	entity_served(&ioq->entity, served);
+	entity_served(&ioq->entity, served, ioq->nr_sectors);
 }
 
 /* Tells whether ioq is queued in root group or not */
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index c8987c0..d76bd96 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -147,6 +147,13 @@ struct io_entity {
 	unsigned short ioprio_class, new_ioprio_class;
 
 	int ioprio_changed;
+
+	/*
+	 * Keep track of total service received by this entity. Keep the
+	 * stats both for time slices and number of sectors dispatched
+	 */
+	unsigned long total_service;
+	unsigned long total_sector_service;
 };
 
 /*
@@ -244,6 +251,9 @@ struct io_group {
 
 	int deleting;
 	unsigned short iocg_id;
+
+	/* The device MKDEV(major, minor), this group has been created for */
+	dev_t	dev;
 };
 
 /**
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 11/25] io-controller: Export disk time used and nr sectors dipatched through cgroups
  2009-07-02 20:01 ` Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron
  Cc: agk, snitzer, vgoyal, akpm, peterz

o This patch exports some statistics through cgroup interface. Two of the
  statistics currently exported are actual disk time assigned to the cgroup
  and actual number of sectors dispatched to disk on behalf of this cgroup.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |   81 +++++++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h |   10 ++++++
 2 files changed, 89 insertions(+), 2 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index f8d0b90..bab01b5 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -15,6 +15,7 @@
 #include <linux/blkdev.h>
 #include "elevator-fq.h"
 #include <linux/blktrace_api.h>
+#include <linux/seq_file.h>
 
 /* Values taken from cfq */
 const int elv_slice_sync = HZ / 10;
@@ -971,13 +972,16 @@ update:
 	}
 }
 
-static void entity_served(struct io_entity *entity, unsigned long served)
+void entity_served(struct io_entity *entity, unsigned long served,
+					unsigned long nr_sectors)
 {
 	struct io_service_tree *st;
 
 	for_each_entity(entity) {
 		st = io_entity_service_tree(entity);
 		entity->service += served;
+		entity->total_service += served;
+		entity->total_sector_service += nr_sectors;
 		BUG_ON(st->wsum == 0);
 		st->vtime += bfq_delta(served, st->wsum);
 		bfq_forget_idle(st);
@@ -1140,6 +1144,66 @@ STORE_FUNCTION(weight, 1, WEIGHT_MAX);
 STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
 #undef STORE_FUNCTION
 
+static int io_cgroup_disk_time_read(struct cgroup *cgroup,
+				struct cftype *cftype, struct seq_file *m)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog;
+	struct hlist_node *n;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (iog->key) {
+			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev),
+					iog->entity.total_service);
+		}
+	}
+	rcu_read_unlock();
+	cgroup_unlock();
+
+	return 0;
+}
+
+static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
+				struct cftype *cftype, struct seq_file *m)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog;
+	struct hlist_node *n;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (iog->key) {
+			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev),
+					iog->entity.total_sector_service);
+		}
+	}
+	rcu_read_unlock();
+	cgroup_unlock();
+
+	return 0;
+}
+
 struct cftype bfqio_files[] = {
 	{
 		.name = "weight",
@@ -1151,6 +1215,14 @@ struct cftype bfqio_files[] = {
 		.read_u64 = io_cgroup_ioprio_class_read,
 		.write_u64 = io_cgroup_ioprio_class_write,
 	},
+	{
+		.name = "disk_time",
+		.read_seq_string = io_cgroup_disk_time_read,
+	},
+	{
+		.name = "disk_sectors",
+		.read_seq_string = io_cgroup_disk_sectors_read,
+	},
 };
 
 static int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
@@ -1252,6 +1324,8 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 	struct io_cgroup *iocg;
 	struct io_group *iog, *leaf = NULL, *prev = NULL;
 	gfp_t flags = GFP_ATOMIC |  __GFP_ZERO;
+	unsigned int major, minor;
+	struct backing_dev_info *bdi = &q->backing_dev_info;
 
 	for (; cgroup != NULL; cgroup = cgroup->parent) {
 		iocg = cgroup_to_io_cgroup(cgroup);
@@ -1272,6 +1346,9 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 
 		iog->iocg_id = css_id(&iocg->css);
 
+		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
+		iog->dev = MKDEV(major, minor);
+
 		io_group_init_entity(iocg, iog);
 		iog->my_entity = &iog->entity;
 
@@ -1873,7 +1950,7 @@ EXPORT_SYMBOL(elv_get_slice_idle);
 
 static void elv_ioq_served(struct io_queue *ioq, unsigned long served)
 {
-	entity_served(&ioq->entity, served);
+	entity_served(&ioq->entity, served, ioq->nr_sectors);
 }
 
 /* Tells whether ioq is queued in root group or not */
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index c8987c0..d76bd96 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -147,6 +147,13 @@ struct io_entity {
 	unsigned short ioprio_class, new_ioprio_class;
 
 	int ioprio_changed;
+
+	/*
+	 * Keep track of total service received by this entity. Keep the
+	 * stats both for time slices and number of sectors dispatched
+	 */
+	unsigned long total_service;
+	unsigned long total_sector_service;
 };
 
 /*
@@ -244,6 +251,9 @@ struct io_group {
 
 	int deleting;
 	unsigned short iocg_id;
+
+	/* The device MKDEV(major, minor), this group has been created for */
+	dev_t	dev;
 };
 
 /**
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 11/25] io-controller: Export disk time used and nr sectors dipatched through cgroups
@ 2009-07-02 20:01   ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew
  Cc: peterz, akpm, snitzer, agk, vgoyal

o This patch exports some statistics through cgroup interface. Two of the
  statistics currently exported are actual disk time assigned to the cgroup
  and actual number of sectors dispatched to disk on behalf of this cgroup.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |   81 +++++++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h |   10 ++++++
 2 files changed, 89 insertions(+), 2 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index f8d0b90..bab01b5 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -15,6 +15,7 @@
 #include <linux/blkdev.h>
 #include "elevator-fq.h"
 #include <linux/blktrace_api.h>
+#include <linux/seq_file.h>
 
 /* Values taken from cfq */
 const int elv_slice_sync = HZ / 10;
@@ -971,13 +972,16 @@ update:
 	}
 }
 
-static void entity_served(struct io_entity *entity, unsigned long served)
+void entity_served(struct io_entity *entity, unsigned long served,
+					unsigned long nr_sectors)
 {
 	struct io_service_tree *st;
 
 	for_each_entity(entity) {
 		st = io_entity_service_tree(entity);
 		entity->service += served;
+		entity->total_service += served;
+		entity->total_sector_service += nr_sectors;
 		BUG_ON(st->wsum == 0);
 		st->vtime += bfq_delta(served, st->wsum);
 		bfq_forget_idle(st);
@@ -1140,6 +1144,66 @@ STORE_FUNCTION(weight, 1, WEIGHT_MAX);
 STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
 #undef STORE_FUNCTION
 
+static int io_cgroup_disk_time_read(struct cgroup *cgroup,
+				struct cftype *cftype, struct seq_file *m)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog;
+	struct hlist_node *n;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (iog->key) {
+			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev),
+					iog->entity.total_service);
+		}
+	}
+	rcu_read_unlock();
+	cgroup_unlock();
+
+	return 0;
+}
+
+static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
+				struct cftype *cftype, struct seq_file *m)
+{
+	struct io_cgroup *iocg;
+	struct io_group *iog;
+	struct hlist_node *n;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (iog->key) {
+			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev),
+					iog->entity.total_sector_service);
+		}
+	}
+	rcu_read_unlock();
+	cgroup_unlock();
+
+	return 0;
+}
+
 struct cftype bfqio_files[] = {
 	{
 		.name = "weight",
@@ -1151,6 +1215,14 @@ struct cftype bfqio_files[] = {
 		.read_u64 = io_cgroup_ioprio_class_read,
 		.write_u64 = io_cgroup_ioprio_class_write,
 	},
+	{
+		.name = "disk_time",
+		.read_seq_string = io_cgroup_disk_time_read,
+	},
+	{
+		.name = "disk_sectors",
+		.read_seq_string = io_cgroup_disk_sectors_read,
+	},
 };
 
 static int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
@@ -1252,6 +1324,8 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 	struct io_cgroup *iocg;
 	struct io_group *iog, *leaf = NULL, *prev = NULL;
 	gfp_t flags = GFP_ATOMIC |  __GFP_ZERO;
+	unsigned int major, minor;
+	struct backing_dev_info *bdi = &q->backing_dev_info;
 
 	for (; cgroup != NULL; cgroup = cgroup->parent) {
 		iocg = cgroup_to_io_cgroup(cgroup);
@@ -1272,6 +1346,9 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 
 		iog->iocg_id = css_id(&iocg->css);
 
+		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
+		iog->dev = MKDEV(major, minor);
+
 		io_group_init_entity(iocg, iog);
 		iog->my_entity = &iog->entity;
 
@@ -1873,7 +1950,7 @@ EXPORT_SYMBOL(elv_get_slice_idle);
 
 static void elv_ioq_served(struct io_queue *ioq, unsigned long served)
 {
-	entity_served(&ioq->entity, served);
+	entity_served(&ioq->entity, served, ioq->nr_sectors);
 }
 
 /* Tells whether ioq is queued in root group or not */
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index c8987c0..d76bd96 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -147,6 +147,13 @@ struct io_entity {
 	unsigned short ioprio_class, new_ioprio_class;
 
 	int ioprio_changed;
+
+	/*
+	 * Keep track of total service received by this entity. Keep the
+	 * stats both for time slices and number of sectors dispatched
+	 */
+	unsigned long total_service;
+	unsigned long total_sector_service;
 };
 
 /*
@@ -244,6 +251,9 @@ struct io_group {
 
 	int deleting;
 	unsigned short iocg_id;
+
+	/* The device MKDEV(major, minor), this group has been created for */
+	dev_t	dev;
 };
 
 /**
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 12/25] io-controller: idle for sometime on sync queue before expiring it
       [not found] ` <1246564917-19603-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (10 preceding siblings ...)
  2009-07-02 20:01   ` [PATCH 11/25] io-controller: Export disk time used and nr sectors dipatched through cgroups Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  2009-07-02 20:01   ` [PATCH 13/25] io-controller: Wait for requests to complete from last queue before new queue is scheduled Vivek Goyal
                     ` (15 subsequent siblings)
  27 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, nauman-hpIqsD4AKlfQT0dZR+AlfA,
	dpshah-hpIqsD4AKlfQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA

o When a sync queue expires, in many cases it might be empty and then
  it will be deleted from the active tree. This will lead to a scenario
  where out of two competing queues, only one is on the tree and when a
  new queue is selected, vtime jump takes place and we don't see services
  provided in proportion to weight.

o In general this is a fundamental problem with fairness of sync queues
  where queues are not continuously backlogged. Looks like idling is
  only solution to make sure such kind of queues can get some decent amount
  of disk bandwidth in the face of competion from continusouly backlogged
  queues. But excessive idling has potential to reduce performance on SSD
  and disks with commnad queuing.

o This patch experiments with waiting for next request to come before a
  queue is expired after it has consumed its time slice. This can ensure
  more accurate fairness numbers in some cases.

o Introduced a tunable "fairness". If set, io-controller will put more
  focus on getting fairness right than getting throughput right.

o When writes are being done on a file opened with O_SYNC, ioscheduler sees
  synchronous write requests with noidle flag set. But the fact is we are
  seeing a continuous stream of writes with-in 1ms or so. Hence it makes sense
  to wait on these writes. For the time being to achieve fairness for O_SYNC
  writes, continue to idle even if last request was sync write and noidle
  flag was set. (Only done if "fairness" is set). Probably right fix is to
  make sure in O_SYNC path, requests are not marked with noidle flag.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/cfq-iosched.c |    1 +
 block/elevator-fq.c |  133 ++++++++++++++++++++++++++++++++++++++++++++-------
 block/elevator-fq.h |   15 ++++++
 3 files changed, 132 insertions(+), 17 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 02b5cd5..98a35fd 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2004,6 +2004,7 @@ static struct elv_fs_entry cfq_attrs[] = {
 	ELV_ATTR(slice_idle),
 	ELV_ATTR(slice_sync),
 	ELV_ATTR(slice_async),
+	ELV_ATTR(fairness),
 	__ATTR_NULL
 };
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index bab01b5..68be1dc 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -424,6 +424,7 @@ static void bfq_active_insert(struct io_service_tree *st,
 	struct rb_node *node = &entity->rb_node;
 
 	bfq_insert(&st->active, entity);
+	entity->sched_data->nr_active++;
 
 	if (node->rb_left != NULL)
 		node = node->rb_left;
@@ -483,6 +484,7 @@ static void bfq_active_remove(struct io_service_tree *st,
 
 	node = bfq_find_deepest(&entity->rb_node);
 	bfq_remove(&st->active, entity);
+	entity->sched_data->nr_active--;
 
 	if (node != NULL)
 		bfq_update_active_tree(node);
@@ -569,6 +571,21 @@ static void bfq_forget_idle(struct io_service_tree *st)
 		bfq_put_idle_entity(st, first_idle);
 }
 
+/*
+ * Returns the number of active entities a particular io group has. This
+ * includes number of active entities on service tree as well as the active
+ * entity which is being served currently, if any.
+ */
+
+static inline int elv_iog_nr_active(struct io_group *iog)
+{
+	struct io_sched_data *sd = &iog->sched_data;
+
+	if (sd->active_entity)
+		return sd->nr_active + 1;
+	else
+		return sd->nr_active;
+}
 
 static struct io_service_tree *
 __bfq_entity_update_prio(struct io_service_tree *old_st,
@@ -1995,6 +2012,8 @@ SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
 EXPORT_SYMBOL(elv_slice_sync_show);
 SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
 EXPORT_SYMBOL(elv_slice_async_show);
+SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
+EXPORT_SYMBOL(elv_fairness_show);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
@@ -2019,6 +2038,8 @@ STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
 EXPORT_SYMBOL(elv_slice_sync_store);
 STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
 EXPORT_SYMBOL(elv_slice_async_store);
+STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
+EXPORT_SYMBOL(elv_fairness_store);
 #undef STORE_FUNCTION
 
 void elv_schedule_dispatch(struct request_queue *q)
@@ -2142,7 +2163,7 @@ static void elv_ioq_update_idle_window(struct elevator_queue *eq,
 	 * io scheduler if it wants to disable idling based on additional
 	 * considrations like seek pattern.
 	 */
-	if (enable_idle) {
+	if (enable_idle && !efqd->fairness) {
 		if (eq->ops->elevator_update_idle_window_fn)
 			enable_idle = eq->ops->elevator_update_idle_window_fn(
 						eq, ioq->sched_queue, rq);
@@ -2328,6 +2349,7 @@ static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
 
 		elv_clear_ioq_wait_request(ioq);
 		elv_clear_ioq_must_dispatch(ioq);
+		elv_clear_ioq_wait_busy_done(ioq);
 		elv_mark_ioq_slice_new(ioq);
 
 		del_timer(&efqd->idle_slice_timer);
@@ -2483,10 +2505,12 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 	assert_spin_locked(q->queue_lock);
 	elv_log_ioq(efqd, ioq, "slice expired");
 
-	if (elv_ioq_wait_request(ioq))
+	if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq))
 		del_timer(&efqd->idle_slice_timer);
 
 	elv_clear_ioq_wait_request(ioq);
+	elv_clear_ioq_wait_busy(ioq);
+	elv_clear_ioq_wait_busy_done(ioq);
 
 	/*
 	 * if ioq->slice_end = 0, that means a queue was expired before first
@@ -2659,7 +2683,7 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 		 * has other work pending, don't risk delaying until the
 		 * idle timer unplug to continue working.
 		 */
-		if (elv_ioq_wait_request(ioq)) {
+		if (elv_ioq_wait_request(ioq) && !elv_ioq_wait_busy(ioq)) {
 			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
 			    efqd->busy_queues > 1) {
 				del_timer(&efqd->idle_slice_timer);
@@ -2667,6 +2691,18 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 			}
 			elv_mark_ioq_must_dispatch(ioq);
 		}
+
+		/*
+		 * If we were waiting for a request on this queue, wait is
+		 * done. Schedule the next dispatch
+		 */
+		if (elv_ioq_wait_busy(ioq)) {
+			del_timer(&efqd->idle_slice_timer);
+			elv_clear_ioq_wait_busy(ioq);
+			elv_mark_ioq_wait_busy_done(ioq);
+			elv_clear_ioq_must_dispatch(ioq);
+			elv_schedule_dispatch(q);
+		}
 	} else if (elv_should_preempt(q, ioq, rq)) {
 		/*
 		 * not the active queue - expire current slice if it is
@@ -2694,6 +2730,9 @@ static void elv_idle_slice_timer(unsigned long data)
 
 	if (ioq) {
 
+		if (elv_ioq_wait_busy(ioq))
+			goto expire;
+
 		/*
 		 * We saw a request before the queue expired, let it through
 		 */
@@ -2727,7 +2766,7 @@ out_cont:
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
-static void elv_ioq_arm_slice_timer(struct request_queue *q)
+static void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
 {
 	struct elv_fq_data *efqd = &q->elevator->efqd;
 	struct io_queue *ioq = elv_active_ioq(q->elevator);
@@ -2740,26 +2779,38 @@ static void elv_ioq_arm_slice_timer(struct request_queue *q)
 	 * for devices that support queuing, otherwise we still have a problem
 	 * with sync vs async workloads.
 	 */
-	if (blk_queue_nonrot(q) && efqd->hw_tag)
+	if (blk_queue_nonrot(q) && efqd->hw_tag && !efqd->fairness)
 		return;
 
 	/*
-	 * still requests with the driver, don't idle
+	 * idle is disabled, either manually or by past process history
 	 */
-	if (efqd->rq_in_driver)
+	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
 		return;
 
 	/*
-	 * idle is disabled, either manually or by past process history
+	 * This queue has consumed its time slice. We are waiting only for
+	 * it to become busy before we select next queue for dispatch.
 	 */
-	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
+	if (wait_for_busy) {
+		elv_mark_ioq_wait_busy(ioq);
+		sl = efqd->elv_slice_idle;
+		mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+		elv_log_ioq(efqd, ioq, "arm idle: %lu wait busy=1", sl);
+		return;
+	}
+
+	/*
+	 * still requests with the driver, don't idle
+	 */
+	if (efqd->rq_in_driver && !efqd->fairness)
 		return;
 
 	/*
 	 * may be iosched got its own idling logic. In that case io
 	 * schduler will take care of arming the timer, if need be.
 	 */
-	if (q->elevator->ops->elevator_arm_slice_timer_fn) {
+	if (q->elevator->ops->elevator_arm_slice_timer_fn && !efqd->fairness) {
 		q->elevator->ops->elevator_arm_slice_timer_fn(q,
 						ioq->sched_queue);
 	} else {
@@ -2822,11 +2873,38 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 			goto expire;
 	}
 
+	/* We are waiting for this queue to become busy before it expires.*/
+	if (efqd->fairness && elv_ioq_wait_busy(ioq)) {
+		ioq = NULL;
+		goto keep_queue;
+	}
+
 	/*
 	 * The active queue has run out of time, expire it and select new.
 	 */
-	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
-		goto expire;
+	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq)) {
+		/*
+		 * Queue has used up its slice. Wait busy is not on otherwise
+		 * we wouldn't have been here. There is a chance that after
+		 * slice expiry no request from the queue completed hence
+		 * wait busy timer could not be turned on. If that's the case
+		 * don't expire the queue yet. Next request completion from
+		 * the queue will arm the wait busy timer.
+		 *
+		 * Don't wait if this group has other active queues. This
+		 * will make sure that we don't loose fairness at group level
+		 * at the same time in root group we will not see cfq
+		 * regressions.
+		 */
+		if (elv_ioq_sync(ioq) && !ioq->nr_queued
+		    && elv_ioq_nr_dispatched(ioq)
+		    && (elv_iog_nr_active(ioq_to_io_group(ioq)) <= 1)
+		    && !elv_ioq_wait_busy_done(ioq)) {
+			ioq = NULL;
+			goto keep_queue;
+		} else
+			goto expire;
+	}
 
 	/*
 	 * If we have a RT cfqq waiting, then we pre-empt the current non-rt
@@ -2977,11 +3055,13 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 	const int sync = rq_is_sync(rq);
 	struct io_queue *ioq;
 	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_group *iog;
 
 	if (!elv_iosched_fair_queuing_enabled(q->elevator))
 		return;
 
 	ioq = rq->ioq;
+	iog = ioq_to_io_group(ioq);
 
 	elv_log_ioq(efqd, ioq, "complete");
 
@@ -3007,6 +3087,12 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 			elv_ioq_set_prio_slice(q, ioq);
 			elv_clear_ioq_slice_new(ioq);
 		}
+
+		if (elv_ioq_class_idle(ioq)) {
+			elv_ioq_slice_expired(q);
+			goto done;
+		}
+
 		/*
 		 * If there are no requests waiting in this queue, and
 		 * there are other queues ready to issue requests, AND
@@ -3014,13 +3100,24 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		 * mean seek distance, give them a chance to run instead
 		 * of idling.
 		 */
-		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
-			elv_ioq_slice_expired(q);
-		else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
-			 && sync && !rq_noidle(rq))
-			elv_ioq_arm_slice_timer(q);
+		if (elv_ioq_slice_used(ioq)) {
+			if (sync && !ioq->nr_queued
+			    && (elv_iog_nr_active(iog) <= 1)) {
+				/*
+				 * Idle for one extra period in hierarchical
+				 * setup
+				 */
+				elv_ioq_arm_slice_timer(q, 1);
+			} else {
+				/* Expire the queue */
+				elv_ioq_slice_expired(q);
+			}
+		} else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
+			 && sync && (!rq_noidle(rq) || efqd->fairness))
+			elv_ioq_arm_slice_timer(q, 0);
 	}
 
+done:
 	if (!efqd->rq_in_driver)
 		elv_schedule_dispatch(q);
 }
@@ -3125,6 +3222,8 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 	efqd->elv_slice_idle = elv_slice_idle;
 	efqd->hw_tag = 1;
 
+	/* For the time being keep fairness enabled by default */
+	efqd->fairness = 1;
 	return 0;
 }
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index d76bd96..a414309 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -75,6 +75,7 @@ struct io_service_tree {
 struct io_sched_data {
 	struct io_entity *active_entity;
 	struct io_entity *next_active;
+	int nr_active;
 	struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
 };
 
@@ -337,6 +338,13 @@ struct elv_fq_data {
 	unsigned long long rate_sampling_start; /*sampling window start jifies*/
 	/* number of sectors finished io during current sampling window */
 	unsigned long rate_sectors_current;
+
+	/*
+	 * If set to 1, will disable many optimizations done for boost
+	 * throughput and focus more on providing fairness for sync
+	 * queues.
+	 */
+	unsigned int fairness;
 };
 
 /* Logging facilities. */
@@ -358,6 +366,8 @@ enum elv_queue_state_flags {
 	ELV_QUEUE_FLAG_wait_request,	  /* waiting for a request */
 	ELV_QUEUE_FLAG_must_dispatch,	  /* must be allowed a dispatch */
 	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
+	ELV_QUEUE_FLAG_wait_busy,	  /* wait for this queue to get busy */
+	ELV_QUEUE_FLAG_wait_busy_done,	  /* Have already waited on this queue*/
 };
 
 #define ELV_IO_QUEUE_FLAG_FNS(name)					\
@@ -380,6 +390,8 @@ ELV_IO_QUEUE_FLAG_FNS(wait_request)
 ELV_IO_QUEUE_FLAG_FNS(must_dispatch)
 ELV_IO_QUEUE_FLAG_FNS(idle_window)
 ELV_IO_QUEUE_FLAG_FNS(slice_new)
+ELV_IO_QUEUE_FLAG_FNS(wait_busy)
+ELV_IO_QUEUE_FLAG_FNS(wait_busy_done)
 
 static inline struct io_service_tree *
 io_entity_service_tree(struct io_entity *entity)
@@ -532,6 +544,9 @@ extern ssize_t elv_slice_sync_store(struct elevator_queue *q, const char *name,
 extern ssize_t elv_slice_async_show(struct elevator_queue *q, char *name);
 extern ssize_t elv_slice_async_store(struct elevator_queue *q, const char *name,
 						size_t count);
+extern ssize_t elv_fairness_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_fairness_store(struct elevator_queue *q, const char *name,
+						size_t count);
 
 /* Functions used by elevator.c */
 extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 12/25] io-controller: idle for sometime on sync queue before expiring it
  2009-07-02 20:01 ` Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron
  Cc: agk, snitzer, vgoyal, akpm, peterz

o When a sync queue expires, in many cases it might be empty and then
  it will be deleted from the active tree. This will lead to a scenario
  where out of two competing queues, only one is on the tree and when a
  new queue is selected, vtime jump takes place and we don't see services
  provided in proportion to weight.

o In general this is a fundamental problem with fairness of sync queues
  where queues are not continuously backlogged. Looks like idling is
  only solution to make sure such kind of queues can get some decent amount
  of disk bandwidth in the face of competion from continusouly backlogged
  queues. But excessive idling has potential to reduce performance on SSD
  and disks with commnad queuing.

o This patch experiments with waiting for next request to come before a
  queue is expired after it has consumed its time slice. This can ensure
  more accurate fairness numbers in some cases.

o Introduced a tunable "fairness". If set, io-controller will put more
  focus on getting fairness right than getting throughput right.

o When writes are being done on a file opened with O_SYNC, ioscheduler sees
  synchronous write requests with noidle flag set. But the fact is we are
  seeing a continuous stream of writes with-in 1ms or so. Hence it makes sense
  to wait on these writes. For the time being to achieve fairness for O_SYNC
  writes, continue to idle even if last request was sync write and noidle
  flag was set. (Only done if "fairness" is set). Probably right fix is to
  make sure in O_SYNC path, requests are not marked with noidle flag.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |    1 +
 block/elevator-fq.c |  133 ++++++++++++++++++++++++++++++++++++++++++++-------
 block/elevator-fq.h |   15 ++++++
 3 files changed, 132 insertions(+), 17 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 02b5cd5..98a35fd 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2004,6 +2004,7 @@ static struct elv_fs_entry cfq_attrs[] = {
 	ELV_ATTR(slice_idle),
 	ELV_ATTR(slice_sync),
 	ELV_ATTR(slice_async),
+	ELV_ATTR(fairness),
 	__ATTR_NULL
 };
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index bab01b5..68be1dc 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -424,6 +424,7 @@ static void bfq_active_insert(struct io_service_tree *st,
 	struct rb_node *node = &entity->rb_node;
 
 	bfq_insert(&st->active, entity);
+	entity->sched_data->nr_active++;
 
 	if (node->rb_left != NULL)
 		node = node->rb_left;
@@ -483,6 +484,7 @@ static void bfq_active_remove(struct io_service_tree *st,
 
 	node = bfq_find_deepest(&entity->rb_node);
 	bfq_remove(&st->active, entity);
+	entity->sched_data->nr_active--;
 
 	if (node != NULL)
 		bfq_update_active_tree(node);
@@ -569,6 +571,21 @@ static void bfq_forget_idle(struct io_service_tree *st)
 		bfq_put_idle_entity(st, first_idle);
 }
 
+/*
+ * Returns the number of active entities a particular io group has. This
+ * includes number of active entities on service tree as well as the active
+ * entity which is being served currently, if any.
+ */
+
+static inline int elv_iog_nr_active(struct io_group *iog)
+{
+	struct io_sched_data *sd = &iog->sched_data;
+
+	if (sd->active_entity)
+		return sd->nr_active + 1;
+	else
+		return sd->nr_active;
+}
 
 static struct io_service_tree *
 __bfq_entity_update_prio(struct io_service_tree *old_st,
@@ -1995,6 +2012,8 @@ SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
 EXPORT_SYMBOL(elv_slice_sync_show);
 SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
 EXPORT_SYMBOL(elv_slice_async_show);
+SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
+EXPORT_SYMBOL(elv_fairness_show);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
@@ -2019,6 +2038,8 @@ STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
 EXPORT_SYMBOL(elv_slice_sync_store);
 STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
 EXPORT_SYMBOL(elv_slice_async_store);
+STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
+EXPORT_SYMBOL(elv_fairness_store);
 #undef STORE_FUNCTION
 
 void elv_schedule_dispatch(struct request_queue *q)
@@ -2142,7 +2163,7 @@ static void elv_ioq_update_idle_window(struct elevator_queue *eq,
 	 * io scheduler if it wants to disable idling based on additional
 	 * considrations like seek pattern.
 	 */
-	if (enable_idle) {
+	if (enable_idle && !efqd->fairness) {
 		if (eq->ops->elevator_update_idle_window_fn)
 			enable_idle = eq->ops->elevator_update_idle_window_fn(
 						eq, ioq->sched_queue, rq);
@@ -2328,6 +2349,7 @@ static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
 
 		elv_clear_ioq_wait_request(ioq);
 		elv_clear_ioq_must_dispatch(ioq);
+		elv_clear_ioq_wait_busy_done(ioq);
 		elv_mark_ioq_slice_new(ioq);
 
 		del_timer(&efqd->idle_slice_timer);
@@ -2483,10 +2505,12 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 	assert_spin_locked(q->queue_lock);
 	elv_log_ioq(efqd, ioq, "slice expired");
 
-	if (elv_ioq_wait_request(ioq))
+	if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq))
 		del_timer(&efqd->idle_slice_timer);
 
 	elv_clear_ioq_wait_request(ioq);
+	elv_clear_ioq_wait_busy(ioq);
+	elv_clear_ioq_wait_busy_done(ioq);
 
 	/*
 	 * if ioq->slice_end = 0, that means a queue was expired before first
@@ -2659,7 +2683,7 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 		 * has other work pending, don't risk delaying until the
 		 * idle timer unplug to continue working.
 		 */
-		if (elv_ioq_wait_request(ioq)) {
+		if (elv_ioq_wait_request(ioq) && !elv_ioq_wait_busy(ioq)) {
 			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
 			    efqd->busy_queues > 1) {
 				del_timer(&efqd->idle_slice_timer);
@@ -2667,6 +2691,18 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 			}
 			elv_mark_ioq_must_dispatch(ioq);
 		}
+
+		/*
+		 * If we were waiting for a request on this queue, wait is
+		 * done. Schedule the next dispatch
+		 */
+		if (elv_ioq_wait_busy(ioq)) {
+			del_timer(&efqd->idle_slice_timer);
+			elv_clear_ioq_wait_busy(ioq);
+			elv_mark_ioq_wait_busy_done(ioq);
+			elv_clear_ioq_must_dispatch(ioq);
+			elv_schedule_dispatch(q);
+		}
 	} else if (elv_should_preempt(q, ioq, rq)) {
 		/*
 		 * not the active queue - expire current slice if it is
@@ -2694,6 +2730,9 @@ static void elv_idle_slice_timer(unsigned long data)
 
 	if (ioq) {
 
+		if (elv_ioq_wait_busy(ioq))
+			goto expire;
+
 		/*
 		 * We saw a request before the queue expired, let it through
 		 */
@@ -2727,7 +2766,7 @@ out_cont:
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
-static void elv_ioq_arm_slice_timer(struct request_queue *q)
+static void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
 {
 	struct elv_fq_data *efqd = &q->elevator->efqd;
 	struct io_queue *ioq = elv_active_ioq(q->elevator);
@@ -2740,26 +2779,38 @@ static void elv_ioq_arm_slice_timer(struct request_queue *q)
 	 * for devices that support queuing, otherwise we still have a problem
 	 * with sync vs async workloads.
 	 */
-	if (blk_queue_nonrot(q) && efqd->hw_tag)
+	if (blk_queue_nonrot(q) && efqd->hw_tag && !efqd->fairness)
 		return;
 
 	/*
-	 * still requests with the driver, don't idle
+	 * idle is disabled, either manually or by past process history
 	 */
-	if (efqd->rq_in_driver)
+	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
 		return;
 
 	/*
-	 * idle is disabled, either manually or by past process history
+	 * This queue has consumed its time slice. We are waiting only for
+	 * it to become busy before we select next queue for dispatch.
 	 */
-	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
+	if (wait_for_busy) {
+		elv_mark_ioq_wait_busy(ioq);
+		sl = efqd->elv_slice_idle;
+		mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+		elv_log_ioq(efqd, ioq, "arm idle: %lu wait busy=1", sl);
+		return;
+	}
+
+	/*
+	 * still requests with the driver, don't idle
+	 */
+	if (efqd->rq_in_driver && !efqd->fairness)
 		return;
 
 	/*
 	 * may be iosched got its own idling logic. In that case io
 	 * schduler will take care of arming the timer, if need be.
 	 */
-	if (q->elevator->ops->elevator_arm_slice_timer_fn) {
+	if (q->elevator->ops->elevator_arm_slice_timer_fn && !efqd->fairness) {
 		q->elevator->ops->elevator_arm_slice_timer_fn(q,
 						ioq->sched_queue);
 	} else {
@@ -2822,11 +2873,38 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 			goto expire;
 	}
 
+	/* We are waiting for this queue to become busy before it expires.*/
+	if (efqd->fairness && elv_ioq_wait_busy(ioq)) {
+		ioq = NULL;
+		goto keep_queue;
+	}
+
 	/*
 	 * The active queue has run out of time, expire it and select new.
 	 */
-	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
-		goto expire;
+	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq)) {
+		/*
+		 * Queue has used up its slice. Wait busy is not on otherwise
+		 * we wouldn't have been here. There is a chance that after
+		 * slice expiry no request from the queue completed hence
+		 * wait busy timer could not be turned on. If that's the case
+		 * don't expire the queue yet. Next request completion from
+		 * the queue will arm the wait busy timer.
+		 *
+		 * Don't wait if this group has other active queues. This
+		 * will make sure that we don't loose fairness at group level
+		 * at the same time in root group we will not see cfq
+		 * regressions.
+		 */
+		if (elv_ioq_sync(ioq) && !ioq->nr_queued
+		    && elv_ioq_nr_dispatched(ioq)
+		    && (elv_iog_nr_active(ioq_to_io_group(ioq)) <= 1)
+		    && !elv_ioq_wait_busy_done(ioq)) {
+			ioq = NULL;
+			goto keep_queue;
+		} else
+			goto expire;
+	}
 
 	/*
 	 * If we have a RT cfqq waiting, then we pre-empt the current non-rt
@@ -2977,11 +3055,13 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 	const int sync = rq_is_sync(rq);
 	struct io_queue *ioq;
 	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_group *iog;
 
 	if (!elv_iosched_fair_queuing_enabled(q->elevator))
 		return;
 
 	ioq = rq->ioq;
+	iog = ioq_to_io_group(ioq);
 
 	elv_log_ioq(efqd, ioq, "complete");
 
@@ -3007,6 +3087,12 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 			elv_ioq_set_prio_slice(q, ioq);
 			elv_clear_ioq_slice_new(ioq);
 		}
+
+		if (elv_ioq_class_idle(ioq)) {
+			elv_ioq_slice_expired(q);
+			goto done;
+		}
+
 		/*
 		 * If there are no requests waiting in this queue, and
 		 * there are other queues ready to issue requests, AND
@@ -3014,13 +3100,24 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		 * mean seek distance, give them a chance to run instead
 		 * of idling.
 		 */
-		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
-			elv_ioq_slice_expired(q);
-		else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
-			 && sync && !rq_noidle(rq))
-			elv_ioq_arm_slice_timer(q);
+		if (elv_ioq_slice_used(ioq)) {
+			if (sync && !ioq->nr_queued
+			    && (elv_iog_nr_active(iog) <= 1)) {
+				/*
+				 * Idle for one extra period in hierarchical
+				 * setup
+				 */
+				elv_ioq_arm_slice_timer(q, 1);
+			} else {
+				/* Expire the queue */
+				elv_ioq_slice_expired(q);
+			}
+		} else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
+			 && sync && (!rq_noidle(rq) || efqd->fairness))
+			elv_ioq_arm_slice_timer(q, 0);
 	}
 
+done:
 	if (!efqd->rq_in_driver)
 		elv_schedule_dispatch(q);
 }
@@ -3125,6 +3222,8 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 	efqd->elv_slice_idle = elv_slice_idle;
 	efqd->hw_tag = 1;
 
+	/* For the time being keep fairness enabled by default */
+	efqd->fairness = 1;
 	return 0;
 }
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index d76bd96..a414309 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -75,6 +75,7 @@ struct io_service_tree {
 struct io_sched_data {
 	struct io_entity *active_entity;
 	struct io_entity *next_active;
+	int nr_active;
 	struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
 };
 
@@ -337,6 +338,13 @@ struct elv_fq_data {
 	unsigned long long rate_sampling_start; /*sampling window start jifies*/
 	/* number of sectors finished io during current sampling window */
 	unsigned long rate_sectors_current;
+
+	/*
+	 * If set to 1, will disable many optimizations done for boost
+	 * throughput and focus more on providing fairness for sync
+	 * queues.
+	 */
+	unsigned int fairness;
 };
 
 /* Logging facilities. */
@@ -358,6 +366,8 @@ enum elv_queue_state_flags {
 	ELV_QUEUE_FLAG_wait_request,	  /* waiting for a request */
 	ELV_QUEUE_FLAG_must_dispatch,	  /* must be allowed a dispatch */
 	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
+	ELV_QUEUE_FLAG_wait_busy,	  /* wait for this queue to get busy */
+	ELV_QUEUE_FLAG_wait_busy_done,	  /* Have already waited on this queue*/
 };
 
 #define ELV_IO_QUEUE_FLAG_FNS(name)					\
@@ -380,6 +390,8 @@ ELV_IO_QUEUE_FLAG_FNS(wait_request)
 ELV_IO_QUEUE_FLAG_FNS(must_dispatch)
 ELV_IO_QUEUE_FLAG_FNS(idle_window)
 ELV_IO_QUEUE_FLAG_FNS(slice_new)
+ELV_IO_QUEUE_FLAG_FNS(wait_busy)
+ELV_IO_QUEUE_FLAG_FNS(wait_busy_done)
 
 static inline struct io_service_tree *
 io_entity_service_tree(struct io_entity *entity)
@@ -532,6 +544,9 @@ extern ssize_t elv_slice_sync_store(struct elevator_queue *q, const char *name,
 extern ssize_t elv_slice_async_show(struct elevator_queue *q, char *name);
 extern ssize_t elv_slice_async_store(struct elevator_queue *q, const char *name,
 						size_t count);
+extern ssize_t elv_fairness_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_fairness_store(struct elevator_queue *q, const char *name,
+						size_t count);
 
 /* Functions used by elevator.c */
 extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 12/25] io-controller: idle for sometime on sync queue before expiring it
@ 2009-07-02 20:01   ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew
  Cc: peterz, akpm, snitzer, agk, vgoyal

o When a sync queue expires, in many cases it might be empty and then
  it will be deleted from the active tree. This will lead to a scenario
  where out of two competing queues, only one is on the tree and when a
  new queue is selected, vtime jump takes place and we don't see services
  provided in proportion to weight.

o In general this is a fundamental problem with fairness of sync queues
  where queues are not continuously backlogged. Looks like idling is
  only solution to make sure such kind of queues can get some decent amount
  of disk bandwidth in the face of competion from continusouly backlogged
  queues. But excessive idling has potential to reduce performance on SSD
  and disks with commnad queuing.

o This patch experiments with waiting for next request to come before a
  queue is expired after it has consumed its time slice. This can ensure
  more accurate fairness numbers in some cases.

o Introduced a tunable "fairness". If set, io-controller will put more
  focus on getting fairness right than getting throughput right.

o When writes are being done on a file opened with O_SYNC, ioscheduler sees
  synchronous write requests with noidle flag set. But the fact is we are
  seeing a continuous stream of writes with-in 1ms or so. Hence it makes sense
  to wait on these writes. For the time being to achieve fairness for O_SYNC
  writes, continue to idle even if last request was sync write and noidle
  flag was set. (Only done if "fairness" is set). Probably right fix is to
  make sure in O_SYNC path, requests are not marked with noidle flag.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |    1 +
 block/elevator-fq.c |  133 ++++++++++++++++++++++++++++++++++++++++++++-------
 block/elevator-fq.h |   15 ++++++
 3 files changed, 132 insertions(+), 17 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 02b5cd5..98a35fd 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2004,6 +2004,7 @@ static struct elv_fs_entry cfq_attrs[] = {
 	ELV_ATTR(slice_idle),
 	ELV_ATTR(slice_sync),
 	ELV_ATTR(slice_async),
+	ELV_ATTR(fairness),
 	__ATTR_NULL
 };
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index bab01b5..68be1dc 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -424,6 +424,7 @@ static void bfq_active_insert(struct io_service_tree *st,
 	struct rb_node *node = &entity->rb_node;
 
 	bfq_insert(&st->active, entity);
+	entity->sched_data->nr_active++;
 
 	if (node->rb_left != NULL)
 		node = node->rb_left;
@@ -483,6 +484,7 @@ static void bfq_active_remove(struct io_service_tree *st,
 
 	node = bfq_find_deepest(&entity->rb_node);
 	bfq_remove(&st->active, entity);
+	entity->sched_data->nr_active--;
 
 	if (node != NULL)
 		bfq_update_active_tree(node);
@@ -569,6 +571,21 @@ static void bfq_forget_idle(struct io_service_tree *st)
 		bfq_put_idle_entity(st, first_idle);
 }
 
+/*
+ * Returns the number of active entities a particular io group has. This
+ * includes number of active entities on service tree as well as the active
+ * entity which is being served currently, if any.
+ */
+
+static inline int elv_iog_nr_active(struct io_group *iog)
+{
+	struct io_sched_data *sd = &iog->sched_data;
+
+	if (sd->active_entity)
+		return sd->nr_active + 1;
+	else
+		return sd->nr_active;
+}
 
 static struct io_service_tree *
 __bfq_entity_update_prio(struct io_service_tree *old_st,
@@ -1995,6 +2012,8 @@ SHOW_FUNCTION(elv_slice_sync_show, efqd->elv_slice[1], 1);
 EXPORT_SYMBOL(elv_slice_sync_show);
 SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
 EXPORT_SYMBOL(elv_slice_async_show);
+SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
+EXPORT_SYMBOL(elv_fairness_show);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
@@ -2019,6 +2038,8 @@ STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
 EXPORT_SYMBOL(elv_slice_sync_store);
 STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
 EXPORT_SYMBOL(elv_slice_async_store);
+STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
+EXPORT_SYMBOL(elv_fairness_store);
 #undef STORE_FUNCTION
 
 void elv_schedule_dispatch(struct request_queue *q)
@@ -2142,7 +2163,7 @@ static void elv_ioq_update_idle_window(struct elevator_queue *eq,
 	 * io scheduler if it wants to disable idling based on additional
 	 * considrations like seek pattern.
 	 */
-	if (enable_idle) {
+	if (enable_idle && !efqd->fairness) {
 		if (eq->ops->elevator_update_idle_window_fn)
 			enable_idle = eq->ops->elevator_update_idle_window_fn(
 						eq, ioq->sched_queue, rq);
@@ -2328,6 +2349,7 @@ static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
 
 		elv_clear_ioq_wait_request(ioq);
 		elv_clear_ioq_must_dispatch(ioq);
+		elv_clear_ioq_wait_busy_done(ioq);
 		elv_mark_ioq_slice_new(ioq);
 
 		del_timer(&efqd->idle_slice_timer);
@@ -2483,10 +2505,12 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 	assert_spin_locked(q->queue_lock);
 	elv_log_ioq(efqd, ioq, "slice expired");
 
-	if (elv_ioq_wait_request(ioq))
+	if (elv_ioq_wait_request(ioq) || elv_ioq_wait_busy(ioq))
 		del_timer(&efqd->idle_slice_timer);
 
 	elv_clear_ioq_wait_request(ioq);
+	elv_clear_ioq_wait_busy(ioq);
+	elv_clear_ioq_wait_busy_done(ioq);
 
 	/*
 	 * if ioq->slice_end = 0, that means a queue was expired before first
@@ -2659,7 +2683,7 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 		 * has other work pending, don't risk delaying until the
 		 * idle timer unplug to continue working.
 		 */
-		if (elv_ioq_wait_request(ioq)) {
+		if (elv_ioq_wait_request(ioq) && !elv_ioq_wait_busy(ioq)) {
 			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
 			    efqd->busy_queues > 1) {
 				del_timer(&efqd->idle_slice_timer);
@@ -2667,6 +2691,18 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 			}
 			elv_mark_ioq_must_dispatch(ioq);
 		}
+
+		/*
+		 * If we were waiting for a request on this queue, wait is
+		 * done. Schedule the next dispatch
+		 */
+		if (elv_ioq_wait_busy(ioq)) {
+			del_timer(&efqd->idle_slice_timer);
+			elv_clear_ioq_wait_busy(ioq);
+			elv_mark_ioq_wait_busy_done(ioq);
+			elv_clear_ioq_must_dispatch(ioq);
+			elv_schedule_dispatch(q);
+		}
 	} else if (elv_should_preempt(q, ioq, rq)) {
 		/*
 		 * not the active queue - expire current slice if it is
@@ -2694,6 +2730,9 @@ static void elv_idle_slice_timer(unsigned long data)
 
 	if (ioq) {
 
+		if (elv_ioq_wait_busy(ioq))
+			goto expire;
+
 		/*
 		 * We saw a request before the queue expired, let it through
 		 */
@@ -2727,7 +2766,7 @@ out_cont:
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
-static void elv_ioq_arm_slice_timer(struct request_queue *q)
+static void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
 {
 	struct elv_fq_data *efqd = &q->elevator->efqd;
 	struct io_queue *ioq = elv_active_ioq(q->elevator);
@@ -2740,26 +2779,38 @@ static void elv_ioq_arm_slice_timer(struct request_queue *q)
 	 * for devices that support queuing, otherwise we still have a problem
 	 * with sync vs async workloads.
 	 */
-	if (blk_queue_nonrot(q) && efqd->hw_tag)
+	if (blk_queue_nonrot(q) && efqd->hw_tag && !efqd->fairness)
 		return;
 
 	/*
-	 * still requests with the driver, don't idle
+	 * idle is disabled, either manually or by past process history
 	 */
-	if (efqd->rq_in_driver)
+	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
 		return;
 
 	/*
-	 * idle is disabled, either manually or by past process history
+	 * This queue has consumed its time slice. We are waiting only for
+	 * it to become busy before we select next queue for dispatch.
 	 */
-	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
+	if (wait_for_busy) {
+		elv_mark_ioq_wait_busy(ioq);
+		sl = efqd->elv_slice_idle;
+		mod_timer(&efqd->idle_slice_timer, jiffies + sl);
+		elv_log_ioq(efqd, ioq, "arm idle: %lu wait busy=1", sl);
+		return;
+	}
+
+	/*
+	 * still requests with the driver, don't idle
+	 */
+	if (efqd->rq_in_driver && !efqd->fairness)
 		return;
 
 	/*
 	 * may be iosched got its own idling logic. In that case io
 	 * schduler will take care of arming the timer, if need be.
 	 */
-	if (q->elevator->ops->elevator_arm_slice_timer_fn) {
+	if (q->elevator->ops->elevator_arm_slice_timer_fn && !efqd->fairness) {
 		q->elevator->ops->elevator_arm_slice_timer_fn(q,
 						ioq->sched_queue);
 	} else {
@@ -2822,11 +2873,38 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 			goto expire;
 	}
 
+	/* We are waiting for this queue to become busy before it expires.*/
+	if (efqd->fairness && elv_ioq_wait_busy(ioq)) {
+		ioq = NULL;
+		goto keep_queue;
+	}
+
 	/*
 	 * The active queue has run out of time, expire it and select new.
 	 */
-	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq))
-		goto expire;
+	if (elv_ioq_slice_used(ioq) && !elv_ioq_must_dispatch(ioq)) {
+		/*
+		 * Queue has used up its slice. Wait busy is not on otherwise
+		 * we wouldn't have been here. There is a chance that after
+		 * slice expiry no request from the queue completed hence
+		 * wait busy timer could not be turned on. If that's the case
+		 * don't expire the queue yet. Next request completion from
+		 * the queue will arm the wait busy timer.
+		 *
+		 * Don't wait if this group has other active queues. This
+		 * will make sure that we don't loose fairness at group level
+		 * at the same time in root group we will not see cfq
+		 * regressions.
+		 */
+		if (elv_ioq_sync(ioq) && !ioq->nr_queued
+		    && elv_ioq_nr_dispatched(ioq)
+		    && (elv_iog_nr_active(ioq_to_io_group(ioq)) <= 1)
+		    && !elv_ioq_wait_busy_done(ioq)) {
+			ioq = NULL;
+			goto keep_queue;
+		} else
+			goto expire;
+	}
 
 	/*
 	 * If we have a RT cfqq waiting, then we pre-empt the current non-rt
@@ -2977,11 +3055,13 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 	const int sync = rq_is_sync(rq);
 	struct io_queue *ioq;
 	struct elv_fq_data *efqd = &q->elevator->efqd;
+	struct io_group *iog;
 
 	if (!elv_iosched_fair_queuing_enabled(q->elevator))
 		return;
 
 	ioq = rq->ioq;
+	iog = ioq_to_io_group(ioq);
 
 	elv_log_ioq(efqd, ioq, "complete");
 
@@ -3007,6 +3087,12 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 			elv_ioq_set_prio_slice(q, ioq);
 			elv_clear_ioq_slice_new(ioq);
 		}
+
+		if (elv_ioq_class_idle(ioq)) {
+			elv_ioq_slice_expired(q);
+			goto done;
+		}
+
 		/*
 		 * If there are no requests waiting in this queue, and
 		 * there are other queues ready to issue requests, AND
@@ -3014,13 +3100,24 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		 * mean seek distance, give them a chance to run instead
 		 * of idling.
 		 */
-		if (elv_ioq_slice_used(ioq) || elv_ioq_class_idle(ioq))
-			elv_ioq_slice_expired(q);
-		else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
-			 && sync && !rq_noidle(rq))
-			elv_ioq_arm_slice_timer(q);
+		if (elv_ioq_slice_used(ioq)) {
+			if (sync && !ioq->nr_queued
+			    && (elv_iog_nr_active(iog) <= 1)) {
+				/*
+				 * Idle for one extra period in hierarchical
+				 * setup
+				 */
+				elv_ioq_arm_slice_timer(q, 1);
+			} else {
+				/* Expire the queue */
+				elv_ioq_slice_expired(q);
+			}
+		} else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
+			 && sync && (!rq_noidle(rq) || efqd->fairness))
+			elv_ioq_arm_slice_timer(q, 0);
 	}
 
+done:
 	if (!efqd->rq_in_driver)
 		elv_schedule_dispatch(q);
 }
@@ -3125,6 +3222,8 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 	efqd->elv_slice_idle = elv_slice_idle;
 	efqd->hw_tag = 1;
 
+	/* For the time being keep fairness enabled by default */
+	efqd->fairness = 1;
 	return 0;
 }
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index d76bd96..a414309 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -75,6 +75,7 @@ struct io_service_tree {
 struct io_sched_data {
 	struct io_entity *active_entity;
 	struct io_entity *next_active;
+	int nr_active;
 	struct io_service_tree service_tree[IO_IOPRIO_CLASSES];
 };
 
@@ -337,6 +338,13 @@ struct elv_fq_data {
 	unsigned long long rate_sampling_start; /*sampling window start jifies*/
 	/* number of sectors finished io during current sampling window */
 	unsigned long rate_sectors_current;
+
+	/*
+	 * If set to 1, will disable many optimizations done for boost
+	 * throughput and focus more on providing fairness for sync
+	 * queues.
+	 */
+	unsigned int fairness;
 };
 
 /* Logging facilities. */
@@ -358,6 +366,8 @@ enum elv_queue_state_flags {
 	ELV_QUEUE_FLAG_wait_request,	  /* waiting for a request */
 	ELV_QUEUE_FLAG_must_dispatch,	  /* must be allowed a dispatch */
 	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
+	ELV_QUEUE_FLAG_wait_busy,	  /* wait for this queue to get busy */
+	ELV_QUEUE_FLAG_wait_busy_done,	  /* Have already waited on this queue*/
 };
 
 #define ELV_IO_QUEUE_FLAG_FNS(name)					\
@@ -380,6 +390,8 @@ ELV_IO_QUEUE_FLAG_FNS(wait_request)
 ELV_IO_QUEUE_FLAG_FNS(must_dispatch)
 ELV_IO_QUEUE_FLAG_FNS(idle_window)
 ELV_IO_QUEUE_FLAG_FNS(slice_new)
+ELV_IO_QUEUE_FLAG_FNS(wait_busy)
+ELV_IO_QUEUE_FLAG_FNS(wait_busy_done)
 
 static inline struct io_service_tree *
 io_entity_service_tree(struct io_entity *entity)
@@ -532,6 +544,9 @@ extern ssize_t elv_slice_sync_store(struct elevator_queue *q, const char *name,
 extern ssize_t elv_slice_async_show(struct elevator_queue *q, char *name);
 extern ssize_t elv_slice_async_store(struct elevator_queue *q, const char *name,
 						size_t count);
+extern ssize_t elv_fairness_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_fairness_store(struct elevator_queue *q, const char *name,
+						size_t count);
 
 /* Functions used by elevator.c */
 extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 13/25] io-controller: Wait for requests to complete from last queue before new queue is scheduled
       [not found] ` <1246564917-19603-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (11 preceding siblings ...)
  2009-07-02 20:01   ` [PATCH 12/25] io-controller: idle for sometime on sync queue before expiring it Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  2009-07-02 20:01   ` [PATCH 14/25] io-controller: Separate out queue and data Vivek Goyal
                     ` (14 subsequent siblings)
  27 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, nauman-hpIqsD4AKlfQT0dZR+AlfA,
	dpshah-hpIqsD4AKlfQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA

o Currently one can dispatch requests from multiple queues to the disk. This
  is true for hardware which supports queuing. So if a disk support queue
  depth of 31 it is possible that 20 requests are dispatched from queue 1
  and then next queue is scheduled in which dispatches more requests.

o This multiple queue dispatch introduces issues for accurate accounting of
  disk time consumed by a particular queue. For example, if one async queue
  is scheduled in, it can dispatch 31 requests to the disk and then it will
  be expired and a new sync queue might get scheduled in. These 31 requests
  might take a long time to finish but this time is never accounted to the
  async queue which dispatched these requests.

o This patch introduces the functionality where we wait for all the requests
  to finish from previous queue before next queue is scheduled in. That way
  a queue is more accurately accounted for disk time it has consumed. Note
  this still does not take care of errors introduced by disk write caching.

o Because above behavior can result in reduced throughput, this behavior will
  be enabled only if user sets "fairness" tunable to 2 or higher.

o This patch helps in achieving more isolation between reads and buffered
  writes in different cgroups. buffered writes typically utilize full queue
  depth and then expire the queue. On the contarary, sequential reads
  typicaly driver queue depth of 1. So despite the fact that writes are
  using more disk time it is never accounted to write queue because we don't
  wait for requests to finish after dispatching these. This patch helps
  do more accurate accounting of disk time, especially for buffered writes
  hence providing better fairness hence better isolation between two cgroups
  running read and write workloads.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/elevator-fq.c |   31 ++++++++++++++++++++++++++++++-
 1 files changed, 30 insertions(+), 1 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 68be1dc..7609579 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -2038,7 +2038,7 @@ STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
 EXPORT_SYMBOL(elv_slice_sync_store);
 STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
 EXPORT_SYMBOL(elv_slice_async_store);
-STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
+STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 2, 0);
 EXPORT_SYMBOL(elv_fairness_store);
 #undef STORE_FUNCTION
 
@@ -2952,6 +2952,24 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 	}
 
 expire:
+	if (efqd->fairness >= 2 && !force && ioq && ioq->dispatched) {
+		/*
+		 * If there are request dispatched from this queue, don't
+		 * dispatch requests from new queue till all the requests from
+		 * this queue have completed.
+		 *
+		 * This helps in attributing right amount of disk time consumed
+		 * by a particular queue when hardware allows queuing.
+		 *
+		 * Set ioq = NULL so that no more requests are dispatched from
+		 * this queue.
+		 */
+		elv_log_ioq(efqd, ioq, "select: wait for requests to finish"
+				" disp=%lu", ioq->dispatched);
+		ioq = NULL;
+		goto keep_queue;
+	}
+
 	elv_ioq_slice_expired(q);
 new_queue:
 	ioq = elv_set_active_ioq(q, new_ioq);
@@ -3109,6 +3127,17 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 				 */
 				elv_ioq_arm_slice_timer(q, 1);
 			} else {
+				/* If fairness >=2 and there are requests
+				 * dispatched from this queue, don't dispatch
+				 * new requests from a different queue till
+				 * all requests from this queue have finished.
+				 * This helps in attributing right disk time
+				 * to a queue when hardware supports queuing.
+				 */
+
+				if (efqd->fairness >= 2 && ioq->dispatched)
+					goto done;
+
 				/* Expire the queue */
 				elv_ioq_slice_expired(q);
 			}
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 13/25] io-controller: Wait for requests to complete from last queue before new queue is scheduled
  2009-07-02 20:01 ` Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron
  Cc: agk, snitzer, vgoyal, akpm, peterz

o Currently one can dispatch requests from multiple queues to the disk. This
  is true for hardware which supports queuing. So if a disk support queue
  depth of 31 it is possible that 20 requests are dispatched from queue 1
  and then next queue is scheduled in which dispatches more requests.

o This multiple queue dispatch introduces issues for accurate accounting of
  disk time consumed by a particular queue. For example, if one async queue
  is scheduled in, it can dispatch 31 requests to the disk and then it will
  be expired and a new sync queue might get scheduled in. These 31 requests
  might take a long time to finish but this time is never accounted to the
  async queue which dispatched these requests.

o This patch introduces the functionality where we wait for all the requests
  to finish from previous queue before next queue is scheduled in. That way
  a queue is more accurately accounted for disk time it has consumed. Note
  this still does not take care of errors introduced by disk write caching.

o Because above behavior can result in reduced throughput, this behavior will
  be enabled only if user sets "fairness" tunable to 2 or higher.

o This patch helps in achieving more isolation between reads and buffered
  writes in different cgroups. buffered writes typically utilize full queue
  depth and then expire the queue. On the contarary, sequential reads
  typicaly driver queue depth of 1. So despite the fact that writes are
  using more disk time it is never accounted to write queue because we don't
  wait for requests to finish after dispatching these. This patch helps
  do more accurate accounting of disk time, especially for buffered writes
  hence providing better fairness hence better isolation between two cgroups
  running read and write workloads.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |   31 ++++++++++++++++++++++++++++++-
 1 files changed, 30 insertions(+), 1 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 68be1dc..7609579 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -2038,7 +2038,7 @@ STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
 EXPORT_SYMBOL(elv_slice_sync_store);
 STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
 EXPORT_SYMBOL(elv_slice_async_store);
-STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
+STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 2, 0);
 EXPORT_SYMBOL(elv_fairness_store);
 #undef STORE_FUNCTION
 
@@ -2952,6 +2952,24 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 	}
 
 expire:
+	if (efqd->fairness >= 2 && !force && ioq && ioq->dispatched) {
+		/*
+		 * If there are request dispatched from this queue, don't
+		 * dispatch requests from new queue till all the requests from
+		 * this queue have completed.
+		 *
+		 * This helps in attributing right amount of disk time consumed
+		 * by a particular queue when hardware allows queuing.
+		 *
+		 * Set ioq = NULL so that no more requests are dispatched from
+		 * this queue.
+		 */
+		elv_log_ioq(efqd, ioq, "select: wait for requests to finish"
+				" disp=%lu", ioq->dispatched);
+		ioq = NULL;
+		goto keep_queue;
+	}
+
 	elv_ioq_slice_expired(q);
 new_queue:
 	ioq = elv_set_active_ioq(q, new_ioq);
@@ -3109,6 +3127,17 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 				 */
 				elv_ioq_arm_slice_timer(q, 1);
 			} else {
+				/* If fairness >=2 and there are requests
+				 * dispatched from this queue, don't dispatch
+				 * new requests from a different queue till
+				 * all requests from this queue have finished.
+				 * This helps in attributing right disk time
+				 * to a queue when hardware supports queuing.
+				 */
+
+				if (efqd->fairness >= 2 && ioq->dispatched)
+					goto done;
+
 				/* Expire the queue */
 				elv_ioq_slice_expired(q);
 			}
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 13/25] io-controller: Wait for requests to complete from last queue before new queue is scheduled
@ 2009-07-02 20:01   ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew
  Cc: peterz, akpm, snitzer, agk, vgoyal

o Currently one can dispatch requests from multiple queues to the disk. This
  is true for hardware which supports queuing. So if a disk support queue
  depth of 31 it is possible that 20 requests are dispatched from queue 1
  and then next queue is scheduled in which dispatches more requests.

o This multiple queue dispatch introduces issues for accurate accounting of
  disk time consumed by a particular queue. For example, if one async queue
  is scheduled in, it can dispatch 31 requests to the disk and then it will
  be expired and a new sync queue might get scheduled in. These 31 requests
  might take a long time to finish but this time is never accounted to the
  async queue which dispatched these requests.

o This patch introduces the functionality where we wait for all the requests
  to finish from previous queue before next queue is scheduled in. That way
  a queue is more accurately accounted for disk time it has consumed. Note
  this still does not take care of errors introduced by disk write caching.

o Because above behavior can result in reduced throughput, this behavior will
  be enabled only if user sets "fairness" tunable to 2 or higher.

o This patch helps in achieving more isolation between reads and buffered
  writes in different cgroups. buffered writes typically utilize full queue
  depth and then expire the queue. On the contarary, sequential reads
  typicaly driver queue depth of 1. So despite the fact that writes are
  using more disk time it is never accounted to write queue because we don't
  wait for requests to finish after dispatching these. This patch helps
  do more accurate accounting of disk time, especially for buffered writes
  hence providing better fairness hence better isolation between two cgroups
  running read and write workloads.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |   31 ++++++++++++++++++++++++++++++-
 1 files changed, 30 insertions(+), 1 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 68be1dc..7609579 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -2038,7 +2038,7 @@ STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
 EXPORT_SYMBOL(elv_slice_sync_store);
 STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
 EXPORT_SYMBOL(elv_slice_async_store);
-STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
+STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 2, 0);
 EXPORT_SYMBOL(elv_fairness_store);
 #undef STORE_FUNCTION
 
@@ -2952,6 +2952,24 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 	}
 
 expire:
+	if (efqd->fairness >= 2 && !force && ioq && ioq->dispatched) {
+		/*
+		 * If there are request dispatched from this queue, don't
+		 * dispatch requests from new queue till all the requests from
+		 * this queue have completed.
+		 *
+		 * This helps in attributing right amount of disk time consumed
+		 * by a particular queue when hardware allows queuing.
+		 *
+		 * Set ioq = NULL so that no more requests are dispatched from
+		 * this queue.
+		 */
+		elv_log_ioq(efqd, ioq, "select: wait for requests to finish"
+				" disp=%lu", ioq->dispatched);
+		ioq = NULL;
+		goto keep_queue;
+	}
+
 	elv_ioq_slice_expired(q);
 new_queue:
 	ioq = elv_set_active_ioq(q, new_ioq);
@@ -3109,6 +3127,17 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 				 */
 				elv_ioq_arm_slice_timer(q, 1);
 			} else {
+				/* If fairness >=2 and there are requests
+				 * dispatched from this queue, don't dispatch
+				 * new requests from a different queue till
+				 * all requests from this queue have finished.
+				 * This helps in attributing right disk time
+				 * to a queue when hardware supports queuing.
+				 */
+
+				if (efqd->fairness >= 2 && ioq->dispatched)
+					goto done;
+
 				/* Expire the queue */
 				elv_ioq_slice_expired(q);
 			}
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 14/25] io-controller: Separate out queue and data
       [not found] ` <1246564917-19603-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (12 preceding siblings ...)
  2009-07-02 20:01   ` [PATCH 13/25] io-controller: Wait for requests to complete from last queue before new queue is scheduled Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  2009-07-02 20:01   ` [PATCH 15/25] io-conroller: Prepare elevator layer for single queue schedulers Vivek Goyal
                     ` (13 subsequent siblings)
  27 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, nauman-hpIqsD4AKlfQT0dZR+AlfA,
	dpshah-hpIqsD4AKlfQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA

o So far noop, deadline and AS had one common structure called *_data which
  contained both the queue information where requests are queued and also
  common data used for scheduling. This patch breaks down this common
  structure in two parts, *_queue and *_data. This is along the lines of
  cfq where all the reuquests are queued in queue and common data and tunables
  are part of data.

o It does not change the functionality but this re-organization helps once
  noop, deadline and AS are changed to use hierarchical fair queuing.

o looks like queue_empty function is not required and we can check for
  q->nr_sorted in elevator layer to see if ioscheduler queues are empty or
  not.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/as-iosched.c       |  208 ++++++++++++++++++++++++++--------------------
 block/deadline-iosched.c |  117 ++++++++++++++++----------
 block/elevator.c         |  111 +++++++++++++++++++++----
 block/noop-iosched.c     |   59 ++++++-------
 include/linux/elevator.h |    8 ++-
 5 files changed, 319 insertions(+), 184 deletions(-)

diff --git a/block/as-iosched.c b/block/as-iosched.c
index 7a12cf6..cafc734 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -76,13 +76,7 @@ enum anticipation_status {
 				 * or timed out */
 };
 
-struct as_data {
-	/*
-	 * run time data
-	 */
-
-	struct request_queue *q;	/* the "owner" queue */
-
+struct as_queue {
 	/*
 	 * requests (as_rq s) are present on both sort_list and fifo_list
 	 */
@@ -90,6 +84,14 @@ struct as_data {
 	struct list_head fifo_list[2];
 
 	struct request *next_rq[2];	/* next in sort order */
+	unsigned long last_check_fifo[2];
+	int write_batch_count;		/* max # of reqs in a write batch */
+	int current_write_count;	/* how many requests left this batch */
+	int write_batch_idled;		/* has the write batch gone idle? */
+};
+
+struct as_data {
+	struct request_queue *q;	/* the "owner" queue */
 	sector_t last_sector[2];	/* last SYNC & ASYNC sectors */
 
 	unsigned long exit_prob;	/* probability a task will exit while
@@ -103,21 +105,17 @@ struct as_data {
 	sector_t new_seek_mean;
 
 	unsigned long current_batch_expires;
-	unsigned long last_check_fifo[2];
 	int changed_batch;		/* 1: waiting for old batch to end */
 	int new_batch;			/* 1: waiting on first read complete */
-	int batch_data_dir;		/* current batch SYNC / ASYNC */
-	int write_batch_count;		/* max # of reqs in a write batch */
-	int current_write_count;	/* how many requests left this batch */
-	int write_batch_idled;		/* has the write batch gone idle? */
 
 	enum anticipation_status antic_status;
 	unsigned long antic_start;	/* jiffies: when it started */
 	struct timer_list antic_timer;	/* anticipatory scheduling timer */
-	struct work_struct antic_work;	/* Deferred unplugging */
+	struct work_struct antic_work;  /* Deferred unplugging */
 	struct io_context *io_context;	/* Identify the expected process */
 	int ioc_finished; /* IO associated with io_context is finished */
 	int nr_dispatched;
+	int batch_data_dir;		/* current batch SYNC / ASYNC */
 
 	/*
 	 * settings that change how the i/o scheduler behaves
@@ -258,13 +256,14 @@ static void as_put_io_context(struct request *rq)
 /*
  * rb tree support functions
  */
-#define RQ_RB_ROOT(ad, rq)	(&(ad)->sort_list[rq_is_sync((rq))])
+#define RQ_RB_ROOT(asq, rq)	(&(asq)->sort_list[rq_is_sync((rq))])
 
 static void as_add_rq_rb(struct as_data *ad, struct request *rq)
 {
 	struct request *alias;
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
-	while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(ad, rq), rq)))) {
+	while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(asq, rq), rq)))) {
 		as_move_to_dispatch(ad, alias);
 		as_antic_stop(ad);
 	}
@@ -272,7 +271,9 @@ static void as_add_rq_rb(struct as_data *ad, struct request *rq)
 
 static inline void as_del_rq_rb(struct as_data *ad, struct request *rq)
 {
-	elv_rb_del(RQ_RB_ROOT(ad, rq), rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
+
+	elv_rb_del(RQ_RB_ROOT(asq, rq), rq);
 }
 
 /*
@@ -366,7 +367,7 @@ as_choose_req(struct as_data *ad, struct request *rq1, struct request *rq2)
  * what request to process next. Anticipation works on top of this.
  */
 static struct request *
-as_find_next_rq(struct as_data *ad, struct request *last)
+as_find_next_rq(struct as_data *ad, struct as_queue *asq, struct request *last)
 {
 	struct rb_node *rbnext = rb_next(&last->rb_node);
 	struct rb_node *rbprev = rb_prev(&last->rb_node);
@@ -382,7 +383,7 @@ as_find_next_rq(struct as_data *ad, struct request *last)
 	else {
 		const int data_dir = rq_is_sync(last);
 
-		rbnext = rb_first(&ad->sort_list[data_dir]);
+		rbnext = rb_first(&asq->sort_list[data_dir]);
 		if (rbnext && rbnext != &last->rb_node)
 			next = rb_entry_rq(rbnext);
 	}
@@ -789,9 +790,10 @@ static int as_can_anticipate(struct as_data *ad, struct request *rq)
 static void as_update_rq(struct as_data *ad, struct request *rq)
 {
 	const int data_dir = rq_is_sync(rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
 	/* keep the next_rq cache up to date */
-	ad->next_rq[data_dir] = as_choose_req(ad, rq, ad->next_rq[data_dir]);
+	asq->next_rq[data_dir] = as_choose_req(ad, rq, asq->next_rq[data_dir]);
 
 	/*
 	 * have we been anticipating this request?
@@ -812,25 +814,26 @@ static void update_write_batch(struct as_data *ad)
 {
 	unsigned long batch = ad->batch_expire[BLK_RW_ASYNC];
 	long write_time;
+	struct as_queue *asq = elv_get_sched_queue(ad->q, NULL);
 
 	write_time = (jiffies - ad->current_batch_expires) + batch;
 	if (write_time < 0)
 		write_time = 0;
 
-	if (write_time > batch && !ad->write_batch_idled) {
+	if (write_time > batch && !asq->write_batch_idled) {
 		if (write_time > batch * 3)
-			ad->write_batch_count /= 2;
+			asq->write_batch_count /= 2;
 		else
-			ad->write_batch_count--;
-	} else if (write_time < batch && ad->current_write_count == 0) {
+			asq->write_batch_count--;
+	} else if (write_time < batch && asq->current_write_count == 0) {
 		if (batch > write_time * 3)
-			ad->write_batch_count *= 2;
+			asq->write_batch_count *= 2;
 		else
-			ad->write_batch_count++;
+			asq->write_batch_count++;
 	}
 
-	if (ad->write_batch_count < 1)
-		ad->write_batch_count = 1;
+	if (asq->write_batch_count < 1)
+		asq->write_batch_count = 1;
 }
 
 /*
@@ -901,6 +904,7 @@ static void as_remove_queued_request(struct request_queue *q,
 	const int data_dir = rq_is_sync(rq);
 	struct as_data *ad = q->elevator->elevator_data;
 	struct io_context *ioc;
+	struct as_queue *asq = elv_get_sched_queue(q, rq);
 
 	WARN_ON(RQ_STATE(rq) != AS_RQ_QUEUED);
 
@@ -914,8 +918,8 @@ static void as_remove_queued_request(struct request_queue *q,
 	 * Update the "next_rq" cache if we are about to remove its
 	 * entry
 	 */
-	if (ad->next_rq[data_dir] == rq)
-		ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+	if (asq->next_rq[data_dir] == rq)
+		asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);
 
 	rq_fifo_clear(rq);
 	as_del_rq_rb(ad, rq);
@@ -929,23 +933,23 @@ static void as_remove_queued_request(struct request_queue *q,
  *
  * See as_antic_expired comment.
  */
-static int as_fifo_expired(struct as_data *ad, int adir)
+static int as_fifo_expired(struct as_data *ad, struct as_queue *asq, int adir)
 {
 	struct request *rq;
 	long delta_jif;
 
-	delta_jif = jiffies - ad->last_check_fifo[adir];
+	delta_jif = jiffies - asq->last_check_fifo[adir];
 	if (unlikely(delta_jif < 0))
 		delta_jif = -delta_jif;
 	if (delta_jif < ad->fifo_expire[adir])
 		return 0;
 
-	ad->last_check_fifo[adir] = jiffies;
+	asq->last_check_fifo[adir] = jiffies;
 
-	if (list_empty(&ad->fifo_list[adir]))
+	if (list_empty(&asq->fifo_list[adir]))
 		return 0;
 
-	rq = rq_entry_fifo(ad->fifo_list[adir].next);
+	rq = rq_entry_fifo(asq->fifo_list[adir].next);
 
 	return time_after(jiffies, rq_fifo_time(rq));
 }
@@ -954,7 +958,7 @@ static int as_fifo_expired(struct as_data *ad, int adir)
  * as_batch_expired returns true if the current batch has expired. A batch
  * is a set of reads or a set of writes.
  */
-static inline int as_batch_expired(struct as_data *ad)
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq)
 {
 	if (ad->changed_batch || ad->new_batch)
 		return 0;
@@ -964,7 +968,7 @@ static inline int as_batch_expired(struct as_data *ad)
 		return time_after(jiffies, ad->current_batch_expires);
 
 	return time_after(jiffies, ad->current_batch_expires)
-		|| ad->current_write_count == 0;
+		|| asq->current_write_count == 0;
 }
 
 /*
@@ -973,6 +977,7 @@ static inline int as_batch_expired(struct as_data *ad)
 static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 {
 	const int data_dir = rq_is_sync(rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
 	BUG_ON(RB_EMPTY_NODE(&rq->rb_node));
 
@@ -995,12 +1000,12 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 			ad->io_context = NULL;
 		}
 
-		if (ad->current_write_count != 0)
-			ad->current_write_count--;
+		if (asq->current_write_count != 0)
+			asq->current_write_count--;
 	}
 	ad->ioc_finished = 0;
 
-	ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+	asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);
 
 	/*
 	 * take it off the sort and fifo list, add to dispatch queue
@@ -1024,9 +1029,16 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 static int as_dispatch_request(struct request_queue *q, int force)
 {
 	struct as_data *ad = q->elevator->elevator_data;
-	const int reads = !list_empty(&ad->fifo_list[BLK_RW_SYNC]);
-	const int writes = !list_empty(&ad->fifo_list[BLK_RW_ASYNC]);
 	struct request *rq;
+	struct as_queue *asq = elv_select_sched_queue(q, force);
+	int reads, writes;
+
+	if (!asq)
+		return 0;
+
+	reads = !list_empty(&asq->fifo_list[BLK_RW_SYNC]);
+	writes = !list_empty(&asq->fifo_list[BLK_RW_ASYNC]);
+
 
 	if (unlikely(force)) {
 		/*
@@ -1042,25 +1054,25 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		ad->changed_batch = 0;
 		ad->new_batch = 0;
 
-		while (ad->next_rq[BLK_RW_SYNC]) {
-			as_move_to_dispatch(ad, ad->next_rq[BLK_RW_SYNC]);
+		while (asq->next_rq[BLK_RW_SYNC]) {
+			as_move_to_dispatch(ad, asq->next_rq[BLK_RW_SYNC]);
 			dispatched++;
 		}
-		ad->last_check_fifo[BLK_RW_SYNC] = jiffies;
+		asq->last_check_fifo[BLK_RW_SYNC] = jiffies;
 
-		while (ad->next_rq[BLK_RW_ASYNC]) {
-			as_move_to_dispatch(ad, ad->next_rq[BLK_RW_ASYNC]);
+		while (asq->next_rq[BLK_RW_ASYNC]) {
+			as_move_to_dispatch(ad, asq->next_rq[BLK_RW_ASYNC]);
 			dispatched++;
 		}
-		ad->last_check_fifo[BLK_RW_ASYNC] = jiffies;
+		asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
 
 		return dispatched;
 	}
 
 	/* Signal that the write batch was uncontended, so we can't time it */
 	if (ad->batch_data_dir == BLK_RW_ASYNC && !reads) {
-		if (ad->current_write_count == 0 || !writes)
-			ad->write_batch_idled = 1;
+		if (asq->current_write_count == 0 || !writes)
+			asq->write_batch_idled = 1;
 	}
 
 	if (!(reads || writes)
@@ -1069,14 +1081,14 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		|| ad->changed_batch)
 		return 0;
 
-	if (!(reads && writes && as_batch_expired(ad))) {
+	if (!(reads && writes && as_batch_expired(ad, asq))) {
 		/*
 		 * batch is still running or no reads or no writes
 		 */
-		rq = ad->next_rq[ad->batch_data_dir];
+		rq = asq->next_rq[ad->batch_data_dir];
 
 		if (ad->batch_data_dir == BLK_RW_SYNC && ad->antic_expire) {
-			if (as_fifo_expired(ad, BLK_RW_SYNC))
+			if (as_fifo_expired(ad, asq, BLK_RW_SYNC))
 				goto fifo_expired;
 
 			if (as_can_anticipate(ad, rq)) {
@@ -1100,7 +1112,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 	 */
 
 	if (reads) {
-		BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[BLK_RW_SYNC]));
+		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_SYNC]));
 
 		if (writes && ad->batch_data_dir == BLK_RW_SYNC)
 			/*
@@ -1113,8 +1125,8 @@ static int as_dispatch_request(struct request_queue *q, int force)
 			ad->changed_batch = 1;
 		}
 		ad->batch_data_dir = BLK_RW_SYNC;
-		rq = rq_entry_fifo(ad->fifo_list[BLK_RW_SYNC].next);
-		ad->last_check_fifo[ad->batch_data_dir] = jiffies;
+		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_SYNC].next);
+		asq->last_check_fifo[ad->batch_data_dir] = jiffies;
 		goto dispatch_request;
 	}
 
@@ -1124,7 +1136,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 
 	if (writes) {
 dispatch_writes:
-		BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[BLK_RW_ASYNC]));
+		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_ASYNC]));
 
 		if (ad->batch_data_dir == BLK_RW_SYNC) {
 			ad->changed_batch = 1;
@@ -1137,10 +1149,10 @@ dispatch_writes:
 			ad->new_batch = 0;
 		}
 		ad->batch_data_dir = BLK_RW_ASYNC;
-		ad->current_write_count = ad->write_batch_count;
-		ad->write_batch_idled = 0;
-		rq = rq_entry_fifo(ad->fifo_list[BLK_RW_ASYNC].next);
-		ad->last_check_fifo[BLK_RW_ASYNC] = jiffies;
+		asq->current_write_count = asq->write_batch_count;
+		asq->write_batch_idled = 0;
+		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_ASYNC].next);
+		asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
 		goto dispatch_request;
 	}
 
@@ -1152,9 +1164,9 @@ dispatch_request:
 	 * If a request has expired, service it.
 	 */
 
-	if (as_fifo_expired(ad, ad->batch_data_dir)) {
+	if (as_fifo_expired(ad, asq, ad->batch_data_dir)) {
 fifo_expired:
-		rq = rq_entry_fifo(ad->fifo_list[ad->batch_data_dir].next);
+		rq = rq_entry_fifo(asq->fifo_list[ad->batch_data_dir].next);
 	}
 
 	if (ad->changed_batch) {
@@ -1187,6 +1199,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 {
 	struct as_data *ad = q->elevator->elevator_data;
 	int data_dir;
+	struct as_queue *asq = elv_get_sched_queue(q, rq);
 
 	RQ_SET_STATE(rq, AS_RQ_NEW);
 
@@ -1205,7 +1218,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 	 * set expire time and add to fifo list
 	 */
 	rq_set_fifo_time(rq, jiffies + ad->fifo_expire[data_dir]);
-	list_add_tail(&rq->queuelist, &ad->fifo_list[data_dir]);
+	list_add_tail(&rq->queuelist, &asq->fifo_list[data_dir]);
 
 	as_update_rq(ad, rq); /* keep state machine up to date */
 	RQ_SET_STATE(rq, AS_RQ_QUEUED);
@@ -1227,31 +1240,20 @@ static void as_deactivate_request(struct request_queue *q, struct request *rq)
 		atomic_inc(&RQ_IOC(rq)->aic->nr_dispatched);
 }
 
-/*
- * as_queue_empty tells us if there are requests left in the device. It may
- * not be the case that a driver can get the next request even if the queue
- * is not empty - it is used in the block layer to check for plugging and
- * merging opportunities
- */
-static int as_queue_empty(struct request_queue *q)
-{
-	struct as_data *ad = q->elevator->elevator_data;
-
-	return list_empty(&ad->fifo_list[BLK_RW_ASYNC])
-		&& list_empty(&ad->fifo_list[BLK_RW_SYNC]);
-}
-
 static int
 as_merge(struct request_queue *q, struct request **req, struct bio *bio)
 {
-	struct as_data *ad = q->elevator->elevator_data;
 	sector_t rb_key = bio->bi_sector + bio_sectors(bio);
 	struct request *__rq;
+	struct as_queue *asq = elv_get_sched_queue_current(q);
+
+	if (!asq)
+		return ELEVATOR_NO_MERGE;
 
 	/*
 	 * check for front merge
 	 */
-	__rq = elv_rb_find(&ad->sort_list[bio_data_dir(bio)], rb_key);
+	__rq = elv_rb_find(&asq->sort_list[bio_data_dir(bio)], rb_key);
 	if (__rq && elv_rq_merge_ok(__rq, bio)) {
 		*req = __rq;
 		return ELEVATOR_FRONT_MERGE;
@@ -1334,6 +1336,41 @@ static int as_may_queue(struct request_queue *q, int rw)
 	return ret;
 }
 
+/* Called with queue lock held */
+static void *as_alloc_as_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
+{
+	struct as_queue *asq;
+	struct as_data *ad = eq->elevator_data;
+
+	asq = kmalloc_node(sizeof(*asq), gfp_mask | __GFP_ZERO, q->node);
+	if (asq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&asq->fifo_list[BLK_RW_SYNC]);
+	INIT_LIST_HEAD(&asq->fifo_list[BLK_RW_ASYNC]);
+	asq->sort_list[BLK_RW_SYNC] = RB_ROOT;
+	asq->sort_list[BLK_RW_ASYNC] = RB_ROOT;
+	if (ad)
+		asq->write_batch_count = ad->batch_expire[BLK_RW_ASYNC] / 10;
+	else
+		asq->write_batch_count = default_write_batch_expire / 10;
+
+	if (asq->write_batch_count < 2)
+		asq->write_batch_count = 2;
+out:
+	return asq;
+}
+
+static void as_free_as_queue(struct elevator_queue *e, void *sched_queue)
+{
+	struct as_queue *asq = sched_queue;
+
+	BUG_ON(!list_empty(&asq->fifo_list[BLK_RW_SYNC]));
+	BUG_ON(!list_empty(&asq->fifo_list[BLK_RW_ASYNC]));
+	kfree(asq);
+}
+
 static void as_exit_queue(struct elevator_queue *e)
 {
 	struct as_data *ad = e->elevator_data;
@@ -1341,9 +1378,6 @@ static void as_exit_queue(struct elevator_queue *e)
 	del_timer_sync(&ad->antic_timer);
 	cancel_work_sync(&ad->antic_work);
 
-	BUG_ON(!list_empty(&ad->fifo_list[BLK_RW_SYNC]));
-	BUG_ON(!list_empty(&ad->fifo_list[BLK_RW_ASYNC]));
-
 	put_io_context(ad->io_context);
 	kfree(ad);
 }
@@ -1367,10 +1401,6 @@ static void *as_init_queue(struct request_queue *q)
 	init_timer(&ad->antic_timer);
 	INIT_WORK(&ad->antic_work, as_work_handler);
 
-	INIT_LIST_HEAD(&ad->fifo_list[BLK_RW_SYNC]);
-	INIT_LIST_HEAD(&ad->fifo_list[BLK_RW_ASYNC]);
-	ad->sort_list[BLK_RW_SYNC] = RB_ROOT;
-	ad->sort_list[BLK_RW_ASYNC] = RB_ROOT;
 	ad->fifo_expire[BLK_RW_SYNC] = default_read_expire;
 	ad->fifo_expire[BLK_RW_ASYNC] = default_write_expire;
 	ad->antic_expire = default_antic_expire;
@@ -1378,9 +1408,6 @@ static void *as_init_queue(struct request_queue *q)
 	ad->batch_expire[BLK_RW_ASYNC] = default_write_batch_expire;
 
 	ad->current_batch_expires = jiffies + ad->batch_expire[BLK_RW_SYNC];
-	ad->write_batch_count = ad->batch_expire[BLK_RW_ASYNC] / 10;
-	if (ad->write_batch_count < 2)
-		ad->write_batch_count = 2;
 
 	return ad;
 }
@@ -1478,7 +1505,6 @@ static struct elevator_type iosched_as = {
 		.elevator_add_req_fn =		as_add_request,
 		.elevator_activate_req_fn =	as_activate_request,
 		.elevator_deactivate_req_fn = 	as_deactivate_request,
-		.elevator_queue_empty_fn =	as_queue_empty,
 		.elevator_completed_req_fn =	as_completed_request,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
@@ -1486,6 +1512,8 @@ static struct elevator_type iosched_as = {
 		.elevator_init_fn =		as_init_queue,
 		.elevator_exit_fn =		as_exit_queue,
 		.trim =				as_trim,
+		.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
+		.elevator_free_sched_queue_fn = as_free_as_queue,
 	},
 
 	.elevator_attrs = as_attrs,
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index b547cbc..5bd5257 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -23,25 +23,23 @@ static const int writes_starved = 2;    /* max times reads can starve a write */
 static const int fifo_batch = 16;       /* # of sequential requests treated as one
 				     by the above parameters. For throughput. */
 
-struct deadline_data {
-	/*
-	 * run time data
-	 */
-
+struct deadline_queue {
 	/*
 	 * requests (deadline_rq s) are present on both sort_list and fifo_list
 	 */
-	struct rb_root sort_list[2];	
+	struct rb_root sort_list[2];
 	struct list_head fifo_list[2];
-
 	/*
 	 * next in sort order. read, write or both are NULL
 	 */
 	struct request *next_rq[2];
 	unsigned int batching;		/* number of sequential requests made */
-	sector_t last_sector;		/* head position */
 	unsigned int starved;		/* times reads have starved writes */
+};
 
+struct deadline_data {
+	struct request_queue *q;
+	sector_t last_sector;		/* head position */
 	/*
 	 * settings that change how the i/o scheduler behaves
 	 */
@@ -56,7 +54,9 @@ static void deadline_move_request(struct deadline_data *, struct request *);
 static inline struct rb_root *
 deadline_rb_root(struct deadline_data *dd, struct request *rq)
 {
-	return &dd->sort_list[rq_data_dir(rq)];
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
+
+	return &dq->sort_list[rq_data_dir(rq)];
 }
 
 /*
@@ -87,9 +87,10 @@ static inline void
 deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
 {
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
 
-	if (dd->next_rq[data_dir] == rq)
-		dd->next_rq[data_dir] = deadline_latter_request(rq);
+	if (dq->next_rq[data_dir] == rq)
+		dq->next_rq[data_dir] = deadline_latter_request(rq);
 
 	elv_rb_del(deadline_rb_root(dd, rq), rq);
 }
@@ -102,6 +103,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
 {
 	struct deadline_data *dd = q->elevator->elevator_data;
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(q, rq);
 
 	deadline_add_rq_rb(dd, rq);
 
@@ -109,7 +111,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
 	 * set expire time and add to fifo list
 	 */
 	rq_set_fifo_time(rq, jiffies + dd->fifo_expire[data_dir]);
-	list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
+	list_add_tail(&rq->queuelist, &dq->fifo_list[data_dir]);
 }
 
 /*
@@ -129,6 +131,11 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	struct deadline_data *dd = q->elevator->elevator_data;
 	struct request *__rq;
 	int ret;
+	struct deadline_queue *dq;
+
+	dq = elv_get_sched_queue_current(q);
+	if (!dq)
+		return ELEVATOR_NO_MERGE;
 
 	/*
 	 * check for front merge
@@ -136,7 +143,7 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	if (dd->front_merges) {
 		sector_t sector = bio->bi_sector + bio_sectors(bio);
 
-		__rq = elv_rb_find(&dd->sort_list[bio_data_dir(bio)], sector);
+		__rq = elv_rb_find(&dq->sort_list[bio_data_dir(bio)], sector);
 		if (__rq) {
 			BUG_ON(sector != blk_rq_pos(__rq));
 
@@ -207,10 +214,11 @@ static void
 deadline_move_request(struct deadline_data *dd, struct request *rq)
 {
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
 
-	dd->next_rq[READ] = NULL;
-	dd->next_rq[WRITE] = NULL;
-	dd->next_rq[data_dir] = deadline_latter_request(rq);
+	dq->next_rq[READ] = NULL;
+	dq->next_rq[WRITE] = NULL;
+	dq->next_rq[data_dir] = deadline_latter_request(rq);
 
 	dd->last_sector = rq_end_sector(rq);
 
@@ -225,9 +233,9 @@ deadline_move_request(struct deadline_data *dd, struct request *rq)
  * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
  * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
  */
-static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
+static inline int deadline_check_fifo(struct deadline_queue *dq, int ddir)
 {
-	struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
+	struct request *rq = rq_entry_fifo(dq->fifo_list[ddir].next);
 
 	/*
 	 * rq is expired!
@@ -245,20 +253,26 @@ static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
 static int deadline_dispatch_requests(struct request_queue *q, int force)
 {
 	struct deadline_data *dd = q->elevator->elevator_data;
-	const int reads = !list_empty(&dd->fifo_list[READ]);
-	const int writes = !list_empty(&dd->fifo_list[WRITE]);
+	struct deadline_queue *dq = elv_select_sched_queue(q, force);
+	int reads, writes;
 	struct request *rq;
 	int data_dir;
 
+	if (!dq)
+		return 0;
+
+	reads = !list_empty(&dq->fifo_list[READ]);
+	writes = !list_empty(&dq->fifo_list[WRITE]);
+
 	/*
 	 * batches are currently reads XOR writes
 	 */
-	if (dd->next_rq[WRITE])
-		rq = dd->next_rq[WRITE];
+	if (dq->next_rq[WRITE])
+		rq = dq->next_rq[WRITE];
 	else
-		rq = dd->next_rq[READ];
+		rq = dq->next_rq[READ];
 
-	if (rq && dd->batching < dd->fifo_batch)
+	if (rq && dq->batching < dd->fifo_batch)
 		/* we have a next request are still entitled to batch */
 		goto dispatch_request;
 
@@ -268,9 +282,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
 	 */
 
 	if (reads) {
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
+		BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[READ]));
 
-		if (writes && (dd->starved++ >= dd->writes_starved))
+		if (writes && (dq->starved++ >= dd->writes_starved))
 			goto dispatch_writes;
 
 		data_dir = READ;
@@ -284,9 +298,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
 
 	if (writes) {
 dispatch_writes:
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
+		BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[WRITE]));
 
-		dd->starved = 0;
+		dq->starved = 0;
 
 		data_dir = WRITE;
 
@@ -299,48 +313,62 @@ dispatch_find_request:
 	/*
 	 * we are not running a batch, find best request for selected data_dir
 	 */
-	if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
+	if (deadline_check_fifo(dq, data_dir) || !dq->next_rq[data_dir]) {
 		/*
 		 * A deadline has expired, the last request was in the other
 		 * direction, or we have run out of higher-sectored requests.
 		 * Start again from the request with the earliest expiry time.
 		 */
-		rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
+		rq = rq_entry_fifo(dq->fifo_list[data_dir].next);
 	} else {
 		/*
 		 * The last req was the same dir and we have a next request in
 		 * sort order. No expired requests so continue on from here.
 		 */
-		rq = dd->next_rq[data_dir];
+		rq = dq->next_rq[data_dir];
 	}
 
-	dd->batching = 0;
+	dq->batching = 0;
 
 dispatch_request:
 	/*
 	 * rq is the selected appropriate request.
 	 */
-	dd->batching++;
+	dq->batching++;
 	deadline_move_request(dd, rq);
 
 	return 1;
 }
 
-static int deadline_queue_empty(struct request_queue *q)
+static void *deadline_alloc_deadline_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
 {
-	struct deadline_data *dd = q->elevator->elevator_data;
+	struct deadline_queue *dq;
 
-	return list_empty(&dd->fifo_list[WRITE])
-		&& list_empty(&dd->fifo_list[READ]);
+	dq = kmalloc_node(sizeof(*dq), gfp_mask | __GFP_ZERO, q->node);
+	if (dq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&dq->fifo_list[READ]);
+	INIT_LIST_HEAD(&dq->fifo_list[WRITE]);
+	dq->sort_list[READ] = RB_ROOT;
+	dq->sort_list[WRITE] = RB_ROOT;
+out:
+	return dq;
+}
+
+static void deadline_free_deadline_queue(struct elevator_queue *e,
+						void *sched_queue)
+{
+	struct deadline_queue *dq = sched_queue;
+
+	kfree(dq);
 }
 
 static void deadline_exit_queue(struct elevator_queue *e)
 {
 	struct deadline_data *dd = e->elevator_data;
 
-	BUG_ON(!list_empty(&dd->fifo_list[READ]));
-	BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
-
 	kfree(dd);
 }
 
@@ -355,10 +383,7 @@ static void *deadline_init_queue(struct request_queue *q)
 	if (!dd)
 		return NULL;
 
-	INIT_LIST_HEAD(&dd->fifo_list[READ]);
-	INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
-	dd->sort_list[READ] = RB_ROOT;
-	dd->sort_list[WRITE] = RB_ROOT;
+	dd->q = q;
 	dd->fifo_expire[READ] = read_expire;
 	dd->fifo_expire[WRITE] = write_expire;
 	dd->writes_starved = writes_starved;
@@ -445,13 +470,13 @@ static struct elevator_type iosched_deadline = {
 		.elevator_merge_req_fn =	deadline_merged_requests,
 		.elevator_dispatch_fn =		deadline_dispatch_requests,
 		.elevator_add_req_fn =		deadline_add_request,
-		.elevator_queue_empty_fn =	deadline_queue_empty,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
 		.elevator_init_fn =		deadline_init_queue,
 		.elevator_exit_fn =		deadline_exit_queue,
+		.elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
+		.elevator_free_sched_queue_fn = deadline_free_deadline_queue,
 	},
-
 	.elevator_attrs = deadline_attrs,
 	.elevator_name = "deadline",
 	.elevator_owner = THIS_MODULE,
diff --git a/block/elevator.c b/block/elevator.c
index a6ef1f1..76cfc3a 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -175,17 +175,54 @@ static struct elevator_type *elevator_get(const char *name)
 	return e;
 }
 
-static void *elevator_init_queue(struct request_queue *q,
-				 struct elevator_queue *eq)
+static void *elevator_init_data(struct request_queue *q,
+					struct elevator_queue *eq)
 {
-	return eq->ops->elevator_init_fn(q);
+	void *data = NULL;
+
+	if (eq->ops->elevator_init_fn) {
+		data = eq->ops->elevator_init_fn(q);
+		if (data)
+			return data;
+		else
+			return ERR_PTR(-ENOMEM);
+	}
+
+	/* IO scheduler does not instanciate data (noop), it is not an error */
+	return NULL;
+}
+
+static void elevator_free_sched_queue(struct elevator_queue *eq,
+						void *sched_queue)
+{
+	/* Not all io schedulers (cfq) strore sched_queue */
+	if (!sched_queue)
+		return;
+	eq->ops->elevator_free_sched_queue_fn(eq, sched_queue);
+}
+
+static void *elevator_alloc_sched_queue(struct request_queue *q,
+					struct elevator_queue *eq)
+{
+	void *sched_queue = NULL;
+
+	if (eq->ops->elevator_alloc_sched_queue_fn) {
+		sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
+								GFP_KERNEL);
+		if (!sched_queue)
+			return ERR_PTR(-ENOMEM);
+
+	}
+
+	return sched_queue;
 }
 
 static void elevator_attach(struct request_queue *q, struct elevator_queue *eq,
-			   void *data)
+			   void *data, void *sched_queue)
 {
 	q->elevator = eq;
 	eq->elevator_data = data;
+	eq->sched_queue = sched_queue;
 }
 
 static char chosen_elevator[16];
@@ -255,7 +292,7 @@ int elevator_init(struct request_queue *q, char *name)
 	struct elevator_type *e = NULL;
 	struct elevator_queue *eq;
 	int ret = 0;
-	void *data;
+	void *data = NULL, *sched_queue = NULL;
 
 	INIT_LIST_HEAD(&q->queue_head);
 	q->last_merge = NULL;
@@ -289,13 +326,21 @@ int elevator_init(struct request_queue *q, char *name)
 	if (!eq)
 		return -ENOMEM;
 
-	data = elevator_init_queue(q, eq);
-	if (!data) {
+	data = elevator_init_data(q, eq);
+
+	if (IS_ERR(data)) {
+		kobject_put(&eq->kobj);
+		return -ENOMEM;
+	}
+
+	sched_queue = elevator_alloc_sched_queue(q, eq);
+
+	if (IS_ERR(sched_queue)) {
 		kobject_put(&eq->kobj);
 		return -ENOMEM;
 	}
 
-	elevator_attach(q, eq, data);
+	elevator_attach(q, eq, data, sched_queue);
 	return ret;
 }
 EXPORT_SYMBOL(elevator_init);
@@ -303,6 +348,7 @@ EXPORT_SYMBOL(elevator_init);
 void elevator_exit(struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
+	elevator_free_sched_queue(e, e->sched_queue);
 	elv_exit_fq_data(e);
 	if (e->ops->elevator_exit_fn)
 		e->ops->elevator_exit_fn(e);
@@ -992,7 +1038,7 @@ EXPORT_SYMBOL_GPL(elv_unregister);
 static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 {
 	struct elevator_queue *old_elevator, *e;
-	void *data;
+	void *data = NULL, *sched_queue = NULL;
 
 	/*
 	 * Allocate new elevator
@@ -1001,10 +1047,18 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	if (!e)
 		return 0;
 
-	data = elevator_init_queue(q, e);
-	if (!data) {
+	data = elevator_init_data(q, e);
+
+	if (IS_ERR(data)) {
 		kobject_put(&e->kobj);
-		return 0;
+		return -ENOMEM;
+	}
+
+	sched_queue = elevator_alloc_sched_queue(q, e);
+
+	if (IS_ERR(sched_queue)) {
+		kobject_put(&e->kobj);
+		return -ENOMEM;
 	}
 
 	/*
@@ -1021,7 +1075,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	/*
 	 * attach and start new elevator
 	 */
-	elevator_attach(q, e, data);
+	elevator_attach(q, e, data, sched_queue);
 
 	spin_unlock_irq(q->queue_lock);
 
@@ -1136,16 +1190,43 @@ struct request *elv_rb_latter_request(struct request_queue *q,
 }
 EXPORT_SYMBOL(elv_rb_latter_request);
 
-/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+/* Get the io scheduler queue pointer. */
 void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
 {
-	return ioq_sched_queue(req_ioq(rq));
+	/*
+	 * io scheduler is not using fair queuing. Return sched_queue
+	 * pointer stored in elevator_queue. It will be null if io
+	 * scheduler never stored anything there to begin with (cfq)
+	 */
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
+	/*
+	 * IO schedueler is using fair queuing infrasture. If io scheduler
+	 * has passed a non null rq, retrieve sched_queue pointer from
+	 * there. */
+	if (rq)
+		return ioq_sched_queue(req_ioq(rq));
+
+	return NULL;
 }
 EXPORT_SYMBOL(elv_get_sched_queue);
 
 /* Select an ioscheduler queue to dispatch request from. */
 void *elv_select_sched_queue(struct request_queue *q, int force)
 {
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
 	return ioq_sched_queue(elv_fq_select_ioq(q, force));
 }
 EXPORT_SYMBOL(elv_select_sched_queue);
+
+/*
+ * Get the io scheduler queue pointer for current task.
+ */
+void *elv_get_sched_queue_current(struct request_queue *q)
+{
+	return q->elevator->sched_queue;
+}
+EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index 3a0d369..d587832 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -7,7 +7,7 @@
 #include <linux/module.h>
 #include <linux/init.h>
 
-struct noop_data {
+struct noop_queue {
 	struct list_head queue;
 };
 
@@ -19,11 +19,14 @@ static void noop_merged_requests(struct request_queue *q, struct request *rq,
 
 static int noop_dispatch(struct request_queue *q, int force)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_select_sched_queue(q, force);
 
-	if (!list_empty(&nd->queue)) {
+	if (!nq)
+		return 0;
+
+	if (!list_empty(&nq->queue)) {
 		struct request *rq;
-		rq = list_entry(nd->queue.next, struct request, queuelist);
+		rq = list_entry(nq->queue.next, struct request, queuelist);
 		list_del_init(&rq->queuelist);
 		elv_dispatch_sort(q, rq);
 		return 1;
@@ -33,24 +36,17 @@ static int noop_dispatch(struct request_queue *q, int force)
 
 static void noop_add_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	list_add_tail(&rq->queuelist, &nd->queue);
-}
-
-static int noop_queue_empty(struct request_queue *q)
-{
-	struct noop_data *nd = q->elevator->elevator_data;
-
-	return list_empty(&nd->queue);
+	list_add_tail(&rq->queuelist, &nq->queue);
 }
 
 static struct request *
 noop_former_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	if (rq->queuelist.prev == &nd->queue)
+	if (rq->queuelist.prev == &nq->queue)
 		return NULL;
 	return list_entry(rq->queuelist.prev, struct request, queuelist);
 }
@@ -58,30 +54,32 @@ noop_former_request(struct request_queue *q, struct request *rq)
 static struct request *
 noop_latter_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	if (rq->queuelist.next == &nd->queue)
+	if (rq->queuelist.next == &nq->queue)
 		return NULL;
 	return list_entry(rq->queuelist.next, struct request, queuelist);
 }
 
-static void *noop_init_queue(struct request_queue *q)
+static void *noop_alloc_noop_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
 {
-	struct noop_data *nd;
+	struct noop_queue *nq;
 
-	nd = kmalloc_node(sizeof(*nd), GFP_KERNEL, q->node);
-	if (!nd)
-		return NULL;
-	INIT_LIST_HEAD(&nd->queue);
-	return nd;
+	nq = kmalloc_node(sizeof(*nq), gfp_mask | __GFP_ZERO, q->node);
+	if (nq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&nq->queue);
+out:
+	return nq;
 }
 
-static void noop_exit_queue(struct elevator_queue *e)
+static void noop_free_noop_queue(struct elevator_queue *e, void *sched_queue)
 {
-	struct noop_data *nd = e->elevator_data;
+	struct noop_queue *nq = sched_queue;
 
-	BUG_ON(!list_empty(&nd->queue));
-	kfree(nd);
+	kfree(nq);
 }
 
 static struct elevator_type elevator_noop = {
@@ -89,11 +87,10 @@ static struct elevator_type elevator_noop = {
 		.elevator_merge_req_fn		= noop_merged_requests,
 		.elevator_dispatch_fn		= noop_dispatch,
 		.elevator_add_req_fn		= noop_add_request,
-		.elevator_queue_empty_fn	= noop_queue_empty,
 		.elevator_former_req_fn		= noop_former_request,
 		.elevator_latter_req_fn		= noop_latter_request,
-		.elevator_init_fn		= noop_init_queue,
-		.elevator_exit_fn		= noop_exit_queue,
+		.elevator_alloc_sched_queue_fn	= noop_alloc_noop_queue,
+		.elevator_free_sched_queue_fn	= noop_free_noop_queue,
 	},
 	.elevator_name = "noop",
 	.elevator_owner = THIS_MODULE,
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 81f1ed8..e7048b9 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -30,8 +30,9 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
 
 typedef void *(elevator_init_fn) (struct request_queue *);
 typedef void (elevator_exit_fn) (struct elevator_queue *);
-#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void* (elevator_alloc_sched_queue_fn) (struct request_queue *q, struct elevator_queue *eq, gfp_t);
 typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
 typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
 typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
 typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
@@ -70,8 +71,9 @@ struct elevator_ops
 	elevator_exit_fn *elevator_exit_fn;
 	void (*trim)(struct io_context *);
 
-#ifdef CONFIG_ELV_FAIR_QUEUING
+	elevator_alloc_sched_queue_fn *elevator_alloc_sched_queue_fn;
 	elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+#ifdef CONFIG_ELV_FAIR_QUEUING
 	elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
 	elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;
 
@@ -112,6 +114,7 @@ struct elevator_queue
 {
 	struct elevator_ops *ops;
 	void *elevator_data;
+	void *sched_queue;
 	struct kobject kobj;
 	struct elevator_type *elevator_type;
 	struct mutex sysfs_lock;
@@ -258,5 +261,6 @@ static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
+extern void *elv_get_sched_queue_current(struct request_queue *q);
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 14/25] io-controller: Separate out queue and data
  2009-07-02 20:01 ` Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron
  Cc: agk, snitzer, vgoyal, akpm, peterz

o So far noop, deadline and AS had one common structure called *_data which
  contained both the queue information where requests are queued and also
  common data used for scheduling. This patch breaks down this common
  structure in two parts, *_queue and *_data. This is along the lines of
  cfq where all the reuquests are queued in queue and common data and tunables
  are part of data.

o It does not change the functionality but this re-organization helps once
  noop, deadline and AS are changed to use hierarchical fair queuing.

o looks like queue_empty function is not required and we can check for
  q->nr_sorted in elevator layer to see if ioscheduler queues are empty or
  not.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/as-iosched.c       |  208 ++++++++++++++++++++++++++--------------------
 block/deadline-iosched.c |  117 ++++++++++++++++----------
 block/elevator.c         |  111 +++++++++++++++++++++----
 block/noop-iosched.c     |   59 ++++++-------
 include/linux/elevator.h |    8 ++-
 5 files changed, 319 insertions(+), 184 deletions(-)

diff --git a/block/as-iosched.c b/block/as-iosched.c
index 7a12cf6..cafc734 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -76,13 +76,7 @@ enum anticipation_status {
 				 * or timed out */
 };
 
-struct as_data {
-	/*
-	 * run time data
-	 */
-
-	struct request_queue *q;	/* the "owner" queue */
-
+struct as_queue {
 	/*
 	 * requests (as_rq s) are present on both sort_list and fifo_list
 	 */
@@ -90,6 +84,14 @@ struct as_data {
 	struct list_head fifo_list[2];
 
 	struct request *next_rq[2];	/* next in sort order */
+	unsigned long last_check_fifo[2];
+	int write_batch_count;		/* max # of reqs in a write batch */
+	int current_write_count;	/* how many requests left this batch */
+	int write_batch_idled;		/* has the write batch gone idle? */
+};
+
+struct as_data {
+	struct request_queue *q;	/* the "owner" queue */
 	sector_t last_sector[2];	/* last SYNC & ASYNC sectors */
 
 	unsigned long exit_prob;	/* probability a task will exit while
@@ -103,21 +105,17 @@ struct as_data {
 	sector_t new_seek_mean;
 
 	unsigned long current_batch_expires;
-	unsigned long last_check_fifo[2];
 	int changed_batch;		/* 1: waiting for old batch to end */
 	int new_batch;			/* 1: waiting on first read complete */
-	int batch_data_dir;		/* current batch SYNC / ASYNC */
-	int write_batch_count;		/* max # of reqs in a write batch */
-	int current_write_count;	/* how many requests left this batch */
-	int write_batch_idled;		/* has the write batch gone idle? */
 
 	enum anticipation_status antic_status;
 	unsigned long antic_start;	/* jiffies: when it started */
 	struct timer_list antic_timer;	/* anticipatory scheduling timer */
-	struct work_struct antic_work;	/* Deferred unplugging */
+	struct work_struct antic_work;  /* Deferred unplugging */
 	struct io_context *io_context;	/* Identify the expected process */
 	int ioc_finished; /* IO associated with io_context is finished */
 	int nr_dispatched;
+	int batch_data_dir;		/* current batch SYNC / ASYNC */
 
 	/*
 	 * settings that change how the i/o scheduler behaves
@@ -258,13 +256,14 @@ static void as_put_io_context(struct request *rq)
 /*
  * rb tree support functions
  */
-#define RQ_RB_ROOT(ad, rq)	(&(ad)->sort_list[rq_is_sync((rq))])
+#define RQ_RB_ROOT(asq, rq)	(&(asq)->sort_list[rq_is_sync((rq))])
 
 static void as_add_rq_rb(struct as_data *ad, struct request *rq)
 {
 	struct request *alias;
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
-	while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(ad, rq), rq)))) {
+	while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(asq, rq), rq)))) {
 		as_move_to_dispatch(ad, alias);
 		as_antic_stop(ad);
 	}
@@ -272,7 +271,9 @@ static void as_add_rq_rb(struct as_data *ad, struct request *rq)
 
 static inline void as_del_rq_rb(struct as_data *ad, struct request *rq)
 {
-	elv_rb_del(RQ_RB_ROOT(ad, rq), rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
+
+	elv_rb_del(RQ_RB_ROOT(asq, rq), rq);
 }
 
 /*
@@ -366,7 +367,7 @@ as_choose_req(struct as_data *ad, struct request *rq1, struct request *rq2)
  * what request to process next. Anticipation works on top of this.
  */
 static struct request *
-as_find_next_rq(struct as_data *ad, struct request *last)
+as_find_next_rq(struct as_data *ad, struct as_queue *asq, struct request *last)
 {
 	struct rb_node *rbnext = rb_next(&last->rb_node);
 	struct rb_node *rbprev = rb_prev(&last->rb_node);
@@ -382,7 +383,7 @@ as_find_next_rq(struct as_data *ad, struct request *last)
 	else {
 		const int data_dir = rq_is_sync(last);
 
-		rbnext = rb_first(&ad->sort_list[data_dir]);
+		rbnext = rb_first(&asq->sort_list[data_dir]);
 		if (rbnext && rbnext != &last->rb_node)
 			next = rb_entry_rq(rbnext);
 	}
@@ -789,9 +790,10 @@ static int as_can_anticipate(struct as_data *ad, struct request *rq)
 static void as_update_rq(struct as_data *ad, struct request *rq)
 {
 	const int data_dir = rq_is_sync(rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
 	/* keep the next_rq cache up to date */
-	ad->next_rq[data_dir] = as_choose_req(ad, rq, ad->next_rq[data_dir]);
+	asq->next_rq[data_dir] = as_choose_req(ad, rq, asq->next_rq[data_dir]);
 
 	/*
 	 * have we been anticipating this request?
@@ -812,25 +814,26 @@ static void update_write_batch(struct as_data *ad)
 {
 	unsigned long batch = ad->batch_expire[BLK_RW_ASYNC];
 	long write_time;
+	struct as_queue *asq = elv_get_sched_queue(ad->q, NULL);
 
 	write_time = (jiffies - ad->current_batch_expires) + batch;
 	if (write_time < 0)
 		write_time = 0;
 
-	if (write_time > batch && !ad->write_batch_idled) {
+	if (write_time > batch && !asq->write_batch_idled) {
 		if (write_time > batch * 3)
-			ad->write_batch_count /= 2;
+			asq->write_batch_count /= 2;
 		else
-			ad->write_batch_count--;
-	} else if (write_time < batch && ad->current_write_count == 0) {
+			asq->write_batch_count--;
+	} else if (write_time < batch && asq->current_write_count == 0) {
 		if (batch > write_time * 3)
-			ad->write_batch_count *= 2;
+			asq->write_batch_count *= 2;
 		else
-			ad->write_batch_count++;
+			asq->write_batch_count++;
 	}
 
-	if (ad->write_batch_count < 1)
-		ad->write_batch_count = 1;
+	if (asq->write_batch_count < 1)
+		asq->write_batch_count = 1;
 }
 
 /*
@@ -901,6 +904,7 @@ static void as_remove_queued_request(struct request_queue *q,
 	const int data_dir = rq_is_sync(rq);
 	struct as_data *ad = q->elevator->elevator_data;
 	struct io_context *ioc;
+	struct as_queue *asq = elv_get_sched_queue(q, rq);
 
 	WARN_ON(RQ_STATE(rq) != AS_RQ_QUEUED);
 
@@ -914,8 +918,8 @@ static void as_remove_queued_request(struct request_queue *q,
 	 * Update the "next_rq" cache if we are about to remove its
 	 * entry
 	 */
-	if (ad->next_rq[data_dir] == rq)
-		ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+	if (asq->next_rq[data_dir] == rq)
+		asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);
 
 	rq_fifo_clear(rq);
 	as_del_rq_rb(ad, rq);
@@ -929,23 +933,23 @@ static void as_remove_queued_request(struct request_queue *q,
  *
  * See as_antic_expired comment.
  */
-static int as_fifo_expired(struct as_data *ad, int adir)
+static int as_fifo_expired(struct as_data *ad, struct as_queue *asq, int adir)
 {
 	struct request *rq;
 	long delta_jif;
 
-	delta_jif = jiffies - ad->last_check_fifo[adir];
+	delta_jif = jiffies - asq->last_check_fifo[adir];
 	if (unlikely(delta_jif < 0))
 		delta_jif = -delta_jif;
 	if (delta_jif < ad->fifo_expire[adir])
 		return 0;
 
-	ad->last_check_fifo[adir] = jiffies;
+	asq->last_check_fifo[adir] = jiffies;
 
-	if (list_empty(&ad->fifo_list[adir]))
+	if (list_empty(&asq->fifo_list[adir]))
 		return 0;
 
-	rq = rq_entry_fifo(ad->fifo_list[adir].next);
+	rq = rq_entry_fifo(asq->fifo_list[adir].next);
 
 	return time_after(jiffies, rq_fifo_time(rq));
 }
@@ -954,7 +958,7 @@ static int as_fifo_expired(struct as_data *ad, int adir)
  * as_batch_expired returns true if the current batch has expired. A batch
  * is a set of reads or a set of writes.
  */
-static inline int as_batch_expired(struct as_data *ad)
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq)
 {
 	if (ad->changed_batch || ad->new_batch)
 		return 0;
@@ -964,7 +968,7 @@ static inline int as_batch_expired(struct as_data *ad)
 		return time_after(jiffies, ad->current_batch_expires);
 
 	return time_after(jiffies, ad->current_batch_expires)
-		|| ad->current_write_count == 0;
+		|| asq->current_write_count == 0;
 }
 
 /*
@@ -973,6 +977,7 @@ static inline int as_batch_expired(struct as_data *ad)
 static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 {
 	const int data_dir = rq_is_sync(rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
 	BUG_ON(RB_EMPTY_NODE(&rq->rb_node));
 
@@ -995,12 +1000,12 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 			ad->io_context = NULL;
 		}
 
-		if (ad->current_write_count != 0)
-			ad->current_write_count--;
+		if (asq->current_write_count != 0)
+			asq->current_write_count--;
 	}
 	ad->ioc_finished = 0;
 
-	ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+	asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);
 
 	/*
 	 * take it off the sort and fifo list, add to dispatch queue
@@ -1024,9 +1029,16 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 static int as_dispatch_request(struct request_queue *q, int force)
 {
 	struct as_data *ad = q->elevator->elevator_data;
-	const int reads = !list_empty(&ad->fifo_list[BLK_RW_SYNC]);
-	const int writes = !list_empty(&ad->fifo_list[BLK_RW_ASYNC]);
 	struct request *rq;
+	struct as_queue *asq = elv_select_sched_queue(q, force);
+	int reads, writes;
+
+	if (!asq)
+		return 0;
+
+	reads = !list_empty(&asq->fifo_list[BLK_RW_SYNC]);
+	writes = !list_empty(&asq->fifo_list[BLK_RW_ASYNC]);
+
 
 	if (unlikely(force)) {
 		/*
@@ -1042,25 +1054,25 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		ad->changed_batch = 0;
 		ad->new_batch = 0;
 
-		while (ad->next_rq[BLK_RW_SYNC]) {
-			as_move_to_dispatch(ad, ad->next_rq[BLK_RW_SYNC]);
+		while (asq->next_rq[BLK_RW_SYNC]) {
+			as_move_to_dispatch(ad, asq->next_rq[BLK_RW_SYNC]);
 			dispatched++;
 		}
-		ad->last_check_fifo[BLK_RW_SYNC] = jiffies;
+		asq->last_check_fifo[BLK_RW_SYNC] = jiffies;
 
-		while (ad->next_rq[BLK_RW_ASYNC]) {
-			as_move_to_dispatch(ad, ad->next_rq[BLK_RW_ASYNC]);
+		while (asq->next_rq[BLK_RW_ASYNC]) {
+			as_move_to_dispatch(ad, asq->next_rq[BLK_RW_ASYNC]);
 			dispatched++;
 		}
-		ad->last_check_fifo[BLK_RW_ASYNC] = jiffies;
+		asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
 
 		return dispatched;
 	}
 
 	/* Signal that the write batch was uncontended, so we can't time it */
 	if (ad->batch_data_dir == BLK_RW_ASYNC && !reads) {
-		if (ad->current_write_count == 0 || !writes)
-			ad->write_batch_idled = 1;
+		if (asq->current_write_count == 0 || !writes)
+			asq->write_batch_idled = 1;
 	}
 
 	if (!(reads || writes)
@@ -1069,14 +1081,14 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		|| ad->changed_batch)
 		return 0;
 
-	if (!(reads && writes && as_batch_expired(ad))) {
+	if (!(reads && writes && as_batch_expired(ad, asq))) {
 		/*
 		 * batch is still running or no reads or no writes
 		 */
-		rq = ad->next_rq[ad->batch_data_dir];
+		rq = asq->next_rq[ad->batch_data_dir];
 
 		if (ad->batch_data_dir == BLK_RW_SYNC && ad->antic_expire) {
-			if (as_fifo_expired(ad, BLK_RW_SYNC))
+			if (as_fifo_expired(ad, asq, BLK_RW_SYNC))
 				goto fifo_expired;
 
 			if (as_can_anticipate(ad, rq)) {
@@ -1100,7 +1112,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 	 */
 
 	if (reads) {
-		BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[BLK_RW_SYNC]));
+		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_SYNC]));
 
 		if (writes && ad->batch_data_dir == BLK_RW_SYNC)
 			/*
@@ -1113,8 +1125,8 @@ static int as_dispatch_request(struct request_queue *q, int force)
 			ad->changed_batch = 1;
 		}
 		ad->batch_data_dir = BLK_RW_SYNC;
-		rq = rq_entry_fifo(ad->fifo_list[BLK_RW_SYNC].next);
-		ad->last_check_fifo[ad->batch_data_dir] = jiffies;
+		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_SYNC].next);
+		asq->last_check_fifo[ad->batch_data_dir] = jiffies;
 		goto dispatch_request;
 	}
 
@@ -1124,7 +1136,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 
 	if (writes) {
 dispatch_writes:
-		BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[BLK_RW_ASYNC]));
+		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_ASYNC]));
 
 		if (ad->batch_data_dir == BLK_RW_SYNC) {
 			ad->changed_batch = 1;
@@ -1137,10 +1149,10 @@ dispatch_writes:
 			ad->new_batch = 0;
 		}
 		ad->batch_data_dir = BLK_RW_ASYNC;
-		ad->current_write_count = ad->write_batch_count;
-		ad->write_batch_idled = 0;
-		rq = rq_entry_fifo(ad->fifo_list[BLK_RW_ASYNC].next);
-		ad->last_check_fifo[BLK_RW_ASYNC] = jiffies;
+		asq->current_write_count = asq->write_batch_count;
+		asq->write_batch_idled = 0;
+		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_ASYNC].next);
+		asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
 		goto dispatch_request;
 	}
 
@@ -1152,9 +1164,9 @@ dispatch_request:
 	 * If a request has expired, service it.
 	 */
 
-	if (as_fifo_expired(ad, ad->batch_data_dir)) {
+	if (as_fifo_expired(ad, asq, ad->batch_data_dir)) {
 fifo_expired:
-		rq = rq_entry_fifo(ad->fifo_list[ad->batch_data_dir].next);
+		rq = rq_entry_fifo(asq->fifo_list[ad->batch_data_dir].next);
 	}
 
 	if (ad->changed_batch) {
@@ -1187,6 +1199,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 {
 	struct as_data *ad = q->elevator->elevator_data;
 	int data_dir;
+	struct as_queue *asq = elv_get_sched_queue(q, rq);
 
 	RQ_SET_STATE(rq, AS_RQ_NEW);
 
@@ -1205,7 +1218,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 	 * set expire time and add to fifo list
 	 */
 	rq_set_fifo_time(rq, jiffies + ad->fifo_expire[data_dir]);
-	list_add_tail(&rq->queuelist, &ad->fifo_list[data_dir]);
+	list_add_tail(&rq->queuelist, &asq->fifo_list[data_dir]);
 
 	as_update_rq(ad, rq); /* keep state machine up to date */
 	RQ_SET_STATE(rq, AS_RQ_QUEUED);
@@ -1227,31 +1240,20 @@ static void as_deactivate_request(struct request_queue *q, struct request *rq)
 		atomic_inc(&RQ_IOC(rq)->aic->nr_dispatched);
 }
 
-/*
- * as_queue_empty tells us if there are requests left in the device. It may
- * not be the case that a driver can get the next request even if the queue
- * is not empty - it is used in the block layer to check for plugging and
- * merging opportunities
- */
-static int as_queue_empty(struct request_queue *q)
-{
-	struct as_data *ad = q->elevator->elevator_data;
-
-	return list_empty(&ad->fifo_list[BLK_RW_ASYNC])
-		&& list_empty(&ad->fifo_list[BLK_RW_SYNC]);
-}
-
 static int
 as_merge(struct request_queue *q, struct request **req, struct bio *bio)
 {
-	struct as_data *ad = q->elevator->elevator_data;
 	sector_t rb_key = bio->bi_sector + bio_sectors(bio);
 	struct request *__rq;
+	struct as_queue *asq = elv_get_sched_queue_current(q);
+
+	if (!asq)
+		return ELEVATOR_NO_MERGE;
 
 	/*
 	 * check for front merge
 	 */
-	__rq = elv_rb_find(&ad->sort_list[bio_data_dir(bio)], rb_key);
+	__rq = elv_rb_find(&asq->sort_list[bio_data_dir(bio)], rb_key);
 	if (__rq && elv_rq_merge_ok(__rq, bio)) {
 		*req = __rq;
 		return ELEVATOR_FRONT_MERGE;
@@ -1334,6 +1336,41 @@ static int as_may_queue(struct request_queue *q, int rw)
 	return ret;
 }
 
+/* Called with queue lock held */
+static void *as_alloc_as_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
+{
+	struct as_queue *asq;
+	struct as_data *ad = eq->elevator_data;
+
+	asq = kmalloc_node(sizeof(*asq), gfp_mask | __GFP_ZERO, q->node);
+	if (asq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&asq->fifo_list[BLK_RW_SYNC]);
+	INIT_LIST_HEAD(&asq->fifo_list[BLK_RW_ASYNC]);
+	asq->sort_list[BLK_RW_SYNC] = RB_ROOT;
+	asq->sort_list[BLK_RW_ASYNC] = RB_ROOT;
+	if (ad)
+		asq->write_batch_count = ad->batch_expire[BLK_RW_ASYNC] / 10;
+	else
+		asq->write_batch_count = default_write_batch_expire / 10;
+
+	if (asq->write_batch_count < 2)
+		asq->write_batch_count = 2;
+out:
+	return asq;
+}
+
+static void as_free_as_queue(struct elevator_queue *e, void *sched_queue)
+{
+	struct as_queue *asq = sched_queue;
+
+	BUG_ON(!list_empty(&asq->fifo_list[BLK_RW_SYNC]));
+	BUG_ON(!list_empty(&asq->fifo_list[BLK_RW_ASYNC]));
+	kfree(asq);
+}
+
 static void as_exit_queue(struct elevator_queue *e)
 {
 	struct as_data *ad = e->elevator_data;
@@ -1341,9 +1378,6 @@ static void as_exit_queue(struct elevator_queue *e)
 	del_timer_sync(&ad->antic_timer);
 	cancel_work_sync(&ad->antic_work);
 
-	BUG_ON(!list_empty(&ad->fifo_list[BLK_RW_SYNC]));
-	BUG_ON(!list_empty(&ad->fifo_list[BLK_RW_ASYNC]));
-
 	put_io_context(ad->io_context);
 	kfree(ad);
 }
@@ -1367,10 +1401,6 @@ static void *as_init_queue(struct request_queue *q)
 	init_timer(&ad->antic_timer);
 	INIT_WORK(&ad->antic_work, as_work_handler);
 
-	INIT_LIST_HEAD(&ad->fifo_list[BLK_RW_SYNC]);
-	INIT_LIST_HEAD(&ad->fifo_list[BLK_RW_ASYNC]);
-	ad->sort_list[BLK_RW_SYNC] = RB_ROOT;
-	ad->sort_list[BLK_RW_ASYNC] = RB_ROOT;
 	ad->fifo_expire[BLK_RW_SYNC] = default_read_expire;
 	ad->fifo_expire[BLK_RW_ASYNC] = default_write_expire;
 	ad->antic_expire = default_antic_expire;
@@ -1378,9 +1408,6 @@ static void *as_init_queue(struct request_queue *q)
 	ad->batch_expire[BLK_RW_ASYNC] = default_write_batch_expire;
 
 	ad->current_batch_expires = jiffies + ad->batch_expire[BLK_RW_SYNC];
-	ad->write_batch_count = ad->batch_expire[BLK_RW_ASYNC] / 10;
-	if (ad->write_batch_count < 2)
-		ad->write_batch_count = 2;
 
 	return ad;
 }
@@ -1478,7 +1505,6 @@ static struct elevator_type iosched_as = {
 		.elevator_add_req_fn =		as_add_request,
 		.elevator_activate_req_fn =	as_activate_request,
 		.elevator_deactivate_req_fn = 	as_deactivate_request,
-		.elevator_queue_empty_fn =	as_queue_empty,
 		.elevator_completed_req_fn =	as_completed_request,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
@@ -1486,6 +1512,8 @@ static struct elevator_type iosched_as = {
 		.elevator_init_fn =		as_init_queue,
 		.elevator_exit_fn =		as_exit_queue,
 		.trim =				as_trim,
+		.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
+		.elevator_free_sched_queue_fn = as_free_as_queue,
 	},
 
 	.elevator_attrs = as_attrs,
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index b547cbc..5bd5257 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -23,25 +23,23 @@ static const int writes_starved = 2;    /* max times reads can starve a write */
 static const int fifo_batch = 16;       /* # of sequential requests treated as one
 				     by the above parameters. For throughput. */
 
-struct deadline_data {
-	/*
-	 * run time data
-	 */
-
+struct deadline_queue {
 	/*
 	 * requests (deadline_rq s) are present on both sort_list and fifo_list
 	 */
-	struct rb_root sort_list[2];	
+	struct rb_root sort_list[2];
 	struct list_head fifo_list[2];
-
 	/*
 	 * next in sort order. read, write or both are NULL
 	 */
 	struct request *next_rq[2];
 	unsigned int batching;		/* number of sequential requests made */
-	sector_t last_sector;		/* head position */
 	unsigned int starved;		/* times reads have starved writes */
+};
 
+struct deadline_data {
+	struct request_queue *q;
+	sector_t last_sector;		/* head position */
 	/*
 	 * settings that change how the i/o scheduler behaves
 	 */
@@ -56,7 +54,9 @@ static void deadline_move_request(struct deadline_data *, struct request *);
 static inline struct rb_root *
 deadline_rb_root(struct deadline_data *dd, struct request *rq)
 {
-	return &dd->sort_list[rq_data_dir(rq)];
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
+
+	return &dq->sort_list[rq_data_dir(rq)];
 }
 
 /*
@@ -87,9 +87,10 @@ static inline void
 deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
 {
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
 
-	if (dd->next_rq[data_dir] == rq)
-		dd->next_rq[data_dir] = deadline_latter_request(rq);
+	if (dq->next_rq[data_dir] == rq)
+		dq->next_rq[data_dir] = deadline_latter_request(rq);
 
 	elv_rb_del(deadline_rb_root(dd, rq), rq);
 }
@@ -102,6 +103,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
 {
 	struct deadline_data *dd = q->elevator->elevator_data;
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(q, rq);
 
 	deadline_add_rq_rb(dd, rq);
 
@@ -109,7 +111,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
 	 * set expire time and add to fifo list
 	 */
 	rq_set_fifo_time(rq, jiffies + dd->fifo_expire[data_dir]);
-	list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
+	list_add_tail(&rq->queuelist, &dq->fifo_list[data_dir]);
 }
 
 /*
@@ -129,6 +131,11 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	struct deadline_data *dd = q->elevator->elevator_data;
 	struct request *__rq;
 	int ret;
+	struct deadline_queue *dq;
+
+	dq = elv_get_sched_queue_current(q);
+	if (!dq)
+		return ELEVATOR_NO_MERGE;
 
 	/*
 	 * check for front merge
@@ -136,7 +143,7 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	if (dd->front_merges) {
 		sector_t sector = bio->bi_sector + bio_sectors(bio);
 
-		__rq = elv_rb_find(&dd->sort_list[bio_data_dir(bio)], sector);
+		__rq = elv_rb_find(&dq->sort_list[bio_data_dir(bio)], sector);
 		if (__rq) {
 			BUG_ON(sector != blk_rq_pos(__rq));
 
@@ -207,10 +214,11 @@ static void
 deadline_move_request(struct deadline_data *dd, struct request *rq)
 {
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
 
-	dd->next_rq[READ] = NULL;
-	dd->next_rq[WRITE] = NULL;
-	dd->next_rq[data_dir] = deadline_latter_request(rq);
+	dq->next_rq[READ] = NULL;
+	dq->next_rq[WRITE] = NULL;
+	dq->next_rq[data_dir] = deadline_latter_request(rq);
 
 	dd->last_sector = rq_end_sector(rq);
 
@@ -225,9 +233,9 @@ deadline_move_request(struct deadline_data *dd, struct request *rq)
  * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
  * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
  */
-static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
+static inline int deadline_check_fifo(struct deadline_queue *dq, int ddir)
 {
-	struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
+	struct request *rq = rq_entry_fifo(dq->fifo_list[ddir].next);
 
 	/*
 	 * rq is expired!
@@ -245,20 +253,26 @@ static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
 static int deadline_dispatch_requests(struct request_queue *q, int force)
 {
 	struct deadline_data *dd = q->elevator->elevator_data;
-	const int reads = !list_empty(&dd->fifo_list[READ]);
-	const int writes = !list_empty(&dd->fifo_list[WRITE]);
+	struct deadline_queue *dq = elv_select_sched_queue(q, force);
+	int reads, writes;
 	struct request *rq;
 	int data_dir;
 
+	if (!dq)
+		return 0;
+
+	reads = !list_empty(&dq->fifo_list[READ]);
+	writes = !list_empty(&dq->fifo_list[WRITE]);
+
 	/*
 	 * batches are currently reads XOR writes
 	 */
-	if (dd->next_rq[WRITE])
-		rq = dd->next_rq[WRITE];
+	if (dq->next_rq[WRITE])
+		rq = dq->next_rq[WRITE];
 	else
-		rq = dd->next_rq[READ];
+		rq = dq->next_rq[READ];
 
-	if (rq && dd->batching < dd->fifo_batch)
+	if (rq && dq->batching < dd->fifo_batch)
 		/* we have a next request are still entitled to batch */
 		goto dispatch_request;
 
@@ -268,9 +282,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
 	 */
 
 	if (reads) {
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
+		BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[READ]));
 
-		if (writes && (dd->starved++ >= dd->writes_starved))
+		if (writes && (dq->starved++ >= dd->writes_starved))
 			goto dispatch_writes;
 
 		data_dir = READ;
@@ -284,9 +298,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
 
 	if (writes) {
 dispatch_writes:
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
+		BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[WRITE]));
 
-		dd->starved = 0;
+		dq->starved = 0;
 
 		data_dir = WRITE;
 
@@ -299,48 +313,62 @@ dispatch_find_request:
 	/*
 	 * we are not running a batch, find best request for selected data_dir
 	 */
-	if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
+	if (deadline_check_fifo(dq, data_dir) || !dq->next_rq[data_dir]) {
 		/*
 		 * A deadline has expired, the last request was in the other
 		 * direction, or we have run out of higher-sectored requests.
 		 * Start again from the request with the earliest expiry time.
 		 */
-		rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
+		rq = rq_entry_fifo(dq->fifo_list[data_dir].next);
 	} else {
 		/*
 		 * The last req was the same dir and we have a next request in
 		 * sort order. No expired requests so continue on from here.
 		 */
-		rq = dd->next_rq[data_dir];
+		rq = dq->next_rq[data_dir];
 	}
 
-	dd->batching = 0;
+	dq->batching = 0;
 
 dispatch_request:
 	/*
 	 * rq is the selected appropriate request.
 	 */
-	dd->batching++;
+	dq->batching++;
 	deadline_move_request(dd, rq);
 
 	return 1;
 }
 
-static int deadline_queue_empty(struct request_queue *q)
+static void *deadline_alloc_deadline_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
 {
-	struct deadline_data *dd = q->elevator->elevator_data;
+	struct deadline_queue *dq;
 
-	return list_empty(&dd->fifo_list[WRITE])
-		&& list_empty(&dd->fifo_list[READ]);
+	dq = kmalloc_node(sizeof(*dq), gfp_mask | __GFP_ZERO, q->node);
+	if (dq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&dq->fifo_list[READ]);
+	INIT_LIST_HEAD(&dq->fifo_list[WRITE]);
+	dq->sort_list[READ] = RB_ROOT;
+	dq->sort_list[WRITE] = RB_ROOT;
+out:
+	return dq;
+}
+
+static void deadline_free_deadline_queue(struct elevator_queue *e,
+						void *sched_queue)
+{
+	struct deadline_queue *dq = sched_queue;
+
+	kfree(dq);
 }
 
 static void deadline_exit_queue(struct elevator_queue *e)
 {
 	struct deadline_data *dd = e->elevator_data;
 
-	BUG_ON(!list_empty(&dd->fifo_list[READ]));
-	BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
-
 	kfree(dd);
 }
 
@@ -355,10 +383,7 @@ static void *deadline_init_queue(struct request_queue *q)
 	if (!dd)
 		return NULL;
 
-	INIT_LIST_HEAD(&dd->fifo_list[READ]);
-	INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
-	dd->sort_list[READ] = RB_ROOT;
-	dd->sort_list[WRITE] = RB_ROOT;
+	dd->q = q;
 	dd->fifo_expire[READ] = read_expire;
 	dd->fifo_expire[WRITE] = write_expire;
 	dd->writes_starved = writes_starved;
@@ -445,13 +470,13 @@ static struct elevator_type iosched_deadline = {
 		.elevator_merge_req_fn =	deadline_merged_requests,
 		.elevator_dispatch_fn =		deadline_dispatch_requests,
 		.elevator_add_req_fn =		deadline_add_request,
-		.elevator_queue_empty_fn =	deadline_queue_empty,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
 		.elevator_init_fn =		deadline_init_queue,
 		.elevator_exit_fn =		deadline_exit_queue,
+		.elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
+		.elevator_free_sched_queue_fn = deadline_free_deadline_queue,
 	},
-
 	.elevator_attrs = deadline_attrs,
 	.elevator_name = "deadline",
 	.elevator_owner = THIS_MODULE,
diff --git a/block/elevator.c b/block/elevator.c
index a6ef1f1..76cfc3a 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -175,17 +175,54 @@ static struct elevator_type *elevator_get(const char *name)
 	return e;
 }
 
-static void *elevator_init_queue(struct request_queue *q,
-				 struct elevator_queue *eq)
+static void *elevator_init_data(struct request_queue *q,
+					struct elevator_queue *eq)
 {
-	return eq->ops->elevator_init_fn(q);
+	void *data = NULL;
+
+	if (eq->ops->elevator_init_fn) {
+		data = eq->ops->elevator_init_fn(q);
+		if (data)
+			return data;
+		else
+			return ERR_PTR(-ENOMEM);
+	}
+
+	/* IO scheduler does not instanciate data (noop), it is not an error */
+	return NULL;
+}
+
+static void elevator_free_sched_queue(struct elevator_queue *eq,
+						void *sched_queue)
+{
+	/* Not all io schedulers (cfq) strore sched_queue */
+	if (!sched_queue)
+		return;
+	eq->ops->elevator_free_sched_queue_fn(eq, sched_queue);
+}
+
+static void *elevator_alloc_sched_queue(struct request_queue *q,
+					struct elevator_queue *eq)
+{
+	void *sched_queue = NULL;
+
+	if (eq->ops->elevator_alloc_sched_queue_fn) {
+		sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
+								GFP_KERNEL);
+		if (!sched_queue)
+			return ERR_PTR(-ENOMEM);
+
+	}
+
+	return sched_queue;
 }
 
 static void elevator_attach(struct request_queue *q, struct elevator_queue *eq,
-			   void *data)
+			   void *data, void *sched_queue)
 {
 	q->elevator = eq;
 	eq->elevator_data = data;
+	eq->sched_queue = sched_queue;
 }
 
 static char chosen_elevator[16];
@@ -255,7 +292,7 @@ int elevator_init(struct request_queue *q, char *name)
 	struct elevator_type *e = NULL;
 	struct elevator_queue *eq;
 	int ret = 0;
-	void *data;
+	void *data = NULL, *sched_queue = NULL;
 
 	INIT_LIST_HEAD(&q->queue_head);
 	q->last_merge = NULL;
@@ -289,13 +326,21 @@ int elevator_init(struct request_queue *q, char *name)
 	if (!eq)
 		return -ENOMEM;
 
-	data = elevator_init_queue(q, eq);
-	if (!data) {
+	data = elevator_init_data(q, eq);
+
+	if (IS_ERR(data)) {
+		kobject_put(&eq->kobj);
+		return -ENOMEM;
+	}
+
+	sched_queue = elevator_alloc_sched_queue(q, eq);
+
+	if (IS_ERR(sched_queue)) {
 		kobject_put(&eq->kobj);
 		return -ENOMEM;
 	}
 
-	elevator_attach(q, eq, data);
+	elevator_attach(q, eq, data, sched_queue);
 	return ret;
 }
 EXPORT_SYMBOL(elevator_init);
@@ -303,6 +348,7 @@ EXPORT_SYMBOL(elevator_init);
 void elevator_exit(struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
+	elevator_free_sched_queue(e, e->sched_queue);
 	elv_exit_fq_data(e);
 	if (e->ops->elevator_exit_fn)
 		e->ops->elevator_exit_fn(e);
@@ -992,7 +1038,7 @@ EXPORT_SYMBOL_GPL(elv_unregister);
 static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 {
 	struct elevator_queue *old_elevator, *e;
-	void *data;
+	void *data = NULL, *sched_queue = NULL;
 
 	/*
 	 * Allocate new elevator
@@ -1001,10 +1047,18 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	if (!e)
 		return 0;
 
-	data = elevator_init_queue(q, e);
-	if (!data) {
+	data = elevator_init_data(q, e);
+
+	if (IS_ERR(data)) {
 		kobject_put(&e->kobj);
-		return 0;
+		return -ENOMEM;
+	}
+
+	sched_queue = elevator_alloc_sched_queue(q, e);
+
+	if (IS_ERR(sched_queue)) {
+		kobject_put(&e->kobj);
+		return -ENOMEM;
 	}
 
 	/*
@@ -1021,7 +1075,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	/*
 	 * attach and start new elevator
 	 */
-	elevator_attach(q, e, data);
+	elevator_attach(q, e, data, sched_queue);
 
 	spin_unlock_irq(q->queue_lock);
 
@@ -1136,16 +1190,43 @@ struct request *elv_rb_latter_request(struct request_queue *q,
 }
 EXPORT_SYMBOL(elv_rb_latter_request);
 
-/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+/* Get the io scheduler queue pointer. */
 void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
 {
-	return ioq_sched_queue(req_ioq(rq));
+	/*
+	 * io scheduler is not using fair queuing. Return sched_queue
+	 * pointer stored in elevator_queue. It will be null if io
+	 * scheduler never stored anything there to begin with (cfq)
+	 */
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
+	/*
+	 * IO schedueler is using fair queuing infrasture. If io scheduler
+	 * has passed a non null rq, retrieve sched_queue pointer from
+	 * there. */
+	if (rq)
+		return ioq_sched_queue(req_ioq(rq));
+
+	return NULL;
 }
 EXPORT_SYMBOL(elv_get_sched_queue);
 
 /* Select an ioscheduler queue to dispatch request from. */
 void *elv_select_sched_queue(struct request_queue *q, int force)
 {
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
 	return ioq_sched_queue(elv_fq_select_ioq(q, force));
 }
 EXPORT_SYMBOL(elv_select_sched_queue);
+
+/*
+ * Get the io scheduler queue pointer for current task.
+ */
+void *elv_get_sched_queue_current(struct request_queue *q)
+{
+	return q->elevator->sched_queue;
+}
+EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index 3a0d369..d587832 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -7,7 +7,7 @@
 #include <linux/module.h>
 #include <linux/init.h>
 
-struct noop_data {
+struct noop_queue {
 	struct list_head queue;
 };
 
@@ -19,11 +19,14 @@ static void noop_merged_requests(struct request_queue *q, struct request *rq,
 
 static int noop_dispatch(struct request_queue *q, int force)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_select_sched_queue(q, force);
 
-	if (!list_empty(&nd->queue)) {
+	if (!nq)
+		return 0;
+
+	if (!list_empty(&nq->queue)) {
 		struct request *rq;
-		rq = list_entry(nd->queue.next, struct request, queuelist);
+		rq = list_entry(nq->queue.next, struct request, queuelist);
 		list_del_init(&rq->queuelist);
 		elv_dispatch_sort(q, rq);
 		return 1;
@@ -33,24 +36,17 @@ static int noop_dispatch(struct request_queue *q, int force)
 
 static void noop_add_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	list_add_tail(&rq->queuelist, &nd->queue);
-}
-
-static int noop_queue_empty(struct request_queue *q)
-{
-	struct noop_data *nd = q->elevator->elevator_data;
-
-	return list_empty(&nd->queue);
+	list_add_tail(&rq->queuelist, &nq->queue);
 }
 
 static struct request *
 noop_former_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	if (rq->queuelist.prev == &nd->queue)
+	if (rq->queuelist.prev == &nq->queue)
 		return NULL;
 	return list_entry(rq->queuelist.prev, struct request, queuelist);
 }
@@ -58,30 +54,32 @@ noop_former_request(struct request_queue *q, struct request *rq)
 static struct request *
 noop_latter_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	if (rq->queuelist.next == &nd->queue)
+	if (rq->queuelist.next == &nq->queue)
 		return NULL;
 	return list_entry(rq->queuelist.next, struct request, queuelist);
 }
 
-static void *noop_init_queue(struct request_queue *q)
+static void *noop_alloc_noop_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
 {
-	struct noop_data *nd;
+	struct noop_queue *nq;
 
-	nd = kmalloc_node(sizeof(*nd), GFP_KERNEL, q->node);
-	if (!nd)
-		return NULL;
-	INIT_LIST_HEAD(&nd->queue);
-	return nd;
+	nq = kmalloc_node(sizeof(*nq), gfp_mask | __GFP_ZERO, q->node);
+	if (nq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&nq->queue);
+out:
+	return nq;
 }
 
-static void noop_exit_queue(struct elevator_queue *e)
+static void noop_free_noop_queue(struct elevator_queue *e, void *sched_queue)
 {
-	struct noop_data *nd = e->elevator_data;
+	struct noop_queue *nq = sched_queue;
 
-	BUG_ON(!list_empty(&nd->queue));
-	kfree(nd);
+	kfree(nq);
 }
 
 static struct elevator_type elevator_noop = {
@@ -89,11 +87,10 @@ static struct elevator_type elevator_noop = {
 		.elevator_merge_req_fn		= noop_merged_requests,
 		.elevator_dispatch_fn		= noop_dispatch,
 		.elevator_add_req_fn		= noop_add_request,
-		.elevator_queue_empty_fn	= noop_queue_empty,
 		.elevator_former_req_fn		= noop_former_request,
 		.elevator_latter_req_fn		= noop_latter_request,
-		.elevator_init_fn		= noop_init_queue,
-		.elevator_exit_fn		= noop_exit_queue,
+		.elevator_alloc_sched_queue_fn	= noop_alloc_noop_queue,
+		.elevator_free_sched_queue_fn	= noop_free_noop_queue,
 	},
 	.elevator_name = "noop",
 	.elevator_owner = THIS_MODULE,
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 81f1ed8..e7048b9 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -30,8 +30,9 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
 
 typedef void *(elevator_init_fn) (struct request_queue *);
 typedef void (elevator_exit_fn) (struct elevator_queue *);
-#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void* (elevator_alloc_sched_queue_fn) (struct request_queue *q, struct elevator_queue *eq, gfp_t);
 typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
 typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
 typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
 typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
@@ -70,8 +71,9 @@ struct elevator_ops
 	elevator_exit_fn *elevator_exit_fn;
 	void (*trim)(struct io_context *);
 
-#ifdef CONFIG_ELV_FAIR_QUEUING
+	elevator_alloc_sched_queue_fn *elevator_alloc_sched_queue_fn;
 	elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+#ifdef CONFIG_ELV_FAIR_QUEUING
 	elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
 	elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;
 
@@ -112,6 +114,7 @@ struct elevator_queue
 {
 	struct elevator_ops *ops;
 	void *elevator_data;
+	void *sched_queue;
 	struct kobject kobj;
 	struct elevator_type *elevator_type;
 	struct mutex sysfs_lock;
@@ -258,5 +261,6 @@ static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
+extern void *elv_get_sched_queue_current(struct request_queue *q);
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 14/25] io-controller: Separate out queue and data
@ 2009-07-02 20:01   ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew
  Cc: peterz, akpm, snitzer, agk, vgoyal

o So far noop, deadline and AS had one common structure called *_data which
  contained both the queue information where requests are queued and also
  common data used for scheduling. This patch breaks down this common
  structure in two parts, *_queue and *_data. This is along the lines of
  cfq where all the reuquests are queued in queue and common data and tunables
  are part of data.

o It does not change the functionality but this re-organization helps once
  noop, deadline and AS are changed to use hierarchical fair queuing.

o looks like queue_empty function is not required and we can check for
  q->nr_sorted in elevator layer to see if ioscheduler queues are empty or
  not.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/as-iosched.c       |  208 ++++++++++++++++++++++++++--------------------
 block/deadline-iosched.c |  117 ++++++++++++++++----------
 block/elevator.c         |  111 +++++++++++++++++++++----
 block/noop-iosched.c     |   59 ++++++-------
 include/linux/elevator.h |    8 ++-
 5 files changed, 319 insertions(+), 184 deletions(-)

diff --git a/block/as-iosched.c b/block/as-iosched.c
index 7a12cf6..cafc734 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -76,13 +76,7 @@ enum anticipation_status {
 				 * or timed out */
 };
 
-struct as_data {
-	/*
-	 * run time data
-	 */
-
-	struct request_queue *q;	/* the "owner" queue */
-
+struct as_queue {
 	/*
 	 * requests (as_rq s) are present on both sort_list and fifo_list
 	 */
@@ -90,6 +84,14 @@ struct as_data {
 	struct list_head fifo_list[2];
 
 	struct request *next_rq[2];	/* next in sort order */
+	unsigned long last_check_fifo[2];
+	int write_batch_count;		/* max # of reqs in a write batch */
+	int current_write_count;	/* how many requests left this batch */
+	int write_batch_idled;		/* has the write batch gone idle? */
+};
+
+struct as_data {
+	struct request_queue *q;	/* the "owner" queue */
 	sector_t last_sector[2];	/* last SYNC & ASYNC sectors */
 
 	unsigned long exit_prob;	/* probability a task will exit while
@@ -103,21 +105,17 @@ struct as_data {
 	sector_t new_seek_mean;
 
 	unsigned long current_batch_expires;
-	unsigned long last_check_fifo[2];
 	int changed_batch;		/* 1: waiting for old batch to end */
 	int new_batch;			/* 1: waiting on first read complete */
-	int batch_data_dir;		/* current batch SYNC / ASYNC */
-	int write_batch_count;		/* max # of reqs in a write batch */
-	int current_write_count;	/* how many requests left this batch */
-	int write_batch_idled;		/* has the write batch gone idle? */
 
 	enum anticipation_status antic_status;
 	unsigned long antic_start;	/* jiffies: when it started */
 	struct timer_list antic_timer;	/* anticipatory scheduling timer */
-	struct work_struct antic_work;	/* Deferred unplugging */
+	struct work_struct antic_work;  /* Deferred unplugging */
 	struct io_context *io_context;	/* Identify the expected process */
 	int ioc_finished; /* IO associated with io_context is finished */
 	int nr_dispatched;
+	int batch_data_dir;		/* current batch SYNC / ASYNC */
 
 	/*
 	 * settings that change how the i/o scheduler behaves
@@ -258,13 +256,14 @@ static void as_put_io_context(struct request *rq)
 /*
  * rb tree support functions
  */
-#define RQ_RB_ROOT(ad, rq)	(&(ad)->sort_list[rq_is_sync((rq))])
+#define RQ_RB_ROOT(asq, rq)	(&(asq)->sort_list[rq_is_sync((rq))])
 
 static void as_add_rq_rb(struct as_data *ad, struct request *rq)
 {
 	struct request *alias;
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
-	while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(ad, rq), rq)))) {
+	while ((unlikely(alias = elv_rb_add(RQ_RB_ROOT(asq, rq), rq)))) {
 		as_move_to_dispatch(ad, alias);
 		as_antic_stop(ad);
 	}
@@ -272,7 +271,9 @@ static void as_add_rq_rb(struct as_data *ad, struct request *rq)
 
 static inline void as_del_rq_rb(struct as_data *ad, struct request *rq)
 {
-	elv_rb_del(RQ_RB_ROOT(ad, rq), rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
+
+	elv_rb_del(RQ_RB_ROOT(asq, rq), rq);
 }
 
 /*
@@ -366,7 +367,7 @@ as_choose_req(struct as_data *ad, struct request *rq1, struct request *rq2)
  * what request to process next. Anticipation works on top of this.
  */
 static struct request *
-as_find_next_rq(struct as_data *ad, struct request *last)
+as_find_next_rq(struct as_data *ad, struct as_queue *asq, struct request *last)
 {
 	struct rb_node *rbnext = rb_next(&last->rb_node);
 	struct rb_node *rbprev = rb_prev(&last->rb_node);
@@ -382,7 +383,7 @@ as_find_next_rq(struct as_data *ad, struct request *last)
 	else {
 		const int data_dir = rq_is_sync(last);
 
-		rbnext = rb_first(&ad->sort_list[data_dir]);
+		rbnext = rb_first(&asq->sort_list[data_dir]);
 		if (rbnext && rbnext != &last->rb_node)
 			next = rb_entry_rq(rbnext);
 	}
@@ -789,9 +790,10 @@ static int as_can_anticipate(struct as_data *ad, struct request *rq)
 static void as_update_rq(struct as_data *ad, struct request *rq)
 {
 	const int data_dir = rq_is_sync(rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
 	/* keep the next_rq cache up to date */
-	ad->next_rq[data_dir] = as_choose_req(ad, rq, ad->next_rq[data_dir]);
+	asq->next_rq[data_dir] = as_choose_req(ad, rq, asq->next_rq[data_dir]);
 
 	/*
 	 * have we been anticipating this request?
@@ -812,25 +814,26 @@ static void update_write_batch(struct as_data *ad)
 {
 	unsigned long batch = ad->batch_expire[BLK_RW_ASYNC];
 	long write_time;
+	struct as_queue *asq = elv_get_sched_queue(ad->q, NULL);
 
 	write_time = (jiffies - ad->current_batch_expires) + batch;
 	if (write_time < 0)
 		write_time = 0;
 
-	if (write_time > batch && !ad->write_batch_idled) {
+	if (write_time > batch && !asq->write_batch_idled) {
 		if (write_time > batch * 3)
-			ad->write_batch_count /= 2;
+			asq->write_batch_count /= 2;
 		else
-			ad->write_batch_count--;
-	} else if (write_time < batch && ad->current_write_count == 0) {
+			asq->write_batch_count--;
+	} else if (write_time < batch && asq->current_write_count == 0) {
 		if (batch > write_time * 3)
-			ad->write_batch_count *= 2;
+			asq->write_batch_count *= 2;
 		else
-			ad->write_batch_count++;
+			asq->write_batch_count++;
 	}
 
-	if (ad->write_batch_count < 1)
-		ad->write_batch_count = 1;
+	if (asq->write_batch_count < 1)
+		asq->write_batch_count = 1;
 }
 
 /*
@@ -901,6 +904,7 @@ static void as_remove_queued_request(struct request_queue *q,
 	const int data_dir = rq_is_sync(rq);
 	struct as_data *ad = q->elevator->elevator_data;
 	struct io_context *ioc;
+	struct as_queue *asq = elv_get_sched_queue(q, rq);
 
 	WARN_ON(RQ_STATE(rq) != AS_RQ_QUEUED);
 
@@ -914,8 +918,8 @@ static void as_remove_queued_request(struct request_queue *q,
 	 * Update the "next_rq" cache if we are about to remove its
 	 * entry
 	 */
-	if (ad->next_rq[data_dir] == rq)
-		ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+	if (asq->next_rq[data_dir] == rq)
+		asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);
 
 	rq_fifo_clear(rq);
 	as_del_rq_rb(ad, rq);
@@ -929,23 +933,23 @@ static void as_remove_queued_request(struct request_queue *q,
  *
  * See as_antic_expired comment.
  */
-static int as_fifo_expired(struct as_data *ad, int adir)
+static int as_fifo_expired(struct as_data *ad, struct as_queue *asq, int adir)
 {
 	struct request *rq;
 	long delta_jif;
 
-	delta_jif = jiffies - ad->last_check_fifo[adir];
+	delta_jif = jiffies - asq->last_check_fifo[adir];
 	if (unlikely(delta_jif < 0))
 		delta_jif = -delta_jif;
 	if (delta_jif < ad->fifo_expire[adir])
 		return 0;
 
-	ad->last_check_fifo[adir] = jiffies;
+	asq->last_check_fifo[adir] = jiffies;
 
-	if (list_empty(&ad->fifo_list[adir]))
+	if (list_empty(&asq->fifo_list[adir]))
 		return 0;
 
-	rq = rq_entry_fifo(ad->fifo_list[adir].next);
+	rq = rq_entry_fifo(asq->fifo_list[adir].next);
 
 	return time_after(jiffies, rq_fifo_time(rq));
 }
@@ -954,7 +958,7 @@ static int as_fifo_expired(struct as_data *ad, int adir)
  * as_batch_expired returns true if the current batch has expired. A batch
  * is a set of reads or a set of writes.
  */
-static inline int as_batch_expired(struct as_data *ad)
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq)
 {
 	if (ad->changed_batch || ad->new_batch)
 		return 0;
@@ -964,7 +968,7 @@ static inline int as_batch_expired(struct as_data *ad)
 		return time_after(jiffies, ad->current_batch_expires);
 
 	return time_after(jiffies, ad->current_batch_expires)
-		|| ad->current_write_count == 0;
+		|| asq->current_write_count == 0;
 }
 
 /*
@@ -973,6 +977,7 @@ static inline int as_batch_expired(struct as_data *ad)
 static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 {
 	const int data_dir = rq_is_sync(rq);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
 	BUG_ON(RB_EMPTY_NODE(&rq->rb_node));
 
@@ -995,12 +1000,12 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 			ad->io_context = NULL;
 		}
 
-		if (ad->current_write_count != 0)
-			ad->current_write_count--;
+		if (asq->current_write_count != 0)
+			asq->current_write_count--;
 	}
 	ad->ioc_finished = 0;
 
-	ad->next_rq[data_dir] = as_find_next_rq(ad, rq);
+	asq->next_rq[data_dir] = as_find_next_rq(ad, asq, rq);
 
 	/*
 	 * take it off the sort and fifo list, add to dispatch queue
@@ -1024,9 +1029,16 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 static int as_dispatch_request(struct request_queue *q, int force)
 {
 	struct as_data *ad = q->elevator->elevator_data;
-	const int reads = !list_empty(&ad->fifo_list[BLK_RW_SYNC]);
-	const int writes = !list_empty(&ad->fifo_list[BLK_RW_ASYNC]);
 	struct request *rq;
+	struct as_queue *asq = elv_select_sched_queue(q, force);
+	int reads, writes;
+
+	if (!asq)
+		return 0;
+
+	reads = !list_empty(&asq->fifo_list[BLK_RW_SYNC]);
+	writes = !list_empty(&asq->fifo_list[BLK_RW_ASYNC]);
+
 
 	if (unlikely(force)) {
 		/*
@@ -1042,25 +1054,25 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		ad->changed_batch = 0;
 		ad->new_batch = 0;
 
-		while (ad->next_rq[BLK_RW_SYNC]) {
-			as_move_to_dispatch(ad, ad->next_rq[BLK_RW_SYNC]);
+		while (asq->next_rq[BLK_RW_SYNC]) {
+			as_move_to_dispatch(ad, asq->next_rq[BLK_RW_SYNC]);
 			dispatched++;
 		}
-		ad->last_check_fifo[BLK_RW_SYNC] = jiffies;
+		asq->last_check_fifo[BLK_RW_SYNC] = jiffies;
 
-		while (ad->next_rq[BLK_RW_ASYNC]) {
-			as_move_to_dispatch(ad, ad->next_rq[BLK_RW_ASYNC]);
+		while (asq->next_rq[BLK_RW_ASYNC]) {
+			as_move_to_dispatch(ad, asq->next_rq[BLK_RW_ASYNC]);
 			dispatched++;
 		}
-		ad->last_check_fifo[BLK_RW_ASYNC] = jiffies;
+		asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
 
 		return dispatched;
 	}
 
 	/* Signal that the write batch was uncontended, so we can't time it */
 	if (ad->batch_data_dir == BLK_RW_ASYNC && !reads) {
-		if (ad->current_write_count == 0 || !writes)
-			ad->write_batch_idled = 1;
+		if (asq->current_write_count == 0 || !writes)
+			asq->write_batch_idled = 1;
 	}
 
 	if (!(reads || writes)
@@ -1069,14 +1081,14 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		|| ad->changed_batch)
 		return 0;
 
-	if (!(reads && writes && as_batch_expired(ad))) {
+	if (!(reads && writes && as_batch_expired(ad, asq))) {
 		/*
 		 * batch is still running or no reads or no writes
 		 */
-		rq = ad->next_rq[ad->batch_data_dir];
+		rq = asq->next_rq[ad->batch_data_dir];
 
 		if (ad->batch_data_dir == BLK_RW_SYNC && ad->antic_expire) {
-			if (as_fifo_expired(ad, BLK_RW_SYNC))
+			if (as_fifo_expired(ad, asq, BLK_RW_SYNC))
 				goto fifo_expired;
 
 			if (as_can_anticipate(ad, rq)) {
@@ -1100,7 +1112,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 	 */
 
 	if (reads) {
-		BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[BLK_RW_SYNC]));
+		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_SYNC]));
 
 		if (writes && ad->batch_data_dir == BLK_RW_SYNC)
 			/*
@@ -1113,8 +1125,8 @@ static int as_dispatch_request(struct request_queue *q, int force)
 			ad->changed_batch = 1;
 		}
 		ad->batch_data_dir = BLK_RW_SYNC;
-		rq = rq_entry_fifo(ad->fifo_list[BLK_RW_SYNC].next);
-		ad->last_check_fifo[ad->batch_data_dir] = jiffies;
+		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_SYNC].next);
+		asq->last_check_fifo[ad->batch_data_dir] = jiffies;
 		goto dispatch_request;
 	}
 
@@ -1124,7 +1136,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 
 	if (writes) {
 dispatch_writes:
-		BUG_ON(RB_EMPTY_ROOT(&ad->sort_list[BLK_RW_ASYNC]));
+		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_ASYNC]));
 
 		if (ad->batch_data_dir == BLK_RW_SYNC) {
 			ad->changed_batch = 1;
@@ -1137,10 +1149,10 @@ dispatch_writes:
 			ad->new_batch = 0;
 		}
 		ad->batch_data_dir = BLK_RW_ASYNC;
-		ad->current_write_count = ad->write_batch_count;
-		ad->write_batch_idled = 0;
-		rq = rq_entry_fifo(ad->fifo_list[BLK_RW_ASYNC].next);
-		ad->last_check_fifo[BLK_RW_ASYNC] = jiffies;
+		asq->current_write_count = asq->write_batch_count;
+		asq->write_batch_idled = 0;
+		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_ASYNC].next);
+		asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
 		goto dispatch_request;
 	}
 
@@ -1152,9 +1164,9 @@ dispatch_request:
 	 * If a request has expired, service it.
 	 */
 
-	if (as_fifo_expired(ad, ad->batch_data_dir)) {
+	if (as_fifo_expired(ad, asq, ad->batch_data_dir)) {
 fifo_expired:
-		rq = rq_entry_fifo(ad->fifo_list[ad->batch_data_dir].next);
+		rq = rq_entry_fifo(asq->fifo_list[ad->batch_data_dir].next);
 	}
 
 	if (ad->changed_batch) {
@@ -1187,6 +1199,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 {
 	struct as_data *ad = q->elevator->elevator_data;
 	int data_dir;
+	struct as_queue *asq = elv_get_sched_queue(q, rq);
 
 	RQ_SET_STATE(rq, AS_RQ_NEW);
 
@@ -1205,7 +1218,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 	 * set expire time and add to fifo list
 	 */
 	rq_set_fifo_time(rq, jiffies + ad->fifo_expire[data_dir]);
-	list_add_tail(&rq->queuelist, &ad->fifo_list[data_dir]);
+	list_add_tail(&rq->queuelist, &asq->fifo_list[data_dir]);
 
 	as_update_rq(ad, rq); /* keep state machine up to date */
 	RQ_SET_STATE(rq, AS_RQ_QUEUED);
@@ -1227,31 +1240,20 @@ static void as_deactivate_request(struct request_queue *q, struct request *rq)
 		atomic_inc(&RQ_IOC(rq)->aic->nr_dispatched);
 }
 
-/*
- * as_queue_empty tells us if there are requests left in the device. It may
- * not be the case that a driver can get the next request even if the queue
- * is not empty - it is used in the block layer to check for plugging and
- * merging opportunities
- */
-static int as_queue_empty(struct request_queue *q)
-{
-	struct as_data *ad = q->elevator->elevator_data;
-
-	return list_empty(&ad->fifo_list[BLK_RW_ASYNC])
-		&& list_empty(&ad->fifo_list[BLK_RW_SYNC]);
-}
-
 static int
 as_merge(struct request_queue *q, struct request **req, struct bio *bio)
 {
-	struct as_data *ad = q->elevator->elevator_data;
 	sector_t rb_key = bio->bi_sector + bio_sectors(bio);
 	struct request *__rq;
+	struct as_queue *asq = elv_get_sched_queue_current(q);
+
+	if (!asq)
+		return ELEVATOR_NO_MERGE;
 
 	/*
 	 * check for front merge
 	 */
-	__rq = elv_rb_find(&ad->sort_list[bio_data_dir(bio)], rb_key);
+	__rq = elv_rb_find(&asq->sort_list[bio_data_dir(bio)], rb_key);
 	if (__rq && elv_rq_merge_ok(__rq, bio)) {
 		*req = __rq;
 		return ELEVATOR_FRONT_MERGE;
@@ -1334,6 +1336,41 @@ static int as_may_queue(struct request_queue *q, int rw)
 	return ret;
 }
 
+/* Called with queue lock held */
+static void *as_alloc_as_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
+{
+	struct as_queue *asq;
+	struct as_data *ad = eq->elevator_data;
+
+	asq = kmalloc_node(sizeof(*asq), gfp_mask | __GFP_ZERO, q->node);
+	if (asq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&asq->fifo_list[BLK_RW_SYNC]);
+	INIT_LIST_HEAD(&asq->fifo_list[BLK_RW_ASYNC]);
+	asq->sort_list[BLK_RW_SYNC] = RB_ROOT;
+	asq->sort_list[BLK_RW_ASYNC] = RB_ROOT;
+	if (ad)
+		asq->write_batch_count = ad->batch_expire[BLK_RW_ASYNC] / 10;
+	else
+		asq->write_batch_count = default_write_batch_expire / 10;
+
+	if (asq->write_batch_count < 2)
+		asq->write_batch_count = 2;
+out:
+	return asq;
+}
+
+static void as_free_as_queue(struct elevator_queue *e, void *sched_queue)
+{
+	struct as_queue *asq = sched_queue;
+
+	BUG_ON(!list_empty(&asq->fifo_list[BLK_RW_SYNC]));
+	BUG_ON(!list_empty(&asq->fifo_list[BLK_RW_ASYNC]));
+	kfree(asq);
+}
+
 static void as_exit_queue(struct elevator_queue *e)
 {
 	struct as_data *ad = e->elevator_data;
@@ -1341,9 +1378,6 @@ static void as_exit_queue(struct elevator_queue *e)
 	del_timer_sync(&ad->antic_timer);
 	cancel_work_sync(&ad->antic_work);
 
-	BUG_ON(!list_empty(&ad->fifo_list[BLK_RW_SYNC]));
-	BUG_ON(!list_empty(&ad->fifo_list[BLK_RW_ASYNC]));
-
 	put_io_context(ad->io_context);
 	kfree(ad);
 }
@@ -1367,10 +1401,6 @@ static void *as_init_queue(struct request_queue *q)
 	init_timer(&ad->antic_timer);
 	INIT_WORK(&ad->antic_work, as_work_handler);
 
-	INIT_LIST_HEAD(&ad->fifo_list[BLK_RW_SYNC]);
-	INIT_LIST_HEAD(&ad->fifo_list[BLK_RW_ASYNC]);
-	ad->sort_list[BLK_RW_SYNC] = RB_ROOT;
-	ad->sort_list[BLK_RW_ASYNC] = RB_ROOT;
 	ad->fifo_expire[BLK_RW_SYNC] = default_read_expire;
 	ad->fifo_expire[BLK_RW_ASYNC] = default_write_expire;
 	ad->antic_expire = default_antic_expire;
@@ -1378,9 +1408,6 @@ static void *as_init_queue(struct request_queue *q)
 	ad->batch_expire[BLK_RW_ASYNC] = default_write_batch_expire;
 
 	ad->current_batch_expires = jiffies + ad->batch_expire[BLK_RW_SYNC];
-	ad->write_batch_count = ad->batch_expire[BLK_RW_ASYNC] / 10;
-	if (ad->write_batch_count < 2)
-		ad->write_batch_count = 2;
 
 	return ad;
 }
@@ -1478,7 +1505,6 @@ static struct elevator_type iosched_as = {
 		.elevator_add_req_fn =		as_add_request,
 		.elevator_activate_req_fn =	as_activate_request,
 		.elevator_deactivate_req_fn = 	as_deactivate_request,
-		.elevator_queue_empty_fn =	as_queue_empty,
 		.elevator_completed_req_fn =	as_completed_request,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
@@ -1486,6 +1512,8 @@ static struct elevator_type iosched_as = {
 		.elevator_init_fn =		as_init_queue,
 		.elevator_exit_fn =		as_exit_queue,
 		.trim =				as_trim,
+		.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
+		.elevator_free_sched_queue_fn = as_free_as_queue,
 	},
 
 	.elevator_attrs = as_attrs,
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index b547cbc..5bd5257 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -23,25 +23,23 @@ static const int writes_starved = 2;    /* max times reads can starve a write */
 static const int fifo_batch = 16;       /* # of sequential requests treated as one
 				     by the above parameters. For throughput. */
 
-struct deadline_data {
-	/*
-	 * run time data
-	 */
-
+struct deadline_queue {
 	/*
 	 * requests (deadline_rq s) are present on both sort_list and fifo_list
 	 */
-	struct rb_root sort_list[2];	
+	struct rb_root sort_list[2];
 	struct list_head fifo_list[2];
-
 	/*
 	 * next in sort order. read, write or both are NULL
 	 */
 	struct request *next_rq[2];
 	unsigned int batching;		/* number of sequential requests made */
-	sector_t last_sector;		/* head position */
 	unsigned int starved;		/* times reads have starved writes */
+};
 
+struct deadline_data {
+	struct request_queue *q;
+	sector_t last_sector;		/* head position */
 	/*
 	 * settings that change how the i/o scheduler behaves
 	 */
@@ -56,7 +54,9 @@ static void deadline_move_request(struct deadline_data *, struct request *);
 static inline struct rb_root *
 deadline_rb_root(struct deadline_data *dd, struct request *rq)
 {
-	return &dd->sort_list[rq_data_dir(rq)];
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
+
+	return &dq->sort_list[rq_data_dir(rq)];
 }
 
 /*
@@ -87,9 +87,10 @@ static inline void
 deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
 {
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
 
-	if (dd->next_rq[data_dir] == rq)
-		dd->next_rq[data_dir] = deadline_latter_request(rq);
+	if (dq->next_rq[data_dir] == rq)
+		dq->next_rq[data_dir] = deadline_latter_request(rq);
 
 	elv_rb_del(deadline_rb_root(dd, rq), rq);
 }
@@ -102,6 +103,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
 {
 	struct deadline_data *dd = q->elevator->elevator_data;
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(q, rq);
 
 	deadline_add_rq_rb(dd, rq);
 
@@ -109,7 +111,7 @@ deadline_add_request(struct request_queue *q, struct request *rq)
 	 * set expire time and add to fifo list
 	 */
 	rq_set_fifo_time(rq, jiffies + dd->fifo_expire[data_dir]);
-	list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
+	list_add_tail(&rq->queuelist, &dq->fifo_list[data_dir]);
 }
 
 /*
@@ -129,6 +131,11 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	struct deadline_data *dd = q->elevator->elevator_data;
 	struct request *__rq;
 	int ret;
+	struct deadline_queue *dq;
+
+	dq = elv_get_sched_queue_current(q);
+	if (!dq)
+		return ELEVATOR_NO_MERGE;
 
 	/*
 	 * check for front merge
@@ -136,7 +143,7 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	if (dd->front_merges) {
 		sector_t sector = bio->bi_sector + bio_sectors(bio);
 
-		__rq = elv_rb_find(&dd->sort_list[bio_data_dir(bio)], sector);
+		__rq = elv_rb_find(&dq->sort_list[bio_data_dir(bio)], sector);
 		if (__rq) {
 			BUG_ON(sector != blk_rq_pos(__rq));
 
@@ -207,10 +214,11 @@ static void
 deadline_move_request(struct deadline_data *dd, struct request *rq)
 {
 	const int data_dir = rq_data_dir(rq);
+	struct deadline_queue *dq = elv_get_sched_queue(dd->q, rq);
 
-	dd->next_rq[READ] = NULL;
-	dd->next_rq[WRITE] = NULL;
-	dd->next_rq[data_dir] = deadline_latter_request(rq);
+	dq->next_rq[READ] = NULL;
+	dq->next_rq[WRITE] = NULL;
+	dq->next_rq[data_dir] = deadline_latter_request(rq);
 
 	dd->last_sector = rq_end_sector(rq);
 
@@ -225,9 +233,9 @@ deadline_move_request(struct deadline_data *dd, struct request *rq)
  * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
  * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
  */
-static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
+static inline int deadline_check_fifo(struct deadline_queue *dq, int ddir)
 {
-	struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
+	struct request *rq = rq_entry_fifo(dq->fifo_list[ddir].next);
 
 	/*
 	 * rq is expired!
@@ -245,20 +253,26 @@ static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
 static int deadline_dispatch_requests(struct request_queue *q, int force)
 {
 	struct deadline_data *dd = q->elevator->elevator_data;
-	const int reads = !list_empty(&dd->fifo_list[READ]);
-	const int writes = !list_empty(&dd->fifo_list[WRITE]);
+	struct deadline_queue *dq = elv_select_sched_queue(q, force);
+	int reads, writes;
 	struct request *rq;
 	int data_dir;
 
+	if (!dq)
+		return 0;
+
+	reads = !list_empty(&dq->fifo_list[READ]);
+	writes = !list_empty(&dq->fifo_list[WRITE]);
+
 	/*
 	 * batches are currently reads XOR writes
 	 */
-	if (dd->next_rq[WRITE])
-		rq = dd->next_rq[WRITE];
+	if (dq->next_rq[WRITE])
+		rq = dq->next_rq[WRITE];
 	else
-		rq = dd->next_rq[READ];
+		rq = dq->next_rq[READ];
 
-	if (rq && dd->batching < dd->fifo_batch)
+	if (rq && dq->batching < dd->fifo_batch)
 		/* we have a next request are still entitled to batch */
 		goto dispatch_request;
 
@@ -268,9 +282,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
 	 */
 
 	if (reads) {
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
+		BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[READ]));
 
-		if (writes && (dd->starved++ >= dd->writes_starved))
+		if (writes && (dq->starved++ >= dd->writes_starved))
 			goto dispatch_writes;
 
 		data_dir = READ;
@@ -284,9 +298,9 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
 
 	if (writes) {
 dispatch_writes:
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
+		BUG_ON(RB_EMPTY_ROOT(&dq->sort_list[WRITE]));
 
-		dd->starved = 0;
+		dq->starved = 0;
 
 		data_dir = WRITE;
 
@@ -299,48 +313,62 @@ dispatch_find_request:
 	/*
 	 * we are not running a batch, find best request for selected data_dir
 	 */
-	if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
+	if (deadline_check_fifo(dq, data_dir) || !dq->next_rq[data_dir]) {
 		/*
 		 * A deadline has expired, the last request was in the other
 		 * direction, or we have run out of higher-sectored requests.
 		 * Start again from the request with the earliest expiry time.
 		 */
-		rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
+		rq = rq_entry_fifo(dq->fifo_list[data_dir].next);
 	} else {
 		/*
 		 * The last req was the same dir and we have a next request in
 		 * sort order. No expired requests so continue on from here.
 		 */
-		rq = dd->next_rq[data_dir];
+		rq = dq->next_rq[data_dir];
 	}
 
-	dd->batching = 0;
+	dq->batching = 0;
 
 dispatch_request:
 	/*
 	 * rq is the selected appropriate request.
 	 */
-	dd->batching++;
+	dq->batching++;
 	deadline_move_request(dd, rq);
 
 	return 1;
 }
 
-static int deadline_queue_empty(struct request_queue *q)
+static void *deadline_alloc_deadline_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
 {
-	struct deadline_data *dd = q->elevator->elevator_data;
+	struct deadline_queue *dq;
 
-	return list_empty(&dd->fifo_list[WRITE])
-		&& list_empty(&dd->fifo_list[READ]);
+	dq = kmalloc_node(sizeof(*dq), gfp_mask | __GFP_ZERO, q->node);
+	if (dq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&dq->fifo_list[READ]);
+	INIT_LIST_HEAD(&dq->fifo_list[WRITE]);
+	dq->sort_list[READ] = RB_ROOT;
+	dq->sort_list[WRITE] = RB_ROOT;
+out:
+	return dq;
+}
+
+static void deadline_free_deadline_queue(struct elevator_queue *e,
+						void *sched_queue)
+{
+	struct deadline_queue *dq = sched_queue;
+
+	kfree(dq);
 }
 
 static void deadline_exit_queue(struct elevator_queue *e)
 {
 	struct deadline_data *dd = e->elevator_data;
 
-	BUG_ON(!list_empty(&dd->fifo_list[READ]));
-	BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
-
 	kfree(dd);
 }
 
@@ -355,10 +383,7 @@ static void *deadline_init_queue(struct request_queue *q)
 	if (!dd)
 		return NULL;
 
-	INIT_LIST_HEAD(&dd->fifo_list[READ]);
-	INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
-	dd->sort_list[READ] = RB_ROOT;
-	dd->sort_list[WRITE] = RB_ROOT;
+	dd->q = q;
 	dd->fifo_expire[READ] = read_expire;
 	dd->fifo_expire[WRITE] = write_expire;
 	dd->writes_starved = writes_starved;
@@ -445,13 +470,13 @@ static struct elevator_type iosched_deadline = {
 		.elevator_merge_req_fn =	deadline_merged_requests,
 		.elevator_dispatch_fn =		deadline_dispatch_requests,
 		.elevator_add_req_fn =		deadline_add_request,
-		.elevator_queue_empty_fn =	deadline_queue_empty,
 		.elevator_former_req_fn =	elv_rb_former_request,
 		.elevator_latter_req_fn =	elv_rb_latter_request,
 		.elevator_init_fn =		deadline_init_queue,
 		.elevator_exit_fn =		deadline_exit_queue,
+		.elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
+		.elevator_free_sched_queue_fn = deadline_free_deadline_queue,
 	},
-
 	.elevator_attrs = deadline_attrs,
 	.elevator_name = "deadline",
 	.elevator_owner = THIS_MODULE,
diff --git a/block/elevator.c b/block/elevator.c
index a6ef1f1..76cfc3a 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -175,17 +175,54 @@ static struct elevator_type *elevator_get(const char *name)
 	return e;
 }
 
-static void *elevator_init_queue(struct request_queue *q,
-				 struct elevator_queue *eq)
+static void *elevator_init_data(struct request_queue *q,
+					struct elevator_queue *eq)
 {
-	return eq->ops->elevator_init_fn(q);
+	void *data = NULL;
+
+	if (eq->ops->elevator_init_fn) {
+		data = eq->ops->elevator_init_fn(q);
+		if (data)
+			return data;
+		else
+			return ERR_PTR(-ENOMEM);
+	}
+
+	/* IO scheduler does not instanciate data (noop), it is not an error */
+	return NULL;
+}
+
+static void elevator_free_sched_queue(struct elevator_queue *eq,
+						void *sched_queue)
+{
+	/* Not all io schedulers (cfq) strore sched_queue */
+	if (!sched_queue)
+		return;
+	eq->ops->elevator_free_sched_queue_fn(eq, sched_queue);
+}
+
+static void *elevator_alloc_sched_queue(struct request_queue *q,
+					struct elevator_queue *eq)
+{
+	void *sched_queue = NULL;
+
+	if (eq->ops->elevator_alloc_sched_queue_fn) {
+		sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
+								GFP_KERNEL);
+		if (!sched_queue)
+			return ERR_PTR(-ENOMEM);
+
+	}
+
+	return sched_queue;
 }
 
 static void elevator_attach(struct request_queue *q, struct elevator_queue *eq,
-			   void *data)
+			   void *data, void *sched_queue)
 {
 	q->elevator = eq;
 	eq->elevator_data = data;
+	eq->sched_queue = sched_queue;
 }
 
 static char chosen_elevator[16];
@@ -255,7 +292,7 @@ int elevator_init(struct request_queue *q, char *name)
 	struct elevator_type *e = NULL;
 	struct elevator_queue *eq;
 	int ret = 0;
-	void *data;
+	void *data = NULL, *sched_queue = NULL;
 
 	INIT_LIST_HEAD(&q->queue_head);
 	q->last_merge = NULL;
@@ -289,13 +326,21 @@ int elevator_init(struct request_queue *q, char *name)
 	if (!eq)
 		return -ENOMEM;
 
-	data = elevator_init_queue(q, eq);
-	if (!data) {
+	data = elevator_init_data(q, eq);
+
+	if (IS_ERR(data)) {
+		kobject_put(&eq->kobj);
+		return -ENOMEM;
+	}
+
+	sched_queue = elevator_alloc_sched_queue(q, eq);
+
+	if (IS_ERR(sched_queue)) {
 		kobject_put(&eq->kobj);
 		return -ENOMEM;
 	}
 
-	elevator_attach(q, eq, data);
+	elevator_attach(q, eq, data, sched_queue);
 	return ret;
 }
 EXPORT_SYMBOL(elevator_init);
@@ -303,6 +348,7 @@ EXPORT_SYMBOL(elevator_init);
 void elevator_exit(struct elevator_queue *e)
 {
 	mutex_lock(&e->sysfs_lock);
+	elevator_free_sched_queue(e, e->sched_queue);
 	elv_exit_fq_data(e);
 	if (e->ops->elevator_exit_fn)
 		e->ops->elevator_exit_fn(e);
@@ -992,7 +1038,7 @@ EXPORT_SYMBOL_GPL(elv_unregister);
 static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 {
 	struct elevator_queue *old_elevator, *e;
-	void *data;
+	void *data = NULL, *sched_queue = NULL;
 
 	/*
 	 * Allocate new elevator
@@ -1001,10 +1047,18 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	if (!e)
 		return 0;
 
-	data = elevator_init_queue(q, e);
-	if (!data) {
+	data = elevator_init_data(q, e);
+
+	if (IS_ERR(data)) {
 		kobject_put(&e->kobj);
-		return 0;
+		return -ENOMEM;
+	}
+
+	sched_queue = elevator_alloc_sched_queue(q, e);
+
+	if (IS_ERR(sched_queue)) {
+		kobject_put(&e->kobj);
+		return -ENOMEM;
 	}
 
 	/*
@@ -1021,7 +1075,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	/*
 	 * attach and start new elevator
 	 */
-	elevator_attach(q, e, data);
+	elevator_attach(q, e, data, sched_queue);
 
 	spin_unlock_irq(q->queue_lock);
 
@@ -1136,16 +1190,43 @@ struct request *elv_rb_latter_request(struct request_queue *q,
 }
 EXPORT_SYMBOL(elv_rb_latter_request);
 
-/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+/* Get the io scheduler queue pointer. */
 void *elv_get_sched_queue(struct request_queue *q, struct request *rq)
 {
-	return ioq_sched_queue(req_ioq(rq));
+	/*
+	 * io scheduler is not using fair queuing. Return sched_queue
+	 * pointer stored in elevator_queue. It will be null if io
+	 * scheduler never stored anything there to begin with (cfq)
+	 */
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
+	/*
+	 * IO schedueler is using fair queuing infrasture. If io scheduler
+	 * has passed a non null rq, retrieve sched_queue pointer from
+	 * there. */
+	if (rq)
+		return ioq_sched_queue(req_ioq(rq));
+
+	return NULL;
 }
 EXPORT_SYMBOL(elv_get_sched_queue);
 
 /* Select an ioscheduler queue to dispatch request from. */
 void *elv_select_sched_queue(struct request_queue *q, int force)
 {
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
 	return ioq_sched_queue(elv_fq_select_ioq(q, force));
 }
 EXPORT_SYMBOL(elv_select_sched_queue);
+
+/*
+ * Get the io scheduler queue pointer for current task.
+ */
+void *elv_get_sched_queue_current(struct request_queue *q)
+{
+	return q->elevator->sched_queue;
+}
+EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index 3a0d369..d587832 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -7,7 +7,7 @@
 #include <linux/module.h>
 #include <linux/init.h>
 
-struct noop_data {
+struct noop_queue {
 	struct list_head queue;
 };
 
@@ -19,11 +19,14 @@ static void noop_merged_requests(struct request_queue *q, struct request *rq,
 
 static int noop_dispatch(struct request_queue *q, int force)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_select_sched_queue(q, force);
 
-	if (!list_empty(&nd->queue)) {
+	if (!nq)
+		return 0;
+
+	if (!list_empty(&nq->queue)) {
 		struct request *rq;
-		rq = list_entry(nd->queue.next, struct request, queuelist);
+		rq = list_entry(nq->queue.next, struct request, queuelist);
 		list_del_init(&rq->queuelist);
 		elv_dispatch_sort(q, rq);
 		return 1;
@@ -33,24 +36,17 @@ static int noop_dispatch(struct request_queue *q, int force)
 
 static void noop_add_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	list_add_tail(&rq->queuelist, &nd->queue);
-}
-
-static int noop_queue_empty(struct request_queue *q)
-{
-	struct noop_data *nd = q->elevator->elevator_data;
-
-	return list_empty(&nd->queue);
+	list_add_tail(&rq->queuelist, &nq->queue);
 }
 
 static struct request *
 noop_former_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	if (rq->queuelist.prev == &nd->queue)
+	if (rq->queuelist.prev == &nq->queue)
 		return NULL;
 	return list_entry(rq->queuelist.prev, struct request, queuelist);
 }
@@ -58,30 +54,32 @@ noop_former_request(struct request_queue *q, struct request *rq)
 static struct request *
 noop_latter_request(struct request_queue *q, struct request *rq)
 {
-	struct noop_data *nd = q->elevator->elevator_data;
+	struct noop_queue *nq = elv_get_sched_queue(q, rq);
 
-	if (rq->queuelist.next == &nd->queue)
+	if (rq->queuelist.next == &nq->queue)
 		return NULL;
 	return list_entry(rq->queuelist.next, struct request, queuelist);
 }
 
-static void *noop_init_queue(struct request_queue *q)
+static void *noop_alloc_noop_queue(struct request_queue *q,
+				struct elevator_queue *eq, gfp_t gfp_mask)
 {
-	struct noop_data *nd;
+	struct noop_queue *nq;
 
-	nd = kmalloc_node(sizeof(*nd), GFP_KERNEL, q->node);
-	if (!nd)
-		return NULL;
-	INIT_LIST_HEAD(&nd->queue);
-	return nd;
+	nq = kmalloc_node(sizeof(*nq), gfp_mask | __GFP_ZERO, q->node);
+	if (nq == NULL)
+		goto out;
+
+	INIT_LIST_HEAD(&nq->queue);
+out:
+	return nq;
 }
 
-static void noop_exit_queue(struct elevator_queue *e)
+static void noop_free_noop_queue(struct elevator_queue *e, void *sched_queue)
 {
-	struct noop_data *nd = e->elevator_data;
+	struct noop_queue *nq = sched_queue;
 
-	BUG_ON(!list_empty(&nd->queue));
-	kfree(nd);
+	kfree(nq);
 }
 
 static struct elevator_type elevator_noop = {
@@ -89,11 +87,10 @@ static struct elevator_type elevator_noop = {
 		.elevator_merge_req_fn		= noop_merged_requests,
 		.elevator_dispatch_fn		= noop_dispatch,
 		.elevator_add_req_fn		= noop_add_request,
-		.elevator_queue_empty_fn	= noop_queue_empty,
 		.elevator_former_req_fn		= noop_former_request,
 		.elevator_latter_req_fn		= noop_latter_request,
-		.elevator_init_fn		= noop_init_queue,
-		.elevator_exit_fn		= noop_exit_queue,
+		.elevator_alloc_sched_queue_fn	= noop_alloc_noop_queue,
+		.elevator_free_sched_queue_fn	= noop_free_noop_queue,
 	},
 	.elevator_name = "noop",
 	.elevator_owner = THIS_MODULE,
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 81f1ed8..e7048b9 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -30,8 +30,9 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
 
 typedef void *(elevator_init_fn) (struct request_queue *);
 typedef void (elevator_exit_fn) (struct elevator_queue *);
-#ifdef CONFIG_ELV_FAIR_QUEUING
+typedef void* (elevator_alloc_sched_queue_fn) (struct request_queue *q, struct elevator_queue *eq, gfp_t);
 typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
+#ifdef CONFIG_ELV_FAIR_QUEUING
 typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
 typedef void (elevator_active_ioq_reset_fn) (struct request_queue *, void*);
 typedef void (elevator_arm_slice_timer_fn) (struct request_queue*, void*);
@@ -70,8 +71,9 @@ struct elevator_ops
 	elevator_exit_fn *elevator_exit_fn;
 	void (*trim)(struct io_context *);
 
-#ifdef CONFIG_ELV_FAIR_QUEUING
+	elevator_alloc_sched_queue_fn *elevator_alloc_sched_queue_fn;
 	elevator_free_sched_queue_fn *elevator_free_sched_queue_fn;
+#ifdef CONFIG_ELV_FAIR_QUEUING
 	elevator_active_ioq_set_fn *elevator_active_ioq_set_fn;
 	elevator_active_ioq_reset_fn *elevator_active_ioq_reset_fn;
 
@@ -112,6 +114,7 @@ struct elevator_queue
 {
 	struct elevator_ops *ops;
 	void *elevator_data;
+	void *sched_queue;
 	struct kobject kobj;
 	struct elevator_type *elevator_type;
 	struct mutex sysfs_lock;
@@ -258,5 +261,6 @@ static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
+extern void *elv_get_sched_queue_current(struct request_queue *q);
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 15/25] io-conroller: Prepare elevator layer for single queue schedulers
       [not found] ` <1246564917-19603-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (13 preceding siblings ...)
  2009-07-02 20:01   ` [PATCH 14/25] io-controller: Separate out queue and data Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  2009-07-02 20:01   ` [PATCH 16/25] io-controller: noop changes for hierarchical fair queuing Vivek Goyal
                     ` (12 subsequent siblings)
  27 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, nauman-hpIqsD4AKlfQT0dZR+AlfA,
	dpshah-hpIqsD4AKlfQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA

Elevator layer now has support for hierarchical fair queuing. cfq has
been migrated to make use of it and now it is time to do groundwork for
noop, deadline and AS.

noop deadline and AS don't maintain separate queues for different processes.
There is only one single queue. Effectively one can think that in hierarchical
setup, there will be one queue per cgroup where requests from all the
processes in the cgroup will be queued.

Generally io scheduler takes care of creating queues. Because there is
only one queue here, we have modified common layer to take care of queue
creation and some other functionality. This special casing helps in keeping
the changes to noop, deadline and AS to the minimum.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/as-iosched.c       |    2 +-
 block/deadline-iosched.c |    2 +-
 block/elevator-fq.c      |  206 +++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h      |   54 ++++++++++++
 block/elevator.c         |   37 ++++++++-
 block/noop-iosched.c     |    2 +-
 include/linux/elevator.h |   16 ++++-
 7 files changed, 312 insertions(+), 7 deletions(-)

diff --git a/block/as-iosched.c b/block/as-iosched.c
index cafc734..e3514eb 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -1338,7 +1338,7 @@ static int as_may_queue(struct request_queue *q, int rw)
 
 /* Called with queue lock held */
 static void *as_alloc_as_queue(struct request_queue *q,
-				struct elevator_queue *eq, gfp_t gfp_mask)
+		struct elevator_queue *eq, gfp_t gfp_mask, struct io_queue *ioq)
 {
 	struct as_queue *asq;
 	struct as_data *ad = eq->elevator_data;
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 5bd5257..03d7208 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -341,7 +341,7 @@ dispatch_request:
 }
 
 static void *deadline_alloc_deadline_queue(struct request_queue *q,
-				struct elevator_queue *eq, gfp_t gfp_mask)
+		struct elevator_queue *eq, gfp_t gfp_mask, struct io_queue *ioq)
 {
 	struct deadline_queue *dq;
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 7609579..25fdac6 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1031,6 +1031,12 @@ io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
 
 	/* Free up async idle queue */
 	elv_release_ioq(e, &iog->async_idle_queue);
+
+#ifdef CONFIG_GROUP_IOSCHED
+	/* Optimization for io schedulers having single ioq */
+	if (elv_iosched_single_ioq(e))
+		elv_release_ioq(e, &iog->ioq);
+#endif
 }
 
 /* Mainly hierarchical grouping code */
@@ -1867,6 +1873,162 @@ int io_group_allow_merge(struct request *rq, struct bio *bio)
 
 	return (iog == __iog);
 }
+
+/*
+ * Find/Create the io queue the rq should go in. This is an optimization
+ * for the io schedulers (noop, deadline and AS) which maintain only single
+ * io queue per cgroup. In this case common layer can just maintain a
+ * pointer in group data structure and keeps track of it.
+ *
+ * For the io schdulers like cfq, which maintain multiple io queues per
+ * cgroup, and decide the io queue  of request based on process, this
+ * function is not invoked.
+ */
+int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
+					gfp_t gfp_mask)
+{
+	struct elevator_queue *e = q->elevator;
+	unsigned long flags;
+	struct io_queue *ioq = NULL, *new_ioq = NULL;
+	struct io_group *iog;
+	void *sched_q = NULL, *new_sched_q = NULL;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return 0;
+
+	might_sleep_if(gfp_mask & __GFP_WAIT);
+	spin_lock_irqsave(q->queue_lock, flags);
+
+retry:
+	/* Determine the io group request belongs to */
+	iog = io_get_io_group(q, 1);
+	BUG_ON(!iog);
+
+	/* Get the iosched queue */
+	ioq = iog->ioq;
+	if (!ioq) {
+		/* io queue and sched_queue needs to be allocated */
+		BUG_ON(!e->ops->elevator_alloc_sched_queue_fn);
+
+		if (new_ioq) {
+			goto alloc_sched_q;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			new_ioq = elv_alloc_ioq(q, gfp_mask | __GFP_NOFAIL
+							| __GFP_ZERO);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+			if (!ioq)
+				goto queue_fail;
+		}
+
+alloc_sched_q:
+		if (new_sched_q) {
+			ioq = new_ioq;
+			new_ioq = NULL;
+			sched_q = new_sched_q;
+			new_sched_q = NULL;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			/* Call io scheduer to create scheduler queue */
+			new_sched_q = e->ops->elevator_alloc_sched_queue_fn(q,
+					e, gfp_mask | __GFP_NOFAIL
+					| __GFP_ZERO, new_ioq);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			sched_q = e->ops->elevator_alloc_sched_queue_fn(q, e,
+						gfp_mask | __GFP_ZERO, ioq);
+			if (!sched_q) {
+				elv_free_ioq(ioq);
+				goto queue_fail;
+			}
+		}
+
+		elv_init_ioq(e, ioq, iog, sched_q, IOPRIO_CLASS_BE,
+					IOPRIO_NORM, 1);
+		io_group_set_ioq(iog, ioq);
+		elv_mark_ioq_sync(ioq);
+		elv_get_iog(iog);
+	}
+
+	if (new_sched_q)
+		e->ops->elevator_free_sched_queue_fn(q->elevator, new_sched_q);
+
+	if (new_ioq)
+		elv_free_ioq(new_ioq);
+
+	/* Request reference */
+	elv_get_ioq(ioq);
+	rq->ioq = ioq;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return 0;
+
+queue_fail:
+	WARN_ON((gfp_mask & __GFP_WAIT) && !ioq);
+	elv_schedule_dispatch(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return 1;
+}
+
+/*
+ * Find out the io queue of current task. Optimization for single ioq
+ * per io group io schedulers.
+ */
+struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	struct io_group *iog;
+
+	/* Determine the io group and io queue of the bio submitting task */
+	iog = io_get_io_group(q, 0);
+	if (!iog) {
+		/* May be task belongs to a cgroup for which io group has
+		 * not been setup yet. */
+		return NULL;
+	}
+	return iog->ioq;
+}
+
+/*
+ * This request has been serviced. Clean up ioq info and drop the reference.
+ * Again this is called only for single queue per cgroup schedulers (noop,
+ * deadline, AS).
+ */
+void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
+{
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	if (ioq) {
+		rq->ioq = NULL;
+		elv_put_ioq(ioq);
+	}
+}
+
+static inline int is_only_root_group(void)
+{
+	if (list_empty(&io_root_cgroup.css.cgroup->children))
+		return 1;
+
+	return 0;
+}
+
 #else /* GROUP_IOSCHED */
 static void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
 {
@@ -1916,6 +2078,11 @@ struct io_group *io_get_io_group(struct request_queue *q, int create)
 	return q->elevator->efqd.root_group;
 }
 EXPORT_SYMBOL(io_get_io_group);
+
+static inline int is_only_root_group(void)
+{
+	return 1;
+}
 #endif /* GROUP_IOSCHED */
 
 /* Elevator fair queuing function */
@@ -2206,7 +2373,12 @@ int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
 	ioq->efqd = efqd;
 	elv_ioq_set_ioprio_class(ioq, ioprio_class);
 	elv_ioq_set_ioprio(ioq, ioprio);
-	ioq->pid = current->pid;
+
+	if (elv_iosched_single_ioq(eq))
+		ioq->pid = 0;
+	else
+		ioq->pid = current->pid;
+
 	ioq->sched_queue = sched_queue;
 	if (is_sync && !elv_ioq_class_idle(ioq))
 		elv_mark_ioq_idle_window(ioq);
@@ -2589,6 +2761,14 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 	struct io_entity *entity, *new_entity;
 	struct io_group *iog = NULL, *new_iog = NULL;
 
+	/*
+	 * Currently only CFQ has preemption logic. Other schedulers don't
+	 * have any notion of preemption across classes or preemption with-in
+	 * class etc.
+	 */
+	if (elv_iosched_single_ioq(eq))
+		return 0;
+
 	ioq = elv_active_ioq(eq);
 
 	if (!ioq)
@@ -2873,6 +3053,17 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 			goto expire;
 	}
 
+	/*
+	 * If there is only root group present, don't expire the queue for
+	 * single queue ioschedulers (noop, deadline, AS). It is unnecessary
+	 * overhead.
+	 */
+
+	if (is_only_root_group() && elv_iosched_single_ioq(q->elevator)) {
+		elv_log_ioq(efqd, ioq, "select: only root group, no expiry");
+		goto keep_queue;
+	}
+
 	/* We are waiting for this queue to become busy before it expires.*/
 	if (efqd->fairness && elv_ioq_wait_busy(ioq)) {
 		ioq = NULL;
@@ -3112,6 +3303,19 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		}
 
 		/*
+		 * If there is only root group present, don't expire the queue
+		 * for single queue ioschedulers (noop, deadline, AS). It is
+		 * unnecessary overhead.
+		 */
+
+		if (is_only_root_group() &&
+			elv_iosched_single_ioq(q->elevator)) {
+			elv_log_ioq(efqd, ioq, "select: only root group,"
+					" no expiry");
+			goto done;
+		}
+
+		/*
 		 * If there are no requests waiting in this queue, and
 		 * there are other queues ready to issue requests, AND
 		 * those other queues are issuing requests within our
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index a414309..baa6cee 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -255,6 +255,9 @@ struct io_group {
 
 	/* The device MKDEV(major, minor), this group has been created for */
 	dev_t	dev;
+
+	/* Single ioq per group, used for noop, deadline, anticipatory */
+	struct io_queue *ioq;
 };
 
 /**
@@ -514,6 +517,21 @@ static inline int update_requeue(struct io_queue *ioq, int requeue)
 	return requeue;
 }
 
+extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
+					gfp_t gfp_mask);
+extern void elv_fq_unset_request_ioq(struct request_queue *q,
+					struct request *rq);
+extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+
+/* Sets the single ioq associated with the io group. (noop, deadline, AS) */
+static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
+{
+	BUG_ON(!iog);
+	/* io group reference. Will be dropped when group is destroyed. */
+	elv_get_ioq(ioq);
+	iog->ioq = ioq;
+}
+
 #else /* !GROUP_IOSCHED */
 static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
 {
@@ -533,6 +551,26 @@ static inline int update_requeue(struct io_queue *ioq, int requeue)
 	return requeue;
 }
 
+static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
+{
+}
+
+static inline int elv_fq_set_request_ioq(struct request_queue *q,
+					struct request *rq, gfp_t gfp_mask)
+{
+	return 0;
+}
+
+static inline void elv_fq_unset_request_ioq(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	return NULL;
+}
+
 #endif /* GROUP_IOSCHED */
 
 extern ssize_t elv_slice_idle_show(struct elevator_queue *q, char *name);
@@ -642,5 +680,21 @@ static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
 {
 	return 1;
 }
+static inline int elv_fq_set_request_ioq(struct request_queue *q,
+					struct request *rq, gfp_t gfp_mask)
+{
+	return 0;
+}
+
+static inline void elv_fq_unset_request_ioq(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	return NULL;
+}
+
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index 76cfc3a..862be80 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -206,9 +206,17 @@ static void *elevator_alloc_sched_queue(struct request_queue *q,
 {
 	void *sched_queue = NULL;
 
+	/*
+	 * If fair queuing is enabled, then queue allocation takes place
+	 * during set_request() functions when request actually comes
+	 * in.
+	 */
+	if (elv_iosched_fair_queuing_enabled(eq))
+		return NULL;
+
 	if (eq->ops->elevator_alloc_sched_queue_fn) {
 		sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
-								GFP_KERNEL);
+							GFP_KERNEL, NULL);
 		if (!sched_queue)
 			return ERR_PTR(-ENOMEM);
 
@@ -829,6 +837,13 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 
+	/*
+	 * Optimization for noop, deadline and AS which maintain only single
+	 * ioq per io group
+	 */
+	if (elv_iosched_single_ioq(e))
+		return elv_fq_set_request_ioq(q, rq, gfp_mask);
+
 	if (e->ops->elevator_set_req_fn)
 		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
 
@@ -840,6 +855,15 @@ void elv_put_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	/*
+	 * Optimization for noop, deadline and AS which maintain only single
+	 * ioq per io group
+	 */
+	if (elv_iosched_single_ioq(e)) {
+		elv_fq_unset_request_ioq(q, rq);
+		return;
+	}
+
 	if (e->ops->elevator_put_req_fn)
 		e->ops->elevator_put_req_fn(rq);
 }
@@ -1224,9 +1248,18 @@ EXPORT_SYMBOL(elv_select_sched_queue);
 
 /*
  * Get the io scheduler queue pointer for current task.
+ *
+ * If fair queuing is enabled, determine the io group of task and retrieve
+ * the ioq pointer from that. This is used by only single queue ioschedulers
+ * for retrieving the queue associated with the group to decide whether the
+ * new bio can do a front merge or not.
  */
 void *elv_get_sched_queue_current(struct request_queue *q)
 {
-	return q->elevator->sched_queue;
+	/* Fair queuing is not enabled. There is only one queue. */
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
+	return ioq_sched_queue(elv_lookup_ioq_current(q));
 }
 EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index d587832..731dbf2 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -62,7 +62,7 @@ noop_latter_request(struct request_queue *q, struct request *rq)
 }
 
 static void *noop_alloc_noop_queue(struct request_queue *q,
-				struct elevator_queue *eq, gfp_t gfp_mask)
+		struct elevator_queue *eq, gfp_t gfp_mask, struct io_queue *ioq)
 {
 	struct noop_queue *nq;
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index e7048b9..6d2c8db 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -30,7 +30,7 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
 
 typedef void *(elevator_init_fn) (struct request_queue *);
 typedef void (elevator_exit_fn) (struct elevator_queue *);
-typedef void* (elevator_alloc_sched_queue_fn) (struct request_queue *q, struct elevator_queue *eq, gfp_t);
+typedef void* (elevator_alloc_sched_queue_fn) (struct request_queue *q, struct elevator_queue *eq, gfp_t, struct io_queue *ioq);
 typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
 #ifdef CONFIG_ELV_FAIR_QUEUING
 typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
@@ -247,17 +247,31 @@ enum {
 /* iosched wants to use fair queuing logic of elevator layer */
 #define	ELV_IOSCHED_NEED_FQ	1
 
+/* iosched maintains only single ioq per group.*/
+#define ELV_IOSCHED_SINGLE_IOQ        2
+
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 {
 	return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
 }
 
+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+	return (e->elevator_type->elevator_features) & ELV_IOSCHED_SINGLE_IOQ;
+}
+
 #else /* ELV_IOSCHED_FAIR_QUEUING */
 
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 {
 	return 0;
 }
+
+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+	return 0;
+}
+
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 15/25] io-conroller: Prepare elevator layer for single queue schedulers
  2009-07-02 20:01 ` Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron
  Cc: agk, snitzer, vgoyal, akpm, peterz

Elevator layer now has support for hierarchical fair queuing. cfq has
been migrated to make use of it and now it is time to do groundwork for
noop, deadline and AS.

noop deadline and AS don't maintain separate queues for different processes.
There is only one single queue. Effectively one can think that in hierarchical
setup, there will be one queue per cgroup where requests from all the
processes in the cgroup will be queued.

Generally io scheduler takes care of creating queues. Because there is
only one queue here, we have modified common layer to take care of queue
creation and some other functionality. This special casing helps in keeping
the changes to noop, deadline and AS to the minimum.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/as-iosched.c       |    2 +-
 block/deadline-iosched.c |    2 +-
 block/elevator-fq.c      |  206 +++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h      |   54 ++++++++++++
 block/elevator.c         |   37 ++++++++-
 block/noop-iosched.c     |    2 +-
 include/linux/elevator.h |   16 ++++-
 7 files changed, 312 insertions(+), 7 deletions(-)

diff --git a/block/as-iosched.c b/block/as-iosched.c
index cafc734..e3514eb 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -1338,7 +1338,7 @@ static int as_may_queue(struct request_queue *q, int rw)
 
 /* Called with queue lock held */
 static void *as_alloc_as_queue(struct request_queue *q,
-				struct elevator_queue *eq, gfp_t gfp_mask)
+		struct elevator_queue *eq, gfp_t gfp_mask, struct io_queue *ioq)
 {
 	struct as_queue *asq;
 	struct as_data *ad = eq->elevator_data;
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 5bd5257..03d7208 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -341,7 +341,7 @@ dispatch_request:
 }
 
 static void *deadline_alloc_deadline_queue(struct request_queue *q,
-				struct elevator_queue *eq, gfp_t gfp_mask)
+		struct elevator_queue *eq, gfp_t gfp_mask, struct io_queue *ioq)
 {
 	struct deadline_queue *dq;
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 7609579..25fdac6 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1031,6 +1031,12 @@ io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
 
 	/* Free up async idle queue */
 	elv_release_ioq(e, &iog->async_idle_queue);
+
+#ifdef CONFIG_GROUP_IOSCHED
+	/* Optimization for io schedulers having single ioq */
+	if (elv_iosched_single_ioq(e))
+		elv_release_ioq(e, &iog->ioq);
+#endif
 }
 
 /* Mainly hierarchical grouping code */
@@ -1867,6 +1873,162 @@ int io_group_allow_merge(struct request *rq, struct bio *bio)
 
 	return (iog == __iog);
 }
+
+/*
+ * Find/Create the io queue the rq should go in. This is an optimization
+ * for the io schedulers (noop, deadline and AS) which maintain only single
+ * io queue per cgroup. In this case common layer can just maintain a
+ * pointer in group data structure and keeps track of it.
+ *
+ * For the io schdulers like cfq, which maintain multiple io queues per
+ * cgroup, and decide the io queue  of request based on process, this
+ * function is not invoked.
+ */
+int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
+					gfp_t gfp_mask)
+{
+	struct elevator_queue *e = q->elevator;
+	unsigned long flags;
+	struct io_queue *ioq = NULL, *new_ioq = NULL;
+	struct io_group *iog;
+	void *sched_q = NULL, *new_sched_q = NULL;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return 0;
+
+	might_sleep_if(gfp_mask & __GFP_WAIT);
+	spin_lock_irqsave(q->queue_lock, flags);
+
+retry:
+	/* Determine the io group request belongs to */
+	iog = io_get_io_group(q, 1);
+	BUG_ON(!iog);
+
+	/* Get the iosched queue */
+	ioq = iog->ioq;
+	if (!ioq) {
+		/* io queue and sched_queue needs to be allocated */
+		BUG_ON(!e->ops->elevator_alloc_sched_queue_fn);
+
+		if (new_ioq) {
+			goto alloc_sched_q;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			new_ioq = elv_alloc_ioq(q, gfp_mask | __GFP_NOFAIL
+							| __GFP_ZERO);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+			if (!ioq)
+				goto queue_fail;
+		}
+
+alloc_sched_q:
+		if (new_sched_q) {
+			ioq = new_ioq;
+			new_ioq = NULL;
+			sched_q = new_sched_q;
+			new_sched_q = NULL;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			/* Call io scheduer to create scheduler queue */
+			new_sched_q = e->ops->elevator_alloc_sched_queue_fn(q,
+					e, gfp_mask | __GFP_NOFAIL
+					| __GFP_ZERO, new_ioq);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			sched_q = e->ops->elevator_alloc_sched_queue_fn(q, e,
+						gfp_mask | __GFP_ZERO, ioq);
+			if (!sched_q) {
+				elv_free_ioq(ioq);
+				goto queue_fail;
+			}
+		}
+
+		elv_init_ioq(e, ioq, iog, sched_q, IOPRIO_CLASS_BE,
+					IOPRIO_NORM, 1);
+		io_group_set_ioq(iog, ioq);
+		elv_mark_ioq_sync(ioq);
+		elv_get_iog(iog);
+	}
+
+	if (new_sched_q)
+		e->ops->elevator_free_sched_queue_fn(q->elevator, new_sched_q);
+
+	if (new_ioq)
+		elv_free_ioq(new_ioq);
+
+	/* Request reference */
+	elv_get_ioq(ioq);
+	rq->ioq = ioq;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return 0;
+
+queue_fail:
+	WARN_ON((gfp_mask & __GFP_WAIT) && !ioq);
+	elv_schedule_dispatch(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return 1;
+}
+
+/*
+ * Find out the io queue of current task. Optimization for single ioq
+ * per io group io schedulers.
+ */
+struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	struct io_group *iog;
+
+	/* Determine the io group and io queue of the bio submitting task */
+	iog = io_get_io_group(q, 0);
+	if (!iog) {
+		/* May be task belongs to a cgroup for which io group has
+		 * not been setup yet. */
+		return NULL;
+	}
+	return iog->ioq;
+}
+
+/*
+ * This request has been serviced. Clean up ioq info and drop the reference.
+ * Again this is called only for single queue per cgroup schedulers (noop,
+ * deadline, AS).
+ */
+void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
+{
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	if (ioq) {
+		rq->ioq = NULL;
+		elv_put_ioq(ioq);
+	}
+}
+
+static inline int is_only_root_group(void)
+{
+	if (list_empty(&io_root_cgroup.css.cgroup->children))
+		return 1;
+
+	return 0;
+}
+
 #else /* GROUP_IOSCHED */
 static void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
 {
@@ -1916,6 +2078,11 @@ struct io_group *io_get_io_group(struct request_queue *q, int create)
 	return q->elevator->efqd.root_group;
 }
 EXPORT_SYMBOL(io_get_io_group);
+
+static inline int is_only_root_group(void)
+{
+	return 1;
+}
 #endif /* GROUP_IOSCHED */
 
 /* Elevator fair queuing function */
@@ -2206,7 +2373,12 @@ int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
 	ioq->efqd = efqd;
 	elv_ioq_set_ioprio_class(ioq, ioprio_class);
 	elv_ioq_set_ioprio(ioq, ioprio);
-	ioq->pid = current->pid;
+
+	if (elv_iosched_single_ioq(eq))
+		ioq->pid = 0;
+	else
+		ioq->pid = current->pid;
+
 	ioq->sched_queue = sched_queue;
 	if (is_sync && !elv_ioq_class_idle(ioq))
 		elv_mark_ioq_idle_window(ioq);
@@ -2589,6 +2761,14 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 	struct io_entity *entity, *new_entity;
 	struct io_group *iog = NULL, *new_iog = NULL;
 
+	/*
+	 * Currently only CFQ has preemption logic. Other schedulers don't
+	 * have any notion of preemption across classes or preemption with-in
+	 * class etc.
+	 */
+	if (elv_iosched_single_ioq(eq))
+		return 0;
+
 	ioq = elv_active_ioq(eq);
 
 	if (!ioq)
@@ -2873,6 +3053,17 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 			goto expire;
 	}
 
+	/*
+	 * If there is only root group present, don't expire the queue for
+	 * single queue ioschedulers (noop, deadline, AS). It is unnecessary
+	 * overhead.
+	 */
+
+	if (is_only_root_group() && elv_iosched_single_ioq(q->elevator)) {
+		elv_log_ioq(efqd, ioq, "select: only root group, no expiry");
+		goto keep_queue;
+	}
+
 	/* We are waiting for this queue to become busy before it expires.*/
 	if (efqd->fairness && elv_ioq_wait_busy(ioq)) {
 		ioq = NULL;
@@ -3112,6 +3303,19 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		}
 
 		/*
+		 * If there is only root group present, don't expire the queue
+		 * for single queue ioschedulers (noop, deadline, AS). It is
+		 * unnecessary overhead.
+		 */
+
+		if (is_only_root_group() &&
+			elv_iosched_single_ioq(q->elevator)) {
+			elv_log_ioq(efqd, ioq, "select: only root group,"
+					" no expiry");
+			goto done;
+		}
+
+		/*
 		 * If there are no requests waiting in this queue, and
 		 * there are other queues ready to issue requests, AND
 		 * those other queues are issuing requests within our
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index a414309..baa6cee 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -255,6 +255,9 @@ struct io_group {
 
 	/* The device MKDEV(major, minor), this group has been created for */
 	dev_t	dev;
+
+	/* Single ioq per group, used for noop, deadline, anticipatory */
+	struct io_queue *ioq;
 };
 
 /**
@@ -514,6 +517,21 @@ static inline int update_requeue(struct io_queue *ioq, int requeue)
 	return requeue;
 }
 
+extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
+					gfp_t gfp_mask);
+extern void elv_fq_unset_request_ioq(struct request_queue *q,
+					struct request *rq);
+extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+
+/* Sets the single ioq associated with the io group. (noop, deadline, AS) */
+static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
+{
+	BUG_ON(!iog);
+	/* io group reference. Will be dropped when group is destroyed. */
+	elv_get_ioq(ioq);
+	iog->ioq = ioq;
+}
+
 #else /* !GROUP_IOSCHED */
 static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
 {
@@ -533,6 +551,26 @@ static inline int update_requeue(struct io_queue *ioq, int requeue)
 	return requeue;
 }
 
+static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
+{
+}
+
+static inline int elv_fq_set_request_ioq(struct request_queue *q,
+					struct request *rq, gfp_t gfp_mask)
+{
+	return 0;
+}
+
+static inline void elv_fq_unset_request_ioq(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	return NULL;
+}
+
 #endif /* GROUP_IOSCHED */
 
 extern ssize_t elv_slice_idle_show(struct elevator_queue *q, char *name);
@@ -642,5 +680,21 @@ static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
 {
 	return 1;
 }
+static inline int elv_fq_set_request_ioq(struct request_queue *q,
+					struct request *rq, gfp_t gfp_mask)
+{
+	return 0;
+}
+
+static inline void elv_fq_unset_request_ioq(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	return NULL;
+}
+
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index 76cfc3a..862be80 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -206,9 +206,17 @@ static void *elevator_alloc_sched_queue(struct request_queue *q,
 {
 	void *sched_queue = NULL;
 
+	/*
+	 * If fair queuing is enabled, then queue allocation takes place
+	 * during set_request() functions when request actually comes
+	 * in.
+	 */
+	if (elv_iosched_fair_queuing_enabled(eq))
+		return NULL;
+
 	if (eq->ops->elevator_alloc_sched_queue_fn) {
 		sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
-								GFP_KERNEL);
+							GFP_KERNEL, NULL);
 		if (!sched_queue)
 			return ERR_PTR(-ENOMEM);
 
@@ -829,6 +837,13 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 
+	/*
+	 * Optimization for noop, deadline and AS which maintain only single
+	 * ioq per io group
+	 */
+	if (elv_iosched_single_ioq(e))
+		return elv_fq_set_request_ioq(q, rq, gfp_mask);
+
 	if (e->ops->elevator_set_req_fn)
 		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
 
@@ -840,6 +855,15 @@ void elv_put_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	/*
+	 * Optimization for noop, deadline and AS which maintain only single
+	 * ioq per io group
+	 */
+	if (elv_iosched_single_ioq(e)) {
+		elv_fq_unset_request_ioq(q, rq);
+		return;
+	}
+
 	if (e->ops->elevator_put_req_fn)
 		e->ops->elevator_put_req_fn(rq);
 }
@@ -1224,9 +1248,18 @@ EXPORT_SYMBOL(elv_select_sched_queue);
 
 /*
  * Get the io scheduler queue pointer for current task.
+ *
+ * If fair queuing is enabled, determine the io group of task and retrieve
+ * the ioq pointer from that. This is used by only single queue ioschedulers
+ * for retrieving the queue associated with the group to decide whether the
+ * new bio can do a front merge or not.
  */
 void *elv_get_sched_queue_current(struct request_queue *q)
 {
-	return q->elevator->sched_queue;
+	/* Fair queuing is not enabled. There is only one queue. */
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
+	return ioq_sched_queue(elv_lookup_ioq_current(q));
 }
 EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index d587832..731dbf2 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -62,7 +62,7 @@ noop_latter_request(struct request_queue *q, struct request *rq)
 }
 
 static void *noop_alloc_noop_queue(struct request_queue *q,
-				struct elevator_queue *eq, gfp_t gfp_mask)
+		struct elevator_queue *eq, gfp_t gfp_mask, struct io_queue *ioq)
 {
 	struct noop_queue *nq;
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index e7048b9..6d2c8db 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -30,7 +30,7 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
 
 typedef void *(elevator_init_fn) (struct request_queue *);
 typedef void (elevator_exit_fn) (struct elevator_queue *);
-typedef void* (elevator_alloc_sched_queue_fn) (struct request_queue *q, struct elevator_queue *eq, gfp_t);
+typedef void* (elevator_alloc_sched_queue_fn) (struct request_queue *q, struct elevator_queue *eq, gfp_t, struct io_queue *ioq);
 typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
 #ifdef CONFIG_ELV_FAIR_QUEUING
 typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
@@ -247,17 +247,31 @@ enum {
 /* iosched wants to use fair queuing logic of elevator layer */
 #define	ELV_IOSCHED_NEED_FQ	1
 
+/* iosched maintains only single ioq per group.*/
+#define ELV_IOSCHED_SINGLE_IOQ        2
+
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 {
 	return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
 }
 
+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+	return (e->elevator_type->elevator_features) & ELV_IOSCHED_SINGLE_IOQ;
+}
+
 #else /* ELV_IOSCHED_FAIR_QUEUING */
 
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 {
 	return 0;
 }
+
+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+	return 0;
+}
+
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 15/25] io-conroller: Prepare elevator layer for single queue schedulers
@ 2009-07-02 20:01   ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew
  Cc: peterz, akpm, snitzer, agk, vgoyal

Elevator layer now has support for hierarchical fair queuing. cfq has
been migrated to make use of it and now it is time to do groundwork for
noop, deadline and AS.

noop deadline and AS don't maintain separate queues for different processes.
There is only one single queue. Effectively one can think that in hierarchical
setup, there will be one queue per cgroup where requests from all the
processes in the cgroup will be queued.

Generally io scheduler takes care of creating queues. Because there is
only one queue here, we have modified common layer to take care of queue
creation and some other functionality. This special casing helps in keeping
the changes to noop, deadline and AS to the minimum.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/as-iosched.c       |    2 +-
 block/deadline-iosched.c |    2 +-
 block/elevator-fq.c      |  206 +++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h      |   54 ++++++++++++
 block/elevator.c         |   37 ++++++++-
 block/noop-iosched.c     |    2 +-
 include/linux/elevator.h |   16 ++++-
 7 files changed, 312 insertions(+), 7 deletions(-)

diff --git a/block/as-iosched.c b/block/as-iosched.c
index cafc734..e3514eb 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -1338,7 +1338,7 @@ static int as_may_queue(struct request_queue *q, int rw)
 
 /* Called with queue lock held */
 static void *as_alloc_as_queue(struct request_queue *q,
-				struct elevator_queue *eq, gfp_t gfp_mask)
+		struct elevator_queue *eq, gfp_t gfp_mask, struct io_queue *ioq)
 {
 	struct as_queue *asq;
 	struct as_data *ad = eq->elevator_data;
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 5bd5257..03d7208 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -341,7 +341,7 @@ dispatch_request:
 }
 
 static void *deadline_alloc_deadline_queue(struct request_queue *q,
-				struct elevator_queue *eq, gfp_t gfp_mask)
+		struct elevator_queue *eq, gfp_t gfp_mask, struct io_queue *ioq)
 {
 	struct deadline_queue *dq;
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 7609579..25fdac6 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1031,6 +1031,12 @@ io_put_io_group_queues(struct elevator_queue *e, struct io_group *iog)
 
 	/* Free up async idle queue */
 	elv_release_ioq(e, &iog->async_idle_queue);
+
+#ifdef CONFIG_GROUP_IOSCHED
+	/* Optimization for io schedulers having single ioq */
+	if (elv_iosched_single_ioq(e))
+		elv_release_ioq(e, &iog->ioq);
+#endif
 }
 
 /* Mainly hierarchical grouping code */
@@ -1867,6 +1873,162 @@ int io_group_allow_merge(struct request *rq, struct bio *bio)
 
 	return (iog == __iog);
 }
+
+/*
+ * Find/Create the io queue the rq should go in. This is an optimization
+ * for the io schedulers (noop, deadline and AS) which maintain only single
+ * io queue per cgroup. In this case common layer can just maintain a
+ * pointer in group data structure and keeps track of it.
+ *
+ * For the io schdulers like cfq, which maintain multiple io queues per
+ * cgroup, and decide the io queue  of request based on process, this
+ * function is not invoked.
+ */
+int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
+					gfp_t gfp_mask)
+{
+	struct elevator_queue *e = q->elevator;
+	unsigned long flags;
+	struct io_queue *ioq = NULL, *new_ioq = NULL;
+	struct io_group *iog;
+	void *sched_q = NULL, *new_sched_q = NULL;
+
+	if (!elv_iosched_fair_queuing_enabled(e))
+		return 0;
+
+	might_sleep_if(gfp_mask & __GFP_WAIT);
+	spin_lock_irqsave(q->queue_lock, flags);
+
+retry:
+	/* Determine the io group request belongs to */
+	iog = io_get_io_group(q, 1);
+	BUG_ON(!iog);
+
+	/* Get the iosched queue */
+	ioq = iog->ioq;
+	if (!ioq) {
+		/* io queue and sched_queue needs to be allocated */
+		BUG_ON(!e->ops->elevator_alloc_sched_queue_fn);
+
+		if (new_ioq) {
+			goto alloc_sched_q;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			new_ioq = elv_alloc_ioq(q, gfp_mask | __GFP_NOFAIL
+							| __GFP_ZERO);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			ioq = elv_alloc_ioq(q, gfp_mask | __GFP_ZERO);
+			if (!ioq)
+				goto queue_fail;
+		}
+
+alloc_sched_q:
+		if (new_sched_q) {
+			ioq = new_ioq;
+			new_ioq = NULL;
+			sched_q = new_sched_q;
+			new_sched_q = NULL;
+		} else if (gfp_mask & __GFP_WAIT) {
+			/*
+			 * Inform the allocator of the fact that we will
+			 * just repeat this allocation if it fails, to allow
+			 * the allocator to do whatever it needs to attempt to
+			 * free memory.
+			 */
+			spin_unlock_irq(q->queue_lock);
+			/* Call io scheduer to create scheduler queue */
+			new_sched_q = e->ops->elevator_alloc_sched_queue_fn(q,
+					e, gfp_mask | __GFP_NOFAIL
+					| __GFP_ZERO, new_ioq);
+			spin_lock_irq(q->queue_lock);
+			goto retry;
+		} else {
+			sched_q = e->ops->elevator_alloc_sched_queue_fn(q, e,
+						gfp_mask | __GFP_ZERO, ioq);
+			if (!sched_q) {
+				elv_free_ioq(ioq);
+				goto queue_fail;
+			}
+		}
+
+		elv_init_ioq(e, ioq, iog, sched_q, IOPRIO_CLASS_BE,
+					IOPRIO_NORM, 1);
+		io_group_set_ioq(iog, ioq);
+		elv_mark_ioq_sync(ioq);
+		elv_get_iog(iog);
+	}
+
+	if (new_sched_q)
+		e->ops->elevator_free_sched_queue_fn(q->elevator, new_sched_q);
+
+	if (new_ioq)
+		elv_free_ioq(new_ioq);
+
+	/* Request reference */
+	elv_get_ioq(ioq);
+	rq->ioq = ioq;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return 0;
+
+queue_fail:
+	WARN_ON((gfp_mask & __GFP_WAIT) && !ioq);
+	elv_schedule_dispatch(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+	return 1;
+}
+
+/*
+ * Find out the io queue of current task. Optimization for single ioq
+ * per io group io schedulers.
+ */
+struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	struct io_group *iog;
+
+	/* Determine the io group and io queue of the bio submitting task */
+	iog = io_get_io_group(q, 0);
+	if (!iog) {
+		/* May be task belongs to a cgroup for which io group has
+		 * not been setup yet. */
+		return NULL;
+	}
+	return iog->ioq;
+}
+
+/*
+ * This request has been serviced. Clean up ioq info and drop the reference.
+ * Again this is called only for single queue per cgroup schedulers (noop,
+ * deadline, AS).
+ */
+void elv_fq_unset_request_ioq(struct request_queue *q, struct request *rq)
+{
+	struct io_queue *ioq = rq->ioq;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return;
+
+	if (ioq) {
+		rq->ioq = NULL;
+		elv_put_ioq(ioq);
+	}
+}
+
+static inline int is_only_root_group(void)
+{
+	if (list_empty(&io_root_cgroup.css.cgroup->children))
+		return 1;
+
+	return 0;
+}
+
 #else /* GROUP_IOSCHED */
 static void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
 {
@@ -1916,6 +2078,11 @@ struct io_group *io_get_io_group(struct request_queue *q, int create)
 	return q->elevator->efqd.root_group;
 }
 EXPORT_SYMBOL(io_get_io_group);
+
+static inline int is_only_root_group(void)
+{
+	return 1;
+}
 #endif /* GROUP_IOSCHED */
 
 /* Elevator fair queuing function */
@@ -2206,7 +2373,12 @@ int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
 	ioq->efqd = efqd;
 	elv_ioq_set_ioprio_class(ioq, ioprio_class);
 	elv_ioq_set_ioprio(ioq, ioprio);
-	ioq->pid = current->pid;
+
+	if (elv_iosched_single_ioq(eq))
+		ioq->pid = 0;
+	else
+		ioq->pid = current->pid;
+
 	ioq->sched_queue = sched_queue;
 	if (is_sync && !elv_ioq_class_idle(ioq))
 		elv_mark_ioq_idle_window(ioq);
@@ -2589,6 +2761,14 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 	struct io_entity *entity, *new_entity;
 	struct io_group *iog = NULL, *new_iog = NULL;
 
+	/*
+	 * Currently only CFQ has preemption logic. Other schedulers don't
+	 * have any notion of preemption across classes or preemption with-in
+	 * class etc.
+	 */
+	if (elv_iosched_single_ioq(eq))
+		return 0;
+
 	ioq = elv_active_ioq(eq);
 
 	if (!ioq)
@@ -2873,6 +3053,17 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 			goto expire;
 	}
 
+	/*
+	 * If there is only root group present, don't expire the queue for
+	 * single queue ioschedulers (noop, deadline, AS). It is unnecessary
+	 * overhead.
+	 */
+
+	if (is_only_root_group() && elv_iosched_single_ioq(q->elevator)) {
+		elv_log_ioq(efqd, ioq, "select: only root group, no expiry");
+		goto keep_queue;
+	}
+
 	/* We are waiting for this queue to become busy before it expires.*/
 	if (efqd->fairness && elv_ioq_wait_busy(ioq)) {
 		ioq = NULL;
@@ -3112,6 +3303,19 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		}
 
 		/*
+		 * If there is only root group present, don't expire the queue
+		 * for single queue ioschedulers (noop, deadline, AS). It is
+		 * unnecessary overhead.
+		 */
+
+		if (is_only_root_group() &&
+			elv_iosched_single_ioq(q->elevator)) {
+			elv_log_ioq(efqd, ioq, "select: only root group,"
+					" no expiry");
+			goto done;
+		}
+
+		/*
 		 * If there are no requests waiting in this queue, and
 		 * there are other queues ready to issue requests, AND
 		 * those other queues are issuing requests within our
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index a414309..baa6cee 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -255,6 +255,9 @@ struct io_group {
 
 	/* The device MKDEV(major, minor), this group has been created for */
 	dev_t	dev;
+
+	/* Single ioq per group, used for noop, deadline, anticipatory */
+	struct io_queue *ioq;
 };
 
 /**
@@ -514,6 +517,21 @@ static inline int update_requeue(struct io_queue *ioq, int requeue)
 	return requeue;
 }
 
+extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
+					gfp_t gfp_mask);
+extern void elv_fq_unset_request_ioq(struct request_queue *q,
+					struct request *rq);
+extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+
+/* Sets the single ioq associated with the io group. (noop, deadline, AS) */
+static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
+{
+	BUG_ON(!iog);
+	/* io group reference. Will be dropped when group is destroyed. */
+	elv_get_ioq(ioq);
+	iog->ioq = ioq;
+}
+
 #else /* !GROUP_IOSCHED */
 static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
 {
@@ -533,6 +551,26 @@ static inline int update_requeue(struct io_queue *ioq, int requeue)
 	return requeue;
 }
 
+static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
+{
+}
+
+static inline int elv_fq_set_request_ioq(struct request_queue *q,
+					struct request *rq, gfp_t gfp_mask)
+{
+	return 0;
+}
+
+static inline void elv_fq_unset_request_ioq(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	return NULL;
+}
+
 #endif /* GROUP_IOSCHED */
 
 extern ssize_t elv_slice_idle_show(struct elevator_queue *q, char *name);
@@ -642,5 +680,21 @@ static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
 {
 	return 1;
 }
+static inline int elv_fq_set_request_ioq(struct request_queue *q,
+					struct request *rq, gfp_t gfp_mask)
+{
+	return 0;
+}
+
+static inline void elv_fq_unset_request_ioq(struct request_queue *q,
+						struct request *rq)
+{
+}
+
+static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+{
+	return NULL;
+}
+
 #endif /* CONFIG_ELV_FAIR_QUEUING */
 #endif /* _BFQ_SCHED_H */
diff --git a/block/elevator.c b/block/elevator.c
index 76cfc3a..862be80 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -206,9 +206,17 @@ static void *elevator_alloc_sched_queue(struct request_queue *q,
 {
 	void *sched_queue = NULL;
 
+	/*
+	 * If fair queuing is enabled, then queue allocation takes place
+	 * during set_request() functions when request actually comes
+	 * in.
+	 */
+	if (elv_iosched_fair_queuing_enabled(eq))
+		return NULL;
+
 	if (eq->ops->elevator_alloc_sched_queue_fn) {
 		sched_queue = eq->ops->elevator_alloc_sched_queue_fn(q, eq,
-								GFP_KERNEL);
+							GFP_KERNEL, NULL);
 		if (!sched_queue)
 			return ERR_PTR(-ENOMEM);
 
@@ -829,6 +837,13 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 
+	/*
+	 * Optimization for noop, deadline and AS which maintain only single
+	 * ioq per io group
+	 */
+	if (elv_iosched_single_ioq(e))
+		return elv_fq_set_request_ioq(q, rq, gfp_mask);
+
 	if (e->ops->elevator_set_req_fn)
 		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
 
@@ -840,6 +855,15 @@ void elv_put_request(struct request_queue *q, struct request *rq)
 {
 	struct elevator_queue *e = q->elevator;
 
+	/*
+	 * Optimization for noop, deadline and AS which maintain only single
+	 * ioq per io group
+	 */
+	if (elv_iosched_single_ioq(e)) {
+		elv_fq_unset_request_ioq(q, rq);
+		return;
+	}
+
 	if (e->ops->elevator_put_req_fn)
 		e->ops->elevator_put_req_fn(rq);
 }
@@ -1224,9 +1248,18 @@ EXPORT_SYMBOL(elv_select_sched_queue);
 
 /*
  * Get the io scheduler queue pointer for current task.
+ *
+ * If fair queuing is enabled, determine the io group of task and retrieve
+ * the ioq pointer from that. This is used by only single queue ioschedulers
+ * for retrieving the queue associated with the group to decide whether the
+ * new bio can do a front merge or not.
  */
 void *elv_get_sched_queue_current(struct request_queue *q)
 {
-	return q->elevator->sched_queue;
+	/* Fair queuing is not enabled. There is only one queue. */
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return q->elevator->sched_queue;
+
+	return ioq_sched_queue(elv_lookup_ioq_current(q));
 }
 EXPORT_SYMBOL(elv_get_sched_queue_current);
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index d587832..731dbf2 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -62,7 +62,7 @@ noop_latter_request(struct request_queue *q, struct request *rq)
 }
 
 static void *noop_alloc_noop_queue(struct request_queue *q,
-				struct elevator_queue *eq, gfp_t gfp_mask)
+		struct elevator_queue *eq, gfp_t gfp_mask, struct io_queue *ioq)
 {
 	struct noop_queue *nq;
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index e7048b9..6d2c8db 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -30,7 +30,7 @@ typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct reques
 
 typedef void *(elevator_init_fn) (struct request_queue *);
 typedef void (elevator_exit_fn) (struct elevator_queue *);
-typedef void* (elevator_alloc_sched_queue_fn) (struct request_queue *q, struct elevator_queue *eq, gfp_t);
+typedef void* (elevator_alloc_sched_queue_fn) (struct request_queue *q, struct elevator_queue *eq, gfp_t, struct io_queue *ioq);
 typedef void (elevator_free_sched_queue_fn) (struct elevator_queue*, void *);
 #ifdef CONFIG_ELV_FAIR_QUEUING
 typedef void (elevator_active_ioq_set_fn) (struct request_queue*, void *, int);
@@ -247,17 +247,31 @@ enum {
 /* iosched wants to use fair queuing logic of elevator layer */
 #define	ELV_IOSCHED_NEED_FQ	1
 
+/* iosched maintains only single ioq per group.*/
+#define ELV_IOSCHED_SINGLE_IOQ        2
+
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 {
 	return (e->elevator_type->elevator_features) & ELV_IOSCHED_NEED_FQ;
 }
 
+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+	return (e->elevator_type->elevator_features) & ELV_IOSCHED_SINGLE_IOQ;
+}
+
 #else /* ELV_IOSCHED_FAIR_QUEUING */
 
 static inline int elv_iosched_fair_queuing_enabled(struct elevator_queue *e)
 {
 	return 0;
 }
+
+static inline int elv_iosched_single_ioq(struct elevator_queue *e)
+{
+	return 0;
+}
+
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 16/25] io-controller: noop changes for hierarchical fair queuing
       [not found] ` <1246564917-19603-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (14 preceding siblings ...)
  2009-07-02 20:01   ` [PATCH 15/25] io-conroller: Prepare elevator layer for single queue schedulers Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  2009-07-02 20:01   ` [PATCH 17/25] io-controller: deadline " Vivek Goyal
                     ` (11 subsequent siblings)
  27 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, nauman-hpIqsD4AKlfQT0dZR+AlfA,
	dpshah-hpIqsD4AKlfQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA

This patch changes noop to use queue scheduling code from elevator layer.
One can go back to old noop by deselecting CONFIG_IOSCHED_NOOP_HIER.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched |   11 +++++++++++
 block/noop-iosched.c  |   13 +++++++++++++
 2 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index a91a807..9da6657 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -25,6 +25,17 @@ config IOSCHED_NOOP
 	  that do their own scheduling and require only minimal assistance from
 	  the kernel.
 
+config IOSCHED_NOOP_HIER
+	bool "Noop Hierarchical Scheduling support"
+	depends on IOSCHED_NOOP && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in noop. In this mode noop keeps
+	  one IO queue per cgroup instead of a global queue. Elevator
+	  fair queuing logic ensures fairness among various queues.
+
 config IOSCHED_AS
 	tristate "Anticipatory I/O scheduler"
 	default y
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index 731dbf2..97ea41b 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -82,6 +82,15 @@ static void noop_free_noop_queue(struct elevator_queue *e, void *sched_queue)
 	kfree(nq);
 }
 
+#ifdef CONFIG_IOSCHED_NOOP_HIER
+static struct elv_fs_entry noop_attrs[] = {
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_idle),
+	ELV_ATTR(slice_sync),
+	__ATTR_NULL
+};
+#endif
+
 static struct elevator_type elevator_noop = {
 	.ops = {
 		.elevator_merge_req_fn		= noop_merged_requests,
@@ -92,6 +101,10 @@ static struct elevator_type elevator_noop = {
 		.elevator_alloc_sched_queue_fn	= noop_alloc_noop_queue,
 		.elevator_free_sched_queue_fn	= noop_free_noop_queue,
 	},
+#ifdef CONFIG_IOSCHED_NOOP_HIER
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+	.elevator_attrs = noop_attrs,
+#endif
 	.elevator_name = "noop",
 	.elevator_owner = THIS_MODULE,
 };
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 16/25] io-controller: noop changes for hierarchical fair queuing
  2009-07-02 20:01 ` Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron
  Cc: agk, snitzer, vgoyal, akpm, peterz

This patch changes noop to use queue scheduling code from elevator layer.
One can go back to old noop by deselecting CONFIG_IOSCHED_NOOP_HIER.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched |   11 +++++++++++
 block/noop-iosched.c  |   13 +++++++++++++
 2 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index a91a807..9da6657 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -25,6 +25,17 @@ config IOSCHED_NOOP
 	  that do their own scheduling and require only minimal assistance from
 	  the kernel.
 
+config IOSCHED_NOOP_HIER
+	bool "Noop Hierarchical Scheduling support"
+	depends on IOSCHED_NOOP && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in noop. In this mode noop keeps
+	  one IO queue per cgroup instead of a global queue. Elevator
+	  fair queuing logic ensures fairness among various queues.
+
 config IOSCHED_AS
 	tristate "Anticipatory I/O scheduler"
 	default y
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index 731dbf2..97ea41b 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -82,6 +82,15 @@ static void noop_free_noop_queue(struct elevator_queue *e, void *sched_queue)
 	kfree(nq);
 }
 
+#ifdef CONFIG_IOSCHED_NOOP_HIER
+static struct elv_fs_entry noop_attrs[] = {
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_idle),
+	ELV_ATTR(slice_sync),
+	__ATTR_NULL
+};
+#endif
+
 static struct elevator_type elevator_noop = {
 	.ops = {
 		.elevator_merge_req_fn		= noop_merged_requests,
@@ -92,6 +101,10 @@ static struct elevator_type elevator_noop = {
 		.elevator_alloc_sched_queue_fn	= noop_alloc_noop_queue,
 		.elevator_free_sched_queue_fn	= noop_free_noop_queue,
 	},
+#ifdef CONFIG_IOSCHED_NOOP_HIER
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+	.elevator_attrs = noop_attrs,
+#endif
 	.elevator_name = "noop",
 	.elevator_owner = THIS_MODULE,
 };
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 16/25] io-controller: noop changes for hierarchical fair queuing
@ 2009-07-02 20:01   ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew
  Cc: peterz, akpm, snitzer, agk, vgoyal

This patch changes noop to use queue scheduling code from elevator layer.
One can go back to old noop by deselecting CONFIG_IOSCHED_NOOP_HIER.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched |   11 +++++++++++
 block/noop-iosched.c  |   13 +++++++++++++
 2 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index a91a807..9da6657 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -25,6 +25,17 @@ config IOSCHED_NOOP
 	  that do their own scheduling and require only minimal assistance from
 	  the kernel.
 
+config IOSCHED_NOOP_HIER
+	bool "Noop Hierarchical Scheduling support"
+	depends on IOSCHED_NOOP && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in noop. In this mode noop keeps
+	  one IO queue per cgroup instead of a global queue. Elevator
+	  fair queuing logic ensures fairness among various queues.
+
 config IOSCHED_AS
 	tristate "Anticipatory I/O scheduler"
 	default y
diff --git a/block/noop-iosched.c b/block/noop-iosched.c
index 731dbf2..97ea41b 100644
--- a/block/noop-iosched.c
+++ b/block/noop-iosched.c
@@ -82,6 +82,15 @@ static void noop_free_noop_queue(struct elevator_queue *e, void *sched_queue)
 	kfree(nq);
 }
 
+#ifdef CONFIG_IOSCHED_NOOP_HIER
+static struct elv_fs_entry noop_attrs[] = {
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_idle),
+	ELV_ATTR(slice_sync),
+	__ATTR_NULL
+};
+#endif
+
 static struct elevator_type elevator_noop = {
 	.ops = {
 		.elevator_merge_req_fn		= noop_merged_requests,
@@ -92,6 +101,10 @@ static struct elevator_type elevator_noop = {
 		.elevator_alloc_sched_queue_fn	= noop_alloc_noop_queue,
 		.elevator_free_sched_queue_fn	= noop_free_noop_queue,
 	},
+#ifdef CONFIG_IOSCHED_NOOP_HIER
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+	.elevator_attrs = noop_attrs,
+#endif
 	.elevator_name = "noop",
 	.elevator_owner = THIS_MODULE,
 };
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 17/25] io-controller: deadline changes for hierarchical fair queuing
       [not found] ` <1246564917-19603-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (15 preceding siblings ...)
  2009-07-02 20:01   ` [PATCH 16/25] io-controller: noop changes for hierarchical fair queuing Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  2009-07-02 20:01   ` [PATCH 18/25] io-controller: anticipatory " Vivek Goyal
                     ` (10 subsequent siblings)
  27 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, nauman-hpIqsD4AKlfQT0dZR+AlfA,
	dpshah-hpIqsD4AKlfQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA

This patch changes deadline to use queue scheduling code from elevator layer.
One can go back to old deadline by selecting CONFIG_IOSCHED_DEADLINE_HIER.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched    |   11 +++++++++++
 block/deadline-iosched.c |    8 ++++++++
 2 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 9da6657..3a9e7d7 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -55,6 +55,17 @@ config IOSCHED_DEADLINE
 	  a disk at any one time, its behaviour is almost identical to the
 	  anticipatory I/O scheduler and so is a good choice.
 
+config IOSCHED_DEADLINE_HIER
+	bool "Deadline Hierarchical Scheduling support"
+	depends on IOSCHED_DEADLINE && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in deadline. In this mode deadline keeps
+	  one IO queue per cgroup instead of a global queue. Elevator
+	  fair queuing logic ensures fairness among various queues.
+
 config IOSCHED_CFQ
 	tristate "CFQ I/O scheduler"
 	select ELV_FAIR_QUEUING
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 03d7208..ad63493 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -460,6 +460,11 @@ static struct elv_fs_entry deadline_attrs[] = {
 	DD_ATTR(writes_starved),
 	DD_ATTR(front_merges),
 	DD_ATTR(fifo_batch),
+#ifdef CONFIG_IOSCHED_DEADLINE_HIER
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_idle),
+	ELV_ATTR(slice_sync),
+#endif
 	__ATTR_NULL
 };
 
@@ -477,6 +482,9 @@ static struct elevator_type iosched_deadline = {
 		.elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
 		.elevator_free_sched_queue_fn = deadline_free_deadline_queue,
 	},
+#ifdef CONFIG_IOSCHED_DEADLINE_HIER
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#endif
 	.elevator_attrs = deadline_attrs,
 	.elevator_name = "deadline",
 	.elevator_owner = THIS_MODULE,
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 17/25] io-controller: deadline changes for hierarchical fair queuing
  2009-07-02 20:01 ` Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron
  Cc: agk, snitzer, vgoyal, akpm, peterz

This patch changes deadline to use queue scheduling code from elevator layer.
One can go back to old deadline by selecting CONFIG_IOSCHED_DEADLINE_HIER.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched    |   11 +++++++++++
 block/deadline-iosched.c |    8 ++++++++
 2 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 9da6657..3a9e7d7 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -55,6 +55,17 @@ config IOSCHED_DEADLINE
 	  a disk at any one time, its behaviour is almost identical to the
 	  anticipatory I/O scheduler and so is a good choice.
 
+config IOSCHED_DEADLINE_HIER
+	bool "Deadline Hierarchical Scheduling support"
+	depends on IOSCHED_DEADLINE && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in deadline. In this mode deadline keeps
+	  one IO queue per cgroup instead of a global queue. Elevator
+	  fair queuing logic ensures fairness among various queues.
+
 config IOSCHED_CFQ
 	tristate "CFQ I/O scheduler"
 	select ELV_FAIR_QUEUING
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 03d7208..ad63493 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -460,6 +460,11 @@ static struct elv_fs_entry deadline_attrs[] = {
 	DD_ATTR(writes_starved),
 	DD_ATTR(front_merges),
 	DD_ATTR(fifo_batch),
+#ifdef CONFIG_IOSCHED_DEADLINE_HIER
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_idle),
+	ELV_ATTR(slice_sync),
+#endif
 	__ATTR_NULL
 };
 
@@ -477,6 +482,9 @@ static struct elevator_type iosched_deadline = {
 		.elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
 		.elevator_free_sched_queue_fn = deadline_free_deadline_queue,
 	},
+#ifdef CONFIG_IOSCHED_DEADLINE_HIER
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#endif
 	.elevator_attrs = deadline_attrs,
 	.elevator_name = "deadline",
 	.elevator_owner = THIS_MODULE,
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 17/25] io-controller: deadline changes for hierarchical fair queuing
@ 2009-07-02 20:01   ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew
  Cc: peterz, akpm, snitzer, agk, vgoyal

This patch changes deadline to use queue scheduling code from elevator layer.
One can go back to old deadline by selecting CONFIG_IOSCHED_DEADLINE_HIER.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched    |   11 +++++++++++
 block/deadline-iosched.c |    8 ++++++++
 2 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 9da6657..3a9e7d7 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -55,6 +55,17 @@ config IOSCHED_DEADLINE
 	  a disk at any one time, its behaviour is almost identical to the
 	  anticipatory I/O scheduler and so is a good choice.
 
+config IOSCHED_DEADLINE_HIER
+	bool "Deadline Hierarchical Scheduling support"
+	depends on IOSCHED_DEADLINE && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in deadline. In this mode deadline keeps
+	  one IO queue per cgroup instead of a global queue. Elevator
+	  fair queuing logic ensures fairness among various queues.
+
 config IOSCHED_CFQ
 	tristate "CFQ I/O scheduler"
 	select ELV_FAIR_QUEUING
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 03d7208..ad63493 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -460,6 +460,11 @@ static struct elv_fs_entry deadline_attrs[] = {
 	DD_ATTR(writes_starved),
 	DD_ATTR(front_merges),
 	DD_ATTR(fifo_batch),
+#ifdef CONFIG_IOSCHED_DEADLINE_HIER
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_idle),
+	ELV_ATTR(slice_sync),
+#endif
 	__ATTR_NULL
 };
 
@@ -477,6 +482,9 @@ static struct elevator_type iosched_deadline = {
 		.elevator_alloc_sched_queue_fn = deadline_alloc_deadline_queue,
 		.elevator_free_sched_queue_fn = deadline_free_deadline_queue,
 	},
+#ifdef CONFIG_IOSCHED_DEADLINE_HIER
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#endif
 	.elevator_attrs = deadline_attrs,
 	.elevator_name = "deadline",
 	.elevator_owner = THIS_MODULE,
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 18/25] io-controller: anticipatory changes for hierarchical fair queuing
       [not found] ` <1246564917-19603-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (16 preceding siblings ...)
  2009-07-02 20:01   ` [PATCH 17/25] io-controller: deadline " Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  2009-07-02 20:01   ` [PATCH 19/25] blkio_cgroup patches from Ryo to track async bios Vivek Goyal
                     ` (9 subsequent siblings)
  27 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, nauman-hpIqsD4AKlfQT0dZR+AlfA,
	dpshah-hpIqsD4AKlfQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA

This patch changes anticipatory scheduler to use queue scheduling code from
elevator layer.  One can go back to old as by deselecting
CONFIG_IOSCHED_AS_HIER. Even with CONFIG_IOSCHED_AS_HIER=y, with-out any
other cgroup created, AS behavior should remain the same as old.

o AS is a single queue ioschduler, that means there is one AS queue per group.

o common layer code select the queue to dispatch from based on fairness, and
  then AS code selects the request with-in group.

o AS runs reads and writes batches with-in group. So common layer runs timed
  group queues and with-in group time, AS runs timed batches of reads and
  writes.

o Note: Previously AS write batch length was adjusted synamically whenever
  a W->R batch data direction took place and when first request from the
  read batch completed.

  Now write batch updation takes place when last request from the write
  batch has finished during W->R transition.

o AS runs its own anticipation logic to anticipate on reads. common layer also
  does the anticipation on the group if think time of the group is with-in
  slice_idle.

o Introduced few debugging messages in AS.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched    |   12 ++
 block/as-iosched.c       |  280 +++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.c      |   93 +++++++++++++--
 block/elevator-fq.h      |    3 +
 include/linux/elevator.h |    2 +
 5 files changed, 372 insertions(+), 18 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3a9e7d7..77fc786 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -45,6 +45,18 @@ config IOSCHED_AS
 	  deadline I/O scheduler, it can also be slower in some cases
 	  especially some database loads.
 
+config IOSCHED_AS_HIER
+	bool "Anticipatory Hierarchical Scheduling support"
+	depends on IOSCHED_AS && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in anticipatory. In this mode
+	  anticipatory keeps one IO queue per cgroup instead of a global
+	  queue. Elevator fair queuing logic ensures fairness among various
+	  queues.
+
 config IOSCHED_DEADLINE
 	tristate "Deadline I/O scheduler"
 	default y
diff --git a/block/as-iosched.c b/block/as-iosched.c
index e3514eb..18a61bb 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -16,6 +16,7 @@
 #include <linux/compiler.h>
 #include <linux/rbtree.h>
 #include <linux/interrupt.h>
+#include <linux/blktrace_api.h>
 
 /*
  * See Documentation/block/as-iosched.txt
@@ -84,10 +85,24 @@ struct as_queue {
 	struct list_head fifo_list[2];
 
 	struct request *next_rq[2];	/* next in sort order */
+
+	/*
+	 * If an as_queue is switched while a batch is running, then we
+	 * store the time left before current batch will expire
+	 */
+	long current_batch_time_left;
+
+	/*
+	 * batch data dir when queue was scheduled out. This will be used
+	 * to setup ad->batch_data_dir when queue is scheduled in.
+	 */
+	int saved_batch_data_dir;
+
 	unsigned long last_check_fifo[2];
 	int write_batch_count;		/* max # of reqs in a write batch */
 	int current_write_count;	/* how many requests left this batch */
 	int write_batch_idled;		/* has the write batch gone idle? */
+	int nr_queued[2];
 };
 
 struct as_data {
@@ -123,6 +138,9 @@ struct as_data {
 	unsigned long fifo_expire[2];
 	unsigned long batch_expire[2];
 	unsigned long antic_expire;
+
+	/* elevator requested a queue switch. */
+	int switch_queue;
 };
 
 /*
@@ -144,12 +162,174 @@ enum arq_state {
 #define RQ_STATE(rq)	((enum arq_state)(rq)->elevator_private2)
 #define RQ_SET_STATE(rq, state)	((rq)->elevator_private2 = (void *) state)
 
+#define as_log(ad, fmt, args...)        \
+	blk_add_trace_msg((ad)->q, "as " fmt, ##args)
+
 static DEFINE_PER_CPU(unsigned long, ioc_count);
 static struct completion *ioc_gone;
 static DEFINE_SPINLOCK(ioc_gone_lock);
 
 static void as_move_to_dispatch(struct as_data *ad, struct request *rq);
 static void as_antic_stop(struct as_data *ad);
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq);
+
+#ifdef CONFIG_IOSCHED_AS_HIER
+static void as_save_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+	/* Save batch data dir */
+	asq->saved_batch_data_dir = ad->batch_data_dir;
+
+	if (ad->changed_batch) {
+		/*
+		 * In case of force expire, we come here. Batch changeover
+		 * has been signalled but we are waiting for all the
+		 * request to finish from previous batch and then start
+		 * the new batch. Can't wait now. Mark that full batch time
+		 * needs to be allocated when this queue is scheduled again.
+		 */
+		asq->current_batch_time_left =
+				ad->batch_expire[ad->batch_data_dir];
+		ad->changed_batch = 0;
+		goto out;
+	}
+
+	if (ad->new_batch) {
+		/*
+		 * We should come here only when new_batch has been set
+		 * but no read request has been issued or if it is a forced
+		 * expiry.
+		 *
+		 * In both the cases, new batch has not started yet so
+		 * allocate full batch length for next scheduling opportunity.
+		 * We don't do write batch size adjustment in hierarchical
+		 * AS so that should not be an issue.
+		 */
+		asq->current_batch_time_left =
+				ad->batch_expire[ad->batch_data_dir];
+		ad->new_batch = 0;
+		goto out;
+	}
+
+	/* Save how much time is left before current batch expires */
+	if (as_batch_expired(ad, asq))
+		asq->current_batch_time_left = 0;
+	else {
+		asq->current_batch_time_left = ad->current_batch_expires
+							- jiffies;
+		BUG_ON((asq->current_batch_time_left) < 0);
+	}
+
+	if (ad->io_context) {
+		put_io_context(ad->io_context);
+		ad->io_context = NULL;
+	}
+
+out:
+	as_log(ad, "save batch: dir=%c time_left=%d changed_batch=%d"
+			" new_batch=%d, antic_status=%d",
+			ad->batch_data_dir ? 'R' : 'W',
+			asq->current_batch_time_left,
+			ad->changed_batch, ad->new_batch, ad->antic_status);
+	return;
+}
+
+/*
+ * FIXME: In original AS, read batch's time account started only after when
+ * first request had completed (if last batch was a write batch). But here
+ * we might be rescheduling a read batch right away irrespective of the fact
+ * of disk cache state.
+ */
+static void as_restore_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+	/* Adjust the batch expire time */
+	if (asq->current_batch_time_left)
+		ad->current_batch_expires = jiffies +
+						asq->current_batch_time_left;
+	/* restore asq batch_data_dir info */
+	ad->batch_data_dir = asq->saved_batch_data_dir;
+	as_log(ad, "restore batch: dir=%c time=%d reads_q=%d writes_q=%d"
+			" ad->antic_status=%d",
+			ad->batch_data_dir ? 'R' : 'W',
+			asq->current_batch_time_left,
+			asq->nr_queued[1], asq->nr_queued[0],
+			ad->antic_status);
+}
+
+/* ioq has been set. */
+static void as_active_ioq_set(struct request_queue *q, void *sched_queue,
+				int coop)
+{
+	struct as_queue *asq = sched_queue;
+	struct as_data *ad = q->elevator->elevator_data;
+
+	as_restore_batch_context(ad, asq);
+}
+
+/*
+ * This is a notification from common layer that it wishes to expire this
+ * io queue. AS decides whether queue can be expired, if yes, it also
+ * saves the batch context.
+ */
+static int as_expire_ioq(struct request_queue *q, void *sched_queue,
+				int slice_expired, int force)
+{
+	struct as_data *ad = q->elevator->elevator_data;
+	int status = ad->antic_status;
+	struct as_queue *asq = sched_queue;
+
+	as_log(ad, "as_expire_ioq slice_expired=%d, force=%d", slice_expired,
+		force);
+
+	/* Forced expiry. We don't have a choice */
+	if (force) {
+		as_antic_stop(ad);
+		/*
+		 * antic_stop() sets antic_status to FINISHED which signifies
+		 * that either we timed out or we found a close request but
+		 * that's not the case here. Start from scratch.
+		 */
+		ad->antic_status = ANTIC_OFF;
+		as_save_batch_context(ad, asq);
+		ad->switch_queue = 0;
+		return 1;
+	}
+
+	/*
+	 * We are waiting for requests to finish from last
+	 * batch. Don't expire the queue now
+	 */
+	if (ad->changed_batch)
+		goto keep_queue;
+
+	/*
+	 * Wait for all requests from existing batch to finish before we
+	 * switch the queue. New queue might change the batch direction
+	 * and this is to be consistent with AS philosophy of not dispatching
+	 * new requests to underlying drive till requests from requests
+	 * from previous batch are completed.
+	 */
+	if (ad->nr_dispatched)
+		goto keep_queue;
+
+	/*
+	 * If AS anticipation is ON, wait for it to finish.
+	 */
+	BUG_ON(status == ANTIC_WAIT_REQ);
+
+	if (status == ANTIC_WAIT_NEXT)
+		goto keep_queue;
+
+	/* We are good to expire the queue. Save batch context */
+	as_save_batch_context(ad, asq);
+	ad->switch_queue = 0;
+	return 1;
+
+keep_queue:
+	/* Mark that elevator requested for queue switch whenever possible */
+	ad->switch_queue = 1;
+	return 0;
+}
+#endif
 
 /*
  * IO Context helper functions
@@ -429,6 +609,7 @@ static void as_antic_waitnext(struct as_data *ad)
 	mod_timer(&ad->antic_timer, timeout);
 
 	ad->antic_status = ANTIC_WAIT_NEXT;
+	as_log(ad, "antic_waitnext set");
 }
 
 /*
@@ -442,8 +623,10 @@ static void as_antic_waitreq(struct as_data *ad)
 	if (ad->antic_status == ANTIC_OFF) {
 		if (!ad->io_context || ad->ioc_finished)
 			as_antic_waitnext(ad);
-		else
+		else {
 			ad->antic_status = ANTIC_WAIT_REQ;
+			as_log(ad, "antic_waitreq set");
+		}
 	}
 }
 
@@ -455,6 +638,8 @@ static void as_antic_stop(struct as_data *ad)
 {
 	int status = ad->antic_status;
 
+	as_log(ad, "as_antic_stop antic_status=%d", ad->antic_status);
+
 	if (status == ANTIC_WAIT_REQ || status == ANTIC_WAIT_NEXT) {
 		if (status == ANTIC_WAIT_NEXT)
 			del_timer(&ad->antic_timer);
@@ -474,6 +659,7 @@ static void as_antic_timeout(unsigned long data)
 	unsigned long flags;
 
 	spin_lock_irqsave(q->queue_lock, flags);
+	as_log(ad, "as_antic_timeout");
 	if (ad->antic_status == ANTIC_WAIT_REQ
 			|| ad->antic_status == ANTIC_WAIT_NEXT) {
 		struct as_io_context *aic;
@@ -652,6 +838,21 @@ static int as_can_break_anticipation(struct as_data *ad, struct request *rq)
 	struct io_context *ioc;
 	struct as_io_context *aic;
 
+#ifdef CONFIG_IOSCHED_AS_HIER
+	/*
+	 * If the active asq and rq's asq are not same, then one can not
+	 * break the anticipation. This primarily becomes useful when a
+	 * request is added to a queue which is not being served currently.
+	 */
+	if (rq) {
+		struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
+		struct as_queue *curr_asq =
+				elv_active_sched_queue(ad->q->elevator);
+
+		if (asq != curr_asq)
+			return 0;
+	}
+#endif
 	ioc = ad->io_context;
 	BUG_ON(!ioc);
 	spin_lock(&ioc->lock);
@@ -810,16 +1011,20 @@ static void as_update_rq(struct as_data *ad, struct request *rq)
 /*
  * Gathers timings and resizes the write batch automatically
  */
-static void update_write_batch(struct as_data *ad)
+static void update_write_batch(struct as_data *ad, struct request *rq)
 {
 	unsigned long batch = ad->batch_expire[BLK_RW_ASYNC];
 	long write_time;
-	struct as_queue *asq = elv_get_sched_queue(ad->q, NULL);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
 	write_time = (jiffies - ad->current_batch_expires) + batch;
 	if (write_time < 0)
 		write_time = 0;
 
+	as_log(ad, "upd write: write_time=%d batch=%d write_batch_idled=%d"
+			" current_write_count=%d", write_time, batch,
+			asq->write_batch_idled, asq->current_write_count);
+
 	if (write_time > batch && !asq->write_batch_idled) {
 		if (write_time > batch * 3)
 			asq->write_batch_count /= 2;
@@ -834,6 +1039,8 @@ static void update_write_batch(struct as_data *ad)
 
 	if (asq->write_batch_count < 1)
 		asq->write_batch_count = 1;
+
+	as_log(ad, "upd write count=%d", asq->write_batch_count);
 }
 
 /*
@@ -843,6 +1050,7 @@ static void update_write_batch(struct as_data *ad)
 static void as_completed_request(struct request_queue *q, struct request *rq)
 {
 	struct as_data *ad = q->elevator->elevator_data;
+	struct as_queue *asq = elv_get_sched_queue(q, rq);
 
 	WARN_ON(!list_empty(&rq->queuelist));
 
@@ -851,7 +1059,24 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
 		goto out;
 	}
 
+	as_log(ad, "complete: reads_q=%d writes_q=%d changed_batch=%d"
+		" new_batch=%d switch_queue=%d, dir=%c",
+		asq->nr_queued[1], asq->nr_queued[0], ad->changed_batch,
+		ad->new_batch, ad->switch_queue,
+		ad->batch_data_dir ? 'R' : 'W');
+
 	if (ad->changed_batch && ad->nr_dispatched == 1) {
+		/*
+		 * If this was write batch finishing, adjust the write batch
+		 * length.
+		 *
+		 * Note, write batch length is being calculated upon completion
+		 * of last write request finished and not completion of first
+		 * read request finished in the next batch.
+		 */
+		if (ad->batch_data_dir == BLK_RW_SYNC)
+			update_write_batch(ad, rq);
+
 		ad->current_batch_expires = jiffies +
 					ad->batch_expire[ad->batch_data_dir];
 		kblockd_schedule_work(q, &ad->antic_work);
@@ -869,7 +1094,6 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
 	 * and writeback caches
 	 */
 	if (ad->new_batch && ad->batch_data_dir == rq_is_sync(rq)) {
-		update_write_batch(ad);
 		ad->current_batch_expires = jiffies +
 				ad->batch_expire[BLK_RW_SYNC];
 		ad->new_batch = 0;
@@ -888,6 +1112,13 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
 	}
 
 	as_put_io_context(rq);
+
+	/*
+	 * If elevator requested a queue switch, kick the queue in the
+	 * hope that this is right time for switch.
+	 */
+	if (ad->switch_queue)
+		kblockd_schedule_work(q, &ad->antic_work);
 out:
 	RQ_SET_STATE(rq, AS_RQ_POSTSCHED);
 }
@@ -908,6 +1139,9 @@ static void as_remove_queued_request(struct request_queue *q,
 
 	WARN_ON(RQ_STATE(rq) != AS_RQ_QUEUED);
 
+	BUG_ON(asq->nr_queued[data_dir] <= 0);
+	asq->nr_queued[data_dir]--;
+
 	ioc = RQ_IOC(rq);
 	if (ioc && ioc->aic) {
 		BUG_ON(!atomic_read(&ioc->aic->nr_queued));
@@ -1019,6 +1253,8 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 	if (RQ_IOC(rq) && RQ_IOC(rq)->aic)
 		atomic_inc(&RQ_IOC(rq)->aic->nr_dispatched);
 	ad->nr_dispatched++;
+	as_log(ad, "dispatch req dir=%c nr_dispatched = %d",
+			data_dir ? 'R' : 'W', ad->nr_dispatched);
 }
 
 /*
@@ -1066,6 +1302,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		}
 		asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
 
+		as_log(ad, "forced dispatch");
 		return dispatched;
 	}
 
@@ -1078,8 +1315,14 @@ static int as_dispatch_request(struct request_queue *q, int force)
 	if (!(reads || writes)
 		|| ad->antic_status == ANTIC_WAIT_REQ
 		|| ad->antic_status == ANTIC_WAIT_NEXT
-		|| ad->changed_batch)
+		|| ad->changed_batch) {
+		as_log(ad, "no dispatch. read_q=%d, writes_q=%d"
+			" ad->antic_status=%d, changed_batch=%d,"
+			" switch_queue=%d new_batch=%d", asq->nr_queued[1],
+			asq->nr_queued[0], ad->antic_status, ad->changed_batch,
+			ad->switch_queue, ad->new_batch);
 		return 0;
+	}
 
 	if (!(reads && writes && as_batch_expired(ad, asq))) {
 		/*
@@ -1092,6 +1335,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 				goto fifo_expired;
 
 			if (as_can_anticipate(ad, rq)) {
+				as_log(ad, "can_anticipate = 1");
 				as_antic_waitreq(ad);
 				return 0;
 			}
@@ -1111,6 +1355,8 @@ static int as_dispatch_request(struct request_queue *q, int force)
 	 * data direction (read / write)
 	 */
 
+	as_log(ad, "select a fresh batch and request");
+
 	if (reads) {
 		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_SYNC]));
 
@@ -1125,6 +1371,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 			ad->changed_batch = 1;
 		}
 		ad->batch_data_dir = BLK_RW_SYNC;
+		as_log(ad, "new batch dir is sync");
 		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_SYNC].next);
 		asq->last_check_fifo[ad->batch_data_dir] = jiffies;
 		goto dispatch_request;
@@ -1149,6 +1396,7 @@ dispatch_writes:
 			ad->new_batch = 0;
 		}
 		ad->batch_data_dir = BLK_RW_ASYNC;
+		as_log(ad, "new batch dir is async");
 		asq->current_write_count = asq->write_batch_count;
 		asq->write_batch_idled = 0;
 		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_ASYNC].next);
@@ -1184,6 +1432,9 @@ fifo_expired:
 		ad->changed_batch = 0;
 	}
 
+	if (ad->switch_queue)
+		return 0;
+
 	/*
 	 * rq is the selected appropriate request.
 	 */
@@ -1207,6 +1458,11 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 
 	rq->elevator_private = as_get_io_context(q->node);
 
+	asq->nr_queued[data_dir]++;
+	as_log(ad, "add a %c request read_q=%d write_q=%d",
+			data_dir ? 'R' : 'W', asq->nr_queued[1],
+			asq->nr_queued[0]);
+
 	if (RQ_IOC(rq)) {
 		as_update_iohist(ad, RQ_IOC(rq)->aic, rq);
 		atomic_inc(&RQ_IOC(rq)->aic->nr_queued);
@@ -1408,6 +1664,7 @@ static void *as_init_queue(struct request_queue *q)
 	ad->batch_expire[BLK_RW_ASYNC] = default_write_batch_expire;
 
 	ad->current_batch_expires = jiffies + ad->batch_expire[BLK_RW_SYNC];
+	ad->switch_queue = 0;
 
 	return ad;
 }
@@ -1493,6 +1750,11 @@ static struct elv_fs_entry as_attrs[] = {
 	AS_ATTR(antic_expire),
 	AS_ATTR(read_batch_expire),
 	AS_ATTR(write_batch_expire),
+#ifdef CONFIG_IOSCHED_AS_HIER
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_idle),
+	ELV_ATTR(slice_sync),
+#endif
 	__ATTR_NULL
 };
 
@@ -1514,8 +1776,14 @@ static struct elevator_type iosched_as = {
 		.trim =				as_trim,
 		.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
 		.elevator_free_sched_queue_fn = as_free_as_queue,
+#ifdef CONFIG_IOSCHED_AS_HIER
+		.elevator_expire_ioq_fn =       as_expire_ioq,
+		.elevator_active_ioq_set_fn =   as_active_ioq_set,
 	},
-
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#else
+	},
+#endif
 	.elevator_attrs = as_attrs,
 	.elevator_name = "anticipatory",
 	.elevator_owner = THIS_MODULE,
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 25fdac6..c1d04b1 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -2523,6 +2523,7 @@ static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
 		elv_clear_ioq_must_dispatch(ioq);
 		elv_clear_ioq_wait_busy_done(ioq);
 		elv_mark_ioq_slice_new(ioq);
+		elv_clear_ioq_must_expire(ioq);
 
 		del_timer(&efqd->idle_slice_timer);
 	}
@@ -2646,6 +2647,49 @@ static inline unsigned long elv_disk_time_used(struct request_queue *q,
 }
 
 /*
+ * Call iosched to let that elevator wants to expire the queue. This gives
+ * iosched like AS to say no (if it is in the middle of batch changeover or
+ * it is anticipating). it also allows iosched to do some house keeping
+ *
+ * force--> it is force dispatch and iosched must clean up its state. This
+ * 	     is useful when elevator wants to drain iosched and wants to
+ * 	     expire currnent active queue.
+ *
+ * slice_expired--> if 1, ioq slice expired hence elevator fair queuing logic
+ * 		    wants to switch the queue. iosched should allow that until
+ * 		    and unless necessary. Currently AS can deny the switch if
+ * 		    in the middle of batch switch.
+ *
+ * 		    if 0, time slice is still remaining. It is up to the iosched
+ * 		    whether it wants to wait on this queue or just want to
+ * 		    expire it and move on to next queue.
+ *
+ */
+static int elv_iosched_expire_ioq(struct request_queue *q, int slice_expired,
+					int force)
+{
+	struct elevator_queue *e = q->elevator;
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+	int ret = 1;
+
+	if (e->ops->elevator_expire_ioq_fn) {
+		ret = e->ops->elevator_expire_ioq_fn(q, ioq->sched_queue,
+							slice_expired, force);
+		/*
+		 * AS denied expiration of queue right now. Mark that elevator
+		 * layer has requested ioscheduler (as) to expire this queue.
+		 * Now as will try to expire this queue as soon as it can.
+		 * Now don't try to dispatch from this queue even if we get
+		 * a new request and if time slice is left. Do expire it once.
+		 */
+		if (!ret)
+			elv_mark_ioq_must_expire(ioq);
+	}
+
+	return ret;
+}
+
+/*
  * Do the accounting. Determine how much service (in terms of time slices)
  * current queue used and adjust the start, finish time of queue and vtime
  * of the tree accordingly.
@@ -2683,6 +2727,7 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 	elv_clear_ioq_wait_request(ioq);
 	elv_clear_ioq_wait_busy(ioq);
 	elv_clear_ioq_wait_busy_done(ioq);
+	elv_clear_ioq_must_expire(ioq);
 
 	/*
 	 * if ioq->slice_end = 0, that means a queue was expired before first
@@ -2821,16 +2866,18 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 static void elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
 {
 	elv_log_ioq(&q->elevator->efqd, ioq, "preempt");
-	elv_ioq_slice_expired(q);
+	if (elv_iosched_expire_ioq(q, 0, 1)) {
+		elv_ioq_slice_expired(q);
 
-	/*
-	 * Put the new queue at the front of the of the current list,
-	 * so we know that it will be selected next.
-	 */
+		/*
+		 * Put the new queue at the front of the of the current list,
+		 * so we know that it will be selected next.
+		 */
 
-	elv_activate_ioq(ioq, 1);
-	ioq->slice_end = 0;
-	elv_mark_ioq_slice_new(ioq);
+		elv_activate_ioq(ioq, 1);
+		ioq->slice_end = 0;
+		elv_mark_ioq_slice_new(ioq);
+	}
 }
 
 void elv_ioq_request_add(struct request_queue *q, struct request *rq)
@@ -3035,6 +3082,8 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 	struct elv_fq_data *efqd = &q->elevator->efqd;
 	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
 	struct io_group *iog;
+	struct elevator_type *e = q->elevator->elevator_type;
+	int slice_expired = 1;
 
 	if (!elv_nr_busy_ioq(q->elevator))
 		return NULL;
@@ -3053,6 +3102,10 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 			goto expire;
 	}
 
+	/* This queue has been marked for expiry. Try to expire it */
+	if (elv_ioq_must_expire(ioq))
+		goto expire;
+
 	/*
 	 * If there is only root group present, don't expire the queue for
 	 * single queue ioschedulers (noop, deadline, AS). It is unnecessary
@@ -3142,8 +3195,10 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 		goto keep_queue;
 	}
 
+	slice_expired = 0;
 expire:
-	if (efqd->fairness >= 2 && !force && ioq && ioq->dispatched) {
+	if (efqd->fairness >= 2 && !force && ioq && ioq->dispatched
+	    && strcmp(e->elevator_name, "anticipatory")) {
 		/*
 		 * If there are request dispatched from this queue, don't
 		 * dispatch requests from new queue till all the requests from
@@ -3154,6 +3209,11 @@ expire:
 		 *
 		 * Set ioq = NULL so that no more requests are dispatched from
 		 * this queue.
+		 *
+		 * Note: Anticipatory already has the behavior where queue
+		 * switch is not allowed until requests from previous queue
+		 * have finished. Hence we don't have to get into this loop
+		 * in case of AS.
 		 */
 		elv_log_ioq(efqd, ioq, "select: wait for requests to finish"
 				" disp=%lu", ioq->dispatched);
@@ -3161,7 +3221,14 @@ expire:
 		goto keep_queue;
 	}
 
-	elv_ioq_slice_expired(q);
+	if (elv_iosched_expire_ioq(q, slice_expired, force))
+		elv_ioq_slice_expired(q);
+	else
+		/*
+		 * Not making ioq = NULL, as AS can deny queue expiration and
+		 * continue to dispatch from same queue
+		 */
+		goto keep_queue;
 new_queue:
 	ioq = elv_set_active_ioq(q, new_ioq);
 keep_queue:
@@ -3298,7 +3365,8 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		}
 
 		if (elv_ioq_class_idle(ioq)) {
-			elv_ioq_slice_expired(q);
+			if (elv_iosched_expire_ioq(q, 1, 0))
+				elv_ioq_slice_expired(q);
 			goto done;
 		}
 
@@ -3343,7 +3411,8 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 					goto done;
 
 				/* Expire the queue */
-				elv_ioq_slice_expired(q);
+				if (elv_iosched_expire_ioq(q, 1, 0))
+					elv_ioq_slice_expired(q);
 			}
 		} else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
 			 && sync && (!rq_noidle(rq) || efqd->fairness))
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index baa6cee..c117d40 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -371,6 +371,8 @@ enum elv_queue_state_flags {
 	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
 	ELV_QUEUE_FLAG_wait_busy,	  /* wait for this queue to get busy */
 	ELV_QUEUE_FLAG_wait_busy_done,	  /* Have already waited on this queue*/
+	ELV_QUEUE_FLAG_must_expire,       /* Expire this queue even if it has
+					   * request and time slice left */
 };
 
 #define ELV_IO_QUEUE_FLAG_FNS(name)					\
@@ -395,6 +397,7 @@ ELV_IO_QUEUE_FLAG_FNS(idle_window)
 ELV_IO_QUEUE_FLAG_FNS(slice_new)
 ELV_IO_QUEUE_FLAG_FNS(wait_busy)
 ELV_IO_QUEUE_FLAG_FNS(wait_busy_done)
+ELV_IO_QUEUE_FLAG_FNS(must_expire)
 
 static inline struct io_service_tree *
 io_entity_service_tree(struct io_entity *entity)
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 6d2c8db..dda7951 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -42,6 +42,7 @@ typedef int (elevator_update_idle_window_fn) (struct elevator_queue*, void*,
 						struct request*);
 typedef struct io_queue* (elevator_close_cooperator_fn) (struct request_queue*,
 						void*, int probe);
+typedef int (elevator_expire_ioq_fn) (struct request_queue*, void *, int, int);
 #endif
 
 struct elevator_ops
@@ -81,6 +82,7 @@ struct elevator_ops
 	elevator_should_preempt_fn *elevator_should_preempt_fn;
 	elevator_update_idle_window_fn *elevator_update_idle_window_fn;
 	elevator_close_cooperator_fn *elevator_close_cooperator_fn;
+	elevator_expire_ioq_fn  *elevator_expire_ioq_fn;
 #endif
 };
 
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 18/25] io-controller: anticipatory changes for hierarchical fair queuing
  2009-07-02 20:01 ` Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron
  Cc: agk, snitzer, vgoyal, akpm, peterz

This patch changes anticipatory scheduler to use queue scheduling code from
elevator layer.  One can go back to old as by deselecting
CONFIG_IOSCHED_AS_HIER. Even with CONFIG_IOSCHED_AS_HIER=y, with-out any
other cgroup created, AS behavior should remain the same as old.

o AS is a single queue ioschduler, that means there is one AS queue per group.

o common layer code select the queue to dispatch from based on fairness, and
  then AS code selects the request with-in group.

o AS runs reads and writes batches with-in group. So common layer runs timed
  group queues and with-in group time, AS runs timed batches of reads and
  writes.

o Note: Previously AS write batch length was adjusted synamically whenever
  a W->R batch data direction took place and when first request from the
  read batch completed.

  Now write batch updation takes place when last request from the write
  batch has finished during W->R transition.

o AS runs its own anticipation logic to anticipate on reads. common layer also
  does the anticipation on the group if think time of the group is with-in
  slice_idle.

o Introduced few debugging messages in AS.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched    |   12 ++
 block/as-iosched.c       |  280 +++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.c      |   93 +++++++++++++--
 block/elevator-fq.h      |    3 +
 include/linux/elevator.h |    2 +
 5 files changed, 372 insertions(+), 18 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3a9e7d7..77fc786 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -45,6 +45,18 @@ config IOSCHED_AS
 	  deadline I/O scheduler, it can also be slower in some cases
 	  especially some database loads.
 
+config IOSCHED_AS_HIER
+	bool "Anticipatory Hierarchical Scheduling support"
+	depends on IOSCHED_AS && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in anticipatory. In this mode
+	  anticipatory keeps one IO queue per cgroup instead of a global
+	  queue. Elevator fair queuing logic ensures fairness among various
+	  queues.
+
 config IOSCHED_DEADLINE
 	tristate "Deadline I/O scheduler"
 	default y
diff --git a/block/as-iosched.c b/block/as-iosched.c
index e3514eb..18a61bb 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -16,6 +16,7 @@
 #include <linux/compiler.h>
 #include <linux/rbtree.h>
 #include <linux/interrupt.h>
+#include <linux/blktrace_api.h>
 
 /*
  * See Documentation/block/as-iosched.txt
@@ -84,10 +85,24 @@ struct as_queue {
 	struct list_head fifo_list[2];
 
 	struct request *next_rq[2];	/* next in sort order */
+
+	/*
+	 * If an as_queue is switched while a batch is running, then we
+	 * store the time left before current batch will expire
+	 */
+	long current_batch_time_left;
+
+	/*
+	 * batch data dir when queue was scheduled out. This will be used
+	 * to setup ad->batch_data_dir when queue is scheduled in.
+	 */
+	int saved_batch_data_dir;
+
 	unsigned long last_check_fifo[2];
 	int write_batch_count;		/* max # of reqs in a write batch */
 	int current_write_count;	/* how many requests left this batch */
 	int write_batch_idled;		/* has the write batch gone idle? */
+	int nr_queued[2];
 };
 
 struct as_data {
@@ -123,6 +138,9 @@ struct as_data {
 	unsigned long fifo_expire[2];
 	unsigned long batch_expire[2];
 	unsigned long antic_expire;
+
+	/* elevator requested a queue switch. */
+	int switch_queue;
 };
 
 /*
@@ -144,12 +162,174 @@ enum arq_state {
 #define RQ_STATE(rq)	((enum arq_state)(rq)->elevator_private2)
 #define RQ_SET_STATE(rq, state)	((rq)->elevator_private2 = (void *) state)
 
+#define as_log(ad, fmt, args...)        \
+	blk_add_trace_msg((ad)->q, "as " fmt, ##args)
+
 static DEFINE_PER_CPU(unsigned long, ioc_count);
 static struct completion *ioc_gone;
 static DEFINE_SPINLOCK(ioc_gone_lock);
 
 static void as_move_to_dispatch(struct as_data *ad, struct request *rq);
 static void as_antic_stop(struct as_data *ad);
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq);
+
+#ifdef CONFIG_IOSCHED_AS_HIER
+static void as_save_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+	/* Save batch data dir */
+	asq->saved_batch_data_dir = ad->batch_data_dir;
+
+	if (ad->changed_batch) {
+		/*
+		 * In case of force expire, we come here. Batch changeover
+		 * has been signalled but we are waiting for all the
+		 * request to finish from previous batch and then start
+		 * the new batch. Can't wait now. Mark that full batch time
+		 * needs to be allocated when this queue is scheduled again.
+		 */
+		asq->current_batch_time_left =
+				ad->batch_expire[ad->batch_data_dir];
+		ad->changed_batch = 0;
+		goto out;
+	}
+
+	if (ad->new_batch) {
+		/*
+		 * We should come here only when new_batch has been set
+		 * but no read request has been issued or if it is a forced
+		 * expiry.
+		 *
+		 * In both the cases, new batch has not started yet so
+		 * allocate full batch length for next scheduling opportunity.
+		 * We don't do write batch size adjustment in hierarchical
+		 * AS so that should not be an issue.
+		 */
+		asq->current_batch_time_left =
+				ad->batch_expire[ad->batch_data_dir];
+		ad->new_batch = 0;
+		goto out;
+	}
+
+	/* Save how much time is left before current batch expires */
+	if (as_batch_expired(ad, asq))
+		asq->current_batch_time_left = 0;
+	else {
+		asq->current_batch_time_left = ad->current_batch_expires
+							- jiffies;
+		BUG_ON((asq->current_batch_time_left) < 0);
+	}
+
+	if (ad->io_context) {
+		put_io_context(ad->io_context);
+		ad->io_context = NULL;
+	}
+
+out:
+	as_log(ad, "save batch: dir=%c time_left=%d changed_batch=%d"
+			" new_batch=%d, antic_status=%d",
+			ad->batch_data_dir ? 'R' : 'W',
+			asq->current_batch_time_left,
+			ad->changed_batch, ad->new_batch, ad->antic_status);
+	return;
+}
+
+/*
+ * FIXME: In original AS, read batch's time account started only after when
+ * first request had completed (if last batch was a write batch). But here
+ * we might be rescheduling a read batch right away irrespective of the fact
+ * of disk cache state.
+ */
+static void as_restore_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+	/* Adjust the batch expire time */
+	if (asq->current_batch_time_left)
+		ad->current_batch_expires = jiffies +
+						asq->current_batch_time_left;
+	/* restore asq batch_data_dir info */
+	ad->batch_data_dir = asq->saved_batch_data_dir;
+	as_log(ad, "restore batch: dir=%c time=%d reads_q=%d writes_q=%d"
+			" ad->antic_status=%d",
+			ad->batch_data_dir ? 'R' : 'W',
+			asq->current_batch_time_left,
+			asq->nr_queued[1], asq->nr_queued[0],
+			ad->antic_status);
+}
+
+/* ioq has been set. */
+static void as_active_ioq_set(struct request_queue *q, void *sched_queue,
+				int coop)
+{
+	struct as_queue *asq = sched_queue;
+	struct as_data *ad = q->elevator->elevator_data;
+
+	as_restore_batch_context(ad, asq);
+}
+
+/*
+ * This is a notification from common layer that it wishes to expire this
+ * io queue. AS decides whether queue can be expired, if yes, it also
+ * saves the batch context.
+ */
+static int as_expire_ioq(struct request_queue *q, void *sched_queue,
+				int slice_expired, int force)
+{
+	struct as_data *ad = q->elevator->elevator_data;
+	int status = ad->antic_status;
+	struct as_queue *asq = sched_queue;
+
+	as_log(ad, "as_expire_ioq slice_expired=%d, force=%d", slice_expired,
+		force);
+
+	/* Forced expiry. We don't have a choice */
+	if (force) {
+		as_antic_stop(ad);
+		/*
+		 * antic_stop() sets antic_status to FINISHED which signifies
+		 * that either we timed out or we found a close request but
+		 * that's not the case here. Start from scratch.
+		 */
+		ad->antic_status = ANTIC_OFF;
+		as_save_batch_context(ad, asq);
+		ad->switch_queue = 0;
+		return 1;
+	}
+
+	/*
+	 * We are waiting for requests to finish from last
+	 * batch. Don't expire the queue now
+	 */
+	if (ad->changed_batch)
+		goto keep_queue;
+
+	/*
+	 * Wait for all requests from existing batch to finish before we
+	 * switch the queue. New queue might change the batch direction
+	 * and this is to be consistent with AS philosophy of not dispatching
+	 * new requests to underlying drive till requests from requests
+	 * from previous batch are completed.
+	 */
+	if (ad->nr_dispatched)
+		goto keep_queue;
+
+	/*
+	 * If AS anticipation is ON, wait for it to finish.
+	 */
+	BUG_ON(status == ANTIC_WAIT_REQ);
+
+	if (status == ANTIC_WAIT_NEXT)
+		goto keep_queue;
+
+	/* We are good to expire the queue. Save batch context */
+	as_save_batch_context(ad, asq);
+	ad->switch_queue = 0;
+	return 1;
+
+keep_queue:
+	/* Mark that elevator requested for queue switch whenever possible */
+	ad->switch_queue = 1;
+	return 0;
+}
+#endif
 
 /*
  * IO Context helper functions
@@ -429,6 +609,7 @@ static void as_antic_waitnext(struct as_data *ad)
 	mod_timer(&ad->antic_timer, timeout);
 
 	ad->antic_status = ANTIC_WAIT_NEXT;
+	as_log(ad, "antic_waitnext set");
 }
 
 /*
@@ -442,8 +623,10 @@ static void as_antic_waitreq(struct as_data *ad)
 	if (ad->antic_status == ANTIC_OFF) {
 		if (!ad->io_context || ad->ioc_finished)
 			as_antic_waitnext(ad);
-		else
+		else {
 			ad->antic_status = ANTIC_WAIT_REQ;
+			as_log(ad, "antic_waitreq set");
+		}
 	}
 }
 
@@ -455,6 +638,8 @@ static void as_antic_stop(struct as_data *ad)
 {
 	int status = ad->antic_status;
 
+	as_log(ad, "as_antic_stop antic_status=%d", ad->antic_status);
+
 	if (status == ANTIC_WAIT_REQ || status == ANTIC_WAIT_NEXT) {
 		if (status == ANTIC_WAIT_NEXT)
 			del_timer(&ad->antic_timer);
@@ -474,6 +659,7 @@ static void as_antic_timeout(unsigned long data)
 	unsigned long flags;
 
 	spin_lock_irqsave(q->queue_lock, flags);
+	as_log(ad, "as_antic_timeout");
 	if (ad->antic_status == ANTIC_WAIT_REQ
 			|| ad->antic_status == ANTIC_WAIT_NEXT) {
 		struct as_io_context *aic;
@@ -652,6 +838,21 @@ static int as_can_break_anticipation(struct as_data *ad, struct request *rq)
 	struct io_context *ioc;
 	struct as_io_context *aic;
 
+#ifdef CONFIG_IOSCHED_AS_HIER
+	/*
+	 * If the active asq and rq's asq are not same, then one can not
+	 * break the anticipation. This primarily becomes useful when a
+	 * request is added to a queue which is not being served currently.
+	 */
+	if (rq) {
+		struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
+		struct as_queue *curr_asq =
+				elv_active_sched_queue(ad->q->elevator);
+
+		if (asq != curr_asq)
+			return 0;
+	}
+#endif
 	ioc = ad->io_context;
 	BUG_ON(!ioc);
 	spin_lock(&ioc->lock);
@@ -810,16 +1011,20 @@ static void as_update_rq(struct as_data *ad, struct request *rq)
 /*
  * Gathers timings and resizes the write batch automatically
  */
-static void update_write_batch(struct as_data *ad)
+static void update_write_batch(struct as_data *ad, struct request *rq)
 {
 	unsigned long batch = ad->batch_expire[BLK_RW_ASYNC];
 	long write_time;
-	struct as_queue *asq = elv_get_sched_queue(ad->q, NULL);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
 	write_time = (jiffies - ad->current_batch_expires) + batch;
 	if (write_time < 0)
 		write_time = 0;
 
+	as_log(ad, "upd write: write_time=%d batch=%d write_batch_idled=%d"
+			" current_write_count=%d", write_time, batch,
+			asq->write_batch_idled, asq->current_write_count);
+
 	if (write_time > batch && !asq->write_batch_idled) {
 		if (write_time > batch * 3)
 			asq->write_batch_count /= 2;
@@ -834,6 +1039,8 @@ static void update_write_batch(struct as_data *ad)
 
 	if (asq->write_batch_count < 1)
 		asq->write_batch_count = 1;
+
+	as_log(ad, "upd write count=%d", asq->write_batch_count);
 }
 
 /*
@@ -843,6 +1050,7 @@ static void update_write_batch(struct as_data *ad)
 static void as_completed_request(struct request_queue *q, struct request *rq)
 {
 	struct as_data *ad = q->elevator->elevator_data;
+	struct as_queue *asq = elv_get_sched_queue(q, rq);
 
 	WARN_ON(!list_empty(&rq->queuelist));
 
@@ -851,7 +1059,24 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
 		goto out;
 	}
 
+	as_log(ad, "complete: reads_q=%d writes_q=%d changed_batch=%d"
+		" new_batch=%d switch_queue=%d, dir=%c",
+		asq->nr_queued[1], asq->nr_queued[0], ad->changed_batch,
+		ad->new_batch, ad->switch_queue,
+		ad->batch_data_dir ? 'R' : 'W');
+
 	if (ad->changed_batch && ad->nr_dispatched == 1) {
+		/*
+		 * If this was write batch finishing, adjust the write batch
+		 * length.
+		 *
+		 * Note, write batch length is being calculated upon completion
+		 * of last write request finished and not completion of first
+		 * read request finished in the next batch.
+		 */
+		if (ad->batch_data_dir == BLK_RW_SYNC)
+			update_write_batch(ad, rq);
+
 		ad->current_batch_expires = jiffies +
 					ad->batch_expire[ad->batch_data_dir];
 		kblockd_schedule_work(q, &ad->antic_work);
@@ -869,7 +1094,6 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
 	 * and writeback caches
 	 */
 	if (ad->new_batch && ad->batch_data_dir == rq_is_sync(rq)) {
-		update_write_batch(ad);
 		ad->current_batch_expires = jiffies +
 				ad->batch_expire[BLK_RW_SYNC];
 		ad->new_batch = 0;
@@ -888,6 +1112,13 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
 	}
 
 	as_put_io_context(rq);
+
+	/*
+	 * If elevator requested a queue switch, kick the queue in the
+	 * hope that this is right time for switch.
+	 */
+	if (ad->switch_queue)
+		kblockd_schedule_work(q, &ad->antic_work);
 out:
 	RQ_SET_STATE(rq, AS_RQ_POSTSCHED);
 }
@@ -908,6 +1139,9 @@ static void as_remove_queued_request(struct request_queue *q,
 
 	WARN_ON(RQ_STATE(rq) != AS_RQ_QUEUED);
 
+	BUG_ON(asq->nr_queued[data_dir] <= 0);
+	asq->nr_queued[data_dir]--;
+
 	ioc = RQ_IOC(rq);
 	if (ioc && ioc->aic) {
 		BUG_ON(!atomic_read(&ioc->aic->nr_queued));
@@ -1019,6 +1253,8 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 	if (RQ_IOC(rq) && RQ_IOC(rq)->aic)
 		atomic_inc(&RQ_IOC(rq)->aic->nr_dispatched);
 	ad->nr_dispatched++;
+	as_log(ad, "dispatch req dir=%c nr_dispatched = %d",
+			data_dir ? 'R' : 'W', ad->nr_dispatched);
 }
 
 /*
@@ -1066,6 +1302,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		}
 		asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
 
+		as_log(ad, "forced dispatch");
 		return dispatched;
 	}
 
@@ -1078,8 +1315,14 @@ static int as_dispatch_request(struct request_queue *q, int force)
 	if (!(reads || writes)
 		|| ad->antic_status == ANTIC_WAIT_REQ
 		|| ad->antic_status == ANTIC_WAIT_NEXT
-		|| ad->changed_batch)
+		|| ad->changed_batch) {
+		as_log(ad, "no dispatch. read_q=%d, writes_q=%d"
+			" ad->antic_status=%d, changed_batch=%d,"
+			" switch_queue=%d new_batch=%d", asq->nr_queued[1],
+			asq->nr_queued[0], ad->antic_status, ad->changed_batch,
+			ad->switch_queue, ad->new_batch);
 		return 0;
+	}
 
 	if (!(reads && writes && as_batch_expired(ad, asq))) {
 		/*
@@ -1092,6 +1335,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 				goto fifo_expired;
 
 			if (as_can_anticipate(ad, rq)) {
+				as_log(ad, "can_anticipate = 1");
 				as_antic_waitreq(ad);
 				return 0;
 			}
@@ -1111,6 +1355,8 @@ static int as_dispatch_request(struct request_queue *q, int force)
 	 * data direction (read / write)
 	 */
 
+	as_log(ad, "select a fresh batch and request");
+
 	if (reads) {
 		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_SYNC]));
 
@@ -1125,6 +1371,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 			ad->changed_batch = 1;
 		}
 		ad->batch_data_dir = BLK_RW_SYNC;
+		as_log(ad, "new batch dir is sync");
 		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_SYNC].next);
 		asq->last_check_fifo[ad->batch_data_dir] = jiffies;
 		goto dispatch_request;
@@ -1149,6 +1396,7 @@ dispatch_writes:
 			ad->new_batch = 0;
 		}
 		ad->batch_data_dir = BLK_RW_ASYNC;
+		as_log(ad, "new batch dir is async");
 		asq->current_write_count = asq->write_batch_count;
 		asq->write_batch_idled = 0;
 		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_ASYNC].next);
@@ -1184,6 +1432,9 @@ fifo_expired:
 		ad->changed_batch = 0;
 	}
 
+	if (ad->switch_queue)
+		return 0;
+
 	/*
 	 * rq is the selected appropriate request.
 	 */
@@ -1207,6 +1458,11 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 
 	rq->elevator_private = as_get_io_context(q->node);
 
+	asq->nr_queued[data_dir]++;
+	as_log(ad, "add a %c request read_q=%d write_q=%d",
+			data_dir ? 'R' : 'W', asq->nr_queued[1],
+			asq->nr_queued[0]);
+
 	if (RQ_IOC(rq)) {
 		as_update_iohist(ad, RQ_IOC(rq)->aic, rq);
 		atomic_inc(&RQ_IOC(rq)->aic->nr_queued);
@@ -1408,6 +1664,7 @@ static void *as_init_queue(struct request_queue *q)
 	ad->batch_expire[BLK_RW_ASYNC] = default_write_batch_expire;
 
 	ad->current_batch_expires = jiffies + ad->batch_expire[BLK_RW_SYNC];
+	ad->switch_queue = 0;
 
 	return ad;
 }
@@ -1493,6 +1750,11 @@ static struct elv_fs_entry as_attrs[] = {
 	AS_ATTR(antic_expire),
 	AS_ATTR(read_batch_expire),
 	AS_ATTR(write_batch_expire),
+#ifdef CONFIG_IOSCHED_AS_HIER
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_idle),
+	ELV_ATTR(slice_sync),
+#endif
 	__ATTR_NULL
 };
 
@@ -1514,8 +1776,14 @@ static struct elevator_type iosched_as = {
 		.trim =				as_trim,
 		.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
 		.elevator_free_sched_queue_fn = as_free_as_queue,
+#ifdef CONFIG_IOSCHED_AS_HIER
+		.elevator_expire_ioq_fn =       as_expire_ioq,
+		.elevator_active_ioq_set_fn =   as_active_ioq_set,
 	},
-
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#else
+	},
+#endif
 	.elevator_attrs = as_attrs,
 	.elevator_name = "anticipatory",
 	.elevator_owner = THIS_MODULE,
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 25fdac6..c1d04b1 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -2523,6 +2523,7 @@ static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
 		elv_clear_ioq_must_dispatch(ioq);
 		elv_clear_ioq_wait_busy_done(ioq);
 		elv_mark_ioq_slice_new(ioq);
+		elv_clear_ioq_must_expire(ioq);
 
 		del_timer(&efqd->idle_slice_timer);
 	}
@@ -2646,6 +2647,49 @@ static inline unsigned long elv_disk_time_used(struct request_queue *q,
 }
 
 /*
+ * Call iosched to let that elevator wants to expire the queue. This gives
+ * iosched like AS to say no (if it is in the middle of batch changeover or
+ * it is anticipating). it also allows iosched to do some house keeping
+ *
+ * force--> it is force dispatch and iosched must clean up its state. This
+ * 	     is useful when elevator wants to drain iosched and wants to
+ * 	     expire currnent active queue.
+ *
+ * slice_expired--> if 1, ioq slice expired hence elevator fair queuing logic
+ * 		    wants to switch the queue. iosched should allow that until
+ * 		    and unless necessary. Currently AS can deny the switch if
+ * 		    in the middle of batch switch.
+ *
+ * 		    if 0, time slice is still remaining. It is up to the iosched
+ * 		    whether it wants to wait on this queue or just want to
+ * 		    expire it and move on to next queue.
+ *
+ */
+static int elv_iosched_expire_ioq(struct request_queue *q, int slice_expired,
+					int force)
+{
+	struct elevator_queue *e = q->elevator;
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+	int ret = 1;
+
+	if (e->ops->elevator_expire_ioq_fn) {
+		ret = e->ops->elevator_expire_ioq_fn(q, ioq->sched_queue,
+							slice_expired, force);
+		/*
+		 * AS denied expiration of queue right now. Mark that elevator
+		 * layer has requested ioscheduler (as) to expire this queue.
+		 * Now as will try to expire this queue as soon as it can.
+		 * Now don't try to dispatch from this queue even if we get
+		 * a new request and if time slice is left. Do expire it once.
+		 */
+		if (!ret)
+			elv_mark_ioq_must_expire(ioq);
+	}
+
+	return ret;
+}
+
+/*
  * Do the accounting. Determine how much service (in terms of time slices)
  * current queue used and adjust the start, finish time of queue and vtime
  * of the tree accordingly.
@@ -2683,6 +2727,7 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 	elv_clear_ioq_wait_request(ioq);
 	elv_clear_ioq_wait_busy(ioq);
 	elv_clear_ioq_wait_busy_done(ioq);
+	elv_clear_ioq_must_expire(ioq);
 
 	/*
 	 * if ioq->slice_end = 0, that means a queue was expired before first
@@ -2821,16 +2866,18 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 static void elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
 {
 	elv_log_ioq(&q->elevator->efqd, ioq, "preempt");
-	elv_ioq_slice_expired(q);
+	if (elv_iosched_expire_ioq(q, 0, 1)) {
+		elv_ioq_slice_expired(q);
 
-	/*
-	 * Put the new queue at the front of the of the current list,
-	 * so we know that it will be selected next.
-	 */
+		/*
+		 * Put the new queue at the front of the of the current list,
+		 * so we know that it will be selected next.
+		 */
 
-	elv_activate_ioq(ioq, 1);
-	ioq->slice_end = 0;
-	elv_mark_ioq_slice_new(ioq);
+		elv_activate_ioq(ioq, 1);
+		ioq->slice_end = 0;
+		elv_mark_ioq_slice_new(ioq);
+	}
 }
 
 void elv_ioq_request_add(struct request_queue *q, struct request *rq)
@@ -3035,6 +3082,8 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 	struct elv_fq_data *efqd = &q->elevator->efqd;
 	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
 	struct io_group *iog;
+	struct elevator_type *e = q->elevator->elevator_type;
+	int slice_expired = 1;
 
 	if (!elv_nr_busy_ioq(q->elevator))
 		return NULL;
@@ -3053,6 +3102,10 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 			goto expire;
 	}
 
+	/* This queue has been marked for expiry. Try to expire it */
+	if (elv_ioq_must_expire(ioq))
+		goto expire;
+
 	/*
 	 * If there is only root group present, don't expire the queue for
 	 * single queue ioschedulers (noop, deadline, AS). It is unnecessary
@@ -3142,8 +3195,10 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 		goto keep_queue;
 	}
 
+	slice_expired = 0;
 expire:
-	if (efqd->fairness >= 2 && !force && ioq && ioq->dispatched) {
+	if (efqd->fairness >= 2 && !force && ioq && ioq->dispatched
+	    && strcmp(e->elevator_name, "anticipatory")) {
 		/*
 		 * If there are request dispatched from this queue, don't
 		 * dispatch requests from new queue till all the requests from
@@ -3154,6 +3209,11 @@ expire:
 		 *
 		 * Set ioq = NULL so that no more requests are dispatched from
 		 * this queue.
+		 *
+		 * Note: Anticipatory already has the behavior where queue
+		 * switch is not allowed until requests from previous queue
+		 * have finished. Hence we don't have to get into this loop
+		 * in case of AS.
 		 */
 		elv_log_ioq(efqd, ioq, "select: wait for requests to finish"
 				" disp=%lu", ioq->dispatched);
@@ -3161,7 +3221,14 @@ expire:
 		goto keep_queue;
 	}
 
-	elv_ioq_slice_expired(q);
+	if (elv_iosched_expire_ioq(q, slice_expired, force))
+		elv_ioq_slice_expired(q);
+	else
+		/*
+		 * Not making ioq = NULL, as AS can deny queue expiration and
+		 * continue to dispatch from same queue
+		 */
+		goto keep_queue;
 new_queue:
 	ioq = elv_set_active_ioq(q, new_ioq);
 keep_queue:
@@ -3298,7 +3365,8 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		}
 
 		if (elv_ioq_class_idle(ioq)) {
-			elv_ioq_slice_expired(q);
+			if (elv_iosched_expire_ioq(q, 1, 0))
+				elv_ioq_slice_expired(q);
 			goto done;
 		}
 
@@ -3343,7 +3411,8 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 					goto done;
 
 				/* Expire the queue */
-				elv_ioq_slice_expired(q);
+				if (elv_iosched_expire_ioq(q, 1, 0))
+					elv_ioq_slice_expired(q);
 			}
 		} else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
 			 && sync && (!rq_noidle(rq) || efqd->fairness))
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index baa6cee..c117d40 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -371,6 +371,8 @@ enum elv_queue_state_flags {
 	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
 	ELV_QUEUE_FLAG_wait_busy,	  /* wait for this queue to get busy */
 	ELV_QUEUE_FLAG_wait_busy_done,	  /* Have already waited on this queue*/
+	ELV_QUEUE_FLAG_must_expire,       /* Expire this queue even if it has
+					   * request and time slice left */
 };
 
 #define ELV_IO_QUEUE_FLAG_FNS(name)					\
@@ -395,6 +397,7 @@ ELV_IO_QUEUE_FLAG_FNS(idle_window)
 ELV_IO_QUEUE_FLAG_FNS(slice_new)
 ELV_IO_QUEUE_FLAG_FNS(wait_busy)
 ELV_IO_QUEUE_FLAG_FNS(wait_busy_done)
+ELV_IO_QUEUE_FLAG_FNS(must_expire)
 
 static inline struct io_service_tree *
 io_entity_service_tree(struct io_entity *entity)
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 6d2c8db..dda7951 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -42,6 +42,7 @@ typedef int (elevator_update_idle_window_fn) (struct elevator_queue*, void*,
 						struct request*);
 typedef struct io_queue* (elevator_close_cooperator_fn) (struct request_queue*,
 						void*, int probe);
+typedef int (elevator_expire_ioq_fn) (struct request_queue*, void *, int, int);
 #endif
 
 struct elevator_ops
@@ -81,6 +82,7 @@ struct elevator_ops
 	elevator_should_preempt_fn *elevator_should_preempt_fn;
 	elevator_update_idle_window_fn *elevator_update_idle_window_fn;
 	elevator_close_cooperator_fn *elevator_close_cooperator_fn;
+	elevator_expire_ioq_fn  *elevator_expire_ioq_fn;
 #endif
 };
 
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 18/25] io-controller: anticipatory changes for hierarchical fair queuing
@ 2009-07-02 20:01   ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew
  Cc: peterz, akpm, snitzer, agk, vgoyal

This patch changes anticipatory scheduler to use queue scheduling code from
elevator layer.  One can go back to old as by deselecting
CONFIG_IOSCHED_AS_HIER. Even with CONFIG_IOSCHED_AS_HIER=y, with-out any
other cgroup created, AS behavior should remain the same as old.

o AS is a single queue ioschduler, that means there is one AS queue per group.

o common layer code select the queue to dispatch from based on fairness, and
  then AS code selects the request with-in group.

o AS runs reads and writes batches with-in group. So common layer runs timed
  group queues and with-in group time, AS runs timed batches of reads and
  writes.

o Note: Previously AS write batch length was adjusted synamically whenever
  a W->R batch data direction took place and when first request from the
  read batch completed.

  Now write batch updation takes place when last request from the write
  batch has finished during W->R transition.

o AS runs its own anticipation logic to anticipate on reads. common layer also
  does the anticipation on the group if think time of the group is with-in
  slice_idle.

o Introduced few debugging messages in AS.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched    |   12 ++
 block/as-iosched.c       |  280 +++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.c      |   93 +++++++++++++--
 block/elevator-fq.h      |    3 +
 include/linux/elevator.h |    2 +
 5 files changed, 372 insertions(+), 18 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 3a9e7d7..77fc786 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -45,6 +45,18 @@ config IOSCHED_AS
 	  deadline I/O scheduler, it can also be slower in some cases
 	  especially some database loads.
 
+config IOSCHED_AS_HIER
+	bool "Anticipatory Hierarchical Scheduling support"
+	depends on IOSCHED_AS && CGROUPS
+	select ELV_FAIR_QUEUING
+	select GROUP_IOSCHED
+	default n
+	---help---
+	  Enable hierarhical scheduling in anticipatory. In this mode
+	  anticipatory keeps one IO queue per cgroup instead of a global
+	  queue. Elevator fair queuing logic ensures fairness among various
+	  queues.
+
 config IOSCHED_DEADLINE
 	tristate "Deadline I/O scheduler"
 	default y
diff --git a/block/as-iosched.c b/block/as-iosched.c
index e3514eb..18a61bb 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -16,6 +16,7 @@
 #include <linux/compiler.h>
 #include <linux/rbtree.h>
 #include <linux/interrupt.h>
+#include <linux/blktrace_api.h>
 
 /*
  * See Documentation/block/as-iosched.txt
@@ -84,10 +85,24 @@ struct as_queue {
 	struct list_head fifo_list[2];
 
 	struct request *next_rq[2];	/* next in sort order */
+
+	/*
+	 * If an as_queue is switched while a batch is running, then we
+	 * store the time left before current batch will expire
+	 */
+	long current_batch_time_left;
+
+	/*
+	 * batch data dir when queue was scheduled out. This will be used
+	 * to setup ad->batch_data_dir when queue is scheduled in.
+	 */
+	int saved_batch_data_dir;
+
 	unsigned long last_check_fifo[2];
 	int write_batch_count;		/* max # of reqs in a write batch */
 	int current_write_count;	/* how many requests left this batch */
 	int write_batch_idled;		/* has the write batch gone idle? */
+	int nr_queued[2];
 };
 
 struct as_data {
@@ -123,6 +138,9 @@ struct as_data {
 	unsigned long fifo_expire[2];
 	unsigned long batch_expire[2];
 	unsigned long antic_expire;
+
+	/* elevator requested a queue switch. */
+	int switch_queue;
 };
 
 /*
@@ -144,12 +162,174 @@ enum arq_state {
 #define RQ_STATE(rq)	((enum arq_state)(rq)->elevator_private2)
 #define RQ_SET_STATE(rq, state)	((rq)->elevator_private2 = (void *) state)
 
+#define as_log(ad, fmt, args...)        \
+	blk_add_trace_msg((ad)->q, "as " fmt, ##args)
+
 static DEFINE_PER_CPU(unsigned long, ioc_count);
 static struct completion *ioc_gone;
 static DEFINE_SPINLOCK(ioc_gone_lock);
 
 static void as_move_to_dispatch(struct as_data *ad, struct request *rq);
 static void as_antic_stop(struct as_data *ad);
+static inline int as_batch_expired(struct as_data *ad, struct as_queue *asq);
+
+#ifdef CONFIG_IOSCHED_AS_HIER
+static void as_save_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+	/* Save batch data dir */
+	asq->saved_batch_data_dir = ad->batch_data_dir;
+
+	if (ad->changed_batch) {
+		/*
+		 * In case of force expire, we come here. Batch changeover
+		 * has been signalled but we are waiting for all the
+		 * request to finish from previous batch and then start
+		 * the new batch. Can't wait now. Mark that full batch time
+		 * needs to be allocated when this queue is scheduled again.
+		 */
+		asq->current_batch_time_left =
+				ad->batch_expire[ad->batch_data_dir];
+		ad->changed_batch = 0;
+		goto out;
+	}
+
+	if (ad->new_batch) {
+		/*
+		 * We should come here only when new_batch has been set
+		 * but no read request has been issued or if it is a forced
+		 * expiry.
+		 *
+		 * In both the cases, new batch has not started yet so
+		 * allocate full batch length for next scheduling opportunity.
+		 * We don't do write batch size adjustment in hierarchical
+		 * AS so that should not be an issue.
+		 */
+		asq->current_batch_time_left =
+				ad->batch_expire[ad->batch_data_dir];
+		ad->new_batch = 0;
+		goto out;
+	}
+
+	/* Save how much time is left before current batch expires */
+	if (as_batch_expired(ad, asq))
+		asq->current_batch_time_left = 0;
+	else {
+		asq->current_batch_time_left = ad->current_batch_expires
+							- jiffies;
+		BUG_ON((asq->current_batch_time_left) < 0);
+	}
+
+	if (ad->io_context) {
+		put_io_context(ad->io_context);
+		ad->io_context = NULL;
+	}
+
+out:
+	as_log(ad, "save batch: dir=%c time_left=%d changed_batch=%d"
+			" new_batch=%d, antic_status=%d",
+			ad->batch_data_dir ? 'R' : 'W',
+			asq->current_batch_time_left,
+			ad->changed_batch, ad->new_batch, ad->antic_status);
+	return;
+}
+
+/*
+ * FIXME: In original AS, read batch's time account started only after when
+ * first request had completed (if last batch was a write batch). But here
+ * we might be rescheduling a read batch right away irrespective of the fact
+ * of disk cache state.
+ */
+static void as_restore_batch_context(struct as_data *ad, struct as_queue *asq)
+{
+	/* Adjust the batch expire time */
+	if (asq->current_batch_time_left)
+		ad->current_batch_expires = jiffies +
+						asq->current_batch_time_left;
+	/* restore asq batch_data_dir info */
+	ad->batch_data_dir = asq->saved_batch_data_dir;
+	as_log(ad, "restore batch: dir=%c time=%d reads_q=%d writes_q=%d"
+			" ad->antic_status=%d",
+			ad->batch_data_dir ? 'R' : 'W',
+			asq->current_batch_time_left,
+			asq->nr_queued[1], asq->nr_queued[0],
+			ad->antic_status);
+}
+
+/* ioq has been set. */
+static void as_active_ioq_set(struct request_queue *q, void *sched_queue,
+				int coop)
+{
+	struct as_queue *asq = sched_queue;
+	struct as_data *ad = q->elevator->elevator_data;
+
+	as_restore_batch_context(ad, asq);
+}
+
+/*
+ * This is a notification from common layer that it wishes to expire this
+ * io queue. AS decides whether queue can be expired, if yes, it also
+ * saves the batch context.
+ */
+static int as_expire_ioq(struct request_queue *q, void *sched_queue,
+				int slice_expired, int force)
+{
+	struct as_data *ad = q->elevator->elevator_data;
+	int status = ad->antic_status;
+	struct as_queue *asq = sched_queue;
+
+	as_log(ad, "as_expire_ioq slice_expired=%d, force=%d", slice_expired,
+		force);
+
+	/* Forced expiry. We don't have a choice */
+	if (force) {
+		as_antic_stop(ad);
+		/*
+		 * antic_stop() sets antic_status to FINISHED which signifies
+		 * that either we timed out or we found a close request but
+		 * that's not the case here. Start from scratch.
+		 */
+		ad->antic_status = ANTIC_OFF;
+		as_save_batch_context(ad, asq);
+		ad->switch_queue = 0;
+		return 1;
+	}
+
+	/*
+	 * We are waiting for requests to finish from last
+	 * batch. Don't expire the queue now
+	 */
+	if (ad->changed_batch)
+		goto keep_queue;
+
+	/*
+	 * Wait for all requests from existing batch to finish before we
+	 * switch the queue. New queue might change the batch direction
+	 * and this is to be consistent with AS philosophy of not dispatching
+	 * new requests to underlying drive till requests from requests
+	 * from previous batch are completed.
+	 */
+	if (ad->nr_dispatched)
+		goto keep_queue;
+
+	/*
+	 * If AS anticipation is ON, wait for it to finish.
+	 */
+	BUG_ON(status == ANTIC_WAIT_REQ);
+
+	if (status == ANTIC_WAIT_NEXT)
+		goto keep_queue;
+
+	/* We are good to expire the queue. Save batch context */
+	as_save_batch_context(ad, asq);
+	ad->switch_queue = 0;
+	return 1;
+
+keep_queue:
+	/* Mark that elevator requested for queue switch whenever possible */
+	ad->switch_queue = 1;
+	return 0;
+}
+#endif
 
 /*
  * IO Context helper functions
@@ -429,6 +609,7 @@ static void as_antic_waitnext(struct as_data *ad)
 	mod_timer(&ad->antic_timer, timeout);
 
 	ad->antic_status = ANTIC_WAIT_NEXT;
+	as_log(ad, "antic_waitnext set");
 }
 
 /*
@@ -442,8 +623,10 @@ static void as_antic_waitreq(struct as_data *ad)
 	if (ad->antic_status == ANTIC_OFF) {
 		if (!ad->io_context || ad->ioc_finished)
 			as_antic_waitnext(ad);
-		else
+		else {
 			ad->antic_status = ANTIC_WAIT_REQ;
+			as_log(ad, "antic_waitreq set");
+		}
 	}
 }
 
@@ -455,6 +638,8 @@ static void as_antic_stop(struct as_data *ad)
 {
 	int status = ad->antic_status;
 
+	as_log(ad, "as_antic_stop antic_status=%d", ad->antic_status);
+
 	if (status == ANTIC_WAIT_REQ || status == ANTIC_WAIT_NEXT) {
 		if (status == ANTIC_WAIT_NEXT)
 			del_timer(&ad->antic_timer);
@@ -474,6 +659,7 @@ static void as_antic_timeout(unsigned long data)
 	unsigned long flags;
 
 	spin_lock_irqsave(q->queue_lock, flags);
+	as_log(ad, "as_antic_timeout");
 	if (ad->antic_status == ANTIC_WAIT_REQ
 			|| ad->antic_status == ANTIC_WAIT_NEXT) {
 		struct as_io_context *aic;
@@ -652,6 +838,21 @@ static int as_can_break_anticipation(struct as_data *ad, struct request *rq)
 	struct io_context *ioc;
 	struct as_io_context *aic;
 
+#ifdef CONFIG_IOSCHED_AS_HIER
+	/*
+	 * If the active asq and rq's asq are not same, then one can not
+	 * break the anticipation. This primarily becomes useful when a
+	 * request is added to a queue which is not being served currently.
+	 */
+	if (rq) {
+		struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
+		struct as_queue *curr_asq =
+				elv_active_sched_queue(ad->q->elevator);
+
+		if (asq != curr_asq)
+			return 0;
+	}
+#endif
 	ioc = ad->io_context;
 	BUG_ON(!ioc);
 	spin_lock(&ioc->lock);
@@ -810,16 +1011,20 @@ static void as_update_rq(struct as_data *ad, struct request *rq)
 /*
  * Gathers timings and resizes the write batch automatically
  */
-static void update_write_batch(struct as_data *ad)
+static void update_write_batch(struct as_data *ad, struct request *rq)
 {
 	unsigned long batch = ad->batch_expire[BLK_RW_ASYNC];
 	long write_time;
-	struct as_queue *asq = elv_get_sched_queue(ad->q, NULL);
+	struct as_queue *asq = elv_get_sched_queue(ad->q, rq);
 
 	write_time = (jiffies - ad->current_batch_expires) + batch;
 	if (write_time < 0)
 		write_time = 0;
 
+	as_log(ad, "upd write: write_time=%d batch=%d write_batch_idled=%d"
+			" current_write_count=%d", write_time, batch,
+			asq->write_batch_idled, asq->current_write_count);
+
 	if (write_time > batch && !asq->write_batch_idled) {
 		if (write_time > batch * 3)
 			asq->write_batch_count /= 2;
@@ -834,6 +1039,8 @@ static void update_write_batch(struct as_data *ad)
 
 	if (asq->write_batch_count < 1)
 		asq->write_batch_count = 1;
+
+	as_log(ad, "upd write count=%d", asq->write_batch_count);
 }
 
 /*
@@ -843,6 +1050,7 @@ static void update_write_batch(struct as_data *ad)
 static void as_completed_request(struct request_queue *q, struct request *rq)
 {
 	struct as_data *ad = q->elevator->elevator_data;
+	struct as_queue *asq = elv_get_sched_queue(q, rq);
 
 	WARN_ON(!list_empty(&rq->queuelist));
 
@@ -851,7 +1059,24 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
 		goto out;
 	}
 
+	as_log(ad, "complete: reads_q=%d writes_q=%d changed_batch=%d"
+		" new_batch=%d switch_queue=%d, dir=%c",
+		asq->nr_queued[1], asq->nr_queued[0], ad->changed_batch,
+		ad->new_batch, ad->switch_queue,
+		ad->batch_data_dir ? 'R' : 'W');
+
 	if (ad->changed_batch && ad->nr_dispatched == 1) {
+		/*
+		 * If this was write batch finishing, adjust the write batch
+		 * length.
+		 *
+		 * Note, write batch length is being calculated upon completion
+		 * of last write request finished and not completion of first
+		 * read request finished in the next batch.
+		 */
+		if (ad->batch_data_dir == BLK_RW_SYNC)
+			update_write_batch(ad, rq);
+
 		ad->current_batch_expires = jiffies +
 					ad->batch_expire[ad->batch_data_dir];
 		kblockd_schedule_work(q, &ad->antic_work);
@@ -869,7 +1094,6 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
 	 * and writeback caches
 	 */
 	if (ad->new_batch && ad->batch_data_dir == rq_is_sync(rq)) {
-		update_write_batch(ad);
 		ad->current_batch_expires = jiffies +
 				ad->batch_expire[BLK_RW_SYNC];
 		ad->new_batch = 0;
@@ -888,6 +1112,13 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
 	}
 
 	as_put_io_context(rq);
+
+	/*
+	 * If elevator requested a queue switch, kick the queue in the
+	 * hope that this is right time for switch.
+	 */
+	if (ad->switch_queue)
+		kblockd_schedule_work(q, &ad->antic_work);
 out:
 	RQ_SET_STATE(rq, AS_RQ_POSTSCHED);
 }
@@ -908,6 +1139,9 @@ static void as_remove_queued_request(struct request_queue *q,
 
 	WARN_ON(RQ_STATE(rq) != AS_RQ_QUEUED);
 
+	BUG_ON(asq->nr_queued[data_dir] <= 0);
+	asq->nr_queued[data_dir]--;
+
 	ioc = RQ_IOC(rq);
 	if (ioc && ioc->aic) {
 		BUG_ON(!atomic_read(&ioc->aic->nr_queued));
@@ -1019,6 +1253,8 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 	if (RQ_IOC(rq) && RQ_IOC(rq)->aic)
 		atomic_inc(&RQ_IOC(rq)->aic->nr_dispatched);
 	ad->nr_dispatched++;
+	as_log(ad, "dispatch req dir=%c nr_dispatched = %d",
+			data_dir ? 'R' : 'W', ad->nr_dispatched);
 }
 
 /*
@@ -1066,6 +1302,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		}
 		asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
 
+		as_log(ad, "forced dispatch");
 		return dispatched;
 	}
 
@@ -1078,8 +1315,14 @@ static int as_dispatch_request(struct request_queue *q, int force)
 	if (!(reads || writes)
 		|| ad->antic_status == ANTIC_WAIT_REQ
 		|| ad->antic_status == ANTIC_WAIT_NEXT
-		|| ad->changed_batch)
+		|| ad->changed_batch) {
+		as_log(ad, "no dispatch. read_q=%d, writes_q=%d"
+			" ad->antic_status=%d, changed_batch=%d,"
+			" switch_queue=%d new_batch=%d", asq->nr_queued[1],
+			asq->nr_queued[0], ad->antic_status, ad->changed_batch,
+			ad->switch_queue, ad->new_batch);
 		return 0;
+	}
 
 	if (!(reads && writes && as_batch_expired(ad, asq))) {
 		/*
@@ -1092,6 +1335,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 				goto fifo_expired;
 
 			if (as_can_anticipate(ad, rq)) {
+				as_log(ad, "can_anticipate = 1");
 				as_antic_waitreq(ad);
 				return 0;
 			}
@@ -1111,6 +1355,8 @@ static int as_dispatch_request(struct request_queue *q, int force)
 	 * data direction (read / write)
 	 */
 
+	as_log(ad, "select a fresh batch and request");
+
 	if (reads) {
 		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_SYNC]));
 
@@ -1125,6 +1371,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 			ad->changed_batch = 1;
 		}
 		ad->batch_data_dir = BLK_RW_SYNC;
+		as_log(ad, "new batch dir is sync");
 		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_SYNC].next);
 		asq->last_check_fifo[ad->batch_data_dir] = jiffies;
 		goto dispatch_request;
@@ -1149,6 +1396,7 @@ dispatch_writes:
 			ad->new_batch = 0;
 		}
 		ad->batch_data_dir = BLK_RW_ASYNC;
+		as_log(ad, "new batch dir is async");
 		asq->current_write_count = asq->write_batch_count;
 		asq->write_batch_idled = 0;
 		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_ASYNC].next);
@@ -1184,6 +1432,9 @@ fifo_expired:
 		ad->changed_batch = 0;
 	}
 
+	if (ad->switch_queue)
+		return 0;
+
 	/*
 	 * rq is the selected appropriate request.
 	 */
@@ -1207,6 +1458,11 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 
 	rq->elevator_private = as_get_io_context(q->node);
 
+	asq->nr_queued[data_dir]++;
+	as_log(ad, "add a %c request read_q=%d write_q=%d",
+			data_dir ? 'R' : 'W', asq->nr_queued[1],
+			asq->nr_queued[0]);
+
 	if (RQ_IOC(rq)) {
 		as_update_iohist(ad, RQ_IOC(rq)->aic, rq);
 		atomic_inc(&RQ_IOC(rq)->aic->nr_queued);
@@ -1408,6 +1664,7 @@ static void *as_init_queue(struct request_queue *q)
 	ad->batch_expire[BLK_RW_ASYNC] = default_write_batch_expire;
 
 	ad->current_batch_expires = jiffies + ad->batch_expire[BLK_RW_SYNC];
+	ad->switch_queue = 0;
 
 	return ad;
 }
@@ -1493,6 +1750,11 @@ static struct elv_fs_entry as_attrs[] = {
 	AS_ATTR(antic_expire),
 	AS_ATTR(read_batch_expire),
 	AS_ATTR(write_batch_expire),
+#ifdef CONFIG_IOSCHED_AS_HIER
+	ELV_ATTR(fairness),
+	ELV_ATTR(slice_idle),
+	ELV_ATTR(slice_sync),
+#endif
 	__ATTR_NULL
 };
 
@@ -1514,8 +1776,14 @@ static struct elevator_type iosched_as = {
 		.trim =				as_trim,
 		.elevator_alloc_sched_queue_fn = as_alloc_as_queue,
 		.elevator_free_sched_queue_fn = as_free_as_queue,
+#ifdef CONFIG_IOSCHED_AS_HIER
+		.elevator_expire_ioq_fn =       as_expire_ioq,
+		.elevator_active_ioq_set_fn =   as_active_ioq_set,
 	},
-
+	.elevator_features = ELV_IOSCHED_NEED_FQ | ELV_IOSCHED_SINGLE_IOQ,
+#else
+	},
+#endif
 	.elevator_attrs = as_attrs,
 	.elevator_name = "anticipatory",
 	.elevator_owner = THIS_MODULE,
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 25fdac6..c1d04b1 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -2523,6 +2523,7 @@ static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
 		elv_clear_ioq_must_dispatch(ioq);
 		elv_clear_ioq_wait_busy_done(ioq);
 		elv_mark_ioq_slice_new(ioq);
+		elv_clear_ioq_must_expire(ioq);
 
 		del_timer(&efqd->idle_slice_timer);
 	}
@@ -2646,6 +2647,49 @@ static inline unsigned long elv_disk_time_used(struct request_queue *q,
 }
 
 /*
+ * Call iosched to let that elevator wants to expire the queue. This gives
+ * iosched like AS to say no (if it is in the middle of batch changeover or
+ * it is anticipating). it also allows iosched to do some house keeping
+ *
+ * force--> it is force dispatch and iosched must clean up its state. This
+ * 	     is useful when elevator wants to drain iosched and wants to
+ * 	     expire currnent active queue.
+ *
+ * slice_expired--> if 1, ioq slice expired hence elevator fair queuing logic
+ * 		    wants to switch the queue. iosched should allow that until
+ * 		    and unless necessary. Currently AS can deny the switch if
+ * 		    in the middle of batch switch.
+ *
+ * 		    if 0, time slice is still remaining. It is up to the iosched
+ * 		    whether it wants to wait on this queue or just want to
+ * 		    expire it and move on to next queue.
+ *
+ */
+static int elv_iosched_expire_ioq(struct request_queue *q, int slice_expired,
+					int force)
+{
+	struct elevator_queue *e = q->elevator;
+	struct io_queue *ioq = elv_active_ioq(q->elevator);
+	int ret = 1;
+
+	if (e->ops->elevator_expire_ioq_fn) {
+		ret = e->ops->elevator_expire_ioq_fn(q, ioq->sched_queue,
+							slice_expired, force);
+		/*
+		 * AS denied expiration of queue right now. Mark that elevator
+		 * layer has requested ioscheduler (as) to expire this queue.
+		 * Now as will try to expire this queue as soon as it can.
+		 * Now don't try to dispatch from this queue even if we get
+		 * a new request and if time slice is left. Do expire it once.
+		 */
+		if (!ret)
+			elv_mark_ioq_must_expire(ioq);
+	}
+
+	return ret;
+}
+
+/*
  * Do the accounting. Determine how much service (in terms of time slices)
  * current queue used and adjust the start, finish time of queue and vtime
  * of the tree accordingly.
@@ -2683,6 +2727,7 @@ void __elv_ioq_slice_expired(struct request_queue *q, struct io_queue *ioq)
 	elv_clear_ioq_wait_request(ioq);
 	elv_clear_ioq_wait_busy(ioq);
 	elv_clear_ioq_wait_busy_done(ioq);
+	elv_clear_ioq_must_expire(ioq);
 
 	/*
 	 * if ioq->slice_end = 0, that means a queue was expired before first
@@ -2821,16 +2866,18 @@ static int elv_should_preempt(struct request_queue *q, struct io_queue *new_ioq,
 static void elv_preempt_queue(struct request_queue *q, struct io_queue *ioq)
 {
 	elv_log_ioq(&q->elevator->efqd, ioq, "preempt");
-	elv_ioq_slice_expired(q);
+	if (elv_iosched_expire_ioq(q, 0, 1)) {
+		elv_ioq_slice_expired(q);
 
-	/*
-	 * Put the new queue at the front of the of the current list,
-	 * so we know that it will be selected next.
-	 */
+		/*
+		 * Put the new queue at the front of the of the current list,
+		 * so we know that it will be selected next.
+		 */
 
-	elv_activate_ioq(ioq, 1);
-	ioq->slice_end = 0;
-	elv_mark_ioq_slice_new(ioq);
+		elv_activate_ioq(ioq, 1);
+		ioq->slice_end = 0;
+		elv_mark_ioq_slice_new(ioq);
+	}
 }
 
 void elv_ioq_request_add(struct request_queue *q, struct request *rq)
@@ -3035,6 +3082,8 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 	struct elv_fq_data *efqd = &q->elevator->efqd;
 	struct io_queue *new_ioq = NULL, *ioq = elv_active_ioq(q->elevator);
 	struct io_group *iog;
+	struct elevator_type *e = q->elevator->elevator_type;
+	int slice_expired = 1;
 
 	if (!elv_nr_busy_ioq(q->elevator))
 		return NULL;
@@ -3053,6 +3102,10 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 			goto expire;
 	}
 
+	/* This queue has been marked for expiry. Try to expire it */
+	if (elv_ioq_must_expire(ioq))
+		goto expire;
+
 	/*
 	 * If there is only root group present, don't expire the queue for
 	 * single queue ioschedulers (noop, deadline, AS). It is unnecessary
@@ -3142,8 +3195,10 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 		goto keep_queue;
 	}
 
+	slice_expired = 0;
 expire:
-	if (efqd->fairness >= 2 && !force && ioq && ioq->dispatched) {
+	if (efqd->fairness >= 2 && !force && ioq && ioq->dispatched
+	    && strcmp(e->elevator_name, "anticipatory")) {
 		/*
 		 * If there are request dispatched from this queue, don't
 		 * dispatch requests from new queue till all the requests from
@@ -3154,6 +3209,11 @@ expire:
 		 *
 		 * Set ioq = NULL so that no more requests are dispatched from
 		 * this queue.
+		 *
+		 * Note: Anticipatory already has the behavior where queue
+		 * switch is not allowed until requests from previous queue
+		 * have finished. Hence we don't have to get into this loop
+		 * in case of AS.
 		 */
 		elv_log_ioq(efqd, ioq, "select: wait for requests to finish"
 				" disp=%lu", ioq->dispatched);
@@ -3161,7 +3221,14 @@ expire:
 		goto keep_queue;
 	}
 
-	elv_ioq_slice_expired(q);
+	if (elv_iosched_expire_ioq(q, slice_expired, force))
+		elv_ioq_slice_expired(q);
+	else
+		/*
+		 * Not making ioq = NULL, as AS can deny queue expiration and
+		 * continue to dispatch from same queue
+		 */
+		goto keep_queue;
 new_queue:
 	ioq = elv_set_active_ioq(q, new_ioq);
 keep_queue:
@@ -3298,7 +3365,8 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 		}
 
 		if (elv_ioq_class_idle(ioq)) {
-			elv_ioq_slice_expired(q);
+			if (elv_iosched_expire_ioq(q, 1, 0))
+				elv_ioq_slice_expired(q);
 			goto done;
 		}
 
@@ -3343,7 +3411,8 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 					goto done;
 
 				/* Expire the queue */
-				elv_ioq_slice_expired(q);
+				if (elv_iosched_expire_ioq(q, 1, 0))
+					elv_ioq_slice_expired(q);
 			}
 		} else if (!ioq->nr_queued && !elv_close_cooperator(q, ioq, 1)
 			 && sync && (!rq_noidle(rq) || efqd->fairness))
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index baa6cee..c117d40 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -371,6 +371,8 @@ enum elv_queue_state_flags {
 	ELV_QUEUE_FLAG_slice_new,	  /* no requests dispatched in slice */
 	ELV_QUEUE_FLAG_wait_busy,	  /* wait for this queue to get busy */
 	ELV_QUEUE_FLAG_wait_busy_done,	  /* Have already waited on this queue*/
+	ELV_QUEUE_FLAG_must_expire,       /* Expire this queue even if it has
+					   * request and time slice left */
 };
 
 #define ELV_IO_QUEUE_FLAG_FNS(name)					\
@@ -395,6 +397,7 @@ ELV_IO_QUEUE_FLAG_FNS(idle_window)
 ELV_IO_QUEUE_FLAG_FNS(slice_new)
 ELV_IO_QUEUE_FLAG_FNS(wait_busy)
 ELV_IO_QUEUE_FLAG_FNS(wait_busy_done)
+ELV_IO_QUEUE_FLAG_FNS(must_expire)
 
 static inline struct io_service_tree *
 io_entity_service_tree(struct io_entity *entity)
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 6d2c8db..dda7951 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -42,6 +42,7 @@ typedef int (elevator_update_idle_window_fn) (struct elevator_queue*, void*,
 						struct request*);
 typedef struct io_queue* (elevator_close_cooperator_fn) (struct request_queue*,
 						void*, int probe);
+typedef int (elevator_expire_ioq_fn) (struct request_queue*, void *, int, int);
 #endif
 
 struct elevator_ops
@@ -81,6 +82,7 @@ struct elevator_ops
 	elevator_should_preempt_fn *elevator_should_preempt_fn;
 	elevator_update_idle_window_fn *elevator_update_idle_window_fn;
 	elevator_close_cooperator_fn *elevator_close_cooperator_fn;
+	elevator_expire_ioq_fn  *elevator_expire_ioq_fn;
 #endif
 };
 
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 19/25] blkio_cgroup patches from Ryo to track async bios.
       [not found] ` <1246564917-19603-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (17 preceding siblings ...)
  2009-07-02 20:01   ` [PATCH 18/25] io-controller: anticipatory " Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  2009-07-02 20:01   ` [PATCH 20/25] io-controller: map async requests to appropriate cgroup Vivek Goyal
                     ` (8 subsequent siblings)
  27 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, nauman-hpIqsD4AKlfQT0dZR+AlfA,
	dpshah-hpIqsD4AKlfQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA

o blkio_cgroup patches from Ryo to track async bios.

o Fernando is also working on another IO tracking mechanism. We are not
  particular about any IO tracking mechanism. This patchset can make use
  of any mechanism which makes it to upstream. For the time being making
  use of Ryo's posting.

Based on 2.6.30-rc3-git3
Signed-off-by: Hirokazu Takahashi <taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
Signed-off-by: Ryo Tsuruta <ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
---
 block/blk-ioc.c               |   36 +++--
 fs/buffer.c                   |    2 +
 fs/direct-io.c                |    2 +
 include/linux/biotrack.h      |  103 +++++++++++++
 include/linux/cgroup_subsys.h |    6 +
 include/linux/iocontext.h     |    1 +
 include/linux/memcontrol.h    |    6 +
 include/linux/mmzone.h        |    4 +-
 include/linux/page_cgroup.h   |   31 ++++-
 init/Kconfig                  |   15 ++
 mm/Makefile                   |    4 +-
 mm/biotrack.c                 |  321 +++++++++++++++++++++++++++++++++++++++++
 mm/bounce.c                   |    2 +
 mm/filemap.c                  |    2 +
 mm/memcontrol.c               |    6 +
 mm/memory.c                   |    5 +
 mm/page-writeback.c           |    2 +
 mm/page_cgroup.c              |   17 ++-
 mm/swap_state.c               |    2 +
 19 files changed, 537 insertions(+), 30 deletions(-)
 create mode 100644 include/linux/biotrack.h
 create mode 100644 mm/biotrack.c

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 0d56336..890d475 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -84,27 +84,31 @@ void exit_io_context(void)
 	}
 }
 
+void init_io_context(struct io_context *ioc)
+{
+	atomic_long_set(&ioc->refcount, 1);
+	atomic_set(&ioc->nr_tasks, 1);
+	spin_lock_init(&ioc->lock);
+	ioc->ioprio_changed = 0;
+	ioc->ioprio = 0;
+#ifdef CONFIG_GROUP_IOSCHED
+	ioc->cgroup_changed = 0;
+#endif
+	ioc->last_waited = jiffies; /* doesn't matter... */
+	ioc->nr_batch_requests = 0; /* because this is 0 */
+	ioc->aic = NULL;
+	INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
+	INIT_HLIST_HEAD(&ioc->cic_list);
+	ioc->ioc_data = NULL;
+}
+
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 {
 	struct io_context *ret;
 
 	ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
-	if (ret) {
-		atomic_long_set(&ret->refcount, 1);
-		atomic_set(&ret->nr_tasks, 1);
-		spin_lock_init(&ret->lock);
-		ret->ioprio_changed = 0;
-		ret->ioprio = 0;
-#ifdef CONFIG_GROUP_IOSCHED
-		ret->cgroup_changed = 0;
-#endif
-		ret->last_waited = jiffies; /* doesn't matter... */
-		ret->nr_batch_requests = 0; /* because this is 0 */
-		ret->aic = NULL;
-		INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
-		INIT_HLIST_HEAD(&ret->cic_list);
-		ret->ioc_data = NULL;
-	}
+	if (ret)
+		init_io_context(ret);
 
 	return ret;
 }
diff --git a/fs/buffer.c b/fs/buffer.c
index a3ef091..cb68608 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -36,6 +36,7 @@
 #include <linux/buffer_head.h>
 #include <linux/task_io_accounting_ops.h>
 #include <linux/bio.h>
+#include <linux/biotrack.h>
 #include <linux/notifier.h>
 #include <linux/cpu.h>
 #include <linux/bitops.h>
@@ -668,6 +669,7 @@ static void __set_page_dirty(struct page *page,
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
 		account_page_dirtied(page, mapping);
+		blkio_cgroup_reset_owner_pagedirty(page, current->mm);
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 8b10b87..185ba0a 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -33,6 +33,7 @@
 #include <linux/err.h>
 #include <linux/blkdev.h>
 #include <linux/buffer_head.h>
+#include <linux/biotrack.h>
 #include <linux/rwsem.h>
 #include <linux/uio.h>
 #include <asm/atomic.h>
@@ -797,6 +798,7 @@ static int do_direct_IO(struct dio *dio)
 			ret = PTR_ERR(page);
 			goto out;
 		}
+		blkio_cgroup_reset_owner(page, current->mm);
 
 		while (block_in_page < blocks_per_page) {
 			unsigned offset_in_page = block_in_page << blkbits;
diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
new file mode 100644
index 0000000..0b4491a
--- /dev/null
+++ b/include/linux/biotrack.h
@@ -0,0 +1,103 @@
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include <linux/page_cgroup.h>
+
+#ifndef _LINUX_BIOTRACK_H
+#define _LINUX_BIOTRACK_H
+
+#ifdef	CONFIG_CGROUP_BLKIO
+
+struct io_context;
+struct block_device;
+
+struct blkio_cgroup {
+	struct cgroup_subsys_state css;
+	struct io_context *io_context;	/* default io_context */
+/*	struct radix_tree_root io_context_root; per device io_context */
+};
+
+/**
+ * __init_blkio_page_cgroup() - initialize a blkio_page_cgroup
+ * @pc:		page_cgroup of the page
+ *
+ * Reset the owner ID of a page.
+ */
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+	lock_page_cgroup(pc);
+	page_cgroup_set_id(pc, 0);
+	unlock_page_cgroup(pc);
+}
+
+/**
+ * blkio_cgroup_disabled - check whether blkio_cgroup is disabled
+ *
+ * Returns true if disabled, false if not.
+ */
+static inline bool blkio_cgroup_disabled(void)
+{
+	if (blkio_cgroup_subsys.disabled)
+		return true;
+	return false;
+}
+
+extern void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+						 struct mm_struct *mm);
+extern void blkio_cgroup_copy_owner(struct page *page, struct page *opage);
+
+extern struct io_context *get_blkio_cgroup_iocontext(struct bio *bio);
+extern unsigned long get_blkio_cgroup_id(struct bio *bio);
+extern unsigned long get_blkio_cgroup_id_page(struct page *page);
+extern struct cgroup *blkio_cgroup_lookup(int id);
+
+#else	/* CONFIG_CGROUP_BIO */
+
+struct blkio_cgroup;
+
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+}
+
+static inline bool blkio_cgroup_disabled(void)
+{
+	return true;
+}
+
+static inline void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_copy_owner(struct page *page, struct page *opage)
+{
+}
+
+static inline struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+	return NULL;
+}
+
+static inline unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+	return 0;
+}
+
+static inline unsigned long get_blkio_cgroup_id_page(struct page *page)
+{
+	return 0;
+}
+
+#endif	/* CONFIG_CGROUP_BLKIO */
+
+#endif /* _LINUX_BIOTRACK_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index baf544f..78504f3 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -43,6 +43,12 @@ SUBSYS(mem_cgroup)
 
 /* */
 
+#ifdef CONFIG_CGROUP_BLKIO
+SUBSYS(blkio_cgroup)
+#endif
+
+/* */
+
 #ifdef CONFIG_CGROUP_DEVICE
 SUBSYS(devices)
 #endif
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index ccecf53..c595842 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -104,6 +104,7 @@ int put_io_context(struct io_context *ioc);
 void exit_io_context(void);
 struct io_context *get_io_context(gfp_t gfp_flags, int node);
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
+void init_io_context(struct io_context *ioc);
 void copy_io_context(struct io_context **pdst, struct io_context **psrc);
 #else
 static inline void exit_io_context(void)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e46a073..eb45fe9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -37,6 +37,8 @@ struct mm_struct;
  * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
  */
 
+extern void __init_mem_page_cgroup(struct page_cgroup *pc);
+
 extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
 /* for swap handling */
@@ -121,6 +123,10 @@ void mem_cgroup_update_mapped_file_stat(struct page *page, int val);
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
+static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+}
+
 static inline int mem_cgroup_newpage_charge(struct page *page,
 					struct mm_struct *mm, gfp_t gfp_mask)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8895985..c9d1ed4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -605,7 +605,7 @@ typedef struct pglist_data {
 	int nr_zones;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP	/* means !SPARSEMEM */
 	struct page *node_mem_map;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	struct page_cgroup *node_page_cgroup;
 #endif
 #endif
@@ -956,7 +956,7 @@ struct mem_section {
 
 	/* See declaration of similar field in struct zone */
 	unsigned long *pageblock_flags;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	/*
 	 * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
 	 * section. (see memcontrol.h/page_cgroup.h about this.)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 13f126c..f470fd2 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -1,7 +1,7 @@
 #ifndef __LINUX_PAGE_CGROUP_H
 #define __LINUX_PAGE_CGROUP_H
 
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 #include <linux/bit_spinlock.h>
 /*
  * Page Cgroup can be considered as an extended mem_map.
@@ -12,9 +12,11 @@
  */
 struct page_cgroup {
 	unsigned long flags;
-	struct mem_cgroup *mem_cgroup;
 	struct page *page;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+	struct mem_cgroup *mem_cgroup;
 	struct list_head lru;		/* per cgroup LRU list */
+#endif
 };
 
 void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
@@ -83,7 +85,7 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_CGROUP_PAGE */
 struct page_cgroup;
 
 static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
@@ -138,4 +140,27 @@ static inline void swap_cgroup_swapoff(int type)
 }
 
 #endif
+
+#ifdef CONFIG_CGROUP_BLKIO
+/*
+ * use lower 16 bits for flags and reserve the rest for the page tracking id
+ */
+#define PCG_TRACKING_ID_SHIFT	(16)
+#define PCG_TRACKING_ID_BITS \
+	(8 * sizeof(unsigned long) - PCG_TRACKING_ID_SHIFT)
+
+/* NOTE: must be called with page_cgroup() held */
+static inline unsigned long page_cgroup_get_id(struct page_cgroup *pc)
+{
+	return pc->flags >> PCG_TRACKING_ID_SHIFT;
+}
+
+/* NOTE: must be called with page_cgroup() held */
+static inline void page_cgroup_set_id(struct page_cgroup *pc, unsigned long id)
+{
+	WARN_ON(id >= (1UL << PCG_TRACKING_ID_BITS));
+	pc->flags &= (1UL << PCG_TRACKING_ID_SHIFT) - 1;
+	pc->flags |= (unsigned long)(id << PCG_TRACKING_ID_SHIFT);
+}
+#endif
 #endif
diff --git a/init/Kconfig b/init/Kconfig
index eaa44db..1241018 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -622,6 +622,21 @@ config GROUP_IOSCHED
 
 endif # CGROUPS
 
+config CGROUP_BLKIO
+	bool "Block I/O cgroup subsystem"
+	depends on CGROUPS && BLOCK
+	select MM_OWNER
+	help
+	  Provides a Resource Controller which enables to track the onwner
+	  of every Block I/O requests.
+	  The information this subsystem provides can be used from any
+	  kind of module such as dm-ioband device mapper modules or
+	  the cfq-scheduler.
+
+config CGROUP_PAGE
+	def_bool y
+	depends on CGROUP_MEM_RES_CTLR || CGROUP_BLKIO
+
 config MM_OWNER
 	bool
 
diff --git a/mm/Makefile b/mm/Makefile
index 5e0bd64..6208744 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -39,6 +39,8 @@ else
 obj-$(CONFIG_SMP) += allocpercpu.o
 endif
 obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
+obj-$(CONFIG_CGROUP_BLKIO) += biotrack.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
diff --git a/mm/biotrack.c b/mm/biotrack.c
new file mode 100644
index 0000000..320f511
--- /dev/null
+++ b/mm/biotrack.c
@@ -0,0 +1,321 @@
+/* biotrack.c - Block I/O Tracking
+ *
+ * Copyright (C) VA Linux Systems Japan, 2008-2009
+ * Developed by Hirokazu Takahashi <taka-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
+ *
+ * Copyright (C) 2008 Andrea Righi <righi.andrea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
+ * Use part of page_cgroup->flags to store blkio-cgroup ID.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/smp.h>
+#include <linux/bit_spinlock.h>
+#include <linux/blkdev.h>
+#include <linux/biotrack.h>
+#include <linux/mm_inline.h>
+
+/*
+ * The block I/O tracking mechanism is implemented on the cgroup memory
+ * controller framework. It helps to find the the owner of an I/O request
+ * because every I/O request has a target page and the owner of the page
+ * can be easily determined on the framework.
+ */
+
+/* Return the blkio_cgroup that associates with a cgroup. */
+static inline struct blkio_cgroup *cgroup_blkio(struct cgroup *cgrp)
+{
+	return container_of(cgroup_subsys_state(cgrp, blkio_cgroup_subsys_id),
+					struct blkio_cgroup, css);
+}
+
+/* Return the blkio_cgroup that associates with a process. */
+static inline struct blkio_cgroup *blkio_cgroup_from_task(struct task_struct *p)
+{
+	return container_of(task_subsys_state(p, blkio_cgroup_subsys_id),
+					struct blkio_cgroup, css);
+}
+
+static struct io_context default_blkio_io_context;
+static struct blkio_cgroup default_blkio_cgroup = {
+	.io_context	= &default_blkio_io_context,
+};
+
+/**
+ * blkio_cgroup_set_owner() - set the owner ID of a page.
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Make a given page have the blkio-cgroup ID of the owner of this page.
+ */
+void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+	struct blkio_cgroup *biog;
+	struct page_cgroup *pc;
+	unsigned long id;
+
+	if (blkio_cgroup_disabled())
+		return;
+	pc = lookup_page_cgroup(page);
+	if (unlikely(!pc))
+		return;
+
+	lock_page_cgroup(pc);
+	page_cgroup_set_id(pc, 0);	/* 0: default blkio_cgroup id */
+	unlock_page_cgroup(pc);
+	if (!mm)
+		return;
+
+	rcu_read_lock();
+	biog = blkio_cgroup_from_task(rcu_dereference(mm->owner));
+	if (unlikely(!biog)) {
+		rcu_read_unlock();
+		return;
+	}
+	/*
+	 * css_get(&bio->css) isn't called to increment the reference
+	 * count of this blkio_cgroup "biog" so the css_id might turn
+	 * invalid even if this page is still active.
+	 * This approach is chosen to minimize the overhead.
+	 */
+	id = css_id(&biog->css);
+	rcu_read_unlock();
+	lock_page_cgroup(pc);
+	page_cgroup_set_id(pc, id);
+	unlock_page_cgroup(pc);
+}
+
+/**
+ * blkio_cgroup_reset_owner() - reset the owner ID of a page
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Change the owner of a given page if necessary.
+ */
+void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
+{
+	blkio_cgroup_set_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_reset_owner_pagedirty() - reset the owner ID of a pagecache page
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Change the owner of a given page if the page is in the pagecache.
+ */
+void blkio_cgroup_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
+{
+	if (!page_is_file_cache(page))
+		return;
+	if (current->flags & PF_MEMALLOC)
+		return;
+
+	blkio_cgroup_reset_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_copy_owner() - copy the owner ID of a page into another page
+ * @npage:	the page where we want to copy the owner
+ * @opage:	the page from which we want to copy the ID
+ *
+ * Copy the owner ID of @opage into @npage.
+ */
+void blkio_cgroup_copy_owner(struct page *npage, struct page *opage)
+{
+	struct page_cgroup *npc, *opc;
+	unsigned long id;
+
+	if (blkio_cgroup_disabled())
+		return;
+	npc = lookup_page_cgroup(npage);
+	if (unlikely(!npc))
+		return;
+	opc = lookup_page_cgroup(opage);
+	if (unlikely(!opc))
+		return;
+
+	lock_page_cgroup(opc);
+	lock_page_cgroup(npc);
+	id = page_cgroup_get_id(opc);
+	page_cgroup_set_id(npc, id);
+	unlock_page_cgroup(npc);
+	unlock_page_cgroup(opc);
+}
+
+/* Create a new blkio-cgroup. */
+static struct cgroup_subsys_state *
+blkio_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct blkio_cgroup *biog;
+	struct io_context *ioc;
+
+	if (!cgrp->parent) {
+		biog = &default_blkio_cgroup;
+		init_io_context(biog->io_context);
+		/* Increment the referrence count not to be released ever. */
+		atomic_long_inc(&biog->io_context->refcount);
+		return &biog->css;
+	}
+
+	biog = kzalloc(sizeof(*biog), GFP_KERNEL);
+	if (!biog)
+		return ERR_PTR(-ENOMEM);
+	ioc = alloc_io_context(GFP_KERNEL, -1);
+	if (!ioc) {
+		kfree(biog);
+		return ERR_PTR(-ENOMEM);
+	}
+	biog->io_context = ioc;
+	return &biog->css;
+}
+
+/* Delete the blkio-cgroup. */
+static void blkio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+
+	put_io_context(biog->io_context);
+	free_css_id(&blkio_cgroup_subsys, &biog->css);
+	kfree(biog);
+}
+
+/**
+ * get_blkio_cgroup_id() - determine the blkio-cgroup ID
+ * @bio:	the &struct bio which describes the I/O
+ *
+ * Returns the blkio-cgroup ID of a given bio. A return value zero
+ * means that the page associated with the bio belongs to default_blkio_cgroup.
+ */
+unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+	struct page_cgroup *pc;
+	struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+	unsigned long id = 0;
+
+	pc = lookup_page_cgroup(page);
+	if (pc) {
+		lock_page_cgroup(pc);
+		id = page_cgroup_get_id(pc);
+		unlock_page_cgroup(pc);
+	}
+	return id;
+}
+
+/**
+ * get_blkio_cgroup_id_page() - determine the blkio-cgroup ID
+ * @page:	the &struct page which describes the I/O
+ *
+ * Returns the blkio-cgroup ID of a given page. A return value zero
+ * means that the page associated with the IO belongs to default_blkio_cgroup.
+ */
+unsigned long get_blkio_cgroup_id_page(struct page *page)
+{
+	struct page_cgroup *pc;
+	unsigned long id = 0;
+
+	pc = lookup_page_cgroup(page);
+	if (pc) {
+		lock_page_cgroup(pc);
+		id = page_cgroup_get_id(pc);
+		unlock_page_cgroup(pc);
+	}
+	return id;
+}
+
+/**
+ * get_blkio_cgroup_iocontext() - determine the blkio-cgroup iocontext
+ * @bio:	the &struct bio which describe the I/O
+ *
+ * Returns the iocontext of blkio-cgroup that issued a given bio.
+ */
+struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+	struct cgroup_subsys_state *css;
+	struct blkio_cgroup *biog;
+	struct io_context *ioc;
+	unsigned long id;
+
+	id = get_blkio_cgroup_id(bio);
+	rcu_read_lock();
+	css = css_lookup(&blkio_cgroup_subsys, id);
+	if (css)
+		biog = container_of(css, struct blkio_cgroup, css);
+	else
+		biog = &default_blkio_cgroup;
+	ioc = biog->io_context;	/* default io_context for this cgroup */
+	atomic_long_inc(&ioc->refcount);
+	rcu_read_unlock();
+	return ioc;
+}
+
+/**
+ * blkio_cgroup_lookup() - lookup a cgroup by blkio-cgroup ID
+ * @id:		blkio-cgroup ID
+ *
+ * Returns the cgroup associated with the specified ID, or NULL if lookup
+ * fails.
+ *
+ * Note:
+ * This function should be called under rcu_read_lock().
+ */
+struct cgroup *blkio_cgroup_lookup(int id)
+{
+	struct cgroup *cgrp;
+	struct cgroup_subsys_state *css;
+
+	if (blkio_cgroup_disabled())
+		return NULL;
+
+	css = css_lookup(&blkio_cgroup_subsys, id);
+	if (!css)
+		return NULL;
+	cgrp = css->cgroup;
+	return cgrp;
+}
+EXPORT_SYMBOL(get_blkio_cgroup_iocontext);
+EXPORT_SYMBOL(get_blkio_cgroup_id);
+EXPORT_SYMBOL(blkio_cgroup_lookup);
+
+static u64 blkio_id_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+	unsigned long id;
+
+	rcu_read_lock();
+	id = css_id(&biog->css);
+	rcu_read_unlock();
+	return (u64)id;
+}
+
+
+static struct cftype blkio_files[] = {
+	{
+		.name = "id",
+		.read_u64 = blkio_id_read,
+	},
+};
+
+static int blkio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	return cgroup_add_files(cgrp, ss, blkio_files,
+					ARRAY_SIZE(blkio_files));
+}
+
+struct cgroup_subsys blkio_cgroup_subsys = {
+	.name		= "blkio",
+	.create		= blkio_cgroup_create,
+	.destroy	= blkio_cgroup_destroy,
+	.populate	= blkio_cgroup_populate,
+	.subsys_id	= blkio_cgroup_subsys_id,
+	.use_id		= 1,
+};
diff --git a/mm/bounce.c b/mm/bounce.c
index a2b76a5..422d89c 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -14,6 +14,7 @@
 #include <linux/hash.h>
 #include <linux/highmem.h>
 #include <asm/tlbflush.h>
+#include <linux/biotrack.h>
 
 #include <trace/events/block.h>
 
@@ -210,6 +211,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
 		to->bv_len = from->bv_len;
 		to->bv_offset = from->bv_offset;
 		inc_zone_page_state(to->bv_page, NR_BOUNCE);
+		blkio_cgroup_copy_owner(to->bv_page, page);
 
 		if (rw == WRITE) {
 			char *vto, *vfrom;
diff --git a/mm/filemap.c b/mm/filemap.c
index 2239671..c2356cd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/mm_inline.h> /* for page_is_file_cache() */
 #include "internal.h"
 
@@ -464,6 +465,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 					gfp_mask & GFP_RECLAIM_MASK);
 	if (error)
 		goto out;
+	blkio_cgroup_set_owner(page, current->mm);
 
 	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error == 0) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e2fa20d..41896f6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -129,6 +129,12 @@ struct mem_cgroup_lru_info {
 	struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
 };
 
+void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+	pc->mem_cgroup = NULL;
+	INIT_LIST_HEAD(&pc->lru);
+}
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
diff --git a/mm/memory.c b/mm/memory.c
index f46ac18..f1451d6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -51,6 +51,7 @@
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/mmu_notifier.h>
 #include <linux/kallsyms.h>
 #include <linux/swapops.h>
@@ -2117,6 +2118,7 @@ gotten:
 		 */
 		ptep_clear_flush_notify(vma, address, page_table);
 		page_add_new_anon_rmap(new_page, vma, address);
+		blkio_cgroup_set_owner(new_page, mm);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		if (old_page) {
@@ -2582,6 +2584,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	flush_icache_page(vma, page);
 	set_pte_at(mm, address, page_table, pte);
 	page_add_anon_rmap(page, vma, address);
+	blkio_cgroup_reset_owner(page, mm);
 	/* It's better to call commit-charge after rmap is established */
 	mem_cgroup_commit_charge_swapin(page, ptr);
 
@@ -2646,6 +2649,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto release;
 	inc_mm_counter(mm, anon_rss);
 	page_add_new_anon_rmap(page, vma, address);
+	blkio_cgroup_set_owner(page, mm);
 	set_pte_at(mm, address, page_table, entry);
 
 	/* No need to invalidate - it was non-present before */
@@ -2793,6 +2797,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (anon) {
 			inc_mm_counter(mm, anon_rss);
 			page_add_new_anon_rmap(page, vma, address);
+			blkio_cgroup_set_owner(page, mm);
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7b0dcea..31d3675 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -23,6 +23,7 @@
 #include <linux/init.h>
 #include <linux/backing-dev.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/biotrack.h>
 #include <linux/blkdev.h>
 #include <linux/mpage.h>
 #include <linux/rmap.h>
@@ -1244,6 +1245,7 @@ int __set_page_dirty_nobuffers(struct page *page)
 			BUG_ON(mapping2 != mapping);
 			WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
 			account_page_dirtied(page, mapping);
+			blkio_cgroup_reset_owner_pagedirty(page, current->mm);
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index f22b4eb..2883bb7 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -9,14 +9,15 @@
 #include <linux/vmalloc.h>
 #include <linux/cgroup.h>
 #include <linux/swapops.h>
+#include <linux/biotrack.h>
 
 static void __meminit
 __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
 {
 	pc->flags = 0;
-	pc->mem_cgroup = NULL;
 	pc->page = pfn_to_page(pfn);
-	INIT_LIST_HEAD(&pc->lru);
+	__init_mem_page_cgroup(pc);
+	__init_blkio_page_cgroup(pc);
 }
 static unsigned long total_usage;
 
@@ -74,7 +75,7 @@ void __init page_cgroup_init_flatmem(void)
 
 	int nid, fail;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && blkio_cgroup_disabled())
 		return;
 
 	for_each_online_node(nid)  {
@@ -83,12 +84,12 @@ void __init page_cgroup_init_flatmem(void)
 			goto fail;
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you"
+	printk(KERN_INFO "please try cgroup_disable=memory,blkio option if you"
 	" don't want memory cgroups\n");
 	return;
 fail:
 	printk(KERN_CRIT "allocation of page_cgroup failed.\n");
-	printk(KERN_CRIT "please try 'cgroup_disable=memory' boot option\n");
+	printk(KERN_CRIT "please try cgroup_disable=memory,blkio boot options\n");
 	panic("Out of memory");
 }
 
@@ -245,7 +246,7 @@ void __init page_cgroup_init(void)
 	unsigned long pfn;
 	int fail = 0;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && blkio_cgroup_disabled())
 		return;
 
 	for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
@@ -260,8 +261,8 @@ void __init page_cgroup_init(void)
 		hotplug_memory_notifier(page_cgroup_callback, 0);
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you don't"
-	" want memory cgroups\n");
+	printk(KERN_INFO "please try cgroup_disable=memory,blkio option"
+	" if you don't want memory and io cgroups\n");
 }
 
 void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 42cd38e..6eb96f1 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -18,6 +18,7 @@
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
 #include <linux/page_cgroup.h>
+#include <linux/biotrack.h>
 
 #include <asm/pgtable.h>
 
@@ -307,6 +308,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 */
 		__set_page_locked(new_page);
 		SetPageSwapBacked(new_page);
+		blkio_cgroup_set_owner(new_page, current->mm);
 		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
 		if (likely(!err)) {
 			/*
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 19/25] blkio_cgroup patches from Ryo to track async bios.
  2009-07-02 20:01 ` Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron
  Cc: agk, snitzer, vgoyal, akpm, peterz

o blkio_cgroup patches from Ryo to track async bios.

o Fernando is also working on another IO tracking mechanism. We are not
  particular about any IO tracking mechanism. This patchset can make use
  of any mechanism which makes it to upstream. For the time being making
  use of Ryo's posting.

Based on 2.6.30-rc3-git3
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
---
 block/blk-ioc.c               |   36 +++--
 fs/buffer.c                   |    2 +
 fs/direct-io.c                |    2 +
 include/linux/biotrack.h      |  103 +++++++++++++
 include/linux/cgroup_subsys.h |    6 +
 include/linux/iocontext.h     |    1 +
 include/linux/memcontrol.h    |    6 +
 include/linux/mmzone.h        |    4 +-
 include/linux/page_cgroup.h   |   31 ++++-
 init/Kconfig                  |   15 ++
 mm/Makefile                   |    4 +-
 mm/biotrack.c                 |  321 +++++++++++++++++++++++++++++++++++++++++
 mm/bounce.c                   |    2 +
 mm/filemap.c                  |    2 +
 mm/memcontrol.c               |    6 +
 mm/memory.c                   |    5 +
 mm/page-writeback.c           |    2 +
 mm/page_cgroup.c              |   17 ++-
 mm/swap_state.c               |    2 +
 19 files changed, 537 insertions(+), 30 deletions(-)
 create mode 100644 include/linux/biotrack.h
 create mode 100644 mm/biotrack.c

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 0d56336..890d475 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -84,27 +84,31 @@ void exit_io_context(void)
 	}
 }
 
+void init_io_context(struct io_context *ioc)
+{
+	atomic_long_set(&ioc->refcount, 1);
+	atomic_set(&ioc->nr_tasks, 1);
+	spin_lock_init(&ioc->lock);
+	ioc->ioprio_changed = 0;
+	ioc->ioprio = 0;
+#ifdef CONFIG_GROUP_IOSCHED
+	ioc->cgroup_changed = 0;
+#endif
+	ioc->last_waited = jiffies; /* doesn't matter... */
+	ioc->nr_batch_requests = 0; /* because this is 0 */
+	ioc->aic = NULL;
+	INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
+	INIT_HLIST_HEAD(&ioc->cic_list);
+	ioc->ioc_data = NULL;
+}
+
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 {
 	struct io_context *ret;
 
 	ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
-	if (ret) {
-		atomic_long_set(&ret->refcount, 1);
-		atomic_set(&ret->nr_tasks, 1);
-		spin_lock_init(&ret->lock);
-		ret->ioprio_changed = 0;
-		ret->ioprio = 0;
-#ifdef CONFIG_GROUP_IOSCHED
-		ret->cgroup_changed = 0;
-#endif
-		ret->last_waited = jiffies; /* doesn't matter... */
-		ret->nr_batch_requests = 0; /* because this is 0 */
-		ret->aic = NULL;
-		INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
-		INIT_HLIST_HEAD(&ret->cic_list);
-		ret->ioc_data = NULL;
-	}
+	if (ret)
+		init_io_context(ret);
 
 	return ret;
 }
diff --git a/fs/buffer.c b/fs/buffer.c
index a3ef091..cb68608 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -36,6 +36,7 @@
 #include <linux/buffer_head.h>
 #include <linux/task_io_accounting_ops.h>
 #include <linux/bio.h>
+#include <linux/biotrack.h>
 #include <linux/notifier.h>
 #include <linux/cpu.h>
 #include <linux/bitops.h>
@@ -668,6 +669,7 @@ static void __set_page_dirty(struct page *page,
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
 		account_page_dirtied(page, mapping);
+		blkio_cgroup_reset_owner_pagedirty(page, current->mm);
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 8b10b87..185ba0a 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -33,6 +33,7 @@
 #include <linux/err.h>
 #include <linux/blkdev.h>
 #include <linux/buffer_head.h>
+#include <linux/biotrack.h>
 #include <linux/rwsem.h>
 #include <linux/uio.h>
 #include <asm/atomic.h>
@@ -797,6 +798,7 @@ static int do_direct_IO(struct dio *dio)
 			ret = PTR_ERR(page);
 			goto out;
 		}
+		blkio_cgroup_reset_owner(page, current->mm);
 
 		while (block_in_page < blocks_per_page) {
 			unsigned offset_in_page = block_in_page << blkbits;
diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
new file mode 100644
index 0000000..0b4491a
--- /dev/null
+++ b/include/linux/biotrack.h
@@ -0,0 +1,103 @@
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include <linux/page_cgroup.h>
+
+#ifndef _LINUX_BIOTRACK_H
+#define _LINUX_BIOTRACK_H
+
+#ifdef	CONFIG_CGROUP_BLKIO
+
+struct io_context;
+struct block_device;
+
+struct blkio_cgroup {
+	struct cgroup_subsys_state css;
+	struct io_context *io_context;	/* default io_context */
+/*	struct radix_tree_root io_context_root; per device io_context */
+};
+
+/**
+ * __init_blkio_page_cgroup() - initialize a blkio_page_cgroup
+ * @pc:		page_cgroup of the page
+ *
+ * Reset the owner ID of a page.
+ */
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+	lock_page_cgroup(pc);
+	page_cgroup_set_id(pc, 0);
+	unlock_page_cgroup(pc);
+}
+
+/**
+ * blkio_cgroup_disabled - check whether blkio_cgroup is disabled
+ *
+ * Returns true if disabled, false if not.
+ */
+static inline bool blkio_cgroup_disabled(void)
+{
+	if (blkio_cgroup_subsys.disabled)
+		return true;
+	return false;
+}
+
+extern void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+						 struct mm_struct *mm);
+extern void blkio_cgroup_copy_owner(struct page *page, struct page *opage);
+
+extern struct io_context *get_blkio_cgroup_iocontext(struct bio *bio);
+extern unsigned long get_blkio_cgroup_id(struct bio *bio);
+extern unsigned long get_blkio_cgroup_id_page(struct page *page);
+extern struct cgroup *blkio_cgroup_lookup(int id);
+
+#else	/* CONFIG_CGROUP_BIO */
+
+struct blkio_cgroup;
+
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+}
+
+static inline bool blkio_cgroup_disabled(void)
+{
+	return true;
+}
+
+static inline void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_copy_owner(struct page *page, struct page *opage)
+{
+}
+
+static inline struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+	return NULL;
+}
+
+static inline unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+	return 0;
+}
+
+static inline unsigned long get_blkio_cgroup_id_page(struct page *page)
+{
+	return 0;
+}
+
+#endif	/* CONFIG_CGROUP_BLKIO */
+
+#endif /* _LINUX_BIOTRACK_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index baf544f..78504f3 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -43,6 +43,12 @@ SUBSYS(mem_cgroup)
 
 /* */
 
+#ifdef CONFIG_CGROUP_BLKIO
+SUBSYS(blkio_cgroup)
+#endif
+
+/* */
+
 #ifdef CONFIG_CGROUP_DEVICE
 SUBSYS(devices)
 #endif
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index ccecf53..c595842 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -104,6 +104,7 @@ int put_io_context(struct io_context *ioc);
 void exit_io_context(void);
 struct io_context *get_io_context(gfp_t gfp_flags, int node);
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
+void init_io_context(struct io_context *ioc);
 void copy_io_context(struct io_context **pdst, struct io_context **psrc);
 #else
 static inline void exit_io_context(void)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e46a073..eb45fe9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -37,6 +37,8 @@ struct mm_struct;
  * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
  */
 
+extern void __init_mem_page_cgroup(struct page_cgroup *pc);
+
 extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
 /* for swap handling */
@@ -121,6 +123,10 @@ void mem_cgroup_update_mapped_file_stat(struct page *page, int val);
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
+static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+}
+
 static inline int mem_cgroup_newpage_charge(struct page *page,
 					struct mm_struct *mm, gfp_t gfp_mask)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8895985..c9d1ed4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -605,7 +605,7 @@ typedef struct pglist_data {
 	int nr_zones;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP	/* means !SPARSEMEM */
 	struct page *node_mem_map;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	struct page_cgroup *node_page_cgroup;
 #endif
 #endif
@@ -956,7 +956,7 @@ struct mem_section {
 
 	/* See declaration of similar field in struct zone */
 	unsigned long *pageblock_flags;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	/*
 	 * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
 	 * section. (see memcontrol.h/page_cgroup.h about this.)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 13f126c..f470fd2 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -1,7 +1,7 @@
 #ifndef __LINUX_PAGE_CGROUP_H
 #define __LINUX_PAGE_CGROUP_H
 
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 #include <linux/bit_spinlock.h>
 /*
  * Page Cgroup can be considered as an extended mem_map.
@@ -12,9 +12,11 @@
  */
 struct page_cgroup {
 	unsigned long flags;
-	struct mem_cgroup *mem_cgroup;
 	struct page *page;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+	struct mem_cgroup *mem_cgroup;
 	struct list_head lru;		/* per cgroup LRU list */
+#endif
 };
 
 void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
@@ -83,7 +85,7 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_CGROUP_PAGE */
 struct page_cgroup;
 
 static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
@@ -138,4 +140,27 @@ static inline void swap_cgroup_swapoff(int type)
 }
 
 #endif
+
+#ifdef CONFIG_CGROUP_BLKIO
+/*
+ * use lower 16 bits for flags and reserve the rest for the page tracking id
+ */
+#define PCG_TRACKING_ID_SHIFT	(16)
+#define PCG_TRACKING_ID_BITS \
+	(8 * sizeof(unsigned long) - PCG_TRACKING_ID_SHIFT)
+
+/* NOTE: must be called with page_cgroup() held */
+static inline unsigned long page_cgroup_get_id(struct page_cgroup *pc)
+{
+	return pc->flags >> PCG_TRACKING_ID_SHIFT;
+}
+
+/* NOTE: must be called with page_cgroup() held */
+static inline void page_cgroup_set_id(struct page_cgroup *pc, unsigned long id)
+{
+	WARN_ON(id >= (1UL << PCG_TRACKING_ID_BITS));
+	pc->flags &= (1UL << PCG_TRACKING_ID_SHIFT) - 1;
+	pc->flags |= (unsigned long)(id << PCG_TRACKING_ID_SHIFT);
+}
+#endif
 #endif
diff --git a/init/Kconfig b/init/Kconfig
index eaa44db..1241018 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -622,6 +622,21 @@ config GROUP_IOSCHED
 
 endif # CGROUPS
 
+config CGROUP_BLKIO
+	bool "Block I/O cgroup subsystem"
+	depends on CGROUPS && BLOCK
+	select MM_OWNER
+	help
+	  Provides a Resource Controller which enables to track the onwner
+	  of every Block I/O requests.
+	  The information this subsystem provides can be used from any
+	  kind of module such as dm-ioband device mapper modules or
+	  the cfq-scheduler.
+
+config CGROUP_PAGE
+	def_bool y
+	depends on CGROUP_MEM_RES_CTLR || CGROUP_BLKIO
+
 config MM_OWNER
 	bool
 
diff --git a/mm/Makefile b/mm/Makefile
index 5e0bd64..6208744 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -39,6 +39,8 @@ else
 obj-$(CONFIG_SMP) += allocpercpu.o
 endif
 obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
+obj-$(CONFIG_CGROUP_BLKIO) += biotrack.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
diff --git a/mm/biotrack.c b/mm/biotrack.c
new file mode 100644
index 0000000..320f511
--- /dev/null
+++ b/mm/biotrack.c
@@ -0,0 +1,321 @@
+/* biotrack.c - Block I/O Tracking
+ *
+ * Copyright (C) VA Linux Systems Japan, 2008-2009
+ * Developed by Hirokazu Takahashi <taka@valinux.co.jp>
+ *
+ * Copyright (C) 2008 Andrea Righi <righi.andrea@gmail.com>
+ * Use part of page_cgroup->flags to store blkio-cgroup ID.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/smp.h>
+#include <linux/bit_spinlock.h>
+#include <linux/blkdev.h>
+#include <linux/biotrack.h>
+#include <linux/mm_inline.h>
+
+/*
+ * The block I/O tracking mechanism is implemented on the cgroup memory
+ * controller framework. It helps to find the the owner of an I/O request
+ * because every I/O request has a target page and the owner of the page
+ * can be easily determined on the framework.
+ */
+
+/* Return the blkio_cgroup that associates with a cgroup. */
+static inline struct blkio_cgroup *cgroup_blkio(struct cgroup *cgrp)
+{
+	return container_of(cgroup_subsys_state(cgrp, blkio_cgroup_subsys_id),
+					struct blkio_cgroup, css);
+}
+
+/* Return the blkio_cgroup that associates with a process. */
+static inline struct blkio_cgroup *blkio_cgroup_from_task(struct task_struct *p)
+{
+	return container_of(task_subsys_state(p, blkio_cgroup_subsys_id),
+					struct blkio_cgroup, css);
+}
+
+static struct io_context default_blkio_io_context;
+static struct blkio_cgroup default_blkio_cgroup = {
+	.io_context	= &default_blkio_io_context,
+};
+
+/**
+ * blkio_cgroup_set_owner() - set the owner ID of a page.
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Make a given page have the blkio-cgroup ID of the owner of this page.
+ */
+void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+	struct blkio_cgroup *biog;
+	struct page_cgroup *pc;
+	unsigned long id;
+
+	if (blkio_cgroup_disabled())
+		return;
+	pc = lookup_page_cgroup(page);
+	if (unlikely(!pc))
+		return;
+
+	lock_page_cgroup(pc);
+	page_cgroup_set_id(pc, 0);	/* 0: default blkio_cgroup id */
+	unlock_page_cgroup(pc);
+	if (!mm)
+		return;
+
+	rcu_read_lock();
+	biog = blkio_cgroup_from_task(rcu_dereference(mm->owner));
+	if (unlikely(!biog)) {
+		rcu_read_unlock();
+		return;
+	}
+	/*
+	 * css_get(&bio->css) isn't called to increment the reference
+	 * count of this blkio_cgroup "biog" so the css_id might turn
+	 * invalid even if this page is still active.
+	 * This approach is chosen to minimize the overhead.
+	 */
+	id = css_id(&biog->css);
+	rcu_read_unlock();
+	lock_page_cgroup(pc);
+	page_cgroup_set_id(pc, id);
+	unlock_page_cgroup(pc);
+}
+
+/**
+ * blkio_cgroup_reset_owner() - reset the owner ID of a page
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Change the owner of a given page if necessary.
+ */
+void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
+{
+	blkio_cgroup_set_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_reset_owner_pagedirty() - reset the owner ID of a pagecache page
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Change the owner of a given page if the page is in the pagecache.
+ */
+void blkio_cgroup_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
+{
+	if (!page_is_file_cache(page))
+		return;
+	if (current->flags & PF_MEMALLOC)
+		return;
+
+	blkio_cgroup_reset_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_copy_owner() - copy the owner ID of a page into another page
+ * @npage:	the page where we want to copy the owner
+ * @opage:	the page from which we want to copy the ID
+ *
+ * Copy the owner ID of @opage into @npage.
+ */
+void blkio_cgroup_copy_owner(struct page *npage, struct page *opage)
+{
+	struct page_cgroup *npc, *opc;
+	unsigned long id;
+
+	if (blkio_cgroup_disabled())
+		return;
+	npc = lookup_page_cgroup(npage);
+	if (unlikely(!npc))
+		return;
+	opc = lookup_page_cgroup(opage);
+	if (unlikely(!opc))
+		return;
+
+	lock_page_cgroup(opc);
+	lock_page_cgroup(npc);
+	id = page_cgroup_get_id(opc);
+	page_cgroup_set_id(npc, id);
+	unlock_page_cgroup(npc);
+	unlock_page_cgroup(opc);
+}
+
+/* Create a new blkio-cgroup. */
+static struct cgroup_subsys_state *
+blkio_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct blkio_cgroup *biog;
+	struct io_context *ioc;
+
+	if (!cgrp->parent) {
+		biog = &default_blkio_cgroup;
+		init_io_context(biog->io_context);
+		/* Increment the referrence count not to be released ever. */
+		atomic_long_inc(&biog->io_context->refcount);
+		return &biog->css;
+	}
+
+	biog = kzalloc(sizeof(*biog), GFP_KERNEL);
+	if (!biog)
+		return ERR_PTR(-ENOMEM);
+	ioc = alloc_io_context(GFP_KERNEL, -1);
+	if (!ioc) {
+		kfree(biog);
+		return ERR_PTR(-ENOMEM);
+	}
+	biog->io_context = ioc;
+	return &biog->css;
+}
+
+/* Delete the blkio-cgroup. */
+static void blkio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+
+	put_io_context(biog->io_context);
+	free_css_id(&blkio_cgroup_subsys, &biog->css);
+	kfree(biog);
+}
+
+/**
+ * get_blkio_cgroup_id() - determine the blkio-cgroup ID
+ * @bio:	the &struct bio which describes the I/O
+ *
+ * Returns the blkio-cgroup ID of a given bio. A return value zero
+ * means that the page associated with the bio belongs to default_blkio_cgroup.
+ */
+unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+	struct page_cgroup *pc;
+	struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+	unsigned long id = 0;
+
+	pc = lookup_page_cgroup(page);
+	if (pc) {
+		lock_page_cgroup(pc);
+		id = page_cgroup_get_id(pc);
+		unlock_page_cgroup(pc);
+	}
+	return id;
+}
+
+/**
+ * get_blkio_cgroup_id_page() - determine the blkio-cgroup ID
+ * @page:	the &struct page which describes the I/O
+ *
+ * Returns the blkio-cgroup ID of a given page. A return value zero
+ * means that the page associated with the IO belongs to default_blkio_cgroup.
+ */
+unsigned long get_blkio_cgroup_id_page(struct page *page)
+{
+	struct page_cgroup *pc;
+	unsigned long id = 0;
+
+	pc = lookup_page_cgroup(page);
+	if (pc) {
+		lock_page_cgroup(pc);
+		id = page_cgroup_get_id(pc);
+		unlock_page_cgroup(pc);
+	}
+	return id;
+}
+
+/**
+ * get_blkio_cgroup_iocontext() - determine the blkio-cgroup iocontext
+ * @bio:	the &struct bio which describe the I/O
+ *
+ * Returns the iocontext of blkio-cgroup that issued a given bio.
+ */
+struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+	struct cgroup_subsys_state *css;
+	struct blkio_cgroup *biog;
+	struct io_context *ioc;
+	unsigned long id;
+
+	id = get_blkio_cgroup_id(bio);
+	rcu_read_lock();
+	css = css_lookup(&blkio_cgroup_subsys, id);
+	if (css)
+		biog = container_of(css, struct blkio_cgroup, css);
+	else
+		biog = &default_blkio_cgroup;
+	ioc = biog->io_context;	/* default io_context for this cgroup */
+	atomic_long_inc(&ioc->refcount);
+	rcu_read_unlock();
+	return ioc;
+}
+
+/**
+ * blkio_cgroup_lookup() - lookup a cgroup by blkio-cgroup ID
+ * @id:		blkio-cgroup ID
+ *
+ * Returns the cgroup associated with the specified ID, or NULL if lookup
+ * fails.
+ *
+ * Note:
+ * This function should be called under rcu_read_lock().
+ */
+struct cgroup *blkio_cgroup_lookup(int id)
+{
+	struct cgroup *cgrp;
+	struct cgroup_subsys_state *css;
+
+	if (blkio_cgroup_disabled())
+		return NULL;
+
+	css = css_lookup(&blkio_cgroup_subsys, id);
+	if (!css)
+		return NULL;
+	cgrp = css->cgroup;
+	return cgrp;
+}
+EXPORT_SYMBOL(get_blkio_cgroup_iocontext);
+EXPORT_SYMBOL(get_blkio_cgroup_id);
+EXPORT_SYMBOL(blkio_cgroup_lookup);
+
+static u64 blkio_id_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+	unsigned long id;
+
+	rcu_read_lock();
+	id = css_id(&biog->css);
+	rcu_read_unlock();
+	return (u64)id;
+}
+
+
+static struct cftype blkio_files[] = {
+	{
+		.name = "id",
+		.read_u64 = blkio_id_read,
+	},
+};
+
+static int blkio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	return cgroup_add_files(cgrp, ss, blkio_files,
+					ARRAY_SIZE(blkio_files));
+}
+
+struct cgroup_subsys blkio_cgroup_subsys = {
+	.name		= "blkio",
+	.create		= blkio_cgroup_create,
+	.destroy	= blkio_cgroup_destroy,
+	.populate	= blkio_cgroup_populate,
+	.subsys_id	= blkio_cgroup_subsys_id,
+	.use_id		= 1,
+};
diff --git a/mm/bounce.c b/mm/bounce.c
index a2b76a5..422d89c 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -14,6 +14,7 @@
 #include <linux/hash.h>
 #include <linux/highmem.h>
 #include <asm/tlbflush.h>
+#include <linux/biotrack.h>
 
 #include <trace/events/block.h>
 
@@ -210,6 +211,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
 		to->bv_len = from->bv_len;
 		to->bv_offset = from->bv_offset;
 		inc_zone_page_state(to->bv_page, NR_BOUNCE);
+		blkio_cgroup_copy_owner(to->bv_page, page);
 
 		if (rw == WRITE) {
 			char *vto, *vfrom;
diff --git a/mm/filemap.c b/mm/filemap.c
index 2239671..c2356cd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/mm_inline.h> /* for page_is_file_cache() */
 #include "internal.h"
 
@@ -464,6 +465,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 					gfp_mask & GFP_RECLAIM_MASK);
 	if (error)
 		goto out;
+	blkio_cgroup_set_owner(page, current->mm);
 
 	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error == 0) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e2fa20d..41896f6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -129,6 +129,12 @@ struct mem_cgroup_lru_info {
 	struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
 };
 
+void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+	pc->mem_cgroup = NULL;
+	INIT_LIST_HEAD(&pc->lru);
+}
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
diff --git a/mm/memory.c b/mm/memory.c
index f46ac18..f1451d6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -51,6 +51,7 @@
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/mmu_notifier.h>
 #include <linux/kallsyms.h>
 #include <linux/swapops.h>
@@ -2117,6 +2118,7 @@ gotten:
 		 */
 		ptep_clear_flush_notify(vma, address, page_table);
 		page_add_new_anon_rmap(new_page, vma, address);
+		blkio_cgroup_set_owner(new_page, mm);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		if (old_page) {
@@ -2582,6 +2584,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	flush_icache_page(vma, page);
 	set_pte_at(mm, address, page_table, pte);
 	page_add_anon_rmap(page, vma, address);
+	blkio_cgroup_reset_owner(page, mm);
 	/* It's better to call commit-charge after rmap is established */
 	mem_cgroup_commit_charge_swapin(page, ptr);
 
@@ -2646,6 +2649,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto release;
 	inc_mm_counter(mm, anon_rss);
 	page_add_new_anon_rmap(page, vma, address);
+	blkio_cgroup_set_owner(page, mm);
 	set_pte_at(mm, address, page_table, entry);
 
 	/* No need to invalidate - it was non-present before */
@@ -2793,6 +2797,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (anon) {
 			inc_mm_counter(mm, anon_rss);
 			page_add_new_anon_rmap(page, vma, address);
+			blkio_cgroup_set_owner(page, mm);
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7b0dcea..31d3675 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -23,6 +23,7 @@
 #include <linux/init.h>
 #include <linux/backing-dev.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/biotrack.h>
 #include <linux/blkdev.h>
 #include <linux/mpage.h>
 #include <linux/rmap.h>
@@ -1244,6 +1245,7 @@ int __set_page_dirty_nobuffers(struct page *page)
 			BUG_ON(mapping2 != mapping);
 			WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
 			account_page_dirtied(page, mapping);
+			blkio_cgroup_reset_owner_pagedirty(page, current->mm);
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index f22b4eb..2883bb7 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -9,14 +9,15 @@
 #include <linux/vmalloc.h>
 #include <linux/cgroup.h>
 #include <linux/swapops.h>
+#include <linux/biotrack.h>
 
 static void __meminit
 __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
 {
 	pc->flags = 0;
-	pc->mem_cgroup = NULL;
 	pc->page = pfn_to_page(pfn);
-	INIT_LIST_HEAD(&pc->lru);
+	__init_mem_page_cgroup(pc);
+	__init_blkio_page_cgroup(pc);
 }
 static unsigned long total_usage;
 
@@ -74,7 +75,7 @@ void __init page_cgroup_init_flatmem(void)
 
 	int nid, fail;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && blkio_cgroup_disabled())
 		return;
 
 	for_each_online_node(nid)  {
@@ -83,12 +84,12 @@ void __init page_cgroup_init_flatmem(void)
 			goto fail;
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you"
+	printk(KERN_INFO "please try cgroup_disable=memory,blkio option if you"
 	" don't want memory cgroups\n");
 	return;
 fail:
 	printk(KERN_CRIT "allocation of page_cgroup failed.\n");
-	printk(KERN_CRIT "please try 'cgroup_disable=memory' boot option\n");
+	printk(KERN_CRIT "please try cgroup_disable=memory,blkio boot options\n");
 	panic("Out of memory");
 }
 
@@ -245,7 +246,7 @@ void __init page_cgroup_init(void)
 	unsigned long pfn;
 	int fail = 0;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && blkio_cgroup_disabled())
 		return;
 
 	for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
@@ -260,8 +261,8 @@ void __init page_cgroup_init(void)
 		hotplug_memory_notifier(page_cgroup_callback, 0);
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you don't"
-	" want memory cgroups\n");
+	printk(KERN_INFO "please try cgroup_disable=memory,blkio option"
+	" if you don't want memory and io cgroups\n");
 }
 
 void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 42cd38e..6eb96f1 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -18,6 +18,7 @@
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
 #include <linux/page_cgroup.h>
+#include <linux/biotrack.h>
 
 #include <asm/pgtable.h>
 
@@ -307,6 +308,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 */
 		__set_page_locked(new_page);
 		SetPageSwapBacked(new_page);
+		blkio_cgroup_set_owner(new_page, current->mm);
 		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
 		if (likely(!err)) {
 			/*
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 19/25] blkio_cgroup patches from Ryo to track async bios.
@ 2009-07-02 20:01   ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew
  Cc: peterz, akpm, snitzer, agk, vgoyal

o blkio_cgroup patches from Ryo to track async bios.

o Fernando is also working on another IO tracking mechanism. We are not
  particular about any IO tracking mechanism. This patchset can make use
  of any mechanism which makes it to upstream. For the time being making
  use of Ryo's posting.

Based on 2.6.30-rc3-git3
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
---
 block/blk-ioc.c               |   36 +++--
 fs/buffer.c                   |    2 +
 fs/direct-io.c                |    2 +
 include/linux/biotrack.h      |  103 +++++++++++++
 include/linux/cgroup_subsys.h |    6 +
 include/linux/iocontext.h     |    1 +
 include/linux/memcontrol.h    |    6 +
 include/linux/mmzone.h        |    4 +-
 include/linux/page_cgroup.h   |   31 ++++-
 init/Kconfig                  |   15 ++
 mm/Makefile                   |    4 +-
 mm/biotrack.c                 |  321 +++++++++++++++++++++++++++++++++++++++++
 mm/bounce.c                   |    2 +
 mm/filemap.c                  |    2 +
 mm/memcontrol.c               |    6 +
 mm/memory.c                   |    5 +
 mm/page-writeback.c           |    2 +
 mm/page_cgroup.c              |   17 ++-
 mm/swap_state.c               |    2 +
 19 files changed, 537 insertions(+), 30 deletions(-)
 create mode 100644 include/linux/biotrack.h
 create mode 100644 mm/biotrack.c

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 0d56336..890d475 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -84,27 +84,31 @@ void exit_io_context(void)
 	}
 }
 
+void init_io_context(struct io_context *ioc)
+{
+	atomic_long_set(&ioc->refcount, 1);
+	atomic_set(&ioc->nr_tasks, 1);
+	spin_lock_init(&ioc->lock);
+	ioc->ioprio_changed = 0;
+	ioc->ioprio = 0;
+#ifdef CONFIG_GROUP_IOSCHED
+	ioc->cgroup_changed = 0;
+#endif
+	ioc->last_waited = jiffies; /* doesn't matter... */
+	ioc->nr_batch_requests = 0; /* because this is 0 */
+	ioc->aic = NULL;
+	INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
+	INIT_HLIST_HEAD(&ioc->cic_list);
+	ioc->ioc_data = NULL;
+}
+
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 {
 	struct io_context *ret;
 
 	ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
-	if (ret) {
-		atomic_long_set(&ret->refcount, 1);
-		atomic_set(&ret->nr_tasks, 1);
-		spin_lock_init(&ret->lock);
-		ret->ioprio_changed = 0;
-		ret->ioprio = 0;
-#ifdef CONFIG_GROUP_IOSCHED
-		ret->cgroup_changed = 0;
-#endif
-		ret->last_waited = jiffies; /* doesn't matter... */
-		ret->nr_batch_requests = 0; /* because this is 0 */
-		ret->aic = NULL;
-		INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
-		INIT_HLIST_HEAD(&ret->cic_list);
-		ret->ioc_data = NULL;
-	}
+	if (ret)
+		init_io_context(ret);
 
 	return ret;
 }
diff --git a/fs/buffer.c b/fs/buffer.c
index a3ef091..cb68608 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -36,6 +36,7 @@
 #include <linux/buffer_head.h>
 #include <linux/task_io_accounting_ops.h>
 #include <linux/bio.h>
+#include <linux/biotrack.h>
 #include <linux/notifier.h>
 #include <linux/cpu.h>
 #include <linux/bitops.h>
@@ -668,6 +669,7 @@ static void __set_page_dirty(struct page *page,
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
 		account_page_dirtied(page, mapping);
+		blkio_cgroup_reset_owner_pagedirty(page, current->mm);
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 8b10b87..185ba0a 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -33,6 +33,7 @@
 #include <linux/err.h>
 #include <linux/blkdev.h>
 #include <linux/buffer_head.h>
+#include <linux/biotrack.h>
 #include <linux/rwsem.h>
 #include <linux/uio.h>
 #include <asm/atomic.h>
@@ -797,6 +798,7 @@ static int do_direct_IO(struct dio *dio)
 			ret = PTR_ERR(page);
 			goto out;
 		}
+		blkio_cgroup_reset_owner(page, current->mm);
 
 		while (block_in_page < blocks_per_page) {
 			unsigned offset_in_page = block_in_page << blkbits;
diff --git a/include/linux/biotrack.h b/include/linux/biotrack.h
new file mode 100644
index 0000000..0b4491a
--- /dev/null
+++ b/include/linux/biotrack.h
@@ -0,0 +1,103 @@
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include <linux/page_cgroup.h>
+
+#ifndef _LINUX_BIOTRACK_H
+#define _LINUX_BIOTRACK_H
+
+#ifdef	CONFIG_CGROUP_BLKIO
+
+struct io_context;
+struct block_device;
+
+struct blkio_cgroup {
+	struct cgroup_subsys_state css;
+	struct io_context *io_context;	/* default io_context */
+/*	struct radix_tree_root io_context_root; per device io_context */
+};
+
+/**
+ * __init_blkio_page_cgroup() - initialize a blkio_page_cgroup
+ * @pc:		page_cgroup of the page
+ *
+ * Reset the owner ID of a page.
+ */
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+	lock_page_cgroup(pc);
+	page_cgroup_set_id(pc, 0);
+	unlock_page_cgroup(pc);
+}
+
+/**
+ * blkio_cgroup_disabled - check whether blkio_cgroup is disabled
+ *
+ * Returns true if disabled, false if not.
+ */
+static inline bool blkio_cgroup_disabled(void)
+{
+	if (blkio_cgroup_subsys.disabled)
+		return true;
+	return false;
+}
+
+extern void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm);
+extern void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+						 struct mm_struct *mm);
+extern void blkio_cgroup_copy_owner(struct page *page, struct page *opage);
+
+extern struct io_context *get_blkio_cgroup_iocontext(struct bio *bio);
+extern unsigned long get_blkio_cgroup_id(struct bio *bio);
+extern unsigned long get_blkio_cgroup_id_page(struct page *page);
+extern struct cgroup *blkio_cgroup_lookup(int id);
+
+#else	/* CONFIG_CGROUP_BIO */
+
+struct blkio_cgroup;
+
+static inline void __init_blkio_page_cgroup(struct page_cgroup *pc)
+{
+}
+
+static inline bool blkio_cgroup_disabled(void)
+{
+	return true;
+}
+
+static inline void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_reset_owner_pagedirty(struct page *page,
+						struct mm_struct *mm)
+{
+}
+
+static inline void blkio_cgroup_copy_owner(struct page *page, struct page *opage)
+{
+}
+
+static inline struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+	return NULL;
+}
+
+static inline unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+	return 0;
+}
+
+static inline unsigned long get_blkio_cgroup_id_page(struct page *page)
+{
+	return 0;
+}
+
+#endif	/* CONFIG_CGROUP_BLKIO */
+
+#endif /* _LINUX_BIOTRACK_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index baf544f..78504f3 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -43,6 +43,12 @@ SUBSYS(mem_cgroup)
 
 /* */
 
+#ifdef CONFIG_CGROUP_BLKIO
+SUBSYS(blkio_cgroup)
+#endif
+
+/* */
+
 #ifdef CONFIG_CGROUP_DEVICE
 SUBSYS(devices)
 #endif
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index ccecf53..c595842 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -104,6 +104,7 @@ int put_io_context(struct io_context *ioc);
 void exit_io_context(void);
 struct io_context *get_io_context(gfp_t gfp_flags, int node);
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node);
+void init_io_context(struct io_context *ioc);
 void copy_io_context(struct io_context **pdst, struct io_context **psrc);
 #else
 static inline void exit_io_context(void)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e46a073..eb45fe9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -37,6 +37,8 @@ struct mm_struct;
  * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
  */
 
+extern void __init_mem_page_cgroup(struct page_cgroup *pc);
+
 extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
 /* for swap handling */
@@ -121,6 +123,10 @@ void mem_cgroup_update_mapped_file_stat(struct page *page, int val);
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
+static inline void __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+}
+
 static inline int mem_cgroup_newpage_charge(struct page *page,
 					struct mm_struct *mm, gfp_t gfp_mask)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8895985..c9d1ed4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -605,7 +605,7 @@ typedef struct pglist_data {
 	int nr_zones;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP	/* means !SPARSEMEM */
 	struct page *node_mem_map;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	struct page_cgroup *node_page_cgroup;
 #endif
 #endif
@@ -956,7 +956,7 @@ struct mem_section {
 
 	/* See declaration of similar field in struct zone */
 	unsigned long *pageblock_flags;
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 	/*
 	 * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
 	 * section. (see memcontrol.h/page_cgroup.h about this.)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 13f126c..f470fd2 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -1,7 +1,7 @@
 #ifndef __LINUX_PAGE_CGROUP_H
 #define __LINUX_PAGE_CGROUP_H
 
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+#ifdef CONFIG_CGROUP_PAGE
 #include <linux/bit_spinlock.h>
 /*
  * Page Cgroup can be considered as an extended mem_map.
@@ -12,9 +12,11 @@
  */
 struct page_cgroup {
 	unsigned long flags;
-	struct mem_cgroup *mem_cgroup;
 	struct page *page;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+	struct mem_cgroup *mem_cgroup;
 	struct list_head lru;		/* per cgroup LRU list */
+#endif
 };
 
 void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
@@ -83,7 +85,7 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_CGROUP_PAGE */
 struct page_cgroup;
 
 static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
@@ -138,4 +140,27 @@ static inline void swap_cgroup_swapoff(int type)
 }
 
 #endif
+
+#ifdef CONFIG_CGROUP_BLKIO
+/*
+ * use lower 16 bits for flags and reserve the rest for the page tracking id
+ */
+#define PCG_TRACKING_ID_SHIFT	(16)
+#define PCG_TRACKING_ID_BITS \
+	(8 * sizeof(unsigned long) - PCG_TRACKING_ID_SHIFT)
+
+/* NOTE: must be called with page_cgroup() held */
+static inline unsigned long page_cgroup_get_id(struct page_cgroup *pc)
+{
+	return pc->flags >> PCG_TRACKING_ID_SHIFT;
+}
+
+/* NOTE: must be called with page_cgroup() held */
+static inline void page_cgroup_set_id(struct page_cgroup *pc, unsigned long id)
+{
+	WARN_ON(id >= (1UL << PCG_TRACKING_ID_BITS));
+	pc->flags &= (1UL << PCG_TRACKING_ID_SHIFT) - 1;
+	pc->flags |= (unsigned long)(id << PCG_TRACKING_ID_SHIFT);
+}
+#endif
 #endif
diff --git a/init/Kconfig b/init/Kconfig
index eaa44db..1241018 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -622,6 +622,21 @@ config GROUP_IOSCHED
 
 endif # CGROUPS
 
+config CGROUP_BLKIO
+	bool "Block I/O cgroup subsystem"
+	depends on CGROUPS && BLOCK
+	select MM_OWNER
+	help
+	  Provides a Resource Controller which enables to track the onwner
+	  of every Block I/O requests.
+	  The information this subsystem provides can be used from any
+	  kind of module such as dm-ioband device mapper modules or
+	  the cfq-scheduler.
+
+config CGROUP_PAGE
+	def_bool y
+	depends on CGROUP_MEM_RES_CTLR || CGROUP_BLKIO
+
 config MM_OWNER
 	bool
 
diff --git a/mm/Makefile b/mm/Makefile
index 5e0bd64..6208744 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -39,6 +39,8 @@ else
 obj-$(CONFIG_SMP) += allocpercpu.o
 endif
 obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_PAGE) += page_cgroup.o
+obj-$(CONFIG_CGROUP_BLKIO) += biotrack.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
diff --git a/mm/biotrack.c b/mm/biotrack.c
new file mode 100644
index 0000000..320f511
--- /dev/null
+++ b/mm/biotrack.c
@@ -0,0 +1,321 @@
+/* biotrack.c - Block I/O Tracking
+ *
+ * Copyright (C) VA Linux Systems Japan, 2008-2009
+ * Developed by Hirokazu Takahashi <taka@valinux.co.jp>
+ *
+ * Copyright (C) 2008 Andrea Righi <righi.andrea@gmail.com>
+ * Use part of page_cgroup->flags to store blkio-cgroup ID.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/smp.h>
+#include <linux/bit_spinlock.h>
+#include <linux/blkdev.h>
+#include <linux/biotrack.h>
+#include <linux/mm_inline.h>
+
+/*
+ * The block I/O tracking mechanism is implemented on the cgroup memory
+ * controller framework. It helps to find the the owner of an I/O request
+ * because every I/O request has a target page and the owner of the page
+ * can be easily determined on the framework.
+ */
+
+/* Return the blkio_cgroup that associates with a cgroup. */
+static inline struct blkio_cgroup *cgroup_blkio(struct cgroup *cgrp)
+{
+	return container_of(cgroup_subsys_state(cgrp, blkio_cgroup_subsys_id),
+					struct blkio_cgroup, css);
+}
+
+/* Return the blkio_cgroup that associates with a process. */
+static inline struct blkio_cgroup *blkio_cgroup_from_task(struct task_struct *p)
+{
+	return container_of(task_subsys_state(p, blkio_cgroup_subsys_id),
+					struct blkio_cgroup, css);
+}
+
+static struct io_context default_blkio_io_context;
+static struct blkio_cgroup default_blkio_cgroup = {
+	.io_context	= &default_blkio_io_context,
+};
+
+/**
+ * blkio_cgroup_set_owner() - set the owner ID of a page.
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Make a given page have the blkio-cgroup ID of the owner of this page.
+ */
+void blkio_cgroup_set_owner(struct page *page, struct mm_struct *mm)
+{
+	struct blkio_cgroup *biog;
+	struct page_cgroup *pc;
+	unsigned long id;
+
+	if (blkio_cgroup_disabled())
+		return;
+	pc = lookup_page_cgroup(page);
+	if (unlikely(!pc))
+		return;
+
+	lock_page_cgroup(pc);
+	page_cgroup_set_id(pc, 0);	/* 0: default blkio_cgroup id */
+	unlock_page_cgroup(pc);
+	if (!mm)
+		return;
+
+	rcu_read_lock();
+	biog = blkio_cgroup_from_task(rcu_dereference(mm->owner));
+	if (unlikely(!biog)) {
+		rcu_read_unlock();
+		return;
+	}
+	/*
+	 * css_get(&bio->css) isn't called to increment the reference
+	 * count of this blkio_cgroup "biog" so the css_id might turn
+	 * invalid even if this page is still active.
+	 * This approach is chosen to minimize the overhead.
+	 */
+	id = css_id(&biog->css);
+	rcu_read_unlock();
+	lock_page_cgroup(pc);
+	page_cgroup_set_id(pc, id);
+	unlock_page_cgroup(pc);
+}
+
+/**
+ * blkio_cgroup_reset_owner() - reset the owner ID of a page
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Change the owner of a given page if necessary.
+ */
+void blkio_cgroup_reset_owner(struct page *page, struct mm_struct *mm)
+{
+	blkio_cgroup_set_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_reset_owner_pagedirty() - reset the owner ID of a pagecache page
+ * @page:	the page we want to tag
+ * @mm:		the mm_struct of a page owner
+ *
+ * Change the owner of a given page if the page is in the pagecache.
+ */
+void blkio_cgroup_reset_owner_pagedirty(struct page *page, struct mm_struct *mm)
+{
+	if (!page_is_file_cache(page))
+		return;
+	if (current->flags & PF_MEMALLOC)
+		return;
+
+	blkio_cgroup_reset_owner(page, mm);
+}
+
+/**
+ * blkio_cgroup_copy_owner() - copy the owner ID of a page into another page
+ * @npage:	the page where we want to copy the owner
+ * @opage:	the page from which we want to copy the ID
+ *
+ * Copy the owner ID of @opage into @npage.
+ */
+void blkio_cgroup_copy_owner(struct page *npage, struct page *opage)
+{
+	struct page_cgroup *npc, *opc;
+	unsigned long id;
+
+	if (blkio_cgroup_disabled())
+		return;
+	npc = lookup_page_cgroup(npage);
+	if (unlikely(!npc))
+		return;
+	opc = lookup_page_cgroup(opage);
+	if (unlikely(!opc))
+		return;
+
+	lock_page_cgroup(opc);
+	lock_page_cgroup(npc);
+	id = page_cgroup_get_id(opc);
+	page_cgroup_set_id(npc, id);
+	unlock_page_cgroup(npc);
+	unlock_page_cgroup(opc);
+}
+
+/* Create a new blkio-cgroup. */
+static struct cgroup_subsys_state *
+blkio_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct blkio_cgroup *biog;
+	struct io_context *ioc;
+
+	if (!cgrp->parent) {
+		biog = &default_blkio_cgroup;
+		init_io_context(biog->io_context);
+		/* Increment the referrence count not to be released ever. */
+		atomic_long_inc(&biog->io_context->refcount);
+		return &biog->css;
+	}
+
+	biog = kzalloc(sizeof(*biog), GFP_KERNEL);
+	if (!biog)
+		return ERR_PTR(-ENOMEM);
+	ioc = alloc_io_context(GFP_KERNEL, -1);
+	if (!ioc) {
+		kfree(biog);
+		return ERR_PTR(-ENOMEM);
+	}
+	biog->io_context = ioc;
+	return &biog->css;
+}
+
+/* Delete the blkio-cgroup. */
+static void blkio_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+
+	put_io_context(biog->io_context);
+	free_css_id(&blkio_cgroup_subsys, &biog->css);
+	kfree(biog);
+}
+
+/**
+ * get_blkio_cgroup_id() - determine the blkio-cgroup ID
+ * @bio:	the &struct bio which describes the I/O
+ *
+ * Returns the blkio-cgroup ID of a given bio. A return value zero
+ * means that the page associated with the bio belongs to default_blkio_cgroup.
+ */
+unsigned long get_blkio_cgroup_id(struct bio *bio)
+{
+	struct page_cgroup *pc;
+	struct page *page = bio_iovec_idx(bio, 0)->bv_page;
+	unsigned long id = 0;
+
+	pc = lookup_page_cgroup(page);
+	if (pc) {
+		lock_page_cgroup(pc);
+		id = page_cgroup_get_id(pc);
+		unlock_page_cgroup(pc);
+	}
+	return id;
+}
+
+/**
+ * get_blkio_cgroup_id_page() - determine the blkio-cgroup ID
+ * @page:	the &struct page which describes the I/O
+ *
+ * Returns the blkio-cgroup ID of a given page. A return value zero
+ * means that the page associated with the IO belongs to default_blkio_cgroup.
+ */
+unsigned long get_blkio_cgroup_id_page(struct page *page)
+{
+	struct page_cgroup *pc;
+	unsigned long id = 0;
+
+	pc = lookup_page_cgroup(page);
+	if (pc) {
+		lock_page_cgroup(pc);
+		id = page_cgroup_get_id(pc);
+		unlock_page_cgroup(pc);
+	}
+	return id;
+}
+
+/**
+ * get_blkio_cgroup_iocontext() - determine the blkio-cgroup iocontext
+ * @bio:	the &struct bio which describe the I/O
+ *
+ * Returns the iocontext of blkio-cgroup that issued a given bio.
+ */
+struct io_context *get_blkio_cgroup_iocontext(struct bio *bio)
+{
+	struct cgroup_subsys_state *css;
+	struct blkio_cgroup *biog;
+	struct io_context *ioc;
+	unsigned long id;
+
+	id = get_blkio_cgroup_id(bio);
+	rcu_read_lock();
+	css = css_lookup(&blkio_cgroup_subsys, id);
+	if (css)
+		biog = container_of(css, struct blkio_cgroup, css);
+	else
+		biog = &default_blkio_cgroup;
+	ioc = biog->io_context;	/* default io_context for this cgroup */
+	atomic_long_inc(&ioc->refcount);
+	rcu_read_unlock();
+	return ioc;
+}
+
+/**
+ * blkio_cgroup_lookup() - lookup a cgroup by blkio-cgroup ID
+ * @id:		blkio-cgroup ID
+ *
+ * Returns the cgroup associated with the specified ID, or NULL if lookup
+ * fails.
+ *
+ * Note:
+ * This function should be called under rcu_read_lock().
+ */
+struct cgroup *blkio_cgroup_lookup(int id)
+{
+	struct cgroup *cgrp;
+	struct cgroup_subsys_state *css;
+
+	if (blkio_cgroup_disabled())
+		return NULL;
+
+	css = css_lookup(&blkio_cgroup_subsys, id);
+	if (!css)
+		return NULL;
+	cgrp = css->cgroup;
+	return cgrp;
+}
+EXPORT_SYMBOL(get_blkio_cgroup_iocontext);
+EXPORT_SYMBOL(get_blkio_cgroup_id);
+EXPORT_SYMBOL(blkio_cgroup_lookup);
+
+static u64 blkio_id_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct blkio_cgroup *biog = cgroup_blkio(cgrp);
+	unsigned long id;
+
+	rcu_read_lock();
+	id = css_id(&biog->css);
+	rcu_read_unlock();
+	return (u64)id;
+}
+
+
+static struct cftype blkio_files[] = {
+	{
+		.name = "id",
+		.read_u64 = blkio_id_read,
+	},
+};
+
+static int blkio_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	return cgroup_add_files(cgrp, ss, blkio_files,
+					ARRAY_SIZE(blkio_files));
+}
+
+struct cgroup_subsys blkio_cgroup_subsys = {
+	.name		= "blkio",
+	.create		= blkio_cgroup_create,
+	.destroy	= blkio_cgroup_destroy,
+	.populate	= blkio_cgroup_populate,
+	.subsys_id	= blkio_cgroup_subsys_id,
+	.use_id		= 1,
+};
diff --git a/mm/bounce.c b/mm/bounce.c
index a2b76a5..422d89c 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -14,6 +14,7 @@
 #include <linux/hash.h>
 #include <linux/highmem.h>
 #include <asm/tlbflush.h>
+#include <linux/biotrack.h>
 
 #include <trace/events/block.h>
 
@@ -210,6 +211,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig,
 		to->bv_len = from->bv_len;
 		to->bv_offset = from->bv_offset;
 		inc_zone_page_state(to->bv_page, NR_BOUNCE);
+		blkio_cgroup_copy_owner(to->bv_page, page);
 
 		if (rw == WRITE) {
 			char *vto, *vfrom;
diff --git a/mm/filemap.c b/mm/filemap.c
index 2239671..c2356cd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
 #include <linux/cpuset.h>
 #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
 #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/mm_inline.h> /* for page_is_file_cache() */
 #include "internal.h"
 
@@ -464,6 +465,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 					gfp_mask & GFP_RECLAIM_MASK);
 	if (error)
 		goto out;
+	blkio_cgroup_set_owner(page, current->mm);
 
 	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error == 0) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e2fa20d..41896f6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -129,6 +129,12 @@ struct mem_cgroup_lru_info {
 	struct mem_cgroup_per_node *nodeinfo[MAX_NUMNODES];
 };
 
+void __meminit __init_mem_page_cgroup(struct page_cgroup *pc)
+{
+	pc->mem_cgroup = NULL;
+	INIT_LIST_HEAD(&pc->lru);
+}
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
diff --git a/mm/memory.c b/mm/memory.c
index f46ac18..f1451d6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -51,6 +51,7 @@
 #include <linux/init.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
 #include <linux/mmu_notifier.h>
 #include <linux/kallsyms.h>
 #include <linux/swapops.h>
@@ -2117,6 +2118,7 @@ gotten:
 		 */
 		ptep_clear_flush_notify(vma, address, page_table);
 		page_add_new_anon_rmap(new_page, vma, address);
+		blkio_cgroup_set_owner(new_page, mm);
 		set_pte_at(mm, address, page_table, entry);
 		update_mmu_cache(vma, address, entry);
 		if (old_page) {
@@ -2582,6 +2584,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	flush_icache_page(vma, page);
 	set_pte_at(mm, address, page_table, pte);
 	page_add_anon_rmap(page, vma, address);
+	blkio_cgroup_reset_owner(page, mm);
 	/* It's better to call commit-charge after rmap is established */
 	mem_cgroup_commit_charge_swapin(page, ptr);
 
@@ -2646,6 +2649,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto release;
 	inc_mm_counter(mm, anon_rss);
 	page_add_new_anon_rmap(page, vma, address);
+	blkio_cgroup_set_owner(page, mm);
 	set_pte_at(mm, address, page_table, entry);
 
 	/* No need to invalidate - it was non-present before */
@@ -2793,6 +2797,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (anon) {
 			inc_mm_counter(mm, anon_rss);
 			page_add_new_anon_rmap(page, vma, address);
+			blkio_cgroup_set_owner(page, mm);
 		} else {
 			inc_mm_counter(mm, file_rss);
 			page_add_file_rmap(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7b0dcea..31d3675 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -23,6 +23,7 @@
 #include <linux/init.h>
 #include <linux/backing-dev.h>
 #include <linux/task_io_accounting_ops.h>
+#include <linux/biotrack.h>
 #include <linux/blkdev.h>
 #include <linux/mpage.h>
 #include <linux/rmap.h>
@@ -1244,6 +1245,7 @@ int __set_page_dirty_nobuffers(struct page *page)
 			BUG_ON(mapping2 != mapping);
 			WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
 			account_page_dirtied(page, mapping);
+			blkio_cgroup_reset_owner_pagedirty(page, current->mm);
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index f22b4eb..2883bb7 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -9,14 +9,15 @@
 #include <linux/vmalloc.h>
 #include <linux/cgroup.h>
 #include <linux/swapops.h>
+#include <linux/biotrack.h>
 
 static void __meminit
 __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
 {
 	pc->flags = 0;
-	pc->mem_cgroup = NULL;
 	pc->page = pfn_to_page(pfn);
-	INIT_LIST_HEAD(&pc->lru);
+	__init_mem_page_cgroup(pc);
+	__init_blkio_page_cgroup(pc);
 }
 static unsigned long total_usage;
 
@@ -74,7 +75,7 @@ void __init page_cgroup_init_flatmem(void)
 
 	int nid, fail;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && blkio_cgroup_disabled())
 		return;
 
 	for_each_online_node(nid)  {
@@ -83,12 +84,12 @@ void __init page_cgroup_init_flatmem(void)
 			goto fail;
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you"
+	printk(KERN_INFO "please try cgroup_disable=memory,blkio option if you"
 	" don't want memory cgroups\n");
 	return;
 fail:
 	printk(KERN_CRIT "allocation of page_cgroup failed.\n");
-	printk(KERN_CRIT "please try 'cgroup_disable=memory' boot option\n");
+	printk(KERN_CRIT "please try cgroup_disable=memory,blkio boot options\n");
 	panic("Out of memory");
 }
 
@@ -245,7 +246,7 @@ void __init page_cgroup_init(void)
 	unsigned long pfn;
 	int fail = 0;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() && blkio_cgroup_disabled())
 		return;
 
 	for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
@@ -260,8 +261,8 @@ void __init page_cgroup_init(void)
 		hotplug_memory_notifier(page_cgroup_callback, 0);
 	}
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you don't"
-	" want memory cgroups\n");
+	printk(KERN_INFO "please try cgroup_disable=memory,blkio option"
+	" if you don't want memory and io cgroups\n");
 }
 
 void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 42cd38e..6eb96f1 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -18,6 +18,7 @@
 #include <linux/pagevec.h>
 #include <linux/migrate.h>
 #include <linux/page_cgroup.h>
+#include <linux/biotrack.h>
 
 #include <asm/pgtable.h>
 
@@ -307,6 +308,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 */
 		__set_page_locked(new_page);
 		SetPageSwapBacked(new_page);
+		blkio_cgroup_set_owner(new_page, current->mm);
 		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
 		if (likely(!err)) {
 			/*
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 20/25] io-controller: map async requests to appropriate cgroup
       [not found] ` <1246564917-19603-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (18 preceding siblings ...)
  2009-07-02 20:01   ` [PATCH 19/25] blkio_cgroup patches from Ryo to track async bios Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  2009-07-02 20:01   ` [PATCH 21/25] io-controller: Per cgroup request descriptor support Vivek Goyal
                     ` (7 subsequent siblings)
  27 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, nauman-hpIqsD4AKlfQT0dZR+AlfA,
	dpshah-hpIqsD4AKlfQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA

o So far we were assuming that a bio/rq belongs to the task who is submitting
  it. It did not hold good in case of async writes. This patch makes use of
  blkio_cgroup pataches to attribute the aysnc writes to right group instead
  of task submitting the bio.

o For sync requests, we continue to assume that io belongs to the task
  submitting it. Only in case of async requests, we make use of io tracking
  patches to track the owner cgroup.

o So far cfq always caches the async queue pointer. With async requests now
  not necessarily being tied to submitting task io context, caching the
  pointer will not help for async queues. This patch introduces a new config
  option CONFIG_TRACK_ASYNC_CONTEXT. If this option is not set, cfq retains
  old behavior where async queue pointer is cached in task context. If it
  is set, async queue pointer is not cached and we take help of bio
  tracking patches to determine group bio belongs to and then map it to
  async queue of that group.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched    |   16 +++++
 block/as-iosched.c       |    2 +-
 block/blk-core.c         |    7 +-
 block/cfq-iosched.c      |  152 ++++++++++++++++++++++++++++++++++++----------
 block/deadline-iosched.c |    2 +-
 block/elevator-fq.c      |  123 ++++++++++++++++++++++++++++++++-----
 block/elevator-fq.h      |   20 ++++--
 block/elevator.c         |   15 +++--
 include/linux/elevator.h |   21 ++++++-
 9 files changed, 287 insertions(+), 71 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 77fc786..0677099 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -124,6 +124,22 @@ config DEFAULT_IOSCHED
 	default "cfq" if DEFAULT_CFQ
 	default "noop" if DEFAULT_NOOP
 
+config TRACK_ASYNC_CONTEXT
+	bool "Determine async request context from bio"
+	depends on GROUP_IOSCHED
+	select CGROUP_BLKIO
+	default n
+	---help---
+	  Normally async request is attributed to the task submitting the
+	  request. With group ioscheduling, for accurate accounting of
+	  async writes, one needs to map the request to original task/cgroup
+	  which originated the request and not the submitter of the request.
+
+	  Currently there are generic io tracking patches to provide facility
+	  to map bio to original owner. If this option is set, for async
+	  request, original owner of the bio is decided by using io tracking
+	  patches otherwise we continue to attribute the request to the
+	  submitting thread.
 endmenu
 
 endif
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 18a61bb..213f3e3 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -1501,7 +1501,7 @@ as_merge(struct request_queue *q, struct request **req, struct bio *bio)
 {
 	sector_t rb_key = bio->bi_sector + bio_sectors(bio);
 	struct request *__rq;
-	struct as_queue *asq = elv_get_sched_queue_current(q);
+	struct as_queue *asq = elv_get_sched_queue_bio(q, bio);
 
 	if (!asq)
 		return ELEVATOR_NO_MERGE;
diff --git a/block/blk-core.c b/block/blk-core.c
index b06cf5c..6d8b4dd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -628,7 +628,8 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
 }
 
 static struct request *
-blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
+blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
+					gfp_t gfp_mask)
 {
 	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
 
@@ -640,7 +641,7 @@ blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
 	rq->cmd_flags = flags | REQ_ALLOCED;
 
 	if (priv) {
-		if (unlikely(elv_set_request(q, rq, gfp_mask))) {
+		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
 			mempool_free(rq, q->rq.rq_pool);
 			return NULL;
 		}
@@ -781,7 +782,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		rw_flags |= REQ_IO_STAT;
 	spin_unlock_irq(q->queue_lock);
 
-	rq = blk_alloc_request(q, rw_flags, priv, gfp_mask);
+	rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
 	if (unlikely(!rq)) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 98a35fd..a40a2fa 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -160,8 +160,8 @@ CFQ_CFQQ_FNS(coop);
 	blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)
 
 static void cfq_dispatch_insert(struct request_queue *, struct request *);
-static struct cfq_queue *cfq_get_queue(struct cfq_data *, int,
-				       struct io_context *, gfp_t);
+static struct cfq_queue *cfq_get_queue(struct cfq_data *, struct bio *bio,
+					int, struct io_context *, gfp_t);
 static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
 						struct io_context *);
 
@@ -171,22 +171,56 @@ static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
 	return cic->cfqq[!!is_sync];
 }
 
-static inline void cic_set_cfqq(struct cfq_io_context *cic,
-				struct cfq_queue *cfqq, int is_sync)
-{
-	cic->cfqq[!!is_sync] = cfqq;
-}
-
 /*
- * We regard a request as SYNC, if it's either a read or has the SYNC bit
- * set (in which case it could also be direct WRITE).
+ * Determine the cfq queue bio should go in. This is primarily used by
+ * front merge and allow merge functions.
+ *
+ * Currently this function takes the ioprio and iprio_class from task
+ * submitting async bio. Later save the task information in the page_cgroup
+ * and retrieve task's ioprio and class from there.
  */
-static inline int cfq_bio_sync(struct bio *bio)
+static struct cfq_queue *cic_bio_to_cfqq(struct cfq_data *cfqd,
+		struct cfq_io_context *cic, struct bio *bio, int is_sync)
 {
-	if (bio_data_dir(bio) == READ || bio_sync(bio))
-		return 1;
+	struct cfq_queue *cfqq = NULL;
 
-	return 0;
+	cfqq = cic_to_cfqq(cic, is_sync);
+
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	if (!cfqq && !is_sync) {
+		const int ioprio = task_ioprio(cic->ioc);
+		const int ioprio_class = task_ioprio_class(cic->ioc);
+		struct io_group *iog;
+		/*
+		 * async bio tracking is enabled and we are not caching
+		 * async queue pointer in cic.
+		 */
+		iog = io_get_io_group_bio(cfqd->queue, bio, 0);
+		if (!iog) {
+			/*
+			 * May be this is first rq/bio and io group has not
+			 * been setup yet.
+			 */
+			return NULL;
+		}
+		return io_group_async_queue_prio(iog, ioprio_class, ioprio);
+	}
+#endif
+	return cfqq;
+}
+
+static inline void cic_set_cfqq(struct cfq_io_context *cic,
+				struct cfq_queue *cfqq, int is_sync)
+{
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	/*
+	 * Don't cache async queue pointer as now one io context might
+	 * be submitting async io for various different async queues
+	 */
+	if (!is_sync)
+		return;
+#endif
+	cic->cfqq[!!is_sync] = cfqq;
 }
 
 static inline struct io_group *cfqq_to_io_group(struct cfq_queue *cfqq)
@@ -499,7 +533,7 @@ cfq_find_rq_fmerge(struct cfq_data *cfqd, struct bio *bio)
 	if (!cic)
 		return NULL;
 
-	cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+	cfqq = cic_bio_to_cfqq(cfqd, cic, bio, elv_bio_sync(bio));
 	if (cfqq) {
 		sector_t sector = bio->bi_sector + bio_sectors(bio);
 
@@ -581,7 +615,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	/*
 	 * Disallow merge of a sync bio into an async request.
 	 */
-	if (cfq_bio_sync(bio) && !rq_is_sync(rq))
+	if (elv_bio_sync(bio) && !rq_is_sync(rq))
 		return 0;
 
 	/*
@@ -592,7 +626,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	if (!cic)
 		return 0;
 
-	cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+	cfqq = cic_bio_to_cfqq(cfqd, cic, bio, elv_bio_sync(bio));
 	if (cfqq == RQ_CFQQ(rq))
 		return 1;
 
@@ -1199,14 +1233,28 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
 	spin_lock_irqsave(q->queue_lock, flags);
 
 	cfqq = cic->cfqq[BLK_RW_ASYNC];
+
 	if (cfqq) {
 		struct cfq_queue *new_cfqq;
-		new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic->ioc,
+
+		/*
+		 * Drop the reference to old queue unconditionally. Don't
+		 * worry whether new async prio queue has been allocated
+		 * or not.
+		 */
+		cic_set_cfqq(cic, NULL, BLK_RW_ASYNC);
+		cfq_put_queue(cfqq);
+
+		/*
+		 * Why to allocate new queue now? Will it not be automatically
+		 * allocated whenever another async request from same context
+		 * comes? Keeping it for the time being because existing cfq
+		 * code allocates the new queue immediately upon prio change
+		 */
+		new_cfqq = cfq_get_queue(cfqd, NULL, BLK_RW_ASYNC, cic->ioc,
 						GFP_ATOMIC);
-		if (new_cfqq) {
-			cic->cfqq[BLK_RW_ASYNC] = new_cfqq;
-			cfq_put_queue(cfqq);
-		}
+		if (new_cfqq)
+			cic_set_cfqq(cic, new_cfqq, BLK_RW_ASYNC);
 	}
 
 	cfqq = cic->cfqq[BLK_RW_SYNC];
@@ -1239,7 +1287,7 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
 
 	spin_lock_irqsave(q->queue_lock, flags);
 
-	iog = io_get_io_group(q, 0);
+	iog = io_get_io_group(q, NULL, 0);
 
 	if (async_cfqq != NULL) {
 		__iog = cfqq_to_io_group(async_cfqq);
@@ -1277,7 +1325,7 @@ static void cfq_ioc_set_cgroup(struct io_context *ioc)
 #endif  /* CONFIG_IOSCHED_CFQ_HIER */
 
 static struct cfq_queue *
-cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
+cfq_find_alloc_queue(struct cfq_data *cfqd, struct bio *bio, int is_sync,
 				struct io_context *ioc, gfp_t gfp_mask)
 {
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
@@ -1286,12 +1334,28 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
 	struct io_queue *ioq = NULL, *new_ioq = NULL;
 	struct io_group *iog = NULL;
 retry:
-	iog = io_get_io_group(q, 1);
+	iog = io_get_io_group_bio(q, bio, 1);
 
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
 	cfqq = cic_to_cfqq(cic, is_sync);
 
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	if (!cfqq && !is_sync) {
+		const int ioprio = task_ioprio(cic->ioc);
+		const int ioprio_class = task_ioprio_class(cic->ioc);
+
+		/*
+		 * We have not cached async queue pointer as bio tracking
+		 * is enabled. Look into group async queue array using ioc
+		 * class and prio to see if somebody already allocated the
+		 * queue.
+		 */
+
+		cfqq = io_group_async_queue_prio(iog, ioprio_class, ioprio);
+	}
+#endif
+
 	if (!cfqq) {
 		if (new_cfqq) {
 			goto alloc_ioq;
@@ -1381,14 +1445,14 @@ out:
 }
 
 static struct cfq_queue *
-cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
-					gfp_t gfp_mask)
+cfq_get_queue(struct cfq_data *cfqd, struct bio *bio, int is_sync,
+		struct io_context *ioc, gfp_t gfp_mask)
 {
 	const int ioprio = task_ioprio(ioc);
 	const int ioprio_class = task_ioprio_class(ioc);
 	struct cfq_queue *async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
-	struct io_group *iog = io_get_io_group(cfqd->queue, 1);
+	struct io_group *iog = io_get_io_group_bio(cfqd->queue, bio, 1);
 
 	if (!is_sync) {
 		async_cfqq = io_group_async_queue_prio(iog, ioprio_class,
@@ -1397,7 +1461,7 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 	}
 
 	if (!cfqq) {
-		cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc, gfp_mask);
+		cfqq = cfq_find_alloc_queue(cfqd, bio, is_sync, ioc, gfp_mask);
 		if (!cfqq)
 			return NULL;
 	}
@@ -1405,8 +1469,30 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 	if (!is_sync && !async_cfqq)
 		io_group_set_async_queue(iog, ioprio_class, ioprio, cfqq->ioq);
 
-	/* ioc reference */
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	/*
+	 * ioc reference. If async request queue/group is determined from the
+	 * original task/cgroup and not from submitter task, io context can
+	 * not cache the pointer to async queue and everytime a request comes,
+	 * it will be determined by going through the async queue array.
+	 *
+	 * This comes from the fact that we might be getting async requests
+	 * which belong to a different cgroup altogether than the cgroup
+	 * iocontext belongs to. And this thread might be submitting bios
+	 * from various cgroups. So every time async queue will be different
+	 * based on the cgroup of the bio/rq. Can't cache the async cfqq
+	 * pointer in cic.
+	 */
+	if (is_sync)
+		elv_get_ioq(cfqq->ioq);
+#else
+	/*
+	 * async requests are being attributed to task submitting
+	 * it, hence cic can cache async cfqq pointer. Take the
+	 * queue reference even for async queue.
+	 */
 	elv_get_ioq(cfqq->ioq);
+#endif
 	return cfqq;
 }
 
@@ -1802,7 +1888,8 @@ static void cfq_put_request(struct request *rq)
  * Allocate cfq data structures associated with this request.
  */
 static int
-cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
+				gfp_t gfp_mask)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_io_context *cic;
@@ -1822,7 +1909,8 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 
 	cfqq = cic_to_cfqq(cic, is_sync);
 	if (!cfqq) {
-		cfqq = cfq_get_queue(cfqd, is_sync, cic->ioc, gfp_mask);
+		cfqq = cfq_get_queue(cfqd, bio, is_sync, cic->ioc,
+						gfp_mask);
 
 		if (!cfqq)
 			goto queue_fail;
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index ad63493..e5a94e3 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -133,7 +133,7 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	int ret;
 	struct deadline_queue *dq;
 
-	dq = elv_get_sched_queue_current(q);
+	dq = elv_get_sched_queue_bio(q, bio);
 	if (!dq)
 		return ELEVATOR_NO_MERGE;
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index c1d04b1..899972c 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -16,6 +16,7 @@
 #include "elevator-fq.h"
 #include <linux/blktrace_api.h>
 #include <linux/seq_file.h>
+#include <linux/biotrack.h>
 
 /* Values taken from cfq */
 const int elv_slice_sync = HZ / 10;
@@ -1084,6 +1085,9 @@ struct io_cgroup io_root_cgroup = {
 
 static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
 {
+	if (!cgroup)
+		return &io_root_cgroup;
+
 	return container_of(cgroup_subsys_state(cgroup, io_subsys_id),
 			    struct io_cgroup, css);
 }
@@ -1517,9 +1521,60 @@ end:
 	return iog;
 }
 
+/* Map a page to respective cgroup. Null return means, map it to root cgroup */
+static inline struct cgroup *get_cgroup_from_page(struct page *page)
+{
+	unsigned long bio_cgroup_id;
+	struct cgroup *cgroup;
+
+	bio_cgroup_id = get_blkio_cgroup_id_page(page);
+
+	if (!bio_cgroup_id)
+		return NULL;
+
+	cgroup = blkio_cgroup_lookup(bio_cgroup_id);
+	return cgroup;
+}
+
+struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
+					int create)
+{
+	struct page *page = NULL;
+
+	/*
+	 * Determine the group from task context. Even calls from
+	 * blk_get_request() which don't have any bio info will be mapped
+	 * to the task's group
+	 */
+	if (!bio)
+		goto sync;
+
+	if (bio_barrier(bio)) {
+		/*
+		 * Map barrier requests to root group. May be more special
+		 * bio cases should come here
+		 */
+		return q->elevator->efqd.root_group;
+	}
+
+	/* Map the sync bio to the right group using task context */
+	if (elv_bio_sync(bio))
+		goto sync;
+
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	/* Determine the group from info stored in page */
+	page = bio_iovec_idx(bio, 0)->bv_page;
+	return io_get_io_group(q, page, create);
+#endif
+
+sync:
+	return io_get_io_group(q, NULL, create);
+}
+EXPORT_SYMBOL(io_get_io_group_bio);
+
 /*
- * Search for the io group current task belongs to. If create=1, then also
- * create the io group if it is not already there.
+ * Find the io group page belongs to.
+ * If "create" is set, io group is created if it is not already present.
  *
  * Note: This function should be called with queue lock held. It returns
  * a pointer to io group without taking any reference. That group will
@@ -1527,28 +1582,48 @@ end:
  * needs to get hold of queue lock). So if somebody needs to use group
  * pointer even after dropping queue lock, take a reference to the group
  * before dropping queue lock.
+ *
+ * One can call it without queue lock with rcu read lock held for browsing
+ * through the groups.
  */
-struct io_group *io_get_io_group(struct request_queue *q, int create)
+struct io_group *io_get_io_group(struct request_queue *q, struct page *page,
+					int create)
 {
 	struct cgroup *cgroup;
 	struct io_group *iog;
 	struct elv_fq_data *efqd = &q->elevator->efqd;
 
-	assert_spin_locked(q->queue_lock);
+	if (create)
+		assert_spin_locked(q->queue_lock);
 
 	rcu_read_lock();
-	cgroup = task_cgroup(current, io_subsys_id);
-	iog = io_find_alloc_group(q, cgroup, efqd, create);
-	if (!iog) {
+
+	if (!page)
+		cgroup = task_cgroup(current, io_subsys_id);
+	else
+		cgroup = get_cgroup_from_page(page);
+
+	if (!cgroup) {
 		if (create)
 			iog = efqd->root_group;
-		else
+		else {
 			/*
 			 * bio merge functions doing lookup don't want to
 			 * map bio to root group by default
 			 */
 			iog = NULL;
+		}
+		goto out;
 	}
+
+	iog = io_find_alloc_group(q, cgroup, efqd, create);
+	if (!iog) {
+		if (create)
+			iog = efqd->root_group;
+		else
+			iog = NULL;
+	}
+out:
 	rcu_read_unlock();
 	return iog;
 }
@@ -1861,7 +1936,7 @@ int io_group_allow_merge(struct request *rq, struct bio *bio)
 		return 1;
 
 	/* Determine the io group of the bio submitting task */
-	iog = io_get_io_group(q, 0);
+	iog = io_get_io_group_bio(q, bio, 0);
 	if (!iog) {
 		/* May be task belongs to a differet cgroup for which io
 		 * group has not been setup yet. */
@@ -1885,7 +1960,7 @@ int io_group_allow_merge(struct request *rq, struct bio *bio)
  * function is not invoked.
  */
 int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
-					gfp_t gfp_mask)
+				struct bio *bio, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 	unsigned long flags;
@@ -1901,7 +1976,7 @@ int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
 
 retry:
 	/* Determine the io group request belongs to */
-	iog = io_get_io_group(q, 1);
+	iog = io_get_io_group_bio(q, bio, 1);
 	BUG_ON(!iog);
 
 	/* Get the iosched queue */
@@ -1986,17 +2061,17 @@ queue_fail:
 }
 
 /*
- * Find out the io queue of current task. Optimization for single ioq
+ * Find out the io queue of bio belongs to. Optimization for single ioq
  * per io group io schedulers.
  */
-struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+struct io_queue *elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
 {
 	struct io_group *iog;
 
 	/* Determine the io group and io queue of the bio submitting task */
-	iog = io_get_io_group(q, 0);
+	iog = io_get_io_group_bio(q, bio, 0);
 	if (!iog) {
-		/* May be task belongs to a cgroup for which io group has
+		/* May be bio belongs to a cgroup for which io group has
 		 * not been setup yet. */
 		return NULL;
 	}
@@ -2072,13 +2147,21 @@ static void io_free_root_group(struct elevator_queue *e)
 	kfree(iog);
 }
 
-struct io_group *io_get_io_group(struct request_queue *q, int create)
+struct io_group *io_get_io_group(struct request_queue *q, struct page *page,
+						int create)
 {
 	/* In flat mode, there is only root group */
 	return q->elevator->efqd.root_group;
 }
 EXPORT_SYMBOL(io_get_io_group);
 
+struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
+						int create)
+{
+	return q->elevator->efqd.root_group;
+}
+EXPORT_SYMBOL(io_get_io_group_bio);
+
 static inline int is_only_root_group(void)
 {
 	return 1;
@@ -3232,6 +3315,10 @@ expire:
 new_queue:
 	ioq = elv_set_active_ioq(q, new_ioq);
 keep_queue:
+	if (ioq)
+		elv_log_ioq(efqd, ioq, "select busy=%d qued=%d disp=%d",
+				elv_nr_busy_ioq(q->elevator), ioq->nr_queued,
+				elv_ioq_nr_dispatched(ioq));
 	return ioq;
 }
 
@@ -3339,7 +3426,9 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 	ioq = rq->ioq;
 	iog = ioq_to_io_group(ioq);
 
-	elv_log_ioq(efqd, ioq, "complete");
+	elv_log_ioq(efqd, ioq, "complete rq_queued=%d drv=%d disp=%d",
+				ioq->nr_queued, efqd->rq_in_driver,
+				elv_ioq_nr_dispatched(ioq));
 
 	elv_update_hw_tag(efqd);
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index c117d40..bb43444 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -521,10 +521,11 @@ static inline int update_requeue(struct io_queue *ioq, int requeue)
 }
 
 extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
-					gfp_t gfp_mask);
+					struct bio *bio, gfp_t gfp_mask);
 extern void elv_fq_unset_request_ioq(struct request_queue *q,
 					struct request *rq);
-extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+						struct bio *bio);
 
 /* Sets the single ioq associated with the io group. (noop, deadline, AS) */
 static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
@@ -559,7 +560,7 @@ static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
 }
 
 static inline int elv_fq_set_request_ioq(struct request_queue *q,
-					struct request *rq, gfp_t gfp_mask)
+			struct request *rq, struct bio *bio, gfp_t gfp_mask)
 {
 	return 0;
 }
@@ -569,7 +570,8 @@ static inline void elv_fq_unset_request_ioq(struct request_queue *q,
 {
 }
 
-static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+						struct bio *bio)
 {
 	return NULL;
 }
@@ -626,7 +628,10 @@ extern void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
 					int ioprio);
 extern void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 					int ioprio, struct io_queue *ioq);
-extern struct io_group *io_get_io_group(struct request_queue *q, int create);
+extern struct io_group *io_get_io_group(struct request_queue *q,
+					struct page *page, int create);
+extern struct io_group *io_get_io_group_bio(struct request_queue *q,
+					struct bio *bio, int create);
 extern int elv_nr_busy_ioq(struct elevator_queue *e);
 extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
 extern void elv_free_ioq(struct io_queue *ioq);
@@ -684,7 +689,7 @@ static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
 	return 1;
 }
 static inline int elv_fq_set_request_ioq(struct request_queue *q,
-					struct request *rq, gfp_t gfp_mask)
+			struct request *rq, struct bio *bio, gfp_t gfp_mask)
 {
 	return 0;
 }
@@ -694,7 +699,8 @@ static inline void elv_fq_unset_request_ioq(struct request_queue *q,
 {
 }
 
-static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+						struct bio *bio)
 {
 	return NULL;
 }
diff --git a/block/elevator.c b/block/elevator.c
index 862be80..68d5a80 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -833,7 +833,8 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
 	return NULL;
 }
 
-int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+int elv_set_request(struct request_queue *q, struct request *rq,
+			struct bio *bio, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 
@@ -842,10 +843,10 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 	 * ioq per io group
 	 */
 	if (elv_iosched_single_ioq(e))
-		return elv_fq_set_request_ioq(q, rq, gfp_mask);
+		return elv_fq_set_request_ioq(q, rq, bio, gfp_mask);
 
 	if (e->ops->elevator_set_req_fn)
-		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
+		return e->ops->elevator_set_req_fn(q, rq, bio, gfp_mask);
 
 	rq->elevator_private = NULL;
 	return 0;
@@ -1247,19 +1248,19 @@ void *elv_select_sched_queue(struct request_queue *q, int force)
 EXPORT_SYMBOL(elv_select_sched_queue);
 
 /*
- * Get the io scheduler queue pointer for current task.
+ * Get the io scheduler queue pointer for the group bio belongs to.
  *
  * If fair queuing is enabled, determine the io group of task and retrieve
  * the ioq pointer from that. This is used by only single queue ioschedulers
  * for retrieving the queue associated with the group to decide whether the
  * new bio can do a front merge or not.
  */
-void *elv_get_sched_queue_current(struct request_queue *q)
+void *elv_get_sched_queue_bio(struct request_queue *q, struct bio *bio)
 {
 	/* Fair queuing is not enabled. There is only one queue. */
 	if (!elv_iosched_fair_queuing_enabled(q->elevator))
 		return q->elevator->sched_queue;
 
-	return ioq_sched_queue(elv_lookup_ioq_current(q));
+	return ioq_sched_queue(elv_lookup_ioq_bio(q, bio));
 }
-EXPORT_SYMBOL(elv_get_sched_queue_current);
+EXPORT_SYMBOL(elv_get_sched_queue_bio);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index dda7951..cf6b752 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -23,7 +23,7 @@ typedef struct request *(elevator_request_list_fn) (struct request_queue *, stru
 typedef void (elevator_completed_req_fn) (struct request_queue *, struct request *);
 typedef int (elevator_may_queue_fn) (struct request_queue *, int);
 
-typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, gfp_t);
+typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, struct bio *bio, gfp_t);
 typedef void (elevator_put_req_fn) (struct request *);
 typedef void (elevator_activate_req_fn) (struct request_queue *, struct request *);
 typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct request *);
@@ -148,7 +148,8 @@ extern void elv_unregister_queue(struct request_queue *q);
 extern int elv_may_queue(struct request_queue *, int);
 extern void elv_abort_queue(struct request_queue *);
 extern void elv_completed_request(struct request_queue *, struct request *);
-extern int elv_set_request(struct request_queue *, struct request *, gfp_t);
+extern int elv_set_request(struct request_queue *, struct request *,
+					struct bio *bio, gfp_t);
 extern void elv_put_request(struct request_queue *, struct request *);
 extern void elv_drain_elevator(struct request_queue *);
 
@@ -277,6 +278,20 @@ static inline int elv_iosched_single_ioq(struct elevator_queue *e)
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
-extern void *elv_get_sched_queue_current(struct request_queue *q);
+extern void *elv_get_sched_queue_bio(struct request_queue *q, struct bio *bio);
+
+/*
+ * This is equivalent of rq_is_sync()/cfq_bio_sync() function where we
+ * determine whether an rq/bio is sync or not. There are cases like during
+ * merging and during * request allocation, where we don't have rq but bio
+ * and needs to find out * if this bio will be considered as sync or async by
+ * elevator/iosched. This function is useful in such cases.
+ */
+static inline int elv_bio_sync(struct bio *bio)
+{
+	if ((bio_data_dir(bio) == READ) || bio_sync(bio))
+		return 1;
+	return 0;
+}
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 20/25] io-controller: map async requests to appropriate cgroup
  2009-07-02 20:01 ` Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron
  Cc: agk, snitzer, vgoyal, akpm, peterz

o So far we were assuming that a bio/rq belongs to the task who is submitting
  it. It did not hold good in case of async writes. This patch makes use of
  blkio_cgroup pataches to attribute the aysnc writes to right group instead
  of task submitting the bio.

o For sync requests, we continue to assume that io belongs to the task
  submitting it. Only in case of async requests, we make use of io tracking
  patches to track the owner cgroup.

o So far cfq always caches the async queue pointer. With async requests now
  not necessarily being tied to submitting task io context, caching the
  pointer will not help for async queues. This patch introduces a new config
  option CONFIG_TRACK_ASYNC_CONTEXT. If this option is not set, cfq retains
  old behavior where async queue pointer is cached in task context. If it
  is set, async queue pointer is not cached and we take help of bio
  tracking patches to determine group bio belongs to and then map it to
  async queue of that group.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched    |   16 +++++
 block/as-iosched.c       |    2 +-
 block/blk-core.c         |    7 +-
 block/cfq-iosched.c      |  152 ++++++++++++++++++++++++++++++++++++----------
 block/deadline-iosched.c |    2 +-
 block/elevator-fq.c      |  123 ++++++++++++++++++++++++++++++++-----
 block/elevator-fq.h      |   20 ++++--
 block/elevator.c         |   15 +++--
 include/linux/elevator.h |   21 ++++++-
 9 files changed, 287 insertions(+), 71 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 77fc786..0677099 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -124,6 +124,22 @@ config DEFAULT_IOSCHED
 	default "cfq" if DEFAULT_CFQ
 	default "noop" if DEFAULT_NOOP
 
+config TRACK_ASYNC_CONTEXT
+	bool "Determine async request context from bio"
+	depends on GROUP_IOSCHED
+	select CGROUP_BLKIO
+	default n
+	---help---
+	  Normally async request is attributed to the task submitting the
+	  request. With group ioscheduling, for accurate accounting of
+	  async writes, one needs to map the request to original task/cgroup
+	  which originated the request and not the submitter of the request.
+
+	  Currently there are generic io tracking patches to provide facility
+	  to map bio to original owner. If this option is set, for async
+	  request, original owner of the bio is decided by using io tracking
+	  patches otherwise we continue to attribute the request to the
+	  submitting thread.
 endmenu
 
 endif
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 18a61bb..213f3e3 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -1501,7 +1501,7 @@ as_merge(struct request_queue *q, struct request **req, struct bio *bio)
 {
 	sector_t rb_key = bio->bi_sector + bio_sectors(bio);
 	struct request *__rq;
-	struct as_queue *asq = elv_get_sched_queue_current(q);
+	struct as_queue *asq = elv_get_sched_queue_bio(q, bio);
 
 	if (!asq)
 		return ELEVATOR_NO_MERGE;
diff --git a/block/blk-core.c b/block/blk-core.c
index b06cf5c..6d8b4dd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -628,7 +628,8 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
 }
 
 static struct request *
-blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
+blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
+					gfp_t gfp_mask)
 {
 	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
 
@@ -640,7 +641,7 @@ blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
 	rq->cmd_flags = flags | REQ_ALLOCED;
 
 	if (priv) {
-		if (unlikely(elv_set_request(q, rq, gfp_mask))) {
+		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
 			mempool_free(rq, q->rq.rq_pool);
 			return NULL;
 		}
@@ -781,7 +782,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		rw_flags |= REQ_IO_STAT;
 	spin_unlock_irq(q->queue_lock);
 
-	rq = blk_alloc_request(q, rw_flags, priv, gfp_mask);
+	rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
 	if (unlikely(!rq)) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 98a35fd..a40a2fa 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -160,8 +160,8 @@ CFQ_CFQQ_FNS(coop);
 	blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)
 
 static void cfq_dispatch_insert(struct request_queue *, struct request *);
-static struct cfq_queue *cfq_get_queue(struct cfq_data *, int,
-				       struct io_context *, gfp_t);
+static struct cfq_queue *cfq_get_queue(struct cfq_data *, struct bio *bio,
+					int, struct io_context *, gfp_t);
 static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
 						struct io_context *);
 
@@ -171,22 +171,56 @@ static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
 	return cic->cfqq[!!is_sync];
 }
 
-static inline void cic_set_cfqq(struct cfq_io_context *cic,
-				struct cfq_queue *cfqq, int is_sync)
-{
-	cic->cfqq[!!is_sync] = cfqq;
-}
-
 /*
- * We regard a request as SYNC, if it's either a read or has the SYNC bit
- * set (in which case it could also be direct WRITE).
+ * Determine the cfq queue bio should go in. This is primarily used by
+ * front merge and allow merge functions.
+ *
+ * Currently this function takes the ioprio and iprio_class from task
+ * submitting async bio. Later save the task information in the page_cgroup
+ * and retrieve task's ioprio and class from there.
  */
-static inline int cfq_bio_sync(struct bio *bio)
+static struct cfq_queue *cic_bio_to_cfqq(struct cfq_data *cfqd,
+		struct cfq_io_context *cic, struct bio *bio, int is_sync)
 {
-	if (bio_data_dir(bio) == READ || bio_sync(bio))
-		return 1;
+	struct cfq_queue *cfqq = NULL;
 
-	return 0;
+	cfqq = cic_to_cfqq(cic, is_sync);
+
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	if (!cfqq && !is_sync) {
+		const int ioprio = task_ioprio(cic->ioc);
+		const int ioprio_class = task_ioprio_class(cic->ioc);
+		struct io_group *iog;
+		/*
+		 * async bio tracking is enabled and we are not caching
+		 * async queue pointer in cic.
+		 */
+		iog = io_get_io_group_bio(cfqd->queue, bio, 0);
+		if (!iog) {
+			/*
+			 * May be this is first rq/bio and io group has not
+			 * been setup yet.
+			 */
+			return NULL;
+		}
+		return io_group_async_queue_prio(iog, ioprio_class, ioprio);
+	}
+#endif
+	return cfqq;
+}
+
+static inline void cic_set_cfqq(struct cfq_io_context *cic,
+				struct cfq_queue *cfqq, int is_sync)
+{
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	/*
+	 * Don't cache async queue pointer as now one io context might
+	 * be submitting async io for various different async queues
+	 */
+	if (!is_sync)
+		return;
+#endif
+	cic->cfqq[!!is_sync] = cfqq;
 }
 
 static inline struct io_group *cfqq_to_io_group(struct cfq_queue *cfqq)
@@ -499,7 +533,7 @@ cfq_find_rq_fmerge(struct cfq_data *cfqd, struct bio *bio)
 	if (!cic)
 		return NULL;
 
-	cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+	cfqq = cic_bio_to_cfqq(cfqd, cic, bio, elv_bio_sync(bio));
 	if (cfqq) {
 		sector_t sector = bio->bi_sector + bio_sectors(bio);
 
@@ -581,7 +615,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	/*
 	 * Disallow merge of a sync bio into an async request.
 	 */
-	if (cfq_bio_sync(bio) && !rq_is_sync(rq))
+	if (elv_bio_sync(bio) && !rq_is_sync(rq))
 		return 0;
 
 	/*
@@ -592,7 +626,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	if (!cic)
 		return 0;
 
-	cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+	cfqq = cic_bio_to_cfqq(cfqd, cic, bio, elv_bio_sync(bio));
 	if (cfqq == RQ_CFQQ(rq))
 		return 1;
 
@@ -1199,14 +1233,28 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
 	spin_lock_irqsave(q->queue_lock, flags);
 
 	cfqq = cic->cfqq[BLK_RW_ASYNC];
+
 	if (cfqq) {
 		struct cfq_queue *new_cfqq;
-		new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic->ioc,
+
+		/*
+		 * Drop the reference to old queue unconditionally. Don't
+		 * worry whether new async prio queue has been allocated
+		 * or not.
+		 */
+		cic_set_cfqq(cic, NULL, BLK_RW_ASYNC);
+		cfq_put_queue(cfqq);
+
+		/*
+		 * Why to allocate new queue now? Will it not be automatically
+		 * allocated whenever another async request from same context
+		 * comes? Keeping it for the time being because existing cfq
+		 * code allocates the new queue immediately upon prio change
+		 */
+		new_cfqq = cfq_get_queue(cfqd, NULL, BLK_RW_ASYNC, cic->ioc,
 						GFP_ATOMIC);
-		if (new_cfqq) {
-			cic->cfqq[BLK_RW_ASYNC] = new_cfqq;
-			cfq_put_queue(cfqq);
-		}
+		if (new_cfqq)
+			cic_set_cfqq(cic, new_cfqq, BLK_RW_ASYNC);
 	}
 
 	cfqq = cic->cfqq[BLK_RW_SYNC];
@@ -1239,7 +1287,7 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
 
 	spin_lock_irqsave(q->queue_lock, flags);
 
-	iog = io_get_io_group(q, 0);
+	iog = io_get_io_group(q, NULL, 0);
 
 	if (async_cfqq != NULL) {
 		__iog = cfqq_to_io_group(async_cfqq);
@@ -1277,7 +1325,7 @@ static void cfq_ioc_set_cgroup(struct io_context *ioc)
 #endif  /* CONFIG_IOSCHED_CFQ_HIER */
 
 static struct cfq_queue *
-cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
+cfq_find_alloc_queue(struct cfq_data *cfqd, struct bio *bio, int is_sync,
 				struct io_context *ioc, gfp_t gfp_mask)
 {
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
@@ -1286,12 +1334,28 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
 	struct io_queue *ioq = NULL, *new_ioq = NULL;
 	struct io_group *iog = NULL;
 retry:
-	iog = io_get_io_group(q, 1);
+	iog = io_get_io_group_bio(q, bio, 1);
 
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
 	cfqq = cic_to_cfqq(cic, is_sync);
 
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	if (!cfqq && !is_sync) {
+		const int ioprio = task_ioprio(cic->ioc);
+		const int ioprio_class = task_ioprio_class(cic->ioc);
+
+		/*
+		 * We have not cached async queue pointer as bio tracking
+		 * is enabled. Look into group async queue array using ioc
+		 * class and prio to see if somebody already allocated the
+		 * queue.
+		 */
+
+		cfqq = io_group_async_queue_prio(iog, ioprio_class, ioprio);
+	}
+#endif
+
 	if (!cfqq) {
 		if (new_cfqq) {
 			goto alloc_ioq;
@@ -1381,14 +1445,14 @@ out:
 }
 
 static struct cfq_queue *
-cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
-					gfp_t gfp_mask)
+cfq_get_queue(struct cfq_data *cfqd, struct bio *bio, int is_sync,
+		struct io_context *ioc, gfp_t gfp_mask)
 {
 	const int ioprio = task_ioprio(ioc);
 	const int ioprio_class = task_ioprio_class(ioc);
 	struct cfq_queue *async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
-	struct io_group *iog = io_get_io_group(cfqd->queue, 1);
+	struct io_group *iog = io_get_io_group_bio(cfqd->queue, bio, 1);
 
 	if (!is_sync) {
 		async_cfqq = io_group_async_queue_prio(iog, ioprio_class,
@@ -1397,7 +1461,7 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 	}
 
 	if (!cfqq) {
-		cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc, gfp_mask);
+		cfqq = cfq_find_alloc_queue(cfqd, bio, is_sync, ioc, gfp_mask);
 		if (!cfqq)
 			return NULL;
 	}
@@ -1405,8 +1469,30 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 	if (!is_sync && !async_cfqq)
 		io_group_set_async_queue(iog, ioprio_class, ioprio, cfqq->ioq);
 
-	/* ioc reference */
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	/*
+	 * ioc reference. If async request queue/group is determined from the
+	 * original task/cgroup and not from submitter task, io context can
+	 * not cache the pointer to async queue and everytime a request comes,
+	 * it will be determined by going through the async queue array.
+	 *
+	 * This comes from the fact that we might be getting async requests
+	 * which belong to a different cgroup altogether than the cgroup
+	 * iocontext belongs to. And this thread might be submitting bios
+	 * from various cgroups. So every time async queue will be different
+	 * based on the cgroup of the bio/rq. Can't cache the async cfqq
+	 * pointer in cic.
+	 */
+	if (is_sync)
+		elv_get_ioq(cfqq->ioq);
+#else
+	/*
+	 * async requests are being attributed to task submitting
+	 * it, hence cic can cache async cfqq pointer. Take the
+	 * queue reference even for async queue.
+	 */
 	elv_get_ioq(cfqq->ioq);
+#endif
 	return cfqq;
 }
 
@@ -1802,7 +1888,8 @@ static void cfq_put_request(struct request *rq)
  * Allocate cfq data structures associated with this request.
  */
 static int
-cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
+				gfp_t gfp_mask)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_io_context *cic;
@@ -1822,7 +1909,8 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 
 	cfqq = cic_to_cfqq(cic, is_sync);
 	if (!cfqq) {
-		cfqq = cfq_get_queue(cfqd, is_sync, cic->ioc, gfp_mask);
+		cfqq = cfq_get_queue(cfqd, bio, is_sync, cic->ioc,
+						gfp_mask);
 
 		if (!cfqq)
 			goto queue_fail;
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index ad63493..e5a94e3 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -133,7 +133,7 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	int ret;
 	struct deadline_queue *dq;
 
-	dq = elv_get_sched_queue_current(q);
+	dq = elv_get_sched_queue_bio(q, bio);
 	if (!dq)
 		return ELEVATOR_NO_MERGE;
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index c1d04b1..899972c 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -16,6 +16,7 @@
 #include "elevator-fq.h"
 #include <linux/blktrace_api.h>
 #include <linux/seq_file.h>
+#include <linux/biotrack.h>
 
 /* Values taken from cfq */
 const int elv_slice_sync = HZ / 10;
@@ -1084,6 +1085,9 @@ struct io_cgroup io_root_cgroup = {
 
 static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
 {
+	if (!cgroup)
+		return &io_root_cgroup;
+
 	return container_of(cgroup_subsys_state(cgroup, io_subsys_id),
 			    struct io_cgroup, css);
 }
@@ -1517,9 +1521,60 @@ end:
 	return iog;
 }
 
+/* Map a page to respective cgroup. Null return means, map it to root cgroup */
+static inline struct cgroup *get_cgroup_from_page(struct page *page)
+{
+	unsigned long bio_cgroup_id;
+	struct cgroup *cgroup;
+
+	bio_cgroup_id = get_blkio_cgroup_id_page(page);
+
+	if (!bio_cgroup_id)
+		return NULL;
+
+	cgroup = blkio_cgroup_lookup(bio_cgroup_id);
+	return cgroup;
+}
+
+struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
+					int create)
+{
+	struct page *page = NULL;
+
+	/*
+	 * Determine the group from task context. Even calls from
+	 * blk_get_request() which don't have any bio info will be mapped
+	 * to the task's group
+	 */
+	if (!bio)
+		goto sync;
+
+	if (bio_barrier(bio)) {
+		/*
+		 * Map barrier requests to root group. May be more special
+		 * bio cases should come here
+		 */
+		return q->elevator->efqd.root_group;
+	}
+
+	/* Map the sync bio to the right group using task context */
+	if (elv_bio_sync(bio))
+		goto sync;
+
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	/* Determine the group from info stored in page */
+	page = bio_iovec_idx(bio, 0)->bv_page;
+	return io_get_io_group(q, page, create);
+#endif
+
+sync:
+	return io_get_io_group(q, NULL, create);
+}
+EXPORT_SYMBOL(io_get_io_group_bio);
+
 /*
- * Search for the io group current task belongs to. If create=1, then also
- * create the io group if it is not already there.
+ * Find the io group page belongs to.
+ * If "create" is set, io group is created if it is not already present.
  *
  * Note: This function should be called with queue lock held. It returns
  * a pointer to io group without taking any reference. That group will
@@ -1527,28 +1582,48 @@ end:
  * needs to get hold of queue lock). So if somebody needs to use group
  * pointer even after dropping queue lock, take a reference to the group
  * before dropping queue lock.
+ *
+ * One can call it without queue lock with rcu read lock held for browsing
+ * through the groups.
  */
-struct io_group *io_get_io_group(struct request_queue *q, int create)
+struct io_group *io_get_io_group(struct request_queue *q, struct page *page,
+					int create)
 {
 	struct cgroup *cgroup;
 	struct io_group *iog;
 	struct elv_fq_data *efqd = &q->elevator->efqd;
 
-	assert_spin_locked(q->queue_lock);
+	if (create)
+		assert_spin_locked(q->queue_lock);
 
 	rcu_read_lock();
-	cgroup = task_cgroup(current, io_subsys_id);
-	iog = io_find_alloc_group(q, cgroup, efqd, create);
-	if (!iog) {
+
+	if (!page)
+		cgroup = task_cgroup(current, io_subsys_id);
+	else
+		cgroup = get_cgroup_from_page(page);
+
+	if (!cgroup) {
 		if (create)
 			iog = efqd->root_group;
-		else
+		else {
 			/*
 			 * bio merge functions doing lookup don't want to
 			 * map bio to root group by default
 			 */
 			iog = NULL;
+		}
+		goto out;
 	}
+
+	iog = io_find_alloc_group(q, cgroup, efqd, create);
+	if (!iog) {
+		if (create)
+			iog = efqd->root_group;
+		else
+			iog = NULL;
+	}
+out:
 	rcu_read_unlock();
 	return iog;
 }
@@ -1861,7 +1936,7 @@ int io_group_allow_merge(struct request *rq, struct bio *bio)
 		return 1;
 
 	/* Determine the io group of the bio submitting task */
-	iog = io_get_io_group(q, 0);
+	iog = io_get_io_group_bio(q, bio, 0);
 	if (!iog) {
 		/* May be task belongs to a differet cgroup for which io
 		 * group has not been setup yet. */
@@ -1885,7 +1960,7 @@ int io_group_allow_merge(struct request *rq, struct bio *bio)
  * function is not invoked.
  */
 int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
-					gfp_t gfp_mask)
+				struct bio *bio, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 	unsigned long flags;
@@ -1901,7 +1976,7 @@ int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
 
 retry:
 	/* Determine the io group request belongs to */
-	iog = io_get_io_group(q, 1);
+	iog = io_get_io_group_bio(q, bio, 1);
 	BUG_ON(!iog);
 
 	/* Get the iosched queue */
@@ -1986,17 +2061,17 @@ queue_fail:
 }
 
 /*
- * Find out the io queue of current task. Optimization for single ioq
+ * Find out the io queue of bio belongs to. Optimization for single ioq
  * per io group io schedulers.
  */
-struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+struct io_queue *elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
 {
 	struct io_group *iog;
 
 	/* Determine the io group and io queue of the bio submitting task */
-	iog = io_get_io_group(q, 0);
+	iog = io_get_io_group_bio(q, bio, 0);
 	if (!iog) {
-		/* May be task belongs to a cgroup for which io group has
+		/* May be bio belongs to a cgroup for which io group has
 		 * not been setup yet. */
 		return NULL;
 	}
@@ -2072,13 +2147,21 @@ static void io_free_root_group(struct elevator_queue *e)
 	kfree(iog);
 }
 
-struct io_group *io_get_io_group(struct request_queue *q, int create)
+struct io_group *io_get_io_group(struct request_queue *q, struct page *page,
+						int create)
 {
 	/* In flat mode, there is only root group */
 	return q->elevator->efqd.root_group;
 }
 EXPORT_SYMBOL(io_get_io_group);
 
+struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
+						int create)
+{
+	return q->elevator->efqd.root_group;
+}
+EXPORT_SYMBOL(io_get_io_group_bio);
+
 static inline int is_only_root_group(void)
 {
 	return 1;
@@ -3232,6 +3315,10 @@ expire:
 new_queue:
 	ioq = elv_set_active_ioq(q, new_ioq);
 keep_queue:
+	if (ioq)
+		elv_log_ioq(efqd, ioq, "select busy=%d qued=%d disp=%d",
+				elv_nr_busy_ioq(q->elevator), ioq->nr_queued,
+				elv_ioq_nr_dispatched(ioq));
 	return ioq;
 }
 
@@ -3339,7 +3426,9 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 	ioq = rq->ioq;
 	iog = ioq_to_io_group(ioq);
 
-	elv_log_ioq(efqd, ioq, "complete");
+	elv_log_ioq(efqd, ioq, "complete rq_queued=%d drv=%d disp=%d",
+				ioq->nr_queued, efqd->rq_in_driver,
+				elv_ioq_nr_dispatched(ioq));
 
 	elv_update_hw_tag(efqd);
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index c117d40..bb43444 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -521,10 +521,11 @@ static inline int update_requeue(struct io_queue *ioq, int requeue)
 }
 
 extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
-					gfp_t gfp_mask);
+					struct bio *bio, gfp_t gfp_mask);
 extern void elv_fq_unset_request_ioq(struct request_queue *q,
 					struct request *rq);
-extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+						struct bio *bio);
 
 /* Sets the single ioq associated with the io group. (noop, deadline, AS) */
 static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
@@ -559,7 +560,7 @@ static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
 }
 
 static inline int elv_fq_set_request_ioq(struct request_queue *q,
-					struct request *rq, gfp_t gfp_mask)
+			struct request *rq, struct bio *bio, gfp_t gfp_mask)
 {
 	return 0;
 }
@@ -569,7 +570,8 @@ static inline void elv_fq_unset_request_ioq(struct request_queue *q,
 {
 }
 
-static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+						struct bio *bio)
 {
 	return NULL;
 }
@@ -626,7 +628,10 @@ extern void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
 					int ioprio);
 extern void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 					int ioprio, struct io_queue *ioq);
-extern struct io_group *io_get_io_group(struct request_queue *q, int create);
+extern struct io_group *io_get_io_group(struct request_queue *q,
+					struct page *page, int create);
+extern struct io_group *io_get_io_group_bio(struct request_queue *q,
+					struct bio *bio, int create);
 extern int elv_nr_busy_ioq(struct elevator_queue *e);
 extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
 extern void elv_free_ioq(struct io_queue *ioq);
@@ -684,7 +689,7 @@ static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
 	return 1;
 }
 static inline int elv_fq_set_request_ioq(struct request_queue *q,
-					struct request *rq, gfp_t gfp_mask)
+			struct request *rq, struct bio *bio, gfp_t gfp_mask)
 {
 	return 0;
 }
@@ -694,7 +699,8 @@ static inline void elv_fq_unset_request_ioq(struct request_queue *q,
 {
 }
 
-static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+						struct bio *bio)
 {
 	return NULL;
 }
diff --git a/block/elevator.c b/block/elevator.c
index 862be80..68d5a80 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -833,7 +833,8 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
 	return NULL;
 }
 
-int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+int elv_set_request(struct request_queue *q, struct request *rq,
+			struct bio *bio, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 
@@ -842,10 +843,10 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 	 * ioq per io group
 	 */
 	if (elv_iosched_single_ioq(e))
-		return elv_fq_set_request_ioq(q, rq, gfp_mask);
+		return elv_fq_set_request_ioq(q, rq, bio, gfp_mask);
 
 	if (e->ops->elevator_set_req_fn)
-		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
+		return e->ops->elevator_set_req_fn(q, rq, bio, gfp_mask);
 
 	rq->elevator_private = NULL;
 	return 0;
@@ -1247,19 +1248,19 @@ void *elv_select_sched_queue(struct request_queue *q, int force)
 EXPORT_SYMBOL(elv_select_sched_queue);
 
 /*
- * Get the io scheduler queue pointer for current task.
+ * Get the io scheduler queue pointer for the group bio belongs to.
  *
  * If fair queuing is enabled, determine the io group of task and retrieve
  * the ioq pointer from that. This is used by only single queue ioschedulers
  * for retrieving the queue associated with the group to decide whether the
  * new bio can do a front merge or not.
  */
-void *elv_get_sched_queue_current(struct request_queue *q)
+void *elv_get_sched_queue_bio(struct request_queue *q, struct bio *bio)
 {
 	/* Fair queuing is not enabled. There is only one queue. */
 	if (!elv_iosched_fair_queuing_enabled(q->elevator))
 		return q->elevator->sched_queue;
 
-	return ioq_sched_queue(elv_lookup_ioq_current(q));
+	return ioq_sched_queue(elv_lookup_ioq_bio(q, bio));
 }
-EXPORT_SYMBOL(elv_get_sched_queue_current);
+EXPORT_SYMBOL(elv_get_sched_queue_bio);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index dda7951..cf6b752 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -23,7 +23,7 @@ typedef struct request *(elevator_request_list_fn) (struct request_queue *, stru
 typedef void (elevator_completed_req_fn) (struct request_queue *, struct request *);
 typedef int (elevator_may_queue_fn) (struct request_queue *, int);
 
-typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, gfp_t);
+typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, struct bio *bio, gfp_t);
 typedef void (elevator_put_req_fn) (struct request *);
 typedef void (elevator_activate_req_fn) (struct request_queue *, struct request *);
 typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct request *);
@@ -148,7 +148,8 @@ extern void elv_unregister_queue(struct request_queue *q);
 extern int elv_may_queue(struct request_queue *, int);
 extern void elv_abort_queue(struct request_queue *);
 extern void elv_completed_request(struct request_queue *, struct request *);
-extern int elv_set_request(struct request_queue *, struct request *, gfp_t);
+extern int elv_set_request(struct request_queue *, struct request *,
+					struct bio *bio, gfp_t);
 extern void elv_put_request(struct request_queue *, struct request *);
 extern void elv_drain_elevator(struct request_queue *);
 
@@ -277,6 +278,20 @@ static inline int elv_iosched_single_ioq(struct elevator_queue *e)
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
-extern void *elv_get_sched_queue_current(struct request_queue *q);
+extern void *elv_get_sched_queue_bio(struct request_queue *q, struct bio *bio);
+
+/*
+ * This is equivalent of rq_is_sync()/cfq_bio_sync() function where we
+ * determine whether an rq/bio is sync or not. There are cases like during
+ * merging and during * request allocation, where we don't have rq but bio
+ * and needs to find out * if this bio will be considered as sync or async by
+ * elevator/iosched. This function is useful in such cases.
+ */
+static inline int elv_bio_sync(struct bio *bio)
+{
+	if ((bio_data_dir(bio) == READ) || bio_sync(bio))
+		return 1;
+	return 0;
+}
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 20/25] io-controller: map async requests to appropriate cgroup
@ 2009-07-02 20:01   ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew
  Cc: peterz, akpm, snitzer, agk, vgoyal

o So far we were assuming that a bio/rq belongs to the task who is submitting
  it. It did not hold good in case of async writes. This patch makes use of
  blkio_cgroup pataches to attribute the aysnc writes to right group instead
  of task submitting the bio.

o For sync requests, we continue to assume that io belongs to the task
  submitting it. Only in case of async requests, we make use of io tracking
  patches to track the owner cgroup.

o So far cfq always caches the async queue pointer. With async requests now
  not necessarily being tied to submitting task io context, caching the
  pointer will not help for async queues. This patch introduces a new config
  option CONFIG_TRACK_ASYNC_CONTEXT. If this option is not set, cfq retains
  old behavior where async queue pointer is cached in task context. If it
  is set, async queue pointer is not cached and we take help of bio
  tracking patches to determine group bio belongs to and then map it to
  async queue of that group.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched    |   16 +++++
 block/as-iosched.c       |    2 +-
 block/blk-core.c         |    7 +-
 block/cfq-iosched.c      |  152 ++++++++++++++++++++++++++++++++++++----------
 block/deadline-iosched.c |    2 +-
 block/elevator-fq.c      |  123 ++++++++++++++++++++++++++++++++-----
 block/elevator-fq.h      |   20 ++++--
 block/elevator.c         |   15 +++--
 include/linux/elevator.h |   21 ++++++-
 9 files changed, 287 insertions(+), 71 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 77fc786..0677099 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -124,6 +124,22 @@ config DEFAULT_IOSCHED
 	default "cfq" if DEFAULT_CFQ
 	default "noop" if DEFAULT_NOOP
 
+config TRACK_ASYNC_CONTEXT
+	bool "Determine async request context from bio"
+	depends on GROUP_IOSCHED
+	select CGROUP_BLKIO
+	default n
+	---help---
+	  Normally async request is attributed to the task submitting the
+	  request. With group ioscheduling, for accurate accounting of
+	  async writes, one needs to map the request to original task/cgroup
+	  which originated the request and not the submitter of the request.
+
+	  Currently there are generic io tracking patches to provide facility
+	  to map bio to original owner. If this option is set, for async
+	  request, original owner of the bio is decided by using io tracking
+	  patches otherwise we continue to attribute the request to the
+	  submitting thread.
 endmenu
 
 endif
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 18a61bb..213f3e3 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -1501,7 +1501,7 @@ as_merge(struct request_queue *q, struct request **req, struct bio *bio)
 {
 	sector_t rb_key = bio->bi_sector + bio_sectors(bio);
 	struct request *__rq;
-	struct as_queue *asq = elv_get_sched_queue_current(q);
+	struct as_queue *asq = elv_get_sched_queue_bio(q, bio);
 
 	if (!asq)
 		return ELEVATOR_NO_MERGE;
diff --git a/block/blk-core.c b/block/blk-core.c
index b06cf5c..6d8b4dd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -628,7 +628,8 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
 }
 
 static struct request *
-blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
+blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
+					gfp_t gfp_mask)
 {
 	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
 
@@ -640,7 +641,7 @@ blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
 	rq->cmd_flags = flags | REQ_ALLOCED;
 
 	if (priv) {
-		if (unlikely(elv_set_request(q, rq, gfp_mask))) {
+		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
 			mempool_free(rq, q->rq.rq_pool);
 			return NULL;
 		}
@@ -781,7 +782,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		rw_flags |= REQ_IO_STAT;
 	spin_unlock_irq(q->queue_lock);
 
-	rq = blk_alloc_request(q, rw_flags, priv, gfp_mask);
+	rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
 	if (unlikely(!rq)) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 98a35fd..a40a2fa 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -160,8 +160,8 @@ CFQ_CFQQ_FNS(coop);
 	blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)
 
 static void cfq_dispatch_insert(struct request_queue *, struct request *);
-static struct cfq_queue *cfq_get_queue(struct cfq_data *, int,
-				       struct io_context *, gfp_t);
+static struct cfq_queue *cfq_get_queue(struct cfq_data *, struct bio *bio,
+					int, struct io_context *, gfp_t);
 static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
 						struct io_context *);
 
@@ -171,22 +171,56 @@ static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
 	return cic->cfqq[!!is_sync];
 }
 
-static inline void cic_set_cfqq(struct cfq_io_context *cic,
-				struct cfq_queue *cfqq, int is_sync)
-{
-	cic->cfqq[!!is_sync] = cfqq;
-}
-
 /*
- * We regard a request as SYNC, if it's either a read or has the SYNC bit
- * set (in which case it could also be direct WRITE).
+ * Determine the cfq queue bio should go in. This is primarily used by
+ * front merge and allow merge functions.
+ *
+ * Currently this function takes the ioprio and iprio_class from task
+ * submitting async bio. Later save the task information in the page_cgroup
+ * and retrieve task's ioprio and class from there.
  */
-static inline int cfq_bio_sync(struct bio *bio)
+static struct cfq_queue *cic_bio_to_cfqq(struct cfq_data *cfqd,
+		struct cfq_io_context *cic, struct bio *bio, int is_sync)
 {
-	if (bio_data_dir(bio) == READ || bio_sync(bio))
-		return 1;
+	struct cfq_queue *cfqq = NULL;
 
-	return 0;
+	cfqq = cic_to_cfqq(cic, is_sync);
+
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	if (!cfqq && !is_sync) {
+		const int ioprio = task_ioprio(cic->ioc);
+		const int ioprio_class = task_ioprio_class(cic->ioc);
+		struct io_group *iog;
+		/*
+		 * async bio tracking is enabled and we are not caching
+		 * async queue pointer in cic.
+		 */
+		iog = io_get_io_group_bio(cfqd->queue, bio, 0);
+		if (!iog) {
+			/*
+			 * May be this is first rq/bio and io group has not
+			 * been setup yet.
+			 */
+			return NULL;
+		}
+		return io_group_async_queue_prio(iog, ioprio_class, ioprio);
+	}
+#endif
+	return cfqq;
+}
+
+static inline void cic_set_cfqq(struct cfq_io_context *cic,
+				struct cfq_queue *cfqq, int is_sync)
+{
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	/*
+	 * Don't cache async queue pointer as now one io context might
+	 * be submitting async io for various different async queues
+	 */
+	if (!is_sync)
+		return;
+#endif
+	cic->cfqq[!!is_sync] = cfqq;
 }
 
 static inline struct io_group *cfqq_to_io_group(struct cfq_queue *cfqq)
@@ -499,7 +533,7 @@ cfq_find_rq_fmerge(struct cfq_data *cfqd, struct bio *bio)
 	if (!cic)
 		return NULL;
 
-	cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+	cfqq = cic_bio_to_cfqq(cfqd, cic, bio, elv_bio_sync(bio));
 	if (cfqq) {
 		sector_t sector = bio->bi_sector + bio_sectors(bio);
 
@@ -581,7 +615,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	/*
 	 * Disallow merge of a sync bio into an async request.
 	 */
-	if (cfq_bio_sync(bio) && !rq_is_sync(rq))
+	if (elv_bio_sync(bio) && !rq_is_sync(rq))
 		return 0;
 
 	/*
@@ -592,7 +626,7 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	if (!cic)
 		return 0;
 
-	cfqq = cic_to_cfqq(cic, cfq_bio_sync(bio));
+	cfqq = cic_bio_to_cfqq(cfqd, cic, bio, elv_bio_sync(bio));
 	if (cfqq == RQ_CFQQ(rq))
 		return 1;
 
@@ -1199,14 +1233,28 @@ static void changed_ioprio(struct io_context *ioc, struct cfq_io_context *cic)
 	spin_lock_irqsave(q->queue_lock, flags);
 
 	cfqq = cic->cfqq[BLK_RW_ASYNC];
+
 	if (cfqq) {
 		struct cfq_queue *new_cfqq;
-		new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic->ioc,
+
+		/*
+		 * Drop the reference to old queue unconditionally. Don't
+		 * worry whether new async prio queue has been allocated
+		 * or not.
+		 */
+		cic_set_cfqq(cic, NULL, BLK_RW_ASYNC);
+		cfq_put_queue(cfqq);
+
+		/*
+		 * Why to allocate new queue now? Will it not be automatically
+		 * allocated whenever another async request from same context
+		 * comes? Keeping it for the time being because existing cfq
+		 * code allocates the new queue immediately upon prio change
+		 */
+		new_cfqq = cfq_get_queue(cfqd, NULL, BLK_RW_ASYNC, cic->ioc,
 						GFP_ATOMIC);
-		if (new_cfqq) {
-			cic->cfqq[BLK_RW_ASYNC] = new_cfqq;
-			cfq_put_queue(cfqq);
-		}
+		if (new_cfqq)
+			cic_set_cfqq(cic, new_cfqq, BLK_RW_ASYNC);
 	}
 
 	cfqq = cic->cfqq[BLK_RW_SYNC];
@@ -1239,7 +1287,7 @@ static void changed_cgroup(struct io_context *ioc, struct cfq_io_context *cic)
 
 	spin_lock_irqsave(q->queue_lock, flags);
 
-	iog = io_get_io_group(q, 0);
+	iog = io_get_io_group(q, NULL, 0);
 
 	if (async_cfqq != NULL) {
 		__iog = cfqq_to_io_group(async_cfqq);
@@ -1277,7 +1325,7 @@ static void cfq_ioc_set_cgroup(struct io_context *ioc)
 #endif  /* CONFIG_IOSCHED_CFQ_HIER */
 
 static struct cfq_queue *
-cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
+cfq_find_alloc_queue(struct cfq_data *cfqd, struct bio *bio, int is_sync,
 				struct io_context *ioc, gfp_t gfp_mask)
 {
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
@@ -1286,12 +1334,28 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
 	struct io_queue *ioq = NULL, *new_ioq = NULL;
 	struct io_group *iog = NULL;
 retry:
-	iog = io_get_io_group(q, 1);
+	iog = io_get_io_group_bio(q, bio, 1);
 
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
 	cfqq = cic_to_cfqq(cic, is_sync);
 
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	if (!cfqq && !is_sync) {
+		const int ioprio = task_ioprio(cic->ioc);
+		const int ioprio_class = task_ioprio_class(cic->ioc);
+
+		/*
+		 * We have not cached async queue pointer as bio tracking
+		 * is enabled. Look into group async queue array using ioc
+		 * class and prio to see if somebody already allocated the
+		 * queue.
+		 */
+
+		cfqq = io_group_async_queue_prio(iog, ioprio_class, ioprio);
+	}
+#endif
+
 	if (!cfqq) {
 		if (new_cfqq) {
 			goto alloc_ioq;
@@ -1381,14 +1445,14 @@ out:
 }
 
 static struct cfq_queue *
-cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
-					gfp_t gfp_mask)
+cfq_get_queue(struct cfq_data *cfqd, struct bio *bio, int is_sync,
+		struct io_context *ioc, gfp_t gfp_mask)
 {
 	const int ioprio = task_ioprio(ioc);
 	const int ioprio_class = task_ioprio_class(ioc);
 	struct cfq_queue *async_cfqq = NULL;
 	struct cfq_queue *cfqq = NULL;
-	struct io_group *iog = io_get_io_group(cfqd->queue, 1);
+	struct io_group *iog = io_get_io_group_bio(cfqd->queue, bio, 1);
 
 	if (!is_sync) {
 		async_cfqq = io_group_async_queue_prio(iog, ioprio_class,
@@ -1397,7 +1461,7 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 	}
 
 	if (!cfqq) {
-		cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc, gfp_mask);
+		cfqq = cfq_find_alloc_queue(cfqd, bio, is_sync, ioc, gfp_mask);
 		if (!cfqq)
 			return NULL;
 	}
@@ -1405,8 +1469,30 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
 	if (!is_sync && !async_cfqq)
 		io_group_set_async_queue(iog, ioprio_class, ioprio, cfqq->ioq);
 
-	/* ioc reference */
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	/*
+	 * ioc reference. If async request queue/group is determined from the
+	 * original task/cgroup and not from submitter task, io context can
+	 * not cache the pointer to async queue and everytime a request comes,
+	 * it will be determined by going through the async queue array.
+	 *
+	 * This comes from the fact that we might be getting async requests
+	 * which belong to a different cgroup altogether than the cgroup
+	 * iocontext belongs to. And this thread might be submitting bios
+	 * from various cgroups. So every time async queue will be different
+	 * based on the cgroup of the bio/rq. Can't cache the async cfqq
+	 * pointer in cic.
+	 */
+	if (is_sync)
+		elv_get_ioq(cfqq->ioq);
+#else
+	/*
+	 * async requests are being attributed to task submitting
+	 * it, hence cic can cache async cfqq pointer. Take the
+	 * queue reference even for async queue.
+	 */
 	elv_get_ioq(cfqq->ioq);
+#endif
 	return cfqq;
 }
 
@@ -1802,7 +1888,8 @@ static void cfq_put_request(struct request *rq)
  * Allocate cfq data structures associated with this request.
  */
 static int
-cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
+				gfp_t gfp_mask)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_io_context *cic;
@@ -1822,7 +1909,8 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 
 	cfqq = cic_to_cfqq(cic, is_sync);
 	if (!cfqq) {
-		cfqq = cfq_get_queue(cfqd, is_sync, cic->ioc, gfp_mask);
+		cfqq = cfq_get_queue(cfqd, bio, is_sync, cic->ioc,
+						gfp_mask);
 
 		if (!cfqq)
 			goto queue_fail;
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index ad63493..e5a94e3 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -133,7 +133,7 @@ deadline_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	int ret;
 	struct deadline_queue *dq;
 
-	dq = elv_get_sched_queue_current(q);
+	dq = elv_get_sched_queue_bio(q, bio);
 	if (!dq)
 		return ELEVATOR_NO_MERGE;
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index c1d04b1..899972c 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -16,6 +16,7 @@
 #include "elevator-fq.h"
 #include <linux/blktrace_api.h>
 #include <linux/seq_file.h>
+#include <linux/biotrack.h>
 
 /* Values taken from cfq */
 const int elv_slice_sync = HZ / 10;
@@ -1084,6 +1085,9 @@ struct io_cgroup io_root_cgroup = {
 
 static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
 {
+	if (!cgroup)
+		return &io_root_cgroup;
+
 	return container_of(cgroup_subsys_state(cgroup, io_subsys_id),
 			    struct io_cgroup, css);
 }
@@ -1517,9 +1521,60 @@ end:
 	return iog;
 }
 
+/* Map a page to respective cgroup. Null return means, map it to root cgroup */
+static inline struct cgroup *get_cgroup_from_page(struct page *page)
+{
+	unsigned long bio_cgroup_id;
+	struct cgroup *cgroup;
+
+	bio_cgroup_id = get_blkio_cgroup_id_page(page);
+
+	if (!bio_cgroup_id)
+		return NULL;
+
+	cgroup = blkio_cgroup_lookup(bio_cgroup_id);
+	return cgroup;
+}
+
+struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
+					int create)
+{
+	struct page *page = NULL;
+
+	/*
+	 * Determine the group from task context. Even calls from
+	 * blk_get_request() which don't have any bio info will be mapped
+	 * to the task's group
+	 */
+	if (!bio)
+		goto sync;
+
+	if (bio_barrier(bio)) {
+		/*
+		 * Map barrier requests to root group. May be more special
+		 * bio cases should come here
+		 */
+		return q->elevator->efqd.root_group;
+	}
+
+	/* Map the sync bio to the right group using task context */
+	if (elv_bio_sync(bio))
+		goto sync;
+
+#ifdef CONFIG_TRACK_ASYNC_CONTEXT
+	/* Determine the group from info stored in page */
+	page = bio_iovec_idx(bio, 0)->bv_page;
+	return io_get_io_group(q, page, create);
+#endif
+
+sync:
+	return io_get_io_group(q, NULL, create);
+}
+EXPORT_SYMBOL(io_get_io_group_bio);
+
 /*
- * Search for the io group current task belongs to. If create=1, then also
- * create the io group if it is not already there.
+ * Find the io group page belongs to.
+ * If "create" is set, io group is created if it is not already present.
  *
  * Note: This function should be called with queue lock held. It returns
  * a pointer to io group without taking any reference. That group will
@@ -1527,28 +1582,48 @@ end:
  * needs to get hold of queue lock). So if somebody needs to use group
  * pointer even after dropping queue lock, take a reference to the group
  * before dropping queue lock.
+ *
+ * One can call it without queue lock with rcu read lock held for browsing
+ * through the groups.
  */
-struct io_group *io_get_io_group(struct request_queue *q, int create)
+struct io_group *io_get_io_group(struct request_queue *q, struct page *page,
+					int create)
 {
 	struct cgroup *cgroup;
 	struct io_group *iog;
 	struct elv_fq_data *efqd = &q->elevator->efqd;
 
-	assert_spin_locked(q->queue_lock);
+	if (create)
+		assert_spin_locked(q->queue_lock);
 
 	rcu_read_lock();
-	cgroup = task_cgroup(current, io_subsys_id);
-	iog = io_find_alloc_group(q, cgroup, efqd, create);
-	if (!iog) {
+
+	if (!page)
+		cgroup = task_cgroup(current, io_subsys_id);
+	else
+		cgroup = get_cgroup_from_page(page);
+
+	if (!cgroup) {
 		if (create)
 			iog = efqd->root_group;
-		else
+		else {
 			/*
 			 * bio merge functions doing lookup don't want to
 			 * map bio to root group by default
 			 */
 			iog = NULL;
+		}
+		goto out;
 	}
+
+	iog = io_find_alloc_group(q, cgroup, efqd, create);
+	if (!iog) {
+		if (create)
+			iog = efqd->root_group;
+		else
+			iog = NULL;
+	}
+out:
 	rcu_read_unlock();
 	return iog;
 }
@@ -1861,7 +1936,7 @@ int io_group_allow_merge(struct request *rq, struct bio *bio)
 		return 1;
 
 	/* Determine the io group of the bio submitting task */
-	iog = io_get_io_group(q, 0);
+	iog = io_get_io_group_bio(q, bio, 0);
 	if (!iog) {
 		/* May be task belongs to a differet cgroup for which io
 		 * group has not been setup yet. */
@@ -1885,7 +1960,7 @@ int io_group_allow_merge(struct request *rq, struct bio *bio)
  * function is not invoked.
  */
 int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
-					gfp_t gfp_mask)
+				struct bio *bio, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 	unsigned long flags;
@@ -1901,7 +1976,7 @@ int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
 
 retry:
 	/* Determine the io group request belongs to */
-	iog = io_get_io_group(q, 1);
+	iog = io_get_io_group_bio(q, bio, 1);
 	BUG_ON(!iog);
 
 	/* Get the iosched queue */
@@ -1986,17 +2061,17 @@ queue_fail:
 }
 
 /*
- * Find out the io queue of current task. Optimization for single ioq
+ * Find out the io queue of bio belongs to. Optimization for single ioq
  * per io group io schedulers.
  */
-struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+struct io_queue *elv_lookup_ioq_bio(struct request_queue *q, struct bio *bio)
 {
 	struct io_group *iog;
 
 	/* Determine the io group and io queue of the bio submitting task */
-	iog = io_get_io_group(q, 0);
+	iog = io_get_io_group_bio(q, bio, 0);
 	if (!iog) {
-		/* May be task belongs to a cgroup for which io group has
+		/* May be bio belongs to a cgroup for which io group has
 		 * not been setup yet. */
 		return NULL;
 	}
@@ -2072,13 +2147,21 @@ static void io_free_root_group(struct elevator_queue *e)
 	kfree(iog);
 }
 
-struct io_group *io_get_io_group(struct request_queue *q, int create)
+struct io_group *io_get_io_group(struct request_queue *q, struct page *page,
+						int create)
 {
 	/* In flat mode, there is only root group */
 	return q->elevator->efqd.root_group;
 }
 EXPORT_SYMBOL(io_get_io_group);
 
+struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
+						int create)
+{
+	return q->elevator->efqd.root_group;
+}
+EXPORT_SYMBOL(io_get_io_group_bio);
+
 static inline int is_only_root_group(void)
 {
 	return 1;
@@ -3232,6 +3315,10 @@ expire:
 new_queue:
 	ioq = elv_set_active_ioq(q, new_ioq);
 keep_queue:
+	if (ioq)
+		elv_log_ioq(efqd, ioq, "select busy=%d qued=%d disp=%d",
+				elv_nr_busy_ioq(q->elevator), ioq->nr_queued,
+				elv_ioq_nr_dispatched(ioq));
 	return ioq;
 }
 
@@ -3339,7 +3426,9 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 	ioq = rq->ioq;
 	iog = ioq_to_io_group(ioq);
 
-	elv_log_ioq(efqd, ioq, "complete");
+	elv_log_ioq(efqd, ioq, "complete rq_queued=%d drv=%d disp=%d",
+				ioq->nr_queued, efqd->rq_in_driver,
+				elv_ioq_nr_dispatched(ioq));
 
 	elv_update_hw_tag(efqd);
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index c117d40..bb43444 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -521,10 +521,11 @@ static inline int update_requeue(struct io_queue *ioq, int requeue)
 }
 
 extern int elv_fq_set_request_ioq(struct request_queue *q, struct request *rq,
-					gfp_t gfp_mask);
+					struct bio *bio, gfp_t gfp_mask);
 extern void elv_fq_unset_request_ioq(struct request_queue *q,
 					struct request *rq);
-extern struct io_queue *elv_lookup_ioq_current(struct request_queue *q);
+extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+						struct bio *bio);
 
 /* Sets the single ioq associated with the io group. (noop, deadline, AS) */
 static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
@@ -559,7 +560,7 @@ static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
 }
 
 static inline int elv_fq_set_request_ioq(struct request_queue *q,
-					struct request *rq, gfp_t gfp_mask)
+			struct request *rq, struct bio *bio, gfp_t gfp_mask)
 {
 	return 0;
 }
@@ -569,7 +570,8 @@ static inline void elv_fq_unset_request_ioq(struct request_queue *q,
 {
 }
 
-static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+						struct bio *bio)
 {
 	return NULL;
 }
@@ -626,7 +628,10 @@ extern void *io_group_async_queue_prio(struct io_group *iog, int ioprio_class,
 					int ioprio);
 extern void io_group_set_async_queue(struct io_group *iog, int ioprio_class,
 					int ioprio, struct io_queue *ioq);
-extern struct io_group *io_get_io_group(struct request_queue *q, int create);
+extern struct io_group *io_get_io_group(struct request_queue *q,
+					struct page *page, int create);
+extern struct io_group *io_get_io_group_bio(struct request_queue *q,
+					struct bio *bio, int create);
 extern int elv_nr_busy_ioq(struct elevator_queue *e);
 extern struct io_queue *elv_alloc_ioq(struct request_queue *q, gfp_t gfp_mask);
 extern void elv_free_ioq(struct io_queue *ioq);
@@ -684,7 +689,7 @@ static inline int io_group_allow_merge(struct request *rq, struct bio *bio)
 	return 1;
 }
 static inline int elv_fq_set_request_ioq(struct request_queue *q,
-					struct request *rq, gfp_t gfp_mask)
+			struct request *rq, struct bio *bio, gfp_t gfp_mask)
 {
 	return 0;
 }
@@ -694,7 +699,8 @@ static inline void elv_fq_unset_request_ioq(struct request_queue *q,
 {
 }
 
-static inline struct io_queue *elv_lookup_ioq_current(struct request_queue *q)
+static inline struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
+						struct bio *bio)
 {
 	return NULL;
 }
diff --git a/block/elevator.c b/block/elevator.c
index 862be80..68d5a80 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -833,7 +833,8 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
 	return NULL;
 }
 
-int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+int elv_set_request(struct request_queue *q, struct request *rq,
+			struct bio *bio, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 
@@ -842,10 +843,10 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 	 * ioq per io group
 	 */
 	if (elv_iosched_single_ioq(e))
-		return elv_fq_set_request_ioq(q, rq, gfp_mask);
+		return elv_fq_set_request_ioq(q, rq, bio, gfp_mask);
 
 	if (e->ops->elevator_set_req_fn)
-		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
+		return e->ops->elevator_set_req_fn(q, rq, bio, gfp_mask);
 
 	rq->elevator_private = NULL;
 	return 0;
@@ -1247,19 +1248,19 @@ void *elv_select_sched_queue(struct request_queue *q, int force)
 EXPORT_SYMBOL(elv_select_sched_queue);
 
 /*
- * Get the io scheduler queue pointer for current task.
+ * Get the io scheduler queue pointer for the group bio belongs to.
  *
  * If fair queuing is enabled, determine the io group of task and retrieve
  * the ioq pointer from that. This is used by only single queue ioschedulers
  * for retrieving the queue associated with the group to decide whether the
  * new bio can do a front merge or not.
  */
-void *elv_get_sched_queue_current(struct request_queue *q)
+void *elv_get_sched_queue_bio(struct request_queue *q, struct bio *bio)
 {
 	/* Fair queuing is not enabled. There is only one queue. */
 	if (!elv_iosched_fair_queuing_enabled(q->elevator))
 		return q->elevator->sched_queue;
 
-	return ioq_sched_queue(elv_lookup_ioq_current(q));
+	return ioq_sched_queue(elv_lookup_ioq_bio(q, bio));
 }
-EXPORT_SYMBOL(elv_get_sched_queue_current);
+EXPORT_SYMBOL(elv_get_sched_queue_bio);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index dda7951..cf6b752 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -23,7 +23,7 @@ typedef struct request *(elevator_request_list_fn) (struct request_queue *, stru
 typedef void (elevator_completed_req_fn) (struct request_queue *, struct request *);
 typedef int (elevator_may_queue_fn) (struct request_queue *, int);
 
-typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, gfp_t);
+typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, struct bio *bio, gfp_t);
 typedef void (elevator_put_req_fn) (struct request *);
 typedef void (elevator_activate_req_fn) (struct request_queue *, struct request *);
 typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct request *);
@@ -148,7 +148,8 @@ extern void elv_unregister_queue(struct request_queue *q);
 extern int elv_may_queue(struct request_queue *, int);
 extern void elv_abort_queue(struct request_queue *);
 extern void elv_completed_request(struct request_queue *, struct request *);
-extern int elv_set_request(struct request_queue *, struct request *, gfp_t);
+extern int elv_set_request(struct request_queue *, struct request *,
+					struct bio *bio, gfp_t);
 extern void elv_put_request(struct request_queue *, struct request *);
 extern void elv_drain_elevator(struct request_queue *);
 
@@ -277,6 +278,20 @@ static inline int elv_iosched_single_ioq(struct elevator_queue *e)
 #endif /* ELV_IOSCHED_FAIR_QUEUING */
 extern void *elv_get_sched_queue(struct request_queue *q, struct request *rq);
 extern void *elv_select_sched_queue(struct request_queue *q, int force);
-extern void *elv_get_sched_queue_current(struct request_queue *q);
+extern void *elv_get_sched_queue_bio(struct request_queue *q, struct bio *bio);
+
+/*
+ * This is equivalent of rq_is_sync()/cfq_bio_sync() function where we
+ * determine whether an rq/bio is sync or not. There are cases like during
+ * merging and during * request allocation, where we don't have rq but bio
+ * and needs to find out * if this bio will be considered as sync or async by
+ * elevator/iosched. This function is useful in such cases.
+ */
+static inline int elv_bio_sync(struct bio *bio)
+{
+	if ((bio_data_dir(bio) == READ) || bio_sync(bio))
+		return 1;
+	return 0;
+}
 #endif /* CONFIG_BLOCK */
 #endif
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 21/25] io-controller: Per cgroup request descriptor support
       [not found] ` <1246564917-19603-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (19 preceding siblings ...)
  2009-07-02 20:01   ` [PATCH 20/25] io-controller: map async requests to appropriate cgroup Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  2009-07-02 20:01   ` [PATCH 22/25] io-controller: Per io group bdi congestion interface Vivek Goyal
                     ` (6 subsequent siblings)
  27 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, nauman-hpIqsD4AKlfQT0dZR+AlfA,
	dpshah-hpIqsD4AKlfQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA

o Currently a request queue has got fixed number of request descriptors for
  sync and async requests. Once the request descriptors are consumed, new
  processes are put to sleep and they effectively become serialized. Because
  sync and async queues are separate, async requests don't impact sync ones
  but if one is looking for fairness between async requests, that is not
  achievable if request queue descriptors become bottleneck.

o Make request descriptor's per io group so that if there is lots of IO
  going on in one cgroup, it does not impact the IO of other group.

o This is just one relatively simple way of doing things. This patch will
  probably change after the feedback. Folks have raised concerns that in
  hierchical setup, child's request descriptors should be capped by parent's
  request descriptors. May be we need to have per cgroup per device files
  in cgroups where one can specify the upper limit of request descriptors
  and whenever a cgroup is created one needs to assign request descritor
  limit making sure total sum of child's request descriptor is not more than
  of parent.

  I guess something like memory controller. Anyway, that would be the next
  step. For the time being, we have implemented something simpler as follows.

o This patch implements the per cgroup request descriptors. request pool per
  queue is still common but every group will have its own wait list and its
  own count of request descriptors allocated to that group for sync and async
  queues. So effectively request_list becomes per io group property and not a
  global request queue feature.

o Currently one can define q->nr_requests to limit request descriptors
  allocated for the queue. Now there is another tunable q->nr_group_requests
  which controls the requests descriptr limit per group. q->nr_requests
  supercedes q->nr_group_requests to make sure if there are lots of groups
  present, we don't end up allocating too many request descriptors on the
  queue.

Signed-off-by: Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/blk-core.c       |  305 +++++++++++++++++++++++++++++++++++++----------
 block/blk-settings.c   |    1 +
 block/blk-sysfs.c      |   58 +++++++--
 block/elevator-fq.c    |   14 +++
 block/elevator-fq.h    |    5 +
 block/elevator.c       |    7 +-
 include/linux/blkdev.h |   87 +++++++++++++-
 7 files changed, 395 insertions(+), 82 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 6d8b4dd..2035c20 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -460,20 +460,30 @@ void blk_cleanup_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_cleanup_queue);
 
-static int blk_init_free_list(struct request_queue *q)
+void blk_init_request_list(struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
 
 	rl->count[BLK_RW_SYNC] = rl->count[BLK_RW_ASYNC] = 0;
-	rl->starved[BLK_RW_SYNC] = rl->starved[BLK_RW_ASYNC] = 0;
-	rl->elvpriv = 0;
 	init_waitqueue_head(&rl->wait[BLK_RW_SYNC]);
 	init_waitqueue_head(&rl->wait[BLK_RW_ASYNC]);
+}
 
-	rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ, mempool_alloc_slab,
-				mempool_free_slab, request_cachep, q->node);
+static int blk_init_free_list(struct request_queue *q)
+{
+	/*
+	 * Initialize the queue request list in case there are non-hiearchical
+	 * io schedulers not making use of fair queuing infrastructure.
+	 *
+	 * For ioschedulers making use of fair queuing infrastructure, request
+	 * list is inside the associated group and when that group is
+	 * instanciated, it takes care of initializing the request list also.
+	 */
+	blk_init_request_list(&q->rq);
+	q->rq_data.rq_pool = mempool_create_node(BLKDEV_MIN_RQ,
+				mempool_alloc_slab, mempool_free_slab,
+				request_cachep, q->node);
 
-	if (!rl->rq_pool)
+	if (!q->rq_data.rq_pool)
 		return -ENOMEM;
 
 	return 0;
@@ -575,6 +585,9 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
 		return NULL;
 	}
 
+	/* init starved waiter wait queue */
+	init_waitqueue_head(&q->rq_data.starved_wait);
+
 	/*
 	 * if caller didn't supply a lock, they get per-queue locking with
 	 * our embedded lock
@@ -624,14 +637,14 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
 {
 	if (rq->cmd_flags & REQ_ELVPRIV)
 		elv_put_request(q, rq);
-	mempool_free(rq, q->rq.rq_pool);
+	mempool_free(rq, q->rq_data.rq_pool);
 }
 
 static struct request *
 blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
 					gfp_t gfp_mask)
 {
-	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
+	struct request *rq = mempool_alloc(q->rq_data.rq_pool, gfp_mask);
 
 	if (!rq)
 		return NULL;
@@ -642,7 +655,7 @@ blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
 
 	if (priv) {
 		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
-			mempool_free(rq, q->rq.rq_pool);
+			mempool_free(rq, q->rq_data.rq_pool);
 			return NULL;
 		}
 		rq->cmd_flags |= REQ_ELVPRIV;
@@ -685,18 +698,18 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
 	ioc->last_waited = jiffies;
 }
 
-static void __freed_request(struct request_queue *q, int sync)
+static void __freed_request(struct request_queue *q, int sync,
+					struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
-
-	if (rl->count[sync] < queue_congestion_off_threshold(q))
+	if (q->rq_data.count[sync] < queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, sync);
 
-	if (rl->count[sync] + 1 <= q->nr_requests) {
+	if (q->rq_data.count[sync] + 1 <= q->nr_requests)
+		blk_clear_queue_full(q, sync);
+
+	if (rl->count[sync] + 1 <= q->nr_group_requests) {
 		if (waitqueue_active(&rl->wait[sync]))
 			wake_up(&rl->wait[sync]);
-
-		blk_clear_queue_full(q, sync);
 	}
 }
 
@@ -704,63 +717,133 @@ static void __freed_request(struct request_queue *q, int sync)
  * A request has just been released.  Account for it, update the full and
  * congestion status, wake up any waiters.   Called under q->queue_lock.
  */
-static void freed_request(struct request_queue *q, int sync, int priv)
-{
-	struct request_list *rl = &q->rq;
+static void freed_request(struct request_queue *q, int sync, int priv,
+					struct request_list *rl)
+{
+	/* There is a window during request allocation where request is
+	 * mapped to one group but by the time a queue for the group is
+	 * allocated, it is possible that original cgroup/io group has been
+	 * deleted and now io queue is allocated in a different group (root)
+	 * altogether.
+	 *
+	 * One solution to the problem is that rq should take io group
+	 * reference. But it looks too much to do that to solve this issue.
+	 * The only side affect to the hard to hit issue seems to be that
+	 * we will try to decrement the rl->count for a request list which
+	 * did not allocate that request. Chcek for rl->count going less than
+	 * zero and do not decrement it if that's the case.
+	 */
+
+	if (priv && rl->count[sync] > 0)
+		rl->count[sync]--;
+
+	BUG_ON(!q->rq_data.count[sync]);
+	q->rq_data.count[sync]--;
 
-	rl->count[sync]--;
 	if (priv)
-		rl->elvpriv--;
+		q->rq_data.elvpriv--;
 
-	__freed_request(q, sync);
+	__freed_request(q, sync, rl);
 
 	if (unlikely(rl->starved[sync ^ 1]))
-		__freed_request(q, sync ^ 1);
+		__freed_request(q, sync ^ 1, rl);
+
+	/* Wake up the starved process on global list, if any */
+	if (unlikely(q->rq_data.starved)) {
+		if (waitqueue_active(&q->rq_data.starved_wait))
+			wake_up(&q->rq_data.starved_wait);
+		q->rq_data.starved--;
+	}
+}
+
+/*
+ * Returns whether one can sleep on this request list or not. There are
+ * cases (elevator switch) where request list might not have allocated
+ * any request descriptor but we deny request allocation due to gloabl
+ * limits. In that case one should sleep on global list as on this request
+ * list no wakeup will take place.
+ *
+ * Also sets the request list starved flag if there are no requests pending
+ * in the direction of rq.
+ *
+ * Return 1 --> sleep on request list, 0 --> sleep on global list
+ */
+static int can_sleep_on_request_list(struct request_list *rl, int is_sync)
+{
+	if (unlikely(rl->count[is_sync] == 0)) {
+		/*
+		 * If there is a request pending in other direction
+		 * in same io group, then set the starved flag of
+		 * the group request list. Otherwise, we need to
+		 * make this process sleep in global starved list
+		 * to make sure it will not sleep indefinitely.
+		 */
+		if (rl->count[is_sync ^ 1] != 0) {
+			rl->starved[is_sync] = 1;
+			return 1;
+		} else
+			return 0;
+	}
+
+	return 1;
 }
 
 /*
  * Get a free request, queue_lock must be held.
- * Returns NULL on failure, with queue_lock held.
+ * Returns NULL on failure, with queue_lock held. Also sets the "reason" field
+ * in case of failure. This reason field helps caller decide to whether sleep
+ * on per group list or global per queue list.
+ * reason = 0 sleep on per group list
+ * reason = 1 sleep on global list
+ *
  * Returns !NULL on success, with queue_lock *not held*.
  */
 static struct request *get_request(struct request_queue *q, int rw_flags,
-				   struct bio *bio, gfp_t gfp_mask)
+					struct bio *bio, gfp_t gfp_mask,
+					struct request_list *rl, int *reason)
 {
 	struct request *rq = NULL;
-	struct request_list *rl = &q->rq;
 	struct io_context *ioc = NULL;
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
 	int may_queue, priv;
+	int sleep_on_global = 0;
 
 	may_queue = elv_may_queue(q, rw_flags);
 	if (may_queue == ELV_MQUEUE_NO)
 		goto rq_starved;
 
-	if (rl->count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
-		if (rl->count[is_sync]+1 >= q->nr_requests) {
-			ioc = current_io_context(GFP_ATOMIC, q->node);
-			/*
-			 * The queue will fill after this allocation, so set
-			 * it as full, and mark this process as "batching".
-			 * This process will be allowed to complete a batch of
-			 * requests, others will be blocked.
-			 */
-			if (!blk_queue_full(q, is_sync)) {
-				ioc_set_batching(q, ioc);
-				blk_set_queue_full(q, is_sync);
-			} else {
-				if (may_queue != ELV_MQUEUE_MUST
-						&& !ioc_batching(q, ioc)) {
-					/*
-					 * The queue is full and the allocating
-					 * process is not a "batcher", and not
-					 * exempted by the IO scheduler
-					 */
-					goto out;
-				}
+	if (q->rq_data.count[is_sync]+1 >= queue_congestion_on_threshold(q))
+		blk_set_queue_congested(q, is_sync);
+
+	/*
+	 * Looks like there is no user of queue full now.
+	 * Keeping it for time being.
+	 */
+	if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
+		blk_set_queue_full(q, is_sync);
+
+	if (rl->count[is_sync]+1 >= q->nr_group_requests) {
+		ioc = current_io_context(GFP_ATOMIC, q->node);
+		/*
+		 * The queue request descriptor group will fill after this
+		 * allocation, so set
+		 * it as full, and mark this process as "batching".
+		 * This process will be allowed to complete a batch of
+		 * requests, others will be blocked.
+		 */
+		if (rl->count[is_sync] <= q->nr_group_requests)
+			ioc_set_batching(q, ioc);
+		else {
+			if (may_queue != ELV_MQUEUE_MUST
+					&& !ioc_batching(q, ioc)) {
+				/*
+				 * The queue is full and the allocating
+				 * process is not a "batcher", and not
+				 * exempted by the IO scheduler
+				 */
+				goto out;
 			}
 		}
-		blk_set_queue_congested(q, is_sync);
 	}
 
 	/*
@@ -768,21 +851,60 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	 * limit of requests, otherwise we could have thousands of requests
 	 * allocated with any setting of ->nr_requests
 	 */
-	if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
+
+	if (q->rq_data.count[is_sync] >= (3 * q->nr_requests / 2)) {
+		/*
+		 * Queue is too full for allocation. On which request queue
+		 * the task should sleep? Generally it should sleep on its
+		 * request list but if elevator switch is happening, in that
+		 * window, request descriptors are allocated from global
+		 * pool and are not accounted against any particular request
+		 * list as group is going away.
+		 *
+		 * So it might happen that request list does not have any
+		 * requests allocated at all and if process sleeps on per
+		 * group request list, it will not be woken up. In such case,
+		 * make it sleep on global starved list.
+		 */
+		if (test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags)
+		    || !can_sleep_on_request_list(rl, is_sync))
+			sleep_on_global = 1;
+		goto out;
+	}
+
+	/*
+	 * Allocation of request is allowed from queue perspective. Now check
+	 * from per group request list
+	 */
+
+	if (rl->count[is_sync] >= (3 * q->nr_group_requests / 2))
 		goto out;
 
-	rl->count[is_sync]++;
 	rl->starved[is_sync] = 0;
 
+	q->rq_data.count[is_sync]++;
+
 	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
-	if (priv)
-		rl->elvpriv++;
+	if (priv) {
+		q->rq_data.elvpriv++;
+		/*
+		 * Account the request to request list only if request is
+		 * going to elevator. During elevator switch, there will
+		 * be small window where group is going away and new group
+		 * will not be allocated till elevator switch is complete.
+		 * So till then instead of slowing down the application,
+		 * we will continue to allocate request from total common
+		 * pool instead of per group limit
+		 */
+		rl->count[is_sync]++;
+	}
 
 	if (blk_queue_io_stat(q))
 		rw_flags |= REQ_IO_STAT;
 	spin_unlock_irq(q->queue_lock);
 
 	rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
+
 	if (unlikely(!rq)) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything
@@ -792,7 +914,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		 * wait queue, but this is pretty rare.
 		 */
 		spin_lock_irq(q->queue_lock);
-		freed_request(q, is_sync, priv);
+		freed_request(q, is_sync, priv, rl);
 
 		/*
 		 * in the very unlikely event that allocation failed and no
@@ -802,9 +924,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		 * rq mempool into READ and WRITE
 		 */
 rq_starved:
-		if (unlikely(rl->count[is_sync] == 0))
-			rl->starved[is_sync] = 1;
-
+		if (!can_sleep_on_request_list(rl, is_sync))
+			sleep_on_global = 1;
 		goto out;
 	}
 
@@ -819,6 +940,8 @@ rq_starved:
 
 	trace_block_getrq(q, bio, rw_flags & 1);
 out:
+	if (reason && sleep_on_global)
+		*reason = 1;
 	return rq;
 }
 
@@ -832,16 +955,44 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 					struct bio *bio)
 {
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
+	int sleep_on_global = 0;
 	struct request *rq;
+	struct request_list *rl = blk_get_request_list(q, bio);
+	struct io_group *iog = NULL;
 
-	rq = get_request(q, rw_flags, bio, GFP_NOIO);
+	rq = get_request(q, rw_flags, bio, GFP_NOIO, rl, &sleep_on_global);
 	while (!rq) {
 		DEFINE_WAIT(wait);
 		struct io_context *ioc;
-		struct request_list *rl = &q->rq;
 
-		prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
-				TASK_UNINTERRUPTIBLE);
+		if (sleep_on_global) {
+			/*
+			 * Task failed allocation and needs to wait and
+			 * try again. There are no requests pending from
+			 * the io group hence need to sleep on global
+			 * wait queue. Most likely the allocation failed
+			 * because of memory issues.
+			 */
+
+			q->rq_data.starved++;
+			prepare_to_wait_exclusive(&q->rq_data.starved_wait,
+					&wait, TASK_UNINTERRUPTIBLE);
+		} else {
+			/*
+			 * We are about to sleep on a request list and we
+			 * drop queue lock. After waking up, we will do
+			 * finish_wait() on request list and in the mean
+			 * time group might be gone. Take a reference to
+			 * the group now.
+			 */
+			prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
+					TASK_UNINTERRUPTIBLE);
+#ifdef CONFIG_GROUP_IOSCHED
+			iog = rl_iog(rl);
+			if (iog)
+				elv_get_iog(iog);
+#endif
+		}
 
 		trace_block_sleeprq(q, bio, rw_flags & 1);
 
@@ -859,9 +1010,30 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 		ioc_set_batching(q, ioc);
 
 		spin_lock_irq(q->queue_lock);
-		finish_wait(&rl->wait[is_sync], &wait);
 
-		rq = get_request(q, rw_flags, bio, GFP_NOIO);
+		if (sleep_on_global) {
+			finish_wait(&q->rq_data.starved_wait, &wait);
+			sleep_on_global = 0;
+		} else {
+			finish_wait(&rl->wait[is_sync], &wait);
+#ifdef CONFIG_GROUP_IOSCHED
+			/*
+			 * We had taken a reference to the rl/iog.
+			 * Put that now
+			 */
+			iog = rl_iog(rl);
+			if (iog)
+				elv_put_iog(iog);
+#endif
+		}
+
+		/*
+		 * After the sleep check the rl again in case cgrop bio
+		 * belonged to is gone and it is mapped to root group now
+		 */
+		rl = blk_get_request_list(q, bio);
+		rq = get_request(q, rw_flags, bio, GFP_NOIO, rl,
+					&sleep_on_global);
 	};
 
 	return rq;
@@ -870,14 +1042,16 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
 {
 	struct request *rq;
+	struct request_list *rl;
 
 	BUG_ON(rw != READ && rw != WRITE);
 
 	spin_lock_irq(q->queue_lock);
+	rl = blk_get_request_list(q, NULL);
 	if (gfp_mask & __GFP_WAIT) {
 		rq = get_request_wait(q, rw, NULL);
 	} else {
-		rq = get_request(q, rw, NULL, gfp_mask);
+		rq = get_request(q, rw, NULL, gfp_mask, rl, NULL);
 		if (!rq)
 			spin_unlock_irq(q->queue_lock);
 	}
@@ -1094,12 +1268,13 @@ void __blk_put_request(struct request_queue *q, struct request *req)
 	if (req->cmd_flags & REQ_ALLOCED) {
 		int is_sync = rq_is_sync(req) != 0;
 		int priv = req->cmd_flags & REQ_ELVPRIV;
+		struct request_list *rl = rq_rl(q, req);
 
 		BUG_ON(!list_empty(&req->queuelist));
 		BUG_ON(!hlist_unhashed(&req->hash));
 
 		blk_free_request(q, req);
-		freed_request(q, is_sync, priv);
+		freed_request(q, is_sync, priv, rl);
 	}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
diff --git a/block/blk-settings.c b/block/blk-settings.c
index bd582a7..78b8aec 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -148,6 +148,7 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
 	 * set defaults
 	 */
 	q->nr_requests = BLKDEV_MAX_RQ;
+	q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
 
 	q->make_request_fn = mfn;
 	blk_queue_dma_alignment(q, 511);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index b1cd040..577ed42 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -38,42 +38,66 @@ static ssize_t queue_requests_show(struct request_queue *q, char *page)
 static ssize_t
 queue_requests_store(struct request_queue *q, const char *page, size_t count)
 {
-	struct request_list *rl = &q->rq;
+	struct request_list *rl;
 	unsigned long nr;
 	int ret = queue_var_store(&nr, page, count);
 	if (nr < BLKDEV_MIN_RQ)
 		nr = BLKDEV_MIN_RQ;
 
 	spin_lock_irq(q->queue_lock);
+	rl = blk_get_request_list(q, NULL);
 	q->nr_requests = nr;
 	blk_queue_congestion_threshold(q);
 
-	if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
+	if (q->rq_data.count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
 		blk_set_queue_congested(q, BLK_RW_SYNC);
-	else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
+	else if (q->rq_data.count[BLK_RW_SYNC] <
+				queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, BLK_RW_SYNC);
 
-	if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
+	if (q->rq_data.count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
 		blk_set_queue_congested(q, BLK_RW_ASYNC);
-	else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
+	else if (q->rq_data.count[BLK_RW_ASYNC] <
+				queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, BLK_RW_ASYNC);
 
-	if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
+	if (q->rq_data.count[BLK_RW_SYNC] >= q->nr_requests) {
 		blk_set_queue_full(q, BLK_RW_SYNC);
-	} else if (rl->count[BLK_RW_SYNC]+1 <= q->nr_requests) {
+	} else if (q->rq_data.count[BLK_RW_SYNC]+1 <= q->nr_requests) {
 		blk_clear_queue_full(q, BLK_RW_SYNC);
 		wake_up(&rl->wait[BLK_RW_SYNC]);
 	}
 
-	if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
+	if (q->rq_data.count[BLK_RW_ASYNC] >= q->nr_requests) {
 		blk_set_queue_full(q, BLK_RW_ASYNC);
-	} else if (rl->count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
+	} else if (q->rq_data.count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
 		blk_clear_queue_full(q, BLK_RW_ASYNC);
 		wake_up(&rl->wait[BLK_RW_ASYNC]);
 	}
 	spin_unlock_irq(q->queue_lock);
 	return ret;
 }
+#ifdef CONFIG_GROUP_IOSCHED
+static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(q->nr_group_requests, (page));
+}
+
+static ssize_t
+queue_group_requests_store(struct request_queue *q, const char *page,
+					size_t count)
+{
+	unsigned long nr;
+	int ret = queue_var_store(&nr, page, count);
+	if (nr < BLKDEV_MIN_RQ)
+		nr = BLKDEV_MIN_RQ;
+
+	spin_lock_irq(q->queue_lock);
+	q->nr_group_requests = nr;
+	spin_unlock_irq(q->queue_lock);
+	return ret;
+}
+#endif
 
 static ssize_t queue_ra_show(struct request_queue *q, char *page)
 {
@@ -239,6 +263,14 @@ static struct queue_sysfs_entry queue_requests_entry = {
 	.store = queue_requests_store,
 };
 
+#ifdef CONFIG_GROUP_IOSCHED
+static struct queue_sysfs_entry queue_group_requests_entry = {
+	.attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_group_requests_show,
+	.store = queue_group_requests_store,
+};
+#endif
+
 static struct queue_sysfs_entry queue_ra_entry = {
 	.attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_ra_show,
@@ -313,6 +345,9 @@ static struct queue_sysfs_entry queue_iostats_entry = {
 
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
+#ifdef CONFIG_GROUP_IOSCHED
+	&queue_group_requests_entry.attr,
+#endif
 	&queue_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
@@ -392,12 +427,11 @@ static void blk_release_queue(struct kobject *kobj)
 {
 	struct request_queue *q =
 		container_of(kobj, struct request_queue, kobj);
-	struct request_list *rl = &q->rq;
 
 	blk_sync_queue(q);
 
-	if (rl->rq_pool)
-		mempool_destroy(rl->rq_pool);
+	if (q->rq_data.rq_pool)
+		mempool_destroy(q->rq_data.rq_pool);
 
 	if (q->queue_tags)
 		__blk_queue_free_tags(q);
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 899972c..c4a2b1e 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1092,6 +1092,16 @@ static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
 			    struct io_cgroup, css);
 }
 
+struct request_list *io_group_get_request_list(struct request_queue *q,
+						struct bio *bio)
+{
+	struct io_group *iog;
+
+	iog = io_get_io_group_bio(q, bio, 1);
+	BUG_ON(!iog);
+	return &iog->rl;
+}
+
 /*
  * Search the io_group for efqd into the hash table (by now only a list)
  * of bgrp.  Must be called under rcu_read_lock().
@@ -1390,6 +1400,8 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 		 */
 		elv_get_iog(iog);
 
+		blk_init_request_list(&iog->rl);
+
 		if (leaf == NULL) {
 			leaf = iog;
 			prev = leaf;
@@ -1667,6 +1679,8 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
 	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
 		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
 
+	blk_init_request_list(&iog->rl);
+
 	iocg = &io_root_cgroup;
 	spin_lock_irq(&iocg->lock);
 	rcu_assign_pointer(iog->key, key);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index bb43444..74a7393 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -258,6 +258,9 @@ struct io_group {
 
 	/* Single ioq per group, used for noop, deadline, anticipatory */
 	struct io_queue *ioq;
+
+	/* request list associated with the group */
+	struct request_list rl;
 };
 
 /**
@@ -526,6 +529,8 @@ extern void elv_fq_unset_request_ioq(struct request_queue *q,
 					struct request *rq);
 extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
 						struct bio *bio);
+extern struct request_list *io_group_get_request_list(struct request_queue *q,
+						struct bio *bio);
 
 /* Sets the single ioq associated with the io group. (noop, deadline, AS) */
 static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
diff --git a/block/elevator.c b/block/elevator.c
index 68d5a80..38a118b 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -646,7 +646,7 @@ void elv_quiesce_start(struct request_queue *q)
 	 * make sure we don't have any requests in flight
 	 */
 	elv_drain_elevator(q);
-	while (q->rq.elvpriv) {
+	while (q->rq_data.elvpriv) {
 		__blk_run_queue(q);
 		spin_unlock_irq(q->queue_lock);
 		msleep(10);
@@ -745,8 +745,9 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 	}
 
 	if (unplug_it && blk_queue_plugged(q)) {
-		int nrq = q->rq.count[BLK_RW_SYNC] + q->rq.count[BLK_RW_ASYNC]
-				- queue_in_flight(q);
+		int nrq = q->rq_data.count[BLK_RW_SYNC] +
+				q->rq_data.count[BLK_RW_ASYNC] -
+				queue_in_flight(q);
 
 		if (nrq >= q->unplug_thresh)
 			__generic_unplug_device(q);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 551e17d..655440c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -32,21 +32,51 @@ struct request;
 struct sg_io_hdr;
 
 #define BLKDEV_MIN_RQ	4
+
+#ifdef CONFIG_GROUP_IOSCHED
+#define BLKDEV_MAX_RQ	512	/* Default maximum for queue */
+#define BLKDEV_MAX_GROUP_RQ    128      /* Default maximum per group*/
+#else
 #define BLKDEV_MAX_RQ	128	/* Default maximum */
+/*
+ * This is eqivalent to case of only one group present (root group). Let
+ * it consume all the request descriptors available on the queue .
+ */
+#define BLKDEV_MAX_GROUP_RQ    BLKDEV_MAX_RQ      /* Default maximum */
+#endif
 
 struct request;
 typedef void (rq_end_io_fn)(struct request *, int);
 
 struct request_list {
 	/*
-	 * count[], starved[], and wait[] are indexed by
+	 * count[], starved and wait[] are indexed by
 	 * BLK_RW_SYNC/BLK_RW_ASYNC
 	 */
 	int count[2];
 	int starved[2];
+	wait_queue_head_t wait[2];
+};
+
+/*
+ * This data structures keeps track of mempool of requests for the queue
+ * and some overall statistics.
+ */
+struct request_data {
+	/*
+	 * Per queue request descriptor count. This is in addition to per
+	 * cgroup count
+	 */
+	int count[2];
 	int elvpriv;
 	mempool_t *rq_pool;
-	wait_queue_head_t wait[2];
+	int starved;
+	/*
+	 * Global list for starved tasks. A task will be queued here if
+	 * it could not allocate request descriptor and the associated
+	 * group request list does not have any requests pending.
+	 */
+	wait_queue_head_t starved_wait;
 };
 
 /*
@@ -355,6 +385,9 @@ struct request_queue
 	 */
 	struct request_list	rq;
 
+	/* Contains request pool and other data like starved data */
+	struct request_data	rq_data;
+
 	request_fn_proc		*request_fn;
 	make_request_fn		*make_request_fn;
 	prep_rq_fn		*prep_rq_fn;
@@ -416,6 +449,8 @@ struct request_queue
 	 * queue settings
 	 */
 	unsigned long		nr_requests;	/* Max # of requests */
+	/* Max # of per io group requests */
+	unsigned long		nr_group_requests;
 	unsigned int		nr_congestion_on;
 	unsigned int		nr_congestion_off;
 	unsigned int		nr_batching;
@@ -795,6 +830,54 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 			 struct scsi_ioctl_command __user *);
 
+extern void blk_init_request_list(struct request_list *rl);
+
+static inline struct request_list *blk_get_request_list(struct request_queue *q,
+						struct bio *bio)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return &q->rq;
+
+	return io_group_get_request_list(q, bio);
+#else
+	return &q->rq;
+#endif
+}
+
+static inline struct request_list *rq_rl(struct request_queue *q,
+						struct request *rq)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+	struct io_group *iog;
+	int priv = rq->cmd_flags & REQ_ELVPRIV;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return &q->rq;
+
+	BUG_ON(priv && !rq->ioq);
+
+	if (priv)
+		iog = ioq_to_io_group(rq->ioq);
+	else
+		iog = q->elevator->efqd.root_group;
+
+	BUG_ON(!iog);
+	return &iog->rl;
+#else
+	return &q->rq;
+#endif
+}
+
+static inline struct io_group *rl_iog(struct request_list *rl)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+	return container_of(rl, struct io_group, rl);
+#else
+	return NULL;
+#endif
+}
+
 /*
  * A queue has just exitted congestion.  Note this in the global counter of
  * congested queues, and wake up anyone who was waiting for requests to be
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 21/25] io-controller: Per cgroup request descriptor support
  2009-07-02 20:01 ` Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron
  Cc: agk, snitzer, vgoyal, akpm, peterz

o Currently a request queue has got fixed number of request descriptors for
  sync and async requests. Once the request descriptors are consumed, new
  processes are put to sleep and they effectively become serialized. Because
  sync and async queues are separate, async requests don't impact sync ones
  but if one is looking for fairness between async requests, that is not
  achievable if request queue descriptors become bottleneck.

o Make request descriptor's per io group so that if there is lots of IO
  going on in one cgroup, it does not impact the IO of other group.

o This is just one relatively simple way of doing things. This patch will
  probably change after the feedback. Folks have raised concerns that in
  hierchical setup, child's request descriptors should be capped by parent's
  request descriptors. May be we need to have per cgroup per device files
  in cgroups where one can specify the upper limit of request descriptors
  and whenever a cgroup is created one needs to assign request descritor
  limit making sure total sum of child's request descriptor is not more than
  of parent.

  I guess something like memory controller. Anyway, that would be the next
  step. For the time being, we have implemented something simpler as follows.

o This patch implements the per cgroup request descriptors. request pool per
  queue is still common but every group will have its own wait list and its
  own count of request descriptors allocated to that group for sync and async
  queues. So effectively request_list becomes per io group property and not a
  global request queue feature.

o Currently one can define q->nr_requests to limit request descriptors
  allocated for the queue. Now there is another tunable q->nr_group_requests
  which controls the requests descriptr limit per group. q->nr_requests
  supercedes q->nr_group_requests to make sure if there are lots of groups
  present, we don't end up allocating too many request descriptors on the
  queue.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-core.c       |  305 +++++++++++++++++++++++++++++++++++++----------
 block/blk-settings.c   |    1 +
 block/blk-sysfs.c      |   58 +++++++--
 block/elevator-fq.c    |   14 +++
 block/elevator-fq.h    |    5 +
 block/elevator.c       |    7 +-
 include/linux/blkdev.h |   87 +++++++++++++-
 7 files changed, 395 insertions(+), 82 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 6d8b4dd..2035c20 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -460,20 +460,30 @@ void blk_cleanup_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_cleanup_queue);
 
-static int blk_init_free_list(struct request_queue *q)
+void blk_init_request_list(struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
 
 	rl->count[BLK_RW_SYNC] = rl->count[BLK_RW_ASYNC] = 0;
-	rl->starved[BLK_RW_SYNC] = rl->starved[BLK_RW_ASYNC] = 0;
-	rl->elvpriv = 0;
 	init_waitqueue_head(&rl->wait[BLK_RW_SYNC]);
 	init_waitqueue_head(&rl->wait[BLK_RW_ASYNC]);
+}
 
-	rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ, mempool_alloc_slab,
-				mempool_free_slab, request_cachep, q->node);
+static int blk_init_free_list(struct request_queue *q)
+{
+	/*
+	 * Initialize the queue request list in case there are non-hiearchical
+	 * io schedulers not making use of fair queuing infrastructure.
+	 *
+	 * For ioschedulers making use of fair queuing infrastructure, request
+	 * list is inside the associated group and when that group is
+	 * instanciated, it takes care of initializing the request list also.
+	 */
+	blk_init_request_list(&q->rq);
+	q->rq_data.rq_pool = mempool_create_node(BLKDEV_MIN_RQ,
+				mempool_alloc_slab, mempool_free_slab,
+				request_cachep, q->node);
 
-	if (!rl->rq_pool)
+	if (!q->rq_data.rq_pool)
 		return -ENOMEM;
 
 	return 0;
@@ -575,6 +585,9 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
 		return NULL;
 	}
 
+	/* init starved waiter wait queue */
+	init_waitqueue_head(&q->rq_data.starved_wait);
+
 	/*
 	 * if caller didn't supply a lock, they get per-queue locking with
 	 * our embedded lock
@@ -624,14 +637,14 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
 {
 	if (rq->cmd_flags & REQ_ELVPRIV)
 		elv_put_request(q, rq);
-	mempool_free(rq, q->rq.rq_pool);
+	mempool_free(rq, q->rq_data.rq_pool);
 }
 
 static struct request *
 blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
 					gfp_t gfp_mask)
 {
-	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
+	struct request *rq = mempool_alloc(q->rq_data.rq_pool, gfp_mask);
 
 	if (!rq)
 		return NULL;
@@ -642,7 +655,7 @@ blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
 
 	if (priv) {
 		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
-			mempool_free(rq, q->rq.rq_pool);
+			mempool_free(rq, q->rq_data.rq_pool);
 			return NULL;
 		}
 		rq->cmd_flags |= REQ_ELVPRIV;
@@ -685,18 +698,18 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
 	ioc->last_waited = jiffies;
 }
 
-static void __freed_request(struct request_queue *q, int sync)
+static void __freed_request(struct request_queue *q, int sync,
+					struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
-
-	if (rl->count[sync] < queue_congestion_off_threshold(q))
+	if (q->rq_data.count[sync] < queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, sync);
 
-	if (rl->count[sync] + 1 <= q->nr_requests) {
+	if (q->rq_data.count[sync] + 1 <= q->nr_requests)
+		blk_clear_queue_full(q, sync);
+
+	if (rl->count[sync] + 1 <= q->nr_group_requests) {
 		if (waitqueue_active(&rl->wait[sync]))
 			wake_up(&rl->wait[sync]);
-
-		blk_clear_queue_full(q, sync);
 	}
 }
 
@@ -704,63 +717,133 @@ static void __freed_request(struct request_queue *q, int sync)
  * A request has just been released.  Account for it, update the full and
  * congestion status, wake up any waiters.   Called under q->queue_lock.
  */
-static void freed_request(struct request_queue *q, int sync, int priv)
-{
-	struct request_list *rl = &q->rq;
+static void freed_request(struct request_queue *q, int sync, int priv,
+					struct request_list *rl)
+{
+	/* There is a window during request allocation where request is
+	 * mapped to one group but by the time a queue for the group is
+	 * allocated, it is possible that original cgroup/io group has been
+	 * deleted and now io queue is allocated in a different group (root)
+	 * altogether.
+	 *
+	 * One solution to the problem is that rq should take io group
+	 * reference. But it looks too much to do that to solve this issue.
+	 * The only side affect to the hard to hit issue seems to be that
+	 * we will try to decrement the rl->count for a request list which
+	 * did not allocate that request. Chcek for rl->count going less than
+	 * zero and do not decrement it if that's the case.
+	 */
+
+	if (priv && rl->count[sync] > 0)
+		rl->count[sync]--;
+
+	BUG_ON(!q->rq_data.count[sync]);
+	q->rq_data.count[sync]--;
 
-	rl->count[sync]--;
 	if (priv)
-		rl->elvpriv--;
+		q->rq_data.elvpriv--;
 
-	__freed_request(q, sync);
+	__freed_request(q, sync, rl);
 
 	if (unlikely(rl->starved[sync ^ 1]))
-		__freed_request(q, sync ^ 1);
+		__freed_request(q, sync ^ 1, rl);
+
+	/* Wake up the starved process on global list, if any */
+	if (unlikely(q->rq_data.starved)) {
+		if (waitqueue_active(&q->rq_data.starved_wait))
+			wake_up(&q->rq_data.starved_wait);
+		q->rq_data.starved--;
+	}
+}
+
+/*
+ * Returns whether one can sleep on this request list or not. There are
+ * cases (elevator switch) where request list might not have allocated
+ * any request descriptor but we deny request allocation due to gloabl
+ * limits. In that case one should sleep on global list as on this request
+ * list no wakeup will take place.
+ *
+ * Also sets the request list starved flag if there are no requests pending
+ * in the direction of rq.
+ *
+ * Return 1 --> sleep on request list, 0 --> sleep on global list
+ */
+static int can_sleep_on_request_list(struct request_list *rl, int is_sync)
+{
+	if (unlikely(rl->count[is_sync] == 0)) {
+		/*
+		 * If there is a request pending in other direction
+		 * in same io group, then set the starved flag of
+		 * the group request list. Otherwise, we need to
+		 * make this process sleep in global starved list
+		 * to make sure it will not sleep indefinitely.
+		 */
+		if (rl->count[is_sync ^ 1] != 0) {
+			rl->starved[is_sync] = 1;
+			return 1;
+		} else
+			return 0;
+	}
+
+	return 1;
 }
 
 /*
  * Get a free request, queue_lock must be held.
- * Returns NULL on failure, with queue_lock held.
+ * Returns NULL on failure, with queue_lock held. Also sets the "reason" field
+ * in case of failure. This reason field helps caller decide to whether sleep
+ * on per group list or global per queue list.
+ * reason = 0 sleep on per group list
+ * reason = 1 sleep on global list
+ *
  * Returns !NULL on success, with queue_lock *not held*.
  */
 static struct request *get_request(struct request_queue *q, int rw_flags,
-				   struct bio *bio, gfp_t gfp_mask)
+					struct bio *bio, gfp_t gfp_mask,
+					struct request_list *rl, int *reason)
 {
 	struct request *rq = NULL;
-	struct request_list *rl = &q->rq;
 	struct io_context *ioc = NULL;
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
 	int may_queue, priv;
+	int sleep_on_global = 0;
 
 	may_queue = elv_may_queue(q, rw_flags);
 	if (may_queue == ELV_MQUEUE_NO)
 		goto rq_starved;
 
-	if (rl->count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
-		if (rl->count[is_sync]+1 >= q->nr_requests) {
-			ioc = current_io_context(GFP_ATOMIC, q->node);
-			/*
-			 * The queue will fill after this allocation, so set
-			 * it as full, and mark this process as "batching".
-			 * This process will be allowed to complete a batch of
-			 * requests, others will be blocked.
-			 */
-			if (!blk_queue_full(q, is_sync)) {
-				ioc_set_batching(q, ioc);
-				blk_set_queue_full(q, is_sync);
-			} else {
-				if (may_queue != ELV_MQUEUE_MUST
-						&& !ioc_batching(q, ioc)) {
-					/*
-					 * The queue is full and the allocating
-					 * process is not a "batcher", and not
-					 * exempted by the IO scheduler
-					 */
-					goto out;
-				}
+	if (q->rq_data.count[is_sync]+1 >= queue_congestion_on_threshold(q))
+		blk_set_queue_congested(q, is_sync);
+
+	/*
+	 * Looks like there is no user of queue full now.
+	 * Keeping it for time being.
+	 */
+	if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
+		blk_set_queue_full(q, is_sync);
+
+	if (rl->count[is_sync]+1 >= q->nr_group_requests) {
+		ioc = current_io_context(GFP_ATOMIC, q->node);
+		/*
+		 * The queue request descriptor group will fill after this
+		 * allocation, so set
+		 * it as full, and mark this process as "batching".
+		 * This process will be allowed to complete a batch of
+		 * requests, others will be blocked.
+		 */
+		if (rl->count[is_sync] <= q->nr_group_requests)
+			ioc_set_batching(q, ioc);
+		else {
+			if (may_queue != ELV_MQUEUE_MUST
+					&& !ioc_batching(q, ioc)) {
+				/*
+				 * The queue is full and the allocating
+				 * process is not a "batcher", and not
+				 * exempted by the IO scheduler
+				 */
+				goto out;
 			}
 		}
-		blk_set_queue_congested(q, is_sync);
 	}
 
 	/*
@@ -768,21 +851,60 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	 * limit of requests, otherwise we could have thousands of requests
 	 * allocated with any setting of ->nr_requests
 	 */
-	if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
+
+	if (q->rq_data.count[is_sync] >= (3 * q->nr_requests / 2)) {
+		/*
+		 * Queue is too full for allocation. On which request queue
+		 * the task should sleep? Generally it should sleep on its
+		 * request list but if elevator switch is happening, in that
+		 * window, request descriptors are allocated from global
+		 * pool and are not accounted against any particular request
+		 * list as group is going away.
+		 *
+		 * So it might happen that request list does not have any
+		 * requests allocated at all and if process sleeps on per
+		 * group request list, it will not be woken up. In such case,
+		 * make it sleep on global starved list.
+		 */
+		if (test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags)
+		    || !can_sleep_on_request_list(rl, is_sync))
+			sleep_on_global = 1;
+		goto out;
+	}
+
+	/*
+	 * Allocation of request is allowed from queue perspective. Now check
+	 * from per group request list
+	 */
+
+	if (rl->count[is_sync] >= (3 * q->nr_group_requests / 2))
 		goto out;
 
-	rl->count[is_sync]++;
 	rl->starved[is_sync] = 0;
 
+	q->rq_data.count[is_sync]++;
+
 	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
-	if (priv)
-		rl->elvpriv++;
+	if (priv) {
+		q->rq_data.elvpriv++;
+		/*
+		 * Account the request to request list only if request is
+		 * going to elevator. During elevator switch, there will
+		 * be small window where group is going away and new group
+		 * will not be allocated till elevator switch is complete.
+		 * So till then instead of slowing down the application,
+		 * we will continue to allocate request from total common
+		 * pool instead of per group limit
+		 */
+		rl->count[is_sync]++;
+	}
 
 	if (blk_queue_io_stat(q))
 		rw_flags |= REQ_IO_STAT;
 	spin_unlock_irq(q->queue_lock);
 
 	rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
+
 	if (unlikely(!rq)) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything
@@ -792,7 +914,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		 * wait queue, but this is pretty rare.
 		 */
 		spin_lock_irq(q->queue_lock);
-		freed_request(q, is_sync, priv);
+		freed_request(q, is_sync, priv, rl);
 
 		/*
 		 * in the very unlikely event that allocation failed and no
@@ -802,9 +924,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		 * rq mempool into READ and WRITE
 		 */
 rq_starved:
-		if (unlikely(rl->count[is_sync] == 0))
-			rl->starved[is_sync] = 1;
-
+		if (!can_sleep_on_request_list(rl, is_sync))
+			sleep_on_global = 1;
 		goto out;
 	}
 
@@ -819,6 +940,8 @@ rq_starved:
 
 	trace_block_getrq(q, bio, rw_flags & 1);
 out:
+	if (reason && sleep_on_global)
+		*reason = 1;
 	return rq;
 }
 
@@ -832,16 +955,44 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 					struct bio *bio)
 {
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
+	int sleep_on_global = 0;
 	struct request *rq;
+	struct request_list *rl = blk_get_request_list(q, bio);
+	struct io_group *iog = NULL;
 
-	rq = get_request(q, rw_flags, bio, GFP_NOIO);
+	rq = get_request(q, rw_flags, bio, GFP_NOIO, rl, &sleep_on_global);
 	while (!rq) {
 		DEFINE_WAIT(wait);
 		struct io_context *ioc;
-		struct request_list *rl = &q->rq;
 
-		prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
-				TASK_UNINTERRUPTIBLE);
+		if (sleep_on_global) {
+			/*
+			 * Task failed allocation and needs to wait and
+			 * try again. There are no requests pending from
+			 * the io group hence need to sleep on global
+			 * wait queue. Most likely the allocation failed
+			 * because of memory issues.
+			 */
+
+			q->rq_data.starved++;
+			prepare_to_wait_exclusive(&q->rq_data.starved_wait,
+					&wait, TASK_UNINTERRUPTIBLE);
+		} else {
+			/*
+			 * We are about to sleep on a request list and we
+			 * drop queue lock. After waking up, we will do
+			 * finish_wait() on request list and in the mean
+			 * time group might be gone. Take a reference to
+			 * the group now.
+			 */
+			prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
+					TASK_UNINTERRUPTIBLE);
+#ifdef CONFIG_GROUP_IOSCHED
+			iog = rl_iog(rl);
+			if (iog)
+				elv_get_iog(iog);
+#endif
+		}
 
 		trace_block_sleeprq(q, bio, rw_flags & 1);
 
@@ -859,9 +1010,30 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 		ioc_set_batching(q, ioc);
 
 		spin_lock_irq(q->queue_lock);
-		finish_wait(&rl->wait[is_sync], &wait);
 
-		rq = get_request(q, rw_flags, bio, GFP_NOIO);
+		if (sleep_on_global) {
+			finish_wait(&q->rq_data.starved_wait, &wait);
+			sleep_on_global = 0;
+		} else {
+			finish_wait(&rl->wait[is_sync], &wait);
+#ifdef CONFIG_GROUP_IOSCHED
+			/*
+			 * We had taken a reference to the rl/iog.
+			 * Put that now
+			 */
+			iog = rl_iog(rl);
+			if (iog)
+				elv_put_iog(iog);
+#endif
+		}
+
+		/*
+		 * After the sleep check the rl again in case cgrop bio
+		 * belonged to is gone and it is mapped to root group now
+		 */
+		rl = blk_get_request_list(q, bio);
+		rq = get_request(q, rw_flags, bio, GFP_NOIO, rl,
+					&sleep_on_global);
 	};
 
 	return rq;
@@ -870,14 +1042,16 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
 {
 	struct request *rq;
+	struct request_list *rl;
 
 	BUG_ON(rw != READ && rw != WRITE);
 
 	spin_lock_irq(q->queue_lock);
+	rl = blk_get_request_list(q, NULL);
 	if (gfp_mask & __GFP_WAIT) {
 		rq = get_request_wait(q, rw, NULL);
 	} else {
-		rq = get_request(q, rw, NULL, gfp_mask);
+		rq = get_request(q, rw, NULL, gfp_mask, rl, NULL);
 		if (!rq)
 			spin_unlock_irq(q->queue_lock);
 	}
@@ -1094,12 +1268,13 @@ void __blk_put_request(struct request_queue *q, struct request *req)
 	if (req->cmd_flags & REQ_ALLOCED) {
 		int is_sync = rq_is_sync(req) != 0;
 		int priv = req->cmd_flags & REQ_ELVPRIV;
+		struct request_list *rl = rq_rl(q, req);
 
 		BUG_ON(!list_empty(&req->queuelist));
 		BUG_ON(!hlist_unhashed(&req->hash));
 
 		blk_free_request(q, req);
-		freed_request(q, is_sync, priv);
+		freed_request(q, is_sync, priv, rl);
 	}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
diff --git a/block/blk-settings.c b/block/blk-settings.c
index bd582a7..78b8aec 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -148,6 +148,7 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
 	 * set defaults
 	 */
 	q->nr_requests = BLKDEV_MAX_RQ;
+	q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
 
 	q->make_request_fn = mfn;
 	blk_queue_dma_alignment(q, 511);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index b1cd040..577ed42 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -38,42 +38,66 @@ static ssize_t queue_requests_show(struct request_queue *q, char *page)
 static ssize_t
 queue_requests_store(struct request_queue *q, const char *page, size_t count)
 {
-	struct request_list *rl = &q->rq;
+	struct request_list *rl;
 	unsigned long nr;
 	int ret = queue_var_store(&nr, page, count);
 	if (nr < BLKDEV_MIN_RQ)
 		nr = BLKDEV_MIN_RQ;
 
 	spin_lock_irq(q->queue_lock);
+	rl = blk_get_request_list(q, NULL);
 	q->nr_requests = nr;
 	blk_queue_congestion_threshold(q);
 
-	if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
+	if (q->rq_data.count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
 		blk_set_queue_congested(q, BLK_RW_SYNC);
-	else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
+	else if (q->rq_data.count[BLK_RW_SYNC] <
+				queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, BLK_RW_SYNC);
 
-	if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
+	if (q->rq_data.count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
 		blk_set_queue_congested(q, BLK_RW_ASYNC);
-	else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
+	else if (q->rq_data.count[BLK_RW_ASYNC] <
+				queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, BLK_RW_ASYNC);
 
-	if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
+	if (q->rq_data.count[BLK_RW_SYNC] >= q->nr_requests) {
 		blk_set_queue_full(q, BLK_RW_SYNC);
-	} else if (rl->count[BLK_RW_SYNC]+1 <= q->nr_requests) {
+	} else if (q->rq_data.count[BLK_RW_SYNC]+1 <= q->nr_requests) {
 		blk_clear_queue_full(q, BLK_RW_SYNC);
 		wake_up(&rl->wait[BLK_RW_SYNC]);
 	}
 
-	if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
+	if (q->rq_data.count[BLK_RW_ASYNC] >= q->nr_requests) {
 		blk_set_queue_full(q, BLK_RW_ASYNC);
-	} else if (rl->count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
+	} else if (q->rq_data.count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
 		blk_clear_queue_full(q, BLK_RW_ASYNC);
 		wake_up(&rl->wait[BLK_RW_ASYNC]);
 	}
 	spin_unlock_irq(q->queue_lock);
 	return ret;
 }
+#ifdef CONFIG_GROUP_IOSCHED
+static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(q->nr_group_requests, (page));
+}
+
+static ssize_t
+queue_group_requests_store(struct request_queue *q, const char *page,
+					size_t count)
+{
+	unsigned long nr;
+	int ret = queue_var_store(&nr, page, count);
+	if (nr < BLKDEV_MIN_RQ)
+		nr = BLKDEV_MIN_RQ;
+
+	spin_lock_irq(q->queue_lock);
+	q->nr_group_requests = nr;
+	spin_unlock_irq(q->queue_lock);
+	return ret;
+}
+#endif
 
 static ssize_t queue_ra_show(struct request_queue *q, char *page)
 {
@@ -239,6 +263,14 @@ static struct queue_sysfs_entry queue_requests_entry = {
 	.store = queue_requests_store,
 };
 
+#ifdef CONFIG_GROUP_IOSCHED
+static struct queue_sysfs_entry queue_group_requests_entry = {
+	.attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_group_requests_show,
+	.store = queue_group_requests_store,
+};
+#endif
+
 static struct queue_sysfs_entry queue_ra_entry = {
 	.attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_ra_show,
@@ -313,6 +345,9 @@ static struct queue_sysfs_entry queue_iostats_entry = {
 
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
+#ifdef CONFIG_GROUP_IOSCHED
+	&queue_group_requests_entry.attr,
+#endif
 	&queue_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
@@ -392,12 +427,11 @@ static void blk_release_queue(struct kobject *kobj)
 {
 	struct request_queue *q =
 		container_of(kobj, struct request_queue, kobj);
-	struct request_list *rl = &q->rq;
 
 	blk_sync_queue(q);
 
-	if (rl->rq_pool)
-		mempool_destroy(rl->rq_pool);
+	if (q->rq_data.rq_pool)
+		mempool_destroy(q->rq_data.rq_pool);
 
 	if (q->queue_tags)
 		__blk_queue_free_tags(q);
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 899972c..c4a2b1e 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1092,6 +1092,16 @@ static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
 			    struct io_cgroup, css);
 }
 
+struct request_list *io_group_get_request_list(struct request_queue *q,
+						struct bio *bio)
+{
+	struct io_group *iog;
+
+	iog = io_get_io_group_bio(q, bio, 1);
+	BUG_ON(!iog);
+	return &iog->rl;
+}
+
 /*
  * Search the io_group for efqd into the hash table (by now only a list)
  * of bgrp.  Must be called under rcu_read_lock().
@@ -1390,6 +1400,8 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 		 */
 		elv_get_iog(iog);
 
+		blk_init_request_list(&iog->rl);
+
 		if (leaf == NULL) {
 			leaf = iog;
 			prev = leaf;
@@ -1667,6 +1679,8 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
 	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
 		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
 
+	blk_init_request_list(&iog->rl);
+
 	iocg = &io_root_cgroup;
 	spin_lock_irq(&iocg->lock);
 	rcu_assign_pointer(iog->key, key);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index bb43444..74a7393 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -258,6 +258,9 @@ struct io_group {
 
 	/* Single ioq per group, used for noop, deadline, anticipatory */
 	struct io_queue *ioq;
+
+	/* request list associated with the group */
+	struct request_list rl;
 };
 
 /**
@@ -526,6 +529,8 @@ extern void elv_fq_unset_request_ioq(struct request_queue *q,
 					struct request *rq);
 extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
 						struct bio *bio);
+extern struct request_list *io_group_get_request_list(struct request_queue *q,
+						struct bio *bio);
 
 /* Sets the single ioq associated with the io group. (noop, deadline, AS) */
 static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
diff --git a/block/elevator.c b/block/elevator.c
index 68d5a80..38a118b 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -646,7 +646,7 @@ void elv_quiesce_start(struct request_queue *q)
 	 * make sure we don't have any requests in flight
 	 */
 	elv_drain_elevator(q);
-	while (q->rq.elvpriv) {
+	while (q->rq_data.elvpriv) {
 		__blk_run_queue(q);
 		spin_unlock_irq(q->queue_lock);
 		msleep(10);
@@ -745,8 +745,9 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 	}
 
 	if (unplug_it && blk_queue_plugged(q)) {
-		int nrq = q->rq.count[BLK_RW_SYNC] + q->rq.count[BLK_RW_ASYNC]
-				- queue_in_flight(q);
+		int nrq = q->rq_data.count[BLK_RW_SYNC] +
+				q->rq_data.count[BLK_RW_ASYNC] -
+				queue_in_flight(q);
 
 		if (nrq >= q->unplug_thresh)
 			__generic_unplug_device(q);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 551e17d..655440c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -32,21 +32,51 @@ struct request;
 struct sg_io_hdr;
 
 #define BLKDEV_MIN_RQ	4
+
+#ifdef CONFIG_GROUP_IOSCHED
+#define BLKDEV_MAX_RQ	512	/* Default maximum for queue */
+#define BLKDEV_MAX_GROUP_RQ    128      /* Default maximum per group*/
+#else
 #define BLKDEV_MAX_RQ	128	/* Default maximum */
+/*
+ * This is eqivalent to case of only one group present (root group). Let
+ * it consume all the request descriptors available on the queue .
+ */
+#define BLKDEV_MAX_GROUP_RQ    BLKDEV_MAX_RQ      /* Default maximum */
+#endif
 
 struct request;
 typedef void (rq_end_io_fn)(struct request *, int);
 
 struct request_list {
 	/*
-	 * count[], starved[], and wait[] are indexed by
+	 * count[], starved and wait[] are indexed by
 	 * BLK_RW_SYNC/BLK_RW_ASYNC
 	 */
 	int count[2];
 	int starved[2];
+	wait_queue_head_t wait[2];
+};
+
+/*
+ * This data structures keeps track of mempool of requests for the queue
+ * and some overall statistics.
+ */
+struct request_data {
+	/*
+	 * Per queue request descriptor count. This is in addition to per
+	 * cgroup count
+	 */
+	int count[2];
 	int elvpriv;
 	mempool_t *rq_pool;
-	wait_queue_head_t wait[2];
+	int starved;
+	/*
+	 * Global list for starved tasks. A task will be queued here if
+	 * it could not allocate request descriptor and the associated
+	 * group request list does not have any requests pending.
+	 */
+	wait_queue_head_t starved_wait;
 };
 
 /*
@@ -355,6 +385,9 @@ struct request_queue
 	 */
 	struct request_list	rq;
 
+	/* Contains request pool and other data like starved data */
+	struct request_data	rq_data;
+
 	request_fn_proc		*request_fn;
 	make_request_fn		*make_request_fn;
 	prep_rq_fn		*prep_rq_fn;
@@ -416,6 +449,8 @@ struct request_queue
 	 * queue settings
 	 */
 	unsigned long		nr_requests;	/* Max # of requests */
+	/* Max # of per io group requests */
+	unsigned long		nr_group_requests;
 	unsigned int		nr_congestion_on;
 	unsigned int		nr_congestion_off;
 	unsigned int		nr_batching;
@@ -795,6 +830,54 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 			 struct scsi_ioctl_command __user *);
 
+extern void blk_init_request_list(struct request_list *rl);
+
+static inline struct request_list *blk_get_request_list(struct request_queue *q,
+						struct bio *bio)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return &q->rq;
+
+	return io_group_get_request_list(q, bio);
+#else
+	return &q->rq;
+#endif
+}
+
+static inline struct request_list *rq_rl(struct request_queue *q,
+						struct request *rq)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+	struct io_group *iog;
+	int priv = rq->cmd_flags & REQ_ELVPRIV;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return &q->rq;
+
+	BUG_ON(priv && !rq->ioq);
+
+	if (priv)
+		iog = ioq_to_io_group(rq->ioq);
+	else
+		iog = q->elevator->efqd.root_group;
+
+	BUG_ON(!iog);
+	return &iog->rl;
+#else
+	return &q->rq;
+#endif
+}
+
+static inline struct io_group *rl_iog(struct request_list *rl)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+	return container_of(rl, struct io_group, rl);
+#else
+	return NULL;
+#endif
+}
+
 /*
  * A queue has just exitted congestion.  Note this in the global counter of
  * congested queues, and wake up anyone who was waiting for requests to be
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 21/25] io-controller: Per cgroup request descriptor support
@ 2009-07-02 20:01   ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew
  Cc: peterz, akpm, snitzer, agk, vgoyal

o Currently a request queue has got fixed number of request descriptors for
  sync and async requests. Once the request descriptors are consumed, new
  processes are put to sleep and they effectively become serialized. Because
  sync and async queues are separate, async requests don't impact sync ones
  but if one is looking for fairness between async requests, that is not
  achievable if request queue descriptors become bottleneck.

o Make request descriptor's per io group so that if there is lots of IO
  going on in one cgroup, it does not impact the IO of other group.

o This is just one relatively simple way of doing things. This patch will
  probably change after the feedback. Folks have raised concerns that in
  hierchical setup, child's request descriptors should be capped by parent's
  request descriptors. May be we need to have per cgroup per device files
  in cgroups where one can specify the upper limit of request descriptors
  and whenever a cgroup is created one needs to assign request descritor
  limit making sure total sum of child's request descriptor is not more than
  of parent.

  I guess something like memory controller. Anyway, that would be the next
  step. For the time being, we have implemented something simpler as follows.

o This patch implements the per cgroup request descriptors. request pool per
  queue is still common but every group will have its own wait list and its
  own count of request descriptors allocated to that group for sync and async
  queues. So effectively request_list becomes per io group property and not a
  global request queue feature.

o Currently one can define q->nr_requests to limit request descriptors
  allocated for the queue. Now there is another tunable q->nr_group_requests
  which controls the requests descriptr limit per group. q->nr_requests
  supercedes q->nr_group_requests to make sure if there are lots of groups
  present, we don't end up allocating too many request descriptors on the
  queue.

Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-core.c       |  305 +++++++++++++++++++++++++++++++++++++----------
 block/blk-settings.c   |    1 +
 block/blk-sysfs.c      |   58 +++++++--
 block/elevator-fq.c    |   14 +++
 block/elevator-fq.h    |    5 +
 block/elevator.c       |    7 +-
 include/linux/blkdev.h |   87 +++++++++++++-
 7 files changed, 395 insertions(+), 82 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 6d8b4dd..2035c20 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -460,20 +460,30 @@ void blk_cleanup_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_cleanup_queue);
 
-static int blk_init_free_list(struct request_queue *q)
+void blk_init_request_list(struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
 
 	rl->count[BLK_RW_SYNC] = rl->count[BLK_RW_ASYNC] = 0;
-	rl->starved[BLK_RW_SYNC] = rl->starved[BLK_RW_ASYNC] = 0;
-	rl->elvpriv = 0;
 	init_waitqueue_head(&rl->wait[BLK_RW_SYNC]);
 	init_waitqueue_head(&rl->wait[BLK_RW_ASYNC]);
+}
 
-	rl->rq_pool = mempool_create_node(BLKDEV_MIN_RQ, mempool_alloc_slab,
-				mempool_free_slab, request_cachep, q->node);
+static int blk_init_free_list(struct request_queue *q)
+{
+	/*
+	 * Initialize the queue request list in case there are non-hiearchical
+	 * io schedulers not making use of fair queuing infrastructure.
+	 *
+	 * For ioschedulers making use of fair queuing infrastructure, request
+	 * list is inside the associated group and when that group is
+	 * instanciated, it takes care of initializing the request list also.
+	 */
+	blk_init_request_list(&q->rq);
+	q->rq_data.rq_pool = mempool_create_node(BLKDEV_MIN_RQ,
+				mempool_alloc_slab, mempool_free_slab,
+				request_cachep, q->node);
 
-	if (!rl->rq_pool)
+	if (!q->rq_data.rq_pool)
 		return -ENOMEM;
 
 	return 0;
@@ -575,6 +585,9 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
 		return NULL;
 	}
 
+	/* init starved waiter wait queue */
+	init_waitqueue_head(&q->rq_data.starved_wait);
+
 	/*
 	 * if caller didn't supply a lock, they get per-queue locking with
 	 * our embedded lock
@@ -624,14 +637,14 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
 {
 	if (rq->cmd_flags & REQ_ELVPRIV)
 		elv_put_request(q, rq);
-	mempool_free(rq, q->rq.rq_pool);
+	mempool_free(rq, q->rq_data.rq_pool);
 }
 
 static struct request *
 blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
 					gfp_t gfp_mask)
 {
-	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
+	struct request *rq = mempool_alloc(q->rq_data.rq_pool, gfp_mask);
 
 	if (!rq)
 		return NULL;
@@ -642,7 +655,7 @@ blk_alloc_request(struct request_queue *q, struct bio *bio, int flags, int priv,
 
 	if (priv) {
 		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
-			mempool_free(rq, q->rq.rq_pool);
+			mempool_free(rq, q->rq_data.rq_pool);
 			return NULL;
 		}
 		rq->cmd_flags |= REQ_ELVPRIV;
@@ -685,18 +698,18 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
 	ioc->last_waited = jiffies;
 }
 
-static void __freed_request(struct request_queue *q, int sync)
+static void __freed_request(struct request_queue *q, int sync,
+					struct request_list *rl)
 {
-	struct request_list *rl = &q->rq;
-
-	if (rl->count[sync] < queue_congestion_off_threshold(q))
+	if (q->rq_data.count[sync] < queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, sync);
 
-	if (rl->count[sync] + 1 <= q->nr_requests) {
+	if (q->rq_data.count[sync] + 1 <= q->nr_requests)
+		blk_clear_queue_full(q, sync);
+
+	if (rl->count[sync] + 1 <= q->nr_group_requests) {
 		if (waitqueue_active(&rl->wait[sync]))
 			wake_up(&rl->wait[sync]);
-
-		blk_clear_queue_full(q, sync);
 	}
 }
 
@@ -704,63 +717,133 @@ static void __freed_request(struct request_queue *q, int sync)
  * A request has just been released.  Account for it, update the full and
  * congestion status, wake up any waiters.   Called under q->queue_lock.
  */
-static void freed_request(struct request_queue *q, int sync, int priv)
-{
-	struct request_list *rl = &q->rq;
+static void freed_request(struct request_queue *q, int sync, int priv,
+					struct request_list *rl)
+{
+	/* There is a window during request allocation where request is
+	 * mapped to one group but by the time a queue for the group is
+	 * allocated, it is possible that original cgroup/io group has been
+	 * deleted and now io queue is allocated in a different group (root)
+	 * altogether.
+	 *
+	 * One solution to the problem is that rq should take io group
+	 * reference. But it looks too much to do that to solve this issue.
+	 * The only side affect to the hard to hit issue seems to be that
+	 * we will try to decrement the rl->count for a request list which
+	 * did not allocate that request. Chcek for rl->count going less than
+	 * zero and do not decrement it if that's the case.
+	 */
+
+	if (priv && rl->count[sync] > 0)
+		rl->count[sync]--;
+
+	BUG_ON(!q->rq_data.count[sync]);
+	q->rq_data.count[sync]--;
 
-	rl->count[sync]--;
 	if (priv)
-		rl->elvpriv--;
+		q->rq_data.elvpriv--;
 
-	__freed_request(q, sync);
+	__freed_request(q, sync, rl);
 
 	if (unlikely(rl->starved[sync ^ 1]))
-		__freed_request(q, sync ^ 1);
+		__freed_request(q, sync ^ 1, rl);
+
+	/* Wake up the starved process on global list, if any */
+	if (unlikely(q->rq_data.starved)) {
+		if (waitqueue_active(&q->rq_data.starved_wait))
+			wake_up(&q->rq_data.starved_wait);
+		q->rq_data.starved--;
+	}
+}
+
+/*
+ * Returns whether one can sleep on this request list or not. There are
+ * cases (elevator switch) where request list might not have allocated
+ * any request descriptor but we deny request allocation due to gloabl
+ * limits. In that case one should sleep on global list as on this request
+ * list no wakeup will take place.
+ *
+ * Also sets the request list starved flag if there are no requests pending
+ * in the direction of rq.
+ *
+ * Return 1 --> sleep on request list, 0 --> sleep on global list
+ */
+static int can_sleep_on_request_list(struct request_list *rl, int is_sync)
+{
+	if (unlikely(rl->count[is_sync] == 0)) {
+		/*
+		 * If there is a request pending in other direction
+		 * in same io group, then set the starved flag of
+		 * the group request list. Otherwise, we need to
+		 * make this process sleep in global starved list
+		 * to make sure it will not sleep indefinitely.
+		 */
+		if (rl->count[is_sync ^ 1] != 0) {
+			rl->starved[is_sync] = 1;
+			return 1;
+		} else
+			return 0;
+	}
+
+	return 1;
 }
 
 /*
  * Get a free request, queue_lock must be held.
- * Returns NULL on failure, with queue_lock held.
+ * Returns NULL on failure, with queue_lock held. Also sets the "reason" field
+ * in case of failure. This reason field helps caller decide to whether sleep
+ * on per group list or global per queue list.
+ * reason = 0 sleep on per group list
+ * reason = 1 sleep on global list
+ *
  * Returns !NULL on success, with queue_lock *not held*.
  */
 static struct request *get_request(struct request_queue *q, int rw_flags,
-				   struct bio *bio, gfp_t gfp_mask)
+					struct bio *bio, gfp_t gfp_mask,
+					struct request_list *rl, int *reason)
 {
 	struct request *rq = NULL;
-	struct request_list *rl = &q->rq;
 	struct io_context *ioc = NULL;
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
 	int may_queue, priv;
+	int sleep_on_global = 0;
 
 	may_queue = elv_may_queue(q, rw_flags);
 	if (may_queue == ELV_MQUEUE_NO)
 		goto rq_starved;
 
-	if (rl->count[is_sync]+1 >= queue_congestion_on_threshold(q)) {
-		if (rl->count[is_sync]+1 >= q->nr_requests) {
-			ioc = current_io_context(GFP_ATOMIC, q->node);
-			/*
-			 * The queue will fill after this allocation, so set
-			 * it as full, and mark this process as "batching".
-			 * This process will be allowed to complete a batch of
-			 * requests, others will be blocked.
-			 */
-			if (!blk_queue_full(q, is_sync)) {
-				ioc_set_batching(q, ioc);
-				blk_set_queue_full(q, is_sync);
-			} else {
-				if (may_queue != ELV_MQUEUE_MUST
-						&& !ioc_batching(q, ioc)) {
-					/*
-					 * The queue is full and the allocating
-					 * process is not a "batcher", and not
-					 * exempted by the IO scheduler
-					 */
-					goto out;
-				}
+	if (q->rq_data.count[is_sync]+1 >= queue_congestion_on_threshold(q))
+		blk_set_queue_congested(q, is_sync);
+
+	/*
+	 * Looks like there is no user of queue full now.
+	 * Keeping it for time being.
+	 */
+	if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
+		blk_set_queue_full(q, is_sync);
+
+	if (rl->count[is_sync]+1 >= q->nr_group_requests) {
+		ioc = current_io_context(GFP_ATOMIC, q->node);
+		/*
+		 * The queue request descriptor group will fill after this
+		 * allocation, so set
+		 * it as full, and mark this process as "batching".
+		 * This process will be allowed to complete a batch of
+		 * requests, others will be blocked.
+		 */
+		if (rl->count[is_sync] <= q->nr_group_requests)
+			ioc_set_batching(q, ioc);
+		else {
+			if (may_queue != ELV_MQUEUE_MUST
+					&& !ioc_batching(q, ioc)) {
+				/*
+				 * The queue is full and the allocating
+				 * process is not a "batcher", and not
+				 * exempted by the IO scheduler
+				 */
+				goto out;
 			}
 		}
-		blk_set_queue_congested(q, is_sync);
 	}
 
 	/*
@@ -768,21 +851,60 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	 * limit of requests, otherwise we could have thousands of requests
 	 * allocated with any setting of ->nr_requests
 	 */
-	if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
+
+	if (q->rq_data.count[is_sync] >= (3 * q->nr_requests / 2)) {
+		/*
+		 * Queue is too full for allocation. On which request queue
+		 * the task should sleep? Generally it should sleep on its
+		 * request list but if elevator switch is happening, in that
+		 * window, request descriptors are allocated from global
+		 * pool and are not accounted against any particular request
+		 * list as group is going away.
+		 *
+		 * So it might happen that request list does not have any
+		 * requests allocated at all and if process sleeps on per
+		 * group request list, it will not be woken up. In such case,
+		 * make it sleep on global starved list.
+		 */
+		if (test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags)
+		    || !can_sleep_on_request_list(rl, is_sync))
+			sleep_on_global = 1;
+		goto out;
+	}
+
+	/*
+	 * Allocation of request is allowed from queue perspective. Now check
+	 * from per group request list
+	 */
+
+	if (rl->count[is_sync] >= (3 * q->nr_group_requests / 2))
 		goto out;
 
-	rl->count[is_sync]++;
 	rl->starved[is_sync] = 0;
 
+	q->rq_data.count[is_sync]++;
+
 	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
-	if (priv)
-		rl->elvpriv++;
+	if (priv) {
+		q->rq_data.elvpriv++;
+		/*
+		 * Account the request to request list only if request is
+		 * going to elevator. During elevator switch, there will
+		 * be small window where group is going away and new group
+		 * will not be allocated till elevator switch is complete.
+		 * So till then instead of slowing down the application,
+		 * we will continue to allocate request from total common
+		 * pool instead of per group limit
+		 */
+		rl->count[is_sync]++;
+	}
 
 	if (blk_queue_io_stat(q))
 		rw_flags |= REQ_IO_STAT;
 	spin_unlock_irq(q->queue_lock);
 
 	rq = blk_alloc_request(q, bio, rw_flags, priv, gfp_mask);
+
 	if (unlikely(!rq)) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything
@@ -792,7 +914,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		 * wait queue, but this is pretty rare.
 		 */
 		spin_lock_irq(q->queue_lock);
-		freed_request(q, is_sync, priv);
+		freed_request(q, is_sync, priv, rl);
 
 		/*
 		 * in the very unlikely event that allocation failed and no
@@ -802,9 +924,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		 * rq mempool into READ and WRITE
 		 */
 rq_starved:
-		if (unlikely(rl->count[is_sync] == 0))
-			rl->starved[is_sync] = 1;
-
+		if (!can_sleep_on_request_list(rl, is_sync))
+			sleep_on_global = 1;
 		goto out;
 	}
 
@@ -819,6 +940,8 @@ rq_starved:
 
 	trace_block_getrq(q, bio, rw_flags & 1);
 out:
+	if (reason && sleep_on_global)
+		*reason = 1;
 	return rq;
 }
 
@@ -832,16 +955,44 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 					struct bio *bio)
 {
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
+	int sleep_on_global = 0;
 	struct request *rq;
+	struct request_list *rl = blk_get_request_list(q, bio);
+	struct io_group *iog = NULL;
 
-	rq = get_request(q, rw_flags, bio, GFP_NOIO);
+	rq = get_request(q, rw_flags, bio, GFP_NOIO, rl, &sleep_on_global);
 	while (!rq) {
 		DEFINE_WAIT(wait);
 		struct io_context *ioc;
-		struct request_list *rl = &q->rq;
 
-		prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
-				TASK_UNINTERRUPTIBLE);
+		if (sleep_on_global) {
+			/*
+			 * Task failed allocation and needs to wait and
+			 * try again. There are no requests pending from
+			 * the io group hence need to sleep on global
+			 * wait queue. Most likely the allocation failed
+			 * because of memory issues.
+			 */
+
+			q->rq_data.starved++;
+			prepare_to_wait_exclusive(&q->rq_data.starved_wait,
+					&wait, TASK_UNINTERRUPTIBLE);
+		} else {
+			/*
+			 * We are about to sleep on a request list and we
+			 * drop queue lock. After waking up, we will do
+			 * finish_wait() on request list and in the mean
+			 * time group might be gone. Take a reference to
+			 * the group now.
+			 */
+			prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
+					TASK_UNINTERRUPTIBLE);
+#ifdef CONFIG_GROUP_IOSCHED
+			iog = rl_iog(rl);
+			if (iog)
+				elv_get_iog(iog);
+#endif
+		}
 
 		trace_block_sleeprq(q, bio, rw_flags & 1);
 
@@ -859,9 +1010,30 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 		ioc_set_batching(q, ioc);
 
 		spin_lock_irq(q->queue_lock);
-		finish_wait(&rl->wait[is_sync], &wait);
 
-		rq = get_request(q, rw_flags, bio, GFP_NOIO);
+		if (sleep_on_global) {
+			finish_wait(&q->rq_data.starved_wait, &wait);
+			sleep_on_global = 0;
+		} else {
+			finish_wait(&rl->wait[is_sync], &wait);
+#ifdef CONFIG_GROUP_IOSCHED
+			/*
+			 * We had taken a reference to the rl/iog.
+			 * Put that now
+			 */
+			iog = rl_iog(rl);
+			if (iog)
+				elv_put_iog(iog);
+#endif
+		}
+
+		/*
+		 * After the sleep check the rl again in case cgrop bio
+		 * belonged to is gone and it is mapped to root group now
+		 */
+		rl = blk_get_request_list(q, bio);
+		rq = get_request(q, rw_flags, bio, GFP_NOIO, rl,
+					&sleep_on_global);
 	};
 
 	return rq;
@@ -870,14 +1042,16 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
 {
 	struct request *rq;
+	struct request_list *rl;
 
 	BUG_ON(rw != READ && rw != WRITE);
 
 	spin_lock_irq(q->queue_lock);
+	rl = blk_get_request_list(q, NULL);
 	if (gfp_mask & __GFP_WAIT) {
 		rq = get_request_wait(q, rw, NULL);
 	} else {
-		rq = get_request(q, rw, NULL, gfp_mask);
+		rq = get_request(q, rw, NULL, gfp_mask, rl, NULL);
 		if (!rq)
 			spin_unlock_irq(q->queue_lock);
 	}
@@ -1094,12 +1268,13 @@ void __blk_put_request(struct request_queue *q, struct request *req)
 	if (req->cmd_flags & REQ_ALLOCED) {
 		int is_sync = rq_is_sync(req) != 0;
 		int priv = req->cmd_flags & REQ_ELVPRIV;
+		struct request_list *rl = rq_rl(q, req);
 
 		BUG_ON(!list_empty(&req->queuelist));
 		BUG_ON(!hlist_unhashed(&req->hash));
 
 		blk_free_request(q, req);
-		freed_request(q, is_sync, priv);
+		freed_request(q, is_sync, priv, rl);
 	}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
diff --git a/block/blk-settings.c b/block/blk-settings.c
index bd582a7..78b8aec 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -148,6 +148,7 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
 	 * set defaults
 	 */
 	q->nr_requests = BLKDEV_MAX_RQ;
+	q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
 
 	q->make_request_fn = mfn;
 	blk_queue_dma_alignment(q, 511);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index b1cd040..577ed42 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -38,42 +38,66 @@ static ssize_t queue_requests_show(struct request_queue *q, char *page)
 static ssize_t
 queue_requests_store(struct request_queue *q, const char *page, size_t count)
 {
-	struct request_list *rl = &q->rq;
+	struct request_list *rl;
 	unsigned long nr;
 	int ret = queue_var_store(&nr, page, count);
 	if (nr < BLKDEV_MIN_RQ)
 		nr = BLKDEV_MIN_RQ;
 
 	spin_lock_irq(q->queue_lock);
+	rl = blk_get_request_list(q, NULL);
 	q->nr_requests = nr;
 	blk_queue_congestion_threshold(q);
 
-	if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
+	if (q->rq_data.count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
 		blk_set_queue_congested(q, BLK_RW_SYNC);
-	else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
+	else if (q->rq_data.count[BLK_RW_SYNC] <
+				queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, BLK_RW_SYNC);
 
-	if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
+	if (q->rq_data.count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
 		blk_set_queue_congested(q, BLK_RW_ASYNC);
-	else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
+	else if (q->rq_data.count[BLK_RW_ASYNC] <
+				queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, BLK_RW_ASYNC);
 
-	if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
+	if (q->rq_data.count[BLK_RW_SYNC] >= q->nr_requests) {
 		blk_set_queue_full(q, BLK_RW_SYNC);
-	} else if (rl->count[BLK_RW_SYNC]+1 <= q->nr_requests) {
+	} else if (q->rq_data.count[BLK_RW_SYNC]+1 <= q->nr_requests) {
 		blk_clear_queue_full(q, BLK_RW_SYNC);
 		wake_up(&rl->wait[BLK_RW_SYNC]);
 	}
 
-	if (rl->count[BLK_RW_ASYNC] >= q->nr_requests) {
+	if (q->rq_data.count[BLK_RW_ASYNC] >= q->nr_requests) {
 		blk_set_queue_full(q, BLK_RW_ASYNC);
-	} else if (rl->count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
+	} else if (q->rq_data.count[BLK_RW_ASYNC]+1 <= q->nr_requests) {
 		blk_clear_queue_full(q, BLK_RW_ASYNC);
 		wake_up(&rl->wait[BLK_RW_ASYNC]);
 	}
 	spin_unlock_irq(q->queue_lock);
 	return ret;
 }
+#ifdef CONFIG_GROUP_IOSCHED
+static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(q->nr_group_requests, (page));
+}
+
+static ssize_t
+queue_group_requests_store(struct request_queue *q, const char *page,
+					size_t count)
+{
+	unsigned long nr;
+	int ret = queue_var_store(&nr, page, count);
+	if (nr < BLKDEV_MIN_RQ)
+		nr = BLKDEV_MIN_RQ;
+
+	spin_lock_irq(q->queue_lock);
+	q->nr_group_requests = nr;
+	spin_unlock_irq(q->queue_lock);
+	return ret;
+}
+#endif
 
 static ssize_t queue_ra_show(struct request_queue *q, char *page)
 {
@@ -239,6 +263,14 @@ static struct queue_sysfs_entry queue_requests_entry = {
 	.store = queue_requests_store,
 };
 
+#ifdef CONFIG_GROUP_IOSCHED
+static struct queue_sysfs_entry queue_group_requests_entry = {
+	.attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_group_requests_show,
+	.store = queue_group_requests_store,
+};
+#endif
+
 static struct queue_sysfs_entry queue_ra_entry = {
 	.attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_ra_show,
@@ -313,6 +345,9 @@ static struct queue_sysfs_entry queue_iostats_entry = {
 
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
+#ifdef CONFIG_GROUP_IOSCHED
+	&queue_group_requests_entry.attr,
+#endif
 	&queue_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
@@ -392,12 +427,11 @@ static void blk_release_queue(struct kobject *kobj)
 {
 	struct request_queue *q =
 		container_of(kobj, struct request_queue, kobj);
-	struct request_list *rl = &q->rq;
 
 	blk_sync_queue(q);
 
-	if (rl->rq_pool)
-		mempool_destroy(rl->rq_pool);
+	if (q->rq_data.rq_pool)
+		mempool_destroy(q->rq_data.rq_pool);
 
 	if (q->queue_tags)
 		__blk_queue_free_tags(q);
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 899972c..c4a2b1e 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1092,6 +1092,16 @@ static struct io_cgroup *cgroup_to_io_cgroup(struct cgroup *cgroup)
 			    struct io_cgroup, css);
 }
 
+struct request_list *io_group_get_request_list(struct request_queue *q,
+						struct bio *bio)
+{
+	struct io_group *iog;
+
+	iog = io_get_io_group_bio(q, bio, 1);
+	BUG_ON(!iog);
+	return &iog->rl;
+}
+
 /*
  * Search the io_group for efqd into the hash table (by now only a list)
  * of bgrp.  Must be called under rcu_read_lock().
@@ -1390,6 +1400,8 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 		 */
 		elv_get_iog(iog);
 
+		blk_init_request_list(&iog->rl);
+
 		if (leaf == NULL) {
 			leaf = iog;
 			prev = leaf;
@@ -1667,6 +1679,8 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
 	for (i = 0; i < IO_IOPRIO_CLASSES; i++)
 		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
 
+	blk_init_request_list(&iog->rl);
+
 	iocg = &io_root_cgroup;
 	spin_lock_irq(&iocg->lock);
 	rcu_assign_pointer(iog->key, key);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index bb43444..74a7393 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -258,6 +258,9 @@ struct io_group {
 
 	/* Single ioq per group, used for noop, deadline, anticipatory */
 	struct io_queue *ioq;
+
+	/* request list associated with the group */
+	struct request_list rl;
 };
 
 /**
@@ -526,6 +529,8 @@ extern void elv_fq_unset_request_ioq(struct request_queue *q,
 					struct request *rq);
 extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
 						struct bio *bio);
+extern struct request_list *io_group_get_request_list(struct request_queue *q,
+						struct bio *bio);
 
 /* Sets the single ioq associated with the io group. (noop, deadline, AS) */
 static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
diff --git a/block/elevator.c b/block/elevator.c
index 68d5a80..38a118b 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -646,7 +646,7 @@ void elv_quiesce_start(struct request_queue *q)
 	 * make sure we don't have any requests in flight
 	 */
 	elv_drain_elevator(q);
-	while (q->rq.elvpriv) {
+	while (q->rq_data.elvpriv) {
 		__blk_run_queue(q);
 		spin_unlock_irq(q->queue_lock);
 		msleep(10);
@@ -745,8 +745,9 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 	}
 
 	if (unplug_it && blk_queue_plugged(q)) {
-		int nrq = q->rq.count[BLK_RW_SYNC] + q->rq.count[BLK_RW_ASYNC]
-				- queue_in_flight(q);
+		int nrq = q->rq_data.count[BLK_RW_SYNC] +
+				q->rq_data.count[BLK_RW_ASYNC] -
+				queue_in_flight(q);
 
 		if (nrq >= q->unplug_thresh)
 			__generic_unplug_device(q);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 551e17d..655440c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -32,21 +32,51 @@ struct request;
 struct sg_io_hdr;
 
 #define BLKDEV_MIN_RQ	4
+
+#ifdef CONFIG_GROUP_IOSCHED
+#define BLKDEV_MAX_RQ	512	/* Default maximum for queue */
+#define BLKDEV_MAX_GROUP_RQ    128      /* Default maximum per group*/
+#else
 #define BLKDEV_MAX_RQ	128	/* Default maximum */
+/*
+ * This is eqivalent to case of only one group present (root group). Let
+ * it consume all the request descriptors available on the queue .
+ */
+#define BLKDEV_MAX_GROUP_RQ    BLKDEV_MAX_RQ      /* Default maximum */
+#endif
 
 struct request;
 typedef void (rq_end_io_fn)(struct request *, int);
 
 struct request_list {
 	/*
-	 * count[], starved[], and wait[] are indexed by
+	 * count[], starved and wait[] are indexed by
 	 * BLK_RW_SYNC/BLK_RW_ASYNC
 	 */
 	int count[2];
 	int starved[2];
+	wait_queue_head_t wait[2];
+};
+
+/*
+ * This data structures keeps track of mempool of requests for the queue
+ * and some overall statistics.
+ */
+struct request_data {
+	/*
+	 * Per queue request descriptor count. This is in addition to per
+	 * cgroup count
+	 */
+	int count[2];
 	int elvpriv;
 	mempool_t *rq_pool;
-	wait_queue_head_t wait[2];
+	int starved;
+	/*
+	 * Global list for starved tasks. A task will be queued here if
+	 * it could not allocate request descriptor and the associated
+	 * group request list does not have any requests pending.
+	 */
+	wait_queue_head_t starved_wait;
 };
 
 /*
@@ -355,6 +385,9 @@ struct request_queue
 	 */
 	struct request_list	rq;
 
+	/* Contains request pool and other data like starved data */
+	struct request_data	rq_data;
+
 	request_fn_proc		*request_fn;
 	make_request_fn		*make_request_fn;
 	prep_rq_fn		*prep_rq_fn;
@@ -416,6 +449,8 @@ struct request_queue
 	 * queue settings
 	 */
 	unsigned long		nr_requests;	/* Max # of requests */
+	/* Max # of per io group requests */
+	unsigned long		nr_group_requests;
 	unsigned int		nr_congestion_on;
 	unsigned int		nr_congestion_off;
 	unsigned int		nr_batching;
@@ -795,6 +830,54 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 			 struct scsi_ioctl_command __user *);
 
+extern void blk_init_request_list(struct request_list *rl);
+
+static inline struct request_list *blk_get_request_list(struct request_queue *q,
+						struct bio *bio)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return &q->rq;
+
+	return io_group_get_request_list(q, bio);
+#else
+	return &q->rq;
+#endif
+}
+
+static inline struct request_list *rq_rl(struct request_queue *q,
+						struct request *rq)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+	struct io_group *iog;
+	int priv = rq->cmd_flags & REQ_ELVPRIV;
+
+	if (!elv_iosched_fair_queuing_enabled(q->elevator))
+		return &q->rq;
+
+	BUG_ON(priv && !rq->ioq);
+
+	if (priv)
+		iog = ioq_to_io_group(rq->ioq);
+	else
+		iog = q->elevator->efqd.root_group;
+
+	BUG_ON(!iog);
+	return &iog->rl;
+#else
+	return &q->rq;
+#endif
+}
+
+static inline struct io_group *rl_iog(struct request_list *rl)
+{
+#ifdef CONFIG_GROUP_IOSCHED
+	return container_of(rl, struct io_group, rl);
+#else
+	return NULL;
+#endif
+}
+
 /*
  * A queue has just exitted congestion.  Note this in the global counter of
  * congested queues, and wake up anyone who was waiting for requests to be
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 22/25] io-controller: Per io group bdi congestion interface
       [not found] ` <1246564917-19603-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (20 preceding siblings ...)
  2009-07-02 20:01   ` [PATCH 21/25] io-controller: Per cgroup request descriptor support Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  2009-07-02 20:01   ` [PATCH 23/25] io-controller: Support per cgroup per device weights and io class Vivek Goyal
                     ` (5 subsequent siblings)
  27 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, nauman-hpIqsD4AKlfQT0dZR+AlfA,
	dpshah-hpIqsD4AKlfQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA

o So far there used to be only one pair or queue  of request descriptors
  (one for sync and one for async) per device and number of requests allocated
  used to decide whether associated bdi is congested or not.

  Now with per io group request descriptor infrastructure, there is a pair
  of request descriptor queue per io group per device. So it might happen
  that overall request queue is not congested but a particular io group
  bio belongs to is congested.

  Or, it could be otherwise that group is not congested but overall queue
  is congested. This can happen if user has not properly set the request
  descriptors limits for queue and groups.
  (q->nr_requests < nr_groups * q->nr_group_requests)

  Hence there is a need for new interface which can query deivce congestion
  status per group. This group is determined by the "struct page" IO will be
  done for. If page is null, then group is determined from the current task
  context.

o This patch introduces new set of function bdi_*_congested_group(), which
  take "struct page" as addition argument. These functions will call the
  block layer and in trun elevator to find out if the io group the page will
  go into is congested or not.

o Currently I have introduced the core functions and migrated most of the users.
  But there might be still some left. This is an ongoing TODO item.

o There are some io_get_io_group() related changes which should be pushed into
  higher patches. Still testing this patch. Will push these changes up in next
  posting.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/blk-core.c            |   21 ++++++++++++++
 block/elevator-fq.c         |   58 ++++++++++++++++++++++++++++++++++++++++
 block/elevator-fq.h         |    6 ++++
 drivers/md/dm-table.c       |   11 +++++---
 drivers/md/dm.c             |    7 +++--
 drivers/md/dm.h             |    3 +-
 drivers/md/linear.c         |    7 +++-
 drivers/md/multipath.c      |    7 +++-
 drivers/md/raid0.c          |    6 +++-
 drivers/md/raid1.c          |    9 ++++--
 drivers/md/raid10.c         |    6 +++-
 drivers/md/raid5.c          |    2 +-
 fs/afs/write.c              |    8 +++++-
 fs/btrfs/disk-io.c          |    6 +++-
 fs/btrfs/extent_io.c        |   12 ++++++++
 fs/btrfs/volumes.c          |    8 ++++-
 fs/cifs/file.c              |   11 +++++++
 fs/ext2/ialloc.c            |    2 +-
 fs/gfs2/aops.c              |   12 ++++++++
 fs/nilfs2/segbuf.c          |    3 +-
 fs/xfs/linux-2.6/xfs_aops.c |    2 +-
 fs/xfs/linux-2.6/xfs_buf.c  |    2 +-
 include/linux/backing-dev.h |   61 ++++++++++++++++++++++++++++++++++++++++-
 include/linux/blkdev.h      |    5 +++
 mm/backing-dev.c            |   62 +++++++++++++++++++++++++++++++++++++++++++
 mm/page-writeback.c         |   11 +++++++
 mm/readahead.c              |    2 +-
 27 files changed, 318 insertions(+), 32 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2035c20..79fe6a9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -90,6 +90,27 @@ void blk_queue_congestion_threshold(struct request_queue *q)
 	q->nr_congestion_off = nr;
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+int blk_queue_io_group_congested(struct backing_dev_info *bdi, int bdi_bits,
+					struct page *page)
+{
+	int ret = 0;
+	struct request_queue *q = bdi->unplug_io_data;
+
+	if (!q && !q->elevator)
+		return bdi_congested(bdi, bdi_bits);
+
+	/* Do we need to hold queue lock? */
+	if (bdi_bits & (1 << BDI_sync_congested))
+		ret |= elv_io_group_congested(q, page, 1);
+
+	if (bdi_bits & (1 << BDI_async_congested))
+		ret |= elv_io_group_congested(q, page, 0);
+
+	return ret;
+}
+#endif
+
 /**
  * blk_get_backing_dev_info - get the address of a queue's backing_dev_info
  * @bdev:	device
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index c4a2b1e..2a2b68d 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1102,6 +1102,58 @@ struct request_list *io_group_get_request_list(struct request_queue *q,
 	return &iog->rl;
 }
 
+/* Set io group congestion on and off thresholds */
+void elv_io_group_congestion_threshold(struct request_queue *q,
+						struct io_group *iog)
+{
+	int nr;
+
+	nr = q->nr_group_requests - (q->nr_group_requests / 8) + 1;
+	if (nr > q->nr_group_requests)
+		nr = q->nr_group_requests;
+	iog->nr_congestion_on = nr;
+
+	nr = q->nr_group_requests - (q->nr_group_requests / 8)
+			- (q->nr_group_requests / 16) - 1;
+	if (nr < 1)
+		nr = 1;
+	iog->nr_congestion_off = nr;
+}
+
+static inline int elv_is_iog_congested(struct request_queue *q,
+					struct io_group *iog, int sync)
+{
+	if (iog->rl.count[sync] >= iog->nr_congestion_on)
+		return 1;
+	return 0;
+}
+
+/* Determine if io group page maps to is congested or not */
+int elv_io_group_congested(struct request_queue *q, struct page *page, int sync)
+{
+	struct io_group *iog;
+	int ret = 0;
+
+	rcu_read_lock();
+
+	iog = io_get_io_group(q, page, 0);
+
+	if (!iog) {
+		/*
+		 * Either cgroup got deleted or this is first request in the
+		 * group and associated io group object has not been created
+		 * yet. Map it to root group.
+		 *
+		 * TODO: Fix the case of group not created yet.
+		 */
+		iog = q->elevator->efqd.root_group;
+	}
+
+	ret = elv_is_iog_congested(q, iog, sync);
+	rcu_read_unlock();
+	return ret;
+}
+
 /*
  * Search the io_group for efqd into the hash table (by now only a list)
  * of bgrp.  Must be called under rcu_read_lock().
@@ -1401,6 +1453,7 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 		elv_get_iog(iog);
 
 		blk_init_request_list(&iog->rl);
+		elv_io_group_congestion_threshold(q, iog);
 
 		if (leaf == NULL) {
 			leaf = iog;
@@ -1680,6 +1733,7 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
 		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
 
 	blk_init_request_list(&iog->rl);
+	elv_io_group_congestion_threshold(q, iog);
 
 	iocg = &io_root_cgroup;
 	spin_lock_irq(&iocg->lock);
@@ -1688,6 +1742,10 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
 	iog->iocg_id = css_id(&iocg->css);
 	spin_unlock_irq(&iocg->lock);
 
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	io_group_path(iog, iog->path, sizeof(iog->path));
+#endif
+
 	return iog;
 }
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 74a7393..214fb61 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -259,6 +259,10 @@ struct io_group {
 	/* Single ioq per group, used for noop, deadline, anticipatory */
 	struct io_queue *ioq;
 
+	/* io group congestion on and off threshold for request descriptors */
+	unsigned int nr_congestion_on;
+	unsigned int nr_congestion_off;
+
 	/* request list associated with the group */
 	struct request_list rl;
 };
@@ -531,6 +535,8 @@ extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
 						struct bio *bio);
 extern struct request_list *io_group_get_request_list(struct request_queue *q,
 						struct bio *bio);
+extern int elv_io_group_congested(struct request_queue *q, struct page *page,
+					int sync);
 
 /* Sets the single ioq associated with the io group. (noop, deadline, AS) */
 static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 4899ebe..e3fd6d8 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1175,7 +1175,8 @@ int dm_table_resume_targets(struct dm_table *t)
 	return 0;
 }
 
-int dm_table_any_congested(struct dm_table *t, int bdi_bits)
+int dm_table_any_congested(struct dm_table *t, int bdi_bits, struct page *page,
+				int group)
 {
 	struct dm_dev_internal *dd;
 	struct list_head *devices = dm_table_get_devices(t);
@@ -1185,9 +1186,11 @@ int dm_table_any_congested(struct dm_table *t, int bdi_bits)
 		struct request_queue *q = bdev_get_queue(dd->dm_dev.bdev);
 		char b[BDEVNAME_SIZE];
 
-		if (likely(q))
-			r |= bdi_congested(&q->backing_dev_info, bdi_bits);
-		else
+		if (likely(q)) {
+			struct backing_dev_info *bdi = &q->backing_dev_info;
+			r |= group ? bdi_congested_group(bdi, bdi_bits, page)
+				: bdi_congested(bdi, bdi_bits);
+		} else
 			DMWARN_LIMIT("%s: any_congested: nonexistent device %s",
 				     dm_device_name(t->md),
 				     bdevname(dd->dm_dev.bdev, b));
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3c6d4ee..320ef4c 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1608,7 +1608,8 @@ static void dm_unplug_all(struct request_queue *q)
 	}
 }
 
-static int dm_any_congested(void *congested_data, int bdi_bits)
+static int dm_any_congested(void *congested_data, int bdi_bits,
+					struct page *page, int group)
 {
 	int r = bdi_bits;
 	struct mapped_device *md = congested_data;
@@ -1625,8 +1626,8 @@ static int dm_any_congested(void *congested_data, int bdi_bits)
 				r = md->queue->backing_dev_info.state &
 				    bdi_bits;
 			else
-				r = dm_table_any_congested(map, bdi_bits);
-
+				r = dm_table_any_congested(map, bdi_bits, page,
+								 group);
 			dm_table_put(map);
 		}
 	}
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 23278ae..9c4c5a5 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -57,7 +57,8 @@ struct list_head *dm_table_get_devices(struct dm_table *t);
 void dm_table_presuspend_targets(struct dm_table *t);
 void dm_table_postsuspend_targets(struct dm_table *t);
 int dm_table_resume_targets(struct dm_table *t);
-int dm_table_any_congested(struct dm_table *t, int bdi_bits);
+int dm_table_any_congested(struct dm_table *t, int bdi_bits, struct page *page,
+				int group);
 int dm_table_any_busy_target(struct dm_table *t);
 int dm_table_set_type(struct dm_table *t);
 unsigned dm_table_get_type(struct dm_table *t);
diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 15c8b7b..f227075 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -102,7 +102,7 @@ static void linear_unplug(struct request_queue *q)
 	rcu_read_unlock();
 }
 
-static int linear_congested(void *data, int bits)
+static int linear_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	linear_conf_t *conf;
@@ -113,7 +113,10 @@ static int linear_congested(void *data, int bits)
 
 	for (i = 0; i < mddev->raid_disks && !ret ; i++) {
 		struct request_queue *q = bdev_get_queue(conf->disks[i].rdev->bdev);
-		ret |= bdi_congested(&q->backing_dev_info, bits);
+		struct backing_dev_info *bdi = &q->backing_dev_info;
+
+		ret |= group ? bdi_congested_group(bdi, bits, page) :
+			bdi_congested(bdi, bits);
 	}
 
 	rcu_read_unlock();
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index cbe368f..87bb1dd 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -192,7 +192,8 @@ static void multipath_status (struct seq_file *seq, mddev_t *mddev)
 	seq_printf (seq, "]");
 }
 
-static int multipath_congested(void *data, int bits)
+static int multipath_congested(void *data, int bits, struct page *page,
+					int group)
 {
 	mddev_t *mddev = data;
 	multipath_conf_t *conf = mddev->private;
@@ -203,8 +204,10 @@ static int multipath_congested(void *data, int bits)
 		mdk_rdev_t *rdev = rcu_dereference(conf->multipaths[i].rdev);
 		if (rdev && !test_bit(Faulty, &rdev->flags)) {
 			struct request_queue *q = bdev_get_queue(rdev->bdev);
+			struct backing_dev_info *bdi = &q->backing_dev_info;
 
-			ret |= bdi_congested(&q->backing_dev_info, bits);
+			ret |= group ? bdi_congested_group(bdi, bits, page)
+				: bdi_congested(bdi, bits);
 			/* Just like multipath_map, we just check the
 			 * first available device
 			 */
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index ab4a489..f7813e7 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -37,7 +37,7 @@ static void raid0_unplug(struct request_queue *q)
 	}
 }
 
-static int raid0_congested(void *data, int bits)
+static int raid0_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	raid0_conf_t *conf = mddev->private;
@@ -46,8 +46,10 @@ static int raid0_congested(void *data, int bits)
 
 	for (i = 0; i < mddev->raid_disks && !ret ; i++) {
 		struct request_queue *q = bdev_get_queue(devlist[i]->bdev);
+		struct backing_dev_info *bdi = &q->backing_dev_info;
 
-		ret |= bdi_congested(&q->backing_dev_info, bits);
+		ret |= group ? bdi_congested_group(bdi, bits, page)
+				: bdi_congested(bdi, bits);
 	}
 	return ret;
 }
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 89939a7..6132848 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -570,7 +570,7 @@ static void raid1_unplug(struct request_queue *q)
 	md_wakeup_thread(mddev->thread);
 }
 
-static int raid1_congested(void *data, int bits)
+static int raid1_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	conf_t *conf = mddev->private;
@@ -581,14 +581,17 @@ static int raid1_congested(void *data, int bits)
 		mdk_rdev_t *rdev = rcu_dereference(conf->mirrors[i].rdev);
 		if (rdev && !test_bit(Faulty, &rdev->flags)) {
 			struct request_queue *q = bdev_get_queue(rdev->bdev);
+			struct backing_dev_info *bdi = &q->backing_dev_info;
 
 			/* Note the '|| 1' - when read_balance prefers
 			 * non-congested targets, it can be removed
 			 */
 			if ((bits & (1<<BDI_async_congested)) || 1)
-				ret |= bdi_congested(&q->backing_dev_info, bits);
+				ret |= group ? bdi_congested_group(bdi, bits,
+					page) : bdi_congested(bdi, bits);
 			else
-				ret &= bdi_congested(&q->backing_dev_info, bits);
+				ret &= group ? bdi_congested_group(bdi, bits,
+					page) : bdi_congested(bdi, bits);
 		}
 	}
 	rcu_read_unlock();
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index ae12cea..3d9c6b0 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -625,7 +625,7 @@ static void raid10_unplug(struct request_queue *q)
 	md_wakeup_thread(mddev->thread);
 }
 
-static int raid10_congested(void *data, int bits)
+static int raid10_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	conf_t *conf = mddev->private;
@@ -636,8 +636,10 @@ static int raid10_congested(void *data, int bits)
 		mdk_rdev_t *rdev = rcu_dereference(conf->mirrors[i].rdev);
 		if (rdev && !test_bit(Faulty, &rdev->flags)) {
 			struct request_queue *q = bdev_get_queue(rdev->bdev);
+			struct backing_dev_info *bdi = &q->backing_dev_info;
 
-			ret |= bdi_congested(&q->backing_dev_info, bits);
+			ret |= group ? bdi_congested_group(bdi, bits, page)
+				: bdi_congested(bdi, bits);
 		}
 	}
 	rcu_read_unlock();
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f9f991e..16e4d1a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3323,7 +3323,7 @@ static void raid5_unplug_device(struct request_queue *q)
 	unplug_slaves(mddev);
 }
 
-static int raid5_congested(void *data, int bits)
+static int raid5_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	raid5_conf_t *conf = mddev->private;
diff --git a/fs/afs/write.c b/fs/afs/write.c
index c2e7a7f..aa8b359 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -455,7 +455,7 @@ int afs_writepage(struct page *page, struct writeback_control *wbc)
 	}
 
 	wbc->nr_to_write -= ret;
-	if (wbc->nonblocking && bdi_write_congested(bdi))
+	if (wbc->nonblocking && bdi_or_group_write_congested(bdi, page))
 		wbc->encountered_congestion = 1;
 
 	_leave(" = 0");
@@ -491,6 +491,12 @@ static int afs_writepages_region(struct address_space *mapping,
 			return 0;
 		}
 
+		if (wbc->nonblocking && bdi_write_congested_group(bdi, page)) {
+			wbc->encountered_congestion = 1;
+			page_cache_release(page);
+			break;
+		}
+
 		/* at this point we hold neither mapping->tree_lock nor lock on
 		 * the page itself: the page may be truncated or invalidated
 		 * (changing page->mapping to NULL), or even swizzled back from
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index d28d29c..cd7cf6c 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1249,7 +1249,8 @@ struct btrfs_root *btrfs_read_fs_root(struct btrfs_fs_info *fs_info,
 	return root;
 }
 
-static int btrfs_congested_fn(void *congested_data, int bdi_bits)
+static int btrfs_congested_fn(void *congested_data, int bdi_bits,
+					struct page *page, int group)
 {
 	struct btrfs_fs_info *info = (struct btrfs_fs_info *)congested_data;
 	int ret = 0;
@@ -1260,7 +1261,8 @@ static int btrfs_congested_fn(void *congested_data, int bdi_bits)
 		if (!device->bdev)
 			continue;
 		bdi = blk_get_backing_dev_info(device->bdev);
-		if (bdi && bdi_congested(bdi, bdi_bits)) {
+		if (bdi && (group ? bdi_congested_group(bdi, bdi_bits, page) :
+		    bdi_congested(bdi, bdi_bits))) {
 			ret = 1;
 			break;
 		}
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6826018..fd7d53f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2368,6 +2368,18 @@ retry:
 		unsigned i;
 
 		scanned = 1;
+
+		/*
+		 * If the io group page will go into is congested, bail out.
+		 */
+		if (wbc->nonblocking
+		    && bdi_write_congested_group(bdi, pvec.pages[0])) {
+			wbc->encountered_congestion = 1;
+			done = 1;
+			pagevec_release(&pvec);
+			break;
+		}
+
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 3ab80e9..7ab5dea 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -165,6 +165,7 @@ static noinline int run_scheduled_bios(struct btrfs_device *device)
 	unsigned long limit;
 	unsigned long last_waited = 0;
 	int force_reg = 0;
+	struct page *page;
 
 	bdi = blk_get_backing_dev_info(device->bdev);
 	fs_info = device->dev_root->fs_info;
@@ -276,8 +277,11 @@ loop_lock:
 		 * is now congested.  Back off and let other work structs
 		 * run instead
 		 */
-		if (pending && bdi_write_congested(bdi) && batch_run > 32 &&
-		    fs_info->fs_devices->open_devices > 1) {
+		if (pending)
+			page = bio_iovec_idx(pending, 0)->bv_page;
+
+		if (pending && bdi_or_group_write_congested(bdi, page) &&
+		    num_run > 32 && fs_info->fs_devices->open_devices > 1) {
 			struct io_context *ioc;
 
 			ioc = current->io_context;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0686684..365ca1b 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1466,6 +1466,17 @@ retry:
 		n_iov = 0;
 		bytes_to_write = 0;
 
+		/*
+		 * If the io group page will go into is congested, bail out.
+		 */
+		if (wbc->nonblocking &&
+		    bdi_write_congested_group(bdi, pvec.pages[0])) {
+			wbc->encountered_congestion = 1;
+			done = 1;
+			pagevec_release(&pvec);
+			break;
+		}
+
 		for (i = 0; i < nr_pages; i++) {
 			page = pvec.pages[i];
 			/*
diff --git a/fs/ext2/ialloc.c b/fs/ext2/ialloc.c
index 15387c9..090a961 100644
--- a/fs/ext2/ialloc.c
+++ b/fs/ext2/ialloc.c
@@ -179,7 +179,7 @@ static void ext2_preread_inode(struct inode *inode)
 	struct backing_dev_info *bdi;
 
 	bdi = inode->i_mapping->backing_dev_info;
-	if (bdi_read_congested(bdi))
+	if (bdi_or_group_read_congested(bdi, NULL))
 		return;
 	if (bdi_write_congested(bdi))
 		return;
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 03ebb43..5b9c93b 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -371,6 +371,18 @@ retry:
 					       PAGECACHE_TAG_DIRTY,
 					       min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
 		scanned = 1;
+
+		/*
+		 * If io group page belongs to is congested. bail out.
+		 */
+		if (wbc->nonblocking
+		    && bdi_write_congested_group(bdi, pvec.pages[0])) {
+			wbc->encountered_congestion = 1;
+			done = 1;
+			pagevec_release(&pvec);
+			break;
+		}
+
 		ret = gfs2_write_jdata_pagevec(mapping, wbc, &pvec, nr_pages, end);
 		if (ret)
 			done = 1;
diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
index 9e3fe17..aa29612 100644
--- a/fs/nilfs2/segbuf.c
+++ b/fs/nilfs2/segbuf.c
@@ -266,8 +266,9 @@ static int nilfs_submit_seg_bio(struct nilfs_write_info *wi, int mode)
 {
 	struct bio *bio = wi->bio;
 	int err;
+	struct page *page = bio_iovec_idx(bio, 0)->bv_page;
 
-	if (wi->nbio > 0 && bdi_write_congested(wi->bdi)) {
+	if (wi->nbio > 0 && bdi_or_group_write_congested(wi->bdi, page)) {
 		wait_for_completion(&wi->bio_event);
 		wi->nbio--;
 		if (unlikely(atomic_read(&wi->err))) {
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index 7ec89fc..2a515ab 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -891,7 +891,7 @@ xfs_convert_page(
 
 			bdi = inode->i_mapping->backing_dev_info;
 			wbc->nr_to_write--;
-			if (bdi_write_congested(bdi)) {
+			if (bdi_or_group_write_congested(bdi, page)) {
 				wbc->encountered_congestion = 1;
 				done = 1;
 			} else if (wbc->nr_to_write <= 0) {
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 1418b91..e95c97e 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -714,7 +714,7 @@ xfs_buf_readahead(
 	struct backing_dev_info *bdi;
 
 	bdi = target->bt_mapping->backing_dev_info;
-	if (bdi_read_congested(bdi))
+	if (bdi_or_group_read_congested(bdi, NULL))
 		return;
 
 	flags |= (XBF_TRYLOCK|XBF_ASYNC|XBF_READ_AHEAD);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 0ec2c59..f06fdbf 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -29,7 +29,7 @@ enum bdi_state {
 	BDI_unused,		/* Available bits start here */
 };
 
-typedef int (congested_fn)(void *, int);
+typedef int (congested_fn)(void *, int, struct page *, int);
 
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
@@ -209,7 +209,7 @@ int writeback_in_progress(struct backing_dev_info *bdi);
 static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits)
 {
 	if (bdi->congested_fn)
-		return bdi->congested_fn(bdi->congested_data, bdi_bits);
+		return bdi->congested_fn(bdi->congested_data, bdi_bits, NULL, 0);
 	return (bdi->state & bdi_bits);
 }
 
@@ -229,6 +229,63 @@ static inline int bdi_rw_congested(struct backing_dev_info *bdi)
 				  (1 << BDI_async_congested));
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+extern int bdi_congested_group(struct backing_dev_info *bdi, int bdi_bits,
+				struct page *page);
+
+extern int bdi_read_congested_group(struct backing_dev_info *bdi,
+						struct page *page);
+
+extern int bdi_or_group_read_congested(struct backing_dev_info *bdi,
+					struct page *page);
+
+extern int bdi_write_congested_group(struct backing_dev_info *bdi,
+					struct page *page);
+
+extern int bdi_or_group_write_congested(struct backing_dev_info *bdi,
+					struct page *page);
+
+extern int bdi_rw_congested_group(struct backing_dev_info *bdi,
+					struct page *page);
+#else /* CONFIG_GROUP_IOSCHED */
+static inline int bdi_congested_group(struct backing_dev_info *bdi,
+					int bdi_bits, struct page *page)
+{
+	return bdi_congested(bdi, bdi_bits);
+}
+
+static inline int bdi_read_congested_group(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_read_congested(bdi);
+}
+
+static inline int bdi_or_group_read_congested(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_read_congested(bdi);
+}
+
+static inline int bdi_write_congested_group(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_write_congested(bdi);
+}
+
+static inline int bdi_or_group_write_congested(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_write_congested(bdi);
+}
+
+static inline int bdi_rw_congested_group(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_rw_congested(bdi);
+}
+
+#endif /* CONFIG_GROUP_IOSCHED */
+
 void clear_bdi_congested(struct backing_dev_info *bdi, int rw);
 void set_bdi_congested(struct backing_dev_info *bdi, int rw);
 long congestion_wait(int rw, long timeout);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 655440c..e8565b1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -897,6 +897,11 @@ static inline void blk_set_queue_congested(struct request_queue *q, int rw)
 	set_bdi_congested(&q->backing_dev_info, rw);
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+extern int blk_queue_io_group_congested(struct backing_dev_info *bdi,
+					int bdi_bits, struct page *page);
+#endif
+
 extern void blk_start_queue(struct request_queue *q);
 extern void blk_stop_queue(struct request_queue *q);
 extern void blk_sync_queue(struct request_queue *q);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 493b468..cef038d 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -7,6 +7,7 @@
 #include <linux/module.h>
 #include <linux/writeback.h>
 #include <linux/device.h>
+#include "../block/elevator-fq.h"
 
 void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 {
@@ -328,3 +329,64 @@ long congestion_wait(int rw, long timeout)
 }
 EXPORT_SYMBOL(congestion_wait);
 
+/*
+ * With group IO scheduling, there are request descriptors per io group per
+ * queue. So generic notion of whether queue is congested or not is not
+ * very accurate. Queue might not be congested but the io group in which
+ * request will go might actually be congested.
+ *
+ * Hence to get the correct idea about congestion level, one should query
+ * the io group congestion status on the queue. Pass in the page information
+ * which can be used to determine the io group of the page and congestion
+ * status can be determined accordingly.
+ *
+ * If page info is not passed, io group is determined from the current task
+ * context.
+ */
+#ifdef CONFIG_GROUP_IOSCHED
+int bdi_congested_group(struct backing_dev_info *bdi, int bdi_bits,
+				struct page *page)
+{
+	if (bdi->congested_fn)
+		return bdi->congested_fn(bdi->congested_data, bdi_bits, page, 1);
+
+	return blk_queue_io_group_congested(bdi, bdi_bits, page);
+}
+EXPORT_SYMBOL(bdi_congested_group);
+
+int bdi_read_congested_group(struct backing_dev_info *bdi, struct page *page)
+{
+	return bdi_congested_group(bdi, 1 << BDI_sync_congested, page);
+}
+EXPORT_SYMBOL(bdi_read_congested_group);
+
+/* Checks if either bdi or associated group is read congested */
+int bdi_or_group_read_congested(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_read_congested(bdi) || bdi_read_congested_group(bdi, page);
+}
+EXPORT_SYMBOL(bdi_or_group_read_congested);
+
+int bdi_write_congested_group(struct backing_dev_info *bdi, struct page *page)
+{
+	return bdi_congested_group(bdi, 1 << BDI_async_congested, page);
+}
+EXPORT_SYMBOL(bdi_write_congested_group);
+
+/* Checks if either bdi or associated group is write congested */
+int bdi_or_group_write_congested(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_write_congested(bdi) || bdi_write_congested_group(bdi, page);
+}
+EXPORT_SYMBOL(bdi_or_group_write_congested);
+
+int bdi_rw_congested_group(struct backing_dev_info *bdi, struct page *page)
+{
+	return bdi_congested_group(bdi, (1 << BDI_sync_congested) |
+				  (1 << BDI_async_congested), page);
+}
+EXPORT_SYMBOL(bdi_rw_congested_group);
+
+#endif /* CONFIG_GROUP_IOSCHED */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 31d3675..5ad9453 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -982,6 +982,17 @@ retry:
 		if (nr_pages == 0)
 			break;
 
+		/*
+		 * If the io group page will go into is congested, bail out.
+		 */
+		if (wbc->nonblocking
+		    && bdi_write_congested_group(bdi, pvec.pages[0])) {
+			wbc->encountered_congestion = 1;
+			done = 1;
+			pagevec_release(&pvec);
+			break;
+		}
+
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
diff --git a/mm/readahead.c b/mm/readahead.c
index aa1aa23..22e0639 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -542,7 +542,7 @@ page_cache_async_readahead(struct address_space *mapping,
 	/*
 	 * Defer asynchronous read-ahead on IO congestion.
 	 */
-	if (bdi_read_congested(mapping->backing_dev_info))
+	if (bdi_or_group_read_congested(mapping->backing_dev_info, NULL))
 		return;
 
 	/* do read-ahead */
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 22/25] io-controller: Per io group bdi congestion interface
  2009-07-02 20:01 ` Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron
  Cc: agk, snitzer, vgoyal, akpm, peterz

o So far there used to be only one pair or queue  of request descriptors
  (one for sync and one for async) per device and number of requests allocated
  used to decide whether associated bdi is congested or not.

  Now with per io group request descriptor infrastructure, there is a pair
  of request descriptor queue per io group per device. So it might happen
  that overall request queue is not congested but a particular io group
  bio belongs to is congested.

  Or, it could be otherwise that group is not congested but overall queue
  is congested. This can happen if user has not properly set the request
  descriptors limits for queue and groups.
  (q->nr_requests < nr_groups * q->nr_group_requests)

  Hence there is a need for new interface which can query deivce congestion
  status per group. This group is determined by the "struct page" IO will be
  done for. If page is null, then group is determined from the current task
  context.

o This patch introduces new set of function bdi_*_congested_group(), which
  take "struct page" as addition argument. These functions will call the
  block layer and in trun elevator to find out if the io group the page will
  go into is congested or not.

o Currently I have introduced the core functions and migrated most of the users.
  But there might be still some left. This is an ongoing TODO item.

o There are some io_get_io_group() related changes which should be pushed into
  higher patches. Still testing this patch. Will push these changes up in next
  posting.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-core.c            |   21 ++++++++++++++
 block/elevator-fq.c         |   58 ++++++++++++++++++++++++++++++++++++++++
 block/elevator-fq.h         |    6 ++++
 drivers/md/dm-table.c       |   11 +++++---
 drivers/md/dm.c             |    7 +++--
 drivers/md/dm.h             |    3 +-
 drivers/md/linear.c         |    7 +++-
 drivers/md/multipath.c      |    7 +++-
 drivers/md/raid0.c          |    6 +++-
 drivers/md/raid1.c          |    9 ++++--
 drivers/md/raid10.c         |    6 +++-
 drivers/md/raid5.c          |    2 +-
 fs/afs/write.c              |    8 +++++-
 fs/btrfs/disk-io.c          |    6 +++-
 fs/btrfs/extent_io.c        |   12 ++++++++
 fs/btrfs/volumes.c          |    8 ++++-
 fs/cifs/file.c              |   11 +++++++
 fs/ext2/ialloc.c            |    2 +-
 fs/gfs2/aops.c              |   12 ++++++++
 fs/nilfs2/segbuf.c          |    3 +-
 fs/xfs/linux-2.6/xfs_aops.c |    2 +-
 fs/xfs/linux-2.6/xfs_buf.c  |    2 +-
 include/linux/backing-dev.h |   61 ++++++++++++++++++++++++++++++++++++++++-
 include/linux/blkdev.h      |    5 +++
 mm/backing-dev.c            |   62 +++++++++++++++++++++++++++++++++++++++++++
 mm/page-writeback.c         |   11 +++++++
 mm/readahead.c              |    2 +-
 27 files changed, 318 insertions(+), 32 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2035c20..79fe6a9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -90,6 +90,27 @@ void blk_queue_congestion_threshold(struct request_queue *q)
 	q->nr_congestion_off = nr;
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+int blk_queue_io_group_congested(struct backing_dev_info *bdi, int bdi_bits,
+					struct page *page)
+{
+	int ret = 0;
+	struct request_queue *q = bdi->unplug_io_data;
+
+	if (!q && !q->elevator)
+		return bdi_congested(bdi, bdi_bits);
+
+	/* Do we need to hold queue lock? */
+	if (bdi_bits & (1 << BDI_sync_congested))
+		ret |= elv_io_group_congested(q, page, 1);
+
+	if (bdi_bits & (1 << BDI_async_congested))
+		ret |= elv_io_group_congested(q, page, 0);
+
+	return ret;
+}
+#endif
+
 /**
  * blk_get_backing_dev_info - get the address of a queue's backing_dev_info
  * @bdev:	device
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index c4a2b1e..2a2b68d 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1102,6 +1102,58 @@ struct request_list *io_group_get_request_list(struct request_queue *q,
 	return &iog->rl;
 }
 
+/* Set io group congestion on and off thresholds */
+void elv_io_group_congestion_threshold(struct request_queue *q,
+						struct io_group *iog)
+{
+	int nr;
+
+	nr = q->nr_group_requests - (q->nr_group_requests / 8) + 1;
+	if (nr > q->nr_group_requests)
+		nr = q->nr_group_requests;
+	iog->nr_congestion_on = nr;
+
+	nr = q->nr_group_requests - (q->nr_group_requests / 8)
+			- (q->nr_group_requests / 16) - 1;
+	if (nr < 1)
+		nr = 1;
+	iog->nr_congestion_off = nr;
+}
+
+static inline int elv_is_iog_congested(struct request_queue *q,
+					struct io_group *iog, int sync)
+{
+	if (iog->rl.count[sync] >= iog->nr_congestion_on)
+		return 1;
+	return 0;
+}
+
+/* Determine if io group page maps to is congested or not */
+int elv_io_group_congested(struct request_queue *q, struct page *page, int sync)
+{
+	struct io_group *iog;
+	int ret = 0;
+
+	rcu_read_lock();
+
+	iog = io_get_io_group(q, page, 0);
+
+	if (!iog) {
+		/*
+		 * Either cgroup got deleted or this is first request in the
+		 * group and associated io group object has not been created
+		 * yet. Map it to root group.
+		 *
+		 * TODO: Fix the case of group not created yet.
+		 */
+		iog = q->elevator->efqd.root_group;
+	}
+
+	ret = elv_is_iog_congested(q, iog, sync);
+	rcu_read_unlock();
+	return ret;
+}
+
 /*
  * Search the io_group for efqd into the hash table (by now only a list)
  * of bgrp.  Must be called under rcu_read_lock().
@@ -1401,6 +1453,7 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 		elv_get_iog(iog);
 
 		blk_init_request_list(&iog->rl);
+		elv_io_group_congestion_threshold(q, iog);
 
 		if (leaf == NULL) {
 			leaf = iog;
@@ -1680,6 +1733,7 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
 		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
 
 	blk_init_request_list(&iog->rl);
+	elv_io_group_congestion_threshold(q, iog);
 
 	iocg = &io_root_cgroup;
 	spin_lock_irq(&iocg->lock);
@@ -1688,6 +1742,10 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
 	iog->iocg_id = css_id(&iocg->css);
 	spin_unlock_irq(&iocg->lock);
 
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	io_group_path(iog, iog->path, sizeof(iog->path));
+#endif
+
 	return iog;
 }
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 74a7393..214fb61 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -259,6 +259,10 @@ struct io_group {
 	/* Single ioq per group, used for noop, deadline, anticipatory */
 	struct io_queue *ioq;
 
+	/* io group congestion on and off threshold for request descriptors */
+	unsigned int nr_congestion_on;
+	unsigned int nr_congestion_off;
+
 	/* request list associated with the group */
 	struct request_list rl;
 };
@@ -531,6 +535,8 @@ extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
 						struct bio *bio);
 extern struct request_list *io_group_get_request_list(struct request_queue *q,
 						struct bio *bio);
+extern int elv_io_group_congested(struct request_queue *q, struct page *page,
+					int sync);
 
 /* Sets the single ioq associated with the io group. (noop, deadline, AS) */
 static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 4899ebe..e3fd6d8 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1175,7 +1175,8 @@ int dm_table_resume_targets(struct dm_table *t)
 	return 0;
 }
 
-int dm_table_any_congested(struct dm_table *t, int bdi_bits)
+int dm_table_any_congested(struct dm_table *t, int bdi_bits, struct page *page,
+				int group)
 {
 	struct dm_dev_internal *dd;
 	struct list_head *devices = dm_table_get_devices(t);
@@ -1185,9 +1186,11 @@ int dm_table_any_congested(struct dm_table *t, int bdi_bits)
 		struct request_queue *q = bdev_get_queue(dd->dm_dev.bdev);
 		char b[BDEVNAME_SIZE];
 
-		if (likely(q))
-			r |= bdi_congested(&q->backing_dev_info, bdi_bits);
-		else
+		if (likely(q)) {
+			struct backing_dev_info *bdi = &q->backing_dev_info;
+			r |= group ? bdi_congested_group(bdi, bdi_bits, page)
+				: bdi_congested(bdi, bdi_bits);
+		} else
 			DMWARN_LIMIT("%s: any_congested: nonexistent device %s",
 				     dm_device_name(t->md),
 				     bdevname(dd->dm_dev.bdev, b));
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3c6d4ee..320ef4c 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1608,7 +1608,8 @@ static void dm_unplug_all(struct request_queue *q)
 	}
 }
 
-static int dm_any_congested(void *congested_data, int bdi_bits)
+static int dm_any_congested(void *congested_data, int bdi_bits,
+					struct page *page, int group)
 {
 	int r = bdi_bits;
 	struct mapped_device *md = congested_data;
@@ -1625,8 +1626,8 @@ static int dm_any_congested(void *congested_data, int bdi_bits)
 				r = md->queue->backing_dev_info.state &
 				    bdi_bits;
 			else
-				r = dm_table_any_congested(map, bdi_bits);
-
+				r = dm_table_any_congested(map, bdi_bits, page,
+								 group);
 			dm_table_put(map);
 		}
 	}
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 23278ae..9c4c5a5 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -57,7 +57,8 @@ struct list_head *dm_table_get_devices(struct dm_table *t);
 void dm_table_presuspend_targets(struct dm_table *t);
 void dm_table_postsuspend_targets(struct dm_table *t);
 int dm_table_resume_targets(struct dm_table *t);
-int dm_table_any_congested(struct dm_table *t, int bdi_bits);
+int dm_table_any_congested(struct dm_table *t, int bdi_bits, struct page *page,
+				int group);
 int dm_table_any_busy_target(struct dm_table *t);
 int dm_table_set_type(struct dm_table *t);
 unsigned dm_table_get_type(struct dm_table *t);
diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 15c8b7b..f227075 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -102,7 +102,7 @@ static void linear_unplug(struct request_queue *q)
 	rcu_read_unlock();
 }
 
-static int linear_congested(void *data, int bits)
+static int linear_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	linear_conf_t *conf;
@@ -113,7 +113,10 @@ static int linear_congested(void *data, int bits)
 
 	for (i = 0; i < mddev->raid_disks && !ret ; i++) {
 		struct request_queue *q = bdev_get_queue(conf->disks[i].rdev->bdev);
-		ret |= bdi_congested(&q->backing_dev_info, bits);
+		struct backing_dev_info *bdi = &q->backing_dev_info;
+
+		ret |= group ? bdi_congested_group(bdi, bits, page) :
+			bdi_congested(bdi, bits);
 	}
 
 	rcu_read_unlock();
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index cbe368f..87bb1dd 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -192,7 +192,8 @@ static void multipath_status (struct seq_file *seq, mddev_t *mddev)
 	seq_printf (seq, "]");
 }
 
-static int multipath_congested(void *data, int bits)
+static int multipath_congested(void *data, int bits, struct page *page,
+					int group)
 {
 	mddev_t *mddev = data;
 	multipath_conf_t *conf = mddev->private;
@@ -203,8 +204,10 @@ static int multipath_congested(void *data, int bits)
 		mdk_rdev_t *rdev = rcu_dereference(conf->multipaths[i].rdev);
 		if (rdev && !test_bit(Faulty, &rdev->flags)) {
 			struct request_queue *q = bdev_get_queue(rdev->bdev);
+			struct backing_dev_info *bdi = &q->backing_dev_info;
 
-			ret |= bdi_congested(&q->backing_dev_info, bits);
+			ret |= group ? bdi_congested_group(bdi, bits, page)
+				: bdi_congested(bdi, bits);
 			/* Just like multipath_map, we just check the
 			 * first available device
 			 */
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index ab4a489..f7813e7 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -37,7 +37,7 @@ static void raid0_unplug(struct request_queue *q)
 	}
 }
 
-static int raid0_congested(void *data, int bits)
+static int raid0_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	raid0_conf_t *conf = mddev->private;
@@ -46,8 +46,10 @@ static int raid0_congested(void *data, int bits)
 
 	for (i = 0; i < mddev->raid_disks && !ret ; i++) {
 		struct request_queue *q = bdev_get_queue(devlist[i]->bdev);
+		struct backing_dev_info *bdi = &q->backing_dev_info;
 
-		ret |= bdi_congested(&q->backing_dev_info, bits);
+		ret |= group ? bdi_congested_group(bdi, bits, page)
+				: bdi_congested(bdi, bits);
 	}
 	return ret;
 }
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 89939a7..6132848 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -570,7 +570,7 @@ static void raid1_unplug(struct request_queue *q)
 	md_wakeup_thread(mddev->thread);
 }
 
-static int raid1_congested(void *data, int bits)
+static int raid1_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	conf_t *conf = mddev->private;
@@ -581,14 +581,17 @@ static int raid1_congested(void *data, int bits)
 		mdk_rdev_t *rdev = rcu_dereference(conf->mirrors[i].rdev);
 		if (rdev && !test_bit(Faulty, &rdev->flags)) {
 			struct request_queue *q = bdev_get_queue(rdev->bdev);
+			struct backing_dev_info *bdi = &q->backing_dev_info;
 
 			/* Note the '|| 1' - when read_balance prefers
 			 * non-congested targets, it can be removed
 			 */
 			if ((bits & (1<<BDI_async_congested)) || 1)
-				ret |= bdi_congested(&q->backing_dev_info, bits);
+				ret |= group ? bdi_congested_group(bdi, bits,
+					page) : bdi_congested(bdi, bits);
 			else
-				ret &= bdi_congested(&q->backing_dev_info, bits);
+				ret &= group ? bdi_congested_group(bdi, bits,
+					page) : bdi_congested(bdi, bits);
 		}
 	}
 	rcu_read_unlock();
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index ae12cea..3d9c6b0 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -625,7 +625,7 @@ static void raid10_unplug(struct request_queue *q)
 	md_wakeup_thread(mddev->thread);
 }
 
-static int raid10_congested(void *data, int bits)
+static int raid10_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	conf_t *conf = mddev->private;
@@ -636,8 +636,10 @@ static int raid10_congested(void *data, int bits)
 		mdk_rdev_t *rdev = rcu_dereference(conf->mirrors[i].rdev);
 		if (rdev && !test_bit(Faulty, &rdev->flags)) {
 			struct request_queue *q = bdev_get_queue(rdev->bdev);
+			struct backing_dev_info *bdi = &q->backing_dev_info;
 
-			ret |= bdi_congested(&q->backing_dev_info, bits);
+			ret |= group ? bdi_congested_group(bdi, bits, page)
+				: bdi_congested(bdi, bits);
 		}
 	}
 	rcu_read_unlock();
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f9f991e..16e4d1a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3323,7 +3323,7 @@ static void raid5_unplug_device(struct request_queue *q)
 	unplug_slaves(mddev);
 }
 
-static int raid5_congested(void *data, int bits)
+static int raid5_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	raid5_conf_t *conf = mddev->private;
diff --git a/fs/afs/write.c b/fs/afs/write.c
index c2e7a7f..aa8b359 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -455,7 +455,7 @@ int afs_writepage(struct page *page, struct writeback_control *wbc)
 	}
 
 	wbc->nr_to_write -= ret;
-	if (wbc->nonblocking && bdi_write_congested(bdi))
+	if (wbc->nonblocking && bdi_or_group_write_congested(bdi, page))
 		wbc->encountered_congestion = 1;
 
 	_leave(" = 0");
@@ -491,6 +491,12 @@ static int afs_writepages_region(struct address_space *mapping,
 			return 0;
 		}
 
+		if (wbc->nonblocking && bdi_write_congested_group(bdi, page)) {
+			wbc->encountered_congestion = 1;
+			page_cache_release(page);
+			break;
+		}
+
 		/* at this point we hold neither mapping->tree_lock nor lock on
 		 * the page itself: the page may be truncated or invalidated
 		 * (changing page->mapping to NULL), or even swizzled back from
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index d28d29c..cd7cf6c 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1249,7 +1249,8 @@ struct btrfs_root *btrfs_read_fs_root(struct btrfs_fs_info *fs_info,
 	return root;
 }
 
-static int btrfs_congested_fn(void *congested_data, int bdi_bits)
+static int btrfs_congested_fn(void *congested_data, int bdi_bits,
+					struct page *page, int group)
 {
 	struct btrfs_fs_info *info = (struct btrfs_fs_info *)congested_data;
 	int ret = 0;
@@ -1260,7 +1261,8 @@ static int btrfs_congested_fn(void *congested_data, int bdi_bits)
 		if (!device->bdev)
 			continue;
 		bdi = blk_get_backing_dev_info(device->bdev);
-		if (bdi && bdi_congested(bdi, bdi_bits)) {
+		if (bdi && (group ? bdi_congested_group(bdi, bdi_bits, page) :
+		    bdi_congested(bdi, bdi_bits))) {
 			ret = 1;
 			break;
 		}
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6826018..fd7d53f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2368,6 +2368,18 @@ retry:
 		unsigned i;
 
 		scanned = 1;
+
+		/*
+		 * If the io group page will go into is congested, bail out.
+		 */
+		if (wbc->nonblocking
+		    && bdi_write_congested_group(bdi, pvec.pages[0])) {
+			wbc->encountered_congestion = 1;
+			done = 1;
+			pagevec_release(&pvec);
+			break;
+		}
+
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 3ab80e9..7ab5dea 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -165,6 +165,7 @@ static noinline int run_scheduled_bios(struct btrfs_device *device)
 	unsigned long limit;
 	unsigned long last_waited = 0;
 	int force_reg = 0;
+	struct page *page;
 
 	bdi = blk_get_backing_dev_info(device->bdev);
 	fs_info = device->dev_root->fs_info;
@@ -276,8 +277,11 @@ loop_lock:
 		 * is now congested.  Back off and let other work structs
 		 * run instead
 		 */
-		if (pending && bdi_write_congested(bdi) && batch_run > 32 &&
-		    fs_info->fs_devices->open_devices > 1) {
+		if (pending)
+			page = bio_iovec_idx(pending, 0)->bv_page;
+
+		if (pending && bdi_or_group_write_congested(bdi, page) &&
+		    num_run > 32 && fs_info->fs_devices->open_devices > 1) {
 			struct io_context *ioc;
 
 			ioc = current->io_context;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0686684..365ca1b 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1466,6 +1466,17 @@ retry:
 		n_iov = 0;
 		bytes_to_write = 0;
 
+		/*
+		 * If the io group page will go into is congested, bail out.
+		 */
+		if (wbc->nonblocking &&
+		    bdi_write_congested_group(bdi, pvec.pages[0])) {
+			wbc->encountered_congestion = 1;
+			done = 1;
+			pagevec_release(&pvec);
+			break;
+		}
+
 		for (i = 0; i < nr_pages; i++) {
 			page = pvec.pages[i];
 			/*
diff --git a/fs/ext2/ialloc.c b/fs/ext2/ialloc.c
index 15387c9..090a961 100644
--- a/fs/ext2/ialloc.c
+++ b/fs/ext2/ialloc.c
@@ -179,7 +179,7 @@ static void ext2_preread_inode(struct inode *inode)
 	struct backing_dev_info *bdi;
 
 	bdi = inode->i_mapping->backing_dev_info;
-	if (bdi_read_congested(bdi))
+	if (bdi_or_group_read_congested(bdi, NULL))
 		return;
 	if (bdi_write_congested(bdi))
 		return;
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 03ebb43..5b9c93b 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -371,6 +371,18 @@ retry:
 					       PAGECACHE_TAG_DIRTY,
 					       min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
 		scanned = 1;
+
+		/*
+		 * If io group page belongs to is congested. bail out.
+		 */
+		if (wbc->nonblocking
+		    && bdi_write_congested_group(bdi, pvec.pages[0])) {
+			wbc->encountered_congestion = 1;
+			done = 1;
+			pagevec_release(&pvec);
+			break;
+		}
+
 		ret = gfs2_write_jdata_pagevec(mapping, wbc, &pvec, nr_pages, end);
 		if (ret)
 			done = 1;
diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
index 9e3fe17..aa29612 100644
--- a/fs/nilfs2/segbuf.c
+++ b/fs/nilfs2/segbuf.c
@@ -266,8 +266,9 @@ static int nilfs_submit_seg_bio(struct nilfs_write_info *wi, int mode)
 {
 	struct bio *bio = wi->bio;
 	int err;
+	struct page *page = bio_iovec_idx(bio, 0)->bv_page;
 
-	if (wi->nbio > 0 && bdi_write_congested(wi->bdi)) {
+	if (wi->nbio > 0 && bdi_or_group_write_congested(wi->bdi, page)) {
 		wait_for_completion(&wi->bio_event);
 		wi->nbio--;
 		if (unlikely(atomic_read(&wi->err))) {
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index 7ec89fc..2a515ab 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -891,7 +891,7 @@ xfs_convert_page(
 
 			bdi = inode->i_mapping->backing_dev_info;
 			wbc->nr_to_write--;
-			if (bdi_write_congested(bdi)) {
+			if (bdi_or_group_write_congested(bdi, page)) {
 				wbc->encountered_congestion = 1;
 				done = 1;
 			} else if (wbc->nr_to_write <= 0) {
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 1418b91..e95c97e 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -714,7 +714,7 @@ xfs_buf_readahead(
 	struct backing_dev_info *bdi;
 
 	bdi = target->bt_mapping->backing_dev_info;
-	if (bdi_read_congested(bdi))
+	if (bdi_or_group_read_congested(bdi, NULL))
 		return;
 
 	flags |= (XBF_TRYLOCK|XBF_ASYNC|XBF_READ_AHEAD);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 0ec2c59..f06fdbf 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -29,7 +29,7 @@ enum bdi_state {
 	BDI_unused,		/* Available bits start here */
 };
 
-typedef int (congested_fn)(void *, int);
+typedef int (congested_fn)(void *, int, struct page *, int);
 
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
@@ -209,7 +209,7 @@ int writeback_in_progress(struct backing_dev_info *bdi);
 static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits)
 {
 	if (bdi->congested_fn)
-		return bdi->congested_fn(bdi->congested_data, bdi_bits);
+		return bdi->congested_fn(bdi->congested_data, bdi_bits, NULL, 0);
 	return (bdi->state & bdi_bits);
 }
 
@@ -229,6 +229,63 @@ static inline int bdi_rw_congested(struct backing_dev_info *bdi)
 				  (1 << BDI_async_congested));
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+extern int bdi_congested_group(struct backing_dev_info *bdi, int bdi_bits,
+				struct page *page);
+
+extern int bdi_read_congested_group(struct backing_dev_info *bdi,
+						struct page *page);
+
+extern int bdi_or_group_read_congested(struct backing_dev_info *bdi,
+					struct page *page);
+
+extern int bdi_write_congested_group(struct backing_dev_info *bdi,
+					struct page *page);
+
+extern int bdi_or_group_write_congested(struct backing_dev_info *bdi,
+					struct page *page);
+
+extern int bdi_rw_congested_group(struct backing_dev_info *bdi,
+					struct page *page);
+#else /* CONFIG_GROUP_IOSCHED */
+static inline int bdi_congested_group(struct backing_dev_info *bdi,
+					int bdi_bits, struct page *page)
+{
+	return bdi_congested(bdi, bdi_bits);
+}
+
+static inline int bdi_read_congested_group(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_read_congested(bdi);
+}
+
+static inline int bdi_or_group_read_congested(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_read_congested(bdi);
+}
+
+static inline int bdi_write_congested_group(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_write_congested(bdi);
+}
+
+static inline int bdi_or_group_write_congested(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_write_congested(bdi);
+}
+
+static inline int bdi_rw_congested_group(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_rw_congested(bdi);
+}
+
+#endif /* CONFIG_GROUP_IOSCHED */
+
 void clear_bdi_congested(struct backing_dev_info *bdi, int rw);
 void set_bdi_congested(struct backing_dev_info *bdi, int rw);
 long congestion_wait(int rw, long timeout);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 655440c..e8565b1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -897,6 +897,11 @@ static inline void blk_set_queue_congested(struct request_queue *q, int rw)
 	set_bdi_congested(&q->backing_dev_info, rw);
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+extern int blk_queue_io_group_congested(struct backing_dev_info *bdi,
+					int bdi_bits, struct page *page);
+#endif
+
 extern void blk_start_queue(struct request_queue *q);
 extern void blk_stop_queue(struct request_queue *q);
 extern void blk_sync_queue(struct request_queue *q);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 493b468..cef038d 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -7,6 +7,7 @@
 #include <linux/module.h>
 #include <linux/writeback.h>
 #include <linux/device.h>
+#include "../block/elevator-fq.h"
 
 void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 {
@@ -328,3 +329,64 @@ long congestion_wait(int rw, long timeout)
 }
 EXPORT_SYMBOL(congestion_wait);
 
+/*
+ * With group IO scheduling, there are request descriptors per io group per
+ * queue. So generic notion of whether queue is congested or not is not
+ * very accurate. Queue might not be congested but the io group in which
+ * request will go might actually be congested.
+ *
+ * Hence to get the correct idea about congestion level, one should query
+ * the io group congestion status on the queue. Pass in the page information
+ * which can be used to determine the io group of the page and congestion
+ * status can be determined accordingly.
+ *
+ * If page info is not passed, io group is determined from the current task
+ * context.
+ */
+#ifdef CONFIG_GROUP_IOSCHED
+int bdi_congested_group(struct backing_dev_info *bdi, int bdi_bits,
+				struct page *page)
+{
+	if (bdi->congested_fn)
+		return bdi->congested_fn(bdi->congested_data, bdi_bits, page, 1);
+
+	return blk_queue_io_group_congested(bdi, bdi_bits, page);
+}
+EXPORT_SYMBOL(bdi_congested_group);
+
+int bdi_read_congested_group(struct backing_dev_info *bdi, struct page *page)
+{
+	return bdi_congested_group(bdi, 1 << BDI_sync_congested, page);
+}
+EXPORT_SYMBOL(bdi_read_congested_group);
+
+/* Checks if either bdi or associated group is read congested */
+int bdi_or_group_read_congested(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_read_congested(bdi) || bdi_read_congested_group(bdi, page);
+}
+EXPORT_SYMBOL(bdi_or_group_read_congested);
+
+int bdi_write_congested_group(struct backing_dev_info *bdi, struct page *page)
+{
+	return bdi_congested_group(bdi, 1 << BDI_async_congested, page);
+}
+EXPORT_SYMBOL(bdi_write_congested_group);
+
+/* Checks if either bdi or associated group is write congested */
+int bdi_or_group_write_congested(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_write_congested(bdi) || bdi_write_congested_group(bdi, page);
+}
+EXPORT_SYMBOL(bdi_or_group_write_congested);
+
+int bdi_rw_congested_group(struct backing_dev_info *bdi, struct page *page)
+{
+	return bdi_congested_group(bdi, (1 << BDI_sync_congested) |
+				  (1 << BDI_async_congested), page);
+}
+EXPORT_SYMBOL(bdi_rw_congested_group);
+
+#endif /* CONFIG_GROUP_IOSCHED */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 31d3675..5ad9453 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -982,6 +982,17 @@ retry:
 		if (nr_pages == 0)
 			break;
 
+		/*
+		 * If the io group page will go into is congested, bail out.
+		 */
+		if (wbc->nonblocking
+		    && bdi_write_congested_group(bdi, pvec.pages[0])) {
+			wbc->encountered_congestion = 1;
+			done = 1;
+			pagevec_release(&pvec);
+			break;
+		}
+
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
diff --git a/mm/readahead.c b/mm/readahead.c
index aa1aa23..22e0639 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -542,7 +542,7 @@ page_cache_async_readahead(struct address_space *mapping,
 	/*
 	 * Defer asynchronous read-ahead on IO congestion.
 	 */
-	if (bdi_read_congested(mapping->backing_dev_info))
+	if (bdi_or_group_read_congested(mapping->backing_dev_info, NULL))
 		return;
 
 	/* do read-ahead */
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 22/25] io-controller: Per io group bdi congestion interface
@ 2009-07-02 20:01   ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew
  Cc: peterz, akpm, snitzer, agk, vgoyal

o So far there used to be only one pair or queue  of request descriptors
  (one for sync and one for async) per device and number of requests allocated
  used to decide whether associated bdi is congested or not.

  Now with per io group request descriptor infrastructure, there is a pair
  of request descriptor queue per io group per device. So it might happen
  that overall request queue is not congested but a particular io group
  bio belongs to is congested.

  Or, it could be otherwise that group is not congested but overall queue
  is congested. This can happen if user has not properly set the request
  descriptors limits for queue and groups.
  (q->nr_requests < nr_groups * q->nr_group_requests)

  Hence there is a need for new interface which can query deivce congestion
  status per group. This group is determined by the "struct page" IO will be
  done for. If page is null, then group is determined from the current task
  context.

o This patch introduces new set of function bdi_*_congested_group(), which
  take "struct page" as addition argument. These functions will call the
  block layer and in trun elevator to find out if the io group the page will
  go into is congested or not.

o Currently I have introduced the core functions and migrated most of the users.
  But there might be still some left. This is an ongoing TODO item.

o There are some io_get_io_group() related changes which should be pushed into
  higher patches. Still testing this patch. Will push these changes up in next
  posting.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-core.c            |   21 ++++++++++++++
 block/elevator-fq.c         |   58 ++++++++++++++++++++++++++++++++++++++++
 block/elevator-fq.h         |    6 ++++
 drivers/md/dm-table.c       |   11 +++++---
 drivers/md/dm.c             |    7 +++--
 drivers/md/dm.h             |    3 +-
 drivers/md/linear.c         |    7 +++-
 drivers/md/multipath.c      |    7 +++-
 drivers/md/raid0.c          |    6 +++-
 drivers/md/raid1.c          |    9 ++++--
 drivers/md/raid10.c         |    6 +++-
 drivers/md/raid5.c          |    2 +-
 fs/afs/write.c              |    8 +++++-
 fs/btrfs/disk-io.c          |    6 +++-
 fs/btrfs/extent_io.c        |   12 ++++++++
 fs/btrfs/volumes.c          |    8 ++++-
 fs/cifs/file.c              |   11 +++++++
 fs/ext2/ialloc.c            |    2 +-
 fs/gfs2/aops.c              |   12 ++++++++
 fs/nilfs2/segbuf.c          |    3 +-
 fs/xfs/linux-2.6/xfs_aops.c |    2 +-
 fs/xfs/linux-2.6/xfs_buf.c  |    2 +-
 include/linux/backing-dev.h |   61 ++++++++++++++++++++++++++++++++++++++++-
 include/linux/blkdev.h      |    5 +++
 mm/backing-dev.c            |   62 +++++++++++++++++++++++++++++++++++++++++++
 mm/page-writeback.c         |   11 +++++++
 mm/readahead.c              |    2 +-
 27 files changed, 318 insertions(+), 32 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2035c20..79fe6a9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -90,6 +90,27 @@ void blk_queue_congestion_threshold(struct request_queue *q)
 	q->nr_congestion_off = nr;
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+int blk_queue_io_group_congested(struct backing_dev_info *bdi, int bdi_bits,
+					struct page *page)
+{
+	int ret = 0;
+	struct request_queue *q = bdi->unplug_io_data;
+
+	if (!q && !q->elevator)
+		return bdi_congested(bdi, bdi_bits);
+
+	/* Do we need to hold queue lock? */
+	if (bdi_bits & (1 << BDI_sync_congested))
+		ret |= elv_io_group_congested(q, page, 1);
+
+	if (bdi_bits & (1 << BDI_async_congested))
+		ret |= elv_io_group_congested(q, page, 0);
+
+	return ret;
+}
+#endif
+
 /**
  * blk_get_backing_dev_info - get the address of a queue's backing_dev_info
  * @bdev:	device
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index c4a2b1e..2a2b68d 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1102,6 +1102,58 @@ struct request_list *io_group_get_request_list(struct request_queue *q,
 	return &iog->rl;
 }
 
+/* Set io group congestion on and off thresholds */
+void elv_io_group_congestion_threshold(struct request_queue *q,
+						struct io_group *iog)
+{
+	int nr;
+
+	nr = q->nr_group_requests - (q->nr_group_requests / 8) + 1;
+	if (nr > q->nr_group_requests)
+		nr = q->nr_group_requests;
+	iog->nr_congestion_on = nr;
+
+	nr = q->nr_group_requests - (q->nr_group_requests / 8)
+			- (q->nr_group_requests / 16) - 1;
+	if (nr < 1)
+		nr = 1;
+	iog->nr_congestion_off = nr;
+}
+
+static inline int elv_is_iog_congested(struct request_queue *q,
+					struct io_group *iog, int sync)
+{
+	if (iog->rl.count[sync] >= iog->nr_congestion_on)
+		return 1;
+	return 0;
+}
+
+/* Determine if io group page maps to is congested or not */
+int elv_io_group_congested(struct request_queue *q, struct page *page, int sync)
+{
+	struct io_group *iog;
+	int ret = 0;
+
+	rcu_read_lock();
+
+	iog = io_get_io_group(q, page, 0);
+
+	if (!iog) {
+		/*
+		 * Either cgroup got deleted or this is first request in the
+		 * group and associated io group object has not been created
+		 * yet. Map it to root group.
+		 *
+		 * TODO: Fix the case of group not created yet.
+		 */
+		iog = q->elevator->efqd.root_group;
+	}
+
+	ret = elv_is_iog_congested(q, iog, sync);
+	rcu_read_unlock();
+	return ret;
+}
+
 /*
  * Search the io_group for efqd into the hash table (by now only a list)
  * of bgrp.  Must be called under rcu_read_lock().
@@ -1401,6 +1453,7 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 		elv_get_iog(iog);
 
 		blk_init_request_list(&iog->rl);
+		elv_io_group_congestion_threshold(q, iog);
 
 		if (leaf == NULL) {
 			leaf = iog;
@@ -1680,6 +1733,7 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
 		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
 
 	blk_init_request_list(&iog->rl);
+	elv_io_group_congestion_threshold(q, iog);
 
 	iocg = &io_root_cgroup;
 	spin_lock_irq(&iocg->lock);
@@ -1688,6 +1742,10 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
 	iog->iocg_id = css_id(&iocg->css);
 	spin_unlock_irq(&iocg->lock);
 
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	io_group_path(iog, iog->path, sizeof(iog->path));
+#endif
+
 	return iog;
 }
 
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 74a7393..214fb61 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -259,6 +259,10 @@ struct io_group {
 	/* Single ioq per group, used for noop, deadline, anticipatory */
 	struct io_queue *ioq;
 
+	/* io group congestion on and off threshold for request descriptors */
+	unsigned int nr_congestion_on;
+	unsigned int nr_congestion_off;
+
 	/* request list associated with the group */
 	struct request_list rl;
 };
@@ -531,6 +535,8 @@ extern struct io_queue *elv_lookup_ioq_bio(struct request_queue *q,
 						struct bio *bio);
 extern struct request_list *io_group_get_request_list(struct request_queue *q,
 						struct bio *bio);
+extern int elv_io_group_congested(struct request_queue *q, struct page *page,
+					int sync);
 
 /* Sets the single ioq associated with the io group. (noop, deadline, AS) */
 static inline void io_group_set_ioq(struct io_group *iog, struct io_queue *ioq)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 4899ebe..e3fd6d8 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1175,7 +1175,8 @@ int dm_table_resume_targets(struct dm_table *t)
 	return 0;
 }
 
-int dm_table_any_congested(struct dm_table *t, int bdi_bits)
+int dm_table_any_congested(struct dm_table *t, int bdi_bits, struct page *page,
+				int group)
 {
 	struct dm_dev_internal *dd;
 	struct list_head *devices = dm_table_get_devices(t);
@@ -1185,9 +1186,11 @@ int dm_table_any_congested(struct dm_table *t, int bdi_bits)
 		struct request_queue *q = bdev_get_queue(dd->dm_dev.bdev);
 		char b[BDEVNAME_SIZE];
 
-		if (likely(q))
-			r |= bdi_congested(&q->backing_dev_info, bdi_bits);
-		else
+		if (likely(q)) {
+			struct backing_dev_info *bdi = &q->backing_dev_info;
+			r |= group ? bdi_congested_group(bdi, bdi_bits, page)
+				: bdi_congested(bdi, bdi_bits);
+		} else
 			DMWARN_LIMIT("%s: any_congested: nonexistent device %s",
 				     dm_device_name(t->md),
 				     bdevname(dd->dm_dev.bdev, b));
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3c6d4ee..320ef4c 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1608,7 +1608,8 @@ static void dm_unplug_all(struct request_queue *q)
 	}
 }
 
-static int dm_any_congested(void *congested_data, int bdi_bits)
+static int dm_any_congested(void *congested_data, int bdi_bits,
+					struct page *page, int group)
 {
 	int r = bdi_bits;
 	struct mapped_device *md = congested_data;
@@ -1625,8 +1626,8 @@ static int dm_any_congested(void *congested_data, int bdi_bits)
 				r = md->queue->backing_dev_info.state &
 				    bdi_bits;
 			else
-				r = dm_table_any_congested(map, bdi_bits);
-
+				r = dm_table_any_congested(map, bdi_bits, page,
+								 group);
 			dm_table_put(map);
 		}
 	}
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 23278ae..9c4c5a5 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -57,7 +57,8 @@ struct list_head *dm_table_get_devices(struct dm_table *t);
 void dm_table_presuspend_targets(struct dm_table *t);
 void dm_table_postsuspend_targets(struct dm_table *t);
 int dm_table_resume_targets(struct dm_table *t);
-int dm_table_any_congested(struct dm_table *t, int bdi_bits);
+int dm_table_any_congested(struct dm_table *t, int bdi_bits, struct page *page,
+				int group);
 int dm_table_any_busy_target(struct dm_table *t);
 int dm_table_set_type(struct dm_table *t);
 unsigned dm_table_get_type(struct dm_table *t);
diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 15c8b7b..f227075 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -102,7 +102,7 @@ static void linear_unplug(struct request_queue *q)
 	rcu_read_unlock();
 }
 
-static int linear_congested(void *data, int bits)
+static int linear_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	linear_conf_t *conf;
@@ -113,7 +113,10 @@ static int linear_congested(void *data, int bits)
 
 	for (i = 0; i < mddev->raid_disks && !ret ; i++) {
 		struct request_queue *q = bdev_get_queue(conf->disks[i].rdev->bdev);
-		ret |= bdi_congested(&q->backing_dev_info, bits);
+		struct backing_dev_info *bdi = &q->backing_dev_info;
+
+		ret |= group ? bdi_congested_group(bdi, bits, page) :
+			bdi_congested(bdi, bits);
 	}
 
 	rcu_read_unlock();
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index cbe368f..87bb1dd 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -192,7 +192,8 @@ static void multipath_status (struct seq_file *seq, mddev_t *mddev)
 	seq_printf (seq, "]");
 }
 
-static int multipath_congested(void *data, int bits)
+static int multipath_congested(void *data, int bits, struct page *page,
+					int group)
 {
 	mddev_t *mddev = data;
 	multipath_conf_t *conf = mddev->private;
@@ -203,8 +204,10 @@ static int multipath_congested(void *data, int bits)
 		mdk_rdev_t *rdev = rcu_dereference(conf->multipaths[i].rdev);
 		if (rdev && !test_bit(Faulty, &rdev->flags)) {
 			struct request_queue *q = bdev_get_queue(rdev->bdev);
+			struct backing_dev_info *bdi = &q->backing_dev_info;
 
-			ret |= bdi_congested(&q->backing_dev_info, bits);
+			ret |= group ? bdi_congested_group(bdi, bits, page)
+				: bdi_congested(bdi, bits);
 			/* Just like multipath_map, we just check the
 			 * first available device
 			 */
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index ab4a489..f7813e7 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -37,7 +37,7 @@ static void raid0_unplug(struct request_queue *q)
 	}
 }
 
-static int raid0_congested(void *data, int bits)
+static int raid0_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	raid0_conf_t *conf = mddev->private;
@@ -46,8 +46,10 @@ static int raid0_congested(void *data, int bits)
 
 	for (i = 0; i < mddev->raid_disks && !ret ; i++) {
 		struct request_queue *q = bdev_get_queue(devlist[i]->bdev);
+		struct backing_dev_info *bdi = &q->backing_dev_info;
 
-		ret |= bdi_congested(&q->backing_dev_info, bits);
+		ret |= group ? bdi_congested_group(bdi, bits, page)
+				: bdi_congested(bdi, bits);
 	}
 	return ret;
 }
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 89939a7..6132848 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -570,7 +570,7 @@ static void raid1_unplug(struct request_queue *q)
 	md_wakeup_thread(mddev->thread);
 }
 
-static int raid1_congested(void *data, int bits)
+static int raid1_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	conf_t *conf = mddev->private;
@@ -581,14 +581,17 @@ static int raid1_congested(void *data, int bits)
 		mdk_rdev_t *rdev = rcu_dereference(conf->mirrors[i].rdev);
 		if (rdev && !test_bit(Faulty, &rdev->flags)) {
 			struct request_queue *q = bdev_get_queue(rdev->bdev);
+			struct backing_dev_info *bdi = &q->backing_dev_info;
 
 			/* Note the '|| 1' - when read_balance prefers
 			 * non-congested targets, it can be removed
 			 */
 			if ((bits & (1<<BDI_async_congested)) || 1)
-				ret |= bdi_congested(&q->backing_dev_info, bits);
+				ret |= group ? bdi_congested_group(bdi, bits,
+					page) : bdi_congested(bdi, bits);
 			else
-				ret &= bdi_congested(&q->backing_dev_info, bits);
+				ret &= group ? bdi_congested_group(bdi, bits,
+					page) : bdi_congested(bdi, bits);
 		}
 	}
 	rcu_read_unlock();
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index ae12cea..3d9c6b0 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -625,7 +625,7 @@ static void raid10_unplug(struct request_queue *q)
 	md_wakeup_thread(mddev->thread);
 }
 
-static int raid10_congested(void *data, int bits)
+static int raid10_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	conf_t *conf = mddev->private;
@@ -636,8 +636,10 @@ static int raid10_congested(void *data, int bits)
 		mdk_rdev_t *rdev = rcu_dereference(conf->mirrors[i].rdev);
 		if (rdev && !test_bit(Faulty, &rdev->flags)) {
 			struct request_queue *q = bdev_get_queue(rdev->bdev);
+			struct backing_dev_info *bdi = &q->backing_dev_info;
 
-			ret |= bdi_congested(&q->backing_dev_info, bits);
+			ret |= group ? bdi_congested_group(bdi, bits, page)
+				: bdi_congested(bdi, bits);
 		}
 	}
 	rcu_read_unlock();
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f9f991e..16e4d1a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3323,7 +3323,7 @@ static void raid5_unplug_device(struct request_queue *q)
 	unplug_slaves(mddev);
 }
 
-static int raid5_congested(void *data, int bits)
+static int raid5_congested(void *data, int bits, struct page *page, int group)
 {
 	mddev_t *mddev = data;
 	raid5_conf_t *conf = mddev->private;
diff --git a/fs/afs/write.c b/fs/afs/write.c
index c2e7a7f..aa8b359 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -455,7 +455,7 @@ int afs_writepage(struct page *page, struct writeback_control *wbc)
 	}
 
 	wbc->nr_to_write -= ret;
-	if (wbc->nonblocking && bdi_write_congested(bdi))
+	if (wbc->nonblocking && bdi_or_group_write_congested(bdi, page))
 		wbc->encountered_congestion = 1;
 
 	_leave(" = 0");
@@ -491,6 +491,12 @@ static int afs_writepages_region(struct address_space *mapping,
 			return 0;
 		}
 
+		if (wbc->nonblocking && bdi_write_congested_group(bdi, page)) {
+			wbc->encountered_congestion = 1;
+			page_cache_release(page);
+			break;
+		}
+
 		/* at this point we hold neither mapping->tree_lock nor lock on
 		 * the page itself: the page may be truncated or invalidated
 		 * (changing page->mapping to NULL), or even swizzled back from
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index d28d29c..cd7cf6c 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1249,7 +1249,8 @@ struct btrfs_root *btrfs_read_fs_root(struct btrfs_fs_info *fs_info,
 	return root;
 }
 
-static int btrfs_congested_fn(void *congested_data, int bdi_bits)
+static int btrfs_congested_fn(void *congested_data, int bdi_bits,
+					struct page *page, int group)
 {
 	struct btrfs_fs_info *info = (struct btrfs_fs_info *)congested_data;
 	int ret = 0;
@@ -1260,7 +1261,8 @@ static int btrfs_congested_fn(void *congested_data, int bdi_bits)
 		if (!device->bdev)
 			continue;
 		bdi = blk_get_backing_dev_info(device->bdev);
-		if (bdi && bdi_congested(bdi, bdi_bits)) {
+		if (bdi && (group ? bdi_congested_group(bdi, bdi_bits, page) :
+		    bdi_congested(bdi, bdi_bits))) {
 			ret = 1;
 			break;
 		}
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6826018..fd7d53f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2368,6 +2368,18 @@ retry:
 		unsigned i;
 
 		scanned = 1;
+
+		/*
+		 * If the io group page will go into is congested, bail out.
+		 */
+		if (wbc->nonblocking
+		    && bdi_write_congested_group(bdi, pvec.pages[0])) {
+			wbc->encountered_congestion = 1;
+			done = 1;
+			pagevec_release(&pvec);
+			break;
+		}
+
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 3ab80e9..7ab5dea 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -165,6 +165,7 @@ static noinline int run_scheduled_bios(struct btrfs_device *device)
 	unsigned long limit;
 	unsigned long last_waited = 0;
 	int force_reg = 0;
+	struct page *page;
 
 	bdi = blk_get_backing_dev_info(device->bdev);
 	fs_info = device->dev_root->fs_info;
@@ -276,8 +277,11 @@ loop_lock:
 		 * is now congested.  Back off and let other work structs
 		 * run instead
 		 */
-		if (pending && bdi_write_congested(bdi) && batch_run > 32 &&
-		    fs_info->fs_devices->open_devices > 1) {
+		if (pending)
+			page = bio_iovec_idx(pending, 0)->bv_page;
+
+		if (pending && bdi_or_group_write_congested(bdi, page) &&
+		    num_run > 32 && fs_info->fs_devices->open_devices > 1) {
 			struct io_context *ioc;
 
 			ioc = current->io_context;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0686684..365ca1b 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1466,6 +1466,17 @@ retry:
 		n_iov = 0;
 		bytes_to_write = 0;
 
+		/*
+		 * If the io group page will go into is congested, bail out.
+		 */
+		if (wbc->nonblocking &&
+		    bdi_write_congested_group(bdi, pvec.pages[0])) {
+			wbc->encountered_congestion = 1;
+			done = 1;
+			pagevec_release(&pvec);
+			break;
+		}
+
 		for (i = 0; i < nr_pages; i++) {
 			page = pvec.pages[i];
 			/*
diff --git a/fs/ext2/ialloc.c b/fs/ext2/ialloc.c
index 15387c9..090a961 100644
--- a/fs/ext2/ialloc.c
+++ b/fs/ext2/ialloc.c
@@ -179,7 +179,7 @@ static void ext2_preread_inode(struct inode *inode)
 	struct backing_dev_info *bdi;
 
 	bdi = inode->i_mapping->backing_dev_info;
-	if (bdi_read_congested(bdi))
+	if (bdi_or_group_read_congested(bdi, NULL))
 		return;
 	if (bdi_write_congested(bdi))
 		return;
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 03ebb43..5b9c93b 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -371,6 +371,18 @@ retry:
 					       PAGECACHE_TAG_DIRTY,
 					       min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
 		scanned = 1;
+
+		/*
+		 * If io group page belongs to is congested. bail out.
+		 */
+		if (wbc->nonblocking
+		    && bdi_write_congested_group(bdi, pvec.pages[0])) {
+			wbc->encountered_congestion = 1;
+			done = 1;
+			pagevec_release(&pvec);
+			break;
+		}
+
 		ret = gfs2_write_jdata_pagevec(mapping, wbc, &pvec, nr_pages, end);
 		if (ret)
 			done = 1;
diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
index 9e3fe17..aa29612 100644
--- a/fs/nilfs2/segbuf.c
+++ b/fs/nilfs2/segbuf.c
@@ -266,8 +266,9 @@ static int nilfs_submit_seg_bio(struct nilfs_write_info *wi, int mode)
 {
 	struct bio *bio = wi->bio;
 	int err;
+	struct page *page = bio_iovec_idx(bio, 0)->bv_page;
 
-	if (wi->nbio > 0 && bdi_write_congested(wi->bdi)) {
+	if (wi->nbio > 0 && bdi_or_group_write_congested(wi->bdi, page)) {
 		wait_for_completion(&wi->bio_event);
 		wi->nbio--;
 		if (unlikely(atomic_read(&wi->err))) {
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index 7ec89fc..2a515ab 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -891,7 +891,7 @@ xfs_convert_page(
 
 			bdi = inode->i_mapping->backing_dev_info;
 			wbc->nr_to_write--;
-			if (bdi_write_congested(bdi)) {
+			if (bdi_or_group_write_congested(bdi, page)) {
 				wbc->encountered_congestion = 1;
 				done = 1;
 			} else if (wbc->nr_to_write <= 0) {
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 1418b91..e95c97e 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -714,7 +714,7 @@ xfs_buf_readahead(
 	struct backing_dev_info *bdi;
 
 	bdi = target->bt_mapping->backing_dev_info;
-	if (bdi_read_congested(bdi))
+	if (bdi_or_group_read_congested(bdi, NULL))
 		return;
 
 	flags |= (XBF_TRYLOCK|XBF_ASYNC|XBF_READ_AHEAD);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 0ec2c59..f06fdbf 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -29,7 +29,7 @@ enum bdi_state {
 	BDI_unused,		/* Available bits start here */
 };
 
-typedef int (congested_fn)(void *, int);
+typedef int (congested_fn)(void *, int, struct page *, int);
 
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
@@ -209,7 +209,7 @@ int writeback_in_progress(struct backing_dev_info *bdi);
 static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits)
 {
 	if (bdi->congested_fn)
-		return bdi->congested_fn(bdi->congested_data, bdi_bits);
+		return bdi->congested_fn(bdi->congested_data, bdi_bits, NULL, 0);
 	return (bdi->state & bdi_bits);
 }
 
@@ -229,6 +229,63 @@ static inline int bdi_rw_congested(struct backing_dev_info *bdi)
 				  (1 << BDI_async_congested));
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+extern int bdi_congested_group(struct backing_dev_info *bdi, int bdi_bits,
+				struct page *page);
+
+extern int bdi_read_congested_group(struct backing_dev_info *bdi,
+						struct page *page);
+
+extern int bdi_or_group_read_congested(struct backing_dev_info *bdi,
+					struct page *page);
+
+extern int bdi_write_congested_group(struct backing_dev_info *bdi,
+					struct page *page);
+
+extern int bdi_or_group_write_congested(struct backing_dev_info *bdi,
+					struct page *page);
+
+extern int bdi_rw_congested_group(struct backing_dev_info *bdi,
+					struct page *page);
+#else /* CONFIG_GROUP_IOSCHED */
+static inline int bdi_congested_group(struct backing_dev_info *bdi,
+					int bdi_bits, struct page *page)
+{
+	return bdi_congested(bdi, bdi_bits);
+}
+
+static inline int bdi_read_congested_group(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_read_congested(bdi);
+}
+
+static inline int bdi_or_group_read_congested(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_read_congested(bdi);
+}
+
+static inline int bdi_write_congested_group(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_write_congested(bdi);
+}
+
+static inline int bdi_or_group_write_congested(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_write_congested(bdi);
+}
+
+static inline int bdi_rw_congested_group(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_rw_congested(bdi);
+}
+
+#endif /* CONFIG_GROUP_IOSCHED */
+
 void clear_bdi_congested(struct backing_dev_info *bdi, int rw);
 void set_bdi_congested(struct backing_dev_info *bdi, int rw);
 long congestion_wait(int rw, long timeout);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 655440c..e8565b1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -897,6 +897,11 @@ static inline void blk_set_queue_congested(struct request_queue *q, int rw)
 	set_bdi_congested(&q->backing_dev_info, rw);
 }
 
+#ifdef CONFIG_GROUP_IOSCHED
+extern int blk_queue_io_group_congested(struct backing_dev_info *bdi,
+					int bdi_bits, struct page *page);
+#endif
+
 extern void blk_start_queue(struct request_queue *q);
 extern void blk_stop_queue(struct request_queue *q);
 extern void blk_sync_queue(struct request_queue *q);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 493b468..cef038d 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -7,6 +7,7 @@
 #include <linux/module.h>
 #include <linux/writeback.h>
 #include <linux/device.h>
+#include "../block/elevator-fq.h"
 
 void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 {
@@ -328,3 +329,64 @@ long congestion_wait(int rw, long timeout)
 }
 EXPORT_SYMBOL(congestion_wait);
 
+/*
+ * With group IO scheduling, there are request descriptors per io group per
+ * queue. So generic notion of whether queue is congested or not is not
+ * very accurate. Queue might not be congested but the io group in which
+ * request will go might actually be congested.
+ *
+ * Hence to get the correct idea about congestion level, one should query
+ * the io group congestion status on the queue. Pass in the page information
+ * which can be used to determine the io group of the page and congestion
+ * status can be determined accordingly.
+ *
+ * If page info is not passed, io group is determined from the current task
+ * context.
+ */
+#ifdef CONFIG_GROUP_IOSCHED
+int bdi_congested_group(struct backing_dev_info *bdi, int bdi_bits,
+				struct page *page)
+{
+	if (bdi->congested_fn)
+		return bdi->congested_fn(bdi->congested_data, bdi_bits, page, 1);
+
+	return blk_queue_io_group_congested(bdi, bdi_bits, page);
+}
+EXPORT_SYMBOL(bdi_congested_group);
+
+int bdi_read_congested_group(struct backing_dev_info *bdi, struct page *page)
+{
+	return bdi_congested_group(bdi, 1 << BDI_sync_congested, page);
+}
+EXPORT_SYMBOL(bdi_read_congested_group);
+
+/* Checks if either bdi or associated group is read congested */
+int bdi_or_group_read_congested(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_read_congested(bdi) || bdi_read_congested_group(bdi, page);
+}
+EXPORT_SYMBOL(bdi_or_group_read_congested);
+
+int bdi_write_congested_group(struct backing_dev_info *bdi, struct page *page)
+{
+	return bdi_congested_group(bdi, 1 << BDI_async_congested, page);
+}
+EXPORT_SYMBOL(bdi_write_congested_group);
+
+/* Checks if either bdi or associated group is write congested */
+int bdi_or_group_write_congested(struct backing_dev_info *bdi,
+						struct page *page)
+{
+	return bdi_write_congested(bdi) || bdi_write_congested_group(bdi, page);
+}
+EXPORT_SYMBOL(bdi_or_group_write_congested);
+
+int bdi_rw_congested_group(struct backing_dev_info *bdi, struct page *page)
+{
+	return bdi_congested_group(bdi, (1 << BDI_sync_congested) |
+				  (1 << BDI_async_congested), page);
+}
+EXPORT_SYMBOL(bdi_rw_congested_group);
+
+#endif /* CONFIG_GROUP_IOSCHED */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 31d3675..5ad9453 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -982,6 +982,17 @@ retry:
 		if (nr_pages == 0)
 			break;
 
+		/*
+		 * If the io group page will go into is congested, bail out.
+		 */
+		if (wbc->nonblocking
+		    && bdi_write_congested_group(bdi, pvec.pages[0])) {
+			wbc->encountered_congestion = 1;
+			done = 1;
+			pagevec_release(&pvec);
+			break;
+		}
+
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
diff --git a/mm/readahead.c b/mm/readahead.c
index aa1aa23..22e0639 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -542,7 +542,7 @@ page_cache_async_readahead(struct address_space *mapping,
 	/*
 	 * Defer asynchronous read-ahead on IO congestion.
 	 */
-	if (bdi_read_congested(mapping->backing_dev_info))
+	if (bdi_or_group_read_congested(mapping->backing_dev_info, NULL))
 		return;
 
 	/* do read-ahead */
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 23/25] io-controller: Support per cgroup per device weights and io class
       [not found] ` <1246564917-19603-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (21 preceding siblings ...)
  2009-07-02 20:01   ` [PATCH 22/25] io-controller: Per io group bdi congestion interface Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  2009-07-02 20:01   ` [PATCH 24/25] io-controller: Debug hierarchical IO scheduling Vivek Goyal
                     ` (4 subsequent siblings)
  27 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, nauman-hpIqsD4AKlfQT0dZR+AlfA,
	dpshah-hpIqsD4AKlfQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA

This patch enables per-cgroup per-device weight and ioprio_class handling.
A new cgroup interface "policy" is introduced. You can make use of this
file to configure weight and ioprio_class for each device in a given cgroup.
The original "weight" and "ioprio_class" files are still available. If you
don't do special configuration for a particular device, "weight" and
"ioprio_class" are used as default values in this device.

You can use the following format to play with the new interface.
#echo dev_major:dev_minor weight ioprio_class > /patch/to/cgroup/policy
weight=0 means removing the policy for device.

Examples:
Configure weight=300 ioprio_class=2 on /dev/hdb (8:16) in this cgroup
# echo "8:16 300 2" > io.policy
# cat io.policy
dev	weight	class
8:16	300	2

Configure weight=500 ioprio_class=1 on /dev/hda (8:0) in this cgroup
# echo "8:0 500 1" > io.policy
# cat io.policy
dev	weight	class
8:0	500	1
8:16	300	2

Remove the policy for /dev/hda in this cgroup
# echo 8:0 0 1 > io.policy
# cat io.policy
dev	weight	class
8:16	300	2

Changelog (v1 -> v2)
- Rename some structures
- Use spin_lock_irqsave() and spin_lock_irqrestore() version to prevent
  from enabling the interrupts unconditionally.
- Fix policy setup bug when switching to another io scheduler.
- If a policy is available for a specific device, don't update weight and
  io class when writing "weight" and "iprio_class".
- Fix a bug when parsing policy string.

Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/elevator-fq.c |  266 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h |   10 ++
 2 files changed, 272 insertions(+), 4 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 2a2b68d..31b066d 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -17,6 +17,7 @@
 #include <linux/blktrace_api.h>
 #include <linux/seq_file.h>
 #include <linux/biotrack.h>
+#include <linux/genhd.h>
 
 /* Values taken from cfq */
 const int elv_slice_sync = HZ / 10;
@@ -1053,12 +1054,31 @@ static void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
 	entity->sched_data = &iog->sched_data;
 }
 
-static void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
+						 dev_t dev);
+
+static void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
+			  dev_t dev)
 {
 	struct io_entity *entity = &iog->entity;
+	struct io_policy_node *pn;
+	unsigned long flags;
+
+	spin_lock_irqsave(&iocg->lock, flags);
+	pn = policy_search_node(iocg, dev);
+	if (pn) {
+		entity->weight = pn->weight;
+		entity->new_weight = pn->weight;
+		entity->ioprio_class = pn->ioprio_class;
+		entity->new_ioprio_class = pn->ioprio_class;
+	} else {
+		entity->weight = iocg->weight;
+		entity->new_weight = iocg->weight;
+		entity->ioprio_class = iocg->ioprio_class;
+		entity->new_ioprio_class = iocg->ioprio_class;
+	}
+	spin_unlock_irqrestore(&iocg->lock, flags);
 
-	entity->weight = entity->new_weight = iocg->weight;
-	entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
 	entity->ioprio_changed = 1;
 	entity->my_sched_data = &iog->sched_data;
 }
@@ -1174,6 +1194,227 @@ io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
 	return NULL;
 }
 
+static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
+				  struct seq_file *m)
+{
+	struct io_cgroup *iocg;
+	struct io_policy_node *pn;
+
+	iocg = cgroup_to_io_cgroup(cgrp);
+
+	if (list_empty(&iocg->policy_list))
+		goto out;
+
+	seq_printf(m, "dev\tweight\tclass\n");
+
+	spin_lock_irq(&iocg->lock);
+	list_for_each_entry(pn, &iocg->policy_list, node) {
+		seq_printf(m, "%u:%u\t%u\t%hu\n", MAJOR(pn->dev),
+			   MINOR(pn->dev), pn->weight, pn->ioprio_class);
+	}
+	spin_unlock_irq(&iocg->lock);
+out:
+	return 0;
+}
+
+static inline void policy_insert_node(struct io_cgroup *iocg,
+					  struct io_policy_node *pn)
+{
+	list_add(&pn->node, &iocg->policy_list);
+}
+
+/* Must be called with iocg->lock held */
+static inline void policy_delete_node(struct io_policy_node *pn)
+{
+	list_del(&pn->node);
+}
+
+/* Must be called with iocg->lock held */
+static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
+						 dev_t dev)
+{
+	struct io_policy_node *pn;
+
+	if (list_empty(&iocg->policy_list))
+		return NULL;
+
+	list_for_each_entry(pn, &iocg->policy_list, node) {
+		if (pn->dev == dev)
+			return pn;
+	}
+
+	return NULL;
+}
+
+static int check_dev_num(dev_t dev)
+{
+	int part = 0;
+	struct gendisk *disk;
+
+	disk = get_gendisk(dev, &part);
+	if (!disk || part)
+		return -ENODEV;
+
+	return 0;
+}
+
+static int policy_parse_and_set(char *buf, struct io_policy_node *newpn)
+{
+	char *s[4], *p, *major_s = NULL, *minor_s = NULL;
+	int ret;
+	unsigned long major, minor, temp;
+	int i = 0;
+	dev_t dev;
+
+	memset(s, 0, sizeof(s));
+	while ((p = strsep(&buf, " ")) != NULL) {
+		if (!*p)
+			continue;
+		s[i++] = p;
+
+		/* Prevent from inputing too many things */
+		if (i == 4)
+			break;
+	}
+
+	if (i != 3)
+		return -EINVAL;
+
+	p = strsep(&s[0], ":");
+	if (p != NULL)
+		major_s = p;
+	else
+		return -EINVAL;
+
+	minor_s = s[0];
+	if (!minor_s)
+		return -EINVAL;
+
+	ret = strict_strtoul(major_s, 10, &major);
+	if (ret)
+		return -EINVAL;
+
+	ret = strict_strtoul(minor_s, 10, &minor);
+	if (ret)
+		return -EINVAL;
+
+	dev = MKDEV(major, minor);
+
+	ret = check_dev_num(dev);
+	if (ret)
+		return ret;
+
+	newpn->dev = dev;
+
+	if (s[1] == NULL)
+		return -EINVAL;
+
+	ret = strict_strtoul(s[1], 10, &temp);
+	if (ret || temp > WEIGHT_MAX)
+		return -EINVAL;
+
+	newpn->weight =  temp;
+
+	if (s[2] == NULL)
+		return -EINVAL;
+
+	ret = strict_strtoul(s[2], 10, &temp);
+	if (ret || temp < IOPRIO_CLASS_RT || temp > IOPRIO_CLASS_IDLE)
+		return -EINVAL;
+	newpn->ioprio_class = temp;
+
+	return 0;
+}
+
+static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
+			    const char *buffer)
+{
+	struct io_cgroup *iocg;
+	struct io_policy_node *newpn, *pn;
+	char *buf;
+	int ret = 0;
+	int keep_newpn = 0;
+	struct hlist_node *n;
+	struct io_group *iog;
+
+	buf = kstrdup(buffer, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
+	if (!newpn) {
+		ret = -ENOMEM;
+		goto free_buf;
+	}
+
+	ret = policy_parse_and_set(buf, newpn);
+	if (ret)
+		goto free_newpn;
+
+	if (!cgroup_lock_live_group(cgrp)) {
+		ret = -ENODEV;
+		goto free_newpn;
+	}
+
+	iocg = cgroup_to_io_cgroup(cgrp);
+	spin_lock_irq(&iocg->lock);
+
+	pn = policy_search_node(iocg, newpn->dev);
+	if (!pn) {
+		if (newpn->weight != 0) {
+			policy_insert_node(iocg, newpn);
+			keep_newpn = 1;
+		}
+		goto update_io_group;
+	}
+
+	if (newpn->weight == 0) {
+		/* weight == 0 means deleteing a policy */
+		policy_delete_node(pn);
+		goto update_io_group;
+	}
+
+	pn->weight = newpn->weight;
+	pn->ioprio_class = newpn->ioprio_class;
+
+update_io_group:
+	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
+		if (iog->dev == newpn->dev) {
+			if (newpn->weight) {
+				iog->entity.new_weight = newpn->weight;
+				iog->entity.new_ioprio_class =
+					newpn->ioprio_class;
+				/*
+				 * iog weight and ioprio_class updating
+				 * actually happens if ioprio_changed is set.
+				 * So ensure ioprio_changed is not set until
+				 * new weight and new ioprio_class are updated.
+				 */
+				smp_wmb();
+				iog->entity.ioprio_changed = 1;
+			} else {
+				iog->entity.new_weight = iocg->weight;
+				iog->entity.new_ioprio_class =
+					iocg->ioprio_class;
+
+				/* The same as above */
+				smp_wmb();
+				iog->entity.ioprio_changed = 1;
+			}
+		}
+	}
+	spin_unlock_irq(&iocg->lock);
+
+	cgroup_unlock();
+
+free_newpn:
+	if (!keep_newpn)
+		kfree(newpn);
+free_buf:
+	kfree(buf);
+	return ret;
+}
+
 #define SHOW_FUNCTION(__VAR)						\
 static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
 				       struct cftype *cftype)		\
@@ -1206,6 +1447,7 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
 	struct io_cgroup *iocg;					\
 	struct io_group *iog;						\
 	struct hlist_node *n;						\
+	struct io_policy_node *pn;					\
 									\
 	if (val < (__MIN) || val > (__MAX))				\
 		return -EINVAL;						\
@@ -1218,6 +1460,9 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
 	spin_lock_irq(&iocg->lock);					\
 	iocg->__VAR = (unsigned long)val;				\
 	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {	\
+		pn = policy_search_node(iocg, iog->dev);		\
+		if (pn)							\
+			continue;					\
 		iog->entity.new_##__VAR = (unsigned long)val;		\
 		smp_wmb();						\
 		iog->entity.ioprio_changed = 1;				\
@@ -1295,6 +1540,12 @@ static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
 
 struct cftype bfqio_files[] = {
 	{
+		.name = "policy",
+		.read_seq_string = io_cgroup_policy_read,
+		.write_string = io_cgroup_policy_write,
+		.max_write_len = 256,
+	},
+	{
 		.name = "weight",
 		.read_u64 = io_cgroup_weight_read,
 		.write_u64 = io_cgroup_weight_write,
@@ -1336,6 +1587,7 @@ static struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
 	INIT_HLIST_HEAD(&iocg->group_data);
 	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
 	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
+	INIT_LIST_HEAD(&iocg->policy_list);
 
 	return &iocg->css;
 }
@@ -1438,7 +1690,7 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
 		iog->dev = MKDEV(major, minor);
 
-		io_group_init_entity(iocg, iog);
+		io_group_init_entity(iocg, iog, iog->dev);
 		iog->my_entity = &iog->entity;
 
 		atomic_set(&iog->ref, 0);
@@ -1904,6 +2156,7 @@ static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
 	struct io_group *iog;
 	struct elv_fq_data *efqd;
 	unsigned long uninitialized_var(flags);
+	struct io_policy_node *pn, *pntmp;
 
 	/*
 	 * io groups are linked in two lists. One list is maintained
@@ -1943,6 +2196,11 @@ remove_entry:
 	goto remove_entry;
 
 done:
+	list_for_each_entry_safe(pn, pntmp, &iocg->policy_list, node) {
+		policy_delete_node(pn);
+		kfree(pn);
+	}
+
 	free_css_id(&io_subsys, &iocg->css);
 	rcu_read_unlock();
 	BUG_ON(!hlist_empty(&iocg->group_data));
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 214fb61..58c650b 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -267,6 +267,13 @@ struct io_group {
 	struct request_list rl;
 };
 
+struct io_policy_node {
+	struct list_head node;
+	dev_t dev;
+	unsigned int weight;
+	unsigned short ioprio_class;
+};
+
 /**
  * struct io_cgroup - io cgroup data structure.
  * @css: subsystem state for io in the containing cgroup.
@@ -284,6 +291,9 @@ struct io_cgroup {
 	unsigned int weight;
 	unsigned short ioprio_class;
 
+	/* list of io_policy_node */
+	struct list_head policy_list;
+
 	spinlock_t lock;
 	struct hlist_head group_data;
 };
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 23/25] io-controller: Support per cgroup per device weights and io class
  2009-07-02 20:01 ` Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron
  Cc: agk, snitzer, vgoyal, akpm, peterz

This patch enables per-cgroup per-device weight and ioprio_class handling.
A new cgroup interface "policy" is introduced. You can make use of this
file to configure weight and ioprio_class for each device in a given cgroup.
The original "weight" and "ioprio_class" files are still available. If you
don't do special configuration for a particular device, "weight" and
"ioprio_class" are used as default values in this device.

You can use the following format to play with the new interface.
#echo dev_major:dev_minor weight ioprio_class > /patch/to/cgroup/policy
weight=0 means removing the policy for device.

Examples:
Configure weight=300 ioprio_class=2 on /dev/hdb (8:16) in this cgroup
# echo "8:16 300 2" > io.policy
# cat io.policy
dev	weight	class
8:16	300	2

Configure weight=500 ioprio_class=1 on /dev/hda (8:0) in this cgroup
# echo "8:0 500 1" > io.policy
# cat io.policy
dev	weight	class
8:0	500	1
8:16	300	2

Remove the policy for /dev/hda in this cgroup
# echo 8:0 0 1 > io.policy
# cat io.policy
dev	weight	class
8:16	300	2

Changelog (v1 -> v2)
- Rename some structures
- Use spin_lock_irqsave() and spin_lock_irqrestore() version to prevent
  from enabling the interrupts unconditionally.
- Fix policy setup bug when switching to another io scheduler.
- If a policy is available for a specific device, don't update weight and
  io class when writing "weight" and "iprio_class".
- Fix a bug when parsing policy string.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |  266 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h |   10 ++
 2 files changed, 272 insertions(+), 4 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 2a2b68d..31b066d 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -17,6 +17,7 @@
 #include <linux/blktrace_api.h>
 #include <linux/seq_file.h>
 #include <linux/biotrack.h>
+#include <linux/genhd.h>
 
 /* Values taken from cfq */
 const int elv_slice_sync = HZ / 10;
@@ -1053,12 +1054,31 @@ static void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
 	entity->sched_data = &iog->sched_data;
 }
 
-static void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
+						 dev_t dev);
+
+static void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
+			  dev_t dev)
 {
 	struct io_entity *entity = &iog->entity;
+	struct io_policy_node *pn;
+	unsigned long flags;
+
+	spin_lock_irqsave(&iocg->lock, flags);
+	pn = policy_search_node(iocg, dev);
+	if (pn) {
+		entity->weight = pn->weight;
+		entity->new_weight = pn->weight;
+		entity->ioprio_class = pn->ioprio_class;
+		entity->new_ioprio_class = pn->ioprio_class;
+	} else {
+		entity->weight = iocg->weight;
+		entity->new_weight = iocg->weight;
+		entity->ioprio_class = iocg->ioprio_class;
+		entity->new_ioprio_class = iocg->ioprio_class;
+	}
+	spin_unlock_irqrestore(&iocg->lock, flags);
 
-	entity->weight = entity->new_weight = iocg->weight;
-	entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
 	entity->ioprio_changed = 1;
 	entity->my_sched_data = &iog->sched_data;
 }
@@ -1174,6 +1194,227 @@ io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
 	return NULL;
 }
 
+static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
+				  struct seq_file *m)
+{
+	struct io_cgroup *iocg;
+	struct io_policy_node *pn;
+
+	iocg = cgroup_to_io_cgroup(cgrp);
+
+	if (list_empty(&iocg->policy_list))
+		goto out;
+
+	seq_printf(m, "dev\tweight\tclass\n");
+
+	spin_lock_irq(&iocg->lock);
+	list_for_each_entry(pn, &iocg->policy_list, node) {
+		seq_printf(m, "%u:%u\t%u\t%hu\n", MAJOR(pn->dev),
+			   MINOR(pn->dev), pn->weight, pn->ioprio_class);
+	}
+	spin_unlock_irq(&iocg->lock);
+out:
+	return 0;
+}
+
+static inline void policy_insert_node(struct io_cgroup *iocg,
+					  struct io_policy_node *pn)
+{
+	list_add(&pn->node, &iocg->policy_list);
+}
+
+/* Must be called with iocg->lock held */
+static inline void policy_delete_node(struct io_policy_node *pn)
+{
+	list_del(&pn->node);
+}
+
+/* Must be called with iocg->lock held */
+static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
+						 dev_t dev)
+{
+	struct io_policy_node *pn;
+
+	if (list_empty(&iocg->policy_list))
+		return NULL;
+
+	list_for_each_entry(pn, &iocg->policy_list, node) {
+		if (pn->dev == dev)
+			return pn;
+	}
+
+	return NULL;
+}
+
+static int check_dev_num(dev_t dev)
+{
+	int part = 0;
+	struct gendisk *disk;
+
+	disk = get_gendisk(dev, &part);
+	if (!disk || part)
+		return -ENODEV;
+
+	return 0;
+}
+
+static int policy_parse_and_set(char *buf, struct io_policy_node *newpn)
+{
+	char *s[4], *p, *major_s = NULL, *minor_s = NULL;
+	int ret;
+	unsigned long major, minor, temp;
+	int i = 0;
+	dev_t dev;
+
+	memset(s, 0, sizeof(s));
+	while ((p = strsep(&buf, " ")) != NULL) {
+		if (!*p)
+			continue;
+		s[i++] = p;
+
+		/* Prevent from inputing too many things */
+		if (i == 4)
+			break;
+	}
+
+	if (i != 3)
+		return -EINVAL;
+
+	p = strsep(&s[0], ":");
+	if (p != NULL)
+		major_s = p;
+	else
+		return -EINVAL;
+
+	minor_s = s[0];
+	if (!minor_s)
+		return -EINVAL;
+
+	ret = strict_strtoul(major_s, 10, &major);
+	if (ret)
+		return -EINVAL;
+
+	ret = strict_strtoul(minor_s, 10, &minor);
+	if (ret)
+		return -EINVAL;
+
+	dev = MKDEV(major, minor);
+
+	ret = check_dev_num(dev);
+	if (ret)
+		return ret;
+
+	newpn->dev = dev;
+
+	if (s[1] == NULL)
+		return -EINVAL;
+
+	ret = strict_strtoul(s[1], 10, &temp);
+	if (ret || temp > WEIGHT_MAX)
+		return -EINVAL;
+
+	newpn->weight =  temp;
+
+	if (s[2] == NULL)
+		return -EINVAL;
+
+	ret = strict_strtoul(s[2], 10, &temp);
+	if (ret || temp < IOPRIO_CLASS_RT || temp > IOPRIO_CLASS_IDLE)
+		return -EINVAL;
+	newpn->ioprio_class = temp;
+
+	return 0;
+}
+
+static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
+			    const char *buffer)
+{
+	struct io_cgroup *iocg;
+	struct io_policy_node *newpn, *pn;
+	char *buf;
+	int ret = 0;
+	int keep_newpn = 0;
+	struct hlist_node *n;
+	struct io_group *iog;
+
+	buf = kstrdup(buffer, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
+	if (!newpn) {
+		ret = -ENOMEM;
+		goto free_buf;
+	}
+
+	ret = policy_parse_and_set(buf, newpn);
+	if (ret)
+		goto free_newpn;
+
+	if (!cgroup_lock_live_group(cgrp)) {
+		ret = -ENODEV;
+		goto free_newpn;
+	}
+
+	iocg = cgroup_to_io_cgroup(cgrp);
+	spin_lock_irq(&iocg->lock);
+
+	pn = policy_search_node(iocg, newpn->dev);
+	if (!pn) {
+		if (newpn->weight != 0) {
+			policy_insert_node(iocg, newpn);
+			keep_newpn = 1;
+		}
+		goto update_io_group;
+	}
+
+	if (newpn->weight == 0) {
+		/* weight == 0 means deleteing a policy */
+		policy_delete_node(pn);
+		goto update_io_group;
+	}
+
+	pn->weight = newpn->weight;
+	pn->ioprio_class = newpn->ioprio_class;
+
+update_io_group:
+	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
+		if (iog->dev == newpn->dev) {
+			if (newpn->weight) {
+				iog->entity.new_weight = newpn->weight;
+				iog->entity.new_ioprio_class =
+					newpn->ioprio_class;
+				/*
+				 * iog weight and ioprio_class updating
+				 * actually happens if ioprio_changed is set.
+				 * So ensure ioprio_changed is not set until
+				 * new weight and new ioprio_class are updated.
+				 */
+				smp_wmb();
+				iog->entity.ioprio_changed = 1;
+			} else {
+				iog->entity.new_weight = iocg->weight;
+				iog->entity.new_ioprio_class =
+					iocg->ioprio_class;
+
+				/* The same as above */
+				smp_wmb();
+				iog->entity.ioprio_changed = 1;
+			}
+		}
+	}
+	spin_unlock_irq(&iocg->lock);
+
+	cgroup_unlock();
+
+free_newpn:
+	if (!keep_newpn)
+		kfree(newpn);
+free_buf:
+	kfree(buf);
+	return ret;
+}
+
 #define SHOW_FUNCTION(__VAR)						\
 static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
 				       struct cftype *cftype)		\
@@ -1206,6 +1447,7 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
 	struct io_cgroup *iocg;					\
 	struct io_group *iog;						\
 	struct hlist_node *n;						\
+	struct io_policy_node *pn;					\
 									\
 	if (val < (__MIN) || val > (__MAX))				\
 		return -EINVAL;						\
@@ -1218,6 +1460,9 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
 	spin_lock_irq(&iocg->lock);					\
 	iocg->__VAR = (unsigned long)val;				\
 	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {	\
+		pn = policy_search_node(iocg, iog->dev);		\
+		if (pn)							\
+			continue;					\
 		iog->entity.new_##__VAR = (unsigned long)val;		\
 		smp_wmb();						\
 		iog->entity.ioprio_changed = 1;				\
@@ -1295,6 +1540,12 @@ static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
 
 struct cftype bfqio_files[] = {
 	{
+		.name = "policy",
+		.read_seq_string = io_cgroup_policy_read,
+		.write_string = io_cgroup_policy_write,
+		.max_write_len = 256,
+	},
+	{
 		.name = "weight",
 		.read_u64 = io_cgroup_weight_read,
 		.write_u64 = io_cgroup_weight_write,
@@ -1336,6 +1587,7 @@ static struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
 	INIT_HLIST_HEAD(&iocg->group_data);
 	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
 	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
+	INIT_LIST_HEAD(&iocg->policy_list);
 
 	return &iocg->css;
 }
@@ -1438,7 +1690,7 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
 		iog->dev = MKDEV(major, minor);
 
-		io_group_init_entity(iocg, iog);
+		io_group_init_entity(iocg, iog, iog->dev);
 		iog->my_entity = &iog->entity;
 
 		atomic_set(&iog->ref, 0);
@@ -1904,6 +2156,7 @@ static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
 	struct io_group *iog;
 	struct elv_fq_data *efqd;
 	unsigned long uninitialized_var(flags);
+	struct io_policy_node *pn, *pntmp;
 
 	/*
 	 * io groups are linked in two lists. One list is maintained
@@ -1943,6 +2196,11 @@ remove_entry:
 	goto remove_entry;
 
 done:
+	list_for_each_entry_safe(pn, pntmp, &iocg->policy_list, node) {
+		policy_delete_node(pn);
+		kfree(pn);
+	}
+
 	free_css_id(&io_subsys, &iocg->css);
 	rcu_read_unlock();
 	BUG_ON(!hlist_empty(&iocg->group_data));
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 214fb61..58c650b 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -267,6 +267,13 @@ struct io_group {
 	struct request_list rl;
 };
 
+struct io_policy_node {
+	struct list_head node;
+	dev_t dev;
+	unsigned int weight;
+	unsigned short ioprio_class;
+};
+
 /**
  * struct io_cgroup - io cgroup data structure.
  * @css: subsystem state for io in the containing cgroup.
@@ -284,6 +291,9 @@ struct io_cgroup {
 	unsigned int weight;
 	unsigned short ioprio_class;
 
+	/* list of io_policy_node */
+	struct list_head policy_list;
+
 	spinlock_t lock;
 	struct hlist_head group_data;
 };
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 23/25] io-controller: Support per cgroup per device weights and io class
@ 2009-07-02 20:01   ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew
  Cc: peterz, akpm, snitzer, agk, vgoyal

This patch enables per-cgroup per-device weight and ioprio_class handling.
A new cgroup interface "policy" is introduced. You can make use of this
file to configure weight and ioprio_class for each device in a given cgroup.
The original "weight" and "ioprio_class" files are still available. If you
don't do special configuration for a particular device, "weight" and
"ioprio_class" are used as default values in this device.

You can use the following format to play with the new interface.
#echo dev_major:dev_minor weight ioprio_class > /patch/to/cgroup/policy
weight=0 means removing the policy for device.

Examples:
Configure weight=300 ioprio_class=2 on /dev/hdb (8:16) in this cgroup
# echo "8:16 300 2" > io.policy
# cat io.policy
dev	weight	class
8:16	300	2

Configure weight=500 ioprio_class=1 on /dev/hda (8:0) in this cgroup
# echo "8:0 500 1" > io.policy
# cat io.policy
dev	weight	class
8:0	500	1
8:16	300	2

Remove the policy for /dev/hda in this cgroup
# echo 8:0 0 1 > io.policy
# cat io.policy
dev	weight	class
8:16	300	2

Changelog (v1 -> v2)
- Rename some structures
- Use spin_lock_irqsave() and spin_lock_irqrestore() version to prevent
  from enabling the interrupts unconditionally.
- Fix policy setup bug when switching to another io scheduler.
- If a policy is available for a specific device, don't update weight and
  io class when writing "weight" and "iprio_class".
- Fix a bug when parsing policy string.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/elevator-fq.c |  266 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h |   10 ++
 2 files changed, 272 insertions(+), 4 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 2a2b68d..31b066d 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -17,6 +17,7 @@
 #include <linux/blktrace_api.h>
 #include <linux/seq_file.h>
 #include <linux/biotrack.h>
+#include <linux/genhd.h>
 
 /* Values taken from cfq */
 const int elv_slice_sync = HZ / 10;
@@ -1053,12 +1054,31 @@ static void bfq_init_entity(struct io_entity *entity, struct io_group *iog)
 	entity->sched_data = &iog->sched_data;
 }
 
-static void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog)
+static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
+						 dev_t dev);
+
+static void io_group_init_entity(struct io_cgroup *iocg, struct io_group *iog,
+			  dev_t dev)
 {
 	struct io_entity *entity = &iog->entity;
+	struct io_policy_node *pn;
+	unsigned long flags;
+
+	spin_lock_irqsave(&iocg->lock, flags);
+	pn = policy_search_node(iocg, dev);
+	if (pn) {
+		entity->weight = pn->weight;
+		entity->new_weight = pn->weight;
+		entity->ioprio_class = pn->ioprio_class;
+		entity->new_ioprio_class = pn->ioprio_class;
+	} else {
+		entity->weight = iocg->weight;
+		entity->new_weight = iocg->weight;
+		entity->ioprio_class = iocg->ioprio_class;
+		entity->new_ioprio_class = iocg->ioprio_class;
+	}
+	spin_unlock_irqrestore(&iocg->lock, flags);
 
-	entity->weight = entity->new_weight = iocg->weight;
-	entity->ioprio_class = entity->new_ioprio_class = iocg->ioprio_class;
 	entity->ioprio_changed = 1;
 	entity->my_sched_data = &iog->sched_data;
 }
@@ -1174,6 +1194,227 @@ io_cgroup_lookup_group(struct io_cgroup *iocg, void *key)
 	return NULL;
 }
 
+static int io_cgroup_policy_read(struct cgroup *cgrp, struct cftype *cft,
+				  struct seq_file *m)
+{
+	struct io_cgroup *iocg;
+	struct io_policy_node *pn;
+
+	iocg = cgroup_to_io_cgroup(cgrp);
+
+	if (list_empty(&iocg->policy_list))
+		goto out;
+
+	seq_printf(m, "dev\tweight\tclass\n");
+
+	spin_lock_irq(&iocg->lock);
+	list_for_each_entry(pn, &iocg->policy_list, node) {
+		seq_printf(m, "%u:%u\t%u\t%hu\n", MAJOR(pn->dev),
+			   MINOR(pn->dev), pn->weight, pn->ioprio_class);
+	}
+	spin_unlock_irq(&iocg->lock);
+out:
+	return 0;
+}
+
+static inline void policy_insert_node(struct io_cgroup *iocg,
+					  struct io_policy_node *pn)
+{
+	list_add(&pn->node, &iocg->policy_list);
+}
+
+/* Must be called with iocg->lock held */
+static inline void policy_delete_node(struct io_policy_node *pn)
+{
+	list_del(&pn->node);
+}
+
+/* Must be called with iocg->lock held */
+static struct io_policy_node *policy_search_node(const struct io_cgroup *iocg,
+						 dev_t dev)
+{
+	struct io_policy_node *pn;
+
+	if (list_empty(&iocg->policy_list))
+		return NULL;
+
+	list_for_each_entry(pn, &iocg->policy_list, node) {
+		if (pn->dev == dev)
+			return pn;
+	}
+
+	return NULL;
+}
+
+static int check_dev_num(dev_t dev)
+{
+	int part = 0;
+	struct gendisk *disk;
+
+	disk = get_gendisk(dev, &part);
+	if (!disk || part)
+		return -ENODEV;
+
+	return 0;
+}
+
+static int policy_parse_and_set(char *buf, struct io_policy_node *newpn)
+{
+	char *s[4], *p, *major_s = NULL, *minor_s = NULL;
+	int ret;
+	unsigned long major, minor, temp;
+	int i = 0;
+	dev_t dev;
+
+	memset(s, 0, sizeof(s));
+	while ((p = strsep(&buf, " ")) != NULL) {
+		if (!*p)
+			continue;
+		s[i++] = p;
+
+		/* Prevent from inputing too many things */
+		if (i == 4)
+			break;
+	}
+
+	if (i != 3)
+		return -EINVAL;
+
+	p = strsep(&s[0], ":");
+	if (p != NULL)
+		major_s = p;
+	else
+		return -EINVAL;
+
+	minor_s = s[0];
+	if (!minor_s)
+		return -EINVAL;
+
+	ret = strict_strtoul(major_s, 10, &major);
+	if (ret)
+		return -EINVAL;
+
+	ret = strict_strtoul(minor_s, 10, &minor);
+	if (ret)
+		return -EINVAL;
+
+	dev = MKDEV(major, minor);
+
+	ret = check_dev_num(dev);
+	if (ret)
+		return ret;
+
+	newpn->dev = dev;
+
+	if (s[1] == NULL)
+		return -EINVAL;
+
+	ret = strict_strtoul(s[1], 10, &temp);
+	if (ret || temp > WEIGHT_MAX)
+		return -EINVAL;
+
+	newpn->weight =  temp;
+
+	if (s[2] == NULL)
+		return -EINVAL;
+
+	ret = strict_strtoul(s[2], 10, &temp);
+	if (ret || temp < IOPRIO_CLASS_RT || temp > IOPRIO_CLASS_IDLE)
+		return -EINVAL;
+	newpn->ioprio_class = temp;
+
+	return 0;
+}
+
+static int io_cgroup_policy_write(struct cgroup *cgrp, struct cftype *cft,
+			    const char *buffer)
+{
+	struct io_cgroup *iocg;
+	struct io_policy_node *newpn, *pn;
+	char *buf;
+	int ret = 0;
+	int keep_newpn = 0;
+	struct hlist_node *n;
+	struct io_group *iog;
+
+	buf = kstrdup(buffer, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	newpn = kzalloc(sizeof(*newpn), GFP_KERNEL);
+	if (!newpn) {
+		ret = -ENOMEM;
+		goto free_buf;
+	}
+
+	ret = policy_parse_and_set(buf, newpn);
+	if (ret)
+		goto free_newpn;
+
+	if (!cgroup_lock_live_group(cgrp)) {
+		ret = -ENODEV;
+		goto free_newpn;
+	}
+
+	iocg = cgroup_to_io_cgroup(cgrp);
+	spin_lock_irq(&iocg->lock);
+
+	pn = policy_search_node(iocg, newpn->dev);
+	if (!pn) {
+		if (newpn->weight != 0) {
+			policy_insert_node(iocg, newpn);
+			keep_newpn = 1;
+		}
+		goto update_io_group;
+	}
+
+	if (newpn->weight == 0) {
+		/* weight == 0 means deleteing a policy */
+		policy_delete_node(pn);
+		goto update_io_group;
+	}
+
+	pn->weight = newpn->weight;
+	pn->ioprio_class = newpn->ioprio_class;
+
+update_io_group:
+	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
+		if (iog->dev == newpn->dev) {
+			if (newpn->weight) {
+				iog->entity.new_weight = newpn->weight;
+				iog->entity.new_ioprio_class =
+					newpn->ioprio_class;
+				/*
+				 * iog weight and ioprio_class updating
+				 * actually happens if ioprio_changed is set.
+				 * So ensure ioprio_changed is not set until
+				 * new weight and new ioprio_class are updated.
+				 */
+				smp_wmb();
+				iog->entity.ioprio_changed = 1;
+			} else {
+				iog->entity.new_weight = iocg->weight;
+				iog->entity.new_ioprio_class =
+					iocg->ioprio_class;
+
+				/* The same as above */
+				smp_wmb();
+				iog->entity.ioprio_changed = 1;
+			}
+		}
+	}
+	spin_unlock_irq(&iocg->lock);
+
+	cgroup_unlock();
+
+free_newpn:
+	if (!keep_newpn)
+		kfree(newpn);
+free_buf:
+	kfree(buf);
+	return ret;
+}
+
 #define SHOW_FUNCTION(__VAR)						\
 static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
 				       struct cftype *cftype)		\
@@ -1206,6 +1447,7 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
 	struct io_cgroup *iocg;					\
 	struct io_group *iog;						\
 	struct hlist_node *n;						\
+	struct io_policy_node *pn;					\
 									\
 	if (val < (__MIN) || val > (__MAX))				\
 		return -EINVAL;						\
@@ -1218,6 +1460,9 @@ static int io_cgroup_##__VAR##_write(struct cgroup *cgroup,		\
 	spin_lock_irq(&iocg->lock);					\
 	iocg->__VAR = (unsigned long)val;				\
 	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {	\
+		pn = policy_search_node(iocg, iog->dev);		\
+		if (pn)							\
+			continue;					\
 		iog->entity.new_##__VAR = (unsigned long)val;		\
 		smp_wmb();						\
 		iog->entity.ioprio_changed = 1;				\
@@ -1295,6 +1540,12 @@ static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
 
 struct cftype bfqio_files[] = {
 	{
+		.name = "policy",
+		.read_seq_string = io_cgroup_policy_read,
+		.write_string = io_cgroup_policy_write,
+		.max_write_len = 256,
+	},
+	{
 		.name = "weight",
 		.read_u64 = io_cgroup_weight_read,
 		.write_u64 = io_cgroup_weight_write,
@@ -1336,6 +1587,7 @@ static struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
 	INIT_HLIST_HEAD(&iocg->group_data);
 	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
 	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
+	INIT_LIST_HEAD(&iocg->policy_list);
 
 	return &iocg->css;
 }
@@ -1438,7 +1690,7 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
 		iog->dev = MKDEV(major, minor);
 
-		io_group_init_entity(iocg, iog);
+		io_group_init_entity(iocg, iog, iog->dev);
 		iog->my_entity = &iog->entity;
 
 		atomic_set(&iog->ref, 0);
@@ -1904,6 +2156,7 @@ static void iocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
 	struct io_group *iog;
 	struct elv_fq_data *efqd;
 	unsigned long uninitialized_var(flags);
+	struct io_policy_node *pn, *pntmp;
 
 	/*
 	 * io groups are linked in two lists. One list is maintained
@@ -1943,6 +2196,11 @@ remove_entry:
 	goto remove_entry;
 
 done:
+	list_for_each_entry_safe(pn, pntmp, &iocg->policy_list, node) {
+		policy_delete_node(pn);
+		kfree(pn);
+	}
+
 	free_css_id(&io_subsys, &iocg->css);
 	rcu_read_unlock();
 	BUG_ON(!hlist_empty(&iocg->group_data));
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 214fb61..58c650b 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -267,6 +267,13 @@ struct io_group {
 	struct request_list rl;
 };
 
+struct io_policy_node {
+	struct list_head node;
+	dev_t dev;
+	unsigned int weight;
+	unsigned short ioprio_class;
+};
+
 /**
  * struct io_cgroup - io cgroup data structure.
  * @css: subsystem state for io in the containing cgroup.
@@ -284,6 +291,9 @@ struct io_cgroup {
 	unsigned int weight;
 	unsigned short ioprio_class;
 
+	/* list of io_policy_node */
+	struct list_head policy_list;
+
 	spinlock_t lock;
 	struct hlist_head group_data;
 };
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 24/25] io-controller: Debug hierarchical IO scheduling
       [not found] ` <1246564917-19603-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (22 preceding siblings ...)
  2009-07-02 20:01   ` [PATCH 23/25] io-controller: Support per cgroup per device weights and io class Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  2009-07-02 20:01   ` [PATCH 25/25] io-controller: experimental debug patch for async queue wait before expiry Vivek Goyal
                     ` (3 subsequent siblings)
  27 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, nauman-hpIqsD4AKlfQT0dZR+AlfA,
	dpshah-hpIqsD4AKlfQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA

o Littile debugging aid for hierarchical IO scheduling.

o Enabled under CONFIG_DEBUG_GROUP_IOSCHED

o Currently it outputs more debug messages in blktrace output which helps
  a great deal in debugging in hierarchical setup. It also creates additional
  cgroup interfaces io.disk_queue and io.disk_dequeue to output some more
  debugging data.

Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/Kconfig.iosched |   10 ++-
 block/as-iosched.c    |   50 ++++++---
 block/elevator-fq.c   |  280 ++++++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h   |   36 +++++++
 4 files changed, 354 insertions(+), 22 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 0677099..79f188c 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -140,6 +140,14 @@ config TRACK_ASYNC_CONTEXT
 	  request, original owner of the bio is decided by using io tracking
 	  patches otherwise we continue to attribute the request to the
 	  submitting thread.
-endmenu
 
+config DEBUG_GROUP_IOSCHED
+	bool "Debug Hierarchical Scheduling support"
+	depends on CGROUPS && GROUP_IOSCHED
+	default n
+	---help---
+	  Enable some debugging hooks for hierarchical scheduling support.
+	  Currently it just outputs more information in blktrace output.
+
+endmenu
 endif
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 213f3e3..9ad96ee 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -78,6 +78,7 @@ enum anticipation_status {
 };
 
 struct as_queue {
+	struct io_queue *ioq;
 	/*
 	 * requests (as_rq s) are present on both sort_list and fifo_list
 	 */
@@ -162,6 +163,17 @@ enum arq_state {
 #define RQ_STATE(rq)	((enum arq_state)(rq)->elevator_private2)
 #define RQ_SET_STATE(rq, state)	((rq)->elevator_private2 = (void *) state)
 
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+#define as_log_asq(ad, asq, fmt, args...)				\
+{									\
+	blk_add_trace_msg((ad)->q, "as %s " fmt,			\
+			ioq_to_io_group((asq)->ioq)->path, ##args);	\
+}
+#else
+#define as_log_asq(ad, asq, fmt, args...) \
+	blk_add_trace_msg((ad)->q, "as " fmt, ##args)
+#endif
+
 #define as_log(ad, fmt, args...)        \
 	blk_add_trace_msg((ad)->q, "as " fmt, ##args)
 
@@ -225,7 +237,7 @@ static void as_save_batch_context(struct as_data *ad, struct as_queue *asq)
 	}
 
 out:
-	as_log(ad, "save batch: dir=%c time_left=%d changed_batch=%d"
+	as_log_asq(ad, asq, "save batch: dir=%c time_left=%d changed_batch=%d"
 			" new_batch=%d, antic_status=%d",
 			ad->batch_data_dir ? 'R' : 'W',
 			asq->current_batch_time_left,
@@ -247,8 +259,8 @@ static void as_restore_batch_context(struct as_data *ad, struct as_queue *asq)
 						asq->current_batch_time_left;
 	/* restore asq batch_data_dir info */
 	ad->batch_data_dir = asq->saved_batch_data_dir;
-	as_log(ad, "restore batch: dir=%c time=%d reads_q=%d writes_q=%d"
-			" ad->antic_status=%d",
+	as_log_asq(ad, asq, "restore batch: dir=%c time=%d reads_q=%d"
+			" writes_q=%d ad->antic_status=%d",
 			ad->batch_data_dir ? 'R' : 'W',
 			asq->current_batch_time_left,
 			asq->nr_queued[1], asq->nr_queued[0],
@@ -277,8 +289,8 @@ static int as_expire_ioq(struct request_queue *q, void *sched_queue,
 	int status = ad->antic_status;
 	struct as_queue *asq = sched_queue;
 
-	as_log(ad, "as_expire_ioq slice_expired=%d, force=%d", slice_expired,
-		force);
+	as_log_asq(ad, asq, "as_expire_ioq slice_expired=%d, force=%d",
+			slice_expired, force);
 
 	/* Forced expiry. We don't have a choice */
 	if (force) {
@@ -1021,9 +1033,10 @@ static void update_write_batch(struct as_data *ad, struct request *rq)
 	if (write_time < 0)
 		write_time = 0;
 
-	as_log(ad, "upd write: write_time=%d batch=%d write_batch_idled=%d"
-			" current_write_count=%d", write_time, batch,
-			asq->write_batch_idled, asq->current_write_count);
+	as_log_asq(ad, asq, "upd write: write_time=%d batch=%d"
+			" write_batch_idled=%d current_write_count=%d",
+			write_time, batch, asq->write_batch_idled,
+			asq->current_write_count);
 
 	if (write_time > batch && !asq->write_batch_idled) {
 		if (write_time > batch * 3)
@@ -1040,7 +1053,7 @@ static void update_write_batch(struct as_data *ad, struct request *rq)
 	if (asq->write_batch_count < 1)
 		asq->write_batch_count = 1;
 
-	as_log(ad, "upd write count=%d", asq->write_batch_count);
+	as_log_asq(ad, asq, "upd write count=%d", asq->write_batch_count);
 }
 
 /*
@@ -1059,7 +1072,7 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
 		goto out;
 	}
 
-	as_log(ad, "complete: reads_q=%d writes_q=%d changed_batch=%d"
+	as_log_asq(ad, asq, "complete: reads_q=%d writes_q=%d changed_batch=%d"
 		" new_batch=%d switch_queue=%d, dir=%c",
 		asq->nr_queued[1], asq->nr_queued[0], ad->changed_batch,
 		ad->new_batch, ad->switch_queue,
@@ -1253,7 +1266,7 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 	if (RQ_IOC(rq) && RQ_IOC(rq)->aic)
 		atomic_inc(&RQ_IOC(rq)->aic->nr_dispatched);
 	ad->nr_dispatched++;
-	as_log(ad, "dispatch req dir=%c nr_dispatched = %d",
+	as_log_asq(ad, asq, "dispatch req dir=%c nr_dispatched = %d",
 			data_dir ? 'R' : 'W', ad->nr_dispatched);
 }
 
@@ -1302,7 +1315,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		}
 		asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
 
-		as_log(ad, "forced dispatch");
+		as_log_asq(ad, asq, "forced dispatch");
 		return dispatched;
 	}
 
@@ -1316,7 +1329,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		|| ad->antic_status == ANTIC_WAIT_REQ
 		|| ad->antic_status == ANTIC_WAIT_NEXT
 		|| ad->changed_batch) {
-		as_log(ad, "no dispatch. read_q=%d, writes_q=%d"
+		as_log_asq(ad, asq, "no dispatch. read_q=%d, writes_q=%d"
 			" ad->antic_status=%d, changed_batch=%d,"
 			" switch_queue=%d new_batch=%d", asq->nr_queued[1],
 			asq->nr_queued[0], ad->antic_status, ad->changed_batch,
@@ -1335,7 +1348,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 				goto fifo_expired;
 
 			if (as_can_anticipate(ad, rq)) {
-				as_log(ad, "can_anticipate = 1");
+				as_log_asq(ad, asq, "can_anticipate = 1");
 				as_antic_waitreq(ad);
 				return 0;
 			}
@@ -1355,7 +1368,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 	 * data direction (read / write)
 	 */
 
-	as_log(ad, "select a fresh batch and request");
+	as_log_asq(ad, asq, "select a fresh batch and request");
 
 	if (reads) {
 		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_SYNC]));
@@ -1371,7 +1384,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 			ad->changed_batch = 1;
 		}
 		ad->batch_data_dir = BLK_RW_SYNC;
-		as_log(ad, "new batch dir is sync");
+		as_log_asq(ad, asq, "new batch dir is sync");
 		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_SYNC].next);
 		asq->last_check_fifo[ad->batch_data_dir] = jiffies;
 		goto dispatch_request;
@@ -1396,7 +1409,7 @@ dispatch_writes:
 			ad->new_batch = 0;
 		}
 		ad->batch_data_dir = BLK_RW_ASYNC;
-		as_log(ad, "new batch dir is async");
+		as_log_asq(ad, asq, "new batch dir is async");
 		asq->current_write_count = asq->write_batch_count;
 		asq->write_batch_idled = 0;
 		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_ASYNC].next);
@@ -1459,7 +1472,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 	rq->elevator_private = as_get_io_context(q->node);
 
 	asq->nr_queued[data_dir]++;
-	as_log(ad, "add a %c request read_q=%d write_q=%d",
+	as_log_asq(ad, asq, "add a %c request read_q=%d write_q=%d",
 			data_dir ? 'R' : 'W', asq->nr_queued[1],
 			asq->nr_queued[0]);
 
@@ -1614,6 +1627,7 @@ static void *as_alloc_as_queue(struct request_queue *q,
 
 	if (asq->write_batch_count < 2)
 		asq->write_batch_count = 2;
+	asq->ioq = ioq;
 out:
 	return asq;
 }
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 31b066d..5b3f068 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -159,6 +159,119 @@ static void bfq_find_matching_entity(struct io_entity **entity,
 		*new_entity = parent_entity(*new_entity);
 	}
 }
+
+static inline struct io_group *io_entity_to_iog(struct io_entity *entity)
+{
+	struct io_group *iog = NULL;
+
+	BUG_ON(entity == NULL);
+	if (entity->my_sched_data != NULL)
+		iog = container_of(entity, struct io_group, entity);
+	return iog;
+}
+
+/* Returns parent group of io group */
+static inline struct io_group *iog_parent(struct io_group *iog)
+{
+	struct io_group *piog;
+
+	if (!iog->entity.sched_data)
+		return NULL;
+
+	/*
+	 * Not following entity->parent pointer as for top level groups
+	 * this pointer is NULL.
+	 */
+	piog = container_of(iog->entity.sched_data, struct io_group,
+					sched_data);
+	return piog;
+}
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+static void io_group_path(struct io_group *iog, char *buf, int buflen)
+{
+	unsigned short id = iog->iocg_id;
+	struct cgroup_subsys_state *css;
+
+	rcu_read_lock();
+
+	if (!id)
+		goto out;
+
+	css = css_lookup(&io_subsys, id);
+	if (!css)
+		goto out;
+
+	if (!css_tryget(css))
+		goto out;
+
+	cgroup_path(css->cgroup, buf, buflen);
+
+	css_put(css);
+
+	rcu_read_unlock();
+	return;
+out:
+	rcu_read_unlock();
+	buf[0] = '\0';
+	return;
+}
+
+/*
+ * An entity has been freshly added to active tree. Either it came from
+ * idle tree or it was not on any of the trees. Do the accounting.
+ */
+static inline void bfq_account_for_entity_addition(struct io_entity *entity)
+{
+	struct io_group *iog = io_entity_to_iog(entity);
+
+	if (iog) {
+		struct elv_fq_data *efqd;
+
+		/*
+		 * Keep track of how many times a group has been added
+		 * to active tree.
+		 */
+		iog->queue++;
+		iog->queue_start = jiffies;
+
+		/* Log group addition event */
+		rcu_read_lock();
+		efqd = rcu_dereference(iog->key);
+		if (efqd)
+			elv_log_iog(efqd, iog, "add group weight=%u",
+					iog->entity.weight);
+		rcu_read_unlock();
+	}
+}
+
+/*
+ * An entity got removed from active tree and either went to idle tree or
+ * not is on any of the tree. Do the accouting
+ */
+static inline void bfq_account_for_entity_deletion(struct io_entity *entity)
+{
+	struct io_group *iog = io_entity_to_iog(entity);
+
+	if (iog) {
+		struct elv_fq_data *efqd;
+
+		iog->dequeue++;
+		/* Keep a track of how long group was on active tree */
+		iog->queue_duration += jiffies_to_msecs(jiffies -
+						iog->queue_start);
+		iog->queue_start = 0;
+
+		/* Log group deletion event */
+		rcu_read_lock();
+		efqd = rcu_dereference(iog->key);
+		if (efqd)
+			elv_log_iog(efqd, iog, "del group weight=%u",
+					iog->entity.weight);
+		rcu_read_unlock();
+	}
+}
+#endif /* DEBUG_GROUP_IOSCHED */
 #else /* GROUP_IOSCHED */
 #define for_each_entity(entity)	\
 	for (; entity != NULL; entity = NULL)
@@ -186,6 +299,11 @@ static void bfq_find_matching_entity(struct io_entity **entity,
 					struct io_entity **new_entity)
 {
 }
+
+static inline struct io_group *io_entity_to_iog(struct io_entity *entity)
+{
+	return NULL;
+}
 #endif /* GROUP_IOSCHED */
 
 static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
@@ -769,6 +887,7 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
 {
 	struct io_sched_data *sd = entity->sched_data;
 	struct io_service_tree *st = io_entity_service_tree(entity);
+	int newly_added = 0;
 
 	if (entity == sd->active_entity) {
 		BUG_ON(entity->tree != NULL);
@@ -795,6 +914,7 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
 		bfq_idle_remove(st, entity);
 		entity->start = bfq_gt(st->vtime, entity->finish) ?
 				       st->vtime : entity->finish;
+		newly_added = 1;
 	} else {
 		/*
 		 * The finish time of the entity may be invalid, and
@@ -807,6 +927,7 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
 
 		BUG_ON(entity->on_st);
 		entity->on_st = 1;
+		newly_added = 1;
 	}
 
 	st = __bfq_entity_update_prio(st, entity);
@@ -844,6 +965,11 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
 		bfq_calc_finish(entity, entity->budget);
 	}
 	bfq_active_insert(st, entity);
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	if (newly_added)
+		bfq_account_for_entity_addition(entity);
+#endif
 }
 
 /**
@@ -912,6 +1038,9 @@ static int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
 	BUG_ON(sd->active_entity == entity);
 	BUG_ON(sd->next_active == entity);
 
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	bfq_account_for_entity_deletion(entity);
+#endif
 	return ret;
 }
 
@@ -1170,6 +1299,10 @@ int elv_io_group_congested(struct request_queue *q, struct page *page, int sync)
 	}
 
 	ret = elv_is_iog_congested(q, iog, sync);
+	if (ret)
+		elv_log_iog(&q->elevator->efqd, iog, "iog congested=%d sync=%d"
+			" rl.count[sync]=%d nr_group_requests=%d",
+			ret, sync, iog->rl.count[sync], q->nr_group_requests);
 	rcu_read_unlock();
 	return ret;
 }
@@ -1538,6 +1671,67 @@ static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
 	return 0;
 }
 
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+static int io_cgroup_disk_queue_read(struct cgroup *cgroup,
+			struct cftype *cftype, struct seq_file *m)
+{
+	struct io_cgroup *iocg = NULL;
+	struct io_group *iog = NULL;
+	struct hlist_node *n;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+	rcu_read_lock();
+	/* Loop through all the io groups and print statistics */
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (iog->key) {
+			seq_printf(m, "%u %u %lu %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev), iog->queue,
+					iog->queue_duration);
+		}
+	}
+	rcu_read_unlock();
+	cgroup_unlock();
+
+	return 0;
+}
+
+static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,
+			struct cftype *cftype, struct seq_file *m)
+{
+	struct io_cgroup *iocg = NULL;
+	struct io_group *iog = NULL;
+	struct hlist_node *n;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+	spin_lock_irq(&iocg->lock);
+	/* Loop through all the io groups and print statistics */
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (iog->key) {
+			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev), iog->dequeue);
+		}
+	}
+	spin_unlock_irq(&iocg->lock);
+	cgroup_unlock();
+
+	return 0;
+}
+#endif
+
 struct cftype bfqio_files[] = {
 	{
 		.name = "policy",
@@ -1563,6 +1757,16 @@ struct cftype bfqio_files[] = {
 		.name = "disk_sectors",
 		.read_seq_string = io_cgroup_disk_sectors_read,
 	},
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	{
+		.name = "disk_queue",
+		.read_seq_string = io_cgroup_disk_queue_read,
+	},
+	{
+		.name = "disk_dequeue",
+		.read_seq_string = io_cgroup_disk_dequeue_read,
+	},
+#endif
 };
 
 static int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
@@ -1707,6 +1911,11 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 		blk_init_request_list(&iog->rl);
 		elv_io_group_congestion_threshold(q, iog);
 
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+		io_group_path(iog, iog->path, sizeof(iog->path));
+#endif
+
+
 		if (leaf == NULL) {
 			leaf = iog;
 			prev = leaf;
@@ -2548,6 +2757,22 @@ EXPORT_SYMBOL(elv_get_slice_idle);
 static void elv_ioq_served(struct io_queue *ioq, unsigned long served)
 {
 	entity_served(&ioq->entity, served, ioq->nr_sectors);
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+		{
+			struct elv_fq_data *efqd = ioq->efqd;
+			struct io_group *iog = ioq_to_io_group(ioq);
+			elv_log_ioq(efqd, ioq, "ioq served: QSt=0x%lx QSs=0x%lx"
+				" QTt=0x%lx QTs=0x%lx GTt=0x%lx "
+				" GTs=0x%lx rq_queued=%d",
+				served, ioq->nr_sectors,
+				ioq->entity.total_service,
+				ioq->entity.total_sector_service,
+				iog->entity.total_service,
+				iog->entity.total_sector_service,
+				ioq->nr_queued);
+		}
+#endif
 }
 
 /* Tells whether ioq is queued in root group or not */
@@ -2926,10 +3151,29 @@ static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
 	if (ioq) {
 		struct io_group *iog = ioq_to_io_group(ioq);
 		elv_log_ioq(efqd, ioq, "set_active, busy=%d ioprio=%d"
-				" weight=%u group_weight=%u",
+				" weight=%u rq_queued=%d group_weight=%u",
 				efqd->busy_queues,
 				ioq->entity.ioprio, ioq->entity.weight,
-				iog_weight(iog));
+				ioq->nr_queued, iog_weight(iog));
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+			{
+				int nr_active = 0;
+				struct io_group *parent = NULL;
+
+				parent = iog_parent(iog);
+				if (parent)
+					nr_active = elv_iog_nr_active(parent);
+
+				elv_log_ioq(efqd, ioq, "set_active, ioq"
+				" nrgrps=%d QTt=0x%lx QTs=0x%lx GTt=0x%lx "
+				" GTs=0x%lx rq_queued=%d", nr_active,
+				ioq->entity.total_service,
+				ioq->entity.total_sector_service,
+				iog->entity.total_service,
+				iog->entity.total_sector_service,
+				ioq->nr_queued);
+			}
+#endif
 		ioq->slice_end = 0;
 
 		elv_clear_ioq_wait_request(ioq);
@@ -3010,6 +3254,21 @@ static void elv_add_ioq_busy(struct elv_fq_data *efqd, struct io_queue *ioq)
 		struct io_group *iog = ioq_to_io_group(ioq);
 		iog->busy_rt_queues++;
 	}
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	{
+		struct io_group *iog = ioq_to_io_group(ioq);
+		elv_log_ioq(efqd, ioq, "add to busy: QTt=0x%lx QTs=0x%lx"
+			" GTt=0x%lx GTs=0x%lx rq_queued=%d",
+			ioq->entity.total_service,
+			ioq->entity.total_sector_service,
+			iog->entity.total_service,
+			iog->entity.total_sector_service,
+			ioq->nr_queued);
+	}
+#else
+	elv_log_ioq(efqd, ioq, "add to busy");
+#endif
 }
 
 static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
@@ -3019,7 +3278,21 @@ static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
 
 	BUG_ON(!elv_ioq_busy(ioq));
 	BUG_ON(ioq->nr_queued);
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	{
+		struct io_group *iog = ioq_to_io_group(ioq);
+		elv_log_ioq(efqd, ioq, "del from busy: QTt=0x%lx "
+			"QTs=0x%lx ioq GTt=0x%lx GTs=0x%lx "
+			"rq_queued=%d",
+			ioq->entity.total_service,
+			ioq->entity.total_sector_service,
+			iog->entity.total_service,
+			iog->entity.total_sector_service,
+			ioq->nr_queued);
+	}
+#else
 	elv_log_ioq(efqd, ioq, "del from busy");
+#endif
 	elv_clear_ioq_busy(ioq);
 	BUG_ON(efqd->busy_queues == 0);
 	efqd->busy_queues--;
@@ -3311,6 +3584,7 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 
 	elv_ioq_update_io_thinktime(ioq);
 	elv_ioq_update_idle_window(q->elevator, ioq, rq);
+	elv_log_ioq(efqd, ioq, "add rq: rq_queued=%d", ioq->nr_queued);
 
 	if (ioq == elv_active_ioq(q->elevator)) {
 		/*
@@ -3531,7 +3805,7 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 	}
 
 	/* We are waiting for this queue to become busy before it expires.*/
-	if (efqd->fairness && elv_ioq_wait_busy(ioq)) {
+	if (elv_ioq_wait_busy(ioq)) {
 		ioq = NULL;
 		goto keep_queue;
 	}
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 58c650b..19ac8ca 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -265,6 +265,23 @@ struct io_group {
 
 	/* request list associated with the group */
 	struct request_list rl;
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	/* How many times this group has been added to active tree */
+	unsigned long queue;
+
+	/* How long this group remained on active tree, in ms */
+	unsigned long queue_duration;
+
+	/* When was this group added to active tree */
+	unsigned long queue_start;
+
+	/* How many times this group has been removed from active tree */
+	unsigned long dequeue;
+
+	/* Store cgroup path */
+	char path[128];
+#endif
 };
 
 struct io_policy_node {
@@ -368,10 +385,29 @@ struct elv_fq_data {
 };
 
 /* Logging facilities. */
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+#define elv_log_ioq(efqd, ioq, fmt, args...) \
+{								\
+	blk_add_trace_msg((efqd)->queue, "elv%d%c %s " fmt, (ioq)->pid,	\
+			elv_ioq_sync(ioq) ? 'S' : 'A', \
+			ioq_to_io_group(ioq)->path, ##args); \
+}
+
+#define elv_log_iog(efqd, iog, fmt, args...) \
+{                                                                      \
+	blk_add_trace_msg((efqd)->queue, "elv %s " fmt, (iog)->path, ##args); \
+}
+
+#else
 #define elv_log_ioq(efqd, ioq, fmt, args...) \
 	blk_add_trace_msg((efqd)->queue, "elv%d%c " fmt, (ioq)->pid,	\
 				elv_ioq_sync(ioq) ? 'S' : 'A', ##args)
 
+#define elv_log_iog(efqd, iog, fmt, args...) \
+	blk_add_trace_msg((efqd)->queue, "elv " fmt, ##args)
+
+#endif
+
 #define elv_log(efqd, fmt, args...) \
 	blk_add_trace_msg((efqd)->queue, "elv " fmt, ##args)
 
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 24/25] io-controller: Debug hierarchical IO scheduling
  2009-07-02 20:01 ` Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron
  Cc: agk, snitzer, vgoyal, akpm, peterz

o Littile debugging aid for hierarchical IO scheduling.

o Enabled under CONFIG_DEBUG_GROUP_IOSCHED

o Currently it outputs more debug messages in blktrace output which helps
  a great deal in debugging in hierarchical setup. It also creates additional
  cgroup interfaces io.disk_queue and io.disk_dequeue to output some more
  debugging data.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched |   10 ++-
 block/as-iosched.c    |   50 ++++++---
 block/elevator-fq.c   |  280 ++++++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h   |   36 +++++++
 4 files changed, 354 insertions(+), 22 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 0677099..79f188c 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -140,6 +140,14 @@ config TRACK_ASYNC_CONTEXT
 	  request, original owner of the bio is decided by using io tracking
 	  patches otherwise we continue to attribute the request to the
 	  submitting thread.
-endmenu
 
+config DEBUG_GROUP_IOSCHED
+	bool "Debug Hierarchical Scheduling support"
+	depends on CGROUPS && GROUP_IOSCHED
+	default n
+	---help---
+	  Enable some debugging hooks for hierarchical scheduling support.
+	  Currently it just outputs more information in blktrace output.
+
+endmenu
 endif
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 213f3e3..9ad96ee 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -78,6 +78,7 @@ enum anticipation_status {
 };
 
 struct as_queue {
+	struct io_queue *ioq;
 	/*
 	 * requests (as_rq s) are present on both sort_list and fifo_list
 	 */
@@ -162,6 +163,17 @@ enum arq_state {
 #define RQ_STATE(rq)	((enum arq_state)(rq)->elevator_private2)
 #define RQ_SET_STATE(rq, state)	((rq)->elevator_private2 = (void *) state)
 
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+#define as_log_asq(ad, asq, fmt, args...)				\
+{									\
+	blk_add_trace_msg((ad)->q, "as %s " fmt,			\
+			ioq_to_io_group((asq)->ioq)->path, ##args);	\
+}
+#else
+#define as_log_asq(ad, asq, fmt, args...) \
+	blk_add_trace_msg((ad)->q, "as " fmt, ##args)
+#endif
+
 #define as_log(ad, fmt, args...)        \
 	blk_add_trace_msg((ad)->q, "as " fmt, ##args)
 
@@ -225,7 +237,7 @@ static void as_save_batch_context(struct as_data *ad, struct as_queue *asq)
 	}
 
 out:
-	as_log(ad, "save batch: dir=%c time_left=%d changed_batch=%d"
+	as_log_asq(ad, asq, "save batch: dir=%c time_left=%d changed_batch=%d"
 			" new_batch=%d, antic_status=%d",
 			ad->batch_data_dir ? 'R' : 'W',
 			asq->current_batch_time_left,
@@ -247,8 +259,8 @@ static void as_restore_batch_context(struct as_data *ad, struct as_queue *asq)
 						asq->current_batch_time_left;
 	/* restore asq batch_data_dir info */
 	ad->batch_data_dir = asq->saved_batch_data_dir;
-	as_log(ad, "restore batch: dir=%c time=%d reads_q=%d writes_q=%d"
-			" ad->antic_status=%d",
+	as_log_asq(ad, asq, "restore batch: dir=%c time=%d reads_q=%d"
+			" writes_q=%d ad->antic_status=%d",
 			ad->batch_data_dir ? 'R' : 'W',
 			asq->current_batch_time_left,
 			asq->nr_queued[1], asq->nr_queued[0],
@@ -277,8 +289,8 @@ static int as_expire_ioq(struct request_queue *q, void *sched_queue,
 	int status = ad->antic_status;
 	struct as_queue *asq = sched_queue;
 
-	as_log(ad, "as_expire_ioq slice_expired=%d, force=%d", slice_expired,
-		force);
+	as_log_asq(ad, asq, "as_expire_ioq slice_expired=%d, force=%d",
+			slice_expired, force);
 
 	/* Forced expiry. We don't have a choice */
 	if (force) {
@@ -1021,9 +1033,10 @@ static void update_write_batch(struct as_data *ad, struct request *rq)
 	if (write_time < 0)
 		write_time = 0;
 
-	as_log(ad, "upd write: write_time=%d batch=%d write_batch_idled=%d"
-			" current_write_count=%d", write_time, batch,
-			asq->write_batch_idled, asq->current_write_count);
+	as_log_asq(ad, asq, "upd write: write_time=%d batch=%d"
+			" write_batch_idled=%d current_write_count=%d",
+			write_time, batch, asq->write_batch_idled,
+			asq->current_write_count);
 
 	if (write_time > batch && !asq->write_batch_idled) {
 		if (write_time > batch * 3)
@@ -1040,7 +1053,7 @@ static void update_write_batch(struct as_data *ad, struct request *rq)
 	if (asq->write_batch_count < 1)
 		asq->write_batch_count = 1;
 
-	as_log(ad, "upd write count=%d", asq->write_batch_count);
+	as_log_asq(ad, asq, "upd write count=%d", asq->write_batch_count);
 }
 
 /*
@@ -1059,7 +1072,7 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
 		goto out;
 	}
 
-	as_log(ad, "complete: reads_q=%d writes_q=%d changed_batch=%d"
+	as_log_asq(ad, asq, "complete: reads_q=%d writes_q=%d changed_batch=%d"
 		" new_batch=%d switch_queue=%d, dir=%c",
 		asq->nr_queued[1], asq->nr_queued[0], ad->changed_batch,
 		ad->new_batch, ad->switch_queue,
@@ -1253,7 +1266,7 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 	if (RQ_IOC(rq) && RQ_IOC(rq)->aic)
 		atomic_inc(&RQ_IOC(rq)->aic->nr_dispatched);
 	ad->nr_dispatched++;
-	as_log(ad, "dispatch req dir=%c nr_dispatched = %d",
+	as_log_asq(ad, asq, "dispatch req dir=%c nr_dispatched = %d",
 			data_dir ? 'R' : 'W', ad->nr_dispatched);
 }
 
@@ -1302,7 +1315,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		}
 		asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
 
-		as_log(ad, "forced dispatch");
+		as_log_asq(ad, asq, "forced dispatch");
 		return dispatched;
 	}
 
@@ -1316,7 +1329,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		|| ad->antic_status == ANTIC_WAIT_REQ
 		|| ad->antic_status == ANTIC_WAIT_NEXT
 		|| ad->changed_batch) {
-		as_log(ad, "no dispatch. read_q=%d, writes_q=%d"
+		as_log_asq(ad, asq, "no dispatch. read_q=%d, writes_q=%d"
 			" ad->antic_status=%d, changed_batch=%d,"
 			" switch_queue=%d new_batch=%d", asq->nr_queued[1],
 			asq->nr_queued[0], ad->antic_status, ad->changed_batch,
@@ -1335,7 +1348,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 				goto fifo_expired;
 
 			if (as_can_anticipate(ad, rq)) {
-				as_log(ad, "can_anticipate = 1");
+				as_log_asq(ad, asq, "can_anticipate = 1");
 				as_antic_waitreq(ad);
 				return 0;
 			}
@@ -1355,7 +1368,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 	 * data direction (read / write)
 	 */
 
-	as_log(ad, "select a fresh batch and request");
+	as_log_asq(ad, asq, "select a fresh batch and request");
 
 	if (reads) {
 		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_SYNC]));
@@ -1371,7 +1384,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 			ad->changed_batch = 1;
 		}
 		ad->batch_data_dir = BLK_RW_SYNC;
-		as_log(ad, "new batch dir is sync");
+		as_log_asq(ad, asq, "new batch dir is sync");
 		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_SYNC].next);
 		asq->last_check_fifo[ad->batch_data_dir] = jiffies;
 		goto dispatch_request;
@@ -1396,7 +1409,7 @@ dispatch_writes:
 			ad->new_batch = 0;
 		}
 		ad->batch_data_dir = BLK_RW_ASYNC;
-		as_log(ad, "new batch dir is async");
+		as_log_asq(ad, asq, "new batch dir is async");
 		asq->current_write_count = asq->write_batch_count;
 		asq->write_batch_idled = 0;
 		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_ASYNC].next);
@@ -1459,7 +1472,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 	rq->elevator_private = as_get_io_context(q->node);
 
 	asq->nr_queued[data_dir]++;
-	as_log(ad, "add a %c request read_q=%d write_q=%d",
+	as_log_asq(ad, asq, "add a %c request read_q=%d write_q=%d",
 			data_dir ? 'R' : 'W', asq->nr_queued[1],
 			asq->nr_queued[0]);
 
@@ -1614,6 +1627,7 @@ static void *as_alloc_as_queue(struct request_queue *q,
 
 	if (asq->write_batch_count < 2)
 		asq->write_batch_count = 2;
+	asq->ioq = ioq;
 out:
 	return asq;
 }
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 31b066d..5b3f068 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -159,6 +159,119 @@ static void bfq_find_matching_entity(struct io_entity **entity,
 		*new_entity = parent_entity(*new_entity);
 	}
 }
+
+static inline struct io_group *io_entity_to_iog(struct io_entity *entity)
+{
+	struct io_group *iog = NULL;
+
+	BUG_ON(entity == NULL);
+	if (entity->my_sched_data != NULL)
+		iog = container_of(entity, struct io_group, entity);
+	return iog;
+}
+
+/* Returns parent group of io group */
+static inline struct io_group *iog_parent(struct io_group *iog)
+{
+	struct io_group *piog;
+
+	if (!iog->entity.sched_data)
+		return NULL;
+
+	/*
+	 * Not following entity->parent pointer as for top level groups
+	 * this pointer is NULL.
+	 */
+	piog = container_of(iog->entity.sched_data, struct io_group,
+					sched_data);
+	return piog;
+}
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+static void io_group_path(struct io_group *iog, char *buf, int buflen)
+{
+	unsigned short id = iog->iocg_id;
+	struct cgroup_subsys_state *css;
+
+	rcu_read_lock();
+
+	if (!id)
+		goto out;
+
+	css = css_lookup(&io_subsys, id);
+	if (!css)
+		goto out;
+
+	if (!css_tryget(css))
+		goto out;
+
+	cgroup_path(css->cgroup, buf, buflen);
+
+	css_put(css);
+
+	rcu_read_unlock();
+	return;
+out:
+	rcu_read_unlock();
+	buf[0] = '\0';
+	return;
+}
+
+/*
+ * An entity has been freshly added to active tree. Either it came from
+ * idle tree or it was not on any of the trees. Do the accounting.
+ */
+static inline void bfq_account_for_entity_addition(struct io_entity *entity)
+{
+	struct io_group *iog = io_entity_to_iog(entity);
+
+	if (iog) {
+		struct elv_fq_data *efqd;
+
+		/*
+		 * Keep track of how many times a group has been added
+		 * to active tree.
+		 */
+		iog->queue++;
+		iog->queue_start = jiffies;
+
+		/* Log group addition event */
+		rcu_read_lock();
+		efqd = rcu_dereference(iog->key);
+		if (efqd)
+			elv_log_iog(efqd, iog, "add group weight=%u",
+					iog->entity.weight);
+		rcu_read_unlock();
+	}
+}
+
+/*
+ * An entity got removed from active tree and either went to idle tree or
+ * not is on any of the tree. Do the accouting
+ */
+static inline void bfq_account_for_entity_deletion(struct io_entity *entity)
+{
+	struct io_group *iog = io_entity_to_iog(entity);
+
+	if (iog) {
+		struct elv_fq_data *efqd;
+
+		iog->dequeue++;
+		/* Keep a track of how long group was on active tree */
+		iog->queue_duration += jiffies_to_msecs(jiffies -
+						iog->queue_start);
+		iog->queue_start = 0;
+
+		/* Log group deletion event */
+		rcu_read_lock();
+		efqd = rcu_dereference(iog->key);
+		if (efqd)
+			elv_log_iog(efqd, iog, "del group weight=%u",
+					iog->entity.weight);
+		rcu_read_unlock();
+	}
+}
+#endif /* DEBUG_GROUP_IOSCHED */
 #else /* GROUP_IOSCHED */
 #define for_each_entity(entity)	\
 	for (; entity != NULL; entity = NULL)
@@ -186,6 +299,11 @@ static void bfq_find_matching_entity(struct io_entity **entity,
 					struct io_entity **new_entity)
 {
 }
+
+static inline struct io_group *io_entity_to_iog(struct io_entity *entity)
+{
+	return NULL;
+}
 #endif /* GROUP_IOSCHED */
 
 static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
@@ -769,6 +887,7 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
 {
 	struct io_sched_data *sd = entity->sched_data;
 	struct io_service_tree *st = io_entity_service_tree(entity);
+	int newly_added = 0;
 
 	if (entity == sd->active_entity) {
 		BUG_ON(entity->tree != NULL);
@@ -795,6 +914,7 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
 		bfq_idle_remove(st, entity);
 		entity->start = bfq_gt(st->vtime, entity->finish) ?
 				       st->vtime : entity->finish;
+		newly_added = 1;
 	} else {
 		/*
 		 * The finish time of the entity may be invalid, and
@@ -807,6 +927,7 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
 
 		BUG_ON(entity->on_st);
 		entity->on_st = 1;
+		newly_added = 1;
 	}
 
 	st = __bfq_entity_update_prio(st, entity);
@@ -844,6 +965,11 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
 		bfq_calc_finish(entity, entity->budget);
 	}
 	bfq_active_insert(st, entity);
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	if (newly_added)
+		bfq_account_for_entity_addition(entity);
+#endif
 }
 
 /**
@@ -912,6 +1038,9 @@ static int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
 	BUG_ON(sd->active_entity == entity);
 	BUG_ON(sd->next_active == entity);
 
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	bfq_account_for_entity_deletion(entity);
+#endif
 	return ret;
 }
 
@@ -1170,6 +1299,10 @@ int elv_io_group_congested(struct request_queue *q, struct page *page, int sync)
 	}
 
 	ret = elv_is_iog_congested(q, iog, sync);
+	if (ret)
+		elv_log_iog(&q->elevator->efqd, iog, "iog congested=%d sync=%d"
+			" rl.count[sync]=%d nr_group_requests=%d",
+			ret, sync, iog->rl.count[sync], q->nr_group_requests);
 	rcu_read_unlock();
 	return ret;
 }
@@ -1538,6 +1671,67 @@ static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
 	return 0;
 }
 
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+static int io_cgroup_disk_queue_read(struct cgroup *cgroup,
+			struct cftype *cftype, struct seq_file *m)
+{
+	struct io_cgroup *iocg = NULL;
+	struct io_group *iog = NULL;
+	struct hlist_node *n;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+	rcu_read_lock();
+	/* Loop through all the io groups and print statistics */
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (iog->key) {
+			seq_printf(m, "%u %u %lu %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev), iog->queue,
+					iog->queue_duration);
+		}
+	}
+	rcu_read_unlock();
+	cgroup_unlock();
+
+	return 0;
+}
+
+static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,
+			struct cftype *cftype, struct seq_file *m)
+{
+	struct io_cgroup *iocg = NULL;
+	struct io_group *iog = NULL;
+	struct hlist_node *n;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+	spin_lock_irq(&iocg->lock);
+	/* Loop through all the io groups and print statistics */
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (iog->key) {
+			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev), iog->dequeue);
+		}
+	}
+	spin_unlock_irq(&iocg->lock);
+	cgroup_unlock();
+
+	return 0;
+}
+#endif
+
 struct cftype bfqio_files[] = {
 	{
 		.name = "policy",
@@ -1563,6 +1757,16 @@ struct cftype bfqio_files[] = {
 		.name = "disk_sectors",
 		.read_seq_string = io_cgroup_disk_sectors_read,
 	},
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	{
+		.name = "disk_queue",
+		.read_seq_string = io_cgroup_disk_queue_read,
+	},
+	{
+		.name = "disk_dequeue",
+		.read_seq_string = io_cgroup_disk_dequeue_read,
+	},
+#endif
 };
 
 static int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
@@ -1707,6 +1911,11 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 		blk_init_request_list(&iog->rl);
 		elv_io_group_congestion_threshold(q, iog);
 
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+		io_group_path(iog, iog->path, sizeof(iog->path));
+#endif
+
+
 		if (leaf == NULL) {
 			leaf = iog;
 			prev = leaf;
@@ -2548,6 +2757,22 @@ EXPORT_SYMBOL(elv_get_slice_idle);
 static void elv_ioq_served(struct io_queue *ioq, unsigned long served)
 {
 	entity_served(&ioq->entity, served, ioq->nr_sectors);
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+		{
+			struct elv_fq_data *efqd = ioq->efqd;
+			struct io_group *iog = ioq_to_io_group(ioq);
+			elv_log_ioq(efqd, ioq, "ioq served: QSt=0x%lx QSs=0x%lx"
+				" QTt=0x%lx QTs=0x%lx GTt=0x%lx "
+				" GTs=0x%lx rq_queued=%d",
+				served, ioq->nr_sectors,
+				ioq->entity.total_service,
+				ioq->entity.total_sector_service,
+				iog->entity.total_service,
+				iog->entity.total_sector_service,
+				ioq->nr_queued);
+		}
+#endif
 }
 
 /* Tells whether ioq is queued in root group or not */
@@ -2926,10 +3151,29 @@ static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
 	if (ioq) {
 		struct io_group *iog = ioq_to_io_group(ioq);
 		elv_log_ioq(efqd, ioq, "set_active, busy=%d ioprio=%d"
-				" weight=%u group_weight=%u",
+				" weight=%u rq_queued=%d group_weight=%u",
 				efqd->busy_queues,
 				ioq->entity.ioprio, ioq->entity.weight,
-				iog_weight(iog));
+				ioq->nr_queued, iog_weight(iog));
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+			{
+				int nr_active = 0;
+				struct io_group *parent = NULL;
+
+				parent = iog_parent(iog);
+				if (parent)
+					nr_active = elv_iog_nr_active(parent);
+
+				elv_log_ioq(efqd, ioq, "set_active, ioq"
+				" nrgrps=%d QTt=0x%lx QTs=0x%lx GTt=0x%lx "
+				" GTs=0x%lx rq_queued=%d", nr_active,
+				ioq->entity.total_service,
+				ioq->entity.total_sector_service,
+				iog->entity.total_service,
+				iog->entity.total_sector_service,
+				ioq->nr_queued);
+			}
+#endif
 		ioq->slice_end = 0;
 
 		elv_clear_ioq_wait_request(ioq);
@@ -3010,6 +3254,21 @@ static void elv_add_ioq_busy(struct elv_fq_data *efqd, struct io_queue *ioq)
 		struct io_group *iog = ioq_to_io_group(ioq);
 		iog->busy_rt_queues++;
 	}
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	{
+		struct io_group *iog = ioq_to_io_group(ioq);
+		elv_log_ioq(efqd, ioq, "add to busy: QTt=0x%lx QTs=0x%lx"
+			" GTt=0x%lx GTs=0x%lx rq_queued=%d",
+			ioq->entity.total_service,
+			ioq->entity.total_sector_service,
+			iog->entity.total_service,
+			iog->entity.total_sector_service,
+			ioq->nr_queued);
+	}
+#else
+	elv_log_ioq(efqd, ioq, "add to busy");
+#endif
 }
 
 static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
@@ -3019,7 +3278,21 @@ static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
 
 	BUG_ON(!elv_ioq_busy(ioq));
 	BUG_ON(ioq->nr_queued);
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	{
+		struct io_group *iog = ioq_to_io_group(ioq);
+		elv_log_ioq(efqd, ioq, "del from busy: QTt=0x%lx "
+			"QTs=0x%lx ioq GTt=0x%lx GTs=0x%lx "
+			"rq_queued=%d",
+			ioq->entity.total_service,
+			ioq->entity.total_sector_service,
+			iog->entity.total_service,
+			iog->entity.total_sector_service,
+			ioq->nr_queued);
+	}
+#else
 	elv_log_ioq(efqd, ioq, "del from busy");
+#endif
 	elv_clear_ioq_busy(ioq);
 	BUG_ON(efqd->busy_queues == 0);
 	efqd->busy_queues--;
@@ -3311,6 +3584,7 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 
 	elv_ioq_update_io_thinktime(ioq);
 	elv_ioq_update_idle_window(q->elevator, ioq, rq);
+	elv_log_ioq(efqd, ioq, "add rq: rq_queued=%d", ioq->nr_queued);
 
 	if (ioq == elv_active_ioq(q->elevator)) {
 		/*
@@ -3531,7 +3805,7 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 	}
 
 	/* We are waiting for this queue to become busy before it expires.*/
-	if (efqd->fairness && elv_ioq_wait_busy(ioq)) {
+	if (elv_ioq_wait_busy(ioq)) {
 		ioq = NULL;
 		goto keep_queue;
 	}
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 58c650b..19ac8ca 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -265,6 +265,23 @@ struct io_group {
 
 	/* request list associated with the group */
 	struct request_list rl;
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	/* How many times this group has been added to active tree */
+	unsigned long queue;
+
+	/* How long this group remained on active tree, in ms */
+	unsigned long queue_duration;
+
+	/* When was this group added to active tree */
+	unsigned long queue_start;
+
+	/* How many times this group has been removed from active tree */
+	unsigned long dequeue;
+
+	/* Store cgroup path */
+	char path[128];
+#endif
 };
 
 struct io_policy_node {
@@ -368,10 +385,29 @@ struct elv_fq_data {
 };
 
 /* Logging facilities. */
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+#define elv_log_ioq(efqd, ioq, fmt, args...) \
+{								\
+	blk_add_trace_msg((efqd)->queue, "elv%d%c %s " fmt, (ioq)->pid,	\
+			elv_ioq_sync(ioq) ? 'S' : 'A', \
+			ioq_to_io_group(ioq)->path, ##args); \
+}
+
+#define elv_log_iog(efqd, iog, fmt, args...) \
+{                                                                      \
+	blk_add_trace_msg((efqd)->queue, "elv %s " fmt, (iog)->path, ##args); \
+}
+
+#else
 #define elv_log_ioq(efqd, ioq, fmt, args...) \
 	blk_add_trace_msg((efqd)->queue, "elv%d%c " fmt, (ioq)->pid,	\
 				elv_ioq_sync(ioq) ? 'S' : 'A', ##args)
 
+#define elv_log_iog(efqd, iog, fmt, args...) \
+	blk_add_trace_msg((efqd)->queue, "elv " fmt, ##args)
+
+#endif
+
 #define elv_log(efqd, fmt, args...) \
 	blk_add_trace_msg((efqd)->queue, "elv " fmt, ##args)
 
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 24/25] io-controller: Debug hierarchical IO scheduling
@ 2009-07-02 20:01   ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew
  Cc: peterz, akpm, snitzer, agk, vgoyal

o Littile debugging aid for hierarchical IO scheduling.

o Enabled under CONFIG_DEBUG_GROUP_IOSCHED

o Currently it outputs more debug messages in blktrace output which helps
  a great deal in debugging in hierarchical setup. It also creates additional
  cgroup interfaces io.disk_queue and io.disk_dequeue to output some more
  debugging data.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/Kconfig.iosched |   10 ++-
 block/as-iosched.c    |   50 ++++++---
 block/elevator-fq.c   |  280 ++++++++++++++++++++++++++++++++++++++++++++++++-
 block/elevator-fq.h   |   36 +++++++
 4 files changed, 354 insertions(+), 22 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 0677099..79f188c 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -140,6 +140,14 @@ config TRACK_ASYNC_CONTEXT
 	  request, original owner of the bio is decided by using io tracking
 	  patches otherwise we continue to attribute the request to the
 	  submitting thread.
-endmenu
 
+config DEBUG_GROUP_IOSCHED
+	bool "Debug Hierarchical Scheduling support"
+	depends on CGROUPS && GROUP_IOSCHED
+	default n
+	---help---
+	  Enable some debugging hooks for hierarchical scheduling support.
+	  Currently it just outputs more information in blktrace output.
+
+endmenu
 endif
diff --git a/block/as-iosched.c b/block/as-iosched.c
index 213f3e3..9ad96ee 100644
--- a/block/as-iosched.c
+++ b/block/as-iosched.c
@@ -78,6 +78,7 @@ enum anticipation_status {
 };
 
 struct as_queue {
+	struct io_queue *ioq;
 	/*
 	 * requests (as_rq s) are present on both sort_list and fifo_list
 	 */
@@ -162,6 +163,17 @@ enum arq_state {
 #define RQ_STATE(rq)	((enum arq_state)(rq)->elevator_private2)
 #define RQ_SET_STATE(rq, state)	((rq)->elevator_private2 = (void *) state)
 
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+#define as_log_asq(ad, asq, fmt, args...)				\
+{									\
+	blk_add_trace_msg((ad)->q, "as %s " fmt,			\
+			ioq_to_io_group((asq)->ioq)->path, ##args);	\
+}
+#else
+#define as_log_asq(ad, asq, fmt, args...) \
+	blk_add_trace_msg((ad)->q, "as " fmt, ##args)
+#endif
+
 #define as_log(ad, fmt, args...)        \
 	blk_add_trace_msg((ad)->q, "as " fmt, ##args)
 
@@ -225,7 +237,7 @@ static void as_save_batch_context(struct as_data *ad, struct as_queue *asq)
 	}
 
 out:
-	as_log(ad, "save batch: dir=%c time_left=%d changed_batch=%d"
+	as_log_asq(ad, asq, "save batch: dir=%c time_left=%d changed_batch=%d"
 			" new_batch=%d, antic_status=%d",
 			ad->batch_data_dir ? 'R' : 'W',
 			asq->current_batch_time_left,
@@ -247,8 +259,8 @@ static void as_restore_batch_context(struct as_data *ad, struct as_queue *asq)
 						asq->current_batch_time_left;
 	/* restore asq batch_data_dir info */
 	ad->batch_data_dir = asq->saved_batch_data_dir;
-	as_log(ad, "restore batch: dir=%c time=%d reads_q=%d writes_q=%d"
-			" ad->antic_status=%d",
+	as_log_asq(ad, asq, "restore batch: dir=%c time=%d reads_q=%d"
+			" writes_q=%d ad->antic_status=%d",
 			ad->batch_data_dir ? 'R' : 'W',
 			asq->current_batch_time_left,
 			asq->nr_queued[1], asq->nr_queued[0],
@@ -277,8 +289,8 @@ static int as_expire_ioq(struct request_queue *q, void *sched_queue,
 	int status = ad->antic_status;
 	struct as_queue *asq = sched_queue;
 
-	as_log(ad, "as_expire_ioq slice_expired=%d, force=%d", slice_expired,
-		force);
+	as_log_asq(ad, asq, "as_expire_ioq slice_expired=%d, force=%d",
+			slice_expired, force);
 
 	/* Forced expiry. We don't have a choice */
 	if (force) {
@@ -1021,9 +1033,10 @@ static void update_write_batch(struct as_data *ad, struct request *rq)
 	if (write_time < 0)
 		write_time = 0;
 
-	as_log(ad, "upd write: write_time=%d batch=%d write_batch_idled=%d"
-			" current_write_count=%d", write_time, batch,
-			asq->write_batch_idled, asq->current_write_count);
+	as_log_asq(ad, asq, "upd write: write_time=%d batch=%d"
+			" write_batch_idled=%d current_write_count=%d",
+			write_time, batch, asq->write_batch_idled,
+			asq->current_write_count);
 
 	if (write_time > batch && !asq->write_batch_idled) {
 		if (write_time > batch * 3)
@@ -1040,7 +1053,7 @@ static void update_write_batch(struct as_data *ad, struct request *rq)
 	if (asq->write_batch_count < 1)
 		asq->write_batch_count = 1;
 
-	as_log(ad, "upd write count=%d", asq->write_batch_count);
+	as_log_asq(ad, asq, "upd write count=%d", asq->write_batch_count);
 }
 
 /*
@@ -1059,7 +1072,7 @@ static void as_completed_request(struct request_queue *q, struct request *rq)
 		goto out;
 	}
 
-	as_log(ad, "complete: reads_q=%d writes_q=%d changed_batch=%d"
+	as_log_asq(ad, asq, "complete: reads_q=%d writes_q=%d changed_batch=%d"
 		" new_batch=%d switch_queue=%d, dir=%c",
 		asq->nr_queued[1], asq->nr_queued[0], ad->changed_batch,
 		ad->new_batch, ad->switch_queue,
@@ -1253,7 +1266,7 @@ static void as_move_to_dispatch(struct as_data *ad, struct request *rq)
 	if (RQ_IOC(rq) && RQ_IOC(rq)->aic)
 		atomic_inc(&RQ_IOC(rq)->aic->nr_dispatched);
 	ad->nr_dispatched++;
-	as_log(ad, "dispatch req dir=%c nr_dispatched = %d",
+	as_log_asq(ad, asq, "dispatch req dir=%c nr_dispatched = %d",
 			data_dir ? 'R' : 'W', ad->nr_dispatched);
 }
 
@@ -1302,7 +1315,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		}
 		asq->last_check_fifo[BLK_RW_ASYNC] = jiffies;
 
-		as_log(ad, "forced dispatch");
+		as_log_asq(ad, asq, "forced dispatch");
 		return dispatched;
 	}
 
@@ -1316,7 +1329,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 		|| ad->antic_status == ANTIC_WAIT_REQ
 		|| ad->antic_status == ANTIC_WAIT_NEXT
 		|| ad->changed_batch) {
-		as_log(ad, "no dispatch. read_q=%d, writes_q=%d"
+		as_log_asq(ad, asq, "no dispatch. read_q=%d, writes_q=%d"
 			" ad->antic_status=%d, changed_batch=%d,"
 			" switch_queue=%d new_batch=%d", asq->nr_queued[1],
 			asq->nr_queued[0], ad->antic_status, ad->changed_batch,
@@ -1335,7 +1348,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 				goto fifo_expired;
 
 			if (as_can_anticipate(ad, rq)) {
-				as_log(ad, "can_anticipate = 1");
+				as_log_asq(ad, asq, "can_anticipate = 1");
 				as_antic_waitreq(ad);
 				return 0;
 			}
@@ -1355,7 +1368,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 	 * data direction (read / write)
 	 */
 
-	as_log(ad, "select a fresh batch and request");
+	as_log_asq(ad, asq, "select a fresh batch and request");
 
 	if (reads) {
 		BUG_ON(RB_EMPTY_ROOT(&asq->sort_list[BLK_RW_SYNC]));
@@ -1371,7 +1384,7 @@ static int as_dispatch_request(struct request_queue *q, int force)
 			ad->changed_batch = 1;
 		}
 		ad->batch_data_dir = BLK_RW_SYNC;
-		as_log(ad, "new batch dir is sync");
+		as_log_asq(ad, asq, "new batch dir is sync");
 		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_SYNC].next);
 		asq->last_check_fifo[ad->batch_data_dir] = jiffies;
 		goto dispatch_request;
@@ -1396,7 +1409,7 @@ dispatch_writes:
 			ad->new_batch = 0;
 		}
 		ad->batch_data_dir = BLK_RW_ASYNC;
-		as_log(ad, "new batch dir is async");
+		as_log_asq(ad, asq, "new batch dir is async");
 		asq->current_write_count = asq->write_batch_count;
 		asq->write_batch_idled = 0;
 		rq = rq_entry_fifo(asq->fifo_list[BLK_RW_ASYNC].next);
@@ -1459,7 +1472,7 @@ static void as_add_request(struct request_queue *q, struct request *rq)
 	rq->elevator_private = as_get_io_context(q->node);
 
 	asq->nr_queued[data_dir]++;
-	as_log(ad, "add a %c request read_q=%d write_q=%d",
+	as_log_asq(ad, asq, "add a %c request read_q=%d write_q=%d",
 			data_dir ? 'R' : 'W', asq->nr_queued[1],
 			asq->nr_queued[0]);
 
@@ -1614,6 +1627,7 @@ static void *as_alloc_as_queue(struct request_queue *q,
 
 	if (asq->write_batch_count < 2)
 		asq->write_batch_count = 2;
+	asq->ioq = ioq;
 out:
 	return asq;
 }
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 31b066d..5b3f068 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -159,6 +159,119 @@ static void bfq_find_matching_entity(struct io_entity **entity,
 		*new_entity = parent_entity(*new_entity);
 	}
 }
+
+static inline struct io_group *io_entity_to_iog(struct io_entity *entity)
+{
+	struct io_group *iog = NULL;
+
+	BUG_ON(entity == NULL);
+	if (entity->my_sched_data != NULL)
+		iog = container_of(entity, struct io_group, entity);
+	return iog;
+}
+
+/* Returns parent group of io group */
+static inline struct io_group *iog_parent(struct io_group *iog)
+{
+	struct io_group *piog;
+
+	if (!iog->entity.sched_data)
+		return NULL;
+
+	/*
+	 * Not following entity->parent pointer as for top level groups
+	 * this pointer is NULL.
+	 */
+	piog = container_of(iog->entity.sched_data, struct io_group,
+					sched_data);
+	return piog;
+}
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+static void io_group_path(struct io_group *iog, char *buf, int buflen)
+{
+	unsigned short id = iog->iocg_id;
+	struct cgroup_subsys_state *css;
+
+	rcu_read_lock();
+
+	if (!id)
+		goto out;
+
+	css = css_lookup(&io_subsys, id);
+	if (!css)
+		goto out;
+
+	if (!css_tryget(css))
+		goto out;
+
+	cgroup_path(css->cgroup, buf, buflen);
+
+	css_put(css);
+
+	rcu_read_unlock();
+	return;
+out:
+	rcu_read_unlock();
+	buf[0] = '\0';
+	return;
+}
+
+/*
+ * An entity has been freshly added to active tree. Either it came from
+ * idle tree or it was not on any of the trees. Do the accounting.
+ */
+static inline void bfq_account_for_entity_addition(struct io_entity *entity)
+{
+	struct io_group *iog = io_entity_to_iog(entity);
+
+	if (iog) {
+		struct elv_fq_data *efqd;
+
+		/*
+		 * Keep track of how many times a group has been added
+		 * to active tree.
+		 */
+		iog->queue++;
+		iog->queue_start = jiffies;
+
+		/* Log group addition event */
+		rcu_read_lock();
+		efqd = rcu_dereference(iog->key);
+		if (efqd)
+			elv_log_iog(efqd, iog, "add group weight=%u",
+					iog->entity.weight);
+		rcu_read_unlock();
+	}
+}
+
+/*
+ * An entity got removed from active tree and either went to idle tree or
+ * not is on any of the tree. Do the accouting
+ */
+static inline void bfq_account_for_entity_deletion(struct io_entity *entity)
+{
+	struct io_group *iog = io_entity_to_iog(entity);
+
+	if (iog) {
+		struct elv_fq_data *efqd;
+
+		iog->dequeue++;
+		/* Keep a track of how long group was on active tree */
+		iog->queue_duration += jiffies_to_msecs(jiffies -
+						iog->queue_start);
+		iog->queue_start = 0;
+
+		/* Log group deletion event */
+		rcu_read_lock();
+		efqd = rcu_dereference(iog->key);
+		if (efqd)
+			elv_log_iog(efqd, iog, "del group weight=%u",
+					iog->entity.weight);
+		rcu_read_unlock();
+	}
+}
+#endif /* DEBUG_GROUP_IOSCHED */
 #else /* GROUP_IOSCHED */
 #define for_each_entity(entity)	\
 	for (; entity != NULL; entity = NULL)
@@ -186,6 +299,11 @@ static void bfq_find_matching_entity(struct io_entity **entity,
 					struct io_entity **new_entity)
 {
 }
+
+static inline struct io_group *io_entity_to_iog(struct io_entity *entity)
+{
+	return NULL;
+}
 #endif /* GROUP_IOSCHED */
 
 static inline int elv_prio_slice(struct elv_fq_data *efqd, int sync,
@@ -769,6 +887,7 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
 {
 	struct io_sched_data *sd = entity->sched_data;
 	struct io_service_tree *st = io_entity_service_tree(entity);
+	int newly_added = 0;
 
 	if (entity == sd->active_entity) {
 		BUG_ON(entity->tree != NULL);
@@ -795,6 +914,7 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
 		bfq_idle_remove(st, entity);
 		entity->start = bfq_gt(st->vtime, entity->finish) ?
 				       st->vtime : entity->finish;
+		newly_added = 1;
 	} else {
 		/*
 		 * The finish time of the entity may be invalid, and
@@ -807,6 +927,7 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
 
 		BUG_ON(entity->on_st);
 		entity->on_st = 1;
+		newly_added = 1;
 	}
 
 	st = __bfq_entity_update_prio(st, entity);
@@ -844,6 +965,11 @@ static void __bfq_activate_entity(struct io_entity *entity, int add_front)
 		bfq_calc_finish(entity, entity->budget);
 	}
 	bfq_active_insert(st, entity);
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	if (newly_added)
+		bfq_account_for_entity_addition(entity);
+#endif
 }
 
 /**
@@ -912,6 +1038,9 @@ static int __bfq_deactivate_entity(struct io_entity *entity, int requeue)
 	BUG_ON(sd->active_entity == entity);
 	BUG_ON(sd->next_active == entity);
 
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	bfq_account_for_entity_deletion(entity);
+#endif
 	return ret;
 }
 
@@ -1170,6 +1299,10 @@ int elv_io_group_congested(struct request_queue *q, struct page *page, int sync)
 	}
 
 	ret = elv_is_iog_congested(q, iog, sync);
+	if (ret)
+		elv_log_iog(&q->elevator->efqd, iog, "iog congested=%d sync=%d"
+			" rl.count[sync]=%d nr_group_requests=%d",
+			ret, sync, iog->rl.count[sync], q->nr_group_requests);
 	rcu_read_unlock();
 	return ret;
 }
@@ -1538,6 +1671,67 @@ static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
 	return 0;
 }
 
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+static int io_cgroup_disk_queue_read(struct cgroup *cgroup,
+			struct cftype *cftype, struct seq_file *m)
+{
+	struct io_cgroup *iocg = NULL;
+	struct io_group *iog = NULL;
+	struct hlist_node *n;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+	rcu_read_lock();
+	/* Loop through all the io groups and print statistics */
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (iog->key) {
+			seq_printf(m, "%u %u %lu %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev), iog->queue,
+					iog->queue_duration);
+		}
+	}
+	rcu_read_unlock();
+	cgroup_unlock();
+
+	return 0;
+}
+
+static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,
+			struct cftype *cftype, struct seq_file *m)
+{
+	struct io_cgroup *iocg = NULL;
+	struct io_group *iog = NULL;
+	struct hlist_node *n;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+	spin_lock_irq(&iocg->lock);
+	/* Loop through all the io groups and print statistics */
+	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
+		/*
+		 * There might be groups which are not functional and
+		 * waiting to be reclaimed upon cgoup deletion.
+		 */
+		if (iog->key) {
+			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+					MINOR(iog->dev), iog->dequeue);
+		}
+	}
+	spin_unlock_irq(&iocg->lock);
+	cgroup_unlock();
+
+	return 0;
+}
+#endif
+
 struct cftype bfqio_files[] = {
 	{
 		.name = "policy",
@@ -1563,6 +1757,16 @@ struct cftype bfqio_files[] = {
 		.name = "disk_sectors",
 		.read_seq_string = io_cgroup_disk_sectors_read,
 	},
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	{
+		.name = "disk_queue",
+		.read_seq_string = io_cgroup_disk_queue_read,
+	},
+	{
+		.name = "disk_dequeue",
+		.read_seq_string = io_cgroup_disk_dequeue_read,
+	},
+#endif
 };
 
 static int iocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
@@ -1707,6 +1911,11 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 		blk_init_request_list(&iog->rl);
 		elv_io_group_congestion_threshold(q, iog);
 
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+		io_group_path(iog, iog->path, sizeof(iog->path));
+#endif
+
+
 		if (leaf == NULL) {
 			leaf = iog;
 			prev = leaf;
@@ -2548,6 +2757,22 @@ EXPORT_SYMBOL(elv_get_slice_idle);
 static void elv_ioq_served(struct io_queue *ioq, unsigned long served)
 {
 	entity_served(&ioq->entity, served, ioq->nr_sectors);
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+		{
+			struct elv_fq_data *efqd = ioq->efqd;
+			struct io_group *iog = ioq_to_io_group(ioq);
+			elv_log_ioq(efqd, ioq, "ioq served: QSt=0x%lx QSs=0x%lx"
+				" QTt=0x%lx QTs=0x%lx GTt=0x%lx "
+				" GTs=0x%lx rq_queued=%d",
+				served, ioq->nr_sectors,
+				ioq->entity.total_service,
+				ioq->entity.total_sector_service,
+				iog->entity.total_service,
+				iog->entity.total_sector_service,
+				ioq->nr_queued);
+		}
+#endif
 }
 
 /* Tells whether ioq is queued in root group or not */
@@ -2926,10 +3151,29 @@ static void __elv_set_active_ioq(struct elv_fq_data *efqd, struct io_queue *ioq,
 	if (ioq) {
 		struct io_group *iog = ioq_to_io_group(ioq);
 		elv_log_ioq(efqd, ioq, "set_active, busy=%d ioprio=%d"
-				" weight=%u group_weight=%u",
+				" weight=%u rq_queued=%d group_weight=%u",
 				efqd->busy_queues,
 				ioq->entity.ioprio, ioq->entity.weight,
-				iog_weight(iog));
+				ioq->nr_queued, iog_weight(iog));
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+			{
+				int nr_active = 0;
+				struct io_group *parent = NULL;
+
+				parent = iog_parent(iog);
+				if (parent)
+					nr_active = elv_iog_nr_active(parent);
+
+				elv_log_ioq(efqd, ioq, "set_active, ioq"
+				" nrgrps=%d QTt=0x%lx QTs=0x%lx GTt=0x%lx "
+				" GTs=0x%lx rq_queued=%d", nr_active,
+				ioq->entity.total_service,
+				ioq->entity.total_sector_service,
+				iog->entity.total_service,
+				iog->entity.total_sector_service,
+				ioq->nr_queued);
+			}
+#endif
 		ioq->slice_end = 0;
 
 		elv_clear_ioq_wait_request(ioq);
@@ -3010,6 +3254,21 @@ static void elv_add_ioq_busy(struct elv_fq_data *efqd, struct io_queue *ioq)
 		struct io_group *iog = ioq_to_io_group(ioq);
 		iog->busy_rt_queues++;
 	}
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	{
+		struct io_group *iog = ioq_to_io_group(ioq);
+		elv_log_ioq(efqd, ioq, "add to busy: QTt=0x%lx QTs=0x%lx"
+			" GTt=0x%lx GTs=0x%lx rq_queued=%d",
+			ioq->entity.total_service,
+			ioq->entity.total_sector_service,
+			iog->entity.total_service,
+			iog->entity.total_sector_service,
+			ioq->nr_queued);
+	}
+#else
+	elv_log_ioq(efqd, ioq, "add to busy");
+#endif
 }
 
 static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
@@ -3019,7 +3278,21 @@ static void elv_del_ioq_busy(struct elevator_queue *e, struct io_queue *ioq,
 
 	BUG_ON(!elv_ioq_busy(ioq));
 	BUG_ON(ioq->nr_queued);
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	{
+		struct io_group *iog = ioq_to_io_group(ioq);
+		elv_log_ioq(efqd, ioq, "del from busy: QTt=0x%lx "
+			"QTs=0x%lx ioq GTt=0x%lx GTs=0x%lx "
+			"rq_queued=%d",
+			ioq->entity.total_service,
+			ioq->entity.total_sector_service,
+			iog->entity.total_service,
+			iog->entity.total_sector_service,
+			ioq->nr_queued);
+	}
+#else
 	elv_log_ioq(efqd, ioq, "del from busy");
+#endif
 	elv_clear_ioq_busy(ioq);
 	BUG_ON(efqd->busy_queues == 0);
 	efqd->busy_queues--;
@@ -3311,6 +3584,7 @@ void elv_ioq_request_add(struct request_queue *q, struct request *rq)
 
 	elv_ioq_update_io_thinktime(ioq);
 	elv_ioq_update_idle_window(q->elevator, ioq, rq);
+	elv_log_ioq(efqd, ioq, "add rq: rq_queued=%d", ioq->nr_queued);
 
 	if (ioq == elv_active_ioq(q->elevator)) {
 		/*
@@ -3531,7 +3805,7 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 	}
 
 	/* We are waiting for this queue to become busy before it expires.*/
-	if (efqd->fairness && elv_ioq_wait_busy(ioq)) {
+	if (elv_ioq_wait_busy(ioq)) {
 		ioq = NULL;
 		goto keep_queue;
 	}
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 58c650b..19ac8ca 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -265,6 +265,23 @@ struct io_group {
 
 	/* request list associated with the group */
 	struct request_list rl;
+
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+	/* How many times this group has been added to active tree */
+	unsigned long queue;
+
+	/* How long this group remained on active tree, in ms */
+	unsigned long queue_duration;
+
+	/* When was this group added to active tree */
+	unsigned long queue_start;
+
+	/* How many times this group has been removed from active tree */
+	unsigned long dequeue;
+
+	/* Store cgroup path */
+	char path[128];
+#endif
 };
 
 struct io_policy_node {
@@ -368,10 +385,29 @@ struct elv_fq_data {
 };
 
 /* Logging facilities. */
+#ifdef CONFIG_DEBUG_GROUP_IOSCHED
+#define elv_log_ioq(efqd, ioq, fmt, args...) \
+{								\
+	blk_add_trace_msg((efqd)->queue, "elv%d%c %s " fmt, (ioq)->pid,	\
+			elv_ioq_sync(ioq) ? 'S' : 'A', \
+			ioq_to_io_group(ioq)->path, ##args); \
+}
+
+#define elv_log_iog(efqd, iog, fmt, args...) \
+{                                                                      \
+	blk_add_trace_msg((efqd)->queue, "elv %s " fmt, (iog)->path, ##args); \
+}
+
+#else
 #define elv_log_ioq(efqd, ioq, fmt, args...) \
 	blk_add_trace_msg((efqd)->queue, "elv%d%c " fmt, (ioq)->pid,	\
 				elv_ioq_sync(ioq) ? 'S' : 'A', ##args)
 
+#define elv_log_iog(efqd, iog, fmt, args...) \
+	blk_add_trace_msg((efqd)->queue, "elv " fmt, ##args)
+
+#endif
+
 #define elv_log(efqd, fmt, args...) \
 	blk_add_trace_msg((efqd)->queue, "elv " fmt, ##args)
 
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 25/25] io-controller: experimental debug patch for async queue wait before expiry
       [not found] ` <1246564917-19603-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (23 preceding siblings ...)
  2009-07-02 20:01   ` [PATCH 24/25] io-controller: Debug hierarchical IO scheduling Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  2009-07-08  3:56   ` [RFC] IO scheduler based IO controller V6 Balbir Singh
                     ` (2 subsequent siblings)
  27 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, nauman-hpIqsD4AKlfQT0dZR+AlfA,
	dpshah-hpIqsD4AKlfQT0dZR+AlfA, lizf-BthXqXjhjHXQFUHtdCDX3A
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA

o A debug patch which does wait for next IO from async queue once it
  becomes empty.

o For async writes, traffic seen by IO scheduler is not in proportion to
  the weight of the cgroup task/page belongs to. So if there are two processes
  doing heavy writeouts in two cgroups with weights 1000 and 500 respectively,
  then IO scheduler does not see more traffic/IO from higher weight cgroup
  even if IO scheduler tries to give it higher disk time. Effectively, the
  async queue belonging to higher weight cgroup becomes empty, and gets out
  of contention for disk and lower weight cgroup gets to use disk giving
  an impression in user space that higher weight cgroup did not get higher
  time to disk.

o This is more of a problem at page cache level where a higher weight
  process might be writing out the pages of lower weight process etc and
  should be fixed there.

o While we fix those issues, introducing this debug patch which allows one
  to idle on async queue (tunable via /sys/blolc/<disk>/queue/async_slice_idle)  so that once a higher weight queue becomes empty, instead of expiring it
  we try to wait for next request to come from that queue hence giving it
  higher disk time. A higher value of async_slice_idle, around 300ms, helps
  me get some right numbers for my setup. Note: higher disk time would not
  necessarily translate in more IO done as higher weight group is not pushing
  enough IO to io scheduler. It is just a debugging aid to prove correctness
  of IO controller by providing higher disk times to higher weight cgroup.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/cfq-iosched.c |    1 +
 block/elevator-fq.c |   39 ++++++++++++++++++++++++++++++++++++---
 block/elevator-fq.h |    5 +++++
 3 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index a40a2fa..fbe56a9 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2093,6 +2093,7 @@ static struct elv_fs_entry cfq_attrs[] = {
 	ELV_ATTR(slice_sync),
 	ELV_ATTR(slice_async),
 	ELV_ATTR(fairness),
+	ELV_ATTR(async_slice_idle),
 	__ATTR_NULL
 };
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 5b3f068..7c83d1e 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -24,6 +24,7 @@ const int elv_slice_sync = HZ / 10;
 int elv_slice_async = HZ / 25;
 const int elv_slice_async_rq = 2;
 int elv_slice_idle = HZ / 125;
+int elv_async_slice_idle = 0;
 static struct kmem_cache *elv_ioq_pool;
 
 /* Maximum Window length for updating average disk rate */
@@ -2819,6 +2820,8 @@ SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
 EXPORT_SYMBOL(elv_slice_async_show);
 SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
 EXPORT_SYMBOL(elv_fairness_show);
+SHOW_FUNCTION(elv_async_slice_idle_show, efqd->elv_async_slice_idle, 1);
+EXPORT_SYMBOL(elv_async_slice_idle_show);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
@@ -2845,6 +2848,8 @@ STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
 EXPORT_SYMBOL(elv_slice_async_store);
 STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 2, 0);
 EXPORT_SYMBOL(elv_fairness_store);
+STORE_FUNCTION(elv_async_slice_idle_store, &efqd->elv_async_slice_idle, 0, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_async_slice_idle_store);
 #undef STORE_FUNCTION
 
 void elv_schedule_dispatch(struct request_queue *q)
@@ -3018,7 +3023,7 @@ int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
 		ioq->pid = current->pid;
 
 	ioq->sched_queue = sched_queue;
-	if (is_sync && !elv_ioq_class_idle(ioq))
+	if (!elv_ioq_class_idle(ioq) && (is_sync || efqd->fairness))
 		elv_mark_ioq_idle_window(ioq);
 	bfq_init_entity(&ioq->entity, iog);
 	ioq->entity.budget = elv_prio_to_slice(efqd, ioq);
@@ -3699,7 +3704,12 @@ static void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
 	/*
 	 * idle is disabled, either manually or by past process history
 	 */
-	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
+	if ((elv_ioq_sync(ioq) && !efqd->elv_slice_idle) ||
+			!elv_ioq_idle_window(ioq))
+		return;
+
+	/* If this is async queue and async_slice_idle is disabled, return */
+	if (!elv_ioq_sync(ioq) && !efqd->elv_async_slice_idle)
 		return;
 
 	/*
@@ -3708,7 +3718,10 @@ static void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
 	 */
 	if (wait_for_busy) {
 		elv_mark_ioq_wait_busy(ioq);
-		sl = efqd->elv_slice_idle;
+		if (elv_ioq_sync(ioq))
+			sl = efqd->elv_slice_idle;
+		else
+			sl = efqd->elv_async_slice_idle;
 		mod_timer(&efqd->idle_slice_timer, jiffies + sl);
 		elv_log_ioq(efqd, ioq, "arm idle: %lu wait busy=1", sl);
 		return;
@@ -3882,6 +3895,18 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 		goto keep_queue;
 	}
 
+	/*
+	 * If this is an async queue which has time slice left but not
+	 * requests. Wait busy is also not on (may be because when last
+	 * request completed, ioq was not empty). Wait for the request
+	 * completion. May be completion will turn wait busy on.
+	 */
+	if (efqd->fairness && efqd->elv_async_slice_idle && !elv_ioq_sync(ioq)
+	    && elv_ioq_nr_dispatched(ioq)) {
+		ioq = NULL;
+		goto keep_queue;
+	}
+
 	slice_expired = 0;
 expire:
 	if (efqd->fairness >= 2 && !force && ioq && ioq->dispatched
@@ -4076,6 +4101,13 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 			goto done;
 		}
 
+		/* For async queue try to do wait busy */
+		if (efqd->fairness && !elv_ioq_sync(ioq) && !ioq->nr_queued
+		    && (elv_iog_nr_active(iog) <= 1)) {
+			elv_ioq_arm_slice_timer(q, 1);
+			goto done;
+		}
+
 		/*
 		 * If there are no requests waiting in this queue, and
 		 * there are other queues ready to issue requests, AND
@@ -4215,6 +4247,7 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 	efqd->elv_slice[0] = elv_slice_async;
 	efqd->elv_slice[1] = elv_slice_sync;
 	efqd->elv_slice_idle = elv_slice_idle;
+	efqd->elv_async_slice_idle = elv_async_slice_idle;
 	efqd->hw_tag = 1;
 
 	/* For the time being keep fairness enabled by default */
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 19ac8ca..f089a55 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -362,6 +362,8 @@ struct elv_fq_data {
 	 * users of this functionality.
 	 */
 	unsigned int elv_slice_idle;
+	/* idle slice for async queue */
+	unsigned int elv_async_slice_idle;
 	struct timer_list idle_slice_timer;
 	struct work_struct unplug_work;
 
@@ -647,6 +649,9 @@ extern ssize_t elv_slice_async_store(struct elevator_queue *q, const char *name,
 extern ssize_t elv_fairness_show(struct elevator_queue *q, char *name);
 extern ssize_t elv_fairness_store(struct elevator_queue *q, const char *name,
 						size_t count);
+extern ssize_t elv_async_slice_idle_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_async_slice_idle_store(struct elevator_queue *q,
+					const char *name, size_t count);
 
 /* Functions used by elevator.c */
 extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 25/25] io-controller: experimental debug patch for async queue wait before expiry
  2009-07-02 20:01 ` Vivek Goyal
@ 2009-07-02 20:01   ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron
  Cc: agk, snitzer, vgoyal, akpm, peterz

o A debug patch which does wait for next IO from async queue once it
  becomes empty.

o For async writes, traffic seen by IO scheduler is not in proportion to
  the weight of the cgroup task/page belongs to. So if there are two processes
  doing heavy writeouts in two cgroups with weights 1000 and 500 respectively,
  then IO scheduler does not see more traffic/IO from higher weight cgroup
  even if IO scheduler tries to give it higher disk time. Effectively, the
  async queue belonging to higher weight cgroup becomes empty, and gets out
  of contention for disk and lower weight cgroup gets to use disk giving
  an impression in user space that higher weight cgroup did not get higher
  time to disk.

o This is more of a problem at page cache level where a higher weight
  process might be writing out the pages of lower weight process etc and
  should be fixed there.

o While we fix those issues, introducing this debug patch which allows one
  to idle on async queue (tunable via /sys/blolc/<disk>/queue/async_slice_idle)  so that once a higher weight queue becomes empty, instead of expiring it
  we try to wait for next request to come from that queue hence giving it
  higher disk time. A higher value of async_slice_idle, around 300ms, helps
  me get some right numbers for my setup. Note: higher disk time would not
  necessarily translate in more IO done as higher weight group is not pushing
  enough IO to io scheduler. It is just a debugging aid to prove correctness
  of IO controller by providing higher disk times to higher weight cgroup.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |    1 +
 block/elevator-fq.c |   39 ++++++++++++++++++++++++++++++++++++---
 block/elevator-fq.h |    5 +++++
 3 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index a40a2fa..fbe56a9 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2093,6 +2093,7 @@ static struct elv_fs_entry cfq_attrs[] = {
 	ELV_ATTR(slice_sync),
 	ELV_ATTR(slice_async),
 	ELV_ATTR(fairness),
+	ELV_ATTR(async_slice_idle),
 	__ATTR_NULL
 };
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 5b3f068..7c83d1e 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -24,6 +24,7 @@ const int elv_slice_sync = HZ / 10;
 int elv_slice_async = HZ / 25;
 const int elv_slice_async_rq = 2;
 int elv_slice_idle = HZ / 125;
+int elv_async_slice_idle = 0;
 static struct kmem_cache *elv_ioq_pool;
 
 /* Maximum Window length for updating average disk rate */
@@ -2819,6 +2820,8 @@ SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
 EXPORT_SYMBOL(elv_slice_async_show);
 SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
 EXPORT_SYMBOL(elv_fairness_show);
+SHOW_FUNCTION(elv_async_slice_idle_show, efqd->elv_async_slice_idle, 1);
+EXPORT_SYMBOL(elv_async_slice_idle_show);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
@@ -2845,6 +2848,8 @@ STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
 EXPORT_SYMBOL(elv_slice_async_store);
 STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 2, 0);
 EXPORT_SYMBOL(elv_fairness_store);
+STORE_FUNCTION(elv_async_slice_idle_store, &efqd->elv_async_slice_idle, 0, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_async_slice_idle_store);
 #undef STORE_FUNCTION
 
 void elv_schedule_dispatch(struct request_queue *q)
@@ -3018,7 +3023,7 @@ int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
 		ioq->pid = current->pid;
 
 	ioq->sched_queue = sched_queue;
-	if (is_sync && !elv_ioq_class_idle(ioq))
+	if (!elv_ioq_class_idle(ioq) && (is_sync || efqd->fairness))
 		elv_mark_ioq_idle_window(ioq);
 	bfq_init_entity(&ioq->entity, iog);
 	ioq->entity.budget = elv_prio_to_slice(efqd, ioq);
@@ -3699,7 +3704,12 @@ static void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
 	/*
 	 * idle is disabled, either manually or by past process history
 	 */
-	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
+	if ((elv_ioq_sync(ioq) && !efqd->elv_slice_idle) ||
+			!elv_ioq_idle_window(ioq))
+		return;
+
+	/* If this is async queue and async_slice_idle is disabled, return */
+	if (!elv_ioq_sync(ioq) && !efqd->elv_async_slice_idle)
 		return;
 
 	/*
@@ -3708,7 +3718,10 @@ static void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
 	 */
 	if (wait_for_busy) {
 		elv_mark_ioq_wait_busy(ioq);
-		sl = efqd->elv_slice_idle;
+		if (elv_ioq_sync(ioq))
+			sl = efqd->elv_slice_idle;
+		else
+			sl = efqd->elv_async_slice_idle;
 		mod_timer(&efqd->idle_slice_timer, jiffies + sl);
 		elv_log_ioq(efqd, ioq, "arm idle: %lu wait busy=1", sl);
 		return;
@@ -3882,6 +3895,18 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 		goto keep_queue;
 	}
 
+	/*
+	 * If this is an async queue which has time slice left but not
+	 * requests. Wait busy is also not on (may be because when last
+	 * request completed, ioq was not empty). Wait for the request
+	 * completion. May be completion will turn wait busy on.
+	 */
+	if (efqd->fairness && efqd->elv_async_slice_idle && !elv_ioq_sync(ioq)
+	    && elv_ioq_nr_dispatched(ioq)) {
+		ioq = NULL;
+		goto keep_queue;
+	}
+
 	slice_expired = 0;
 expire:
 	if (efqd->fairness >= 2 && !force && ioq && ioq->dispatched
@@ -4076,6 +4101,13 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 			goto done;
 		}
 
+		/* For async queue try to do wait busy */
+		if (efqd->fairness && !elv_ioq_sync(ioq) && !ioq->nr_queued
+		    && (elv_iog_nr_active(iog) <= 1)) {
+			elv_ioq_arm_slice_timer(q, 1);
+			goto done;
+		}
+
 		/*
 		 * If there are no requests waiting in this queue, and
 		 * there are other queues ready to issue requests, AND
@@ -4215,6 +4247,7 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 	efqd->elv_slice[0] = elv_slice_async;
 	efqd->elv_slice[1] = elv_slice_sync;
 	efqd->elv_slice_idle = elv_slice_idle;
+	efqd->elv_async_slice_idle = elv_async_slice_idle;
 	efqd->hw_tag = 1;
 
 	/* For the time being keep fairness enabled by default */
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 19ac8ca..f089a55 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -362,6 +362,8 @@ struct elv_fq_data {
 	 * users of this functionality.
 	 */
 	unsigned int elv_slice_idle;
+	/* idle slice for async queue */
+	unsigned int elv_async_slice_idle;
 	struct timer_list idle_slice_timer;
 	struct work_struct unplug_work;
 
@@ -647,6 +649,9 @@ extern ssize_t elv_slice_async_store(struct elevator_queue *q, const char *name,
 extern ssize_t elv_fairness_show(struct elevator_queue *q, char *name);
 extern ssize_t elv_fairness_store(struct elevator_queue *q, const char *name,
 						size_t count);
+extern ssize_t elv_async_slice_idle_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_async_slice_idle_store(struct elevator_queue *q,
+					const char *name, size_t count);
 
 /* Functions used by elevator.c */
 extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
-- 
1.6.0.6


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH 25/25] io-controller: experimental debug patch for async queue wait before expiry
@ 2009-07-02 20:01   ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:01 UTC (permalink / raw)
  To: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew
  Cc: peterz, akpm, snitzer, agk, vgoyal

o A debug patch which does wait for next IO from async queue once it
  becomes empty.

o For async writes, traffic seen by IO scheduler is not in proportion to
  the weight of the cgroup task/page belongs to. So if there are two processes
  doing heavy writeouts in two cgroups with weights 1000 and 500 respectively,
  then IO scheduler does not see more traffic/IO from higher weight cgroup
  even if IO scheduler tries to give it higher disk time. Effectively, the
  async queue belonging to higher weight cgroup becomes empty, and gets out
  of contention for disk and lower weight cgroup gets to use disk giving
  an impression in user space that higher weight cgroup did not get higher
  time to disk.

o This is more of a problem at page cache level where a higher weight
  process might be writing out the pages of lower weight process etc and
  should be fixed there.

o While we fix those issues, introducing this debug patch which allows one
  to idle on async queue (tunable via /sys/blolc/<disk>/queue/async_slice_idle)  so that once a higher weight queue becomes empty, instead of expiring it
  we try to wait for next request to come from that queue hence giving it
  higher disk time. A higher value of async_slice_idle, around 300ms, helps
  me get some right numbers for my setup. Note: higher disk time would not
  necessarily translate in more IO done as higher weight group is not pushing
  enough IO to io scheduler. It is just a debugging aid to prove correctness
  of IO controller by providing higher disk times to higher weight cgroup.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |    1 +
 block/elevator-fq.c |   39 ++++++++++++++++++++++++++++++++++++---
 block/elevator-fq.h |    5 +++++
 3 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index a40a2fa..fbe56a9 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2093,6 +2093,7 @@ static struct elv_fs_entry cfq_attrs[] = {
 	ELV_ATTR(slice_sync),
 	ELV_ATTR(slice_async),
 	ELV_ATTR(fairness),
+	ELV_ATTR(async_slice_idle),
 	__ATTR_NULL
 };
 
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 5b3f068..7c83d1e 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -24,6 +24,7 @@ const int elv_slice_sync = HZ / 10;
 int elv_slice_async = HZ / 25;
 const int elv_slice_async_rq = 2;
 int elv_slice_idle = HZ / 125;
+int elv_async_slice_idle = 0;
 static struct kmem_cache *elv_ioq_pool;
 
 /* Maximum Window length for updating average disk rate */
@@ -2819,6 +2820,8 @@ SHOW_FUNCTION(elv_slice_async_show, efqd->elv_slice[0], 1);
 EXPORT_SYMBOL(elv_slice_async_show);
 SHOW_FUNCTION(elv_fairness_show, efqd->fairness, 0);
 EXPORT_SYMBOL(elv_fairness_show);
+SHOW_FUNCTION(elv_async_slice_idle_show, efqd->elv_async_slice_idle, 1);
+EXPORT_SYMBOL(elv_async_slice_idle_show);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
@@ -2845,6 +2848,8 @@ STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
 EXPORT_SYMBOL(elv_slice_async_store);
 STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 2, 0);
 EXPORT_SYMBOL(elv_fairness_store);
+STORE_FUNCTION(elv_async_slice_idle_store, &efqd->elv_async_slice_idle, 0, UINT_MAX, 1);
+EXPORT_SYMBOL(elv_async_slice_idle_store);
 #undef STORE_FUNCTION
 
 void elv_schedule_dispatch(struct request_queue *q)
@@ -3018,7 +3023,7 @@ int elv_init_ioq(struct elevator_queue *eq, struct io_queue *ioq,
 		ioq->pid = current->pid;
 
 	ioq->sched_queue = sched_queue;
-	if (is_sync && !elv_ioq_class_idle(ioq))
+	if (!elv_ioq_class_idle(ioq) && (is_sync || efqd->fairness))
 		elv_mark_ioq_idle_window(ioq);
 	bfq_init_entity(&ioq->entity, iog);
 	ioq->entity.budget = elv_prio_to_slice(efqd, ioq);
@@ -3699,7 +3704,12 @@ static void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
 	/*
 	 * idle is disabled, either manually or by past process history
 	 */
-	if (!efqd->elv_slice_idle || !elv_ioq_idle_window(ioq))
+	if ((elv_ioq_sync(ioq) && !efqd->elv_slice_idle) ||
+			!elv_ioq_idle_window(ioq))
+		return;
+
+	/* If this is async queue and async_slice_idle is disabled, return */
+	if (!elv_ioq_sync(ioq) && !efqd->elv_async_slice_idle)
 		return;
 
 	/*
@@ -3708,7 +3718,10 @@ static void elv_ioq_arm_slice_timer(struct request_queue *q, int wait_for_busy)
 	 */
 	if (wait_for_busy) {
 		elv_mark_ioq_wait_busy(ioq);
-		sl = efqd->elv_slice_idle;
+		if (elv_ioq_sync(ioq))
+			sl = efqd->elv_slice_idle;
+		else
+			sl = efqd->elv_async_slice_idle;
 		mod_timer(&efqd->idle_slice_timer, jiffies + sl);
 		elv_log_ioq(efqd, ioq, "arm idle: %lu wait busy=1", sl);
 		return;
@@ -3882,6 +3895,18 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
 		goto keep_queue;
 	}
 
+	/*
+	 * If this is an async queue which has time slice left but not
+	 * requests. Wait busy is also not on (may be because when last
+	 * request completed, ioq was not empty). Wait for the request
+	 * completion. May be completion will turn wait busy on.
+	 */
+	if (efqd->fairness && efqd->elv_async_slice_idle && !elv_ioq_sync(ioq)
+	    && elv_ioq_nr_dispatched(ioq)) {
+		ioq = NULL;
+		goto keep_queue;
+	}
+
 	slice_expired = 0;
 expire:
 	if (efqd->fairness >= 2 && !force && ioq && ioq->dispatched
@@ -4076,6 +4101,13 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
 			goto done;
 		}
 
+		/* For async queue try to do wait busy */
+		if (efqd->fairness && !elv_ioq_sync(ioq) && !ioq->nr_queued
+		    && (elv_iog_nr_active(iog) <= 1)) {
+			elv_ioq_arm_slice_timer(q, 1);
+			goto done;
+		}
+
 		/*
 		 * If there are no requests waiting in this queue, and
 		 * there are other queues ready to issue requests, AND
@@ -4215,6 +4247,7 @@ int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e)
 	efqd->elv_slice[0] = elv_slice_async;
 	efqd->elv_slice[1] = elv_slice_sync;
 	efqd->elv_slice_idle = elv_slice_idle;
+	efqd->elv_async_slice_idle = elv_async_slice_idle;
 	efqd->hw_tag = 1;
 
 	/* For the time being keep fairness enabled by default */
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index 19ac8ca..f089a55 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -362,6 +362,8 @@ struct elv_fq_data {
 	 * users of this functionality.
 	 */
 	unsigned int elv_slice_idle;
+	/* idle slice for async queue */
+	unsigned int elv_async_slice_idle;
 	struct timer_list idle_slice_timer;
 	struct work_struct unplug_work;
 
@@ -647,6 +649,9 @@ extern ssize_t elv_slice_async_store(struct elevator_queue *q, const char *name,
 extern ssize_t elv_fairness_show(struct elevator_queue *q, char *name);
 extern ssize_t elv_fairness_store(struct elevator_queue *q, const char *name,
 						size_t count);
+extern ssize_t elv_async_slice_idle_show(struct elevator_queue *q, char *name);
+extern ssize_t elv_async_slice_idle_store(struct elevator_queue *q,
+					const char *name, size_t count);
 
 /* Functions used by elevator.c */
 extern int elv_init_fq_data(struct request_queue *q, struct elevator_queue *e);
-- 
1.6.0.6

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* Re: [PATCH 13/25] io-controller: Wait for requests to complete from last queue before new queue is scheduled
       [not found]   ` <1246564917-19603-14-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-07-02 20:09     ` Nauman Rafique
  0 siblings, 0 replies; 191+ messages in thread
From: Nauman Rafique @ 2009-07-02 20:09 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Thu, Jul 2, 2009 at 1:01 PM, Vivek Goyal<vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> o Currently one can dispatch requests from multiple queues to the disk. This
>  is true for hardware which supports queuing. So if a disk support queue
>  depth of 31 it is possible that 20 requests are dispatched from queue 1
>  and then next queue is scheduled in which dispatches more requests.
>
> o This multiple queue dispatch introduces issues for accurate accounting of
>  disk time consumed by a particular queue. For example, if one async queue
>  is scheduled in, it can dispatch 31 requests to the disk and then it will
>  be expired and a new sync queue might get scheduled in. These 31 requests
>  might take a long time to finish but this time is never accounted to the
>  async queue which dispatched these requests.
>
> o This patch introduces the functionality where we wait for all the requests
>  to finish from previous queue before next queue is scheduled in. That way
>  a queue is more accurately accounted for disk time it has consumed. Note
>  this still does not take care of errors introduced by disk write caching.
>
> o Because above behavior can result in reduced throughput, this behavior will
>  be enabled only if user sets "fairness" tunable to 2 or higher.

Vivek,
Did you collect any numbers for the impact on throughput from this
patch? It seems like with this change, we can even support NCQ.

>
> o This patch helps in achieving more isolation between reads and buffered
>  writes in different cgroups. buffered writes typically utilize full queue
>  depth and then expire the queue. On the contarary, sequential reads
>  typicaly driver queue depth of 1. So despite the fact that writes are
>  using more disk time it is never accounted to write queue because we don't
>  wait for requests to finish after dispatching these. This patch helps
>  do more accurate accounting of disk time, especially for buffered writes
>  hence providing better fairness hence better isolation between two cgroups
>  running read and write workloads.
>
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  block/elevator-fq.c |   31 ++++++++++++++++++++++++++++++-
>  1 files changed, 30 insertions(+), 1 deletions(-)
>
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 68be1dc..7609579 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -2038,7 +2038,7 @@ STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
>  EXPORT_SYMBOL(elv_slice_sync_store);
>  STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
>  EXPORT_SYMBOL(elv_slice_async_store);
> -STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
> +STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 2, 0);
>  EXPORT_SYMBOL(elv_fairness_store);
>  #undef STORE_FUNCTION
>
> @@ -2952,6 +2952,24 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
>        }
>
>  expire:
> +       if (efqd->fairness >= 2 && !force && ioq && ioq->dispatched) {
> +               /*
> +                * If there are request dispatched from this queue, don't
> +                * dispatch requests from new queue till all the requests from
> +                * this queue have completed.
> +                *
> +                * This helps in attributing right amount of disk time consumed
> +                * by a particular queue when hardware allows queuing.
> +                *
> +                * Set ioq = NULL so that no more requests are dispatched from
> +                * this queue.
> +                */
> +               elv_log_ioq(efqd, ioq, "select: wait for requests to finish"
> +                               " disp=%lu", ioq->dispatched);
> +               ioq = NULL;
> +               goto keep_queue;
> +       }
> +
>        elv_ioq_slice_expired(q);
>  new_queue:
>        ioq = elv_set_active_ioq(q, new_ioq);
> @@ -3109,6 +3127,17 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
>                                 */
>                                elv_ioq_arm_slice_timer(q, 1);
>                        } else {
> +                               /* If fairness >=2 and there are requests
> +                                * dispatched from this queue, don't dispatch
> +                                * new requests from a different queue till
> +                                * all requests from this queue have finished.
> +                                * This helps in attributing right disk time
> +                                * to a queue when hardware supports queuing.
> +                                */
> +
> +                               if (efqd->fairness >= 2 && ioq->dispatched)
> +                                       goto done;
> +
>                                /* Expire the queue */
>                                elv_ioq_slice_expired(q);
>                        }
> --
> 1.6.0.6
>
>

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 13/25] io-controller: Wait for requests to complete from  last queue before new queue is scheduled
  2009-07-02 20:01   ` Vivek Goyal
@ 2009-07-02 20:09     ` Nauman Rafique
  -1 siblings, 0 replies; 191+ messages in thread
From: Nauman Rafique @ 2009-07-02 20:09 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, containers, dm-devel, jens.axboe, dpshah, lizf,
	mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron, agk, snitzer, akpm, peterz

On Thu, Jul 2, 2009 at 1:01 PM, Vivek Goyal<vgoyal@redhat.com> wrote:
> o Currently one can dispatch requests from multiple queues to the disk. This
>  is true for hardware which supports queuing. So if a disk support queue
>  depth of 31 it is possible that 20 requests are dispatched from queue 1
>  and then next queue is scheduled in which dispatches more requests.
>
> o This multiple queue dispatch introduces issues for accurate accounting of
>  disk time consumed by a particular queue. For example, if one async queue
>  is scheduled in, it can dispatch 31 requests to the disk and then it will
>  be expired and a new sync queue might get scheduled in. These 31 requests
>  might take a long time to finish but this time is never accounted to the
>  async queue which dispatched these requests.
>
> o This patch introduces the functionality where we wait for all the requests
>  to finish from previous queue before next queue is scheduled in. That way
>  a queue is more accurately accounted for disk time it has consumed. Note
>  this still does not take care of errors introduced by disk write caching.
>
> o Because above behavior can result in reduced throughput, this behavior will
>  be enabled only if user sets "fairness" tunable to 2 or higher.

Vivek,
Did you collect any numbers for the impact on throughput from this
patch? It seems like with this change, we can even support NCQ.

>
> o This patch helps in achieving more isolation between reads and buffered
>  writes in different cgroups. buffered writes typically utilize full queue
>  depth and then expire the queue. On the contarary, sequential reads
>  typicaly driver queue depth of 1. So despite the fact that writes are
>  using more disk time it is never accounted to write queue because we don't
>  wait for requests to finish after dispatching these. This patch helps
>  do more accurate accounting of disk time, especially for buffered writes
>  hence providing better fairness hence better isolation between two cgroups
>  running read and write workloads.
>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  block/elevator-fq.c |   31 ++++++++++++++++++++++++++++++-
>  1 files changed, 30 insertions(+), 1 deletions(-)
>
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 68be1dc..7609579 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -2038,7 +2038,7 @@ STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
>  EXPORT_SYMBOL(elv_slice_sync_store);
>  STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
>  EXPORT_SYMBOL(elv_slice_async_store);
> -STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
> +STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 2, 0);
>  EXPORT_SYMBOL(elv_fairness_store);
>  #undef STORE_FUNCTION
>
> @@ -2952,6 +2952,24 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
>        }
>
>  expire:
> +       if (efqd->fairness >= 2 && !force && ioq && ioq->dispatched) {
> +               /*
> +                * If there are request dispatched from this queue, don't
> +                * dispatch requests from new queue till all the requests from
> +                * this queue have completed.
> +                *
> +                * This helps in attributing right amount of disk time consumed
> +                * by a particular queue when hardware allows queuing.
> +                *
> +                * Set ioq = NULL so that no more requests are dispatched from
> +                * this queue.
> +                */
> +               elv_log_ioq(efqd, ioq, "select: wait for requests to finish"
> +                               " disp=%lu", ioq->dispatched);
> +               ioq = NULL;
> +               goto keep_queue;
> +       }
> +
>        elv_ioq_slice_expired(q);
>  new_queue:
>        ioq = elv_set_active_ioq(q, new_ioq);
> @@ -3109,6 +3127,17 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
>                                 */
>                                elv_ioq_arm_slice_timer(q, 1);
>                        } else {
> +                               /* If fairness >=2 and there are requests
> +                                * dispatched from this queue, don't dispatch
> +                                * new requests from a different queue till
> +                                * all requests from this queue have finished.
> +                                * This helps in attributing right disk time
> +                                * to a queue when hardware supports queuing.
> +                                */
> +
> +                               if (efqd->fairness >= 2 && ioq->dispatched)
> +                                       goto done;
> +
>                                /* Expire the queue */
>                                elv_ioq_slice_expired(q);
>                        }
> --
> 1.6.0.6
>
>

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 13/25] io-controller: Wait for requests to complete from last queue before new queue is scheduled
@ 2009-07-02 20:09     ` Nauman Rafique
  0 siblings, 0 replies; 191+ messages in thread
From: Nauman Rafique @ 2009-07-02 20:09 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	balbir, paolo.valente, guijianfeng, fernando, mikew, jmoyer,
	m-ikeda, lizf, fchecconi, akpm, containers, linux-kernel,
	s-uchida, righi.andrea, jbaron

On Thu, Jul 2, 2009 at 1:01 PM, Vivek Goyal<vgoyal@redhat.com> wrote:
> o Currently one can dispatch requests from multiple queues to the disk. This
>  is true for hardware which supports queuing. So if a disk support queue
>  depth of 31 it is possible that 20 requests are dispatched from queue 1
>  and then next queue is scheduled in which dispatches more requests.
>
> o This multiple queue dispatch introduces issues for accurate accounting of
>  disk time consumed by a particular queue. For example, if one async queue
>  is scheduled in, it can dispatch 31 requests to the disk and then it will
>  be expired and a new sync queue might get scheduled in. These 31 requests
>  might take a long time to finish but this time is never accounted to the
>  async queue which dispatched these requests.
>
> o This patch introduces the functionality where we wait for all the requests
>  to finish from previous queue before next queue is scheduled in. That way
>  a queue is more accurately accounted for disk time it has consumed. Note
>  this still does not take care of errors introduced by disk write caching.
>
> o Because above behavior can result in reduced throughput, this behavior will
>  be enabled only if user sets "fairness" tunable to 2 or higher.

Vivek,
Did you collect any numbers for the impact on throughput from this
patch? It seems like with this change, we can even support NCQ.

>
> o This patch helps in achieving more isolation between reads and buffered
>  writes in different cgroups. buffered writes typically utilize full queue
>  depth and then expire the queue. On the contarary, sequential reads
>  typicaly driver queue depth of 1. So despite the fact that writes are
>  using more disk time it is never accounted to write queue because we don't
>  wait for requests to finish after dispatching these. This patch helps
>  do more accurate accounting of disk time, especially for buffered writes
>  hence providing better fairness hence better isolation between two cgroups
>  running read and write workloads.
>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  block/elevator-fq.c |   31 ++++++++++++++++++++++++++++++-
>  1 files changed, 30 insertions(+), 1 deletions(-)
>
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 68be1dc..7609579 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -2038,7 +2038,7 @@ STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
>  EXPORT_SYMBOL(elv_slice_sync_store);
>  STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
>  EXPORT_SYMBOL(elv_slice_async_store);
> -STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
> +STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 2, 0);
>  EXPORT_SYMBOL(elv_fairness_store);
>  #undef STORE_FUNCTION
>
> @@ -2952,6 +2952,24 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
>        }
>
>  expire:
> +       if (efqd->fairness >= 2 && !force && ioq && ioq->dispatched) {
> +               /*
> +                * If there are request dispatched from this queue, don't
> +                * dispatch requests from new queue till all the requests from
> +                * this queue have completed.
> +                *
> +                * This helps in attributing right amount of disk time consumed
> +                * by a particular queue when hardware allows queuing.
> +                *
> +                * Set ioq = NULL so that no more requests are dispatched from
> +                * this queue.
> +                */
> +               elv_log_ioq(efqd, ioq, "select: wait for requests to finish"
> +                               " disp=%lu", ioq->dispatched);
> +               ioq = NULL;
> +               goto keep_queue;
> +       }
> +
>        elv_ioq_slice_expired(q);
>  new_queue:
>        ioq = elv_set_active_ioq(q, new_ioq);
> @@ -3109,6 +3127,17 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
>                                 */
>                                elv_ioq_arm_slice_timer(q, 1);
>                        } else {
> +                               /* If fairness >=2 and there are requests
> +                                * dispatched from this queue, don't dispatch
> +                                * new requests from a different queue till
> +                                * all requests from this queue have finished.
> +                                * This helps in attributing right disk time
> +                                * to a queue when hardware supports queuing.
> +                                */
> +
> +                               if (efqd->fairness >= 2 && ioq->dispatched)
> +                                       goto done;
> +
>                                /* Expire the queue */
>                                elv_ioq_slice_expired(q);
>                        }
> --
> 1.6.0.6
>
>

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 13/25] io-controller: Wait for requests to complete from last queue before new queue is scheduled
       [not found]     ` <e98e18940907021309u1f784b3at409b55ba46ed108c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-07-02 20:17       ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:17 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Thu, Jul 02, 2009 at 01:09:14PM -0700, Nauman Rafique wrote:
> On Thu, Jul 2, 2009 at 1:01 PM, Vivek Goyal<vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > o Currently one can dispatch requests from multiple queues to the disk. This
> >  is true for hardware which supports queuing. So if a disk support queue
> >  depth of 31 it is possible that 20 requests are dispatched from queue 1
> >  and then next queue is scheduled in which dispatches more requests.
> >
> > o This multiple queue dispatch introduces issues for accurate accounting of
> >  disk time consumed by a particular queue. For example, if one async queue
> >  is scheduled in, it can dispatch 31 requests to the disk and then it will
> >  be expired and a new sync queue might get scheduled in. These 31 requests
> >  might take a long time to finish but this time is never accounted to the
> >  async queue which dispatched these requests.
> >
> > o This patch introduces the functionality where we wait for all the requests
> >  to finish from previous queue before next queue is scheduled in. That way
> >  a queue is more accurately accounted for disk time it has consumed. Note
> >  this still does not take care of errors introduced by disk write caching.
> >
> > o Because above behavior can result in reduced throughput, this behavior will
> >  be enabled only if user sets "fairness" tunable to 2 or higher.
> 
> Vivek,
> Did you collect any numbers for the impact on throughput from this
> patch? It seems like with this change, we can even support NCQ.
> 

Hi Nauman,

Not yet. I will try to do some impact analysis of this change and post the
results.

Thanks
Vivek

> >
> > o This patch helps in achieving more isolation between reads and buffered
> >  writes in different cgroups. buffered writes typically utilize full queue
> >  depth and then expire the queue. On the contarary, sequential reads
> >  typicaly driver queue depth of 1. So despite the fact that writes are
> >  using more disk time it is never accounted to write queue because we don't
> >  wait for requests to finish after dispatching these. This patch helps
> >  do more accurate accounting of disk time, especially for buffered writes
> >  hence providing better fairness hence better isolation between two cgroups
> >  running read and write workloads.
> >
> > Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > ---
> >  block/elevator-fq.c |   31 ++++++++++++++++++++++++++++++-
> >  1 files changed, 30 insertions(+), 1 deletions(-)
> >
> > diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> > index 68be1dc..7609579 100644
> > --- a/block/elevator-fq.c
> > +++ b/block/elevator-fq.c
> > @@ -2038,7 +2038,7 @@ STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
> >  EXPORT_SYMBOL(elv_slice_sync_store);
> >  STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
> >  EXPORT_SYMBOL(elv_slice_async_store);
> > -STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
> > +STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 2, 0);
> >  EXPORT_SYMBOL(elv_fairness_store);
> >  #undef STORE_FUNCTION
> >
> > @@ -2952,6 +2952,24 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
> >        }
> >
> >  expire:
> > +       if (efqd->fairness >= 2 && !force && ioq && ioq->dispatched) {
> > +               /*
> > +                * If there are request dispatched from this queue, don't
> > +                * dispatch requests from new queue till all the requests from
> > +                * this queue have completed.
> > +                *
> > +                * This helps in attributing right amount of disk time consumed
> > +                * by a particular queue when hardware allows queuing.
> > +                *
> > +                * Set ioq = NULL so that no more requests are dispatched from
> > +                * this queue.
> > +                */
> > +               elv_log_ioq(efqd, ioq, "select: wait for requests to finish"
> > +                               " disp=%lu", ioq->dispatched);
> > +               ioq = NULL;
> > +               goto keep_queue;
> > +       }
> > +
> >        elv_ioq_slice_expired(q);
> >  new_queue:
> >        ioq = elv_set_active_ioq(q, new_ioq);
> > @@ -3109,6 +3127,17 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
> >                                 */
> >                                elv_ioq_arm_slice_timer(q, 1);
> >                        } else {
> > +                               /* If fairness >=2 and there are requests
> > +                                * dispatched from this queue, don't dispatch
> > +                                * new requests from a different queue till
> > +                                * all requests from this queue have finished.
> > +                                * This helps in attributing right disk time
> > +                                * to a queue when hardware supports queuing.
> > +                                */
> > +
> > +                               if (efqd->fairness >= 2 && ioq->dispatched)
> > +                                       goto done;
> > +
> >                                /* Expire the queue */
> >                                elv_ioq_slice_expired(q);
> >                        }
> > --
> > 1.6.0.6
> >
> >

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 13/25] io-controller: Wait for requests to complete from last queue before new queue is scheduled
  2009-07-02 20:09     ` Nauman Rafique
@ 2009-07-02 20:17       ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:17 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: linux-kernel, containers, dm-devel, jens.axboe, dpshah, lizf,
	mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron, agk, snitzer, akpm, peterz

On Thu, Jul 02, 2009 at 01:09:14PM -0700, Nauman Rafique wrote:
> On Thu, Jul 2, 2009 at 1:01 PM, Vivek Goyal<vgoyal@redhat.com> wrote:
> > o Currently one can dispatch requests from multiple queues to the disk. This
> >  is true for hardware which supports queuing. So if a disk support queue
> >  depth of 31 it is possible that 20 requests are dispatched from queue 1
> >  and then next queue is scheduled in which dispatches more requests.
> >
> > o This multiple queue dispatch introduces issues for accurate accounting of
> >  disk time consumed by a particular queue. For example, if one async queue
> >  is scheduled in, it can dispatch 31 requests to the disk and then it will
> >  be expired and a new sync queue might get scheduled in. These 31 requests
> >  might take a long time to finish but this time is never accounted to the
> >  async queue which dispatched these requests.
> >
> > o This patch introduces the functionality where we wait for all the requests
> >  to finish from previous queue before next queue is scheduled in. That way
> >  a queue is more accurately accounted for disk time it has consumed. Note
> >  this still does not take care of errors introduced by disk write caching.
> >
> > o Because above behavior can result in reduced throughput, this behavior will
> >  be enabled only if user sets "fairness" tunable to 2 or higher.
> 
> Vivek,
> Did you collect any numbers for the impact on throughput from this
> patch? It seems like with this change, we can even support NCQ.
> 

Hi Nauman,

Not yet. I will try to do some impact analysis of this change and post the
results.

Thanks
Vivek

> >
> > o This patch helps in achieving more isolation between reads and buffered
> >  writes in different cgroups. buffered writes typically utilize full queue
> >  depth and then expire the queue. On the contarary, sequential reads
> >  typicaly driver queue depth of 1. So despite the fact that writes are
> >  using more disk time it is never accounted to write queue because we don't
> >  wait for requests to finish after dispatching these. This patch helps
> >  do more accurate accounting of disk time, especially for buffered writes
> >  hence providing better fairness hence better isolation between two cgroups
> >  running read and write workloads.
> >
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  block/elevator-fq.c |   31 ++++++++++++++++++++++++++++++-
> >  1 files changed, 30 insertions(+), 1 deletions(-)
> >
> > diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> > index 68be1dc..7609579 100644
> > --- a/block/elevator-fq.c
> > +++ b/block/elevator-fq.c
> > @@ -2038,7 +2038,7 @@ STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
> >  EXPORT_SYMBOL(elv_slice_sync_store);
> >  STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
> >  EXPORT_SYMBOL(elv_slice_async_store);
> > -STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
> > +STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 2, 0);
> >  EXPORT_SYMBOL(elv_fairness_store);
> >  #undef STORE_FUNCTION
> >
> > @@ -2952,6 +2952,24 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
> >        }
> >
> >  expire:
> > +       if (efqd->fairness >= 2 && !force && ioq && ioq->dispatched) {
> > +               /*
> > +                * If there are request dispatched from this queue, don't
> > +                * dispatch requests from new queue till all the requests from
> > +                * this queue have completed.
> > +                *
> > +                * This helps in attributing right amount of disk time consumed
> > +                * by a particular queue when hardware allows queuing.
> > +                *
> > +                * Set ioq = NULL so that no more requests are dispatched from
> > +                * this queue.
> > +                */
> > +               elv_log_ioq(efqd, ioq, "select: wait for requests to finish"
> > +                               " disp=%lu", ioq->dispatched);
> > +               ioq = NULL;
> > +               goto keep_queue;
> > +       }
> > +
> >        elv_ioq_slice_expired(q);
> >  new_queue:
> >        ioq = elv_set_active_ioq(q, new_ioq);
> > @@ -3109,6 +3127,17 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
> >                                 */
> >                                elv_ioq_arm_slice_timer(q, 1);
> >                        } else {
> > +                               /* If fairness >=2 and there are requests
> > +                                * dispatched from this queue, don't dispatch
> > +                                * new requests from a different queue till
> > +                                * all requests from this queue have finished.
> > +                                * This helps in attributing right disk time
> > +                                * to a queue when hardware supports queuing.
> > +                                */
> > +
> > +                               if (efqd->fairness >= 2 && ioq->dispatched)
> > +                                       goto done;
> > +
> >                                /* Expire the queue */
> >                                elv_ioq_slice_expired(q);
> >                        }
> > --
> > 1.6.0.6
> >
> >

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 13/25] io-controller: Wait for requests to complete from last queue before new queue is scheduled
@ 2009-07-02 20:17       ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-02 20:17 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	balbir, paolo.valente, guijianfeng, fernando, mikew, jmoyer,
	m-ikeda, lizf, fchecconi, akpm, containers, linux-kernel,
	s-uchida, righi.andrea, jbaron

On Thu, Jul 02, 2009 at 01:09:14PM -0700, Nauman Rafique wrote:
> On Thu, Jul 2, 2009 at 1:01 PM, Vivek Goyal<vgoyal@redhat.com> wrote:
> > o Currently one can dispatch requests from multiple queues to the disk. This
> >  is true for hardware which supports queuing. So if a disk support queue
> >  depth of 31 it is possible that 20 requests are dispatched from queue 1
> >  and then next queue is scheduled in which dispatches more requests.
> >
> > o This multiple queue dispatch introduces issues for accurate accounting of
> >  disk time consumed by a particular queue. For example, if one async queue
> >  is scheduled in, it can dispatch 31 requests to the disk and then it will
> >  be expired and a new sync queue might get scheduled in. These 31 requests
> >  might take a long time to finish but this time is never accounted to the
> >  async queue which dispatched these requests.
> >
> > o This patch introduces the functionality where we wait for all the requests
> >  to finish from previous queue before next queue is scheduled in. That way
> >  a queue is more accurately accounted for disk time it has consumed. Note
> >  this still does not take care of errors introduced by disk write caching.
> >
> > o Because above behavior can result in reduced throughput, this behavior will
> >  be enabled only if user sets "fairness" tunable to 2 or higher.
> 
> Vivek,
> Did you collect any numbers for the impact on throughput from this
> patch? It seems like with this change, we can even support NCQ.
> 

Hi Nauman,

Not yet. I will try to do some impact analysis of this change and post the
results.

Thanks
Vivek

> >
> > o This patch helps in achieving more isolation between reads and buffered
> >  writes in different cgroups. buffered writes typically utilize full queue
> >  depth and then expire the queue. On the contarary, sequential reads
> >  typicaly driver queue depth of 1. So despite the fact that writes are
> >  using more disk time it is never accounted to write queue because we don't
> >  wait for requests to finish after dispatching these. This patch helps
> >  do more accurate accounting of disk time, especially for buffered writes
> >  hence providing better fairness hence better isolation between two cgroups
> >  running read and write workloads.
> >
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  block/elevator-fq.c |   31 ++++++++++++++++++++++++++++++-
> >  1 files changed, 30 insertions(+), 1 deletions(-)
> >
> > diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> > index 68be1dc..7609579 100644
> > --- a/block/elevator-fq.c
> > +++ b/block/elevator-fq.c
> > @@ -2038,7 +2038,7 @@ STORE_FUNCTION(elv_slice_sync_store, &efqd->elv_slice[1], 1, UINT_MAX, 1);
> >  EXPORT_SYMBOL(elv_slice_sync_store);
> >  STORE_FUNCTION(elv_slice_async_store, &efqd->elv_slice[0], 1, UINT_MAX, 1);
> >  EXPORT_SYMBOL(elv_slice_async_store);
> > -STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 1, 0);
> > +STORE_FUNCTION(elv_fairness_store, &efqd->fairness, 0, 2, 0);
> >  EXPORT_SYMBOL(elv_fairness_store);
> >  #undef STORE_FUNCTION
> >
> > @@ -2952,6 +2952,24 @@ void *elv_fq_select_ioq(struct request_queue *q, int force)
> >        }
> >
> >  expire:
> > +       if (efqd->fairness >= 2 && !force && ioq && ioq->dispatched) {
> > +               /*
> > +                * If there are request dispatched from this queue, don't
> > +                * dispatch requests from new queue till all the requests from
> > +                * this queue have completed.
> > +                *
> > +                * This helps in attributing right amount of disk time consumed
> > +                * by a particular queue when hardware allows queuing.
> > +                *
> > +                * Set ioq = NULL so that no more requests are dispatched from
> > +                * this queue.
> > +                */
> > +               elv_log_ioq(efqd, ioq, "select: wait for requests to finish"
> > +                               " disp=%lu", ioq->dispatched);
> > +               ioq = NULL;
> > +               goto keep_queue;
> > +       }
> > +
> >        elv_ioq_slice_expired(q);
> >  new_queue:
> >        ioq = elv_set_active_ioq(q, new_ioq);
> > @@ -3109,6 +3127,17 @@ void elv_ioq_completed_request(struct request_queue *q, struct request *rq)
> >                                 */
> >                                elv_ioq_arm_slice_timer(q, 1);
> >                        } else {
> > +                               /* If fairness >=2 and there are requests
> > +                                * dispatched from this queue, don't dispatch
> > +                                * new requests from a different queue till
> > +                                * all requests from this queue have finished.
> > +                                * This helps in attributing right disk time
> > +                                * to a queue when hardware supports queuing.
> > +                                */
> > +
> > +                               if (efqd->fairness >= 2 && ioq->dispatched)
> > +                                       goto done;
> > +
> >                                /* Expire the queue */
> >                                elv_ioq_slice_expired(q);
> >                        }
> > --
> > 1.6.0.6
> >
> >

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 09/25] io-controller: Common hierarchical fair queuing code in elevaotor layer
       [not found]   ` <1246564917-19603-10-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-07-06  2:46     ` Gui Jianfeng
  0 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-06  2:46 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Vivek Goyal wrote:
...
> +static struct io_group *
> +io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
> +{
> +	struct io_cgroup *iocg;
> +	struct io_group *iog, *leaf = NULL, *prev = NULL;
> +	gfp_t flags = GFP_ATOMIC |  __GFP_ZERO;
> +
> +	for (; cgroup != NULL; cgroup = cgroup->parent) {
> +		iocg = cgroup_to_io_cgroup(cgroup);
> +
> +		iog = io_cgroup_lookup_group(iocg, key);
> +		if (iog != NULL) {
> +			/*
> +			 * All the cgroups in the path from there to the
> +			 * root must have a io_group for efqd, so we don't
> +			 * need any more allocations.
> +			 */
> +			break;
> +		}
> +
> +		iog = kzalloc_node(sizeof(*iog), flags, q->node);
> +		if (!iog)
> +			goto cleanup;
> +
> +		iog->iocg_id = css_id(&iocg->css);

  Hi Vivek,

  IMHO, The io_cgroup id is nothing more than keeping track the corresponding iocg.
  So why not just store iocg pointer in io_group and just get rid of this complexity.
  I'd like to post a patch to do this change, what's your opinion?

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 09/25] io-controller: Common hierarchical fair queuing code in elevaotor layer
  2009-07-02 20:01   ` Vivek Goyal
@ 2009-07-06  2:46     ` Gui Jianfeng
  -1 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-06  2:46 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, jbaron, agk,
	snitzer, akpm, peterz

Vivek Goyal wrote:
...
> +static struct io_group *
> +io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
> +{
> +	struct io_cgroup *iocg;
> +	struct io_group *iog, *leaf = NULL, *prev = NULL;
> +	gfp_t flags = GFP_ATOMIC |  __GFP_ZERO;
> +
> +	for (; cgroup != NULL; cgroup = cgroup->parent) {
> +		iocg = cgroup_to_io_cgroup(cgroup);
> +
> +		iog = io_cgroup_lookup_group(iocg, key);
> +		if (iog != NULL) {
> +			/*
> +			 * All the cgroups in the path from there to the
> +			 * root must have a io_group for efqd, so we don't
> +			 * need any more allocations.
> +			 */
> +			break;
> +		}
> +
> +		iog = kzalloc_node(sizeof(*iog), flags, q->node);
> +		if (!iog)
> +			goto cleanup;
> +
> +		iog->iocg_id = css_id(&iocg->css);

  Hi Vivek,

  IMHO, The io_cgroup id is nothing more than keeping track the corresponding iocg.
  So why not just store iocg pointer in io_group and just get rid of this complexity.
  I'd like to post a patch to do this change, what's your opinion?

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 09/25] io-controller: Common hierarchical fair queuing code in elevaotor layer
@ 2009-07-06  2:46     ` Gui Jianfeng
  0 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-06  2:46 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	balbir, paolo.valente, fernando, mikew, jmoyer, nauman, m-ikeda,
	lizf, fchecconi, akpm, jbaron, linux-kernel, s-uchida,
	righi.andrea, containers

Vivek Goyal wrote:
...
> +static struct io_group *
> +io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
> +{
> +	struct io_cgroup *iocg;
> +	struct io_group *iog, *leaf = NULL, *prev = NULL;
> +	gfp_t flags = GFP_ATOMIC |  __GFP_ZERO;
> +
> +	for (; cgroup != NULL; cgroup = cgroup->parent) {
> +		iocg = cgroup_to_io_cgroup(cgroup);
> +
> +		iog = io_cgroup_lookup_group(iocg, key);
> +		if (iog != NULL) {
> +			/*
> +			 * All the cgroups in the path from there to the
> +			 * root must have a io_group for efqd, so we don't
> +			 * need any more allocations.
> +			 */
> +			break;
> +		}
> +
> +		iog = kzalloc_node(sizeof(*iog), flags, q->node);
> +		if (!iog)
> +			goto cleanup;
> +
> +		iog->iocg_id = css_id(&iocg->css);

  Hi Vivek,

  IMHO, The io_cgroup id is nothing more than keeping track the corresponding iocg.
  So why not just store iocg pointer in io_group and just get rid of this complexity.
  I'd like to post a patch to do this change, what's your opinion?

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 09/25] io-controller: Common hierarchical fair queuing code in elevaotor layer
       [not found]     ` <4A51657B.7000008-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-07-06 14:16       ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-06 14:16 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Mon, Jul 06, 2009 at 10:46:19AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> > +static struct io_group *
> > +io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
> > +{
> > +	struct io_cgroup *iocg;
> > +	struct io_group *iog, *leaf = NULL, *prev = NULL;
> > +	gfp_t flags = GFP_ATOMIC |  __GFP_ZERO;
> > +
> > +	for (; cgroup != NULL; cgroup = cgroup->parent) {
> > +		iocg = cgroup_to_io_cgroup(cgroup);
> > +
> > +		iog = io_cgroup_lookup_group(iocg, key);
> > +		if (iog != NULL) {
> > +			/*
> > +			 * All the cgroups in the path from there to the
> > +			 * root must have a io_group for efqd, so we don't
> > +			 * need any more allocations.
> > +			 */
> > +			break;
> > +		}
> > +
> > +		iog = kzalloc_node(sizeof(*iog), flags, q->node);
> > +		if (!iog)
> > +			goto cleanup;
> > +
> > +		iog->iocg_id = css_id(&iocg->css);
> 
>   Hi Vivek,
> 
>   IMHO, The io_cgroup id is nothing more than keeping track the corresponding iocg.
>   So why not just store iocg pointer in io_group and just get rid of this complexity.
>   I'd like to post a patch to do this change, what's your opinion?
> 

Hi Gui,

You can try that but I suspect that there not much to be gained in
terms of number of lines of code or code complexity. Do try it out though
and we can then have a look at the patch.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 09/25] io-controller: Common hierarchical fair queuing code in elevaotor layer
  2009-07-06  2:46     ` Gui Jianfeng
@ 2009-07-06 14:16       ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-06 14:16 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, jbaron, agk,
	snitzer, akpm, peterz

On Mon, Jul 06, 2009 at 10:46:19AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> > +static struct io_group *
> > +io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
> > +{
> > +	struct io_cgroup *iocg;
> > +	struct io_group *iog, *leaf = NULL, *prev = NULL;
> > +	gfp_t flags = GFP_ATOMIC |  __GFP_ZERO;
> > +
> > +	for (; cgroup != NULL; cgroup = cgroup->parent) {
> > +		iocg = cgroup_to_io_cgroup(cgroup);
> > +
> > +		iog = io_cgroup_lookup_group(iocg, key);
> > +		if (iog != NULL) {
> > +			/*
> > +			 * All the cgroups in the path from there to the
> > +			 * root must have a io_group for efqd, so we don't
> > +			 * need any more allocations.
> > +			 */
> > +			break;
> > +		}
> > +
> > +		iog = kzalloc_node(sizeof(*iog), flags, q->node);
> > +		if (!iog)
> > +			goto cleanup;
> > +
> > +		iog->iocg_id = css_id(&iocg->css);
> 
>   Hi Vivek,
> 
>   IMHO, The io_cgroup id is nothing more than keeping track the corresponding iocg.
>   So why not just store iocg pointer in io_group and just get rid of this complexity.
>   I'd like to post a patch to do this change, what's your opinion?
> 

Hi Gui,

You can try that but I suspect that there not much to be gained in
terms of number of lines of code or code complexity. Do try it out though
and we can then have a look at the patch.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 09/25] io-controller: Common hierarchical fair queuing code in elevaotor layer
@ 2009-07-06 14:16       ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-06 14:16 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	balbir, paolo.valente, fernando, mikew, jmoyer, nauman, m-ikeda,
	lizf, fchecconi, akpm, jbaron, linux-kernel, s-uchida,
	righi.andrea, containers

On Mon, Jul 06, 2009 at 10:46:19AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> > +static struct io_group *
> > +io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
> > +{
> > +	struct io_cgroup *iocg;
> > +	struct io_group *iog, *leaf = NULL, *prev = NULL;
> > +	gfp_t flags = GFP_ATOMIC |  __GFP_ZERO;
> > +
> > +	for (; cgroup != NULL; cgroup = cgroup->parent) {
> > +		iocg = cgroup_to_io_cgroup(cgroup);
> > +
> > +		iog = io_cgroup_lookup_group(iocg, key);
> > +		if (iog != NULL) {
> > +			/*
> > +			 * All the cgroups in the path from there to the
> > +			 * root must have a io_group for efqd, so we don't
> > +			 * need any more allocations.
> > +			 */
> > +			break;
> > +		}
> > +
> > +		iog = kzalloc_node(sizeof(*iog), flags, q->node);
> > +		if (!iog)
> > +			goto cleanup;
> > +
> > +		iog->iocg_id = css_id(&iocg->css);
> 
>   Hi Vivek,
> 
>   IMHO, The io_cgroup id is nothing more than keeping track the corresponding iocg.
>   So why not just store iocg pointer in io_group and just get rid of this complexity.
>   I'd like to post a patch to do this change, what's your opinion?
> 

Hi Gui,

You can try that but I suspect that there not much to be gained in
terms of number of lines of code or code complexity. Do try it out though
and we can then have a look at the patch.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 191+ messages in thread

* [PATCH] io-controller: Get rid of css id from io cgroup
  2009-07-06 14:16       ` Vivek Goyal
@ 2009-07-07  1:40           ` Gui Jianfeng
  -1 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-07  1:40 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Get rid of css id from io cgroup since it's nothing
more than keeping track of iocg. An alternative is
caching iocg pointer in io group, just remove the
complexity.

Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
 block/elevator-fq.c |   36 ++++++++++++------------------------
 block/elevator-fq.h |    2 +-
 2 files changed, 13 insertions(+), 25 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 7c83d1e..f499b54 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -191,25 +191,19 @@ static inline struct io_group *iog_parent(struct io_group *iog)
 #ifdef CONFIG_DEBUG_GROUP_IOSCHED
 static void io_group_path(struct io_group *iog, char *buf, int buflen)
 {
-	unsigned short id = iog->iocg_id;
-	struct cgroup_subsys_state *css;
+	struct io_cgroup *iocg;
+	int ret;
 
 	rcu_read_lock();
 
-	if (!id)
+	iocg = iog->iocg;
+	if (!iocg)
 		goto out;
 
-	css = css_lookup(&io_subsys, id);
-	if (!css)
-		goto out;
-
-	if (!css_tryget(css))
+	ret = cgroup_path(iocg->css.cgroup, buf, buflen);
+	if (ret)
 		goto out;
 
-	cgroup_path(css->cgroup, buf, buflen);
-
-	css_put(css);
-
 	rcu_read_unlock();
 	return;
 out:
@@ -1847,7 +1841,6 @@ struct cgroup_subsys io_subsys = {
 	.destroy = iocg_destroy,
 	.populate = iocg_populate,
 	.subsys_id = io_subsys_id,
-	.use_id = 1,
 };
 
 static inline unsigned int iog_weight(struct io_group *iog)
@@ -1890,7 +1883,7 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 		if (!iog)
 			goto cleanup;
 
-		iog->iocg_id = css_id(&iocg->css);
+		iog->iocg = iocg;
 
 		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
 		iog->dev = MKDEV(major, minor);
@@ -2201,7 +2194,7 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
 	spin_lock_irq(&iocg->lock);
 	rcu_assign_pointer(iog->key, key);
 	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
-	iog->iocg_id = css_id(&iocg->css);
+	iog->iocg = iocg;
 	spin_unlock_irq(&iocg->lock);
 
 #ifdef CONFIG_DEBUG_GROUP_IOSCHED
@@ -2397,7 +2390,7 @@ remove_entry:
 			  group_node);
 	efqd = rcu_dereference(iog->key);
 	hlist_del_rcu(&iog->group_node);
-	iog->iocg_id = 0;
+	iog->iocg = NULL;
 	spin_unlock_irqrestore(&iocg->lock, flags);
 
 	spin_lock_irqsave(efqd->queue->queue_lock, flags);
@@ -2411,7 +2404,6 @@ done:
 		kfree(pn);
 	}
 
-	free_css_id(&io_subsys, &iocg->css);
 	rcu_read_unlock();
 	BUG_ON(!hlist_empty(&iocg->group_data));
 	kfree(iocg);
@@ -2427,20 +2419,16 @@ static void io_group_check_and_destroy(struct elv_fq_data *efqd,
 {
 	struct io_cgroup *iocg;
 	unsigned long flags;
-	struct cgroup_subsys_state *css;
 
 	rcu_read_lock();
 
-	css = css_lookup(&io_subsys, iog->iocg_id);
-
-	if (!css)
+	iocg = iog->iocg;
+	if (!iocg)
 		goto out;
 
-	iocg = container_of(css, struct io_cgroup, css);
-
 	spin_lock_irqsave(&iocg->lock, flags);
 
-	if (iog->iocg_id) {
+	if (iog->iocg) {
 		hlist_del_rcu(&iog->group_node);
 		__io_destroy_group(efqd, iog);
 	}
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index f089a55..75fee82 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -251,7 +251,7 @@ struct io_group {
 	unsigned int busy_rt_queues;
 
 	int deleting;
-	unsigned short iocg_id;
+	struct io_cgroup *iocg;
 
 	/* The device MKDEV(major, minor), this group has been created for */
 	dev_t	dev;
-- 
1.5.4.rc3 

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH] io-controller: Get rid of css id from io cgroup
@ 2009-07-07  1:40           ` Gui Jianfeng
  0 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-07  1:40 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, jbaron, agk,
	snitzer, akpm, peterz

Get rid of css id from io cgroup since it's nothing
more than keeping track of iocg. An alternative is
caching iocg pointer in io group, just remove the
complexity.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
 block/elevator-fq.c |   36 ++++++++++++------------------------
 block/elevator-fq.h |    2 +-
 2 files changed, 13 insertions(+), 25 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 7c83d1e..f499b54 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -191,25 +191,19 @@ static inline struct io_group *iog_parent(struct io_group *iog)
 #ifdef CONFIG_DEBUG_GROUP_IOSCHED
 static void io_group_path(struct io_group *iog, char *buf, int buflen)
 {
-	unsigned short id = iog->iocg_id;
-	struct cgroup_subsys_state *css;
+	struct io_cgroup *iocg;
+	int ret;
 
 	rcu_read_lock();
 
-	if (!id)
+	iocg = iog->iocg;
+	if (!iocg)
 		goto out;
 
-	css = css_lookup(&io_subsys, id);
-	if (!css)
-		goto out;
-
-	if (!css_tryget(css))
+	ret = cgroup_path(iocg->css.cgroup, buf, buflen);
+	if (ret)
 		goto out;
 
-	cgroup_path(css->cgroup, buf, buflen);
-
-	css_put(css);
-
 	rcu_read_unlock();
 	return;
 out:
@@ -1847,7 +1841,6 @@ struct cgroup_subsys io_subsys = {
 	.destroy = iocg_destroy,
 	.populate = iocg_populate,
 	.subsys_id = io_subsys_id,
-	.use_id = 1,
 };
 
 static inline unsigned int iog_weight(struct io_group *iog)
@@ -1890,7 +1883,7 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
 		if (!iog)
 			goto cleanup;
 
-		iog->iocg_id = css_id(&iocg->css);
+		iog->iocg = iocg;
 
 		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
 		iog->dev = MKDEV(major, minor);
@@ -2201,7 +2194,7 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
 	spin_lock_irq(&iocg->lock);
 	rcu_assign_pointer(iog->key, key);
 	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
-	iog->iocg_id = css_id(&iocg->css);
+	iog->iocg = iocg;
 	spin_unlock_irq(&iocg->lock);
 
 #ifdef CONFIG_DEBUG_GROUP_IOSCHED
@@ -2397,7 +2390,7 @@ remove_entry:
 			  group_node);
 	efqd = rcu_dereference(iog->key);
 	hlist_del_rcu(&iog->group_node);
-	iog->iocg_id = 0;
+	iog->iocg = NULL;
 	spin_unlock_irqrestore(&iocg->lock, flags);
 
 	spin_lock_irqsave(efqd->queue->queue_lock, flags);
@@ -2411,7 +2404,6 @@ done:
 		kfree(pn);
 	}
 
-	free_css_id(&io_subsys, &iocg->css);
 	rcu_read_unlock();
 	BUG_ON(!hlist_empty(&iocg->group_data));
 	kfree(iocg);
@@ -2427,20 +2419,16 @@ static void io_group_check_and_destroy(struct elv_fq_data *efqd,
 {
 	struct io_cgroup *iocg;
 	unsigned long flags;
-	struct cgroup_subsys_state *css;
 
 	rcu_read_lock();
 
-	css = css_lookup(&io_subsys, iog->iocg_id);
-
-	if (!css)
+	iocg = iog->iocg;
+	if (!iocg)
 		goto out;
 
-	iocg = container_of(css, struct io_cgroup, css);
-
 	spin_lock_irqsave(&iocg->lock, flags);
 
-	if (iog->iocg_id) {
+	if (iog->iocg) {
 		hlist_del_rcu(&iog->group_node);
 		__io_destroy_group(efqd, iog);
 	}
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index f089a55..75fee82 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -251,7 +251,7 @@ struct io_group {
 	unsigned int busy_rt_queues;
 
 	int deleting;
-	unsigned short iocg_id;
+	struct io_cgroup *iocg;
 
 	/* The device MKDEV(major, minor), this group has been created for */
 	dev_t	dev;
-- 
1.5.4.rc3 



^ permalink raw reply related	[flat|nested] 191+ messages in thread

* Re: [PATCH 11/25] io-controller: Export disk time used and nr sectors dipatched through cgroups
       [not found]   ` <1246564917-19603-12-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-07-08  2:16     ` Gui Jianfeng
  0 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-08  2:16 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Vivek Goyal wrote:
...
>  
> +static int io_cgroup_disk_time_read(struct cgroup *cgroup,
> +				struct cftype *cftype, struct seq_file *m)
> +{
> +	struct io_cgroup *iocg;
> +	struct io_group *iog;
> +	struct hlist_node *n;
> +
> +	if (!cgroup_lock_live_group(cgroup))
> +		return -ENODEV;
> +
> +	iocg = cgroup_to_io_cgroup(cgroup);
> +
> +	rcu_read_lock();
> +	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
> +		/*
> +		 * There might be groups which are not functional and
> +		 * waiting to be reclaimed upon cgoup deletion.
> +		 */
> +		if (iog->key) {
> +			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
> +					MINOR(iog->dev),
> +					iog->entity.total_service);

Hi Vivek,

Let io.disk_*'s outputs conform with io.policy's.

Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
 block/elevator-fq.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 7c83d1e..29392e7 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1631,7 +1631,7 @@ static int io_cgroup_disk_time_read(struct cgroup *cgroup,
 		 * waiting to be reclaimed upon cgoup deletion.
 		 */
 		if (iog->key) {
-			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
 					MINOR(iog->dev),
 					iog->entity.total_service);
 		}
@@ -1661,7 +1661,7 @@ static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
 		 * waiting to be reclaimed upon cgoup deletion.
 		 */
 		if (iog->key) {
-			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
 					MINOR(iog->dev),
 					iog->entity.total_sector_service);
 		}
@@ -1692,7 +1692,7 @@ static int io_cgroup_disk_queue_read(struct cgroup *cgroup,
 		 * waiting to be reclaimed upon cgoup deletion.
 		 */
 		if (iog->key) {
-			seq_printf(m, "%u %u %lu %lu\n", MAJOR(iog->dev),
+			seq_printf(m, "%u:%u %lu %lu\n", MAJOR(iog->dev),
 					MINOR(iog->dev), iog->queue,
 					iog->queue_duration);
 		}
@@ -1722,7 +1722,7 @@ static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,
 		 * waiting to be reclaimed upon cgoup deletion.
 		 */
 		if (iog->key) {
-			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
 					MINOR(iog->dev), iog->dequeue);
 		}
 	}
-- 
1.5.4.rc3

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* Re: [PATCH 11/25] io-controller: Export disk time used and nr sectors dipatched through cgroups
  2009-07-02 20:01   ` Vivek Goyal
@ 2009-07-08  2:16     ` Gui Jianfeng
  -1 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-08  2:16 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, jbaron, agk,
	snitzer, akpm, peterz

Vivek Goyal wrote:
...
>  
> +static int io_cgroup_disk_time_read(struct cgroup *cgroup,
> +				struct cftype *cftype, struct seq_file *m)
> +{
> +	struct io_cgroup *iocg;
> +	struct io_group *iog;
> +	struct hlist_node *n;
> +
> +	if (!cgroup_lock_live_group(cgroup))
> +		return -ENODEV;
> +
> +	iocg = cgroup_to_io_cgroup(cgroup);
> +
> +	rcu_read_lock();
> +	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
> +		/*
> +		 * There might be groups which are not functional and
> +		 * waiting to be reclaimed upon cgoup deletion.
> +		 */
> +		if (iog->key) {
> +			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
> +					MINOR(iog->dev),
> +					iog->entity.total_service);

Hi Vivek,

Let io.disk_*'s outputs conform with io.policy's.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
 block/elevator-fq.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 7c83d1e..29392e7 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1631,7 +1631,7 @@ static int io_cgroup_disk_time_read(struct cgroup *cgroup,
 		 * waiting to be reclaimed upon cgoup deletion.
 		 */
 		if (iog->key) {
-			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
 					MINOR(iog->dev),
 					iog->entity.total_service);
 		}
@@ -1661,7 +1661,7 @@ static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
 		 * waiting to be reclaimed upon cgoup deletion.
 		 */
 		if (iog->key) {
-			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
 					MINOR(iog->dev),
 					iog->entity.total_sector_service);
 		}
@@ -1692,7 +1692,7 @@ static int io_cgroup_disk_queue_read(struct cgroup *cgroup,
 		 * waiting to be reclaimed upon cgoup deletion.
 		 */
 		if (iog->key) {
-			seq_printf(m, "%u %u %lu %lu\n", MAJOR(iog->dev),
+			seq_printf(m, "%u:%u %lu %lu\n", MAJOR(iog->dev),
 					MINOR(iog->dev), iog->queue,
 					iog->queue_duration);
 		}
@@ -1722,7 +1722,7 @@ static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,
 		 * waiting to be reclaimed upon cgoup deletion.
 		 */
 		if (iog->key) {
-			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
 					MINOR(iog->dev), iog->dequeue);
 		}
 	}
-- 
1.5.4.rc3



^ permalink raw reply related	[flat|nested] 191+ messages in thread

* Re: [PATCH 11/25] io-controller: Export disk time used and nr sectors dipatched through cgroups
@ 2009-07-08  2:16     ` Gui Jianfeng
  0 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-08  2:16 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	balbir, paolo.valente, fernando, mikew, jmoyer, nauman, m-ikeda,
	lizf, fchecconi, akpm, jbaron, linux-kernel, s-uchida,
	righi.andrea, containers

Vivek Goyal wrote:
...
>  
> +static int io_cgroup_disk_time_read(struct cgroup *cgroup,
> +				struct cftype *cftype, struct seq_file *m)
> +{
> +	struct io_cgroup *iocg;
> +	struct io_group *iog;
> +	struct hlist_node *n;
> +
> +	if (!cgroup_lock_live_group(cgroup))
> +		return -ENODEV;
> +
> +	iocg = cgroup_to_io_cgroup(cgroup);
> +
> +	rcu_read_lock();
> +	hlist_for_each_entry_rcu(iog, n, &iocg->group_data, group_node) {
> +		/*
> +		 * There might be groups which are not functional and
> +		 * waiting to be reclaimed upon cgoup deletion.
> +		 */
> +		if (iog->key) {
> +			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
> +					MINOR(iog->dev),
> +					iog->entity.total_service);

Hi Vivek,

Let io.disk_*'s outputs conform with io.policy's.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
 block/elevator-fq.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 7c83d1e..29392e7 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1631,7 +1631,7 @@ static int io_cgroup_disk_time_read(struct cgroup *cgroup,
 		 * waiting to be reclaimed upon cgoup deletion.
 		 */
 		if (iog->key) {
-			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
 					MINOR(iog->dev),
 					iog->entity.total_service);
 		}
@@ -1661,7 +1661,7 @@ static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
 		 * waiting to be reclaimed upon cgoup deletion.
 		 */
 		if (iog->key) {
-			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
 					MINOR(iog->dev),
 					iog->entity.total_sector_service);
 		}
@@ -1692,7 +1692,7 @@ static int io_cgroup_disk_queue_read(struct cgroup *cgroup,
 		 * waiting to be reclaimed upon cgoup deletion.
 		 */
 		if (iog->key) {
-			seq_printf(m, "%u %u %lu %lu\n", MAJOR(iog->dev),
+			seq_printf(m, "%u:%u %lu %lu\n", MAJOR(iog->dev),
 					MINOR(iog->dev), iog->queue,
 					iog->queue_duration);
 		}
@@ -1722,7 +1722,7 @@ static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,
 		 * waiting to be reclaimed upon cgoup deletion.
 		 */
 		if (iog->key) {
-			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
+			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
 					MINOR(iog->dev), iog->dequeue);
 		}
 	}
-- 
1.5.4.rc3

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* Re: [PATCH 21/25] io-controller: Per cgroup request descriptor support
       [not found]   ` <1246564917-19603-22-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-07-08  3:27     ` Gui Jianfeng
  2009-07-21  5:37     ` Gui Jianfeng
  1 sibling, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-08  3:27 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Vivek Goyal wrote:
...
>  }
> +#ifdef CONFIG_GROUP_IOSCHED
> +static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
> +{
> +	return queue_var_show(q->nr_group_requests, (page));
> +}
> +
> +static ssize_t
> +queue_group_requests_store(struct request_queue *q, const char *page,
> +					size_t count)
> +{
> +	unsigned long nr;
> +	int ret = queue_var_store(&nr, page, count);
> +	if (nr < BLKDEV_MIN_RQ)
> +		nr = BLKDEV_MIN_RQ;
> +
> +	spin_lock_irq(q->queue_lock);
> +	q->nr_group_requests = nr;
> +	spin_unlock_irq(q->queue_lock);
> +	return ret;
> +}
> +#endif

Hi Vivek,

Do we need to update the congestion thresholds for allocated io groups?

Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
 block/blk-sysfs.c |   15 +++++++++++++++
 1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 577ed42..92b9f25 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -83,17 +83,32 @@ static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
 	return queue_var_show(q->nr_group_requests, (page));
 }
 
+extern void elv_io_group_congestion_threshold(struct request_queue *q,
+					      struct io_group *iog);
+
 static ssize_t
 queue_group_requests_store(struct request_queue *q, const char *page,
 					size_t count)
 {
+	struct hlist_node *n;
+	struct io_group *iog;
+	struct elv_fq_data *efqd;
 	unsigned long nr;
 	int ret = queue_var_store(&nr, page, count);
+
 	if (nr < BLKDEV_MIN_RQ)
 		nr = BLKDEV_MIN_RQ;
 
 	spin_lock_irq(q->queue_lock);
+
 	q->nr_group_requests = nr;
+
+	efqd = &q->elevator->efqd;
+
+	hlist_for_each_entry(iog, n, &efqd->group_list, elv_data_node) {
+		elv_io_group_congestion_threshold(q, iog);
+	}
+
 	spin_unlock_irq(q->queue_lock);
 	return ret;
 }
-- 
1.5.4.rc3 

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* Re: [PATCH 21/25] io-controller: Per cgroup request descriptor support
  2009-07-02 20:01   ` Vivek Goyal
@ 2009-07-08  3:27     ` Gui Jianfeng
  -1 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-08  3:27 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, jbaron, agk,
	snitzer, akpm, peterz

Vivek Goyal wrote:
...
>  }
> +#ifdef CONFIG_GROUP_IOSCHED
> +static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
> +{
> +	return queue_var_show(q->nr_group_requests, (page));
> +}
> +
> +static ssize_t
> +queue_group_requests_store(struct request_queue *q, const char *page,
> +					size_t count)
> +{
> +	unsigned long nr;
> +	int ret = queue_var_store(&nr, page, count);
> +	if (nr < BLKDEV_MIN_RQ)
> +		nr = BLKDEV_MIN_RQ;
> +
> +	spin_lock_irq(q->queue_lock);
> +	q->nr_group_requests = nr;
> +	spin_unlock_irq(q->queue_lock);
> +	return ret;
> +}
> +#endif

Hi Vivek,

Do we need to update the congestion thresholds for allocated io groups?

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
 block/blk-sysfs.c |   15 +++++++++++++++
 1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 577ed42..92b9f25 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -83,17 +83,32 @@ static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
 	return queue_var_show(q->nr_group_requests, (page));
 }
 
+extern void elv_io_group_congestion_threshold(struct request_queue *q,
+					      struct io_group *iog);
+
 static ssize_t
 queue_group_requests_store(struct request_queue *q, const char *page,
 					size_t count)
 {
+	struct hlist_node *n;
+	struct io_group *iog;
+	struct elv_fq_data *efqd;
 	unsigned long nr;
 	int ret = queue_var_store(&nr, page, count);
+
 	if (nr < BLKDEV_MIN_RQ)
 		nr = BLKDEV_MIN_RQ;
 
 	spin_lock_irq(q->queue_lock);
+
 	q->nr_group_requests = nr;
+
+	efqd = &q->elevator->efqd;
+
+	hlist_for_each_entry(iog, n, &efqd->group_list, elv_data_node) {
+		elv_io_group_congestion_threshold(q, iog);
+	}
+
 	spin_unlock_irq(q->queue_lock);
 	return ret;
 }
-- 
1.5.4.rc3 


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* Re: [PATCH 21/25] io-controller: Per cgroup request descriptor support
@ 2009-07-08  3:27     ` Gui Jianfeng
  0 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-08  3:27 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	balbir, paolo.valente, fernando, mikew, jmoyer, nauman, m-ikeda,
	lizf, fchecconi, akpm, jbaron, linux-kernel, s-uchida,
	righi.andrea, containers

Vivek Goyal wrote:
...
>  }
> +#ifdef CONFIG_GROUP_IOSCHED
> +static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
> +{
> +	return queue_var_show(q->nr_group_requests, (page));
> +}
> +
> +static ssize_t
> +queue_group_requests_store(struct request_queue *q, const char *page,
> +					size_t count)
> +{
> +	unsigned long nr;
> +	int ret = queue_var_store(&nr, page, count);
> +	if (nr < BLKDEV_MIN_RQ)
> +		nr = BLKDEV_MIN_RQ;
> +
> +	spin_lock_irq(q->queue_lock);
> +	q->nr_group_requests = nr;
> +	spin_unlock_irq(q->queue_lock);
> +	return ret;
> +}
> +#endif

Hi Vivek,

Do we need to update the congestion thresholds for allocated io groups?

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
 block/blk-sysfs.c |   15 +++++++++++++++
 1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 577ed42..92b9f25 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -83,17 +83,32 @@ static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
 	return queue_var_show(q->nr_group_requests, (page));
 }
 
+extern void elv_io_group_congestion_threshold(struct request_queue *q,
+					      struct io_group *iog);
+
 static ssize_t
 queue_group_requests_store(struct request_queue *q, const char *page,
 					size_t count)
 {
+	struct hlist_node *n;
+	struct io_group *iog;
+	struct elv_fq_data *efqd;
 	unsigned long nr;
 	int ret = queue_var_store(&nr, page, count);
+
 	if (nr < BLKDEV_MIN_RQ)
 		nr = BLKDEV_MIN_RQ;
 
 	spin_lock_irq(q->queue_lock);
+
 	q->nr_group_requests = nr;
+
+	efqd = &q->elevator->efqd;
+
+	hlist_for_each_entry(iog, n, &efqd->group_list, elv_data_node) {
+		elv_io_group_congestion_threshold(q, iog);
+	}
+
 	spin_unlock_irq(q->queue_lock);
 	return ret;
 }
-- 
1.5.4.rc3 

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* Re: [RFC] IO scheduler based IO controller V6
       [not found] ` <1246564917-19603-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (24 preceding siblings ...)
  2009-07-02 20:01   ` [PATCH 25/25] io-controller: experimental debug patch for async queue wait before expiry Vivek Goyal
@ 2009-07-08  3:56   ` Balbir Singh
  2009-07-10  1:56   ` [PATCH] io-controller: implement per group request allocation limitation Gui Jianfeng
  2009-07-27  2:10   ` [RFC] IO scheduler based IO controller V6 Gui Jianfeng
  27 siblings, 0 replies; 191+ messages in thread
From: Balbir Singh @ 2009-07-08  3:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

* Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> [2009-07-02 16:01:32]:

> 
> Hi All,
> 
> Here is the V6 of the IO controller patches generated on top of 2.6.31-rc1.
> 
> Previous versions of the patches was posted here.
> 
> (V1) http://lkml.org/lkml/2009/3/11/486
> (V2) http://lkml.org/lkml/2009/5/5/275
> (V3) http://lkml.org/lkml/2009/5/26/472
> (V4) http://lkml.org/lkml/2009/6/8/580
> (V5) http://lkml.org/lkml/2009/6/19/279
> 
> This patchset is still work in progress but I want to keep on getting the
> snapshot of my tree out at regular intervals to get the feedback hence V6.
>

Hi, Vivek,

I was able to compile and boot a 2.6.31-rc1 kernel with this patchset.
I have a request could you fold up all patches and make one
consolidated patch available somewhere (makes it easier to test), may
be a git tree?

I did some quick tests with some io benchmarks and found in a simple
scenario that the scheduler worked as expected, except that it took
very long. I'll investigate further and revert back.
 
-- 
	Balbir

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [RFC] IO scheduler based IO controller V6
  2009-07-02 20:01 ` Vivek Goyal
@ 2009-07-08  3:56   ` Balbir Singh
  -1 siblings, 0 replies; 191+ messages in thread
From: Balbir Singh @ 2009-07-08  3:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, righi.andrea, m-ikeda, jbaron,
	agk, snitzer, akpm, peterz

* Vivek Goyal <vgoyal@redhat.com> [2009-07-02 16:01:32]:

> 
> Hi All,
> 
> Here is the V6 of the IO controller patches generated on top of 2.6.31-rc1.
> 
> Previous versions of the patches was posted here.
> 
> (V1) http://lkml.org/lkml/2009/3/11/486
> (V2) http://lkml.org/lkml/2009/5/5/275
> (V3) http://lkml.org/lkml/2009/5/26/472
> (V4) http://lkml.org/lkml/2009/6/8/580
> (V5) http://lkml.org/lkml/2009/6/19/279
> 
> This patchset is still work in progress but I want to keep on getting the
> snapshot of my tree out at regular intervals to get the feedback hence V6.
>

Hi, Vivek,

I was able to compile and boot a 2.6.31-rc1 kernel with this patchset.
I have a request could you fold up all patches and make one
consolidated patch available somewhere (makes it easier to test), may
be a git tree?

I did some quick tests with some io benchmarks and found in a simple
scenario that the scheduler worked as expected, except that it took
very long. I'll investigate further and revert back.
 
-- 
	Balbir

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [RFC] IO scheduler based IO controller V6
@ 2009-07-08  3:56   ` Balbir Singh
  0 siblings, 0 replies; 191+ messages in thread
From: Balbir Singh @ 2009-07-08  3:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	paolo.valente, guijianfeng, fernando, mikew, jmoyer, nauman,
	m-ikeda, lizf, fchecconi, akpm, containers, linux-kernel,
	s-uchida, righi.andrea, jbaron

* Vivek Goyal <vgoyal@redhat.com> [2009-07-02 16:01:32]:

> 
> Hi All,
> 
> Here is the V6 of the IO controller patches generated on top of 2.6.31-rc1.
> 
> Previous versions of the patches was posted here.
> 
> (V1) http://lkml.org/lkml/2009/3/11/486
> (V2) http://lkml.org/lkml/2009/5/5/275
> (V3) http://lkml.org/lkml/2009/5/26/472
> (V4) http://lkml.org/lkml/2009/6/8/580
> (V5) http://lkml.org/lkml/2009/6/19/279
> 
> This patchset is still work in progress but I want to keep on getting the
> snapshot of my tree out at regular intervals to get the feedback hence V6.
>

Hi, Vivek,

I was able to compile and boot a 2.6.31-rc1 kernel with this patchset.
I have a request could you fold up all patches and make one
consolidated patch available somewhere (makes it easier to test), may
be a git tree?

I did some quick tests with some io benchmarks and found in a simple
scenario that the scheduler worked as expected, except that it took
very long. I'll investigate further and revert back.
 
-- 
	Balbir

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [RFC] IO scheduler based IO controller V6
       [not found]   ` <20090708035621.GB3215-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
@ 2009-07-08 13:41     ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-08 13:41 UTC (permalink / raw)
  To: Balbir Singh
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Wed, Jul 08, 2009 at 09:26:21AM +0530, Balbir Singh wrote:
> * Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> [2009-07-02 16:01:32]:
> 
> > 
> > Hi All,
> > 
> > Here is the V6 of the IO controller patches generated on top of 2.6.31-rc1.
> > 
> > Previous versions of the patches was posted here.
> > 
> > (V1) http://lkml.org/lkml/2009/3/11/486
> > (V2) http://lkml.org/lkml/2009/5/5/275
> > (V3) http://lkml.org/lkml/2009/5/26/472
> > (V4) http://lkml.org/lkml/2009/6/8/580
> > (V5) http://lkml.org/lkml/2009/6/19/279
> > 
> > This patchset is still work in progress but I want to keep on getting the
> > snapshot of my tree out at regular intervals to get the feedback hence V6.
> >
> 
> Hi, Vivek,
> 
> I was able to compile and boot a 2.6.31-rc1 kernel with this patchset.
> I have a request could you fold up all patches and make one
> consolidated patch available somewhere (makes it easier to test), may
> be a git tree?
> 

Thanks for trying it out balbir. Ok, for ease of patching and testing, I 
will also maintain a consolidated patch. For V6 you can download the patch
from here.

http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v6.patch

> I did some quick tests with some io benchmarks and found in a simple
> scenario that the scheduler worked as expected, except that it took
> very long. I'll investigate further and revert back.

Thanks. I will wait for details.

Vivek

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [RFC] IO scheduler based IO controller V6
  2009-07-08  3:56   ` Balbir Singh
@ 2009-07-08 13:41     ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-08 13:41 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, righi.andrea, m-ikeda, jbaron,
	agk, snitzer, akpm, peterz

On Wed, Jul 08, 2009 at 09:26:21AM +0530, Balbir Singh wrote:
> * Vivek Goyal <vgoyal@redhat.com> [2009-07-02 16:01:32]:
> 
> > 
> > Hi All,
> > 
> > Here is the V6 of the IO controller patches generated on top of 2.6.31-rc1.
> > 
> > Previous versions of the patches was posted here.
> > 
> > (V1) http://lkml.org/lkml/2009/3/11/486
> > (V2) http://lkml.org/lkml/2009/5/5/275
> > (V3) http://lkml.org/lkml/2009/5/26/472
> > (V4) http://lkml.org/lkml/2009/6/8/580
> > (V5) http://lkml.org/lkml/2009/6/19/279
> > 
> > This patchset is still work in progress but I want to keep on getting the
> > snapshot of my tree out at regular intervals to get the feedback hence V6.
> >
> 
> Hi, Vivek,
> 
> I was able to compile and boot a 2.6.31-rc1 kernel with this patchset.
> I have a request could you fold up all patches and make one
> consolidated patch available somewhere (makes it easier to test), may
> be a git tree?
> 

Thanks for trying it out balbir. Ok, for ease of patching and testing, I 
will also maintain a consolidated patch. For V6 you can download the patch
from here.

http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v6.patch

> I did some quick tests with some io benchmarks and found in a simple
> scenario that the scheduler worked as expected, except that it took
> very long. I'll investigate further and revert back.

Thanks. I will wait for details.

Vivek

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [RFC] IO scheduler based IO controller V6
@ 2009-07-08 13:41     ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-08 13:41 UTC (permalink / raw)
  To: Balbir Singh
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	paolo.valente, guijianfeng, fernando, mikew, jmoyer, nauman,
	m-ikeda, lizf, fchecconi, akpm, containers, linux-kernel,
	s-uchida, righi.andrea, jbaron

On Wed, Jul 08, 2009 at 09:26:21AM +0530, Balbir Singh wrote:
> * Vivek Goyal <vgoyal@redhat.com> [2009-07-02 16:01:32]:
> 
> > 
> > Hi All,
> > 
> > Here is the V6 of the IO controller patches generated on top of 2.6.31-rc1.
> > 
> > Previous versions of the patches was posted here.
> > 
> > (V1) http://lkml.org/lkml/2009/3/11/486
> > (V2) http://lkml.org/lkml/2009/5/5/275
> > (V3) http://lkml.org/lkml/2009/5/26/472
> > (V4) http://lkml.org/lkml/2009/6/8/580
> > (V5) http://lkml.org/lkml/2009/6/19/279
> > 
> > This patchset is still work in progress but I want to keep on getting the
> > snapshot of my tree out at regular intervals to get the feedback hence V6.
> >
> 
> Hi, Vivek,
> 
> I was able to compile and boot a 2.6.31-rc1 kernel with this patchset.
> I have a request could you fold up all patches and make one
> consolidated patch available somewhere (makes it easier to test), may
> be a git tree?
> 

Thanks for trying it out balbir. Ok, for ease of patching and testing, I 
will also maintain a consolidated patch. For V6 you can download the patch
from here.

http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v6.patch

> I did some quick tests with some io benchmarks and found in a simple
> scenario that the scheduler worked as expected, except that it took
> very long. I'll investigate further and revert back.

Thanks. I will wait for details.

Vivek

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 21/25] io-controller: Per cgroup request descriptor support
       [not found]     ` <4A54121D.5090008-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-07-08 13:57       ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-08 13:57 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Wed, Jul 08, 2009 at 11:27:25AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> >  }
> > +#ifdef CONFIG_GROUP_IOSCHED
> > +static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
> > +{
> > +	return queue_var_show(q->nr_group_requests, (page));
> > +}
> > +
> > +static ssize_t
> > +queue_group_requests_store(struct request_queue *q, const char *page,
> > +					size_t count)
> > +{
> > +	unsigned long nr;
> > +	int ret = queue_var_store(&nr, page, count);
> > +	if (nr < BLKDEV_MIN_RQ)
> > +		nr = BLKDEV_MIN_RQ;
> > +
> > +	spin_lock_irq(q->queue_lock);
> > +	q->nr_group_requests = nr;
> > +	spin_unlock_irq(q->queue_lock);
> > +	return ret;
> > +}
> > +#endif
> 
> Hi Vivek,
> 
> Do we need to update the congestion thresholds for allocated io groups?
> 

Good catch Gui. Thanks. I will test the patch and queue up for next posting.

Vivek

> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> ---
>  block/blk-sysfs.c |   15 +++++++++++++++
>  1 files changed, 15 insertions(+), 0 deletions(-)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 577ed42..92b9f25 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -83,17 +83,32 @@ static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
>  	return queue_var_show(q->nr_group_requests, (page));
>  }
>  
> +extern void elv_io_group_congestion_threshold(struct request_queue *q,
> +					      struct io_group *iog);
> +
>  static ssize_t
>  queue_group_requests_store(struct request_queue *q, const char *page,
>  					size_t count)
>  {
> +	struct hlist_node *n;
> +	struct io_group *iog;
> +	struct elv_fq_data *efqd;
>  	unsigned long nr;
>  	int ret = queue_var_store(&nr, page, count);
> +
>  	if (nr < BLKDEV_MIN_RQ)
>  		nr = BLKDEV_MIN_RQ;
>  
>  	spin_lock_irq(q->queue_lock);
> +
>  	q->nr_group_requests = nr;
> +
> +	efqd = &q->elevator->efqd;
> +
> +	hlist_for_each_entry(iog, n, &efqd->group_list, elv_data_node) {
> +		elv_io_group_congestion_threshold(q, iog);
> +	}
> +
>  	spin_unlock_irq(q->queue_lock);
>  	return ret;
>  }
> -- 
> 1.5.4.rc3 

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 21/25] io-controller: Per cgroup request descriptor support
  2009-07-08  3:27     ` Gui Jianfeng
@ 2009-07-08 13:57       ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-08 13:57 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, jbaron, agk,
	snitzer, akpm, peterz

On Wed, Jul 08, 2009 at 11:27:25AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> >  }
> > +#ifdef CONFIG_GROUP_IOSCHED
> > +static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
> > +{
> > +	return queue_var_show(q->nr_group_requests, (page));
> > +}
> > +
> > +static ssize_t
> > +queue_group_requests_store(struct request_queue *q, const char *page,
> > +					size_t count)
> > +{
> > +	unsigned long nr;
> > +	int ret = queue_var_store(&nr, page, count);
> > +	if (nr < BLKDEV_MIN_RQ)
> > +		nr = BLKDEV_MIN_RQ;
> > +
> > +	spin_lock_irq(q->queue_lock);
> > +	q->nr_group_requests = nr;
> > +	spin_unlock_irq(q->queue_lock);
> > +	return ret;
> > +}
> > +#endif
> 
> Hi Vivek,
> 
> Do we need to update the congestion thresholds for allocated io groups?
> 

Good catch Gui. Thanks. I will test the patch and queue up for next posting.

Vivek

> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
>  block/blk-sysfs.c |   15 +++++++++++++++
>  1 files changed, 15 insertions(+), 0 deletions(-)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 577ed42..92b9f25 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -83,17 +83,32 @@ static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
>  	return queue_var_show(q->nr_group_requests, (page));
>  }
>  
> +extern void elv_io_group_congestion_threshold(struct request_queue *q,
> +					      struct io_group *iog);
> +
>  static ssize_t
>  queue_group_requests_store(struct request_queue *q, const char *page,
>  					size_t count)
>  {
> +	struct hlist_node *n;
> +	struct io_group *iog;
> +	struct elv_fq_data *efqd;
>  	unsigned long nr;
>  	int ret = queue_var_store(&nr, page, count);
> +
>  	if (nr < BLKDEV_MIN_RQ)
>  		nr = BLKDEV_MIN_RQ;
>  
>  	spin_lock_irq(q->queue_lock);
> +
>  	q->nr_group_requests = nr;
> +
> +	efqd = &q->elevator->efqd;
> +
> +	hlist_for_each_entry(iog, n, &efqd->group_list, elv_data_node) {
> +		elv_io_group_congestion_threshold(q, iog);
> +	}
> +
>  	spin_unlock_irq(q->queue_lock);
>  	return ret;
>  }
> -- 
> 1.5.4.rc3 

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 21/25] io-controller: Per cgroup request descriptor support
@ 2009-07-08 13:57       ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-08 13:57 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	balbir, paolo.valente, fernando, mikew, jmoyer, nauman, m-ikeda,
	lizf, fchecconi, akpm, jbaron, linux-kernel, s-uchida,
	righi.andrea, containers

On Wed, Jul 08, 2009 at 11:27:25AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> >  }
> > +#ifdef CONFIG_GROUP_IOSCHED
> > +static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
> > +{
> > +	return queue_var_show(q->nr_group_requests, (page));
> > +}
> > +
> > +static ssize_t
> > +queue_group_requests_store(struct request_queue *q, const char *page,
> > +					size_t count)
> > +{
> > +	unsigned long nr;
> > +	int ret = queue_var_store(&nr, page, count);
> > +	if (nr < BLKDEV_MIN_RQ)
> > +		nr = BLKDEV_MIN_RQ;
> > +
> > +	spin_lock_irq(q->queue_lock);
> > +	q->nr_group_requests = nr;
> > +	spin_unlock_irq(q->queue_lock);
> > +	return ret;
> > +}
> > +#endif
> 
> Hi Vivek,
> 
> Do we need to update the congestion thresholds for allocated io groups?
> 

Good catch Gui. Thanks. I will test the patch and queue up for next posting.

Vivek

> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
>  block/blk-sysfs.c |   15 +++++++++++++++
>  1 files changed, 15 insertions(+), 0 deletions(-)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 577ed42..92b9f25 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -83,17 +83,32 @@ static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
>  	return queue_var_show(q->nr_group_requests, (page));
>  }
>  
> +extern void elv_io_group_congestion_threshold(struct request_queue *q,
> +					      struct io_group *iog);
> +
>  static ssize_t
>  queue_group_requests_store(struct request_queue *q, const char *page,
>  					size_t count)
>  {
> +	struct hlist_node *n;
> +	struct io_group *iog;
> +	struct elv_fq_data *efqd;
>  	unsigned long nr;
>  	int ret = queue_var_store(&nr, page, count);
> +
>  	if (nr < BLKDEV_MIN_RQ)
>  		nr = BLKDEV_MIN_RQ;
>  
>  	spin_lock_irq(q->queue_lock);
> +
>  	q->nr_group_requests = nr;
> +
> +	efqd = &q->elevator->efqd;
> +
> +	hlist_for_each_entry(iog, n, &efqd->group_list, elv_data_node) {
> +		elv_io_group_congestion_threshold(q, iog);
> +	}
> +
>  	spin_unlock_irq(q->queue_lock);
>  	return ret;
>  }
> -- 
> 1.5.4.rc3 

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 11/25] io-controller: Export disk time used and nr sectors dipatched through cgroups
       [not found]     ` <4A54018C.5090804-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-07-08 14:00       ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-08 14:00 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Wed, Jul 08, 2009 at 10:16:44AM +0800, Gui Jianfeng wrote:
[..]
> 
> Hi Vivek,
> 
> Let io.disk_*'s outputs conform with io.policy's.
> 

Sure. Queued up for next posting.

Thanks
Vivek

> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> ---
>  block/elevator-fq.c |    8 ++++----
>  1 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 7c83d1e..29392e7 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -1631,7 +1631,7 @@ static int io_cgroup_disk_time_read(struct cgroup *cgroup,
>  		 * waiting to be reclaimed upon cgoup deletion.
>  		 */
>  		if (iog->key) {
> -			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
> +			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
>  					MINOR(iog->dev),
>  					iog->entity.total_service);
>  		}
> @@ -1661,7 +1661,7 @@ static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
>  		 * waiting to be reclaimed upon cgoup deletion.
>  		 */
>  		if (iog->key) {
> -			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
> +			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
>  					MINOR(iog->dev),
>  					iog->entity.total_sector_service);
>  		}
> @@ -1692,7 +1692,7 @@ static int io_cgroup_disk_queue_read(struct cgroup *cgroup,
>  		 * waiting to be reclaimed upon cgoup deletion.
>  		 */
>  		if (iog->key) {
> -			seq_printf(m, "%u %u %lu %lu\n", MAJOR(iog->dev),
> +			seq_printf(m, "%u:%u %lu %lu\n", MAJOR(iog->dev),
>  					MINOR(iog->dev), iog->queue,
>  					iog->queue_duration);
>  		}
> @@ -1722,7 +1722,7 @@ static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,
>  		 * waiting to be reclaimed upon cgoup deletion.
>  		 */
>  		if (iog->key) {
> -			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
> +			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
>  					MINOR(iog->dev), iog->dequeue);
>  		}
>  	}
> -- 
> 1.5.4.rc3
> 

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 11/25] io-controller: Export disk time used and nr sectors dipatched through cgroups
  2009-07-08  2:16     ` Gui Jianfeng
@ 2009-07-08 14:00       ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-08 14:00 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, jbaron, agk,
	snitzer, akpm, peterz

On Wed, Jul 08, 2009 at 10:16:44AM +0800, Gui Jianfeng wrote:
[..]
> 
> Hi Vivek,
> 
> Let io.disk_*'s outputs conform with io.policy's.
> 

Sure. Queued up for next posting.

Thanks
Vivek

> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
>  block/elevator-fq.c |    8 ++++----
>  1 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 7c83d1e..29392e7 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -1631,7 +1631,7 @@ static int io_cgroup_disk_time_read(struct cgroup *cgroup,
>  		 * waiting to be reclaimed upon cgoup deletion.
>  		 */
>  		if (iog->key) {
> -			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
> +			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
>  					MINOR(iog->dev),
>  					iog->entity.total_service);
>  		}
> @@ -1661,7 +1661,7 @@ static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
>  		 * waiting to be reclaimed upon cgoup deletion.
>  		 */
>  		if (iog->key) {
> -			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
> +			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
>  					MINOR(iog->dev),
>  					iog->entity.total_sector_service);
>  		}
> @@ -1692,7 +1692,7 @@ static int io_cgroup_disk_queue_read(struct cgroup *cgroup,
>  		 * waiting to be reclaimed upon cgoup deletion.
>  		 */
>  		if (iog->key) {
> -			seq_printf(m, "%u %u %lu %lu\n", MAJOR(iog->dev),
> +			seq_printf(m, "%u:%u %lu %lu\n", MAJOR(iog->dev),
>  					MINOR(iog->dev), iog->queue,
>  					iog->queue_duration);
>  		}
> @@ -1722,7 +1722,7 @@ static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,
>  		 * waiting to be reclaimed upon cgoup deletion.
>  		 */
>  		if (iog->key) {
> -			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
> +			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
>  					MINOR(iog->dev), iog->dequeue);
>  		}
>  	}
> -- 
> 1.5.4.rc3
> 

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 11/25] io-controller: Export disk time used and nr sectors dipatched through cgroups
@ 2009-07-08 14:00       ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-08 14:00 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	balbir, paolo.valente, fernando, mikew, jmoyer, nauman, m-ikeda,
	lizf, fchecconi, akpm, jbaron, linux-kernel, s-uchida,
	righi.andrea, containers

On Wed, Jul 08, 2009 at 10:16:44AM +0800, Gui Jianfeng wrote:
[..]
> 
> Hi Vivek,
> 
> Let io.disk_*'s outputs conform with io.policy's.
> 

Sure. Queued up for next posting.

Thanks
Vivek

> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
>  block/elevator-fq.c |    8 ++++----
>  1 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 7c83d1e..29392e7 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -1631,7 +1631,7 @@ static int io_cgroup_disk_time_read(struct cgroup *cgroup,
>  		 * waiting to be reclaimed upon cgoup deletion.
>  		 */
>  		if (iog->key) {
> -			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
> +			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
>  					MINOR(iog->dev),
>  					iog->entity.total_service);
>  		}
> @@ -1661,7 +1661,7 @@ static int io_cgroup_disk_sectors_read(struct cgroup *cgroup,
>  		 * waiting to be reclaimed upon cgoup deletion.
>  		 */
>  		if (iog->key) {
> -			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
> +			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
>  					MINOR(iog->dev),
>  					iog->entity.total_sector_service);
>  		}
> @@ -1692,7 +1692,7 @@ static int io_cgroup_disk_queue_read(struct cgroup *cgroup,
>  		 * waiting to be reclaimed upon cgoup deletion.
>  		 */
>  		if (iog->key) {
> -			seq_printf(m, "%u %u %lu %lu\n", MAJOR(iog->dev),
> +			seq_printf(m, "%u:%u %lu %lu\n", MAJOR(iog->dev),
>  					MINOR(iog->dev), iog->queue,
>  					iog->queue_duration);
>  		}
> @@ -1722,7 +1722,7 @@ static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,
>  		 * waiting to be reclaimed upon cgoup deletion.
>  		 */
>  		if (iog->key) {
> -			seq_printf(m, "%u %u %lu\n", MAJOR(iog->dev),
> +			seq_printf(m, "%u:%u %lu\n", MAJOR(iog->dev),
>  					MINOR(iog->dev), iog->dequeue);
>  		}
>  	}
> -- 
> 1.5.4.rc3
> 

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: Get rid of css id from io cgroup
       [not found]           ` <4A52A77E.8050203-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-07-08 14:04             ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-08 14:04 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Tue, Jul 07, 2009 at 09:40:14AM +0800, Gui Jianfeng wrote:
> Get rid of css id from io cgroup since it's nothing
> more than keeping track of iocg. An alternative is
> caching iocg pointer in io group, just remove the
> complexity.
> 

Gui, one advantage of using css_id is that we store only 2 bytes of id
instead of 8 bytes of iocg* pointer (on 64bit). So saving of 6 bytes per
group. May be it is not a bad idea to keep the usage of css id around
because anyway we don't seem to gain much by getting rid of it.

So for the time being I tend to think that lets continue using css id.

Thanks
Vivek

> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> ---
>  block/elevator-fq.c |   36 ++++++++++++------------------------
>  block/elevator-fq.h |    2 +-
>  2 files changed, 13 insertions(+), 25 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 7c83d1e..f499b54 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -191,25 +191,19 @@ static inline struct io_group *iog_parent(struct io_group *iog)
>  #ifdef CONFIG_DEBUG_GROUP_IOSCHED
>  static void io_group_path(struct io_group *iog, char *buf, int buflen)
>  {
> -	unsigned short id = iog->iocg_id;
> -	struct cgroup_subsys_state *css;
> +	struct io_cgroup *iocg;
> +	int ret;
>  
>  	rcu_read_lock();
>  
> -	if (!id)
> +	iocg = iog->iocg;
> +	if (!iocg)
>  		goto out;
>  
> -	css = css_lookup(&io_subsys, id);
> -	if (!css)
> -		goto out;
> -
> -	if (!css_tryget(css))
> +	ret = cgroup_path(iocg->css.cgroup, buf, buflen);
> +	if (ret)
>  		goto out;
>  
> -	cgroup_path(css->cgroup, buf, buflen);
> -
> -	css_put(css);
> -
>  	rcu_read_unlock();
>  	return;
>  out:
> @@ -1847,7 +1841,6 @@ struct cgroup_subsys io_subsys = {
>  	.destroy = iocg_destroy,
>  	.populate = iocg_populate,
>  	.subsys_id = io_subsys_id,
> -	.use_id = 1,
>  };
>  
>  static inline unsigned int iog_weight(struct io_group *iog)
> @@ -1890,7 +1883,7 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
>  		if (!iog)
>  			goto cleanup;
>  
> -		iog->iocg_id = css_id(&iocg->css);
> +		iog->iocg = iocg;
>  
>  		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
>  		iog->dev = MKDEV(major, minor);
> @@ -2201,7 +2194,7 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
>  	spin_lock_irq(&iocg->lock);
>  	rcu_assign_pointer(iog->key, key);
>  	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
> -	iog->iocg_id = css_id(&iocg->css);
> +	iog->iocg = iocg;
>  	spin_unlock_irq(&iocg->lock);
>  
>  #ifdef CONFIG_DEBUG_GROUP_IOSCHED
> @@ -2397,7 +2390,7 @@ remove_entry:
>  			  group_node);
>  	efqd = rcu_dereference(iog->key);
>  	hlist_del_rcu(&iog->group_node);
> -	iog->iocg_id = 0;
> +	iog->iocg = NULL;
>  	spin_unlock_irqrestore(&iocg->lock, flags);
>  
>  	spin_lock_irqsave(efqd->queue->queue_lock, flags);
> @@ -2411,7 +2404,6 @@ done:
>  		kfree(pn);
>  	}
>  
> -	free_css_id(&io_subsys, &iocg->css);
>  	rcu_read_unlock();
>  	BUG_ON(!hlist_empty(&iocg->group_data));
>  	kfree(iocg);
> @@ -2427,20 +2419,16 @@ static void io_group_check_and_destroy(struct elv_fq_data *efqd,
>  {
>  	struct io_cgroup *iocg;
>  	unsigned long flags;
> -	struct cgroup_subsys_state *css;
>  
>  	rcu_read_lock();
>  
> -	css = css_lookup(&io_subsys, iog->iocg_id);
> -
> -	if (!css)
> +	iocg = iog->iocg;
> +	if (!iocg)
>  		goto out;
>  
> -	iocg = container_of(css, struct io_cgroup, css);
> -
>  	spin_lock_irqsave(&iocg->lock, flags);
>  
> -	if (iog->iocg_id) {
> +	if (iog->iocg) {
>  		hlist_del_rcu(&iog->group_node);
>  		__io_destroy_group(efqd, iog);
>  	}
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index f089a55..75fee82 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -251,7 +251,7 @@ struct io_group {
>  	unsigned int busy_rt_queues;
>  
>  	int deleting;
> -	unsigned short iocg_id;
> +	struct io_cgroup *iocg;
>  
>  	/* The device MKDEV(major, minor), this group has been created for */
>  	dev_t	dev;
> -- 
> 1.5.4.rc3 
> 

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: Get rid of css id from io cgroup
  2009-07-07  1:40           ` Gui Jianfeng
@ 2009-07-08 14:04             ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-08 14:04 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, jbaron, agk,
	snitzer, akpm, peterz

On Tue, Jul 07, 2009 at 09:40:14AM +0800, Gui Jianfeng wrote:
> Get rid of css id from io cgroup since it's nothing
> more than keeping track of iocg. An alternative is
> caching iocg pointer in io group, just remove the
> complexity.
> 

Gui, one advantage of using css_id is that we store only 2 bytes of id
instead of 8 bytes of iocg* pointer (on 64bit). So saving of 6 bytes per
group. May be it is not a bad idea to keep the usage of css id around
because anyway we don't seem to gain much by getting rid of it.

So for the time being I tend to think that lets continue using css id.

Thanks
Vivek

> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
>  block/elevator-fq.c |   36 ++++++++++++------------------------
>  block/elevator-fq.h |    2 +-
>  2 files changed, 13 insertions(+), 25 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 7c83d1e..f499b54 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -191,25 +191,19 @@ static inline struct io_group *iog_parent(struct io_group *iog)
>  #ifdef CONFIG_DEBUG_GROUP_IOSCHED
>  static void io_group_path(struct io_group *iog, char *buf, int buflen)
>  {
> -	unsigned short id = iog->iocg_id;
> -	struct cgroup_subsys_state *css;
> +	struct io_cgroup *iocg;
> +	int ret;
>  
>  	rcu_read_lock();
>  
> -	if (!id)
> +	iocg = iog->iocg;
> +	if (!iocg)
>  		goto out;
>  
> -	css = css_lookup(&io_subsys, id);
> -	if (!css)
> -		goto out;
> -
> -	if (!css_tryget(css))
> +	ret = cgroup_path(iocg->css.cgroup, buf, buflen);
> +	if (ret)
>  		goto out;
>  
> -	cgroup_path(css->cgroup, buf, buflen);
> -
> -	css_put(css);
> -
>  	rcu_read_unlock();
>  	return;
>  out:
> @@ -1847,7 +1841,6 @@ struct cgroup_subsys io_subsys = {
>  	.destroy = iocg_destroy,
>  	.populate = iocg_populate,
>  	.subsys_id = io_subsys_id,
> -	.use_id = 1,
>  };
>  
>  static inline unsigned int iog_weight(struct io_group *iog)
> @@ -1890,7 +1883,7 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
>  		if (!iog)
>  			goto cleanup;
>  
> -		iog->iocg_id = css_id(&iocg->css);
> +		iog->iocg = iocg;
>  
>  		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
>  		iog->dev = MKDEV(major, minor);
> @@ -2201,7 +2194,7 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
>  	spin_lock_irq(&iocg->lock);
>  	rcu_assign_pointer(iog->key, key);
>  	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
> -	iog->iocg_id = css_id(&iocg->css);
> +	iog->iocg = iocg;
>  	spin_unlock_irq(&iocg->lock);
>  
>  #ifdef CONFIG_DEBUG_GROUP_IOSCHED
> @@ -2397,7 +2390,7 @@ remove_entry:
>  			  group_node);
>  	efqd = rcu_dereference(iog->key);
>  	hlist_del_rcu(&iog->group_node);
> -	iog->iocg_id = 0;
> +	iog->iocg = NULL;
>  	spin_unlock_irqrestore(&iocg->lock, flags);
>  
>  	spin_lock_irqsave(efqd->queue->queue_lock, flags);
> @@ -2411,7 +2404,6 @@ done:
>  		kfree(pn);
>  	}
>  
> -	free_css_id(&io_subsys, &iocg->css);
>  	rcu_read_unlock();
>  	BUG_ON(!hlist_empty(&iocg->group_data));
>  	kfree(iocg);
> @@ -2427,20 +2419,16 @@ static void io_group_check_and_destroy(struct elv_fq_data *efqd,
>  {
>  	struct io_cgroup *iocg;
>  	unsigned long flags;
> -	struct cgroup_subsys_state *css;
>  
>  	rcu_read_lock();
>  
> -	css = css_lookup(&io_subsys, iog->iocg_id);
> -
> -	if (!css)
> +	iocg = iog->iocg;
> +	if (!iocg)
>  		goto out;
>  
> -	iocg = container_of(css, struct io_cgroup, css);
> -
>  	spin_lock_irqsave(&iocg->lock, flags);
>  
> -	if (iog->iocg_id) {
> +	if (iog->iocg) {
>  		hlist_del_rcu(&iog->group_node);
>  		__io_destroy_group(efqd, iog);
>  	}
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index f089a55..75fee82 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -251,7 +251,7 @@ struct io_group {
>  	unsigned int busy_rt_queues;
>  
>  	int deleting;
> -	unsigned short iocg_id;
> +	struct io_cgroup *iocg;
>  
>  	/* The device MKDEV(major, minor), this group has been created for */
>  	dev_t	dev;
> -- 
> 1.5.4.rc3 
> 

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: Get rid of css id from io cgroup
@ 2009-07-08 14:04             ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-08 14:04 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	balbir, paolo.valente, fernando, mikew, jmoyer, nauman, m-ikeda,
	lizf, fchecconi, akpm, jbaron, linux-kernel, s-uchida,
	righi.andrea, containers

On Tue, Jul 07, 2009 at 09:40:14AM +0800, Gui Jianfeng wrote:
> Get rid of css id from io cgroup since it's nothing
> more than keeping track of iocg. An alternative is
> caching iocg pointer in io group, just remove the
> complexity.
> 

Gui, one advantage of using css_id is that we store only 2 bytes of id
instead of 8 bytes of iocg* pointer (on 64bit). So saving of 6 bytes per
group. May be it is not a bad idea to keep the usage of css id around
because anyway we don't seem to gain much by getting rid of it.

So for the time being I tend to think that lets continue using css id.

Thanks
Vivek

> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
>  block/elevator-fq.c |   36 ++++++++++++------------------------
>  block/elevator-fq.h |    2 +-
>  2 files changed, 13 insertions(+), 25 deletions(-)
> 
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 7c83d1e..f499b54 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -191,25 +191,19 @@ static inline struct io_group *iog_parent(struct io_group *iog)
>  #ifdef CONFIG_DEBUG_GROUP_IOSCHED
>  static void io_group_path(struct io_group *iog, char *buf, int buflen)
>  {
> -	unsigned short id = iog->iocg_id;
> -	struct cgroup_subsys_state *css;
> +	struct io_cgroup *iocg;
> +	int ret;
>  
>  	rcu_read_lock();
>  
> -	if (!id)
> +	iocg = iog->iocg;
> +	if (!iocg)
>  		goto out;
>  
> -	css = css_lookup(&io_subsys, id);
> -	if (!css)
> -		goto out;
> -
> -	if (!css_tryget(css))
> +	ret = cgroup_path(iocg->css.cgroup, buf, buflen);
> +	if (ret)
>  		goto out;
>  
> -	cgroup_path(css->cgroup, buf, buflen);
> -
> -	css_put(css);
> -
>  	rcu_read_unlock();
>  	return;
>  out:
> @@ -1847,7 +1841,6 @@ struct cgroup_subsys io_subsys = {
>  	.destroy = iocg_destroy,
>  	.populate = iocg_populate,
>  	.subsys_id = io_subsys_id,
> -	.use_id = 1,
>  };
>  
>  static inline unsigned int iog_weight(struct io_group *iog)
> @@ -1890,7 +1883,7 @@ io_group_chain_alloc(struct request_queue *q, void *key, struct cgroup *cgroup)
>  		if (!iog)
>  			goto cleanup;
>  
> -		iog->iocg_id = css_id(&iocg->css);
> +		iog->iocg = iocg;
>  
>  		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
>  		iog->dev = MKDEV(major, minor);
> @@ -2201,7 +2194,7 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
>  	spin_lock_irq(&iocg->lock);
>  	rcu_assign_pointer(iog->key, key);
>  	hlist_add_head_rcu(&iog->group_node, &iocg->group_data);
> -	iog->iocg_id = css_id(&iocg->css);
> +	iog->iocg = iocg;
>  	spin_unlock_irq(&iocg->lock);
>  
>  #ifdef CONFIG_DEBUG_GROUP_IOSCHED
> @@ -2397,7 +2390,7 @@ remove_entry:
>  			  group_node);
>  	efqd = rcu_dereference(iog->key);
>  	hlist_del_rcu(&iog->group_node);
> -	iog->iocg_id = 0;
> +	iog->iocg = NULL;
>  	spin_unlock_irqrestore(&iocg->lock, flags);
>  
>  	spin_lock_irqsave(efqd->queue->queue_lock, flags);
> @@ -2411,7 +2404,6 @@ done:
>  		kfree(pn);
>  	}
>  
> -	free_css_id(&io_subsys, &iocg->css);
>  	rcu_read_unlock();
>  	BUG_ON(!hlist_empty(&iocg->group_data));
>  	kfree(iocg);
> @@ -2427,20 +2419,16 @@ static void io_group_check_and_destroy(struct elv_fq_data *efqd,
>  {
>  	struct io_cgroup *iocg;
>  	unsigned long flags;
> -	struct cgroup_subsys_state *css;
>  
>  	rcu_read_lock();
>  
> -	css = css_lookup(&io_subsys, iog->iocg_id);
> -
> -	if (!css)
> +	iocg = iog->iocg;
> +	if (!iocg)
>  		goto out;
>  
> -	iocg = container_of(css, struct io_cgroup, css);
> -
>  	spin_lock_irqsave(&iocg->lock, flags);
>  
> -	if (iog->iocg_id) {
> +	if (iog->iocg) {
>  		hlist_del_rcu(&iog->group_node);
>  		__io_destroy_group(efqd, iog);
>  	}
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index f089a55..75fee82 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -251,7 +251,7 @@ struct io_group {
>  	unsigned int busy_rt_queues;
>  
>  	int deleting;
> -	unsigned short iocg_id;
> +	struct io_cgroup *iocg;
>  
>  	/* The device MKDEV(major, minor), this group has been created for */
>  	dev_t	dev;
> -- 
> 1.5.4.rc3 
> 

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [RFC] IO scheduler based IO controller V6
       [not found]     ` <20090708134114.GA24048-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-07-08 14:39       ` Balbir Singh
  0 siblings, 0 replies; 191+ messages in thread
From: Balbir Singh @ 2009-07-08 14:39 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	agk-H+wXaHxf7aLQT0dZR+AlfA, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

* Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> [2009-07-08 09:41:14]:

> On Wed, Jul 08, 2009 at 09:26:21AM +0530, Balbir Singh wrote:
> > * Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> [2009-07-02 16:01:32]:
> > 
> > > 
> > > Hi All,
> > > 
> > > Here is the V6 of the IO controller patches generated on top of 2.6.31-rc1.
> > > 
> > > Previous versions of the patches was posted here.
> > > 
> > > (V1) http://lkml.org/lkml/2009/3/11/486
> > > (V2) http://lkml.org/lkml/2009/5/5/275
> > > (V3) http://lkml.org/lkml/2009/5/26/472
> > > (V4) http://lkml.org/lkml/2009/6/8/580
> > > (V5) http://lkml.org/lkml/2009/6/19/279
> > > 
> > > This patchset is still work in progress but I want to keep on getting the
> > > snapshot of my tree out at regular intervals to get the feedback hence V6.
> > >
> > 
> > Hi, Vivek,
> > 
> > I was able to compile and boot a 2.6.31-rc1 kernel with this patchset.
> > I have a request could you fold up all patches and make one
> > consolidated patch available somewhere (makes it easier to test), may
> > be a git tree?
> > 
> 
> Thanks for trying it out balbir. Ok, for ease of patching and testing, I 
> will also maintain a consolidated patch. For V6 you can download the patch
> from here.
> 
> http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v6.patch
>

Thanks, this will definitely help me get more testing done!
 
> > I did some quick tests with some io benchmarks and found in a simple
> > scenario that the scheduler worked as expected, except that it took
> > very long. I'll investigate further and revert back.
> 
> Thanks. I will wait for details.
>

I'll try and send something out by Friday, but for now I am not even
very sure if it is a real problem. I ran iozone on two groups with 500
and 1000 as weights on the same parition and set fairness to 1 in
sysfs for the partition. I used a record size of 4 (default) and tried
to run it on a file size of 1G.

BTW, I don't see anything about weights being multiple of an expected
figure documented anywhere. I tried weights of 1024 (similar to the
scheduler and got shouted back at :) ). Does the documentation patch
specify the expected range for weights? 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [RFC] IO scheduler based IO controller V6
  2009-07-08 13:41     ` Vivek Goyal
@ 2009-07-08 14:39       ` Balbir Singh
  -1 siblings, 0 replies; 191+ messages in thread
From: Balbir Singh @ 2009-07-08 14:39 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, snitzer, dm-devel, jens.axboe, agk, paolo.valente,
	fernando, jmoyer, fchecconi, akpm, containers, linux-kernel,
	righi.andrea

* Vivek Goyal <vgoyal@redhat.com> [2009-07-08 09:41:14]:

> On Wed, Jul 08, 2009 at 09:26:21AM +0530, Balbir Singh wrote:
> > * Vivek Goyal <vgoyal@redhat.com> [2009-07-02 16:01:32]:
> > 
> > > 
> > > Hi All,
> > > 
> > > Here is the V6 of the IO controller patches generated on top of 2.6.31-rc1.
> > > 
> > > Previous versions of the patches was posted here.
> > > 
> > > (V1) http://lkml.org/lkml/2009/3/11/486
> > > (V2) http://lkml.org/lkml/2009/5/5/275
> > > (V3) http://lkml.org/lkml/2009/5/26/472
> > > (V4) http://lkml.org/lkml/2009/6/8/580
> > > (V5) http://lkml.org/lkml/2009/6/19/279
> > > 
> > > This patchset is still work in progress but I want to keep on getting the
> > > snapshot of my tree out at regular intervals to get the feedback hence V6.
> > >
> > 
> > Hi, Vivek,
> > 
> > I was able to compile and boot a 2.6.31-rc1 kernel with this patchset.
> > I have a request could you fold up all patches and make one
> > consolidated patch available somewhere (makes it easier to test), may
> > be a git tree?
> > 
> 
> Thanks for trying it out balbir. Ok, for ease of patching and testing, I 
> will also maintain a consolidated patch. For V6 you can download the patch
> from here.
> 
> http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v6.patch
>

Thanks, this will definitely help me get more testing done!
 
> > I did some quick tests with some io benchmarks and found in a simple
> > scenario that the scheduler worked as expected, except that it took
> > very long. I'll investigate further and revert back.
> 
> Thanks. I will wait for details.
>

I'll try and send something out by Friday, but for now I am not even
very sure if it is a real problem. I ran iozone on two groups with 500
and 1000 as weights on the same parition and set fairness to 1 in
sysfs for the partition. I used a record size of 4 (default) and tried
to run it on a file size of 1G.

BTW, I don't see anything about weights being multiple of an expected
figure documented anywhere. I tried weights of 1024 (similar to the
scheduler and got shouted back at :) ). Does the documentation patch
specify the expected range for weights? 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [RFC] IO scheduler based IO controller V6
@ 2009-07-08 14:39       ` Balbir Singh
  0 siblings, 0 replies; 191+ messages in thread
From: Balbir Singh @ 2009-07-08 14:39 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: paolo.valente, dhaval, snitzer, fernando, jmoyer, linux-kernel,
	fchecconi, dm-devel, jens.axboe, akpm, containers, agk,
	righi.andrea

* Vivek Goyal <vgoyal@redhat.com> [2009-07-08 09:41:14]:

> On Wed, Jul 08, 2009 at 09:26:21AM +0530, Balbir Singh wrote:
> > * Vivek Goyal <vgoyal@redhat.com> [2009-07-02 16:01:32]:
> > 
> > > 
> > > Hi All,
> > > 
> > > Here is the V6 of the IO controller patches generated on top of 2.6.31-rc1.
> > > 
> > > Previous versions of the patches was posted here.
> > > 
> > > (V1) http://lkml.org/lkml/2009/3/11/486
> > > (V2) http://lkml.org/lkml/2009/5/5/275
> > > (V3) http://lkml.org/lkml/2009/5/26/472
> > > (V4) http://lkml.org/lkml/2009/6/8/580
> > > (V5) http://lkml.org/lkml/2009/6/19/279
> > > 
> > > This patchset is still work in progress but I want to keep on getting the
> > > snapshot of my tree out at regular intervals to get the feedback hence V6.
> > >
> > 
> > Hi, Vivek,
> > 
> > I was able to compile and boot a 2.6.31-rc1 kernel with this patchset.
> > I have a request could you fold up all patches and make one
> > consolidated patch available somewhere (makes it easier to test), may
> > be a git tree?
> > 
> 
> Thanks for trying it out balbir. Ok, for ease of patching and testing, I 
> will also maintain a consolidated patch. For V6 you can download the patch
> from here.
> 
> http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v6.patch
>

Thanks, this will definitely help me get more testing done!
 
> > I did some quick tests with some io benchmarks and found in a simple
> > scenario that the scheduler worked as expected, except that it took
> > very long. I'll investigate further and revert back.
> 
> Thanks. I will wait for details.
>

I'll try and send something out by Friday, but for now I am not even
very sure if it is a real problem. I ran iozone on two groups with 500
and 1000 as weights on the same parition and set fairness to 1 in
sysfs for the partition. I used a record size of 4 (default) and tried
to run it on a file size of 1G.

BTW, I don't see anything about weights being multiple of an expected
figure documented anywhere. I tried weights of 1024 (similar to the
scheduler and got shouted back at :) ). Does the documentation patch
specify the expected range for weights? 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [RFC] IO scheduler based IO controller V6
       [not found]       ` <20090708143925.GE3215-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
@ 2009-07-09  1:58         ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-09  1:58 UTC (permalink / raw)
  To: Balbir Singh
  Cc: paolo.valente-rcYM44yAMweonA0d6jMUrA,
	dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	agk-H+wXaHxf7aLQT0dZR+AlfA, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Wed, Jul 08, 2009 at 08:09:25PM +0530, Balbir Singh wrote:
> * Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> [2009-07-08 09:41:14]:
> 
> > On Wed, Jul 08, 2009 at 09:26:21AM +0530, Balbir Singh wrote:
> > > * Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> [2009-07-02 16:01:32]:
> > > 
> > > > 
> > > > Hi All,
> > > > 
> > > > Here is the V6 of the IO controller patches generated on top of 2.6.31-rc1.
> > > > 
> > > > Previous versions of the patches was posted here.
> > > > 
> > > > (V1) http://lkml.org/lkml/2009/3/11/486
> > > > (V2) http://lkml.org/lkml/2009/5/5/275
> > > > (V3) http://lkml.org/lkml/2009/5/26/472
> > > > (V4) http://lkml.org/lkml/2009/6/8/580
> > > > (V5) http://lkml.org/lkml/2009/6/19/279
> > > > 
> > > > This patchset is still work in progress but I want to keep on getting the
> > > > snapshot of my tree out at regular intervals to get the feedback hence V6.
> > > >
> > > 
> > > Hi, Vivek,
> > > 
> > > I was able to compile and boot a 2.6.31-rc1 kernel with this patchset.
> > > I have a request could you fold up all patches and make one
> > > consolidated patch available somewhere (makes it easier to test), may
> > > be a git tree?
> > > 
> > 
> > Thanks for trying it out balbir. Ok, for ease of patching and testing, I 
> > will also maintain a consolidated patch. For V6 you can download the patch
> > from here.
> > 
> > http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v6.patch
> >
> 
> Thanks, this will definitely help me get more testing done!
>  
> > > I did some quick tests with some io benchmarks and found in a simple
> > > scenario that the scheduler worked as expected, except that it took
> > > very long. I'll investigate further and revert back.
> > 
> > Thanks. I will wait for details.
> >
> 
> I'll try and send something out by Friday, but for now I am not even
> very sure if it is a real problem. I ran iozone on two groups with 500
> and 1000 as weights on the same parition and set fairness to 1 in
> sysfs for the partition. I used a record size of 4 (default) and tried
> to run it on a file size of 1G.
> 

Hi Balbir,

Trying iozone might be a good idea for analyzing the performance impact
of io controller patches but it might not be the best thing to test
fairness.

The biggest reason being that IO controller provides fairness only if
there is constant contention between the groups. If one group goes away
for sometime, other gets to use the disk full. Now while running above
benchmark, there are numerous occasions where disk is not contended for
and we don't see fairness numbers in user space.

I would recommend trying out fio or small tests to begin with which
can create continuously backlogged queues at the disk to see how 
accurate the io-controller is.

> BTW, I don't see anything about weights being multiple of an expected
> figure documented anywhere. I tried weights of 1024 (similar to the
> scheduler and got shouted back at :) ). Does the documentation patch
> specify the expected range for weights? 
> 

Weight range is 1-1000. I will update documentation to reflect this.
Thanks for pointing it out.

Vivek

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [RFC] IO scheduler based IO controller V6
  2009-07-08 14:39       ` Balbir Singh
@ 2009-07-09  1:58         ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-09  1:58 UTC (permalink / raw)
  To: Balbir Singh
  Cc: dhaval, snitzer, dm-devel, jens.axboe, agk, paolo.valente,
	fernando, jmoyer, fchecconi, akpm, containers, linux-kernel,
	righi.andrea

On Wed, Jul 08, 2009 at 08:09:25PM +0530, Balbir Singh wrote:
> * Vivek Goyal <vgoyal@redhat.com> [2009-07-08 09:41:14]:
> 
> > On Wed, Jul 08, 2009 at 09:26:21AM +0530, Balbir Singh wrote:
> > > * Vivek Goyal <vgoyal@redhat.com> [2009-07-02 16:01:32]:
> > > 
> > > > 
> > > > Hi All,
> > > > 
> > > > Here is the V6 of the IO controller patches generated on top of 2.6.31-rc1.
> > > > 
> > > > Previous versions of the patches was posted here.
> > > > 
> > > > (V1) http://lkml.org/lkml/2009/3/11/486
> > > > (V2) http://lkml.org/lkml/2009/5/5/275
> > > > (V3) http://lkml.org/lkml/2009/5/26/472
> > > > (V4) http://lkml.org/lkml/2009/6/8/580
> > > > (V5) http://lkml.org/lkml/2009/6/19/279
> > > > 
> > > > This patchset is still work in progress but I want to keep on getting the
> > > > snapshot of my tree out at regular intervals to get the feedback hence V6.
> > > >
> > > 
> > > Hi, Vivek,
> > > 
> > > I was able to compile and boot a 2.6.31-rc1 kernel with this patchset.
> > > I have a request could you fold up all patches and make one
> > > consolidated patch available somewhere (makes it easier to test), may
> > > be a git tree?
> > > 
> > 
> > Thanks for trying it out balbir. Ok, for ease of patching and testing, I 
> > will also maintain a consolidated patch. For V6 you can download the patch
> > from here.
> > 
> > http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v6.patch
> >
> 
> Thanks, this will definitely help me get more testing done!
>  
> > > I did some quick tests with some io benchmarks and found in a simple
> > > scenario that the scheduler worked as expected, except that it took
> > > very long. I'll investigate further and revert back.
> > 
> > Thanks. I will wait for details.
> >
> 
> I'll try and send something out by Friday, but for now I am not even
> very sure if it is a real problem. I ran iozone on two groups with 500
> and 1000 as weights on the same parition and set fairness to 1 in
> sysfs for the partition. I used a record size of 4 (default) and tried
> to run it on a file size of 1G.
> 

Hi Balbir,

Trying iozone might be a good idea for analyzing the performance impact
of io controller patches but it might not be the best thing to test
fairness.

The biggest reason being that IO controller provides fairness only if
there is constant contention between the groups. If one group goes away
for sometime, other gets to use the disk full. Now while running above
benchmark, there are numerous occasions where disk is not contended for
and we don't see fairness numbers in user space.

I would recommend trying out fio or small tests to begin with which
can create continuously backlogged queues at the disk to see how 
accurate the io-controller is.

> BTW, I don't see anything about weights being multiple of an expected
> figure documented anywhere. I tried weights of 1024 (similar to the
> scheduler and got shouted back at :) ). Does the documentation patch
> specify the expected range for weights? 
> 

Weight range is 1-1000. I will update documentation to reflect this.
Thanks for pointing it out.

Vivek

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [RFC] IO scheduler based IO controller V6
@ 2009-07-09  1:58         ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-09  1:58 UTC (permalink / raw)
  To: Balbir Singh
  Cc: paolo.valente, dhaval, snitzer, fernando, jmoyer, linux-kernel,
	fchecconi, dm-devel, jens.axboe, akpm, containers, agk,
	righi.andrea

On Wed, Jul 08, 2009 at 08:09:25PM +0530, Balbir Singh wrote:
> * Vivek Goyal <vgoyal@redhat.com> [2009-07-08 09:41:14]:
> 
> > On Wed, Jul 08, 2009 at 09:26:21AM +0530, Balbir Singh wrote:
> > > * Vivek Goyal <vgoyal@redhat.com> [2009-07-02 16:01:32]:
> > > 
> > > > 
> > > > Hi All,
> > > > 
> > > > Here is the V6 of the IO controller patches generated on top of 2.6.31-rc1.
> > > > 
> > > > Previous versions of the patches was posted here.
> > > > 
> > > > (V1) http://lkml.org/lkml/2009/3/11/486
> > > > (V2) http://lkml.org/lkml/2009/5/5/275
> > > > (V3) http://lkml.org/lkml/2009/5/26/472
> > > > (V4) http://lkml.org/lkml/2009/6/8/580
> > > > (V5) http://lkml.org/lkml/2009/6/19/279
> > > > 
> > > > This patchset is still work in progress but I want to keep on getting the
> > > > snapshot of my tree out at regular intervals to get the feedback hence V6.
> > > >
> > > 
> > > Hi, Vivek,
> > > 
> > > I was able to compile and boot a 2.6.31-rc1 kernel with this patchset.
> > > I have a request could you fold up all patches and make one
> > > consolidated patch available somewhere (makes it easier to test), may
> > > be a git tree?
> > > 
> > 
> > Thanks for trying it out balbir. Ok, for ease of patching and testing, I 
> > will also maintain a consolidated patch. For V6 you can download the patch
> > from here.
> > 
> > http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v6.patch
> >
> 
> Thanks, this will definitely help me get more testing done!
>  
> > > I did some quick tests with some io benchmarks and found in a simple
> > > scenario that the scheduler worked as expected, except that it took
> > > very long. I'll investigate further and revert back.
> > 
> > Thanks. I will wait for details.
> >
> 
> I'll try and send something out by Friday, but for now I am not even
> very sure if it is a real problem. I ran iozone on two groups with 500
> and 1000 as weights on the same parition and set fairness to 1 in
> sysfs for the partition. I used a record size of 4 (default) and tried
> to run it on a file size of 1G.
> 

Hi Balbir,

Trying iozone might be a good idea for analyzing the performance impact
of io controller patches but it might not be the best thing to test
fairness.

The biggest reason being that IO controller provides fairness only if
there is constant contention between the groups. If one group goes away
for sometime, other gets to use the disk full. Now while running above
benchmark, there are numerous occasions where disk is not contended for
and we don't see fairness numbers in user space.

I would recommend trying out fio or small tests to begin with which
can create continuously backlogged queues at the disk to see how 
accurate the io-controller is.

> BTW, I don't see anything about weights being multiple of an expected
> figure documented anywhere. I tried weights of 1024 (similar to the
> scheduler and got shouted back at :) ). Does the documentation patch
> specify the expected range for weights? 
> 

Weight range is 1-1000. I will update documentation to reflect this.
Thanks for pointing it out.

Vivek

^ permalink raw reply	[flat|nested] 191+ messages in thread

* [PATCH] io-controller: implement per group request allocation limitation
       [not found] ` <1246564917-19603-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (25 preceding siblings ...)
  2009-07-08  3:56   ` [RFC] IO scheduler based IO controller V6 Balbir Singh
@ 2009-07-10  1:56   ` Gui Jianfeng
  2009-07-27  2:10   ` [RFC] IO scheduler based IO controller V6 Gui Jianfeng
  27 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-10  1:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hi Vivek,

This patch exports a cgroup based per group request limits interface.
and removes the global one. Now we can use this interface to perform
different request allocation limitation for different groups. 

Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
---
 block/blk-core.c     |   23 ++++++++++--
 block/blk-settings.c |    1 -
 block/blk-sysfs.c    |   43 -----------------------
 block/elevator-fq.c  |   94 ++++++++++++++++++++++++++++++++++++++++++++++---
 block/elevator-fq.h  |    4 ++
 5 files changed, 111 insertions(+), 54 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 79fe6a9..7010b76 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -722,13 +722,20 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
 static void __freed_request(struct request_queue *q, int sync,
 					struct request_list *rl)
 {
+	struct io_group *iog;
+	unsigned long nr_group_requests;
+
 	if (q->rq_data.count[sync] < queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, sync);
 
 	if (q->rq_data.count[sync] + 1 <= q->nr_requests)
 		blk_clear_queue_full(q, sync);
 
-	if (rl->count[sync] + 1 <= q->nr_group_requests) {
+	iog = rl_iog(rl);
+
+	nr_group_requests = get_group_requests(q, iog);
+
+	if (nr_group_requests && rl->count[sync] + 1 <= nr_group_requests) {
 		if (waitqueue_active(&rl->wait[sync]))
 			wake_up(&rl->wait[sync]);
 	}
@@ -828,6 +835,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
 	int may_queue, priv;
 	int sleep_on_global = 0;
+	struct io_group *iog;
+	unsigned long nr_group_requests;
 
 	may_queue = elv_may_queue(q, rw_flags);
 	if (may_queue == ELV_MQUEUE_NO)
@@ -843,7 +852,12 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
 		blk_set_queue_full(q, is_sync);
 
-	if (rl->count[is_sync]+1 >= q->nr_group_requests) {
+	iog = rl_iog(rl);
+
+	nr_group_requests = get_group_requests(q, iog);
+
+	if (nr_group_requests &&
+	    rl->count[is_sync]+1 >= nr_group_requests) {
 		ioc = current_io_context(GFP_ATOMIC, q->node);
 		/*
 		 * The queue request descriptor group will fill after this
@@ -852,7 +866,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		 * This process will be allowed to complete a batch of
 		 * requests, others will be blocked.
 		 */
-		if (rl->count[is_sync] <= q->nr_group_requests)
+		if (rl->count[is_sync] <= nr_group_requests)
 			ioc_set_batching(q, ioc);
 		else {
 			if (may_queue != ELV_MQUEUE_MUST
@@ -898,7 +912,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	 * from per group request list
 	 */
 
-	if (rl->count[is_sync] >= (3 * q->nr_group_requests / 2))
+	if (nr_group_requests &&
+	    rl->count[is_sync] >= (3 * nr_group_requests / 2))
 		goto out;
 
 	rl->starved[is_sync] = 0;
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 78b8aec..bd582a7 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -148,7 +148,6 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
 	 * set defaults
 	 */
 	q->nr_requests = BLKDEV_MAX_RQ;
-	q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
 
 	q->make_request_fn = mfn;
 	blk_queue_dma_alignment(q, 511);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 92b9f25..706d852 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -78,40 +78,8 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
 	return ret;
 }
 #ifdef CONFIG_GROUP_IOSCHED
-static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
-{
-	return queue_var_show(q->nr_group_requests, (page));
-}
-
 extern void elv_io_group_congestion_threshold(struct request_queue *q,
 					      struct io_group *iog);
-
-static ssize_t
-queue_group_requests_store(struct request_queue *q, const char *page,
-					size_t count)
-{
-	struct hlist_node *n;
-	struct io_group *iog;
-	struct elv_fq_data *efqd;
-	unsigned long nr;
-	int ret = queue_var_store(&nr, page, count);
-
-	if (nr < BLKDEV_MIN_RQ)
-		nr = BLKDEV_MIN_RQ;
-
-	spin_lock_irq(q->queue_lock);
-
-	q->nr_group_requests = nr;
-
-	efqd = &q->elevator->efqd;
-
-	hlist_for_each_entry(iog, n, &efqd->group_list, elv_data_node) {
-		elv_io_group_congestion_threshold(q, iog);
-	}
-
-	spin_unlock_irq(q->queue_lock);
-	return ret;
-}
 #endif
 
 static ssize_t queue_ra_show(struct request_queue *q, char *page)
@@ -278,14 +246,6 @@ static struct queue_sysfs_entry queue_requests_entry = {
 	.store = queue_requests_store,
 };
 
-#ifdef CONFIG_GROUP_IOSCHED
-static struct queue_sysfs_entry queue_group_requests_entry = {
-	.attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
-	.show = queue_group_requests_show,
-	.store = queue_group_requests_store,
-};
-#endif
-
 static struct queue_sysfs_entry queue_ra_entry = {
 	.attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_ra_show,
@@ -360,9 +320,6 @@ static struct queue_sysfs_entry queue_iostats_entry = {
 
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
-#ifdef CONFIG_GROUP_IOSCHED
-	&queue_group_requests_entry.attr,
-#endif
 	&queue_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 29392e7..bfb0210 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -59,6 +59,35 @@ elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
 #define for_each_entity_safe(entity, parent) \
 	for (; entity && ({ parent = entity->parent; 1; }); entity = parent)
 
+unsigned short get_group_requests(struct request_queue *q,
+				  struct io_group *iog)
+{
+	struct cgroup_subsys_state *css;
+	struct io_cgroup *iocg;
+	unsigned long nr_group_requests;
+
+	if (!iog)
+		return q->nr_requests;
+
+	rcu_read_lock();
+
+	if (!iog->iocg_id) {
+		nr_group_requests = 0;
+		goto out;
+	}
+
+	css = css_lookup(&io_subsys, iog->iocg_id);
+	if (!css) {
+		nr_group_requests = 0;
+		goto out;
+	}
+
+	iocg = container_of(css, struct io_cgroup, css);
+	nr_group_requests = iocg->nr_group_requests;
+out:
+	rcu_read_unlock();
+	return nr_group_requests;
+}
 
 static struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
 						 int extract);
@@ -1257,14 +1286,17 @@ void elv_io_group_congestion_threshold(struct request_queue *q,
 						struct io_group *iog)
 {
 	int nr;
+	unsigned long nr_group_requests;
 
-	nr = q->nr_group_requests - (q->nr_group_requests / 8) + 1;
-	if (nr > q->nr_group_requests)
-		nr = q->nr_group_requests;
+	nr_group_requests = get_group_requests(q, iog);
+
+	nr = nr_group_requests - (nr_group_requests / 8) + 1;
+	if (nr > nr_group_requests)
+		nr = nr_group_requests;
 	iog->nr_congestion_on = nr;
 
-	nr = q->nr_group_requests - (q->nr_group_requests / 8)
-			- (q->nr_group_requests / 16) - 1;
+	nr = nr_group_requests - (nr_group_requests / 8)
+			- (nr_group_requests / 16) - 1;
 	if (nr < 1)
 		nr = 1;
 	iog->nr_congestion_off = nr;
@@ -1283,6 +1315,7 @@ int elv_io_group_congested(struct request_queue *q, struct page *page, int sync)
 {
 	struct io_group *iog;
 	int ret = 0;
+	unsigned long nr_group_requests;
 
 	rcu_read_lock();
 
@@ -1300,10 +1333,11 @@ int elv_io_group_congested(struct request_queue *q, struct page *page, int sync)
 	}
 
 	ret = elv_is_iog_congested(q, iog, sync);
+	nr_group_requests = get_group_requests(q, iog);
 	if (ret)
 		elv_log_iog(&q->elevator->efqd, iog, "iog congested=%d sync=%d"
 			" rl.count[sync]=%d nr_group_requests=%d",
-			ret, sync, iog->rl.count[sync], q->nr_group_requests);
+			ret, sync, iog->rl.count[sync], nr_group_requests);
 	rcu_read_unlock();
 	return ret;
 }
@@ -1549,6 +1583,48 @@ free_buf:
 	return ret;
 }
 
+static u64 io_cgroup_nr_requests_read(struct cgroup *cgroup,
+				       struct cftype *cftype)
+{
+	struct io_cgroup *iocg;
+	u64 ret;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+	spin_lock_irq(&iocg->lock);
+	ret = iocg->nr_group_requests;
+	spin_unlock_irq(&iocg->lock);
+
+	cgroup_unlock();
+
+	return ret;
+}
+
+static int io_cgroup_nr_requests_write(struct cgroup *cgroup,
+					struct cftype *cftype,
+					u64 val)
+{
+	struct io_cgroup *iocg;
+
+	if (val < BLKDEV_MIN_RQ)
+		val = BLKDEV_MIN_RQ;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+
+	spin_lock_irq(&iocg->lock);
+	iocg->nr_group_requests = (unsigned long)val;
+	spin_unlock_irq(&iocg->lock);
+
+	cgroup_unlock();
+
+	return 0;
+}
+
 #define SHOW_FUNCTION(__VAR)						\
 static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
 				       struct cftype *cftype)		\
@@ -1735,6 +1811,11 @@ static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,
 
 struct cftype bfqio_files[] = {
 	{
+		.name = "nr_group_requests",
+		.read_u64 = io_cgroup_nr_requests_read,
+		.write_u64 = io_cgroup_nr_requests_write,
+	},
+	{
 		.name = "policy",
 		.read_seq_string = io_cgroup_policy_read,
 		.write_string = io_cgroup_policy_write,
@@ -1790,6 +1871,7 @@ static struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
 
 	spin_lock_init(&iocg->lock);
 	INIT_HLIST_HEAD(&iocg->group_data);
+	iocg->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
 	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
 	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
 	INIT_LIST_HEAD(&iocg->policy_list);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index f089a55..df077d0 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -308,6 +308,7 @@ struct io_cgroup {
 	unsigned int weight;
 	unsigned short ioprio_class;
 
+	unsigned long nr_group_requests;
 	/* list of io_policy_node */
 	struct list_head policy_list;
 
@@ -386,6 +387,9 @@ struct elv_fq_data {
 	unsigned int fairness;
 };
 
+extern unsigned short get_group_requests(struct request_queue *q,
+					 struct io_group *iog);
+
 /* Logging facilities. */
 #ifdef CONFIG_DEBUG_GROUP_IOSCHED
 #define elv_log_ioq(efqd, ioq, fmt, args...) \
-- 
1.5.4.rc3 

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH] io-controller: implement per group request allocation limitation
  2009-07-02 20:01 ` Vivek Goyal
@ 2009-07-10  1:56   ` Gui Jianfeng
  -1 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-10  1:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, jbaron, agk,
	snitzer, akpm, peterz

Hi Vivek,

This patch exports a cgroup based per group request limits interface.
and removes the global one. Now we can use this interface to perform
different request allocation limitation for different groups. 

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
 block/blk-core.c     |   23 ++++++++++--
 block/blk-settings.c |    1 -
 block/blk-sysfs.c    |   43 -----------------------
 block/elevator-fq.c  |   94 ++++++++++++++++++++++++++++++++++++++++++++++---
 block/elevator-fq.h  |    4 ++
 5 files changed, 111 insertions(+), 54 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 79fe6a9..7010b76 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -722,13 +722,20 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
 static void __freed_request(struct request_queue *q, int sync,
 					struct request_list *rl)
 {
+	struct io_group *iog;
+	unsigned long nr_group_requests;
+
 	if (q->rq_data.count[sync] < queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, sync);
 
 	if (q->rq_data.count[sync] + 1 <= q->nr_requests)
 		blk_clear_queue_full(q, sync);
 
-	if (rl->count[sync] + 1 <= q->nr_group_requests) {
+	iog = rl_iog(rl);
+
+	nr_group_requests = get_group_requests(q, iog);
+
+	if (nr_group_requests && rl->count[sync] + 1 <= nr_group_requests) {
 		if (waitqueue_active(&rl->wait[sync]))
 			wake_up(&rl->wait[sync]);
 	}
@@ -828,6 +835,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
 	int may_queue, priv;
 	int sleep_on_global = 0;
+	struct io_group *iog;
+	unsigned long nr_group_requests;
 
 	may_queue = elv_may_queue(q, rw_flags);
 	if (may_queue == ELV_MQUEUE_NO)
@@ -843,7 +852,12 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
 		blk_set_queue_full(q, is_sync);
 
-	if (rl->count[is_sync]+1 >= q->nr_group_requests) {
+	iog = rl_iog(rl);
+
+	nr_group_requests = get_group_requests(q, iog);
+
+	if (nr_group_requests &&
+	    rl->count[is_sync]+1 >= nr_group_requests) {
 		ioc = current_io_context(GFP_ATOMIC, q->node);
 		/*
 		 * The queue request descriptor group will fill after this
@@ -852,7 +866,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		 * This process will be allowed to complete a batch of
 		 * requests, others will be blocked.
 		 */
-		if (rl->count[is_sync] <= q->nr_group_requests)
+		if (rl->count[is_sync] <= nr_group_requests)
 			ioc_set_batching(q, ioc);
 		else {
 			if (may_queue != ELV_MQUEUE_MUST
@@ -898,7 +912,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	 * from per group request list
 	 */
 
-	if (rl->count[is_sync] >= (3 * q->nr_group_requests / 2))
+	if (nr_group_requests &&
+	    rl->count[is_sync] >= (3 * nr_group_requests / 2))
 		goto out;
 
 	rl->starved[is_sync] = 0;
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 78b8aec..bd582a7 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -148,7 +148,6 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
 	 * set defaults
 	 */
 	q->nr_requests = BLKDEV_MAX_RQ;
-	q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
 
 	q->make_request_fn = mfn;
 	blk_queue_dma_alignment(q, 511);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 92b9f25..706d852 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -78,40 +78,8 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
 	return ret;
 }
 #ifdef CONFIG_GROUP_IOSCHED
-static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
-{
-	return queue_var_show(q->nr_group_requests, (page));
-}
-
 extern void elv_io_group_congestion_threshold(struct request_queue *q,
 					      struct io_group *iog);
-
-static ssize_t
-queue_group_requests_store(struct request_queue *q, const char *page,
-					size_t count)
-{
-	struct hlist_node *n;
-	struct io_group *iog;
-	struct elv_fq_data *efqd;
-	unsigned long nr;
-	int ret = queue_var_store(&nr, page, count);
-
-	if (nr < BLKDEV_MIN_RQ)
-		nr = BLKDEV_MIN_RQ;
-
-	spin_lock_irq(q->queue_lock);
-
-	q->nr_group_requests = nr;
-
-	efqd = &q->elevator->efqd;
-
-	hlist_for_each_entry(iog, n, &efqd->group_list, elv_data_node) {
-		elv_io_group_congestion_threshold(q, iog);
-	}
-
-	spin_unlock_irq(q->queue_lock);
-	return ret;
-}
 #endif
 
 static ssize_t queue_ra_show(struct request_queue *q, char *page)
@@ -278,14 +246,6 @@ static struct queue_sysfs_entry queue_requests_entry = {
 	.store = queue_requests_store,
 };
 
-#ifdef CONFIG_GROUP_IOSCHED
-static struct queue_sysfs_entry queue_group_requests_entry = {
-	.attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
-	.show = queue_group_requests_show,
-	.store = queue_group_requests_store,
-};
-#endif
-
 static struct queue_sysfs_entry queue_ra_entry = {
 	.attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_ra_show,
@@ -360,9 +320,6 @@ static struct queue_sysfs_entry queue_iostats_entry = {
 
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
-#ifdef CONFIG_GROUP_IOSCHED
-	&queue_group_requests_entry.attr,
-#endif
 	&queue_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 29392e7..bfb0210 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -59,6 +59,35 @@ elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
 #define for_each_entity_safe(entity, parent) \
 	for (; entity && ({ parent = entity->parent; 1; }); entity = parent)
 
+unsigned short get_group_requests(struct request_queue *q,
+				  struct io_group *iog)
+{
+	struct cgroup_subsys_state *css;
+	struct io_cgroup *iocg;
+	unsigned long nr_group_requests;
+
+	if (!iog)
+		return q->nr_requests;
+
+	rcu_read_lock();
+
+	if (!iog->iocg_id) {
+		nr_group_requests = 0;
+		goto out;
+	}
+
+	css = css_lookup(&io_subsys, iog->iocg_id);
+	if (!css) {
+		nr_group_requests = 0;
+		goto out;
+	}
+
+	iocg = container_of(css, struct io_cgroup, css);
+	nr_group_requests = iocg->nr_group_requests;
+out:
+	rcu_read_unlock();
+	return nr_group_requests;
+}
 
 static struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
 						 int extract);
@@ -1257,14 +1286,17 @@ void elv_io_group_congestion_threshold(struct request_queue *q,
 						struct io_group *iog)
 {
 	int nr;
+	unsigned long nr_group_requests;
 
-	nr = q->nr_group_requests - (q->nr_group_requests / 8) + 1;
-	if (nr > q->nr_group_requests)
-		nr = q->nr_group_requests;
+	nr_group_requests = get_group_requests(q, iog);
+
+	nr = nr_group_requests - (nr_group_requests / 8) + 1;
+	if (nr > nr_group_requests)
+		nr = nr_group_requests;
 	iog->nr_congestion_on = nr;
 
-	nr = q->nr_group_requests - (q->nr_group_requests / 8)
-			- (q->nr_group_requests / 16) - 1;
+	nr = nr_group_requests - (nr_group_requests / 8)
+			- (nr_group_requests / 16) - 1;
 	if (nr < 1)
 		nr = 1;
 	iog->nr_congestion_off = nr;
@@ -1283,6 +1315,7 @@ int elv_io_group_congested(struct request_queue *q, struct page *page, int sync)
 {
 	struct io_group *iog;
 	int ret = 0;
+	unsigned long nr_group_requests;
 
 	rcu_read_lock();
 
@@ -1300,10 +1333,11 @@ int elv_io_group_congested(struct request_queue *q, struct page *page, int sync)
 	}
 
 	ret = elv_is_iog_congested(q, iog, sync);
+	nr_group_requests = get_group_requests(q, iog);
 	if (ret)
 		elv_log_iog(&q->elevator->efqd, iog, "iog congested=%d sync=%d"
 			" rl.count[sync]=%d nr_group_requests=%d",
-			ret, sync, iog->rl.count[sync], q->nr_group_requests);
+			ret, sync, iog->rl.count[sync], nr_group_requests);
 	rcu_read_unlock();
 	return ret;
 }
@@ -1549,6 +1583,48 @@ free_buf:
 	return ret;
 }
 
+static u64 io_cgroup_nr_requests_read(struct cgroup *cgroup,
+				       struct cftype *cftype)
+{
+	struct io_cgroup *iocg;
+	u64 ret;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+	spin_lock_irq(&iocg->lock);
+	ret = iocg->nr_group_requests;
+	spin_unlock_irq(&iocg->lock);
+
+	cgroup_unlock();
+
+	return ret;
+}
+
+static int io_cgroup_nr_requests_write(struct cgroup *cgroup,
+					struct cftype *cftype,
+					u64 val)
+{
+	struct io_cgroup *iocg;
+
+	if (val < BLKDEV_MIN_RQ)
+		val = BLKDEV_MIN_RQ;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+
+	spin_lock_irq(&iocg->lock);
+	iocg->nr_group_requests = (unsigned long)val;
+	spin_unlock_irq(&iocg->lock);
+
+	cgroup_unlock();
+
+	return 0;
+}
+
 #define SHOW_FUNCTION(__VAR)						\
 static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
 				       struct cftype *cftype)		\
@@ -1735,6 +1811,11 @@ static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,
 
 struct cftype bfqio_files[] = {
 	{
+		.name = "nr_group_requests",
+		.read_u64 = io_cgroup_nr_requests_read,
+		.write_u64 = io_cgroup_nr_requests_write,
+	},
+	{
 		.name = "policy",
 		.read_seq_string = io_cgroup_policy_read,
 		.write_string = io_cgroup_policy_write,
@@ -1790,6 +1871,7 @@ static struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
 
 	spin_lock_init(&iocg->lock);
 	INIT_HLIST_HEAD(&iocg->group_data);
+	iocg->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
 	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
 	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
 	INIT_LIST_HEAD(&iocg->policy_list);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index f089a55..df077d0 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -308,6 +308,7 @@ struct io_cgroup {
 	unsigned int weight;
 	unsigned short ioprio_class;
 
+	unsigned long nr_group_requests;
 	/* list of io_policy_node */
 	struct list_head policy_list;
 
@@ -386,6 +387,9 @@ struct elv_fq_data {
 	unsigned int fairness;
 };
 
+extern unsigned short get_group_requests(struct request_queue *q,
+					 struct io_group *iog);
+
 /* Logging facilities. */
 #ifdef CONFIG_DEBUG_GROUP_IOSCHED
 #define elv_log_ioq(efqd, ioq, fmt, args...) \
-- 
1.5.4.rc3 


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* [PATCH] io-controller: implement per group request allocation limitation
@ 2009-07-10  1:56   ` Gui Jianfeng
  0 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-10  1:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	balbir, paolo.valente, fernando, mikew, jmoyer, nauman, m-ikeda,
	lizf, fchecconi, akpm, jbaron, linux-kernel, s-uchida,
	righi.andrea, containers

Hi Vivek,

This patch exports a cgroup based per group request limits interface.
and removes the global one. Now we can use this interface to perform
different request allocation limitation for different groups. 

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
 block/blk-core.c     |   23 ++++++++++--
 block/blk-settings.c |    1 -
 block/blk-sysfs.c    |   43 -----------------------
 block/elevator-fq.c  |   94 ++++++++++++++++++++++++++++++++++++++++++++++---
 block/elevator-fq.h  |    4 ++
 5 files changed, 111 insertions(+), 54 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 79fe6a9..7010b76 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -722,13 +722,20 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
 static void __freed_request(struct request_queue *q, int sync,
 					struct request_list *rl)
 {
+	struct io_group *iog;
+	unsigned long nr_group_requests;
+
 	if (q->rq_data.count[sync] < queue_congestion_off_threshold(q))
 		blk_clear_queue_congested(q, sync);
 
 	if (q->rq_data.count[sync] + 1 <= q->nr_requests)
 		blk_clear_queue_full(q, sync);
 
-	if (rl->count[sync] + 1 <= q->nr_group_requests) {
+	iog = rl_iog(rl);
+
+	nr_group_requests = get_group_requests(q, iog);
+
+	if (nr_group_requests && rl->count[sync] + 1 <= nr_group_requests) {
 		if (waitqueue_active(&rl->wait[sync]))
 			wake_up(&rl->wait[sync]);
 	}
@@ -828,6 +835,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
 	int may_queue, priv;
 	int sleep_on_global = 0;
+	struct io_group *iog;
+	unsigned long nr_group_requests;
 
 	may_queue = elv_may_queue(q, rw_flags);
 	if (may_queue == ELV_MQUEUE_NO)
@@ -843,7 +852,12 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
 		blk_set_queue_full(q, is_sync);
 
-	if (rl->count[is_sync]+1 >= q->nr_group_requests) {
+	iog = rl_iog(rl);
+
+	nr_group_requests = get_group_requests(q, iog);
+
+	if (nr_group_requests &&
+	    rl->count[is_sync]+1 >= nr_group_requests) {
 		ioc = current_io_context(GFP_ATOMIC, q->node);
 		/*
 		 * The queue request descriptor group will fill after this
@@ -852,7 +866,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		 * This process will be allowed to complete a batch of
 		 * requests, others will be blocked.
 		 */
-		if (rl->count[is_sync] <= q->nr_group_requests)
+		if (rl->count[is_sync] <= nr_group_requests)
 			ioc_set_batching(q, ioc);
 		else {
 			if (may_queue != ELV_MQUEUE_MUST
@@ -898,7 +912,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	 * from per group request list
 	 */
 
-	if (rl->count[is_sync] >= (3 * q->nr_group_requests / 2))
+	if (nr_group_requests &&
+	    rl->count[is_sync] >= (3 * nr_group_requests / 2))
 		goto out;
 
 	rl->starved[is_sync] = 0;
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 78b8aec..bd582a7 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -148,7 +148,6 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
 	 * set defaults
 	 */
 	q->nr_requests = BLKDEV_MAX_RQ;
-	q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
 
 	q->make_request_fn = mfn;
 	blk_queue_dma_alignment(q, 511);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 92b9f25..706d852 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -78,40 +78,8 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
 	return ret;
 }
 #ifdef CONFIG_GROUP_IOSCHED
-static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
-{
-	return queue_var_show(q->nr_group_requests, (page));
-}
-
 extern void elv_io_group_congestion_threshold(struct request_queue *q,
 					      struct io_group *iog);
-
-static ssize_t
-queue_group_requests_store(struct request_queue *q, const char *page,
-					size_t count)
-{
-	struct hlist_node *n;
-	struct io_group *iog;
-	struct elv_fq_data *efqd;
-	unsigned long nr;
-	int ret = queue_var_store(&nr, page, count);
-
-	if (nr < BLKDEV_MIN_RQ)
-		nr = BLKDEV_MIN_RQ;
-
-	spin_lock_irq(q->queue_lock);
-
-	q->nr_group_requests = nr;
-
-	efqd = &q->elevator->efqd;
-
-	hlist_for_each_entry(iog, n, &efqd->group_list, elv_data_node) {
-		elv_io_group_congestion_threshold(q, iog);
-	}
-
-	spin_unlock_irq(q->queue_lock);
-	return ret;
-}
 #endif
 
 static ssize_t queue_ra_show(struct request_queue *q, char *page)
@@ -278,14 +246,6 @@ static struct queue_sysfs_entry queue_requests_entry = {
 	.store = queue_requests_store,
 };
 
-#ifdef CONFIG_GROUP_IOSCHED
-static struct queue_sysfs_entry queue_group_requests_entry = {
-	.attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
-	.show = queue_group_requests_show,
-	.store = queue_group_requests_store,
-};
-#endif
-
 static struct queue_sysfs_entry queue_ra_entry = {
 	.attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_ra_show,
@@ -360,9 +320,6 @@ static struct queue_sysfs_entry queue_iostats_entry = {
 
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
-#ifdef CONFIG_GROUP_IOSCHED
-	&queue_group_requests_entry.attr,
-#endif
 	&queue_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 29392e7..bfb0210 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -59,6 +59,35 @@ elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
 #define for_each_entity_safe(entity, parent) \
 	for (; entity && ({ parent = entity->parent; 1; }); entity = parent)
 
+unsigned short get_group_requests(struct request_queue *q,
+				  struct io_group *iog)
+{
+	struct cgroup_subsys_state *css;
+	struct io_cgroup *iocg;
+	unsigned long nr_group_requests;
+
+	if (!iog)
+		return q->nr_requests;
+
+	rcu_read_lock();
+
+	if (!iog->iocg_id) {
+		nr_group_requests = 0;
+		goto out;
+	}
+
+	css = css_lookup(&io_subsys, iog->iocg_id);
+	if (!css) {
+		nr_group_requests = 0;
+		goto out;
+	}
+
+	iocg = container_of(css, struct io_cgroup, css);
+	nr_group_requests = iocg->nr_group_requests;
+out:
+	rcu_read_unlock();
+	return nr_group_requests;
+}
 
 static struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
 						 int extract);
@@ -1257,14 +1286,17 @@ void elv_io_group_congestion_threshold(struct request_queue *q,
 						struct io_group *iog)
 {
 	int nr;
+	unsigned long nr_group_requests;
 
-	nr = q->nr_group_requests - (q->nr_group_requests / 8) + 1;
-	if (nr > q->nr_group_requests)
-		nr = q->nr_group_requests;
+	nr_group_requests = get_group_requests(q, iog);
+
+	nr = nr_group_requests - (nr_group_requests / 8) + 1;
+	if (nr > nr_group_requests)
+		nr = nr_group_requests;
 	iog->nr_congestion_on = nr;
 
-	nr = q->nr_group_requests - (q->nr_group_requests / 8)
-			- (q->nr_group_requests / 16) - 1;
+	nr = nr_group_requests - (nr_group_requests / 8)
+			- (nr_group_requests / 16) - 1;
 	if (nr < 1)
 		nr = 1;
 	iog->nr_congestion_off = nr;
@@ -1283,6 +1315,7 @@ int elv_io_group_congested(struct request_queue *q, struct page *page, int sync)
 {
 	struct io_group *iog;
 	int ret = 0;
+	unsigned long nr_group_requests;
 
 	rcu_read_lock();
 
@@ -1300,10 +1333,11 @@ int elv_io_group_congested(struct request_queue *q, struct page *page, int sync)
 	}
 
 	ret = elv_is_iog_congested(q, iog, sync);
+	nr_group_requests = get_group_requests(q, iog);
 	if (ret)
 		elv_log_iog(&q->elevator->efqd, iog, "iog congested=%d sync=%d"
 			" rl.count[sync]=%d nr_group_requests=%d",
-			ret, sync, iog->rl.count[sync], q->nr_group_requests);
+			ret, sync, iog->rl.count[sync], nr_group_requests);
 	rcu_read_unlock();
 	return ret;
 }
@@ -1549,6 +1583,48 @@ free_buf:
 	return ret;
 }
 
+static u64 io_cgroup_nr_requests_read(struct cgroup *cgroup,
+				       struct cftype *cftype)
+{
+	struct io_cgroup *iocg;
+	u64 ret;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+	spin_lock_irq(&iocg->lock);
+	ret = iocg->nr_group_requests;
+	spin_unlock_irq(&iocg->lock);
+
+	cgroup_unlock();
+
+	return ret;
+}
+
+static int io_cgroup_nr_requests_write(struct cgroup *cgroup,
+					struct cftype *cftype,
+					u64 val)
+{
+	struct io_cgroup *iocg;
+
+	if (val < BLKDEV_MIN_RQ)
+		val = BLKDEV_MIN_RQ;
+
+	if (!cgroup_lock_live_group(cgroup))
+		return -ENODEV;
+
+	iocg = cgroup_to_io_cgroup(cgroup);
+
+	spin_lock_irq(&iocg->lock);
+	iocg->nr_group_requests = (unsigned long)val;
+	spin_unlock_irq(&iocg->lock);
+
+	cgroup_unlock();
+
+	return 0;
+}
+
 #define SHOW_FUNCTION(__VAR)						\
 static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
 				       struct cftype *cftype)		\
@@ -1735,6 +1811,11 @@ static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,
 
 struct cftype bfqio_files[] = {
 	{
+		.name = "nr_group_requests",
+		.read_u64 = io_cgroup_nr_requests_read,
+		.write_u64 = io_cgroup_nr_requests_write,
+	},
+	{
 		.name = "policy",
 		.read_seq_string = io_cgroup_policy_read,
 		.write_string = io_cgroup_policy_write,
@@ -1790,6 +1871,7 @@ static struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
 
 	spin_lock_init(&iocg->lock);
 	INIT_HLIST_HEAD(&iocg->group_data);
+	iocg->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
 	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
 	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
 	INIT_LIST_HEAD(&iocg->policy_list);
diff --git a/block/elevator-fq.h b/block/elevator-fq.h
index f089a55..df077d0 100644
--- a/block/elevator-fq.h
+++ b/block/elevator-fq.h
@@ -308,6 +308,7 @@ struct io_cgroup {
 	unsigned int weight;
 	unsigned short ioprio_class;
 
+	unsigned long nr_group_requests;
 	/* list of io_policy_node */
 	struct list_head policy_list;
 
@@ -386,6 +387,9 @@ struct elv_fq_data {
 	unsigned int fairness;
 };
 
+extern unsigned short get_group_requests(struct request_queue *q,
+					 struct io_group *iog);
+
 /* Logging facilities. */
 #ifdef CONFIG_DEBUG_GROUP_IOSCHED
 #define elv_log_ioq(efqd, ioq, fmt, args...) \
-- 
1.5.4.rc3 

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
       [not found]   ` <4A569FC5.7090801-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-07-13 16:03     ` Vivek Goyal
  2009-08-04  2:02     ` Munehiro Ikeda
  2009-08-04  2:04     ` Munehiro Ikeda
  2 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-13 16:03 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Fri, Jul 10, 2009 at 09:56:21AM +0800, Gui Jianfeng wrote:
> Hi Vivek,
> 
> This patch exports a cgroup based per group request limits interface.
> and removes the global one. Now we can use this interface to perform
> different request allocation limitation for different groups. 
> 

Thanks Gui. Few points come to mind.

- You seem to be making this as per cgroup limit on all devices. I guess
  that different devices in the system can have different settings of
  q->nr_requests and hence will probably want different per group limit.
  So we might have to make it per cgroup per device limit.

- There does not seem to be any checks for making sure that children
  cgroups don't have more request descriptors allocated than parent group.

- I am re-thinking that what's the advantage of configuring request
  descriptors also through cgroups. It does bring in additional complexity
  with it and it should justfiy the advantages. Can you think of some?

  Until and unless we can come up with some significant advantages, I will
  prefer to continue to use per group limit through q->nr_group_requests
  interface instead of cgroup. Once things stablize, we can revisit it and
  see how this interface can be improved.

Thanks
Vivek

> Signed-off-by: Gui Jianfeng <guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> ---
>  block/blk-core.c     |   23 ++++++++++--
>  block/blk-settings.c |    1 -
>  block/blk-sysfs.c    |   43 -----------------------
>  block/elevator-fq.c  |   94 ++++++++++++++++++++++++++++++++++++++++++++++---
>  block/elevator-fq.h  |    4 ++
>  5 files changed, 111 insertions(+), 54 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 79fe6a9..7010b76 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -722,13 +722,20 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
>  static void __freed_request(struct request_queue *q, int sync,
>  					struct request_list *rl)
>  {
> +	struct io_group *iog;
> +	unsigned long nr_group_requests;
> +
>  	if (q->rq_data.count[sync] < queue_congestion_off_threshold(q))
>  		blk_clear_queue_congested(q, sync);
>  
>  	if (q->rq_data.count[sync] + 1 <= q->nr_requests)
>  		blk_clear_queue_full(q, sync);
>  
> -	if (rl->count[sync] + 1 <= q->nr_group_requests) {
> +	iog = rl_iog(rl);
> +
> +	nr_group_requests = get_group_requests(q, iog);
> +
> +	if (nr_group_requests && rl->count[sync] + 1 <= nr_group_requests) {
>  		if (waitqueue_active(&rl->wait[sync]))
>  			wake_up(&rl->wait[sync]);
>  	}
> @@ -828,6 +835,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>  	const bool is_sync = rw_is_sync(rw_flags) != 0;
>  	int may_queue, priv;
>  	int sleep_on_global = 0;
> +	struct io_group *iog;
> +	unsigned long nr_group_requests;
>  
>  	may_queue = elv_may_queue(q, rw_flags);
>  	if (may_queue == ELV_MQUEUE_NO)
> @@ -843,7 +852,12 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>  	if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
>  		blk_set_queue_full(q, is_sync);
>  
> -	if (rl->count[is_sync]+1 >= q->nr_group_requests) {
> +	iog = rl_iog(rl);
> +
> +	nr_group_requests = get_group_requests(q, iog);
> +
> +	if (nr_group_requests &&
> +	    rl->count[is_sync]+1 >= nr_group_requests) {
>  		ioc = current_io_context(GFP_ATOMIC, q->node);
>  		/*
>  		 * The queue request descriptor group will fill after this
> @@ -852,7 +866,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>  		 * This process will be allowed to complete a batch of
>  		 * requests, others will be blocked.
>  		 */
> -		if (rl->count[is_sync] <= q->nr_group_requests)
> +		if (rl->count[is_sync] <= nr_group_requests)
>  			ioc_set_batching(q, ioc);
>  		else {
>  			if (may_queue != ELV_MQUEUE_MUST
> @@ -898,7 +912,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>  	 * from per group request list
>  	 */
>  
> -	if (rl->count[is_sync] >= (3 * q->nr_group_requests / 2))
> +	if (nr_group_requests &&
> +	    rl->count[is_sync] >= (3 * nr_group_requests / 2))
>  		goto out;
>  
>  	rl->starved[is_sync] = 0;
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 78b8aec..bd582a7 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -148,7 +148,6 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
>  	 * set defaults
>  	 */
>  	q->nr_requests = BLKDEV_MAX_RQ;
> -	q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
>  
>  	q->make_request_fn = mfn;
>  	blk_queue_dma_alignment(q, 511);
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 92b9f25..706d852 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -78,40 +78,8 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
>  	return ret;
>  }
>  #ifdef CONFIG_GROUP_IOSCHED
> -static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
> -{
> -	return queue_var_show(q->nr_group_requests, (page));
> -}
> -
>  extern void elv_io_group_congestion_threshold(struct request_queue *q,
>  					      struct io_group *iog);
> -
> -static ssize_t
> -queue_group_requests_store(struct request_queue *q, const char *page,
> -					size_t count)
> -{
> -	struct hlist_node *n;
> -	struct io_group *iog;
> -	struct elv_fq_data *efqd;
> -	unsigned long nr;
> -	int ret = queue_var_store(&nr, page, count);
> -
> -	if (nr < BLKDEV_MIN_RQ)
> -		nr = BLKDEV_MIN_RQ;
> -
> -	spin_lock_irq(q->queue_lock);
> -
> -	q->nr_group_requests = nr;
> -
> -	efqd = &q->elevator->efqd;
> -
> -	hlist_for_each_entry(iog, n, &efqd->group_list, elv_data_node) {
> -		elv_io_group_congestion_threshold(q, iog);
> -	}
> -
> -	spin_unlock_irq(q->queue_lock);
> -	return ret;
> -}
>  #endif
>  
>  static ssize_t queue_ra_show(struct request_queue *q, char *page)
> @@ -278,14 +246,6 @@ static struct queue_sysfs_entry queue_requests_entry = {
>  	.store = queue_requests_store,
>  };
>  
> -#ifdef CONFIG_GROUP_IOSCHED
> -static struct queue_sysfs_entry queue_group_requests_entry = {
> -	.attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
> -	.show = queue_group_requests_show,
> -	.store = queue_group_requests_store,
> -};
> -#endif
> -
>  static struct queue_sysfs_entry queue_ra_entry = {
>  	.attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
>  	.show = queue_ra_show,
> @@ -360,9 +320,6 @@ static struct queue_sysfs_entry queue_iostats_entry = {
>  
>  static struct attribute *default_attrs[] = {
>  	&queue_requests_entry.attr,
> -#ifdef CONFIG_GROUP_IOSCHED
> -	&queue_group_requests_entry.attr,
> -#endif
>  	&queue_ra_entry.attr,
>  	&queue_max_hw_sectors_entry.attr,
>  	&queue_max_sectors_entry.attr,
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 29392e7..bfb0210 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -59,6 +59,35 @@ elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
>  #define for_each_entity_safe(entity, parent) \
>  	for (; entity && ({ parent = entity->parent; 1; }); entity = parent)
>  
> +unsigned short get_group_requests(struct request_queue *q,
> +				  struct io_group *iog)
> +{
> +	struct cgroup_subsys_state *css;
> +	struct io_cgroup *iocg;
> +	unsigned long nr_group_requests;
> +
> +	if (!iog)
> +		return q->nr_requests;
> +
> +	rcu_read_lock();
> +
> +	if (!iog->iocg_id) {
> +		nr_group_requests = 0;
> +		goto out;
> +	}
> +
> +	css = css_lookup(&io_subsys, iog->iocg_id);
> +	if (!css) {
> +		nr_group_requests = 0;
> +		goto out;
> +	}
> +
> +	iocg = container_of(css, struct io_cgroup, css);
> +	nr_group_requests = iocg->nr_group_requests;
> +out:
> +	rcu_read_unlock();
> +	return nr_group_requests;
> +}
>  
>  static struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
>  						 int extract);
> @@ -1257,14 +1286,17 @@ void elv_io_group_congestion_threshold(struct request_queue *q,
>  						struct io_group *iog)
>  {
>  	int nr;
> +	unsigned long nr_group_requests;
>  
> -	nr = q->nr_group_requests - (q->nr_group_requests / 8) + 1;
> -	if (nr > q->nr_group_requests)
> -		nr = q->nr_group_requests;
> +	nr_group_requests = get_group_requests(q, iog);
> +
> +	nr = nr_group_requests - (nr_group_requests / 8) + 1;
> +	if (nr > nr_group_requests)
> +		nr = nr_group_requests;
>  	iog->nr_congestion_on = nr;
>  
> -	nr = q->nr_group_requests - (q->nr_group_requests / 8)
> -			- (q->nr_group_requests / 16) - 1;
> +	nr = nr_group_requests - (nr_group_requests / 8)
> +			- (nr_group_requests / 16) - 1;
>  	if (nr < 1)
>  		nr = 1;
>  	iog->nr_congestion_off = nr;
> @@ -1283,6 +1315,7 @@ int elv_io_group_congested(struct request_queue *q, struct page *page, int sync)
>  {
>  	struct io_group *iog;
>  	int ret = 0;
> +	unsigned long nr_group_requests;
>  
>  	rcu_read_lock();
>  
> @@ -1300,10 +1333,11 @@ int elv_io_group_congested(struct request_queue *q, struct page *page, int sync)
>  	}
>  
>  	ret = elv_is_iog_congested(q, iog, sync);
> +	nr_group_requests = get_group_requests(q, iog);
>  	if (ret)
>  		elv_log_iog(&q->elevator->efqd, iog, "iog congested=%d sync=%d"
>  			" rl.count[sync]=%d nr_group_requests=%d",
> -			ret, sync, iog->rl.count[sync], q->nr_group_requests);
> +			ret, sync, iog->rl.count[sync], nr_group_requests);
>  	rcu_read_unlock();
>  	return ret;
>  }
> @@ -1549,6 +1583,48 @@ free_buf:
>  	return ret;
>  }
>  
> +static u64 io_cgroup_nr_requests_read(struct cgroup *cgroup,
> +				       struct cftype *cftype)
> +{
> +	struct io_cgroup *iocg;
> +	u64 ret;
> +
> +	if (!cgroup_lock_live_group(cgroup))
> +		return -ENODEV;
> +
> +	iocg = cgroup_to_io_cgroup(cgroup);
> +	spin_lock_irq(&iocg->lock);
> +	ret = iocg->nr_group_requests;
> +	spin_unlock_irq(&iocg->lock);
> +
> +	cgroup_unlock();
> +
> +	return ret;
> +}
> +
> +static int io_cgroup_nr_requests_write(struct cgroup *cgroup,
> +					struct cftype *cftype,
> +					u64 val)
> +{
> +	struct io_cgroup *iocg;
> +
> +	if (val < BLKDEV_MIN_RQ)
> +		val = BLKDEV_MIN_RQ;
> +
> +	if (!cgroup_lock_live_group(cgroup))
> +		return -ENODEV;
> +
> +	iocg = cgroup_to_io_cgroup(cgroup);
> +
> +	spin_lock_irq(&iocg->lock);
> +	iocg->nr_group_requests = (unsigned long)val;
> +	spin_unlock_irq(&iocg->lock);
> +
> +	cgroup_unlock();
> +
> +	return 0;
> +}
> +
>  #define SHOW_FUNCTION(__VAR)						\
>  static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
>  				       struct cftype *cftype)		\
> @@ -1735,6 +1811,11 @@ static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,
>  
>  struct cftype bfqio_files[] = {
>  	{
> +		.name = "nr_group_requests",
> +		.read_u64 = io_cgroup_nr_requests_read,
> +		.write_u64 = io_cgroup_nr_requests_write,
> +	},
> +	{
>  		.name = "policy",
>  		.read_seq_string = io_cgroup_policy_read,
>  		.write_string = io_cgroup_policy_write,
> @@ -1790,6 +1871,7 @@ static struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
>  
>  	spin_lock_init(&iocg->lock);
>  	INIT_HLIST_HEAD(&iocg->group_data);
> +	iocg->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
>  	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
>  	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
>  	INIT_LIST_HEAD(&iocg->policy_list);
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index f089a55..df077d0 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -308,6 +308,7 @@ struct io_cgroup {
>  	unsigned int weight;
>  	unsigned short ioprio_class;
>  
> +	unsigned long nr_group_requests;
>  	/* list of io_policy_node */
>  	struct list_head policy_list;
>  
> @@ -386,6 +387,9 @@ struct elv_fq_data {
>  	unsigned int fairness;
>  };
>  
> +extern unsigned short get_group_requests(struct request_queue *q,
> +					 struct io_group *iog);
> +
>  /* Logging facilities. */
>  #ifdef CONFIG_DEBUG_GROUP_IOSCHED
>  #define elv_log_ioq(efqd, ioq, fmt, args...) \
> -- 
> 1.5.4.rc3 

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
  2009-07-10  1:56   ` Gui Jianfeng
@ 2009-07-13 16:03     ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-13 16:03 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, jbaron, agk,
	snitzer, akpm, peterz

On Fri, Jul 10, 2009 at 09:56:21AM +0800, Gui Jianfeng wrote:
> Hi Vivek,
> 
> This patch exports a cgroup based per group request limits interface.
> and removes the global one. Now we can use this interface to perform
> different request allocation limitation for different groups. 
> 

Thanks Gui. Few points come to mind.

- You seem to be making this as per cgroup limit on all devices. I guess
  that different devices in the system can have different settings of
  q->nr_requests and hence will probably want different per group limit.
  So we might have to make it per cgroup per device limit.

- There does not seem to be any checks for making sure that children
  cgroups don't have more request descriptors allocated than parent group.

- I am re-thinking that what's the advantage of configuring request
  descriptors also through cgroups. It does bring in additional complexity
  with it and it should justfiy the advantages. Can you think of some?

  Until and unless we can come up with some significant advantages, I will
  prefer to continue to use per group limit through q->nr_group_requests
  interface instead of cgroup. Once things stablize, we can revisit it and
  see how this interface can be improved.

Thanks
Vivek

> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
>  block/blk-core.c     |   23 ++++++++++--
>  block/blk-settings.c |    1 -
>  block/blk-sysfs.c    |   43 -----------------------
>  block/elevator-fq.c  |   94 ++++++++++++++++++++++++++++++++++++++++++++++---
>  block/elevator-fq.h  |    4 ++
>  5 files changed, 111 insertions(+), 54 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 79fe6a9..7010b76 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -722,13 +722,20 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
>  static void __freed_request(struct request_queue *q, int sync,
>  					struct request_list *rl)
>  {
> +	struct io_group *iog;
> +	unsigned long nr_group_requests;
> +
>  	if (q->rq_data.count[sync] < queue_congestion_off_threshold(q))
>  		blk_clear_queue_congested(q, sync);
>  
>  	if (q->rq_data.count[sync] + 1 <= q->nr_requests)
>  		blk_clear_queue_full(q, sync);
>  
> -	if (rl->count[sync] + 1 <= q->nr_group_requests) {
> +	iog = rl_iog(rl);
> +
> +	nr_group_requests = get_group_requests(q, iog);
> +
> +	if (nr_group_requests && rl->count[sync] + 1 <= nr_group_requests) {
>  		if (waitqueue_active(&rl->wait[sync]))
>  			wake_up(&rl->wait[sync]);
>  	}
> @@ -828,6 +835,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>  	const bool is_sync = rw_is_sync(rw_flags) != 0;
>  	int may_queue, priv;
>  	int sleep_on_global = 0;
> +	struct io_group *iog;
> +	unsigned long nr_group_requests;
>  
>  	may_queue = elv_may_queue(q, rw_flags);
>  	if (may_queue == ELV_MQUEUE_NO)
> @@ -843,7 +852,12 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>  	if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
>  		blk_set_queue_full(q, is_sync);
>  
> -	if (rl->count[is_sync]+1 >= q->nr_group_requests) {
> +	iog = rl_iog(rl);
> +
> +	nr_group_requests = get_group_requests(q, iog);
> +
> +	if (nr_group_requests &&
> +	    rl->count[is_sync]+1 >= nr_group_requests) {
>  		ioc = current_io_context(GFP_ATOMIC, q->node);
>  		/*
>  		 * The queue request descriptor group will fill after this
> @@ -852,7 +866,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>  		 * This process will be allowed to complete a batch of
>  		 * requests, others will be blocked.
>  		 */
> -		if (rl->count[is_sync] <= q->nr_group_requests)
> +		if (rl->count[is_sync] <= nr_group_requests)
>  			ioc_set_batching(q, ioc);
>  		else {
>  			if (may_queue != ELV_MQUEUE_MUST
> @@ -898,7 +912,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>  	 * from per group request list
>  	 */
>  
> -	if (rl->count[is_sync] >= (3 * q->nr_group_requests / 2))
> +	if (nr_group_requests &&
> +	    rl->count[is_sync] >= (3 * nr_group_requests / 2))
>  		goto out;
>  
>  	rl->starved[is_sync] = 0;
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 78b8aec..bd582a7 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -148,7 +148,6 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
>  	 * set defaults
>  	 */
>  	q->nr_requests = BLKDEV_MAX_RQ;
> -	q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
>  
>  	q->make_request_fn = mfn;
>  	blk_queue_dma_alignment(q, 511);
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 92b9f25..706d852 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -78,40 +78,8 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
>  	return ret;
>  }
>  #ifdef CONFIG_GROUP_IOSCHED
> -static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
> -{
> -	return queue_var_show(q->nr_group_requests, (page));
> -}
> -
>  extern void elv_io_group_congestion_threshold(struct request_queue *q,
>  					      struct io_group *iog);
> -
> -static ssize_t
> -queue_group_requests_store(struct request_queue *q, const char *page,
> -					size_t count)
> -{
> -	struct hlist_node *n;
> -	struct io_group *iog;
> -	struct elv_fq_data *efqd;
> -	unsigned long nr;
> -	int ret = queue_var_store(&nr, page, count);
> -
> -	if (nr < BLKDEV_MIN_RQ)
> -		nr = BLKDEV_MIN_RQ;
> -
> -	spin_lock_irq(q->queue_lock);
> -
> -	q->nr_group_requests = nr;
> -
> -	efqd = &q->elevator->efqd;
> -
> -	hlist_for_each_entry(iog, n, &efqd->group_list, elv_data_node) {
> -		elv_io_group_congestion_threshold(q, iog);
> -	}
> -
> -	spin_unlock_irq(q->queue_lock);
> -	return ret;
> -}
>  #endif
>  
>  static ssize_t queue_ra_show(struct request_queue *q, char *page)
> @@ -278,14 +246,6 @@ static struct queue_sysfs_entry queue_requests_entry = {
>  	.store = queue_requests_store,
>  };
>  
> -#ifdef CONFIG_GROUP_IOSCHED
> -static struct queue_sysfs_entry queue_group_requests_entry = {
> -	.attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
> -	.show = queue_group_requests_show,
> -	.store = queue_group_requests_store,
> -};
> -#endif
> -
>  static struct queue_sysfs_entry queue_ra_entry = {
>  	.attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
>  	.show = queue_ra_show,
> @@ -360,9 +320,6 @@ static struct queue_sysfs_entry queue_iostats_entry = {
>  
>  static struct attribute *default_attrs[] = {
>  	&queue_requests_entry.attr,
> -#ifdef CONFIG_GROUP_IOSCHED
> -	&queue_group_requests_entry.attr,
> -#endif
>  	&queue_ra_entry.attr,
>  	&queue_max_hw_sectors_entry.attr,
>  	&queue_max_sectors_entry.attr,
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 29392e7..bfb0210 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -59,6 +59,35 @@ elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
>  #define for_each_entity_safe(entity, parent) \
>  	for (; entity && ({ parent = entity->parent; 1; }); entity = parent)
>  
> +unsigned short get_group_requests(struct request_queue *q,
> +				  struct io_group *iog)
> +{
> +	struct cgroup_subsys_state *css;
> +	struct io_cgroup *iocg;
> +	unsigned long nr_group_requests;
> +
> +	if (!iog)
> +		return q->nr_requests;
> +
> +	rcu_read_lock();
> +
> +	if (!iog->iocg_id) {
> +		nr_group_requests = 0;
> +		goto out;
> +	}
> +
> +	css = css_lookup(&io_subsys, iog->iocg_id);
> +	if (!css) {
> +		nr_group_requests = 0;
> +		goto out;
> +	}
> +
> +	iocg = container_of(css, struct io_cgroup, css);
> +	nr_group_requests = iocg->nr_group_requests;
> +out:
> +	rcu_read_unlock();
> +	return nr_group_requests;
> +}
>  
>  static struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
>  						 int extract);
> @@ -1257,14 +1286,17 @@ void elv_io_group_congestion_threshold(struct request_queue *q,
>  						struct io_group *iog)
>  {
>  	int nr;
> +	unsigned long nr_group_requests;
>  
> -	nr = q->nr_group_requests - (q->nr_group_requests / 8) + 1;
> -	if (nr > q->nr_group_requests)
> -		nr = q->nr_group_requests;
> +	nr_group_requests = get_group_requests(q, iog);
> +
> +	nr = nr_group_requests - (nr_group_requests / 8) + 1;
> +	if (nr > nr_group_requests)
> +		nr = nr_group_requests;
>  	iog->nr_congestion_on = nr;
>  
> -	nr = q->nr_group_requests - (q->nr_group_requests / 8)
> -			- (q->nr_group_requests / 16) - 1;
> +	nr = nr_group_requests - (nr_group_requests / 8)
> +			- (nr_group_requests / 16) - 1;
>  	if (nr < 1)
>  		nr = 1;
>  	iog->nr_congestion_off = nr;
> @@ -1283,6 +1315,7 @@ int elv_io_group_congested(struct request_queue *q, struct page *page, int sync)
>  {
>  	struct io_group *iog;
>  	int ret = 0;
> +	unsigned long nr_group_requests;
>  
>  	rcu_read_lock();
>  
> @@ -1300,10 +1333,11 @@ int elv_io_group_congested(struct request_queue *q, struct page *page, int sync)
>  	}
>  
>  	ret = elv_is_iog_congested(q, iog, sync);
> +	nr_group_requests = get_group_requests(q, iog);
>  	if (ret)
>  		elv_log_iog(&q->elevator->efqd, iog, "iog congested=%d sync=%d"
>  			" rl.count[sync]=%d nr_group_requests=%d",
> -			ret, sync, iog->rl.count[sync], q->nr_group_requests);
> +			ret, sync, iog->rl.count[sync], nr_group_requests);
>  	rcu_read_unlock();
>  	return ret;
>  }
> @@ -1549,6 +1583,48 @@ free_buf:
>  	return ret;
>  }
>  
> +static u64 io_cgroup_nr_requests_read(struct cgroup *cgroup,
> +				       struct cftype *cftype)
> +{
> +	struct io_cgroup *iocg;
> +	u64 ret;
> +
> +	if (!cgroup_lock_live_group(cgroup))
> +		return -ENODEV;
> +
> +	iocg = cgroup_to_io_cgroup(cgroup);
> +	spin_lock_irq(&iocg->lock);
> +	ret = iocg->nr_group_requests;
> +	spin_unlock_irq(&iocg->lock);
> +
> +	cgroup_unlock();
> +
> +	return ret;
> +}
> +
> +static int io_cgroup_nr_requests_write(struct cgroup *cgroup,
> +					struct cftype *cftype,
> +					u64 val)
> +{
> +	struct io_cgroup *iocg;
> +
> +	if (val < BLKDEV_MIN_RQ)
> +		val = BLKDEV_MIN_RQ;
> +
> +	if (!cgroup_lock_live_group(cgroup))
> +		return -ENODEV;
> +
> +	iocg = cgroup_to_io_cgroup(cgroup);
> +
> +	spin_lock_irq(&iocg->lock);
> +	iocg->nr_group_requests = (unsigned long)val;
> +	spin_unlock_irq(&iocg->lock);
> +
> +	cgroup_unlock();
> +
> +	return 0;
> +}
> +
>  #define SHOW_FUNCTION(__VAR)						\
>  static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
>  				       struct cftype *cftype)		\
> @@ -1735,6 +1811,11 @@ static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,
>  
>  struct cftype bfqio_files[] = {
>  	{
> +		.name = "nr_group_requests",
> +		.read_u64 = io_cgroup_nr_requests_read,
> +		.write_u64 = io_cgroup_nr_requests_write,
> +	},
> +	{
>  		.name = "policy",
>  		.read_seq_string = io_cgroup_policy_read,
>  		.write_string = io_cgroup_policy_write,
> @@ -1790,6 +1871,7 @@ static struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
>  
>  	spin_lock_init(&iocg->lock);
>  	INIT_HLIST_HEAD(&iocg->group_data);
> +	iocg->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
>  	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
>  	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
>  	INIT_LIST_HEAD(&iocg->policy_list);
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index f089a55..df077d0 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -308,6 +308,7 @@ struct io_cgroup {
>  	unsigned int weight;
>  	unsigned short ioprio_class;
>  
> +	unsigned long nr_group_requests;
>  	/* list of io_policy_node */
>  	struct list_head policy_list;
>  
> @@ -386,6 +387,9 @@ struct elv_fq_data {
>  	unsigned int fairness;
>  };
>  
> +extern unsigned short get_group_requests(struct request_queue *q,
> +					 struct io_group *iog);
> +
>  /* Logging facilities. */
>  #ifdef CONFIG_DEBUG_GROUP_IOSCHED
>  #define elv_log_ioq(efqd, ioq, fmt, args...) \
> -- 
> 1.5.4.rc3 

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
@ 2009-07-13 16:03     ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-13 16:03 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	balbir, paolo.valente, fernando, mikew, jmoyer, nauman, m-ikeda,
	lizf, fchecconi, akpm, jbaron, linux-kernel, s-uchida,
	righi.andrea, containers

On Fri, Jul 10, 2009 at 09:56:21AM +0800, Gui Jianfeng wrote:
> Hi Vivek,
> 
> This patch exports a cgroup based per group request limits interface.
> and removes the global one. Now we can use this interface to perform
> different request allocation limitation for different groups. 
> 

Thanks Gui. Few points come to mind.

- You seem to be making this as per cgroup limit on all devices. I guess
  that different devices in the system can have different settings of
  q->nr_requests and hence will probably want different per group limit.
  So we might have to make it per cgroup per device limit.

- There does not seem to be any checks for making sure that children
  cgroups don't have more request descriptors allocated than parent group.

- I am re-thinking that what's the advantage of configuring request
  descriptors also through cgroups. It does bring in additional complexity
  with it and it should justfiy the advantages. Can you think of some?

  Until and unless we can come up with some significant advantages, I will
  prefer to continue to use per group limit through q->nr_group_requests
  interface instead of cgroup. Once things stablize, we can revisit it and
  see how this interface can be improved.

Thanks
Vivek

> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
>  block/blk-core.c     |   23 ++++++++++--
>  block/blk-settings.c |    1 -
>  block/blk-sysfs.c    |   43 -----------------------
>  block/elevator-fq.c  |   94 ++++++++++++++++++++++++++++++++++++++++++++++---
>  block/elevator-fq.h  |    4 ++
>  5 files changed, 111 insertions(+), 54 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 79fe6a9..7010b76 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -722,13 +722,20 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
>  static void __freed_request(struct request_queue *q, int sync,
>  					struct request_list *rl)
>  {
> +	struct io_group *iog;
> +	unsigned long nr_group_requests;
> +
>  	if (q->rq_data.count[sync] < queue_congestion_off_threshold(q))
>  		blk_clear_queue_congested(q, sync);
>  
>  	if (q->rq_data.count[sync] + 1 <= q->nr_requests)
>  		blk_clear_queue_full(q, sync);
>  
> -	if (rl->count[sync] + 1 <= q->nr_group_requests) {
> +	iog = rl_iog(rl);
> +
> +	nr_group_requests = get_group_requests(q, iog);
> +
> +	if (nr_group_requests && rl->count[sync] + 1 <= nr_group_requests) {
>  		if (waitqueue_active(&rl->wait[sync]))
>  			wake_up(&rl->wait[sync]);
>  	}
> @@ -828,6 +835,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>  	const bool is_sync = rw_is_sync(rw_flags) != 0;
>  	int may_queue, priv;
>  	int sleep_on_global = 0;
> +	struct io_group *iog;
> +	unsigned long nr_group_requests;
>  
>  	may_queue = elv_may_queue(q, rw_flags);
>  	if (may_queue == ELV_MQUEUE_NO)
> @@ -843,7 +852,12 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>  	if (q->rq_data.count[is_sync]+1 >= q->nr_requests)
>  		blk_set_queue_full(q, is_sync);
>  
> -	if (rl->count[is_sync]+1 >= q->nr_group_requests) {
> +	iog = rl_iog(rl);
> +
> +	nr_group_requests = get_group_requests(q, iog);
> +
> +	if (nr_group_requests &&
> +	    rl->count[is_sync]+1 >= nr_group_requests) {
>  		ioc = current_io_context(GFP_ATOMIC, q->node);
>  		/*
>  		 * The queue request descriptor group will fill after this
> @@ -852,7 +866,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>  		 * This process will be allowed to complete a batch of
>  		 * requests, others will be blocked.
>  		 */
> -		if (rl->count[is_sync] <= q->nr_group_requests)
> +		if (rl->count[is_sync] <= nr_group_requests)
>  			ioc_set_batching(q, ioc);
>  		else {
>  			if (may_queue != ELV_MQUEUE_MUST
> @@ -898,7 +912,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>  	 * from per group request list
>  	 */
>  
> -	if (rl->count[is_sync] >= (3 * q->nr_group_requests / 2))
> +	if (nr_group_requests &&
> +	    rl->count[is_sync] >= (3 * nr_group_requests / 2))
>  		goto out;
>  
>  	rl->starved[is_sync] = 0;
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 78b8aec..bd582a7 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -148,7 +148,6 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
>  	 * set defaults
>  	 */
>  	q->nr_requests = BLKDEV_MAX_RQ;
> -	q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
>  
>  	q->make_request_fn = mfn;
>  	blk_queue_dma_alignment(q, 511);
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 92b9f25..706d852 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -78,40 +78,8 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
>  	return ret;
>  }
>  #ifdef CONFIG_GROUP_IOSCHED
> -static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
> -{
> -	return queue_var_show(q->nr_group_requests, (page));
> -}
> -
>  extern void elv_io_group_congestion_threshold(struct request_queue *q,
>  					      struct io_group *iog);
> -
> -static ssize_t
> -queue_group_requests_store(struct request_queue *q, const char *page,
> -					size_t count)
> -{
> -	struct hlist_node *n;
> -	struct io_group *iog;
> -	struct elv_fq_data *efqd;
> -	unsigned long nr;
> -	int ret = queue_var_store(&nr, page, count);
> -
> -	if (nr < BLKDEV_MIN_RQ)
> -		nr = BLKDEV_MIN_RQ;
> -
> -	spin_lock_irq(q->queue_lock);
> -
> -	q->nr_group_requests = nr;
> -
> -	efqd = &q->elevator->efqd;
> -
> -	hlist_for_each_entry(iog, n, &efqd->group_list, elv_data_node) {
> -		elv_io_group_congestion_threshold(q, iog);
> -	}
> -
> -	spin_unlock_irq(q->queue_lock);
> -	return ret;
> -}
>  #endif
>  
>  static ssize_t queue_ra_show(struct request_queue *q, char *page)
> @@ -278,14 +246,6 @@ static struct queue_sysfs_entry queue_requests_entry = {
>  	.store = queue_requests_store,
>  };
>  
> -#ifdef CONFIG_GROUP_IOSCHED
> -static struct queue_sysfs_entry queue_group_requests_entry = {
> -	.attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
> -	.show = queue_group_requests_show,
> -	.store = queue_group_requests_store,
> -};
> -#endif
> -
>  static struct queue_sysfs_entry queue_ra_entry = {
>  	.attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
>  	.show = queue_ra_show,
> @@ -360,9 +320,6 @@ static struct queue_sysfs_entry queue_iostats_entry = {
>  
>  static struct attribute *default_attrs[] = {
>  	&queue_requests_entry.attr,
> -#ifdef CONFIG_GROUP_IOSCHED
> -	&queue_group_requests_entry.attr,
> -#endif
>  	&queue_ra_entry.attr,
>  	&queue_max_hw_sectors_entry.attr,
>  	&queue_max_sectors_entry.attr,
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 29392e7..bfb0210 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -59,6 +59,35 @@ elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
>  #define for_each_entity_safe(entity, parent) \
>  	for (; entity && ({ parent = entity->parent; 1; }); entity = parent)
>  
> +unsigned short get_group_requests(struct request_queue *q,
> +				  struct io_group *iog)
> +{
> +	struct cgroup_subsys_state *css;
> +	struct io_cgroup *iocg;
> +	unsigned long nr_group_requests;
> +
> +	if (!iog)
> +		return q->nr_requests;
> +
> +	rcu_read_lock();
> +
> +	if (!iog->iocg_id) {
> +		nr_group_requests = 0;
> +		goto out;
> +	}
> +
> +	css = css_lookup(&io_subsys, iog->iocg_id);
> +	if (!css) {
> +		nr_group_requests = 0;
> +		goto out;
> +	}
> +
> +	iocg = container_of(css, struct io_cgroup, css);
> +	nr_group_requests = iocg->nr_group_requests;
> +out:
> +	rcu_read_unlock();
> +	return nr_group_requests;
> +}
>  
>  static struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
>  						 int extract);
> @@ -1257,14 +1286,17 @@ void elv_io_group_congestion_threshold(struct request_queue *q,
>  						struct io_group *iog)
>  {
>  	int nr;
> +	unsigned long nr_group_requests;
>  
> -	nr = q->nr_group_requests - (q->nr_group_requests / 8) + 1;
> -	if (nr > q->nr_group_requests)
> -		nr = q->nr_group_requests;
> +	nr_group_requests = get_group_requests(q, iog);
> +
> +	nr = nr_group_requests - (nr_group_requests / 8) + 1;
> +	if (nr > nr_group_requests)
> +		nr = nr_group_requests;
>  	iog->nr_congestion_on = nr;
>  
> -	nr = q->nr_group_requests - (q->nr_group_requests / 8)
> -			- (q->nr_group_requests / 16) - 1;
> +	nr = nr_group_requests - (nr_group_requests / 8)
> +			- (nr_group_requests / 16) - 1;
>  	if (nr < 1)
>  		nr = 1;
>  	iog->nr_congestion_off = nr;
> @@ -1283,6 +1315,7 @@ int elv_io_group_congested(struct request_queue *q, struct page *page, int sync)
>  {
>  	struct io_group *iog;
>  	int ret = 0;
> +	unsigned long nr_group_requests;
>  
>  	rcu_read_lock();
>  
> @@ -1300,10 +1333,11 @@ int elv_io_group_congested(struct request_queue *q, struct page *page, int sync)
>  	}
>  
>  	ret = elv_is_iog_congested(q, iog, sync);
> +	nr_group_requests = get_group_requests(q, iog);
>  	if (ret)
>  		elv_log_iog(&q->elevator->efqd, iog, "iog congested=%d sync=%d"
>  			" rl.count[sync]=%d nr_group_requests=%d",
> -			ret, sync, iog->rl.count[sync], q->nr_group_requests);
> +			ret, sync, iog->rl.count[sync], nr_group_requests);
>  	rcu_read_unlock();
>  	return ret;
>  }
> @@ -1549,6 +1583,48 @@ free_buf:
>  	return ret;
>  }
>  
> +static u64 io_cgroup_nr_requests_read(struct cgroup *cgroup,
> +				       struct cftype *cftype)
> +{
> +	struct io_cgroup *iocg;
> +	u64 ret;
> +
> +	if (!cgroup_lock_live_group(cgroup))
> +		return -ENODEV;
> +
> +	iocg = cgroup_to_io_cgroup(cgroup);
> +	spin_lock_irq(&iocg->lock);
> +	ret = iocg->nr_group_requests;
> +	spin_unlock_irq(&iocg->lock);
> +
> +	cgroup_unlock();
> +
> +	return ret;
> +}
> +
> +static int io_cgroup_nr_requests_write(struct cgroup *cgroup,
> +					struct cftype *cftype,
> +					u64 val)
> +{
> +	struct io_cgroup *iocg;
> +
> +	if (val < BLKDEV_MIN_RQ)
> +		val = BLKDEV_MIN_RQ;
> +
> +	if (!cgroup_lock_live_group(cgroup))
> +		return -ENODEV;
> +
> +	iocg = cgroup_to_io_cgroup(cgroup);
> +
> +	spin_lock_irq(&iocg->lock);
> +	iocg->nr_group_requests = (unsigned long)val;
> +	spin_unlock_irq(&iocg->lock);
> +
> +	cgroup_unlock();
> +
> +	return 0;
> +}
> +
>  #define SHOW_FUNCTION(__VAR)						\
>  static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
>  				       struct cftype *cftype)		\
> @@ -1735,6 +1811,11 @@ static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,
>  
>  struct cftype bfqio_files[] = {
>  	{
> +		.name = "nr_group_requests",
> +		.read_u64 = io_cgroup_nr_requests_read,
> +		.write_u64 = io_cgroup_nr_requests_write,
> +	},
> +	{
>  		.name = "policy",
>  		.read_seq_string = io_cgroup_policy_read,
>  		.write_string = io_cgroup_policy_write,
> @@ -1790,6 +1871,7 @@ static struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
>  
>  	spin_lock_init(&iocg->lock);
>  	INIT_HLIST_HEAD(&iocg->group_data);
> +	iocg->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
>  	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
>  	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
>  	INIT_LIST_HEAD(&iocg->policy_list);
> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
> index f089a55..df077d0 100644
> --- a/block/elevator-fq.h
> +++ b/block/elevator-fq.h
> @@ -308,6 +308,7 @@ struct io_cgroup {
>  	unsigned int weight;
>  	unsigned short ioprio_class;
>  
> +	unsigned long nr_group_requests;
>  	/* list of io_policy_node */
>  	struct list_head policy_list;
>  
> @@ -386,6 +387,9 @@ struct elv_fq_data {
>  	unsigned int fairness;
>  };
>  
> +extern unsigned short get_group_requests(struct request_queue *q,
> +					 struct io_group *iog);
> +
>  /* Logging facilities. */
>  #ifdef CONFIG_DEBUG_GROUP_IOSCHED
>  #define elv_log_ioq(efqd, ioq, fmt, args...) \
> -- 
> 1.5.4.rc3 

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
       [not found]     ` <20090713160352.GA3714-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-07-13 21:08       ` Munehiro Ikeda
  2009-07-14  7:37         ` Gui Jianfeng
  1 sibling, 0 replies; 191+ messages in thread
From: Munehiro Ikeda @ 2009-07-13 21:08 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Vivek Goyal wrote, on 07/13/2009 12:03 PM:
> On Fri, Jul 10, 2009 at 09:56:21AM +0800, Gui Jianfeng wrote:
>> Hi Vivek,
>>
>> This patch exports a cgroup based per group request limits interface.
>> and removes the global one. Now we can use this interface to perform
>> different request allocation limitation for different groups.
>>
>
> Thanks Gui. Few points come to mind.
>
> - You seem to be making this as per cgroup limit on all devices. I guess
>    that different devices in the system can have different settings of
>    q->nr_requests and hence will probably want different per group limit.
>    So we might have to make it per cgroup per device limit.

 From the viewpoint of implementation, there is a difficulty in my mind to
implement per cgroup per device limit arising from that io_group is allocated
when associated device is firstly used.  I guess Gui chose per cgroup limit
on all devices approach because of this, right?


> - There does not seem to be any checks for making sure that children
>    cgroups don't have more request descriptors allocated than parent group.
>
> - I am re-thinking that what's the advantage of configuring request
>    descriptors also through cgroups. It does bring in additional complexity
>    with it and it should justfiy the advantages. Can you think of some?
>
>    Until and unless we can come up with some significant advantages, I will
>    prefer to continue to use per group limit through q->nr_group_requests
>    interface instead of cgroup. Once things stablize, we can revisit it and
>    see how this interface can be improved.

I agree.  I will try to clarify if per group per device limitation is needed
or not (or, if it has the advantage beyond the complexity) through some tests.



Tnaks a lot,
Muuhh


> Thanks
> Vivek
>
>> Signed-off-by: Gui Jianfeng<guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
>> ---
>>   block/blk-core.c     |   23 ++++++++++--
>>   block/blk-settings.c |    1 -
>>   block/blk-sysfs.c    |   43 -----------------------
>>   block/elevator-fq.c  |   94 ++++++++++++++++++++++++++++++++++++++++++++++---
>>   block/elevator-fq.h  |    4 ++
>>   5 files changed, 111 insertions(+), 54 deletions(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 79fe6a9..7010b76 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -722,13 +722,20 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
>>   static void __freed_request(struct request_queue *q, int sync,
>>   					struct request_list *rl)
>>   {
>> +	struct io_group *iog;
>> +	unsigned long nr_group_requests;
>> +
>>   	if (q->rq_data.count[sync]<  queue_congestion_off_threshold(q))
>>   		blk_clear_queue_congested(q, sync);
>>
>>   	if (q->rq_data.count[sync] + 1<= q->nr_requests)
>>   		blk_clear_queue_full(q, sync);
>>
>> -	if (rl->count[sync] + 1<= q->nr_group_requests) {
>> +	iog = rl_iog(rl);
>> +
>> +	nr_group_requests = get_group_requests(q, iog);
>> +
>> +	if (nr_group_requests&&  rl->count[sync] + 1<= nr_group_requests) {
>>   		if (waitqueue_active(&rl->wait[sync]))
>>   			wake_up(&rl->wait[sync]);
>>   	}
>> @@ -828,6 +835,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>>   	const bool is_sync = rw_is_sync(rw_flags) != 0;
>>   	int may_queue, priv;
>>   	int sleep_on_global = 0;
>> +	struct io_group *iog;
>> +	unsigned long nr_group_requests;
>>
>>   	may_queue = elv_may_queue(q, rw_flags);
>>   	if (may_queue == ELV_MQUEUE_NO)
>> @@ -843,7 +852,12 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>>   	if (q->rq_data.count[is_sync]+1>= q->nr_requests)
>>   		blk_set_queue_full(q, is_sync);
>>
>> -	if (rl->count[is_sync]+1>= q->nr_group_requests) {
>> +	iog = rl_iog(rl);
>> +
>> +	nr_group_requests = get_group_requests(q, iog);
>> +
>> +	if (nr_group_requests&&
>> +	    rl->count[is_sync]+1>= nr_group_requests) {
>>   		ioc = current_io_context(GFP_ATOMIC, q->node);
>>   		/*
>>   		 * The queue request descriptor group will fill after this
>> @@ -852,7 +866,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>>   		 * This process will be allowed to complete a batch of
>>   		 * requests, others will be blocked.
>>   		 */
>> -		if (rl->count[is_sync]<= q->nr_group_requests)
>> +		if (rl->count[is_sync]<= nr_group_requests)
>>   			ioc_set_batching(q, ioc);
>>   		else {
>>   			if (may_queue != ELV_MQUEUE_MUST
>> @@ -898,7 +912,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>>   	 * from per group request list
>>   	 */
>>
>> -	if (rl->count[is_sync]>= (3 * q->nr_group_requests / 2))
>> +	if (nr_group_requests&&
>> +	    rl->count[is_sync]>= (3 * nr_group_requests / 2))
>>   		goto out;
>>
>>   	rl->starved[is_sync] = 0;
>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>> index 78b8aec..bd582a7 100644
>> --- a/block/blk-settings.c
>> +++ b/block/blk-settings.c
>> @@ -148,7 +148,6 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
>>   	 * set defaults
>>   	 */
>>   	q->nr_requests = BLKDEV_MAX_RQ;
>> -	q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
>>
>>   	q->make_request_fn = mfn;
>>   	blk_queue_dma_alignment(q, 511);
>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>> index 92b9f25..706d852 100644
>> --- a/block/blk-sysfs.c
>> +++ b/block/blk-sysfs.c
>> @@ -78,40 +78,8 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
>>   	return ret;
>>   }
>>   #ifdef CONFIG_GROUP_IOSCHED
>> -static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
>> -{
>> -	return queue_var_show(q->nr_group_requests, (page));
>> -}
>> -
>>   extern void elv_io_group_congestion_threshold(struct request_queue *q,
>>   					      struct io_group *iog);
>> -
>> -static ssize_t
>> -queue_group_requests_store(struct request_queue *q, const char *page,
>> -					size_t count)
>> -{
>> -	struct hlist_node *n;
>> -	struct io_group *iog;
>> -	struct elv_fq_data *efqd;
>> -	unsigned long nr;
>> -	int ret = queue_var_store(&nr, page, count);
>> -
>> -	if (nr<  BLKDEV_MIN_RQ)
>> -		nr = BLKDEV_MIN_RQ;
>> -
>> -	spin_lock_irq(q->queue_lock);
>> -
>> -	q->nr_group_requests = nr;
>> -
>> -	efqd =&q->elevator->efqd;
>> -
>> -	hlist_for_each_entry(iog, n,&efqd->group_list, elv_data_node) {
>> -		elv_io_group_congestion_threshold(q, iog);
>> -	}
>> -
>> -	spin_unlock_irq(q->queue_lock);
>> -	return ret;
>> -}
>>   #endif
>>
>>   static ssize_t queue_ra_show(struct request_queue *q, char *page)
>> @@ -278,14 +246,6 @@ static struct queue_sysfs_entry queue_requests_entry = {
>>   	.store = queue_requests_store,
>>   };
>>
>> -#ifdef CONFIG_GROUP_IOSCHED
>> -static struct queue_sysfs_entry queue_group_requests_entry = {
>> -	.attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
>> -	.show = queue_group_requests_show,
>> -	.store = queue_group_requests_store,
>> -};
>> -#endif
>> -
>>   static struct queue_sysfs_entry queue_ra_entry = {
>>   	.attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
>>   	.show = queue_ra_show,
>> @@ -360,9 +320,6 @@ static struct queue_sysfs_entry queue_iostats_entry = {
>>
>>   static struct attribute *default_attrs[] = {
>>   	&queue_requests_entry.attr,
>> -#ifdef CONFIG_GROUP_IOSCHED
>> -	&queue_group_requests_entry.attr,
>> -#endif
>>   	&queue_ra_entry.attr,
>>   	&queue_max_hw_sectors_entry.attr,
>>   	&queue_max_sectors_entry.attr,
>> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
>> index 29392e7..bfb0210 100644
>> --- a/block/elevator-fq.c
>> +++ b/block/elevator-fq.c
>> @@ -59,6 +59,35 @@ elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
>>   #define for_each_entity_safe(entity, parent) \
>>   	for (; entity&&  ({ parent = entity->parent; 1; }); entity = parent)
>>
>> +unsigned short get_group_requests(struct request_queue *q,
>> +				  struct io_group *iog)
>> +{
>> +	struct cgroup_subsys_state *css;
>> +	struct io_cgroup *iocg;
>> +	unsigned long nr_group_requests;
>> +
>> +	if (!iog)
>> +		return q->nr_requests;
>> +
>> +	rcu_read_lock();
>> +
>> +	if (!iog->iocg_id) {
>> +		nr_group_requests = 0;
>> +		goto out;
>> +	}
>> +
>> +	css = css_lookup(&io_subsys, iog->iocg_id);
>> +	if (!css) {
>> +		nr_group_requests = 0;
>> +		goto out;
>> +	}
>> +
>> +	iocg = container_of(css, struct io_cgroup, css);
>> +	nr_group_requests = iocg->nr_group_requests;
>> +out:
>> +	rcu_read_unlock();
>> +	return nr_group_requests;
>> +}
>>
>>   static struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
>>   						 int extract);
>> @@ -1257,14 +1286,17 @@ void elv_io_group_congestion_threshold(struct request_queue *q,
>>   						struct io_group *iog)
>>   {
>>   	int nr;
>> +	unsigned long nr_group_requests;
>>
>> -	nr = q->nr_group_requests - (q->nr_group_requests / 8) + 1;
>> -	if (nr>  q->nr_group_requests)
>> -		nr = q->nr_group_requests;
>> +	nr_group_requests = get_group_requests(q, iog);
>> +
>> +	nr = nr_group_requests - (nr_group_requests / 8) + 1;
>> +	if (nr>  nr_group_requests)
>> +		nr = nr_group_requests;
>>   	iog->nr_congestion_on = nr;
>>
>> -	nr = q->nr_group_requests - (q->nr_group_requests / 8)
>> -			- (q->nr_group_requests / 16) - 1;
>> +	nr = nr_group_requests - (nr_group_requests / 8)
>> +			- (nr_group_requests / 16) - 1;
>>   	if (nr<  1)
>>   		nr = 1;
>>   	iog->nr_congestion_off = nr;
>> @@ -1283,6 +1315,7 @@ int elv_io_group_congested(struct request_queue *q, struct page *page, int sync)
>>   {
>>   	struct io_group *iog;
>>   	int ret = 0;
>> +	unsigned long nr_group_requests;
>>
>>   	rcu_read_lock();
>>
>> @@ -1300,10 +1333,11 @@ int elv_io_group_congested(struct request_queue *q, struct page *page, int sync)
>>   	}
>>
>>   	ret = elv_is_iog_congested(q, iog, sync);
>> +	nr_group_requests = get_group_requests(q, iog);
>>   	if (ret)
>>   		elv_log_iog(&q->elevator->efqd, iog, "iog congested=%d sync=%d"
>>   			" rl.count[sync]=%d nr_group_requests=%d",
>> -			ret, sync, iog->rl.count[sync], q->nr_group_requests);
>> +			ret, sync, iog->rl.count[sync], nr_group_requests);
>>   	rcu_read_unlock();
>>   	return ret;
>>   }
>> @@ -1549,6 +1583,48 @@ free_buf:
>>   	return ret;
>>   }
>>
>> +static u64 io_cgroup_nr_requests_read(struct cgroup *cgroup,
>> +				       struct cftype *cftype)
>> +{
>> +	struct io_cgroup *iocg;
>> +	u64 ret;
>> +
>> +	if (!cgroup_lock_live_group(cgroup))
>> +		return -ENODEV;
>> +
>> +	iocg = cgroup_to_io_cgroup(cgroup);
>> +	spin_lock_irq(&iocg->lock);
>> +	ret = iocg->nr_group_requests;
>> +	spin_unlock_irq(&iocg->lock);
>> +
>> +	cgroup_unlock();
>> +
>> +	return ret;
>> +}
>> +
>> +static int io_cgroup_nr_requests_write(struct cgroup *cgroup,
>> +					struct cftype *cftype,
>> +					u64 val)
>> +{
>> +	struct io_cgroup *iocg;
>> +
>> +	if (val<  BLKDEV_MIN_RQ)
>> +		val = BLKDEV_MIN_RQ;
>> +
>> +	if (!cgroup_lock_live_group(cgroup))
>> +		return -ENODEV;
>> +
>> +	iocg = cgroup_to_io_cgroup(cgroup);
>> +
>> +	spin_lock_irq(&iocg->lock);
>> +	iocg->nr_group_requests = (unsigned long)val;
>> +	spin_unlock_irq(&iocg->lock);
>> +
>> +	cgroup_unlock();
>> +
>> +	return 0;
>> +}
>> +
>>   #define SHOW_FUNCTION(__VAR)						\
>>   static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
>>   				       struct cftype *cftype)		\
>> @@ -1735,6 +1811,11 @@ static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,
>>
>>   struct cftype bfqio_files[] = {
>>   	{
>> +		.name = "nr_group_requests",
>> +		.read_u64 = io_cgroup_nr_requests_read,
>> +		.write_u64 = io_cgroup_nr_requests_write,
>> +	},
>> +	{
>>   		.name = "policy",
>>   		.read_seq_string = io_cgroup_policy_read,
>>   		.write_string = io_cgroup_policy_write,
>> @@ -1790,6 +1871,7 @@ static struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
>>
>>   	spin_lock_init(&iocg->lock);
>>   	INIT_HLIST_HEAD(&iocg->group_data);
>> +	iocg->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
>>   	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
>>   	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
>>   	INIT_LIST_HEAD(&iocg->policy_list);
>> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
>> index f089a55..df077d0 100644
>> --- a/block/elevator-fq.h
>> +++ b/block/elevator-fq.h
>> @@ -308,6 +308,7 @@ struct io_cgroup {
>>   	unsigned int weight;
>>   	unsigned short ioprio_class;
>>
>> +	unsigned long nr_group_requests;
>>   	/* list of io_policy_node */
>>   	struct list_head policy_list;
>>
>> @@ -386,6 +387,9 @@ struct elv_fq_data {
>>   	unsigned int fairness;
>>   };
>>
>> +extern unsigned short get_group_requests(struct request_queue *q,
>> +					 struct io_group *iog);
>> +
>>   /* Logging facilities. */
>>   #ifdef CONFIG_DEBUG_GROUP_IOSCHED
>>   #define elv_log_ioq(efqd, ioq, fmt, args...) \
>> --
>> 1.5.4.rc3
>

-- 
IKEDA, Munehiro
   NEC Corporation of America
     m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
  2009-07-13 16:03     ` Vivek Goyal
@ 2009-07-13 21:08       ` Munehiro Ikeda
  -1 siblings, 0 replies; 191+ messages in thread
From: Munehiro Ikeda @ 2009-07-13 21:08 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Gui Jianfeng, linux-kernel, containers, dm-devel, jens.axboe,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, s-uchida, taka, jmoyer, dhaval, balbir, righi.andrea,
	jbaron, agk, snitzer, akpm, peterz

Vivek Goyal wrote, on 07/13/2009 12:03 PM:
> On Fri, Jul 10, 2009 at 09:56:21AM +0800, Gui Jianfeng wrote:
>> Hi Vivek,
>>
>> This patch exports a cgroup based per group request limits interface.
>> and removes the global one. Now we can use this interface to perform
>> different request allocation limitation for different groups.
>>
>
> Thanks Gui. Few points come to mind.
>
> - You seem to be making this as per cgroup limit on all devices. I guess
>    that different devices in the system can have different settings of
>    q->nr_requests and hence will probably want different per group limit.
>    So we might have to make it per cgroup per device limit.

 From the viewpoint of implementation, there is a difficulty in my mind to
implement per cgroup per device limit arising from that io_group is allocated
when associated device is firstly used.  I guess Gui chose per cgroup limit
on all devices approach because of this, right?


> - There does not seem to be any checks for making sure that children
>    cgroups don't have more request descriptors allocated than parent group.
>
> - I am re-thinking that what's the advantage of configuring request
>    descriptors also through cgroups. It does bring in additional complexity
>    with it and it should justfiy the advantages. Can you think of some?
>
>    Until and unless we can come up with some significant advantages, I will
>    prefer to continue to use per group limit through q->nr_group_requests
>    interface instead of cgroup. Once things stablize, we can revisit it and
>    see how this interface can be improved.

I agree.  I will try to clarify if per group per device limitation is needed
or not (or, if it has the advantage beyond the complexity) through some tests.



Tnaks a lot,
Muuhh


> Thanks
> Vivek
>
>> Signed-off-by: Gui Jianfeng<guijianfeng@cn.fujitsu.com>
>> ---
>>   block/blk-core.c     |   23 ++++++++++--
>>   block/blk-settings.c |    1 -
>>   block/blk-sysfs.c    |   43 -----------------------
>>   block/elevator-fq.c  |   94 ++++++++++++++++++++++++++++++++++++++++++++++---
>>   block/elevator-fq.h  |    4 ++
>>   5 files changed, 111 insertions(+), 54 deletions(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 79fe6a9..7010b76 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -722,13 +722,20 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
>>   static void __freed_request(struct request_queue *q, int sync,
>>   					struct request_list *rl)
>>   {
>> +	struct io_group *iog;
>> +	unsigned long nr_group_requests;
>> +
>>   	if (q->rq_data.count[sync]<  queue_congestion_off_threshold(q))
>>   		blk_clear_queue_congested(q, sync);
>>
>>   	if (q->rq_data.count[sync] + 1<= q->nr_requests)
>>   		blk_clear_queue_full(q, sync);
>>
>> -	if (rl->count[sync] + 1<= q->nr_group_requests) {
>> +	iog = rl_iog(rl);
>> +
>> +	nr_group_requests = get_group_requests(q, iog);
>> +
>> +	if (nr_group_requests&&  rl->count[sync] + 1<= nr_group_requests) {
>>   		if (waitqueue_active(&rl->wait[sync]))
>>   			wake_up(&rl->wait[sync]);
>>   	}
>> @@ -828,6 +835,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>>   	const bool is_sync = rw_is_sync(rw_flags) != 0;
>>   	int may_queue, priv;
>>   	int sleep_on_global = 0;
>> +	struct io_group *iog;
>> +	unsigned long nr_group_requests;
>>
>>   	may_queue = elv_may_queue(q, rw_flags);
>>   	if (may_queue == ELV_MQUEUE_NO)
>> @@ -843,7 +852,12 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>>   	if (q->rq_data.count[is_sync]+1>= q->nr_requests)
>>   		blk_set_queue_full(q, is_sync);
>>
>> -	if (rl->count[is_sync]+1>= q->nr_group_requests) {
>> +	iog = rl_iog(rl);
>> +
>> +	nr_group_requests = get_group_requests(q, iog);
>> +
>> +	if (nr_group_requests&&
>> +	    rl->count[is_sync]+1>= nr_group_requests) {
>>   		ioc = current_io_context(GFP_ATOMIC, q->node);
>>   		/*
>>   		 * The queue request descriptor group will fill after this
>> @@ -852,7 +866,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>>   		 * This process will be allowed to complete a batch of
>>   		 * requests, others will be blocked.
>>   		 */
>> -		if (rl->count[is_sync]<= q->nr_group_requests)
>> +		if (rl->count[is_sync]<= nr_group_requests)
>>   			ioc_set_batching(q, ioc);
>>   		else {
>>   			if (may_queue != ELV_MQUEUE_MUST
>> @@ -898,7 +912,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>>   	 * from per group request list
>>   	 */
>>
>> -	if (rl->count[is_sync]>= (3 * q->nr_group_requests / 2))
>> +	if (nr_group_requests&&
>> +	    rl->count[is_sync]>= (3 * nr_group_requests / 2))
>>   		goto out;
>>
>>   	rl->starved[is_sync] = 0;
>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>> index 78b8aec..bd582a7 100644
>> --- a/block/blk-settings.c
>> +++ b/block/blk-settings.c
>> @@ -148,7 +148,6 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
>>   	 * set defaults
>>   	 */
>>   	q->nr_requests = BLKDEV_MAX_RQ;
>> -	q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
>>
>>   	q->make_request_fn = mfn;
>>   	blk_queue_dma_alignment(q, 511);
>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>> index 92b9f25..706d852 100644
>> --- a/block/blk-sysfs.c
>> +++ b/block/blk-sysfs.c
>> @@ -78,40 +78,8 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
>>   	return ret;
>>   }
>>   #ifdef CONFIG_GROUP_IOSCHED
>> -static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
>> -{
>> -	return queue_var_show(q->nr_group_requests, (page));
>> -}
>> -
>>   extern void elv_io_group_congestion_threshold(struct request_queue *q,
>>   					      struct io_group *iog);
>> -
>> -static ssize_t
>> -queue_group_requests_store(struct request_queue *q, const char *page,
>> -					size_t count)
>> -{
>> -	struct hlist_node *n;
>> -	struct io_group *iog;
>> -	struct elv_fq_data *efqd;
>> -	unsigned long nr;
>> -	int ret = queue_var_store(&nr, page, count);
>> -
>> -	if (nr<  BLKDEV_MIN_RQ)
>> -		nr = BLKDEV_MIN_RQ;
>> -
>> -	spin_lock_irq(q->queue_lock);
>> -
>> -	q->nr_group_requests = nr;
>> -
>> -	efqd =&q->elevator->efqd;
>> -
>> -	hlist_for_each_entry(iog, n,&efqd->group_list, elv_data_node) {
>> -		elv_io_group_congestion_threshold(q, iog);
>> -	}
>> -
>> -	spin_unlock_irq(q->queue_lock);
>> -	return ret;
>> -}
>>   #endif
>>
>>   static ssize_t queue_ra_show(struct request_queue *q, char *page)
>> @@ -278,14 +246,6 @@ static struct queue_sysfs_entry queue_requests_entry = {
>>   	.store = queue_requests_store,
>>   };
>>
>> -#ifdef CONFIG_GROUP_IOSCHED
>> -static struct queue_sysfs_entry queue_group_requests_entry = {
>> -	.attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
>> -	.show = queue_group_requests_show,
>> -	.store = queue_group_requests_store,
>> -};
>> -#endif
>> -
>>   static struct queue_sysfs_entry queue_ra_entry = {
>>   	.attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
>>   	.show = queue_ra_show,
>> @@ -360,9 +320,6 @@ static struct queue_sysfs_entry queue_iostats_entry = {
>>
>>   static struct attribute *default_attrs[] = {
>>   	&queue_requests_entry.attr,
>> -#ifdef CONFIG_GROUP_IOSCHED
>> -	&queue_group_requests_entry.attr,
>> -#endif
>>   	&queue_ra_entry.attr,
>>   	&queue_max_hw_sectors_entry.attr,
>>   	&queue_max_sectors_entry.attr,
>> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
>> index 29392e7..bfb0210 100644
>> --- a/block/elevator-fq.c
>> +++ b/block/elevator-fq.c
>> @@ -59,6 +59,35 @@ elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
>>   #define for_each_entity_safe(entity, parent) \
>>   	for (; entity&&  ({ parent = entity->parent; 1; }); entity = parent)
>>
>> +unsigned short get_group_requests(struct request_queue *q,
>> +				  struct io_group *iog)
>> +{
>> +	struct cgroup_subsys_state *css;
>> +	struct io_cgroup *iocg;
>> +	unsigned long nr_group_requests;
>> +
>> +	if (!iog)
>> +		return q->nr_requests;
>> +
>> +	rcu_read_lock();
>> +
>> +	if (!iog->iocg_id) {
>> +		nr_group_requests = 0;
>> +		goto out;
>> +	}
>> +
>> +	css = css_lookup(&io_subsys, iog->iocg_id);
>> +	if (!css) {
>> +		nr_group_requests = 0;
>> +		goto out;
>> +	}
>> +
>> +	iocg = container_of(css, struct io_cgroup, css);
>> +	nr_group_requests = iocg->nr_group_requests;
>> +out:
>> +	rcu_read_unlock();
>> +	return nr_group_requests;
>> +}
>>
>>   static struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
>>   						 int extract);
>> @@ -1257,14 +1286,17 @@ void elv_io_group_congestion_threshold(struct request_queue *q,
>>   						struct io_group *iog)
>>   {
>>   	int nr;
>> +	unsigned long nr_group_requests;
>>
>> -	nr = q->nr_group_requests - (q->nr_group_requests / 8) + 1;
>> -	if (nr>  q->nr_group_requests)
>> -		nr = q->nr_group_requests;
>> +	nr_group_requests = get_group_requests(q, iog);
>> +
>> +	nr = nr_group_requests - (nr_group_requests / 8) + 1;
>> +	if (nr>  nr_group_requests)
>> +		nr = nr_group_requests;
>>   	iog->nr_congestion_on = nr;
>>
>> -	nr = q->nr_group_requests - (q->nr_group_requests / 8)
>> -			- (q->nr_group_requests / 16) - 1;
>> +	nr = nr_group_requests - (nr_group_requests / 8)
>> +			- (nr_group_requests / 16) - 1;
>>   	if (nr<  1)
>>   		nr = 1;
>>   	iog->nr_congestion_off = nr;
>> @@ -1283,6 +1315,7 @@ int elv_io_group_congested(struct request_queue *q, struct page *page, int sync)
>>   {
>>   	struct io_group *iog;
>>   	int ret = 0;
>> +	unsigned long nr_group_requests;
>>
>>   	rcu_read_lock();
>>
>> @@ -1300,10 +1333,11 @@ int elv_io_group_congested(struct request_queue *q, struct page *page, int sync)
>>   	}
>>
>>   	ret = elv_is_iog_congested(q, iog, sync);
>> +	nr_group_requests = get_group_requests(q, iog);
>>   	if (ret)
>>   		elv_log_iog(&q->elevator->efqd, iog, "iog congested=%d sync=%d"
>>   			" rl.count[sync]=%d nr_group_requests=%d",
>> -			ret, sync, iog->rl.count[sync], q->nr_group_requests);
>> +			ret, sync, iog->rl.count[sync], nr_group_requests);
>>   	rcu_read_unlock();
>>   	return ret;
>>   }
>> @@ -1549,6 +1583,48 @@ free_buf:
>>   	return ret;
>>   }
>>
>> +static u64 io_cgroup_nr_requests_read(struct cgroup *cgroup,
>> +				       struct cftype *cftype)
>> +{
>> +	struct io_cgroup *iocg;
>> +	u64 ret;
>> +
>> +	if (!cgroup_lock_live_group(cgroup))
>> +		return -ENODEV;
>> +
>> +	iocg = cgroup_to_io_cgroup(cgroup);
>> +	spin_lock_irq(&iocg->lock);
>> +	ret = iocg->nr_group_requests;
>> +	spin_unlock_irq(&iocg->lock);
>> +
>> +	cgroup_unlock();
>> +
>> +	return ret;
>> +}
>> +
>> +static int io_cgroup_nr_requests_write(struct cgroup *cgroup,
>> +					struct cftype *cftype,
>> +					u64 val)
>> +{
>> +	struct io_cgroup *iocg;
>> +
>> +	if (val<  BLKDEV_MIN_RQ)
>> +		val = BLKDEV_MIN_RQ;
>> +
>> +	if (!cgroup_lock_live_group(cgroup))
>> +		return -ENODEV;
>> +
>> +	iocg = cgroup_to_io_cgroup(cgroup);
>> +
>> +	spin_lock_irq(&iocg->lock);
>> +	iocg->nr_group_requests = (unsigned long)val;
>> +	spin_unlock_irq(&iocg->lock);
>> +
>> +	cgroup_unlock();
>> +
>> +	return 0;
>> +}
>> +
>>   #define SHOW_FUNCTION(__VAR)						\
>>   static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
>>   				       struct cftype *cftype)		\
>> @@ -1735,6 +1811,11 @@ static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,
>>
>>   struct cftype bfqio_files[] = {
>>   	{
>> +		.name = "nr_group_requests",
>> +		.read_u64 = io_cgroup_nr_requests_read,
>> +		.write_u64 = io_cgroup_nr_requests_write,
>> +	},
>> +	{
>>   		.name = "policy",
>>   		.read_seq_string = io_cgroup_policy_read,
>>   		.write_string = io_cgroup_policy_write,
>> @@ -1790,6 +1871,7 @@ static struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
>>
>>   	spin_lock_init(&iocg->lock);
>>   	INIT_HLIST_HEAD(&iocg->group_data);
>> +	iocg->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
>>   	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
>>   	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
>>   	INIT_LIST_HEAD(&iocg->policy_list);
>> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
>> index f089a55..df077d0 100644
>> --- a/block/elevator-fq.h
>> +++ b/block/elevator-fq.h
>> @@ -308,6 +308,7 @@ struct io_cgroup {
>>   	unsigned int weight;
>>   	unsigned short ioprio_class;
>>
>> +	unsigned long nr_group_requests;
>>   	/* list of io_policy_node */
>>   	struct list_head policy_list;
>>
>> @@ -386,6 +387,9 @@ struct elv_fq_data {
>>   	unsigned int fairness;
>>   };
>>
>> +extern unsigned short get_group_requests(struct request_queue *q,
>> +					 struct io_group *iog);
>> +
>>   /* Logging facilities. */
>>   #ifdef CONFIG_DEBUG_GROUP_IOSCHED
>>   #define elv_log_ioq(efqd, ioq, fmt, args...) \
>> --
>> 1.5.4.rc3
>

-- 
IKEDA, Munehiro
   NEC Corporation of America
     m-ikeda@ds.jp.nec.com


^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
@ 2009-07-13 21:08       ` Munehiro Ikeda
  0 siblings, 0 replies; 191+ messages in thread
From: Munehiro Ikeda @ 2009-07-13 21:08 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	balbir, paolo.valente, Gui Jianfeng, fernando, mikew, jmoyer,
	nauman, righi.andrea, lizf, fchecconi, akpm, containers,
	linux-kernel, s-uchida, jbaron

Vivek Goyal wrote, on 07/13/2009 12:03 PM:
> On Fri, Jul 10, 2009 at 09:56:21AM +0800, Gui Jianfeng wrote:
>> Hi Vivek,
>>
>> This patch exports a cgroup based per group request limits interface.
>> and removes the global one. Now we can use this interface to perform
>> different request allocation limitation for different groups.
>>
>
> Thanks Gui. Few points come to mind.
>
> - You seem to be making this as per cgroup limit on all devices. I guess
>    that different devices in the system can have different settings of
>    q->nr_requests and hence will probably want different per group limit.
>    So we might have to make it per cgroup per device limit.

 From the viewpoint of implementation, there is a difficulty in my mind to
implement per cgroup per device limit arising from that io_group is allocated
when associated device is firstly used.  I guess Gui chose per cgroup limit
on all devices approach because of this, right?


> - There does not seem to be any checks for making sure that children
>    cgroups don't have more request descriptors allocated than parent group.
>
> - I am re-thinking that what's the advantage of configuring request
>    descriptors also through cgroups. It does bring in additional complexity
>    with it and it should justfiy the advantages. Can you think of some?
>
>    Until and unless we can come up with some significant advantages, I will
>    prefer to continue to use per group limit through q->nr_group_requests
>    interface instead of cgroup. Once things stablize, we can revisit it and
>    see how this interface can be improved.

I agree.  I will try to clarify if per group per device limitation is needed
or not (or, if it has the advantage beyond the complexity) through some tests.



Tnaks a lot,
Muuhh


> Thanks
> Vivek
>
>> Signed-off-by: Gui Jianfeng<guijianfeng@cn.fujitsu.com>
>> ---
>>   block/blk-core.c     |   23 ++++++++++--
>>   block/blk-settings.c |    1 -
>>   block/blk-sysfs.c    |   43 -----------------------
>>   block/elevator-fq.c  |   94 ++++++++++++++++++++++++++++++++++++++++++++++---
>>   block/elevator-fq.h  |    4 ++
>>   5 files changed, 111 insertions(+), 54 deletions(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 79fe6a9..7010b76 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -722,13 +722,20 @@ static void ioc_set_batching(struct request_queue *q, struct io_context *ioc)
>>   static void __freed_request(struct request_queue *q, int sync,
>>   					struct request_list *rl)
>>   {
>> +	struct io_group *iog;
>> +	unsigned long nr_group_requests;
>> +
>>   	if (q->rq_data.count[sync]<  queue_congestion_off_threshold(q))
>>   		blk_clear_queue_congested(q, sync);
>>
>>   	if (q->rq_data.count[sync] + 1<= q->nr_requests)
>>   		blk_clear_queue_full(q, sync);
>>
>> -	if (rl->count[sync] + 1<= q->nr_group_requests) {
>> +	iog = rl_iog(rl);
>> +
>> +	nr_group_requests = get_group_requests(q, iog);
>> +
>> +	if (nr_group_requests&&  rl->count[sync] + 1<= nr_group_requests) {
>>   		if (waitqueue_active(&rl->wait[sync]))
>>   			wake_up(&rl->wait[sync]);
>>   	}
>> @@ -828,6 +835,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>>   	const bool is_sync = rw_is_sync(rw_flags) != 0;
>>   	int may_queue, priv;
>>   	int sleep_on_global = 0;
>> +	struct io_group *iog;
>> +	unsigned long nr_group_requests;
>>
>>   	may_queue = elv_may_queue(q, rw_flags);
>>   	if (may_queue == ELV_MQUEUE_NO)
>> @@ -843,7 +852,12 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>>   	if (q->rq_data.count[is_sync]+1>= q->nr_requests)
>>   		blk_set_queue_full(q, is_sync);
>>
>> -	if (rl->count[is_sync]+1>= q->nr_group_requests) {
>> +	iog = rl_iog(rl);
>> +
>> +	nr_group_requests = get_group_requests(q, iog);
>> +
>> +	if (nr_group_requests&&
>> +	    rl->count[is_sync]+1>= nr_group_requests) {
>>   		ioc = current_io_context(GFP_ATOMIC, q->node);
>>   		/*
>>   		 * The queue request descriptor group will fill after this
>> @@ -852,7 +866,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>>   		 * This process will be allowed to complete a batch of
>>   		 * requests, others will be blocked.
>>   		 */
>> -		if (rl->count[is_sync]<= q->nr_group_requests)
>> +		if (rl->count[is_sync]<= nr_group_requests)
>>   			ioc_set_batching(q, ioc);
>>   		else {
>>   			if (may_queue != ELV_MQUEUE_MUST
>> @@ -898,7 +912,8 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>>   	 * from per group request list
>>   	 */
>>
>> -	if (rl->count[is_sync]>= (3 * q->nr_group_requests / 2))
>> +	if (nr_group_requests&&
>> +	    rl->count[is_sync]>= (3 * nr_group_requests / 2))
>>   		goto out;
>>
>>   	rl->starved[is_sync] = 0;
>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>> index 78b8aec..bd582a7 100644
>> --- a/block/blk-settings.c
>> +++ b/block/blk-settings.c
>> @@ -148,7 +148,6 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
>>   	 * set defaults
>>   	 */
>>   	q->nr_requests = BLKDEV_MAX_RQ;
>> -	q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
>>
>>   	q->make_request_fn = mfn;
>>   	blk_queue_dma_alignment(q, 511);
>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>> index 92b9f25..706d852 100644
>> --- a/block/blk-sysfs.c
>> +++ b/block/blk-sysfs.c
>> @@ -78,40 +78,8 @@ queue_requests_store(struct request_queue *q, const char *page, size_t count)
>>   	return ret;
>>   }
>>   #ifdef CONFIG_GROUP_IOSCHED
>> -static ssize_t queue_group_requests_show(struct request_queue *q, char *page)
>> -{
>> -	return queue_var_show(q->nr_group_requests, (page));
>> -}
>> -
>>   extern void elv_io_group_congestion_threshold(struct request_queue *q,
>>   					      struct io_group *iog);
>> -
>> -static ssize_t
>> -queue_group_requests_store(struct request_queue *q, const char *page,
>> -					size_t count)
>> -{
>> -	struct hlist_node *n;
>> -	struct io_group *iog;
>> -	struct elv_fq_data *efqd;
>> -	unsigned long nr;
>> -	int ret = queue_var_store(&nr, page, count);
>> -
>> -	if (nr<  BLKDEV_MIN_RQ)
>> -		nr = BLKDEV_MIN_RQ;
>> -
>> -	spin_lock_irq(q->queue_lock);
>> -
>> -	q->nr_group_requests = nr;
>> -
>> -	efqd =&q->elevator->efqd;
>> -
>> -	hlist_for_each_entry(iog, n,&efqd->group_list, elv_data_node) {
>> -		elv_io_group_congestion_threshold(q, iog);
>> -	}
>> -
>> -	spin_unlock_irq(q->queue_lock);
>> -	return ret;
>> -}
>>   #endif
>>
>>   static ssize_t queue_ra_show(struct request_queue *q, char *page)
>> @@ -278,14 +246,6 @@ static struct queue_sysfs_entry queue_requests_entry = {
>>   	.store = queue_requests_store,
>>   };
>>
>> -#ifdef CONFIG_GROUP_IOSCHED
>> -static struct queue_sysfs_entry queue_group_requests_entry = {
>> -	.attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
>> -	.show = queue_group_requests_show,
>> -	.store = queue_group_requests_store,
>> -};
>> -#endif
>> -
>>   static struct queue_sysfs_entry queue_ra_entry = {
>>   	.attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
>>   	.show = queue_ra_show,
>> @@ -360,9 +320,6 @@ static struct queue_sysfs_entry queue_iostats_entry = {
>>
>>   static struct attribute *default_attrs[] = {
>>   	&queue_requests_entry.attr,
>> -#ifdef CONFIG_GROUP_IOSCHED
>> -	&queue_group_requests_entry.attr,
>> -#endif
>>   	&queue_ra_entry.attr,
>>   	&queue_max_hw_sectors_entry.attr,
>>   	&queue_max_sectors_entry.attr,
>> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
>> index 29392e7..bfb0210 100644
>> --- a/block/elevator-fq.c
>> +++ b/block/elevator-fq.c
>> @@ -59,6 +59,35 @@ elv_release_ioq(struct elevator_queue *eq, struct io_queue **ioq_ptr);
>>   #define for_each_entity_safe(entity, parent) \
>>   	for (; entity&&  ({ parent = entity->parent; 1; }); entity = parent)
>>
>> +unsigned short get_group_requests(struct request_queue *q,
>> +				  struct io_group *iog)
>> +{
>> +	struct cgroup_subsys_state *css;
>> +	struct io_cgroup *iocg;
>> +	unsigned long nr_group_requests;
>> +
>> +	if (!iog)
>> +		return q->nr_requests;
>> +
>> +	rcu_read_lock();
>> +
>> +	if (!iog->iocg_id) {
>> +		nr_group_requests = 0;
>> +		goto out;
>> +	}
>> +
>> +	css = css_lookup(&io_subsys, iog->iocg_id);
>> +	if (!css) {
>> +		nr_group_requests = 0;
>> +		goto out;
>> +	}
>> +
>> +	iocg = container_of(css, struct io_cgroup, css);
>> +	nr_group_requests = iocg->nr_group_requests;
>> +out:
>> +	rcu_read_unlock();
>> +	return nr_group_requests;
>> +}
>>
>>   static struct io_entity *bfq_lookup_next_entity(struct io_sched_data *sd,
>>   						 int extract);
>> @@ -1257,14 +1286,17 @@ void elv_io_group_congestion_threshold(struct request_queue *q,
>>   						struct io_group *iog)
>>   {
>>   	int nr;
>> +	unsigned long nr_group_requests;
>>
>> -	nr = q->nr_group_requests - (q->nr_group_requests / 8) + 1;
>> -	if (nr>  q->nr_group_requests)
>> -		nr = q->nr_group_requests;
>> +	nr_group_requests = get_group_requests(q, iog);
>> +
>> +	nr = nr_group_requests - (nr_group_requests / 8) + 1;
>> +	if (nr>  nr_group_requests)
>> +		nr = nr_group_requests;
>>   	iog->nr_congestion_on = nr;
>>
>> -	nr = q->nr_group_requests - (q->nr_group_requests / 8)
>> -			- (q->nr_group_requests / 16) - 1;
>> +	nr = nr_group_requests - (nr_group_requests / 8)
>> +			- (nr_group_requests / 16) - 1;
>>   	if (nr<  1)
>>   		nr = 1;
>>   	iog->nr_congestion_off = nr;
>> @@ -1283,6 +1315,7 @@ int elv_io_group_congested(struct request_queue *q, struct page *page, int sync)
>>   {
>>   	struct io_group *iog;
>>   	int ret = 0;
>> +	unsigned long nr_group_requests;
>>
>>   	rcu_read_lock();
>>
>> @@ -1300,10 +1333,11 @@ int elv_io_group_congested(struct request_queue *q, struct page *page, int sync)
>>   	}
>>
>>   	ret = elv_is_iog_congested(q, iog, sync);
>> +	nr_group_requests = get_group_requests(q, iog);
>>   	if (ret)
>>   		elv_log_iog(&q->elevator->efqd, iog, "iog congested=%d sync=%d"
>>   			" rl.count[sync]=%d nr_group_requests=%d",
>> -			ret, sync, iog->rl.count[sync], q->nr_group_requests);
>> +			ret, sync, iog->rl.count[sync], nr_group_requests);
>>   	rcu_read_unlock();
>>   	return ret;
>>   }
>> @@ -1549,6 +1583,48 @@ free_buf:
>>   	return ret;
>>   }
>>
>> +static u64 io_cgroup_nr_requests_read(struct cgroup *cgroup,
>> +				       struct cftype *cftype)
>> +{
>> +	struct io_cgroup *iocg;
>> +	u64 ret;
>> +
>> +	if (!cgroup_lock_live_group(cgroup))
>> +		return -ENODEV;
>> +
>> +	iocg = cgroup_to_io_cgroup(cgroup);
>> +	spin_lock_irq(&iocg->lock);
>> +	ret = iocg->nr_group_requests;
>> +	spin_unlock_irq(&iocg->lock);
>> +
>> +	cgroup_unlock();
>> +
>> +	return ret;
>> +}
>> +
>> +static int io_cgroup_nr_requests_write(struct cgroup *cgroup,
>> +					struct cftype *cftype,
>> +					u64 val)
>> +{
>> +	struct io_cgroup *iocg;
>> +
>> +	if (val<  BLKDEV_MIN_RQ)
>> +		val = BLKDEV_MIN_RQ;
>> +
>> +	if (!cgroup_lock_live_group(cgroup))
>> +		return -ENODEV;
>> +
>> +	iocg = cgroup_to_io_cgroup(cgroup);
>> +
>> +	spin_lock_irq(&iocg->lock);
>> +	iocg->nr_group_requests = (unsigned long)val;
>> +	spin_unlock_irq(&iocg->lock);
>> +
>> +	cgroup_unlock();
>> +
>> +	return 0;
>> +}
>> +
>>   #define SHOW_FUNCTION(__VAR)						\
>>   static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,		\
>>   				       struct cftype *cftype)		\
>> @@ -1735,6 +1811,11 @@ static int io_cgroup_disk_dequeue_read(struct cgroup *cgroup,
>>
>>   struct cftype bfqio_files[] = {
>>   	{
>> +		.name = "nr_group_requests",
>> +		.read_u64 = io_cgroup_nr_requests_read,
>> +		.write_u64 = io_cgroup_nr_requests_write,
>> +	},
>> +	{
>>   		.name = "policy",
>>   		.read_seq_string = io_cgroup_policy_read,
>>   		.write_string = io_cgroup_policy_write,
>> @@ -1790,6 +1871,7 @@ static struct cgroup_subsys_state *iocg_create(struct cgroup_subsys *subsys,
>>
>>   	spin_lock_init(&iocg->lock);
>>   	INIT_HLIST_HEAD(&iocg->group_data);
>> +	iocg->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
>>   	iocg->weight = IO_DEFAULT_GRP_WEIGHT;
>>   	iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
>>   	INIT_LIST_HEAD(&iocg->policy_list);
>> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
>> index f089a55..df077d0 100644
>> --- a/block/elevator-fq.h
>> +++ b/block/elevator-fq.h
>> @@ -308,6 +308,7 @@ struct io_cgroup {
>>   	unsigned int weight;
>>   	unsigned short ioprio_class;
>>
>> +	unsigned long nr_group_requests;
>>   	/* list of io_policy_node */
>>   	struct list_head policy_list;
>>
>> @@ -386,6 +387,9 @@ struct elv_fq_data {
>>   	unsigned int fairness;
>>   };
>>
>> +extern unsigned short get_group_requests(struct request_queue *q,
>> +					 struct io_group *iog);
>> +
>>   /* Logging facilities. */
>>   #ifdef CONFIG_DEBUG_GROUP_IOSCHED
>>   #define elv_log_ioq(efqd, ioq, fmt, args...) \
>> --
>> 1.5.4.rc3
>

-- 
IKEDA, Munehiro
   NEC Corporation of America
     m-ikeda@ds.jp.nec.com

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
  2009-07-13 16:03     ` Vivek Goyal
@ 2009-07-14  7:37         ` Gui Jianfeng
  -1 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-14  7:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Vivek Goyal wrote:
> On Fri, Jul 10, 2009 at 09:56:21AM +0800, Gui Jianfeng wrote:
>> Hi Vivek,
>>
>> This patch exports a cgroup based per group request limits interface.
>> and removes the global one. Now we can use this interface to perform
>> different request allocation limitation for different groups. 
>>
> 
> Thanks Gui. Few points come to mind.
> 
> - You seem to be making this as per cgroup limit on all devices. I guess
>   that different devices in the system can have different settings of
>   q->nr_requests and hence will probably want different per group limit.
>   So we might have to make it per cgroup per device limit.

  Yes, per cgroup per device limitation seems more reasonable. I'll see what
  i can do.

> 
> - There does not seem to be any checks for making sure that children
>   cgroups don't have more request descriptors allocated than parent group.

  Do we really need to make it hierarchical? IMHO, maintaining this limitation
  for cgroups independently is enough.

> 
> - I am re-thinking that what's the advantage of configuring request
>   descriptors also through cgroups. It does bring in additional complexity
>   with it and it should justfiy the advantages. Can you think of some?

  I'll try, but at least, this feature lets us be able to do more accurate
  limitation. :)

> 
>   Until and unless we can come up with some significant advantages, I will
>   prefer to continue to use per group limit through q->nr_group_requests
>   interface instead of cgroup. Once things stablize, we can revisit it and
>   see how this interface can be improved.

  I agree.

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
@ 2009-07-14  7:37         ` Gui Jianfeng
  0 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-14  7:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, jbaron, agk,
	snitzer, akpm, peterz

Vivek Goyal wrote:
> On Fri, Jul 10, 2009 at 09:56:21AM +0800, Gui Jianfeng wrote:
>> Hi Vivek,
>>
>> This patch exports a cgroup based per group request limits interface.
>> and removes the global one. Now we can use this interface to perform
>> different request allocation limitation for different groups. 
>>
> 
> Thanks Gui. Few points come to mind.
> 
> - You seem to be making this as per cgroup limit on all devices. I guess
>   that different devices in the system can have different settings of
>   q->nr_requests and hence will probably want different per group limit.
>   So we might have to make it per cgroup per device limit.

  Yes, per cgroup per device limitation seems more reasonable. I'll see what
  i can do.

> 
> - There does not seem to be any checks for making sure that children
>   cgroups don't have more request descriptors allocated than parent group.

  Do we really need to make it hierarchical? IMHO, maintaining this limitation
  for cgroups independently is enough.

> 
> - I am re-thinking that what's the advantage of configuring request
>   descriptors also through cgroups. It does bring in additional complexity
>   with it and it should justfiy the advantages. Can you think of some?

  I'll try, but at least, this feature lets us be able to do more accurate
  limitation. :)

> 
>   Until and unless we can come up with some significant advantages, I will
>   prefer to continue to use per group limit through q->nr_group_requests
>   interface instead of cgroup. Once things stablize, we can revisit it and
>   see how this interface can be improved.

  I agree.

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
       [not found]       ` <4A5BA238.3030902-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
@ 2009-07-14  7:45         ` Gui Jianfeng
  0 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-14  7:45 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Munehiro Ikeda wrote:
> Vivek Goyal wrote, on 07/13/2009 12:03 PM:
>> On Fri, Jul 10, 2009 at 09:56:21AM +0800, Gui Jianfeng wrote:
>>> Hi Vivek,
>>>
>>> This patch exports a cgroup based per group request limits interface.
>>> and removes the global one. Now we can use this interface to perform
>>> different request allocation limitation for different groups.
>>>
>>
>> Thanks Gui. Few points come to mind.
>>
>> - You seem to be making this as per cgroup limit on all devices. I guess
>>    that different devices in the system can have different settings of
>>    q->nr_requests and hence will probably want different per group limit.
>>    So we might have to make it per cgroup per device limit.
> 
> From the viewpoint of implementation, there is a difficulty in my mind to
> implement per cgroup per device limit arising from that io_group is
> allocated
> when associated device is firstly used.  I guess Gui chose per cgroup limit
> on all devices approach because of this, right?

  Yes, I choose this solution from the simplicity point of view, the code will
  get complicated if choosing per cgroup per device limit. But it seems per 
  cgroup per device limits is more reasonable.

> 
> 
>> - There does not seem to be any checks for making sure that children
>>    cgroups don't have more request descriptors allocated than parent
>> group.
>>
>> - I am re-thinking that what's the advantage of configuring request
>>    descriptors also through cgroups. It does bring in additional
>> complexity
>>    with it and it should justfiy the advantages. Can you think of some?
>>
>>    Until and unless we can come up with some significant advantages, I
>> will
>>    prefer to continue to use per group limit through q->nr_group_requests
>>    interface instead of cgroup. Once things stablize, we can revisit
>> it and
>>    see how this interface can be improved.
> 
> I agree.  I will try to clarify if per group per device limitation is
> needed
> or not (or, if it has the advantage beyond the complexity) through some
> tests.

  Great, hope to hear you soon.

-- 
Regards
Gui Jianfeng

> 
> 
> 
> Tnaks a lot,
> Muuhh
> 
> 
>> Thanks
>> Vivek
>>
>>> Signed-off-by: Gui Jianfeng<guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
>>> ---
>>>   block/blk-core.c     |   23 ++++++++++--
>>>   block/blk-settings.c |    1 -
>>>   block/blk-sysfs.c    |   43 -----------------------
>>>   block/elevator-fq.c  |   94
>>> ++++++++++++++++++++++++++++++++++++++++++++++---
>>>   block/elevator-fq.h  |    4 ++
>>>   5 files changed, 111 insertions(+), 54 deletions(-)
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 79fe6a9..7010b76 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -722,13 +722,20 @@ static void ioc_set_batching(struct
>>> request_queue *q, struct io_context *ioc)
>>>   static void __freed_request(struct request_queue *q, int sync,
>>>                       struct request_list *rl)
>>>   {
>>> +    struct io_group *iog;
>>> +    unsigned long nr_group_requests;
>>> +
>>>       if (q->rq_data.count[sync]<  queue_congestion_off_threshold(q))
>>>           blk_clear_queue_congested(q, sync);
>>>
>>>       if (q->rq_data.count[sync] + 1<= q->nr_requests)
>>>           blk_clear_queue_full(q, sync);
>>>
>>> -    if (rl->count[sync] + 1<= q->nr_group_requests) {
>>> +    iog = rl_iog(rl);
>>> +
>>> +    nr_group_requests = get_group_requests(q, iog);
>>> +
>>> +    if (nr_group_requests&&  rl->count[sync] + 1<= nr_group_requests) {
>>>           if (waitqueue_active(&rl->wait[sync]))
>>>               wake_up(&rl->wait[sync]);
>>>       }
>>> @@ -828,6 +835,8 @@ static struct request *get_request(struct
>>> request_queue *q, int rw_flags,
>>>       const bool is_sync = rw_is_sync(rw_flags) != 0;
>>>       int may_queue, priv;
>>>       int sleep_on_global = 0;
>>> +    struct io_group *iog;
>>> +    unsigned long nr_group_requests;
>>>
>>>       may_queue = elv_may_queue(q, rw_flags);
>>>       if (may_queue == ELV_MQUEUE_NO)
>>> @@ -843,7 +852,12 @@ static struct request *get_request(struct
>>> request_queue *q, int rw_flags,
>>>       if (q->rq_data.count[is_sync]+1>= q->nr_requests)
>>>           blk_set_queue_full(q, is_sync);
>>>
>>> -    if (rl->count[is_sync]+1>= q->nr_group_requests) {
>>> +    iog = rl_iog(rl);
>>> +
>>> +    nr_group_requests = get_group_requests(q, iog);
>>> +
>>> +    if (nr_group_requests&&
>>> +        rl->count[is_sync]+1>= nr_group_requests) {
>>>           ioc = current_io_context(GFP_ATOMIC, q->node);
>>>           /*
>>>            * The queue request descriptor group will fill after this
>>> @@ -852,7 +866,7 @@ static struct request *get_request(struct
>>> request_queue *q, int rw_flags,
>>>            * This process will be allowed to complete a batch of
>>>            * requests, others will be blocked.
>>>            */
>>> -        if (rl->count[is_sync]<= q->nr_group_requests)
>>> +        if (rl->count[is_sync]<= nr_group_requests)
>>>               ioc_set_batching(q, ioc);
>>>           else {
>>>               if (may_queue != ELV_MQUEUE_MUST
>>> @@ -898,7 +912,8 @@ static struct request *get_request(struct
>>> request_queue *q, int rw_flags,
>>>        * from per group request list
>>>        */
>>>
>>> -    if (rl->count[is_sync]>= (3 * q->nr_group_requests / 2))
>>> +    if (nr_group_requests&&
>>> +        rl->count[is_sync]>= (3 * nr_group_requests / 2))
>>>           goto out;
>>>
>>>       rl->starved[is_sync] = 0;
>>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>>> index 78b8aec..bd582a7 100644
>>> --- a/block/blk-settings.c
>>> +++ b/block/blk-settings.c
>>> @@ -148,7 +148,6 @@ void blk_queue_make_request(struct request_queue
>>> *q, make_request_fn *mfn)
>>>        * set defaults
>>>        */
>>>       q->nr_requests = BLKDEV_MAX_RQ;
>>> -    q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
>>>
>>>       q->make_request_fn = mfn;
>>>       blk_queue_dma_alignment(q, 511);
>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>>> index 92b9f25..706d852 100644
>>> --- a/block/blk-sysfs.c
>>> +++ b/block/blk-sysfs.c
>>> @@ -78,40 +78,8 @@ queue_requests_store(struct request_queue *q,
>>> const char *page, size_t count)
>>>       return ret;
>>>   }
>>>   #ifdef CONFIG_GROUP_IOSCHED
>>> -static ssize_t queue_group_requests_show(struct request_queue *q,
>>> char *page)
>>> -{
>>> -    return queue_var_show(q->nr_group_requests, (page));
>>> -}
>>> -
>>>   extern void elv_io_group_congestion_threshold(struct request_queue *q,
>>>                             struct io_group *iog);
>>> -
>>> -static ssize_t
>>> -queue_group_requests_store(struct request_queue *q, const char *page,
>>> -                    size_t count)
>>> -{
>>> -    struct hlist_node *n;
>>> -    struct io_group *iog;
>>> -    struct elv_fq_data *efqd;
>>> -    unsigned long nr;
>>> -    int ret = queue_var_store(&nr, page, count);
>>> -
>>> -    if (nr<  BLKDEV_MIN_RQ)
>>> -        nr = BLKDEV_MIN_RQ;
>>> -
>>> -    spin_lock_irq(q->queue_lock);
>>> -
>>> -    q->nr_group_requests = nr;
>>> -
>>> -    efqd =&q->elevator->efqd;
>>> -
>>> -    hlist_for_each_entry(iog, n,&efqd->group_list, elv_data_node) {
>>> -        elv_io_group_congestion_threshold(q, iog);
>>> -    }
>>> -
>>> -    spin_unlock_irq(q->queue_lock);
>>> -    return ret;
>>> -}
>>>   #endif
>>>
>>>   static ssize_t queue_ra_show(struct request_queue *q, char *page)
>>> @@ -278,14 +246,6 @@ static struct queue_sysfs_entry
>>> queue_requests_entry = {
>>>       .store = queue_requests_store,
>>>   };
>>>
>>> -#ifdef CONFIG_GROUP_IOSCHED
>>> -static struct queue_sysfs_entry queue_group_requests_entry = {
>>> -    .attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
>>> -    .show = queue_group_requests_show,
>>> -    .store = queue_group_requests_store,
>>> -};
>>> -#endif
>>> -
>>>   static struct queue_sysfs_entry queue_ra_entry = {
>>>       .attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
>>>       .show = queue_ra_show,
>>> @@ -360,9 +320,6 @@ static struct queue_sysfs_entry
>>> queue_iostats_entry = {
>>>
>>>   static struct attribute *default_attrs[] = {
>>>       &queue_requests_entry.attr,
>>> -#ifdef CONFIG_GROUP_IOSCHED
>>> -    &queue_group_requests_entry.attr,
>>> -#endif
>>>       &queue_ra_entry.attr,
>>>       &queue_max_hw_sectors_entry.attr,
>>>       &queue_max_sectors_entry.attr,
>>> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
>>> index 29392e7..bfb0210 100644
>>> --- a/block/elevator-fq.c
>>> +++ b/block/elevator-fq.c
>>> @@ -59,6 +59,35 @@ elv_release_ioq(struct elevator_queue *eq, struct
>>> io_queue **ioq_ptr);
>>>   #define for_each_entity_safe(entity, parent) \
>>>       for (; entity&&  ({ parent = entity->parent; 1; }); entity =
>>> parent)
>>>
>>> +unsigned short get_group_requests(struct request_queue *q,
>>> +                  struct io_group *iog)
>>> +{
>>> +    struct cgroup_subsys_state *css;
>>> +    struct io_cgroup *iocg;
>>> +    unsigned long nr_group_requests;
>>> +
>>> +    if (!iog)
>>> +        return q->nr_requests;
>>> +
>>> +    rcu_read_lock();
>>> +
>>> +    if (!iog->iocg_id) {
>>> +        nr_group_requests = 0;
>>> +        goto out;
>>> +    }
>>> +
>>> +    css = css_lookup(&io_subsys, iog->iocg_id);
>>> +    if (!css) {
>>> +        nr_group_requests = 0;
>>> +        goto out;
>>> +    }
>>> +
>>> +    iocg = container_of(css, struct io_cgroup, css);
>>> +    nr_group_requests = iocg->nr_group_requests;
>>> +out:
>>> +    rcu_read_unlock();
>>> +    return nr_group_requests;
>>> +}
>>>
>>>   static struct io_entity *bfq_lookup_next_entity(struct
>>> io_sched_data *sd,
>>>                            int extract);
>>> @@ -1257,14 +1286,17 @@ void elv_io_group_congestion_threshold(struct
>>> request_queue *q,
>>>                           struct io_group *iog)
>>>   {
>>>       int nr;
>>> +    unsigned long nr_group_requests;
>>>
>>> -    nr = q->nr_group_requests - (q->nr_group_requests / 8) + 1;
>>> -    if (nr>  q->nr_group_requests)
>>> -        nr = q->nr_group_requests;
>>> +    nr_group_requests = get_group_requests(q, iog);
>>> +
>>> +    nr = nr_group_requests - (nr_group_requests / 8) + 1;
>>> +    if (nr>  nr_group_requests)
>>> +        nr = nr_group_requests;
>>>       iog->nr_congestion_on = nr;
>>>
>>> -    nr = q->nr_group_requests - (q->nr_group_requests / 8)
>>> -            - (q->nr_group_requests / 16) - 1;
>>> +    nr = nr_group_requests - (nr_group_requests / 8)
>>> +            - (nr_group_requests / 16) - 1;
>>>       if (nr<  1)
>>>           nr = 1;
>>>       iog->nr_congestion_off = nr;
>>> @@ -1283,6 +1315,7 @@ int elv_io_group_congested(struct request_queue
>>> *q, struct page *page, int sync)
>>>   {
>>>       struct io_group *iog;
>>>       int ret = 0;
>>> +    unsigned long nr_group_requests;
>>>
>>>       rcu_read_lock();
>>>
>>> @@ -1300,10 +1333,11 @@ int elv_io_group_congested(struct
>>> request_queue *q, struct page *page, int sync)
>>>       }
>>>
>>>       ret = elv_is_iog_congested(q, iog, sync);
>>> +    nr_group_requests = get_group_requests(q, iog);
>>>       if (ret)
>>>           elv_log_iog(&q->elevator->efqd, iog, "iog congested=%d
>>> sync=%d"
>>>               " rl.count[sync]=%d nr_group_requests=%d",
>>> -            ret, sync, iog->rl.count[sync], q->nr_group_requests);
>>> +            ret, sync, iog->rl.count[sync], nr_group_requests);
>>>       rcu_read_unlock();
>>>       return ret;
>>>   }
>>> @@ -1549,6 +1583,48 @@ free_buf:
>>>       return ret;
>>>   }
>>>
>>> +static u64 io_cgroup_nr_requests_read(struct cgroup *cgroup,
>>> +                       struct cftype *cftype)
>>> +{
>>> +    struct io_cgroup *iocg;
>>> +    u64 ret;
>>> +
>>> +    if (!cgroup_lock_live_group(cgroup))
>>> +        return -ENODEV;
>>> +
>>> +    iocg = cgroup_to_io_cgroup(cgroup);
>>> +    spin_lock_irq(&iocg->lock);
>>> +    ret = iocg->nr_group_requests;
>>> +    spin_unlock_irq(&iocg->lock);
>>> +
>>> +    cgroup_unlock();
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static int io_cgroup_nr_requests_write(struct cgroup *cgroup,
>>> +                    struct cftype *cftype,
>>> +                    u64 val)
>>> +{
>>> +    struct io_cgroup *iocg;
>>> +
>>> +    if (val<  BLKDEV_MIN_RQ)
>>> +        val = BLKDEV_MIN_RQ;
>>> +
>>> +    if (!cgroup_lock_live_group(cgroup))
>>> +        return -ENODEV;
>>> +
>>> +    iocg = cgroup_to_io_cgroup(cgroup);
>>> +
>>> +    spin_lock_irq(&iocg->lock);
>>> +    iocg->nr_group_requests = (unsigned long)val;
>>> +    spin_unlock_irq(&iocg->lock);
>>> +
>>> +    cgroup_unlock();
>>> +
>>> +    return 0;
>>> +}
>>> +
>>>   #define SHOW_FUNCTION(__VAR)                        \
>>>   static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,        \
>>>                          struct cftype *cftype)        \
>>> @@ -1735,6 +1811,11 @@ static int io_cgroup_disk_dequeue_read(struct
>>> cgroup *cgroup,
>>>
>>>   struct cftype bfqio_files[] = {
>>>       {
>>> +        .name = "nr_group_requests",
>>> +        .read_u64 = io_cgroup_nr_requests_read,
>>> +        .write_u64 = io_cgroup_nr_requests_write,
>>> +    },
>>> +    {
>>>           .name = "policy",
>>>           .read_seq_string = io_cgroup_policy_read,
>>>           .write_string = io_cgroup_policy_write,
>>> @@ -1790,6 +1871,7 @@ static struct cgroup_subsys_state
>>> *iocg_create(struct cgroup_subsys *subsys,
>>>
>>>       spin_lock_init(&iocg->lock);
>>>       INIT_HLIST_HEAD(&iocg->group_data);
>>> +    iocg->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
>>>       iocg->weight = IO_DEFAULT_GRP_WEIGHT;
>>>       iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
>>>       INIT_LIST_HEAD(&iocg->policy_list);
>>> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
>>> index f089a55..df077d0 100644
>>> --- a/block/elevator-fq.h
>>> +++ b/block/elevator-fq.h
>>> @@ -308,6 +308,7 @@ struct io_cgroup {
>>>       unsigned int weight;
>>>       unsigned short ioprio_class;
>>>
>>> +    unsigned long nr_group_requests;
>>>       /* list of io_policy_node */
>>>       struct list_head policy_list;
>>>
>>> @@ -386,6 +387,9 @@ struct elv_fq_data {
>>>       unsigned int fairness;
>>>   };
>>>
>>> +extern unsigned short get_group_requests(struct request_queue *q,
>>> +                     struct io_group *iog);
>>> +
>>>   /* Logging facilities. */
>>>   #ifdef CONFIG_DEBUG_GROUP_IOSCHED
>>>   #define elv_log_ioq(efqd, ioq, fmt, args...) \
>>> -- 
>>> 1.5.4.rc3
>>
> 

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
  2009-07-13 21:08       ` Munehiro Ikeda
@ 2009-07-14  7:45         ` Gui Jianfeng
  -1 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-14  7:45 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: Vivek Goyal, linux-kernel, containers, dm-devel, jens.axboe,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, s-uchida, taka, jmoyer, dhaval, balbir, righi.andrea,
	jbaron, agk, snitzer, akpm, peterz

Munehiro Ikeda wrote:
> Vivek Goyal wrote, on 07/13/2009 12:03 PM:
>> On Fri, Jul 10, 2009 at 09:56:21AM +0800, Gui Jianfeng wrote:
>>> Hi Vivek,
>>>
>>> This patch exports a cgroup based per group request limits interface.
>>> and removes the global one. Now we can use this interface to perform
>>> different request allocation limitation for different groups.
>>>
>>
>> Thanks Gui. Few points come to mind.
>>
>> - You seem to be making this as per cgroup limit on all devices. I guess
>>    that different devices in the system can have different settings of
>>    q->nr_requests and hence will probably want different per group limit.
>>    So we might have to make it per cgroup per device limit.
> 
> From the viewpoint of implementation, there is a difficulty in my mind to
> implement per cgroup per device limit arising from that io_group is
> allocated
> when associated device is firstly used.  I guess Gui chose per cgroup limit
> on all devices approach because of this, right?

  Yes, I choose this solution from the simplicity point of view, the code will
  get complicated if choosing per cgroup per device limit. But it seems per 
  cgroup per device limits is more reasonable.

> 
> 
>> - There does not seem to be any checks for making sure that children
>>    cgroups don't have more request descriptors allocated than parent
>> group.
>>
>> - I am re-thinking that what's the advantage of configuring request
>>    descriptors also through cgroups. It does bring in additional
>> complexity
>>    with it and it should justfiy the advantages. Can you think of some?
>>
>>    Until and unless we can come up with some significant advantages, I
>> will
>>    prefer to continue to use per group limit through q->nr_group_requests
>>    interface instead of cgroup. Once things stablize, we can revisit
>> it and
>>    see how this interface can be improved.
> 
> I agree.  I will try to clarify if per group per device limitation is
> needed
> or not (or, if it has the advantage beyond the complexity) through some
> tests.

  Great, hope to hear you soon.

-- 
Regards
Gui Jianfeng

> 
> 
> 
> Tnaks a lot,
> Muuhh
> 
> 
>> Thanks
>> Vivek
>>
>>> Signed-off-by: Gui Jianfeng<guijianfeng@cn.fujitsu.com>
>>> ---
>>>   block/blk-core.c     |   23 ++++++++++--
>>>   block/blk-settings.c |    1 -
>>>   block/blk-sysfs.c    |   43 -----------------------
>>>   block/elevator-fq.c  |   94
>>> ++++++++++++++++++++++++++++++++++++++++++++++---
>>>   block/elevator-fq.h  |    4 ++
>>>   5 files changed, 111 insertions(+), 54 deletions(-)
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 79fe6a9..7010b76 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -722,13 +722,20 @@ static void ioc_set_batching(struct
>>> request_queue *q, struct io_context *ioc)
>>>   static void __freed_request(struct request_queue *q, int sync,
>>>                       struct request_list *rl)
>>>   {
>>> +    struct io_group *iog;
>>> +    unsigned long nr_group_requests;
>>> +
>>>       if (q->rq_data.count[sync]<  queue_congestion_off_threshold(q))
>>>           blk_clear_queue_congested(q, sync);
>>>
>>>       if (q->rq_data.count[sync] + 1<= q->nr_requests)
>>>           blk_clear_queue_full(q, sync);
>>>
>>> -    if (rl->count[sync] + 1<= q->nr_group_requests) {
>>> +    iog = rl_iog(rl);
>>> +
>>> +    nr_group_requests = get_group_requests(q, iog);
>>> +
>>> +    if (nr_group_requests&&  rl->count[sync] + 1<= nr_group_requests) {
>>>           if (waitqueue_active(&rl->wait[sync]))
>>>               wake_up(&rl->wait[sync]);
>>>       }
>>> @@ -828,6 +835,8 @@ static struct request *get_request(struct
>>> request_queue *q, int rw_flags,
>>>       const bool is_sync = rw_is_sync(rw_flags) != 0;
>>>       int may_queue, priv;
>>>       int sleep_on_global = 0;
>>> +    struct io_group *iog;
>>> +    unsigned long nr_group_requests;
>>>
>>>       may_queue = elv_may_queue(q, rw_flags);
>>>       if (may_queue == ELV_MQUEUE_NO)
>>> @@ -843,7 +852,12 @@ static struct request *get_request(struct
>>> request_queue *q, int rw_flags,
>>>       if (q->rq_data.count[is_sync]+1>= q->nr_requests)
>>>           blk_set_queue_full(q, is_sync);
>>>
>>> -    if (rl->count[is_sync]+1>= q->nr_group_requests) {
>>> +    iog = rl_iog(rl);
>>> +
>>> +    nr_group_requests = get_group_requests(q, iog);
>>> +
>>> +    if (nr_group_requests&&
>>> +        rl->count[is_sync]+1>= nr_group_requests) {
>>>           ioc = current_io_context(GFP_ATOMIC, q->node);
>>>           /*
>>>            * The queue request descriptor group will fill after this
>>> @@ -852,7 +866,7 @@ static struct request *get_request(struct
>>> request_queue *q, int rw_flags,
>>>            * This process will be allowed to complete a batch of
>>>            * requests, others will be blocked.
>>>            */
>>> -        if (rl->count[is_sync]<= q->nr_group_requests)
>>> +        if (rl->count[is_sync]<= nr_group_requests)
>>>               ioc_set_batching(q, ioc);
>>>           else {
>>>               if (may_queue != ELV_MQUEUE_MUST
>>> @@ -898,7 +912,8 @@ static struct request *get_request(struct
>>> request_queue *q, int rw_flags,
>>>        * from per group request list
>>>        */
>>>
>>> -    if (rl->count[is_sync]>= (3 * q->nr_group_requests / 2))
>>> +    if (nr_group_requests&&
>>> +        rl->count[is_sync]>= (3 * nr_group_requests / 2))
>>>           goto out;
>>>
>>>       rl->starved[is_sync] = 0;
>>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>>> index 78b8aec..bd582a7 100644
>>> --- a/block/blk-settings.c
>>> +++ b/block/blk-settings.c
>>> @@ -148,7 +148,6 @@ void blk_queue_make_request(struct request_queue
>>> *q, make_request_fn *mfn)
>>>        * set defaults
>>>        */
>>>       q->nr_requests = BLKDEV_MAX_RQ;
>>> -    q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
>>>
>>>       q->make_request_fn = mfn;
>>>       blk_queue_dma_alignment(q, 511);
>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>>> index 92b9f25..706d852 100644
>>> --- a/block/blk-sysfs.c
>>> +++ b/block/blk-sysfs.c
>>> @@ -78,40 +78,8 @@ queue_requests_store(struct request_queue *q,
>>> const char *page, size_t count)
>>>       return ret;
>>>   }
>>>   #ifdef CONFIG_GROUP_IOSCHED
>>> -static ssize_t queue_group_requests_show(struct request_queue *q,
>>> char *page)
>>> -{
>>> -    return queue_var_show(q->nr_group_requests, (page));
>>> -}
>>> -
>>>   extern void elv_io_group_congestion_threshold(struct request_queue *q,
>>>                             struct io_group *iog);
>>> -
>>> -static ssize_t
>>> -queue_group_requests_store(struct request_queue *q, const char *page,
>>> -                    size_t count)
>>> -{
>>> -    struct hlist_node *n;
>>> -    struct io_group *iog;
>>> -    struct elv_fq_data *efqd;
>>> -    unsigned long nr;
>>> -    int ret = queue_var_store(&nr, page, count);
>>> -
>>> -    if (nr<  BLKDEV_MIN_RQ)
>>> -        nr = BLKDEV_MIN_RQ;
>>> -
>>> -    spin_lock_irq(q->queue_lock);
>>> -
>>> -    q->nr_group_requests = nr;
>>> -
>>> -    efqd =&q->elevator->efqd;
>>> -
>>> -    hlist_for_each_entry(iog, n,&efqd->group_list, elv_data_node) {
>>> -        elv_io_group_congestion_threshold(q, iog);
>>> -    }
>>> -
>>> -    spin_unlock_irq(q->queue_lock);
>>> -    return ret;
>>> -}
>>>   #endif
>>>
>>>   static ssize_t queue_ra_show(struct request_queue *q, char *page)
>>> @@ -278,14 +246,6 @@ static struct queue_sysfs_entry
>>> queue_requests_entry = {
>>>       .store = queue_requests_store,
>>>   };
>>>
>>> -#ifdef CONFIG_GROUP_IOSCHED
>>> -static struct queue_sysfs_entry queue_group_requests_entry = {
>>> -    .attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
>>> -    .show = queue_group_requests_show,
>>> -    .store = queue_group_requests_store,
>>> -};
>>> -#endif
>>> -
>>>   static struct queue_sysfs_entry queue_ra_entry = {
>>>       .attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
>>>       .show = queue_ra_show,
>>> @@ -360,9 +320,6 @@ static struct queue_sysfs_entry
>>> queue_iostats_entry = {
>>>
>>>   static struct attribute *default_attrs[] = {
>>>       &queue_requests_entry.attr,
>>> -#ifdef CONFIG_GROUP_IOSCHED
>>> -    &queue_group_requests_entry.attr,
>>> -#endif
>>>       &queue_ra_entry.attr,
>>>       &queue_max_hw_sectors_entry.attr,
>>>       &queue_max_sectors_entry.attr,
>>> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
>>> index 29392e7..bfb0210 100644
>>> --- a/block/elevator-fq.c
>>> +++ b/block/elevator-fq.c
>>> @@ -59,6 +59,35 @@ elv_release_ioq(struct elevator_queue *eq, struct
>>> io_queue **ioq_ptr);
>>>   #define for_each_entity_safe(entity, parent) \
>>>       for (; entity&&  ({ parent = entity->parent; 1; }); entity =
>>> parent)
>>>
>>> +unsigned short get_group_requests(struct request_queue *q,
>>> +                  struct io_group *iog)
>>> +{
>>> +    struct cgroup_subsys_state *css;
>>> +    struct io_cgroup *iocg;
>>> +    unsigned long nr_group_requests;
>>> +
>>> +    if (!iog)
>>> +        return q->nr_requests;
>>> +
>>> +    rcu_read_lock();
>>> +
>>> +    if (!iog->iocg_id) {
>>> +        nr_group_requests = 0;
>>> +        goto out;
>>> +    }
>>> +
>>> +    css = css_lookup(&io_subsys, iog->iocg_id);
>>> +    if (!css) {
>>> +        nr_group_requests = 0;
>>> +        goto out;
>>> +    }
>>> +
>>> +    iocg = container_of(css, struct io_cgroup, css);
>>> +    nr_group_requests = iocg->nr_group_requests;
>>> +out:
>>> +    rcu_read_unlock();
>>> +    return nr_group_requests;
>>> +}
>>>
>>>   static struct io_entity *bfq_lookup_next_entity(struct
>>> io_sched_data *sd,
>>>                            int extract);
>>> @@ -1257,14 +1286,17 @@ void elv_io_group_congestion_threshold(struct
>>> request_queue *q,
>>>                           struct io_group *iog)
>>>   {
>>>       int nr;
>>> +    unsigned long nr_group_requests;
>>>
>>> -    nr = q->nr_group_requests - (q->nr_group_requests / 8) + 1;
>>> -    if (nr>  q->nr_group_requests)
>>> -        nr = q->nr_group_requests;
>>> +    nr_group_requests = get_group_requests(q, iog);
>>> +
>>> +    nr = nr_group_requests - (nr_group_requests / 8) + 1;
>>> +    if (nr>  nr_group_requests)
>>> +        nr = nr_group_requests;
>>>       iog->nr_congestion_on = nr;
>>>
>>> -    nr = q->nr_group_requests - (q->nr_group_requests / 8)
>>> -            - (q->nr_group_requests / 16) - 1;
>>> +    nr = nr_group_requests - (nr_group_requests / 8)
>>> +            - (nr_group_requests / 16) - 1;
>>>       if (nr<  1)
>>>           nr = 1;
>>>       iog->nr_congestion_off = nr;
>>> @@ -1283,6 +1315,7 @@ int elv_io_group_congested(struct request_queue
>>> *q, struct page *page, int sync)
>>>   {
>>>       struct io_group *iog;
>>>       int ret = 0;
>>> +    unsigned long nr_group_requests;
>>>
>>>       rcu_read_lock();
>>>
>>> @@ -1300,10 +1333,11 @@ int elv_io_group_congested(struct
>>> request_queue *q, struct page *page, int sync)
>>>       }
>>>
>>>       ret = elv_is_iog_congested(q, iog, sync);
>>> +    nr_group_requests = get_group_requests(q, iog);
>>>       if (ret)
>>>           elv_log_iog(&q->elevator->efqd, iog, "iog congested=%d
>>> sync=%d"
>>>               " rl.count[sync]=%d nr_group_requests=%d",
>>> -            ret, sync, iog->rl.count[sync], q->nr_group_requests);
>>> +            ret, sync, iog->rl.count[sync], nr_group_requests);
>>>       rcu_read_unlock();
>>>       return ret;
>>>   }
>>> @@ -1549,6 +1583,48 @@ free_buf:
>>>       return ret;
>>>   }
>>>
>>> +static u64 io_cgroup_nr_requests_read(struct cgroup *cgroup,
>>> +                       struct cftype *cftype)
>>> +{
>>> +    struct io_cgroup *iocg;
>>> +    u64 ret;
>>> +
>>> +    if (!cgroup_lock_live_group(cgroup))
>>> +        return -ENODEV;
>>> +
>>> +    iocg = cgroup_to_io_cgroup(cgroup);
>>> +    spin_lock_irq(&iocg->lock);
>>> +    ret = iocg->nr_group_requests;
>>> +    spin_unlock_irq(&iocg->lock);
>>> +
>>> +    cgroup_unlock();
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static int io_cgroup_nr_requests_write(struct cgroup *cgroup,
>>> +                    struct cftype *cftype,
>>> +                    u64 val)
>>> +{
>>> +    struct io_cgroup *iocg;
>>> +
>>> +    if (val<  BLKDEV_MIN_RQ)
>>> +        val = BLKDEV_MIN_RQ;
>>> +
>>> +    if (!cgroup_lock_live_group(cgroup))
>>> +        return -ENODEV;
>>> +
>>> +    iocg = cgroup_to_io_cgroup(cgroup);
>>> +
>>> +    spin_lock_irq(&iocg->lock);
>>> +    iocg->nr_group_requests = (unsigned long)val;
>>> +    spin_unlock_irq(&iocg->lock);
>>> +
>>> +    cgroup_unlock();
>>> +
>>> +    return 0;
>>> +}
>>> +
>>>   #define SHOW_FUNCTION(__VAR)                        \
>>>   static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,        \
>>>                          struct cftype *cftype)        \
>>> @@ -1735,6 +1811,11 @@ static int io_cgroup_disk_dequeue_read(struct
>>> cgroup *cgroup,
>>>
>>>   struct cftype bfqio_files[] = {
>>>       {
>>> +        .name = "nr_group_requests",
>>> +        .read_u64 = io_cgroup_nr_requests_read,
>>> +        .write_u64 = io_cgroup_nr_requests_write,
>>> +    },
>>> +    {
>>>           .name = "policy",
>>>           .read_seq_string = io_cgroup_policy_read,
>>>           .write_string = io_cgroup_policy_write,
>>> @@ -1790,6 +1871,7 @@ static struct cgroup_subsys_state
>>> *iocg_create(struct cgroup_subsys *subsys,
>>>
>>>       spin_lock_init(&iocg->lock);
>>>       INIT_HLIST_HEAD(&iocg->group_data);
>>> +    iocg->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
>>>       iocg->weight = IO_DEFAULT_GRP_WEIGHT;
>>>       iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
>>>       INIT_LIST_HEAD(&iocg->policy_list);
>>> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
>>> index f089a55..df077d0 100644
>>> --- a/block/elevator-fq.h
>>> +++ b/block/elevator-fq.h
>>> @@ -308,6 +308,7 @@ struct io_cgroup {
>>>       unsigned int weight;
>>>       unsigned short ioprio_class;
>>>
>>> +    unsigned long nr_group_requests;
>>>       /* list of io_policy_node */
>>>       struct list_head policy_list;
>>>
>>> @@ -386,6 +387,9 @@ struct elv_fq_data {
>>>       unsigned int fairness;
>>>   };
>>>
>>> +extern unsigned short get_group_requests(struct request_queue *q,
>>> +                     struct io_group *iog);
>>> +
>>>   /* Logging facilities. */
>>>   #ifdef CONFIG_DEBUG_GROUP_IOSCHED
>>>   #define elv_log_ioq(efqd, ioq, fmt, args...) \
>>> -- 
>>> 1.5.4.rc3
>>
> 



^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
@ 2009-07-14  7:45         ` Gui Jianfeng
  0 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-14  7:45 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	balbir, paolo.valente, fernando, mikew, jmoyer, nauman,
	Vivek Goyal, righi.andrea, lizf, fchecconi, akpm, jbaron,
	linux-kernel, s-uchida, containers

Munehiro Ikeda wrote:
> Vivek Goyal wrote, on 07/13/2009 12:03 PM:
>> On Fri, Jul 10, 2009 at 09:56:21AM +0800, Gui Jianfeng wrote:
>>> Hi Vivek,
>>>
>>> This patch exports a cgroup based per group request limits interface.
>>> and removes the global one. Now we can use this interface to perform
>>> different request allocation limitation for different groups.
>>>
>>
>> Thanks Gui. Few points come to mind.
>>
>> - You seem to be making this as per cgroup limit on all devices. I guess
>>    that different devices in the system can have different settings of
>>    q->nr_requests and hence will probably want different per group limit.
>>    So we might have to make it per cgroup per device limit.
> 
> From the viewpoint of implementation, there is a difficulty in my mind to
> implement per cgroup per device limit arising from that io_group is
> allocated
> when associated device is firstly used.  I guess Gui chose per cgroup limit
> on all devices approach because of this, right?

  Yes, I choose this solution from the simplicity point of view, the code will
  get complicated if choosing per cgroup per device limit. But it seems per 
  cgroup per device limits is more reasonable.

> 
> 
>> - There does not seem to be any checks for making sure that children
>>    cgroups don't have more request descriptors allocated than parent
>> group.
>>
>> - I am re-thinking that what's the advantage of configuring request
>>    descriptors also through cgroups. It does bring in additional
>> complexity
>>    with it and it should justfiy the advantages. Can you think of some?
>>
>>    Until and unless we can come up with some significant advantages, I
>> will
>>    prefer to continue to use per group limit through q->nr_group_requests
>>    interface instead of cgroup. Once things stablize, we can revisit
>> it and
>>    see how this interface can be improved.
> 
> I agree.  I will try to clarify if per group per device limitation is
> needed
> or not (or, if it has the advantage beyond the complexity) through some
> tests.

  Great, hope to hear you soon.

-- 
Regards
Gui Jianfeng

> 
> 
> 
> Tnaks a lot,
> Muuhh
> 
> 
>> Thanks
>> Vivek
>>
>>> Signed-off-by: Gui Jianfeng<guijianfeng@cn.fujitsu.com>
>>> ---
>>>   block/blk-core.c     |   23 ++++++++++--
>>>   block/blk-settings.c |    1 -
>>>   block/blk-sysfs.c    |   43 -----------------------
>>>   block/elevator-fq.c  |   94
>>> ++++++++++++++++++++++++++++++++++++++++++++++---
>>>   block/elevator-fq.h  |    4 ++
>>>   5 files changed, 111 insertions(+), 54 deletions(-)
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 79fe6a9..7010b76 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -722,13 +722,20 @@ static void ioc_set_batching(struct
>>> request_queue *q, struct io_context *ioc)
>>>   static void __freed_request(struct request_queue *q, int sync,
>>>                       struct request_list *rl)
>>>   {
>>> +    struct io_group *iog;
>>> +    unsigned long nr_group_requests;
>>> +
>>>       if (q->rq_data.count[sync]<  queue_congestion_off_threshold(q))
>>>           blk_clear_queue_congested(q, sync);
>>>
>>>       if (q->rq_data.count[sync] + 1<= q->nr_requests)
>>>           blk_clear_queue_full(q, sync);
>>>
>>> -    if (rl->count[sync] + 1<= q->nr_group_requests) {
>>> +    iog = rl_iog(rl);
>>> +
>>> +    nr_group_requests = get_group_requests(q, iog);
>>> +
>>> +    if (nr_group_requests&&  rl->count[sync] + 1<= nr_group_requests) {
>>>           if (waitqueue_active(&rl->wait[sync]))
>>>               wake_up(&rl->wait[sync]);
>>>       }
>>> @@ -828,6 +835,8 @@ static struct request *get_request(struct
>>> request_queue *q, int rw_flags,
>>>       const bool is_sync = rw_is_sync(rw_flags) != 0;
>>>       int may_queue, priv;
>>>       int sleep_on_global = 0;
>>> +    struct io_group *iog;
>>> +    unsigned long nr_group_requests;
>>>
>>>       may_queue = elv_may_queue(q, rw_flags);
>>>       if (may_queue == ELV_MQUEUE_NO)
>>> @@ -843,7 +852,12 @@ static struct request *get_request(struct
>>> request_queue *q, int rw_flags,
>>>       if (q->rq_data.count[is_sync]+1>= q->nr_requests)
>>>           blk_set_queue_full(q, is_sync);
>>>
>>> -    if (rl->count[is_sync]+1>= q->nr_group_requests) {
>>> +    iog = rl_iog(rl);
>>> +
>>> +    nr_group_requests = get_group_requests(q, iog);
>>> +
>>> +    if (nr_group_requests&&
>>> +        rl->count[is_sync]+1>= nr_group_requests) {
>>>           ioc = current_io_context(GFP_ATOMIC, q->node);
>>>           /*
>>>            * The queue request descriptor group will fill after this
>>> @@ -852,7 +866,7 @@ static struct request *get_request(struct
>>> request_queue *q, int rw_flags,
>>>            * This process will be allowed to complete a batch of
>>>            * requests, others will be blocked.
>>>            */
>>> -        if (rl->count[is_sync]<= q->nr_group_requests)
>>> +        if (rl->count[is_sync]<= nr_group_requests)
>>>               ioc_set_batching(q, ioc);
>>>           else {
>>>               if (may_queue != ELV_MQUEUE_MUST
>>> @@ -898,7 +912,8 @@ static struct request *get_request(struct
>>> request_queue *q, int rw_flags,
>>>        * from per group request list
>>>        */
>>>
>>> -    if (rl->count[is_sync]>= (3 * q->nr_group_requests / 2))
>>> +    if (nr_group_requests&&
>>> +        rl->count[is_sync]>= (3 * nr_group_requests / 2))
>>>           goto out;
>>>
>>>       rl->starved[is_sync] = 0;
>>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>>> index 78b8aec..bd582a7 100644
>>> --- a/block/blk-settings.c
>>> +++ b/block/blk-settings.c
>>> @@ -148,7 +148,6 @@ void blk_queue_make_request(struct request_queue
>>> *q, make_request_fn *mfn)
>>>        * set defaults
>>>        */
>>>       q->nr_requests = BLKDEV_MAX_RQ;
>>> -    q->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
>>>
>>>       q->make_request_fn = mfn;
>>>       blk_queue_dma_alignment(q, 511);
>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>>> index 92b9f25..706d852 100644
>>> --- a/block/blk-sysfs.c
>>> +++ b/block/blk-sysfs.c
>>> @@ -78,40 +78,8 @@ queue_requests_store(struct request_queue *q,
>>> const char *page, size_t count)
>>>       return ret;
>>>   }
>>>   #ifdef CONFIG_GROUP_IOSCHED
>>> -static ssize_t queue_group_requests_show(struct request_queue *q,
>>> char *page)
>>> -{
>>> -    return queue_var_show(q->nr_group_requests, (page));
>>> -}
>>> -
>>>   extern void elv_io_group_congestion_threshold(struct request_queue *q,
>>>                             struct io_group *iog);
>>> -
>>> -static ssize_t
>>> -queue_group_requests_store(struct request_queue *q, const char *page,
>>> -                    size_t count)
>>> -{
>>> -    struct hlist_node *n;
>>> -    struct io_group *iog;
>>> -    struct elv_fq_data *efqd;
>>> -    unsigned long nr;
>>> -    int ret = queue_var_store(&nr, page, count);
>>> -
>>> -    if (nr<  BLKDEV_MIN_RQ)
>>> -        nr = BLKDEV_MIN_RQ;
>>> -
>>> -    spin_lock_irq(q->queue_lock);
>>> -
>>> -    q->nr_group_requests = nr;
>>> -
>>> -    efqd =&q->elevator->efqd;
>>> -
>>> -    hlist_for_each_entry(iog, n,&efqd->group_list, elv_data_node) {
>>> -        elv_io_group_congestion_threshold(q, iog);
>>> -    }
>>> -
>>> -    spin_unlock_irq(q->queue_lock);
>>> -    return ret;
>>> -}
>>>   #endif
>>>
>>>   static ssize_t queue_ra_show(struct request_queue *q, char *page)
>>> @@ -278,14 +246,6 @@ static struct queue_sysfs_entry
>>> queue_requests_entry = {
>>>       .store = queue_requests_store,
>>>   };
>>>
>>> -#ifdef CONFIG_GROUP_IOSCHED
>>> -static struct queue_sysfs_entry queue_group_requests_entry = {
>>> -    .attr = {.name = "nr_group_requests", .mode = S_IRUGO | S_IWUSR },
>>> -    .show = queue_group_requests_show,
>>> -    .store = queue_group_requests_store,
>>> -};
>>> -#endif
>>> -
>>>   static struct queue_sysfs_entry queue_ra_entry = {
>>>       .attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
>>>       .show = queue_ra_show,
>>> @@ -360,9 +320,6 @@ static struct queue_sysfs_entry
>>> queue_iostats_entry = {
>>>
>>>   static struct attribute *default_attrs[] = {
>>>       &queue_requests_entry.attr,
>>> -#ifdef CONFIG_GROUP_IOSCHED
>>> -    &queue_group_requests_entry.attr,
>>> -#endif
>>>       &queue_ra_entry.attr,
>>>       &queue_max_hw_sectors_entry.attr,
>>>       &queue_max_sectors_entry.attr,
>>> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
>>> index 29392e7..bfb0210 100644
>>> --- a/block/elevator-fq.c
>>> +++ b/block/elevator-fq.c
>>> @@ -59,6 +59,35 @@ elv_release_ioq(struct elevator_queue *eq, struct
>>> io_queue **ioq_ptr);
>>>   #define for_each_entity_safe(entity, parent) \
>>>       for (; entity&&  ({ parent = entity->parent; 1; }); entity =
>>> parent)
>>>
>>> +unsigned short get_group_requests(struct request_queue *q,
>>> +                  struct io_group *iog)
>>> +{
>>> +    struct cgroup_subsys_state *css;
>>> +    struct io_cgroup *iocg;
>>> +    unsigned long nr_group_requests;
>>> +
>>> +    if (!iog)
>>> +        return q->nr_requests;
>>> +
>>> +    rcu_read_lock();
>>> +
>>> +    if (!iog->iocg_id) {
>>> +        nr_group_requests = 0;
>>> +        goto out;
>>> +    }
>>> +
>>> +    css = css_lookup(&io_subsys, iog->iocg_id);
>>> +    if (!css) {
>>> +        nr_group_requests = 0;
>>> +        goto out;
>>> +    }
>>> +
>>> +    iocg = container_of(css, struct io_cgroup, css);
>>> +    nr_group_requests = iocg->nr_group_requests;
>>> +out:
>>> +    rcu_read_unlock();
>>> +    return nr_group_requests;
>>> +}
>>>
>>>   static struct io_entity *bfq_lookup_next_entity(struct
>>> io_sched_data *sd,
>>>                            int extract);
>>> @@ -1257,14 +1286,17 @@ void elv_io_group_congestion_threshold(struct
>>> request_queue *q,
>>>                           struct io_group *iog)
>>>   {
>>>       int nr;
>>> +    unsigned long nr_group_requests;
>>>
>>> -    nr = q->nr_group_requests - (q->nr_group_requests / 8) + 1;
>>> -    if (nr>  q->nr_group_requests)
>>> -        nr = q->nr_group_requests;
>>> +    nr_group_requests = get_group_requests(q, iog);
>>> +
>>> +    nr = nr_group_requests - (nr_group_requests / 8) + 1;
>>> +    if (nr>  nr_group_requests)
>>> +        nr = nr_group_requests;
>>>       iog->nr_congestion_on = nr;
>>>
>>> -    nr = q->nr_group_requests - (q->nr_group_requests / 8)
>>> -            - (q->nr_group_requests / 16) - 1;
>>> +    nr = nr_group_requests - (nr_group_requests / 8)
>>> +            - (nr_group_requests / 16) - 1;
>>>       if (nr<  1)
>>>           nr = 1;
>>>       iog->nr_congestion_off = nr;
>>> @@ -1283,6 +1315,7 @@ int elv_io_group_congested(struct request_queue
>>> *q, struct page *page, int sync)
>>>   {
>>>       struct io_group *iog;
>>>       int ret = 0;
>>> +    unsigned long nr_group_requests;
>>>
>>>       rcu_read_lock();
>>>
>>> @@ -1300,10 +1333,11 @@ int elv_io_group_congested(struct
>>> request_queue *q, struct page *page, int sync)
>>>       }
>>>
>>>       ret = elv_is_iog_congested(q, iog, sync);
>>> +    nr_group_requests = get_group_requests(q, iog);
>>>       if (ret)
>>>           elv_log_iog(&q->elevator->efqd, iog, "iog congested=%d
>>> sync=%d"
>>>               " rl.count[sync]=%d nr_group_requests=%d",
>>> -            ret, sync, iog->rl.count[sync], q->nr_group_requests);
>>> +            ret, sync, iog->rl.count[sync], nr_group_requests);
>>>       rcu_read_unlock();
>>>       return ret;
>>>   }
>>> @@ -1549,6 +1583,48 @@ free_buf:
>>>       return ret;
>>>   }
>>>
>>> +static u64 io_cgroup_nr_requests_read(struct cgroup *cgroup,
>>> +                       struct cftype *cftype)
>>> +{
>>> +    struct io_cgroup *iocg;
>>> +    u64 ret;
>>> +
>>> +    if (!cgroup_lock_live_group(cgroup))
>>> +        return -ENODEV;
>>> +
>>> +    iocg = cgroup_to_io_cgroup(cgroup);
>>> +    spin_lock_irq(&iocg->lock);
>>> +    ret = iocg->nr_group_requests;
>>> +    spin_unlock_irq(&iocg->lock);
>>> +
>>> +    cgroup_unlock();
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static int io_cgroup_nr_requests_write(struct cgroup *cgroup,
>>> +                    struct cftype *cftype,
>>> +                    u64 val)
>>> +{
>>> +    struct io_cgroup *iocg;
>>> +
>>> +    if (val<  BLKDEV_MIN_RQ)
>>> +        val = BLKDEV_MIN_RQ;
>>> +
>>> +    if (!cgroup_lock_live_group(cgroup))
>>> +        return -ENODEV;
>>> +
>>> +    iocg = cgroup_to_io_cgroup(cgroup);
>>> +
>>> +    spin_lock_irq(&iocg->lock);
>>> +    iocg->nr_group_requests = (unsigned long)val;
>>> +    spin_unlock_irq(&iocg->lock);
>>> +
>>> +    cgroup_unlock();
>>> +
>>> +    return 0;
>>> +}
>>> +
>>>   #define SHOW_FUNCTION(__VAR)                        \
>>>   static u64 io_cgroup_##__VAR##_read(struct cgroup *cgroup,        \
>>>                          struct cftype *cftype)        \
>>> @@ -1735,6 +1811,11 @@ static int io_cgroup_disk_dequeue_read(struct
>>> cgroup *cgroup,
>>>
>>>   struct cftype bfqio_files[] = {
>>>       {
>>> +        .name = "nr_group_requests",
>>> +        .read_u64 = io_cgroup_nr_requests_read,
>>> +        .write_u64 = io_cgroup_nr_requests_write,
>>> +    },
>>> +    {
>>>           .name = "policy",
>>>           .read_seq_string = io_cgroup_policy_read,
>>>           .write_string = io_cgroup_policy_write,
>>> @@ -1790,6 +1871,7 @@ static struct cgroup_subsys_state
>>> *iocg_create(struct cgroup_subsys *subsys,
>>>
>>>       spin_lock_init(&iocg->lock);
>>>       INIT_HLIST_HEAD(&iocg->group_data);
>>> +    iocg->nr_group_requests = BLKDEV_MAX_GROUP_RQ;
>>>       iocg->weight = IO_DEFAULT_GRP_WEIGHT;
>>>       iocg->ioprio_class = IO_DEFAULT_GRP_CLASS;
>>>       INIT_LIST_HEAD(&iocg->policy_list);
>>> diff --git a/block/elevator-fq.h b/block/elevator-fq.h
>>> index f089a55..df077d0 100644
>>> --- a/block/elevator-fq.h
>>> +++ b/block/elevator-fq.h
>>> @@ -308,6 +308,7 @@ struct io_cgroup {
>>>       unsigned int weight;
>>>       unsigned short ioprio_class;
>>>
>>> +    unsigned long nr_group_requests;
>>>       /* list of io_policy_node */
>>>       struct list_head policy_list;
>>>
>>> @@ -386,6 +387,9 @@ struct elv_fq_data {
>>>       unsigned int fairness;
>>>   };
>>>
>>> +extern unsigned short get_group_requests(struct request_queue *q,
>>> +                     struct io_group *iog);
>>> +
>>>   /* Logging facilities. */
>>>   #ifdef CONFIG_DEBUG_GROUP_IOSCHED
>>>   #define elv_log_ioq(efqd, ioq, fmt, args...) \
>>> -- 
>>> 1.5.4.rc3
>>
> 

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 22/25] io-controller: Per io group bdi congestion interface
       [not found]   ` <1246564917-19603-23-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-07-17  0:16     ` Munehiro Ikeda
  0 siblings, 0 replies; 191+ messages in thread
From: Munehiro Ikeda @ 2009-07-17  0:16 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hi,

Vivek Goyal wrote, on 07/02/2009 04:01 PM:
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 2035c20..79fe6a9 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -90,6 +90,27 @@ void blk_queue_congestion_threshold(struct request_queue *q)
>   	q->nr_congestion_off = nr;
>   }
> 
> +#ifdef CONFIG_GROUP_IOSCHED
> +int blk_queue_io_group_congested(struct backing_dev_info *bdi, int bdi_bits,
> +					struct page *page)
> +{
> +	int ret = 0;
> +	struct request_queue *q = bdi->unplug_io_data;
> +
> +	if (!q&&  !q->elevator)
> +		return bdi_congested(bdi, bdi_bits);

It causes NULL pointer dereference for brd etc.

Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
---
 block/blk-core.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 79fe6a9..39fab66 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -97,7 +97,7 @@ int blk_queue_io_group_congested(struct backing_dev_info *bdi, int bdi_bits,
 	int ret = 0;
 	struct request_queue *q = bdi->unplug_io_data;
 
-	if (!q && !q->elevator)
+	if (!q || !q->elevator)
 		return bdi_congested(bdi, bdi_bits);
 
 	/* Do we need to hold queue lock? */
-- 
1.6.2.5


-- 
IKEDA, Munehiro
  NEC Corporation of America
    m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* Re: [PATCH 22/25] io-controller: Per io group bdi congestion interface
  2009-07-02 20:01   ` Vivek Goyal
@ 2009-07-17  0:16     ` Munehiro Ikeda
  -1 siblings, 0 replies; 191+ messages in thread
From: Munehiro Ikeda @ 2009-07-17  0:16 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, jbaron,
	agk, snitzer, akpm, peterz

Hi,

Vivek Goyal wrote, on 07/02/2009 04:01 PM:
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 2035c20..79fe6a9 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -90,6 +90,27 @@ void blk_queue_congestion_threshold(struct request_queue *q)
>   	q->nr_congestion_off = nr;
>   }
> 
> +#ifdef CONFIG_GROUP_IOSCHED
> +int blk_queue_io_group_congested(struct backing_dev_info *bdi, int bdi_bits,
> +					struct page *page)
> +{
> +	int ret = 0;
> +	struct request_queue *q = bdi->unplug_io_data;
> +
> +	if (!q&&  !q->elevator)
> +		return bdi_congested(bdi, bdi_bits);

It causes NULL pointer dereference for brd etc.

Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
---
 block/blk-core.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 79fe6a9..39fab66 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -97,7 +97,7 @@ int blk_queue_io_group_congested(struct backing_dev_info *bdi, int bdi_bits,
 	int ret = 0;
 	struct request_queue *q = bdi->unplug_io_data;
 
-	if (!q && !q->elevator)
+	if (!q || !q->elevator)
 		return bdi_congested(bdi, bdi_bits);
 
 	/* Do we need to hold queue lock? */
-- 
1.6.2.5


-- 
IKEDA, Munehiro
  NEC Corporation of America
    m-ikeda@ds.jp.nec.com


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* Re: [PATCH 22/25] io-controller: Per io group bdi congestion interface
@ 2009-07-17  0:16     ` Munehiro Ikeda
  0 siblings, 0 replies; 191+ messages in thread
From: Munehiro Ikeda @ 2009-07-17  0:16 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	balbir, paolo.valente, guijianfeng, fernando, mikew, jmoyer,
	nauman, righi.andrea, lizf, fchecconi, akpm, containers,
	linux-kernel, s-uchida, jbaron

Hi,

Vivek Goyal wrote, on 07/02/2009 04:01 PM:
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 2035c20..79fe6a9 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -90,6 +90,27 @@ void blk_queue_congestion_threshold(struct request_queue *q)
>   	q->nr_congestion_off = nr;
>   }
> 
> +#ifdef CONFIG_GROUP_IOSCHED
> +int blk_queue_io_group_congested(struct backing_dev_info *bdi, int bdi_bits,
> +					struct page *page)
> +{
> +	int ret = 0;
> +	struct request_queue *q = bdi->unplug_io_data;
> +
> +	if (!q&&  !q->elevator)
> +		return bdi_congested(bdi, bdi_bits);

It causes NULL pointer dereference for brd etc.

Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
---
 block/blk-core.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 79fe6a9..39fab66 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -97,7 +97,7 @@ int blk_queue_io_group_congested(struct backing_dev_info *bdi, int bdi_bits,
 	int ret = 0;
 	struct request_queue *q = bdi->unplug_io_data;
 
-	if (!q && !q->elevator)
+	if (!q || !q->elevator)
 		return bdi_congested(bdi, bdi_bits);
 
 	/* Do we need to hold queue lock? */
-- 
1.6.2.5


-- 
IKEDA, Munehiro
  NEC Corporation of America
    m-ikeda@ds.jp.nec.com

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* Re: [PATCH 22/25] io-controller: Per io group bdi congestion interface
       [not found]     ` <4A5FC2CA.1040609-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
@ 2009-07-17 13:52       ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-17 13:52 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, Jul 16, 2009 at 08:16:10PM -0400, Munehiro Ikeda wrote:
> Hi,
> 
> Vivek Goyal wrote, on 07/02/2009 04:01 PM:
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 2035c20..79fe6a9 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -90,6 +90,27 @@ void blk_queue_congestion_threshold(struct request_queue *q)
> >   	q->nr_congestion_off = nr;
> >   }
> > 
> > +#ifdef CONFIG_GROUP_IOSCHED
> > +int blk_queue_io_group_congested(struct backing_dev_info *bdi, int bdi_bits,
> > +					struct page *page)
> > +{
> > +	int ret = 0;
> > +	struct request_queue *q = bdi->unplug_io_data;
> > +
> > +	if (!q&&  !q->elevator)
> > +		return bdi_congested(bdi, bdi_bits);
> 
> It causes NULL pointer dereference for brd etc.
> 
> Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
> ---
>  block/blk-core.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 79fe6a9..39fab66 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -97,7 +97,7 @@ int blk_queue_io_group_congested(struct backing_dev_info *bdi, int bdi_bits,
>  	int ret = 0;
>  	struct request_queue *q = bdi->unplug_io_data;
>  
> -	if (!q && !q->elevator)
> +	if (!q || !q->elevator)
>  		return bdi_congested(bdi, bdi_bits);
>  

Hi,

Thanks for the patch. I also noticed this recently and fixed it for next
to be posted version.

Thanks
Vivek

>  	/* Do we need to hold queue lock? */
> -- 
> 1.6.2.5
> 
> 
> -- 
> IKEDA, Munehiro
>   NEC Corporation of America
>     m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 22/25] io-controller: Per io group bdi congestion interface
  2009-07-17  0:16     ` Munehiro Ikeda
@ 2009-07-17 13:52       ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-17 13:52 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, jbaron,
	agk, snitzer, akpm, peterz

On Thu, Jul 16, 2009 at 08:16:10PM -0400, Munehiro Ikeda wrote:
> Hi,
> 
> Vivek Goyal wrote, on 07/02/2009 04:01 PM:
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 2035c20..79fe6a9 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -90,6 +90,27 @@ void blk_queue_congestion_threshold(struct request_queue *q)
> >   	q->nr_congestion_off = nr;
> >   }
> > 
> > +#ifdef CONFIG_GROUP_IOSCHED
> > +int blk_queue_io_group_congested(struct backing_dev_info *bdi, int bdi_bits,
> > +					struct page *page)
> > +{
> > +	int ret = 0;
> > +	struct request_queue *q = bdi->unplug_io_data;
> > +
> > +	if (!q&&  !q->elevator)
> > +		return bdi_congested(bdi, bdi_bits);
> 
> It causes NULL pointer dereference for brd etc.
> 
> Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
> ---
>  block/blk-core.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 79fe6a9..39fab66 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -97,7 +97,7 @@ int blk_queue_io_group_congested(struct backing_dev_info *bdi, int bdi_bits,
>  	int ret = 0;
>  	struct request_queue *q = bdi->unplug_io_data;
>  
> -	if (!q && !q->elevator)
> +	if (!q || !q->elevator)
>  		return bdi_congested(bdi, bdi_bits);
>  

Hi,

Thanks for the patch. I also noticed this recently and fixed it for next
to be posted version.

Thanks
Vivek

>  	/* Do we need to hold queue lock? */
> -- 
> 1.6.2.5
> 
> 
> -- 
> IKEDA, Munehiro
>   NEC Corporation of America
>     m-ikeda@ds.jp.nec.com

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 22/25] io-controller: Per io group bdi congestion interface
@ 2009-07-17 13:52       ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-17 13:52 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	balbir, paolo.valente, guijianfeng, fernando, mikew, jmoyer,
	nauman, righi.andrea, lizf, fchecconi, akpm, containers,
	linux-kernel, s-uchida, jbaron

On Thu, Jul 16, 2009 at 08:16:10PM -0400, Munehiro Ikeda wrote:
> Hi,
> 
> Vivek Goyal wrote, on 07/02/2009 04:01 PM:
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 2035c20..79fe6a9 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -90,6 +90,27 @@ void blk_queue_congestion_threshold(struct request_queue *q)
> >   	q->nr_congestion_off = nr;
> >   }
> > 
> > +#ifdef CONFIG_GROUP_IOSCHED
> > +int blk_queue_io_group_congested(struct backing_dev_info *bdi, int bdi_bits,
> > +					struct page *page)
> > +{
> > +	int ret = 0;
> > +	struct request_queue *q = bdi->unplug_io_data;
> > +
> > +	if (!q&&  !q->elevator)
> > +		return bdi_congested(bdi, bdi_bits);
> 
> It causes NULL pointer dereference for brd etc.
> 
> Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
> ---
>  block/blk-core.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 79fe6a9..39fab66 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -97,7 +97,7 @@ int blk_queue_io_group_congested(struct backing_dev_info *bdi, int bdi_bits,
>  	int ret = 0;
>  	struct request_queue *q = bdi->unplug_io_data;
>  
> -	if (!q && !q->elevator)
> +	if (!q || !q->elevator)
>  		return bdi_congested(bdi, bdi_bits);
>  

Hi,

Thanks for the patch. I also noticed this recently and fixed it for next
to be posted version.

Thanks
Vivek

>  	/* Do we need to hold queue lock? */
> -- 
> 1.6.2.5
> 
> 
> -- 
> IKEDA, Munehiro
>   NEC Corporation of America
>     m-ikeda@ds.jp.nec.com

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 21/25] io-controller: Per cgroup request descriptor support
       [not found]   ` <1246564917-19603-22-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-07-08  3:27     ` Gui Jianfeng
@ 2009-07-21  5:37     ` Gui Jianfeng
  1 sibling, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-21  5:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Vivek Goyal wrote:
> o Currently a request queue has got fixed number of request descriptors for
>   sync and async requests. Once the request descriptors are consumed, new
>   processes are put to sleep and they effectively become serialized. Because
>   sync and async queues are separate, async requests don't impact sync ones
>   but if one is looking for fairness between async requests, that is not
>   achievable if request queue descriptors become bottleneck.
> 
> o Make request descriptor's per io group so that if there is lots of IO
>   going on in one cgroup, it does not impact the IO of other group.
> 
> o This is just one relatively simple way of doing things. This patch will
>   probably change after the feedback. Folks have raised concerns that in
>   hierchical setup, child's request descriptors should be capped by parent's
>   request descriptors. May be we need to have per cgroup per device files
>   in cgroups where one can specify the upper limit of request descriptors
>   and whenever a cgroup is created one needs to assign request descritor
>   limit making sure total sum of child's request descriptor is not more than
>   of parent.
> 
>   I guess something like memory controller. Anyway, that would be the next
>   step. For the time being, we have implemented something simpler as follows.
> 
> o This patch implements the per cgroup request descriptors. request pool per
>   queue is still common but every group will have its own wait list and its
>   own count of request descriptors allocated to that group for sync and async
>   queues. So effectively request_list becomes per io group property and not a
>   global request queue feature.
> 
> o Currently one can define q->nr_requests to limit request descriptors
>   allocated for the queue. Now there is another tunable q->nr_group_requests
>   which controls the requests descriptr limit per group. q->nr_requests
>   supercedes q->nr_group_requests to make sure if there are lots of groups
>   present, we don't end up allocating too many request descriptors on the
>   queue.
> 

  Hi Vivek,

  In order to prevent q->nr_requests from becoming the bottle-neck of allocating
  requests, whether we can update nr_requests accordingly when allocating or removing
  a cgroup?

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 21/25] io-controller: Per cgroup request descriptor support
  2009-07-02 20:01   ` Vivek Goyal
                     ` (2 preceding siblings ...)
  (?)
@ 2009-07-21  5:37   ` Gui Jianfeng
       [not found]     ` <4A655434.5060404-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2009-07-21  5:55       ` Nauman Rafique
  -1 siblings, 2 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-21  5:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, jbaron, agk,
	snitzer, akpm, peterz

Vivek Goyal wrote:
> o Currently a request queue has got fixed number of request descriptors for
>   sync and async requests. Once the request descriptors are consumed, new
>   processes are put to sleep and they effectively become serialized. Because
>   sync and async queues are separate, async requests don't impact sync ones
>   but if one is looking for fairness between async requests, that is not
>   achievable if request queue descriptors become bottleneck.
> 
> o Make request descriptor's per io group so that if there is lots of IO
>   going on in one cgroup, it does not impact the IO of other group.
> 
> o This is just one relatively simple way of doing things. This patch will
>   probably change after the feedback. Folks have raised concerns that in
>   hierchical setup, child's request descriptors should be capped by parent's
>   request descriptors. May be we need to have per cgroup per device files
>   in cgroups where one can specify the upper limit of request descriptors
>   and whenever a cgroup is created one needs to assign request descritor
>   limit making sure total sum of child's request descriptor is not more than
>   of parent.
> 
>   I guess something like memory controller. Anyway, that would be the next
>   step. For the time being, we have implemented something simpler as follows.
> 
> o This patch implements the per cgroup request descriptors. request pool per
>   queue is still common but every group will have its own wait list and its
>   own count of request descriptors allocated to that group for sync and async
>   queues. So effectively request_list becomes per io group property and not a
>   global request queue feature.
> 
> o Currently one can define q->nr_requests to limit request descriptors
>   allocated for the queue. Now there is another tunable q->nr_group_requests
>   which controls the requests descriptr limit per group. q->nr_requests
>   supercedes q->nr_group_requests to make sure if there are lots of groups
>   present, we don't end up allocating too many request descriptors on the
>   queue.
> 

  Hi Vivek,

  In order to prevent q->nr_requests from becoming the bottle-neck of allocating
  requests, whether we can update nr_requests accordingly when allocating or removing
  a cgroup?

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 21/25] io-controller: Per cgroup request descriptor support
       [not found]     ` <4A655434.5060404-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-07-21  5:55       ` Nauman Rafique
  0 siblings, 0 replies; 191+ messages in thread
From: Nauman Rafique @ 2009-07-21  5:55 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Mon, Jul 20, 2009 at 10:37 PM, Gui
Jianfeng<guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> wrote:
> Vivek Goyal wrote:
>> o Currently a request queue has got fixed number of request descriptors for
>>   sync and async requests. Once the request descriptors are consumed, new
>>   processes are put to sleep and they effectively become serialized. Because
>>   sync and async queues are separate, async requests don't impact sync ones
>>   but if one is looking for fairness between async requests, that is not
>>   achievable if request queue descriptors become bottleneck.
>>
>> o Make request descriptor's per io group so that if there is lots of IO
>>   going on in one cgroup, it does not impact the IO of other group.
>>
>> o This is just one relatively simple way of doing things. This patch will
>>   probably change after the feedback. Folks have raised concerns that in
>>   hierchical setup, child's request descriptors should be capped by parent's
>>   request descriptors. May be we need to have per cgroup per device files
>>   in cgroups where one can specify the upper limit of request descriptors
>>   and whenever a cgroup is created one needs to assign request descritor
>>   limit making sure total sum of child's request descriptor is not more than
>>   of parent.
>>
>>   I guess something like memory controller. Anyway, that would be the next
>>   step. For the time being, we have implemented something simpler as follows.
>>
>> o This patch implements the per cgroup request descriptors. request pool per
>>   queue is still common but every group will have its own wait list and its
>>   own count of request descriptors allocated to that group for sync and async
>>   queues. So effectively request_list becomes per io group property and not a
>>   global request queue feature.
>>
>> o Currently one can define q->nr_requests to limit request descriptors
>>   allocated for the queue. Now there is another tunable q->nr_group_requests
>>   which controls the requests descriptr limit per group. q->nr_requests
>>   supercedes q->nr_group_requests to make sure if there are lots of groups
>>   present, we don't end up allocating too many request descriptors on the
>>   queue.
>>
>
>  Hi Vivek,
>
>  In order to prevent q->nr_requests from becoming the bottle-neck of allocating
>  requests, whether we can update nr_requests accordingly when allocating or removing
>  a cgroup?

Vivek,
I agree with Gui here. In fact, it does not make much sense to keep
the nr_requests limit if we already have per cgroup limit in place.
This change also simplifies code quite a bit, as we can get rid of all
that sleep_on_global logic.

>
> --
> Regards
> Gui Jianfeng
>
>

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 21/25] io-controller: Per cgroup request descriptor  support
  2009-07-21  5:37   ` Gui Jianfeng
@ 2009-07-21  5:55       ` Nauman Rafique
  2009-07-21  5:55       ` Nauman Rafique
  1 sibling, 0 replies; 191+ messages in thread
From: Nauman Rafique @ 2009-07-21  5:55 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: Vivek Goyal, linux-kernel, containers, dm-devel, jens.axboe,
	dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando,
	s-uchida, taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron, agk, snitzer, akpm, peterz

On Mon, Jul 20, 2009 at 10:37 PM, Gui
Jianfeng<guijianfeng@cn.fujitsu.com> wrote:
> Vivek Goyal wrote:
>> o Currently a request queue has got fixed number of request descriptors for
>>   sync and async requests. Once the request descriptors are consumed, new
>>   processes are put to sleep and they effectively become serialized. Because
>>   sync and async queues are separate, async requests don't impact sync ones
>>   but if one is looking for fairness between async requests, that is not
>>   achievable if request queue descriptors become bottleneck.
>>
>> o Make request descriptor's per io group so that if there is lots of IO
>>   going on in one cgroup, it does not impact the IO of other group.
>>
>> o This is just one relatively simple way of doing things. This patch will
>>   probably change after the feedback. Folks have raised concerns that in
>>   hierchical setup, child's request descriptors should be capped by parent's
>>   request descriptors. May be we need to have per cgroup per device files
>>   in cgroups where one can specify the upper limit of request descriptors
>>   and whenever a cgroup is created one needs to assign request descritor
>>   limit making sure total sum of child's request descriptor is not more than
>>   of parent.
>>
>>   I guess something like memory controller. Anyway, that would be the next
>>   step. For the time being, we have implemented something simpler as follows.
>>
>> o This patch implements the per cgroup request descriptors. request pool per
>>   queue is still common but every group will have its own wait list and its
>>   own count of request descriptors allocated to that group for sync and async
>>   queues. So effectively request_list becomes per io group property and not a
>>   global request queue feature.
>>
>> o Currently one can define q->nr_requests to limit request descriptors
>>   allocated for the queue. Now there is another tunable q->nr_group_requests
>>   which controls the requests descriptr limit per group. q->nr_requests
>>   supercedes q->nr_group_requests to make sure if there are lots of groups
>>   present, we don't end up allocating too many request descriptors on the
>>   queue.
>>
>
>  Hi Vivek,
>
>  In order to prevent q->nr_requests from becoming the bottle-neck of allocating
>  requests, whether we can update nr_requests accordingly when allocating or removing
>  a cgroup?

Vivek,
I agree with Gui here. In fact, it does not make much sense to keep
the nr_requests limit if we already have per cgroup limit in place.
This change also simplifies code quite a bit, as we can get rid of all
that sleep_on_global logic.

>
> --
> Regards
> Gui Jianfeng
>
>

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 21/25] io-controller: Per cgroup request descriptor support
@ 2009-07-21  5:55       ` Nauman Rafique
  0 siblings, 0 replies; 191+ messages in thread
From: Nauman Rafique @ 2009-07-21  5:55 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: Vivek Goyal, linux-kernel, containers, dm-devel, jens.axboe,
	dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando,
	s-uchida, taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron, agk, snitzer, akpm, peterz

On Mon, Jul 20, 2009 at 10:37 PM, Gui
Jianfeng<guijianfeng@cn.fujitsu.com> wrote:
> Vivek Goyal wrote:
>> o Currently a request queue has got fixed number of request descriptors for
>>   sync and async requests. Once the request descriptors are consumed, new
>>   processes are put to sleep and they effectively become serialized. Because
>>   sync and async queues are separate, async requests don't impact sync ones
>>   but if one is looking for fairness between async requests, that is not
>>   achievable if request queue descriptors become bottleneck.
>>
>> o Make request descriptor's per io group so that if there is lots of IO
>>   going on in one cgroup, it does not impact the IO of other group.
>>
>> o This is just one relatively simple way of doing things. This patch will
>>   probably change after the feedback. Folks have raised concerns that in
>>   hierchical setup, child's request descriptors should be capped by parent's
>>   request descriptors. May be we need to have per cgroup per device files
>>   in cgroups where one can specify the upper limit of request descriptors
>>   and whenever a cgroup is created one needs to assign request descritor
>>   limit making sure total sum of child's request descriptor is not more than
>>   of parent.
>>
>>   I guess something like memory controller. Anyway, that would be the next
>>   step. For the time being, we have implemented something simpler as follows.
>>
>> o This patch implements the per cgroup request descriptors. request pool per
>>   queue is still common but every group will have its own wait list and its
>>   own count of request descriptors allocated to that group for sync and async
>>   queues. So effectively request_list becomes per io group property and not a
>>   global request queue feature.
>>
>> o Currently one can define q->nr_requests to limit request descriptors
>>   allocated for the queue. Now there is another tunable q->nr_group_requests
>>   which controls the requests descriptr limit per group. q->nr_requests
>>   supercedes q->nr_group_requests to make sure if there are lots of groups
>>   present, we don't end up allocating too many request descriptors on the
>>   queue.
>>
>
>  Hi Vivek,
>
>  In order to prevent q->nr_requests from becoming the bottle-neck of allocating
>  requests, whether we can update nr_requests accordingly when allocating or removing
>  a cgroup?

Vivek,
I agree with Gui here. In fact, it does not make much sense to keep
the nr_requests limit if we already have per cgroup limit in place.
This change also simplifies code quite a bit, as we can get rid of all
that sleep_on_global logic.

>
> --
> Regards
> Gui Jianfeng
>
>

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 21/25] io-controller: Per cgroup request descriptor support
       [not found]       ` <e98e18940907202255y5c7c546ei95d87e5a451ad0c2-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-07-21 14:01         ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-21 14:01 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Mon, Jul 20, 2009 at 10:55:31PM -0700, Nauman Rafique wrote:
> On Mon, Jul 20, 2009 at 10:37 PM, Gui
> Jianfeng<guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> wrote:
> > Vivek Goyal wrote:
> >> o Currently a request queue has got fixed number of request descriptors for
> >>   sync and async requests. Once the request descriptors are consumed, new
> >>   processes are put to sleep and they effectively become serialized. Because
> >>   sync and async queues are separate, async requests don't impact sync ones
> >>   but if one is looking for fairness between async requests, that is not
> >>   achievable if request queue descriptors become bottleneck.
> >>
> >> o Make request descriptor's per io group so that if there is lots of IO
> >>   going on in one cgroup, it does not impact the IO of other group.
> >>
> >> o This is just one relatively simple way of doing things. This patch will
> >>   probably change after the feedback. Folks have raised concerns that in
> >>   hierchical setup, child's request descriptors should be capped by parent's
> >>   request descriptors. May be we need to have per cgroup per device files
> >>   in cgroups where one can specify the upper limit of request descriptors
> >>   and whenever a cgroup is created one needs to assign request descritor
> >>   limit making sure total sum of child's request descriptor is not more than
> >>   of parent.
> >>
> >>   I guess something like memory controller. Anyway, that would be the next
> >>   step. For the time being, we have implemented something simpler as follows.
> >>
> >> o This patch implements the per cgroup request descriptors. request pool per
> >>   queue is still common but every group will have its own wait list and its
> >>   own count of request descriptors allocated to that group for sync and async
> >>   queues. So effectively request_list becomes per io group property and not a
> >>   global request queue feature.
> >>
> >> o Currently one can define q->nr_requests to limit request descriptors
> >>   allocated for the queue. Now there is another tunable q->nr_group_requests
> >>   which controls the requests descriptr limit per group. q->nr_requests
> >>   supercedes q->nr_group_requests to make sure if there are lots of groups
> >>   present, we don't end up allocating too many request descriptors on the
> >>   queue.
> >>
> >
> >  Hi Vivek,
> >
> >  In order to prevent q->nr_requests from becoming the bottle-neck of allocating
> >  requests, whether we can update nr_requests accordingly when allocating or removing
> >  a cgroup?
> 
> Vivek,
> I agree with Gui here. In fact, it does not make much sense to keep
> the nr_requests limit if we already have per cgroup limit in place.
> This change also simplifies code quite a bit, as we can get rid of all
> that sleep_on_global logic.
> 

Hi Nauman, Gui,

There were few reasons to keep a total limit on number of request
descriptors (q->nr_requests) apart from per group limit.

- We have this notion of queue being congested or not depending on out of
  q->nr_requests how many are currently being used. Writeback threads,
  some filesystems and other places make use of this information to either
  not to block or to avoid pushing too much of data on device if queue is
  congested. 

  With q->nr_requests removed, how do you define queue full and congested
  semantics?

- I think slee_on_global logic makes sense even without q->nr_requests.
  Assume that a group allows request descriptor allocation but due to lack
  of memory, allocation fails. Where do you make this process wait to
  attempt next time? Making all such failed processes on gloabl list on
  queue instead of per group list makes more sense to me for following
  reasons.

	- If this is the first request allocation from the group and we
 	  make the process sleep on group list, it will never be woken up
	  as no request from that group will complete.

	- If there are many processes who failed request descriptor
	  allocation, when some request completes, I think it is more
	  fair to wake these up in FIFO manner to try out allocation again
	  instead of waiting for request to complete from the group
	  process belongs to. The reason being that io controller did not
          fail the request descriptor allocation.

  So even if you get rid of q->nr_requests, you still shall have to have
  some logic of global wait list where failed allocations can wait.

- It is backward compatible and there are less chances of higher layers
  being broken due to this.


Gui, I think automatic updation of q->nr_requests is probably not a very
good thing. It is user defined tunable and user does not expect this to
change automatically.

At this point of time I really can't think of simpler and cleaner way.
Ideas are welcome.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 21/25] io-controller: Per cgroup request descriptor support
  2009-07-21  5:55       ` Nauman Rafique
@ 2009-07-21 14:01         ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-21 14:01 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: Gui Jianfeng, linux-kernel, containers, dm-devel, jens.axboe,
	dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando,
	s-uchida, taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron, agk, snitzer, akpm, peterz

On Mon, Jul 20, 2009 at 10:55:31PM -0700, Nauman Rafique wrote:
> On Mon, Jul 20, 2009 at 10:37 PM, Gui
> Jianfeng<guijianfeng@cn.fujitsu.com> wrote:
> > Vivek Goyal wrote:
> >> o Currently a request queue has got fixed number of request descriptors for
> >>   sync and async requests. Once the request descriptors are consumed, new
> >>   processes are put to sleep and they effectively become serialized. Because
> >>   sync and async queues are separate, async requests don't impact sync ones
> >>   but if one is looking for fairness between async requests, that is not
> >>   achievable if request queue descriptors become bottleneck.
> >>
> >> o Make request descriptor's per io group so that if there is lots of IO
> >>   going on in one cgroup, it does not impact the IO of other group.
> >>
> >> o This is just one relatively simple way of doing things. This patch will
> >>   probably change after the feedback. Folks have raised concerns that in
> >>   hierchical setup, child's request descriptors should be capped by parent's
> >>   request descriptors. May be we need to have per cgroup per device files
> >>   in cgroups where one can specify the upper limit of request descriptors
> >>   and whenever a cgroup is created one needs to assign request descritor
> >>   limit making sure total sum of child's request descriptor is not more than
> >>   of parent.
> >>
> >>   I guess something like memory controller. Anyway, that would be the next
> >>   step. For the time being, we have implemented something simpler as follows.
> >>
> >> o This patch implements the per cgroup request descriptors. request pool per
> >>   queue is still common but every group will have its own wait list and its
> >>   own count of request descriptors allocated to that group for sync and async
> >>   queues. So effectively request_list becomes per io group property and not a
> >>   global request queue feature.
> >>
> >> o Currently one can define q->nr_requests to limit request descriptors
> >>   allocated for the queue. Now there is another tunable q->nr_group_requests
> >>   which controls the requests descriptr limit per group. q->nr_requests
> >>   supercedes q->nr_group_requests to make sure if there are lots of groups
> >>   present, we don't end up allocating too many request descriptors on the
> >>   queue.
> >>
> >
> >  Hi Vivek,
> >
> >  In order to prevent q->nr_requests from becoming the bottle-neck of allocating
> >  requests, whether we can update nr_requests accordingly when allocating or removing
> >  a cgroup?
> 
> Vivek,
> I agree with Gui here. In fact, it does not make much sense to keep
> the nr_requests limit if we already have per cgroup limit in place.
> This change also simplifies code quite a bit, as we can get rid of all
> that sleep_on_global logic.
> 

Hi Nauman, Gui,

There were few reasons to keep a total limit on number of request
descriptors (q->nr_requests) apart from per group limit.

- We have this notion of queue being congested or not depending on out of
  q->nr_requests how many are currently being used. Writeback threads,
  some filesystems and other places make use of this information to either
  not to block or to avoid pushing too much of data on device if queue is
  congested. 

  With q->nr_requests removed, how do you define queue full and congested
  semantics?

- I think slee_on_global logic makes sense even without q->nr_requests.
  Assume that a group allows request descriptor allocation but due to lack
  of memory, allocation fails. Where do you make this process wait to
  attempt next time? Making all such failed processes on gloabl list on
  queue instead of per group list makes more sense to me for following
  reasons.

	- If this is the first request allocation from the group and we
 	  make the process sleep on group list, it will never be woken up
	  as no request from that group will complete.

	- If there are many processes who failed request descriptor
	  allocation, when some request completes, I think it is more
	  fair to wake these up in FIFO manner to try out allocation again
	  instead of waiting for request to complete from the group
	  process belongs to. The reason being that io controller did not
          fail the request descriptor allocation.

  So even if you get rid of q->nr_requests, you still shall have to have
  some logic of global wait list where failed allocations can wait.

- It is backward compatible and there are less chances of higher layers
  being broken due to this.


Gui, I think automatic updation of q->nr_requests is probably not a very
good thing. It is user defined tunable and user does not expect this to
change automatically.

At this point of time I really can't think of simpler and cleaner way.
Ideas are welcome.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 21/25] io-controller: Per cgroup request descriptor support
@ 2009-07-21 14:01         ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-21 14:01 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	balbir, paolo.valente, Gui Jianfeng, fernando, mikew, jmoyer,
	m-ikeda, lizf, fchecconi, akpm, containers, linux-kernel,
	s-uchida, righi.andrea, jbaron

On Mon, Jul 20, 2009 at 10:55:31PM -0700, Nauman Rafique wrote:
> On Mon, Jul 20, 2009 at 10:37 PM, Gui
> Jianfeng<guijianfeng@cn.fujitsu.com> wrote:
> > Vivek Goyal wrote:
> >> o Currently a request queue has got fixed number of request descriptors for
> >>   sync and async requests. Once the request descriptors are consumed, new
> >>   processes are put to sleep and they effectively become serialized. Because
> >>   sync and async queues are separate, async requests don't impact sync ones
> >>   but if one is looking for fairness between async requests, that is not
> >>   achievable if request queue descriptors become bottleneck.
> >>
> >> o Make request descriptor's per io group so that if there is lots of IO
> >>   going on in one cgroup, it does not impact the IO of other group.
> >>
> >> o This is just one relatively simple way of doing things. This patch will
> >>   probably change after the feedback. Folks have raised concerns that in
> >>   hierchical setup, child's request descriptors should be capped by parent's
> >>   request descriptors. May be we need to have per cgroup per device files
> >>   in cgroups where one can specify the upper limit of request descriptors
> >>   and whenever a cgroup is created one needs to assign request descritor
> >>   limit making sure total sum of child's request descriptor is not more than
> >>   of parent.
> >>
> >>   I guess something like memory controller. Anyway, that would be the next
> >>   step. For the time being, we have implemented something simpler as follows.
> >>
> >> o This patch implements the per cgroup request descriptors. request pool per
> >>   queue is still common but every group will have its own wait list and its
> >>   own count of request descriptors allocated to that group for sync and async
> >>   queues. So effectively request_list becomes per io group property and not a
> >>   global request queue feature.
> >>
> >> o Currently one can define q->nr_requests to limit request descriptors
> >>   allocated for the queue. Now there is another tunable q->nr_group_requests
> >>   which controls the requests descriptr limit per group. q->nr_requests
> >>   supercedes q->nr_group_requests to make sure if there are lots of groups
> >>   present, we don't end up allocating too many request descriptors on the
> >>   queue.
> >>
> >
> >  Hi Vivek,
> >
> >  In order to prevent q->nr_requests from becoming the bottle-neck of allocating
> >  requests, whether we can update nr_requests accordingly when allocating or removing
> >  a cgroup?
> 
> Vivek,
> I agree with Gui here. In fact, it does not make much sense to keep
> the nr_requests limit if we already have per cgroup limit in place.
> This change also simplifies code quite a bit, as we can get rid of all
> that sleep_on_global logic.
> 

Hi Nauman, Gui,

There were few reasons to keep a total limit on number of request
descriptors (q->nr_requests) apart from per group limit.

- We have this notion of queue being congested or not depending on out of
  q->nr_requests how many are currently being used. Writeback threads,
  some filesystems and other places make use of this information to either
  not to block or to avoid pushing too much of data on device if queue is
  congested. 

  With q->nr_requests removed, how do you define queue full and congested
  semantics?

- I think slee_on_global logic makes sense even without q->nr_requests.
  Assume that a group allows request descriptor allocation but due to lack
  of memory, allocation fails. Where do you make this process wait to
  attempt next time? Making all such failed processes on gloabl list on
  queue instead of per group list makes more sense to me for following
  reasons.

	- If this is the first request allocation from the group and we
 	  make the process sleep on group list, it will never be woken up
	  as no request from that group will complete.

	- If there are many processes who failed request descriptor
	  allocation, when some request completes, I think it is more
	  fair to wake these up in FIFO manner to try out allocation again
	  instead of waiting for request to complete from the group
	  process belongs to. The reason being that io controller did not
          fail the request descriptor allocation.

  So even if you get rid of q->nr_requests, you still shall have to have
  some logic of global wait list where failed allocations can wait.

- It is backward compatible and there are less chances of higher layers
  being broken due to this.


Gui, I think automatic updation of q->nr_requests is probably not a very
good thing. It is user defined tunable and user does not expect this to
change automatically.

At this point of time I really can't think of simpler and cleaner way.
Ideas are welcome.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 21/25] io-controller: Per cgroup request descriptor support
       [not found]         ` <20090721140134.GB540-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-07-21 17:57           ` Nauman Rafique
  0 siblings, 0 replies; 191+ messages in thread
From: Nauman Rafique @ 2009-07-21 17:57 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Tue, Jul 21, 2009 at 7:01 AM, Vivek Goyal<vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Mon, Jul 20, 2009 at 10:55:31PM -0700, Nauman Rafique wrote:
>> On Mon, Jul 20, 2009 at 10:37 PM, Gui
>> Jianfeng<guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> wrote:
>> > Vivek Goyal wrote:
>> >> o Currently a request queue has got fixed number of request descriptors for
>> >>   sync and async requests. Once the request descriptors are consumed, new
>> >>   processes are put to sleep and they effectively become serialized. Because
>> >>   sync and async queues are separate, async requests don't impact sync ones
>> >>   but if one is looking for fairness between async requests, that is not
>> >>   achievable if request queue descriptors become bottleneck.
>> >>
>> >> o Make request descriptor's per io group so that if there is lots of IO
>> >>   going on in one cgroup, it does not impact the IO of other group.
>> >>
>> >> o This is just one relatively simple way of doing things. This patch will
>> >>   probably change after the feedback. Folks have raised concerns that in
>> >>   hierchical setup, child's request descriptors should be capped by parent's
>> >>   request descriptors. May be we need to have per cgroup per device files
>> >>   in cgroups where one can specify the upper limit of request descriptors
>> >>   and whenever a cgroup is created one needs to assign request descritor
>> >>   limit making sure total sum of child's request descriptor is not more than
>> >>   of parent.
>> >>
>> >>   I guess something like memory controller. Anyway, that would be the next
>> >>   step. For the time being, we have implemented something simpler as follows.
>> >>
>> >> o This patch implements the per cgroup request descriptors. request pool per
>> >>   queue is still common but every group will have its own wait list and its
>> >>   own count of request descriptors allocated to that group for sync and async
>> >>   queues. So effectively request_list becomes per io group property and not a
>> >>   global request queue feature.
>> >>
>> >> o Currently one can define q->nr_requests to limit request descriptors
>> >>   allocated for the queue. Now there is another tunable q->nr_group_requests
>> >>   which controls the requests descriptr limit per group. q->nr_requests
>> >>   supercedes q->nr_group_requests to make sure if there are lots of groups
>> >>   present, we don't end up allocating too many request descriptors on the
>> >>   queue.
>> >>
>> >
>> >  Hi Vivek,
>> >
>> >  In order to prevent q->nr_requests from becoming the bottle-neck of allocating
>> >  requests, whether we can update nr_requests accordingly when allocating or removing
>> >  a cgroup?
>>
>> Vivek,
>> I agree with Gui here. In fact, it does not make much sense to keep
>> the nr_requests limit if we already have per cgroup limit in place.
>> This change also simplifies code quite a bit, as we can get rid of all
>> that sleep_on_global logic.
>>
>
> Hi Nauman, Gui,
>
> There were few reasons to keep a total limit on number of request
> descriptors (q->nr_requests) apart from per group limit.
>
> - We have this notion of queue being congested or not depending on out of
>  q->nr_requests how many are currently being used. Writeback threads,
>  some filesystems and other places make use of this information to either
>  not to block or to avoid pushing too much of data on device if queue is
>  congested.
>
>  With q->nr_requests removed, how do you define queue full and congested
>  semantics?

We can still keep q->nr_requests around, but don't use that number to
deny request descriptor allocation; only use it for defining queue
full and congested semantics.

>
> - I think slee_on_global logic makes sense even without q->nr_requests.
>  Assume that a group allows request descriptor allocation but due to lack
>  of memory, allocation fails. Where do you make this process wait to
>  attempt next time? Making all such failed processes on gloabl list on
>  queue instead of per group list makes more sense to me for following
>  reasons.
>
>        - If this is the first request allocation from the group and we
>          make the process sleep on group list, it will never be woken up
>          as no request from that group will complete.
>
>        - If there are many processes who failed request descriptor
>          allocation, when some request completes, I think it is more
>          fair to wake these up in FIFO manner to try out allocation again
>          instead of waiting for request to complete from the group
>          process belongs to. The reason being that io controller did not
>          fail the request descriptor allocation.
>
>  So even if you get rid of q->nr_requests, you still shall have to have
>  some logic of global wait list where failed allocations can wait.
>
> - It is backward compatible and there are less chances of higher layers
>  being broken due to this.
>
>
> Gui, I think automatic updation of q->nr_requests is probably not a very
> good thing. It is user defined tunable and user does not expect this to
> change automatically.
>
> At this point of time I really can't think of simpler and cleaner way.
> Ideas are welcome.
>
> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 21/25] io-controller: Per cgroup request descriptor  support
  2009-07-21 14:01         ` Vivek Goyal
@ 2009-07-21 17:57           ` Nauman Rafique
  -1 siblings, 0 replies; 191+ messages in thread
From: Nauman Rafique @ 2009-07-21 17:57 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Gui Jianfeng, linux-kernel, containers, dm-devel, jens.axboe,
	dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando,
	s-uchida, taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	jbaron, agk, snitzer, akpm, peterz

On Tue, Jul 21, 2009 at 7:01 AM, Vivek Goyal<vgoyal@redhat.com> wrote:
> On Mon, Jul 20, 2009 at 10:55:31PM -0700, Nauman Rafique wrote:
>> On Mon, Jul 20, 2009 at 10:37 PM, Gui
>> Jianfeng<guijianfeng@cn.fujitsu.com> wrote:
>> > Vivek Goyal wrote:
>> >> o Currently a request queue has got fixed number of request descriptors for
>> >>   sync and async requests. Once the request descriptors are consumed, new
>> >>   processes are put to sleep and they effectively become serialized. Because
>> >>   sync and async queues are separate, async requests don't impact sync ones
>> >>   but if one is looking for fairness between async requests, that is not
>> >>   achievable if request queue descriptors become bottleneck.
>> >>
>> >> o Make request descriptor's per io group so that if there is lots of IO
>> >>   going on in one cgroup, it does not impact the IO of other group.
>> >>
>> >> o This is just one relatively simple way of doing things. This patch will
>> >>   probably change after the feedback. Folks have raised concerns that in
>> >>   hierchical setup, child's request descriptors should be capped by parent's
>> >>   request descriptors. May be we need to have per cgroup per device files
>> >>   in cgroups where one can specify the upper limit of request descriptors
>> >>   and whenever a cgroup is created one needs to assign request descritor
>> >>   limit making sure total sum of child's request descriptor is not more than
>> >>   of parent.
>> >>
>> >>   I guess something like memory controller. Anyway, that would be the next
>> >>   step. For the time being, we have implemented something simpler as follows.
>> >>
>> >> o This patch implements the per cgroup request descriptors. request pool per
>> >>   queue is still common but every group will have its own wait list and its
>> >>   own count of request descriptors allocated to that group for sync and async
>> >>   queues. So effectively request_list becomes per io group property and not a
>> >>   global request queue feature.
>> >>
>> >> o Currently one can define q->nr_requests to limit request descriptors
>> >>   allocated for the queue. Now there is another tunable q->nr_group_requests
>> >>   which controls the requests descriptr limit per group. q->nr_requests
>> >>   supercedes q->nr_group_requests to make sure if there are lots of groups
>> >>   present, we don't end up allocating too many request descriptors on the
>> >>   queue.
>> >>
>> >
>> >  Hi Vivek,
>> >
>> >  In order to prevent q->nr_requests from becoming the bottle-neck of allocating
>> >  requests, whether we can update nr_requests accordingly when allocating or removing
>> >  a cgroup?
>>
>> Vivek,
>> I agree with Gui here. In fact, it does not make much sense to keep
>> the nr_requests limit if we already have per cgroup limit in place.
>> This change also simplifies code quite a bit, as we can get rid of all
>> that sleep_on_global logic.
>>
>
> Hi Nauman, Gui,
>
> There were few reasons to keep a total limit on number of request
> descriptors (q->nr_requests) apart from per group limit.
>
> - We have this notion of queue being congested or not depending on out of
>  q->nr_requests how many are currently being used. Writeback threads,
>  some filesystems and other places make use of this information to either
>  not to block or to avoid pushing too much of data on device if queue is
>  congested.
>
>  With q->nr_requests removed, how do you define queue full and congested
>  semantics?

We can still keep q->nr_requests around, but don't use that number to
deny request descriptor allocation; only use it for defining queue
full and congested semantics.

>
> - I think slee_on_global logic makes sense even without q->nr_requests.
>  Assume that a group allows request descriptor allocation but due to lack
>  of memory, allocation fails. Where do you make this process wait to
>  attempt next time? Making all such failed processes on gloabl list on
>  queue instead of per group list makes more sense to me for following
>  reasons.
>
>        - If this is the first request allocation from the group and we
>          make the process sleep on group list, it will never be woken up
>          as no request from that group will complete.
>
>        - If there are many processes who failed request descriptor
>          allocation, when some request completes, I think it is more
>          fair to wake these up in FIFO manner to try out allocation again
>          instead of waiting for request to complete from the group
>          process belongs to. The reason being that io controller did not
>          fail the request descriptor allocation.
>
>  So even if you get rid of q->nr_requests, you still shall have to have
>  some logic of global wait list where failed allocations can wait.
>
> - It is backward compatible and there are less chances of higher layers
>  being broken due to this.
>
>
> Gui, I think automatic updation of q->nr_requests is probably not a very
> good thing. It is user defined tunable and user does not expect this to
> change automatically.
>
> At this point of time I really can't think of simpler and cleaner way.
> Ideas are welcome.
>
> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 21/25] io-controller: Per cgroup request descriptor support
@ 2009-07-21 17:57           ` Nauman Rafique
  0 siblings, 0 replies; 191+ messages in thread
From: Nauman Rafique @ 2009-07-21 17:57 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	balbir, paolo.valente, Gui Jianfeng, fernando, mikew, jmoyer,
	m-ikeda, lizf, fchecconi, akpm, containers, linux-kernel,
	s-uchida, righi.andrea, jbaron

On Tue, Jul 21, 2009 at 7:01 AM, Vivek Goyal<vgoyal@redhat.com> wrote:
> On Mon, Jul 20, 2009 at 10:55:31PM -0700, Nauman Rafique wrote:
>> On Mon, Jul 20, 2009 at 10:37 PM, Gui
>> Jianfeng<guijianfeng@cn.fujitsu.com> wrote:
>> > Vivek Goyal wrote:
>> >> o Currently a request queue has got fixed number of request descriptors for
>> >>   sync and async requests. Once the request descriptors are consumed, new
>> >>   processes are put to sleep and they effectively become serialized. Because
>> >>   sync and async queues are separate, async requests don't impact sync ones
>> >>   but if one is looking for fairness between async requests, that is not
>> >>   achievable if request queue descriptors become bottleneck.
>> >>
>> >> o Make request descriptor's per io group so that if there is lots of IO
>> >>   going on in one cgroup, it does not impact the IO of other group.
>> >>
>> >> o This is just one relatively simple way of doing things. This patch will
>> >>   probably change after the feedback. Folks have raised concerns that in
>> >>   hierchical setup, child's request descriptors should be capped by parent's
>> >>   request descriptors. May be we need to have per cgroup per device files
>> >>   in cgroups where one can specify the upper limit of request descriptors
>> >>   and whenever a cgroup is created one needs to assign request descritor
>> >>   limit making sure total sum of child's request descriptor is not more than
>> >>   of parent.
>> >>
>> >>   I guess something like memory controller. Anyway, that would be the next
>> >>   step. For the time being, we have implemented something simpler as follows.
>> >>
>> >> o This patch implements the per cgroup request descriptors. request pool per
>> >>   queue is still common but every group will have its own wait list and its
>> >>   own count of request descriptors allocated to that group for sync and async
>> >>   queues. So effectively request_list becomes per io group property and not a
>> >>   global request queue feature.
>> >>
>> >> o Currently one can define q->nr_requests to limit request descriptors
>> >>   allocated for the queue. Now there is another tunable q->nr_group_requests
>> >>   which controls the requests descriptr limit per group. q->nr_requests
>> >>   supercedes q->nr_group_requests to make sure if there are lots of groups
>> >>   present, we don't end up allocating too many request descriptors on the
>> >>   queue.
>> >>
>> >
>> >  Hi Vivek,
>> >
>> >  In order to prevent q->nr_requests from becoming the bottle-neck of allocating
>> >  requests, whether we can update nr_requests accordingly when allocating or removing
>> >  a cgroup?
>>
>> Vivek,
>> I agree with Gui here. In fact, it does not make much sense to keep
>> the nr_requests limit if we already have per cgroup limit in place.
>> This change also simplifies code quite a bit, as we can get rid of all
>> that sleep_on_global logic.
>>
>
> Hi Nauman, Gui,
>
> There were few reasons to keep a total limit on number of request
> descriptors (q->nr_requests) apart from per group limit.
>
> - We have this notion of queue being congested or not depending on out of
>  q->nr_requests how many are currently being used. Writeback threads,
>  some filesystems and other places make use of this information to either
>  not to block or to avoid pushing too much of data on device if queue is
>  congested.
>
>  With q->nr_requests removed, how do you define queue full and congested
>  semantics?

We can still keep q->nr_requests around, but don't use that number to
deny request descriptor allocation; only use it for defining queue
full and congested semantics.

>
> - I think slee_on_global logic makes sense even without q->nr_requests.
>  Assume that a group allows request descriptor allocation but due to lack
>  of memory, allocation fails. Where do you make this process wait to
>  attempt next time? Making all such failed processes on gloabl list on
>  queue instead of per group list makes more sense to me for following
>  reasons.
>
>        - If this is the first request allocation from the group and we
>          make the process sleep on group list, it will never be woken up
>          as no request from that group will complete.
>
>        - If there are many processes who failed request descriptor
>          allocation, when some request completes, I think it is more
>          fair to wake these up in FIFO manner to try out allocation again
>          instead of waiting for request to complete from the group
>          process belongs to. The reason being that io controller did not
>          fail the request descriptor allocation.
>
>  So even if you get rid of q->nr_requests, you still shall have to have
>  some logic of global wait list where failed allocations can wait.
>
> - It is backward compatible and there are less chances of higher layers
>  being broken due to this.
>
>
> Gui, I think automatic updation of q->nr_requests is probably not a very
> good thing. It is user defined tunable and user does not expect this to
> change automatically.
>
> At this point of time I really can't think of simpler and cleaner way.
> Ideas are welcome.
>
> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [RFC] IO scheduler based IO controller V6
       [not found] ` <1246564917-19603-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (26 preceding siblings ...)
  2009-07-10  1:56   ` [PATCH] io-controller: implement per group request allocation limitation Gui Jianfeng
@ 2009-07-27  2:10   ` Gui Jianfeng
  27 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-27  2:10 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

[-- Attachment #1: Type: text/plain, Size: 1664 bytes --]

Hi,

Here are some fio test results for IO Controller V6 built and not built.
Iozone test results are also attached.

Arch: X86
Mem:  1G
Disk: 320G
IO Scheduler: CFQ

============
By normal read and write syscall.

Block Size:32K
File Size: 1G * 10

Mode                    Normal read   |   Random read   |   Normal write   |   Random write | Direct read | Direct write

2.6.31-rc1              47,932KiB/s       3,566KiB/s        45,693KiB/s        8,501KiB/s     50,088KiB/s   43,473KiB/s

2.6.31-rc1-Vivek-V6     47,231KiB/s       3,411KiB/s        42,451KiB/s        8,714KiB/s     51,284KiB/s   42,341KiB/s

Performance             -1.5%             -4.4%             -7.0%              +2.5%          +2.4%         -2.6%

============

By mmap.

Block Size:32K
File Size: 500M

Mode                    Normal read   |   Random read   |   Normal write   |   Random write

2.6.31-rc1              49,951KiB/s       3,245KiB/s        21,950KiB/s        2,771KiB/s

2.6.31-rc1-Vivek-V6     49,951KiB/s       3,154KiB/s        22,593KiB/s        2,648KiB/s

Performance             0%                -2.8%             +2.9%              -4.4%

============
 
By libaio calls

Block Size:32K
File Size: 500M

Mode                    Normal read   |  Random read   |   Normal write   |   Random write

2.6.31-rc1              49,447KiB/s       3,296KiB/s        57,519KiB/s        21,093KiB/s

2.6.31-rc1-Vivek-V6     50,142KiB/s       3,238KiB/s        57,791KiB/s        21,283KiB/s

Performance             +1.4%             -1.8%             0%                 +0.1%

============












[-- Attachment #2: fio-test.tgz --]
[-- Type: application/x-compressed, Size: 4964 bytes --]

[-- Attachment #3: iozone_log.tgz --]
[-- Type: application/x-compressed, Size: 142924 bytes --]

[-- Attachment #4: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [RFC] IO scheduler based IO controller V6
  2009-07-02 20:01 ` Vivek Goyal
                   ` (28 preceding siblings ...)
  (?)
@ 2009-07-27  2:10 ` Gui Jianfeng
       [not found]   ` <4A6D0C9A.3080600-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2009-07-27 12:55     ` Vivek Goyal
  -1 siblings, 2 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-27  2:10 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, jbaron, agk,
	snitzer, akpm, peterz

[-- Attachment #1: Type: text/plain, Size: 1598 bytes --]

Hi,

Here are some fio test results for IO Controller V6 built and not built.
Iozone test results are also attached.

Arch: X86
Mem:  1G
Disk: 320G
IO Scheduler: CFQ

============
By normal read and write syscall.

Block Size:32K
File Size: 1G * 10

Mode                    Normal read   |   Random read   |   Normal write   |   Random write | Direct read | Direct write

2.6.31-rc1              47,932KiB/s       3,566KiB/s        45,693KiB/s        8,501KiB/s     50,088KiB/s   43,473KiB/s

2.6.31-rc1-Vivek-V6     47,231KiB/s       3,411KiB/s        42,451KiB/s        8,714KiB/s     51,284KiB/s   42,341KiB/s

Performance             -1.5%             -4.4%             -7.0%              +2.5%          +2.4%         -2.6%

============

By mmap.

Block Size:32K
File Size: 500M

Mode                    Normal read   |   Random read   |   Normal write   |   Random write

2.6.31-rc1              49,951KiB/s       3,245KiB/s        21,950KiB/s        2,771KiB/s

2.6.31-rc1-Vivek-V6     49,951KiB/s       3,154KiB/s        22,593KiB/s        2,648KiB/s

Performance             0%                -2.8%             +2.9%              -4.4%

============
 
By libaio calls

Block Size:32K
File Size: 500M

Mode                    Normal read   |  Random read   |   Normal write   |   Random write

2.6.31-rc1              49,447KiB/s       3,296KiB/s        57,519KiB/s        21,093KiB/s

2.6.31-rc1-Vivek-V6     50,142KiB/s       3,238KiB/s        57,791KiB/s        21,283KiB/s

Performance             +1.4%             -1.8%             0%                 +0.1%

============












[-- Attachment #2: fio-test.tgz --]
[-- Type: application/x-compressed, Size: 4964 bytes --]

[-- Attachment #3: iozone_log.tgz --]
[-- Type: application/x-compressed, Size: 142924 bytes --]

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [RFC] IO scheduler based IO controller V6
       [not found]   ` <4A6D0C9A.3080600-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-07-27 12:55     ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-27 12:55 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Mon, Jul 27, 2009 at 10:10:34AM +0800, Gui Jianfeng wrote:
> Hi,
> 
> Here are some fio test results for IO Controller V6 built and not built.
> Iozone test results are also attached.
> 

Hi Gui,

Thanks a lot for some performance numbers. It seems to be a mixed chart.
Performance gains at some places and loss at others. I am curious about
that -7.0% for normal writes. Not sure what can contribute to that.

What was the value of "fairness" parameter when you ran those tests? Can you
please set fairness = 0 and re-run the tests (if you have already not done so).

By default fairness is set to 1 in V6. With fairness = 0, we should be very
close to existing CFQ behavior. If not, then we need to dive deeper and
see why variations are happening.

Is it also possible to run the same tests with V7. 

Thanks
Vivek

> Arch: X86
> Mem:  1G
> Disk: 320G
> IO Scheduler: CFQ
> 
> ============
> By normal read and write syscall.
> 
> Block Size:32K
> File Size: 1G * 10
> 
> Mode                    Normal read   |   Random read   |   Normal write   |   Random write | Direct read | Direct write
> 
> 2.6.31-rc1              47,932KiB/s       3,566KiB/s        45,693KiB/s        8,501KiB/s     50,088KiB/s   43,473KiB/s
> 
> 2.6.31-rc1-Vivek-V6     47,231KiB/s       3,411KiB/s        42,451KiB/s        8,714KiB/s     51,284KiB/s   42,341KiB/s
> 
> Performance             -1.5%             -4.4%             -7.0%              +2.5%          +2.4%         -2.6%
> 
> ============
> 
> By mmap.
> 
> Block Size:32K
> File Size: 500M
> 
> Mode                    Normal read   |   Random read   |   Normal write   |   Random write
> 
> 2.6.31-rc1              49,951KiB/s       3,245KiB/s        21,950KiB/s        2,771KiB/s
> 
> 2.6.31-rc1-Vivek-V6     49,951KiB/s       3,154KiB/s        22,593KiB/s        2,648KiB/s
> 
> Performance             0%                -2.8%             +2.9%              -4.4%
> 
> ============
>  
> By libaio calls
> 
> Block Size:32K
> File Size: 500M
> 
> Mode                    Normal read   |  Random read   |   Normal write   |   Random write
> 
> 2.6.31-rc1              49,447KiB/s       3,296KiB/s        57,519KiB/s        21,093KiB/s
> 
> 2.6.31-rc1-Vivek-V6     50,142KiB/s       3,238KiB/s        57,791KiB/s        21,283KiB/s
> 
> Performance             +1.4%             -1.8%             0%                 +0.1%
> 
> ============
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 



_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [RFC] IO scheduler based IO controller V6
  2009-07-27  2:10 ` [RFC] IO scheduler based IO controller V6 Gui Jianfeng
@ 2009-07-27 12:55     ` Vivek Goyal
  2009-07-27 12:55     ` Vivek Goyal
  1 sibling, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-27 12:55 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, jbaron, agk,
	snitzer, akpm, peterz

On Mon, Jul 27, 2009 at 10:10:34AM +0800, Gui Jianfeng wrote:
> Hi,
> 
> Here are some fio test results for IO Controller V6 built and not built.
> Iozone test results are also attached.
> 

Hi Gui,

Thanks a lot for some performance numbers. It seems to be a mixed chart.
Performance gains at some places and loss at others. I am curious about
that -7.0% for normal writes. Not sure what can contribute to that.

What was the value of "fairness" parameter when you ran those tests? Can you
please set fairness = 0 and re-run the tests (if you have already not done so).

By default fairness is set to 1 in V6. With fairness = 0, we should be very
close to existing CFQ behavior. If not, then we need to dive deeper and
see why variations are happening.

Is it also possible to run the same tests with V7. 

Thanks
Vivek

> Arch: X86
> Mem:  1G
> Disk: 320G
> IO Scheduler: CFQ
> 
> ============
> By normal read and write syscall.
> 
> Block Size:32K
> File Size: 1G * 10
> 
> Mode                    Normal read   |   Random read   |   Normal write   |   Random write | Direct read | Direct write
> 
> 2.6.31-rc1              47,932KiB/s       3,566KiB/s        45,693KiB/s        8,501KiB/s     50,088KiB/s   43,473KiB/s
> 
> 2.6.31-rc1-Vivek-V6     47,231KiB/s       3,411KiB/s        42,451KiB/s        8,714KiB/s     51,284KiB/s   42,341KiB/s
> 
> Performance             -1.5%             -4.4%             -7.0%              +2.5%          +2.4%         -2.6%
> 
> ============
> 
> By mmap.
> 
> Block Size:32K
> File Size: 500M
> 
> Mode                    Normal read   |   Random read   |   Normal write   |   Random write
> 
> 2.6.31-rc1              49,951KiB/s       3,245KiB/s        21,950KiB/s        2,771KiB/s
> 
> 2.6.31-rc1-Vivek-V6     49,951KiB/s       3,154KiB/s        22,593KiB/s        2,648KiB/s
> 
> Performance             0%                -2.8%             +2.9%              -4.4%
> 
> ============
>  
> By libaio calls
> 
> Block Size:32K
> File Size: 500M
> 
> Mode                    Normal read   |  Random read   |   Normal write   |   Random write
> 
> 2.6.31-rc1              49,447KiB/s       3,296KiB/s        57,519KiB/s        21,093KiB/s
> 
> 2.6.31-rc1-Vivek-V6     50,142KiB/s       3,238KiB/s        57,791KiB/s        21,283KiB/s
> 
> Performance             +1.4%             -1.8%             0%                 +0.1%
> 
> ============
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 




^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [RFC] IO scheduler based IO controller V6
@ 2009-07-27 12:55     ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-27 12:55 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	balbir, paolo.valente, fernando, mikew, jmoyer, nauman, m-ikeda,
	lizf, fchecconi, akpm, jbaron, linux-kernel, s-uchida,
	righi.andrea, containers

On Mon, Jul 27, 2009 at 10:10:34AM +0800, Gui Jianfeng wrote:
> Hi,
> 
> Here are some fio test results for IO Controller V6 built and not built.
> Iozone test results are also attached.
> 

Hi Gui,

Thanks a lot for some performance numbers. It seems to be a mixed chart.
Performance gains at some places and loss at others. I am curious about
that -7.0% for normal writes. Not sure what can contribute to that.

What was the value of "fairness" parameter when you ran those tests? Can you
please set fairness = 0 and re-run the tests (if you have already not done so).

By default fairness is set to 1 in V6. With fairness = 0, we should be very
close to existing CFQ behavior. If not, then we need to dive deeper and
see why variations are happening.

Is it also possible to run the same tests with V7. 

Thanks
Vivek

> Arch: X86
> Mem:  1G
> Disk: 320G
> IO Scheduler: CFQ
> 
> ============
> By normal read and write syscall.
> 
> Block Size:32K
> File Size: 1G * 10
> 
> Mode                    Normal read   |   Random read   |   Normal write   |   Random write | Direct read | Direct write
> 
> 2.6.31-rc1              47,932KiB/s       3,566KiB/s        45,693KiB/s        8,501KiB/s     50,088KiB/s   43,473KiB/s
> 
> 2.6.31-rc1-Vivek-V6     47,231KiB/s       3,411KiB/s        42,451KiB/s        8,714KiB/s     51,284KiB/s   42,341KiB/s
> 
> Performance             -1.5%             -4.4%             -7.0%              +2.5%          +2.4%         -2.6%
> 
> ============
> 
> By mmap.
> 
> Block Size:32K
> File Size: 500M
> 
> Mode                    Normal read   |   Random read   |   Normal write   |   Random write
> 
> 2.6.31-rc1              49,951KiB/s       3,245KiB/s        21,950KiB/s        2,771KiB/s
> 
> 2.6.31-rc1-Vivek-V6     49,951KiB/s       3,154KiB/s        22,593KiB/s        2,648KiB/s
> 
> Performance             0%                -2.8%             +2.9%              -4.4%
> 
> ============
>  
> By libaio calls
> 
> Block Size:32K
> File Size: 500M
> 
> Mode                    Normal read   |  Random read   |   Normal write   |   Random write
> 
> 2.6.31-rc1              49,447KiB/s       3,296KiB/s        57,519KiB/s        21,093KiB/s
> 
> 2.6.31-rc1-Vivek-V6     50,142KiB/s       3,238KiB/s        57,791KiB/s        21,283KiB/s
> 
> Performance             +1.4%             -1.8%             0%                 +0.1%
> 
> ============
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [RFC] IO scheduler based IO controller V6
       [not found]     ` <20090727125503.GA24449-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-07-28  3:27       ` Vivek Goyal
  2009-07-28 11:36       ` Gui Jianfeng
  2009-07-29  9:07       ` Gui Jianfeng
  2 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-28  3:27 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Mon, Jul 27, 2009 at 08:55:03AM -0400, Vivek Goyal wrote:
> On Mon, Jul 27, 2009 at 10:10:34AM +0800, Gui Jianfeng wrote:
> > Hi,
> > 
> > Here are some fio test results for IO Controller V6 built and not built.
> > Iozone test results are also attached.
> > 
> 
> Hi Gui,
> 
> Thanks a lot for some performance numbers. It seems to be a mixed chart.
> Performance gains at some places and loss at others. I am curious about
> that -7.0% for normal writes. Not sure what can contribute to that.
> 

Gui, can you also try with CONFIG_TRACK_ASYNC_CONTEXT=n and see if it improves
buffered write performance.

Thanks
Vivek

> What was the value of "fairness" parameter when you ran those tests? Can you
> please set fairness = 0 and re-run the tests (if you have already not done so).
> 
> By default fairness is set to 1 in V6. With fairness = 0, we should be very
> close to existing CFQ behavior. If not, then we need to dive deeper and
> see why variations are happening.
> 
> Is it also possible to run the same tests with V7. 
> 
> Thanks
> Vivek
> 
> > Arch: X86
> > Mem:  1G
> > Disk: 320G
> > IO Scheduler: CFQ
> > 
> > ============
> > By normal read and write syscall.
> > 
> > Block Size:32K
> > File Size: 1G * 10
> > 
> > Mode                    Normal read   |   Random read   |   Normal write   |   Random write | Direct read | Direct write
> > 
> > 2.6.31-rc1              47,932KiB/s       3,566KiB/s        45,693KiB/s        8,501KiB/s     50,088KiB/s   43,473KiB/s
> > 
> > 2.6.31-rc1-Vivek-V6     47,231KiB/s       3,411KiB/s        42,451KiB/s        8,714KiB/s     51,284KiB/s   42,341KiB/s
> > 
> > Performance             -1.5%             -4.4%             -7.0%              +2.5%          +2.4%         -2.6%
> > 
> > ============
> > 
> > By mmap.
> > 
> > Block Size:32K
> > File Size: 500M
> > 
> > Mode                    Normal read   |   Random read   |   Normal write   |   Random write
> > 
> > 2.6.31-rc1              49,951KiB/s       3,245KiB/s        21,950KiB/s        2,771KiB/s
> > 
> > 2.6.31-rc1-Vivek-V6     49,951KiB/s       3,154KiB/s        22,593KiB/s        2,648KiB/s
> > 
> > Performance             0%                -2.8%             +2.9%              -4.4%
> > 
> > ============
> >  
> > By libaio calls
> > 
> > Block Size:32K
> > File Size: 500M
> > 
> > Mode                    Normal read   |  Random read   |   Normal write   |   Random write
> > 
> > 2.6.31-rc1              49,447KiB/s       3,296KiB/s        57,519KiB/s        21,093KiB/s
> > 
> > 2.6.31-rc1-Vivek-V6     50,142KiB/s       3,238KiB/s        57,791KiB/s        21,283KiB/s
> > 
> > Performance             +1.4%             -1.8%             0%                 +0.1%
> > 
> > ============
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> 
> 
> 
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [RFC] IO scheduler based IO controller V6
  2009-07-27 12:55     ` Vivek Goyal
@ 2009-07-28  3:27       ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-28  3:27 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, jbaron, agk,
	snitzer, akpm, peterz

On Mon, Jul 27, 2009 at 08:55:03AM -0400, Vivek Goyal wrote:
> On Mon, Jul 27, 2009 at 10:10:34AM +0800, Gui Jianfeng wrote:
> > Hi,
> > 
> > Here are some fio test results for IO Controller V6 built and not built.
> > Iozone test results are also attached.
> > 
> 
> Hi Gui,
> 
> Thanks a lot for some performance numbers. It seems to be a mixed chart.
> Performance gains at some places and loss at others. I am curious about
> that -7.0% for normal writes. Not sure what can contribute to that.
> 

Gui, can you also try with CONFIG_TRACK_ASYNC_CONTEXT=n and see if it improves
buffered write performance.

Thanks
Vivek

> What was the value of "fairness" parameter when you ran those tests? Can you
> please set fairness = 0 and re-run the tests (if you have already not done so).
> 
> By default fairness is set to 1 in V6. With fairness = 0, we should be very
> close to existing CFQ behavior. If not, then we need to dive deeper and
> see why variations are happening.
> 
> Is it also possible to run the same tests with V7. 
> 
> Thanks
> Vivek
> 
> > Arch: X86
> > Mem:  1G
> > Disk: 320G
> > IO Scheduler: CFQ
> > 
> > ============
> > By normal read and write syscall.
> > 
> > Block Size:32K
> > File Size: 1G * 10
> > 
> > Mode                    Normal read   |   Random read   |   Normal write   |   Random write | Direct read | Direct write
> > 
> > 2.6.31-rc1              47,932KiB/s       3,566KiB/s        45,693KiB/s        8,501KiB/s     50,088KiB/s   43,473KiB/s
> > 
> > 2.6.31-rc1-Vivek-V6     47,231KiB/s       3,411KiB/s        42,451KiB/s        8,714KiB/s     51,284KiB/s   42,341KiB/s
> > 
> > Performance             -1.5%             -4.4%             -7.0%              +2.5%          +2.4%         -2.6%
> > 
> > ============
> > 
> > By mmap.
> > 
> > Block Size:32K
> > File Size: 500M
> > 
> > Mode                    Normal read   |   Random read   |   Normal write   |   Random write
> > 
> > 2.6.31-rc1              49,951KiB/s       3,245KiB/s        21,950KiB/s        2,771KiB/s
> > 
> > 2.6.31-rc1-Vivek-V6     49,951KiB/s       3,154KiB/s        22,593KiB/s        2,648KiB/s
> > 
> > Performance             0%                -2.8%             +2.9%              -4.4%
> > 
> > ============
> >  
> > By libaio calls
> > 
> > Block Size:32K
> > File Size: 500M
> > 
> > Mode                    Normal read   |  Random read   |   Normal write   |   Random write
> > 
> > 2.6.31-rc1              49,447KiB/s       3,296KiB/s        57,519KiB/s        21,093KiB/s
> > 
> > 2.6.31-rc1-Vivek-V6     50,142KiB/s       3,238KiB/s        57,791KiB/s        21,283KiB/s
> > 
> > Performance             +1.4%             -1.8%             0%                 +0.1%
> > 
> > ============
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> 
> 
> 

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [RFC] IO scheduler based IO controller V6
@ 2009-07-28  3:27       ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-07-28  3:27 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	balbir, paolo.valente, fernando, mikew, jmoyer, nauman, m-ikeda,
	lizf, fchecconi, akpm, jbaron, linux-kernel, s-uchida,
	righi.andrea, containers

On Mon, Jul 27, 2009 at 08:55:03AM -0400, Vivek Goyal wrote:
> On Mon, Jul 27, 2009 at 10:10:34AM +0800, Gui Jianfeng wrote:
> > Hi,
> > 
> > Here are some fio test results for IO Controller V6 built and not built.
> > Iozone test results are also attached.
> > 
> 
> Hi Gui,
> 
> Thanks a lot for some performance numbers. It seems to be a mixed chart.
> Performance gains at some places and loss at others. I am curious about
> that -7.0% for normal writes. Not sure what can contribute to that.
> 

Gui, can you also try with CONFIG_TRACK_ASYNC_CONTEXT=n and see if it improves
buffered write performance.

Thanks
Vivek

> What was the value of "fairness" parameter when you ran those tests? Can you
> please set fairness = 0 and re-run the tests (if you have already not done so).
> 
> By default fairness is set to 1 in V6. With fairness = 0, we should be very
> close to existing CFQ behavior. If not, then we need to dive deeper and
> see why variations are happening.
> 
> Is it also possible to run the same tests with V7. 
> 
> Thanks
> Vivek
> 
> > Arch: X86
> > Mem:  1G
> > Disk: 320G
> > IO Scheduler: CFQ
> > 
> > ============
> > By normal read and write syscall.
> > 
> > Block Size:32K
> > File Size: 1G * 10
> > 
> > Mode                    Normal read   |   Random read   |   Normal write   |   Random write | Direct read | Direct write
> > 
> > 2.6.31-rc1              47,932KiB/s       3,566KiB/s        45,693KiB/s        8,501KiB/s     50,088KiB/s   43,473KiB/s
> > 
> > 2.6.31-rc1-Vivek-V6     47,231KiB/s       3,411KiB/s        42,451KiB/s        8,714KiB/s     51,284KiB/s   42,341KiB/s
> > 
> > Performance             -1.5%             -4.4%             -7.0%              +2.5%          +2.4%         -2.6%
> > 
> > ============
> > 
> > By mmap.
> > 
> > Block Size:32K
> > File Size: 500M
> > 
> > Mode                    Normal read   |   Random read   |   Normal write   |   Random write
> > 
> > 2.6.31-rc1              49,951KiB/s       3,245KiB/s        21,950KiB/s        2,771KiB/s
> > 
> > 2.6.31-rc1-Vivek-V6     49,951KiB/s       3,154KiB/s        22,593KiB/s        2,648KiB/s
> > 
> > Performance             0%                -2.8%             +2.9%              -4.4%
> > 
> > ============
> >  
> > By libaio calls
> > 
> > Block Size:32K
> > File Size: 500M
> > 
> > Mode                    Normal read   |  Random read   |   Normal write   |   Random write
> > 
> > 2.6.31-rc1              49,447KiB/s       3,296KiB/s        57,519KiB/s        21,093KiB/s
> > 
> > 2.6.31-rc1-Vivek-V6     50,142KiB/s       3,238KiB/s        57,791KiB/s        21,283KiB/s
> > 
> > Performance             +1.4%             -1.8%             0%                 +0.1%
> > 
> > ============
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> 
> 
> 

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [RFC] IO scheduler based IO controller V6
       [not found]       ` <20090728032712.GC3620-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-07-28  3:36         ` Gui Jianfeng
  0 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-28  3:36 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Vivek Goyal wrote:
> On Mon, Jul 27, 2009 at 08:55:03AM -0400, Vivek Goyal wrote:
>> On Mon, Jul 27, 2009 at 10:10:34AM +0800, Gui Jianfeng wrote:
>>> Hi,
>>>
>>> Here are some fio test results for IO Controller V6 built and not built.
>>> Iozone test results are also attached.
>>>
>> Hi Gui,
>>
>> Thanks a lot for some performance numbers. It seems to be a mixed chart.
>> Performance gains at some places and loss at others. I am curious about
>> that -7.0% for normal writes. Not sure what can contribute to that.
>>
> 
> Gui, can you also try with CONFIG_TRACK_ASYNC_CONTEXT=n and see if it improves
> buffered write performance.

  Hi Vivek,

  Ok, I'll do it, and i'll do the performance test for IO Controller V7, and post
  the results as soon as i get them.

> 
> Thanks
> Vivek
> 
>> What was the value of "fairness" parameter when you ran those tests? Can you
>> please set fairness = 0 and re-run the tests (if you have already not done so).
>>
>> By default fairness is set to 1 in V6. With fairness = 0, we should be very
>> close to existing CFQ behavior. If not, then we need to dive deeper and
>> see why variations are happening.
>>
>> Is it also possible to run the same tests with V7. 
>>
>> Thanks
>> Vivek
>>
>>> Arch: X86
>>> Mem:  1G
>>> Disk: 320G
>>> IO Scheduler: CFQ
>>>
>>> ============
>>> By normal read and write syscall.
>>>
>>> Block Size:32K
>>> File Size: 1G * 10
>>>
>>> Mode                    Normal read   |   Random read   |   Normal write   |   Random write | Direct read | Direct write
>>>
>>> 2.6.31-rc1              47,932KiB/s       3,566KiB/s        45,693KiB/s        8,501KiB/s     50,088KiB/s   43,473KiB/s
>>>
>>> 2.6.31-rc1-Vivek-V6     47,231KiB/s       3,411KiB/s        42,451KiB/s        8,714KiB/s     51,284KiB/s   42,341KiB/s
>>>
>>> Performance             -1.5%             -4.4%             -7.0%              +2.5%          +2.4%         -2.6%
>>>
>>> ============
>>>
>>> By mmap.
>>>
>>> Block Size:32K
>>> File Size: 500M
>>>
>>> Mode                    Normal read   |   Random read   |   Normal write   |   Random write
>>>
>>> 2.6.31-rc1              49,951KiB/s       3,245KiB/s        21,950KiB/s        2,771KiB/s
>>>
>>> 2.6.31-rc1-Vivek-V6     49,951KiB/s       3,154KiB/s        22,593KiB/s        2,648KiB/s
>>>
>>> Performance             0%                -2.8%             +2.9%              -4.4%
>>>
>>> ============
>>>  
>>> By libaio calls
>>>
>>> Block Size:32K
>>> File Size: 500M
>>>
>>> Mode                    Normal read   |  Random read   |   Normal write   |   Random write
>>>
>>> 2.6.31-rc1              49,447KiB/s       3,296KiB/s        57,519KiB/s        21,093KiB/s
>>>
>>> 2.6.31-rc1-Vivek-V6     50,142KiB/s       3,238KiB/s        57,791KiB/s        21,283KiB/s
>>>
>>> Performance             +1.4%             -1.8%             0%                 +0.1%
>>>
>>> ============
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
> 
> 
> 

-- 
Regards
Gui Jianfeng

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [RFC] IO scheduler based IO controller V6
  2009-07-28  3:27       ` Vivek Goyal
@ 2009-07-28  3:36         ` Gui Jianfeng
  -1 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-28  3:36 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, jbaron, agk,
	snitzer, akpm, peterz

Vivek Goyal wrote:
> On Mon, Jul 27, 2009 at 08:55:03AM -0400, Vivek Goyal wrote:
>> On Mon, Jul 27, 2009 at 10:10:34AM +0800, Gui Jianfeng wrote:
>>> Hi,
>>>
>>> Here are some fio test results for IO Controller V6 built and not built.
>>> Iozone test results are also attached.
>>>
>> Hi Gui,
>>
>> Thanks a lot for some performance numbers. It seems to be a mixed chart.
>> Performance gains at some places and loss at others. I am curious about
>> that -7.0% for normal writes. Not sure what can contribute to that.
>>
> 
> Gui, can you also try with CONFIG_TRACK_ASYNC_CONTEXT=n and see if it improves
> buffered write performance.

  Hi Vivek,

  Ok, I'll do it, and i'll do the performance test for IO Controller V7, and post
  the results as soon as i get them.

> 
> Thanks
> Vivek
> 
>> What was the value of "fairness" parameter when you ran those tests? Can you
>> please set fairness = 0 and re-run the tests (if you have already not done so).
>>
>> By default fairness is set to 1 in V6. With fairness = 0, we should be very
>> close to existing CFQ behavior. If not, then we need to dive deeper and
>> see why variations are happening.
>>
>> Is it also possible to run the same tests with V7. 
>>
>> Thanks
>> Vivek
>>
>>> Arch: X86
>>> Mem:  1G
>>> Disk: 320G
>>> IO Scheduler: CFQ
>>>
>>> ============
>>> By normal read and write syscall.
>>>
>>> Block Size:32K
>>> File Size: 1G * 10
>>>
>>> Mode                    Normal read   |   Random read   |   Normal write   |   Random write | Direct read | Direct write
>>>
>>> 2.6.31-rc1              47,932KiB/s       3,566KiB/s        45,693KiB/s        8,501KiB/s     50,088KiB/s   43,473KiB/s
>>>
>>> 2.6.31-rc1-Vivek-V6     47,231KiB/s       3,411KiB/s        42,451KiB/s        8,714KiB/s     51,284KiB/s   42,341KiB/s
>>>
>>> Performance             -1.5%             -4.4%             -7.0%              +2.5%          +2.4%         -2.6%
>>>
>>> ============
>>>
>>> By mmap.
>>>
>>> Block Size:32K
>>> File Size: 500M
>>>
>>> Mode                    Normal read   |   Random read   |   Normal write   |   Random write
>>>
>>> 2.6.31-rc1              49,951KiB/s       3,245KiB/s        21,950KiB/s        2,771KiB/s
>>>
>>> 2.6.31-rc1-Vivek-V6     49,951KiB/s       3,154KiB/s        22,593KiB/s        2,648KiB/s
>>>
>>> Performance             0%                -2.8%             +2.9%              -4.4%
>>>
>>> ============
>>>  
>>> By libaio calls
>>>
>>> Block Size:32K
>>> File Size: 500M
>>>
>>> Mode                    Normal read   |  Random read   |   Normal write   |   Random write
>>>
>>> 2.6.31-rc1              49,447KiB/s       3,296KiB/s        57,519KiB/s        21,093KiB/s
>>>
>>> 2.6.31-rc1-Vivek-V6     50,142KiB/s       3,238KiB/s        57,791KiB/s        21,283KiB/s
>>>
>>> Performance             +1.4%             -1.8%             0%                 +0.1%
>>>
>>> ============
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
> 
> 
> 

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [RFC] IO scheduler based IO controller V6
@ 2009-07-28  3:36         ` Gui Jianfeng
  0 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-28  3:36 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	balbir, paolo.valente, fernando, mikew, jmoyer, nauman, m-ikeda,
	lizf, fchecconi, akpm, jbaron, linux-kernel, s-uchida,
	righi.andrea, containers

Vivek Goyal wrote:
> On Mon, Jul 27, 2009 at 08:55:03AM -0400, Vivek Goyal wrote:
>> On Mon, Jul 27, 2009 at 10:10:34AM +0800, Gui Jianfeng wrote:
>>> Hi,
>>>
>>> Here are some fio test results for IO Controller V6 built and not built.
>>> Iozone test results are also attached.
>>>
>> Hi Gui,
>>
>> Thanks a lot for some performance numbers. It seems to be a mixed chart.
>> Performance gains at some places and loss at others. I am curious about
>> that -7.0% for normal writes. Not sure what can contribute to that.
>>
> 
> Gui, can you also try with CONFIG_TRACK_ASYNC_CONTEXT=n and see if it improves
> buffered write performance.

  Hi Vivek,

  Ok, I'll do it, and i'll do the performance test for IO Controller V7, and post
  the results as soon as i get them.

> 
> Thanks
> Vivek
> 
>> What was the value of "fairness" parameter when you ran those tests? Can you
>> please set fairness = 0 and re-run the tests (if you have already not done so).
>>
>> By default fairness is set to 1 in V6. With fairness = 0, we should be very
>> close to existing CFQ behavior. If not, then we need to dive deeper and
>> see why variations are happening.
>>
>> Is it also possible to run the same tests with V7. 
>>
>> Thanks
>> Vivek
>>
>>> Arch: X86
>>> Mem:  1G
>>> Disk: 320G
>>> IO Scheduler: CFQ
>>>
>>> ============
>>> By normal read and write syscall.
>>>
>>> Block Size:32K
>>> File Size: 1G * 10
>>>
>>> Mode                    Normal read   |   Random read   |   Normal write   |   Random write | Direct read | Direct write
>>>
>>> 2.6.31-rc1              47,932KiB/s       3,566KiB/s        45,693KiB/s        8,501KiB/s     50,088KiB/s   43,473KiB/s
>>>
>>> 2.6.31-rc1-Vivek-V6     47,231KiB/s       3,411KiB/s        42,451KiB/s        8,714KiB/s     51,284KiB/s   42,341KiB/s
>>>
>>> Performance             -1.5%             -4.4%             -7.0%              +2.5%          +2.4%         -2.6%
>>>
>>> ============
>>>
>>> By mmap.
>>>
>>> Block Size:32K
>>> File Size: 500M
>>>
>>> Mode                    Normal read   |   Random read   |   Normal write   |   Random write
>>>
>>> 2.6.31-rc1              49,951KiB/s       3,245KiB/s        21,950KiB/s        2,771KiB/s
>>>
>>> 2.6.31-rc1-Vivek-V6     49,951KiB/s       3,154KiB/s        22,593KiB/s        2,648KiB/s
>>>
>>> Performance             0%                -2.8%             +2.9%              -4.4%
>>>
>>> ============
>>>  
>>> By libaio calls
>>>
>>> Block Size:32K
>>> File Size: 500M
>>>
>>> Mode                    Normal read   |  Random read   |   Normal write   |   Random write
>>>
>>> 2.6.31-rc1              49,447KiB/s       3,296KiB/s        57,519KiB/s        21,093KiB/s
>>>
>>> 2.6.31-rc1-Vivek-V6     50,142KiB/s       3,238KiB/s        57,791KiB/s        21,283KiB/s
>>>
>>> Performance             +1.4%             -1.8%             0%                 +0.1%
>>>
>>> ============
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
> 
> 
> 

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [RFC] IO scheduler based IO controller V6
       [not found]     ` <20090727125503.GA24449-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-07-28  3:27       ` Vivek Goyal
@ 2009-07-28 11:36       ` Gui Jianfeng
  2009-07-29  9:07       ` Gui Jianfeng
  2 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-28 11:36 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Vivek Goyal wrote:
> On Mon, Jul 27, 2009 at 10:10:34AM +0800, Gui Jianfeng wrote:
>> Hi,
>>
>> Here are some fio test results for IO Controller V6 built and not built.
>> Iozone test results are also attached.
>>
> 
> Hi Gui,
> 
> Thanks a lot for some performance numbers. It seems to be a mixed chart.
> Performance gains at some places and loss at others. I am curious about
> that -7.0% for normal writes. Not sure what can contribute to that.
> 
> What was the value of "fairness" parameter when you ran those tests? Can you
> please set fairness = 0 and re-run the tests (if you have already not done so).
> 
> By default fairness is set to 1 in V6. With fairness = 0, we should be very
> close to existing CFQ behavior. If not, then we need to dive deeper and
> see why variations are happening.

  Hi Vivek,

  I tested with default fairness value, i'll re-test it when fairness is set to 0.

> 
> Is it also possible to run the same tests with V7. 

  Sure.

-- 
Regards
Gui Jianfeng

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [RFC] IO scheduler based IO controller V6
  2009-07-27 12:55     ` Vivek Goyal
                       ` (2 preceding siblings ...)
  (?)
@ 2009-07-28 11:36     ` Gui Jianfeng
  -1 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-28 11:36 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, jbaron, agk,
	snitzer, akpm, peterz

Vivek Goyal wrote:
> On Mon, Jul 27, 2009 at 10:10:34AM +0800, Gui Jianfeng wrote:
>> Hi,
>>
>> Here are some fio test results for IO Controller V6 built and not built.
>> Iozone test results are also attached.
>>
> 
> Hi Gui,
> 
> Thanks a lot for some performance numbers. It seems to be a mixed chart.
> Performance gains at some places and loss at others. I am curious about
> that -7.0% for normal writes. Not sure what can contribute to that.
> 
> What was the value of "fairness" parameter when you ran those tests? Can you
> please set fairness = 0 and re-run the tests (if you have already not done so).
> 
> By default fairness is set to 1 in V6. With fairness = 0, we should be very
> close to existing CFQ behavior. If not, then we need to dive deeper and
> see why variations are happening.

  Hi Vivek,

  I tested with default fairness value, i'll re-test it when fairness is set to 0.

> 
> Is it also possible to run the same tests with V7. 

  Sure.

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [RFC] IO scheduler based IO controller V6
       [not found]     ` <20090727125503.GA24449-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-07-28  3:27       ` Vivek Goyal
  2009-07-28 11:36       ` Gui Jianfeng
@ 2009-07-29  9:07       ` Gui Jianfeng
  2 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-29  9:07 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Vivek Goyal wrote:
> On Mon, Jul 27, 2009 at 10:10:34AM +0800, Gui Jianfeng wrote:
>> Hi,
>>
>> Here are some fio test results for IO Controller V6 built and not built.
>> Iozone test results are also attached.
>>
> 
> Hi Gui,
> 
> Thanks a lot for some performance numbers. It seems to be a mixed chart.
> Performance gains at some places and loss at others. I am curious about
> that -7.0% for normal writes. Not sure what can contribute to that.
> 
> What was the value of "fairness" parameter when you ran those tests? Can you
> please set fairness = 0 and re-run the tests (if you have already not done so).
> 
> By default fairness is set to 1 in V6. With fairness = 0, we should be very
> close to existing CFQ behavior. If not, then we need to dive deeper and
> see why variations are happening.

  Hi Vivek,

  I re-run the fio test under V6 for normal reads and writes with fairness == 0.
  It seems the performance gets a little better for normal writes than before.
  

  Mode                    Normal read   |   Random read   |   Normal write   |   Random write

  2.6.31-rc1              53,547KiB/s       2,894KiB/s        44,088KiB/s        8,450KiB/s

  V6(fairness = 0)        53,199KiB/s       2,847KiB/s        41,898KiB/s        8,582KiB/s

  Performance		  0%		    -1.6%             -4.9%		 +1.5%

> 
> Is it also possible to run the same tests with V7. 
> 
> Thanks
> Vivek
> 
>> Arch: X86
>> Mem:  1G
>> Disk: 320G
>> IO Scheduler: CFQ
>>
>> ============
>> By normal read and write syscall.
>>
>> Block Size:32K
>> File Size: 1G * 10
>>
>> Mode                    Normal read   |   Random read   |   Normal write   |   Random write | Direct read | Direct write
>>
>> 2.6.31-rc1              47,932KiB/s       3,566KiB/s        45,693KiB/s        8,501KiB/s     50,088KiB/s   43,473KiB/s
>>
>> 2.6.31-rc1-Vivek-V6     47,231KiB/s       3,411KiB/s        42,451KiB/s        8,714KiB/s     51,284KiB/s   42,341KiB/s
>>
>> Performance             -1.5%             -4.4%             -7.0%              +2.5%          +2.4%         -2.6%
>>
>> ============
>>
>> By mmap.
>>
>> Block Size:32K
>> File Size: 500M
>>
>> Mode                    Normal read   |   Random read   |   Normal write   |   Random write
>>
>> 2.6.31-rc1              49,951KiB/s       3,245KiB/s        21,950KiB/s        2,771KiB/s
>>
>> 2.6.31-rc1-Vivek-V6     49,951KiB/s       3,154KiB/s        22,593KiB/s        2,648KiB/s
>>
>> Performance             0%                -2.8%             +2.9%              -4.4%
>>
>> ============
>>  
>> By libaio calls
>>
>> Block Size:32K
>> File Size: 500M
>>
>> Mode                    Normal read   |  Random read   |   Normal write   |   Random write
>>
>> 2.6.31-rc1              49,447KiB/s       3,296KiB/s        57,519KiB/s        21,093KiB/s
>>
>> 2.6.31-rc1-Vivek-V6     50,142KiB/s       3,238KiB/s        57,791KiB/s        21,283KiB/s
>>
>> Performance             +1.4%             -1.8%             0%                 +0.1%
>>
>> ============
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
> 
> 
> 
> 
> 
> 

-- 
Regards
Gui Jianfeng


_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [RFC] IO scheduler based IO controller V6
  2009-07-27 12:55     ` Vivek Goyal
                       ` (3 preceding siblings ...)
  (?)
@ 2009-07-29  9:07     ` Gui Jianfeng
  -1 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-07-29  9:07 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, jbaron, agk,
	snitzer, akpm, peterz

Vivek Goyal wrote:
> On Mon, Jul 27, 2009 at 10:10:34AM +0800, Gui Jianfeng wrote:
>> Hi,
>>
>> Here are some fio test results for IO Controller V6 built and not built.
>> Iozone test results are also attached.
>>
> 
> Hi Gui,
> 
> Thanks a lot for some performance numbers. It seems to be a mixed chart.
> Performance gains at some places and loss at others. I am curious about
> that -7.0% for normal writes. Not sure what can contribute to that.
> 
> What was the value of "fairness" parameter when you ran those tests? Can you
> please set fairness = 0 and re-run the tests (if you have already not done so).
> 
> By default fairness is set to 1 in V6. With fairness = 0, we should be very
> close to existing CFQ behavior. If not, then we need to dive deeper and
> see why variations are happening.

  Hi Vivek,

  I re-run the fio test under V6 for normal reads and writes with fairness == 0.
  It seems the performance gets a little better for normal writes than before.
  

  Mode                    Normal read   |   Random read   |   Normal write   |   Random write

  2.6.31-rc1              53,547KiB/s       2,894KiB/s        44,088KiB/s        8,450KiB/s

  V6(fairness = 0)        53,199KiB/s       2,847KiB/s        41,898KiB/s        8,582KiB/s

  Performance		  0%		    -1.6%             -4.9%		 +1.5%

> 
> Is it also possible to run the same tests with V7. 
> 
> Thanks
> Vivek
> 
>> Arch: X86
>> Mem:  1G
>> Disk: 320G
>> IO Scheduler: CFQ
>>
>> ============
>> By normal read and write syscall.
>>
>> Block Size:32K
>> File Size: 1G * 10
>>
>> Mode                    Normal read   |   Random read   |   Normal write   |   Random write | Direct read | Direct write
>>
>> 2.6.31-rc1              47,932KiB/s       3,566KiB/s        45,693KiB/s        8,501KiB/s     50,088KiB/s   43,473KiB/s
>>
>> 2.6.31-rc1-Vivek-V6     47,231KiB/s       3,411KiB/s        42,451KiB/s        8,714KiB/s     51,284KiB/s   42,341KiB/s
>>
>> Performance             -1.5%             -4.4%             -7.0%              +2.5%          +2.4%         -2.6%
>>
>> ============
>>
>> By mmap.
>>
>> Block Size:32K
>> File Size: 500M
>>
>> Mode                    Normal read   |   Random read   |   Normal write   |   Random write
>>
>> 2.6.31-rc1              49,951KiB/s       3,245KiB/s        21,950KiB/s        2,771KiB/s
>>
>> 2.6.31-rc1-Vivek-V6     49,951KiB/s       3,154KiB/s        22,593KiB/s        2,648KiB/s
>>
>> Performance             0%                -2.8%             +2.9%              -4.4%
>>
>> ============
>>  
>> By libaio calls
>>
>> Block Size:32K
>> File Size: 500M
>>
>> Mode                    Normal read   |  Random read   |   Normal write   |   Random write
>>
>> 2.6.31-rc1              49,447KiB/s       3,296KiB/s        57,519KiB/s        21,093KiB/s
>>
>> 2.6.31-rc1-Vivek-V6     50,142KiB/s       3,238KiB/s        57,791KiB/s        21,283KiB/s
>>
>> Performance             +1.4%             -1.8%             0%                 +0.1%
>>
>> ============
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
> 
> 
> 
> 
> 
> 

-- 
Regards
Gui Jianfeng



^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 20/25] io-controller: map async requests to appropriate cgroup
       [not found]   ` <1246564917-19603-21-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-08-03  2:13     ` Gui Jianfeng
  0 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-08-03  2:13 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Vivek Goyal wrote:
...
> +
> +struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> +					int create)
> +{
> +	struct page *page = NULL;
> +
> +	/*
> +	 * Determine the group from task context. Even calls from
> +	 * blk_get_request() which don't have any bio info will be mapped
> +	 * to the task's group
> +	 */
> +	if (!bio)
> +		goto sync;
> +
> +	if (bio_barrier(bio)) {
> +		/*
> +		 * Map barrier requests to root group. May be more special
> +		 * bio cases should come here
> +		 */
> +		return q->elevator->efqd.root_group;
> +	}
> +
> +	/* Map the sync bio to the right group using task context */
> +	if (elv_bio_sync(bio))
> +		goto sync;
> +
> +#ifdef CONFIG_TRACK_ASYNC_CONTEXT
> +	/* Determine the group from info stored in page */
> +	page = bio_iovec_idx(bio, 0)->bv_page;
> +	return io_get_io_group(q, page, create);
> +#endif
> +
> +sync:
> +	return io_get_io_group(q, NULL, create);

Fix build warning.

block/elevator-fq.c: In function ‘io_get_io_group_bio’:
block/elevator-fq.c:2075: warning: unused variable ‘page’

---
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 66b10eb..d304f79 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -2102,7 +2102,7 @@ struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
 #endif
 
 sync:
-	return io_get_io_group(q, NULL, create);
+	return io_get_io_group(q, page, create);
 }
 EXPORT_SYMBOL(io_get_io_group_bio);


> +}
> +EXPORT_SYMBOL(io_get_io_group_bio);
> +


-- 
Regards
Gui Jianfeng

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* Re: [PATCH 20/25] io-controller: map async requests to appropriate cgroup
  2009-07-02 20:01   ` Vivek Goyal
@ 2009-08-03  2:13     ` Gui Jianfeng
  -1 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-08-03  2:13 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, jbaron, agk,
	snitzer, akpm, peterz

Vivek Goyal wrote:
...
> +
> +struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> +					int create)
> +{
> +	struct page *page = NULL;
> +
> +	/*
> +	 * Determine the group from task context. Even calls from
> +	 * blk_get_request() which don't have any bio info will be mapped
> +	 * to the task's group
> +	 */
> +	if (!bio)
> +		goto sync;
> +
> +	if (bio_barrier(bio)) {
> +		/*
> +		 * Map barrier requests to root group. May be more special
> +		 * bio cases should come here
> +		 */
> +		return q->elevator->efqd.root_group;
> +	}
> +
> +	/* Map the sync bio to the right group using task context */
> +	if (elv_bio_sync(bio))
> +		goto sync;
> +
> +#ifdef CONFIG_TRACK_ASYNC_CONTEXT
> +	/* Determine the group from info stored in page */
> +	page = bio_iovec_idx(bio, 0)->bv_page;
> +	return io_get_io_group(q, page, create);
> +#endif
> +
> +sync:
> +	return io_get_io_group(q, NULL, create);

Fix build warning.

block/elevator-fq.c: In function ‘io_get_io_group_bio’:
block/elevator-fq.c:2075: warning: unused variable ‘page’

---
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 66b10eb..d304f79 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -2102,7 +2102,7 @@ struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
 #endif
 
 sync:
-	return io_get_io_group(q, NULL, create);
+	return io_get_io_group(q, page, create);
 }
 EXPORT_SYMBOL(io_get_io_group_bio);


> +}
> +EXPORT_SYMBOL(io_get_io_group_bio);
> +


-- 
Regards
Gui Jianfeng


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* Re: [PATCH 20/25] io-controller: map async requests to appropriate cgroup
@ 2009-08-03  2:13     ` Gui Jianfeng
  0 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-08-03  2:13 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	balbir, paolo.valente, fernando, mikew, jmoyer, nauman, m-ikeda,
	lizf, fchecconi, akpm, jbaron, linux-kernel, s-uchida,
	righi.andrea, containers

Vivek Goyal wrote:
...
> +
> +struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> +					int create)
> +{
> +	struct page *page = NULL;
> +
> +	/*
> +	 * Determine the group from task context. Even calls from
> +	 * blk_get_request() which don't have any bio info will be mapped
> +	 * to the task's group
> +	 */
> +	if (!bio)
> +		goto sync;
> +
> +	if (bio_barrier(bio)) {
> +		/*
> +		 * Map barrier requests to root group. May be more special
> +		 * bio cases should come here
> +		 */
> +		return q->elevator->efqd.root_group;
> +	}
> +
> +	/* Map the sync bio to the right group using task context */
> +	if (elv_bio_sync(bio))
> +		goto sync;
> +
> +#ifdef CONFIG_TRACK_ASYNC_CONTEXT
> +	/* Determine the group from info stored in page */
> +	page = bio_iovec_idx(bio, 0)->bv_page;
> +	return io_get_io_group(q, page, create);
> +#endif
> +
> +sync:
> +	return io_get_io_group(q, NULL, create);

Fix build warning.

block/elevator-fq.c: In function ‘io_get_io_group_bio’:
block/elevator-fq.c:2075: warning: unused variable ‘page’

---
diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 66b10eb..d304f79 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -2102,7 +2102,7 @@ struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
 #endif
 
 sync:
-	return io_get_io_group(q, NULL, create);
+	return io_get_io_group(q, page, create);
 }
 EXPORT_SYMBOL(io_get_io_group_bio);


> +}
> +EXPORT_SYMBOL(io_get_io_group_bio);
> +


-- 
Regards
Gui Jianfeng

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* Re: [PATCH 20/25] io-controller: map async requests to appropriate cgroup
       [not found]     ` <4A7647DA.5050607-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-08-04  1:25       ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-08-04  1:25 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Mon, Aug 03, 2009 at 10:13:46AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> > +
> > +struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> > +					int create)
> > +{
> > +	struct page *page = NULL;
> > +
> > +	/*
> > +	 * Determine the group from task context. Even calls from
> > +	 * blk_get_request() which don't have any bio info will be mapped
> > +	 * to the task's group
> > +	 */
> > +	if (!bio)
> > +		goto sync;
> > +
> > +	if (bio_barrier(bio)) {
> > +		/*
> > +		 * Map barrier requests to root group. May be more special
> > +		 * bio cases should come here
> > +		 */
> > +		return q->elevator->efqd.root_group;
> > +	}
> > +
> > +	/* Map the sync bio to the right group using task context */
> > +	if (elv_bio_sync(bio))
> > +		goto sync;
> > +
> > +#ifdef CONFIG_TRACK_ASYNC_CONTEXT
> > +	/* Determine the group from info stored in page */
> > +	page = bio_iovec_idx(bio, 0)->bv_page;
> > +	return io_get_io_group(q, page, create);
> > +#endif
> > +
> > +sync:
> > +	return io_get_io_group(q, NULL, create);
> 
> Fix build warning.
> 
> block/elevator-fq.c: In function ‘io_get_io_group_bio’:
> block/elevator-fq.c:2075: warning: unused variable ‘page’
> 

Thanks Gui. Will apply in next posting.

Vivek

> ---
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 66b10eb..d304f79 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -2102,7 +2102,7 @@ struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
>  #endif
>  
>  sync:
> -	return io_get_io_group(q, NULL, create);
> +	return io_get_io_group(q, page, create);
>  }
>  EXPORT_SYMBOL(io_get_io_group_bio);
> 
> 
> > +}
> > +EXPORT_SYMBOL(io_get_io_group_bio);
> > +
> 
> 
> -- 
> Regards
> Gui Jianfeng
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 20/25] io-controller: map async requests to appropriate cgroup
  2009-08-03  2:13     ` Gui Jianfeng
@ 2009-08-04  1:25       ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-08-04  1:25 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: linux-kernel, containers, dm-devel, jens.axboe, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, jbaron, agk,
	snitzer, akpm, peterz

On Mon, Aug 03, 2009 at 10:13:46AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> > +
> > +struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> > +					int create)
> > +{
> > +	struct page *page = NULL;
> > +
> > +	/*
> > +	 * Determine the group from task context. Even calls from
> > +	 * blk_get_request() which don't have any bio info will be mapped
> > +	 * to the task's group
> > +	 */
> > +	if (!bio)
> > +		goto sync;
> > +
> > +	if (bio_barrier(bio)) {
> > +		/*
> > +		 * Map barrier requests to root group. May be more special
> > +		 * bio cases should come here
> > +		 */
> > +		return q->elevator->efqd.root_group;
> > +	}
> > +
> > +	/* Map the sync bio to the right group using task context */
> > +	if (elv_bio_sync(bio))
> > +		goto sync;
> > +
> > +#ifdef CONFIG_TRACK_ASYNC_CONTEXT
> > +	/* Determine the group from info stored in page */
> > +	page = bio_iovec_idx(bio, 0)->bv_page;
> > +	return io_get_io_group(q, page, create);
> > +#endif
> > +
> > +sync:
> > +	return io_get_io_group(q, NULL, create);
> 
> Fix build warning.
> 
> block/elevator-fq.c: In function ‘io_get_io_group_bio’:
> block/elevator-fq.c:2075: warning: unused variable ‘page’
> 

Thanks Gui. Will apply in next posting.

Vivek

> ---
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 66b10eb..d304f79 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -2102,7 +2102,7 @@ struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
>  #endif
>  
>  sync:
> -	return io_get_io_group(q, NULL, create);
> +	return io_get_io_group(q, page, create);
>  }
>  EXPORT_SYMBOL(io_get_io_group_bio);
> 
> 
> > +}
> > +EXPORT_SYMBOL(io_get_io_group_bio);
> > +
> 
> 
> -- 
> Regards
> Gui Jianfeng

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH 20/25] io-controller: map async requests to appropriate cgroup
@ 2009-08-04  1:25       ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-08-04  1:25 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	balbir, paolo.valente, fernando, mikew, jmoyer, nauman, m-ikeda,
	lizf, fchecconi, akpm, jbaron, linux-kernel, s-uchida,
	righi.andrea, containers

On Mon, Aug 03, 2009 at 10:13:46AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> ...
> > +
> > +struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
> > +					int create)
> > +{
> > +	struct page *page = NULL;
> > +
> > +	/*
> > +	 * Determine the group from task context. Even calls from
> > +	 * blk_get_request() which don't have any bio info will be mapped
> > +	 * to the task's group
> > +	 */
> > +	if (!bio)
> > +		goto sync;
> > +
> > +	if (bio_barrier(bio)) {
> > +		/*
> > +		 * Map barrier requests to root group. May be more special
> > +		 * bio cases should come here
> > +		 */
> > +		return q->elevator->efqd.root_group;
> > +	}
> > +
> > +	/* Map the sync bio to the right group using task context */
> > +	if (elv_bio_sync(bio))
> > +		goto sync;
> > +
> > +#ifdef CONFIG_TRACK_ASYNC_CONTEXT
> > +	/* Determine the group from info stored in page */
> > +	page = bio_iovec_idx(bio, 0)->bv_page;
> > +	return io_get_io_group(q, page, create);
> > +#endif
> > +
> > +sync:
> > +	return io_get_io_group(q, NULL, create);
> 
> Fix build warning.
> 
> block/elevator-fq.c: In function ‘io_get_io_group_bio’:
> block/elevator-fq.c:2075: warning: unused variable ‘page’
> 

Thanks Gui. Will apply in next posting.

Vivek

> ---
> diff --git a/block/elevator-fq.c b/block/elevator-fq.c
> index 66b10eb..d304f79 100644
> --- a/block/elevator-fq.c
> +++ b/block/elevator-fq.c
> @@ -2102,7 +2102,7 @@ struct io_group *io_get_io_group_bio(struct request_queue *q, struct bio *bio,
>  #endif
>  
>  sync:
> -	return io_get_io_group(q, NULL, create);
> +	return io_get_io_group(q, page, create);
>  }
>  EXPORT_SYMBOL(io_get_io_group_bio);
> 
> 
> > +}
> > +EXPORT_SYMBOL(io_get_io_group_bio);
> > +
> 
> 
> -- 
> Regards
> Gui Jianfeng

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
       [not found]         ` <4A5C377F.4040105-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2009-08-04  2:00           ` Munehiro Ikeda
  0 siblings, 0 replies; 191+ messages in thread
From: Munehiro Ikeda @ 2009-08-04  2:00 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

[-- Attachment #1: Type: text/plain, Size: 9600 bytes --]

Gui Jianfeng wrote, on 07/14/2009 03:45 AM:
> Munehiro Ikeda wrote:
>> Vivek Goyal wrote, on 07/13/2009 12:03 PM:
>>> On Fri, Jul 10, 2009 at 09:56:21AM +0800, Gui Jianfeng wrote:
>>>> Hi Vivek,
>>>>
>>>> This patch exports a cgroup based per group request limits interface.
>>>> and removes the global one. Now we can use this interface to perform
>>>> different request allocation limitation for different groups.
>>>>
>>> Thanks Gui. Few points come to mind.
>>>
>>> - You seem to be making this as per cgroup limit on all devices. I guess
>>>     that different devices in the system can have different settings of
>>>     q->nr_requests and hence will probably want different per group limit.
>>>     So we might have to make it per cgroup per device limit.
>>  From the viewpoint of implementation, there is a difficulty in my mind to
>> implement per cgroup per device limit arising from that io_group is
>> allocated
>> when associated device is firstly used.  I guess Gui chose per cgroup limit
>> on all devices approach because of this, right?
>
>    Yes, I choose this solution from the simplicity point of view, the code will
>    get complicated if choosing per cgroup per device limit. But it seems per
>    cgroup per device limits is more reasonable.
>
>>
>>> - There does not seem to be any checks for making sure that children
>>>     cgroups don't have more request descriptors allocated than parent
>>> group.
>>>
>>> - I am re-thinking that what's the advantage of configuring request
>>>     descriptors also through cgroups. It does bring in additional
>>> complexity
>>>     with it and it should justfiy the advantages. Can you think of some?
>>>
>>>     Until and unless we can come up with some significant advantages, I
>>> will
>>>     prefer to continue to use per group limit through q->nr_group_requests
>>>     interface instead of cgroup. Once things stablize, we can revisit
>>> it and
>>>     see how this interface can be improved.
>> I agree.  I will try to clarify if per group per device limitation is
>> needed
>> or not (or, if it has the advantage beyond the complexity) through some
>> tests.
>
>    Great, hope to hear you soon.

Sorry for so late.  I tried it, and write the result and my opinion 
below...


Scenario
====================

The possible scenario where per-cgroup nr_requests limitation is 
beneficial in my mind is that:

- Process P1 in cgroup "g1" is running with submitting many requests
    to a device.  The number of the requests in the device queue is
    almost nr_requests for the device.

- After a while, process P2 in cgroup "g2" starts running.  P2 also
    tries to submit requests as many as P1.

- Assuming that user wants P2 to grab bandwidth as soon as possible
    and keep it certain level.

In this scenario, I predicted the bandwidth behavior of P2 along with 
tuning global nr_group_requests like below.

- If having nr_group_requests almost same as nr_requests, P1 can
    allocate requests up to nr_requests and there is no room for P2 at
    the beginning of its running.  As a result of it, P2 has to wait
    for a while till P1's requests are completed and rising of
    bandwidth is delayed.

- If having nr_group_requests fewer to restrict requests from P1 and
    make room for P2, the bandwidth of P2 may be lower than the case
    that P1 can allocate more requests.

If the prediction is correct and per-cgroup nr_requests limitation can 
make the situation better, per-cgroup nr_requests is supposed to be 
beneficial.


Verification Conditions
========================

- Kernel:
    2.6.31-rc1
    + Patches from Vivek on Jul 2, 2009
      (IO scheduler based IO controller V6)
 
https://lists.linux-foundation.org/pipermail/containers/2009-July/018948.html
    + Patches from Gui Jianfeng on Jul 7, 2009 (Bug fixes)
 
https://lists.linux-foundation.org/pipermail/containers/2009-July/019086.html
 
https://lists.linux-foundation.org/pipermail/containers/2009-July/019087.html
    + Patch from Gui Jianfeng on Jul 9, 2009 (per-cgroup requests limit)
 
https://lists.linux-foundation.org/pipermail/containers/2009-July/019123.html
    + Patch from me on Jul 16, 2009 (Bug fix)
 
https://lists.linux-foundation.org/pipermail/containers/2009-July/019286.html
    + 2 local bug fix patches
        (Not posted yet, I'm posting them in following mails)

- All results are measured under nr_requests=500.

- Used fio to make I/O.  Job file is like below.  Used libaio and
    direct-I/O and tuned iodepth to make rl->count[1] approx 500 always.

----- fio job file : from here -----

[global]
size=128m
directory=/mnt/b1

runtime=30
time_based

write_bw_log
bwavgtime=200

rw=randread
direct=1
ioengine=libaio
iodepth=500

[g1]
exec_prerun=./pre.sh /mnt/cgroups/g1
exec_postrun=./log.sh /mnt/cgroups/g1 sdb "_post"

[g2]
startdelay=10
exec_prerun=./pre.sh /mnt/cgroups/g2
exec_postrun=./log.sh /mnt/cgroups/g2 sdb "_post"

----- fio job file : till here -----

Note:
pre.sh and log.sh used in exec_{pre|post}run are to assign processes 
to expected cgroups and record the conditions.  Let me omit the detail 
of them because they are not fundamental part of this verification.


Results
====================

Bandwidth of g2 (=P2) was measured under some conditions.  Conditions 
and bandwidth logs are shown below.
Bandwidth logs are shown only the beginning part (from starting of P2 
to 3000[ms] after aprox.) because the full logs are so long.  Average 
bandwidth from the beginning of log to ~10[sec] is also calculated.

Note1:
fio seems to log bandwidth only when actual data transfer occurs 
(correct me if it's not true).  This means that there is no line with 
BW=0.  In there is no data transfer, the time-stamp are simply skipped 
to record.

Note2:
Graph picture of the bandwidth logs is attached.
    Result(1): orange
    Result(2): green
    Result(3): brown
    Result(4): black


---------- Result (1) ----------

* Both of g1 and g2 have nr_group_requests=500

< Conditions >
nr_requests = 500
g1/
    io.nr_group_requests = 500
    io.weight = 500
    io.ioprio_class = 2
g2/
    io.nr_group_requests = 500
    io.weight = 500
    io.ioprio_class = 2

< Bandwidth log of g2 >
t [ms]	bw[KiB/s]
969	4
1170	1126
1374	1084
1576	876
1776	901
1980	1069
2191	1087
2400	1117
2612	1087
2822	1136
...

< Average bandwidth >
1063 [KiB/s]
(969~9979[ms])


---------- Result (2) ----------

* Both of g1 and g2 have nr_group_requests=100

< Conditions >
nr_requests = 500
g1/
    io.nr_group_requests = 100
    io.weight = 500
    io.ioprio_class = 2
g2/
    io.nr_group_requests = 100
    io.weight = 500
    io.ioprio_class = 2

< Bandwidth log of g2 >
t [ms]	bw[KiB/s]
1498	2
1733	892
2096	722
2311	1224
2534	1180
2753	1197
2988	1137
...

< Average bandwidth >
998 [KiB/s]
(1498~9898[ms])


---------- Result (3) ----------

* To set different nr_group_requests on g1 and g2

< Conditions >
nr_requests = 500
g1/
    io.nr_group_requests = 100
    io.weight = 500
    io.ioprio_class = 2
g2/
    io.nr_group_requests = 500
    io.weight = 500
    io.ioprio_class = 2

< Bandwidth log of g2 >
t [ms]	bw[KiB/s]
244	839
451	1133
659	964
877	1038
1088	1125
1294	979
1501	1068
1708	934
1916	1048
2117	1126
2328	1111
2533	1118
2758	1206
2969	990
...

< Average bandwidth >
1048 [KiB/s]
(244~9906[ms])


---------- Result (4) ----------

* To make g2/io.ioprio_class as RT

< Conditions >
nr_requests = 500
g1/
    io.nr_group_requests = 500
    io.weight = 500
    io.ioprio_class = 2
g2/
    io.nr_group_requests = 500
    io.weight = 500
    io.ioprio_class = 1

< Bandwidth log of g2 >
t [ms]	bw[KiB/s]
476	8
677	2211
878	2221
1080	2486
1281	2241
1481	2109
1681	2334
1882	2129
2082	2211
2283	1915
2489	1778
2690	1915
2891	1997
...

< Average bandwidth >
2132[KiB/s]
(476~9954[ms])


Consideration and Conclusion
=============================

  From result(1), it is observed that it takes 1000~1200[ms] to rise 
P2 bandwidth.  In result(2), where both of g1 and g2 have 
nr_group_requests=100, the delay gets longer as 1800~2000[ms].  In 
addition to it, the average bandwidth becomes ~5% lower than 
result(1).  This is supposed that P2 couldn't allocate enough requests.
Then, result(3) shows that bandwidth of P2 can rise quickly (~300[ms]) 
if nr_group_requests can be set per-cgroup.  Result(4) shows that the 
delay can be shortened by setting g2 as RT class, however, the delay 
is still longer than result(3).

I think it is confirmed that "per-cgroup nr_requests limitation is 
useful in a certain situation".  Beyond that, the discussion topic is 
the benefit pointed out above is eligible for the complication of the 
implementation.  IMHO, I don't think the implementation of per-cgroup 
request limitation is too complicated to accept.  On the other hand I 
guess it suddenly gets complicated if we try to implement further 
more, especially hierarchical support.  It is also true that I have a 
feeling that implementation without per-device limitation and 
hierarchical support is like "unfinished work".

So, my opinion so far is that, per-cgroup nr_requests limitation 
should be merged only if hierarchical support is concluded 
"unnecessary" for it.  If merging it tempts hierarchical support, it 
shouldn't be.
How about your opinion, all?

My considerations or verification method might be wrong.  Please 
correct them if any.  And if you have any other idea of scenario to 
verify the effect of per-cgroup nr_requests limitation, please let me 
know.  I'll try it.



-- 
IKEDA, Munehiro
    NEC Corporation of America
      m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org



[-- Attachment #2: g2_bw.png --]
[-- Type: image/png, Size: 62770 bytes --]

[-- Attachment #3: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
  2009-07-14  7:45         ` Gui Jianfeng
@ 2009-08-04  2:00           ` Munehiro Ikeda
  -1 siblings, 0 replies; 191+ messages in thread
From: Munehiro Ikeda @ 2009-08-04  2:00 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: Vivek Goyal, linux-kernel, containers, dm-devel, jens.axboe,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, s-uchida, taka, jmoyer, dhaval, balbir, righi.andrea,
	jbaron, agk, snitzer, akpm, peterz

[-- Attachment #1: Type: text/plain, Size: 9574 bytes --]

Gui Jianfeng wrote, on 07/14/2009 03:45 AM:
> Munehiro Ikeda wrote:
>> Vivek Goyal wrote, on 07/13/2009 12:03 PM:
>>> On Fri, Jul 10, 2009 at 09:56:21AM +0800, Gui Jianfeng wrote:
>>>> Hi Vivek,
>>>>
>>>> This patch exports a cgroup based per group request limits interface.
>>>> and removes the global one. Now we can use this interface to perform
>>>> different request allocation limitation for different groups.
>>>>
>>> Thanks Gui. Few points come to mind.
>>>
>>> - You seem to be making this as per cgroup limit on all devices. I guess
>>>     that different devices in the system can have different settings of
>>>     q->nr_requests and hence will probably want different per group limit.
>>>     So we might have to make it per cgroup per device limit.
>>  From the viewpoint of implementation, there is a difficulty in my mind to
>> implement per cgroup per device limit arising from that io_group is
>> allocated
>> when associated device is firstly used.  I guess Gui chose per cgroup limit
>> on all devices approach because of this, right?
>
>    Yes, I choose this solution from the simplicity point of view, the code will
>    get complicated if choosing per cgroup per device limit. But it seems per
>    cgroup per device limits is more reasonable.
>
>>
>>> - There does not seem to be any checks for making sure that children
>>>     cgroups don't have more request descriptors allocated than parent
>>> group.
>>>
>>> - I am re-thinking that what's the advantage of configuring request
>>>     descriptors also through cgroups. It does bring in additional
>>> complexity
>>>     with it and it should justfiy the advantages. Can you think of some?
>>>
>>>     Until and unless we can come up with some significant advantages, I
>>> will
>>>     prefer to continue to use per group limit through q->nr_group_requests
>>>     interface instead of cgroup. Once things stablize, we can revisit
>>> it and
>>>     see how this interface can be improved.
>> I agree.  I will try to clarify if per group per device limitation is
>> needed
>> or not (or, if it has the advantage beyond the complexity) through some
>> tests.
>
>    Great, hope to hear you soon.

Sorry for so late.  I tried it, and write the result and my opinion 
below...


Scenario
====================

The possible scenario where per-cgroup nr_requests limitation is 
beneficial in my mind is that:

- Process P1 in cgroup "g1" is running with submitting many requests
    to a device.  The number of the requests in the device queue is
    almost nr_requests for the device.

- After a while, process P2 in cgroup "g2" starts running.  P2 also
    tries to submit requests as many as P1.

- Assuming that user wants P2 to grab bandwidth as soon as possible
    and keep it certain level.

In this scenario, I predicted the bandwidth behavior of P2 along with 
tuning global nr_group_requests like below.

- If having nr_group_requests almost same as nr_requests, P1 can
    allocate requests up to nr_requests and there is no room for P2 at
    the beginning of its running.  As a result of it, P2 has to wait
    for a while till P1's requests are completed and rising of
    bandwidth is delayed.

- If having nr_group_requests fewer to restrict requests from P1 and
    make room for P2, the bandwidth of P2 may be lower than the case
    that P1 can allocate more requests.

If the prediction is correct and per-cgroup nr_requests limitation can 
make the situation better, per-cgroup nr_requests is supposed to be 
beneficial.


Verification Conditions
========================

- Kernel:
    2.6.31-rc1
    + Patches from Vivek on Jul 2, 2009
      (IO scheduler based IO controller V6)
 
https://lists.linux-foundation.org/pipermail/containers/2009-July/018948.html
    + Patches from Gui Jianfeng on Jul 7, 2009 (Bug fixes)
 
https://lists.linux-foundation.org/pipermail/containers/2009-July/019086.html
 
https://lists.linux-foundation.org/pipermail/containers/2009-July/019087.html
    + Patch from Gui Jianfeng on Jul 9, 2009 (per-cgroup requests limit)
 
https://lists.linux-foundation.org/pipermail/containers/2009-July/019123.html
    + Patch from me on Jul 16, 2009 (Bug fix)
 
https://lists.linux-foundation.org/pipermail/containers/2009-July/019286.html
    + 2 local bug fix patches
        (Not posted yet, I'm posting them in following mails)

- All results are measured under nr_requests=500.

- Used fio to make I/O.  Job file is like below.  Used libaio and
    direct-I/O and tuned iodepth to make rl->count[1] approx 500 always.

----- fio job file : from here -----

[global]
size=128m
directory=/mnt/b1

runtime=30
time_based

write_bw_log
bwavgtime=200

rw=randread
direct=1
ioengine=libaio
iodepth=500

[g1]
exec_prerun=./pre.sh /mnt/cgroups/g1
exec_postrun=./log.sh /mnt/cgroups/g1 sdb "_post"

[g2]
startdelay=10
exec_prerun=./pre.sh /mnt/cgroups/g2
exec_postrun=./log.sh /mnt/cgroups/g2 sdb "_post"

----- fio job file : till here -----

Note:
pre.sh and log.sh used in exec_{pre|post}run are to assign processes 
to expected cgroups and record the conditions.  Let me omit the detail 
of them because they are not fundamental part of this verification.


Results
====================

Bandwidth of g2 (=P2) was measured under some conditions.  Conditions 
and bandwidth logs are shown below.
Bandwidth logs are shown only the beginning part (from starting of P2 
to 3000[ms] after aprox.) because the full logs are so long.  Average 
bandwidth from the beginning of log to ~10[sec] is also calculated.

Note1:
fio seems to log bandwidth only when actual data transfer occurs 
(correct me if it's not true).  This means that there is no line with 
BW=0.  In there is no data transfer, the time-stamp are simply skipped 
to record.

Note2:
Graph picture of the bandwidth logs is attached.
    Result(1): orange
    Result(2): green
    Result(3): brown
    Result(4): black


---------- Result (1) ----------

* Both of g1 and g2 have nr_group_requests=500

< Conditions >
nr_requests = 500
g1/
    io.nr_group_requests = 500
    io.weight = 500
    io.ioprio_class = 2
g2/
    io.nr_group_requests = 500
    io.weight = 500
    io.ioprio_class = 2

< Bandwidth log of g2 >
t [ms]	bw[KiB/s]
969	4
1170	1126
1374	1084
1576	876
1776	901
1980	1069
2191	1087
2400	1117
2612	1087
2822	1136
...

< Average bandwidth >
1063 [KiB/s]
(969~9979[ms])


---------- Result (2) ----------

* Both of g1 and g2 have nr_group_requests=100

< Conditions >
nr_requests = 500
g1/
    io.nr_group_requests = 100
    io.weight = 500
    io.ioprio_class = 2
g2/
    io.nr_group_requests = 100
    io.weight = 500
    io.ioprio_class = 2

< Bandwidth log of g2 >
t [ms]	bw[KiB/s]
1498	2
1733	892
2096	722
2311	1224
2534	1180
2753	1197
2988	1137
...

< Average bandwidth >
998 [KiB/s]
(1498~9898[ms])


---------- Result (3) ----------

* To set different nr_group_requests on g1 and g2

< Conditions >
nr_requests = 500
g1/
    io.nr_group_requests = 100
    io.weight = 500
    io.ioprio_class = 2
g2/
    io.nr_group_requests = 500
    io.weight = 500
    io.ioprio_class = 2

< Bandwidth log of g2 >
t [ms]	bw[KiB/s]
244	839
451	1133
659	964
877	1038
1088	1125
1294	979
1501	1068
1708	934
1916	1048
2117	1126
2328	1111
2533	1118
2758	1206
2969	990
...

< Average bandwidth >
1048 [KiB/s]
(244~9906[ms])


---------- Result (4) ----------

* To make g2/io.ioprio_class as RT

< Conditions >
nr_requests = 500
g1/
    io.nr_group_requests = 500
    io.weight = 500
    io.ioprio_class = 2
g2/
    io.nr_group_requests = 500
    io.weight = 500
    io.ioprio_class = 1

< Bandwidth log of g2 >
t [ms]	bw[KiB/s]
476	8
677	2211
878	2221
1080	2486
1281	2241
1481	2109
1681	2334
1882	2129
2082	2211
2283	1915
2489	1778
2690	1915
2891	1997
...

< Average bandwidth >
2132[KiB/s]
(476~9954[ms])


Consideration and Conclusion
=============================

  From result(1), it is observed that it takes 1000~1200[ms] to rise 
P2 bandwidth.  In result(2), where both of g1 and g2 have 
nr_group_requests=100, the delay gets longer as 1800~2000[ms].  In 
addition to it, the average bandwidth becomes ~5% lower than 
result(1).  This is supposed that P2 couldn't allocate enough requests.
Then, result(3) shows that bandwidth of P2 can rise quickly (~300[ms]) 
if nr_group_requests can be set per-cgroup.  Result(4) shows that the 
delay can be shortened by setting g2 as RT class, however, the delay 
is still longer than result(3).

I think it is confirmed that "per-cgroup nr_requests limitation is 
useful in a certain situation".  Beyond that, the discussion topic is 
the benefit pointed out above is eligible for the complication of the 
implementation.  IMHO, I don't think the implementation of per-cgroup 
request limitation is too complicated to accept.  On the other hand I 
guess it suddenly gets complicated if we try to implement further 
more, especially hierarchical support.  It is also true that I have a 
feeling that implementation without per-device limitation and 
hierarchical support is like "unfinished work".

So, my opinion so far is that, per-cgroup nr_requests limitation 
should be merged only if hierarchical support is concluded 
"unnecessary" for it.  If merging it tempts hierarchical support, it 
shouldn't be.
How about your opinion, all?

My considerations or verification method might be wrong.  Please 
correct them if any.  And if you have any other idea of scenario to 
verify the effect of per-cgroup nr_requests limitation, please let me 
know.  I'll try it.



-- 
IKEDA, Munehiro
    NEC Corporation of America
      m-ikeda@ds.jp.nec.com



[-- Attachment #2: g2_bw.png --]
[-- Type: image/png, Size: 62770 bytes --]

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
@ 2009-08-04  2:00           ` Munehiro Ikeda
  0 siblings, 0 replies; 191+ messages in thread
From: Munehiro Ikeda @ 2009-08-04  2:00 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	balbir, paolo.valente, fernando, mikew, jmoyer, nauman,
	Vivek Goyal, righi.andrea, lizf, fchecconi, akpm, jbaron,
	linux-kernel, s-uchida, containers

[-- Attachment #1: Type: text/plain, Size: 9574 bytes --]

Gui Jianfeng wrote, on 07/14/2009 03:45 AM:
> Munehiro Ikeda wrote:
>> Vivek Goyal wrote, on 07/13/2009 12:03 PM:
>>> On Fri, Jul 10, 2009 at 09:56:21AM +0800, Gui Jianfeng wrote:
>>>> Hi Vivek,
>>>>
>>>> This patch exports a cgroup based per group request limits interface.
>>>> and removes the global one. Now we can use this interface to perform
>>>> different request allocation limitation for different groups.
>>>>
>>> Thanks Gui. Few points come to mind.
>>>
>>> - You seem to be making this as per cgroup limit on all devices. I guess
>>>     that different devices in the system can have different settings of
>>>     q->nr_requests and hence will probably want different per group limit.
>>>     So we might have to make it per cgroup per device limit.
>>  From the viewpoint of implementation, there is a difficulty in my mind to
>> implement per cgroup per device limit arising from that io_group is
>> allocated
>> when associated device is firstly used.  I guess Gui chose per cgroup limit
>> on all devices approach because of this, right?
>
>    Yes, I choose this solution from the simplicity point of view, the code will
>    get complicated if choosing per cgroup per device limit. But it seems per
>    cgroup per device limits is more reasonable.
>
>>
>>> - There does not seem to be any checks for making sure that children
>>>     cgroups don't have more request descriptors allocated than parent
>>> group.
>>>
>>> - I am re-thinking that what's the advantage of configuring request
>>>     descriptors also through cgroups. It does bring in additional
>>> complexity
>>>     with it and it should justfiy the advantages. Can you think of some?
>>>
>>>     Until and unless we can come up with some significant advantages, I
>>> will
>>>     prefer to continue to use per group limit through q->nr_group_requests
>>>     interface instead of cgroup. Once things stablize, we can revisit
>>> it and
>>>     see how this interface can be improved.
>> I agree.  I will try to clarify if per group per device limitation is
>> needed
>> or not (or, if it has the advantage beyond the complexity) through some
>> tests.
>
>    Great, hope to hear you soon.

Sorry for so late.  I tried it, and write the result and my opinion 
below...


Scenario
====================

The possible scenario where per-cgroup nr_requests limitation is 
beneficial in my mind is that:

- Process P1 in cgroup "g1" is running with submitting many requests
    to a device.  The number of the requests in the device queue is
    almost nr_requests for the device.

- After a while, process P2 in cgroup "g2" starts running.  P2 also
    tries to submit requests as many as P1.

- Assuming that user wants P2 to grab bandwidth as soon as possible
    and keep it certain level.

In this scenario, I predicted the bandwidth behavior of P2 along with 
tuning global nr_group_requests like below.

- If having nr_group_requests almost same as nr_requests, P1 can
    allocate requests up to nr_requests and there is no room for P2 at
    the beginning of its running.  As a result of it, P2 has to wait
    for a while till P1's requests are completed and rising of
    bandwidth is delayed.

- If having nr_group_requests fewer to restrict requests from P1 and
    make room for P2, the bandwidth of P2 may be lower than the case
    that P1 can allocate more requests.

If the prediction is correct and per-cgroup nr_requests limitation can 
make the situation better, per-cgroup nr_requests is supposed to be 
beneficial.


Verification Conditions
========================

- Kernel:
    2.6.31-rc1
    + Patches from Vivek on Jul 2, 2009
      (IO scheduler based IO controller V6)
 
https://lists.linux-foundation.org/pipermail/containers/2009-July/018948.html
    + Patches from Gui Jianfeng on Jul 7, 2009 (Bug fixes)
 
https://lists.linux-foundation.org/pipermail/containers/2009-July/019086.html
 
https://lists.linux-foundation.org/pipermail/containers/2009-July/019087.html
    + Patch from Gui Jianfeng on Jul 9, 2009 (per-cgroup requests limit)
 
https://lists.linux-foundation.org/pipermail/containers/2009-July/019123.html
    + Patch from me on Jul 16, 2009 (Bug fix)
 
https://lists.linux-foundation.org/pipermail/containers/2009-July/019286.html
    + 2 local bug fix patches
        (Not posted yet, I'm posting them in following mails)

- All results are measured under nr_requests=500.

- Used fio to make I/O.  Job file is like below.  Used libaio and
    direct-I/O and tuned iodepth to make rl->count[1] approx 500 always.

----- fio job file : from here -----

[global]
size=128m
directory=/mnt/b1

runtime=30
time_based

write_bw_log
bwavgtime=200

rw=randread
direct=1
ioengine=libaio
iodepth=500

[g1]
exec_prerun=./pre.sh /mnt/cgroups/g1
exec_postrun=./log.sh /mnt/cgroups/g1 sdb "_post"

[g2]
startdelay=10
exec_prerun=./pre.sh /mnt/cgroups/g2
exec_postrun=./log.sh /mnt/cgroups/g2 sdb "_post"

----- fio job file : till here -----

Note:
pre.sh and log.sh used in exec_{pre|post}run are to assign processes 
to expected cgroups and record the conditions.  Let me omit the detail 
of them because they are not fundamental part of this verification.


Results
====================

Bandwidth of g2 (=P2) was measured under some conditions.  Conditions 
and bandwidth logs are shown below.
Bandwidth logs are shown only the beginning part (from starting of P2 
to 3000[ms] after aprox.) because the full logs are so long.  Average 
bandwidth from the beginning of log to ~10[sec] is also calculated.

Note1:
fio seems to log bandwidth only when actual data transfer occurs 
(correct me if it's not true).  This means that there is no line with 
BW=0.  In there is no data transfer, the time-stamp are simply skipped 
to record.

Note2:
Graph picture of the bandwidth logs is attached.
    Result(1): orange
    Result(2): green
    Result(3): brown
    Result(4): black


---------- Result (1) ----------

* Both of g1 and g2 have nr_group_requests=500

< Conditions >
nr_requests = 500
g1/
    io.nr_group_requests = 500
    io.weight = 500
    io.ioprio_class = 2
g2/
    io.nr_group_requests = 500
    io.weight = 500
    io.ioprio_class = 2

< Bandwidth log of g2 >
t [ms]	bw[KiB/s]
969	4
1170	1126
1374	1084
1576	876
1776	901
1980	1069
2191	1087
2400	1117
2612	1087
2822	1136
...

< Average bandwidth >
1063 [KiB/s]
(969~9979[ms])


---------- Result (2) ----------

* Both of g1 and g2 have nr_group_requests=100

< Conditions >
nr_requests = 500
g1/
    io.nr_group_requests = 100
    io.weight = 500
    io.ioprio_class = 2
g2/
    io.nr_group_requests = 100
    io.weight = 500
    io.ioprio_class = 2

< Bandwidth log of g2 >
t [ms]	bw[KiB/s]
1498	2
1733	892
2096	722
2311	1224
2534	1180
2753	1197
2988	1137
...

< Average bandwidth >
998 [KiB/s]
(1498~9898[ms])


---------- Result (3) ----------

* To set different nr_group_requests on g1 and g2

< Conditions >
nr_requests = 500
g1/
    io.nr_group_requests = 100
    io.weight = 500
    io.ioprio_class = 2
g2/
    io.nr_group_requests = 500
    io.weight = 500
    io.ioprio_class = 2

< Bandwidth log of g2 >
t [ms]	bw[KiB/s]
244	839
451	1133
659	964
877	1038
1088	1125
1294	979
1501	1068
1708	934
1916	1048
2117	1126
2328	1111
2533	1118
2758	1206
2969	990
...

< Average bandwidth >
1048 [KiB/s]
(244~9906[ms])


---------- Result (4) ----------

* To make g2/io.ioprio_class as RT

< Conditions >
nr_requests = 500
g1/
    io.nr_group_requests = 500
    io.weight = 500
    io.ioprio_class = 2
g2/
    io.nr_group_requests = 500
    io.weight = 500
    io.ioprio_class = 1

< Bandwidth log of g2 >
t [ms]	bw[KiB/s]
476	8
677	2211
878	2221
1080	2486
1281	2241
1481	2109
1681	2334
1882	2129
2082	2211
2283	1915
2489	1778
2690	1915
2891	1997
...

< Average bandwidth >
2132[KiB/s]
(476~9954[ms])


Consideration and Conclusion
=============================

  From result(1), it is observed that it takes 1000~1200[ms] to rise 
P2 bandwidth.  In result(2), where both of g1 and g2 have 
nr_group_requests=100, the delay gets longer as 1800~2000[ms].  In 
addition to it, the average bandwidth becomes ~5% lower than 
result(1).  This is supposed that P2 couldn't allocate enough requests.
Then, result(3) shows that bandwidth of P2 can rise quickly (~300[ms]) 
if nr_group_requests can be set per-cgroup.  Result(4) shows that the 
delay can be shortened by setting g2 as RT class, however, the delay 
is still longer than result(3).

I think it is confirmed that "per-cgroup nr_requests limitation is 
useful in a certain situation".  Beyond that, the discussion topic is 
the benefit pointed out above is eligible for the complication of the 
implementation.  IMHO, I don't think the implementation of per-cgroup 
request limitation is too complicated to accept.  On the other hand I 
guess it suddenly gets complicated if we try to implement further 
more, especially hierarchical support.  It is also true that I have a 
feeling that implementation without per-device limitation and 
hierarchical support is like "unfinished work".

So, my opinion so far is that, per-cgroup nr_requests limitation 
should be merged only if hierarchical support is concluded 
"unnecessary" for it.  If merging it tempts hierarchical support, it 
shouldn't be.
How about your opinion, all?

My considerations or verification method might be wrong.  Please 
correct them if any.  And if you have any other idea of scenario to 
verify the effect of per-cgroup nr_requests limitation, please let me 
know.  I'll try it.



-- 
IKEDA, Munehiro
    NEC Corporation of America
      m-ikeda@ds.jp.nec.com



[-- Attachment #2: g2_bw.png --]
[-- Type: image/png, Size: 62770 bytes --]

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
       [not found]   ` <4A569FC5.7090801-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2009-07-13 16:03     ` Vivek Goyal
@ 2009-08-04  2:02     ` Munehiro Ikeda
  2009-08-04  2:04     ` Munehiro Ikeda
  2 siblings, 0 replies; 191+ messages in thread
From: Munehiro Ikeda @ 2009-08-04  2:02 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Gui Jianfeng wrote, on 07/09/2009 09:56 PM:
> Hi Vivek,
>
> This patch exports a cgroup based per group request limits interface.
> and removes the global one. Now we can use this interface to perform
> different request allocation limitation for different groups.
>
> Signed-off-by: Gui Jianfeng<guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> ---
(snip)

Hi Jianfeng,

If this helps.

If calling elv_io_group_congestion_threshold() before
setting iog->iocg_id, iocg->nr_group_requests cannot be
referred.  As a result of it, iog->nr_congestion_on is
always misculculated as 0.
This patch moves the calling of
elv_io_group_congestion_threshold() after setting
iog->iocg_id.

Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
---
  block/elevator-fq.c |    3 ++-
  1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 7c83d1e..3368a7f 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -2195,7 +2195,6 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
  		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
  
  	blk_init_request_list(&iog->rl);
-	elv_io_group_congestion_threshold(q, iog);
  
  	iocg = &io_root_cgroup;
  	spin_lock_irq(&iocg->lock);
@@ -2204,6 +2203,8 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
  	iog->iocg_id = css_id(&iocg->css);
  	spin_unlock_irq(&iocg->lock);
  
+	elv_io_group_congestion_threshold(q, iog);
+
  #ifdef CONFIG_DEBUG_GROUP_IOSCHED
  	io_group_path(iog, iog->path, sizeof(iog->path));
  #endif
-- 
1.6.2.5




-- 
IKEDA, Munehiro
   NEC Corporation of America
     m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
  2009-07-10  1:56   ` Gui Jianfeng
@ 2009-08-04  2:02     ` Munehiro Ikeda
  -1 siblings, 0 replies; 191+ messages in thread
From: Munehiro Ikeda @ 2009-08-04  2:02 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: Vivek Goyal, linux-kernel, containers, dm-devel, jens.axboe,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, s-uchida, taka, jmoyer, dhaval, balbir, righi.andrea,
	jbaron, agk, snitzer, akpm, peterz

Gui Jianfeng wrote, on 07/09/2009 09:56 PM:
> Hi Vivek,
>
> This patch exports a cgroup based per group request limits interface.
> and removes the global one. Now we can use this interface to perform
> different request allocation limitation for different groups.
>
> Signed-off-by: Gui Jianfeng<guijianfeng@cn.fujitsu.com>
> ---
(snip)

Hi Jianfeng,

If this helps.

If calling elv_io_group_congestion_threshold() before
setting iog->iocg_id, iocg->nr_group_requests cannot be
referred.  As a result of it, iog->nr_congestion_on is
always misculculated as 0.
This patch moves the calling of
elv_io_group_congestion_threshold() after setting
iog->iocg_id.

Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
---
  block/elevator-fq.c |    3 ++-
  1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 7c83d1e..3368a7f 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -2195,7 +2195,6 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
  		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
  
  	blk_init_request_list(&iog->rl);
-	elv_io_group_congestion_threshold(q, iog);
  
  	iocg = &io_root_cgroup;
  	spin_lock_irq(&iocg->lock);
@@ -2204,6 +2203,8 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
  	iog->iocg_id = css_id(&iocg->css);
  	spin_unlock_irq(&iocg->lock);
  
+	elv_io_group_congestion_threshold(q, iog);
+
  #ifdef CONFIG_DEBUG_GROUP_IOSCHED
  	io_group_path(iog, iog->path, sizeof(iog->path));
  #endif
-- 
1.6.2.5




-- 
IKEDA, Munehiro
   NEC Corporation of America
     m-ikeda@ds.jp.nec.com


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
@ 2009-08-04  2:02     ` Munehiro Ikeda
  0 siblings, 0 replies; 191+ messages in thread
From: Munehiro Ikeda @ 2009-08-04  2:02 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	balbir, paolo.valente, fernando, mikew, jmoyer, nauman,
	Vivek Goyal, righi.andrea, lizf, fchecconi, akpm, jbaron,
	linux-kernel, s-uchida, containers

Gui Jianfeng wrote, on 07/09/2009 09:56 PM:
> Hi Vivek,
>
> This patch exports a cgroup based per group request limits interface.
> and removes the global one. Now we can use this interface to perform
> different request allocation limitation for different groups.
>
> Signed-off-by: Gui Jianfeng<guijianfeng@cn.fujitsu.com>
> ---
(snip)

Hi Jianfeng,

If this helps.

If calling elv_io_group_congestion_threshold() before
setting iog->iocg_id, iocg->nr_group_requests cannot be
referred.  As a result of it, iog->nr_congestion_on is
always misculculated as 0.
This patch moves the calling of
elv_io_group_congestion_threshold() after setting
iog->iocg_id.

Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
---
  block/elevator-fq.c |    3 ++-
  1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 7c83d1e..3368a7f 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -2195,7 +2195,6 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
  		iog->sched_data.service_tree[i] = IO_SERVICE_TREE_INIT;
  
  	blk_init_request_list(&iog->rl);
-	elv_io_group_congestion_threshold(q, iog);
  
  	iocg = &io_root_cgroup;
  	spin_lock_irq(&iocg->lock);
@@ -2204,6 +2203,8 @@ static struct io_group *io_alloc_root_group(struct request_queue *q,
  	iog->iocg_id = css_id(&iocg->css);
  	spin_unlock_irq(&iocg->lock);
  
+	elv_io_group_congestion_threshold(q, iog);
+
  #ifdef CONFIG_DEBUG_GROUP_IOSCHED
  	io_group_path(iog, iog->path, sizeof(iog->path));
  #endif
-- 
1.6.2.5




-- 
IKEDA, Munehiro
   NEC Corporation of America
     m-ikeda@ds.jp.nec.com

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
       [not found]   ` <4A569FC5.7090801-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2009-07-13 16:03     ` Vivek Goyal
  2009-08-04  2:02     ` Munehiro Ikeda
@ 2009-08-04  2:04     ` Munehiro Ikeda
  2 siblings, 0 replies; 191+ messages in thread
From: Munehiro Ikeda @ 2009-08-04  2:04 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Gui Jianfeng wrote, on 07/09/2009 09:56 PM:
> Hi Vivek,
>
> This patch exports a cgroup based per group request limits interface.
> and removes the global one. Now we can use this interface to perform
> different request allocation limitation for different groups.
>
> Signed-off-by: Gui Jianfeng<guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
(snip)

Hi Jianfeng,

If this helps, again.

A patch posted from Gui Jianfeng on 2009/07/09 adds per-cgroup
nr_requests control by io.nr_group_requests.  The patch missed to update
iog->nr_congestion_{on|off} and this patch adds the missing-link.

Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
---
  block/elevator-fq.c |   11 +++++++++++
  1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 673e490..316bd8d 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1607,6 +1607,10 @@ static int io_cgroup_nr_requests_write(struct cgroup *cgroup,
  					u64 val)
  {
  	struct io_cgroup *iocg;
+	struct io_group *iog;
+	struct elv_fq_data *efqd;
+	struct request_queue *q;
+	struct hlist_node *n;
  
  	if (val < BLKDEV_MIN_RQ)
  		val = BLKDEV_MIN_RQ;
@@ -1618,6 +1622,13 @@ static int io_cgroup_nr_requests_write(struct cgroup *cgroup,
  
  	spin_lock_irq(&iocg->lock);
  	iocg->nr_group_requests = (unsigned long)val;
+	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
+		rcu_read_lock();
+		efqd = rcu_dereference(iog->key);
+		q = efqd->queue;
+		rcu_read_unlock();
+		elv_io_group_congestion_threshold(q, iog);
+	}
  	spin_unlock_irq(&iocg->lock);
  
  	cgroup_unlock();
-- 
1.6.2.5


-- 
IKEDA, Munehiro
   NEC Corporation of America
     m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
  2009-07-10  1:56   ` Gui Jianfeng
@ 2009-08-04  2:04     ` Munehiro Ikeda
  -1 siblings, 0 replies; 191+ messages in thread
From: Munehiro Ikeda @ 2009-08-04  2:04 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: Vivek Goyal, linux-kernel, containers, dm-devel, jens.axboe,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, s-uchida, taka, jmoyer, dhaval, balbir, righi.andrea,
	jbaron, agk, snitzer, akpm, peterz

Gui Jianfeng wrote, on 07/09/2009 09:56 PM:
> Hi Vivek,
>
> This patch exports a cgroup based per group request limits interface.
> and removes the global one. Now we can use this interface to perform
> different request allocation limitation for different groups.
>
> Signed-off-by: Gui Jianfeng<guijianfeng@cn.fujitsu.com>
(snip)

Hi Jianfeng,

If this helps, again.

A patch posted from Gui Jianfeng on 2009/07/09 adds per-cgroup
nr_requests control by io.nr_group_requests.  The patch missed to update
iog->nr_congestion_{on|off} and this patch adds the missing-link.

Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
---
  block/elevator-fq.c |   11 +++++++++++
  1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 673e490..316bd8d 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1607,6 +1607,10 @@ static int io_cgroup_nr_requests_write(struct cgroup *cgroup,
  					u64 val)
  {
  	struct io_cgroup *iocg;
+	struct io_group *iog;
+	struct elv_fq_data *efqd;
+	struct request_queue *q;
+	struct hlist_node *n;
  
  	if (val < BLKDEV_MIN_RQ)
  		val = BLKDEV_MIN_RQ;
@@ -1618,6 +1622,13 @@ static int io_cgroup_nr_requests_write(struct cgroup *cgroup,
  
  	spin_lock_irq(&iocg->lock);
  	iocg->nr_group_requests = (unsigned long)val;
+	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
+		rcu_read_lock();
+		efqd = rcu_dereference(iog->key);
+		q = efqd->queue;
+		rcu_read_unlock();
+		elv_io_group_congestion_threshold(q, iog);
+	}
  	spin_unlock_irq(&iocg->lock);
  
  	cgroup_unlock();
-- 
1.6.2.5


-- 
IKEDA, Munehiro
   NEC Corporation of America
     m-ikeda@ds.jp.nec.com


^ permalink raw reply related	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
@ 2009-08-04  2:04     ` Munehiro Ikeda
  0 siblings, 0 replies; 191+ messages in thread
From: Munehiro Ikeda @ 2009-08-04  2:04 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	balbir, paolo.valente, fernando, mikew, jmoyer, nauman,
	Vivek Goyal, righi.andrea, lizf, fchecconi, akpm, jbaron,
	linux-kernel, s-uchida, containers

Gui Jianfeng wrote, on 07/09/2009 09:56 PM:
> Hi Vivek,
>
> This patch exports a cgroup based per group request limits interface.
> and removes the global one. Now we can use this interface to perform
> different request allocation limitation for different groups.
>
> Signed-off-by: Gui Jianfeng<guijianfeng@cn.fujitsu.com>
(snip)

Hi Jianfeng,

If this helps, again.

A patch posted from Gui Jianfeng on 2009/07/09 adds per-cgroup
nr_requests control by io.nr_group_requests.  The patch missed to update
iog->nr_congestion_{on|off} and this patch adds the missing-link.

Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>
---
  block/elevator-fq.c |   11 +++++++++++
  1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/block/elevator-fq.c b/block/elevator-fq.c
index 673e490..316bd8d 100644
--- a/block/elevator-fq.c
+++ b/block/elevator-fq.c
@@ -1607,6 +1607,10 @@ static int io_cgroup_nr_requests_write(struct cgroup *cgroup,
  					u64 val)
  {
  	struct io_cgroup *iocg;
+	struct io_group *iog;
+	struct elv_fq_data *efqd;
+	struct request_queue *q;
+	struct hlist_node *n;
  
  	if (val < BLKDEV_MIN_RQ)
  		val = BLKDEV_MIN_RQ;
@@ -1618,6 +1622,13 @@ static int io_cgroup_nr_requests_write(struct cgroup *cgroup,
  
  	spin_lock_irq(&iocg->lock);
  	iocg->nr_group_requests = (unsigned long)val;
+	hlist_for_each_entry(iog, n, &iocg->group_data, group_node) {
+		rcu_read_lock();
+		efqd = rcu_dereference(iog->key);
+		q = efqd->queue;
+		rcu_read_unlock();
+		elv_io_group_congestion_threshold(q, iog);
+	}
  	spin_unlock_irq(&iocg->lock);
  
  	cgroup_unlock();
-- 
1.6.2.5


-- 
IKEDA, Munehiro
   NEC Corporation of America
     m-ikeda@ds.jp.nec.com

^ permalink raw reply related	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
       [not found]           ` <4A77964A.7040602-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
@ 2009-08-04  6:38             ` Gui Jianfeng
  2009-08-04 22:37             ` Vivek Goyal
  1 sibling, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-08-04  6:38 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Munehiro Ikeda wrote:
...
> 
> Consideration and Conclusion
> =============================
> 
>  From result(1), it is observed that it takes 1000~1200[ms] to rise P2
> bandwidth.  In result(2), where both of g1 and g2 have
> nr_group_requests=100, the delay gets longer as 1800~2000[ms].  In
> addition to it, the average bandwidth becomes ~5% lower than result(1). 
> This is supposed that P2 couldn't allocate enough requests.
> Then, result(3) shows that bandwidth of P2 can rise quickly (~300[ms])
> if nr_group_requests can be set per-cgroup.  Result(4) shows that the
> delay can be shortened by setting g2 as RT class, however, the delay is
> still longer than result(3).
> 
> I think it is confirmed that "per-cgroup nr_requests limitation is
> useful in a certain situation".  Beyond that, the discussion topic is
> the benefit pointed out above is eligible for the complication of the
> implementation.  IMHO, I don't think the implementation of per-cgroup
> request limitation is too complicated to accept.  On the other hand I
> guess it suddenly gets complicated if we try to implement further more,
> especially hierarchical support.  It is also true that I have a feeling
> that implementation without per-device limitation and hierarchical
> support is like "unfinished work".
> 
> So, my opinion so far is that, per-cgroup nr_requests limitation should
> be merged only if hierarchical support is concluded "unnecessary" for
> it.  If merging it tempts hierarchical support, it shouldn't be.
> How about your opinion, all?

  Hi Munehiro-san,

  Thanks for the great job. It seems Per-cgroup requests allocation limits
  has its value in some cases. IMHO, for the time being, we can just drop
  the hierarchical support for "Per-cgroup requests allocation limits", and
  see whether it can work well.

> 
> My considerations or verification method might be wrong.  Please correct
> them if any.  And if you have any other idea of scenario to verify the
> effect of per-cgroup nr_requests limitation, please let me know.  I'll
> try it.
> 
> 
> 
> 
> ------------------------------------------------------------------------
> 

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
  2009-08-04  2:00           ` Munehiro Ikeda
@ 2009-08-04  6:38             ` Gui Jianfeng
  -1 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-08-04  6:38 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: Vivek Goyal, linux-kernel, containers, dm-devel, jens.axboe,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, s-uchida, taka, jmoyer, dhaval, balbir, righi.andrea,
	jbaron, agk, snitzer, akpm, peterz

Munehiro Ikeda wrote:
...
> 
> Consideration and Conclusion
> =============================
> 
>  From result(1), it is observed that it takes 1000~1200[ms] to rise P2
> bandwidth.  In result(2), where both of g1 and g2 have
> nr_group_requests=100, the delay gets longer as 1800~2000[ms].  In
> addition to it, the average bandwidth becomes ~5% lower than result(1). 
> This is supposed that P2 couldn't allocate enough requests.
> Then, result(3) shows that bandwidth of P2 can rise quickly (~300[ms])
> if nr_group_requests can be set per-cgroup.  Result(4) shows that the
> delay can be shortened by setting g2 as RT class, however, the delay is
> still longer than result(3).
> 
> I think it is confirmed that "per-cgroup nr_requests limitation is
> useful in a certain situation".  Beyond that, the discussion topic is
> the benefit pointed out above is eligible for the complication of the
> implementation.  IMHO, I don't think the implementation of per-cgroup
> request limitation is too complicated to accept.  On the other hand I
> guess it suddenly gets complicated if we try to implement further more,
> especially hierarchical support.  It is also true that I have a feeling
> that implementation without per-device limitation and hierarchical
> support is like "unfinished work".
> 
> So, my opinion so far is that, per-cgroup nr_requests limitation should
> be merged only if hierarchical support is concluded "unnecessary" for
> it.  If merging it tempts hierarchical support, it shouldn't be.
> How about your opinion, all?

  Hi Munehiro-san,

  Thanks for the great job. It seems Per-cgroup requests allocation limits
  has its value in some cases. IMHO, for the time being, we can just drop
  the hierarchical support for "Per-cgroup requests allocation limits", and
  see whether it can work well.

> 
> My considerations or verification method might be wrong.  Please correct
> them if any.  And if you have any other idea of scenario to verify the
> effect of per-cgroup nr_requests limitation, please let me know.  I'll
> try it.
> 
> 
> 
> 
> ------------------------------------------------------------------------
> 

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
@ 2009-08-04  6:38             ` Gui Jianfeng
  0 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-08-04  6:38 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	balbir, paolo.valente, fernando, mikew, jmoyer, nauman,
	Vivek Goyal, righi.andrea, lizf, fchecconi, akpm, jbaron,
	linux-kernel, s-uchida, containers

Munehiro Ikeda wrote:
...
> 
> Consideration and Conclusion
> =============================
> 
>  From result(1), it is observed that it takes 1000~1200[ms] to rise P2
> bandwidth.  In result(2), where both of g1 and g2 have
> nr_group_requests=100, the delay gets longer as 1800~2000[ms].  In
> addition to it, the average bandwidth becomes ~5% lower than result(1). 
> This is supposed that P2 couldn't allocate enough requests.
> Then, result(3) shows that bandwidth of P2 can rise quickly (~300[ms])
> if nr_group_requests can be set per-cgroup.  Result(4) shows that the
> delay can be shortened by setting g2 as RT class, however, the delay is
> still longer than result(3).
> 
> I think it is confirmed that "per-cgroup nr_requests limitation is
> useful in a certain situation".  Beyond that, the discussion topic is
> the benefit pointed out above is eligible for the complication of the
> implementation.  IMHO, I don't think the implementation of per-cgroup
> request limitation is too complicated to accept.  On the other hand I
> guess it suddenly gets complicated if we try to implement further more,
> especially hierarchical support.  It is also true that I have a feeling
> that implementation without per-device limitation and hierarchical
> support is like "unfinished work".
> 
> So, my opinion so far is that, per-cgroup nr_requests limitation should
> be merged only if hierarchical support is concluded "unnecessary" for
> it.  If merging it tempts hierarchical support, it shouldn't be.
> How about your opinion, all?

  Hi Munehiro-san,

  Thanks for the great job. It seems Per-cgroup requests allocation limits
  has its value in some cases. IMHO, for the time being, we can just drop
  the hierarchical support for "Per-cgroup requests allocation limits", and
  see whether it can work well.

> 
> My considerations or verification method might be wrong.  Please correct
> them if any.  And if you have any other idea of scenario to verify the
> effect of per-cgroup nr_requests limitation, please let me know.  I'll
> try it.
> 
> 
> 
> 
> ------------------------------------------------------------------------
> 

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
       [not found]     ` <4A7796D2.4030104-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
@ 2009-08-04  6:41       ` Gui Jianfeng
  0 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-08-04  6:41 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA



Munehiro Ikeda wrote:
> Gui Jianfeng wrote, on 07/09/2009 09:56 PM:
>> Hi Vivek,
>>
>> This patch exports a cgroup based per group request limits interface.
>> and removes the global one. Now we can use this interface to perform
>> different request allocation limitation for different groups.
>>
>> Signed-off-by: Gui Jianfeng<guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
>> ---
> (snip)
> 
> Hi Jianfeng,
> 
> If this helps.
> 
> If calling elv_io_group_congestion_threshold() before
> setting iog->iocg_id, iocg->nr_group_requests cannot be
> referred.  As a result of it, iog->nr_congestion_on is
> always misculculated as 0.
> This patch moves the calling of
> elv_io_group_congestion_threshold() after setting
> iog->iocg_id.
> 
> Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>

  That's true, thanks.

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
  2009-08-04  2:02     ` Munehiro Ikeda
@ 2009-08-04  6:41       ` Gui Jianfeng
  -1 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-08-04  6:41 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: Vivek Goyal, linux-kernel, containers, dm-devel, jens.axboe,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, s-uchida, taka, jmoyer, dhaval, balbir, righi.andrea,
	jbaron, agk, snitzer, akpm, peterz



Munehiro Ikeda wrote:
> Gui Jianfeng wrote, on 07/09/2009 09:56 PM:
>> Hi Vivek,
>>
>> This patch exports a cgroup based per group request limits interface.
>> and removes the global one. Now we can use this interface to perform
>> different request allocation limitation for different groups.
>>
>> Signed-off-by: Gui Jianfeng<guijianfeng@cn.fujitsu.com>
>> ---
> (snip)
> 
> Hi Jianfeng,
> 
> If this helps.
> 
> If calling elv_io_group_congestion_threshold() before
> setting iog->iocg_id, iocg->nr_group_requests cannot be
> referred.  As a result of it, iog->nr_congestion_on is
> always misculculated as 0.
> This patch moves the calling of
> elv_io_group_congestion_threshold() after setting
> iog->iocg_id.
> 
> Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>

  That's true, thanks.

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
@ 2009-08-04  6:41       ` Gui Jianfeng
  0 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-08-04  6:41 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	balbir, paolo.valente, fernando, mikew, jmoyer, nauman,
	Vivek Goyal, righi.andrea, lizf, fchecconi, akpm, jbaron,
	linux-kernel, s-uchida, containers



Munehiro Ikeda wrote:
> Gui Jianfeng wrote, on 07/09/2009 09:56 PM:
>> Hi Vivek,
>>
>> This patch exports a cgroup based per group request limits interface.
>> and removes the global one. Now we can use this interface to perform
>> different request allocation limitation for different groups.
>>
>> Signed-off-by: Gui Jianfeng<guijianfeng@cn.fujitsu.com>
>> ---
> (snip)
> 
> Hi Jianfeng,
> 
> If this helps.
> 
> If calling elv_io_group_congestion_threshold() before
> setting iog->iocg_id, iocg->nr_group_requests cannot be
> referred.  As a result of it, iog->nr_congestion_on is
> always misculculated as 0.
> This patch moves the calling of
> elv_io_group_congestion_threshold() after setting
> iog->iocg_id.
> 
> Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>

  That's true, thanks.

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
       [not found]     ` <4A779719.1070900-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
@ 2009-08-04  6:45       ` Gui Jianfeng
  0 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-08-04  6:45 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Munehiro Ikeda wrote:
> Gui Jianfeng wrote, on 07/09/2009 09:56 PM:
>> Hi Vivek,
>>
>> This patch exports a cgroup based per group request limits interface.
>> and removes the global one. Now we can use this interface to perform
>> different request allocation limitation for different groups.
>>
>> Signed-off-by: Gui Jianfeng<guijianfeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> (snip)
> 
> Hi Jianfeng,
> 
> If this helps, again.
> 
> A patch posted from Gui Jianfeng on 2009/07/09 adds per-cgroup
> nr_requests control by io.nr_group_requests.  The patch missed to update
> iog->nr_congestion_{on|off} and this patch adds the missing-link.
> 
> Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>

  Yes, i'v fixed this issue for the global nr_group_requests updating in V6,
  but forgot to update it in this patch. Thanks :)

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
  2009-08-04  2:04     ` Munehiro Ikeda
@ 2009-08-04  6:45       ` Gui Jianfeng
  -1 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-08-04  6:45 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: Vivek Goyal, linux-kernel, containers, dm-devel, jens.axboe,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, s-uchida, taka, jmoyer, dhaval, balbir, righi.andrea,
	jbaron, agk, snitzer, akpm, peterz

Munehiro Ikeda wrote:
> Gui Jianfeng wrote, on 07/09/2009 09:56 PM:
>> Hi Vivek,
>>
>> This patch exports a cgroup based per group request limits interface.
>> and removes the global one. Now we can use this interface to perform
>> different request allocation limitation for different groups.
>>
>> Signed-off-by: Gui Jianfeng<guijianfeng@cn.fujitsu.com>
> (snip)
> 
> Hi Jianfeng,
> 
> If this helps, again.
> 
> A patch posted from Gui Jianfeng on 2009/07/09 adds per-cgroup
> nr_requests control by io.nr_group_requests.  The patch missed to update
> iog->nr_congestion_{on|off} and this patch adds the missing-link.
> 
> Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>

  Yes, i'v fixed this issue for the global nr_group_requests updating in V6,
  but forgot to update it in this patch. Thanks :)

-- 
Regards
Gui Jianfeng


^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
@ 2009-08-04  6:45       ` Gui Jianfeng
  0 siblings, 0 replies; 191+ messages in thread
From: Gui Jianfeng @ 2009-08-04  6:45 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	balbir, paolo.valente, fernando, mikew, jmoyer, nauman,
	Vivek Goyal, righi.andrea, lizf, fchecconi, akpm, jbaron,
	linux-kernel, s-uchida, containers

Munehiro Ikeda wrote:
> Gui Jianfeng wrote, on 07/09/2009 09:56 PM:
>> Hi Vivek,
>>
>> This patch exports a cgroup based per group request limits interface.
>> and removes the global one. Now we can use this interface to perform
>> different request allocation limitation for different groups.
>>
>> Signed-off-by: Gui Jianfeng<guijianfeng@cn.fujitsu.com>
> (snip)
> 
> Hi Jianfeng,
> 
> If this helps, again.
> 
> A patch posted from Gui Jianfeng on 2009/07/09 adds per-cgroup
> nr_requests control by io.nr_group_requests.  The patch missed to update
> iog->nr_congestion_{on|off} and this patch adds the missing-link.
> 
> Signed-off-by: Munehiro "Muuhh" Ikeda <m-ikeda@ds.jp.nec.com>

  Yes, i'v fixed this issue for the global nr_group_requests updating in V6,
  but forgot to update it in this patch. Thanks :)

-- 
Regards
Gui Jianfeng

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
       [not found]           ` <4A77964A.7040602-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
  2009-08-04  6:38             ` Gui Jianfeng
@ 2009-08-04 22:37             ` Vivek Goyal
  1 sibling, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-08-04 22:37 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	snitzer-H+wXaHxf7aLQT0dZR+AlfA, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon, Aug 03, 2009 at 10:00:42PM -0400, Munehiro Ikeda wrote:
> Gui Jianfeng wrote, on 07/14/2009 03:45 AM:
>> Munehiro Ikeda wrote:
>>> Vivek Goyal wrote, on 07/13/2009 12:03 PM:
>>>> On Fri, Jul 10, 2009 at 09:56:21AM +0800, Gui Jianfeng wrote:
>>>>> Hi Vivek,
>>>>>
>>>>> This patch exports a cgroup based per group request limits interface.
>>>>> and removes the global one. Now we can use this interface to perform
>>>>> different request allocation limitation for different groups.
>>>>>
>>>> Thanks Gui. Few points come to mind.
>>>>
>>>> - You seem to be making this as per cgroup limit on all devices. I guess
>>>>     that different devices in the system can have different settings of
>>>>     q->nr_requests and hence will probably want different per group limit.
>>>>     So we might have to make it per cgroup per device limit.
>>>  From the viewpoint of implementation, there is a difficulty in my mind to
>>> implement per cgroup per device limit arising from that io_group is
>>> allocated
>>> when associated device is firstly used.  I guess Gui chose per cgroup limit
>>> on all devices approach because of this, right?
>>
>>    Yes, I choose this solution from the simplicity point of view, the code will
>>    get complicated if choosing per cgroup per device limit. But it seems per
>>    cgroup per device limits is more reasonable.
>>
>>>
>>>> - There does not seem to be any checks for making sure that children
>>>>     cgroups don't have more request descriptors allocated than parent
>>>> group.
>>>>
>>>> - I am re-thinking that what's the advantage of configuring request
>>>>     descriptors also through cgroups. It does bring in additional
>>>> complexity
>>>>     with it and it should justfiy the advantages. Can you think of some?
>>>>
>>>>     Until and unless we can come up with some significant advantages, I
>>>> will
>>>>     prefer to continue to use per group limit through q->nr_group_requests
>>>>     interface instead of cgroup. Once things stablize, we can revisit
>>>> it and
>>>>     see how this interface can be improved.
>>> I agree.  I will try to clarify if per group per device limitation is
>>> needed
>>> or not (or, if it has the advantage beyond the complexity) through some
>>> tests.
>>
>>    Great, hope to hear you soon.
>
> Sorry for so late.  I tried it, and write the result and my opinion  
> below...
>
>

Hi Ikeda,

Nice analysis. Few questions/comments inline...


> Scenario
> ====================
>
> The possible scenario where per-cgroup nr_requests limitation is  
> beneficial in my mind is that:
>
> - Process P1 in cgroup "g1" is running with submitting many requests
>    to a device.  The number of the requests in the device queue is
>    almost nr_requests for the device.
>
> - After a while, process P2 in cgroup "g2" starts running.  P2 also
>    tries to submit requests as many as P1.
>
> - Assuming that user wants P2 to grab bandwidth as soon as possible
>    and keep it certain level.
>
> In this scenario, I predicted the bandwidth behavior of P2 along with  
> tuning global nr_group_requests like below.
>
> - If having nr_group_requests almost same as nr_requests, P1 can
>    allocate requests up to nr_requests and there is no room for P2 at
>    the beginning of its running.  As a result of it, P2 has to wait
>    for a while till P1's requests are completed and rising of
>    bandwidth is delayed.
>
> - If having nr_group_requests fewer to restrict requests from P1 and
>    make room for P2, the bandwidth of P2 may be lower than the case
>    that P1 can allocate more requests.
>
> If the prediction is correct and per-cgroup nr_requests limitation can  
> make the situation better, per-cgroup nr_requests is supposed to be  
> beneficial.
>
>
> Verification Conditions
> ========================
>
> - Kernel:
>    2.6.31-rc1
>    + Patches from Vivek on Jul 2, 2009
>      (IO scheduler based IO controller V6)
>
> https://lists.linux-foundation.org/pipermail/containers/2009-July/018948.html
>    + Patches from Gui Jianfeng on Jul 7, 2009 (Bug fixes)
>
> https://lists.linux-foundation.org/pipermail/containers/2009-July/019086.html
>
> https://lists.linux-foundation.org/pipermail/containers/2009-July/019087.html
>    + Patch from Gui Jianfeng on Jul 9, 2009 (per-cgroup requests limit)
>
> https://lists.linux-foundation.org/pipermail/containers/2009-July/019123.html
>    + Patch from me on Jul 16, 2009 (Bug fix)
>
> https://lists.linux-foundation.org/pipermail/containers/2009-July/019286.html
>    + 2 local bug fix patches
>        (Not posted yet, I'm posting them in following mails)
>
> - All results are measured under nr_requests=500.
>
> - Used fio to make I/O.  Job file is like below.  Used libaio and
>    direct-I/O and tuned iodepth to make rl->count[1] approx 500 always.
>
> ----- fio job file : from here -----
>
> [global]
> size=128m
> directory=/mnt/b1
>
> runtime=30
> time_based
>
> write_bw_log
> bwavgtime=200
>
> rw=randread
> direct=1
> ioengine=libaio
> iodepth=500
>
> [g1]
> exec_prerun=./pre.sh /mnt/cgroups/g1
> exec_postrun=./log.sh /mnt/cgroups/g1 sdb "_post"
>
> [g2]
> startdelay=10
> exec_prerun=./pre.sh /mnt/cgroups/g2
> exec_postrun=./log.sh /mnt/cgroups/g2 sdb "_post"
>
> ----- fio job file : till here -----
>
> Note:
> pre.sh and log.sh used in exec_{pre|post}run are to assign processes to 
> expected cgroups and record the conditions.  Let me omit the detail of 
> them because they are not fundamental part of this verification.
>
>
> Results
> ====================
>
> Bandwidth of g2 (=P2) was measured under some conditions.  Conditions  
> and bandwidth logs are shown below.
> Bandwidth logs are shown only the beginning part (from starting of P2 to 
> 3000[ms] after aprox.) because the full logs are so long.  Average  
> bandwidth from the beginning of log to ~10[sec] is also calculated.
>
> Note1:
> fio seems to log bandwidth only when actual data transfer occurs  
> (correct me if it's not true).  This means that there is no line with  
> BW=0.  In there is no data transfer, the time-stamp are simply skipped  
> to record.
>
> Note2:
> Graph picture of the bandwidth logs is attached.
>    Result(1): orange
>    Result(2): green
>    Result(3): brown
>    Result(4): black
>
>
> ---------- Result (1) ----------
>
> * Both of g1 and g2 have nr_group_requests=500
>
> < Conditions >
> nr_requests = 500
> g1/
>    io.nr_group_requests = 500
>    io.weight = 500
>    io.ioprio_class = 2
> g2/
>    io.nr_group_requests = 500
>    io.weight = 500
>    io.ioprio_class = 2
>
> < Bandwidth log of g2 >
> t [ms]	bw[KiB/s]
> 969	4
> 1170	1126
> 1374	1084
> 1576	876
> 1776	901
> 1980	1069
> 2191	1087
> 2400	1117
> 2612	1087
> 2822	1136
> ...
>
> < Average bandwidth >
> 1063 [KiB/s]
> (969~9979[ms])
>
>
> ---------- Result (2) ----------
>
> * Both of g1 and g2 have nr_group_requests=100
>
> < Conditions >
> nr_requests = 500
> g1/
>    io.nr_group_requests = 100
>    io.weight = 500
>    io.ioprio_class = 2
> g2/
>    io.nr_group_requests = 100
>    io.weight = 500
>    io.ioprio_class = 2
>
> < Bandwidth log of g2 >
> t [ms]	bw[KiB/s]
> 1498	2
> 1733	892
> 2096	722
> 2311	1224
> 2534	1180
> 2753	1197
> 2988	1137
> ...
>
> < Average bandwidth >
> 998 [KiB/s]
> (1498~9898[ms])
>
>
> ---------- Result (3) ----------
>
> * To set different nr_group_requests on g1 and g2
>
> < Conditions >
> nr_requests = 500
> g1/
>    io.nr_group_requests = 100
>    io.weight = 500
>    io.ioprio_class = 2
> g2/
>    io.nr_group_requests = 500
>    io.weight = 500
>    io.ioprio_class = 2
>
> < Bandwidth log of g2 >
> t [ms]	bw[KiB/s]
> 244	839
> 451	1133
> 659	964
> 877	1038
> 1088	1125
> 1294	979
> 1501	1068
> 1708	934
> 1916	1048
> 2117	1126
> 2328	1111
> 2533	1118
> 2758	1206
> 2969	990
> ...
>
> < Average bandwidth >
> 1048 [KiB/s]
> (244~9906[ms])
>
>
> ---------- Result (4) ----------
>
> * To make g2/io.ioprio_class as RT
>
> < Conditions >
> nr_requests = 500
> g1/
>    io.nr_group_requests = 500
>    io.weight = 500
>    io.ioprio_class = 2
> g2/
>    io.nr_group_requests = 500
>    io.weight = 500
>    io.ioprio_class = 1
>
> < Bandwidth log of g2 >
> t [ms]	bw[KiB/s]
> 476	8
> 677	2211
> 878	2221
> 1080	2486
> 1281	2241
> 1481	2109
> 1681	2334
> 1882	2129
> 2082	2211
> 2283	1915
> 2489	1778
> 2690	1915
> 2891	1997
> ...
>
> < Average bandwidth >
> 2132[KiB/s]
> (476~9954[ms])
>
>
> Consideration and Conclusion
> =============================
>
>  From result(1), it is observed that it takes 1000~1200[ms] to rise P2 
> bandwidth.  In result(2), where both of g1 and g2 have  
> nr_group_requests=100, the delay gets longer as 1800~2000[ms].

Result (2) is surprising in terms of delay. Queue limit is 500 and per
group limit is 100. That means that group g2 should have got request
descriptor allocated as soon as it sent a request to device. That should
also mean that it should start getting serviced soon and BW available
should rise soon. 

I am not sure why reverse is happening. Sounds like a bug somewhere..
 
>  In  
> addition to it, the average bandwidth becomes ~5% lower than result(1).  
> This is supposed that P2 couldn't allocate enough requests.
> Then, result(3) shows that bandwidth of P2 can rise quickly (~300[ms])  
> if nr_group_requests can be set per-cgroup.  Result(4) shows that the  
> delay can be shortened by setting g2 as RT class, however, the delay is 
> still longer than result(3).
>
> I think it is confirmed that "per-cgroup nr_requests limitation is  
> useful in a certain situation".

I am still not clear how per cgroup request descriptor limit is useful.
Having higher number of descriptor reserved for a cgroup probably should
help a bit on queuing hardware where more requests from same queue can be
dispatched at one go. May be it can also help a bit in more merging in 
certain scenarios and gives higher throughput.

I think key question here is how relevant the number of request
descriptors is when it comes to fairness for the group. I have not
experimented but probably a few request descriptors (25-30) probably
should be enough to ensure group gets fair amount of disk. That's a 
different thing that throughput of group might suffer and bandwith
allocation among task with-in group will also have adverse effect.

>  Beyond that, the discussion topic is  
> the benefit pointed out above is eligible for the complication of the  
> implementation.  IMHO, I don't think the implementation of per-cgroup  
> request limitation is too complicated to accept.  On the other hand I  
> guess it suddenly gets complicated if we try to implement further more, 
> especially hierarchical support.  It is also true that I have a feeling 
> that implementation without per-device limitation and hierarchical 
> support is like "unfinished work".
>

IMHO, configuring request descriptors per cgroup can be a future TODO item
depending on how useful people consider it. For the time being, it would 
be better to keep things simple and code small.

In fact currently I am trying to replace BFQ with a CFS style scheduler in
io controller patches to reduce the code size as well as its complexity.

Thanks
Vivek

> So, my opinion so far is that, per-cgroup nr_requests limitation should 
> be merged only if hierarchical support is concluded "unnecessary" for it. 
>  If merging it tempts hierarchical support, it shouldn't be.
> How about your opinion, all?
>
> My considerations or verification method might be wrong.  Please correct 
> them if any.  And if you have any other idea of scenario to verify the 
> effect of per-cgroup nr_requests limitation, please let me know.  I'll 
> try it.
>
>
>
> -- 
> IKEDA, Munehiro
>    NEC Corporation of America
>      m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org
>
>

^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
  2009-08-04  2:00           ` Munehiro Ikeda
@ 2009-08-04 22:37             ` Vivek Goyal
  -1 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-08-04 22:37 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: Gui Jianfeng, linux-kernel, containers, dm-devel, jens.axboe,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, s-uchida, taka, jmoyer, dhaval, balbir, righi.andrea,
	jbaron, agk, snitzer, akpm, peterz

On Mon, Aug 03, 2009 at 10:00:42PM -0400, Munehiro Ikeda wrote:
> Gui Jianfeng wrote, on 07/14/2009 03:45 AM:
>> Munehiro Ikeda wrote:
>>> Vivek Goyal wrote, on 07/13/2009 12:03 PM:
>>>> On Fri, Jul 10, 2009 at 09:56:21AM +0800, Gui Jianfeng wrote:
>>>>> Hi Vivek,
>>>>>
>>>>> This patch exports a cgroup based per group request limits interface.
>>>>> and removes the global one. Now we can use this interface to perform
>>>>> different request allocation limitation for different groups.
>>>>>
>>>> Thanks Gui. Few points come to mind.
>>>>
>>>> - You seem to be making this as per cgroup limit on all devices. I guess
>>>>     that different devices in the system can have different settings of
>>>>     q->nr_requests and hence will probably want different per group limit.
>>>>     So we might have to make it per cgroup per device limit.
>>>  From the viewpoint of implementation, there is a difficulty in my mind to
>>> implement per cgroup per device limit arising from that io_group is
>>> allocated
>>> when associated device is firstly used.  I guess Gui chose per cgroup limit
>>> on all devices approach because of this, right?
>>
>>    Yes, I choose this solution from the simplicity point of view, the code will
>>    get complicated if choosing per cgroup per device limit. But it seems per
>>    cgroup per device limits is more reasonable.
>>
>>>
>>>> - There does not seem to be any checks for making sure that children
>>>>     cgroups don't have more request descriptors allocated than parent
>>>> group.
>>>>
>>>> - I am re-thinking that what's the advantage of configuring request
>>>>     descriptors also through cgroups. It does bring in additional
>>>> complexity
>>>>     with it and it should justfiy the advantages. Can you think of some?
>>>>
>>>>     Until and unless we can come up with some significant advantages, I
>>>> will
>>>>     prefer to continue to use per group limit through q->nr_group_requests
>>>>     interface instead of cgroup. Once things stablize, we can revisit
>>>> it and
>>>>     see how this interface can be improved.
>>> I agree.  I will try to clarify if per group per device limitation is
>>> needed
>>> or not (or, if it has the advantage beyond the complexity) through some
>>> tests.
>>
>>    Great, hope to hear you soon.
>
> Sorry for so late.  I tried it, and write the result and my opinion  
> below...
>
>

Hi Ikeda,

Nice analysis. Few questions/comments inline...


> Scenario
> ====================
>
> The possible scenario where per-cgroup nr_requests limitation is  
> beneficial in my mind is that:
>
> - Process P1 in cgroup "g1" is running with submitting many requests
>    to a device.  The number of the requests in the device queue is
>    almost nr_requests for the device.
>
> - After a while, process P2 in cgroup "g2" starts running.  P2 also
>    tries to submit requests as many as P1.
>
> - Assuming that user wants P2 to grab bandwidth as soon as possible
>    and keep it certain level.
>
> In this scenario, I predicted the bandwidth behavior of P2 along with  
> tuning global nr_group_requests like below.
>
> - If having nr_group_requests almost same as nr_requests, P1 can
>    allocate requests up to nr_requests and there is no room for P2 at
>    the beginning of its running.  As a result of it, P2 has to wait
>    for a while till P1's requests are completed and rising of
>    bandwidth is delayed.
>
> - If having nr_group_requests fewer to restrict requests from P1 and
>    make room for P2, the bandwidth of P2 may be lower than the case
>    that P1 can allocate more requests.
>
> If the prediction is correct and per-cgroup nr_requests limitation can  
> make the situation better, per-cgroup nr_requests is supposed to be  
> beneficial.
>
>
> Verification Conditions
> ========================
>
> - Kernel:
>    2.6.31-rc1
>    + Patches from Vivek on Jul 2, 2009
>      (IO scheduler based IO controller V6)
>
> https://lists.linux-foundation.org/pipermail/containers/2009-July/018948.html
>    + Patches from Gui Jianfeng on Jul 7, 2009 (Bug fixes)
>
> https://lists.linux-foundation.org/pipermail/containers/2009-July/019086.html
>
> https://lists.linux-foundation.org/pipermail/containers/2009-July/019087.html
>    + Patch from Gui Jianfeng on Jul 9, 2009 (per-cgroup requests limit)
>
> https://lists.linux-foundation.org/pipermail/containers/2009-July/019123.html
>    + Patch from me on Jul 16, 2009 (Bug fix)
>
> https://lists.linux-foundation.org/pipermail/containers/2009-July/019286.html
>    + 2 local bug fix patches
>        (Not posted yet, I'm posting them in following mails)
>
> - All results are measured under nr_requests=500.
>
> - Used fio to make I/O.  Job file is like below.  Used libaio and
>    direct-I/O and tuned iodepth to make rl->count[1] approx 500 always.
>
> ----- fio job file : from here -----
>
> [global]
> size=128m
> directory=/mnt/b1
>
> runtime=30
> time_based
>
> write_bw_log
> bwavgtime=200
>
> rw=randread
> direct=1
> ioengine=libaio
> iodepth=500
>
> [g1]
> exec_prerun=./pre.sh /mnt/cgroups/g1
> exec_postrun=./log.sh /mnt/cgroups/g1 sdb "_post"
>
> [g2]
> startdelay=10
> exec_prerun=./pre.sh /mnt/cgroups/g2
> exec_postrun=./log.sh /mnt/cgroups/g2 sdb "_post"
>
> ----- fio job file : till here -----
>
> Note:
> pre.sh and log.sh used in exec_{pre|post}run are to assign processes to 
> expected cgroups and record the conditions.  Let me omit the detail of 
> them because they are not fundamental part of this verification.
>
>
> Results
> ====================
>
> Bandwidth of g2 (=P2) was measured under some conditions.  Conditions  
> and bandwidth logs are shown below.
> Bandwidth logs are shown only the beginning part (from starting of P2 to 
> 3000[ms] after aprox.) because the full logs are so long.  Average  
> bandwidth from the beginning of log to ~10[sec] is also calculated.
>
> Note1:
> fio seems to log bandwidth only when actual data transfer occurs  
> (correct me if it's not true).  This means that there is no line with  
> BW=0.  In there is no data transfer, the time-stamp are simply skipped  
> to record.
>
> Note2:
> Graph picture of the bandwidth logs is attached.
>    Result(1): orange
>    Result(2): green
>    Result(3): brown
>    Result(4): black
>
>
> ---------- Result (1) ----------
>
> * Both of g1 and g2 have nr_group_requests=500
>
> < Conditions >
> nr_requests = 500
> g1/
>    io.nr_group_requests = 500
>    io.weight = 500
>    io.ioprio_class = 2
> g2/
>    io.nr_group_requests = 500
>    io.weight = 500
>    io.ioprio_class = 2
>
> < Bandwidth log of g2 >
> t [ms]	bw[KiB/s]
> 969	4
> 1170	1126
> 1374	1084
> 1576	876
> 1776	901
> 1980	1069
> 2191	1087
> 2400	1117
> 2612	1087
> 2822	1136
> ...
>
> < Average bandwidth >
> 1063 [KiB/s]
> (969~9979[ms])
>
>
> ---------- Result (2) ----------
>
> * Both of g1 and g2 have nr_group_requests=100
>
> < Conditions >
> nr_requests = 500
> g1/
>    io.nr_group_requests = 100
>    io.weight = 500
>    io.ioprio_class = 2
> g2/
>    io.nr_group_requests = 100
>    io.weight = 500
>    io.ioprio_class = 2
>
> < Bandwidth log of g2 >
> t [ms]	bw[KiB/s]
> 1498	2
> 1733	892
> 2096	722
> 2311	1224
> 2534	1180
> 2753	1197
> 2988	1137
> ...
>
> < Average bandwidth >
> 998 [KiB/s]
> (1498~9898[ms])
>
>
> ---------- Result (3) ----------
>
> * To set different nr_group_requests on g1 and g2
>
> < Conditions >
> nr_requests = 500
> g1/
>    io.nr_group_requests = 100
>    io.weight = 500
>    io.ioprio_class = 2
> g2/
>    io.nr_group_requests = 500
>    io.weight = 500
>    io.ioprio_class = 2
>
> < Bandwidth log of g2 >
> t [ms]	bw[KiB/s]
> 244	839
> 451	1133
> 659	964
> 877	1038
> 1088	1125
> 1294	979
> 1501	1068
> 1708	934
> 1916	1048
> 2117	1126
> 2328	1111
> 2533	1118
> 2758	1206
> 2969	990
> ...
>
> < Average bandwidth >
> 1048 [KiB/s]
> (244~9906[ms])
>
>
> ---------- Result (4) ----------
>
> * To make g2/io.ioprio_class as RT
>
> < Conditions >
> nr_requests = 500
> g1/
>    io.nr_group_requests = 500
>    io.weight = 500
>    io.ioprio_class = 2
> g2/
>    io.nr_group_requests = 500
>    io.weight = 500
>    io.ioprio_class = 1
>
> < Bandwidth log of g2 >
> t [ms]	bw[KiB/s]
> 476	8
> 677	2211
> 878	2221
> 1080	2486
> 1281	2241
> 1481	2109
> 1681	2334
> 1882	2129
> 2082	2211
> 2283	1915
> 2489	1778
> 2690	1915
> 2891	1997
> ...
>
> < Average bandwidth >
> 2132[KiB/s]
> (476~9954[ms])
>
>
> Consideration and Conclusion
> =============================
>
>  From result(1), it is observed that it takes 1000~1200[ms] to rise P2 
> bandwidth.  In result(2), where both of g1 and g2 have  
> nr_group_requests=100, the delay gets longer as 1800~2000[ms].

Result (2) is surprising in terms of delay. Queue limit is 500 and per
group limit is 100. That means that group g2 should have got request
descriptor allocated as soon as it sent a request to device. That should
also mean that it should start getting serviced soon and BW available
should rise soon. 

I am not sure why reverse is happening. Sounds like a bug somewhere..
 
>  In  
> addition to it, the average bandwidth becomes ~5% lower than result(1).  
> This is supposed that P2 couldn't allocate enough requests.
> Then, result(3) shows that bandwidth of P2 can rise quickly (~300[ms])  
> if nr_group_requests can be set per-cgroup.  Result(4) shows that the  
> delay can be shortened by setting g2 as RT class, however, the delay is 
> still longer than result(3).
>
> I think it is confirmed that "per-cgroup nr_requests limitation is  
> useful in a certain situation".

I am still not clear how per cgroup request descriptor limit is useful.
Having higher number of descriptor reserved for a cgroup probably should
help a bit on queuing hardware where more requests from same queue can be
dispatched at one go. May be it can also help a bit in more merging in 
certain scenarios and gives higher throughput.

I think key question here is how relevant the number of request
descriptors is when it comes to fairness for the group. I have not
experimented but probably a few request descriptors (25-30) probably
should be enough to ensure group gets fair amount of disk. That's a 
different thing that throughput of group might suffer and bandwith
allocation among task with-in group will also have adverse effect.

>  Beyond that, the discussion topic is  
> the benefit pointed out above is eligible for the complication of the  
> implementation.  IMHO, I don't think the implementation of per-cgroup  
> request limitation is too complicated to accept.  On the other hand I  
> guess it suddenly gets complicated if we try to implement further more, 
> especially hierarchical support.  It is also true that I have a feeling 
> that implementation without per-device limitation and hierarchical 
> support is like "unfinished work".
>

IMHO, configuring request descriptors per cgroup can be a future TODO item
depending on how useful people consider it. For the time being, it would 
be better to keep things simple and code small.

In fact currently I am trying to replace BFQ with a CFS style scheduler in
io controller patches to reduce the code size as well as its complexity.

Thanks
Vivek

> So, my opinion so far is that, per-cgroup nr_requests limitation should 
> be merged only if hierarchical support is concluded "unnecessary" for it. 
>  If merging it tempts hierarchical support, it shouldn't be.
> How about your opinion, all?
>
> My considerations or verification method might be wrong.  Please correct 
> them if any.  And if you have any other idea of scenario to verify the 
> effect of per-cgroup nr_requests limitation, please let me know.  I'll 
> try it.
>
>
>
> -- 
> IKEDA, Munehiro
>    NEC Corporation of America
>      m-ikeda@ds.jp.nec.com
>
>



^ permalink raw reply	[flat|nested] 191+ messages in thread

* Re: [PATCH] io-controller: implement per group request allocation limitation
@ 2009-08-04 22:37             ` Vivek Goyal
  0 siblings, 0 replies; 191+ messages in thread
From: Vivek Goyal @ 2009-08-04 22:37 UTC (permalink / raw)
  To: Munehiro Ikeda
  Cc: dhaval, snitzer, peterz, dm-devel, dpshah, jens.axboe, agk,
	balbir, paolo.valente, Gui Jianfeng, fernando, mikew, jmoyer,
	nauman, righi.andrea, lizf, fchecconi, akpm, containers,
	linux-kernel, s-uchida, jbaron

On Mon, Aug 03, 2009 at 10:00:42PM -0400, Munehiro Ikeda wrote:
> Gui Jianfeng wrote, on 07/14/2009 03:45 AM:
>> Munehiro Ikeda wrote:
>>> Vivek Goyal wrote, on 07/13/2009 12:03 PM:
>>>> On Fri, Jul 10, 2009 at 09:56:21AM +0800, Gui Jianfeng wrote:
>>>>> Hi Vivek,
>>>>>
>>>>> This patch exports a cgroup based per group request limits interface.
>>>>> and removes the global one. Now we can use this interface to perform
>>>>> different request allocation limitation for different groups.
>>>>>
>>>> Thanks Gui. Few points come to mind.
>>>>
>>>> - You seem to be making this as per cgroup limit on all devices. I guess
>>>>     that different devices in the system can have different settings of
>>>>     q->nr_requests and hence will probably want different per group limit.
>>>>     So we might have to make it per cgroup per device limit.
>>>  From the viewpoint of implementation, there is a difficulty in my mind to
>>> implement per cgroup per device limit arising from that io_group is
>>> allocated
>>> when associated device is firstly used.  I guess Gui chose per cgroup limit
>>> on all devices approach because of this, right?
>>
>>    Yes, I choose this solution from the simplicity point of view, the code will
>>    get complicated if choosing per cgroup per device limit. But it seems per
>>    cgroup per device limits is more reasonable.
>>
>>>
>>>> - There does not seem to be any checks for making sure that children
>>>>     cgroups don't have more request descriptors allocated than parent
>>>> group.
>>>>
>>>> - I am re-thinking that what's the advantage of configuring request
>>>>     descriptors also through cgroups. It does bring in additional
>>>> complexity
>>>>     with it and it should justfiy the advantages. Can you think of some?
>>>>
>>>>     Until and unless we can come up with some significant advantages, I
>>>> will
>>>>     prefer to continue to use per group limit through q->nr_group_requests
>>>>     interface instead of cgroup. Once things stablize, we can revisit
>>>> it and
>>>>     see how this interface can be improved.
>>> I agree.  I will try to clarify if per group per device limitation is
>>> needed
>>> or not (or, if it has the advantage beyond the complexity) through some
>>> tests.
>>
>>    Great, hope to hear you soon.
>
> Sorry for so late.  I tried it, and write the result and my opinion  
> below...
>
>

Hi Ikeda,

Nice analysis. Few questions/comments inline...


> Scenario
> ====================
>
> The possible scenario where per-cgroup nr_requests limitation is  
> beneficial in my mind is that:
>
> - Process P1 in cgroup "g1" is running with submitting many requests
>    to a device.  The number of the requests in the device queue is
>    almost nr_requests for the device.
>
> - After a while, process P2 in cgroup "g2" starts running.  P2 also
>    tries to submit requests as many as P1.
>
> - Assuming that user wants P2 to grab bandwidth as soon as possible
>    and keep it certain level.
>
> In this scenario, I predicted the bandwidth behavior of P2 along with  
> tuning global nr_group_requests like below.
>
> - If having nr_group_requests almost same as nr_requests, P1 can
>    allocate requests up to nr_requests and there is no room for P2 at
>    the beginning of its running.  As a result of it, P2 has to wait
>    for a while till P1's requests are completed and rising of
>    bandwidth is delayed.
>
> - If having nr_group_requests fewer to restrict requests from P1 and
>    make room for P2, the bandwidth of P2 may be lower than the case
>    that P1 can allocate more requests.
>
> If the prediction is correct and per-cgroup nr_requests limitation can  
> make the situation better, per-cgroup nr_requests is supposed to be  
> beneficial.
>
>
> Verification Conditions
> ========================
>
> - Kernel:
>    2.6.31-rc1
>    + Patches from Vivek on Jul 2, 2009
>      (IO scheduler based IO controller V6)
>
> https://lists.linux-foundation.org/pipermail/containers/2009-July/018948.html
>    + Patches from Gui Jianfeng on Jul 7, 2009 (Bug fixes)
>
> https://lists.linux-foundation.org/pipermail/containers/2009-July/019086.html
>
> https://lists.linux-foundation.org/pipermail/containers/2009-July/019087.html
>    + Patch from Gui Jianfeng on Jul 9, 2009 (per-cgroup requests limit)
>
> https://lists.linux-foundation.org/pipermail/containers/2009-July/019123.html
>    + Patch from me on Jul 16, 2009 (Bug fix)
>
> https://lists.linux-foundation.org/pipermail/containers/2009-July/019286.html
>    + 2 local bug fix patches
>        (Not posted yet, I'm posting them in following mails)
>
> - All results are measured under nr_requests=500.
>
> - Used fio to make I/O.  Job file is like below.  Used libaio and
>    direct-I/O and tuned iodepth to make rl->count[1] approx 500 always.
>
> ----- fio job file : from here -----
>
> [global]
> size=128m
> directory=/mnt/b1
>
> runtime=30
> time_based
>
> write_bw_log
> bwavgtime=200
>
> rw=randread
> direct=1
> ioengine=libaio
> iodepth=500
>
> [g1]
> exec_prerun=./pre.sh /mnt/cgroups/g1
> exec_postrun=./log.sh /mnt/cgroups/g1 sdb "_post"
>
> [g2]
> startdelay=10
> exec_prerun=./pre.sh /mnt/cgroups/g2
> exec_postrun=./log.sh /mnt/cgroups/g2 sdb "_post"
>
> ----- fio job file : till here -----
>
> Note:
> pre.sh and log.sh used in exec_{pre|post}run are to assign processes to 
> expected cgroups and record the conditions.  Let me omit the detail of 
> them because they are not fundamental part of this verification.
>
>
> Results
> ====================
>
> Bandwidth of g2 (=P2) was measured under some conditions.  Conditions  
> and bandwidth logs are shown below.
> Bandwidth logs are shown only the beginning part (from starting of P2 to 
> 3000[ms] after aprox.) because the full logs are so long.  Average  
> bandwidth from the beginning of log to ~10[sec] is also calculated.
>
> Note1:
> fio seems to log bandwidth only when actual data transfer occurs  
> (correct me if it's not true).  This means that there is no line with  
> BW=0.  In there is no data transfer, the time-stamp are simply skipped  
> to record.
>
> Note2:
> Graph picture of the bandwidth logs is attached.
>    Result(1): orange
>    Result(2): green
>    Result(3): brown
>    Result(4): black
>
>
> ---------- Result (1) ----------
>
> * Both of g1 and g2 have nr_group_requests=500
>
> < Conditions >
> nr_requests = 500
> g1/
>    io.nr_group_requests = 500
>    io.weight = 500
>    io.ioprio_class = 2
> g2/
>    io.nr_group_requests = 500
>    io.weight = 500
>    io.ioprio_class = 2
>
> < Bandwidth log of g2 >
> t [ms]	bw[KiB/s]
> 969	4
> 1170	1126
> 1374	1084
> 1576	876
> 1776	901
> 1980	1069
> 2191	1087
> 2400	1117
> 2612	1087
> 2822	1136
> ...
>
> < Average bandwidth >
> 1063 [KiB/s]
> (969~9979[ms])
>
>
> ---------- Result (2) ----------
>
> * Both of g1 and g2 have nr_group_requests=100
>
> < Conditions >
> nr_requests = 500
> g1/
>    io.nr_group_requests = 100
>    io.weight = 500
>    io.ioprio_class = 2
> g2/
>    io.nr_group_requests = 100
>    io.weight = 500
>    io.ioprio_class = 2
>
> < Bandwidth log of g2 >
> t [ms]	bw[KiB/s]
> 1498	2
> 1733	892
> 2096	722
> 2311	1224
> 2534	1180
> 2753	1197
> 2988	1137
> ...
>
> < Average bandwidth >
> 998 [KiB/s]
> (1498~9898[ms])
>
>
> ---------- Result (3) ----------
>
> * To set different nr_group_requests on g1 and g2
>
> < Conditions >
> nr_requests = 500
> g1/
>    io.nr_group_requests = 100
>    io.weight = 500
>    io.ioprio_class = 2
> g2/
>    io.nr_group_requests = 500
>    io.weight = 500
>    io.ioprio_class = 2
>
> < Bandwidth log of g2 >
> t [ms]	bw[KiB/s]
> 244	839
> 451	1133
> 659	964
> 877	1038
> 1088	1125
> 1294	979
> 1501	1068
> 1708	934
> 1916	1048
> 2117	1126
> 2328	1111
> 2533	1118
> 2758	1206
> 2969	990
> ...
>
> < Average bandwidth >
> 1048 [KiB/s]
> (244~9906[ms])
>
>
> ---------- Result (4) ----------
>
> * To make g2/io.ioprio_class as RT
>
> < Conditions >
> nr_requests = 500
> g1/
>    io.nr_group_requests = 500
>    io.weight = 500
>    io.ioprio_class = 2
> g2/
>    io.nr_group_requests = 500
>    io.weight = 500
>    io.ioprio_class = 1
>
> < Bandwidth log of g2 >
> t [ms]	bw[KiB/s]
> 476	8
> 677	2211
> 878	2221
> 1080	2486
> 1281	2241
> 1481	2109
> 1681	2334
> 1882	2129
> 2082	2211
> 2283	1915
> 2489	1778
> 2690	1915
> 2891	1997
> ...
>
> < Average bandwidth >
> 2132[KiB/s]
> (476~9954[ms])
>
>
> Consideration and Conclusion
> =============================
>
>  From result(1), it is observed that it takes 1000~1200[ms] to rise P2 
> bandwidth.  In result(2), where both of g1 and g2 have  
> nr_group_requests=100, the delay gets longer as 1800~2000[ms].

Result (2) is surprising in terms of delay. Queue limit is 500 and per
group limit is 100. That means that group g2 should have got request
descriptor allocated as soon as it sent a request to device. That should
also mean that it should start getting serviced soon and BW available
should rise soon. 

I am not sure why reverse is happening. Sounds like a bug somewhere..
 
>  In  
> addition to it, the average bandwidth becomes ~5% lower than result(1).  
> This is supposed that P2 couldn't allocate enough requests.
> Then, result(3) shows that bandwidth of P2 can rise quickly (~300[ms])  
> if nr_group_requests can be set per-cgroup.  Result(4) shows that the  
> delay can be shortened by setting g2 as RT class, however, the delay is 
> still longer than result(3).
>
> I think it is confirmed that "per-cgroup nr_requests limitation is  
> useful in a certain situation".

I am still not clear how per cgroup request descriptor limit is useful.
Having higher number of descriptor reserved for a cgroup probably should
help a bit on queuing hardware where more requests from same queue can be
dispatched at one go. May be it can also help a bit in more merging in 
certain scenarios and gives higher throughput.

I think key question here is how relevant the number of request
descriptors is when it comes to fairness for the group. I have not
experimented but probably a few request descriptors (25-30) probably
should be enough to ensure group gets fair amount of disk. That's a 
different thing that throughput of group might suffer and bandwith
allocation among task with-in group will also have adverse effect.

>  Beyond that, the discussion topic is  
> the benefit pointed out above is eligible for the complication of the  
> implementation.  IMHO, I don't think the implementation of per-cgroup  
> request limitation is too complicated to accept.  On the other hand I  
> guess it suddenly gets complicated if we try to implement further more, 
> especially hierarchical support.  It is also true that I have a feeling 
> that implementation without per-device limitation and hierarchical 
> support is like "unfinished work".
>

IMHO, configuring request descriptors per cgroup can be a future TODO item
depending on how useful people consider it. For the time being, it would 
be better to keep things simple and code small.

In fact currently I am trying to replace BFQ with a CFS style scheduler in
io controller patches to reduce the code size as well as its complexity.

Thanks
Vivek

> So, my opinion so far is that, per-cgroup nr_requests limitation should 
> be merged only if hierarchical support is concluded "unnecessary" for it. 
>  If merging it tempts hierarchical support, it shouldn't be.
> How about your opinion, all?
>
> My considerations or verification method might be wrong.  Please correct 
> them if any.  And if you have any other idea of scenario to verify the 
> effect of per-cgroup nr_requests limitation, please let me know.  I'll 
> try it.
>
>
>
> -- 
> IKEDA, Munehiro
>    NEC Corporation of America
>      m-ikeda@ds.jp.nec.com
>
>

^ permalink raw reply	[flat|nested] 191+ messages in thread

end of thread, other threads:[~2009-08-04 22:47 UTC | newest]

Thread overview: 191+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-07-02 20:01 [RFC] IO scheduler based IO controller V6 Vivek Goyal
2009-07-02 20:01 ` Vivek Goyal
2009-07-02 20:01 ` [PATCH 01/25] io-controller: Documentation Vivek Goyal
2009-07-02 20:01   ` Vivek Goyal
2009-07-02 20:01 ` [PATCH 02/25] io-controller: Core of the B-WF2Q+ scheduler Vivek Goyal
2009-07-02 20:01   ` Vivek Goyal
2009-07-02 20:01 ` [PATCH 03/25] io-controller: bfq support of in-class preemption Vivek Goyal
2009-07-02 20:01   ` Vivek Goyal
2009-07-02 20:01 ` [PATCH 04/25] io-controller: Common flat fair queuing code in elevaotor layer Vivek Goyal
2009-07-02 20:01   ` Vivek Goyal
2009-07-02 20:01 ` [PATCH 05/25] io-controller: Charge for time slice based on average disk rate Vivek Goyal
2009-07-02 20:01   ` Vivek Goyal
2009-07-02 20:01 ` [PATCH 06/25] io-controller: Modify cfq to make use of flat elevator fair queuing Vivek Goyal
2009-07-02 20:01   ` Vivek Goyal
2009-07-02 20:01 ` [PATCH 07/25] io-controller: core bfq scheduler changes for hierarchical setup Vivek Goyal
2009-07-02 20:01   ` Vivek Goyal
2009-07-02 20:01 ` [PATCH 08/25] io-controller: cgroup related changes for hierarchical group support Vivek Goyal
2009-07-02 20:01   ` Vivek Goyal
2009-07-02 20:01 ` [PATCH 09/25] io-controller: Common hierarchical fair queuing code in elevaotor layer Vivek Goyal
2009-07-02 20:01   ` Vivek Goyal
     [not found]   ` <1246564917-19603-10-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-07-06  2:46     ` Gui Jianfeng
2009-07-06  2:46   ` Gui Jianfeng
2009-07-06  2:46     ` Gui Jianfeng
2009-07-06 14:16     ` Vivek Goyal
2009-07-06 14:16       ` Vivek Goyal
     [not found]       ` <20090706141650.GD8279-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-07-07  1:40         ` [PATCH] io-controller: Get rid of css id from io cgroup Gui Jianfeng
2009-07-07  1:40           ` Gui Jianfeng
2009-07-08 14:04           ` Vivek Goyal
2009-07-08 14:04             ` Vivek Goyal
     [not found]           ` <4A52A77E.8050203-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-07-08 14:04             ` Vivek Goyal
     [not found]     ` <4A51657B.7000008-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-07-06 14:16       ` [PATCH 09/25] io-controller: Common hierarchical fair queuing code in elevaotor layer Vivek Goyal
2009-07-02 20:01 ` [PATCH 10/25] io-controller: cfq changes to use " Vivek Goyal
2009-07-02 20:01   ` Vivek Goyal
2009-07-02 20:01 ` [PATCH 11/25] io-controller: Export disk time used and nr sectors dipatched through cgroups Vivek Goyal
2009-07-02 20:01   ` Vivek Goyal
2009-07-08  2:16   ` Gui Jianfeng
2009-07-08  2:16     ` Gui Jianfeng
2009-07-08 14:00     ` Vivek Goyal
2009-07-08 14:00       ` Vivek Goyal
     [not found]     ` <4A54018C.5090804-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-07-08 14:00       ` Vivek Goyal
     [not found]   ` <1246564917-19603-12-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-07-08  2:16     ` Gui Jianfeng
2009-07-02 20:01 ` [PATCH 12/25] io-controller: idle for sometime on sync queue before expiring it Vivek Goyal
2009-07-02 20:01   ` Vivek Goyal
     [not found] ` <1246564917-19603-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-07-02 20:01   ` [PATCH 01/25] io-controller: Documentation Vivek Goyal
2009-07-02 20:01   ` [PATCH 02/25] io-controller: Core of the B-WF2Q+ scheduler Vivek Goyal
2009-07-02 20:01   ` [PATCH 03/25] io-controller: bfq support of in-class preemption Vivek Goyal
2009-07-02 20:01   ` [PATCH 04/25] io-controller: Common flat fair queuing code in elevaotor layer Vivek Goyal
2009-07-02 20:01   ` [PATCH 05/25] io-controller: Charge for time slice based on average disk rate Vivek Goyal
2009-07-02 20:01   ` [PATCH 06/25] io-controller: Modify cfq to make use of flat elevator fair queuing Vivek Goyal
2009-07-02 20:01   ` [PATCH 07/25] io-controller: core bfq scheduler changes for hierarchical setup Vivek Goyal
2009-07-02 20:01   ` [PATCH 08/25] io-controller: cgroup related changes for hierarchical group support Vivek Goyal
2009-07-02 20:01   ` [PATCH 09/25] io-controller: Common hierarchical fair queuing code in elevaotor layer Vivek Goyal
2009-07-02 20:01   ` [PATCH 10/25] io-controller: cfq changes to use " Vivek Goyal
2009-07-02 20:01   ` [PATCH 11/25] io-controller: Export disk time used and nr sectors dipatched through cgroups Vivek Goyal
2009-07-02 20:01   ` [PATCH 12/25] io-controller: idle for sometime on sync queue before expiring it Vivek Goyal
2009-07-02 20:01   ` [PATCH 13/25] io-controller: Wait for requests to complete from last queue before new queue is scheduled Vivek Goyal
2009-07-02 20:01   ` [PATCH 14/25] io-controller: Separate out queue and data Vivek Goyal
2009-07-02 20:01   ` [PATCH 15/25] io-conroller: Prepare elevator layer for single queue schedulers Vivek Goyal
2009-07-02 20:01   ` [PATCH 16/25] io-controller: noop changes for hierarchical fair queuing Vivek Goyal
2009-07-02 20:01   ` [PATCH 17/25] io-controller: deadline " Vivek Goyal
2009-07-02 20:01   ` [PATCH 18/25] io-controller: anticipatory " Vivek Goyal
2009-07-02 20:01   ` [PATCH 19/25] blkio_cgroup patches from Ryo to track async bios Vivek Goyal
2009-07-02 20:01   ` [PATCH 20/25] io-controller: map async requests to appropriate cgroup Vivek Goyal
2009-07-02 20:01   ` [PATCH 21/25] io-controller: Per cgroup request descriptor support Vivek Goyal
2009-07-02 20:01   ` [PATCH 22/25] io-controller: Per io group bdi congestion interface Vivek Goyal
2009-07-02 20:01   ` [PATCH 23/25] io-controller: Support per cgroup per device weights and io class Vivek Goyal
2009-07-02 20:01   ` [PATCH 24/25] io-controller: Debug hierarchical IO scheduling Vivek Goyal
2009-07-02 20:01   ` [PATCH 25/25] io-controller: experimental debug patch for async queue wait before expiry Vivek Goyal
2009-07-08  3:56   ` [RFC] IO scheduler based IO controller V6 Balbir Singh
2009-07-10  1:56   ` [PATCH] io-controller: implement per group request allocation limitation Gui Jianfeng
2009-07-27  2:10   ` [RFC] IO scheduler based IO controller V6 Gui Jianfeng
2009-07-02 20:01 ` [PATCH 13/25] io-controller: Wait for requests to complete from last queue before new queue is scheduled Vivek Goyal
2009-07-02 20:01   ` Vivek Goyal
2009-07-02 20:09   ` Nauman Rafique
2009-07-02 20:09     ` Nauman Rafique
     [not found]     ` <e98e18940907021309u1f784b3at409b55ba46ed108c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-07-02 20:17       ` Vivek Goyal
2009-07-02 20:17     ` Vivek Goyal
2009-07-02 20:17       ` Vivek Goyal
     [not found]   ` <1246564917-19603-14-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-07-02 20:09     ` Nauman Rafique
2009-07-02 20:01 ` [PATCH 14/25] io-controller: Separate out queue and data Vivek Goyal
2009-07-02 20:01   ` Vivek Goyal
2009-07-02 20:01 ` [PATCH 15/25] io-conroller: Prepare elevator layer for single queue schedulers Vivek Goyal
2009-07-02 20:01   ` Vivek Goyal
2009-07-02 20:01 ` [PATCH 16/25] io-controller: noop changes for hierarchical fair queuing Vivek Goyal
2009-07-02 20:01   ` Vivek Goyal
2009-07-02 20:01 ` [PATCH 17/25] io-controller: deadline " Vivek Goyal
2009-07-02 20:01   ` Vivek Goyal
2009-07-02 20:01 ` [PATCH 18/25] io-controller: anticipatory " Vivek Goyal
2009-07-02 20:01   ` Vivek Goyal
2009-07-02 20:01 ` [PATCH 19/25] blkio_cgroup patches from Ryo to track async bios Vivek Goyal
2009-07-02 20:01   ` Vivek Goyal
2009-07-02 20:01 ` [PATCH 20/25] io-controller: map async requests to appropriate cgroup Vivek Goyal
2009-07-02 20:01   ` Vivek Goyal
     [not found]   ` <1246564917-19603-21-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-08-03  2:13     ` Gui Jianfeng
2009-08-03  2:13   ` Gui Jianfeng
2009-08-03  2:13     ` Gui Jianfeng
     [not found]     ` <4A7647DA.5050607-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-08-04  1:25       ` Vivek Goyal
2009-08-04  1:25     ` Vivek Goyal
2009-08-04  1:25       ` Vivek Goyal
2009-07-02 20:01 ` [PATCH 21/25] io-controller: Per cgroup request descriptor support Vivek Goyal
2009-07-02 20:01   ` Vivek Goyal
2009-07-08  3:27   ` Gui Jianfeng
2009-07-08  3:27     ` Gui Jianfeng
     [not found]     ` <4A54121D.5090008-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-07-08 13:57       ` Vivek Goyal
2009-07-08 13:57     ` Vivek Goyal
2009-07-08 13:57       ` Vivek Goyal
     [not found]   ` <1246564917-19603-22-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-07-08  3:27     ` Gui Jianfeng
2009-07-21  5:37     ` Gui Jianfeng
2009-07-21  5:37   ` Gui Jianfeng
     [not found]     ` <4A655434.5060404-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-07-21  5:55       ` Nauman Rafique
2009-07-21  5:55     ` Nauman Rafique
2009-07-21  5:55       ` Nauman Rafique
2009-07-21 14:01       ` Vivek Goyal
2009-07-21 14:01         ` Vivek Goyal
     [not found]         ` <20090721140134.GB540-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-07-21 17:57           ` Nauman Rafique
2009-07-21 17:57         ` Nauman Rafique
2009-07-21 17:57           ` Nauman Rafique
     [not found]       ` <e98e18940907202255y5c7c546ei95d87e5a451ad0c2-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-07-21 14:01         ` Vivek Goyal
2009-07-02 20:01 ` [PATCH 22/25] io-controller: Per io group bdi congestion interface Vivek Goyal
2009-07-02 20:01   ` Vivek Goyal
2009-07-17  0:16   ` Munehiro Ikeda
2009-07-17  0:16     ` Munehiro Ikeda
2009-07-17 13:52     ` Vivek Goyal
2009-07-17 13:52       ` Vivek Goyal
     [not found]     ` <4A5FC2CA.1040609-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
2009-07-17 13:52       ` Vivek Goyal
     [not found]   ` <1246564917-19603-23-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-07-17  0:16     ` Munehiro Ikeda
2009-07-02 20:01 ` [PATCH 23/25] io-controller: Support per cgroup per device weights and io class Vivek Goyal
2009-07-02 20:01   ` Vivek Goyal
2009-07-02 20:01 ` [PATCH 24/25] io-controller: Debug hierarchical IO scheduling Vivek Goyal
2009-07-02 20:01   ` Vivek Goyal
2009-07-02 20:01 ` [PATCH 25/25] io-controller: experimental debug patch for async queue wait before expiry Vivek Goyal
2009-07-02 20:01   ` Vivek Goyal
2009-07-08  3:56 ` [RFC] IO scheduler based IO controller V6 Balbir Singh
2009-07-08  3:56   ` Balbir Singh
     [not found]   ` <20090708035621.GB3215-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
2009-07-08 13:41     ` Vivek Goyal
2009-07-08 13:41   ` Vivek Goyal
2009-07-08 13:41     ` Vivek Goyal
     [not found]     ` <20090708134114.GA24048-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-07-08 14:39       ` Balbir Singh
2009-07-08 14:39     ` Balbir Singh
2009-07-08 14:39       ` Balbir Singh
     [not found]       ` <20090708143925.GE3215-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
2009-07-09  1:58         ` Vivek Goyal
2009-07-09  1:58       ` Vivek Goyal
2009-07-09  1:58         ` Vivek Goyal
2009-07-10  1:56 ` [PATCH] io-controller: implement per group request allocation limitation Gui Jianfeng
2009-07-10  1:56   ` Gui Jianfeng
2009-07-13 16:03   ` Vivek Goyal
2009-07-13 16:03     ` Vivek Goyal
2009-07-13 21:08     ` Munehiro Ikeda
2009-07-13 21:08       ` Munehiro Ikeda
2009-07-14  7:45       ` Gui Jianfeng
2009-07-14  7:45         ` Gui Jianfeng
2009-08-04  2:00         ` Munehiro Ikeda
2009-08-04  2:00           ` Munehiro Ikeda
     [not found]           ` <4A77964A.7040602-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
2009-08-04  6:38             ` Gui Jianfeng
2009-08-04 22:37             ` Vivek Goyal
2009-08-04  6:38           ` Gui Jianfeng
2009-08-04  6:38             ` Gui Jianfeng
2009-08-04 22:37           ` Vivek Goyal
2009-08-04 22:37             ` Vivek Goyal
     [not found]         ` <4A5C377F.4040105-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-08-04  2:00           ` Munehiro Ikeda
     [not found]       ` <4A5BA238.3030902-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
2009-07-14  7:45         ` Gui Jianfeng
     [not found]     ` <20090713160352.GA3714-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-07-13 21:08       ` Munehiro Ikeda
2009-07-14  7:37       ` Gui Jianfeng
2009-07-14  7:37         ` Gui Jianfeng
2009-08-04  2:02   ` Munehiro Ikeda
2009-08-04  2:02     ` Munehiro Ikeda
2009-08-04  6:41     ` Gui Jianfeng
2009-08-04  6:41       ` Gui Jianfeng
     [not found]     ` <4A7796D2.4030104-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
2009-08-04  6:41       ` Gui Jianfeng
2009-08-04  2:04   ` Munehiro Ikeda
2009-08-04  2:04     ` Munehiro Ikeda
     [not found]     ` <4A779719.1070900-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
2009-08-04  6:45       ` Gui Jianfeng
2009-08-04  6:45     ` Gui Jianfeng
2009-08-04  6:45       ` Gui Jianfeng
     [not found]   ` <4A569FC5.7090801-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-07-13 16:03     ` Vivek Goyal
2009-08-04  2:02     ` Munehiro Ikeda
2009-08-04  2:04     ` Munehiro Ikeda
2009-07-27  2:10 ` [RFC] IO scheduler based IO controller V6 Gui Jianfeng
     [not found]   ` <4A6D0C9A.3080600-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2009-07-27 12:55     ` Vivek Goyal
2009-07-27 12:55   ` Vivek Goyal
2009-07-27 12:55     ` Vivek Goyal
2009-07-28  3:27     ` Vivek Goyal
2009-07-28  3:27       ` Vivek Goyal
     [not found]       ` <20090728032712.GC3620-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-07-28  3:36         ` Gui Jianfeng
2009-07-28  3:36       ` Gui Jianfeng
2009-07-28  3:36         ` Gui Jianfeng
     [not found]     ` <20090727125503.GA24449-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-07-28  3:27       ` Vivek Goyal
2009-07-28 11:36       ` Gui Jianfeng
2009-07-29  9:07       ` Gui Jianfeng
2009-07-28 11:36     ` Gui Jianfeng
2009-07-29  9:07     ` Gui Jianfeng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.